Lua jit tests faster than Julia for Stock Prediction Engine
I started testing Julia as a possible alternative because Julia advocates claimed the interpreter loop was nearly as fast a C and it was similar in concept to Python which I love but which was too slow for our application. I recently ran across a blog entry mentioning a new Lua Jit. I found it intriguing because Lua did quite well during our last round of tests.
Performance comparison Julia versus Lua Jit
Relative Execution Time. Lua Jit as baseline – lower is better | ||||
Operation | Lua52 | LuaJit | Julia | Julia Typed Array |
Parse File into Data Frame | 2.42 | 1.0 | 5.64 | 5.64 |
Compute SMA(14) | 2.81 | 1.0 | 6.87 | 0.70 |
Compute SMA(600) | 33.32 | 1.0 | 80.00 | 1.30 |
Compute SMA_slice(14) | 2.42 | 1.0 | 11.87 | 1.83 |
Compute SMA_slice(600) | 33.32 | 1.0 | 15.52 | 5.90 |
Did not implement slice in Lua so re-used the timing from nested loop version. Response times are in seconds. |
Only 1 tested Julia operation was faster than Lua JIT
The only function where Julia out performed Lua Jit was in the SMA(14) all other items tested were slower. I think the reason it did better in this instance is that the SMA function must allocate a new array with 71K rows to store the results. In Julia you can do this as a typed Array of float. In Lua this is done as an append to list so it is allocating memory in little pieces. In the SMA(600) the Lua jit was faster again because it is doing more work compute in a tight loop relative to the memory allocation overhead.
Is this a mistake? It is not what I expected?
The results of these tests are not what I expected. I thought that since the Julia type system and it’s array allocation allowed it to allocate typed arrays as contiguous blocks that the allocation should be faster and the statistical loops should have been able to run very fast. Also since Julia had real numeric indexes where the lua array indexes smell like hash indexes I thought they would be slower since they should be more complex than iterate and advance a pointer and I didn’t see how the Lua Jit could overcome that point of additional complexity.
Since the answers are not what I expected I am posting this with the hope somebody else will see something I missed that will materially affect the results.
At some point you always have to stop thinking this seems fast and actually measure results. I am a big believer in testing a language in a way that actually models the types of problems we will be solving and using production scale data to do the testing. I borrowed some code from a prior article “Why language performance Matters” for the stock prediction domain. Two of the things we do thousands of times per day is parse CSV files, manipulate the columns by computing a wide variety of sliding window indicators and either use those new columns to drive our stock prediction engine or write the results back to disk to use in a future stage. This code attempts to mimic in micro this process.
For this test I was using a 1 minute Bar file with 71,879 rows. It can be obtained for free using the historical download function built into Metatrader. The CSV file table is of the following form but I had to manually add the CSV header string on line #1.
date,time,open,close,high,low,volume 2014.09.23,03:51,1.10430,1.10438,1.10426,1.10435,55 2014.09.23,03:52,1.10434,1.10438,1.10431,1.10438,28 2014.09.23,03:53,1.10439,1.10450,1.10436,1.10450,31 2014.09.23,03:54,1.10450,1.10450,1.10428,1.10439,34 2014.09.23,03:55,1.10437,1.10450,1.10427,1.10432,38 2014.09.23,03:56,1.10433,1.10451,1.10433,1.10450,28 2014.09.23,03:57,1.10449,1.10450,1.10441,1.10441,16 2014.09.23,03:58,1.10441,1.10445,1.10439,1.10440,28 2014.09.23,03:59,1.10439,1.10441,1.10438,1.10441,10 2014.09.23,04:00,1.10440,1.10451,1.10440,1.10451,24 2014.09.23,04:01,1.10451,1.10451,1.10451,1.10451,1
For those Julia experts out there I expected some benefit when forcing the explicit conversion from the DataArray we get out of the DataTable but the amount of improvement was greater than expected. I think we should either fix the problems in DataArray or provide convenience methods to convert to typed arrays. I hate to convert to typed arrays because I want to be able to easily add new values to the ends of the rows which become more cumbersome if forced to convert out of the DataArray format. Also note how much slower the Array slice function runs than explicitly unrolling the function into the inner loop was 4.5 times slower. The use of the functional style slices reduced performance in Julia by enough to make it slower than lua jit.
Run Times Measured in Seconds. (Lower is better) | ||||
Operation | Lua52 | LuaJit | Julia | Julia Typed Array |
Parse File into Data Frame | 0.756 | 0.312 | 1.786 | 1.786 |
Compute SMA(14) | 0.045 | 0.016 | 0.11 | 0.0113 |
Compute SMA(600) | 1.033 | 0.031 | 2.48 | 0.0403 |
Compute SMA_slice(14) | 0.045 | 0.016 | 0.19 | 0.0292 |
Compute SMA_slice(600) | 1.033 | 0.031 | 0.45 | 0.1829 |
Version used for the test | Lua 5.2.1 | LuaJIT 2.0.3 | 0.3.2+2 |
Relative Execution Time. Lua52 as baseline – lower is better | ||||
Operation | Lua52 | LuaJit | Julia | Julia Typed Array |
Parse File into Data Frame | 1.0 | 41% | 233% | 233% |
Compute SMA(14) | 1.0 | 36% | 244% | 25% |
Compute SMA(600) | 1.0 | 3% | 240% | 4% |
Compute SMA_slice(14) | 1.0 | 36% | 422% | 65% |
Compute SMA_slice(600) | 1.0 | 3% | 43% | 18% |
Did not implement slice in Lua so re-used the timing from nested loop version. |
- All timings are average of best 8 out of 10.
- Timings did not include startup time only execution of the main algorithms steps. Julia starts many times slower than lua or lua jit which is almost enough to make me want to switch just to save debugging time.
- Test ran on windows, Lenvo T420s, 8GB Ram 2.33Ghtz I7 with Samsung SSD.
Lua Code
-- Provides a CSV parse similar to the generic table parse -- in R and julia where each column winds up as a vector -- of columns so we can easily apply statistical functions -- to the columns. require "os" function split(aStr, sub) local out = {} local i = 0 local pat = "[^" .. sub .. "]+" for val in string.gmatch(aStr, pat) do out[i] = val i = i + 1 end return out end function trim (s) return (string.gsub(s, "^%s*(.-)%s*$", "%1")) end function curr_sec() return os.clock() end function print_arr(arr) local k, v for k,v in pairs(arr) do print(" " .. k .. " = " .. v) end end -- NOTE: To Be valid as a comparison to DataFrame parser must create -- an array of columns that can be accessed by Name and -- has been parsed into a numeric type when it is numeric. -- columns must be accessible as single vectors to pass into -- stats functions -- -- Reads a CSV File as set of columns. If the column -- can be treated as numeric then all values will be parsed -- as numeric. Columns can be indexed as out["names"]["close"] -- where close is the column names or accessed as out["cols"][3] -- to obtain the column by position in the file. function read_csv(fiName) local file = assert(io.open(fiName, "r")) local tout = {} local headline = file:read("*l") local headarr = split(headline, ",") -- Build structure to contain our output -- columns. local cols = {} local names = {} tout["names"] = names tout["columns"] = cols for fldNum,fldName in pairs(headarr) do fldName = trim(fldName) avector = {} -- since objects are by reference both the -- the cols and names return pointer to same -- array. cols[fldNum] = avector names[fldName] = avector end local i = 0; for line in file:lines() do local arr = split(line, ",") for fldNum,fldName in pairs(headarr) do acol = cols[fldNum] fldval = arr[fldNum] nval = tonumber(fldval) if nval == nil then acol[i] = fldval else acol[i] = nval end end -- for each field i = i + 1 end -- for each line file:close() --print("num rec=" .. i) return tout end -- func() -- Note: There are faster ways to compute SMA but this -- is closest to the method shown in basic tutorials -- and is a valid test of a tight loop that spends a -- a lot of time indexing into an array. function sma(arr, numPer) local tout = {} adjdiv = numPer + 1 for ndx,tval in pairs(arr) do local begndx = math.max(0, ndx - numPer) local tsum = 0.0 for slicendx = begndx, ndx, 1 do tsum = tsum + arr[slicendx] end if ndx > adjdiv then tout[ndx] = tsum / adjdiv else tout[ndx] = tsum / ndx end end return tout end beg_time = curr_sec() freemem = collectgarbage('count') print("megabytes in use=" .. freemem/1024) dtbl = read_csv("2014.M1.csv") end_time = curr_sec() elap = end_time - beg_time print("elap sec time=" .. elap) --print("num records=" .. # tout) freemem = collectgarbage('count') print("megabytes in use=" .. freemem/1024) print("Column Names") names = dtbl["names"] print(names) for k,v in pairs(names) do print(" " .. k ) end beg_time = curr_sec() closeVect = names["close"] sma14 = sma(closeVect, 14) end_time = curr_sec() elap = end_time - beg_time freemem = collectgarbage('count') print("sma(14) elap=" .. elap .. " num sma=" .. # sma14 .. " megabytes in use=" .. freemem/1024) beg_time = curr_sec() sma600 = sma(closeVect, 600) end_time = curr_sec() elap = end_time - beg_time freemem = collectgarbage('count') print("sma(600) elap=" .. elap .. " num sma=" .. # sma600 .. " megabytes in use=" .. freemem/1024)
Julia Code
# Julia function to load CSV and compute # a couple of SMA. written to be identical # to lua version. using DataFrames # Note: There are faster ways to compute SMA but this # is closest to the method shown in basic tutorials # and is a valid test of a tight loop that spends a # a lot of time indexing into an array. We show it # as a nested loop instead of slice function sma(avect, numPer) print(length(avect), " numper=", numPer) numEle = length(avect) tout = Array(Float32, numEle) numdiv = numPer + 1 for ndx = 1:numEle tsum = 0.0 begndx = max(1, ndx - numPer) for slicendx = begndx:ndx tsum += avect[slicendx] end if ndx < numdiv tout[ndx] = tsum / float32(ndx) else tout[ndx] = tsum / float32(numdiv) end end return tout end ## Showing the more common Julia version of ## slicing out what we need and applying built in ## operator mean instead of manual sub loop. function sma_slice(avect, numPer) #print(length(avect), " numper=", numPer) numEle = length(avect) tout = Array(Float32, numEle) for ndx = 1:numEle begndx = max(1, ndx - numPer) tout[ndx] = mean(avect[begndx:ndx]) end return tout end function runTest() print ("read table") tic() dta = readtable("2014.M1.csv") toc() println("finished read table") closeVect = dta[:close] println("compute sma(14)") tic() sma14 = sma(closeVect, 14) toc() println("finished sma(14)") println("compute sma (600)") tic() sma600 = sma(closeVect, 600) toc() println("finished sma(600)") closeVect = dta[:close] println("compute sma_slice(14)") tic() sma14 = sma_slice(closeVect, 14) toc() println("finished sma_slice(14)") println("compute sma_slice(600)") tic() sma600 = sma_slice(closeVect, 600) toc() println("finished sma_slice(600)") println("\n\n Convert to Typed Array\n\n") tic() tlen = length(closeVect) tvect = Array(Float32, tlen) for ndx = 1:tlen tvect[ndx] = closeVect[ndx] end toc() println("vector convertion complete") println("compute sma(14)") tic() sma14 = sma(tvect, 14) toc() println("finished sma(14)") println("compute sma (600)") tic() sma600 = sma(tvect, 600) toc() println("finished sma(600)") closeVect = dta[:close] println("compute sma_slice(14)") tic() sma14 = sma_slice(tvect, 14) toc() println("finished sma_slice(14)") println("compute sma_slice(600)") tic() sma600 = sma_slice(tvect, 600) toc() println("finished sma_slice(600)") end runTest()
Lua with ffi array of Double
Coming Soon. We actually tested part of this with a sum of a array of doubles. The ffi version was faster but not by as much as I had expected. This was on a LuaJit User group thread. Now I just need to track down that email.
Great comparison.
You should be able to shave some milliseconds in the CSV parsing by using Lpeg[1] with something like this:
local field = lpeg.C((1 – lpeg.S’,\n”‘)^0)
local record = lpeg.Ct(field * (‘,’ * field)^0) * (lpeg.P’\n’ + -1)
A quick test on my computer takes luajit CSV parsing from 0,26 to 0,21
[1] http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html
Great catch. I will have to give this a try when I get a chance.
Since you asked. All the timings were in seconds. I adjusted the workload to make the time run long enough to get consistent readings from the timer. Then I ran them several times and averaged the results. I feel this is superior to worrying about more precising timing because there is enough variability in preemptive multitasking systems that that very fine grained timers don’t deliver predictably accurate results anyway. Looking at the results I think it would be a better investment to use a larger source file to increase the workload even more. I will switch over to the finer resolution timers some of you have recommended but do not expect it to not change the fundamental conclusion and it is unlikely to change the results enough to matter.
Modified LUA code to properly handle divide when working on less than numPer bars at front of file.