|
• About Miller • File formats • Miller features in the context of the Unix toolkit • Record-heterogeneity • Reference • Data examples • Cookbook • FAQ • Internationalization • Installation, portability, dependencies, and testing • Performance • Why C? • Why call it Miller? • How original is Miller? • Things to do • Documents by release • Contact information • GitHub repo |
DataTest data were of the form
Data files of one million lines (totalling about 50MB for CSV and 60MB for DKVP) were used. In experiments not shown here, I also varied the file sizes; the size-dependent results were the expected, completely unsurprising linearities and so I produced no file-size-dependent plots for your viewing pleasure. ComparandsThe cat, cut, awk, sed, sort tools were compared to mlr on an 8-core Darwin laptop; RAM capacity was nowhere near challenged . The catc program is a simple line-oriented line-printer (source here) which is intermediate between Miller (which is record-aware as well as line-aware) and cat (which is only byte-aware).Raw resultsNote that for CSV data, the command is mlr --csvlite ... rather than mlr ....
Mac Mac Comparand
DKVP CSV
seconds seconds
0.016 0.013 cat
0.189 0.189 catc
3.657 4.388 awk -F, '{print}'
2.027 1.795 mlr cat
2.292 1.940 cut -d , -f 1,4
3.540 4.516 awk -F, '{print $1,$4}'
1.600 1.390 mlr cut -f a,x
1.694 1.648 mlr cut -x -f a,x
0.845 0.643 sed -e 's/x=/EKS=/' -e 's/b=/BEE=/'
2.076 1.842 mlr rename x,EKS,b,BEE
5.643 5.031 awk -F, '{gsub("x=","",$4);gsub("y=","",$5);print $4+$5}'
4.019 3.679 mlr put '$z=$x+$y'
2.481 2.628 mlr stats1 -a mean -f x,y -g a,b
2.587 2.389 mlr stats2 -a corr -f x,y -g a,b
23.247 14.466 sort -t, -k 1,2
3.023 5.658 mlr sort -f a,b
17.224 15.523 sort -t, -k 4,5
5.807 5.194 mlr sort -n x,y
Analysis
ConclusionFor record-oriented data transformations, Miller meets or beats the Unix toolkit in many contexts. Field renames in particular are worth doing as a pre-pipe or post-pipe using sed. |