Miller: Command Line CSV File Processing
Often times the lowest common denominator for integrating two systems is a CSV file. Usually I would resort to scripts for processing these CSV files. However, for certain simple operations scripting could be overkill. A battle tested command line would be much more appropriate than writing a possibly bug ridden one off script.
Miller is a command line tool for CSV file processing. It can replace Python scripts for common data transformation tasks such as:
- Change file delimiters
- Converting between CSV variants
- Adding/Removing column
- Calculating Statistics
- Sorting
- Remove records based on some criteria
Verbs
Miller’s functionality is exposed as a set of “verbs” e.g:
cat
- similar to the UNIXcat
command in that it simply outputs whatever was inputed.put
- for updating and adding fields to recordsfilter
- for filtering recordscut
- to remove fieldsstats1
- to calculating various statistics for the data e.g. count, sum, mean.sort
- for sorting the records
Each verb has its own options.
DSL
The put
and filter
verbs are unique in that they support a DSL that gives you the power of a scripting language to transform/filter your data.
Variables in the DSL
-
Variables in the current record’s context have names like
foo
orbar
-
Variables that refer to field values start with a
$
e.g .$Outflow
,$Account
-
Variables whose context is global to the entire list of records start with a
@
e.g@total
Chaining Verbs
Verbs can be chained together by placing the word then
between the verbs
Examples
Miller is a feature rich tool and the documentation is a little daunting at first. The best way to learn it is through examples.
Sample CSV file
(Note: There might be some line wrapping issues on Firefox, still debugging the CSS).
We’ll be playing with a CSV file that looks like this throughout the examples (this is just a snippet of that file):
Account, Flag, Date, Payee, Category Group/Category,Category Group, Category, Memo, Outflow, Inflow
Cash,,11/26/20,Food Court @ Apartment,Everyday Expenses: Restaurants,Everyday Expenses,Restaurants,,$14.00,$0.00
Cash,,11/26/20,Food Court @ Apartment,Everyday Expenses: Restaurants,Everyday Expenses,Restaurants,,$14.15,$0.00
Cash,,11/24/20,Food Court @ Neighborhood,Everyday Expenses: Restaurants,Everyday Expenses,Restaurants,,$6.25,$0.00
e-Wallet B,,11/24/20,Mobile Phone Plan,Monthly Bills: Phone,Monthly Bills,Phone,,$118.00,$0.00
Debit Card A,,11/22/20,Supermarket A,Everyday Expenses: Groceries,Everyday Expenses,Groceries,,$9.80,$0.00
Cash,,11/21/20,Carpark,Everyday Expenses: Parking and Toll,Everyday Expenses,Parking and Toll,,$7.30,$0.00
Cash,,11/21/20,Supermarket A,Everyday Expenses: Groceries,Everyday Expenses,Groceries,,$37.10,$0.00
Debit Card A,,11/21/20,Supermarket B,Everyday Expenses: Groceries,Everyday Expenses,Groceries,,$110.05,$0.00
Cash,,11/20/20,Pharmacy A,Everyday Expenses: Family,Everyday Expenses,Family,,$14.80,$0.00
Input Output Formats: Pretty Printing
mlr --icsv --opprint cat data.csv
Output:
Account Flag Date Payee Category Group/Category Category Group Category Memo Outflow Inflow
Cash - 11/26/20 Food Court @ Apartment Everyday Expenses: Restaurants Everyday Expenses Restaurants - $14.00 $0.00
Cash - 11/26/20 Food Court @ Apartment Everyday Expenses: Restaurants Everyday Expenses Restaurants - $14.15 $0.00
Cash - 11/24/20 Food Court @ Neighborhood Everyday Expenses: Restaurants Everyday Expenses Restaurants - $6.25 $0.00
e-Wallet B - 11/24/20 Mobile Phone Plan Monthly Bills: Phone Monthly Bills Phone - $118.00 $0.00
Debit Card A - 11/22/20 Supermarket A Everyday Expenses: Groceries Everyday Expenses Groceries - $9.80 $0.00
Cash - 11/21/20 Carpark Everyday Expenses: Parking and Toll Everyday Expenses Parking and Toll - $7.30 $0.00
Cash - 11/21/20 Supermarket A Everyday Expenses: Groceries Everyday Expenses Groceries - $37.10 $0.00
Debit Card A - 11/21/20 Supermarket B Everyday Expenses: Groceries Everyday Expenses Groceries - $110.05 $0.00
Cash - 11/20/20 Pharmacy A Everyday Expenses: Family Everyday Expenses Family - $14.80 $0.00
Credit Card A - 11/20/20 Wordpress Hosting Provider Everyday Expenses: Software Everyday Expenses Software - $14.80 $0.00
Credit Card A - 11/20/20 E-hailing company Everyday Expenses: e-Hailing and Taxi Everyday Expenses e-Hailing and Taxi - $14.70 $0.00
Credit Card A - 11/20/20 E-hailing company Everyday Expenses: e-Hailing and Taxi Everyday Expenses e-Hailing and Taxi - $22.50 $0.00
Cash - 11/20/20 Chinese Restaurant A Everyday Expenses: Family Everyday Expenses Family - $29.30 $0.00
Input Output Formats: Converting CSV to JSON
mlr --icsv --ojson --jlistwrap cat data.csv
Output:
[
{ "Account": "Cash", "Flag": "", "Date": "11/26/20", "Payee": "Food Court @ Apartment", "Category Group/Category": "Everyday Expenses: Restaurants", "Category Group": "Everyday Expenses", "Category": "Restaurants", "Memo": "", "Outflow": "$14.00", "Inflow": "$0.00" }
,{ "Account": "Cash", "Flag": "", "Date": "11/26/20", "Payee": "Food Court @ Apartment", "Category Group/Category": "Everyday Expenses: Restaurants", "Category Group": "Everyday Expenses", "Category": "Restaurants", "Memo": "", "Outflow": "$14.15", "Inflow": "$0.00" }
,{ "Account": "Cash", "Flag": "", "Date": "11/24/20", "Payee": "Food Court @ Neighborhood", "Category Group/Category": "Everyday Expenses: Restaurants", "Category Group": "Everyday Expenses", "Category": "Restaurants", "Memo": "", "Outflow": "$6.25", "Inflow": "$0.00" }
,{ "Account": "e-Wallet B", "Flag": "", "Date": "11/24/20", "Payee": "Mobile Phone Plan", "Category Group/Category": "Monthly Bills: Phone", "Category Group": "Monthly Bills", "Category": "Phone", "Memo": "", "Outflow": "$118.00", "Inflow": "$0.00" }
,{ "Account": "Debit Card A", "Flag": "", "Date": "11/22/20", "Payee": "Supermarket A", "Category Group/Category": "Everyday Expenses: Groceries", "Category Group": "Everyday Expenses", "Category": "Groceries", "Memo": "", "Outflow": "$9.80", "Inflow": "$0.00" }
,{ "Account": "Cash", "Flag": "", "Date": "11/21/20", "Payee": "Carpark", "Category Group/Category": "Everyday Expenses: Parking and Toll", "Category Group": "Everyday Expenses", "Category": "Parking and Toll", "Memo": "", "Outflow": "$7.30", "Inflow": "$0.00" }
,{ "Account": "Cash", "Flag": "", "Date": "11/21/20", "Payee": "Supermarket A", "Category Group/Category": "Everyday Expenses: Groceries", "Category Group": "Everyday Expenses", "Category": "Groceries", "Memo": "", "Outflow": "$37.10", "Inflow": "$0.00" }
,{ "Account": "Debit Card A", "Flag": "", "Date": "11/21/20", "Payee": "Supermarket B", "Category Group/Category": "Everyday Expenses: Groceries", "Category Group": "Everyday Expenses", "Category": "Groceries", "Memo": "", "Outflow": "$110.05", "Inflow": "$0.00" }
,{ "Account": "Cash", "Flag": "", "Date": "11/20/20", "Payee": "Pharmacy A", "Category Group/Category": "Everyday Expenses: Family", "Category Group": "Everyday Expenses", "Category": "Family", "Memo": "", "Outflow": "$14.80", "Inflow": "$0.00" }
Input Output Formats: Changing the Field Separator
mlr --ofs '|' --csv cat data.csv
Output:
Account|Flag|Date|Payee|Category Group/Category|Category Group|Category|Memo|Outflow|Inflow
Cash||11/26/20|Food Court @ Apartment|Everyday Expenses: Restaurants|Everyday Expenses|Restaurants||$14.00|$0.00
Cash||11/26/20|Food Court @ Apartment|Everyday Expenses: Restaurants|Everyday Expenses|Restaurants||$14.15|$0.00
Cash||11/24/20|Food Court @ Neighborhood|Everyday Expenses: Restaurants|Everyday Expenses|Restaurants||$6.25|$0.00
e-Wallet B||11/24/20|Mobile Phone Plan|Monthly Bills: Phone|Monthly Bills|Phone||$118.00|$0.00
Debit Card A||11/22/20|Supermarket A|Everyday Expenses: Groceries|Everyday Expenses|Groceries||$9.80|$0.00
Cash||11/21/20|Carpark|Everyday Expenses: Parking and Toll|Everyday Expenses|Parking and Toll||$7.30|$0.00
Cash||11/21/20|Supermarket A|Everyday Expenses: Groceries|Everyday Expenses|Groceries||$37.10|$0.00
Debit Card A||11/21/20|Supermarket B|Everyday Expenses: Groceries|Everyday Expenses|Groceries||$110.05|$0.00
Cash||11/20/20|Pharmacy A|Everyday Expenses: Family|Everyday Expenses|Family||$14.80|$0.00
Removing Columns
mlr --csv cut -x -f "Inflow,Flag,Memo,Category Group/Category,Category Group" data.csv
Output:
Account,Date,Payee,Category,Outflow
Cash,11/26/20,Food Court @ Apartment,Restaurants,$14.00
Cash,11/26/20,Food Court @ Apartment,Restaurants,$14.15
Cash,11/24/20,Food Court @ Neighborhood,Restaurants,$6.25
e-Wallet B,11/24/20,Mobile Phone Plan,Phone,$118.00
Debit Card A,11/22/20,Supermarket A,Groceries,$9.80
Cash,11/21/20,Carpark,Parking and Toll,$7.30
Cash,11/21/20,Supermarket A,Groceries,$37.10
Debit Card A,11/21/20,Supermarket B,Groceries,$110.05
Cash,11/20/20,Pharmacy A,Family,$14.80
Modifying Files in Place
By default, the verbs will send their output to standard output. We can edit the file in place using the -I
option.
mlr -I --csv cut -x -f "Inflow,Flag,Memo,Category Group/Category,Category Group" data.csv
Removing ‘$’ from the Outflow columns
In order to run numerical operations on the data we need to remove the ‘$’ from the Outflow column. We can do this using the ssub
function provided in the DSL. Lets do this in-place as well:
mlr -I --csv put '$Outflow=ssub($Outflow, "$", "");' data.csv
Output:
Account,Date,Payee,Category,Outflow
Cash,11/26/20,Food Court @ Apartment,Restaurants,14.00
Cash,11/26/20,Food Court @ Apartment,Restaurants,14.15
Cash,11/24/20,Food Court @ Neighborhood,Restaurants,6.25
e-Wallet B,11/24/20,Mobile Phone Plan,Phone,118.00
Debit Card A,11/22/20,Supermarket A,Groceries,9.80
Cash,11/21/20,Carpark,Parking and Toll,7.30
Cash,11/21/20,Supermarket A,Groceries,37.10
Debit Card A,11/21/20,Supermarket B,Groceries,110.05
Cash,11/20/20,Pharmacy A,Family,14.80
Filtering Records
mlr --csv filter '$Category=="Electricity"' data.csv
Output:
Account,Date,Payee,Category,Outflow
Bank Account A,11/13/20,Electricity Company,Electricity,241.50
Bank Account A,10/7/20,Electricity Company,Electricity,406.05
Bank Account A,9/2/20,Electricity Company,Electricity,185.30
Bank Account A,8/1/20,Electricity Company,Electricity,234.70
Bank Account A,7/1/20,Electricity Company,Electricity,355.95
Bank Account A,6/3/20,Electricity Company,Electricity,215.30
Bank Account A,5/1/20,Electricity Company,Electricity,215.60
Bank Account A,4/1/20,Electricity Company,Electricity,206.30
Bank Account A,3/1/20,Electricity Company,Electricity,175.05
Chaining Verbs
The previous output was a bit noisy as it contains field that we’re not interested in, lets chain filter
and cut
verbs so that we only see the fields that we want to see.
mlr --csv filter '$Category=="Electricity"' then cut -f "Category,Outflow" data.csv
Output:
Category,Outflow
Electricity,241.50
Electricity,406.05
Electricity,185.30
Electricity,234.70
Electricity,355.95
Electricity,215.30
Electricity,215.60
Electricity,206.30
Electricity,175.05
Electricity,201.50
Electricity,224.00
Electricity,219.95
Electricity,262.15
Electricity,193.85
Adding a “Month” Field
What I would really like to do is analyze the data by month. To make this easier I’m going to introduce a “Month” field that is based on the “Date” field. The verb used for this is put
.
mlr -I --csv put 't = splitnv($Date, "/"); $Month=t[1] . "/20" . t[3];' data.csv
Output:
Account,Date,Payee,Category,Outflow,Month
Cash,11/26/20,Food Court @ Apartment,Restaurants,14.00,11/2020
Cash,11/26/20,Food Court @ Apartment,Restaurants,14.15,11/2020
Cash,11/24/20,Food Court @ Neighborhood,Restaurants,6.25,11/2020
e-Wallet B,11/24/20,Mobile Phone Plan,Phone,118.00,11/2020
Debit Card A,11/22/20,Supermarket A,Groceries,9.80,11/2020
Cash,11/21/20,Carpark,Parking and Toll,7.30,11/2020
Cash,11/21/20,Supermarket A,Groceries,37.10,11/2020
Debit Card A,11/21/20,Supermarket B,Groceries,110.05,11/2020
Cash,11/20/20,Pharmacy A,Family,14.80,11/2020
Statistics: Which payee did I transact with the most frequently?
mlr --icsv --opprint stats1 -a count -f "Payee" -g "Payee" data.csv
Output:
Payee Payee_count
Food Court @ Apartment 31
Food Court @ Neighborhood 1
Mobile Phone Plan 14
Supermarket A 127
Carpark 5
Supermarket B 82
Pharmacy A 12
Wordpress Hosting Provider 5
E-hailing company 12
This isn’t very useful because the second field isn’t sorted. To sort:
mlr --icsv --opprint stats1 -a count -f "Payee" -g "Payee" then sort -nr "Payee_count" data.csv
Output:
Payee Payee_count
Supermarket A 127
Supermarket B 82
e-Hailing 76
Food Court @ Apartment 31
Phamarcy B 18
Mobile Phone Plan 14
Electricity Company 14
Malaysian Restaurant H 14
Pharmacy A 12
E-hailing company 12
Water Filter Rental 12
Toll Prepaid Card 12
Cafe G 10
Fuel Company A 9
If we are only interested in the top 5 we can chain the head
verb:
mlr --icsv --opprint stats1 -a count -f "Payee" -g "Payee" then sort -nr "Payee_count" then head -n 5 data.csv
Output:
Payee Payee_count
Supermarket A 127
Supermarket B 82
e-Hailing 76
Food Court @ Apartment 31
Phamarcy B 18
Statistics: Which grocery stores did I spend the most money at?
mlr --icsv --opprint filter '$Category == "Groceries"' then stats1 -a sum -f Outflow -g Payee data.csv
Output:
Payee Outflow_sum
Supermarket A 4998.700000
Supermarket B 3969.350000
Phamarcy B 246.350000
Water Filter Rental 916.890000
Pharmacy A 76.850000
Pharmacy E 23.300000
Pharmacy D 17.200000
Online Marketplace A 20.050000
We can also group by Month and Payee:
mlr --icsv --opprint filter '$Category == "Groceries"' then stats1 -a sum -f Outflow -g Month,Payee data.csv
Output:
Month Payee Outflow_sum
11/2020 Supermarket A 251.200000
11/2020 Supermarket B 281.400000
11/2020 Phamarcy B 8.500000
11/2020 Water Filter Rental 76.220000
10/2020 Supermarket B 556.600000
10/2020 Supermarket A 227.900000
10/2020 Pharmacy A 12.900000
10/2020 Water Filter Rental 69.470000
9/2020 Supermarket A 231.600000
9/2020 Supermarket B 381.350000
9/2020 Pharmacy E 8.500000
9/2020 Phamarcy B 23.500000
9/2020 Pharmacy A 22.450000
9/2020 Water Filter Rental 81.470000
8/2020 Supermarket A 327.000000
8/2020 Supermarket B 293.800000
8/2020 Water Filter Rental 73.470000