Miller: Command Line CSV File Processing

Often times the lowest common denominator for integrating two systems is a CSV file. Usually I would resort to scripts for processing these CSV files. However, for certain simple operations scripting could be overkill. A battle tested command line would be much more appropriate than writing a possibly bug ridden one off script.

Miller is a command line tool for CSV file processing. It can replace Python scripts for common data transformation tasks such as:

Verbs

Miller’s functionality is exposed as a set of “verbs” e.g:

Each verb has its own options.

DSL

The put and filter verbs are unique in that they support a DSL that gives you the power of a scripting language to transform/filter your data.

Variables in the DSL

Chaining Verbs

Verbs can be chained together by placing the word then between the verbs

Examples

Miller is a feature rich tool and the documentation is a little daunting at first. The best way to learn it is through examples.

Sample CSV file

(Note: There might be some line wrapping issues on Firefox, still debugging the CSS).

We’ll be playing with a CSV file that looks like this throughout the examples (this is just a snippet of that file):

Account, Flag, Date, Payee, Category Group/Category,Category Group, Category, Memo, Outflow, Inflow
Cash,,11/26/20,Food Court @ Apartment,Everyday Expenses: Restaurants,Everyday Expenses,Restaurants,,$14.00,$0.00
Cash,,11/26/20,Food Court @ Apartment,Everyday Expenses: Restaurants,Everyday Expenses,Restaurants,,$14.15,$0.00
Cash,,11/24/20,Food Court @ Neighborhood,Everyday Expenses: Restaurants,Everyday Expenses,Restaurants,,$6.25,$0.00
e-Wallet B,,11/24/20,Mobile Phone Plan,Monthly Bills: Phone,Monthly Bills,Phone,,$118.00,$0.00
Debit Card A,,11/22/20,Supermarket A,Everyday Expenses: Groceries,Everyday Expenses,Groceries,,$9.80,$0.00
Cash,,11/21/20,Carpark,Everyday Expenses: Parking and Toll,Everyday Expenses,Parking and Toll,,$7.30,$0.00
Cash,,11/21/20,Supermarket A,Everyday Expenses: Groceries,Everyday Expenses,Groceries,,$37.10,$0.00
Debit Card A,,11/21/20,Supermarket B,Everyday Expenses: Groceries,Everyday Expenses,Groceries,,$110.05,$0.00
Cash,,11/20/20,Pharmacy A,Everyday Expenses: Family,Everyday Expenses,Family,,$14.80,$0.00

Input Output Formats: Pretty Printing

mlr --icsv --opprint cat data.csv

Output:

Input Output Formats: Converting CSV to JSON

 mlr --icsv --ojson --jlistwrap cat data.csv

Output:

[
{ "Account": "Cash", "Flag": "", "Date": "11/26/20", "Payee": "Food Court @ Apartment", "Category Group/Category": "Everyday Expenses: Restaurants", "Category Group": "Everyday Expenses", "Category": "Restaurants", "Memo": "", "Outflow": "$14.00", "Inflow": "$0.00" }
,{ "Account": "Cash", "Flag": "", "Date": "11/26/20", "Payee": "Food Court @ Apartment", "Category Group/Category": "Everyday Expenses: Restaurants", "Category Group": "Everyday Expenses", "Category": "Restaurants", "Memo": "", "Outflow": "$14.15", "Inflow": "$0.00" }
,{ "Account": "Cash", "Flag": "", "Date": "11/24/20", "Payee": "Food Court @ Neighborhood", "Category Group/Category": "Everyday Expenses: Restaurants", "Category Group": "Everyday Expenses", "Category": "Restaurants", "Memo": "", "Outflow": "$6.25", "Inflow": "$0.00" }
,{ "Account": "e-Wallet B", "Flag": "", "Date": "11/24/20", "Payee": "Mobile Phone Plan", "Category Group/Category": "Monthly Bills: Phone", "Category Group": "Monthly Bills", "Category": "Phone", "Memo": "", "Outflow": "$118.00", "Inflow": "$0.00" }
,{ "Account": "Debit Card A", "Flag": "", "Date": "11/22/20", "Payee": "Supermarket A", "Category Group/Category": "Everyday Expenses: Groceries", "Category Group": "Everyday Expenses", "Category": "Groceries", "Memo": "", "Outflow": "$9.80", "Inflow": "$0.00" }
,{ "Account": "Cash", "Flag": "", "Date": "11/21/20", "Payee": "Carpark", "Category Group/Category": "Everyday Expenses: Parking and Toll", "Category Group": "Everyday Expenses", "Category": "Parking and Toll", "Memo": "", "Outflow": "$7.30", "Inflow": "$0.00" }
,{ "Account": "Cash", "Flag": "", "Date": "11/21/20", "Payee": "Supermarket A", "Category Group/Category": "Everyday Expenses: Groceries", "Category Group": "Everyday Expenses", "Category": "Groceries", "Memo": "", "Outflow": "$37.10", "Inflow": "$0.00" }
,{ "Account": "Debit Card A", "Flag": "", "Date": "11/21/20", "Payee": "Supermarket B", "Category Group/Category": "Everyday Expenses: Groceries", "Category Group": "Everyday Expenses", "Category": "Groceries", "Memo": "", "Outflow": "$110.05", "Inflow": "$0.00" }
,{ "Account": "Cash", "Flag": "", "Date": "11/20/20", "Payee": "Pharmacy A", "Category Group/Category": "Everyday Expenses: Family", "Category Group": "Everyday Expenses", "Category": "Family", "Memo": "", "Outflow": "$14.80", "Inflow": "$0.00" }

Input Output Formats: Changing the Field Separator

mlr --ofs '|' --csv cat data.csv

Output:

Account|Flag|Date|Payee|Category Group/Category|Category Group|Category|Memo|Outflow|Inflow
Cash||11/26/20|Food Court @ Apartment|Everyday Expenses: Restaurants|Everyday Expenses|Restaurants||$14.00|$0.00
Cash||11/26/20|Food Court @ Apartment|Everyday Expenses: Restaurants|Everyday Expenses|Restaurants||$14.15|$0.00
Cash||11/24/20|Food Court @ Neighborhood|Everyday Expenses: Restaurants|Everyday Expenses|Restaurants||$6.25|$0.00
e-Wallet B||11/24/20|Mobile Phone Plan|Monthly Bills: Phone|Monthly Bills|Phone||$118.00|$0.00
Debit Card A||11/22/20|Supermarket A|Everyday Expenses: Groceries|Everyday Expenses|Groceries||$9.80|$0.00
Cash||11/21/20|Carpark|Everyday Expenses: Parking and Toll|Everyday Expenses|Parking and Toll||$7.30|$0.00
Cash||11/21/20|Supermarket A|Everyday Expenses: Groceries|Everyday Expenses|Groceries||$37.10|$0.00
Debit Card A||11/21/20|Supermarket B|Everyday Expenses: Groceries|Everyday Expenses|Groceries||$110.05|$0.00
Cash||11/20/20|Pharmacy A|Everyday Expenses: Family|Everyday Expenses|Family||$14.80|$0.00

Removing Columns

mlr --csv cut -x -f "Inflow,Flag,Memo,Category Group/Category,Category Group" data.csv

Output:

Account,Date,Payee,Category,Outflow
Cash,11/26/20,Food Court @ Apartment,Restaurants,$14.00
Cash,11/26/20,Food Court @ Apartment,Restaurants,$14.15
Cash,11/24/20,Food Court @ Neighborhood,Restaurants,$6.25
e-Wallet B,11/24/20,Mobile Phone Plan,Phone,$118.00
Debit Card A,11/22/20,Supermarket A,Groceries,$9.80
Cash,11/21/20,Carpark,Parking and Toll,$7.30
Cash,11/21/20,Supermarket A,Groceries,$37.10
Debit Card A,11/21/20,Supermarket B,Groceries,$110.05
Cash,11/20/20,Pharmacy A,Family,$14.80

Modifying Files in Place

By default, the verbs will send their output to standard output. We can edit the file in place using the -I option.

mlr -I --csv cut -x -f "Inflow,Flag,Memo,Category Group/Category,Category Group" data.csv

Removing ‘$’ from the Outflow columns

In order to run numerical operations on the data we need to remove the ‘$’ from the Outflow column. We can do this using the ssub function provided in the DSL. Lets do this in-place as well:

 mlr -I --csv put '$Outflow=ssub($Outflow, "$", "");' data.csv

Output:

Account,Date,Payee,Category,Outflow
Cash,11/26/20,Food Court @ Apartment,Restaurants,14.00
Cash,11/26/20,Food Court @ Apartment,Restaurants,14.15
Cash,11/24/20,Food Court @ Neighborhood,Restaurants,6.25
e-Wallet B,11/24/20,Mobile Phone Plan,Phone,118.00
Debit Card A,11/22/20,Supermarket A,Groceries,9.80
Cash,11/21/20,Carpark,Parking and Toll,7.30
Cash,11/21/20,Supermarket A,Groceries,37.10
Debit Card A,11/21/20,Supermarket B,Groceries,110.05
Cash,11/20/20,Pharmacy A,Family,14.80

Filtering Records

mlr --csv filter '$Category=="Electricity"' data.csv

Output:

Account,Date,Payee,Category,Outflow
Bank Account A,11/13/20,Electricity Company,Electricity,241.50
Bank Account A,10/7/20,Electricity Company,Electricity,406.05
Bank Account A,9/2/20,Electricity Company,Electricity,185.30
Bank Account A,8/1/20,Electricity Company,Electricity,234.70
Bank Account A,7/1/20,Electricity Company,Electricity,355.95
Bank Account A,6/3/20,Electricity Company,Electricity,215.30
Bank Account A,5/1/20,Electricity Company,Electricity,215.60
Bank Account A,4/1/20,Electricity Company,Electricity,206.30
Bank Account A,3/1/20,Electricity Company,Electricity,175.05

Chaining Verbs

The previous output was a bit noisy as it contains field that we’re not interested in, lets chain filter and cut verbs so that we only see the fields that we want to see.

mlr --csv filter '$Category=="Electricity"' then cut -f "Category,Outflow" data.csv

Output:

Category,Outflow
Electricity,241.50
Electricity,406.05
Electricity,185.30
Electricity,234.70
Electricity,355.95
Electricity,215.30
Electricity,215.60
Electricity,206.30
Electricity,175.05
Electricity,201.50
Electricity,224.00
Electricity,219.95
Electricity,262.15
Electricity,193.85

Adding a “Month” Field

What I would really like to do is analyze the data by month. To make this easier I’m going to introduce a “Month” field that is based on the “Date” field. The verb used for this is put.

mlr -I --csv put 't = splitnv($Date, "/"); $Month=t[1] . "/20" . t[3];'  data.csv

Output:

Account,Date,Payee,Category,Outflow,Month
Cash,11/26/20,Food Court @ Apartment,Restaurants,14.00,11/2020
Cash,11/26/20,Food Court @ Apartment,Restaurants,14.15,11/2020
Cash,11/24/20,Food Court @ Neighborhood,Restaurants,6.25,11/2020
e-Wallet B,11/24/20,Mobile Phone Plan,Phone,118.00,11/2020
Debit Card A,11/22/20,Supermarket A,Groceries,9.80,11/2020
Cash,11/21/20,Carpark,Parking and Toll,7.30,11/2020
Cash,11/21/20,Supermarket A,Groceries,37.10,11/2020
Debit Card A,11/21/20,Supermarket B,Groceries,110.05,11/2020
Cash,11/20/20,Pharmacy A,Family,14.80,11/2020

Statistics: Which payee did I transact with the most frequently?

mlr --icsv --opprint stats1 -a count -f "Payee" -g "Payee" data.csv

Output:

Payee                         Payee_count
Food Court @ Apartment        31
Food Court @ Neighborhood     1
Mobile Phone Plan             14
Supermarket A                 127
Carpark                       5
Supermarket B                 82
Pharmacy A                    12
Wordpress Hosting Provider    5
E-hailing company             12

This isn’t very useful because the second field isn’t sorted. To sort:

mlr --icsv --opprint  stats1 -a count -f "Payee" -g "Payee" then sort -nr "Payee_count"  data.csv

Output:

Payee                         Payee_count
Supermarket A                 127
Supermarket B                 82
e-Hailing                     76
Food Court @ Apartment        31
Phamarcy B                    18
Mobile Phone Plan             14
Electricity Company           14
Malaysian Restaurant H        14
Pharmacy A                    12
E-hailing company             12
Water Filter Rental           12
Toll Prepaid Card             12
Cafe G                        10
Fuel Company A                9

If we are only interested in the top 5 we can chain the head verb:

mlr --icsv --opprint  stats1 -a count -f "Payee" -g "Payee" then sort  -nr "Payee_count" then head -n 5 data.csv

Output:

Payee                  Payee_count
Supermarket A          127
Supermarket B          82
e-Hailing              76
Food Court @ Apartment 31
Phamarcy B             18

Statistics: Which grocery stores did I spend the most money at?

mlr --icsv --opprint filter '$Category == "Groceries"' then stats1 -a sum -f Outflow -g Payee data.csv

Output:

Payee                Outflow_sum
Supermarket A        4998.700000
Supermarket B        3969.350000
Phamarcy B           246.350000
Water Filter Rental  916.890000
Pharmacy A           76.850000
Pharmacy E           23.300000
Pharmacy D           17.200000
Online Marketplace A 20.050000

We can also group by Month and Payee:

mlr --icsv --opprint filter '$Category == "Groceries"' then stats1 -a sum -f Outflow -g Month,Payee data.csv

Output:

Month   Payee                Outflow_sum
11/2020 Supermarket A        251.200000
11/2020 Supermarket B        281.400000
11/2020 Phamarcy B           8.500000
11/2020 Water Filter Rental  76.220000
10/2020 Supermarket B        556.600000
10/2020 Supermarket A        227.900000
10/2020 Pharmacy A           12.900000
10/2020 Water Filter Rental  69.470000
9/2020  Supermarket A        231.600000
9/2020  Supermarket B        381.350000
9/2020  Pharmacy E           8.500000
9/2020  Phamarcy B           23.500000
9/2020  Pharmacy A           22.450000
9/2020  Water Filter Rental  81.470000
8/2020  Supermarket A        327.000000
8/2020  Supermarket B        293.800000
8/2020  Water Filter Rental  73.470000

Comments

comments powered by Disqus