Data science at the command line

2014-12-25

The UNIX command line is a powerful tool for diving into large data files and piping specialized utilities to create summary statistics. This is a compilation of some useful commands.

Parsing CSV

Read the first line of a CSV file, which contains the column names, and list each column name in a new line after splitting by comma and stripping surrounding quotes. Notice awk can create an array and has operations like length that returns the number of elements in an array.

head -n 1 WebExtract.txt | awk '{ split($0, a, ","); max = length(a) } 
  END { for (x=1; x<=max;x++) {gsub(/"/, "", a[x]); print a[x]}  }'

Take the average of a column of integers in a CSV file after filtering for a category.

cat iris.csv | grep 'Iris-setosa' | awk -F "," '{ sum += $1; n += 1; } 
  END { printf "%0.5f\n", sum/n }'

To select a single line in a CSV file and output one column.

sed -n '2405p;2405q' WebExtract.txt | awk -F "\",\"" '{ print $3; }'

We can also select a subset of lines in the middle of the document and return a single column.

sed -n '1000,1010p' WebExtract.txt | awk -F "\",\"" { 'print $3; }'

Count the number of lines with wc.

cat trip_data_1.csv | wc -l

JSON files can be parsed using jq. The following command parses a 35MB file for the attribute city, sorts the cities, and returns the number of occurrences.

cat ./yelp_train_academic_dataset_business.json | jq '.city' | sort | uniq -c

Parsing JSON

To test a command on a large file we can select just the first few lines.

head -n 100 ./yelp_train_academic_dataset_business.json | jq '.city' | sort | uniq -c

To further filter the output for occurrences greater than 5 we can use awk.

head -n 100 ./yelp_train_academic_dataset_business.json | jq '.city' | 
  sort | uniq -c | awk '{if ($1 > 5) {print $0}}'

The above filter also works reasonably well on the entire dataset.

cat ./yelp_train_academic_dataset_business.json | jq '.city' | 
  sort | uniq -c | awk '{if ($1 > 100) {print $0}}'

To get the total number of entires with over 100 occurrences we can take the sum.

cat ./yelp_train_academic_dataset_business.json | jq '.city' | 
  sort | uniq -c | awk '{ if ($1 > 100) {sum += $1;} } END { print sum }'

To filter by a key within jq we can pipe commands and return only the city names.

head -n 3 ./yelp_train_academic_dataset_business.json | jq 'select(.city | 
  contains("De Forest")) | .city'