home | lab | find me | science | publications | software | toolbox | site map


Parsing CSV with python or AWK

A simple task: take a CSV file and compute the average of the values in the fourth column.

In python, a general purpose language, the code is reasonably straightforward. There are a couple of lines of noise (the two imports at the top). And to highlight two idiosyncrasies: (1) one has to know that "sys.argv" stands for the list of arguments passed to python, and (2) that list indexing starts at zero, so that the fourth column is number 3. There's also a bit of low-level work: opening the file, iterating its lines, splitting each line by commas and casting the read strings into numbers (a floating point number in this case).

# Run like: python parse.py data.csv
from __future__ import with_statement
import sys

with open(sys.argv[1]) as f:
  # Prepare variables
  total = 0
  n_lines = 0
	# Iterate over each line
  for line in f:
	  # Execute for each line
	  fields = line.split(',')
	  total += float(fields[3])
		n_lines += 1
  # Conclude
  print "Average:", total / n_lines

But there are languages far more suited for such a task, for example AWK, which has the built-in ability to process line-by-line the content of a file given as argument or from piped into it.

AWK has a number of built-in variables such as FS (the column separator character), and NR (the number of lines so far). The middle part--executed for every line--consists of one or more statements (in this case just one) that execute when matching a specified condition (in this case there's no condition and therefore all lines are considered). Implicitly casts strings to numbers.

# Run like: cat data.csv | awk -f parse.awk

# Prepare variables
BEGIN {
  FS = ",";
  total = 0;
}

# Execute for each line
{
  total += $4;
}

# Conclude
END {
  print "Average:", total / NR;
}


Last updated: 2015-01-17 21:43 New York time. Copyright Albert Cardona.