Monday, June 14, 2010

Using perl and awk for data extraction - Part 1

It is quite often that you come across a delimited text file containing important data. Often asked is how to manipulate and extract important information from such data. There are commonly three possible ways to manipulate such a file:

1. Put into a spreadsheet program and use formulas or macros to extract data.
2. Import the data to a database and use query language to extract the data.
3. Use unix/linux tools such as awk and perl to get the data.

Method 1 is not very flexible as you are restricted to simple formulas unless you get down to programming with macros.

Method 2 is a bit too combersome for simple tasks.

Linux/unix offers very powerful tools for data manipulation. Here we will use awk as the primary data extraction program, with a small perl program as a front-end to enable the use of column names.

Introduction to awk
Awk is a powerful pattern matching program that reads delimited text files and outputs lines matching a given search pattern. Originally written for unix in the '70s and named after the initials of its authors, awk has various incarnations. Any modern awk will suffice for the task at hand. I am using GNU awk (or gawk) version 3.1.6

Listing 1.1 gives shows the file data.csv. It is a comma seperated text file with the first row as the names of columns. The data is actually a listing of elements in a building such as foundation footings, columns, beams, walls etc. Not all the elements have values for all columns.

Listing 1.1 - data.csv
 element, tag, count,  type,  len,  width,  height,  area,
 footing,  f1,     2,  rect,  1.1,    0.9,     0.3,      
 footing,  f2,     4,  rect,  1.0,    1.0,     0.3,      
 footing,  f3,     1,  quad,     ,       ,     0.3,  1.85
 column,   c1,     4,  rect,  0.2,    0.2,     3.3,      
 column,   c2,     2,  rect,  0.2,    0.3,     3.3,      
 column,   c3,     1,  circ,     ,   0.35,     3.3,      
 beam,     b1,     4,  rect,  4.1,    0.2,     0.4,      
 beam,     b2,     3,  rect,  3.0,    0.2,     0.4,      
 wall,       ,     4,   ext,   16,   0.15,     3.3,      
 wall,       ,     2,   int,   24,   0.15,     3.3,      
 wall,       ,     2,   int,   24,   0.15,     3.3,      
 door,     d1,     1,      ,     ,    0.9,     2.6,      
 door,     d2,     4,      ,     ,    0.8,     2.6,      
 rfmt,     f1,     5,   t10,  1.4,       ,        ,      
 rfmt,     f1,     6,   t10,  0.9,       ,        ,      
 rfmt,     f2,     7,   t12,  0.8,       ,        ,      
 rfmt,     f2,     4,   t10,  0.9,       ,        ,      


Now lets try some awk commands. Commands to be entered in shell start with a $ (indicating prompt). They are split to two lines with a \ to fit to screen. You can use with \ or without slash in one continuous line. First cat the file and pipe to awk:

Display the first column. $1 refers to first column, $2 to second and so on. The -F, specifies that comma is to be used as field separator in the input file. The commands to execute as given inside {} enclosed in '':
$ cat data.csv |awk -F, '{print $1}'

Display columns 1 and 2 seperated by comma:
$ cat data.csv |awk -F, -v OFS=',' \
'{print $1, $2}'

Display the sum of columns 4 and 6:
$ cat data.csv |awk -F, -v OFS=',' \
'{print $4 + $6}'

Display the column 2 and product of column 4 and 6. Awk's internal variable OFS specifies the delimiter to be used in the output:
$ cat data.csv |awk -F, -v OFS=',' \
'{print $2, $4 * $6}'

You can specifiy the condition to be met before the curly braces: Display column 1, column 2, and sum of column 4 and 6, where column 1 is 'footing':
$ cat data.csv |awk -F, -v OFS=',' \
'$1 == "footing" {print $1, $2, $4 + $6}'

Or you could use more sophisticated regex matching as well: Display column 1, column 2, and sum of column 4 and 6, where column 2 is 'f'-something:
$ cat data.csv |awk -F, -v OFS=',' \
'$1 ~ /f./ {print $1, $2, $4 + $6}'

Instead of using $1, $2, we could use column names derived from the first row. Will look into this in the next part.