Friday, June 25, 2010

Using perl and awk for data extraction - Part 2

In Part 1, we have seen how to extract specific columns and their combinations using awk variables $1, $2 etc. In this part, you will see how to use the column names instead of $1, $2. This will make things very much clearer.

In awk, you can use the -v option to pass variables to awk. So for example:
cat data.csv |awk -F, -v OFS=',' -v tag=2 -v width=6 \
'$tag ~ /f1/ {print $tag, $width}'
Here, in addition to the built-in variable OFS, two new variables are passed with the -v option. They are 'tag' and 'width' with values 2 and 6 respectively. Hence within the {}, we can use $tag which gets evaluated to $2 (since tag=2) and $width evaluated to $6 (since width=6). The result is that the line is much more readable since we are using the column names.

When you have many columns, it is very difficult to pass column names to awk via the -v option manually. You will have a long list of '-v element=1 -v tag=2 -v len=3' etc. Instead, few lines of perl code can grab the column names and build actual awk command with -v options. Perl is well and good enough to actually do what awk itself does, but then would be re-writing awk program in perl.

Thus, the perl program will do just the following:
1. Accept two arguments, namely, the condition part ($tag ~ /f1/ part), and what to extract (the part within {}).
2. Read standard input.
3. Grab the column names from the first line from the input.
4. Build the -v list of column names.
5. Run awk with the conditions, what to extract and -v list of column names.

The complete perl program pfilter0 is shown below:
1 #!/usr/bin/perl
2 #USAGE: pfilter0 'cond' 'ext'
3 # where  cond is test condition to be applied by awk
4 #  ext is what to extract (the part within {}
5 
6 use strict;
7 use warnings;
8 
9 my @userinput;
10 my $cond = '';  #conditions or regex to match
11 my $ext = '';  #cols or expressions to extract
12 my $colist;   #column names as -v format for awk
13 my $colheader;  #first row containing the column names
14 
15 $cond   = defined ($ARGV[0]) ? $ARGV[0] : '';
16 $ext    = defined ($ARGV[1]) ? $ARGV[1] : '$0';
17 
18 @userinput = <>;
19 
20 #grab first line as containing column names
21 $colheader = shift(@userinput);
22 chomp($colheader);
23 my @cols = split(",", $colheader);
24 
25 my $i=0;
26 #build awk's -v format list of column names
27 $colist = join('', map {$i++ ; "-v $_=$i "} @cols);
28 
29 #build the awk command line pars with ',' as input \
    and output delim
30 my $cmd = "|awk -F, -v OFS=',' $colist '$cond 
                    { print $ext }'";
31 
32 #pass the @userinput to awk
33 open (AWK, $cmd);
34 map {print AWK} @userinput;
35 close AWK;

Now you can use the column names automatically:
-- Extract tag, len and width from all where tag matches f1
cat data.csv | ./pfilter0 '$tag == "f1"' '$tag, \ 
                                     $len, $width'

-- Extract len * width from all footing elements
cat data.csv | ./pfilter0 '$element == "footing"' \
                                     '$len * $width'

-- Using regular expressions. Extract everything where tag 
   is f1 or f2 only,
-- not other fx
cat data.csv |./pfilter0 '$tag ~ /f[1|2]/'

-- Extract tag from all columns of rect type
cat data.csv |./pfilter0 '$element == "column" && \ 
                              $type ~ /rect/' '$tag'

Part 3 will explain how the code works.

Monday, June 14, 2010

Using perl and awk for data extraction - Part 1

It is quite often that you come across a delimited text file containing important data. Often asked is how to manipulate and extract important information from such data. There are commonly three possible ways to manipulate such a file:

1. Put into a spreadsheet program and use formulas or macros to extract data.
2. Import the data to a database and use query language to extract the data.
3. Use unix/linux tools such as awk and perl to get the data.

Method 1 is not very flexible as you are restricted to simple formulas unless you get down to programming with macros.

Method 2 is a bit too combersome for simple tasks.

Linux/unix offers very powerful tools for data manipulation. Here we will use awk as the primary data extraction program, with a small perl program as a front-end to enable the use of column names.

Introduction to awk
Awk is a powerful pattern matching program that reads delimited text files and outputs lines matching a given search pattern. Originally written for unix in the '70s and named after the initials of its authors, awk has various incarnations. Any modern awk will suffice for the task at hand. I am using GNU awk (or gawk) version 3.1.6

Listing 1.1 gives shows the file data.csv. It is a comma seperated text file with the first row as the names of columns. The data is actually a listing of elements in a building such as foundation footings, columns, beams, walls etc. Not all the elements have values for all columns.

Listing 1.1 - data.csv
 element, tag, count,  type,  len,  width,  height,  area,
 footing,  f1,     2,  rect,  1.1,    0.9,     0.3,      
 footing,  f2,     4,  rect,  1.0,    1.0,     0.3,      
 footing,  f3,     1,  quad,     ,       ,     0.3,  1.85
 column,   c1,     4,  rect,  0.2,    0.2,     3.3,      
 column,   c2,     2,  rect,  0.2,    0.3,     3.3,      
 column,   c3,     1,  circ,     ,   0.35,     3.3,      
 beam,     b1,     4,  rect,  4.1,    0.2,     0.4,      
 beam,     b2,     3,  rect,  3.0,    0.2,     0.4,      
 wall,       ,     4,   ext,   16,   0.15,     3.3,      
 wall,       ,     2,   int,   24,   0.15,     3.3,      
 wall,       ,     2,   int,   24,   0.15,     3.3,      
 door,     d1,     1,      ,     ,    0.9,     2.6,      
 door,     d2,     4,      ,     ,    0.8,     2.6,      
 rfmt,     f1,     5,   t10,  1.4,       ,        ,      
 rfmt,     f1,     6,   t10,  0.9,       ,        ,      
 rfmt,     f2,     7,   t12,  0.8,       ,        ,      
 rfmt,     f2,     4,   t10,  0.9,       ,        ,      


Now lets try some awk commands. Commands to be entered in shell start with a $ (indicating prompt). They are split to two lines with a \ to fit to screen. You can use with \ or without slash in one continuous line. First cat the file and pipe to awk:

Display the first column. $1 refers to first column, $2 to second and so on. The -F, specifies that comma is to be used as field separator in the input file. The commands to execute as given inside {} enclosed in '':
$ cat data.csv |awk -F, '{print $1}'

Display columns 1 and 2 seperated by comma:
$ cat data.csv |awk -F, -v OFS=',' \
'{print $1, $2}'

Display the sum of columns 4 and 6:
$ cat data.csv |awk -F, -v OFS=',' \
'{print $4 + $6}'

Display the column 2 and product of column 4 and 6. Awk's internal variable OFS specifies the delimiter to be used in the output:
$ cat data.csv |awk -F, -v OFS=',' \
'{print $2, $4 * $6}'

You can specifiy the condition to be met before the curly braces: Display column 1, column 2, and sum of column 4 and 6, where column 1 is 'footing':
$ cat data.csv |awk -F, -v OFS=',' \
'$1 == "footing" {print $1, $2, $4 + $6}'

Or you could use more sophisticated regex matching as well: Display column 1, column 2, and sum of column 4 and 6, where column 2 is 'f'-something:
$ cat data.csv |awk -F, -v OFS=',' \
'$1 ~ /f./ {print $1, $2, $4 + $6}'

Instead of using $1, $2, we could use column names derived from the first row. Will look into this in the next part.