Friday, June 25, 2010

Using perl and awk for data extraction - Part 2

In Part 1, we have seen how to extract specific columns and their combinations using awk variables $1, $2 etc. In this part, you will see how to use the column names instead of $1, $2. This will make things very much clearer.

In awk, you can use the -v option to pass variables to awk. So for example:
cat data.csv |awk -F, -v OFS=',' -v tag=2 -v width=6 \
'$tag ~ /f1/ {print $tag, $width}'
Here, in addition to the built-in variable OFS, two new variables are passed with the -v option. They are 'tag' and 'width' with values 2 and 6 respectively. Hence within the {}, we can use $tag which gets evaluated to $2 (since tag=2) and $width evaluated to $6 (since width=6). The result is that the line is much more readable since we are using the column names.

When you have many columns, it is very difficult to pass column names to awk via the -v option manually. You will have a long list of '-v element=1 -v tag=2 -v len=3' etc. Instead, few lines of perl code can grab the column names and build actual awk command with -v options. Perl is well and good enough to actually do what awk itself does, but then would be re-writing awk program in perl.

Thus, the perl program will do just the following:
1. Accept two arguments, namely, the condition part ($tag ~ /f1/ part), and what to extract (the part within {}).
2. Read standard input.
3. Grab the column names from the first line from the input.
4. Build the -v list of column names.
5. Run awk with the conditions, what to extract and -v list of column names.

The complete perl program pfilter0 is shown below:
1 #!/usr/bin/perl
2 #USAGE: pfilter0 'cond' 'ext'
3 # where  cond is test condition to be applied by awk
4 #  ext is what to extract (the part within {}
6 use strict;
7 use warnings;
9 my @userinput;
10 my $cond = '';  #conditions or regex to match
11 my $ext = '';  #cols or expressions to extract
12 my $colist;   #column names as -v format for awk
13 my $colheader;  #first row containing the column names
15 $cond   = defined ($ARGV[0]) ? $ARGV[0] : '';
16 $ext    = defined ($ARGV[1]) ? $ARGV[1] : '$0';
18 @userinput = <>;
20 #grab first line as containing column names
21 $colheader = shift(@userinput);
22 chomp($colheader);
23 my @cols = split(",", $colheader);
25 my $i=0;
26 #build awk's -v format list of column names
27 $colist = join('', map {$i++ ; "-v $_=$i "} @cols);
29 #build the awk command line pars with ',' as input \
    and output delim
30 my $cmd = "|awk -F, -v OFS=',' $colist '$cond 
                    { print $ext }'";
32 #pass the @userinput to awk
33 open (AWK, $cmd);
34 map {print AWK} @userinput;
35 close AWK;

Now you can use the column names automatically:
-- Extract tag, len and width from all where tag matches f1
cat data.csv | ./pfilter0 '$tag == "f1"' '$tag, \ 
                                     $len, $width'

-- Extract len * width from all footing elements
cat data.csv | ./pfilter0 '$element == "footing"' \
                                     '$len * $width'

-- Using regular expressions. Extract everything where tag 
   is f1 or f2 only,
-- not other fx
cat data.csv |./pfilter0 '$tag ~ /f[1|2]/'

-- Extract tag from all columns of rect type
cat data.csv |./pfilter0 '$element == "column" && \ 
                              $type ~ /rect/' '$tag'

Part 3 will explain how the code works.