Friday, September 10, 2010

Using perl and awk for data extraction - Part 3

In part 2, you saw the small perl program written to extract data. The code uses fair amount of one-liners and perl syntactic sugar. In addition, it uses functional style of programming where appropriate.

Lines 1-13:
Just startup lines and variable declarations. You could use variables without declaring first.

Line 18:
@userinput = <> 

This grabs contents from standard input and store in array @userinput. The <> actually grabs the whole content to memory, be it file or from standard input. This is not advisable to be used in production code.

Line 21:
$colheader = shift(@userinput)

This grabs the first item from array @userinput using the shift function. It also reduces the number of elements in @userinput.

Line 22:
chomp($colheader)

The first line stored in $colheader contains the newline character as well. This newline character is removed with the chomp function. Strictly speaking, it removes the input record separator as specified in perl special vairable $/, which by default is newline character.

Line 23:
my @cols = split(",", $colheader)

Splits the contents of $colheader based on delimeter ,(comma). The split function returns an array of strings. This gets stored in @cols, resulting in @cols containing the column names.

Line 27:
$colist = join('', map {$i++ ; "-v $_=$i "} @cols)

This line consists of two parts, firstly the use of map function, and secondly the use of join function.
The map function, "maps" or applies a given function to each member of a list. Suppose you have a line that says (where @numbers = (1, 2, 3, 4)):
@results = map(square, @numbers)

This will apply the function 'square' to each member of the array @numbers. The result will be an array of elements that contain (square(1), square(2), square(3), square(4)). If square is a function that returns the mathematical square of its arguments, then @results will contain (1, 4, 9, 16).
Similarly, instead of a function, you can give a code block within curly braces {}, to be applied to each member of the array given. In this case the code block is
{$i++; "-v $_=$i "}

Note the trailing space just after i. This means, for each element in @cols, run the piece of code in the curly braces. The code in curly braces does two things:
1. increment variable i
2. returns a string of the form "v = $_=$i "
In perl, $i occurring within double quoted text strings will be parsed and hence its value will be substituted. $_ is the perl special variable that contains the current argument, ie the element from @cols that is under consideration
Suppose @cols is (length, width, height), then the result of map function will be
("-v length=1", "-v width=2", "-v height=3")

This resulting array from the map function is directly used by the join function. The join functions just 'joins' the elements of a string array separated by a given string. Here we are using the null string '' as separator, ie just directly join. As a final result, $colist will contain something like
"-v length=1 -v width=2 -v height=3"


For those interested, perl's map function is very similar to lisp's mapcar or haskell's map function.

Line 30:
my $cmd = "|awk -F, -v OFS=',' $colist '$cond { print $ext }'" 

This builds a command line syntax for calling awk from perl. Note the use of unix pipe | at the beginning. This will cause whatever we send to be piped to awk program. -F, specifies that field separator is ,(comma) and -v OFS=',' specifies that output field separator to be used is also comma. Since this is a double quoted string, $colist is parsed and its value is substituted, rather than the literal $colist. Same for $cond and $ext.

Line 33:
open (AWK, $cmd)

This is perl standard way of calling shell commands. You could use ABC, BLAH, FOOBAR or anything instead of AWK.

Line 34:
map {print AWK} @userinput

Here again I use perl's ability to do functional programming. It runs the code {print AWK} to each element in the array @userinput, ie it sends each line in @userinput to the command shell as specified in line 30.

No comments: