UNIX tip of the day —
duplicate and replace lines with awk

Today I got some data I wanted to add to my machine learning training datasets for named entity recognition. My system is designed to be used with output from automatic speech recognition (ASR). It is frequently difficult to be certain whether ASR output will contain hyphens or not, e.g. (email, vs e-mail) so frequently I include both variants to be robust. I was able to automatically add these variants with a quick awk oneliner

awk ‘/-/ {print; gsub(“-“, ” “)} {print}’‍

Recall that awk operates with pattern action blocks. Here I have a pattern of “-“, which will match any line containing a hyphen. First I will print these matching lines. Then I use gsub to substitute hyphens with spaces. Then I have a second action block without a pattern, which simply prints every line. At this point, any line with a hyphen has been modified, so the second occurrence will be without hyphens. Oh, how I love to be lazy!

Posted in linguistics, UNIX | Comments Off on UNIX tip of the day —
duplicate and replace lines with awk

Git tip – restoring “lost” commits

I ran into a git issue today where I thought I was ready to push a recent commit, and the push failed, saying that I was in the middle of a rebase. I don’t remember starting a rebase, but maybe I did. I tried git rebase –continue, but that didn’t work, so then I tried git rebase –abort. That fixed the issue about being in the middle of a rebase, but it also threw out my commit. It was a pretty big commit, and I thought it might just be lost, but it turns out it wasn’t! Git reflog to the rescue. I found some handy instructions here: git ready » restoring lost commits

That almost worked. I got my lost commit back, but when I tried to push, I still got an error. So finally I took the old-fashioned approach. I backed up my directory with the commit I wanted, cloned the repository from scratch, manually copied my changed files, and then committed and pushed. Ah git . Apparently I am not the only one who does this; see xkcd: Git

Posted in wordpress | Comments Off on Git tip – restoring “lost” commits

UNIX tip of the day: two file processing with AWK

I recently came across some AWK code from a work colleague that I did not understand at all

awk -F'\t' -v OFS='\t' 'FNR==NR{a[$1]=$1;next};$1 in a{print $1,$2,$3}' file1 file2

I usually like to understand code instead of blindly copying and pasting, so I did a little research into what this was doing. Searching for “awk FNR NR” got me to this stackoverflow page: linux – What is “NR==FNR” in awk?

And that led me in turn to this excellent article about Idiomatic awk . I’ll summarize some of the points from there

NR = record number, starting with 1. By default the record separator (RS) is a newline, so this amounts to a line number. When processing 2 files, AWK first processes the first file one record at a time, and then the second file. The NR continues to increment for both files.

FNR = file record number. This counter starts back at 1 for each file.

To explain further, I am going to use 2 example files – foo.txt and bar.txt, with contents like so

$ tail foo.txt bar.txt
== foo.txt ==
== bar.txt ==
a       12      blah
c       42      yada

One thing that I learned from the idiomatic AWK is that I have not been writing idiomatic AWK. I have been writing awk scripts like

awk '{ if ($1=="a") { print }}' foo.txt

This works, but is not idiomatic. AWK scripts follow the general pattern of CONDITION { ACTIONS }. Thus I don’t need an if statement to create a conditional. Using idiomatic AWK, this is simply:

awk '$1=="a" { print }' foo.txt

Now I can start to understand the first script better.

FNR==NR is a condition

a[$1]=$1;next is the action

Now what is that doing?

a is a variable – in this case an associative array (hash, dict, i.e. key-value store). AWK doesn’t require you to initialize variables at all. So this is building up an array, with they key and value both set to the value of the first column. Since we have the condition FNR==NR this block will only execute when reading the first file. Finally, we have the next, which says to skip the rest of the condition-action pairs.

Now on to this part:

$1 in a{print $1,$2,$3}

This is another condition-action pair. Note that because of the first condition-action of FNR==NR {next}, this second condition-action pair will only be applied when processing the second file. $1 in a is the condition. It is saying “if the first column of this record in file2 matches a key in the array a, then print the first, second, and third column of file2

Okay – only a few more things left to explain – let’s look at the options given to awk

-F ‘\t’ this defines the input field separator to be a tab. (default is space)

-v OFS=’\t’ The -v option lets you set a variable in the script. In this case, we are setting the special variable OFS (output field separator) to tab (default is space)

Note that instead of using -F and -v, you could also specify these values in a BEGIN block, which is executed before any data is processed, like so

$ awk  'BEGIN {IFS="\t"; OFS="\t"} FNR==NR {a[$1]=$1;next}; $1 in a  { print $1,$2,$3}' foo.txt bar.txt
a       12      blah
c       42      yada‍‍‍‍

But that requires a little more typing

One other thing to keep in mind with AWK – whitespace usually doesn’t count much for anything. When I first saw

FNR==NR{a[$1]=$1;next} I thought that the curly brace after NR was specifying an array index, like you might do in Perl. Nope – that curly brace is just specifying the action block, and you don’t need any whitespace between the condition and the action

Let’s go even one step further. What if we have more than 2 files? Let’s say we have 3 salespeople x, y, and z selling products a, b, and c. Each salesperson gives you a report of their sales. We want to find the total sales for each product. Here are their files:

$ tail [x-z].txt
== x.txt ==
a       10
c       22
== y.txt ==
b       12
c       42
== z.txt ==
a       16
b       32

In SQL, we would use a GROUP BY with a SUM. In AWK, we will again just use an associative array, and add the values to it.

$ awk -F '\t' -v OFS='\t' '{a[$1] += $2} END { for (k in a) { print k,a[k] } }' [x-z].txt
a       26
b       44
c       64

We use a similar approach here, except that we don’t need to do any special processing for the first versus subsequent files. We simply build up the array a, and after we have processed all the lines, we print out the totals by looping over a in an END block. Note that in this case, it doesn’t matter if the data is in separate files or in one file. We could have also just concatenated all the data together first, and then piped to AWK like so:

$ cat [x-z].txt | awk -F '\t' -v OFS='\t' '{a[$1] += $2} END { for (k in a) { print k,a[k] } }'
a       26
b       44
c       64
Posted in wordpress | Comments Off on UNIX tip of the day: two file processing with AWK

UNIX tip of the day – trap EXIT

I was reading a shell script today and came across the trap command, which I was not aware of. Some googling led me to this article: How “Exit Traps” Can Make Your Bash Scripts Way MoreRobust And Reliable , which has a really nice explanation. Basically, trap acts sort of like a finally block in a try/catch pattern. Very useful for shutting down services, cleaning up temp files and such. I think that trap is specific to BASH, so you can’t use it with plain old Bourne shell, but just about every UNIX has BASH nowadays, so I don’t think it is a problem.

Posted in wordpress | Comments Off on UNIX tip of the day – trap EXIT

UNIX tip of the day – grep -P is slow

Unless you really need some advanced regular expressions only supported by PCRE, using POSIX regular expressions with grep is usually an order of magnitude faster – that’s because the default engine with grep uses finite automata, as opposed to a backtracking algorithm which PCRE uses ( the main featuress you gain from the backtracking algorithm are lookahead/lookbehind and backreferences)

Here’s a small example

$ time  grep -E 'post:content.*facebook' a_bunch_of_files* | wc -l
real    0m2.643s
user    0m1.304s
sys     0m1.306s
$ time  grep -P 'post:content.*facebook' a_bunch_of_files*  | wc -l
real    0m29.542s
user    0m28.365s
sys     0m1.04s

Note that the -E flag uses “extended” regular expressions. All this does is change the default meaning of special characters. With the -E flag a pipe “|” means OR. Without the -E flag, a pipe “|” just represents the normal character, and to get the “special” meaning of OR, you have to escape it.

Posted in perl, regex, UNIX | Comments Off on UNIX tip of the day – grep -P is slow