UNIX Tip of the day: Strange behavior editing PHP files with vim

I have been struggling with this weird behavior in vim for the last month or so. I just re-joined Automattic in September. I have been using vim on many different computers and environments (linux, max, cygwin) for over 15 years, and I have never experienced this before. I only seem to have this issue when using vim on my development server. I tried searching the interwebs several times over the last month without avail. Today I finally decided to ask some colleagues if any of them had experienced the issue. As I was writing up the issue, I discovered the root cause, and a solution!

It is difficult to describe. Basically, when editing php files, if I try to type a method call of an object, the formatting gets messed up. For example, if I try to type $this->foo, it ends up displaying on the screen as $thi->foo – after typing the > character, the s disappears. However, if I write the buffer and reopen the file, I can see that it is actually there. As you can imagine, this is very annoying.

In order to fully document the issue, I wanted to also share my .vimrc file to help others debug. It also occurred to me that the issue could be due to GNU screen. I have experience other issues like that in the past. So I decided to see if I could replicate the behavior running outside of screen. It turns out that the behavior was also broken, but in a slightly different way. Instead of deleting the s character as above, a visual bell was triggered! I tried searching the interwebs again about this weird visual bell behavior, and ran across a Google groups posting with the answer. The issue is that the > character was trying to match to an opening < character, and probably not finding one, since I was deep inside a <?php block. This is controlled by the showmatch feature in vim. I was able to exclude matching of angle brackets <> by adding the following in my .vimrc file

" Disable matching of <> in PHP files because it causes strange behavior
" when trying to type method names of objects
autocmd BufRead *.php set  mps-=<:>

I hope that this can help others who may have had the same issue.

Posted in wordpress | Comments Off on UNIX Tip of the day: Strange behavior editing PHP files with vim

UNIX tip of the day —
duplicate and replace lines with awk

Today I got some data I wanted to add to my machine learning training datasets for named entity recognition. My system is designed to be used with output from automatic speech recognition (ASR). It is frequently difficult to be certain whether ASR output will contain hyphens or not, e.g. (email, vs e-mail) so frequently I include both variants to be robust. I was able to automatically add these variants with a quick awk oneliner

awk ‘/-/ {print; gsub(“-“, ” “)} {print}’‍

Recall that awk operates with pattern action blocks. Here I have a pattern of “-“, which will match any line containing a hyphen. First I will print these matching lines. Then I use gsub to substitute hyphens with spaces. Then I have a second action block without a pattern, which simply prints every line. At this point, any line with a hyphen has been modified, so the second occurrence will be without hyphens. Oh, how I love to be lazy!

Posted in linguistics, UNIX | Comments Off on UNIX tip of the day —
duplicate and replace lines with awk

Git tip – restoring “lost” commits

I ran into a git issue today where I thought I was ready to push a recent commit, and the push failed, saying that I was in the middle of a rebase. I don’t remember starting a rebase, but maybe I did. I tried git rebase –continue, but that didn’t work, so then I tried git rebase –abort. That fixed the issue about being in the middle of a rebase, but it also threw out my commit. It was a pretty big commit, and I thought it might just be lost, but it turns out it wasn’t! Git reflog to the rescue. I found some handy instructions here: git ready » restoring lost commits

That almost worked. I got my lost commit back, but when I tried to push, I still got an error. So finally I took the old-fashioned approach. I backed up my directory with the commit I wanted, cloned the repository from scratch, manually copied my changed files, and then committed and pushed. Ah git . Apparently I am not the only one who does this; see xkcd: Git

Posted in wordpress | Comments Off on Git tip – restoring “lost” commits

UNIX tip of the day: two file processing with AWK

I recently came across some AWK code from a work colleague that I did not understand at all

awk -F'\t' -v OFS='\t' 'FNR==NR{a[$1]=$1;next};$1 in a{print $1,$2,$3}' file1 file2

I usually like to understand code instead of blindly copying and pasting, so I did a little research into what this was doing. Searching for “awk FNR NR” got me to this stackoverflow page: linux – What is “NR==FNR” in awk?

And that led me in turn to this excellent article about Idiomatic awk . I’ll summarize some of the points from there

NR = record number, starting with 1. By default the record separator (RS) is a newline, so this amounts to a line number. When processing 2 files, AWK first processes the first file one record at a time, and then the second file. The NR continues to increment for both files.

FNR = file record number. This counter starts back at 1 for each file.

To explain further, I am going to use 2 example files – foo.txt and bar.txt, with contents like so

$ tail foo.txt bar.txt
== foo.txt ==
a
b
c
== bar.txt ==
a       12      blah
c       42      yada

One thing that I learned from the idiomatic AWK is that I have not been writing idiomatic AWK. I have been writing awk scripts like

awk '{ if ($1=="a") { print }}' foo.txt

This works, but is not idiomatic. AWK scripts follow the general pattern of CONDITION { ACTIONS }. Thus I don’t need an if statement to create a conditional. Using idiomatic AWK, this is simply:

awk '$1=="a" { print }' foo.txt

Now I can start to understand the first script better.

FNR==NR is a condition

a[$1]=$1;next is the action

Now what is that doing?

a is a variable – in this case an associative array (hash, dict, i.e. key-value store). AWK doesn’t require you to initialize variables at all. So this is building up an array, with they key and value both set to the value of the first column. Since we have the condition FNR==NR this block will only execute when reading the first file. Finally, we have the next, which says to skip the rest of the condition-action pairs.

Now on to this part:

$1 in a{print $1,$2,$3}

This is another condition-action pair. Note that because of the first condition-action of FNR==NR {next}, this second condition-action pair will only be applied when processing the second file. $1 in a is the condition. It is saying “if the first column of this record in file2 matches a key in the array a, then print the first, second, and third column of file2

Okay – only a few more things left to explain – let’s look at the options given to awk

-F ‘\t’ this defines the input field separator to be a tab. (default is space)

-v OFS=’\t’ The -v option lets you set a variable in the script. In this case, we are setting the special variable OFS (output field separator) to tab (default is space)

Note that instead of using -F and -v, you could also specify these values in a BEGIN block, which is executed before any data is processed, like so

$ awk  'BEGIN {IFS="\t"; OFS="\t"} FNR==NR {a[$1]=$1;next}; $1 in a  { print $1,$2,$3}' foo.txt bar.txt
a       12      blah
c       42      yada&#x200d;&#x200d;&#x200d;&#x200d;

But that requires a little more typing

One other thing to keep in mind with AWK – whitespace usually doesn’t count much for anything. When I first saw

FNR==NR{a[$1]=$1;next} I thought that the curly brace after NR was specifying an array index, like you might do in Perl. Nope – that curly brace is just specifying the action block, and you don’t need any whitespace between the condition and the action

Let’s go even one step further. What if we have more than 2 files? Let’s say we have 3 salespeople x, y, and z selling products a, b, and c. Each salesperson gives you a report of their sales. We want to find the total sales for each product. Here are their files:

$ tail [x-z].txt
== x.txt ==
a       10
c       22
== y.txt ==
b       12
c       42
== z.txt ==
a       16
b       32

In SQL, we would use a GROUP BY with a SUM. In AWK, we will again just use an associative array, and add the values to it.

$ awk -F '\t' -v OFS='\t' '{a[$1] += $2} END { for (k in a) { print k,a[k] } }' [x-z].txt
a       26
b       44
c       64
&#x200d;&#x200d;&#x200d;&#x200d;

We use a similar approach here, except that we don’t need to do any special processing for the first versus subsequent files. We simply build up the array a, and after we have processed all the lines, we print out the totals by looping over a in an END block. Note that in this case, it doesn’t matter if the data is in separate files or in one file. We could have also just concatenated all the data together first, and then piped to AWK like so:

$ cat [x-z].txt | awk -F '\t' -v OFS='\t' '{a[$1] += $2} END { for (k in a) { print k,a[k] } }'
a       26
b       44
c       64
Posted in wordpress | Comments Off on UNIX tip of the day: two file processing with AWK

UNIX tip of the day – trap EXIT

I was reading a shell script today and came across the trap command, which I was not aware of. Some googling led me to this article: How “Exit Traps” Can Make Your Bash Scripts Way MoreRobust And Reliable , which has a really nice explanation. Basically, trap acts sort of like a finally block in a try/catch pattern. Very useful for shutting down services, cleaning up temp files and such. I think that trap is specific to BASH, so you can’t use it with plain old Bourne shell, but just about every UNIX has BASH nowadays, so I don’t think it is a problem.

Posted in wordpress | Comments Off on UNIX tip of the day – trap EXIT