linguistics, UNIX

UNIX tip of the day —
duplicate and replace lines with awk

Today I got some data I wanted to add to my machine learning training datasets for named entity recognition. My system is designed to be used with output from automatic speech recognition (ASR). It is frequently difficult to be certain whether ASR output will contain hyphens or not, e.g. (email, vs e-mail) so frequently I include both variants to be robust. I was able to automatically add these variants with a quick awk oneliner

awk ‘/-/ {print; gsub(“-“, ” “)} {print}’‍

Recall that awk operates with pattern action blocks. Here I have a pattern of “-“, which will match any line containing a hyphen. First I will print these matching lines. Then I use gsub to substitute hyphens with spaces. Then I have a second action block without a pattern, which simply prints every line. At this point, any line with a hyphen has been modified, so the second occurrence will be without hyphens. Oh, how I love to be lazy!