Linguistics 5200 Fall 2009

Introduction to computational corpus linguistics

Homework 2 – More UNIX basics and regular expressions

September 4, 2009 · homework

This assignment will cover some more of the UNIX basics we have covered (including material up to September 8th). Some of the questions will ask you to use some of the files from practiceFiles, specifically, celex.txt and devilsDictionary.txt.

Print out the entries (orthography only) from the celex.txt file which were taken from GOOGLE. Hint: You will need to use a pipe. (6 points)
Print out the 50 most frequent words from the celex.txt file which were taken from GOOGLE. Hint: You will need to combine the answers from the last 2 questions. (9 points)
Use unix commands to count the number of entries (not definitions) in the devil’s dictionary that begin with a vowel. Your output should be a single number. (7 points)
Use unix commands to calculate the average number of letters per word for each entry (not the definitions) in the Devil’s Dictionary. The output should simply be a number. HINT: You will need to use subshells, and bc (10 points)
Count the number of adjectives, nouns, and verbs in the devil’s dictionary. (10 points)
Print out all the entries (not the definitions), which are not adjectives, nouns, or verbs. HINT: use grep more than once. (10 points)
Write a unix pipeline which will print the number of words in the celex.txt file that contain a q not followed by a u (look only at the orthography of each entry). (8 points)
Extra credit

Write a unix pipeline which will print the total number of points in this assignment. Don’t include the points for the extra credit (3 extra points) (Hint: use dc)

Written by Robert Felty

6 Comments to “Homework 2 – More UNIX basics and regular expressions”

sikos says:

September 9, 2009 at 10:38 am

Hi Rob,
Not sure what you mean by “entries (not definitions”. Is there a readme that goes along with this file?
Les

Log in to Reply
- robfelty says:
  
  September 9, 2009 at 1:15 pm
  
  Les,
  
  Here is an example.
  cat n., a furry domesticated animal of the feline family
  
  “cat” is the entry. Everything after the comma is the definition
  
  Look at the last example in the GREP chapter of the notes for an example. I will also go over this in class tomorrow.
  
  Log in to Reply
keith.mertz says:

September 11, 2009 at 9:29 am

Does the backslash properly function as an escape character for the single quote mark, or must I (if even possible) use some other sorcery to grep for, say, a possessive? When I use a regular expression such as ‘^[A-Z\']*’ in the hopes of capturing something like “HUGH’S”, the terminal just stares back at me with eyes glazed and spittle dribbling down its chin.

I feel like this had to have been mentioned, so I apologize (if so) for asking again. Thanks in advance for enlightenment!

Log in to Reply
- robfelty says:
  
  September 11, 2009 at 10:07 am
  
  Backslash will not escape single or double quotes. You have two options,
  1. use double quotes, e.g.
  grep “don’t” file
  2. Don’t use any quotes, but then you have to escape every special character
  grep d\[ao\]n\’t celex.txt
  Then to actually match special characters, you have to double escape them. e.g.
  grep CVVCC\\]\\[CVVC celex.txt
  
  All that having been said, it might be easier to use a negative character class.
  
  Log in to Reply
kelleya says:

September 11, 2009 at 1:03 pm

Hi Rob,

For question 7 I’m able to limit my search to words that contain ‘q’ not followed by ‘u’ but when printing just the orthography I’m getting all words that have a string of q not followed by u somewhere in the definition or description.
i.e. “gendarme” is one of the words in the printed list, for which in its full entry is:
36631\gendarme\13\18700\6\’Zqn-d#m\[CVVC][CVVC]\[ZA~:n][dA:m]\Zandarm\3.6667.0000\HML\Z’an<d`arm

since "Zqn" fits the criteria of my search. Is this ok or do I need to get my pipeline to only print words from the entries?

Thanks,
Arrick

Log in to Reply
- robfelty says:
  
  September 11, 2009 at 1:47 pm
  
  You should limit your search to only the orthography of the words.
  
  Log in to Reply

Linguistics 5200 Fall 2009

Introduction to computational corpus linguistics

Homework 2 – More UNIX basics and regular expressions

6 Comments to “Homework 2 – More UNIX basics and regular expressions”

Leave a Reply

Archives

Categories