Linguistics 5200 Fall 2009

Introduction to computational corpus linguistics

Homework 3 solution

September 21, 2009 · homework

Overall students did quite well with this assignment. I have made comments in the files you submitted, which have now been updated in the subversion repository. To see the changes, open your terminal, change the directory to <myling5200>/students/<yourname>, and type svn update. Replace <myling5200> and <yourname> with the path to your working copy and your username respectively.

Class statistics for Homework 3
mean	53.1
standard deviation	8.0

UNIX

Using the celex.txt file, calculate the ratio of heterosyllabic vs. tautosyllabic st clusters. That is, how frequently do words contain an st cluster that is within a syllable, vs. how frequently they contain an st cluster that spans two syllables. Note that each word contains a syllabified transcription where syllables are surrounded by brackets []. For example, abacus has three syllables, [&][b@][k@s]. You should use grep and bc to calculate the ratio (also compare to the question from hmwk 2 to computer the average number of letters per word for each entry in the devils dictionary). 10 points
echo "`grep -Ec 's\]\[t' celex.txt` / `grep -Ec '(\[st|st\])' celex.temp `" |bc -l
How many entries in the devils dictionary have more than 6 letters? Use grep to find out (5 points)
grep -Ec '^[A-Z-]{7,}.*, [a-z]{1,3}\.' devilsDictionary.txt

Subversion

For this homework, submit it via subversion by adding it into your own directory. Make at least 2 separate commits. When you are finished, make sure to say so in your log message.

Create a new file called hmwk3_<yourname>.txt, and add it to the svn repository. Show the commands you used (5 points)
pwd #myling5200/students/<myname>
touch hmwk3_<myname>.txt
svn add hmwk3_<myname>.txt
svn commit -m 'adding homework 3 file'
Show the log of all changes you made to your homework 3 file. Show the commands you used (5 points)
pwd #myling5200/students/<myname>
svn log hmwk3_<myname>.txt
Find all log messages pertaining to the slides which contain grep. You will need to use a pipe. Your command should print out not only the line which contains grep, but also the 2 preceding lines. Search the grep manual for “context” to find the appropriate option. (7 points)
pwd #myling5200
svn log slides | grep -EiB 'grep'
Show the changes to your homework 3 file between the final version and the version before that. Show the commands you used (5 points)
pwd #myling5200/students/<myname>
svn diff -r <number>:<number> hmwk3_<myname>.txt

Python

Calculate the percentage of indefinite articles in Moby Dick using the NLTK. You can use the percentage function defined in chapter 1.1 (8 points)

percentage((text1.count('a')) + text1.count('an'), len(text1))
Using the dispersion_plot function in the nltk, find 1 word which has been used with increasing frequency in the inaugural address, and one which has been used with decreasing frequency. You can base your decision of increasing vs. decreasing simply by using visual inspection of the graphs. (5 points)
Use the random module to generate 2 random integers between 10 and 100, and then calculate the quotient of the first number divided by the second. Make sure to use normal division, not integer division. Look at the help for the random module to find the appropriate function (10 points)

from __future__ import division
import random
print(random.randint(10,100) / random.randint(10,100))

Extra credit:

Use perl and regular expressions to strip the answers from the solutions to homework one. First, download the solution. You might also want to look at a blog entry on perl slurping for hints. (3 extra points)

perl =pe '$string = do { local ( $/ ); <>}; $string=~s/<(code|pre)>.*?<\/pre>//gs;' < hmwk2.solution > hmwk2.question

Written by Robert Felty