Linguistics 5200 Fall 2009

Introduction to computational corpus linguistics

Homework 3 questions

September 11, 2009 · homework

This homework assignment continues to expand your UNIX skills, as well as starting to use python and the NLTK. It covers material up to Sep. 15th. It is due Sep. 18th by 5:00 p.m. You should submit the homework via svn.

UNIX

Using the celex.txt file, calculate the ratio of heterosyllabic vs. tautsyllabic st clusters. That is, how frequently do words contain an st cluster that is within a syllable, vs. how frequently they contain an st cluster that spans two syllables. Note that each word contains a syllabified transcription where syllables are surrounded by brackets []. For example, abacus has three syllables, [&][b@][k@s]. You should use grep and bc to calculate the ratio (also compare to the question from hmwk 2 to computer the average number of letters per word for each entry in the devils dictionary). 10 points
How many entries in the devils dictionary have more than 6 letters? Use grep to find out (5 points)

Subversion

For this homework, submit it via subversion by adding it into your own directory. Make at least 2 separate commits. When you are finished, make sure to say so in your log message.

Create a new file called hmwk3_<yourname>.txt, and add it to the svn repository. Show the commands you used (5 points)
Show the log of all changes you made to your homework 3 file. Show the commands you used (5 points)
Find all log messages pertaining to the slides which contain grep. You will need to use a pipe. Your command should print out not only the line which contains grep, but also the 2 preceding lines. Search the grep manual for “context” to find the appropriate option. (7 points)
Show the changes to your homework 3 file between the final version and the version before that. Show the commands you used (5 points)

Python

Calculate the percentage of indefinite articles in Moby Dick using the NLTK. You can use the percentage function defined in chapter 1.1 (8 points)
Using the dispersion_plot function in the nltk, find 1 word which has been used with increasing frequency in the inaugural address, and one which has been used with decreasing frequency. You can base your decision of increasing vs. decreasing simply by using visual inspection of the graphs. (5 points)
Use the random module to generate 2 random integers between 10 and 100, and then calculate the quotient of the first number divided by the second. Make sure to use normal division, not integer division. Look at the help for the random module to find the appropriate function (10 points)

Extra credit:

Use perl and regular expressions to strip the answers from the solutions to homework one. First, download the solution. You might also want to look at a blog entry on perl slurping for hints. (3 extra points)

Written by Robert Felty

4 Comments to “Homework 3 questions”

anwenq says:

September 16, 2009 at 11:23 am

Hi Rob,

For the first problem, do you want us to avoid counting instances such as “trusts to\[trVsts][tu:]” in the tautsyllabic category?

Log in to Reply
- robfelty says:
  
  September 17, 2009 at 3:34 pm
  
  Anwen,
  
  Good question. Don’t worry about avoiding compound words. In fact, it could be argued that they are one phonological word.
  
  Log in to Reply
keith.mertz says:

September 17, 2009 at 4:16 pm

Rob,

1. For #6, should we come up with some clever pipe combination to automatically pull version numbers from the log, or should we just note in the answer that “Here I would put ‘version’”?

2. Can we use regular expressions in the basic NLTK commands? Especially for commands like concordance and count and so on. Later in the book it talks about importing regular expressions, so I assume not.

Log in to Reply
- robfelty says:
  
  September 17, 2009 at 4:28 pm
  
  Keith,
  
  No. You do not need to use a clever pipe. You should just look at the log to find the revision number for your final version and the one before it, and then use those numbers when you show the difference between them.
  
  You can use regular expressions with some of them. One example is the findall method. It will be explicitly mentioned in the documentation though. If you have loaded the stuff from the book, you can type:
  
  help(text1.findall)
  
  to find out how to use findall
  
  Log in to Reply