Linguistics 5200 Fall 2009

Introduction to computational corpus linguistics

Homework 5 – Using the NLTK to investigate corpora and word frequency

September 25, 2009 · homework

In this homework you will practice loading and extracting information from various corpora, and calculating word frequency and conditional word frequency. There will be questions about conditionals, loops, and list expressions. Please put your answers in an executable python script named hmwk5.py, and commit it to the subversion repository.
It is due Oct. 2nd and covers material up to Sep. 24th.

Create a list called my_ints, with the following values: (10, 15, 24, 67, 1098, 500, 700) (2 points)
Print the maximum value in my_ints (3 points)
Use a for loop and a conditional to print out whether the value is an odd or even number. For example, for 10, your program should print “10 is even”. (5 points)
Now create a new list called new_ints and fill it with values from my_ints which are divisible by 3. In addition, double each of the new values. For example, the new list should contain 30 (15*2). Use a for loop and a conditional to accomplish this task. (5 points)
Now do the same thing as in the last question, but use a list expression to accomplish the task. (5 points)
Import the Reuters corpus from the NLTK. How many documents contain stories about coffee? (4 points)
Print the number of words in the Reuters corpus which belong to the barley category. (5 points)
Create a conditional frequency distribution of word lengths from the Reuters corpus for the categories barley, corn, and rye. (8 points)
Using the cfd you just created, print out a table which lists cumulative counts of word lengths (up to nine letters long) for each category. (5 points)
Load the devilsDictionary.txt file from the ling5200 svn repository in resources/texts into the NLTK as a plaintext corpus (3 points)
Store a list of all the words from the Devil’s Dictionary into a variable called devil_words (4 points)
Now create a list of words which does not include punctuation, and store it in devil_words_nopunc. Import the string module to get a handy list of punctuation marks, stored in string.punctuation. (5 points)
Create a frequency distribution for each of the two lists of words from the Devil’s dictionary, one which includes punctuation, and one which doesn’t. Find the most frequently occuring word in each list. (6 points)

Written by Robert Felty

1 Comment to “Homework 5 – Using the NLTK to investigate corpora and word frequency”

robfelty says:

October 1, 2009 at 12:09 pm

Sam asks:

I was wondering on Question 10, are we supposed to perform that operation in Python, and if so, how do we access svn with Python?

Yes. You should do it in python. However, you don’t have to interact with svn at all. All you have to do is update your working copy of the class repository. Then the devilsDictionary.txt file is simply a regular file on your computer.

Log in to Reply

Linguistics 5200 Fall 2009

Introduction to computational corpus linguistics

Homework 5 – Using the NLTK to investigate corpora and word frequency

1 Comment to “Homework 5 – Using the NLTK to investigate corpora and word frequency”

Leave a Reply

Archives

Categories