Linguistics 5200 Fall 2009

Introduction to computational corpus linguistics

Homework 4 solutions – More regular expressions, Python lists and word frequency

September 28, 2009 · homework

Overall, students did quite well on this assignment. Comments are in your files in the subversion repository. Run svn update from your personal directory to get the latest version. My python solution file is also in the repository under resources/hmwk

Class statistics for Homework 4
mean	54.3
standard deviation	4.8

Regular expressions

Use grep to print out all words from the celex.txt file which have a
frequency of 100 or more, and either start with a-l, or end with m-z. Note that
the orthography column comes right before the frequency column, so this is
possible to do using a single regular expression. You should use one grouping
and several character classes. Hint 1: use negative character classes to avoid
getting stuff that is in more than one column. Hint 2: Consider what is
different between the numbers 87 and 99, vs. the numbers 100 and 154.

Here are some example words which should be included
- at (starts with a-l)
- yellow (ends with m-z)
Here are some example words which should be excluded
- omega (does not start with a-l, does not end with m-z)
- abacus (starts with a-l, but has a frequency less than 100
(10 points)

grep -E '^[^\\]*\\([a-lA-L][^\\]*|[^\\]*[m-zM-Z])\\[0-9]{3,}' celex.txt

Python

Create a python script called hmwk4.py, and put all your answers in this file.
You can use comments in the file if you like. Before each answer, print out the
question number, and a brief description. For example:

print('#2 - first 10 words in the holy grail')

Print the first 10 words from monty python and the holy grail (text6). (3 points)
print(text6[:10])
Print the last 20 words from Moby Dick (text1). (4 points)
print(text1[-20:])
Create a frequency distribution of the holy grail. Store it in the variable
called moby_dist. (4 points)

moby_dist = FreqDist(text6)
Print the number of times the word “Grail” occurs in this text (4 points)
print(moby_dist['Grail'])
Print the most frequent word in the Holy Grail. (Hint: note that punctuation is counted as words by the NLTK. That is the answer might be a punctuation mark). (4 points)
print(moby_dist.max())
Create a list which contains the word lengths of each word in the Holy
Grail(text6). Store it in a variable called holy_lengths. Do the same for Moby
Dick (text1), and store it in a variable called moby_lengths. (6 points)

moby_lengths = [len(w) for w in text1]
holy_lengths = [len(w) for w in text6]
Create a frequency distribution of word lengths for Moby Dick and The Holy Grail. Store the distributions in variables called moby_len_dist and holy_len_dist respectively. (6 points)
moby_len_dist = FreqDist(moby_lengths)
holy_len_dist = FreqDist(holy_lengths)
Print the most commonly occuring word length for Moby Dick and for The Holy Grail. (Use one command for each) (5 points)
print(moby_len_dist.max())
print(holy_len_dist.max())
Calculate the mean word length for Moby Dick and The Holy Grail. You can use the sum() function to calculate the total number of characters in the text. For example, sum([22, 24, 3]) returns 49. Store the results in variables holy_mean_len and moby_mean_len respectively. (6 points)
holy_mean_len = sum(holy_lengths)/len(holy_lengths)
moby_mean_len = sum(moby_lengths)/len(moby_lengths)
Create a list of words from Moby Dick which have more than 3 letters, and
less than 7 letters. Store it in a variable called four_to_six. (8 points)

four_to_six = [w for w in text1 if len(w) > 3 and len(w) < 7]

Written by Robert Felty

Linguistics 5200 Fall 2009

Introduction to computational corpus linguistics

Homework 4 solutions – More regular expressions, Python lists and word frequency

Regular expressions

Python

Leave a Reply

Archives

Categories