Linguistics 5200 Fall 2009

Introduction to computational corpus linguistics

Homework 10 Solution

November 16, 2009 · homework

Most students did well on this assignment. Please take a detailed look at my solution in resources/hmwk/hmwk10.py

Class statistics for Homework 10
mean	51.67
standard deviation	7.28

Use svn to copy my solution to homework 8 from resources/py into your personal directory as hmwk10.py (5 points)

svn cp resources/py/hmwk8.py students/robfelty/hmwk10.py
Modify the mean_word_len and mean_sent_len functions to accept two optional
arguments, ignore_stop and use_set. The default for each of
these should be True. If use_set is True, you should convert the
stopword corpus to a set. If ignore_stop is True, you should ignore stopwords from the calculation (which is what the code in hmwk8.py does). (15 points)

def mean_sent_len(sents, ignore_stop=True, use_set=True):
''' returns the average number of words per sentence

Input should be a list of lists, with each item in the list being a
sentence, composed of a list of words. We ignore any punctuation and
stopwords
'''
if use_set:
eng_stopwords = set(stopwords.words('english'))
else:
eng_stopwords = stopwords.words('english')
if ignore_stop:
words_no_punc = [w for s in sents for w in s
if w not in string.punctuation
and w.lower() not in eng_stopwords]
else:
words_no_punc = [w for s in sents for w in s
if w not in string.punctuation ]
num_words = len(words_no_punc)
num_sents = len(sents)
return (num_words / num_sents)

def mean_word_len(words, ignore_stop=True, use_set=True):
''' returns the average number of letters per words

Input should be a list of words.
We ignore any punctuation and stopwords
'''
if use_set:
eng_stopwords = set(stopwords.words('english'))
else:
eng_stopwords = stopwords.words('english')
if ignore_stop:
words_no_punc = [w for w in words
if w not in string.punctuation and w.lower() not in eng_stopwords]
else:
words_no_punc = [w for w in words
if w not in string.punctuation]
num_words = len(words_no_punc)
num_chars = sum([len(w) for w in words_no_punc])
return (num_chars / num_words)
Now create a new file called means_timing.py. In this file, import your hmwk10.py module, and use the timeit module to test how long it takes to calculate the mean sentence length 100 times, trying all 4 combinations of the parameters of use_set and ignore_stop. (10 points)

import nltk
import hmwk10
setup = '''import nltk
import text_means
f = open('../texts/Candide.txt')
raw = f.read()
sents = text_means.sent_tokenize(raw)
words = nltk.word_tokenize(raw)
'''
test1 = 'text_means.mean_word_len(words)'
print Timer(test1, setup).timeit(100)

test2 = 'text_means.mean_word_len(words, use_set=False)'
print Timer(test2, setup).timeit(100)

test3 = 'text_means.mean_word_len(words, use_set=False, check_stop=False)'
print Timer(test3, setup).timeit(100)

test4 = 'text_means.mean_word_len(words, use_set=True, check_stop=False)'
print Timer(test4, setup).timeit(100)
Now add another global option called include-stop (i for short) to hmwk10.py specifying whether or not to ignore stopwords when calculating mean word length and sentence length. The default should be False. (10 points)
opts, args = getopt.gnu_getopt(sys.argv[1:], "hwsajni",
["help", "word", "sent", 'ari', 'adj', 'noheader', 'include-stop'])
include_stop = False
for o, a in opts:
if o in ("-i", "--include-stop"):
include_stop = True

# in calc_text_stats
mean_sent_length = mean_sent_len(sents,include_stop=include_stop)
Modify the calc_text_stats function so that it also computes the percentage of words that are stop words. 10 points
Now create a bash script which prints out the mean word and sentence length for Huck Finn, Tom Sawyer, Candide, and the Devil’s dictionary. Pipe the output to sort to sort by mean sentence length. Try it both including and ignoring stop words. Your output (when ignoring stop words), should look like the that below.(10 points)
```
filename          mean_word_len mean_sent_len per_stop_words
tomSawyer                  5.51          7.46            42.2
Candide                    6.07          9.04            43.5
huckFinn                   4.93          9.32            45.0
devilsDictionary           6.30         10.08            40.2
```
./text_means.py -wsi ../texts/{tomSawyer,huckFinn,Candide,devilsDictionary}.txt |sort -nk 3
./text_means.py -ws ../texts/{tomSawyer,huckFinn,Candide,devilsDictionary}.txt |sort -nk 3

Written by Robert Felty

Linguistics 5200 Fall 2009

Introduction to computational corpus linguistics

Homework 10 Solution

Leave a Reply

Archives

Categories