Linguistics 5200 Fall 2009

Introduction to computational corpus linguistics

Homework 8 solution

October 27, 2009 · homework

Most students did fairly well on this assignment. Please take a detailed look at my solution in resources/hmwk/hmwk8.py

Class statistics for Homework 8
mean	49.71
standard deviation	9.2

Readability measures are used to score the reading difficulty of a text, for the purposes of selecting texts of appropriate difficulty for language learners. Let us define μw to be the average number of letters per word, and μs to be the average number of words per sentence, in a given text. The Automated Readability Index (ARI) of the text is defined to be: 4.71 μw + 0.5 μs – 21.43. Define a function which computes the ARI score. It should accept two arguments – the mean word length, and the mean sentence length. (5 points)

def calc_ari(mean_sent, mean_word):
ari = 4.71 * mean_word + 0.5 * mean_sent - 21.43
return(ari)
One feature of English is that it is easy to turn verbs into nouns and adjectives, by using participles. For example, in the phrase the burning bush, the verb burn is used as an adjective, by using the present participle form. Create a function called verb_adjectives which uses the findall method from the NLTK to find present participles used as adjectives. For simplicity, find only adjectives that are preceded by an article (a, an, the). Make sure that they have a word following them (not punctuation). The function should accept a list of tokens, as returned by the words() function in the NLTK. Note that all present participles in English end in ing. Unfortunately, the nltk findall function which we used in class prints out the words, instead of just returning them. This means that we cannot use it in a function. (Go ahead and try to use it, and convince yourself why it is generally bad to print stuff from functions, instead of just returning results (unless the function’s only purpose is to print something out, e.g. pprint)). So, I will get you started on the functions you need to use:
```
regexp = r'<a><.*><man>'
moby = nltk.Text( gutenberg .words('melville - moby_dick .txt '))
bracketed = nltk.text.TokenSearcher(moby)
hits = bracketed.findall(regexp)
```
This returns a list of lists, where each list is composed of the 3 word phrase which matches. So your main task is to come up with the correct regular expression. (7 points)

def verb_adjectives(tokens):
'returns a list of 3-word phrases where a verb is used as an adjective'
regexp = r'<a|the><.*ing><\w+>'
bracketed = nltk.text.TokenSearcher(tokens)
hits = bracketed.findall(regexp)
return(hits)
As we have seen in class, most computational linguistics involves a combination of automation and hand-checking. Let’s refine our verb_adjectives function by ensuring that none of the words following the adjective are in the stopwords corpus. Without doing this, we get results like['an', 'understanding', 'of'], where understanding is being used as a noun, not an adjective. Use a list expression to remove all hits where the third word in the list is a stopword. (7 points)
def verb_adjectives(tokens):
'returns a list of 3-word phrases where a verb is used as an adjective'
regexp = r'<a|the><.*ing><\w+>'
bracketed = nltk.text.TokenSearcher(tokens)
hits = bracketed.findall(regexp)
eng_stop = nltk.corpus.stopwords.words('english')
hits = [h for h in hits if h[2].lower() not in eng_stop]
return(hits)

Add three more options to your script, -j (–adj), -a (–ari), and -n (–noheader). Note that if the –ari option is specified, then you should also print out the mean word length and mean sentence length. Your options should now look like:

-w --word print only mean word length
-s --sent print only mean sentence length
-h --help print this help information and exit
-a --ari  print ari statistics
-j --adj  print mean number of adjectival verbs per sentence
-n --noheader do not print a header line

(10 points)

Now modify your script so that it can accept either stdin or one or more files as input. Use the stdin_or_file() function in args.py as an example. Your script will no longer print out usage information when no arguments are given, as was the case for homework 7. Create a function called calc_text_stats to handle all the calculations. That way you can call this function either multiple times (once per file, if files are specified), or just once, if reading from stdin. This will make your code more readable. You should also make sure to handle the two new options, for ari and adj. (20 points)
def calc_text_stats(text, showsent, showword, showari, showadj):
'print out statistics for a raw text'
words = nltk.word_tokenize(text)
sents = sent_tokenize(text)
if showsent:
mean_sent_length = mean_sent_len(sents)
mean_sent_print = '%13.2f' % mean_sent_length
else:
mean_sent_print
if showword:
mean_word_length = mean_word_len(words)
mean_word_print = '%13.2f' % mean_word_length
else:
mean_word_print= ''
if showari:
ari = '%13.2f' % calc_ari(mean_sent_length, mean_word_length)
else:
ari=''
if showadj:
adjs = verb_adjectives(words)
mean_adjs = '%13.3f' % (len(adjs) / len(sents))
else:
mean_adjs=''
return '%s %s %s %s' % (mean_word_print,
mean_sent_print, ari, mean_adjs)
if showheader:
headers=['filename']
if showword:
headers.append('mean_word_len')
if showsent:
headers.append('mean_sent_len')
if showari:
headers.append('ari')
if showadj:
headers.append('adjectiv_verbs')
format_string = '%-17s ' + '%13s ' * (len(headers)-1)
print format_string % tuple(headers)
if len(args) > 0:
for file in args:
filename = os.path.basename(file).rstrip('.txt')
f = open(file)
raw = f.read()
print calc_text_stats(raw, showsent, showword, showari, showadj)
else:
raw = sys.stdin.read()
print calc_text_stats(raw, showsent, showword, showari, showadj)
Now print out the mean word length, mean sentence length, ari, and the mean number of present participles used as adjectives per sentence for huckFinn, tomSawyeer, Candide, and devilsDictionary. Pipe the output to sort, and sort by ari. Your output should be similar to homework 7. Show the BASH command you used. (11 points)

#!/bin/bash
students/robfelty/hmwk8.py --noheader\
resources/texts/{huckFinn,tomSawyer,Candide,devilsDictionary}.txt | sort -nk 4

Written by Robert Felty

Linguistics 5200 Fall 2009

Introduction to computational corpus linguistics

Homework 8 solution

Leave a Reply

Archives

Categories