Linguistics 5200 Fall 2009

Introduction to computational corpus linguistics

Homework 8 – More with python scripts, regular expressions, and tokenization

October 18, 2009 · homework

In this homework you will expand upon some of the code you wrote in homework 7, using the functions you wrote to calculate mean word and sentence length. However, you will now add the ability to read from stdin. Note that for this assignment, you should not print out any information about the questions. You should only print out information as requested in the last question. Make sure to read all questions before starting the assignment. It is due Oct. 23rd and covers material up to Oct. 15th.

Readability measures are used to score the reading difficulty of a text, for the purposes of selecting texts of appropriate difficulty for language learners. Let us define μw to be the average number of letters per word, and μs to be the average number of words per sentence, in a given text. The Automated Readability Index (ARI) of the text is defined to be: 4.71 μw + 0.5 μs – 21.43. Define a function which computes the ARI score. It should accept two arguments – the mean word length, and the mean sentence length. (5 points)
One feature of English is that it is easy to turn verbs into nouns and adjectives, by using participles. For example, in the phrase the burning bush, the verb burn is used as an adjective, by using the present participle form. Create a function called verb_adjectives which uses the findall method from the NLTK to find present participles used as adjectives. For simplicity, find only adjectives that are preceded by an article (a, an, the). Make sure that they have a word following them (not punctuation). The function should accept a list of tokens, as returned by the words() function in the NLTK. Note that all present participles in English end in ing. Unfortunately, the nltk findall function which we used in class prints out the words, instead of just returning them. This means that we cannot use it in a function. (Go ahead and try to use it, and convince yourself why it is generally bad to print stuff from functions, instead of just returning results (unless the function’s only purpose is to print something out, e.g. pprint)). So, I will get you started on the functions you need to use:
```
regexp = r'<a><.*><man>'
moby = nltk.Text( gutenberg .words('melville - moby_dick .txt '))
bracketed = nltk.text.TokenSearcher(moby)
hits = bracketed.findall(regexp)
```
This returns a list of lists, where each list is composed of the 3 word phrase which matches. So your main task is to come up with the correct regular expression. (7 points)
As we have seen in class, most computational linguistics involves a combination of automation and hand-checking. Let’s refine our verb_adjectives function by ensuring that none of the words following the adjective are in the stopwords corpus. Without doing this, we get results like['an', 'understanding', 'of'], where understanding is being used as a noun, not an adjective. Use a list expression to remove all hits where the third word in the list is a stopword. (7 points)
Add three more options to your script, -j (–adj), -a (–ari), and -n (–noheader). Note that if the –ari option is specified, then you should also print out the mean word length and mean sentence length. If no options are given, it should function the same as if the options -wsaj were given. Your options should now look like:
```
-w --word print only mean word length
-s --sent print only mean sentence length
-h --help print this help information and exit
-a --ari  print ari statistics
-j --adj  print mean number of adjectival verbs per sentence
-n --noheader do not print a header line
```
(10 points)
Now modify your script so that it can accept either stdin or one or more files as input. Use the stdin_or_file() function in args.py as an example. Your script will no longer print out usage information when no arguments are given, as was the case for homework 7. Create a function called calc_text_stats to handle all the calculations. That way you can call this function either multiple times (once per file, if files are specified), or just once, if reading from stdin. This will make your code more readable. You should also make sure to handle the two new options, for ari and adj. (20 points)
Now print out the mean word length, mean sentence length, ari, and the mean number of present participles used as adjectives per sentence for huckFinn, tomSawyeer, Candide, and devilsDictionary. Pipe the output to sort, and sort by ari. Your output should be similar to homework 7. Save the bash command you used in a script called ari. Make sure that it is executable. (11 points)

Written by Robert Felty

Linguistics 5200 Fall 2009

Introduction to computational corpus linguistics

Homework 8 – More with python scripts, regular expressions, and tokenization

Leave a Reply

Archives

Categories