It seems that several people are still a little bit confused about what I would like you to do for homework 7. Your program should function much like any other UNIX program. For example, consider the wc program. I can use wc to count the words from several files. (From the resources/py directory)
wc args.py opts.py auto_histling.py
I specified three files on the command line, separated by spaces.
The output should be something like:
37 116 895 args.py
34 105 847 opts.py
77 335 3167 auto_histling.py
148 556 4909 total
Note that the columns are nicely aligned. Your program should work in a similar way, except that it will be printing out mean word and sentence length.
The getopt method will return a list of options and arguments. The arguments should be the filenames you specified on the command line. You will want to loop over the arguments and process each file one at a time.
Rob
Here are the slides from today’s class covering normalization and tokenization using regular expressions and the nltk. We did not get to the last section on tokenization, so we will postpone that until Tuesday. Have a nice weekend.
ling5200-nltk3-3-slides.pdf
Here are the notes for today covering the use of regular expressions for text normalization and tokenization.
ling5200-nltk3-3-notes.pdf
Here are the slides from today’s class covering unicode and regular expressions in python. I corrected the problem with the codecs.open example.
ling5200-nltk3-2-slides.pdf
Here are the notes for today’s class covering unicode and regular expressions in python
ling5200-nltk3-2-notes.pdf
Most students did well on this homework.
I made an error in the description for question 3. I had the wrong values for mean word and sentence length for Moby Dick, as I had not converted words to lowercase before comparing with the stopword corpus. I have corrected that in the solution here, and I did not take off any points if you did not convert to lower case. My solution python file is in the repository under resources/homework
I would also like to remind you of a couple things:
- make sure your python files are executable
- have a proper shbang line as the very first line of your file.
- use 4 spaces for indenting, not tabs, and please do not mix tabs and spaces
Starting with homework 7, I will begin taking off 5 points each for any of the above mistakes.
Class statistics for Homework 5
mean |
49.71 |
standard deviation |
14.14 |
-
Create a function called mean_word_len, which accepts a list of words (e.g. text1 — Moby Dick), and returns the mean characters per word. You should remove punctuation and stopwords from the calculation. (10 points)
from pprint import pprint
import nltk
from nltk.corpus import stopwords
import string
def mean_word_len(words):
eng_stopwords = stopwords.words('english')
words_no_punc = [w for w in words
if w not in string.punctuation and w.lower() not in eng_stopwords]
num_words = len(words_no_punc)
num_chars = sum([len(w) for w in words_no_punc])
return (num_chars / num_words)
-
Create a function called mean_sent_len, which accepts a list of sentences, and returns the mean words per sentence. You should remove punctuation and stopwords from the calculation. Note that the NLTK .sents() method returns a list of lists. That is, each item in the list represents a sentence, which itself is composed of a list of words. (15 points)
import string
def mean_sent_len(sents):
eng_stopwords = stopwords.words('english')
words_no_punc = [w for s in sents for w in s
if w not in string.punctuation and w.lower() not in eng_stopwords]
num_words = len(words_no_punc)
num_sents = len(sents)
return (num_words / num_sents)
- Now use your two new functions to print out the mean sentence length and the mean word length for all of the texts from the gutenberg project included in the NLTK. You should print out these statistics with one file per line, with the fileid first, and then the mean word length and sentence length. One example would be:
melville-moby_dick.txt 5.94330208809 8.86877613075
(10 points)
from nltk.corpus import gutenberg
for fileid in gutenberg.fileids():
words = gutenberg.words(fileid)
sents = gutenberg.sents(fileid)
print fileid, mean_word_len(words), mean_sent_len(sents)
- Using the CMU pronouncing dictionary, create a list of all words which have 3 letters, and 2 syllables. Your final list should include just the spelling of the words. To calculate the number of syllables, use the number of vowels in the word (every vowel includes the digit 1, 2, or 0, marking primary, secondary, or no stress). (15 points)
entries = nltk.corpus.cmudict.entries()
stress_markers = ['0','1','2']
three_letter_two_syl_words = []
for word , pron in entries :
if len(word) == 3:
syllables = 0
for phoneme in pron:
for marker in stress_markers:
if marker in phoneme:
syllables += 1
if syllables == 2:
three_letter_two_syl_words.append((word,pron))
pprint(three_letter_two_syl_words)
- Imagine you are writing a play, and you are you thinking of interesting places to stage a scene. You would like it be somewhere like a house, but not exactly. Use the wordnet corpus to help you brainstorm for possible locations. First, find the hypernyms of the first definition of the word house. Then find all the hyponyms of those hypernyms, and print out the names of the words. Your output should contain one synset per line, with first the synset name, and then all of the lemma_names for that synset, e.g.:
lodge.n.05 - lodge, indian_lodge
(10 points)
from nltk.corpus import wordnet
house = wordnet.synsets('house')[0]
house_hypernyms = wordnet.synset(house.name).hypernyms()
for hypernym in house_hypernyms:
print "-------", hypernym.name, "---------"
for hyponym in hypernym.hyponyms():
print hyponym.name, " - ", ", ".join(hyponym.lemma_names)
In this homework you will expand upon some of the code you wrote homework 6, using the functions you wrote to calculate mean word and sentence length. However, now you will accept command line arguments and options to use these functions, and print the output in a nice-looking format. Make sure to read all questions before starting the assignment. It is due Oct. 16th and covers material up to Oct. 8th.
-
From BASH, use svn to copy your hmwk6.py file to hmwk7.py. This will preserve all of the history from hmwk6, so you can see how you have improved your code from homework 6 to homework 7. (3 points)
- Create a function called usage, which prints out information about how the script should be used, including what arguments should be specified, and what options are possible. It should take one argument, which is the name of the script file. (7 points)
- Write your script to process the following options. Look at opts.py under resources/py for an example. If both -s and -w are specified, it should print out both options. (14 points)
-w --word print only mean word length
-s --sent print only mean sentence length
-h --help print this help information and exit
-
Instead of specifying which texts to process in your code, change your code so
that it accepts filenames from the command line. Look at the args.py file
under resources/py for an example of how to do this. Your code should print out
the name of each file (you can use the os.path.basename function to print out only the name of the file) specified on the command line, and the mean word length
and sentence length, with a width of 13 and a precision of 2. Note that it
should only print word length or sentence length if that option has been
specified. If no files are specified, it should print the usage information
and exit. Also note that after reading in a text you will have to first convert
it to a list of words or sentences using the tokenize functions in the nltk,
before calculating the mean word length and sentence length with the functions
you defined in homework 6. See chapter 13 in the notes for examples on how to
tokenize text to homework 5 for how to do this. The first line of output should
be a line of headers describing the columns (28 points) Here is some example
output:
filename mean_word_len mean_sent_len
fooey 3.45 13.47
bar 3.15 9.29
-
Use your script to print out mean word length and sentence length for huckFinn, tomSawyeer, Candide, and devilsDictionary (in resources/texts). Save the output to a file called hmwk7_stats.txt in your personal directory, and commit it to the svn repository. Show the command you use in BASH. Make your paths relative to the root of your working copy of the repository. Do the same command, but also try the -s and -w option, and print to the screen. (8 points)
Here are the slides from today’s class covering string basics and methods in python.
ling5200-nltk-3-1-slides.pdf
Here are the notes for today’s class, including a lengthy discussion of string manipulation in python.
ling5200-nltk3-1-notes.pdf
Here are the slides from today’s class covering file input and output, reading from stdin, and command line arguments and options. Please also look at the args.py and opts.py files under resources/py, which have some examples. Two other things to note:
- The combined course notes file now has an appendix with solutions to practice problems
- I updated the celex.txt file in resources/texts. Please run svn update to get the latest version
ling5200-nltk3-slides.pdf