October 12, 2009 · homework

Most students did well on this homework.
I made an error in the description for question 3. I had the wrong values for mean word and sentence length for Moby Dick, as I had not converted words to lowercase before comparing with the stopword corpus. I have corrected that in the solution here, and I did not take off any points if you did not convert to lower case. My solution python file is in the repository under resources/homework

I would also like to remind you of a couple things:

  1. make sure your python files are executable
  2. have a proper shbang line as the very first line of your file.
  3. use 4 spaces for indenting, not tabs, and please do not mix tabs and spaces

Starting with homework 7, I will begin taking off 5 points each for any of the above mistakes.

Class statistics for Homework 5
mean 49.71
standard deviation 14.14
  1. Create a function called mean_word_len, which accepts a list of words (e.g. text1 — Moby Dick), and returns the mean characters per word. You should remove punctuation and stopwords from the calculation. (10 points)

    from pprint import pprint
    import nltk
    from nltk.corpus import stopwords
    import string
    def mean_word_len(words):
        eng_stopwords = stopwords.words('english')
        words_no_punc = [w for w in words
                  if w not in string.punctuation and w.lower() not in eng_stopwords]
        num_words = len(words_no_punc)
        num_chars = sum([len(w) for w in words_no_punc])
        return (num_chars / num_words)
  2. Create a function called mean_sent_len, which accepts a list of sentences, and returns the mean words per sentence. You should remove punctuation and stopwords from the calculation. Note that the NLTK .sents() method returns a list of lists. That is, each item in the list represents a sentence, which itself is composed of a list of words. (15 points)

    import string
    def mean_sent_len(sents):
        eng_stopwords = stopwords.words('english')
        words_no_punc = [w for s in sents for w in s
                    if w not in string.punctuation and w.lower() not in eng_stopwords]
        num_words = len(words_no_punc)
        num_sents = len(sents)
        return (num_words / num_sents)
  3. Now use your two new functions to print out the mean sentence length and the mean word length for all of the texts from the gutenberg project included in the NLTK. You should print out these statistics with one file per line, with the fileid first, and then the mean word length and sentence length. One example would be:
    melville-moby_dick.txt 5.94330208809 8.86877613075
    (10 points)

    from nltk.corpus import gutenberg
    for fileid in gutenberg.fileids():
        words = gutenberg.words(fileid)
        sents = gutenberg.sents(fileid)
        print fileid, mean_word_len(words), mean_sent_len(sents)
  4. Using the CMU pronouncing dictionary, create a list of all words which have 3 letters, and 2 syllables. Your final list should include just the spelling of the words. To calculate the number of syllables, use the number of vowels in the word (every vowel includes the digit 1, 2, or 0, marking primary, secondary, or no stress). (15 points)
    entries = nltk.corpus.cmudict.entries()
    stress_markers = ['0','1','2']
    three_letter_two_syl_words = []
    for word , pron in entries :
        if len(word) == 3:
            syllables = 0
            for phoneme in pron:
                for marker in stress_markers:
                  if marker in phoneme:
                      syllables += 1
            if syllables == 2:
                three_letter_two_syl_words.append((word,pron))
    pprint(three_letter_two_syl_words)
  5. Imagine you are writing a play, and you are you thinking of interesting places to stage a scene. You would like it be somewhere like a house, but not exactly. Use the wordnet corpus to help you brainstorm for possible locations. First, find the hypernyms of the first definition of the word house. Then find all the hyponyms of those hypernyms, and print out the names of the words. Your output should contain one synset per line, with first the synset name, and then all of the lemma_names for that synset, e.g.:
    lodge.n.05 - lodge, indian_lodge
    (10 points)

    from nltk.corpus import wordnet
    house = wordnet.synsets('house')[0]
    house_hypernyms = wordnet.synset(house.name).hypernyms()
    for hypernym in house_hypernyms:
        print "-------", hypernym.name, "---------"
        for hyponym in hypernym.hyponyms():
            print hyponym.name, " - ", ", ".join(hyponym.lemma_names)
Written by Robert Felty


Leave a Reply

You must be logged in to post a comment.

Subscribe without commenting