November 12, 2009 · News · 2 comments

Sam found a nice program to automatically identify the language of a text using trigrams. You might find it of interest.

November 12, 2009 · notes · (No comments)

Here are today’s notes covering details of part of speech tagging

pdf iconling5200-nltk-5-1-notes.pdf

November 12, 2009 · homework · (No comments)

Several people have asked some questions about homework 10 which I would like to address

On named parameters to mean_sent_len and mean_word_len functions. We had previously defined these functions to ignore stop words. That is when computing the mean number of words per sentence, throw out stop words before calculating the mean. We might not want to this all the time though. So now we make this an option to the function. Like all other named arguments to functions, they have a default value. In this case, we want the default to be true.

For question 3, remember that when using the timeit module, you have to import all necessary modules in your setup statement. If you like, this can be a multiline string (it’s easier to read that way). Also note that question 3 has nothing to do with question 4.

Note that for question 4, I am asking you to add a global option, i.e. one that you could specify when calling your script from the command line. This has nothing to do with question 3 at all.

Note that my sample output had an error. I accidentally output the percentage of non-stopwords, as opposed to the percentage of stopwords. Sorry about that, and thanks to Steve for pointing it out.

Finally, as to the strange naming of include-stopwords, consider trying it the other way around, using ignore_stopwords. If this is true by default, (which is what we want), then how do you make it false from the command line? You could make it have an argument, so you would say

./hmwk10.py --ignore_stopwords=false 

but I don’t like that. I would rather specify

./hmwk10.py --include_stopwords

and have the default for –include_stopwords be false.

November 10, 2009 · notes · (No comments)

Here are today’s notes on part of speech tagging

pdf iconling5200-nltk-5-notes.pdf

November 5, 2009 · homework · 2 comments

In this homework you will apply some of the advanced function programming we have discussed, including using named arguments and default values. It covers material up to November 5th, and is due November 13th.

  1. Use svn to copy my solution to homework 8 from resources/py into your personal directory as hmwk10.py (5 points)
  2. Modify the mean_word_len and mean_sent_len functions to accept two optional
    arguments, ignore_stop and use_set. The default for each of
    these should be True. If use_set is True, you should convert the
    stopword corpus to a set. If ignore_stop is True, you should ignore stopwords from the calculation (which is what the code in hmwk8.py does). (15 points)
  3. Now create a new file called means_timing.py. In this file, import your hmwk10.py module, and use the timeit module to test how long it takes to calculate the mean sentence length 100 times, trying all 4 combinations of the parameters of use_set and ignore_stop. (10 points)
  4. Now add another global option called include-stop (i for short) to hmwk10.py specifying whether or not to ignore stopwords when calculating mean word length and sentence length. The default should be False. (10 points)
  5. Modify the calc_text_stats function so that it also computes the percentage of words that are stop words. 10 points
  6. Now create a bash script which prints out the mean word and sentence length for Huck Finn, Tom Sawyer, Candide, and the Devil’s dictionary. Pipe the output to sort to sort by mean sentence length. Try it both including and ignoring stop words. Your output (when ignoring stop words), should look like the that below.(10 points)
    filename          mean_word_len mean_sent_len  per_stop_words
    tomSawyer                  5.51          7.46            42.2
    Candide                    6.07          9.04            43.5
    huckFinn                   4.93          9.32            45.0
    devilsDictionary           6.30         10.08            40.2
    
November 5, 2009 · notes · (No comments)

Here are today’s notes covering a sample of python modules

pdf iconling5200-nltk-4-3-notes.pdf

November 3, 2009 · homework · (No comments)

There is no regular homework assignment this week. Instead, I would like you to send me a 1-2 page description of your plans for the final project. It doesn’t have to be anything too formal, but I want to make sure that you have selected a project of reasonable scope. You can e-mail them to me, or add them to the repository. Please give me either plain text or pdf. Please no word documents.

November 3, 2009 · notes · (No comments)

Here are today’s notes covering python modules and algorithm design, including recursion.

pdf iconling5200-nltk-4-2-notes.pdf

November 2, 2009 · homework · (No comments)

Most students did very well on this assignment. Please take a detailed look at my solution in resources/hmwk/hmwk9.py

Class statistics for Homework 9
mean 54.17
standard deviation 6.97
  1. Use the get_wiki function defined below to download the wikipedia page about Ben Franklin. (4 points)

    def get_wiki(url):
        'Download text from a wikipedia page and return raw text'
        from urllib2 import urlopen, Request
        headers = {'User-Agent': '''Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;
                    rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'''}
        req=Request(url=url, headers=headers)
        f = urlopen(req)
        raw = unicode(f.read(),encoding='utf8')
        return(raw)
    
    raw = get_wiki('http://en.wikipedia.org/wiki/Ben_franklin')
  2. Wikipedia pages generally have a bunch of references and external links. These almost always occur at the section beginning “See also”. Strip off all text after “See also”. Hint: “See also” actually occurs twice in the document – once in the table of contents, and once as a section heading. You only want to ignore stuff after the section heading. (6 points)

    see_index = raw.rfind('See also')
    text = raw[:see_index]
  3. Next, define a function called unknown, which removes any items from
    this set that occur in the Words Corpus (nltk.corpus.words). The function should take a list of tokenized words as input, and return a list of novel words. Hint 1: your code
    will be much faster if you convert the list of words in the Words Corpus into a
    set, and then check whether or not a word is in that set. Hint 2: ignore case and punctuation when checking against the words corpus (but preserve case in your output). Hint 3: Sometimes the nltk word tokenizer does not strip off periods ending sentences. Make sure that none of the words end in a period. (15 points)

    def unknown(tokens):
        from string import punctuation
        words =  nltk.corpus.words.words()
        # convert to lower case and make into a set
        words = set([w.lower() for w in words])
        'returns novel tokens (those not in the words corpus)'
        nopunc = [w.rstrip('.') for w in tokens if w not in punctuation]
        nopunc.append('foo')
        novel = [w for w in nopunc if w.lower() not in words]
        return(novel)
  4. Use your unknown function to find novel words in the wikipedia page on Ben Franklin (5 points)
    cleaned = nltk.clean_html(text)
    tokens = nltk.word_tokenize(cleaned)
    novel = unknown(tokens)
  5. As with most computational linguistics processes, it is nearly impossible to achieve perfect results. It is clear from browsing through the results that there are a number of “novel” words, which in fact are not novel. Let’s further refine our process. Some of the “novel” words we have found may be numbers, proper names (named entities), acronyms, or words with affixes (including both inflectional and derivational affixes). Let’s try to divide up our novel words into these categories.
    1. Use regular expressions to remove any numbers from the novel words.
      Remember that a number may have commas or a decimal point in it, and may begin
      with a dollar sign or end with a percent sign. Save the result as
      novel_nonum. Hint: when testing your regular expression, it is probaby
      easier to check the result of finding items which are numbers, as opposed to
      checking the result of finding items which are not numbers. (8 points)

      import re
      number_re = r'\$?[0-9,.]+%?$'
      number_match = re.compile(number_re)
      novel_nonum = [w for w in novel if not number_match.match(w)]
    2. Now Use the porter stemmer to stem all the items in novel_nonum, then re-run them through the unknown function, saving the result as as novel_stems (7 points)
      porter = nltk.PorterStemmer()
      stemmed = [porter.stem(w) for w in novel_nonum]
      novel_stems = unknown(stemmed)
    3. Next, find as many proper names from novel_stems as possible, saving the result as proper_names. Note that finding named entities is actually a very difficult problem, and usually involves syntax and semantics. For our purposes however, let’s just use the fact that proper names in English start with a capital letter. Also create a new variable novel_no_proper, which has the proper names removed. (5 points)
      proper_names = [w for w in novel_stems if w[0].isupper()]
      novel_no_proper = [w for w in novel_stems if not w[0].isupper()]
    4. Calculate the percentage of novel tokens in the Ben Franklin wikipedia page, after having excluded number, affixed words, and proper names. (4 points)
      novel_token = len(novel_no_proper) / len(tokens)
    5. Calculate the percentage of novel types in the Ben Franklin wikipedia page, after having excluded number, affixed words, and proper names. (6 points)
      novel_type = len(set(novel_no_proper)) / len(set(tokens))
  6. Extra Credit: Find additional ways to remove false positives in our “novel” word list. (3 extra points for each additional way, up to 12 extra points)
    # remove smart quotes, dashes  and other such characters
    novel_no_quotes =  [w for w in novel_no_proper if w.isalpha()]
    # Try to repair stemming process
    # Sometimes the stemmer removes final e's which should be there
    novel_fixed_e = [w for w in novel_no_quotes if w+'e'.lower() not in words]
    # Likewise with 'ate'
    novel_fixed_ate = [w for w in novel_fixed_e if w+'ate'.lower() not in words]
    # Sometimes the stemmer converts y to i when it shouldn't
    novel_fixed_i = [w for w in novel_fixed_ate if re.sub('i$', 'y', w).lower() not in words]
October 29, 2009 · News · (No comments)

I added some more tips to homework 9 in the comments section. I also corrected the get_wiki() function. Sorry about the inconvience there.