October 23, 2009 · homework

In this homework, you will attempt to find novel words in webpages. Make
sure to read all questions before starting the assignment. It is due Oct. 30th
and covers material up to Oct. 22nd

  1. Use the get_wiki function defined below to download the wikipedia page about Ben Franklin. (4 points)

    def get_wiki(url):
        'Download text from a wikipedia page and return raw text'
        from urllib2 import urlopen, Request
        headers = {'User-Agent':
                  '''Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;
                   rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'''}
        req=Request(url=url, headers=headers)
        f = urlopen(req)
        raw = unicode(f.read(),encoding='utf8')
        return(raw)
    
  2. Wikipedia pages generally have a bunch of references and external links. These almost always occur at the section beginning “See also”. Strip off all text after “See also”. Hint: “See also” actually occurs twice in the document – once in the table of contents, and once as a section heading. You only want to ignore stuff after the section heading. (6 points)
  3. Next, define a function called unknown, which removes any items from
    this set that occur in the Words Corpus (nltk.corpus.words). The function should take a list of tokenized words as input, and return a list of novel words. Hint 1: your code
    will be much faster if you convert the list of words in the Words Corpus into a
    set, and then check whether or not a word is in that set. Hint 2: ignore case and punctuation when checking against the words corpus (but preserve case in your output). Hint 3: Sometimes the nltk word tokenizer does not strip off periods ending sentences. Make sure that none of the words end in a period. Hint 4: Make sure to strip out all html tags before tokenizing (see chapter 2 of the NLTK book for an example). (15 points)
  4. Use your unknown function to find novel words in the wikipedia page on Ben Franklin (5 points)
  5. As with most computational linguistics processes, it is nearly impossible to achieve perfect results. It is clear from browsing through the results that there are a number of “novel” words, which in fact are not novel. Let’s further refine our process. Some of the “novel” words we have found may be numbers, proper names (named entities), acronyms, or words with affixes (including both inflectional and derivational affixes). Let’s try to divide up our novel words into these categories.
    1. Use regular expressions to remove any numbers from the novel words.
      Remember that a number may have commas or a decimal point in it, and may begin
      with a dollar sign or end with a percent sign. Save the result as
      novel_nonum. Hint: when testing your regular expression, it is probaby
      easier to check the result of finding items which are numbers, as opposed to
      checking the result of finding items which are not numbers. (8 points)

    2. Now Use the porter stemmer to stem all the items in novel_nonum, then re-run them through the unknown function, saving the result as as novel_stems (7 points)
    3. Next, find as many proper names from novel_stems as possible, saving the result as proper_names. Note that finding named entities is actually a very difficult problem, and usually involves syntax and semantics. For our purposes however, let’s just use the fact that proper names in English start with a capital letter. Also create a new variable novel_no_proper, which has the proper names removed. (5 points)
    4. Calculate the percentage of novel tokens in the Ben Franklin wikipedia page, after having excluded number, affixed words, and proper names. (4 points)
    5. Calculate the percentage of novel types in the Ben Franklin wikipedia page, after having excluded number, affixed words, and proper names. (6 points)
  6. Extra Credit: Find additional ways to remove false positives in our “novel” word list. (3 extra points for each additional way, up to 12 extra points)
Written by Robert Felty


3 Comments to “Homework 9 – Finding novel words”

  1. ash_v says:

    In Q2, what is meant by remove all text after “see also”? Does it mean just text or text+html tags?

  2. Robert Felty says:

    Some people had a problem with the get_wiki function. I have updated the function above so that it should work for everyone now. Sorry about that.

    A couple other questions people had:

    What do I do with all these html tags?
    Strip them out. See chapter 2 of the NLTK book for an example (or look in the notes)
    In question 5, everything is of unicode type. What do you want here?
    Type here means lemma, as in type-token ratio.
    Won’t removing proper names based on capitalization also remove the first of word of sentences?
    True, but we are trying to find novel words. Any words that are in our dictionary that are the first word of a sentence will have already been taken out. If there are novel words which happen to start a sentence, they will be grouped in with the proper names. This is another example of how we have to use a combination of automated techniques and manual checking

Leave a Reply

You must be logged in to post a comment.

Subscribe without commenting