Linguistics 5200 Fall 2009 » homework http://robfelty.com/teaching/ling5200Fall2009 Introduction to computational corpus linguistics Fri, 18 Dec 2009 00:07:43 +0000 http://wordpress.org/?v=2.9-rare en hourly 1 Homework 11 solution http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-11-solution/ http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-11-solution/#comments Tue, 24 Nov 2009 03:55:31 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-11-solution/ Most students did very well on this assignment. The only consistent shortcoming was having unnecessary loops in the tag_errors function. These unnecessary loops lead to an increase in execution time of about 10%.

Class statistics for Homework 11
mean 56.71
standard deviation 8.58

    In this homework you will practice part of speech tagging, and evaluating part of speech taggers. The homework covers material up to Nov. 12, and is due Nov. 19th.

    1. Use the unigram tagger to evaluate the accuracy of tagging of the romance and the adventure genres of the Brown corpus. Use a default tagger of NN as a backoff tagger. You should train the tagger on the first 90% of each genre, and test on the remaining 10%. (10 points)

      t0 = nltk.DefaultTagger('NN')

      adv_tagged_sents = brown.tagged_sents(categories='adventure')
      adv_size = int(len(adv_tagged_sents) * 0.9)
      adv_train_sents = adv_tagged_sents[:size]
      adv_test_sents = adv_tagged_sents[size:]
      adv_tag = nltk.UnigramTagger(adv_train_sents, backoff=t0)
      adv_tag.evaluate(adv_test_sents)

      rom_tagged_sents = brown.tagged_sents(categories='romance')
      rom_size = int(len(rom_tagged_sents) * 0.9)
      rom_train_sents = rom_tagged_sents[:size]
      rom_test_sents = rom_tagged_sents[size:]
      rom_tag = nltk.UnigramTagger(rom_train_sents, backoff=t0)
      rom_tag.evaluate(rom_test_sents)
    2. Now let’s investigate the most common types of errors that our tagger makes. Write a function called tag_errors which will return all errors that our tagger made. It should accept two arguments, test, and gold, which should be lists of tagged sentences. The test sentences should be ones that have been automatically tagged, and the gold should be ones that have been manually corrected. The function should output a list of incorrect, correct tuples, e.g. [('VB', 'NN'), ('VBN', 'VBD'), ('NN', 'VB'), ('NN', 'VBD'), ('TO', 'IN')]. (15 points)

      def tag_errors(test,gold):
          '''returns list of tuples of (wrong,correct) given automatically tagged
          data and the gold standard for that data'
      ''
          errors=[]
          for testsent, goldsent in zip(test,gold):
              for testpair, goldpair in zip(testsent,goldsent):
                  if testpair[1]!=goldpair[1]:
                      errors.append((testpair[1],goldpair[1]))
          return errors
    3. Use the Unigram taggers you trained to tag the test data from the adventure and romance genres of the Brown corpus. HINT: Look at the batch_tag method of the UnigramTagger. (10 points)

      adv_sents = brown.sents(categories='adventure')
      adv_unknown = adv_sents[adv_size:]
      adv_test = adv_tagger.batch_tag(adv_unknown)

      rom_sents = brown.sents(categories='romance')
      rom_unknown = rom_sents[rom_size:]
      rom_test = rom_tagger.batch_tag(rom_unknown)
    4. Use your tag_errors function to find all the tagging errors for the romance and adventure genres of the Brown corpus. (10 points)

      adv_errors = tag_errors(adv_test, adv_test_sents)
      rom_errors = tag_errors(rom_test, rom_test_sents)
    5. Now create frequency distributions of the tagging errors for the romance and adventure genres. (5 points)
      adv_error_fd = nltk.FreqDist(adv_errors)
      rom_error_fd = nltk.FreqDist(rom_errors)
    6. What differences do you notice between the frequency distributions of the two genres? (No code required for this question) (5 points)
       
    7. How might we improve our tagging performance? (No code required for this question) (5 points)
    8.  
    ]]> http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-11-solution/feed/ 0 Homework 10 Solution http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-10-solution/ http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-10-solution/#comments Tue, 17 Nov 2009 04:18:36 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-10-solution/ Most students did well on this assignment. Please take a detailed look at my solution in resources/hmwk/hmwk10.py

    Class statistics for Homework 10
    mean 51.67
    standard deviation 7.28
    1. Use svn to copy my solution to homework 8 from resources/py into your personal directory as hmwk10.py (5 points)

      svn cp resources/py/hmwk8.py students/robfelty/hmwk10.py
    2. Modify the mean_word_len and mean_sent_len functions to accept two optional
      arguments, ignore_stop and use_set. The default for each of
      these should be True. If use_set is True, you should convert the
      stopword corpus to a set. If ignore_stop is True, you should ignore stopwords from the calculation (which is what the code in hmwk8.py does). (15 points)

      def mean_sent_len(sents, ignore_stop=True, use_set=True):
          ''' returns the average number of words per sentence

          Input should be a list of lists, with each item in the list being a
          sentence, composed of a list of words. We ignore any punctuation and
          stopwords
          '
      ''
          if use_set:
              eng_stopwords = set(stopwords.words('english'))
          else:
              eng_stopwords = stopwords.words('english')
          if ignore_stop:
              words_no_punc = [w for s in sents for w in s
                              if w not in string.punctuation
                              and w.lower() not in eng_stopwords]
          else:
              words_no_punc = [w for s in sents for w in s
                              if w not in string.punctuation ]
          num_words = len(words_no_punc)
          num_sents = len(sents)
          return (num_words / num_sents)

      def mean_word_len(words, ignore_stop=True, use_set=True):
          ''' returns the average number of letters per words

          Input should be a list of words.
          We ignore any punctuation and stopwords
          '
      ''
          if use_set:
              eng_stopwords = set(stopwords.words('english'))
          else:
              eng_stopwords = stopwords.words('english')
          if ignore_stop:
              words_no_punc = [w for w in words
                    if w not in string.punctuation and w.lower() not in eng_stopwords]
          else:
              words_no_punc = [w for w in words
                    if w not in string.punctuation]
          num_words = len(words_no_punc)
          num_chars = sum([len(w) for w in words_no_punc])
          return (num_chars / num_words)
    3. Now create a new file called means_timing.py. In this file, import your hmwk10.py module, and use the timeit module to test how long it takes to calculate the mean sentence length 100 times, trying all 4 combinations of the parameters of use_set and ignore_stop. (10 points)

      import nltk
      import hmwk10
      setup = '''import nltk
      import text_means
      f = open('
      ../texts/Candide.txt')
      raw = f.read()
      sents = text_means.sent_tokenize(raw)
      words = nltk.word_tokenize(raw)
      '
      ''
      test1 = 'text_means.mean_word_len(words)'
      print Timer(test1, setup).timeit(100)

      test2 = 'text_means.mean_word_len(words, use_set=False)'
      print Timer(test2, setup).timeit(100)

      test3 = 'text_means.mean_word_len(words, use_set=False, check_stop=False)'
      print Timer(test3, setup).timeit(100)

      test4 = 'text_means.mean_word_len(words, use_set=True, check_stop=False)'
      print Timer(test4, setup).timeit(100)
    4. Now add another global option called include-stop (i for short) to hmwk10.py specifying whether or not to ignore stopwords when calculating mean word length and sentence length. The default should be False. (10 points)
      opts, args = getopt.gnu_getopt(sys.argv[1:], "hwsajni",
                   ["help", "word", "sent", 'ari', 'adj', 'noheader', 'include-stop'])
      include_stop = False
      for o, a in opts:
          if o in ("-i", "--include-stop"):
              include_stop = True

      # in calc_text_stats
      mean_sent_length = mean_sent_len(sents,include_stop=include_stop)
    5. Modify the calc_text_stats function so that it also computes the percentage of words that are stop words. 10 points
    6. Now create a bash script which prints out the mean word and sentence length for Huck Finn, Tom Sawyer, Candide, and the Devil’s dictionary. Pipe the output to sort to sort by mean sentence length. Try it both including and ignoring stop words. Your output (when ignoring stop words), should look like the that below.(10 points)
      filename          mean_word_len mean_sent_len per_stop_words
      tomSawyer                  5.51          7.46            42.2
      Candide                    6.07          9.04            43.5
      huckFinn                   4.93          9.32            45.0
      devilsDictionary           6.30         10.08            40.2
      
       ./text_means.py -wsi ../texts/{tomSawyer,huckFinn,Candide,devilsDictionary}.txt |sort -nk 3
       ./text_means.py -ws ../texts/{tomSawyer,huckFinn,Candide,devilsDictionary}.txt |sort -nk 3
    ]]>
    http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-10-solution/feed/ 0
    Homework 11 – Part of speech tagging http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-11-part-of-speech-tagging/ http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-11-part-of-speech-tagging/#comments Sat, 14 Nov 2009 17:49:20 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-11-part-of-speech-tagging/ In this homework you will practice part of speech tagging, and evaluating part of speech taggers. The homework covers material up to Nov. 12, and is due Nov. 20th.

    1. Use the unigram tagger to evaluate the accuracy of tagging of the romance and the adventure genres of the Brown corpus. Use a default tagger of NN as a backoff tagger. You should train the tagger on the first 90% of each genre, and test on the remaining 10%. (10 points)
    2. Now let’s investigate the most common types of errors that our tagger makes. Write a function called tag_errors which will return all errors that our tagger made. It should accept two arguments, test, and gold, which should be lists of tagged sentences. The test sentences should be ones that have been automatically tagged, and the gold should be ones that have been manually corrected. The function should output a list of incorrect, correct tuples, e.g. [('VB', 'NN'), ('VBN', 'VBD'), ('NN', 'VB'), ('NN', 'VBD'), ('TO', 'IN')]. (15 points)
    3. Use the Unigram taggers you trained to tag the test data from the adventure and romance genres of the Brown corpus. HINT: Look at the batch_tag method of the UnigramTagger. (10 points)
    4. Use your tag_errors function to find all the tagging errors for the romance and adventure genres of the Brown corpus. (10 points)
    5. Now create frequency distributions of the tagging errors for the romance and adventure genres. (5 points)
    6. What differences do you notice between the frequency distributions of the two genres? (No code required for this question) (5 points)
    7. How might we improve our tagging performance? (No code required for this question) (5 points)
    ]]>
    http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-11-part-of-speech-tagging/feed/ 0
    Homework 10 clarifications http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-10-clarifications/ http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-10-clarifications/#comments Thu, 12 Nov 2009 06:34:34 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/?p=157 Several people have asked some questions about homework 10 which I would like to address

    On named parameters to mean_sent_len and mean_word_len functions. We had previously defined these functions to ignore stop words. That is when computing the mean number of words per sentence, throw out stop words before calculating the mean. We might not want to this all the time though. So now we make this an option to the function. Like all other named arguments to functions, they have a default value. In this case, we want the default to be true.

    For question 3, remember that when using the timeit module, you have to import all necessary modules in your setup statement. If you like, this can be a multiline string (it’s easier to read that way). Also note that question 3 has nothing to do with question 4.

    Note that for question 4, I am asking you to add a global option, i.e. one that you could specify when calling your script from the command line. This has nothing to do with question 3 at all.

    Note that my sample output had an error. I accidentally output the percentage of non-stopwords, as opposed to the percentage of stopwords. Sorry about that, and thanks to Steve for pointing it out.

    Finally, as to the strange naming of include-stopwords, consider trying it the other way around, using ignore_stopwords. If this is true by default, (which is what we want), then how do you make it false from the command line? You could make it have an argument, so you would say

    ./hmwk10.py --ignore_stopwords=false 

    but I don’t like that. I would rather specify

    ./hmwk10.py --include_stopwords

    and have the default for –include_stopwords be false.

    ]]>
    http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-10-clarifications/feed/ 0
    Homework 10 – Advanced function usage http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-10-advanced-function-usage/ http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-10-advanced-function-usage/#comments Fri, 06 Nov 2009 04:07:29 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-10-advanced-function-usage/ In this homework you will apply some of the advanced function programming we have discussed, including using named arguments and default values. It covers material up to November 5th, and is due November 13th.

    1. Use svn to copy my solution to homework 8 from resources/py into your personal directory as hmwk10.py (5 points)
    2. Modify the mean_word_len and mean_sent_len functions to accept two optional
      arguments, ignore_stop and use_set. The default for each of
      these should be True. If use_set is True, you should convert the
      stopword corpus to a set. If ignore_stop is True, you should ignore stopwords from the calculation (which is what the code in hmwk8.py does). (15 points)
    3. Now create a new file called means_timing.py. In this file, import your hmwk10.py module, and use the timeit module to test how long it takes to calculate the mean sentence length 100 times, trying all 4 combinations of the parameters of use_set and ignore_stop. (10 points)
    4. Now add another global option called include-stop (i for short) to hmwk10.py specifying whether or not to ignore stopwords when calculating mean word length and sentence length. The default should be False. (10 points)
    5. Modify the calc_text_stats function so that it also computes the percentage of words that are stop words. 10 points
    6. Now create a bash script which prints out the mean word and sentence length for Huck Finn, Tom Sawyer, Candide, and the Devil’s dictionary. Pipe the output to sort to sort by mean sentence length. Try it both including and ignoring stop words. Your output (when ignoring stop words), should look like the that below.(10 points)
      filename          mean_word_len mean_sent_len  per_stop_words
      tomSawyer                  5.51          7.46            42.2
      Candide                    6.07          9.04            43.5
      huckFinn                   4.93          9.32            45.0
      devilsDictionary           6.30         10.08            40.2
      
    ]]>
    http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-10-advanced-function-usage/feed/ 2
    Final project topics due this Friday http://robfelty.com/teaching/ling5200Fall2009/2009/11/final-project-topics-due-this-friday/ http://robfelty.com/teaching/ling5200Fall2009/2009/11/final-project-topics-due-this-friday/#comments Wed, 04 Nov 2009 03:40:12 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/?p=151 There is no regular homework assignment this week. Instead, I would like you to send me a 1-2 page description of your plans for the final project. It doesn’t have to be anything too formal, but I want to make sure that you have selected a project of reasonable scope. You can e-mail them to me, or add them to the repository. Please give me either plain text or pdf. Please no word documents.

    ]]>
    http://robfelty.com/teaching/ling5200Fall2009/2009/11/final-project-topics-due-this-friday/feed/ 0
    Homework 9 Solution http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-9-solution/ http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-9-solution/#comments Mon, 02 Nov 2009 15:49:02 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-9-solution/ Most students did very well on this assignment. Please take a detailed look at my solution in resources/hmwk/hmwk9.py

    Class statistics for Homework 9
    mean 54.17
    standard deviation 6.97
    1. Use the get_wiki function defined below to download the wikipedia page about Ben Franklin. (4 points)

      def get_wiki(url):
          'Download text from a wikipedia page and return raw text'
          from urllib2 import urlopen, Request
          headers = {'User-Agent': '''Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;
                      rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'''}
          req=Request(url=url, headers=headers)
          f = urlopen(req)
          raw = unicode(f.read(),encoding='utf8')
          return(raw)
      
      raw = get_wiki('http://en.wikipedia.org/wiki/Ben_franklin')
    2. Wikipedia pages generally have a bunch of references and external links. These almost always occur at the section beginning “See also”. Strip off all text after “See also”. Hint: “See also” actually occurs twice in the document – once in the table of contents, and once as a section heading. You only want to ignore stuff after the section heading. (6 points)

      see_index = raw.rfind('See also')
      text = raw[:see_index]
    3. Next, define a function called unknown, which removes any items from
      this set that occur in the Words Corpus (nltk.corpus.words). The function should take a list of tokenized words as input, and return a list of novel words. Hint 1: your code
      will be much faster if you convert the list of words in the Words Corpus into a
      set, and then check whether or not a word is in that set. Hint 2: ignore case and punctuation when checking against the words corpus (but preserve case in your output). Hint 3: Sometimes the nltk word tokenizer does not strip off periods ending sentences. Make sure that none of the words end in a period. (15 points)

      def unknown(tokens):
          from string import punctuation
          words =  nltk.corpus.words.words()
          # convert to lower case and make into a set
          words = set([w.lower() for w in words])
          'returns novel tokens (those not in the words corpus)'
          nopunc = [w.rstrip('.') for w in tokens if w not in punctuation]
          nopunc.append('foo')
          novel = [w for w in nopunc if w.lower() not in words]
          return(novel)
    4. Use your unknown function to find novel words in the wikipedia page on Ben Franklin (5 points)
      cleaned = nltk.clean_html(text)
      tokens = nltk.word_tokenize(cleaned)
      novel = unknown(tokens)
    5. As with most computational linguistics processes, it is nearly impossible to achieve perfect results. It is clear from browsing through the results that there are a number of “novel” words, which in fact are not novel. Let’s further refine our process. Some of the “novel” words we have found may be numbers, proper names (named entities), acronyms, or words with affixes (including both inflectional and derivational affixes). Let’s try to divide up our novel words into these categories.
      1. Use regular expressions to remove any numbers from the novel words.
        Remember that a number may have commas or a decimal point in it, and may begin
        with a dollar sign or end with a percent sign. Save the result as
        novel_nonum. Hint: when testing your regular expression, it is probaby
        easier to check the result of finding items which are numbers, as opposed to
        checking the result of finding items which are not numbers. (8 points)

        import re
        number_re = r'\$?[0-9,.]+%?$'
        number_match = re.compile(number_re)
        novel_nonum = [w for w in novel if not number_match.match(w)]
      2. Now Use the porter stemmer to stem all the items in novel_nonum, then re-run them through the unknown function, saving the result as as novel_stems (7 points)
        porter = nltk.PorterStemmer()
        stemmed = [porter.stem(w) for w in novel_nonum]
        novel_stems = unknown(stemmed)
      3. Next, find as many proper names from novel_stems as possible, saving the result as proper_names. Note that finding named entities is actually a very difficult problem, and usually involves syntax and semantics. For our purposes however, let’s just use the fact that proper names in English start with a capital letter. Also create a new variable novel_no_proper, which has the proper names removed. (5 points)
        proper_names = [w for w in novel_stems if w[0].isupper()]
        novel_no_proper = [w for w in novel_stems if not w[0].isupper()]
      4. Calculate the percentage of novel tokens in the Ben Franklin wikipedia page, after having excluded number, affixed words, and proper names. (4 points)
        novel_token = len(novel_no_proper) / len(tokens)
      5. Calculate the percentage of novel types in the Ben Franklin wikipedia page, after having excluded number, affixed words, and proper names. (6 points)
        novel_type = len(set(novel_no_proper)) / len(set(tokens))
    6. Extra Credit: Find additional ways to remove false positives in our “novel” word list. (3 extra points for each additional way, up to 12 extra points)
      # remove smart quotes, dashes  and other such characters
      novel_no_quotes =  [w for w in novel_no_proper if w.isalpha()]
      # Try to repair stemming process
      # Sometimes the stemmer removes final e's which should be there
      novel_fixed_e = [w for w in novel_no_quotes if w+'e'.lower() not in words]
      # Likewise with 'ate'
      novel_fixed_ate = [w for w in novel_fixed_e if w+'ate'.lower() not in words]
      # Sometimes the stemmer converts y to i when it shouldn't
      novel_fixed_i = [w for w in novel_fixed_ate if re.sub('i$', 'y', w).lower() not in words]
    ]]>
    http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-9-solution/feed/ 0
    Homework 8 solution http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-8-solution/ http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-8-solution/#comments Tue, 27 Oct 2009 23:50:43 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-8-solution/ Most students did fairly well on this assignment. Please take a detailed look at my solution in resources/hmwk/hmwk8.py

    Class statistics for Homework 8
    mean 49.71
    standard deviation 9.2
    1. Readability measures are used to score the reading difficulty of a text, for the purposes of selecting texts of appropriate difficulty for language learners. Let us define μw to be the average number of letters per word, and μs to be the average number of words per sentence, in a given text. The Automated Readability Index (ARI) of the text is defined to be: 4.71 μw + 0.5 μs – 21.43. Define a function which computes the ARI score. It should accept two arguments – the mean word length, and the mean sentence length. (5 points)

      def calc_ari(mean_sent, mean_word):
          ari = 4.71 * mean_word + 0.5 * mean_sent - 21.43
          return(ari)
    2. One feature of English is that it is easy to turn verbs into nouns and adjectives, by using participles. For example, in the phrase the burning bush, the verb burn is used as an adjective, by using the present participle form. Create a function called verb_adjectives which uses the findall method from the NLTK to find present participles used as adjectives. For simplicity, find only adjectives that are preceded by an article (a, an, the). Make sure that they have a word following them (not punctuation). The function should accept a list of tokens, as returned by the words() function in the NLTK. Note that all present participles in English end in ing. Unfortunately, the nltk findall function which we used in class prints out the words, instead of just returning them. This means that we cannot use it in a function. (Go ahead and try to use it, and convince yourself why it is generally bad to print stuff from functions, instead of just returning results (unless the function’s only purpose is to print something out, e.g. pprint)). So, I will get you started on the functions you need to use:

      regexp = r'<a><.*><man>'
      moby = nltk.Text( gutenberg .words('melville - moby_dick .txt '))
      bracketed = nltk.text.TokenSearcher(moby)
      hits = bracketed.findall(regexp)
      

      This returns a list of lists, where each list is composed of the 3 word phrase which matches. So your main task is to come up with the correct regular expression. (7 points)

      def verb_adjectives(tokens):
          'returns a list of 3-word phrases where a verb is used as an adjective'
          regexp = r'&lt;a|the&gt;&lt;.*ing&gt;&lt;\w+&gt;'
          bracketed = nltk.text.TokenSearcher(tokens)
          hits = bracketed.findall(regexp)
          return(hits)
    3. As we have seen in class, most computational linguistics involves a combination of automation and hand-checking. Let’s refine our verb_adjectives function by ensuring that none of the words following the adjective are in the stopwords corpus. Without doing this, we get results like['an', 'understanding', 'of'], where understanding is being used as a noun, not an adjective. Use a list expression to remove all hits where the third word in the list is a stopword. (7 points)
      def verb_adjectives(tokens):
          'returns a list of 3-word phrases where a verb is used as an adjective'
          regexp = r'&lt;a|the&gt;&lt;.*ing&gt;&lt;\w+&gt;'
          bracketed = nltk.text.TokenSearcher(tokens)
          hits = bracketed.findall(regexp)
          eng_stop = nltk.corpus.stopwords.words('english')
          hits = [h for h in hits if h[2].lower() not in eng_stop]
          return(hits)
    4. Add three more options to your script, -j (–adj), -a (–ari), and -n (–noheader). Note that if the –ari option is specified, then you should also print out the mean word length and mean sentence length. Your options should now look like:
      -w --word print only mean word length
      -s --sent print only mean sentence length
      -h --help print this help information and exit
      -a --ari  print ari statistics
      -j --adj  print mean number of adjectival verbs per sentence
      -n --noheader do not print a header line
      

      (10 points)

    5. Now modify your script so that it can accept either stdin or one or more files as input. Use the stdin_or_file() function in args.py as an example. Your script will no longer print out usage information when no arguments are given, as was the case for homework 7. Create a function called calc_text_stats to handle all the calculations. That way you can call this function either multiple times (once per file, if files are specified), or just once, if reading from stdin. This will make your code more readable. You should also make sure to handle the two new options, for ari and adj. (20 points)
      def calc_text_stats(text, showsent, showword, showari, showadj):
          'print out statistics for a raw text'
          words = nltk.word_tokenize(text)
          sents = sent_tokenize(text)
          if showsent:
              mean_sent_length = mean_sent_len(sents)
              mean_sent_print = '%13.2f' % mean_sent_length
          else:
              mean_sent_print
          if showword:
              mean_word_length = mean_word_len(words)
              mean_word_print = '%13.2f' % mean_word_length
          else:
              mean_word_print= ''
          if showari:
              ari = '%13.2f' % calc_ari(mean_sent_length, mean_word_length)
          else:
              ari=''
          if showadj:
              adjs = verb_adjectives(words)
              mean_adjs = '%13.3f' % (len(adjs) / len(sents))
          else:
              mean_adjs=''
          return '%s %s %s %s' % (mean_word_print,
                                       mean_sent_print, ari, mean_adjs)
      if showheader:
          headers=['filename']
          if showword:
              headers.append('mean_word_len')
          if showsent:
              headers.append('mean_sent_len')
          if showari:
              headers.append('ari')
          if showadj:
              headers.append('adjectiv_verbs')
          format_string = '%-17s ' + '%13s ' * (len(headers)-1)
          print format_string % tuple(headers)
      if len(args) > 0:
          for file in args:
              filename = os.path.basename(file).rstrip('.txt')
              f = open(file)
              raw = f.read()
              print calc_text_stats(raw, showsent, showword, showari, showadj)
      else:
          raw = sys.stdin.read()
          print calc_text_stats(raw, showsent, showword, showari, showadj)
    6. Now print out the mean word length, mean sentence length, ari, and the mean number of present participles used as adjectives per sentence for huckFinn, tomSawyeer, Candide, and devilsDictionary. Pipe the output to sort, and sort by ari. Your output should be similar to homework 7. Show the BASH command you used. (11 points)
    7. #!/bin/bash
      students/robfelty/hmwk8.py --noheader\
      resources/texts/{huckFinn,tomSawyer,Candide,devilsDictionary}.txt | sort -nk 4
    ]]>
    http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-8-solution/feed/ 0
    Homework 9 – Finding novel words http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-9-finding-novel-words/ http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-9-finding-novel-words/#comments Sat, 24 Oct 2009 00:10:02 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-9-finding-novel-words/ In this homework, you will attempt to find novel words in webpages. Make
    sure to read all questions before starting the assignment. It is due Oct. 30th
    and covers material up to Oct. 22nd

    1. Use the get_wiki function defined below to download the wikipedia page about Ben Franklin. (4 points)

      def get_wiki(url):
          'Download text from a wikipedia page and return raw text'
          from urllib2 import urlopen, Request
          headers = {'User-Agent':
                    '''Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB;
                     rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4'''}
          req=Request(url=url, headers=headers)
          f = urlopen(req)
          raw = unicode(f.read(),encoding='utf8')
          return(raw)
      
    2. Wikipedia pages generally have a bunch of references and external links. These almost always occur at the section beginning “See also”. Strip off all text after “See also”. Hint: “See also” actually occurs twice in the document – once in the table of contents, and once as a section heading. You only want to ignore stuff after the section heading. (6 points)
    3. Next, define a function called unknown, which removes any items from
      this set that occur in the Words Corpus (nltk.corpus.words). The function should take a list of tokenized words as input, and return a list of novel words. Hint 1: your code
      will be much faster if you convert the list of words in the Words Corpus into a
      set, and then check whether or not a word is in that set. Hint 2: ignore case and punctuation when checking against the words corpus (but preserve case in your output). Hint 3: Sometimes the nltk word tokenizer does not strip off periods ending sentences. Make sure that none of the words end in a period. Hint 4: Make sure to strip out all html tags before tokenizing (see chapter 2 of the NLTK book for an example). (15 points)
    4. Use your unknown function to find novel words in the wikipedia page on Ben Franklin (5 points)
    5. As with most computational linguistics processes, it is nearly impossible to achieve perfect results. It is clear from browsing through the results that there are a number of “novel” words, which in fact are not novel. Let’s further refine our process. Some of the “novel” words we have found may be numbers, proper names (named entities), acronyms, or words with affixes (including both inflectional and derivational affixes). Let’s try to divide up our novel words into these categories.
      1. Use regular expressions to remove any numbers from the novel words.
        Remember that a number may have commas or a decimal point in it, and may begin
        with a dollar sign or end with a percent sign. Save the result as
        novel_nonum. Hint: when testing your regular expression, it is probaby
        easier to check the result of finding items which are numbers, as opposed to
        checking the result of finding items which are not numbers. (8 points)

      2. Now Use the porter stemmer to stem all the items in novel_nonum, then re-run them through the unknown function, saving the result as as novel_stems (7 points)
      3. Next, find as many proper names from novel_stems as possible, saving the result as proper_names. Note that finding named entities is actually a very difficult problem, and usually involves syntax and semantics. For our purposes however, let’s just use the fact that proper names in English start with a capital letter. Also create a new variable novel_no_proper, which has the proper names removed. (5 points)
      4. Calculate the percentage of novel tokens in the Ben Franklin wikipedia page, after having excluded number, affixed words, and proper names. (4 points)
      5. Calculate the percentage of novel types in the Ben Franklin wikipedia page, after having excluded number, affixed words, and proper names. (6 points)
    6. Extra Credit: Find additional ways to remove false positives in our “novel” word list. (3 extra points for each additional way, up to 12 extra points)
    ]]>
    http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-9-finding-novel-words/feed/ 3
    Homework 7 solution http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-7-solution/ http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-7-solution/#comments Mon, 19 Oct 2009 16:52:11 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-7-solution/ This homework proved to be challenging for students. We will go over some of the common problems in class on Tuesday. Please take a detailed look at my solution in resources/hmwk

    Class statistics for Homework 7
    mean 46
    standard deviation 8.98
    1. From BASH, use svn to copy your hmwk6.py file to hmwk7.py. This will preserve all of the history from hmwk6, so you can see how you have improved your code from homework 6 to homework 7. (3 points)

      # in bash:
      svn cp hmwk6.py hmwk7.py
      # Now in python
      # we keep our functions from hmwk6
      import sys
      import os
      import getopt
      import string
      from pprint import pprint
      import nltk
      from nltk.corpus import stopwords
      def mean_sent_len(sents):
          eng_stopwords = stopwords.words('english')
          words_no_punc = [w for s in sents for w in s
                      if w not in string.punctuation and w.lower() not in eng_stopwords]
          num_words = len(words_no_punc)
          num_sents = len(sents)
          return (num_words / num_sents)

      def mean_word_len(words):
          eng_stopwords = stopwords.words('english')
          words_no_punc = [w for w in words
                    if w not in string.punctuation and w.lower() not in eng_stopwords]
          num_words = len(words_no_punc)
          num_chars = sum([len(w) for w in words_no_punc])
          return (num_chars / num_words)
    2. Create a function called usage, which prints out information about how the script should be used, including what arguments should be specified, and what options are possible. It should take one argument, which is the name of the script file. (7 points)
      def usage(script):
          print 'Usage: ' + script + ' <options> file(s)'
          print '''
          Possible options:
              -w --word print only mean word length
              -s --sent print only mean sentence length
              -h --help print this help information and exit
          '
      ''
    3. Write your script to process the following options. Look at opts.py under resources/py for an example. If both -s and -w are specified, it should print out both options. (14 points)
      -w --word print only mean word length
      -s --sent print only mean sentence length
      -h --help print this help information and exit
      
      try:
          opts, args = getopt.gnu_getopt(sys.argv[1:], "hws",
                       ["help", "word", "sent"])
      except getopt.GetoptError, err:
          # print help information and exit:
          print str(err) # will print something like "option -a not recognized"
          usage(sys.argv[0])
          sys.exit(2)
      sent = False
      word = False
      if len(opts) == 0:
          sent = True
          word = True
      for o, a in opts:
          if o in ("-h", "--help"):
              usage(sys.argv[0])
              sys.exit()
          if o in ("-s", "--sent"):
              sent = True
          if o in ("-w", "--word"):
              word = True
    4. Instead of specifying which texts to process in your code, change your code so
      that it accepts filenames from the command line. Look at the args.py file
      under resources/py for an example of how to do this. Your code should print out
      the name of each file (you can use the os.path.basename function to print out only the name of the file) specified on the command line, and the mean word length
      and sentence length, with a width of 13 and a precision of 2. Note that it
      should only print word length or sentence length if that option has been
      specified. If no files are specified, it should print the usage information
      and exit. Also note that after reading in a text you will have to first convert
      it to a list of words or sentences using the tokenize functions in the nltk,
      before calculating the mean word length and sentence length with the functions
      you defined in homework 6. See chapter 13 in the notes for examples on how to
      tokenize text to homework 5 for how to do this. The first line of output should
      be a line of headers describing the columns (28 points) Here is some example
      output:

      filename        mean_word_len mean_sent_len
      fooey                    3.45         13.47
      bar                      3.15          9.29
      
      if len(args) > 0:
          if word and sent:
              print '%-17s %s %s' % ('filename', 'mean_word_len', 'mean_sent_len')
          elif word:
              print '%-17s %s' % ('filename', 'mean_word_len')
          elif sent:
              print '%-17s %s' % ('filename', 'mean_sent_len')
          for file in args:
              f = open(file)
              raw = f.read()
              words = nltk.word_tokenize(raw)
              sents = nltk.sent_tokenize(raw)
              filename = os.path.basename(file).rstrip('.txt')
              if sent:
                  mean_sent_length = mean_sent_len(sents)
              if word:
                  mean_word_length = mean_word_len(words)
              if word and sent:
                  print '%-17s %13.2f %13.2f' % (filename, mean_word_length, mean_sent_length)
              elif word:
                  print '%-17s %13.2f' % (filename, mean_word_length)
              elif sent:
                  print '%-17s %13.2f' % (filename, mean_sent_length)
      else:
          usage(sys.argv[0])
          sys.exit(2)
    5. Use your script to print out mean word length and sentence length for huckFinn, tomSawyeer, Candide, and devilsDictionary (in resources/texts). Save the output to a file called hmwk7_stats.txt in your personal directory, and commit it to the svn repository. Show the command you use in BASH. Make your paths relative to the root of your working copy of the repository. Do the same command, but also try the -s and -w option, and print to the screen. (8 points)

      # In bash:
      students/robfelty/hmwk7.py resources/texts/{huckFinn,tomSawyer,Candide,devilsDictionary}.txt > students/robfelty/hmwk7_stats.txt
      students/robfelty/hmwk7.py -w resources/texts/{huckFinn,tomSawyer,Candide,devilsDictionary}.txt
      students/robfelty/hmwk7.py -s resources/texts/{huckFinn,tomSawyer,Candide,devilsDictionary}.txt
    ]]>
    http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-7-solution/feed/ 0