Linguistics 5200 Fall 2009 http://robfelty.com/teaching/ling5200Fall2009 Introduction to computational corpus linguistics Fri, 18 Dec 2009 00:07:43 +0000 http://wordpress.org/?v=2.9-rare en hourly 1 Final grades / happy holidays http://robfelty.com/teaching/ling5200Fall2009/2009/12/final-grades-happy-holidays/ http://robfelty.com/teaching/ling5200Fall2009/2009/12/final-grades-happy-holidays/#comments Fri, 18 Dec 2009 00:07:43 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/?p=175 I have made comments in your final papers and commited them to the subversion repository. I made the comments directly in the pdf (if you submitted in something other than pdf, I converted it to pdf, with the same basename). You should be able to view the comments in your favorite pdf viewer (adobe reader, skim, okular, apple’s preview, etc.)

I also added a file to each of your directories called course_grades.txt which lists all your grades for the course. It is a tab delimited file. For pretty viewing, you might want to open it with a spreadsheet program.

I also wanted to say thank you for an exciting course. As always, I end up learning quite a bit from teaching. I hope that you learned quite a bit as well, and that the topics we covered are useful to you in future endeavors.

Enjoy your well deserved break.

]]>
http://robfelty.com/teaching/ling5200Fall2009/2009/12/final-grades-happy-holidays/feed/ 0
final papers http://robfelty.com/teaching/ling5200Fall2009/2009/12/final-papers/ http://robfelty.com/teaching/ling5200Fall2009/2009/12/final-papers/#comments Fri, 11 Dec 2009 00:40:12 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/?p=174 This is just a reminder that final papers are due next Wednesday, Dec. 16th by 5 p.m. The papers should be 5-15 pages long. Please commit them to the subversion repository, along with all code you wrote, and any additional resources I need to run your code, such as texts, databases, etc.

]]>
http://robfelty.com/teaching/ling5200Fall2009/2009/12/final-papers/feed/ 0
Notes for Dec. 3rd http://robfelty.com/teaching/ling5200Fall2009/2009/12/notes-for-dec-3rd/ http://robfelty.com/teaching/ling5200Fall2009/2009/12/notes-for-dec-3rd/#comments Thu, 03 Dec 2009 19:20:04 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/?p=172 Here are today’s notes on advanced regular expressions in python and perl.

pdf iconling5200-grep2-notes.pdf

]]>
http://robfelty.com/teaching/ling5200Fall2009/2009/12/notes-for-dec-3rd/feed/ 0
Final presentation schedule http://robfelty.com/teaching/ling5200Fall2009/2009/12/final-presentation-schedule/ http://robfelty.com/teaching/ling5200Fall2009/2009/12/final-presentation-schedule/#comments Tue, 01 Dec 2009 21:20:10 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/?p=171 Here is the final presentation schedule for next week: Tuesday
  • 12:40 Matt Cecil
  • 12:55 Keith Mertz
  • 13:10 Ashwini Vaidya

Thursday

  • 12:40 Anwen Fredriksen
  • 12:55 Steve Vihel
  • 13:10 Calvin Pohawpotchoko
  • 13:25 Sam Perdue

Please be sure to follow these guidelines:

  • 5-10 minutes long
  • Should prepare handouts or slides
    • slides in pdf or ppt format please
    • E-mail me slides by 10 a.m.
    • You can use (preferably) my computer or yours
  • Use your classmates as resources for ideas
  • PRACTICE BEFOREHAND
]]>
http://robfelty.com/teaching/ling5200Fall2009/2009/12/final-presentation-schedule/feed/ 0
Notes for Dec. 1st http://robfelty.com/teaching/ling5200Fall2009/2009/12/notes-for-dec-1st/ http://robfelty.com/teaching/ling5200Fall2009/2009/12/notes-for-dec-1st/#comments Tue, 01 Dec 2009 19:10:04 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/?p=169 Here are today’s notes on Bayesian and maximum entropy classifiers.

pdf iconling5200-nltk6-2-notes.pdf

]]>
http://robfelty.com/teaching/ling5200Fall2009/2009/12/notes-for-dec-1st/feed/ 0
Homework 11 solution http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-11-solution/ http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-11-solution/#comments Tue, 24 Nov 2009 03:55:31 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-11-solution/ Most students did very well on this assignment. The only consistent shortcoming was having unnecessary loops in the tag_errors function. These unnecessary loops lead to an increase in execution time of about 10%.

Class statistics for Homework 11
mean 56.71
standard deviation 8.58

    In this homework you will practice part of speech tagging, and evaluating part of speech taggers. The homework covers material up to Nov. 12, and is due Nov. 19th.

    1. Use the unigram tagger to evaluate the accuracy of tagging of the romance and the adventure genres of the Brown corpus. Use a default tagger of NN as a backoff tagger. You should train the tagger on the first 90% of each genre, and test on the remaining 10%. (10 points)

      t0 = nltk.DefaultTagger('NN')

      adv_tagged_sents = brown.tagged_sents(categories='adventure')
      adv_size = int(len(adv_tagged_sents) * 0.9)
      adv_train_sents = adv_tagged_sents[:size]
      adv_test_sents = adv_tagged_sents[size:]
      adv_tag = nltk.UnigramTagger(adv_train_sents, backoff=t0)
      adv_tag.evaluate(adv_test_sents)

      rom_tagged_sents = brown.tagged_sents(categories='romance')
      rom_size = int(len(rom_tagged_sents) * 0.9)
      rom_train_sents = rom_tagged_sents[:size]
      rom_test_sents = rom_tagged_sents[size:]
      rom_tag = nltk.UnigramTagger(rom_train_sents, backoff=t0)
      rom_tag.evaluate(rom_test_sents)
    2. Now let’s investigate the most common types of errors that our tagger makes. Write a function called tag_errors which will return all errors that our tagger made. It should accept two arguments, test, and gold, which should be lists of tagged sentences. The test sentences should be ones that have been automatically tagged, and the gold should be ones that have been manually corrected. The function should output a list of incorrect, correct tuples, e.g. [('VB', 'NN'), ('VBN', 'VBD'), ('NN', 'VB'), ('NN', 'VBD'), ('TO', 'IN')]. (15 points)

      def tag_errors(test,gold):
          '''returns list of tuples of (wrong,correct) given automatically tagged
          data and the gold standard for that data'
      ''
          errors=[]
          for testsent, goldsent in zip(test,gold):
              for testpair, goldpair in zip(testsent,goldsent):
                  if testpair[1]!=goldpair[1]:
                      errors.append((testpair[1],goldpair[1]))
          return errors
    3. Use the Unigram taggers you trained to tag the test data from the adventure and romance genres of the Brown corpus. HINT: Look at the batch_tag method of the UnigramTagger. (10 points)

      adv_sents = brown.sents(categories='adventure')
      adv_unknown = adv_sents[adv_size:]
      adv_test = adv_tagger.batch_tag(adv_unknown)

      rom_sents = brown.sents(categories='romance')
      rom_unknown = rom_sents[rom_size:]
      rom_test = rom_tagger.batch_tag(rom_unknown)
    4. Use your tag_errors function to find all the tagging errors for the romance and adventure genres of the Brown corpus. (10 points)

      adv_errors = tag_errors(adv_test, adv_test_sents)
      rom_errors = tag_errors(rom_test, rom_test_sents)
    5. Now create frequency distributions of the tagging errors for the romance and adventure genres. (5 points)
      adv_error_fd = nltk.FreqDist(adv_errors)
      rom_error_fd = nltk.FreqDist(rom_errors)
    6. What differences do you notice between the frequency distributions of the two genres? (No code required for this question) (5 points)
       
    7. How might we improve our tagging performance? (No code required for this question) (5 points)
    8.  
    ]]> http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-11-solution/feed/ 0 Notes for Nov. 19th http://robfelty.com/teaching/ling5200Fall2009/2009/11/notes-for-nov-19th/ http://robfelty.com/teaching/ling5200Fall2009/2009/11/notes-for-nov-19th/#comments Thu, 19 Nov 2009 19:16:50 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/?p=166 Here are today’s notes on classifier evaluation and decision trees

    pdf iconling5200-nltk6-notes.pdf

    ]]>
    http://robfelty.com/teaching/ling5200Fall2009/2009/11/notes-for-nov-19th/feed/ 0
    Notes for Nov. 17th http://robfelty.com/teaching/ling5200Fall2009/2009/11/notes-for-nov-17th/ http://robfelty.com/teaching/ling5200Fall2009/2009/11/notes-for-nov-17th/#comments Tue, 17 Nov 2009 19:20:09 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/?p=164 Here are today’s notes on supervised classification

    pdf iconling5200-nltk6-notes.pdf

    ]]>
    http://robfelty.com/teaching/ling5200Fall2009/2009/11/notes-for-nov-17th/feed/ 0
    Homework 10 Solution http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-10-solution/ http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-10-solution/#comments Tue, 17 Nov 2009 04:18:36 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-10-solution/ Most students did well on this assignment. Please take a detailed look at my solution in resources/hmwk/hmwk10.py

    Class statistics for Homework 10
    mean 51.67
    standard deviation 7.28
    1. Use svn to copy my solution to homework 8 from resources/py into your personal directory as hmwk10.py (5 points)

      svn cp resources/py/hmwk8.py students/robfelty/hmwk10.py
    2. Modify the mean_word_len and mean_sent_len functions to accept two optional
      arguments, ignore_stop and use_set. The default for each of
      these should be True. If use_set is True, you should convert the
      stopword corpus to a set. If ignore_stop is True, you should ignore stopwords from the calculation (which is what the code in hmwk8.py does). (15 points)

      def mean_sent_len(sents, ignore_stop=True, use_set=True):
          ''' returns the average number of words per sentence

          Input should be a list of lists, with each item in the list being a
          sentence, composed of a list of words. We ignore any punctuation and
          stopwords
          '
      ''
          if use_set:
              eng_stopwords = set(stopwords.words('english'))
          else:
              eng_stopwords = stopwords.words('english')
          if ignore_stop:
              words_no_punc = [w for s in sents for w in s
                              if w not in string.punctuation
                              and w.lower() not in eng_stopwords]
          else:
              words_no_punc = [w for s in sents for w in s
                              if w not in string.punctuation ]
          num_words = len(words_no_punc)
          num_sents = len(sents)
          return (num_words / num_sents)

      def mean_word_len(words, ignore_stop=True, use_set=True):
          ''' returns the average number of letters per words

          Input should be a list of words.
          We ignore any punctuation and stopwords
          '
      ''
          if use_set:
              eng_stopwords = set(stopwords.words('english'))
          else:
              eng_stopwords = stopwords.words('english')
          if ignore_stop:
              words_no_punc = [w for w in words
                    if w not in string.punctuation and w.lower() not in eng_stopwords]
          else:
              words_no_punc = [w for w in words
                    if w not in string.punctuation]
          num_words = len(words_no_punc)
          num_chars = sum([len(w) for w in words_no_punc])
          return (num_chars / num_words)
    3. Now create a new file called means_timing.py. In this file, import your hmwk10.py module, and use the timeit module to test how long it takes to calculate the mean sentence length 100 times, trying all 4 combinations of the parameters of use_set and ignore_stop. (10 points)

      import nltk
      import hmwk10
      setup = '''import nltk
      import text_means
      f = open('
      ../texts/Candide.txt')
      raw = f.read()
      sents = text_means.sent_tokenize(raw)
      words = nltk.word_tokenize(raw)
      '
      ''
      test1 = 'text_means.mean_word_len(words)'
      print Timer(test1, setup).timeit(100)

      test2 = 'text_means.mean_word_len(words, use_set=False)'
      print Timer(test2, setup).timeit(100)

      test3 = 'text_means.mean_word_len(words, use_set=False, check_stop=False)'
      print Timer(test3, setup).timeit(100)

      test4 = 'text_means.mean_word_len(words, use_set=True, check_stop=False)'
      print Timer(test4, setup).timeit(100)
    4. Now add another global option called include-stop (i for short) to hmwk10.py specifying whether or not to ignore stopwords when calculating mean word length and sentence length. The default should be False. (10 points)
      opts, args = getopt.gnu_getopt(sys.argv[1:], "hwsajni",
                   ["help", "word", "sent", 'ari', 'adj', 'noheader', 'include-stop'])
      include_stop = False
      for o, a in opts:
          if o in ("-i", "--include-stop"):
              include_stop = True

      # in calc_text_stats
      mean_sent_length = mean_sent_len(sents,include_stop=include_stop)
    5. Modify the calc_text_stats function so that it also computes the percentage of words that are stop words. 10 points
    6. Now create a bash script which prints out the mean word and sentence length for Huck Finn, Tom Sawyer, Candide, and the Devil’s dictionary. Pipe the output to sort to sort by mean sentence length. Try it both including and ignoring stop words. Your output (when ignoring stop words), should look like the that below.(10 points)
      filename          mean_word_len mean_sent_len per_stop_words
      tomSawyer                  5.51          7.46            42.2
      Candide                    6.07          9.04            43.5
      huckFinn                   4.93          9.32            45.0
      devilsDictionary           6.30         10.08            40.2
      
       ./text_means.py -wsi ../texts/{tomSawyer,huckFinn,Candide,devilsDictionary}.txt |sort -nk 3
       ./text_means.py -ws ../texts/{tomSawyer,huckFinn,Candide,devilsDictionary}.txt |sort -nk 3
    ]]>
    http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-10-solution/feed/ 0
    Homework 11 – Part of speech tagging http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-11-part-of-speech-tagging/ http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-11-part-of-speech-tagging/#comments Sat, 14 Nov 2009 17:49:20 +0000 Robert Felty http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-11-part-of-speech-tagging/ In this homework you will practice part of speech tagging, and evaluating part of speech taggers. The homework covers material up to Nov. 12, and is due Nov. 20th.

    1. Use the unigram tagger to evaluate the accuracy of tagging of the romance and the adventure genres of the Brown corpus. Use a default tagger of NN as a backoff tagger. You should train the tagger on the first 90% of each genre, and test on the remaining 10%. (10 points)
    2. Now let’s investigate the most common types of errors that our tagger makes. Write a function called tag_errors which will return all errors that our tagger made. It should accept two arguments, test, and gold, which should be lists of tagged sentences. The test sentences should be ones that have been automatically tagged, and the gold should be ones that have been manually corrected. The function should output a list of incorrect, correct tuples, e.g. [('VB', 'NN'), ('VBN', 'VBD'), ('NN', 'VB'), ('NN', 'VBD'), ('TO', 'IN')]. (15 points)
    3. Use the Unigram taggers you trained to tag the test data from the adventure and romance genres of the Brown corpus. HINT: Look at the batch_tag method of the UnigramTagger. (10 points)
    4. Use your tag_errors function to find all the tagging errors for the romance and adventure genres of the Brown corpus. (10 points)
    5. Now create frequency distributions of the tagging errors for the romance and adventure genres. (5 points)
    6. What differences do you notice between the frequency distributions of the two genres? (No code required for this question) (5 points)
    7. How might we improve our tagging performance? (No code required for this question) (5 points)
    ]]>
    http://robfelty.com/teaching/ling5200Fall2009/2009/11/homework-11-part-of-speech-tagging/feed/ 0