Comments on: Homework 9 – Finding novel words http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-9-finding-novel-words/ Introduction to computational corpus linguistics Mon, 16 Nov 2009 18:31:36 -0500 http://wordpress.org/?v=2.9-rare hourly 1 By: Robert Felty http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-9-finding-novel-words/comment-page-1/#comment-29 Robert Felty Thu, 29 Oct 2009 15:22:12 +0000 http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-9-finding-novel-words/#comment-29 Some people had a problem with the get_wiki function. I have updated the function above so that it should work for everyone now. Sorry about that. A couple other questions people had: <dl> <dt>What do I do with all these html tags?</dt> <dd>Strip them out. See chapter 2 of the NLTK book for an example (or look in the notes)</dd> <dt>In question 5, everything is of unicode type. What do you want here?</dt> <dd>Type here means lemma, as in type-token ratio.</dd> <dt>Won't removing proper names based on capitalization also remove the first of word of sentences?</dt> <dd>True, but we are trying to find novel words. Any words that are in our dictionary that are the first word of a sentence will have already been taken out. If there are novel words which happen to start a sentence, they will be grouped in with the proper names. This is another example of how we have to use a combination of automated techniques and manual checking</dd> </dl> Some people had a problem with the get_wiki function. I have updated the function above so that it should work for everyone now. Sorry about that.

A couple other questions people had:

What do I do with all these html tags?
Strip them out. See chapter 2 of the NLTK book for an example (or look in the notes)
In question 5, everything is of unicode type. What do you want here?
Type here means lemma, as in type-token ratio.
Won’t removing proper names based on capitalization also remove the first of word of sentences?
True, but we are trying to find novel words. Any words that are in our dictionary that are the first word of a sentence will have already been taken out. If there are novel words which happen to start a sentence, they will be grouped in with the proper names. This is another example of how we have to use a combination of automated techniques and manual checking
]]>
By: Robert Felty http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-9-finding-novel-words/comment-page-1/#comment-28 Robert Felty Tue, 27 Oct 2009 02:42:43 +0000 http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-9-finding-novel-words/#comment-28 Well, it doesn't really matter, since you want to strip out all html tags anyways. Well, it doesn’t really matter, since you want to strip out all html tags anyways.

]]>
By: ash_v http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-9-finding-novel-words/comment-page-1/#comment-27 ash_v Mon, 26 Oct 2009 21:45:44 +0000 http://robfelty.com/teaching/ling5200Fall2009/2009/10/homework-9-finding-novel-words/#comment-27 In Q2, what is meant by remove all text after "see also"? Does it mean just text or text+html tags? In Q2, what is meant by remove all text after “see also”? Does it mean just text or text+html tags?

]]>