By: Robert Felty

Robert Felty — Thu, 29 Oct 2009 15:22:12 +0000

Some people had a problem with the get_wiki function. I have updated the function above so that it should work for everyone now. Sorry about that. A couple other questions people had:

What do I do with all these html tags?: Strip them out. See chapter 2 of the NLTK book for an example (or look in the notes)
In question 5, everything is of unicode type. What do you want here?: Type here means lemma, as in type-token ratio.
Won't removing proper names based on capitalization also remove the first of word of sentences?: True, but we are trying to find novel words. Any words that are in our dictionary that are the first word of a sentence will have already been taken out. If there are novel words which happen to start a sentence, they will be grouped in with the proper names. This is another example of how we have to use a combination of automated techniques and manual checking

By: Robert Felty

Robert Felty — Tue, 27 Oct 2009 02:42:43 +0000

Well, it doesn’t really matter, since you want to strip out all html tags anyways.

By: ash_v

ash_v — Mon, 26 Oct 2009 21:45:44 +0000

In Q2, what is meant by remove all text after “see also”? Does it mean just text or text+html tags?

Comments on: Homework 9 – Finding novel words

By: Robert Felty

By: Robert Felty

By: ash_v