September 17, 2009 · notes · (No comments)

Here are the notes for today’s class on Python lists and calculating word frequency with the NLTK

pdf iconling5200-nltk1-1-notes.pdf

September 15, 2009 · slides · (No comments)

Here are the slides from today’s class covering an intro to python and the nltk.

pdf iconling5200-nltk1-slides.pdf

September 15, 2009 · News · 3 comments

I figured out the adj question. We were indeed being led astray by the similarity between “adj” and “adv”. We should have been looking at the simlilarity between “adv” and “v”. When we use a regular expression like (adj|v). on the string “quickly, adv.”, we see that the string does contain “v.”. However, the string does not contain ” v.”

September 15, 2009 · News · 1 comment

I would like to know a bit about how you think the course is going. Please answer the following questions to let me know what you think. You must be logged in to fill out the questionnaire. Note that you have to click vote for each question (but it won’t take you to a new page, so it shouldn’t be too annoying).

The homework questions reflect the material covered in class.

  • 4 (strongly agree) (60%, 3 Votes)
  • 3 (somewhat agree) (40%, 2 Votes)
  • 2 (somewhat disagree) (0%, 0 Votes)
  • 1 (strongly disagree) (0%, 0 Votes)

Total Voters: 5

Loading ... Loading ...

Lectures cover

  • 3 (a little too much material) (100%, 5 Votes)
  • 4 (much too much material) (0%, 0 Votes)
  • 2 (not quite enough material) (0%, 0 Votes)
  • 1 (much too little material) (0%, 0 Votes)

Total Voters: 5

Loading ... Loading ...

The instructor is enthusiastic

  • 3 (somewhat agree) (75%, 3 Votes)
  • 4 (strongly agree) (25%, 1 Votes)
  • 2 (somewhat disagree) (0%, 0 Votes)
  • 1 (strongly disagree) (0%, 0 Votes)

Total Voters: 4

Loading ... Loading ...

I am learning a lot in this course

  • 4 (strongly agree) (75%, 3 Votes)
  • 3 (somewhat agree) (25%, 1 Votes)
  • 2 (somewhat agree) (0%, 0 Votes)
  • 1 (strongly disagree) (0%, 0 Votes)

Total Voters: 4

Loading ... Loading ...

I would describe the difficulty of this course as:

  • 3 (More difficult than average) (100%, 3 Votes)
  • 4 (One of the most difficult courses I have ever taken) (0%, 0 Votes)
  • 2 (Less difficult than average) (0%, 0 Votes)
  • 1 (One of the easiest courses I have taken) (0%, 0 Votes)

Total Voters: 3

Loading ... Loading ...

The instructor is easy to talk to

  • 4 (strongly agree) (100%, 5 Votes)
  • 3 (somewhat agree) (0%, 0 Votes)
  • 2 (somewhat disagree) (0%, 0 Votes)
  • 1 (strongly disagree) (0%, 0 Votes)

Total Voters: 5

Loading ... Loading ...

Finally, are there any topics not currently in the syllabus which you would like to cover, or any other suggestions. Please leave a comment (your name will appear with your comment).

September 15, 2009 · News · (No comments)

Several people seemed interested in seeing some of the handy shortcuts I use in my .bashrc file. I have created a rc directory in the repository with several of my various configuration files. I hope you find them useful.

September 14, 2009 · homework · (No comments)

Overall most students did quite well. Comments are in your svn direcotries.

mean 54.4
standard deviation 6.24
  1. Print out the entries (orthography only) from the celex.txt file which were taken from GOOGLE. Hint: You will need to use a pipe. (6 points)

    grep GOOGLE celex.txt | cut -f2 -d '\'
  2. Print out the 50 most frequent words from the celex.txt file which were taken from GOOGLE. Hint: You will need to combine the answers from the last 2 questions. (9 points)

    grep GOOGLE celex.txt | cut -f2,3 -d '\' | sort -t '\' -k 2,2rn | head -n 50
    OR
    grep GOOGLE celex.txt | sort -t '\' -k 3,3rn | head -n 50 | cut -f 2,3 -d '\'
  3. Use unix commands to count the number of entries (not definitions) in the devil’s dictionary that begin with a vowel. Your output should be a single number. (7 points)
    grep -Ec '^[AEIOU][A-Z-]*,' devilsDictionary.txt
    227
  4. Use unix commands to calculate the average number of letters per word for each entry (not the definitions) in the Devil’s Dictionary. The output should simply be a number. HINT: You will need to use subshells, and bc (10 points)
    entries=`grep -E '^[A-Z]+,' devilsDictionary.txt |cut -f1 -d ','|wc -l`
    letters=`grep -E '^[A-Z]+,' devilsDictionary.txt |cut -f1 -d ','|wc -c`
    echo "$letters/$entries"|bc -l

    OR, in one fell swoop

    echo "`grep -E '^[A-Z]+,' devilsDictionary.txt |cut -f1 -d ','|wc -c`/`grep -E '^[A-Z]+,' devilsDictionary.txt |cut -f1 -d ','|wc -l`"|bc -l
  5. Count the number of adjectives, nouns, and verbs in the devil’s dictionary. (10 points)
    noun=`grep -cE '^[A-Z]+, n\.' devilsDictionary.txt`
    verb=`grep -cE '^[A-Z]+, v\.' devilsDictionary.txt`
    adj=`grep -cE '^[A-Z]+, adj\.' devilsDictionary.txt`
  6. Print out all the entries (not the definitions), which are not adjectives, nouns, or verbs. HINT: use grep more than once. (10 points)
    grep -E '^[A-Z]+, ' devilsDictionary.txt |grep -vE '^[A-Z]+, (v|n|adj)\.' | cut -f1 -d '.'
  7. Write a unix pipeline which will print the number of words in the celex.txt file that contain a q not followed by a u (look only at the orthography of each entry). (8 points)
    cut -f2 -d '\' celex.txt |grep -Eic 'q[^u]'
    EVEN BETTER
    cut -f2 -d '\' celex.txt |grep -Eic 'q([^u]|$)'
  8. Extra credit

    Write a unix pipeline which will print the total number of points in this assignment. Don’t include the points for the extra credit (3 extra points) (Hint: use dc)

    echo "`grep -oE '[0-9]+ points' hmwk2.solution |cut -d ' ' -f1` ++++++ p"|dc
September 11, 2009 · homework · 4 comments

This homework assignment continues to expand your UNIX skills, as well as starting to use python and the NLTK. It covers material up to Sep. 15th. It is due Sep. 18th by 5:00 p.m. You should submit the homework via svn.

UNIX

  1. Using the celex.txt file, calculate the ratio of heterosyllabic vs. tautsyllabic st clusters. That is, how frequently do words contain an st cluster that is within a syllable, vs. how frequently they contain an st cluster that spans two syllables. Note that each word contains a syllabified transcription where syllables are surrounded by brackets []. For example, abacus has three syllables, [&][b@][k@s]. You should use grep and bc to calculate the ratio (also compare to the question from hmwk 2 to computer the average number of letters per word for each entry in the devils dictionary). 10 points
  2. How many entries in the devils dictionary have more than 6 letters? Use grep to find out (5 points)

Subversion

For this homework, submit it via subversion by adding it into your own directory. Make at least 2 separate commits. When you are finished, make sure to say so in your log message.

  1. Create a new file called hmwk3_<yourname>.txt, and add it to the svn repository. Show the commands you used (5 points)
  2. Show the log of all changes you made to your homework 3 file. Show the commands you used (5 points)
  3. Find all log messages pertaining to the slides which contain grep. You will need to use a pipe. Your command should print out not only the line which contains grep, but also the 2 preceding lines. Search the grep manual for “context” to find the appropriate option. (7 points)
  4. Show the changes to your homework 3 file between the final version and the version before that. Show the commands you used (5 points)

Python

  1. Calculate the percentage of indefinite articles in Moby Dick using the NLTK. You can use the percentage function defined in chapter 1.1 (8 points)

  2. Using the dispersion_plot function in the nltk, find 1 word which has been used with increasing frequency in the inaugural address, and one which has been used with decreasing frequency. You can base your decision of increasing vs. decreasing simply by using visual inspection of the graphs. (5 points)
  3. Use the random module to generate 2 random integers between 10 and 100, and then calculate the quotient of the first number divided by the second. Make sure to use normal division, not integer division. Look at the help for the random module to find the appropriate function (10 points)

Extra credit:

Use perl and regular expressions to strip the answers from the solutions to homework one. First, download the solution. You might also want to look at a blog entry on perl slurping for hints. (3 extra points)

September 10, 2009 · News · (No comments)

Since we didn’t get around to talking about python and the nltk at all today, I am postponing it until next week. There is no additional reading for Tuesday. I will update the syllabus to reflect this change shortly.

Also, here is another note on using subversion with the class repository

cd
svn co http://robfelty.com/subversion/ling5200/ myling5200
cd myling5200/students
svn co http://robfelty.com/subversion/ling5200/students/<yourname>
September 10, 2009 · notes · (No comments)

Here are the notes for Sep. 10th, including an intro to python and the nltk.

pdf iconling5200-nltk1-notes.pdf

September 8, 2009 · slides · (No comments)

Here are the slides from today covering more unix basics. We will cover the stuff about subversion on Thursday.

pdf iconling5200-unix3-slides.pdf