September 14, 2009 · homework · (No comments)

Overall most students did quite well. Comments are in your svn direcotries.

mean 54.4
standard deviation 6.24
  1. Print out the entries (orthography only) from the celex.txt file which were taken from GOOGLE. Hint: You will need to use a pipe. (6 points)

    grep GOOGLE celex.txt | cut -f2 -d '\'
  2. Print out the 50 most frequent words from the celex.txt file which were taken from GOOGLE. Hint: You will need to combine the answers from the last 2 questions. (9 points)

    grep GOOGLE celex.txt | cut -f2,3 -d '\' | sort -t '\' -k 2,2rn | head -n 50
    OR
    grep GOOGLE celex.txt | sort -t '\' -k 3,3rn | head -n 50 | cut -f 2,3 -d '\'
  3. Use unix commands to count the number of entries (not definitions) in the devil’s dictionary that begin with a vowel. Your output should be a single number. (7 points)
    grep -Ec '^[AEIOU][A-Z-]*,' devilsDictionary.txt
    227
  4. Use unix commands to calculate the average number of letters per word for each entry (not the definitions) in the Devil’s Dictionary. The output should simply be a number. HINT: You will need to use subshells, and bc (10 points)
    entries=`grep -E '^[A-Z]+,' devilsDictionary.txt |cut -f1 -d ','|wc -l`
    letters=`grep -E '^[A-Z]+,' devilsDictionary.txt |cut -f1 -d ','|wc -c`
    echo "$letters/$entries"|bc -l

    OR, in one fell swoop

    echo "`grep -E '^[A-Z]+,' devilsDictionary.txt |cut -f1 -d ','|wc -c`/`grep -E '^[A-Z]+,' devilsDictionary.txt |cut -f1 -d ','|wc -l`"|bc -l
  5. Count the number of adjectives, nouns, and verbs in the devil’s dictionary. (10 points)
    noun=`grep -cE '^[A-Z]+, n\.' devilsDictionary.txt`
    verb=`grep -cE '^[A-Z]+, v\.' devilsDictionary.txt`
    adj=`grep -cE '^[A-Z]+, adj\.' devilsDictionary.txt`
  6. Print out all the entries (not the definitions), which are not adjectives, nouns, or verbs. HINT: use grep more than once. (10 points)
    grep -E '^[A-Z]+, ' devilsDictionary.txt |grep -vE '^[A-Z]+, (v|n|adj)\.' | cut -f1 -d '.'
  7. Write a unix pipeline which will print the number of words in the celex.txt file that contain a q not followed by a u (look only at the orthography of each entry). (8 points)
    cut -f2 -d '\' celex.txt |grep -Eic 'q[^u]'
    EVEN BETTER
    cut -f2 -d '\' celex.txt |grep -Eic 'q([^u]|$)'
  8. Extra credit

    Write a unix pipeline which will print the total number of points in this assignment. Don’t include the points for the extra credit (3 extra points) (Hint: use dc)

    echo "`grep -oE '[0-9]+ points' hmwk2.solution |cut -d ' ' -f1` ++++++ p"|dc
September 11, 2009 · homework · 4 comments

This homework assignment continues to expand your UNIX skills, as well as starting to use python and the NLTK. It covers material up to Sep. 15th. It is due Sep. 18th by 5:00 p.m. You should submit the homework via svn.

UNIX

  1. Using the celex.txt file, calculate the ratio of heterosyllabic vs. tautsyllabic st clusters. That is, how frequently do words contain an st cluster that is within a syllable, vs. how frequently they contain an st cluster that spans two syllables. Note that each word contains a syllabified transcription where syllables are surrounded by brackets []. For example, abacus has three syllables, [&][b@][k@s]. You should use grep and bc to calculate the ratio (also compare to the question from hmwk 2 to computer the average number of letters per word for each entry in the devils dictionary). 10 points
  2. How many entries in the devils dictionary have more than 6 letters? Use grep to find out (5 points)

Subversion

For this homework, submit it via subversion by adding it into your own directory. Make at least 2 separate commits. When you are finished, make sure to say so in your log message.

  1. Create a new file called hmwk3_<yourname>.txt, and add it to the svn repository. Show the commands you used (5 points)
  2. Show the log of all changes you made to your homework 3 file. Show the commands you used (5 points)
  3. Find all log messages pertaining to the slides which contain grep. You will need to use a pipe. Your command should print out not only the line which contains grep, but also the 2 preceding lines. Search the grep manual for “context” to find the appropriate option. (7 points)
  4. Show the changes to your homework 3 file between the final version and the version before that. Show the commands you used (5 points)

Python

  1. Calculate the percentage of indefinite articles in Moby Dick using the NLTK. You can use the percentage function defined in chapter 1.1 (8 points)

  2. Using the dispersion_plot function in the nltk, find 1 word which has been used with increasing frequency in the inaugural address, and one which has been used with decreasing frequency. You can base your decision of increasing vs. decreasing simply by using visual inspection of the graphs. (5 points)
  3. Use the random module to generate 2 random integers between 10 and 100, and then calculate the quotient of the first number divided by the second. Make sure to use normal division, not integer division. Look at the help for the random module to find the appropriate function (10 points)

Extra credit:

Use perl and regular expressions to strip the answers from the solutions to homework one. First, download the solution. You might also want to look at a blog entry on perl slurping for hints. (3 extra points)

September 8, 2009 · homework · (No comments)

Here are my solutions to the first homework assignment.

  1. Create a directory named ling5200 in your home directory to store all of the files for this course (3 points)
    cd
    mkdir ling5200
  2. Download the practiceFiles.zip file and unzip it into the newly created ling5200 folder. Change directories to the practiceFiles directory using the absolute path, and show that you got there using the pwd command. (4 points)
    cd /home/<username>/ling5200/practiceFiles
    pwd
  3. Copy the devilsDictionary.txt file to devilsDictionary.copy (4 points)
    cp devilsDictionary.txt devilsDictionary.copy
  4. Print out the first 20 files in the practiceFiles directory, sorted in reverse alphabetical order (5 points)
    ls | head -n 20 | sort -r
  5. Print out the number of entries in the celex.txt file (5 points)
    wc -l celex.txt
  6. Print out 20 random entries in the celex.txt file. Hint: look at the documentation for sort. (8 points)
    sort -k 3,3R celex.txt |head -n 20
  7. Print out the last 20 lines of the celex.txt file (5 points)
    tail -n 20 celex.txt
  8. Print the 50 most frequent words in celex.txt. Print only the orthography and the frequency. Save the results to a file called celex-most-frequent.txt. Hint: You will need sort, plus 2 pipes (9 points)
    sort -t '\' -k 3,3rn celex.txt | head -n 50 | cut -f 2,3 -d '\' > celex-most-frequent.txt
    OR
    sort -t '\' -k 3,3n celex.txt | tail -n 50 | cut -f 2,3 -d '\' > celex-most-frequent.txt
  9. Now Print the 50 least frequent words in celex.txt. Print only the orthography and the frequency. Save the results to a file called celex-least-frequent.txt. Hint: You will need sort, plus 2 pipes (5 points)
    sort -t '\' -k 3,3n celex.txt | head -n 50 | cut -f 2,3 -d '\' > celex-least-frequent.txt
    OR
    sort -t '\' -k 3,3rn celex.txt | tail -n 50 | cut -f 2,3 -d '\' > celex-least-frequent.txt
  10. Now combine the 2 files you just created and store the combination in a new file called celex-most-least-frequent.txt, with the most frequent words first, and the least frequent words last (6 points)
    cat celex-most-frequent.txt celex-least-frequent.txt > celex-most-least-frequent.txt
  11. Print out the 20 most common word frequencies. Hint: The most common word frequency is 0. There are 34037 words in celex.txt with frequencies of 0 (6 points)
    cut -f 3 -d '\' celex.txt |sort -n |uniq -c|head -n 20
September 4, 2009 · News, homework · (No comments)

A few people have noted several problems with the sort command. I apologize for the problems. There are several differences between the sort included with Mac and that on Linux (and I think on cygwin, but I am not sure).

  1. Linux sort has a random option, -R, which is necessary for question 6. If you are working on a Mac, you can go ahead and pretend that the option is there in your answer. I will not take off any points for not figuring this out.
  2. The textbook uses an older way of specifying fields (columns) for sort, using +. The preferred way is to use the -k option, as I have done in the examples in class. I am 99% certain that the -k option works in both Mac and Linux (and cygwin).

Again, I apologize for the confusion.

September 4, 2009 · homework · 6 comments

This assignment will cover some more of the UNIX basics we have covered (including material up to September 8th). Some of the questions will ask you to use some of the files from practiceFiles, specifically, celex.txt and devilsDictionary.txt.

  1. Print out the entries (orthography only) from the celex.txt file which were taken from GOOGLE. Hint: You will need to use a pipe. (6 points)

  2. Print out the 50 most frequent words from the celex.txt file which were taken from GOOGLE. Hint: You will need to combine the answers from the last 2 questions. (9 points)

  3. Use unix commands to count the number of entries (not definitions) in the devil’s dictionary that begin with a vowel. Your output should be a single number. (7 points)
  4. Use unix commands to calculate the average number of letters per word for each entry (not the definitions) in the Devil’s Dictionary. The output should simply be a number. HINT: You will need to use subshells, and bc (10 points)
  5. Count the number of adjectives, nouns, and verbs in the devil’s dictionary. (10 points)
  6. Print out all the entries (not the definitions), which are not adjectives, nouns, or verbs. HINT: use grep more than once. (10 points)
  7. Write a unix pipeline which will print the number of words in the celex.txt file that contain a q not followed by a u (look only at the orthography of each entry). (8 points)
  8. Extra credit

    Write a unix pipeline which will print the total number of points in this assignment. Don’t include the points for the extra credit (3 extra points) (Hint: use dc)

August 26, 2009 · homework · 2 comments

This assignment will ask you to practice some of the UNIX basics we have covered. Some of the questions will ask you to use the practiceFiles; others will ask you to use the CELEX database (celex.txt). The assignment is due Sep. 4, at 5 p.m.

TIP: For dealing with tab-delimited text files, specify the delimiter as: $’\t’

  1. Create a directory named ling5200 in your home directory to store all of the files for this course (3 points)
  2. Download the practiceFiles.zip file and unzip it into the newly created ling5200 folder. Change directories to the practiceFiles directory using the absolute path, and show that you got there using the pwd command. (4 points)
  3. Copy the devilsDictionary.txt file to devilsDictionary.copy (4 points)
  4. Print out the first 20 filex in the practiceFiles directory, sorted in reverse alphabetical order (5 points)
  5. Print out the number of entries in the celex.txt file (5 points)
  6. Print out 20 random entries in the celex.txt file. Hint: look at the documentation for sort. (8 points)
  7. Print out the last 20 lines of the celex.txt file (5 points)
  8. Print the 50 most frequent words in celex.txt. Print only the orthography and the frequency. Save the results to a file called celex-most-frequent.txt. Hint: You will need sort, plus 2 pipes (9 points)
  9. Now Print the 50 least frequent words in celex.txt. Print only the orthography and the frequency. Save the results to a file called celex-least-frequent.txt. Hint: You will need sort, plus 2 pipes (5 points)
  10. Now combine the 2 files you just created and store the combination in a new file called celex-most-least-frequent.txt, with the most frequent words first, and the least frequent words last (6 points)
  11. Print out the 20 most common word frequencies. Hint: The most common word frequency is 0. There are 34037 words in celex.txt with frequencies of 0 (6 points)