September 8, 2009 · homework

Here are my solutions to the first homework assignment.

  1. Create a directory named ling5200 in your home directory to store all of the files for this course (3 points)
    cd
    mkdir ling5200
  2. Download the practiceFiles.zip file and unzip it into the newly created ling5200 folder. Change directories to the practiceFiles directory using the absolute path, and show that you got there using the pwd command. (4 points)
    cd /home/<username>/ling5200/practiceFiles
    pwd
  3. Copy the devilsDictionary.txt file to devilsDictionary.copy (4 points)
    cp devilsDictionary.txt devilsDictionary.copy
  4. Print out the first 20 files in the practiceFiles directory, sorted in reverse alphabetical order (5 points)
    ls | head -n 20 | sort -r
  5. Print out the number of entries in the celex.txt file (5 points)
    wc -l celex.txt
  6. Print out 20 random entries in the celex.txt file. Hint: look at the documentation for sort. (8 points)
    sort -k 3,3R celex.txt |head -n 20
  7. Print out the last 20 lines of the celex.txt file (5 points)
    tail -n 20 celex.txt
  8. Print the 50 most frequent words in celex.txt. Print only the orthography and the frequency. Save the results to a file called celex-most-frequent.txt. Hint: You will need sort, plus 2 pipes (9 points)
    sort -t '\' -k 3,3rn celex.txt | head -n 50 | cut -f 2,3 -d '\' > celex-most-frequent.txt
    OR
    sort -t '\' -k 3,3n celex.txt | tail -n 50 | cut -f 2,3 -d '\' > celex-most-frequent.txt
  9. Now Print the 50 least frequent words in celex.txt. Print only the orthography and the frequency. Save the results to a file called celex-least-frequent.txt. Hint: You will need sort, plus 2 pipes (5 points)
    sort -t '\' -k 3,3n celex.txt | head -n 50 | cut -f 2,3 -d '\' > celex-least-frequent.txt
    OR
    sort -t '\' -k 3,3rn celex.txt | tail -n 50 | cut -f 2,3 -d '\' > celex-least-frequent.txt
  10. Now combine the 2 files you just created and store the combination in a new file called celex-most-least-frequent.txt, with the most frequent words first, and the least frequent words last (6 points)
    cat celex-most-frequent.txt celex-least-frequent.txt > celex-most-least-frequent.txt
  11. Print out the 20 most common word frequencies. Hint: The most common word frequency is 0. There are 34037 words in celex.txt with frequencies of 0 (6 points)
    cut -f 3 -d '\' celex.txt |sort -n |uniq -c|head -n 20
Written by Robert Felty


Leave a Reply

You must be logged in to post a comment.

Subscribe without commenting