Here are the notes for today’s class covering more Unix basics and an intro to subversion and version control systems.
Here are my solutions to the first homework assignment.
- Create a directory named ling5200 in your home directory to store all of the files for this course (3 points)
cd
mkdir ling5200 - Download the practiceFiles.zip file and unzip it into the newly created ling5200 folder. Change directories to the practiceFiles directory using the absolute path, and show that you got there using the pwd command. (4 points)
cd /home/<username>/ling5200/practiceFiles
pwd - Copy the devilsDictionary.txt file to devilsDictionary.copy (4 points)
cp devilsDictionary.txt devilsDictionary.copy
- Print out the first 20 files in the practiceFiles directory, sorted in reverse alphabetical order (5 points)
ls | head -n 20 | sort -r
- Print out the number of entries in the celex.txt file (5 points)
wc -l celex.txt
- Print out 20 random entries in the celex.txt file. Hint: look at the documentation for sort. (8 points)
sort -k 3,3R celex.txt |head -n 20
- Print out the last 20 lines of the celex.txt file (5 points)
tail -n 20 celex.txt
- Print the 50 most frequent words in celex.txt. Print only the orthography and the frequency. Save the results to a file called celex-most-frequent.txt. Hint: You will need sort, plus 2 pipes (9 points)
sort -t '\' -k 3,3rn celex.txt | head -n 50 | cut -f 2,3 -d '\' > celex-most-frequent.txt
OR
sort -t '\' -k 3,3n celex.txt | tail -n 50 | cut -f 2,3 -d '\' > celex-most-frequent.txt - Now Print the 50 least frequent words in celex.txt. Print only the orthography and the frequency. Save the results to a file called celex-least-frequent.txt. Hint: You will need sort, plus 2 pipes (5 points)
sort -t '\' -k 3,3n celex.txt | head -n 50 | cut -f 2,3 -d '\' > celex-least-frequent.txt
OR
sort -t '\' -k 3,3rn celex.txt | tail -n 50 | cut -f 2,3 -d '\' > celex-least-frequent.txt - Now combine the 2 files you just created and store the combination in a new file called celex-most-least-frequent.txt, with the most frequent words first, and the least frequent words last (6 points)
cat celex-most-frequent.txt celex-least-frequent.txt > celex-most-least-frequent.txt
- Print out the 20 most common word frequencies. Hint: The most common word frequency is 0. There are 34037 words in celex.txt with frequencies of 0 (6 points)
cut -f 3 -d '\' celex.txt |sort -n |uniq -c|head -n 20
A few people have noted several problems with the sort command. I apologize for the problems. There are several differences between the sort included with Mac and that on Linux (and I think on cygwin, but I am not sure).
- Linux sort has a random option, -R, which is necessary for question 6. If you are working on a Mac, you can go ahead and pretend that the option is there in your answer. I will not take off any points for not figuring this out.
- The textbook uses an older way of specifying fields (columns) for sort, using +. The preferred way is to use the -k option, as I have done in the examples in class. I am 99% certain that the -k option works in both Mac and Linux (and cygwin).
Again, I apologize for the confusion.
This assignment will cover some more of the UNIX basics we have covered (including material up to September 8th). Some of the questions will ask you to use some of the files from practiceFiles, specifically, celex.txt and devilsDictionary.txt.
- Print out the entries (orthography only) from the celex.txt file which were taken from GOOGLE. Hint: You will need to use a pipe. (6 points)
- Print out the 50 most frequent words from the celex.txt file which were taken from GOOGLE. Hint: You will need to combine the answers from the last 2 questions. (9 points)
- Use unix commands to count the number of entries (not definitions) in the devil’s dictionary that begin with a vowel. Your output should be a single number. (7 points)
- Use unix commands to calculate the average number of letters per word for each entry (not the definitions) in the Devil’s Dictionary. The output should simply be a number. HINT: You will need to use subshells, and bc (10 points)
- Count the number of adjectives, nouns, and verbs in the devil’s dictionary. (10 points)
- Print out all the entries (not the definitions), which are not adjectives, nouns, or verbs. HINT: use grep more than once. (10 points)
- Write a unix pipeline which will print the number of words in the celex.txt file that contain a q not followed by a u (look only at the orthography of each entry). (8 points)
-
Extra credit
Write a unix pipeline which will print the total number of points in this assignment. Don’t include the points for the extra credit (3 extra points) (Hint: use dc)
Here are notes for Sep. 3rd, covering globs and regular expressions