This assignment will ask you to practice some of the UNIX basics we have covered. Some of the questions will ask you to use the practiceFiles; others will ask you to use the CELEX database (celex.txt). The assignment is due Sep. 4, at 5 p.m.
TIP: For dealing with tab-delimited text files, specify the delimiter as: $’\t’
- Create a directory named ling5200 in your home directory to store all of the files for this course (3 points)
- Download the practiceFiles.zip file and unzip it into the newly created ling5200 folder. Change directories to the practiceFiles directory using the absolute path, and show that you got there using the pwd command. (4 points)
- Copy the devilsDictionary.txt file to devilsDictionary.copy (4 points)
- Print out the first 20 filex in the practiceFiles directory, sorted in reverse alphabetical order (5 points)
- Print out the number of entries in the celex.txt file (5 points)
- Print out 20 random entries in the celex.txt file. Hint: look at the documentation for sort. (8 points)
- Print out the last 20 lines of the celex.txt file (5 points)
- Print the 50 most frequent words in celex.txt. Print only the orthography and the frequency. Save the results to a file called celex-most-frequent.txt. Hint: You will need sort, plus 2 pipes (9 points)
- Now Print the 50 least frequent words in celex.txt. Print only the orthography and the frequency. Save the results to a file called celex-least-frequent.txt. Hint: You will need sort, plus 2 pipes (5 points)
- Now combine the 2 files you just created and store the combination in a new file called celex-most-least-frequent.txt, with the most frequent words first, and the least frequent words last (6 points)
- Print out the 20 most common word frequencies. Hint: The most common word frequency is 0. There are 34037 words in celex.txt with frequencies of 0 (6 points)
In Q. 11, what is meant by word frequency is 0? There is no word in the second field? then do we need to use regex to solve this?
A word frequency of 0 means that the word exists, but that it was not found in the corpus. For example, foo is not a valid English word, so it is not in the CELEX. In contrast, abacuses is a valid English word, but it never occurred in the corpus from which CELEX gets its frequencies. Words with 0 frequency will have a 0 in the frequency column (field).