<? bloginfo('name'); ?>

Feeds

Comments

Sep 8th notes on unix

September 8, 2009 · notes · (No comments)

Here are the notes for today’s class covering more Unix basics and an intro to subversion and version control systems.

pdf icon ling5200-unix3-notes.pdf

Homework assignment 1 – Unix Basics – Solution

September 8, 2009 · homework · (No comments)

Here are my solutions to the first homework assignment.

Create a directory named ling5200 in your home directory to store all of the files for this course (3 points)
cd
mkdir ling5200
Download the practiceFiles.zip file and unzip it into the newly created ling5200 folder. Change directories to the practiceFiles directory using the absolute path, and show that you got there using the pwd command. (4 points)
cd /home/<username>/ling5200/practiceFiles
pwd
Copy the devilsDictionary.txt file to devilsDictionary.copy (4 points)
cp devilsDictionary.txt devilsDictionary.copy
Print out the first 20 files in the practiceFiles directory, sorted in reverse alphabetical order (5 points)
ls | head -n 20 | sort -r
Print out the number of entries in the celex.txt file (5 points)
wc -l celex.txt
Print out 20 random entries in the celex.txt file. Hint: look at the documentation for sort. (8 points)
sort -k 3,3R celex.txt |head -n 20
Print out the last 20 lines of the celex.txt file (5 points)
tail -n 20 celex.txt
Print the 50 most frequent words in celex.txt. Print only the orthography and the frequency. Save the results to a file called celex-most-frequent.txt. Hint: You will need sort, plus 2 pipes (9 points)
sort -t '\' -k 3,3rn celex.txt | head -n 50 | cut -f 2,3 -d '\' > celex-most-frequent.txt
OR
sort -t '\' -k 3,3n celex.txt | tail -n 50 | cut -f 2,3 -d '\' > celex-most-frequent.txt
Now Print the 50 least frequent words in celex.txt. Print only the orthography and the frequency. Save the results to a file called celex-least-frequent.txt. Hint: You will need sort, plus 2 pipes (5 points)
sort -t '\' -k 3,3n celex.txt | head -n 50 | cut -f 2,3 -d '\' > celex-least-frequent.txt
OR
sort -t '\' -k 3,3rn celex.txt | tail -n 50 | cut -f 2,3 -d '\' > celex-least-frequent.txt
Now combine the 2 files you just created and store the combination in a new file called celex-most-least-frequent.txt, with the most frequent words first, and the least frequent words last (6 points)
cat celex-most-frequent.txt celex-least-frequent.txt > celex-most-least-frequent.txt
Print out the 20 most common word frequencies. Hint: The most common word frequency is 0. There are 34037 words in celex.txt with frequencies of 0 (6 points)
cut -f 3 -d '\' celex.txt |sort -n |uniq -c|head -n 20

Sort on Linux vs. BSD (Mac)

September 4, 2009 · News, homework · (No comments)

A few people have noted several problems with the sort command. I apologize for the problems. There are several differences between the sort included with Mac and that on Linux (and I think on cygwin, but I am not sure).

Linux sort has a random option, -R, which is necessary for question 6. If you are working on a Mac, you can go ahead and pretend that the option is there in your answer. I will not take off any points for not figuring this out.
The textbook uses an older way of specifying fields (columns) for sort, using +. The preferred way is to use the -k option, as I have done in the examples in class. I am 99% certain that the -k option works in both Mac and Linux (and cygwin).

Again, I apologize for the confusion.

Homework 2 – More UNIX basics and regular expressions

September 4, 2009 · homework · 6 comments

This assignment will cover some more of the UNIX basics we have covered (including material up to September 8th). Some of the questions will ask you to use some of the files from practiceFiles, specifically, celex.txt and devilsDictionary.txt.

Print out the entries (orthography only) from the celex.txt file which were taken from GOOGLE. Hint: You will need to use a pipe. (6 points)
Print out the 50 most frequent words from the celex.txt file which were taken from GOOGLE. Hint: You will need to combine the answers from the last 2 questions. (9 points)
Use unix commands to count the number of entries (not definitions) in the devil’s dictionary that begin with a vowel. Your output should be a single number. (7 points)
Use unix commands to calculate the average number of letters per word for each entry (not the definitions) in the Devil’s Dictionary. The output should simply be a number. HINT: You will need to use subshells, and bc (10 points)
Count the number of adjectives, nouns, and verbs in the devil’s dictionary. (10 points)
Print out all the entries (not the definitions), which are not adjectives, nouns, or verbs. HINT: use grep more than once. (10 points)
Write a unix pipeline which will print the number of words in the celex.txt file that contain a q not followed by a u (look only at the orthography of each entry). (8 points)
Extra credit

Write a unix pipeline which will print the total number of points in this assignment. Don’t include the points for the extra credit (3 extra points) (Hint: use dc)

Slides from Sep 3rd.

September 3, 2009 · slides · (No comments)

Here are the slides from today. I have corrected the one error that I mentioned in class.
ling5200-grep-slides

Linguistics 5200 Fall 2009

Introduction to computational corpus linguistics

Sep 8th notes on unix

Homework assignment 1 – Unix Basics – Solution

Sort on Linux vs. BSD (Mac)

Homework 2 – More UNIX basics and regular expressions

Slides from Sep 3rd.

Notes for Sep. 3rd

Slides for Sep 1st.

Archives

Categories