Note that I have switched the schedule for Thursday Sep. 3rd and Tuesday Sep. 8th. We will now be covering regular expressions on Thursday Sep 3rd. I have also added a reading for Thursday, which is available on the course website (under resources).
Here are the notes for Sep. 1st, in which we cover more unix basics, including streams and pipes.
Please download the notes for today’s class on UNIX basics, including filesystem navigation and basic reading of files.
Matt Cecil just alerted me to the fact that the UNIX textbook is also online for free (it is not the most current edition, but I don’t think there should be any substantial differences).
Check it out at:
http://www.netlibrary.com/Details.aspx?ProductId=2203
This assignment will ask you to practice some of the UNIX basics we have covered. Some of the questions will ask you to use the practiceFiles; others will ask you to use the CELEX database (celex.txt). The assignment is due Sep. 4, at 5 p.m.
TIP: For dealing with tab-delimited text files, specify the delimiter as: $’\t’
- Create a directory named ling5200 in your home directory to store all of the files for this course (3 points)
- Download the practiceFiles.zip file and unzip it into the newly created ling5200 folder. Change directories to the practiceFiles directory using the absolute path, and show that you got there using the pwd command. (4 points)
- Copy the devilsDictionary.txt file to devilsDictionary.copy (4 points)
- Print out the first 20 filex in the practiceFiles directory, sorted in reverse alphabetical order (5 points)
- Print out the number of entries in the celex.txt file (5 points)
- Print out 20 random entries in the celex.txt file. Hint: look at the documentation for sort. (8 points)
- Print out the last 20 lines of the celex.txt file (5 points)
- Print the 50 most frequent words in celex.txt. Print only the orthography and the frequency. Save the results to a file called celex-most-frequent.txt. Hint: You will need sort, plus 2 pipes (9 points)
- Now Print the 50 least frequent words in celex.txt. Print only the orthography and the frequency. Save the results to a file called celex-least-frequent.txt. Hint: You will need sort, plus 2 pipes (5 points)
- Now combine the 2 files you just created and store the combination in a new file called celex-most-least-frequent.txt, with the most frequent words first, and the least frequent words last (6 points)
- Print out the 20 most common word frequencies. Hint: The most common word frequency is 0. There are 34037 words in celex.txt with frequencies of 0 (6 points)
Please download the notes for Tuesday, Aug. 25th, and follow along in class.
ling5200-intro-notes
I would like to welcome you to Linguistics 5200 — Introduction to Computational Corpus Linguistics. I will use this website to post information about the course, such as class notes and slides, as well as news items, such as this message. You will receive e-mail notifications when new material is posted to the website.
I look forward to meeting you next Tuesday. I have two requests for you for Tuesday:
- Glance over the syllabus before class
- Please bring your laptop to class. If you don’t have a laptop, you can share with someone who does