CS520
Fall 2010
Programming Assignment 5
Due Sunday November 7


The goal of this assignment is to use Posix threads to more efficiently process a set of text files in the English language.

The program will accept a sequence of names of text files as its only command-line arguments. The goal of the program is to find the twenty longest words that appear in every file. (That is, to be reported a word must appear in each file and it must be one of the twenty longest of such words.) If there are ties for position twenty on the list of longest words, then report all the words that tie. That is, you may report more than twenty words for your result, because of ties for the last position on the list. Note that you might report fewer than twenty words if there are fewer than twenty words that are common to all files. Also note that words less than eight characters in length should be ignored. Also, since the input files are in English, you may assume that no word will be longer that fifty characters, since I believe the longest word in the English language is only 45 letters long. Words longer than fifty characters can be ignored, since they are probably not real words. Therefore, the analysis should actually report the twenty longest words that appear in every file and that are at least eight characters long and no more than fifty characters long.

If the user does not specify at least one file to be processed, then terminate the program with an appropriate message.

A word starts with a letter (either uppercase or lowercase) and continues until a non-letter (or EOF) is encountered. Non-words in the file should simply be ignored. Once a word is identified, convert all uppercase letters to lowercase before you process it.

Therefore, "elephant's" will be two words, "elephant" and "s", and since "s" is less than eight characters long, it will be ignored. Likewise, "double-precision" will be two words, "double" and "precision", and since "double" is less than eight characters long, it will be ignored.

The output can be unsorted, but it should be given to stdout and should consist of one word per line. You should not print anything else to stdout. Please be sure to turnoff debugging output before submitting your program for grading.

Step one is to define a data structure that can keep a set of words easily accessed by their length, longest to shortest.

Step two is to write a function that, given a block of text (defined by a starting address and a length), will find all words in the block that are at least eight characters long and will insert them into the data structure defined during step one.

Step three is to write a function that, given a set of instances of the data structure, will find the twenty longest words that are common to all instances of the data structure.

Step four is to write a function that given two instances of the data structure will merge them so that the new instance has the union of all words in the two instances.

Step five is to write a serial program that will process the text files in sequence using the functions written in steps two and three to find the twenty longest words that appear in all the files. This program should read the files using the following standard functions:

Use of these functions is illustrated by a simple example program, count.c, available in ~cs520/public/prog5. For more details about these functions, consult their man pages (man open, man 2 read, man close).

You may also want to consider using this function:

lseek allows you, for instance, to "backup" the next location to be read from in an open file, in order to re-read a section of the file. See man lseek for more information.

Step six is to write a multi-threaded program to solve the problem in which threads are assigned the task of reading and processing one particular file. And one thread uses the function of step three to analyze the data structures produced by the threads and to report the list of longest words appearing in all files.

The solution produced by step six is likely to suffer from load imbalance. Step seven is to alter the approach of step six to have multiple threads working on the same file. This requires that the threads coordinate to share the processing of a single file. Use the function of step four to combine instances of the data structure produced from different sections of the same file.

Step eight is to implement a solution that uses condition variables to coordinate threads, if you are not already using condition variables in your step seven program.

After finishing the eight steps then implement the fastest version of the program that you can determine. The program must use only open, read, lseek and close to do I/O. The program must also be multi-threaded and use Posix threads. Of course, the program must produce the correct results.

We will have a contest to determine the fastest student submission. (I reserve the right to enter the contest too, but I may not.) I will buy the student with the fastest solution a beverage of their choice at Libby's. In addition, all other submissions that are within 10% of the running time of the fastest submission will earn ten bonus points on the assignment.

I will determine the fastest submission by averaging the wall clock times for a series of benchmark runs. What I will use for the benchmark runs will not be released in advance. But you can expect runs using a varying number of files (including only one file) and files of varying lengths.

The benchmark runs will be performed on agate.cs.unh.edu. I will try to do all the runs when the machine is lightly loaded. I will compile the programs using gcc with -O (i.e. modest optimization).

Sample files that might be used to benchmark programs are available in ~cs520/public/prog5/files. You can assume that all input files will be less than 2GB in length. Also, you can assume at most 100 files will be given to one particular run of your program. If an invalid filename is given as an argument, terminate the program with an appropriate error message.

If other run-time errors occur, you may also terminate your program with an error. But, note, if malloc fails, this will be considered to be your problem. You need to be able to process the files given the system's limits on memory usage.

Your code for this assignment should be placed in two files: prog5.c and fast5.c. prog5.c should contain your code for the highest step you completed in steps one to eight given above. fast5.c should contain your code for your fastest solution to the program. Note that these two programs might be the same, or they might be different because, for instance, you find that a solution using condition variables (i.e. step eight) is not as fast as some other approach.

You should attack this assignment in the order of the steps given above. Points will be awarded for this assignment in the following manner:

In addition, as discussed above, 10 bonus points will be awarded to all students whose fast.c is within 10% of the fastest submission.

Your program will be graded primarily by testing it for correct functionality. In addition, however, you may lose points if your program is not properly structured and documented. Decompose sub-problems appropriately into functions and do incremental testing. Leave your debugging output in your code, but disabled, when you do your final assignment submission.

By the end of the lab on Friday October 29, you should have completed steps one, two, three, five and six. This will require that you attempt these steps prior to coming to lab. In lab we will work on any issues that you are having trouble with.

By the end of the lab on Friday November 5, you should have completed step eight. In lab we will work on any issues that you are having trouble with.

Your programs will be graded using agate.cs.unh.edu so be sure to test in that environment.

You should only submit prog5.c and fast5.c. Be sure you disable any debugging output before you submit your assignment.

Your programs should be submitted for grading from agate.cs.unh.edu. To turn in this assignment, type:
~cs520/bin/submit prog5 prog5.c fast5.c

Please submit only these two files. Do not turn in any other files!

Submissions can be checked by typing:
~cs520/bin/scheck prog5

To receive full credit for the assignment, you must turn in your files prior to 8am on Monday November 8. Programming assignments may be handed in late at a penalty of 2 points for one day late, 5 points for two days late, 10 points for three days late, 20 points for four days late, and 40 points for five days late. No program may be turned in more than 5 days late.

Remember: as always you are expected to do your own work on this assignment.


Last modified on October 28, 2010.

Comments and questions should be directed to hatcher@unh.edu