CS520
Spring 2015
Programming Assignment 4
Due Sunday April 5


The goal of this assignment is to use Posix threads to efficiently process a set of English text files.

The program will accept a sequence of names of text files as its only command-line arguments. The goal of the program is to find the twenty most frequently used words across all files, where each word must appear at least once in each file. If there are ties for position twenty on the list of most frequently used words, then report all the words that tie. That is, you may report more than twenty words for your result, because of ties for the last position on the list. Note that you might report fewer than twenty words if there are fewer than twenty words that are common to all files. Also note that words less than six characters in length should be ignored. Also, since the input files are in English, you may assume that no word will be longer that fifty characters, since I believe the longest word in the English language is only 45 letters long. Words longer than fifty characters can be ignored, since they are probably not real words. Therefore, the analysis should actually report the twenty most frequently used words, where each word must appear at least once in each file, and each word must be at least six characters long and no more than fifty characters long.

If the user does not specify at least one file to be processed, then terminate the program with an appropriate message.

A word starts with a letter (either uppercase or lowercase) and continues until a non-letter (or EOF) is encountered. Non-words in the file should simply be ignored. Once a word is identified, convert all uppercase letters to lowercase before you process it.

Therefore, "elephant's" will be two words, "elephant" and "s", and since "s" is less than six characters long, it will be ignored. Likewise, "double-precision" will be two words, "double" and "precision".

The output can be unsorted, but it should be given to stdout and should consist of one word per line. You should not print anything else to stdout. Please be sure to turnoff debugging output before submitting your program for grading.

Step one is to implement a data structure that can efficiently store a set of words, with each word being associated with a count of how many times it has been seen in a block of text. A hash table is recommended.

Step two is to implement the producer-consumer pattern connecting one thread that reads a file and hands chunks of the file to one or more threads that process the chunks to find words and keep their counts in an instance of the data structure of step one. The buffer between the two threads should hold a maximum of five chunks, and the chunks should be roughly 1000 bytes long. They can be slightly longer than 1000 characters in order to avoid breaking a word across a chunk boundary. (Since words are at most 50 characters, this means a chunk should not be more than 1050 characters.) If you have more than one consumer thread, then you need to make the data structure that they are updating thread safe. This can simply be done by using a single mutex to lock the whole data structure.

Step three is to have the main thread create a set of threads for each file to be processed. Each set should use the producer-consumer implementation of step two to work together to count the words in their file.

Step four is to write a function that, given two instances of the data structure of step one, will compute the "intersection" of the two instances. That is, produce an instance of the data structure that will contain only the words that occur at least once in both files. For the words in the intersection, sum each word's count from each instance into a global count. This intersection can be done by updating one of the instances. That is, you do not need to create a new instance of the data structure.

Complete the program by having the main thread use the function of step four to combine the data structures produced by the sets of threads of step three. This, of course, requires the main thread to wait for the sets of threads to finish their work.

Sample files to test your program are available in ~cs520/public/prog4/files. You can assume at most 10 files will be given to one particular run of your program. If an invalid filename is given as an argument, terminate the program with an appropriate error message.

If other run-time errors occur, you may also terminate your program with an error. But, note, if malloc fails, this will be considered to be your problem. You need to be able to process the files given the system's limits on memory usage. Therefore you need to free dynamically allocated memory blocks when they are no longer needed. In fact, I will be using valgrind to be sure you deallocate all dynamically allocated memory prior to the program terminating. (This is only necessary when the program terminates normally.)

You should also use the helgrind component of valgrind to check for race conditions in your program.

Put all of your code for this assignment in one file: prog4.c.

If you provide a correct serial solution to the problem, you will receive up to 50 points. That is, you build an appropriate (i.e. efficient) data structure for step one and then use it serially, by processing each file one at a time, utilizing the function of step four to combine the data structures produced from the files.

If you provide a correct multithreaded solution to the problem that implements the producer-consumer pattern with a single consumer for each input file, you will receive up to 75 points.

If you provide a correct multithreaded solution to the problem that implements the producer-consumer pattern with at least four consumers for each input file, you will receive up to 100 points.

Your program will be graded primarily by testing it for correct functionality. In addition, however, you may lose points if your program is not properly structured and documented. Coding guidelines are given on the course overview webpage. Decompose sub-problems appropriately into functions and do incremental testing. Please turn-off any debugging code before you submit your program.

In particular, you must have comments in your code that clearly state whether you obtained the 50-point level, the 75-point level or the 100-point level. Also your code must be properly documented and structured so that we can easily confirm which level you implemented.

Your program will be graded using agate.cs.unh.edu so be sure to test in that environment. Your program will be compiled using these gcc flags: -g -Wall -std=c99 -pthread.

Your program should be submitted for grading from agate.cs.unh.edu. To turn in this assignment, type:
~cs520/bin/submit prog4 prog4.c

Submissions can be checked by typing:
~cs520/bin/scheck prog4

This assignment is due Sunday April 5. The standard late policy concerning late submissions will be in effect. See the course overview webpage.

Remember: as always you are expected to do your own work on this assignment. Copying code from another student or from sites on the internet is explicitly forbidden!


Last modified on March 18, 2015.

Comments and questions should be directed to hatcher@unh.edu