The goal of this assignment is to use Posix threads to efficiently process a set of English text files.
The program will accept a sequence of names of text files as its only command-line arguments. The goal of the program is to find the twenty most frequently used words across all files, where each word must appear at least once in each file. If there are ties for position twenty on the list of most frequently used words, then report all the words that tie. That is, you may report more than twenty words for your result, because of ties for the last position on the list. Note that you might report fewer than twenty words if there are fewer than twenty words that are common to all files. Also note that words less than six characters in length should be ignored. Also, since the input files are in English, you may assume that no word will be longer that fifty characters, since I believe the longest word in the English language is only 45 letters long. Words longer than fifty characters can be ignored, since they are probably not real words. Therefore, the analysis should actually report the twenty most frequently used words, where each word must appear at least once in each file, and each word must be at least six characters long and no more than fifty characters long.
If the user does not specify at least one file to be processed, then terminate the program with an appropriate message.
A word starts with a letter (either uppercase or lowercase) and continues until a non-letter (or EOF) is encountered. Non-words in the file should simply be ignored. Once a word is identified, convert all uppercase letters to lowercase before you process it.
Therefore, "elephant's" will be two words, "elephant" and "s", and since "s" is less than six characters long, it will be ignored. Likewise, "double-precision" will be two words, "double" and "precision".
The output can be unsorted, but it should be given to stdout and should consist of one word per line. You should not print anything else to stdout. Please be sure to turnoff debugging output before submitting your program for grading.
Step one is to implement a data structure that can efficiently store a set of words, with each word being associated with a count of how many times it has been seen in a block of text. A hash table is recommended.
Step two is to write a function that, given a block of text (defined by a starting address and a length), will use the data structure from step one to track how many times each word occurs in the block.
Step three is to write a function that, given a set of instances of the data structure, will compute the "intersection" of all the instances. That is, produce an instance of the data structure that will contain only the words that occur at least once in each file. For the words in the intersection, sum each word's count from each instance into a global count.
Step four is to write a serial program that will process the text files in sequence using the functions written in steps two and three to find the twenty most frequently used words, where each word occurs at least once in each file. (You will need to develop a strategy for handling words that get split across the block boundaries of step 2.)
This program should read the files using the following standard functions:
You may also want to consider using this function:
Step five is to write a multi-threaded program to solve the problem. One simple, but acceptable, approach would be to assign each thread the task of reading and processing one particular file. And one thread uses the function of step three to analyze the data structures produced by the threads and to report the list of the most frequently used words appearing in all files.
We will have a contest to determine the fastest student submission. I will buy the two students with the fastest solutions lunch or dinner at Libby's (maximum value $20). In addition, all other submissions that are within 10% of the running time of the fastest submission will earn ten bonus points on the assignment.
To be competitive in the contest, you will most likely need to have a more sophisticated approach than the one outlined in step five above, because assigning one thread per file is likely to suffer from load imbalance, if the file sizes vary greatly.
I will determine the fastest submission by averaging the wall clock times for a series of benchmark runs. What I will use for the benchmark runs will not be released in advance. But you can expect runs using a varying number of files (including only one file) and files of varying lengths.
The benchmark runs will be performed on agate.cs.unh.edu. I will try to do all the runs when the machine is lightly loaded. I will compile the programs using gcc with -O (i.e. modest optimization).
Sample files that might be used to benchmark programs are available in ~cs520/public/prog5/files. You can assume that all input files will be less than 2GB in length. Also, you can assume at most 100 files will be given to one particular run of your program. If an invalid filename is given as an argument, terminate the program with an appropriate error message.
If other run-time errors occur, you may also terminate your program with an error. But, note, if malloc fails, this will be considered to be your problem. You need to be able to process the files given the system's limits on memory usage.
Put all of your code for this assignment in one file: prog5.c.
If you provide a correct serial solution to the problem, you will receive 70 points. If you provide a correct multithreaded solution to the problem, you will receive 100 points. If your submission does not correctly solve the problem, you will receive at most 50 points, no matter whether you used threads or not. Therefore, start by providing a serial solution, then make it multithreaded, then finally work to make it as efficient as you can in order to compete in the contest. Be sure you position yourself to submit a working program when the assignment due date arrives! In particular, be sure you understand that submitting a multithreaded program that does not work correctly will be worth less points than a serial program that works correctly.
In the comment block at the top of the file you submit, be sure to clearly state whether your program uses threads and, if so, what your strategy is for utilizing the threads to solve the problem.
Your program will be graded primarily by testing it for correct functionality. In addition, however, you may lose points if your program is not properly structured and documented. Decompose sub-problems appropriately into functions and do incremental testing. Leave your debugging output in your code, but disabled, when you do your final assignment submission.
Please turn-off any debugging code before you submit your program.
Your programs will be graded using agate.cs.unh.edu so be sure to test in that environment.
Your programs should be submitted for grading from
agate.cs.unh.edu.
To turn in this assignment, type:
~cs520/bin/submit prog5 prog5.c
Submissions can be checked by typing:
~cs520/bin/scheck prog5
To receive full credit for the assignment, you must turn in your files prior to 8am on Thursday April 10. The standard policy concerning late submissions will be in effect. See the course overview webpage.
Remember: as always you are expected to do your own work on this assignment.
Comments and questions should be directed to hatcher@unh.edu