Implement a concurrent buffer facility to support the producer-consumer multithreading pattern. Use your concurrent buffer facility to implement a multithreaded program to analyze a set of English text files.
The concurrent buffer facility consists of these four functions:
void *createConcurrentBuffer(unsigned int size);
void putConcurrentBuffer(void *handle, void *p);
void *getConcurrentBuffer(void *handle);
void deleteConcurrentBuffer(void *handle);
A concurrent buffer allows safe multithreaded access to the buffer. It is a FIFO (first in, first out) queue. If the queue is full, an attempt to "put" to the queue will block until space becomes available in the queue. If the queue is empty, an attempt to "get" from the queue will block until a value becomes available in the queue.
The buffer elements are simply void* pointers. The buffer does not make a copy of the data. It simply stores a pointer to the data.
A call to createConcurrentBuffer creates an instance of a concurrent buffer with the number of buffer elements given by the argument "size". The buffer is initialized to be empty. If successful, the function returns a non-NULL void* "handle" which can be passed to the other three functions in order to operate on the buffer. If the "size" argument is zero, then the function will return NULL, indicating failure. If sufficient memory cannot be allocated for the buffer, then the function will also return NULL.
A call to putConcurrentBuffer puts a pointer to a value into the buffer indicated by the "handle" argument. If the buffer is full, the calling thread will block until space becomes available. If a NULL "handle" is passed in, the function will print an appropriate error message to stderr and the program is terminated via exit(-1).
A call to getConcurrentBuffer retrieves a pointer to a value from the buffer indicated by the "handle" argument. If the buffer is empty, the calling thread will block until a pointer to a value becomes available. The retrieved pointer is returned by the function. If a NULL "handle" is passed in, the function will print an appropriate error message to stderr and the program is terminated via exit(-1).
A call to deleteConcurrentBuffer, frees all memory allocated for the buffer indicated by the "handle" argument. It is not an error to delete a non-empty buffer. If a NULL "handle" is passed in, the function will print an appropriate error message to stderr and the program is terminated via exit(-1).
If putConcurrentBuffer, getConcurrentBuffer or deleteConcurrentBuffer are called with a non-NUll but invalid "handle" (including a "handle" to a deleted buffer), the behavior is undefined.
If a pthread function fails during the execution of one of the four functions, an appropriate error message should be printed and the program should be terminated via exit(-1).
The text analyzer program should accept a sequence of names of English text files as its only command-line arguments. The goal of the program is to print a (non-graphical) histogram of word length for all the words in the text files.
Words less than six characters in length should be ignored. Also, since the input files are in English, you may assume that no word will be longer that fifty characters, since I believe the longest word in the English language is only 45 letters long. Words longer than fifty characters can be ignored, since they are probably not real words. Therefore, the text analyzer program should actually print a histogram of all the words in the files where a word is at least six characters long and not more than fifty characters long.
If the user does not specify at least one file to be processed, then terminate the program with an appropriate message.
A word starts with a letter (either uppercase or lowercase) and continues until a non-letter (or EOF) is encountered. Non-words in the file should simply be ignored.
Therefore, "elephant's" will be two words, "elephant" and "s", and since "s" is less than six characters long, it will be ignored. Likewise, "double-precision" will be two words, "double" and "precision".
The output should be printed to stdout, one line per word length, starting with 6 and ending with 50. Each line should contain two decimal numbers separated by a single space. The first number is the word length. The second number is the number of words found with that length. Nothing else should appear on a line: just a single space separating two decimal numbers. There should not be any leading or trailing spaces, or anything else. Also, no leading zeroes on the numbers. You should not print anything else to stdout. Please be sure to turnoff debugging output before submitting your program for grading. We will be using "cmp" or "diff" when grading, so be sure to follow these formatting directions exactly.
The text analyzer program should create two threads for each file. The first thread should open the file for reading. If the open fails, an appropriate error message should be printed to stderr, and the thread should treat the file as if it was an empty file. If the open is successful, the thread should read the file and send a sequence of blocks of lines to a concurrent buffer, created to allow a set of "producer" threads (reading the files) to cooperate with a set of "consumer" threads, which actually process the text to identify and count words. Each block of lines should contain roughly 1000 characters. The producer thread should add lines to a block as long as the length of the block is less than 1000. As soon as adding a line to the block causes the block to equal or exceed 1000 characters, terminate the block (including the line that pushed the block length to equal or exceed 1000) and send it to the concurrent buffer.
You may assume that the longest line that you will see in an input file is 100 characters, including the newline character.
Set the size of the concurrent buffer to be 10.
The second thread for each file is a "consumer" thread. A "consumer" thread retrieves blocks of lines from the concurrent buffer, identifies and counts words, and builds a histogram for the words that it sees. Once all the files have been processed, the "consumer" threads need to coordinate to combine all their partial histograms into a single histogram that is a histogram for all the words in all the files. Once the single histogram is complete, the main thread should print the histogram to stdout.
Note that a consumer thread will retrieve and process blocks from any file, not just the file for which it was created.
You need to devise a strategy for the "producer" threads to communicate to the "consumer" threads that all files have been read to EOF. Perhaps simply have the "producer" threads place a NULL in the concurrent buffer once they reach the end of their file. And the "consumer" threads can stop trying to retrieve from the buffer once they receive a NULL pointer.
Write the program so that all the threads are executing concurrently. Do not process one file after another. Be sure the threads are processing all the files at the same time.
Be sure to use the valgrind tool to validate that you have freed all allocated memory, both in your buffer facility and in your text file analyzer.
And also use the helgrind component of valgrind to check for race conditions in your program.
Place all your code for the concurrent buffer facility in a file called concurrentBuffer.c and place all your code for the text file analyzer in a file called histogram.c. A header file for the concurrent buffer facility, concurrentBuffer.h, and a sample concurrentBuffer.c with stubs for the four functions, are provided in ~cs520/public/prog5.
A serial (single-threaded) solution to the text file analysis problem is provided in ~cs520/public/prog5/serialHistogram.c. The output of your text analyzer should match the output of the serial solution exactly.
The directory ~cs520/public/books contains some English novels that you can use for testing.
Points will be awarded for this assignment in the following way:
Your program will be graded primarily by testing it for correct functionality. In addition, remember, you may lose points if your program is not properly structured or adequately documented. Coding guidelines are given on the course overview webpage.
Your programs will be graded using agate.cs.unh.edu so be sure to test in that environment. Your programs will be compiled using these gcc flags: -g -Wall -std=c99 -pthread.
Your program should be submitted for grading from
agate.cs.unh.edu.
To turn in this assignment, type:
~cs520/bin/submit prog5 concurrentBuffer.c histogram.c
Submissions can be checked by typing:
~cs520/bin/scheck prog5
This assignment is due Wednesday April 18. The standard late policy concerning late submissions will be in effect. See the course overview webpage.
Remember: as always you are expected to do your own work on this assignment. Copying code from another student or from sites on the internet is explicitly forbidden!
Comments and questions should be directed to pjh@cs.unh.edu