You are to write two programs to convert between Unicode and the UTF-8 encoding.
Unicode is a 16-bit character encoding that is "designed to support the interchange, processing, and display of the written texts of the diverse languages of the modern world." UTF-8 is a "transformation format" for efficiently storing or transmitting Unicode text.
UTF-8 encodes the Unicode 16-bit characters in the following manner:
Some examples in binary (Unicode : UTF-8):
Write a program, uni2utf, that will read a stream of Unicode characters from stdin and will write the corresponding stream of UTF-8 bytes to stdout. Be sure to handle the error cases where the input is empty or has an odd number of bytes by emitting to stderr an appropriate error message.
A reference for the UTF-8 encoding is pages 100-101 of The Java Virtual Machine Specification by Lindholm and Yellin. Be careful, however, your programs should use the standard UTF-8 encoding for the null byte (\u0000).
Write a program, utf2uni, that will read a stream of UTF-8 bytes from stdin and will write the corresponding Unicode characters to stdout. Be sure to check that the UTF-8 byte sequences are valid. If you encounter an invalid UTF-8 byte sequence, then emit an appropriate error message to stderr, then skip to the next valid character, and continue processing.
When reading a stream of Unicode characters, assume that the high byte of a character will be read first and the low byte will be next. When writing a stream of Unicode characters, write the high byte first, followed by the low byte.
Similarly, write multi-byte UTF-8 encoded characters with the high byte first and read multi-byte UTF-8 encoded characters assuming the high byte will be first.
The two programs will be worth equal credit: each is worth half of the points for the assignment.
Your program will be graded primarily by testing it for correct functionality. However, you may lose points if your program is not properly structured or adequately documented.
All test files will be publicly available for this assignment. When ready, they will be in ~cs611/public/prog1.
You can write your programs in either C or C++. You must submit a Makefile (called "Makefile") so that we can conveniently build your programs. Your programs will be graded using an Alpha machine (e.g. hopper and christa) so be sure to test in that environment.
Your programs should be submitted for grading from either
hopper or christa.
To turn in this assignment, type:
~cs611/bin/submit prog1 <list of files to submit>
Do not turn in any non-Ascii files (i.e. no object files, no executable files, etc.).
Submissions can be checked by typing:
~cs611/bin/scheck prog1
To receive full credit for the assignment, you must turn in your files prior to 8am on Monday February 10. Late submissions will be accepted at the penalty of 5% per day up to one week late.
Remember: as always you are expected to do your own work on this assignment.
Comments and questions should be directed to pjh@cs.unh.edu