CS611
Fall 2004
Programming Assignment 1
Due Sunday September 12


You are to write two programs to convert between Unicode and the TUF-8 encoding.

Unicode is a 16-bit character encoding that is "designed to support the interchange, processing, and display of the written texts of the diverse languages of the modern world." UTF-8 is a "transformation format" for efficiently storing or transmitting Unicode text. TUF-8 is a variation of UTF-8 popular among hackers who live in the marshes along the Wabash River in central Indiana.

UTF-8 encodes the Unicode 16-bit characters in the following manner:

(The values to the right of the colon are 8-bit bytes, where the X's are data bits.)

Some examples in binary (Unicode : UTF-8):

A reference for the UTF-8 encoding is section 4.4.7 of The Java Virtual Machine Specification by Lindholm and Yellin.

TUF-8 encodes the Unicode 16-bit characters in the following manner:

(The values to the right of the colon are 8-bit bytes, where the X's are data bits.)

Some examples in binary (Unicode : TUF-8):

Write a program, uni2tuf, that will read a stream of Unicode characters from stdin and will write the corresponding stream of TUF-8 bytes to stdout. Be sure to handle the error cases where the input is empty or has an odd number of bytes by emitting to stderr an appropriate error message, which includes the byte offset in the file where the error was detected.

Write a program, tuf2uni, that will read a stream of TUF-8 bytes from stdin and will write the corresponding Unicode characters to stdout. Be sure to check that the TUF-8 byte sequences are valid. If you encounter an invalid TUF-8 byte sequence, then emit an appropriate error message (that includes the byte offset in the file at which the error was detected) to stderr, then skip to the next valid character, and continue processing:

When reading a stream of Unicode characters, assume that the high byte of a character will be read first and the low byte will be next. When writing a stream of Unicode characters, write the high byte first, followed by the low byte.

Similarly, write multi-byte TUF-8 encoded characters with the high byte first and read multi-byte TUF-8 encoded characters assuming the high byte will be first.

The two programs will be worth equal credit: each is worth 50% of the points for the assignment.

Your program will be graded primarily by testing it for correct functionality. However, you may lose points if your program is not properly structured or adequately documented. See the mandatory guidelines given in the course overview webpage.

You may find using the od command helpful for analyzing the test files. In particular, using the -tx1 flag will display the bytes of a file, one byte at a time, in hexadecimal.

You must write your programs in C. You must submit a Makefile (called "Makefile") so that we can conveniently build your programs. The Makefile goal of "tuf2uni" should build an executable called tuf2uni and the goal of "uni2tuf" should build an executable called "uni2tuf".

Your programs will be graded using a CIS Linux machine (e.g. turing.unh.edu) so be sure to test in that environment.

Your assignment should be submitted for grading from a CIS Linux machine (e.g. turing.unh.edu). To turn in this assignment, type:
~cs611/bin/submit prog1 <list of files to submit>

Please submit only your C source files and your Makefile. Do not turn in any other files!

Submissions can be checked from a CIS Linux machine by typing:
~cs611/bin/scheck prog1

To receive full credit for the assignment, you must turn in your files prior to 8am on Monday September 13. Programming assignments may be handed in late at a penalty of 2 points for one day late, 5 points for two days late, 10 points for three days late, 20 points for four days late, and 40 points for five days late. No program may be turned in more than 5 days late.

Remember: as always you are expected to do your own work on this assignment. Copying code from another student or from sites on the internet is explicitly forbidden!

If you developed your code on a DOS/Windows system, be sure to appropriately transfer your files to a CIS Linux system before submitting them. You need to convert the DOS ASCII file format to UNIX format. If you need help with this, please see me.


Last modified on August 28, 2004.

Comments and questions should be directed to pjh@cs.unh.edu