CS520
Spring 2013
Program 1
Due Sunday February 10


Write a C program, decodeUTF8, that will read a file containing UTF-8 encodings of a sequence of Unicode characters, decode the UTF-8 to determine the Unicode characters, and write those decoded characters in UTF-32 format to an output file. The UTF-32 characters should be written to the file in Little Endian format.

The program will take two arguments: the name of the input file as the first argument and the name of the output file as the second argument. The program will return a status of -1 if the file is not a valid UTF-8 file; othewise it will return a status of 0. Put all your source code in the file decodeUTF8.c.

An empty input file is okay. You should simply produce an empty output file.

A file is a valid UTF-8 file if it does not contain any of these errors:

If an error is detected, print an appropriate error message to stderr that includes the offset in the file for the start byte for the sequence that is in error. In the case of an unexpected continuation byte, print the offset of that byte.

fprintf is a generalization of printf that prints to a particular stream (rather than always to stdout). For example:

fprintf(stderr, "unexpected continuation byte at offset %d\n", offset);

You may exit the program after reporting the first error.

You should write other programs to create interesting test cases.

To complete this program, you need to understand the following C features:

These operators can be used to accomplish a task like read a byte, determine whether it might be the start byte for a three-byte UTF-8 sequence, and if so pull out its data bits:
unsigned int c = getchar();
if (c == EOF)
{
  fprintf(stderr, "unexpected EOF!\n");
}
else if ((c >> 4) == 0x0E)
{
  printf("c contains a three-byte UTF-8 sequence with");
  printf(" %02x for the data bits\n", c & 0x0F);
}

Or, to take the two bytes from a two-byte UTF-8 sequence, extract the data bits, and then combine the data bits to form the UTF-16 character:

unsigned int convertTwoByteUtf8ToUtf16(unsigned int byte1,
                                       unsigned int byte2)
{
  if ((byte1 >> 5) != 0x06)
  {
    fprintf(stderr, "byte1 does not contain UTF-8 two-byte start byte!\n");
    return 0;
  }

  if ((byte2 >> 6) != 0x02)
  {
    fprintf(stderr, "byte2 does not contain UTF-8 continuation byte!\n");
    return 0;
  }

  return (((byte1 & 0x1F) << 6) | (byte2 & 0x3F));
}
Your program will be graded primarily by testing it for correct functionality.
  1. 60 points will be awarded for properly handling one-byte UTF-8 encodings.
  2. 10 additional points will be awarded for also handling two-byte encodings.
  3. 10 additional points will be awarded for also handling three-byte encodings.
  4. 10 additional points will be awarded for also handling four-byte encodings.
  5. 10 additional points will be awarded for properly detecting errors in the input file.

In addition, remember, you may lose points if your program is not properly structured or adequately documented. Coding guidelines are given on the course overview webpage.

Your programs will be graded using agate.cs.unh.edu so be sure to test in that environment.

Your programs should be submitted for grading from agate.cs.unh.edu. To turn in this program, type:

% ~cs520/bin/submit prog1 decodeUTF8.c

Submissions can be checked by typing:

% ~cs520/bin/scheck prog1

This assignment is due Sunday February 10. The standard late policy concerning late submissions will be in effect. See the course overview webpage.

Remember: as always you are expected to do your own work on this assignment. Copying code from another student or from sites on the internet is explicitly forbidden!


Last modified on January 10, 2013.

Comments and questions should be directed to hatcher@unh.edu