CS520
Fall 2018
Program 1
Due Wednesday, September 12


Write two C programs:

You are not allowed to use any iconv library routines. The goal is for you to solve this problem yourself at a low level, in order to get experience with bit manipulation in C.

fprintf is a generalization of printf that prints to a particular stream (rather than always to stdout). For example:

fprintf(stderr, "unexpected continuation byte at offset %d\n", offset);

You should write other programs to create interesting test cases.

To complete this assignment, you need to understand the following C features:

These operators can be used to accomplish a task like read a byte, determine whether it might be the start byte for a three-byte UTF-8 sequence, and if so pull out its data bits:
unsigned int c = getchar();
if (c == EOF)
{
  fprintf(stderr, "unexpected EOF!\n");
}
else if ((c >> 4) == 0x0E)
{
  printf("c contains a three-byte UTF-8 sequence with");
  printf(" %02x for the data bits\n", c & 0x0F);
}

Or, to take the two bytes from a two-byte UTF-8 sequence, extract the data bits, and then combine the data bits to form the UTF-16 character:

unsigned int convertTwoByteUtf8ToUtf16(unsigned int byte1,
                                       unsigned int byte2)
{
  if ((byte1 >> 5) != 0x06)
  {
    fprintf(stderr, "byte1 does not contain UTF-8 two-byte start byte!\n");
    return 0;
  }

  if ((byte2 >> 6) != 0x02)
  {
    fprintf(stderr, "byte2 does not contain UTF-8 continuation byte!\n");
    return 0;
  }

  return (((byte1 & 0x1F) << 6) | (byte2 & 0x3F));
}
Your two programs will be graded primarily by testing them for correct functionality.
  1. 60 points will be awarded for properly processing encodings of characters from the BMP.
  2. 30 additional points will be awarded for properly processing encodings of characters from the supplementary planes
  3. 10 additional points will be awarded for properly detecting errors in the input files.

There are a few test files available on agate in ~cs520/public/lab2 and ~cs520/public/prog1. However, you should perform exhaustive testing since the range of Unicode values is relatively small.

In addition, remember, you may lose points if your program is not properly structured or adequately documented. Coding guidelines are given on the course overview webpage.

Your programs will be graded using agate.cs.unh.edu so be sure to test in that environment. Your programs will be compiled using these gcc flags: -g -Wall -std=c99.

Your programs should be submitted for grading from agate.cs.unh.edu. To turn in this program, type:

% ~cs520/bin/submit prog1 decodeUTF16.c encodeUTF16.c

Submissions can be checked by typing:

% ~cs520/bin/scheck prog1

This assignment is due Wednesday September 12. The standard late policy concerning late submissions will be in effect. See the course overview webpage.

Remember: as always you are expected to do your own work on this assignment. Copying code from another student or from sites on the internet is explicitly forbidden!


Last modified on August 10, 2018.

Comments and questions should be directed to pjh@cs.unh.edu