CS520
Fall 2018
Program 1
Due Wednesday, September 12

Write two C programs:

decodeUTF16.
The program reads a file containing UTF-16 encodings of a sequence of Unicode characters, decodes the UTF-16 to determine the Unicode characters, and writes those decoded characters in UTF-32 format to an output file.
The program will take two arguments: the name of the input file as the first argument and the name of the output file as the second argument. Print an appropriate error message if the input file cannot be opened for reading or the output file cannot be opened for writing. (If the user specifies the same file for both input and output, then the behaviour is undefined, meaning that there is no particular requirement for how you handle this case.)
Assume the input file will start with a BOM to indicate whether the file is stored in little endian or big endian format. If there is no BOM, report the error and terminate the program.
The output file should be written with the same endian-ness as the input file and should contain the appropriate BOM as the first character.
A file that contains only a BOM is okay. Simply write an output file that contains only a BOM.
A file is a valid UTF-16 file if it does not contain:
- A value that denotes a noncharacter.
- Unpaired surrogates: A value in the range 0xD800 to 0xDBFF not followed by a value in the range 0xDC00 to 0xDFFF, or any value in the range 0xDC00 to 0xDFFF not preceded by a value in the range 0xD800 to 0xDBFF.
- An incomplete value, meaning that EOF is encountered while reading a UTF-16 value (or the BOM).
If an error is detected, print an appropriate error message to stderr that includes the offset in the file for the start byte for the UTF-16 value that is in error. Print the offset in decimal.
You may exit the program after reporting the first error.
The program will return a status of -1 if an error is encountered; otherwise it will return a status of 0.
Put all your source code in the file decodeUTF16.c.
encodeUTF16.
The program reads a file containing UTF-32 encodings of a sequence of Unicode characters, decodes the UTF-32 to determine the Unicode characters, and writes those decoded characters in UTF-16 format to an output file.
The program will take two arguments: the name of the input file as the first argument and the name of the output file as the second argument.
Assume the input file will start with a BOM to indicate whether the file is stored in little endian or big endian format. If there is no BOM, report the error and terminate the program.
The output file should be written with the same endian-ness as the input file and should contain the appropriate BOM as the first character.
A file that contains only a BOM is okay. Simply write an output file that contains only a BOM.
A file is a valid UTF-32 file if it does not contain:
- A value outside the valid range for Unicode characters (0x00000000 to 0x0010FFFF).
- A value that denotes a noncharacter.
- A value in the range 0xD800-0xDFFF (i.e. a leading or trailing surrogate).
- An incomplete value, meaning that EOF is encountered while reading a UTF-32 value (or the BOM).
If an error is detected, print an appropriate error message to stderr that includes the offset in the file for the start byte for the UTF-32 value that is in error. Print the offset in decimal.
You may exit the program after reporting the first error.
The program will return a status of -1 if an error is encountered; otherwise it will return a status of 0.
Put all your source code in the file encodeUTF16.c.

You are not allowed to use any iconv library routines. The goal is for you to solve this problem yourself at a low level, in order to get experience with bit manipulation in C.

fprintf is a generalization of printf that prints to a particular stream (rather than always to stdout). For example:

fprintf(stderr, "unexpected continuation byte at offset %d\n", offset);

You should write other programs to create interesting test cases.

To complete this assignment, you need to understand the following C features:

hexadecimal constants, such as 0x7F
left shift, such as x << 6, which will produce a value by left shifting the value of x six bit positions, filling with zeros on the right and discarding the bits that go off the end on the left. (Note that x is not modified, rather an intermediate value is produced, similar to what happens when computing x + 1.) Consider this example:
```
unsigned char x = 0xA3;
unsigned char y = x << 3;
```
y will be assigned 0x18.
right shift, such as x >> 6, which will produce a value by right shifting the value of x six bit positions. If the value being shifted is of unsigned type, then the result is filled with zeros on the left. If the value being shifted is of signed type, then the C compiler decides whether to fill with zeros, or fill by replicating the sign bit. (In either case, the bits that go off the end on the right are discarded.) In order to write portable code you should avoid right shifting a signed type.
Consider this example:
```
unsigned char x = 0xA3;
unsigned char y = x >> 3;
```
y will be assigned 0x14.
bitwise OR, which produces a 1 at a bit position if at least one of the input operands has a 1 at that position; otherwise 0 is produced at that bit position. Consider this example:
```
unsigned char x = 0xA3;
unsigned char y = 0x32;
unsigned char z = x | y;
```
z will be assigned 0xB3.
bitwise AND, which produces a 1 at a bit position if both input operands have a 1 at that position; otherwise 0 is produced at that bit position. Consider this example:
```
unsigned char x = 0xA3;
unsigned char y = 0x32;
unsigned char z = x & y;
```
z will be assigned 0x22.
bitwise XOR (exclusive OR), which produces a 1 at a bit position if one of the input operands has a 1 at that position and the other input operand has a 0 at that position; otherwise 0 is produced at that bit position. Consider this example:
```
unsigned char x = 0xA3;
unsigned char y = 0x32;
unsigned char z = x ^ y;
```
z will be assigned 0x91.
bitwise complement, which produces 1 at a bit position if the input operand has a 0 at that position and produces 0 at a bit position if the input operand has a 1 at that position. x. Consider this example:
```
unsigned char x = 0xA3;
unsigned char y = ~x;
```
y will be assigned 0x5C.

These operators can be used to accomplish a task like read a byte, determine whether it might be the start byte for a three-byte UTF-8 sequence, and if so pull out its data bits:

unsigned int c = getchar();
if (c == EOF)
{
  fprintf(stderr, "unexpected EOF!\n");
}
else if ((c >> 4) == 0x0E)
{
  printf("c contains a three-byte UTF-8 sequence with");
  printf(" %02x for the data bits\n", c & 0x0F);
}

Or, to take the two bytes from a two-byte UTF-8 sequence, extract the data bits, and then combine the data bits to form the UTF-16 character:

unsigned int convertTwoByteUtf8ToUtf16(unsigned int byte1,
                                       unsigned int byte2)
{
  if ((byte1 >> 5) != 0x06)
  {
    fprintf(stderr, "byte1 does not contain UTF-8 two-byte start byte!\n");
    return 0;
  }

  if ((byte2 >> 6) != 0x02)
  {
    fprintf(stderr, "byte2 does not contain UTF-8 continuation byte!\n");
    return 0;
  }

  return (((byte1 & 0x1F) << 6) | (byte2 & 0x3F));
}

Your two programs will be graded primarily by testing them for correct functionality.

60 points will be awarded for properly processing encodings of characters from the BMP.
30 additional points will be awarded for properly processing encodings of characters from the supplementary planes
10 additional points will be awarded for properly detecting errors in the input files.

There are a few test files available on agate in ~cs520/public/lab2 and ~cs520/public/prog1. However, you should perform exhaustive testing since the range of Unicode values is relatively small.

In addition, remember, you may lose points if your program is not properly structured or adequately documented. Coding guidelines are given on the course overview webpage.

Your programs will be graded using agate.cs.unh.edu so be sure to test in that environment. Your programs will be compiled using these gcc flags: -g -Wall -std=c99.

Your programs should be submitted for grading from agate.cs.unh.edu. To turn in this program, type:

% ~cs520/bin/submit prog1 decodeUTF16.c encodeUTF16.c

Submissions can be checked by typing:

% ~cs520/bin/scheck prog1

This assignment is due Wednesday September 12. The standard late policy concerning late submissions will be in effect. See the course overview webpage.

Remember: as always you are expected to do your own work on this assignment. Copying code from another student or from sites on the internet is explicitly forbidden!

Last modified on August 10, 2018.

Comments and questions should be directed to pjh@cs.unh.edu

CS520 Fall 2018 Program 1 Due Wednesday, September 12

CS520
Fall 2018
Program 1
Due Wednesday, September 12