CS611
Spring 2002
Programming Assignment 2
Due Sunday February 24


Replace the assemble module of as611.

The source code for as611 is in ~cs611/public/prog2/as611. There are three entry points for the assembly module. Stubs for these three functions are in assemble.c. (This directory also contains Linux and Alpha executables for my solution to this assignment.)

The init_assemble function is called once at program start-up to allow the assemble module to initialize any internal data structures.

The assemble function is called once for each line of the input assembly language source program. The four parameters to this function are strings and correspond to the four possible (non-comment) components of an input line. (The structure of an input line is described further below.) The function should encode its input in sim611 machine format in an internal data structure for later dumping to an output object file.

The write_obj_file function produces an object file from the internal data structure constructed by the series of calls to the assemble function. (The structure of the object file is described below.)

An input assembly source statement has the following basic format:

label: opcode operands ;comments

All four fields are optional, although certain opcodes require certain operands and an operand cannot appear without an opcode. Statements can not be continued on other lines.

The strings that are passed into assemble for labels and opcodes will already have been verified to be legal "identifiers" (start with a letter and made up of letters and digits). For operands, the strings will already be verified to be either identifiers or integer constants.

The input is "caseless" -- case of characters does not matter: "xy" is the same as "XY" is the same as "xY" is the same as "Xy". The symbols will be promoted to uppercase elsewhere in as611 prior to the call to the assemble function. Also labels are truncated to a maximum of six characters prior to the call to the assemble function.

As well as register names such as "R0", "R1", etc., you should support "SP" and "PC" as pseudonyms for "R14" and "R15" respectively.

A comment is begun by a semi-colon and ends with the end of the line. Comments are discarded prior to the call to the assemble function.

The only legal address specification is a string which would also be a legal label name (the label without the terminating colon). The string doesn't have to be a label -- it can be undefined and therefore an outsymbol -- but it has to have the same form as a label name.

NOTE: The LDIMM instruction accepts either a decimal constant (possibly signed) or an address specification as its nonregister operand.

The assembler should support the following pseudo-ops (ie they can appear in the opcode column of the input source program):

BYTE value
WORD value
ALLOC length

BYTE stores a one-byte value at the current location. WORD stores a 16-bit value starting at the current location. ALLOC allocates `length' number of bytes starting at the current location.

The value and length operands can only be decimal constants (the value can be possibly signed).

If a line of the input is in error, call the error routine to format an error message. See the message.c file for details. You are only responsible for detecting one error per line, but you must be capable of detecting multiple errors in a file. You may, if you wish, simply ignore the rest of a line once an error is detected on that line. If an input file has errors, no object code need be output.

If one of the four input parameters to the assemble function does not have a corresponding component on the input line, the NULL pointer will be supplied for that parameter.

See the comments in the ~cs611/public/prog2/sim611/exec.c file for other details of the sim611 machine and the sim611 assembly language. (This directory contains the source code for the sim611 machine simulator, as well as Linux and Alpha executables for the simulator.)

An object file is divided into four sections -- insymbol table, outsymbol table, relocation data, and the object code itself. The four sections appear in the order just listed and each section is preceded by a two-byte integer which describes the length in bytes of the section which follows.

The relocation data is simply a series of bits. The relocation bits consist of one bit for each byte of the object code. A bit is set if the corresponding byte of the object code contains the low-order byte of a relocatable address, and the bit is clear otherwise. If bit zero (low-order, right-most, bit) of a byte of relocation data refers to the object code byte 'n', then bit one of the same byte of relocation data refers to object code byte `n+1', bit two refers to byte `n+2', and so on.

Insymbols are the label names defined in the source program. Outsymbols are the address references which do not appear as label names somewhere in the source program (ie undefined symbols).

An entry in the insymbol table consists of a 6-byte string which is the symbol itself (in all uppercase) and a 2-byte offset of where in the object code that symbol refers (ie the address of the symbol).

An entry in the outsymbol table consists of a 6-byte string which is the symbol itself (in all uppercase) and a 2-byte offset into the object code of where that symbol is used (not its address -- it's undefined -- but where its address would go if it were known). This offset is in fact a pointer to the beginning of a linked list of all the uses of this symbol in the object code. The address fields of the references to the outsymbol are used to store the links. An address field containing all ones (in binary) terminates the chain.

Symbols are stored in both the insymbol table and the outsymbol table left justified and blank filled. Symbols in the insymbol and outsymbol tables are stored in all uppercase.

Nine public test files (assembly language source files) are available in ~cs611/public/prog2/test. Each file tests a different aspect of the assignment and each will be worth 10 points. Most of these files are simply meant to be used to test the assembler and are not intended to be executed by sim611. One hidden test file will be used to test error handling and other items not covered in the public files. This file will also be worth 10 points.

NOTE: Before starting, be sure you understand what a two-byte integer is and be sure you realize that the object code file is not human readable.

Your implementation must be performed using C.

Your program will be graded primarily by testing it for correct functionality. However, you may lose points if your program is not properly structured or adequately documented.

Your code should be submitted for grading from a CIS Alpha machine (e.g. cisunix.unh.edu). To turn in this assignment, type:
~cs611/bin/submit prog2 assemble.c

Do not turn in any other files!

Submissions can be checked by typing:
~cs611/bin/scheck prog2

To receive full credit for the assignment, you must turn in your files prior to 8am on Monday February 25. Late submissions will be accepted at the penalty of 5 points per day up to one week late.

Your programs will be graded using a CIS Alpha machine (e.g. cisunix.unh.edu) so be sure to test in that environment.

Remember: as always you are expected to do your own work on this assignment.


Last modified on January 14, 2002.

Comments and questions should be directed to hatcher@unh.edu