I am currently making out the plans for the milestones for the rest of this year in my long-term projects (Thelema, Assiah, and Alfheim). I was hoping that someone would be able to review these plans and help me determine which are feasible in the next four months.

The overall goal, which I project to take at least until the end of next year, is to develop a rudimentary compiler and REPL for my language project, Thelema. I am currently focusing on the lower-level toolchain which will support the language, which will consist of an assembler, Assiah, and a library for manipulating and linking ELF format object files, Alfheim. All the preliminary code is to be in R6RS Scheme; later iterations will be self-hosting in a combination of Assiah and Thelema, but that is probably at least two years away.

At this time, my milestone plans are as follows, the notes in italics are partially started already:

Development Milestones for Thelema Project for Sept-Dec 2014
Assiah - develop a simple data table format for representing x86 instructions
Assiah - Begin data entry of x86 instructions
Assiah - design basic data structures needed to represent x86 instruction as a stream of opcodes
Assiah - design pattern-matching macros and functions for transforming a line code into an opcode stream representation
Assiah - review instruction data format for completeness and match it to the opcode format
Assiah - complete data entry of x86 instructions
Assiah - write first pass to count the size fof the resulting opcode streams for determining relative positions of labels
Assiah - write second pass for integrating generated opcodes with computed labels
Assiah - produce simple flat binaries and compare to equivalent generated by NASM and/or GAS
Alfheim - basic read and display of x86 ELF files
Alfheim - wrting new x86 ELF executable with a known program
Alfheim - write an object file that can be linked by ld(1)
Alfheim - write ld(1) compatible linker and compare results of linking two known object files
Assiah - use Alfheim library to generate x86 ELF object files

What I want to know is, given the state of the code as it is, and the general progress made to date, is this a realistic timetable? I know that no one can give any definitive answers to such a question, but any advice would be appreciated.

There are a lot of ways to write assemblers I'm sure. However, I've only done it once, so I'm compairing the timeline to how I wrote mine. I could have the wronge idea entirely.

Assiah - develop a simple data table format for representing x86 instructions

This probably should't be more complex then a list of tokens and type (and I also included line number for more usefull error messages), since it's only ever read in a few passes.

Assiah - design basic data structures needed to represent x86 instruction as a stream of opcodes

I'm not sure why you would need too. I just wrote the instructions to output as soon as I had the instruction. If a second buffer helps with something though, then feel free too.

---

After you've read in the input and tokenized it, and before generating instructions, you should scan your code for the location of labels and store those in a hashtable. This is so you'll be able to generate an instruction when you see a label that you haven't seen yet. A good portion of this part and the next part was actually dealing with syntax errors and giving meaningful error messages.

Assiah - design pattern-matching macros and functions for transforming a line code into an opcode stream representation

This second pass is suprisingly easy. (get type, get the appropriate number of arguments, pass them all to the apropriate function that prints the final instruction out). Again, a good portion of this was dedicated to making meaningfull error messages.

Assiah - review instruction data format for completeness and match it to the opcode format
Assiah - complete data entry of x86 instructions

I guess I combined these two with the last part. After reading an instruction like mov ax, bx, I recognised that mov takes two arguments and "looked ahead" to learn that both of the arguments were registers. Now I know what I'm dealing with, so I pass the two arugments to a function that handles mov with two register arguments, and it prints out the instruction.

I think what your saying is you'll read the instruction into some other format that records the opcode. After that, you'll get more information on the format in a later pass and combine it to gererate the final opcode. Then on another pass you'll print this out.

Assiah - write first pass to count the size fof the resulting opcode streams for determining relative positions of labels

I think it would be much easier to find the location of the labels before even attempting to process the instructions themselves. That way when you see an instruction, you can just get it's binary representation in one go. Going back to edit binary opcodes seems like it might be a bit complicated.

I guess a template of how I think most assemblers work is this:

Pass 1: Tokenize the input (ignore comments/whitespace, record the original word, the line number and identify the type (as in INT, COMMA, LABEL, INSTRUCTION-2-ARGS, etc...)). This pass prints error for anything it cannot identify.

Pass 2: Record all of the addresses of all of the labels (yes, you'll need to figure out the length of everything you pass over as well - but it pretty much mirrors pass 3 so you can copy and paste it!)

Pass 3: Go through the list one more time, and print out the final instructions as you go through (using the hash table when the labels are referenced).

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.