Hi,

I'm trying to construct a compiler for a really simple stack based language to learn how my computer really works, and to learn how to create a compiler. So far, I have never done anything in assembly (only higher level languages). Could you give me advice about the following things:

- If I create this compiler, is it easier/better to compile to assembly, and then let an assembler translate that to machine code, or compile directly to machine code?

- In the assembly tutorials I skimmed so far, there seems to be only one stack. There probably need to be two stacks: one callstack and one stack for data. How can this be done?

- Where do I find information about machine code? I'm looking for the mapping between instructions and binary code. So information that says: 10101001 is the code for push on stack, for
example. (but this isn't necessary if the compiler will compile to assembly)

- This language will only deal with numbers, not strings or other data structures for now. But I do need a way to output (print) the number on top of the stack, this is os-specific, I assume?

- How does an assembler create a windows .exe? Does it have to add extra information and instructions besides the assembled code? Can my compiler create such an exe file, or is it better to use a separate program for that?

- Are there any good online resources on the subject? If there are, where do I find them?

Thanks for your help & time,

Jules

- If I create this compiler, is it easier/better to compile to assembly, and then let an assembler translate that to machine code, or compile directly to machine code?

It's easier to delegate the nitty gritty machine code stuff to an assembler. If you're writing a simple compiler then that's the way you want to go. If you're writing a commercial quality compiler then it's probably better to build the machine code yourself. However, if this is a learning project for how the computer works, you can learn a great deal more by building the output files manually.

- In the assembly tutorials I skimmed so far, there seems to be only one stack. There probably need to be two stacks: one callstack and one stack for data. How can this be done?

There's no reason to use two stacks when one will do. The "callstack" and "stack for data" are one in the same, and while it's easy to corrupt the stack in assembly by not respecting this duality, a compiler abstracts that problem away for high level programmers.

For a stack based language, a call stack and a data stack is a good idea, but that's a layer above the actual assembly level. You might want to look at the Forth language for ideas.

- Where do I find information about machine code? I'm looking for the mapping between instructions and binary code. So information that says: 10101001 is the code for push on stack, for
example. (but this isn't necessary if the compiler will compile to assembly)

Look for manuals about the processor you're writing the compiler for. For example, you can find opcode information for the x86 architecture at www.intel.com.

- This language will only deal with numbers, not strings or other data structures for now. But I do need a way to output (print) the number on top of the stack, this is os-specific, I assume?

It's language specific. You can do it however you want, and the implementation depends on how you write your compiler.

- Are there any good online resources on the subject? If there are, where do I find them?

None that I know of. Though you can search these forums for "Tama" and find a very early draft of a stack based scripting language. How much that will help, I have no idea, but it might be useful.

Thanks for the quick response!

It's easier to delegate the nitty gritty machine code stuff to an assembler. If you're writing a simple compiler then that's the way you want to go. If you're writing a commercial quality compiler then it's probably better to build the machine code yourself. However, if this is a learning project for how the computer works, you can learn a great deal more by building the output files manually.

Is there a 1-to-1 relationship between assembly and machine code? (so it is pretty easy to translate once you've parsed it?) I think I'm going to compile to assembly first, and after that directly to machine code. Does an assembler add extra things to the machine code it generates, or is the generated machine code executable without modifications? Is it possible to save it in a .exe, and execute it under windows?

There's no reason to use two stacks when one will do. The "callstack" and "stack for data" are one in the same, and while it's easy to corrupt the stack in assembly by not respecting this duality, a compiler abstracts that problem away for high level programmers.

Do you mean that there is only one "real" stack, but it looks like two? How can mix the calls stack and data stack? If someone evaluates a procedure that pops values off the stack, how can I be sure it doesn't pop return addresses?

For a stack based language, a call stack and a data stack is a good idea, but that's a layer above the actual assembly level. You might want to look at the Forth language for ideas.

Thanks, I've heard of Forth, but I don't know the language. I'll certainly look for interesting things in Forth.

Look for manuals about the processor you're writing the compiler for. For example, you can find opcode information for the x86 architecture at www.intel.com.

Thanks!

None that I know of. Though you can search these forums for "Tama" and find a very early draft of a stack based scripting language. How much that will help, I have no idea, but it might be useful.

Tama is an interpreted language? Looks good, but it's very minimal as you say (no control structures so far?). It would be turing complete if you add first class named blocks, though.

I already have an interpreter (written in Ruby, +- 100 loc):

Simple Stack Based Programming Language
> swap-pop: swap pop
ok
> choose: 2 top 'pop 'swap-pop ifel
ok
> false 1 2 choose .
2
> true 1 2 choose .
1
> double: 2 *
ok
> square: dup *
ok
> f: 'double 'square ifel
ok
> 3 true f .
6
> 3 false f .
9
> 2 3 4
> stack
4, 3, 2
> clear
> stack

>

The 'proc syntax to quote procedures isn't nice, better would be {pop} {swap pop} ifel, (so inline first-class procs) but this requires more parsing, and I'll do that later.

It's turing complete, you can use recursive procedures to create loops, but this doesn't work well because the language doesn't have proper tail calls yet. This is hard to implement in Ruby, but it's trivial with a compiler (just pop the callstack *before* calling the last procedure in a procedure).

>Is there a 1-to-1 relationship between assembly and machine code?
It depends on whom you ask. Traditional assembly was indeed 1-to-1. Assembly was defined as human readable machine code. However, with the advent of macro facilities, the 1-to-1 source code correspondance with machine code went away. So "technically", when an assembler encodes the machine code, 1-to-1 is there because the macro parser has already run, but to an assembly programmer, 1-to-1 is not there unless you stay strictly low level and don't take full advantage of the assembler's feature set.

>Does an assembler add extra things to the machine code it generates,
>or is the generated machine code executable without modifications?
It depends on how you assemble the machine code. If you assemble to an object file, it's not executable without a linking step to resolve library calls. If you compile to an executable file, then yes, it is executable immediately after the assembly process.

>Is it possible to save it in a .exe, and execute it under windows?
Yes, and if you have an assembler that writes directly to a PE, or if you do it yourself with your compiler, the output file can be immediately executed. There's no magic here; an executable file is just a file format that Windows recognizes as executable. If you know how the format is structured (and the specification is available online), you can create it and produce a .exe file that actually works.

>Do you mean that there is only one "real" stack, but it looks like two?
At the assembly level, there's only one stack, and it only looks like one. :) You can create as many stacks as you want, but only one is directly supported by the architecture (referring to x86 assembly).

>If someone evaluates a procedure that pops values off the stack, how
>can I be sure it doesn't pop return addresses?
Don't give them absolute control over the stack. Since you're not writing an assembly language, you can abstract away their ability to directly control the stack. This gives you plently of wiggle room in forcing them to never encounter that situation.

Alternatively, rather than letting them use your low level stack, you provide a separate stack data structure that they can't easily corrupt.

Don't give them absolute control over the stack. Since you're not writing an assembly language, you can abstract away their ability to directly control the stack. This gives you plently of wiggle room in forcing them to never encounter that situation.

Alternatively, rather than letting them use your low level stack, you provide a separate stack data structure that they can't easily corrupt.

OK, I'd like to start with the easy way.

If this code will be compiled:

b: 3 +

(define a function b, which pushes 3 on the stack, and calls +)

It could be something like this:

label b
  push(3)
  push(:b1) ; push a return location
  goto(:+)
  label b1
  pop ; pop :b1
  goto(pop)

This wil not work, of course. pop will not pop :b1, but the result of +. Is there a simple way to avoid this? The compiler cannot know how many values a procedure will pop, because you can define things like:

c: if pop = 1 then pop else pop pop end

I have no idea how to solve this :o

Mark the address of the return value. If the stack pointer ever drops to that address, any further user pops are an error. But this is what I meant with a separate user stack. That way you can easily check whether there would be underflow and act accordingly without risking your own call stack.

Narue did a good job with the answers. On thing I wish to mention is that assemblers are processor specific and generally similar for all the products of one processor manufacturer. Thus you would find many similarities among intel processor assembly languages and the same would go for the assemblers of all motorola products. Each assembler has it's own traits because each processor is different. Among intel assemblers you may find some commands available on only a few processors because these features are not available on most of the processors. In general, however, you would find a large selection of commands that were available on all of their products. For assembly language information you want the processor manufacturers web site. There are a number of intel in disquise processors. Many smaller processor manufacturers use intel assemblers, which leads me to believe they are actually intel.

This article has been dead for over six months. Start a new discussion instead.