Turing Complete offers you a chance to dive into the lowest levels of how computers compute. Assembly is the purest, most precise description of an algorithm you can make - exactly what to do, exactly when to do it, and in exactly what order. That’s too descriptive for most humans, but it’s great for computers. Let’s talk about it!
I tend to use italics for emphasis, and bold to define new terms. I try to mention all common abbreviations.
Use the table of contents to skip around. The page is written to read in-order for newcomers, but relevant information is always in the correct section.
What Is Assembly?
Assembly is a human-palatable interface to a specific computer processor. Every processor has its own distinct assembly language. Knowing “how to program in assembly” means you can probably pick up a different assembly language pretty easily, but it doesn’t mean that you are intimately familiar with the details and specifics of any one in particular. For example, I personally am very familiar with the MIPS IV 32-bit processor, but my knowledge about writing efficient programs on that processor would not get me the whole distance to writing efficient programs on the 64-bit version. Each processor is different.
The reason is because every processor understands a different instruction set. We call this the ISA, or instruction set architecture. A processor looks at instructions, “decodes” them into control signals, and then executes them. The processor sees the instructions in binary form, just ones and zeroes on wires. As humans, it’s very difficult to program this way. It’s error prone, and very difficult to bugfix (see discussion on labels). So we make mnemonics, short names that stand in for the binary representation of an instruction. For example, in the OVERTURE architecture, the instruction 0b00xx xxxx
might be given a mnemonic such as load_immediate
, load_imm
, loadi
, ldi
, or li
(note that assemblers are typically case-insensitive, but currently Turing Complete’s assembler is case-sensitive). This instruction loads the value of the bottom 6 bits, called the immediate value, directly into register 0. In a typical assembler, we would write loadi 55
to represent the bit-pattern 0b0011 0111
. We call loadi
the opcode field, and 55
the operand field. Different instructions may have different numbers of operands; for example the MIPS add
instruction takes 3: the register to store the result, and 2 registers whose values should be added.
Turing Complete’s assembler currently assumes any spaces separate instruction bytes, so we have to combine the different fields using constant operators like |
(bitwise-or). For example, if we have set the loadi
mnemonic in the left pane of the assembly editor, we could write loadi|55
.
Types of ISAs
There are two major classifications of ISAs. a RISC architecture is a Reduced Instruction Set Computer. It tries to offer a consistent binary instruction format (which is easier to build hardware for), at the cost of expressiveness. A RISC architecture, for example, usually has the same number of bytes in every instruction. OVERTURE is a RISC architecture with 1-byte instructions, and LEG is a RISC architecture with 4-byte instructions. OVERTURE pays the price for this by making it impossible to encode immediates values over 63. LEG pays the price by having large numbers of instructions which require 4 bytes but do not use all 4 fields, making it hard to fit programs into the 256 byte space.
The other type of architecture is a CISC architecture, or a Complex Instruction Set Computer. CISCs try to offer expressiveness and convenience at the cost of hardware complexity. Most modern processors are CISC architectures, including the x86 processor that you are most likely using to view this page. A CISC architecture usually has VLIs, or variable length instructions. This can enable simple operations, like addition, to fit in a single byte, while something like loadi
can have a second byte for the immediate, allowing values up to 255. Turing Complete does not currently have levels which build a CISC architecture, but several people have done it in the sandbox. Feel free to ask around in the server.
Different ISAs can have wildly varying sets of instructions. Some RISC architectures have huge numbers of simple instructions, and some CISC architectures have small numbers of very complex instructions. There may be little or no overlap. But that doesn’t mean one architecture can do something that another cannot. Any Turing Complete architecture has exactly the same computational power as any other. If a processor does not have an instruction to do byte-NOT on the value in a register, then the processor cannot perform that operation in a single step. The key to assembly programming is figuring out how to decompose such operations into smaller steps which the processor can perform.
Getting Started
Let’s focus on the OVERTURE architecture, for now. Like almost every assembly language, OVERTURE executes instructions in a straight line, one by one, unless a jump instruction changes where we are in the program. OVERTURE doesn’t have a not
instruction. Let’s examine how we could compute the bitwise complement of the value in register 0.
Recall that all OVERTURE logic operations take their arguments from registers 1 and 2, and put the result in register 3. We don’t have a not
instruction, but we do have and
, or
, nand
, and nor
, as well as add
and sub
.
We could invert a byte by nand
ing it with itself, but for the purposes of demonstration, let’s do something a bit more exotic.
XOR As a Bit Toggle
Remember the level Bit Inverter? We can use a bitwise XOR (written ^
) to selectively invert bits. The operation m ^ w
“toggles” all the bits in w
that match up with the ones in m
, and leave the rest unchanged. We call w
the “word” or “argument” and m
the “mask.” Note that these names are specific to this use case and do not apply to all uses of XOR.
So if we want to completely invert a byte, we want to XOR it with a mask which is all ones. To do that, we’ll need to construct the mask. We can’t simply loadi
, because our mask has the value 255
and immediates must be less than 64.
Multi-Purpose Subtraction
To do this, we must take advantage of the fact that computers do not care where a value came from. They only care what the value is. What we want is a completely logical object, a mask that refers to every bit. But we can get it in a completely arithmetic way - the value is 255
, which in 8-bit arithmetic is the same as -1
. What’s the easiest way to get -1
? 0 - 1
!
We can loadi
those values. But there’s one catch.
Moving Data Around
We said we want to complement the value in register 0. loadi
would overwrite that value, so we need to save it in a different register first. Registers 1, 2, and 3 will be used for logical operations, so we’ll save it in register 4. Turing Complete should have automatically given you the mnemonic 0to4
for “moving” the value in register 0 to register 4. Note that this overwrites register 4, but does not change register 0. The operation is a copy-and-paste, not a cut-and-paste. Despite this, the typical assembly term for this operation is a data move, register move, or simply move. I recommend using the bitwise-or trick discussed above to make mnemonics for each source register and each destination register and then write move|s0|d4
.
Some architectures use the mnemonic mov
instead of move
, and others use load
.
Putting The First Pieces Together
So let’s get our value from register 0 to register 4, and set up for the XOR:
move|s0|d4 loadi|0 move|s0|d1 loadi|1 move|s0|d2 sub
This gets us our mask in register 3. But hold on… we don’t have an XOR instruction, either!
Complex Operations From Simple Ones (XOR)
We can break the XOR operation into smaller steps. Recall that the XOR of a
and b
is a ^ b = (a | b) & ~(a & b)
. In words, the XOR of two bits is 1 if their OR is 1, and their AND is 0. We have OR and NAND operations, so we can do these things. We need to save some intermediate values while we’re working. We’ll use register 0 since it’s free. Notice how we shuffle data around into the correct registers as needed.
move|s4|d1 # the value we saved earlier move|s3|d2 # the mask we just computed or # their bitwise OR move|s3|d0 # save that so it isn't overwritten by... nand # their bitwise NAND move|s0|d1 # prepare to operate on their OR... move|s3|d2 # and their NAND... and # the result of the XOR.
Now we have the bitwise complement of the value that was in register 0. It’s in register 3, but from here we could move it wherever we want.
Control Flow
Most useful programs aren’t a “straight line” of instructions to do and then stop. We usually want to change what we’re doing depending on some input, or repeat some code multiple times. For this, we have jump instructions. Jump instructions let us break out of the normal control flow of “do this instruction, then do the next one in order.” Instead, we get to say “do this instruction, then jump to the 10th instruction and keep going from there” (for example).
OVERTURE jump instructions take the instruction number to jump to from register 0 (so it is easy to loadi
) and a “value” against which they can check a “condition” from register 3 (so that it can be the result of the last arithmetic/logic operation). The Turing Complete assembler lets you write label some_label_name_here
, which makes some_label_name_here
a constant that refers to the index of the following instruction.
OVERTURE has several different jump instructions. In the level Conditionals, you implemented a unit that can check if a number is less than 0, equal to 0, or greater than 0. The jump instructions can act conditionally, jumping only if a specified condition is true of the “value.” The condition always
can be used to make the jump unconditional - it will always jump. A typical unconditional jump mnemonic is jump
, jmp
, or simply j
. You may also see branch
or b
, which is just another word for exactly the same thing. A conditional “Jump if zero” mnemonic could be jump_zero
, jmpzero
, jtz
(jump-true-zero), or simply jz
, where “jump if not zero” might be jfz
or jnz
. Other typical abbreviations are gt
for greater-than, ge
for greater-or-equal, and lt
and le
the same for less than. Some other assembly languages may have even more conditions.
b
instruction which is just j
in disguise.
A Simple Loop (partial Storage Cracker spoilers)
Let’s write a program that starts with 0, and then repeatedly adds 1 and keeps repeating that until the result is 0 again. We store the value 1
, which we need repeatedly, in register 2 outside of the loop, because we don’t need to re-compute it every time. Then we perform the addition, check the value, and, if it’s not zero, move the result to register 1 so that we can keep going.
Notice how we have to go out of our way to move the result of the addition back into register 1 with move|s3|d1
. The loop body, or the part of the loop between the label and the jump back to the label, is written to expect that the values to add are already in registers 1 and 2. This is called a loop invariant. It is fact about the state of the machine that must be true on both entry to and exit from the loop body. It’s very important to think about your loop invariants when writing loops. As the loop body becomes more complicated, failing to maintain the invariants (called violating the invariant) is a major source of bugs. If you document the invariants well, these bugs are easy to catch!
loadi|1 # set up register 2 move|s0|d2 # with the value 1 label loop # the next instruction is the start of the loop add # register 3 = register 1 + register 2 loadi|done # register 0 = address of the instruction after the loop jz # jump out of the loop if the result of addition is 0 move|s3|d1 # register 1 = the result of addition loadi|loop # load the address of the first loop instruction j # and jump to it. label done
That’s not so bad! But we can do a bit better.
A Better Loop Layout
Because our loop has an unconditional jump at the bottom, and a conditional jump in the middle, every iteration of the loop will execute two jump instructions. For OVERTURE, this isn’t a huge deal, but for more complicated architectures this can be very expensive (jumps are typically among the slowest operations a processor can do, because processors are very optimized for “straight-line” code, as conditional jumps might or might not disrupt the flow of the program,and things like speculative execution would not be happy :/ ). What if we could make the conditional branch dual-purpose?
If a conditional jump fails, we always move on to the next instruction in sequence. If that instruction were the instruction after the loop, we wouldn’t need the second branch. Our conditional branch could jump to the top of the loop, and “falling through” to the next instruction could end the loop! The trick is that we have inverted the condition. Instead of jumping if we should stop, we jump if we should continue. This is called the “do-while” loop layout, and it is how almost all loops are written in assembly. It looks like this:
loadi|1 move|s0|d2 label loop add move|s3|d1 loadi|loop jnz label done
And now we have saved one instruction in the loop.
In our case, we didn’t want to check the condition before doing the first addition. But what if we did? We need the condition to be at the bottom of the loop for this to work, right? Indeed, we do. But if we use a second jump instruction to enter the loop, we don’t have to execute the second jump instruction on every iteration of the loop, and that’s still better. Such a loop might look like this:
loadi|check j label loop add move|s3|d1 label check loadi|loop jnz label done
If-Then and If-Then-Else
Sometimes we want to execute a little piece of code only if some condition holds. And maybe we want to execute a different piece of code only if it fails. In “structured” programming languages, these constructs are called if
and if ... else
. Just like for loops, our lives get a little bit simpler if we invert the condition. This is the standard in assembly language programming, so really make sure you get used to that. Demorgan’s Laws can also be quite helpful.
Suppose we want to read an input and output 1 if the number is 10, and zero otherwise. In a structure language, that might look like this:
if (input == 10) {
output 1
}
else {
output 0
}
Let’s translate that into assembly. Firstly, we can’t do == 10
. What we can do is input - 10
. If the result is 0, the input was 10.
move|inp|d1 # read from input to register 1 loadi|10 move|s0|d2 # get 10 into register 2 sub # compare them loadi|else jnz # jump to the else case if input was not 10 label then # fall through to here if the input was 10 loadi|1 move|s0|out loadi|done j label else loadi|0 move|s0|out label done
A deeper if-elseif-else chain is implemented by simply repeating this inside the else
block.