A few notes on x86-64 assembly language

Links: x86 instruction list. Linux x86-64 syscall list, and a nicer one.

The GNU Assembler (gas, but its executable name is simply as) isn't part of GCC; it's a part of the GNU binutils. Therefore its documentation is also part of the binutils docs.

The main differences between the AT&T syntax used by the GNU assembler and the Intel syntax that's used by (among others) NASM, and that's in all the books and documentation on x86 assembly, are:

One learning resource for AT&T-style, GNU Assembler-compatible assembly is to write C programs and almost compile them with GCC: if you give GCC the -S option, it will compile your code to assembly language, but it won't assemble and link it for you. Each C file is assembled into a file with an .s filename extension.

Example programs

Statically allocating memory and predefined variables

Input C program, "foo.c":

char buf[70];
int buflen = sizeof(buf);
int main(int argc, char** argv) {
	return buflen;

This was compiled into "foo.s":

	.file	"foo.c"
	.globl	buf
	.align 32
	.type	buf, @object
	.size	buf, 70
	.zero	70
	.globl	buflen
	.align 4
	.type	buflen, @object
	.size	buflen, 4
	.long	70
	.globl	main
	.type	main, @function
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	movl	%edi, -4(%rbp)
	movq	%rsi, -16(%rbp)
	movl	buflen(%rip), %eax
	popq	%rbp
	.cfi_def_cfa 7, 8
	.size	main, .-main
	.ident	"GCC: (GNU) 13.2.1 20230801"
	.section	.note.GNU-stack,"",@progbits

The statically-allocated but unfilled memory is put in the .bss section. Allocated and predefined memory, which is here just one 32-bit long, is put in the .data section. Actual program code goes in the .text section. (I've bolded the directives that start these sections, as a kind of syntax highlighting.) As you can see, the code weaves in and out of the segments; unlike in some NASM examples I've seen, it is not necessary to have each section's code be contiguous.

We also see here a use of the dot on its own, as a label referring to the current position (equivalent to NASM's $). On the third line from the bottom, .size main, .-main, the size of the object referred to by the symbol main (corresponding to the int main(int, char**) C function) is calculated by subtracting it's start position from the current start position.

64-bit AT&T syntax Linux hello, world


	.align 4
msg:	.ascii "hello "
	.byte 0xf0, 0x9f, 0xa4, 0x96 /* u+01f916 robot face */
	.byte 10 /* newline */
	/* msglen isn't part of the assembled .data;
	   it's known only to the assembler */
	msglen = . - msg

	/* .global = make a symbol visible to the linker */
	.global _start
	/* preparing a write syscall */
	movq $1, %rax     /* syscall 1 = write */
	movq $1, %rdi     /* file descriptor 1 = stdout */
	movq $msg, %rsi    /* pointer to data */
	movq $msglen, %rdx /* length of data */

	/* preparing an exit syscall */
	movq $60, %rax /* syscall 60 = exit */
	movq $0, %rdi  /* return value */

This was hand-written, not compiled. The message contains a Unicode astral plane robot emoji; it's manually encoded as a series of bytes, but it could also have been encoded in the string in octal. I don't think the GNU assembler has any directives for directly including UTF-8 strings in the source code.

Assemble this with as hello.s; it'll produce a.out. You also need to manually link it: ld -o hello a.out. Then run it like normal, ./hello.

The dollar sign really is critical in all those movs, and it's also required when moving those memory addresses (pointers) to the write syscall. Otherwise you'll move some garbage data and probably cause a segfault.

created 2023-12-03, last edited 2023-12-03. index