Beginner Level Assembly Language for Hackers/Programmer/Security researcher

As I promised I am back with Assembly Language for you( BONUS: basic GDB tutorial ).

NOTE before reading: if you get confused with anything while reading through, I recommend you to read till the end as some concepts will get clear at the end.

 

Before we begin, thanks to you all and especially all my supporters.

THANKS:

  • VansOP
  • Thouna Kh(youtube)
  • Jamir
  • Shameer
You can also buy me a coffee -- HERE

 

* Who is recommended to read this post?

    This blog is for anyone who wants to start on security research or hacking.

 

Difficulty level: Beginner.

  If you are a complete beginner in hacking and security research then I recommend you to first read this_blog_post.

 

 

JUST A RECAP :

* What is Assembly Language?

I will steal some definitions from Wikipedia :)

 

"In computer programming, assembly language (or assembler language), often abbreviated asm, is any low-level programming language in which there is a very strong correspondence between the instructions in the language and the architecture's machine code instructions. Because assembly depends on the machine code instructions, every assembly language is designed for exactly one specific computer architecture. Assembly language may also be called symbolic machine code"

 

* Why do we need to learn asm as a Security researcher or a hacker?

    Hacking has always been understanding the internals and manipulating the workings of a thing. So, understanding the deep workings of what you are trying to hack is critical.

Further, in this blog, we will try attacking a piece of software. We will focus on attacking Linux executables i.e elf files. It's like the exe for Linux machines. And learning asm will give you the power to better understanding the program flow and its internal workings, which will enable you to exploit the program by finding its flaws.

 

One more important use of asm that I want to mention:

 

Suppose you find a flaw in a program. NO......Let's make this more interesting (criminal style). [Just a rough example scenario]

Suppose you find a flaw in a BIG Bank server and now you want to run a piece of code on that server's process that will transfer $1000 per day to your account (I don't know why you want to do this :-)

 

In what language do you have to write the code to inject on the bank server's process?


Well, injecting a code into a process may seem like injecting your C source code into the server. Hell NO! That's not how you do it. Ok, I will explain.

Since the running bank server program(like exe or elf) is in binary format, you will also have to inject a binary code.                                

Wait...wait! You don't need to write binary code (the confusing 0's and 1's).

ASM will also accomplish the work of this code injection problem since ASM is just above the binary layer and also because asm is aware of CPU architectures.

So,

Now you write your code in asm and convert it to binary and FINALLY inject to get $1000 per day in your account ;)

Hey Mr.rich the above code you just injected is called Shellcode in more technical terms. Let's discuss more on shellcode later some day. 

 

Enough of talking. Now, let's go to the real technical ASM(Assembly language).

 

 

* What will you learn here?

    I am teaching you the x86 processor's assembly language (or 32-bit x86 processor's asm). Don't worry learning the 64-bit asm version will be easy if you know the 32-bit version.


In the beginning, let's learn about a very important concept of asm "Registers". The x86 processor has several variables called registers. Just think of registers as internal variables for the processor. There are several registers and I will explain briefly about them. Hey! if you want to learn seriously then follow along with me. There is always a lot of learning experience in practical than in theory.

 

You need to have a Linux system like Ubuntu or Kali Linux (GDB and GCC are preinstalled in most Linux systems). The GNU Development tools include a debugger called GDB. Debuggers allow a programmer to view a compiled program's memory, process registers, its asm instruction etc...

Above all it allows us to view the execution flow of a program from different angles, pause it and change anything along the way.

 

 


assembly language registers

The above figure is the listing of all registers in the x86 processor. We will also learn GDB while learning asm. 

 

EAX -- Accumulator 

ECX -- Counter

EDX -- Data

EBX -- Base

The above four registers are general-purpose registers. They are used for various purposes but they mainly act as temporary variables. Don't be afraid of the names they are just another variable for the CPU when executing machine instructions.

 

ESP -- Stack pointer

EBP -- Base pointer

ESI -- Source index

EDI -- Destination index

The first two registers are called pointers because they point somewhere in memory (or simply they store a 32-bit address). Unlike other registers, EBP and ESP have a fairly important role in program execution and memory management. And as the name suggests ESI and EDI point to the source and destination index like when moving a value from one location to another. But you can just think of it as a general-purpose register.

 

EIP -- Instruction pointer register

The EIP points to the current instruction the processor is reading. Naturally, this register is quite important and we will use this a lot while debugging. Like a child pointing his finger at each word as he reads, the processor reads each instruction using the EIP register as its fingers.

 

The remaining EFLAGS registers consist of bit flags that are used for comparisons and memory segmentation. You will understand this more later. And below the EFLAGS are memory segments. The actual memory space is split into several different segments, which will be discussed later, and these registers keep track of the segments. For most of the part, these registers can be ignored since they are rarely needed to be accessed directly.


 

* Assembly Instructions :

    Let's talk a bit about its syntax before we actually try to understand the instructions. There are two popular syntaxes for writing assembly instructions. Those are:

1. INTEL syntax

2. AT&T syntax 

 

Assembly syntax

 

You can identify At&t syntax easily by the "%" and "$" signs in the instructions. As you can see Intel syntax is much more readable but there are always pros and cons for everything, you find that out and choose a syntax you prefer (or learn both - best option). We will use Intel syntax more on this blog because I found this on the Internet


 

 

 

 

INTEL syntax:

operation <destination> , <source>

 

"operation" are usually easily understandable mnemonics like "mov" operation will move a value from the source to the destination. Likewise, "sub" will subtract, "inc" will increment, and so forth. I will explain all necessary operations later when we use them. The "destination" and "source" values will either be a register, a memory address, or a value.

 

For example:

The instructions above will add the value 16 (0x10 in decimal is 16) or 0x10 to esp and move 0x0 in eax register.



No programming language is complete without control flow statements. For asm the "cmp" operation compares two values. That's it? Just compare? Wait... the operation that follows this compare statement will decide what to do after comparing the two values. Any operation that starts with "J" is used to jump to a different part of the program. OK, I was talking about the operations that might follow a "cmp" statement. An example will best illustrate some operations.

 

 

 The instructions above compare a DWORD (4 bytes) value found in "ebp-0xc" with 0xa (10 in decimal). After comparing, the "jle" operation short for "jump less than or equal to" executes like "if the value found in "ebp-0xc" is less than or equal to 10 then jump to the address 0x565561bf ".


 

Confused!! by all these registers and operations?? Don't worry everything will make sense after a short walkthrough of gdb and an example program.


Let's write a simple C program that just prints "life?" seven times.

----------

#include<stdio.h>

void main()

    {

        int i;

        for(i=0; i < 7; i++)

            {

                printf("life?");

            }

    }

---------- 

 

Create a file called "asm_tut.c" and enter the above code. Now let's compile our "asm_tut.c". Compile using the following command "gcc -g -m32 asm_tut.c -o asm_tut" . The "-g" flag tells the GCC compiler to include extra debugging information, which will give GDB access to the source code. The "m32" flag is to compile our program in 32-bit mode since I have a 64-bit system (you don't need this flag if you are running on 32-bit environment). After compilation, time to fire up GDB for debugging our program. And most of your confusion on asm will be clear after we debug our simple program.

 

 

 

starting GDB

 

The "-q"(or quite mode) flag in GDB command is to launch GDB without its welcome banner. Don't include the "--nx" flag in your command (this just tells GDB not to read any of its .gdbinit files, ignore it for now you will understand this later in time). Ohh I almost forgot GDB's default asm syntax is the AT&T but we like to use the INTEL's. To change GDB's default asm syntax to INTEL, type the following command in your terminal (not inside GDB) echo "set disassembly-flavor intel" > ~/.gdbinit

 

 

 


The "list" command displays the source code of our program. To see the assembly code of our main function, just type in "disassemble main".

 

 



The above code is our assembly code of our main(). Notice the white-colored area, in the beginning, those are called function prologue. It's included by the compiler to set up memory for our code to run (like setting up space for our variables, will talk about this more later). Now let's run our code and inspect the registers and operation while it's running

 

 

 


First, I set a breakpoint at the beginning of our main function and run the program. After the "run" command, we land at our breakpoint (execution is paused at our breakpoint). Now we do "info registers" to display all our registers and their current values. Since EIP points to the next instruction CPU will execute, let's examine what's in our EIP register.




 

That's how you examine a register. Ok..ok I will explain...

In GDB, to examine a register or a memory address we use the command "x". But there is more we can do than just doing "x $eip". Now let's discuss the above figure. Let's start with 

"x/i $eip"

The "i" after "x/" tells the examine command to display the information collected by it in an instruction format or to convert/disassemble the machine language into human-readable assembly format(not only this we can display it in many other formats. Discussed below...). Since EIP is a register or more like a variable, we have to include the "$" symbol to examine it. "x/5i $eip" just displays five instructions.


Some more common formatting letters:

Format letter -- Definition -- Example

o -- display in octal -- x/o 0x56435445

x -- display in hexadecimal -- x/x 0x56435445

u -- display in unsigned, standard base 10 decimal -- x/u 0x56435445

t -- display in binary -- x/t 0x56435445

 

 

 



In the above figure "x/3x $eip" prints out three units in hex. The default size of a single unit is a four-byte unit called word. We can change the size of the unit by adding a size letter to the end of the format letter.

Size letters -- definition

b -- A single byte

h -- A halfword, which is two bytes in size

w -- A word, which is four bytes in size

g -- A giant, which is eight bytes in size

 

 

 

 

If you pay a little bit more attention to the above figure then I am sure question arises about the bytes up there. Still not getting it?? It's ok...

When we first display the examine command in single bytes we can see the first and second bytes being 0xeb and 0x16. But the second time when we print in halfword format we find that it displays 0x16eb and not 0xeb16. This same byte reversed effect is also seen when we print full 4-byte word as 0xec8316eb but when we print single bytes it is 0xeb, 0x16, 0x83, 0xec.

All of this is because, in x86 processor, values are stored in little-endian byte order, which means that the least significant byte is stored first. For example, if four bytes are to be interpreted as a single value, the bytes must be reversed. Take this as an example : 0xef , 0xbe , 0xad , 0xde = 0xdeadbeef. Still confused??

To clear up your confusion let's convert some hex into decimal.

 

 



The four bytes 0xc7, 0x45, 0xf4, 0x00 are displayed as a full four-byte word both in hexadecimal and in standard unsigned decimal notation respectively (0x00f445c7 and 16008647). When using a command-line calculator called bc and converting the hexadecimal bytes into decimal without reversing the bytes we get a horribly wrong large chunk of number (3343250432). But when we reverse the order of the bytes we get 16008647 as the result when is equivalent to 0x00f445c7 in hexadecimal. So, now I hope you understand that all bytes are stored in little-endian order (reversed) in x86 processor. You don't need to worry about this most of the time as most debugging tools and compilers will take care of this but there will be a time when you need to manually play around with those little-endian bytes.

 

 

 

* Explanation of our asm_tut.c main() function in Assembly language.

    As you have learned above, all the important core concepts required to understand asm codes we can now move on to understanding the asm code of our
asm_tut.c main() function.

 

 

 

The shaded portion of the above code is our actual main() function's code. Well, then what are the others? Oh... the beginning portion of the code is known as the function prologue and the ending function epilogue. It's added by our compiler to do some memory arrangement for our variables in our program. It has other vital uses which we will discuss later.





In the above code, I set a breakpoint at our main() function and run the program. After I hit the breakpoint, I print out the EIP (where our current execution is at). Those are our main() function's code ( without the function prologue and epilogue ).

Check out the first instruction 

* mov DWORD PTR [ebp-0xc], 0x0 , it moves a zero(0x0) to the location pointed by ebp-0xc ( EBP is used to reference variables in our program). Can you guess the variable? Yes, it's our i variable for our for loop. It starts out by zeroing out the i for our for loop.

* jmp 0x565561d5 , this operation jumps our execution flow to 0x565561d5.  

* cmp DWORD PTR [ebp-0xc],0x6 (we are now at 0x565561d5), here we compare our i variable with 0x6 , this is our for loop checking for i variable like i<= 6 ( this is equal to i<7 ).

* jle 0x565561bf , if our i variable is less then or equal to 6 then jump to 0x565561bf. Let's assume that our i variable is 0, so now we jump to 0x565561bf.

 

* sub   esp,0xc
  lea   eax,[ebx-0x1ff8]
  push  eax
  call   0x56556030 <printf@plt>
  add   esp,0x10

The above instructions collectively perform our printf("life!") function call. I will explain the above instructions briefly and it's totally fine if you don't fully understand it (all your doubts here will be cleared after we learn stack and calling convention, I will explain those in the next post).

First, it subtracts 0xc from esp which creates space for our argument on printf function which is the string to print. Then lea (load effective address) loads the address of our string (life!) on eax register. Now, eax holds a pointer to our string "life!". How do I know it?


push eax push the pointer to our string on the stack (any argument to a function is pushed on the stack and then call the function). Now printf is called and add esp,0x10 just arrange the stack for our program continuation after printf call. 

After printf call :

 

* add DWORD PTR [ebp-0xc],0x1 , this basically adds 1 to our i variable and after this, continues the loop of comparing - printing - adding until the i variable is greater than 6.


I think it's a good time to end this post. I am tired.

Hope you learn some on asm. At least the abbreviation of assembly :)

If you have any questions regarding the topic or any questions related to this blog then feel free to comment down. I will try to answer all your questions if I know.

As I am a human I make mistakes so please let me know if any.

2 Comments

Post a Comment
Previous Post Next Post