Introduction And Writing Your First Shellcode

    First of all, I am sorry for the annoying ads. Now, reboot your mind and grab popcorn it's not going to be a short article.

Note: [Explicit Content] This article is not for infants.

 

* Introduction:

It's been a while without any dedicated articles for Hackers or Security Researchers (whatever you call). 

As some of you guys requested, today we will be covering "Shellcode" which time will prove an extremely handy tool in your toolbox.

Let's start with some prerequisites and recommendations.


* Difficulty: Intermediate (Don't worry, just follow me it will be super interesting and eszy)

 

* Who is recommended to read this?

  • Anyone who is trying hard to remove their "Script Kiddie" title.
  • This article is dedicated to all hackers (white hat, black hat, pink hat, yellow hat, green hat).


For some of you, the below prerequisite will be a 'GOOD BYE'. But hey, don't worry, I will help you with the prerequisite too.

* Prerequisite:

  • Basic knowledge of Assembly Language. (click_here_to_learn
  • Basic C programming language. ;-')
  • Basic English. (Hi, Can you read this? If yes: you can skip this)
  • Basic memory management concepts like Stack and Stack Frames. (learn_more)
  • Strong History and Chemistry knowledge :-')



* The WHAT, WHY, and HOW... 

Shellcode -in a nutshell:

Shellcode, also sometimes referred to as an exploit payload is a (usually) small self-contained program that does the real work once a program has been hacked.

"Gaining control of your target is just the beginning. The real fun begins after that" -- MeAndMyQwerty ;-')


Shellcode usually spawns a shell, as that is an elegant way to hand off control; but it can do anything a normal program can do.

Unfortunately, for many hackers, the shellcode story stops at copying and pasting bytes. These hackers are just scratching the surface of what's possible. Custom shellcode gives you absolute control over the exploited program. Perhaps you want your shellcode to add an admin account to /etc/passwd or to automatically remove lines from log files. Once you know how to write your own shellcode, your exploits are limited only by your imagination. In addition, writing shellcode develops assembly language skills and employs a number of hacking techniques worth knowing.



* Choosing the right language (Assembly vs C)

We are talking about writing shellcode, right? But shellcode is actually an architecture-specific machine instruction or machine code (the very easy and poor language made up of only 0s and 1s).

As machine language is too easy for us, we will write our shellcode using Assembly Language and convert our Assembly code into raw native code (machine code) using Assembler.


Why Assembly Language and not C?

It is because it starts with the letter "A" and "A" reminds me of "Apple" and I like eating Apple.

But if we use C language to write our shellcode, for example, then we need to compile our C code to produce machine code (which is architecture dependent). In the above process, our compiler will do some magic to our code and add many unnecessary bytes (from shellcoding perspective) to our final machine code. You will understand what are those unnecessary bytes later in this article.

Since, our goal is to be able to produce an optimized, small, self-contained machine code. Then, it will be kind of stupid to write our shellcode in C.


On the other hand, assembly language is already machine specific and we are working directly with kernel using system calls, which is really convenient for our use (explained below).

Upto here, we know what shellcode is and the language we will use to write our shellcode.



Now,

How's Assembly Language working directly with kernel using System Calls?

Before explaining this we need to go back to some basic concepts.

The operating system manages things like input, output, process control, file access, and network communication in the kernel.

Compiled C programs ultimately perform these tasks by making system calls to the kernel. 

System Calls? 
A System call is a programmatic way a program requests a service from the Kernel.

 

Different operating systems have a different set of system calls.

In C, standard libraries are used for convenience and portability. A C program that uses printf() to output a string can be compiled for many different systems, since the compiler knows the appropriate system calls for various architectures. A C program compiled on an x86 processor will produce an x86 assembly code.

By definition, assembly language is already specific to certain processor architecture, so portability is impossible. There are no standard libraries; instead, kernel system calls have to be made directly.

 

C vs Assembly (Final Round)

In this Final Round, we are writing a simple C program and then rewriting it in x86 assembly.


hiii.c (our simple example program)


 

When the compiled program is run, execution flows through the standard I/O library, eventually making a system call to write the string "Y am I here? Help!!!" to the screen. The strace program is used to trace a program's system call. Used on the compiled hiii program, it shows every system call that program makes.



As you can see from the above image, the compiled program does more than just print a string. The system calls at the start are setting up the environment and memory for the program (don't focus on all the system calls for now), but the important part is the write() syscall pointed by the red arrow. This is what actually outputs the string.


[ The Unix manual pages (accessed with the man command) are separated into sections. Section 2 contains the manual pages for system calls, so man 2 write command will describe the use of the write() system call ]


The strace output also shows the arguments for the syscall. The buf and count arguments are a pointer to our string and its length. The fd argument of 1 is a special standard files descriptor. File descriptors are used for almost everything in Unix: input, output, file access, network sockets, and so on. A file descriptor is similar to a number given out at a coat check. Opening a file descriptor is like checking in your coat, since you are given a number that can later be used to reference your coat. The first three file descriptor numbers (0, 1, and 2) are automatically used for standard input, output, and error. These values are standard and have been defined in several places, such as the /usr/include/unistd.h file on the following page.

 

  

Writing bytes to standard output’s file descriptor of 1 will print the bytes; reading from standard input’s file descriptor of 0 will input bytes. The standard error file descriptor of 2 is used to display the error or debugging messages that can be filtered from the standard output.

 

 

* Linux System Calls in Assembly (x86)

Every possible Linux system call is enumerated, so they can be referenced by numbers when making the calls in assembly. These syscalls are listed in /usr/include/asm-i386/unistd.h.

 



For our rewrite of hiii.c in assembly, we will make a system call to the write() function for the output and then a second system call to exit() so the process quits cleanly. This can be done in x86 assembly using just two assembly instructions: mov and int.

 Assembly instructions for the x86 processor have one, two, three, or no operands. The operands to an instruction can be numerical values, memory addresses, or processor registers. The x86 processor has several 32-bit registers that can be viewed as hardware variables. The registers EAX, EBX, ECX, EDX, ESI, EDI, EBP, and ESP can all be used as operands, while the EIP register (execution pointer) cannot.  

The mov instruction copies a value between its two operands. Using Intel assembly syntax, the first operand is the destination and the second is the source. The int instruction sends an interrupt signal to the kernel, defined by its single operand. With the Linux kernel, interrupt 0x80 is used to tell the kernel to make a system call. When the int 0x80 instruction is executed, the kernel will make a system call based on the first four registers. The EAX register is used to specify which system call to make, while the EBX, ECX, and EDX registers are used to hold the first, second, and third arguments to the system call. All of these registers can be set using the mov instruction.  

In the following assembly code listing, the memory segments are simply declared. The string "Hello, world!" with a newline character (0x0a) is in the data segment, and the actual assembly instructions are in the text segment. This follows proper memory segmentation practices.



The instructions of this program are straightforward. For the write() syscall to standard output, the value of 4 is put in EAX since the write() function is system call number 4. Then, the value of 1 is put into EBX, since the first argument of write() should be the file descriptor for standard output. Next, the address of the string in the data segment is put into ECX, and the length of the string (in this case, 14 bytes) is put into EDX. After these registers are loaded, the system call interrupt is triggered, which will call the write() function. To exit cleanly, the exit() function needs to be called with a single argument of 0. So the value of 1 is put into EAX, since exit() is system call number 1, and the value of 0 is put into EBX, since the first and only argument should be 0. Then the system call interrupt is triggered again.


To create an executable binary, this assembly code must first be assembled and then linked into an executable format. When compiling C code, the GCC compiler takes care of all of this automatically. We are going to create an executable and linking format (ELF) binary, so the global _start line shows the linker where the assembly instructions begin. 

The nasm assembler with the -f elf argument will assemble the helloworld.asm into an object file ready to be linked as an ELF binary. By default, this object file will be called helloworld.o. The linker program ld will produce an executable a.out binary from the assembled object.


This tiny program works, but it’s not shellcode, since it isn’t self-contained and must be linked.

 

 

* Path to Shellcode:

Shellcode is literally injected into a running program, where it takes over like a biological virus inside a cell. Since shellcode isn’t really an executable program, we don’t have the luxury of declaring the layout of data in memory or even using other memory segments. Our instructions must be self-contained and ready to take over control of the processor regardless of its current state. This is commonly referred to as position-independent code.

In shellcode, the bytes for the string "Hello, world!" must be mixed together with the bytes for the assembly instructions, since there aren’t definable or predictable memory segments. This is fine as long as EIP doesn’t try to interpret the string as instructions. However, to access the string as data we need a pointer to it. When the shellcode gets executed, it could be anywhere in memory. The string’s absolute memory address needs to be calculated relative to EIP. Since EIP cannot be accessed from assembly instructions, however, we need to use some sort of trick.



* Assembly Instruction using the Stack:

The stack is so integral to the x86 architecture that there are special instructions for its operations.


 
Stack-based exploits are made possible by the call and ret instructions. When a function is called, the return address of the next instruction is pushed to the stack, beginning the stack frame. After the function is finished, the ret instruction pops the return address from the stack and jumps EIP back there. By overwriting the stored return address on the stack before the ret instruction, we can take control of a program’s execution. 

This architecture can be misused in another way to solve the problem of addressing the inline string data. If the string is placed directly after a call instruction, the address of the string will get pushed to the stack as the return address. Instead of calling a function, we can jump past the string to a pop instruction that will take the address off the stack and into a register. The following assembly instructions demonstrate this technique.




The call instruction jumps execution down below the string. This also pushes the address of the next instruction to the stack, the next instruction in our case being the beginning of the string. The return address can immediately be popped from the stack into the appropriate register. Without using any memory segments, these raw instructions, injected into an existing process, will execute in a completely position-independent way. This means that, when these instructions are assembled, they cannot be linked into an executable.


The nasm assembler converts assembly language into machine code and a corresponding tool called ndisasm converts machine code into assembly. These tools are used above to show the relationship between the machine code bytes and the assembly instructions. In the above disassembly instructions the bytes of the "Hello, world!" string are interpreted as instructions (You guess the bytes...). Now, if we can inject this shellcode into a program and redirect EIP, the program will print out Hello, world!

Hey, wait...it's not this simple.

Yes, the above shellcode will work on some program but will fail in most cases. Why?

Look at the above image. When we run the hexdump command, the shellcode bytes are displayed clearly. Notice, the "00" or Null bytes in our code?

I know the byte "00" looks so innocent but...


Often, shellcode will be injected into a process as a string, using functions like strcpy(). Such functions will simply terminate at the first null byte, producing incomplete and unusable shellcode in memory. In order for the shellcode to survive transit, it must be redesigned so it doesn’t contain any null bytes.

 

 

* Removing Null bytes

Looking at the disassembly, it is obvious that the first null bytes come from the call instruction.


 This instruction jumps execution forward by 19 (0x13) bytes, based on the first operand. The call instruction allows for much longer jump distances, which means that a small value like 19 will have to be padded with leading zeros resulting in null bytes. 

One way around this problem takes advantage of two’s complement. A small negative number will have its leading bits turned on, resulting in 0xff bytes. This means that, if we call using a negative value to move backward in execution, the machine code for that instruction won’t have any null bytes. The following revision of the helloworld shellcode uses a standard implementation of this trick: Jump to the end of the shellcode to a call instruction which, in turn, will jump back to a pop instruction at the beginning of the shellcode.


After assembling this new shellcode, disassembly shows that the call instruction (shown in italics below) is now free of null bytes. This solves the first and most difficult null-byte problem for this shellcode, but there are still many other null bytes (shown in bold).


These remaining null bytes can be eliminated with an understanding of register widths and addressing. Notice that the first jmp instruction is actually jmp short. This means execution can only jump a maximum of approximately 128 bytes in either direction. The normal jmp instruction, as well as the call instruction (which has no short version), allows for much longer jumps. The difference between assembled machine code for the two jump varieties is shown below:


 The EAX, EBX, ECX, EDX, ESI, EDI, EBP, and ESP registers are 32 bits in width. The E stands for extended, because these were originally 16-bit registers called AX, BX, CX, DX, SI, DI, BP, and SP. These original 16-bit versions of the registers can still be used for accessing the first 16 bits of each corresponding 32-bit register. Furthermore, the individual bytes of the AX, BX, CX, and DX registers can be accessed as 8-bit registers called AL, AH, BL, BH, CL, CH, DL, and DH, where L stands for low byte and H for high byte. Naturally, assembly instructions using the smaller registers only need to specify operands up to the register’s bit width. The three variations of a mov instruction are shown below.


Using the AL, BL, CL, or DL register will put the correct least significant byte into the corresponding extended register without creating any null bytes in the machine code. However, the top three bytes of the register could still contain anything. This is especially true for shellcode, since it will be taking over another process. If we want the 32-bit register values to be correct, we need to zero out the entire register before the mov instructions—but this, again, must be done without using null bytes. Here are some more simple assembly instructions for your arsenal. These first two are small instructions that increment and decrement their operand by one.

 

The next few instructions, like the mov instruction, have two operands. They all do simple arithmetic and bitwise logical operations between the two operands, storing the result in the first operand.


One method is to move an arbitrary 32-bit number into the register and then subtract that value from the register using the mov and sub instructions:


While this technique works, it takes 10 bytes to zero out a single register, making the assembled shellcode larger than necessary. Can you think of a way to optimize this technique? The DWORD value specified in each instruction comprises 80 percent of the code. Subtracting any value from itself also produces 0 and doesn’t require any static data. This can be done with a single two-byte instruction:


 Using the sub instruction will work fine when zeroing registers at the beginning of shellcode. This instruction will modify processor flags, which are used for branching, however. For that reason, there is a preferred two-byte instruction that is used to zero registers in most shellcode. The xor instruction performs an exclusive or operation on the bits in a register. Since 1 xored with 1 results in a 0, and 0 xored with 0 results in a 0, any value xored with itself will result in 0. This is the same result as with any value subtracted from itself, but the xor instruction doesn’t modify processor flags, so it’s considered to be a cleaner method.


 You can safely use the sub instruction to zero registers (if done at the beginning of the shellcode), but the xor instruction is most commonly used in shellcode in the wild. This next revision of the shellcode makes use of the smaller registers and the xor instruction to avoid null bytes. The inc and dec instructions have also been used when possible to make for even smaller shellcode.

 


After assembling this shellcode, hexdump and grep are used to quickly check it for null bytes.


Now this shellcode is usable, as it doesn’t contain any null bytes. When used with an exploit, our shellcode will turn a vulnerable program into greeting the world like a newbie.

 

* Shell-Spawing Shellcode

Now that you’ve learned how to make system calls and avoid null bytes, all sorts of shellcodes can be constructed. To spawn a shell, we just need to make a system call to execute the /bin/sh shell program. System call number 11, execve(), is similar to the C execute() function that we used in the previous chapters.



The first argument of the filename should be a pointer to the string "/bin/sh", since this is what we want to execute. The environment array—the third argument—can be empty, but it still need to be terminated with a 32-bit null pointer. The argument array—the second argument—must be null-terminated, too; it must also contain the string pointer (since the zeroth argument is the name of the running program). Done in C, a program making this call would look like this:


To do this in assembly, the argument and environment arrays need to be built in memory. In addition, the "/bin/sh" string needs to be terminated with a null byte. This must be built in memory as well. Dealing with memory in assembly is similar to using pointers in C. The lea instruction, whose name stands for load effective address, works like the address-of operator in C.


With Intel assembly syntax, operands can be dereferenced as pointers if they are surrounded by square brackets. For example, the following instruction in assembly will treat EBX+12 as a pointer and write eax to where it’s pointing. 


The following shellcode uses these new instructions to build the execve()arguments in memory. The environment array is collapsed into the end of the argument array, so they share the same 32-bit null terminator.


After terminating the string and building the arrays, the shellcode uses the lea instruction (shown in bold above) to put a pointer to the argument array into the ECX register. Loading the effective address of a bracketed register added to a value is an efficient way to add the value to the register and store the result in another register. In the example above, the brackets dereference EBX+8 as the argument to lea, which loads that address into EDX. Loading the address of a dereferenced pointer produces the original pointer, so this instruction puts EBX+8 into EDX. Normally, this would require both a mov and an add instruction. When assembled, this shellcode is devoid of null bytes. It will spawn a shell when used in an exploit.


This shellcode, however, can be shortened to less than the current 45 bytes. Since shellcode needs to be injected into program memory somewhere, smaller shellcode can be used in tighter exploit situations with smaller usable buffers. The smaller the shellcode, the more situations it can be used in. Obviously, the XAAAABBBB visual aid can be trimmed from the end of the string, which brings the shellcode down to 36 bytes.


This shellcode can be shrunk down further by redesigning it and using registers more efficiently. The ESP register is the stack pointer, pointing to the top of the stack. When a value is pushed to the stack, ESP is moved up in memory (by subtracting 4) and the value is placed at the top of the stack. When a value is popped from the stack, the pointer in ESP is moved down in memory (by adding 4). 

The following shellcode uses push instructions to build the necessary structures in memory for the execve() system call.


This shellcode builds the null-terminated string "/bin//sh" on the stack, and then copies ESP for the pointer. The extra backslash doesn’t matter and is effectively ignored. The same method is used to build the arrays for the remaining arguments. The resulting shellcode still spawns a shell but is only 25 bytes, compared to 36 bytes using the jmp call method.



That's all the basics you need. Now you can write your custom shellcode, it's just a matter of calling functions and using big unused brain.

It almost seems too easy, doesn’t it?

 

Our above shellcode is just scratching the possibility. Shall I upload more on remote shellcode, more real-world use cases? Get ready with networking and socket concepts. You can learn them from here.


* Advanced Materials:

Common Shellcode types or Traditional Shellcode Archive with Operating-System specific variants:

1. Unix:

  • execve /bin/sh
  • port-binding /bin/sh
  • passive connect ("reverse shell") /bin/sh
  • setuidTypes of
  • breaking chroot

2. Windows:

  • WinExec
  • Reverse shell using CreateProcess cmd.exe


 

* THE END

As I can solve "CAPTCHA", I make mistakes so please let me know if any.

OK bye.


 

Post a Comment (0)
Previous Post Next Post