Man syscalls (2): Linux system calls. Linux - syscalls. System calls in Linux Interrupts in the x86 architecture

System calls

So far, all the programs we've made have had to use well-defined kernel mechanisms to register /proc files and device drivers. This is great if you want to do something already provided by the kernel programmers, like writing a device driver. But what if you want to do something fancy, change the behavior of the system in some way?

This is exactly where kernel programming becomes dangerous. While writing the example below, I have destroyed the open system call. This meant that I cannot open any files, I cannot run any programs, and I cannot shut down the system with the shutdown command. I have to turn off the power to stop it. Fortunately, no files were destroyed. To ensure that you don't lose any files either, please do a sync before you issue the insmod and rmmod commands.

Forget about /proc files and device files. They are just small details. The real kernel communication process used by all processes is system calls. When a process requests service from the kernel (such as opening a file, starting a new process, or requesting more memory), this mechanism is used. If you want to change the behavior of the kernel in interesting ways, this is the right place. By the way, if you want to see what system calls a program has used, run: strace .

In general, a process is not able to access the kernel. It cannot access kernel memory and cannot call kernel functions. The CPU hardware enforces this state of affairs (it's called `protected mode' for a reason). System calls are an exception to this general rule. The process fills registers with the appropriate values and then calls a special instruction that jumps to a predefined location in the kernel (of course , it is read by user processes, but not overwritten by them.) Under Intel CPUs, this is done via interrupt 0x80. The hardware knows that once you jump to this location, you are no longer running in restricted user mode. Instead, you run as operating system kernel, and therefore you are allowed to do whatever you want to do.

The place in the kernel that a process can jump to is called system_call . The procedure that resides there checks for the system call number, which tells the kernel exactly what the process wants. Then, it looks up the system call table (sys_call_table) to find the address of the kernel function to call. Then the desired function is called, and after it returns a value, several checks are made on the system. The result is then returned back to the process (or to another process if the process has terminated). If you want to see the code that does all this, it's in the source file arch/< architecture >/kernel/entry.S , after the ENTRY(system_call) line.

So, if we want to change how some system call works, the first thing we have to do is write our own function to do the appropriate thing (usually adding some of our own code and then calling the original function), then change the pointer to sys_call_table to point to our function. Since we might be deleted later and don't want to leave the system in a volatile state, it's important for cleanup_module to restore the table to its original state.

The source code provided here is an example of such a module. We want to "spy" on some user, and send a message via printk whenever that user opens a file. We replace the file open system call with our own function called our_sys_open . This function checks the uid (user id) of the current process, and if it is equal to the uid we are spying on, calls printk to display the name of the file to be opened. It then calls the original open function with the same parameters, actually opening the file.

The init_module function changes the appropriate location in sys_call_table and stores the original pointer in a variable. The cleanup_module function uses this variable to restore everything back to normal. This approach is dangerous, due to the possibility of two modules changing the same system call. Imagine that we have two modules, A and B. Module A's open system call will be called A_open, and the same module B's call will be called B_open. Now that the kernel injected syscall has been replaced with A_open, which will call the original sys_open when it's done what it needs to do. Then, B will insert into the kernel, and replace the system call with B_open, which will call what it thinks is the original system call, but is actually A_open.

Now, if B is removed first, everything will be fine: this will just restore the system call on A_open that calls the original. However, if A is removed and then B is removed, the system will collapse. Removing A will restore the system call to the original, sys_open, cutting B out of the loop. Then, when B is removed, it will restore the system call to what it thinks is the original. In fact, the call will be directed to A_open, which is no longer in memory. At first glance it looks like we could solve this particular problem by checking if the syscall is equal to our open function and if so, don't change the value of that call (so that B doesn't change the syscall when removed), but that would still call worst problem. When A is removed, it sees that the system call has been changed to B_open so that it no longer points to A_open, so it will not restore the pointer to sys_open before being removed from memory. Unfortunately, B_open will still try to call A_open, which is no longer in memory, so even without removing B, the system will still crash.

I see two ways to prevent this problem. First: restore the call to the original value of sys_open. Unfortunately, sys_open is not part of the system kernel table in /proc/ksyms , so we can't access it. Another solution is to use a link counter to prevent the module from being unloaded. This is good for regular modules, but bad for "educational" modules.

/* syscall.c * * System call "stealing" sample */ /* Copyright (C) 1998-99 by Ori Pomerantz */ /* The necessary header files */ /* Standard in kernel modules */ #include /* We're doing kernel work */ #include /* Specifically, a module */ /* Deal with CONFIG_MODVERSIONS */ #if CONFIG_MODVERSIONS==1 #define MODVERSIONS #include #endif #include /* The list of system calls */ /* For the current (process) structure, we need * this to know who the current user is. */ #include /* In 2.2.3 /usr/include/linux/version.h includes a * macro for this, but 2.0.35 doesn't - so I add it * here if necessary. */ #ifndef KERNEL_VERSION #define KERNEL_VERSION(a ,b,c) ((a)*65536+(b)*256+(c)) #endif #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0) #include #endif /* The system call table (a table of functions). We * just define this as external, and the kernel will * fill it up for us when we are insmod"ed */ extern void *sys_call_table; /* UID we want to spy on - will be filled from the * command line */ int uid; #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0) MODULE_PARM(uid, "i"); #endif /* A pointer to the original system call. The reason * we keep this, rather than call the original function * (sys_open), is because somebody else might have * replaced the system call before us. the function in that module - and it * might be removed before we are. * It"sa static variable, so it is not exported. */ asmlinkage int (*original_call)(const char *, int, int); /* For some reason, in 2.2.3 current->uid gave me * zero, not the real user ID. I tried to find what went * wrong, but I couldn't do it in a short time, and * I'm lazy - so I'll just use the system call to get the * uid, the way a process would. * * For some reason, after I recompiled the kernel this * problem went away. */ asmlinkage int (*getuid_call)(); /* The function we "ll replace sys_open (the function * called when you call the open system call) with. To * find the exact prototype, with the number and type * of arguments, we find the original function first * (it" s at fs/open.c). * * In theory, this means that we"re tied to the * current version of the kernel. In practice, the * system calls almost never change (it would wreck havoc * and require programs to be recompiled, since the system * calls are the interface between the kernel and the * processes).*/ asmlinkage int our_sys_open(const char *filename, int flags, int mode) ( int i = 0; char ch; /* Check if this is the user we"re spying on */ if (uid == getuid_call()) ( /* getuid_call is the getuid system call, * which gives the uid of the user who * ran the process which called the system * call we got */ /* Report the file, if relevant */ printk("Opened file by %d: ", uid); do ( #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0) get_user(ch, filename+i); #else ch = get_user(filename+ i); #endif i++; printk("%c", ch); ) while (ch != 0); printk("\n"); ) /* Call the original sys_open - otherwise, we lose * the ability to open files */ return original_call(filename, flags, mode); ) /* Initialize the module - replace the system call */ int init_module() ( /* Warning - too late for it now, but maybe for * next time. .. */ printk("I"m dangerous. I hope you did a "); printk("sync before you insmod"ed me.\n"); printk("My counterpart, cleanup_module(), is even"); printk("more dangerous. If\n"); printk("you value your file system, it will"); printk("be \"sync; rmmod\" \n"); printk("when you remove this module.\n"); /* Keep a pointer to the original function in * original_call, and then replace the system call * in the system call table with our_sys_open */ original_call = sys_call_table[__NR_open]; sys_call_table[__NR_open] = our_sys_open; /* To get the address of the function for system * call foo, go to sys_call_table[__NR_foo]. */ printk("Spying on UID:%d\n", uid); /* Get the system call for getuid */ getuid_call = sys_call_table[__NR_getuid]; return 0; ) /* Cleanup - unregister the appropriate file from /proc */ void cleanup_module() ( /* Return the system call back to normal */ if (sys_call_table[__NR_open] != our_sys_open) ( printk("Somebody else also played with the "); printk("open system call\n"); printk("The system may be left in "); printk("an unstable state.\n"); ) sys_call_table[__NR_open] = original_call; )

Most often, the code for the system call numbered __NR_xxx, defined in /usr/include/asm/unistd.h, can be found in the Linux kernel source code in the function sys_xxx(). (The call table for i386 can be found in /usr/src/linux/arch/i386/kernel/entry.S.) There are many exceptions to this rule, mainly due to the fact that most of the old system calls are replaced with new ones, and without any system. On platforms with proprietary OS emulation, such as parisc, sparc, sparc64, and alpha, there are many additional system calls; mips64 also has a full set of 32-bit system calls.

Over time, there have been changes to the interface of some system calls as needed. One of the reasons for these changes was the need to increase the size of structures or scalar values passed to a system call. Due to these changes, on some architectures (namely, on the old 32-bit i386), various groups of similar system calls appeared (for example, truncate(2) and truncate64(2)), which perform the same tasks but differ in the size of their arguments. (As noted, applications are unaffected: the glibc wrappers do some work to trigger the correct system call, and this ensures ABI compatibility for older binaries.) Examples of system calls that have multiple versions:

*Currently there are three different versions stat(2): sys_stat() (place __NR_oldstat), sys_newstat() (place __NR_stat) and sys_stat64() (place __NR_stat64), the latter is currently in use. A similar situation with lstat(2) and fstat(2). * Similarly defined __NR_oldolduname, __NR_olduname and __NR_uname for calls sys_olduname(), sys_uname() and sys_newuname(). * Linux 2.0 has a new version vm86(2), the new and old versions of nuclear procedures are called sys_vm86old() and sys_vm86(). * Linux 2.4 has a new version getrlimit(2) the new and old versions of nuclear procedures are called sys_old_getrlimit() (place __NR_getrlimit) and sys_getrlimit() (place __NR_ugetrlimit). * In Linux 2.4, the size of the user and group ID field has been increased from 16 to 32 bits. Several system calls have been added to support this change (for example, chown32(2), getuid32(2), getgroups32(2), setresuid32(2)), deprecating earlier calls with the same names but without the "32" suffix. * Linux 2.4 added support for accessing large files (whose sizes and offsets do not fit in 32 bits) in applications on 32-bit architectures. This required changes to the system calls that work with file sizes and offsets. The following system calls have been added: fcntl64(2), getdents64(2), stat64(2), statfs64(2), truncate64(2) and their counterparts that handle file descriptors or symbolic links. These system calls do away with the old system calls, which, with the exception of the "stat" calls, are named the same but do not have the "64" suffix.

On newer platforms that only have 64-bit file access and 32-bit UID/GID (eg alpha, ia64, s390x, x86-64), there is only one version of the system calls for UID/GID and file access. On platforms (usually 32-bit platforms) that have *64 and *32 calls, the other versions are obsolete.

* Challenges rt_sig* added to the 2.2 kernel to support additional real-time signals (see signal(7)). These system calls deprecate the old system calls with the same name but without the "rt_" prefix. * In system calls select(2) and mmap(2) five or more arguments are used, which caused problems determining how arguments were passed on the i386. As a result, while on other architectures calls sys_select() and sys_mmap() match __NR_select and __NR_mmap, on i386 they correspond to old_select() and old_mmap() (procedures using a pointer to a block of arguments). Currently, there is no longer an issue with passing more than five arguments and there is __NR__newselect, which exactly matches sys_select(), and the same situation with __NR_mmap2.

This material is a modification of the article of the same name by Vladimir Meshkov, published in the journal "System Administrator"

This material is a copy of Vladimir Meshkov's articles from the magazine "System Administrator". These articles can be found at the links below. Also, some examples of program source texts were changed - improved, finalized. (Example 4.2 has been heavily modified, since a slightly different system call had to be intercepted) URLs: http://www.samag.ru/img/uploaded/p.pdf http://www.samag.ru/img/uploaded/a3. pdf

Have questions? Then you are here: [email protected]

2. Loadable kernel module
4. Examples of intercepting system calls based on LKM
- 4.1 Disable directory creation

1. General view of Linux architecture

The most general view allows us to see a two-level model of the system. kernel<=>progs In the center (on the left) is the kernel of the system. The kernel interacts directly with the computer hardware, isolating application programs from architectural features. The kernel has a set of services provided to application programs. Kernel services include I/O operations (opening, reading, writing, and managing files), creating and managing processes, synchronizing them, and interprocess communication. All applications request kernel services through system calls.

The second level is made up of applications or tasks, both system ones, which determine the functionality of the system, and application ones, which provide the Linux user interface. However, despite the external heterogeneity of applications, the schemes for interacting with the core are the same.

Interaction with the kernel occurs through the standard system call interface. The system call interface is a set of kernel services and defines the format of requests for services. A process requests a service by making a system call to a specific kernel procedure, which looks like a regular library function call. The kernel executes the request on behalf of the process and returns the required data to the process.

In the above example, the program opens a file, reads data from it, and closes the file. In this case, the operation of opening (open), reading (read) and closing (close) the file is performed by the kernel at the request of the task, and the open (2), read (2) and close (2) functions are system calls.

/* Source 1.0 */ #include main () ( int fd; char buf; /* Open the file - get a link (file descriptor) fd */ fd = open("file1",O_RDONLY); /* Read 80 characters into the buffer buf */ read(fd, buf , sizeof(buf)); /* Close the file */ close(fd); ) /* EOF */ A complete list of OS Linux system calls can be found in /usr/include/asm/unistd.h. Let's now look at the mechanism for making system calls in this example. The compiler, having met the open() function to open a file, converts it into assembler code, loading the system call number corresponding to this function and its parameters into the processor registers and then calling interrupt 0x80. The following values are loaded into the processor registers:

to the EAX register - the number of the system call. So, for our case, the system call number is 5 (see __NR_open).
to the EBX register - the first parameter of the function (for open() it is a pointer to a string containing the name of the file being opened.
to the ECX register - the second parameter (file access rights)

The third parameter is loaded into the EDX register, in this case we do not have it. To perform a system call in OS Linux, the system_call function is used, which is defined (depending on the architecture in this case, i386) in the /usr/src/linux/arch/i386/kernel/entry.S file. This function is the entry point for all system calls. The kernel responds to the 0x80 interrupt by calling the system_call function, which is essentially the 0x80 interrupt handler.

To make sure we're on the right track, let's look at the code for the open() function in the libc system library:

# gdb -q /lib/libc.so.6 (gdb) disas open Dump of assembler code for function open: 0x000c8080 : call 0x1082be< __i686.get_pc_thunk.cx >0x000c8085 : add $0x6423b,%ecx 0x000c808b : cmpl $0x0.0x1a84(%ecx) 0x000c8092 : jne 0xc80b1 0x000c8094 : push %ebx 0x000c8095 : mov 0x10(%esp,1),%edx 0x000c8099 : mov 0xc(%esp,1),%ecx 0x000c809d : mov 0x8(%esp,1),%ebx 0x000c80a1 : mov $0x5,%eax 0x000c80a6 : int $0x80 ... As you can see in the last lines, the parameters are passed to the EDX, ECX, EBX registers, and the last EAX register is filled with the system call number equal to 5, as we already know.

Now let's get back to the system call mechanism. So, the kernel calls the 0x80 interrupt handler - the system_call function. System_call pushes copies of the registers containing the call parameters onto the stack using the SAVE_ALL macro and calls the desired system function with the call command. The table of pointers to kernel functions that implement system calls is located in the sys_call_table array (see file arch/i386/kernel/entry.S). The system call number that resides in the EAX register is the index into this array. Thus, if EAX contains the value 5, the sys_open() kernel function will be called. Why is the SAVE_ALL macro needed? The explanation here is very simple. Since almost all kernel system functions are written in C, they look for their parameters on the stack. And the parameters are pushed onto the stack with SAVE_ALL! The return value of the system call is stored in the EAX register.

Now let's figure out how to intercept the system call. The mechanism of loadable kernel modules will help us with this.

2. Loadable kernel module

Loadable Kernel Module (LKM - Loadable Kernel Module) is code that runs in kernel space. The main feature of LKM is the ability to dynamically load and unload without having to reboot the entire system or recompile the kernel.

Each LKM consists of two main functions (minimum):

module initialization function. Called when LKM is loaded into memory: int init_module(void) ( ... )
module unload function: void cleanup_module(void) ( ... )

Here is an example of the simplest module: /* Source 2.0 */ #include int init_module(void) ( printk("Hello World\n"); return 0; ) void cleanup_module(void) ( printk("Bye\n"); ) /* EOF */ Compile and load the module. Loading a module into memory is done with the insmod command, and viewing loaded modules with the lsmod command: # gcc -c -DMODULE -I /usr/src/linux/include/ src-2.0.c # insmod src-2.0.o Warning: loading src-2.0 .o will taint the kernel: no license Module src-2.0 loaded, with warnings # dmesg | tail -n 1 Hello World # lsmod | grep src src-2.0 336 0 (unused) # rmmod src-2.0 # dmesg | tail -n 1 Bye

3. Algorithm for intercepting a system call based on LKM

To implement a module that intercepts a system call, it is necessary to define an interception algorithm. The algorithm is the following:

save a pointer to the original (original) call so that it can be restored
create a function that implements the new system call
replace calls in the sys_call_table system call table, i.e. set the corresponding pointer to a new system call
at the end of work (when the module is unloaded), restore the original system call using the previously saved pointer

Tracing allows you to find out which system calls are involved in the operation of the user application. By tracing, you can determine which system call should be intercepted in order to take control of the application. # ltrace -S ./src-1.0 ... open("file1", 0, 01 SYS_open("file1", 0, 01) = 3<... open resumed>) = 3 read(3, SYS_read(3, "123\n", 80) = 4<... read resumed>"123\n", 80) = 4 close(3 SYS_close(3) = 0<... close resumed>) = 0 ... Now we have enough information to start studying examples of implementations of modules that intercept system calls.

4. Examples of intercepting system calls based on LKM

4.1 Disable directory creation

When a directory is created, the sys_mkdir kernel function is called. The parameter is a string containing the name of the directory to be created. Consider the code that intercepts the corresponding system call. /* Source 4.1 */ #include #include #include /* Export the system call table */ extern void *sys_call_table; /* Define a pointer to store the original call */ int (*orig_mkdir)(const char *path); /* Create our own system call. Our call does nothing, just returns null */ int own_mkdir(const char *path) ( return 0; ) /* During module initialization, save the pointer to the original call and replace the system call */ int init_module(void) ( orig_mkdir =sys_call_table; sys_call_table=own_mkdir; printk("sys_mkdir replaced\n"); return(0); ) /* On unload, restore the original call */ void cleanup_module(void) ( sys_call_table=orig_mkdir; printk("sys_mkdir moved back\n "); ) /* EOF */ To get the object module, run the following command and run some experiments on the system: # gcc -c -DMODULE -I/usr/src/linux/include/ src-3.1.c # dmesg | tail -n 1 sys_mkdir replaced # mkdir test # ls -ald test ls: test: No such file or directory # rmmod src-3.1 # dmesg | tail -n 1 sys_mkdir moved back # mkdir test # ls -ald test drwxr-xr-x 2 root root 4096 2003-12-23 03:46 test As you can see, the "mkdir" command does not work, or rather nothing happens. Unloading the module is enough to restore system functionality. What has been done above.

4.2 Hiding a file entry in a directory

Let's determine which system call is responsible for reading the contents of the directory. To do this, we will write another test fragment that reads the current directory: /* Source 4.2.1 */ #include #include int main() ( DIR *d; struct dirent *dp; d = opendir("."); dp = readdir(d); return 0; ) /* EOF */ Get the executable and trace: # gcc -o src-3.2.1 src-3.2.1.c # ltrace -S ./src-3.2.1 ... opendir("." Sys_open (".", 100352, 010005141300) = 3 sys_fstat64 (3, 0xbffff79c, 0x4014c2c0, 3, 0xbffff874) = 0x_fcntl64 (3, 2, 1, 1, 0x4014c2c0) = 0 SYS_BRK (NULL) = 0x080495F4 SYS_BRK (0x0806A5F4) = 0x0806a5f4 SYS_brk(NULL) = 0x0806a5f4 SYS_brk(0x0806b000) = 0x0806b000<... opendir resumed>) = 0x08049648 readdir(0x08049648 SYS_getdents64(3, 0x08049678, 4096, 0x40014400, 0x4014c2c0) = 528<... readdir resumed>) = 0x08049678 ... Pay attention to the last line. The contents of the directory are read by the getdents64 function (getdents is possible in other kernels). The result is stored as a list of structures of type struct dirent, and the function itself returns the length of all entries in the directory. We are interested in two fields of this structure:

d_reclen - record size
d_name - file name

In order to hide a file entry about a file (in other words, make it invisible), you need to intercept the sys_getdents64 system call, find the corresponding entry in the list of received structures and delete it. Consider the code that performs this operation (the author of the original code is Michal Zalewski): /* Source 4.2.2 */ #include #include #include #include #include #include #include #include extern void *sys_call_table; int (*orig_getdents)(u_int fd, struct dirent *dirp, u_int count); /* Define our system call */ int own_getdents(u_int fd, struct dirent *dirp, u_int count) ( unsigned int tmp, n; int t; struct dirent64 ( int d_ino1,d_ino2; int d_off1,d_off2; unsigned short d_reclen; unsigned char d_type; char d_name; ) *dirp2, *dirp3; /* The name of the file we want to hide */ char hide = "file1"; /* Determine the length of entries in the directory */ tmp = (*orig_getdents)(fd,dirp ,count); if (tmp>0) ( /* Allocate memory for the kernel-space structure and copy the contents of the directory into it */ dirp2 = (struct dirent64 *)kmalloc(tmp,GFP_KERNEL); copy_from_user(dirp2,dirp,tmp) ; /* Invoke the second structure and save the value of the length of entries in the directory */ dirp3 = dirp2; t = tmp; /* Start looking for our file */ while (t>0) ( /* Read the length of the first entry and determine the remaining length of the entries in directory */ n = dirp3->d_reclen; t -= n; /* Check if the filename from the current entry matches the one we are looking for */ if (strstr((char *)&(dirp3->d_name), (char *)&hide) != NULL) ( /* If so, then overwrite the entry and calculate a new value for the length of entries in the directory */ memcpy(dirp3, (char *)dirp3+dirp3->d_reclen, t); tmp -=n; ) /* Position the pointer to the next entry and continue searching */ dirp3 = (struct dirent64 *)((char *)dirp3+dirp3->d_reclen); ) /* Return result and free memory */ copy_to_user(dirp,dirp2,tmp); free(dirp2); ) /* Return the value of the length of entries in the directory */ return tmp; ) /* Module initialization and unloading functions have a standard form */ int init_module(void) ( orig_getdents = sys_call_table; sys_call_table=own_getdents; return 0; ) void cleanup_module() ( sys_call_table=orig_getdents; ) /* EOF */ After compiling this code, notice how "file1" disappears, which is what we wanted to prove.

5. Direct access method to the kernel address space /dev/kmem

Let us first consider theoretically how interception is carried out by the method of direct access to the address space of the kernel, and then proceed to practical implementation.

Direct access to the kernel address space is provided by the /dev/kmem device file. This file displays all available virtual address space, including the swap partition (swap-area). To work with a kmem file, standard system functions are used - open(), read(), write(). Opening /dev/kmem in the standard way, we can refer to any address in the system by setting it as an offset in this file. This method was developed by Silvio Cesare.

System functions are accessed by loading the function parameters into processor registers and then calling software interrupt 0x80. The handler for this interrupt, the system_call function, pushes the call parameters onto the stack, retrieves the address of the called system function from the sys_call_table table, and transfers control to that address.

With full access to the kernel address space, we can get the entire contents of the system call table, i.e. addresses of all system functions. By changing the address of any system call, we thereby intercept it. But for this you need to know the address of the table, or, in other words, the offset in the /dev/kmem file at which this table is located.

To determine the address of the sys_call_table table, you must first calculate the address of the system_call function. Since this function is an interrupt handler, let's look at how interrupts are handled in protected mode.

In real mode, when registering an interrupt, the processor accesses the interrupt vector table, which is always at the very beginning of memory and contains two-word addresses of interrupt handlers. In protected mode, the analogue of the interrupt vector table is the Interrupt Descriptor Table (IDT), located in the protected mode operating system. In order for the processor to access this table, its address must be loaded into the Interrupt Descriptor Table Register (IDTR). The IDT table contains interrupt handler descriptors, which, in particular, include their addresses. These descriptors are called gateways (gates). The processor, having registered an interrupt, retrieves the gateway from the IDT by its number, determines the address of the handler and transfers control to it.

To calculate the address of the system_call function from the IDT table, it is necessary to extract the interrupt gate int $0x80, and from it the address of the corresponding handler, i.e. address of the system_call function. In the system_call function, the system_call_table table is accessed by the call command<адрес_таблицы>(,%eax,4). Having found the opcode (signature) of this command in the /dev/kmem file, we will also find the address of the system call table.

To determine the opcode, let's use the debugger and disassemble the system_call function:

# gdb -q /usr/src/linux/vmlinux (gdb) disas system_call Dump of assembler code for function system_call: 0xc0194cbc : push %eax 0xc0194cbd : cld 0xc0194cbe : push %es 0xc0194cbf : push %ds 0xc0194cc0 : push %eax 0xc0194cc1 : push %ebp 0xc0194cc2 : push %edi 0xc0194cc3 : push %esi 0xc0194cc4 : push %edx 0xc0194cc5 : push %ecx 0xc0194cc6 : push %ebx 0xc0194cc7 : mov $0x18,%edx 0xc0194ccc : mov %edx,%ds 0xc0194cce : mov %edx,%es 0xc0194cd0 : mov $0xffffe000,%ebx 0xc0194cd5 : and %esp,%ebx 0xc0194cd7 : testb $0x2.0x18(%ebx) 0xc0194cdb : jne 0xc0194d3c 0xc0194cdd : cmp $0x10e,%eax 0xc0194ce2 : jae 0xc0194d69 0xc0194ce8 : call *0xc02cbb0c(,%eax,4) 0xc0194cef : mov %eax,0x18(%esp,1) 0xc0194cf3 : nop End of assembler dump. The line "call *0xc02cbb0c(,%eax,4)" is a call to the sys_call_table table. The value 0xc02cbb0c is the address of the table (most likely your numbers will be different). Get the opcode of this command: (gdb) x/xw system_call+44 0xc0194ce8 : 0x0c8514ff We have found the sys_call_table command opcode. It is equal to \xff\x14\x85. The 4 bytes following it are the address of the table. You can verify this by entering the command: (gdb) x/xw system_call+44+3 0xc0194ceb : 0xc02cbb0c Thus, finding the sequence \xff\x14\x85 in the /dev/kmem file and reading the 4 bytes following it, we get the address of the sys_call_table system call table. Knowing its address, we can get the contents of this table (the addresses of all system functions) and change the address of any system call by intercepting it.

Consider the pseudocode that performs the interception operation:

readaddr(old_syscall, scr + SYS_CALL*4, 4); writeaddr(new_syscall, scr + SYS_CALL*4, 4); The readaddr function reads the system call address from the system call table and stores it in the old_syscall variable. Each entry in the sys_call_table table takes 4 bytes. The required address is located at the offset sct + SYS_CALL*4 in the file /dev/kmem (here sct is the address of the sys_call_table table, SYS_CALL is the serial number of the system call). The writeaddr function overwrites the address of the SYS_CALL system call with the address of the new_syscall function, and all calls to the SYS_CALL system call will be serviced by this function.

It seems that everything is simple and the goal is achieved. However, let's remember that we are working in the user's address space. If we place a new system function in this address space, then when we call this function, we will get a beautiful error message. Hence the conclusion - a new system call must be placed in the address space of the kernel. To do this, you need to: get a block of memory in kernel space, place a new system call in this block.

You can allocate memory in kernel space using the kmalloc function. But you cannot directly call a kernel function from the user's address space, so we use the following algorithm:

knowing the address of the sys_call_table table, we get the address of some system call (for example, sys_mkdir)
we define a function that performs a call to the kmalloc function. This function returns a pointer to a block of memory in the kernel's address space. Let's call this function get_kmalloc
store the first N bytes of the sys_mkdir system call, where N is the size of the get_kmalloc function
overwrite the first N bytes of the sys_mkdir call with the get_kmalloc function
we execute the call to the sys_mkdir system call, thereby launching the get_kmalloc function for execution
restore the first N bytes of the sys_mkdir system call

As a result, we will have a block of memory located in kernel space.

But to implement this algorithm, we need the address of the kmalloc function. You can find it in several ways. The simplest is to read this address from the System.map file or determine it using the gdb debugger (print &kmalloc). If the kernel has module support enabled, the kmalloc address can be determined using the get_kernel_syms() function. This option will be discussed further. If there is no support for kernel modules, then the address of the kmalloc function will have to be searched for by the opcode of the kmalloc call command - similar to what was done for the sys_call_table table.

The kmalloc function takes two parameters: the size of the requested memory and the GFP specifier. To find the opcode, we will use the debugger and disassemble any kernel function that contains a call to the kmalloc function.

# gdb -q /usr/src/linux/vmlinux (gdb) disas inter_module_register Dump of assembler code for function inter_module_register: 0xc01a57b4 : push %ebp 0xc01a57b5 : push %edi 0xc01a57b6 : push %esi 0xc01a57b7 : push %ebx 0xc01a57b8 : sub $0x10,%esp 0xc01a57bb : mov 0x24(%esp,1),%ebx 0xc01a57bf : mov 0x28(%esp,1),%esi 0xc01a57c3 : mov 0x2c(%esp,1),%ebp 0xc01a57c7 : movl $0x1f0,0x4(%esp,1) 0xc01a57cf : movl $0x14,(%esp,1) 0xc01a57d6 : call 0xc01bea2a ... No matter what the function does, the main thing in it is what we need - a call to the kmalloc function. Pay attention to the last lines. First, parameters are loaded onto the stack (the esp register points to the top of the stack), and then the function call follows. The GFP specifier is loaded onto the stack first ($0x1f0,0x4(%esp,1). For kernel versions 2.4.9 and higher, this value is 0x1f0. Find the opcode for this command: (gdb) x/xw inter_module_register+19 0xc01a57c7 : 0x042444c7 If we find this opcode, we can calculate the address of the kmalloc function. At first glance, the address of this function is an argument to the call instruction, but this is not entirely true. Unlike the system_call function, here the instruction is not the kmalloc address, but the offset to it relative to the current address. We will verify this by defining the opcode of the command call 0xc01bea2a: (gdb) x/xw inter_module_register+34 0xc01a57d6 : 0x01924fe8 The first byte is e8, which is the opcode of the call instruction. Find the value of this command's argument: (gdb) x/xw inter_module_register+35 0xc01a57d7 : 0x0001924f Now if we add the current address 0xc01a57d6, the offset 0x0001924f and 5 bytes of the command, we get the required address of the kmalloc function - 0xc01bea2a.

This concludes the theoretical calculations and, using the above technique, we will intercept the sys_mkdir system call.

6. An example of interception using /dev/kmem

/* source 6.0 */ #include #include #include #include #include #include #include #include /* System call number to intercept */ #define _SYS_MKDIR_ 39 #define KMEM_FILE "/dev/kmem" #define MAX_SYMS 4096 /* IDTR register format description */ struct ( unsigned short limit; unsigned int base; ) __attribute__ ((packed) ) idtr; /* IDT table interrupt gate format description */ struct ( unsigned short off1; unsigned short sel; unsigned char none, flags; unsigned short off2; ) __attribute__ ((packed)) idt; /* Description of the structure for the get_kmalloc function */ struct kma_struc ( ulong (*kmalloc) (uint, int); // - address of the kmalloc function int size; // - size of memory to allocate int flags; // - flag, for kernels > 2.4.9 = 0x1f0 (GFP) ulong mem; ) __attribute__ ((packed)) kmalloc; /* A function that only allocates a block of memory in the kernel address space */ int get_kmalloc(struct kma_struc *k) ( k->mem = k->kmalloc(k->size, k->flags); return 0; ) /* Function that returns the address of the function (needed for kmalloc lookup) */ ulong get_sym(char *n) ( struct kernel_sym tab; int numsyms; int i; numsyms = get_kernel_syms(NULL); if (numsyms > MAX_SYMS || numsyms< 0) return 0; get_kernel_syms(tab); for (i = 0; i < numsyms; i++) { if (!strncmp(n, tab[i].name, strlen(n))) return tab[i].value; } return 0; } /* Наша новая системная функция, ничего не делает;) */ int new_mkdir(const char *path) { return 0; } /* Читает из /dev/kmem с offset size данных в buf */ static inline int rkm(int fd, uint offset, void *buf, uint size) { if (lseek(fd, offset, 0) != offset){ printf("lseek err\n"); return 0; } if (read(fd, buf, size) != size) return 0; return size; } /* Аналогично, но только пишет в /dev/kmem */ static inline int wkm(int fd, uint offset, void *buf, uint size) { if (lseek(fd, offset, 0) != offset) return 0; if (write(fd, buf, size) != size) return 0; return size; } /* Читает из /dev/kmem данные размером 4 байта */ static inline int rkml(int fd, uint offset, ulong *buf) { return rkm(fd, offset, buf, sizeof(ulong)); } /* Аналогично, но только пишет */ static inline int wkml(int fd, uint offset, ulong buf) { return wkm(fd, offset, &buf, sizeof(ulong)); } /* Функция для получения адреса sys_call_table */ ulong get_sct(int kmem) { ulong sys_call_off; // - адрес обработчика // прерывания int $0x80 (функция system_call) char *p; char sc_asm; asm("sidt %0" : "=m" (idtr)); if (!rkm(kmem, idtr.base+(8*0x80), &idt, sizeof(idt))) return 0; sys_call_off = (idt.off2 << 16) | idt.off1; if (!rkm(kmem, sys_call_off, &sc_asm, 128)) return 0; p = (char *)memmem(sc_asm, 128, "\xff\x14\x85", 3) + 3; printf("call for sys_call_table at %08x\n",p); if (p) return *(ulong *)p; return 0; } /* Функция для определения адреса функции kmalloc */ ulong get_kma(ulong pgoff) { uint i; unsigned char buf, *p, *p1; int kmemz; ulong ret; ret = get_sym("kmalloc"); if (ret) { printf("\nZer gut!\n"); return ret; } kmemz = open("/dev/kmem", O_RDONLY); if (kmemz < 0) return 0; for (i = pgoff+0x100000; i < (pgoff + 0x1000000); i += 0x10000){ if (!rkm(kmemz, i, buf, sizeof(buf))) return 0; p1=(char *)memmem(buf,sizeof(buf),"\x68\xf0\x01\x00",4); if(p1) { p=(char *)memmem(p1+4,sizeof(buf),"\xe8",1)+1; if (p) { close(kmemz); return *(unsigned long *)p+i+(p-buf)+4; } } } close(kmemz); return 0; } int main() { int kmem; // !! - пустые, нужно подставить ulong get_kmalloc_size; // - размер функции get_kmalloc !! ulong get_kmalloc_addr; // - адрес функции get_kmalloc !! ulong new_mkdir_size; // - размер функции-перехватчика!! ulong new_mkdir_addr; // - адрес функции-перехватчика!! ulong sys_mkdir_addr; // - адрес системного вызова sys_mkdir ulong page_offset; // - нижняя граница адресного // пространства ядра ulong sct; // - адрес таблицы sys_call_table ulong kma; // - адрес функции kmalloc unsigned char tmp; kmem = open(KMEM_FILE, O_RDWR, 0); if (kmem < 0) return 0; sct = get_sct(kmem); page_offset = sct & 0xF0000000; kma = get_kma(page_offset); printf("OK\n" "page_offset\t\t:\t0x%08x\n" "sys_call_table\t:\t0x%08x\n" "kmalloc()\t\t:\t0x%08x\n", page_offset,sct,kma); /* Найдем адрес sys_mkdir */ if (!rkml(kmem, sct+(_SYS_MKDIR_*4), &sys_mkdir_addr)) { printf("Cannot get addr of %d syscall\n", _SYS_MKDIR_); perror("er: "); return 1; } /* Сохраним первые N байт вызова sys_mkdir */ if (!rkm(kmem, sys_mkdir_addr, tmp, get_kmalloc_size)) { printf("Cannot save old %d syscall!\n", _SYS_MKDIR_); return 1; } /* Перепишем первые N байт, функцией get_kmalloc */ if (!wkm(kmem, sys_mkdir_addr,(void *)get_kmalloc_addr, get_kmalloc_size)) { printf("Can"t overwrite our syscall %d!\n",_SYS_MKDIR_); return 1; } kmalloc.kmalloc = (void *) kma; //- адрес функции kmalloc kmalloc.size = new_mkdir_size; //- размер запращевоемой // памяти (размер функции-перехватчика new_mkdir) kmalloc.flags = 0x1f0; //- спецификатор GFP /* Выполним сис. вызов sys_mkdir, тем самым выполним нашу функцию get_kmalloc */ mkdir((char *)&kmalloc,0); /* Востановим оригинальный вызов sys_mkdir */ if (!wkm(kmem, sys_mkdir_addr, tmp, get_kmalloc_size)) { printf("Can"t restore syscall %d !\n",_SYS_MKDIR_); return 1; } if (kmalloc.mem < page_offset) { printf("Allocated memory is too low (%08x < %08x)\n", kmalloc.mem, page_offset); return 1; } /* Оторбразим результаты */ printf("sys_mkdir_addr\t\t:\t0x%08x\n" "get_kmalloc_size\t:\t0x%08x (%d bytes)\n\n" "our kmem region\t\t:\t0x%08x\n" "size of our kmem\t:\t0x%08x (%d bytes)\n\n", sys_mkdir_addr, get_kmalloc_size, get_kmalloc_size, kmalloc.mem, kmalloc.size, kmalloc.size); /* Разместим в пространстве ядра наш новый сис. вызво */ if(!wkm(kmem, kmalloc.mem, (void *)new_mkdir_addr, new_mkdir_size)) { printf("Unable to locate new system call !\n"); return 1; } /* Перепишем таблицу sys_call_table на наш новый вызов */ if(!wkml(kmem, sct+(_SYS_MKDIR_*4), kmalloc.mem)) { printf("Eh ..."); return 1; } return 1; } /* EOF */ Скомпилируем полученый код и определим адреса и размеры функций get_kmalloc и new_mkdir. Запускать полученое творение рано! Для вычисления адресов и размеров воспользуемся утилитой objdump: # gcc -o src-6.0 src-6.0.c # objdump -x ./src-6.0 >dump Let's open the dump file and find the data we are interested in: 080485a4 g F .text 00000032 get_kmalloc 080486b1 g F .text 0000000a new_mkdir Now let's add these values to our program: ulong get_kmalloc_size=0x32; ulong get_kmalloc_addr=0x080485a4 ; ulong new_mkdir_size=0x0a; ulong new_mkdir_addr=0x080486b1; Now let's recompile the program. Having launched it for execution, we will intercept the sys_mkdir system call. All calls to sys_mkdir will now be handled by the new_mkdir function.

End Of Paper/EOP

The performance of the code from all sections was tested on the 2.4.22 kernel. When preparing the report, materials from the site were used