Man syscalls (2): Linux system calls. Linux - syscalls. System calls in Linux Interrupts in the x86 architecture
System calls
So far, all the programs we've made have had to use well-defined kernel mechanisms to register /proc files and device drivers. This is great if you want to do something already provided by the kernel programmers, like writing a device driver. But what if you want to do something fancy, change the behavior of the system in some way?
This is exactly where kernel programming becomes dangerous. While writing the example below, I have destroyed the open system call. This meant that I cannot open any files, I cannot run any programs, and I cannot shut down the system with the shutdown command. I have to turn off the power to stop it. Fortunately, no files were destroyed. To ensure that you don't lose any files either, please do a sync before you issue the insmod and rmmod commands.
Forget about /proc files and device files. They are just small details. The real kernel communication process used by all processes is system calls. When a process requests service from the kernel (such as opening a file, starting a new process, or requesting more memory), this mechanism is used. If you want to change the behavior of the kernel in interesting ways, this is the right place. By the way, if you want to see what system calls a program has used, run: strace
In general, a process is not able to access the kernel. It cannot access kernel memory and cannot call kernel functions. The CPU hardware enforces this state of affairs (it's called `protected mode' for a reason). System calls are an exception to this general rule. The process fills registers with the appropriate values and then calls a special instruction that jumps to a predefined location in the kernel (of course , it is read by user processes, but not overwritten by them.) Under Intel CPUs, this is done via interrupt 0x80. The hardware knows that once you jump to this location, you are no longer running in restricted user mode. Instead, you run as operating system kernel, and therefore you are allowed to do whatever you want to do.
The place in the kernel that a process can jump to is called system_call . The procedure that resides there checks for the system call number, which tells the kernel exactly what the process wants. Then, it looks up the system call table (sys_call_table) to find the address of the kernel function to call. Then the desired function is called, and after it returns a value, several checks are made on the system. The result is then returned back to the process (or to another process if the process has terminated). If you want to see the code that does all this, it's in the source file arch/< architecture >/kernel/entry.S , after the ENTRY(system_call) line.
So, if we want to change how some system call works, the first thing we have to do is write our own function to do the appropriate thing (usually adding some of our own code and then calling the original function), then change the pointer to sys_call_table to point to our function. Since we might be deleted later and don't want to leave the system in a volatile state, it's important for cleanup_module to restore the table to its original state.
The source code provided here is an example of such a module. We want to "spy" on some user, and send a message via printk whenever that user opens a file. We replace the file open system call with our own function called our_sys_open . This function checks the uid (user id) of the current process, and if it is equal to the uid we are spying on, calls printk to display the name of the file to be opened. It then calls the original open function with the same parameters, actually opening the file.
The init_module function changes the appropriate location in sys_call_table and stores the original pointer in a variable. The cleanup_module function uses this variable to restore everything back to normal. This approach is dangerous, due to the possibility of two modules changing the same system call. Imagine that we have two modules, A and B. Module A's open system call will be called A_open, and the same module B's call will be called B_open. Now that the kernel injected syscall has been replaced with A_open, which will call the original sys_open when it's done what it needs to do. Then, B will insert into the kernel, and replace the system call with B_open, which will call what it thinks is the original system call, but is actually A_open.
Now, if B is removed first, everything will be fine: this will just restore the system call on A_open that calls the original. However, if A is removed and then B is removed, the system will collapse. Removing A will restore the system call to the original, sys_open, cutting B out of the loop. Then, when B is removed, it will restore the system call to what it thinks is the original. In fact, the call will be directed to A_open, which is no longer in memory. At first glance it looks like we could solve this particular problem by checking if the syscall is equal to our open function and if so, don't change the value of that call (so that B doesn't change the syscall when removed), but that would still call worst problem. When A is removed, it sees that the system call has been changed to B_open so that it no longer points to A_open, so it will not restore the pointer to sys_open before being removed from memory. Unfortunately, B_open will still try to call A_open, which is no longer in memory, so even without removing B, the system will still crash.
I see two ways to prevent this problem. First: restore the call to the original value of sys_open. Unfortunately, sys_open is not part of the system kernel table in /proc/ksyms , so we can't access it. Another solution is to use a link counter to prevent the module from being unloaded. This is good for regular modules, but bad for "educational" modules.
/* syscall.c * * System call "stealing" sample */ /* Copyright (C) 1998-99 by Ori Pomerantz */ /* The necessary header files */ /* Standard in kernel modules */ #include
Most often, the code for the system call numbered __NR_xxx, defined in /usr/include/asm/unistd.h, can be found in the Linux kernel source code in the function sys_xxx(). (The call table for i386 can be found in /usr/src/linux/arch/i386/kernel/entry.S.) There are many exceptions to this rule, mainly due to the fact that most of the old system calls are replaced with new ones, and without any system. On platforms with proprietary OS emulation, such as parisc, sparc, sparc64, and alpha, there are many additional system calls; mips64 also has a full set of 32-bit system calls.
Over time, there have been changes to the interface of some system calls as needed. One of the reasons for these changes was the need to increase the size of structures or scalar values passed to a system call. Due to these changes, on some architectures (namely, on the old 32-bit i386), various groups of similar system calls appeared (for example, truncate(2) and truncate64(2)), which perform the same tasks but differ in the size of their arguments. (As noted, applications are unaffected: the glibc wrappers do some work to trigger the correct system call, and this ensures ABI compatibility for older binaries.) Examples of system calls that have multiple versions:
*Currently there are three different versions stat(2): sys_stat() (place __NR_oldstat), sys_newstat() (place __NR_stat) and sys_stat64() (place __NR_stat64), the latter is currently in use. A similar situation with lstat(2) and fstat(2). * Similarly defined __NR_oldolduname, __NR_olduname and __NR_uname for calls sys_olduname(), sys_uname() and sys_newuname(). * Linux 2.0 has a new version vm86(2), the new and old versions of nuclear procedures are called sys_vm86old() and sys_vm86(). * Linux 2.4 has a new version getrlimit(2) the new and old versions of nuclear procedures are called sys_old_getrlimit() (place __NR_getrlimit) and sys_getrlimit() (place __NR_ugetrlimit). * In Linux 2.4, the size of the user and group ID field has been increased from 16 to 32 bits. Several system calls have been added to support this change (for example, chown32(2), getuid32(2), getgroups32(2), setresuid32(2)), deprecating earlier calls with the same names but without the "32" suffix. * Linux 2.4 added support for accessing large files (whose sizes and offsets do not fit in 32 bits) in applications on 32-bit architectures. This required changes to the system calls that work with file sizes and offsets. The following system calls have been added: fcntl64(2), getdents64(2), stat64(2), statfs64(2), truncate64(2) and their counterparts that handle file descriptors or symbolic links. These system calls do away with the old system calls, which, with the exception of the "stat" calls, are named the same but do not have the "64" suffix.
On newer platforms that only have 64-bit file access and 32-bit UID/GID (eg alpha, ia64, s390x, x86-64), there is only one version of the system calls for UID/GID and file access. On platforms (usually 32-bit platforms) that have *64 and *32 calls, the other versions are obsolete.
* Challenges rt_sig* added to the 2.2 kernel to support additional real-time signals (see signal(7)). These system calls deprecate the old system calls with the same name but without the "rt_" prefix. * In system calls select(2) and mmap(2) five or more arguments are used, which caused problems determining how arguments were passed on the i386. As a result, while on other architectures calls sys_select() and sys_mmap() match __NR_select and __NR_mmap, on i386 they correspond to old_select() and old_mmap() (procedures using a pointer to a block of arguments). Currently, there is no longer an issue with passing more than five arguments and there is __NR__newselect, which exactly matches sys_select(), and the same situation with __NR_mmap2.
This material is a modification of the article of the same name by Vladimir Meshkov, published in the journal "System Administrator"
This material is a copy of Vladimir Meshkov's articles from the magazine "System Administrator". These articles can be found at the links below. Also, some examples of program source texts were changed - improved, finalized. (Example 4.2 has been heavily modified, since a slightly different system call had to be intercepted) URLs: http://www.samag.ru/img/uploaded/p.pdf http://www.samag.ru/img/uploaded/a3. pdf
Have questions? Then you are here: [email protected]
- 2. Loadable kernel module
- 4. Examples of intercepting system calls based on LKM
- 4.1 Disable directory creation
1. General view of Linux architecture
The most general view allows us to see a two-level model of the system. kernel<=>progs In the center (on the left) is the kernel of the system. The kernel interacts directly with the computer hardware, isolating application programs from architectural features. The kernel has a set of services provided to application programs. Kernel services include I/O operations (opening, reading, writing, and managing files), creating and managing processes, synchronizing them, and interprocess communication. All applications request kernel services through system calls.The second level is made up of applications or tasks, both system ones, which determine the functionality of the system, and application ones, which provide the Linux user interface. However, despite the external heterogeneity of applications, the schemes for interacting with the core are the same.
Interaction with the kernel occurs through the standard system call interface. The system call interface is a set of kernel services and defines the format of requests for services. A process requests a service by making a system call to a specific kernel procedure, which looks like a regular library function call. The kernel executes the request on behalf of the process and returns the required data to the process.
In the above example, the program opens a file, reads data from it, and closes the file. In this case, the operation of opening (open), reading (read) and closing (close) the file is performed by the kernel at the request of the task, and the open (2), read (2) and close (2) functions are system calls.
/* Source 1.0 */ #include
- to the EAX register - the number of the system call. So, for our case, the system call number is 5 (see __NR_open).
- to the EBX register - the first parameter of the function (for open() it is a pointer to a string containing the name of the file being opened.
- to the ECX register - the second parameter (file access rights)
To make sure we're on the right track, let's look at the code for the open() function in the libc system library:
# gdb -q /lib/libc.so.6 (gdb) disas open Dump of assembler code for function open: 0x000c8080
Now let's get back to the system call mechanism. So, the kernel calls the 0x80 interrupt handler - the system_call function. System_call pushes copies of the registers containing the call parameters onto the stack using the SAVE_ALL macro and calls the desired system function with the call command. The table of pointers to kernel functions that implement system calls is located in the sys_call_table array (see file arch/i386/kernel/entry.S). The system call number that resides in the EAX register is the index into this array. Thus, if EAX contains the value 5, the sys_open() kernel function will be called. Why is the SAVE_ALL macro needed? The explanation here is very simple. Since almost all kernel system functions are written in C, they look for their parameters on the stack. And the parameters are pushed onto the stack with SAVE_ALL! The return value of the system call is stored in the EAX register.
Now let's figure out how to intercept the system call. The mechanism of loadable kernel modules will help us with this.
2. Loadable kernel module
Loadable Kernel Module (LKM - Loadable Kernel Module) is code that runs in kernel space. The main feature of LKM is the ability to dynamically load and unload without having to reboot the entire system or recompile the kernel.Each LKM consists of two main functions (minimum):
- module initialization function. Called when LKM is loaded into memory: int init_module(void) ( ... )
- module unload function: void cleanup_module(void) ( ... )
3. Algorithm for intercepting a system call based on LKM
To implement a module that intercepts a system call, it is necessary to define an interception algorithm. The algorithm is the following:- save a pointer to the original (original) call so that it can be restored
- create a function that implements the new system call
- replace calls in the sys_call_table system call table, i.e. set the corresponding pointer to a new system call
- at the end of work (when the module is unloaded), restore the original system call using the previously saved pointer
4. Examples of intercepting system calls based on LKM
4.1 Disable directory creation
When a directory is created, the sys_mkdir kernel function is called. The parameter is a string containing the name of the directory to be created. Consider the code that intercepts the corresponding system call. /* Source 4.1 */ #include4.2 Hiding a file entry in a directory
Let's determine which system call is responsible for reading the contents of the directory. To do this, we will write another test fragment that reads the current directory: /* Source 4.2.1 */ #include- d_reclen - record size
- d_name - file name
5. Direct access method to the kernel address space /dev/kmem
Let us first consider theoretically how interception is carried out by the method of direct access to the address space of the kernel, and then proceed to practical implementation.Direct access to the kernel address space is provided by the /dev/kmem device file. This file displays all available virtual address space, including the swap partition (swap-area). To work with a kmem file, standard system functions are used - open(), read(), write(). Opening /dev/kmem in the standard way, we can refer to any address in the system by setting it as an offset in this file. This method was developed by Silvio Cesare.
System functions are accessed by loading the function parameters into processor registers and then calling software interrupt 0x80. The handler for this interrupt, the system_call function, pushes the call parameters onto the stack, retrieves the address of the called system function from the sys_call_table table, and transfers control to that address.
With full access to the kernel address space, we can get the entire contents of the system call table, i.e. addresses of all system functions. By changing the address of any system call, we thereby intercept it. But for this you need to know the address of the table, or, in other words, the offset in the /dev/kmem file at which this table is located.
To determine the address of the sys_call_table table, you must first calculate the address of the system_call function. Since this function is an interrupt handler, let's look at how interrupts are handled in protected mode.
In real mode, when registering an interrupt, the processor accesses the interrupt vector table, which is always at the very beginning of memory and contains two-word addresses of interrupt handlers. In protected mode, the analogue of the interrupt vector table is the Interrupt Descriptor Table (IDT), located in the protected mode operating system. In order for the processor to access this table, its address must be loaded into the Interrupt Descriptor Table Register (IDTR). The IDT table contains interrupt handler descriptors, which, in particular, include their addresses. These descriptors are called gateways (gates). The processor, having registered an interrupt, retrieves the gateway from the IDT by its number, determines the address of the handler and transfers control to it.
To calculate the address of the system_call function from the IDT table, it is necessary to extract the interrupt gate int $0x80, and from it the address of the corresponding handler, i.e. address of the system_call function. In the system_call function, the system_call_table table is accessed by the call command<адрес_таблицы>(,%eax,4). Having found the opcode (signature) of this command in the /dev/kmem file, we will also find the address of the system call table.
To determine the opcode, let's use the debugger and disassemble the system_call function:
# gdb -q /usr/src/linux/vmlinux (gdb) disas system_call Dump of assembler code for function system_call: 0xc0194cbc
Consider the pseudocode that performs the interception operation:
readaddr(old_syscall, scr + SYS_CALL*4, 4); writeaddr(new_syscall, scr + SYS_CALL*4, 4); The readaddr function reads the system call address from the system call table and stores it in the old_syscall variable. Each entry in the sys_call_table table takes 4 bytes. The required address is located at the offset sct + SYS_CALL*4 in the file /dev/kmem (here sct is the address of the sys_call_table table, SYS_CALL is the serial number of the system call). The writeaddr function overwrites the address of the SYS_CALL system call with the address of the new_syscall function, and all calls to the SYS_CALL system call will be serviced by this function.
It seems that everything is simple and the goal is achieved. However, let's remember that we are working in the user's address space. If we place a new system function in this address space, then when we call this function, we will get a beautiful error message. Hence the conclusion - a new system call must be placed in the address space of the kernel. To do this, you need to: get a block of memory in kernel space, place a new system call in this block.
You can allocate memory in kernel space using the kmalloc function. But you cannot directly call a kernel function from the user's address space, so we use the following algorithm:
- knowing the address of the sys_call_table table, we get the address of some system call (for example, sys_mkdir)
- we define a function that performs a call to the kmalloc function. This function returns a pointer to a block of memory in the kernel's address space. Let's call this function get_kmalloc
- store the first N bytes of the sys_mkdir system call, where N is the size of the get_kmalloc function
- overwrite the first N bytes of the sys_mkdir call with the get_kmalloc function
- we execute the call to the sys_mkdir system call, thereby launching the get_kmalloc function for execution
- restore the first N bytes of the sys_mkdir system call
But to implement this algorithm, we need the address of the kmalloc function. You can find it in several ways. The simplest is to read this address from the System.map file or determine it using the gdb debugger (print &kmalloc). If the kernel has module support enabled, the kmalloc address can be determined using the get_kernel_syms() function. This option will be discussed further. If there is no support for kernel modules, then the address of the kmalloc function will have to be searched for by the opcode of the kmalloc call command - similar to what was done for the sys_call_table table.
The kmalloc function takes two parameters: the size of the requested memory and the GFP specifier. To find the opcode, we will use the debugger and disassemble any kernel function that contains a call to the kmalloc function.
# gdb -q /usr/src/linux/vmlinux (gdb) disas inter_module_register Dump of assembler code for function inter_module_register: 0xc01a57b4
This concludes the theoretical calculations and, using the above technique, we will intercept the sys_mkdir system call.