This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

man2

1: Linux cli command ptrace
2: Linux cli command fcntl
3: Linux cli command prlimit
4: Linux cli command utime
5: Linux cli command getdents64
6: Linux cli command sync_file_range2
7: Linux cli command arm_fadvise
8: Linux cli command mlock2
9: Linux cli command tuxcall
10: Linux cli command sysinfo
11: Linux cli command ioprio_set
12: Linux cli command setfsuid32
13: Linux cli command fchown
14: Linux cli command break
15: Linux cli command bdflush
16: Linux cli command sysfs
17: Linux cli command getgid
18: Linux cli command shutdown
19: Linux cli command vm86old
20: Linux cli command preadv2
21: Linux cli command setdomainname
22: Linux cli command inotify_init1
23: Linux cli command ioctl
24: Linux cli command setrlimit
25: Linux cli command sendfile
26: Linux cli command mq_unlink
27: Linux cli command fanotify_mark
28: Linux cli command removexattr
29: Linux cli command munlock
30: Linux cli command alloc_hugepages
31: Linux cli command flock
32: Linux cli command lchown32
33: Linux cli command close_range
34: Linux cli command sigreturn
35: Linux cli command epoll_wait
36: Linux cli command getpid
37: Linux cli command finit_module
38: Linux cli command getrandom
39: Linux cli command creat
40: Linux cli command perf_event_open
41: Linux cli command stat
42: Linux cli command semop
43: Linux cli command pread
44: Linux cli command syscalls
45: Linux cli command umount2
46: Linux cli command timer_gettime
47: Linux cli command shmctl
48: Linux cli command llseek
49: Linux cli command security
50: Linux cli command eventfd
51: Linux cli command s390_pci_mmio_read
52: Linux cli command personality
53: Linux cli command getresuid
54: Linux cli command pidfd_open
55: Linux cli command exit
56: Linux cli command chdir
57: Linux cli command migrate_pages
58: Linux cli command mq_getsetattr
59: Linux cli command pwritev
60: Linux cli command select
61: Linux cli command membarrier
62: Linux cli command pciconfig_write
63: Linux cli command sendmmsg
64: Linux cli command capget
65: Linux cli command getppid
66: Linux cli command setpgid
67: Linux cli command munmap
68: Linux cli command fdetach
69: Linux cli command listen
70: Linux cli command lsetxattr
71: Linux cli command msgctl
72: Linux cli command setfsuid
73: Linux cli command delete_module
74: Linux cli command getpmsg
75: Linux cli command renameat2
76: Linux cli command oldlstat
77: Linux cli command rt_sigaction
78: Linux cli command open
79: Linux cli command readlink
80: Linux cli command fattach
81: Linux cli command stat64
82: Linux cli command vserver
83: Linux cli command request_key
84: Linux cli command linkat
85: Linux cli command sigpending
86: Linux cli command pwrite64
87: Linux cli command rmdir
88: Linux cli command getsid
89: Linux cli command sysctl
90: Linux cli command sched_get_priority_max
91: Linux cli command getrlimit
92: Linux cli command kill
93: Linux cli command pciconfig_read
94: Linux cli command pciconfig_iobase
95: Linux cli command pipe2
96: Linux cli command fstatat64
97: Linux cli command time
98: Linux cli command putpmsg
99: Linux cli command faccessat
100: Linux cli command unimplemented
101: Linux cli command recvmsg
102: Linux cli command adjtimex
103: Linux cli command process_vm_readv
104: Linux cli command sigaction
105: Linux cli command execveat
106: Linux cli command clock_settime
107: Linux cli command settimeofday
108: Linux cli command write
109: Linux cli command fstatfs64
110: Linux cli command idle
111: Linux cli command s390_sthyi
112: Linux cli command setuid
113: Linux cli command process_vm_writev
114: Linux cli command munlockall
115: Linux cli command io_getevents
116: Linux cli command futex
117: Linux cli command semtimedop
118: Linux cli command ssetmask
119: Linux cli command truncate
120: Linux cli command mbind
121: Linux cli command setreuid32
122: Linux cli command semctl
123: Linux cli command gettimeofday
124: Linux cli command inl
125: Linux cli command gettid
126: Linux cli command openat2
127: Linux cli command sigprocmask
128: Linux cli command keyctl
129: Linux cli command brk
130: Linux cli command mq_timedreceive
131: Linux cli command sendfile64
132: Linux cli command timer_create
133: Linux cli command getresuid32
134: Linux cli command truncate64
135: Linux cli command getxattr
136: Linux cli command recv
137: Linux cli command faccessat2
138: Linux cli command fchownat
139: Linux cli command get_kernel_syms
140: Linux cli command mq_notify
141: Linux cli command io_destroy
142: Linux cli command sched_rr_get_interval
143: Linux cli command msync
144: Linux cli command inw_p
145: Linux cli command sync_file_range
146: Linux cli command setregid32
147: Linux cli command sethostname
148: Linux cli command newfstatat
149: Linux cli command pkey_alloc
150: Linux cli command restart_syscall
151: Linux cli command inw
152: Linux cli command statfs64
153: Linux cli command modify_ldt
154: Linux cli command outl
155: Linux cli command outsb
156: Linux cli command outl_p
157: Linux cli command readdir
158: Linux cli command arm_fadvise64_64
159: Linux cli command clone3
160: Linux cli command inb_p
161: Linux cli command rt_sigqueueinfo
162: Linux cli command landlock_create_ruleset
163: Linux cli command fchdir
164: Linux cli command ioctl_iflags
165: Linux cli command fork
166: Linux cli command sigsuspend
167: Linux cli command lstat
168: Linux cli command create_module
169: Linux cli command ppoll
170: Linux cli command inl_p
171: Linux cli command ioctl_ficlonerange
172: Linux cli command sched_getscheduler
173: Linux cli command intro
174: Linux cli command shmdt
175: Linux cli command arm_sync_file_range
176: Linux cli command sync
177: Linux cli command tee
178: Linux cli command arch_prctl
179: Linux cli command set_mempolicy
180: Linux cli command mlock
181: Linux cli command stime
182: Linux cli command socket
183: Linux cli command shmop
184: Linux cli command vfork
185: Linux cli command lremovexattr
186: Linux cli command bpf
187: Linux cli command fchmodat
188: Linux cli command s390_pci_mmio_write
189: Linux cli command mq_timedsend
190: Linux cli command oldolduname
191: Linux cli command getuid32
192: Linux cli command prlimit64
193: Linux cli command timer_getoverrun
194: Linux cli command landlock_restrict_self
195: Linux cli command setgroups32
196: Linux cli command setitimer
197: Linux cli command sigaltstack
198: Linux cli command fremovexattr
199: Linux cli command pselect6
200: Linux cli command getpgid
201: Linux cli command ioctl_fslabel
202: Linux cli command swapoff
203: Linux cli command writev
204: Linux cli command gethostname
205: Linux cli command sched_yield
206: Linux cli command uname
207: Linux cli command outw_p
208: Linux cli command fstat64
209: Linux cli command getgroups
210: Linux cli command fsync
211: Linux cli command subpage_prot
212: Linux cli command preadv
213: Linux cli command getrusage
214: Linux cli command openat
215: Linux cli command vm86
216: Linux cli command llistxattr
217: Linux cli command ioprio_get
218: Linux cli command lgetxattr
219: Linux cli command chroot
220: Linux cli command mount
221: Linux cli command get_mempolicy
222: Linux cli command renameat
223: Linux cli command remap_file_pages
224: Linux cli command landlock_add_rule
225: Linux cli command outw
226: Linux cli command setgid32
227: Linux cli command getdents
228: Linux cli command setup
229: Linux cli command seteuid
230: Linux cli command read
231: Linux cli command sigwaitinfo
232: Linux cli command connect
233: Linux cli command set_tid_address
234: Linux cli command copy_file_range
235: Linux cli command dup3
236: Linux cli command memfd_secret
237: Linux cli command timerfd_gettime
238: Linux cli command reboot
239: Linux cli command mknodat
240: Linux cli command pidfd_getfd
241: Linux cli command fadvise64_64
242: Linux cli command getresgid32
243: Linux cli command unshare
244: Linux cli command fadvise64
245: Linux cli command symlinkat
246: Linux cli command name_to_handle_at
247: Linux cli command timer_delete
248: Linux cli command pipe
249: Linux cli command s390_guarded_storage
250: Linux cli command listxattr
251: Linux cli command accept
252: Linux cli command setegid
253: Linux cli command ftruncate64
254: Linux cli command get_thread_area
255: Linux cli command msgrcv
256: Linux cli command geteuid
257: Linux cli command mq_open
258: Linux cli command mkdirat
259: Linux cli command pwritev2
260: Linux cli command mlockall
261: Linux cli command setgid
262: Linux cli command execve
263: Linux cli command fstat
264: Linux cli command getsockopt
265: Linux cli command geteuid32
266: Linux cli command quotactl
267: Linux cli command pselect
268: Linux cli command fstatat
269: Linux cli command cacheflush
270: Linux cli command mmap2
271: Linux cli command unlinkat
272: Linux cli command spu_run
273: Linux cli command sbrk
274: Linux cli command getpagesize
275: Linux cli command fgetxattr
276: Linux cli command kexec_file_load
277: Linux cli command posix_fadvise
278: Linux cli command ioctl_userfaultfd
279: Linux cli command fdatasync
280: Linux cli command kexec_load
281: Linux cli command lookup_dcookie
282: Linux cli command epoll_pwait2
283: Linux cli command syncfs
284: Linux cli command prof
285: Linux cli command sched_getaffinity
286: Linux cli command clock_getres
287: Linux cli command inb
288: Linux cli command umount
289: Linux cli command sched_setscheduler
290: Linux cli command fallocate
291: Linux cli command add_key
292: Linux cli command fsetxattr
293: Linux cli command utimes
294: Linux cli command getcpu
295: Linux cli command stty
296: Linux cli command pread64
297: Linux cli command vmsplice
298: Linux cli command socketcall
299: Linux cli command socketpair
300: Linux cli command clock_adjtime
301: Linux cli command getegid
302: Linux cli command timer_settime
303: Linux cli command chown
304: Linux cli command ioctl_pipe
305: Linux cli command fchown32
306: Linux cli command epoll_create
307: Linux cli command ioctl_getfsmap
308: Linux cli command sched_setparam
309: Linux cli command pivot_root
310: Linux cli command set_robust_list
311: Linux cli command nice
312: Linux cli command clone2
313: Linux cli command waitpid
314: Linux cli command kcmp
315: Linux cli command setsockopt
316: Linux cli command ioctl_fat
317: Linux cli command open_by_handle_at
318: Linux cli command epoll_ctl
319: Linux cli command eventfd2
320: Linux cli command lchown
321: Linux cli command getpeername
322: Linux cli command mmap
323: Linux cli command ioctl_tty
324: Linux cli command accept4
325: Linux cli command inotify_add_watch
326: Linux cli command capset
327: Linux cli command sched_setaffinity
328: Linux cli command memfd_create
329: Linux cli command io_cancel
330: Linux cli command fstatfs
331: Linux cli command fanotify_init
332: Linux cli command statfs
333: Linux cli command epoll_pwait
334: Linux cli command ioperm
335: Linux cli command clock_nanosleep
336: Linux cli command setgroups
337: Linux cli command lseek
338: Linux cli command rt_sigprocmask
339: Linux cli command getunwind
340: Linux cli command fcntl64
341: Linux cli command olduname
342: Linux cli command select_tut
343: Linux cli command mincore
344: Linux cli command wait
345: Linux cli command getgid32
346: Linux cli command ioctl_pagemap_scan
347: Linux cli command msgget
348: Linux cli command rt_tgsigqueueinfo
349: Linux cli command get_robust_list
350: Linux cli command dup
351: Linux cli command syslog
352: Linux cli command phys
353: Linux cli command io_setup
354: Linux cli command setpriority
355: Linux cli command recvmmsg
356: Linux cli command signalfd4
357: Linux cli command lstat64
358: Linux cli command sched_getattr
359: Linux cli command readahead
360: Linux cli command setpgrp
361: Linux cli command poll
362: Linux cli command pkey_mprotect
363: Linux cli command times
364: Linux cli command setresuid32
365: Linux cli command getresgid
366: Linux cli command getsockname
367: Linux cli command umask
368: Linux cli command epoll_create1
369: Linux cli command ustat
370: Linux cli command dup2
371: Linux cli command rt_sigreturn
372: Linux cli command setfsgid
373: Linux cli command shmget
374: Linux cli command link
375: Linux cli command mkdir
376: Linux cli command getegid32
377: Linux cli command setresgid32
378: Linux cli command sched_getparam
379: Linux cli command unlink
380: Linux cli command free_hugepages
381: Linux cli command iopl
382: Linux cli command waitid
383: Linux cli command getpriority
384: Linux cli command statx
385: Linux cli command exit_group
386: Linux cli command readv
387: Linux cli command getpgrp
388: Linux cli command rt_sigtimedwait
389: Linux cli command mount_setattr
390: Linux cli command pwrite
391: Linux cli command mprotect
392: Linux cli command getuid
393: Linux cli command recvfrom
394: Linux cli command setns
395: Linux cli command mpx
396: Linux cli command nfsservctl
397: Linux cli command getmsg
398: Linux cli command pkey_free
399: Linux cli command pause
400: Linux cli command setregid
401: Linux cli command pidfd_send_signal
402: Linux cli command ioctl_ficlone
403: Linux cli command sched_setattr
404: Linux cli command process_madvise
405: Linux cli command tkill
406: Linux cli command clock_gettime
407: Linux cli command madvise1
408: Linux cli command sched_get_priority_min
409: Linux cli command tgkill
410: Linux cli command userfaultfd
411: Linux cli command bind
412: Linux cli command lock
413: Linux cli command uselib
414: Linux cli command afs_syscall
415: Linux cli command splice
416: Linux cli command move_pages
417: Linux cli command seccomp_unotify
418: Linux cli command flistxattr
419: Linux cli command insw
420: Linux cli command close
421: Linux cli command mknod
422: Linux cli command outsw
423: Linux cli command sendmsg
424: Linux cli command setfsgid32
425: Linux cli command init_module
426: Linux cli command fchmod
427: Linux cli command seccomp
428: Linux cli command setxattr
429: Linux cli command setuid32
430: Linux cli command clone
431: Linux cli command setresuid
432: Linux cli command chown32
433: Linux cli command send
434: Linux cli command access
435: Linux cli command symlink
436: Linux cli command nanosleep
437: Linux cli command swapon
438: Linux cli command ioctl_console
439: Linux cli command prctl
440: Linux cli command msgsnd
441: Linux cli command set_thread_area
442: Linux cli command spu_create
443: Linux cli command setreuid
444: Linux cli command ioctl_ns
445: Linux cli command syscall
446: Linux cli command getgroups32
447: Linux cli command ftruncate
448: Linux cli command insl
449: Linux cli command shmat
450: Linux cli command semget
451: Linux cli command insb
452: Linux cli command sendto
453: Linux cli command rename
454: Linux cli command inotify_rm_watch
455: Linux cli command outsl
456: Linux cli command acct
457: Linux cli command isastream
458: Linux cli command wait3
459: Linux cli command sgetmask
460: Linux cli command signal
461: Linux cli command rt_sigsuspend
462: Linux cli command s390_runtime_instr
463: Linux cli command setresgid
464: Linux cli command getitimer
465: Linux cli command alarm
466: Linux cli command perfmonctl
467: Linux cli command rt_sigpending
468: Linux cli command wait4
469: Linux cli command sigtimedwait
470: Linux cli command madvise
471: Linux cli command putmsg
472: Linux cli command open_howtype
473: Linux cli command oldfstat
474: Linux cli command setsid
475: Linux cli command timerfd_create
476: Linux cli command getcwd
477: Linux cli command inotify_init
478: Linux cli command readlinkat
479: Linux cli command getdomainname
480: Linux cli command gtty
481: Linux cli command ipc
482: Linux cli command outb_p
483: Linux cli command msgop
484: Linux cli command ioctl_fideduperange
485: Linux cli command utimensat
486: Linux cli command query_module
487: Linux cli command ugetrlimit
488: Linux cli command vhangup
489: Linux cli command futimesat
490: Linux cli command chmod
491: Linux cli command signalfd
492: Linux cli command io_submit
493: Linux cli command mremap
494: Linux cli command oldstat
495: Linux cli command timerfd_settime
496: Linux cli command outb

1 - Linux cli command ptrace

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command ptrace and provides detailed information about the command ptrace, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the ptrace.

NAME 🖥️ ptrace 🖥️

process trace

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <sys/ptrace.h>
long ptrace(enum __ptrace_request op, pid_t pid,
 void *addr, void *data);

DESCRIPTION

The ptrace() system call provides a means by which one process (the “tracer”) may observe and control the execution of another process (the “tracee”), and examine and change the tracee’s memory and registers. It is primarily used to implement breakpoint debugging and system call tracing.

A tracee first needs to be attached to the tracer. Attachment and subsequent commands are per thread: in a multithreaded process, every thread can be individually attached to a (potentially different) tracer, or left not attached and thus not debugged. Therefore, “tracee” always means “(one) thread”, never “a (possibly multithreaded) process”. Ptrace commands are always sent to a specific tracee using a call of the form

ptrace(PTRACE_foo, pid, ...)

where pid is the thread ID of the corresponding Linux thread.

(Note that in this page, a “multithreaded process” means a thread group consisting of threads created using the clone(2) CLONE_THREAD flag.)

A process can initiate a trace by calling fork(2) and having the resulting child do a PTRACE_TRACEME, followed (typically) by an execve(2). Alternatively, one process may commence tracing another process using PTRACE_ATTACH or PTRACE_SEIZE.

While being traced, the tracee will stop each time a signal is delivered, even if the signal is being ignored. (An exception is SIGKILL, which has its usual effect.) The tracer will be notified at its next call to waitpid(2) (or one of the related “wait” system calls); that call will return a status value containing information that indicates the cause of the stop in the tracee. While the tracee is stopped, the tracer can use various ptrace operations to inspect and modify the tracee. The tracer then causes the tracee to continue, optionally ignoring the delivered signal (or even delivering a different signal instead).

If the PTRACE_O_TRACEEXEC option is not in effect, all successful calls to execve(2) by the traced process will cause it to be sent a SIGTRAP signal, giving the parent a chance to gain control before the new program begins execution.

When the tracer is finished tracing, it can cause the tracee to continue executing in a normal, untraced mode via PTRACE_DETACH.

The value of op determines the operation to be performed:

PTRACE_TRACEME
Indicate that this process is to be traced by its parent. A process probably shouldn’t make this operation if its parent isn’t expecting to trace it. (pid, addr, and data are ignored.)

The PTRACE_TRACEME operation is used only by the tracee; the remaining operations are used only by the tracer. In the following operations, pid specifies the thread ID of the tracee to be acted on. For operations other than PTRACE_ATTACH, PTRACE_SEIZE, PTRACE_INTERRUPT, and PTRACE_KILL, the tracee must be stopped.

PTRACE_PEEKTEXT
PTRACE_PEEKDATA
Read a word at the address addr in the tracee’s memory, returning the word as the result of the ptrace() call. Linux does not have separate text and data address spaces, so these two operations are currently equivalent. (data is ignored; but see NOTES.)

PTRACE_PEEKUSER
Read a word at offset addr in the tracee’s USER area, which holds the registers and other information about the process (see <sys/user.h>). The word is returned as the result of the ptrace() call. Typically, the offset must be word-aligned, though this might vary by architecture. See NOTES. (data is ignored; but see NOTES.)

PTRACE_POKETEXT
PTRACE_POKEDATA
Copy the word data to the address addr in the tracee’s memory. As for PTRACE_PEEKTEXT and PTRACE_PEEKDATA, these two operations are currently equivalent.

PTRACE_POKEUSER
Copy the word data to offset addr in the tracee’s USER area. As for PTRACE_PEEKUSER, the offset must typically be word-aligned. In order to maintain the integrity of the kernel, some modifications to the USER area are disallowed.

PTRACE_GETREGS
PTRACE_GETFPREGS
Copy the tracee’s general-purpose or floating-point registers, respectively, to the address data in the tracer. See <sys/user.h> for information on the format of this data. (addr is ignored.) Note that SPARC systems have the meaning of data and addr reversed; that is, data is ignored and the registers are copied to the address addr. PTRACE_GETREGS and PTRACE_GETFPREGS are not present on all architectures.

PTRACE_GETREGSET (since Linux 2.6.34)
Read the tracee’s registers. addr specifies, in an architecture-dependent way, the type of registers to be read. NT_PRSTATUS (with numerical value 1) usually results in reading of general-purpose registers. If the CPU has, for example, floating-point and/or vector registers, they can be retrieved by setting addr to the corresponding NT_foo constant. data points to a struct iovec, which describes the destination buffer’s location and length. On return, the kernel modifies iov.len to indicate the actual number of bytes returned.

PTRACE_SETREGS
PTRACE_SETFPREGS
Modify the tracee’s general-purpose or floating-point registers, respectively, from the address data in the tracer. As for PTRACE_POKEUSER, some general-purpose register modifications may be disallowed. (addr is ignored.) Note that SPARC systems have the meaning of data and addr reversed; that is, data is ignored and the registers are copied from the address addr. PTRACE_SETREGS and PTRACE_SETFPREGS are not present on all architectures.

PTRACE_SETREGSET (since Linux 2.6.34)
Modify the tracee’s registers. The meaning of addr and data is analogous to PTRACE_GETREGSET.

PTRACE_GETSIGINFO (since Linux 2.3.99-pre6)
Retrieve information about the signal that caused the stop. Copy a siginfo_t structure (see sigaction(2)) from the tracee to the address data in the tracer. (addr is ignored.)

PTRACE_SETSIGINFO (since Linux 2.3.99-pre6)
Set signal information: copy a siginfo_t structure from the address data in the tracer to the tracee. This will affect only signals that would normally be delivered to the tracee and were caught by the tracer. It may be difficult to tell these normal signals from synthetic signals generated by ptrace() itself. (addr is ignored.)

PTRACE_PEEKSIGINFO (since Linux 3.10)
Retrieve siginfo_t structures without removing signals from a queue. addr points to a ptrace_peeksiginfo_args structure that specifies the ordinal position from which copying of signals should start, and the number of signals to copy. siginfo_t structures are copied into the buffer pointed to by data. The return value contains the number of copied signals (zero indicates that there is no signal corresponding to the specified ordinal position). Within the returned siginfo structures, the si_code field includes information (__SI_CHLD, __SI_FAULT, etc.) that are not otherwise exposed to user space.

struct ptrace_peeksiginfo_args {
    u64 off;    /* Ordinal position in queue at which
                   to start copying signals */
    u32 flags;  /* PTRACE_PEEKSIGINFO_SHARED or 0 */
    s32 nr;     /* Number of signals to copy */
};

Currently, there is only one flag, PTRACE_PEEKSIGINFO_SHARED, for dumping signals from the process-wide signal queue. If this flag is not set, signals are read from the per-thread queue of the specified thread.

PTRACE_GETSIGMASK (since Linux 3.11)
Place a copy of the mask of blocked signals (see sigprocmask(2)) in the buffer pointed to by data, which should be a pointer to a buffer of type sigset_t. The addr argument contains the size of the buffer pointed to by data (i.e., sizeof(sigset_t)).

PTRACE_SETSIGMASK (since Linux 3.11)
Change the mask of blocked signals (see sigprocmask(2)) to the value specified in the buffer pointed to by data, which should be a pointer to a buffer of type sigset_t. The addr argument contains the size of the buffer pointed to by data (i.e., sizeof(sigset_t)).

PTRACE_SETOPTIONS (since Linux 2.4.6; see BUGS for caveats)
Set ptrace options from data. (addr is ignored.) data is interpreted as a bit mask of options, which are specified by the following flags:

PTRACE_O_EXITKILL (since Linux 3.8)
Send a SIGKILL signal to the tracee if the tracer exits. This option is useful for ptrace jailers that want to ensure that tracees can never escape the tracer’s control.

PTRACE_O_TRACECLONE (since Linux 2.5.46)
Stop the tracee at the next clone(2) and automatically start tracing the newly cloned process, which will start with a SIGSTOP, or PTRACE_EVENT_STOP if PTRACE_SEIZE was used. A waitpid(2) by the tracer will return a status value such that

  status>>8 == (SIGTRAP | (PTRACE_EVENT_CLONE<<8))

The PID of the new process can be retrieved with PTRACE_GETEVENTMSG.

This option may not catch clone(2) calls in all cases. If the tracee calls clone(2) with the CLONE_VFORK flag, PTRACE_EVENT_VFORK will be delivered instead if PTRACE_O_TRACEVFORK is set; otherwise if the tracee calls clone(2) with the exit signal set to SIGCHLD, PTRACE_EVENT_FORK will be delivered if PTRACE_O_TRACEFORK is set.

PTRACE_O_TRACEEXEC (since Linux 2.5.46)
Stop the tracee at the next execve(2). A waitpid(2) by the tracer will return a status value such that

  status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8))

If the execing thread is not a thread group leader, the thread ID is reset to thread group leader’s ID before this stop. Since Linux 3.0, the former thread ID can be retrieved with PTRACE_GETEVENTMSG.

PTRACE_O_TRACEEXIT (since Linux 2.5.60)
Stop the tracee at exit. A waitpid(2) by the tracer will return a status value such that

  status>>8 == (SIGTRAP | (PTRACE_EVENT_EXIT<<8))

The tracee’s exit status can be retrieved with PTRACE_GETEVENTMSG.

The tracee is stopped early during process exit, when registers are still available, allowing the tracer to see where the exit occurred, whereas the normal exit notification is done after the process is finished exiting. Even though context is available, the tracer cannot prevent the exit from happening at this point.

PTRACE_O_TRACEFORK (since Linux 2.5.46)
Stop the tracee at the next fork(2) and automatically start tracing the newly forked process, which will start with a SIGSTOP, or PTRACE_EVENT_STOP if PTRACE_SEIZE was used. A waitpid(2) by the tracer will return a status value such that

  status>>8 == (SIGTRAP | (PTRACE_EVENT_FORK<<8))

The PID of the new process can be retrieved with PTRACE_GETEVENTMSG.

PTRACE_O_TRACESYSGOOD (since Linux 2.4.6)
When delivering system call traps, set bit 7 in the signal number (i.e., deliver SIGTRAP|0x80). This makes it easy for the tracer to distinguish normal traps from those caused by a system call.

PTRACE_O_TRACEVFORK (since Linux 2.5.46)
Stop the tracee at the next vfork(2) and automatically start tracing the newly vforked process, which will start with a SIGSTOP, or PTRACE_EVENT_STOP if PTRACE_SEIZE was used. A waitpid(2) by the tracer will return a status value such that

  status>>8 == (SIGTRAP | (PTRACE_EVENT_VFORK<<8))

The PID of the new process can be retrieved with PTRACE_GETEVENTMSG.

PTRACE_O_TRACEVFORKDONE (since Linux 2.5.60)
Stop the tracee at the completion of the next vfork(2). A waitpid(2) by the tracer will return a status value such that

  status>>8 == (SIGTRAP | (PTRACE_EVENT_VFORK_DONE<<8))

The PID of the new process can (since Linux 2.6.18) be retrieved with PTRACE_GETEVENTMSG.

PTRACE_O_TRACESECCOMP (since Linux 3.5)
Stop the tracee when a seccomp(2) SECCOMP_RET_TRACE rule is triggered. A waitpid(2) by the tracer will return a status value such that

  status>>8 == (SIGTRAP | (PTRACE_EVENT_SECCOMP<<8))

While this triggers a PTRACE_EVENT stop, it is similar to a syscall-enter-stop. For details, see the note on PTRACE_EVENT_SECCOMP below. The seccomp event message data (from the SECCOMP_RET_DATA portion of the seccomp filter rule) can be retrieved with PTRACE_GETEVENTMSG.

PTRACE_O_SUSPEND_SECCOMP (since Linux 4.3)
Suspend the tracee’s seccomp protections. This applies regardless of mode, and can be used when the tracee has not yet installed seccomp filters. That is, a valid use case is to suspend a tracee’s seccomp protections before they are installed by the tracee, let the tracee install the filters, and then clear this flag when the filters should be resumed. Setting this option requires that the tracer have the CAP_SYS_ADMIN capability, not have any seccomp protections installed, and not have PTRACE_O_SUSPEND_SECCOMP set on itself.

PTRACE_GETEVENTMSG (since Linux 2.5.46)
Retrieve a message (as an unsigned long) about the ptrace event that just happened, placing it at the address data in the tracer. For PTRACE_EVENT_EXIT, this is the tracee’s exit status. For PTRACE_EVENT_FORK, PTRACE_EVENT_VFORK, PTRACE_EVENT_VFORK_DONE, and PTRACE_EVENT_CLONE, this is the PID of the new process. For PTRACE_EVENT_SECCOMP, this is the seccomp(2) filter’s SECCOMP_RET_DATA associated with the triggered rule. (addr is ignored.)

PTRACE_CONT
Restart the stopped tracee process. If data is nonzero, it is interpreted as the number of a signal to be delivered to the tracee; otherwise, no signal is delivered. Thus, for example, the tracer can control whether a signal sent to the tracee is delivered or not. (addr is ignored.)

PTRACE_SYSCALL
PTRACE_SINGLESTEP
Restart the stopped tracee as for PTRACE_CONT, but arrange for the tracee to be stopped at the next entry to or exit from a system call, or after execution of a single instruction, respectively. (The tracee will also, as usual, be stopped upon receipt of a signal.) From the tracer’s perspective, the tracee will appear to have been stopped by receipt of a SIGTRAP. So, for PTRACE_SYSCALL, for example, the idea is to inspect the arguments to the system call at the first stop, then do another PTRACE_SYSCALL and inspect the return value of the system call at the second stop. The data argument is treated as for PTRACE_CONT. (addr is ignored.)

PTRACE_SET_SYSCALL (since Linux 2.6.16)
When in syscall-enter-stop, change the number of the system call that is about to be executed to the number specified in the data argument. The addr argument is ignored. This operation is currently supported only on arm (and arm64, though only for backwards compatibility), but most other architectures have other means of accomplishing this (usually by changing the register that the userland code passed the system call number in).

PTRACE_SYSEMU
PTRACE_SYSEMU_SINGLESTEP (since Linux 2.6.14)
For PTRACE_SYSEMU, continue and stop on entry to the next system call, which will not be executed. See the documentation on syscall-stops below. For PTRACE_SYSEMU_SINGLESTEP, do the same but also singlestep if not a system call. This call is used by programs like User Mode Linux that want to emulate all the tracee’s system calls. The data argument is treated as for PTRACE_CONT. The addr argument is ignored. These operations are currently supported only on x86.

PTRACE_LISTEN (since Linux 3.4)
Restart the stopped tracee, but prevent it from executing. The resulting state of the tracee is similar to a process which has been stopped by a SIGSTOP (or other stopping signal). See the “group-stop” subsection for additional information. PTRACE_LISTEN works only on tracees attached by PTRACE_SEIZE.

PTRACE_KILL
Send the tracee a SIGKILL to terminate it. (addr and data are ignored.)

This operation is deprecated; do not use it! Instead, send a SIGKILL directly using kill(2) or tgkill(2). The problem with PTRACE_KILL is that it requires the tracee to be in signal-delivery-stop, otherwise it may not work (i.e., may complete successfully but won’t kill the tracee). By contrast, sending a SIGKILL directly has no such limitation.

PTRACE_INTERRUPT (since Linux 3.4)
Stop a tracee. If the tracee is running or sleeping in kernel space and PTRACE_SYSCALL is in effect, the system call is interrupted and syscall-exit-stop is reported. (The interrupted system call is restarted when the tracee is restarted.) If the tracee was already stopped by a signal and PTRACE_LISTEN was sent to it, the tracee stops with PTRACE_EVENT_STOP and WSTOPSIG(status) returns the stop signal. If any other ptrace-stop is generated at the same time (for example, if a signal is sent to the tracee), this ptrace-stop happens. If none of the above applies (for example, if the tracee is running in user space), it stops with PTRACE_EVENT_STOP with WSTOPSIG(status) == SIGTRAP. PTRACE_INTERRUPT only works on tracees attached by PTRACE_SEIZE.

PTRACE_ATTACH
Attach to the process specified in pid, making it a tracee of the calling process. The tracee is sent a SIGSTOP, but will not necessarily have stopped by the completion of this call; use waitpid(2) to wait for the tracee to stop. See the “Attaching and detaching” subsection for additional information. (addr and data are ignored.)

Permission to perform a PTRACE_ATTACH is governed by a ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check; see below.

PTRACE_SEIZE (since Linux 3.4)
Attach to the process specified in pid, making it a tracee of the calling process. Unlike PTRACE_ATTACH, PTRACE_SEIZE does not stop the process. Group-stops are reported as PTRACE_EVENT_STOP and WSTOPSIG(status) returns the stop signal. Automatically attached children stop with PTRACE_EVENT_STOP and WSTOPSIG(status) returns SIGTRAP instead of having SIGSTOP signal delivered to them. execve(2) does not deliver an extra SIGTRAP. Only a PTRACE_SEIZEd process can accept PTRACE_INTERRUPT and PTRACE_LISTEN commands. The “seized” behavior just described is inherited by children that are automatically attached using PTRACE_O_TRACEFORK, PTRACE_O_TRACEVFORK, and PTRACE_O_TRACECLONE. addr must be zero. data contains a bit mask of ptrace options to activate immediately.

Permission to perform a PTRACE_SEIZE is governed by a ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check; see below.

PTRACE_SECCOMP_GET_FILTER (since Linux 4.4)
This operation allows the tracer to dump the tracee’s classic BPF filters.

addr is an integer specifying the index of the filter to be dumped. The most recently installed filter has the index 0. If addr is greater than the number of installed filters, the operation fails with the error ENOENT.

data is either a pointer to a struct sock_filter array that is large enough to store the BPF program, or NULL if the program is not to be stored.

Upon success, the return value is the number of instructions in the BPF program. If data was NULL, then this return value can be used to correctly size the struct sock_filter array passed in a subsequent call.

This operation fails with the error EACCES if the caller does not have the CAP_SYS_ADMIN capability or if the caller is in strict or filter seccomp mode. If the filter referred to by addr is not a classic BPF filter, the operation fails with the error EMEDIUMTYPE.

This operation is available if the kernel was configured with both the CONFIG_SECCOMP_FILTER and the CONFIG_CHECKPOINT_RESTORE options.

PTRACE_DETACH
Restart the stopped tracee as for PTRACE_CONT, but first detach from it. Under Linux, a tracee can be detached in this way regardless of which method was used to initiate tracing. (addr is ignored.)

PTRACE_GET_THREAD_AREA (since Linux 2.6.0)
This operation performs a similar task to get_thread_area(2). It reads the TLS entry in the GDT whose index is given in addr, placing a copy of the entry into the struct user_desc pointed to by data. (By contrast with get_thread_area(2), the entry_number of the struct user_desc is ignored.)

PTRACE_SET_THREAD_AREA (since Linux 2.6.0)
This operation performs a similar task to set_thread_area(2). It sets the TLS entry in the GDT whose index is given in addr, assigning it the data supplied in the struct user_desc pointed to by data. (By contrast with set_thread_area(2), the entry_number of the struct user_desc is ignored; in other words, this ptrace operation can’t be used to allocate a free TLS entry.)

PTRACE_GET_SYSCALL_INFO (since Linux 5.3)
Retrieve information about the system call that caused the stop. The information is placed into the buffer pointed by the data argument, which should be a pointer to a buffer of type struct ptrace_syscall_info. The addr argument contains the size of the buffer pointed to by the data argument (i.e., sizeof(struct ptrace_syscall_info)). The return value contains the number of bytes available to be written by the kernel. If the size of the data to be written by the kernel exceeds the size specified by the addr argument, the output data is truncated.

The ptrace_syscall_info structure contains the following fields:

struct ptrace_syscall_info {
    __u8 op;        /* Type of system call stop */
    __u32 arch;     /* AUDIT_ARCH_* value; see seccomp(2) */
    __u64 instruction_pointer; /* CPU instruction pointer */
    __u64 stack_pointer;    /* CPU stack pointer */
    union {
        struct {    /* op == PTRACE_SYSCALL_INFO_ENTRY */
            __u64 nr;       /* System call number */
            __u64 args[6];  /* System call arguments */
        } entry;
        struct {    /* op == PTRACE_SYSCALL_INFO_EXIT */
            __s64 rval;     /* System call return value */
            __u8 is_error;  /* System call error flag;
                               Boolean: does rval contain
                               an error value (-ERRCODE) or
                               a nonerror return value? */
        } exit;
        struct {    /* op == PTRACE_SYSCALL_INFO_SECCOMP */
            __u64 nr;       /* System call number */
            __u64 args[6];  /* System call arguments */
            __u32 ret_data; /* SECCOMP_RET_DATA portion
                               of SECCOMP_RET_TRACE
                               return value */
        } seccomp;
    };
};

The op, arch, instruction_pointer, and stack_pointer fields are defined for all kinds of ptrace system call stops. The rest of the structure is a union; one should read only those fields that are meaningful for the kind of system call stop specified by the op field.

The op field has one of the following values (defined in <linux/ptrace.h>) indicating what type of stop occurred and which part of the union is filled:

PTRACE_SYSCALL_INFO_ENTRY
The entry component of the union contains information relating to a system call entry stop.

PTRACE_SYSCALL_INFO_EXIT
The exit component of the union contains information relating to a system call exit stop.

PTRACE_SYSCALL_INFO_SECCOMP
The seccomp component of the union contains information relating to a PTRACE_EVENT_SECCOMP stop.

PTRACE_SYSCALL_INFO_NONE
No component of the union contains relevant information.

In case of system call entry or exit stops, the data returned by PTRACE_GET_SYSCALL_INFO is limited to type PTRACE_SYSCALL_INFO_NONE unless PTRACE_O_TRACESYSGOOD option is set before the corresponding system call stop has occurred.

Death under ptrace

When a (possibly multithreaded) process receives a killing signal (one whose disposition is set to SIG_DFL and whose default action is to kill the process), all threads exit. Tracees report their death to their tracer(s). Notification of this event is delivered via waitpid(2).

Note that the killing signal will first cause signal-delivery-stop (on one tracee only), and only after it is injected by the tracer (or after it was dispatched to a thread which isn’t traced), will death from the signal happen on all tracees within a multithreaded process. (The term “signal-delivery-stop” is explained below.)

SIGKILL does not generate signal-delivery-stop and therefore the tracer can’t suppress it. SIGKILL kills even within system calls (syscall-exit-stop is not generated prior to death by SIGKILL). The net effect is that SIGKILL always kills the process (all its threads), even if some threads of the process are ptraced.

When the tracee calls _exit(2), it reports its death to its tracer. Other threads are not affected.

When any thread executes exit_group(2), every tracee in its thread group reports its death to its tracer.

If the PTRACE_O_TRACEEXIT option is on, PTRACE_EVENT_EXIT will happen before actual death. This applies to exits via exit(2), exit_group(2), and signal deaths (except SIGKILL, depending on the kernel version; see BUGS below), and when threads are torn down on execve(2) in a multithreaded process.

The tracer cannot assume that the ptrace-stopped tracee exists. There are many scenarios when the tracee may die while stopped (such as SIGKILL). Therefore, the tracer must be prepared to handle an ESRCH error on any ptrace operation. Unfortunately, the same error is returned if the tracee exists but is not ptrace-stopped (for commands which require a stopped tracee), or if it is not traced by the process which issued the ptrace call. The tracer needs to keep track of the stopped/running state of the tracee, and interpret ESRCH as “tracee died unexpectedly” only if it knows that the tracee has been observed to enter ptrace-stop. Note that there is no guarantee that waitpid(WNOHANG) will reliably report the tracee’s death status if a ptrace operation returned ESRCH. waitpid(WNOHANG) may return 0 instead. In other words, the tracee may be “not yet fully dead”, but already refusing ptrace operations.

The tracer can’t assume that the tracee always ends its life by reporting WIFEXITED(status) or WIFSIGNALED(status); there are cases where this does not occur. For example, if a thread other than thread group leader does an execve(2), it disappears; its PID will never be seen again, and any subsequent ptrace stops will be reported under the thread group leader’s PID.

Stopped states

A tracee can be in two states: running or stopped. For the purposes of ptrace, a tracee which is blocked in a system call (such as read(2), pause(2), etc.) is nevertheless considered to be running, even if the tracee is blocked for a long time. The state of the tracee after PTRACE_LISTEN is somewhat of a gray area: it is not in any ptrace-stop (ptrace commands won’t work on it, and it will deliver waitpid(2) notifications), but it also may be considered “stopped” because it is not executing instructions (is not scheduled), and if it was in group-stop before PTRACE_LISTEN, it will not respond to signals until SIGCONT is received.

There are many kinds of states when the tracee is stopped, and in ptrace discussions they are often conflated. Therefore, it is important to use precise terms.

In this manual page, any stopped state in which the tracee is ready to accept ptrace commands from the tracer is called ptrace-stop. Ptrace-stops can be further subdivided into signal-delivery-stop, group-stop, syscall-stop, PTRACE_EVENT stops, and so on. These stopped states are described in detail below.

When the running tracee enters ptrace-stop, it notifies its tracer using waitpid(2) (or one of the other “wait” system calls). Most of this manual page assumes that the tracer waits with:

pid = waitpid(pid_or_minus_1, &status, __WALL);

Ptrace-stopped tracees are reported as returns with pid greater than 0 and WIFSTOPPED(status) true.

The __WALL flag does not include the WSTOPPED and WEXITED flags, but implies their functionality.

Setting the WCONTINUED flag when calling waitpid(2) is not recommended: the “continued” state is per-process and consuming it can confuse the real parent of the tracee.

Use of the WNOHANG flag may cause waitpid(2) to return 0 (“no wait results available yet”) even if the tracer knows there should be a notification. Example:

errno = 0;
ptrace(PTRACE_CONT, pid, 0L, 0L);
if (errno == ESRCH) {
    /* tracee is dead */
    r = waitpid(tracee, &status, __WALL | WNOHANG);
    /* r can still be 0 here! */
}

The following kinds of ptrace-stops exist: signal-delivery-stops, group-stops, PTRACE_EVENT stops, syscall-stops. They all are reported by waitpid(2) with WIFSTOPPED(status) true. They may be differentiated by examining the value status>>8, and if there is ambiguity in that value, by querying PTRACE_GETSIGINFO. (Note: the WSTOPSIG(status) macro can’t be used to perform this examination, because it returns the value (status>>8) & 0xff.)

Signal-delivery-stop

When a (possibly multithreaded) process receives any signal except SIGKILL, the kernel selects an arbitrary thread which handles the signal. (If the signal is generated with tgkill(2), the target thread can be explicitly selected by the caller.) If the selected thread is traced, it enters signal-delivery-stop. At this point, the signal is not yet delivered to the process, and can be suppressed by the tracer. If the tracer doesn’t suppress the signal, it passes the signal to the tracee in the next ptrace restart operation. This second step of signal delivery is called signal injection in this manual page. Note that if the signal is blocked, signal-delivery-stop doesn’t happen until the signal is unblocked, with the usual exception that SIGSTOP can’t be blocked.

Signal-delivery-stop is observed by the tracer as waitpid(2) returning with WIFSTOPPED(status) true, with the signal returned by WSTOPSIG(status). If the signal is SIGTRAP, this may be a different kind of ptrace-stop; see the “Syscall-stops” and “execve” sections below for details. If WSTOPSIG(status) returns a stopping signal, this may be a group-stop; see below.

Signal injection and suppression

After signal-delivery-stop is observed by the tracer, the tracer should restart the tracee with the call

ptrace(PTRACE_restart, pid, 0, sig)

where PTRACE_restart is one of the restarting ptrace operations. If sig is 0, then a signal is not delivered. Otherwise, the signal sig is delivered. This operation is called signal injection in this manual page, to distinguish it from signal-delivery-stop.

The sig value may be different from the WSTOPSIG(status) value: the tracer can cause a different signal to be injected.

Note that a suppressed signal still causes system calls to return prematurely. In this case, system calls will be restarted: the tracer will observe the tracee to reexecute the interrupted system call (or restart_syscall(2) system call for a few system calls which use a different mechanism for restarting) if the tracer uses PTRACE_SYSCALL. Even system calls (such as poll(2)) which are not restartable after signal are restarted after signal is suppressed; however, kernel bugs exist which cause some system calls to fail with EINTR even though no observable signal is injected to the tracee.

Restarting ptrace commands issued in ptrace-stops other than signal-delivery-stop are not guaranteed to inject a signal, even if sig is nonzero. No error is reported; a nonzero sig may simply be ignored. Ptrace users should not try to “create a new signal” this way: use tgkill(2) instead.

The fact that signal injection operations may be ignored when restarting the tracee after ptrace stops that are not signal-delivery-stops is a cause of confusion among ptrace users. One typical scenario is that the tracer observes group-stop, mistakes it for signal-delivery-stop, restarts the tracee with

ptrace(PTRACE_restart, pid, 0, stopsig)

with the intention of injecting stopsig, but stopsig gets ignored and the tracee continues to run.

The SIGCONT signal has a side effect of waking up (all threads of) a group-stopped process. This side effect happens before signal-delivery-stop. The tracer can’t suppress this side effect (it can only suppress signal injection, which only causes the SIGCONT handler to not be executed in the tracee, if such a handler is installed). In fact, waking up from group-stop may be followed by signal-delivery-stop for signal(s) other than SIGCONT, if they were pending when SIGCONT was delivered. In other words, SIGCONT may be not the first signal observed by the tracee after it was sent.

Stopping signals cause (all threads of) a process to enter group-stop. This side effect happens after signal injection, and therefore can be suppressed by the tracer.

In Linux 2.4 and earlier, the SIGSTOP signal can’t be injected.

PTRACE_GETSIGINFO can be used to retrieve a siginfo_t structure which corresponds to the delivered signal. PTRACE_SETSIGINFO may be used to modify it. If PTRACE_SETSIGINFO has been used to alter siginfo_t, the si_signo field and the sig parameter in the restarting command must match, otherwise the result is undefined.

Group-stop

When a (possibly multithreaded) process receives a stopping signal, all threads stop. If some threads are traced, they enter a group-stop. Note that the stopping signal will first cause signal-delivery-stop (on one tracee only), and only after it is injected by the tracer (or after it was dispatched to a thread which isn’t traced), will group-stop be initiated on all tracees within the multithreaded process. As usual, every tracee reports its group-stop separately to the corresponding tracer.

Group-stop is observed by the tracer as waitpid(2) returning with WIFSTOPPED(status) true, with the stopping signal available via WSTOPSIG(status). The same result is returned by some other classes of ptrace-stops, therefore the recommended practice is to perform the call

ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo)

The call can be avoided if the signal is not SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU; only these four signals are stopping signals. If the tracer sees something else, it can’t be a group-stop. Otherwise, the tracer needs to call PTRACE_GETSIGINFO. If PTRACE_GETSIGINFO fails with EINVAL, then it is definitely a group-stop. (Other failure codes are possible, such as ESRCH (“no such process”) if a SIGKILL killed the tracee.)

If tracee was attached using PTRACE_SEIZE, group-stop is indicated by PTRACE_EVENT_STOP: status>>16 == PTRACE_EVENT_STOP. This allows detection of group-stops without requiring an extra PTRACE_GETSIGINFO call.

As of Linux 2.6.38, after the tracer sees the tracee ptrace-stop and until it restarts or kills it, the tracee will not run, and will not send notifications (except SIGKILL death) to the tracer, even if the tracer enters into another waitpid(2) call.

The kernel behavior described in the previous paragraph causes a problem with transparent handling of stopping signals. If the tracer restarts the tracee after group-stop, the stopping signal is effectively ignored—the tracee doesn’t remain stopped, it runs. If the tracer doesn’t restart the tracee before entering into the next waitpid(2), future SIGCONT signals will not be reported to the tracer; this would cause the SIGCONT signals to have no effect on the tracee.

Since Linux 3.4, there is a method to overcome this problem: instead of PTRACE_CONT, a PTRACE_LISTEN command can be used to restart a tracee in a way where it does not execute, but waits for a new event which it can report via waitpid(2) (such as when it is restarted by a SIGCONT).

PTRACE_EVENT stops

If the tracer sets PTRACE_O_TRACE_* options, the tracee will enter ptrace-stops called PTRACE_EVENT stops.

PTRACE_EVENT stops are observed by the tracer as waitpid(2) returning with WIFSTOPPED(status), and WSTOPSIG(status) returns SIGTRAP (or for PTRACE_EVENT_STOP, returns the stopping signal if tracee is in a group-stop). An additional bit is set in the higher byte of the status word: the value status>>8 will be

((PTRACE_EVENT_foo<<8) | SIGTRAP).

The following events exist:

PTRACE_EVENT_VFORK
Stop before return from vfork(2) or clone(2) with the CLONE_VFORK flag. When the tracee is continued after this stop, it will wait for child to exit/exec before continuing its execution (in other words, the usual behavior on vfork(2)).

PTRACE_EVENT_FORK
Stop before return from fork(2) or clone(2) with the exit signal set to SIGCHLD.

PTRACE_EVENT_CLONE
Stop before return from clone(2).

PTRACE_EVENT_VFORK_DONE
Stop before return from vfork(2) or clone(2) with the CLONE_VFORK flag, but after the child unblocked this tracee by exiting or execing.

For all four stops described above, the stop occurs in the parent (i.e., the tracee), not in the newly created thread. PTRACE_GETEVENTMSG can be used to retrieve the new thread’s ID.

PTRACE_EVENT_EXEC
Stop before return from execve(2). Since Linux 3.0, PTRACE_GETEVENTMSG returns the former thread ID.

PTRACE_EVENT_EXIT
Stop before exit (including death from exit_group(2)), signal death, or exit caused by execve(2) in a multithreaded process. PTRACE_GETEVENTMSG returns the exit status. Registers can be examined (unlike when “real” exit happens). The tracee is still alive; it needs to be PTRACE_CONTed or PTRACE_DETACHed to finish exiting.

PTRACE_EVENT_STOP
Stop induced by PTRACE_INTERRUPT command, or group-stop, or initial ptrace-stop when a new child is attached (only if attached using PTRACE_SEIZE).

PTRACE_EVENT_SECCOMP
Stop triggered by a seccomp(2) rule on tracee syscall entry when PTRACE_O_TRACESECCOMP has been set by the tracer. The seccomp event message data (from the SECCOMP_RET_DATA portion of the seccomp filter rule) can be retrieved with PTRACE_GETEVENTMSG. The semantics of this stop are described in detail in a separate section below.

PTRACE_GETSIGINFO on PTRACE_EVENT stops returns SIGTRAP in si_signo, with si_code set to (event<<8) | SIGTRAP.

Syscall-stops

If the tracee was restarted by PTRACE_SYSCALL or PTRACE_SYSEMU, the tracee enters syscall-enter-stop just prior to entering any system call (which will not be executed if the restart was using PTRACE_SYSEMU, regardless of any change made to registers at this point or how the tracee is restarted after this stop). No matter which method caused the syscall-entry-stop, if the tracer restarts the tracee with PTRACE_SYSCALL, the tracee enters syscall-exit-stop when the system call is finished, or if it is interrupted by a signal. (That is, signal-delivery-stop never happens between syscall-enter-stop and syscall-exit-stop; it happens after syscall-exit-stop.). If the tracee is continued using any other method (including PTRACE_SYSEMU), no syscall-exit-stop occurs. Note that all mentions PTRACE_SYSEMU apply equally to PTRACE_SYSEMU_SINGLESTEP.

However, even if the tracee was continued using PTRACE_SYSCALL, it is not guaranteed that the next stop will be a syscall-exit-stop. Other possibilities are that the tracee may stop in a PTRACE_EVENT stop (including seccomp stops), exit (if it entered _exit(2) or exit_group(2)), be killed by SIGKILL, or die silently (if it is a thread group leader, the execve(2) happened in another thread, and that thread is not traced by the same tracer; this situation is discussed later).

Syscall-enter-stop and syscall-exit-stop are observed by the tracer as waitpid(2) returning with WIFSTOPPED(status) true, and WSTOPSIG(status) giving SIGTRAP. If the PTRACE_O_TRACESYSGOOD option was set by the tracer, then WSTOPSIG(status) will give the value (SIGTRAP | 0x80).

Syscall-stops can be distinguished from signal-delivery-stop with SIGTRAP by querying PTRACE_GETSIGINFO for the following cases:

si_code <= 0
SIGTRAP was delivered as a result of a user-space action, for example, a system call (tgkill(2), kill(2), sigqueue(3), etc.), expiration of a POSIX timer, change of state on a POSIX message queue, or completion of an asynchronous I/O operation.

si_code == SI_KERNEL (0x80)
SIGTRAP was sent by the kernel.

si_code == SIGTRAP or si_code == (SIGTRAP|0x80)
This is a syscall-stop.

However, syscall-stops happen very often (twice per system call), and performing PTRACE_GETSIGINFO for every syscall-stop may be somewhat expensive.

Some architectures allow the cases to be distinguished by examining registers. For example, on x86, rax == -ENOSYS in syscall-enter-stop. Since SIGTRAP (like any other signal) always happens after syscall-exit-stop, and at this point rax almost never contains -ENOSYS, the SIGTRAP looks like “syscall-stop which is not syscall-enter-stop”; in other words, it looks like a “stray syscall-exit-stop” and can be detected this way. But such detection is fragile and is best avoided.

Using the PTRACE_O_TRACESYSGOOD option is the recommended method to distinguish syscall-stops from other kinds of ptrace-stops, since it is reliable and does not incur a performance penalty.

Syscall-enter-stop and syscall-exit-stop are indistinguishable from each other by the tracer. The tracer needs to keep track of the sequence of ptrace-stops in order to not misinterpret syscall-enter-stop as syscall-exit-stop or vice versa. In general, a syscall-enter-stop is always followed by syscall-exit-stop, PTRACE_EVENT stop, or the tracee’s death; no other kinds of ptrace-stop can occur in between. However, note that seccomp stops (see below) can cause syscall-exit-stops, without preceding syscall-entry-stops. If seccomp is in use, care needs to be taken not to misinterpret such stops as syscall-entry-stops.

If after syscall-enter-stop, the tracer uses a restarting command other than PTRACE_SYSCALL, syscall-exit-stop is not generated.

PTRACE_GETSIGINFO on syscall-stops returns SIGTRAP in si_signo, with si_code set to SIGTRAP or (SIGTRAP|0x80).

PTRACE_EVENT_SECCOMP stops (Linux 3.5 to Linux 4.7)

The behavior of PTRACE_EVENT_SECCOMP stops and their interaction with other kinds of ptrace stops has changed between kernel versions. This documents the behavior from their introduction until Linux 4.7 (inclusive). The behavior in later kernel versions is documented in the next section.

A PTRACE_EVENT_SECCOMP stop occurs whenever a SECCOMP_RET_TRACE rule is triggered. This is independent of which methods was used to restart the system call. Notably, seccomp still runs even if the tracee was restarted using PTRACE_SYSEMU and this system call is unconditionally skipped.

Restarts from this stop will behave as if the stop had occurred right before the system call in question. In particular, both PTRACE_SYSCALL and PTRACE_SYSEMU will normally cause a subsequent syscall-entry-stop. However, if after the PTRACE_EVENT_SECCOMP the system call number is negative, both the syscall-entry-stop and the system call itself will be skipped. This means that if the system call number is negative after a PTRACE_EVENT_SECCOMP and the tracee is restarted using PTRACE_SYSCALL, the next observed stop will be a syscall-exit-stop, rather than the syscall-entry-stop that might have been expected.

PTRACE_EVENT_SECCOMP stops (since Linux 4.8)

Starting with Linux 4.8, the PTRACE_EVENT_SECCOMP stop was reordered to occur between syscall-entry-stop and syscall-exit-stop. Note that seccomp no longer runs (and no PTRACE_EVENT_SECCOMP will be reported) if the system call is skipped due to PTRACE_SYSEMU.

Functionally, a PTRACE_EVENT_SECCOMP stop functions comparably to a syscall-entry-stop (i.e., continuations using PTRACE_SYSCALL will cause syscall-exit-stops, the system call number may be changed and any other modified registers are visible to the to-be-executed system call as well). Note that there may be, but need not have been a preceding syscall-entry-stop.

After a PTRACE_EVENT_SECCOMP stop, seccomp will be rerun, with a SECCOMP_RET_TRACE rule now functioning the same as a SECCOMP_RET_ALLOW. Specifically, this means that if registers are not modified during the PTRACE_EVENT_SECCOMP stop, the system call will then be allowed.

PTRACE_SINGLESTEP stops

[Details of these kinds of stops are yet to be documented.]

Informational and restarting ptrace commands

Most ptrace commands (all except PTRACE_ATTACH, PTRACE_SEIZE, PTRACE_TRACEME, PTRACE_INTERRUPT, and PTRACE_KILL) require the tracee to be in a ptrace-stop, otherwise they fail with ESRCH.

When the tracee is in ptrace-stop, the tracer can read and write data to the tracee using informational commands. These commands leave the tracee in ptrace-stopped state:

ptrace(PTRACE_PEEKTEXT/PEEKDATA/PEEKUSER, pid, addr, 0);
ptrace(PTRACE_POKETEXT/POKEDATA/POKEUSER, pid, addr, long_val);
ptrace(PTRACE_GETREGS/GETFPREGS, pid, 0, &struct);
ptrace(PTRACE_SETREGS/SETFPREGS, pid, 0, &struct);
ptrace(PTRACE_GETREGSET, pid, NT_foo, &iov);
ptrace(PTRACE_SETREGSET, pid, NT_foo, &iov);
ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo);
ptrace(PTRACE_SETSIGINFO, pid, 0, &siginfo);
ptrace(PTRACE_GETEVENTMSG, pid, 0, &long_var);
ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);

Note that some errors are not reported. For example, setting signal information (siginfo) may have no effect in some ptrace-stops, yet the call may succeed (return 0 and not set errno); querying PTRACE_GETEVENTMSG may succeed and return some random value if current ptrace-stop is not documented as returning a meaningful event message.

The call

ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);

affects one tracee. The tracee’s current flags are replaced. Flags are inherited by new tracees created and “auto-attached” via active PTRACE_O_TRACEFORK, PTRACE_O_TRACEVFORK, or PTRACE_O_TRACECLONE options.

Another group of commands makes the ptrace-stopped tracee run. They have the form:

ptrace(cmd, pid, 0, sig);

where cmd is PTRACE_CONT, PTRACE_LISTEN, PTRACE_DETACH, PTRACE_SYSCALL, PTRACE_SINGLESTEP, PTRACE_SYSEMU, or PTRACE_SYSEMU_SINGLESTEP. If the tracee is in signal-delivery-stop, sig is the signal to be injected (if it is nonzero). Otherwise, sig may be ignored. (When restarting a tracee from a ptrace-stop other than signal-delivery-stop, recommended practice is to always pass 0 in sig.)

Attaching and detaching

A thread can be attached to the tracer using the call

ptrace(PTRACE_ATTACH, pid, 0, 0);

or

ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_flags);

PTRACE_ATTACH sends SIGSTOP to this thread. If the tracer wants this SIGSTOP to have no effect, it needs to suppress it. Note that if other signals are concurrently sent to this thread during attach, the tracer may see the tracee enter signal-delivery-stop with other signal(s) first! The usual practice is to reinject these signals until SIGSTOP is seen, then suppress SIGSTOP injection. The design bug here is that a ptrace attach and a concurrently delivered SIGSTOP may race and the concurrent SIGSTOP may be lost.

Since attaching sends SIGSTOP and the tracer usually suppresses it, this may cause a stray EINTR return from the currently executing system call in the tracee, as described in the “Signal injection and suppression” section.

Since Linux 3.4, PTRACE_SEIZE can be used instead of PTRACE_ATTACH. PTRACE_SEIZE does not stop the attached process. If you need to stop it after attach (or at any other time) without sending it any signals, use PTRACE_INTERRUPT command.

The operation

ptrace(PTRACE_TRACEME, 0, 0, 0);

turns the calling thread into a tracee. The thread continues to run (doesn’t enter ptrace-stop). A common practice is to follow the PTRACE_TRACEME with

raise(SIGSTOP);

and allow the parent (which is our tracer now) to observe our signal-delivery-stop.

If the PTRACE_O_TRACEFORK, PTRACE_O_TRACEVFORK, or PTRACE_O_TRACECLONE options are in effect, then children created by, respectively, vfork(2) or clone(2) with the CLONE_VFORK flag, fork(2) or clone(2) with the exit signal set to SIGCHLD, and other kinds of clone(2), are automatically attached to the same tracer which traced their parent. SIGSTOP is delivered to the children, causing them to enter signal-delivery-stop after they exit the system call which created them.

Detaching of the tracee is performed by:

ptrace(PTRACE_DETACH, pid, 0, sig);

PTRACE_DETACH is a restarting operation; therefore it requires the tracee to be in ptrace-stop. If the tracee is in signal-delivery-stop, a signal can be injected. Otherwise, the sig parameter may be silently ignored.

If the tracee is running when the tracer wants to detach it, the usual solution is to send SIGSTOP (using tgkill(2), to make sure it goes to the correct thread), wait for the tracee to stop in signal-delivery-stop for SIGSTOP and then detach it (suppressing SIGSTOP injection). A design bug is that this can race with concurrent SIGSTOPs. Another complication is that the tracee may enter other ptrace-stops and needs to be restarted and waited for again, until SIGSTOP is seen. Yet another complication is to be sure that the tracee is not already ptrace-stopped, because no signal delivery happens while it is—not even SIGSTOP.

If the tracer dies, all tracees are automatically detached and restarted, unless they were in group-stop. Handling of restart from group-stop is currently buggy, but the “as planned” behavior is to leave tracee stopped and waiting for SIGCONT. If the tracee is restarted from signal-delivery-stop, the pending signal is injected.

execve(2) under ptrace

When one thread in a multithreaded process calls execve(2), the kernel destroys all other threads in the process, and resets the thread ID of the execing thread to the thread group ID (process ID). (Or, to put things another way, when a multithreaded process does an execve(2), at completion of the call, it appears as though the execve(2) occurred in the thread group leader, regardless of which thread did the execve(2).) This resetting of the thread ID looks very confusing to tracers:

All other threads stop in PTRACE_EVENT_EXIT stop, if the PTRACE_O_TRACEEXIT option was turned on. Then all other threads except the thread group leader report death as if they exited via _exit(2) with exit code 0.
The execing tracee changes its thread ID while it is in the execve(2). (Remember, under ptrace, the “pid” returned from waitpid(2), or fed into ptrace calls, is the tracee’s thread ID.) That is, the tracee’s thread ID is reset to be the same as its process ID, which is the same as the thread group leader’s thread ID.
Then a PTRACE_EVENT_EXEC stop happens, if the PTRACE_O_TRACEEXEC option was turned on.
If the thread group leader has reported its PTRACE_EVENT_EXIT stop by this time, it appears to the tracer that the dead thread leader “reappears from nowhere”. (Note: the thread group leader does not report death via WIFEXITED(status) until there is at least one other live thread. This eliminates the possibility that the tracer will see it dying and then reappearing.) If the thread group leader was still alive, for the tracer this may look as if thread group leader returns from a different system call than it entered, or even “returned from a system call even though it was not in any system call”. If the thread group leader was not traced (or was traced by a different tracer), then during execve(2) it will appear as if it has become a tracee of the tracer of the execing tracee.

All of the above effects are the artifacts of the thread ID change in the tracee.

The PTRACE_O_TRACEEXEC option is the recommended tool for dealing with this situation. First, it enables PTRACE_EVENT_EXEC stop, which occurs before execve(2) returns. In this stop, the tracer can use PTRACE_GETEVENTMSG to retrieve the tracee’s former thread ID. (This feature was introduced in Linux 3.0.) Second, the PTRACE_O_TRACEEXEC option disables legacy SIGTRAP generation on execve(2).

When the tracer receives PTRACE_EVENT_EXEC stop notification, it is guaranteed that except this tracee and the thread group leader, no other threads from the process are alive.

On receiving the PTRACE_EVENT_EXEC stop notification, the tracer should clean up all its internal data structures describing the threads of this process, and retain only one data structure—one which describes the single still running tracee, with

thread ID == thread group ID == process ID.

Example: two threads call execve(2) at the same time:

*** we get syscall-enter-stop in thread 1: **
PID1 execve("/bin/foo", "foo" <unfinished ...>
*** we issue PTRACE_SYSCALL for thread 1 **
*** we get syscall-enter-stop in thread 2: **
PID2 execve("/bin/bar", "bar" <unfinished ...>
*** we issue PTRACE_SYSCALL for thread 2 **
*** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL **
*** we get syscall-exit-stop for PID0: **
PID0 <... execve resumed> )             = 0

If the PTRACE_O_TRACEEXEC option is not in effect for the execing tracee, and if the tracee was PTRACE_ATTACHed rather that PTRACE_SEIZEd, the kernel delivers an extra SIGTRAP to the tracee after execve(2) returns. This is an ordinary signal (similar to one which can be generated by kill -TRAP), not a special kind of ptrace-stop. Employing PTRACE_GETSIGINFO for this signal returns si_code set to 0 (SI_USER). This signal may be blocked by signal mask, and thus may be delivered (much) later.

Usually, the tracer (for example, strace(1)) would not want to show this extra post-execve SIGTRAP signal to the user, and would suppress its delivery to the tracee (if SIGTRAP is set to SIG_DFL, it is a killing signal). However, determining which SIGTRAP to suppress is not easy. Setting the PTRACE_O_TRACEEXEC option or using PTRACE_SEIZE and thus suppressing this extra SIGTRAP is the recommended approach.

Real parent

The ptrace API (ab)uses the standard UNIX parent/child signaling over waitpid(2). This used to cause the real parent of the process to stop receiving several kinds of waitpid(2) notifications when the child process is traced by some other process.

Many of these bugs have been fixed, but as of Linux 2.6.38 several still exist; see BUGS below.

As of Linux 2.6.38, the following is believed to work correctly:

exit/death by signal is reported first to the tracer, then, when the tracer consumes the waitpid(2) result, to the real parent (to the real parent only when the whole multithreaded process exits). If the tracer and the real parent are the same process, the report is sent only once.

RETURN VALUE

On success, the PTRACE_PEEK* operations return the requested data (but see NOTES), the PTRACE_SECCOMP_GET_FILTER operation returns the number of instructions in the BPF program, the PTRACE_GET_SYSCALL_INFO operation returns the number of bytes available to be written by the kernel, and other operations return zero.

On error, all operations return -1, and errno is set to indicate the error. Since the value returned by a successful PTRACE_PEEK* operation may be -1, the caller must clear errno before the call, and then check it afterward to determine whether or not an error occurred.

ERRORS

EBUSY
(i386 only) There was an error with allocating or freeing a debug register.

EFAULT
There was an attempt to read from or write to an invalid area in the tracer’s or the tracee’s memory, probably because the area wasn’t mapped or accessible. Unfortunately, under Linux, different variations of this fault will return EIO or EFAULT more or less arbitrarily.

EINVAL
An attempt was made to set an invalid option.

EIO
op is invalid, or an attempt was made to read from or write to an invalid area in the tracer’s or the tracee’s memory, or there was a word-alignment violation, or an invalid signal was specified during a restart operation.

EPERM
The specified process cannot be traced. This could be because the tracer has insufficient privileges (the required capability is CAP_SYS_PTRACE); unprivileged processes cannot trace processes that they cannot send signals to or those running set-user-ID/set-group-ID programs, for obvious reasons. Alternatively, the process may already be being traced, or (before Linux 2.6.26) be init(1) (PID 1).

ESRCH
The specified process does not exist, or is not currently being traced by the caller, or is not stopped (for operations that require a stopped tracee).

STANDARDS

None.

HISTORY

SVr4, 4.3BSD.

Before Linux 2.6.26, init(1), the process with PID 1, may not be traced.

NOTES

Although arguments to ptrace() are interpreted according to the prototype given, glibc currently declares ptrace() as a variadic function with only the op argument fixed. It is recommended to always supply four arguments, even if the requested operation does not use them, setting unused/ignored arguments to 0L or (void *) 0.

A tracees parent continues to be the tracer even if that tracer calls execve(2).

The layout of the contents of memory and the USER area are quite operating-system- and architecture-specific. The offset supplied, and the data returned, might not entirely match with the definition of struct user.

The size of a “word” is determined by the operating-system variant (e.g., for 32-bit Linux it is 32 bits).

This page documents the way the ptrace() call works currently in Linux. Its behavior differs significantly on other flavors of UNIX. In any case, use of ptrace() is highly specific to the operating system and architecture.

Ptrace access mode checking

Various parts of the kernel-user-space API (not just ptrace() operations), require so-called “ptrace access mode” checks, whose outcome determines whether an operation is permitted (or, in a few cases, causes a “read” operation to return sanitized data). These checks are performed in cases where one process can inspect sensitive information about, or in some cases modify the state of, another process. The checks are based on factors such as the credentials and capabilities of the two processes, whether or not the “target” process is dumpable, and the results of checks performed by any enabled Linux Security Module (LSM)—for example, SELinux, Yama, or Smack—and by the commoncap LSM (which is always invoked).

Prior to Linux 2.6.27, all access checks were of a single type. Since Linux 2.6.27, two access mode levels are distinguished:

PTRACE_MODE_READ
For “read” operations or other operations that are less dangerous, such as: get_robust_list(2); kcmp(2); reading /proc/pid/auxv, /proc/pid/environ, or /proc/pid/stat; or readlink(2) of a /proc/pid/ns/* file.

PTRACE_MODE_ATTACH
For “write” operations, or other operations that are more dangerous, such as: ptrace attaching (PTRACE_ATTACH) to another process or calling process_vm_writev(2). (PTRACE_MODE_ATTACH was effectively the default before Linux 2.6.27.)

Since Linux 4.5, the above access mode checks are combined (ORed) with one of the following modifiers:

PTRACE_MODE_FSCREDS
Use the caller’s filesystem UID and GID (see credentials(7)) or effective capabilities for LSM checks.

PTRACE_MODE_REALCREDS
Use the caller’s real UID and GID or permitted capabilities for LSM checks. This was effectively the default before Linux 4.5.

Because combining one of the credential modifiers with one of the aforementioned access modes is typical, some macros are defined in the kernel sources for the combinations:

PTRACE_MODE_READ_FSCREDS
Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS.

PTRACE_MODE_READ_REALCREDS
Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS.

PTRACE_MODE_ATTACH_FSCREDS
Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS.

PTRACE_MODE_ATTACH_REALCREDS
Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS.

One further modifier can be ORed with the access mode:

PTRACE_MODE_NOAUDIT (since Linux 3.3)
Don’t audit this access mode check. This modifier is employed for ptrace access mode checks (such as checks when reading /proc/pid/stat) that merely cause the output to be filtered or sanitized, rather than causing an error to be returned to the caller. In these cases, accessing the file is not a security violation and there is no reason to generate a security audit record. This modifier suppresses the generation of such an audit record for the particular access check.

Note that all of the PTRACE_MODE_* constants described in this subsection are kernel-internal, and not visible to user space. The constant names are mentioned here in order to label the various kinds of ptrace access mode checks that are performed for various system calls and accesses to various pseudofiles (e.g., under /proc). These names are used in other manual pages to provide a simple shorthand for labeling the different kernel checks.

The algorithm employed for ptrace access mode checking determines whether the calling process is allowed to perform the corresponding action on the target process. (In the case of opening */proc/*pid files, the “calling process” is the one opening the file, and the process with the corresponding PID is the “target process”.) The algorithm is as follows:

If the calling thread and the target thread are in the same thread group, access is always allowed.
If the access mode specifies PTRACE_MODE_FSCREDS, then, for the check in the next step, employ the caller’s filesystem UID and GID. (As noted in credentials(7), the filesystem UID and GID almost always have the same values as the corresponding effective IDs.)
Otherwise, the access mode specifies PTRACE_MODE_REALCREDS, so use the caller’s real UID and GID for the checks in the next step. (Most APIs that check the caller’s UID and GID use the effective IDs. For historical reasons, the PTRACE_MODE_REALCREDS check uses the real IDs instead.)
Deny access if neither of the following is true:
- The real, effective, and saved-set user IDs of the target match the caller’s user ID, and the real, effective, and saved-set group IDs of the target match the caller’s group ID.
- The caller has the CAP_SYS_PTRACE capability in the user namespace of the target.
Deny access if the target process “dumpable” attribute has a value other than 1 (SUID_DUMP_USER; see the discussion of PR_SET_DUMPABLE in prctl(2)), and the caller does not have the CAP_SYS_PTRACE capability in the user namespace of the target process.
The kernel LSM security_ptrace_access_check() interface is invoked to see if ptrace access is permitted. The results depend on the LSM(s). The implementation of this interface in the commoncap LSM performs the following steps:
(5.1)
If the access mode includes PTRACE_MODE_FSCREDS, then use the caller’s effective capability set in the following check; otherwise (the access mode specifies PTRACE_MODE_REALCREDS, so) use the caller’s permitted capability set.
(5.2)
Deny access if neither of the following is true:
- The caller and the target process are in the same user namespace, and the caller’s capabilities are a superset of the target process’s permitted capabilities.
- The caller has the CAP_SYS_PTRACE capability in the target process’s user namespace.
Note that the commoncap LSM does not distinguish between PTRACE_MODE_READ and PTRACE_MODE_ATTACH.
If access has not been denied by any of the preceding steps, then access is allowed.

/proc/sys/kernel/yama/ptrace_scope

On systems with the Yama Linux Security Module (LSM) installed (i.e., the kernel was configured with CONFIG_SECURITY_YAMA), the /proc/sys/kernel/yama/ptrace_scope file (available since Linux 3.4) can be used to restrict the ability to trace a process with ptrace() (and thus also the ability to use tools such as strace(1) and gdb(1)). The goal of such restrictions is to prevent attack escalation whereby a compromised process can ptrace-attach to other sensitive processes (e.g., a GPG agent or an SSH session) owned by the user in order to gain additional credentials that may exist in memory and thus expand the scope of the attack.

More precisely, the Yama LSM limits two types of operations:

Any operation that performs a ptrace access mode PTRACE_MODE_ATTACH check—for example, ptrace() PTRACE_ATTACH. (See the “Ptrace access mode checking” discussion above.)
ptrace() PTRACE_TRACEME.

A process that has the CAP_SYS_PTRACE capability can update the /proc/sys/kernel/yama/ptrace_scope file with one of the following values:

0 (“classic ptrace permissions”)
No additional restrictions on operations that perform PTRACE_MODE_ATTACH checks (beyond those imposed by the commoncap and other LSMs).

The use of PTRACE_TRACEME is unchanged.

1 (“restricted ptrace”) [default value]
When performing an operation that requires a PTRACE_MODE_ATTACH check, the calling process must either have the CAP_SYS_PTRACE capability in the user namespace of the target process or it must have a predefined relationship with the target process. By default, the predefined relationship is that the target process must be a descendant of the caller.

A target process can employ the prctl(2) PR_SET_PTRACER operation to declare an additional PID that is allowed to perform PTRACE_MODE_ATTACH operations on the target. See the kernel source file Documentation/admin-guide/LSM/Yama.rst (or Documentation/security/Yama.txt before Linux 4.13) for further details.

The use of PTRACE_TRACEME is unchanged.

2 (“admin-only attach”)
Only processes with the CAP_SYS_PTRACE capability in the user namespace of the target process may perform PTRACE_MODE_ATTACH operations or trace children that employ PTRACE_TRACEME.

3 (“no attach”)
No process may perform PTRACE_MODE_ATTACH operations or trace children that employ PTRACE_TRACEME.

Once this value has been written to the file, it cannot be changed.

With respect to values 1 and 2, note that creating a new user namespace effectively removes the protection offered by Yama. This is because a process in the parent user namespace whose effective UID matches the UID of the creator of a child namespace has all capabilities (including CAP_SYS_PTRACE) when performing operations within the child user namespace (and further-removed descendants of that namespace). Consequently, when a process tries to use user namespaces to sandbox itself, it inadvertently weakens the protections offered by the Yama LSM.

C library/kernel differences

At the system call level, the PTRACE_PEEKTEXT, PTRACE_PEEKDATA, and PTRACE_PEEKUSER operations have a different API: they store the result at the address specified by the data parameter, and the return value is the error flag. The glibc wrapper function provides the API given in DESCRIPTION above, with the result being returned via the function return value.

BUGS

On hosts with Linux 2.6 kernel headers, PTRACE_SETOPTIONS is declared with a different value than the one for Linux 2.4. This leads to applications compiled with Linux 2.6 kernel headers failing when run on Linux 2.4. This can be worked around by redefining PTRACE_SETOPTIONS to PTRACE_OLDSETOPTIONS, if that is defined.

Group-stop notifications are sent to the tracer, but not to real parent. Last confirmed on 2.6.38.6.

If a thread group leader is traced and exits by calling _exit(2), a PTRACE_EVENT_EXIT stop will happen for it (if requested), but the subsequent WIFEXITED notification will not be delivered until all other threads exit. As explained above, if one of other threads calls execve(2), the death of the thread group leader will never be reported. If the execed thread is not traced by this tracer, the tracer will never know that execve(2) happened. One possible workaround is to PTRACE_DETACH the thread group leader instead of restarting it in this case. Last confirmed on 2.6.38.6.

A SIGKILL signal may still cause a PTRACE_EVENT_EXIT stop before actual signal death. This may be changed in the future; SIGKILL is meant to always immediately kill tasks even under ptrace. Last confirmed on Linux 3.13.

Some system calls return with EINTR if a signal was sent to a tracee, but delivery was suppressed by the tracer. (This is very typical operation: it is usually done by debuggers on every attach, in order to not introduce a bogus SIGSTOP). As of Linux 3.2.9, the following system calls are affected (this list is likely incomplete): epoll_wait(2), and read(2) from an inotify(7) file descriptor. The usual symptom of this bug is that when you attach to a quiescent process with the command

strace -p <process-ID>

then, instead of the usual and expected one-line output such as

restart_syscall(<... resuming interrupted call ...>_

or

select(6, [5], NULL, [5], NULL_

(’_’ denotes the cursor position), you observe more than one line. For example:

    clock_gettime(CLOCK_MONOTONIC, {15370, 690928118}) = 0
    epoll_wait(4,_

What is not visible here is that the process was blocked in epoll_wait(2) before strace(1) has attached to it. Attaching caused epoll_wait(2) to return to user space with the error EINTR. In this particular case, the program reacted to EINTR by checking the current time, and then executing epoll_wait(2) again. (Programs which do not expect such “stray” EINTR errors may behave in an unintended way upon an strace(1) attach.)

Contrary to the normal rules, the glibc wrapper for ptrace() can set errno to zero.

2 - Linux cli command fcntl

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command fcntl and provides detailed information about the command fcntl, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the fcntl.

NAME 🖥️ fcntl 🖥️

manipulate file descriptor

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <fcntl.h>
int fcntl(int fd, int op, ... /* arg */ );

DESCRIPTION

fcntl() performs one of the operations described below on the open file descriptor fd. The operation is determined by op.

fcntl() can take an optional third argument. Whether or not this argument is required is determined by op. The required argument type is indicated in parentheses after each op name (in most cases, the required type is int, and we identify the argument using the name arg), or void is specified if the argument is not required.

Certain of the operations below are supported only since a particular Linux kernel version. The preferred method of checking whether the host kernel supports a particular operation is to invoke fcntl() with the desired op value and then test whether the call failed with EINVAL, indicating that the kernel does not recognize this value.

Duplicating a file descriptor

F_DUPFD (int)
Duplicate the file descriptor fd using the lowest-numbered available file descriptor greater than or equal to arg. This is different from dup2(2), which uses exactly the file descriptor specified.

On success, the new file descriptor is returned.

See dup(2) for further details.

F_DUPFD_CLOEXEC (int; since Linux 2.6.24)
As for F_DUPFD, but additionally set the close-on-exec flag for the duplicate file descriptor. Specifying this flag permits a program to avoid an additional fcntl() F_SETFD operation to set the FD_CLOEXEC flag. For an explanation of why this flag is useful, see the description of O_CLOEXEC in open(2).

File descriptor flags

The following operations manipulate the flags associated with a file descriptor. Currently, only one such flag is defined: FD_CLOEXEC, the close-on-exec flag. If the FD_CLOEXEC bit is set, the file descriptor will automatically be closed during a successful execve(2). (If the execve(2) fails, the file descriptor is left open.) If the FD_CLOEXEC bit is not set, the file descriptor will remain open across an execve(2).

F_GETFD (void)
Return (as the function result) the file descriptor flags; arg is ignored.

F_SETFD (int)
Set the file descriptor flags to the value specified by arg.

In multithreaded programs, using fcntl() F_SETFD to set the close-on-exec flag at the same time as another thread performs a fork(2) plus execve(2) is vulnerable to a race condition that may unintentionally leak the file descriptor to the program executed in the child process. See the discussion of the O_CLOEXEC flag in open(2) for details and a remedy to the problem.

File status flags

Each open file description has certain associated status flags, initialized by open(2) and possibly modified by fcntl(). Duplicated file descriptors (made with dup(2), fcntl(F_DUPFD), fork(2), etc.) refer to the same open file description, and thus share the same file status flags.

The file status flags and their semantics are described in open(2).

F_GETFL (void)
Return (as the function result) the file access mode and the file status flags; arg is ignored.

F_SETFL (int)
Set the file status flags to the value specified by arg. File access mode (O_RDONLY, O_WRONLY, O_RDWR) and file creation flags (i.e., O_CREAT, O_EXCL, O_NOCTTY, O_TRUNC) in arg are ignored. On Linux, this operation can change only the O_APPEND, O_ASYNC, O_DIRECT, O_NOATIME, and O_NONBLOCK flags. It is not possible to change the O_DSYNC and O_SYNC flags; see BUGS, below.

Advisory record locking

Linux implements traditional (“process-associated”) UNIX record locks, as standardized by POSIX. For a Linux-specific alternative with better semantics, see the discussion of open file description locks below.

F_SETLK, F_SETLKW, and F_GETLK are used to acquire, release, and test for the existence of record locks (also known as byte-range, file-segment, or file-region locks). The third argument, lock, is a pointer to a structure that has at least the following fields (in unspecified order).

struct flock {
    ...
    short l_type;    /* Type of lock: F_RDLCK,
                        F_WRLCK, F_UNLCK */
    short l_whence;  /* How to interpret l_start:
                        SEEK_SET, SEEK_CUR, SEEK_END */
    off_t l_start;   /* Starting offset for lock */
    off_t l_len;     /* Number of bytes to lock */
    pid_t l_pid;     /* PID of process blocking our lock
                        (set by F_GETLK and F_OFD_GETLK) */
    ...
};

The l_whence, l_start, and l_len fields of this structure specify the range of bytes we wish to lock. Bytes past the end of the file may be locked, but not bytes before the start of the file.

l_start is the starting offset for the lock, and is interpreted relative to either: the start of the file (if l_whence is SEEK_SET); the current file offset (if l_whence is SEEK_CUR); or the end of the file (if l_whence is SEEK_END). In the final two cases, l_start can be a negative number provided the offset does not lie before the start of the file.

l_len specifies the number of bytes to be locked. If l_len is positive, then the range to be locked covers bytes l_start up to and including l_start+l_len-1. Specifying 0 for l_len has the special meaning: lock all bytes starting at the location specified by l_whence and l_start through to the end of file, no matter how large the file grows.

POSIX.1-2001 allows (but does not require) an implementation to support a negative l_len value; if l_len is negative, the interval described by lock covers bytes l_start+l_len up to and including l_start-1. This is supported since Linux 2.4.21 and Linux 2.5.49.

The l_type field can be used to place a read (F_RDLCK) or a write (F_WRLCK) lock on a file. Any number of processes may hold a read lock (shared lock) on a file region, but only one process may hold a write lock (exclusive lock). An exclusive lock excludes all other locks, both shared and exclusive. A single process can hold only one type of lock on a file region; if a new lock is applied to an already-locked region, then the existing lock is converted to the new lock type. (Such conversions may involve splitting, shrinking, or coalescing with an existing lock if the byte range specified by the new lock does not precisely coincide with the range of the existing lock.)

F_SETLK (struct flock *)
Acquire a lock (when l_type is F_RDLCK or F_WRLCK) or release a lock (when l_type is F_UNLCK) on the bytes specified by the l_whence, l_start, and l_len fields of lock. If a conflicting lock is held by another process, this call returns -1 and sets errno to EACCES or EAGAIN. (The error returned in this case differs across implementations, so POSIX requires a portable application to check for both errors.)

F_SETLKW (struct flock *)
As for F_SETLK, but if a conflicting lock is held on the file, then wait for that lock to be released. If a signal is caught while waiting, then the call is interrupted and (after the signal handler has returned) returns immediately (with return value -1 and errno set to EINTR; see signal(7)).

F_GETLK (struct flock *)
On input to this call, lock describes a lock we would like to place on the file. If the lock could be placed, fcntl() does not actually place it, but returns F_UNLCK in the l_type field of lock and leaves the other fields of the structure unchanged.

If one or more incompatible locks would prevent this lock being placed, then fcntl() returns details about one of those locks in the l_type, l_whence, l_start, and l_len fields of lock. If the conflicting lock is a traditional (process-associated) record lock, then the l_pid field is set to the PID of the process holding that lock. If the conflicting lock is an open file description lock, then l_pid is set to -1. Note that the returned information may already be out of date by the time the caller inspects it.

In order to place a read lock, fd must be open for reading. In order to place a write lock, fd must be open for writing. To place both types of lock, open a file read-write.

When placing locks with F_SETLKW, the kernel detects deadlocks, whereby two or more processes have their lock requests mutually blocked by locks held by the other processes. For example, suppose process A holds a write lock on byte 100 of a file, and process B holds a write lock on byte 200. If each process then attempts to lock the byte already locked by the other process using F_SETLKW, then, without deadlock detection, both processes would remain blocked indefinitely. When the kernel detects such deadlocks, it causes one of the blocking lock requests to immediately fail with the error EDEADLK; an application that encounters such an error should release some of its locks to allow other applications to proceed before attempting regain the locks that it requires. Circular deadlocks involving more than two processes are also detected. Note, however, that there are limitations to the kernel’s deadlock-detection algorithm; see BUGS.

As well as being removed by an explicit F_UNLCK, record locks are automatically released when the process terminates.

Record locks are not inherited by a child created via fork(2), but are preserved across an execve(2).

Because of the buffering performed by the stdio(3) library, the use of record locking with routines in that package should be avoided; use read(2) and write(2) instead.

The record locks described above are associated with the process (unlike the open file description locks described below). This has some unfortunate consequences:

If a process closes any file descriptor referring to a file, then all of the process’s locks on that file are released, regardless of the file descriptor(s) on which the locks were obtained. This is bad: it means that a process can lose its locks on a file such as /etc/passwd or /etc/mtab when for some reason a library function decides to open, read, and close the same file.
The threads in a process share locks. In other words, a multithreaded program can’t use record locking to ensure that threads don’t simultaneously access the same region of a file.

Open file description locks solve both of these problems.

Open file description locks (non-POSIX)

Open file description locks are advisory byte-range locks whose operation is in most respects identical to the traditional record locks described above. This lock type is Linux-specific, and available since Linux 3.15. (There is a proposal with the Austin Group to include this lock type in the next revision of POSIX.1.) For an explanation of open file descriptions, see open(2).

The principal difference between the two lock types is that whereas traditional record locks are associated with a process, open file description locks are associated with the open file description on which they are acquired, much like locks acquired with flock(2). Consequently (and unlike traditional advisory record locks), open file description locks are inherited across fork(2) (and clone(2) with CLONE_FILES), and are only automatically released on the last close of the open file description, instead of being released on any close of the file.

Conflicting lock combinations (i.e., a read lock and a write lock or two write locks) where one lock is an open file description lock and the other is a traditional record lock conflict even when they are acquired by the same process on the same file descriptor.

Open file description locks placed via the same open file description (i.e., via the same file descriptor, or via a duplicate of the file descriptor created by fork(2), dup(2), fcntl() F_DUPFD, and so on) are always compatible: if a new lock is placed on an already locked region, then the existing lock is converted to the new lock type. (Such conversions may result in splitting, shrinking, or coalescing with an existing lock as discussed above.)

On the other hand, open file description locks may conflict with each other when they are acquired via different open file descriptions. Thus, the threads in a multithreaded program can use open file description locks to synchronize access to a file region by having each thread perform its own open(2) on the file and applying locks via the resulting file descriptor.

As with traditional advisory locks, the third argument to fcntl(), lock, is a pointer to an flock structure. By contrast with traditional record locks, the l_pid field of that structure must be set to zero when using the operations described below.

The operations for working with open file description locks are analogous to those used with traditional locks:

F_OFD_SETLK (struct flock *)
Acquire an open file description lock (when l_type is F_RDLCK or F_WRLCK) or release an open file description lock (when l_type is F_UNLCK) on the bytes specified by the l_whence, l_start, and l_len fields of lock. If a conflicting lock is held by another process, this call returns -1 and sets errno to EAGAIN.

F_OFD_SETLKW (struct flock *)
As for F_OFD_SETLK, but if a conflicting lock is held on the file, then wait for that lock to be released. If a signal is caught while waiting, then the call is interrupted and (after the signal handler has returned) returns immediately (with return value -1 and errno set to EINTR; see signal(7)).

F_OFD_GETLK (struct flock *)
On input to this call, lock describes an open file description lock we would like to place on the file. If the lock could be placed, fcntl() does not actually place it, but returns F_UNLCK in the l_type field of lock and leaves the other fields of the structure unchanged. If one or more incompatible locks would prevent this lock being placed, then details about one of these locks are returned via lock, as described above for F_GETLK.

In the current implementation, no deadlock detection is performed for open file description locks. (This contrasts with process-associated record locks, for which the kernel does perform deadlock detection.)

Mandatory locking

Warning: the Linux implementation of mandatory locking is unreliable. See BUGS below. Because of these bugs, and the fact that the feature is believed to be little used, since Linux 4.5, mandatory locking has been made an optional feature, governed by a configuration option (CONFIG_MANDATORY_FILE_LOCKING). This feature is no longer supported at all in Linux 5.15 and above.

By default, both traditional (process-associated) and open file description record locks are advisory. Advisory locks are not enforced and are useful only between cooperating processes.

Both lock types can also be mandatory. Mandatory locks are enforced for all processes. If a process tries to perform an incompatible access (e.g., read(2) or write(2)) on a file region that has an incompatible mandatory lock, then the result depends upon whether the O_NONBLOCK flag is enabled for its open file description. If the O_NONBLOCK flag is not enabled, then the system call is blocked until the lock is removed or converted to a mode that is compatible with the access. If the O_NONBLOCK flag is enabled, then the system call fails with the error EAGAIN.

To make use of mandatory locks, mandatory locking must be enabled both on the filesystem that contains the file to be locked, and on the file itself. Mandatory locking is enabled on a filesystem using the “-o mand” option to mount(8), or the MS_MANDLOCK flag for mount(2). Mandatory locking is enabled on a file by disabling group execute permission on the file and enabling the set-group-ID permission bit (see chmod(1) and chmod(2)).

Mandatory locking is not specified by POSIX. Some other systems also support mandatory locking, although the details of how to enable it vary across systems.

Lost locks

When an advisory lock is obtained on a networked filesystem such as NFS it is possible that the lock might get lost. This may happen due to administrative action on the server, or due to a network partition (i.e., loss of network connectivity with the server) which lasts long enough for the server to assume that the client is no longer functioning.

When the filesystem determines that a lock has been lost, future read(2) or write(2) requests may fail with the error EIO. This error will persist until the lock is removed or the file descriptor is closed. Since Linux 3.12, this happens at least for NFSv4 (including all minor versions).

Some versions of UNIX send a signal (SIGLOST) in this circumstance. Linux does not define this signal, and does not provide any asynchronous notification of lost locks.

Managing signals

F_GETOWN, F_SETOWN, F_GETOWN_EX, F_SETOWN_EX, F_GETSIG, and F_SETSIG are used to manage I/O availability signals:

F_GETOWN (void)
Return (as the function result) the process ID or process group ID currently receiving SIGIO and SIGURG signals for events on file descriptor fd. Process IDs are returned as positive values; process group IDs are returned as negative values (but see BUGS below). arg is ignored.

F_SETOWN (int)
Set the process ID or process group ID that will receive SIGIO and SIGURG signals for events on the file descriptor fd. The target process or process group ID is specified in arg. A process ID is specified as a positive value; a process group ID is specified as a negative value. Most commonly, the calling process specifies itself as the owner (that is, arg is specified as getpid(2)).

As well as setting the file descriptor owner, one must also enable generation of signals on the file descriptor. This is done by using the fcntl() F_SETFL operation to set the O_ASYNC file status flag on the file descriptor. Subsequently, a SIGIO signal is sent whenever input or output becomes possible on the file descriptor. The fcntl() F_SETSIG operation can be used to obtain delivery of a signal other than SIGIO.

Sending a signal to the owner process (group) specified by F_SETOWN is subject to the same permissions checks as are described for kill(2), where the sending process is the one that employs F_SETOWN (but see BUGS below). If this permission check fails, then the signal is silently discarded. Note: The F_SETOWN operation records the caller’s credentials at the time of the fcntl() call, and it is these saved credentials that are used for the permission checks.

If the file descriptor fd refers to a socket, F_SETOWN also selects the recipient of SIGURG signals that are delivered when out-of-band data arrives on that socket. (SIGURG is sent in any situation where select(2) would report the socket as having an “exceptional condition”.)

The following was true in Linux 2.6.x up to and including Linux 2.6.11:

If a nonzero value is given to F_SETSIG in a multithreaded process running with a threading library that supports thread groups (e.g., NPTL), then a positive value given to F_SETOWN has a different meaning: instead of being a process ID identifying a whole process, it is a thread ID identifying a specific thread within a process. Consequently, it may be necessary to pass F_SETOWN the result of gettid(2) instead of getpid(2) to get sensible results when F_SETSIG is used. (In current Linux threading implementations, a main thread’s thread ID is the same as its process ID. This means that a single-threaded program can equally use gettid(2) or getpid(2) in this scenario.) Note, however, that the statements in this paragraph do not apply to the SIGURG signal generated for out-of-band data on a socket: this signal is always sent to either a process or a process group, depending on the value given to F_SETOWN.

The above behavior was accidentally dropped in Linux 2.6.12, and won’t be restored. From Linux 2.6.32 onward, use F_SETOWN_EX to target SIGIO and SIGURG signals at a particular thread.

F_GETOWN_EX (struct f_owner_ex *) (since Linux 2.6.32)
Return the current file descriptor owner settings as defined by a previous F_SETOWN_EX operation. The information is returned in the structure pointed to by arg, which has the following form:

struct f_owner_ex {
    int   type;
    pid_t pid;
};

The type field will have one of the values F_OWNER_TID, F_OWNER_PID, or F_OWNER_PGRP. The pid field is a positive integer representing a thread ID, process ID, or process group ID. See F_SETOWN_EX for more details.

F_SETOWN_EX (struct f_owner_ex *) (since Linux 2.6.32)
This operation performs a similar task to F_SETOWN. It allows the caller to direct I/O availability signals to a specific thread, process, or process group. The caller specifies the target of signals via arg, which is a pointer to a f_owner_ex structure. The type field has one of the following values, which define how pid is interpreted:

F_OWNER_TID
Send the signal to the thread whose thread ID (the value returned by a call to clone(2) or gettid(2)) is specified in pid.

F_OWNER_PID
Send the signal to the process whose ID is specified in pid.

F_OWNER_PGRP
Send the signal to the process group whose ID is specified in pid. (Note that, unlike with F_SETOWN, a process group ID is specified as a positive value here.)

F_GETSIG (void)
Return (as the function result) the signal sent when input or output becomes possible. A value of zero means SIGIO is sent. Any other value (including SIGIO) is the signal sent instead, and in this case additional info is available to the signal handler if installed with SA_SIGINFO. arg is ignored.

F_SETSIG (int)
Set the signal sent when input or output becomes possible to the value given in arg. A value of zero means to send the default SIGIO signal. Any other value (including SIGIO) is the signal to send instead, and in this case additional info is available to the signal handler if installed with SA_SIGINFO.

By using F_SETSIG with a nonzero value, and setting SA_SIGINFO for the signal handler (see sigaction(2)), extra information about I/O events is passed to the handler in a siginfo_t structure. If the si_code field indicates the source is SI_SIGIO, the si_fd field gives the file descriptor associated with the event. Otherwise, there is no indication which file descriptors are pending, and you should use the usual mechanisms (select(2), poll(2), read(2) with O_NONBLOCK set etc.) to determine which file descriptors are available for I/O.

Note that the file descriptor provided in si_fd is the one that was specified during the F_SETSIG operation. This can lead to an unusual corner case. If the file descriptor is duplicated (dup(2) or similar), and the original file descriptor is closed, then I/O events will continue to be generated, but the si_fd field will contain the number of the now closed file descriptor.

By selecting a real time signal (value >= SIGRTMIN), multiple I/O events may be queued using the same signal numbers. (Queuing is dependent on available memory.) Extra information is available if SA_SIGINFO is set for the signal handler, as above.

Note that Linux imposes a limit on the number of real-time signals that may be queued to a process (see getrlimit(2) and signal(7)) and if this limit is reached, then the kernel reverts to delivering SIGIO, and this signal is delivered to the entire process rather than to a specific thread.

Using these mechanisms, a program can implement fully asynchronous I/O without using select(2) or poll(2) most of the time.

The use of O_ASYNC is specific to BSD and Linux. The only use of F_GETOWN and F_SETOWN specified in POSIX.1 is in conjunction with the use of the SIGURG signal on sockets. (POSIX does not specify the SIGIO signal.) F_GETOWN_EX, F_SETOWN_EX, F_GETSIG, and F_SETSIG are Linux-specific. POSIX has asynchronous I/O and the aio_sigevent structure to achieve similar things; these are also available in Linux as part of the GNU C Library (glibc).

Leases

F_SETLEASE and F_GETLEASE (Linux 2.4 onward) are used to establish a new lease, and retrieve the current lease, on the open file description referred to by the file descriptor fd. A file lease provides a mechanism whereby the process holding the lease (the “lease holder”) is notified (via delivery of a signal) when a process (the “lease breaker”) tries to open(2) or truncate(2) the file referred to by that file descriptor.

F_SETLEASE (int)
Set or remove a file lease according to which of the following values is specified in the integer arg:

F_RDLCK
Take out a read lease. This will cause the calling process to be notified when the file is opened for writing or is truncated. A read lease can be placed only on a file descriptor that is opened read-only.

F_WRLCK
Take out a write lease. This will cause the caller to be notified when the file is opened for reading or writing or is truncated. A write lease may be placed on a file only if there are no other open file descriptors for the file.

F_UNLCK
Remove our lease from the file.

Leases are associated with an open file description (see open(2)). This means that duplicate file descriptors (created by, for example, fork(2) or dup(2)) refer to the same lease, and this lease may be modified or released using any of these descriptors. Furthermore, the lease is released by either an explicit F_UNLCK operation on any of these duplicate file descriptors, or when all such file descriptors have been closed.

Leases may be taken out only on regular files. An unprivileged process may take out a lease only on a file whose UID (owner) matches the filesystem UID of the process. A process with the CAP_LEASE capability may take out leases on arbitrary files.

F_GETLEASE (void)
Indicates what type of lease is associated with the file descriptor fd by returning either F_RDLCK, F_WRLCK, or F_UNLCK, indicating, respectively, a read lease , a write lease, or no lease. arg is ignored.

When a process (the “lease breaker”) performs an open(2) or truncate(2) that conflicts with a lease established via F_SETLEASE, the system call is blocked by the kernel and the kernel notifies the lease holder by sending it a signal (SIGIO by default). The lease holder should respond to receipt of this signal by doing whatever cleanup is required in preparation for the file to be accessed by another process (e.g., flushing cached buffers) and then either remove or downgrade its lease. A lease is removed by performing an F_SETLEASE operation specifying arg as F_UNLCK. If the lease holder currently holds a write lease on the file, and the lease breaker is opening the file for reading, then it is sufficient for the lease holder to downgrade the lease to a read lease. This is done by performing an F_SETLEASE operation specifying arg as F_RDLCK.

If the lease holder fails to downgrade or remove the lease within the number of seconds specified in /proc/sys/fs/lease-break-time, then the kernel forcibly removes or downgrades the lease holder’s lease.

Once a lease break has been initiated, F_GETLEASE returns the target lease type (either F_RDLCK or F_UNLCK, depending on what would be compatible with the lease breaker) until the lease holder voluntarily downgrades or removes the lease or the kernel forcibly does so after the lease break timer expires.

Once the lease has been voluntarily or forcibly removed or downgraded, and assuming the lease breaker has not unblocked its system call, the kernel permits the lease breaker’s system call to proceed.

If the lease breaker’s blocked open(2) or truncate(2) is interrupted by a signal handler, then the system call fails with the error EINTR, but the other steps still occur as described above. If the lease breaker is killed by a signal while blocked in open(2) or truncate(2), then the other steps still occur as described above. If the lease breaker specifies the O_NONBLOCK flag when calling open(2), then the call immediately fails with the error EWOULDBLOCK, but the other steps still occur as described above.

The default signal used to notify the lease holder is SIGIO, but this can be changed using the F_SETSIG operation to fcntl(). If a F_SETSIG operation is performed (even one specifying SIGIO), and the signal handler is established using SA_SIGINFO, then the handler will receive a siginfo_t structure as its second argument, and the si_fd field of this argument will hold the file descriptor of the leased file that has been accessed by another process. (This is useful if the caller holds leases against multiple files.)

File and directory change notification (dnotify)

F_NOTIFY (int)
(Linux 2.4 onward) Provide notification when the directory referred to by fd or any of the files that it contains is changed. The events to be notified are specified in arg, which is a bit mask specified by ORing together zero or more of the following bits:

DN_ACCESS
A file was accessed (read(2), pread(2), readv(2), and similar)
DN_MODIFY
A file was modified (write(2), pwrite(2), writev(2), truncate(2), ftruncate(2), and similar).
DN_CREATE
A file was created (open(2), creat(2), mknod(2), mkdir(2), link(2), symlink(2), rename(2) into this directory).
DN_DELETE
A file was unlinked (unlink(2), rename(2) to another directory, rmdir(2)).
DN_RENAME
A file was renamed within this directory (rename(2)).
DN_ATTRIB
The attributes of a file were changed (chown(2), chmod(2), utime(2), utimensat(2), and similar).

(In order to obtain these definitions, the _GNU_SOURCE feature test macro must be defined before including any header files.)

Directory notifications are normally “one-shot”, and the application must reregister to receive further notifications. Alternatively, if DN_MULTISHOT is included in arg, then notification will remain in effect until explicitly removed.

A series of F_NOTIFY requests is cumulative, with the events in arg being added to the set already monitored. To disable notification of all events, make an F_NOTIFY call specifying arg as 0.

Notification occurs via delivery of a signal. The default signal is SIGIO, but this can be changed using the F_SETSIG operation to fcntl(). (Note that SIGIO is one of the nonqueuing standard signals; switching to the use of a real-time signal means that multiple notifications can be queued to the process.) In the latter case, the signal handler receives a siginfo_t structure as its second argument (if the handler was established using SA_SIGINFO) and the si_fd field of this structure contains the file descriptor which generated the notification (useful when establishing notification on multiple directories).

Especially when using DN_MULTISHOT, a real time signal should be used for notification, so that multiple notifications can be queued.

NOTE: New applications should use the inotify interface (available since Linux 2.6.13), which provides a much superior interface for obtaining notifications of filesystem events. See inotify(7).

Changing the capacity of a pipe

F_SETPIPE_SZ (int; since Linux 2.6.35)
Change the capacity of the pipe referred to by fd to be at least arg bytes. An unprivileged process can adjust the pipe capacity to any value between the system page size and the limit defined in /proc/sys/fs/pipe-max-size (see proc(5)). Attempts to set the pipe capacity below the page size are silently rounded up to the page size. Attempts by an unprivileged process to set the pipe capacity above the limit in /proc/sys/fs/pipe-max-size yield the error EPERM; a privileged process (CAP_SYS_RESOURCE) can override the limit.

When allocating the buffer for the pipe, the kernel may use a capacity larger than arg, if that is convenient for the implementation. (In the current implementation, the allocation is the next higher power-of-two page-size multiple of the requested size.) The actual capacity (in bytes) that is set is returned as the function result.

Attempting to set the pipe capacity smaller than the amount of buffer space currently used to store data produces the error EBUSY.

Note that because of the way the pages of the pipe buffer are employed when data is written to the pipe, the number of bytes that can be written may be less than the nominal size, depending on the size of the writes.

F_GETPIPE_SZ (void; since Linux 2.6.35)
Return (as the function result) the capacity of the pipe referred to by fd.

File Sealing

File seals limit the set of allowed operations on a given file. For each seal that is set on a file, a specific set of operations will fail with EPERM on this file from now on. The file is said to be sealed. The default set of seals depends on the type of the underlying file and filesystem. For an overview of file sealing, a discussion of its purpose, and some code examples, see memfd_create(2).

Currently, file seals can be applied only to a file descriptor returned by memfd_create(2) (if the MFD_ALLOW_SEALING was employed). On other filesystems, all fcntl() operations that operate on seals will return EINVAL.

Seals are a property of an inode. Thus, all open file descriptors referring to the same inode share the same set of seals. Furthermore, seals can never be removed, only added.

F_ADD_SEALS (int; since Linux 3.17)
Add the seals given in the bit-mask argument arg to the set of seals of the inode referred to by the file descriptor fd. Seals cannot be removed again. Once this call succeeds, the seals are enforced by the kernel immediately. If the current set of seals includes F_SEAL_SEAL (see below), then this call will be rejected with EPERM. Adding a seal that is already set is a no-op, in case F_SEAL_SEAL is not set already. In order to place a seal, the file descriptor fd must be writable.

F_GET_SEALS (void; since Linux 3.17)
Return (as the function result) the current set of seals of the inode referred to by fd. If no seals are set, 0 is returned. If the file does not support sealing, -1 is returned and errno is set to EINVAL.

The following seals are available:

F_SEAL_SEAL
If this seal is set, any further call to fcntl() with F_ADD_SEALS fails with the error EPERM. Therefore, this seal prevents any modifications to the set of seals itself. If the initial set of seals of a file includes F_SEAL_SEAL, then this effectively causes the set of seals to be constant and locked.

F_SEAL_SHRINK
If this seal is set, the file in question cannot be reduced in size. This affects open(2) with the O_TRUNC flag as well as truncate(2) and ftruncate(2). Those calls fail with EPERM if you try to shrink the file in question. Increasing the file size is still possible.

F_SEAL_GROW
If this seal is set, the size of the file in question cannot be increased. This affects write(2) beyond the end of the file, truncate(2), ftruncate(2), and fallocate(2). These calls fail with EPERM if you use them to increase the file size. If you keep the size or shrink it, those calls still work as expected.

F_SEAL_WRITE
If this seal is set, you cannot modify the contents of the file. Note that shrinking or growing the size of the file is still possible and allowed. Thus, this seal is normally used in combination with one of the other seals. This seal affects write(2) and fallocate(2) (only in combination with the FALLOC_FL_PUNCH_HOLE flag). Those calls fail with EPERM if this seal is set. Furthermore, trying to create new shared, writable memory-mappings via mmap(2) will also fail with EPERM.

Using the F_ADD_SEALS operation to set the F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping exists. Such mappings must be unmapped before you can add this seal. Furthermore, if there are any asynchronous I/O operations (io_submit(2)) pending on the file, all outstanding writes will be discarded.

F_SEAL_FUTURE_WRITE (since Linux 5.1)
The effect of this seal is similar to F_SEAL_WRITE, but the contents of the file can still be modified via shared writable mappings that were created prior to the seal being set. Any attempt to create a new writable mapping on the file via mmap(2) will fail with EPERM. Likewise, an attempt to write to the file via write(2) will fail with EPERM.

Using this seal, one process can create a memory buffer that it can continue to modify while sharing that buffer on a “read-only” basis with other processes.

File read/write hints

Write lifetime hints can be used to inform the kernel about the relative expected lifetime of writes on a given inode or via a particular open file description. (See open(2) for an explanation of open file descriptions.) In this context, the term “write lifetime” means the expected time the data will live on media, before being overwritten or erased.

An application may use the different hint values specified below to separate writes into different write classes, so that multiple users or applications running on a single storage back-end can aggregate their I/O patterns in a consistent manner. However, there are no functional semantics implied by these flags, and different I/O classes can use the write lifetime hints in arbitrary ways, so long as the hints are used consistently.

The following operations can be applied to the file descriptor, fd:

F_GET_RW_HINT (uint64_t *; since Linux 4.13)
Returns the value of the read/write hint associated with the underlying inode referred to by fd.

F_SET_RW_HINT (uint64_t *; since Linux 4.13)
Sets the read/write hint value associated with the underlying inode referred to by fd. This hint persists until either it is explicitly modified or the underlying filesystem is unmounted.

F_GET_FILE_RW_HINT (uint64_t *; since Linux 4.13)
Returns the value of the read/write hint associated with the open file description referred to by fd.

F_SET_FILE_RW_HINT (uint64_t *; since Linux 4.13)
Sets the read/write hint value associated with the open file description referred to by fd.

If an open file description has not been assigned a read/write hint, then it shall use the value assigned to the inode, if any.

The following read/write hints are valid since Linux 4.13:

RWH_WRITE_LIFE_NOT_SET
No specific hint has been set. This is the default value.

RWH_WRITE_LIFE_NONE
No specific write lifetime is associated with this file or inode.

RWH_WRITE_LIFE_SHORT
Data written to this inode or via this open file description is expected to have a short lifetime.

RWH_WRITE_LIFE_MEDIUM
Data written to this inode or via this open file description is expected to have a lifetime longer than data written with RWH_WRITE_LIFE_SHORT.

RWH_WRITE_LIFE_LONG
Data written to this inode or via this open file description is expected to have a lifetime longer than data written with RWH_WRITE_LIFE_MEDIUM.

RWH_WRITE_LIFE_EXTREME
Data written to this inode or via this open file description is expected to have a lifetime longer than data written with RWH_WRITE_LIFE_LONG.

All the write-specific hints are relative to each other, and no individual absolute meaning should be attributed to them.

RETURN VALUE

For a successful call, the return value depends on the operation:

F_DUPFD
The new file descriptor.

F_GETFD
Value of file descriptor flags.

F_GETFL
Value of file status flags.

F_GETLEASE
Type of lease held on file descriptor.

F_GETOWN
Value of file descriptor owner.

F_GETSIG
Value of signal sent when read or write becomes possible, or zero for traditional SIGIO behavior.

F_GETPIPE_SZ
F_SETPIPE_SZ
The pipe capacity.

F_GET_SEALS
A bit mask identifying the seals that have been set for the inode referred to by fd.

All other operations
Zero.

On error, -1 is returned, and errno is set to indicate the error.

ERRORS

EACCES or EAGAIN
Operation is prohibited by locks held by other processes.

EAGAIN
The operation is prohibited because the file has been memory-mapped by another process.

EBADF
fd is not an open file descriptor

EBADF
op is F_SETLK or F_SETLKW and the file descriptor open mode doesn’t match with the type of lock requested.

EBUSY
op is F_SETPIPE_SZ and the new pipe capacity specified in arg is smaller than the amount of buffer space currently used to store data in the pipe.

EBUSY
op is F_ADD_SEALS, arg includes F_SEAL_WRITE, and there exists a writable, shared mapping on the file referred to by fd.

EDEADLK
It was detected that the specified F_SETLKW operation would cause a deadlock.

EFAULT
lock is outside your accessible address space.

EINTR
op is F_SETLKW or F_OFD_SETLKW and the operation was interrupted by a signal; see signal(7).

EINTR
op is F_GETLK, F_SETLK, F_OFD_GETLK, or F_OFD_SETLK, and the operation was interrupted by a signal before the lock was checked or acquired. Most likely when locking a remote file (e.g., locking over NFS), but can sometimes happen locally.

EINVAL
The value specified in op is not recognized by this kernel.

EINVAL
op is F_ADD_SEALS and arg includes an unrecognized sealing bit.

EINVAL
op is F_ADD_SEALS or F_GET_SEALS and the filesystem containing the inode referred to by fd does not support sealing.

EINVAL
op is F_DUPFD and arg is negative or is greater than the maximum allowable value (see the discussion of RLIMIT_NOFILE in getrlimit(2)).

EINVAL
op is F_SETSIG and arg is not an allowable signal number.

EINVAL
op is F_OFD_SETLK, F_OFD_SETLKW, or F_OFD_GETLK, and l_pid was not specified as zero.

EMFILE
op is F_DUPFD and the per-process limit on the number of open file descriptors has been reached.

ENOLCK
Too many segment locks open, lock table is full, or a remote locking protocol failed (e.g., locking over NFS).

ENOTDIR
F_NOTIFY was specified in op, but fd does not refer to a directory.

EPERM
op is F_SETPIPE_SZ and the soft or hard user pipe limit has been reached; see pipe(7).

EPERM
Attempted to clear the O_APPEND flag on a file that has the append-only attribute set.

EPERM
op was F_ADD_SEALS, but fd was not open for writing or the current set of seals on the file already includes F_SEAL_SEAL.

STANDARDS

POSIX.1-2008.

F_GETOWN_EX, F_SETOWN_EX, F_SETPIPE_SZ, F_GETPIPE_SZ, F_GETSIG, F_SETSIG, F_NOTIFY, F_GETLEASE, and F_SETLEASE are Linux-specific. (Define the _GNU_SOURCE macro to obtain these definitions.)

F_OFD_SETLK, F_OFD_SETLKW, and F_OFD_GETLK are Linux-specific (and one must define _GNU_SOURCE to obtain their definitions), but work is being done to have them included in the next version of POSIX.1.

F_ADD_SEALS and F_GET_SEALS are Linux-specific.

HISTORY

SVr4, 4.3BSD, POSIX.1-2001.

Only the operations F_DUPFD, F_GETFD, F_SETFD, F_GETFL, F_SETFL, F_GETLK, F_SETLK, and F_SETLKW are specified in POSIX.1-2001.

F_GETOWN and F_SETOWN are specified in POSIX.1-2001. (To get their definitions, define either _XOPEN_SOURCE with the value 500 or greater, or _POSIX_C_SOURCE with the value 200809L or greater.)

F_DUPFD_CLOEXEC is specified in POSIX.1-2008. (To get this definition, define _POSIX_C_SOURCE with the value 200809L or greater, or _XOPEN_SOURCE with the value 700 or greater.)

NOTES

The errors returned by dup2(2) are different from those returned by F_DUPFD.

File locking

The original Linux fcntl() system call was not designed to handle large file offsets (in the flock structure). Consequently, an fcntl64() system call was added in Linux 2.4. The newer system call employs a different structure for file locking, flock64, and corresponding operations, F_GETLK64, F_SETLK64, and F_SETLKW64. However, these details can be ignored by applications using glibc, whose fcntl() wrapper function transparently employs the more recent system call where it is available.

Record locks

Since Linux 2.0, there is no interaction between the types of lock placed by flock(2) and fcntl().

Several systems have more fields in struct flock such as, for example, l_sysid (to identify the machine where the lock is held). Clearly, l_pid alone is not going to be very useful if the process holding the lock may live on a different machine; on Linux, while present on some architectures (such as MIPS32), this field is not used.

The original Linux fcntl() system call was not designed to handle large file offsets (in the flock structure). Consequently, an fcntl64() system call was added in Linux 2.4. The newer system call employs a different structure for file locking, flock64, and corresponding operations, F_GETLK64, F_SETLK64, and F_SETLKW64. However, these details can be ignored by applications using glibc, whose fcntl() wrapper function transparently employs the more recent system call where it is available.

Record locking and NFS

Before Linux 3.12, if an NFSv4 client loses contact with the server for a period of time (defined as more than 90 seconds with no communication), it might lose and regain a lock without ever being aware of the fact. (The period of time after which contact is assumed lost is known as the NFSv4 leasetime. On a Linux NFS server, this can be determined by looking at /proc/fs/nfsd/nfsv4leasetime, which expresses the period in seconds. The default value for this file is 90.) This scenario potentially risks data corruption, since another process might acquire a lock in the intervening period and perform file I/O.

Since Linux 3.12, if an NFSv4 client loses contact with the server, any I/O to the file by a process which “thinks” it holds a lock will fail until that process closes and reopens the file. A kernel parameter, nfs.recover_lost_locks, can be set to 1 to obtain the pre-3.12 behavior, whereby the client will attempt to recover lost locks when contact is reestablished with the server. Because of the attendant risk of data corruption, this parameter defaults to 0 (disabled).

BUGS

F_SETFL

It is not possible to use F_SETFL to change the state of the O_DSYNC and O_SYNC flags. Attempts to change the state of these flags are silently ignored.

F_GETOWN

A limitation of the Linux system call conventions on some architectures (notably i386) means that if a (negative) process group ID to be returned by F_GETOWN falls in the range -1 to -4095, then the return value is wrongly interpreted by glibc as an error in the system call; that is, the return value of fcntl() will be -1, and errno will contain the (positive) process group ID. The Linux-specific F_GETOWN_EX operation avoids this problem. Since glibc 2.11, glibc makes the kernel F_GETOWN problem invisible by implementing F_GETOWN using F_GETOWN_EX.

F_SETOWN

In Linux 2.4 and earlier, there is bug that can occur when an unprivileged process uses F_SETOWN to specify the owner of a socket file descriptor as a process (group) other than the caller. In this case, fcntl() can return -1 with errno set to EPERM, even when the owner process (group) is one that the caller has permission to send signals to. Despite this error return, the file descriptor owner is set, and signals will be sent to the owner.

Deadlock detection

The deadlock-detection algorithm employed by the kernel when dealing with F_SETLKW requests can yield both false negatives (failures to detect deadlocks, leaving a set of deadlocked processes blocked indefinitely) and false positives (EDEADLK errors when there is no deadlock). For example, the kernel limits the lock depth of its dependency search to 10 steps, meaning that circular deadlock chains that exceed that size will not be detected. In addition, the kernel may falsely indicate a deadlock when two or more processes created using the clone(2) CLONE_FILES flag place locks that appear (to the kernel) to conflict.

Mandatory locking

The Linux implementation of mandatory locking is subject to race conditions which render it unreliable: a write(2) call that overlaps with a lock may modify data after the mandatory lock is acquired; a read(2) call that overlaps with a lock may detect changes to data that were made only after a write lock was acquired. Similar races exist between mandatory locks and mmap(2). It is therefore inadvisable to rely on mandatory locking.

3 - Linux cli command prlimit

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command prlimit and provides detailed information about the command prlimit, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the prlimit.

NAME 🖥️ prlimit 🖥️

get/set resource limits

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <sys/resource.h>
int getrlimit(int resource, struct rlimit *rlim);
int setrlimit(int resource, const struct rlimit *rlim);
int prlimit(pid_t pid, int resource,
 const struct rlimit *_Nullable new_limit,
 struct rlimit *_Nullable old_limit);

Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

prlimit():

    _GNU_SOURCE

DESCRIPTION

The getrlimit() and setrlimit() system calls get and set resource limits. Each resource has an associated soft and hard limit, as defined by the rlimit structure:

struct rlimit {
    rlim_t rlim_cur;  /* Soft limit */
    rlim_t rlim_max;  /* Hard limit (ceiling for rlim_cur) */
};

The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may set only its soft limit to a value in the range from 0 up to the hard limit, and (irreversibly) lower its hard limit. A privileged process (under Linux: one with the CAP_SYS_RESOURCE capability in the initial user namespace) may make arbitrary changes to either limit value.

The value RLIM_INFINITY denotes no limit on a resource (both in the structure returned by getrlimit() and in the structure passed to setrlimit()).

The resource argument must be one of:

RLIMIT_AS
This is the maximum size of the process’s virtual memory (address space). The limit is specified in bytes, and is rounded down to the system page size. This limit affects calls to brk(2), mmap(2), and mremap(2), which fail with the error ENOMEM upon exceeding this limit. In addition, automatic stack expansion fails (and generates a SIGSEGV that kills the process if no alternate stack has been made available via sigaltstack(2)). Since the value is a long, on machines with a 32-bit long either this limit is at most 2 GiB, or this resource is unlimited.

RLIMIT_CORE
This is the maximum size of a core file (see core(5)) in bytes that the process may dump. When 0 no core dump files are created. When nonzero, larger dumps are truncated to this size.

RLIMIT_CPU
This is a limit, in seconds, on the amount of CPU time that the process can consume. When the process reaches the soft limit, it is sent a SIGXCPU signal. The default action for this signal is to terminate the process. However, the signal can be caught, and the handler can return control to the main program. If the process continues to consume CPU time, it will be sent SIGXCPU once per second until the hard limit is reached, at which time it is sent SIGKILL. (This latter point describes Linux behavior. Implementations vary in how they treat processes which continue to consume CPU time after reaching the soft limit. Portable applications that need to catch this signal should perform an orderly termination upon first receipt of SIGXCPU.)

RLIMIT_DATA
This is the maximum size of the process’s data segment (initialized data, uninitialized data, and heap). The limit is specified in bytes, and is rounded down to the system page size. This limit affects calls to brk(2), sbrk(2), and (since Linux 4.7) mmap(2), which fail with the error ENOMEM upon encountering the soft limit of this resource.

RLIMIT_FSIZE
This is the maximum size in bytes of files that the process may create. Attempts to extend a file beyond this limit result in delivery of a SIGXFSZ signal. By default, this signal terminates a process, but a process can catch this signal instead, in which case the relevant system call (e.g., write(2), truncate(2)) fails with the error EFBIG.

RLIMIT_LOCKS (Linux 2.4.0 to Linux 2.4.24)
This is a limit on the combined number of flock(2) locks and fcntl(2) leases that this process may establish.

RLIMIT_MEMLOCK
This is the maximum number of bytes of memory that may be locked into RAM. This limit is in effect rounded down to the nearest multiple of the system page size. This limit affects mlock(2), mlockall(2), and the mmap(2) MAP_LOCKED operation. Since Linux 2.6.9, it also affects the shmctl(2) SHM_LOCK operation, where it sets a maximum on the total bytes in shared memory segments (see shmget(2)) that may be locked by the real user ID of the calling process. The shmctl(2) SHM_LOCK locks are accounted for separately from the per-process memory locks established by mlock(2), mlockall(2), and mmap(2) MAP_LOCKED; a process can lock bytes up to this limit in each of these two categories.

Before Linux 2.6.9, this limit controlled the amount of memory that could be locked by a privileged process. Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process may lock, and this limit instead governs the amount of memory that an unprivileged process may lock.

RLIMIT_MSGQUEUE (since Linux 2.6.8)
This is a limit on the number of bytes that can be allocated for POSIX message queues for the real user ID of the calling process. This limit is enforced for mq_open(3). Each message queue that the user creates counts (until it is removed) against this limit according to the formula:

Since Linux 3.5:

bytes = attr.mq_maxmsg * sizeof(struct msg_msg) +
        MIN(attr.mq_maxmsg, MQ_PRIO_MAX) *
              sizeof(struct posix_msg_tree_node)+
                        /* For overhead */
        attr.mq_maxmsg * attr.mq_msgsize;
                        /* For message data */

Linux 3.4 and earlier:

bytes = attr.mq_maxmsg * sizeof(struct msg_msg *) +
                        /* For overhead */
        attr.mq_maxmsg * attr.mq_msgsize;
                        /* For message data */

where attr is the mq_attr structure specified as the fourth argument to mq_open(3), and the msg_msg and posix_msg_tree_node structures are kernel-internal structures.

The “overhead” addend in the formula accounts for overhead bytes required by the implementation and ensures that the user cannot create an unlimited number of zero-length messages (such messages nevertheless each consume some system memory for bookkeeping overhead).

RLIMIT_NICE (since Linux 2.6.12, but see BUGS below)
This specifies a ceiling to which the process’s nice value can be raised using setpriority(2) or nice(2). The actual ceiling for the nice value is calculated as 20 - rlim_cur. The useful range for this limit is thus from 1 (corresponding to a nice value of 19) to 40 (corresponding to a nice value of -20). This unusual choice of range was necessary because negative numbers cannot be specified as resource limit values, since they typically have special meanings. For example, RLIM_INFINITY typically is the same as -1. For more detail on the nice value, see sched(7).

RLIMIT_NOFILE
This specifies a value one greater than the maximum file descriptor number that can be opened by this process. Attempts (open(2), pipe(2), dup(2), etc.) to exceed this limit yield the error EMFILE. (Historically, this limit was named RLIMIT_OFILE on BSD.)

Since Linux 4.5, this limit also defines the maximum number of file descriptors that an unprivileged process (one without the CAP_SYS_RESOURCE capability) may have “in flight” to other processes, by being passed across UNIX domain sockets. This limit applies to the sendmsg(2) system call. For further details, see unix(7).

RLIMIT_NPROC
This is a limit on the number of extant process (or, more precisely on Linux, threads) for the real user ID of the calling process. So long as the current number of processes belonging to this process’s real user ID is greater than or equal to this limit, fork(2) fails with the error EAGAIN.

The RLIMIT_NPROC limit is not enforced for processes that have either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE capability, or run with real user ID 0.

RLIMIT_RSS
This is a limit (in bytes) on the process’s resident set (the number of virtual pages resident in RAM). This limit has effect only in Linux 2.4.x, x < 30, and there affects only calls to madvise(2) specifying MADV_WILLNEED.

RLIMIT_RTPRIO (since Linux 2.6.12, but see BUGS)
This specifies a ceiling on the real-time priority that may be set for this process using sched_setscheduler(2) and sched_setparam(2).

For further details on real-time scheduling policies, see sched(7)

RLIMIT_RTTIME (since Linux 2.6.25)
This is a limit (in microseconds) on the amount of CPU time that a process scheduled under a real-time scheduling policy may consume without making a blocking system call. For the purpose of this limit, each time a process makes a blocking system call, the count of its consumed CPU time is reset to zero. The CPU time count is not reset if the process continues trying to use the CPU but is preempted, its time slice expires, or it calls sched_yield(2).

Upon reaching the soft limit, the process is sent a SIGXCPU signal. If the process catches or ignores this signal and continues consuming CPU time, then SIGXCPU will be generated once each second until the hard limit is reached, at which point the process is sent a SIGKILL signal.

The intended use of this limit is to stop a runaway real-time process from locking up the system.

For further details on real-time scheduling policies, see sched(7)

RLIMIT_SIGPENDING (since Linux 2.6.8)
This is a limit on the number of signals that may be queued for the real user ID of the calling process. Both standard and real-time signals are counted for the purpose of checking this limit. However, the limit is enforced only for sigqueue(3); it is always possible to use kill(2) to queue one instance of any of the signals that are not already queued to the process.

RLIMIT_STACK
This is the maximum size of the process stack, in bytes. Upon reaching this limit, a SIGSEGV signal is generated. To handle this signal, a process must employ an alternate signal stack (sigaltstack(2)).

Since Linux 2.6.23, this limit also determines the amount of space used for the process’s command-line arguments and environment variables; for details, see execve(2).

prlimit()

The Linux-specific prlimit() system call combines and extends the functionality of setrlimit() and getrlimit(). It can be used to both set and get the resource limits of an arbitrary process.

The resource argument has the same meaning as for setrlimit() and getrlimit().

If the new_limit argument is not NULL, then the rlimit structure to which it points is used to set new values for the soft and hard limits for resource. If the old_limit argument is not NULL, then a successful call to prlimit() places the previous soft and hard limits for resource in the rlimit structure pointed to by old_limit.

The pid argument specifies the ID of the process on which the call is to operate. If pid is 0, then the call applies to the calling process. To set or get the resources of a process other than itself, the caller must have the CAP_SYS_RESOURCE capability in the user namespace of the process whose resource limits are being changed, or the real, effective, and saved set user IDs of the target process must match the real user ID of the caller and the real, effective, and saved set group IDs of the target process must match the real group ID of the caller.

RETURN VALUE

On success, these system calls return 0. On error, -1 is returned, and errno is set to indicate the error.

ERRORS

EFAULT
A pointer argument points to a location outside the accessible address space.

EINVAL
The value specified in resource is not valid; or, for setrlimit() or prlimit(): rlim->rlim_cur was greater than rlim->rlim_max.

EPERM
An unprivileged process tried to raise the hard limit; the CAP_SYS_RESOURCE capability is required to do this.

EPERM
The caller tried to increase the hard RLIMIT_NOFILE limit above the maximum defined by /proc/sys/fs/nr_open (see proc(5))

EPERM
(prlimit()) The calling process did not have permission to set limits for the process specified by pid.

ESRCH
Could not find a process with the ID specified in pid.

ATTRIBUTES

For an explanation of the terms used in this section, see attributes(7).

Interface	Attribute	Value
getrlimit(), setrlimit(), prlimit()	Thread safety	MT-Safe

STANDARDS

getrlimit()
setrlimit()
POSIX.1-2008.

prlimit()
Linux.

RLIMIT_MEMLOCK and RLIMIT_NPROC derive from BSD and are not specified in POSIX.1; they are present on the BSDs and Linux, but on few other implementations. RLIMIT_RSS derives from BSD and is not specified in POSIX.1; it is nevertheless present on most implementations. RLIMIT_MSGQUEUE, RLIMIT_NICE, RLIMIT_RTPRIO, RLIMIT_RTTIME, and RLIMIT_SIGPENDING are Linux-specific.

HISTORY

getrlimit()
setrlimit()
POSIX.1-2001, SVr4, 4.3BSD.

prlimit()
Linux 2.6.36, glibc 2.13.

NOTES

A child process created via fork(2) inherits its parent’s resource limits. Resource limits are preserved across execve(2).

Resource limits are per-process attributes that are shared by all of the threads in a process.

Lowering the soft limit for a resource below the process’s current consumption of that resource will succeed (but will prevent the process from further increasing its consumption of the resource).

One can set the resource limits of the shell using the built-in ulimit command (limit in csh(1)). The shell’s resource limits are inherited by the processes that it creates to execute commands.

Since Linux 2.6.24, the resource limits of any process can be inspected via /proc/pid/limits; see proc(5).

Ancient systems provided a vlimit() function with a similar purpose to setrlimit(). For backward compatibility, glibc also provides vlimit(). All new applications should be written using setrlimit().

C library/kernel ABI differences

Since glibc 2.13, the glibc getrlimit() and setrlimit() wrapper functions no longer invoke the corresponding system calls, but instead employ prlimit(), for the reasons described in BUGS.

The name of the glibc wrapper function is prlimit(); the underlying system call is prlimit64().

BUGS

In older Linux kernels, the SIGXCPU and SIGKILL signals delivered when a process encountered the soft and hard RLIMIT_CPU limits were delivered one (CPU) second later than they should have been. This was fixed in Linux 2.6.8.

In Linux 2.6.x kernels before Linux 2.6.17, a RLIMIT_CPU limit of 0 is wrongly treated as “no limit” (like RLIM_INFINITY). Since Linux 2.6.17, setting a limit of 0 does have an effect, but is actually treated as a limit of 1 second.

A kernel bug means that RLIMIT_RTPRIO does not work in Linux 2.6.12; the problem is fixed in Linux 2.6.13.

In Linux 2.6.12, there was an off-by-one mismatch between the priority ranges returned by getpriority(2) and RLIMIT_NICE. This had the effect that the actual ceiling for the nice value was calculated as 19 - rlim_cur. This was fixed in Linux 2.6.13.

Since Linux 2.6.12, if a process reaches its soft RLIMIT_CPU limit and has a handler installed for SIGXCPU, then, in addition to invoking the signal handler, the kernel increases the soft limit by one second. This behavior repeats if the process continues to consume CPU time, until the hard limit is reached, at which point the process is killed. Other implementations do not change the RLIMIT_CPU soft limit in this manner, and the Linux behavior is probably not standards conformant; portable applications should avoid relying on this Linux-specific behavior. The Linux-specific RLIMIT_RTTIME limit exhibits the same behavior when the soft limit is encountered.

Kernels before Linux 2.4.22 did not diagnose the error EINVAL for setrlimit() when rlim->rlim_cur was greater than rlim->rlim_max.

Linux doesn’t return an error when an attempt to set RLIMIT_CPU has failed, for compatibility reasons.

Representation of “large” resource limit values on 32-bit platforms

The glibc getrlimit() and setrlimit() wrapper functions use a 64-bit rlim_t data type, even on 32-bit platforms. However, the rlim_t data type used in the getrlimit() and setrlimit() system calls is a (32-bit) unsigned long. Furthermore, in Linux, the kernel represents resource limits on 32-bit platforms as unsigned long. However, a 32-bit data type is not wide enough. The most pertinent limit here is RLIMIT_FSIZE, which specifies the maximum size to which a file can grow: to be useful, this limit must be represented using a type that is as wide as the type used to represent file offsets—that is, as wide as a 64-bit off_t (assuming a program compiled with _FILE_OFFSET_BITS=64).

To work around this kernel limitation, if a program tried to set a resource limit to a value larger than can be represented in a 32-bit unsigned long, then the glibc setrlimit() wrapper function silently converted the limit value to RLIM_INFINITY. In other words, the requested resource limit setting was silently ignored.

Since glibc 2.13, glibc works around the limitations of the getrlimit() and setrlimit() system calls by implementing setrlimit() and getrlimit() as wrapper functions that call prlimit().

EXAMPLES

The program below demonstrates the use of prlimit().

#define _GNU_SOURCE
#define _FILE_OFFSET_BITS 64
#include <err.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <time.h>
int
main(int argc, char *argv[])
{
    pid_t          pid;
    struct rlimit  old, new;
    struct rlimit  *newp;
    if (!(argc == 2 || argc == 4)) {
        fprintf(stderr, "Usage: %s <pid> [<new-soft-limit> "
                "<new-hard-limit>]

“, argv[0]); exit(EXIT_FAILURE); } pid = atoi(argv[1]); /* PID of target process / newp = NULL; if (argc == 4) { new.rlim_cur = atoi(argv[2]); new.rlim_max = atoi(argv[3]); newp = &new; } / Set CPU time limit of target process; retrieve and display previous limit / if (prlimit(pid, RLIMIT_CPU, newp, &old) == -1) err(EXIT_FAILURE, “prlimit-1”); printf(“Previous limits: soft=%jd; hard=%jd “, (intmax_t) old.rlim_cur, (intmax_t) old.rlim_max); / Retrieve and display new CPU time limit */ if (prlimit(pid, RLIMIT_CPU, NULL, &old) == -1) err(EXIT_FAILURE, “prlimit-2”); printf(“New limits: soft=%jd; hard=%jd “, (intmax_t) old.rlim_cur, (intmax_t) old.rlim_max); exit(EXIT_SUCCESS); }

4 - Linux cli command utime

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command utime and provides detailed information about the command utime, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the utime.

NAME 🖥️ utime 🖥️

change file last access and modification times

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <utime.h>
int utime(const char *filename,
 const struct utimbuf *_Nullable times);
#include <sys/time.h>
int utimes(const char *filename,
 const struct timeval times[_Nullable 2]);

DESCRIPTION

Note: modern applications may prefer to use the interfaces described in utimensat(2).

The utime() system call changes the access and modification times of the inode specified by filename to the actime and modtime fields of times respectively. The status change time (ctime) will be set to the current time, even if the other time stamps don’t actually change.

If times is NULL, then the access and modification times of the file are set to the current time.

Changing timestamps is permitted when: either the process has appropriate privileges, or the effective user ID equals the user ID of the file, or times is NULL and the process has write permission for the file.

The utimbuf structure is:

struct utimbuf {
    time_t actime;       /* access time */
    time_t modtime;      /* modification time */
};

The utime() system call allows specification of timestamps with a resolution of 1 second.

The utimes() system call is similar, but the times argument refers to an array rather than a structure. The elements of this array are timeval structures, which allow a precision of 1 microsecond for specifying timestamps. The timeval structure is:

struct timeval {
    long tv_sec;        /* seconds */
    long tv_usec;       /* microseconds */
};

times[0] specifies the new access time, and times[1] specifies the new modification time. If times is NULL, then analogously to utime(), the access and modification times of the file are set to the current time.

RETURN VALUE

On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.

ERRORS

EACCES
Search permission is denied for one of the directories in the path prefix of path (see also path_resolution(7)).

EACCES
times is NULL, the caller’s effective user ID does not match the owner of the file, the caller does not have write access to the file, and the caller is not privileged (Linux: does not have either the CAP_DAC_OVERRIDE or the CAP_FOWNER capability).

ENOENT
filename does not exist.

EPERM
times is not NULL, the caller’s effective UID does not match the owner of the file, and the caller is not privileged (Linux: does not have the CAP_FOWNER capability).

EROFS
path resides on a read-only filesystem.

STANDARDS

POSIX.1-2008.

HISTORY

utime()
SVr4, POSIX.1-2001. POSIX.1-2008 marks it as obsolete.

utimes()
4.3BSD, POSIX.1-2001.

NOTES

Linux does not allow changing the timestamps on an immutable file, or setting the timestamps to something other than the current time on an append-only file.

5 - Linux cli command getdents64

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command getdents64 and provides detailed information about the command getdents64, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the getdents64.

NAME 🖥️ getdents64 🖥️

get directory entries

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
long syscall(SYS_getdents, unsigned int fd",structlinux_dirent*"dirp,
 unsigned int count);
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <dirent.h>
ssize_t getdents64(int fd, void dirp[.count], size_t count);

Note: glibc provides no wrapper for getdents(), necessitating the use of syscall(2).

Note: There is no definition of struct linux_dirent in glibc; see NOTES.

DESCRIPTION

These are not the interfaces you are interested in. Look at readdir(3) for the POSIX-conforming C library interface. This page documents the bare kernel system call interfaces.

getdents()

The system call getdents() reads several linux_dirent structures from the directory referred to by the open file descriptor fd into the buffer pointed to by dirp. The argument count specifies the size of that buffer.

The linux_dirent structure is declared as follows:

struct linux_dirent {
    unsigned long  d_ino;     /* Inode number */
    unsigned long  d_off;     /* Not an offset; see below */
    unsigned short d_reclen;  /* Length of this linux_dirent */
    char           d_name[];  /* Filename (null-terminated) */
                      /* length is actually (d_reclen - 2 -
                         offsetof(struct linux_dirent, d_name)) */
    /*
    char           pad;       // Zero padding byte
    char           d_type;    // File type (only since Linux
                              // 2.6.4); offset is (d_reclen - 1)
    */
}

d_ino is an inode number. d_off is a filesystem-specific value with no specific meaning to user space, though on older filesystems it used to be the distance from the start of the directory to the start of the next linux_dirent; see readdir(3). d_reclen is the size of this entire linux_dirent. d_name is a null-terminated filename.

d_type is a byte at the end of the structure that indicates the file type. It contains one of the following values (defined in <dirent.h>):

DT_BLK
This is a block device.

DT_CHR
This is a character device.

DT_DIR
This is a directory.

DT_FIFO
This is a named pipe (FIFO).

DT_LNK
This is a symbolic link.

DT_REG
This is a regular file.

DT_SOCK
This is a UNIX domain socket.

DT_UNKNOWN
The file type is unknown.

The d_type field is implemented since Linux 2.6.4. It occupies a space that was previously a zero-filled padding byte in the linux_dirent structure. Thus, on kernels up to and including Linux 2.6.3, attempting to access this field always provides the value 0 (DT_UNKNOWN).

Currently, only some filesystems (among them: Btrfs, ext2, ext3, and ext4) have full support for returning the file type in d_type. All applications must properly handle a return of DT_UNKNOWN.

getdents64()

The original Linux getdents() system call did not handle large filesystems and large file offsets. Consequently, Linux 2.4 added getdents64(), with wider types for the d_ino and d_off fields. In addition, getdents64() supports an explicit d_type field.

The getdents64() system call is like getdents(), except that its second argument is a pointer to a buffer containing structures of the following type:

struct linux_dirent64 {
    ino64_t        d_ino;    /* 64-bit inode number */
    off64_t        d_off;    /* Not an offset; see getdents() */
    unsigned short d_reclen; /* Size of this dirent */
    unsigned char  d_type;   /* File type */
    char           d_name[]; /* Filename (null-terminated) */
};

RETURN VALUE

On success, the number of bytes read is returned. On end of directory, 0 is returned. On error, -1 is returned, and errno is set to indicate the error.

ERRORS

EBADF
Invalid file descriptor fd.

EFAULT
Argument points outside the calling process’s address space.

EINVAL
Result buffer is too small.

ENOENT
No such directory.

ENOTDIR
File descriptor does not refer to a directory.

STANDARDS

None.

HISTORY

SVr4.

getdents64()
glibc 2.30.

NOTES

glibc does not provide a wrapper for getdents(); call getdents() using syscall(2). In that case you will need to define the linux_dirent or linux_dirent64 structure yourself.

Probably, you want to use readdir(3) instead of these system calls.

These calls supersede readdir(2).

EXAMPLES

The program below demonstrates the use of getdents(). The following output shows an example of what we see when running this program on an ext2 directory:

$ ./a.out /testfs/
--------------- nread=120 ---------------
inode#    file type  d_reclen  d_off   d_name
       2  directory    16         12  .
       2  directory    16         24  ..
      11  directory    24         44  lost+found
      12  regular      16         56  a
  228929  directory    16         68  sub
   16353  directory    16         80  sub2
  130817  directory    16       4096  sub3

Program source

#define _GNU_SOURCE
#include <dirent.h>     /* Defines DT_* constants */
#include <err.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
struct linux_dirent {
    unsigned long  d_ino;
    off_t          d_off;
    unsigned short d_reclen;
    char           d_name[];
};
#define BUF_SIZE 1024
int
main(int argc, char *argv[])
{
    int                  fd;
    char                 d_type;
    char                 buf[BUF_SIZE];
    long                 nread;
    struct linux_dirent  *d;
    fd = open(argc > 1 ? argv[1] : ".", O_RDONLY | O_DIRECTORY);
    if (fd == -1)
        err(EXIT_FAILURE, "open");
    for (;;) {
        nread = syscall(SYS_getdents, fd, buf, BUF_SIZE);
        if (nread == -1)
            err(EXIT_FAILURE, "getdents");
        if (nread == 0)
            break;
        printf("--------------- nread=%ld ---------------

“, nread); printf(“inode# file type d_reclen d_off d_name “); for (size_t bpos = 0; bpos < nread;) { d = (struct linux_dirent *) (buf + bpos); printf("%8lu “, d->d_ino); d_type = *(buf + bpos + d->d_reclen - 1); printf(”%-10s “, (d_type == DT_REG) ? “regular” : (d_type == DT_DIR) ? “directory” : (d_type == DT_FIFO) ? “FIFO” : (d_type == DT_SOCK) ? “socket” : (d_type == DT_LNK) ? “symlink” : (d_type == DT_BLK) ? “block dev” : (d_type == DT_CHR) ? “char dev” : “???”); printf("%4d %10jd %s “, d->d_reclen, (intmax_t) d->d_off, d->d_name); bpos += d->d_reclen; } } exit(EXIT_SUCCESS); }

6 - Linux cli command sync_file_range2

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command sync_file_range2 and provides detailed information about the command sync_file_range2, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the sync_file_range2.

NAME 🖥️ sync_file_range2 🖥️

sync a file segment with disk

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#define _GNU_SOURCE /* See feature_test_macros(7) */
#define _FILE_OFFSET_BITS 64
#include <fcntl.h>
int sync_file_range(int fd, off_t offset, off_t nbytes,
 unsigned int flags);

DESCRIPTION

sync_file_range() permits fine control when synchronizing the open file referred to by the file descriptor fd with disk.

offset is the starting byte of the file range to be synchronized. nbytes specifies the length of the range to be synchronized, in bytes; if nbytes is zero, then all bytes from offset through to the end of file are synchronized. Synchronization is in units of the system page size: offset is rounded down to a page boundary; (offset+nbytes-1) is rounded up to a page boundary.

The flags bit-mask argument can include any of the following values:

SYNC_FILE_RANGE_WAIT_BEFORE
Wait upon write-out of all pages in the specified range that have already been submitted to the device driver for write-out before performing any write.

SYNC_FILE_RANGE_WRITE
Initiate write-out of all dirty pages in the specified range which are not presently submitted write-out. Note that even this may block if you attempt to write more than request queue size.

SYNC_FILE_RANGE_WAIT_AFTER
Wait upon write-out of all pages in the range after performing any write.

Specifying flags as 0 is permitted, as a no-op.

Warning

This system call is extremely dangerous and should not be used in portable programs. None of these operations writes out the file’s metadata. Therefore, unless the application is strictly performing overwrites of already-instantiated disk blocks, there are no guarantees that the data will be available after a crash. There is no user interface to know if a write is purely an overwrite. On filesystems using copy-on-write semantics (e.g., btrfs) an overwrite of existing allocated blocks is impossible. When writing into preallocated space, many filesystems also require calls into the block allocator, which this system call does not sync out to disk. This system call does not flush disk write caches and thus does not provide any data integrity on systems with volatile disk write caches.

Some details

SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any I/O errors or ENOSPC conditions and will return these to the caller.

Useful combinations of the flags bits are:

SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE
Ensures that all pages in the specified range which were dirty when sync_file_range() was called are placed under write-out. This is a start-write-for-data-integrity operation.

SYNC_FILE_RANGE_WRITE
Start write-out of all dirty pages in the specified range which are not presently under write-out. This is an asynchronous flush-to-disk operation. This is not suitable for data integrity operations.

SYNC_FILE_RANGE_WAIT_BEFORE (or SYNC_FILE_RANGE_WAIT_AFTER)
Wait for completion of write-out of all pages in the specified range. This can be used after an earlier SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE operation to wait for completion of that operation, and obtain its result.

SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER
This is a write-for-data-integrity operation that will ensure that all pages in the specified range which were dirty when sync_file_range() was called are committed to disk.

RETURN VALUE

On success, sync_file_range() returns 0; on failure -1 is returned and errno is set to indicate the error.

ERRORS

EBADF
fd is not a valid file descriptor.

EINVAL
flags specifies an invalid bit; or offset or nbytes is invalid.

EIO
I/O error.

ENOMEM
Out of memory.

ENOSPC
Out of disk space.

ESPIPE
fd refers to something other than a regular file, a block device, or a directory.

VERSIONS

sync_file_range2()

Some architectures (e.g., PowerPC, ARM) need 64-bit arguments to be aligned in a suitable pair of registers. On such architectures, the call signature of sync_file_range() shown in the SYNOPSIS would force a register to be wasted as padding between the fd and offset arguments. (See syscall(2) for details.) Therefore, these architectures define a different system call that orders the arguments suitably:

int sync_file_range2(int fd, unsigned int flags,
 off_t offset, off_t nbytes);

The behavior of this system call is otherwise exactly the same as sync_file_range().

STANDARDS

Linux.

HISTORY

Linux 2.6.17.

sync_file_range2()

A system call with this signature first appeared on the ARM architecture in Linux 2.6.20, with the name arm_sync_file_range(). It was renamed in Linux 2.6.22, when the analogous system call was added for PowerPC. On architectures where glibc support is provided, glibc transparently wraps sync_file_range2() under the name sync_file_range().

NOTES

_FILE_OFFSET_BITS should be defined to be 64 in code that takes the address of sync_file_range, if the code is intended to be portable to traditional 32-bit x86 and ARM platforms where off_t’s width defaults to 32 bits.

7 - Linux cli command arm_fadvise

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command arm_fadvise and provides detailed information about the command arm_fadvise, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the arm_fadvise.

NAME 🖥️ arm_fadvise 🖥️

predeclare an access pattern for file data

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <fcntl.h>
int posix_fadvise(int fd, off_t offset, off_t len",int advice );"

Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

posix_fadvise():

    _POSIX_C_SOURCE >= 200112L

DESCRIPTION

Programs can use posix_fadvise() to announce an intention to access file data in a specific pattern in the future, thus allowing the kernel to perform appropriate optimizations.

The advice applies to a (not necessarily existent) region starting at offset and extending for len bytes (or until the end of the file if len is 0) within the file referred to by fd. The advice is not binding; it merely constitutes an expectation on behalf of the application.

Permissible values for advice include:

POSIX_FADV_NORMAL
Indicates that the application has no advice to give about its access pattern for the specified data. If no advice is given for an open file, this is the default assumption.

POSIX_FADV_SEQUENTIAL
The application expects to access the specified data sequentially (with lower offsets read before higher ones).

POSIX_FADV_RANDOM
The specified data will be accessed in random order.

POSIX_FADV_NOREUSE
The specified data will be accessed only once.

Before Linux 2.6.18, POSIX_FADV_NOREUSE had the same semantics as POSIX_FADV_WILLNEED. This was probably a bug; since Linux 2.6.18, this flag is a no-op.

POSIX_FADV_WILLNEED
The specified data will be accessed in the near future.

POSIX_FADV_WILLNEED initiates a nonblocking read of the specified region into the page cache. The amount of data read may be decreased by the kernel depending on virtual memory load. (A few megabytes will usually be fully satisfied, and more is rarely useful.)

POSIX_FADV_DONTNEED
The specified data will not be accessed in the near future.

POSIX_FADV_DONTNEED attempts to free cached pages associated with the specified region. This is useful, for example, while streaming large files. A program may periodically request the kernel to free cached data that has already been used, so that more useful cached pages are not discarded instead.

Requests to discard partial pages are ignored. It is preferable to preserve needed data than discard unneeded data. If the application requires that data be considered for discarding, then offset and len must be page-aligned.

The implementation may attempt to write back dirty pages in the specified region, but this is not guaranteed. Any unwritten dirty pages will not be freed. If the application wishes to ensure that dirty pages will be released, it should call fsync(2) or fdatasync(2) first.

RETURN VALUE

On success, zero is returned. On error, an error number is returned.

ERRORS

EBADF
The fd argument was not a valid file descriptor.

EINVAL
An invalid value was specified for advice.

ESPIPE
The specified file descriptor refers to a pipe or FIFO. (ESPIPE is the error specified by POSIX, but before Linux 2.6.16, Linux returned EINVAL in this case.)

VERSIONS

Under Linux, POSIX_FADV_NORMAL sets the readahead window to the default size for the backing device; POSIX_FADV_SEQUENTIAL doubles this size, and POSIX_FADV_RANDOM disables file readahead entirely. These changes affect the entire file, not just the specified region (but other open file handles to the same file are unaffected).

C library/kernel differences

The name of the wrapper function in the C library is posix_fadvise(). The underlying system call is called fadvise64() (or, on some architectures, fadvise64_64()); the difference between the two is that the former system call assumes that the type of the len argument is size_t, while the latter expects loff_t there.

Architecture-specific variants

Some architectures require 64-bit arguments to be aligned in a suitable pair of registers (see syscall(2) for further detail). On such architectures, the call signature of posix_fadvise() shown in the SYNOPSIS would force a register to be wasted as padding between the fd and offset arguments. Therefore, these architectures define a version of the system call that orders the arguments suitably, but is otherwise exactly the same as posix_fadvise().

For example, since Linux 2.6.14, ARM has the following system call:

long arm_fadvise64_64(int fd, int advice,
 loff_t offset, loff_t len);

These architecture-specific details are generally hidden from applications by the glibc posix_fadvise() wrapper function, which invokes the appropriate architecture-specific system call.

STANDARDS

POSIX.1-2008.

HISTORY

POSIX.1-2001.

Kernel support first appeared in Linux 2.5.60; the underlying system call is called fadvise64(). Library support has been provided since glibc 2.2, via the wrapper function posix_fadvise().

Since Linux 3.18, support for the underlying system call is optional, depending on the setting of the CONFIG_ADVISE_SYSCALLS configuration option.

The type of the len argument was changed from size_t to off_t in POSIX.1-2001 TC1.

NOTES

The contents of the kernel buffer cache can be cleared via the /proc/sys/vm/drop_caches interface described in proc(5).

One can obtain a snapshot of which pages of a file are resident in the buffer cache by opening a file, mapping it with mmap(2), and then applying mincore(2) to the mapping.

BUGS

Before Linux 2.6.6, if len was specified as 0, then this was interpreted literally as “zero bytes”, rather than as meaning “all bytes through to the end of the file”.

8 - Linux cli command mlock2

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command mlock2 and provides detailed information about the command mlock2, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the mlock2.

NAME 🖥️ mlock2 🖥️

lock and unlock memory

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <sys/mman.h>
int mlock(const void addr[.len], size_t len);
int mlock2(const void addr[.len], size_t len, unsigned int flags);
int munlock(const void addr[.len], size_t len);
int mlockall(int flags);
int munlockall(void);

DESCRIPTION

mlock(), mlock2(), and mlockall() lock part or all of the calling process’s virtual address space into RAM, preventing that memory from being paged to the swap area.

munlock() and munlockall() perform the converse operation, unlocking part or all of the calling process’s virtual address space, so that pages in the specified virtual address range can be swapped out again if required by the kernel memory manager.

Memory locking and unlocking are performed in units of whole pages.

mlock(), mlock2(), and munlock()

mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked.

mlock2() also locks pages in the specified range starting at addr and continuing for len bytes. However, the state of the pages contained in that range after the call returns successfully will depend on the value in the flags argument.

The flags argument can be either 0 or the following constant:

MLOCK_ONFAULT
Lock pages that are currently resident and mark the entire range so that the remaining nonresident pages are locked when they are populated by a page fault.

If flags is 0, mlock2() behaves exactly the same as mlock().

munlock() unlocks pages in the address range starting at addr and continuing for len bytes. After this call, all pages that contain a part of the specified memory range can be moved to external swap space again by the kernel.

mlockall() and munlockall()

mlockall() locks all pages mapped into the address space of the calling process. This includes the pages of the code, data, and stack segment, as well as shared libraries, user space kernel data, shared memory, and memory-mapped files. All mapped pages are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked.

The flags argument is constructed as the bitwise OR of one or more of the following constants:

MCL_CURRENT
Lock all pages which are currently mapped into the address space of the process.

MCL_FUTURE
Lock all pages which will become mapped into the address space of the process in the future. These could be, for instance, new pages required by a growing heap and stack as well as new memory-mapped files or shared memory regions.

MCL_ONFAULT (since Linux 4.4)
Used together with MCL_CURRENT, MCL_FUTURE, or both. Mark all current (with MCL_CURRENT) or future (with MCL_FUTURE) mappings to lock pages when they are faulted in. When used with MCL_CURRENT, all present pages are locked, but mlockall() will not fault in non-present pages. When used with MCL_FUTURE, all future mappings will be marked to lock pages when they are faulted in, but they will not be populated by the lock when the mapping is created. MCL_ONFAULT must be used with either MCL_CURRENT or MCL_FUTURE or both.

If MCL_FUTURE has been specified, then a later system call (e.g., mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number of locked bytes to exceed the permitted maximum (see below). In the same circumstances, stack growth may likewise fail: the kernel will deny stack expansion and deliver a SIGSEGV signal to the process.

munlockall() unlocks all pages mapped into the address space of the calling process.

RETURN VALUE

On success, these system calls return 0. On error, -1 is returned, errno is set to indicate the error, and no changes are made to any locks in the address space of the process.

ERRORS

EAGAIN
(mlock(), mlock2(), and munlock()) Some or all of the specified address range could not be locked.

EINVAL
(mlock(), mlock2(), and munlock()) The result of the addition addr+len was less than addr (e.g., the addition may have resulted in an overflow).

EINVAL
(mlock2()) Unknown flags were specified.

EINVAL
(mlockall()) Unknown flags were specified or MCL_ONFAULT was specified without either MCL_FUTURE or MCL_CURRENT.

EINVAL
(Not on Linux) addr was not a multiple of the page size.

ENOMEM
(mlock(), mlock2(), and munlock()) Some of the specified address range does not correspond to mapped pages in the address space of the process.

ENOMEM
(mlock(), mlock2(), and munlock()) Locking or unlocking a region would result in the total number of mappings with distinct attributes (e.g., locked versus unlocked) exceeding the allowed maximum. (For example, unlocking a range in the middle of a currently locked mapping would result in three mappings: two locked mappings at each end and an unlocked mapping in the middle.)

ENOMEM
(Linux 2.6.9 and later) the caller had a nonzero RLIMIT_MEMLOCK soft resource limit, but tried to lock more memory than the limit permitted. This limit is not enforced if the process is privileged (CAP_IPC_LOCK).

ENOMEM
(Linux 2.4 and earlier) the calling process tried to lock more than half of RAM.

EPERM
The caller is not privileged, but needs privilege (CAP_IPC_LOCK) to perform the requested operation.

EPERM
(munlockall()) (Linux 2.6.8 and earlier) The caller was not privileged (CAP_IPC_LOCK).

VERSIONS

Linux

Under Linux, mlock(), mlock2(), and munlock() automatically round addr down to the nearest page boundary. However, the POSIX.1 specification of mlock() and munlock() allows an implementation to require that addr is page aligned, so portable applications should ensure this.

The VmLck field of the Linux-specific /proc/pid/status file shows how many kilobytes of memory the process with ID PID has locked using mlock(), mlock2(), mlockall(), and mmap(2) MAP_LOCKED.

STANDARDS

mlock()
munlock()
mlockall()
munlockall()
POSIX.1-2008.

mlock2()
Linux.

On POSIX systems on which mlock() and munlock() are available, _POSIX_MEMLOCK_RANGE is defined in <unistd.h> and the number of bytes in a page can be determined from the constant PAGESIZE (if defined) in <limits.h> or by calling sysconf(_SC_PAGESIZE).

On POSIX systems on which mlockall() and munlockall() are available, _POSIX_MEMLOCK is defined in <unistd.h> to a value greater than 0. (See also sysconf(3).)

HISTORY

mlock()
munlock()
mlockall()
munlockall()
POSIX.1-2001, POSIX.1-2008, SVr4.

mlock2()
Linux 4.4, glibc 2.27.

NOTES

Memory locking has two main applications: real-time algorithms and high-security data processing. Real-time applications require deterministic timing, and, like scheduling, paging is one major cause of unexpected program execution delays. Real-time applications will usually also switch to a real-time scheduler with sched_setscheduler(2). Cryptographic security software often handles critical bytes like passwords or secret keys as data structures. As a result of paging, these secrets could be transferred onto a persistent swap store medium, where they might be accessible to the enemy long after the security software has erased the secrets in RAM and terminated. (But be aware that the suspend mode on laptops and some desktop computers will save a copy of the system’s RAM to disk, regardless of memory locks.)

Real-time processes that are using mlockall() to prevent delays on page faults should reserve enough locked stack pages before entering the time-critical section, so that no page fault can be caused by function calls. This can be achieved by calling a function that allocates a sufficiently large automatic variable (an array) and writes to the memory occupied by this array in order to touch these stack pages. This way, enough pages will be mapped for the stack and can be locked into RAM. The dummy writes ensure that not even copy-on-write page faults can occur in the critical section.

Memory locks are not inherited by a child created via fork(2) and are automatically removed (unlocked) during an execve(2) or when the process terminates. The mlockall() MCL_FUTURE and MCL_FUTURE | MCL_ONFAULT settings are not inherited by a child created via fork(2) and are cleared during an execve(2).

Note that fork(2) will prepare the address space for a copy-on-write operation. The consequence is that any write access that follows will cause a page fault that in turn may cause high latencies for a real-time process. Therefore, it is crucial not to invoke fork(2) after an mlockall() or mlock() operation—not even from a thread which runs at a low priority within a process which also has a thread running at elevated priority.

The memory lock on an address range is automatically removed if the address range is unmapped via munmap(2).

Memory locks do not stack, that is, pages which have been locked several times by calls to mlock(), mlock2(), or mlockall() will be unlocked by a single call to munlock() for the corresponding range or by munlockall(). Pages which are mapped to several locations or by several processes stay locked into RAM as long as they are locked at least at one location or by at least one process.

If a call to mlockall() which uses the MCL_FUTURE flag is followed by another call that does not specify this flag, the changes made by the MCL_FUTURE call will be lost.

The mlock2() MLOCK_ONFAULT flag and the mlockall() MCL_ONFAULT flag allow efficient memory locking for applications that deal with large mappings where only a (small) portion of pages in the mapping are touched. In such cases, locking all of the pages in a mapping would incur a significant penalty for memory locking.

Limits and permissions

In Linux 2.6.8 and earlier, a process must be privileged (CAP_IPC_LOCK) in order to lock memory and the RLIMIT_MEMLOCK soft resource limit defines a limit on how much memory the process may lock.

Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process can lock and the RLIMIT_MEMLOCK soft resource limit instead defines a limit on how much memory an unprivileged process may lock.

BUGS

In Linux 4.8 and earlier, a bug in the kernel’s accounting of locked memory for unprivileged processes (i.e., without CAP_IPC_LOCK) meant that if the region specified by addr and len overlapped an existing lock, then the already locked bytes in the overlapping region were counted twice when checking against the limit. Such double accounting could incorrectly calculate a “total locked memory” value for the process that exceeded the RLIMIT_MEMLOCK limit, with the result that mlock() and mlock2() would fail on requests that should have succeeded. This bug was fixed in Linux 4.9.

In Linux 2.4 series of kernels up to and including Linux 2.4.17, a bug caused the mlockall() MCL_FUTURE flag to be inherited across a fork(2). This was rectified in Linux 2.4.18.

Since Linux 2.6.9, if a privileged process calls mlockall(MCL_FUTURE) and later drops privileges (loses the CAP_IPC_LOCK capability by, for example, setting its effective UID to a nonzero value), then subsequent memory allocations (e.g., mmap(2), brk(2)) will fail if the RLIMIT_MEMLOCK resource limit is encountered.

9 - Linux cli command tuxcall

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command tuxcall and provides detailed information about the command tuxcall, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the tuxcall.

NAME 🖥️ tuxcall 🖥️

unimplemented system calls

SYNOPSIS

Unimplemented system calls.

DESCRIPTION

These system calls are not implemented in the Linux kernel.

RETURN VALUE

These system calls always return -1 and set errno to ENOSYS.

NOTES

Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.

Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.

Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.

10 - Linux cli command sysinfo

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command sysinfo and provides detailed information about the command sysinfo, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the sysinfo.

NAME 🖥️ sysinfo 🖥️

return system information

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <sys/sysinfo.h>
int sysinfo(struct sysinfo *info);

DESCRIPTION

sysinfo() returns certain statistics on memory and swap usage, as well as the load average.

Until Linux 2.3.16, sysinfo() returned information in the following structure:

struct sysinfo {
    long uptime;             /* Seconds since boot */
    unsigned long loads[3];  /* 1, 5, and 15 minute load averages */
    unsigned long totalram;  /* Total usable main memory size */
    unsigned long freeram;   /* Available memory size */
    unsigned long sharedram; /* Amount of shared memory */
    unsigned long bufferram; /* Memory used by buffers */
    unsigned long totalswap; /* Total swap space size */
    unsigned long freeswap;  /* Swap space still available */
    unsigned short procs;    /* Number of current processes */
    char _f[22];             /* Pads structure to 64 bytes */
};

In the above structure, the sizes of the memory and swap fields are given in bytes.

Since Linux 2.3.23 (i386) and Linux 2.3.48 (all architectures) the structure is:

struct sysinfo {
    long uptime;             /* Seconds since boot */
    unsigned long loads[3];  /* 1, 5, and 15 minute load averages */
    unsigned long totalram;  /* Total usable main memory size */
    unsigned long freeram;   /* Available memory size */
    unsigned long sharedram; /* Amount of shared memory */
    unsigned long bufferram; /* Memory used by buffers */
    unsigned long totalswap; /* Total swap space size */
    unsigned long freeswap;  /* Swap space still available */
    unsigned short procs;    /* Number of current processes */
    unsigned long totalhigh; /* Total high memory size */
    unsigned long freehigh;  /* Available high memory size */
    unsigned int mem_unit;   /* Memory unit size in bytes */
    char _f[20-2*sizeof(long)-sizeof(int)];
                             /* Padding to 64 bytes */
};

In the above structure, sizes of the memory and swap fields are given as multiples of mem_unit bytes.

RETURN VALUE

On success, sysinfo() returns zero. On error, -1 is returned, and errno is set to indicate the error.

ERRORS

EFAULT
info is not a valid address.

STANDARDS

Linux.

HISTORY

Linux 0.98.pl6.

NOTES

All of the information provided by this system call is also available via /proc/meminfo and /proc/loadavg.

11 - Linux cli command ioprio_set

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command ioprio_set and provides detailed information about the command ioprio_set, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the ioprio_set.

NAME 🖥️ ioprio_set 🖥️

get/set I/O scheduling class and priority

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <linux/ioprio.h> /* Definition of IOPRIO_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_ioprio_get, int which, int who);
int syscall(SYS_ioprio_set, int which, int who, int ioprio);

Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).

DESCRIPTION

The ioprio_get() and ioprio_set() system calls get and set the I/O scheduling class and priority of one or more threads.

The which and who arguments identify the thread(s) on which the system calls operate. The which argument determines how who is interpreted, and has one of the following values:

IOPRIO_WHO_PROCESS
who is a process ID or thread ID identifying a single process or thread. If who is 0, then operate on the calling thread.

IOPRIO_WHO_PGRP
who is a process group ID identifying all the members of a process group. If who is 0, then operate on the process group of which the caller is a member.

IOPRIO_WHO_USER
who is a user ID identifying all of the processes that have a matching real UID.

If which is specified as IOPRIO_WHO_PGRP or IOPRIO_WHO_USER when calling ioprio_get(), and more than one process matches who, then the returned priority will be the highest one found among all of the matching processes. One priority is said to be higher than another one if it belongs to a higher priority class (IOPRIO_CLASS_RT is the highest priority class; IOPRIO_CLASS_IDLE is the lowest) or if it belongs to the same priority class as the other process but has a higher priority level (a lower priority number means a higher priority level).

The ioprio argument given to ioprio_set() is a bit mask that specifies both the scheduling class and the priority to be assigned to the target process(es). The following macros are used for assembling and dissecting ioprio values:

IOPRIO_PRIO_VALUE(class, data)
Given a scheduling class and priority (data), this macro combines the two values to produce an ioprio value, which is returned as the result of the macro.

IOPRIO_PRIO_CLASS(mask)
Given mask (an ioprio value), this macro returns its I/O class component, that is, one of the values IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, or IOPRIO_CLASS_IDLE.

IOPRIO_PRIO_DATA(mask)
Given mask (an ioprio value), this macro returns its priority (data) component.

See the NOTES section for more information on scheduling classes and priorities, as well as the meaning of specifying ioprio as 0.

I/O priorities are supported for reads and for synchronous (O_DIRECT, O_SYNC) writes. I/O priorities are not supported for asynchronous writes because they are issued outside the context of the program dirtying the memory, and thus program-specific priorities do not apply.

RETURN VALUE

On success, ioprio_get() returns the ioprio value of the process with highest I/O priority of any of the processes that match the criteria specified in which and who. On error, -1 is returned, and errno is set to indicate the error.

On success, ioprio_set() returns 0. On error, -1 is returned, and errno is set to indicate the error.

ERRORS

EINVAL
Invalid value for which or ioprio. Refer to the NOTES section for available scheduler classes and priority levels for ioprio.

EPERM
The calling process does not have the privilege needed to assign this ioprio to the specified process(es). See the NOTES section for more information on required privileges for ioprio_set().

ESRCH
No process(es) could be found that matched the specification in which and who.

STANDARDS

Linux.

HISTORY

Linux 2.6.13.

NOTES

Two or more processes or threads can share an I/O context. This will be the case when clone(2) was called with the CLONE_IO flag. However, by default, the distinct threads of a process will not share the same I/O context. This means that if you want to change the I/O priority of all threads in a process, you may need to call ioprio_set() on each of the threads. The thread ID that you would need for this operation is the one that is returned by gettid(2) or clone(2).

These system calls have an effect only when used in conjunction with an I/O scheduler that supports I/O priorities. As at kernel 2.6.17 the only such scheduler is the Completely Fair Queuing (CFQ) I/O scheduler.

If no I/O scheduler has been set for a thread, then by default the I/O priority will follow the CPU nice value (setpriority(2)). Before Linux 2.6.24, once an I/O priority had been set using ioprio_set(), there was no way to reset the I/O scheduling behavior to the default. Since Linux 2.6.24, specifying ioprio as 0 can be used to reset to the default I/O scheduling behavior.

Selecting an I/O scheduler

I/O schedulers are selected on a per-device basis via the special file /sys/block/device/queue/scheduler.

One can view the current I/O scheduler via the /sys filesystem. For example, the following command displays a list of all schedulers currently loaded in the kernel:

$ cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]

The scheduler surrounded by brackets is the one actually in use for the device (sda in the example). Setting another scheduler is done by writing the name of the new scheduler to this file. For example, the following command will set the scheduler for the sda device to cfq:

$ su
Password:
# echo cfq > /sys/block/sda/queue/scheduler

The Completely Fair Queuing (CFQ) I/O scheduler

Since version 3 (also known as CFQ Time Sliced), CFQ implements I/O nice levels similar to those of CPU scheduling. These nice levels are grouped into three scheduling classes, each one containing one or more priority levels:

IOPRIO_CLASS_RT (1)
This is the real-time I/O class. This scheduling class is given higher priority than any other class: processes from this class are given first access to the disk every time. Thus, this I/O class needs to be used with some care: one I/O real-time process can starve the entire system. Within the real-time class, there are 8 levels of class data (priority) that determine exactly how much time this process needs the disk for on each service. The highest real-time priority level is 0; the lowest is 7. In the future, this might change to be more directly mappable to performance, by passing in a desired data rate instead.

IOPRIO_CLASS_BE (2)
This is the best-effort scheduling class, which is the default for any process that hasn’t set a specific I/O priority. The class data (priority) determines how much I/O bandwidth the process will get. Best-effort priority levels are analogous to CPU nice values (see getpriority(2)). The priority level determines a priority relative to other processes in the best-effort scheduling class. Priority levels range from 0 (highest) to 7 (lowest).

IOPRIO_CLASS_IDLE (3)
This is the idle scheduling class. Processes running at this level get I/O time only when no one else needs the disk. The idle class has no class data. Attention is required when assigning this priority class to a process, since it may become starved if higher priority processes are constantly accessing the disk.

Refer to the kernel source file Documentation/block/ioprio.txt for more information on the CFQ I/O Scheduler and an example program.

Required permissions to set I/O priorities

Permission to change a process’s priority is granted or denied based on two criteria:

Process ownership
An unprivileged process may set the I/O priority only for a process whose real UID matches the real or effective UID of the calling process. A process which has the CAP_SYS_NICE capability can change the priority of any process.

What is the desired priority
Attempts to set very high priorities (IOPRIO_CLASS_RT) require the CAP_SYS_ADMIN capability. Up to Linux 2.6.24 also required CAP_SYS_ADMIN to set a very low priority (IOPRIO_CLASS_IDLE), but since Linux 2.6.25, this is no longer required.

A call to ioprio_set() must follow both rules, or the call will fail with the error EPERM.

BUGS

glibc does not yet provide a suitable header file defining the function prototypes and macros described on this page. Suitable definitions can be found in linux/ioprio.h.

12 - Linux cli command setfsuid32

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command setfsuid32 and provides detailed information about the command setfsuid32, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the setfsuid32.

NAME 🖥️ setfsuid32 🖥️

set user identity used for filesystem checks

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <sys/fsuid.h>
[[deprecated]] int setfsuid(uid_t fsuid);

DESCRIPTION

On Linux, a process has both a filesystem user ID and an effective user ID. The (Linux-specific) filesystem user ID is used for permissions checking when accessing filesystem objects, while the effective user ID is used for various other kinds of permissions checks (see credentials(7)).

Normally, the value of the process’s filesystem user ID is the same as the value of its effective user ID. This is so, because whenever a process’s effective user ID is changed, the kernel also changes the filesystem user ID to be the same as the new value of the effective user ID. A process can cause the value of its filesystem user ID to diverge from its effective user ID by using setfsuid() to change its filesystem user ID to the value given in fsuid.

Explicit calls to setfsuid() and setfsgid(2) are (were) usually used only by programs such as the Linux NFS server that need to change what user and group ID is used for file access without a corresponding change in the real and effective user and group IDs. A change in the normal user IDs for a program such as the NFS server is (was) a security hole that can expose it to unwanted signals. (However, this issue is historical; see below.)

setfsuid() will succeed only if the caller is the superuser or if fsuid matches either the caller’s real user ID, effective user ID, saved set-user-ID, or current filesystem user ID.

RETURN VALUE

On both success and failure, this call returns the previous filesystem user ID of the caller.

STANDARDS

Linux.

HISTORY

Linux 1.2.

At the time when this system call was introduced, one process could send a signal to another process with the same effective user ID. This meant that if a privileged process changed its effective user ID for the purpose of file permission checking, then it could become vulnerable to receiving signals sent by another (unprivileged) process with the same user ID. The filesystem user ID attribute was thus added to allow a process to change its user ID for the purposes of file permission checking without at the same time becoming vulnerable to receiving unwanted signals. Since Linux 2.0, signal permission handling is different (see kill(2)), with the result that a process can change its effective user ID without being vulnerable to receiving signals from unwanted processes. Thus, setfsuid() is nowadays unneeded and should be avoided in new applications (likewise for setfsgid(2)).

The original Linux setfsuid() system call supported only 16-bit user IDs. Subsequently, Linux 2.4 added setfsuid32() supporting 32-bit IDs. The glibc setfsuid() wrapper function transparently deals with the variation across kernel versions.

C library/kernel differences

In glibc 2.15 and earlier, when the wrapper for this system call determines that the argument can’t be passed to the kernel without integer truncation (because the kernel is old and does not support 32-bit user IDs), it will return -1 and set errno to EINVAL without attempting the system call.

BUGS

No error indications of any kind are returned to the caller, and the fact that both successful and unsuccessful calls return the same value makes it impossible to directly determine whether the call succeeded or failed. Instead, the caller must resort to looking at the return value from a further call such as setfsuid(-1) (which will always fail), in order to determine if a preceding call to setfsuid() changed the filesystem user ID. At the very least, EPERM should be returned when the call fails (because the caller lacks the CAP_SETUID capability).

13 - Linux cli command fchown

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command fchown and provides detailed information about the command fchown, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the fchown.

NAME 🖥️ fchown 🖥️

change ownership of a file

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <unistd.h>
int chown(const char *pathname, uid_t owner, gid_t group);
int fchown(int fd, uid_t owner, gid_t group);
int lchown(const char *pathname, uid_t owner, gid_t group);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int fchownat(int dirfd, const char *pathname,
 uid_t owner, gid_t group, int flags);

Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

fchown(), lchown():

    /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
        || _XOPEN_SOURCE >= 500
        || /* glibc <= 2.19: */ _BSD_SOURCE

fchownat():

    Since glibc 2.10:
        _POSIX_C_SOURCE >= 200809L
    Before glibc 2.10:
        _ATFILE_SOURCE

DESCRIPTION

These system calls change the owner and group of a file. The chown(), fchown(), and lchown() system calls differ only in how the file is specified:

chown() changes the ownership of the file specified by pathname, which is dereferenced if it is a symbolic link.
fchown() changes the ownership of the file referred to by the open file descriptor fd.
lchown() is like chown(), but does not dereference symbolic links.

Only a privileged process (Linux: one with the CAP_CHOWN capability) may change the owner of a file. The owner of a file may change the group of the file to any group of which that owner is a member. A privileged process (Linux: with CAP_CHOWN) may change the group arbitrarily.

If the owner or group is specified as -1, then that ID is not changed.

When the owner or group of an executable file is changed by an unprivileged user, the S_ISUID and S_ISGID mode bits are cleared. POSIX does not specify whether this also should happen when root does the chown(); the Linux behavior depends on the kernel version, and since Linux 2.2.13, root is treated like other users. In case of a non-group-executable file (i.e., one for which the S_IXGRP bit is not set) the S_ISGID bit indicates mandatory locking, and is not cleared by a chown().

When the owner or group of an executable file is changed (by any user), all capability sets for the file are cleared.

fchownat()

The fchownat() system call operates in exactly the same way as chown(), except for the differences described here.

If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chown() for a relative pathname).

If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chown()).

If pathname is absolute, then dirfd is ignored.

The flags argument is a bit mask created by ORing together 0 or more of the following values;

AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory. If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.

AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead operate on the link itself, like lchown(). (By default, fchownat() dereferences symbolic links, like chown().)

See openat(2) for an explanation of the need for fchownat().

RETURN VALUE

On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.

ERRORS

Depending on the filesystem, errors other than those listed below can be returned.

The more general errors for chown() are listed below.

EACCES
Search permission is denied on a component of the path prefix. (See also path_resolution(7).)

EBADF
(fchown()) fd is not a valid open file descriptor.

EBADF
(fchownat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.

EFAULT
pathname points outside your accessible address space.

EINVAL
(fchownat()) Invalid flag specified in flags.

EIO
(fchown()) A low-level I/O error occurred while modifying the inode.

ELOOP
Too many symbolic links were encountered in resolving pathname.

ENAMETOOLONG
pathname is too long.

ENOENT
The file does not exist.

ENOMEM
Insufficient kernel memory was available.

ENOTDIR
A component of the path prefix is not a directory.

ENOTDIR
(fchownat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.

EPERM
The calling process did not have the required permissions (see above) to change owner and/or group.

EPERM
The file is marked immutable or append-only. (See ioctl_iflags(2).)

EROFS
The named file resides on a read-only filesystem.

VERSIONS

The 4.4BSD version can be used only by the superuser (that is, ordinary users cannot give away files).

STANDARDS

POSIX.1-2008.

HISTORY

chown()
fchown()
lchown()
4.4BSD, SVr4, POSIX.1-2001.

fchownat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.

NOTES

Ownership of new files

When a new file is created (by, for example, open(2) or mkdir(2)), its owner is made the same as the filesystem user ID of the creating process. The group of the file depends on a range of factors, including the type of filesystem, the options used to mount the filesystem, and whether or not the set-group-ID mode bit is enabled on the parent directory. If the filesystem supports the -o grpid (or, synonymously -o bsdgroups) and -o nogrpid (or, synonymously -o sysvgroups) mount(8) options, then the rules are as follows:

If the filesystem is mounted with -o grpid, then the group of a new file is made the same as that of the parent directory.
If the filesystem is mounted with -o nogrpid and the set-group-ID bit is disabled on the parent directory, then the group of a new file is made the same as the process’s filesystem GID.
If the filesystem is mounted with -o nogrpid and the set-group-ID bit is enabled on the parent directory, then the group of a new file is made the same as that of the parent directory.

As at Linux 4.12, the -o grpid and -o nogrpid mount options are supported by ext2, ext3, ext4, and XFS. Filesystems that don’t support these mount options follow the -o nogrpid rules.

glibc notes

On older kernels where fchownat() is unavailable, the glibc wrapper function falls back to the use of chown() and lchown(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.

NFS

The chown() semantics are deliberately violated on NFS filesystems which have UID mapping enabled. Additionally, the semantics of all system calls which access the file contents are violated, because chown() may cause immediate access revocation on already open files. Client side caching may lead to a delay between the time where ownership have been changed to allow access for a user and the time where the file can actually be accessed by the user on other clients.

Historical details

The original Linux chown(), fchown(), and lchown() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added chown32(), fchown32(), and lchown32(), supporting 32-bit IDs. The glibc chown(), fchown(), and lchown() wrapper functions transparently deal with the variations across kernel versions.

Before Linux 2.1.81 (except 2.1.46), chown() did not follow symbolic links. Since Linux 2.1.81, chown() does follow symbolic links, and there is a new system call lchown() that does not follow symbolic links. Since Linux 2.1.86, this new call (that has the same semantics as the old chown()) has got the same syscall number, and chown() got the newly introduced number.

EXAMPLES

The following program changes the ownership of the file named in its second command-line argument to the value specified in its first command-line argument. The new owner can be specified either as a numeric user ID, or as a username (which is converted to a user ID by using getpwnam(3) to perform a lookup in the system password file).

Program source

#include <pwd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
    char           *endptr;
    uid_t          uid;
    struct passwd  *pwd;
    if (argc != 3 || argv[1][0] == '') {
        fprintf(stderr, "%s <owner> <file>

“, argv[0]); exit(EXIT_FAILURE); } uid = strtol(argv[1], &endptr, 10); /* Allow a numeric string */ if (endptr != ‘�’) { / Was not pure numeric string / pwd = getpwnam(argv[1]); / Try getting UID for username */ if (pwd == NULL) { perror(“getpwnam”); exit(EXIT_FAILURE); } uid = pwd->pw_uid; } if (chown(argv[2], uid, -1) == -1) { perror(“chown”); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); }

14 - Linux cli command break

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command break and provides detailed information about the command break, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the break.

NAME 🖥️ break 🖥️

unimplemented system calls

SYNOPSIS

Unimplemented system calls.

DESCRIPTION

These system calls are not implemented in the Linux kernel.

RETURN VALUE

These system calls always return -1 and set errno to ENOSYS.

NOTES

Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.

Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.

Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.

15 - Linux cli command bdflush

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command bdflush and provides detailed information about the command bdflush, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the bdflush.

NAME 🖥️ bdflush 🖥️

start, flush, or tune buffer-dirty-flush daemon

SYNOPSIS

#include <sys/kdaemon.h>
[[deprecated]] int bdflush(int func, long *address);
[[deprecated]] int bdflush(int func, long data);

DESCRIPTION

Note: Since Linux 2.6, this system call is deprecated and does nothing. It is likely to disappear altogether in a future kernel release. Nowadays, the task performed by bdflush() is handled by the kernel pdflush thread.

bdflush() starts, flushes, or tunes the buffer-dirty-flush daemon. Only a privileged process (one with the CAP_SYS_ADMIN capability) may call bdflush().

If func is negative or 0, and no daemon has been started, then bdflush() enters the daemon code and never returns.

If func is 1, some dirty buffers are written to disk.

If func is 2 or more and is even (low bit is 0), then address is the address of a long word, and the tuning parameter numbered (func-2)/2 is returned to the caller in that address.

If func is 3 or more and is odd (low bit is 1), then data is a long word, and the kernel sets tuning parameter numbered (func-3)/2 to that value.

The set of parameters, their values, and their valid ranges are defined in the Linux kernel source file fs/buffer.c.

RETURN VALUE

If func is negative or 0 and the daemon successfully starts, bdflush() never returns. Otherwise, the return value is 0 on success and -1 on failure, with errno set to indicate the error.

ERRORS

EBUSY
An attempt was made to enter the daemon code after another process has already entered.

EFAULT
address points outside your accessible address space.

EINVAL
An attempt was made to read or write an invalid parameter number, or to write an invalid value to a parameter.

EPERM
Caller does not have the CAP_SYS_ADMIN capability.

STANDARDS

Linux.

HISTORY

Since glibc 2.23, glibc no longer supports this obsolete system call.

16 - Linux cli command sysfs

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command sysfs and provides detailed information about the command sysfs, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the sysfs.

NAME 🖥️ sysfs 🖥️

get filesystem type information

SYNOPSIS

[[deprecated]] int sysfs(int option, const char *fsname);
[[deprecated]] int sysfs(int option, unsigned int fs_index, char *buf);
[[deprecated]] int sysfs(int option);

DESCRIPTION

Note: if you are looking for information about the sysfs filesystem that is normally mounted at /sys, see sysfs(5).

The (obsolete) sysfs() system call returns information about the filesystem types currently present in the kernel. The specific form of the sysfs() call and the information returned depends on the option in effect:

1
Translate the filesystem identifier string fsname into a filesystem type index.

2
Translate the filesystem type index fs_index into a null-terminated filesystem identifier string. This string will be written to the buffer pointed to by buf. Make sure that buf has enough space to accept the string.

3
Return the total number of filesystem types currently present in the kernel.

The numbering of the filesystem type indexes begins with zero.

RETURN VALUE

On success, sysfs() returns the filesystem index for option 1, zero for option 2, and the number of currently configured filesystems for option 3. On error, -1 is returned, and errno is set to indicate the error.

ERRORS

EFAULT
Either fsname or buf is outside your accessible address space.

EINVAL
fsname is not a valid filesystem type identifier; fs_index is out-of-bounds; option is invalid.

STANDARDS

None.

HISTORY

SVr4.

This System-V derived system call is obsolete; don’t use it. On systems with /proc, the same information can be obtained via /proc; use that interface instead.

BUGS

There is no libc or glibc support. There is no way to guess how large buf should be.

17 - Linux cli command getgid

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command getgid and provides detailed information about the command getgid, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the getgid.

NAME 🖥️ getgid 🖥️

get group identity

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <unistd.h>
gid_t getgid(void);
gid_t getegid(void);

DESCRIPTION

getgid() returns the real group ID of the calling process.

getegid() returns the effective group ID of the calling process.

ERRORS

These functions are always successful and never modify errno.

VERSIONS

On Alpha, instead of a pair of getgid() and getegid() system calls, a single getxgid() system call is provided, which returns a pair of real and effective GIDs. The glibc getgid() and getegid() wrapper functions transparently deal with this. See syscall(2) for details regarding register mapping.

STANDARDS

POSIX.1-2008.

HISTORY

POSIX.1-2001, 4.3BSD.

The original Linux getgid() and getegid() system calls supported only 16-bit group IDs. Subsequently, Linux 2.4 added getgid32() and getegid32(), supporting 32-bit IDs. The glibc getgid() and getegid() wrapper functions transparently deal with the variations across kernel versions.

18 - Linux cli command shutdown

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command shutdown and provides detailed information about the command shutdown, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the shutdown.

NAME 🖥️ shutdown 🖥️

shut down part of a full-duplex connection

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <sys/socket.h>
int shutdown(int sockfd, int how);

DESCRIPTION

The shutdown() call causes all or part of a full-duplex connection on the socket associated with sockfd to be shut down. If how is SHUT_RD, further receptions will be disallowed. If how is SHUT_WR, further transmissions will be disallowed. If how is SHUT_RDWR, further receptions and transmissions will be disallowed.

RETURN VALUE

On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.

ERRORS

EBADF
sockfd is not a valid file descriptor.

EINVAL
An invalid value was specified in how (but see BUGS).

ENOTCONN
The specified socket is not connected.

ENOTSOCK
The file descriptor sockfd does not refer to a socket.

STANDARDS

POSIX.1-2008.

HISTORY

POSIX.1-2001, 4.4BSD (first appeared in 4.2BSD).

NOTES

The constants SHUT_RD, SHUT_WR, SHUT_RDWR have the value 0, 1, 2, respectively, and are defined in <sys/socket.h> since glibc-2.1.91.

BUGS

Checks for the validity of how are done in domain-specific code, and before Linux 3.7 not all domains performed these checks. Most notably, UNIX domain sockets simply ignored invalid values. This problem was fixed for UNIX domain sockets in Linux 3.7.

19 - Linux cli command vm86old

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command vm86old and provides detailed information about the command vm86old, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the vm86old.

NAME 🖥️ vm86old 🖥️

enter virtual 8086 mode

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <sys/vm86.h>
int vm86old(struct vm86_struct *info);
int vm86(unsigned long fn, struct vm86plus_struct *v86);

DESCRIPTION

The system call vm86() was introduced in Linux 0.97p2. In Linux 2.1.15 and 2.0.28, it was renamed to vm86old(), and a new vm86() was introduced. The definition of struct vm86_struct was changed in 1.1.8 and 1.1.9.

These calls cause the process to enter VM86 mode (virtual-8086 in Intel literature), and are used by dosemu.

VM86 mode is an emulation of real mode within a protected mode task.

RETURN VALUE

On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.

ERRORS

EFAULT
This return value is specific to i386 and indicates a problem with getting user-space data.

ENOSYS
This return value indicates the call is not implemented on the present architecture.

EPERM
Saved kernel stack exists. (This is a kernel sanity check; the saved stack should exist only within vm86 mode itself.)

STANDARDS

Linux on 32-bit Intel processors.

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

  █║▌│║█║▌★ KALI ★ PARROT ★ DEBIAN 🔴 PENTESTING ★ HACKING ★ █║▌│║█║▌

              ██╗ ██╗ ██████╗  ██████╗ ██╗  ██╗███████╗██████╗
             ████████╗██╔══██╗██╔═══██╗╚██╗██╔╝██╔════╝██╔══██╗
             ╚██╔═██╔╝██║  ██║██║   ██║ ╚███╔╝ █████╗  ██║  ██║
             ████████╗██║  ██║██║   ██║ ██╔██╗ ██╔══╝  ██║  ██║
             ╚██╔═██╔╝██████╔╝╚██████╔╝██╔╝ ██╗███████╗██████╔╝
              ╚═╝ ╚═╝ ╚═════╝  ╚═════╝ ╚═╝  ╚═╝╚══════╝╚═════╝

               █║▌│║█║▌ WITH COMMANDLINE-KUNGFU POWER █║▌│║█║▌

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

20 - Linux cli command preadv2

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command preadv2 and provides detailed information about the command preadv2, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the preadv2.

NAME 🖥️ preadv2 🖥️

read or write data into multiple buffers

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <sys/uio.h>
ssize_t readv(int fd, const struct iovec *iov, int iovcnt);
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);
ssize_t preadv(int fd, const struct iovec *iov, int iovcnt,
 off_t offset);
ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt,
 off_t offset);
ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt,
 off_t offset, int flags);
ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt,
 off_t offset, int flags);

Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

preadv(), pwritev():

    Since glibc 2.19:
        _DEFAULT_SOURCE
    glibc 2.19 and earlier:
        _BSD_SOURCE

DESCRIPTION

The readv() system call reads iovcnt buffers from the file associated with the file descriptor fd into the buffers described by iov (“scatter input”).

The writev() system call writes iovcnt buffers of data described by iov to the file associated with the file descriptor fd (“gather output”).

The pointer iov points to an array of iovec structures, described in iovec(3type).

The readv() system call works just like read(2) except that multiple buffers are filled.

The writev() system call works just like write(2) except that multiple buffers are written out.

Buffers are processed in array order. This means that readv() completely fills iov[0] before proceeding to iov[1], and so on. (If there is insufficient data, then not all buffers pointed to by iov may be filled.) Similarly, writev() writes out the entire contents of iov[0] before proceeding to iov[1], and so on.

The data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not intermingled with output from writes in other processes; analogously, readv() is guaranteed to read a contiguous block of data from the file, regardless of read operations performed in other threads or processes that have file descriptors referring to the same open file description (see open(2)).

preadv() and pwritev()

The preadv() system call combines the functionality of readv() and pread(2). It performs the same task as readv(), but adds a fourth argument, offset, which specifies the file offset at which the input operation is to be performed.

The pwritev() system call combines the functionality of writev() and pwrite(2). It performs the same task as writev(), but adds a fourth argument, offset, which specifies the file offset at which the output operation is to be performed.

The file offset is not changed by these system calls. The file referred to by fd must be capable of seeking.

preadv2() and pwritev2()

These system calls are similar to preadv() and pwritev() calls, but add a fifth argument, flags, which modifies the behavior on a per-call basis.

Unlike preadv() and pwritev(), if the offset argument is -1, then the current file offset is used and updated.

The flags argument contains a bitwise OR of zero or more of the following flags:

RWF_DSYNC (since Linux 4.7)
Provide a per-write equivalent of the O_DSYNC open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call.

RWF_HIPRI (since Linux 4.6)
High priority read/write. Allows block-based filesystems to use polling of the device, which provides lower latency, but may use additional resources. (Currently, this feature is usable only on a file descriptor opened using the O_DIRECT flag.)

RWF_SYNC (since Linux 4.7)
Provide a per-write equivalent of the O_SYNC open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call.

RWF_NOWAIT (since Linux 4.14)
Do not wait for data which is not immediately available. If this flag is specified, the preadv2() system call will return instantly if it would have to read data from the backing storage or wait for a lock. If some data was successfully read, it will return the number of bytes read. If no bytes were read, it will return -1 and set errno to EAGAIN (but see BUGS). Currently, this flag is meaningful only for preadv2().

RWF_APPEND (since Linux 4.16)
Provide a per-write equivalent of the O_APPEND open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call. The offset argument does not affect the write operation; the data is always appended to the end of the file. However, if the offset argument is -1, the current file offset is updated.

RETURN VALUE

On success, readv(), preadv(), and preadv2() return the number of bytes read; writev(), pwritev(), and pwritev2() return the number of bytes written.

Note that it is not an error for a successful call to transfer fewer bytes than requested (see read(2) and write(2)).

On error, -1 is returned, and errno is set to indicate the error.

ERRORS

The errors are as given for read(2) and write(2). Furthermore, preadv(), preadv2(), pwritev(), and pwritev2() can also fail for the same reasons as lseek(2). Additionally, the following errors are defined:

EINVAL
The sum of the iov_len values overflows an ssize_t value.

EINVAL
The vector count, iovcnt, is less than zero or greater than the permitted maximum.

EOPNOTSUPP
An unknown flag is specified in flags.

VERSIONS

C library/kernel differences

The raw preadv() and pwritev() system calls have call signatures that differ slightly from that of the corresponding GNU C library wrapper functions shown in the SYNOPSIS. The final argument, offset, is unpacked by the wrapper functions into two arguments in the system calls:

** unsigned long pos_l, unsigned long **pos

These arguments contain, respectively, the low order and high order 32 bits of offset.

STANDARDS

readv()
writev()
POSIX.1-2008.

preadv()
pwritev()
BSD.

preadv2()
pwritev2()
Linux.

HISTORY

readv()
writev()
POSIX.1-2001, 4.4BSD (first appeared in 4.2BSD).

preadv(), pwritev(): Linux 2.6.30, glibc 2.10.

preadv2(), pwritev2(): Linux 4.6, glibc 2.26.

Historical C library/kernel differences

To deal with the fact that IOV_MAX was so low on early versions of Linux, the glibc wrapper functions for readv() and writev() did some extra work if they detected that the underlying kernel system call failed because this limit was exceeded. In the case of readv(), the wrapper function allocated a temporary buffer large enough for all of the items specified by iov, passed that buffer in a call to read(2), copied data from the buffer to the locations specified by the iov_base fields of the elements of iov, and then freed the buffer. The wrapper function for writev() performed the analogous task using a temporary buffer and a call to write(2).

The need for this extra effort in the glibc wrapper functions went away with Linux 2.2 and later. However, glibc continued to provide this behavior until glibc 2.10. Starting with glibc 2.9, the wrapper functions provide this behavior only if the library detects that the system is running a Linux kernel older than Linux 2.6.18 (an arbitrarily selected kernel version). And since glibc 2.20 (which requires a minimum of Linux 2.6.32), the glibc wrapper functions always just directly invoke the system calls.

NOTES

POSIX.1 allows an implementation to place a limit on the number of items that can be passed in iov. An implementation can advertise its limit by defining IOV_MAX in <limits.h> or at run time via the return value from sysconf(_SC_IOV_MAX). On modern Linux systems, the limit is 1024. Back in Linux 2.0 days, this limit was 16.

BUGS

Linux 5.9 and Linux 5.10 have a bug where preadv2() with the RWF_NOWAIT flag may return 0 even when not at end of file.

EXAMPLES

The following code sample demonstrates the use of writev():

char          *str0 = "hello ";
char          *str1 = "world

“; ssize_t nwritten; struct iovec iov[2]; iov[0].iov_base = str0; iov[0].iov_len = strlen(str0); iov[1].iov_base = str1; iov[1].iov_len = strlen(str1); nwritten = writev(STDOUT_FILENO, iov, 2);

21 - Linux cli command setdomainname

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command setdomainname and provides detailed information about the command setdomainname, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the setdomainname.

NAME 🖥️ setdomainname 🖥️

get/set NIS domain name

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <unistd.h>
int getdomainname(char *name, size_t len);
int setdomainname(const char *name, size_t len);

Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

getdomainname(), setdomainname():

    Since glibc 2.21:
        _DEFAULT_SOURCE
    In glibc 2.19 and 2.20:
        _DEFAULT_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
    Up to and including glibc 2.19:
        _BSD_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)

DESCRIPTION

These functions are used to access or to change the NIS domain name of the host system. More precisely, they operate on the NIS domain name associated with the calling process’s UTS namespace.

setdomainname() sets the domain name to the value given in the character array name. The len argument specifies the number of bytes in name. (Thus, name does not require a terminating null byte.)

getdomainname() returns the null-terminated domain name in the character array name, which has a length of len bytes. If the null-terminated domain name requires more than len bytes, getdomainname() returns the first len bytes (glibc) or gives an error (libc).

RETURN VALUE

On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.

ERRORS

setdomainname() can fail with the following errors:

EFAULT
name pointed outside of user address space.

EINVAL
len was negative or too large.

EPERM
The caller did not have the CAP_SYS_ADMIN capability in the user namespace associated with its UTS namespace (see namespaces(7)).

getdomainname() can fail with the following errors:

EINVAL
For getdomainname() under libc: name is NULL or name is longer than len bytes.

VERSIONS

On most Linux architectures (including x86), there is no getdomainname() system call; instead, glibc implements getdomainname() as a library function that returns a copy of the domainname field returned from a call to uname(2).

STANDARDS

None.

HISTORY

Since Linux 1.0, the limit on the length of a domain name, including the terminating null byte, is 64 bytes. In older kernels, it was 8 bytes.

22 - Linux cli command inotify_init1

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command inotify_init1 and provides detailed information about the command inotify_init1, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the inotify_init1.

NAME 🖥️ inotify_init1 🖥️

initialize an inotify instance

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <sys/inotify.h>
int inotify_init(void);
int inotify_init1(int flags);

DESCRIPTION

For an overview of the inotify API, see inotify(7).

inotify_init() initializes a new inotify instance and returns a file descriptor associated with a new inotify event queue.

If flags is 0, then inotify_init1() is the same as inotify_init(). The following values can be bitwise ORed in flags to obtain different behavior:

IN_NONBLOCK
Set the O_NONBLOCK file status flag on the open file description (see open(2)) referred to by the new file descriptor. Using this flag saves extra calls to fcntl(2) to achieve the same result.

IN_CLOEXEC
Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.

RETURN VALUE

On success, these system calls return a new file descriptor. On error, -1 is returned, and errno is set to indicate the error.

ERRORS

EINVAL
(inotify_init1()) An invalid value was specified in flags.

EMFILE
The user limit on the total number of inotify instances has been reached.

EMFILE
The per-process limit on the number of open file descriptors has been reached.

ENFILE
The system-wide limit on the total number of open files has been reached.

ENOMEM
Insufficient kernel memory is available.

STANDARDS

Linux.

HISTORY

inotify_init()
Linux 2.6.13, glibc 2.4.

inotify_init1()
Linux 2.6.27, glibc 2.9.

23 - Linux cli command ioctl

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command ioctl and provides detailed information about the command ioctl, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the ioctl.

NAME 🖥️ ioctl 🖥️

control device

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <sys/ioctl.h>
int ioctl(int fd, unsigned long op, ...); /* glibc, BSD */
int ioctl(int fd, int op, ...); /* musl, other UNIX */

DESCRIPTION

The ioctl() system call manipulates the underlying device parameters of special files. In particular, many operating characteristics of character special files (e.g., terminals) may be controlled with ioctl() operations. The argument fd must be an open file descriptor.

The second argument is a device-dependent operation code. The third argument is an untyped pointer to memory. It’s traditionally **char ***argp (from the days before void * was valid C), and will be so named for this discussion.

An ioctl() op has encoded in it whether the argument is an in parameter or out parameter, and the size of the argument argp in bytes. Macros and defines used in specifying an ioctl() op are located in the file <sys/ioctl.h>. See NOTES.

RETURN VALUE

Usually, on success zero is returned. A few ioctl() operations use the return value as an output parameter and return a nonnegative value on success. On error, -1 is returned, and errno is set to indicate the error.

ERRORS

EBADF
fd is not a valid file descriptor.

EFAULT
argp references an inaccessible memory area.

EINVAL
op or argp is not valid.

ENOTTY
fd is not associated with a character special device.

ENOTTY
The specified operation does not apply to the kind of object that the file descriptor fd references.

VERSIONS

Arguments, returns, and semantics of ioctl() vary according to the device driver in question (the call is used as a catch-all for operations that don’t cleanly fit the UNIX stream I/O model).

STANDARDS

None.

HISTORY

Version 7 AT&T UNIX has

ioctl(int fildes, int op, struct sgttyb *argp);

(where struct sgttyb has historically been used by stty(2) and gtty(2), and is polymorphic by operation type (like a void * would be, if it had been available)).

SysIII documents arg without a type at all.

4.3BSD has

ioctl(int d, unsigned long op, char *argp);

(with char * similarly in for void *).

SysVr4 has

int ioctl(int fildes, int op, ... /* arg */);

NOTES

In order to use this call, one needs an open file descriptor. Often the open(2) call has unwanted side effects, that can be avoided under Linux by giving it the O_NONBLOCK flag.

ioctl structure

Ioctl op values are 32-bit constants. In principle these constants are completely arbitrary, but people have tried to build some structure into them.

The old Linux situation was that of mostly 16-bit constants, where the last byte is a serial number, and the preceding byte(s) give a type indicating the driver. Sometimes the major number was used: 0x03 for the HDIO_* ioctls, 0x06 for the LP* ioctls. And sometimes one or more ASCII letters were used. For example, TCGETS has value 0x00005401, with 0x54 = ‘T’ indicating the terminal driver, and CYGETTIMEOUT has value 0x00435906, with 0x43 0x59 = ‘C’ ‘Y’ indicating the cyclades driver.

Later (0.98p5) some more information was built into the number. One has 2 direction bits (00: none, 01: write, 10: read, 11: read/write) followed by 14 size bits (giving the size of the argument), followed by an 8-bit type (collecting the ioctls in groups for a common purpose or a common driver), and an 8-bit serial number.

The macros describing this structure live in <asm/ioctl.h> and are _IO(type,nr) and {_IOR,_IOW,_IOWR}(type,nr,size). They use sizeof(size) so that size is a misnomer here: this third argument is a data type.

Note that the size bits are very unreliable: in lots of cases they are wrong, either because of buggy macros using sizeof(sizeof(struct)), or because of legacy values.

Thus, it seems that the new structure only gave disadvantages: it does not help in checking, but it causes varying values for the various architectures.

24 - Linux cli command setrlimit

➡ A Linux man page (short for manual page) is a form of software documentation found on Linux and Unix-like operating systems. This man-page explains the command setrlimit and provides detailed information about the command setrlimit, system calls, library functions, and other aspects of the system, including usage, options, and examples of _. You can access this man page by typing man followed by the setrlimit.

NAME 🖥️ setrlimit 🖥️

get/set resource limits

LIBRARY

Standard C library (libc, -lc)

SYNOPSIS

#include <sys/resource.h>
int getrlimit(int resource, struct rlimit *rlim);
int setrlimit(int resource, const struct rlimit *rlim);
int prlimit(pid_t pid, int resource,
 const struct rlimit *_Nullable new_limit,
 struct rlimit *_Nullable old_limit);

Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

prlimit():

    _GNU_SOURCE

DESCRIPTION

The getrlimit() and setrlimit() system calls get and set resource limits. Each resource has an associated soft and hard limit, as defined by the rlimit structure:

struct rlimit {
    rlim_t rlim_cur;  /* Soft limit */
    rlim_t rlim_max;  /* Hard limit (ceiling for rlim_cur) */
};