This is the multi-page printable view of this section. Click here to print.
man2
- 1: Linux cli command ptrace
- 2: Linux cli command fcntl
- 3: Linux cli command prlimit
- 4: Linux cli command utime
- 5: Linux cli command getdents64
- 6: Linux cli command sync_file_range2
- 7: Linux cli command arm_fadvise
- 8: Linux cli command mlock2
- 9: Linux cli command tuxcall
- 10: Linux cli command sysinfo
- 11: Linux cli command ioprio_set
- 12: Linux cli command setfsuid32
- 13: Linux cli command fchown
- 14: Linux cli command break
- 15: Linux cli command bdflush
- 16: Linux cli command sysfs
- 17: Linux cli command getgid
- 18: Linux cli command shutdown
- 19: Linux cli command vm86old
- 20: Linux cli command preadv2
- 21: Linux cli command setdomainname
- 22: Linux cli command inotify_init1
- 23: Linux cli command ioctl
- 24: Linux cli command setrlimit
- 25: Linux cli command sendfile
- 26: Linux cli command mq_unlink
- 27: Linux cli command fanotify_mark
- 28: Linux cli command removexattr
- 29: Linux cli command munlock
- 30: Linux cli command alloc_hugepages
- 31: Linux cli command flock
- 32: Linux cli command lchown32
- 33: Linux cli command close_range
- 34: Linux cli command sigreturn
- 35: Linux cli command epoll_wait
- 36: Linux cli command getpid
- 37: Linux cli command finit_module
- 38: Linux cli command getrandom
- 39: Linux cli command creat
- 40: Linux cli command perf_event_open
- 41: Linux cli command stat
- 42: Linux cli command semop
- 43: Linux cli command pread
- 44: Linux cli command syscalls
- 45: Linux cli command umount2
- 46: Linux cli command timer_gettime
- 47: Linux cli command shmctl
- 48: Linux cli command llseek
- 49: Linux cli command security
- 50: Linux cli command eventfd
- 51: Linux cli command s390_pci_mmio_read
- 52: Linux cli command personality
- 53: Linux cli command getresuid
- 54: Linux cli command pidfd_open
- 55: Linux cli command exit
- 56: Linux cli command chdir
- 57: Linux cli command migrate_pages
- 58: Linux cli command mq_getsetattr
- 59: Linux cli command pwritev
- 60: Linux cli command select
- 61: Linux cli command membarrier
- 62: Linux cli command pciconfig_write
- 63: Linux cli command sendmmsg
- 64: Linux cli command capget
- 65: Linux cli command getppid
- 66: Linux cli command setpgid
- 67: Linux cli command munmap
- 68: Linux cli command fdetach
- 69: Linux cli command listen
- 70: Linux cli command lsetxattr
- 71: Linux cli command msgctl
- 72: Linux cli command setfsuid
- 73: Linux cli command delete_module
- 74: Linux cli command getpmsg
- 75: Linux cli command renameat2
- 76: Linux cli command oldlstat
- 77: Linux cli command rt_sigaction
- 78: Linux cli command open
- 79: Linux cli command readlink
- 80: Linux cli command fattach
- 81: Linux cli command stat64
- 82: Linux cli command vserver
- 83: Linux cli command request_key
- 84: Linux cli command linkat
- 85: Linux cli command sigpending
- 86: Linux cli command pwrite64
- 87: Linux cli command rmdir
- 88: Linux cli command getsid
- 89: Linux cli command sysctl
- 90: Linux cli command sched_get_priority_max
- 91: Linux cli command getrlimit
- 92: Linux cli command kill
- 93: Linux cli command pciconfig_read
- 94: Linux cli command pciconfig_iobase
- 95: Linux cli command pipe2
- 96: Linux cli command fstatat64
- 97: Linux cli command time
- 98: Linux cli command putpmsg
- 99: Linux cli command faccessat
- 100: Linux cli command unimplemented
- 101: Linux cli command recvmsg
- 102: Linux cli command adjtimex
- 103: Linux cli command process_vm_readv
- 104: Linux cli command sigaction
- 105: Linux cli command execveat
- 106: Linux cli command clock_settime
- 107: Linux cli command settimeofday
- 108: Linux cli command write
- 109: Linux cli command fstatfs64
- 110: Linux cli command idle
- 111: Linux cli command s390_sthyi
- 112: Linux cli command setuid
- 113: Linux cli command process_vm_writev
- 114: Linux cli command munlockall
- 115: Linux cli command io_getevents
- 116: Linux cli command futex
- 117: Linux cli command semtimedop
- 118: Linux cli command ssetmask
- 119: Linux cli command truncate
- 120: Linux cli command mbind
- 121: Linux cli command setreuid32
- 122: Linux cli command semctl
- 123: Linux cli command gettimeofday
- 124: Linux cli command inl
- 125: Linux cli command gettid
- 126: Linux cli command openat2
- 127: Linux cli command sigprocmask
- 128: Linux cli command keyctl
- 129: Linux cli command brk
- 130: Linux cli command mq_timedreceive
- 131: Linux cli command sendfile64
- 132: Linux cli command timer_create
- 133: Linux cli command getresuid32
- 134: Linux cli command truncate64
- 135: Linux cli command getxattr
- 136: Linux cli command recv
- 137: Linux cli command faccessat2
- 138: Linux cli command fchownat
- 139: Linux cli command get_kernel_syms
- 140: Linux cli command mq_notify
- 141: Linux cli command io_destroy
- 142: Linux cli command sched_rr_get_interval
- 143: Linux cli command msync
- 144: Linux cli command inw_p
- 145: Linux cli command sync_file_range
- 146: Linux cli command setregid32
- 147: Linux cli command sethostname
- 148: Linux cli command newfstatat
- 149: Linux cli command pkey_alloc
- 150: Linux cli command restart_syscall
- 151: Linux cli command inw
- 152: Linux cli command statfs64
- 153: Linux cli command modify_ldt
- 154: Linux cli command outl
- 155: Linux cli command outsb
- 156: Linux cli command outl_p
- 157: Linux cli command readdir
- 158: Linux cli command arm_fadvise64_64
- 159: Linux cli command clone3
- 160: Linux cli command inb_p
- 161: Linux cli command rt_sigqueueinfo
- 162: Linux cli command landlock_create_ruleset
- 163: Linux cli command fchdir
- 164: Linux cli command ioctl_iflags
- 165: Linux cli command fork
- 166: Linux cli command sigsuspend
- 167: Linux cli command lstat
- 168: Linux cli command create_module
- 169: Linux cli command ppoll
- 170: Linux cli command inl_p
- 171: Linux cli command ioctl_ficlonerange
- 172: Linux cli command sched_getscheduler
- 173: Linux cli command intro
- 174: Linux cli command shmdt
- 175: Linux cli command arm_sync_file_range
- 176: Linux cli command sync
- 177: Linux cli command tee
- 178: Linux cli command arch_prctl
- 179: Linux cli command set_mempolicy
- 180: Linux cli command mlock
- 181: Linux cli command stime
- 182: Linux cli command socket
- 183: Linux cli command shmop
- 184: Linux cli command vfork
- 185: Linux cli command lremovexattr
- 186: Linux cli command bpf
- 187: Linux cli command fchmodat
- 188: Linux cli command s390_pci_mmio_write
- 189: Linux cli command mq_timedsend
- 190: Linux cli command oldolduname
- 191: Linux cli command getuid32
- 192: Linux cli command prlimit64
- 193: Linux cli command timer_getoverrun
- 194: Linux cli command landlock_restrict_self
- 195: Linux cli command setgroups32
- 196: Linux cli command setitimer
- 197: Linux cli command sigaltstack
- 198: Linux cli command fremovexattr
- 199: Linux cli command pselect6
- 200: Linux cli command getpgid
- 201: Linux cli command ioctl_fslabel
- 202: Linux cli command swapoff
- 203: Linux cli command writev
- 204: Linux cli command gethostname
- 205: Linux cli command sched_yield
- 206: Linux cli command uname
- 207: Linux cli command outw_p
- 208: Linux cli command fstat64
- 209: Linux cli command getgroups
- 210: Linux cli command fsync
- 211: Linux cli command subpage_prot
- 212: Linux cli command preadv
- 213: Linux cli command getrusage
- 214: Linux cli command openat
- 215: Linux cli command vm86
- 216: Linux cli command llistxattr
- 217: Linux cli command ioprio_get
- 218: Linux cli command lgetxattr
- 219: Linux cli command chroot
- 220: Linux cli command mount
- 221: Linux cli command get_mempolicy
- 222: Linux cli command renameat
- 223: Linux cli command remap_file_pages
- 224: Linux cli command landlock_add_rule
- 225: Linux cli command outw
- 226: Linux cli command setgid32
- 227: Linux cli command getdents
- 228: Linux cli command setup
- 229: Linux cli command seteuid
- 230: Linux cli command read
- 231: Linux cli command sigwaitinfo
- 232: Linux cli command connect
- 233: Linux cli command set_tid_address
- 234: Linux cli command copy_file_range
- 235: Linux cli command dup3
- 236: Linux cli command memfd_secret
- 237: Linux cli command timerfd_gettime
- 238: Linux cli command reboot
- 239: Linux cli command mknodat
- 240: Linux cli command pidfd_getfd
- 241: Linux cli command fadvise64_64
- 242: Linux cli command getresgid32
- 243: Linux cli command unshare
- 244: Linux cli command fadvise64
- 245: Linux cli command symlinkat
- 246: Linux cli command name_to_handle_at
- 247: Linux cli command timer_delete
- 248: Linux cli command pipe
- 249: Linux cli command s390_guarded_storage
- 250: Linux cli command listxattr
- 251: Linux cli command accept
- 252: Linux cli command setegid
- 253: Linux cli command ftruncate64
- 254: Linux cli command get_thread_area
- 255: Linux cli command msgrcv
- 256: Linux cli command geteuid
- 257: Linux cli command mq_open
- 258: Linux cli command mkdirat
- 259: Linux cli command pwritev2
- 260: Linux cli command mlockall
- 261: Linux cli command setgid
- 262: Linux cli command execve
- 263: Linux cli command fstat
- 264: Linux cli command getsockopt
- 265: Linux cli command geteuid32
- 266: Linux cli command quotactl
- 267: Linux cli command pselect
- 268: Linux cli command fstatat
- 269: Linux cli command cacheflush
- 270: Linux cli command mmap2
- 271: Linux cli command unlinkat
- 272: Linux cli command spu_run
- 273: Linux cli command sbrk
- 274: Linux cli command getpagesize
- 275: Linux cli command fgetxattr
- 276: Linux cli command kexec_file_load
- 277: Linux cli command posix_fadvise
- 278: Linux cli command ioctl_userfaultfd
- 279: Linux cli command fdatasync
- 280: Linux cli command kexec_load
- 281: Linux cli command lookup_dcookie
- 282: Linux cli command epoll_pwait2
- 283: Linux cli command syncfs
- 284: Linux cli command prof
- 285: Linux cli command sched_getaffinity
- 286: Linux cli command clock_getres
- 287: Linux cli command inb
- 288: Linux cli command umount
- 289: Linux cli command sched_setscheduler
- 290: Linux cli command fallocate
- 291: Linux cli command add_key
- 292: Linux cli command fsetxattr
- 293: Linux cli command utimes
- 294: Linux cli command getcpu
- 295: Linux cli command stty
- 296: Linux cli command pread64
- 297: Linux cli command vmsplice
- 298: Linux cli command socketcall
- 299: Linux cli command socketpair
- 300: Linux cli command clock_adjtime
- 301: Linux cli command getegid
- 302: Linux cli command timer_settime
- 303: Linux cli command chown
- 304: Linux cli command ioctl_pipe
- 305: Linux cli command fchown32
- 306: Linux cli command epoll_create
- 307: Linux cli command ioctl_getfsmap
- 308: Linux cli command sched_setparam
- 309: Linux cli command pivot_root
- 310: Linux cli command set_robust_list
- 311: Linux cli command nice
- 312: Linux cli command clone2
- 313: Linux cli command waitpid
- 314: Linux cli command kcmp
- 315: Linux cli command setsockopt
- 316: Linux cli command ioctl_fat
- 317: Linux cli command open_by_handle_at
- 318: Linux cli command epoll_ctl
- 319: Linux cli command eventfd2
- 320: Linux cli command lchown
- 321: Linux cli command getpeername
- 322: Linux cli command mmap
- 323: Linux cli command ioctl_tty
- 324: Linux cli command accept4
- 325: Linux cli command inotify_add_watch
- 326: Linux cli command capset
- 327: Linux cli command sched_setaffinity
- 328: Linux cli command memfd_create
- 329: Linux cli command io_cancel
- 330: Linux cli command fstatfs
- 331: Linux cli command fanotify_init
- 332: Linux cli command statfs
- 333: Linux cli command epoll_pwait
- 334: Linux cli command ioperm
- 335: Linux cli command clock_nanosleep
- 336: Linux cli command setgroups
- 337: Linux cli command lseek
- 338: Linux cli command rt_sigprocmask
- 339: Linux cli command getunwind
- 340: Linux cli command fcntl64
- 341: Linux cli command olduname
- 342: Linux cli command select_tut
- 343: Linux cli command mincore
- 344: Linux cli command wait
- 345: Linux cli command getgid32
- 346: Linux cli command ioctl_pagemap_scan
- 347: Linux cli command msgget
- 348: Linux cli command rt_tgsigqueueinfo
- 349: Linux cli command get_robust_list
- 350: Linux cli command dup
- 351: Linux cli command syslog
- 352: Linux cli command phys
- 353: Linux cli command io_setup
- 354: Linux cli command setpriority
- 355: Linux cli command recvmmsg
- 356: Linux cli command signalfd4
- 357: Linux cli command lstat64
- 358: Linux cli command sched_getattr
- 359: Linux cli command readahead
- 360: Linux cli command setpgrp
- 361: Linux cli command poll
- 362: Linux cli command pkey_mprotect
- 363: Linux cli command times
- 364: Linux cli command setresuid32
- 365: Linux cli command getresgid
- 366: Linux cli command getsockname
- 367: Linux cli command umask
- 368: Linux cli command epoll_create1
- 369: Linux cli command ustat
- 370: Linux cli command dup2
- 371: Linux cli command rt_sigreturn
- 372: Linux cli command setfsgid
- 373: Linux cli command shmget
- 374: Linux cli command link
- 375: Linux cli command mkdir
- 376: Linux cli command getegid32
- 377: Linux cli command setresgid32
- 378: Linux cli command sched_getparam
- 379: Linux cli command unlink
- 380: Linux cli command free_hugepages
- 381: Linux cli command iopl
- 382: Linux cli command waitid
- 383: Linux cli command getpriority
- 384: Linux cli command statx
- 385: Linux cli command exit_group
- 386: Linux cli command readv
- 387: Linux cli command getpgrp
- 388: Linux cli command rt_sigtimedwait
- 389: Linux cli command mount_setattr
- 390: Linux cli command pwrite
- 391: Linux cli command mprotect
- 392: Linux cli command getuid
- 393: Linux cli command recvfrom
- 394: Linux cli command setns
- 395: Linux cli command mpx
- 396: Linux cli command nfsservctl
- 397: Linux cli command getmsg
- 398: Linux cli command pkey_free
- 399: Linux cli command pause
- 400: Linux cli command setregid
- 401: Linux cli command pidfd_send_signal
- 402: Linux cli command ioctl_ficlone
- 403: Linux cli command sched_setattr
- 404: Linux cli command process_madvise
- 405: Linux cli command tkill
- 406: Linux cli command clock_gettime
- 407: Linux cli command madvise1
- 408: Linux cli command sched_get_priority_min
- 409: Linux cli command tgkill
- 410: Linux cli command userfaultfd
- 411: Linux cli command bind
- 412: Linux cli command lock
- 413: Linux cli command uselib
- 414: Linux cli command afs_syscall
- 415: Linux cli command splice
- 416: Linux cli command move_pages
- 417: Linux cli command seccomp_unotify
- 418: Linux cli command flistxattr
- 419: Linux cli command insw
- 420: Linux cli command close
- 421: Linux cli command mknod
- 422: Linux cli command outsw
- 423: Linux cli command sendmsg
- 424: Linux cli command setfsgid32
- 425: Linux cli command init_module
- 426: Linux cli command fchmod
- 427: Linux cli command seccomp
- 428: Linux cli command setxattr
- 429: Linux cli command setuid32
- 430: Linux cli command clone
- 431: Linux cli command setresuid
- 432: Linux cli command chown32
- 433: Linux cli command send
- 434: Linux cli command access
- 435: Linux cli command symlink
- 436: Linux cli command nanosleep
- 437: Linux cli command swapon
- 438: Linux cli command ioctl_console
- 439: Linux cli command prctl
- 440: Linux cli command msgsnd
- 441: Linux cli command set_thread_area
- 442: Linux cli command spu_create
- 443: Linux cli command setreuid
- 444: Linux cli command ioctl_ns
- 445: Linux cli command syscall
- 446: Linux cli command getgroups32
- 447: Linux cli command ftruncate
- 448: Linux cli command insl
- 449: Linux cli command shmat
- 450: Linux cli command semget
- 451: Linux cli command insb
- 452: Linux cli command sendto
- 453: Linux cli command rename
- 454: Linux cli command inotify_rm_watch
- 455: Linux cli command outsl
- 456: Linux cli command acct
- 457: Linux cli command isastream
- 458: Linux cli command wait3
- 459: Linux cli command sgetmask
- 460: Linux cli command signal
- 461: Linux cli command rt_sigsuspend
- 462: Linux cli command s390_runtime_instr
- 463: Linux cli command setresgid
- 464: Linux cli command getitimer
- 465: Linux cli command alarm
- 466: Linux cli command perfmonctl
- 467: Linux cli command rt_sigpending
- 468: Linux cli command wait4
- 469: Linux cli command sigtimedwait
- 470: Linux cli command madvise
- 471: Linux cli command putmsg
- 472: Linux cli command open_howtype
- 473: Linux cli command oldfstat
- 474: Linux cli command setsid
- 475: Linux cli command timerfd_create
- 476: Linux cli command getcwd
- 477: Linux cli command inotify_init
- 478: Linux cli command readlinkat
- 479: Linux cli command getdomainname
- 480: Linux cli command gtty
- 481: Linux cli command ipc
- 482: Linux cli command outb_p
- 483: Linux cli command msgop
- 484: Linux cli command ioctl_fideduperange
- 485: Linux cli command utimensat
- 486: Linux cli command query_module
- 487: Linux cli command ugetrlimit
- 488: Linux cli command vhangup
- 489: Linux cli command futimesat
- 490: Linux cli command chmod
- 491: Linux cli command signalfd
- 492: Linux cli command io_submit
- 493: Linux cli command mremap
- 494: Linux cli command oldstat
- 495: Linux cli command timerfd_settime
- 496: Linux cli command outb
1 - Linux cli command ptrace
NAME π₯οΈ ptrace π₯οΈ
process trace
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/ptrace.h>
long ptrace(enum __ptrace_request op, pid_t pid,
void *addr, void *data);
DESCRIPTION
The ptrace() system call provides a means by which one process (the “tracer”) may observe and control the execution of another process (the “tracee”), and examine and change the tracee’s memory and registers. It is primarily used to implement breakpoint debugging and system call tracing.
A tracee first needs to be attached to the tracer. Attachment and subsequent commands are per thread: in a multithreaded process, every thread can be individually attached to a (potentially different) tracer, or left not attached and thus not debugged. Therefore, “tracee” always means “(one) thread”, never “a (possibly multithreaded) process”. Ptrace commands are always sent to a specific tracee using a call of the form
ptrace(PTRACE_foo, pid, ...)
where pid is the thread ID of the corresponding Linux thread.
(Note that in this page, a “multithreaded process” means a thread group consisting of threads created using the clone(2) CLONE_THREAD flag.)
A process can initiate a trace by calling fork(2) and having the resulting child do a PTRACE_TRACEME, followed (typically) by an execve(2). Alternatively, one process may commence tracing another process using PTRACE_ATTACH or PTRACE_SEIZE.
While being traced, the tracee will stop each time a signal is delivered, even if the signal is being ignored. (An exception is SIGKILL, which has its usual effect.) The tracer will be notified at its next call to waitpid(2) (or one of the related “wait” system calls); that call will return a status value containing information that indicates the cause of the stop in the tracee. While the tracee is stopped, the tracer can use various ptrace operations to inspect and modify the tracee. The tracer then causes the tracee to continue, optionally ignoring the delivered signal (or even delivering a different signal instead).
If the PTRACE_O_TRACEEXEC option is not in effect, all successful calls to execve(2) by the traced process will cause it to be sent a SIGTRAP signal, giving the parent a chance to gain control before the new program begins execution.
When the tracer is finished tracing, it can cause the tracee to continue executing in a normal, untraced mode via PTRACE_DETACH.
The value of op determines the operation to be performed:
PTRACE_TRACEME
Indicate that this process is to be traced by its parent. A process probably shouldn’t make this operation if its parent isn’t expecting to trace it. (pid, addr, and data are ignored.)
The PTRACE_TRACEME operation is used only by the tracee; the remaining operations are used only by the tracer. In the following operations, pid specifies the thread ID of the tracee to be acted on. For operations other than PTRACE_ATTACH, PTRACE_SEIZE, PTRACE_INTERRUPT, and PTRACE_KILL, the tracee must be stopped.
PTRACE_PEEKTEXT
PTRACE_PEEKDATA
Read a word at the address addr in the tracee’s memory, returning the word as the result of the ptrace() call. Linux does not have separate text and data address spaces, so these two operations are currently equivalent. (data is ignored; but see NOTES.)
PTRACE_PEEKUSER
Read a word at offset addr in the tracee’s USER area, which holds the registers and other information about the process (see <sys/user.h>). The word is returned as the result of the ptrace() call. Typically, the offset must be word-aligned, though this might vary by architecture. See NOTES. (data is ignored; but see NOTES.)
PTRACE_POKETEXT
PTRACE_POKEDATA
Copy the word data to the address addr in the tracee’s memory. As for PTRACE_PEEKTEXT and PTRACE_PEEKDATA, these two operations are currently equivalent.
PTRACE_POKEUSER
Copy the word data to offset addr in the tracee’s USER area. As for PTRACE_PEEKUSER, the offset must typically be word-aligned. In order to maintain the integrity of the kernel, some modifications to the USER area are disallowed.
PTRACE_GETREGS
PTRACE_GETFPREGS
Copy the tracee’s general-purpose or floating-point registers, respectively, to the address data in the tracer. See <sys/user.h> for information on the format of this data. (addr is ignored.) Note that SPARC systems have the meaning of data and addr reversed; that is, data is ignored and the registers are copied to the address addr. PTRACE_GETREGS and PTRACE_GETFPREGS are not present on all architectures.
PTRACE_GETREGSET (since Linux 2.6.34)
Read the tracee’s registers. addr specifies, in an architecture-dependent way, the type of registers to be read. NT_PRSTATUS (with numerical value 1) usually results in reading of general-purpose registers. If the CPU has, for example, floating-point and/or vector registers, they can be retrieved by setting addr to the corresponding NT_foo constant. data points to a struct iovec, which describes the destination buffer’s location and length. On return, the kernel modifies iov.len to indicate the actual number of bytes returned.
PTRACE_SETREGS
PTRACE_SETFPREGS
Modify the tracee’s general-purpose or floating-point registers, respectively, from the address data in the tracer. As for PTRACE_POKEUSER, some general-purpose register modifications may be disallowed. (addr is ignored.) Note that SPARC systems have the meaning of data and addr reversed; that is, data is ignored and the registers are copied from the address addr. PTRACE_SETREGS and PTRACE_SETFPREGS are not present on all architectures.
PTRACE_SETREGSET (since Linux 2.6.34)
Modify the tracee’s registers. The meaning of addr and data is analogous to PTRACE_GETREGSET.
PTRACE_GETSIGINFO (since Linux 2.3.99-pre6)
Retrieve information about the signal that caused the stop. Copy a siginfo_t structure (see sigaction(2)) from the tracee to the address data in the tracer. (addr is ignored.)
PTRACE_SETSIGINFO (since Linux 2.3.99-pre6)
Set signal information: copy a siginfo_t structure from the address data in the tracer to the tracee. This will affect only signals that would normally be delivered to the tracee and were caught by the tracer. It may be difficult to tell these normal signals from synthetic signals generated by ptrace() itself. (addr is ignored.)
PTRACE_PEEKSIGINFO (since Linux 3.10)
Retrieve siginfo_t structures without removing signals from a queue. addr points to a ptrace_peeksiginfo_args structure that specifies the ordinal position from which copying of signals should start, and the number of signals to copy. siginfo_t structures are copied into the buffer pointed to by data. The return value contains the number of copied signals (zero indicates that there is no signal corresponding to the specified ordinal position). Within the returned siginfo structures, the si_code field includes information (__SI_CHLD, __SI_FAULT, etc.) that are not otherwise exposed to user space.
struct ptrace_peeksiginfo_args {
u64 off; /* Ordinal position in queue at which
to start copying signals */
u32 flags; /* PTRACE_PEEKSIGINFO_SHARED or 0 */
s32 nr; /* Number of signals to copy */
};
Currently, there is only one flag, PTRACE_PEEKSIGINFO_SHARED, for dumping signals from the process-wide signal queue. If this flag is not set, signals are read from the per-thread queue of the specified thread.
PTRACE_GETSIGMASK (since Linux 3.11)
Place a copy of the mask of blocked signals (see sigprocmask(2)) in the buffer pointed to by data, which should be a pointer to a buffer of type sigset_t. The addr argument contains the size of the buffer pointed to by data (i.e., sizeof(sigset_t)).
PTRACE_SETSIGMASK (since Linux 3.11)
Change the mask of blocked signals (see sigprocmask(2)) to the value specified in the buffer pointed to by data, which should be a pointer to a buffer of type sigset_t. The addr argument contains the size of the buffer pointed to by data (i.e., sizeof(sigset_t)).
PTRACE_SETOPTIONS (since Linux 2.4.6; see BUGS for caveats)
Set ptrace options from data. (addr is ignored.) data is interpreted as a bit mask of options, which are specified by the following flags:
PTRACE_O_EXITKILL (since Linux 3.8)
Send a SIGKILL signal to the tracee if the tracer exits. This option is useful for ptrace jailers that want to ensure that tracees can never escape the tracer’s control.
PTRACE_O_TRACECLONE (since Linux 2.5.46)
Stop the tracee at the next clone(2) and automatically start tracing the newly cloned process, which will start with a SIGSTOP, or PTRACE_EVENT_STOP if PTRACE_SEIZE was used. A waitpid(2) by the tracer will return a status value such that
status>>8 == (SIGTRAP | (PTRACE_EVENT_CLONE<<8))
The PID of the new process can be retrieved with PTRACE_GETEVENTMSG.
This option may not catch clone(2) calls in all cases. If the tracee calls clone(2) with the CLONE_VFORK flag, PTRACE_EVENT_VFORK will be delivered instead if PTRACE_O_TRACEVFORK is set; otherwise if the tracee calls clone(2) with the exit signal set to SIGCHLD, PTRACE_EVENT_FORK will be delivered if PTRACE_O_TRACEFORK is set.
PTRACE_O_TRACEEXEC (since Linux 2.5.46)
Stop the tracee at the next execve(2). A waitpid(2) by the tracer will return a status value such that
status>>8 == (SIGTRAP | (PTRACE_EVENT_EXEC<<8))
If the execing thread is not a thread group leader, the thread ID is reset to thread group leader’s ID before this stop. Since Linux 3.0, the former thread ID can be retrieved with PTRACE_GETEVENTMSG.
PTRACE_O_TRACEEXIT (since Linux 2.5.60)
Stop the tracee at exit. A waitpid(2) by the tracer will return a status value such that
status>>8 == (SIGTRAP | (PTRACE_EVENT_EXIT<<8))
The tracee’s exit status can be retrieved with PTRACE_GETEVENTMSG.
The tracee is stopped early during process exit, when registers are still available, allowing the tracer to see where the exit occurred, whereas the normal exit notification is done after the process is finished exiting. Even though context is available, the tracer cannot prevent the exit from happening at this point.
PTRACE_O_TRACEFORK (since Linux 2.5.46)
Stop the tracee at the next fork(2) and automatically start tracing the newly forked process, which will start with a SIGSTOP, or PTRACE_EVENT_STOP if PTRACE_SEIZE was used. A waitpid(2) by the tracer will return a status value such that
status>>8 == (SIGTRAP | (PTRACE_EVENT_FORK<<8))
The PID of the new process can be retrieved with PTRACE_GETEVENTMSG.
PTRACE_O_TRACESYSGOOD (since Linux 2.4.6)
When delivering system call traps, set bit 7 in the signal number (i.e., deliver SIGTRAP|0x80). This makes it easy for the tracer to distinguish normal traps from those caused by a system call.
PTRACE_O_TRACEVFORK (since Linux 2.5.46)
Stop the tracee at the next vfork(2) and automatically start tracing the newly vforked process, which will start with a SIGSTOP, or PTRACE_EVENT_STOP if PTRACE_SEIZE was used. A waitpid(2) by the tracer will return a status value such that
status>>8 == (SIGTRAP | (PTRACE_EVENT_VFORK<<8))
The PID of the new process can be retrieved with PTRACE_GETEVENTMSG.
PTRACE_O_TRACEVFORKDONE (since Linux 2.5.60)
Stop the tracee at the completion of the next vfork(2). A waitpid(2) by the tracer will return a status value such that
status>>8 == (SIGTRAP | (PTRACE_EVENT_VFORK_DONE<<8))
The PID of the new process can (since Linux 2.6.18) be retrieved with PTRACE_GETEVENTMSG.
PTRACE_O_TRACESECCOMP (since Linux 3.5)
Stop the tracee when a seccomp(2) SECCOMP_RET_TRACE rule is triggered. A waitpid(2) by the tracer will return a status value such that
status>>8 == (SIGTRAP | (PTRACE_EVENT_SECCOMP<<8))
While this triggers a PTRACE_EVENT stop, it is similar to a syscall-enter-stop. For details, see the note on PTRACE_EVENT_SECCOMP below. The seccomp event message data (from the SECCOMP_RET_DATA portion of the seccomp filter rule) can be retrieved with PTRACE_GETEVENTMSG.
PTRACE_O_SUSPEND_SECCOMP (since Linux 4.3)
Suspend the tracee’s seccomp protections. This applies regardless of mode, and can be used when the tracee has not yet installed seccomp filters. That is, a valid use case is to suspend a tracee’s seccomp protections before they are installed by the tracee, let the tracee install the filters, and then clear this flag when the filters should be resumed. Setting this option requires that the tracer have the CAP_SYS_ADMIN capability, not have any seccomp protections installed, and not have PTRACE_O_SUSPEND_SECCOMP set on itself.
PTRACE_GETEVENTMSG (since Linux 2.5.46)
Retrieve a message (as an unsigned long) about the ptrace event that just happened, placing it at the address data in the tracer. For PTRACE_EVENT_EXIT, this is the tracee’s exit status. For PTRACE_EVENT_FORK, PTRACE_EVENT_VFORK, PTRACE_EVENT_VFORK_DONE, and PTRACE_EVENT_CLONE, this is the PID of the new process. For PTRACE_EVENT_SECCOMP, this is the seccomp(2) filter’s SECCOMP_RET_DATA associated with the triggered rule. (addr is ignored.)
PTRACE_CONT
Restart the stopped tracee process. If data is nonzero, it is interpreted as the number of a signal to be delivered to the tracee; otherwise, no signal is delivered. Thus, for example, the tracer can control whether a signal sent to the tracee is delivered or not. (addr is ignored.)
PTRACE_SYSCALL
PTRACE_SINGLESTEP
Restart the stopped tracee as for PTRACE_CONT, but arrange for the tracee to be stopped at the next entry to or exit from a system call, or after execution of a single instruction, respectively. (The tracee will also, as usual, be stopped upon receipt of a signal.) From the tracer’s perspective, the tracee will appear to have been stopped by receipt of a SIGTRAP. So, for PTRACE_SYSCALL, for example, the idea is to inspect the arguments to the system call at the first stop, then do another PTRACE_SYSCALL and inspect the return value of the system call at the second stop. The data argument is treated as for PTRACE_CONT. (addr is ignored.)
PTRACE_SET_SYSCALL (since Linux 2.6.16)
When in syscall-enter-stop, change the number of the system call that is about to be executed to the number specified in the data argument. The addr argument is ignored. This operation is currently supported only on arm (and arm64, though only for backwards compatibility), but most other architectures have other means of accomplishing this (usually by changing the register that the userland code passed the system call number in).
PTRACE_SYSEMU
PTRACE_SYSEMU_SINGLESTEP (since Linux 2.6.14)
For PTRACE_SYSEMU, continue and stop on entry to the next system call, which will not be executed. See the documentation on syscall-stops below. For PTRACE_SYSEMU_SINGLESTEP, do the same but also singlestep if not a system call. This call is used by programs like User Mode Linux that want to emulate all the tracee’s system calls. The data argument is treated as for PTRACE_CONT. The addr argument is ignored. These operations are currently supported only on x86.
PTRACE_LISTEN (since Linux 3.4)
Restart the stopped tracee, but prevent it from executing. The resulting state of the tracee is similar to a process which has been stopped by a SIGSTOP (or other stopping signal). See the “group-stop” subsection for additional information. PTRACE_LISTEN works only on tracees attached by PTRACE_SEIZE.
PTRACE_KILL
Send the tracee a SIGKILL to terminate it. (addr and data are ignored.)
This operation is deprecated; do not use it! Instead, send a SIGKILL directly using kill(2) or tgkill(2). The problem with PTRACE_KILL is that it requires the tracee to be in signal-delivery-stop, otherwise it may not work (i.e., may complete successfully but won’t kill the tracee). By contrast, sending a SIGKILL directly has no such limitation.
PTRACE_INTERRUPT (since Linux 3.4)
Stop a tracee. If the tracee is running or sleeping in kernel space and PTRACE_SYSCALL is in effect, the system call is interrupted and syscall-exit-stop is reported. (The interrupted system call is restarted when the tracee is restarted.) If the tracee was already stopped by a signal and PTRACE_LISTEN was sent to it, the tracee stops with PTRACE_EVENT_STOP and WSTOPSIG(status) returns the stop signal. If any other ptrace-stop is generated at the same time (for example, if a signal is sent to the tracee), this ptrace-stop happens. If none of the above applies (for example, if the tracee is running in user space), it stops with PTRACE_EVENT_STOP with WSTOPSIG(status) == SIGTRAP. PTRACE_INTERRUPT only works on tracees attached by PTRACE_SEIZE.
PTRACE_ATTACH
Attach to the process specified in pid, making it a tracee of the calling process. The tracee is sent a SIGSTOP, but will not necessarily have stopped by the completion of this call; use waitpid(2) to wait for the tracee to stop. See the “Attaching and detaching” subsection for additional information. (addr and data are ignored.)
Permission to perform a PTRACE_ATTACH is governed by a ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check; see below.
PTRACE_SEIZE (since Linux 3.4)
Attach to the process specified in pid, making it a tracee of the calling process. Unlike PTRACE_ATTACH, PTRACE_SEIZE does not stop the process. Group-stops are reported as PTRACE_EVENT_STOP and WSTOPSIG(status) returns the stop signal. Automatically attached children stop with PTRACE_EVENT_STOP and WSTOPSIG(status) returns SIGTRAP instead of having SIGSTOP signal delivered to them. execve(2) does not deliver an extra SIGTRAP. Only a PTRACE_SEIZEd process can accept PTRACE_INTERRUPT and PTRACE_LISTEN commands. The “seized” behavior just described is inherited by children that are automatically attached using PTRACE_O_TRACEFORK, PTRACE_O_TRACEVFORK, and PTRACE_O_TRACECLONE. addr must be zero. data contains a bit mask of ptrace options to activate immediately.
Permission to perform a PTRACE_SEIZE is governed by a ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check; see below.
PTRACE_SECCOMP_GET_FILTER (since Linux 4.4)
This operation allows the tracer to dump the tracee’s classic BPF filters.
addr is an integer specifying the index of the filter to be dumped. The most recently installed filter has the index 0. If addr is greater than the number of installed filters, the operation fails with the error ENOENT.
data is either a pointer to a struct sock_filter array that is large enough to store the BPF program, or NULL if the program is not to be stored.
Upon success, the return value is the number of instructions in the BPF program. If data was NULL, then this return value can be used to correctly size the struct sock_filter array passed in a subsequent call.
This operation fails with the error EACCES if the caller does not have the CAP_SYS_ADMIN capability or if the caller is in strict or filter seccomp mode. If the filter referred to by addr is not a classic BPF filter, the operation fails with the error EMEDIUMTYPE.
This operation is available if the kernel was configured with both the CONFIG_SECCOMP_FILTER and the CONFIG_CHECKPOINT_RESTORE options.
PTRACE_DETACH
Restart the stopped tracee as for PTRACE_CONT, but first detach from it. Under Linux, a tracee can be detached in this way regardless of which method was used to initiate tracing. (addr is ignored.)
PTRACE_GET_THREAD_AREA (since Linux 2.6.0)
This operation performs a similar task to get_thread_area(2). It reads the TLS entry in the GDT whose index is given in addr, placing a copy of the entry into the struct user_desc pointed to by data. (By contrast with get_thread_area(2), the entry_number of the struct user_desc is ignored.)
PTRACE_SET_THREAD_AREA (since Linux 2.6.0)
This operation performs a similar task to set_thread_area(2). It sets the TLS entry in the GDT whose index is given in addr, assigning it the data supplied in the struct user_desc pointed to by data. (By contrast with set_thread_area(2), the entry_number of the struct user_desc is ignored; in other words, this ptrace operation can’t be used to allocate a free TLS entry.)
PTRACE_GET_SYSCALL_INFO (since Linux 5.3)
Retrieve information about the system call that caused the stop. The information is placed into the buffer pointed by the data argument, which should be a pointer to a buffer of type struct ptrace_syscall_info. The addr argument contains the size of the buffer pointed to by the data argument (i.e., sizeof(struct ptrace_syscall_info)). The return value contains the number of bytes available to be written by the kernel. If the size of the data to be written by the kernel exceeds the size specified by the addr argument, the output data is truncated.
The ptrace_syscall_info structure contains the following fields:
struct ptrace_syscall_info {
__u8 op; /* Type of system call stop */
__u32 arch; /* AUDIT_ARCH_* value; see seccomp(2) */
__u64 instruction_pointer; /* CPU instruction pointer */
__u64 stack_pointer; /* CPU stack pointer */
union {
struct { /* op == PTRACE_SYSCALL_INFO_ENTRY */
__u64 nr; /* System call number */
__u64 args[6]; /* System call arguments */
} entry;
struct { /* op == PTRACE_SYSCALL_INFO_EXIT */
__s64 rval; /* System call return value */
__u8 is_error; /* System call error flag;
Boolean: does rval contain
an error value (-ERRCODE) or
a nonerror return value? */
} exit;
struct { /* op == PTRACE_SYSCALL_INFO_SECCOMP */
__u64 nr; /* System call number */
__u64 args[6]; /* System call arguments */
__u32 ret_data; /* SECCOMP_RET_DATA portion
of SECCOMP_RET_TRACE
return value */
} seccomp;
};
};
The op, arch, instruction_pointer, and stack_pointer fields are defined for all kinds of ptrace system call stops. The rest of the structure is a union; one should read only those fields that are meaningful for the kind of system call stop specified by the op field.
The op field has one of the following values (defined in <linux/ptrace.h>) indicating what type of stop occurred and which part of the union is filled:
PTRACE_SYSCALL_INFO_ENTRY
The entry component of the union contains information relating to a system call entry stop.
PTRACE_SYSCALL_INFO_EXIT
The exit component of the union contains information relating to a system call exit stop.
PTRACE_SYSCALL_INFO_SECCOMP
The seccomp component of the union contains information relating to a PTRACE_EVENT_SECCOMP stop.
PTRACE_SYSCALL_INFO_NONE
No component of the union contains relevant information.
In case of system call entry or exit stops, the data returned by PTRACE_GET_SYSCALL_INFO is limited to type PTRACE_SYSCALL_INFO_NONE unless PTRACE_O_TRACESYSGOOD option is set before the corresponding system call stop has occurred.
Death under ptrace
When a (possibly multithreaded) process receives a killing signal (one whose disposition is set to SIG_DFL and whose default action is to kill the process), all threads exit. Tracees report their death to their tracer(s). Notification of this event is delivered via waitpid(2).
Note that the killing signal will first cause signal-delivery-stop (on one tracee only), and only after it is injected by the tracer (or after it was dispatched to a thread which isn’t traced), will death from the signal happen on all tracees within a multithreaded process. (The term “signal-delivery-stop” is explained below.)
SIGKILL does not generate signal-delivery-stop and therefore the tracer can’t suppress it. SIGKILL kills even within system calls (syscall-exit-stop is not generated prior to death by SIGKILL). The net effect is that SIGKILL always kills the process (all its threads), even if some threads of the process are ptraced.
When the tracee calls _exit(2), it reports its death to its tracer. Other threads are not affected.
When any thread executes exit_group(2), every tracee in its thread group reports its death to its tracer.
If the PTRACE_O_TRACEEXIT option is on, PTRACE_EVENT_EXIT will happen before actual death. This applies to exits via exit(2), exit_group(2), and signal deaths (except SIGKILL, depending on the kernel version; see BUGS below), and when threads are torn down on execve(2) in a multithreaded process.
The tracer cannot assume that the ptrace-stopped tracee exists. There are many scenarios when the tracee may die while stopped (such as SIGKILL). Therefore, the tracer must be prepared to handle an ESRCH error on any ptrace operation. Unfortunately, the same error is returned if the tracee exists but is not ptrace-stopped (for commands which require a stopped tracee), or if it is not traced by the process which issued the ptrace call. The tracer needs to keep track of the stopped/running state of the tracee, and interpret ESRCH as “tracee died unexpectedly” only if it knows that the tracee has been observed to enter ptrace-stop. Note that there is no guarantee that waitpid(WNOHANG) will reliably report the tracee’s death status if a ptrace operation returned ESRCH. waitpid(WNOHANG) may return 0 instead. In other words, the tracee may be “not yet fully dead”, but already refusing ptrace operations.
The tracer can’t assume that the tracee always ends its life by reporting WIFEXITED(status) or WIFSIGNALED(status); there are cases where this does not occur. For example, if a thread other than thread group leader does an execve(2), it disappears; its PID will never be seen again, and any subsequent ptrace stops will be reported under the thread group leader’s PID.
Stopped states
A tracee can be in two states: running or stopped. For the purposes of ptrace, a tracee which is blocked in a system call (such as read(2), pause(2), etc.) is nevertheless considered to be running, even if the tracee is blocked for a long time. The state of the tracee after PTRACE_LISTEN is somewhat of a gray area: it is not in any ptrace-stop (ptrace commands won’t work on it, and it will deliver waitpid(2) notifications), but it also may be considered “stopped” because it is not executing instructions (is not scheduled), and if it was in group-stop before PTRACE_LISTEN, it will not respond to signals until SIGCONT is received.
There are many kinds of states when the tracee is stopped, and in ptrace discussions they are often conflated. Therefore, it is important to use precise terms.
In this manual page, any stopped state in which the tracee is ready to accept ptrace commands from the tracer is called ptrace-stop. Ptrace-stops can be further subdivided into signal-delivery-stop, group-stop, syscall-stop, PTRACE_EVENT stops, and so on. These stopped states are described in detail below.
When the running tracee enters ptrace-stop, it notifies its tracer using waitpid(2) (or one of the other “wait” system calls). Most of this manual page assumes that the tracer waits with:
pid = waitpid(pid_or_minus_1, &status, __WALL);
Ptrace-stopped tracees are reported as returns with pid greater than 0 and WIFSTOPPED(status) true.
The __WALL flag does not include the WSTOPPED and WEXITED flags, but implies their functionality.
Setting the WCONTINUED flag when calling waitpid(2) is not recommended: the “continued” state is per-process and consuming it can confuse the real parent of the tracee.
Use of the WNOHANG flag may cause waitpid(2) to return 0 (“no wait results available yet”) even if the tracer knows there should be a notification. Example:
errno = 0;
ptrace(PTRACE_CONT, pid, 0L, 0L);
if (errno == ESRCH) {
/* tracee is dead */
r = waitpid(tracee, &status, __WALL | WNOHANG);
/* r can still be 0 here! */
}
The following kinds of ptrace-stops exist: signal-delivery-stops, group-stops, PTRACE_EVENT stops, syscall-stops. They all are reported by waitpid(2) with WIFSTOPPED(status) true. They may be differentiated by examining the value status>>8, and if there is ambiguity in that value, by querying PTRACE_GETSIGINFO. (Note: the WSTOPSIG(status) macro can’t be used to perform this examination, because it returns the value (status>>8) & 0xff.)
Signal-delivery-stop
When a (possibly multithreaded) process receives any signal except SIGKILL, the kernel selects an arbitrary thread which handles the signal. (If the signal is generated with tgkill(2), the target thread can be explicitly selected by the caller.) If the selected thread is traced, it enters signal-delivery-stop. At this point, the signal is not yet delivered to the process, and can be suppressed by the tracer. If the tracer doesn’t suppress the signal, it passes the signal to the tracee in the next ptrace restart operation. This second step of signal delivery is called signal injection in this manual page. Note that if the signal is blocked, signal-delivery-stop doesn’t happen until the signal is unblocked, with the usual exception that SIGSTOP can’t be blocked.
Signal-delivery-stop is observed by the tracer as waitpid(2) returning with WIFSTOPPED(status) true, with the signal returned by WSTOPSIG(status). If the signal is SIGTRAP, this may be a different kind of ptrace-stop; see the “Syscall-stops” and “execve” sections below for details. If WSTOPSIG(status) returns a stopping signal, this may be a group-stop; see below.
Signal injection and suppression
After signal-delivery-stop is observed by the tracer, the tracer should restart the tracee with the call
ptrace(PTRACE_restart, pid, 0, sig)
where PTRACE_restart is one of the restarting ptrace operations. If sig is 0, then a signal is not delivered. Otherwise, the signal sig is delivered. This operation is called signal injection in this manual page, to distinguish it from signal-delivery-stop.
The sig value may be different from the WSTOPSIG(status) value: the tracer can cause a different signal to be injected.
Note that a suppressed signal still causes system calls to return prematurely. In this case, system calls will be restarted: the tracer will observe the tracee to reexecute the interrupted system call (or restart_syscall(2) system call for a few system calls which use a different mechanism for restarting) if the tracer uses PTRACE_SYSCALL. Even system calls (such as poll(2)) which are not restartable after signal are restarted after signal is suppressed; however, kernel bugs exist which cause some system calls to fail with EINTR even though no observable signal is injected to the tracee.
Restarting ptrace commands issued in ptrace-stops other than signal-delivery-stop are not guaranteed to inject a signal, even if sig is nonzero. No error is reported; a nonzero sig may simply be ignored. Ptrace users should not try to “create a new signal” this way: use tgkill(2) instead.
The fact that signal injection operations may be ignored when restarting the tracee after ptrace stops that are not signal-delivery-stops is a cause of confusion among ptrace users. One typical scenario is that the tracer observes group-stop, mistakes it for signal-delivery-stop, restarts the tracee with
ptrace(PTRACE_restart, pid, 0, stopsig)
with the intention of injecting stopsig, but stopsig gets ignored and the tracee continues to run.
The SIGCONT signal has a side effect of waking up (all threads of) a group-stopped process. This side effect happens before signal-delivery-stop. The tracer can’t suppress this side effect (it can only suppress signal injection, which only causes the SIGCONT handler to not be executed in the tracee, if such a handler is installed). In fact, waking up from group-stop may be followed by signal-delivery-stop for signal(s) other than SIGCONT, if they were pending when SIGCONT was delivered. In other words, SIGCONT may be not the first signal observed by the tracee after it was sent.
Stopping signals cause (all threads of) a process to enter group-stop. This side effect happens after signal injection, and therefore can be suppressed by the tracer.
In Linux 2.4 and earlier, the SIGSTOP signal can’t be injected.
PTRACE_GETSIGINFO can be used to retrieve a siginfo_t structure which corresponds to the delivered signal. PTRACE_SETSIGINFO may be used to modify it. If PTRACE_SETSIGINFO has been used to alter siginfo_t, the si_signo field and the sig parameter in the restarting command must match, otherwise the result is undefined.
Group-stop
When a (possibly multithreaded) process receives a stopping signal, all threads stop. If some threads are traced, they enter a group-stop. Note that the stopping signal will first cause signal-delivery-stop (on one tracee only), and only after it is injected by the tracer (or after it was dispatched to a thread which isn’t traced), will group-stop be initiated on all tracees within the multithreaded process. As usual, every tracee reports its group-stop separately to the corresponding tracer.
Group-stop is observed by the tracer as waitpid(2) returning with WIFSTOPPED(status) true, with the stopping signal available via WSTOPSIG(status). The same result is returned by some other classes of ptrace-stops, therefore the recommended practice is to perform the call
ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo)
The call can be avoided if the signal is not SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU; only these four signals are stopping signals. If the tracer sees something else, it can’t be a group-stop. Otherwise, the tracer needs to call PTRACE_GETSIGINFO. If PTRACE_GETSIGINFO fails with EINVAL, then it is definitely a group-stop. (Other failure codes are possible, such as ESRCH (“no such process”) if a SIGKILL killed the tracee.)
If tracee was attached using PTRACE_SEIZE, group-stop is indicated by PTRACE_EVENT_STOP: status>>16 == PTRACE_EVENT_STOP. This allows detection of group-stops without requiring an extra PTRACE_GETSIGINFO call.
As of Linux 2.6.38, after the tracer sees the tracee ptrace-stop and until it restarts or kills it, the tracee will not run, and will not send notifications (except SIGKILL death) to the tracer, even if the tracer enters into another waitpid(2) call.
The kernel behavior described in the previous paragraph causes a problem with transparent handling of stopping signals. If the tracer restarts the tracee after group-stop, the stopping signal is effectively ignoredβthe tracee doesn’t remain stopped, it runs. If the tracer doesn’t restart the tracee before entering into the next waitpid(2), future SIGCONT signals will not be reported to the tracer; this would cause the SIGCONT signals to have no effect on the tracee.
Since Linux 3.4, there is a method to overcome this problem: instead of PTRACE_CONT, a PTRACE_LISTEN command can be used to restart a tracee in a way where it does not execute, but waits for a new event which it can report via waitpid(2) (such as when it is restarted by a SIGCONT).
PTRACE_EVENT stops
If the tracer sets PTRACE_O_TRACE_* options, the tracee will enter ptrace-stops called PTRACE_EVENT stops.
PTRACE_EVENT stops are observed by the tracer as waitpid(2) returning with WIFSTOPPED(status), and WSTOPSIG(status) returns SIGTRAP (or for PTRACE_EVENT_STOP, returns the stopping signal if tracee is in a group-stop). An additional bit is set in the higher byte of the status word: the value status>>8 will be
((PTRACE_EVENT_foo<<8) | SIGTRAP).
The following events exist:
PTRACE_EVENT_VFORK
Stop before return from vfork(2) or clone(2) with the CLONE_VFORK flag. When the tracee is continued after this stop, it will wait for child to exit/exec before continuing its execution (in other words, the usual behavior on vfork(2)).
PTRACE_EVENT_FORK
Stop before return from fork(2) or clone(2) with the exit signal set to SIGCHLD.
PTRACE_EVENT_CLONE
Stop before return from clone(2).
PTRACE_EVENT_VFORK_DONE
Stop before return from vfork(2) or clone(2) with the CLONE_VFORK flag, but after the child unblocked this tracee by exiting or execing.
For all four stops described above, the stop occurs in the parent (i.e., the tracee), not in the newly created thread. PTRACE_GETEVENTMSG can be used to retrieve the new thread’s ID.
PTRACE_EVENT_EXEC
Stop before return from execve(2). Since Linux 3.0, PTRACE_GETEVENTMSG returns the former thread ID.
PTRACE_EVENT_EXIT
Stop before exit (including death from exit_group(2)), signal death, or exit caused by execve(2) in a multithreaded process. PTRACE_GETEVENTMSG returns the exit status. Registers can be examined (unlike when “real” exit happens). The tracee is still alive; it needs to be PTRACE_CONTed or PTRACE_DETACHed to finish exiting.
PTRACE_EVENT_STOP
Stop induced by PTRACE_INTERRUPT command, or group-stop, or initial ptrace-stop when a new child is attached (only if attached using PTRACE_SEIZE).
PTRACE_EVENT_SECCOMP
Stop triggered by a seccomp(2) rule on tracee syscall entry when PTRACE_O_TRACESECCOMP has been set by the tracer. The seccomp event message data (from the SECCOMP_RET_DATA portion of the seccomp filter rule) can be retrieved with PTRACE_GETEVENTMSG. The semantics of this stop are described in detail in a separate section below.
PTRACE_GETSIGINFO on PTRACE_EVENT stops returns SIGTRAP in si_signo, with si_code set to (event<<8) | SIGTRAP.
Syscall-stops
If the tracee was restarted by PTRACE_SYSCALL or PTRACE_SYSEMU, the tracee enters syscall-enter-stop just prior to entering any system call (which will not be executed if the restart was using PTRACE_SYSEMU, regardless of any change made to registers at this point or how the tracee is restarted after this stop). No matter which method caused the syscall-entry-stop, if the tracer restarts the tracee with PTRACE_SYSCALL, the tracee enters syscall-exit-stop when the system call is finished, or if it is interrupted by a signal. (That is, signal-delivery-stop never happens between syscall-enter-stop and syscall-exit-stop; it happens after syscall-exit-stop.). If the tracee is continued using any other method (including PTRACE_SYSEMU), no syscall-exit-stop occurs. Note that all mentions PTRACE_SYSEMU apply equally to PTRACE_SYSEMU_SINGLESTEP.
However, even if the tracee was continued using PTRACE_SYSCALL, it is not guaranteed that the next stop will be a syscall-exit-stop. Other possibilities are that the tracee may stop in a PTRACE_EVENT stop (including seccomp stops), exit (if it entered _exit(2) or exit_group(2)), be killed by SIGKILL, or die silently (if it is a thread group leader, the execve(2) happened in another thread, and that thread is not traced by the same tracer; this situation is discussed later).
Syscall-enter-stop and syscall-exit-stop are observed by the tracer as waitpid(2) returning with WIFSTOPPED(status) true, and WSTOPSIG(status) giving SIGTRAP. If the PTRACE_O_TRACESYSGOOD option was set by the tracer, then WSTOPSIG(status) will give the value (SIGTRAP | 0x80).
Syscall-stops can be distinguished from signal-delivery-stop with SIGTRAP by querying PTRACE_GETSIGINFO for the following cases:
si_code <= 0
SIGTRAP was delivered as a result of a user-space action, for example, a system call (tgkill(2), kill(2), sigqueue(3), etc.), expiration of a POSIX timer, change of state on a POSIX message queue, or completion of an asynchronous I/O operation.
si_code == SI_KERNEL (0x80)
SIGTRAP was sent by the kernel.
si_code == SIGTRAP or si_code == (SIGTRAP|0x80)
This is a syscall-stop.
However, syscall-stops happen very often (twice per system call), and performing PTRACE_GETSIGINFO for every syscall-stop may be somewhat expensive.
Some architectures allow the cases to be distinguished by examining registers. For example, on x86, rax == -ENOSYS in syscall-enter-stop. Since SIGTRAP (like any other signal) always happens after syscall-exit-stop, and at this point rax almost never contains -ENOSYS, the SIGTRAP looks like “syscall-stop which is not syscall-enter-stop”; in other words, it looks like a “stray syscall-exit-stop” and can be detected this way. But such detection is fragile and is best avoided.
Using the PTRACE_O_TRACESYSGOOD option is the recommended method to distinguish syscall-stops from other kinds of ptrace-stops, since it is reliable and does not incur a performance penalty.
Syscall-enter-stop and syscall-exit-stop are indistinguishable from each other by the tracer. The tracer needs to keep track of the sequence of ptrace-stops in order to not misinterpret syscall-enter-stop as syscall-exit-stop or vice versa. In general, a syscall-enter-stop is always followed by syscall-exit-stop, PTRACE_EVENT stop, or the tracee’s death; no other kinds of ptrace-stop can occur in between. However, note that seccomp stops (see below) can cause syscall-exit-stops, without preceding syscall-entry-stops. If seccomp is in use, care needs to be taken not to misinterpret such stops as syscall-entry-stops.
If after syscall-enter-stop, the tracer uses a restarting command other than PTRACE_SYSCALL, syscall-exit-stop is not generated.
PTRACE_GETSIGINFO on syscall-stops returns SIGTRAP in si_signo, with si_code set to SIGTRAP or (SIGTRAP|0x80).
PTRACE_EVENT_SECCOMP stops (Linux 3.5 to Linux 4.7)
The behavior of PTRACE_EVENT_SECCOMP stops and their interaction with other kinds of ptrace stops has changed between kernel versions. This documents the behavior from their introduction until Linux 4.7 (inclusive). The behavior in later kernel versions is documented in the next section.
A PTRACE_EVENT_SECCOMP stop occurs whenever a SECCOMP_RET_TRACE rule is triggered. This is independent of which methods was used to restart the system call. Notably, seccomp still runs even if the tracee was restarted using PTRACE_SYSEMU and this system call is unconditionally skipped.
Restarts from this stop will behave as if the stop had occurred right before the system call in question. In particular, both PTRACE_SYSCALL and PTRACE_SYSEMU will normally cause a subsequent syscall-entry-stop. However, if after the PTRACE_EVENT_SECCOMP the system call number is negative, both the syscall-entry-stop and the system call itself will be skipped. This means that if the system call number is negative after a PTRACE_EVENT_SECCOMP and the tracee is restarted using PTRACE_SYSCALL, the next observed stop will be a syscall-exit-stop, rather than the syscall-entry-stop that might have been expected.
PTRACE_EVENT_SECCOMP stops (since Linux 4.8)
Starting with Linux 4.8, the PTRACE_EVENT_SECCOMP stop was reordered to occur between syscall-entry-stop and syscall-exit-stop. Note that seccomp no longer runs (and no PTRACE_EVENT_SECCOMP will be reported) if the system call is skipped due to PTRACE_SYSEMU.
Functionally, a PTRACE_EVENT_SECCOMP stop functions comparably to a syscall-entry-stop (i.e., continuations using PTRACE_SYSCALL will cause syscall-exit-stops, the system call number may be changed and any other modified registers are visible to the to-be-executed system call as well). Note that there may be, but need not have been a preceding syscall-entry-stop.
After a PTRACE_EVENT_SECCOMP stop, seccomp will be rerun, with a SECCOMP_RET_TRACE rule now functioning the same as a SECCOMP_RET_ALLOW. Specifically, this means that if registers are not modified during the PTRACE_EVENT_SECCOMP stop, the system call will then be allowed.
PTRACE_SINGLESTEP stops
[Details of these kinds of stops are yet to be documented.]
Informational and restarting ptrace commands
Most ptrace commands (all except PTRACE_ATTACH, PTRACE_SEIZE, PTRACE_TRACEME, PTRACE_INTERRUPT, and PTRACE_KILL) require the tracee to be in a ptrace-stop, otherwise they fail with ESRCH.
When the tracee is in ptrace-stop, the tracer can read and write data to the tracee using informational commands. These commands leave the tracee in ptrace-stopped state:
ptrace(PTRACE_PEEKTEXT/PEEKDATA/PEEKUSER, pid, addr, 0);
ptrace(PTRACE_POKETEXT/POKEDATA/POKEUSER, pid, addr, long_val);
ptrace(PTRACE_GETREGS/GETFPREGS, pid, 0, &struct);
ptrace(PTRACE_SETREGS/SETFPREGS, pid, 0, &struct);
ptrace(PTRACE_GETREGSET, pid, NT_foo, &iov);
ptrace(PTRACE_SETREGSET, pid, NT_foo, &iov);
ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo);
ptrace(PTRACE_SETSIGINFO, pid, 0, &siginfo);
ptrace(PTRACE_GETEVENTMSG, pid, 0, &long_var);
ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);
Note that some errors are not reported. For example, setting signal information (siginfo) may have no effect in some ptrace-stops, yet the call may succeed (return 0 and not set errno); querying PTRACE_GETEVENTMSG may succeed and return some random value if current ptrace-stop is not documented as returning a meaningful event message.
The call
ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_flags);
affects one tracee. The tracee’s current flags are replaced. Flags are inherited by new tracees created and “auto-attached” via active PTRACE_O_TRACEFORK, PTRACE_O_TRACEVFORK, or PTRACE_O_TRACECLONE options.
Another group of commands makes the ptrace-stopped tracee run. They have the form:
ptrace(cmd, pid, 0, sig);
where cmd is PTRACE_CONT, PTRACE_LISTEN, PTRACE_DETACH, PTRACE_SYSCALL, PTRACE_SINGLESTEP, PTRACE_SYSEMU, or PTRACE_SYSEMU_SINGLESTEP. If the tracee is in signal-delivery-stop, sig is the signal to be injected (if it is nonzero). Otherwise, sig may be ignored. (When restarting a tracee from a ptrace-stop other than signal-delivery-stop, recommended practice is to always pass 0 in sig.)
Attaching and detaching
A thread can be attached to the tracer using the call
ptrace(PTRACE_ATTACH, pid, 0, 0);
or
ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_flags);
PTRACE_ATTACH sends SIGSTOP to this thread. If the tracer wants this SIGSTOP to have no effect, it needs to suppress it. Note that if other signals are concurrently sent to this thread during attach, the tracer may see the tracee enter signal-delivery-stop with other signal(s) first! The usual practice is to reinject these signals until SIGSTOP is seen, then suppress SIGSTOP injection. The design bug here is that a ptrace attach and a concurrently delivered SIGSTOP may race and the concurrent SIGSTOP may be lost.
Since attaching sends SIGSTOP and the tracer usually suppresses it, this may cause a stray EINTR return from the currently executing system call in the tracee, as described in the “Signal injection and suppression” section.
Since Linux 3.4, PTRACE_SEIZE can be used instead of PTRACE_ATTACH. PTRACE_SEIZE does not stop the attached process. If you need to stop it after attach (or at any other time) without sending it any signals, use PTRACE_INTERRUPT command.
The operation
ptrace(PTRACE_TRACEME, 0, 0, 0);
turns the calling thread into a tracee. The thread continues to run (doesn’t enter ptrace-stop). A common practice is to follow the PTRACE_TRACEME with
raise(SIGSTOP);
and allow the parent (which is our tracer now) to observe our signal-delivery-stop.
If the PTRACE_O_TRACEFORK, PTRACE_O_TRACEVFORK, or PTRACE_O_TRACECLONE options are in effect, then children created by, respectively, vfork(2) or clone(2) with the CLONE_VFORK flag, fork(2) or clone(2) with the exit signal set to SIGCHLD, and other kinds of clone(2), are automatically attached to the same tracer which traced their parent. SIGSTOP is delivered to the children, causing them to enter signal-delivery-stop after they exit the system call which created them.
Detaching of the tracee is performed by:
ptrace(PTRACE_DETACH, pid, 0, sig);
PTRACE_DETACH is a restarting operation; therefore it requires the tracee to be in ptrace-stop. If the tracee is in signal-delivery-stop, a signal can be injected. Otherwise, the sig parameter may be silently ignored.
If the tracee is running when the tracer wants to detach it, the usual solution is to send SIGSTOP (using tgkill(2), to make sure it goes to the correct thread), wait for the tracee to stop in signal-delivery-stop for SIGSTOP and then detach it (suppressing SIGSTOP injection). A design bug is that this can race with concurrent SIGSTOPs. Another complication is that the tracee may enter other ptrace-stops and needs to be restarted and waited for again, until SIGSTOP is seen. Yet another complication is to be sure that the tracee is not already ptrace-stopped, because no signal delivery happens while it isβnot even SIGSTOP.
If the tracer dies, all tracees are automatically detached and restarted, unless they were in group-stop. Handling of restart from group-stop is currently buggy, but the “as planned” behavior is to leave tracee stopped and waiting for SIGCONT. If the tracee is restarted from signal-delivery-stop, the pending signal is injected.
execve(2) under ptrace
When one thread in a multithreaded process calls execve(2), the kernel destroys all other threads in the process, and resets the thread ID of the execing thread to the thread group ID (process ID). (Or, to put things another way, when a multithreaded process does an execve(2), at completion of the call, it appears as though the execve(2) occurred in the thread group leader, regardless of which thread did the execve(2).) This resetting of the thread ID looks very confusing to tracers:
All other threads stop in PTRACE_EVENT_EXIT stop, if the PTRACE_O_TRACEEXIT option was turned on. Then all other threads except the thread group leader report death as if they exited via _exit(2) with exit code 0.
The execing tracee changes its thread ID while it is in the execve(2). (Remember, under ptrace, the “pid” returned from waitpid(2), or fed into ptrace calls, is the tracee’s thread ID.) That is, the tracee’s thread ID is reset to be the same as its process ID, which is the same as the thread group leader’s thread ID.
Then a PTRACE_EVENT_EXEC stop happens, if the PTRACE_O_TRACEEXEC option was turned on.
If the thread group leader has reported its PTRACE_EVENT_EXIT stop by this time, it appears to the tracer that the dead thread leader “reappears from nowhere”. (Note: the thread group leader does not report death via WIFEXITED(status) until there is at least one other live thread. This eliminates the possibility that the tracer will see it dying and then reappearing.) If the thread group leader was still alive, for the tracer this may look as if thread group leader returns from a different system call than it entered, or even “returned from a system call even though it was not in any system call”. If the thread group leader was not traced (or was traced by a different tracer), then during execve(2) it will appear as if it has become a tracee of the tracer of the execing tracee.
All of the above effects are the artifacts of the thread ID change in the tracee.
The PTRACE_O_TRACEEXEC option is the recommended tool for dealing with this situation. First, it enables PTRACE_EVENT_EXEC stop, which occurs before execve(2) returns. In this stop, the tracer can use PTRACE_GETEVENTMSG to retrieve the tracee’s former thread ID. (This feature was introduced in Linux 3.0.) Second, the PTRACE_O_TRACEEXEC option disables legacy SIGTRAP generation on execve(2).
When the tracer receives PTRACE_EVENT_EXEC stop notification, it is guaranteed that except this tracee and the thread group leader, no other threads from the process are alive.
On receiving the PTRACE_EVENT_EXEC stop notification, the tracer should clean up all its internal data structures describing the threads of this process, and retain only one data structureβone which describes the single still running tracee, with
thread ID == thread group ID == process ID.
Example: two threads call execve(2) at the same time:
*** we get syscall-enter-stop in thread 1: **
PID1 execve("/bin/foo", "foo" <unfinished ...>
*** we issue PTRACE_SYSCALL for thread 1 **
*** we get syscall-enter-stop in thread 2: **
PID2 execve("/bin/bar", "bar" <unfinished ...>
*** we issue PTRACE_SYSCALL for thread 2 **
*** we get PTRACE_EVENT_EXEC for PID0, we issue PTRACE_SYSCALL **
*** we get syscall-exit-stop for PID0: **
PID0 <... execve resumed> ) = 0
If the PTRACE_O_TRACEEXEC option is not in effect for the execing tracee, and if the tracee was PTRACE_ATTACHed rather that PTRACE_SEIZEd, the kernel delivers an extra SIGTRAP to the tracee after execve(2) returns. This is an ordinary signal (similar to one which can be generated by kill -TRAP), not a special kind of ptrace-stop. Employing PTRACE_GETSIGINFO for this signal returns si_code set to 0 (SI_USER). This signal may be blocked by signal mask, and thus may be delivered (much) later.
Usually, the tracer (for example, strace(1)) would not want to show this extra post-execve SIGTRAP signal to the user, and would suppress its delivery to the tracee (if SIGTRAP is set to SIG_DFL, it is a killing signal). However, determining which SIGTRAP to suppress is not easy. Setting the PTRACE_O_TRACEEXEC option or using PTRACE_SEIZE and thus suppressing this extra SIGTRAP is the recommended approach.
Real parent
The ptrace API (ab)uses the standard UNIX parent/child signaling over waitpid(2). This used to cause the real parent of the process to stop receiving several kinds of waitpid(2) notifications when the child process is traced by some other process.
Many of these bugs have been fixed, but as of Linux 2.6.38 several still exist; see BUGS below.
As of Linux 2.6.38, the following is believed to work correctly:
- exit/death by signal is reported first to the tracer, then, when the tracer consumes the waitpid(2) result, to the real parent (to the real parent only when the whole multithreaded process exits). If the tracer and the real parent are the same process, the report is sent only once.
RETURN VALUE
On success, the PTRACE_PEEK* operations return the requested data (but see NOTES), the PTRACE_SECCOMP_GET_FILTER operation returns the number of instructions in the BPF program, the PTRACE_GET_SYSCALL_INFO operation returns the number of bytes available to be written by the kernel, and other operations return zero.
On error, all operations return -1, and errno is set to indicate the error. Since the value returned by a successful PTRACE_PEEK* operation may be -1, the caller must clear errno before the call, and then check it afterward to determine whether or not an error occurred.
ERRORS
EBUSY
(i386 only) There was an error with allocating or freeing a debug register.
EFAULT
There was an attempt to read from or write to an invalid area in the tracer’s or the tracee’s memory, probably because the area wasn’t mapped or accessible. Unfortunately, under Linux, different variations of this fault will return EIO or EFAULT more or less arbitrarily.
EINVAL
An attempt was made to set an invalid option.
EIO
op is invalid, or an attempt was made to read from or write to an invalid area in the tracer’s or the tracee’s memory, or there was a word-alignment violation, or an invalid signal was specified during a restart operation.
EPERM
The specified process cannot be traced. This could be because the tracer has insufficient privileges (the required capability is CAP_SYS_PTRACE); unprivileged processes cannot trace processes that they cannot send signals to or those running set-user-ID/set-group-ID programs, for obvious reasons. Alternatively, the process may already be being traced, or (before Linux 2.6.26) be init(1) (PID 1).
ESRCH
The specified process does not exist, or is not currently being traced by the caller, or is not stopped (for operations that require a stopped tracee).
STANDARDS
None.
HISTORY
SVr4, 4.3BSD.
Before Linux 2.6.26, init(1), the process with PID 1, may not be traced.
NOTES
Although arguments to ptrace() are interpreted according to the prototype given, glibc currently declares ptrace() as a variadic function with only the op argument fixed. It is recommended to always supply four arguments, even if the requested operation does not use them, setting unused/ignored arguments to 0L or (void *) 0.
A tracees parent continues to be the tracer even if that tracer calls execve(2).
The layout of the contents of memory and the USER area are quite operating-system- and architecture-specific. The offset supplied, and the data returned, might not entirely match with the definition of struct user.
The size of a “word” is determined by the operating-system variant (e.g., for 32-bit Linux it is 32 bits).
This page documents the way the ptrace() call works currently in Linux. Its behavior differs significantly on other flavors of UNIX. In any case, use of ptrace() is highly specific to the operating system and architecture.
Ptrace access mode checking
Various parts of the kernel-user-space API (not just ptrace() operations), require so-called “ptrace access mode” checks, whose outcome determines whether an operation is permitted (or, in a few cases, causes a “read” operation to return sanitized data). These checks are performed in cases where one process can inspect sensitive information about, or in some cases modify the state of, another process. The checks are based on factors such as the credentials and capabilities of the two processes, whether or not the “target” process is dumpable, and the results of checks performed by any enabled Linux Security Module (LSM)βfor example, SELinux, Yama, or Smackβand by the commoncap LSM (which is always invoked).
Prior to Linux 2.6.27, all access checks were of a single type. Since Linux 2.6.27, two access mode levels are distinguished:
PTRACE_MODE_READ
For “read” operations or other operations that are less dangerous, such as: get_robust_list(2); kcmp(2); reading /proc/pid/auxv, /proc/pid/environ, or /proc/pid/stat; or readlink(2) of a /proc/pid/ns/* file.
PTRACE_MODE_ATTACH
For “write” operations, or other operations that are more dangerous, such as: ptrace attaching (PTRACE_ATTACH) to another process or calling process_vm_writev(2). (PTRACE_MODE_ATTACH was effectively the default before Linux 2.6.27.)
Since Linux 4.5, the above access mode checks are combined (ORed) with one of the following modifiers:
PTRACE_MODE_FSCREDS
Use the caller’s filesystem UID and GID (see credentials(7)) or effective capabilities for LSM checks.
PTRACE_MODE_REALCREDS
Use the caller’s real UID and GID or permitted capabilities for LSM checks. This was effectively the default before Linux 4.5.
Because combining one of the credential modifiers with one of the aforementioned access modes is typical, some macros are defined in the kernel sources for the combinations:
PTRACE_MODE_READ_FSCREDS
Defined as PTRACE_MODE_READ | PTRACE_MODE_FSCREDS.
PTRACE_MODE_READ_REALCREDS
Defined as PTRACE_MODE_READ | PTRACE_MODE_REALCREDS.
PTRACE_MODE_ATTACH_FSCREDS
Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_FSCREDS.
PTRACE_MODE_ATTACH_REALCREDS
Defined as PTRACE_MODE_ATTACH | PTRACE_MODE_REALCREDS.
One further modifier can be ORed with the access mode:
PTRACE_MODE_NOAUDIT (since Linux 3.3)
Don’t audit this access mode check. This modifier is employed for ptrace access mode checks (such as checks when reading /proc/pid/stat) that merely cause the output to be filtered or sanitized, rather than causing an error to be returned to the caller. In these cases, accessing the file is not a security violation and there is no reason to generate a security audit record. This modifier suppresses the generation of such an audit record for the particular access check.
Note that all of the PTRACE_MODE_* constants described in this subsection are kernel-internal, and not visible to user space. The constant names are mentioned here in order to label the various kinds of ptrace access mode checks that are performed for various system calls and accesses to various pseudofiles (e.g., under /proc). These names are used in other manual pages to provide a simple shorthand for labeling the different kernel checks.
The algorithm employed for ptrace access mode checking determines whether the calling process is allowed to perform the corresponding action on the target process. (In the case of opening */proc/*pid files, the “calling process” is the one opening the file, and the process with the corresponding PID is the “target process”.) The algorithm is as follows:
If the calling thread and the target thread are in the same thread group, access is always allowed.
If the access mode specifies PTRACE_MODE_FSCREDS, then, for the check in the next step, employ the caller’s filesystem UID and GID. (As noted in credentials(7), the filesystem UID and GID almost always have the same values as the corresponding effective IDs.)
Otherwise, the access mode specifies PTRACE_MODE_REALCREDS, so use the caller’s real UID and GID for the checks in the next step. (Most APIs that check the caller’s UID and GID use the effective IDs. For historical reasons, the PTRACE_MODE_REALCREDS check uses the real IDs instead.)
Deny access if neither of the following is true:
The real, effective, and saved-set user IDs of the target match the caller’s user ID, and the real, effective, and saved-set group IDs of the target match the caller’s group ID.
The caller has the CAP_SYS_PTRACE capability in the user namespace of the target.
Deny access if the target process “dumpable” attribute has a value other than 1 (SUID_DUMP_USER; see the discussion of PR_SET_DUMPABLE in prctl(2)), and the caller does not have the CAP_SYS_PTRACE capability in the user namespace of the target process.
The kernel LSM security_ptrace_access_check() interface is invoked to see if ptrace access is permitted. The results depend on the LSM(s). The implementation of this interface in the commoncap LSM performs the following steps:
(5.1)
If the access mode includes PTRACE_MODE_FSCREDS, then use the caller’s effective capability set in the following check; otherwise (the access mode specifies PTRACE_MODE_REALCREDS, so) use the caller’s permitted capability set.(5.2)
Deny access if neither of the following is true:The caller and the target process are in the same user namespace, and the caller’s capabilities are a superset of the target process’s permitted capabilities.
The caller has the CAP_SYS_PTRACE capability in the target process’s user namespace.
Note that the commoncap LSM does not distinguish between PTRACE_MODE_READ and PTRACE_MODE_ATTACH.
If access has not been denied by any of the preceding steps, then access is allowed.
/proc/sys/kernel/yama/ptrace_scope
On systems with the Yama Linux Security Module (LSM) installed (i.e., the kernel was configured with CONFIG_SECURITY_YAMA), the /proc/sys/kernel/yama/ptrace_scope file (available since Linux 3.4) can be used to restrict the ability to trace a process with ptrace() (and thus also the ability to use tools such as strace(1) and gdb(1)). The goal of such restrictions is to prevent attack escalation whereby a compromised process can ptrace-attach to other sensitive processes (e.g., a GPG agent or an SSH session) owned by the user in order to gain additional credentials that may exist in memory and thus expand the scope of the attack.
More precisely, the Yama LSM limits two types of operations:
Any operation that performs a ptrace access mode PTRACE_MODE_ATTACH checkβfor example, ptrace() PTRACE_ATTACH. (See the “Ptrace access mode checking” discussion above.)
ptrace() PTRACE_TRACEME.
A process that has the CAP_SYS_PTRACE capability can update the /proc/sys/kernel/yama/ptrace_scope file with one of the following values:
0 (“classic ptrace permissions”)
No additional restrictions on operations that perform PTRACE_MODE_ATTACH checks (beyond those imposed by the commoncap and other LSMs).
The use of PTRACE_TRACEME is unchanged.
1 (“restricted ptrace”) [default value]
When performing an operation that requires a PTRACE_MODE_ATTACH check, the calling process must either have the CAP_SYS_PTRACE capability in the user namespace of the target process or it must have a predefined relationship with the target process. By default, the predefined relationship is that the target process must be a descendant of the caller.
A target process can employ the prctl(2) PR_SET_PTRACER operation to declare an additional PID that is allowed to perform PTRACE_MODE_ATTACH operations on the target. See the kernel source file Documentation/admin-guide/LSM/Yama.rst (or Documentation/security/Yama.txt before Linux 4.13) for further details.
The use of PTRACE_TRACEME is unchanged.
2 (“admin-only attach”)
Only processes with the CAP_SYS_PTRACE capability in the user namespace of the target process may perform PTRACE_MODE_ATTACH operations or trace children that employ PTRACE_TRACEME.
3 (“no attach”)
No process may perform PTRACE_MODE_ATTACH operations or trace children that employ PTRACE_TRACEME.
Once this value has been written to the file, it cannot be changed.
With respect to values 1 and 2, note that creating a new user namespace effectively removes the protection offered by Yama. This is because a process in the parent user namespace whose effective UID matches the UID of the creator of a child namespace has all capabilities (including CAP_SYS_PTRACE) when performing operations within the child user namespace (and further-removed descendants of that namespace). Consequently, when a process tries to use user namespaces to sandbox itself, it inadvertently weakens the protections offered by the Yama LSM.
C library/kernel differences
At the system call level, the PTRACE_PEEKTEXT, PTRACE_PEEKDATA, and PTRACE_PEEKUSER operations have a different API: they store the result at the address specified by the data parameter, and the return value is the error flag. The glibc wrapper function provides the API given in DESCRIPTION above, with the result being returned via the function return value.
BUGS
On hosts with Linux 2.6 kernel headers, PTRACE_SETOPTIONS is declared with a different value than the one for Linux 2.4. This leads to applications compiled with Linux 2.6 kernel headers failing when run on Linux 2.4. This can be worked around by redefining PTRACE_SETOPTIONS to PTRACE_OLDSETOPTIONS, if that is defined.
Group-stop notifications are sent to the tracer, but not to real parent. Last confirmed on 2.6.38.6.
If a thread group leader is traced and exits by calling _exit(2), a PTRACE_EVENT_EXIT stop will happen for it (if requested), but the subsequent WIFEXITED notification will not be delivered until all other threads exit. As explained above, if one of other threads calls execve(2), the death of the thread group leader will never be reported. If the execed thread is not traced by this tracer, the tracer will never know that execve(2) happened. One possible workaround is to PTRACE_DETACH the thread group leader instead of restarting it in this case. Last confirmed on 2.6.38.6.
A SIGKILL signal may still cause a PTRACE_EVENT_EXIT stop before actual signal death. This may be changed in the future; SIGKILL is meant to always immediately kill tasks even under ptrace. Last confirmed on Linux 3.13.
Some system calls return with EINTR if a signal was sent to a tracee, but delivery was suppressed by the tracer. (This is very typical operation: it is usually done by debuggers on every attach, in order to not introduce a bogus SIGSTOP). As of Linux 3.2.9, the following system calls are affected (this list is likely incomplete): epoll_wait(2), and read(2) from an inotify(7) file descriptor. The usual symptom of this bug is that when you attach to a quiescent process with the command
strace -p <process-ID>
then, instead of the usual and expected one-line output such as
restart_syscall(<... resuming interrupted call ...>_
or
select(6, [5], NULL, [5], NULL_
(’_’ denotes the cursor position), you observe more than one line. For example:
clock_gettime(CLOCK_MONOTONIC, {15370, 690928118}) = 0
epoll_wait(4,_
What is not visible here is that the process was blocked in epoll_wait(2) before strace(1) has attached to it. Attaching caused epoll_wait(2) to return to user space with the error EINTR. In this particular case, the program reacted to EINTR by checking the current time, and then executing epoll_wait(2) again. (Programs which do not expect such “stray” EINTR errors may behave in an unintended way upon an strace(1) attach.)
Contrary to the normal rules, the glibc wrapper for ptrace() can set errno to zero.
SEE ALSO
gdb(1), ltrace(1), strace(1), clone(2), execve(2), fork(2), gettid(2), prctl(2), seccomp(2), sigaction(2), tgkill(2), vfork(2), waitpid(2), exec(3), capabilities(7), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2 - Linux cli command fcntl
NAME π₯οΈ fcntl π₯οΈ
manipulate file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <fcntl.h>
int fcntl(int fd, int op, ... /* arg */ );
DESCRIPTION
fcntl() performs one of the operations described below on the open file descriptor fd. The operation is determined by op.
fcntl() can take an optional third argument. Whether or not this argument is required is determined by op. The required argument type is indicated in parentheses after each op name (in most cases, the required type is int, and we identify the argument using the name arg), or void is specified if the argument is not required.
Certain of the operations below are supported only since a particular Linux kernel version. The preferred method of checking whether the host kernel supports a particular operation is to invoke fcntl() with the desired op value and then test whether the call failed with EINVAL, indicating that the kernel does not recognize this value.
Duplicating a file descriptor
F_DUPFD (int)
Duplicate the file descriptor fd using the lowest-numbered available file descriptor greater than or equal to arg. This is different from dup2(2), which uses exactly the file descriptor specified.
On success, the new file descriptor is returned.
See dup(2) for further details.
F_DUPFD_CLOEXEC (int; since Linux 2.6.24)
As for F_DUPFD, but additionally set the close-on-exec flag for the duplicate file descriptor. Specifying this flag permits a program to avoid an additional fcntl() F_SETFD operation to set the FD_CLOEXEC flag. For an explanation of why this flag is useful, see the description of O_CLOEXEC in open(2).
File descriptor flags
The following operations manipulate the flags associated with a file descriptor. Currently, only one such flag is defined: FD_CLOEXEC, the close-on-exec flag. If the FD_CLOEXEC bit is set, the file descriptor will automatically be closed during a successful execve(2). (If the execve(2) fails, the file descriptor is left open.) If the FD_CLOEXEC bit is not set, the file descriptor will remain open across an execve(2).
F_GETFD (void)
Return (as the function result) the file descriptor flags; arg is ignored.
F_SETFD (int)
Set the file descriptor flags to the value specified by arg.
In multithreaded programs, using fcntl() F_SETFD to set the close-on-exec flag at the same time as another thread performs a fork(2) plus execve(2) is vulnerable to a race condition that may unintentionally leak the file descriptor to the program executed in the child process. See the discussion of the O_CLOEXEC flag in open(2) for details and a remedy to the problem.
File status flags
Each open file description has certain associated status flags, initialized by open(2) and possibly modified by fcntl(). Duplicated file descriptors (made with dup(2), fcntl(F_DUPFD), fork(2), etc.) refer to the same open file description, and thus share the same file status flags.
The file status flags and their semantics are described in open(2).
F_GETFL (void)
Return (as the function result) the file access mode and the file status flags; arg is ignored.
F_SETFL (int)
Set the file status flags to the value specified by arg. File access mode (O_RDONLY, O_WRONLY, O_RDWR) and file creation flags (i.e., O_CREAT, O_EXCL, O_NOCTTY, O_TRUNC) in arg are ignored. On Linux, this operation can change only the O_APPEND, O_ASYNC, O_DIRECT, O_NOATIME, and O_NONBLOCK flags. It is not possible to change the O_DSYNC and O_SYNC flags; see BUGS, below.
Advisory record locking
Linux implements traditional (“process-associated”) UNIX record locks, as standardized by POSIX. For a Linux-specific alternative with better semantics, see the discussion of open file description locks below.
F_SETLK, F_SETLKW, and F_GETLK are used to acquire, release, and test for the existence of record locks (also known as byte-range, file-segment, or file-region locks). The third argument, lock, is a pointer to a structure that has at least the following fields (in unspecified order).
struct flock {
...
short l_type; /* Type of lock: F_RDLCK,
F_WRLCK, F_UNLCK */
short l_whence; /* How to interpret l_start:
SEEK_SET, SEEK_CUR, SEEK_END */
off_t l_start; /* Starting offset for lock */
off_t l_len; /* Number of bytes to lock */
pid_t l_pid; /* PID of process blocking our lock
(set by F_GETLK and F_OFD_GETLK) */
...
};
The l_whence, l_start, and l_len fields of this structure specify the range of bytes we wish to lock. Bytes past the end of the file may be locked, but not bytes before the start of the file.
l_start is the starting offset for the lock, and is interpreted relative to either: the start of the file (if l_whence is SEEK_SET); the current file offset (if l_whence is SEEK_CUR); or the end of the file (if l_whence is SEEK_END). In the final two cases, l_start can be a negative number provided the offset does not lie before the start of the file.
l_len specifies the number of bytes to be locked. If l_len is positive, then the range to be locked covers bytes l_start up to and including l_start+l_len-1. Specifying 0 for l_len has the special meaning: lock all bytes starting at the location specified by l_whence and l_start through to the end of file, no matter how large the file grows.
POSIX.1-2001 allows (but does not require) an implementation to support a negative l_len value; if l_len is negative, the interval described by lock covers bytes l_start+l_len up to and including l_start-1. This is supported since Linux 2.4.21 and Linux 2.5.49.
The l_type field can be used to place a read (F_RDLCK) or a write (F_WRLCK) lock on a file. Any number of processes may hold a read lock (shared lock) on a file region, but only one process may hold a write lock (exclusive lock). An exclusive lock excludes all other locks, both shared and exclusive. A single process can hold only one type of lock on a file region; if a new lock is applied to an already-locked region, then the existing lock is converted to the new lock type. (Such conversions may involve splitting, shrinking, or coalescing with an existing lock if the byte range specified by the new lock does not precisely coincide with the range of the existing lock.)
F_SETLK (struct flock *)
Acquire a lock (when l_type is F_RDLCK or F_WRLCK) or release a lock (when l_type is F_UNLCK) on the bytes specified by the l_whence, l_start, and l_len fields of lock. If a conflicting lock is held by another process, this call returns -1 and sets errno to EACCES or EAGAIN. (The error returned in this case differs across implementations, so POSIX requires a portable application to check for both errors.)
F_SETLKW (struct flock *)
As for F_SETLK, but if a conflicting lock is held on the file, then wait for that lock to be released. If a signal is caught while waiting, then the call is interrupted and (after the signal handler has returned) returns immediately (with return value -1 and errno set to EINTR; see signal(7)).
F_GETLK (struct flock *)
On input to this call, lock describes a lock we would like to place on the file. If the lock could be placed, fcntl() does not actually place it, but returns F_UNLCK in the l_type field of lock and leaves the other fields of the structure unchanged.
If one or more incompatible locks would prevent this lock being placed, then fcntl() returns details about one of those locks in the l_type, l_whence, l_start, and l_len fields of lock. If the conflicting lock is a traditional (process-associated) record lock, then the l_pid field is set to the PID of the process holding that lock. If the conflicting lock is an open file description lock, then l_pid is set to -1. Note that the returned information may already be out of date by the time the caller inspects it.
In order to place a read lock, fd must be open for reading. In order to place a write lock, fd must be open for writing. To place both types of lock, open a file read-write.
When placing locks with F_SETLKW, the kernel detects deadlocks, whereby two or more processes have their lock requests mutually blocked by locks held by the other processes. For example, suppose process A holds a write lock on byte 100 of a file, and process B holds a write lock on byte 200. If each process then attempts to lock the byte already locked by the other process using F_SETLKW, then, without deadlock detection, both processes would remain blocked indefinitely. When the kernel detects such deadlocks, it causes one of the blocking lock requests to immediately fail with the error EDEADLK; an application that encounters such an error should release some of its locks to allow other applications to proceed before attempting regain the locks that it requires. Circular deadlocks involving more than two processes are also detected. Note, however, that there are limitations to the kernel’s deadlock-detection algorithm; see BUGS.
As well as being removed by an explicit F_UNLCK, record locks are automatically released when the process terminates.
Record locks are not inherited by a child created via fork(2), but are preserved across an execve(2).
Because of the buffering performed by the stdio(3) library, the use of record locking with routines in that package should be avoided; use read(2) and write(2) instead.
The record locks described above are associated with the process (unlike the open file description locks described below). This has some unfortunate consequences:
If a process closes any file descriptor referring to a file, then all of the process’s locks on that file are released, regardless of the file descriptor(s) on which the locks were obtained. This is bad: it means that a process can lose its locks on a file such as /etc/passwd or /etc/mtab when for some reason a library function decides to open, read, and close the same file.
The threads in a process share locks. In other words, a multithreaded program can’t use record locking to ensure that threads don’t simultaneously access the same region of a file.
Open file description locks solve both of these problems.
Open file description locks (non-POSIX)
Open file description locks are advisory byte-range locks whose operation is in most respects identical to the traditional record locks described above. This lock type is Linux-specific, and available since Linux 3.15. (There is a proposal with the Austin Group to include this lock type in the next revision of POSIX.1.) For an explanation of open file descriptions, see open(2).
The principal difference between the two lock types is that whereas traditional record locks are associated with a process, open file description locks are associated with the open file description on which they are acquired, much like locks acquired with flock(2). Consequently (and unlike traditional advisory record locks), open file description locks are inherited across fork(2) (and clone(2) with CLONE_FILES), and are only automatically released on the last close of the open file description, instead of being released on any close of the file.
Conflicting lock combinations (i.e., a read lock and a write lock or two write locks) where one lock is an open file description lock and the other is a traditional record lock conflict even when they are acquired by the same process on the same file descriptor.
Open file description locks placed via the same open file description (i.e., via the same file descriptor, or via a duplicate of the file descriptor created by fork(2), dup(2), fcntl() F_DUPFD, and so on) are always compatible: if a new lock is placed on an already locked region, then the existing lock is converted to the new lock type. (Such conversions may result in splitting, shrinking, or coalescing with an existing lock as discussed above.)
On the other hand, open file description locks may conflict with each other when they are acquired via different open file descriptions. Thus, the threads in a multithreaded program can use open file description locks to synchronize access to a file region by having each thread perform its own open(2) on the file and applying locks via the resulting file descriptor.
As with traditional advisory locks, the third argument to fcntl(), lock, is a pointer to an flock structure. By contrast with traditional record locks, the l_pid field of that structure must be set to zero when using the operations described below.
The operations for working with open file description locks are analogous to those used with traditional locks:
F_OFD_SETLK (struct flock *)
Acquire an open file description lock (when l_type is F_RDLCK or F_WRLCK) or release an open file description lock (when l_type is F_UNLCK) on the bytes specified by the l_whence, l_start, and l_len fields of lock. If a conflicting lock is held by another process, this call returns -1 and sets errno to EAGAIN.
F_OFD_SETLKW (struct flock *)
As for F_OFD_SETLK, but if a conflicting lock is held on the file, then wait for that lock to be released. If a signal is caught while waiting, then the call is interrupted and (after the signal handler has returned) returns immediately (with return value -1 and errno set to EINTR; see signal(7)).
F_OFD_GETLK (struct flock *)
On input to this call, lock describes an open file description lock we would like to place on the file. If the lock could be placed, fcntl() does not actually place it, but returns F_UNLCK in the l_type field of lock and leaves the other fields of the structure unchanged. If one or more incompatible locks would prevent this lock being placed, then details about one of these locks are returned via lock, as described above for F_GETLK.
In the current implementation, no deadlock detection is performed for open file description locks. (This contrasts with process-associated record locks, for which the kernel does perform deadlock detection.)
Mandatory locking
Warning: the Linux implementation of mandatory locking is unreliable. See BUGS below. Because of these bugs, and the fact that the feature is believed to be little used, since Linux 4.5, mandatory locking has been made an optional feature, governed by a configuration option (CONFIG_MANDATORY_FILE_LOCKING). This feature is no longer supported at all in Linux 5.15 and above.
By default, both traditional (process-associated) and open file description record locks are advisory. Advisory locks are not enforced and are useful only between cooperating processes.
Both lock types can also be mandatory. Mandatory locks are enforced for all processes. If a process tries to perform an incompatible access (e.g., read(2) or write(2)) on a file region that has an incompatible mandatory lock, then the result depends upon whether the O_NONBLOCK flag is enabled for its open file description. If the O_NONBLOCK flag is not enabled, then the system call is blocked until the lock is removed or converted to a mode that is compatible with the access. If the O_NONBLOCK flag is enabled, then the system call fails with the error EAGAIN.
To make use of mandatory locks, mandatory locking must be enabled both on the filesystem that contains the file to be locked, and on the file itself. Mandatory locking is enabled on a filesystem using the “-o mand” option to mount(8), or the MS_MANDLOCK flag for mount(2). Mandatory locking is enabled on a file by disabling group execute permission on the file and enabling the set-group-ID permission bit (see chmod(1) and chmod(2)).
Mandatory locking is not specified by POSIX. Some other systems also support mandatory locking, although the details of how to enable it vary across systems.
Lost locks
When an advisory lock is obtained on a networked filesystem such as NFS it is possible that the lock might get lost. This may happen due to administrative action on the server, or due to a network partition (i.e., loss of network connectivity with the server) which lasts long enough for the server to assume that the client is no longer functioning.
When the filesystem determines that a lock has been lost, future read(2) or write(2) requests may fail with the error EIO. This error will persist until the lock is removed or the file descriptor is closed. Since Linux 3.12, this happens at least for NFSv4 (including all minor versions).
Some versions of UNIX send a signal (SIGLOST) in this circumstance. Linux does not define this signal, and does not provide any asynchronous notification of lost locks.
Managing signals
F_GETOWN, F_SETOWN, F_GETOWN_EX, F_SETOWN_EX, F_GETSIG, and F_SETSIG are used to manage I/O availability signals:
F_GETOWN (void)
Return (as the function result) the process ID or process group ID currently receiving SIGIO and SIGURG signals for events on file descriptor fd. Process IDs are returned as positive values; process group IDs are returned as negative values (but see BUGS below). arg is ignored.
F_SETOWN (int)
Set the process ID or process group ID that will receive SIGIO and SIGURG signals for events on the file descriptor fd. The target process or process group ID is specified in arg. A process ID is specified as a positive value; a process group ID is specified as a negative value. Most commonly, the calling process specifies itself as the owner (that is, arg is specified as getpid(2)).
As well as setting the file descriptor owner, one must also enable generation of signals on the file descriptor. This is done by using the fcntl() F_SETFL operation to set the O_ASYNC file status flag on the file descriptor. Subsequently, a SIGIO signal is sent whenever input or output becomes possible on the file descriptor. The fcntl() F_SETSIG operation can be used to obtain delivery of a signal other than SIGIO.
Sending a signal to the owner process (group) specified by F_SETOWN is subject to the same permissions checks as are described for kill(2), where the sending process is the one that employs F_SETOWN (but see BUGS below). If this permission check fails, then the signal is silently discarded. Note: The F_SETOWN operation records the caller’s credentials at the time of the fcntl() call, and it is these saved credentials that are used for the permission checks.
If the file descriptor fd refers to a socket, F_SETOWN also selects the recipient of SIGURG signals that are delivered when out-of-band data arrives on that socket. (SIGURG is sent in any situation where select(2) would report the socket as having an “exceptional condition”.)
The following was true in Linux 2.6.x up to and including Linux 2.6.11:
If a nonzero value is given to F_SETSIG in a multithreaded process running with a threading library that supports thread groups (e.g., NPTL), then a positive value given to F_SETOWN has a different meaning: instead of being a process ID identifying a whole process, it is a thread ID identifying a specific thread within a process. Consequently, it may be necessary to pass F_SETOWN the result of gettid(2) instead of getpid(2) to get sensible results when F_SETSIG is used. (In current Linux threading implementations, a main thread’s thread ID is the same as its process ID. This means that a single-threaded program can equally use gettid(2) or getpid(2) in this scenario.) Note, however, that the statements in this paragraph do not apply to the SIGURG signal generated for out-of-band data on a socket: this signal is always sent to either a process or a process group, depending on the value given to F_SETOWN.
The above behavior was accidentally dropped in Linux 2.6.12, and won’t be restored. From Linux 2.6.32 onward, use F_SETOWN_EX to target SIGIO and SIGURG signals at a particular thread.
F_GETOWN_EX (struct f_owner_ex *) (since Linux 2.6.32)
Return the current file descriptor owner settings as defined by a previous F_SETOWN_EX operation. The information is returned in the structure pointed to by arg, which has the following form:
struct f_owner_ex {
int type;
pid_t pid;
};
The type field will have one of the values F_OWNER_TID, F_OWNER_PID, or F_OWNER_PGRP. The pid field is a positive integer representing a thread ID, process ID, or process group ID. See F_SETOWN_EX for more details.
F_SETOWN_EX (struct f_owner_ex *) (since Linux 2.6.32)
This operation performs a similar task to F_SETOWN. It allows the caller to direct I/O availability signals to a specific thread, process, or process group. The caller specifies the target of signals via arg, which is a pointer to a f_owner_ex structure. The type field has one of the following values, which define how pid is interpreted:
F_OWNER_TID
Send the signal to the thread whose thread ID (the value returned by a call to clone(2) or gettid(2)) is specified in pid.
F_OWNER_PID
Send the signal to the process whose ID is specified in pid.
F_OWNER_PGRP
Send the signal to the process group whose ID is specified in pid. (Note that, unlike with F_SETOWN, a process group ID is specified as a positive value here.)
F_GETSIG (void)
Return (as the function result) the signal sent when input or output becomes possible. A value of zero means SIGIO is sent. Any other value (including SIGIO) is the signal sent instead, and in this case additional info is available to the signal handler if installed with SA_SIGINFO. arg is ignored.
F_SETSIG (int)
Set the signal sent when input or output becomes possible to the value given in arg. A value of zero means to send the default SIGIO signal. Any other value (including SIGIO) is the signal to send instead, and in this case additional info is available to the signal handler if installed with SA_SIGINFO.
By using F_SETSIG with a nonzero value, and setting SA_SIGINFO for the signal handler (see sigaction(2)), extra information about I/O events is passed to the handler in a siginfo_t structure. If the si_code field indicates the source is SI_SIGIO, the si_fd field gives the file descriptor associated with the event. Otherwise, there is no indication which file descriptors are pending, and you should use the usual mechanisms (select(2), poll(2), read(2) with O_NONBLOCK set etc.) to determine which file descriptors are available for I/O.
Note that the file descriptor provided in si_fd is the one that was specified during the F_SETSIG operation. This can lead to an unusual corner case. If the file descriptor is duplicated (dup(2) or similar), and the original file descriptor is closed, then I/O events will continue to be generated, but the si_fd field will contain the number of the now closed file descriptor.
By selecting a real time signal (value >= SIGRTMIN), multiple I/O events may be queued using the same signal numbers. (Queuing is dependent on available memory.) Extra information is available if SA_SIGINFO is set for the signal handler, as above.
Note that Linux imposes a limit on the number of real-time signals that may be queued to a process (see getrlimit(2) and signal(7)) and if this limit is reached, then the kernel reverts to delivering SIGIO, and this signal is delivered to the entire process rather than to a specific thread.
Using these mechanisms, a program can implement fully asynchronous I/O without using select(2) or poll(2) most of the time.
The use of O_ASYNC is specific to BSD and Linux. The only use of F_GETOWN and F_SETOWN specified in POSIX.1 is in conjunction with the use of the SIGURG signal on sockets. (POSIX does not specify the SIGIO signal.) F_GETOWN_EX, F_SETOWN_EX, F_GETSIG, and F_SETSIG are Linux-specific. POSIX has asynchronous I/O and the aio_sigevent structure to achieve similar things; these are also available in Linux as part of the GNU C Library (glibc).
Leases
F_SETLEASE and F_GETLEASE (Linux 2.4 onward) are used to establish a new lease, and retrieve the current lease, on the open file description referred to by the file descriptor fd. A file lease provides a mechanism whereby the process holding the lease (the “lease holder”) is notified (via delivery of a signal) when a process (the “lease breaker”) tries to open(2) or truncate(2) the file referred to by that file descriptor.
F_SETLEASE (int)
Set or remove a file lease according to which of the following values is specified in the integer arg:
F_RDLCK
Take out a read lease. This will cause the calling process to be notified when the file is opened for writing or is truncated. A read lease can be placed only on a file descriptor that is opened read-only.
F_WRLCK
Take out a write lease. This will cause the caller to be notified when the file is opened for reading or writing or is truncated. A write lease may be placed on a file only if there are no other open file descriptors for the file.
F_UNLCK
Remove our lease from the file.
Leases are associated with an open file description (see open(2)). This means that duplicate file descriptors (created by, for example, fork(2) or dup(2)) refer to the same lease, and this lease may be modified or released using any of these descriptors. Furthermore, the lease is released by either an explicit F_UNLCK operation on any of these duplicate file descriptors, or when all such file descriptors have been closed.
Leases may be taken out only on regular files. An unprivileged process may take out a lease only on a file whose UID (owner) matches the filesystem UID of the process. A process with the CAP_LEASE capability may take out leases on arbitrary files.
F_GETLEASE (void)
Indicates what type of lease is associated with the file descriptor fd by returning either F_RDLCK, F_WRLCK, or F_UNLCK, indicating, respectively, a read lease , a write lease, or no lease. arg is ignored.
When a process (the “lease breaker”) performs an open(2) or truncate(2) that conflicts with a lease established via F_SETLEASE, the system call is blocked by the kernel and the kernel notifies the lease holder by sending it a signal (SIGIO by default). The lease holder should respond to receipt of this signal by doing whatever cleanup is required in preparation for the file to be accessed by another process (e.g., flushing cached buffers) and then either remove or downgrade its lease. A lease is removed by performing an F_SETLEASE operation specifying arg as F_UNLCK. If the lease holder currently holds a write lease on the file, and the lease breaker is opening the file for reading, then it is sufficient for the lease holder to downgrade the lease to a read lease. This is done by performing an F_SETLEASE operation specifying arg as F_RDLCK.
If the lease holder fails to downgrade or remove the lease within the number of seconds specified in /proc/sys/fs/lease-break-time, then the kernel forcibly removes or downgrades the lease holder’s lease.
Once a lease break has been initiated, F_GETLEASE returns the target lease type (either F_RDLCK or F_UNLCK, depending on what would be compatible with the lease breaker) until the lease holder voluntarily downgrades or removes the lease or the kernel forcibly does so after the lease break timer expires.
Once the lease has been voluntarily or forcibly removed or downgraded, and assuming the lease breaker has not unblocked its system call, the kernel permits the lease breaker’s system call to proceed.
If the lease breaker’s blocked open(2) or truncate(2) is interrupted by a signal handler, then the system call fails with the error EINTR, but the other steps still occur as described above. If the lease breaker is killed by a signal while blocked in open(2) or truncate(2), then the other steps still occur as described above. If the lease breaker specifies the O_NONBLOCK flag when calling open(2), then the call immediately fails with the error EWOULDBLOCK, but the other steps still occur as described above.
The default signal used to notify the lease holder is SIGIO, but this can be changed using the F_SETSIG operation to fcntl(). If a F_SETSIG operation is performed (even one specifying SIGIO), and the signal handler is established using SA_SIGINFO, then the handler will receive a siginfo_t structure as its second argument, and the si_fd field of this argument will hold the file descriptor of the leased file that has been accessed by another process. (This is useful if the caller holds leases against multiple files.)
File and directory change notification (dnotify)
F_NOTIFY (int)
(Linux 2.4 onward) Provide notification when the directory referred to by fd or any of the files that it contains is changed. The events to be notified are specified in arg, which is a bit mask specified by ORing together zero or more of the following bits:
DN_ACCESS
A file was accessed (read(2), pread(2), readv(2), and similar)DN_MODIFY
A file was modified (write(2), pwrite(2), writev(2), truncate(2), ftruncate(2), and similar).DN_CREATE
A file was created (open(2), creat(2), mknod(2), mkdir(2), link(2), symlink(2), rename(2) into this directory).DN_DELETE
A file was unlinked (unlink(2), rename(2) to another directory, rmdir(2)).DN_RENAME
A file was renamed within this directory (rename(2)).DN_ATTRIB
The attributes of a file were changed (chown(2), chmod(2), utime(2), utimensat(2), and similar).
(In order to obtain these definitions, the _GNU_SOURCE feature test macro must be defined before including any header files.)
Directory notifications are normally “one-shot”, and the application must reregister to receive further notifications. Alternatively, if DN_MULTISHOT is included in arg, then notification will remain in effect until explicitly removed.
A series of F_NOTIFY requests is cumulative, with the events in arg being added to the set already monitored. To disable notification of all events, make an F_NOTIFY call specifying arg as 0.
Notification occurs via delivery of a signal. The default signal is SIGIO, but this can be changed using the F_SETSIG operation to fcntl(). (Note that SIGIO is one of the nonqueuing standard signals; switching to the use of a real-time signal means that multiple notifications can be queued to the process.) In the latter case, the signal handler receives a siginfo_t structure as its second argument (if the handler was established using SA_SIGINFO) and the si_fd field of this structure contains the file descriptor which generated the notification (useful when establishing notification on multiple directories).
Especially when using DN_MULTISHOT, a real time signal should be used for notification, so that multiple notifications can be queued.
NOTE: New applications should use the inotify interface (available since Linux 2.6.13), which provides a much superior interface for obtaining notifications of filesystem events. See inotify(7).
Changing the capacity of a pipe
F_SETPIPE_SZ (int; since Linux 2.6.35)
Change the capacity of the pipe referred to by fd to be at least arg bytes. An unprivileged process can adjust the pipe capacity to any value between the system page size and the limit defined in /proc/sys/fs/pipe-max-size (see proc(5)). Attempts to set the pipe capacity below the page size are silently rounded up to the page size. Attempts by an unprivileged process to set the pipe capacity above the limit in /proc/sys/fs/pipe-max-size yield the error EPERM; a privileged process (CAP_SYS_RESOURCE) can override the limit.
When allocating the buffer for the pipe, the kernel may use a capacity larger than arg, if that is convenient for the implementation. (In the current implementation, the allocation is the next higher power-of-two page-size multiple of the requested size.) The actual capacity (in bytes) that is set is returned as the function result.
Attempting to set the pipe capacity smaller than the amount of buffer space currently used to store data produces the error EBUSY.
Note that because of the way the pages of the pipe buffer are employed when data is written to the pipe, the number of bytes that can be written may be less than the nominal size, depending on the size of the writes.
F_GETPIPE_SZ (void; since Linux 2.6.35)
Return (as the function result) the capacity of the pipe referred to by fd.
File Sealing
File seals limit the set of allowed operations on a given file. For each seal that is set on a file, a specific set of operations will fail with EPERM on this file from now on. The file is said to be sealed. The default set of seals depends on the type of the underlying file and filesystem. For an overview of file sealing, a discussion of its purpose, and some code examples, see memfd_create(2).
Currently, file seals can be applied only to a file descriptor returned by memfd_create(2) (if the MFD_ALLOW_SEALING was employed). On other filesystems, all fcntl() operations that operate on seals will return EINVAL.
Seals are a property of an inode. Thus, all open file descriptors referring to the same inode share the same set of seals. Furthermore, seals can never be removed, only added.
F_ADD_SEALS (int; since Linux 3.17)
Add the seals given in the bit-mask argument arg to the set of seals of the inode referred to by the file descriptor fd. Seals cannot be removed again. Once this call succeeds, the seals are enforced by the kernel immediately. If the current set of seals includes F_SEAL_SEAL (see below), then this call will be rejected with EPERM. Adding a seal that is already set is a no-op, in case F_SEAL_SEAL is not set already. In order to place a seal, the file descriptor fd must be writable.
F_GET_SEALS (void; since Linux 3.17)
Return (as the function result) the current set of seals of the inode referred to by fd. If no seals are set, 0 is returned. If the file does not support sealing, -1 is returned and errno is set to EINVAL.
The following seals are available:
F_SEAL_SEAL
If this seal is set, any further call to fcntl() with F_ADD_SEALS fails with the error EPERM. Therefore, this seal prevents any modifications to the set of seals itself. If the initial set of seals of a file includes F_SEAL_SEAL, then this effectively causes the set of seals to be constant and locked.
F_SEAL_SHRINK
If this seal is set, the file in question cannot be reduced in size. This affects open(2) with the O_TRUNC flag as well as truncate(2) and ftruncate(2). Those calls fail with EPERM if you try to shrink the file in question. Increasing the file size is still possible.
F_SEAL_GROW
If this seal is set, the size of the file in question cannot be increased. This affects write(2) beyond the end of the file, truncate(2), ftruncate(2), and fallocate(2). These calls fail with EPERM if you use them to increase the file size. If you keep the size or shrink it, those calls still work as expected.
F_SEAL_WRITE
If this seal is set, you cannot modify the contents of the file. Note that shrinking or growing the size of the file is still possible and allowed. Thus, this seal is normally used in combination with one of the other seals. This seal affects write(2) and fallocate(2) (only in combination with the FALLOC_FL_PUNCH_HOLE flag). Those calls fail with EPERM if this seal is set. Furthermore, trying to create new shared, writable memory-mappings via mmap(2) will also fail with EPERM.
Using the F_ADD_SEALS operation to set the F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping exists. Such mappings must be unmapped before you can add this seal. Furthermore, if there are any asynchronous I/O operations (io_submit(2)) pending on the file, all outstanding writes will be discarded.
F_SEAL_FUTURE_WRITE (since Linux 5.1)
The effect of this seal is similar to F_SEAL_WRITE, but the contents of the file can still be modified via shared writable mappings that were created prior to the seal being set. Any attempt to create a new writable mapping on the file via mmap(2) will fail with EPERM. Likewise, an attempt to write to the file via write(2) will fail with EPERM.
Using this seal, one process can create a memory buffer that it can continue to modify while sharing that buffer on a “read-only” basis with other processes.
File read/write hints
Write lifetime hints can be used to inform the kernel about the relative expected lifetime of writes on a given inode or via a particular open file description. (See open(2) for an explanation of open file descriptions.) In this context, the term “write lifetime” means the expected time the data will live on media, before being overwritten or erased.
An application may use the different hint values specified below to separate writes into different write classes, so that multiple users or applications running on a single storage back-end can aggregate their I/O patterns in a consistent manner. However, there are no functional semantics implied by these flags, and different I/O classes can use the write lifetime hints in arbitrary ways, so long as the hints are used consistently.
The following operations can be applied to the file descriptor, fd:
F_GET_RW_HINT (uint64_t *; since Linux 4.13)
Returns the value of the read/write hint associated with the underlying inode referred to by fd.
F_SET_RW_HINT (uint64_t *; since Linux 4.13)
Sets the read/write hint value associated with the underlying inode referred to by fd. This hint persists until either it is explicitly modified or the underlying filesystem is unmounted.
F_GET_FILE_RW_HINT (uint64_t *; since Linux 4.13)
Returns the value of the read/write hint associated with the open file description referred to by fd.
F_SET_FILE_RW_HINT (uint64_t *; since Linux 4.13)
Sets the read/write hint value associated with the open file description referred to by fd.
If an open file description has not been assigned a read/write hint, then it shall use the value assigned to the inode, if any.
The following read/write hints are valid since Linux 4.13:
RWH_WRITE_LIFE_NOT_SET
No specific hint has been set. This is the default value.
RWH_WRITE_LIFE_NONE
No specific write lifetime is associated with this file or inode.
RWH_WRITE_LIFE_SHORT
Data written to this inode or via this open file description is expected to have a short lifetime.
RWH_WRITE_LIFE_MEDIUM
Data written to this inode or via this open file description is expected to have a lifetime longer than data written with RWH_WRITE_LIFE_SHORT.
RWH_WRITE_LIFE_LONG
Data written to this inode or via this open file description is expected to have a lifetime longer than data written with RWH_WRITE_LIFE_MEDIUM.
RWH_WRITE_LIFE_EXTREME
Data written to this inode or via this open file description is expected to have a lifetime longer than data written with RWH_WRITE_LIFE_LONG.
All the write-specific hints are relative to each other, and no individual absolute meaning should be attributed to them.
RETURN VALUE
For a successful call, the return value depends on the operation:
F_DUPFD
The new file descriptor.
F_GETFD
Value of file descriptor flags.
F_GETFL
Value of file status flags.
F_GETLEASE
Type of lease held on file descriptor.
F_GETOWN
Value of file descriptor owner.
F_GETSIG
Value of signal sent when read or write becomes possible, or zero for traditional SIGIO behavior.
F_GETPIPE_SZ
F_SETPIPE_SZ
The pipe capacity.
F_GET_SEALS
A bit mask identifying the seals that have been set for the inode referred to by fd.
All other operations
Zero.
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES or EAGAIN
Operation is prohibited by locks held by other processes.
EAGAIN
The operation is prohibited because the file has been memory-mapped by another process.
EBADF
fd is not an open file descriptor
EBADF
op is F_SETLK or F_SETLKW and the file descriptor open mode doesn’t match with the type of lock requested.
EBUSY
op is F_SETPIPE_SZ and the new pipe capacity specified in arg is smaller than the amount of buffer space currently used to store data in the pipe.
EBUSY
op is F_ADD_SEALS, arg includes F_SEAL_WRITE, and there exists a writable, shared mapping on the file referred to by fd.
EDEADLK
It was detected that the specified F_SETLKW operation would cause a deadlock.
EFAULT
lock is outside your accessible address space.
EINTR
op is F_SETLKW or F_OFD_SETLKW and the operation was interrupted by a signal; see signal(7).
EINTR
op is F_GETLK, F_SETLK, F_OFD_GETLK, or F_OFD_SETLK, and the operation was interrupted by a signal before the lock was checked or acquired. Most likely when locking a remote file (e.g., locking over NFS), but can sometimes happen locally.
EINVAL
The value specified in op is not recognized by this kernel.
EINVAL
op is F_ADD_SEALS and arg includes an unrecognized sealing bit.
EINVAL
op is F_ADD_SEALS or F_GET_SEALS and the filesystem containing the inode referred to by fd does not support sealing.
EINVAL
op is F_DUPFD and arg is negative or is greater than the maximum allowable value (see the discussion of RLIMIT_NOFILE in getrlimit(2)).
EINVAL
op is F_SETSIG and arg is not an allowable signal number.
EINVAL
op is F_OFD_SETLK, F_OFD_SETLKW, or F_OFD_GETLK, and l_pid was not specified as zero.
EMFILE
op is F_DUPFD and the per-process limit on the number of open file descriptors has been reached.
ENOLCK
Too many segment locks open, lock table is full, or a remote locking protocol failed (e.g., locking over NFS).
ENOTDIR
F_NOTIFY was specified in op, but fd does not refer to a directory.
EPERM
op is F_SETPIPE_SZ and the soft or hard user pipe limit has been reached; see pipe(7).
EPERM
Attempted to clear the O_APPEND flag on a file that has the append-only attribute set.
EPERM
op was F_ADD_SEALS, but fd was not open for writing or the current set of seals on the file already includes F_SEAL_SEAL.
STANDARDS
POSIX.1-2008.
F_GETOWN_EX, F_SETOWN_EX, F_SETPIPE_SZ, F_GETPIPE_SZ, F_GETSIG, F_SETSIG, F_NOTIFY, F_GETLEASE, and F_SETLEASE are Linux-specific. (Define the _GNU_SOURCE macro to obtain these definitions.)
F_OFD_SETLK, F_OFD_SETLKW, and F_OFD_GETLK are Linux-specific (and one must define _GNU_SOURCE to obtain their definitions), but work is being done to have them included in the next version of POSIX.1.
F_ADD_SEALS and F_GET_SEALS are Linux-specific.
HISTORY
SVr4, 4.3BSD, POSIX.1-2001.
Only the operations F_DUPFD, F_GETFD, F_SETFD, F_GETFL, F_SETFL, F_GETLK, F_SETLK, and F_SETLKW are specified in POSIX.1-2001.
F_GETOWN and F_SETOWN are specified in POSIX.1-2001. (To get their definitions, define either _XOPEN_SOURCE with the value 500 or greater, or _POSIX_C_SOURCE with the value 200809L or greater.)
F_DUPFD_CLOEXEC is specified in POSIX.1-2008. (To get this definition, define _POSIX_C_SOURCE with the value 200809L or greater, or _XOPEN_SOURCE with the value 700 or greater.)
NOTES
The errors returned by dup2(2) are different from those returned by F_DUPFD.
File locking
The original Linux fcntl() system call was not designed to handle large file offsets (in the flock structure). Consequently, an fcntl64() system call was added in Linux 2.4. The newer system call employs a different structure for file locking, flock64, and corresponding operations, F_GETLK64, F_SETLK64, and F_SETLKW64. However, these details can be ignored by applications using glibc, whose fcntl() wrapper function transparently employs the more recent system call where it is available.
Record locks
Since Linux 2.0, there is no interaction between the types of lock placed by flock(2) and fcntl().
Several systems have more fields in struct flock such as, for example, l_sysid (to identify the machine where the lock is held). Clearly, l_pid alone is not going to be very useful if the process holding the lock may live on a different machine; on Linux, while present on some architectures (such as MIPS32), this field is not used.
The original Linux fcntl() system call was not designed to handle large file offsets (in the flock structure). Consequently, an fcntl64() system call was added in Linux 2.4. The newer system call employs a different structure for file locking, flock64, and corresponding operations, F_GETLK64, F_SETLK64, and F_SETLKW64. However, these details can be ignored by applications using glibc, whose fcntl() wrapper function transparently employs the more recent system call where it is available.
Record locking and NFS
Before Linux 3.12, if an NFSv4 client loses contact with the server for a period of time (defined as more than 90 seconds with no communication), it might lose and regain a lock without ever being aware of the fact. (The period of time after which contact is assumed lost is known as the NFSv4 leasetime. On a Linux NFS server, this can be determined by looking at /proc/fs/nfsd/nfsv4leasetime, which expresses the period in seconds. The default value for this file is 90.) This scenario potentially risks data corruption, since another process might acquire a lock in the intervening period and perform file I/O.
Since Linux 3.12, if an NFSv4 client loses contact with the server, any I/O to the file by a process which “thinks” it holds a lock will fail until that process closes and reopens the file. A kernel parameter, nfs.recover_lost_locks, can be set to 1 to obtain the pre-3.12 behavior, whereby the client will attempt to recover lost locks when contact is reestablished with the server. Because of the attendant risk of data corruption, this parameter defaults to 0 (disabled).
BUGS
F_SETFL
It is not possible to use F_SETFL to change the state of the O_DSYNC and O_SYNC flags. Attempts to change the state of these flags are silently ignored.
F_GETOWN
A limitation of the Linux system call conventions on some architectures (notably i386) means that if a (negative) process group ID to be returned by F_GETOWN falls in the range -1 to -4095, then the return value is wrongly interpreted by glibc as an error in the system call; that is, the return value of fcntl() will be -1, and errno will contain the (positive) process group ID. The Linux-specific F_GETOWN_EX operation avoids this problem. Since glibc 2.11, glibc makes the kernel F_GETOWN problem invisible by implementing F_GETOWN using F_GETOWN_EX.
F_SETOWN
In Linux 2.4 and earlier, there is bug that can occur when an unprivileged process uses F_SETOWN to specify the owner of a socket file descriptor as a process (group) other than the caller. In this case, fcntl() can return -1 with errno set to EPERM, even when the owner process (group) is one that the caller has permission to send signals to. Despite this error return, the file descriptor owner is set, and signals will be sent to the owner.
Deadlock detection
The deadlock-detection algorithm employed by the kernel when dealing with F_SETLKW requests can yield both false negatives (failures to detect deadlocks, leaving a set of deadlocked processes blocked indefinitely) and false positives (EDEADLK errors when there is no deadlock). For example, the kernel limits the lock depth of its dependency search to 10 steps, meaning that circular deadlock chains that exceed that size will not be detected. In addition, the kernel may falsely indicate a deadlock when two or more processes created using the clone(2) CLONE_FILES flag place locks that appear (to the kernel) to conflict.
Mandatory locking
The Linux implementation of mandatory locking is subject to race conditions which render it unreliable: a write(2) call that overlaps with a lock may modify data after the mandatory lock is acquired; a read(2) call that overlaps with a lock may detect changes to data that were made only after a write lock was acquired. Similar races exist between mandatory locks and mmap(2). It is therefore inadvisable to rely on mandatory locking.
SEE ALSO
dup2(2), flock(2), open(2), socket(2), lockf(3), capabilities(7), feature_test_macros(7), lslocks(8)
locks.txt, mandatory-locking.txt, and dnotify.txt in the Linux kernel source directory Documentation/filesystems/ (on older kernels, these files are directly under the Documentation/ directory, and mandatory-locking.txt is called mandatory.txt)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3 - Linux cli command prlimit
NAME π₯οΈ prlimit π₯οΈ
get/set resource limits
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/resource.h>
int getrlimit(int resource, struct rlimit *rlim);
int setrlimit(int resource, const struct rlimit *rlim);
int prlimit(pid_t pid, int resource,
const struct rlimit *_Nullable new_limit,
struct rlimit *_Nullable old_limit);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
prlimit():
_GNU_SOURCE
DESCRIPTION
The getrlimit() and setrlimit() system calls get and set resource limits. Each resource has an associated soft and hard limit, as defined by the rlimit structure:
struct rlimit {
rlim_t rlim_cur; /* Soft limit */
rlim_t rlim_max; /* Hard limit (ceiling for rlim_cur) */
};
The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may set only its soft limit to a value in the range from 0 up to the hard limit, and (irreversibly) lower its hard limit. A privileged process (under Linux: one with the CAP_SYS_RESOURCE capability in the initial user namespace) may make arbitrary changes to either limit value.
The value RLIM_INFINITY denotes no limit on a resource (both in the structure returned by getrlimit() and in the structure passed to setrlimit()).
The resource argument must be one of:
RLIMIT_AS
This is the maximum size of the process’s virtual memory (address space). The limit is specified in bytes, and is rounded down to the system page size. This limit affects calls to brk(2), mmap(2), and mremap(2), which fail with the error ENOMEM upon exceeding this limit. In addition, automatic stack expansion fails (and generates a SIGSEGV that kills the process if no alternate stack has been made available via sigaltstack(2)). Since the value is a long, on machines with a 32-bit long either this limit is at most 2 GiB, or this resource is unlimited.
RLIMIT_CORE
This is the maximum size of a core file (see core(5)) in bytes that the process may dump. When 0 no core dump files are created. When nonzero, larger dumps are truncated to this size.
RLIMIT_CPU
This is a limit, in seconds, on the amount of CPU time that the process can consume. When the process reaches the soft limit, it is sent a SIGXCPU signal. The default action for this signal is to terminate the process. However, the signal can be caught, and the handler can return control to the main program. If the process continues to consume CPU time, it will be sent SIGXCPU once per second until the hard limit is reached, at which time it is sent SIGKILL. (This latter point describes Linux behavior. Implementations vary in how they treat processes which continue to consume CPU time after reaching the soft limit. Portable applications that need to catch this signal should perform an orderly termination upon first receipt of SIGXCPU.)
RLIMIT_DATA
This is the maximum size of the process’s data segment (initialized data, uninitialized data, and heap). The limit is specified in bytes, and is rounded down to the system page size. This limit affects calls to brk(2), sbrk(2), and (since Linux 4.7) mmap(2), which fail with the error ENOMEM upon encountering the soft limit of this resource.
RLIMIT_FSIZE
This is the maximum size in bytes of files that the process may create. Attempts to extend a file beyond this limit result in delivery of a SIGXFSZ signal. By default, this signal terminates a process, but a process can catch this signal instead, in which case the relevant system call (e.g., write(2), truncate(2)) fails with the error EFBIG.
RLIMIT_LOCKS (Linux 2.4.0 to Linux 2.4.24)
This is a limit on the combined number of flock(2) locks and fcntl(2) leases that this process may establish.
RLIMIT_MEMLOCK
This is the maximum number of bytes of memory that may be locked into RAM. This limit is in effect rounded down to the nearest multiple of the system page size. This limit affects mlock(2), mlockall(2), and the mmap(2) MAP_LOCKED operation. Since Linux 2.6.9, it also affects the shmctl(2) SHM_LOCK operation, where it sets a maximum on the total bytes in shared memory segments (see shmget(2)) that may be locked by the real user ID of the calling process. The shmctl(2) SHM_LOCK locks are accounted for separately from the per-process memory locks established by mlock(2), mlockall(2), and mmap(2) MAP_LOCKED; a process can lock bytes up to this limit in each of these two categories.
Before Linux 2.6.9, this limit controlled the amount of memory that could be locked by a privileged process. Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process may lock, and this limit instead governs the amount of memory that an unprivileged process may lock.
RLIMIT_MSGQUEUE (since Linux 2.6.8)
This is a limit on the number of bytes that can be allocated for POSIX message queues for the real user ID of the calling process. This limit is enforced for mq_open(3). Each message queue that the user creates counts (until it is removed) against this limit according to the formula:
Since Linux 3.5:
bytes = attr.mq_maxmsg * sizeof(struct msg_msg) +
MIN(attr.mq_maxmsg, MQ_PRIO_MAX) *
sizeof(struct posix_msg_tree_node)+
/* For overhead */
attr.mq_maxmsg * attr.mq_msgsize;
/* For message data */
Linux 3.4 and earlier:
bytes = attr.mq_maxmsg * sizeof(struct msg_msg *) +
/* For overhead */
attr.mq_maxmsg * attr.mq_msgsize;
/* For message data */
where attr is the mq_attr structure specified as the fourth argument to mq_open(3), and the msg_msg and posix_msg_tree_node structures are kernel-internal structures.
The “overhead” addend in the formula accounts for overhead bytes required by the implementation and ensures that the user cannot create an unlimited number of zero-length messages (such messages nevertheless each consume some system memory for bookkeeping overhead).
RLIMIT_NICE (since Linux 2.6.12, but see BUGS below)
This specifies a ceiling to which the process’s nice value can be raised using setpriority(2) or nice(2). The actual ceiling for the nice value is calculated as 20 - rlim_cur. The useful range for this limit is thus from 1 (corresponding to a nice value of 19) to 40 (corresponding to a nice value of -20). This unusual choice of range was necessary because negative numbers cannot be specified as resource limit values, since they typically have special meanings. For example, RLIM_INFINITY typically is the same as -1. For more detail on the nice value, see sched(7).
RLIMIT_NOFILE
This specifies a value one greater than the maximum file descriptor number that can be opened by this process. Attempts (open(2), pipe(2), dup(2), etc.) to exceed this limit yield the error EMFILE. (Historically, this limit was named RLIMIT_OFILE on BSD.)
Since Linux 4.5, this limit also defines the maximum number of file descriptors that an unprivileged process (one without the CAP_SYS_RESOURCE capability) may have “in flight” to other processes, by being passed across UNIX domain sockets. This limit applies to the sendmsg(2) system call. For further details, see unix(7).
RLIMIT_NPROC
This is a limit on the number of extant process (or, more precisely on Linux, threads) for the real user ID of the calling process. So long as the current number of processes belonging to this process’s real user ID is greater than or equal to this limit, fork(2) fails with the error EAGAIN.
The RLIMIT_NPROC limit is not enforced for processes that have either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE capability, or run with real user ID 0.
RLIMIT_RSS
This is a limit (in bytes) on the process’s resident set (the number of virtual pages resident in RAM). This limit has effect only in Linux 2.4.x, x < 30, and there affects only calls to madvise(2) specifying MADV_WILLNEED.
RLIMIT_RTPRIO (since Linux 2.6.12, but see BUGS)
This specifies a ceiling on the real-time priority that may be set for this process using sched_setscheduler(2) and sched_setparam(2).
For further details on real-time scheduling policies, see sched(7)
RLIMIT_RTTIME (since Linux 2.6.25)
This is a limit (in microseconds) on the amount of CPU time that a process scheduled under a real-time scheduling policy may consume without making a blocking system call. For the purpose of this limit, each time a process makes a blocking system call, the count of its consumed CPU time is reset to zero. The CPU time count is not reset if the process continues trying to use the CPU but is preempted, its time slice expires, or it calls sched_yield(2).
Upon reaching the soft limit, the process is sent a SIGXCPU signal. If the process catches or ignores this signal and continues consuming CPU time, then SIGXCPU will be generated once each second until the hard limit is reached, at which point the process is sent a SIGKILL signal.
The intended use of this limit is to stop a runaway real-time process from locking up the system.
For further details on real-time scheduling policies, see sched(7)
RLIMIT_SIGPENDING (since Linux 2.6.8)
This is a limit on the number of signals that may be queued for the real user ID of the calling process. Both standard and real-time signals are counted for the purpose of checking this limit. However, the limit is enforced only for sigqueue(3); it is always possible to use kill(2) to queue one instance of any of the signals that are not already queued to the process.
RLIMIT_STACK
This is the maximum size of the process stack, in bytes. Upon reaching this limit, a SIGSEGV signal is generated. To handle this signal, a process must employ an alternate signal stack (sigaltstack(2)).
Since Linux 2.6.23, this limit also determines the amount of space used for the process’s command-line arguments and environment variables; for details, see execve(2).
prlimit()
The Linux-specific prlimit() system call combines and extends the functionality of setrlimit() and getrlimit(). It can be used to both set and get the resource limits of an arbitrary process.
The resource argument has the same meaning as for setrlimit() and getrlimit().
If the new_limit argument is not NULL, then the rlimit structure to which it points is used to set new values for the soft and hard limits for resource. If the old_limit argument is not NULL, then a successful call to prlimit() places the previous soft and hard limits for resource in the rlimit structure pointed to by old_limit.
The pid argument specifies the ID of the process on which the call is to operate. If pid is 0, then the call applies to the calling process. To set or get the resources of a process other than itself, the caller must have the CAP_SYS_RESOURCE capability in the user namespace of the process whose resource limits are being changed, or the real, effective, and saved set user IDs of the target process must match the real user ID of the caller and the real, effective, and saved set group IDs of the target process must match the real group ID of the caller.
RETURN VALUE
On success, these system calls return 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
A pointer argument points to a location outside the accessible address space.
EINVAL
The value specified in resource is not valid; or, for setrlimit() or prlimit(): rlim->rlim_cur was greater than rlim->rlim_max.
EPERM
An unprivileged process tried to raise the hard limit; the CAP_SYS_RESOURCE capability is required to do this.
EPERM
The caller tried to increase the hard RLIMIT_NOFILE limit above the maximum defined by /proc/sys/fs/nr_open (see proc(5))
EPERM
(prlimit()) The calling process did not have permission to set limits for the process specified by pid.
ESRCH
Could not find a process with the ID specified in pid.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
getrlimit(), setrlimit(), prlimit() | Thread safety | MT-Safe |
STANDARDS
getrlimit()
setrlimit()
POSIX.1-2008.
prlimit()
Linux.
RLIMIT_MEMLOCK and RLIMIT_NPROC derive from BSD and are not specified in POSIX.1; they are present on the BSDs and Linux, but on few other implementations. RLIMIT_RSS derives from BSD and is not specified in POSIX.1; it is nevertheless present on most implementations. RLIMIT_MSGQUEUE, RLIMIT_NICE, RLIMIT_RTPRIO, RLIMIT_RTTIME, and RLIMIT_SIGPENDING are Linux-specific.
HISTORY
getrlimit()
setrlimit()
POSIX.1-2001, SVr4, 4.3BSD.
prlimit()
Linux 2.6.36, glibc 2.13.
NOTES
A child process created via fork(2) inherits its parent’s resource limits. Resource limits are preserved across execve(2).
Resource limits are per-process attributes that are shared by all of the threads in a process.
Lowering the soft limit for a resource below the process’s current consumption of that resource will succeed (but will prevent the process from further increasing its consumption of the resource).
One can set the resource limits of the shell using the built-in ulimit command (limit in csh(1)). The shell’s resource limits are inherited by the processes that it creates to execute commands.
Since Linux 2.6.24, the resource limits of any process can be inspected via /proc/pid/limits; see proc(5).
Ancient systems provided a vlimit() function with a similar purpose to setrlimit(). For backward compatibility, glibc also provides vlimit(). All new applications should be written using setrlimit().
C library/kernel ABI differences
Since glibc 2.13, the glibc getrlimit() and setrlimit() wrapper functions no longer invoke the corresponding system calls, but instead employ prlimit(), for the reasons described in BUGS.
The name of the glibc wrapper function is prlimit(); the underlying system call is prlimit64().
BUGS
In older Linux kernels, the SIGXCPU and SIGKILL signals delivered when a process encountered the soft and hard RLIMIT_CPU limits were delivered one (CPU) second later than they should have been. This was fixed in Linux 2.6.8.
In Linux 2.6.x kernels before Linux 2.6.17, a RLIMIT_CPU limit of 0 is wrongly treated as “no limit” (like RLIM_INFINITY). Since Linux 2.6.17, setting a limit of 0 does have an effect, but is actually treated as a limit of 1 second.
A kernel bug means that RLIMIT_RTPRIO does not work in Linux 2.6.12; the problem is fixed in Linux 2.6.13.
In Linux 2.6.12, there was an off-by-one mismatch between the priority ranges returned by getpriority(2) and RLIMIT_NICE. This had the effect that the actual ceiling for the nice value was calculated as 19 - rlim_cur. This was fixed in Linux 2.6.13.
Since Linux 2.6.12, if a process reaches its soft RLIMIT_CPU limit and has a handler installed for SIGXCPU, then, in addition to invoking the signal handler, the kernel increases the soft limit by one second. This behavior repeats if the process continues to consume CPU time, until the hard limit is reached, at which point the process is killed. Other implementations do not change the RLIMIT_CPU soft limit in this manner, and the Linux behavior is probably not standards conformant; portable applications should avoid relying on this Linux-specific behavior. The Linux-specific RLIMIT_RTTIME limit exhibits the same behavior when the soft limit is encountered.
Kernels before Linux 2.4.22 did not diagnose the error EINVAL for setrlimit() when rlim->rlim_cur was greater than rlim->rlim_max.
Linux doesn’t return an error when an attempt to set RLIMIT_CPU has failed, for compatibility reasons.
Representation of “large” resource limit values on 32-bit platforms
The glibc getrlimit() and setrlimit() wrapper functions use a 64-bit rlim_t data type, even on 32-bit platforms. However, the rlim_t data type used in the getrlimit() and setrlimit() system calls is a (32-bit) unsigned long. Furthermore, in Linux, the kernel represents resource limits on 32-bit platforms as unsigned long. However, a 32-bit data type is not wide enough. The most pertinent limit here is RLIMIT_FSIZE, which specifies the maximum size to which a file can grow: to be useful, this limit must be represented using a type that is as wide as the type used to represent file offsetsβthat is, as wide as a 64-bit off_t (assuming a program compiled with _FILE_OFFSET_BITS=64).
To work around this kernel limitation, if a program tried to set a resource limit to a value larger than can be represented in a 32-bit unsigned long, then the glibc setrlimit() wrapper function silently converted the limit value to RLIM_INFINITY. In other words, the requested resource limit setting was silently ignored.
Since glibc 2.13, glibc works around the limitations of the getrlimit() and setrlimit() system calls by implementing setrlimit() and getrlimit() as wrapper functions that call prlimit().
EXAMPLES
The program below demonstrates the use of prlimit().
#define _GNU_SOURCE
#define _FILE_OFFSET_BITS 64
#include <err.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <time.h>
int
main(int argc, char *argv[])
{
pid_t pid;
struct rlimit old, new;
struct rlimit *newp;
if (!(argc == 2 || argc == 4)) {
fprintf(stderr, "Usage: %s <pid> [<new-soft-limit> "
"<new-hard-limit>]
“, argv[0]); exit(EXIT_FAILURE); } pid = atoi(argv[1]); /* PID of target process / newp = NULL; if (argc == 4) { new.rlim_cur = atoi(argv[2]); new.rlim_max = atoi(argv[3]); newp = &new; } / Set CPU time limit of target process; retrieve and display previous limit / if (prlimit(pid, RLIMIT_CPU, newp, &old) == -1) err(EXIT_FAILURE, “prlimit-1”); printf(“Previous limits: soft=%jd; hard=%jd “, (intmax_t) old.rlim_cur, (intmax_t) old.rlim_max); / Retrieve and display new CPU time limit */ if (prlimit(pid, RLIMIT_CPU, NULL, &old) == -1) err(EXIT_FAILURE, “prlimit-2”); printf(“New limits: soft=%jd; hard=%jd “, (intmax_t) old.rlim_cur, (intmax_t) old.rlim_max); exit(EXIT_SUCCESS); }
SEE ALSO
prlimit(1), dup(2), fcntl(2), fork(2), getrusage(2), mlock(2), mmap(2), open(2), quotactl(2), sbrk(2), shmctl(2), malloc(3), sigqueue(3), ulimit(3), core(5), capabilities(7), cgroups(7), credentials(7), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4 - Linux cli command utime
NAME π₯οΈ utime π₯οΈ
change file last access and modification times
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <utime.h>
int utime(const char *filename,
const struct utimbuf *_Nullable times);
#include <sys/time.h>
int utimes(const char *filename,
const struct timeval times[_Nullable 2]);
DESCRIPTION
Note: modern applications may prefer to use the interfaces described in utimensat(2).
The utime() system call changes the access and modification times of the inode specified by filename to the actime and modtime fields of times respectively. The status change time (ctime) will be set to the current time, even if the other time stamps don’t actually change.
If times is NULL, then the access and modification times of the file are set to the current time.
Changing timestamps is permitted when: either the process has appropriate privileges, or the effective user ID equals the user ID of the file, or times is NULL and the process has write permission for the file.
The utimbuf structure is:
struct utimbuf {
time_t actime; /* access time */
time_t modtime; /* modification time */
};
The utime() system call allows specification of timestamps with a resolution of 1 second.
The utimes() system call is similar, but the times argument refers to an array rather than a structure. The elements of this array are timeval structures, which allow a precision of 1 microsecond for specifying timestamps. The timeval structure is:
struct timeval {
long tv_sec; /* seconds */
long tv_usec; /* microseconds */
};
times[0] specifies the new access time, and times[1] specifies the new modification time. If times is NULL, then analogously to utime(), the access and modification times of the file are set to the current time.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Search permission is denied for one of the directories in the path prefix of path (see also path_resolution(7)).
EACCES
times is NULL, the caller’s effective user ID does not match the owner of the file, the caller does not have write access to the file, and the caller is not privileged (Linux: does not have either the CAP_DAC_OVERRIDE or the CAP_FOWNER capability).
ENOENT
filename does not exist.
EPERM
times is not NULL, the caller’s effective UID does not match the owner of the file, and the caller is not privileged (Linux: does not have the CAP_FOWNER capability).
EROFS
path resides on a read-only filesystem.
STANDARDS
POSIX.1-2008.
HISTORY
utime()
SVr4, POSIX.1-2001. POSIX.1-2008 marks it as obsolete.
utimes()
4.3BSD, POSIX.1-2001.
NOTES
Linux does not allow changing the timestamps on an immutable file, or setting the timestamps to something other than the current time on an append-only file.
SEE ALSO
chattr(1), touch(1), futimesat(2), stat(2), utimensat(2), futimens(3), futimes(3), inode(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5 - Linux cli command getdents64
NAME π₯οΈ getdents64 π₯οΈ
get directory entries
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
long syscall(SYS_getdents, unsigned int fd",structlinux_dirent*"dirp,
unsigned int count);
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <dirent.h>
ssize_t getdents64(int fd, void dirp[.count], size_t count);
Note: glibc provides no wrapper for getdents(), necessitating the use of syscall(2).
Note: There is no definition of struct linux_dirent in glibc; see NOTES.
DESCRIPTION
These are not the interfaces you are interested in. Look at readdir(3) for the POSIX-conforming C library interface. This page documents the bare kernel system call interfaces.
getdents()
The system call getdents() reads several linux_dirent structures from the directory referred to by the open file descriptor fd into the buffer pointed to by dirp. The argument count specifies the size of that buffer.
The linux_dirent structure is declared as follows:
struct linux_dirent {
unsigned long d_ino; /* Inode number */
unsigned long d_off; /* Not an offset; see below */
unsigned short d_reclen; /* Length of this linux_dirent */
char d_name[]; /* Filename (null-terminated) */
/* length is actually (d_reclen - 2 -
offsetof(struct linux_dirent, d_name)) */
/*
char pad; // Zero padding byte
char d_type; // File type (only since Linux
// 2.6.4); offset is (d_reclen - 1)
*/
}
d_ino is an inode number. d_off is a filesystem-specific value with no specific meaning to user space, though on older filesystems it used to be the distance from the start of the directory to the start of the next linux_dirent; see readdir(3). d_reclen is the size of this entire linux_dirent. d_name is a null-terminated filename.
d_type is a byte at the end of the structure that indicates the file type. It contains one of the following values (defined in <dirent.h>):
DT_BLK
This is a block device.
DT_CHR
This is a character device.
DT_DIR
This is a directory.
DT_FIFO
This is a named pipe (FIFO).
DT_LNK
This is a symbolic link.
DT_REG
This is a regular file.
DT_SOCK
This is a UNIX domain socket.
DT_UNKNOWN
The file type is unknown.
The d_type field is implemented since Linux 2.6.4. It occupies a space that was previously a zero-filled padding byte in the linux_dirent structure. Thus, on kernels up to and including Linux 2.6.3, attempting to access this field always provides the value 0 (DT_UNKNOWN).
Currently, only some filesystems (among them: Btrfs, ext2, ext3, and ext4) have full support for returning the file type in d_type. All applications must properly handle a return of DT_UNKNOWN.
getdents64()
The original Linux getdents() system call did not handle large filesystems and large file offsets. Consequently, Linux 2.4 added getdents64(), with wider types for the d_ino and d_off fields. In addition, getdents64() supports an explicit d_type field.
The getdents64() system call is like getdents(), except that its second argument is a pointer to a buffer containing structures of the following type:
struct linux_dirent64 {
ino64_t d_ino; /* 64-bit inode number */
off64_t d_off; /* Not an offset; see getdents() */
unsigned short d_reclen; /* Size of this dirent */
unsigned char d_type; /* File type */
char d_name[]; /* Filename (null-terminated) */
};
RETURN VALUE
On success, the number of bytes read is returned. On end of directory, 0 is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBADF
Invalid file descriptor fd.
EFAULT
Argument points outside the calling process’s address space.
EINVAL
Result buffer is too small.
ENOENT
No such directory.
ENOTDIR
File descriptor does not refer to a directory.
STANDARDS
None.
HISTORY
SVr4.
getdents64()
glibc 2.30.
NOTES
glibc does not provide a wrapper for getdents(); call getdents() using syscall(2). In that case you will need to define the linux_dirent or linux_dirent64 structure yourself.
Probably, you want to use readdir(3) instead of these system calls.
These calls supersede readdir(2).
EXAMPLES
The program below demonstrates the use of getdents(). The following output shows an example of what we see when running this program on an ext2 directory:
$ ./a.out /testfs/
--------------- nread=120 ---------------
inode# file type d_reclen d_off d_name
2 directory 16 12 .
2 directory 16 24 ..
11 directory 24 44 lost+found
12 regular 16 56 a
228929 directory 16 68 sub
16353 directory 16 80 sub2
130817 directory 16 4096 sub3
Program source
#define _GNU_SOURCE
#include <dirent.h> /* Defines DT_* constants */
#include <err.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
struct linux_dirent {
unsigned long d_ino;
off_t d_off;
unsigned short d_reclen;
char d_name[];
};
#define BUF_SIZE 1024
int
main(int argc, char *argv[])
{
int fd;
char d_type;
char buf[BUF_SIZE];
long nread;
struct linux_dirent *d;
fd = open(argc > 1 ? argv[1] : ".", O_RDONLY | O_DIRECTORY);
if (fd == -1)
err(EXIT_FAILURE, "open");
for (;;) {
nread = syscall(SYS_getdents, fd, buf, BUF_SIZE);
if (nread == -1)
err(EXIT_FAILURE, "getdents");
if (nread == 0)
break;
printf("--------------- nread=%ld ---------------
“, nread); printf(“inode# file type d_reclen d_off d_name “); for (size_t bpos = 0; bpos < nread;) { d = (struct linux_dirent *) (buf + bpos); printf("%8lu “, d->d_ino); d_type = *(buf + bpos + d->d_reclen - 1); printf(”%-10s “, (d_type == DT_REG) ? “regular” : (d_type == DT_DIR) ? “directory” : (d_type == DT_FIFO) ? “FIFO” : (d_type == DT_SOCK) ? “socket” : (d_type == DT_LNK) ? “symlink” : (d_type == DT_BLK) ? “block dev” : (d_type == DT_CHR) ? “char dev” : “???”); printf("%4d %10jd %s “, d->d_reclen, (intmax_t) d->d_off, d->d_name); bpos += d->d_reclen; } } exit(EXIT_SUCCESS); }
SEE ALSO
readdir(2), readdir(3), inode(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
6 - Linux cli command sync_file_range2
NAME π₯οΈ sync_file_range2 π₯οΈ
sync a file segment with disk
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#define _FILE_OFFSET_BITS 64
#include <fcntl.h>
int sync_file_range(int fd, off_t offset, off_t nbytes,
unsigned int flags);
DESCRIPTION
sync_file_range() permits fine control when synchronizing the open file referred to by the file descriptor fd with disk.
offset is the starting byte of the file range to be synchronized. nbytes specifies the length of the range to be synchronized, in bytes; if nbytes is zero, then all bytes from offset through to the end of file are synchronized. Synchronization is in units of the system page size: offset is rounded down to a page boundary; (offset+nbytes-1) is rounded up to a page boundary.
The flags bit-mask argument can include any of the following values:
SYNC_FILE_RANGE_WAIT_BEFORE
Wait upon write-out of all pages in the specified range that have already been submitted to the device driver for write-out before performing any write.
SYNC_FILE_RANGE_WRITE
Initiate write-out of all dirty pages in the specified range which are not presently submitted write-out. Note that even this may block if you attempt to write more than request queue size.
SYNC_FILE_RANGE_WAIT_AFTER
Wait upon write-out of all pages in the range after performing any write.
Specifying flags as 0 is permitted, as a no-op.
Warning
This system call is extremely dangerous and should not be used in portable programs. None of these operations writes out the file’s metadata. Therefore, unless the application is strictly performing overwrites of already-instantiated disk blocks, there are no guarantees that the data will be available after a crash. There is no user interface to know if a write is purely an overwrite. On filesystems using copy-on-write semantics (e.g., btrfs) an overwrite of existing allocated blocks is impossible. When writing into preallocated space, many filesystems also require calls into the block allocator, which this system call does not sync out to disk. This system call does not flush disk write caches and thus does not provide any data integrity on systems with volatile disk write caches.
Some details
SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any I/O errors or ENOSPC conditions and will return these to the caller.
Useful combinations of the flags bits are:
SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE
Ensures that all pages in the specified range which were dirty when sync_file_range() was called are placed under write-out. This is a start-write-for-data-integrity operation.
SYNC_FILE_RANGE_WRITE
Start write-out of all dirty pages in the specified range which are not presently under write-out. This is an asynchronous flush-to-disk operation. This is not suitable for data integrity operations.
SYNC_FILE_RANGE_WAIT_BEFORE (or SYNC_FILE_RANGE_WAIT_AFTER)
Wait for completion of write-out of all pages in the specified range. This can be used after an earlier SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE operation to wait for completion of that operation, and obtain its result.
SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER
This is a write-for-data-integrity operation that will ensure that all pages in the specified range which were dirty when sync_file_range() was called are committed to disk.
RETURN VALUE
On success, sync_file_range() returns 0; on failure -1 is returned and errno is set to indicate the error.
ERRORS
EBADF
fd is not a valid file descriptor.
EINVAL
flags specifies an invalid bit; or offset or nbytes is invalid.
EIO
I/O error.
ENOMEM
Out of memory.
ENOSPC
Out of disk space.
ESPIPE
fd refers to something other than a regular file, a block device, or a directory.
VERSIONS
sync_file_range2()
Some architectures (e.g., PowerPC, ARM) need 64-bit arguments to be aligned in a suitable pair of registers. On such architectures, the call signature of sync_file_range() shown in the SYNOPSIS would force a register to be wasted as padding between the fd and offset arguments. (See syscall(2) for details.) Therefore, these architectures define a different system call that orders the arguments suitably:
int sync_file_range2(int fd, unsigned int flags,
off_t offset, off_t nbytes);
The behavior of this system call is otherwise exactly the same as sync_file_range().
STANDARDS
Linux.
HISTORY
Linux 2.6.17.
sync_file_range2()
A system call with this signature first appeared on the ARM architecture in Linux 2.6.20, with the name arm_sync_file_range(). It was renamed in Linux 2.6.22, when the analogous system call was added for PowerPC. On architectures where glibc support is provided, glibc transparently wraps sync_file_range2() under the name sync_file_range().
NOTES
_FILE_OFFSET_BITS should be defined to be 64 in code that takes the address of sync_file_range, if the code is intended to be portable to traditional 32-bit x86 and ARM platforms where off_t’s width defaults to 32 bits.
SEE ALSO
fdatasync(2), fsync(2), msync(2), sync(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7 - Linux cli command arm_fadvise
NAME π₯οΈ arm_fadvise π₯οΈ
predeclare an access pattern for file data
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <fcntl.h>
int posix_fadvise(int fd, off_t offset, off_t len",int advice );"
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
posix_fadvise():
_POSIX_C_SOURCE >= 200112L
DESCRIPTION
Programs can use posix_fadvise() to announce an intention to access file data in a specific pattern in the future, thus allowing the kernel to perform appropriate optimizations.
The advice applies to a (not necessarily existent) region starting at offset and extending for len bytes (or until the end of the file if len is 0) within the file referred to by fd. The advice is not binding; it merely constitutes an expectation on behalf of the application.
Permissible values for advice include:
POSIX_FADV_NORMAL
Indicates that the application has no advice to give about its access pattern for the specified data. If no advice is given for an open file, this is the default assumption.
POSIX_FADV_SEQUENTIAL
The application expects to access the specified data sequentially (with lower offsets read before higher ones).
POSIX_FADV_RANDOM
The specified data will be accessed in random order.
POSIX_FADV_NOREUSE
The specified data will be accessed only once.
Before Linux 2.6.18, POSIX_FADV_NOREUSE had the same semantics as POSIX_FADV_WILLNEED. This was probably a bug; since Linux 2.6.18, this flag is a no-op.
POSIX_FADV_WILLNEED
The specified data will be accessed in the near future.
POSIX_FADV_WILLNEED initiates a nonblocking read of the specified region into the page cache. The amount of data read may be decreased by the kernel depending on virtual memory load. (A few megabytes will usually be fully satisfied, and more is rarely useful.)
POSIX_FADV_DONTNEED
The specified data will not be accessed in the near future.
POSIX_FADV_DONTNEED attempts to free cached pages associated with the specified region. This is useful, for example, while streaming large files. A program may periodically request the kernel to free cached data that has already been used, so that more useful cached pages are not discarded instead.
Requests to discard partial pages are ignored. It is preferable to preserve needed data than discard unneeded data. If the application requires that data be considered for discarding, then offset and len must be page-aligned.
The implementation may attempt to write back dirty pages in the specified region, but this is not guaranteed. Any unwritten dirty pages will not be freed. If the application wishes to ensure that dirty pages will be released, it should call fsync(2) or fdatasync(2) first.
RETURN VALUE
On success, zero is returned. On error, an error number is returned.
ERRORS
EBADF
The fd argument was not a valid file descriptor.
EINVAL
An invalid value was specified for advice.
ESPIPE
The specified file descriptor refers to a pipe or FIFO. (ESPIPE is the error specified by POSIX, but before Linux 2.6.16, Linux returned EINVAL in this case.)
VERSIONS
Under Linux, POSIX_FADV_NORMAL sets the readahead window to the default size for the backing device; POSIX_FADV_SEQUENTIAL doubles this size, and POSIX_FADV_RANDOM disables file readahead entirely. These changes affect the entire file, not just the specified region (but other open file handles to the same file are unaffected).
C library/kernel differences
The name of the wrapper function in the C library is posix_fadvise(). The underlying system call is called fadvise64() (or, on some architectures, fadvise64_64()); the difference between the two is that the former system call assumes that the type of the len argument is size_t, while the latter expects loff_t there.
Architecture-specific variants
Some architectures require 64-bit arguments to be aligned in a suitable pair of registers (see syscall(2) for further detail). On such architectures, the call signature of posix_fadvise() shown in the SYNOPSIS would force a register to be wasted as padding between the fd and offset arguments. Therefore, these architectures define a version of the system call that orders the arguments suitably, but is otherwise exactly the same as posix_fadvise().
For example, since Linux 2.6.14, ARM has the following system call:
long arm_fadvise64_64(int fd, int advice,
loff_t offset, loff_t len);
These architecture-specific details are generally hidden from applications by the glibc posix_fadvise() wrapper function, which invokes the appropriate architecture-specific system call.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
Kernel support first appeared in Linux 2.5.60; the underlying system call is called fadvise64(). Library support has been provided since glibc 2.2, via the wrapper function posix_fadvise().
Since Linux 3.18, support for the underlying system call is optional, depending on the setting of the CONFIG_ADVISE_SYSCALLS configuration option.
The type of the len argument was changed from size_t to off_t in POSIX.1-2001 TC1.
NOTES
The contents of the kernel buffer cache can be cleared via the /proc/sys/vm/drop_caches interface described in proc(5).
One can obtain a snapshot of which pages of a file are resident in the buffer cache by opening a file, mapping it with mmap(2), and then applying mincore(2) to the mapping.
BUGS
Before Linux 2.6.6, if len was specified as 0, then this was interpreted literally as “zero bytes”, rather than as meaning “all bytes through to the end of the file”.
SEE ALSO
fincore(1), mincore(2), readahead(2), sync_file_range(2), posix_fallocate(3), posix_madvise(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
8 - Linux cli command mlock2
NAME π₯οΈ mlock2 π₯οΈ
lock and unlock memory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/mman.h>
int mlock(const void addr[.len], size_t len);
int mlock2(const void addr[.len], size_t len, unsigned int flags);
int munlock(const void addr[.len], size_t len);
int mlockall(int flags);
int munlockall(void);
DESCRIPTION
mlock(), mlock2(), and mlockall() lock part or all of the calling process’s virtual address space into RAM, preventing that memory from being paged to the swap area.
munlock() and munlockall() perform the converse operation, unlocking part or all of the calling process’s virtual address space, so that pages in the specified virtual address range can be swapped out again if required by the kernel memory manager.
Memory locking and unlocking are performed in units of whole pages.
mlock(), mlock2(), and munlock()
mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked.
mlock2() also locks pages in the specified range starting at addr and continuing for len bytes. However, the state of the pages contained in that range after the call returns successfully will depend on the value in the flags argument.
The flags argument can be either 0 or the following constant:
MLOCK_ONFAULT
Lock pages that are currently resident and mark the entire range so that the remaining nonresident pages are locked when they are populated by a page fault.
If flags is 0, mlock2() behaves exactly the same as mlock().
munlock() unlocks pages in the address range starting at addr and continuing for len bytes. After this call, all pages that contain a part of the specified memory range can be moved to external swap space again by the kernel.
mlockall() and munlockall()
mlockall() locks all pages mapped into the address space of the calling process. This includes the pages of the code, data, and stack segment, as well as shared libraries, user space kernel data, shared memory, and memory-mapped files. All mapped pages are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked.
The flags argument is constructed as the bitwise OR of one or more of the following constants:
MCL_CURRENT
Lock all pages which are currently mapped into the address space of the process.
MCL_FUTURE
Lock all pages which will become mapped into the address space of the process in the future. These could be, for instance, new pages required by a growing heap and stack as well as new memory-mapped files or shared memory regions.
MCL_ONFAULT (since Linux 4.4)
Used together with MCL_CURRENT, MCL_FUTURE, or both. Mark all current (with MCL_CURRENT) or future (with MCL_FUTURE) mappings to lock pages when they are faulted in. When used with MCL_CURRENT, all present pages are locked, but mlockall() will not fault in non-present pages. When used with MCL_FUTURE, all future mappings will be marked to lock pages when they are faulted in, but they will not be populated by the lock when the mapping is created. MCL_ONFAULT must be used with either MCL_CURRENT or MCL_FUTURE or both.
If MCL_FUTURE has been specified, then a later system call (e.g., mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number of locked bytes to exceed the permitted maximum (see below). In the same circumstances, stack growth may likewise fail: the kernel will deny stack expansion and deliver a SIGSEGV signal to the process.
munlockall() unlocks all pages mapped into the address space of the calling process.
RETURN VALUE
On success, these system calls return 0. On error, -1 is returned, errno is set to indicate the error, and no changes are made to any locks in the address space of the process.
ERRORS
EAGAIN
(mlock(), mlock2(), and munlock()) Some or all of the specified address range could not be locked.
EINVAL
(mlock(), mlock2(), and munlock()) The result of the addition addr+len was less than addr (e.g., the addition may have resulted in an overflow).
EINVAL
(mlock2()) Unknown flags were specified.
EINVAL
(mlockall()) Unknown flags were specified or MCL_ONFAULT was specified without either MCL_FUTURE or MCL_CURRENT.
EINVAL
(Not on Linux) addr was not a multiple of the page size.
ENOMEM
(mlock(), mlock2(), and munlock()) Some of the specified address range does not correspond to mapped pages in the address space of the process.
ENOMEM
(mlock(), mlock2(), and munlock()) Locking or unlocking a region would result in the total number of mappings with distinct attributes (e.g., locked versus unlocked) exceeding the allowed maximum. (For example, unlocking a range in the middle of a currently locked mapping would result in three mappings: two locked mappings at each end and an unlocked mapping in the middle.)
ENOMEM
(Linux 2.6.9 and later) the caller had a nonzero RLIMIT_MEMLOCK soft resource limit, but tried to lock more memory than the limit permitted. This limit is not enforced if the process is privileged (CAP_IPC_LOCK).
ENOMEM
(Linux 2.4 and earlier) the calling process tried to lock more than half of RAM.
EPERM
The caller is not privileged, but needs privilege (CAP_IPC_LOCK) to perform the requested operation.
EPERM
(munlockall()) (Linux 2.6.8 and earlier) The caller was not privileged (CAP_IPC_LOCK).
VERSIONS
Linux
Under Linux, mlock(), mlock2(), and munlock() automatically round addr down to the nearest page boundary. However, the POSIX.1 specification of mlock() and munlock() allows an implementation to require that addr is page aligned, so portable applications should ensure this.
The VmLck field of the Linux-specific /proc/pid/status file shows how many kilobytes of memory the process with ID PID has locked using mlock(), mlock2(), mlockall(), and mmap(2) MAP_LOCKED.
STANDARDS
mlock()
munlock()
mlockall()
munlockall()
POSIX.1-2008.
mlock2()
Linux.
On POSIX systems on which mlock() and munlock() are available, _POSIX_MEMLOCK_RANGE is defined in <unistd.h> and the number of bytes in a page can be determined from the constant PAGESIZE (if defined) in <limits.h> or by calling sysconf(_SC_PAGESIZE).
On POSIX systems on which mlockall() and munlockall() are available, _POSIX_MEMLOCK is defined in <unistd.h> to a value greater than 0. (See also sysconf(3).)
HISTORY
mlock()
munlock()
mlockall()
munlockall()
POSIX.1-2001, POSIX.1-2008, SVr4.
mlock2()
Linux 4.4, glibc 2.27.
NOTES
Memory locking has two main applications: real-time algorithms and high-security data processing. Real-time applications require deterministic timing, and, like scheduling, paging is one major cause of unexpected program execution delays. Real-time applications will usually also switch to a real-time scheduler with sched_setscheduler(2). Cryptographic security software often handles critical bytes like passwords or secret keys as data structures. As a result of paging, these secrets could be transferred onto a persistent swap store medium, where they might be accessible to the enemy long after the security software has erased the secrets in RAM and terminated. (But be aware that the suspend mode on laptops and some desktop computers will save a copy of the system’s RAM to disk, regardless of memory locks.)
Real-time processes that are using mlockall() to prevent delays on page faults should reserve enough locked stack pages before entering the time-critical section, so that no page fault can be caused by function calls. This can be achieved by calling a function that allocates a sufficiently large automatic variable (an array) and writes to the memory occupied by this array in order to touch these stack pages. This way, enough pages will be mapped for the stack and can be locked into RAM. The dummy writes ensure that not even copy-on-write page faults can occur in the critical section.
Memory locks are not inherited by a child created via fork(2) and are automatically removed (unlocked) during an execve(2) or when the process terminates. The mlockall() MCL_FUTURE and MCL_FUTURE | MCL_ONFAULT settings are not inherited by a child created via fork(2) and are cleared during an execve(2).
Note that fork(2) will prepare the address space for a copy-on-write operation. The consequence is that any write access that follows will cause a page fault that in turn may cause high latencies for a real-time process. Therefore, it is crucial not to invoke fork(2) after an mlockall() or mlock() operationβnot even from a thread which runs at a low priority within a process which also has a thread running at elevated priority.
The memory lock on an address range is automatically removed if the address range is unmapped via munmap(2).
Memory locks do not stack, that is, pages which have been locked several times by calls to mlock(), mlock2(), or mlockall() will be unlocked by a single call to munlock() for the corresponding range or by munlockall(). Pages which are mapped to several locations or by several processes stay locked into RAM as long as they are locked at least at one location or by at least one process.
If a call to mlockall() which uses the MCL_FUTURE flag is followed by another call that does not specify this flag, the changes made by the MCL_FUTURE call will be lost.
The mlock2() MLOCK_ONFAULT flag and the mlockall() MCL_ONFAULT flag allow efficient memory locking for applications that deal with large mappings where only a (small) portion of pages in the mapping are touched. In such cases, locking all of the pages in a mapping would incur a significant penalty for memory locking.
Limits and permissions
In Linux 2.6.8 and earlier, a process must be privileged (CAP_IPC_LOCK) in order to lock memory and the RLIMIT_MEMLOCK soft resource limit defines a limit on how much memory the process may lock.
Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process can lock and the RLIMIT_MEMLOCK soft resource limit instead defines a limit on how much memory an unprivileged process may lock.
BUGS
In Linux 4.8 and earlier, a bug in the kernel’s accounting of locked memory for unprivileged processes (i.e., without CAP_IPC_LOCK) meant that if the region specified by addr and len overlapped an existing lock, then the already locked bytes in the overlapping region were counted twice when checking against the limit. Such double accounting could incorrectly calculate a “total locked memory” value for the process that exceeded the RLIMIT_MEMLOCK limit, with the result that mlock() and mlock2() would fail on requests that should have succeeded. This bug was fixed in Linux 4.9.
In Linux 2.4 series of kernels up to and including Linux 2.4.17, a bug caused the mlockall() MCL_FUTURE flag to be inherited across a fork(2). This was rectified in Linux 2.4.18.
Since Linux 2.6.9, if a privileged process calls mlockall(MCL_FUTURE) and later drops privileges (loses the CAP_IPC_LOCK capability by, for example, setting its effective UID to a nonzero value), then subsequent memory allocations (e.g., mmap(2), brk(2)) will fail if the RLIMIT_MEMLOCK resource limit is encountered.
SEE ALSO
mincore(2), mmap(2), setrlimit(2), shmctl(2), sysconf(3), proc(5), capabilities(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
9 - Linux cli command tuxcall
NAME π₯οΈ tuxcall π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
10 - Linux cli command sysinfo
NAME π₯οΈ sysinfo π₯οΈ
return system information
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/sysinfo.h>
int sysinfo(struct sysinfo *info);
DESCRIPTION
sysinfo() returns certain statistics on memory and swap usage, as well as the load average.
Until Linux 2.3.16, sysinfo() returned information in the following structure:
struct sysinfo {
long uptime; /* Seconds since boot */
unsigned long loads[3]; /* 1, 5, and 15 minute load averages */
unsigned long totalram; /* Total usable main memory size */
unsigned long freeram; /* Available memory size */
unsigned long sharedram; /* Amount of shared memory */
unsigned long bufferram; /* Memory used by buffers */
unsigned long totalswap; /* Total swap space size */
unsigned long freeswap; /* Swap space still available */
unsigned short procs; /* Number of current processes */
char _f[22]; /* Pads structure to 64 bytes */
};
In the above structure, the sizes of the memory and swap fields are given in bytes.
Since Linux 2.3.23 (i386) and Linux 2.3.48 (all architectures) the structure is:
struct sysinfo {
long uptime; /* Seconds since boot */
unsigned long loads[3]; /* 1, 5, and 15 minute load averages */
unsigned long totalram; /* Total usable main memory size */
unsigned long freeram; /* Available memory size */
unsigned long sharedram; /* Amount of shared memory */
unsigned long bufferram; /* Memory used by buffers */
unsigned long totalswap; /* Total swap space size */
unsigned long freeswap; /* Swap space still available */
unsigned short procs; /* Number of current processes */
unsigned long totalhigh; /* Total high memory size */
unsigned long freehigh; /* Available high memory size */
unsigned int mem_unit; /* Memory unit size in bytes */
char _f[20-2*sizeof(long)-sizeof(int)];
/* Padding to 64 bytes */
};
In the above structure, sizes of the memory and swap fields are given as multiples of mem_unit bytes.
RETURN VALUE
On success, sysinfo() returns zero. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
info is not a valid address.
STANDARDS
Linux.
HISTORY
Linux 0.98.pl6.
NOTES
All of the information provided by this system call is also available via /proc/meminfo and /proc/loadavg.
SEE ALSO
proc(5)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
11 - Linux cli command ioprio_set
NAME π₯οΈ ioprio_set π₯οΈ
get/set I/O scheduling class and priority
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/ioprio.h> /* Definition of IOPRIO_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_ioprio_get, int which, int who);
int syscall(SYS_ioprio_set, int which, int who, int ioprio);
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
The ioprio_get() and ioprio_set() system calls get and set the I/O scheduling class and priority of one or more threads.
The which and who arguments identify the thread(s) on which the system calls operate. The which argument determines how who is interpreted, and has one of the following values:
IOPRIO_WHO_PROCESS
who is a process ID or thread ID identifying a single process or thread. If who is 0, then operate on the calling thread.
IOPRIO_WHO_PGRP
who is a process group ID identifying all the members of a process group. If who is 0, then operate on the process group of which the caller is a member.
IOPRIO_WHO_USER
who is a user ID identifying all of the processes that have a matching real UID.
If which is specified as IOPRIO_WHO_PGRP or IOPRIO_WHO_USER when calling ioprio_get(), and more than one process matches who, then the returned priority will be the highest one found among all of the matching processes. One priority is said to be higher than another one if it belongs to a higher priority class (IOPRIO_CLASS_RT is the highest priority class; IOPRIO_CLASS_IDLE is the lowest) or if it belongs to the same priority class as the other process but has a higher priority level (a lower priority number means a higher priority level).
The ioprio argument given to ioprio_set() is a bit mask that specifies both the scheduling class and the priority to be assigned to the target process(es). The following macros are used for assembling and dissecting ioprio values:
IOPRIO_PRIO_VALUE(class, data)
Given a scheduling class and priority (data), this macro combines the two values to produce an ioprio value, which is returned as the result of the macro.
IOPRIO_PRIO_CLASS(mask)
Given mask (an ioprio value), this macro returns its I/O class component, that is, one of the values IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, or IOPRIO_CLASS_IDLE.
IOPRIO_PRIO_DATA(mask)
Given mask (an ioprio value), this macro returns its priority (data) component.
See the NOTES section for more information on scheduling classes and priorities, as well as the meaning of specifying ioprio as 0.
I/O priorities are supported for reads and for synchronous (O_DIRECT, O_SYNC) writes. I/O priorities are not supported for asynchronous writes because they are issued outside the context of the program dirtying the memory, and thus program-specific priorities do not apply.
RETURN VALUE
On success, ioprio_get() returns the ioprio value of the process with highest I/O priority of any of the processes that match the criteria specified in which and who. On error, -1 is returned, and errno is set to indicate the error.
On success, ioprio_set() returns 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
Invalid value for which or ioprio. Refer to the NOTES section for available scheduler classes and priority levels for ioprio.
EPERM
The calling process does not have the privilege needed to assign this ioprio to the specified process(es). See the NOTES section for more information on required privileges for ioprio_set().
ESRCH
No process(es) could be found that matched the specification in which and who.
STANDARDS
Linux.
HISTORY
Linux 2.6.13.
NOTES
Two or more processes or threads can share an I/O context. This will be the case when clone(2) was called with the CLONE_IO flag. However, by default, the distinct threads of a process will not share the same I/O context. This means that if you want to change the I/O priority of all threads in a process, you may need to call ioprio_set() on each of the threads. The thread ID that you would need for this operation is the one that is returned by gettid(2) or clone(2).
These system calls have an effect only when used in conjunction with an I/O scheduler that supports I/O priorities. As at kernel 2.6.17 the only such scheduler is the Completely Fair Queuing (CFQ) I/O scheduler.
If no I/O scheduler has been set for a thread, then by default the I/O priority will follow the CPU nice value (setpriority(2)). Before Linux 2.6.24, once an I/O priority had been set using ioprio_set(), there was no way to reset the I/O scheduling behavior to the default. Since Linux 2.6.24, specifying ioprio as 0 can be used to reset to the default I/O scheduling behavior.
Selecting an I/O scheduler
I/O schedulers are selected on a per-device basis via the special file /sys/block/device/queue/scheduler.
One can view the current I/O scheduler via the /sys filesystem. For example, the following command displays a list of all schedulers currently loaded in the kernel:
$ cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
The scheduler surrounded by brackets is the one actually in use for the device (sda in the example). Setting another scheduler is done by writing the name of the new scheduler to this file. For example, the following command will set the scheduler for the sda device to cfq:
$ su
Password:
# echo cfq > /sys/block/sda/queue/scheduler
The Completely Fair Queuing (CFQ) I/O scheduler
Since version 3 (also known as CFQ Time Sliced), CFQ implements I/O nice levels similar to those of CPU scheduling. These nice levels are grouped into three scheduling classes, each one containing one or more priority levels:
IOPRIO_CLASS_RT (1)
This is the real-time I/O class. This scheduling class is given higher priority than any other class: processes from this class are given first access to the disk every time. Thus, this I/O class needs to be used with some care: one I/O real-time process can starve the entire system. Within the real-time class, there are 8 levels of class data (priority) that determine exactly how much time this process needs the disk for on each service. The highest real-time priority level is 0; the lowest is 7. In the future, this might change to be more directly mappable to performance, by passing in a desired data rate instead.
IOPRIO_CLASS_BE (2)
This is the best-effort scheduling class, which is the default for any process that hasn’t set a specific I/O priority. The class data (priority) determines how much I/O bandwidth the process will get. Best-effort priority levels are analogous to CPU nice values (see getpriority(2)). The priority level determines a priority relative to other processes in the best-effort scheduling class. Priority levels range from 0 (highest) to 7 (lowest).
IOPRIO_CLASS_IDLE (3)
This is the idle scheduling class. Processes running at this level get I/O time only when no one else needs the disk. The idle class has no class data. Attention is required when assigning this priority class to a process, since it may become starved if higher priority processes are constantly accessing the disk.
Refer to the kernel source file Documentation/block/ioprio.txt for more information on the CFQ I/O Scheduler and an example program.
Required permissions to set I/O priorities
Permission to change a process’s priority is granted or denied based on two criteria:
Process ownership
An unprivileged process may set the I/O priority only for a process whose real UID matches the real or effective UID of the calling process. A process which has the CAP_SYS_NICE capability can change the priority of any process.
What is the desired priority
Attempts to set very high priorities (IOPRIO_CLASS_RT) require the CAP_SYS_ADMIN capability. Up to Linux 2.6.24 also required CAP_SYS_ADMIN to set a very low priority (IOPRIO_CLASS_IDLE), but since Linux 2.6.25, this is no longer required.
A call to ioprio_set() must follow both rules, or the call will fail with the error EPERM.
BUGS
glibc does not yet provide a suitable header file defining the function prototypes and macros described on this page. Suitable definitions can be found in linux/ioprio.h.
SEE ALSO
ionice(1), getpriority(2), open(2), capabilities(7), cgroups(7)
Documentation/block/ioprio.txt in the Linux kernel source tree
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
12 - Linux cli command setfsuid32
NAME π₯οΈ setfsuid32 π₯οΈ
set user identity used for filesystem checks
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/fsuid.h>
[[deprecated]] int setfsuid(uid_t fsuid);
DESCRIPTION
On Linux, a process has both a filesystem user ID and an effective user ID. The (Linux-specific) filesystem user ID is used for permissions checking when accessing filesystem objects, while the effective user ID is used for various other kinds of permissions checks (see credentials(7)).
Normally, the value of the process’s filesystem user ID is the same as the value of its effective user ID. This is so, because whenever a process’s effective user ID is changed, the kernel also changes the filesystem user ID to be the same as the new value of the effective user ID. A process can cause the value of its filesystem user ID to diverge from its effective user ID by using setfsuid() to change its filesystem user ID to the value given in fsuid.
Explicit calls to setfsuid() and setfsgid(2) are (were) usually used only by programs such as the Linux NFS server that need to change what user and group ID is used for file access without a corresponding change in the real and effective user and group IDs. A change in the normal user IDs for a program such as the NFS server is (was) a security hole that can expose it to unwanted signals. (However, this issue is historical; see below.)
setfsuid() will succeed only if the caller is the superuser or if fsuid matches either the caller’s real user ID, effective user ID, saved set-user-ID, or current filesystem user ID.
RETURN VALUE
On both success and failure, this call returns the previous filesystem user ID of the caller.
STANDARDS
Linux.
HISTORY
Linux 1.2.
At the time when this system call was introduced, one process could send a signal to another process with the same effective user ID. This meant that if a privileged process changed its effective user ID for the purpose of file permission checking, then it could become vulnerable to receiving signals sent by another (unprivileged) process with the same user ID. The filesystem user ID attribute was thus added to allow a process to change its user ID for the purposes of file permission checking without at the same time becoming vulnerable to receiving unwanted signals. Since Linux 2.0, signal permission handling is different (see kill(2)), with the result that a process can change its effective user ID without being vulnerable to receiving signals from unwanted processes. Thus, setfsuid() is nowadays unneeded and should be avoided in new applications (likewise for setfsgid(2)).
The original Linux setfsuid() system call supported only 16-bit user IDs. Subsequently, Linux 2.4 added setfsuid32() supporting 32-bit IDs. The glibc setfsuid() wrapper function transparently deals with the variation across kernel versions.
C library/kernel differences
In glibc 2.15 and earlier, when the wrapper for this system call determines that the argument can’t be passed to the kernel without integer truncation (because the kernel is old and does not support 32-bit user IDs), it will return -1 and set errno to EINVAL without attempting the system call.
BUGS
No error indications of any kind are returned to the caller, and the fact that both successful and unsuccessful calls return the same value makes it impossible to directly determine whether the call succeeded or failed. Instead, the caller must resort to looking at the return value from a further call such as setfsuid(-1) (which will always fail), in order to determine if a preceding call to setfsuid() changed the filesystem user ID. At the very least, EPERM should be returned when the call fails (because the caller lacks the CAP_SETUID capability).
SEE ALSO
kill(2), setfsgid(2), capabilities(7), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
13 - Linux cli command fchown
NAME π₯οΈ fchown π₯οΈ
change ownership of a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int chown(const char *pathname, uid_t owner, gid_t group);
int fchown(int fd, uid_t owner, gid_t group);
int lchown(const char *pathname, uid_t owner, gid_t group);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int fchownat(int dirfd, const char *pathname,
uid_t owner, gid_t group, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
fchown(), lchown():
/* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
|| _XOPEN_SOURCE >= 500
|| /* glibc <= 2.19: */ _BSD_SOURCE
fchownat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These system calls change the owner and group of a file. The chown(), fchown(), and lchown() system calls differ only in how the file is specified:
chown() changes the ownership of the file specified by pathname, which is dereferenced if it is a symbolic link.
fchown() changes the ownership of the file referred to by the open file descriptor fd.
lchown() is like chown(), but does not dereference symbolic links.
Only a privileged process (Linux: one with the CAP_CHOWN capability) may change the owner of a file. The owner of a file may change the group of the file to any group of which that owner is a member. A privileged process (Linux: with CAP_CHOWN) may change the group arbitrarily.
If the owner or group is specified as -1, then that ID is not changed.
When the owner or group of an executable file is changed by an unprivileged user, the S_ISUID and S_ISGID mode bits are cleared. POSIX does not specify whether this also should happen when root does the chown(); the Linux behavior depends on the kernel version, and since Linux 2.2.13, root is treated like other users. In case of a non-group-executable file (i.e., one for which the S_IXGRP bit is not set) the S_ISGID bit indicates mandatory locking, and is not cleared by a chown().
When the owner or group of an executable file is changed (by any user), all capability sets for the file are cleared.
fchownat()
The fchownat() system call operates in exactly the same way as chown(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chown() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chown()).
If pathname is absolute, then dirfd is ignored.
The flags argument is a bit mask created by ORing together 0 or more of the following values;
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory. If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead operate on the link itself, like lchown(). (By default, fchownat() dereferences symbolic links, like chown().)
See openat(2) for an explanation of the need for fchownat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Depending on the filesystem, errors other than those listed below can be returned.
The more general errors for chown() are listed below.
EACCES
Search permission is denied on a component of the path prefix. (See also path_resolution(7).)
EBADF
(fchown()) fd is not a valid open file descriptor.
EBADF
(fchownat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
pathname points outside your accessible address space.
EINVAL
(fchownat()) Invalid flag specified in flags.
EIO
(fchown()) A low-level I/O error occurred while modifying the inode.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ENAMETOOLONG
pathname is too long.
ENOENT
The file does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component of the path prefix is not a directory.
ENOTDIR
(fchownat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EPERM
The calling process did not have the required permissions (see above) to change owner and/or group.
EPERM
The file is marked immutable or append-only. (See ioctl_iflags(2).)
EROFS
The named file resides on a read-only filesystem.
VERSIONS
The 4.4BSD version can be used only by the superuser (that is, ordinary users cannot give away files).
STANDARDS
POSIX.1-2008.
HISTORY
chown()
fchown()
lchown()
4.4BSD, SVr4, POSIX.1-2001.
fchownat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
NOTES
Ownership of new files
When a new file is created (by, for example, open(2) or mkdir(2)), its owner is made the same as the filesystem user ID of the creating process. The group of the file depends on a range of factors, including the type of filesystem, the options used to mount the filesystem, and whether or not the set-group-ID mode bit is enabled on the parent directory. If the filesystem supports the -o grpid (or, synonymously -o bsdgroups) and -o nogrpid (or, synonymously -o sysvgroups) mount(8) options, then the rules are as follows:
If the filesystem is mounted with -o grpid, then the group of a new file is made the same as that of the parent directory.
If the filesystem is mounted with -o nogrpid and the set-group-ID bit is disabled on the parent directory, then the group of a new file is made the same as the process’s filesystem GID.
If the filesystem is mounted with -o nogrpid and the set-group-ID bit is enabled on the parent directory, then the group of a new file is made the same as that of the parent directory.
As at Linux 4.12, the -o grpid and -o nogrpid mount options are supported by ext2, ext3, ext4, and XFS. Filesystems that don’t support these mount options follow the -o nogrpid rules.
glibc notes
On older kernels where fchownat() is unavailable, the glibc wrapper function falls back to the use of chown() and lchown(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
NFS
The chown() semantics are deliberately violated on NFS filesystems which have UID mapping enabled. Additionally, the semantics of all system calls which access the file contents are violated, because chown() may cause immediate access revocation on already open files. Client side caching may lead to a delay between the time where ownership have been changed to allow access for a user and the time where the file can actually be accessed by the user on other clients.
Historical details
The original Linux chown(), fchown(), and lchown() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added chown32(), fchown32(), and lchown32(), supporting 32-bit IDs. The glibc chown(), fchown(), and lchown() wrapper functions transparently deal with the variations across kernel versions.
Before Linux 2.1.81 (except 2.1.46), chown() did not follow symbolic links. Since Linux 2.1.81, chown() does follow symbolic links, and there is a new system call lchown() that does not follow symbolic links. Since Linux 2.1.86, this new call (that has the same semantics as the old chown()) has got the same syscall number, and chown() got the newly introduced number.
EXAMPLES
The following program changes the ownership of the file named in its second command-line argument to the value specified in its first command-line argument. The new owner can be specified either as a numeric user ID, or as a username (which is converted to a user ID by using getpwnam(3) to perform a lookup in the system password file).
Program source
#include <pwd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
char *endptr;
uid_t uid;
struct passwd *pwd;
if (argc != 3 || argv[1][0] == ' ') {
fprintf(stderr, "%s <owner> <file>
“, argv[0]); exit(EXIT_FAILURE); } uid = strtol(argv[1], &endptr, 10); /* Allow a numeric string */ if (endptr != ‘οΏ½’) { / Was not pure numeric string / pwd = getpwnam(argv[1]); / Try getting UID for username */ if (pwd == NULL) { perror(“getpwnam”); exit(EXIT_FAILURE); } uid = pwd->pw_uid; } if (chown(argv[2], uid, -1) == -1) { perror(“chown”); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); }
SEE ALSO
chgrp(1), chown(1), chmod(2), flock(2), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
14 - Linux cli command break
NAME π₯οΈ break π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
15 - Linux cli command bdflush
NAME π₯οΈ bdflush π₯οΈ
start, flush, or tune buffer-dirty-flush daemon
SYNOPSIS
#include <sys/kdaemon.h>
[[deprecated]] int bdflush(int func, long *address);
[[deprecated]] int bdflush(int func, long data);
DESCRIPTION
Note: Since Linux 2.6, this system call is deprecated and does nothing. It is likely to disappear altogether in a future kernel release. Nowadays, the task performed by bdflush() is handled by the kernel pdflush thread.
bdflush() starts, flushes, or tunes the buffer-dirty-flush daemon. Only a privileged process (one with the CAP_SYS_ADMIN capability) may call bdflush().
If func is negative or 0, and no daemon has been started, then bdflush() enters the daemon code and never returns.
If func is 1, some dirty buffers are written to disk.
If func is 2 or more and is even (low bit is 0), then address is the address of a long word, and the tuning parameter numbered (func-2)/2 is returned to the caller in that address.
If func is 3 or more and is odd (low bit is 1), then data is a long word, and the kernel sets tuning parameter numbered (func-3)/2 to that value.
The set of parameters, their values, and their valid ranges are defined in the Linux kernel source file fs/buffer.c.
RETURN VALUE
If func is negative or 0 and the daemon successfully starts, bdflush() never returns. Otherwise, the return value is 0 on success and -1 on failure, with errno set to indicate the error.
ERRORS
EBUSY
An attempt was made to enter the daemon code after another process has already entered.
EFAULT
address points outside your accessible address space.
EINVAL
An attempt was made to read or write an invalid parameter number, or to write an invalid value to a parameter.
EPERM
Caller does not have the CAP_SYS_ADMIN capability.
STANDARDS
Linux.
HISTORY
Since glibc 2.23, glibc no longer supports this obsolete system call.
SEE ALSO
sync(1), fsync(2), sync(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
16 - Linux cli command sysfs
NAME π₯οΈ sysfs π₯οΈ
get filesystem type information
SYNOPSIS
[[deprecated]] int sysfs(int option, const char *fsname);
[[deprecated]] int sysfs(int option, unsigned int fs_index, char *buf);
[[deprecated]] int sysfs(int option);
DESCRIPTION
Note: if you are looking for information about the sysfs filesystem that is normally mounted at /sys, see sysfs(5).
The (obsolete) sysfs() system call returns information about the filesystem types currently present in the kernel. The specific form of the sysfs() call and the information returned depends on the option in effect:
1
Translate the filesystem identifier string fsname into a filesystem type index.
2
Translate the filesystem type index fs_index into a null-terminated filesystem identifier string. This string will be written to the buffer pointed to by buf. Make sure that buf has enough space to accept the string.
3
Return the total number of filesystem types currently present in the kernel.
The numbering of the filesystem type indexes begins with zero.
RETURN VALUE
On success, sysfs() returns the filesystem index for option 1, zero for option 2, and the number of currently configured filesystems for option 3. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
Either fsname or buf is outside your accessible address space.
EINVAL
fsname is not a valid filesystem type identifier; fs_index is out-of-bounds; option is invalid.
STANDARDS
None.
HISTORY
SVr4.
This System-V derived system call is obsolete; don’t use it. On systems with /proc, the same information can be obtained via /proc; use that interface instead.
BUGS
There is no libc or glibc support. There is no way to guess how large buf should be.
SEE ALSO
proc(5), sysfs(5)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
17 - Linux cli command getgid
NAME π₯οΈ getgid π₯οΈ
get group identity
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
gid_t getgid(void);
gid_t getegid(void);
DESCRIPTION
getgid() returns the real group ID of the calling process.
getegid() returns the effective group ID of the calling process.
ERRORS
These functions are always successful and never modify errno.
VERSIONS
On Alpha, instead of a pair of getgid() and getegid() system calls, a single getxgid() system call is provided, which returns a pair of real and effective GIDs. The glibc getgid() and getegid() wrapper functions transparently deal with this. See syscall(2) for details regarding register mapping.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.3BSD.
The original Linux getgid() and getegid() system calls supported only 16-bit group IDs. Subsequently, Linux 2.4 added getgid32() and getegid32(), supporting 32-bit IDs. The glibc getgid() and getegid() wrapper functions transparently deal with the variations across kernel versions.
SEE ALSO
getresgid(2), setgid(2), setregid(2), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
18 - Linux cli command shutdown
NAME π₯οΈ shutdown π₯οΈ
shut down part of a full-duplex connection
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
int shutdown(int sockfd, int how);
DESCRIPTION
The shutdown() call causes all or part of a full-duplex connection on the socket associated with sockfd to be shut down. If how is SHUT_RD, further receptions will be disallowed. If how is SHUT_WR, further transmissions will be disallowed. If how is SHUT_RDWR, further receptions and transmissions will be disallowed.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBADF
sockfd is not a valid file descriptor.
EINVAL
An invalid value was specified in how (but see BUGS).
ENOTCONN
The specified socket is not connected.
ENOTSOCK
The file descriptor sockfd does not refer to a socket.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.4BSD (first appeared in 4.2BSD).
NOTES
The constants SHUT_RD, SHUT_WR, SHUT_RDWR have the value 0, 1, 2, respectively, and are defined in <sys/socket.h> since glibc-2.1.91.
BUGS
Checks for the validity of how are done in domain-specific code, and before Linux 3.7 not all domains performed these checks. Most notably, UNIX domain sockets simply ignored invalid values. This problem was fixed for UNIX domain sockets in Linux 3.7.
SEE ALSO
close(2), connect(2), socket(2), socket(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
19 - Linux cli command vm86old
NAME π₯οΈ vm86old π₯οΈ
enter virtual 8086 mode
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/vm86.h>
int vm86old(struct vm86_struct *info);
int vm86(unsigned long fn, struct vm86plus_struct *v86);
DESCRIPTION
The system call vm86() was introduced in Linux 0.97p2. In Linux 2.1.15 and 2.0.28, it was renamed to vm86old(), and a new vm86() was introduced. The definition of struct vm86_struct was changed in 1.1.8 and 1.1.9.
These calls cause the process to enter VM86 mode (virtual-8086 in Intel literature), and are used by dosemu.
VM86 mode is an emulation of real mode within a protected mode task.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
This return value is specific to i386 and indicates a problem with getting user-space data.
ENOSYS
This return value indicates the call is not implemented on the present architecture.
EPERM
Saved kernel stack exists. (This is a kernel sanity check; the saved stack should exist only within vm86 mode itself.)
STANDARDS
Linux on 32-bit Intel processors.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
20 - Linux cli command preadv2
NAME π₯οΈ preadv2 π₯οΈ
read or write data into multiple buffers
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/uio.h>
ssize_t readv(int fd, const struct iovec *iov, int iovcnt);
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);
ssize_t preadv(int fd, const struct iovec *iov, int iovcnt,
off_t offset);
ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt,
off_t offset);
ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt,
off_t offset, int flags);
ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt,
off_t offset, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
preadv(), pwritev():
Since glibc 2.19:
_DEFAULT_SOURCE
glibc 2.19 and earlier:
_BSD_SOURCE
DESCRIPTION
The readv() system call reads iovcnt buffers from the file associated with the file descriptor fd into the buffers described by iov (“scatter input”).
The writev() system call writes iovcnt buffers of data described by iov to the file associated with the file descriptor fd (“gather output”).
The pointer iov points to an array of iovec structures, described in iovec(3type).
The readv() system call works just like read(2) except that multiple buffers are filled.
The writev() system call works just like write(2) except that multiple buffers are written out.
Buffers are processed in array order. This means that readv() completely fills iov[0] before proceeding to iov[1], and so on. (If there is insufficient data, then not all buffers pointed to by iov may be filled.) Similarly, writev() writes out the entire contents of iov[0] before proceeding to iov[1], and so on.
The data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not intermingled with output from writes in other processes; analogously, readv() is guaranteed to read a contiguous block of data from the file, regardless of read operations performed in other threads or processes that have file descriptors referring to the same open file description (see open(2)).
preadv() and pwritev()
The preadv() system call combines the functionality of readv() and pread(2). It performs the same task as readv(), but adds a fourth argument, offset, which specifies the file offset at which the input operation is to be performed.
The pwritev() system call combines the functionality of writev() and pwrite(2). It performs the same task as writev(), but adds a fourth argument, offset, which specifies the file offset at which the output operation is to be performed.
The file offset is not changed by these system calls. The file referred to by fd must be capable of seeking.
preadv2() and pwritev2()
These system calls are similar to preadv() and pwritev() calls, but add a fifth argument, flags, which modifies the behavior on a per-call basis.
Unlike preadv() and pwritev(), if the offset argument is -1, then the current file offset is used and updated.
The flags argument contains a bitwise OR of zero or more of the following flags:
RWF_DSYNC (since Linux 4.7)
Provide a per-write equivalent of the O_DSYNC open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call.
RWF_HIPRI (since Linux 4.6)
High priority read/write. Allows block-based filesystems to use polling of the device, which provides lower latency, but may use additional resources. (Currently, this feature is usable only on a file descriptor opened using the O_DIRECT flag.)
RWF_SYNC (since Linux 4.7)
Provide a per-write equivalent of the O_SYNC open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call.
RWF_NOWAIT (since Linux 4.14)
Do not wait for data which is not immediately available. If this flag is specified, the preadv2() system call will return instantly if it would have to read data from the backing storage or wait for a lock. If some data was successfully read, it will return the number of bytes read. If no bytes were read, it will return -1 and set errno to EAGAIN (but see BUGS). Currently, this flag is meaningful only for preadv2().
RWF_APPEND (since Linux 4.16)
Provide a per-write equivalent of the O_APPEND open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call. The offset argument does not affect the write operation; the data is always appended to the end of the file. However, if the offset argument is -1, the current file offset is updated.
RETURN VALUE
On success, readv(), preadv(), and preadv2() return the number of bytes read; writev(), pwritev(), and pwritev2() return the number of bytes written.
Note that it is not an error for a successful call to transfer fewer bytes than requested (see read(2) and write(2)).
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
The errors are as given for read(2) and write(2). Furthermore, preadv(), preadv2(), pwritev(), and pwritev2() can also fail for the same reasons as lseek(2). Additionally, the following errors are defined:
EINVAL
The sum of the iov_len values overflows an ssize_t value.
EINVAL
The vector count, iovcnt, is less than zero or greater than the permitted maximum.
EOPNOTSUPP
An unknown flag is specified in flags.
VERSIONS
C library/kernel differences
The raw preadv() and pwritev() system calls have call signatures that differ slightly from that of the corresponding GNU C library wrapper functions shown in the SYNOPSIS. The final argument, offset, is unpacked by the wrapper functions into two arguments in the system calls:
** unsigned long pos_l, unsigned long **pos
These arguments contain, respectively, the low order and high order 32 bits of offset.
STANDARDS
readv()
writev()
POSIX.1-2008.
preadv()
pwritev()
BSD.
preadv2()
pwritev2()
Linux.
HISTORY
readv()
writev()
POSIX.1-2001, 4.4BSD (first appeared in 4.2BSD).
preadv(), pwritev(): Linux 2.6.30, glibc 2.10.
preadv2(), pwritev2(): Linux 4.6, glibc 2.26.
Historical C library/kernel differences
To deal with the fact that IOV_MAX was so low on early versions of Linux, the glibc wrapper functions for readv() and writev() did some extra work if they detected that the underlying kernel system call failed because this limit was exceeded. In the case of readv(), the wrapper function allocated a temporary buffer large enough for all of the items specified by iov, passed that buffer in a call to read(2), copied data from the buffer to the locations specified by the iov_base fields of the elements of iov, and then freed the buffer. The wrapper function for writev() performed the analogous task using a temporary buffer and a call to write(2).
The need for this extra effort in the glibc wrapper functions went away with Linux 2.2 and later. However, glibc continued to provide this behavior until glibc 2.10. Starting with glibc 2.9, the wrapper functions provide this behavior only if the library detects that the system is running a Linux kernel older than Linux 2.6.18 (an arbitrarily selected kernel version). And since glibc 2.20 (which requires a minimum of Linux 2.6.32), the glibc wrapper functions always just directly invoke the system calls.
NOTES
POSIX.1 allows an implementation to place a limit on the number of items that can be passed in iov. An implementation can advertise its limit by defining IOV_MAX in <limits.h> or at run time via the return value from sysconf(_SC_IOV_MAX). On modern Linux systems, the limit is 1024. Back in Linux 2.0 days, this limit was 16.
BUGS
Linux 5.9 and Linux 5.10 have a bug where preadv2() with the RWF_NOWAIT flag may return 0 even when not at end of file.
EXAMPLES
The following code sample demonstrates the use of writev():
char *str0 = "hello ";
char *str1 = "world
“; ssize_t nwritten; struct iovec iov[2]; iov[0].iov_base = str0; iov[0].iov_len = strlen(str0); iov[1].iov_base = str1; iov[1].iov_len = strlen(str1); nwritten = writev(STDOUT_FILENO, iov, 2);
SEE ALSO
pread(2), read(2), write(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
21 - Linux cli command setdomainname
NAME π₯οΈ setdomainname π₯οΈ
get/set NIS domain name
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int getdomainname(char *name, size_t len);
int setdomainname(const char *name, size_t len);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
getdomainname(), setdomainname():
Since glibc 2.21:
_DEFAULT_SOURCE
In glibc 2.19 and 2.20:
_DEFAULT_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
Up to and including glibc 2.19:
_BSD_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
DESCRIPTION
These functions are used to access or to change the NIS domain name of the host system. More precisely, they operate on the NIS domain name associated with the calling process’s UTS namespace.
setdomainname() sets the domain name to the value given in the character array name. The len argument specifies the number of bytes in name. (Thus, name does not require a terminating null byte.)
getdomainname() returns the null-terminated domain name in the character array name, which has a length of len bytes. If the null-terminated domain name requires more than len bytes, getdomainname() returns the first len bytes (glibc) or gives an error (libc).
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
setdomainname() can fail with the following errors:
EFAULT
name pointed outside of user address space.
EINVAL
len was negative or too large.
EPERM
The caller did not have the CAP_SYS_ADMIN capability in the user namespace associated with its UTS namespace (see namespaces(7)).
getdomainname() can fail with the following errors:
EINVAL
For getdomainname() under libc: name is NULL or name is longer than len bytes.
VERSIONS
On most Linux architectures (including x86), there is no getdomainname() system call; instead, glibc implements getdomainname() as a library function that returns a copy of the domainname field returned from a call to uname(2).
STANDARDS
None.
HISTORY
Since Linux 1.0, the limit on the length of a domain name, including the terminating null byte, is 64 bytes. In older kernels, it was 8 bytes.
SEE ALSO
gethostname(2), sethostname(2), uname(2), uts_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
22 - Linux cli command inotify_init1
NAME π₯οΈ inotify_init1 π₯οΈ
initialize an inotify instance
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/inotify.h>
int inotify_init(void);
int inotify_init1(int flags);
DESCRIPTION
For an overview of the inotify API, see inotify(7).
inotify_init() initializes a new inotify instance and returns a file descriptor associated with a new inotify event queue.
If flags is 0, then inotify_init1() is the same as inotify_init(). The following values can be bitwise ORed in flags to obtain different behavior:
IN_NONBLOCK
Set the O_NONBLOCK file status flag on the open file description (see open(2)) referred to by the new file descriptor. Using this flag saves extra calls to fcntl(2) to achieve the same result.
IN_CLOEXEC
Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.
RETURN VALUE
On success, these system calls return a new file descriptor. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
(inotify_init1()) An invalid value was specified in flags.
EMFILE
The user limit on the total number of inotify instances has been reached.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENOMEM
Insufficient kernel memory is available.
STANDARDS
Linux.
HISTORY
inotify_init()
Linux 2.6.13, glibc 2.4.
inotify_init1()
Linux 2.6.27, glibc 2.9.
SEE ALSO
inotify_add_watch(2), inotify_rm_watch(2), inotify(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
23 - Linux cli command ioctl
NAME π₯οΈ ioctl π₯οΈ
control device
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/ioctl.h>
int ioctl(int fd, unsigned long op, ...); /* glibc, BSD */
int ioctl(int fd, int op, ...); /* musl, other UNIX */
DESCRIPTION
The ioctl() system call manipulates the underlying device parameters of special files. In particular, many operating characteristics of character special files (e.g., terminals) may be controlled with ioctl() operations. The argument fd must be an open file descriptor.
The second argument is a device-dependent operation code. The third argument is an untyped pointer to memory. It’s traditionally **char ***argp (from the days before void * was valid C), and will be so named for this discussion.
An ioctl() op has encoded in it whether the argument is an in parameter or out parameter, and the size of the argument argp in bytes. Macros and defines used in specifying an ioctl() op are located in the file <sys/ioctl.h>. See NOTES.
RETURN VALUE
Usually, on success zero is returned. A few ioctl() operations use the return value as an output parameter and return a nonnegative value on success. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBADF
fd is not a valid file descriptor.
EFAULT
argp references an inaccessible memory area.
EINVAL
op or argp is not valid.
ENOTTY
fd is not associated with a character special device.
ENOTTY
The specified operation does not apply to the kind of object that the file descriptor fd references.
VERSIONS
Arguments, returns, and semantics of ioctl() vary according to the device driver in question (the call is used as a catch-all for operations that don’t cleanly fit the UNIX stream I/O model).
STANDARDS
None.
HISTORY
VersionΒ 7 AT&T UNIX has
ioctl(int fildes, int op, struct sgttyb *argp);
(where struct sgttyb has historically been used by stty(2) and gtty(2), and is polymorphic by operation type (like a void * would be, if it had been available)).
SysIII documents arg without a type at all.
4.3BSD has
ioctl(int d, unsigned long op, char *argp);
(with char * similarly in for void *).
SysVr4 has
int ioctl(int fildes, int op, ... /* arg */);
NOTES
In order to use this call, one needs an open file descriptor. Often the open(2) call has unwanted side effects, that can be avoided under Linux by giving it the O_NONBLOCK flag.
ioctl structure
Ioctl op values are 32-bit constants. In principle these constants are completely arbitrary, but people have tried to build some structure into them.
The old Linux situation was that of mostly 16-bit constants, where the last byte is a serial number, and the preceding byte(s) give a type indicating the driver. Sometimes the major number was used: 0x03 for the HDIO_* ioctls, 0x06 for the LP* ioctls. And sometimes one or more ASCII letters were used. For example, TCGETS has value 0x00005401, with 0x54 = ‘T’ indicating the terminal driver, and CYGETTIMEOUT has value 0x00435906, with 0x43 0x59 = ‘C’ ‘Y’ indicating the cyclades driver.
Later (0.98p5) some more information was built into the number. One has 2 direction bits (00: none, 01: write, 10: read, 11: read/write) followed by 14 size bits (giving the size of the argument), followed by an 8-bit type (collecting the ioctls in groups for a common purpose or a common driver), and an 8-bit serial number.
The macros describing this structure live in <asm/ioctl.h> and are _IO(type,nr) and {_IOR,_IOW,_IOWR}(type,nr,size). They use sizeof(size) so that size is a misnomer here: this third argument is a data type.
Note that the size bits are very unreliable: in lots of cases they are wrong, either because of buggy macros using sizeof(sizeof(struct)), or because of legacy values.
Thus, it seems that the new structure only gave disadvantages: it does not help in checking, but it causes varying values for the various architectures.
SEE ALSO
execve(2), fcntl(2), ioctl_console(2), ioctl_fat(2), ioctl_ficlone(2), ioctl_ficlonerange(2), ioctl_fideduperange(2), ioctl_fslabel(2), ioctl_getfsmap(2), ioctl_iflags(2), ioctl_ns(2), ioctl_tty(2), ioctl_userfaultfd(2), open(2), sd(4), tty(4)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
24 - Linux cli command setrlimit
NAME π₯οΈ setrlimit π₯οΈ
get/set resource limits
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/resource.h>
int getrlimit(int resource, struct rlimit *rlim);
int setrlimit(int resource, const struct rlimit *rlim);
int prlimit(pid_t pid, int resource,
const struct rlimit *_Nullable new_limit,
struct rlimit *_Nullable old_limit);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
prlimit():
_GNU_SOURCE
DESCRIPTION
The getrlimit() and setrlimit() system calls get and set resource limits. Each resource has an associated soft and hard limit, as defined by the rlimit structure:
struct rlimit {
rlim_t rlim_cur; /* Soft limit */
rlim_t rlim_max; /* Hard limit (ceiling for rlim_cur) */
};
The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may set only its soft limit to a value in the range from 0 up to the hard limit, and (irreversibly) lower its hard limit. A privileged process (under Linux: one with the CAP_SYS_RESOURCE capability in the initial user namespace) may make arbitrary changes to either limit value.
The value RLIM_INFINITY denotes no limit on a resource (both in the structure returned by getrlimit() and in the structure passed to setrlimit()).
The resource argument must be one of:
RLIMIT_AS
This is the maximum size of the process’s virtual memory (address space). The limit is specified in bytes, and is rounded down to the system page size. This limit affects calls to brk(2), mmap(2), and mremap(2), which fail with the error ENOMEM upon exceeding this limit. In addition, automatic stack expansion fails (and generates a SIGSEGV that kills the process if no alternate stack has been made available via sigaltstack(2)). Since the value is a long, on machines with a 32-bit long either this limit is at most 2 GiB, or this resource is unlimited.
RLIMIT_CORE
This is the maximum size of a core file (see core(5)) in bytes that the process may dump. When 0 no core dump files are created. When nonzero, larger dumps are truncated to this size.
RLIMIT_CPU
This is a limit, in seconds, on the amount of CPU time that the process can consume. When the process reaches the soft limit, it is sent a SIGXCPU signal. The default action for this signal is to terminate the process. However, the signal can be caught, and the handler can return control to the main program. If the process continues to consume CPU time, it will be sent SIGXCPU once per second until the hard limit is reached, at which time it is sent SIGKILL. (This latter point describes Linux behavior. Implementations vary in how they treat processes which continue to consume CPU time after reaching the soft limit. Portable applications that need to catch this signal should perform an orderly termination upon first receipt of SIGXCPU.)
RLIMIT_DATA
This is the maximum size of the process’s data segment (initialized data, uninitialized data, and heap). The limit is specified in bytes, and is rounded down to the system page size. This limit affects calls to brk(2), sbrk(2), and (since Linux 4.7) mmap(2), which fail with the error ENOMEM upon encountering the soft limit of this resource.
RLIMIT_FSIZE
This is the maximum size in bytes of files that the process may create. Attempts to extend a file beyond this limit result in delivery of a SIGXFSZ signal. By default, this signal terminates a process, but a process can catch this signal instead, in which case the relevant system call (e.g., write(2), truncate(2)) fails with the error EFBIG.
RLIMIT_LOCKS (Linux 2.4.0 to Linux 2.4.24)
This is a limit on the combined number of flock(2) locks and fcntl(2) leases that this process may establish.
RLIMIT_MEMLOCK
This is the maximum number of bytes of memory that may be locked into RAM. This limit is in effect rounded down to the nearest multiple of the system page size. This limit affects mlock(2), mlockall(2), and the mmap(2) MAP_LOCKED operation. Since Linux 2.6.9, it also affects the shmctl(2) SHM_LOCK operation, where it sets a maximum on the total bytes in shared memory segments (see shmget(2)) that may be locked by the real user ID of the calling process. The shmctl(2) SHM_LOCK locks are accounted for separately from the per-process memory locks established by mlock(2), mlockall(2), and mmap(2) MAP_LOCKED; a process can lock bytes up to this limit in each of these two categories.
Before Linux 2.6.9, this limit controlled the amount of memory that could be locked by a privileged process. Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process may lock, and this limit instead governs the amount of memory that an unprivileged process may lock.
RLIMIT_MSGQUEUE (since Linux 2.6.8)
This is a limit on the number of bytes that can be allocated for POSIX message queues for the real user ID of the calling process. This limit is enforced for mq_open(3). Each message queue that the user creates counts (until it is removed) against this limit according to the formula:
Since Linux 3.5:
bytes = attr.mq_maxmsg * sizeof(struct msg_msg) +
MIN(attr.mq_maxmsg, MQ_PRIO_MAX) *
sizeof(struct posix_msg_tree_node)+
/* For overhead */
attr.mq_maxmsg * attr.mq_msgsize;
/* For message data */
Linux 3.4 and earlier:
bytes = attr.mq_maxmsg * sizeof(struct msg_msg *) +
/* For overhead */
attr.mq_maxmsg * attr.mq_msgsize;
/* For message data */
where attr is the mq_attr structure specified as the fourth argument to mq_open(3), and the msg_msg and posix_msg_tree_node structures are kernel-internal structures.
The “overhead” addend in the formula accounts for overhead bytes required by the implementation and ensures that the user cannot create an unlimited number of zero-length messages (such messages nevertheless each consume some system memory for bookkeeping overhead).
RLIMIT_NICE (since Linux 2.6.12, but see BUGS below)
This specifies a ceiling to which the process’s nice value can be raised using setpriority(2) or nice(2). The actual ceiling for the nice value is calculated as 20 - rlim_cur. The useful range for this limit is thus from 1 (corresponding to a nice value of 19) to 40 (corresponding to a nice value of -20). This unusual choice of range was necessary because negative numbers cannot be specified as resource limit values, since they typically have special meanings. For example, RLIM_INFINITY typically is the same as -1. For more detail on the nice value, see sched(7).
RLIMIT_NOFILE
This specifies a value one greater than the maximum file descriptor number that can be opened by this process. Attempts (open(2), pipe(2), dup(2), etc.) to exceed this limit yield the error EMFILE. (Historically, this limit was named RLIMIT_OFILE on BSD.)
Since Linux 4.5, this limit also defines the maximum number of file descriptors that an unprivileged process (one without the CAP_SYS_RESOURCE capability) may have “in flight” to other processes, by being passed across UNIX domain sockets. This limit applies to the sendmsg(2) system call. For further details, see unix(7).
RLIMIT_NPROC
This is a limit on the number of extant process (or, more precisely on Linux, threads) for the real user ID of the calling process. So long as the current number of processes belonging to this process’s real user ID is greater than or equal to this limit, fork(2) fails with the error EAGAIN.
The RLIMIT_NPROC limit is not enforced for processes that have either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE capability, or run with real user ID 0.
RLIMIT_RSS
This is a limit (in bytes) on the process’s resident set (the number of virtual pages resident in RAM). This limit has effect only in Linux 2.4.x, x < 30, and there affects only calls to madvise(2) specifying MADV_WILLNEED.
RLIMIT_RTPRIO (since Linux 2.6.12, but see BUGS)
This specifies a ceiling on the real-time priority that may be set for this process using sched_setscheduler(2) and sched_setparam(2).
For further details on real-time scheduling policies, see sched(7)
RLIMIT_RTTIME (since Linux 2.6.25)
This is a limit (in microseconds) on the amount of CPU time that a process scheduled under a real-time scheduling policy may consume without making a blocking system call. For the purpose of this limit, each time a process makes a blocking system call, the count of its consumed CPU time is reset to zero. The CPU time count is not reset if the process continues trying to use the CPU but is preempted, its time slice expires, or it calls sched_yield(2).
Upon reaching the soft limit, the process is sent a SIGXCPU signal. If the process catches or ignores this signal and continues consuming CPU time, then SIGXCPU will be generated once each second until the hard limit is reached, at which point the process is sent a SIGKILL signal.
The intended use of this limit is to stop a runaway real-time process from locking up the system.
For further details on real-time scheduling policies, see sched(7)
RLIMIT_SIGPENDING (since Linux 2.6.8)
This is a limit on the number of signals that may be queued for the real user ID of the calling process. Both standard and real-time signals are counted for the purpose of checking this limit. However, the limit is enforced only for sigqueue(3); it is always possible to use kill(2) to queue one instance of any of the signals that are not already queued to the process.
RLIMIT_STACK
This is the maximum size of the process stack, in bytes. Upon reaching this limit, a SIGSEGV signal is generated. To handle this signal, a process must employ an alternate signal stack (sigaltstack(2)).
Since Linux 2.6.23, this limit also determines the amount of space used for the process’s command-line arguments and environment variables; for details, see execve(2).
prlimit()
The Linux-specific prlimit() system call combines and extends the functionality of setrlimit() and getrlimit(). It can be used to both set and get the resource limits of an arbitrary process.
The resource argument has the same meaning as for setrlimit() and getrlimit().
If the new_limit argument is not NULL, then the rlimit structure to which it points is used to set new values for the soft and hard limits for resource. If the old_limit argument is not NULL, then a successful call to prlimit() places the previous soft and hard limits for resource in the rlimit structure pointed to by old_limit.
The pid argument specifies the ID of the process on which the call is to operate. If pid is 0, then the call applies to the calling process. To set or get the resources of a process other than itself, the caller must have the CAP_SYS_RESOURCE capability in the user namespace of the process whose resource limits are being changed, or the real, effective, and saved set user IDs of the target process must match the real user ID of the caller and the real, effective, and saved set group IDs of the target process must match the real group ID of the caller.
RETURN VALUE
On success, these system calls return 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
A pointer argument points to a location outside the accessible address space.
EINVAL
The value specified in resource is not valid; or, for setrlimit() or prlimit(): rlim->rlim_cur was greater than rlim->rlim_max.
EPERM
An unprivileged process tried to raise the hard limit; the CAP_SYS_RESOURCE capability is required to do this.
EPERM
The caller tried to increase the hard RLIMIT_NOFILE limit above the maximum defined by /proc/sys/fs/nr_open (see proc(5))
EPERM
(prlimit()) The calling process did not have permission to set limits for the process specified by pid.
ESRCH
Could not find a process with the ID specified in pid.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
getrlimit(), setrlimit(), prlimit() | Thread safety | MT-Safe |
STANDARDS
getrlimit()
setrlimit()
POSIX.1-2008.
prlimit()
Linux.
RLIMIT_MEMLOCK and RLIMIT_NPROC derive from BSD and are not specified in POSIX.1; they are present on the BSDs and Linux, but on few other implementations. RLIMIT_RSS derives from BSD and is not specified in POSIX.1; it is nevertheless present on most implementations. RLIMIT_MSGQUEUE, RLIMIT_NICE, RLIMIT_RTPRIO, RLIMIT_RTTIME, and RLIMIT_SIGPENDING are Linux-specific.
HISTORY
getrlimit()
setrlimit()
POSIX.1-2001, SVr4, 4.3BSD.
prlimit()
Linux 2.6.36, glibc 2.13.
NOTES
A child process created via fork(2) inherits its parent’s resource limits. Resource limits are preserved across execve(2).
Resource limits are per-process attributes that are shared by all of the threads in a process.
Lowering the soft limit for a resource below the process’s current consumption of that resource will succeed (but will prevent the process from further increasing its consumption of the resource).
One can set the resource limits of the shell using the built-in ulimit command (limit in csh(1)). The shell’s resource limits are inherited by the processes that it creates to execute commands.
Since Linux 2.6.24, the resource limits of any process can be inspected via /proc/pid/limits; see proc(5).
Ancient systems provided a vlimit() function with a similar purpose to setrlimit(). For backward compatibility, glibc also provides vlimit(). All new applications should be written using setrlimit().
C library/kernel ABI differences
Since glibc 2.13, the glibc getrlimit() and setrlimit() wrapper functions no longer invoke the corresponding system calls, but instead employ prlimit(), for the reasons described in BUGS.
The name of the glibc wrapper function is prlimit(); the underlying system call is prlimit64().
BUGS
In older Linux kernels, the SIGXCPU and SIGKILL signals delivered when a process encountered the soft and hard RLIMIT_CPU limits were delivered one (CPU) second later than they should have been. This was fixed in Linux 2.6.8.
In Linux 2.6.x kernels before Linux 2.6.17, a RLIMIT_CPU limit of 0 is wrongly treated as “no limit” (like RLIM_INFINITY). Since Linux 2.6.17, setting a limit of 0 does have an effect, but is actually treated as a limit of 1 second.
A kernel bug means that RLIMIT_RTPRIO does not work in Linux 2.6.12; the problem is fixed in Linux 2.6.13.
In Linux 2.6.12, there was an off-by-one mismatch between the priority ranges returned by getpriority(2) and RLIMIT_NICE. This had the effect that the actual ceiling for the nice value was calculated as 19 - rlim_cur. This was fixed in Linux 2.6.13.
Since Linux 2.6.12, if a process reaches its soft RLIMIT_CPU limit and has a handler installed for SIGXCPU, then, in addition to invoking the signal handler, the kernel increases the soft limit by one second. This behavior repeats if the process continues to consume CPU time, until the hard limit is reached, at which point the process is killed. Other implementations do not change the RLIMIT_CPU soft limit in this manner, and the Linux behavior is probably not standards conformant; portable applications should avoid relying on this Linux-specific behavior. The Linux-specific RLIMIT_RTTIME limit exhibits the same behavior when the soft limit is encountered.
Kernels before Linux 2.4.22 did not diagnose the error EINVAL for setrlimit() when rlim->rlim_cur was greater than rlim->rlim_max.
Linux doesn’t return an error when an attempt to set RLIMIT_CPU has failed, for compatibility reasons.
Representation of “large” resource limit values on 32-bit platforms
The glibc getrlimit() and setrlimit() wrapper functions use a 64-bit rlim_t data type, even on 32-bit platforms. However, the rlim_t data type used in the getrlimit() and setrlimit() system calls is a (32-bit) unsigned long. Furthermore, in Linux, the kernel represents resource limits on 32-bit platforms as unsigned long. However, a 32-bit data type is not wide enough. The most pertinent limit here is RLIMIT_FSIZE, which specifies the maximum size to which a file can grow: to be useful, this limit must be represented using a type that is as wide as the type used to represent file offsetsβthat is, as wide as a 64-bit off_t (assuming a program compiled with _FILE_OFFSET_BITS=64).
To work around this kernel limitation, if a program tried to set a resource limit to a value larger than can be represented in a 32-bit unsigned long, then the glibc setrlimit() wrapper function silently converted the limit value to RLIM_INFINITY. In other words, the requested resource limit setting was silently ignored.
Since glibc 2.13, glibc works around the limitations of the getrlimit() and setrlimit() system calls by implementing setrlimit() and getrlimit() as wrapper functions that call prlimit().
EXAMPLES
The program below demonstrates the use of prlimit().
#define _GNU_SOURCE
#define _FILE_OFFSET_BITS 64
#include <err.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <time.h>
int
main(int argc, char *argv[])
{
pid_t pid;
struct rlimit old, new;
struct rlimit *newp;
if (!(argc == 2 || argc == 4)) {
fprintf(stderr, "Usage: %s <pid> [<new-soft-limit> "
"<new-hard-limit>]
“, argv[0]); exit(EXIT_FAILURE); } pid = atoi(argv[1]); /* PID of target process / newp = NULL; if (argc == 4) { new.rlim_cur = atoi(argv[2]); new.rlim_max = atoi(argv[3]); newp = &new; } / Set CPU time limit of target process; retrieve and display previous limit / if (prlimit(pid, RLIMIT_CPU, newp, &old) == -1) err(EXIT_FAILURE, “prlimit-1”); printf(“Previous limits: soft=%jd; hard=%jd “, (intmax_t) old.rlim_cur, (intmax_t) old.rlim_max); / Retrieve and display new CPU time limit */ if (prlimit(pid, RLIMIT_CPU, NULL, &old) == -1) err(EXIT_FAILURE, “prlimit-2”); printf(“New limits: soft=%jd; hard=%jd “, (intmax_t) old.rlim_cur, (intmax_t) old.rlim_max); exit(EXIT_SUCCESS); }
SEE ALSO
prlimit(1), dup(2), fcntl(2), fork(2), getrusage(2), mlock(2), mmap(2), open(2), quotactl(2), sbrk(2), shmctl(2), malloc(3), sigqueue(3), ulimit(3), core(5), capabilities(7), cgroups(7), credentials(7), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
25 - Linux cli command sendfile
NAME π₯οΈ sendfile π₯οΈ
transfer data between file descriptors
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/sendfile.h>
ssize_t sendfile(int out_fd, int in_fd, off_t *_Nullable offset,
size_t count);
DESCRIPTION
sendfile() copies data between one file descriptor and another. Because this copying is done within the kernel, sendfile() is more efficient than the combination of read(2) and write(2), which would require transferring data to and from user space.
in_fd should be a file descriptor opened for reading and out_fd should be a descriptor opened for writing.
If offset is not NULL, then it points to a variable holding the file offset from which sendfile() will start reading data from in_fd. When sendfile() returns, this variable will be set to the offset of the byte following the last byte that was read. If offset is not NULL, then sendfile() does not modify the file offset of in_fd; otherwise the file offset is adjusted to reflect the number of bytes read from in_fd.
If offset is NULL, then data will be read from in_fd starting at the file offset, and the file offset will be updated by the call.
count is the number of bytes to copy between the file descriptors.
The in_fd argument must correspond to a file which supports mmap(2)-like operations (i.e., it cannot be a socket). Except since Linux 5.12 and if out_fd is a pipe, in which case sendfile() desugars to a splice(2) and its restrictions apply.
Before Linux 2.6.33, out_fd must refer to a socket. Since Linux 2.6.33 it can be any file. If it’s seekable, then sendfile() changes the file offset appropriately.
RETURN VALUE
If the transfer was successful, the number of bytes written to out_fd is returned. Note that a successful call to sendfile() may write fewer bytes than requested; the caller should be prepared to retry the call if there were unsent bytes. See also NOTES.
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EAGAIN
Nonblocking I/O has been selected using O_NONBLOCK and the write would block.
EBADF
The input file was not opened for reading or the output file was not opened for writing.
EFAULT
Bad address.
EINVAL
Descriptor is not valid or locked, or an mmap(2)-like operation is not available for in_fd, or count is negative.
EINVAL
out_fd has the O_APPEND flag set. This is not currently supported by sendfile().
EIO
Unspecified error while reading from in_fd.
ENOMEM
Insufficient memory to read from in_fd.
EOVERFLOW
count is too large, the operation would result in exceeding the maximum size of either the input file or the output file.
ESPIPE
offset is not NULL but the input file is not seekable.
VERSIONS
Other UNIX systems implement sendfile() with different semantics and prototypes. It should not be used in portable programs.
STANDARDS
None.
HISTORY
Linux 2.2, glibc 2.1.
In Linux 2.4 and earlier, out_fd could also refer to a regular file; this possibility went away in the Linux 2.6.x kernel series, but was restored in Linux 2.6.33.
The original Linux sendfile() system call was not designed to handle large file offsets. Consequently, Linux 2.4 added sendfile64(), with a wider type for the offset argument. The glibc sendfile() wrapper function transparently deals with the kernel differences.
NOTES
sendfile() will transfer at most 0x7ffff000 (2,147,479,552) bytes, returning the number of bytes actually transferred. (This is true on both 32-bit and 64-bit systems.)
If you plan to use sendfile() for sending files to a TCP socket, but need to send some header data in front of the file contents, you will find it useful to employ the TCP_CORK option, described in tcp(7), to minimize the number of packets and to tune performance.
Applications may wish to fall back to read(2) and write(2) in the case where sendfile() fails with EINVAL or ENOSYS.
If out_fd refers to a socket or pipe with zero-copy support, callers must ensure the transferred portions of the file referred to by in_fd remain unmodified until the reader on the other end of out_fd has consumed the transferred data.
The Linux-specific splice(2) call supports transferring data between arbitrary file descriptors provided one (or both) of them is a pipe.
SEE ALSO
copy_file_range(2), mmap(2), open(2), socket(2), splice(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
26 - Linux cli command mq_unlink
NAME π₯οΈ mq_unlink π₯οΈ
remove a message queue
LIBRARY
Real-time library (librt, -lrt)
SYNOPSIS
#include <mqueue.h>
int mq_unlink(const char *name);
DESCRIPTION
mq_unlink() removes the specified message queue name. The message queue name is removed immediately. The queue itself is destroyed once any other processes that have the queue open close their descriptors referring to the queue.
RETURN VALUE
On success mq_unlink() returns 0; on error, -1 is returned, with errno set to indicate the error.
ERRORS
EACCES
The caller does not have permission to unlink this message queue.
ENAMETOOLONG
name was too long.
ENOENT
There is no message queue with the given name.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
mq_unlink() | Thread safety | MT-Safe |
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
SEE ALSO
mq_close(3), mq_getattr(3), mq_notify(3), mq_open(3), mq_receive(3), mq_send(3), mq_overview(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
27 - Linux cli command fanotify_mark
NAME π₯οΈ fanotify_mark π₯οΈ
add, remove, or modify an fanotify mark on a filesystem object
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/fanotify.h>
int fanotify_mark(int fanotify_fd, unsigned int flags,
uint64_t mask, int dirfd,
const char *_Nullable pathname);
DESCRIPTION
For an overview of the fanotify API, see fanotify(7).
fanotify_mark() adds, removes, or modifies an fanotify mark on a filesystem object. The caller must have read permission on the filesystem object that is to be marked.
The fanotify_fd argument is a file descriptor returned by fanotify_init(2).
flags is a bit mask describing the modification to perform. It must include exactly one of the following values:
FAN_MARK_ADD
The events in mask will be added to the mark mask (or to the ignore mask). mask must be nonempty or the error EINVAL will occur.
FAN_MARK_REMOVE
The events in argument mask will be removed from the mark mask (or from the ignore mask). mask must be nonempty or the error EINVAL will occur.
FAN_MARK_FLUSH
Remove either all marks for filesystems, all marks for mounts, or all marks for directories and files from the fanotify group. If flags contains FAN_MARK_MOUNT, all marks for mounts are removed from the group. If flags contains FAN_MARK_FILESYSTEM, all marks for filesystems are removed from the group. Otherwise, all marks for directories and files are removed. No flag other than, and at most one of, the flags FAN_MARK_MOUNT or FAN_MARK_FILESYSTEM can be used in conjunction with FAN_MARK_FLUSH. mask is ignored.
If none of the values above is specified, or more than one is specified, the call fails with the error EINVAL.
In addition, zero or more of the following values may be ORed into flags:
FAN_MARK_DONT_FOLLOW
If pathname is a symbolic link, mark the link itself, rather than the file to which it refers. (By default, fanotify_mark() dereferences pathname if it is a symbolic link.)
FAN_MARK_ONLYDIR
If the filesystem object to be marked is not a directory, the error ENOTDIR shall be raised.
FAN_MARK_MOUNT
Mark the mount specified by pathname. If pathname is not itself a mount point, the mount containing pathname will be marked. All directories, subdirectories, and the contained files of the mount will be monitored. The events which require that filesystem objects are identified by file handles, such as FAN_CREATE, FAN_ATTRIB, FAN_MOVE, and FAN_DELETE_SELF, cannot be provided as a mask when flags contains FAN_MARK_MOUNT. Attempting to do so will result in the error EINVAL being returned. Use of this flag requires the CAP_SYS_ADMIN capability.
FAN_MARK_FILESYSTEM (since Linux 4.20)
Mark the filesystem specified by pathname. The filesystem containing pathname will be marked. All the contained files and directories of the filesystem from any mount point will be monitored. Use of this flag requires the CAP_SYS_ADMIN capability.
FAN_MARK_IGNORED_MASK
The events in mask shall be added to or removed from the ignore mask. Note that the flags FAN_ONDIR, and FAN_EVENT_ON_CHILD have no effect when provided with this flag. The effect of setting the flags FAN_ONDIR, and FAN_EVENT_ON_CHILD in the mark mask on the events that are set in the ignore mask is undefined and depends on the Linux kernel version. Specifically, prior to Linux 5.9, setting a mark mask on a file and a mark with ignore mask on its parent directory would not result in ignoring events on the file, regardless of the FAN_EVENT_ON_CHILD flag in the parent directory’s mark mask. When the ignore mask is updated with the FAN_MARK_IGNORED_MASK flag on a mark that was previously updated with the FAN_MARK_IGNORE flag, the update fails with EEXIST error.
FAN_MARK_IGNORE (since Linux 6.0)
This flag has a similar effect as setting the FAN_MARK_IGNORED_MASK flag. The events in mask shall be added to or removed from the ignore mask. Unlike the FAN_MARK_IGNORED_MASK flag, this flag also has the effect that the FAN_ONDIR, and FAN_EVENT_ON_CHILD flags take effect on the ignore mask. Specifically, unless the FAN_ONDIR flag is set with FAN_MARK_IGNORE, events on directories will not be ignored. If the flag FAN_EVENT_ON_CHILD is set with FAN_MARK_IGNORE, events on children will be ignored. For example, a mark on a directory with combination of a mask with FAN_CREATE event and FAN_ONDIR flag and an ignore mask with FAN_CREATE event and without FAN_ONDIR flag, will result in getting only the events for creation of sub-directories. When using the FAN_MARK_IGNORE flag to add to an ignore mask of a mount, filesystem, or directory inode mark, the FAN_MARK_IGNORED_SURV_MODIFY flag must be specified. Failure to do so will results with EINVAL or EISDIR error.
FAN_MARK_IGNORED_SURV_MODIFY
The ignore mask shall survive modify events. If this flag is not set, the ignore mask is cleared when a modify event occurs on the marked object. Omitting this flag is typically used to suppress events (e.g., FAN_OPEN) for a specific file, until that specific file’s content has been modified. It is far less useful to suppress events on an entire filesystem, or mount, or on all files inside a directory, until some file’s content has been modified. For this reason, the FAN_MARK_IGNORE flag requires the FAN_MARK_IGNORED_SURV_MODIFY flag on a mount, filesystem, or directory inode mark. This flag cannot be removed from a mark once set. When the ignore mask is updated without this flag on a mark that was previously updated with the FAN_MARK_IGNORE and FAN_MARK_IGNORED_SURV_MODIFY flags, the update fails with EEXIST error.
FAN_MARK_IGNORE_SURV
This is a synonym for (FAN_MARK_IGNORE|FAN_MARK_IGNORED_SURV_MODIFY).
FAN_MARK_EVICTABLE (since Linux 5.19)
When an inode mark is created with this flag, the inode object will not be pinned to the inode cache, therefore, allowing the inode object to be evicted from the inode cache when the memory pressure on the system is high. The eviction of the inode object results in the evictable mark also being lost. When the mask of an evictable inode mark is updated without using the FAN_MARK_EVICATBLE flag, the marked inode is pinned to inode cache and the mark is no longer evictable. When the mask of a non-evictable inode mark is updated with the FAN_MARK_EVICTABLE flag, the inode mark remains non-evictable and the update fails with EEXIST error. Mounts and filesystems are not evictable objects, therefore, an attempt to create a mount mark or a filesystem mark with the FAN_MARK_EVICTABLE flag, will result in the error EINVAL. For example, inode marks can be used in combination with mount marks to reduce the amount of events from noninteresting paths. The event listener reads events, checks if the path reported in the event is of interest, and if it is not, the listener sets a mark with an ignore mask on the directory. Evictable inode marks allow using this method for a large number of directories without the concern of pinning all inodes and exhausting the system’s memory.
mask defines which events shall be listened for (or which shall be ignored). It is a bit mask composed of the following values:
FAN_ACCESS
Create an event when a file or directory (but see BUGS) is accessed (read).
FAN_MODIFY
Create an event when a file is modified (write).
FAN_CLOSE_WRITE
Create an event when a writable file is closed.
FAN_CLOSE_NOWRITE
Create an event when a read-only file or directory is closed.
FAN_OPEN
Create an event when a file or directory is opened.
FAN_OPEN_EXEC (since Linux 5.0)
Create an event when a file is opened with the intent to be executed. See NOTES for additional details.
FAN_ATTRIB (since Linux 5.1)
Create an event when the metadata for a file or directory has changed. An fanotify group that identifies filesystem objects by file handles is required.
FAN_CREATE (since Linux 5.1)
Create an event when a file or directory has been created in a marked parent directory. An fanotify group that identifies filesystem objects by file handles is required.
FAN_DELETE (since Linux 5.1)
Create an event when a file or directory has been deleted in a marked parent directory. An fanotify group that identifies filesystem objects by file handles is required.
FAN_DELETE_SELF (since Linux 5.1)
Create an event when a marked file or directory itself is deleted. An fanotify group that identifies filesystem objects by file handles is required.
FAN_FS_ERROR (since Linux 5.16)
Create an event when a filesystem error leading to inconsistent filesystem metadata is detected. An additional information record of type FAN_EVENT_INFO_TYPE_ERROR is returned for each event in the read buffer. An fanotify group that identifies filesystem objects by file handles is required.
Events of such type are dependent on support from the underlying filesystem. At the time of writing, only the ext4 filesystem reports FAN_FS_ERROR events.
See fanotify(7) for additional details.
FAN_MOVED_FROM (since Linux 5.1)
Create an event when a file or directory has been moved from a marked parent directory. An fanotify group that identifies filesystem objects by file handles is required.
FAN_MOVED_TO (since Linux 5.1)
Create an event when a file or directory has been moved to a marked parent directory. An fanotify group that identifies filesystem objects by file handles is required.
FAN_RENAME (since Linux 5.17)
This event contains the same information provided by events FAN_MOVED_FROM and FAN_MOVED_TO, however is represented by a single event with up to two information records. An fanotify group that identifies filesystem objects by file handles is required. If the filesystem object to be marked is not a directory, the error ENOTDIR shall be raised.
FAN_MOVE_SELF (since Linux 5.1)
Create an event when a marked file or directory itself has been moved. An fanotify group that identifies filesystem objects by file handles is required.
FAN_OPEN_PERM
Create an event when a permission to open a file or directory is requested. An fanotify file descriptor created with FAN_CLASS_PRE_CONTENT or FAN_CLASS_CONTENT is required.
FAN_OPEN_EXEC_PERM (since Linux 5.0)
Create an event when a permission to open a file for execution is requested. An fanotify file descriptor created with FAN_CLASS_PRE_CONTENT or FAN_CLASS_CONTENT is required. See NOTES for additional details.
FAN_ACCESS_PERM
Create an event when a permission to read a file or directory is requested. An fanotify file descriptor created with FAN_CLASS_PRE_CONTENT or FAN_CLASS_CONTENT is required.
FAN_ONDIR
Create events for directoriesβfor example, when opendir(3), readdir(3) (but see BUGS), and closedir(3) are called. Without this flag, events are created only for files. In the context of directory entry events, such as FAN_CREATE, FAN_DELETE, FAN_MOVED_FROM, and FAN_MOVED_TO, specifying the flag FAN_ONDIR is required in order to create events when subdirectory entries are modified (i.e., mkdir(2)/ rmdir(2)).
FAN_EVENT_ON_CHILD
Events for the immediate children of marked directories shall be created. The flag has no effect when marking mounts and filesystems. Note that events are not generated for children of the subdirectories of marked directories. More specifically, the directory entry modification events FAN_CREATE, FAN_DELETE, FAN_MOVED_FROM, and FAN_MOVED_TO are not generated for any entry modifications performed inside subdirectories of marked directories. Note that the events FAN_DELETE_SELF and FAN_MOVE_SELF are not generated for children of marked directories. To monitor complete directory trees it is necessary to mark the relevant mount or filesystem.
The following composed values are defined:
FAN_CLOSE
A file is closed (FAN_CLOSE_WRITE|FAN_CLOSE_NOWRITE).
FAN_MOVE
A file or directory has been moved (FAN_MOVED_FROM|FAN_MOVED_TO).
The filesystem object to be marked is determined by the file descriptor dirfd and the pathname specified in pathname:
If pathname is NULL, dirfd defines the filesystem object to be marked.
If pathname is NULL, and dirfd takes the special value AT_FDCWD, the current working directory is to be marked.
If pathname is absolute, it defines the filesystem object to be marked, and dirfd is ignored.
If pathname is relative, and dirfd does not have the value AT_FDCWD, then the filesystem object to be marked is determined by interpreting pathname relative the directory referred to by dirfd.
If pathname is relative, and dirfd has the value AT_FDCWD, then the filesystem object to be marked is determined by interpreting pathname relative to the current working directory. (See openat(2) for an explanation of why the dirfd argument is useful.)
RETURN VALUE
On success, fanotify_mark() returns 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBADF
An invalid file descriptor was passed in fanotify_fd.
EBADF
pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EEXIST
The filesystem object indicated by dirfd and pathname has a mark that was updated without the FAN_MARK_EVICTABLE flag, and the user attempted to update the mark with FAN_MARK_EVICTABLE flag.
EEXIST
The filesystem object indicated by dirfd and pathname has a mark that was updated with the FAN_MARK_IGNORE flag, and the user attempted to update the mark with FAN_MARK_IGNORED_MASK flag.
EEXIST
The filesystem object indicated by dirfd and pathname has a mark that was updated with the FAN_MARK_IGNORE and FAN_MARK_IGNORED_SURV_MODIFY flags, and the user attempted to update the mark only with FAN_MARK_IGNORE flag.
EINVAL
An invalid value was passed in flags or mask, or fanotify_fd was not an fanotify file descriptor.
EINVAL
The fanotify file descriptor was opened with FAN_CLASS_NOTIF or the fanotify group identifies filesystem objects by file handles and mask contains a flag for permission events (FAN_OPEN_PERM or FAN_ACCESS_PERM).
EINVAL
The group was initialized without FAN_REPORT_FID but one or more event types specified in the mask require it.
EINVAL
flags contains FAN_MARK_IGNORE, and either FAN_MARK_MOUNT or FAN_MARK_FILESYSTEM, but does not contain FAN_MARK_IGNORED_SURV_MODIFY.
EISDIR
flags contains FAN_MARK_IGNORE, but does not contain FAN_MARK_IGNORED_SURV_MODIFY, and dirfd and pathname specify a directory.
ENODEV
The filesystem object indicated by dirfd and pathname is not associated with a filesystem that supports fsid (e.g., fuse(4)). tmpfs(5) did not support fsid prior to Linux 5.13. This error can be returned only with an fanotify group that identifies filesystem objects by file handles.
ENOENT
The filesystem object indicated by dirfd and pathname does not exist. This error also occurs when trying to remove a mark from an object which is not marked.
ENOMEM
The necessary memory could not be allocated.
ENOSPC
The number of marks for this user exceeds the limit and the FAN_UNLIMITED_MARKS flag was not specified when the fanotify file descriptor was created with fanotify_init(2). See fanotify(7) for details about this limit.
ENOSYS
This kernel does not implement fanotify_mark(). The fanotify API is available only if the kernel was configured with CONFIG_FANOTIFY.
ENOTDIR
flags contains FAN_MARK_ONLYDIR, and dirfd and pathname do not specify a directory.
ENOTDIR
mask contains FAN_RENAME, and dirfd and pathname do not specify a directory.
ENOTDIR
flags contains FAN_MARK_IGNORE, or the fanotify group was initialized with flag FAN_REPORT_TARGET_FID, and mask contains directory entry modification events (e.g., FAN_CREATE, FAN_DELETE), or directory event flags (e.g., FAN_ONDIR, FAN_EVENT_ON_CHILD), and dirfd and pathname do not specify a directory.
EOPNOTSUPP
The object indicated by pathname is associated with a filesystem that does not support the encoding of file handles. This error can be returned only with an fanotify group that identifies filesystem objects by file handles. Calling name_to_handle_at(2) with the flag AT_HANDLE_FID (since Linux 6.5) can be used as a test to check if a filesystem supports reporting events with file handles.
EPERM
The operation is not permitted because the caller lacks a required capability.
EXDEV
The filesystem object indicated by pathname resides within a filesystem subvolume (e.g., btrfs(5)) which uses a different fsid than its root superblock. This error can be returned only with an fanotify group that identifies filesystem objects by file handles.
STANDARDS
Linux.
HISTORY
Linux 2.6.37.
NOTES
FAN_OPEN_EXEC and FAN_OPEN_EXEC_PERM
When using either FAN_OPEN_EXEC or FAN_OPEN_EXEC_PERM within the mask, events of these types will be returned only when the direct execution of a program occurs. More specifically, this means that events of these types will be generated for files that are opened using execve(2), execveat(2), or uselib(2). Events of these types will not be raised in the situation where an interpreter is passed (or reads) a file for interpretation.
Additionally, if a mark has also been placed on the Linux dynamic linker, a user should also expect to receive an event for it when an ELF object has been successfully opened using execve(2) or execveat(2).
For example, if the following ELF binary were to be invoked and a FAN_OPEN_EXEC mark has been placed on /:
$ /bin/echo foo
The listening application in this case would receive FAN_OPEN_EXEC events for both the ELF binary and interpreter, respectively:
/bin/echo
/lib64/ld-linux-x86-64.so.2
BUGS
The following bugs were present in before Linux 3.16:
If flags contains FAN_MARK_FLUSH, dirfd, and pathname must specify a valid filesystem object, even though this object is not used.
readdir(2) does not generate a FAN_ACCESS event.
If fanotify_mark() is called with FAN_MARK_FLUSH, flags is not checked for invalid values.
SEE ALSO
fanotify_init(2), fanotify(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
28 - Linux cli command removexattr
NAME π₯οΈ removexattr π₯οΈ
remove an extended attribute
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/xattr.h>
int removexattr(const char *path, const char *name);
int lremovexattr(const char *path, const char *name);
int fremovexattr(int fd, const char *name);
DESCRIPTION
Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7).
removexattr() removes the extended attribute identified by name and associated with the given path in the filesystem.
lremovexattr() is identical to removexattr(), except in the case of a symbolic link, where the extended attribute is removed from the link itself, not the file that it refers to.
fremovexattr() is identical to removexattr(), only the extended attribute is removed from the open file referred to by fd (as returned by open(2)) in place of path.
An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode.
RETURN VALUE
On success, zero is returned. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
ENODATA
The named attribute does not exist.
ENOTSUP
Extended attributes are not supported by the filesystem, or are disabled.
In addition, the errors documented in stat(2) can also occur.
STANDARDS
Linux.
HISTORY
Linux 2.4, glibc 2.3.
SEE ALSO
getfattr(1), setfattr(1), getxattr(2), listxattr(2), open(2), setxattr(2), stat(2), symlink(7), xattr(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
29 - Linux cli command munlock
NAME π₯οΈ munlock π₯οΈ
lock and unlock memory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/mman.h>
int mlock(const void addr[.len], size_t len);
int mlock2(const void addr[.len], size_t len, unsigned int flags);
int munlock(const void addr[.len], size_t len);
int mlockall(int flags);
int munlockall(void);
DESCRIPTION
mlock(), mlock2(), and mlockall() lock part or all of the calling process’s virtual address space into RAM, preventing that memory from being paged to the swap area.
munlock() and munlockall() perform the converse operation, unlocking part or all of the calling process’s virtual address space, so that pages in the specified virtual address range can be swapped out again if required by the kernel memory manager.
Memory locking and unlocking are performed in units of whole pages.
mlock(), mlock2(), and munlock()
mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked.
mlock2() also locks pages in the specified range starting at addr and continuing for len bytes. However, the state of the pages contained in that range after the call returns successfully will depend on the value in the flags argument.
The flags argument can be either 0 or the following constant:
MLOCK_ONFAULT
Lock pages that are currently resident and mark the entire range so that the remaining nonresident pages are locked when they are populated by a page fault.
If flags is 0, mlock2() behaves exactly the same as mlock().
munlock() unlocks pages in the address range starting at addr and continuing for len bytes. After this call, all pages that contain a part of the specified memory range can be moved to external swap space again by the kernel.
mlockall() and munlockall()
mlockall() locks all pages mapped into the address space of the calling process. This includes the pages of the code, data, and stack segment, as well as shared libraries, user space kernel data, shared memory, and memory-mapped files. All mapped pages are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked.
The flags argument is constructed as the bitwise OR of one or more of the following constants:
MCL_CURRENT
Lock all pages which are currently mapped into the address space of the process.
MCL_FUTURE
Lock all pages which will become mapped into the address space of the process in the future. These could be, for instance, new pages required by a growing heap and stack as well as new memory-mapped files or shared memory regions.
MCL_ONFAULT (since Linux 4.4)
Used together with MCL_CURRENT, MCL_FUTURE, or both. Mark all current (with MCL_CURRENT) or future (with MCL_FUTURE) mappings to lock pages when they are faulted in. When used with MCL_CURRENT, all present pages are locked, but mlockall() will not fault in non-present pages. When used with MCL_FUTURE, all future mappings will be marked to lock pages when they are faulted in, but they will not be populated by the lock when the mapping is created. MCL_ONFAULT must be used with either MCL_CURRENT or MCL_FUTURE or both.
If MCL_FUTURE has been specified, then a later system call (e.g., mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number of locked bytes to exceed the permitted maximum (see below). In the same circumstances, stack growth may likewise fail: the kernel will deny stack expansion and deliver a SIGSEGV signal to the process.
munlockall() unlocks all pages mapped into the address space of the calling process.
RETURN VALUE
On success, these system calls return 0. On error, -1 is returned, errno is set to indicate the error, and no changes are made to any locks in the address space of the process.
ERRORS
EAGAIN
(mlock(), mlock2(), and munlock()) Some or all of the specified address range could not be locked.
EINVAL
(mlock(), mlock2(), and munlock()) The result of the addition addr+len was less than addr (e.g., the addition may have resulted in an overflow).
EINVAL
(mlock2()) Unknown flags were specified.
EINVAL
(mlockall()) Unknown flags were specified or MCL_ONFAULT was specified without either MCL_FUTURE or MCL_CURRENT.
EINVAL
(Not on Linux) addr was not a multiple of the page size.
ENOMEM
(mlock(), mlock2(), and munlock()) Some of the specified address range does not correspond to mapped pages in the address space of the process.
ENOMEM
(mlock(), mlock2(), and munlock()) Locking or unlocking a region would result in the total number of mappings with distinct attributes (e.g., locked versus unlocked) exceeding the allowed maximum. (For example, unlocking a range in the middle of a currently locked mapping would result in three mappings: two locked mappings at each end and an unlocked mapping in the middle.)
ENOMEM
(Linux 2.6.9 and later) the caller had a nonzero RLIMIT_MEMLOCK soft resource limit, but tried to lock more memory than the limit permitted. This limit is not enforced if the process is privileged (CAP_IPC_LOCK).
ENOMEM
(Linux 2.4 and earlier) the calling process tried to lock more than half of RAM.
EPERM
The caller is not privileged, but needs privilege (CAP_IPC_LOCK) to perform the requested operation.
EPERM
(munlockall()) (Linux 2.6.8 and earlier) The caller was not privileged (CAP_IPC_LOCK).
VERSIONS
Linux
Under Linux, mlock(), mlock2(), and munlock() automatically round addr down to the nearest page boundary. However, the POSIX.1 specification of mlock() and munlock() allows an implementation to require that addr is page aligned, so portable applications should ensure this.
The VmLck field of the Linux-specific /proc/pid/status file shows how many kilobytes of memory the process with ID PID has locked using mlock(), mlock2(), mlockall(), and mmap(2) MAP_LOCKED.
STANDARDS
mlock()
munlock()
mlockall()
munlockall()
POSIX.1-2008.
mlock2()
Linux.
On POSIX systems on which mlock() and munlock() are available, _POSIX_MEMLOCK_RANGE is defined in <unistd.h> and the number of bytes in a page can be determined from the constant PAGESIZE (if defined) in <limits.h> or by calling sysconf(_SC_PAGESIZE).
On POSIX systems on which mlockall() and munlockall() are available, _POSIX_MEMLOCK is defined in <unistd.h> to a value greater than 0. (See also sysconf(3).)
HISTORY
mlock()
munlock()
mlockall()
munlockall()
POSIX.1-2001, POSIX.1-2008, SVr4.
mlock2()
Linux 4.4, glibc 2.27.
NOTES
Memory locking has two main applications: real-time algorithms and high-security data processing. Real-time applications require deterministic timing, and, like scheduling, paging is one major cause of unexpected program execution delays. Real-time applications will usually also switch to a real-time scheduler with sched_setscheduler(2). Cryptographic security software often handles critical bytes like passwords or secret keys as data structures. As a result of paging, these secrets could be transferred onto a persistent swap store medium, where they might be accessible to the enemy long after the security software has erased the secrets in RAM and terminated. (But be aware that the suspend mode on laptops and some desktop computers will save a copy of the system’s RAM to disk, regardless of memory locks.)
Real-time processes that are using mlockall() to prevent delays on page faults should reserve enough locked stack pages before entering the time-critical section, so that no page fault can be caused by function calls. This can be achieved by calling a function that allocates a sufficiently large automatic variable (an array) and writes to the memory occupied by this array in order to touch these stack pages. This way, enough pages will be mapped for the stack and can be locked into RAM. The dummy writes ensure that not even copy-on-write page faults can occur in the critical section.
Memory locks are not inherited by a child created via fork(2) and are automatically removed (unlocked) during an execve(2) or when the process terminates. The mlockall() MCL_FUTURE and MCL_FUTURE | MCL_ONFAULT settings are not inherited by a child created via fork(2) and are cleared during an execve(2).
Note that fork(2) will prepare the address space for a copy-on-write operation. The consequence is that any write access that follows will cause a page fault that in turn may cause high latencies for a real-time process. Therefore, it is crucial not to invoke fork(2) after an mlockall() or mlock() operationβnot even from a thread which runs at a low priority within a process which also has a thread running at elevated priority.
The memory lock on an address range is automatically removed if the address range is unmapped via munmap(2).
Memory locks do not stack, that is, pages which have been locked several times by calls to mlock(), mlock2(), or mlockall() will be unlocked by a single call to munlock() for the corresponding range or by munlockall(). Pages which are mapped to several locations or by several processes stay locked into RAM as long as they are locked at least at one location or by at least one process.
If a call to mlockall() which uses the MCL_FUTURE flag is followed by another call that does not specify this flag, the changes made by the MCL_FUTURE call will be lost.
The mlock2() MLOCK_ONFAULT flag and the mlockall() MCL_ONFAULT flag allow efficient memory locking for applications that deal with large mappings where only a (small) portion of pages in the mapping are touched. In such cases, locking all of the pages in a mapping would incur a significant penalty for memory locking.
Limits and permissions
In Linux 2.6.8 and earlier, a process must be privileged (CAP_IPC_LOCK) in order to lock memory and the RLIMIT_MEMLOCK soft resource limit defines a limit on how much memory the process may lock.
Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process can lock and the RLIMIT_MEMLOCK soft resource limit instead defines a limit on how much memory an unprivileged process may lock.
BUGS
In Linux 4.8 and earlier, a bug in the kernel’s accounting of locked memory for unprivileged processes (i.e., without CAP_IPC_LOCK) meant that if the region specified by addr and len overlapped an existing lock, then the already locked bytes in the overlapping region were counted twice when checking against the limit. Such double accounting could incorrectly calculate a “total locked memory” value for the process that exceeded the RLIMIT_MEMLOCK limit, with the result that mlock() and mlock2() would fail on requests that should have succeeded. This bug was fixed in Linux 4.9.
In Linux 2.4 series of kernels up to and including Linux 2.4.17, a bug caused the mlockall() MCL_FUTURE flag to be inherited across a fork(2). This was rectified in Linux 2.4.18.
Since Linux 2.6.9, if a privileged process calls mlockall(MCL_FUTURE) and later drops privileges (loses the CAP_IPC_LOCK capability by, for example, setting its effective UID to a nonzero value), then subsequent memory allocations (e.g., mmap(2), brk(2)) will fail if the RLIMIT_MEMLOCK resource limit is encountered.
SEE ALSO
mincore(2), mmap(2), setrlimit(2), shmctl(2), sysconf(3), proc(5), capabilities(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
30 - Linux cli command alloc_hugepages
NAME π₯οΈ alloc_hugepages π₯οΈ
allocate or free huge pages
SYNOPSIS
void *syscall(SYS_alloc_hugepages, int key, void addr[.len], size_t len,
int prot, int flag);
int syscall(SYS_free_hugepages, void *addr);
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
The system calls alloc_hugepages() and free_hugepages() were introduced in Linux 2.5.36 and removed again in Linux 2.5.54. They existed only on i386 and ia64 (when built with CONFIG_HUGETLB_PAGE). In Linux 2.4.20, the syscall numbers exist, but the calls fail with the error ENOSYS.
On i386 the memory management hardware knows about ordinary pages (4 KiB) and huge pages (2 or 4 MiB). Similarly ia64 knows about huge pages of several sizes. These system calls serve to map huge pages into the process’s memory or to free them again. Huge pages are locked into memory, and are not swapped.
The key argument is an identifier. When zero the pages are private, and not inherited by children. When positive the pages are shared with other applications using the same key, and inherited by child processes.
The addr argument of free_hugepages() tells which page is being freed: it was the return value of a call to alloc_hugepages(). (The memory is first actually freed when all users have released it.) The addr argument of alloc_hugepages() is a hint, that the kernel may or may not follow. Addresses must be properly aligned.
The len argument is the length of the required segment. It must be a multiple of the huge page size.
The prot argument specifies the memory protection of the segment. It is one of PROT_READ, PROT_WRITE, PROT_EXEC.
The flag argument is ignored, unless key is positive. In that case, if flag is IPC_CREAT, then a new huge page segment is created when none with the given key existed. If this flag is not set, then ENOENT is returned when no segment with the given key exists.
RETURN VALUE
On success, alloc_hugepages() returns the allocated virtual address, and free_hugepages() returns zero. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
ENOSYS
The system call is not supported on this kernel.
FILES
/proc/sys/vm/nr_hugepages
Number of configured hugetlb pages. This can be read and written.
/proc/meminfo
Gives info on the number of configured hugetlb pages and on their size in the three variables HugePages_Total, HugePages_Free, Hugepagesize.
STANDARDS
Linux on Intel processors.
HISTORY
These system calls are gone; they existed only in Linux 2.5.36 through to Linux 2.5.54.
NOTES
Now the hugetlbfs filesystem can be used instead. Memory backed by huge pages (if the CPU supports them) is obtained by using mmap(2) to map files in this virtual filesystem.
The maximal number of huge pages can be specified using the hugepages= boot parameter.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
31 - Linux cli command flock
NAME π₯οΈ flock π₯οΈ
apply or remove an advisory lock on an open file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/file.h>
int flock(int fd, int op);
DESCRIPTION
Apply or remove an advisory lock on the open file specified by fd. The argument op is one of the following:
LOCK_SH
Place a shared lock. More than one process may hold a shared lock for a given file at a given time.LOCK_EX
Place an exclusive lock. Only one process may hold an exclusive lock for a given file at a given time.LOCK_UN
Remove an existing lock held by this process.
A call to flock() may block if an incompatible lock is held by another process. To make a nonblocking request, include LOCK_NB (by ORing) with any of the above operations.
A single file may not simultaneously have both shared and exclusive locks.
Locks created by flock() are associated with an open file description (see open(2)). This means that duplicate file descriptors (created by, for example, fork(2) or dup(2)) refer to the same lock, and this lock may be modified or released using any of these file descriptors. Furthermore, the lock is released either by an explicit LOCK_UN operation on any of these duplicate file descriptors, or when all such file descriptors have been closed.
If a process uses open(2) (or similar) to obtain more than one file descriptor for the same file, these file descriptors are treated independently by flock(). An attempt to lock the file using one of these file descriptors may be denied by a lock that the calling process has already placed via another file descriptor.
A process may hold only one type of lock (shared or exclusive) on a file. Subsequent flock() calls on an already locked file will convert an existing lock to the new lock mode.
Locks created by flock() are preserved across an execve(2).
A shared or exclusive lock can be placed on a file regardless of the mode in which the file was opened.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBADF
fd is not an open file descriptor.
EINTR
While waiting to acquire a lock, the call was interrupted by delivery of a signal caught by a handler; see signal(7).
EINVAL
op is invalid.
ENOLCK
The kernel ran out of memory for allocating lock records.
EWOULDBLOCK
The file is locked and the LOCK_NB flag was selected.
VERSIONS
Since Linux 2.0, flock() is implemented as a system call in its own right rather than being emulated in the GNU C library as a call to fcntl(2). With this implementation, there is no interaction between the types of lock placed by flock() and fcntl(2), and flock() does not detect deadlock. (Note, however, that on some systems, such as the modern BSDs, flock() and fcntl(2) locks do interact with one another.)
CIFS details
Up to Linux 5.4, flock() is not propagated over SMB. A file with such locks will not appear locked for remote clients.
Since Linux 5.5, flock() locks are emulated with SMB byte-range locks on the entire file. Similarly to NFS, this means that fcntl(2) and flock() locks interact with one another. Another important side-effect is that the locks are not advisory anymore: any IO on a locked file will always fail with EACCES when done from a separate file descriptor. This difference originates from the design of locks in the SMB protocol, which provides mandatory locking semantics.
Remote and mandatory locking semantics may vary with SMB protocol, mount options and server type. See mount.cifs(8) for additional information.
STANDARDS
BSD.
HISTORY
4.4BSD (the flock() call first appeared in 4.2BSD). A version of flock(), possibly implemented in terms of fcntl(2), appears on most UNIX systems.
NFS details
Up to Linux 2.6.11, flock() does not lock files over NFS (i.e., the scope of locks was limited to the local system). Instead, one could use fcntl(2) byte-range locking, which does work over NFS, given a sufficiently recent version of Linux and a server which supports locking.
Since Linux 2.6.12, NFS clients support flock() locks by emulating them as fcntl(2) byte-range locks on the entire file. This means that fcntl(2) and flock() locks do interact with one another over NFS. It also means that in order to place an exclusive lock, the file must be opened for writing.
Since Linux 2.6.37, the kernel supports a compatibility mode that allows flock() locks (and also fcntl(2) byte region locks) to be treated as local; see the discussion of the local_lock option in nfs(5).
NOTES
flock() places advisory locks only; given suitable permissions on a file, a process is free to ignore the use of flock() and perform I/O on the file.
flock() and fcntl(2) locks have different semantics with respect to forked processes and dup(2). On systems that implement flock() using fcntl(2), the semantics of flock() will be different from those described in this manual page.
Converting a lock (shared to exclusive, or vice versa) is not guaranteed to be atomic: the existing lock is first removed, and then a new lock is established. Between these two steps, a pending lock request by another process may be granted, with the result that the conversion either blocks, or fails if LOCK_NB was specified. (This is the original BSD behavior, and occurs on many other implementations.)
SEE ALSO
flock(1), close(2), dup(2), execve(2), fcntl(2), fork(2), open(2), lockf(3), lslocks(8)
Documentation/filesystems/locks.txt in the Linux kernel source tree (Documentation/locks.txt in older kernels)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
32 - Linux cli command lchown32
NAME π₯οΈ lchown32 π₯οΈ
change ownership of a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int chown(const char *pathname, uid_t owner, gid_t group);
int fchown(int fd, uid_t owner, gid_t group);
int lchown(const char *pathname, uid_t owner, gid_t group);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int fchownat(int dirfd, const char *pathname,
uid_t owner, gid_t group, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
fchown(), lchown():
/* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
|| _XOPEN_SOURCE >= 500
|| /* glibc <= 2.19: */ _BSD_SOURCE
fchownat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These system calls change the owner and group of a file. The chown(), fchown(), and lchown() system calls differ only in how the file is specified:
chown() changes the ownership of the file specified by pathname, which is dereferenced if it is a symbolic link.
fchown() changes the ownership of the file referred to by the open file descriptor fd.
lchown() is like chown(), but does not dereference symbolic links.
Only a privileged process (Linux: one with the CAP_CHOWN capability) may change the owner of a file. The owner of a file may change the group of the file to any group of which that owner is a member. A privileged process (Linux: with CAP_CHOWN) may change the group arbitrarily.
If the owner or group is specified as -1, then that ID is not changed.
When the owner or group of an executable file is changed by an unprivileged user, the S_ISUID and S_ISGID mode bits are cleared. POSIX does not specify whether this also should happen when root does the chown(); the Linux behavior depends on the kernel version, and since Linux 2.2.13, root is treated like other users. In case of a non-group-executable file (i.e., one for which the S_IXGRP bit is not set) the S_ISGID bit indicates mandatory locking, and is not cleared by a chown().
When the owner or group of an executable file is changed (by any user), all capability sets for the file are cleared.
fchownat()
The fchownat() system call operates in exactly the same way as chown(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chown() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chown()).
If pathname is absolute, then dirfd is ignored.
The flags argument is a bit mask created by ORing together 0 or more of the following values;
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory. If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead operate on the link itself, like lchown(). (By default, fchownat() dereferences symbolic links, like chown().)
See openat(2) for an explanation of the need for fchownat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Depending on the filesystem, errors other than those listed below can be returned.
The more general errors for chown() are listed below.
EACCES
Search permission is denied on a component of the path prefix. (See also path_resolution(7).)
EBADF
(fchown()) fd is not a valid open file descriptor.
EBADF
(fchownat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
pathname points outside your accessible address space.
EINVAL
(fchownat()) Invalid flag specified in flags.
EIO
(fchown()) A low-level I/O error occurred while modifying the inode.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ENAMETOOLONG
pathname is too long.
ENOENT
The file does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component of the path prefix is not a directory.
ENOTDIR
(fchownat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EPERM
The calling process did not have the required permissions (see above) to change owner and/or group.
EPERM
The file is marked immutable or append-only. (See ioctl_iflags(2).)
EROFS
The named file resides on a read-only filesystem.
VERSIONS
The 4.4BSD version can be used only by the superuser (that is, ordinary users cannot give away files).
STANDARDS
POSIX.1-2008.
HISTORY
chown()
fchown()
lchown()
4.4BSD, SVr4, POSIX.1-2001.
fchownat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
NOTES
Ownership of new files
When a new file is created (by, for example, open(2) or mkdir(2)), its owner is made the same as the filesystem user ID of the creating process. The group of the file depends on a range of factors, including the type of filesystem, the options used to mount the filesystem, and whether or not the set-group-ID mode bit is enabled on the parent directory. If the filesystem supports the -o grpid (or, synonymously -o bsdgroups) and -o nogrpid (or, synonymously -o sysvgroups) mount(8) options, then the rules are as follows:
If the filesystem is mounted with -o grpid, then the group of a new file is made the same as that of the parent directory.
If the filesystem is mounted with -o nogrpid and the set-group-ID bit is disabled on the parent directory, then the group of a new file is made the same as the process’s filesystem GID.
If the filesystem is mounted with -o nogrpid and the set-group-ID bit is enabled on the parent directory, then the group of a new file is made the same as that of the parent directory.
As at Linux 4.12, the -o grpid and -o nogrpid mount options are supported by ext2, ext3, ext4, and XFS. Filesystems that don’t support these mount options follow the -o nogrpid rules.
glibc notes
On older kernels where fchownat() is unavailable, the glibc wrapper function falls back to the use of chown() and lchown(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
NFS
The chown() semantics are deliberately violated on NFS filesystems which have UID mapping enabled. Additionally, the semantics of all system calls which access the file contents are violated, because chown() may cause immediate access revocation on already open files. Client side caching may lead to a delay between the time where ownership have been changed to allow access for a user and the time where the file can actually be accessed by the user on other clients.
Historical details
The original Linux chown(), fchown(), and lchown() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added chown32(), fchown32(), and lchown32(), supporting 32-bit IDs. The glibc chown(), fchown(), and lchown() wrapper functions transparently deal with the variations across kernel versions.
Before Linux 2.1.81 (except 2.1.46), chown() did not follow symbolic links. Since Linux 2.1.81, chown() does follow symbolic links, and there is a new system call lchown() that does not follow symbolic links. Since Linux 2.1.86, this new call (that has the same semantics as the old chown()) has got the same syscall number, and chown() got the newly introduced number.
EXAMPLES
The following program changes the ownership of the file named in its second command-line argument to the value specified in its first command-line argument. The new owner can be specified either as a numeric user ID, or as a username (which is converted to a user ID by using getpwnam(3) to perform a lookup in the system password file).
Program source
#include <pwd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
char *endptr;
uid_t uid;
struct passwd *pwd;
if (argc != 3 || argv[1][0] == ' ') {
fprintf(stderr, "%s <owner> <file>
“, argv[0]); exit(EXIT_FAILURE); } uid = strtol(argv[1], &endptr, 10); /* Allow a numeric string */ if (endptr != ‘οΏ½’) { / Was not pure numeric string / pwd = getpwnam(argv[1]); / Try getting UID for username */ if (pwd == NULL) { perror(“getpwnam”); exit(EXIT_FAILURE); } uid = pwd->pw_uid; } if (chown(argv[2], uid, -1) == -1) { perror(“chown”); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); }
SEE ALSO
chgrp(1), chown(1), chmod(2), flock(2), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
33 - Linux cli command close_range
NAME π₯οΈ close_range π₯οΈ
close all file descriptors in a given range
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <unistd.h>
#include <linux/close_range.h> /* Definition of CLOSE_RANGE_*
constants */
int close_range(unsigned int first, unsigned int last",int"flags);
DESCRIPTION
The close_range() system call closes all open file descriptors from first to last (included).
Errors closing a given file descriptor are currently ignored.
flags is a bit mask containing 0 or more of the following:
CLOSE_RANGE_CLOEXEC (since Linux 5.11)
Set the close-on-exec flag on the specified file descriptors, rather than immediately closing them.
CLOSE_RANGE_UNSHARE
Unshare the specified file descriptors from any other processes before closing them, avoiding races with other threads sharing the file descriptor table.
RETURN VALUE
On success, close_range() returns 0. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EINVAL
flags is not valid, or first is greater than last.
The following can occur with CLOSE_RANGE_UNSHARE (when constructing the new descriptor table):
EMFILE
The number of open file descriptors exceeds the limit specified in /proc/sys/fs/nr_open (see proc(5)). This error can occur in situations where that limit was lowered before a call to close_range() where the CLOSE_RANGE_UNSHARE flag is specified.
ENOMEM
Insufficient kernel memory was available.
STANDARDS
None.
HISTORY
FreeBSD. Linux 5.9, glibc 2.34.
NOTES
Closing all open file descriptors
To avoid blindly closing file descriptors in the range of possible file descriptors, this is sometimes implemented (on Linux) by listing open file descriptors in /proc/self/fd/ and calling close(2) on each one. close_range() can take care of this without requiring /proc and within a single system call, which provides significant performance benefits.
Closing file descriptors before exec
File descriptors can be closed safely using
/* we don't want anything past stderr here */
close_range(3, ~0U, CLOSE_RANGE_UNSHARE);
execve(....);
CLOSE_RANGE_UNSHARE is conceptually equivalent to
unshare(CLONE_FILES);
close_range(first, last, 0);
but can be more efficient: if the unshared range extends past the current maximum number of file descriptors allocated in the caller’s file descriptor table (the common case when last is ~0U), the kernel will unshare a new file descriptor table for the caller up to first, copying as few file descriptors as possible. This avoids subsequent close(2) calls entirely; the whole operation is complete once the table is unshared.
Closing files on exec
This is particularly useful in cases where multiple pre-exec setup steps risk conflicting with each other. For example, setting up a seccomp(2) profile can conflict with a close_range() call: if the file descriptors are closed before the seccomp(2) profile is set up, the profile setup can’t use them itself, or control their closure; if the file descriptors are closed afterwards, the seccomp profile can’t block the close_range() call or any fallbacks. Using CLOSE_RANGE_CLOEXEC avoids this: the descriptors can be marked before the seccomp(2) profile is set up, and the profile can control access to close_range() without affecting the calling process.
EXAMPLES
The program shown below opens the files named in its command-line arguments, displays the list of files that it has opened (by iterating through the entries in /proc/PID/fd), uses close_range() to close all file descriptors greater than or equal to 3, and then once more displays the process’s list of open files. The following example demonstrates the use of the program:
$ touch /tmp/a /tmp/b /tmp/c
$ ./a.out /tmp/a /tmp/b /tmp/c
/tmp/a opened as FD 3
/tmp/b opened as FD 4
/tmp/c opened as FD 5
/proc/self/fd/0 ==> /dev/pts/1
/proc/self/fd/1 ==> /dev/pts/1
/proc/self/fd/2 ==> /dev/pts/1
/proc/self/fd/3 ==> /tmp/a
/proc/self/fd/4 ==> /tmp/b
/proc/self/fd/5 ==> /tmp/b
/proc/self/fd/6 ==> /proc/9005/fd
========= About to call close_range() =======
/proc/self/fd/0 ==> /dev/pts/1
/proc/self/fd/1 ==> /dev/pts/1
/proc/self/fd/2 ==> /dev/pts/1
/proc/self/fd/3 ==> /proc/9005/fd
Note that the lines showing the pathname /proc/9005/fd result from the calls to opendir(3).
Program source
#define _GNU_SOURCE
#include <dirent.h>
#include <fcntl.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
/* Show the contents of the symbolic links in /proc/self/fd */
static void
show_fds(void)
{
DIR *dirp;
char path[PATH_MAX], target[PATH_MAX];
ssize_t len;
struct dirent *dp;
dirp = opendir("/proc/self/fd");
if (dirp == NULL) {
perror("opendir");
exit(EXIT_FAILURE);
}
for (;;) {
dp = readdir(dirp);
if (dp == NULL)
break;
if (dp->d_type == DT_LNK) {
snprintf(path, sizeof(path), "/proc/self/fd/%s",
dp->d_name);
len = readlink(path, target, sizeof(target));
printf("%s ==> %.*s
“, path, (int) len, target); } } closedir(dirp); } int main(int argc, char *argv[]) { int fd; for (size_t j = 1; j < argc; j++) { fd = open(argv[j], O_RDONLY); if (fd == -1) { perror(argv[j]); exit(EXIT_FAILURE); } printf("%s opened as FD %d “, argv[j], fd); } show_fds(); printf("========= About to call close_range() ======= “); if (close_range(3, ~0U, 0) == -1) { perror(“close_range”); exit(EXIT_FAILURE); } show_fds(); exit(EXIT_FAILURE); }
SEE ALSO
close(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
34 - Linux cli command sigreturn
NAME π₯οΈ sigreturn π₯οΈ
return from signal handler and cleanup stack frame
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
int sigreturn(...);
DESCRIPTION
If the Linux kernel determines that an unblocked signal is pending for a process, then, at the next transition back to user mode in that process (e.g., upon return from a system call or when the process is rescheduled onto the CPU), it creates a new frame on the user-space stack where it saves various pieces of process context (processor status word, registers, signal mask, and signal stack settings).
The kernel also arranges that, during the transition back to user mode, the signal handler is called, and that, upon return from the handler, control passes to a piece of user-space code commonly called the “signal trampoline”. The signal trampoline code in turn calls sigreturn().
This sigreturn() call undoes everything that was doneβchanging the process’s signal mask, switching signal stacks (see sigaltstack(2))βin order to invoke the signal handler. Using the information that was earlier saved on the user-space stack sigreturn() restores the process’s signal mask, switches stacks, and restores the process’s context (processor flags and registers, including the stack pointer and instruction pointer), so that the process resumes execution at the point where it was interrupted by the signal.
RETURN VALUE
sigreturn() never returns.
VERSIONS
Many UNIX-type systems have a sigreturn() system call or near equivalent. However, this call is not specified in POSIX, and details of its behavior vary across systems.
STANDARDS
None.
NOTES
sigreturn() exists only to allow the implementation of signal handlers. It should never be called directly. (Indeed, a simple sigreturn() wrapper in the GNU C library simply returns -1, with errno set to ENOSYS.) Details of the arguments (if any) passed to sigreturn() vary depending on the architecture. (On some architectures, such as x86-64, sigreturn() takes no arguments, since all of the information that it requires is available in the stack frame that was previously created by the kernel on the user-space stack.)
Once upon a time, UNIX systems placed the signal trampoline code onto the user stack. Nowadays, pages of the user stack are protected so as to disallow code execution. Thus, on contemporary Linux systems, depending on the architecture, the signal trampoline code lives either in the vdso(7) or in the C library. In the latter case, the C library’s sigaction(2) wrapper function informs the kernel of the location of the trampoline code by placing its address in the sa_restorer field of the sigaction structure, and sets the SA_RESTORER flag in the sa_flags field.
The saved process context information is placed in a ucontext_t structure (see <sys/ucontext.h>). That structure is visible within the signal handler as the third argument of a handler established via sigaction(2) with the SA_SIGINFO flag.
On some other UNIX systems, the operation of the signal trampoline differs a little. In particular, on some systems, upon transitioning back to user mode, the kernel passes control to the trampoline (rather than the signal handler), and the trampoline code calls the signal handler (and then calls sigreturn() once the handler returns).
C library/kernel differences
The original Linux system call was named sigreturn(). However, with the addition of real-time signals in Linux 2.2, a new system call, rt_sigreturn() was added to support an enlarged sigset_t type. The GNU C library hides these details from us, transparently employing rt_sigreturn() when the kernel provides it.
SEE ALSO
kill(2), restart_syscall(2), sigaltstack(2), signal(2), getcontext(3), signal(7), vdso(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
35 - Linux cli command epoll_wait
NAME π₯οΈ epoll_wait π₯οΈ
wait for an I/O event on an epoll file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/epoll.h>
int epoll_wait(int epfd, struct epoll_event *events,
int maxevents, int timeout);
int epoll_pwait(int epfd, struct epoll_event *events,
int maxevents, int timeout,
const sigset_t *_Nullable sigmask);
int epoll_pwait2(int epfd, struct epoll_event *events,
int maxevents, const struct timespec *_Nullable timeout,
const sigset_t *_Nullable sigmask);
DESCRIPTION
The epoll_wait() system call waits for events on the epoll(7) instance referred to by the file descriptor epfd. The buffer pointed to by events is used to return information from the ready list about file descriptors in the interest list that have some events available. Up to maxevents are returned by epoll_wait(). The maxevents argument must be greater than zero.
The timeout argument specifies the number of milliseconds that epoll_wait() will block. Time is measured against the CLOCK_MONOTONIC clock.
A call to epoll_wait() will block until either:
a file descriptor delivers an event;
the call is interrupted by a signal handler; or
the timeout expires.
Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount. Specifying a timeout of -1 causes epoll_wait() to block indefinitely, while specifying a timeout equal to zero causes epoll_wait() to return immediately, even if no events are available.
The struct epoll_event is described in epoll_event(3type).
The data field of each returned epoll_event structure contains the same data as was specified in the most recent call to epoll_ctl(2) (EPOLL_CTL_ADD, EPOLL_CTL_MOD) for the corresponding open file descriptor.
The events field is a bit mask that indicates the events that have occurred for the corresponding open file description. See epoll_ctl(2) for a list of the bits that may appear in this mask.
epoll_pwait()
The relationship between epoll_wait() and epoll_pwait() is analogous to the relationship between select(2) and pselect(2): like pselect(2), epoll_pwait() allows an application to safely wait until either a file descriptor becomes ready or until a signal is caught.
The following epoll_pwait() call:
ready = epoll_pwait(epfd, &events, maxevents, timeout, &sigmask);
is equivalent to atomically executing the following calls:
sigset_t origmask;
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ready = epoll_wait(epfd, &events, maxevents, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
The sigmask argument may be specified as NULL, in which case epoll_pwait() is equivalent to epoll_wait().
epoll_pwait2()
The epoll_pwait2() system call is equivalent to epoll_pwait() except for the timeout argument. It takes an argument of type timespec to be able to specify nanosecond resolution timeout. This argument functions the same as in pselect(2) and ppoll(2). If timeout is NULL, then epoll_pwait2() can block indefinitely.
RETURN VALUE
On success, epoll_wait() returns the number of file descriptors ready for the requested I/O operation, or zero if no file descriptor became ready during the requested timeout milliseconds. On failure, epoll_wait() returns -1 and errno is set to indicate the error.
ERRORS
EBADF
epfd is not a valid file descriptor.
EFAULT
The memory area pointed to by events is not accessible with write permissions.
EINTR
The call was interrupted by a signal handler before either (1) any of the requested events occurred or (2) the timeout expired; see signal(7).
EINVAL
epfd is not an epoll file descriptor, or maxevents is less than or equal to zero.
STANDARDS
Linux.
HISTORY
epoll_wait()
Linux 2.6, glibc 2.3.2.
epoll_pwait()
Linux 2.6.19, glibc 2.6.
epoll_pwait2()
Linux 5.11.
NOTES
While one thread is blocked in a call to epoll_wait(), it is possible for another thread to add a file descriptor to the waited-upon epoll instance. If the new file descriptor becomes ready, it will cause the epoll_wait() call to unblock.
If more than maxevents file descriptors are ready when epoll_wait() is called, then successive epoll_wait() calls will round robin through the set of ready file descriptors. This behavior helps avoid starvation scenarios, where a process fails to notice that additional file descriptors are ready because it focuses on a set of file descriptors that are already known to be ready.
Note that it is possible to call epoll_wait() on an epoll instance whose interest list is currently empty (or whose interest list becomes empty because file descriptors are closed or removed from the interest in another thread). The call will block until some file descriptor is later added to the interest list (in another thread) and that file descriptor becomes ready.
C library/kernel differences
The raw epoll_pwait() and epoll_pwait2() system calls have a sixth argument, size_t sigsetsize, which specifies the size in bytes of the sigmask argument. The glibc epoll_pwait() wrapper function specifies this argument as a fixed value (equal to sizeof(sigset_t)).
BUGS
Before Linux 2.6.37, a timeout value larger than approximately LONG_MAX / HZ milliseconds is treated as -1 (i.e., infinity). Thus, for example, on a system where sizeof(long) is 4 and the kernel HZ value is 1000, this means that timeouts greater than 35.79 minutes are treated as infinity.
SEE ALSO
epoll_create(2), epoll_ctl(2), epoll(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
36 - Linux cli command getpid
NAME π₯οΈ getpid π₯οΈ
get process identification
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
pid_t getpid(void);
pid_t getppid(void);
DESCRIPTION
getpid() returns the process ID (PID) of the calling process. (This is often used by routines that generate unique temporary filenames.)
getppid() returns the process ID of the parent of the calling process. This will be either the ID of the process that created this process using fork(), or, if that process has already terminated, the ID of the process to which this process has been reparented (either init(1) or a “subreaper” process defined via the prctl(2) PR_SET_CHILD_SUBREAPER operation).
ERRORS
These functions are always successful.
VERSIONS
On Alpha, instead of a pair of getpid() and getppid() system calls, a single getxpid() system call is provided, which returns a pair of PID and parent PID. The glibc getpid() and getppid() wrapper functions transparently deal with this. See syscall(2) for details regarding register mapping.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.3BSD, SVr4.
C library/kernel differences
From glibc 2.3.4 up to and including glibc 2.24, the glibc wrapper function for getpid() cached PIDs, with the goal of avoiding additional system calls when a process calls getpid() repeatedly. Normally this caching was invisible, but its correct operation relied on support in the wrapper functions for fork(2), vfork(2), and clone(2): if an application bypassed the glibc wrappers for these system calls by using syscall(2), then a call to getpid() in the child would return the wrong value (to be precise: it would return the PID of the parent process). In addition, there were cases where getpid() could return the wrong value even when invoking clone(2) via the glibc wrapper function. (For a discussion of one such case, see BUGS in clone(2).) Furthermore, the complexity of the caching code had been the source of a few bugs within glibc over the years.
Because of the aforementioned problems, since glibc 2.25, the PID cache is removed: calls to getpid() always invoke the actual system call, rather than returning a cached value.
NOTES
If the caller’s parent is in a different PID namespace (see pid_namespaces(7)), getppid() returns 0.
From a kernel perspective, the PID (which is shared by all of the threads in a multithreaded process) is sometimes also known as the thread group ID (TGID). This contrasts with the kernel thread ID (TID), which is unique for each thread. For further details, see gettid(2) and the discussion of the CLONE_THREAD flag in clone(2).
SEE ALSO
clone(2), fork(2), gettid(2), kill(2), exec(3), mkstemp(3), tempnam(3), tmpfile(3), tmpnam(3), credentials(7), pid_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
37 - Linux cli command finit_module
NAME π₯οΈ finit_module π₯οΈ
load a kernel module
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/module.h> /* Definition of MODULE_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_init_module, void module_image[.len], unsigned long len,
const char *param_values);
int syscall(SYS_finit_module, int fd,
const char *param_values, int flags);
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
init_module() loads an ELF image into kernel space, performs any necessary symbol relocations, initializes module parameters to values provided by the caller, and then runs the module’s init function. This system call requires privilege.
The module_image argument points to a buffer containing the binary image to be loaded; len specifies the size of that buffer. The module image should be a valid ELF image, built for the running kernel.
The param_values argument is a string containing space-delimited specifications of the values for module parameters (defined inside the module using module_param() and module_param_array()). The kernel parses this string and initializes the specified parameters. Each of the parameter specifications has the form:
name[ =value [, value…]]
The parameter name is one of those defined within the module using module_param() (see the Linux kernel source file include/linux/moduleparam.h). The parameter value is optional in the case of bool and invbool parameters. Values for array parameters are specified as a comma-separated list.
finit_module()
The finit_module() system call is like init_module(), but reads the module to be loaded from the file descriptor fd. It is useful when the authenticity of a kernel module can be determined from its location in the filesystem; in cases where that is possible, the overhead of using cryptographically signed modules to determine the authenticity of a module can be avoided. The param_values argument is as for init_module().
The flags argument modifies the operation of finit_module(). It is a bit mask value created by ORing together zero or more of the following flags:
MODULE_INIT_IGNORE_MODVERSIONS
Ignore symbol version hashes.
MODULE_INIT_IGNORE_VERMAGIC
Ignore kernel version magic.
MODULE_INIT_COMPRESSED_FILE (since Linux 5.17)
Use in-kernel module decompression.
There are some safety checks built into a module to ensure that it matches the kernel against which it is loaded. These checks are recorded when the module is built and verified when the module is loaded. First, the module records a “vermagic” string containing the kernel version number and prominent features (such as the CPU type). Second, if the module was built with the CONFIG_MODVERSIONS configuration option enabled, a version hash is recorded for each symbol the module uses. This hash is based on the types of the arguments and return value for the function named by the symbol. In this case, the kernel version number within the “vermagic” string is ignored, as the symbol version hashes are assumed to be sufficiently reliable.
Using the MODULE_INIT_IGNORE_VERMAGIC flag indicates that the “vermagic” string is to be ignored, and the MODULE_INIT_IGNORE_MODVERSIONS flag indicates that the symbol version hashes are to be ignored. If the kernel is built to permit forced loading (i.e., configured with CONFIG_MODULE_FORCE_LOAD), then loading continues, otherwise it fails with the error ENOEXEC as expected for malformed modules.
If the kernel was build with CONFIG_MODULE_DECOMPRESS, the in-kernel decompression feature can be used. User-space code can check if the kernel supports decompression by reading the /sys/module/compression attribute. If the kernel supports decompression, the compressed file can directly be passed to finit_module() using the MODULE_INIT_COMPRESSED_FILE flag. The in-kernel module decompressor supports the following compression algorithms:
gzip (since Linux 5.17)
xz (since Linux 5.17)
zstd (since Linux 6.2)
The kernel only implements a single decompression method. This is selected during module generation accordingly to the compression method chosen in the kernel configuration.
RETURN VALUE
On success, these system calls return 0. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EBADMSG (since Linux 3.7)
Module signature is misformatted.
EBUSY
Timeout while trying to resolve a symbol reference by this module.
EFAULT
An address argument referred to a location that is outside the process’s accessible address space.
ENOKEY (since Linux 3.7)
Module signature is invalid or the kernel does not have a key for this module. This error is returned only if the kernel was configured with CONFIG_MODULE_SIG_FORCE; if the kernel was not configured with this option, then an invalid or unsigned module simply taints the kernel.
ENOMEM
Out of memory.
EPERM
The caller was not privileged (did not have the CAP_SYS_MODULE capability), or module loading is disabled (see /proc/sys/kernel/modules_disabled in proc(5)).
The following errors may additionally occur for init_module():
EEXIST
A module with this name is already loaded.
EINVAL
param_values is invalid, or some part of the ELF image in module_image contains inconsistencies.
ENOEXEC
The binary image supplied in module_image is not an ELF image, or is an ELF image that is invalid or for a different architecture.
The following errors may additionally occur for finit_module():
EBADF
The file referred to by fd is not opened for reading.
EFBIG
The file referred to by fd is too large.
EINVAL
flags is invalid.
EINVAL
The decompressor sanity checks failed, while loading a compressed module with flag MODULE_INIT_COMPRESSED_FILE set.
ENOEXEC
fd does not refer to an open file.
EOPNOTSUPP (since Linux 5.17)
The flag MODULE_INIT_COMPRESSED_FILE is set to load a compressed module, and the kernel was built without CONFIG_MODULE_DECOMPRESS.
ETXTBSY (since Linux 4.7)
The file referred to by fd is opened for read-write.
In addition to the above errors, if the module’s init function is executed and returns an error, then init_module() or finit_module() fails and errno is set to the value returned by the init function.
STANDARDS
Linux.
HISTORY
finit_module()
Linux 3.8.
The init_module() system call is not supported by glibc. No declaration is provided in glibc headers, but, through a quirk of history, glibc versions before glibc 2.23 did export an ABI for this system call. Therefore, in order to employ this system call, it is (before glibc 2.23) sufficient to manually declare the interface in your code; alternatively, you can invoke the system call using syscall(2).
Linux 2.4 and earlier
In Linux 2.4 and earlier, the init_module() system call was rather different:
#include <linux/module.h>
** int init_module(const char *name, struct module *image);**
(User-space applications can detect which version of init_module() is available by calling query_module(); the latter call fails with the error ENOSYS on Linux 2.6 and later.)
The older version of the system call loads the relocated module image pointed to by image into kernel space and runs the module’s init function. The caller is responsible for providing the relocated image (since Linux 2.6, the init_module() system call does the relocation).
The module image begins with a module structure and is followed by code and data as appropriate. Since Linux 2.2, the module structure is defined as follows:
struct module {
unsigned long size_of_struct;
struct module *next;
const char *name;
unsigned long size;
long usecount;
unsigned long flags;
unsigned int nsyms;
unsigned int ndeps;
struct module_symbol *syms;
struct module_ref *deps;
struct module_ref *refs;
int (*init)(void);
void (*cleanup)(void);
const struct exception_table_entry *ex_table_start;
const struct exception_table_entry *ex_table_end;
#ifdef __alpha__
unsigned long gp;
#endif
};
All of the pointer fields, with the exception of next and refs, are expected to point within the module body and be initialized as appropriate for kernel space, that is, relocated with the rest of the module.
NOTES
Information about currently loaded modules can be found in /proc/modules and in the file trees under the per-module subdirectories under /sys/module.
See the Linux kernel source file include/linux/module.h for some useful background information.
SEE ALSO
create_module(2), delete_module(2), query_module(2), lsmod(8), modprobe(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
38 - Linux cli command getrandom
NAME π₯οΈ getrandom π₯οΈ
obtain a series of random bytes
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/random.h>
ssize_t getrandom(void buf[.buflen], size_t buflen, unsigned int flags);
DESCRIPTION
The getrandom() system call fills the buffer pointed to by buf with up to buflen random bytes. These bytes can be used to seed user-space random number generators or for cryptographic purposes.
By default, getrandom() draws entropy from the urandom source (i.e., the same source as the /dev/urandom device). This behavior can be changed via the flags argument.
If the urandom source has been initialized, reads of up to 256 bytes will always return as many bytes as requested and will not be interrupted by signals. No such guarantees apply for larger buffer sizes. For example, if the call is interrupted by a signal handler, it may return a partially filled buffer, or fail with the error EINTR.
If the urandom source has not yet been initialized, then getrandom() will block, unless GRND_NONBLOCK is specified in flags.
The flags argument is a bit mask that can contain zero or more of the following values ORed together:
GRND_RANDOM
If this bit is set, then random bytes are drawn from the random source (i.e., the same source as the /dev/random device) instead of the urandom source. The random source is limited based on the entropy that can be obtained from environmental noise. If the number of available bytes in the random source is less than requested in buflen, the call returns just the available random bytes. If no random bytes are available, the behavior depends on the presence of GRND_NONBLOCK in the flags argument.
GRND_NONBLOCK
By default, when reading from the random source, getrandom() blocks if no random bytes are available, and when reading from the urandom source, it blocks if the entropy pool has not yet been initialized. If the GRND_NONBLOCK flag is set, then getrandom() does not block in these cases, but instead immediately returns -1 with errno set to EAGAIN.
RETURN VALUE
On success, getrandom() returns the number of bytes that were copied to the buffer buf. This may be less than the number of bytes requested via buflen if either GRND_RANDOM was specified in flags and insufficient entropy was present in the random source or the system call was interrupted by a signal.
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EAGAIN
The requested entropy was not available, and getrandom() would have blocked if the GRND_NONBLOCK flag was not set.
EFAULT
The address referred to by buf is outside the accessible address space.
EINTR
The call was interrupted by a signal handler; see the description of how interrupted read(2) calls on “slow” devices are handled with and without the SA_RESTART flag in the signal(7) man page.
EINVAL
An invalid flag was specified in flags.
ENOSYS
The glibc wrapper function for getrandom() determined that the underlying kernel does not implement this system call.
STANDARDS
Linux.
HISTORY
Linux 3.17, glibc 2.25.
NOTES
For an overview and comparison of the various interfaces that can be used to obtain randomness, see random(7).
Unlike /dev/random and /dev/urandom, getrandom() does not involve the use of pathnames or file descriptors. Thus, getrandom() can be useful in cases where chroot(2) makes /dev pathnames invisible, and where an application (e.g., a daemon during start-up) closes a file descriptor for one of these files that was opened by a library.
Maximum number of bytes returned
As of Linux 3.19 the following limits apply:
When reading from the urandom source, a maximum of 32Mi-1 bytes is returned by a single call to getrandom() on systems where int has a size of 32 bits.
When reading from the random source, a maximum of 512 bytes is returned.
Interruption by a signal handler
When reading from the urandom source (GRND_RANDOM is not set), getrandom() will block until the entropy pool has been initialized (unless the GRND_NONBLOCK flag was specified). If a request is made to read a large number of bytes (more than 256), getrandom() will block until those bytes have been generated and transferred from kernel memory to buf. When reading from the random source (GRND_RANDOM is set), getrandom() will block until some random bytes become available (unless the GRND_NONBLOCK flag was specified).
The behavior when a call to getrandom() that is blocked while reading from the urandom source is interrupted by a signal handler depends on the initialization state of the entropy buffer and on the request size, buflen. If the entropy is not yet initialized, then the call fails with the EINTR error. If the entropy pool has been initialized and the request size is large (buflen > 256), the call either succeeds, returning a partially filled buffer, or fails with the error EINTR. If the entropy pool has been initialized and the request size is small (buflen <= 256), then getrandom() will not fail with EINTR. Instead, it will return all of the bytes that have been requested.
When reading from the random source, blocking requests of any size can be interrupted by a signal handler (the call fails with the error EINTR).
Using getrandom() to read small buffers (<= 256 bytes) from the urandom source is the preferred mode of usage.
The special treatment of small values of buflen was designed for compatibility with OpenBSD’s getentropy(3), which is nowadays supported by glibc.
The user of getrandom() must always check the return value, to determine whether either an error occurred or fewer bytes than requested were returned. In the case where GRND_RANDOM is not specified and buflen is less than or equal to 256, a return of fewer bytes than requested should never happen, but the careful programmer will check for this anyway!
BUGS
As of Linux 3.19, the following bug exists:
- Depending on CPU load, getrandom() does not react to interrupts before reading all bytes requested.
SEE ALSO
getentropy(3), random(4), urandom(4), random(7), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
39 - Linux cli command creat
NAME π₯οΈ creat π₯οΈ
open and possibly create a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <fcntl.h>
int open(const char *pathname, int flags, ...
/* mode_t mode */ );
int creat(const char *pathname, mode_t mode);
int openat(int dirfd, const char *pathname, int flags, ...
/* mode_t mode */ );
/* Documented separately, in
openat2(2):
*/
int openat2(int dirfd, const char *pathname,
const struct open_how *how, size_t size);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
openat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
The open() system call opens the file specified by pathname. If the specified file does not exist, it may optionally (if O_CREAT is specified in flags) be created by open().
The return value of open() is a file descriptor, a small, nonnegative integer that is an index to an entry in the process’s table of open file descriptors. The file descriptor is used in subsequent system calls ( read(2), write(2), lseek(2), fcntl(2), etc.) to refer to the open file. The file descriptor returned by a successful call will be the lowest-numbered file descriptor not currently open for the process.
By default, the new file descriptor is set to remain open across an execve(2) (i.e., the FD_CLOEXEC file descriptor flag described in fcntl(2) is initially disabled); the O_CLOEXEC flag, described below, can be used to change this default. The file offset is set to the beginning of the file (see lseek(2)).
A call to open() creates a new open file description, an entry in the system-wide table of open files. The open file description records the file offset and the file status flags (see below). A file descriptor is a reference to an open file description; this reference is unaffected if pathname is subsequently removed or modified to refer to a different file. For further details on open file descriptions, see NOTES.
The argument flags must include one of the following access modes: O_RDONLY, O_WRONLY, or O_RDWR. These request opening the file read-only, write-only, or read/write, respectively.
In addition, zero or more file creation flags and file status flags can be bitwise ORed in flags. The file creation flags are O_CLOEXEC, O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TMPFILE, and O_TRUNC. The file status flags are all of the remaining flags listed below. The distinction between these two groups of flags is that the file creation flags affect the semantics of the open operation itself, while the file status flags affect the semantics of subsequent I/O operations. The file status flags can be retrieved and (in some cases) modified; see fcntl(2) for details.
The full list of file creation flags and file status flags is as follows:
O_APPEND
The file is opened in append mode. Before each write(2), the file offset is positioned at the end of the file, as if with lseek(2). The modification of the file offset and the write operation are performed as a single atomic step.
O_APPEND may lead to corrupted files on NFS filesystems if more than one process appends data to a file at once. This is because NFS does not support appending to a file, so the client kernel has to simulate it, which can’t be done without a race condition.
O_ASYNC
Enable signal-driven I/O: generate a signal (SIGIO by default, but this can be changed via fcntl(2)) when input or output becomes possible on this file descriptor. This feature is available only for terminals, pseudoterminals, sockets, and (since Linux 2.6) pipes and FIFOs. See fcntl(2) for further details. See also BUGS, below.
O_CLOEXEC (since Linux 2.6.23)
Enable the close-on-exec flag for the new file descriptor. Specifying this flag permits a program to avoid additional fcntl(2) F_SETFD operations to set the FD_CLOEXEC flag.
Note that the use of this flag is essential in some multithreaded programs, because using a separate fcntl(2) F_SETFD operation to set the FD_CLOEXEC flag does not suffice to avoid race conditions where one thread opens a file descriptor and attempts to set its close-on-exec flag using fcntl(2) at the same time as another thread does a fork(2) plus execve(2). Depending on the order of execution, the race may lead to the file descriptor returned by open() being unintentionally leaked to the program executed by the child process created by fork(2). (This kind of race is in principle possible for any system call that creates a file descriptor whose close-on-exec flag should be set, and various other Linux system calls provide an equivalent of the O_CLOEXEC flag to deal with this problem.)
O_CREAT
If pathname does not exist, create it as a regular file.
The owner (user ID) of the new file is set to the effective user ID of the process.
The group ownership (group ID) of the new file is set either to the effective group ID of the process (System V semantics) or to the group ID of the parent directory (BSD semantics). On Linux, the behavior depends on whether the set-group-ID mode bit is set on the parent directory: if that bit is set, then BSD semantics apply; otherwise, System V semantics apply. For some filesystems, the behavior also depends on the bsdgroups and sysvgroups mount options described in mount(8).
The mode argument specifies the file mode bits to be applied when a new file is created. If neither O_CREAT nor O_TMPFILE is specified in flags, then mode is ignored (and can thus be specified as 0, or simply omitted). The mode argument must be supplied if O_CREAT or O_TMPFILE is specified in flags; if it is not supplied, some arbitrary bytes from the stack will be applied as the file mode.
The effective mode is modified by the process’s umask in the usual way: in the absence of a default ACL, the mode of the created file is (mode & ~umask).
Note that mode applies only to future accesses of the newly created file; the open() call that creates a read-only file may well return a read/write file descriptor.
The following symbolic constants are provided for mode:
S_IRWXU
00700 user (file owner) has read, write, and execute permission
S_IRUSR
00400 user has read permission
S_IWUSR
00200 user has write permission
S_IXUSR
00100 user has execute permission
S_IRWXG
00070 group has read, write, and execute permission
S_IRGRP
00040 group has read permission
S_IWGRP
00020 group has write permission
S_IXGRP
00010 group has execute permission
S_IRWXO
00007 others have read, write, and execute permission
S_IROTH
00004 others have read permission
S_IWOTH
00002 others have write permission
S_IXOTH
00001 others have execute permission
According to POSIX, the effect when other bits are set in mode is unspecified. On Linux, the following bits are also honored in mode:
S_ISUID
0004000 set-user-ID bit
S_ISGID
0002000 set-group-ID bit (see inode(7)).
S_ISVTX
0001000 sticky bit (see inode(7)).
O_DIRECT (since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is described in raw(8).
O_DIRECTORY
If pathname is not a directory, cause the open to fail. This flag was added in Linux 2.1.126, to avoid denial-of-service problems if opendir(3) is called on a FIFO or tape device.
O_DSYNC
Write operations on the file will complete according to the requirements of synchronized I/O data integrity completion.
By the time write(2) (and similar) return, the output data has been transferred to the underlying hardware, along with any file metadata that would be required to retrieve that data (i.e., as though each write(2) was followed by a call to fdatasync(2)). See NOTES below.
O_EXCL
Ensure that this call creates the file: if this flag is specified in conjunction with O_CREAT, and pathname already exists, then open() fails with the error EEXIST.
When these two flags are specified, symbolic links are not followed: if pathname is a symbolic link, then open() fails regardless of where the symbolic link points.
In general, the behavior of O_EXCL is undefined if it is used without O_CREAT. There is one exception: on Linux 2.6 and later, O_EXCL can be used without O_CREAT if pathname refers to a block device. If the block device is in use by the system (e.g., mounted), open() fails with the error EBUSY.
On NFS, O_EXCL is supported only when using NFSv3 or later on kernel 2.6 or later. In NFS environments where O_EXCL support is not provided, programs that rely on it for performing locking tasks will contain a race condition. Portable programs that want to perform atomic file locking using a lockfile, and need to avoid reliance on NFS support for O_EXCL, can create a unique file on the same filesystem (e.g., incorporating hostname and PID), and use link(2) to make a link to the lockfile. If link(2) returns 0, the lock is successful. Otherwise, use stat(2) on the unique file to check if its link count has increased to 2, in which case the lock is also successful.
O_LARGEFILE
(LFS) Allow files whose sizes cannot be represented in an off_t (but can be represented in an off64_t) to be opened. The _LARGEFILE64_SOURCE macro must be defined (before including any header files) in order to obtain this definition. Setting the _FILE_OFFSET_BITS feature test macro to 64 (rather than using O_LARGEFILE) is the preferred method of accessing large files on 32-bit systems (see feature_test_macros(7)).
O_NOATIME (since Linux 2.6.8)
Do not update the file last access time (st_atime in the inode) when the file is read(2).
This flag can be employed only if one of the following conditions is true:
The effective UID of the process matches the owner UID of the file.
The calling process has the CAP_FOWNER capability in its user namespace and the owner UID of the file has a mapping in the namespace.
This flag is intended for use by indexing or backup programs, where its use can significantly reduce the amount of disk activity. This flag may not be effective on all filesystems. One example is NFS, where the server maintains the access time.
O_NOCTTY
If pathname refers to a terminal deviceβsee tty(4)βit will not become the process’s controlling terminal even if the process does not have one.
O_NOFOLLOW
If the trailing component (i.e., basename) of pathname is a symbolic link, then the open fails, with the error ELOOP. Symbolic links in earlier components of the pathname will still be followed. (Note that the ELOOP error that can occur in this case is indistinguishable from the case where an open fails because there are too many symbolic links found while resolving components in the prefix part of the pathname.)
This flag is a FreeBSD extension, which was added in Linux 2.1.126, and has subsequently been standardized in POSIX.1-2008.
See also O_PATH below.
O_NONBLOCK or O_NDELAY
When possible, the file is opened in nonblocking mode. Neither the open() nor any subsequent I/O operations on the file descriptor which is returned will cause the calling process to wait.
Note that the setting of this flag has no effect on the operation of poll(2), select(2), epoll(7), and similar, since those interfaces merely inform the caller about whether a file descriptor is “ready”, meaning that an I/O operation performed on the file descriptor with the O_NONBLOCK flag clear would not block.
Note that this flag has no effect for regular files and block devices; that is, I/O operations will (briefly) block when device activity is required, regardless of whether O_NONBLOCK is set. Since O_NONBLOCK semantics might eventually be implemented, applications should not depend upon blocking behavior when specifying this flag for regular files and block devices.
For the handling of FIFOs (named pipes), see also fifo(7). For a discussion of the effect of O_NONBLOCK in conjunction with mandatory file locks and with file leases, see fcntl(2).
O_PATH (since Linux 2.6.39)
Obtain a file descriptor that can be used for two purposes: to indicate a location in the filesystem tree and to perform operations that act purely at the file descriptor level. The file itself is not opened, and other file operations (e.g., read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), ioctl(2), mmap(2)) fail with the error EBADF.
The following operations can be performed on the resulting file descriptor:
close(2).
fchdir(2), if the file descriptor refers to a directory (since Linux 3.5).
fstat(2) (since Linux 3.6).
fstatfs(2) (since Linux 3.12).
Duplicating the file descriptor (dup(2), fcntl(2) F_DUPFD, etc.).
Getting and setting file descriptor flags (fcntl(2) F_GETFD and F_SETFD).
Retrieving open file status flags using the fcntl(2) F_GETFL operation: the returned flags will include the bit O_PATH.
Passing the file descriptor as the dirfd argument of openat() and the other “*at()” system calls. This includes linkat(2) with AT_EMPTY_PATH (or via procfs using AT_SYMLINK_FOLLOW) even if the file is not a directory.
Passing the file descriptor to another process via a UNIX domain socket (see SCM_RIGHTS in unix(7)).
When O_PATH is specified in flags, flag bits other than O_CLOEXEC, O_DIRECTORY, and O_NOFOLLOW are ignored.
Opening a file or directory with the O_PATH flag requires no permissions on the object itself (but does require execute permission on the directories in the path prefix). Depending on the subsequent operation, a check for suitable file permissions may be performed (e.g., fchdir(2) requires execute permission on the directory referred to by its file descriptor argument). By contrast, obtaining a reference to a filesystem object by opening it with the O_RDONLY flag requires that the caller have read permission on the object, even when the subsequent operation (e.g., fchdir(2), fstat(2)) does not require read permission on the object.
If pathname is a symbolic link and the O_NOFOLLOW flag is also specified, then the call returns a file descriptor referring to the symbolic link. This file descriptor can be used as the dirfd argument in calls to fchownat(2), fstatat(2), linkat(2), and readlinkat(2) with an empty pathname to have the calls operate on the symbolic link.
If pathname refers to an automount point that has not yet been triggered, so no other filesystem is mounted on it, then the call returns a file descriptor referring to the automount directory without triggering a mount. fstatfs(2) can then be used to determine if it is, in fact, an untriggered automount point (.f_type == AUTOFS_SUPER_MAGIC).
One use of O_PATH for regular files is to provide the equivalent of POSIX.1’s O_EXEC functionality. This permits us to open a file for which we have execute permission but not read permission, and then execute that file, with steps something like the following:
char buf[PATH_MAX];
fd = open("some_prog", O_PATH);
snprintf(buf, PATH_MAX, "/proc/self/fd/%d", fd);
execl(buf, "some_prog", (char *) NULL);
An O_PATH file descriptor can also be passed as the argument of fexecve(3).
O_SYNC
Write operations on the file will complete according to the requirements of synchronized I/O file integrity completion (by contrast with the synchronized I/O data integrity completion provided by O_DSYNC.)
By the time write(2) (or similar) returns, the output data and associated file metadata have been transferred to the underlying hardware (i.e., as though each write(2) was followed by a call to fsync(2)). See NOTES below.
O_TMPFILE (since Linux 3.11)
Create an unnamed temporary regular file. The pathname argument specifies a directory; an unnamed inode will be created in that directory’s filesystem. Anything written to the resulting file will be lost when the last file descriptor is closed, unless the file is given a name.
O_TMPFILE must be specified with one of O_RDWR or O_WRONLY and, optionally, O_EXCL. If O_EXCL is not specified, then linkat(2) can be used to link the temporary file into the filesystem, making it permanent, using code like the following:
char path[PATH_MAX];
fd = open("/path/to/dir", O_TMPFILE | O_RDWR,
S_IRUSR | S_IWUSR);
/* File I/O on 'fd'... */
linkat(fd, "", AT_FDCWD, "/path/for/file", AT_EMPTY_PATH);
/* If the caller doesn't have the CAP_DAC_READ_SEARCH
capability (needed to use AT_EMPTY_PATH with linkat(2)),
and there is a proc(5) filesystem mounted, then the
linkat(2) call above can be replaced with:
snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd);
linkat(AT_FDCWD, path, AT_FDCWD, "/path/for/file",
AT_SYMLINK_FOLLOW);
*/
In this case, the open() mode argument determines the file permission mode, as with O_CREAT.
Specifying O_EXCL in conjunction with O_TMPFILE prevents a temporary file from being linked into the filesystem in the above manner. (Note that the meaning of O_EXCL in this case is different from the meaning of O_EXCL otherwise.)
There are two main use cases for O_TMPFILE:
Improved tmpfile(3) functionality: race-free creation of temporary files that (1) are automatically deleted when closed; (2) can never be reached via any pathname; (3) are not subject to symlink attacks; and (4) do not require the caller to devise unique names.
Creating a file that is initially invisible, which is then populated with data and adjusted to have appropriate filesystem attributes (fchown(2), fchmod(2), fsetxattr(2), etc.) before being atomically linked into the filesystem in a fully formed state (using linkat(2) as described above).
O_TMPFILE requires support by the underlying filesystem; only a subset of Linux filesystems provide that support. In the initial implementation, support was provided in the ext2, ext3, ext4, UDF, Minix, and tmpfs filesystems. Support for other filesystems has subsequently been added as follows: XFS (Linux 3.15); Btrfs (Linux 3.16); F2FS (Linux 3.16); and ubifs (Linux 4.9)
O_TRUNC
If the file already exists and is a regular file and the access mode allows writing (i.e., is O_RDWR or O_WRONLY) it will be truncated to length 0. If the file is a FIFO or terminal device file, the O_TRUNC flag is ignored. Otherwise, the effect of O_TRUNC is unspecified.
creat()
A call to creat() is equivalent to calling open() with flags equal to O_CREAT|O_WRONLY|O_TRUNC.
openat()
The openat() system call operates in exactly the same way as open(), except for the differences described here.
The dirfd argument is used in conjunction with the pathname argument as follows:
If the pathname given in pathname is absolute, then dirfd is ignored.
If the pathname given in pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like open()).
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by open() for a relative pathname). In this case, dirfd must be a directory that was opened for reading (O_RDONLY) or using the O_PATH flag.
If the pathname given in pathname is relative, and dirfd is not a valid file descriptor, an error (EBADF) results. (Specifying an invalid file descriptor number in dirfd can be used as a means to ensure that pathname is absolute.)
openat2(2)
The openat2(2) system call is an extension of openat(), and provides a superset of the features of openat(). It is documented separately, in openat2(2).
RETURN VALUE
On success, open(), openat(), and creat() return the new file descriptor (a nonnegative integer). On error, -1 is returned and errno is set to indicate the error.
ERRORS
open(), openat(), and creat() can fail with the following errors:
EACCES
The requested access to the file is not allowed, or search permission is denied for one of the directories in the path prefix of pathname, or the file did not exist yet and write access to the parent directory is not allowed. (See also path_resolution(7).)
EACCES
Where O_CREAT is specified, the protected_fifos or protected_regular sysctl is enabled, the file already exists and is a FIFO or regular file, the owner of the file is neither the current user nor the owner of the containing directory, and the containing directory is both world- or group-writable and sticky. For details, see the descriptions of /proc/sys/fs/protected_fifos and /proc/sys/fs/protected_regular in proc_sys_fs(5).
EBADF
(openat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EBUSY
O_EXCL was specified in flags and pathname refers to a block device that is in use by the system (e.g., it is mounted).
EDQUOT
Where O_CREAT is specified, the file does not exist, and the user’s quota of disk blocks or inodes on the filesystem has been exhausted.
EEXIST
pathname already exists and O_CREAT and O_EXCL were used.
EFAULT
pathname points outside your accessible address space.
EFBIG
See EOVERFLOW.
EINTR
While blocked waiting to complete an open of a slow device (e.g., a FIFO; see fifo(7)), the call was interrupted by a signal handler; see signal(7).
EINVAL
The filesystem does not support the O_DIRECT flag. See NOTES for more information.
EINVAL
Invalid value in flags.
EINVAL
O_TMPFILE was specified in flags, but neither O_WRONLY nor O_RDWR was specified.
EINVAL
O_CREAT was specified in flags and the final component (“basename”) of the new file’s pathname is invalid (e.g., it contains characters not permitted by the underlying filesystem).
EINVAL
The final component (“basename”) of pathname is invalid (e.g., it contains characters not permitted by the underlying filesystem).
EISDIR
pathname refers to a directory and the access requested involved writing (that is, O_WRONLY or O_RDWR is set).
EISDIR
pathname refers to an existing directory, O_TMPFILE and one of O_WRONLY or O_RDWR were specified in flags, but this kernel version does not provide the O_TMPFILE functionality.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ELOOP
pathname was a symbolic link, and flags specified O_NOFOLLOW but not O_PATH.
EMFILE
The per-process limit on the number of open file descriptors has been reached (see the description of RLIMIT_NOFILE in getrlimit(2)).
ENAMETOOLONG
pathname was too long.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENODEV
pathname refers to a device special file and no corresponding device exists. (This is a Linux kernel bug; in this situation ENXIO must be returned.)
ENOENT
O_CREAT is not set and the named file does not exist.
ENOENT
A directory component in pathname does not exist or is a dangling symbolic link.
ENOENT
pathname refers to a nonexistent directory, O_TMPFILE and one of O_WRONLY or O_RDWR were specified in flags, but this kernel version does not provide the O_TMPFILE functionality.
ENOMEM
The named file is a FIFO, but memory for the FIFO buffer can’t be allocated because the per-user hard limit on memory allocation for pipes has been reached and the caller is not privileged; see pipe(7).
ENOMEM
Insufficient kernel memory was available.
ENOSPC
pathname was to be created but the device containing pathname has no room for the new file.
ENOTDIR
A component used as a directory in pathname is not, in fact, a directory, or O_DIRECTORY was specified and pathname was not a directory.
ENOTDIR
(openat()) pathname is a relative pathname and dirfd is a file descriptor referring to a file other than a directory.
ENXIO
O_NONBLOCK | O_WRONLY is set, the named file is a FIFO, and no process has the FIFO open for reading.
ENXIO
The file is a device special file and no corresponding device exists.
ENXIO
The file is a UNIX domain socket.
EOPNOTSUPP
The filesystem containing pathname does not support O_TMPFILE.
EOVERFLOW
pathname refers to a regular file that is too large to be opened. The usual scenario here is that an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 tried to open a file whose size exceeds (1<<31)-1 bytes; see also O_LARGEFILE above. This is the error specified by POSIX.1; before Linux 2.6.24, Linux gave the error EFBIG for this case.
EPERM
The O_NOATIME flag was specified, but the effective user ID of the caller did not match the owner of the file and the caller was not privileged.
EPERM
The operation was prevented by a file seal; see fcntl(2).
EROFS
pathname refers to a file on a read-only filesystem and write access was requested.
ETXTBSY
pathname refers to an executable image which is currently being executed and write access was requested.
ETXTBSY
pathname refers to a file that is currently in use as a swap file, and the O_TRUNC flag was specified.
ETXTBSY
pathname refers to a file that is currently being read by the kernel (e.g., for module/firmware loading), and write access was requested.
EWOULDBLOCK
The O_NONBLOCK flag was specified, and an incompatible lease was held on the file (see fcntl(2)).
VERSIONS
The (undefined) effect of O_RDONLY | O_TRUNC varies among implementations. On many systems the file is actually truncated.
Synchronized I/O
The POSIX.1-2008 “synchronized I/O” option specifies different variants of synchronized I/O, and specifies the open() flags O_SYNC, O_DSYNC, and O_RSYNC for controlling the behavior. Regardless of whether an implementation supports this option, it must at least support the use of O_SYNC for regular files.
Linux implements O_SYNC and O_DSYNC, but not O_RSYNC. Somewhat incorrectly, glibc defines O_RSYNC to have the same value as O_SYNC. (O_RSYNC is defined in the Linux header file <asm/fcntl.h> on HP PA-RISC, but it is not used.)
O_SYNC provides synchronized I/O file integrity completion, meaning write operations will flush data and all associated metadata to the underlying hardware. O_DSYNC provides synchronized I/O data integrity completion, meaning write operations will flush data to the underlying hardware, but will only flush metadata updates that are required to allow a subsequent read operation to complete successfully. Data integrity completion can reduce the number of disk operations that are required for applications that don’t need the guarantees of file integrity completion.
To understand the difference between the two types of completion, consider two pieces of file metadata: the file last modification timestamp (st_mtime) and the file length. All write operations will update the last file modification timestamp, but only writes that add data to the end of the file will change the file length. The last modification timestamp is not needed to ensure that a read completes successfully, but the file length is. Thus, O_DSYNC would only guarantee to flush updates to the file length metadata (whereas O_SYNC would also always flush the last modification timestamp metadata).
Before Linux 2.6.33, Linux implemented only the O_SYNC flag for open(). However, when that flag was specified, most filesystems actually provided the equivalent of synchronized I/O data integrity completion (i.e., O_SYNC was actually implemented as the equivalent of O_DSYNC).
Since Linux 2.6.33, proper O_SYNC support is provided. However, to ensure backward binary compatibility, O_DSYNC was defined with the same value as the historical O_SYNC, and O_SYNC was defined as a new (two-bit) flag value that includes the O_DSYNC flag value. This ensures that applications compiled against new headers get at least O_DSYNC semantics before Linux 2.6.33.
C library/kernel differences
Since glibc 2.26, the glibc wrapper function for open() employs the openat() system call, rather than the kernel’s open() system call. For certain architectures, this is also true before glibc 2.26.
STANDARDS
open()
creat()
openat()
POSIX.1-2008.
openat2(2) Linux.
The O_DIRECT, O_NOATIME, O_PATH, and O_TMPFILE flags are Linux-specific. One must define _GNU_SOURCE to obtain their definitions.
The O_CLOEXEC, O_DIRECTORY, and O_NOFOLLOW flags are not specified in POSIX.1-2001, but are specified in POSIX.1-2008. Since glibc 2.12, one can obtain their definitions by defining either _POSIX_C_SOURCE with a value greater than or equal to 200809L or _XOPEN_SOURCE with a value greater than or equal to 700. In glibc 2.11 and earlier, one obtains the definitions by defining _GNU_SOURCE.
HISTORY
open()
creat()
SVr4, 4.3BSD, POSIX.1-2001.
openat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
NOTES
Under Linux, the O_NONBLOCK flag is sometimes used in cases where one wants to open but does not necessarily have the intention to read or write. For example, this may be used to open a device in order to get a file descriptor for use with ioctl(2).
Note that open() can open device special files, but creat() cannot create them; use mknod(2) instead.
If the file is newly created, its st_atime, st_ctime, st_mtime fields (respectively, time of last access, time of last status change, and time of last modification; see stat(2)) are set to the current time, and so are the st_ctime and st_mtime fields of the parent directory. Otherwise, if the file is modified because of the O_TRUNC flag, its st_ctime and st_mtime fields are set to the current time.
The files in the /proc/pid/fd directory show the open file descriptors of the process with the PID pid. The files in the /proc/pid/fdinfo directory show even more information about these file descriptors. See proc(5) for further details of both of these directories.
The Linux header file <asm/fcntl.h> doesn’t define O_ASYNC; the (BSD-derived) FASYNC synonym is defined instead.
Open file descriptions
The term open file description is the one used by POSIX to refer to the entries in the system-wide table of open files. In other contexts, this object is variously also called an “open file object”, a “file handle”, an “open file table entry”, orβin kernel-developer parlanceβa struct file.
When a file descriptor is duplicated (using dup(2) or similar), the duplicate refers to the same open file description as the original file descriptor, and the two file descriptors consequently share the file offset and file status flags. Such sharing can also occur between processes: a child process created via fork(2) inherits duplicates of its parent’s file descriptors, and those duplicates refer to the same open file descriptions.
Each open() of a file creates a new open file description; thus, there may be multiple open file descriptions corresponding to a file inode.
On Linux, one can use the kcmp(2) KCMP_FILE operation to test whether two file descriptors (in the same process or in two different processes) refer to the same open file description.
NFS
There are many infelicities in the protocol underlying NFS, affecting amongst others O_SYNC and O_NDELAY.
On NFS filesystems with UID mapping enabled, open() may return a file descriptor but, for example, read(2) requests are denied with EACCES. This is because the client performs open() by checking the permissions, but UID mapping is performed by the server upon read and write requests.
FIFOs
Opening the read or write end of a FIFO blocks until the other end is also opened (by another process or thread). See fifo(7) for further details.
File access mode
Unlike the other values that can be specified in flags, the access mode values O_RDONLY, O_WRONLY, and O_RDWR do not specify individual bits. Rather, they define the low order two bits of flags, and are defined respectively as 0, 1, and 2. In other words, the combination O_RDONLY | O_WRONLY is a logical error, and certainly does not have the same meaning as O_RDWR.
Linux reserves the special, nonstandard access mode 3 (binary 11) in flags to mean: check for read and write permission on the file and return a file descriptor that can’t be used for reading or writing. This nonstandard access mode is used by some Linux drivers to return a file descriptor that is to be used only for device-specific ioctl(2) operations.
Rationale for openat() and other directory file descriptor APIs
openat() and the other system calls and library functions that take a directory file descriptor argument (i.e., execveat(2), faccessat(2), fanotify_mark(2), fchmodat(2), fchownat(2), fspick(2), fstatat(2), futimesat(2), linkat(2), mkdirat(2), mknodat(2), mount_setattr(2), move_mount(2), name_to_handle_at(2), open_tree(2), openat2(2), readlinkat(2), renameat(2), renameat2(2), statx(2), symlinkat(2), unlinkat(2), utimensat(2), mkfifoat(3), and scandirat(3)) address two problems with the older interfaces that preceded them. Here, the explanation is in terms of the openat() call, but the rationale is analogous for the other interfaces.
First, openat() allows an application to avoid race conditions that could occur when using open() to open files in directories other than the current working directory. These race conditions result from the fact that some component of the directory prefix given to open() could be changed in parallel with the call to open(). Suppose, for example, that we wish to create the file dir1/dir2/xxx.dep if the file dir1/dir2/xxx exists. The problem is that between the existence check and the file-creation step, dir1 or dir2 (which might be symbolic links) could be modified to point to a different location. Such races can be avoided by opening a file descriptor for the target directory, and then specifying that file descriptor as the dirfd argument of (say) fstatat(2) and openat(). The use of the dirfd file descriptor also has other benefits:
the file descriptor is a stable reference to the directory, even if the directory is renamed; and
the open file descriptor prevents the underlying filesystem from being dismounted, just as when a process has a current working directory on a filesystem.
Second, openat() allows the implementation of a per-thread “current working directory”, via file descriptor(s) maintained by the application. (This functionality can also be obtained by tricks based on the use of */proc/self/fd/*dirfd, but less efficiently.)
The dirfd argument for these APIs can be obtained by using open() or openat() to open a directory (with either the O_RDONLY or the O_PATH flag). Alternatively, such a file descriptor can be obtained by applying dirfd(3) to a directory stream created using opendir(3).
When these APIs are given a dirfd argument of AT_FDCWD or the specified pathname is absolute, then they handle their pathname argument in the same way as the corresponding conventional APIs. However, in this case, several of the APIs have a flags argument that provides access to functionality that is not available with the corresponding conventional APIs.
O_DIRECT
The O_DIRECT flag may impose alignment restrictions on the length and address of user-space buffers and the file offset of I/Os. In Linux alignment restrictions vary by filesystem and kernel version and might be absent entirely. The handling of misaligned O_DIRECT I/Os also varies; they can either fail with EINVAL or fall back to buffered I/O.
Since Linux 6.1, O_DIRECT support and alignment restrictions for a file can be queried using statx(2), using the STATX_DIOALIGN flag. Support for STATX_DIOALIGN varies by filesystem; see statx(2).
Some filesystems provide their own interfaces for querying O_DIRECT alignment restrictions, for example the XFS_IOC_DIOINFO operation in xfsctl(3). STATX_DIOALIGN should be used instead when it is available.
If none of the above is available, then direct I/O support and alignment restrictions can only be assumed from known characteristics of the filesystem, the individual file, the underlying storage device(s), and the kernel version. In Linux 2.4, most filesystems based on block devices require that the file offset and the length and memory address of all I/O segments be multiples of the filesystem block size (typically 4096 bytes). In Linux 2.6.0, this was relaxed to the logical block size of the block device (typically 512 bytes). A block device’s logical block size can be determined using the ioctl(2) BLKSSZGET operation or from the shell using the command:
blockdev --getss
O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption and undefined behavior in parent and child processes. This restriction does not apply when the memory buffer for the O_DIRECT I/Os was created using shmat(2) or mmap(2) with the MAP_SHARED flag. Nor does this restriction apply when the memory buffer has been advised as MADV_DONTFORK with madvise(2), ensuring that it will not be available to the child after fork(2).
The O_DIRECT flag was introduced in SGI IRIX, where it has alignment restrictions similar to those of Linux 2.4. IRIX has also a fcntl(2) call to query appropriate alignments, and sizes. FreeBSD 4.x introduced a flag of the same name, but without alignment restrictions.
O_DIRECT support was added in Linux 2.4.10. Older Linux kernels simply ignore this flag. Some filesystems may not implement the flag, in which case open() fails with the error EINVAL if it is used.
Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file. Even when the filesystem correctly handles the coherency issues in this situation, overall I/O throughput is likely to be slower than using either mode alone. Likewise, applications should avoid mixing mmap(2) of files with direct I/O to the same files.
The behavior of O_DIRECT with NFS will differ from local filesystems. Older kernels, or kernels configured in certain ways, may not support this combination. The NFS protocol does not support passing the flag to the server, so O_DIRECT I/O will bypass the page cache only on the client; the server may still cache the I/O. The client asks the server to make the I/O synchronous to preserve the synchronous semantics of O_DIRECT. Some servers will perform poorly under these circumstances, especially if the I/O size is small. Some servers may also be configured to lie to clients about the I/O having reached stable storage; this will avoid the performance penalty at some risk to data integrity in the event of server power failure. The Linux NFS client places no alignment restrictions on O_DIRECT I/O.
In summary, O_DIRECT is a potentially powerful tool that should be used with caution. It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default.
BUGS
Currently, it is not possible to enable signal-driven I/O by specifying O_ASYNC when calling open(); use fcntl(2) to enable this flag.
One must check for two different error codes, EISDIR and ENOENT, when trying to determine whether the kernel supports O_TMPFILE functionality.
When both O_CREAT and O_DIRECTORY are specified in flags and the file specified by pathname does not exist, open() will create a regular file (i.e., O_DIRECTORY is ignored).
SEE ALSO
chmod(2), chown(2), close(2), dup(2), fcntl(2), link(2), lseek(2), mknod(2), mmap(2), mount(2), open_by_handle_at(2), openat2(2), read(2), socket(2), stat(2), umask(2), unlink(2), write(2), fopen(3), acl(5), fifo(7), inode(7), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
40 - Linux cli command perf_event_open
NAME π₯οΈ perf_event_open π₯οΈ
set up performance monitoring
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/perf_event.h> /* Definition of PERF_* constants */
#include <linux/hw_breakpoint.h> /* Definition of HW_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_perf_event_open, struct perf_event_attr *attr,
pid_t pid, int cpu, int group_fd",unsignedlong"flags);
Note: glibc provides no wrapper for perf_event_open(), necessitating the use of syscall(2).
DESCRIPTION
Given a list of parameters, perf_event_open() returns a file descriptor, for use in subsequent system calls ( read(2), mmap(2), prctl(2), fcntl(2), etc.).
A call to perf_event_open() creates a file descriptor that allows measuring performance information. Each file descriptor corresponds to one event that is measured; these can be grouped together to measure multiple events simultaneously.
Events can be enabled and disabled in two ways: via ioctl(2) and via prctl(2). When an event is disabled it does not count or generate overflows but does continue to exist and maintain its count value.
Events come in two flavors: counting and sampled. A counting event is one that is used for counting the aggregate number of events that occur. In general, counting event results are gathered with a read(2) call. A sampling event periodically writes measurements to a buffer that can then be accessed via mmap(2).
Arguments
The pid and cpu arguments allow specifying which process and CPU to monitor:
pid == 0 and cpu == -1
This measures the calling process/thread on any CPU.
pid == 0 and cpu >= 0
This measures the calling process/thread only when running on the specified CPU.
pid > 0 and cpu == -1
This measures the specified process/thread on any CPU.
pid > 0 and cpu >= 0
This measures the specified process/thread only when running on the specified CPU.
pid == -1 and cpu >= 0
This measures all processes/threads on the specified CPU. This requires CAP_PERFMON (since Linux 5.8) or CAP_SYS_ADMIN capability or a /proc/sys/kernel/perf_event_paranoid value of less than 1.
pid == -1 and cpu == -1
This setting is invalid and will return an error.
When pid is greater than zero, permission to perform this system call is governed by CAP_PERFMON (since Linux 5.9) and a ptrace access mode PTRACE_MODE_READ_REALCREDS check on older Linux versions; see ptrace(2).
The group_fd argument allows event groups to be created. An event group has one event which is the group leader. The leader is created first, with group_fd = -1. The rest of the group members are created with subsequent perf_event_open() calls with group_fd being set to the file descriptor of the group leader. (A single event on its own is created with group_fd = -1 and is considered to be a group with only 1 member.) An event group is scheduled onto the CPU as a unit: it will be put onto the CPU only if all of the events in the group can be put onto the CPU. This means that the values of the member events can be meaningfully compared βadded, divided (to get ratios), and so onβ with each other, since they have counted events for the same set of executed instructions.
The flags argument is formed by ORing together zero or more of the following values:
PERF_FLAG_FD_CLOEXEC (since Linux 3.14)
This flag enables the close-on-exec flag for the created event file descriptor, so that the file descriptor is automatically closed on execve(2). Setting the close-on-exec flags at creation time, rather than later with fcntl(2), avoids potential race conditions where the calling thread invokes perf_event_open() and fcntl(2) at the same time as another thread calls fork(2) then execve(2).
PERF_FLAG_FD_NO_GROUP
This flag tells the event to ignore the group_fd parameter except for the purpose of setting up output redirection using the PERF_FLAG_FD_OUTPUT flag.
PERF_FLAG_FD_OUTPUT (broken since Linux 2.6.35)
This flag re-routes the event’s sampled output to instead be included in the mmap buffer of the event specified by group_fd.
PERF_FLAG_PID_CGROUP (since Linux 2.6.39)
This flag activates per-container system-wide monitoring. A container is an abstraction that isolates a set of resources for finer-grained control (CPUs, memory, etc.). In this mode, the event is measured only if the thread running on the monitored CPU belongs to the designated container (cgroup). The cgroup is identified by passing a file descriptor opened on its directory in the cgroupfs filesystem. For instance, if the cgroup to monitor is called test, then a file descriptor opened on /dev/cgroup/test (assuming cgroupfs is mounted on /dev/cgroup) must be passed as the pid parameter. cgroup monitoring is available only for system-wide events and may therefore require extra permissions.
The perf_event_attr structure provides detailed configuration information for the event being created.
struct perf_event_attr {
__u32 type; /* Type of event */
__u32 size; /* Size of attribute structure */
__u64 config; /* Type-specific configuration */
union {
__u64 sample_period; /* Period of sampling */
__u64 sample_freq; /* Frequency of sampling */
};
__u64 sample_type; /* Specifies values included in sample */
__u64 read_format; /* Specifies values returned in read */
__u64 disabled : 1, /* off by default */
inherit : 1, /* children inherit it */
pinned : 1, /* must always be on PMU */
exclusive : 1, /* only group on PMU */
exclude_user : 1, /* don't count user */
exclude_kernel : 1, /* don't count kernel */
exclude_hv : 1, /* don't count hypervisor */
exclude_idle : 1, /* don't count when idle */
mmap : 1, /* include mmap data */
comm : 1, /* include comm data */
freq : 1, /* use freq, not period */
inherit_stat : 1, /* per task counts */
enable_on_exec : 1, /* next exec enables */
task : 1, /* trace fork/exit */
watermark : 1, /* wakeup_watermark */
precise_ip : 2, /* skid constraint */
mmap_data : 1, /* non-exec mmap data */
sample_id_all : 1, /* sample_type all events */
exclude_host : 1, /* don't count in host */
exclude_guest : 1, /* don't count in guest */
exclude_callchain_kernel : 1,
/* exclude kernel callchains */
exclude_callchain_user : 1,
/* exclude user callchains */
mmap2 : 1, /* include mmap with inode data */
comm_exec : 1, /* flag comm events that are
due to exec */
use_clockid : 1, /* use clockid for time fields */
context_switch : 1, /* context switch data */
write_backward : 1, /* Write ring buffer from end
to beginning */
namespaces : 1, /* include namespaces data */
ksymbol : 1, /* include ksymbol events */
bpf_event : 1, /* include bpf events */
aux_output : 1, /* generate AUX records
instead of events */
cgroup : 1, /* include cgroup events */
text_poke : 1, /* include text poke events */
build_id : 1, /* use build id in mmap2 events */
inherit_thread : 1, /* children only inherit */
/* if cloned with CLONE_THREAD */
remove_on_exec : 1, /* event is removed from task
on exec */
sigtrap : 1, /* send synchronous SIGTRAP
on event */
__reserved_1 : 26;
union {
__u32 wakeup_events; /* wakeup every n events */
__u32 wakeup_watermark; /* bytes before wakeup */
};
__u32 bp_type; /* breakpoint type */
union {
__u64 bp_addr; /* breakpoint address */
__u64 kprobe_func; /* for perf_kprobe */
__u64 uprobe_path; /* for perf_uprobe */
__u64 config1; /* extension of config */
};
union {
__u64 bp_len; /* breakpoint length */
__u64 kprobe_addr; /* with kprobe_func == NULL */
__u64 probe_offset; /* for perf_[k,u]probe */
__u64 config2; /* extension of config1 */
};
__u64 branch_sample_type; /* enum perf_branch_sample_type */
__u64 sample_regs_user; /* user regs to dump on samples */
__u32 sample_stack_user; /* size of stack to dump on
samples */
__s32 clockid; /* clock to use for time fields */
__u64 sample_regs_intr; /* regs to dump on samples */
__u32 aux_watermark; /* aux bytes before wakeup */
__u16 sample_max_stack; /* max frames in callchain */
__u16 __reserved_2; /* align to u64 */
__u32 aux_sample_size; /* max aux sample size */
__u32 __reserved_3; /* align to u64 */
__u64 sig_data; /* user data for sigtrap */
};
The fields of the perf_event_attr structure are described in more detail below:
type
This field specifies the overall event type. It has one of the following values:
PERF_TYPE_HARDWARE
This indicates one of the “generalized” hardware events provided by the kernel. See the config field definition for more details.
PERF_TYPE_SOFTWARE
This indicates one of the software-defined events provided by the kernel (even if no hardware support is available).
PERF_TYPE_TRACEPOINT
This indicates a tracepoint provided by the kernel tracepoint infrastructure.
PERF_TYPE_HW_CACHE
This indicates a hardware cache event. This has a special encoding, described in the config field definition.
PERF_TYPE_RAW
This indicates a “raw” implementation-specific event in the config field.
PERF_TYPE_BREAKPOINT (since Linux 2.6.33)
This indicates a hardware breakpoint as provided by the CPU. Breakpoints can be read/write accesses to an address as well as execution of an instruction address.
dynamic PMU
Since Linux 2.6.38, perf_event_open() can support multiple PMUs. To enable this, a value exported by the kernel can be used in the type field to indicate which PMU to use. The value to use can be found in the sysfs filesystem: there is a subdirectory per PMU instance under /sys/bus/event_source/devices. In each subdirectory there is a type file whose content is an integer that can be used in the type field. For instance, /sys/bus/event_source/devices/cpu/type contains the value for the core CPU PMU, which is usually 4.
kprobe and uprobe (since Linux 4.17)
These two dynamic PMUs create a kprobe/uprobe and attach it to the file descriptor generated by perf_event_open. The kprobe/uprobe will be destroyed on the destruction of the file descriptor. See fields kprobe_func, uprobe_path, kprobe_addr, and probe_offset for more details.
size
The size of the perf_event_attr structure for forward/backward compatibility. Set this using sizeof(struct perf_event_attr) to allow the kernel to see the struct size at the time of compilation.
The related define PERF_ATTR_SIZE_VER0 is set to 64; this was the size of the first published struct. PERF_ATTR_SIZE_VER1 is 72, corresponding to the addition of breakpoints in Linux 2.6.33. PERF_ATTR_SIZE_VER2 is 80 corresponding to the addition of branch sampling in Linux 3.4. PERF_ATTR_SIZE_VER3 is 96 corresponding to the addition of sample_regs_user and sample_stack_user in Linux 3.7. PERF_ATTR_SIZE_VER4 is 104 corresponding to the addition of sample_regs_intr in Linux 3.19. PERF_ATTR_SIZE_VER5 is 112 corresponding to the addition of aux_watermark in Linux 4.1.
config
This specifies which event you want, in conjunction with the type field. The config1 and config2 fields are also taken into account in cases where 64 bits is not enough to fully specify the event. The encoding of these fields are event dependent.
There are various ways to set the config field that are dependent on the value of the previously described type field. What follows are various possible settings for config separated out by type.
If type is PERF_TYPE_HARDWARE, we are measuring one of the generalized hardware CPU events. Not all of these are available on all platforms. Set config to one of the following:
PERF_COUNT_HW_CPU_CYCLES
Total cycles. Be wary of what happens during CPU frequency scaling.
PERF_COUNT_HW_INSTRUCTIONS
Retired instructions. Be careful, these can be affected by various issues, most notably hardware interrupt counts.
PERF_COUNT_HW_CACHE_REFERENCES
Cache accesses. Usually this indicates Last Level Cache accesses but this may vary depending on your CPU. This may include prefetches and coherency messages; again this depends on the design of your CPU.
PERF_COUNT_HW_CACHE_MISSES
Cache misses. Usually this indicates Last Level Cache misses; this is intended to be used in conjunction with the PERF_COUNT_HW_CACHE_REFERENCES event to calculate cache miss rates.
PERF_COUNT_HW_BRANCH_INSTRUCTIONS
Retired branch instructions. Prior to Linux 2.6.35, this used the wrong event on AMD processors.
PERF_COUNT_HW_BRANCH_MISSES
Mispredicted branch instructions.
PERF_COUNT_HW_BUS_CYCLES
Bus cycles, which can be different from total cycles.
PERF_COUNT_HW_STALLED_CYCLES_FRONTEND (since Linux 3.0)
Stalled cycles during issue.
PERF_COUNT_HW_STALLED_CYCLES_BACKEND (since Linux 3.0)
Stalled cycles during retirement.
PERF_COUNT_HW_REF_CPU_CYCLES (since Linux 3.3)
Total cycles; not affected by CPU frequency scaling.
If type is PERF_TYPE_SOFTWARE, we are measuring software events provided by the kernel. Set config to one of the following:
PERF_COUNT_SW_CPU_CLOCK
This reports the CPU clock, a high-resolution per-CPU timer.
PERF_COUNT_SW_TASK_CLOCK
This reports a clock count specific to the task that is running.
PERF_COUNT_SW_PAGE_FAULTS
This reports the number of page faults.
PERF_COUNT_SW_CONTEXT_SWITCHES
This counts context switches. Until Linux 2.6.34, these were all reported as user-space events, after that they are reported as happening in the kernel.
PERF_COUNT_SW_CPU_MIGRATIONS
This reports the number of times the process has migrated to a new CPU.
PERF_COUNT_SW_PAGE_FAULTS_MIN
This counts the number of minor page faults. These did not require disk I/O to handle.
PERF_COUNT_SW_PAGE_FAULTS_MAJ
This counts the number of major page faults. These required disk I/O to handle.
PERF_COUNT_SW_ALIGNMENT_FAULTS (since Linux 2.6.33)
This counts the number of alignment faults. These happen when unaligned memory accesses happen; the kernel can handle these but it reduces performance. This happens only on some architectures (never on x86).
PERF_COUNT_SW_EMULATION_FAULTS (since Linux 2.6.33)
This counts the number of emulation faults. The kernel sometimes traps on unimplemented instructions and emulates them for user space. This can negatively impact performance.
PERF_COUNT_SW_DUMMY (since Linux 3.12)
This is a placeholder event that counts nothing. Informational sample record types such as mmap or comm must be associated with an active event. This dummy event allows gathering such records without requiring a counting event.
PERF_COUNT_SW_BPF_OUTPUT (since Linux 4.4)
This is used to generate raw sample data from BPF. BPF programs can write to this event using bpf_perf_event_output helper.
PERF_COUNT_SW_CGROUP_SWITCHES (since Linux 5.13)
This counts context switches to a task in a different cgroup. In other words, if the next task is in the same cgroup, it won’t count the switch.
If type is PERF_TYPE_TRACEPOINT, then we are measuring kernel tracepoints. The value to use in config can be obtained from under debugfs tracing/events/*/*/id if ftrace is enabled in the kernel.
If type is PERF_TYPE_HW_CACHE, then we are measuring a hardware CPU cache event. To calculate the appropriate config value, use the following equation:
config = (perf_hw_cache_id) | (perf_hw_cache_op_id « 8) | (perf_hw_cache_op_result_id « 16);
where perf_hw_cache_id is one of:
PERF_COUNT_HW_CACHE_L1D
for measuring Level 1 Data CachePERF_COUNT_HW_CACHE_L1I
for measuring Level 1 Instruction CachePERF_COUNT_HW_CACHE_LL
for measuring Last-Level CachePERF_COUNT_HW_CACHE_DTLB
for measuring the Data TLBPERF_COUNT_HW_CACHE_ITLB
for measuring the Instruction TLBPERF_COUNT_HW_CACHE_BPU
for measuring the branch prediction unitPERF_COUNT_HW_CACHE_NODE (since Linux 3.1)
for measuring local memory accessesand perf_hw_cache_op_id is one of:
PERF_COUNT_HW_CACHE_OP_READ
for read accessesPERF_COUNT_HW_CACHE_OP_WRITE
for write accessesPERF_COUNT_HW_CACHE_OP_PREFETCH
for prefetch accessesand perf_hw_cache_op_result_id is one of:
PERF_COUNT_HW_CACHE_RESULT_ACCESS
to measure accessesPERF_COUNT_HW_CACHE_RESULT_MISS
to measure missesIf type is PERF_TYPE_RAW, then a custom “raw” config value is needed. Most CPUs support events that are not covered by the “generalized” events. These are implementation defined; see your CPU manual (for example the Intel Volume 3B documentation or the AMD BIOS and Kernel Developer Guide). The libpfm4 library can be used to translate from the name in the architectural manuals to the raw hex value perf_event_open() expects in this field.
If type is PERF_TYPE_BREAKPOINT, then leave config set to zero. Its parameters are set in other places.
If type is kprobe or uprobe, set retprobe (bit 0 of config, see /sys/bus/event_source/devices/[k,u]probe/format/retprobe) for kretprobe/uretprobe. See fields kprobe_func, uprobe_path, kprobe_addr, and probe_offset for more details.
kprobe_func
uprobe_path
kprobe_addr
probe_offset
These fields describe the kprobe/uprobe for dynamic PMUs kprobe and uprobe. For kprobe: use kprobe_func and probe_offset, or use kprobe_addr and leave kprobe_func as NULL. For uprobe: use uprobe_path and probe_offset.
sample_period
sample_freq
A “sampling” event is one that generates an overflow notification every N events, where N is given by sample_period. A sampling event has sample_period > 0. When an overflow occurs, requested data is recorded in the mmap buffer. The sample_type field controls what data is recorded on each overflow.
sample_freq can be used if you wish to use frequency rather than period. In this case, you set the freq flag. The kernel will adjust the sampling period to try and achieve the desired rate. The rate of adjustment is a timer tick.
sample_type
The various bits in this field specify which values to include in the sample. They will be recorded in a ring-buffer, which is available to user space using mmap(2). The order in which the values are saved in the sample are documented in the MMAP Layout subsection below; it is not the enum perf_event_sample_format order.
PERF_SAMPLE_IP
Records instruction pointer.
PERF_SAMPLE_TID
Records the process and thread IDs.
PERF_SAMPLE_TIME
Records a timestamp.
PERF_SAMPLE_ADDR
Records an address, if applicable.
PERF_SAMPLE_READ
Record counter values for all events in a group, not just the group leader.
PERF_SAMPLE_CALLCHAIN
Records the callchain (stack backtrace).
PERF_SAMPLE_ID
Records a unique ID for the opened event’s group leader.
PERF_SAMPLE_CPU
Records CPU number.
PERF_SAMPLE_PERIOD
Records the current sampling period.
PERF_SAMPLE_STREAM_ID
Records a unique ID for the opened event. Unlike PERF_SAMPLE_ID the actual ID is returned, not the group leader. This ID is the same as the one returned by PERF_FORMAT_ID.
PERF_SAMPLE_RAW
Records additional data, if applicable. Usually returned by tracepoint events.
PERF_SAMPLE_BRANCH_STACK (since Linux 3.4)
This provides a record of recent branches, as provided by CPU branch sampling hardware (such as Intel Last Branch Record). Not all hardware supports this feature.
See the branch_sample_type field for how to filter which branches are reported.
PERF_SAMPLE_REGS_USER (since Linux 3.7)
Records the current user-level CPU register state (the values in the process before the kernel was called).
PERF_SAMPLE_STACK_USER (since Linux 3.7)
Records the user level stack, allowing stack unwinding.
PERF_SAMPLE_WEIGHT (since Linux 3.10)
Records a hardware provided weight value that expresses how costly the sampled event was. This allows the hardware to highlight expensive events in a profile.
PERF_SAMPLE_DATA_SRC (since Linux 3.10)
Records the data source: where in the memory hierarchy the data associated with the sampled instruction came from. This is available only if the underlying hardware supports this feature.
PERF_SAMPLE_IDENTIFIER (since Linux 3.12)
Places the SAMPLE_ID value in a fixed position in the record, either at the beginning (for sample events) or at the end (if a non-sample event).
This was necessary because a sample stream may have records from various different event sources with different sample_type settings. Parsing the event stream properly was not possible because the format of the record was needed to find SAMPLE_ID, but the format could not be found without knowing what event the sample belonged to (causing a circular dependency).
The PERF_SAMPLE_IDENTIFIER setting makes the event stream always parsable by putting SAMPLE_ID in a fixed location, even though it means having duplicate SAMPLE_ID values in records.
PERF_SAMPLE_TRANSACTION (since Linux 3.13)
Records reasons for transactional memory abort events (for example, from Intel TSX transactional memory support).
The precise_ip setting must be greater than 0 and a transactional memory abort event must be measured or no values will be recorded. Also note that some perf_event measurements, such as sampled cycle counting, may cause extraneous aborts (by causing an interrupt during a transaction).
PERF_SAMPLE_REGS_INTR (since Linux 3.19)
Records a subset of the current CPU register state as specified by sample_regs_intr. Unlike PERF_SAMPLE_REGS_USER the register values will return kernel register state if the overflow happened while kernel code is running. If the CPU supports hardware sampling of register state (i.e., PEBS on Intel x86) and precise_ip is set higher than zero then the register values returned are those captured by hardware at the time of the sampled instruction’s retirement.
PERF_SAMPLE_PHYS_ADDR (since Linux 4.13)
Records physical address of data like in PERF_SAMPLE_ADDR.
PERF_SAMPLE_CGROUP (since Linux 5.7)
Records (perf_event) cgroup ID of the process. This corresponds to the id field in the PERF_RECORD_CGROUP event.
PERF_SAMPLE_DATA_PAGE_SIZE (since Linux 5.11)
Records page size of data like in PERF_SAMPLE_ADDR.
PERF_SAMPLE_CODE_PAGE_SIZE (since Linux 5.11)
Records page size of ip like in PERF_SAMPLE_IP.
PERF_SAMPLE_WEIGHT_STRUCT (since Linux 5.12)
Records hardware provided weight values like in PERF_SAMPLE_WEIGHT, but it can represent multiple values in a struct. This shares the same space as PERF_SAMPLE_WEIGHT, so users can apply either of those, not both. It has the following format and the meaning of each field is dependent on the hardware implementation.
union perf_sample_weight {
u64 full; /* PERF_SAMPLE_WEIGHT */
struct { /* PERF_SAMPLE_WEIGHT_STRUCT */
u32 var1_dw;
u16 var2_w;
u16 var3_w;
};
};
read_format
This field specifies the format of the data returned by read(2) on a perf_event_open() file descriptor.
PERF_FORMAT_TOTAL_TIME_ENABLED
Adds the 64-bit time_enabled field. This can be used to calculate estimated totals if the PMU is overcommitted and multiplexing is happening.
PERF_FORMAT_TOTAL_TIME_RUNNING
Adds the 64-bit time_running field. This can be used to calculate estimated totals if the PMU is overcommitted and multiplexing is happening.
PERF_FORMAT_ID
Adds a 64-bit unique value that corresponds to the event group.
PERF_FORMAT_GROUP
Allows all counter values in an event group to be read with one read.
PERF_FORMAT_LOST (since Linux 6.0)
Adds a 64-bit value that is the number of lost samples for this event. This would be only meaningful when sample_period or sample_freq is set.
disabled
The disabled bit specifies whether the counter starts out disabled or enabled. If disabled, the event can later be enabled by ioctl(2), prctl(2), or enable_on_exec.
When creating an event group, typically the group leader is initialized with disabled set to 1 and any child events are initialized with disabled set to 0. Despite disabled being 0, the child events will not start until the group leader is enabled.
inherit
The inherit bit specifies that this counter should count events of child tasks as well as the task specified. This applies only to new children, not to any existing children at the time the counter is created (nor to any new children of existing children).
Inherit does not work for some combinations of read_format values, such as PERF_FORMAT_GROUP.
pinned
The pinned bit specifies that the counter should always be on the CPU if at all possible. It applies only to hardware counters and only to group leaders. If a pinned counter cannot be put onto the CPU (e.g., because there are not enough hardware counters or because of a conflict with some other event), then the counter goes into an ’error’ state, where reads return end-of-file (i.e., read(2) returns 0) until the counter is subsequently enabled or disabled.
exclusive
The exclusive bit specifies that when this counter’s group is on the CPU, it should be the only group using the CPU’s counters. In the future this may allow monitoring programs to support PMU features that need to run alone so that they do not disrupt other hardware counters.
Note that many unexpected situations may prevent events with the exclusive bit set from ever running. This includes any users running a system-wide measurement as well as any kernel use of the performance counters (including the commonly enabled NMI Watchdog Timer interface).
exclude_user
If this bit is set, the count excludes events that happen in user space.
exclude_kernel
If this bit is set, the count excludes events that happen in kernel space.
exclude_hv
If this bit is set, the count excludes events that happen in the hypervisor. This is mainly for PMUs that have built-in support for handling this (such as POWER). Extra support is needed for handling hypervisor measurements on most machines.
exclude_idle
If set, don’t count when the CPU is running the idle task. While you can currently enable this for any event type, it is ignored for all but software events.
mmap
The mmap bit enables generation of PERF_RECORD_MMAP samples for every mmap(2) call that has PROT_EXEC set. This allows tools to notice new executable code being mapped into a program (dynamic shared libraries for example) so that addresses can be mapped back to the original code.
comm
The comm bit enables tracking of process command name as modified by the execve(2) and prctl(PR_SET_NAME) system calls as well as writing to /proc/self/comm. If the comm_exec flag is also successfully set (possible since Linux 3.16), then the misc flag PERF_RECORD_MISC_COMM_EXEC can be used to differentiate the execve(2) case from the others.
freq
If this bit is set, then sample_frequency not sample_period is used when setting up the sampling interval.
inherit_stat
This bit enables saving of event counts on context switch for inherited tasks. This is meaningful only if the inherit field is set.
enable_on_exec
If this bit is set, a counter is automatically enabled after a call to execve(2).
task
If this bit is set, then fork/exit notifications are included in the ring buffer.
watermark
If set, have an overflow notification happen when we cross the wakeup_watermark boundary. Otherwise, overflow notifications happen after wakeup_events samples.
precise_ip (since Linux 2.6.35)
This controls the amount of skid. Skid is how many instructions execute between an event of interest happening and the kernel being able to stop and record the event. Smaller skid is better and allows more accurate reporting of which events correspond to which instructions, but hardware is often limited with how small this can be.
The possible values of this field are the following:
0
SAMPLE_IP can have arbitrary skid.
1
SAMPLE_IP must have constant skid.
2
SAMPLE_IP requested to have 0 skid.
3
SAMPLE_IP must have 0 skid. See also the description of PERF_RECORD_MISC_EXACT_IP.
mmap_data (since Linux 2.6.36)
This is the counterpart of the mmap field. This enables generation of PERF_RECORD_MMAP samples for mmap(2) calls that do not have PROT_EXEC set (for example data and SysV shared memory).
sample_id_all (since Linux 2.6.38)
If set, then TID, TIME, ID, STREAM_ID, and CPU can additionally be included in non-PERF_RECORD_SAMPLEs if the corresponding sample_type is selected.
If PERF_SAMPLE_IDENTIFIER is specified, then an additional ID value is included as the last value to ease parsing the record stream. This may lead to the id value appearing twice.
The layout is described by this pseudo-structure:
struct sample_id {
{ u32 pid, tid; } /* if PERF_SAMPLE_TID set */
{ u64 time; } /* if PERF_SAMPLE_TIME set */
{ u64 id; } /* if PERF_SAMPLE_ID set */
{ u64 stream_id;} /* if PERF_SAMPLE_STREAM_ID set */
{ u32 cpu, res; } /* if PERF_SAMPLE_CPU set */
{ u64 id; } /* if PERF_SAMPLE_IDENTIFIER set */
};
exclude_host (since Linux 3.2)
When conducting measurements that include processes running VM instances (i.e., have executed a KVM_RUN ioctl(2)), only measure events happening inside a guest instance. This is only meaningful outside the guests; this setting does not change counts gathered inside of a guest. Currently, this functionality is x86 only.
exclude_guest (since Linux 3.2)
When conducting measurements that include processes running VM instances (i.e., have executed a KVM_RUN ioctl(2)), do not measure events happening inside guest instances. This is only meaningful outside the guests; this setting does not change counts gathered inside of a guest. Currently, this functionality is x86 only.
exclude_callchain_kernel (since Linux 3.7)
Do not include kernel callchains.
exclude_callchain_user (since Linux 3.7)
Do not include user callchains.
mmap2 (since Linux 3.16)
Generate an extended executable mmap record that contains enough additional information to uniquely identify shared mappings. The mmap flag must also be set for this to work.
comm_exec (since Linux 3.16)
This is purely a feature-detection flag, it does not change kernel behavior. If this flag can successfully be set, then, when comm is enabled, the PERF_RECORD_MISC_COMM_EXEC flag will be set in the misc field of a comm record header if the rename event being reported was caused by a call to execve(2). This allows tools to distinguish between the various types of process renaming.
use_clockid (since Linux 4.1)
This allows selecting which internal Linux clock to use when generating timestamps via the clockid field. This can make it easier to correlate perf sample times with timestamps generated by other tools.
context_switch (since Linux 4.3)
This enables the generation of PERF_RECORD_SWITCH records when a context switch occurs. It also enables the generation of PERF_RECORD_SWITCH_CPU_WIDE records when sampling in CPU-wide mode. This functionality is in addition to existing tracepoint and software events for measuring context switches. The advantage of this method is that it will give full information even with strict perf_event_paranoid settings.
write_backward (since Linux 4.6)
This causes the ring buffer to be written from the end to the beginning. This is to support reading from overwritable ring buffer.
namespaces (since Linux 4.11)
This enables the generation of PERF_RECORD_NAMESPACES records when a task enters a new namespace. Each namespace has a combination of device and inode numbers.
ksymbol (since Linux 5.0)
This enables the generation of PERF_RECORD_KSYMBOL records when new kernel symbols are registered or unregistered. This is analyzing dynamic kernel functions like eBPF.
bpf_event (since Linux 5.0)
This enables the generation of PERF_RECORD_BPF_EVENT records when an eBPF program is loaded or unloaded.
aux_output (since Linux 5.4)
This allows normal (non-AUX) events to generate data for AUX events if the hardware supports it.
cgroup (since Linux 5.7)
This enables the generation of PERF_RECORD_CGROUP records when a new cgroup is created (and activated).
text_poke (since Linux 5.8)
This enables the generation of PERF_RECORD_TEXT_POKE records when there’s a change to the kernel text (i.e., self-modifying code).
build_id (since Linux 5.12)
This changes the contents in the PERF_RECORD_MMAP2 to have a build-id instead of device and inode numbers.
inherit_thread (since Linux 5.13)
This disables the inheritance of the event to a child process. Only new threads in the same process (which is cloned with CLONE_THREAD) will inherit the event.
remove_on_exec (since Linux 5.13)
This closes the event when it starts a new process image by execve(2).
sigtrap (since Linux 5.13)
This enables synchronous signal delivery of SIGTRAP on event overflow.
wakeup_events
wakeup_watermark
This union sets how many samples (wakeup_events) or bytes (wakeup_watermark) happen before an overflow notification happens. Which one is used is selected by the watermark bit flag.
wakeup_events counts only PERF_RECORD_SAMPLE record types. To receive overflow notification for all PERF_RECORD types choose watermark and set wakeup_watermark to 1.
Prior to Linux 3.0, setting wakeup_events to 0 resulted in no overflow notifications; more recent kernels treat 0 the same as 1.
bp_type (since Linux 2.6.33)
This chooses the breakpoint type. It is one of:
HW_BREAKPOINT_EMPTY
No breakpoint.
HW_BREAKPOINT_R
Count when we read the memory location.
HW_BREAKPOINT_W
Count when we write the memory location.
HW_BREAKPOINT_RW
Count when we read or write the memory location.
HW_BREAKPOINT_X
Count when we execute code at the memory location.
The values can be combined via a bitwise or, but the combination of HW_BREAKPOINT_R or HW_BREAKPOINT_W with HW_BREAKPOINT_X is not allowed.
bp_addr (since Linux 2.6.33)
This is the address of the breakpoint. For execution breakpoints, this is the memory address of the instruction of interest; for read and write breakpoints, it is the memory address of the memory location of interest.
config1 (since Linux 2.6.39)
config1 is used for setting events that need an extra register or otherwise do not fit in the regular config field. Raw OFFCORE_EVENTS on Nehalem/Westmere/SandyBridge use this field on Linux 3.3 and later kernels.
bp_len (since Linux 2.6.33)
bp_len is the length of the breakpoint being measured if type is PERF_TYPE_BREAKPOINT. Options are HW_BREAKPOINT_LEN_1, HW_BREAKPOINT_LEN_2, HW_BREAKPOINT_LEN_4, and HW_BREAKPOINT_LEN_8. For an execution breakpoint, set this to sizeof(long).
config2 (since Linux 2.6.39)
config2 is a further extension of the config1 field.
branch_sample_type (since Linux 3.4)
If PERF_SAMPLE_BRANCH_STACK is enabled, then this specifies what branches to include in the branch record.
The first part of the value is the privilege level, which is a combination of one of the values listed below. If the user does not set privilege level explicitly, the kernel will use the event’s privilege level. Event and branch privilege levels do not have to match.
PERF_SAMPLE_BRANCH_USER
Branch target is in user space.
PERF_SAMPLE_BRANCH_KERNEL
Branch target is in kernel space.
PERF_SAMPLE_BRANCH_HV
Branch target is in hypervisor.
PERF_SAMPLE_BRANCH_PLM_ALL
A convenience value that is the three preceding values ORed together.
In addition to the privilege value, at least one or more of the following bits must be set.
PERF_SAMPLE_BRANCH_ANY
Any branch type.
PERF_SAMPLE_BRANCH_ANY_CALL
Any call branch (includes direct calls, indirect calls, and far jumps).
PERF_SAMPLE_BRANCH_IND_CALL
Indirect calls.
PERF_SAMPLE_BRANCH_CALL (since Linux 4.4)
Direct calls.
PERF_SAMPLE_BRANCH_ANY_RETURN
Any return branch.
PERF_SAMPLE_BRANCH_IND_JUMP (since Linux 4.2)
Indirect jumps.
PERF_SAMPLE_BRANCH_COND (since Linux 3.16)
Conditional branches.
PERF_SAMPLE_BRANCH_ABORT_TX (since Linux 3.11)
Transactional memory aborts.
PERF_SAMPLE_BRANCH_IN_TX (since Linux 3.11)
Branch in transactional memory transaction.
PERF_SAMPLE_BRANCH_NO_TX (since Linux 3.11)
Branch not in transactional memory transaction. PERF_SAMPLE_BRANCH_CALL_STACK (since Linux 4.1) Branch is part of a hardware-generated call stack. This requires hardware support, currently only found on Intel x86 Haswell or newer.
sample_regs_user (since Linux 3.7)
This bit mask defines the set of user CPU registers to dump on samples. The layout of the register mask is architecture-specific and is described in the kernel header file arch/ARCH/include/uapi/asm/perf_regs.h.
sample_stack_user (since Linux 3.7)
This defines the size of the user stack to dump if PERF_SAMPLE_STACK_USER is specified.
clockid (since Linux 4.1)
If use_clockid is set, then this field selects which internal Linux timer to use for timestamps. The available timers are defined in linux/time.h, with CLOCK_MONOTONIC, CLOCK_MONOTONIC_RAW, CLOCK_REALTIME, CLOCK_BOOTTIME, and CLOCK_TAI currently supported.
aux_watermark (since Linux 4.1)
This specifies how much data is required to trigger a PERF_RECORD_AUX sample.
sample_max_stack (since Linux 4.8)
When sample_type includes PERF_SAMPLE_CALLCHAIN, this field specifies how many stack frames to report when generating the callchain.
aux_sample_size (since Linux 5.5)
When PERF_SAMPLE_AUX flag is set, specify the desired size of AUX data. Note that it can get smaller data than the specified size.
sig_data (since Linux 5.13)
This data will be copied to user’s signal handler (through si_perf in the siginfo_t) to disambiguate which event triggered the signal.
Reading results
Once a perf_event_open() file descriptor has been opened, the values of the events can be read from the file descriptor. The values that are there are specified by the read_format field in the attr structure at open time.
If you attempt to read into a buffer that is not big enough to hold the data, the error ENOSPC results.
Here is the layout of the data returned by a read:
If PERF_FORMAT_GROUP was specified to allow reading all events in a group at once:
struct read_format { u64 nr; /* The number of events */ u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */ u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */ struct { u64 value; /* The value of the event */ u64 id; /* if PERF_FORMAT_ID */ u64 lost; /* if PERF_FORMAT_LOST */ } values[nr]; };
If PERF_FORMAT_GROUP was not specified:
struct read_format { u64 value; /* The value of the event */ u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */ u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */ u64 id; /* if PERF_FORMAT_ID */ u64 lost; /* if PERF_FORMAT_LOST */ };
The values read are as follows:
nr
The number of events in this file descriptor. Available only if PERF_FORMAT_GROUP was specified.
time_enabled
time_running
Total time the event was enabled and running. Normally these values are the same. Multiplexing happens if the number of events is more than the number of available PMU counter slots. In that case the events run only part of the time and the time_enabled and time running values can be used to scale an estimated value for the count.
value
An unsigned 64-bit value containing the counter result.
id
A globally unique value for this particular event; only present if PERF_FORMAT_ID was specified in read_format.
lost
The number of lost samples of this event; only present if PERF_FORMAT_LOST was specified in read_format.
MMAP layout
When using perf_event_open() in sampled mode, asynchronous events (like counter overflow or PROT_EXEC mmap tracking) are logged into a ring-buffer. This ring-buffer is created and accessed through mmap(2).
The mmap size should be 1+2^n pages, where the first page is a metadata page (struct perf_event_mmap_page) that contains various bits of information such as where the ring-buffer head is.
Before Linux 2.6.39, there is a bug that means you must allocate an mmap ring buffer when sampling even if you do not plan to access it.
The structure of the first metadata mmap page is as follows:
struct perf_event_mmap_page {
__u32 version; /* version number of this structure */
__u32 compat_version; /* lowest version this is compat with */
__u32 lock; /* seqlock for synchronization */
__u32 index; /* hardware counter identifier */
__s64 offset; /* add to hardware counter value */
__u64 time_enabled; /* time event active */
__u64 time_running; /* time event on CPU */
union {
__u64 capabilities;
struct {
__u64 cap_usr_time / cap_usr_rdpmc / cap_bit0 : 1,
cap_bit0_is_deprecated : 1,
cap_user_rdpmc : 1,
cap_user_time : 1,
cap_user_time_zero : 1,
};
};
__u16 pmc_width;
__u16 time_shift;
__u32 time_mult;
__u64 time_offset;
__u64 __reserved[120]; /* Pad to 1 k */
__u64 data_head; /* head in the data section */
__u64 data_tail; /* user-space written tail */
__u64 data_offset; /* where the buffer starts */
__u64 data_size; /* data buffer size */
__u64 aux_head;
__u64 aux_tail;
__u64 aux_offset;
__u64 aux_size;
}
The following list describes the fields in the perf_event_mmap_page structure in more detail:
version
Version number of this structure.
compat_version
The lowest version this is compatible with.
lock
A seqlock for synchronization.
index
A unique hardware counter identifier.
offset
When using rdpmc for reads this offset value must be added to the one returned by rdpmc to get the current total event count.
time_enabled
Time the event was active.
time_running
Time the event was running.
cap_usr_time / cap_usr_rdpmc / cap_bit0 (since Linux 3.4)
There was a bug in the definition of cap_usr_time and cap_usr_rdpmc from Linux 3.4 until Linux 3.11. Both bits were defined to point to the same location, so it was impossible to know if cap_usr_time or cap_usr_rdpmc were actually set.
Starting with Linux 3.12, these are renamed to cap_bit0 and you should use the cap_user_time and cap_user_rdpmc fields instead.
cap_bit0_is_deprecated (since Linux 3.12)
If set, this bit indicates that the kernel supports the properly separated cap_user_time and cap_user_rdpmc bits.
If not-set, it indicates an older kernel where cap_usr_time and cap_usr_rdpmc map to the same bit and thus both features should be used with caution.
cap_user_rdpmc (since Linux 3.12)
If the hardware supports user-space read of performance counters without syscall (this is the “rdpmc” instruction on x86), then the following code can be used to do a read:
u32 seq, time_mult, time_shift, idx, width;
u64 count, enabled, running;
u64 cyc, time_offset;
do {
seq = pc->lock;
barrier();
enabled = pc->time_enabled;
running = pc->time_running;
if (pc->cap_usr_time && enabled != running) {
cyc = rdtsc();
time_offset = pc->time_offset;
time_mult = pc->time_mult;
time_shift = pc->time_shift;
}
idx = pc->index;
count = pc->offset;
if (pc->cap_usr_rdpmc && idx) {
width = pc->pmc_width;
count += rdpmc(idx - 1);
}
barrier();
} while (pc->lock != seq);
cap_user_time (since Linux 3.12)
This bit indicates the hardware has a constant, nonstop timestamp counter (TSC on x86).
cap_user_time_zero (since Linux 3.12)
Indicates the presence of time_zero which allows mapping timestamp values to the hardware clock.
pmc_width
If cap_usr_rdpmc, this field provides the bit-width of the value read using the rdpmc or equivalent instruction. This can be used to sign extend the result like:
pmc <<= 64 - pmc_width;
pmc >>= 64 - pmc_width; // signed shift right
count += pmc;
time_shift
time_mult
time_offset
If cap_usr_time, these fields can be used to compute the time delta since time_enabled (in nanoseconds) using rdtsc or similar.
u64 quot, rem;
u64 delta;
quot = cyc >> time_shift;
rem = cyc & (((u64)1 << time_shift) - 1);
delta = time_offset + quot * time_mult +
((rem * time_mult) >> time_shift);
Where time_offset, time_mult, time_shift, and cyc are read in the seqcount loop described above. This delta can then be added to enabled and possible running (if idx), improving the scaling:
enabled += delta;
if (idx)
running += delta;
quot = count / running;
rem = count % running;
count = quot * enabled + (rem * enabled) / running;
time_zero (since Linux 3.12)
If cap_usr_time_zero is set, then the hardware clock (the TSC timestamp counter on x86) can be calculated from the time_zero, time_mult, and time_shift values:
time = timestamp - time_zero;
quot = time / time_mult;
rem = time % time_mult;
cyc = (quot << time_shift) + (rem << time_shift) / time_mult;
And vice versa:
quot = cyc >> time_shift;
rem = cyc & (((u64)1 << time_shift) - 1);
timestamp = time_zero + quot * time_mult +
((rem * time_mult) >> time_shift);
data_head
This points to the head of the data section. The value continuously increases, it does not wrap. The value needs to be manually wrapped by the size of the mmap buffer before accessing the samples.
On SMP-capable platforms, after reading the data_head value, user space should issue an rmb().
data_tail
When the mapping is PROT_WRITE, the data_tail value should be written by user space to reflect the last read data. In this case, the kernel will not overwrite unread data.
data_offset (since Linux 4.1)
Contains the offset of the location in the mmap buffer where perf sample data begins.
data_size (since Linux 4.1)
Contains the size of the perf sample region within the mmap buffer.
aux_head
aux_tail
aux_offset
aux_size (since Linux 4.1)
The AUX region allows mmap(2)-ing a separate sample buffer for high-bandwidth data streams (separate from the main perf sample buffer). An example of a high-bandwidth stream is instruction tracing support, as is found in newer Intel processors.
To set up an AUX area, first aux_offset needs to be set with an offset greater than data_offset+data_size and aux_size needs to be set to the desired buffer size. The desired offset and size must be page aligned, and the size must be a power of two. These values are then passed to mmap in order to map the AUX buffer. Pages in the AUX buffer are included as part of the RLIMIT_MEMLOCK resource limit (see setrlimit(2)), and also as part of the perf_event_mlock_kb allowance.
By default, the AUX buffer will be truncated if it will not fit in the available space in the ring buffer. If the AUX buffer is mapped as a read only buffer, then it will operate in ring buffer mode where old data will be overwritten by new. In overwrite mode, it might not be possible to infer where the new data began, and it is the consumer’s job to disable measurement while reading to avoid possible data races.
The aux_head and aux_tail ring buffer pointers have the same behavior and ordering rules as the previous described data_head and data_tail.
The following 2^n ring-buffer pages have the layout described below.
If perf_event_attr.sample_id_all is set, then all event types will have the sample_type selected fields related to where/when (identity) an event took place (TID, TIME, ID, CPU, STREAM_ID) described in PERF_RECORD_SAMPLE below, it will be stashed just after the perf_event_header and the fields already present for the existing fields, that is, at the end of the payload. This allows a newer perf.data file to be supported by older perf tools, with the new optional fields being ignored.
The mmap values start with a header:
struct perf_event_header {
__u32 type;
__u16 misc;
__u16 size;
};
Below, we describe the perf_event_header fields in more detail. For ease of reading, the fields with shorter descriptions are presented first.
size
This indicates the size of the record.
misc
The misc field contains additional information about the sample.
The CPU mode can be determined from this value by masking with PERF_RECORD_MISC_CPUMODE_MASK and looking for one of the following (note these are not bit masks, only one can be set at a time):
PERF_RECORD_MISC_CPUMODE_UNKNOWN
Unknown CPU mode.
PERF_RECORD_MISC_KERNEL
Sample happened in the kernel.
PERF_RECORD_MISC_USER
Sample happened in user code.
PERF_RECORD_MISC_HYPERVISOR
Sample happened in the hypervisor.
PERF_RECORD_MISC_GUEST_KERNEL (since Linux 2.6.35)
Sample happened in the guest kernel.
PERF_RECORD_MISC_GUEST_USER (since Linux 2.6.35)
Sample happened in guest user code.
Since the following three statuses are generated by different record types, they alias to the same bit:
PERF_RECORD_MISC_MMAP_DATA (since Linux 3.10)
This is set when the mapping is not executable; otherwise the mapping is executable.PERF_RECORD_MISC_COMM_EXEC (since Linux 3.16)
This is set for a PERF_RECORD_COMM record on kernels more recent than Linux 3.16 if a process name change was caused by an execve(2) system call.PERF_RECORD_MISC_SWITCH_OUT (since Linux 4.3)
When a PERF_RECORD_SWITCH or PERF_RECORD_SWITCH_CPU_WIDE record is generated, this bit indicates that the context switch is away from the current process (instead of into the current process).
In addition, the following bits can be set:
PERF_RECORD_MISC_EXACT_IP
This indicates that the content of PERF_SAMPLE_IP points to the actual instruction that triggered the event. See also perf_event_attr.precise_ip.PERF_RECORD_MISC_SWITCH_OUT_PREEMPT (since Linux 4.17)
When a PERF_RECORD_SWITCH or PERF_RECORD_SWITCH_CPU_WIDE record is generated, this indicates the context switch was a preemption.PERF_RECORD_MISC_MMAP_BUILD_ID (since Linux 5.12)
This indicates that the content of PERF_SAMPLE_MMAP2 contains build-ID data instead of device major and minor numbers as well as the inode number.PERF_RECORD_MISC_EXT_RESERVED (since Linux 2.6.35)
This indicates there is extended data available (currently not used).PERF_RECORD_MISC_PROC_MAP_PARSE_TIMEOUT
This bit is not set by the kernel. It is reserved for the user-space perf utility to indicate that /proc/pid/maps parsing was taking too long and was stopped, and thus the mmap records may be truncated.
type
The type value is one of the below. The values in the corresponding record (that follows the header) depend on the type selected as shown.
PERF_RECORD_MMAP
The MMAP events record the PROT_EXEC mappings so that we can correlate user-space IPs to code. They have the following structure:
struct {
struct perf_event_header header;
u32 pid, tid;
u64 addr;
u64 len;
u64 pgoff;
char filename[];
};
pid
is the process ID.
tid
is the thread ID.
addr
is the address of the allocated memory. len is the length of the allocated memory. pgoff is the page offset of the allocated memory. filename is a string describing the backing of the allocated memory.
PERF_RECORD_LOST
This record indicates when events are lost.
struct {
struct perf_event_header header;
u64 id;
u64 lost;
struct sample_id sample_id;
};
id
is the unique event ID for the samples that were lost.
lost
is the number of events that were lost.
PERF_RECORD_COMM
This record indicates a change in the process name.
struct {
struct perf_event_header header;
u32 pid;
u32 tid;
char comm[];
struct sample_id sample_id;
};
pid
is the process ID.
tid
is the thread ID.
comm
is a string containing the new name of the process.
PERF_RECORD_EXIT
This record indicates a process exit event.
struct {
struct perf_event_header header;
u32 pid, ppid;
u32 tid, ptid;
u64 time;
struct sample_id sample_id;
};
PERF_RECORD_THROTTLE
PERF_RECORD_UNTHROTTLE
This record indicates a throttle/unthrottle event.
struct {
struct perf_event_header header;
u64 time;
u64 id;
u64 stream_id;
struct sample_id sample_id;
};
PERF_RECORD_FORK
This record indicates a fork event.
struct {
struct perf_event_header header;
u32 pid, ppid;
u32 tid, ptid;
u64 time;
struct sample_id sample_id;
};
PERF_RECORD_READ
This record indicates a read event.
struct {
struct perf_event_header header;
u32 pid, tid;
struct read_format values;
struct sample_id sample_id;
};
PERF_RECORD_SAMPLE
This record indicates a sample.
struct {
struct perf_event_header header;
u64 sample_id; /* if PERF_SAMPLE_IDENTIFIER */
u64 ip; /* if PERF_SAMPLE_IP */
u32 pid, tid; /* if PERF_SAMPLE_TID */
u64 time; /* if PERF_SAMPLE_TIME */
u64 addr; /* if PERF_SAMPLE_ADDR */
u64 id; /* if PERF_SAMPLE_ID */
u64 stream_id; /* if PERF_SAMPLE_STREAM_ID */
u32 cpu, res; /* if PERF_SAMPLE_CPU */
u64 period; /* if PERF_SAMPLE_PERIOD */
struct read_format v;
/* if PERF_SAMPLE_READ */
u64 nr; /* if PERF_SAMPLE_CALLCHAIN */
u64 ips[nr]; /* if PERF_SAMPLE_CALLCHAIN */
u32 size; /* if PERF_SAMPLE_RAW */
char data[size]; /* if PERF_SAMPLE_RAW */
u64 bnr; /* if PERF_SAMPLE_BRANCH_STACK */
struct perf_branch_entry lbr[bnr];
/* if PERF_SAMPLE_BRANCH_STACK */
u64 abi; /* if PERF_SAMPLE_REGS_USER */
u64 regs[weight(mask)];
/* if PERF_SAMPLE_REGS_USER */
u64 size; /* if PERF_SAMPLE_STACK_USER */
char data[size]; /* if PERF_SAMPLE_STACK_USER */
u64 dyn_size; /* if PERF_SAMPLE_STACK_USER &&
size != 0 */
union perf_sample_weight weight;
/* if PERF_SAMPLE_WEIGHT */
/* || PERF_SAMPLE_WEIGHT_STRUCT */
u64 data_src; /* if PERF_SAMPLE_DATA_SRC */
u64 transaction; /* if PERF_SAMPLE_TRANSACTION */
u64 abi; /* if PERF_SAMPLE_REGS_INTR */
u64 regs[weight(mask)];
/* if PERF_SAMPLE_REGS_INTR */
u64 phys_addr; /* if PERF_SAMPLE_PHYS_ADDR */
u64 cgroup; /* if PERF_SAMPLE_CGROUP */
u64 data_page_size;
/* if PERF_SAMPLE_DATA_PAGE_SIZE */
u64 code_page_size;
/* if PERF_SAMPLE_CODE_PAGE_SIZE */
u64 size; /* if PERF_SAMPLE_AUX */
char data[size]; /* if PERF_SAMPLE_AUX */
};
sample_id
If PERF_SAMPLE_IDENTIFIER is enabled, a 64-bit unique ID is included. This is a duplication of the PERF_SAMPLE_ID id value, but included at the beginning of the sample so parsers can easily obtain the value.
ip
If PERF_SAMPLE_IP is enabled, then a 64-bit instruction pointer value is included.
pid
tid
If PERF_SAMPLE_TID is enabled, then a 32-bit process ID and 32-bit thread ID are included.
time
If PERF_SAMPLE_TIME is enabled, then a 64-bit timestamp is included. This is obtained via local_clock() which is a hardware timestamp if available and the jiffies value if not.
addr
If PERF_SAMPLE_ADDR is enabled, then a 64-bit address is included. This is usually the address of a tracepoint, breakpoint, or software event; otherwise the value is 0.
id
If PERF_SAMPLE_ID is enabled, a 64-bit unique ID is included. If the event is a member of an event group, the group leader ID is returned. This ID is the same as the one returned by PERF_FORMAT_ID.
stream_id
If PERF_SAMPLE_STREAM_ID is enabled, a 64-bit unique ID is included. Unlike PERF_SAMPLE_ID the actual ID is returned, not the group leader. This ID is the same as the one returned by PERF_FORMAT_ID.
cpu
res
If PERF_SAMPLE_CPU is enabled, this is a 32-bit value indicating which CPU was being used, in addition to a reserved (unused) 32-bit value.
period
If PERF_SAMPLE_PERIOD is enabled, a 64-bit value indicating the current sampling period is written.
v
If PERF_SAMPLE_READ is enabled, a structure of type read_format is included which has values for all events in the event group. The values included depend on the read_format value used at perf_event_open() time.
nr
ips[nr]
If PERF_SAMPLE_CALLCHAIN is enabled, then a 64-bit number is included which indicates how many following 64-bit instruction pointers will follow. This is the current callchain.
size
data[size]
If PERF_SAMPLE_RAW is enabled, then a 32-bit value indicating size is included followed by an array of 8-bit values of length size. The values are padded with 0 to have 64-bit alignment.
This RAW record data is opaque with respect to the ABI. The ABI doesn’t make any promises with respect to the stability of its content, it may vary depending on event, hardware, and kernel version.
bnr
lbr[bnr]
If PERF_SAMPLE_BRANCH_STACK is enabled, then a 64-bit value indicating the number of records is included, followed by bnr perf_branch_entry structures which each include the fields:
from
This indicates the source instruction (may not be a branch).
to
The branch target.
mispred
The branch target was mispredicted.
predicted
The branch target was predicted.
in_tx (since Linux 3.11)
The branch was in a transactional memory transaction.
abort (since Linux 3.11)
The branch was in an aborted transactional memory transaction.
cycles (since Linux 4.3)
This reports the number of cycles elapsed since the previous branch stack update.
The entries are from most to least recent, so the first entry has the most recent branch.
Support for mispred, predicted, and cycles is optional; if not supported, those values will be 0.
The type of branches recorded is specified by the branch_sample_type field.
abi
regs[weight(mask)]
If PERF_SAMPLE_REGS_USER is enabled, then the user CPU registers are recorded.
The abi field is one of PERF_SAMPLE_REGS_ABI_NONE, PERF_SAMPLE_REGS_ABI_32, or PERF_SAMPLE_REGS_ABI_64.
The regs field is an array of the CPU registers that were specified by the sample_regs_user attr field. The number of values is the number of bits set in the sample_regs_user bit mask.
size
data[size]
dyn_size
If PERF_SAMPLE_STACK_USER is enabled, then the user stack is recorded. This can be used to generate stack backtraces. size is the size requested by the user in sample_stack_user or else the maximum record size. data is the stack data (a raw dump of the memory pointed to by the stack pointer at the time of sampling). dyn_size is the amount of data actually dumped (can be less than size). Note that dyn_size is omitted if size is 0.
weight
If PERF_SAMPLE_WEIGHT or PERF_SAMPLE_WEIGHT_STRUCT is enabled, then a 64-bit value provided by the hardware is recorded that indicates how costly the event was. This allows expensive events to stand out more clearly in profiles.
data_src
If PERF_SAMPLE_DATA_SRC is enabled, then a 64-bit value is recorded that is made up of the following fields:
mem_op
Type of opcode, a bitwise combination of:
PERF_MEM_OP_NA
Not availablePERF_MEM_OP_LOAD
Load instructionPERF_MEM_OP_STORE
Store instructionPERF_MEM_OP_PFETCH
PrefetchPERF_MEM_OP_EXEC
Executable code
mem_lvl
Memory hierarchy level hit or miss, a bitwise combination of the following, shifted left by PERF_MEM_LVL_SHIFT:
PERF_MEM_LVL_NA
Not availablePERF_MEM_LVL_HIT
HitPERF_MEM_LVL_MISS
MissPERF_MEM_LVL_L1
Level 1 cachePERF_MEM_LVL_LFB
Line fill bufferPERF_MEM_LVL_L2
Level 2 cachePERF_MEM_LVL_L3
Level 3 cachePERF_MEM_LVL_LOC_RAM
Local DRAMPERF_MEM_LVL_REM_RAM1
Remote DRAM 1 hopPERF_MEM_LVL_REM_RAM2
Remote DRAM 2 hopsPERF_MEM_LVL_REM_CCE1
Remote cache 1 hopPERF_MEM_LVL_REM_CCE2
Remote cache 2 hopsPERF_MEM_LVL_IO
I/O memoryPERF_MEM_LVL_UNC
Uncached memory
mem_snoop
Snoop mode, a bitwise combination of the following, shifted left by PERF_MEM_SNOOP_SHIFT:
PERF_MEM_SNOOP_NA
Not availablePERF_MEM_SNOOP_NONE
No snoopPERF_MEM_SNOOP_HIT
Snoop hitPERF_MEM_SNOOP_MISS
Snoop missPERF_MEM_SNOOP_HITM
Snoop hit modified
mem_lock
Lock instruction, a bitwise combination of the following, shifted left by PERF_MEM_LOCK_SHIFT:
PERF_MEM_LOCK_NA
Not availablePERF_MEM_LOCK_LOCKED
Locked transaction
mem_dtlb
TLB access hit or miss, a bitwise combination of the following, shifted left by PERF_MEM_TLB_SHIFT:
PERF_MEM_TLB_NA
Not availablePERF_MEM_TLB_HIT
HitPERF_MEM_TLB_MISS
MissPERF_MEM_TLB_L1
Level 1 TLBPERF_MEM_TLB_L2
Level 2 TLBPERF_MEM_TLB_WK
Hardware walkerPERF_MEM_TLB_OS
OS fault handler
transaction
If the PERF_SAMPLE_TRANSACTION flag is set, then a 64-bit field is recorded describing the sources of any transactional memory aborts.
The field is a bitwise combination of the following values:
PERF_TXN_ELISION
Abort from an elision type transaction (Intel-CPU-specific).
PERF_TXN_TRANSACTION
Abort from a generic transaction.
PERF_TXN_SYNC
Synchronous abort (related to the reported instruction).
PERF_TXN_ASYNC
Asynchronous abort (not related to the reported instruction).
PERF_TXN_RETRY
Retryable abort (retrying the transaction may have succeeded).
PERF_TXN_CONFLICT
Abort due to memory conflicts with other threads.
PERF_TXN_CAPACITY_WRITE
Abort due to write capacity overflow.
PERF_TXN_CAPACITY_READ
Abort due to read capacity overflow.
In addition, a user-specified abort code can be obtained from the high 32 bits of the field by shifting right by PERF_TXN_ABORT_SHIFT and masking with the value PERF_TXN_ABORT_MASK.
abi
regs[weight(mask)]
If PERF_SAMPLE_REGS_INTR is enabled, then the user CPU registers are recorded.
The abi field is one of PERF_SAMPLE_REGS_ABI_NONE, PERF_SAMPLE_REGS_ABI_32, or PERF_SAMPLE_REGS_ABI_64.
The regs field is an array of the CPU registers that were specified by the sample_regs_intr attr field. The number of values is the number of bits set in the sample_regs_intr bit mask.
phys_addr
If the PERF_SAMPLE_PHYS_ADDR flag is set, then the 64-bit physical address is recorded.
cgroup
If the PERF_SAMPLE_CGROUP flag is set, then the 64-bit cgroup ID (for the perf_event subsystem) is recorded. To get the pathname of the cgroup, the ID should match to one in a PERF_RECORD_CGROUP.
data_page_size
If the PERF_SAMPLE_DATA_PAGE_SIZE flag is set, then the 64-bit page size value of the data address is recorded.
code_page_size
If the PERF_SAMPLE_CODE_PAGE_SIZE flag is set, then the 64-bit page size value of the ip address is recorded.
size
data[size]
If PERF_SAMPLE_AUX is enabled, a snapshot of the aux buffer is recorded.
PERF_RECORD_MMAP2
This record includes extended information on mmap(2) calls returning executable mappings. The format is similar to that of the PERF_RECORD_MMAP record, but includes extra values that allow uniquely identifying shared mappings. Depending on the PERF_RECORD_MISC_MMAP_BUILD_ID bit in the header, the extra values have different layout and meanings.
struct {
struct perf_event_header header;
u32 pid;
u32 tid;
u64 addr;
u64 len;
u64 pgoff;
union {
struct {
u32 maj;
u32 min;
u64 ino;
u64 ino_generation;
};
struct { /* if PERF_RECORD_MISC_MMAP_BUILD_ID */
u8 build_id_size;
u8 __reserved_1;
u16 __reserved_2;
u8 build_id[20];
};
};
u32 prot;
u32 flags;
char filename[];
struct sample_id sample_id;
};
pid
is the process ID.
tid
is the thread ID.
addr
is the address of the allocated memory.
len
is the length of the allocated memory.
pgoff
is the page offset of the allocated memory.
maj
is the major ID of the underlying device.
min
is the minor ID of the underlying device.
ino
is the inode number.
ino_generation
is the inode generation.
build_id_size
is the actual size of build_id field (up to 20).
build_id
is a raw data to identify a binary.
prot
is the protection information.
flags
is the flags information.
filename
is a string describing the backing of the allocated memory.
PERF_RECORD_AUX (since Linux 4.1)
This record reports that new data is available in the separate AUX buffer region.
struct {
struct perf_event_header header;
u64 aux_offset;
u64 aux_size;
u64 flags;
struct sample_id sample_id;
};
aux_offset
offset in the AUX mmap region where the new data begins.
aux_size
size of the data made available.
flags
describes the AUX update.
PERF_AUX_FLAG_TRUNCATED
if set, then the data returned was truncated to fit the available buffer size.
PERF_AUX_FLAG_OVERWRITE
if set, then the data returned has overwritten previous data.
PERF_RECORD_ITRACE_START (since Linux 4.1)
This record indicates which process has initiated an instruction trace event, allowing tools to properly correlate the instruction addresses in the AUX buffer with the proper executable.
struct {
struct perf_event_header header;
u32 pid;
u32 tid;
};
pid
process ID of the thread starting an instruction trace.
tid
thread ID of the thread starting an instruction trace.
PERF_RECORD_LOST_SAMPLES (since Linux 4.2)
When using hardware sampling (such as Intel PEBS) this record indicates some number of samples that may have been lost.
struct {
struct perf_event_header header;
u64 lost;
struct sample_id sample_id;
};
lost
the number of potentially lost samples.
PERF_RECORD_SWITCH (since Linux 4.3)
This record indicates a context switch has happened. The PERF_RECORD_MISC_SWITCH_OUT bit in the misc field indicates whether it was a context switch into or away from the current process.
struct {
struct perf_event_header header;
struct sample_id sample_id;
};
PERF_RECORD_SWITCH_CPU_WIDE (since Linux 4.3)
As with PERF_RECORD_SWITCH this record indicates a context switch has happened, but it only occurs when sampling in CPU-wide mode and provides additional information on the process being switched to/from. The PERF_RECORD_MISC_SWITCH_OUT bit in the misc field indicates whether it was a context switch into or away from the current process.
struct {
struct perf_event_header header;
u32 next_prev_pid;
u32 next_prev_tid;
struct sample_id sample_id;
};
next_prev_pid
The process ID of the previous (if switching in) or next (if switching out) process on the CPU.
next_prev_tid
The thread ID of the previous (if switching in) or next (if switching out) thread on the CPU.
PERF_RECORD_NAMESPACES (since Linux 4.11)
This record includes various namespace information of a process.
struct {
struct perf_event_header header;
u32 pid;
u32 tid;
u64 nr_namespaces;
struct { u64 dev, inode } [nr_namespaces];
struct sample_id sample_id;
};
pid
is the process ID
tid
is the thread ID
nr_namespace
is the number of namespaces in this record
Each namespace has dev and inode fields and is recorded in the fixed position like below:
NET_NS_INDEX=0
Network namespace
UTS_NS_INDEX=1
UTS namespace
IPC_NS_INDEX=2
IPC namespace
PID_NS_INDEX=3
PID namespace
USER_NS_INDEX=4
User namespace
MNT_NS_INDEX=5
Mount namespace
CGROUP_NS_INDEX=6
Cgroup namespace
PERF_RECORD_KSYMBOL (since Linux 5.0)
This record indicates kernel symbol register/unregister events.
struct {
struct perf_event_header header;
u64 addr;
u32 len;
u16 ksym_type;
u16 flags;
char name[];
struct sample_id sample_id;
};
addr
is the address of the kernel symbol.
len
is the length of the kernel symbol.
ksym_type
is the type of the kernel symbol. Currently the following types are available:
PERF_RECORD_KSYMBOL_TYPE_BPF
The kernel symbol is a BPF function.
flags
If the PERF_RECORD_KSYMBOL_FLAGS_UNREGISTER is set, then this event is for unregistering the kernel symbol.
PERF_RECORD_BPF_EVENT (since Linux 5.0)
This record indicates BPF program is loaded or unloaded.
struct {
struct perf_event_header header;
u16 type;
u16 flags;
u32 id;
u8 tag[BPF_TAG_SIZE];
struct sample_id sample_id;
};
type
is one of the following values:
PERF_BPF_EVENT_PROG_LOAD
A BPF program is loaded
PERF_BPF_EVENT_PROG_UNLOAD
A BPF program is unloaded
id
is the ID of the BPF program.
tag
is the tag of the BPF program. Currently, BPF_TAG_SIZE is defined as 8.
PERF_RECORD_CGROUP (since Linux 5.7)
This record indicates a new cgroup is created and activated.
struct {
struct perf_event_header header;
u64 id;
char path[];
struct sample_id sample_id;
};
id
is the cgroup identifier. This can be also retrieved by name_to_handle_at(2) on the cgroup path (as a file handle).
path
is the path of the cgroup from the root.
PERF_RECORD_TEXT_POKE (since Linux 5.8)
This record indicates a change in the kernel text. This includes addition and removal of the text and the corresponding length is zero in this case.
struct {
struct perf_event_header header;
u64 addr;
u16 old_len;
u16 new_len;
u8 bytes[];
struct sample_id sample_id;
};
addr
is the address of the change
old_len
is the old length
new_len
is the new length
bytes
contains old bytes immediately followed by new bytes.
Overflow handling
Events can be set to notify when a threshold is crossed, indicating an overflow. Overflow conditions can be captured by monitoring the event file descriptor with poll(2), select(2), or epoll(7). Alternatively, the overflow events can be captured via sa signal handler, by enabling I/O signaling on the file descriptor; see the discussion of the F_SETOWN and F_SETSIG operations in fcntl(2).
Overflows are generated only by sampling events (sample_period must have a nonzero value).
There are two ways to generate overflow notifications.
The first is to set a wakeup_events or wakeup_watermark value that will trigger if a certain number of samples or bytes have been written to the mmap ring buffer. In this case, POLL_IN is indicated.
The other way is by use of the PERF_EVENT_IOC_REFRESH ioctl. This ioctl adds to a counter that decrements each time the event overflows. When nonzero, POLL_IN is indicated, but once the counter reaches 0 POLL_HUP is indicated and the underlying event is disabled.
Refreshing an event group leader refreshes all siblings and refreshing with a parameter of 0 currently enables infinite refreshes; these behaviors are unsupported and should not be relied on.
Starting with Linux 3.18, POLL_HUP is indicated if the event being monitored is attached to a different process and that process exits.
rdpmc instruction
Starting with Linux 3.4 on x86, you can use the rdpmc instruction to get low-latency reads without having to enter the kernel. Note that using rdpmc is not necessarily faster than other methods for reading event values.
Support for this can be detected with the cap_usr_rdpmc field in the mmap page; documentation on how to calculate event values can be found in that section.
Originally, when rdpmc support was enabled, any process (not just ones with an active perf event) could use the rdpmc instruction to access the counters. Starting with Linux 4.0, rdpmc support is only allowed if an event is currently enabled in a process’s context. To restore the old behavior, write the value 2 to /sys/devices/cpu/rdpmc.
perf_event ioctl calls
Various ioctls act on perf_event_open() file descriptors:
PERF_EVENT_IOC_ENABLE
This enables the individual event or event group specified by the file descriptor argument.
If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument, then all events in a group are enabled, even if the event specified is not the group leader (but see BUGS).
PERF_EVENT_IOC_DISABLE
This disables the individual counter or event group specified by the file descriptor argument.
Enabling or disabling the leader of a group enables or disables the entire group; that is, while the group leader is disabled, none of the counters in the group will count. Enabling or disabling a member of a group other than the leader affects only that counter; disabling a non-leader stops that counter from counting but doesn’t affect any other counter.
If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument, then all events in a group are disabled, even if the event specified is not the group leader (but see BUGS).
PERF_EVENT_IOC_REFRESH
Non-inherited overflow counters can use this to enable a counter for a number of overflows specified by the argument, after which it is disabled. Subsequent calls of this ioctl add the argument value to the current count. An overflow notification with POLL_IN set will happen on each overflow until the count reaches 0; when that happens a notification with POLL_HUP set is sent and the event is disabled. Using an argument of 0 is considered undefined behavior.
PERF_EVENT_IOC_RESET
Reset the event count specified by the file descriptor argument to zero. This resets only the counts; there is no way to reset the multiplexing time_enabled or time_running values.
If the PERF_IOC_FLAG_GROUP bit is set in the ioctl argument, then all events in a group are reset, even if the event specified is not the group leader (but see BUGS).
PERF_EVENT_IOC_PERIOD
This updates the overflow period for the event.
Since Linux 3.7 (on ARM) and Linux 3.14 (all other architectures), the new period takes effect immediately. On older kernels, the new period did not take effect until after the next overflow.
The argument is a pointer to a 64-bit value containing the desired new period.
Prior to Linux 2.6.36, this ioctl always failed due to a bug in the kernel.
PERF_EVENT_IOC_SET_OUTPUT
This tells the kernel to report event notifications to the specified file descriptor rather than the default one. The file descriptors must all be on the same CPU.
The argument specifies the desired file descriptor, or -1 if output should be ignored.
PERF_EVENT_IOC_SET_FILTER (since Linux 2.6.33)
This adds an ftrace filter to this event.
The argument is a pointer to the desired ftrace filter.
PERF_EVENT_IOC_ID (since Linux 3.12)
This returns the event ID value for the given event file descriptor.
The argument is a pointer to a 64-bit unsigned integer to hold the result.
PERF_EVENT_IOC_SET_BPF (since Linux 4.1)
This allows attaching a Berkeley Packet Filter (BPF) program to an existing kprobe tracepoint event. You need CAP_PERFMON (since Linux 5.8) or CAP_SYS_ADMIN privileges to use this ioctl.
The argument is a BPF program file descriptor that was created by a previous bpf(2) system call.
PERF_EVENT_IOC_PAUSE_OUTPUT (since Linux 4.7)
This allows pausing and resuming the event’s ring-buffer. A paused ring-buffer does not prevent generation of samples, but simply discards them. The discarded samples are considered lost, and cause a PERF_RECORD_LOST sample to be generated when possible. An overflow signal may still be triggered by the discarded sample even though the ring-buffer remains empty.
The argument is an unsigned 32-bit integer. A nonzero value pauses the ring-buffer, while a zero value resumes the ring-buffer.
PERF_EVENT_MODIFY_ATTRIBUTES (since Linux 4.17)
This allows modifying an existing event without the overhead of closing and reopening a new event. Currently this is supported only for breakpoint events.
The argument is a pointer to a perf_event_attr structure containing the updated event settings.
PERF_EVENT_IOC_QUERY_BPF (since Linux 4.16)
This allows querying which Berkeley Packet Filter (BPF) programs are attached to an existing kprobe tracepoint. You can only attach one BPF program per event, but you can have multiple events attached to a tracepoint. Querying this value on one tracepoint event returns the ID of all BPF programs in all events attached to the tracepoint. You need CAP_PERFMON (since Linux 5.8) or CAP_SYS_ADMIN privileges to use this ioctl.
The argument is a pointer to a structure
struct perf_event_query_bpf {
__u32 ids_len;
__u32 prog_cnt;
__u32 ids[0];
};
The ids_len field indicates the number of ids that can fit in the provided ids array. The prog_cnt value is filled in by the kernel with the number of attached BPF programs. The ids array is filled with the ID of each attached BPF program. If there are more programs than will fit in the array, then the kernel will return ENOSPC and ids_len will indicate the number of program IDs that were successfully copied.
Using prctl(2)
A process can enable or disable all currently open event groups using the prctl(2) PR_TASK_PERF_EVENTS_ENABLE and PR_TASK_PERF_EVENTS_DISABLE operations. This applies only to events created locally by the calling process. This does not apply to events created by other processes attached to the calling process or inherited events from a parent process. Only group leaders are enabled and disabled, not any other members of the groups.
perf_event related configuration files
Files in /proc/sys/kernel/
/proc/sys/kernel/perf_event_paranoid
The perf_event_paranoid file can be set to restrict access to the performance counters.2
allow only user-space measurements (default since Linux 4.6).1
allow both kernel and user measurements (default before Linux 4.6).0
allow access to CPU-specific data but not raw tracepoint samples.-1
no restrictions.The existence of the perf_event_paranoid file is the official method for determining if a kernel supports perf_event_open().
/proc/sys/kernel/perf_event_max_sample_rate
This sets the maximum sample rate. Setting this too high can allow users to sample at a rate that impacts overall machine performance and potentially lock up the machine. The default value is 100000 (samples per second)./proc/sys/kernel/perf_event_max_stack
This file sets the maximum depth of stack frame entries reported when generating a call trace./proc/sys/kernel/perf_event_mlock_kb
Maximum number of pages an unprivileged user can mlock(2). The default is 516 (kB).
Files in /sys/bus/event_source/devices/
Since Linux 2.6.34, the kernel supports having multiple PMUs available for monitoring. Information on how to program these PMUs can be found under /sys/bus/event_source/devices/. Each subdirectory corresponds to a different PMU.
/sys/bus/event_source/devices/*/type (since Linux 2.6.38)
This contains an integer that can be used in the type field of perf_event_attr to indicate that you wish to use this PMU./sys/bus/event_source/devices/cpu/rdpmc (since Linux 3.4)
If this file is 1, then direct user-space access to the performance counter registers is allowed via the rdpmc instruction. This can be disabled by echoing 0 to the file.As of Linux 4.0 the behavior has changed, so that 1 now means only allow access to processes with active perf events, with 2 indicating the old allow-anyone-access behavior.
/sys/bus/event_source/devices/*/format/ (since Linux 3.4)
This subdirectory contains information on the architecture-specific subfields available for programming the various config fields in the perf_event_attr struct.The content of each file is the name of the config field, followed by a colon, followed by a series of integer bit ranges separated by commas. For example, the file event may contain the value config1:1,6-10,44 which indicates that event is an attribute that occupies bits 1,6β10, and 44 of perf_event_attr::config1.
/sys/bus/event_source/devices/*/events/ (since Linux 3.4)
This subdirectory contains files with predefined events. The contents are strings describing the event settings expressed in terms of the fields found in the previously mentioned ./format/ directory. These are not necessarily complete lists of all events supported by a PMU, but usually a subset of events deemed useful or interesting.The content of each file is a list of attribute names separated by commas. Each entry has an optional value (either hex or decimal). If no value is specified, then it is assumed to be a single-bit field with a value of 1. An example entry may look like this: event=0x2,inv,ldlat=3.
/sys/bus/event_source/devices/*/uevent
This file is the standard kernel device interface for injecting hotplug events./sys/bus/event_source/devices/*/cpumask (since Linux 3.7)
The cpumask file contains a comma-separated list of integers that indicate a representative CPU number for each socket (package) on the motherboard. This is needed when setting up uncore or northbridge events, as those PMUs present socket-wide events.
RETURN VALUE
On success, perf_event_open() returns the new file descriptor. On error, -1 is returned and errno is set to indicate the error.
ERRORS
The errors returned by perf_event_open() can be inconsistent, and may vary across processor architectures and performance monitoring units.
E2BIG
Returned if the perf_event_attr size value is too small (smaller than PERF_ATTR_SIZE_VER0), too big (larger than the page size), or larger than the kernel supports and the extra bytes are not zero. When E2BIG is returned, the perf_event_attr size field is overwritten by the kernel to be the size of the structure it was expecting.
EACCES
Returned when the requested event requires CAP_PERFMON (since Linux 5.8) or CAP_SYS_ADMIN permissions (or a more permissive perf_event paranoid setting). Some common cases where an unprivileged process may encounter this error: attaching to a process owned by a different user; monitoring all processes on a given CPU (i.e., specifying the pid argument as -1); and not setting exclude_kernel when the paranoid setting requires it.
EBADF
Returned if the group_fd file descriptor is not valid, or, if PERF_FLAG_PID_CGROUP is set, the cgroup file descriptor in pid is not valid.
EBUSY (since Linux 4.1)
Returned if another event already has exclusive access to the PMU.
EFAULT
Returned if the attr pointer points at an invalid memory address.
EINTR
Returned when trying to mix perf and ftrace handling for a uprobe.
EINVAL
Returned if the specified event is invalid. There are many possible reasons for this. A not-exhaustive list: sample_freq is higher than the maximum setting; the cpu to monitor does not exist; read_format is out of range; sample_type is out of range; the flags value is out of range; exclusive or pinned set and the event is not a group leader; the event config values are out of range or set reserved bits; the generic event selected is not supported; or there is not enough room to add the selected event.
EMFILE
Each opened event uses one file descriptor. If a large number of events are opened, the per-process limit on the number of open file descriptors will be reached, and no more events can be created.
ENODEV
Returned when the event involves a feature not supported by the current CPU.
ENOENT
Returned if the type setting is not valid. This error is also returned for some unsupported generic events.
ENOSPC
Prior to Linux 3.3, if there was not enough room for the event, ENOSPC was returned. In Linux 3.3, this was changed to EINVAL. ENOSPC is still returned if you try to add more breakpoint events than supported by the hardware.
ENOSYS
Returned if PERF_SAMPLE_STACK_USER is set in sample_type and it is not supported by hardware.
EOPNOTSUPP
Returned if an event requiring a specific hardware feature is requested but there is no hardware support. This includes requesting low-skid events if not supported, branch tracing if it is not available, sampling if no PMU interrupt is available, and branch stacks for software events.
EOVERFLOW (since Linux 4.8)
Returned if PERF_SAMPLE_CALLCHAIN is requested and sample_max_stack is larger than the maximum specified in /proc/sys/kernel/perf_event_max_stack.
EPERM
Returned on many (but not all) architectures when an unsupported exclude_hv, exclude_idle, exclude_user, or exclude_kernel setting is specified.
It can also happen, as with EACCES, when the requested event requires CAP_PERFMON (since Linux 5.8) or CAP_SYS_ADMIN permissions (or a more permissive perf_event paranoid setting). This includes setting a breakpoint on a kernel address, and (since Linux 3.13) setting a kernel function-trace tracepoint.
ESRCH
Returned if attempting to attach to a process that does not exist.
STANDARDS
Linux.
HISTORY
perf_event_open() was introduced in Linux 2.6.31 but was called perf_counter_open(). It was renamed in Linux 2.6.32.
NOTES
The official way of knowing if perf_event_open() support is enabled is checking for the existence of the file /proc/sys/kernel/perf_event_paranoid.
CAP_PERFMON capability (since Linux 5.8) provides secure approach to performance monitoring and observability operations in a system according to the principal of least privilege (POSIX IEEE 1003.1e). Accessing system performance monitoring and observability operations using CAP_PERFMON rather than the much more powerful CAP_SYS_ADMIN excludes chances to misuse credentials and makes operations more secure. CAP_SYS_ADMIN usage for secure system performance monitoring and observability is discouraged in favor of the CAP_PERFMON capability.
BUGS
The F_SETOWN_EX option to fcntl(2) is needed to properly get overflow signals in threads. This was introduced in Linux 2.6.32.
Prior to Linux 2.6.33 (at least for x86), the kernel did not check if events could be scheduled together until read time. The same happens on all known kernels if the NMI watchdog is enabled. This means to see if a given set of events works you have to perf_event_open(), start, then read before you know for sure you can get valid measurements.
Prior to Linux 2.6.34, event constraints were not enforced by the kernel. In that case, some events would silently return “0” if the kernel scheduled them in an improper counter slot.
Prior to Linux 2.6.34, there was a bug when multiplexing where the wrong results could be returned.
Kernels from Linux 2.6.35 to Linux 2.6.39 can quickly crash the kernel if “inherit” is enabled and many threads are started.
Prior to Linux 2.6.35, PERF_FORMAT_GROUP did not work with attached processes.
There is a bug in the kernel code between Linux 2.6.36 and Linux 3.0 that ignores the “watermark” field and acts as if a wakeup_event was chosen if the union has a nonzero value in it.
From Linux 2.6.31 to Linux 3.4, the PERF_IOC_FLAG_GROUP ioctl argument was broken and would repeatedly operate on the event specified rather than iterating across all sibling events in a group.
From Linux 3.4 to Linux 3.11, the mmap cap_usr_rdpmc and cap_usr_time bits mapped to the same location. Code should migrate to the new cap_user_rdpmc and cap_user_time fields instead.
Always double-check your results! Various generalized events have had wrong values. For example, retired branches measured the wrong thing on AMD machines until Linux 2.6.35.
EXAMPLES
The following is a short example that measures the total instruction count of a call to printf(3).
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(SYS_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(void)
{
int fd;
long long count;
struct perf_event_attr pe;
memset(&pe, 0, sizeof(pe));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(pe);
pe.config = PERF_COUNT_HW_INSTRUCTIONS;
pe.disabled = 1;
pe.exclude_kernel = 1;
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx
“, pe.config); exit(EXIT_FAILURE); } ioctl(fd, PERF_EVENT_IOC_RESET, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); printf(“Measuring instruction count for this printf “); ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); read(fd, &count, sizeof(count)); printf(“Used %lld instructions “, count); close(fd); }
SEE ALSO
perf(1), fcntl(2), mmap(2), open(2), prctl(2), read(2)
Documentation/admin-guide/perf-security.rst in the kernel source tree
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
41 - Linux cli command stat
NAME π₯οΈ stat π₯οΈ
get file status
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int stat(const char *restrict pathname,
struct stat *restrict statbuf);
int fstat(int fd, struct stat *statbuf);
int lstat(const char *restrict pathname,
struct stat *restrict statbuf);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int fstatat(int dirfd, const char *restrict pathname,
struct stat *restrict statbuf, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
lstat():
/* Since glibc 2.20 */ _DEFAULT_SOURCE
|| _XOPEN_SOURCE >= 500
|| /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L
|| /* glibc 2.19 and earlier */ _BSD_SOURCE
fstatat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These functions return information about a file, in the buffer pointed to by statbuf. No permissions are required on the file itself, butβin the case of stat(), fstatat(), and lstat()βexecute (search) permission is required on all of the directories in pathname that lead to the file.
stat() and fstatat() retrieve information about the file pointed to by pathname; the differences for fstatat() are described below.
lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information about the link itself, not the file that the link refers to.
fstat() is identical to stat(), except that the file about which information is to be retrieved is specified by the file descriptor fd.
The stat structure
All of these system calls return a stat structure (see stat(3type)).
Note: for performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call. For example, if st_mode or st_uid is changed by another process by calling chmod(2) or chown(2), stat() might return the old st_mode together with the new st_uid, or the old st_uid together with the new st_mode.
fstatat()
The fstatat() system call is a more general interface for accessing file information which can still provide exactly the behavior of each of stat(), lstat(), and fstat().
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by stat() and lstat() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like stat() and lstat()).
If pathname is absolute, then dirfd is ignored.
flags can either be 0, or include one or more of the following flags ORed:
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory, and the behavior of fstatat() is similar to that of fstat(). If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_NO_AUTOMOUNT (since Linux 2.6.38)
Don’t automount the terminal (“basename”) component of pathname. Since Linux 3.1 this flag is ignored. Since Linux 4.11 this flag is implied.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead return information about the link itself, like lstat(). (By default, fstatat() dereferences symbolic links, like stat().)
See openat(2) for an explanation of the need for fstatat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Search permission is denied for one of the directories in the path prefix of pathname. (See also path_resolution(7).)
EBADF
fd is not a valid open file descriptor.
EBADF
(fstatat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
Bad address.
EINVAL
(fstatat()) Invalid flag specified in flags.
ELOOP
Too many symbolic links encountered while traversing the path.
ENAMETOOLONG
pathname is too long.
ENOENT
A component of pathname does not exist or is a dangling symbolic link.
ENOENT
pathname is an empty string and AT_EMPTY_PATH was not specified in flags.
ENOMEM
Out of memory (i.e., kernel memory).
ENOTDIR
A component of the path prefix of pathname is not a directory.
ENOTDIR
(fstatat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EOVERFLOW
pathname or fd refers to a file whose size, inode number, or number of blocks cannot be represented in, respectively, the types off_t, ino_t, or blkcnt_t. This error can occur when, for example, an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 calls stat() on a file whose size exceeds (1<<31)-1 bytes.
STANDARDS
POSIX.1-2008.
HISTORY
stat()
fstat()
lstat()
SVr4, 4.3BSD, POSIX.1-2001.
fstatat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
According to POSIX.1-2001, lstat() on a symbolic link need return valid information only in the st_size field and the file type of the st_mode field of the stat structure. POSIX.1-2008 tightens the specification, requiring lstat() to return valid information in all fields except the mode bits in st_mode.
Use of the st_blocks and st_blksize fields may be less portable. (They were introduced in BSD. The interpretation differs between systems, and possibly on a single system when NFS mounts are involved.)
C library/kernel differences
Over time, increases in the size of the stat structure have led to three successive versions of stat(): sys_stat() (slot __NR_oldstat), sys_newstat() (slot __NR_stat), and sys_stat64() (slot __NR_stat64) on 32-bit platforms such as i386. The first two versions were already present in Linux 1.0 (albeit with different names); the last was added in Linux 2.4. Similar remarks apply for fstat() and lstat().
The kernel-internal versions of the stat structure dealt with by the different versions are, respectively:
__old_kernel_stat
The original structure, with rather narrow fields, and no padding.
stat
Larger st_ino field and padding added to various parts of the structure to allow for future expansion.
stat64
Even larger st_ino field, larger st_uid and st_gid fields to accommodate the Linux-2.4 expansion of UIDs and GIDs to 32 bits, and various other enlarged fields and further padding in the structure. (Various padding bytes were eventually consumed in Linux 2.6, with the advent of 32-bit device IDs and nanosecond components for the timestamp fields.)
The glibc stat() wrapper function hides these details from applications, invoking the most recent version of the system call provided by the kernel, and repacking the returned information if required for old binaries.
On modern 64-bit systems, life is simpler: there is a single stat() system call and the kernel deals with a stat structure that contains fields of a sufficient size.
The underlying system call employed by the glibc fstatat() wrapper function is actually called fstatat64() or, on some architectures, newfstatat().
EXAMPLES
The following program calls lstat() and displays selected fields in the returned stat structure.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
#include <time.h>
int
main(int argc, char *argv[])
{
struct stat sb;
if (argc != 2) {
fprintf(stderr, "Usage: %s <pathname>
“, argv[0]); exit(EXIT_FAILURE); } if (lstat(argv[1], &sb) == -1) { perror(“lstat”); exit(EXIT_FAILURE); } printf(“ID of containing device: [%x,%x] “, major(sb.st_dev), minor(sb.st_dev)); printf(“File type: “); switch (sb.st_mode & S_IFMT) { case S_IFBLK: printf(“block device “); break; case S_IFCHR: printf(“character device “); break; case S_IFDIR: printf(“directory “); break; case S_IFIFO: printf(“FIFO/pipe “); break; case S_IFLNK: printf(“symlink “); break; case S_IFREG: printf(“regular file “); break; case S_IFSOCK: printf(“socket “); break; default: printf(“unknown? “); break; } printf(“I-node number: %ju “, (uintmax_t) sb.st_ino); printf(“Mode: %jo (octal) “, (uintmax_t) sb.st_mode); printf(“Link count: %ju “, (uintmax_t) sb.st_nlink); printf(“Ownership: UID=%ju GID=%ju “, (uintmax_t) sb.st_uid, (uintmax_t) sb.st_gid); printf(“Preferred I/O block size: %jd bytes “, (intmax_t) sb.st_blksize); printf(“File size: %jd bytes “, (intmax_t) sb.st_size); printf(“Blocks allocated: %jd “, (intmax_t) sb.st_blocks); printf(“Last status change: %s”, ctime(&sb.st_ctime)); printf(“Last file access: %s”, ctime(&sb.st_atime)); printf(“Last file modification: %s”, ctime(&sb.st_mtime)); exit(EXIT_SUCCESS); }
SEE ALSO
ls(1), stat(1), access(2), chmod(2), chown(2), readlink(2), statx(2), utime(2), stat(3type), capabilities(7), inode(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
42 - Linux cli command semop
NAME π₯οΈ semop π₯οΈ
System V semaphore operations
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/sem.h>
int semop(int semid, struct sembuf *sops, size_t nsops);
int semtimedop(int semid, struct sembuf *sops, size_t nsops,
const struct timespec *_Nullable timeout);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
semtimedop():
_GNU_SOURCE
DESCRIPTION
Each semaphore in a System V semaphore set has the following associated values:
unsigned short semval; /* semaphore value */
unsigned short semzcnt; /* # waiting for zero */
unsigned short semncnt; /* # waiting for increase */
pid_t sempid; /* PID of process that last
modified the semaphore value */
semop() performs operations on selected semaphores in the set indicated by semid. Each of the nsops elements in the array pointed to by sops is a structure that specifies an operation to be performed on a single semaphore. The elements of this structure are of type struct sembuf, containing the following members:
unsigned short sem_num; /* semaphore number */
short sem_op; /* semaphore operation */
short sem_flg; /* operation flags */
Flags recognized in sem_flg are IPC_NOWAIT and SEM_UNDO. If an operation specifies SEM_UNDO, it will be automatically undone when the process terminates.
The set of operations contained in sops is performed in array order, and atomically, that is, the operations are performed either as a complete unit, or not at all. The behavior of the system call if not all operations can be performed immediately depends on the presence of the IPC_NOWAIT flag in the individual sem_flg fields, as noted below.
Each operation is performed on the sem_num-th semaphore of the semaphore set, where the first semaphore of the set is numbered 0. There are three types of operation, distinguished by the value of sem_op.
If sem_op is a positive integer, the operation adds this value to the semaphore value (semval). Furthermore, if SEM_UNDO is specified for this operation, the system subtracts the value sem_op from the semaphore adjustment (semadj) value for this semaphore. This operation can always proceedβit never forces a thread to wait. The calling process must have alter permission on the semaphore set.
If sem_op is zero, the process must have read permission on the semaphore set. This is a “wait-for-zero” operation: if semval is zero, the operation can immediately proceed. Otherwise, if IPC_NOWAIT is specified in sem_flg, semop() fails with errno set to EAGAIN (and none of the operations in sops is performed). Otherwise, semzcnt (the count of threads waiting until this semaphore’s value becomes zero) is incremented by one and the thread sleeps until one of the following occurs:
semval becomes 0, at which time the value of semzcnt is decremented.
The semaphore set is removed: semop() fails, with errno set to EIDRM.
The calling thread catches a signal: the value of semzcnt is decremented and semop() fails, with errno set to EINTR.
If sem_op is less than zero, the process must have alter permission on the semaphore set. If semval is greater than or equal to the absolute value of sem_op, the operation can proceed immediately: the absolute value of sem_op is subtracted from semval, and, if SEM_UNDO is specified for this operation, the system adds the absolute value of sem_op to the semaphore adjustment (semadj) value for this semaphore. If the absolute value of sem_op is greater than semval, and IPC_NOWAIT is specified in sem_flg, semop() fails, with errno set to EAGAIN (and none of the operations in sops is performed). Otherwise, semncnt (the counter of threads waiting for this semaphore’s value to increase) is incremented by one and the thread sleeps until one of the following occurs:
semval becomes greater than or equal to the absolute value of sem_op: the operation now proceeds, as described above.
The semaphore set is removed from the system: semop() fails, with errno set to EIDRM.
The calling thread catches a signal: the value of semncnt is decremented and semop() fails, with errno set to EINTR.
On successful completion, the sempid value for each semaphore specified in the array pointed to by sops is set to the caller’s process ID. In addition, the sem_otime is set to the current time.
semtimedop()
semtimedop() behaves identically to semop() except that in those cases where the calling thread would sleep, the duration of that sleep is limited by the amount of elapsed time specified by the timespec structure whose address is passed in the timeout argument. (This sleep interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the interval may overrun by a small amount.) If the specified time limit has been reached, semtimedop() fails with errno set to EAGAIN (and none of the operations in sops is performed). If the timeout argument is NULL, then semtimedop() behaves exactly like semop().
Note that if semtimedop() is interrupted by a signal, causing the call to fail with the error EINTR, the contents of timeout are left unchanged.
RETURN VALUE
On success, semop() and semtimedop() return 0. On failure, they return -1, and set errno to indicate the error.
ERRORS
E2BIG
The argument nsops is greater than SEMOPM, the maximum number of operations allowed per system call.
EACCES
The calling process does not have the permissions required to perform the specified semaphore operations, and does not have the CAP_IPC_OWNER capability in the user namespace that governs its IPC namespace.
EAGAIN
An operation could not proceed immediately and either IPC_NOWAIT was specified in sem_flg or the time limit specified in timeout expired.
EFAULT
An address specified in either the sops or the timeout argument isn’t accessible.
EFBIG
For some operation the value of sem_num is less than 0 or greater than or equal to the number of semaphores in the set.
EIDRM
The semaphore set was removed.
EINTR
While blocked in this system call, the thread caught a signal; see signal(7).
EINVAL
The semaphore set doesn’t exist, or semid is less than zero, or nsops has a nonpositive value.
ENOMEM
The sem_flg of some operation specified SEM_UNDO and the system does not have enough memory to allocate the undo structure.
ERANGE
For some operation sem_op+semval is greater than SEMVMX, the implementation dependent maximum value for semval.
STANDARDS
POSIX.1-2008.
VERSIONS
Linux 2.5.52 (backported into Linux 2.4.22), glibc 2.3.3. POSIX.1-2001, SVr4.
NOTES
The sem_undo structures of a process aren’t inherited by the child produced by fork(2), but they are inherited across an execve(2) system call.
semop() is never automatically restarted after being interrupted by a signal handler, regardless of the setting of the SA_RESTART flag when establishing a signal handler.
A semaphore adjustment (semadj) value is a per-process, per-semaphore integer that is the negated sum of all operations performed on a semaphore specifying the SEM_UNDO flag. Each process has a list of semadj valuesβone value for each semaphore on which it has operated using SEM_UNDO. When a process terminates, each of its per-semaphore semadj values is added to the corresponding semaphore, thus undoing the effect of that process’s operations on the semaphore (but see BUGS below). When a semaphore’s value is directly set using the SETVAL or SETALL request to semctl(2), the corresponding semadj values in all processes are cleared. The clone(2) CLONE_SYSVSEM flag allows more than one process to share a semadj list; see clone(2) for details.
The semval, sempid, semzcnt, and semnct values for a semaphore can all be retrieved using appropriate semctl(2) calls.
Semaphore limits
The following limits on semaphore set resources affect the semop() call:
SEMOPM
Maximum number of operations allowed for one semop() call. Before Linux 3.19, the default value for this limit was 32. Since Linux 3.19, the default value is 500. On Linux, this limit can be read and modified via the third field of /proc/sys/kernel/sem. Note: this limit should not be raised above 1000, because of the risk of that semop() fails due to kernel memory fragmentation when allocating memory to copy the sops array.
SEMVMX
Maximum allowable value for semval: implementation dependent (32767).
The implementation has no intrinsic limits for the adjust on exit maximum value (SEMAEM), the system wide maximum number of undo structures (SEMMNU) and the per-process maximum number of undo entries system parameters.
BUGS
When a process terminates, its set of associated semadj structures is used to undo the effect of all of the semaphore operations it performed with the SEM_UNDO flag. This raises a difficulty: if one (or more) of these semaphore adjustments would result in an attempt to decrease a semaphore’s value below zero, what should an implementation do? One possible approach would be to block until all the semaphore adjustments could be performed. This is however undesirable since it could force process termination to block for arbitrarily long periods. Another possibility is that such semaphore adjustments could be ignored altogether (somewhat analogously to failing when IPC_NOWAIT is specified for a semaphore operation). Linux adopts a third approach: decreasing the semaphore value as far as possible (i.e., to zero) and allowing process termination to proceed immediately.
In Linux 2.6.x, x <= 10, there is a bug that in some circumstances prevents a thread that is waiting for a semaphore value to become zero from being woken up when the value does actually become zero. This bug is fixed in Linux 2.6.11.
EXAMPLES
The following code segment uses semop() to atomically wait for the value of semaphore 0 to become zero, and then increment the semaphore value by one.
struct sembuf sops[2];
int semid;
/* Code to set semid omitted */
sops[0].sem_num = 0; /* Operate on semaphore 0 */
sops[0].sem_op = 0; /* Wait for value to equal 0 */
sops[0].sem_flg = 0;
sops[1].sem_num = 0; /* Operate on semaphore 0 */
sops[1].sem_op = 1; /* Increment value by one */
sops[1].sem_flg = 0;
if (semop(semid, sops, 2) == -1) {
perror("semop");
exit(EXIT_FAILURE);
}
A further example of the use of semop() can be found in shmop(2).
SEE ALSO
clone(2), semctl(2), semget(2), sigaction(2), capabilities(7), sem_overview(7), sysvipc(7), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
43 - Linux cli command pread
NAME π₯οΈ pread π₯οΈ
read from or write to a file descriptor at a given offset
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
ssize_t pread(int fd, void buf[.count], size_t count,
off_t offset);
ssize_t pwrite(int fd, const void buf[.count], size_t count,
off_t offset);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
pread(), pwrite():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
DESCRIPTION
pread() reads up to count bytes from file descriptor fd at offset offset (from the start of the file) into the buffer starting at buf. The file offset is not changed.
pwrite() writes up to count bytes from the buffer starting at buf to the file descriptor fd at offset offset. The file offset is not changed.
The file referenced by fd must be capable of seeking.
RETURN VALUE
On success, pread() returns the number of bytes read (a return of zero indicates end of file) and pwrite() returns the number of bytes written.
Note that it is not an error for a successful call to transfer fewer bytes than requested (see read(2) and write(2)).
On error, -1 is returned and errno is set to indicate the error.
ERRORS
pread() can fail and set errno to any error specified for read(2) or lseek(2). pwrite() can fail and set errno to any error specified for write(2) or lseek(2).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
Added in Linux 2.1.60; the entries in the i386 system call table were added in Linux 2.1.69. C library support (including emulation using lseek(2) on older kernels without the system calls) was added in glibc 2.1.
C library/kernel differences
On Linux, the underlying system calls were renamed in Linux 2.6: pread() became pread64(), and pwrite() became pwrite64(). The system call numbers remained the same. The glibc pread() and pwrite() wrapper functions transparently deal with the change.
On some 32-bit architectures, the calling signature for these system calls differ, for the reasons described in syscall(2).
NOTES
The pread() and pwrite() system calls are especially useful in multithreaded applications. They allow multiple threads to perform I/O on the same file descriptor without being affected by changes to the file offset by other threads.
BUGS
POSIX requires that opening a file with the O_APPEND flag should have no effect on the location at which pwrite() writes data. However, on Linux, if a file is opened with O_APPEND, pwrite() appends data to the end of the file, regardless of the value of offset.
SEE ALSO
lseek(2), read(2), readv(2), write(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
44 - Linux cli command syscalls
NAME π₯οΈ syscalls π₯οΈ
Linux system calls
SYNOPSIS
Linux system calls.
DESCRIPTION
The system call is the fundamental interface between an application and the Linux kernel.
System calls and library wrapper functions
System calls are generally not invoked directly, but rather via wrapper functions in glibc (or perhaps some other library). For details of direct invocation of a system call, see intro(2). Often, but not always, the name of the wrapper function is the same as the name of the system call that it invokes. For example, glibc contains a function chdir() which invokes the underlying “chdir” system call.
Often the glibc wrapper function is quite thin, doing little work other than copying arguments to the right registers before invoking the system call, and then setting errno appropriately after the system call has returned. (These are the same steps that are performed by syscall(2), which can be used to invoke system calls for which no wrapper function is provided.) Note: system calls indicate a failure by returning a negative error number to the caller on architectures without a separate error register/flag, as noted in syscall(2); when this happens, the wrapper function negates the returned error number (to make it positive), copies it to errno, and returns -1 to the caller of the wrapper.
Sometimes, however, the wrapper function does some extra work before invoking the system call. For example, nowadays there are (for reasons described below) two related system calls, truncate(2) and truncate64(2), and the glibc truncate() wrapper function checks which of those system calls are provided by the kernel and determines which should be employed.
System call list
Below is a list of the Linux system calls. In the list, the Kernel column indicates the kernel version for those system calls that were new in Linux 2.2, or have appeared since that kernel version. Note the following points:
Where no kernel version is indicated, the system call appeared in Linux 1.0 or earlier.
Where a system call is marked “1.2” this means the system call probably appeared in a Linux 1.1.x kernel version, and first appeared in a stable kernel with 1.2. (Development of the Linux 1.2 kernel was initiated from a branch of Linux 1.0.6 via the Linux 1.1.x unstable kernel series.)
Where a system call is marked “2.0” this means the system call probably appeared in a Linux 1.3.x kernel version, and first appeared in a stable kernel with Linux 2.0. (Development of the Linux 2.0 kernel was initiated from a branch of Linux 1.2.x, somewhere around Linux 1.2.10, via the Linux 1.3.x unstable kernel series.)
Where a system call is marked “2.2” this means the system call probably appeared in a Linux 2.1.x kernel version, and first appeared in a stable kernel with Linux 2.2.0. (Development of the Linux 2.2 kernel was initiated from a branch of Linux 2.0.21 via the Linux 2.1.x unstable kernel series.)
Where a system call is marked “2.4” this means the system call probably appeared in a Linux 2.3.x kernel version, and first appeared in a stable kernel with Linux 2.4.0. (Development of the Linux 2.4 kernel was initiated from a branch of Linux 2.2.8 via the Linux 2.3.x unstable kernel series.)
Where a system call is marked “2.6” this means the system call probably appeared in a Linux 2.5.x kernel version, and first appeared in a stable kernel with Linux 2.6.0. (Development of Linux 2.6 was initiated from a branch of Linux 2.4.15 via the Linux 2.5.x unstable kernel series.)
Starting with Linux 2.6.0, the development model changed, and new system calls may appear in each Linux 2.6.x release. In this case, the exact version number where the system call appeared is shown. This convention continues with the Linux 3.x kernel series, which followed on from Linux 2.6.39; and the Linux 4.x kernel series, which followed on from Linux 3.19; and the Linux 5.x kernel series, which followed on from Linux 4.20; and the Linux 6.x kernel series, which followed on from Linux 5.19.
In some cases, a system call was added to a stable kernel series after it branched from the previous stable kernel series, and then backported into the earlier stable kernel series. For example some system calls that appeared in Linux 2.6.x were also backported into a Linux 2.4.x release after Linux 2.4.15. When this is so, the version where the system call appeared in both of the major kernel series is listed.
The list of system calls that are available as at Linux 5.14 (or in a few cases only on older kernels) is as follows:
System call | Kernel | Notes |
---|---|---|
_llseek(2) | 1.2 | |
_newselect(2) | 2.0 | |
_sysctl(2) | 2.0 | Removed in 5.5 |
accept(2) | 2.0 | See notes on socketcall(2) |
accept4(2) | 2.6.28 | |
access(2) | 1.0 | |
acct(2) | 1.0 | |
add_key(2) | 2.6.10 | |
adjtimex(2) | 1.0 | |
alarm(2) | 1.0 | |
alloc_hugepages(2) | 2.5.36 | Removed in 2.5.44 |
arc_gettls(2) | 3.9 | ARC only |
arc_settls(2) | 3.9 | ARC only |
arc_usr_cmpxchg(2) | 4.9 | ARC only |
arch_prctl(2) | 2.6 | x86_64, x86 since 4.12 |
atomic_barrier(2) | 2.6.34 | m68k only |
atomic_cmpxchg_32(2) | 2.6.34 | m68k only |
bdflush(2) | 1.2 | Deprecated (does nothing) since 2.6 |
bind(2) | 2.0 | See notes on socketcall(2) |
bpf(2) | 3.18 | |
brk(2) | 1.0 | |
breakpoint(2) | 2.2 | ARM OABI only, defined with __ARM_NR prefix |
cacheflush(2) | 1.2 | Not on x86 |
capget(2) | 2.2 | |
capset(2) | 2.2 | |
chdir(2) | 1.0 | |
chmod(2) | 1.0 | |
chown(2) | 2.2 | See chown(2) for version details |
chown32(2) | 2.4 | |
chroot(2) | 1.0 | |
clock_adjtime(2) | 2.6.39 | |
clock_getres(2) | 2.6 | |
clock_gettime(2) | 2.6 | |
clock_nanosleep(2) | 2.6 | |
clock_settime(2) | 2.6 | |
clone2(2) | 2.4 | IA-64 only |
clone(2) | 1.0 | |
clone3(2) | 5.3 | |
close(2) | 1.0 | |
close_range(2) | 5.9 | |
connect(2) | 2.0 | See notes on socketcall(2) |
copy_file_range(2) | 4.5 | |
creat(2) | 1.0 | |
create_module(2) | 1.0 | Removed in 2.6 |
delete_module(2) | 1.0 | |
dup(2) | 1.0 | |
dup2(2) | 1.0 | |
dup3(2) | 2.6.27 | |
epoll_create(2) | 2.6 | |
epoll_create1(2) | 2.6.27 | |
epoll_ctl(2) | 2.6 | |
epoll_pwait(2) | 2.6.19 | |
epoll_pwait2(2) | 5.11 | |
epoll_wait(2) | 2.6 | |
eventfd(2) | 2.6.22 | |
eventfd2(2) | 2.6.27 | |
execv(2) | 2.0 | SPARC/SPARC64 only, for compatibility with SunOS |
execve(2) | 1.0 | |
execveat(2) | 3.19 | |
exit(2) | 1.0 | |
exit_group(2) | 2.6 | |
faccessat(2) | 2.6.16 | |
faccessat2(2) | 5.8 | |
fadvise64(2) | 2.6 | |
fadvise64_64(2) | 2.6 | |
fallocate(2) | 2.6.23 | |
fanotify_init(2) | 2.6.37 | |
fanotify_mark(2) | 2.6.37 | |
fchdir(2) | 1.0 | |
fchmod(2) | 1.0 | |
fchmodat(2) | 2.6.16 | |
fchown(2) | 1.0 | |
fchown32(2) | 2.4 | |
fchownat(2) | 2.6.16 | |
fcntl(2) | 1.0 | |
fcntl64(2) | 2.4 | |
fdatasync(2) | 2.0 | |
fgetxattr(2) | 2.6; 2.4.18 | |
finit_module(2) | 3.8 | |
flistxattr(2) | 2.6; 2.4.18 | |
flock(2) | 2.0 | |
fork(2) | 1.0 | |
free_hugepages(2) | 2.5.36 | Removed in 2.5.44 |
fremovexattr(2) | 2.6; 2.4.18 | |
fsconfig(2) | 5.2 | |
fsetxattr(2) | 2.6; 2.4.18 | |
fsmount(2) | 5.2 | |
fsopen(2) | 5.2 | |
fspick(2) | 5.2 | |
fstat(2) | 1.0 | |
fstat64(2) | 2.4 | |
fstatat64(2) | 2.6.16 | |
fstatfs(2) | 1.0 | |
fstatfs64(2) | 2.6 | |
fsync(2) | 1.0 | |
ftruncate(2) | 1.0 | |
ftruncate64(2) | 2.4 | |
futex(2) | 2.6 | |
futimesat(2) | 2.6.16 | |
get_kernel_syms(2) | 1.0 | Removed in 2.6 |
get_mempolicy(2) | 2.6.6 | |
get_robust_list(2) | 2.6.17 | |
get_thread_area(2) | 2.6 | |
get_tls(2) | 4.15 | ARM OABI only, has __ARM_NR prefix |
getcpu(2) | 2.6.19 | |
getcwd(2) | 2.2 | |
getdents(2) | 2.0 | |
getdents64(2) | 2.4 | |
getdomainname(2) | 2.2 | SPARC, SPARC64; available as osf_getdomainname(2) on Alpha since Linux 2.0 |
getdtablesize(2) | 2.0 | SPARC (removed in 2.6.26), available on Alpha as osf_getdtablesize(2) |
getegid(2) | 1.0 | |
getegid32(2) | 2.4 | |
geteuid(2) | 1.0 | |
geteuid32(2) | 2.4 | |
getgid(2) | 1.0 | |
getgid32(2) | 2.4 | |
getgroups(2) | 1.0 | |
getgroups32(2) | 2.4 | |
gethostname(2) | 2.0 | Alpha, was available on SPARC up to Linux 2.6.26 |
getitimer(2) | 1.0 | |
getpeername(2) | 2.0 | See notes on socketcall(2) |
getpagesize(2) | 2.0 | Alpha, SPARC/SPARC64 only |
getpgid(2) | 1.0 | |
getpgrp(2) | 1.0 | |
getpid(2) | 1.0 | |
getppid(2) | 1.0 | |
getpriority(2) | 1.0 | |
getrandom(2) | 3.17 | |
getresgid(2) | 2.2 | |
getresgid32(2) | 2.4 | |
getresuid(2) | 2.2 | |
getresuid32(2) | 2.4 | |
getrlimit(2) | 1.0 | |
getrusage(2) | 1.0 | |
getsid(2) | 2.0 | |
getsockname(2) | 2.0 | See notes on socketcall(2) |
getsockopt(2) | 2.0 | See notes on socketcall(2) |
gettid(2) | 2.4.11 | |
gettimeofday(2) | 1.0 | |
getuid(2) | 1.0 | |
getuid32(2) | 2.4 | |
getunwind(2) | 2.4.8 | IA-64 only; deprecated |
getxattr(2) | 2.6; 2.4.18 | |
getxgid(2) | 2.0 | Alpha only; see NOTES |
getxpid(2) | 2.0 | Alpha only; see NOTES |
getxuid(2) | 2.0 | Alpha only; see NOTES |
init_module(2) | 1.0 | |
inotify_add_watch(2) | 2.6.13 | |
inotify_init(2) | 2.6.13 | |
inotify_init1(2) | 2.6.27 | |
inotify_rm_watch(2) | 2.6.13 | |
io_cancel(2) | 2.6 | |
io_destroy(2) | 2.6 | |
io_getevents(2) | 2.6 | |
io_pgetevents(2) | 4.18 | |
io_setup(2) | 2.6 | |
io_submit(2) | 2.6 | |
io_uring_enter(2) | 5.1 | |
io_uring_register(2) | 5.1 | |
io_uring_setup(2) | 5.1 | |
ioctl(2) | 1.0 | |
ioperm(2) | 1.0 | |
iopl(2) | 1.0 | |
ioprio_get(2) | 2.6.13 | |
ioprio_set(2) | 2.6.13 | |
ipc(2) | 1.0 | |
kcmp(2) | 3.5 | |
kern_features(2) | 3.7 | SPARC64 only |
kexec_file_load(2) | 3.17 | |
kexec_load(2) | 2.6.13 | |
keyctl(2) | 2.6.10 | |
kill(2) | 1.0 | |
landlock_add_rule(2) | 5.13 | |
landlock_create_ruleset(2) | 5.13 | |
landlock_restrict_self(2) | 5.13 | |
lchown(2) | 1.0 | See chown(2) for version details |
lchown32(2) | 2.4 | |
lgetxattr(2) | 2.6; 2.4.18 | |
link(2) | 1.0 | |
linkat(2) | 2.6.16 | |
listen(2) | 2.0 | See notes on socketcall(2) |
listxattr(2) | 2.6; 2.4.18 | |
llistxattr(2) | 2.6; 2.4.18 | |
lookup_dcookie(2) | 2.6 | |
lremovexattr(2) | 2.6; 2.4.18 | |
lseek(2) | 1.0 | |
lsetxattr(2) | 2.6; 2.4.18 | |
lstat(2) | 1.0 | |
lstat64(2) | 2.4 | |
madvise(2) | 2.4 | |
mbind(2) | 2.6.6 | |
memory_ordering(2) | 2.2 | SPARC64 only |
membarrier(2) | 3.17 | |
memfd_create(2) | 3.17 | |
memfd_secret(2) | 5.14 | |
migrate_pages(2) | 2.6.16 | |
mincore(2) | 2.4 | |
mkdir(2) | 1.0 | |
mkdirat(2) | 2.6.16 | |
mknod(2) | 1.0 | |
mknodat(2) | 2.6.16 | |
mlock(2) | 2.0 | |
mlock2(2) | 4.4 | |
mlockall(2) | 2.0 | |
mmap(2) | 1.0 | |
mmap2(2) | 2.4 | |
modify_ldt(2) | 1.0 | |
mount(2) | 1.0 | |
move_mount(2) | 5.2 | |
move_pages(2) | 2.6.18 | |
mprotect(2) | 1.0 | |
mq_getsetattr(2) | 2.6.6 | |
mq_notify(2) | 2.6.6 | |
mq_open(2) | 2.6.6 | |
mq_timedreceive(2) | 2.6.6 | |
mq_timedsend(2) | 2.6.6 | |
mq_unlink(2) | 2.6.6 | |
mremap(2) | 2.0 | |
msgctl(2) | 2.0 | See notes on ipc(2) |
msgget(2) | 2.0 | See notes on ipc(2) |
msgrcv(2) | 2.0 | See notes on ipc(2) |
msgsnd(2) | 2.0 | See notes on ipc(2) |
msync(2) | 2.0 | |
munlock(2) | 2.0 | |
munlockall(2) | 2.0 | |
munmap(2) | 1.0 | |
name_to_handle_at(2) | 2.6.39 | |
nanosleep(2) | 2.0 | |
newfstatat(2) | 2.6.16 | See stat(2) |
nfsservctl(2) | 2.2 | Removed in 3.1 |
nice(2) | 1.0 | |
old_adjtimex(2) | 2.0 | Alpha only; see NOTES |
old_getrlimit(2) | 2.4 | Old variant of getrlimit(2) that used a different value for RLIM_INFINITY |
oldfstat(2) | 1.0 | |
oldlstat(2) | 1.0 | |
oldolduname(2) | 1.0 | |
oldstat(2) | 1.0 | |
oldumount(2) | 2.4.116 | Name of the old umount(2) syscall on Alpha |
olduname(2) | 1.0 | |
open(2) | 1.0 | |
open_by_handle_at(2) | 2.6.39 | |
open_tree(2) | 5.2 | |
openat(2) | 2.6.16 | |
openat2(2) | 5.6 | |
or1k_atomic(2) | 3.1 | OpenRISC 1000 only |
pause(2) | 1.0 | |
pciconfig_iobase(2) | 2.2.15; 2.4 | Not on x86 |
pciconfig_read(2) | 2.0.26; 2.2 | Not on x86 |
pciconfig_write(2) | 2.0.26; 2.2 | Not on x86 |
perf_event_open(2) | 2.6.31 | Was perf_counter_open() in 2.6.31; renamed in 2.6.32 |
personality(2) | 1.2 | |
perfctr(2) | 2.2 | SPARC only; removed in 2.6.34 |
perfmonctl(2) | 2.4 | IA-64 only; removed in 5.10 |
pidfd_getfd(2) | 5.6 | |
pidfd_send_signal(2) | 5.1 | |
pidfd_open(2) | 5.3 | |
pipe(2) | 1.0 | |
pipe2(2) | 2.6.27 | |
pivot_root(2) | 2.4 | |
pkey_alloc(2) | 4.8 | |
pkey_free(2) | 4.8 | |
pkey_mprotect(2) | 4.8 | |
poll(2) | 2.0.36; 2.2 | |
ppoll(2) | 2.6.16 | |
prctl(2) | 2.2 | |
pread64(2) | Added as "pread" in 2.2; renamed "pread64" in 2.6 | |
preadv(2) | 2.6.30 | |
preadv2(2) | 4.6 | |
prlimit64(2) | 2.6.36 | |
process_madvise(2) | 5.10 | |
process_vm_readv(2) | 3.2 | |
process_vm_writev(2) | 3.2 | |
pselect6(2) | 2.6.16 | |
ptrace(2) | 1.0 | |
pwrite64(2) | Added as "pwrite" in 2.2; renamed "pwrite64" in 2.6 | |
pwritev(2) | 2.6.30 | |
pwritev2(2) | 4.6 | |
query_module(2) | 2.2 | Removed in 2.6 |
quotactl(2) | 1.0 | |
quotactl_fd(2) | 5.14 | |
read(2) | 1.0 | |
readahead(2) | 2.4.13 | |
readdir(2) | 1.0 | |
readlink(2) | 1.0 | |
readlinkat(2) | 2.6.16 | |
readv(2) | 2.0 | |
reboot(2) | 1.0 | |
recv(2) | 2.0 | See notes on socketcall(2) |
recvfrom(2) | 2.0 | See notes on socketcall(2) |
recvmsg(2) | 2.0 | See notes on socketcall(2) |
recvmmsg(2) | 2.6.33 | |
remap_file_pages(2) | 2.6 | Deprecated since 3.16 |
removexattr(2) | 2.6; 2.4.18 | |
rename(2) | 1.0 | |
renameat(2) | 2.6.16 | |
renameat2(2) | 3.15 | |
request_key(2) | 2.6.10 | |
restart_syscall(2) | 2.6 | |
riscv_flush_icache(2) | 4.15 | RISC-V only |
rmdir(2) | 1.0 | |
rseq(2) | 4.18 | |
rt_sigaction(2) | 2.2 | |
rt_sigpending(2) | 2.2 | |
rt_sigprocmask(2) | 2.2 | |
rt_sigqueueinfo(2) | 2.2 | |
rt_sigreturn(2) | 2.2 | |
rt_sigsuspend(2) | 2.2 | |
rt_sigtimedwait(2) | 2.2 | |
rt_tgsigqueueinfo(2) | 2.6.31 | |
rtas(2) | 2.6.2 | PowerPC/PowerPC64 only |
s390_runtime_instr(2) | 3.7 | s390 only |
s390_pci_mmio_read(2) | 3.19 | s390 only |
s390_pci_mmio_write(2) | 3.19 | s390 only |
s390_sthyi(2) | 4.15 | s390 only |
s390_guarded_storage(2) | 4.12 | s390 only |
sched_get_affinity(2) | 2.6 | Name of sched_getaffinity(2) on SPARC and SPARC64 |
sched_get_priority_max(2) | 2.0 | |
sched_get_priority_min(2) | 2.0 | |
sched_getaffinity(2) | 2.6 | |
sched_getattr(2) | 3.14 | |
sched_getparam(2) | 2.0 | |
sched_getscheduler(2) | 2.0 | |
sched_rr_get_interval(2) | 2.0 | |
sched_set_affinity(2) | 2.6 | Name of sched_setaffinity(2) on SPARC and SPARC64 |
sched_setaffinity(2) | 2.6 | |
sched_setattr(2) | 3.14 | |
sched_setparam(2) | 2.0 | |
sched_setscheduler(2) | 2.0 | |
sched_yield(2) | 2.0 | |
seccomp(2) | 3.17 | |
select(2) | 1.0 | |
semctl(2) | 2.0 | See notes on ipc(2) |
semget(2) | 2.0 | See notes on ipc(2) |
semop(2) | 2.0 | See notes on ipc(2) |
semtimedop(2) | 2.6; 2.4.22 | |
send(2) | 2.0 | See notes on socketcall(2) |
sendfile(2) | 2.2 | |
sendfile64(2) | 2.6; 2.4.19 | |
sendmmsg(2) | 3.0 | |
sendmsg(2) | 2.0 | See notes on socketcall(2) |
sendto(2) | 2.0 | See notes on socketcall(2) |
set_mempolicy(2) | 2.6.6 | |
set_robust_list(2) | 2.6.17 | |
set_thread_area(2) | 2.6 | |
set_tid_address(2) | 2.6 | |
set_tls(2) | 2.6.11 | ARM OABI/EABI only (constant has __ARM_NR prefix) |
setdomainname(2) | 1.0 | |
setfsgid(2) | 1.2 | |
setfsgid32(2) | 2.4 | |
setfsuid(2) | 1.2 | |
setfsuid32(2) | 2.4 | |
setgid(2) | 1.0 | |
setgid32(2) | 2.4 | |
setgroups(2) | 1.0 | |
setgroups32(2) | 2.4 | |
sethae(2) | 2.0 | Alpha only; see NOTES |
sethostname(2) | 1.0 | |
setitimer(2) | 1.0 | |
setns(2) | 3.0 | |
setpgid(2) | 1.0 | |
setpgrp(2) | 2.0 | Alternative name for setpgid(2) on Alpha |
setpriority(2) | 1.0 | |
setregid(2) | 1.0 | |
setregid32(2) | 2.4 | |
setresgid(2) | 2.2 | |
setresgid32(2) | 2.4 | |
setresuid(2) | 2.2 | |
setresuid32(2) | 2.4 | |
setreuid(2) | 1.0 | |
setreuid32(2) | 2.4 | |
setrlimit(2) | 1.0 | |
setsid(2) | 1.0 | |
setsockopt(2) | 2.0 | See notes on socketcall(2) |
settimeofday(2) | 1.0 | |
setuid(2) | 1.0 | |
setuid32(2) | 2.4 | |
setup(2) | 1.0 | Removed in 2.2 |
setxattr(2) | 2.6; 2.4.18 | |
sgetmask(2) | 1.0 | |
shmat(2) | 2.0 | See notes on ipc(2) |
shmctl(2) | 2.0 | See notes on ipc(2) |
shmdt(2) | 2.0 | See notes on ipc(2) |
shmget(2) | 2.0 | See notes on ipc(2) |
shutdown(2) | 2.0 | See notes on socketcall(2) |
sigaction(2) | 1.0 | |
sigaltstack(2) | 2.2 | |
signal(2) | 1.0 | |
signalfd(2) | 2.6.22 | |
signalfd4(2) | 2.6.27 | |
sigpending(2) | 1.0 | |
sigprocmask(2) | 1.0 | |
sigreturn(2) | 1.0 | |
sigsuspend(2) | 1.0 | |
socket(2) | 2.0 | See notes on socketcall(2) |
socketcall(2) | 1.0 | |
socketpair(2) | 2.0 | See notes on socketcall(2) |
spill(2) | 2.6.13 | Xtensa only |
splice(2) | 2.6.17 | |
spu_create(2) | 2.6.16 | PowerPC/PowerPC64 only |
spu_run(2) | 2.6.16 | PowerPC/PowerPC64 only |
ssetmask(2) | 1.0 | |
stat(2) | 1.0 | |
stat64(2) | 2.4 | |
statfs(2) | 1.0 | |
statfs64(2) | 2.6 | |
statx(2) | 4.11 | |
stime(2) | 1.0 | |
subpage_prot(2) | 2.6.25 | PowerPC/PowerPC64 only |
swapcontext(2) | 2.6.3 | PowerPC/PowerPC64 only |
switch_endian(2) | 4.1 | PowerPC64 only |
swapoff(2) | 1.0 | |
swapon(2) | 1.0 | |
symlink(2) | 1.0 | |
symlinkat(2) | 2.6.16 | |
sync(2) | 1.0 | |
sync_file_range(2) | 2.6.17 | |
sync_file_range2(2) | 2.6.22 | |
syncfs(2) | 2.6.39 | |
sys_debug_setcontext(2) | 2.6.11 | PowerPC only |
syscall(2) | 1.0 | Still available on ARM OABI and MIPS O32 ABI |
sysfs(2) | 1.2 | |
sysinfo(2) | 1.0 | |
syslog(2) | 1.0 | |
sysmips(2) | 2.6.0 | MIPS only |
tee(2) | 2.6.17 | |
tgkill(2) | 2.6 | |
time(2) | 1.0 | |
timer_create(2) | 2.6 | |
timer_delete(2) | 2.6 | |
timer_getoverrun(2) | 2.6 | |
timer_gettime(2) | 2.6 | |
timer_settime(2) | 2.6 | |
timerfd_create(2) | 2.6.25 | |
timerfd_gettime(2) | 2.6.25 | |
timerfd_settime(2) | 2.6.25 | |
times(2) | 1.0 | |
tkill(2) | 2.6; 2.4.22 | |
truncate(2) | 1.0 | |
truncate64(2) | 2.4 | |
ugetrlimit(2) | 2.4 | |
umask(2) | 1.0 | |
umount(2) | 1.0 | |
umount2(2) | 2.2 | |
uname(2) | 1.0 | |
unlink(2) | 1.0 | |
unlinkat(2) | 2.6.16 | |
unshare(2) | 2.6.16 | |
uselib(2) | 1.0 | |
ustat(2) | 1.0 | |
userfaultfd(2) | 4.3 | |
usr26(2) | 2.4.8.1 | ARM OABI only |
usr32(2) | 2.4.8.1 | ARM OABI only |
utime(2) | 1.0 | |
utimensat(2) | 2.6.22 | |
utimes(2) | 2.2 | |
utrap_install(2) | 2.2 | SPARC64 only |
vfork(2) | 2.2 | |
vhangup(2) | 1.0 | |
vm86old(2) | 1.0 | Was "vm86"; renamed in 2.0.28/2.2 |
vm86(2) | 2.0.28; 2.2 | |
vmsplice(2) | 2.6.17 | |
wait4(2) | 1.0 | |
waitid(2) | 2.6.10 | |
waitpid(2) | 1.0 | |
write(2) | 1.0 | |
writev(2) | 2.0 | |
xtensa(2) | 2.6.13 | Xtensa only |
On many platforms, including x86-32, socket calls are all multiplexed (via glibc wrapper functions) through socketcall(2) and similarly System V IPC calls are multiplexed through ipc(2).
Although slots are reserved for them in the system call table, the following system calls are not implemented in the standard kernel: afs_syscall(2), break(2), ftime(2), getpmsg(2), gtty(2), idle(2), lock(2), madvise1(2), mpx(2), phys(2), prof(2), profil(2), putpmsg(2), security(2), stty(2), tuxcall(2), ulimit(2), and vserver(2) (see also unimplemented(2)). However, ftime(3), profil(3), and ulimit(3) exist as library routines. The slot for phys(2) is in use since Linux 2.1.116 for umount(2); phys(2) will never be implemented. The getpmsg(2) and putpmsg(2) calls are for kernels patched to support STREAMS, and may never be in the standard kernel.
There was briefly set_zone_reclaim(2), added in Linux 2.6.13, and removed in Linux 2.6.16; this system call was never available to user space.
System calls on removed ports
Some system calls only ever existed on Linux architectures that have since been removed from the kernel:
AVR32 (port removed in Linux 4.12)
pread(2)
pwrite(2)
Blackfin (port removed in Linux 4.17)
bfin_spinlock(2) (added in Linux 2.6.22)
dma_memcpy(2) (added in Linux 2.6.22)
pread(2) (added in Linux 2.6.22)
pwrite(2) (added in Linux 2.6.22)
sram_alloc(2) (added in Linux 2.6.22)
sram_free(2) (added in Linux 2.6.22)
Metag (port removed in Linux 4.17)
metag_get_tls(2) (add in Linux 3.9)
metag_set_fpu_flags(2) (add in Linux 3.9)
metag_set_tls(2) (add in Linux 3.9)
metag_setglobalbit(2) (add in Linux 3.9)
Tile (port removed in Linux 4.17)
- cmpxchg_badaddr(2) (added in Linux 2.6.36)
NOTES
Roughly speaking, the code belonging to the system call with number __NR_xxx defined in /usr/include/asm/unistd.h can be found in the Linux kernel source in the routine sys_xxx(). There are many exceptions, however, mostly because older system calls were superseded by newer ones, and this has been treated somewhat unsystematically. On platforms with proprietary operating-system emulation, such as sparc, sparc64, and alpha, there are many additional system calls; mips64 also contains a full set of 32-bit system calls.
Over time, changes to the interfaces of some system calls have been necessary. One reason for such changes was the need to increase the size of structures or scalar values passed to the system call. Because of these changes, certain architectures (notably, longstanding 32-bit architectures such as i386) now have various groups of related system calls (e.g., truncate(2) and truncate64(2)) which perform similar tasks, but which vary in details such as the size of their arguments. (As noted earlier, applications are generally unaware of this: the glibc wrapper functions do some work to ensure that the right system call is invoked, and that ABI compatibility is preserved for old binaries.) Examples of system calls that exist in multiple versions are the following:
By now there are three different versions of stat(2): sys_stat() (slot __NR_oldstat), sys_newstat() (slot __NR_stat), and sys_stat64() (slot __NR_stat64), with the last being the most current. A similar story applies for lstat(2) and fstat(2).
Similarly, the defines __NR_oldolduname, __NR_olduname, and __NR_uname refer to the routines sys_olduname(), sys_uname(), and sys_newuname().
In Linux 2.0, a new version of vm86(2) appeared, with the old and the new kernel routines being named sys_vm86old() and sys_vm86().
In Linux 2.4, a new version of getrlimit(2) appeared, with the old and the new kernel routines being named sys_old_getrlimit() (slot __NR_getrlimit) and sys_getrlimit() (slot __NR_ugetrlimit).
Linux 2.4 increased the size of user and group IDs from 16 to 32 bits. To support this change, a range of system calls were added (e.g., chown32(2), getuid32(2), getgroups32(2), setresuid32(2)), superseding earlier calls of the same name without the “32” suffix.
Linux 2.4 added support for applications on 32-bit architectures to access large files (i.e., files for which the sizes and file offsets can’t be represented in 32 bits.) To support this change, replacements were required for system calls that deal with file offsets and sizes. Thus the following system calls were added: fcntl64(2), getdents64(2), stat64(2), statfs64(2), truncate64(2), and their analogs that work with file descriptors or symbolic links. These system calls supersede the older system calls which, except in the case of the “stat” calls, have the same name without the “64” suffix.
On newer platforms that only have 64-bit file access and 32-bit UIDs/GIDs (e.g., alpha, ia64, s390x, x86-64), there is just a single version of the UID/GID and file access system calls. On platforms (typically, 32-bit platforms) where the *64 and *32 calls exist, the other versions are obsolete.
The rt_sig* calls were added in Linux 2.2 to support the addition of real-time signals (see signal(7)). These system calls supersede the older system calls of the same name without the “rt_” prefix.
The select(2) and mmap(2) system calls use five or more arguments, which caused problems in the way argument passing on the i386 used to be set up. Thus, while other architectures have sys_select() and sys_mmap() corresponding to __NR_select and __NR_mmap, on i386 one finds old_select() and old_mmap() (routines that use a pointer to an argument block) instead. These days passing five arguments is not a problem any more, and there is a __NR__newselect that corresponds directly to sys_select() and similarly __NR_mmap2. s390x is the only 64-bit architecture that has old_mmap().
Architecture-specific details: Alpha
getxgid(2)
returns a pair of GID and effective GID via registers r0 and r20; it is provided instead of getgid(2) and getegid(2).
getxpid(2)
returns a pair of PID and parent PID via registers r0 and r20; it is provided instead of getpid(2) and getppid(2).
old_adjtimex(2)
is a variant of adjtimex(2) that uses struct timeval32, for compatibility with OSF/1.
getxuid(2)
returns a pair of GID and effective GID via registers r0 and r20; it is provided instead of getuid(2) and geteuid(2).
sethae(2)
is used for configuring the Host Address Extension register on low-cost Alphas in order to access address space beyond first 27 bits.
SEE ALSO
ausyscall(1), intro(2), syscall(2), unimplemented(2), errno(3), libc(7), vdso(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
45 - Linux cli command umount2
NAME π₯οΈ umount2 π₯οΈ
unmount filesystem
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/mount.h>
int umount(const char *target);
int umount2(const char *target, int flags);
DESCRIPTION
umount() and umount2() remove the attachment of the (topmost) filesystem mounted on target.
Appropriate privilege (Linux: the CAP_SYS_ADMIN capability) is required to unmount filesystems.
Linux 2.1.116 added the umount2() system call, which, like umount(), unmounts a target, but allows additional flags controlling the behavior of the operation:
MNT_FORCE (since Linux 2.1.116)
Ask the filesystem to abort pending requests before attempting the unmount. This may allow the unmount to complete without waiting for an inaccessible server, but could cause data loss. If, after aborting requests, some processes still have active references to the filesystem, the unmount will still fail. As at Linux 4.12, MNT_FORCE is supported only on the following filesystems: 9p (since Linux 2.6.16), ceph (since Linux 2.6.34), cifs (since Linux 2.6.12), fuse (since Linux 2.6.16), lustre (since Linux 3.11), and NFS (since Linux 2.1.116).
MNT_DETACH (since Linux 2.4.11)
Perform a lazy unmount: make the mount unavailable for new accesses, immediately disconnect the filesystem and all filesystems mounted below it from each other and from the mount table, and actually perform the unmount when the mount ceases to be busy.
MNT_EXPIRE (since Linux 2.6.8)
Mark the mount as expired. If a mount is not currently in use, then an initial call to umount2() with this flag fails with the error EAGAIN, but marks the mount as expired. The mount remains expired as long as it isn’t accessed by any process. A second umount2() call specifying MNT_EXPIRE unmounts an expired mount. This flag cannot be specified with either MNT_FORCE or MNT_DETACH.
UMOUNT_NOFOLLOW (since Linux 2.6.34)
Don’t dereference target if it is a symbolic link. This flag allows security problems to be avoided in set-user-ID-root programs that allow unprivileged users to unmount filesystems.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
The error values given below result from filesystem type independent errors. Each filesystem type may have its own special errors and its own special behavior. See the Linux kernel source code for details.
EAGAIN
A call to umount2() specifying MNT_EXPIRE successfully marked an unbusy filesystem as expired.
EBUSY
target could not be unmounted because it is busy.
EFAULT
target points outside the user address space.
EINVAL
target is not a mount point.
EINVAL
target is locked; see mount_namespaces(7).
EINVAL
umount2() was called with MNT_EXPIRE and either MNT_DETACH or MNT_FORCE.
EINVAL (since Linux 2.6.34)
umount2() was called with an invalid flag value in flags.
ENAMETOOLONG
A pathname was longer than MAXPATHLEN.
ENOENT
A pathname was empty or had a nonexistent component.
ENOMEM
The kernel could not allocate a free page to copy filenames or data into.
EPERM
The caller does not have the required privileges.
STANDARDS
Linux.
HISTORY
MNT_DETACH and MNT_EXPIRE are available since glibc 2.11.
The original umount() function was called as umount(device) and would return ENOTBLK when called with something other than a block device. In Linux 0.98p4, a call umount(dir) was added, in order to support anonymous devices. In Linux 2.3.99-pre7, the call umount(device) was removed, leaving only umount(dir) (since now devices can be mounted in more than one place, so specifying the device does not suffice).
NOTES
umount() and shared mounts
Shared mounts cause any mount activity on a mount, including umount() operations, to be forwarded to every shared mount in the peer group and every slave mount of that peer group. This means that umount() of any peer in a set of shared mounts will cause all of its peers to be unmounted and all of their slaves to be unmounted as well.
This propagation of unmount activity can be particularly surprising on systems where every mount is shared by default. On such systems, recursively bind mounting the root directory of the filesystem onto a subdirectory and then later unmounting that subdirectory with MNT_DETACH will cause every mount in the mount namespace to be lazily unmounted.
To ensure umount() does not propagate in this fashion, the mount may be remounted using a mount(2) call with a mount_flags argument that includes both MS_REC and MS_PRIVATE prior to umount() being called.
SEE ALSO
mount(2), mount_namespaces(7), path_resolution(7), mount(8), umount(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
46 - Linux cli command timer_gettime
NAME π₯οΈ timer_gettime π₯οΈ
arm/disarm and fetch state of POSIX per-process timer
LIBRARY
Real-time library (librt, -lrt)
SYNOPSIS
#include <time.h>
int timer_gettime(timer_t timerid, struct itimerspec *curr_value);
int timer_settime(timer_t timerid, int flags,
const struct itimerspec *restrict new_value,
struct itimerspec *_Nullable restrict old_value);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
timer_settime(), timer_gettime():
_POSIX_C_SOURCE >= 199309L
DESCRIPTION
timer_settime() arms or disarms the timer identified by timerid. The new_value argument is pointer to an itimerspec structure that specifies the new initial value and the new interval for the timer. The itimerspec structure is described in itimerspec(3type).
Each of the substructures of the itimerspec structure is a timespec(3) structure that allows a time value to be specified in seconds and nanoseconds. These time values are measured according to the clock that was specified when the timer was created by timer_create(2).
If new_value->it_value specifies a nonzero value (i.e., either subfield is nonzero), then timer_settime() arms (starts) the timer, setting it to initially expire at the given time. (If the timer was already armed, then the previous settings are overwritten.) If new_value->it_value specifies a zero value (i.e., both subfields are zero), then the timer is disarmed.
The new_value->it_interval field specifies the period of the timer, in seconds and nanoseconds. If this field is nonzero, then each time that an armed timer expires, the timer is reloaded from the value specified in new_value->it_interval. If new_value->it_interval specifies a zero value, then the timer expires just once, at the time specified by it_value.
By default, the initial expiration time specified in new_value->it_value is interpreted relative to the current time on the timer’s clock at the time of the call. This can be modified by specifying TIMER_ABSTIME in flags, in which case new_value->it_value is interpreted as an absolute value as measured on the timer’s clock; that is, the timer will expire when the clock value reaches the value specified by new_value->it_value. If the specified absolute time has already passed, then the timer expires immediately, and the overrun count (see timer_getoverrun(2)) will be set correctly.
If the value of the CLOCK_REALTIME clock is adjusted while an absolute timer based on that clock is armed, then the expiration of the timer will be appropriately adjusted. Adjustments to the CLOCK_REALTIME clock have no effect on relative timers based on that clock.
If old_value is not NULL, then it points to a buffer that is used to return the previous interval of the timer (in old_value->it_interval) and the amount of time until the timer would previously have next expired (in old_value->it_value).
timer_gettime() returns the time until next expiration, and the interval, for the timer specified by timerid, in the buffer pointed to by curr_value. The time remaining until the next timer expiration is returned in curr_value->it_value; this is always a relative value, regardless of whether the TIMER_ABSTIME flag was used when arming the timer. If the value returned in curr_value->it_value is zero, then the timer is currently disarmed. The timer interval is returned in curr_value->it_interval. If the value returned in curr_value->it_interval is zero, then this is a “one-shot” timer.
RETURN VALUE
On success, timer_settime() and timer_gettime() return 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
These functions may fail with the following errors:
EFAULT
new_value, old_value, or curr_value is not a valid pointer.
EINVAL
timerid is invalid.
timer_settime() may fail with the following errors:
EINVAL
new_value.it_value is negative; or new_value.it_value.tv_nsec is negative or greater than 999,999,999.
STANDARDS
POSIX.1-2008.
HISTORY
Linux 2.6. POSIX.1-2001.
EXAMPLES
See timer_create(2).
SEE ALSO
timer_create(2), timer_getoverrun(2), timespec(3), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
47 - Linux cli command shmctl
NAME π₯οΈ shmctl π₯οΈ
System V shared memory control
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/shm.h>
int shmctl(int shmid, int op, struct shmid_ds *buf);
DESCRIPTION
shmctl() performs the control operation specified by op on the System V shared memory segment whose identifier is given in shmid.
The buf argument is a pointer to a shmid_ds structure, defined in <sys/shm.h> as follows:
struct shmid_ds {
struct ipc_perm shm_perm; /* Ownership and permissions */
size_t shm_segsz; /* Size of segment (bytes) */
time_t shm_atime; /* Last attach time */
time_t shm_dtime; /* Last detach time */
time_t shm_ctime; /* Creation time/time of last
modification via shmctl() */
pid_t shm_cpid; /* PID of creator */
pid_t shm_lpid; /* PID of last shmat(2)/shmdt(2) */
shmatt_t shm_nattch; /* No. of current attaches */
...
};
The fields of the shmid_ds structure are as follows:
shm_perm
This is an ipc_perm structure (see below) that specifies the access permissions on the shared memory segment.
shm_segsz
Size in bytes of the shared memory segment.
shm_atime
Time of the last shmat(2) system call that attached this segment.
shm_dtime
Time of the last shmdt(2) system call that detached tgis segment.
shm_ctime
Time of creation of segment or time of the last shmctl() IPC_SET operation.
shm_cpid
ID of the process that created the shared memory segment.
shm_lpid
ID of the last process that executed a shmat(2) or shmdt(2) system call on this segment.
shm_nattch
Number of processes that have this segment attached.
The ipc_perm structure is defined as follows (the highlighted fields are settable using IPC_SET):
struct ipc_perm {
key_t __key; /* Key supplied to shmget(2) */
uid_t uid; /* Effective UID of owner */
gid_t gid; /* Effective GID of owner */
uid_t cuid; /* Effective UID of creator */
gid_t cgid; /* Effective GID of creator */
unsigned short mode; /* Permissions + SHM_DEST and
SHM_LOCKED flags */
unsigned short __seq; /* Sequence number */
};
The least significant 9 bits of the mode field of the ipc_perm structure define the access permissions for the shared memory segment. The permission bits are as follows:
0400 | Read by user |
0200 | Write by user |
0040 | Read by group |
0020 | Write by group |
0004 | Read by others |
0002 | Write by others |
Bits 0100, 0010, and 0001 (the execute bits) are unused by the system. (It is not necessary to have execute permission on a segment in order to perform a shmat(2) call with the SHM_EXEC flag.)
Valid values for op are:
IPC_STAT
Copy information from the kernel data structure associated with shmid into the shmid_ds structure pointed to by buf. The caller must have read permission on the shared memory segment.
IPC_SET
Write the values of some members of the shmid_ds structure pointed to by buf to the kernel data structure associated with this shared memory segment, updating also its shm_ctime member.
The following fields are updated: shm_perm.uid, shm_perm.gid, and (the least significant 9 bits of) shm_perm.mode.
The effective UID of the calling process must match the owner (shm_perm.uid) or creator (shm_perm.cuid) of the shared memory segment, or the caller must be privileged.
IPC_RMID
Mark the segment to be destroyed. The segment will actually be destroyed only after the last process detaches it (i.e., when the shm_nattch member of the associated structure shmid_ds is zero). The caller must be the owner or creator of the segment, or be privileged. The buf argument is ignored.
If a segment has been marked for destruction, then the (nonstandard) SHM_DEST flag of the shm_perm.mode field in the associated data structure retrieved by IPC_STAT will be set.
The caller must ensure that a segment is eventually destroyed; otherwise its pages that were faulted in will remain in memory or swap.
See also the description of /proc/sys/kernel/shm_rmid_forced in proc(5).
IPC_INFO (Linux-specific)
Return information about system-wide shared memory limits and parameters in the structure pointed to by buf. This structure is of type shminfo (thus, a cast is required), defined in <sys/shm.h> if the _GNU_SOURCE feature test macro is defined:
struct shminfo {
unsigned long shmmax; /* Maximum segment size */
unsigned long shmmin; /* Minimum segment size;
always 1 */
unsigned long shmmni; /* Maximum number of segments */
unsigned long shmseg; /* Maximum number of segments
that a process can attach;
unused within kernel */
unsigned long shmall; /* Maximum number of pages of
shared memory, system-wide */
};
The shmmni, shmmax, and shmall settings can be changed via /proc files of the same name; see proc(5) for details.
SHM_INFO (Linux-specific)
Return a shm_info structure whose fields contain information about system resources consumed by shared memory. This structure is defined in <sys/shm.h> if the _GNU_SOURCE feature test macro is defined:
struct shm_info {
int used_ids; /* # of currently existing
segments */
unsigned long shm_tot; /* Total number of shared
memory pages */
unsigned long shm_rss; /* # of resident shared
memory pages */
unsigned long shm_swp; /* # of swapped shared
memory pages */
unsigned long swap_attempts;
/* Unused since Linux 2.4 */
unsigned long swap_successes;
/* Unused since Linux 2.4 */
};
SHM_STAT (Linux-specific)
Return a shmid_ds structure as for IPC_STAT. However, the shmid argument is not a segment identifier, but instead an index into the kernel’s internal array that maintains information about all shared memory segments on the system.
SHM_STAT_ANY (Linux-specific, since Linux 4.17)
Return a shmid_ds structure as for SHM_STAT. However, shm_perm.mode is not checked for read access for shmid, meaning that any user can employ this operation (just as any user may read /proc/sysvipc/shm to obtain the same information).
The caller can prevent or allow swapping of a shared memory segment with the following op values:
SHM_LOCK (Linux-specific)
Prevent swapping of the shared memory segment. The caller must fault in any pages that are required to be present after locking is enabled. If a segment has been locked, then the (nonstandard) SHM_LOCKED flag of the shm_perm.mode field in the associated data structure retrieved by IPC_STAT will be set.
SHM_UNLOCK (Linux-specific)
Unlock the segment, allowing it to be swapped out.
Before Linux 2.6.10, only a privileged process could employ SHM_LOCK and SHM_UNLOCK. Since Linux 2.6.10, an unprivileged process can employ these operations if its effective UID matches the owner or creator UID of the segment, and (for SHM_LOCK) the amount of memory to be locked falls within the RLIMIT_MEMLOCK resource limit (see setrlimit(2)).
RETURN VALUE
A successful IPC_INFO or SHM_INFO operation returns the index of the highest used entry in the kernel’s internal array recording information about all shared memory segments. (This information can be used with repeated SHM_STAT or SHM_STAT_ANY operations to obtain information about all shared memory segments on the system.) A successful SHM_STAT operation returns the identifier of the shared memory segment whose index was given in shmid. Other operations return 0 on success.
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
IPC_STAT or SHM_STAT is requested and shm_perm.mode does not allow read access for shmid, and the calling process does not have the CAP_IPC_OWNER capability in the user namespace that governs its IPC namespace.
EFAULT
The argument op has value IPC_SET or IPC_STAT but the address pointed to by buf isn’t accessible.
EIDRM
shmid points to a removed identifier.
EINVAL
shmid is not a valid identifier, or op is not a valid operation. Or: for a SHM_STAT or SHM_STAT_ANY operation, the index value specified in shmid referred to an array slot that is currently unused.
ENOMEM
(Since Linux 2.6.9), SHM_LOCK was specified and the size of the to-be-locked segment would mean that the total bytes in locked shared memory segments would exceed the limit for the real user ID of the calling process. This limit is defined by the RLIMIT_MEMLOCK soft resource limit (see setrlimit(2)).
EOVERFLOW
IPC_STAT is attempted, and the GID or UID value is too large to be stored in the structure pointed to by buf.
EPERM
IPC_SET or IPC_RMID is attempted, and the effective user ID of the calling process is not that of the creator (found in shm_perm.cuid), or the owner (found in shm_perm.uid), and the process was not privileged (Linux: did not have the CAP_SYS_ADMIN capability).
Or (before Linux 2.6.9), SHM_LOCK or SHM_UNLOCK was specified, but the process was not privileged (Linux: did not have the CAP_IPC_LOCK capability). (Since Linux 2.6.9, this error can also occur if the RLIMIT_MEMLOCK is 0 and the caller is not privileged.)
VERSIONS
Linux permits a process to attach (shmat(2)) a shared memory segment that has already been marked for deletion using shmctl(IPC_RMID). This feature is not available on other UNIX implementations; portable applications should avoid relying on it.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4.
Various fields in a struct shmid_ds were typed as short under Linux 2.2 and have become long under Linux 2.4. To take advantage of this, a recompilation under glibc-2.1.91 or later should suffice. (The kernel distinguishes old and new calls by an IPC_64 flag in op.)
NOTES
The IPC_INFO, SHM_STAT, and SHM_INFO operations are used by the ipcs(1) program to provide information on allocated resources. In the future, these may modified or moved to a /proc filesystem interface.
SEE ALSO
mlock(2), setrlimit(2), shmget(2), shmop(2), capabilities(7), sysvipc(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
48 - Linux cli command llseek
NAME π₯οΈ llseek π₯οΈ
reposition read/write file offset
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS__llseek, unsigned int fd, unsigned long offset_high,
unsigned long offset_low, loff_t *result,
unsigned int whence);
Note: glibc provides no wrapper for _llseek(), necessitating the use of syscall(2).
DESCRIPTION
Note: for information about the llseek(3) library function, see lseek64(3).
The _llseek() system call repositions the offset of the open file description associated with the file descriptor fd to the value
(offset_high << 32) | offset_low
This new offset is a byte offset relative to the beginning of the file, the current file offset, or the end of the file, depending on whether whence is SEEK_SET, SEEK_CUR, or SEEK_END, respectively.
The new file offset is returned in the argument result. The type loff_t is a 64-bit signed type.
This system call exists on various 32-bit platforms to support seeking to large file offsets.
RETURN VALUE
Upon successful completion, _llseek() returns 0. Otherwise, a value of -1 is returned and errno is set to indicate the error.
ERRORS
EBADF
fd is not an open file descriptor.
EFAULT
Problem with copying results to user space.
EINVAL
whence is invalid.
VERSIONS
You probably want to use the lseek(2) wrapper function instead.
STANDARDS
Linux.
SEE ALSO
lseek(2), open(2), lseek64(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
49 - Linux cli command security
NAME π₯οΈ security π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
50 - Linux cli command eventfd
NAME π₯οΈ eventfd π₯οΈ
create a file descriptor for event notification
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/eventfd.h>
int eventfd(unsigned int initval, int flags);
DESCRIPTION
eventfd() creates an “eventfd object” that can be used as an event wait/notify mechanism by user-space applications, and by the kernel to notify user-space applications of events. The object contains an unsigned 64-bit integer (uint64_t) counter that is maintained by the kernel. This counter is initialized with the value specified in the argument initval.
As its return value, eventfd() returns a new file descriptor that can be used to refer to the eventfd object.
The following values may be bitwise ORed in flags to change the behavior of eventfd():
EFD_CLOEXEC (since Linux 2.6.27)
Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.
EFD_NONBLOCK (since Linux 2.6.27)
Set the O_NONBLOCK file status flag on the open file description (see open(2)) referred to by the new file descriptor. Using this flag saves extra calls to fcntl(2) to achieve the same result.
EFD_SEMAPHORE (since Linux 2.6.30)
Provide semaphore-like semantics for reads from the new file descriptor. See below.
Up to Linux 2.6.26, the flags argument is unused, and must be specified as zero.
The following operations can be performed on the file descriptor returned by eventfd():
read(2)
Each successful read(2) returns an 8-byte integer. A read(2) fails with the error EINVAL if the size of the supplied buffer is less than 8 bytes.
The value returned by read(2) is in host byte orderβthat is, the native byte order for integers on the host machine.
The semantics of read(2) depend on whether the eventfd counter currently has a nonzero value and whether the EFD_SEMAPHORE flag was specified when creating the eventfd file descriptor:
If EFD_SEMAPHORE was not specified and the eventfd counter has a nonzero value, then a read(2) returns 8 bytes containing that value, and the counter’s value is reset to zero.
If EFD_SEMAPHORE was specified and the eventfd counter has a nonzero value, then a read(2) returns 8 bytes containing the value 1, and the counter’s value is decremented by 1.
If the eventfd counter is zero at the time of the call to read(2), then the call either blocks until the counter becomes nonzero (at which time, the read(2) proceeds as described above) or fails with the error EAGAIN if the file descriptor has been made nonblocking.
write(2)
A write(2) call adds the 8-byte integer value supplied in its buffer to the counter. The maximum value that may be stored in the counter is the largest unsigned 64-bit value minus 1 (i.e., 0xfffffffffffffffe). If the addition would cause the counter’s value to exceed the maximum, then the write(2) either blocks until a read(2) is performed on the file descriptor, or fails with the error EAGAIN if the file descriptor has been made nonblocking.
A write(2) fails with the error EINVAL if the size of the supplied buffer is less than 8 bytes, or if an attempt is made to write the value 0xffffffffffffffff.
poll(2)
select(2)
(and similar)
The returned file descriptor supports poll(2) (and analogously epoll(7)) and select(2), as follows:
The file descriptor is readable (the select(2) readfds argument; the poll(2) POLLIN flag) if the counter has a value greater than 0.
The file descriptor is writable (the select(2) writefds argument; the poll(2) POLLOUT flag) if it is possible to write a value of at least “1” without blocking.
If an overflow of the counter value was detected, then select(2) indicates the file descriptor as being both readable and writable, and poll(2) returns a POLLERR event. As noted above, write(2) can never overflow the counter. However an overflow can occur if 2^64 eventfd “signal posts” were performed by the KAIO subsystem (theoretically possible, but practically unlikely). If an overflow has occurred, then read(2) will return that maximum uint64_t value (i.e., 0xffffffffffffffff).
The eventfd file descriptor also supports the other file-descriptor multiplexing APIs: pselect(2) and ppoll(2).
close(2)
When the file descriptor is no longer required it should be closed. When all file descriptors associated with the same eventfd object have been closed, the resources for object are freed by the kernel.
A copy of the file descriptor created by eventfd() is inherited by the child produced by fork(2). The duplicate file descriptor is associated with the same eventfd object. File descriptors created by eventfd() are preserved across execve(2), unless the close-on-exec flag has been set.
RETURN VALUE
On success, eventfd() returns a new eventfd file descriptor. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EINVAL
An unsupported value was specified in flags.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENODEV
Could not mount (internal) anonymous inode device.
ENOMEM
There was insufficient memory to create a new eventfd file descriptor.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
eventfd() | Thread safety | MT-Safe |
VERSIONS
C library/kernel differences
There are two underlying Linux system calls: eventfd() and the more recent eventfd2(). The former system call does not implement a flags argument. The latter system call implements the flags values described above. The glibc wrapper function will use eventfd2() where it is available.
Additional glibc features
The GNU C library defines an additional type, and two functions that attempt to abstract some of the details of reading and writing on an eventfd file descriptor:
typedef uint64_t eventfd_t;
int eventfd_read(int fd, eventfd_t *value);
int eventfd_write(int fd, eventfd_t value);
The functions perform the read and write operations on an eventfd file descriptor, returning 0 if the correct number of bytes was transferred, or -1 otherwise.
STANDARDS
Linux, GNU.
HISTORY
eventfd()
Linux 2.6.22, glibc 2.8.
eventfd2()
Linux 2.6.27 (see VERSIONS). Since glibc 2.9, the eventfd() wrapper will employ the eventfd2() system call, if it is supported by the kernel.
NOTES
Applications can use an eventfd file descriptor instead of a pipe (see pipe(2)) in all cases where a pipe is used simply to signal events. The kernel overhead of an eventfd file descriptor is much lower than that of a pipe, and only one file descriptor is required (versus the two required for a pipe).
When used in the kernel, an eventfd file descriptor can provide a bridge from kernel to user space, allowing, for example, functionalities like KAIO (kernel AIO) to signal to a file descriptor that some operation is complete.
A key point about an eventfd file descriptor is that it can be monitored just like any other file descriptor using select(2), poll(2), or epoll(7). This means that an application can simultaneously monitor the readiness of “traditional” files and the readiness of other kernel mechanisms that support the eventfd interface. (Without the eventfd() interface, these mechanisms could not be multiplexed via select(2), poll(2), or epoll(7).)
The current value of an eventfd counter can be viewed via the entry for the corresponding file descriptor in the process’s /proc/pid/fdinfo directory. See proc(5) for further details.
EXAMPLES
The following program creates an eventfd file descriptor and then forks to create a child process. While the parent briefly sleeps, the child writes each of the integers supplied in the program’s command-line arguments to the eventfd file descriptor. When the parent has finished sleeping, it reads from the eventfd file descriptor.
The following shell session shows a sample run of the program:
$ ./a.out 1 2 4 7 14
Child writing 1 to efd
Child writing 2 to efd
Child writing 4 to efd
Child writing 7 to efd
Child writing 14 to efd
Child completed write loop
Parent about to read
Parent read 28 (0x1c) from efd
Program source
#include <err.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/eventfd.h>
#include <sys/types.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int efd;
uint64_t u;
ssize_t s;
if (argc < 2) {
fprintf(stderr, "Usage: %s <num>...
“, argv[0]); exit(EXIT_FAILURE); } efd = eventfd(0, 0); if (efd == -1) err(EXIT_FAILURE, “eventfd”); switch (fork()) { case 0: for (size_t j = 1; j < argc; j++) { printf(“Child writing %s to efd “, argv[j]); u = strtoull(argv[j], NULL, 0); /* strtoull() allows various bases */ s = write(efd, &u, sizeof(uint64_t)); if (s != sizeof(uint64_t)) err(EXIT_FAILURE, “write”); } printf(“Child completed write loop “); exit(EXIT_SUCCESS); default: sleep(2); printf(“Parent about to read “); s = read(efd, &u, sizeof(uint64_t)); if (s != sizeof(uint64_t)) err(EXIT_FAILURE, “read”); printf(“Parent read %“PRIu64” (%#“PRIx64”) from efd “, u, u); exit(EXIT_SUCCESS); case -1: err(EXIT_FAILURE, “fork”); } }
SEE ALSO
futex(2), pipe(2), poll(2), read(2), select(2), signalfd(2), timerfd_create(2), write(2), epoll(7), sem_overview(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
51 - Linux cli command s390_pci_mmio_read
NAME π₯οΈ s390_pci_mmio_read π₯οΈ
transfer data to/from PCI MMIO memory page
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_s390_pci_mmio_write, unsigned long mmio_addr,
const void user_buffer[.length], size_t length);
int syscall(SYS_s390_pci_mmio_read, unsigned long mmio_addr,
void user_buffer[.length], size_t length);
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
The s390_pci_mmio_write() system call writes length bytes of data from the user-space buffer user_buffer to the PCI MMIO memory location specified by mmio_addr. The s390_pci_mmio_read() system call reads length bytes of data from the PCI MMIO memory location specified by mmio_addr to the user-space buffer user_buffer.
These system calls must be used instead of the simple assignment or data-transfer operations that are used to access the PCI MMIO memory areas mapped to user space on the Linux System z platform. The address specified by mmio_addr must belong to a PCI MMIO memory page mapping in the caller’s address space, and the data being written or read must not cross a page boundary. The length value cannot be greater than the system page size.
RETURN VALUE
On success, s390_pci_mmio_write() and s390_pci_mmio_read() return 0. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
EFAULT
The address in mmio_addr is invalid.
EFAULT
user_buffer does not point to a valid location in the caller’s address space.
EINVAL
Invalid length argument.
ENODEV
PCI support is not enabled.
ENOMEM
Insufficient memory.
STANDARDS
Linux on s390.
HISTORY
Linux 3.19. System z EC12.
SEE ALSO
syscall(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
52 - Linux cli command personality
NAME π₯οΈ personality π₯οΈ
set the process execution domain
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/personality.h>
int personality(unsigned long persona);
DESCRIPTION
Linux supports different execution domains, or personalities, for each process. Among other things, execution domains tell Linux how to map signal numbers into signal actions. The execution domain system allows Linux to provide limited support for binaries compiled under other UNIX-like operating systems.
If persona is not 0xffffffff, then personality() sets the caller’s execution domain to the value specified by persona. Specifying persona as 0xffffffff provides a way of retrieving the current persona without changing it.
A list of the available execution domains can be found in <sys/personality.h>. The execution domain is a 32-bit value in which the top three bytes are set aside for flags that cause the kernel to modify the behavior of certain system calls so as to emulate historical or architectural quirks. The least significant byte is a value defining the personality the kernel should assume. The flag values are as follows:
ADDR_COMPAT_LAYOUT (since Linux 2.6.9)
With this flag set, provide legacy virtual address space layout.
ADDR_NO_RANDOMIZE (since Linux 2.6.12)
With this flag set, disable address-space-layout randomization.
ADDR_LIMIT_32BIT (since Linux 2.2)
Limit the address space to 32 bits.
ADDR_LIMIT_3GB (since Linux 2.4.0)
With this flag set, use 0xc0000000 as the offset at which to search a virtual memory chunk on mmap(2); otherwise use 0xffffe000. Applies to 32-bit x86 processes only.
FDPIC_FUNCPTRS (since Linux 2.6.11)
User-space function pointers to signal handlers point to descriptors. Applies only to ARM if BINFMT_ELF_FDPIC and SuperH.
MMAP_PAGE_ZERO (since Linux 2.4.0)
Map page 0 as read-only (to support binaries that depend on this SVr4 behavior).
READ_IMPLIES_EXEC (since Linux 2.6.8)
With this flag set, PROT_READ implies PROT_EXEC for mmap(2).
SHORT_INODE (since Linux 2.4.0)
No effect.
STICKY_TIMEOUTS (since Linux 1.2.0)
With this flag set, select(2), pselect(2), and ppoll(2) do not modify the returned timeout argument when interrupted by a signal handler.
UNAME26 (since Linux 3.1)
Have uname(2) report a 2.6.(40+x) version number rather than a MAJOR.x version number. Added as a stopgap measure to support broken applications that could not handle the kernel version-numbering switch from Linux 2.6.x to Linux 3.x.
WHOLE_SECONDS (since Linux 1.2.0)
No effect.
The available execution domains are:
PER_BSD (since Linux 1.2.0)
BSD. (No effects.)
PER_HPUX (since Linux 2.4)
Support for 32-bit HP/UX. This support was never complete, and was dropped so that since Linux 4.0, this value has no effect.
PER_IRIX32 (since Linux 2.2)
IRIX 5 32-bit. Never fully functional; support dropped in Linux 2.6.27. Implies STICKY_TIMEOUTS.
PER_IRIX64 (since Linux 2.2)
IRIX 6 64-bit. Implies STICKY_TIMEOUTS; otherwise no effect.
PER_IRIXN32 (since Linux 2.2)
IRIX 6 new 32-bit. Implies STICKY_TIMEOUTS; otherwise no effect.
PER_ISCR4 (since Linux 1.2.0)
Implies STICKY_TIMEOUTS; otherwise no effect.
PER_LINUX (since Linux 1.2.0)
Linux.
PER_LINUX32 (since Linux 2.2)
uname(2) returns the name of the 32-bit architecture in the machine field (“i686” instead of “x86_64”, &c.).
Under ia64 (Itanium), processes with this personality don’t have the O_LARGEFILE open(2) flag forced.
Under 64-bit ARM, setting this personality is forbidden if execve(2)ing a 32-bit process would also be forbidden (cf. the allow_mismatched_32bit_el0 kernel parameter and Documentation/arm64/asymmetric-32bit.rst).
PER_LINUX32_3GB (since Linux 2.4)
Same as PER_LINUX32, but implies ADDR_LIMIT_3GB.
PER_LINUX_32BIT (since Linux 2.0)
Same as PER_LINUX, but implies ADDR_LIMIT_32BIT.
PER_LINUX_FDPIC (since Linux 2.6.11)
Same as PER_LINUX, but implies FDPIC_FUNCPTRS.
PER_OSF4 (since Linux 2.4)
OSF/1 v4. No effect since Linux 6.1, which removed a.out binary support. Before, on alpha, would clear top 32 bits of iov_len in the user’s buffer for compatibility with old versions of OSF/1 where iov_len was defined as. int.
PER_OSR5 (since Linux 2.4)
SCO OpenServer 5. Implies STICKY_TIMEOUTS and WHOLE_SECONDS; otherwise no effect.
PER_RISCOS (since Linux 2.3.7; macro since Linux 2.3.13)
Acorn RISC OS/Arthur (MIPS). No effect. Up to Linux v4.0, would set the emulation altroot to /usr/gnemul/riscos (cf. PER_SUNOS, below). Before then, up to Linux 2.6.3, just Arthur emulation.
PER_SCOSVR3 (since Linux 1.2.0)
SCO UNIX System V Release 3. Same as PER_OSR5, but also implies SHORT_INODE.
PER_SOLARIS (since Linux 2.4)
Solaris. Implies STICKY_TIMEOUTS; otherwise no effect.
PER_SUNOS (since Linux 2.4.0)
Sun OS. Same as PER_BSD, but implies STICKY_TIMEOUTS. Prior to Linux 2.6.26, diverted library and dynamic linker searches to /usr/gnemul. Buggy, largely unmaintained, and almost entirely unused.
PER_SVR3 (since Linux 1.2.0)
AT&T UNIX System V Release 3. Implies STICKY_TIMEOUTS and SHORT_INODE; otherwise no effect.
PER_SVR4 (since Linux 1.2.0)
AT&T UNIX System V Release 4. Implies STICKY_TIMEOUTS and MMAP_PAGE_ZERO; otherwise no effect.
PER_UW7 (since Linux 2.4)
UnixWare 7. Implies STICKY_TIMEOUTS and MMAP_PAGE_ZERO; otherwise no effect.
PER_WYSEV386 (since Linux 1.2.0)
WYSE UNIX System V/386. Implies STICKY_TIMEOUTS and SHORT_INODE; otherwise no effect.
PER_XENIX (since Linux 1.2.0)
XENIX. Implies STICKY_TIMEOUTS and SHORT_INODE; otherwise no effect.
RETURN VALUE
On success, the previous persona is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
The kernel was unable to change the personality.
STANDARDS
Linux.
HISTORY
Linux 1.1.20, glibc 2.3.
SEE ALSO
setarch(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
53 - Linux cli command getresuid
NAME π₯οΈ getresuid π₯οΈ
get real, effective, and saved user/group IDs
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <unistd.h>
int getresuid(uid_t *ruid, uid_t *euid, uid_t *suid);
int getresgid(gid_t *rgid, gid_t *egid, gid_t *sgid);
DESCRIPTION
getresuid() returns the real UID, the effective UID, and the saved set-user-ID of the calling process, in the arguments ruid, euid, and suid, respectively. getresgid() performs the analogous task for the process’s group IDs.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
One of the arguments specified an address outside the calling program’s address space.
STANDARDS
None. These calls also appear on HP-UX and some of the BSDs.
HISTORY
Linux 2.1.44, glibc 2.3.2.
The original Linux getresuid() and getresgid() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added getresuid32() and getresgid32(), supporting 32-bit IDs. The glibc getresuid() and getresgid() wrapper functions transparently deal with the variations across kernel versions.
SEE ALSO
getuid(2), setresuid(2), setreuid(2), setuid(2), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
54 - Linux cli command pidfd_open
NAME π₯οΈ pidfd_open π₯οΈ
obtain a file descriptor that refers to a process
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_pidfd_open, pid_t pid, unsigned int flags);
Note: glibc provides no wrapper for pidfd_open(), necessitating the use of syscall(2).
DESCRIPTION
The pidfd_open() system call creates a file descriptor that refers to the process whose PID is specified in pid. The file descriptor is returned as the function result; the close-on-exec flag is set on the file descriptor.
The flags argument either has the value 0, or contains the following flag:
PIDFD_NONBLOCK (since Linux 5.10)
Return a nonblocking file descriptor. If the process referred to by the file descriptor has not yet terminated, then an attempt to wait on the file descriptor using waitid(2) will immediately return the error EAGAIN rather than blocking.
RETURN VALUE
On success, pidfd_open() returns a file descriptor (a nonnegative integer). On error, -1 is returned and errno is set to indicate the error.
ERRORS
EINVAL
flags is not valid.
EINVAL
pid is not valid.
EMFILE
The per-process limit on the number of open file descriptors has been reached (see the description of RLIMIT_NOFILE in getrlimit(2)).
ENFILE
The system-wide limit on the total number of open files has been reached.
ENODEV
The anonymous inode filesystem is not available in this kernel.
ENOMEM
Insufficient kernel memory was available.
ESRCH
The process specified by pid does not exist.
STANDARDS
Linux.
HISTORY
Linux 5.3.
NOTES
The following code sequence can be used to obtain a file descriptor for the child of fork(2):
pid = fork();
if (pid > 0) { /* If parent */
pidfd = pidfd_open(pid, 0);
...
}
Even if the child has already terminated by the time of the pidfd_open() call, its PID will not have been recycled and the returned file descriptor will refer to the resulting zombie process. Note, however, that this is guaranteed only if the following conditions hold true:
the disposition of SIGCHLD has not been explicitly set to SIG_IGN (see sigaction(2));
the SA_NOCLDWAIT flag was not specified while establishing a handler for SIGCHLD or while setting the disposition of that signal to SIG_DFL (see sigaction(2)); and
the zombie process was not reaped elsewhere in the program (e.g., either by an asynchronously executed signal handler or by wait(2) or similar in another thread).
If any of these conditions does not hold, then the child process (along with a PID file descriptor that refers to it) should instead be created using clone(2) with the CLONE_PIDFD flag.
Use cases for PID file descriptors
A PID file descriptor returned by pidfd_open() (or by clone(2) with the CLONE_PID flag) can be used for the following purposes:
The pidfd_send_signal(2) system call can be used to send a signal to the process referred to by a PID file descriptor.
A PID file descriptor can be monitored using poll(2), select(2), and epoll(7). When the process that it refers to terminates, these interfaces indicate the file descriptor as readable. Note, however, that in the current implementation, nothing can be read from the file descriptor (read(2) on the file descriptor fails with the error EINVAL).
If the PID file descriptor refers to a child of the calling process, then it can be waited on using waitid(2).
The pidfd_getfd(2) system call can be used to obtain a duplicate of a file descriptor of another process referred to by a PID file descriptor.
A PID file descriptor can be used as the argument of setns(2) in order to move into one or more of the same namespaces as the process referred to by the file descriptor.
A PID file descriptor can be used as the argument of process_madvise(2) in order to provide advice on the memory usage patterns of the process referred to by the file descriptor.
The pidfd_open() system call is the preferred way of obtaining a PID file descriptor for an already existing process. The alternative is to obtain a file descriptor by opening a */proc/*pid directory. However, the latter technique is possible only if the proc(5) filesystem is mounted; furthermore, the file descriptor obtained in this way is not pollable and can’t be waited on with waitid(2).
EXAMPLES
The program below opens a PID file descriptor for the process whose PID is specified as its command-line argument. It then uses poll(2) to monitor the file descriptor for process exit, as indicated by an EPOLLIN event.
Program source
#define _GNU_SOURCE
#include <poll.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
static int
pidfd_open(pid_t pid, unsigned int flags)
{
return syscall(SYS_pidfd_open, pid, flags);
}
int
main(int argc, char *argv[])
{
int pidfd, ready;
struct pollfd pollfd;
if (argc != 2) {
fprintf(stderr, "Usage: %s <pid>
“, argv[0]); exit(EXIT_SUCCESS); } pidfd = pidfd_open(atoi(argv[1]), 0); if (pidfd == -1) { perror(“pidfd_open”); exit(EXIT_FAILURE); } pollfd.fd = pidfd; pollfd.events = POLLIN; ready = poll(&pollfd, 1, -1); if (ready == -1) { perror(“poll”); exit(EXIT_FAILURE); } printf(“Events (%#x): POLLIN is %sset “, pollfd.revents, (pollfd.revents & POLLIN) ? "” : “not “); close(pidfd); exit(EXIT_SUCCESS); }
SEE ALSO
clone(2), kill(2), pidfd_getfd(2), pidfd_send_signal(2), poll(2), process_madvise(2), select(2), setns(2), waitid(2), epoll(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
55 - Linux cli command exit
NAME π₯οΈ exit π₯οΈ
terminate the calling process
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
[[noreturn]] void _exit(int status);
#include <stdlib.h>
[[noreturn]] void _Exit(int status);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
_Exit():
_ISOC99_SOURCE || _POSIX_C_SOURCE >= 200112L
DESCRIPTION
_exit() terminates the calling process “immediately”. Any open file descriptors belonging to the process are closed. Any children of the process are inherited by init(1) (or by the nearest “subreaper” process as defined through the use of the prctl(2) PR_SET_CHILD_SUBREAPER operation). The process’s parent is sent a SIGCHLD signal.
The value status & 0xFF is returned to the parent process as the process’s exit status, and can be collected by the parent using one of the wait(2) family of calls.
The function _Exit() is equivalent to _exit().
RETURN VALUE
These functions do not return.
STANDARDS
_exit()
POSIX.1-2008.
_Exit()
C11, POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.3BSD.
_Exit() was introduced by C99.
NOTES
For a discussion on the effects of an exit, the transmission of exit status, zombie processes, signals sent, and so on, see exit(3).
The function _exit() is like exit(3), but does not call any functions registered with atexit(3) or on_exit(3). Open stdio(3) streams are not flushed. On the other hand, _exit() does close open file descriptors, and this may cause an unknown delay, waiting for pending output to finish. If the delay is undesired, it may be useful to call functions like tcflush(3) before calling _exit(). Whether any pending I/O is canceled, and which pending I/O may be canceled upon _exit(), is implementation-dependent.
C library/kernel differences
The text above in DESCRIPTION describes the traditional effect of _exit(), which is to terminate a process, and these are the semantics specified by POSIX.1 and implemented by the C library wrapper function. On modern systems, this means termination of all threads in the process.
By contrast with the C library wrapper function, the raw Linux _exit() system call terminates only the calling thread, and actions such as reparenting child processes or sending SIGCHLD to the parent process are performed only if this is the last thread in the thread group.
Up to glibc 2.3, the _exit() wrapper function invoked the kernel system call of the same name. Since glibc 2.3, the wrapper function invokes exit_group(2), in order to terminate all of the threads in a process.
SEE ALSO
execve(2), exit_group(2), fork(2), kill(2), wait(2), wait4(2), waitpid(2), atexit(3), exit(3), on_exit(3), termios(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
56 - Linux cli command chdir
NAME π₯οΈ chdir π₯οΈ
change working directory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int chdir(const char *path);
int fchdir(int fd);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
fchdir():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
|| /* glibc up to and including 2.19: */ _BSD_SOURCE
DESCRIPTION
chdir() changes the current working directory of the calling process to the directory specified in path.
fchdir() is identical to chdir(); the only difference is that the directory is given as an open file descriptor.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Depending on the filesystem, other errors can be returned. The more general errors for chdir() are listed below:
EACCES
Search permission is denied for one of the components of path. (See also path_resolution(7).)
EFAULT
path points outside your accessible address space.
EIO
An I/O error occurred.
ELOOP
Too many symbolic links were encountered in resolving path.
ENAMETOOLONG
path is too long.
ENOENT
The directory specified in path does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component of path is not a directory.
The general errors for fchdir() are listed below:
EACCES
Search permission was denied on the directory open on fd.
EBADF
fd is not a valid file descriptor.
ENOTDIR
fd does not refer to a directory.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.4BSD.
NOTES
The current working directory is the starting point for interpreting relative pathnames (those not starting with ‘/’).
A child process created via fork(2) inherits its parent’s current working directory. The current working directory is left unchanged by execve(2).
SEE ALSO
chroot(2), getcwd(3), path_resolution(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
57 - Linux cli command migrate_pages
NAME π₯οΈ migrate_pages π₯οΈ
move all pages in a process to another set of nodes
LIBRARY
NUMA (Non-Uniform Memory Access) policy library (libnuma, -lnuma)
SYNOPSIS
#include <numaif.h>
long migrate_pages(int pid, unsigned long maxnode,
const unsigned long *old_nodes,
const unsigned long *new_nodes);
DESCRIPTION
migrate_pages() attempts to move all pages of the process pid that are in memory nodes old_nodes to the memory nodes in new_nodes. Pages not located in any node in old_nodes will not be migrated. As far as possible, the kernel maintains the relative topology relationship inside old_nodes during the migration to new_nodes.
The old_nodes and new_nodes arguments are pointers to bit masks of node numbers, with up to maxnode bits in each mask. These masks are maintained as arrays of unsigned long integers (in the last long integer, the bits beyond those specified by maxnode are ignored). The maxnode argument is the maximum node number in the bit mask plus one (this is the same as in mbind(2), but different from select(2)).
The pid argument is the ID of the process whose pages are to be moved. To move pages in another process, the caller must be privileged (CAP_SYS_NICE) or the real or effective user ID of the calling process must match the real or saved-set user ID of the target process. If pid is 0, then migrate_pages() moves pages of the calling process.
Pages shared with another process will be moved only if the initiating process has the CAP_SYS_NICE privilege.
RETURN VALUE
On success migrate_pages() returns the number of pages that could not be moved (i.e., a return of zero means that all pages were successfully moved). On error, it returns -1, and sets errno to indicate the error.
ERRORS
EFAULT
Part or all of the memory range specified by old_nodes/new_nodes and maxnode points outside your accessible address space.
EINVAL
The value specified by maxnode exceeds a kernel-imposed limit. Or, old_nodes or new_nodes specifies one or more node IDs that are greater than the maximum supported node ID. Or, none of the node IDs specified by new_nodes are on-line and allowed by the process’s current cpuset context, or none of the specified nodes contain memory.
EPERM
Insufficient privilege (CAP_SYS_NICE) to move pages of the process specified by pid, or insufficient privilege (CAP_SYS_NICE) to access the specified target nodes.
ESRCH
No process matching pid could be found.
STANDARDS
Linux.
HISTORY
Linux 2.6.16.
NOTES
For information on library support, see numa(7).
Use get_mempolicy(2) with the MPOL_F_MEMS_ALLOWED flag to obtain the set of nodes that are allowed by the calling process’s cpuset. Note that this information is subject to change at any time by manual or automatic reconfiguration of the cpuset.
Use of migrate_pages() may result in pages whose location (node) violates the memory policy established for the specified addresses (see mbind(2)) and/or the specified process (see set_mempolicy(2)). That is, memory policy does not constrain the destination nodes used by migrate_pages().
The <numaif.h> header is not included with glibc, but requires installing libnuma-devel or a similar package.
SEE ALSO
get_mempolicy(2), mbind(2), set_mempolicy(2), numa(3), numa_maps(5), cpuset(7), numa(7), migratepages(8), numastat(8)
Documentation/vm/page_migration.rst in the Linux kernel source tree
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
58 - Linux cli command mq_getsetattr
NAME π₯οΈ mq_getsetattr π₯οΈ
get/set message queue attributes
SYNOPSIS
#include <mqueue.h> /* Definition of struct mq_attr */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_mq_getsetattr, mqd_t mqdes,
const struct mq_attr *newattr, struct mq_attr *oldattr);
DESCRIPTION
Do not use this system call.
This is the low-level system call used to implement mq_getattr(3) and mq_setattr(3). For an explanation of how this system call operates, see the description of mq_setattr(3).
STANDARDS
None.
NOTES
Never call it unless you are writing a C library!
SEE ALSO
mq_getattr(3), mq_overview(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
59 - Linux cli command pwritev
NAME π₯οΈ pwritev π₯οΈ
read or write data into multiple buffers
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/uio.h>
ssize_t readv(int fd, const struct iovec *iov, int iovcnt);
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);
ssize_t preadv(int fd, const struct iovec *iov, int iovcnt,
off_t offset);
ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt,
off_t offset);
ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt,
off_t offset, int flags);
ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt,
off_t offset, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
preadv(), pwritev():
Since glibc 2.19:
_DEFAULT_SOURCE
glibc 2.19 and earlier:
_BSD_SOURCE
DESCRIPTION
The readv() system call reads iovcnt buffers from the file associated with the file descriptor fd into the buffers described by iov (“scatter input”).
The writev() system call writes iovcnt buffers of data described by iov to the file associated with the file descriptor fd (“gather output”).
The pointer iov points to an array of iovec structures, described in iovec(3type).
The readv() system call works just like read(2) except that multiple buffers are filled.
The writev() system call works just like write(2) except that multiple buffers are written out.
Buffers are processed in array order. This means that readv() completely fills iov[0] before proceeding to iov[1], and so on. (If there is insufficient data, then not all buffers pointed to by iov may be filled.) Similarly, writev() writes out the entire contents of iov[0] before proceeding to iov[1], and so on.
The data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not intermingled with output from writes in other processes; analogously, readv() is guaranteed to read a contiguous block of data from the file, regardless of read operations performed in other threads or processes that have file descriptors referring to the same open file description (see open(2)).
preadv() and pwritev()
The preadv() system call combines the functionality of readv() and pread(2). It performs the same task as readv(), but adds a fourth argument, offset, which specifies the file offset at which the input operation is to be performed.
The pwritev() system call combines the functionality of writev() and pwrite(2). It performs the same task as writev(), but adds a fourth argument, offset, which specifies the file offset at which the output operation is to be performed.
The file offset is not changed by these system calls. The file referred to by fd must be capable of seeking.
preadv2() and pwritev2()
These system calls are similar to preadv() and pwritev() calls, but add a fifth argument, flags, which modifies the behavior on a per-call basis.
Unlike preadv() and pwritev(), if the offset argument is -1, then the current file offset is used and updated.
The flags argument contains a bitwise OR of zero or more of the following flags:
RWF_DSYNC (since Linux 4.7)
Provide a per-write equivalent of the O_DSYNC open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call.
RWF_HIPRI (since Linux 4.6)
High priority read/write. Allows block-based filesystems to use polling of the device, which provides lower latency, but may use additional resources. (Currently, this feature is usable only on a file descriptor opened using the O_DIRECT flag.)
RWF_SYNC (since Linux 4.7)
Provide a per-write equivalent of the O_SYNC open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call.
RWF_NOWAIT (since Linux 4.14)
Do not wait for data which is not immediately available. If this flag is specified, the preadv2() system call will return instantly if it would have to read data from the backing storage or wait for a lock. If some data was successfully read, it will return the number of bytes read. If no bytes were read, it will return -1 and set errno to EAGAIN (but see BUGS). Currently, this flag is meaningful only for preadv2().
RWF_APPEND (since Linux 4.16)
Provide a per-write equivalent of the O_APPEND open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call. The offset argument does not affect the write operation; the data is always appended to the end of the file. However, if the offset argument is -1, the current file offset is updated.
RETURN VALUE
On success, readv(), preadv(), and preadv2() return the number of bytes read; writev(), pwritev(), and pwritev2() return the number of bytes written.
Note that it is not an error for a successful call to transfer fewer bytes than requested (see read(2) and write(2)).
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
The errors are as given for read(2) and write(2). Furthermore, preadv(), preadv2(), pwritev(), and pwritev2() can also fail for the same reasons as lseek(2). Additionally, the following errors are defined:
EINVAL
The sum of the iov_len values overflows an ssize_t value.
EINVAL
The vector count, iovcnt, is less than zero or greater than the permitted maximum.
EOPNOTSUPP
An unknown flag is specified in flags.
VERSIONS
C library/kernel differences
The raw preadv() and pwritev() system calls have call signatures that differ slightly from that of the corresponding GNU C library wrapper functions shown in the SYNOPSIS. The final argument, offset, is unpacked by the wrapper functions into two arguments in the system calls:
** unsigned long pos_l, unsigned long **pos
These arguments contain, respectively, the low order and high order 32 bits of offset.
STANDARDS
readv()
writev()
POSIX.1-2008.
preadv()
pwritev()
BSD.
preadv2()
pwritev2()
Linux.
HISTORY
readv()
writev()
POSIX.1-2001, 4.4BSD (first appeared in 4.2BSD).
preadv(), pwritev(): Linux 2.6.30, glibc 2.10.
preadv2(), pwritev2(): Linux 4.6, glibc 2.26.
Historical C library/kernel differences
To deal with the fact that IOV_MAX was so low on early versions of Linux, the glibc wrapper functions for readv() and writev() did some extra work if they detected that the underlying kernel system call failed because this limit was exceeded. In the case of readv(), the wrapper function allocated a temporary buffer large enough for all of the items specified by iov, passed that buffer in a call to read(2), copied data from the buffer to the locations specified by the iov_base fields of the elements of iov, and then freed the buffer. The wrapper function for writev() performed the analogous task using a temporary buffer and a call to write(2).
The need for this extra effort in the glibc wrapper functions went away with Linux 2.2 and later. However, glibc continued to provide this behavior until glibc 2.10. Starting with glibc 2.9, the wrapper functions provide this behavior only if the library detects that the system is running a Linux kernel older than Linux 2.6.18 (an arbitrarily selected kernel version). And since glibc 2.20 (which requires a minimum of Linux 2.6.32), the glibc wrapper functions always just directly invoke the system calls.
NOTES
POSIX.1 allows an implementation to place a limit on the number of items that can be passed in iov. An implementation can advertise its limit by defining IOV_MAX in <limits.h> or at run time via the return value from sysconf(_SC_IOV_MAX). On modern Linux systems, the limit is 1024. Back in Linux 2.0 days, this limit was 16.
BUGS
Linux 5.9 and Linux 5.10 have a bug where preadv2() with the RWF_NOWAIT flag may return 0 even when not at end of file.
EXAMPLES
The following code sample demonstrates the use of writev():
char *str0 = "hello ";
char *str1 = "world
“; ssize_t nwritten; struct iovec iov[2]; iov[0].iov_base = str0; iov[0].iov_len = strlen(str0); iov[1].iov_base = str1; iov[1].iov_len = strlen(str1); nwritten = writev(STDOUT_FILENO, iov, 2);
SEE ALSO
pread(2), read(2), write(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
60 - Linux cli command select
NAME π₯οΈ select π₯οΈ
synchronous I/O multiplexing
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/select.h>
typedef /* ... */ fd_set;
int select(int nfds, fd_set *_Nullable restrict readfds,
fd_set *_Nullable restrict writefds,
fd_set *_Nullable restrict exceptfds,
struct timeval *_Nullable restrict timeout);
void FD_CLR(int fd, fd_set *set);
int FD_ISSET(int fd, fd_set *set);
void FD_SET(int fd, fd_set *set);
void FD_ZERO(fd_set *set);
int pselect(int nfds, fd_set *_Nullable restrict readfds,
fd_set *_Nullable restrict writefds,
fd_set *_Nullable restrict exceptfds,
const struct timespec *_Nullable restrict timeout,
const sigset_t *_Nullable restrict sigmask);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
pselect():
_POSIX_C_SOURCE >= 200112L
DESCRIPTION
WARNING: select() can monitor only file descriptors numbers that are less than FD_SETSIZE (1024)βan unreasonably low limit for many modern applicationsβand this limitation will not change. All modern applications should instead use poll(2) or epoll(7), which do not suffer this limitation.
select() allows a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become “ready” for some class of I/O operation (e.g., input possible). A file descriptor is considered ready if it is possible to perform a corresponding I/O operation (e.g., read(2), or a sufficiently small write(2)) without blocking.
fd_set
A structure type that can represent a set of file descriptors. According to POSIX, the maximum number of file descriptors in an fd_set structure is the value of the macro FD_SETSIZE.
File descriptor sets
The principal arguments of select() are three “sets” of file descriptors (declared with the type fd_set), which allow the caller to wait for three classes of events on the specified set of file descriptors. Each of the fd_set arguments may be specified as NULL if no file descriptors are to be watched for the corresponding class of events.
Note well: Upon return, each of the file descriptor sets is modified in place to indicate which file descriptors are currently “ready”. Thus, if using select() within a loop, the sets must be reinitialized before each call.
The contents of a file descriptor set can be manipulated using the following macros:
FD_ZERO()
This macro clears (removes all file descriptors from) set. It should be employed as the first step in initializing a file descriptor set.
FD_SET()
This macro adds the file descriptor fd to set. Adding a file descriptor that is already present in the set is a no-op, and does not produce an error.
FD_CLR()
This macro removes the file descriptor fd from set. Removing a file descriptor that is not present in the set is a no-op, and does not produce an error.
FD_ISSET()
select() modifies the contents of the sets according to the rules described below. After calling select(), the FD_ISSET() macro can be used to test if a file descriptor is still present in a set. FD_ISSET() returns nonzero if the file descriptor fd is present in set, and zero if it is not.
Arguments
The arguments of select() are as follows:
readfds
The file descriptors in this set are watched to see if they are ready for reading. A file descriptor is ready for reading if a read operation will not block; in particular, a file descriptor is also ready on end-of-file.
After select() has returned, readfds will be cleared of all file descriptors except for those that are ready for reading.
writefds
The file descriptors in this set are watched to see if they are ready for writing. A file descriptor is ready for writing if a write operation will not block. However, even if a file descriptor indicates as writable, a large write may still block.
After select() has returned, writefds will be cleared of all file descriptors except for those that are ready for writing.
exceptfds
The file descriptors in this set are watched for “exceptional conditions”. For examples of some exceptional conditions, see the discussion of POLLPRI in poll(2).
After select() has returned, exceptfds will be cleared of all file descriptors except for those for which an exceptional condition has occurred.
nfds
This argument should be set to the highest-numbered file descriptor in any of the three sets, plus 1. The indicated file descriptors in each set are checked, up to this limit (but see BUGS).
timeout
The timeout argument is a timeval structure (shown below) that specifies the interval that select() should block waiting for a file descriptor to become ready. The call will block until either:
a file descriptor becomes ready;
the call is interrupted by a signal handler; or
the timeout expires.
Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount.
If both fields of the timeval structure are zero, then select() returns immediately. (This is useful for polling.)
If timeout is specified as NULL, select() blocks indefinitely waiting for a file descriptor to become ready.
pselect()
The pselect() system call allows an application to safely wait until either a file descriptor becomes ready or until a signal is caught.
The operation of select() and pselect() is identical, other than these three differences:
select() uses a timeout that is a struct timeval (with seconds and microseconds), while pselect() uses a struct timespec (with seconds and nanoseconds).
select() may update the timeout argument to indicate how much time was left. pselect() does not change this argument.
select() has no sigmask argument, and behaves as pselect() called with NULL sigmask.
sigmask is a pointer to a signal mask (see sigprocmask(2)); if it is not NULL, then pselect() first replaces the current signal mask by the one pointed to by sigmask, then does the “select” function, and then restores the original signal mask. (If sigmask is NULL, the signal mask is not modified during the pselect() call.)
Other than the difference in the precision of the timeout argument, the following pselect() call:
ready = pselect(nfds, &readfds, &writefds, &exceptfds,
timeout, &sigmask);
is equivalent to atomically executing the following calls:
sigset_t origmask;
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ready = select(nfds, &readfds, &writefds, &exceptfds, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
The reason that pselect() is needed is that if one wants to wait for either a signal or for a file descriptor to become ready, then an atomic test is needed to prevent race conditions. (Suppose the signal handler sets a global flag and returns. Then a test of this global flag followed by a call of select() could hang indefinitely if the signal arrived just after the test but just before the call. By contrast, pselect() allows one to first block signals, handle the signals that have come in, then call pselect() with the desired sigmask, avoiding the race.)
The timeout
The timeout argument for select() is a structure of the following type:
struct timeval {
time_t tv_sec; /* seconds */
suseconds_t tv_usec; /* microseconds */
};
The corresponding argument for pselect() is a timespec(3) structure.
On Linux, select() modifies timeout to reflect the amount of time not slept; most other implementations do not do this. (POSIX.1 permits either behavior.) This causes problems both when Linux code which reads timeout is ported to other operating systems, and when code is ported to Linux that reuses a struct timeval for multiple select()s in a loop without reinitializing it. Consider timeout to be undefined after select() returns.
RETURN VALUE
On success, select() and pselect() return the number of file descriptors contained in the three returned descriptor sets (that is, the total number of bits that are set in readfds, writefds, exceptfds). The return value may be zero if the timeout expired before any file descriptors became ready.
On error, -1 is returned, and errno is set to indicate the error; the file descriptor sets are unmodified, and timeout becomes undefined.
ERRORS
EBADF
An invalid file descriptor was given in one of the sets. (Perhaps a file descriptor that was already closed, or one on which an error has occurred.) However, see BUGS.
EINTR
A signal was caught; see signal(7).
EINVAL
nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see getrlimit(2)).
EINVAL
The value contained within timeout is invalid.
ENOMEM
Unable to allocate memory for internal tables.
VERSIONS
On some other UNIX systems, select() can fail with the error EAGAIN if the system fails to allocate kernel-internal resources, rather than ENOMEM as Linux does. POSIX specifies this error for poll(2), but not for select(). Portable programs may wish to check for EAGAIN and loop, just as with EINTR.
STANDARDS
POSIX.1-2008.
HISTORY
select()
POSIX.1-2001, 4.4BSD (first appeared in 4.2BSD).
Generally portable to/from non-BSD systems supporting clones of the BSD socket layer (including System V variants). However, note that the System V variant typically sets the timeout variable before returning, but the BSD variant does not.
pselect()
Linux 2.6.16. POSIX.1g, POSIX.1-2001.
Prior to this, it was emulated in glibc (but see BUGS).
fd_set
POSIX.1-2001.
NOTES
The following header also provides the fd_set type: <sys/time.h>.
An fd_set is a fixed size buffer. Executing FD_CLR() or FD_SET() with a value of fd that is negative or is equal to or larger than FD_SETSIZE will result in undefined behavior. Moreover, POSIX requires fd to be a valid file descriptor.
The operation of select() and pselect() is not affected by the O_NONBLOCK flag.
The self-pipe trick
On systems that lack pselect(), reliable (and more portable) signal trapping can be achieved using the self-pipe trick. In this technique, a signal handler writes a byte to a pipe whose other end is monitored by select() in the main program. (To avoid possibly blocking when writing to a pipe that may be full or reading from a pipe that may be empty, nonblocking I/O is used when reading from and writing to the pipe.)
Emulating usleep(3)
Before the advent of usleep(3), some code employed a call to select() with all three sets empty, nfds zero, and a non-NULL timeout as a fairly portable way to sleep with subsecond precision.
Correspondence between select() and poll() notifications
Within the Linux kernel source, we find the following definitions which show the correspondence between the readable, writable, and exceptional condition notifications of select() and the event notifications provided by poll(2) and epoll(7):
#define POLLIN_SET (EPOLLRDNORM | EPOLLRDBAND | EPOLLIN |
EPOLLHUP | EPOLLERR)
/* Ready for reading */
#define POLLOUT_SET (EPOLLWRBAND | EPOLLWRNORM | EPOLLOUT |
EPOLLERR)
/* Ready for writing */
#define POLLEX_SET (EPOLLPRI)
/* Exceptional condition */
Multithreaded applications
If a file descriptor being monitored by select() is closed in another thread, the result is unspecified. On some UNIX systems, select() unblocks and returns, with an indication that the file descriptor is ready (a subsequent I/O operation will likely fail with an error, unless another process reopens the file descriptor between the time select() returned and the I/O operation is performed). On Linux (and some other systems), closing the file descriptor in another thread has no effect on select(). In summary, any application that relies on a particular behavior in this scenario must be considered buggy.
C library/kernel differences
The Linux kernel allows file descriptor sets of arbitrary size, determining the length of the sets to be checked from the value of nfds. However, in the glibc implementation, the fd_set type is fixed in size. See also BUGS.
The pselect() interface described in this page is implemented by glibc. The underlying Linux system call is named pselect6(). This system call has somewhat different behavior from the glibc wrapper function.
The Linux pselect6() system call modifies its timeout argument. However, the glibc wrapper function hides this behavior by using a local variable for the timeout argument that is passed to the system call. Thus, the glibc pselect() function does not modify its timeout argument; this is the behavior required by POSIX.1-2001.
The final argument of the pselect6() system call is not a sigset_t * pointer, but is instead a structure of the form:
struct {
const kernel_sigset_t *ss; /* Pointer to signal set */
size_t ss_len; /* Size (in bytes) of object
pointed to by 'ss' */
};
This allows the system call to obtain both a pointer to the signal set and its size, while allowing for the fact that most architectures support a maximum of 6 arguments to a system call. See sigprocmask(2) for a discussion of the difference between the kernel and libc notion of the signal set.
Historical glibc details
glibc 2.0 provided an incorrect version of pselect() that did not take a sigmask argument.
From glibc 2.1 to glibc 2.2.1, one must define _GNU_SOURCE in order to obtain the declaration of pselect() from <sys/select.h>.
BUGS
POSIX allows an implementation to define an upper limit, advertised via the constant FD_SETSIZE, on the range of file descriptors that can be specified in a file descriptor set. The Linux kernel imposes no fixed limit, but the glibc implementation makes fd_set a fixed-size type, with FD_SETSIZE defined as 1024, and the FD_*() macros operating according to that limit. To monitor file descriptors greater than 1023, use poll(2) or epoll(7) instead.
The implementation of the fd_set arguments as value-result arguments is a design error that is avoided in poll(2) and epoll(7).
According to POSIX, select() should check all specified file descriptors in the three file descriptor sets, up to the limit nfds-1. However, the current implementation ignores any file descriptor in these sets that is greater than the maximum file descriptor number that the process currently has open. According to POSIX, any such file descriptor that is specified in one of the sets should result in the error EBADF.
Starting with glibc 2.1, glibc provided an emulation of pselect() that was implemented using sigprocmask(2) and select(). This implementation remained vulnerable to the very race condition that pselect() was designed to prevent. Modern versions of glibc use the (race-free) pselect() system call on kernels where it is provided.
On Linux, select() may report a socket file descriptor as “ready for reading”, while nevertheless a subsequent read blocks. This could for example happen when data has arrived but upon examination has the wrong checksum and is discarded. There may be other circumstances in which a file descriptor is spuriously reported as ready. Thus it may be safer to use O_NONBLOCK on sockets that should not block.
On Linux, select() also modifies timeout if the call is interrupted by a signal handler (i.e., the EINTR error return). This is not permitted by POSIX.1. The Linux pselect() system call has the same behavior, but the glibc wrapper hides this behavior by internally copying the timeout to a local variable and passing that variable to the system call.
EXAMPLES
#include <stdio.h>
#include <stdlib.h>
#include <sys/select.h>
int
main(void)
{
int retval;
fd_set rfds;
struct timeval tv;
/* Watch stdin (fd 0) to see when it has input. */
FD_ZERO(&rfds);
FD_SET(0, &rfds);
/* Wait up to five seconds. */
tv.tv_sec = 5;
tv.tv_usec = 0;
retval = select(1, &rfds, NULL, NULL, &tv);
/* Don't rely on the value of tv now! */
if (retval == -1)
perror("select()");
else if (retval)
printf("Data is available now.
“); /* FD_ISSET(0, &rfds) will be true. */ else printf(“No data within five seconds. “); exit(EXIT_SUCCESS); }
SEE ALSO
accept(2), connect(2), poll(2), read(2), recv(2), restart_syscall(2), send(2), sigprocmask(2), write(2), timespec(3), epoll(7), time(7)
For a tutorial with discussion and examples, see select_tut(2).
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
61 - Linux cli command membarrier
NAME π₯οΈ membarrier π₯οΈ
issue memory barriers on a set of threads
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/membarrier.h>"/*Definitionof MEMBARRIER_* constants*/"
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_membarrier, int cmd, unsigned int flags",int"cpu_id);
Note: glibc provides no wrapper for membarrier(), necessitating the use of syscall(2).
DESCRIPTION
The membarrier() system call helps reducing the overhead of the memory barrier instructions required to order memory accesses on multi-core systems. However, this system call is heavier than a memory barrier, so using it effectively is not as simple as replacing memory barriers with this system call, but requires understanding of the details below.
Use of memory barriers needs to be done taking into account that a memory barrier always needs to be either matched with its memory barrier counterparts, or that the architecture’s memory model doesn’t require the matching barriers.
There are cases where one side of the matching barriers (which we will refer to as “fast side”) is executed much more often than the other (which we will refer to as “slow side”). This is a prime target for the use of membarrier(). The key idea is to replace, for these matching barriers, the fast-side memory barriers by simple compiler barriers, for example:
asm volatile ("" : : : "memory")
and replace the slow-side memory barriers by calls to membarrier().
This will add overhead to the slow side, and remove overhead from the fast side, thus resulting in an overall performance increase as long as the slow side is infrequent enough that the overhead of the membarrier() calls does not outweigh the performance gain on the fast side.
The cmd argument is one of the following:
MEMBARRIER_CMD_QUERY (since Linux 4.3)
Query the set of supported commands. The return value of the call is a bit mask of supported commands. MEMBARRIER_CMD_QUERY, which has the value 0, is not itself included in this bit mask. This command is always supported (on kernels where membarrier() is provided).
MEMBARRIER_CMD_GLOBAL (since Linux 4.16)
Ensure that all threads from all processes on the system pass through a state where all memory accesses to user-space addresses match program order between entry to and return from the membarrier() system call. All threads on the system are targeted by this command.
MEMBARRIER_CMD_GLOBAL_EXPEDITED (since Linux 4.16)
Execute a memory barrier on all running threads of all processes that previously registered with MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED.
Upon return from the system call, the calling thread has a guarantee that all running threads have passed through a state where all memory accesses to user-space addresses match program order between entry to and return from the system call (non-running threads are de facto in such a state). This guarantee is provided only for the threads of processes that previously registered with MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED.
Given that registration is about the intent to receive the barriers, it is valid to invoke MEMBARRIER_CMD_GLOBAL_EXPEDITED from a process that has not employed MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED.
The “expedited” commands complete faster than the non-expedited ones; they never block, but have the downside of causing extra overhead.
MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED (since Linux 4.16)
Register the process’s intent to receive MEMBARRIER_CMD_GLOBAL_EXPEDITED memory barriers.
MEMBARRIER_CMD_PRIVATE_EXPEDITED (since Linux 4.14)
Execute a memory barrier on each running thread belonging to the same process as the calling thread.
Upon return from the system call, the calling thread has a guarantee that all its running thread siblings have passed through a state where all memory accesses to user-space addresses match program order between entry to and return from the system call (non-running threads are de facto in such a state). This guarantee is provided only for threads in the same process as the calling thread.
The “expedited” commands complete faster than the non-expedited ones; they never block, but have the downside of causing extra overhead.
A process must register its intent to use the private expedited command prior to using it.
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED (since Linux 4.14)
Register the process’s intent to use MEMBARRIER_CMD_PRIVATE_EXPEDITED.
MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16)
In addition to providing the memory ordering guarantees described in MEMBARRIER_CMD_PRIVATE_EXPEDITED, upon return from system call the calling thread has a guarantee that all its running thread siblings have executed a core serializing instruction. This guarantee is provided only for threads in the same process as the calling thread.
The “expedited” commands complete faster than the non-expedited ones, they never block, but have the downside of causing extra overhead.
A process must register its intent to use the private expedited sync core command prior to using it.
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16)
Register the process’s intent to use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE.
MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ (since Linux 5.10)
Ensure the caller thread, upon return from system call, that all its running thread siblings have any currently running rseq critical sections restarted if flags parameter is 0; if flags parameter is MEMBARRIER_CMD_FLAG_CPU, then this operation is performed only on CPU indicated by cpu_id. This guarantee is provided only for threads in the same process as the calling thread.
RSEQ membarrier is only available in the “private expedited” form.
A process must register its intent to use the private expedited rseq command prior to using it.
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ (since Linux 5.10)
Register the process’s intent to use MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ.
MEMBARRIER_CMD_SHARED (since Linux 4.3)
This is an alias for MEMBARRIER_CMD_GLOBAL that exists for header backward compatibility.
The flags argument must be specified as 0 unless the command is MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ, in which case flags can be either 0 or MEMBARRIER_CMD_FLAG_CPU.
The cpu_id argument is ignored unless flags is MEMBARRIER_CMD_FLAG_CPU, in which case it must specify the CPU targeted by this membarrier command.
All memory accesses performed in program order from each targeted thread are guaranteed to be ordered with respect to membarrier().
If we use the semantic barrier() to represent a compiler barrier forcing memory accesses to be performed in program order across the barrier, and smp_mb() to represent explicit memory barriers forcing full memory ordering across the barrier, we have the following ordering table for each pairing of barrier(), membarrier(), and smp_mb(). The pair ordering is detailed as (O: ordered, X: not ordered):
barrier() smp_mb() membarrier() barrier() X X O smp_mb() X O O membarrier() O O O
RETURN VALUE
On success, the MEMBARRIER_CMD_QUERY operation returns a bit mask of supported commands, and the MEMBARRIER_CMD_GLOBAL, MEMBARRIER_CMD_GLOBAL_EXPEDITED, MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED, MEMBARRIER_CMD_PRIVATE_EXPEDITED, MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE operations return zero. On error, -1 is returned, and errno is set to indicate the error.
For a given command, with flags set to 0, this system call is guaranteed to always return the same value until reboot. Further calls with the same arguments will lead to the same result. Therefore, with flags set to 0, error handling is required only for the first call to membarrier().
ERRORS
EINVAL
cmd is invalid, or flags is nonzero, or the MEMBARRIER_CMD_GLOBAL command is disabled because the nohz_full CPU parameter has been set, or the MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE and MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE commands are not implemented by the architecture.
ENOSYS
The membarrier() system call is not implemented by this kernel.
EPERM
The current process was not registered prior to using private expedited commands.
STANDARDS
Linux.
HISTORY
Linux 4.3.
Before Linux 5.10, the prototype was:
int membarrier(int cmd, int flags);
NOTES
A memory barrier instruction is part of the instruction set of architectures with weakly ordered memory models. It orders memory accesses prior to the barrier and after the barrier with respect to matching barriers on other cores. For instance, a load fence can order loads prior to and following that fence with respect to stores ordered by store fences.
Program order is the order in which instructions are ordered in the program assembly code.
Examples where membarrier() can be useful include implementations of Read-Copy-Update libraries and garbage collectors.
EXAMPLES
Assuming a multithreaded application where “fast_path()” is executed very frequently, and where “slow_path()” is executed infrequently, the following code (x86) can be transformed using membarrier():
#include <stdlib.h>
static volatile int a, b;
static void
fast_path(int *read_b)
{
a = 1;
asm volatile ("mfence" : : : "memory");
*read_b = b;
}
static void
slow_path(int *read_a)
{
b = 1;
asm volatile ("mfence" : : : "memory");
*read_a = a;
}
int
main(void)
{
int read_a, read_b;
/*
* Real applications would call fast_path() and slow_path()
* from different threads. Call those from main() to keep
* this example short.
*/
slow_path(&read_a);
fast_path(&read_b);
/*
* read_b == 0 implies read_a == 1 and
* read_a == 0 implies read_b == 1.
*/
if (read_b == 0 && read_a == 0)
abort();
exit(EXIT_SUCCESS);
}
The code above transformed to use membarrier() becomes:
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/membarrier.h>
static volatile int a, b;
static int
membarrier(int cmd, unsigned int flags, int cpu_id)
{
return syscall(__NR_membarrier, cmd, flags, cpu_id);
}
static int
init_membarrier(void)
{
int ret;
/* Check that membarrier() is supported. */
ret = membarrier(MEMBARRIER_CMD_QUERY, 0, 0);
if (ret < 0) {
perror("membarrier");
return -1;
}
if (!(ret & MEMBARRIER_CMD_GLOBAL)) {
fprintf(stderr,
"membarrier does not support MEMBARRIER_CMD_GLOBAL
“); return -1; } return 0; } static void fast_path(int *read_b) { a = 1; asm volatile (”" : : : “memory”); *read_b = b; } static void slow_path(int *read_a) { b = 1; membarrier(MEMBARRIER_CMD_GLOBAL, 0, 0); *read_a = a; } int main(int argc, char argv[]) { int read_a, read_b; if (init_membarrier()) exit(EXIT_FAILURE); / * Real applications would call fast_path() and slow_path() * from different threads. Call those from main() to keep * this example short. / slow_path(&read_a); fast_path(&read_b); / * read_b == 0 implies read_a == 1 and * read_a == 0 implies read_b == 1. */ if (read_b == 0 && read_a == 0) abort(); exit(EXIT_SUCCESS); }
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
62 - Linux cli command pciconfig_write
NAME π₯οΈ pciconfig_write π₯οΈ
pci device information handling
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <pci.h>
int pciconfig_read(unsigned long bus, unsigned long dfn,
unsigned long off, unsigned long len,
unsigned char *buf);
int pciconfig_write(unsigned long bus, unsigned long dfn,
unsigned long off, unsigned long len,
unsigned char *buf);
int pciconfig_iobase(int which, unsigned long bus,
unsigned long devfn);
DESCRIPTION
Most of the interaction with PCI devices is already handled by the kernel PCI layer, and thus these calls should not normally need to be accessed from user space.
pciconfig_read()
Reads to buf from device dev at offset off value.
pciconfig_write()
Writes from buf to device dev at offset off value.
pciconfig_iobase()
You pass it a bus/devfn pair and get a physical address for either the memory offset (for things like prep, this is 0xc0000000), the IO base for PIO cycles, or the ISA holes if any.
RETURN VALUE
pciconfig_read()
On success, zero is returned. On error, -1 is returned and errno is set to indicate the error.
pciconfig_write()
On success, zero is returned. On error, -1 is returned and errno is set to indicate the error.
pciconfig_iobase()
Returns information on locations of various I/O regions in physical memory according to the which value. Values for which are: IOBASE_BRIDGE_NUMBER, IOBASE_MEMORY, IOBASE_IO, IOBASE_ISA_IO, IOBASE_ISA_MEM.
ERRORS
EINVAL
len value is invalid. This does not apply to pciconfig_iobase().
EIO
I/O error.
ENODEV
For pciconfig_iobase(), “hose” value is NULL. For the other calls, could not find a slot.
ENOSYS
The system has not implemented these calls (CONFIG_PCI not defined).
EOPNOTSUPP
This return value is valid only for pciconfig_iobase(). It is returned if the value for which is invalid.
EPERM
User does not have the CAP_SYS_ADMIN capability. This does not apply to pciconfig_iobase().
STANDARDS
Linux.
HISTORY
Linux 2.0.26/2.1.11.
SEE ALSO
capabilities(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
63 - Linux cli command sendmmsg
NAME π₯οΈ sendmmsg π₯οΈ
send multiple messages on a socket
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <sys/socket.h>
int sendmmsg(int sockfd, struct mmsghdr *msgvec",unsignedint vlen ,"
int flags);
DESCRIPTION
The sendmmsg() system call is an extension of sendmsg(2) that allows the caller to transmit multiple messages on a socket using a single system call. (This has performance benefits for some applications.)
The sockfd argument is the file descriptor of the socket on which data is to be transmitted.
The msgvec argument is a pointer to an array of mmsghdr structures. The size of this array is specified in vlen.
The mmsghdr structure is defined in <sys/socket.h> as:
struct mmsghdr {
struct msghdr msg_hdr; /* Message header */
unsigned int msg_len; /* Number of bytes transmitted */
};
The msg_hdr field is a msghdr structure, as described in sendmsg(2). The msg_len field is used to return the number of bytes sent from the message in msg_hdr (i.e., the same as the return value from a single sendmsg(2) call).
The flags argument contains flags ORed together. The flags are the same as for sendmsg(2).
A blocking sendmmsg() call blocks until vlen messages have been sent. A nonblocking call sends as many messages as possible (up to the limit specified by vlen) and returns immediately.
On return from sendmmsg(), the msg_len fields of successive elements of msgvec are updated to contain the number of bytes transmitted from the corresponding msg_hdr. The return value of the call indicates the number of elements of msgvec that have been updated.
RETURN VALUE
On success, sendmmsg() returns the number of messages sent from msgvec; if this is less than vlen, the caller can retry with a further sendmmsg() call to send the remaining messages.
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Errors are as for sendmsg(2). An error is returned only if no datagrams could be sent. See also BUGS.
STANDARDS
Linux.
HISTORY
Linux 3.0, glibc 2.14.
NOTES
The value specified in vlen is capped to UIO_MAXIOV (1024).
BUGS
If an error occurs after at least one message has been sent, the call succeeds, and returns the number of messages sent. The error code is lost. The caller can retry the transmission, starting at the first failed message, but there is no guarantee that, if an error is returned, it will be the same as the one that was lost on the previous call.
EXAMPLES
The example below uses sendmmsg() to send onetwo and three in two distinct UDP datagrams using one system call. The contents of the first datagram originates from a pair of buffers.
#define _GNU_SOURCE
#include <arpa/inet.h>
#include <netinet/in.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <sys/types.h>
int
main(void)
{
int retval;
int sockfd;
struct iovec msg1[2], msg2;
struct mmsghdr msg[2];
struct sockaddr_in addr;
sockfd = socket(AF_INET, SOCK_DGRAM, 0);
if (sockfd == -1) {
perror("socket()");
exit(EXIT_FAILURE);
}
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
addr.sin_port = htons(1234);
if (connect(sockfd, (struct sockaddr *) &addr, sizeof(addr)) == -1) {
perror("connect()");
exit(EXIT_FAILURE);
}
memset(msg1, 0, sizeof(msg1));
msg1[0].iov_base = "one";
msg1[0].iov_len = 3;
msg1[1].iov_base = "two";
msg1[1].iov_len = 3;
memset(&msg2, 0, sizeof(msg2));
msg2.iov_base = "three";
msg2.iov_len = 5;
memset(msg, 0, sizeof(msg));
msg[0].msg_hdr.msg_iov = msg1;
msg[0].msg_hdr.msg_iovlen = 2;
msg[1].msg_hdr.msg_iov = &msg2;
msg[1].msg_hdr.msg_iovlen = 1;
retval = sendmmsg(sockfd, msg, 2, 0);
if (retval == -1)
perror("sendmmsg()");
else
printf("%d messages sent
“, retval); exit(0); }
SEE ALSO
recvmmsg(2), sendmsg(2), socket(2), socket(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
64 - Linux cli command capget
NAME π₯οΈ capget π₯οΈ
set/get capabilities of thread(s)
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/capability.h> /* Definition of CAP_* and
_LINUX_CAPABILITY_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_capget, cap_user_header_t hdrp,
cap_user_data_t datap);
int syscall(SYS_capset, cap_user_header_t hdrp,
const cap_user_data_t datap);
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
These two system calls are the raw kernel interface for getting and setting thread capabilities. Not only are these system calls specific to Linux, but the kernel API is likely to change and use of these system calls (in particular the format of the cap_user_*_t types) is subject to extension with each kernel revision, but old programs will keep working.
The portable interfaces are cap_set_proc(3) and cap_get_proc(3); if possible, you should use those interfaces in applications; see NOTES.
Current details
Now that you have been warned, some current kernel details. The structures are defined as follows.
#define _LINUX_CAPABILITY_VERSION_1 0x19980330
#define _LINUX_CAPABILITY_U32S_1 1
/* V2 added in Linux 2.6.25; deprecated */
#define _LINUX_CAPABILITY_VERSION_2 0x20071026
#define _LINUX_CAPABILITY_U32S_2 2
/* V3 added in Linux 2.6.26 */
#define _LINUX_CAPABILITY_VERSION_3 0x20080522
#define _LINUX_CAPABILITY_U32S_3 2
typedef struct __user_cap_header_struct {
__u32 version;
int pid;
} *cap_user_header_t;
typedef struct __user_cap_data_struct {
__u32 effective;
__u32 permitted;
__u32 inheritable;
} *cap_user_data_t;
The effective, permitted, and inheritable fields are bit masks of the capabilities defined in capabilities(7). Note that the CAP_* values are bit indexes and need to be bit-shifted before ORing into the bit fields. To define the structures for passing to the system call, you have to use the struct __user_cap_header_struct and struct __user_cap_data_struct names because the typedefs are only pointers.
Kernels prior to Linux 2.6.25 prefer 32-bit capabilities with version _LINUX_CAPABILITY_VERSION_1. Linux 2.6.25 added 64-bit capability sets, with version _LINUX_CAPABILITY_VERSION_2. There was, however, an API glitch, and Linux 2.6.26 added _LINUX_CAPABILITY_VERSION_3 to fix the problem.
Note that 64-bit capabilities use datap[0] and datap[1], whereas 32-bit capabilities use only datap[0].
On kernels that support file capabilities (VFS capabilities support), these system calls behave slightly differently. This support was added as an option in Linux 2.6.24, and became fixed (nonoptional) in Linux 2.6.33.
For capget() calls, one can probe the capabilities of any process by specifying its process ID with the hdrp->pid field value.
For details on the data, see capabilities(7).
With VFS capabilities support
VFS capabilities employ a file extended attribute (see xattr(7)) to allow capabilities to be attached to executables. This privilege model obsoletes kernel support for one process asynchronously setting the capabilities of another. That is, on kernels that have VFS capabilities support, when calling capset(), the only permitted values for hdrp->pid are 0 or, equivalently, the value returned by gettid(2).
Without VFS capabilities support
On older kernels that do not provide VFS capabilities support capset() can, if the caller has the CAP_SETPCAP capability, be used to change not only the caller’s own capabilities, but also the capabilities of other threads. The call operates on the capabilities of the thread specified by the pid field of hdrp when that is nonzero, or on the capabilities of the calling thread if pid is 0. If pid refers to a single-threaded process, then pid can be specified as a traditional process ID; operating on a thread of a multithreaded process requires a thread ID of the type returned by gettid(2). For capset(), pid can also be: -1, meaning perform the change on all threads except the caller and init(1); or a value less than -1, in which case the change is applied to all members of the process group whose ID is -pid.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
The calls fail with the error EINVAL, and set the version field of hdrp to the kernel preferred value of _LINUX_CAPABILITY_VERSION_? when an unsupported version value is specified. In this way, one can probe what the current preferred capability revision is.
ERRORS
EFAULT
Bad memory address. hdrp must not be NULL. datap may be NULL only when the user is trying to determine the preferred capability version format supported by the kernel.
EINVAL
One of the arguments was invalid.
EPERM
An attempt was made to add a capability to the permitted set, or to set a capability in the effective set that is not in the permitted set.
EPERM
An attempt was made to add a capability to the inheritable set, and either:
that capability was not in the caller’s bounding set; or
the capability was not in the caller’s permitted set and the caller lacked the CAP_SETPCAP capability in its effective set.
EPERM
The caller attempted to use capset() to modify the capabilities of a thread other than itself, but lacked sufficient privilege. For kernels supporting VFS capabilities, this is never permitted. For kernels lacking VFS support, the CAP_SETPCAP capability is required. (A bug in kernels before Linux 2.6.11 meant that this error could also occur if a thread without this capability tried to change its own capabilities by specifying the pid field as a nonzero value (i.e., the value returned by getpid(2)) instead of 0.)
ESRCH
No such thread.
STANDARDS
Linux.
NOTES
The portable interface to the capability querying and setting functions is provided by the libcap library and is available here:
SEE ALSO
clone(2), gettid(2), capabilities(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
65 - Linux cli command getppid
NAME π₯οΈ getppid π₯οΈ
get process identification
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
pid_t getpid(void);
pid_t getppid(void);
DESCRIPTION
getpid() returns the process ID (PID) of the calling process. (This is often used by routines that generate unique temporary filenames.)
getppid() returns the process ID of the parent of the calling process. This will be either the ID of the process that created this process using fork(), or, if that process has already terminated, the ID of the process to which this process has been reparented (either init(1) or a “subreaper” process defined via the prctl(2) PR_SET_CHILD_SUBREAPER operation).
ERRORS
These functions are always successful.
VERSIONS
On Alpha, instead of a pair of getpid() and getppid() system calls, a single getxpid() system call is provided, which returns a pair of PID and parent PID. The glibc getpid() and getppid() wrapper functions transparently deal with this. See syscall(2) for details regarding register mapping.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.3BSD, SVr4.
C library/kernel differences
From glibc 2.3.4 up to and including glibc 2.24, the glibc wrapper function for getpid() cached PIDs, with the goal of avoiding additional system calls when a process calls getpid() repeatedly. Normally this caching was invisible, but its correct operation relied on support in the wrapper functions for fork(2), vfork(2), and clone(2): if an application bypassed the glibc wrappers for these system calls by using syscall(2), then a call to getpid() in the child would return the wrong value (to be precise: it would return the PID of the parent process). In addition, there were cases where getpid() could return the wrong value even when invoking clone(2) via the glibc wrapper function. (For a discussion of one such case, see BUGS in clone(2).) Furthermore, the complexity of the caching code had been the source of a few bugs within glibc over the years.
Because of the aforementioned problems, since glibc 2.25, the PID cache is removed: calls to getpid() always invoke the actual system call, rather than returning a cached value.
NOTES
If the caller’s parent is in a different PID namespace (see pid_namespaces(7)), getppid() returns 0.
From a kernel perspective, the PID (which is shared by all of the threads in a multithreaded process) is sometimes also known as the thread group ID (TGID). This contrasts with the kernel thread ID (TID), which is unique for each thread. For further details, see gettid(2) and the discussion of the CLONE_THREAD flag in clone(2).
SEE ALSO
clone(2), fork(2), gettid(2), kill(2), exec(3), mkstemp(3), tempnam(3), tmpfile(3), tmpnam(3), credentials(7), pid_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
66 - Linux cli command setpgid
NAME π₯οΈ setpgid π₯οΈ
set/get process group
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int setpgid(pid_t pid, pid_t pgid);
pid_t getpgid(pid_t pid);
pid_t getpgrp(void); /* POSIX.1 version */
[[deprecated]] pid_t getpgrp(pid_t pid); /* BSD version */
int setpgrp(void); /* System V version */
[[deprecated]] int setpgrp(pid_t pid, pid_t pgid); /* BSD version */
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
getpgid():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
setpgrp() (POSIX.1):
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.19: */ _DEFAULT_SOURCE
|| /* glibc <= 2.19: */ _SVID_SOURCE
setpgrp() (BSD), getpgrp() (BSD):
[These are available only before glibc 2.19]
_BSD_SOURCE &&
! (_POSIX_SOURCE || _POSIX_C_SOURCE || _XOPEN_SOURCE
|| _GNU_SOURCE || _SVID_SOURCE)
DESCRIPTION
All of these interfaces are available on Linux, and are used for getting and setting the process group ID (PGID) of a process. The preferred, POSIX.1-specified ways of doing this are: getpgrp(void), for retrieving the calling process’s PGID; and setpgid(), for setting a process’s PGID.
setpgid() sets the PGID of the process specified by pid to pgid. If pid is zero, then the process ID of the calling process is used. If pgid is zero, then the PGID of the process specified by pid is made the same as its process ID. If setpgid() is used to move a process from one process group to another (as is done by some shells when creating pipelines), both process groups must be part of the same session (see setsid(2) and credentials(7)). In this case, the pgid specifies an existing process group to be joined and the session ID of that group must match the session ID of the joining process.
The POSIX.1 version of getpgrp(), which takes no arguments, returns the PGID of the calling process.
getpgid() returns the PGID of the process specified by pid. If pid is zero, the process ID of the calling process is used. (Retrieving the PGID of a process other than the caller is rarely necessary, and the POSIX.1 getpgrp() is preferred for that task.)
The System V-style setpgrp(), which takes no arguments, is equivalent to setpgid(0, 0).
The BSD-specific setpgrp() call, which takes arguments pid and pgid, is a wrapper function that calls
setpgid(pid, pgid)
Since glibc 2.19, the BSD-specific setpgrp() function is no longer exposed by <unistd.h>; calls should be replaced with the setpgid() call shown above.
The BSD-specific getpgrp() call, which takes a single pid argument, is a wrapper function that calls
getpgid(pid)
Since glibc 2.19, the BSD-specific getpgrp() function is no longer exposed by <unistd.h>; calls should be replaced with calls to the POSIX.1 getpgrp() which takes no arguments (if the intent is to obtain the caller’s PGID), or with the getpgid() call shown above.
RETURN VALUE
On success, setpgid() and setpgrp() return zero. On error, -1 is returned, and errno is set to indicate the error.
The POSIX.1 getpgrp() always returns the PGID of the caller.
getpgid(), and the BSD-specific getpgrp() return a process group on success. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
An attempt was made to change the process group ID of one of the children of the calling process and the child had already performed an execve(2) (setpgid(), setpgrp()).
EINVAL
pgid is less than 0 (setpgid(), setpgrp()).
EPERM
An attempt was made to move a process into a process group in a different session, or to change the process group ID of one of the children of the calling process and the child was in a different session, or to change the process group ID of a session leader (setpgid(), setpgrp()).
EPERM
The target process group does not exist. (setpgid(), setpgrp()).
ESRCH
For getpgid(): pid does not match any process. For setpgid(): pid is not the calling process and not a child of the calling process.
STANDARDS
getpgid()
setpgid()
getpgrp() (no args)
setpgrp() (no args)
POSIX.1-2008 (but see HISTORY).
setpgrp() (2 args)
getpgrp() (1 arg)
None.
HISTORY
getpgid()
setpgid()
getpgrp() (no args)
POSIX.1-2001.
setpgrp() (no args)
POSIX.1-2001. POSIX.1-2008 marks it as obsolete.
setpgrp() (2 args)
getpgrp() (1 arg)
4.2BSD.
NOTES
A child created via fork(2) inherits its parent’s process group ID. The PGID is preserved across an execve(2).
Each process group is a member of a session and each process is a member of the session of which its process group is a member. (See credentials(7).)
A session can have a controlling terminal. At any time, one (and only one) of the process groups in the session can be the foreground process group for the terminal; the remaining process groups are in the background. If a signal is generated from the terminal (e.g., typing the interrupt key to generate SIGINT), that signal is sent to the foreground process group. (See termios(3) for a description of the characters that generate signals.) Only the foreground process group may read(2) from the terminal; if a background process group tries to read(2) from the terminal, then the group is sent a SIGTTIN signal, which suspends it. The tcgetpgrp(3) and tcsetpgrp(3) functions are used to get/set the foreground process group of the controlling terminal.
The setpgid() and getpgrp() calls are used by programs such as bash(1) to create process groups in order to implement shell job control.
If the termination of a process causes a process group to become orphaned, and if any member of the newly orphaned process group is stopped, then a SIGHUP signal followed by a SIGCONT signal will be sent to each process in the newly orphaned process group. An orphaned process group is one in which the parent of every member of process group is either itself also a member of the process group or is a member of a process group in a different session (see also credentials(7)).
SEE ALSO
getuid(2), setsid(2), tcgetpgrp(3), tcsetpgrp(3), termios(3), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
67 - Linux cli command munmap
NAME π₯οΈ munmap π₯οΈ
map or unmap files or devices into memory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/mman.h>
void *mmap(void addr[.length], size_t length",int prot ,int"flags,
int fd, off_t offset);
int munmap(void addr[.length], size_t length);
See NOTES for information on feature test macro requirements.
DESCRIPTION
mmap() creates a new mapping in the virtual address space of the calling process. The starting address for the new mapping is specified in addr. The length argument specifies the length of the mapping (which must be greater than 0).
If addr is NULL, then the kernel chooses the (page-aligned) address at which to create the mapping; this is the most portable method of creating a new mapping. If addr is not NULL, then the kernel takes it as a hint about where to place the mapping; on Linux, the kernel will pick a nearby page boundary (but always above or equal to the value specified by /proc/sys/vm/mmap_min_addr) and attempt to create the mapping there. If another mapping already exists there, the kernel picks a new address that may or may not depend on the hint. The address of the new mapping is returned as the result of the call.
The contents of a file mapping (as opposed to an anonymous mapping; see MAP_ANONYMOUS below), are initialized using length bytes starting at offset offset in the file (or other object) referred to by the file descriptor fd. offset must be a multiple of the page size as returned by sysconf(_SC_PAGE_SIZE).
After the mmap() call has returned, the file descriptor, fd, can be closed immediately without invalidating the mapping.
The prot argument describes the desired memory protection of the mapping (and must not conflict with the open mode of the file). It is either PROT_NONE or the bitwise OR of one or more of the following flags:
PROT_EXEC
Pages may be executed.
PROT_READ
Pages may be read.
PROT_WRITE
Pages may be written.
PROT_NONE
Pages may not be accessed.
The flags argument
The flags argument determines whether updates to the mapping are visible to other processes mapping the same region, and whether updates are carried through to the underlying file. This behavior is determined by including exactly one of the following values in flags:
MAP_SHARED
Share this mapping. Updates to the mapping are visible to other processes mapping the same region, and (in the case of file-backed mappings) are carried through to the underlying file. (To precisely control when updates are carried through to the underlying file requires the use of msync(2).)
MAP_SHARED_VALIDATE (since Linux 4.15)
This flag provides the same behavior as MAP_SHARED except that MAP_SHARED mappings ignore unknown flags in flags. By contrast, when creating a mapping using MAP_SHARED_VALIDATE, the kernel verifies all passed flags are known and fails the mapping with the error EOPNOTSUPP for unknown flags. This mapping type is also required to be able to use some mapping flags (e.g., MAP_SYNC).
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
Both MAP_SHARED and MAP_PRIVATE are described in POSIX.1-2001 and POSIX.1-2008. MAP_SHARED_VALIDATE is a Linux extension.
In addition, zero or more of the following values can be ORed in flags:
MAP_32BIT (since Linux 2.4.20, 2.6)
Put the mapping into the first 2 Gigabytes of the process address space. This flag is supported only on x86-64, for 64-bit programs. It was added to allow thread stacks to be allocated somewhere in the first 2 GB of memory, so as to improve context-switch performance on some early 64-bit processors. Modern x86-64 processors no longer have this performance problem, so use of this flag is not required on those systems. The MAP_32BIT flag is ignored when MAP_FIXED is set.
MAP_ANON
Synonym for MAP_ANONYMOUS; provided for compatibility with other implementations.
MAP_ANONYMOUS
The mapping is not backed by any file; its contents are initialized to zero. The fd argument is ignored; however, some implementations require fd to be -1 if MAP_ANONYMOUS (or MAP_ANON) is specified, and portable applications should ensure this. The offset argument should be zero. Support for MAP_ANONYMOUS in conjunction with MAP_SHARED was added in Linux 2.4.
MAP_DENYWRITE
This flag is ignored. (Long agoβLinux 2.0 and earlierβit signaled that attempts to write to the underlying file should fail with ETXTBSY. But this was a source of denial-of-service attacks.)
MAP_EXECUTABLE
This flag is ignored.
MAP_FILE
Compatibility flag. Ignored.
MAP_FIXED
Don’t interpret addr as a hint: place the mapping at exactly that address. addr must be suitably aligned: for most architectures a multiple of the page size is sufficient; however, some architectures may impose additional restrictions. If the memory region specified by addr and length overlaps pages of any existing mapping(s), then the overlapped part of the existing mapping(s) will be discarded. If the specified address cannot be used, mmap() will fail.
Software that aspires to be portable should use the MAP_FIXED flag with care, keeping in mind that the exact layout of a process’s memory mappings is allowed to change significantly between Linux versions, C library versions, and operating system releases. Carefully read the discussion of this flag in NOTES!
MAP_FIXED_NOREPLACE (since Linux 4.17)
This flag provides behavior that is similar to MAP_FIXED with respect to the addr enforcement, but differs in that MAP_FIXED_NOREPLACE never clobbers a preexisting mapped range. If the requested range would collide with an existing mapping, then this call fails with the error EEXIST. This flag can therefore be used as a way to atomically (with respect to other threads) attempt to map an address range: one thread will succeed; all others will report failure.
Note that older kernels which do not recognize the MAP_FIXED_NOREPLACE flag will typically (upon detecting a collision with a preexisting mapping) fall back to a βnon-MAP_FIXEDβ type of behavior: they will return an address that is different from the requested address. Therefore, backward-compatible software should check the returned address against the requested address.
MAP_GROWSDOWN
This flag is used for stacks. It indicates to the kernel virtual memory system that the mapping should extend downward in memory. The return address is one page lower than the memory area that is actually created in the process’s virtual address space. Touching an address in the “guard” page below the mapping will cause the mapping to grow by a page. This growth can be repeated until the mapping grows to within a page of the high end of the next lower mapping, at which point touching the “guard” page will result in a SIGSEGV signal.
MAP_HUGETLB (since Linux 2.6.32)
Allocate the mapping using “huge” pages. See the Linux kernel source file Documentation/admin-guide/mm/hugetlbpage.rst for further information, as well as NOTES, below.
MAP_HUGE_2MB
MAP_HUGE_1GB (since Linux 3.8)
Used in conjunction with MAP_HUGETLB to select alternative hugetlb page sizes (respectively, 2 MB and 1 GB) on systems that support multiple hugetlb page sizes.
More generally, the desired huge page size can be configured by encoding the base-2 logarithm of the desired page size in the six bits at the offset MAP_HUGE_SHIFT. (A value of zero in this bit field provides the default huge page size; the default huge page size can be discovered via the Hugepagesize field exposed by /proc/meminfo.) Thus, the above two constants are defined as:
#define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT)
#define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT)
The range of huge page sizes that are supported by the system can be discovered by listing the subdirectories in /sys/kernel/mm/hugepages.
MAP_LOCKED (since Linux 2.5.37)
Mark the mapped region to be locked in the same way as mlock(2). This implementation will try to populate (prefault) the whole range but the mmap() call doesn’t fail with ENOMEM if this fails. Therefore major faults might happen later on. So the semantic is not as strong as mlock(2). One should use mmap() plus mlock(2) when major faults are not acceptable after the initialization of the mapping. The MAP_LOCKED flag is ignored in older kernels.
MAP_NONBLOCK (since Linux 2.5.46)
This flag is meaningful only in conjunction with MAP_POPULATE. Don’t perform read-ahead: create page tables entries only for pages that are already present in RAM. Since Linux 2.6.23, this flag causes MAP_POPULATE to do nothing. One day, the combination of MAP_POPULATE and MAP_NONBLOCK may be reimplemented.
MAP_NORESERVE
Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available. See also the discussion of the file /proc/sys/vm/overcommit_memory in proc(5). Before Linux 2.6, this flag had effect only for private writable mappings.
MAP_POPULATE (since Linux 2.5.46)
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. This will help to reduce blocking on page faults later. The mmap() call doesn’t fail if the mapping cannot be populated (for example, due to limitations on the number of mapped huge pages when using MAP_HUGETLB). Support for MAP_POPULATE in conjunction with private mappings was added in Linux 2.6.23.
MAP_STACK (since Linux 2.6.27)
Allocate the mapping at an address suitable for a process or thread stack.
This flag is currently a no-op on Linux. However, by employing this flag, applications can ensure that they transparently obtain support if the flag is implemented in the future. Thus, it is used in the glibc threading implementation to allow for the fact that some architectures may (later) require special treatment for stack allocations. A further reason to employ this flag is portability: MAP_STACK exists (and has an effect) on some other systems (e.g., some of the BSDs).
MAP_SYNC (since Linux 4.15)
This flag is available only with the MAP_SHARED_VALIDATE mapping type; mappings of type MAP_SHARED will silently ignore this flag. This flag is supported only for files supporting DAX (direct mapping of persistent memory). For other files, creating a mapping with this flag results in an EOPNOTSUPP error.
Shared file mappings with this flag provide the guarantee that while some memory is mapped writable in the address space of the process, it will be visible in the same file at the same offset even after the system crashes or is rebooted. In conjunction with the use of appropriate CPU instructions, this provides users of such mappings with a more efficient way of making data modifications persistent.
MAP_UNINITIALIZED (since Linux 2.6.33)
Don’t clear anonymous pages. This flag is intended to improve performance on embedded devices. This flag is honored only if the kernel was configured with the CONFIG_MMAP_ALLOW_UNINITIALIZED option. Because of the security implications, that option is normally enabled only on embedded devices (i.e., devices where one has complete control of the contents of user memory).
Of the above flags, only MAP_FIXED is specified in POSIX.1-2001 and POSIX.1-2008. However, most systems also support MAP_ANONYMOUS (or its synonym MAP_ANON).
munmap()
The munmap() system call deletes the mappings for the specified address range, and causes further references to addresses within the range to generate invalid memory references. The region is also automatically unmapped when the process is terminated. On the other hand, closing the file descriptor does not unmap the region.
The address addr must be a multiple of the page size (but length need not be). All pages containing a part of the indicated range are unmapped, and subsequent references to these pages will generate SIGSEGV. It is not an error if the indicated range does not contain any mapped pages.
RETURN VALUE
On success, mmap() returns a pointer to the mapped area. On error, the value MAP_FAILED (that is, (void *) -1) is returned, and errno is set to indicate the error.
On success, munmap() returns 0. On failure, it returns -1, and errno is set to indicate the error (probably to EINVAL).
ERRORS
EACCES
A file descriptor refers to a non-regular file. Or a file mapping was requested, but fd is not open for reading. Or MAP_SHARED was requested and PROT_WRITE is set, but fd is not open in read/write (O_RDWR) mode. Or PROT_WRITE is set, but the file is append-only.
EAGAIN
The file has been locked, or too much memory has been locked (see setrlimit(2)).
EBADF
fd is not a valid file descriptor (and MAP_ANONYMOUS was not set).
EEXIST
MAP_FIXED_NOREPLACE was specified in flags, and the range covered by addr and length clashes with an existing mapping.
EINVAL
We don’t like addr, length, or offset (e.g., they are too large, or not aligned on a page boundary).
EINVAL
(since Linux 2.6.12) length was 0.
EINVAL
flags contained none of MAP_PRIVATE, MAP_SHARED, or MAP_SHARED_VALIDATE.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENODEV
The underlying filesystem of the specified file does not support memory mapping.
ENOMEM
No memory is available.
ENOMEM
The process’s maximum number of mappings would have been exceeded. This error can also occur for munmap(), when unmapping a region in the middle of an existing mapping, since this results in two smaller mappings on either side of the region being unmapped.
ENOMEM
(since Linux 4.7) The process’s RLIMIT_DATA limit, described in getrlimit(2), would have been exceeded.
ENOMEM
We don’t like addr, because it exceeds the virtual address space of the CPU.
EOVERFLOW
On 32-bit architecture together with the large file extension (i.e., using 64-bit off_t): the number of pages used for length plus number of pages used for offset would overflow unsigned long (32 bits).
EPERM
The prot argument asks for PROT_EXEC but the mapped area belongs to a file on a filesystem that was mounted no-exec.
EPERM
The operation was prevented by a file seal; see fcntl(2).
EPERM
The MAP_HUGETLB flag was specified, but the caller was not privileged (did not have the CAP_IPC_LOCK capability) and is not a member of the sysctl_hugetlb_shm_group group; see the description of /proc/sys/vm/sysctl_hugetlb_shm_group in proc_sys(5).
ETXTBSY
MAP_DENYWRITE was set but the object specified by fd is open for writing.
Use of a mapped region can result in these signals:
SIGSEGV
Attempted write into a region mapped as read-only.
SIGBUS
Attempted access to a page of the buffer that lies beyond the end of the mapped file. For an explanation of the treatment of the bytes in the page that corresponds to the end of a mapped file that is not a multiple of the page size, see NOTES.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
mmap(), munmap() | Thread safety | MT-Safe |
VERSIONS
On some hardware architectures (e.g., i386), PROT_WRITE implies PROT_READ. It is architecture dependent whether PROT_READ implies PROT_EXEC or not. Portable programs should always set PROT_EXEC if they intend to execute code in the new mapping.
The portable way to create a mapping is to specify addr as 0 (NULL), and omit MAP_FIXED from flags. In this case, the system chooses the address for the mapping; the address is chosen so as not to conflict with any existing mapping, and will not be 0. If the MAP_FIXED flag is specified, and addr is 0 (NULL), then the mapped address will be 0 (NULL).
Certain flags constants are defined only if suitable feature test macros are defined (possibly by default): _DEFAULT_SOURCE with glibc 2.19 or later; or _BSD_SOURCE or _SVID_SOURCE in glibc 2.19 and earlier. (Employing _GNU_SOURCE also suffices, and requiring that macro specifically would have been more logical, since these flags are all Linux-specific.) The relevant flags are: MAP_32BIT, MAP_ANONYMOUS (and the synonym MAP_ANON), MAP_DENYWRITE, MAP_EXECUTABLE, MAP_FILE, MAP_GROWSDOWN, MAP_HUGETLB, MAP_LOCKED, MAP_NONBLOCK, MAP_NORESERVE, MAP_POPULATE, and MAP_STACK.
C library/kernel differences
This page describes the interface provided by the glibc mmap() wrapper function. Originally, this function invoked a system call of the same name. Since Linux 2.4, that system call has been superseded by mmap2(2), and nowadays the glibc mmap() wrapper function invokes mmap2(2) with a suitably adjusted value for offset.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.4BSD.
On POSIX systems on which mmap(), msync(2), and munmap() are available, _POSIX_MAPPED_FILES is defined in <unistd.h> to a value greater than 0. (See also sysconf(3).)
NOTES
Memory mapped by mmap() is preserved across fork(2), with the same attributes.
A file is mapped in multiples of the page size. For a file that is not a multiple of the page size, the remaining bytes in the partial page at the end of the mapping are zeroed when mapped, and modifications to that region are not written out to the file. The effect of changing the size of the underlying file of a mapping on the pages that correspond to added or removed regions of the file is unspecified.
An application can determine which pages of a mapping are currently resident in the buffer/page cache using mincore(2).
Using MAP_FIXED safely
The only safe use for MAP_FIXED is where the address range specified by addr and length was previously reserved using another mapping; otherwise, the use of MAP_FIXED is hazardous because it forcibly removes preexisting mappings, making it easy for a multithreaded process to corrupt its own address space.
For example, suppose that thread A looks through /proc/pid/maps in order to locate an unused address range that it can map using MAP_FIXED, while thread B simultaneously acquires part or all of that same address range. When thread A subsequently employs mmap(MAP_FIXED), it will effectively clobber the mapping that thread B created. In this scenario, thread B need not create a mapping directly; simply making a library call that, internally, uses dlopen(3) to load some other shared library, will suffice. The dlopen(3) call will map the library into the process’s address space. Furthermore, almost any library call may be implemented in a way that adds memory mappings to the address space, either with this technique, or by simply allocating memory. Examples include brk(2), malloc(3), pthread_create(3), and the PAM libraries .
Since Linux 4.17, a multithreaded program can use the MAP_FIXED_NOREPLACE flag to avoid the hazard described above when attempting to create a mapping at a fixed address that has not been reserved by a preexisting mapping.
Timestamps changes for file-backed mappings
For file-backed mappings, the st_atime field for the mapped file may be updated at any time between the mmap() and the corresponding unmapping; the first reference to a mapped page will update the field if it has not been already.
The st_ctime and st_mtime field for a file mapped with PROT_WRITE and MAP_SHARED will be updated after a write to the mapped region, and before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if one occurs.
Huge page (Huge TLB) mappings
For mappings that employ huge pages, the requirements for the arguments of mmap() and munmap() differ somewhat from the requirements for mappings that use the native system page size.
For mmap(), offset must be a multiple of the underlying huge page size. The system automatically aligns length to be a multiple of the underlying huge page size.
For munmap(), addr, and length must both be a multiple of the underlying huge page size.
BUGS
On Linux, there are no guarantees like those suggested above under MAP_NORESERVE. By default, any process can be killed at any moment when the system runs out of memory.
Before Linux 2.6.7, the MAP_POPULATE flag has effect only if prot is specified as PROT_NONE.
SUSv3 specifies that mmap() should fail if length is 0. However, before Linux 2.6.12, mmap() succeeded in this case: no mapping was created and the call returned addr. Since Linux 2.6.12, mmap() fails with the error EINVAL for this case.
POSIX specifies that the system shall always zero fill any partial page at the end of the object and that system will never write any modification of the object beyond its end. On Linux, when you write data to such partial page after the end of the object, the data stays in the page cache even after the file is closed and unmapped and even though the data is never written to the file itself, subsequent mappings may see the modified content. In some cases, this could be fixed by calling msync(2) before the unmap takes place; however, this doesn’t work on tmpfs(5) (for example, when using the POSIX shared memory interface documented in shm_overview(7)).
EXAMPLES
The following program prints part of the file specified in its first command-line argument to standard output. The range of bytes to be printed is specified via offset and length values in the second and third command-line arguments. The program creates a memory mapping of the required pages of the file and then uses write(2) to output the desired bytes.
Program source
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
int
main(int argc, char *argv[])
{
int fd;
char *addr;
off_t offset, pa_offset;
size_t length;
ssize_t s;
struct stat sb;
if (argc < 3 || argc > 4) {
fprintf(stderr, "%s file offset [length]
“, argv[0]); exit(EXIT_FAILURE); } fd = open(argv[1], O_RDONLY); if (fd == -1) handle_error(“open”); if (fstat(fd, &sb) == -1) /* To obtain file size / handle_error(“fstat”); offset = atoi(argv[2]); pa_offset = offset & ~(sysconf(_SC_PAGE_SIZE) - 1); / offset for mmap() must be page aligned / if (offset >= sb.st_size) { fprintf(stderr, “offset is past end of file “); exit(EXIT_FAILURE); } if (argc == 4) { length = atoi(argv[3]); if (offset + length > sb.st_size) length = sb.st_size - offset; / Can’t display bytes past end of file / } else { / No length arg ==> display to end of file */ length = sb.st_size - offset; } addr = mmap(NULL, length + offset - pa_offset, PROT_READ, MAP_PRIVATE, fd, pa_offset); if (addr == MAP_FAILED) handle_error(“mmap”); s = write(STDOUT_FILENO, addr + offset - pa_offset, length); if (s != length) { if (s == -1) handle_error(“write”); fprintf(stderr, “partial write”); exit(EXIT_FAILURE); } munmap(addr, length + offset - pa_offset); close(fd); exit(EXIT_SUCCESS); }
SEE ALSO
ftruncate(2), getpagesize(2), memfd_create(2), mincore(2), mlock(2), mmap2(2), mprotect(2), mremap(2), msync(2), remap_file_pages(2), setrlimit(2), shmat(2), userfaultfd(2), shm_open(3), shm_overview(7)
The descriptions of the following files in proc(5): /proc/pid/maps, /proc/pid/map_files, and /proc/pid/smaps.
B.O. Gallmeister, POSIX.4, O’Reilly, pp. 128β129 and 389β391.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
68 - Linux cli command fdetach
NAME π₯οΈ fdetach π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
69 - Linux cli command listen
NAME π₯οΈ listen π₯οΈ
listen for connections on a socket
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
int listen(int sockfd, int backlog);
DESCRIPTION
listen() marks the socket referred to by sockfd as a passive socket, that is, as a socket that will be used to accept incoming connection requests using accept(2).
The sockfd argument is a file descriptor that refers to a socket of type SOCK_STREAM or SOCK_SEQPACKET.
The backlog argument defines the maximum length to which the queue of pending connections for sockfd may grow. If a connection request arrives when the queue is full, the client may receive an error with an indication of ECONNREFUSED or, if the underlying protocol supports retransmission, the request may be ignored so that a later reattempt at connection succeeds.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EADDRINUSE
Another socket is already listening on the same port.
EADDRINUSE
(Internet domain sockets) The socket referred to by sockfd had not previously been bound to an address and, upon attempting to bind it to an ephemeral port, it was determined that all port numbers in the ephemeral port range are currently in use. See the discussion of /proc/sys/net/ipv4/ip_local_port_range in ip(7).
EBADF
The argument sockfd is not a valid file descriptor.
ENOTSOCK
The file descriptor sockfd does not refer to a socket.
EOPNOTSUPP
The socket is not of a type that supports the listen() operation.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.4BSD (first appeared in 4.2BSD).
NOTES
To accept connections, the following steps are performed:
A socket is created with socket(2).
The socket is bound to a local address using bind(2), so that other sockets may be connect(2)ed to it.
A willingness to accept incoming connections and a queue limit for incoming connections are specified with listen().
Connections are accepted with accept(2).
The behavior of the backlog argument on TCP sockets changed with Linux 2.2. Now it specifies the queue length for completely established sockets waiting to be accepted, instead of the number of incomplete connection requests. The maximum length of the queue for incomplete sockets can be set using /proc/sys/net/ipv4/tcp_max_syn_backlog. When syncookies are enabled there is no logical maximum length and this setting is ignored. See tcp(7) for more information.
If the backlog argument is greater than the value in /proc/sys/net/core/somaxconn, then it is silently capped to that value. Since Linux 5.4, the default in this file is 4096; in earlier kernels, the default value is 128. Before Linux 2.4.25, this limit was a hard coded value, SOMAXCONN, with the value 128.
EXAMPLES
See bind(2).
SEE ALSO
accept(2), bind(2), connect(2), socket(2), socket(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
70 - Linux cli command lsetxattr
NAME π₯οΈ lsetxattr π₯οΈ
set an extended attribute value
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/xattr.h>
int setxattr(const char *path, const char *name,
const void value[.size], size_t size, int flags);
int lsetxattr(const char *path, const char *name,
const void value[.size], size_t size, int flags);
int fsetxattr(int fd, const char *name,
const void value[.size], size_t size, int flags);
DESCRIPTION
Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7).
setxattr() sets the value of the extended attribute identified by name and associated with the given path in the filesystem. The size argument specifies the size (in bytes) of value; a zero-length value is permitted.
lsetxattr() is identical to setxattr(), except in the case of a symbolic link, where the extended attribute is set on the link itself, not the file that it refers to.
fsetxattr() is identical to setxattr(), only the extended attribute is set on the open file referred to by fd (as returned by open(2)) in place of path.
An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. The value of an extended attribute is a chunk of arbitrary textual or binary data of specified length.
By default (i.e., flags is zero), the extended attribute will be created if it does not exist, or the value will be replaced if the attribute already exists. To modify these semantics, one of the following values can be specified in flags:
XATTR_CREATE
Perform a pure create, which fails if the named attribute exists already.
XATTR_REPLACE
Perform a pure replace operation, which fails if the named attribute does not already exist.
RETURN VALUE
On success, zero is returned. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
EDQUOT
Disk quota limits meant that there is insufficient space remaining to store the extended attribute.
EEXIST
XATTR_CREATE was specified, and the attribute exists already.
ENODATA
XATTR_REPLACE was specified, and the attribute does not exist.
ENOSPC
There is insufficient space remaining to store the extended attribute.
ENOTSUP
The namespace prefix of name is not valid.
ENOTSUP
Extended attributes are not supported by the filesystem, or are disabled,
EPERM
The file is marked immutable or append-only. (See ioctl_iflags(2).)
In addition, the errors documented in stat(2) can also occur.
ERANGE
The size of name or value exceeds a filesystem-specific limit.
STANDARDS
Linux.
HISTORY
Linux 2.4, glibc 2.3.
SEE ALSO
getfattr(1), setfattr(1), getxattr(2), listxattr(2), open(2), removexattr(2), stat(2), symlink(7), xattr(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
71 - Linux cli command msgctl
NAME π₯οΈ msgctl π₯οΈ
System V message control operations
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/msg.h>
int msgctl(int msqid, int op, struct msqid_ds *buf);
DESCRIPTION
msgctl() performs the control operation specified by op on the System V message queue with identifier msqid.
The msqid_ds data structure is defined in <sys/msg.h> as follows:
struct msqid_ds {
struct ipc_perm msg_perm; /* Ownership and permissions */
time_t msg_stime; /* Time of last msgsnd(2) */
time_t msg_rtime; /* Time of last msgrcv(2) */
time_t msg_ctime; /* Time of creation or last
modification by msgctl() */
unsigned long msg_cbytes; /* # of bytes in queue */
msgqnum_t msg_qnum; /* # number of messages in queue */
msglen_t msg_qbytes; /* Maximum # of bytes in queue */
pid_t msg_lspid; /* PID of last msgsnd(2) */
pid_t msg_lrpid; /* PID of last msgrcv(2) */
};
The fields of the msqid_ds structure are as follows:
msg_perm
This is an ipc_perm structure (see below) that specifies the access permissions on the message queue.
msg_stime
Time of the last msgsnd(2) system call.
msg_rtime
Time of the last msgrcv(2) system call.
msg_ctime
Time of creation of queue or time of last msgctl() IPC_SET operation.
msg_cbytes
Number of bytes in all messages currently on the message queue. This is a nonstandard Linux extension that is not specified in POSIX.
msg_qnum
Number of messages currently on the message queue.
msg_qbytes
Maximum number of bytes of message text allowed on the message queue.
msg_lspid
ID of the process that performed the last msgsnd(2) system call.
msg_lrpid
ID of the process that performed the last msgrcv(2) system call.
The ipc_perm structure is defined as follows (the highlighted fields are settable using IPC_SET):
struct ipc_perm {
key_t __key; /* Key supplied to msgget(2) */
uid_t uid; /* Effective UID of owner */
gid_t gid; /* Effective GID of owner */
uid_t cuid; /* Effective UID of creator */
gid_t cgid; /* Effective GID of creator */
unsigned short mode; /* Permissions */
unsigned short __seq; /* Sequence number */
};
The least significant 9 bits of the mode field of the ipc_perm structure define the access permissions for the message queue. The permission bits are as follows:
0400 | Read by user |
0200 | Write by user |
0040 | Read by group |
0020 | Write by group |
0004 | Read by others |
0002 | Write by others |
Bits 0100, 0010, and 0001 (the execute bits) are unused by the system.
Valid values for op are:
IPC_STAT
Copy information from the kernel data structure associated with msqid into the msqid_ds structure pointed to by buf. The caller must have read permission on the message queue.
IPC_SET
Write the values of some members of the msqid_ds structure pointed to by buf to the kernel data structure associated with this message queue, updating also its msg_ctime member.
The following members of the structure are updated: msg_qbytes, msg_perm.uid, msg_perm.gid, and (the least significant 9 bits of) msg_perm.mode.
The effective UID of the calling process must match the owner (msg_perm.uid) or creator (msg_perm.cuid) of the message queue, or the caller must be privileged. Appropriate privilege (Linux: the CAP_SYS_RESOURCE capability) is required to raise the msg_qbytes value beyond the system parameter MSGMNB.
IPC_RMID
Immediately remove the message queue, awakening all waiting reader and writer processes (with an error return and errno set to EIDRM). The calling process must have appropriate privileges or its effective user ID must be either that of the creator or owner of the message queue. The third argument to msgctl() is ignored in this case.
IPC_INFO (Linux-specific)
Return information about system-wide message queue limits and parameters in the structure pointed to by buf. This structure is of type msginfo (thus, a cast is required), defined in <sys/msg.h> if the _GNU_SOURCE feature test macro is defined:
struct msginfo {
int msgpool; /* Size in kibibytes of buffer pool
used to hold message data;
unused within kernel */
int msgmap; /* Maximum number of entries in message
map; unused within kernel */
int msgmax; /* Maximum number of bytes that can be
written in a single message */
int msgmnb; /* Maximum number of bytes that can be
written to queue; used to initialize
msg_qbytes during queue creation
(msgget(2)) */
int msgmni; /* Maximum number of message queues */
int msgssz; /* Message segment size;
unused within kernel */
int msgtql; /* Maximum number of messages on all queues
in system; unused within kernel */
unsigned short msgseg;
/* Maximum number of segments;
unused within kernel */
};
The msgmni, msgmax, and msgmnb settings can be changed via /proc files of the same name; see proc(5) for details.
MSG_INFO (Linux-specific)
Return a msginfo structure containing the same information as for IPC_INFO, except that the following fields are returned with information about system resources consumed by message queues: the msgpool field returns the number of message queues that currently exist on the system; the msgmap field returns the total number of messages in all queues on the system; and the msgtql field returns the total number of bytes in all messages in all queues on the system.
MSG_STAT (Linux-specific)
Return a msqid_ds structure as for IPC_STAT. However, the msqid argument is not a queue identifier, but instead an index into the kernel’s internal array that maintains information about all message queues on the system.
MSG_STAT_ANY (Linux-specific, since Linux 4.17)
Return a msqid_ds structure as for MSG_STAT. However, msg_perm.mode is not checked for read access for msqid meaning that any user can employ this operation (just as any user may read /proc/sysvipc/msg to obtain the same information).
RETURN VALUE
On success, IPC_STAT, IPC_SET, and IPC_RMID return 0. A successful IPC_INFO or MSG_INFO operation returns the index of the highest used entry in the kernel’s internal array recording information about all message queues. (This information can be used with repeated MSG_STAT or MSG_STAT_ANY operations to obtain information about all queues on the system.) A successful MSG_STAT or MSG_STAT_ANY operation returns the identifier of the queue whose index was given in msqid.
On failure, -1 is returned and errno is set to indicate the error.
ERRORS
EACCES
The argument op is equal to IPC_STAT or MSG_STAT, but the calling process does not have read permission on the message queue msqid, and does not have the CAP_IPC_OWNER capability in the user namespace that governs its IPC namespace.
EFAULT
The argument op has the value IPC_SET or IPC_STAT, but the address pointed to by buf isn’t accessible.
EIDRM
The message queue was removed.
EINVAL
Invalid value for op or msqid. Or: for a MSG_STAT operation, the index value specified in msqid referred to an array slot that is currently unused.
EPERM
The argument op has the value IPC_SET or IPC_RMID, but the effective user ID of the calling process is not the creator (as found in msg_perm.cuid) or the owner (as found in msg_perm.uid) of the message queue, and the caller is not privileged (Linux: does not have the CAP_SYS_ADMIN capability).
EPERM
An attempt (IPC_SET) was made to increase msg_qbytes beyond the system parameter MSGMNB, but the caller is not privileged (Linux: does not have the CAP_SYS_RESOURCE capability).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4.
Various fields in the struct msqid_ds were typed as short under Linux 2.2 and have become long under Linux 2.4. To take advantage of this, a recompilation under glibc-2.1.91 or later should suffice. (The kernel distinguishes old and new calls by an IPC_64 flag in op.)
NOTES
The IPC_INFO, MSG_STAT, and MSG_INFO operations are used by the ipcs(1) program to provide information on allocated resources. In the future these may modified or moved to a /proc filesystem interface.
SEE ALSO
msgget(2), msgrcv(2), msgsnd(2), capabilities(7), mq_overview(7), sysvipc(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
72 - Linux cli command setfsuid
NAME π₯οΈ setfsuid π₯οΈ
set user identity used for filesystem checks
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/fsuid.h>
[[deprecated]] int setfsuid(uid_t fsuid);
DESCRIPTION
On Linux, a process has both a filesystem user ID and an effective user ID. The (Linux-specific) filesystem user ID is used for permissions checking when accessing filesystem objects, while the effective user ID is used for various other kinds of permissions checks (see credentials(7)).
Normally, the value of the process’s filesystem user ID is the same as the value of its effective user ID. This is so, because whenever a process’s effective user ID is changed, the kernel also changes the filesystem user ID to be the same as the new value of the effective user ID. A process can cause the value of its filesystem user ID to diverge from its effective user ID by using setfsuid() to change its filesystem user ID to the value given in fsuid.
Explicit calls to setfsuid() and setfsgid(2) are (were) usually used only by programs such as the Linux NFS server that need to change what user and group ID is used for file access without a corresponding change in the real and effective user and group IDs. A change in the normal user IDs for a program such as the NFS server is (was) a security hole that can expose it to unwanted signals. (However, this issue is historical; see below.)
setfsuid() will succeed only if the caller is the superuser or if fsuid matches either the caller’s real user ID, effective user ID, saved set-user-ID, or current filesystem user ID.
RETURN VALUE
On both success and failure, this call returns the previous filesystem user ID of the caller.
STANDARDS
Linux.
HISTORY
Linux 1.2.
At the time when this system call was introduced, one process could send a signal to another process with the same effective user ID. This meant that if a privileged process changed its effective user ID for the purpose of file permission checking, then it could become vulnerable to receiving signals sent by another (unprivileged) process with the same user ID. The filesystem user ID attribute was thus added to allow a process to change its user ID for the purposes of file permission checking without at the same time becoming vulnerable to receiving unwanted signals. Since Linux 2.0, signal permission handling is different (see kill(2)), with the result that a process can change its effective user ID without being vulnerable to receiving signals from unwanted processes. Thus, setfsuid() is nowadays unneeded and should be avoided in new applications (likewise for setfsgid(2)).
The original Linux setfsuid() system call supported only 16-bit user IDs. Subsequently, Linux 2.4 added setfsuid32() supporting 32-bit IDs. The glibc setfsuid() wrapper function transparently deals with the variation across kernel versions.
C library/kernel differences
In glibc 2.15 and earlier, when the wrapper for this system call determines that the argument can’t be passed to the kernel without integer truncation (because the kernel is old and does not support 32-bit user IDs), it will return -1 and set errno to EINVAL without attempting the system call.
BUGS
No error indications of any kind are returned to the caller, and the fact that both successful and unsuccessful calls return the same value makes it impossible to directly determine whether the call succeeded or failed. Instead, the caller must resort to looking at the return value from a further call such as setfsuid(-1) (which will always fail), in order to determine if a preceding call to setfsuid() changed the filesystem user ID. At the very least, EPERM should be returned when the call fails (because the caller lacks the CAP_SETUID capability).
SEE ALSO
kill(2), setfsgid(2), capabilities(7), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
73 - Linux cli command delete_module
NAME π₯οΈ delete_module π₯οΈ
unload a kernel module
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <fcntl.h> /* Definition of O_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_delete_module, const char *name, unsigned int flags);
Note: glibc provides no wrapper for delete_module(), necessitating the use of syscall(2).
DESCRIPTION
The delete_module() system call attempts to remove the unused loadable module entry identified by name. If the module has an exit function, then that function is executed before unloading the module. The flags argument is used to modify the behavior of the system call, as described below. This system call requires privilege.
Module removal is attempted according to the following rules:
If there are other loaded modules that depend on (i.e., refer to symbols defined in) this module, then the call fails.
Otherwise, if the reference count for the module (i.e., the number of processes currently using the module) is zero, then the module is immediately unloaded.
If a module has a nonzero reference count, then the behavior depends on the bits set in flags. In normal usage (see NOTES), the O_NONBLOCK flag is always specified, and the O_TRUNC flag may additionally be specified.
The various combinations for flags have the following effect:
flags == O_NONBLOCK
The call returns immediately, with an error.flags == (O_NONBLOCK | O_TRUNC)
The module is unloaded immediately, regardless of whether it has a nonzero reference count.(flags & O_NONBLOCK) == 0
If flags does not specify O_NONBLOCK, the following steps occur:The module is marked so that no new references are permitted.
If the module’s reference count is nonzero, the caller is placed in an uninterruptible sleep state (TASK_UNINTERRUPTIBLE) until the reference count is zero, at which point the call unblocks.
The module is unloaded in the usual way.
The O_TRUNC flag has one further effect on the rules described above. By default, if a module has an init function but no exit function, then an attempt to remove the module fails. However, if O_TRUNC was specified, this requirement is bypassed.
Using the O_TRUNC flag is dangerous! If the kernel was not built with CONFIG_MODULE_FORCE_UNLOAD, this flag is silently ignored. (Normally, CONFIG_MODULE_FORCE_UNLOAD is enabled.) Using this flag taints the kernel (TAINT_FORCED_RMMOD).
RETURN VALUE
On success, zero is returned. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EBUSY
The module is not “live” (i.e., it is still being initialized or is already marked for removal); or, the module has an init function but has no exit function, and O_TRUNC was not specified in flags.
EFAULT
name refers to a location outside the process’s accessible address space.
ENOENT
No module by that name exists.
EPERM
The caller was not privileged (did not have the CAP_SYS_MODULE capability), or module unloading is disabled (see /proc/sys/kernel/modules_disabled in proc(5)).
EWOULDBLOCK
Other modules depend on this module; or, O_NONBLOCK was specified in flags, but the reference count of this module is nonzero and O_TRUNC was not specified in flags.
STANDARDS
Linux.
HISTORY
The delete_module() system call is not supported by glibc. No declaration is provided in glibc headers, but, through a quirk of history, glibc versions before glibc 2.23 did export an ABI for this system call. Therefore, in order to employ this system call, it is (before glibc 2.23) sufficient to manually declare the interface in your code; alternatively, you can invoke the system call using syscall(2).
Linux 2.4 and earlier
In Linux 2.4 and earlier, the system call took only one argument:
** int delete_module(const char *name);**
If name is NULL, all unused modules marked auto-clean are removed.
Some further details of differences in the behavior of delete_module() in Linux 2.4 and earlier are not currently explained in this manual page.
NOTES
The uninterruptible sleep that may occur if O_NONBLOCK is omitted from flags is considered undesirable, because the sleeping process is left in an unkillable state. As at Linux 3.7, specifying O_NONBLOCK is optional, but in future kernels it is likely to become mandatory.
SEE ALSO
create_module(2), init_module(2), query_module(2), lsmod(8), modprobe(8), rmmod(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
74 - Linux cli command getpmsg
NAME π₯οΈ getpmsg π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
75 - Linux cli command renameat2
NAME π₯οΈ renameat2 π₯οΈ
change the name or location of a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <stdio.h>
int rename(const char *oldpath, const char *newpath);
#include <fcntl.h> /* Definition of AT_* constants */
#include <stdio.h>
int renameat(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath);
int renameat2(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath",unsignedint"flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
renameat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
renameat2():
_GNU_SOURCE
DESCRIPTION
rename() renames a file, moving it between directories if required. Any other hard links to the file (as created using link(2)) are unaffected. Open file descriptors for oldpath are also unaffected.
Various restrictions determine whether or not the rename operation succeeds: see ERRORS below.
If newpath already exists, it will be atomically replaced, so that there is no point at which another process attempting to access newpath will find it missing. However, there will probably be a window in which both oldpath and newpath refer to the file being renamed.
If oldpath and newpath are existing hard links referring to the same file, then rename() does nothing, and returns a success status.
If newpath exists but the operation fails for some reason, rename() guarantees to leave an instance of newpath in place.
oldpath can specify a directory. In this case, newpath must either not exist, or it must specify an empty directory.
If oldpath refers to a symbolic link, the link is renamed; if newpath refers to a symbolic link, the link will be overwritten.
renameat()
The renameat() system call operates in exactly the same way as rename(), except for the differences described here.
If the pathname given in oldpath is relative, then it is interpreted relative to the directory referred to by the file descriptor olddirfd (rather than relative to the current working directory of the calling process, as is done by rename() for a relative pathname).
If oldpath is relative and olddirfd is the special value AT_FDCWD, then oldpath is interpreted relative to the current working directory of the calling process (like rename()).
If oldpath is absolute, then olddirfd is ignored.
The interpretation of newpath is as for oldpath, except that a relative pathname is interpreted relative to the directory referred to by the file descriptor newdirfd.
See openat(2) for an explanation of the need for renameat().
renameat2()
renameat2() has an additional flags argument. A renameat2() call with a zero flags argument is equivalent to renameat().
The flags argument is a bit mask consisting of zero or more of the following flags:
RENAME_EXCHANGE
Atomically exchange oldpath and newpath. Both pathnames must exist but may be of different types (e.g., one could be a non-empty directory and the other a symbolic link).
RENAME_NOREPLACE
Don’t overwrite newpath of the rename. Return an error if newpath already exists.
RENAME_NOREPLACE can’t be employed together with RENAME_EXCHANGE.
RENAME_NOREPLACE requires support from the underlying filesystem. Support for various filesystems was added as follows:
ext4 (Linux 3.15);
btrfs, tmpfs, and cifs (Linux 3.17);
xfs (Linux 4.0);
Support for many other filesystems was added in Linux 4.9, including ext2, minix, reiserfs, jfs, vfat, and bpf.
RENAME_WHITEOUT (since Linux 3.18)
This operation makes sense only for overlay/union filesystem implementations.
Specifying RENAME_WHITEOUT creates a “whiteout” object at the source of the rename at the same time as performing the rename. The whole operation is atomic, so that if the rename succeeds then the whiteout will also have been created.
A “whiteout” is an object that has special meaning in union/overlay filesystem constructs. In these constructs, multiple layers exist and only the top one is ever modified. A whiteout on an upper layer will effectively hide a matching file in the lower layer, making it appear as if the file didn’t exist.
When a file that exists on the lower layer is renamed, the file is first copied up (if not already on the upper layer) and then renamed on the upper, read-write layer. At the same time, the source file needs to be “whiteouted” (so that the version of the source file in the lower layer is rendered invisible). The whole operation needs to be done atomically.
When not part of a union/overlay, the whiteout appears as a character device with a {0,0} device number. (Note that other union/overlay implementations may employ different methods for storing whiteout entries; specifically, BSD union mount employs a separate inode type, DT_WHT, which, while supported by some filesystems available in Linux, such as CODA and XFS, is ignored by the kernel’s whiteout support code, as of Linux 4.19, at least.)
RENAME_WHITEOUT requires the same privileges as creating a device node (i.e., the CAP_MKNOD capability).
RENAME_WHITEOUT can’t be employed together with RENAME_EXCHANGE.
RENAME_WHITEOUT requires support from the underlying filesystem. Among the filesystems that support it are tmpfs (since Linux 3.18), ext4 (since Linux 3.18), XFS (since Linux 4.1), f2fs (since Linux 4.2), btrfs (since Linux 4.7), and ubifs (since Linux 4.9).
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Write permission is denied for the directory containing oldpath or newpath, or, search permission is denied for one of the directories in the path prefix of oldpath or newpath, or oldpath is a directory and does not allow write permission (needed to update the .. entry). (See also path_resolution(7).)
EBUSY
The rename fails because oldpath or newpath is a directory that is in use by some process (perhaps as current working directory, or as root directory, or because it was open for reading) or is in use by the system (for example as a mount point), while the system considers this an error. (Note that there is no requirement to return EBUSY in such casesβthere is nothing wrong with doing the rename anywayβbut it is allowed to return EBUSY if the system cannot otherwise handle such situations.)
EDQUOT
The user’s quota of disk blocks on the filesystem has been exhausted.
EFAULT
oldpath or newpath points outside your accessible address space.
EINVAL
The new pathname contained a path prefix of the old, or, more generally, an attempt was made to make a directory a subdirectory of itself.
EISDIR
newpath is an existing directory, but oldpath is not a directory.
ELOOP
Too many symbolic links were encountered in resolving oldpath or newpath.
EMLINK
oldpath already has the maximum number of links to it, or it was a directory and the directory containing newpath has the maximum number of links.
ENAMETOOLONG
oldpath or newpath was too long.
ENOENT
The link named by oldpath does not exist; or, a directory component in newpath does not exist; or, oldpath or newpath is an empty string.
ENOMEM
Insufficient kernel memory was available.
ENOSPC
The device containing the file has no room for the new directory entry.
ENOTDIR
A component used as a directory in oldpath or newpath is not, in fact, a directory. Or, oldpath is a directory, and newpath exists but is not a directory.
ENOTEMPTY or EEXIST
newpath is a nonempty directory, that is, contains entries other than “.” and “..”.
EPERM or EACCES
The directory containing oldpath has the sticky bit (S_ISVTX) set and the process’s effective user ID is neither the user ID of the file to be deleted nor that of the directory containing it, and the process is not privileged (Linux: does not have the CAP_FOWNER capability); or newpath is an existing file and the directory containing it has the sticky bit set and the process’s effective user ID is neither the user ID of the file to be replaced nor that of the directory containing it, and the process is not privileged (Linux: does not have the CAP_FOWNER capability); or the filesystem containing oldpath does not support renaming of the type requested.
EROFS
The file is on a read-only filesystem.
EXDEV
oldpath and newpath are not on the same mounted filesystem. (Linux permits a filesystem to be mounted at multiple points, but rename() does not work across different mount points, even if the same filesystem is mounted on both.)
The following additional errors can occur for renameat() and renameat2():
EBADF
oldpath (newpath) is relative but olddirfd (newdirfd) is not a valid file descriptor.
ENOTDIR
oldpath is relative and olddirfd is a file descriptor referring to a file other than a directory; or similar for newpath and newdirfd
The following additional errors can occur for renameat2():
EEXIST
flags contains RENAME_NOREPLACE and newpath already exists.
EINVAL
An invalid flag was specified in flags.
EINVAL
Both RENAME_NOREPLACE and RENAME_EXCHANGE were specified in flags.
EINVAL
Both RENAME_WHITEOUT and RENAME_EXCHANGE were specified in flags.
EINVAL
The filesystem does not support one of the flags in flags.
ENOENT
flags contains RENAME_EXCHANGE and newpath does not exist.
EPERM
RENAME_WHITEOUT was specified in flags, but the caller does not have the CAP_MKNOD capability.
STANDARDS
rename()
C11, POSIX.1-2008.
renameat()
POSIX.1-2008.
renameat2()
Linux.
HISTORY
rename()
4.3BSD, C89, POSIX.1-2001.
renameat()
Linux 2.6.16, glibc 2.4.
renameat2()
Linux 3.15, glibc 2.28.
glibc notes
On older kernels where renameat() is unavailable, the glibc wrapper function falls back to the use of rename(). When oldpath and newpath are relative pathnames, glibc constructs pathnames based on the symbolic links in /proc/self/fd that correspond to the olddirfd and newdirfd arguments.
BUGS
On NFS filesystems, you can not assume that if the operation failed, the file was not renamed. If the server does the rename operation and then crashes, the retransmitted RPC which will be processed when the server is up again causes a failure. The application is expected to deal with this. See link(2) for a similar problem.
SEE ALSO
mv(1), rename(1), chmod(2), link(2), symlink(2), unlink(2), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
76 - Linux cli command oldlstat
NAME π₯οΈ oldlstat π₯οΈ
get file status
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int stat(const char *restrict pathname,
struct stat *restrict statbuf);
int fstat(int fd, struct stat *statbuf);
int lstat(const char *restrict pathname,
struct stat *restrict statbuf);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int fstatat(int dirfd, const char *restrict pathname,
struct stat *restrict statbuf, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
lstat():
/* Since glibc 2.20 */ _DEFAULT_SOURCE
|| _XOPEN_SOURCE >= 500
|| /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L
|| /* glibc 2.19 and earlier */ _BSD_SOURCE
fstatat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These functions return information about a file, in the buffer pointed to by statbuf. No permissions are required on the file itself, butβin the case of stat(), fstatat(), and lstat()βexecute (search) permission is required on all of the directories in pathname that lead to the file.
stat() and fstatat() retrieve information about the file pointed to by pathname; the differences for fstatat() are described below.
lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information about the link itself, not the file that the link refers to.
fstat() is identical to stat(), except that the file about which information is to be retrieved is specified by the file descriptor fd.
The stat structure
All of these system calls return a stat structure (see stat(3type)).
Note: for performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call. For example, if st_mode or st_uid is changed by another process by calling chmod(2) or chown(2), stat() might return the old st_mode together with the new st_uid, or the old st_uid together with the new st_mode.
fstatat()
The fstatat() system call is a more general interface for accessing file information which can still provide exactly the behavior of each of stat(), lstat(), and fstat().
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by stat() and lstat() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like stat() and lstat()).
If pathname is absolute, then dirfd is ignored.
flags can either be 0, or include one or more of the following flags ORed:
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory, and the behavior of fstatat() is similar to that of fstat(). If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_NO_AUTOMOUNT (since Linux 2.6.38)
Don’t automount the terminal (“basename”) component of pathname. Since Linux 3.1 this flag is ignored. Since Linux 4.11 this flag is implied.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead return information about the link itself, like lstat(). (By default, fstatat() dereferences symbolic links, like stat().)
See openat(2) for an explanation of the need for fstatat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Search permission is denied for one of the directories in the path prefix of pathname. (See also path_resolution(7).)
EBADF
fd is not a valid open file descriptor.
EBADF
(fstatat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
Bad address.
EINVAL
(fstatat()) Invalid flag specified in flags.
ELOOP
Too many symbolic links encountered while traversing the path.
ENAMETOOLONG
pathname is too long.
ENOENT
A component of pathname does not exist or is a dangling symbolic link.
ENOENT
pathname is an empty string and AT_EMPTY_PATH was not specified in flags.
ENOMEM
Out of memory (i.e., kernel memory).
ENOTDIR
A component of the path prefix of pathname is not a directory.
ENOTDIR
(fstatat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EOVERFLOW
pathname or fd refers to a file whose size, inode number, or number of blocks cannot be represented in, respectively, the types off_t, ino_t, or blkcnt_t. This error can occur when, for example, an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 calls stat() on a file whose size exceeds (1<<31)-1 bytes.
STANDARDS
POSIX.1-2008.
HISTORY
stat()
fstat()
lstat()
SVr4, 4.3BSD, POSIX.1-2001.
fstatat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
According to POSIX.1-2001, lstat() on a symbolic link need return valid information only in the st_size field and the file type of the st_mode field of the stat structure. POSIX.1-2008 tightens the specification, requiring lstat() to return valid information in all fields except the mode bits in st_mode.
Use of the st_blocks and st_blksize fields may be less portable. (They were introduced in BSD. The interpretation differs between systems, and possibly on a single system when NFS mounts are involved.)
C library/kernel differences
Over time, increases in the size of the stat structure have led to three successive versions of stat(): sys_stat() (slot __NR_oldstat), sys_newstat() (slot __NR_stat), and sys_stat64() (slot __NR_stat64) on 32-bit platforms such as i386. The first two versions were already present in Linux 1.0 (albeit with different names); the last was added in Linux 2.4. Similar remarks apply for fstat() and lstat().
The kernel-internal versions of the stat structure dealt with by the different versions are, respectively:
__old_kernel_stat
The original structure, with rather narrow fields, and no padding.
stat
Larger st_ino field and padding added to various parts of the structure to allow for future expansion.
stat64
Even larger st_ino field, larger st_uid and st_gid fields to accommodate the Linux-2.4 expansion of UIDs and GIDs to 32 bits, and various other enlarged fields and further padding in the structure. (Various padding bytes were eventually consumed in Linux 2.6, with the advent of 32-bit device IDs and nanosecond components for the timestamp fields.)
The glibc stat() wrapper function hides these details from applications, invoking the most recent version of the system call provided by the kernel, and repacking the returned information if required for old binaries.
On modern 64-bit systems, life is simpler: there is a single stat() system call and the kernel deals with a stat structure that contains fields of a sufficient size.
The underlying system call employed by the glibc fstatat() wrapper function is actually called fstatat64() or, on some architectures, newfstatat().
EXAMPLES
The following program calls lstat() and displays selected fields in the returned stat structure.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
#include <time.h>
int
main(int argc, char *argv[])
{
struct stat sb;
if (argc != 2) {
fprintf(stderr, "Usage: %s <pathname>
“, argv[0]); exit(EXIT_FAILURE); } if (lstat(argv[1], &sb) == -1) { perror(“lstat”); exit(EXIT_FAILURE); } printf(“ID of containing device: [%x,%x] “, major(sb.st_dev), minor(sb.st_dev)); printf(“File type: “); switch (sb.st_mode & S_IFMT) { case S_IFBLK: printf(“block device “); break; case S_IFCHR: printf(“character device “); break; case S_IFDIR: printf(“directory “); break; case S_IFIFO: printf(“FIFO/pipe “); break; case S_IFLNK: printf(“symlink “); break; case S_IFREG: printf(“regular file “); break; case S_IFSOCK: printf(“socket “); break; default: printf(“unknown? “); break; } printf(“I-node number: %ju “, (uintmax_t) sb.st_ino); printf(“Mode: %jo (octal) “, (uintmax_t) sb.st_mode); printf(“Link count: %ju “, (uintmax_t) sb.st_nlink); printf(“Ownership: UID=%ju GID=%ju “, (uintmax_t) sb.st_uid, (uintmax_t) sb.st_gid); printf(“Preferred I/O block size: %jd bytes “, (intmax_t) sb.st_blksize); printf(“File size: %jd bytes “, (intmax_t) sb.st_size); printf(“Blocks allocated: %jd “, (intmax_t) sb.st_blocks); printf(“Last status change: %s”, ctime(&sb.st_ctime)); printf(“Last file access: %s”, ctime(&sb.st_atime)); printf(“Last file modification: %s”, ctime(&sb.st_mtime)); exit(EXIT_SUCCESS); }
SEE ALSO
ls(1), stat(1), access(2), chmod(2), chown(2), readlink(2), statx(2), utime(2), stat(3type), capabilities(7), inode(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
77 - Linux cli command rt_sigaction
NAME π₯οΈ rt_sigaction π₯οΈ
examine and change a signal action
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <signal.h>
int sigaction(int signum,
const struct sigaction *_Nullable restrict act,
struct sigaction *_Nullable restrict oldact);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sigaction():
_POSIX_C_SOURCE
siginfo_t:
_POSIX_C_SOURCE >= 199309L
DESCRIPTION
The sigaction() system call is used to change the action taken by a process on receipt of a specific signal. (See signal(7) for an overview of signals.)
signum specifies the signal and can be any valid signal except SIGKILL and SIGSTOP.
If act is non-NULL, the new action for signal signum is installed from act. If oldact is non-NULL, the previous action is saved in oldact.
The sigaction structure is defined as something like:
struct sigaction {
void (*sa_handler)(int);
void (*sa_sigaction)(int, siginfo_t *, void *);
sigset_t sa_mask;
int sa_flags;
void (*sa_restorer)(void);
};
On some architectures a union is involved: do not assign to both sa_handler and sa_sigaction.
The sa_restorer field is not intended for application use. (POSIX does not specify a sa_restorer field.) Some further details of the purpose of this field can be found in sigreturn(2).
sa_handler specifies the action to be associated with signum and can be one of the following:
SIG_DFL for the default action.
SIG_IGN to ignore this signal.
A pointer to a signal handling function. This function receives the signal number as its only argument.
If SA_SIGINFO is specified in sa_flags, then sa_sigaction (instead of sa_handler) specifies the signal-handling function for signum. This function receives three arguments, as described below.
sa_mask specifies a mask of signals which should be blocked (i.e., added to the signal mask of the thread in which the signal handler is invoked) during execution of the signal handler. In addition, the signal which triggered the handler will be blocked, unless the SA_NODEFER flag is used.
sa_flags specifies a set of flags which modify the behavior of the signal. It is formed by the bitwise OR of zero or more of the following:
SA_NOCLDSTOP
If signum is SIGCHLD, do not receive notification when child processes stop (i.e., when they receive one of SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU) or resume (i.e., they receive SIGCONT) (see wait(2)). This flag is meaningful only when establishing a handler for SIGCHLD.
SA_NOCLDWAIT (since Linux 2.6)
If signum is SIGCHLD, do not transform children into zombies when they terminate. See also waitpid(2). This flag is meaningful only when establishing a handler for SIGCHLD, or when setting that signal’s disposition to SIG_DFL.
If the SA_NOCLDWAIT flag is set when establishing a handler for SIGCHLD, POSIX.1 leaves it unspecified whether a SIGCHLD signal is generated when a child process terminates. On Linux, a SIGCHLD signal is generated in this case; on some other implementations, it is not.
SA_NODEFER
Do not add the signal to the thread’s signal mask while the handler is executing, unless the signal is specified in act.sa_mask. Consequently, a further instance of the signal may be delivered to the thread while it is executing the handler. This flag is meaningful only when establishing a signal handler.
SA_NOMASK is an obsolete, nonstandard synonym for this flag.
SA_ONSTACK
Call the signal handler on an alternate signal stack provided by sigaltstack(2). If an alternate stack is not available, the default stack will be used. This flag is meaningful only when establishing a signal handler.
SA_RESETHAND
Restore the signal action to the default upon entry to the signal handler. This flag is meaningful only when establishing a signal handler.
SA_ONESHOT is an obsolete, nonstandard synonym for this flag.
SA_RESTART
Provide behavior compatible with BSD signal semantics by making certain system calls restartable across signals. This flag is meaningful only when establishing a signal handler. See signal(7) for a discussion of system call restarting.
SA_RESTORER
Not intended for application use. This flag is used by C libraries to indicate that the sa_restorer field contains the address of a “signal trampoline”. See sigreturn(2) for more details.
SA_SIGINFO (since Linux 2.2)
The signal handler takes three arguments, not one. In this case, sa_sigaction should be set instead of sa_handler. This flag is meaningful only when establishing a signal handler.
SA_UNSUPPORTED (since Linux 5.11)
Used to dynamically probe for flag bit support.
If an attempt to register a handler succeeds with this flag set in act->sa_flags alongside other flags that are potentially unsupported by the kernel, and an immediately subsequent sigaction() call specifying the same signal number and with a non-NULL oldact argument yields SA_UNSUPPORTED clear in oldact->sa_flags, then oldact->sa_flags may be used as a bitmask describing which of the potentially unsupported flags are, in fact, supported. See the section “Dynamically probing for flag bit support” below for more details.
SA_EXPOSE_TAGBITS (since Linux 5.11)
Normally, when delivering a signal, an architecture-specific set of tag bits are cleared from the si_addr field of siginfo_t. If this flag is set, an architecture-specific subset of the tag bits will be preserved in si_addr.
Programs that need to be compatible with Linux versions older than 5.11 must use SA_UNSUPPORTED to probe for support.
The siginfo_t argument to a SA_SIGINFO handler
When the SA_SIGINFO flag is specified in act.sa_flags, the signal handler address is passed via the act.sa_sigaction field. This handler takes three arguments, as follows:
void
handler(int sig, siginfo_t *info, void *ucontext)
{
...
}
These three arguments are as follows
sig
The number of the signal that caused invocation of the handler.
info
A pointer to a siginfo_t, which is a structure containing further information about the signal, as described below.
ucontext
This is a pointer to a ucontext_t structure, cast to void *. The structure pointed to by this field contains signal context information that was saved on the user-space stack by the kernel; for details, see sigreturn(2). Further information about the ucontext_t structure can be found in getcontext(3) and signal(7). Commonly, the handler function doesn’t make any use of the third argument.
The siginfo_t data type is a structure with the following fields:
siginfo_t {
int si_signo; /* Signal number */
int si_errno; /* An errno value */
int si_code; /* Signal code */
int si_trapno; /* Trap number that caused
hardware-generated signal
(unused on most architectures) */
pid_t si_pid; /* Sending process ID */
uid_t si_uid; /* Real user ID of sending process */
int si_status; /* Exit value or signal */
clock_t si_utime; /* User time consumed */
clock_t si_stime; /* System time consumed */
union sigval si_value; /* Signal value */
int si_int; /* POSIX.1b signal */
void *si_ptr; /* POSIX.1b signal */
int si_overrun; /* Timer overrun count;
POSIX.1b timers */
int si_timerid; /* Timer ID; POSIX.1b timers */
void *si_addr; /* Memory location which caused fault */
long si_band; /* Band event (was int in
glibc 2.3.2 and earlier) */
int si_fd; /* File descriptor */
short si_addr_lsb; /* Least significant bit of address
(since Linux 2.6.32) */
void *si_lower; /* Lower bound when address violation
occurred (since Linux 3.19) */
void *si_upper; /* Upper bound when address violation
occurred (since Linux 3.19) */
int si_pkey; /* Protection key on PTE that caused
fault (since Linux 4.6) */
void *si_call_addr; /* Address of system call instruction
(since Linux 3.5) */
int si_syscall; /* Number of attempted system call
(since Linux 3.5) */
unsigned int si_arch; /* Architecture of attempted system call
(since Linux 3.5) */
}
si_signo, si_errno and si_code are defined for all signals. (si_errno is generally unused on Linux.) The rest of the struct may be a union, so that one should read only the fields that are meaningful for the given signal:
Signals sent with kill(2) and sigqueue(3) fill in si_pid and si_uid. In addition, signals sent with sigqueue(3) fill in si_int and si_ptr with the values specified by the sender of the signal; see sigqueue(3) for more details.
Signals sent by POSIX.1b timers (since Linux 2.6) fill in si_overrun and si_timerid. The si_timerid field is an internal ID used by the kernel to identify the timer; it is not the same as the timer ID returned by timer_create(2). The si_overrun field is the timer overrun count; this is the same information as is obtained by a call to timer_getoverrun(2). These fields are nonstandard Linux extensions.
Signals sent for message queue notification (see the description of SIGEV_SIGNAL in mq_notify(3)) fill in si_int/si_ptr, with the sigev_value supplied to mq_notify(3); si_pid, with the process ID of the message sender; and si_uid, with the real user ID of the message sender.
SIGCHLD fills in si_pid, si_uid, si_status, si_utime, and si_stime, providing information about the child. The si_pid field is the process ID of the child; si_uid is the child’s real user ID. The si_status field contains the exit status of the child (if si_code is CLD_EXITED), or the signal number that caused the process to change state. The si_utime and si_stime contain the user and system CPU time used by the child process; these fields do not include the times used by waited-for children (unlike getrusage(2) and times(2)). Up to Linux 2.6, and since Linux 2.6.27, these fields report CPU time in units of sysconf(_SC_CLK_TCK). In Linux 2.6 kernels before Linux 2.6.27, a bug meant that these fields reported time in units of the (configurable) system jiffy (see time(7)).
SIGILL, SIGFPE, SIGSEGV, SIGBUS, and SIGTRAP fill in si_addr with the address of the fault. On some architectures, these signals also fill in the si_trapno field.
Some suberrors of SIGBUS, in particular BUS_MCEERR_AO and BUS_MCEERR_AR, also fill in si_addr_lsb. This field indicates the least significant bit of the reported address and therefore the extent of the corruption. For example, if a full page was corrupted, si_addr_lsb contains log2(sysconf(_SC_PAGESIZE)). When SIGTRAP is delivered in response to a ptrace(2) event (PTRACE_EVENT_foo), si_addr is not populated, but si_pid and si_uid are populated with the respective process ID and user ID responsible for delivering the trap. In the case of seccomp(2), the tracee will be shown as delivering the event. BUS_MCEERR_* and si_addr_lsb are Linux-specific extensions.
The SEGV_BNDERR suberror of SIGSEGV populates si_lower and si_upper.
The SEGV_PKUERR suberror of SIGSEGV populates si_pkey.
SIGIO/SIGPOLL (the two names are synonyms on Linux) fills in si_band and si_fd. The si_band event is a bit mask containing the same values as are filled in the revents field by poll(2). The si_fd field indicates the file descriptor for which the I/O event occurred; for further details, see the description of F_SETSIG in fcntl(2).
SIGSYS, generated (since Linux 3.5) when a seccomp filter returns SECCOMP_RET_TRAP, fills in si_call_addr, si_syscall, si_arch, si_errno, and other fields as described in seccomp(2).
The si_code field
The si_code field inside the siginfo_t argument that is passed to a SA_SIGINFO signal handler is a value (not a bit mask) indicating why this signal was sent. For a ptrace(2) event, si_code will contain SIGTRAP and have the ptrace event in the high byte:
(SIGTRAP | PTRACE_EVENT_foo << 8).
For a non-ptrace(2) event, the values that can appear in si_code are described in the remainder of this section. Since glibc 2.20, the definitions of most of these symbols are obtained from <signal.h> by defining feature test macros (before including any header file) as follows:
_XOPEN_SOURCE with the value 500 or greater;
_XOPEN_SOURCE and _XOPEN_SOURCE_EXTENDED; or
_POSIX_C_SOURCE with the value 200809L or greater.
For the TRAP_* constants, the symbol definitions are provided only in the first two cases. Before glibc 2.20, no feature test macros were required to obtain these symbols.
For a regular signal, the following list shows the values which can be placed in si_code for any signal, along with the reason that the signal was generated.
SI_USER
kill(2).SI_KERNEL
Sent by the kernel.SI_QUEUE
sigqueue(3).SI_TIMER
POSIX timer expired.SI_MESGQ (since Linux 2.6.6)
POSIX message queue state changed; see mq_notify(3).SI_ASYNCIO
AIO completed.SI_SIGIO
Queued SIGIO (only up to Linux 2.2; from Linux 2.4 onward SIGIO/SIGPOLL fills in si_code as described below).SI_TKILL (since Linux 2.4.19)
tkill(2) or tgkill(2).
The following values can be placed in si_code for a SIGILL signal:
ILL_ILLOPC
Illegal opcode.ILL_ILLOPN
Illegal operand.ILL_ILLADR
Illegal addressing mode.ILL_ILLTRP
Illegal trap.ILL_PRVOPC
Privileged opcode.ILL_PRVREG
Privileged register.ILL_COPROC
Coprocessor error.ILL_BADSTK
Internal stack error.
The following values can be placed in si_code for a SIGFPE signal:
FPE_INTDIV
Integer divide by zero.FPE_INTOVF
Integer overflow.FPE_FLTDIV
Floating-point divide by zero.FPE_FLTOVF
Floating-point overflow.FPE_FLTUND
Floating-point underflow.FPE_FLTRES
Floating-point inexact result.FPE_FLTINV
Floating-point invalid operation.FPE_FLTSUB
Subscript out of range.
The following values can be placed in si_code for a SIGSEGV signal:
SEGV_MAPERR
Address not mapped to object.SEGV_ACCERR
Invalid permissions for mapped object.SEGV_BNDERR (since Linux 3.19)
Failed address bound checks.SEGV_PKUERR (since Linux 4.6)
Access was denied by memory protection keys. See pkeys(7). The protection key which applied to this access is available via si_pkey.
The following values can be placed in si_code for a SIGBUS signal:
BUS_ADRALN
Invalid address alignment.BUS_ADRERR
Nonexistent physical address.BUS_OBJERR
Object-specific hardware error.BUS_MCEERR_AR (since Linux 2.6.32)
Hardware memory error consumed on a machine check; action required.BUS_MCEERR_AO (since Linux 2.6.32)
Hardware memory error detected in process but not consumed; action optional.
The following values can be placed in si_code for a SIGTRAP signal:
TRAP_BRKPT
Process breakpoint.TRAP_TRACE
Process trace trap.TRAP_BRANCH (since Linux 2.4, IA64 only)
Process taken branch trap.TRAP_HWBKPT (since Linux 2.4, IA64 only)
Hardware breakpoint/watchpoint.
The following values can be placed in si_code for a SIGCHLD signal:
CLD_EXITED
Child has exited.CLD_KILLED
Child was killed.CLD_DUMPED
Child terminated abnormally.CLD_TRAPPED
Traced child has trapped.CLD_STOPPED
Child has stopped.CLD_CONTINUED (since Linux 2.6.9)
Stopped child has continued.
The following values can be placed in si_code for a SIGIO/SIGPOLL signal:
POLL_IN
Data input available.POLL_OUT
Output buffers available.POLL_MSG
Input message available.POLL_ERR
I/O error.POLL_PRI
High priority input available.POLL_HUP
Device disconnected.
The following value can be placed in si_code for a SIGSYS signal:
SYS_SECCOMP (since Linux 3.5)
Triggered by a seccomp(2) filter rule.
Dynamically probing for flag bit support
The sigaction() call on Linux accepts unknown bits set in act->sa_flags without error. The behavior of the kernel starting with Linux 5.11 is that a second sigaction() will clear unknown bits from oldact->sa_flags. However, historically, a second sigaction() call would typically leave those bits set in oldact->sa_flags.
This means that support for new flags cannot be detected simply by testing for a flag in sa_flags, and a program must test that SA_UNSUPPORTED has been cleared before relying on the contents of sa_flags.
Since the behavior of the signal handler cannot be guaranteed unless the check passes, it is wise to either block the affected signal while registering the handler and performing the check in this case, or where this is not possible, for example if the signal is synchronous, to issue the second sigaction() in the signal handler itself.
In kernels that do not support a specific flag, the kernel’s behavior is as if the flag was not set, even if the flag was set in act->sa_flags.
The flags SA_NOCLDSTOP, SA_NOCLDWAIT, SA_SIGINFO, SA_ONSTACK, SA_RESTART, SA_NODEFER, SA_RESETHAND, and, if defined by the architecture, SA_RESTORER may not be reliably probed for using this mechanism, because they were introduced before Linux 5.11. However, in general, programs may assume that these flags are supported, since they have all been supported since Linux 2.6, which was released in the year 2003.
See EXAMPLES below for a demonstration of the use of SA_UNSUPPORTED.
RETURN VALUE
sigaction() returns 0 on success; on error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
act or oldact points to memory which is not a valid part of the process address space.
EINVAL
An invalid signal was specified. This will also be generated if an attempt is made to change the action for SIGKILL or SIGSTOP, which cannot be caught or ignored.
VERSIONS
C library/kernel differences
The glibc wrapper function for sigaction() gives an error (EINVAL) on attempts to change the disposition of the two real-time signals used internally by the NPTL threading implementation. See nptl(7) for details.
On architectures where the signal trampoline resides in the C library, the glibc wrapper function for sigaction() places the address of the trampoline code in the act.sa_restorer field and sets the SA_RESTORER flag in the act.sa_flags field. See sigreturn(2).
The original Linux system call was named sigaction(). However, with the addition of real-time signals in Linux 2.2, the fixed-size, 32-bit sigset_t type supported by that system call was no longer fit for purpose. Consequently, a new system call, rt_sigaction(), was added to support an enlarged sigset_t type. The new system call takes a fourth argument, size_t sigsetsize, which specifies the size in bytes of the signal sets in act.sa_mask and oldact.sa_mask. This argument is currently required to have the value sizeof(sigset_t) (or the error EINVAL results). The glibc sigaction() wrapper function hides these details from us, transparently calling rt_sigaction() when the kernel provides it.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4.
POSIX.1-1990 disallowed setting the action for SIGCHLD to SIG_IGN. POSIX.1-2001 and later allow this possibility, so that ignoring SIGCHLD can be used to prevent the creation of zombies (see wait(2)). Nevertheless, the historical BSD and System V behaviors for ignoring SIGCHLD differ, so that the only completely portable method of ensuring that terminated children do not become zombies is to catch the SIGCHLD signal and perform a wait(2) or similar.
POSIX.1-1990 specified only SA_NOCLDSTOP. POSIX.1-2001 added SA_NOCLDWAIT, SA_NODEFER, SA_ONSTACK, SA_RESETHAND, SA_RESTART, and SA_SIGINFO as XSI extensions. POSIX.1-2008 moved SA_NODEFER, SA_RESETHAND, SA_RESTART, and SA_SIGINFO to the base specifications. Use of these latter values in sa_flags may be less portable in applications intended for older UNIX implementations.
The SA_RESETHAND flag is compatible with the SVr4 flag of the same name.
The SA_NODEFER flag is compatible with the SVr4 flag of the same name under kernels 1.3.9 and later. On older kernels the Linux implementation allowed the receipt of any signal, not just the one we are installing (effectively overriding any sa_mask settings).
NOTES
A child created via fork(2) inherits a copy of its parent’s signal dispositions. During an execve(2), the dispositions of handled signals are reset to the default; the dispositions of ignored signals are left unchanged.
According to POSIX, the behavior of a process is undefined after it ignores a SIGFPE, SIGILL, or SIGSEGV signal that was not generated by kill(2) or raise(3). Integer division by zero has undefined result. On some architectures it will generate a SIGFPE signal. (Also dividing the most negative integer by -1 may generate SIGFPE.) Ignoring this signal might lead to an endless loop.
sigaction() can be called with a NULL second argument to query the current signal handler. It can also be used to check whether a given signal is valid for the current machine by calling it with NULL second and third arguments.
It is not possible to block SIGKILL or SIGSTOP (by specifying them in sa_mask). Attempts to do so are silently ignored.
See sigsetops(3) for details on manipulating signal sets.
See signal-safety(7) for a list of the async-signal-safe functions that can be safely called inside from inside a signal handler.
Undocumented
Before the introduction of SA_SIGINFO, it was also possible to get some additional information about the signal. This was done by providing an sa_handler signal handler with a second argument of type struct sigcontext, which is the same structure as the one that is passed in the uc_mcontext field of the ucontext structure that is passed (via a pointer) in the third argument of the sa_sigaction handler. See the relevant Linux kernel sources for details. This use is obsolete now.
BUGS
When delivering a signal with a SA_SIGINFO handler, the kernel does not always provide meaningful values for all of the fields of the siginfo_t that are relevant for that signal.
Up to and including Linux 2.6.13, specifying SA_NODEFER in sa_flags prevents not only the delivered signal from being masked during execution of the handler, but also the signals specified in sa_mask. This bug was fixed in Linux 2.6.14.
EXAMPLES
See mprotect(2).
Probing for flag support
The following example program exits with status EXIT_SUCCESS if SA_EXPOSE_TAGBITS is determined to be supported, and EXIT_FAILURE otherwise.
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
void
handler(int signo, siginfo_t *info, void *context)
{
struct sigaction oldact;
if (sigaction(SIGSEGV, NULL, &oldact) == -1
|| (oldact.sa_flags & SA_UNSUPPORTED)
|| !(oldact.sa_flags & SA_EXPOSE_TAGBITS))
{
_exit(EXIT_FAILURE);
}
_exit(EXIT_SUCCESS);
}
int
main(void)
{
struct sigaction act = { 0 };
act.sa_flags = SA_SIGINFO | SA_UNSUPPORTED | SA_EXPOSE_TAGBITS;
act.sa_sigaction = &handler;
if (sigaction(SIGSEGV, &act, NULL) == -1) {
perror("sigaction");
exit(EXIT_FAILURE);
}
raise(SIGSEGV);
}
SEE ALSO
kill(1), kill(2), pause(2), pidfd_send_signal(2), restart_syscall(2), seccomp(2), sigaltstack(2), signal(2), signalfd(2), sigpending(2), sigprocmask(2), sigreturn(2), sigsuspend(2), wait(2), killpg(3), raise(3), siginterrupt(3), sigqueue(3), sigsetops(3), sigvec(3), core(5), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
78 - Linux cli command open
NAME π₯οΈ open π₯οΈ
open and possibly create a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <fcntl.h>
int open(const char *pathname, int flags, ...
/* mode_t mode */ );
int creat(const char *pathname, mode_t mode);
int openat(int dirfd, const char *pathname, int flags, ...
/* mode_t mode */ );
/* Documented separately, in
openat2(2):
*/
int openat2(int dirfd, const char *pathname,
const struct open_how *how, size_t size);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
openat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
The open() system call opens the file specified by pathname. If the specified file does not exist, it may optionally (if O_CREAT is specified in flags) be created by open().
The return value of open() is a file descriptor, a small, nonnegative integer that is an index to an entry in the process’s table of open file descriptors. The file descriptor is used in subsequent system calls ( read(2), write(2), lseek(2), fcntl(2), etc.) to refer to the open file. The file descriptor returned by a successful call will be the lowest-numbered file descriptor not currently open for the process.
By default, the new file descriptor is set to remain open across an execve(2) (i.e., the FD_CLOEXEC file descriptor flag described in fcntl(2) is initially disabled); the O_CLOEXEC flag, described below, can be used to change this default. The file offset is set to the beginning of the file (see lseek(2)).
A call to open() creates a new open file description, an entry in the system-wide table of open files. The open file description records the file offset and the file status flags (see below). A file descriptor is a reference to an open file description; this reference is unaffected if pathname is subsequently removed or modified to refer to a different file. For further details on open file descriptions, see NOTES.
The argument flags must include one of the following access modes: O_RDONLY, O_WRONLY, or O_RDWR. These request opening the file read-only, write-only, or read/write, respectively.
In addition, zero or more file creation flags and file status flags can be bitwise ORed in flags. The file creation flags are O_CLOEXEC, O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TMPFILE, and O_TRUNC. The file status flags are all of the remaining flags listed below. The distinction between these two groups of flags is that the file creation flags affect the semantics of the open operation itself, while the file status flags affect the semantics of subsequent I/O operations. The file status flags can be retrieved and (in some cases) modified; see fcntl(2) for details.
The full list of file creation flags and file status flags is as follows:
O_APPEND
The file is opened in append mode. Before each write(2), the file offset is positioned at the end of the file, as if with lseek(2). The modification of the file offset and the write operation are performed as a single atomic step.
O_APPEND may lead to corrupted files on NFS filesystems if more than one process appends data to a file at once. This is because NFS does not support appending to a file, so the client kernel has to simulate it, which can’t be done without a race condition.
O_ASYNC
Enable signal-driven I/O: generate a signal (SIGIO by default, but this can be changed via fcntl(2)) when input or output becomes possible on this file descriptor. This feature is available only for terminals, pseudoterminals, sockets, and (since Linux 2.6) pipes and FIFOs. See fcntl(2) for further details. See also BUGS, below.
O_CLOEXEC (since Linux 2.6.23)
Enable the close-on-exec flag for the new file descriptor. Specifying this flag permits a program to avoid additional fcntl(2) F_SETFD operations to set the FD_CLOEXEC flag.
Note that the use of this flag is essential in some multithreaded programs, because using a separate fcntl(2) F_SETFD operation to set the FD_CLOEXEC flag does not suffice to avoid race conditions where one thread opens a file descriptor and attempts to set its close-on-exec flag using fcntl(2) at the same time as another thread does a fork(2) plus execve(2). Depending on the order of execution, the race may lead to the file descriptor returned by open() being unintentionally leaked to the program executed by the child process created by fork(2). (This kind of race is in principle possible for any system call that creates a file descriptor whose close-on-exec flag should be set, and various other Linux system calls provide an equivalent of the O_CLOEXEC flag to deal with this problem.)
O_CREAT
If pathname does not exist, create it as a regular file.
The owner (user ID) of the new file is set to the effective user ID of the process.
The group ownership (group ID) of the new file is set either to the effective group ID of the process (System V semantics) or to the group ID of the parent directory (BSD semantics). On Linux, the behavior depends on whether the set-group-ID mode bit is set on the parent directory: if that bit is set, then BSD semantics apply; otherwise, System V semantics apply. For some filesystems, the behavior also depends on the bsdgroups and sysvgroups mount options described in mount(8).
The mode argument specifies the file mode bits to be applied when a new file is created. If neither O_CREAT nor O_TMPFILE is specified in flags, then mode is ignored (and can thus be specified as 0, or simply omitted). The mode argument must be supplied if O_CREAT or O_TMPFILE is specified in flags; if it is not supplied, some arbitrary bytes from the stack will be applied as the file mode.
The effective mode is modified by the process’s umask in the usual way: in the absence of a default ACL, the mode of the created file is (mode & ~umask).
Note that mode applies only to future accesses of the newly created file; the open() call that creates a read-only file may well return a read/write file descriptor.
The following symbolic constants are provided for mode:
S_IRWXU
00700 user (file owner) has read, write, and execute permission
S_IRUSR
00400 user has read permission
S_IWUSR
00200 user has write permission
S_IXUSR
00100 user has execute permission
S_IRWXG
00070 group has read, write, and execute permission
S_IRGRP
00040 group has read permission
S_IWGRP
00020 group has write permission
S_IXGRP
00010 group has execute permission
S_IRWXO
00007 others have read, write, and execute permission
S_IROTH
00004 others have read permission
S_IWOTH
00002 others have write permission
S_IXOTH
00001 others have execute permission
According to POSIX, the effect when other bits are set in mode is unspecified. On Linux, the following bits are also honored in mode:
S_ISUID
0004000 set-user-ID bit
S_ISGID
0002000 set-group-ID bit (see inode(7)).
S_ISVTX
0001000 sticky bit (see inode(7)).
O_DIRECT (since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is described in raw(8).
O_DIRECTORY
If pathname is not a directory, cause the open to fail. This flag was added in Linux 2.1.126, to avoid denial-of-service problems if opendir(3) is called on a FIFO or tape device.
O_DSYNC
Write operations on the file will complete according to the requirements of synchronized I/O data integrity completion.
By the time write(2) (and similar) return, the output data has been transferred to the underlying hardware, along with any file metadata that would be required to retrieve that data (i.e., as though each write(2) was followed by a call to fdatasync(2)). See NOTES below.
O_EXCL
Ensure that this call creates the file: if this flag is specified in conjunction with O_CREAT, and pathname already exists, then open() fails with the error EEXIST.
When these two flags are specified, symbolic links are not followed: if pathname is a symbolic link, then open() fails regardless of where the symbolic link points.
In general, the behavior of O_EXCL is undefined if it is used without O_CREAT. There is one exception: on Linux 2.6 and later, O_EXCL can be used without O_CREAT if pathname refers to a block device. If the block device is in use by the system (e.g., mounted), open() fails with the error EBUSY.
On NFS, O_EXCL is supported only when using NFSv3 or later on kernel 2.6 or later. In NFS environments where O_EXCL support is not provided, programs that rely on it for performing locking tasks will contain a race condition. Portable programs that want to perform atomic file locking using a lockfile, and need to avoid reliance on NFS support for O_EXCL, can create a unique file on the same filesystem (e.g., incorporating hostname and PID), and use link(2) to make a link to the lockfile. If link(2) returns 0, the lock is successful. Otherwise, use stat(2) on the unique file to check if its link count has increased to 2, in which case the lock is also successful.
O_LARGEFILE
(LFS) Allow files whose sizes cannot be represented in an off_t (but can be represented in an off64_t) to be opened. The _LARGEFILE64_SOURCE macro must be defined (before including any header files) in order to obtain this definition. Setting the _FILE_OFFSET_BITS feature test macro to 64 (rather than using O_LARGEFILE) is the preferred method of accessing large files on 32-bit systems (see feature_test_macros(7)).
O_NOATIME (since Linux 2.6.8)
Do not update the file last access time (st_atime in the inode) when the file is read(2).
This flag can be employed only if one of the following conditions is true:
The effective UID of the process matches the owner UID of the file.
The calling process has the CAP_FOWNER capability in its user namespace and the owner UID of the file has a mapping in the namespace.
This flag is intended for use by indexing or backup programs, where its use can significantly reduce the amount of disk activity. This flag may not be effective on all filesystems. One example is NFS, where the server maintains the access time.
O_NOCTTY
If pathname refers to a terminal deviceβsee tty(4)βit will not become the process’s controlling terminal even if the process does not have one.
O_NOFOLLOW
If the trailing component (i.e., basename) of pathname is a symbolic link, then the open fails, with the error ELOOP. Symbolic links in earlier components of the pathname will still be followed. (Note that the ELOOP error that can occur in this case is indistinguishable from the case where an open fails because there are too many symbolic links found while resolving components in the prefix part of the pathname.)
This flag is a FreeBSD extension, which was added in Linux 2.1.126, and has subsequently been standardized in POSIX.1-2008.
See also O_PATH below.
O_NONBLOCK or O_NDELAY
When possible, the file is opened in nonblocking mode. Neither the open() nor any subsequent I/O operations on the file descriptor which is returned will cause the calling process to wait.
Note that the setting of this flag has no effect on the operation of poll(2), select(2), epoll(7), and similar, since those interfaces merely inform the caller about whether a file descriptor is “ready”, meaning that an I/O operation performed on the file descriptor with the O_NONBLOCK flag clear would not block.
Note that this flag has no effect for regular files and block devices; that is, I/O operations will (briefly) block when device activity is required, regardless of whether O_NONBLOCK is set. Since O_NONBLOCK semantics might eventually be implemented, applications should not depend upon blocking behavior when specifying this flag for regular files and block devices.
For the handling of FIFOs (named pipes), see also fifo(7). For a discussion of the effect of O_NONBLOCK in conjunction with mandatory file locks and with file leases, see fcntl(2).
O_PATH (since Linux 2.6.39)
Obtain a file descriptor that can be used for two purposes: to indicate a location in the filesystem tree and to perform operations that act purely at the file descriptor level. The file itself is not opened, and other file operations (e.g., read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), ioctl(2), mmap(2)) fail with the error EBADF.
The following operations can be performed on the resulting file descriptor:
close(2).
fchdir(2), if the file descriptor refers to a directory (since Linux 3.5).
fstat(2) (since Linux 3.6).
fstatfs(2) (since Linux 3.12).
Duplicating the file descriptor (dup(2), fcntl(2) F_DUPFD, etc.).
Getting and setting file descriptor flags (fcntl(2) F_GETFD and F_SETFD).
Retrieving open file status flags using the fcntl(2) F_GETFL operation: the returned flags will include the bit O_PATH.
Passing the file descriptor as the dirfd argument of openat() and the other “*at()” system calls. This includes linkat(2) with AT_EMPTY_PATH (or via procfs using AT_SYMLINK_FOLLOW) even if the file is not a directory.
Passing the file descriptor to another process via a UNIX domain socket (see SCM_RIGHTS in unix(7)).
When O_PATH is specified in flags, flag bits other than O_CLOEXEC, O_DIRECTORY, and O_NOFOLLOW are ignored.
Opening a file or directory with the O_PATH flag requires no permissions on the object itself (but does require execute permission on the directories in the path prefix). Depending on the subsequent operation, a check for suitable file permissions may be performed (e.g., fchdir(2) requires execute permission on the directory referred to by its file descriptor argument). By contrast, obtaining a reference to a filesystem object by opening it with the O_RDONLY flag requires that the caller have read permission on the object, even when the subsequent operation (e.g., fchdir(2), fstat(2)) does not require read permission on the object.
If pathname is a symbolic link and the O_NOFOLLOW flag is also specified, then the call returns a file descriptor referring to the symbolic link. This file descriptor can be used as the dirfd argument in calls to fchownat(2), fstatat(2), linkat(2), and readlinkat(2) with an empty pathname to have the calls operate on the symbolic link.
If pathname refers to an automount point that has not yet been triggered, so no other filesystem is mounted on it, then the call returns a file descriptor referring to the automount directory without triggering a mount. fstatfs(2) can then be used to determine if it is, in fact, an untriggered automount point (.f_type == AUTOFS_SUPER_MAGIC).
One use of O_PATH for regular files is to provide the equivalent of POSIX.1’s O_EXEC functionality. This permits us to open a file for which we have execute permission but not read permission, and then execute that file, with steps something like the following:
char buf[PATH_MAX];
fd = open("some_prog", O_PATH);
snprintf(buf, PATH_MAX, "/proc/self/fd/%d", fd);
execl(buf, "some_prog", (char *) NULL);
An O_PATH file descriptor can also be passed as the argument of fexecve(3).
O_SYNC
Write operations on the file will complete according to the requirements of synchronized I/O file integrity completion (by contrast with the synchronized I/O data integrity completion provided by O_DSYNC.)
By the time write(2) (or similar) returns, the output data and associated file metadata have been transferred to the underlying hardware (i.e., as though each write(2) was followed by a call to fsync(2)). See NOTES below.
O_TMPFILE (since Linux 3.11)
Create an unnamed temporary regular file. The pathname argument specifies a directory; an unnamed inode will be created in that directory’s filesystem. Anything written to the resulting file will be lost when the last file descriptor is closed, unless the file is given a name.
O_TMPFILE must be specified with one of O_RDWR or O_WRONLY and, optionally, O_EXCL. If O_EXCL is not specified, then linkat(2) can be used to link the temporary file into the filesystem, making it permanent, using code like the following:
char path[PATH_MAX];
fd = open("/path/to/dir", O_TMPFILE | O_RDWR,
S_IRUSR | S_IWUSR);
/* File I/O on 'fd'... */
linkat(fd, "", AT_FDCWD, "/path/for/file", AT_EMPTY_PATH);
/* If the caller doesn't have the CAP_DAC_READ_SEARCH
capability (needed to use AT_EMPTY_PATH with linkat(2)),
and there is a proc(5) filesystem mounted, then the
linkat(2) call above can be replaced with:
snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd);
linkat(AT_FDCWD, path, AT_FDCWD, "/path/for/file",
AT_SYMLINK_FOLLOW);
*/
In this case, the open() mode argument determines the file permission mode, as with O_CREAT.
Specifying O_EXCL in conjunction with O_TMPFILE prevents a temporary file from being linked into the filesystem in the above manner. (Note that the meaning of O_EXCL in this case is different from the meaning of O_EXCL otherwise.)
There are two main use cases for O_TMPFILE:
Improved tmpfile(3) functionality: race-free creation of temporary files that (1) are automatically deleted when closed; (2) can never be reached via any pathname; (3) are not subject to symlink attacks; and (4) do not require the caller to devise unique names.
Creating a file that is initially invisible, which is then populated with data and adjusted to have appropriate filesystem attributes (fchown(2), fchmod(2), fsetxattr(2), etc.) before being atomically linked into the filesystem in a fully formed state (using linkat(2) as described above).
O_TMPFILE requires support by the underlying filesystem; only a subset of Linux filesystems provide that support. In the initial implementation, support was provided in the ext2, ext3, ext4, UDF, Minix, and tmpfs filesystems. Support for other filesystems has subsequently been added as follows: XFS (Linux 3.15); Btrfs (Linux 3.16); F2FS (Linux 3.16); and ubifs (Linux 4.9)
O_TRUNC
If the file already exists and is a regular file and the access mode allows writing (i.e., is O_RDWR or O_WRONLY) it will be truncated to length 0. If the file is a FIFO or terminal device file, the O_TRUNC flag is ignored. Otherwise, the effect of O_TRUNC is unspecified.
creat()
A call to creat() is equivalent to calling open() with flags equal to O_CREAT|O_WRONLY|O_TRUNC.
openat()
The openat() system call operates in exactly the same way as open(), except for the differences described here.
The dirfd argument is used in conjunction with the pathname argument as follows:
If the pathname given in pathname is absolute, then dirfd is ignored.
If the pathname given in pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like open()).
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by open() for a relative pathname). In this case, dirfd must be a directory that was opened for reading (O_RDONLY) or using the O_PATH flag.
If the pathname given in pathname is relative, and dirfd is not a valid file descriptor, an error (EBADF) results. (Specifying an invalid file descriptor number in dirfd can be used as a means to ensure that pathname is absolute.)
openat2(2)
The openat2(2) system call is an extension of openat(), and provides a superset of the features of openat(). It is documented separately, in openat2(2).
RETURN VALUE
On success, open(), openat(), and creat() return the new file descriptor (a nonnegative integer). On error, -1 is returned and errno is set to indicate the error.
ERRORS
open(), openat(), and creat() can fail with the following errors:
EACCES
The requested access to the file is not allowed, or search permission is denied for one of the directories in the path prefix of pathname, or the file did not exist yet and write access to the parent directory is not allowed. (See also path_resolution(7).)
EACCES
Where O_CREAT is specified, the protected_fifos or protected_regular sysctl is enabled, the file already exists and is a FIFO or regular file, the owner of the file is neither the current user nor the owner of the containing directory, and the containing directory is both world- or group-writable and sticky. For details, see the descriptions of /proc/sys/fs/protected_fifos and /proc/sys/fs/protected_regular in proc_sys_fs(5).
EBADF
(openat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EBUSY
O_EXCL was specified in flags and pathname refers to a block device that is in use by the system (e.g., it is mounted).
EDQUOT
Where O_CREAT is specified, the file does not exist, and the user’s quota of disk blocks or inodes on the filesystem has been exhausted.
EEXIST
pathname already exists and O_CREAT and O_EXCL were used.
EFAULT
pathname points outside your accessible address space.
EFBIG
See EOVERFLOW.
EINTR
While blocked waiting to complete an open of a slow device (e.g., a FIFO; see fifo(7)), the call was interrupted by a signal handler; see signal(7).
EINVAL
The filesystem does not support the O_DIRECT flag. See NOTES for more information.
EINVAL
Invalid value in flags.
EINVAL
O_TMPFILE was specified in flags, but neither O_WRONLY nor O_RDWR was specified.
EINVAL
O_CREAT was specified in flags and the final component (“basename”) of the new file’s pathname is invalid (e.g., it contains characters not permitted by the underlying filesystem).
EINVAL
The final component (“basename”) of pathname is invalid (e.g., it contains characters not permitted by the underlying filesystem).
EISDIR
pathname refers to a directory and the access requested involved writing (that is, O_WRONLY or O_RDWR is set).
EISDIR
pathname refers to an existing directory, O_TMPFILE and one of O_WRONLY or O_RDWR were specified in flags, but this kernel version does not provide the O_TMPFILE functionality.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ELOOP
pathname was a symbolic link, and flags specified O_NOFOLLOW but not O_PATH.
EMFILE
The per-process limit on the number of open file descriptors has been reached (see the description of RLIMIT_NOFILE in getrlimit(2)).
ENAMETOOLONG
pathname was too long.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENODEV
pathname refers to a device special file and no corresponding device exists. (This is a Linux kernel bug; in this situation ENXIO must be returned.)
ENOENT
O_CREAT is not set and the named file does not exist.
ENOENT
A directory component in pathname does not exist or is a dangling symbolic link.
ENOENT
pathname refers to a nonexistent directory, O_TMPFILE and one of O_WRONLY or O_RDWR were specified in flags, but this kernel version does not provide the O_TMPFILE functionality.
ENOMEM
The named file is a FIFO, but memory for the FIFO buffer can’t be allocated because the per-user hard limit on memory allocation for pipes has been reached and the caller is not privileged; see pipe(7).
ENOMEM
Insufficient kernel memory was available.
ENOSPC
pathname was to be created but the device containing pathname has no room for the new file.
ENOTDIR
A component used as a directory in pathname is not, in fact, a directory, or O_DIRECTORY was specified and pathname was not a directory.
ENOTDIR
(openat()) pathname is a relative pathname and dirfd is a file descriptor referring to a file other than a directory.
ENXIO
O_NONBLOCK | O_WRONLY is set, the named file is a FIFO, and no process has the FIFO open for reading.
ENXIO
The file is a device special file and no corresponding device exists.
ENXIO
The file is a UNIX domain socket.
EOPNOTSUPP
The filesystem containing pathname does not support O_TMPFILE.
EOVERFLOW
pathname refers to a regular file that is too large to be opened. The usual scenario here is that an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 tried to open a file whose size exceeds (1<<31)-1 bytes; see also O_LARGEFILE above. This is the error specified by POSIX.1; before Linux 2.6.24, Linux gave the error EFBIG for this case.
EPERM
The O_NOATIME flag was specified, but the effective user ID of the caller did not match the owner of the file and the caller was not privileged.
EPERM
The operation was prevented by a file seal; see fcntl(2).
EROFS
pathname refers to a file on a read-only filesystem and write access was requested.
ETXTBSY
pathname refers to an executable image which is currently being executed and write access was requested.
ETXTBSY
pathname refers to a file that is currently in use as a swap file, and the O_TRUNC flag was specified.
ETXTBSY
pathname refers to a file that is currently being read by the kernel (e.g., for module/firmware loading), and write access was requested.
EWOULDBLOCK
The O_NONBLOCK flag was specified, and an incompatible lease was held on the file (see fcntl(2)).
VERSIONS
The (undefined) effect of O_RDONLY | O_TRUNC varies among implementations. On many systems the file is actually truncated.
Synchronized I/O
The POSIX.1-2008 “synchronized I/O” option specifies different variants of synchronized I/O, and specifies the open() flags O_SYNC, O_DSYNC, and O_RSYNC for controlling the behavior. Regardless of whether an implementation supports this option, it must at least support the use of O_SYNC for regular files.
Linux implements O_SYNC and O_DSYNC, but not O_RSYNC. Somewhat incorrectly, glibc defines O_RSYNC to have the same value as O_SYNC. (O_RSYNC is defined in the Linux header file <asm/fcntl.h> on HP PA-RISC, but it is not used.)
O_SYNC provides synchronized I/O file integrity completion, meaning write operations will flush data and all associated metadata to the underlying hardware. O_DSYNC provides synchronized I/O data integrity completion, meaning write operations will flush data to the underlying hardware, but will only flush metadata updates that are required to allow a subsequent read operation to complete successfully. Data integrity completion can reduce the number of disk operations that are required for applications that don’t need the guarantees of file integrity completion.
To understand the difference between the two types of completion, consider two pieces of file metadata: the file last modification timestamp (st_mtime) and the file length. All write operations will update the last file modification timestamp, but only writes that add data to the end of the file will change the file length. The last modification timestamp is not needed to ensure that a read completes successfully, but the file length is. Thus, O_DSYNC would only guarantee to flush updates to the file length metadata (whereas O_SYNC would also always flush the last modification timestamp metadata).
Before Linux 2.6.33, Linux implemented only the O_SYNC flag for open(). However, when that flag was specified, most filesystems actually provided the equivalent of synchronized I/O data integrity completion (i.e., O_SYNC was actually implemented as the equivalent of O_DSYNC).
Since Linux 2.6.33, proper O_SYNC support is provided. However, to ensure backward binary compatibility, O_DSYNC was defined with the same value as the historical O_SYNC, and O_SYNC was defined as a new (two-bit) flag value that includes the O_DSYNC flag value. This ensures that applications compiled against new headers get at least O_DSYNC semantics before Linux 2.6.33.
C library/kernel differences
Since glibc 2.26, the glibc wrapper function for open() employs the openat() system call, rather than the kernel’s open() system call. For certain architectures, this is also true before glibc 2.26.
STANDARDS
open()
creat()
openat()
POSIX.1-2008.
openat2(2) Linux.
The O_DIRECT, O_NOATIME, O_PATH, and O_TMPFILE flags are Linux-specific. One must define _GNU_SOURCE to obtain their definitions.
The O_CLOEXEC, O_DIRECTORY, and O_NOFOLLOW flags are not specified in POSIX.1-2001, but are specified in POSIX.1-2008. Since glibc 2.12, one can obtain their definitions by defining either _POSIX_C_SOURCE with a value greater than or equal to 200809L or _XOPEN_SOURCE with a value greater than or equal to 700. In glibc 2.11 and earlier, one obtains the definitions by defining _GNU_SOURCE.
HISTORY
open()
creat()
SVr4, 4.3BSD, POSIX.1-2001.
openat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
NOTES
Under Linux, the O_NONBLOCK flag is sometimes used in cases where one wants to open but does not necessarily have the intention to read or write. For example, this may be used to open a device in order to get a file descriptor for use with ioctl(2).
Note that open() can open device special files, but creat() cannot create them; use mknod(2) instead.
If the file is newly created, its st_atime, st_ctime, st_mtime fields (respectively, time of last access, time of last status change, and time of last modification; see stat(2)) are set to the current time, and so are the st_ctime and st_mtime fields of the parent directory. Otherwise, if the file is modified because of the O_TRUNC flag, its st_ctime and st_mtime fields are set to the current time.
The files in the /proc/pid/fd directory show the open file descriptors of the process with the PID pid. The files in the /proc/pid/fdinfo directory show even more information about these file descriptors. See proc(5) for further details of both of these directories.
The Linux header file <asm/fcntl.h> doesn’t define O_ASYNC; the (BSD-derived) FASYNC synonym is defined instead.
Open file descriptions
The term open file description is the one used by POSIX to refer to the entries in the system-wide table of open files. In other contexts, this object is variously also called an “open file object”, a “file handle”, an “open file table entry”, orβin kernel-developer parlanceβa struct file.
When a file descriptor is duplicated (using dup(2) or similar), the duplicate refers to the same open file description as the original file descriptor, and the two file descriptors consequently share the file offset and file status flags. Such sharing can also occur between processes: a child process created via fork(2) inherits duplicates of its parent’s file descriptors, and those duplicates refer to the same open file descriptions.
Each open() of a file creates a new open file description; thus, there may be multiple open file descriptions corresponding to a file inode.
On Linux, one can use the kcmp(2) KCMP_FILE operation to test whether two file descriptors (in the same process or in two different processes) refer to the same open file description.
NFS
There are many infelicities in the protocol underlying NFS, affecting amongst others O_SYNC and O_NDELAY.
On NFS filesystems with UID mapping enabled, open() may return a file descriptor but, for example, read(2) requests are denied with EACCES. This is because the client performs open() by checking the permissions, but UID mapping is performed by the server upon read and write requests.
FIFOs
Opening the read or write end of a FIFO blocks until the other end is also opened (by another process or thread). See fifo(7) for further details.
File access mode
Unlike the other values that can be specified in flags, the access mode values O_RDONLY, O_WRONLY, and O_RDWR do not specify individual bits. Rather, they define the low order two bits of flags, and are defined respectively as 0, 1, and 2. In other words, the combination O_RDONLY | O_WRONLY is a logical error, and certainly does not have the same meaning as O_RDWR.
Linux reserves the special, nonstandard access mode 3 (binary 11) in flags to mean: check for read and write permission on the file and return a file descriptor that can’t be used for reading or writing. This nonstandard access mode is used by some Linux drivers to return a file descriptor that is to be used only for device-specific ioctl(2) operations.
Rationale for openat() and other directory file descriptor APIs
openat() and the other system calls and library functions that take a directory file descriptor argument (i.e., execveat(2), faccessat(2), fanotify_mark(2), fchmodat(2), fchownat(2), fspick(2), fstatat(2), futimesat(2), linkat(2), mkdirat(2), mknodat(2), mount_setattr(2), move_mount(2), name_to_handle_at(2), open_tree(2), openat2(2), readlinkat(2), renameat(2), renameat2(2), statx(2), symlinkat(2), unlinkat(2), utimensat(2), mkfifoat(3), and scandirat(3)) address two problems with the older interfaces that preceded them. Here, the explanation is in terms of the openat() call, but the rationale is analogous for the other interfaces.
First, openat() allows an application to avoid race conditions that could occur when using open() to open files in directories other than the current working directory. These race conditions result from the fact that some component of the directory prefix given to open() could be changed in parallel with the call to open(). Suppose, for example, that we wish to create the file dir1/dir2/xxx.dep if the file dir1/dir2/xxx exists. The problem is that between the existence check and the file-creation step, dir1 or dir2 (which might be symbolic links) could be modified to point to a different location. Such races can be avoided by opening a file descriptor for the target directory, and then specifying that file descriptor as the dirfd argument of (say) fstatat(2) and openat(). The use of the dirfd file descriptor also has other benefits:
the file descriptor is a stable reference to the directory, even if the directory is renamed; and
the open file descriptor prevents the underlying filesystem from being dismounted, just as when a process has a current working directory on a filesystem.
Second, openat() allows the implementation of a per-thread “current working directory”, via file descriptor(s) maintained by the application. (This functionality can also be obtained by tricks based on the use of */proc/self/fd/*dirfd, but less efficiently.)
The dirfd argument for these APIs can be obtained by using open() or openat() to open a directory (with either the O_RDONLY or the O_PATH flag). Alternatively, such a file descriptor can be obtained by applying dirfd(3) to a directory stream created using opendir(3).
When these APIs are given a dirfd argument of AT_FDCWD or the specified pathname is absolute, then they handle their pathname argument in the same way as the corresponding conventional APIs. However, in this case, several of the APIs have a flags argument that provides access to functionality that is not available with the corresponding conventional APIs.
O_DIRECT
The O_DIRECT flag may impose alignment restrictions on the length and address of user-space buffers and the file offset of I/Os. In Linux alignment restrictions vary by filesystem and kernel version and might be absent entirely. The handling of misaligned O_DIRECT I/Os also varies; they can either fail with EINVAL or fall back to buffered I/O.
Since Linux 6.1, O_DIRECT support and alignment restrictions for a file can be queried using statx(2), using the STATX_DIOALIGN flag. Support for STATX_DIOALIGN varies by filesystem; see statx(2).
Some filesystems provide their own interfaces for querying O_DIRECT alignment restrictions, for example the XFS_IOC_DIOINFO operation in xfsctl(3). STATX_DIOALIGN should be used instead when it is available.
If none of the above is available, then direct I/O support and alignment restrictions can only be assumed from known characteristics of the filesystem, the individual file, the underlying storage device(s), and the kernel version. In Linux 2.4, most filesystems based on block devices require that the file offset and the length and memory address of all I/O segments be multiples of the filesystem block size (typically 4096 bytes). In Linux 2.6.0, this was relaxed to the logical block size of the block device (typically 512 bytes). A block device’s logical block size can be determined using the ioctl(2) BLKSSZGET operation or from the shell using the command:
blockdev --getss
O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption and undefined behavior in parent and child processes. This restriction does not apply when the memory buffer for the O_DIRECT I/Os was created using shmat(2) or mmap(2) with the MAP_SHARED flag. Nor does this restriction apply when the memory buffer has been advised as MADV_DONTFORK with madvise(2), ensuring that it will not be available to the child after fork(2).
The O_DIRECT flag was introduced in SGI IRIX, where it has alignment restrictions similar to those of Linux 2.4. IRIX has also a fcntl(2) call to query appropriate alignments, and sizes. FreeBSD 4.x introduced a flag of the same name, but without alignment restrictions.
O_DIRECT support was added in Linux 2.4.10. Older Linux kernels simply ignore this flag. Some filesystems may not implement the flag, in which case open() fails with the error EINVAL if it is used.
Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file. Even when the filesystem correctly handles the coherency issues in this situation, overall I/O throughput is likely to be slower than using either mode alone. Likewise, applications should avoid mixing mmap(2) of files with direct I/O to the same files.
The behavior of O_DIRECT with NFS will differ from local filesystems. Older kernels, or kernels configured in certain ways, may not support this combination. The NFS protocol does not support passing the flag to the server, so O_DIRECT I/O will bypass the page cache only on the client; the server may still cache the I/O. The client asks the server to make the I/O synchronous to preserve the synchronous semantics of O_DIRECT. Some servers will perform poorly under these circumstances, especially if the I/O size is small. Some servers may also be configured to lie to clients about the I/O having reached stable storage; this will avoid the performance penalty at some risk to data integrity in the event of server power failure. The Linux NFS client places no alignment restrictions on O_DIRECT I/O.
In summary, O_DIRECT is a potentially powerful tool that should be used with caution. It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default.
BUGS
Currently, it is not possible to enable signal-driven I/O by specifying O_ASYNC when calling open(); use fcntl(2) to enable this flag.
One must check for two different error codes, EISDIR and ENOENT, when trying to determine whether the kernel supports O_TMPFILE functionality.
When both O_CREAT and O_DIRECTORY are specified in flags and the file specified by pathname does not exist, open() will create a regular file (i.e., O_DIRECTORY is ignored).
SEE ALSO
chmod(2), chown(2), close(2), dup(2), fcntl(2), link(2), lseek(2), mknod(2), mmap(2), mount(2), open_by_handle_at(2), openat2(2), read(2), socket(2), stat(2), umask(2), unlink(2), write(2), fopen(3), acl(5), fifo(7), inode(7), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
79 - Linux cli command readlink
NAME π₯οΈ readlink π₯οΈ
read value of a symbolic link
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
ssize_t readlink(const char *restrict pathname, char *restrict buf,
size_t bufsiz);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
ssize_t readlinkat(int dirfd, const char *restrict pathname,
char *restrict buf, size_t bufsiz);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
readlink():
_XOPEN_SOURCE >= 500 || _POSIX_C_SOURCE >= 200112L
|| /* glibc <= 2.19: */ _BSD_SOURCE
readlinkat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
readlink() places the contents of the symbolic link pathname in the buffer buf, which has size bufsiz. readlink() does not append a terminating null byte to buf. It will (silently) truncate the contents (to a length of bufsiz characters), in case the buffer is too small to hold all of the contents.
readlinkat()
The readlinkat() system call operates in exactly the same way as readlink(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by readlink() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like readlink()).
If pathname is absolute, then dirfd is ignored.
Since Linux 2.6.39, pathname can be an empty string, in which case the call operates on the symbolic link referred to by dirfd (which should have been obtained using open(2) with the O_PATH and O_NOFOLLOW flags).
See openat(2) for an explanation of the need for readlinkat().
RETURN VALUE
On success, these calls return the number of bytes placed in buf. (If the returned value equals bufsiz, then truncation may have occurred.) On error, -1 is returned and errno is set to indicate the error.
ERRORS
EACCES
Search permission is denied for a component of the path prefix. (See also path_resolution(7).)
EBADF
(readlinkat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
buf extends outside the process’s allocated address space.
EINVAL
bufsiz is not positive.
EINVAL
The named file (i.e., the final filename component of pathname) is not a symbolic link.
EIO
An I/O error occurred while reading from the filesystem.
ELOOP
Too many symbolic links were encountered in translating the pathname.
ENAMETOOLONG
A pathname, or a component of a pathname, was too long.
ENOENT
The named file does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component of the path prefix is not a directory.
ENOTDIR
(readlinkat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
STANDARDS
POSIX.1-2008.
HISTORY
readlink()
4.4BSD (first appeared in 4.2BSD), POSIX.1-2001, POSIX.1-2008.
readlinkat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
Up to and including glibc 2.4, the return type of readlink() was declared as int. Nowadays, the return type is declared as ssize_t, as (newly) required in POSIX.1-2001.
glibc
On older kernels where readlinkat() is unavailable, the glibc wrapper function falls back to the use of readlink(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
NOTES
Using a statically sized buffer might not provide enough room for the symbolic link contents. The required size for the buffer can be obtained from the stat.st_size value returned by a call to lstat(2) on the link. However, the number of bytes written by readlink() and readlinkat() should be checked to make sure that the size of the symbolic link did not increase between the calls. Dynamically allocating the buffer for readlink() and readlinkat() also addresses a common portability problem when using PATH_MAX for the buffer size, as this constant is not guaranteed to be defined per POSIX if the system does not have such limit.
EXAMPLES
The following program allocates the buffer needed by readlink() dynamically from the information provided by lstat(2), falling back to a buffer of size PATH_MAX in cases where lstat(2) reports a size of zero.
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
char *buf;
ssize_t nbytes, bufsiz;
struct stat sb;
if (argc != 2) {
fprintf(stderr, "Usage: %s <pathname>
“, argv[0]); exit(EXIT_FAILURE); } if (lstat(argv[1], &sb) == -1) { perror(“lstat”); exit(EXIT_FAILURE); } /* Add one to the link size, so that we can determine whether the buffer returned by readlink() was truncated. / bufsiz = sb.st_size + 1; / Some magic symlinks under (for example) /proc and /sys report ‘st_size’ as zero. In that case, take PATH_MAX as a “good enough” estimate. / if (sb.st_size == 0) bufsiz = PATH_MAX; buf = malloc(bufsiz); if (buf == NULL) { perror(“malloc”); exit(EXIT_FAILURE); } nbytes = readlink(argv[1], buf, bufsiz); if (nbytes == -1) { perror(“readlink”); exit(EXIT_FAILURE); } / Print only ’nbytes’ of ‘buf’, as it doesn’t contain a terminating null byte (‘οΏ½’). */ printf(”’%s’ points to ‘%.s’ “, argv[1], (int) nbytes, buf); / If the return value was equal to the buffer size, then the link target was larger than expected (perhaps because the target was changed between the call to lstat() and the call to readlink()). Warn the user that the returned target may have been truncated. */ if (nbytes == bufsiz) printf("(Returned buffer may have been truncated) “); free(buf); exit(EXIT_SUCCESS); }
SEE ALSO
readlink(1), lstat(2), stat(2), symlink(2), realpath(3), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
80 - Linux cli command fattach
NAME π₯οΈ fattach π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
81 - Linux cli command stat64
NAME π₯οΈ stat64 π₯οΈ
get file status
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int stat(const char *restrict pathname,
struct stat *restrict statbuf);
int fstat(int fd, struct stat *statbuf);
int lstat(const char *restrict pathname,
struct stat *restrict statbuf);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int fstatat(int dirfd, const char *restrict pathname,
struct stat *restrict statbuf, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
lstat():
/* Since glibc 2.20 */ _DEFAULT_SOURCE
|| _XOPEN_SOURCE >= 500
|| /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L
|| /* glibc 2.19 and earlier */ _BSD_SOURCE
fstatat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These functions return information about a file, in the buffer pointed to by statbuf. No permissions are required on the file itself, butβin the case of stat(), fstatat(), and lstat()βexecute (search) permission is required on all of the directories in pathname that lead to the file.
stat() and fstatat() retrieve information about the file pointed to by pathname; the differences for fstatat() are described below.
lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information about the link itself, not the file that the link refers to.
fstat() is identical to stat(), except that the file about which information is to be retrieved is specified by the file descriptor fd.
The stat structure
All of these system calls return a stat structure (see stat(3type)).
Note: for performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call. For example, if st_mode or st_uid is changed by another process by calling chmod(2) or chown(2), stat() might return the old st_mode together with the new st_uid, or the old st_uid together with the new st_mode.
fstatat()
The fstatat() system call is a more general interface for accessing file information which can still provide exactly the behavior of each of stat(), lstat(), and fstat().
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by stat() and lstat() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like stat() and lstat()).
If pathname is absolute, then dirfd is ignored.
flags can either be 0, or include one or more of the following flags ORed:
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory, and the behavior of fstatat() is similar to that of fstat(). If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_NO_AUTOMOUNT (since Linux 2.6.38)
Don’t automount the terminal (“basename”) component of pathname. Since Linux 3.1 this flag is ignored. Since Linux 4.11 this flag is implied.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead return information about the link itself, like lstat(). (By default, fstatat() dereferences symbolic links, like stat().)
See openat(2) for an explanation of the need for fstatat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Search permission is denied for one of the directories in the path prefix of pathname. (See also path_resolution(7).)
EBADF
fd is not a valid open file descriptor.
EBADF
(fstatat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
Bad address.
EINVAL
(fstatat()) Invalid flag specified in flags.
ELOOP
Too many symbolic links encountered while traversing the path.
ENAMETOOLONG
pathname is too long.
ENOENT
A component of pathname does not exist or is a dangling symbolic link.
ENOENT
pathname is an empty string and AT_EMPTY_PATH was not specified in flags.
ENOMEM
Out of memory (i.e., kernel memory).
ENOTDIR
A component of the path prefix of pathname is not a directory.
ENOTDIR
(fstatat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EOVERFLOW
pathname or fd refers to a file whose size, inode number, or number of blocks cannot be represented in, respectively, the types off_t, ino_t, or blkcnt_t. This error can occur when, for example, an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 calls stat() on a file whose size exceeds (1<<31)-1 bytes.
STANDARDS
POSIX.1-2008.
HISTORY
stat()
fstat()
lstat()
SVr4, 4.3BSD, POSIX.1-2001.
fstatat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
According to POSIX.1-2001, lstat() on a symbolic link need return valid information only in the st_size field and the file type of the st_mode field of the stat structure. POSIX.1-2008 tightens the specification, requiring lstat() to return valid information in all fields except the mode bits in st_mode.
Use of the st_blocks and st_blksize fields may be less portable. (They were introduced in BSD. The interpretation differs between systems, and possibly on a single system when NFS mounts are involved.)
C library/kernel differences
Over time, increases in the size of the stat structure have led to three successive versions of stat(): sys_stat() (slot __NR_oldstat), sys_newstat() (slot __NR_stat), and sys_stat64() (slot __NR_stat64) on 32-bit platforms such as i386. The first two versions were already present in Linux 1.0 (albeit with different names); the last was added in Linux 2.4. Similar remarks apply for fstat() and lstat().
The kernel-internal versions of the stat structure dealt with by the different versions are, respectively:
__old_kernel_stat
The original structure, with rather narrow fields, and no padding.
stat
Larger st_ino field and padding added to various parts of the structure to allow for future expansion.
stat64
Even larger st_ino field, larger st_uid and st_gid fields to accommodate the Linux-2.4 expansion of UIDs and GIDs to 32 bits, and various other enlarged fields and further padding in the structure. (Various padding bytes were eventually consumed in Linux 2.6, with the advent of 32-bit device IDs and nanosecond components for the timestamp fields.)
The glibc stat() wrapper function hides these details from applications, invoking the most recent version of the system call provided by the kernel, and repacking the returned information if required for old binaries.
On modern 64-bit systems, life is simpler: there is a single stat() system call and the kernel deals with a stat structure that contains fields of a sufficient size.
The underlying system call employed by the glibc fstatat() wrapper function is actually called fstatat64() or, on some architectures, newfstatat().
EXAMPLES
The following program calls lstat() and displays selected fields in the returned stat structure.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
#include <time.h>
int
main(int argc, char *argv[])
{
struct stat sb;
if (argc != 2) {
fprintf(stderr, "Usage: %s <pathname>
“, argv[0]); exit(EXIT_FAILURE); } if (lstat(argv[1], &sb) == -1) { perror(“lstat”); exit(EXIT_FAILURE); } printf(“ID of containing device: [%x,%x] “, major(sb.st_dev), minor(sb.st_dev)); printf(“File type: “); switch (sb.st_mode & S_IFMT) { case S_IFBLK: printf(“block device “); break; case S_IFCHR: printf(“character device “); break; case S_IFDIR: printf(“directory “); break; case S_IFIFO: printf(“FIFO/pipe “); break; case S_IFLNK: printf(“symlink “); break; case S_IFREG: printf(“regular file “); break; case S_IFSOCK: printf(“socket “); break; default: printf(“unknown? “); break; } printf(“I-node number: %ju “, (uintmax_t) sb.st_ino); printf(“Mode: %jo (octal) “, (uintmax_t) sb.st_mode); printf(“Link count: %ju “, (uintmax_t) sb.st_nlink); printf(“Ownership: UID=%ju GID=%ju “, (uintmax_t) sb.st_uid, (uintmax_t) sb.st_gid); printf(“Preferred I/O block size: %jd bytes “, (intmax_t) sb.st_blksize); printf(“File size: %jd bytes “, (intmax_t) sb.st_size); printf(“Blocks allocated: %jd “, (intmax_t) sb.st_blocks); printf(“Last status change: %s”, ctime(&sb.st_ctime)); printf(“Last file access: %s”, ctime(&sb.st_atime)); printf(“Last file modification: %s”, ctime(&sb.st_mtime)); exit(EXIT_SUCCESS); }
SEE ALSO
ls(1), stat(1), access(2), chmod(2), chown(2), readlink(2), statx(2), utime(2), stat(3type), capabilities(7), inode(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
82 - Linux cli command vserver
NAME π₯οΈ vserver π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
83 - Linux cli command request_key
NAME π₯οΈ request_key π₯οΈ
request a key from the kernel’s key management facility
LIBRARY
Linux Key Management Utilities (libkeyutils, -lkeyutils)
SYNOPSIS
#include <keyutils.h>
key_serial_t request_key(const char *type, const char *description,
const char *_Nullable callout_info,
key_serial_t dest_keyring);
DESCRIPTION
request_key() attempts to find a key of the given type with a description (name) that matches the specified description. If such a key could not be found, then the key is optionally created. If the key is found or created, request_key() attaches it to the keyring whose ID is specified in dest_keyring and returns the key’s serial number.
request_key() first recursively searches for a matching key in all of the keyrings attached to the calling process. The keyrings are searched in the order: thread-specific keyring, process-specific keyring, and then session keyring.
If request_key() is called from a program invoked by request_key() on behalf of some other process to generate a key, then the keyrings of that other process will be searched next, using that other process’s user ID, group ID, supplementary group IDs, and security context to determine access.
The search of the keyring tree is breadth-first: the keys in each keyring searched are checked for a match before any child keyrings are recursed into. Only keys for which the caller has search permission be found, and only keyrings for which the caller has search permission may be searched.
If the key is not found and callout is NULL, then the call fails with the error ENOKEY.
If the key is not found and callout is not NULL, then the kernel attempts to invoke a user-space program to instantiate the key. The details are given below.
The dest_keyring serial number may be that of a valid keyring for which the caller has write permission, or it may be one of the following special keyring IDs:
KEY_SPEC_THREAD_KEYRING
This specifies the caller’s thread-specific keyring (see thread-keyring(7)).
KEY_SPEC_PROCESS_KEYRING
This specifies the caller’s process-specific keyring (see process-keyring(7)).
KEY_SPEC_SESSION_KEYRING
This specifies the caller’s session-specific keyring (see session-keyring(7)).
KEY_SPEC_USER_KEYRING
This specifies the caller’s UID-specific keyring (see user-keyring(7)).
KEY_SPEC_USER_SESSION_KEYRING
This specifies the caller’s UID-session keyring (see user-session-keyring(7)).
When the dest_keyring is specified as 0 and no key construction has been performed, then no additional linking is done.
Otherwise, if dest_keyring is 0 and a new key is constructed, the new key will be linked to the “default” keyring. More precisely, when the kernel tries to determine to which keyring the newly constructed key should be linked, it tries the following keyrings, beginning with the keyring set via the keyctl(2) KEYCTL_SET_REQKEY_KEYRING operation and continuing in the order shown below until it finds the first keyring that exists:
The requestor keyring (KEY_REQKEY_DEFL_REQUESTOR_KEYRING, since Linux 2.6.29).
The thread-specific keyring (KEY_REQKEY_DEFL_THREAD_KEYRING; see thread-keyring(7)).
The process-specific keyring (KEY_REQKEY_DEFL_PROCESS_KEYRING; see process-keyring(7)).
The session-specific keyring (KEY_REQKEY_DEFL_SESSION_KEYRING; see session-keyring(7)).
The session keyring for the process’s user ID (KEY_REQKEY_DEFL_USER_SESSION_KEYRING; see user-session-keyring(7)). This keyring is expected to always exist.
The UID-specific keyring (KEY_REQKEY_DEFL_USER_KEYRING; see user-keyring(7)). This keyring is also expected to always exist.
If the keyctl(2) KEYCTL_SET_REQKEY_KEYRING operation specifies KEY_REQKEY_DEFL_DEFAULT (or no KEYCTL_SET_REQKEY_KEYRING operation is performed), then the kernel looks for a keyring starting from the beginning of the list.
Requesting user-space instantiation of a key
If the kernel cannot find a key matching type and description, and callout is not NULL, then the kernel attempts to invoke a user-space program to instantiate a key with the given type and description. In this case, the following steps are performed:
The kernel creates an uninstantiated key, U, with the requested type and description.
The kernel creates an authorization key, V, that refers to the key U and records the facts that the caller of request_key() is:
(2.1)
the context in which the key U should be instantiated and secured, and(2.2)
the context from which associated key requests may be satisfied.The authorization key is constructed as follows:
The key type is ".request_key_auth".
The key’s UID and GID are the same as the corresponding filesystem IDs of the requesting process.
The key grants view, read, and search permissions to the key possessor as well as view permission for the key user.
The description (name) of the key is the hexadecimal string representing the ID of the key that is to be instantiated in the requesting program.
The payload of the key is taken from the data specified in callout_info.
Internally, the kernel also records the PID of the process that called request_key().
The kernel creates a process that executes a user-space service such as request-key(8) with a new session keyring that contains a link to the authorization key, V.
This program is supplied with the following command-line arguments:
[0]
The string "/sbin/request-key".[1]
The string “create” (indicating that a key is to be created).[2]
The ID of the key that is to be instantiated.[3]
The filesystem UID of the caller of request_key().[4]
The filesystem GID of the caller of request_key().[5]
The ID of the thread keyring of the caller of request_key(). This may be zero if that keyring hasn’t been created.[6]
The ID of the process keyring of the caller of request_key(). This may be zero if that keyring hasn’t been created.[7]
The ID of the session keyring of the caller of request_key().Note: each of the command-line arguments that is a key ID is encoded in decimal (unlike the key IDs shown in /proc/keys, which are shown as hexadecimal values).
The program spawned in the previous step:
Assumes the authority to instantiate the key U using the keyctl(2) KEYCTL_ASSUME_AUTHORITY operation (typically via the keyctl_assume_authority(3) function).
Obtains the callout data from the payload of the authorization key V (using the keyctl(2) KEYCTL_READ operation (or, more commonly, the keyctl_read(3) function) with a key ID value of KEY_SPEC_REQKEY_AUTH_KEY).
Instantiates the key (or execs another program that performs that task), specifying the payload and destination keyring. (The destination keyring that the requestor specified when calling request_key() can be accessed using the special key ID KEY_SPEC_REQUESTOR_KEYRING.) Instantiation is performed using the keyctl(2) KEYCTL_INSTANTIATE operation (or, more commonly, the keyctl_instantiate(3) function). At this point, the request_key() call completes, and the requesting program can continue execution.
If these steps are unsuccessful, then an ENOKEY error will be returned to the caller of request_key() and a temporary, negatively instantiated key will be installed in the keyring specified by dest_keyring. This will expire after a few seconds, but will cause subsequent calls to request_key() to fail until it does. The purpose of this negatively instantiated key is to prevent (possibly different) processes making repeated requests (that require expensive request-key(8) upcalls) for a key that can’t (at the moment) be positively instantiated.
Once the key has been instantiated, the authorization key (KEY_SPEC_REQKEY_AUTH_KEY) is revoked, and the destination keyring (KEY_SPEC_REQUESTOR_KEYRING) is no longer accessible from the request-key(8) program.
If a key is created, thenβregardless of whether it is a valid key or a negatively instantiated keyβit will displace any other key with the same type and description from the keyring specified in dest_keyring.
RETURN VALUE
On success, request_key() returns the serial number of the key it found or caused to be created. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EACCES
The keyring wasn’t available for modification by the user.
EDQUOT
The key quota for this user would be exceeded by creating this key or linking it to the keyring.
EFAULT
One of type, description, or callout_info points outside the process’s accessible address space.
EINTR
The request was interrupted by a signal; see signal(7).
EINVAL
The size of the string (including the terminating null byte) specified in type or description exceeded the limit (32 bytes and 4096 bytes respectively).
EINVAL
The size of the string (including the terminating null byte) specified in callout_info exceeded the system page size.
EKEYEXPIRED
An expired key was found, but no replacement could be obtained.
EKEYREJECTED
The attempt to generate a new key was rejected.
EKEYREVOKED
A revoked key was found, but no replacement could be obtained.
ENOKEY
No matching key was found.
ENOMEM
Insufficient memory to create a key.
EPERM
The type argument started with a period (’.’).
STANDARDS
Linux.
HISTORY
Linux 2.6.10.
The ability to instantiate keys upon request was added in Linux 2.6.13.
EXAMPLES
The program below demonstrates the use of request_key(). The type, description, and callout_info arguments for the system call are taken from the values supplied in the command-line arguments. The call specifies the session keyring as the target keyring.
In order to demonstrate this program, we first create a suitable entry in the file /etc/request-key.conf.
$ sudo sh
# echo 'create user mtk:* * /bin/keyctl instantiate %k %c %S' \
> /etc/request-key.conf
# exit
This entry specifies that when a new “user” key with the prefix “mtk:” must be instantiated, that task should be performed via the keyctl(1) command’s instantiate operation. The arguments supplied to the instantiate operation are: the ID of the uninstantiated key (%k); the callout data supplied to the request_key() call (%c); and the session keyring (%S) of the requestor (i.e., the caller of request_key()). See request-key.conf(5) for details of these % specifiers.
Then we run the program and check the contents of /proc/keys to verify that the requested key has been instantiated:
$ ./t_request_key user mtk:key1 "Payload data"
$ grep '2dddaf50' /proc/keys
2dddaf50 I--Q--- 1 perm 3f010000 1000 1000 user mtk:key1: 12
For another example of the use of this program, see keyctl(2).
Program source
/* t_request_key.c */
#include <keyutils.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
int
main(int argc, char *argv[])
{
key_serial_t key;
if (argc != 4) {
fprintf(stderr, "Usage: %s type description callout-data
“, argv[0]); exit(EXIT_FAILURE); } key = request_key(argv[1], argv[2], argv[3], KEY_SPEC_SESSION_KEYRING); if (key == -1) { perror(“request_key”); exit(EXIT_FAILURE); } printf(“Key ID is %jx “, (uintmax_t) key); exit(EXIT_SUCCESS); }
SEE ALSO
keyctl(1), add_key(2), keyctl(2), keyctl(3), capabilities(7), keyrings(7), keyutils(7), persistent-keyring(7), process-keyring(7), session-keyring(7), thread-keyring(7), user-keyring(7), user-session-keyring(7), request-key(8)
The kernel source files Documentation/security/keys/core.rst and Documentation/keys/request-key.rst (or, before Linux 4.13, in the files Documentation/security/keys.txt and Documentation/security/keys-request-key.txt).
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
84 - Linux cli command linkat
NAME π₯οΈ linkat π₯οΈ
make a new name for a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int link(const char *oldpath, const char *newpath);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int linkat(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
linkat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
link() creates a new link (also known as a hard link) to an existing file.
If newpath exists, it will not be overwritten.
This new name may be used exactly as the old one for any operation; both names refer to the same file (and so have the same permissions and ownership) and it is impossible to tell which name was the “original”.
linkat()
The linkat() system call operates in exactly the same way as link(), except for the differences described here.
If the pathname given in oldpath is relative, then it is interpreted relative to the directory referred to by the file descriptor olddirfd (rather than relative to the current working directory of the calling process, as is done by link() for a relative pathname).
If oldpath is relative and olddirfd is the special value AT_FDCWD, then oldpath is interpreted relative to the current working directory of the calling process (like link()).
If oldpath is absolute, then olddirfd is ignored.
The interpretation of newpath is as for oldpath, except that a relative pathname is interpreted relative to the directory referred to by the file descriptor newdirfd.
The following values can be bitwise ORed in flags:
AT_EMPTY_PATH (since Linux 2.6.39)
If oldpath is an empty string, create a link to the file referenced by olddirfd (which may have been obtained using the open(2) O_PATH flag). In this case, olddirfd can refer to any type of file except a directory. This will generally not work if the file has a link count of zero (files created with O_TMPFILE and without O_EXCL are an exception). The caller must have the CAP_DAC_READ_SEARCH capability in order to use this flag. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_SYMLINK_FOLLOW (since Linux 2.6.18)
By default, linkat(), does not dereference oldpath if it is a symbolic link (like link()). The flag AT_SYMLINK_FOLLOW can be specified in flags to cause oldpath to be dereferenced if it is a symbolic link. If procfs is mounted, this can be used as an alternative to AT_EMPTY_PATH, like this:
linkat(AT_FDCWD, "/proc/self/fd/<fd>", newdirfd,
newname, AT_SYMLINK_FOLLOW);
Before Linux 2.6.18, the flags argument was unused, and had to be specified as 0.
See openat(2) for an explanation of the need for linkat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Write access to the directory containing newpath is denied, or search permission is denied for one of the directories in the path prefix of oldpath or newpath. (See also path_resolution(7).)
EDQUOT
The user’s quota of disk blocks on the filesystem has been exhausted.
EEXIST
newpath already exists.
EFAULT
oldpath or newpath points outside your accessible address space.
EIO
An I/O error occurred.
ELOOP
Too many symbolic links were encountered in resolving oldpath or newpath.
EMLINK
The file referred to by oldpath already has the maximum number of links to it. For example, on an ext4(5) filesystem that does not employ the dir_index feature, the limit on the number of hard links to a file is 65,000; on btrfs(5), the limit is 65,535 links.
ENAMETOOLONG
oldpath or newpath was too long.
ENOENT
A directory component in oldpath or newpath does not exist or is a dangling symbolic link.
ENOMEM
Insufficient kernel memory was available.
ENOSPC
The device containing the file has no room for the new directory entry.
ENOTDIR
A component used as a directory in oldpath or newpath is not, in fact, a directory.
EPERM
oldpath is a directory.
EPERM
The filesystem containing oldpath and newpath does not support the creation of hard links.
EPERM (since Linux 3.6)
The caller does not have permission to create a hard link to this file (see the description of /proc/sys/fs/protected_hardlinks in proc(5)).
EPERM
oldpath is marked immutable or append-only. (See ioctl_iflags(2).)
EROFS
The file is on a read-only filesystem.
EXDEV
oldpath and newpath are not on the same mounted filesystem. (Linux permits a filesystem to be mounted at multiple points, but link() does not work across different mounts, even if the same filesystem is mounted on both.)
The following additional errors can occur for linkat():
EBADF
oldpath (newpath) is relative but olddirfd (newdirfd) is neither AT_FDCWD nor a valid file descriptor.
EINVAL
An invalid flag value was specified in flags.
ENOENT
AT_EMPTY_PATH was specified in flags, but the caller did not have the CAP_DAC_READ_SEARCH capability.
ENOENT
An attempt was made to link to the /proc/self/fd/NN file corresponding to a file descriptor created with
open(path, O_TMPFILE | O_EXCL, mode);
See open(2).
ENOENT
An attempt was made to link to a /proc/self/fd/NN file corresponding to a file that has been deleted.
ENOENT
oldpath is a relative pathname and olddirfd refers to a directory that has been deleted, or newpath is a relative pathname and newdirfd refers to a directory that has been deleted.
ENOTDIR
oldpath is relative and olddirfd is a file descriptor referring to a file other than a directory; or similar for newpath and newdirfd
EPERM
AT_EMPTY_PATH was specified in flags, oldpath is an empty string, and olddirfd refers to a directory.
VERSIONS
POSIX.1-2001 says that link() should dereference oldpath if it is a symbolic link. However, since Linux 2.0, Linux does not do so: if oldpath is a symbolic link, then newpath is created as a (hard) link to the same symbolic link file (i.e., newpath becomes a symbolic link to the same file that oldpath refers to). Some other implementations behave in the same manner as Linux. POSIX.1-2008 changes the specification of link(), making it implementation-dependent whether or not oldpath is dereferenced if it is a symbolic link. For precise control over the treatment of symbolic links when creating a link, use linkat().
glibc
On older kernels where linkat() is unavailable, the glibc wrapper function falls back to the use of link(), unless the AT_SYMLINK_FOLLOW is specified. When oldpath and newpath are relative pathnames, glibc constructs pathnames based on the symbolic links in /proc/self/fd that correspond to the olddirfd and newdirfd arguments.
STANDARDS
link()
POSIX.1-2008.
HISTORY
link()
SVr4, 4.3BSD, POSIX.1-2001 (but see VERSIONS).
linkat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
NOTES
Hard links, as created by link(), cannot span filesystems. Use symlink(2) if this is required.
BUGS
On NFS filesystems, the return code may be wrong in case the NFS server performs the link creation and dies before it can say so. Use stat(2) to find out if the link got created.
SEE ALSO
ln(1), open(2), rename(2), stat(2), symlink(2), unlink(2), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
85 - Linux cli command sigpending
NAME π₯οΈ sigpending π₯οΈ
examine pending signals
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <signal.h>
int sigpending(sigset_t *set);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sigpending():
_POSIX_C_SOURCE
DESCRIPTION
sigpending() returns the set of signals that are pending for delivery to the calling thread (i.e., the signals which have been raised while blocked). The mask of pending signals is returned in set.
RETURN VALUE
sigpending() returns 0 on success. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
EFAULT
set points to memory which is not a valid part of the process address space.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
C library/kernel differences
The original Linux system call was named sigpending(). However, with the addition of real-time signals in Linux 2.2, the fixed-size, 32-bit sigset_t argument supported by that system call was no longer fit for purpose. Consequently, a new system call, rt_sigpending(), was added to support an enlarged sigset_t type. The new system call takes a second argument, size_t sigsetsize, which specifies the size in bytes of the signal set in set. The glibc sigpending() wrapper function hides these details from us, transparently calling rt_sigpending() when the kernel provides it.
NOTES
See sigsetops(3) for details on manipulating signal sets.
If a signal is both blocked and has a disposition of “ignored”, it is not added to the mask of pending signals when generated.
The set of signals that is pending for a thread is the union of the set of signals that is pending for that thread and the set of signals that is pending for the process as a whole; see signal(7).
A child created via fork(2) initially has an empty pending signal set; the pending signal set is preserved across an execve(2).
BUGS
Up to and including glibc 2.2.1, there is a bug in the wrapper function for sigpending() which means that information about pending real-time signals is not correctly returned.
SEE ALSO
kill(2), sigaction(2), signal(2), sigprocmask(2), sigsuspend(2), sigsetops(3), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
86 - Linux cli command pwrite64
NAME π₯οΈ pwrite64 π₯οΈ
read from or write to a file descriptor at a given offset
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
ssize_t pread(int fd, void buf[.count], size_t count,
off_t offset);
ssize_t pwrite(int fd, const void buf[.count], size_t count,
off_t offset);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
pread(), pwrite():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
DESCRIPTION
pread() reads up to count bytes from file descriptor fd at offset offset (from the start of the file) into the buffer starting at buf. The file offset is not changed.
pwrite() writes up to count bytes from the buffer starting at buf to the file descriptor fd at offset offset. The file offset is not changed.
The file referenced by fd must be capable of seeking.
RETURN VALUE
On success, pread() returns the number of bytes read (a return of zero indicates end of file) and pwrite() returns the number of bytes written.
Note that it is not an error for a successful call to transfer fewer bytes than requested (see read(2) and write(2)).
On error, -1 is returned and errno is set to indicate the error.
ERRORS
pread() can fail and set errno to any error specified for read(2) or lseek(2). pwrite() can fail and set errno to any error specified for write(2) or lseek(2).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
Added in Linux 2.1.60; the entries in the i386 system call table were added in Linux 2.1.69. C library support (including emulation using lseek(2) on older kernels without the system calls) was added in glibc 2.1.
C library/kernel differences
On Linux, the underlying system calls were renamed in Linux 2.6: pread() became pread64(), and pwrite() became pwrite64(). The system call numbers remained the same. The glibc pread() and pwrite() wrapper functions transparently deal with the change.
On some 32-bit architectures, the calling signature for these system calls differ, for the reasons described in syscall(2).
NOTES
The pread() and pwrite() system calls are especially useful in multithreaded applications. They allow multiple threads to perform I/O on the same file descriptor without being affected by changes to the file offset by other threads.
BUGS
POSIX requires that opening a file with the O_APPEND flag should have no effect on the location at which pwrite() writes data. However, on Linux, if a file is opened with O_APPEND, pwrite() appends data to the end of the file, regardless of the value of offset.
SEE ALSO
lseek(2), read(2), readv(2), write(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
87 - Linux cli command rmdir
NAME π₯οΈ rmdir π₯οΈ
delete a directory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int rmdir(const char *pathname);
DESCRIPTION
rmdir() deletes a directory, which must be empty.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Write access to the directory containing pathname was not allowed, or one of the directories in the path prefix of pathname did not allow search permission. (See also path_resolution(7).)
EBUSY
pathname is currently in use by the system or some process that prevents its removal. On Linux, this means pathname is currently used as a mount point or is the root directory of the calling process.
EFAULT
pathname points outside your accessible address space.
EINVAL
pathname has . as last component.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ENAMETOOLONG
pathname was too long.
ENOENT
A directory component in pathname does not exist or is a dangling symbolic link.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
pathname, or a component used as a directory in pathname, is not, in fact, a directory.
ENOTEMPTY
pathname contains entries other than . and .. ; or, pathname has .. as its final component. POSIX.1 also allows EEXIST for this condition.
EPERM
The directory containing pathname has the sticky bit (S_ISVTX) set and the process’s effective user ID is neither the user ID of the file to be deleted nor that of the directory containing it, and the process is not privileged (Linux: does not have the CAP_FOWNER capability).
EPERM
The filesystem containing pathname does not support the removal of directories.
EROFS
pathname refers to a directory on a read-only filesystem.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.3BSD.
BUGS
Infelicities in the protocol underlying NFS can cause the unexpected disappearance of directories which are still being used.
SEE ALSO
rm(1), rmdir(1), chdir(2), chmod(2), mkdir(2), rename(2), unlink(2), unlinkat(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
88 - Linux cli command getsid
NAME π₯οΈ getsid π₯οΈ
get session ID
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
pid_t getsid(pid_t pid);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
getsid():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
DESCRIPTION
getsid() returns the session ID of the process with process ID pid. If pid is 0, getsid() returns the session ID of the calling process.
RETURN VALUE
On success, a session ID is returned. On error, (pid_t) -1 is returned, and errno is set to indicate the error.
ERRORS
EPERM
A process with process ID pid exists, but it is not in the same session as the calling process, and the implementation considers this an error.
ESRCH
No process with process ID pid was found.
VERSIONS
Linux does not return EPERM.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4. Linux 2.0.
NOTES
See credentials(7) for a description of sessions and session IDs.
SEE ALSO
getpgid(2), setsid(2), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
89 - Linux cli command sysctl
NAME π₯οΈ sysctl π₯οΈ
read/write system parameters
SYNOPSIS
#include <unistd.h>
#include <linux/sysctl.h>
[[deprecated]] int _sysctl(struct __sysctl_args *args);
DESCRIPTION
This system call no longer exists on current kernels! See NOTES.
The _sysctl() call reads and/or writes kernel parameters. For example, the hostname, or the maximum number of open files. The argument has the form
struct __sysctl_args {
int *name; /* integer vector describing variable */
int nlen; /* length of this vector */
void *oldval; /* 0 or address where to store old value */
size_t *oldlenp; /* available room for old value,
overwritten by actual size of old value */
void *newval; /* 0 or address of new value */
size_t newlen; /* size of new value */
};
This call does a search in a tree structure, possibly resembling a directory tree under /proc/sys, and if the requested item is found calls some appropriate routine to read or modify the value.
RETURN VALUE
Upon successful completion, _sysctl() returns 0. Otherwise, a value of -1 is returned and errno is set to indicate the error.
ERRORS
EACCES
EPERM
No search permission for one of the encountered “directories”, or no read permission where oldval was nonzero, or no write permission where newval was nonzero.
EFAULT
The invocation asked for the previous value by setting oldval non-NULL, but allowed zero room in oldlenp.
ENOTDIR
name was not found.
STANDARDS
Linux.
HISTORY
Linux 1.3.57. Removed in Linux 5.5, glibc 2.32.
It originated in 4.4BSD. Only Linux has the /proc/sys mirror, and the object naming schemes differ between Linux and 4.4BSD, but the declaration of the sysctl() function is the same in both.
NOTES
Use of this system call was long discouraged: since Linux 2.6.24, uses of this system call result in warnings in the kernel log, and in Linux 5.5, the system call was finally removed. Use the /proc/sys interface instead.
Note that on older kernels where this system call still exists, it is available only if the kernel was configured with the CONFIG_SYSCTL_SYSCALL option. Furthermore, glibc does not provide a wrapper for this system call, necessitating the use of syscall(2).
BUGS
The object names vary between kernel versions, making this system call worthless for applications.
Not all available objects are properly documented.
It is not yet possible to change operating system by writing to /proc/sys/kernel/ostype.
EXAMPLES
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/sysctl.h>
#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
int _sysctl(struct __sysctl_args *args);
#define OSNAMESZ 100
int
main(void)
{
int name[] = { CTL_KERN, KERN_OSTYPE };
char osname[OSNAMESZ];
size_t osnamelth;
struct __sysctl_args args;
memset(&args, 0, sizeof(args));
args.name = name;
args.nlen = ARRAY_SIZE(name);
args.oldval = osname;
args.oldlenp = &osnamelth;
osnamelth = sizeof(osname);
if (syscall(SYS__sysctl, &args) == -1) {
perror("_sysctl");
exit(EXIT_FAILURE);
}
printf("This machine is running %*s
“, (int) osnamelth, osname); exit(EXIT_SUCCESS); }
SEE ALSO
proc(5)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
90 - Linux cli command sched_get_priority_max
NAME π₯οΈ sched_get_priority_max π₯οΈ
get static priority range
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sched.h>
int sched_get_priority_max(int policy);
int sched_get_priority_min(int policy);
DESCRIPTION
sched_get_priority_max() returns the maximum priority value that can be used with the scheduling algorithm identified by policy. sched_get_priority_min() returns the minimum priority value that can be used with the scheduling algorithm identified by policy. Supported policy values are SCHED_FIFO, SCHED_RR, SCHED_OTHER, SCHED_BATCH, SCHED_IDLE, and SCHED_DEADLINE. Further details about these policies can be found in sched(7).
Processes with numerically higher priority values are scheduled before processes with numerically lower priority values. Thus, the value returned by sched_get_priority_max() will be greater than the value returned by sched_get_priority_min().
Linux allows the static priority range 1 to 99 for the SCHED_FIFO and SCHED_RR policies, and the priority 0 for the remaining policies. Scheduling priority ranges for the various policies are not alterable.
The range of scheduling priorities may vary on other POSIX systems, thus it is a good idea for portable applications to use a virtual priority range and map it to the interval given by sched_get_priority_max() and sched_get_priority_min() POSIX.1 requires a spread of at least 32 between the maximum and the minimum values for SCHED_FIFO and SCHED_RR.
POSIX systems on which sched_get_priority_max() and sched_get_priority_min() are available define _POSIX_PRIORITY_SCHEDULING in <unistd.h>.
RETURN VALUE
On success, sched_get_priority_max() and sched_get_priority_min() return the maximum/minimum priority value for the named scheduling policy. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
The argument policy does not identify a defined scheduling policy.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
SEE ALSO
sched_getaffinity(2), sched_getparam(2), sched_getscheduler(2), sched_setaffinity(2), sched_setparam(2), sched_setscheduler(2), sched(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
91 - Linux cli command getrlimit
NAME π₯οΈ getrlimit π₯οΈ
get/set resource limits
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/resource.h>
int getrlimit(int resource, struct rlimit *rlim);
int setrlimit(int resource, const struct rlimit *rlim);
int prlimit(pid_t pid, int resource,
const struct rlimit *_Nullable new_limit,
struct rlimit *_Nullable old_limit);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
prlimit():
_GNU_SOURCE
DESCRIPTION
The getrlimit() and setrlimit() system calls get and set resource limits. Each resource has an associated soft and hard limit, as defined by the rlimit structure:
struct rlimit {
rlim_t rlim_cur; /* Soft limit */
rlim_t rlim_max; /* Hard limit (ceiling for rlim_cur) */
};
The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may set only its soft limit to a value in the range from 0 up to the hard limit, and (irreversibly) lower its hard limit. A privileged process (under Linux: one with the CAP_SYS_RESOURCE capability in the initial user namespace) may make arbitrary changes to either limit value.
The value RLIM_INFINITY denotes no limit on a resource (both in the structure returned by getrlimit() and in the structure passed to setrlimit()).
The resource argument must be one of:
RLIMIT_AS
This is the maximum size of the process’s virtual memory (address space). The limit is specified in bytes, and is rounded down to the system page size. This limit affects calls to brk(2), mmap(2), and mremap(2), which fail with the error ENOMEM upon exceeding this limit. In addition, automatic stack expansion fails (and generates a SIGSEGV that kills the process if no alternate stack has been made available via sigaltstack(2)). Since the value is a long, on machines with a 32-bit long either this limit is at most 2 GiB, or this resource is unlimited.
RLIMIT_CORE
This is the maximum size of a core file (see core(5)) in bytes that the process may dump. When 0 no core dump files are created. When nonzero, larger dumps are truncated to this size.
RLIMIT_CPU
This is a limit, in seconds, on the amount of CPU time that the process can consume. When the process reaches the soft limit, it is sent a SIGXCPU signal. The default action for this signal is to terminate the process. However, the signal can be caught, and the handler can return control to the main program. If the process continues to consume CPU time, it will be sent SIGXCPU once per second until the hard limit is reached, at which time it is sent SIGKILL. (This latter point describes Linux behavior. Implementations vary in how they treat processes which continue to consume CPU time after reaching the soft limit. Portable applications that need to catch this signal should perform an orderly termination upon first receipt of SIGXCPU.)
RLIMIT_DATA
This is the maximum size of the process’s data segment (initialized data, uninitialized data, and heap). The limit is specified in bytes, and is rounded down to the system page size. This limit affects calls to brk(2), sbrk(2), and (since Linux 4.7) mmap(2), which fail with the error ENOMEM upon encountering the soft limit of this resource.
RLIMIT_FSIZE
This is the maximum size in bytes of files that the process may create. Attempts to extend a file beyond this limit result in delivery of a SIGXFSZ signal. By default, this signal terminates a process, but a process can catch this signal instead, in which case the relevant system call (e.g., write(2), truncate(2)) fails with the error EFBIG.
RLIMIT_LOCKS (Linux 2.4.0 to Linux 2.4.24)
This is a limit on the combined number of flock(2) locks and fcntl(2) leases that this process may establish.
RLIMIT_MEMLOCK
This is the maximum number of bytes of memory that may be locked into RAM. This limit is in effect rounded down to the nearest multiple of the system page size. This limit affects mlock(2), mlockall(2), and the mmap(2) MAP_LOCKED operation. Since Linux 2.6.9, it also affects the shmctl(2) SHM_LOCK operation, where it sets a maximum on the total bytes in shared memory segments (see shmget(2)) that may be locked by the real user ID of the calling process. The shmctl(2) SHM_LOCK locks are accounted for separately from the per-process memory locks established by mlock(2), mlockall(2), and mmap(2) MAP_LOCKED; a process can lock bytes up to this limit in each of these two categories.
Before Linux 2.6.9, this limit controlled the amount of memory that could be locked by a privileged process. Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process may lock, and this limit instead governs the amount of memory that an unprivileged process may lock.
RLIMIT_MSGQUEUE (since Linux 2.6.8)
This is a limit on the number of bytes that can be allocated for POSIX message queues for the real user ID of the calling process. This limit is enforced for mq_open(3). Each message queue that the user creates counts (until it is removed) against this limit according to the formula:
Since Linux 3.5:
bytes = attr.mq_maxmsg * sizeof(struct msg_msg) +
MIN(attr.mq_maxmsg, MQ_PRIO_MAX) *
sizeof(struct posix_msg_tree_node)+
/* For overhead */
attr.mq_maxmsg * attr.mq_msgsize;
/* For message data */
Linux 3.4 and earlier:
bytes = attr.mq_maxmsg * sizeof(struct msg_msg *) +
/* For overhead */
attr.mq_maxmsg * attr.mq_msgsize;
/* For message data */
where attr is the mq_attr structure specified as the fourth argument to mq_open(3), and the msg_msg and posix_msg_tree_node structures are kernel-internal structures.
The “overhead” addend in the formula accounts for overhead bytes required by the implementation and ensures that the user cannot create an unlimited number of zero-length messages (such messages nevertheless each consume some system memory for bookkeeping overhead).
RLIMIT_NICE (since Linux 2.6.12, but see BUGS below)
This specifies a ceiling to which the process’s nice value can be raised using setpriority(2) or nice(2). The actual ceiling for the nice value is calculated as 20 - rlim_cur. The useful range for this limit is thus from 1 (corresponding to a nice value of 19) to 40 (corresponding to a nice value of -20). This unusual choice of range was necessary because negative numbers cannot be specified as resource limit values, since they typically have special meanings. For example, RLIM_INFINITY typically is the same as -1. For more detail on the nice value, see sched(7).
RLIMIT_NOFILE
This specifies a value one greater than the maximum file descriptor number that can be opened by this process. Attempts (open(2), pipe(2), dup(2), etc.) to exceed this limit yield the error EMFILE. (Historically, this limit was named RLIMIT_OFILE on BSD.)
Since Linux 4.5, this limit also defines the maximum number of file descriptors that an unprivileged process (one without the CAP_SYS_RESOURCE capability) may have “in flight” to other processes, by being passed across UNIX domain sockets. This limit applies to the sendmsg(2) system call. For further details, see unix(7).
RLIMIT_NPROC
This is a limit on the number of extant process (or, more precisely on Linux, threads) for the real user ID of the calling process. So long as the current number of processes belonging to this process’s real user ID is greater than or equal to this limit, fork(2) fails with the error EAGAIN.
The RLIMIT_NPROC limit is not enforced for processes that have either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE capability, or run with real user ID 0.
RLIMIT_RSS
This is a limit (in bytes) on the process’s resident set (the number of virtual pages resident in RAM). This limit has effect only in Linux 2.4.x, x < 30, and there affects only calls to madvise(2) specifying MADV_WILLNEED.
RLIMIT_RTPRIO (since Linux 2.6.12, but see BUGS)
This specifies a ceiling on the real-time priority that may be set for this process using sched_setscheduler(2) and sched_setparam(2).
For further details on real-time scheduling policies, see sched(7)
RLIMIT_RTTIME (since Linux 2.6.25)
This is a limit (in microseconds) on the amount of CPU time that a process scheduled under a real-time scheduling policy may consume without making a blocking system call. For the purpose of this limit, each time a process makes a blocking system call, the count of its consumed CPU time is reset to zero. The CPU time count is not reset if the process continues trying to use the CPU but is preempted, its time slice expires, or it calls sched_yield(2).
Upon reaching the soft limit, the process is sent a SIGXCPU signal. If the process catches or ignores this signal and continues consuming CPU time, then SIGXCPU will be generated once each second until the hard limit is reached, at which point the process is sent a SIGKILL signal.
The intended use of this limit is to stop a runaway real-time process from locking up the system.
For further details on real-time scheduling policies, see sched(7)
RLIMIT_SIGPENDING (since Linux 2.6.8)
This is a limit on the number of signals that may be queued for the real user ID of the calling process. Both standard and real-time signals are counted for the purpose of checking this limit. However, the limit is enforced only for sigqueue(3); it is always possible to use kill(2) to queue one instance of any of the signals that are not already queued to the process.
RLIMIT_STACK
This is the maximum size of the process stack, in bytes. Upon reaching this limit, a SIGSEGV signal is generated. To handle this signal, a process must employ an alternate signal stack (sigaltstack(2)).
Since Linux 2.6.23, this limit also determines the amount of space used for the process’s command-line arguments and environment variables; for details, see execve(2).
prlimit()
The Linux-specific prlimit() system call combines and extends the functionality of setrlimit() and getrlimit(). It can be used to both set and get the resource limits of an arbitrary process.
The resource argument has the same meaning as for setrlimit() and getrlimit().
If the new_limit argument is not NULL, then the rlimit structure to which it points is used to set new values for the soft and hard limits for resource. If the old_limit argument is not NULL, then a successful call to prlimit() places the previous soft and hard limits for resource in the rlimit structure pointed to by old_limit.
The pid argument specifies the ID of the process on which the call is to operate. If pid is 0, then the call applies to the calling process. To set or get the resources of a process other than itself, the caller must have the CAP_SYS_RESOURCE capability in the user namespace of the process whose resource limits are being changed, or the real, effective, and saved set user IDs of the target process must match the real user ID of the caller and the real, effective, and saved set group IDs of the target process must match the real group ID of the caller.
RETURN VALUE
On success, these system calls return 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
A pointer argument points to a location outside the accessible address space.
EINVAL
The value specified in resource is not valid; or, for setrlimit() or prlimit(): rlim->rlim_cur was greater than rlim->rlim_max.
EPERM
An unprivileged process tried to raise the hard limit; the CAP_SYS_RESOURCE capability is required to do this.
EPERM
The caller tried to increase the hard RLIMIT_NOFILE limit above the maximum defined by /proc/sys/fs/nr_open (see proc(5))
EPERM
(prlimit()) The calling process did not have permission to set limits for the process specified by pid.
ESRCH
Could not find a process with the ID specified in pid.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
getrlimit(), setrlimit(), prlimit() | Thread safety | MT-Safe |
STANDARDS
getrlimit()
setrlimit()
POSIX.1-2008.
prlimit()
Linux.
RLIMIT_MEMLOCK and RLIMIT_NPROC derive from BSD and are not specified in POSIX.1; they are present on the BSDs and Linux, but on few other implementations. RLIMIT_RSS derives from BSD and is not specified in POSIX.1; it is nevertheless present on most implementations. RLIMIT_MSGQUEUE, RLIMIT_NICE, RLIMIT_RTPRIO, RLIMIT_RTTIME, and RLIMIT_SIGPENDING are Linux-specific.
HISTORY
getrlimit()
setrlimit()
POSIX.1-2001, SVr4, 4.3BSD.
prlimit()
Linux 2.6.36, glibc 2.13.
NOTES
A child process created via fork(2) inherits its parent’s resource limits. Resource limits are preserved across execve(2).
Resource limits are per-process attributes that are shared by all of the threads in a process.
Lowering the soft limit for a resource below the process’s current consumption of that resource will succeed (but will prevent the process from further increasing its consumption of the resource).
One can set the resource limits of the shell using the built-in ulimit command (limit in csh(1)). The shell’s resource limits are inherited by the processes that it creates to execute commands.
Since Linux 2.6.24, the resource limits of any process can be inspected via /proc/pid/limits; see proc(5).
Ancient systems provided a vlimit() function with a similar purpose to setrlimit(). For backward compatibility, glibc also provides vlimit(). All new applications should be written using setrlimit().
C library/kernel ABI differences
Since glibc 2.13, the glibc getrlimit() and setrlimit() wrapper functions no longer invoke the corresponding system calls, but instead employ prlimit(), for the reasons described in BUGS.
The name of the glibc wrapper function is prlimit(); the underlying system call is prlimit64().
BUGS
In older Linux kernels, the SIGXCPU and SIGKILL signals delivered when a process encountered the soft and hard RLIMIT_CPU limits were delivered one (CPU) second later than they should have been. This was fixed in Linux 2.6.8.
In Linux 2.6.x kernels before Linux 2.6.17, a RLIMIT_CPU limit of 0 is wrongly treated as “no limit” (like RLIM_INFINITY). Since Linux 2.6.17, setting a limit of 0 does have an effect, but is actually treated as a limit of 1 second.
A kernel bug means that RLIMIT_RTPRIO does not work in Linux 2.6.12; the problem is fixed in Linux 2.6.13.
In Linux 2.6.12, there was an off-by-one mismatch between the priority ranges returned by getpriority(2) and RLIMIT_NICE. This had the effect that the actual ceiling for the nice value was calculated as 19 - rlim_cur. This was fixed in Linux 2.6.13.
Since Linux 2.6.12, if a process reaches its soft RLIMIT_CPU limit and has a handler installed for SIGXCPU, then, in addition to invoking the signal handler, the kernel increases the soft limit by one second. This behavior repeats if the process continues to consume CPU time, until the hard limit is reached, at which point the process is killed. Other implementations do not change the RLIMIT_CPU soft limit in this manner, and the Linux behavior is probably not standards conformant; portable applications should avoid relying on this Linux-specific behavior. The Linux-specific RLIMIT_RTTIME limit exhibits the same behavior when the soft limit is encountered.
Kernels before Linux 2.4.22 did not diagnose the error EINVAL for setrlimit() when rlim->rlim_cur was greater than rlim->rlim_max.
Linux doesn’t return an error when an attempt to set RLIMIT_CPU has failed, for compatibility reasons.
Representation of “large” resource limit values on 32-bit platforms
The glibc getrlimit() and setrlimit() wrapper functions use a 64-bit rlim_t data type, even on 32-bit platforms. However, the rlim_t data type used in the getrlimit() and setrlimit() system calls is a (32-bit) unsigned long. Furthermore, in Linux, the kernel represents resource limits on 32-bit platforms as unsigned long. However, a 32-bit data type is not wide enough. The most pertinent limit here is RLIMIT_FSIZE, which specifies the maximum size to which a file can grow: to be useful, this limit must be represented using a type that is as wide as the type used to represent file offsetsβthat is, as wide as a 64-bit off_t (assuming a program compiled with _FILE_OFFSET_BITS=64).
To work around this kernel limitation, if a program tried to set a resource limit to a value larger than can be represented in a 32-bit unsigned long, then the glibc setrlimit() wrapper function silently converted the limit value to RLIM_INFINITY. In other words, the requested resource limit setting was silently ignored.
Since glibc 2.13, glibc works around the limitations of the getrlimit() and setrlimit() system calls by implementing setrlimit() and getrlimit() as wrapper functions that call prlimit().
EXAMPLES
The program below demonstrates the use of prlimit().
#define _GNU_SOURCE
#define _FILE_OFFSET_BITS 64
#include <err.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <time.h>
int
main(int argc, char *argv[])
{
pid_t pid;
struct rlimit old, new;
struct rlimit *newp;
if (!(argc == 2 || argc == 4)) {
fprintf(stderr, "Usage: %s <pid> [<new-soft-limit> "
"<new-hard-limit>]
“, argv[0]); exit(EXIT_FAILURE); } pid = atoi(argv[1]); /* PID of target process / newp = NULL; if (argc == 4) { new.rlim_cur = atoi(argv[2]); new.rlim_max = atoi(argv[3]); newp = &new; } / Set CPU time limit of target process; retrieve and display previous limit / if (prlimit(pid, RLIMIT_CPU, newp, &old) == -1) err(EXIT_FAILURE, “prlimit-1”); printf(“Previous limits: soft=%jd; hard=%jd “, (intmax_t) old.rlim_cur, (intmax_t) old.rlim_max); / Retrieve and display new CPU time limit */ if (prlimit(pid, RLIMIT_CPU, NULL, &old) == -1) err(EXIT_FAILURE, “prlimit-2”); printf(“New limits: soft=%jd; hard=%jd “, (intmax_t) old.rlim_cur, (intmax_t) old.rlim_max); exit(EXIT_SUCCESS); }
SEE ALSO
prlimit(1), dup(2), fcntl(2), fork(2), getrusage(2), mlock(2), mmap(2), open(2), quotactl(2), sbrk(2), shmctl(2), malloc(3), sigqueue(3), ulimit(3), core(5), capabilities(7), cgroups(7), credentials(7), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
92 - Linux cli command kill
NAME π₯οΈ kill π₯οΈ
send signal to a process
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <signal.h>
int kill(pid_t pid, int sig);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
kill():
_POSIX_C_SOURCE
DESCRIPTION
The kill() system call can be used to send any signal to any process group or process.
If pid is positive, then signal sig is sent to the process with the ID specified by pid.
If pid equals 0, then sig is sent to every process in the process group of the calling process.
If pid equals -1, then sig is sent to every process for which the calling process has permission to send signals, except for process 1 (init), but see below.
If pid is less than -1, then sig is sent to every process in the process group whose ID is -pid.
If sig is 0, then no signal is sent, but existence and permission checks are still performed; this can be used to check for the existence of a process ID or process group ID that the caller is permitted to signal.
For a process to have permission to send a signal, it must either be privileged (under Linux: have the CAP_KILL capability in the user namespace of the target process), or the real or effective user ID of the sending process must equal the real or saved set-user-ID of the target process. In the case of SIGCONT, it suffices when the sending and receiving processes belong to the same session. (Historically, the rules were different; see NOTES.)
RETURN VALUE
On success (at least one signal was sent), zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
An invalid signal was specified.
EPERM
The calling process does not have permission to send the signal to any of the target processes.
ESRCH
The target process or process group does not exist. Note that an existing process might be a zombie, a process that has terminated execution, but has not yet been wait(2)ed for.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.3BSD.
Linux notes
Across different kernel versions, Linux has enforced different rules for the permissions required for an unprivileged process to send a signal to another process. In Linux 1.0 to 1.2.2, a signal could be sent if the effective user ID of the sender matched effective user ID of the target, or the real user ID of the sender matched the real user ID of the target. From Linux 1.2.3 until 1.3.77, a signal could be sent if the effective user ID of the sender matched either the real or effective user ID of the target. The current rules, which conform to POSIX.1, were adopted in Linux 1.3.78.
NOTES
The only signals that can be sent to process ID 1, the init process, are those for which init has explicitly installed signal handlers. This is done to assure the system is not brought down accidentally.
POSIX.1 requires that kill(-1,sig) send sig to all processes that the calling process may send signals to, except possibly for some implementation-defined system processes. Linux allows a process to signal itself, but on Linux the call kill(-1,sig) does not signal the calling process.
POSIX.1 requires that if a process sends a signal to itself, and the sending thread does not have the signal blocked, and no other thread has it unblocked or is waiting for it in sigwait(3), at least one unblocked signal must be delivered to the sending thread before the kill() returns.
BUGS
In Linux 2.6 up to and including Linux 2.6.7, there was a bug that meant that when sending signals to a process group, kill() failed with the error EPERM if the caller did not have permission to send the signal to any (rather than all) of the members of the process group. Notwithstanding this error return, the signal was still delivered to all of the processes for which the caller had permission to signal.
SEE ALSO
kill(1), _exit(2), pidfd_send_signal(2), signal(2), tkill(2), exit(3), killpg(3), sigqueue(3), capabilities(7), credentials(7), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
93 - Linux cli command pciconfig_read
NAME π₯οΈ pciconfig_read π₯οΈ
pci device information handling
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <pci.h>
int pciconfig_read(unsigned long bus, unsigned long dfn,
unsigned long off, unsigned long len,
unsigned char *buf);
int pciconfig_write(unsigned long bus, unsigned long dfn,
unsigned long off, unsigned long len,
unsigned char *buf);
int pciconfig_iobase(int which, unsigned long bus,
unsigned long devfn);
DESCRIPTION
Most of the interaction with PCI devices is already handled by the kernel PCI layer, and thus these calls should not normally need to be accessed from user space.
pciconfig_read()
Reads to buf from device dev at offset off value.
pciconfig_write()
Writes from buf to device dev at offset off value.
pciconfig_iobase()
You pass it a bus/devfn pair and get a physical address for either the memory offset (for things like prep, this is 0xc0000000), the IO base for PIO cycles, or the ISA holes if any.
RETURN VALUE
pciconfig_read()
On success, zero is returned. On error, -1 is returned and errno is set to indicate the error.
pciconfig_write()
On success, zero is returned. On error, -1 is returned and errno is set to indicate the error.
pciconfig_iobase()
Returns information on locations of various I/O regions in physical memory according to the which value. Values for which are: IOBASE_BRIDGE_NUMBER, IOBASE_MEMORY, IOBASE_IO, IOBASE_ISA_IO, IOBASE_ISA_MEM.
ERRORS
EINVAL
len value is invalid. This does not apply to pciconfig_iobase().
EIO
I/O error.
ENODEV
For pciconfig_iobase(), “hose” value is NULL. For the other calls, could not find a slot.
ENOSYS
The system has not implemented these calls (CONFIG_PCI not defined).
EOPNOTSUPP
This return value is valid only for pciconfig_iobase(). It is returned if the value for which is invalid.
EPERM
User does not have the CAP_SYS_ADMIN capability. This does not apply to pciconfig_iobase().
STANDARDS
Linux.
HISTORY
Linux 2.0.26/2.1.11.
SEE ALSO
capabilities(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
94 - Linux cli command pciconfig_iobase
NAME π₯οΈ pciconfig_iobase π₯οΈ
pci device information handling
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <pci.h>
int pciconfig_read(unsigned long bus, unsigned long dfn,
unsigned long off, unsigned long len,
unsigned char *buf);
int pciconfig_write(unsigned long bus, unsigned long dfn,
unsigned long off, unsigned long len,
unsigned char *buf);
int pciconfig_iobase(int which, unsigned long bus,
unsigned long devfn);
DESCRIPTION
Most of the interaction with PCI devices is already handled by the kernel PCI layer, and thus these calls should not normally need to be accessed from user space.
pciconfig_read()
Reads to buf from device dev at offset off value.
pciconfig_write()
Writes from buf to device dev at offset off value.
pciconfig_iobase()
You pass it a bus/devfn pair and get a physical address for either the memory offset (for things like prep, this is 0xc0000000), the IO base for PIO cycles, or the ISA holes if any.
RETURN VALUE
pciconfig_read()
On success, zero is returned. On error, -1 is returned and errno is set to indicate the error.
pciconfig_write()
On success, zero is returned. On error, -1 is returned and errno is set to indicate the error.
pciconfig_iobase()
Returns information on locations of various I/O regions in physical memory according to the which value. Values for which are: IOBASE_BRIDGE_NUMBER, IOBASE_MEMORY, IOBASE_IO, IOBASE_ISA_IO, IOBASE_ISA_MEM.
ERRORS
EINVAL
len value is invalid. This does not apply to pciconfig_iobase().
EIO
I/O error.
ENODEV
For pciconfig_iobase(), “hose” value is NULL. For the other calls, could not find a slot.
ENOSYS
The system has not implemented these calls (CONFIG_PCI not defined).
EOPNOTSUPP
This return value is valid only for pciconfig_iobase(). It is returned if the value for which is invalid.
EPERM
User does not have the CAP_SYS_ADMIN capability. This does not apply to pciconfig_iobase().
STANDARDS
Linux.
HISTORY
Linux 2.0.26/2.1.11.
SEE ALSO
capabilities(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
95 - Linux cli command pipe2
NAME π₯οΈ pipe2 π₯οΈ
create pipe
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int pipe(int pipefd[2]);
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <fcntl.h> /* Definition of O_* constants */
#include <unistd.h>
int pipe2(int pipefd[2], int flags);
/* On Alpha, IA-64, MIPS, SuperH, and SPARC/SPARC64, pipe() has the
following prototype; see VERSIONS */
#include <unistd.h>
struct fd_pair {
long fd[2];
};
struct fd_pair pipe(void);
DESCRIPTION
pipe() creates a pipe, a unidirectional data channel that can be used for interprocess communication. The array pipefd is used to return two file descriptors referring to the ends of the pipe. pipefd[0] refers to the read end of the pipe. pipefd[1] refers to the write end of the pipe. Data written to the write end of the pipe is buffered by the kernel until it is read from the read end of the pipe. For further details, see pipe(7).
If flags is 0, then pipe2() is the same as pipe(). The following values can be bitwise ORed in flags to obtain different behavior:
O_CLOEXEC
Set the close-on-exec (FD_CLOEXEC) flag on the two new file descriptors. See the description of the same flag in open(2) for reasons why this may be useful.
O_DIRECT (since Linux 3.4)
Create a pipe that performs I/O in “packet” mode. Each write(2) to the pipe is dealt with as a separate packet, and read(2)s from the pipe will read one packet at a time. Note the following points:
Writes of greater than PIPE_BUF bytes (see pipe(7)) will be split into multiple packets. The constant PIPE_BUF is defined in <limits.h>.
If a read(2) specifies a buffer size that is smaller than the next packet, then the requested number of bytes are read, and the excess bytes in the packet are discarded. Specifying a buffer size of PIPE_BUF will be sufficient to read the largest possible packets (see the previous point).
Zero-length packets are not supported. (A read(2) that specifies a buffer size of zero is a no-op, and returns 0.)
Older kernels that do not support this flag will indicate this via an EINVAL error.
Since Linux 4.5, it is possible to change the O_DIRECT setting of a pipe file descriptor using fcntl(2).
O_NONBLOCK
Set the O_NONBLOCK file status flag on the open file descriptions referred to by the new file descriptors. Using this flag saves extra calls to fcntl(2) to achieve the same result.
O_NOTIFICATION_PIPE
Since Linux 5.8, general notification mechanism is built on the top of the pipe where kernel splices notification messages into pipes opened by user space. The owner of the pipe has to tell the kernel which sources of events to watch and filters can also be applied to select which subevents should be placed into the pipe.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, errno is set to indicate the error, and pipefd is left unchanged.
On Linux (and other systems), pipe() does not modify pipefd on failure. A requirement standardizing this behavior was added in POSIX.1-2008 TC2. The Linux-specific pipe2() system call likewise does not modify pipefd on failure.
ERRORS
EFAULT
pipefd is not valid.
EINVAL
(pipe2()) Invalid value in flags.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENFILE
The user hard limit on memory that can be allocated for pipes has been reached and the caller is not privileged; see pipe(7).
ENOPKG
(pipe2()) O_NOTIFICATION_PIPE was passed in flags and support for notifications (CONFIG_WATCH_QUEUE) is not compiled into the kernel.
VERSIONS
The System V ABI on some architectures allows the use of more than one register for returning multiple values; several architectures (namely, Alpha, IA-64, MIPS, SuperH, and SPARC/SPARC64) (ab)use this feature in order to implement the pipe() system call in a functional manner: the call doesn’t take any arguments and returns a pair of file descriptors as the return value on success. The glibc pipe() wrapper function transparently deals with this. See syscall(2) for information regarding registers used for storing second file descriptor.
STANDARDS
pipe()
POSIX.1-2008.
pipe2()
Linux.
HISTORY
pipe()
POSIX.1-2001.
pipe2()
Linux 2.6.27, glibc 2.9.
EXAMPLES
The following program creates a pipe, and then fork(2)s to create a child process; the child inherits a duplicate set of file descriptors that refer to the same pipe. After the fork(2), each process closes the file descriptors that it doesn’t need for the pipe (see pipe(7)). The parent then writes the string contained in the program’s command-line argument to the pipe, and the child reads this string a byte at a time from the pipe and echoes it on standard output.
Program source
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int pipefd[2];
char buf;
pid_t cpid;
if (argc != 2) {
fprintf(stderr, "Usage: %s <string>
“, argv[0]); exit(EXIT_FAILURE); } if (pipe(pipefd) == -1) { perror(“pipe”); exit(EXIT_FAILURE); } cpid = fork(); if (cpid == -1) { perror(“fork”); exit(EXIT_FAILURE); } if (cpid == 0) { /* Child reads from pipe / close(pipefd[1]); / Close unused write end / while (read(pipefd[0], &buf, 1) > 0) write(STDOUT_FILENO, &buf, 1); write(STDOUT_FILENO, " “, 1); close(pipefd[0]); _exit(EXIT_SUCCESS); } else { / Parent writes argv[1] to pipe / close(pipefd[0]); / Close unused read end / write(pipefd[1], argv[1], strlen(argv[1])); close(pipefd[1]); / Reader will see EOF / wait(NULL); / Wait for child */ exit(EXIT_SUCCESS); } }
SEE ALSO
fork(2), read(2), socketpair(2), splice(2), tee(2), vmsplice(2), write(2), popen(3), pipe(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
96 - Linux cli command fstatat64
NAME π₯οΈ fstatat64 π₯οΈ
get file status
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int stat(const char *restrict pathname,
struct stat *restrict statbuf);
int fstat(int fd, struct stat *statbuf);
int lstat(const char *restrict pathname,
struct stat *restrict statbuf);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int fstatat(int dirfd, const char *restrict pathname,
struct stat *restrict statbuf, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
lstat():
/* Since glibc 2.20 */ _DEFAULT_SOURCE
|| _XOPEN_SOURCE >= 500
|| /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L
|| /* glibc 2.19 and earlier */ _BSD_SOURCE
fstatat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These functions return information about a file, in the buffer pointed to by statbuf. No permissions are required on the file itself, butβin the case of stat(), fstatat(), and lstat()βexecute (search) permission is required on all of the directories in pathname that lead to the file.
stat() and fstatat() retrieve information about the file pointed to by pathname; the differences for fstatat() are described below.
lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information about the link itself, not the file that the link refers to.
fstat() is identical to stat(), except that the file about which information is to be retrieved is specified by the file descriptor fd.
The stat structure
All of these system calls return a stat structure (see stat(3type)).
Note: for performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call. For example, if st_mode or st_uid is changed by another process by calling chmod(2) or chown(2), stat() might return the old st_mode together with the new st_uid, or the old st_uid together with the new st_mode.
fstatat()
The fstatat() system call is a more general interface for accessing file information which can still provide exactly the behavior of each of stat(), lstat(), and fstat().
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by stat() and lstat() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like stat() and lstat()).
If pathname is absolute, then dirfd is ignored.
flags can either be 0, or include one or more of the following flags ORed:
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory, and the behavior of fstatat() is similar to that of fstat(). If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_NO_AUTOMOUNT (since Linux 2.6.38)
Don’t automount the terminal (“basename”) component of pathname. Since Linux 3.1 this flag is ignored. Since Linux 4.11 this flag is implied.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead return information about the link itself, like lstat(). (By default, fstatat() dereferences symbolic links, like stat().)
See openat(2) for an explanation of the need for fstatat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Search permission is denied for one of the directories in the path prefix of pathname. (See also path_resolution(7).)
EBADF
fd is not a valid open file descriptor.
EBADF
(fstatat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
Bad address.
EINVAL
(fstatat()) Invalid flag specified in flags.
ELOOP
Too many symbolic links encountered while traversing the path.
ENAMETOOLONG
pathname is too long.
ENOENT
A component of pathname does not exist or is a dangling symbolic link.
ENOENT
pathname is an empty string and AT_EMPTY_PATH was not specified in flags.
ENOMEM
Out of memory (i.e., kernel memory).
ENOTDIR
A component of the path prefix of pathname is not a directory.
ENOTDIR
(fstatat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EOVERFLOW
pathname or fd refers to a file whose size, inode number, or number of blocks cannot be represented in, respectively, the types off_t, ino_t, or blkcnt_t. This error can occur when, for example, an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 calls stat() on a file whose size exceeds (1<<31)-1 bytes.
STANDARDS
POSIX.1-2008.
HISTORY
stat()
fstat()
lstat()
SVr4, 4.3BSD, POSIX.1-2001.
fstatat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
According to POSIX.1-2001, lstat() on a symbolic link need return valid information only in the st_size field and the file type of the st_mode field of the stat structure. POSIX.1-2008 tightens the specification, requiring lstat() to return valid information in all fields except the mode bits in st_mode.
Use of the st_blocks and st_blksize fields may be less portable. (They were introduced in BSD. The interpretation differs between systems, and possibly on a single system when NFS mounts are involved.)
C library/kernel differences
Over time, increases in the size of the stat structure have led to three successive versions of stat(): sys_stat() (slot __NR_oldstat), sys_newstat() (slot __NR_stat), and sys_stat64() (slot __NR_stat64) on 32-bit platforms such as i386. The first two versions were already present in Linux 1.0 (albeit with different names); the last was added in Linux 2.4. Similar remarks apply for fstat() and lstat().
The kernel-internal versions of the stat structure dealt with by the different versions are, respectively:
__old_kernel_stat
The original structure, with rather narrow fields, and no padding.
stat
Larger st_ino field and padding added to various parts of the structure to allow for future expansion.
stat64
Even larger st_ino field, larger st_uid and st_gid fields to accommodate the Linux-2.4 expansion of UIDs and GIDs to 32 bits, and various other enlarged fields and further padding in the structure. (Various padding bytes were eventually consumed in Linux 2.6, with the advent of 32-bit device IDs and nanosecond components for the timestamp fields.)
The glibc stat() wrapper function hides these details from applications, invoking the most recent version of the system call provided by the kernel, and repacking the returned information if required for old binaries.
On modern 64-bit systems, life is simpler: there is a single stat() system call and the kernel deals with a stat structure that contains fields of a sufficient size.
The underlying system call employed by the glibc fstatat() wrapper function is actually called fstatat64() or, on some architectures, newfstatat().
EXAMPLES
The following program calls lstat() and displays selected fields in the returned stat structure.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
#include <time.h>
int
main(int argc, char *argv[])
{
struct stat sb;
if (argc != 2) {
fprintf(stderr, "Usage: %s <pathname>
“, argv[0]); exit(EXIT_FAILURE); } if (lstat(argv[1], &sb) == -1) { perror(“lstat”); exit(EXIT_FAILURE); } printf(“ID of containing device: [%x,%x] “, major(sb.st_dev), minor(sb.st_dev)); printf(“File type: “); switch (sb.st_mode & S_IFMT) { case S_IFBLK: printf(“block device “); break; case S_IFCHR: printf(“character device “); break; case S_IFDIR: printf(“directory “); break; case S_IFIFO: printf(“FIFO/pipe “); break; case S_IFLNK: printf(“symlink “); break; case S_IFREG: printf(“regular file “); break; case S_IFSOCK: printf(“socket “); break; default: printf(“unknown? “); break; } printf(“I-node number: %ju “, (uintmax_t) sb.st_ino); printf(“Mode: %jo (octal) “, (uintmax_t) sb.st_mode); printf(“Link count: %ju “, (uintmax_t) sb.st_nlink); printf(“Ownership: UID=%ju GID=%ju “, (uintmax_t) sb.st_uid, (uintmax_t) sb.st_gid); printf(“Preferred I/O block size: %jd bytes “, (intmax_t) sb.st_blksize); printf(“File size: %jd bytes “, (intmax_t) sb.st_size); printf(“Blocks allocated: %jd “, (intmax_t) sb.st_blocks); printf(“Last status change: %s”, ctime(&sb.st_ctime)); printf(“Last file access: %s”, ctime(&sb.st_atime)); printf(“Last file modification: %s”, ctime(&sb.st_mtime)); exit(EXIT_SUCCESS); }
SEE ALSO
ls(1), stat(1), access(2), chmod(2), chown(2), readlink(2), statx(2), utime(2), stat(3type), capabilities(7), inode(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
97 - Linux cli command time
NAME π₯οΈ time π₯οΈ
get time in seconds
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <time.h>
time_t time(time_t *_Nullable tloc);
DESCRIPTION
time() returns the time as the number of seconds since the Epoch, 1970-01-01 00:00:00 +0000 (UTC).
If tloc is non-NULL, the return value is also stored in the memory pointed to by tloc.
RETURN VALUE
On success, the value of time in seconds since the Epoch is returned. On error, ((time_t) -1) is returned, and errno is set to indicate the error.
ERRORS
EOVERFLOW
The time cannot be represented as a time_t value. This can happen if an executable with 32-bit time_t is run on a 64-bit kernel when the time is 2038-01-19 03:14:08 UTC or later. However, when the system time is out of time_t range in other situations, the behavior is undefined.
EFAULT
tloc points outside your accessible address space (but see BUGS).
On systems where the C library time() wrapper function invokes an implementation provided by the vdso(7) (so that there is no trap into the kernel), an invalid address may instead trigger a SIGSEGV signal.
VERSIONS
POSIX.1 defines seconds since the Epoch using a formula that approximates the number of seconds between a specified time and the Epoch. This formula takes account of the facts that all years that are evenly divisible by 4 are leap years, but years that are evenly divisible by 100 are not leap years unless they are also evenly divisible by 400, in which case they are leap years. This value is not the same as the actual number of seconds between the time and the Epoch, because of leap seconds and because system clocks are not required to be synchronized to a standard reference. Linux systems normally follow the POSIX requirement that this value ignore leap seconds, so that conforming systems interpret it consistently; see POSIX.1-2018 Rationale A.4.16.
Applications intended to run after 2038 should use ABIs with time_t wider than 32 bits; see time_t(3type).
C library/kernel differences
On some architectures, an implementation of time() is provided in the vdso(7).
STANDARDS
C11, POSIX.1-2008.
HISTORY
SVr4, 4.3BSD, C89, POSIX.1-2001.
BUGS
Error returns from this system call are indistinguishable from successful reports that the time is a few seconds before the Epoch, so the C library wrapper function never sets errno as a result of this call.
The tloc argument is obsolescent and should always be NULL in new code. When tloc is NULL, the call cannot fail.
SEE ALSO
date(1), gettimeofday(2), ctime(3), ftime(3), time(7), vdso(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
98 - Linux cli command putpmsg
NAME π₯οΈ putpmsg π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
99 - Linux cli command faccessat
NAME π₯οΈ faccessat π₯οΈ
check user’s permissions for a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int access(const char *pathname, int mode);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int faccessat(int dirfd, const char *pathname, int mode, int flags);
/* But see C library/kernel differences, below */
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_faccessat2,
int dirfd, const char *pathname, int mode",int"flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
faccessat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
access() checks whether the calling process can access the file pathname. If pathname is a symbolic link, it is dereferenced.
The mode specifies the accessibility check(s) to be performed, and is either the value F_OK, or a mask consisting of the bitwise OR of one or more of R_OK, W_OK, and X_OK. F_OK tests for the existence of the file. R_OK, W_OK, and X_OK test whether the file exists and grants read, write, and execute permissions, respectively.
The check is done using the calling process’s real UID and GID, rather than the effective IDs as is done when actually attempting an operation (e.g., open(2)) on the file. Similarly, for the root user, the check uses the set of permitted capabilities rather than the set of effective capabilities; and for non-root users, the check uses an empty set of capabilities.
This allows set-user-ID programs and capability-endowed programs to easily determine the invoking user’s authority. In other words, access() does not answer the “can I read/write/execute this file?” question. It answers a slightly different question: “(assuming I’m a setuid binary) can the user who invoked me read/write/execute this file?”, which gives set-user-ID programs the possibility to prevent malicious users from causing them to read files which users shouldn’t be able to read.
If the calling process is privileged (i.e., its real UID is zero), then an X_OK check is successful for a regular file if execute permission is enabled for any of the file owner, group, or other.
faccessat()
faccessat() operates in exactly the same way as access(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by access() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like access()).
If pathname is absolute, then dirfd is ignored.
flags is constructed by ORing together zero or more of the following values:
AT_EACCESS
Perform access checks using the effective user and group IDs. By default, faccessat() uses the real IDs (like access()).
AT_EMPTY_PATH (since Linux 5.8)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory. If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead return information about the link itself.
See openat(2) for an explanation of the need for faccessat().
faccessat2()
The description of faccessat() given above corresponds to POSIX.1 and to the implementation provided by glibc. However, the glibc implementation was an imperfect emulation (see BUGS) that papered over the fact that the raw Linux faccessat() system call does not have a flags argument. To allow for a proper implementation, Linux 5.8 added the faccessat2() system call, which supports the flags argument and allows a correct implementation of the faccessat() wrapper function.
RETURN VALUE
On success (all requested permissions granted, or mode is F_OK and the file exists), zero is returned. On error (at least one bit in mode asked for a permission that is denied, or mode is F_OK and the file does not exist, or some other error occurred), -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
The requested access would be denied to the file, or search permission is denied for one of the directories in the path prefix of pathname. (See also path_resolution(7).)
EBADF
(faccessat()) pathname is relative but dirfd is neither AT_FDCWD (faccessat()) nor a valid file descriptor.
EFAULT
pathname points outside your accessible address space.
EINVAL
mode was incorrectly specified.
EINVAL
(faccessat()) Invalid flag specified in flags.
EIO
An I/O error occurred.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ENAMETOOLONG
pathname is too long.
ENOENT
A component of pathname does not exist or is a dangling symbolic link.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component used as a directory in pathname is not, in fact, a directory.
ENOTDIR
(faccessat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EPERM
Write permission was requested to a file that has the immutable flag set. See also ioctl_iflags(2).
EROFS
Write permission was requested for a file on a read-only filesystem.
ETXTBSY
Write access was requested to an executable which is being executed.
VERSIONS
If the calling process has appropriate privileges (i.e., is superuser), POSIX.1-2001 permits an implementation to indicate success for an X_OK check even if none of the execute file permission bits are set. Linux does not do this.
C library/kernel differences
The raw faccessat() system call takes only the first three arguments. The AT_EACCESS and AT_SYMLINK_NOFOLLOW flags are actually implemented within the glibc wrapper function for faccessat(). If either of these flags is specified, then the wrapper function employs fstatat(2) to determine access permissions, but see BUGS.
glibc notes
On older kernels where faccessat() is unavailable (and when the AT_EACCESS and AT_SYMLINK_NOFOLLOW flags are not specified), the glibc wrapper function falls back to the use of access(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
STANDARDS
access()
faccessat()
POSIX.1-2008.
faccessat2()
Linux.
HISTORY
access()
SVr4, 4.3BSD, POSIX.1-2001.
faccessat()
Linux 2.6.16, glibc 2.4.
faccessat2()
Linux 5.8.
NOTES
Warning: Using these calls to check if a user is authorized to, for example, open a file before actually doing so using open(2) creates a security hole, because the user might exploit the short time interval between checking and opening the file to manipulate it. For this reason, the use of this system call should be avoided. (In the example just described, a safer alternative would be to temporarily switch the process’s effective user ID to the real ID and then call open(2).)
access() always dereferences symbolic links. If you need to check the permissions on a symbolic link, use faccessat() with the flag AT_SYMLINK_NOFOLLOW.
These calls return an error if any of the access types in mode is denied, even if some of the other access types in mode are permitted.
A file is accessible only if the permissions on each of the directories in the path prefix of pathname grant search (i.e., execute) access. If any directory is inaccessible, then the access() call fails, regardless of the permissions on the file itself.
Only access bits are checked, not the file type or contents. Therefore, if a directory is found to be writable, it probably means that files can be created in the directory, and not that the directory can be written as a file. Similarly, a DOS file may be reported as executable, but the execve(2) call will still fail.
These calls may not work correctly on NFSv2 filesystems with UID mapping enabled, because UID mapping is done on the server and hidden from the client, which checks permissions. (NFS versions 3 and higher perform the check on the server.) Similar problems can occur to FUSE mounts.
BUGS
Because the Linux kernel’s faccessat() system call does not support a flags argument, the glibc faccessat() wrapper function provided in glibc 2.32 and earlier emulates the required functionality using a combination of the faccessat() system call and fstatat(2). However, this emulation does not take ACLs into account. Starting with glibc 2.33, the wrapper function avoids this bug by making use of the faccessat2() system call where it is provided by the underlying kernel.
In Linux 2.4 (and earlier) there is some strangeness in the handling of X_OK tests for superuser. If all categories of execute permission are disabled for a nondirectory file, then the only access() test that returns -1 is when mode is specified as just X_OK; if R_OK or W_OK is also specified in mode, then access() returns 0 for such files. Early Linux 2.6 (up to and including Linux 2.6.3) also behaved in the same way as Linux 2.4.
Before Linux 2.6.20, these calls ignored the effect of the MS_NOEXEC flag if it was used to mount(2) the underlying filesystem. Since Linux 2.6.20, the MS_NOEXEC flag is honored.
SEE ALSO
chmod(2), chown(2), open(2), setgid(2), setuid(2), stat(2), euidaccess(3), credentials(7), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
100 - Linux cli command unimplemented
NAME π₯οΈ unimplemented π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
101 - Linux cli command recvmsg
NAME π₯οΈ recvmsg π₯οΈ
receive a message from a socket
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
ssize_t recv(int sockfd, void buf[.len], size_t len,
int flags);
ssize_t recvfrom(int sockfd, void buf[restrict .len], size_t len,
int flags,
struct sockaddr *_Nullable restrict src_addr,
socklen_t *_Nullable restrict addrlen);
ssize_t recvmsg(int sockfd, struct msghdr *msg, int flags);
DESCRIPTION
The recv(), recvfrom(), and recvmsg() calls are used to receive messages from a socket. They may be used to receive data on both connectionless and connection-oriented sockets. This page first describes common features of all three system calls, and then describes the differences between the calls.
The only difference between recv() and read(2) is the presence of flags. With a zero flags argument, recv() is generally equivalent to read(2) (but see NOTES). Also, the following call
recv(sockfd, buf, len, flags);
is equivalent to
recvfrom(sockfd, buf, len, flags, NULL, NULL);
All three calls return the length of the message on successful completion. If a message is too long to fit in the supplied buffer, excess bytes may be discarded depending on the type of socket the message is received from.
If no messages are available at the socket, the receive calls wait for a message to arrive, unless the socket is nonblocking (see fcntl(2)), in which case the value -1 is returned and errno is set to EAGAIN or EWOULDBLOCK. The receive calls normally return any data available, up to the requested amount, rather than waiting for receipt of the full amount requested.
An application can use select(2), poll(2), or epoll(7) to determine when more data arrives on a socket.
The flags argument
The flags argument is formed by ORing one or more of the following values:
MSG_CMSG_CLOEXEC (recvmsg() only; since Linux 2.6.23)
Set the close-on-exec flag for the file descriptor received via a UNIX domain file descriptor using the SCM_RIGHTS operation (described in unix(7)). This flag is useful for the same reasons as the O_CLOEXEC flag of open(2).
MSG_DONTWAIT (since Linux 2.2)
Enables nonblocking operation; if the operation would block, the call fails with the error EAGAIN or EWOULDBLOCK. This provides similar behavior to setting the O_NONBLOCK flag (via the fcntl(2) F_SETFL operation), but differs in that MSG_DONTWAIT is a per-call option, whereas O_NONBLOCK is a setting on the open file description (see open(2)), which will affect all threads in the calling process as well as other processes that hold file descriptors referring to the same open file description.
MSG_ERRQUEUE (since Linux 2.2)
This flag specifies that queued errors should be received from the socket error queue. The error is passed in an ancillary message with a type dependent on the protocol (for IPv4 IP_RECVERR). The user should supply a buffer of sufficient size. See cmsg(3) and ip(7) for more information. The payload of the original packet that caused the error is passed as normal data via msg_iovec. The original destination address of the datagram that caused the error is supplied via msg_name.
The error is supplied in a sock_extended_err structure:
#define SO_EE_ORIGIN_NONE 0
#define SO_EE_ORIGIN_LOCAL 1
#define SO_EE_ORIGIN_ICMP 2
#define SO_EE_ORIGIN_ICMP6 3
struct sock_extended_err
{
uint32_t ee_errno; /* Error number */
uint8_t ee_origin; /* Where the error originated */
uint8_t ee_type; /* Type */
uint8_t ee_code; /* Code */
uint8_t ee_pad; /* Padding */
uint32_t ee_info; /* Additional information */
uint32_t ee_data; /* Other data */
/* More data may follow */
};
struct sockaddr *SO_EE_OFFENDER(struct sock_extended_err *);
ee_errno contains the errno number of the queued error. ee_origin is the origin code of where the error originated. The other fields are protocol-specific. The macro SO_EE_OFFENDER returns a pointer to the address of the network object where the error originated from given a pointer to the ancillary message. If this address is not known, the sa_family member of the sockaddr contains AF_UNSPEC and the other fields of the sockaddr are undefined. The payload of the packet that caused the error is passed as normal data.
For local errors, no address is passed (this can be checked with the cmsg_len member of the cmsghdr). For error receives, the MSG_ERRQUEUE flag is set in the msghdr. After an error has been passed, the pending socket error is regenerated based on the next queued error and will be passed on the next socket operation.
MSG_OOB
This flag requests receipt of out-of-band data that would not be received in the normal data stream. Some protocols place expedited data at the head of the normal data queue, and thus this flag cannot be used with such protocols.
MSG_PEEK
This flag causes the receive operation to return data from the beginning of the receive queue without removing that data from the queue. Thus, a subsequent receive call will return the same data.
MSG_TRUNC (since Linux 2.2)
For raw (AF_PACKET), Internet datagram (since Linux 2.4.27/2.6.8), netlink (since Linux 2.6.22), and UNIX datagram as well as sequenced-packet (since Linux 3.4) sockets: return the real length of the packet or datagram, even when it was longer than the passed buffer.
For use with Internet stream sockets, see tcp(7).
MSG_WAITALL (since Linux 2.2)
This flag requests that the operation block until the full request is satisfied. However, the call may still return less data than requested if a signal is caught, an error or disconnect occurs, or the next data to be received is of a different type than that returned. This flag has no effect for datagram sockets.
recvfrom()
recvfrom() places the received message into the buffer buf. The caller must specify the size of the buffer in len.
If src_addr is not NULL, and the underlying protocol provides the source address of the message, that source address is placed in the buffer pointed to by src_addr. In this case, addrlen is a value-result argument. Before the call, it should be initialized to the size of the buffer associated with src_addr. Upon return, addrlen is updated to contain the actual size of the source address. The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call.
If the caller is not interested in the source address, src_addr and addrlen should be specified as NULL.
recv()
The recv() call is normally used only on a connected socket (see connect(2)). It is equivalent to the call:
recvfrom(fd, buf, len, flags, NULL, 0);
recvmsg()
The recvmsg() call uses a msghdr structure to minimize the number of directly supplied arguments. This structure is defined as follows in <sys/socket.h>:
struct msghdr {
void *msg_name; /* Optional address */
socklen_t msg_namelen; /* Size of address */
struct iovec *msg_iov; /* Scatter/gather array */
size_t msg_iovlen; /* # elements in msg_iov */
void *msg_control; /* Ancillary data, see below */
size_t msg_controllen; /* Ancillary data buffer len */
int msg_flags; /* Flags on received message */
};
The msg_name field points to a caller-allocated buffer that is used to return the source address if the socket is unconnected. The caller should set msg_namelen to the size of this buffer before this call; upon return from a successful call, msg_namelen will contain the length of the returned address. If the application does not need to know the source address, msg_name can be specified as NULL.
The fields msg_iov and msg_iovlen describe scatter-gather locations, as discussed in readv(2).
The field msg_control, which has length msg_controllen, points to a buffer for other protocol control-related messages or miscellaneous ancillary data. When recvmsg() is called, msg_controllen should contain the length of the available buffer in msg_control; upon return from a successful call it will contain the length of the control message sequence.
The messages are of the form:
struct cmsghdr {
size_t cmsg_len; /* Data byte count, including header
(type is socklen_t in POSIX) */
int cmsg_level; /* Originating protocol */
int cmsg_type; /* Protocol-specific type */
/* followed by
unsigned char cmsg_data[]; */
};
Ancillary data should be accessed only by the macros defined in cmsg(3).
As an example, Linux uses this ancillary data mechanism to pass extended errors, IP options, or file descriptors over UNIX domain sockets. For further information on the use of ancillary data in various socket domains, see unix(7) and ip(7).
The msg_flags field in the msghdr is set on return of recvmsg(). It can contain several flags:
MSG_EOR
indicates end-of-record; the data returned completed a record (generally used with sockets of type SOCK_SEQPACKET).
MSG_TRUNC
indicates that the trailing portion of a datagram was discarded because the datagram was larger than the buffer supplied.
MSG_CTRUNC
indicates that some control data was discarded due to lack of space in the buffer for ancillary data.
MSG_OOB
is returned to indicate that expedited or out-of-band data was received.
MSG_ERRQUEUE
indicates that no data was received but an extended error from the socket error queue.
MSG_CMSG_CLOEXEC (since Linux 2.6.23)
indicates that MSG_CMSG_CLOEXEC was specified in the flags argument of recvmsg().
RETURN VALUE
These calls return the number of bytes received, or -1 if an error occurred. In the event of an error, errno is set to indicate the error.
When a stream socket peer has performed an orderly shutdown, the return value will be 0 (the traditional “end-of-file” return).
Datagram sockets in various domains (e.g., the UNIX and Internet domains) permit zero-length datagrams. When such a datagram is received, the return value is 0.
The value 0 may also be returned if the requested number of bytes to receive from a stream socket was 0.
ERRORS
These are some standard errors generated by the socket layer. Additional errors may be generated and returned from the underlying protocol modules; see their manual pages.
EAGAIN or EWOULDBLOCK
The socket is marked nonblocking and the receive operation would block, or a receive timeout had been set and the timeout expired before data was received. POSIX.1 allows either error to be returned for this case, and does not require these constants to have the same value, so a portable application should check for both possibilities.
EBADF
The argument sockfd is an invalid file descriptor.
ECONNREFUSED
A remote host refused to allow the network connection (typically because it is not running the requested service).
EFAULT
The receive buffer pointer(s) point outside the process’s address space.
EINTR
The receive was interrupted by delivery of a signal before any data was available; see signal(7).
EINVAL
Invalid argument passed.
ENOMEM
Could not allocate memory for recvmsg().
ENOTCONN
The socket is associated with a connection-oriented protocol and has not been connected (see connect(2) and accept(2)).
ENOTSOCK
The file descriptor sockfd does not refer to a socket.
VERSIONS
According to POSIX.1, the msg_controllen field of the msghdr structure should be typed as socklen_t, and the msg_iovlen field should be typed as int, but glibc currently types both as size_t.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.4BSD (first appeared in 4.2BSD).
POSIX.1 describes only the MSG_OOB, MSG_PEEK, and MSG_WAITALL flags.
NOTES
If a zero-length datagram is pending, read(2) and recv() with a flags argument of zero provide different behavior. In this circumstance, read(2) has no effect (the datagram remains pending), while recv() consumes the pending datagram.
See recvmmsg(2) for information about a Linux-specific system call that can be used to receive multiple datagrams in a single call.
EXAMPLES
An example of the use of recvfrom() is shown in getaddrinfo(3).
SEE ALSO
fcntl(2), getsockopt(2), read(2), recvmmsg(2), select(2), shutdown(2), socket(2), cmsg(3), sockatmark(3), ip(7), ipv6(7), socket(7), tcp(7), udp(7), unix(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
102 - Linux cli command adjtimex
NAME π₯οΈ adjtimex π₯οΈ
tune kernel clock
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/timex.h>
int adjtimex(struct timex *buf);
int clock_adjtime(clockid_t clk_id, struct timex *buf);
int ntp_adjtime(struct timex *buf);
DESCRIPTION
Linux uses David L. Mills’ clock adjustment algorithm (see RFC 5905). The system call adjtimex() reads and optionally sets adjustment parameters for this algorithm. It takes a pointer to a timex structure, updates kernel parameters from (selected) field values, and returns the same structure updated with the current kernel values. This structure is declared as follows:
struct timex {
int modes; /* Mode selector */
long offset; /* Time offset; nanoseconds, if STA_NANO
status flag is set, otherwise
microseconds */
long freq; /* Frequency offset; see NOTES for units */
long maxerror; /* Maximum error (microseconds) */
long esterror; /* Estimated error (microseconds) */
int status; /* Clock command/status */
long constant; /* PLL (phase-locked loop) time constant */
long precision; /* Clock precision
(microseconds, read-only) */
long tolerance; /* Clock frequency tolerance (read-only);
see NOTES for units */
struct timeval time;
/* Current time (read-only, except for
ADJ_SETOFFSET); upon return, time.tv_usec
contains nanoseconds, if STA_NANO status
flag is set, otherwise microseconds */
long tick; /* Microseconds between clock ticks */
long ppsfreq; /* PPS (pulse per second) frequency
(read-only); see NOTES for units */
long jitter; /* PPS jitter (read-only); nanoseconds, if
STA_NANO status flag is set, otherwise
microseconds */
int shift; /* PPS interval duration
(seconds, read-only) */
long stabil; /* PPS stability (read-only);
see NOTES for units */
long jitcnt; /* PPS count of jitter limit exceeded
events (read-only) */
long calcnt; /* PPS count of calibration intervals
(read-only) */
long errcnt; /* PPS count of calibration errors
(read-only) */
long stbcnt; /* PPS count of stability limit exceeded
events (read-only) */
int tai; /* TAI offset, as set by previous ADJ_TAI
operation (seconds, read-only,
since Linux 2.6.26) */
/* Further padding bytes to allow for future expansion */
};
The modes field determines which parameters, if any, to set. (As described later in this page, the constants used for ntp_adjtime() are equivalent but differently named.) It is a bit mask containing a bitwise OR combination of zero or more of the following bits:
ADJ_OFFSET
Set time offset from buf.offset. Since Linux 2.6.26, the supplied value is clamped to the range (-0.5s, +0.5s). In older kernels, an EINVAL error occurs if the supplied value is out of range.
ADJ_FREQUENCY
Set frequency offset from buf.freq. Since Linux 2.6.26, the supplied value is clamped to the range (-32768000, +32768000). In older kernels, an EINVAL error occurs if the supplied value is out of range.
ADJ_MAXERROR
Set maximum time error from buf.maxerror.
ADJ_ESTERROR
Set estimated time error from buf.esterror.
ADJ_STATUS
Set clock status bits from buf.status. A description of these bits is provided below.
ADJ_TIMECONST
Set PLL time constant from buf.constant. If the STA_NANO status flag (see below) is clear, the kernel adds 4 to this value.
ADJ_SETOFFSET (since Linux 2.6.39)
Add buf.time to the current time. If buf.status includes the ADJ_NANO flag, then buf.time.tv_usec is interpreted as a nanosecond value; otherwise it is interpreted as microseconds.
The value of buf.time is the sum of its two fields, but the field buf.time.tv_usec must always be nonnegative. The following example shows how to normalize a timeval with nanosecond resolution.
while (buf.time.tv_usec < 0) {
buf.time.tv_sec -= 1;
buf.time.tv_usec += 1000000000;
}
ADJ_MICRO (since Linux 2.6.26)
Select microsecond resolution.
ADJ_NANO (since Linux 2.6.26)
Select nanosecond resolution. Only one of ADJ_MICRO and ADJ_NANO should be specified.
ADJ_TAI (since Linux 2.6.26)
Set TAI (Atomic International Time) offset from buf.constant.
ADJ_TAI should not be used in conjunction with ADJ_TIMECONST, since the latter mode also employs the buf.constant field.
For a complete explanation of TAI and the difference between TAI and UTC, see
BIPM
ADJ_TICK
Set tick value from buf.tick.
Alternatively, modes can be specified as either of the following (multibit mask) values, in which case other bits should not be specified in modes:
ADJ_OFFSET_SINGLESHOT
Old-fashioned adjtime(3): (gradually) adjust time by value specified in buf.offset, which specifies an adjustment in microseconds.
ADJ_OFFSET_SS_READ (functional since Linux 2.6.28)
Return (in buf.offset) the remaining amount of time to be adjusted after an earlier ADJ_OFFSET_SINGLESHOT operation. This feature was added in Linux 2.6.24, but did not work correctly until Linux 2.6.28.
Ordinary users are restricted to a value of either 0 or ADJ_OFFSET_SS_READ for modes. Only the superuser may set any parameters.
The buf.status field is a bit mask that is used to set and/or retrieve status bits associated with the NTP implementation. Some bits in the mask are both readable and settable, while others are read-only.
STA_PLL (read-write)
Enable phase-locked loop (PLL) updates via ADJ_OFFSET.
STA_PPSFREQ (read-write)
Enable PPS (pulse-per-second) frequency discipline.
STA_PPSTIME (read-write)
Enable PPS time discipline.
STA_FLL (read-write)
Select frequency-locked loop (FLL) mode.
STA_INS (read-write)
Insert a leap second after the last second of the UTC day, thus extending the last minute of the day by one second. Leap-second insertion will occur each day, so long as this flag remains set.
STA_DEL (read-write)
Delete a leap second at the last second of the UTC day. Leap second deletion will occur each day, so long as this flag remains set.
STA_UNSYNC (read-write)
Clock unsynchronized.
STA_FREQHOLD (read-write)
Hold frequency. Normally adjustments made via ADJ_OFFSET result in dampened frequency adjustments also being made. So a single call corrects the current offset, but as offsets in the same direction are made repeatedly, the small frequency adjustments will accumulate to fix the long-term skew.
This flag prevents the small frequency adjustment from being made when correcting for an ADJ_OFFSET value.
STA_PPSSIGNAL (read-only)
A valid PPS (pulse-per-second) signal is present.
STA_PPSJITTER (read-only)
PPS signal jitter exceeded.
STA_PPSWANDER (read-only)
PPS signal wander exceeded.
STA_PPSERROR (read-only)
PPS signal calibration error.
STA_CLOCKERR (read-only)
Clock hardware fault.
STA_NANO (read-only; since Linux 2.6.26)
Resolution (0 = microsecond, 1 = nanoseconds). Set via ADJ_NANO, cleared via ADJ_MICRO.
STA_MODE (since Linux 2.6.26)
Mode (0 = Phase Locked Loop, 1 = Frequency Locked Loop).
STA_CLK (read-only; since Linux 2.6.26)
Clock source (0 = A, 1 = B); currently unused.
Attempts to set read-only status bits are silently ignored.
clock_adjtime ()
The clock_adjtime() system call (added in Linux 2.6.39) behaves like adjtimex() but takes an additional clk_id argument to specify the particular clock on which to act.
ntp_adjtime ()
The ntp_adjtime() library function (described in the NTP “Kernel Application Program API”, KAPI) is a more portable interface for performing the same task as adjtimex(). Other than the following points, it is identical to adjtimex():
The constants used in modes are prefixed with “MOD_” rather than “ADJ_”, and have the same suffixes (thus, MOD_OFFSET, MOD_FREQUENCY, and so on), other than the exceptions noted in the following points.
MOD_CLKA is the synonym for ADJ_OFFSET_SINGLESHOT.
MOD_CLKB is the synonym for ADJ_TICK.
The is no synonym for ADJ_OFFSET_SS_READ, which is not described in the KAPI.
RETURN VALUE
On success, adjtimex() and ntp_adjtime() return the clock state; that is, one of the following values:
TIME_OK
Clock synchronized, no leap second adjustment pending.
TIME_INS
Indicates that a leap second will be added at the end of the UTC day.
TIME_DEL
Indicates that a leap second will be deleted at the end of the UTC day.
TIME_OOP
Insertion of a leap second is in progress.
TIME_WAIT
A leap-second insertion or deletion has been completed. This value will be returned until the next ADJ_STATUS operation clears the STA_INS and STA_DEL flags.
TIME_ERROR
The system clock is not synchronized to a reliable server. This value is returned when any of the following holds true:
Either STA_UNSYNC or STA_CLOCKERR is set.
STA_PPSSIGNAL is clear and either STA_PPSFREQ or STA_PPSTIME is set.
STA_PPSTIME and STA_PPSJITTER are both set.
STA_PPSFREQ is set and either STA_PPSWANDER or STA_PPSJITTER is set.
The symbolic name TIME_BAD is a synonym for TIME_ERROR, provided for backward compatibility.
Note that starting with Linux 3.4, the call operates asynchronously and the return value usually will not reflect a state change caused by the call itself.
On failure, these calls return -1 and set errno to indicate the error.
ERRORS
EFAULT
buf does not point to writable memory.
EINVAL (before Linux 2.6.26)
An attempt was made to set buf.freq to a value outside the range (-33554432, +33554432).
EINVAL (before Linux 2.6.26)
An attempt was made to set buf.offset to a value outside the permitted range. Before Linux 2.0, the permitted range was (-131072, +131072). From Linux 2.0 onwards, the permitted range was (-512000, +512000).
EINVAL
An attempt was made to set buf.status to a value other than those listed above.
EINVAL
The clk_id given to clock_adjtime() is invalid for one of two reasons. Either the System-V style hard-coded positive clock ID value is out of range, or the dynamic clk_id does not refer to a valid instance of a clock object. See clock_gettime(2) for a discussion of dynamic clocks.
EINVAL
An attempt was made to set buf.tick to a value outside the range 900000/HZ to 1100000/HZ, where HZ is the system timer interrupt frequency.
ENODEV
The hot-pluggable device (like USB for example) represented by a dynamic clk_id has disappeared after its character device was opened. See clock_gettime(2) for a discussion of dynamic clocks.
EOPNOTSUPP
The given clk_id does not support adjustment.
EPERM
buf.modes is neither 0 nor ADJ_OFFSET_SS_READ, and the caller does not have sufficient privilege. Under Linux, the CAP_SYS_TIME capability is required.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
ntp_adjtime() | Thread safety | MT-Safe |
STANDARDS
adjtimex()
clock_adjtime()
Linux.
The preferred API for the NTP daemon is ntp_adjtime().
NOTES
In struct timex, freq, ppsfreq, and stabil are ppm (parts per million) with a 16-bit fractional part, which means that a value of 1 in one of those fields actually means 2^-16 ppm, and 2^16=65536 is 1 ppm. This is the case for both input values (in the case of freq) and output values.
The leap-second processing triggered by STA_INS and STA_DEL is done by the kernel in timer context. Thus, it will take one tick into the second for the leap second to be inserted or deleted.
SEE ALSO
clock_gettime(2), clock_settime(2), settimeofday(2), adjtime(3), ntp_gettime(3), capabilities(7), time(7), adjtimex(8), hwclock(8)
NTP “Kernel Application Program Interface”
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
103 - Linux cli command process_vm_readv
NAME π₯οΈ process_vm_readv π₯οΈ
transfer data between process address spaces
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/uio.h>
ssize_t process_vm_readv(pid_t pid,
const struct iovec *local_iov,
unsigned long liovcnt,
const struct iovec *remote_iov,
unsigned long riovcnt,
unsigned long flags);
ssize_t process_vm_writev(pid_t pid,
const struct iovec *local_iov,
unsigned long liovcnt,
const struct iovec *remote_iov,
unsigned long riovcnt,
unsigned long flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
process_vm_readv(), process_vm_writev():
_GNU_SOURCE
DESCRIPTION
These system calls transfer data between the address space of the calling process (“the local process”) and the process identified by pid (“the remote process”). The data moves directly between the address spaces of the two processes, without passing through kernel space.
The process_vm_readv() system call transfers data from the remote process to the local process. The data to be transferred is identified by remote_iov and riovcnt: remote_iov is a pointer to an array describing address ranges in the process pid, and riovcnt specifies the number of elements in remote_iov. The data is transferred to the locations specified by local_iov and liovcnt: local_iov is a pointer to an array describing address ranges in the calling process, and liovcnt specifies the number of elements in local_iov.
The process_vm_writev() system call is the converse of process_vm_readv()βit transfers data from the local process to the remote process. Other than the direction of the transfer, the arguments liovcnt, local_iov, riovcnt, and remote_iov have the same meaning as for process_vm_readv().
The local_iov and remote_iov arguments point to an array of iovec structures, described in iovec(3type).
Buffers are processed in array order. This means that process_vm_readv() completely fills local_iov[0] before proceeding to local_iov[1], and so on. Likewise, remote_iov[0] is completely read before proceeding to remote_iov[1], and so on.
Similarly, process_vm_writev() writes out the entire contents of local_iov[0] before proceeding to local_iov[1], and it completely fills remote_iov[0] before proceeding to remote_iov[1].
The lengths of remote_iov[i].iov_len and local_iov[i].iov_len do not have to be the same. Thus, it is possible to split a single local buffer into multiple remote buffers, or vice versa.
The flags argument is currently unused and must be set to 0.
The values specified in the liovcnt and riovcnt arguments must be less than or equal to IOV_MAX (defined in <limits.h> or accessible via the call sysconf(_SC_IOV_MAX)).
The count arguments and local_iov are checked before doing any transfers. If the counts are too big, or local_iov is invalid, or the addresses refer to regions that are inaccessible to the local process, none of the vectors will be processed and an error will be returned immediately.
Note, however, that these system calls do not check the memory regions in the remote process until just before doing the read/write. Consequently, a partial read/write (see RETURN VALUE) may result if one of the remote_iov elements points to an invalid memory region in the remote process. No further reads/writes will be attempted beyond that point. Keep this in mind when attempting to read data of unknown length (such as C strings that are null-terminated) from a remote process, by avoiding spanning memory pages (typically 4 KiB) in a single remote iovec element. (Instead, split the remote read into two remote_iov elements and have them merge back into a single write local_iov entry. The first read entry goes up to the page boundary, while the second starts on the next page boundary.)
Permission to read from or write to another process is governed by a ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check; see ptrace(2).
RETURN VALUE
On success, process_vm_readv() returns the number of bytes read and process_vm_writev() returns the number of bytes written. This return value may be less than the total number of requested bytes, if a partial read/write occurred. (Partial transfers apply at the granularity of iovec elements. These system calls won’t perform a partial transfer that splits a single iovec element.) The caller should check the return value to determine whether a partial read/write occurred.
On error, -1 is returned and errno is set to indicate the error.
ERRORS
EFAULT
The memory described by local_iov is outside the caller’s accessible address space.
EFAULT
The memory described by remote_iov is outside the accessible address space of the process pid.
EINVAL
The sum of the iov_len values of either local_iov or remote_iov overflows a ssize_t value.
EINVAL
flags is not 0.
EINVAL
liovcnt or riovcnt is too large.
ENOMEM
Could not allocate memory for internal copies of the iovec structures.
EPERM
The caller does not have permission to access the address space of the process pid.
ESRCH
No process with ID pid exists.
STANDARDS
Linux.
HISTORY
Linux 3.2, glibc 2.15.
NOTES
The data transfers performed by process_vm_readv() and process_vm_writev() are not guaranteed to be atomic in any way.
These system calls were designed to permit fast message passing by allowing messages to be exchanged with a single copy operation (rather than the double copy that would be required when using, for example, shared memory or pipes).
EXAMPLES
The following code sample demonstrates the use of process_vm_readv(). It reads 20 bytes at the address 0x10000 from the process with PID 10 and writes the first 10 bytes into buf1 and the second 10 bytes into buf2.
#define _GNU_SOURCE
#include <stdlib.h>
#include <sys/types.h>
#include <sys/uio.h>
int
main(void)
{
char buf1[10];
char buf2[10];
pid_t pid = 10; /* PID of remote process */
ssize_t nread;
struct iovec local[2];
struct iovec remote[1];
local[0].iov_base = buf1;
local[0].iov_len = 10;
local[1].iov_base = buf2;
local[1].iov_len = 10;
remote[0].iov_base = (void *) 0x10000;
remote[0].iov_len = 20;
nread = process_vm_readv(pid, local, 2, remote, 1, 0);
if (nread != 20)
exit(EXIT_FAILURE);
exit(EXIT_SUCCESS);
}
SEE ALSO
readv(2), writev(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
104 - Linux cli command sigaction
NAME π₯οΈ sigaction π₯οΈ
examine and change a signal action
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <signal.h>
int sigaction(int signum,
const struct sigaction *_Nullable restrict act,
struct sigaction *_Nullable restrict oldact);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sigaction():
_POSIX_C_SOURCE
siginfo_t:
_POSIX_C_SOURCE >= 199309L
DESCRIPTION
The sigaction() system call is used to change the action taken by a process on receipt of a specific signal. (See signal(7) for an overview of signals.)
signum specifies the signal and can be any valid signal except SIGKILL and SIGSTOP.
If act is non-NULL, the new action for signal signum is installed from act. If oldact is non-NULL, the previous action is saved in oldact.
The sigaction structure is defined as something like:
struct sigaction {
void (*sa_handler)(int);
void (*sa_sigaction)(int, siginfo_t *, void *);
sigset_t sa_mask;
int sa_flags;
void (*sa_restorer)(void);
};
On some architectures a union is involved: do not assign to both sa_handler and sa_sigaction.
The sa_restorer field is not intended for application use. (POSIX does not specify a sa_restorer field.) Some further details of the purpose of this field can be found in sigreturn(2).
sa_handler specifies the action to be associated with signum and can be one of the following:
SIG_DFL for the default action.
SIG_IGN to ignore this signal.
A pointer to a signal handling function. This function receives the signal number as its only argument.
If SA_SIGINFO is specified in sa_flags, then sa_sigaction (instead of sa_handler) specifies the signal-handling function for signum. This function receives three arguments, as described below.
sa_mask specifies a mask of signals which should be blocked (i.e., added to the signal mask of the thread in which the signal handler is invoked) during execution of the signal handler. In addition, the signal which triggered the handler will be blocked, unless the SA_NODEFER flag is used.
sa_flags specifies a set of flags which modify the behavior of the signal. It is formed by the bitwise OR of zero or more of the following:
SA_NOCLDSTOP
If signum is SIGCHLD, do not receive notification when child processes stop (i.e., when they receive one of SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU) or resume (i.e., they receive SIGCONT) (see wait(2)). This flag is meaningful only when establishing a handler for SIGCHLD.
SA_NOCLDWAIT (since Linux 2.6)
If signum is SIGCHLD, do not transform children into zombies when they terminate. See also waitpid(2). This flag is meaningful only when establishing a handler for SIGCHLD, or when setting that signal’s disposition to SIG_DFL.
If the SA_NOCLDWAIT flag is set when establishing a handler for SIGCHLD, POSIX.1 leaves it unspecified whether a SIGCHLD signal is generated when a child process terminates. On Linux, a SIGCHLD signal is generated in this case; on some other implementations, it is not.
SA_NODEFER
Do not add the signal to the thread’s signal mask while the handler is executing, unless the signal is specified in act.sa_mask. Consequently, a further instance of the signal may be delivered to the thread while it is executing the handler. This flag is meaningful only when establishing a signal handler.
SA_NOMASK is an obsolete, nonstandard synonym for this flag.
SA_ONSTACK
Call the signal handler on an alternate signal stack provided by sigaltstack(2). If an alternate stack is not available, the default stack will be used. This flag is meaningful only when establishing a signal handler.
SA_RESETHAND
Restore the signal action to the default upon entry to the signal handler. This flag is meaningful only when establishing a signal handler.
SA_ONESHOT is an obsolete, nonstandard synonym for this flag.
SA_RESTART
Provide behavior compatible with BSD signal semantics by making certain system calls restartable across signals. This flag is meaningful only when establishing a signal handler. See signal(7) for a discussion of system call restarting.
SA_RESTORER
Not intended for application use. This flag is used by C libraries to indicate that the sa_restorer field contains the address of a “signal trampoline”. See sigreturn(2) for more details.
SA_SIGINFO (since Linux 2.2)
The signal handler takes three arguments, not one. In this case, sa_sigaction should be set instead of sa_handler. This flag is meaningful only when establishing a signal handler.
SA_UNSUPPORTED (since Linux 5.11)
Used to dynamically probe for flag bit support.
If an attempt to register a handler succeeds with this flag set in act->sa_flags alongside other flags that are potentially unsupported by the kernel, and an immediately subsequent sigaction() call specifying the same signal number and with a non-NULL oldact argument yields SA_UNSUPPORTED clear in oldact->sa_flags, then oldact->sa_flags may be used as a bitmask describing which of the potentially unsupported flags are, in fact, supported. See the section “Dynamically probing for flag bit support” below for more details.
SA_EXPOSE_TAGBITS (since Linux 5.11)
Normally, when delivering a signal, an architecture-specific set of tag bits are cleared from the si_addr field of siginfo_t. If this flag is set, an architecture-specific subset of the tag bits will be preserved in si_addr.
Programs that need to be compatible with Linux versions older than 5.11 must use SA_UNSUPPORTED to probe for support.
The siginfo_t argument to a SA_SIGINFO handler
When the SA_SIGINFO flag is specified in act.sa_flags, the signal handler address is passed via the act.sa_sigaction field. This handler takes three arguments, as follows:
void
handler(int sig, siginfo_t *info, void *ucontext)
{
...
}
These three arguments are as follows
sig
The number of the signal that caused invocation of the handler.
info
A pointer to a siginfo_t, which is a structure containing further information about the signal, as described below.
ucontext
This is a pointer to a ucontext_t structure, cast to void *. The structure pointed to by this field contains signal context information that was saved on the user-space stack by the kernel; for details, see sigreturn(2). Further information about the ucontext_t structure can be found in getcontext(3) and signal(7). Commonly, the handler function doesn’t make any use of the third argument.
The siginfo_t data type is a structure with the following fields:
siginfo_t {
int si_signo; /* Signal number */
int si_errno; /* An errno value */
int si_code; /* Signal code */
int si_trapno; /* Trap number that caused
hardware-generated signal
(unused on most architectures) */
pid_t si_pid; /* Sending process ID */
uid_t si_uid; /* Real user ID of sending process */
int si_status; /* Exit value or signal */
clock_t si_utime; /* User time consumed */
clock_t si_stime; /* System time consumed */
union sigval si_value; /* Signal value */
int si_int; /* POSIX.1b signal */
void *si_ptr; /* POSIX.1b signal */
int si_overrun; /* Timer overrun count;
POSIX.1b timers */
int si_timerid; /* Timer ID; POSIX.1b timers */
void *si_addr; /* Memory location which caused fault */
long si_band; /* Band event (was int in
glibc 2.3.2 and earlier) */
int si_fd; /* File descriptor */
short si_addr_lsb; /* Least significant bit of address
(since Linux 2.6.32) */
void *si_lower; /* Lower bound when address violation
occurred (since Linux 3.19) */
void *si_upper; /* Upper bound when address violation
occurred (since Linux 3.19) */
int si_pkey; /* Protection key on PTE that caused
fault (since Linux 4.6) */
void *si_call_addr; /* Address of system call instruction
(since Linux 3.5) */
int si_syscall; /* Number of attempted system call
(since Linux 3.5) */
unsigned int si_arch; /* Architecture of attempted system call
(since Linux 3.5) */
}
si_signo, si_errno and si_code are defined for all signals. (si_errno is generally unused on Linux.) The rest of the struct may be a union, so that one should read only the fields that are meaningful for the given signal:
Signals sent with kill(2) and sigqueue(3) fill in si_pid and si_uid. In addition, signals sent with sigqueue(3) fill in si_int and si_ptr with the values specified by the sender of the signal; see sigqueue(3) for more details.
Signals sent by POSIX.1b timers (since Linux 2.6) fill in si_overrun and si_timerid. The si_timerid field is an internal ID used by the kernel to identify the timer; it is not the same as the timer ID returned by timer_create(2). The si_overrun field is the timer overrun count; this is the same information as is obtained by a call to timer_getoverrun(2). These fields are nonstandard Linux extensions.
Signals sent for message queue notification (see the description of SIGEV_SIGNAL in mq_notify(3)) fill in si_int/si_ptr, with the sigev_value supplied to mq_notify(3); si_pid, with the process ID of the message sender; and si_uid, with the real user ID of the message sender.
SIGCHLD fills in si_pid, si_uid, si_status, si_utime, and si_stime, providing information about the child. The si_pid field is the process ID of the child; si_uid is the child’s real user ID. The si_status field contains the exit status of the child (if si_code is CLD_EXITED), or the signal number that caused the process to change state. The si_utime and si_stime contain the user and system CPU time used by the child process; these fields do not include the times used by waited-for children (unlike getrusage(2) and times(2)). Up to Linux 2.6, and since Linux 2.6.27, these fields report CPU time in units of sysconf(_SC_CLK_TCK). In Linux 2.6 kernels before Linux 2.6.27, a bug meant that these fields reported time in units of the (configurable) system jiffy (see time(7)).
SIGILL, SIGFPE, SIGSEGV, SIGBUS, and SIGTRAP fill in si_addr with the address of the fault. On some architectures, these signals also fill in the si_trapno field.
Some suberrors of SIGBUS, in particular BUS_MCEERR_AO and BUS_MCEERR_AR, also fill in si_addr_lsb. This field indicates the least significant bit of the reported address and therefore the extent of the corruption. For example, if a full page was corrupted, si_addr_lsb contains log2(sysconf(_SC_PAGESIZE)). When SIGTRAP is delivered in response to a ptrace(2) event (PTRACE_EVENT_foo), si_addr is not populated, but si_pid and si_uid are populated with the respective process ID and user ID responsible for delivering the trap. In the case of seccomp(2), the tracee will be shown as delivering the event. BUS_MCEERR_* and si_addr_lsb are Linux-specific extensions.
The SEGV_BNDERR suberror of SIGSEGV populates si_lower and si_upper.
The SEGV_PKUERR suberror of SIGSEGV populates si_pkey.
SIGIO/SIGPOLL (the two names are synonyms on Linux) fills in si_band and si_fd. The si_band event is a bit mask containing the same values as are filled in the revents field by poll(2). The si_fd field indicates the file descriptor for which the I/O event occurred; for further details, see the description of F_SETSIG in fcntl(2).
SIGSYS, generated (since Linux 3.5) when a seccomp filter returns SECCOMP_RET_TRAP, fills in si_call_addr, si_syscall, si_arch, si_errno, and other fields as described in seccomp(2).
The si_code field
The si_code field inside the siginfo_t argument that is passed to a SA_SIGINFO signal handler is a value (not a bit mask) indicating why this signal was sent. For a ptrace(2) event, si_code will contain SIGTRAP and have the ptrace event in the high byte:
(SIGTRAP | PTRACE_EVENT_foo << 8).
For a non-ptrace(2) event, the values that can appear in si_code are described in the remainder of this section. Since glibc 2.20, the definitions of most of these symbols are obtained from <signal.h> by defining feature test macros (before including any header file) as follows:
_XOPEN_SOURCE with the value 500 or greater;
_XOPEN_SOURCE and _XOPEN_SOURCE_EXTENDED; or
_POSIX_C_SOURCE with the value 200809L or greater.
For the TRAP_* constants, the symbol definitions are provided only in the first two cases. Before glibc 2.20, no feature test macros were required to obtain these symbols.
For a regular signal, the following list shows the values which can be placed in si_code for any signal, along with the reason that the signal was generated.
SI_USER
kill(2).SI_KERNEL
Sent by the kernel.SI_QUEUE
sigqueue(3).SI_TIMER
POSIX timer expired.SI_MESGQ (since Linux 2.6.6)
POSIX message queue state changed; see mq_notify(3).SI_ASYNCIO
AIO completed.SI_SIGIO
Queued SIGIO (only up to Linux 2.2; from Linux 2.4 onward SIGIO/SIGPOLL fills in si_code as described below).SI_TKILL (since Linux 2.4.19)
tkill(2) or tgkill(2).
The following values can be placed in si_code for a SIGILL signal:
ILL_ILLOPC
Illegal opcode.ILL_ILLOPN
Illegal operand.ILL_ILLADR
Illegal addressing mode.ILL_ILLTRP
Illegal trap.ILL_PRVOPC
Privileged opcode.ILL_PRVREG
Privileged register.ILL_COPROC
Coprocessor error.ILL_BADSTK
Internal stack error.
The following values can be placed in si_code for a SIGFPE signal:
FPE_INTDIV
Integer divide by zero.FPE_INTOVF
Integer overflow.FPE_FLTDIV
Floating-point divide by zero.FPE_FLTOVF
Floating-point overflow.FPE_FLTUND
Floating-point underflow.FPE_FLTRES
Floating-point inexact result.FPE_FLTINV
Floating-point invalid operation.FPE_FLTSUB
Subscript out of range.
The following values can be placed in si_code for a SIGSEGV signal:
SEGV_MAPERR
Address not mapped to object.SEGV_ACCERR
Invalid permissions for mapped object.SEGV_BNDERR (since Linux 3.19)
Failed address bound checks.SEGV_PKUERR (since Linux 4.6)
Access was denied by memory protection keys. See pkeys(7). The protection key which applied to this access is available via si_pkey.
The following values can be placed in si_code for a SIGBUS signal:
BUS_ADRALN
Invalid address alignment.BUS_ADRERR
Nonexistent physical address.BUS_OBJERR
Object-specific hardware error.BUS_MCEERR_AR (since Linux 2.6.32)
Hardware memory error consumed on a machine check; action required.BUS_MCEERR_AO (since Linux 2.6.32)
Hardware memory error detected in process but not consumed; action optional.
The following values can be placed in si_code for a SIGTRAP signal:
TRAP_BRKPT
Process breakpoint.TRAP_TRACE
Process trace trap.TRAP_BRANCH (since Linux 2.4, IA64 only)
Process taken branch trap.TRAP_HWBKPT (since Linux 2.4, IA64 only)
Hardware breakpoint/watchpoint.
The following values can be placed in si_code for a SIGCHLD signal:
CLD_EXITED
Child has exited.CLD_KILLED
Child was killed.CLD_DUMPED
Child terminated abnormally.CLD_TRAPPED
Traced child has trapped.CLD_STOPPED
Child has stopped.CLD_CONTINUED (since Linux 2.6.9)
Stopped child has continued.
The following values can be placed in si_code for a SIGIO/SIGPOLL signal:
POLL_IN
Data input available.POLL_OUT
Output buffers available.POLL_MSG
Input message available.POLL_ERR
I/O error.POLL_PRI
High priority input available.POLL_HUP
Device disconnected.
The following value can be placed in si_code for a SIGSYS signal:
SYS_SECCOMP (since Linux 3.5)
Triggered by a seccomp(2) filter rule.
Dynamically probing for flag bit support
The sigaction() call on Linux accepts unknown bits set in act->sa_flags without error. The behavior of the kernel starting with Linux 5.11 is that a second sigaction() will clear unknown bits from oldact->sa_flags. However, historically, a second sigaction() call would typically leave those bits set in oldact->sa_flags.
This means that support for new flags cannot be detected simply by testing for a flag in sa_flags, and a program must test that SA_UNSUPPORTED has been cleared before relying on the contents of sa_flags.
Since the behavior of the signal handler cannot be guaranteed unless the check passes, it is wise to either block the affected signal while registering the handler and performing the check in this case, or where this is not possible, for example if the signal is synchronous, to issue the second sigaction() in the signal handler itself.
In kernels that do not support a specific flag, the kernel’s behavior is as if the flag was not set, even if the flag was set in act->sa_flags.
The flags SA_NOCLDSTOP, SA_NOCLDWAIT, SA_SIGINFO, SA_ONSTACK, SA_RESTART, SA_NODEFER, SA_RESETHAND, and, if defined by the architecture, SA_RESTORER may not be reliably probed for using this mechanism, because they were introduced before Linux 5.11. However, in general, programs may assume that these flags are supported, since they have all been supported since Linux 2.6, which was released in the year 2003.
See EXAMPLES below for a demonstration of the use of SA_UNSUPPORTED.
RETURN VALUE
sigaction() returns 0 on success; on error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
act or oldact points to memory which is not a valid part of the process address space.
EINVAL
An invalid signal was specified. This will also be generated if an attempt is made to change the action for SIGKILL or SIGSTOP, which cannot be caught or ignored.
VERSIONS
C library/kernel differences
The glibc wrapper function for sigaction() gives an error (EINVAL) on attempts to change the disposition of the two real-time signals used internally by the NPTL threading implementation. See nptl(7) for details.
On architectures where the signal trampoline resides in the C library, the glibc wrapper function for sigaction() places the address of the trampoline code in the act.sa_restorer field and sets the SA_RESTORER flag in the act.sa_flags field. See sigreturn(2).
The original Linux system call was named sigaction(). However, with the addition of real-time signals in Linux 2.2, the fixed-size, 32-bit sigset_t type supported by that system call was no longer fit for purpose. Consequently, a new system call, rt_sigaction(), was added to support an enlarged sigset_t type. The new system call takes a fourth argument, size_t sigsetsize, which specifies the size in bytes of the signal sets in act.sa_mask and oldact.sa_mask. This argument is currently required to have the value sizeof(sigset_t) (or the error EINVAL results). The glibc sigaction() wrapper function hides these details from us, transparently calling rt_sigaction() when the kernel provides it.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4.
POSIX.1-1990 disallowed setting the action for SIGCHLD to SIG_IGN. POSIX.1-2001 and later allow this possibility, so that ignoring SIGCHLD can be used to prevent the creation of zombies (see wait(2)). Nevertheless, the historical BSD and System V behaviors for ignoring SIGCHLD differ, so that the only completely portable method of ensuring that terminated children do not become zombies is to catch the SIGCHLD signal and perform a wait(2) or similar.
POSIX.1-1990 specified only SA_NOCLDSTOP. POSIX.1-2001 added SA_NOCLDWAIT, SA_NODEFER, SA_ONSTACK, SA_RESETHAND, SA_RESTART, and SA_SIGINFO as XSI extensions. POSIX.1-2008 moved SA_NODEFER, SA_RESETHAND, SA_RESTART, and SA_SIGINFO to the base specifications. Use of these latter values in sa_flags may be less portable in applications intended for older UNIX implementations.
The SA_RESETHAND flag is compatible with the SVr4 flag of the same name.
The SA_NODEFER flag is compatible with the SVr4 flag of the same name under kernels 1.3.9 and later. On older kernels the Linux implementation allowed the receipt of any signal, not just the one we are installing (effectively overriding any sa_mask settings).
NOTES
A child created via fork(2) inherits a copy of its parent’s signal dispositions. During an execve(2), the dispositions of handled signals are reset to the default; the dispositions of ignored signals are left unchanged.
According to POSIX, the behavior of a process is undefined after it ignores a SIGFPE, SIGILL, or SIGSEGV signal that was not generated by kill(2) or raise(3). Integer division by zero has undefined result. On some architectures it will generate a SIGFPE signal. (Also dividing the most negative integer by -1 may generate SIGFPE.) Ignoring this signal might lead to an endless loop.
sigaction() can be called with a NULL second argument to query the current signal handler. It can also be used to check whether a given signal is valid for the current machine by calling it with NULL second and third arguments.
It is not possible to block SIGKILL or SIGSTOP (by specifying them in sa_mask). Attempts to do so are silently ignored.
See sigsetops(3) for details on manipulating signal sets.
See signal-safety(7) for a list of the async-signal-safe functions that can be safely called inside from inside a signal handler.
Undocumented
Before the introduction of SA_SIGINFO, it was also possible to get some additional information about the signal. This was done by providing an sa_handler signal handler with a second argument of type struct sigcontext, which is the same structure as the one that is passed in the uc_mcontext field of the ucontext structure that is passed (via a pointer) in the third argument of the sa_sigaction handler. See the relevant Linux kernel sources for details. This use is obsolete now.
BUGS
When delivering a signal with a SA_SIGINFO handler, the kernel does not always provide meaningful values for all of the fields of the siginfo_t that are relevant for that signal.
Up to and including Linux 2.6.13, specifying SA_NODEFER in sa_flags prevents not only the delivered signal from being masked during execution of the handler, but also the signals specified in sa_mask. This bug was fixed in Linux 2.6.14.
EXAMPLES
See mprotect(2).
Probing for flag support
The following example program exits with status EXIT_SUCCESS if SA_EXPOSE_TAGBITS is determined to be supported, and EXIT_FAILURE otherwise.
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
void
handler(int signo, siginfo_t *info, void *context)
{
struct sigaction oldact;
if (sigaction(SIGSEGV, NULL, &oldact) == -1
|| (oldact.sa_flags & SA_UNSUPPORTED)
|| !(oldact.sa_flags & SA_EXPOSE_TAGBITS))
{
_exit(EXIT_FAILURE);
}
_exit(EXIT_SUCCESS);
}
int
main(void)
{
struct sigaction act = { 0 };
act.sa_flags = SA_SIGINFO | SA_UNSUPPORTED | SA_EXPOSE_TAGBITS;
act.sa_sigaction = &handler;
if (sigaction(SIGSEGV, &act, NULL) == -1) {
perror("sigaction");
exit(EXIT_FAILURE);
}
raise(SIGSEGV);
}
SEE ALSO
kill(1), kill(2), pause(2), pidfd_send_signal(2), restart_syscall(2), seccomp(2), sigaltstack(2), signal(2), signalfd(2), sigpending(2), sigprocmask(2), sigreturn(2), sigsuspend(2), wait(2), killpg(3), raise(3), siginterrupt(3), sigqueue(3), sigsetops(3), sigvec(3), core(5), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
105 - Linux cli command execveat
NAME π₯οΈ execveat π₯οΈ
execute program relative to a directory file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int execveat(int dirfd, const char *pathname,
char *const _Nullable argv[],
char *const _Nullable envp[],
int flags);
DESCRIPTION
The execveat() system call executes the program referred to by the combination of dirfd and pathname. It operates in exactly the same way as execve(2), except for the differences described in this manual page.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by execve(2) for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like execve(2)).
If pathname is absolute, then dirfd is ignored.
If pathname is an empty string and the AT_EMPTY_PATH flag is specified, then the file descriptor dirfd specifies the file to be executed (i.e., dirfd refers to an executable file, rather than a directory).
The flags argument is a bit mask that can include zero or more of the following flags:
AT_EMPTY_PATH
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag).
AT_SYMLINK_NOFOLLOW
If the file identified by dirfd and a non-NULL pathname is a symbolic link, then the call fails with the error ELOOP.
RETURN VALUE
On success, execveat() does not return. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
The same errors that occur for execve(2) can also occur for execveat(). The following additional errors can occur for execveat():
pathname
is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EINVAL
Invalid flag specified in flags.
ELOOP
flags includes AT_SYMLINK_NOFOLLOW and the file identified by dirfd and a non-NULL pathname is a symbolic link.
ENOENT
The program identified by dirfd and pathname requires the use of an interpreter program (such as a script starting with “#!”), but the file descriptor dirfd was opened with the O_CLOEXEC flag, with the result that the program file is inaccessible to the launched interpreter. See BUGS.
ENOTDIR
pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
STANDARDS
Linux.
HISTORY
Linux 3.19, glibc 2.34.
NOTES
In addition to the reasons explained in openat(2), the execveat() system call is also needed to allow fexecve(3) to be implemented on systems that do not have the /proc filesystem mounted.
When asked to execute a script file, the argv[0] that is passed to the script interpreter is a string of the form /dev/fd/N or /dev/fd/N/P, where N is the number of the file descriptor passed via the dirfd argument. A string of the first form occurs when AT_EMPTY_PATH is employed. A string of the second form occurs when the script is specified via both dirfd and pathname; in this case, P is the value given in pathname.
For the same reasons described in fexecve(3), the natural idiom when using execveat() is to set the close-on-exec flag on dirfd. (But see BUGS.)
BUGS
The ENOENT error described above means that it is not possible to set the close-on-exec flag on the file descriptor given to a call of the form:
execveat(fd, "", argv, envp, AT_EMPTY_PATH);
However, the inability to set the close-on-exec flag means that a file descriptor referring to the script leaks through to the script itself. As well as wasting a file descriptor, this leakage can lead to file-descriptor exhaustion in scenarios where scripts recursively employ execveat().
SEE ALSO
execve(2), openat(2), fexecve(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
106 - Linux cli command clock_settime
NAME π₯οΈ clock_settime π₯οΈ
clock and time functions
LIBRARY
Standard C library (libc, -lc), since glibc 2.17
Before glibc 2.17, Real-time library (librt, -lrt)
SYNOPSIS
#include <time.h>
int clock_getres(clockid_t clockid, struct timespec *_Nullable res);
int clock_gettime(clockid_t clockid, struct timespec *tp);
int clock_settime(clockid_t clockid, const struct timespec *tp);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
clock_getres(), clock_gettime(), clock_settime():
_POSIX_C_SOURCE >= 199309L
DESCRIPTION
The function clock_getres() finds the resolution (precision) of the specified clock clockid, and, if res is non-NULL, stores it in the struct timespec pointed to by res. The resolution of clocks depends on the implementation and cannot be configured by a particular process. If the time value pointed to by the argument tp of clock_settime() is not a multiple of res, then it is truncated to a multiple of res.
The functions clock_gettime() and clock_settime() retrieve and set the time of the specified clock clockid.
The res and tp arguments are timespec(3) structures.
The clockid argument is the identifier of the particular clock on which to act. A clock may be system-wide and hence visible for all processes, or per-process if it measures time only within a single process.
All implementations support the system-wide real-time clock, which is identified by CLOCK_REALTIME. Its time represents seconds and nanoseconds since the Epoch. When its time is changed, timers for a relative interval are unaffected, but timers for an absolute point in time are affected.
More clocks may be implemented. The interpretation of the corresponding time values and the effect on timers is unspecified.
Sufficiently recent versions of glibc and the Linux kernel support the following clocks:
CLOCK_REALTIME
A settable system-wide clock that measures real (i.e., wall-clock) time. Setting this clock requires appropriate privileges. This clock is affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), and by frequency adjustments performed by NTP and similar applications via adjtime(3), adjtimex(2), clock_adjtime(2), and ntp_adjtime(3). This clock normally counts the number of seconds since 1970-01-01 00:00:00 Coordinated Universal Time (UTC) except that it ignores leap seconds; near a leap second it is typically adjusted by NTP to stay roughly in sync with UTC.
CLOCK_REALTIME_ALARM (since Linux 3.0; Linux-specific)
Like CLOCK_REALTIME, but not settable. See timer_create(2) for further details.
CLOCK_REALTIME_COARSE (since Linux 2.6.32; Linux-specific)
A faster but less precise version of CLOCK_REALTIME. This clock is not settable. Use when you need very fast, but not fine-grained timestamps. Requires per-architecture support, and probably also architecture support for this flag in the vdso(7).
CLOCK_TAI (since Linux 3.10; Linux-specific)
A nonsettable system-wide clock derived from wall-clock time but counting leap seconds. This clock does not experience discontinuities or frequency adjustments caused by inserting leap seconds as CLOCK_REALTIME does.
The acronym TAI refers to International Atomic Time.
CLOCK_MONOTONIC
A nonsettable system-wide clock that represents monotonic time sinceβas described by POSIXβ“some unspecified point in the past”. On Linux, that point corresponds to the number of seconds that the system has been running since it was booted.
The CLOCK_MONOTONIC clock is not affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), but is affected by frequency adjustments. This clock does not count time that the system is suspended. All CLOCK_MONOTONIC variants guarantee that the time returned by consecutive calls will not go backwards, but successive calls mayβdepending on the architectureβreturn identical (not-increased) time values.
CLOCK_MONOTONIC_COARSE (since Linux 2.6.32; Linux-specific)
A faster but less precise version of CLOCK_MONOTONIC. Use when you need very fast, but not fine-grained timestamps. Requires per-architecture support, and probably also architecture support for this flag in the vdso(7).
CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific)
Similar to CLOCK_MONOTONIC, but provides access to a raw hardware-based time that is not subject to frequency adjustments. This clock does not count time that the system is suspended.
CLOCK_BOOTTIME (since Linux 2.6.39; Linux-specific)
A nonsettable system-wide clock that is identical to CLOCK_MONOTONIC, except that it also includes any time that the system is suspended. This allows applications to get a suspend-aware monotonic clock without having to deal with the complications of CLOCK_REALTIME, which may have discontinuities if the time is changed using settimeofday(2) or similar.
CLOCK_BOOTTIME_ALARM (since Linux 3.0; Linux-specific)
Like CLOCK_BOOTTIME. See timer_create(2) for further details.
CLOCK_PROCESS_CPUTIME_ID (since Linux 2.6.12)
This is a clock that measures CPU time consumed by this process (i.e., CPU time consumed by all threads in the process). On Linux, this clock is not settable.
CLOCK_THREAD_CPUTIME_ID (since Linux 2.6.12)
This is a clock that measures CPU time consumed by this thread. On Linux, this clock is not settable.
Linux also implements dynamic clock instances as described below.
Dynamic clocks
In addition to the hard-coded System-V style clock IDs described above, Linux also supports POSIX clock operations on certain character devices. Such devices are called “dynamic” clocks, and are supported since Linux 2.6.39.
Using the appropriate macros, open file descriptors may be converted into clock IDs and passed to clock_gettime(), clock_settime(), and clock_adjtime(2). The following example shows how to convert a file descriptor into a dynamic clock ID.
#define CLOCKFD 3
#define FD_TO_CLOCKID(fd) ((~(clockid_t) (fd) << 3) | CLOCKFD)
#define CLOCKID_TO_FD(clk) ((unsigned int) ~((clk) >> 3))
struct timespec ts;
clockid_t clkid;
int fd;
fd = open("/dev/ptp0", O_RDWR);
clkid = FD_TO_CLOCKID(fd);
clock_gettime(clkid, &ts);
RETURN VALUE
clock_gettime(), clock_settime(), and clock_getres() return 0 for success. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EACCES
clock_settime() does not have write permission for the dynamic POSIX clock device indicated.
EFAULT
tp points outside the accessible address space.
EINVAL
The clockid specified is invalid for one of two reasons. Either the System-V style hard coded positive value is out of range, or the dynamic clock ID does not refer to a valid instance of a clock object.
EINVAL
(clock_settime()): tp.tv_sec is negative or tp.tv_nsec is outside the range [0, 999,999,999].
EINVAL
The clockid specified in a call to clock_settime() is not a settable clock.
EINVAL (since Linux 4.3)
A call to clock_settime() with a clockid of CLOCK_REALTIME attempted to set the time to a value less than the current value of the CLOCK_MONOTONIC clock.
ENODEV
The hot-pluggable device (like USB for example) represented by a dynamic clk_id has disappeared after its character device was opened.
ENOTSUP
The operation is not supported by the dynamic POSIX clock device specified.
EOVERFLOW
The timestamp would not fit in time_t range. This can happen if an executable with 32-bit time_t is run on a 64-bit kernel when the time is 2038-01-19 03:14:08 UTC or later. However, when the system time is out of time_t range in other situations, the behavior is undefined.
EPERM
clock_settime() does not have permission to set the clock indicated.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
clock_getres(), clock_gettime(), clock_settime() | Thread safety | MT-Safe |
VERSIONS
POSIX.1 specifies the following:
Setting the value of the CLOCK_REALTIME clock via clock_settime() shall have no effect on threads that are blocked waiting for a relative time service based upon this clock, including the nanosleep() function; nor on the expiration of relative timers based upon this clock. Consequently, these time services shall expire when the requested relative interval elapses, independently of the new or old value of the clock.
According to POSIX.1-2001, a process with “appropriate privileges” may set the CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks using clock_settime(). On Linux, these clocks are not settable (i.e., no process has “appropriate privileges”).
C library/kernel differences
On some architectures, an implementation of clock_gettime() is provided in the vdso(7).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SUSv2. Linux 2.6.
On POSIX systems on which these functions are available, the symbol _POSIX_TIMERS is defined in <unistd.h> to a value greater than 0. POSIX.1-2008 makes these functions mandatory.
The symbols _POSIX_MONOTONIC_CLOCK, _POSIX_CPUTIME, _POSIX_THREAD_CPUTIME indicate that CLOCK_MONOTONIC, CLOCK_PROCESS_CPUTIME_ID, CLOCK_THREAD_CPUTIME_ID are available. (See also sysconf(3).)
Historical note for SMP systems
Before Linux added kernel support for CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID, glibc implemented these clocks on many platforms using timer registers from the CPUs (TSC on i386, AR.ITC on Itanium). These registers may differ between CPUs and as a consequence these clocks may return bogus results if a process is migrated to another CPU.
If the CPUs in an SMP system have different clock sources, then there is no way to maintain a correlation between the timer registers since each CPU will run at a slightly different frequency. If that is the case, then clock_getcpuclockid(0) will return ENOENT to signify this condition. The two clocks will then be useful only if it can be ensured that a process stays on a certain CPU.
The processors in an SMP system do not start all at exactly the same time and therefore the timer registers are typically running at an offset. Some architectures include code that attempts to limit these offsets on bootup. However, the code cannot guarantee to accurately tune the offsets. glibc contains no provisions to deal with these offsets (unlike the Linux Kernel). Typically these offsets are small and therefore the effects may be negligible in most cases.
Since glibc 2.4, the wrapper functions for the system calls described in this page avoid the abovementioned problems by employing the kernel implementation of CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID, on systems that provide such an implementation (i.e., Linux 2.6.12 and later).
EXAMPLES
The program below demonstrates the use of clock_gettime() and clock_getres() with various clocks. This is an example of what we might see when running the program:
$ ./clock_times x
CLOCK_REALTIME : 1585985459.446 (18356 days + 7h 30m 59s)
resolution: 0.000000001
CLOCK_TAI : 1585985496.447 (18356 days + 7h 31m 36s)
resolution: 0.000000001
CLOCK_MONOTONIC: 52395.722 (14h 33m 15s)
resolution: 0.000000001
CLOCK_BOOTTIME : 72691.019 (20h 11m 31s)
resolution: 0.000000001
Program source
/* clock_times.c
Licensed under GNU General Public License v2 or later.
*/
#define _XOPEN_SOURCE 600
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define SECS_IN_DAY (24 * 60 * 60)
static void
displayClock(clockid_t clock, const char *name, bool showRes)
{
long days;
struct timespec ts;
if (clock_gettime(clock, &ts) == -1) {
perror("clock_gettime");
exit(EXIT_FAILURE);
}
printf("%-15s: %10jd.%03ld (", name,
(intmax_t) ts.tv_sec, ts.tv_nsec / 1000000);
days = ts.tv_sec / SECS_IN_DAY;
if (days > 0)
printf("%ld days + ", days);
printf("%2dh %2dm %2ds",
(int) (ts.tv_sec % SECS_IN_DAY) / 3600,
(int) (ts.tv_sec % 3600) / 60,
(int) ts.tv_sec % 60);
printf(")
“); if (clock_getres(clock, &ts) == -1) { perror(“clock_getres”); exit(EXIT_FAILURE); } if (showRes) printf(” resolution: %10jd.%09ld “, (intmax_t) ts.tv_sec, ts.tv_nsec); } int main(int argc, char *argv[]) { bool showRes = argc > 1; displayClock(CLOCK_REALTIME, “CLOCK_REALTIME”, showRes); #ifdef CLOCK_TAI displayClock(CLOCK_TAI, “CLOCK_TAI”, showRes); #endif displayClock(CLOCK_MONOTONIC, “CLOCK_MONOTONIC”, showRes); #ifdef CLOCK_BOOTTIME displayClock(CLOCK_BOOTTIME, “CLOCK_BOOTTIME”, showRes); #endif exit(EXIT_SUCCESS); }
SEE ALSO
date(1), gettimeofday(2), settimeofday(2), time(2), adjtime(3), clock_getcpuclockid(3), ctime(3), ftime(3), pthread_getcpuclockid(3), sysconf(3), timespec(3), time(7), time_namespaces(7), vdso(7), hwclock(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
107 - Linux cli command settimeofday
NAME π₯οΈ settimeofday π₯οΈ
get / set time
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/time.h>
int gettimeofday(struct timeval *restrict tv,
struct timezone *_Nullable restrict tz);
int settimeofday(const struct timeval *tv,
const struct timezone *_Nullable tz);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
settimeofday():
Since glibc 2.19:
_DEFAULT_SOURCE
glibc 2.19 and earlier:
_BSD_SOURCE
DESCRIPTION
The functions gettimeofday() and settimeofday() can get and set the time as well as a timezone.
The tv argument is a struct timeval (as specified in <sys/time.h>):
struct timeval {
time_t tv_sec; /* seconds */
suseconds_t tv_usec; /* microseconds */
};
and gives the number of seconds and microseconds since the Epoch (see time(2)).
The tz argument is a struct timezone:
struct timezone {
int tz_minuteswest; /* minutes west of Greenwich */
int tz_dsttime; /* type of DST correction */
};
If either tv or tz is NULL, the corresponding structure is not set or returned. (However, compilation warnings will result if tv is NULL.)
The use of the timezone structure is obsolete; the tz argument should normally be specified as NULL. (See NOTES below.)
Under Linux, there are some peculiar “warp clock” semantics associated with the settimeofday() system call if on the very first call (after booting) that has a non-NULL tz argument, the tv argument is NULL and the tz_minuteswest field is nonzero. (The tz_dsttime field should be zero for this case.) In such a case it is assumed that the CMOS clock is on local time, and that it has to be incremented by this amount to get UTC system time. No doubt it is a bad idea to use this feature.
RETURN VALUE
gettimeofday() and settimeofday() return 0 for success. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EFAULT
One of tv or tz pointed outside the accessible address space.
EINVAL
(settimeofday()): timezone is invalid.
EINVAL
(settimeofday()): tv.tv_sec is negative or tv.tv_usec is outside the range [0, 999,999].
EINVAL (since Linux 4.3)
(settimeofday()): An attempt was made to set the time to a value less than the current value of the CLOCK_MONOTONIC clock (see clock_gettime(2)).
EPERM
The calling process has insufficient privilege to call settimeofday(); under Linux the CAP_SYS_TIME capability is required.
VERSIONS
C library/kernel differences
On some architectures, an implementation of gettimeofday() is provided in the vdso(7).
The kernel accepts NULL for both tv and tz. The timezone argument is ignored by glibc and musl, and not passed to/from the kernel. Android’s bionic passes the timezone argument to/from the kernel, but Android does not update the kernel timezone based on the device timezone in Settings, so the kernel’s timezone is typically UTC.
STANDARDS
gettimeofday()
POSIX.1-2008 (obsolete).
settimeofday()
None.
HISTORY
SVr4, 4.3BSD. POSIX.1-2001 describes gettimeofday() but not settimeofday(). POSIX.1-2008 marks gettimeofday() as obsolete, recommending the use of clock_gettime(2) instead.
Traditionally, the fields of struct timeval were of type long.
The tz_dsttime field
On a non-Linux kernel, with glibc, the tz_dsttime field of struct timezone will be set to a nonzero value by gettimeofday() if the current timezone has ever had or will have a daylight saving rule applied. In this sense it exactly mirrors the meaning of daylight(3) for the current zone. On Linux, with glibc, the setting of the tz_dsttime field of struct timezone has never been used by settimeofday() or gettimeofday(). Thus, the following is purely of historical interest.
On old systems, the field tz_dsttime contains a symbolic constant (values are given below) that indicates in which part of the year Daylight Saving Time is in force. (Note: this value is constant throughout the year: it does not indicate that DST is in force, it just selects an algorithm.) The daylight saving time algorithms defined are as follows:
DST_NONE /* not on DST */
DST_USA /* USA style DST */
DST_AUST /* Australian style DST */
DST_WET /* Western European DST */
DST_MET /* Middle European DST */
DST_EET /* Eastern European DST */
DST_CAN /* Canada */
DST_GB /* Great Britain and Eire */
DST_RUM /* Romania */
DST_TUR /* Turkey */
DST_AUSTALT /* Australian style with shift in 1986 */
Of course it turned out that the period in which Daylight Saving Time is in force cannot be given by a simple algorithm, one per country; indeed, this period is determined by unpredictable political decisions. So this method of representing timezones has been abandoned.
NOTES
The time returned by gettimeofday() is affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the system time). If you need a monotonically increasing clock, see clock_gettime(2).
Macros for operating on timeval structures are described in timeradd(3).
SEE ALSO
date(1), adjtimex(2), clock_gettime(2), time(2), ctime(3), ftime(3), timeradd(3), capabilities(7), time(7), vdso(7), hwclock(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
108 - Linux cli command write
NAME π₯οΈ write π₯οΈ
write to a file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
ssize_t write(int fd, const void buf[.count], size_t count);
DESCRIPTION
write() writes up to count bytes from the buffer starting at buf to the file referred to by the file descriptor fd.
The number of bytes written may be less than count if, for example, there is insufficient space on the underlying physical medium, or the RLIMIT_FSIZE resource limit is encountered (see setrlimit(2)), or the call was interrupted by a signal handler after having written less than count bytes. (See also pipe(7).)
For a seekable file (i.e., one to which lseek(2) may be applied, for example, a regular file) writing takes place at the file offset, and the file offset is incremented by the number of bytes actually written. If the file was open(2)ed with O_APPEND, the file offset is first set to the end of the file before writing. The adjustment of the file offset and the write operation are performed as an atomic step.
POSIX requires that a read(2) that can be proved to occur after a write() has returned will return the new data. Note that not all filesystems are POSIX conforming.
According to POSIX.1, if count is greater than SSIZE_MAX, the result is implementation-defined; see NOTES for the upper limit on Linux.
RETURN VALUE
On success, the number of bytes written is returned. On error, -1 is returned, and errno is set to indicate the error.
Note that a successful write() may transfer fewer than count bytes. Such partial writes can occur for various reasons; for example, because there was insufficient space on the disk device to write all of the requested bytes, or because a blocked write() to a socket, pipe, or similar was interrupted by a signal handler after it had transferred some, but before it had transferred all of the requested bytes. In the event of a partial write, the caller can make another write() call to transfer the remaining bytes. The subsequent call will either transfer further bytes or may result in an error (e.g., if the disk is now full).
If count is zero and fd refers to a regular file, then write() may return a failure status if one of the errors below is detected. If no errors are detected, or error detection is not performed, 0 is returned without causing any other effect. If count is zero and fd refers to a file other than a regular file, the results are not specified.
ERRORS
EAGAIN
The file descriptor fd refers to a file other than a socket and has been marked nonblocking (O_NONBLOCK), and the write would block. See open(2) for further details on the O_NONBLOCK flag.
EAGAIN or EWOULDBLOCK
The file descriptor fd refers to a socket and has been marked nonblocking (O_NONBLOCK), and the write would block. POSIX.1-2001 allows either error to be returned for this case, and does not require these constants to have the same value, so a portable application should check for both possibilities.
EBADF
fd is not a valid file descriptor or is not open for writing.
EDESTADDRREQ
fd refers to a datagram socket for which a peer address has not been set using connect(2).
EDQUOT
The user’s quota of disk blocks on the filesystem containing the file referred to by fd has been exhausted.
EFAULT
buf is outside your accessible address space.
EFBIG
An attempt was made to write a file that exceeds the implementation-defined maximum file size or the process’s file size limit, or to write at a position past the maximum allowed offset.
EINTR
The call was interrupted by a signal before any data was written; see signal(7).
EINVAL
fd is attached to an object which is unsuitable for writing; or the file was opened with the O_DIRECT flag, and either the address specified in buf, the value specified in count, or the file offset is not suitably aligned.
EIO
A low-level I/O error occurred while modifying the inode. This error may relate to the write-back of data written by an earlier write(), which may have been issued to a different file descriptor on the same file. Since Linux 4.13, errors from write-back come with a promise that they may be reported by subsequent. write() requests, and will be reported by a subsequent fsync(2) (whether or not they were also reported by write()). An alternate cause of EIO on networked filesystems is when an advisory lock had been taken out on the file descriptor and this lock has been lost. See the Lost locks section of fcntl(2) for further details.
ENOSPC
The device containing the file referred to by fd has no room for the data.
EPERM
The operation was prevented by a file seal; see fcntl(2).
EPIPE
fd is connected to a pipe or socket whose reading end is closed. When this happens the writing process will also receive a SIGPIPE signal. (Thus, the write return value is seen only if the program catches, blocks or ignores this signal.)
Other errors may occur, depending on the object connected to fd.
STANDARDS
POSIX.1-2008.
HISTORY
SVr4, 4.3BSD, POSIX.1-2001.
Under SVr4 a write may be interrupted and return EINTR at any point, not just before any data is written.
NOTES
A successful return from write() does not make any guarantee that data has been committed to disk. On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data. In this case, some errors might be delayed until a future write(), fsync(2), or even close(2). The only way to be sure is to call fsync(2) after you are done writing all your data.
If a write() is interrupted by a signal handler before any bytes are written, then the call fails with the error EINTR; if it is interrupted after at least one byte has been written, the call succeeds, and returns the number of bytes written.
On Linux, write() (and similar system calls) will transfer at most 0x7ffff000 (2,147,479,552) bytes, returning the number of bytes actually transferred. (This is true on both 32-bit and 64-bit systems.)
An error return value while performing write() using direct I/O does not mean the entire write has failed. Partial data may be written and the data at the file offset on which the write() was attempted should be considered inconsistent.
BUGS
According to POSIX.1-2008/SUSv4 Section XSI 2.9.7 (“Thread Interactions with Regular File Operations”):
All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2008 when they operate on regular files or symbolic links: …
Among the APIs subsequently listed are write() and writev(2). And among the effects that should be atomic across threads (and processes) are updates of the file offset. However, before Linux 3.14, this was not the case: if two processes that share an open file description (see open(2)) perform a write() (or writev(2)) at the same time, then the I/O operations were not atomic with respect to updating the file offset, with the result that the blocks of data output by the two processes might (incorrectly) overlap. This problem was fixed in Linux 3.14.
SEE ALSO
close(2), fcntl(2), fsync(2), ioctl(2), lseek(2), open(2), pwrite(2), read(2), select(2), writev(2), fwrite(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
109 - Linux cli command fstatfs64
NAME π₯οΈ fstatfs64 π₯οΈ
get filesystem statistics
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/vfs.h> /* or <sys/statfs.h> */
int statfs(const char *path, struct statfs *buf);
int fstatfs(int fd, struct statfs *buf);
Unless you need the f_type field, you should use the standard statvfs(3) interface instead.
DESCRIPTION
The statfs() system call returns information about a mounted filesystem. path is the pathname of any file within the mounted filesystem. buf is a pointer to a statfs structure defined approximately as follows:
struct statfs {
__fsword_t f_type; /* Type of filesystem (see below) */
__fsword_t f_bsize; /* Optimal transfer block size */
fsblkcnt_t f_blocks; /* Total data blocks in filesystem */
fsblkcnt_t f_bfree; /* Free blocks in filesystem */
fsblkcnt_t f_bavail; /* Free blocks available to
unprivileged user */
fsfilcnt_t f_files; /* Total inodes in filesystem */
fsfilcnt_t f_ffree; /* Free inodes in filesystem */
fsid_t f_fsid; /* Filesystem ID */
__fsword_t f_namelen; /* Maximum length of filenames */
__fsword_t f_frsize; /* Fragment size (since Linux 2.6) */
__fsword_t f_flags; /* Mount flags of filesystem
(since Linux 2.6.36) */
__fsword_t f_spare[xxx];
/* Padding bytes reserved for future use */
};
The following filesystem types may appear in f_type:
ADFS_SUPER_MAGIC 0xadf5
AFFS_SUPER_MAGIC 0xadff
AFS_SUPER_MAGIC 0x5346414f
ANON_INODE_FS_MAGIC 0x09041934 /* Anonymous inode FS (for
pseudofiles that have no name;
e.g., epoll, signalfd, bpf) */
AUTOFS_SUPER_MAGIC 0x0187
BDEVFS_MAGIC 0x62646576
BEFS_SUPER_MAGIC 0x42465331
BFS_MAGIC 0x1badface
BINFMTFS_MAGIC 0x42494e4d
BPF_FS_MAGIC 0xcafe4a11
BTRFS_SUPER_MAGIC 0x9123683e
BTRFS_TEST_MAGIC 0x73727279
CGROUP_SUPER_MAGIC 0x27e0eb /* Cgroup pseudo FS */
CGROUP2_SUPER_MAGIC 0x63677270 /* Cgroup v2 pseudo FS */
CIFS_MAGIC_NUMBER 0xff534d42
CODA_SUPER_MAGIC 0x73757245
COH_SUPER_MAGIC 0x012ff7b7
CRAMFS_MAGIC 0x28cd3d45
DEBUGFS_MAGIC 0x64626720
DEVFS_SUPER_MAGIC 0x1373 /* Linux 2.6.17 and earlier */
DEVPTS_SUPER_MAGIC 0x1cd1
ECRYPTFS_SUPER_MAGIC 0xf15f
EFIVARFS_MAGIC 0xde5e81e4
EFS_SUPER_MAGIC 0x00414a53
EXT_SUPER_MAGIC 0x137d /* Linux 2.0 and earlier */
EXT2_OLD_SUPER_MAGIC 0xef51
EXT2_SUPER_MAGIC 0xef53
EXT3_SUPER_MAGIC 0xef53
EXT4_SUPER_MAGIC 0xef53
F2FS_SUPER_MAGIC 0xf2f52010
FUSE_SUPER_MAGIC 0x65735546
FUTEXFS_SUPER_MAGIC 0xbad1dea /* Unused */
HFS_SUPER_MAGIC 0x4244
HOSTFS_SUPER_MAGIC 0x00c0ffee
HPFS_SUPER_MAGIC 0xf995e849
HUGETLBFS_MAGIC 0x958458f6
ISOFS_SUPER_MAGIC 0x9660
JFFS2_SUPER_MAGIC 0x72b6
JFS_SUPER_MAGIC 0x3153464a
MINIX_SUPER_MAGIC 0x137f /* original minix FS */
MINIX_SUPER_MAGIC2 0x138f /* 30 char minix FS */
MINIX2_SUPER_MAGIC 0x2468 /* minix V2 FS */
MINIX2_SUPER_MAGIC2 0x2478 /* minix V2 FS, 30 char names */
MINIX3_SUPER_MAGIC 0x4d5a /* minix V3 FS, 60 char names */
MQUEUE_MAGIC 0x19800202 /* POSIX message queue FS */
MSDOS_SUPER_MAGIC 0x4d44
MTD_INODE_FS_MAGIC 0x11307854
NCP_SUPER_MAGIC 0x564c
NFS_SUPER_MAGIC 0x6969
NILFS_SUPER_MAGIC 0x3434
NSFS_MAGIC 0x6e736673
NTFS_SB_MAGIC 0x5346544e
OCFS2_SUPER_MAGIC 0x7461636f
OPENPROM_SUPER_MAGIC 0x9fa1
OVERLAYFS_SUPER_MAGIC 0x794c7630
PIPEFS_MAGIC 0x50495045
PROC_SUPER_MAGIC 0x9fa0 /* /proc FS */
PSTOREFS_MAGIC 0x6165676c
QNX4_SUPER_MAGIC 0x002f
QNX6_SUPER_MAGIC 0x68191122
RAMFS_MAGIC 0x858458f6
REISERFS_SUPER_MAGIC 0x52654973
ROMFS_MAGIC 0x7275
SECURITYFS_MAGIC 0x73636673
SELINUX_MAGIC 0xf97cff8c
SMACK_MAGIC 0x43415d53
SMB_SUPER_MAGIC 0x517b
SMB2_MAGIC_NUMBER 0xfe534d42
SOCKFS_MAGIC 0x534f434b
SQUASHFS_MAGIC 0x73717368
SYSFS_MAGIC 0x62656572
SYSV2_SUPER_MAGIC 0x012ff7b6
SYSV4_SUPER_MAGIC 0x012ff7b5
TMPFS_MAGIC 0x01021994
TRACEFS_MAGIC 0x74726163
UDF_SUPER_MAGIC 0x15013346
UFS_MAGIC 0x00011954
USBDEVICE_SUPER_MAGIC 0x9fa2
V9FS_MAGIC 0x01021997
VXFS_SUPER_MAGIC 0xa501fcf5
XENFS_SUPER_MAGIC 0xabba1974
XENIX_SUPER_MAGIC 0x012ff7b4
XFS_SUPER_MAGIC 0x58465342
_XIAFS_SUPER_MAGIC 0x012fd16d /* Linux 2.0 and earlier */
Most of these MAGIC constants are defined in /usr/include/linux/magic.h, and some are hardcoded in kernel sources.
The f_flags field is a bit mask indicating mount options for the filesystem. It contains zero or more of the following bits:
ST_MANDLOCK
Mandatory locking is permitted on the filesystem (see fcntl(2)).
ST_NOATIME
Do not update access times; see mount(2).
ST_NODEV
Disallow access to device special files on this filesystem.
ST_NODIRATIME
Do not update directory access times; see mount(2).
ST_NOEXEC
Execution of programs is disallowed on this filesystem.
ST_NOSUID
The set-user-ID and set-group-ID bits are ignored by exec(3) for executable files on this filesystem
ST_RDONLY
This filesystem is mounted read-only.
ST_RELATIME
Update atime relative to mtime/ctime; see mount(2).
ST_SYNCHRONOUS
Writes are synched to the filesystem immediately (see the description of O_SYNC in open(2)).
ST_NOSYMFOLLOW (since Linux 5.10)
Symbolic links are not followed when resolving paths; see mount(2).
Nobody knows what f_fsid is supposed to contain (but see below).
Fields that are undefined for a particular filesystem are set to 0.
fstatfs() returns the same information about an open file referenced by descriptor fd.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
(statfs()) Search permission is denied for a component of the path prefix of path. (See also path_resolution(7).)
EBADF
(fstatfs()) fd is not a valid open file descriptor.
EFAULT
buf or path points to an invalid address.
EINTR
The call was interrupted by a signal; see signal(7).
EIO
An I/O error occurred while reading from the filesystem.
ELOOP
(statfs()) Too many symbolic links were encountered in translating path.
ENAMETOOLONG
(statfs()) path is too long.
ENOENT
(statfs()) The file referred to by path does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOSYS
The filesystem does not support this call.
ENOTDIR
(statfs()) A component of the path prefix of path is not a directory.
EOVERFLOW
Some values were too large to be represented in the returned struct.
VERSIONS
The f_fsid field
Solaris, Irix, and POSIX have a system call statvfs(2) that returns a struct statvfs (defined in <sys/statvfs.h>) containing an unsigned long f_fsid. Linux, SunOS, HP-UX, 4.4BSD have a system call statfs() that returns a struct statfs (defined in <sys/vfs.h>) containing a fsid_t f_fsid, where fsid_t is defined as struct { int val[2]; }. The same holds for FreeBSD, except that it uses the include file <sys/mount.h>.
The general idea is that f_fsid contains some random stuff such that the pair (f_fsid,ino) uniquely determines a file. Some operating systems use (a variation on) the device number, or the device number combined with the filesystem type. Several operating systems restrict giving out the f_fsid field to the superuser only (and zero it for unprivileged users), because this field is used in the filehandle of the filesystem when NFS-exported, and giving it out is a security concern.
Under some operating systems, the fsid can be used as the second argument to the sysfs(2) system call.
STANDARDS
Linux.
HISTORY
The Linux statfs() was inspired by the 4.4BSD one (but they do not use the same structure).
The original Linux statfs() and fstatfs() system calls were not designed with extremely large file sizes in mind. Subsequently, Linux 2.6 added new statfs64() and fstatfs64() system calls that employ a new structure, statfs64. The new structure contains the same fields as the original statfs structure, but the sizes of various fields are increased, to accommodate large file sizes. The glibc statfs() and fstatfs() wrapper functions transparently deal with the kernel differences.
LSB has deprecated the library calls statfs() and fstatfs() and tells us to use statvfs(3) and fstatvfs(3) instead.
NOTES
The __fsword_t type used for various fields in the statfs structure definition is a glibc internal type, not intended for public use. This leaves the programmer in a bit of a conundrum when trying to copy or compare these fields to local variables in a program. Using unsigned int for such variables suffices on most systems.
Some systems have only <sys/vfs.h>, other systems also have <sys/statfs.h>, where the former includes the latter. So it seems including the former is the best choice.
BUGS
From Linux 2.6.38 up to and including Linux 3.1, fstatfs() failed with the error ENOSYS for file descriptors created by pipe(2).
SEE ALSO
stat(2), statvfs(3), path_resolution(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
110 - Linux cli command idle
NAME π₯οΈ idle π₯οΈ
make process 0 idle
SYNOPSIS
#include <unistd.h>
[[deprecated]] int idle(void);
DESCRIPTION
idle() is an internal system call used during bootstrap. It marks the process’s pages as swappable, lowers its priority, and enters the main scheduling loop. idle() never returns.
Only process 0 may call idle(). Any user process, even a process with superuser permission, will receive EPERM.
RETURN VALUE
idle() never returns for process 0, and always returns -1 for a user process.
ERRORS
EPERM
Always, for a user process.
STANDARDS
Linux.
HISTORY
Removed in Linux 2.3.13.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
111 - Linux cli command s390_sthyi
NAME π₯οΈ s390_sthyi π₯οΈ
emulate STHYI instruction
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <asm/sthyi.h> /* Definition of STHYI_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_s390_sthyi, unsigned long function_code,
void *resp_buffer, uint64_t *return_code,
unsigned long flags);
Note: glibc provides no wrapper for s390_sthyi(), necessitating the use of syscall(2).
DESCRIPTION
The s390_sthyi() system call emulates the STHYI (Store Hypervisor Information) instruction. It provides hardware resource information for the machine and its virtualization levels. This includes CPU type and capacity, as well as the machine model and other metrics.
The function_code argument indicates which function to perform. The following code(s) are supported:
STHYI_FC_CP_IFL_CAP
Return CP (Central Processor) and IFL (Integrated Facility for Linux) capacity information.
The resp_buffer argument specifies the address of a response buffer. When the function_code is STHYI_FC_CP_IFL_CAP, the buffer must be one page (4K) in size. If the system call returns 0, the response buffer will be filled with CPU capacity information. Otherwise, the response buffer’s content is unchanged.
The return_code argument stores the return code of the STHYI instruction, using one of the following values:
0
Success.
4
Unsupported function code.
For further details about return_code, function_code, and resp_buffer, see the reference given in NOTES.
The flags argument is provided to allow for future extensions and currently must be set to 0.
RETURN VALUE
On success (that is: emulation succeeded), the return value of s390_sthyi() matches the condition code of the STHYI instructions, which is a value in the range [0..3]. A return value of 0 indicates that CPU capacity information is stored in *resp_buffer. A return value of 3 indicates “unsupported function code” and the content of *resp_buffer is unchanged. The return values 1 and 2 are reserved.
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
The value specified in resp_buffer or return_code is not a valid address.
EINVAL
The value specified in flags is nonzero.
ENOMEM
Allocating memory for handling the CPU capacity information failed.
EOPNOTSUPP
The value specified in function_code is not valid.
STANDARDS
Linux on s390.
HISTORY
Linux 4.15.
NOTES
For details of the STHYI instruction, see the documentation page.
When the system call interface is used, the response buffer doesn’t have to fulfill alignment requirements described in the STHYI instruction definition.
The kernel caches the response (for up to one second, as of Linux 4.16). Subsequent system call invocations may return the cached response.
SEE ALSO
syscall(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
112 - Linux cli command setuid
NAME π₯οΈ setuid π₯οΈ
set user identity
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int setuid(uid_t uid);
DESCRIPTION
setuid() sets the effective user ID of the calling process. If the calling process is privileged (more precisely: if the process has the CAP_SETUID capability in its user namespace), the real UID and saved set-user-ID are also set.
Under Linux, setuid() is implemented like the POSIX version with the _POSIX_SAVED_IDS feature. This allows a set-user-ID (other than root) program to drop all of its user privileges, do some un-privileged work, and then reengage the original effective user ID in a secure manner.
If the user is root or the program is set-user-ID-root, special care must be taken: setuid() checks the effective user ID of the caller and if it is the superuser, all process-related user ID’s are set to uid. After this has occurred, it is impossible for the program to regain root privileges.
Thus, a set-user-ID-root program wishing to temporarily drop root privileges, assume the identity of an unprivileged user, and then regain root privileges afterward cannot use setuid(). You can accomplish this with seteuid(2).
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
Note: there are cases where setuid() can fail even when the caller is UID 0; it is a grave security error to omit checking for a failure return from setuid().
ERRORS
EAGAIN
The call would change the caller’s real UID (i.e., uid does not match the caller’s real UID), but there was a temporary failure allocating the necessary kernel data structures.
EAGAIN
uid does not match the real user ID of the caller and this call would bring the number of processes belonging to the real user ID uid over the caller’s RLIMIT_NPROC resource limit. Since Linux 3.1, this error case no longer occurs (but robust applications should check for this error); see the description of EAGAIN in execve(2).
EINVAL
The user ID specified in uid is not valid in this user namespace.
EPERM
The user is not privileged (Linux: does not have the CAP_SETUID capability in its user namespace) and uid does not match the real UID or saved set-user-ID of the calling process.
VERSIONS
C library/kernel differences
At the kernel level, user IDs and group IDs are a per-thread attribute. However, POSIX requires that all threads in a process share the same credentials. The NPTL threading implementation handles the POSIX requirements by providing wrapper functions for the various system calls that change process UIDs and GIDs. These wrapper functions (including the one for setuid()) employ a signal-based technique to ensure that when one thread changes credentials, all of the other threads in the process also change their credentials. For details, see nptl(7).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4.
Not quite compatible with the 4.4BSD call, which sets all of the real, saved, and effective user IDs.
The original Linux setuid() system call supported only 16-bit user IDs. Subsequently, Linux 2.4 added setuid32() supporting 32-bit IDs. The glibc setuid() wrapper function transparently deals with the variation across kernel versions.
NOTES
Linux has the concept of the filesystem user ID, normally equal to the effective user ID. The setuid() call also sets the filesystem user ID of the calling process. See setfsuid(2).
If uid is different from the old effective UID, the process will be forbidden from leaving core dumps.
SEE ALSO
getuid(2), seteuid(2), setfsuid(2), setreuid(2), capabilities(7), credentials(7), user_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
113 - Linux cli command process_vm_writev
NAME π₯οΈ process_vm_writev π₯οΈ
transfer data between process address spaces
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/uio.h>
ssize_t process_vm_readv(pid_t pid,
const struct iovec *local_iov,
unsigned long liovcnt,
const struct iovec *remote_iov,
unsigned long riovcnt,
unsigned long flags);
ssize_t process_vm_writev(pid_t pid,
const struct iovec *local_iov,
unsigned long liovcnt,
const struct iovec *remote_iov,
unsigned long riovcnt,
unsigned long flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
process_vm_readv(), process_vm_writev():
_GNU_SOURCE
DESCRIPTION
These system calls transfer data between the address space of the calling process (“the local process”) and the process identified by pid (“the remote process”). The data moves directly between the address spaces of the two processes, without passing through kernel space.
The process_vm_readv() system call transfers data from the remote process to the local process. The data to be transferred is identified by remote_iov and riovcnt: remote_iov is a pointer to an array describing address ranges in the process pid, and riovcnt specifies the number of elements in remote_iov. The data is transferred to the locations specified by local_iov and liovcnt: local_iov is a pointer to an array describing address ranges in the calling process, and liovcnt specifies the number of elements in local_iov.
The process_vm_writev() system call is the converse of process_vm_readv()βit transfers data from the local process to the remote process. Other than the direction of the transfer, the arguments liovcnt, local_iov, riovcnt, and remote_iov have the same meaning as for process_vm_readv().
The local_iov and remote_iov arguments point to an array of iovec structures, described in iovec(3type).
Buffers are processed in array order. This means that process_vm_readv() completely fills local_iov[0] before proceeding to local_iov[1], and so on. Likewise, remote_iov[0] is completely read before proceeding to remote_iov[1], and so on.
Similarly, process_vm_writev() writes out the entire contents of local_iov[0] before proceeding to local_iov[1], and it completely fills remote_iov[0] before proceeding to remote_iov[1].
The lengths of remote_iov[i].iov_len and local_iov[i].iov_len do not have to be the same. Thus, it is possible to split a single local buffer into multiple remote buffers, or vice versa.
The flags argument is currently unused and must be set to 0.
The values specified in the liovcnt and riovcnt arguments must be less than or equal to IOV_MAX (defined in <limits.h> or accessible via the call sysconf(_SC_IOV_MAX)).
The count arguments and local_iov are checked before doing any transfers. If the counts are too big, or local_iov is invalid, or the addresses refer to regions that are inaccessible to the local process, none of the vectors will be processed and an error will be returned immediately.
Note, however, that these system calls do not check the memory regions in the remote process until just before doing the read/write. Consequently, a partial read/write (see RETURN VALUE) may result if one of the remote_iov elements points to an invalid memory region in the remote process. No further reads/writes will be attempted beyond that point. Keep this in mind when attempting to read data of unknown length (such as C strings that are null-terminated) from a remote process, by avoiding spanning memory pages (typically 4 KiB) in a single remote iovec element. (Instead, split the remote read into two remote_iov elements and have them merge back into a single write local_iov entry. The first read entry goes up to the page boundary, while the second starts on the next page boundary.)
Permission to read from or write to another process is governed by a ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check; see ptrace(2).
RETURN VALUE
On success, process_vm_readv() returns the number of bytes read and process_vm_writev() returns the number of bytes written. This return value may be less than the total number of requested bytes, if a partial read/write occurred. (Partial transfers apply at the granularity of iovec elements. These system calls won’t perform a partial transfer that splits a single iovec element.) The caller should check the return value to determine whether a partial read/write occurred.
On error, -1 is returned and errno is set to indicate the error.
ERRORS
EFAULT
The memory described by local_iov is outside the caller’s accessible address space.
EFAULT
The memory described by remote_iov is outside the accessible address space of the process pid.
EINVAL
The sum of the iov_len values of either local_iov or remote_iov overflows a ssize_t value.
EINVAL
flags is not 0.
EINVAL
liovcnt or riovcnt is too large.
ENOMEM
Could not allocate memory for internal copies of the iovec structures.
EPERM
The caller does not have permission to access the address space of the process pid.
ESRCH
No process with ID pid exists.
STANDARDS
Linux.
HISTORY
Linux 3.2, glibc 2.15.
NOTES
The data transfers performed by process_vm_readv() and process_vm_writev() are not guaranteed to be atomic in any way.
These system calls were designed to permit fast message passing by allowing messages to be exchanged with a single copy operation (rather than the double copy that would be required when using, for example, shared memory or pipes).
EXAMPLES
The following code sample demonstrates the use of process_vm_readv(). It reads 20 bytes at the address 0x10000 from the process with PID 10 and writes the first 10 bytes into buf1 and the second 10 bytes into buf2.
#define _GNU_SOURCE
#include <stdlib.h>
#include <sys/types.h>
#include <sys/uio.h>
int
main(void)
{
char buf1[10];
char buf2[10];
pid_t pid = 10; /* PID of remote process */
ssize_t nread;
struct iovec local[2];
struct iovec remote[1];
local[0].iov_base = buf1;
local[0].iov_len = 10;
local[1].iov_base = buf2;
local[1].iov_len = 10;
remote[0].iov_base = (void *) 0x10000;
remote[0].iov_len = 20;
nread = process_vm_readv(pid, local, 2, remote, 1, 0);
if (nread != 20)
exit(EXIT_FAILURE);
exit(EXIT_SUCCESS);
}
SEE ALSO
readv(2), writev(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
114 - Linux cli command munlockall
NAME π₯οΈ munlockall π₯οΈ
lock and unlock memory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/mman.h>
int mlock(const void addr[.len], size_t len);
int mlock2(const void addr[.len], size_t len, unsigned int flags);
int munlock(const void addr[.len], size_t len);
int mlockall(int flags);
int munlockall(void);
DESCRIPTION
mlock(), mlock2(), and mlockall() lock part or all of the calling process’s virtual address space into RAM, preventing that memory from being paged to the swap area.
munlock() and munlockall() perform the converse operation, unlocking part or all of the calling process’s virtual address space, so that pages in the specified virtual address range can be swapped out again if required by the kernel memory manager.
Memory locking and unlocking are performed in units of whole pages.
mlock(), mlock2(), and munlock()
mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked.
mlock2() also locks pages in the specified range starting at addr and continuing for len bytes. However, the state of the pages contained in that range after the call returns successfully will depend on the value in the flags argument.
The flags argument can be either 0 or the following constant:
MLOCK_ONFAULT
Lock pages that are currently resident and mark the entire range so that the remaining nonresident pages are locked when they are populated by a page fault.
If flags is 0, mlock2() behaves exactly the same as mlock().
munlock() unlocks pages in the address range starting at addr and continuing for len bytes. After this call, all pages that contain a part of the specified memory range can be moved to external swap space again by the kernel.
mlockall() and munlockall()
mlockall() locks all pages mapped into the address space of the calling process. This includes the pages of the code, data, and stack segment, as well as shared libraries, user space kernel data, shared memory, and memory-mapped files. All mapped pages are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked.
The flags argument is constructed as the bitwise OR of one or more of the following constants:
MCL_CURRENT
Lock all pages which are currently mapped into the address space of the process.
MCL_FUTURE
Lock all pages which will become mapped into the address space of the process in the future. These could be, for instance, new pages required by a growing heap and stack as well as new memory-mapped files or shared memory regions.
MCL_ONFAULT (since Linux 4.4)
Used together with MCL_CURRENT, MCL_FUTURE, or both. Mark all current (with MCL_CURRENT) or future (with MCL_FUTURE) mappings to lock pages when they are faulted in. When used with MCL_CURRENT, all present pages are locked, but mlockall() will not fault in non-present pages. When used with MCL_FUTURE, all future mappings will be marked to lock pages when they are faulted in, but they will not be populated by the lock when the mapping is created. MCL_ONFAULT must be used with either MCL_CURRENT or MCL_FUTURE or both.
If MCL_FUTURE has been specified, then a later system call (e.g., mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number of locked bytes to exceed the permitted maximum (see below). In the same circumstances, stack growth may likewise fail: the kernel will deny stack expansion and deliver a SIGSEGV signal to the process.
munlockall() unlocks all pages mapped into the address space of the calling process.
RETURN VALUE
On success, these system calls return 0. On error, -1 is returned, errno is set to indicate the error, and no changes are made to any locks in the address space of the process.
ERRORS
EAGAIN
(mlock(), mlock2(), and munlock()) Some or all of the specified address range could not be locked.
EINVAL
(mlock(), mlock2(), and munlock()) The result of the addition addr+len was less than addr (e.g., the addition may have resulted in an overflow).
EINVAL
(mlock2()) Unknown flags were specified.
EINVAL
(mlockall()) Unknown flags were specified or MCL_ONFAULT was specified without either MCL_FUTURE or MCL_CURRENT.
EINVAL
(Not on Linux) addr was not a multiple of the page size.
ENOMEM
(mlock(), mlock2(), and munlock()) Some of the specified address range does not correspond to mapped pages in the address space of the process.
ENOMEM
(mlock(), mlock2(), and munlock()) Locking or unlocking a region would result in the total number of mappings with distinct attributes (e.g., locked versus unlocked) exceeding the allowed maximum. (For example, unlocking a range in the middle of a currently locked mapping would result in three mappings: two locked mappings at each end and an unlocked mapping in the middle.)
ENOMEM
(Linux 2.6.9 and later) the caller had a nonzero RLIMIT_MEMLOCK soft resource limit, but tried to lock more memory than the limit permitted. This limit is not enforced if the process is privileged (CAP_IPC_LOCK).
ENOMEM
(Linux 2.4 and earlier) the calling process tried to lock more than half of RAM.
EPERM
The caller is not privileged, but needs privilege (CAP_IPC_LOCK) to perform the requested operation.
EPERM
(munlockall()) (Linux 2.6.8 and earlier) The caller was not privileged (CAP_IPC_LOCK).
VERSIONS
Linux
Under Linux, mlock(), mlock2(), and munlock() automatically round addr down to the nearest page boundary. However, the POSIX.1 specification of mlock() and munlock() allows an implementation to require that addr is page aligned, so portable applications should ensure this.
The VmLck field of the Linux-specific /proc/pid/status file shows how many kilobytes of memory the process with ID PID has locked using mlock(), mlock2(), mlockall(), and mmap(2) MAP_LOCKED.
STANDARDS
mlock()
munlock()
mlockall()
munlockall()
POSIX.1-2008.
mlock2()
Linux.
On POSIX systems on which mlock() and munlock() are available, _POSIX_MEMLOCK_RANGE is defined in <unistd.h> and the number of bytes in a page can be determined from the constant PAGESIZE (if defined) in <limits.h> or by calling sysconf(_SC_PAGESIZE).
On POSIX systems on which mlockall() and munlockall() are available, _POSIX_MEMLOCK is defined in <unistd.h> to a value greater than 0. (See also sysconf(3).)
HISTORY
mlock()
munlock()
mlockall()
munlockall()
POSIX.1-2001, POSIX.1-2008, SVr4.
mlock2()
Linux 4.4, glibc 2.27.
NOTES
Memory locking has two main applications: real-time algorithms and high-security data processing. Real-time applications require deterministic timing, and, like scheduling, paging is one major cause of unexpected program execution delays. Real-time applications will usually also switch to a real-time scheduler with sched_setscheduler(2). Cryptographic security software often handles critical bytes like passwords or secret keys as data structures. As a result of paging, these secrets could be transferred onto a persistent swap store medium, where they might be accessible to the enemy long after the security software has erased the secrets in RAM and terminated. (But be aware that the suspend mode on laptops and some desktop computers will save a copy of the system’s RAM to disk, regardless of memory locks.)
Real-time processes that are using mlockall() to prevent delays on page faults should reserve enough locked stack pages before entering the time-critical section, so that no page fault can be caused by function calls. This can be achieved by calling a function that allocates a sufficiently large automatic variable (an array) and writes to the memory occupied by this array in order to touch these stack pages. This way, enough pages will be mapped for the stack and can be locked into RAM. The dummy writes ensure that not even copy-on-write page faults can occur in the critical section.
Memory locks are not inherited by a child created via fork(2) and are automatically removed (unlocked) during an execve(2) or when the process terminates. The mlockall() MCL_FUTURE and MCL_FUTURE | MCL_ONFAULT settings are not inherited by a child created via fork(2) and are cleared during an execve(2).
Note that fork(2) will prepare the address space for a copy-on-write operation. The consequence is that any write access that follows will cause a page fault that in turn may cause high latencies for a real-time process. Therefore, it is crucial not to invoke fork(2) after an mlockall() or mlock() operationβnot even from a thread which runs at a low priority within a process which also has a thread running at elevated priority.
The memory lock on an address range is automatically removed if the address range is unmapped via munmap(2).
Memory locks do not stack, that is, pages which have been locked several times by calls to mlock(), mlock2(), or mlockall() will be unlocked by a single call to munlock() for the corresponding range or by munlockall(). Pages which are mapped to several locations or by several processes stay locked into RAM as long as they are locked at least at one location or by at least one process.
If a call to mlockall() which uses the MCL_FUTURE flag is followed by another call that does not specify this flag, the changes made by the MCL_FUTURE call will be lost.
The mlock2() MLOCK_ONFAULT flag and the mlockall() MCL_ONFAULT flag allow efficient memory locking for applications that deal with large mappings where only a (small) portion of pages in the mapping are touched. In such cases, locking all of the pages in a mapping would incur a significant penalty for memory locking.
Limits and permissions
In Linux 2.6.8 and earlier, a process must be privileged (CAP_IPC_LOCK) in order to lock memory and the RLIMIT_MEMLOCK soft resource limit defines a limit on how much memory the process may lock.
Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process can lock and the RLIMIT_MEMLOCK soft resource limit instead defines a limit on how much memory an unprivileged process may lock.
BUGS
In Linux 4.8 and earlier, a bug in the kernel’s accounting of locked memory for unprivileged processes (i.e., without CAP_IPC_LOCK) meant that if the region specified by addr and len overlapped an existing lock, then the already locked bytes in the overlapping region were counted twice when checking against the limit. Such double accounting could incorrectly calculate a “total locked memory” value for the process that exceeded the RLIMIT_MEMLOCK limit, with the result that mlock() and mlock2() would fail on requests that should have succeeded. This bug was fixed in Linux 4.9.
In Linux 2.4 series of kernels up to and including Linux 2.4.17, a bug caused the mlockall() MCL_FUTURE flag to be inherited across a fork(2). This was rectified in Linux 2.4.18.
Since Linux 2.6.9, if a privileged process calls mlockall(MCL_FUTURE) and later drops privileges (loses the CAP_IPC_LOCK capability by, for example, setting its effective UID to a nonzero value), then subsequent memory allocations (e.g., mmap(2), brk(2)) will fail if the RLIMIT_MEMLOCK resource limit is encountered.
SEE ALSO
mincore(2), mmap(2), setrlimit(2), shmctl(2), sysconf(3), proc(5), capabilities(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
115 - Linux cli command io_getevents
NAME π₯οΈ io_getevents π₯οΈ
read asynchronous I/O events from the completion queue
LIBRARY
Standard C library (libc, -lc)
Alternatively, Asynchronous I/O library (libaio, -laio); see VERSIONS.
SYNOPSIS
#include <linux/aio_abi.h> /* Definition of *io_* types */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_io_getevents, aio_context_t ctx_id,
long min_nr, long nr, struct io_event *events,
struct timespec *timeout);
Note: glibc provides no wrapper for io_getevents(), necessitating the use of syscall(2).
DESCRIPTION
Note: this page describes the raw Linux system call interface. The wrapper function provided by libaio uses a different type for the ctx_id argument. See VERSIONS.
The io_getevents() system call attempts to read at least min_nr events and up to nr events from the completion queue of the AIO context specified by ctx_id.
The timeout argument specifies the amount of time to wait for events, and is specified as a relative timeout in a timespec(3) structure.
The specified time will be rounded up to the system clock granularity and is guaranteed not to expire early.
Specifying timeout as NULL means block indefinitely until at least min_nr events have been obtained.
RETURN VALUE
On success, io_getevents() returns the number of events read. This may be 0, or a value less than min_nr, if the timeout expired. It may also be a nonzero value less than min_nr, if the call was interrupted by a signal handler.
For the failure return, see VERSIONS.
ERRORS
EFAULT
Either events or timeout is an invalid pointer.
EINTR
Interrupted by a signal handler; see signal(7).
EINVAL
ctx_id is invalid. min_nr is out of range or nr is out of range.
ENOSYS
io_getevents() is not implemented on this architecture.
VERSIONS
You probably want to use the io_getevents() wrapper function provided by libaio.
Note that the libaio wrapper function uses a different type (io_context_t) for the ctx_id argument. Note also that the libaio wrapper does not follow the usual C library conventions for indicating errors: on error it returns a negated error number (the negative of one of the values listed in ERRORS). If the system call is invoked via syscall(2), then the return value follows the usual conventions for indicating an error: -1, with errno set to a (positive) value that indicates the error.
STANDARDS
Linux.
HISTORY
Linux 2.5.
BUGS
An invalid ctx_id may cause a segmentation fault instead of generating the error EINVAL.
SEE ALSO
io_cancel(2), io_destroy(2), io_setup(2), io_submit(2), timespec(3), aio(7), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
116 - Linux cli command futex
NAME π₯οΈ futex π₯οΈ
fast user-space locking
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/futex.h> /* Definition of FUTEX_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
long syscall(SYS_futex, uint32_t *uaddr, int futex_op",uint32_t"val,
const struct timespec *timeout,"/*or:uint32_tval2*/"
uint32_t *uaddr2, uint32_t val3);
Note: glibc provides no wrapper for futex(), necessitating the use of syscall(2).
DESCRIPTION
The futex() system call provides a method for waiting until a certain condition becomes true. It is typically used as a blocking construct in the context of shared-memory synchronization. When using futexes, the majority of the synchronization operations are performed in user space. A user-space program employs the futex() system call only when it is likely that the program has to block for a longer time until the condition becomes true. Other futex() operations can be used to wake any processes or threads waiting for a particular condition.
A futex is a 32-bit valueβreferred to below as a futex wordβwhose address is supplied to the futex() system call. (Futexes are 32 bits in size on all platforms, including 64-bit systems.) All futex operations are governed by this value. In order to share a futex between processes, the futex is placed in a region of shared memory, created using (for example) mmap(2) or shmat(2). (Thus, the futex word may have different virtual addresses in different processes, but these addresses all refer to the same location in physical memory.) In a multithreaded program, it is sufficient to place the futex word in a global variable shared by all threads.
When executing a futex operation that requests to block a thread, the kernel will block only if the futex word has the value that the calling thread supplied (as one of the arguments of the futex() call) as the expected value of the futex word. The loading of the futex word’s value, the comparison of that value with the expected value, and the actual blocking will happen atomically and will be totally ordered with respect to concurrent operations performed by other threads on the same futex word. Thus, the futex word is used to connect the synchronization in user space with the implementation of blocking by the kernel. Analogously to an atomic compare-and-exchange operation that potentially changes shared memory, blocking via a futex is an atomic compare-and-block operation.
One use of futexes is for implementing locks. The state of the lock (i.e., acquired or not acquired) can be represented as an atomically accessed flag in shared memory. In the uncontended case, a thread can access or modify the lock state with atomic instructions, for example atomically changing it from not acquired to acquired using an atomic compare-and-exchange instruction. (Such instructions are performed entirely in user mode, and the kernel maintains no information about the lock state.) On the other hand, a thread may be unable to acquire a lock because it is already acquired by another thread. It then may pass the lock’s flag as a futex word and the value representing the acquired state as the expected value to a futex() wait operation. This futex() operation will block if and only if the lock is still acquired (i.e., the value in the futex word still matches the “acquired state”). When releasing the lock, a thread has to first reset the lock state to not acquired and then execute a futex operation that wakes threads blocked on the lock flag used as a futex word (this can be further optimized to avoid unnecessary wake-ups). See futex(7) for more detail on how to use futexes.
Besides the basic wait and wake-up futex functionality, there are further futex operations aimed at supporting more complex use cases.
Note that no explicit initialization or destruction is necessary to use futexes; the kernel maintains a futex (i.e., the kernel-internal implementation artifact) only while operations such as FUTEX_WAIT, described below, are being performed on a particular futex word.
Arguments
The uaddr argument points to the futex word. On all platforms, futexes are four-byte integers that must be aligned on a four-byte boundary. The operation to perform on the futex is specified in the futex_op argument; val is a value whose meaning and purpose depends on futex_op.
The remaining arguments (timeout, uaddr2, and val3) are required only for certain of the futex operations described below. Where one of these arguments is not required, it is ignored.
For several blocking operations, the timeout argument is a pointer to a timespec structure that specifies a timeout for the operation. However, notwithstanding the prototype shown above, for some operations, the least significant four bytes of this argument are instead used as an integer whose meaning is determined by the operation. For these operations, the kernel casts the timeout value first to unsigned long, then to uint32_t, and in the remainder of this page, this argument is referred to as val2 when interpreted in this fashion.
Where it is required, the uaddr2 argument is a pointer to a second futex word that is employed by the operation.
The interpretation of the final integer argument, val3, depends on the operation.
Futex operations
The futex_op argument consists of two parts: a command that specifies the operation to be performed, bitwise ORed with zero or more options that modify the behaviour of the operation. The options that may be included in futex_op are as follows:
FUTEX_PRIVATE_FLAG (since Linux 2.6.22)
This option bit can be employed with all futex operations. It tells the kernel that the futex is process-private and not shared with another process (i.e., it is being used for synchronization only between threads of the same process). This allows the kernel to make some additional performance optimizations.
As a convenience, <linux/futex.h> defines a set of constants with the suffix _PRIVATE that are equivalents of all of the operations listed below, but with the FUTEX_PRIVATE_FLAG ORed into the constant value. Thus, there are FUTEX_WAIT_PRIVATE, FUTEX_WAKE_PRIVATE, and so on.
FUTEX_CLOCK_REALTIME (since Linux 2.6.28)
This option bit can be employed only with the FUTEX_WAIT_BITSET, FUTEX_WAIT_REQUEUE_PI, (since Linux 4.5) FUTEX_WAIT, and (since Linux 5.14) FUTEX_LOCK_PI2 operations.
If this option is set, the kernel measures the timeout against the CLOCK_REALTIME clock.
If this option is not set, the kernel measures the timeout against the CLOCK_MONOTONIC clock.
The operation specified in futex_op is one of the following:
FUTEX_WAIT (since Linux 2.6.0)
This operation tests that the value at the futex word pointed to by the address uaddr still contains the expected value val, and if so, then sleeps waiting for a FUTEX_WAKE operation on the futex word. The load of the value of the futex word is an atomic memory access (i.e., using atomic machine instructions of the respective architecture). This load, the comparison with the expected value, and starting to sleep are performed atomically and totally ordered with respect to other futex operations on the same futex word. If the thread starts to sleep, it is considered a waiter on this futex word. If the futex value does not match val, then the call fails immediately with the error EAGAIN.
The purpose of the comparison with the expected value is to prevent lost wake-ups. If another thread changed the value of the futex word after the calling thread decided to block based on the prior value, and if the other thread executed a FUTEX_WAKE operation (or similar wake-up) after the value change and before this FUTEX_WAIT operation, then the calling thread will observe the value change and will not start to sleep.
If the timeout is not NULL, the structure it points to specifies a timeout for the wait. (This interval will be rounded up to the system clock granularity, and is guaranteed not to expire early.) The timeout is by default measured according to the CLOCK_MONOTONIC clock, but, since Linux 4.5, the CLOCK_REALTIME clock can be selected by specifying FUTEX_CLOCK_REALTIME in futex_op. If timeout is NULL, the call blocks indefinitely.
Note: for FUTEX_WAIT, timeout is interpreted as a relative value. This differs from other futex operations, where timeout is interpreted as an absolute value. To obtain the equivalent of FUTEX_WAIT with an absolute timeout, employ FUTEX_WAIT_BITSET with val3 specified as FUTEX_BITSET_MATCH_ANY.
The arguments uaddr2 and val3 are ignored.
FUTEX_WAKE (since Linux 2.6.0)
This operation wakes at most val of the waiters that are waiting (e.g., inside FUTEX_WAIT) on the futex word at the address uaddr. Most commonly, val is specified as either 1 (wake up a single waiter) or INT_MAX (wake up all waiters). No guarantee is provided about which waiters are awoken (e.g., a waiter with a higher scheduling priority is not guaranteed to be awoken in preference to a waiter with a lower priority).
The arguments timeout, uaddr2, and val3 are ignored.
FUTEX_FD (from Linux 2.6.0 up to and including Linux 2.6.25)
This operation creates a file descriptor that is associated with the futex at uaddr. The caller must close the returned file descriptor after use. When another process or thread performs a FUTEX_WAKE on the futex word, the file descriptor indicates as being readable with select(2), poll(2), and epoll(7)
The file descriptor can be used to obtain asynchronous notifications: if val is nonzero, then, when another process or thread executes a FUTEX_WAKE, the caller will receive the signal number that was passed in val.
The arguments timeout, uaddr2, and val3 are ignored.
Because it was inherently racy, FUTEX_FD has been removed from Linux 2.6.26 onward.
FUTEX_REQUEUE (since Linux 2.6.0)
This operation performs the same task as FUTEX_CMP_REQUEUE (see below), except that no check is made using the value in val3. (The argument val3 is ignored.)
FUTEX_CMP_REQUEUE (since Linux 2.6.7)
This operation first checks whether the location uaddr still contains the value val3. If not, the operation fails with the error EAGAIN. Otherwise, the operation wakes up a maximum of val waiters that are waiting on the futex at uaddr. If there are more than val waiters, then the remaining waiters are removed from the wait queue of the source futex at uaddr and added to the wait queue of the target futex at uaddr2. The val2 argument specifies an upper limit on the number of waiters that are requeued to the futex at uaddr2.
The load from uaddr is an atomic memory access (i.e., using atomic machine instructions of the respective architecture). This load, the comparison with val3, and the requeueing of any waiters are performed atomically and totally ordered with respect to other operations on the same futex word.
Typical values to specify for val are 0 or 1. (Specifying INT_MAX is not useful, because it would make the FUTEX_CMP_REQUEUE operation equivalent to FUTEX_WAKE.) The limit value specified via val2 is typically either 1 or INT_MAX. (Specifying the argument as 0 is not useful, because it would make the FUTEX_CMP_REQUEUE operation equivalent to FUTEX_WAIT.)
The FUTEX_CMP_REQUEUE operation was added as a replacement for the earlier FUTEX_REQUEUE. The difference is that the check of the value at uaddr can be used to ensure that requeueing happens only under certain conditions, which allows race conditions to be avoided in certain use cases.
Both FUTEX_REQUEUE and FUTEX_CMP_REQUEUE can be used to avoid “thundering herd” wake-ups that could occur when using FUTEX_WAKE in cases where all of the waiters that are woken need to acquire another futex. Consider the following scenario, where multiple waiter threads are waiting on B, a wait queue implemented using a futex:
lock(A)
while (!check_value(V)) {
unlock(A);
block_on(B);
lock(A);
};
unlock(A);
If a waker thread used FUTEX_WAKE, then all waiters waiting on B would be woken up, and they would all try to acquire lock A. However, waking all of the threads in this manner would be pointless because all except one of the threads would immediately block on lock A again. By contrast, a requeue operation wakes just one waiter and moves the other waiters to lock A, and when the woken waiter unlocks A then the next waiter can proceed.
FUTEX_WAKE_OP (since Linux 2.6.14)
This operation was added to support some user-space use cases where more than one futex must be handled at the same time. The most notable example is the implementation of pthread_cond_signal(3), which requires operations on two futexes, the one used to implement the mutex and the one used in the implementation of the wait queue associated with the condition variable. FUTEX_WAKE_OP allows such cases to be implemented without leading to high rates of contention and context switching.
The FUTEX_WAKE_OP operation is equivalent to executing the following code atomically and totally ordered with respect to other futex operations on any of the two supplied futex words:
uint32_t oldval = *(uint32_t *) uaddr2;
*(uint32_t *) uaddr2 = oldval op oparg;
futex(uaddr, FUTEX_WAKE, val, 0, 0, 0);
if (oldval cmp cmparg)
futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0);
In other words, FUTEX_WAKE_OP does the following:
saves the original value of the futex word at uaddr2 and performs an operation to modify the value of the futex at uaddr2; this is an atomic read-modify-write memory access (i.e., using atomic machine instructions of the respective architecture)
wakes up a maximum of val waiters on the futex for the futex word at uaddr; and
dependent on the results of a test of the original value of the futex word at uaddr2, wakes up a maximum of val2 waiters on the futex for the futex word at uaddr2.
The operation and comparison that are to be performed are encoded in the bits of the argument val3. Pictorially, the encoding is:
+---+---+-----------+-----------+
|op |cmp| oparg | cmparg |
+---+---+-----------+-----------+
4 4 12 12 <== # of bits
Expressed in code, the encoding is:
#define FUTEX_OP(op, oparg, cmp, cmparg) \
(((op & 0xf) << 28) | \
((cmp & 0xf) << 24) | \
((oparg & 0xfff) << 12) | \
(cmparg & 0xfff))
In the above, op and cmp are each one of the codes listed below. The oparg and cmparg components are literal numeric values, except as noted below.
The op component has one of the following values:
FUTEX_OP_SET 0 /* uaddr2 = oparg; */
FUTEX_OP_ADD 1 /* uaddr2 += oparg; */
FUTEX_OP_OR 2 /* uaddr2 |= oparg; */
FUTEX_OP_ANDN 3 /* uaddr2 &= ~oparg; */
FUTEX_OP_XOR 4 /* uaddr2 ^= oparg; */
In addition, bitwise ORing the following value into op causes (1Β <<Β oparg) to be used as the operand:
FUTEX_OP_ARG_SHIFT 8 /* Use (1 << oparg) as operand */
The cmp field is one of the following:
FUTEX_OP_CMP_EQ 0 /* if (oldval == cmparg) wake */
FUTEX_OP_CMP_NE 1 /* if (oldval != cmparg) wake */
FUTEX_OP_CMP_LT 2 /* if (oldval < cmparg) wake */
FUTEX_OP_CMP_LE 3 /* if (oldval <= cmparg) wake */
FUTEX_OP_CMP_GT 4 /* if (oldval > cmparg) wake */
FUTEX_OP_CMP_GE 5 /* if (oldval >= cmparg) wake */
The return value of FUTEX_WAKE_OP is the sum of the number of waiters woken on the futex uaddr plus the number of waiters woken on the futex uaddr2.
FUTEX_WAIT_BITSET (since Linux 2.6.25)
This operation is like FUTEX_WAIT except that val3 is used to provide a 32-bit bit mask to the kernel. This bit mask, in which at least one bit must be set, is stored in the kernel-internal state of the waiter. See the description of FUTEX_WAKE_BITSET for further details.
If timeout is not NULL, the structure it points to specifies an absolute timeout for the wait operation. If timeout is NULL, the operation can block indefinitely.
The uaddr2 argument is ignored.
FUTEX_WAKE_BITSET (since Linux 2.6.25)
This operation is the same as FUTEX_WAKE except that the val3 argument is used to provide a 32-bit bit mask to the kernel. This bit mask, in which at least one bit must be set, is used to select which waiters should be woken up. The selection is done by a bitwise AND of the “wake” bit mask (i.e., the value in val3) and the bit mask which is stored in the kernel-internal state of the waiter (the “wait” bit mask that is set using FUTEX_WAIT_BITSET). All of the waiters for which the result of the AND is nonzero are woken up; the remaining waiters are left sleeping.
The effect of FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSET is to allow selective wake-ups among multiple waiters that are blocked on the same futex. However, note that, depending on the use case, employing this bit-mask multiplexing feature on a futex can be less efficient than simply using multiple futexes, because employing bit-mask multiplexing requires the kernel to check all waiters on a futex, including those that are not interested in being woken up (i.e., they do not have the relevant bit set in their “wait” bit mask).
The constant FUTEX_BITSET_MATCH_ANY, which corresponds to all 32 bits set in the bit mask, can be used as the val3 argument for FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSET. Other than differences in the handling of the timeout argument, the FUTEX_WAIT operation is equivalent to FUTEX_WAIT_BITSET with val3 specified as FUTEX_BITSET_MATCH_ANY; that is, allow a wake-up by any waker. The FUTEX_WAKE operation is equivalent to FUTEX_WAKE_BITSET with val3 specified as FUTEX_BITSET_MATCH_ANY; that is, wake up any waiter(s).
The uaddr2 and timeout arguments are ignored.
Priority-inheritance futexes
Linux supports priority-inheritance (PI) futexes in order to handle priority-inversion problems that can be encountered with normal futex locks. Priority inversion is the problem that occurs when a high-priority task is blocked waiting to acquire a lock held by a low-priority task, while tasks at an intermediate priority continuously preempt the low-priority task from the CPU. Consequently, the low-priority task makes no progress toward releasing the lock, and the high-priority task remains blocked.
Priority inheritance is a mechanism for dealing with the priority-inversion problem. With this mechanism, when a high-priority task becomes blocked by a lock held by a low-priority task, the priority of the low-priority task is temporarily raised to that of the high-priority task, so that it is not preempted by any intermediate level tasks, and can thus make progress toward releasing the lock. To be effective, priority inheritance must be transitive, meaning that if a high-priority task blocks on a lock held by a lower-priority task that is itself blocked by a lock held by another intermediate-priority task (and so on, for chains of arbitrary length), then both of those tasks (or more generally, all of the tasks in a lock chain) have their priorities raised to be the same as the high-priority task.
From a user-space perspective, what makes a futex PI-aware is a policy agreement (described below) between user space and the kernel about the value of the futex word, coupled with the use of the PI-futex operations described below. (Unlike the other futex operations described above, the PI-futex operations are designed for the implementation of very specific IPC mechanisms.)
The PI-futex operations described below differ from the other futex operations in that they impose policy on the use of the value of the futex word:
If the lock is not acquired, the futex word’s value shall be 0.
If the lock is acquired, the futex word’s value shall be the thread ID (TID; see gettid(2)) of the owning thread.
If the lock is owned and there are threads contending for the lock, then the FUTEX_WAITERS bit shall be set in the futex word’s value; in other words, this value is:
FUTEX_WAITERS | TID
(Note that is invalid for a PI futex word to have no owner and FUTEX_WAITERS set.)
With this policy in place, a user-space application can acquire an unacquired lock or release a lock using atomic instructions executed in user mode (e.g., a compare-and-swap operation such as cmpxchg on the x86 architecture). Acquiring a lock simply consists of using compare-and-swap to atomically set the futex word’s value to the caller’s TID if its previous value was 0. Releasing a lock requires using compare-and-swap to set the futex word’s value to 0 if the previous value was the expected TID.
If a futex is already acquired (i.e., has a nonzero value), waiters must employ the FUTEX_LOCK_PI or FUTEX_LOCK_PI2 operations to acquire the lock. If other threads are waiting for the lock, then the FUTEX_WAITERS bit is set in the futex value; in this case, the lock owner must employ the FUTEX_UNLOCK_PI operation to release the lock.
In the cases where callers are forced into the kernel (i.e., required to perform a futex() call), they then deal directly with a so-called RT-mutex, a kernel locking mechanism which implements the required priority-inheritance semantics. After the RT-mutex is acquired, the futex value is updated accordingly, before the calling thread returns to user space.
It is important to note that the kernel will update the futex word’s value prior to returning to user space. (This prevents the possibility of the futex word’s value ending up in an invalid state, such as having an owner but the value being 0, or having waiters but not having the FUTEX_WAITERS bit set.)
If a futex has an associated RT-mutex in the kernel (i.e., there are blocked waiters) and the owner of the futex/RT-mutex dies unexpectedly, then the kernel cleans up the RT-mutex and hands it over to the next waiter. This in turn requires that the user-space value is updated accordingly. To indicate that this is required, the kernel sets the FUTEX_OWNER_DIED bit in the futex word along with the thread ID of the new owner. User space can detect this situation via the presence of the FUTEX_OWNER_DIED bit and is then responsible for cleaning up the stale state left over by the dead owner.
PI futexes are operated on by specifying one of the values listed below in futex_op. Note that the PI futex operations must be used as paired operations and are subject to some additional requirements:
FUTEX_LOCK_PI, FUTEX_LOCK_PI2, and FUTEX_TRYLOCK_PI pair with FUTEX_UNLOCK_PI. FUTEX_UNLOCK_PI must be called only on a futex owned by the calling thread, as defined by the value policy, otherwise the error EPERM results.
FUTEX_WAIT_REQUEUE_PI pairs with FUTEX_CMP_REQUEUE_PI. This must be performed from a non-PI futex to a distinct PI futex (or the error EINVAL results). Additionally, val (the number of waiters to be woken) must be 1 (or the error EINVAL results).
The PI futex operations are as follows:
FUTEX_LOCK_PI (since Linux 2.6.18)
This operation is used after an attempt to acquire the lock via an atomic user-mode instruction failed because the futex word has a nonzero valueβspecifically, because it contained the (PID-namespace-specific) TID of the lock owner.
The operation checks the value of the futex word at the address uaddr. If the value is 0, then the kernel tries to atomically set the futex value to the caller’s TID. If the futex word’s value is nonzero, the kernel atomically sets the FUTEX_WAITERS bit, which signals the futex owner that it cannot unlock the futex in user space atomically by setting the futex value to 0. After that, the kernel:
Tries to find the thread which is associated with the owner TID.
Creates or reuses kernel state on behalf of the owner. (If this is the first waiter, there is no kernel state for this futex, so kernel state is created by locking the RT-mutex and the futex owner is made the owner of the RT-mutex. If there are existing waiters, then the existing state is reused.)
Attaches the waiter to the futex (i.e., the waiter is enqueued on the RT-mutex waiter list).
If more than one waiter exists, the enqueueing of the waiter is in descending priority order. (For information on priority ordering, see the discussion of the SCHED_DEADLINE, SCHED_FIFO, and SCHED_RR scheduling policies in sched(7).) The owner inherits either the waiter’s CPU bandwidth (if the waiter is scheduled under the SCHED_DEADLINE policy) or the waiter’s priority (if the waiter is scheduled under the SCHED_RR or SCHED_FIFO policy). This inheritance follows the lock chain in the case of nested locking and performs deadlock detection.
The timeout argument provides a timeout for the lock attempt. If timeout is not NULL, the structure it points to specifies an absolute timeout, measured against the CLOCK_REALTIME clock. If timeout is NULL, the operation will block indefinitely.
The uaddr2, val, and val3 arguments are ignored.
FUTEX_LOCK_PI2 (since Linux 5.14)
This operation is the same as FUTEX_LOCK_PI, except that the clock against which timeout is measured is selectable. By default, the (absolute) timeout specified in timeout is measured against the CLOCK_MONOTONIC clock, but if the FUTEX_CLOCK_REALTIME flag is specified in futex_op, then the timeout is measured against the CLOCK_REALTIME clock.
FUTEX_TRYLOCK_PI (since Linux 2.6.18)
This operation tries to acquire the lock at uaddr. It is invoked when a user-space atomic acquire did not succeed because the futex word was not 0.
Because the kernel has access to more state information than user space, acquisition of the lock might succeed if performed by the kernel in cases where the futex word (i.e., the state information accessible to use-space) contains stale state (FUTEX_WAITERS and/or FUTEX_OWNER_DIED). This can happen when the owner of the futex died. User space cannot handle this condition in a race-free manner, but the kernel can fix this up and acquire the futex.
The uaddr2, val, timeout, and val3 arguments are ignored.
FUTEX_UNLOCK_PI (since Linux 2.6.18)
This operation wakes the top priority waiter that is waiting in FUTEX_LOCK_PI or FUTEX_LOCK_PI2 on the futex address provided by the uaddr argument.
This is called when the user-space value at uaddr cannot be changed atomically from a TID (of the owner) to 0.
The uaddr2, val, timeout, and val3 arguments are ignored.
FUTEX_CMP_REQUEUE_PI (since Linux 2.6.31)
This operation is a PI-aware variant of FUTEX_CMP_REQUEUE. It requeues waiters that are blocked via FUTEX_WAIT_REQUEUE_PI on uaddr from a non-PI source futex (uaddr) to a PI target futex (uaddr2).
As with FUTEX_CMP_REQUEUE, this operation wakes up a maximum of val waiters that are waiting on the futex at uaddr. However, for FUTEX_CMP_REQUEUE_PI, val is required to be 1 (since the main point is to avoid a thundering herd). The remaining waiters are removed from the wait queue of the source futex at uaddr and added to the wait queue of the target futex at uaddr2.
The val2 and val3 arguments serve the same purposes as for FUTEX_CMP_REQUEUE.
FUTEX_WAIT_REQUEUE_PI (since Linux 2.6.31)
Wait on a non-PI futex at uaddr and potentially be requeued (via a FUTEX_CMP_REQUEUE_PI operation in another task) onto a PI futex at uaddr2. The wait operation on uaddr is the same as for FUTEX_WAIT.
The waiter can be removed from the wait on uaddr without requeueing on uaddr2 via a FUTEX_WAKE operation in another task. In this case, the FUTEX_WAIT_REQUEUE_PI operation fails with the error EAGAIN.
If timeout is not NULL, the structure it points to specifies an absolute timeout for the wait operation. If timeout is NULL, the operation can block indefinitely.
The val3 argument is ignored.
The FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI were added to support a fairly specific use case: support for priority-inheritance-aware POSIX threads condition variables. The idea is that these operations should always be paired, in order to ensure that user space and the kernel remain in sync. Thus, in the FUTEX_WAIT_REQUEUE_PI operation, the user-space application pre-specifies the target of the requeue that takes place in the FUTEX_CMP_REQUEUE_PI operation.
RETURN VALUE
In the event of an error (and assuming that futex() was invoked via syscall(2)), all operations return -1 and set errno to indicate the error.
The return value on success depends on the operation, as described in the following list:
FUTEX_WAIT
Returns 0 if the caller was woken up. Note that a wake-up can also be caused by common futex usage patterns in unrelated code that happened to have previously used the futex word’s memory location (e.g., typical futex-based implementations of Pthreads mutexes can cause this under some conditions). Therefore, callers should always conservatively assume that a return value of 0 can mean a spurious wake-up, and use the futex word’s value (i.e., the user-space synchronization scheme) to decide whether to continue to block or not.
FUTEX_WAKE
Returns the number of waiters that were woken up.
FUTEX_FD
Returns the new file descriptor associated with the futex.
FUTEX_REQUEUE
Returns the number of waiters that were woken up.
FUTEX_CMP_REQUEUE
Returns the total number of waiters that were woken up or requeued to the futex for the futex word at uaddr2. If this value is greater than val, then the difference is the number of waiters requeued to the futex for the futex word at uaddr2.
FUTEX_WAKE_OP
Returns the total number of waiters that were woken up. This is the sum of the woken waiters on the two futexes for the futex words at uaddr and uaddr2.
FUTEX_WAIT_BITSET
Returns 0 if the caller was woken up. See FUTEX_WAIT for how to interpret this correctly in practice.
FUTEX_WAKE_BITSET
Returns the number of waiters that were woken up.
FUTEX_LOCK_PI
Returns 0 if the futex was successfully locked.
FUTEX_LOCK_PI2
Returns 0 if the futex was successfully locked.
FUTEX_TRYLOCK_PI
Returns 0 if the futex was successfully locked.
FUTEX_UNLOCK_PI
Returns 0 if the futex was successfully unlocked.
FUTEX_CMP_REQUEUE_PI
Returns the total number of waiters that were woken up or requeued to the futex for the futex word at uaddr2. If this value is greater than val, then difference is the number of waiters requeued to the futex for the futex word at uaddr2.
FUTEX_WAIT_REQUEUE_PI
Returns 0 if the caller was successfully requeued to the futex for the futex word at uaddr2.
ERRORS
EACCES
No read access to the memory of a futex word.
EAGAIN
(FUTEX_WAIT, FUTEX_WAIT_BITSET, FUTEX_WAIT_REQUEUE_PI) The value pointed to by uaddr was not equal to the expected value val at the time of the call.
Note: on Linux, the symbolic names EAGAIN and EWOULDBLOCK (both of which appear in different parts of the kernel futex code) have the same value.
EAGAIN
(FUTEX_CMP_REQUEUE, FUTEX_CMP_REQUEUE_PI) The value pointed to by uaddr is not equal to the expected value val3.
EAGAIN
(FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_CMP_REQUEUE_PI) The futex owner thread ID of uaddr (for FUTEX_CMP_REQUEUE_PI: uaddr2) is about to exit, but has not yet handled the internal state cleanup. Try again.
EDEADLK
(FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_CMP_REQUEUE_PI) The futex word at uaddr is already locked by the caller.
EDEADLK
(FUTEX_CMP_REQUEUE_PI) While requeueing a waiter to the PI futex for the futex word at uaddr2, the kernel detected a deadlock.
EFAULT
A required pointer argument (i.e., uaddr, uaddr2, or timeout) did not point to a valid user-space address.
EINTR
A FUTEX_WAIT or FUTEX_WAIT_BITSET operation was interrupted by a signal (see signal(7)). Before Linux 2.6.22, this error could also be returned for a spurious wakeup; since Linux 2.6.22, this no longer happens.
EINVAL
The operation in futex_op is one of those that employs a timeout, but the supplied timeout argument was invalid (tv_sec was less than zero, or tv_nsec was not less than 1,000,000,000).
EINVAL
The operation specified in futex_op employs one or both of the pointers uaddr and uaddr2, but one of these does not point to a valid objectβthat is, the address is not four-byte-aligned.
EINVAL
(FUTEX_WAIT_BITSET, FUTEX_WAKE_BITSET) The bit mask supplied in val3 is zero.
EINVAL
(FUTEX_CMP_REQUEUE_PI) uaddr equals uaddr2 (i.e., an attempt was made to requeue to the same futex).
EINVAL
(FUTEX_FD) The signal number supplied in val is invalid.
EINVAL
(FUTEX_WAKE, FUTEX_WAKE_OP, FUTEX_WAKE_BITSET, FUTEX_REQUEUE, FUTEX_CMP_REQUEUE) The kernel detected an inconsistency between the user-space state at uaddr and the kernel stateβthat is, it detected a waiter which waits in FUTEX_LOCK_PI or FUTEX_LOCK_PI2 on uaddr.
EINVAL
(FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_UNLOCK_PI) The kernel detected an inconsistency between the user-space state at uaddr and the kernel state. This indicates either state corruption or that the kernel found a waiter on uaddr which is waiting via FUTEX_WAIT or FUTEX_WAIT_BITSET.
EINVAL
(FUTEX_CMP_REQUEUE_PI) The kernel detected an inconsistency between the user-space state at uaddr2 and the kernel state; that is, the kernel detected a waiter which waits via FUTEX_WAIT or FUTEX_WAIT_BITSET on uaddr2.
EINVAL
(FUTEX_CMP_REQUEUE_PI) The kernel detected an inconsistency between the user-space state at uaddr and the kernel state; that is, the kernel detected a waiter which waits via FUTEX_WAIT or FUTEX_WAIT_BITSET on uaddr.
EINVAL
(FUTEX_CMP_REQUEUE_PI) The kernel detected an inconsistency between the user-space state at uaddr and the kernel state; that is, the kernel detected a waiter which waits on uaddr via FUTEX_LOCK_PI or FUTEX_LOCK_PI2 (instead of FUTEX_WAIT_REQUEUE_PI).
EINVAL
(FUTEX_CMP_REQUEUE_PI) An attempt was made to requeue a waiter to a futex other than that specified by the matching FUTEX_WAIT_REQUEUE_PI call for that waiter.
EINVAL
(FUTEX_CMP_REQUEUE_PI) The val argument is not 1.
EINVAL
Invalid argument.
ENFILE
(FUTEX_FD) The system-wide limit on the total number of open files has been reached.
ENOMEM
(FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_CMP_REQUEUE_PI) The kernel could not allocate memory to hold state information.
ENOSYS
Invalid operation specified in futex_op.
ENOSYS
The FUTEX_CLOCK_REALTIME option was specified in futex_op, but the accompanying operation was neither FUTEX_WAIT, FUTEX_WAIT_BITSET, FUTEX_WAIT_REQUEUE_PI, nor FUTEX_LOCK_PI2.
ENOSYS
(FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_UNLOCK_PI, FUTEX_CMP_REQUEUE_PI, FUTEX_WAIT_REQUEUE_PI) A run-time check determined that the operation is not available. The PI-futex operations are not implemented on all architectures and are not supported on some CPU variants.
EPERM
(FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_CMP_REQUEUE_PI) The caller is not allowed to attach itself to the futex at uaddr (for FUTEX_CMP_REQUEUE_PI: the futex at uaddr2). (This may be caused by a state corruption in user space.)
EPERM
(FUTEX_UNLOCK_PI) The caller does not own the lock represented by the futex word.
ESRCH
(FUTEX_LOCK_PI, FUTEX_LOCK_PI2, FUTEX_TRYLOCK_PI, FUTEX_CMP_REQUEUE_PI) The thread ID in the futex word at uaddr does not exist.
ESRCH
(FUTEX_CMP_REQUEUE_PI) The thread ID in the futex word at uaddr2 does not exist.
ETIMEDOUT
The operation in futex_op employed the timeout specified in timeout, and the timeout expired before the operation completed.
STANDARDS
Linux.
HISTORY
Linux 2.6.0.
Initial futex support was merged in Linux 2.5.7 but with different semantics from what was described above. A four-argument system call with the semantics described in this page was introduced in Linux 2.5.40. A fifth argument was added in Linux 2.5.70, and a sixth argument was added in Linux 2.6.7.
EXAMPLES
The program below demonstrates use of futexes in a program where a parent process and a child process use a pair of futexes located inside a shared anonymous mapping to synchronize access to a shared resource: the terminal. The two processes each write nloops (a command-line argument that defaults to 5 if omitted) messages to the terminal and employ a synchronization protocol that ensures that they alternate in writing messages. Upon running this program we see output such as the following:
$ ./futex_demo
Parent (18534) 0
Child (18535) 0
Parent (18534) 1
Child (18535) 1
Parent (18534) 2
Child (18535) 2
Parent (18534) 3
Child (18535) 3
Parent (18534) 4
Child (18535) 4
Program source
/* futex_demo.c
Usage: futex_demo [nloops]
(Default: 5)
Demonstrate the use of futexes in a program where parent and child
use a pair of futexes located inside a shared anonymous mapping to
synchronize access to a shared resource: the terminal. The two
processes each write 'num-loops' messages to the terminal and employ
a synchronization protocol that ensures that they alternate in
writing messages.
*/
#define _GNU_SOURCE
#include <err.h>
#include <errno.h>
#include <linux/futex.h>
#include <stdatomic.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/time.h>
#include <sys/wait.h>
#include <unistd.h>
static uint32_t *futex1, *futex2, *iaddr;
static int
futex(uint32_t *uaddr, int futex_op, uint32_t val,
const struct timespec *timeout, uint32_t *uaddr2, uint32_t val3)
{
return syscall(SYS_futex, uaddr, futex_op, val,
timeout, uaddr2, val3);
}
/* Acquire the futex pointed to by 'futexp': wait for its value to
become 1, and then set the value to 0. */
static void
fwait(uint32_t *futexp)
{
long s;
const uint32_t one = 1;
/* atomic_compare_exchange_strong(ptr, oldval, newval)
atomically performs the equivalent of:
if (*ptr == *oldval)
*ptr = newval;
It returns true if the test yielded true and *ptr was updated. */
while (1) {
/* Is the futex available? */
if (atomic_compare_exchange_strong(futexp, &one, 0))
break; /* Yes */
/* Futex is not available; wait. */
s = futex(futexp, FUTEX_WAIT, 0, NULL, NULL, 0);
if (s == -1 && errno != EAGAIN)
err(EXIT_FAILURE, "futex-FUTEX_WAIT");
}
}
/* Release the futex pointed to by 'futexp': if the futex currently
has the value 0, set its value to 1 and then wake any futex waiters,
so that if the peer is blocked in fwait(), it can proceed. */
static void
fpost(uint32_t *futexp)
{
long s;
const uint32_t zero = 0;
/* atomic_compare_exchange_strong() was described
in comments above. */
if (atomic_compare_exchange_strong(futexp, &zero, 1)) {
s = futex(futexp, FUTEX_WAKE, 1, NULL, NULL, 0);
if (s == -1)
err(EXIT_FAILURE, "futex-FUTEX_WAKE");
}
}
int
main(int argc, char *argv[])
{
pid_t childPid;
unsigned int nloops;
setbuf(stdout, NULL);
nloops = (argc > 1) ? atoi(argv[1]) : 5;
/* Create a shared anonymous mapping that will hold the futexes.
Since the futexes are being shared between processes, we
subsequently use the "shared" futex operations (i.e., not the
ones suffixed "_PRIVATE"). */
iaddr = mmap(NULL, sizeof(*iaddr) * 2, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
if (iaddr == MAP_FAILED)
err(EXIT_FAILURE, "mmap");
futex1 = &iaddr[0];
futex2 = &iaddr[1];
*futex1 = 0; /* State: unavailable */
*futex2 = 1; /* State: available */
/* Create a child process that inherits the shared anonymous
mapping. */
childPid = fork();
if (childPid == -1)
err(EXIT_FAILURE, "fork");
if (childPid == 0) { /* Child */
for (unsigned int j = 0; j < nloops; j++) {
fwait(futex1);
printf("Child (%jd) %u
“, (intmax_t) getpid(), j); fpost(futex2); } exit(EXIT_SUCCESS); } /* Parent falls through to here. */ for (unsigned int j = 0; j < nloops; j++) { fwait(futex2); printf(“Parent (%jd) %u “, (intmax_t) getpid(), j); fpost(futex1); } wait(NULL); exit(EXIT_SUCCESS); }
SEE ALSO
get_robust_list(2), restart_syscall(2), pthread_mutexattr_getprotocol(3), futex(7), sched(7)
The following kernel source files:
Documentation/pi-futex.txt
Documentation/futex-requeue-pi.txt
Documentation/locking/rt-mutex.txt
Documentation/locking/rt-mutex-design.txt
Documentation/robust-futex-ABI.txt
Franke, H., Russell, R., and Kirwood, M., 2002. Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux (from proceedings of the Ottawa Linux Symposium 2002),
Hart, D., 2009. A futex overview and update**,**
Hart, D. and Guniguntala, D., 2009. Requeue-PI: Making glibc Condvars PI-Aware (from proceedings of the 2009 Real-Time Linux Workshop),
Drepper, U., 2011. Futexes Are Tricky**,**
Futex example library, futex-*.tar.bz2 at
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
117 - Linux cli command semtimedop
NAME π₯οΈ semtimedop π₯οΈ
System V semaphore operations
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/sem.h>
int semop(int semid, struct sembuf *sops, size_t nsops);
int semtimedop(int semid, struct sembuf *sops, size_t nsops,
const struct timespec *_Nullable timeout);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
semtimedop():
_GNU_SOURCE
DESCRIPTION
Each semaphore in a System V semaphore set has the following associated values:
unsigned short semval; /* semaphore value */
unsigned short semzcnt; /* # waiting for zero */
unsigned short semncnt; /* # waiting for increase */
pid_t sempid; /* PID of process that last
modified the semaphore value */
semop() performs operations on selected semaphores in the set indicated by semid. Each of the nsops elements in the array pointed to by sops is a structure that specifies an operation to be performed on a single semaphore. The elements of this structure are of type struct sembuf, containing the following members:
unsigned short sem_num; /* semaphore number */
short sem_op; /* semaphore operation */
short sem_flg; /* operation flags */
Flags recognized in sem_flg are IPC_NOWAIT and SEM_UNDO. If an operation specifies SEM_UNDO, it will be automatically undone when the process terminates.
The set of operations contained in sops is performed in array order, and atomically, that is, the operations are performed either as a complete unit, or not at all. The behavior of the system call if not all operations can be performed immediately depends on the presence of the IPC_NOWAIT flag in the individual sem_flg fields, as noted below.
Each operation is performed on the sem_num-th semaphore of the semaphore set, where the first semaphore of the set is numbered 0. There are three types of operation, distinguished by the value of sem_op.
If sem_op is a positive integer, the operation adds this value to the semaphore value (semval). Furthermore, if SEM_UNDO is specified for this operation, the system subtracts the value sem_op from the semaphore adjustment (semadj) value for this semaphore. This operation can always proceedβit never forces a thread to wait. The calling process must have alter permission on the semaphore set.
If sem_op is zero, the process must have read permission on the semaphore set. This is a “wait-for-zero” operation: if semval is zero, the operation can immediately proceed. Otherwise, if IPC_NOWAIT is specified in sem_flg, semop() fails with errno set to EAGAIN (and none of the operations in sops is performed). Otherwise, semzcnt (the count of threads waiting until this semaphore’s value becomes zero) is incremented by one and the thread sleeps until one of the following occurs:
semval becomes 0, at which time the value of semzcnt is decremented.
The semaphore set is removed: semop() fails, with errno set to EIDRM.
The calling thread catches a signal: the value of semzcnt is decremented and semop() fails, with errno set to EINTR.
If sem_op is less than zero, the process must have alter permission on the semaphore set. If semval is greater than or equal to the absolute value of sem_op, the operation can proceed immediately: the absolute value of sem_op is subtracted from semval, and, if SEM_UNDO is specified for this operation, the system adds the absolute value of sem_op to the semaphore adjustment (semadj) value for this semaphore. If the absolute value of sem_op is greater than semval, and IPC_NOWAIT is specified in sem_flg, semop() fails, with errno set to EAGAIN (and none of the operations in sops is performed). Otherwise, semncnt (the counter of threads waiting for this semaphore’s value to increase) is incremented by one and the thread sleeps until one of the following occurs:
semval becomes greater than or equal to the absolute value of sem_op: the operation now proceeds, as described above.
The semaphore set is removed from the system: semop() fails, with errno set to EIDRM.
The calling thread catches a signal: the value of semncnt is decremented and semop() fails, with errno set to EINTR.
On successful completion, the sempid value for each semaphore specified in the array pointed to by sops is set to the caller’s process ID. In addition, the sem_otime is set to the current time.
semtimedop()
semtimedop() behaves identically to semop() except that in those cases where the calling thread would sleep, the duration of that sleep is limited by the amount of elapsed time specified by the timespec structure whose address is passed in the timeout argument. (This sleep interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the interval may overrun by a small amount.) If the specified time limit has been reached, semtimedop() fails with errno set to EAGAIN (and none of the operations in sops is performed). If the timeout argument is NULL, then semtimedop() behaves exactly like semop().
Note that if semtimedop() is interrupted by a signal, causing the call to fail with the error EINTR, the contents of timeout are left unchanged.
RETURN VALUE
On success, semop() and semtimedop() return 0. On failure, they return -1, and set errno to indicate the error.
ERRORS
E2BIG
The argument nsops is greater than SEMOPM, the maximum number of operations allowed per system call.
EACCES
The calling process does not have the permissions required to perform the specified semaphore operations, and does not have the CAP_IPC_OWNER capability in the user namespace that governs its IPC namespace.
EAGAIN
An operation could not proceed immediately and either IPC_NOWAIT was specified in sem_flg or the time limit specified in timeout expired.
EFAULT
An address specified in either the sops or the timeout argument isn’t accessible.
EFBIG
For some operation the value of sem_num is less than 0 or greater than or equal to the number of semaphores in the set.
EIDRM
The semaphore set was removed.
EINTR
While blocked in this system call, the thread caught a signal; see signal(7).
EINVAL
The semaphore set doesn’t exist, or semid is less than zero, or nsops has a nonpositive value.
ENOMEM
The sem_flg of some operation specified SEM_UNDO and the system does not have enough memory to allocate the undo structure.
ERANGE
For some operation sem_op+semval is greater than SEMVMX, the implementation dependent maximum value for semval.
STANDARDS
POSIX.1-2008.
VERSIONS
Linux 2.5.52 (backported into Linux 2.4.22), glibc 2.3.3. POSIX.1-2001, SVr4.
NOTES
The sem_undo structures of a process aren’t inherited by the child produced by fork(2), but they are inherited across an execve(2) system call.
semop() is never automatically restarted after being interrupted by a signal handler, regardless of the setting of the SA_RESTART flag when establishing a signal handler.
A semaphore adjustment (semadj) value is a per-process, per-semaphore integer that is the negated sum of all operations performed on a semaphore specifying the SEM_UNDO flag. Each process has a list of semadj valuesβone value for each semaphore on which it has operated using SEM_UNDO. When a process terminates, each of its per-semaphore semadj values is added to the corresponding semaphore, thus undoing the effect of that process’s operations on the semaphore (but see BUGS below). When a semaphore’s value is directly set using the SETVAL or SETALL request to semctl(2), the corresponding semadj values in all processes are cleared. The clone(2) CLONE_SYSVSEM flag allows more than one process to share a semadj list; see clone(2) for details.
The semval, sempid, semzcnt, and semnct values for a semaphore can all be retrieved using appropriate semctl(2) calls.
Semaphore limits
The following limits on semaphore set resources affect the semop() call:
SEMOPM
Maximum number of operations allowed for one semop() call. Before Linux 3.19, the default value for this limit was 32. Since Linux 3.19, the default value is 500. On Linux, this limit can be read and modified via the third field of /proc/sys/kernel/sem. Note: this limit should not be raised above 1000, because of the risk of that semop() fails due to kernel memory fragmentation when allocating memory to copy the sops array.
SEMVMX
Maximum allowable value for semval: implementation dependent (32767).
The implementation has no intrinsic limits for the adjust on exit maximum value (SEMAEM), the system wide maximum number of undo structures (SEMMNU) and the per-process maximum number of undo entries system parameters.
BUGS
When a process terminates, its set of associated semadj structures is used to undo the effect of all of the semaphore operations it performed with the SEM_UNDO flag. This raises a difficulty: if one (or more) of these semaphore adjustments would result in an attempt to decrease a semaphore’s value below zero, what should an implementation do? One possible approach would be to block until all the semaphore adjustments could be performed. This is however undesirable since it could force process termination to block for arbitrarily long periods. Another possibility is that such semaphore adjustments could be ignored altogether (somewhat analogously to failing when IPC_NOWAIT is specified for a semaphore operation). Linux adopts a third approach: decreasing the semaphore value as far as possible (i.e., to zero) and allowing process termination to proceed immediately.
In Linux 2.6.x, x <= 10, there is a bug that in some circumstances prevents a thread that is waiting for a semaphore value to become zero from being woken up when the value does actually become zero. This bug is fixed in Linux 2.6.11.
EXAMPLES
The following code segment uses semop() to atomically wait for the value of semaphore 0 to become zero, and then increment the semaphore value by one.
struct sembuf sops[2];
int semid;
/* Code to set semid omitted */
sops[0].sem_num = 0; /* Operate on semaphore 0 */
sops[0].sem_op = 0; /* Wait for value to equal 0 */
sops[0].sem_flg = 0;
sops[1].sem_num = 0; /* Operate on semaphore 0 */
sops[1].sem_op = 1; /* Increment value by one */
sops[1].sem_flg = 0;
if (semop(semid, sops, 2) == -1) {
perror("semop");
exit(EXIT_FAILURE);
}
A further example of the use of semop() can be found in shmop(2).
SEE ALSO
clone(2), semctl(2), semget(2), sigaction(2), capabilities(7), sem_overview(7), sysvipc(7), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
118 - Linux cli command ssetmask
NAME π₯οΈ ssetmask π₯οΈ
manipulation of signal mask (obsolete)
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
[[deprecated]] long syscall(SYS_sgetmask, void);
[[deprecated]] long syscall(SYS_ssetmask, long newmask);
DESCRIPTION
These system calls are obsolete. Do not use them; use sigprocmask(2) instead.
sgetmask() returns the signal mask of the calling process.
ssetmask() sets the signal mask of the calling process to the value given in newmask. The previous signal mask is returned.
The signal masks dealt with by these two system calls are plain bit masks (unlike the sigset_t used by sigprocmask(2)); use sigmask(3) to create and inspect these masks.
RETURN VALUE
sgetmask() always successfully returns the signal mask. ssetmask() always succeeds, and returns the previous signal mask.
ERRORS
These system calls always succeed.
STANDARDS
Linux.
HISTORY
Since Linux 3.16, support for these system calls is optional, depending on whether the kernel was built with the CONFIG_SGETMASK_SYSCALL option.
NOTES
These system calls are unaware of signal numbers greater than 31 (i.e., real-time signals).
These system calls do not exist on x86-64.
It is not possible to block SIGSTOP or SIGKILL.
SEE ALSO
sigprocmask(2), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
119 - Linux cli command truncate
NAME π₯οΈ truncate π₯οΈ
truncate a file to a specified length
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int truncate(const char *path, off_t length);
int ftruncate(int fd, off_t length);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
truncate():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
|| /* glibc <= 2.19: */ _BSD_SOURCE
ftruncate():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.3.5: */ _POSIX_C_SOURCE >= 200112L
|| /* glibc <= 2.19: */ _BSD_SOURCE
DESCRIPTION
The truncate() and ftruncate() functions cause the regular file named by path or referenced by fd to be truncated to a size of precisely length bytes.
If the file previously was larger than this size, the extra data is lost. If the file previously was shorter, it is extended, and the extended part reads as null bytes (‘οΏ½’).
The file offset is not changed.
If the size changed, then the st_ctime and st_mtime fields (respectively, time of last status change and time of last modification; see inode(7)) for the file are updated, and the set-user-ID and set-group-ID mode bits may be cleared.
With ftruncate(), the file must be open for writing; with truncate(), the file must be writable.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
For truncate():
EACCES
Search permission is denied for a component of the path prefix, or the named file is not writable by the user. (See also path_resolution(7).)
EFAULT
The argument path points outside the process’s allocated address space.
EFBIG
The argument length is larger than the maximum file size. (XSI)
EINTR
While blocked waiting to complete, the call was interrupted by a signal handler; see fcntl(2) and signal(7).
EINVAL
The argument length is negative or larger than the maximum file size.
EIO
An I/O error occurred updating the inode.
EISDIR
The named file is a directory.
ELOOP
Too many symbolic links were encountered in translating the pathname.
ENAMETOOLONG
A component of a pathname exceeded 255 characters, or an entire pathname exceeded 1023 characters.
ENOENT
The named file does not exist.
ENOTDIR
A component of the path prefix is not a directory.
EPERM
The underlying filesystem does not support extending a file beyond its current size.
EPERM
The operation was prevented by a file seal; see fcntl(2).
EROFS
The named file resides on a read-only filesystem.
ETXTBSY
The file is an executable file that is being executed.
For ftruncate() the same errors apply, but instead of things that can be wrong with path, we now have things that can be wrong with the file descriptor, fd:
EBADF
fd is not a valid file descriptor.
EBADF or EINVAL
fd is not open for writing.
EINVAL
fd does not reference a regular file or a POSIX shared memory object.
EINVAL or EBADF
The file descriptor fd is not open for writing. POSIX permits, and portable applications should handle, either error for this case. (Linux produces EINVAL.)
VERSIONS
The details in DESCRIPTION are for XSI-compliant systems. For non-XSI-compliant systems, the POSIX standard allows two behaviors for ftruncate() when length exceeds the file length (note that truncate() is not specified at all in such an environment): either returning an error, or extending the file. Like most UNIX implementations, Linux follows the XSI requirement when dealing with native filesystems. However, some nonnative filesystems do not permit truncate() and ftruncate() to be used to extend a file beyond its current length: a notable example on Linux is VFAT.
On some 32-bit architectures, the calling signature for these system calls differ, for the reasons described in syscall(2).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.4BSD, SVr4 (first appeared in 4.2BSD).
The original Linux truncate() and ftruncate() system calls were not designed to handle large file offsets. Consequently, Linux 2.4 added truncate64() and ftruncate64() system calls that handle large files. However, these details can be ignored by applications using glibc, whose wrapper functions transparently employ the more recent system calls where they are available.
NOTES
ftruncate() can also be used to set the size of a POSIX shared memory object; see shm_open(3).
BUGS
A header file bug in glibc 2.12 meant that the minimum value of _POSIX_C_SOURCE required to expose the declaration of ftruncate() was 200809L instead of 200112L. This has been fixed in later glibc versions.
SEE ALSO
truncate(1), open(2), stat(2), path_resolution(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
120 - Linux cli command mbind
NAME π₯οΈ mbind π₯οΈ
set memory policy for a memory range
LIBRARY
NUMA (Non-Uniform Memory Access) policy library (libnuma, -lnuma)
SYNOPSIS
#include <numaif.h>
long mbind(void addr[.len], unsigned long len, int mode,
const unsigned long nodemask[(.maxnode + ULONG_WIDTH - 1)
/ ULONG_WIDTH],
unsigned long maxnode, unsigned int flags);
DESCRIPTION
mbind() sets the NUMA memory policy, which consists of a policy mode and zero or more nodes, for the memory range starting with addr and continuing for len bytes. The memory policy defines from which node memory is allocated.
If the memory range specified by the addr and len arguments includes an “anonymous” region of memoryβthat is a region of memory created using the mmap(2) system call with the MAP_ANONYMOUSβor a memory-mapped file, mapped using the mmap(2) system call with the MAP_PRIVATE flag, pages will be allocated only according to the specified policy when the application writes (stores) to the page. For anonymous regions, an initial read access will use a shared page in the kernel containing all zeros. For a file mapped with MAP_PRIVATE, an initial read access will allocate pages according to the memory policy of the thread that causes the page to be allocated. This may not be the thread that called mbind().
The specified policy will be ignored for any MAP_SHARED mappings in the specified memory range. Rather the pages will be allocated according to the memory policy of the thread that caused the page to be allocated. Again, this may not be the thread that called mbind().
If the specified memory range includes a shared memory region created using the shmget(2) system call and attached using the shmat(2) system call, pages allocated for the anonymous or shared memory region will be allocated according to the policy specified, regardless of which process attached to the shared memory segment causes the allocation. If, however, the shared memory region was created with the SHM_HUGETLB flag, the huge pages will be allocated according to the policy specified only if the page allocation is caused by the process that calls mbind() for that region.
By default, mbind() has an effect only for new allocations; if the pages inside the range have been already touched before setting the policy, then the policy has no effect. This default behavior may be overridden by the MPOL_MF_MOVE and MPOL_MF_MOVE_ALL flags described below.
The mode argument must specify one of MPOL_DEFAULT, MPOL_BIND, MPOL_INTERLEAVE, MPOL_WEIGHTED_INTERLEAVE, MPOL_PREFERRED, or MPOL_LOCAL (which are described in detail below). All policy modes except MPOL_DEFAULT require the caller to specify the node or nodes to which the mode applies, via the nodemask argument.
The mode argument may also include an optional mode flag. The supported mode flags are:
MPOL_F_NUMA_BALANCING (since Linux 5.15)
When mode is MPOL_BIND, enable the kernel NUMA balancing for the task if it is supported by the kernel. If the flag isn’t supported by the kernel, or is used with mode other than MPOL_BIND, -1 is returned and errno is set to EINVAL.
MPOL_F_STATIC_NODES (since Linux-2.6.26)
A nonempty nodemask specifies physical node IDs. Linux does not remap the nodemask when the thread moves to a different cpuset context, nor when the set of nodes allowed by the thread’s current cpuset context changes.
MPOL_F_RELATIVE_NODES (since Linux-2.6.26)
A nonempty nodemask specifies node IDs that are relative to the set of node IDs allowed by the thread’s current cpuset.
nodemask points to a bit mask of nodes containing up to maxnode bits. The bit mask size is rounded to the next multiple of sizeof(unsigned long), but the kernel will use bits only up to maxnode. A NULL value of nodemask or a maxnode value of zero specifies the empty set of nodes. If the value of maxnode is zero, the nodemask argument is ignored. Where a nodemask is required, it must contain at least one node that is on-line, allowed by the thread’s current cpuset context (unless the MPOL_F_STATIC_NODES mode flag is specified), and contains memory.
The mode argument must include one of the following values:
MPOL_DEFAULT
This mode requests that any nondefault policy be removed, restoring default behavior. When applied to a range of memory via mbind(), this means to use the thread memory policy, which may have been set with set_mempolicy(2). If the mode of the thread memory policy is also MPOL_DEFAULT, the system-wide default policy will be used. The system-wide default policy allocates pages on the node of the CPU that triggers the allocation. For MPOL_DEFAULT, the nodemask and maxnode arguments must be specify the empty set of nodes.
MPOL_BIND
This mode specifies a strict policy that restricts memory allocation to the nodes specified in nodemask. If nodemask specifies more than one node, page allocations will come from the node with sufficient free memory that is closest to the node where the allocation takes place. Pages will not be allocated from any node not specified in the IR nodemask . (Before Linux 2.6.26, page allocations came from the node with the lowest numeric node ID first, until that node contained no free memory. Allocations then came from the node with the next highest node ID specified in nodemask and so forth, until none of the specified nodes contained free memory.)
MPOL_INTERLEAVE
This mode specifies that page allocations be interleaved across the set of nodes specified in nodemask. This optimizes for bandwidth instead of latency by spreading out pages and memory accesses to those pages across multiple nodes. To be effective the memory area should be fairly large, at least 1 MB or bigger with a fairly uniform access pattern. Accesses to a single page of the area will still be limited to the memory bandwidth of a single node.
MPOL_WEIGHTED_INTERLEAVE (since Linux 6.9)
This mode interleaves page allocations across the nodes specified in nodemask according to the weights in /sys/kernel/mm/mempolicy/weighted_interleave. For example, if bits 0, 2, and 5 are set in nodemask, and the contents of /sys/kernel/mm/mempolicy/weighted_interleave/node0, /sys/.β.β./node2, and /sys/.β.β./node5 are 4, 7, and 9, respectively, then pages in this region will be allocated on nodes 0, 2, and 5 in a 4:7:9 ratio.
MPOL_PREFERRED
This mode sets the preferred node for allocation. The kernel will try to allocate pages from this node first and fall back to other nodes if the preferred nodes is low on free memory. If nodemask specifies more than one node ID, the first node in the mask will be selected as the preferred node. If the nodemask and maxnode arguments specify the empty set, then the memory is allocated on the node of the CPU that triggered the allocation.
MPOL_LOCAL (since Linux 3.8)
This mode specifies “local allocation”; the memory is allocated on the node of the CPU that triggered the allocation (the “local node”). The nodemask and maxnode arguments must specify the empty set. If the “local node” is low on free memory, the kernel will try to allocate memory from other nodes. The kernel will allocate memory from the “local node” whenever memory for this node is available. If the “local node” is not allowed by the thread’s current cpuset context, the kernel will try to allocate memory from other nodes. The kernel will allocate memory from the “local node” whenever it becomes allowed by the thread’s current cpuset context. By contrast, MPOL_DEFAULT reverts to the memory policy of the thread (which may be set via set_mempolicy(2)); that policy may be something other than “local allocation”.
If MPOL_MF_STRICT is passed in flags and mode is not MPOL_DEFAULT, then the call fails with the error EIO if the existing pages in the memory range don’t follow the policy.
If MPOL_MF_MOVE is specified in flags, then the kernel will attempt to move all the existing pages in the memory range so that they follow the policy. Pages that are shared with other processes will not be moved. If MPOL_MF_STRICT is also specified, then the call fails with the error EIO if some pages could not be moved. If the MPOL_INTERLEAVE policy was specified, pages already residing on the specified nodes will not be moved such that they are interleaved.
If MPOL_MF_MOVE_ALL is passed in flags, then the kernel will attempt to move all existing pages in the memory range regardless of whether other processes use the pages. The calling thread must be privileged (CAP_SYS_NICE) to use this flag. If MPOL_MF_STRICT is also specified, then the call fails with the error EIO if some pages could not be moved. If the MPOL_INTERLEAVE policy was specified, pages already residing on the specified nodes will not be moved such that they are interleaved.
RETURN VALUE
On success, mbind() returns 0; on error, -1 is returned and errno is set to indicate the error.
ERRORS
EFAULT
Part or all of the memory range specified by nodemask and maxnode points outside your accessible address space. Or, there was an unmapped hole in the specified memory range specified by addr and len.
EINVAL
An invalid value was specified for flags or mode; or addr + len was less than addr; or addr is not a multiple of the system page size. Or, mode is MPOL_DEFAULT and nodemask specified a nonempty set; or mode is MPOL_BIND or MPOL_INTERLEAVE and nodemask is empty. Or, maxnode exceeds a kernel-imposed limit. Or, nodemask specifies one or more node IDs that are greater than the maximum supported node ID. Or, none of the node IDs specified by nodemask are on-line and allowed by the thread’s current cpuset context, or none of the specified nodes contain memory. Or, the mode argument specified both MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES.
EIO
MPOL_MF_STRICT was specified and an existing page was already on a node that does not follow the policy; or MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified and the kernel was unable to move all existing pages in the range.
ENOMEM
Insufficient kernel memory was available.
EPERM
The flags argument included the MPOL_MF_MOVE_ALL flag and the caller does not have the CAP_SYS_NICE privilege.
STANDARDS
Linux.
HISTORY
Linux 2.6.7.
Support for huge page policy was added with Linux 2.6.16. For interleave policy to be effective on huge page mappings the policied memory needs to be tens of megabytes or larger.
Before Linux 5.7. MPOL_MF_STRICT was ignored on huge page mappings.
MPOL_MF_MOVE and MPOL_MF_MOVE_ALL are available only on Linux 2.6.16 and later.
NOTES
For information on library support, see numa(7).
NUMA policy is not supported on a memory-mapped file range that was mapped with the MAP_SHARED flag.
The MPOL_DEFAULT mode can have different effects for mbind() and set_mempolicy(2). When MPOL_DEFAULT is specified for set_mempolicy(2), the thread’s memory policy reverts to the system default policy or local allocation. When MPOL_DEFAULT is specified for a range of memory using mbind(), any pages subsequently allocated for that range will use the thread’s memory policy, as set by set_mempolicy(2). This effectively removes the explicit policy from the specified range, “falling back” to a possibly nondefault policy. To select explicit “local allocation” for a memory range, specify a mode of MPOL_LOCAL or MPOL_PREFERRED with an empty set of nodes. This method will work for set_mempolicy(2), as well.
SEE ALSO
get_mempolicy(2), getcpu(2), mmap(2), set_mempolicy(2), shmat(2), shmget(2), numa(3), cpuset(7), numa(7), numactl(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
121 - Linux cli command setreuid32
NAME π₯οΈ setreuid32 π₯οΈ
set real and/or effective user or group ID
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int setreuid(uid_t ruid, uid_t euid);
int setregid(gid_t rgid, gid_t egid);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
setreuid(), setregid():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.19: */ _DEFAULT_SOURCE
|| /* glibc <= 2.19: */ _BSD_SOURCE
DESCRIPTION
setreuid() sets real and effective user IDs of the calling process.
Supplying a value of -1 for either the real or effective user ID forces the system to leave that ID unchanged.
Unprivileged processes may only set the effective user ID to the real user ID, the effective user ID, or the saved set-user-ID.
Unprivileged users may only set the real user ID to the real user ID or the effective user ID.
If the real user ID is set (i.e., ruid is not -1) or the effective user ID is set to a value not equal to the previous real user ID, the saved set-user-ID will be set to the new effective user ID.
Completely analogously, setregid() sets real and effective group ID’s of the calling process, and all of the above holds with “group” instead of “user”.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
Note: there are cases where setreuid() can fail even when the caller is UID 0; it is a grave security error to omit checking for a failure return from setreuid().
ERRORS
EAGAIN
The call would change the caller’s real UID (i.e., ruid does not match the caller’s real UID), but there was a temporary failure allocating the necessary kernel data structures.
EAGAIN
ruid does not match the caller’s real UID and this call would bring the number of processes belonging to the real user ID ruid over the caller’s RLIMIT_NPROC resource limit. Since Linux 3.1, this error case no longer occurs (but robust applications should check for this error); see the description of EAGAIN in execve(2).
EINVAL
One or more of the target user or group IDs is not valid in this user namespace.
EPERM
The calling process is not privileged (on Linux, does not have the necessary capability in its user namespace: CAP_SETUID in the case of setreuid(), or CAP_SETGID in the case of setregid()) and a change other than (i) swapping the effective user (group) ID with the real user (group) ID, or (ii) setting one to the value of the other or (iii) setting the effective user (group) ID to the value of the saved set-user-ID (saved set-group-ID) was specified.
VERSIONS
POSIX.1 does not specify all of the UID changes that Linux permits for an unprivileged process. For setreuid(), the effective user ID can be made the same as the real user ID or the saved set-user-ID, and it is unspecified whether unprivileged processes may set the real user ID to the real user ID, the effective user ID, or the saved set-user-ID. For setregid(), the real group ID can be changed to the value of the saved set-group-ID, and the effective group ID can be changed to the value of the real group ID or the saved set-group-ID. The precise details of what ID changes are permitted vary across implementations.
POSIX.1 makes no specification about the effect of these calls on the saved set-user-ID and saved set-group-ID.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.3BSD (first appeared in 4.2BSD).
Setting the effective user (group) ID to the saved set-user-ID (saved set-group-ID) is possible since Linux 1.1.37 (1.1.38).
The original Linux setreuid() and setregid() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added setreuid32() and setregid32(), supporting 32-bit IDs. The glibc setreuid() and setregid() wrapper functions transparently deal with the variations across kernel versions.
C library/kernel differences
At the kernel level, user IDs and group IDs are a per-thread attribute. However, POSIX requires that all threads in a process share the same credentials. The NPTL threading implementation handles the POSIX requirements by providing wrapper functions for the various system calls that change process UIDs and GIDs. These wrapper functions (including those for setreuid() and setregid()) employ a signal-based technique to ensure that when one thread changes credentials, all of the other threads in the process also change their credentials. For details, see nptl(7).
SEE ALSO
getgid(2), getuid(2), seteuid(2), setgid(2), setresuid(2), setuid(2), capabilities(7), credentials(7), user_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
122 - Linux cli command semctl
NAME π₯οΈ semctl π₯οΈ
System V semaphore control operations
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/sem.h>
int semctl(int semid, int semnum, int op, ...);
DESCRIPTION
semctl() performs the control operation specified by op on the System V semaphore set identified by semid, or on the semnum-th semaphore of that set. (The semaphores in a set are numbered starting at 0.)
This function has three or four arguments, depending on op. When there are four, the fourth has the type union semun. The calling program must define this union as follows:
union semun {
int val; /* Value for SETVAL */
struct semid_ds *buf; /* Buffer for IPC_STAT, IPC_SET */
unsigned short *array; /* Array for GETALL, SETALL */
struct seminfo *__buf; /* Buffer for IPC_INFO
(Linux-specific) */
};
The semid_ds data structure is defined in <sys/sem.h> as follows:
struct semid_ds {
struct ipc_perm sem_perm; /* Ownership and permissions */
time_t sem_otime; /* Last semop time */
time_t sem_ctime; /* Creation time/time of last
modification via semctl() */
unsigned long sem_nsems; /* No. of semaphores in set */
};
The fields of the semid_ds structure are as follows:
sem_perm
This is an ipc_perm structure (see below) that specifies the access permissions on the semaphore set.
sem_otime
Time of last semop(2) system call.
sem_ctime
Time of creation of semaphore set or time of last semctl() IPCSET, SETVAL, or SETALL operation.
sem_nsems
Number of semaphores in the set. Each semaphore of the set is referenced by a nonnegative integer ranging from 0 to sem_nsems-1.
The ipc_perm structure is defined as follows (the highlighted fields are settable using IPC_SET):
struct ipc_perm {
key_t __key; /* Key supplied to semget(2) */
uid_t uid; /* Effective UID of owner */
gid_t gid; /* Effective GID of owner */
uid_t cuid; /* Effective UID of creator */
gid_t cgid; /* Effective GID of creator */
unsigned short mode; /* Permissions */
unsigned short __seq; /* Sequence number */
};
The least significant 9 bits of the mode field of the ipc_perm structure define the access permissions for the shared memory segment. The permission bits are as follows:
0400 | Read by user |
0200 | Write by user |
0040 | Read by group |
0020 | Write by group |
0004 | Read by others |
0002 | Write by others |
In effect, “write” means “alter” for a semaphore set. Bits 0100, 0010, and 0001 (the execute bits) are unused by the system.
Valid values for op are:
IPC_STAT
Copy information from the kernel data structure associated with semid into the semid_ds structure pointed to by arg.buf. The argument semnum is ignored. The calling process must have read permission on the semaphore set.
IPC_SET
Write the values of some members of the semid_ds structure pointed to by arg.buf to the kernel data structure associated with this semaphore set, updating also its sem_ctime member.
The following members of the structure are updated: sem_perm.uid, sem_perm.gid, and (the least significant 9 bits of) sem_perm.mode.
The effective UID of the calling process must match the owner (sem_perm.uid) or creator (sem_perm.cuid) of the semaphore set, or the caller must be privileged. The argument semnum is ignored.
IPC_RMID
Immediately remove the semaphore set, awakening all processes blocked in semop(2) calls on the set (with an error return and errno set to EIDRM). The effective user ID of the calling process must match the creator or owner of the semaphore set, or the caller must be privileged. The argument semnum is ignored.
IPC_INFO (Linux-specific)
Return information about system-wide semaphore limits and parameters in the structure pointed to by arg.__buf. This structure is of type seminfo, defined in <sys/sem.h> if the _GNU_SOURCE feature test macro is defined:
struct seminfo {
int semmap; /* Number of entries in semaphore
map; unused within kernel */
int semmni; /* Maximum number of semaphore sets */
int semmns; /* Maximum number of semaphores in all
semaphore sets */
int semmnu; /* System-wide maximum number of undo
structures; unused within kernel */
int semmsl; /* Maximum number of semaphores in a
set */
int semopm; /* Maximum number of operations for
semop(2) */
int semume; /* Maximum number of undo entries per
process; unused within kernel */
int semusz; /* Size of struct sem_undo */
int semvmx; /* Maximum semaphore value */
int semaem; /* Max. value that can be recorded for
semaphore adjustment (SEM_UNDO) */
};
The semmsl, semmns, semopm, and semmni settings can be changed via /proc/sys/kernel/sem; see proc(5) for details.
SEM_INFO (Linux-specific)
Return a seminfo structure containing the same information as for IPC_INFO, except that the following fields are returned with information about system resources consumed by semaphores: the semusz field returns the number of semaphore sets that currently exist on the system; and the semaem field returns the total number of semaphores in all semaphore sets on the system.
SEM_STAT (Linux-specific)
Return a semid_ds structure as for IPC_STAT. However, the semid argument is not a semaphore identifier, but instead an index into the kernel’s internal array that maintains information about all semaphore sets on the system.
SEM_STAT_ANY (Linux-specific, since Linux 4.17)
Return a semid_ds structure as for SEM_STAT. However, sem_perm.mode is not checked for read access for semid meaning that any user can employ this operation (just as any user may read /proc/sysvipc/sem to obtain the same information).
GETALL
Return semval (i.e., the current value) for all semaphores of the set into arg.array. The argument semnum is ignored. The calling process must have read permission on the semaphore set.
GETNCNT
Return the semncnt value for the semnum-th semaphore of the set (i.e., the number of processes waiting for the semaphore’s value to increase). The calling process must have read permission on the semaphore set.
GETPID
Return the sempid value for the semnum-th semaphore of the set. This is the PID of the process that last performed an operation on that semaphore (but see NOTES). The calling process must have read permission on the semaphore set.
GETVAL
Return semval (i.e., the semaphore value) for the semnum-th semaphore of the set. The calling process must have read permission on the semaphore set.
GETZCNT
Return the semzcnt value for the semnum-th semaphore of the set (i.e., the number of processes waiting for the semaphore value to become 0). The calling process must have read permission on the semaphore set.
SETALL
Set the semval values for all semaphores of the set using arg.array, updating also the sem_ctime member of the semid_ds structure associated with the set. Undo entries (see semop(2)) are cleared for altered semaphores in all processes. If the changes to semaphore values would permit blocked semop(2) calls in other processes to proceed, then those processes are woken up. The argument semnum is ignored. The calling process must have alter (write) permission on the semaphore set.
SETVAL
Set the semaphore value (semval) to arg.val for the semnum-th semaphore of the set, updating also the sem_ctime member of the semid_ds structure associated with the set. Undo entries are cleared for altered semaphores in all processes. If the changes to semaphore values would permit blocked semop(2) calls in other processes to proceed, then those processes are woken up. The calling process must have alter permission on the semaphore set.
RETURN VALUE
On success, semctl() returns a nonnegative value depending on op as follows:
GETNCNT
the value of semncnt.
GETPID
the value of sempid.
GETVAL
the value of semval.
GETZCNT
the value of semzcnt.
IPC_INFO
the index of the highest used entry in the kernel’s internal array recording information about all semaphore sets. (This information can be used with repeated SEM_STAT or SEM_STAT_ANY operations to obtain information about all semaphore sets on the system.)
SEM_INFO
as for IPC_INFO.
SEM_STAT
the identifier of the semaphore set whose index was given in semid.
SEM_STAT_ANY
as for SEM_STAT.
All other op values return 0 on success.
On failure, semctl() returns -1 and sets errno to indicate the error.
ERRORS
EACCES
The argument op has one of the values GETALL, GETPID, GETVAL, GETNCNT, GETZCNT, IPC_STAT, SEM_STAT, SEM_STAT_ANY, SETALL, or SETVAL and the calling process does not have the required permissions on the semaphore set and does not have the CAP_IPC_OWNER capability in the user namespace that governs its IPC namespace.
EFAULT
The address pointed to by arg.buf or arg.array isn’t accessible.
EIDRM
The semaphore set was removed.
EINVAL
Invalid value for op or semid. Or: for a SEM_STAT operation, the index value specified in semid referred to an array slot that is currently unused.
EPERM
The argument op has the value IPC_SET or IPC_RMID but the effective user ID of the calling process is not the creator (as found in sem_perm.cuid) or the owner (as found in sem_perm.uid) of the semaphore set, and the process does not have the CAP_SYS_ADMIN capability.
ERANGE
The argument op has the value SETALL or SETVAL and the value to which semval is to be set (for some semaphore of the set) is less than 0 or greater than the implementation limit SEMVMX.
VERSIONS
POSIX.1 specifies the sem_nsems field of the semid_ds structure as having the type unsigned short, and the field is so defined on most other systems. It was also so defined on Linux 2.2 and earlier, but, since Linux 2.4, the field has the type unsigned long.
The sempid value
POSIX.1 defines sempid as the “process ID of [the] last operation” on a semaphore, and explicitly notes that this value is set by a successful semop(2) call, with the implication that no other interface affects the sempid value.
While some implementations conform to the behavior specified in POSIX.1, others do not. (The fault here probably lies with POSIX.1 inasmuch as it likely failed to capture the full range of existing implementation behaviors.) Various other implementations also update sempid for the other operations that update the value of a semaphore: the SETVAL and SETALL operations, as well as the semaphore adjustments performed on process termination as a consequence of the use of the SEM_UNDO flag (see semop(2)).
Linux also updates sempid for SETVAL operations and semaphore adjustments. However, somewhat inconsistently, up to and including Linux 4.5, the kernel did not update sempid for SETALL operations. This was rectified in Linux 4.6.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4.
Various fields in a struct semid_ds were typed as short under Linux 2.2 and have become long under Linux 2.4. To take advantage of this, a recompilation under glibc-2.1.91 or later should suffice. (The kernel distinguishes old and new calls by an IPC_64 flag in op.)
In some earlier versions of glibc, the semun union was defined in <sys/sem.h>, but POSIX.1 requires that the caller define this union. On versions of glibc where this union is not defined, the macro _SEM_SEMUN_UNDEFINED is defined in <sys/sem.h>.
NOTES
The IPC_INFO, SEM_STAT, and SEM_INFO operations are used by the ipcs(1) program to provide information on allocated resources. In the future these may modified or moved to a /proc filesystem interface.
The following system limit on semaphore sets affects a semctl() call:
SEMVMX
Maximum value for semval: implementation dependent (32767).
For greater portability, it is best to always call semctl() with four arguments.
EXAMPLES
See shmop(2).
SEE ALSO
ipc(2), semget(2), semop(2), capabilities(7), sem_overview(7), sysvipc(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
123 - Linux cli command gettimeofday
NAME π₯οΈ gettimeofday π₯οΈ
get / set time
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/time.h>
int gettimeofday(struct timeval *restrict tv,
struct timezone *_Nullable restrict tz);
int settimeofday(const struct timeval *tv,
const struct timezone *_Nullable tz);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
settimeofday():
Since glibc 2.19:
_DEFAULT_SOURCE
glibc 2.19 and earlier:
_BSD_SOURCE
DESCRIPTION
The functions gettimeofday() and settimeofday() can get and set the time as well as a timezone.
The tv argument is a struct timeval (as specified in <sys/time.h>):
struct timeval {
time_t tv_sec; /* seconds */
suseconds_t tv_usec; /* microseconds */
};
and gives the number of seconds and microseconds since the Epoch (see time(2)).
The tz argument is a struct timezone:
struct timezone {
int tz_minuteswest; /* minutes west of Greenwich */
int tz_dsttime; /* type of DST correction */
};
If either tv or tz is NULL, the corresponding structure is not set or returned. (However, compilation warnings will result if tv is NULL.)
The use of the timezone structure is obsolete; the tz argument should normally be specified as NULL. (See NOTES below.)
Under Linux, there are some peculiar “warp clock” semantics associated with the settimeofday() system call if on the very first call (after booting) that has a non-NULL tz argument, the tv argument is NULL and the tz_minuteswest field is nonzero. (The tz_dsttime field should be zero for this case.) In such a case it is assumed that the CMOS clock is on local time, and that it has to be incremented by this amount to get UTC system time. No doubt it is a bad idea to use this feature.
RETURN VALUE
gettimeofday() and settimeofday() return 0 for success. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EFAULT
One of tv or tz pointed outside the accessible address space.
EINVAL
(settimeofday()): timezone is invalid.
EINVAL
(settimeofday()): tv.tv_sec is negative or tv.tv_usec is outside the range [0, 999,999].
EINVAL (since Linux 4.3)
(settimeofday()): An attempt was made to set the time to a value less than the current value of the CLOCK_MONOTONIC clock (see clock_gettime(2)).
EPERM
The calling process has insufficient privilege to call settimeofday(); under Linux the CAP_SYS_TIME capability is required.
VERSIONS
C library/kernel differences
On some architectures, an implementation of gettimeofday() is provided in the vdso(7).
The kernel accepts NULL for both tv and tz. The timezone argument is ignored by glibc and musl, and not passed to/from the kernel. Android’s bionic passes the timezone argument to/from the kernel, but Android does not update the kernel timezone based on the device timezone in Settings, so the kernel’s timezone is typically UTC.
STANDARDS
gettimeofday()
POSIX.1-2008 (obsolete).
settimeofday()
None.
HISTORY
SVr4, 4.3BSD. POSIX.1-2001 describes gettimeofday() but not settimeofday(). POSIX.1-2008 marks gettimeofday() as obsolete, recommending the use of clock_gettime(2) instead.
Traditionally, the fields of struct timeval were of type long.
The tz_dsttime field
On a non-Linux kernel, with glibc, the tz_dsttime field of struct timezone will be set to a nonzero value by gettimeofday() if the current timezone has ever had or will have a daylight saving rule applied. In this sense it exactly mirrors the meaning of daylight(3) for the current zone. On Linux, with glibc, the setting of the tz_dsttime field of struct timezone has never been used by settimeofday() or gettimeofday(). Thus, the following is purely of historical interest.
On old systems, the field tz_dsttime contains a symbolic constant (values are given below) that indicates in which part of the year Daylight Saving Time is in force. (Note: this value is constant throughout the year: it does not indicate that DST is in force, it just selects an algorithm.) The daylight saving time algorithms defined are as follows:
DST_NONE /* not on DST */
DST_USA /* USA style DST */
DST_AUST /* Australian style DST */
DST_WET /* Western European DST */
DST_MET /* Middle European DST */
DST_EET /* Eastern European DST */
DST_CAN /* Canada */
DST_GB /* Great Britain and Eire */
DST_RUM /* Romania */
DST_TUR /* Turkey */
DST_AUSTALT /* Australian style with shift in 1986 */
Of course it turned out that the period in which Daylight Saving Time is in force cannot be given by a simple algorithm, one per country; indeed, this period is determined by unpredictable political decisions. So this method of representing timezones has been abandoned.
NOTES
The time returned by gettimeofday() is affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the system time). If you need a monotonically increasing clock, see clock_gettime(2).
Macros for operating on timeval structures are described in timeradd(3).
SEE ALSO
date(1), adjtimex(2), clock_gettime(2), time(2), ctime(3), ftime(3), timeradd(3), capabilities(7), time(7), vdso(7), hwclock(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
124 - Linux cli command inl
NAME π₯οΈ inl π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
125 - Linux cli command gettid
NAME π₯οΈ gettid π₯οΈ
get thread identification
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE
#include <unistd.h>
pid_t gettid(void);
DESCRIPTION
gettid() returns the caller’s thread ID (TID). In a single-threaded process, the thread ID is equal to the process ID (PID, as returned by getpid(2)). In a multithreaded process, all threads have the same PID, but each one has a unique TID. For further details, see the discussion of CLONE_THREAD in clone(2).
RETURN VALUE
On success, returns the thread ID of the calling thread.
ERRORS
This call is always successful.
STANDARDS
Linux.
HISTORY
Linux 2.4.11, glibc 2.30.
NOTES
The thread ID returned by this call is not the same thing as a POSIX thread ID (i.e., the opaque value returned by pthread_self(3)).
In a new thread group created by a clone(2) call that does not specify the CLONE_THREAD flag (or, equivalently, a new process created by fork(2)), the new process is a thread group leader, and its thread group ID (the value returned by getpid(2)) is the same as its thread ID (the value returned by gettid()).
SEE ALSO
capget(2), clone(2), fcntl(2), fork(2), get_robust_list(2), getpid(2), ioprio_set(2), perf_event_open(2), sched_setaffinity(2), sched_setparam(2), sched_setscheduler(2), tgkill(2), timer_create(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
126 - Linux cli command openat2
NAME π₯οΈ openat2 π₯οΈ
open and possibly create a file (extended)
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <fcntl.h>"/*Definitionof O_* and S_* constants*/"
#include <linux/openat2.h> /* Definition of RESOLVE_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
long syscall(SYS_openat2, int dirfd, const char *pathname,
struct open_how *how, size_t size);
Note: glibc provides no wrapper for openat2(), necessitating the use of syscall(2).
DESCRIPTION
The openat2() system call is an extension of openat(2) and provides a superset of its functionality.
The openat2() system call opens the file specified by pathname. If the specified file does not exist, it may optionally (if O_CREAT is specified in how.flags) be created.
As with openat(2), if pathname is a relative pathname, then it is interpreted relative to the directory referred to by the file descriptor dirfd (or the current working directory of the calling process, if dirfd is the special value AT_FDCWD). If pathname is an absolute pathname, then dirfd is ignored (unless how.resolve contains RESOLVE_IN_ROOT, in which case pathname is resolved relative to dirfd).
Rather than taking a single flags argument, an extensible structure (how) is passed to allow for future extensions. The size argument must be specified as sizeof(struct open_how).
The open_how structure
The how argument specifies how pathname should be opened, and acts as a superset of the flags and mode arguments to openat(2). This argument is a pointer to an open_how structure, described in open_how(2type).
Any future extensions to openat2() will be implemented as new fields appended to the open_how structure, with a zero value in a new field resulting in the kernel behaving as though that extension field was not present. Therefore, the caller must zero-fill this structure on initialization. (See the “Extensibility” section of the NOTES for more detail on why this is necessary.)
The fields of the open_how structure are as follows:
flags
This field specifies the file creation and file status flags to use when opening the file. All of the O_* flags defined for openat(2) are valid openat2() flag values.
Whereas openat(2) ignores unknown bits in its flags argument, openat2() returns an error if unknown or conflicting flags are specified in how.flags.
mode
This field specifies the mode for the new file, with identical semantics to the mode argument of openat(2).
Whereas openat(2) ignores bits other than those in the range 07777 in its mode argument, openat2() returns an error if how.mode contains bits other than 07777. Similarly, an error is returned if openat2() is called with a nonzero how.mode and how.flags does not contain O_CREAT or O_TMPFILE.
resolve
This is a bit-mask of flags that modify the way in which all components of pathname will be resolved. (See path_resolution(7) for background information.)
The primary use case for these flags is to allow trusted programs to restrict how untrusted paths (or paths inside untrusted directories) are resolved. The full list of resolve flags is as follows:
RESOLVE_BENEATH
Do not permit the path resolution to succeed if any component of the resolution is not a descendant of the directory indicated by dirfd. This causes absolute symbolic links (and absolute values of pathname) to be rejected.
Currently, this flag also disables magic-link resolution (see below). However, this may change in the future. Therefore, to ensure that magic links are not resolved, the caller should explicitly specify RESOLVE_NO_MAGICLINKS.
RESOLVE_IN_ROOT
Treat the directory referred to by dirfd as the root directory while resolving pathname. Absolute symbolic links are interpreted relative to dirfd. If a prefix component of pathname equates to dirfd, then an immediately following .. component likewise equates to dirfd (just as /.. is traditionally equivalent to /). If pathname is an absolute path, it is also interpreted relative to dirfd.
The effect of this flag is as though the calling process had used chroot(2) to (temporarily) modify its root directory (to the directory referred to by dirfd). However, unlike chroot(2) (which changes the filesystem root permanently for a process), RESOLVE_IN_ROOT allows a program to efficiently restrict path resolution on a per-open basis.
Currently, this flag also disables magic-link resolution. However, this may change in the future. Therefore, to ensure that magic links are not resolved, the caller should explicitly specify RESOLVE_NO_MAGICLINKS.
RESOLVE_NO_MAGICLINKS
Disallow all magic-link resolution during path resolution.
Magic links are symbolic link-like objects that are most notably found in proc(5); examples include /proc/pid/exe and /proc/pid/fd/*. (See symlink(7) for more details.)
Unknowingly opening magic links can be risky for some applications. Examples of such risks include the following:
If the process opening a pathname is a controlling process that currently has no controlling terminal (see credentials(7)), then opening a magic link inside /proc/pid/fd that happens to refer to a terminal would cause the process to acquire a controlling terminal.
In a containerized environment, a magic link inside /proc may refer to an object outside the container, and thus may provide a means to escape from the container.
Because of such risks, an application may prefer to disable magic link resolution using the RESOLVE_NO_MAGICLINKS flag.
If the trailing component (i.e., basename) of pathname is a magic link, how.resolve contains RESOLVE_NO_MAGICLINKS, and how.flags contains both O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the magic link will be returned.
RESOLVE_NO_SYMLINKS
Disallow resolution of symbolic links during path resolution. This option implies RESOLVE_NO_MAGICLINKS.
If the trailing component (i.e., basename) of pathname is a symbolic link, how.resolve contains RESOLVE_NO_SYMLINKS, and how.flags contains both O_PATH and O_NOFOLLOW, then an O_PATH file descriptor referencing the symbolic link will be returned.
Note that the effect of the RESOLVE_NO_SYMLINKS flag, which affects the treatment of symbolic links in all of the components of pathname, differs from the effect of the O_NOFOLLOW file creation flag (in how.flags), which affects the handling of symbolic links only in the final component of pathname.
Applications that employ the RESOLVE_NO_SYMLINKS flag are encouraged to make its use configurable (unless it is used for a specific security purpose), as symbolic links are very widely used by end-users. Setting this flag indiscriminatelyβi.e., for purposes not specifically related to securityβfor all uses of openat2() may result in spurious errors on previously functional systems. This may occur if, for example, a system pathname that is used by an application is modified (e.g., in a new distribution release) so that a pathname component (now) contains a symbolic link.
RESOLVE_NO_XDEV
Disallow traversal of mount points during path resolution (including all bind mounts). Consequently, pathname must either be on the same mount as the directory referred to by dirfd, or on the same mount as the current working directory if dirfd is specified as AT_FDCWD.
Applications that employ the RESOLVE_NO_XDEV flag are encouraged to make its use configurable (unless it is used for a specific security purpose), as bind mounts are widely used by end-users. Setting this flag indiscriminatelyβi.e., for purposes not specifically related to securityβfor all uses of openat2() may result in spurious errors on previously functional systems. This may occur if, for example, a system pathname that is used by an application is modified (e.g., in a new distribution release) so that a pathname component (now) contains a bind mount.
RESOLVE_CACHED
Make the open operation fail unless all path components are already present in the kernel’s lookup cache. If any kind of revalidation or I/O is needed to satisfy the lookup, openat2() fails with the error EAGAIN. This is useful in providing a fast-path open that can be performed without resorting to thread offload, or other mechanisms that an application might use to offload slower operations.
If any bits other than those listed above are set in how.resolve, an error is returned.
RETURN VALUE
On success, a new file descriptor is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
The set of errors returned by openat2() includes all of the errors returned by openat(2), as well as the following additional errors:
E2BIG
An extension that this kernel does not support was specified in how. (See the “Extensibility” section of NOTES for more detail on how extensions are handled.)
EAGAIN
how.resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and the kernel could not ensure that a “..” component didn’t escape (due to a race condition or potential attack). The caller may choose to retry the openat2() call.
EAGAIN
RESOLVE_CACHED was set, and the open operation cannot be performed using only cached information. The caller should retry without RESOLVE_CACHED set in how.resolve.
EINVAL
An unknown flag or invalid value was specified in how.
EINVAL
mode is nonzero, but how.flags does not contain O_CREAT or O_TMPFILE.
EINVAL
size was smaller than any known version of struct open_how.
ELOOP
how.resolve contains RESOLVE_NO_SYMLINKS, and one of the path components was a symbolic link (or magic link).
ELOOP
how.resolve contains RESOLVE_NO_MAGICLINKS, and one of the path components was a magic link.
EXDEV
how.resolve contains either RESOLVE_IN_ROOT or RESOLVE_BENEATH, and an escape from the root during path resolution was detected.
EXDEV
how.resolve contains RESOLVE_NO_XDEV, and a path component crosses a mount point.
STANDARDS
Linux.
HISTORY
Linux 5.6.
The semantics of RESOLVE_BENEATH were modeled after FreeBSD’s O_BENEATH.
NOTES
Extensibility
In order to allow for future extensibility, openat2() requires the user-space application to specify the size of the open_how structure that it is passing. By providing this information, it is possible for openat2() to provide both forwards- and backwards-compatibility, with size acting as an implicit version number. (Because new extension fields will always be appended, the structure size will always increase.) This extensibility design is very similar to other system calls such as sched_setattr(2), perf_event_open(2), and clone3(2).
If we let usize be the size of the structure as specified by the user-space application, and ksize be the size of the structure which the kernel supports, then there are three cases to consider:
If ksize equals usize, then there is no version mismatch and how can be used verbatim.
If ksize is larger than usize, then there are some extension fields that the kernel supports which the user-space application is unaware of. Because a zero value in any added extension field signifies a no-op, the kernel treats all of the extension fields not provided by the user-space application as having zero values. This provides backwards-compatibility.
If ksize is smaller than usize, then there are some extension fields which the user-space application is aware of but which the kernel does not support. Because any extension field must have its zero values signify a no-op, the kernel can safely ignore the unsupported extension fields if they are all-zero. If any unsupported extension fields are nonzero, then -1 is returned and errno is set to E2BIG. This provides forwards-compatibility.
Because the definition of struct open_how may change in the future (with new fields being added when system headers are updated), user-space applications should zero-fill struct open_how to ensure that recompiling the program with new headers will not result in spurious errors at run time. The simplest way is to use a designated initializer:
struct open_how how = { .flags = O_RDWR,
.resolve = RESOLVE_IN_ROOT };
or explicitly using memset(3) or similar:
struct open_how how;
memset(&how, 0, sizeof(how));
how.flags = O_RDWR;
how.resolve = RESOLVE_IN_ROOT;
A user-space application that wishes to determine which extensions the running kernel supports can do so by conducting a binary search on size with a structure which has every byte nonzero (to find the largest value which doesn’t produce an error of E2BIG).
SEE ALSO
openat(2), open_how(2type), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
127 - Linux cli command sigprocmask
NAME π₯οΈ sigprocmask π₯οΈ
examine and change blocked signals
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <signal.h>
/* Prototype for the glibc wrapper function */
int sigprocmask(int how, const sigset_t *_Nullable restrict set,
sigset_t *_Nullable restrict oldset);
#include <signal.h> /* Definition of SIG_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
/* Prototype for the underlying system call */
int syscall(SYS_rt_sigprocmask, int how,
const kernel_sigset_t *_Nullable set,
kernel_sigset_t *_Nullable oldset,
size_t sigsetsize);
/* Prototype for the legacy system call */
[[deprecated]] int syscall(SYS_sigprocmask, int how,
const old_kernel_sigset_t *_Nullable set,
old_kernel_sigset_t *_Nullable oldset);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sigprocmask():
_POSIX_C_SOURCE
DESCRIPTION
sigprocmask() is used to fetch and/or change the signal mask of the calling thread. The signal mask is the set of signals whose delivery is currently blocked for the caller (see also signal(7) for more details).
The behavior of the call is dependent on the value of how, as follows.
SIG_BLOCK
The set of blocked signals is the union of the current set and the set argument.
SIG_UNBLOCK
The signals in set are removed from the current set of blocked signals. It is permissible to attempt to unblock a signal which is not blocked.
SIG_SETMASK
The set of blocked signals is set to the argument set.
If oldset is non-NULL, the previous value of the signal mask is stored in oldset.
If set is NULL, then the signal mask is unchanged (i.e., how is ignored), but the current value of the signal mask is nevertheless returned in oldset (if it is not NULL).
A set of functions for modifying and inspecting variables of type sigset_t (“signal sets”) is described in sigsetops(3).
The use of sigprocmask() is unspecified in a multithreaded process; see pthread_sigmask(3).
RETURN VALUE
sigprocmask() returns 0 on success. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
EFAULT
The set or oldset argument points outside the process’s allocated address space.
EINVAL
Either the value specified in how was invalid or the kernel does not support the size passed in sigsetsize.
VERSIONS
C library/kernel differences
The kernel’s definition of sigset_t differs in size from that used by the C library. In this manual page, the former is referred to as kernel_sigset_t (it is nevertheless named sigset_t in the kernel sources).
The glibc wrapper function for sigprocmask() silently ignores attempts to block the two real-time signals that are used internally by the NPTL threading implementation. See nptl(7) for details.
The original Linux system call was named sigprocmask(). However, with the addition of real-time signals in Linux 2.2, the fixed-size, 32-bit sigset_t (referred to as old_kernel_sigset_t in this manual page) type supported by that system call was no longer fit for purpose. Consequently, a new system call, rt_sigprocmask(), was added to support an enlarged sigset_t type (referred to as kernel_sigset_t in this manual page). The new system call takes a fourth argument, size_t sigsetsize, which specifies the size in bytes of the signal sets in set and oldset. This argument is currently required to have a fixed architecture specific value (equal to sizeof(kernel_sigset_t)).
The glibc sigprocmask() wrapper function hides these details from us, transparently calling rt_sigprocmask() when the kernel provides it.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
NOTES
It is not possible to block SIGKILL or SIGSTOP. Attempts to do so are silently ignored.
Each of the threads in a process has its own signal mask.
A child created via fork(2) inherits a copy of its parent’s signal mask; the signal mask is preserved across execve(2).
If SIGBUS, SIGFPE, SIGILL, or SIGSEGV are generated while they are blocked, the result is undefined, unless the signal was generated by kill(2), sigqueue(3), or raise(3).
See sigsetops(3) for details on manipulating signal sets.
Note that it is permissible (although not very useful) to specify both set and oldset as NULL.
SEE ALSO
kill(2), pause(2), sigaction(2), signal(2), sigpending(2), sigsuspend(2), pthread_sigmask(3), sigqueue(3), sigsetops(3), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
128 - Linux cli command keyctl
NAME π₯οΈ keyctl π₯οΈ
manipulate the kernel’s key management facility
LIBRARY
Standard C library (libc, -lc)
Alternatively, Linux Key Management Utilities (libkeyutils, -lkeyutils); see VERSIONS.
SYNOPSIS
#include <linux/keyctl.h> /* Definition of KEY* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
long syscall(SYS_keyctl, int operation, unsigned long arg2,
unsigned long arg3, unsigned long arg4,
unsigned long arg5);
Note: glibc provides no wrapper for keyctl(), necessitating the use of syscall(2).
DESCRIPTION
keyctl() allows user-space programs to perform key manipulation.
The operation performed by keyctl() is determined by the value of the operation argument. Each of these operations is wrapped by the libkeyutils library (provided by the keyutils package) into individual functions (noted below) to permit the compiler to check types.
The permitted values for operation are:
KEYCTL_GET_KEYRING_ID (since Linux 2.6.10)
Map a special key ID to a real key ID for this process.
This operation looks up the special key whose ID is provided in arg2 (cast to key_serial_t). If the special key is found, the ID of the corresponding real key is returned as the function result. The following values may be specified in arg2:
KEY_SPEC_THREAD_KEYRING
This specifies the calling thread’s thread-specific keyring. See thread-keyring(7).
KEY_SPEC_PROCESS_KEYRING
This specifies the caller’s process-specific keyring. See process-keyring(7).
KEY_SPEC_SESSION_KEYRING
This specifies the caller’s session-specific keyring. See session-keyring(7).
KEY_SPEC_USER_KEYRING
This specifies the caller’s UID-specific keyring. See user-keyring(7).
KEY_SPEC_USER_SESSION_KEYRING
This specifies the caller’s UID-session keyring. See user-session-keyring(7).
KEY_SPEC_REQKEY_AUTH_KEY (since Linux 2.6.16)
This specifies the authorization key created by request_key(2) and passed to the process it spawns to generate a key. This key is available only in a request-key(8)-style program that was passed an authorization key by the kernel and ceases to be available once the requested key has been instantiated; see request_key(2).
KEY_SPEC_REQUESTOR_KEYRING (since Linux 2.6.29)
This specifies the key ID for the request_key(2) destination keyring. This keyring is available only in a request-key(8)-style program that was passed an authorization key by the kernel and ceases to be available once the requested key has been instantiated; see request_key(2).
The behavior if the key specified in arg2 does not exist depends on the value of arg3 (cast to int). If arg3 contains a nonzero value, thenβif it is appropriate to do so (e.g., when looking up the user, user-session, or session key)βa new key is created and its real key ID returned as the function result. Otherwise, the operation fails with the error ENOKEY.
If a valid key ID is specified in arg2, and the key exists, then this operation simply returns the key ID. If the key does not exist, the call fails with error ENOKEY.
The caller must have search permission on a keyring in order for it to be found.
The arguments arg4 and arg5 are ignored.
This operation is exposed by libkeyutils via the function keyctl_get_keyring_ID(3).
KEYCTL_JOIN_SESSION_KEYRING (since Linux 2.6.10)
Replace the session keyring this process subscribes to with a new session keyring.
If arg2 is NULL, an anonymous keyring with the description “_ses” is created and the process is subscribed to that keyring as its session keyring, displacing the previous session keyring.
Otherwise, arg2 (cast to char *) is treated as the description (name) of a keyring, and the behavior is as follows:
If a keyring with a matching description exists, the process will attempt to subscribe to that keyring as its session keyring if possible; if that is not possible, an error is returned. In order to subscribe to the keyring, the caller must have search permission on the keyring.
If a keyring with a matching description does not exist, then a new keyring with the specified description is created, and the process is subscribed to that keyring as its session keyring.
The arguments arg3, arg4, and arg5 are ignored.
This operation is exposed by libkeyutils via the function keyctl_join_session_keyring(3).
KEYCTL_UPDATE (since Linux 2.6.10)
Update a key’s data payload.
The arg2 argument (cast to key_serial_t) specifies the ID of the key to be updated. The arg3 argument (cast to void *) points to the new payload and arg4 (cast to size_t) contains the new payload size in bytes.
The caller must have write permission on the key specified and the key type must support updating.
A negatively instantiated key (see the description of KEYCTL_REJECT) can be positively instantiated with this operation.
The arg5 argument is ignored.
This operation is exposed by libkeyutils via the function keyctl_update(3).
KEYCTL_REVOKE (since Linux 2.6.10)
Revoke the key with the ID provided in arg2 (cast to key_serial_t). The key is scheduled for garbage collection; it will no longer be findable, and will be unavailable for further operations. Further attempts to use the key will fail with the error EKEYREVOKED.
The caller must have write or setattr permission on the key.
The arguments arg3, arg4, and arg5 are ignored.
This operation is exposed by libkeyutils via the function keyctl_revoke(3).
KEYCTL_CHOWN (since Linux 2.6.10)
Change the ownership (user and group ID) of a key.
The arg2 argument (cast to key_serial_t) contains the key ID. The arg3 argument (cast to uid_t) contains the new user ID (or -1 in case the user ID shouldn’t be changed). The arg4 argument (cast to gid_t) contains the new group ID (or -1 in case the group ID shouldn’t be changed).
The key must grant the caller setattr permission.
For the UID to be changed, or for the GID to be changed to a group the caller is not a member of, the caller must have the CAP_SYS_ADMIN capability (see capabilities(7)).
If the UID is to be changed, the new user must have sufficient quota to accept the key. The quota deduction will be removed from the old user to the new user should the UID be changed.
The arg5 argument is ignored.
This operation is exposed by libkeyutils via the function keyctl_chown(3).
KEYCTL_SETPERM (since Linux 2.6.10)
Change the permissions of the key with the ID provided in the arg2 argument (cast to key_serial_t) to the permissions provided in the arg3 argument (cast to key_perm_t).
If the caller doesn’t have the CAP_SYS_ADMIN capability, it can change permissions only for the keys it owns. (More precisely: the caller’s filesystem UID must match the UID of the key.)
The key must grant setattr permission to the caller regardless of the caller’s capabilities.
The permissions in arg3 specify masks of available operations for each of the following user categories:
possessor (since Linux 2.6.14)
This is the permission granted to a process that possesses the key (has it attached searchably to one of the process’s keyrings); see keyrings(7).
user
This is the permission granted to a process whose filesystem UID matches the UID of the key.
group
This is the permission granted to a process whose filesystem GID or any of its supplementary GIDs matches the GID of the key.
other
This is the permission granted to other processes that do not match the user and group categories.
The user, group, and other categories are exclusive: if a process matches the user category, it will not receive permissions granted in the group category; if a process matches the user or group category, then it will not receive permissions granted in the other category.
The possessor category grants permissions that are cumulative with the grants from the user, group, or other category.
Each permission mask is eight bits in size, with only six bits currently used. The available permissions are:
view
This permission allows reading attributes of a key.
This permission is required for the KEYCTL_DESCRIBE operation.
The permission bits for each category are KEY_POS_VIEW, KEY_USR_VIEW, KEY_GRP_VIEW, and KEY_OTH_VIEW.
read
This permission allows reading a key’s payload.
This permission is required for the KEYCTL_READ operation.
The permission bits for each category are KEY_POS_READ, KEY_USR_READ, KEY_GRP_READ, and KEY_OTH_READ.
write
This permission allows update or instantiation of a key’s payload. For a keyring, it allows keys to be linked and unlinked from the keyring,
This permission is required for the KEYCTL_UPDATE, KEYCTL_REVOKE, KEYCTL_CLEAR, KEYCTL_LINK, and KEYCTL_UNLINK operations.
The permission bits for each category are KEY_POS_WRITE, KEY_USR_WRITE, KEY_GRP_WRITE, and KEY_OTH_WRITE.
search
This permission allows keyrings to be searched and keys to be found. Searches can recurse only into nested keyrings that have search permission set.
This permission is required for the KEYCTL_GET_KEYRING_ID, KEYCTL_JOIN_SESSION_KEYRING, KEYCTL_SEARCH, and KEYCTL_INVALIDATE operations.
The permission bits for each category are KEY_POS_SEARCH, KEY_USR_SEARCH, KEY_GRP_SEARCH, and KEY_OTH_SEARCH.
link
This permission allows a key or keyring to be linked to.
This permission is required for the KEYCTL_LINK and KEYCTL_SESSION_TO_PARENT operations.
The permission bits for each category are KEY_POS_LINK, KEY_USR_LINK, KEY_GRP_LINK, and KEY_OTH_LINK.
setattr (since Linux 2.6.15).
This permission allows a key’s UID, GID, and permissions mask to be changed.
This permission is required for the KEYCTL_REVOKE, KEYCTL_CHOWN, and KEYCTL_SETPERM operations.
The permission bits for each category are KEY_POS_SETATTR, KEY_USR_SETATTR, KEY_GRP_SETATTR, and KEY_OTH_SETATTR.
As a convenience, the following macros are defined as masks for all of the permission bits in each of the user categories: KEY_POS_ALL, KEY_USR_ALL, KEY_GRP_ALL, and KEY_OTH_ALL.
The arg4 and arg5 arguments are ignored.
This operation is exposed by libkeyutils via the function keyctl_setperm(3).
KEYCTL_DESCRIBE (since Linux 2.6.10)
Obtain a string describing the attributes of a specified key.
The ID of the key to be described is specified in arg2 (cast to key_serial_t). The descriptive string is returned in the buffer pointed to by arg3 (cast to charΒ *); arg4 (cast to size_t) specifies the size of that buffer in bytes.
The key must grant the caller view permission.
The returned string is null-terminated and contains the following information about the key:
type;uid;gid;perm;description
In the above, type and description are strings, uid and gid are decimal strings, and perm is a hexadecimal permissions mask. The descriptive string is written with the following format:
%s;%d;%d;%08x;%s
Note: the intention is that the descriptive string should be extensible in future kernel versions. In particular, the description field will not contain semicolons; it should be parsed by working backwards from the end of the string to find the last semicolon. This allows future semicolon-delimited fields to be inserted in the descriptive string in the future.
Writing to the buffer is attempted only when arg3 is non-NULL and the specified buffer size is large enough to accept the descriptive string (including the terminating null byte). In order to determine whether the buffer size was too small, check to see if the return value of the operation is greater than arg4.
The arg5 argument is ignored.
This operation is exposed by libkeyutils via the function keyctl_describe(3).
KEYCTL_CLEAR
Clear the contents of (i.e., unlink all keys from) a keyring.
The ID of the key (which must be of keyring type) is provided in arg2 (cast to key_serial_t).
The caller must have write permission on the keyring.
The arguments arg3, arg4, and arg5 are ignored.
This operation is exposed by libkeyutils via the function keyctl_clear(3).
KEYCTL_LINK (since Linux 2.6.10)
Create a link from a keyring to a key.
The key to be linked is specified in arg2 (cast to key_serial_t); the keyring is specified in arg3 (cast to key_serial_t).
If a key with the same type and description is already linked in the keyring, then that key is displaced from the keyring.
Before creating the link, the kernel checks the nesting of the keyrings and returns appropriate errors if the link would produce a cycle or if the nesting of keyrings would be too deep (The limit on the nesting of keyrings is determined by the kernel constant KEYRING_SEARCH_MAX_DEPTH, defined with the value 6, and is necessary to prevent overflows on the kernel stack when recursively searching keyrings).
The caller must have link permission on the key being added and write permission on the keyring.
The arguments arg4 and arg5 are ignored.
This operation is exposed by libkeyutils via the function keyctl_link(3).
KEYCTL_UNLINK (since Linux 2.6.10)
Unlink a key from a keyring.
The ID of the key to be unlinked is specified in arg2 (cast to key_serial_t); the ID of the keyring from which it is to be unlinked is specified in arg3 (cast to key_serial_t).
If the key is not currently linked into the keyring, an error results.
The caller must have write permission on the keyring from which the key is being removed.
If the last link to a key is removed, then that key will be scheduled for destruction.
The arguments arg4 and arg5 are ignored.
This operation is exposed by libkeyutils via the function keyctl_unlink(3).
KEYCTL_SEARCH (since Linux 2.6.10)
Search for a key in a keyring tree, returning its ID and optionally linking it to a specified keyring.
The tree to be searched is specified by passing the ID of the head keyring in arg2 (cast to key_serial_t). The search is performed breadth-first and recursively.
The arg3 and arg4 arguments specify the key to be searched for: arg3 (cast as charΒ *) contains the key type (a null-terminated character string up to 32 bytes in size, including the terminating null byte), and arg4 (cast as charΒ *) contains the description of the key (a null-terminated character string up to 4096 bytes in size, including the terminating null byte).
The source keyring must grant search permission to the caller. When performing the recursive search, only keyrings that grant the caller search permission will be searched. Only keys with for which the caller has search permission can be found.
If the key is found, its ID is returned as the function result.
If the key is found and arg5 (cast to key_serial_t) is nonzero, then, subject to the same constraints and rules as KEYCTL_LINK, the key is linked into the keyring whose ID is specified in arg5. If the destination keyring specified in arg5 already contains a link to a key that has the same type and description, then that link will be displaced by a link to the key found by this operation.
Instead of valid existing keyring IDs, the source (arg2) and destination (arg5) keyrings can be one of the special keyring IDs listed under KEYCTL_GET_KEYRING_ID.
This operation is exposed by libkeyutils via the function keyctl_search(3).
KEYCTL_READ (since Linux 2.6.10)
Read the payload data of a key.
The ID of the key whose payload is to be read is specified in arg2 (cast to key_serial_t). This can be the ID of an existing key, or any of the special key IDs listed for KEYCTL_GET_KEYRING_ID.
The payload is placed in the buffer pointed by arg3 (cast to char *); the size of that buffer must be specified in arg4 (cast to size_t).
The returned data will be processed for presentation according to the key type. For example, a keyring will return an array of key_serial_t entries representing the IDs of all the keys that are linked to it. The user key type will return its data as is. If a key type does not implement this function, the operation fails with the error EOPNOTSUPP.
If arg3 is not NULL, as much of the payload data as will fit is copied into the buffer. On a successful return, the return value is always the total size of the payload data. To determine whether the buffer was of sufficient size, check to see that the return value is less than or equal to the value supplied in arg4.
The key must either grant the caller read permission, or grant the caller search permission when searched for from the process keyrings (i.e., the key is possessed).
The arg5 argument is ignored.
This operation is exposed by libkeyutils via the function keyctl_read(3).
KEYCTL_INSTANTIATE (since Linux 2.6.10)
(Positively) instantiate an uninstantiated key with a specified payload.
The ID of the key to be instantiated is provided in arg2 (cast to key_serial_t).
The key payload is specified in the buffer pointed to by arg3 (cast to void *); the size of that buffer is specified in arg4 (cast to size_t).
The payload may be a null pointer and the buffer size may be 0 if this is supported by the key type (e.g., it is a keyring).
The operation may be fail if the payload data is in the wrong format or is otherwise invalid.
If arg5 (cast to key_serial_t) is nonzero, then, subject to the same constraints and rules as KEYCTL_LINK, the instantiated key is linked into the keyring whose ID specified in arg5.
The caller must have the appropriate authorization key, and once the uninstantiated key has been instantiated, the authorization key is revoked. In other words, this operation is available only from a request-key(8)-style program. See request_key(2) for an explanation of uninstantiated keys and key instantiation.
This operation is exposed by libkeyutils via the function keyctl_instantiate(3).
KEYCTL_NEGATE (since Linux 2.6.10)
Negatively instantiate an uninstantiated key.
This operation is equivalent to the call:
keyctl(KEYCTL_REJECT, arg2, arg3, ENOKEY, arg4);
The arg5 argument is ignored.
This operation is exposed by libkeyutils via the function keyctl_negate(3).
KEYCTL_SET_REQKEY_KEYRING (since Linux 2.6.13)
Set the default keyring to which implicitly requested keys will be linked for this thread, and return the previous setting. Implicit key requests are those made by internal kernel components, such as can occur when, for example, opening files on an AFS or NFS filesystem. Setting the default keyring also has an effect when requesting a key from user space; see request_key(2) for details.
The arg2 argument (cast to int) should contain one of the following values, to specify the new default keyring:
KEY_REQKEY_DEFL_NO_CHANGE
Don’t change the default keyring. This can be used to discover the current default keyring (without changing it).
KEY_REQKEY_DEFL_DEFAULT
This selects the default behaviour, which is to use the thread-specific keyring if there is one, otherwise the process-specific keyring if there is one, otherwise the session keyring if there is one, otherwise the UID-specific session keyring, otherwise the user-specific keyring.
KEY_REQKEY_DEFL_THREAD_KEYRING
Use the thread-specific keyring (thread-keyring(7)) as the new default keyring.
KEY_REQKEY_DEFL_PROCESS_KEYRING
Use the process-specific keyring (process-keyring(7)) as the new default keyring.
KEY_REQKEY_DEFL_SESSION_KEYRING
Use the session-specific keyring (session-keyring(7)) as the new default keyring.
KEY_REQKEY_DEFL_USER_KEYRING
Use the UID-specific keyring (user-keyring(7)) as the new default keyring.
KEY_REQKEY_DEFL_USER_SESSION_KEYRING
Use the UID-specific session keyring (user-session-keyring(7)) as the new default keyring.
KEY_REQKEY_DEFL_REQUESTOR_KEYRING (since Linux 2.6.29)
Use the requestor keyring.
All other values are invalid.
The arguments arg3, arg4, and arg5 are ignored.
The setting controlled by this operation is inherited by the child of fork(2) and preserved across execve(2).
This operation is exposed by libkeyutils via the function keyctl_set_reqkey_keyring(3).
KEYCTL_SET_TIMEOUT (since Linux 2.6.16)
Set a timeout on a key.
The ID of the key is specified in arg2 (cast to key_serial_t). The timeout value, in seconds from the current time, is specified in arg3 (cast to unsigned int). The timeout is measured against the realtime clock.
Specifying the timeout value as 0 clears any existing timeout on the key.
The /proc/keys file displays the remaining time until each key will expire. (This is the only method of discovering the timeout on a key.)
The caller must either have the setattr permission on the key or hold an instantiation authorization token for the key (see request_key(2)).
The key and any links to the key will be automatically garbage collected after the timeout expires. Subsequent attempts to access the key will then fail with the error EKEYEXPIRED.
This operation cannot be used to set timeouts on revoked, expired, or negatively instantiated keys.
The arguments arg4 and arg5 are ignored.
This operation is exposed by libkeyutils via the function keyctl_set_timeout(3).
KEYCTL_ASSUME_AUTHORITY (since Linux 2.6.16)
Assume (or divest) the authority for the calling thread to instantiate a key.
The arg2 argument (cast to key_serial_t) specifies either a nonzero key ID to assume authority, or the value 0 to divest authority.
If arg2 is nonzero, then it specifies the ID of an uninstantiated key for which authority is to be assumed. That key can then be instantiated using one of KEYCTL_INSTANTIATE, KEYCTL_INSTANTIATE_IOV, KEYCTL_REJECT, or KEYCTL_NEGATE. Once the key has been instantiated, the thread is automatically divested of authority to instantiate the key.
Authority over a key can be assumed only if the calling thread has present in its keyrings the authorization key that is associated with the specified key. (In other words, the KEYCTL_ASSUME_AUTHORITY operation is available only from a request-key(8)-style program; see request_key(2) for an explanation of how this operation is used.) The caller must have search permission on the authorization key.
If the specified key has a matching authorization key, then the ID of that key is returned. The authorization key can be read (KEYCTL_READ) to obtain the callout information passed to request_key(2).
If the ID given in arg2 is 0, then the currently assumed authority is cleared (divested), and the value 0 is returned.
The KEYCTL_ASSUME_AUTHORITY mechanism allows a program such as request-key(8) to assume the necessary authority to instantiate a new uninstantiated key that was created as a consequence of a call to request_key(2). For further information, see request_key(2) and the kernel source file Documentation/security/keys-request-key.txt.
The arguments arg3, arg4, and arg5 are ignored.
This operation is exposed by libkeyutils via the function keyctl_assume_authority(3).
KEYCTL_GET_SECURITY (since Linux 2.6.26)
Get the LSM (Linux Security Module) security label of the specified key.
The ID of the key whose security label is to be fetched is specified in arg2 (cast to key_serial_t). The security label (terminated by a null byte) will be placed in the buffer pointed to by arg3 argument (cast to char *); the size of the buffer must be provided in arg4 (cast to size_t).
If arg3 is specified as NULL or the buffer size specified in arg4 is too small, the full size of the security label string (including the terminating null byte) is returned as the function result, and nothing is copied to the buffer.
The caller must have view permission on the specified key.
The returned security label string will be rendered in a form appropriate to the LSM in force. For example, with SELinux, it may look like:
unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
If no LSM is currently in force, then an empty string is placed in the buffer.
The arg5 argument is ignored.
This operation is exposed by libkeyutils via the functions keyctl_get_security(3) and keyctl_get_security_alloc(3).
KEYCTL_SESSION_TO_PARENT (since Linux 2.6.32)
Replace the session keyring to which the parent of the calling process subscribes with the session keyring of the calling process.
The keyring will be replaced in the parent process at the point where the parent next transitions from kernel space to user space.
The keyring must exist and must grant the caller link permission. The parent process must be single-threaded and have the same effective ownership as this process and must not be set-user-ID or set-group-ID. The UID of the parent process’s existing session keyring (f it has one), as well as the UID of the caller’s session keyring much match the caller’s effective UID.
The fact that it is the parent process that is affected by this operation allows a program such as the shell to start a child process that uses this operation to change the shell’s session keyring. (This is what the keyctl(1) new_session command does.)
The arguments arg2, arg3, arg4, and arg5 are ignored.
This operation is exposed by libkeyutils via the function keyctl_session_to_parent(3).
KEYCTL_REJECT (since Linux 2.6.39)
Mark a key as negatively instantiated and set an expiration timer on the key. This operation provides a superset of the functionality of the earlier KEYCTL_NEGATE operation.
The ID of the key that is to be negatively instantiated is specified in arg2 (cast to key_serial_t). The arg3 (cast to unsigned int) argument specifies the lifetime of the key, in seconds. The arg4 argument (cast to unsigned int) specifies the error to be returned when a search hits this key; typically, this is one of EKEYREJECTED, EKEYREVOKED, or EKEYEXPIRED.
If arg5 (cast to key_serial_t) is nonzero, then, subject to the same constraints and rules as KEYCTL_LINK, the negatively instantiated key is linked into the keyring whose ID is specified in arg5.
The caller must have the appropriate authorization key. In other words, this operation is available only from a request-key(8)-style program. See request_key(2).
The caller must have the appropriate authorization key, and once the uninstantiated key has been instantiated, the authorization key is revoked. In other words, this operation is available only from a request-key(8)-style program. See request_key(2) for an explanation of uninstantiated keys and key instantiation.
This operation is exposed by libkeyutils via the function keyctl_reject(3).
KEYCTL_INSTANTIATE_IOV (since Linux 2.6.39)
Instantiate an uninstantiated key with a payload specified via a vector of buffers.
This operation is the same as KEYCTL_INSTANTIATE, but the payload data is specified as an array of iovec structures (see iovec(3type)).
The pointer to the payload vector is specified in arg3 (cast as const struct iovecΒ *). The number of items in the vector is specified in arg4 (cast as unsigned int).
The arg2 (key ID) and arg5 (keyring ID) are interpreted as for KEYCTL_INSTANTIATE.
This operation is exposed by libkeyutils via the function keyctl_instantiate_iov(3).
KEYCTL_INVALIDATE (since Linux 3.5)
Mark a key as invalid.
The ID of the key to be invalidated is specified in arg2 (cast to key_serial_t).
To invalidate a key, the caller must have search permission on the key.
This operation marks the key as invalid and schedules immediate garbage collection. The garbage collector removes the invalidated key from all keyrings and deletes the key when its reference count reaches zero. After this operation, the key will be ignored by all searches, even if it is not yet deleted.
Keys that are marked invalid become invisible to normal key operations immediately, though they are still visible in /proc/keys (marked with an ‘i’ flag) until they are actually removed.
The arguments arg3, arg4, and arg5 are ignored.
This operation is exposed by libkeyutils via the function keyctl_invalidate(3).
KEYCTL_GET_PERSISTENT (since Linux 3.13)
Get the persistent keyring (persistent-keyring(7)) for a specified user and link it to a specified keyring.
The user ID is specified in arg2 (cast to uid_t). If the value -1 is specified, the caller’s real user ID is used. The ID of the destination keyring is specified in arg3 (cast to key_serial_t).
The caller must have the CAP_SETUID capability in its user namespace in order to fetch the persistent keyring for a user ID that does not match either the real or effective user ID of the caller.
If the call is successful, a link to the persistent keyring is added to the keyring whose ID was specified in arg3.
The caller must have write permission on the keyring.
The persistent keyring will be created by the kernel if it does not yet exist.
Each time the KEYCTL_GET_PERSISTENT operation is performed, the persistent keyring will have its expiration timeout reset to the value in:
/proc/sys/kernel/keys/persistent_keyring_expiry
Should the timeout be reached, the persistent keyring will be removed and everything it pins can then be garbage collected.
Persistent keyrings were added in Linux 3.13.
The arguments arg4 and arg5 are ignored.
This operation is exposed by libkeyutils via the function keyctl_get_persistent(3).
KEYCTL_DH_COMPUTE (since Linux 4.7)
Compute a Diffie-Hellman shared secret or public key, optionally applying key derivation function (KDF) to the result.
The arg2 argument is a pointer to a set of parameters containing serial numbers for three “user” keys used in the Diffie-Hellman calculation, packaged in a structure of the following form:
struct keyctl_dh_params {
int32_t private; /* The local private key */
int32_t prime; /* The prime, known to both parties */
int32_t base; /* The base integer: either a shared
generator or the remote public key */
};
Each of the three keys specified in this structure must grant the caller read permission. The payloads of these keys are used to calculate the Diffie-Hellman result as:
base ^ private mod prime
If the base is the shared generator, the result is the local public key. If the base is the remote public key, the result is the shared secret.
The arg3 argument (cast to charΒ *) points to a buffer where the result of the calculation is placed. The size of that buffer is specified in arg4 (cast to size_t).
The buffer must be large enough to accommodate the output data, otherwise an error is returned. If arg4 is specified zero, in which case the buffer is not used and the operation returns the minimum required buffer size (i.e., the length of the prime).
Diffie-Hellman computations can be performed in user space, but require a multiple-precision integer (MPI) library. Moving the implementation into the kernel gives access to the kernel MPI implementation, and allows access to secure or acceleration hardware.
Adding support for DH computation to the keyctl() system call was considered a good fit due to the DH algorithm’s use for deriving shared keys; it also allows the type of the key to determine which DH implementation (software or hardware) is appropriate.
If the arg5 argument is NULL, then the DH result itself is returned. Otherwise (since Linux 4.12), it is a pointer to a structure which specifies parameters of the KDF operation to be applied:
struct keyctl_kdf_params {
char *hashname; /* Hash algorithm name */
char *otherinfo; /* SP800-56A OtherInfo */
__u32 otherinfolen; /* Length of otherinfo data */
__u32 __spare[8]; /* Reserved */
};
The hashname field is a null-terminated string which specifies a hash name (available in the kernel’s crypto API; the list of the hashes available is rather tricky to observe; please refer to the “Kernel Crypto API Architecture” documentation for the information regarding how hash names are constructed and your kernel’s source and configuration regarding what ciphers and templates with type CRYPTO_ALG_TYPE_SHASH are available) to be applied to DH result in KDF operation.
The otherinfo field is an OtherInfo data as described in SP800-56A section 5.8.1.2 and is algorithm-specific. This data is concatenated with the result of DH operation and is provided as an input to the KDF operation. Its size is provided in the otherinfolen field and is limited by KEYCTL_KDF_MAX_OI_LEN constant that defined in security/keys/internal.h to a value of 64.
The __spare field is currently unused. It was ignored until Linux 4.13 (but still should be user-addressable since it is copied to the kernel), and should contain zeros since Linux 4.13.
The KDF implementation complies with SP800-56A as well as with SP800-108 (the counter KDF).
This operation is exposed by libkeyutils (from libkeyutils 1.5.10 onwards) via the functions keyctl_dh_compute(3) and keyctl_dh_compute_alloc(3).
KEYCTL_RESTRICT_KEYRING (since Linux 4.12)
Apply a key-linking restriction to the keyring with the ID provided in arg2 (cast to key_serial_t). The caller must have setattr permission on the key. If arg3 is NULL, any attempt to add a key to the keyring is blocked; otherwise it contains a pointer to a string with a key type name and arg4 contains a pointer to string that describes the type-specific restriction. As of Linux 4.12, only the type “asymmetric” has restrictions defined:
builtin_trusted
Allows only keys that are signed by a key linked to the built-in keyring (".builtin_trusted_keys").
builtin_and_secondary_trusted
Allows only keys that are signed by a key linked to the secondary keyring (".secondary_trusted_keys") or, by extension, a key in a built-in keyring, as the latter is linked to the former.
**key_or_keyring:**key
key_or_keyring:key:chain
If key specifies the ID of a key of type “asymmetric”, then only keys that are signed by this key are allowed.
If key specifies the ID of a keyring, then only keys that are signed by a key linked to this keyring are allowed.
If “:chain” is specified, keys that are signed by a keys linked to the destination keyring (that is, the keyring with the ID specified in the arg2 argument) are also allowed.
Note that a restriction can be configured only once for the specified keyring; once a restriction is set, it can’t be overridden.
The argument arg5 is ignored.
RETURN VALUE
For a successful call, the return value depends on the operation:
KEYCTL_GET_KEYRING_ID
The ID of the requested keyring.
KEYCTL_JOIN_SESSION_KEYRING
The ID of the joined session keyring.
KEYCTL_DESCRIBE
The size of the description (including the terminating null byte), irrespective of the provided buffer size.
KEYCTL_SEARCH
The ID of the key that was found.
KEYCTL_READ
The amount of data that is available in the key, irrespective of the provided buffer size.
KEYCTL_SET_REQKEY_KEYRING
The ID of the previous default keyring to which implicitly requested keys were linked (one of KEY_REQKEY_DEFL_USER_*).
KEYCTL_ASSUME_AUTHORITY
Either 0, if the ID given was 0, or the ID of the authorization key matching the specified key, if a nonzero key ID was provided.
KEYCTL_GET_SECURITY
The size of the LSM security label string (including the terminating null byte), irrespective of the provided buffer size.
KEYCTL_GET_PERSISTENT
The ID of the persistent keyring.
KEYCTL_DH_COMPUTE
The number of bytes copied to the buffer, or, if arg4 is 0, the required buffer size.
All other operations
Zero.
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
The requested operation wasn’t permitted.
EAGAIN
operation was KEYCTL_DH_COMPUTE and there was an error during crypto module initialization.
EDEADLK
operation was KEYCTL_LINK and the requested link would result in a cycle.
EDEADLK
operation was KEYCTL_RESTRICT_KEYRING and the requested keyring restriction would result in a cycle.
EDQUOT
The key quota for the caller’s user would be exceeded by creating a key or linking it to the keyring.
EEXIST
operation was KEYCTL_RESTRICT_KEYRING and keyring provided in arg2 argument already has a restriction set.
EFAULT
operation was KEYCTL_DH_COMPUTE and one of the following has failed:
copying of the struct keyctl_dh_params, provided in the arg2 argument, from user space;
copying of the struct keyctl_kdf_params, provided in the non-NULL arg5 argument, from user space (in case kernel supports performing KDF operation on DH operation result);
copying of data pointed by the hashname field of the struct keyctl_kdf_params from user space;
copying of data pointed by the otherinfo field of the struct keyctl_kdf_params from user space if the otherinfolen field was nonzero;
copying of the result to user space.
EINVAL
operation was KEYCTL_SETPERM and an invalid permission bit was specified in arg3.
EINVAL
operation was KEYCTL_SEARCH and the size of the description in arg4 (including the terminating null byte) exceeded 4096 bytes.
EINVAL
size of the string (including the terminating null byte) specified in arg3 (the key type) or arg4 (the key description) exceeded the limit (32 bytes and 4096 bytes respectively).
EINVAL (before Linux 4.12)
operation was KEYCTL_DH_COMPUTE, argument arg5 was non-NULL.
EINVAL
operation was KEYCTL_DH_COMPUTE And the digest size of the hashing algorithm supplied is zero.
EINVAL
operation was KEYCTL_DH_COMPUTE and the buffer size provided is not enough to hold the result. Provide 0 as a buffer size in order to obtain the minimum buffer size.
EINVAL
operation was KEYCTL_DH_COMPUTE and the hash name provided in the hashname field of the struct keyctl_kdf_params pointed by arg5 argument is too big (the limit is implementation-specific and varies between kernel versions, but it is deemed big enough for all valid algorithm names).
EINVAL
operation was KEYCTL_DH_COMPUTE and the __spare field of the struct keyctl_kdf_params provided in the arg5 argument contains nonzero values.
EKEYEXPIRED
An expired key was found or specified.
EKEYREJECTED
A rejected key was found or specified.
EKEYREVOKED
A revoked key was found or specified.
ELOOP
operation was KEYCTL_LINK and the requested link would cause the maximum nesting depth for keyrings to be exceeded.
EMSGSIZE
operation was KEYCTL_DH_COMPUTE and the buffer length exceeds KEYCTL_KDF_MAX_OUTPUT_LEN (which is 1024 currently) or the otherinfolen field of the struct keyctl_kdf_parms passed in arg5 exceeds KEYCTL_KDF_MAX_OI_LEN (which is 64 currently).
ENFILE (before Linux 3.13)
operation was KEYCTL_LINK and the keyring is full. (Before Linux 3.13, the available space for storing keyring links was limited to a single page of memory; since Linux 3.13, there is no fixed limit.)
ENOENT
operation was KEYCTL_UNLINK and the key to be unlinked isn’t linked to the keyring.
ENOENT
operation was KEYCTL_DH_COMPUTE and the hashing algorithm specified in the hashname field of the struct keyctl_kdf_params pointed by arg5 argument hasn’t been found.
ENOENT
operation was KEYCTL_RESTRICT_KEYRING and the type provided in arg3 argument doesn’t support setting key linking restrictions.
ENOKEY
No matching key was found or an invalid key was specified.
ENOKEY
The value KEYCTL_GET_KEYRING_ID was specified in operation, the key specified in arg2 did not exist, and arg3 was zero (meaning don’t create the key if it didn’t exist).
ENOMEM
One of kernel memory allocation routines failed during the execution of the syscall.
ENOTDIR
A key of keyring type was expected but the ID of a key with a different type was provided.
EOPNOTSUPP
operation was KEYCTL_READ and the key type does not support reading (e.g., the type is “login”).
EOPNOTSUPP
operation was KEYCTL_UPDATE and the key type does not support updating.
EOPNOTSUPP
operation was KEYCTL_RESTRICT_KEYRING, the type provided in arg3 argument was “asymmetric”, and the key specified in the restriction specification provided in arg4 has type other than “asymmetric” or “keyring”.
EPERM
operation was KEYCTL_GET_PERSISTENT, arg2 specified a UID other than the calling thread’s real or effective UID, and the caller did not have the CAP_SETUID capability.
EPERM
operation was KEYCTL_SESSION_TO_PARENT and either: all of the UIDs (GIDs) of the parent process do not match the effective UID (GID) of the calling process; the UID of the parent’s existing session keyring or the UID of the caller’s session keyring did not match the effective UID of the caller; the parent process is not single-thread; or the parent process is init(1) or a kernel thread.
ETIMEDOUT
operation was KEYCTL_DH_COMPUTE and the initialization of crypto modules has timed out.
VERSIONS
A wrapper is provided in the libkeyutils library. (The accompanying package provides the <keyutils.h> header file.) However, rather than using this system call directly, you probably want to use the various library functions mentioned in the descriptions of individual operations above.
STANDARDS
Linux.
HISTORY
Linux 2.6.10.
EXAMPLES
The program below provide subset of the functionality of the request-key(8) program provided by the keyutils package. For informational purposes, the program records various information in a log file.
As described in request_key(2), the request-key(8) program is invoked with command-line arguments that describe a key that is to be instantiated. The example program fetches and logs these arguments. The program assumes authority to instantiate the requested key, and then instantiates that key.
The following shell session demonstrates the use of this program. In the session, we compile the program and then use it to temporarily replace the standard request-key(8) program. (Note that temporarily disabling the standard request-key(8) program may not be safe on some systems.) While our example program is installed, we use the example program shown in request_key(2) to request a key.
$ cc -o key_instantiate key_instantiate.c -lkeyutils
$ sudo mv /sbin/request-key /sbin/request-key.backup
$ sudo cp key_instantiate /sbin/request-key
$ ./t_request_key user mykey somepayloaddata
Key ID is 20d035bf
$ sudo mv /sbin/request-key.backup /sbin/request-key
Looking at the log file created by this program, we can see the command-line arguments supplied to our example program:
$ cat /tmp/key_instantiate.log
Time: Mon Nov 7 13:06:47 2016
Command line arguments:
argv[0]: /sbin/request-key
operation: create
key_to_instantiate: 20d035bf
UID: 1000
GID: 1000
thread_keyring: 0
process_keyring: 0
session_keyring: 256e6a6
Key description: user;1000;1000;3f010000;mykey
Auth key payload: somepayloaddata
Destination keyring: 256e6a6
Auth key description: .request_key_auth;1000;1000;0b010000;20d035bf
The last few lines of the above output show that the example program was able to fetch:
the description of the key to be instantiated, which included the name of the key (mykey);
the payload of the authorization key, which consisted of the data (somepayloaddata) passed to request_key(2);
the destination keyring that was specified in the call to request_key(2); and
the description of the authorization key, where we can see that the name of the authorization key matches the ID of the key that is to be instantiated (20d035bf).
The example program in request_key(2) specified the destination keyring as KEY_SPEC_SESSION_KEYRING. By examining the contents of /proc/keys, we can see that this was translated to the ID of the destination keyring (0256e6a6) shown in the log output above; we can also see the newly created key with the name mykey and ID 20d035bf.
$ cat /proc/keys | egrep 'mykey|256e6a6'
0256e6a6 I--Q--- 194 perm 3f030000 1000 1000 keyring _ses: 3
20d035bf I--Q--- 1 perm 3f010000 1000 1000 user mykey: 16
Program source
/* key_instantiate.c */
#include <errno.h>
#include <keyutils.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <time.h>
#ifndef KEY_SPEC_REQUESTOR_KEYRING
#define KEY_SPEC_REQUESTOR_KEYRING (-8)
#endif
int
main(int argc, char *argv[])
{
int akp_size; /* Size of auth_key_payload */
int auth_key;
char dbuf[256];
char auth_key_payload[256];
char *operation;
FILE *fp;
gid_t gid;
uid_t uid;
time_t t;
key_serial_t key_to_instantiate, dest_keyring;
key_serial_t thread_keyring, process_keyring, session_keyring;
if (argc != 8) {
fprintf(stderr, "Usage: %s op key uid gid thread_keyring "
"process_keyring session_keyring
“, argv[0]); exit(EXIT_FAILURE); } fp = fopen("/tmp/key_instantiate.log”, “w”); if (fp == NULL) exit(EXIT_FAILURE); setbuf(fp, NULL); t = time(NULL); fprintf(fp, “Time: %s “, ctime(&t)); /* * The kernel passes a fixed set of arguments to the program * that it execs; fetch them. / operation = argv[1]; key_to_instantiate = atoi(argv[2]); uid = atoi(argv[3]); gid = atoi(argv[4]); thread_keyring = atoi(argv[5]); process_keyring = atoi(argv[6]); session_keyring = atoi(argv[7]); fprintf(fp, “Command line arguments: “); fprintf(fp, " argv[0]: %s “, argv[0]); fprintf(fp, " operation: %s “, operation); fprintf(fp, " key_to_instantiate: %jx “, (uintmax_t) key_to_instantiate); fprintf(fp, " UID: %jd “, (intmax_t) uid); fprintf(fp, " GID: %jd “, (intmax_t) gid); fprintf(fp, " thread_keyring: %jx “, (uintmax_t) thread_keyring); fprintf(fp, " process_keyring: %jx “, (uintmax_t) process_keyring); fprintf(fp, " session_keyring: %jx “, (uintmax_t) session_keyring); fprintf(fp, " “); / * Assume the authority to instantiate the key named in argv[2]. / if (keyctl(KEYCTL_ASSUME_AUTHORITY, key_to_instantiate) == -1) { fprintf(fp, “KEYCTL_ASSUME_AUTHORITY failed: %s “, strerror(errno)); exit(EXIT_FAILURE); } / * Fetch the description of the key that is to be instantiated. / if (keyctl(KEYCTL_DESCRIBE, key_to_instantiate, dbuf, sizeof(dbuf)) == -1) { fprintf(fp, “KEYCTL_DESCRIBE failed: %s “, strerror(errno)); exit(EXIT_FAILURE); } fprintf(fp, “Key description: %s “, dbuf); / * Fetch the payload of the authorization key, which is * actually the callout data given to request_key(). / akp_size = keyctl(KEYCTL_READ, KEY_SPEC_REQKEY_AUTH_KEY, auth_key_payload, sizeof(auth_key_payload)); if (akp_size == -1) { fprintf(fp, “KEYCTL_READ failed: %s “, strerror(errno)); exit(EXIT_FAILURE); } auth_key_payload[akp_size] = ‘οΏ½’; fprintf(fp, “Auth key payload: %s “, auth_key_payload); / * For interest, get the ID of the authorization key and * display it. / auth_key = keyctl(KEYCTL_GET_KEYRING_ID, KEY_SPEC_REQKEY_AUTH_KEY); if (auth_key == -1) { fprintf(fp, “KEYCTL_GET_KEYRING_ID failed: %s “, strerror(errno)); exit(EXIT_FAILURE); } fprintf(fp, “Auth key ID: %jx “, (uintmax_t) auth_key); / * Fetch key ID for the request_key(2) destination keyring. / dest_keyring = keyctl(KEYCTL_GET_KEYRING_ID, KEY_SPEC_REQUESTOR_KEYRING); if (dest_keyring == -1) { fprintf(fp, “KEYCTL_GET_KEYRING_ID failed: %s “, strerror(errno)); exit(EXIT_FAILURE); } fprintf(fp, “Destination keyring: %jx “, (uintmax_t) dest_keyring); / * Fetch the description of the authorization key. This * allows us to see the key type, UID, GID, permissions, * and description (name) of the key. Among other things, * we will see that the name of the key is a hexadecimal * string representing the ID of the key to be instantiated. / if (keyctl(KEYCTL_DESCRIBE, KEY_SPEC_REQKEY_AUTH_KEY, dbuf, sizeof(dbuf)) == -1) { fprintf(fp, “KEYCTL_DESCRIBE failed: %s “, strerror(errno)); exit(EXIT_FAILURE); } fprintf(fp, “Auth key description: %s “, dbuf); / * Instantiate the key using the callout data that was supplied * in the payload of the authorization key. */ if (keyctl(KEYCTL_INSTANTIATE, key_to_instantiate, auth_key_payload, akp_size + 1, dest_keyring) == -1) { fprintf(fp, “KEYCTL_INSTANTIATE failed: %s “, strerror(errno)); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); }
SEE ALSO
keyctl(1), add_key(2), request_key(2), keyctl(3), keyctl_assume_authority(3), keyctl_chown(3), keyctl_clear(3), keyctl_describe(3), keyctl_describe_alloc(3), keyctl_dh_compute(3), keyctl_dh_compute_alloc(3), keyctl_get_keyring_ID(3), keyctl_get_persistent(3), keyctl_get_security(3), keyctl_get_security_alloc(3), keyctl_instantiate(3), keyctl_instantiate_iov(3), keyctl_invalidate(3), keyctl_join_session_keyring(3), keyctl_link(3), keyctl_negate(3), keyctl_read(3), keyctl_read_alloc(3), keyctl_reject(3), keyctl_revoke(3), keyctl_search(3), keyctl_session_to_parent(3), keyctl_set_reqkey_keyring(3), keyctl_set_timeout(3), keyctl_setperm(3), keyctl_unlink(3), keyctl_update(3), recursive_key_scan(3), recursive_session_key_scan(3), capabilities(7), credentials(7), keyrings(7), keyutils(7), persistent-keyring(7), process-keyring(7), session-keyring(7), thread-keyring(7), user-keyring(7), user_namespaces(7), user-session-keyring(7), request-key(8)
The kernel source files under Documentation/security/keys/ (or, before Linux 4.13, in the file Documentation/security/keys.txt).
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
129 - Linux cli command brk
NAME π₯οΈ brk π₯οΈ
change data segment size
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int brk(void *addr);
void *sbrk(intptr_t increment);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
brk(), sbrk():
Since glibc 2.19:
_DEFAULT_SOURCE
|| ((_XOPEN_SOURCE >= 500) &&
! (_POSIX_C_SOURCE >= 200112L))
From glibc 2.12 to glibc 2.19:
_BSD_SOURCE || _SVID_SOURCE
|| ((_XOPEN_SOURCE >= 500) &&
! (_POSIX_C_SOURCE >= 200112L))
Before glibc 2.12:
_BSD_SOURCE || _SVID_SOURCE || _XOPEN_SOURCE >= 500
DESCRIPTION
brk() and sbrk() change the location of the program break, which defines the end of the process’s data segment (i.e., the program break is the first location after the end of the uninitialized data segment). Increasing the program break has the effect of allocating memory to the process; decreasing the break deallocates memory.
brk() sets the end of the data segment to the value specified by addr, when that value is reasonable, the system has enough memory, and the process does not exceed its maximum data size (see setrlimit(2)).
sbrk() increments the program’s data space by increment bytes. Calling sbrk() with an increment of 0 can be used to find the current location of the program break.
RETURN VALUE
On success, brk() returns zero. On error, -1 is returned, and errno is set to ENOMEM.
On success, sbrk() returns the previous program break. (If the break was increased, then this value is a pointer to the start of the newly allocated memory). On error, (void *) -1 is returned, and errno is set to ENOMEM.
STANDARDS
None.
HISTORY
4.3BSD; SUSv1, marked LEGACY in SUSv2, removed in POSIX.1-2001.
NOTES
Avoid using brk() and sbrk(): the malloc(3) memory allocation package is the portable and comfortable way of allocating memory.
Various systems use various types for the argument of sbrk(). Common are int, ssize_t, ptrdiff_t, intptr_t.
C library/kernel differences
The return value described above for brk() is the behavior provided by the glibc wrapper function for the Linux brk() system call. (On most other implementations, the return value from brk() is the same; this return value was also specified in SUSv2.) However, the actual Linux system call returns the new program break on success. On failure, the system call returns the current break. The glibc wrapper function does some work (i.e., checks whether the new break is less than addr) to provide the 0 and -1 return values described above.
On Linux, sbrk() is implemented as a library function that uses the brk() system call, and does some internal bookkeeping so that it can return the old break value.
SEE ALSO
execve(2), getrlimit(2), end(3), malloc(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
130 - Linux cli command mq_timedreceive
NAME π₯οΈ mq_timedreceive π₯οΈ
receive a message from a message queue
LIBRARY
Real-time library (librt, -lrt)
SYNOPSIS
#include <mqueue.h>
ssize_t mq_receive(mqd_t mqdes, char msg_ptr[.msg_len],
size_t msg_len, unsigned int *msg_prio);
#include <time.h>
#include <mqueue.h>
ssize_t mq_timedreceive(mqd_t mqdes, char *restrict msg_ptr[.msg_len],
size_t msg_len, unsigned int *restrict msg_prio,
const struct timespec *restrict abs_timeout);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
mq_timedreceive():
_POSIX_C_SOURCE >= 200112L
DESCRIPTION
mq_receive() removes the oldest message with the highest priority from the message queue referred to by the message queue descriptor mqdes, and places it in the buffer pointed to by msg_ptr. The msg_len argument specifies the size of the buffer pointed to by msg_ptr; this must be greater than or equal to the mq_msgsize attribute of the queue (see mq_getattr(3)). If msg_prio is not NULL, then the buffer to which it points is used to return the priority associated with the received message.
If the queue is empty, then, by default, mq_receive() blocks until a message becomes available, or the call is interrupted by a signal handler. If the O_NONBLOCK flag is enabled for the message queue description, then the call instead fails immediately with the error EAGAIN.
mq_timedreceive() behaves just like mq_receive(), except that if the queue is empty and the O_NONBLOCK flag is not enabled for the message queue description, then abs_timeout points to a structure which specifies how long the call will block. This value is an absolute timeout in seconds and nanoseconds since the Epoch, 1970-01-01 00:00:00 +0000 (UTC), specified in a timespec(3) structure.
If no message is available, and the timeout has already expired by the time of the call, mq_timedreceive() returns immediately.
RETURN VALUE
On success, mq_receive() and mq_timedreceive() return the number of bytes in the received message; on error, -1 is returned, with errno set to indicate the error.
ERRORS
EAGAIN
The queue was empty, and the O_NONBLOCK flag was set for the message queue description referred to by mqdes.
EBADF
The descriptor specified in mqdes was invalid or not opened for reading.
EINTR
The call was interrupted by a signal handler; see signal(7).
EINVAL
The call would have blocked, and abs_timeout was invalid, either because tv_sec was less than zero, or because tv_nsec was less than zero or greater than 1000 million.
EMSGSIZE
msg_len was less than the mq_msgsize attribute of the message queue.
ETIMEDOUT
The call timed out before a message could be transferred.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
mq_receive(), mq_timedreceive() | Thread safety | MT-Safe |
VERSIONS
On Linux, mq_timedreceive() is a system call, and mq_receive() is a library function layered on top of that system call.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
SEE ALSO
mq_close(3), mq_getattr(3), mq_notify(3), mq_open(3), mq_send(3), mq_unlink(3), timespec(3), mq_overview(7), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
131 - Linux cli command sendfile64
NAME π₯οΈ sendfile64 π₯οΈ
transfer data between file descriptors
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/sendfile.h>
ssize_t sendfile(int out_fd, int in_fd, off_t *_Nullable offset,
size_t count);
DESCRIPTION
sendfile() copies data between one file descriptor and another. Because this copying is done within the kernel, sendfile() is more efficient than the combination of read(2) and write(2), which would require transferring data to and from user space.
in_fd should be a file descriptor opened for reading and out_fd should be a descriptor opened for writing.
If offset is not NULL, then it points to a variable holding the file offset from which sendfile() will start reading data from in_fd. When sendfile() returns, this variable will be set to the offset of the byte following the last byte that was read. If offset is not NULL, then sendfile() does not modify the file offset of in_fd; otherwise the file offset is adjusted to reflect the number of bytes read from in_fd.
If offset is NULL, then data will be read from in_fd starting at the file offset, and the file offset will be updated by the call.
count is the number of bytes to copy between the file descriptors.
The in_fd argument must correspond to a file which supports mmap(2)-like operations (i.e., it cannot be a socket). Except since Linux 5.12 and if out_fd is a pipe, in which case sendfile() desugars to a splice(2) and its restrictions apply.
Before Linux 2.6.33, out_fd must refer to a socket. Since Linux 2.6.33 it can be any file. If it’s seekable, then sendfile() changes the file offset appropriately.
RETURN VALUE
If the transfer was successful, the number of bytes written to out_fd is returned. Note that a successful call to sendfile() may write fewer bytes than requested; the caller should be prepared to retry the call if there were unsent bytes. See also NOTES.
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EAGAIN
Nonblocking I/O has been selected using O_NONBLOCK and the write would block.
EBADF
The input file was not opened for reading or the output file was not opened for writing.
EFAULT
Bad address.
EINVAL
Descriptor is not valid or locked, or an mmap(2)-like operation is not available for in_fd, or count is negative.
EINVAL
out_fd has the O_APPEND flag set. This is not currently supported by sendfile().
EIO
Unspecified error while reading from in_fd.
ENOMEM
Insufficient memory to read from in_fd.
EOVERFLOW
count is too large, the operation would result in exceeding the maximum size of either the input file or the output file.
ESPIPE
offset is not NULL but the input file is not seekable.
VERSIONS
Other UNIX systems implement sendfile() with different semantics and prototypes. It should not be used in portable programs.
STANDARDS
None.
HISTORY
Linux 2.2, glibc 2.1.
In Linux 2.4 and earlier, out_fd could also refer to a regular file; this possibility went away in the Linux 2.6.x kernel series, but was restored in Linux 2.6.33.
The original Linux sendfile() system call was not designed to handle large file offsets. Consequently, Linux 2.4 added sendfile64(), with a wider type for the offset argument. The glibc sendfile() wrapper function transparently deals with the kernel differences.
NOTES
sendfile() will transfer at most 0x7ffff000 (2,147,479,552) bytes, returning the number of bytes actually transferred. (This is true on both 32-bit and 64-bit systems.)
If you plan to use sendfile() for sending files to a TCP socket, but need to send some header data in front of the file contents, you will find it useful to employ the TCP_CORK option, described in tcp(7), to minimize the number of packets and to tune performance.
Applications may wish to fall back to read(2) and write(2) in the case where sendfile() fails with EINVAL or ENOSYS.
If out_fd refers to a socket or pipe with zero-copy support, callers must ensure the transferred portions of the file referred to by in_fd remain unmodified until the reader on the other end of out_fd has consumed the transferred data.
The Linux-specific splice(2) call supports transferring data between arbitrary file descriptors provided one (or both) of them is a pipe.
SEE ALSO
copy_file_range(2), mmap(2), open(2), socket(2), splice(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
132 - Linux cli command timer_create
NAME π₯οΈ timer_create π₯οΈ
create a POSIX per-process timer
LIBRARY
Real-time library (librt, -lrt)
SYNOPSIS
#include <signal.h> /* Definition of SIGEV_* constants */
#include <time.h>
int timer_create(clockid_t clockid,
struct sigevent *_Nullable restrict sevp,
timer_t *restrict timerid);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
timer_create():
_POSIX_C_SOURCE >= 199309L
DESCRIPTION
timer_create() creates a new per-process interval timer. The ID of the new timer is returned in the buffer pointed to by timerid, which must be a non-null pointer. This ID is unique within the process, until the timer is deleted. The new timer is initially disarmed.
The clockid argument specifies the clock that the new timer uses to measure time. It can be specified as one of the following values:
CLOCK_REALTIME
A settable system-wide real-time clock.
CLOCK_MONOTONIC
A nonsettable monotonically increasing clock that measures time from some unspecified point in the past that does not change after system startup.
CLOCK_PROCESS_CPUTIME_ID (since Linux 2.6.12)
A clock that measures (user and system) CPU time consumed by (all of the threads in) the calling process.
CLOCK_THREAD_CPUTIME_ID (since Linux 2.6.12)
A clock that measures (user and system) CPU time consumed by the calling thread.
CLOCK_BOOTTIME (Since Linux 2.6.39)
Like CLOCK_MONOTONIC, this is a monotonically increasing clock. However, whereas the CLOCK_MONOTONIC clock does not measure the time while a system is suspended, the CLOCK_BOOTTIME clock does include the time during which the system is suspended. This is useful for applications that need to be suspend-aware. CLOCK_REALTIME is not suitable for such applications, since that clock is affected by discontinuous changes to the system clock.
CLOCK_REALTIME_ALARM (since Linux 3.0)
This clock is like CLOCK_REALTIME, but will wake the system if it is suspended. The caller must have the CAP_WAKE_ALARM capability in order to set a timer against this clock.
CLOCK_BOOTTIME_ALARM (since Linux 3.0)
This clock is like CLOCK_BOOTTIME, but will wake the system if it is suspended. The caller must have the CAP_WAKE_ALARM capability in order to set a timer against this clock.
CLOCK_TAI (since Linux 3.10)
A system-wide clock derived from wall-clock time but counting leap seconds.
See clock_getres(2) for some further details on the above clocks.
As well as the above values, clockid can be specified as the clockid returned by a call to clock_getcpuclockid(3) or pthread_getcpuclockid(3).
The sevp argument points to a sigevent structure that specifies how the caller should be notified when the timer expires. For the definition and general details of this structure, see sigevent(3type).
The sevp.sigev_notify field can have the following values:
SIGEV_NONE
Don’t asynchronously notify when the timer expires. Progress of the timer can be monitored using timer_gettime(2).
SIGEV_SIGNAL
Upon timer expiration, generate the signal sigev_signo for the process. See sigevent(3type) for general details. The si_code field of the siginfo_t structure will be set to SI_TIMER. At any point in time, at most one signal is queued to the process for a given timer; see timer_getoverrun(2) for more details.
SIGEV_THREAD
Upon timer expiration, invoke sigev_notify_function as if it were the start function of a new thread. See sigevent(3type) for details.
SIGEV_THREAD_ID (Linux-specific)
As for SIGEV_SIGNAL, but the signal is targeted at the thread whose ID is given in sigev_notify_thread_id, which must be a thread in the same process as the caller. The sigev_notify_thread_id field specifies a kernel thread ID, that is, the value returned by clone(2) or gettid(2). This flag is intended only for use by threading libraries.
Specifying sevp as NULL is equivalent to specifying a pointer to a sigevent structure in which sigev_notify is SIGEV_SIGNAL, sigev_signo is SIGALRM, and sigev_value.sival_int is the timer ID.
RETURN VALUE
On success, timer_create() returns 0, and the ID of the new timer is placed in *timerid. On failure, -1 is returned, and errno is set to indicate the error.
ERRORS
EAGAIN
Temporary error during kernel allocation of timer structures.
EINVAL
Clock ID, sigev_notify, sigev_signo, or sigev_notify_thread_id is invalid.
ENOMEM
Could not allocate memory.
ENOTSUP
The kernel does not support creating a timer against this clockid.
EPERM
clockid was CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM but the caller did not have the CAP_WAKE_ALARM capability.
VERSIONS
C library/kernel differences
Part of the implementation of the POSIX timers API is provided by glibc. In particular:
Much of the functionality for SIGEV_THREAD is implemented within glibc, rather than the kernel. (This is necessarily so, since the thread involved in handling the notification is one that must be managed by the C library POSIX threads implementation.) Although the notification delivered to the process is via a thread, internally the NPTL implementation uses a sigev_notify value of SIGEV_THREAD_ID along with a real-time signal that is reserved by the implementation (see nptl(7)).
The implementation of the default case where evp is NULL is handled inside glibc, which invokes the underlying system call with a suitably populated sigevent structure.
The timer IDs presented at user level are maintained by glibc, which maps these IDs to the timer IDs employed by the kernel.
STANDARDS
POSIX.1-2008.
HISTORY
Linux 2.6. POSIX.1-2001.
Prior to Linux 2.6, glibc provided an incomplete user-space implementation (CLOCK_REALTIME timers only) using POSIX threads, and before glibc 2.17, the implementation falls back to this technique on systems running kernels older than Linux 2.6.
NOTES
A program may create multiple interval timers using timer_create().
Timers are not inherited by the child of a fork(2), and are disarmed and deleted during an execve(2).
The kernel preallocates a “queued real-time signal” for each timer created using timer_create(). Consequently, the number of timers is limited by the RLIMIT_SIGPENDING resource limit (see setrlimit(2)).
The timers created by timer_create() are commonly known as “POSIX (interval) timers”. The POSIX timers API consists of the following interfaces:
timer_create()
Create a timer.
timer_settime(2)
Arm (start) or disarm (stop) a timer.
timer_gettime(2)
Fetch the time remaining until the next expiration of a timer, along with the interval setting of the timer.
timer_getoverrun(2)
Return the overrun count for the last timer expiration.
timer_delete(2)
Disarm and delete a timer.
Since Linux 3.10, the /proc/pid/timers file can be used to list the POSIX timers for the process with PID pid. See proc(5) for further information.
Since Linux 4.10, support for POSIX timers is a configurable option that is enabled by default. Kernel support can be disabled via the CONFIG_POSIX_TIMERS option.
EXAMPLES
The program below takes two arguments: a sleep period in seconds, and a timer frequency in nanoseconds. The program establishes a handler for the signal it uses for the timer, blocks that signal, creates and arms a timer that expires with the given frequency, sleeps for the specified number of seconds, and then unblocks the timer signal. Assuming that the timer expired at least once while the program slept, the signal handler will be invoked, and the handler displays some information about the timer notification. The program terminates after one invocation of the signal handler.
In the following example run, the program sleeps for 1 second, after creating a timer that has a frequency of 100 nanoseconds. By the time the signal is unblocked and delivered, there have been around ten million overruns.
$ ./a.out 1 100
Establishing handler for signal 34
Blocking signal 34
timer ID is 0x804c008
Sleeping for 1 seconds
Unblocking signal 34
Caught signal 34
sival_ptr = 0xbfb174f4; *sival_ptr = 0x804c008
overrun count = 10004886
Program source
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#define CLOCKID CLOCK_REALTIME
#define SIG SIGRTMIN
#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)
static void
print_siginfo(siginfo_t *si)
{
int or;
timer_t *tidp;
tidp = si->si_value.sival_ptr;
printf(" sival_ptr = %p; ", si->si_value.sival_ptr);
printf(" *sival_ptr = %#jx
“, (uintmax_t) *tidp);
or = timer_getoverrun(*tidp);
if (or == -1)
errExit(“timer_getoverrun”);
else
printf(” overrun count = %d
“, or);
}
static void
handler(int sig, siginfo_t *si, void uc)
{
/ Note: calling printf() from a signal handler is not safe
(and should not be done in production programs), since
printf() is not async-signal-safe; see signal-safety(7).
Nevertheless, we use printf() here as a simple way of
showing that the handler was called. */
printf(“Caught signal %d
“, sig);
print_siginfo(si);
signal(sig, SIG_IGN);
}
int
main(int argc, char argv[])
{
timer_t timerid;
sigset_t mask;
long long freq_nanosecs;
struct sigevent sev;
struct sigaction sa;
struct itimerspec its;
if (argc != 3) {
fprintf(stderr, “Usage: %s
SEE ALSO
clock_gettime(2), setitimer(2), timer_delete(2), timer_getoverrun(2), timer_settime(2), timerfd_create(2), clock_getcpuclockid(3), pthread_getcpuclockid(3), pthreads(7), sigevent(3type), signal(7), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
133 - Linux cli command getresuid32
NAME π₯οΈ getresuid32 π₯οΈ
get real, effective, and saved user/group IDs
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <unistd.h>
int getresuid(uid_t *ruid, uid_t *euid, uid_t *suid);
int getresgid(gid_t *rgid, gid_t *egid, gid_t *sgid);
DESCRIPTION
getresuid() returns the real UID, the effective UID, and the saved set-user-ID of the calling process, in the arguments ruid, euid, and suid, respectively. getresgid() performs the analogous task for the process’s group IDs.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
One of the arguments specified an address outside the calling program’s address space.
STANDARDS
None. These calls also appear on HP-UX and some of the BSDs.
HISTORY
Linux 2.1.44, glibc 2.3.2.
The original Linux getresuid() and getresgid() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added getresuid32() and getresgid32(), supporting 32-bit IDs. The glibc getresuid() and getresgid() wrapper functions transparently deal with the variations across kernel versions.
SEE ALSO
getuid(2), setresuid(2), setreuid(2), setuid(2), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
134 - Linux cli command truncate64
NAME π₯οΈ truncate64 π₯οΈ
truncate a file to a specified length
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int truncate(const char *path, off_t length);
int ftruncate(int fd, off_t length);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
truncate():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
|| /* glibc <= 2.19: */ _BSD_SOURCE
ftruncate():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.3.5: */ _POSIX_C_SOURCE >= 200112L
|| /* glibc <= 2.19: */ _BSD_SOURCE
DESCRIPTION
The truncate() and ftruncate() functions cause the regular file named by path or referenced by fd to be truncated to a size of precisely length bytes.
If the file previously was larger than this size, the extra data is lost. If the file previously was shorter, it is extended, and the extended part reads as null bytes (‘οΏ½’).
The file offset is not changed.
If the size changed, then the st_ctime and st_mtime fields (respectively, time of last status change and time of last modification; see inode(7)) for the file are updated, and the set-user-ID and set-group-ID mode bits may be cleared.
With ftruncate(), the file must be open for writing; with truncate(), the file must be writable.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
For truncate():
EACCES
Search permission is denied for a component of the path prefix, or the named file is not writable by the user. (See also path_resolution(7).)
EFAULT
The argument path points outside the process’s allocated address space.
EFBIG
The argument length is larger than the maximum file size. (XSI)
EINTR
While blocked waiting to complete, the call was interrupted by a signal handler; see fcntl(2) and signal(7).
EINVAL
The argument length is negative or larger than the maximum file size.
EIO
An I/O error occurred updating the inode.
EISDIR
The named file is a directory.
ELOOP
Too many symbolic links were encountered in translating the pathname.
ENAMETOOLONG
A component of a pathname exceeded 255 characters, or an entire pathname exceeded 1023 characters.
ENOENT
The named file does not exist.
ENOTDIR
A component of the path prefix is not a directory.
EPERM
The underlying filesystem does not support extending a file beyond its current size.
EPERM
The operation was prevented by a file seal; see fcntl(2).
EROFS
The named file resides on a read-only filesystem.
ETXTBSY
The file is an executable file that is being executed.
For ftruncate() the same errors apply, but instead of things that can be wrong with path, we now have things that can be wrong with the file descriptor, fd:
EBADF
fd is not a valid file descriptor.
EBADF or EINVAL
fd is not open for writing.
EINVAL
fd does not reference a regular file or a POSIX shared memory object.
EINVAL or EBADF
The file descriptor fd is not open for writing. POSIX permits, and portable applications should handle, either error for this case. (Linux produces EINVAL.)
VERSIONS
The details in DESCRIPTION are for XSI-compliant systems. For non-XSI-compliant systems, the POSIX standard allows two behaviors for ftruncate() when length exceeds the file length (note that truncate() is not specified at all in such an environment): either returning an error, or extending the file. Like most UNIX implementations, Linux follows the XSI requirement when dealing with native filesystems. However, some nonnative filesystems do not permit truncate() and ftruncate() to be used to extend a file beyond its current length: a notable example on Linux is VFAT.
On some 32-bit architectures, the calling signature for these system calls differ, for the reasons described in syscall(2).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.4BSD, SVr4 (first appeared in 4.2BSD).
The original Linux truncate() and ftruncate() system calls were not designed to handle large file offsets. Consequently, Linux 2.4 added truncate64() and ftruncate64() system calls that handle large files. However, these details can be ignored by applications using glibc, whose wrapper functions transparently employ the more recent system calls where they are available.
NOTES
ftruncate() can also be used to set the size of a POSIX shared memory object; see shm_open(3).
BUGS
A header file bug in glibc 2.12 meant that the minimum value of _POSIX_C_SOURCE required to expose the declaration of ftruncate() was 200809L instead of 200112L. This has been fixed in later glibc versions.
SEE ALSO
truncate(1), open(2), stat(2), path_resolution(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
135 - Linux cli command getxattr
NAME π₯οΈ getxattr π₯οΈ
retrieve an extended attribute value
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/xattr.h>
ssize_t getxattr(const char *path, const char *name,
void value[.size], size_t size);
ssize_t lgetxattr(const char *path, const char *name,
void value[.size], size_t size);
ssize_t fgetxattr(int fd, const char *name,
void value[.size], size_t size);
DESCRIPTION
Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7).
getxattr() retrieves the value of the extended attribute identified by name and associated with the given path in the filesystem. The attribute value is placed in the buffer pointed to by value; size specifies the size of that buffer. The return value of the call is the number of bytes placed in value.
lgetxattr() is identical to getxattr(), except in the case of a symbolic link, where the link itself is interrogated, not the file that it refers to.
fgetxattr() is identical to getxattr(), only the open file referred to by fd (as returned by open(2)) is interrogated in place of path.
An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. The value of an extended attribute is a chunk of arbitrary textual or binary data that was assigned using setxattr(2).
If size is specified as zero, these calls return the current size of the named extended attribute (and leave value unchanged). This can be used to determine the size of the buffer that should be supplied in a subsequent call. (But, bear in mind that there is a possibility that the attribute value may change between the two calls, so that it is still necessary to check the return status from the second call.)
RETURN VALUE
On success, these calls return a nonnegative value which is the size (in bytes) of the extended attribute value. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
E2BIG
The size of the attribute value is larger than the maximum size allowed; the attribute cannot be retrieved. This can happen on filesystems that support very large attribute values such as NFSv4, for example.
ENODATA
The named attribute does not exist, or the process has no access to this attribute.
ENOTSUP
Extended attributes are not supported by the filesystem, or are disabled.
ERANGE
The size of the value buffer is too small to hold the result.
In addition, the errors documented in stat(2) can also occur.
STANDARDS
Linux.
HISTORY
Linux 2.4, glibc 2.3.
EXAMPLES
See listxattr(2).
SEE ALSO
getfattr(1), setfattr(1), listxattr(2), open(2), removexattr(2), setxattr(2), stat(2), symlink(7), xattr(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
136 - Linux cli command recv
NAME π₯οΈ recv π₯οΈ
receive a message from a socket
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
ssize_t recv(int sockfd, void buf[.len], size_t len,
int flags);
ssize_t recvfrom(int sockfd, void buf[restrict .len], size_t len,
int flags,
struct sockaddr *_Nullable restrict src_addr,
socklen_t *_Nullable restrict addrlen);
ssize_t recvmsg(int sockfd, struct msghdr *msg, int flags);
DESCRIPTION
The recv(), recvfrom(), and recvmsg() calls are used to receive messages from a socket. They may be used to receive data on both connectionless and connection-oriented sockets. This page first describes common features of all three system calls, and then describes the differences between the calls.
The only difference between recv() and read(2) is the presence of flags. With a zero flags argument, recv() is generally equivalent to read(2) (but see NOTES). Also, the following call
recv(sockfd, buf, len, flags);
is equivalent to
recvfrom(sockfd, buf, len, flags, NULL, NULL);
All three calls return the length of the message on successful completion. If a message is too long to fit in the supplied buffer, excess bytes may be discarded depending on the type of socket the message is received from.
If no messages are available at the socket, the receive calls wait for a message to arrive, unless the socket is nonblocking (see fcntl(2)), in which case the value -1 is returned and errno is set to EAGAIN or EWOULDBLOCK. The receive calls normally return any data available, up to the requested amount, rather than waiting for receipt of the full amount requested.
An application can use select(2), poll(2), or epoll(7) to determine when more data arrives on a socket.
The flags argument
The flags argument is formed by ORing one or more of the following values:
MSG_CMSG_CLOEXEC (recvmsg() only; since Linux 2.6.23)
Set the close-on-exec flag for the file descriptor received via a UNIX domain file descriptor using the SCM_RIGHTS operation (described in unix(7)). This flag is useful for the same reasons as the O_CLOEXEC flag of open(2).
MSG_DONTWAIT (since Linux 2.2)
Enables nonblocking operation; if the operation would block, the call fails with the error EAGAIN or EWOULDBLOCK. This provides similar behavior to setting the O_NONBLOCK flag (via the fcntl(2) F_SETFL operation), but differs in that MSG_DONTWAIT is a per-call option, whereas O_NONBLOCK is a setting on the open file description (see open(2)), which will affect all threads in the calling process as well as other processes that hold file descriptors referring to the same open file description.
MSG_ERRQUEUE (since Linux 2.2)
This flag specifies that queued errors should be received from the socket error queue. The error is passed in an ancillary message with a type dependent on the protocol (for IPv4 IP_RECVERR). The user should supply a buffer of sufficient size. See cmsg(3) and ip(7) for more information. The payload of the original packet that caused the error is passed as normal data via msg_iovec. The original destination address of the datagram that caused the error is supplied via msg_name.
The error is supplied in a sock_extended_err structure:
#define SO_EE_ORIGIN_NONE 0
#define SO_EE_ORIGIN_LOCAL 1
#define SO_EE_ORIGIN_ICMP 2
#define SO_EE_ORIGIN_ICMP6 3
struct sock_extended_err
{
uint32_t ee_errno; /* Error number */
uint8_t ee_origin; /* Where the error originated */
uint8_t ee_type; /* Type */
uint8_t ee_code; /* Code */
uint8_t ee_pad; /* Padding */
uint32_t ee_info; /* Additional information */
uint32_t ee_data; /* Other data */
/* More data may follow */
};
struct sockaddr *SO_EE_OFFENDER(struct sock_extended_err *);
ee_errno contains the errno number of the queued error. ee_origin is the origin code of where the error originated. The other fields are protocol-specific. The macro SO_EE_OFFENDER returns a pointer to the address of the network object where the error originated from given a pointer to the ancillary message. If this address is not known, the sa_family member of the sockaddr contains AF_UNSPEC and the other fields of the sockaddr are undefined. The payload of the packet that caused the error is passed as normal data.
For local errors, no address is passed (this can be checked with the cmsg_len member of the cmsghdr). For error receives, the MSG_ERRQUEUE flag is set in the msghdr. After an error has been passed, the pending socket error is regenerated based on the next queued error and will be passed on the next socket operation.
MSG_OOB
This flag requests receipt of out-of-band data that would not be received in the normal data stream. Some protocols place expedited data at the head of the normal data queue, and thus this flag cannot be used with such protocols.
MSG_PEEK
This flag causes the receive operation to return data from the beginning of the receive queue without removing that data from the queue. Thus, a subsequent receive call will return the same data.
MSG_TRUNC (since Linux 2.2)
For raw (AF_PACKET), Internet datagram (since Linux 2.4.27/2.6.8), netlink (since Linux 2.6.22), and UNIX datagram as well as sequenced-packet (since Linux 3.4) sockets: return the real length of the packet or datagram, even when it was longer than the passed buffer.
For use with Internet stream sockets, see tcp(7).
MSG_WAITALL (since Linux 2.2)
This flag requests that the operation block until the full request is satisfied. However, the call may still return less data than requested if a signal is caught, an error or disconnect occurs, or the next data to be received is of a different type than that returned. This flag has no effect for datagram sockets.
recvfrom()
recvfrom() places the received message into the buffer buf. The caller must specify the size of the buffer in len.
If src_addr is not NULL, and the underlying protocol provides the source address of the message, that source address is placed in the buffer pointed to by src_addr. In this case, addrlen is a value-result argument. Before the call, it should be initialized to the size of the buffer associated with src_addr. Upon return, addrlen is updated to contain the actual size of the source address. The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call.
If the caller is not interested in the source address, src_addr and addrlen should be specified as NULL.
recv()
The recv() call is normally used only on a connected socket (see connect(2)). It is equivalent to the call:
recvfrom(fd, buf, len, flags, NULL, 0);
recvmsg()
The recvmsg() call uses a msghdr structure to minimize the number of directly supplied arguments. This structure is defined as follows in <sys/socket.h>:
struct msghdr {
void *msg_name; /* Optional address */
socklen_t msg_namelen; /* Size of address */
struct iovec *msg_iov; /* Scatter/gather array */
size_t msg_iovlen; /* # elements in msg_iov */
void *msg_control; /* Ancillary data, see below */
size_t msg_controllen; /* Ancillary data buffer len */
int msg_flags; /* Flags on received message */
};
The msg_name field points to a caller-allocated buffer that is used to return the source address if the socket is unconnected. The caller should set msg_namelen to the size of this buffer before this call; upon return from a successful call, msg_namelen will contain the length of the returned address. If the application does not need to know the source address, msg_name can be specified as NULL.
The fields msg_iov and msg_iovlen describe scatter-gather locations, as discussed in readv(2).
The field msg_control, which has length msg_controllen, points to a buffer for other protocol control-related messages or miscellaneous ancillary data. When recvmsg() is called, msg_controllen should contain the length of the available buffer in msg_control; upon return from a successful call it will contain the length of the control message sequence.
The messages are of the form:
struct cmsghdr {
size_t cmsg_len; /* Data byte count, including header
(type is socklen_t in POSIX) */
int cmsg_level; /* Originating protocol */
int cmsg_type; /* Protocol-specific type */
/* followed by
unsigned char cmsg_data[]; */
};
Ancillary data should be accessed only by the macros defined in cmsg(3).
As an example, Linux uses this ancillary data mechanism to pass extended errors, IP options, or file descriptors over UNIX domain sockets. For further information on the use of ancillary data in various socket domains, see unix(7) and ip(7).
The msg_flags field in the msghdr is set on return of recvmsg(). It can contain several flags:
MSG_EOR
indicates end-of-record; the data returned completed a record (generally used with sockets of type SOCK_SEQPACKET).
MSG_TRUNC
indicates that the trailing portion of a datagram was discarded because the datagram was larger than the buffer supplied.
MSG_CTRUNC
indicates that some control data was discarded due to lack of space in the buffer for ancillary data.
MSG_OOB
is returned to indicate that expedited or out-of-band data was received.
MSG_ERRQUEUE
indicates that no data was received but an extended error from the socket error queue.
MSG_CMSG_CLOEXEC (since Linux 2.6.23)
indicates that MSG_CMSG_CLOEXEC was specified in the flags argument of recvmsg().
RETURN VALUE
These calls return the number of bytes received, or -1 if an error occurred. In the event of an error, errno is set to indicate the error.
When a stream socket peer has performed an orderly shutdown, the return value will be 0 (the traditional “end-of-file” return).
Datagram sockets in various domains (e.g., the UNIX and Internet domains) permit zero-length datagrams. When such a datagram is received, the return value is 0.
The value 0 may also be returned if the requested number of bytes to receive from a stream socket was 0.
ERRORS
These are some standard errors generated by the socket layer. Additional errors may be generated and returned from the underlying protocol modules; see their manual pages.
EAGAIN or EWOULDBLOCK
The socket is marked nonblocking and the receive operation would block, or a receive timeout had been set and the timeout expired before data was received. POSIX.1 allows either error to be returned for this case, and does not require these constants to have the same value, so a portable application should check for both possibilities.
EBADF
The argument sockfd is an invalid file descriptor.
ECONNREFUSED
A remote host refused to allow the network connection (typically because it is not running the requested service).
EFAULT
The receive buffer pointer(s) point outside the process’s address space.
EINTR
The receive was interrupted by delivery of a signal before any data was available; see signal(7).
EINVAL
Invalid argument passed.
ENOMEM
Could not allocate memory for recvmsg().
ENOTCONN
The socket is associated with a connection-oriented protocol and has not been connected (see connect(2) and accept(2)).
ENOTSOCK
The file descriptor sockfd does not refer to a socket.
VERSIONS
According to POSIX.1, the msg_controllen field of the msghdr structure should be typed as socklen_t, and the msg_iovlen field should be typed as int, but glibc currently types both as size_t.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.4BSD (first appeared in 4.2BSD).
POSIX.1 describes only the MSG_OOB, MSG_PEEK, and MSG_WAITALL flags.
NOTES
If a zero-length datagram is pending, read(2) and recv() with a flags argument of zero provide different behavior. In this circumstance, read(2) has no effect (the datagram remains pending), while recv() consumes the pending datagram.
See recvmmsg(2) for information about a Linux-specific system call that can be used to receive multiple datagrams in a single call.
EXAMPLES
An example of the use of recvfrom() is shown in getaddrinfo(3).
SEE ALSO
fcntl(2), getsockopt(2), read(2), recvmmsg(2), select(2), shutdown(2), socket(2), cmsg(3), sockatmark(3), ip(7), ipv6(7), socket(7), tcp(7), udp(7), unix(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
137 - Linux cli command faccessat2
NAME π₯οΈ faccessat2 π₯οΈ
check user’s permissions for a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int access(const char *pathname, int mode);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int faccessat(int dirfd, const char *pathname, int mode, int flags);
/* But see C library/kernel differences, below */
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_faccessat2,
int dirfd, const char *pathname, int mode",int"flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
faccessat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
access() checks whether the calling process can access the file pathname. If pathname is a symbolic link, it is dereferenced.
The mode specifies the accessibility check(s) to be performed, and is either the value F_OK, or a mask consisting of the bitwise OR of one or more of R_OK, W_OK, and X_OK. F_OK tests for the existence of the file. R_OK, W_OK, and X_OK test whether the file exists and grants read, write, and execute permissions, respectively.
The check is done using the calling process’s real UID and GID, rather than the effective IDs as is done when actually attempting an operation (e.g., open(2)) on the file. Similarly, for the root user, the check uses the set of permitted capabilities rather than the set of effective capabilities; and for non-root users, the check uses an empty set of capabilities.
This allows set-user-ID programs and capability-endowed programs to easily determine the invoking user’s authority. In other words, access() does not answer the “can I read/write/execute this file?” question. It answers a slightly different question: “(assuming I’m a setuid binary) can the user who invoked me read/write/execute this file?”, which gives set-user-ID programs the possibility to prevent malicious users from causing them to read files which users shouldn’t be able to read.
If the calling process is privileged (i.e., its real UID is zero), then an X_OK check is successful for a regular file if execute permission is enabled for any of the file owner, group, or other.
faccessat()
faccessat() operates in exactly the same way as access(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by access() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like access()).
If pathname is absolute, then dirfd is ignored.
flags is constructed by ORing together zero or more of the following values:
AT_EACCESS
Perform access checks using the effective user and group IDs. By default, faccessat() uses the real IDs (like access()).
AT_EMPTY_PATH (since Linux 5.8)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory. If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead return information about the link itself.
See openat(2) for an explanation of the need for faccessat().
faccessat2()
The description of faccessat() given above corresponds to POSIX.1 and to the implementation provided by glibc. However, the glibc implementation was an imperfect emulation (see BUGS) that papered over the fact that the raw Linux faccessat() system call does not have a flags argument. To allow for a proper implementation, Linux 5.8 added the faccessat2() system call, which supports the flags argument and allows a correct implementation of the faccessat() wrapper function.
RETURN VALUE
On success (all requested permissions granted, or mode is F_OK and the file exists), zero is returned. On error (at least one bit in mode asked for a permission that is denied, or mode is F_OK and the file does not exist, or some other error occurred), -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
The requested access would be denied to the file, or search permission is denied for one of the directories in the path prefix of pathname. (See also path_resolution(7).)
EBADF
(faccessat()) pathname is relative but dirfd is neither AT_FDCWD (faccessat()) nor a valid file descriptor.
EFAULT
pathname points outside your accessible address space.
EINVAL
mode was incorrectly specified.
EINVAL
(faccessat()) Invalid flag specified in flags.
EIO
An I/O error occurred.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ENAMETOOLONG
pathname is too long.
ENOENT
A component of pathname does not exist or is a dangling symbolic link.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component used as a directory in pathname is not, in fact, a directory.
ENOTDIR
(faccessat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EPERM
Write permission was requested to a file that has the immutable flag set. See also ioctl_iflags(2).
EROFS
Write permission was requested for a file on a read-only filesystem.
ETXTBSY
Write access was requested to an executable which is being executed.
VERSIONS
If the calling process has appropriate privileges (i.e., is superuser), POSIX.1-2001 permits an implementation to indicate success for an X_OK check even if none of the execute file permission bits are set. Linux does not do this.
C library/kernel differences
The raw faccessat() system call takes only the first three arguments. The AT_EACCESS and AT_SYMLINK_NOFOLLOW flags are actually implemented within the glibc wrapper function for faccessat(). If either of these flags is specified, then the wrapper function employs fstatat(2) to determine access permissions, but see BUGS.
glibc notes
On older kernels where faccessat() is unavailable (and when the AT_EACCESS and AT_SYMLINK_NOFOLLOW flags are not specified), the glibc wrapper function falls back to the use of access(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
STANDARDS
access()
faccessat()
POSIX.1-2008.
faccessat2()
Linux.
HISTORY
access()
SVr4, 4.3BSD, POSIX.1-2001.
faccessat()
Linux 2.6.16, glibc 2.4.
faccessat2()
Linux 5.8.
NOTES
Warning: Using these calls to check if a user is authorized to, for example, open a file before actually doing so using open(2) creates a security hole, because the user might exploit the short time interval between checking and opening the file to manipulate it. For this reason, the use of this system call should be avoided. (In the example just described, a safer alternative would be to temporarily switch the process’s effective user ID to the real ID and then call open(2).)
access() always dereferences symbolic links. If you need to check the permissions on a symbolic link, use faccessat() with the flag AT_SYMLINK_NOFOLLOW.
These calls return an error if any of the access types in mode is denied, even if some of the other access types in mode are permitted.
A file is accessible only if the permissions on each of the directories in the path prefix of pathname grant search (i.e., execute) access. If any directory is inaccessible, then the access() call fails, regardless of the permissions on the file itself.
Only access bits are checked, not the file type or contents. Therefore, if a directory is found to be writable, it probably means that files can be created in the directory, and not that the directory can be written as a file. Similarly, a DOS file may be reported as executable, but the execve(2) call will still fail.
These calls may not work correctly on NFSv2 filesystems with UID mapping enabled, because UID mapping is done on the server and hidden from the client, which checks permissions. (NFS versions 3 and higher perform the check on the server.) Similar problems can occur to FUSE mounts.
BUGS
Because the Linux kernel’s faccessat() system call does not support a flags argument, the glibc faccessat() wrapper function provided in glibc 2.32 and earlier emulates the required functionality using a combination of the faccessat() system call and fstatat(2). However, this emulation does not take ACLs into account. Starting with glibc 2.33, the wrapper function avoids this bug by making use of the faccessat2() system call where it is provided by the underlying kernel.
In Linux 2.4 (and earlier) there is some strangeness in the handling of X_OK tests for superuser. If all categories of execute permission are disabled for a nondirectory file, then the only access() test that returns -1 is when mode is specified as just X_OK; if R_OK or W_OK is also specified in mode, then access() returns 0 for such files. Early Linux 2.6 (up to and including Linux 2.6.3) also behaved in the same way as Linux 2.4.
Before Linux 2.6.20, these calls ignored the effect of the MS_NOEXEC flag if it was used to mount(2) the underlying filesystem. Since Linux 2.6.20, the MS_NOEXEC flag is honored.
SEE ALSO
chmod(2), chown(2), open(2), setgid(2), setuid(2), stat(2), euidaccess(3), credentials(7), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
138 - Linux cli command fchownat
NAME π₯οΈ fchownat π₯οΈ
change ownership of a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int chown(const char *pathname, uid_t owner, gid_t group);
int fchown(int fd, uid_t owner, gid_t group);
int lchown(const char *pathname, uid_t owner, gid_t group);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int fchownat(int dirfd, const char *pathname,
uid_t owner, gid_t group, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
fchown(), lchown():
/* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
|| _XOPEN_SOURCE >= 500
|| /* glibc <= 2.19: */ _BSD_SOURCE
fchownat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These system calls change the owner and group of a file. The chown(), fchown(), and lchown() system calls differ only in how the file is specified:
chown() changes the ownership of the file specified by pathname, which is dereferenced if it is a symbolic link.
fchown() changes the ownership of the file referred to by the open file descriptor fd.
lchown() is like chown(), but does not dereference symbolic links.
Only a privileged process (Linux: one with the CAP_CHOWN capability) may change the owner of a file. The owner of a file may change the group of the file to any group of which that owner is a member. A privileged process (Linux: with CAP_CHOWN) may change the group arbitrarily.
If the owner or group is specified as -1, then that ID is not changed.
When the owner or group of an executable file is changed by an unprivileged user, the S_ISUID and S_ISGID mode bits are cleared. POSIX does not specify whether this also should happen when root does the chown(); the Linux behavior depends on the kernel version, and since Linux 2.2.13, root is treated like other users. In case of a non-group-executable file (i.e., one for which the S_IXGRP bit is not set) the S_ISGID bit indicates mandatory locking, and is not cleared by a chown().
When the owner or group of an executable file is changed (by any user), all capability sets for the file are cleared.
fchownat()
The fchownat() system call operates in exactly the same way as chown(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chown() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chown()).
If pathname is absolute, then dirfd is ignored.
The flags argument is a bit mask created by ORing together 0 or more of the following values;
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory. If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead operate on the link itself, like lchown(). (By default, fchownat() dereferences symbolic links, like chown().)
See openat(2) for an explanation of the need for fchownat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Depending on the filesystem, errors other than those listed below can be returned.
The more general errors for chown() are listed below.
EACCES
Search permission is denied on a component of the path prefix. (See also path_resolution(7).)
EBADF
(fchown()) fd is not a valid open file descriptor.
EBADF
(fchownat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
pathname points outside your accessible address space.
EINVAL
(fchownat()) Invalid flag specified in flags.
EIO
(fchown()) A low-level I/O error occurred while modifying the inode.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ENAMETOOLONG
pathname is too long.
ENOENT
The file does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component of the path prefix is not a directory.
ENOTDIR
(fchownat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EPERM
The calling process did not have the required permissions (see above) to change owner and/or group.
EPERM
The file is marked immutable or append-only. (See ioctl_iflags(2).)
EROFS
The named file resides on a read-only filesystem.
VERSIONS
The 4.4BSD version can be used only by the superuser (that is, ordinary users cannot give away files).
STANDARDS
POSIX.1-2008.
HISTORY
chown()
fchown()
lchown()
4.4BSD, SVr4, POSIX.1-2001.
fchownat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
NOTES
Ownership of new files
When a new file is created (by, for example, open(2) or mkdir(2)), its owner is made the same as the filesystem user ID of the creating process. The group of the file depends on a range of factors, including the type of filesystem, the options used to mount the filesystem, and whether or not the set-group-ID mode bit is enabled on the parent directory. If the filesystem supports the -o grpid (or, synonymously -o bsdgroups) and -o nogrpid (or, synonymously -o sysvgroups) mount(8) options, then the rules are as follows:
If the filesystem is mounted with -o grpid, then the group of a new file is made the same as that of the parent directory.
If the filesystem is mounted with -o nogrpid and the set-group-ID bit is disabled on the parent directory, then the group of a new file is made the same as the process’s filesystem GID.
If the filesystem is mounted with -o nogrpid and the set-group-ID bit is enabled on the parent directory, then the group of a new file is made the same as that of the parent directory.
As at Linux 4.12, the -o grpid and -o nogrpid mount options are supported by ext2, ext3, ext4, and XFS. Filesystems that don’t support these mount options follow the -o nogrpid rules.
glibc notes
On older kernels where fchownat() is unavailable, the glibc wrapper function falls back to the use of chown() and lchown(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
NFS
The chown() semantics are deliberately violated on NFS filesystems which have UID mapping enabled. Additionally, the semantics of all system calls which access the file contents are violated, because chown() may cause immediate access revocation on already open files. Client side caching may lead to a delay between the time where ownership have been changed to allow access for a user and the time where the file can actually be accessed by the user on other clients.
Historical details
The original Linux chown(), fchown(), and lchown() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added chown32(), fchown32(), and lchown32(), supporting 32-bit IDs. The glibc chown(), fchown(), and lchown() wrapper functions transparently deal with the variations across kernel versions.
Before Linux 2.1.81 (except 2.1.46), chown() did not follow symbolic links. Since Linux 2.1.81, chown() does follow symbolic links, and there is a new system call lchown() that does not follow symbolic links. Since Linux 2.1.86, this new call (that has the same semantics as the old chown()) has got the same syscall number, and chown() got the newly introduced number.
EXAMPLES
The following program changes the ownership of the file named in its second command-line argument to the value specified in its first command-line argument. The new owner can be specified either as a numeric user ID, or as a username (which is converted to a user ID by using getpwnam(3) to perform a lookup in the system password file).
Program source
#include <pwd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
char *endptr;
uid_t uid;
struct passwd *pwd;
if (argc != 3 || argv[1][0] == ' ') {
fprintf(stderr, "%s <owner> <file>
“, argv[0]); exit(EXIT_FAILURE); } uid = strtol(argv[1], &endptr, 10); /* Allow a numeric string */ if (endptr != ‘οΏ½’) { / Was not pure numeric string / pwd = getpwnam(argv[1]); / Try getting UID for username */ if (pwd == NULL) { perror(“getpwnam”); exit(EXIT_FAILURE); } uid = pwd->pw_uid; } if (chown(argv[2], uid, -1) == -1) { perror(“chown”); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); }
SEE ALSO
chgrp(1), chown(1), chmod(2), flock(2), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
139 - Linux cli command get_kernel_syms
NAME π₯οΈ get_kernel_syms π₯οΈ
retrieve exported kernel and module symbols
SYNOPSIS
#include <linux/module.h>
[[deprecated]] int get_kernel_syms(struct kernel_sym *table);
DESCRIPTION
Note: This system call is present only before Linux 2.6.
If table is NULL, get_kernel_syms() returns the number of symbols available for query. Otherwise, it fills in a table of structures:
struct kernel_sym {
unsigned long value;
char name[60];
};
The symbols are interspersed with magic symbols of the form **#**module-name with the kernel having an empty name. The value associated with a symbol of this form is the address at which the module is loaded.
The symbols exported from each module follow their magic module tag and the modules are returned in the reverse of the order in which they were loaded.
RETURN VALUE
On success, returns the number of symbols copied to table. On error, -1 is returned and errno is set to indicate the error.
ERRORS
There is only one possible error return:
ENOSYS
get_kernel_syms() is not supported in this version of the kernel.
STANDARDS
Linux.
HISTORY
Removed in Linux 2.6.
This obsolete system call is not supported by glibc. No declaration is provided in glibc headers, but, through a quirk of history, glibc versions before glibc 2.23 did export an ABI for this system call. Therefore, in order to employ this system call, it was sufficient to manually declare the interface in your code; alternatively, you could invoke the system call using syscall(2).
BUGS
There is no way to indicate the size of the buffer allocated for table. If symbols have been added to the kernel since the program queried for the symbol table size, memory will be corrupted.
The length of exported symbol names is limited to 59 characters.
Because of these limitations, this system call is deprecated in favor of query_module(2) (which is itself nowadays deprecated in favor of other interfaces described on its manual page).
SEE ALSO
create_module(2), delete_module(2), init_module(2), query_module(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
140 - Linux cli command mq_notify
NAME π₯οΈ mq_notify π₯οΈ
register for notification when a message is available
LIBRARY
Real-time library (librt, -lrt)
SYNOPSIS
#include <mqueue.h>
#include <signal.h> /* Definition of SIGEV_* constants */
int mq_notify(mqd_t mqdes, const struct sigevent *sevp);
DESCRIPTION
mq_notify() allows the calling process to register or unregister for delivery of an asynchronous notification when a new message arrives on the empty message queue referred to by the message queue descriptor mqdes.
The sevp argument is a pointer to a sigevent structure. For the definition and general details of this structure, see sigevent(3type).
If sevp is a non-null pointer, then mq_notify() registers the calling process to receive message notification. The sigev_notify field of the sigevent structure to which sevp points specifies how notification is to be performed. This field has one of the following values:
SIGEV_NONE
A “null” notification: the calling process is registered as the target for notification, but when a message arrives, no notification is sent.
SIGEV_SIGNAL
Notify the process by sending the signal specified in sigev_signo. See sigevent(3type) for general details. The si_code field of the siginfo_t structure will be set to SI_MESGQ. In addition, si_pid will be set to the PID of the process that sent the message, and si_uid will be set to the real user ID of the sending process.
SIGEV_THREAD
Upon message delivery, invoke sigev_notify_function as if it were the start function of a new thread. See sigevent(3type) for details.
Only one process can be registered to receive notification from a message queue.
If sevp is NULL, and the calling process is currently registered to receive notifications for this message queue, then the registration is removed; another process can then register to receive a message notification for this queue.
Message notification occurs only when a new message arrives and the queue was previously empty. If the queue was not empty at the time mq_notify() was called, then a notification will occur only after the queue is emptied and a new message arrives.
If another process or thread is waiting to read a message from an empty queue using mq_receive(3), then any message notification registration is ignored: the message is delivered to the process or thread calling mq_receive(3), and the message notification registration remains in effect.
Notification occurs once: after a notification is delivered, the notification registration is removed, and another process can register for message notification. If the notified process wishes to receive the next notification, it can use mq_notify() to request a further notification. This should be done before emptying all unread messages from the queue. (Placing the queue in nonblocking mode is useful for emptying the queue of messages without blocking once it is empty.)
RETURN VALUE
On success mq_notify() returns 0; on error, -1 is returned, with errno set to indicate the error.
ERRORS
EBADF
The message queue descriptor specified in mqdes is invalid.
EBUSY
Another process has already registered to receive notification for this message queue.
EINVAL
sevp->sigev_notify is not one of the permitted values; or sevp->sigev_notify is SIGEV_SIGNAL and sevp->sigev_signo is not a valid signal number.
ENOMEM
Insufficient memory.
POSIX.1-2008 says that an implementation may generate an EINVAL error if sevp is NULL, and the caller is not currently registered to receive notifications for the queue mqdes.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
mq_notify() | Thread safety | MT-Safe |
VERSIONS
C library/kernel differences
In the glibc implementation, the mq_notify() library function is implemented on top of the system call of the same name. When sevp is NULL, or specifies a notification mechanism other than SIGEV_THREAD, the library function directly invokes the system call. For SIGEV_THREAD, much of the implementation resides within the library, rather than the kernel. (This is necessarily so, since the thread involved in handling the notification is one that must be managed by the C library POSIX threads implementation.) The implementation involves the use of a raw netlink(7) socket and creates a new thread for each notification that is delivered to the process.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
EXAMPLES
The following program registers a notification request for the message queue named in its command-line argument. Notification is performed by creating a thread. The thread executes a function which reads one message from the queue and then terminates the process.
Program source
#include <mqueue.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
static void /* Thread start function */
tfunc(union sigval sv)
{
struct mq_attr attr;
ssize_t nr;
void *buf;
mqd_t mqdes = *((mqd_t *) sv.sival_ptr);
/* Determine max. msg size; allocate buffer to receive msg */
if (mq_getattr(mqdes, &attr) == -1)
handle_error("mq_getattr");
buf = malloc(attr.mq_msgsize);
if (buf == NULL)
handle_error("malloc");
nr = mq_receive(mqdes, buf, attr.mq_msgsize, NULL);
if (nr == -1)
handle_error("mq_receive");
printf("Read %zd bytes from MQ
“, nr);
free(buf);
exit(EXIT_SUCCESS); /* Terminate the process */
}
int
main(int argc, char argv[])
{
mqd_t mqdes;
struct sigevent sev;
if (argc != 2) {
fprintf(stderr, “Usage: %s
SEE ALSO
mq_close(3), mq_getattr(3), mq_open(3), mq_receive(3), mq_send(3), mq_unlink(3), mq_overview(7), sigevent(3type)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
141 - Linux cli command io_destroy
NAME π₯οΈ io_destroy π₯οΈ
destroy an asynchronous I/O context
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/aio_abi.h> /* Definition of aio_context_t */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_io_destroy, aio_context_t ctx_id);
Note: glibc provides no wrapper for io_destroy(), necessitating the use of syscall(2).
DESCRIPTION
Note: this page describes the raw Linux system call interface. The wrapper function provided by libaio uses a different type for the ctx_id argument. See VERSIONS.
The io_destroy() system call will attempt to cancel all outstanding asynchronous I/O operations against ctx_id, will block on the completion of all operations that could not be canceled, and will destroy the ctx_id.
RETURN VALUE
On success, io_destroy() returns 0. For the failure return, see VERSIONS.
ERRORS
EFAULT
The context pointed to is invalid.
EINVAL
The AIO context specified by ctx_id is invalid.
ENOSYS
io_destroy() is not implemented on this architecture.
VERSIONS
You probably want to use the io_destroy() wrapper function provided by libaio.
Note that the libaio wrapper function uses a different type (io_context_t) for the ctx_id argument. Note also that the libaio wrapper does not follow the usual C library conventions for indicating errors: on error it returns a negated error number (the negative of one of the values listed in ERRORS). If the system call is invoked via syscall(2), then the return value follows the usual conventions for indicating an error: -1, with errno set to a (positive) value that indicates the error.
STANDARDS
Linux.
HISTORY
Linux 2.5.
SEE ALSO
io_cancel(2), io_getevents(2), io_setup(2), io_submit(2), aio(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
142 - Linux cli command sched_rr_get_interval
NAME π₯οΈ sched_rr_get_interval π₯οΈ
get the SCHED_RR interval for the named process
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sched.h>
int sched_rr_get_interval(pid_t pid, struct timespec *tp);
DESCRIPTION
sched_rr_get_interval() writes into the timespec(3) structure pointed to by tp the round-robin time quantum for the process identified by pid. The specified process should be running under the SCHED_RR scheduling policy.
If pid is zero, the time quantum for the calling process is written into *tp.
RETURN VALUE
On success, sched_rr_get_interval() returns 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
Problem with copying information to user space.
EINVAL
Invalid pid.
ENOSYS
The system call is not yet implemented (only on rather old kernels).
ESRCH
Could not find a process with the ID pid.
VERSIONS
Linux
Linux 3.9 added a new mechanism for adjusting (and viewing) the SCHED_RR quantum: the /proc/sys/kernel/sched_rr_timeslice_ms file exposes the quantum as a millisecond value, whose default is 100. Writing 0 to this file resets the quantum to the default value.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
Linux
POSIX does not specify any mechanism for controlling the size of the round-robin time quantum. Older Linux kernels provide a (nonportable) method of doing this. The quantum can be controlled by adjusting the process’s nice value (see setpriority(2)). Assigning a negative (i.e., high) nice value results in a longer quantum; assigning a positive (i.e., low) nice value results in a shorter quantum. The default quantum is 0.1 seconds; the degree to which changing the nice value affects the quantum has varied somewhat across kernel versions. This method of adjusting the quantum was removed starting with Linux 2.6.24.
NOTES
POSIX systems on which sched_rr_get_interval() is available define _POSIX_PRIORITY_SCHEDULING in <unistd.h>.
SEE ALSO
timespec(3), sched(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
143 - Linux cli command msync
NAME π₯οΈ msync π₯οΈ
synchronize a file with a memory map
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/mman.h>
int msync(void addr[.length], size_t length, int flags);
DESCRIPTION
msync() flushes changes made to the in-core copy of a file that was mapped into memory using mmap(2) back to the filesystem. Without use of this call, there is no guarantee that changes are written back before munmap(2) is called. To be more precise, the part of the file that corresponds to the memory area starting at addr and having length length is updated.
The flags argument should specify exactly one of MS_ASYNC and MS_SYNC, and may additionally include the MS_INVALIDATE bit. These bits have the following meanings:
MS_ASYNC
Specifies that an update be scheduled, but the call returns immediately.
MS_SYNC
Requests an update and waits for it to complete.
MS_INVALIDATE
Asks to invalidate other mappings of the same file (so that they can be updated with the fresh values just written).
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBUSY
MS_INVALIDATE was specified in flags, and a memory lock exists for the specified address range.
EINVAL
addr is not a multiple of PAGESIZE; or any bit other than MS_ASYNC | MS_INVALIDATE | MS_SYNC is set in flags; or both MS_SYNC and MS_ASYNC are set in flags.
ENOMEM
The indicated memory (or part of it) was not mapped.
VERSIONS
According to POSIX, either MS_SYNC or MS_ASYNC must be specified in flags, and indeed failure to include one of these flags will cause msync() to fail on some systems. However, Linux permits a call to msync() that specifies neither of these flags, with semantics that are (currently) equivalent to specifying MS_ASYNC. (Since Linux 2.6.19, MS_ASYNC is in fact a no-op, since the kernel properly tracks dirty pages and flushes them to storage as necessary.) Notwithstanding the Linux behavior, portable, future-proof applications should ensure that they specify either MS_SYNC or MS_ASYNC in flags.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
This call was introduced in Linux 1.3.21, and then used EFAULT instead of ENOMEM. In Linux 2.4.19, this was changed to the POSIX value ENOMEM.
On POSIX systems on which msync() is available, both _POSIX_MAPPED_FILES and _POSIX_SYNCHRONIZED_IO are defined in <unistd.h> to a value greater than 0. (See also sysconf(3).)
SEE ALSO
mmap(2)
B.O. Gallmeister, POSIX.4, O’Reilly, pp. 128β129 and 389β391.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
144 - Linux cli command inw_p
NAME π₯οΈ inw_p π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
145 - Linux cli command sync_file_range
NAME π₯οΈ sync_file_range π₯οΈ
sync a file segment with disk
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#define _FILE_OFFSET_BITS 64
#include <fcntl.h>
int sync_file_range(int fd, off_t offset, off_t nbytes,
unsigned int flags);
DESCRIPTION
sync_file_range() permits fine control when synchronizing the open file referred to by the file descriptor fd with disk.
offset is the starting byte of the file range to be synchronized. nbytes specifies the length of the range to be synchronized, in bytes; if nbytes is zero, then all bytes from offset through to the end of file are synchronized. Synchronization is in units of the system page size: offset is rounded down to a page boundary; (offset+nbytes-1) is rounded up to a page boundary.
The flags bit-mask argument can include any of the following values:
SYNC_FILE_RANGE_WAIT_BEFORE
Wait upon write-out of all pages in the specified range that have already been submitted to the device driver for write-out before performing any write.
SYNC_FILE_RANGE_WRITE
Initiate write-out of all dirty pages in the specified range which are not presently submitted write-out. Note that even this may block if you attempt to write more than request queue size.
SYNC_FILE_RANGE_WAIT_AFTER
Wait upon write-out of all pages in the range after performing any write.
Specifying flags as 0 is permitted, as a no-op.
Warning
This system call is extremely dangerous and should not be used in portable programs. None of these operations writes out the file’s metadata. Therefore, unless the application is strictly performing overwrites of already-instantiated disk blocks, there are no guarantees that the data will be available after a crash. There is no user interface to know if a write is purely an overwrite. On filesystems using copy-on-write semantics (e.g., btrfs) an overwrite of existing allocated blocks is impossible. When writing into preallocated space, many filesystems also require calls into the block allocator, which this system call does not sync out to disk. This system call does not flush disk write caches and thus does not provide any data integrity on systems with volatile disk write caches.
Some details
SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any I/O errors or ENOSPC conditions and will return these to the caller.
Useful combinations of the flags bits are:
SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE
Ensures that all pages in the specified range which were dirty when sync_file_range() was called are placed under write-out. This is a start-write-for-data-integrity operation.
SYNC_FILE_RANGE_WRITE
Start write-out of all dirty pages in the specified range which are not presently under write-out. This is an asynchronous flush-to-disk operation. This is not suitable for data integrity operations.
SYNC_FILE_RANGE_WAIT_BEFORE (or SYNC_FILE_RANGE_WAIT_AFTER)
Wait for completion of write-out of all pages in the specified range. This can be used after an earlier SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE operation to wait for completion of that operation, and obtain its result.
SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER
This is a write-for-data-integrity operation that will ensure that all pages in the specified range which were dirty when sync_file_range() was called are committed to disk.
RETURN VALUE
On success, sync_file_range() returns 0; on failure -1 is returned and errno is set to indicate the error.
ERRORS
EBADF
fd is not a valid file descriptor.
EINVAL
flags specifies an invalid bit; or offset or nbytes is invalid.
EIO
I/O error.
ENOMEM
Out of memory.
ENOSPC
Out of disk space.
ESPIPE
fd refers to something other than a regular file, a block device, or a directory.
VERSIONS
sync_file_range2()
Some architectures (e.g., PowerPC, ARM) need 64-bit arguments to be aligned in a suitable pair of registers. On such architectures, the call signature of sync_file_range() shown in the SYNOPSIS would force a register to be wasted as padding between the fd and offset arguments. (See syscall(2) for details.) Therefore, these architectures define a different system call that orders the arguments suitably:
int sync_file_range2(int fd, unsigned int flags,
off_t offset, off_t nbytes);
The behavior of this system call is otherwise exactly the same as sync_file_range().
STANDARDS
Linux.
HISTORY
Linux 2.6.17.
sync_file_range2()
A system call with this signature first appeared on the ARM architecture in Linux 2.6.20, with the name arm_sync_file_range(). It was renamed in Linux 2.6.22, when the analogous system call was added for PowerPC. On architectures where glibc support is provided, glibc transparently wraps sync_file_range2() under the name sync_file_range().
NOTES
_FILE_OFFSET_BITS should be defined to be 64 in code that takes the address of sync_file_range, if the code is intended to be portable to traditional 32-bit x86 and ARM platforms where off_t’s width defaults to 32 bits.
SEE ALSO
fdatasync(2), fsync(2), msync(2), sync(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
146 - Linux cli command setregid32
NAME π₯οΈ setregid32 π₯οΈ
set real and/or effective user or group ID
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int setreuid(uid_t ruid, uid_t euid);
int setregid(gid_t rgid, gid_t egid);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
setreuid(), setregid():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.19: */ _DEFAULT_SOURCE
|| /* glibc <= 2.19: */ _BSD_SOURCE
DESCRIPTION
setreuid() sets real and effective user IDs of the calling process.
Supplying a value of -1 for either the real or effective user ID forces the system to leave that ID unchanged.
Unprivileged processes may only set the effective user ID to the real user ID, the effective user ID, or the saved set-user-ID.
Unprivileged users may only set the real user ID to the real user ID or the effective user ID.
If the real user ID is set (i.e., ruid is not -1) or the effective user ID is set to a value not equal to the previous real user ID, the saved set-user-ID will be set to the new effective user ID.
Completely analogously, setregid() sets real and effective group ID’s of the calling process, and all of the above holds with “group” instead of “user”.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
Note: there are cases where setreuid() can fail even when the caller is UID 0; it is a grave security error to omit checking for a failure return from setreuid().
ERRORS
EAGAIN
The call would change the caller’s real UID (i.e., ruid does not match the caller’s real UID), but there was a temporary failure allocating the necessary kernel data structures.
EAGAIN
ruid does not match the caller’s real UID and this call would bring the number of processes belonging to the real user ID ruid over the caller’s RLIMIT_NPROC resource limit. Since Linux 3.1, this error case no longer occurs (but robust applications should check for this error); see the description of EAGAIN in execve(2).
EINVAL
One or more of the target user or group IDs is not valid in this user namespace.
EPERM
The calling process is not privileged (on Linux, does not have the necessary capability in its user namespace: CAP_SETUID in the case of setreuid(), or CAP_SETGID in the case of setregid()) and a change other than (i) swapping the effective user (group) ID with the real user (group) ID, or (ii) setting one to the value of the other or (iii) setting the effective user (group) ID to the value of the saved set-user-ID (saved set-group-ID) was specified.
VERSIONS
POSIX.1 does not specify all of the UID changes that Linux permits for an unprivileged process. For setreuid(), the effective user ID can be made the same as the real user ID or the saved set-user-ID, and it is unspecified whether unprivileged processes may set the real user ID to the real user ID, the effective user ID, or the saved set-user-ID. For setregid(), the real group ID can be changed to the value of the saved set-group-ID, and the effective group ID can be changed to the value of the real group ID or the saved set-group-ID. The precise details of what ID changes are permitted vary across implementations.
POSIX.1 makes no specification about the effect of these calls on the saved set-user-ID and saved set-group-ID.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.3BSD (first appeared in 4.2BSD).
Setting the effective user (group) ID to the saved set-user-ID (saved set-group-ID) is possible since Linux 1.1.37 (1.1.38).
The original Linux setreuid() and setregid() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added setreuid32() and setregid32(), supporting 32-bit IDs. The glibc setreuid() and setregid() wrapper functions transparently deal with the variations across kernel versions.
C library/kernel differences
At the kernel level, user IDs and group IDs are a per-thread attribute. However, POSIX requires that all threads in a process share the same credentials. The NPTL threading implementation handles the POSIX requirements by providing wrapper functions for the various system calls that change process UIDs and GIDs. These wrapper functions (including those for setreuid() and setregid()) employ a signal-based technique to ensure that when one thread changes credentials, all of the other threads in the process also change their credentials. For details, see nptl(7).
SEE ALSO
getgid(2), getuid(2), seteuid(2), setgid(2), setresuid(2), setuid(2), capabilities(7), credentials(7), user_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
147 - Linux cli command sethostname
NAME π₯οΈ sethostname π₯οΈ
get/set hostname
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int gethostname(char *name, size_t len);
int sethostname(const char *name, size_t len);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
gethostname():
_XOPEN_SOURCE >= 500 || _POSIX_C_SOURCE >= 200112L
|| /* glibc 2.19 and earlier */ _BSD_SOURCE
sethostname():
Since glibc 2.21:
_DEFAULT_SOURCE
In glibc 2.19 and 2.20:
_DEFAULT_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
Up to and including glibc 2.19:
_BSD_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
DESCRIPTION
These system calls are used to access or to change the system hostname. More precisely, they operate on the hostname associated with the calling process’s UTS namespace.
sethostname() sets the hostname to the value given in the character array name. The len argument specifies the number of bytes in name. (Thus, name does not require a terminating null byte.)
gethostname() returns the null-terminated hostname in the character array name, which has a length of len bytes. If the null-terminated hostname is too large to fit, then the name is truncated, and no error is returned (but see NOTES below). POSIX.1 says that if such truncation occurs, then it is unspecified whether the returned buffer includes a terminating null byte.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
name is an invalid address.
EINVAL
len is negative or, for sethostname(), len is larger than the maximum allowed size.
ENAMETOOLONG
(glibc gethostname()) len is smaller than the actual size. (Before glibc 2.1, glibc uses EINVAL for this case.)
EPERM
For sethostname(), the caller did not have the CAP_SYS_ADMIN capability in the user namespace associated with its UTS namespace (see namespaces(7)).
VERSIONS
SUSv2 guarantees that “Host names are limited to 255 bytes”. POSIX.1 guarantees that “Host names (not including the terminating null byte) are limited to HOST_NAME_MAX bytes”. On Linux, HOST_NAME_MAX is defined with the value 64, which has been the limit since Linux 1.0 (earlier kernels imposed a limit of 8 bytes).
C library/kernel differences
The GNU C library does not employ the gethostname() system call; instead, it implements gethostname() as a library function that calls uname(2) and copies up to len bytes from the returned nodename field into name. Having performed the copy, the function then checks if the length of the nodename was greater than or equal to len, and if it is, then the function returns -1 with errno set to ENAMETOOLONG; in this case, a terminating null byte is not included in the returned name.
STANDARDS
gethostname()
POSIX.1-2008.
sethostname()
None.
HISTORY
SVr4, 4.4BSD (these interfaces first appeared in 4.2BSD). POSIX.1-2001 and POSIX.1-2008 specify gethostname() but not sethostname().
Versions of glibc before glibc 2.2 handle the case where the length of the nodename was greater than or equal to len differently: nothing is copied into name and the function returns -1 with errno set to ENAMETOOLONG.
SEE ALSO
hostname(1), getdomainname(2), setdomainname(2), uname(2), uts_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
148 - Linux cli command newfstatat
NAME π₯οΈ newfstatat π₯οΈ
get file status
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int stat(const char *restrict pathname,
struct stat *restrict statbuf);
int fstat(int fd, struct stat *statbuf);
int lstat(const char *restrict pathname,
struct stat *restrict statbuf);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int fstatat(int dirfd, const char *restrict pathname,
struct stat *restrict statbuf, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
lstat():
/* Since glibc 2.20 */ _DEFAULT_SOURCE
|| _XOPEN_SOURCE >= 500
|| /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L
|| /* glibc 2.19 and earlier */ _BSD_SOURCE
fstatat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These functions return information about a file, in the buffer pointed to by statbuf. No permissions are required on the file itself, butβin the case of stat(), fstatat(), and lstat()βexecute (search) permission is required on all of the directories in pathname that lead to the file.
stat() and fstatat() retrieve information about the file pointed to by pathname; the differences for fstatat() are described below.
lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information about the link itself, not the file that the link refers to.
fstat() is identical to stat(), except that the file about which information is to be retrieved is specified by the file descriptor fd.
The stat structure
All of these system calls return a stat structure (see stat(3type)).
Note: for performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call. For example, if st_mode or st_uid is changed by another process by calling chmod(2) or chown(2), stat() might return the old st_mode together with the new st_uid, or the old st_uid together with the new st_mode.
fstatat()
The fstatat() system call is a more general interface for accessing file information which can still provide exactly the behavior of each of stat(), lstat(), and fstat().
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by stat() and lstat() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like stat() and lstat()).
If pathname is absolute, then dirfd is ignored.
flags can either be 0, or include one or more of the following flags ORed:
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory, and the behavior of fstatat() is similar to that of fstat(). If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_NO_AUTOMOUNT (since Linux 2.6.38)
Don’t automount the terminal (“basename”) component of pathname. Since Linux 3.1 this flag is ignored. Since Linux 4.11 this flag is implied.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead return information about the link itself, like lstat(). (By default, fstatat() dereferences symbolic links, like stat().)
See openat(2) for an explanation of the need for fstatat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Search permission is denied for one of the directories in the path prefix of pathname. (See also path_resolution(7).)
EBADF
fd is not a valid open file descriptor.
EBADF
(fstatat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
Bad address.
EINVAL
(fstatat()) Invalid flag specified in flags.
ELOOP
Too many symbolic links encountered while traversing the path.
ENAMETOOLONG
pathname is too long.
ENOENT
A component of pathname does not exist or is a dangling symbolic link.
ENOENT
pathname is an empty string and AT_EMPTY_PATH was not specified in flags.
ENOMEM
Out of memory (i.e., kernel memory).
ENOTDIR
A component of the path prefix of pathname is not a directory.
ENOTDIR
(fstatat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EOVERFLOW
pathname or fd refers to a file whose size, inode number, or number of blocks cannot be represented in, respectively, the types off_t, ino_t, or blkcnt_t. This error can occur when, for example, an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 calls stat() on a file whose size exceeds (1<<31)-1 bytes.
STANDARDS
POSIX.1-2008.
HISTORY
stat()
fstat()
lstat()
SVr4, 4.3BSD, POSIX.1-2001.
fstatat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
According to POSIX.1-2001, lstat() on a symbolic link need return valid information only in the st_size field and the file type of the st_mode field of the stat structure. POSIX.1-2008 tightens the specification, requiring lstat() to return valid information in all fields except the mode bits in st_mode.
Use of the st_blocks and st_blksize fields may be less portable. (They were introduced in BSD. The interpretation differs between systems, and possibly on a single system when NFS mounts are involved.)
C library/kernel differences
Over time, increases in the size of the stat structure have led to three successive versions of stat(): sys_stat() (slot __NR_oldstat), sys_newstat() (slot __NR_stat), and sys_stat64() (slot __NR_stat64) on 32-bit platforms such as i386. The first two versions were already present in Linux 1.0 (albeit with different names); the last was added in Linux 2.4. Similar remarks apply for fstat() and lstat().
The kernel-internal versions of the stat structure dealt with by the different versions are, respectively:
__old_kernel_stat
The original structure, with rather narrow fields, and no padding.
stat
Larger st_ino field and padding added to various parts of the structure to allow for future expansion.
stat64
Even larger st_ino field, larger st_uid and st_gid fields to accommodate the Linux-2.4 expansion of UIDs and GIDs to 32 bits, and various other enlarged fields and further padding in the structure. (Various padding bytes were eventually consumed in Linux 2.6, with the advent of 32-bit device IDs and nanosecond components for the timestamp fields.)
The glibc stat() wrapper function hides these details from applications, invoking the most recent version of the system call provided by the kernel, and repacking the returned information if required for old binaries.
On modern 64-bit systems, life is simpler: there is a single stat() system call and the kernel deals with a stat structure that contains fields of a sufficient size.
The underlying system call employed by the glibc fstatat() wrapper function is actually called fstatat64() or, on some architectures, newfstatat().
EXAMPLES
The following program calls lstat() and displays selected fields in the returned stat structure.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
#include <time.h>
int
main(int argc, char *argv[])
{
struct stat sb;
if (argc != 2) {
fprintf(stderr, "Usage: %s <pathname>
“, argv[0]); exit(EXIT_FAILURE); } if (lstat(argv[1], &sb) == -1) { perror(“lstat”); exit(EXIT_FAILURE); } printf(“ID of containing device: [%x,%x] “, major(sb.st_dev), minor(sb.st_dev)); printf(“File type: “); switch (sb.st_mode & S_IFMT) { case S_IFBLK: printf(“block device “); break; case S_IFCHR: printf(“character device “); break; case S_IFDIR: printf(“directory “); break; case S_IFIFO: printf(“FIFO/pipe “); break; case S_IFLNK: printf(“symlink “); break; case S_IFREG: printf(“regular file “); break; case S_IFSOCK: printf(“socket “); break; default: printf(“unknown? “); break; } printf(“I-node number: %ju “, (uintmax_t) sb.st_ino); printf(“Mode: %jo (octal) “, (uintmax_t) sb.st_mode); printf(“Link count: %ju “, (uintmax_t) sb.st_nlink); printf(“Ownership: UID=%ju GID=%ju “, (uintmax_t) sb.st_uid, (uintmax_t) sb.st_gid); printf(“Preferred I/O block size: %jd bytes “, (intmax_t) sb.st_blksize); printf(“File size: %jd bytes “, (intmax_t) sb.st_size); printf(“Blocks allocated: %jd “, (intmax_t) sb.st_blocks); printf(“Last status change: %s”, ctime(&sb.st_ctime)); printf(“Last file access: %s”, ctime(&sb.st_atime)); printf(“Last file modification: %s”, ctime(&sb.st_mtime)); exit(EXIT_SUCCESS); }
SEE ALSO
ls(1), stat(1), access(2), chmod(2), chown(2), readlink(2), statx(2), utime(2), stat(3type), capabilities(7), inode(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
149 - Linux cli command pkey_alloc
NAME π₯οΈ pkey_alloc π₯οΈ
allocate or free a protection key
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <sys/mman.h>
int pkey_alloc(unsigned int flags, unsigned int access_rights);
int pkey_free(int pkey);
DESCRIPTION
pkey_alloc() allocates a protection key (pkey) and allows it to be passed to pkey_mprotect(2).
The pkey_alloc() flags is reserved for future use and currently must always be specified as 0.
The pkey_alloc() access_rights argument may contain zero or more disable operations:
PKEY_DISABLE_ACCESS
Disable all data access to memory covered by the returned protection key.
PKEY_DISABLE_WRITE
Disable write access to memory covered by the returned protection key.
pkey_free() frees a protection key and makes it available for later allocations. After a protection key has been freed, it may no longer be used in any protection-key-related operations.
An application should not call pkey_free() on any protection key which has been assigned to an address range by pkey_mprotect(2) and which is still in use. The behavior in this case is undefined and may result in an error.
RETURN VALUE
On success, pkey_alloc() returns a positive protection key value. On success, pkey_free() returns zero. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
pkey, flags, or access_rights is invalid.
ENOSPC
(pkey_alloc()) All protection keys available for the current process have been allocated. The number of keys available is architecture-specific and implementation-specific and may be reduced by kernel-internal use of certain keys. There are currently 15 keys available to user programs on x86.
This error will also be returned if the processor or operating system does not support protection keys. Applications should always be prepared to handle this error, since factors outside of the application’s control can reduce the number of available pkeys.
STANDARDS
Linux.
HISTORY
Linux 4.9, glibc 2.27.
NOTES
pkey_alloc() is always safe to call regardless of whether or not the operating system supports protection keys. It can be used in lieu of any other mechanism for detecting pkey support and will simply fail with the error ENOSPC if the operating system has no pkey support.
The kernel guarantees that the contents of the hardware rights register (PKRU) will be preserved only for allocated protection keys. Any time a key is unallocated (either before the first call returning that key from pkey_alloc() or after it is freed via pkey_free()), the kernel may make arbitrary changes to the parts of the rights register affecting access to that key.
EXAMPLES
See pkeys(7).
SEE ALSO
pkey_mprotect(2), pkeys(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
150 - Linux cli command restart_syscall
NAME π₯οΈ restart_syscall π₯οΈ
restart a system call after interruption by a stop signal
SYNOPSIS
long restart_syscall(void);
Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION
The restart_syscall() system call is used to restart certain system calls after a process that was stopped by a signal (e.g., SIGSTOP or SIGTSTP) is later resumed after receiving a SIGCONT signal. This system call is designed only for internal use by the kernel.
restart_syscall() is used for restarting only those system calls that, when restarted, should adjust their time-related parametersβnamely poll(2) (since Linux 2.6.24), nanosleep(2) (since Linux 2.6), clock_nanosleep(2) (since Linux 2.6), and futex(2), when employed with the FUTEX_WAIT (since Linux 2.6.22) and FUTEX_WAIT_BITSET (since Linux 2.6.31) operations. restart_syscall() restarts the interrupted system call with a time argument that is suitably adjusted to account for the time that has already elapsed (including the time where the process was stopped by a signal). Without the restart_syscall() mechanism, restarting these system calls would not correctly deduct the already elapsed time when the process continued execution.
RETURN VALUE
The return value of restart_syscall() is the return value of whatever system call is being restarted.
ERRORS
errno is set as per the errors for whatever system call is being restarted by restart_syscall().
STANDARDS
Linux.
HISTORY
Linux 2.6.
NOTES
There is no glibc wrapper for this system call, because it is intended for use only by the kernel and should never be called by applications.
The kernel uses restart_syscall() to ensure that when a system call is restarted after a process has been stopped by a signal and then resumed by SIGCONT, then the time that the process spent in the stopped state is counted against the timeout interval specified in the original system call. In the case of system calls that take a timeout argument and automatically restart after a stop signal plus SIGCONT, but which do not have the restart_syscall() mechanism built in, then, after the process resumes execution, the time that the process spent in the stop state is not counted against the timeout value. Notable examples of system calls that suffer this problem are ppoll(2), select(2), and pselect(2).
From user space, the operation of restart_syscall() is largely invisible: to the process that made the system call that is restarted, it appears as though that system call executed and returned in the usual fashion.
SEE ALSO
sigaction(2), sigreturn(2), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
151 - Linux cli command inw
NAME π₯οΈ inw π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
152 - Linux cli command statfs64
NAME π₯οΈ statfs64 π₯οΈ
get filesystem statistics
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/vfs.h> /* or <sys/statfs.h> */
int statfs(const char *path, struct statfs *buf);
int fstatfs(int fd, struct statfs *buf);
Unless you need the f_type field, you should use the standard statvfs(3) interface instead.
DESCRIPTION
The statfs() system call returns information about a mounted filesystem. path is the pathname of any file within the mounted filesystem. buf is a pointer to a statfs structure defined approximately as follows:
struct statfs {
__fsword_t f_type; /* Type of filesystem (see below) */
__fsword_t f_bsize; /* Optimal transfer block size */
fsblkcnt_t f_blocks; /* Total data blocks in filesystem */
fsblkcnt_t f_bfree; /* Free blocks in filesystem */
fsblkcnt_t f_bavail; /* Free blocks available to
unprivileged user */
fsfilcnt_t f_files; /* Total inodes in filesystem */
fsfilcnt_t f_ffree; /* Free inodes in filesystem */
fsid_t f_fsid; /* Filesystem ID */
__fsword_t f_namelen; /* Maximum length of filenames */
__fsword_t f_frsize; /* Fragment size (since Linux 2.6) */
__fsword_t f_flags; /* Mount flags of filesystem
(since Linux 2.6.36) */
__fsword_t f_spare[xxx];
/* Padding bytes reserved for future use */
};
The following filesystem types may appear in f_type:
ADFS_SUPER_MAGIC 0xadf5
AFFS_SUPER_MAGIC 0xadff
AFS_SUPER_MAGIC 0x5346414f
ANON_INODE_FS_MAGIC 0x09041934 /* Anonymous inode FS (for
pseudofiles that have no name;
e.g., epoll, signalfd, bpf) */
AUTOFS_SUPER_MAGIC 0x0187
BDEVFS_MAGIC 0x62646576
BEFS_SUPER_MAGIC 0x42465331
BFS_MAGIC 0x1badface
BINFMTFS_MAGIC 0x42494e4d
BPF_FS_MAGIC 0xcafe4a11
BTRFS_SUPER_MAGIC 0x9123683e
BTRFS_TEST_MAGIC 0x73727279
CGROUP_SUPER_MAGIC 0x27e0eb /* Cgroup pseudo FS */
CGROUP2_SUPER_MAGIC 0x63677270 /* Cgroup v2 pseudo FS */
CIFS_MAGIC_NUMBER 0xff534d42
CODA_SUPER_MAGIC 0x73757245
COH_SUPER_MAGIC 0x012ff7b7
CRAMFS_MAGIC 0x28cd3d45
DEBUGFS_MAGIC 0x64626720
DEVFS_SUPER_MAGIC 0x1373 /* Linux 2.6.17 and earlier */
DEVPTS_SUPER_MAGIC 0x1cd1
ECRYPTFS_SUPER_MAGIC 0xf15f
EFIVARFS_MAGIC 0xde5e81e4
EFS_SUPER_MAGIC 0x00414a53
EXT_SUPER_MAGIC 0x137d /* Linux 2.0 and earlier */
EXT2_OLD_SUPER_MAGIC 0xef51
EXT2_SUPER_MAGIC 0xef53
EXT3_SUPER_MAGIC 0xef53
EXT4_SUPER_MAGIC 0xef53
F2FS_SUPER_MAGIC 0xf2f52010
FUSE_SUPER_MAGIC 0x65735546
FUTEXFS_SUPER_MAGIC 0xbad1dea /* Unused */
HFS_SUPER_MAGIC 0x4244
HOSTFS_SUPER_MAGIC 0x00c0ffee
HPFS_SUPER_MAGIC 0xf995e849
HUGETLBFS_MAGIC 0x958458f6
ISOFS_SUPER_MAGIC 0x9660
JFFS2_SUPER_MAGIC 0x72b6
JFS_SUPER_MAGIC 0x3153464a
MINIX_SUPER_MAGIC 0x137f /* original minix FS */
MINIX_SUPER_MAGIC2 0x138f /* 30 char minix FS */
MINIX2_SUPER_MAGIC 0x2468 /* minix V2 FS */
MINIX2_SUPER_MAGIC2 0x2478 /* minix V2 FS, 30 char names */
MINIX3_SUPER_MAGIC 0x4d5a /* minix V3 FS, 60 char names */
MQUEUE_MAGIC 0x19800202 /* POSIX message queue FS */
MSDOS_SUPER_MAGIC 0x4d44
MTD_INODE_FS_MAGIC 0x11307854
NCP_SUPER_MAGIC 0x564c
NFS_SUPER_MAGIC 0x6969
NILFS_SUPER_MAGIC 0x3434
NSFS_MAGIC 0x6e736673
NTFS_SB_MAGIC 0x5346544e
OCFS2_SUPER_MAGIC 0x7461636f
OPENPROM_SUPER_MAGIC 0x9fa1
OVERLAYFS_SUPER_MAGIC 0x794c7630
PIPEFS_MAGIC 0x50495045
PROC_SUPER_MAGIC 0x9fa0 /* /proc FS */
PSTOREFS_MAGIC 0x6165676c
QNX4_SUPER_MAGIC 0x002f
QNX6_SUPER_MAGIC 0x68191122
RAMFS_MAGIC 0x858458f6
REISERFS_SUPER_MAGIC 0x52654973
ROMFS_MAGIC 0x7275
SECURITYFS_MAGIC 0x73636673
SELINUX_MAGIC 0xf97cff8c
SMACK_MAGIC 0x43415d53
SMB_SUPER_MAGIC 0x517b
SMB2_MAGIC_NUMBER 0xfe534d42
SOCKFS_MAGIC 0x534f434b
SQUASHFS_MAGIC 0x73717368
SYSFS_MAGIC 0x62656572
SYSV2_SUPER_MAGIC 0x012ff7b6
SYSV4_SUPER_MAGIC 0x012ff7b5
TMPFS_MAGIC 0x01021994
TRACEFS_MAGIC 0x74726163
UDF_SUPER_MAGIC 0x15013346
UFS_MAGIC 0x00011954
USBDEVICE_SUPER_MAGIC 0x9fa2
V9FS_MAGIC 0x01021997
VXFS_SUPER_MAGIC 0xa501fcf5
XENFS_SUPER_MAGIC 0xabba1974
XENIX_SUPER_MAGIC 0x012ff7b4
XFS_SUPER_MAGIC 0x58465342
_XIAFS_SUPER_MAGIC 0x012fd16d /* Linux 2.0 and earlier */
Most of these MAGIC constants are defined in /usr/include/linux/magic.h, and some are hardcoded in kernel sources.
The f_flags field is a bit mask indicating mount options for the filesystem. It contains zero or more of the following bits:
ST_MANDLOCK
Mandatory locking is permitted on the filesystem (see fcntl(2)).
ST_NOATIME
Do not update access times; see mount(2).
ST_NODEV
Disallow access to device special files on this filesystem.
ST_NODIRATIME
Do not update directory access times; see mount(2).
ST_NOEXEC
Execution of programs is disallowed on this filesystem.
ST_NOSUID
The set-user-ID and set-group-ID bits are ignored by exec(3) for executable files on this filesystem
ST_RDONLY
This filesystem is mounted read-only.
ST_RELATIME
Update atime relative to mtime/ctime; see mount(2).
ST_SYNCHRONOUS
Writes are synched to the filesystem immediately (see the description of O_SYNC in open(2)).
ST_NOSYMFOLLOW (since Linux 5.10)
Symbolic links are not followed when resolving paths; see mount(2).
Nobody knows what f_fsid is supposed to contain (but see below).
Fields that are undefined for a particular filesystem are set to 0.
fstatfs() returns the same information about an open file referenced by descriptor fd.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
(statfs()) Search permission is denied for a component of the path prefix of path. (See also path_resolution(7).)
EBADF
(fstatfs()) fd is not a valid open file descriptor.
EFAULT
buf or path points to an invalid address.
EINTR
The call was interrupted by a signal; see signal(7).
EIO
An I/O error occurred while reading from the filesystem.
ELOOP
(statfs()) Too many symbolic links were encountered in translating path.
ENAMETOOLONG
(statfs()) path is too long.
ENOENT
(statfs()) The file referred to by path does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOSYS
The filesystem does not support this call.
ENOTDIR
(statfs()) A component of the path prefix of path is not a directory.
EOVERFLOW
Some values were too large to be represented in the returned struct.
VERSIONS
The f_fsid field
Solaris, Irix, and POSIX have a system call statvfs(2) that returns a struct statvfs (defined in <sys/statvfs.h>) containing an unsigned long f_fsid. Linux, SunOS, HP-UX, 4.4BSD have a system call statfs() that returns a struct statfs (defined in <sys/vfs.h>) containing a fsid_t f_fsid, where fsid_t is defined as struct { int val[2]; }. The same holds for FreeBSD, except that it uses the include file <sys/mount.h>.
The general idea is that f_fsid contains some random stuff such that the pair (f_fsid,ino) uniquely determines a file. Some operating systems use (a variation on) the device number, or the device number combined with the filesystem type. Several operating systems restrict giving out the f_fsid field to the superuser only (and zero it for unprivileged users), because this field is used in the filehandle of the filesystem when NFS-exported, and giving it out is a security concern.
Under some operating systems, the fsid can be used as the second argument to the sysfs(2) system call.
STANDARDS
Linux.
HISTORY
The Linux statfs() was inspired by the 4.4BSD one (but they do not use the same structure).
The original Linux statfs() and fstatfs() system calls were not designed with extremely large file sizes in mind. Subsequently, Linux 2.6 added new statfs64() and fstatfs64() system calls that employ a new structure, statfs64. The new structure contains the same fields as the original statfs structure, but the sizes of various fields are increased, to accommodate large file sizes. The glibc statfs() and fstatfs() wrapper functions transparently deal with the kernel differences.
LSB has deprecated the library calls statfs() and fstatfs() and tells us to use statvfs(3) and fstatvfs(3) instead.
NOTES
The __fsword_t type used for various fields in the statfs structure definition is a glibc internal type, not intended for public use. This leaves the programmer in a bit of a conundrum when trying to copy or compare these fields to local variables in a program. Using unsigned int for such variables suffices on most systems.
Some systems have only <sys/vfs.h>, other systems also have <sys/statfs.h>, where the former includes the latter. So it seems including the former is the best choice.
BUGS
From Linux 2.6.38 up to and including Linux 3.1, fstatfs() failed with the error ENOSYS for file descriptors created by pipe(2).
SEE ALSO
stat(2), statvfs(3), path_resolution(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
153 - Linux cli command modify_ldt
NAME π₯οΈ modify_ldt π₯οΈ
get or set a per-process LDT entry
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <asm/ldt.h> /* Definition of struct user_desc */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_modify_ldt, int func, void ptr[.bytecount],
unsigned long bytecount);
Note: glibc provides no wrapper for modify_ldt(), necessitating the use of syscall(2).
DESCRIPTION
modify_ldt() reads or writes the local descriptor table (LDT) for a process. The LDT is an array of segment descriptors that can be referenced by user code. Linux allows processes to configure a per-process (actually per-mm) LDT. For more information about the LDT, see the Intel Software Developer’s Manual or the AMD Architecture Programming Manual.
When func is 0, modify_ldt() reads the LDT into the memory pointed to by ptr. The number of bytes read is the smaller of bytecount and the actual size of the LDT, although the kernel may act as though the LDT is padded with additional trailing zero bytes. On success, modify_ldt() will return the number of bytes read.
When func is 1 or 0x11, modify_ldt() modifies the LDT entry indicated by ptr->entry_number. ptr points to a user_desc structure and bytecount must equal the size of this structure.
The user_desc structure is defined in <asm/ldt.h> as:
struct user_desc {
unsigned int entry_number;
unsigned int base_addr;
unsigned int limit;
unsigned int seg_32bit:1;
unsigned int contents:2;
unsigned int read_exec_only:1;
unsigned int limit_in_pages:1;
unsigned int seg_not_present:1;
unsigned int useable:1;
};
In Linux 2.4 and earlier, this structure was named modify_ldt_ldt_s.
The contents field is the segment type (data, expand-down data, non-conforming code, or conforming code). The other fields match their descriptions in the CPU manual, although modify_ldt() cannot set the hardware-defined “accessed” bit described in the CPU manual.
A user_desc is considered “empty” if read_exec_only and seg_not_present are set to 1 and all of the other fields are 0. An LDT entry can be cleared by setting it to an “empty” user_desc or, if func is 1, by setting both base and limit to 0.
A conforming code segment (i.e., one with contents==3) will be rejected if func is 1 or if seg_not_present is 0.
When func is 2, modify_ldt() will read zeros. This appears to be a leftover from Linux 2.4.
RETURN VALUE
On success, modify_ldt() returns either the actual number of bytes read (for reading) or 0 (for writing). On failure, modify_ldt() returns -1 and sets errno to indicate the error.
ERRORS
EFAULT
ptr points outside the address space.
EINVAL
ptr is 0, or func is 1 and bytecount is not equal to the size of the structure user_desc, or func is 1 or 0x11 and the new LDT entry has invalid values.
ENOSYS
func is neither 0, 1, 2, nor 0x11.
STANDARDS
Linux.
NOTES
modify_ldt() should not be used for thread-local storage, as it slows down context switches and only supports a limited number of threads. Threading libraries should use set_thread_area(2) or arch_prctl(2) instead, except on extremely old kernels that do not support those system calls.
The normal use for modify_ldt() is to run legacy 16-bit or segmented 32-bit code. Not all kernels allow 16-bit segments to be installed, however.
Even on 64-bit kernels, modify_ldt() cannot be used to create a long mode (i.e., 64-bit) code segment. The undocumented field “lm” in user_desc is not useful, and, despite its name, does not result in a long mode segment.
BUGS
On 64-bit kernels before Linux 3.19, setting the “lm” bit in user_desc prevents the descriptor from being considered empty. Keep in mind that the “lm” bit does not exist in the 32-bit headers, but these buggy kernels will still notice the bit even when set in a 32-bit process.
SEE ALSO
arch_prctl(2), set_thread_area(2), vm86(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
154 - Linux cli command outl
NAME π₯οΈ outl π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
155 - Linux cli command outsb
NAME π₯οΈ outsb π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
156 - Linux cli command outl_p
NAME π₯οΈ outl_p π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
157 - Linux cli command readdir
NAME π₯οΈ readdir π₯οΈ
read directory entry
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_readdir, unsigned int fd,
struct old_linux_dirent *dirp, unsigned int count);
Note: There is no definition of struct old_linux_dirent; see NOTES.
DESCRIPTION
This is not the function you are interested in. Look at readdir(3) for the POSIX conforming C library interface. This page documents the bare kernel system call interface, which is superseded by getdents(2).
readdir() reads one old_linux_dirent structure from the directory referred to by the file descriptor fd into the buffer pointed to by dirp. The argument count is ignored; at most one old_linux_dirent structure is read.
The old_linux_dirent structure is declared (privately in Linux kernel file fs/readdir.c) as follows:
struct old_linux_dirent {
unsigned long d_ino; /* inode number */
unsigned long d_offset; /* offset to this old_linux_dirent */
unsigned short d_namlen; /* length of this d_name */
char d_name[1]; /* filename (null-terminated) */
}
d_ino is an inode number. d_offset is the distance from the start of the directory to this old_linux_dirent. d_reclen is the size of d_name, not counting the terminating null byte (‘οΏ½’). d_name is a null-terminated filename.
RETURN VALUE
On success, 1 is returned. On end of directory, 0 is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBADF
Invalid file descriptor fd.
EFAULT
Argument points outside the calling process’s address space.
EINVAL
Result buffer is too small.
ENOENT
No such directory.
ENOTDIR
File descriptor does not refer to a directory.
VERSIONS
You will need to define the old_linux_dirent structure yourself. However, probably you should use readdir(3) instead.
This system call does not exist on x86-64.
STANDARDS
Linux.
SEE ALSO
getdents(2), readdir(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
158 - Linux cli command arm_fadvise64_64
NAME π₯οΈ arm_fadvise64_64 π₯οΈ
predeclare an access pattern for file data
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <fcntl.h>
int posix_fadvise(int fd, off_t offset, off_t len",int advice );"
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
posix_fadvise():
_POSIX_C_SOURCE >= 200112L
DESCRIPTION
Programs can use posix_fadvise() to announce an intention to access file data in a specific pattern in the future, thus allowing the kernel to perform appropriate optimizations.
The advice applies to a (not necessarily existent) region starting at offset and extending for len bytes (or until the end of the file if len is 0) within the file referred to by fd. The advice is not binding; it merely constitutes an expectation on behalf of the application.
Permissible values for advice include:
POSIX_FADV_NORMAL
Indicates that the application has no advice to give about its access pattern for the specified data. If no advice is given for an open file, this is the default assumption.
POSIX_FADV_SEQUENTIAL
The application expects to access the specified data sequentially (with lower offsets read before higher ones).
POSIX_FADV_RANDOM
The specified data will be accessed in random order.
POSIX_FADV_NOREUSE
The specified data will be accessed only once.
Before Linux 2.6.18, POSIX_FADV_NOREUSE had the same semantics as POSIX_FADV_WILLNEED. This was probably a bug; since Linux 2.6.18, this flag is a no-op.
POSIX_FADV_WILLNEED
The specified data will be accessed in the near future.
POSIX_FADV_WILLNEED initiates a nonblocking read of the specified region into the page cache. The amount of data read may be decreased by the kernel depending on virtual memory load. (A few megabytes will usually be fully satisfied, and more is rarely useful.)
POSIX_FADV_DONTNEED
The specified data will not be accessed in the near future.
POSIX_FADV_DONTNEED attempts to free cached pages associated with the specified region. This is useful, for example, while streaming large files. A program may periodically request the kernel to free cached data that has already been used, so that more useful cached pages are not discarded instead.
Requests to discard partial pages are ignored. It is preferable to preserve needed data than discard unneeded data. If the application requires that data be considered for discarding, then offset and len must be page-aligned.
The implementation may attempt to write back dirty pages in the specified region, but this is not guaranteed. Any unwritten dirty pages will not be freed. If the application wishes to ensure that dirty pages will be released, it should call fsync(2) or fdatasync(2) first.
RETURN VALUE
On success, zero is returned. On error, an error number is returned.
ERRORS
EBADF
The fd argument was not a valid file descriptor.
EINVAL
An invalid value was specified for advice.
ESPIPE
The specified file descriptor refers to a pipe or FIFO. (ESPIPE is the error specified by POSIX, but before Linux 2.6.16, Linux returned EINVAL in this case.)
VERSIONS
Under Linux, POSIX_FADV_NORMAL sets the readahead window to the default size for the backing device; POSIX_FADV_SEQUENTIAL doubles this size, and POSIX_FADV_RANDOM disables file readahead entirely. These changes affect the entire file, not just the specified region (but other open file handles to the same file are unaffected).
C library/kernel differences
The name of the wrapper function in the C library is posix_fadvise(). The underlying system call is called fadvise64() (or, on some architectures, fadvise64_64()); the difference between the two is that the former system call assumes that the type of the len argument is size_t, while the latter expects loff_t there.
Architecture-specific variants
Some architectures require 64-bit arguments to be aligned in a suitable pair of registers (see syscall(2) for further detail). On such architectures, the call signature of posix_fadvise() shown in the SYNOPSIS would force a register to be wasted as padding between the fd and offset arguments. Therefore, these architectures define a version of the system call that orders the arguments suitably, but is otherwise exactly the same as posix_fadvise().
For example, since Linux 2.6.14, ARM has the following system call:
long arm_fadvise64_64(int fd, int advice,
loff_t offset, loff_t len);
These architecture-specific details are generally hidden from applications by the glibc posix_fadvise() wrapper function, which invokes the appropriate architecture-specific system call.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
Kernel support first appeared in Linux 2.5.60; the underlying system call is called fadvise64(). Library support has been provided since glibc 2.2, via the wrapper function posix_fadvise().
Since Linux 3.18, support for the underlying system call is optional, depending on the setting of the CONFIG_ADVISE_SYSCALLS configuration option.
The type of the len argument was changed from size_t to off_t in POSIX.1-2001 TC1.
NOTES
The contents of the kernel buffer cache can be cleared via the /proc/sys/vm/drop_caches interface described in proc(5).
One can obtain a snapshot of which pages of a file are resident in the buffer cache by opening a file, mapping it with mmap(2), and then applying mincore(2) to the mapping.
BUGS
Before Linux 2.6.6, if len was specified as 0, then this was interpreted literally as “zero bytes”, rather than as meaning “all bytes through to the end of the file”.
SEE ALSO
fincore(1), mincore(2), readahead(2), sync_file_range(2), posix_fallocate(3), posix_madvise(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
159 - Linux cli command clone3
NAME π₯οΈ clone3 π₯οΈ
create a child process
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
/* Prototype for the glibc wrapper function */
#define _GNU_SOURCE
#include <sched.h>
int clone(int (*fn)(void *_Nullable), void *stack",int"flags,
void *_Nullable arg, ..."/*" pid_t *_Nullable parent_tid,
void *_Nullable tls,
pid_t *_Nullable child_tid */ );
/* For the prototype of the raw clone() system call, see NOTES */
#include <linux/sched.h> /* Definition of struct clone_args */
#include <sched.h> /* Definition of CLONE_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
long syscall(SYS_clone3, struct clone_args *cl_args, size_t size);
Note: glibc provides no wrapper for clone3(), necessitating the use of syscall(2).
DESCRIPTION
These system calls create a new (“child”) process, in a manner similar to fork(2).
By contrast with fork(2), these system calls provide more precise control over what pieces of execution context are shared between the calling process and the child process. For example, using these system calls, the caller can control whether or not the two processes share the virtual address space, the table of file descriptors, and the table of signal handlers. These system calls also allow the new child process to be placed in separate namespaces(7).
Note that in this manual page, “calling process” normally corresponds to “parent process”. But see the descriptions of CLONE_PARENT and CLONE_THREAD below.
This page describes the following interfaces:
The glibc clone() wrapper function and the underlying system call on which it is based. The main text describes the wrapper function; the differences for the raw system call are described toward the end of this page.
The newer clone3() system call.
In the remainder of this page, the terminology “the clone call” is used when noting details that apply to all of these interfaces.
The clone() wrapper function
When the child process is created with the clone() wrapper function, it commences execution by calling the function pointed to by the argument fn. (This differs from fork(2), where execution continues in the child from the point of the fork(2) call.) The arg argument is passed as the argument of the function fn.
When the fn(arg) function returns, the child process terminates. The integer returned by fn is the exit status for the child process. The child process may also terminate explicitly by calling exit(2) or after receiving a fatal signal.
The stack argument specifies the location of the stack used by the child process. Since the child and calling process may share memory, it is not possible for the child process to execute in the same stack as the calling process. The calling process must therefore set up memory space for the child stack and pass a pointer to this space to clone(). Stacks grow downward on all processors that run Linux (except the HP PA processors), so stack usually points to the topmost address of the memory space set up for the child stack. Note that clone() does not provide a means whereby the caller can inform the kernel of the size of the stack area.
The remaining arguments to clone() are discussed below.
clone3()
The clone3() system call provides a superset of the functionality of the older clone() interface. It also provides a number of API improvements, including: space for additional flags bits; cleaner separation in the use of various arguments; and the ability to specify the size of the child’s stack area.
As with fork(2), clone3() returns in both the parent and the child. It returns 0 in the child process and returns the PID of the child in the parent.
The cl_args argument of clone3() is a structure of the following form:
struct clone_args {
u64 flags; /* Flags bit mask */
u64 pidfd; /* Where to store PID file descriptor
(int *) */
u64 child_tid; /* Where to store child TID,
in child's memory (pid_t *) */
u64 parent_tid; /* Where to store child TID,
in parent's memory (pid_t *) */
u64 exit_signal; /* Signal to deliver to parent on
child termination */
u64 stack; /* Pointer to lowest byte of stack */
u64 stack_size; /* Size of stack */
u64 tls; /* Location of new TLS */
u64 set_tid; /* Pointer to a pid_t array
(since Linux 5.5) */
u64 set_tid_size; /* Number of elements in set_tid
(since Linux 5.5) */
u64 cgroup; /* File descriptor for target cgroup
of child (since Linux 5.7) */
};
The size argument that is supplied to clone3() should be initialized to the size of this structure. (The existence of the size argument permits future extensions to the clone_args structure.)
The stack for the child process is specified via cl_args.stack, which points to the lowest byte of the stack area, and cl_args.stack_size, which specifies the size of the stack in bytes. In the case where the CLONE_VM flag (see below) is specified, a stack must be explicitly allocated and specified. Otherwise, these two fields can be specified as NULL and 0, which causes the child to use the same stack area as the parent (in the child’s own virtual address space).
The remaining fields in the cl_args argument are discussed below.
Equivalence between clone() and clone3() arguments
Unlike the older clone() interface, where arguments are passed individually, in the newer clone3() interface the arguments are packaged into the clone_args structure shown above. This structure allows for a superset of the information passed via the clone() arguments.
The following table shows the equivalence between the arguments of clone() and the fields in the clone_args argument supplied to clone3():
clone() clone3() Notes cl_args field flags & ~0xff flags For most flags; details below parent_tid pidfd See CLONE_PIDFD child_tid child_tid See CLONE_CHILD_SETTID parent_tid parent_tid See CLONE_PARENT_SETTID flags & 0xff exit_signal stack stack --- stack_size tls tls See CLONE_SETTLS --- set_tid See below for details --- set_tid_size --- cgroup See CLONE_INTO_CGROUP
The child termination signal
When the child process terminates, a signal may be sent to the parent. The termination signal is specified in the low byte of flags (clone()) or in cl_args.exit_signal (clone3()). If this signal is specified as anything other than SIGCHLD, then the parent process must specify the __WALL or __WCLONE options when waiting for the child with wait(2). If no signal (i.e., zero) is specified, then the parent process is not signaled when the child terminates.
The set_tid array
By default, the kernel chooses the next sequential PID for the new process in each of the PID namespaces where it is present. When creating a process with clone3(), the set_tid array (available since Linux 5.5) can be used to select specific PIDs for the process in some or all of the PID namespaces where it is present. If the PID of the newly created process should be set only for the current PID namespace or in the newly created PID namespace (if flags contains CLONE_NEWPID) then the first element in the set_tid array has to be the desired PID and set_tid_size needs to be 1.
If the PID of the newly created process should have a certain value in multiple PID namespaces, then the set_tid array can have multiple entries. The first entry defines the PID in the most deeply nested PID namespace and each of the following entries contains the PID in the corresponding ancestor PID namespace. The number of PID namespaces in which a PID should be set is defined by set_tid_size which cannot be larger than the number of currently nested PID namespaces.
To create a process with the following PIDs in a PID namespace hierarchy:
PID NS level Requested PID Notes 0 31496 Outermost PID namespace 1 42 2 7 Innermost PID namespace
Set the array to:
set_tid[0] = 7;
set_tid[1] = 42;
set_tid[2] = 31496;
set_tid_size = 3;
If only the PIDs in the two innermost PID namespaces need to be specified, set the array to:
set_tid[0] = 7;
set_tid[1] = 42;
set_tid_size = 2;
The PID in the PID namespaces outside the two innermost PID namespaces is selected the same way as any other PID is selected.
The set_tid feature requires CAP_SYS_ADMIN or (since Linux 5.9) CAP_CHECKPOINT_RESTORE in all owning user namespaces of the target PID namespaces.
Callers may only choose a PID greater than 1 in a given PID namespace if an init process (i.e., a process with PID 1) already exists in that namespace. Otherwise the PID entry for this PID namespace must be 1.
The flags mask
Both clone() and clone3() allow a flags bit mask that modifies their behavior and allows the caller to specify what is shared between the calling process and the child process. This bit maskβthe flags argument of clone() or the cl_args.flags field passed to clone3()βis referred to as the flags mask in the remainder of this page.
The flags mask is specified as a bitwise OR of zero or more of the constants listed below. Except as noted below, these flags are available (and have the same effect) in both clone() and clone3().
CLONE_CHILD_CLEARTID (since Linux 2.5.49)
Clear (zero) the child thread ID at the location pointed to by child_tid (clone()) or cl_args.child_tid (clone3()) in child memory when the child exits, and do a wakeup on the futex at that address. The address involved may be changed by the set_tid_address(2) system call. This is used by threading libraries.
CLONE_CHILD_SETTID (since Linux 2.5.49)
Store the child thread ID at the location pointed to by child_tid (clone()) or cl_args.child_tid (clone3()) in the child’s memory. The store operation completes before the clone call returns control to user space in the child process. (Note that the store operation may not have completed before the clone call returns in the parent process, which is relevant if the CLONE_VM flag is also employed.)
CLONE_CLEAR_SIGHAND (since Linux 5.5)
By default, signal dispositions in the child thread are the same as in the parent. If this flag is specified, then all signals that are handled in the parent (and not set to SIG_IGN) are reset to their default dispositions (SIG_DFL) in the child.
Specifying this flag together with CLONE_SIGHAND is nonsensical and disallowed.
CLONE_DETACHED (historical)
For a while (during the Linux 2.5 development series) there was a CLONE_DETACHED flag, which caused the parent not to receive a signal when the child terminated. Ultimately, the effect of this flag was subsumed under the CLONE_THREAD flag and by the time Linux 2.6.0 was released, this flag had no effect. Starting in Linux 2.6.2, the need to give this flag together with CLONE_THREAD disappeared.
This flag is still defined, but it is usually ignored when calling clone(). However, see the description of CLONE_PIDFD for some exceptions.
CLONE_FILES (since Linux 2.0)
If CLONE_FILES is set, the calling process and the child process share the same file descriptor table. Any file descriptor created by the calling process or by the child process is also valid in the other process. Similarly, if one of the processes closes a file descriptor, or changes its associated flags (using the fcntl(2) F_SETFD operation), the other process is also affected. If a process sharing a file descriptor table calls execve(2), its file descriptor table is duplicated (unshared).
If CLONE_FILES is not set, the child process inherits a copy of all file descriptors opened in the calling process at the time of the clone call. Subsequent operations that open or close file descriptors, or change file descriptor flags, performed by either the calling process or the child process do not affect the other process. Note, however, that the duplicated file descriptors in the child refer to the same open file descriptions as the corresponding file descriptors in the calling process, and thus share file offsets and file status flags (see open(2)).
CLONE_FS (since Linux 2.0)
If CLONE_FS is set, the caller and the child process share the same filesystem information. This includes the root of the filesystem, the current working directory, and the umask. Any call to chroot(2), chdir(2), or umask(2) performed by the calling process or the child process also affects the other process.
If CLONE_FS is not set, the child process works on a copy of the filesystem information of the calling process at the time of the clone call. Calls to chroot(2), chdir(2), or umask(2) performed later by one of the processes do not affect the other process.
CLONE_INTO_CGROUP (since Linux 5.7)
By default, a child process is placed in the same version 2 cgroup as its parent. The CLONE_INTO_CGROUP flag allows the child process to be created in a different version 2 cgroup. (Note that CLONE_INTO_CGROUP has effect only for version 2 cgroups.)
In order to place the child process in a different cgroup, the caller specifies CLONE_INTO_CGROUP in cl_args.flags and passes a file descriptor that refers to a version 2 cgroup in the cl_args.cgroup field. (This file descriptor can be obtained by opening a cgroup v2 directory using either the O_RDONLY or the O_PATH flag.) Note that all of the usual restrictions (described in cgroups(7)) on placing a process into a version 2 cgroup apply.
Among the possible use cases for CLONE_INTO_CGROUP are the following:
Spawning a process into a cgroup different from the parent’s cgroup makes it possible for a service manager to directly spawn new services into dedicated cgroups. This eliminates the accounting jitter that would be caused if the child process was first created in the same cgroup as the parent and then moved into the target cgroup. Furthermore, spawning the child process directly into a target cgroup is significantly cheaper than moving the child process into the target cgroup after it has been created.
The CLONE_INTO_CGROUP flag also allows the creation of frozen child processes by spawning them into a frozen cgroup. (See cgroups(7) for a description of the freezer controller.)
For threaded applications (or even thread implementations which make use of cgroups to limit individual threads), it is possible to establish a fixed cgroup layout before spawning each thread directly into its target cgroup.
CLONE_IO (since Linux 2.6.25)
If CLONE_IO is set, then the new process shares an I/O context with the calling process. If this flag is not set, then (as with fork(2)) the new process has its own I/O context.
The I/O context is the I/O scope of the disk scheduler (i.e., what the I/O scheduler uses to model scheduling of a process’s I/O). If processes share the same I/O context, they are treated as one by the I/O scheduler. As a consequence, they get to share disk time. For some I/O schedulers, if two processes share an I/O context, they will be allowed to interleave their disk access. If several threads are doing I/O on behalf of the same process (aio_read(3), for instance), they should employ CLONE_IO to get better I/O performance.
If the kernel is not configured with the CONFIG_BLOCK option, this flag is a no-op.
CLONE_NEWCGROUP (since Linux 4.6)
Create the process in a new cgroup namespace. If this flag is not set, then (as with fork(2)) the process is created in the same cgroup namespaces as the calling process.
For further information on cgroup namespaces, see cgroup_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWCGROUP.
CLONE_NEWIPC (since Linux 2.6.19)
If CLONE_NEWIPC is set, then create the process in a new IPC namespace. If this flag is not set, then (as with fork(2)), the process is created in the same IPC namespace as the calling process.
For further information on IPC namespaces, see ipc_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWIPC. This flag can’t be specified in conjunction with CLONE_SYSVSEM.
CLONE_NEWNET (since Linux 2.6.24)
(The implementation of this flag was completed only by about Linux 2.6.29.)
If CLONE_NEWNET is set, then create the process in a new network namespace. If this flag is not set, then (as with fork(2)) the process is created in the same network namespace as the calling process.
For further information on network namespaces, see network_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWNET.
CLONE_NEWNS (since Linux 2.4.19)
If CLONE_NEWNS is set, the cloned child is started in a new mount namespace, initialized with a copy of the namespace of the parent. If CLONE_NEWNS is not set, the child lives in the same mount namespace as the parent.
For further information on mount namespaces, see namespaces(7) and mount_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWNS. It is not permitted to specify both CLONE_NEWNS and CLONE_FS in the same clone call.
CLONE_NEWPID (since Linux 2.6.24)
If CLONE_NEWPID is set, then create the process in a new PID namespace. If this flag is not set, then (as with fork(2)) the process is created in the same PID namespace as the calling process.
For further information on PID namespaces, see namespaces(7) and pid_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWPID. This flag can’t be specified in conjunction with CLONE_THREAD.
CLONE_NEWUSER
(This flag first became meaningful for clone() in Linux 2.6.23, the current clone() semantics were merged in Linux 3.5, and the final pieces to make the user namespaces completely usable were merged in Linux 3.8.)
If CLONE_NEWUSER is set, then create the process in a new user namespace. If this flag is not set, then (as with fork(2)) the process is created in the same user namespace as the calling process.
For further information on user namespaces, see namespaces(7) and user_namespaces(7).
Before Linux 3.8, use of CLONE_NEWUSER required that the caller have three capabilities: CAP_SYS_ADMIN, CAP_SETUID, and CAP_SETGID. Starting with Linux 3.8, no privileges are needed to create a user namespace.
This flag can’t be specified in conjunction with CLONE_THREAD or CLONE_PARENT. For security reasons, CLONE_NEWUSER cannot be specified in conjunction with CLONE_FS.
CLONE_NEWUTS (since Linux 2.6.19)
If CLONE_NEWUTS is set, then create the process in a new UTS namespace, whose identifiers are initialized by duplicating the identifiers from the UTS namespace of the calling process. If this flag is not set, then (as with fork(2)) the process is created in the same UTS namespace as the calling process.
For further information on UTS namespaces, see uts_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWUTS.
CLONE_PARENT (since Linux 2.3.12)
If CLONE_PARENT is set, then the parent of the new child (as returned by getppid(2)) will be the same as that of the calling process.
If CLONE_PARENT is not set, then (as with fork(2)) the child’s parent is the calling process.
Note that it is the parent process, as returned by getppid(2), which is signaled when the child terminates, so that if CLONE_PARENT is set, then the parent of the calling process, rather than the calling process itself, is signaled.
The CLONE_PARENT flag can’t be used in clone calls by the global init process (PID 1 in the initial PID namespace) and init processes in other PID namespaces. This restriction prevents the creation of multi-rooted process trees as well as the creation of unreapable zombies in the initial PID namespace.
CLONE_PARENT_SETTID (since Linux 2.5.49)
Store the child thread ID at the location pointed to by parent_tid (clone()) or cl_args.parent_tid (clone3()) in the parent’s memory. (In Linux 2.5.32-2.5.48 there was a flag CLONE_SETTID that did this.) The store operation completes before the clone call returns control to user space.
CLONE_PID (Linux 2.0 to Linux 2.5.15)
If CLONE_PID is set, the child process is created with the same process ID as the calling process. This is good for hacking the system, but otherwise of not much use. From Linux 2.3.21 onward, this flag could be specified only by the system boot process (PID 0). The flag disappeared completely from the kernel sources in Linux 2.5.16. Subsequently, the kernel silently ignored this bit if it was specified in the flags mask. Much later, the same bit was recycled for use as the CLONE_PIDFD flag.
CLONE_PIDFD (since Linux 5.2)
If this flag is specified, a PID file descriptor referring to the child process is allocated and placed at a specified location in the parent’s memory. The close-on-exec flag is set on this new file descriptor. PID file descriptors can be used for the purposes described in pidfd_open(2).
When using clone3(), the PID file descriptor is placed at the location pointed to by cl_args.pidfd.
When using clone(), the PID file descriptor is placed at the location pointed to by parent_tid. Since the parent_tid argument is used to return the PID file descriptor, CLONE_PIDFD cannot be used with CLONE_PARENT_SETTID when calling clone().
It is currently not possible to use this flag together with CLONE_THREAD. This means that the process identified by the PID file descriptor will always be a thread group leader.
If the obsolete CLONE_DETACHED flag is specified alongside CLONE_PIDFD when calling clone(), an error is returned. An error also results if CLONE_DETACHED is specified when calling clone3(). This error behavior ensures that the bit corresponding to CLONE_DETACHED can be reused for further PID file descriptor features in the future.
CLONE_PTRACE (since Linux 2.2)
If CLONE_PTRACE is specified, and the calling process is being traced, then trace the child also (see ptrace(2)).
CLONE_SETTLS (since Linux 2.5.32)
The TLS (Thread Local Storage) descriptor is set to tls.
The interpretation of tls and the resulting effect is architecture dependent. On x86, tls is interpreted as a struct user_descΒ * (see set_thread_area(2)). On x86-64 it is the new value to be set for the %fs base register (see the ARCH_SET_FS argument to arch_prctl(2)). On architectures with a dedicated TLS register, it is the new value of that register.
Use of this flag requires detailed knowledge and generally it should not be used except in libraries implementing threading.
CLONE_SIGHAND (since Linux 2.0)
If CLONE_SIGHAND is set, the calling process and the child process share the same table of signal handlers. If the calling process or child process calls sigaction(2) to change the behavior associated with a signal, the behavior is changed in the other process as well. However, the calling process and child processes still have distinct signal masks and sets of pending signals. So, one of them may block or unblock signals using sigprocmask(2) without affecting the other process.
If CLONE_SIGHAND is not set, the child process inherits a copy of the signal handlers of the calling process at the time of the clone call. Calls to sigaction(2) performed later by one of the processes have no effect on the other process.
Since Linux 2.6.0, the flags mask must also include CLONE_VM if CLONE_SIGHAND is specified.
CLONE_STOPPED (since Linux 2.6.0)
If CLONE_STOPPED is set, then the child is initially stopped (as though it was sent a SIGSTOP signal), and must be resumed by sending it a SIGCONT signal.
This flag was deprecated from Linux 2.6.25 onward, and was removed altogether in Linux 2.6.38. Since then, the kernel silently ignores it without error. Starting with Linux 4.6, the same bit was reused for the CLONE_NEWCGROUP flag.
CLONE_SYSVSEM (since Linux 2.5.10)
If CLONE_SYSVSEM is set, then the child and the calling process share a single list of System V semaphore adjustment (semadj) values (see semop(2)). In this case, the shared list accumulates semadj values across all processes sharing the list, and semaphore adjustments are performed only when the last process that is sharing the list terminates (or ceases sharing the list using unshare(2)). If this flag is not set, then the child has a separate semadj list that is initially empty.
CLONE_THREAD (since Linux 2.4.0)
If CLONE_THREAD is set, the child is placed in the same thread group as the calling process. To make the remainder of the discussion of CLONE_THREAD more readable, the term “thread” is used to refer to the processes within a thread group.
Thread groups were a feature added in Linux 2.4 to support the POSIX threads notion of a set of threads that share a single PID. Internally, this shared PID is the so-called thread group identifier (TGID) for the thread group. Since Linux 2.4, calls to getpid(2) return the TGID of the caller.
The threads within a group can be distinguished by their (system-wide) unique thread IDs (TID). A new thread’s TID is available as the function result returned to the caller, and a thread can obtain its own TID using gettid(2).
When a clone call is made without specifying CLONE_THREAD, then the resulting thread is placed in a new thread group whose TGID is the same as the thread’s TID. This thread is the leader of the new thread group.
A new thread created with CLONE_THREAD has the same parent process as the process that made the clone call (i.e., like CLONE_PARENT), so that calls to getppid(2) return the same value for all of the threads in a thread group. When a CLONE_THREAD thread terminates, the thread that created it is not sent a SIGCHLD (or other termination) signal; nor can the status of such a thread be obtained using wait(2). (The thread is said to be detached.)
After all of the threads in a thread group terminate the parent process of the thread group is sent a SIGCHLD (or other termination) signal.
If any of the threads in a thread group performs an execve(2), then all threads other than the thread group leader are terminated, and the new program is executed in the thread group leader.
If one of the threads in a thread group creates a child using fork(2), then any thread in the group can wait(2) for that child.
Since Linux 2.5.35, the flags mask must also include CLONE_SIGHAND if CLONE_THREAD is specified (and note that, since Linux 2.6.0, CLONE_SIGHAND also requires CLONE_VM to be included).
Signal dispositions and actions are process-wide: if an unhandled signal is delivered to a thread, then it will affect (terminate, stop, continue, be ignored in) all members of the thread group.
Each thread has its own signal mask, as set by sigprocmask(2).
A signal may be process-directed or thread-directed. A process-directed signal is targeted at a thread group (i.e., a TGID), and is delivered to an arbitrarily selected thread from among those that are not blocking the signal. A signal may be process-directed because it was generated by the kernel for reasons other than a hardware exception, or because it was sent using kill(2) or sigqueue(3). A thread-directed signal is targeted at (i.e., delivered to) a specific thread. A signal may be thread directed because it was sent using tgkill(2) or pthread_sigqueue(3), or because the thread executed a machine language instruction that triggered a hardware exception (e.g., invalid memory access triggering SIGSEGV or a floating-point exception triggering SIGFPE).
A call to sigpending(2) returns a signal set that is the union of the pending process-directed signals and the signals that are pending for the calling thread.
If a process-directed signal is delivered to a thread group, and the thread group has installed a handler for the signal, then the handler is invoked in exactly one, arbitrarily selected member of the thread group that has not blocked the signal. If multiple threads in a group are waiting to accept the same signal using sigwaitinfo(2), the kernel will arbitrarily select one of these threads to receive the signal.
CLONE_UNTRACED (since Linux 2.5.46)
If CLONE_UNTRACED is specified, then a tracing process cannot force CLONE_PTRACE on this child process.
CLONE_VFORK (since Linux 2.2)
If CLONE_VFORK is set, the execution of the calling process is suspended until the child releases its virtual memory resources via a call to execve(2) or _exit(2) (as with vfork(2)).
If CLONE_VFORK is not set, then both the calling process and the child are schedulable after the call, and an application should not rely on execution occurring in any particular order.
CLONE_VM (since Linux 2.0)
If CLONE_VM is set, the calling process and the child process run in the same memory space. In particular, memory writes performed by the calling process or by the child process are also visible in the other process. Moreover, any memory mapping or unmapping performed with mmap(2) or munmap(2) by the child or calling process also affects the other process.
If CLONE_VM is not set, the child process runs in a separate copy of the memory space of the calling process at the time of the clone call. Memory writes or file mappings/unmappings performed by one of the processes do not affect the other, as with fork(2).
If the CLONE_VM flag is specified and the CLONE_VFORK flag is not specified, then any alternate signal stack that was established by sigaltstack(2) is cleared in the child process.
RETURN VALUE
On success, the thread ID of the child process is returned in the caller’s thread of execution. On failure, -1 is returned in the caller’s context, no child process is created, and errno is set to indicate the error.
ERRORS
EACCES (clone3() only)
CLONE_INTO_CGROUP was specified in cl_args.flags, but the restrictions (described in cgroups(7)) on placing the child process into the version 2 cgroup referred to by cl_args.cgroup are not met.
EAGAIN
Too many processes are already running; see fork(2).
EBUSY (clone3() only)
CLONE_INTO_CGROUP was specified in cl_args.flags, but the file descriptor specified in cl_args.cgroup refers to a version 2 cgroup in which a domain controller is enabled.
EEXIST (clone3() only)
One (or more) of the PIDs specified in set_tid already exists in the corresponding PID namespace.
EINVAL
Both CLONE_SIGHAND and CLONE_CLEAR_SIGHAND were specified in the flags mask.
EINVAL
CLONE_SIGHAND was specified in the flags mask, but CLONE_VM was not. (Since Linux 2.6.0.)
EINVAL
CLONE_THREAD was specified in the flags mask, but CLONE_SIGHAND was not. (Since Linux 2.5.35.)
EINVAL
CLONE_THREAD was specified in the flags mask, but the current process previously called unshare(2) with the CLONE_NEWPID flag or used setns(2) to reassociate itself with a PID namespace.
EINVAL
Both CLONE_FS and CLONE_NEWNS were specified in the flags mask.
EINVAL (since Linux 3.9)
Both CLONE_NEWUSER and CLONE_FS were specified in the flags mask.
EINVAL
Both CLONE_NEWIPC and CLONE_SYSVSEM were specified in the flags mask.
EINVAL
CLONE_NEWPID and one (or both) of CLONE_THREAD or CLONE_PARENT were specified in the flags mask.
EINVAL
CLONE_NEWUSER and CLONE_THREAD were specified in the flags mask.
EINVAL (since Linux 2.6.32)
CLONE_PARENT was specified, and the caller is an init process.
EINVAL
Returned by the glibc clone() wrapper function when fn or stack is specified as NULL.
EINVAL
CLONE_NEWIPC was specified in the flags mask, but the kernel was not configured with the CONFIG_SYSVIPC and CONFIG_IPC_NS options.
EINVAL
CLONE_NEWNET was specified in the flags mask, but the kernel was not configured with the CONFIG_NET_NS option.
EINVAL
CLONE_NEWPID was specified in the flags mask, but the kernel was not configured with the CONFIG_PID_NS option.
EINVAL
CLONE_NEWUSER was specified in the flags mask, but the kernel was not configured with the CONFIG_USER_NS option.
EINVAL
CLONE_NEWUTS was specified in the flags mask, but the kernel was not configured with the CONFIG_UTS_NS option.
EINVAL
stack is not aligned to a suitable boundary for this architecture. For example, on aarch64, stack must be a multiple of 16.
EINVAL (clone3() only)
CLONE_DETACHED was specified in the flags mask.
EINVAL (clone() only)
CLONE_PIDFD was specified together with CLONE_DETACHED in the flags mask.
EINVAL
CLONE_PIDFD was specified together with CLONE_THREAD in the flags mask.
**EINVAL **(clone() only)
CLONE_PIDFD was specified together with CLONE_PARENT_SETTID in the flags mask.
EINVAL (clone3() only)
set_tid_size is greater than the number of nested PID namespaces.
EINVAL (clone3() only)
One of the PIDs specified in set_tid was an invalid.
EINVAL (clone3() only)
CLONE_THREAD or CLONE_PARENT was specified in the flags mask, but a signal was specified in exit_signal.
EINVAL (AArch64 only, Linux 4.6 and earlier)
stack was not aligned to a 128-bit boundary.
ENOMEM
Cannot allocate sufficient memory to allocate a task structure for the child, or to copy those parts of the caller’s context that need to be copied.
ENOSPC (since Linux 3.7)
CLONE_NEWPID was specified in the flags mask, but the limit on the nesting depth of PID namespaces would have been exceeded; see pid_namespaces(7).
ENOSPC (since Linux 4.9; beforehand EUSERS)
CLONE_NEWUSER was specified in the flags mask, and the call would cause the limit on the number of nested user namespaces to be exceeded. See user_namespaces(7).
From Linux 3.11 to Linux 4.8, the error diagnosed in this case was EUSERS.
ENOSPC (since Linux 4.9)
One of the values in the flags mask specified the creation of a new user namespace, but doing so would have caused the limit defined by the corresponding file in /proc/sys/user to be exceeded. For further details, see namespaces(7).
EOPNOTSUPP (clone3() only)
CLONE_INTO_CGROUP was specified in cl_args.flags, but the file descriptor specified in cl_args.cgroup refers to a version 2 cgroup that is in the domain invalid state.
EPERM
CLONE_NEWCGROUP, CLONE_NEWIPC, CLONE_NEWNET, CLONE_NEWNS, CLONE_NEWPID, or CLONE_NEWUTS was specified by an unprivileged process (process without CAP_SYS_ADMIN).
EPERM
CLONE_PID was specified by a process other than process 0. (This error occurs only on Linux 2.5.15 and earlier.)
EPERM
CLONE_NEWUSER was specified in the flags mask, but either the effective user ID or the effective group ID of the caller does not have a mapping in the parent namespace (see user_namespaces(7)).
EPERM (since Linux 3.9)
CLONE_NEWUSER was specified in the flags mask and the caller is in a chroot environment (i.e., the caller’s root directory does not match the root directory of the mount namespace in which it resides).
EPERM (clone3() only)
set_tid_size was greater than zero, and the caller lacks the CAP_SYS_ADMIN capability in one or more of the user namespaces that own the corresponding PID namespaces.
ERESTARTNOINTR (since Linux 2.6.17)
System call was interrupted by a signal and will be restarted. (This can be seen only during a trace.)
EUSERS (Linux 3.11 to Linux 4.8)
CLONE_NEWUSER was specified in the flags mask, and the limit on the number of nested user namespaces would be exceeded. See the discussion of the ENOSPC error above.
VERSIONS
The glibc clone() wrapper function makes some changes in the memory pointed to by stack (changes required to set the stack up correctly for the child) before invoking the clone() system call. So, in cases where clone() is used to recursively create children, do not use the buffer employed for the parent’s stack as the stack of the child.
On i386, clone() should not be called through vsyscall, but directly through int $0x80.
C library/kernel differences
The raw clone() system call corresponds more closely to fork(2) in that execution in the child continues from the point of the call. As such, the fn and arg arguments of the clone() wrapper function are omitted.
In contrast to the glibc wrapper, the raw clone() system call accepts NULL as a stack argument (and clone3() likewise allows cl_args.stack to be NULL). In this case, the child uses a duplicate of the parent’s stack. (Copy-on-write semantics ensure that the child gets separate copies of stack pages when either process modifies the stack.) In this case, for correct operation, the CLONE_VM option should not be specified. (If the child shares the parent’s memory because of the use of the CLONE_VM flag, then no copy-on-write duplication occurs and chaos is likely to result.)
The order of the arguments also differs in the raw system call, and there are variations in the arguments across architectures, as detailed in the following paragraphs.
The raw system call interface on x86-64 and some other architectures (including sh, tile, and alpha) is:
long clone(unsigned long flags, void *stack,
int *parent_tid, int *child_tid,
unsigned long tls);
On x86-32, and several other common architectures (including score, ARM, ARM 64, PA-RISC, arc, Power PC, xtensa, and MIPS), the order of the last two arguments is reversed:
long clone(unsigned long flags, void *stack,
int *parent_tid, unsigned long tls,
int *child_tid);
On the cris and s390 architectures, the order of the first two arguments is reversed:
long clone(void *stack, unsigned long flags,
int *parent_tid, int *child_tid,
unsigned long tls);
On the microblaze architecture, an additional argument is supplied:
long clone(unsigned long flags, void *stack,
int stack_size, /* Size of stack */
int *parent_tid, int *child_tid,
unsigned long tls);
blackfin, m68k, and sparc
The argument-passing conventions on blackfin, m68k, and sparc are different from the descriptions above. For details, see the kernel (and glibc) source.
ia64
On ia64, a different interface is used:
int __clone2(int (*fn)(void *),
void *stack_base, size_t stack_size,
int flags, void *arg, ...
/* pid_t *parent_tid, struct user_desc *tls,
pid_t *child_tid */ );
The prototype shown above is for the glibc wrapper function; for the system call itself, the prototype can be described as follows (it is identical to the clone() prototype on microblaze):
long clone2(unsigned long flags, void *stack_base,
int stack_size, /* Size of stack */
int *parent_tid, int *child_tid,
unsigned long tls);
__clone2() operates in the same way as clone(), except that stack_base points to the lowest address of the child’s stack area, and stack_size specifies the size of the stack pointed to by stack_base.
STANDARDS
Linux.
HISTORY
clone3()
Linux 5.3.
Linux 2.4 and earlier
In the Linux 2.4.x series, CLONE_THREAD generally does not make the parent of the new thread the same as the parent of the calling process. However, from Linux 2.4.7 to Linux 2.4.18 the CLONE_THREAD flag implied the CLONE_PARENT flag (as in Linux 2.6.0 and later).
In Linux 2.4 and earlier, clone() does not take arguments parent_tid, tls, and child_tid.
NOTES
One use of these system calls is to implement threads: multiple flows of control in a program that run concurrently in a shared address space.
The kcmp(2) system call can be used to test whether two processes share various resources such as a file descriptor table, System V semaphore undo operations, or a virtual address space.
Handlers registered using pthread_atfork(3) are not executed during a clone call.
BUGS
GNU C library versions 2.3.4 up to and including 2.24 contained a wrapper function for getpid(2) that performed caching of PIDs. This caching relied on support in the glibc wrapper for clone(), but limitations in the implementation meant that the cache was not up to date in some circumstances. In particular, if a signal was delivered to the child immediately after the clone() call, then a call to getpid(2) in a handler for the signal could return the PID of the calling process (“the parent”), if the clone wrapper had not yet had a chance to update the PID cache in the child. (This discussion ignores the case where the child was created using CLONE_THREAD, when getpid(2) should return the same value in the child and in the process that called clone(), since the caller and the child are in the same thread group. The stale-cache problem also does not occur if the flags argument includes CLONE_VM.) To get the truth, it was sometimes necessary to use code such as the following:
#include <syscall.h>
pid_t mypid;
mypid = syscall(SYS_getpid);
Because of the stale-cache problem, as well as other problems noted in getpid(2), the PID caching feature was removed in glibc 2.25.
EXAMPLES
The following program demonstrates the use of clone() to create a child process that executes in a separate UTS namespace. The child changes the hostname in its UTS namespace. Both parent and child then display the system hostname, making it possible to see that the hostname differs in the UTS namespaces of the parent and child. For an example of the use of this program, see setns(2).
Within the sample program, we allocate the memory that is to be used for the child’s stack using mmap(2) rather than malloc(3) for the following reasons:
mmap(2) allocates a block of memory that starts on a page boundary and is a multiple of the page size. This is useful if we want to establish a guard page (a page with protection PROT_NONE) at the end of the stack using mprotect(2).
We can specify the MAP_STACK flag to request a mapping that is suitable for a stack. For the moment, this flag is a no-op on Linux, but it exists and has effect on some other systems, so we should include it for portability.
Program source
#define _GNU_SOURCE
#include <err.h>
#include <sched.h>
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/utsname.h>
#include <sys/wait.h>
#include <unistd.h>
static int /* Start function for cloned child */
childFunc(void *arg)
{
struct utsname uts;
/* Change hostname in UTS namespace of child. */
if (sethostname(arg, strlen(arg)) == -1)
err(EXIT_FAILURE, "sethostname");
/* Retrieve and display hostname. */
if (uname(&uts) == -1)
err(EXIT_FAILURE, "uname");
printf("uts.nodename in child: %s
“, uts.nodename);
/* Keep the namespace open for a while, by sleeping.
This allows some experimentation–for example, another
process might join the namespace. /
sleep(200);
return 0; / Child terminates now /
}
#define STACK_SIZE (1024 * 1024) / Stack size for cloned child */
int
main(int argc, char *argv[])
{
char stack; / Start of stack buffer */
char stackTop; / End of stack buffer /
pid_t pid;
struct utsname uts;
if (argc < 2) {
fprintf(stderr, “Usage: %s
SEE ALSO
fork(2), futex(2), getpid(2), gettid(2), kcmp(2), mmap(2), pidfd_open(2), set_thread_area(2), set_tid_address(2), setns(2), tkill(2), unshare(2), wait(2), capabilities(7), namespaces(7), pthreads(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
160 - Linux cli command inb_p
NAME π₯οΈ inb_p π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
161 - Linux cli command rt_sigqueueinfo
NAME π₯οΈ rt_sigqueueinfo π₯οΈ
queue a signal and data
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/signal.h> /* Definition of SI_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_rt_sigqueueinfo, pid_t tgid,
int sig, siginfo_t *info);
int syscall(SYS_rt_tgsigqueueinfo, pid_t tgid, pid_t tid,
int sig, siginfo_t *info);
Note: There are no glibc wrappers for these system calls; see NOTES.
DESCRIPTION
The rt_sigqueueinfo() and rt_tgsigqueueinfo() system calls are the low-level interfaces used to send a signal plus data to a process or thread. The receiver of the signal can obtain the accompanying data by establishing a signal handler with the sigaction(2) SA_SIGINFO flag.
These system calls are not intended for direct application use; they are provided to allow the implementation of sigqueue(3) and pthread_sigqueue(3).
The rt_sigqueueinfo() system call sends the signal sig to the thread group with the ID tgid. (The term “thread group” is synonymous with “process”, and tid corresponds to the traditional UNIX process ID.) The signal will be delivered to an arbitrary member of the thread group (i.e., one of the threads that is not currently blocking the signal).
The info argument specifies the data to accompany the signal. This argument is a pointer to a structure of type siginfo_t, described in sigaction(2) (and defined by including <sigaction.h>). The caller should set the following fields in this structure:
si_code
This should be one of the SI_* codes in the Linux kernel source file include/asm-generic/siginfo.h. If the signal is being sent to any process other than the caller itself, the following restrictions apply:
The code can’t be a value greater than or equal to zero. In particular, it can’t be SI_USER, which is used by the kernel to indicate a signal sent by kill(2), and nor can it be SI_KERNEL, which is used to indicate a signal generated by the kernel.
The code can’t (since Linux 2.6.39) be SI_TKILL, which is used by the kernel to indicate a signal sent using tgkill(2).
si_pid
This should be set to a process ID, typically the process ID of the sender.
si_uid
This should be set to a user ID, typically the real user ID of the sender.
si_value
This field contains the user data to accompany the signal. For more information, see the description of the last (union sigval) argument of sigqueue(3).
Internally, the kernel sets the si_signo field to the value specified in sig, so that the receiver of the signal can also obtain the signal number via that field.
The rt_tgsigqueueinfo() system call is like rt_sigqueueinfo(), but sends the signal and data to the single thread specified by the combination of tgid, a thread group ID, and tid, a thread in that thread group.
RETURN VALUE
On success, these system calls return 0. On error, they return -1 and errno is set to indicate the error.
ERRORS
EAGAIN
The limit of signals which may be queued has been reached. (See signal(7) for further information.)
EINVAL
sig, tgid, or tid was invalid.
EPERM
The caller does not have permission to send the signal to the target. For the required permissions, see kill(2).
EPERM
tgid specifies a process other than the caller and info->si_code is invalid.
ESRCH
rt_sigqueueinfo(): No thread group matching tgid was found.
rt_tgsigqueinfo(): No thread matching tgid and tid was found.
STANDARDS
Linux.
HISTORY
rt_sigqueueinfo()
Linux 2.2.
rt_tgsigqueueinfo()
Linux 2.6.31.
NOTES
Since these system calls are not intended for application use, there are no glibc wrapper functions; use syscall(2) in the unlikely case that you want to call them directly.
As with kill(2), the null signal (0) can be used to check if the specified process or thread exists.
SEE ALSO
kill(2), pidfd_send_signal(2), sigaction(2), sigprocmask(2), tgkill(2), pthread_sigqueue(3), sigqueue(3), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
162 - Linux cli command landlock_create_ruleset
NAME π₯οΈ landlock_create_ruleset π₯οΈ
create a new Landlock ruleset
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/landlock.h> /* Definition of LANDLOCK_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
int syscall(SYS_landlock_create_ruleset,
const struct landlock_ruleset_attr *attr,
size_t size , uint32_t flags);
DESCRIPTION
A Landlock ruleset identifies a set of rules (i.e., actions on objects). This landlock_create_ruleset() system call enables creating a new file descriptor identifying a ruleset. This file descriptor can then be used by landlock_add_rule(2) and landlock_restrict_self(2). See landlock(7) for a global overview.
attr specifies the properties of the new ruleset. It points to the following structure:
struct landlock_ruleset_attr {
__u64 handled_access_fs;
};
handled_access_fs is a bitmask of actions that is handled by this ruleset and should then be forbidden if no rule explicitly allows them (see Filesystem actions in landlock(7)). This enables simply restricting ambient rights (e.g., global filesystem access) and is needed for compatibility reasons.
size must be specified as sizeof(struct landlock_ruleset_attr) for compatibility reasons.
flags must be 0 if attr is used. Otherwise, flags can be set to:
LANDLOCK_CREATE_RULESET_VERSION
If attr is NULL and size is 0, then the returned value is the highest supported Landlock ABI version (starting at 1). This version can be used for a best-effort security approach, which is encouraged when user space is not pinned to a specific kernel version. All features documented in these man pages are available with the version 1.
RETURN VALUE
On success, landlock_create_ruleset() returns a new Landlock ruleset file descriptor, or a Landlock ABI version, according to flags.
ERRORS
landlock_create_ruleset() can fail for the following reasons:
EOPNOTSUPP
Landlock is supported by the kernel but disabled at boot time.
EINVAL
Unknown flags, or unknown access, or too small size.
E2BIG
size is too big.
EFAULT
attr was not a valid address.
ENOMSG
Empty accesses (i.e., attr->handled_access_fs is 0).
STANDARDS
Linux.
HISTORY
Linux 5.13.
EXAMPLES
See landlock(7).
SEE ALSO
landlock_add_rule(2), landlock_restrict_self(2), landlock(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
163 - Linux cli command fchdir
NAME π₯οΈ fchdir π₯οΈ
change working directory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int chdir(const char *path);
int fchdir(int fd);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
fchdir():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
|| /* glibc up to and including 2.19: */ _BSD_SOURCE
DESCRIPTION
chdir() changes the current working directory of the calling process to the directory specified in path.
fchdir() is identical to chdir(); the only difference is that the directory is given as an open file descriptor.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Depending on the filesystem, other errors can be returned. The more general errors for chdir() are listed below:
EACCES
Search permission is denied for one of the components of path. (See also path_resolution(7).)
EFAULT
path points outside your accessible address space.
EIO
An I/O error occurred.
ELOOP
Too many symbolic links were encountered in resolving path.
ENAMETOOLONG
path is too long.
ENOENT
The directory specified in path does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component of path is not a directory.
The general errors for fchdir() are listed below:
EACCES
Search permission was denied on the directory open on fd.
EBADF
fd is not a valid file descriptor.
ENOTDIR
fd does not refer to a directory.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.4BSD.
NOTES
The current working directory is the starting point for interpreting relative pathnames (those not starting with ‘/’).
A child process created via fork(2) inherits its parent’s current working directory. The current working directory is left unchanged by execve(2).
SEE ALSO
chroot(2), getcwd(3), path_resolution(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
164 - Linux cli command ioctl_iflags
NAME π₯οΈ ioctl_iflags π₯οΈ
ioctl() operations for inode flags
DESCRIPTION
Various Linux filesystems support the notion of inode flagsβattributes that modify the semantics of files and directories. These flags can be retrieved and modified using two ioctl(2) operations:
int attr;
fd = open("pathname", ...);
ioctl(fd, FS_IOC_GETFLAGS, &attr); /* Place current flags
in 'attr' */
attr |= FS_NOATIME_FL; /* Tweak returned bit mask */
ioctl(fd, FS_IOC_SETFLAGS, &attr); /* Update flags for inode
referred to by 'fd' */
The lsattr(1) and chattr(1) shell commands provide interfaces to these two operations, allowing a user to view and modify the inode flags associated with a file.
The following flags are supported (shown along with the corresponding letter used to indicate the flag by lsattr(1) and chattr(1)):
FS_APPEND_FL ‘a’
The file can be opened only with the O_APPEND flag. (This restriction applies even to the superuser.) Only a privileged process (CAP_LINUX_IMMUTABLE) can set or clear this attribute.
FS_COMPR_FL ‘c’
Store the file in a compressed format on disk. This flag is not supported by most of the mainstream filesystem implementations; one exception is btrfs(5).
FS_DIRSYNC_FL ‘D’ (since Linux 2.6.0)
Write directory changes synchronously to disk. This flag provides semantics equivalent to the mount(2) MS_DIRSYNC option, but on a per-directory basis. This flag can be applied only to directories.
FS_IMMUTABLE_FL ‘i’
The file is immutable: no changes are permitted to the file contents or metadata (permissions, timestamps, ownership, link count, and so on). (This restriction applies even to the superuser.) Only a privileged process (CAP_LINUX_IMMUTABLE) can set or clear this attribute.
FS_JOURNAL_DATA_FL ‘j’
Enable journaling of file data on ext3(5) and ext4(5) filesystems. On a filesystem that is journaling in ordered or writeback mode, a privileged (CAP_SYS_RESOURCE) process can set this flag to enable journaling of data updates on a per-file basis.
FS_NOATIME_FL ‘A’
Don’t update the file last access time when the file is accessed. This can provide I/O performance benefits for applications that do not care about the accuracy of this timestamp. This flag provides functionality similar to the mount(2) MS_NOATIME flag, but on a per-file basis.
FS_NOCOW_FL ‘C’ (since Linux 2.6.39)
The file will not be subject to copy-on-write updates. This flag has an effect only on filesystems that support copy-on-write semantics, such as Btrfs. See chattr(1) and btrfs(5).
FS_NODUMP_FL ’d’
Don’t include this file in backups made using dump(8).
FS_NOTAIL_FL ’t’
This flag is supported only on Reiserfs. It disables the Reiserfs tail-packing feature, which tries to pack small files (and the final fragment of larger files) into the same disk block as the file metadata.
FS_PROJINHERIT_FL ‘P’ (since Linux 4.5)
Inherit the quota project ID. Files and subdirectories will inherit the project ID of the directory. This flag can be applied only to directories.
FS_SECRM_FL ’s’
Mark the file for secure deletion. This feature is not implemented by any filesystem, since the task of securely erasing a file from a recording medium is surprisingly difficult.
FS_SYNC_FL ‘S’
Make file updates synchronous. For files, this makes all writes synchronous (as though all opens of the file were with the O_SYNC flag). For directories, this has the same effect as the FS_DIRSYNC_FL flag.
FS_TOPDIR_FL ‘T’
Mark a directory for special treatment under the Orlov block-allocation strategy. See chattr(1) for details. This flag can be applied only to directories and has an effect only for ext2, ext3, and ext4.
FS_UNRM_FL ‘u’
Allow the file to be undeleted if it is deleted. This feature is not implemented by any filesystem, since it is possible to implement file-recovery mechanisms outside the kernel.
In most cases, when any of the above flags is set on a directory, the flag is inherited by files and subdirectories created inside that directory. Exceptions include FS_TOPDIR_FL, which is not inheritable, and FS_DIRSYNC_FL, which is inherited only by subdirectories.
STANDARDS
Linux.
NOTES
In order to change the inode flags of a file using the FS_IOC_SETFLAGS operation, the effective user ID of the caller must match the owner of the file, or the caller must have the CAP_FOWNER capability.
The type of the argument given to the FS_IOC_GETFLAGS and FS_IOC_SETFLAGS operations is intΒ *, notwithstanding the implication in the kernel source file include/uapi/linux/fs.h that the argument is longΒ *.
SEE ALSO
chattr(1), lsattr(1), mount(2), btrfs(5), ext4(5), xfs(5), xattr(7), mount(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
165 - Linux cli command fork
NAME π₯οΈ fork π₯οΈ
create a child process
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
pid_t fork(void);
DESCRIPTION
fork() creates a new process by duplicating the calling process. The new process is referred to as the child process. The calling process is referred to as the parent process.
The child process and the parent process run in separate memory spaces. At the time of fork() both memory spaces have the same content. Memory writes, file mappings (mmap(2)), and unmappings (munmap(2)) performed by one of the processes do not affect the other.
The child process is an exact duplicate of the parent process except for the following points:
The child has its own unique process ID, and this PID does not match the ID of any existing process group (setpgid(2)) or session.
The child’s parent process ID is the same as the parent’s process ID.
The child does not inherit its parent’s memory locks (mlock(2), mlockall(2)).
Process resource utilizations (getrusage(2)) and CPU time counters (times(2)) are reset to zero in the child.
The child’s set of pending signals is initially empty (sigpending(2)).
The child does not inherit semaphore adjustments from its parent (semop(2)).
The child does not inherit process-associated record locks from its parent (fcntl(2)). (On the other hand, it does inherit fcntl(2) open file description locks and flock(2) locks from its parent.)
The child does not inherit timers from its parent (setitimer(2), alarm(2), timer_create(2)).
The child does not inherit outstanding asynchronous I/O operations from its parent (aio_read(3), aio_write(3)), nor does it inherit any asynchronous I/O contexts from its parent (see io_setup(2)).
The process attributes in the preceding list are all specified in POSIX.1. The parent and child also differ with respect to the following Linux-specific process attributes:
The child does not inherit directory change notifications (dnotify) from its parent (see the description of F_NOTIFY in fcntl(2)).
The prctl(2) PR_SET_PDEATHSIG setting is reset so that the child does not receive a signal when its parent terminates.
The default timer slack value is set to the parent’s current timer slack value. See the description of PR_SET_TIMERSLACK in prctl(2).
Memory mappings that have been marked with the madvise(2) MADV_DONTFORK flag are not inherited across a fork().
Memory in address ranges that have been marked with the madvise(2) MADV_WIPEONFORK flag is zeroed in the child after a fork(). (The MADV_WIPEONFORK setting remains in place for those address ranges in the child.)
The termination signal of the child is always SIGCHLD (see clone(2)).
The port access permission bits set by ioperm(2) are not inherited by the child; the child must turn on any bits that it requires using ioperm(2).
Note the following further points:
The child process is created with a single threadβthe one that called fork(). The entire virtual address space of the parent is replicated in the child, including the states of mutexes, condition variables, and other pthreads objects; the use of pthread_atfork(3) may be helpful for dealing with problems that this can cause.
After a fork() in a multithreaded program, the child can safely call only async-signal-safe functions (see signal-safety(7)) until such time as it calls execve(2).
The child inherits copies of the parent’s set of open file descriptors. Each file descriptor in the child refers to the same open file description (see open(2)) as the corresponding file descriptor in the parent. This means that the two file descriptors share open file status flags, file offset, and signal-driven I/O attributes (see the description of F_SETOWN and F_SETSIG in fcntl(2)).
The child inherits copies of the parent’s set of open message queue descriptors (see mq_overview(7)). Each file descriptor in the child refers to the same open message queue description as the corresponding file descriptor in the parent. This means that the two file descriptors share the same flags (mq_flags).
The child inherits copies of the parent’s set of open directory streams (see opendir(3)). POSIX.1 says that the corresponding directory streams in the parent and child may share the directory stream positioning; on Linux/glibc they do not.
RETURN VALUE
On success, the PID of the child process is returned in the parent, and 0 is returned in the child. On failure, -1 is returned in the parent, no child process is created, and errno is set to indicate the error.
ERRORS
EAGAIN
A system-imposed limit on the number of threads was encountered. There are a number of limits that may trigger this error:
the RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which limits the number of processes and threads for a real user ID, was reached;
the kernel’s system-wide limit on the number of processes and threads, /proc/sys/kernel/threads-max, was reached (see proc(5));
the maximum number of PIDs, /proc/sys/kernel/pid_max, was reached (see proc(5)); or
the PID limit (pids.max) imposed by the cgroup “process number” (PIDs) controller was reached.
EAGAIN
The caller is operating under the SCHED_DEADLINE scheduling policy and does not have the reset-on-fork flag set. See sched(7).
ENOMEM
fork() failed to allocate the necessary kernel structures because memory is tight.
ENOMEM
An attempt was made to create a child process in a PID namespace whose “init” process has terminated. See pid_namespaces(7).
ENOSYS
fork() is not supported on this platform (for example, hardware without a Memory-Management Unit).
ERESTARTNOINTR (since Linux 2.6.17)
System call was interrupted by a signal and will be restarted. (This can be seen only during a trace.)
VERSIONS
C library/kernel differences
Since glibc 2.3.3, rather than invoking the kernel’s fork() system call, the glibc fork() wrapper that is provided as part of the NPTL threading implementation invokes clone(2) with flags that provide the same effect as the traditional system call. (A call to fork() is equivalent to a call to clone(2) specifying flags as just SIGCHLD.) The glibc wrapper invokes any fork handlers that have been established using pthread_atfork(3).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.3BSD.
NOTES
Under Linux, fork() is implemented using copy-on-write pages, so the only penalty that it incurs is the time and memory required to duplicate the parent’s page tables, and to create a unique task structure for the child.
EXAMPLES
See pipe(2) and wait(2) for more examples.
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
int
main(void)
{
pid_t pid;
if (signal(SIGCHLD, SIG_IGN) == SIG_ERR) {
perror("signal");
exit(EXIT_FAILURE);
}
pid = fork();
switch (pid) {
case -1:
perror("fork");
exit(EXIT_FAILURE);
case 0:
puts("Child exiting.");
exit(EXIT_SUCCESS);
default:
printf("Child is PID %jd
“, (intmax_t) pid); puts(“Parent exiting.”); exit(EXIT_SUCCESS); } }
SEE ALSO
clone(2), execve(2), exit(2), setrlimit(2), unshare(2), vfork(2), wait(2), daemon(3), pthread_atfork(3), capabilities(7), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
166 - Linux cli command sigsuspend
NAME π₯οΈ sigsuspend π₯οΈ
wait for a signal
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <signal.h>
int sigsuspend(const sigset_t *mask);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sigsuspend():
_POSIX_C_SOURCE
DESCRIPTION
sigsuspend() temporarily replaces the signal mask of the calling thread with the mask given by mask and then suspends the thread until delivery of a signal whose action is to invoke a signal handler or to terminate a process.
If the signal terminates the process, then sigsuspend() does not return. If the signal is caught, then sigsuspend() returns after the signal handler returns, and the signal mask is restored to the state before the call to sigsuspend().
It is not possible to block SIGKILL or SIGSTOP; specifying these signals in mask, has no effect on the thread’s signal mask.
RETURN VALUE
sigsuspend() always returns -1, with errno set to indicate the error (normally, EINTR).
ERRORS
EFAULT
mask points to memory which is not a valid part of the process address space.
EINTR
The call was interrupted by a signal; signal(7).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
C library/kernel differences
The original Linux system call was named sigsuspend(). However, with the addition of real-time signals in Linux 2.2, the fixed-size, 32-bit sigset_t type supported by that system call was no longer fit for purpose. Consequently, a new system call, rt_sigsuspend(), was added to support an enlarged sigset_t type. The new system call takes a second argument, size_t sigsetsize, which specifies the size in bytes of the signal set in mask. This argument is currently required to have the value sizeof(sigset_t) (or the error EINVAL results). The glibc sigsuspend() wrapper function hides these details from us, transparently calling rt_sigsuspend() when the kernel provides it.
NOTES
Normally, sigsuspend() is used in conjunction with sigprocmask(2) in order to prevent delivery of a signal during the execution of a critical code section. The caller first blocks the signals with sigprocmask(2). When the critical code has completed, the caller then waits for the signals by calling sigsuspend() with the signal mask that was returned by sigprocmask(2) (in the oldset argument).
See sigsetops(3) for details on manipulating signal sets.
SEE ALSO
kill(2), pause(2), sigaction(2), signal(2), sigprocmask(2), sigwaitinfo(2), sigsetops(3), sigwait(3), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
167 - Linux cli command lstat
NAME π₯οΈ lstat π₯οΈ
get file status
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int stat(const char *restrict pathname,
struct stat *restrict statbuf);
int fstat(int fd, struct stat *statbuf);
int lstat(const char *restrict pathname,
struct stat *restrict statbuf);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int fstatat(int dirfd, const char *restrict pathname,
struct stat *restrict statbuf, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
lstat():
/* Since glibc 2.20 */ _DEFAULT_SOURCE
|| _XOPEN_SOURCE >= 500
|| /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L
|| /* glibc 2.19 and earlier */ _BSD_SOURCE
fstatat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These functions return information about a file, in the buffer pointed to by statbuf. No permissions are required on the file itself, butβin the case of stat(), fstatat(), and lstat()βexecute (search) permission is required on all of the directories in pathname that lead to the file.
stat() and fstatat() retrieve information about the file pointed to by pathname; the differences for fstatat() are described below.
lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information about the link itself, not the file that the link refers to.
fstat() is identical to stat(), except that the file about which information is to be retrieved is specified by the file descriptor fd.
The stat structure
All of these system calls return a stat structure (see stat(3type)).
Note: for performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call. For example, if st_mode or st_uid is changed by another process by calling chmod(2) or chown(2), stat() might return the old st_mode together with the new st_uid, or the old st_uid together with the new st_mode.
fstatat()
The fstatat() system call is a more general interface for accessing file information which can still provide exactly the behavior of each of stat(), lstat(), and fstat().
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by stat() and lstat() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like stat() and lstat()).
If pathname is absolute, then dirfd is ignored.
flags can either be 0, or include one or more of the following flags ORed:
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory, and the behavior of fstatat() is similar to that of fstat(). If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_NO_AUTOMOUNT (since Linux 2.6.38)
Don’t automount the terminal (“basename”) component of pathname. Since Linux 3.1 this flag is ignored. Since Linux 4.11 this flag is implied.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead return information about the link itself, like lstat(). (By default, fstatat() dereferences symbolic links, like stat().)
See openat(2) for an explanation of the need for fstatat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Search permission is denied for one of the directories in the path prefix of pathname. (See also path_resolution(7).)
EBADF
fd is not a valid open file descriptor.
EBADF
(fstatat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
Bad address.
EINVAL
(fstatat()) Invalid flag specified in flags.
ELOOP
Too many symbolic links encountered while traversing the path.
ENAMETOOLONG
pathname is too long.
ENOENT
A component of pathname does not exist or is a dangling symbolic link.
ENOENT
pathname is an empty string and AT_EMPTY_PATH was not specified in flags.
ENOMEM
Out of memory (i.e., kernel memory).
ENOTDIR
A component of the path prefix of pathname is not a directory.
ENOTDIR
(fstatat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EOVERFLOW
pathname or fd refers to a file whose size, inode number, or number of blocks cannot be represented in, respectively, the types off_t, ino_t, or blkcnt_t. This error can occur when, for example, an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 calls stat() on a file whose size exceeds (1<<31)-1 bytes.
STANDARDS
POSIX.1-2008.
HISTORY
stat()
fstat()
lstat()
SVr4, 4.3BSD, POSIX.1-2001.
fstatat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
According to POSIX.1-2001, lstat() on a symbolic link need return valid information only in the st_size field and the file type of the st_mode field of the stat structure. POSIX.1-2008 tightens the specification, requiring lstat() to return valid information in all fields except the mode bits in st_mode.
Use of the st_blocks and st_blksize fields may be less portable. (They were introduced in BSD. The interpretation differs between systems, and possibly on a single system when NFS mounts are involved.)
C library/kernel differences
Over time, increases in the size of the stat structure have led to three successive versions of stat(): sys_stat() (slot __NR_oldstat), sys_newstat() (slot __NR_stat), and sys_stat64() (slot __NR_stat64) on 32-bit platforms such as i386. The first two versions were already present in Linux 1.0 (albeit with different names); the last was added in Linux 2.4. Similar remarks apply for fstat() and lstat().
The kernel-internal versions of the stat structure dealt with by the different versions are, respectively:
__old_kernel_stat
The original structure, with rather narrow fields, and no padding.
stat
Larger st_ino field and padding added to various parts of the structure to allow for future expansion.
stat64
Even larger st_ino field, larger st_uid and st_gid fields to accommodate the Linux-2.4 expansion of UIDs and GIDs to 32 bits, and various other enlarged fields and further padding in the structure. (Various padding bytes were eventually consumed in Linux 2.6, with the advent of 32-bit device IDs and nanosecond components for the timestamp fields.)
The glibc stat() wrapper function hides these details from applications, invoking the most recent version of the system call provided by the kernel, and repacking the returned information if required for old binaries.
On modern 64-bit systems, life is simpler: there is a single stat() system call and the kernel deals with a stat structure that contains fields of a sufficient size.
The underlying system call employed by the glibc fstatat() wrapper function is actually called fstatat64() or, on some architectures, newfstatat().
EXAMPLES
The following program calls lstat() and displays selected fields in the returned stat structure.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
#include <time.h>
int
main(int argc, char *argv[])
{
struct stat sb;
if (argc != 2) {
fprintf(stderr, "Usage: %s <pathname>
“, argv[0]); exit(EXIT_FAILURE); } if (lstat(argv[1], &sb) == -1) { perror(“lstat”); exit(EXIT_FAILURE); } printf(“ID of containing device: [%x,%x] “, major(sb.st_dev), minor(sb.st_dev)); printf(“File type: “); switch (sb.st_mode & S_IFMT) { case S_IFBLK: printf(“block device “); break; case S_IFCHR: printf(“character device “); break; case S_IFDIR: printf(“directory “); break; case S_IFIFO: printf(“FIFO/pipe “); break; case S_IFLNK: printf(“symlink “); break; case S_IFREG: printf(“regular file “); break; case S_IFSOCK: printf(“socket “); break; default: printf(“unknown? “); break; } printf(“I-node number: %ju “, (uintmax_t) sb.st_ino); printf(“Mode: %jo (octal) “, (uintmax_t) sb.st_mode); printf(“Link count: %ju “, (uintmax_t) sb.st_nlink); printf(“Ownership: UID=%ju GID=%ju “, (uintmax_t) sb.st_uid, (uintmax_t) sb.st_gid); printf(“Preferred I/O block size: %jd bytes “, (intmax_t) sb.st_blksize); printf(“File size: %jd bytes “, (intmax_t) sb.st_size); printf(“Blocks allocated: %jd “, (intmax_t) sb.st_blocks); printf(“Last status change: %s”, ctime(&sb.st_ctime)); printf(“Last file access: %s”, ctime(&sb.st_atime)); printf(“Last file modification: %s”, ctime(&sb.st_mtime)); exit(EXIT_SUCCESS); }
SEE ALSO
ls(1), stat(1), access(2), chmod(2), chown(2), readlink(2), statx(2), utime(2), stat(3type), capabilities(7), inode(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
168 - Linux cli command create_module
NAME π₯οΈ create_module π₯οΈ
create a loadable module entry
SYNOPSIS
#include <linux/module.h>
[[deprecated]] caddr_t create_module(const char *name, size_t size);
DESCRIPTION
Note: This system call is present only before Linux 2.6.
create_module() attempts to create a loadable module entry and reserve the kernel memory that will be needed to hold the module. This system call requires privilege.
RETURN VALUE
On success, returns the kernel address at which the module will reside. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EEXIST
A module by that name already exists.
EFAULT
name is outside the program’s accessible address space.
EINVAL
The requested size is too small even for the module header information.
ENOMEM
The kernel could not allocate a contiguous block of memory large enough for the module.
ENOSYS
create_module() is not supported in this version of the kernel (e.g., Linux 2.6 or later).
EPERM
The caller was not privileged (did not have the CAP_SYS_MODULE capability).
STANDARDS
Linux.
HISTORY
Removed in Linux 2.6.
This obsolete system call is not supported by glibc. No declaration is provided in glibc headers, but, through a quirk of history, glibc versions before glibc 2.23 did export an ABI for this system call. Therefore, in order to employ this system call, it was sufficient to manually declare the interface in your code; alternatively, you could invoke the system call using syscall(2).
SEE ALSO
delete_module(2), init_module(2), query_module(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
169 - Linux cli command ppoll
NAME π₯οΈ ppoll π₯οΈ
wait for some event on a file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <poll.h>
int poll(struct pollfd *fds, nfds_t nfds, int timeout);
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <poll.h>
int ppoll(struct pollfd *fds, nfds_t nfds,
const struct timespec *_Nullable tmo_p,
const sigset_t *_Nullable sigmask);
DESCRIPTION
poll() performs a similar task to select(2): it waits for one of a set of file descriptors to become ready to perform I/O. The Linux-specific epoll(7) API performs a similar task, but offers features beyond those found in poll().
The set of file descriptors to be monitored is specified in the fds argument, which is an array of structures of the following form:
struct pollfd {
int fd; /* file descriptor */
short events; /* requested events */
short revents; /* returned events */
};
The caller should specify the number of items in the fds array in nfds.
The field fd contains a file descriptor for an open file. If this field is negative, then the corresponding events field is ignored and the revents field returns zero. (This provides an easy way of ignoring a file descriptor for a single poll() call: simply set the fd field to its bitwise complement.)
The field events is an input parameter, a bit mask specifying the events the application is interested in for the file descriptor fd. This field may be specified as zero, in which case the only events that can be returned in revents are POLLHUP, POLLERR, and POLLNVAL (see below).
The field revents is an output parameter, filled by the kernel with the events that actually occurred. The bits returned in revents can include any of those specified in events, or one of the values POLLERR, POLLHUP, or POLLNVAL. (These three bits are meaningless in the events field, and will be set in the revents field whenever the corresponding condition is true.)
If none of the events requested (and no error) has occurred for any of the file descriptors, then poll() blocks until one of the events occurs.
The timeout argument specifies the number of milliseconds that poll() should block waiting for a file descriptor to become ready. The call will block until either:
a file descriptor becomes ready;
the call is interrupted by a signal handler; or
the timeout expires.
Being “ready” means that the requested operation will not block; thus, poll()ing regular files, block devices, and other files with no reasonable polling semantic always returns instantly as ready to read and write.
Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount. Specifying a negative value in timeout means an infinite timeout. Specifying a timeout of zero causes poll() to return immediately, even if no file descriptors are ready.
The bits that may be set/returned in events and revents are defined in <poll.h>:
POLLIN
There is data to read.
POLLPRI
There is some exceptional condition on the file descriptor. Possibilities include:
There is out-of-band data on a TCP socket (see tcp(7)).
A pseudoterminal master in packet mode has seen a state change on the slave (see ioctl_tty(2)).
A cgroup.events file has been modified (see cgroups(7)).
POLLOUT
Writing is now possible, though a write larger than the available space in a socket or pipe will still block (unless O_NONBLOCK is set).
POLLRDHUP (since Linux 2.6.17)
Stream socket peer closed connection, or shut down writing half of connection. The _GNU_SOURCE feature test macro must be defined (before including any header files) in order to obtain this definition.
POLLERR
Error condition (only returned in revents; ignored in events). This bit is also set for a file descriptor referring to the write end of a pipe when the read end has been closed.
POLLHUP
Hang up (only returned in revents; ignored in events). Note that when reading from a channel such as a pipe or a stream socket, this event merely indicates that the peer closed its end of the channel. Subsequent reads from the channel will return 0 (end of file) only after all outstanding data in the channel has been consumed.
POLLNVAL
Invalid request: fd not open (only returned in revents; ignored in events).
When compiling with _XOPEN_SOURCE defined, one also has the following, which convey no further information beyond the bits listed above:
POLLRDNORM
Equivalent to POLLIN.
POLLRDBAND
Priority band data can be read (generally unused on Linux).
POLLWRNORM
Equivalent to POLLOUT.
POLLWRBAND
Priority data may be written.
Linux also knows about, but does not use POLLMSG.
ppoll()
The relationship between poll() and ppoll() is analogous to the relationship between select(2) and pselect(2): like pselect(2), ppoll() allows an application to safely wait until either a file descriptor becomes ready or until a signal is caught.
Other than the difference in the precision of the timeout argument, the following ppoll() call:
ready = ppoll(&fds, nfds, tmo_p, &sigmask);
is nearly equivalent to atomically executing the following calls:
sigset_t origmask;
int timeout;
timeout = (tmo_p == NULL) ? -1 :
(tmo_p->tv_sec * 1000 + tmo_p->tv_nsec / 1000000);
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ready = poll(&fds, nfds, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
The above code segment is described as nearly equivalent because whereas a negative timeout value for poll() is interpreted as an infinite timeout, a negative value expressed in *tmo_p results in an error from ppoll().
See the description of pselect(2) for an explanation of why ppoll() is necessary.
If the sigmask argument is specified as NULL, then no signal mask manipulation is performed (and thus ppoll() differs from poll() only in the precision of the timeout argument).
The tmo_p argument specifies an upper limit on the amount of time that ppoll() will block. This argument is a pointer to a timespec(3) structure.
If tmo_p is specified as NULL, then ppoll() can block indefinitely.
RETURN VALUE
On success, poll() returns a nonnegative value which is the number of elements in the pollfds whose revents fields have been set to a nonzero value (indicating an event or an error). A return value of zero indicates that the system call timed out before any file descriptors became ready.
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
fds points outside the process’s accessible address space. The array given as argument was not contained in the calling program’s address space.
EINTR
A signal occurred before any requested event; see signal(7).
EINVAL
The nfds value exceeds the RLIMIT_NOFILE value.
EINVAL
(ppoll()) The timeout value expressed in *tmo_p is invalid (negative).
ENOMEM
Unable to allocate memory for kernel data structures.
VERSIONS
On some other UNIX systems, poll() can fail with the error EAGAIN if the system fails to allocate kernel-internal resources, rather than ENOMEM as Linux does. POSIX permits this behavior. Portable programs may wish to check for EAGAIN and loop, just as with EINTR.
Some implementations define the nonstandard constant INFTIM with the value -1 for use as a timeout for poll(). This constant is not provided in glibc.
C library/kernel differences
The Linux ppoll() system call modifies its tmo_p argument. However, the glibc wrapper function hides this behavior by using a local variable for the timeout argument that is passed to the system call. Thus, the glibc ppoll() function does not modify its tmo_p argument.
The raw ppoll() system call has a fifth argument, size_t sigsetsize, which specifies the size in bytes of the sigmask argument. The glibc ppoll() wrapper function specifies this argument as a fixed value (equal to sizeof(kernel_sigset_t)). See sigprocmask(2) for a discussion on the differences between the kernel and the libc notion of the sigset.
STANDARDS
poll()
POSIX.1-2008.
ppoll()
Linux.
HISTORY
poll()
POSIX.1-2001. Linux 2.1.23.
On older kernels that lack this system call, the glibc poll() wrapper function provides emulation using select(2).
ppoll()
Linux 2.6.16, glibc 2.4.
NOTES
The operation of poll() and ppoll() is not affected by the O_NONBLOCK flag.
For a discussion of what may happen if a file descriptor being monitored by poll() is closed in another thread, see select(2).
BUGS
See the discussion of spurious readiness notifications under the BUGS section of select(2).
EXAMPLES
The program below opens each of the files named in its command-line arguments and monitors the resulting file descriptors for readiness to read (POLLIN). The program loops, repeatedly using poll() to monitor the file descriptors, printing the number of ready file descriptors on return. For each ready file descriptor, the program:
displays the returned revents field in a human-readable form;
if the file descriptor is readable, reads some data from it, and displays that data on standard output; and
if the file descriptor was not readable, but some other event occurred (presumably POLLHUP), closes the file descriptor.
Suppose we run the program in one terminal, asking it to open a FIFO:
$ mkfifo myfifo
$ ./poll_input myfifo
In a second terminal window, we then open the FIFO for writing, write some data to it, and close the FIFO:
$ echo aaaaabbbbbccccc > myfifo
In the terminal where we are running the program, we would then see:
Opened "myfifo" on fd 3
About to poll()
Ready: 1
fd=3; events: POLLIN POLLHUP
read 10 bytes: aaaaabbbbb
About to poll()
Ready: 1
fd=3; events: POLLIN POLLHUP
read 6 bytes: ccccc
About to poll()
Ready: 1
fd=3; events: POLLHUP
closing fd 3
All file descriptors closed; bye
In the above output, we see that poll() returned three times:
On the first return, the bits returned in the revents field were POLLIN, indicating that the file descriptor is readable, and POLLHUP, indicating that the other end of the FIFO has been closed. The program then consumed some of the available input.
The second return from poll() also indicated POLLIN and POLLHUP; the program then consumed the last of the available input.
On the final return, poll() indicated only POLLHUP on the FIFO, at which point the file descriptor was closed and the program terminated.
Program source
/* poll_input.c
Licensed under GNU General Public License v2 or later.
*/
#include <fcntl.h>
#include <poll.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)
int
main(int argc, char *argv[])
{
int ready;
char buf[10];
nfds_t num_open_fds, nfds;
ssize_t s;
struct pollfd *pfds;
if (argc < 2) {
fprintf(stderr, "Usage: %s file...
“, argv[0]); exit(EXIT_FAILURE); } num_open_fds = nfds = argc - 1; pfds = calloc(nfds, sizeof(struct pollfd)); if (pfds == NULL) errExit(“malloc”); /* Open each file on command line, and add it to ‘pfds’ array. / for (nfds_t j = 0; j < nfds; j++) { pfds[j].fd = open(argv[j + 1], O_RDONLY); if (pfds[j].fd == -1) errExit(“open”); printf(“Opened "%s" on fd %d “, argv[j + 1], pfds[j].fd); pfds[j].events = POLLIN; } / Keep calling poll() as long as at least one file descriptor is open. / while (num_open_fds > 0) { printf(“About to poll() “); ready = poll(pfds, nfds, -1); if (ready == -1) errExit(“poll”); printf(“Ready: %d “, ready); / Deal with array returned by poll(). */ for (nfds_t j = 0; j < nfds; j++) { if (pfds[j].revents != 0) { printf(” fd=%d; events: %s%s%s “, pfds[j].fd, (pfds[j].revents & POLLIN) ? “POLLIN " : “”, (pfds[j].revents & POLLHUP) ? “POLLHUP " : “”, (pfds[j].revents & POLLERR) ? “POLLERR " : “”); if (pfds[j].revents & POLLIN) { s = read(pfds[j].fd, buf, sizeof(buf)); if (s == -1) errExit(“read”); printf(” read %zd bytes: %.s “, s, (int) s, buf); } else { / POLLERR | POLLHUP */ printf(” closing fd %d “, pfds[j].fd); if (close(pfds[j].fd) == -1) errExit(“close”); num_open_fds–; } } } } printf(“All file descriptors closed; bye “); exit(EXIT_SUCCESS); }
SEE ALSO
restart_syscall(2), select(2), select_tut(2), timespec(3), epoll(7), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
170 - Linux cli command inl_p
NAME π₯οΈ inl_p π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
171 - Linux cli command ioctl_ficlonerange
NAME π₯οΈ ioctl_ficlonerange π₯οΈ
share some the data of one file with another file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/fs.h> /* Definition of FICLONE* constants */
#include <sys/ioctl.h>
int ioctl(int dest_fd, FICLONERANGE, struct file_clone_range *arg);
int ioctl(int dest_fd, FICLONE, int src_fd);
DESCRIPTION
If a filesystem supports files sharing physical storage between multiple files (“reflink”), this ioctl(2) operation can be used to make some of the data in the src_fd file appear in the dest_fd file by sharing the underlying storage, which is faster than making a separate physical copy of the data. Both files must reside within the same filesystem. If a file write should occur to a shared region, the filesystem must ensure that the changes remain private to the file being written. This behavior is commonly referred to as “copy on write”.
This ioctl reflinks up to src_length bytes from file descriptor src_fd at offset src_offset into the file dest_fd at offset dest_offset, provided that both are files. If src_length is zero, the ioctl reflinks to the end of the source file. This information is conveyed in a structure of the following form:
struct file_clone_range {
__s64 src_fd;
__u64 src_offset;
__u64 src_length;
__u64 dest_offset;
};
Clones are atomic with regards to concurrent writes, so no locks need to be taken to obtain a consistent cloned copy.
The FICLONE ioctl clones entire files.
RETURN VALUE
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Error codes can be one of, but are not limited to, the following:
EBADF
src_fd is not open for reading; dest_fd is not open for writing or is open for append-only writes; or the filesystem which src_fd resides on does not support reflink.
EINVAL
The filesystem does not support reflinking the ranges of the given files. This error can also appear if either file descriptor represents a device, FIFO, or socket. Disk filesystems generally require the offset and length arguments to be aligned to the fundamental block size. XFS and Btrfs do not support overlapping reflink ranges in the same file.
EISDIR
One of the files is a directory and the filesystem does not support shared regions in directories.
EOPNOTSUPP
This can appear if the filesystem does not support reflinking either file descriptor, or if either file descriptor refers to special inodes.
EPERM
dest_fd is immutable.
ETXTBSY
One of the files is a swap file. Swap files cannot share storage.
EXDEV
dest_fd and src_fd are not on the same mounted filesystem.
STANDARDS
Linux.
HISTORY
Linux 4.5.
They were previously known as BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE, and were private to Btrfs.
NOTES
Because a copy-on-write operation requires the allocation of new storage, the fallocate(2) operation may unshare shared blocks to guarantee that subsequent writes will not fail because of lack of disk space.
SEE ALSO
ioctl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
172 - Linux cli command sched_getscheduler
NAME π₯οΈ sched_getscheduler π₯οΈ
set and get scheduling policy/parameters
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sched.h>
int sched_setscheduler(pid_t pid, int policy,
const struct sched_param *param);
int sched_getscheduler(pid_t pid);
DESCRIPTION
The sched_setscheduler() system call sets both the scheduling policy and parameters for the thread whose ID is specified in pid. If pid equals zero, the scheduling policy and parameters of the calling thread will be set.
The scheduling parameters are specified in the param argument, which is a pointer to a structure of the following form:
struct sched_param {
...
int sched_priority;
...
};
In the current implementation, the structure contains only one field, sched_priority. The interpretation of param depends on the selected policy.
Currently, Linux supports the following “normal” (i.e., non-real-time) scheduling policies as values that may be specified in policy:
SCHED_OTHER
the standard round-robin time-sharing policy;
SCHED_BATCH
for “batch” style execution of processes; and
SCHED_IDLE
for running very low priority background jobs.
For each of the above policies, param->sched_priority must be 0.
Various “real-time” policies are also supported, for special time-critical applications that need precise control over the way in which runnable threads are selected for execution. For the rules governing when a process may use these policies, see sched(7). The real-time policies that may be specified in policy are:
SCHED_FIFO
a first-in, first-out policy; and
SCHED_RR
a round-robin policy.
For each of the above policies, param->sched_priority specifies a scheduling priority for the thread. This is a number in the range returned by calling sched_get_priority_min(2) and sched_get_priority_max(2) with the specified policy. On Linux, these system calls return, respectively, 1 and 99.
Since Linux 2.6.32, the SCHED_RESET_ON_FORK flag can be ORed in policy when calling sched_setscheduler(). As a result of including this flag, children created by fork(2) do not inherit privileged scheduling policies. See sched(7) for details.
sched_getscheduler() returns the current scheduling policy of the thread identified by pid. If pid equals zero, the policy of the calling thread will be retrieved.
RETURN VALUE
On success, sched_setscheduler() returns zero. On success, sched_getscheduler() returns the policy for the thread (a nonnegative integer). On error, both calls return -1, and errno is set to indicate the error.
ERRORS
EINVAL
Invalid arguments: pid is negative or param is NULL.
EINVAL
(sched_setscheduler()) policy is not one of the recognized policies.
EINVAL
(sched_setscheduler()) param does not make sense for the specified policy.
EPERM
The calling thread does not have appropriate privileges.
ESRCH
The thread whose ID is pid could not be found.
VERSIONS
POSIX.1 does not detail the permissions that an unprivileged thread requires in order to call sched_setscheduler(), and details vary across systems. For example, the Solaris 7 manual page says that the real or effective user ID of the caller must match the real user ID or the save set-user-ID of the target.
The scheduling policy and parameters are in fact per-thread attributes on Linux. The value returned from a call to gettid(2) can be passed in the argument pid. Specifying pid as 0 will operate on the attributes of the calling thread, and passing the value returned from a call to getpid(2) will operate on the attributes of the main thread of the thread group. (If you are using the POSIX threads API, then use pthread_setschedparam(3), pthread_getschedparam(3), and pthread_setschedprio(3), instead of the sched_*(2) system calls.)
STANDARDS
POSIX.1-2008 (but see BUGS below).
SCHED_BATCH and SCHED_IDLE are Linux-specific.
HISTORY
POSIX.1-2001.
NOTES
Further details of the semantics of all of the above “normal” and “real-time” scheduling policies can be found in the sched(7) manual page. That page also describes an additional policy, SCHED_DEADLINE, which is settable only via sched_setattr(2).
POSIX systems on which sched_setscheduler() and sched_getscheduler() are available define _POSIX_PRIORITY_SCHEDULING in <unistd.h>.
BUGS
POSIX.1 says that on success, sched_setscheduler() should return the previous scheduling policy. Linux sched_setscheduler() does not conform to this requirement, since it always returns 0 on success.
SEE ALSO
chrt(1), nice(2), sched_get_priority_max(2), sched_get_priority_min(2), sched_getaffinity(2), sched_getattr(2), sched_getparam(2), sched_rr_get_interval(2), sched_setaffinity(2), sched_setattr(2), sched_setparam(2), sched_yield(2), setpriority(2), capabilities(7), cpuset(7), sched(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
173 - Linux cli command intro
NAME π₯οΈ intro π₯οΈ
introduction to system calls
DESCRIPTION
Section 2 of the manual describes the Linux system calls. A system call is an entry point into the Linux kernel. Usually, system calls are not invoked directly: instead, most system calls have corresponding C library wrapper functions which perform the steps required (e.g., trapping to kernel mode) in order to invoke the system call. Thus, making a system call looks the same as invoking a normal library function.
In many cases, the C library wrapper function does nothing more than:
copying arguments and the unique system call number to the registers where the kernel expects them;
trapping to kernel mode, at which point the kernel does the real work of the system call;
setting errno if the system call returns an error number when the kernel returns the CPU to user mode.
However, in a few cases, a wrapper function may do rather more than this, for example, performing some preprocessing of the arguments before trapping to kernel mode, or postprocessing of values returned by the system call. Where this is the case, the manual pages in Section 2 generally try to note the details of both the (usually GNU) C library API interface and the raw system call. Most commonly, the main DESCRIPTION will focus on the C library interface, and differences for the system call are covered in the NOTES section.
For a list of the Linux system calls, see syscalls(2).
RETURN VALUE
On error, most system calls return a negative error number (i.e., the negated value of one of the constants described in errno(3)). The C library wrapper hides this detail from the caller: when a system call returns a negative value, the wrapper copies the absolute value into the errno variable, and returns -1 as the return value of the wrapper.
The value returned by a successful system call depends on the call. Many system calls return 0 on success, but some can return nonzero values from a successful call. The details are described in the individual manual pages.
In some cases, the programmer must define a feature test macro in order to obtain the declaration of a system call from the header file specified in the man page SYNOPSIS section. (Where required, these feature test macros must be defined before including any header files.) In such cases, the required macro is described in the man page. For further information on feature test macros, see feature_test_macros(7).
STANDARDS
Certain terms and abbreviations are used to indicate UNIX variants and standards to which calls in this section conform. See standards(7).
NOTES
Calling directly
In most cases, it is unnecessary to invoke a system call directly, but there are times when the Standard C library does not implement a nice wrapper function for you. In this case, the programmer must manually invoke the system call using syscall(2). Historically, this was also possible using one of the _syscall macros described in _syscall(2).
Authors and copyright conditions
Look at the header of the manual page source for the author(s) and copyright conditions. Note that these can be different from page to page!
SEE ALSO
_syscall(2), syscall(2), syscalls(2), errno(3), intro(3), capabilities(7), credentials(7), feature_test_macros(7), mq_overview(7), path_resolution(7), pipe(7), pty(7), sem_overview(7), shm_overview(7), signal(7), socket(7), standards(7), symlink(7), system_data_types(7), sysvipc(7), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
174 - Linux cli command shmdt
NAME π₯οΈ shmdt π₯οΈ
System V shared memory operations
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/shm.h>
void *shmat(int shmid, const void *_Nullable shmaddr, int shmflg);
int shmdt(const void *shmaddr);
DESCRIPTION
shmat()
shmat() attaches the System V shared memory segment identified by shmid to the address space of the calling process. The attaching address is specified by shmaddr with one of the following criteria:
If shmaddr is NULL, the system chooses a suitable (unused) page-aligned address to attach the segment.
If shmaddr isn’t NULL and SHM_RND is specified in shmflg, the attach occurs at the address equal to shmaddr rounded down to the nearest multiple of SHMLBA.
Otherwise, shmaddr must be a page-aligned address at which the attach occurs.
In addition to SHM_RND, the following flags may be specified in the shmflg bit-mask argument:
SHM_EXEC (Linux-specific; since Linux 2.6.9)
Allow the contents of the segment to be executed. The caller must have execute permission on the segment.
SHM_RDONLY
Attach the segment for read-only access. The process must have read permission for the segment. If this flag is not specified, the segment is attached for read and write access, and the process must have read and write permission for the segment. There is no notion of a write-only shared memory segment.
SHM_REMAP (Linux-specific)
This flag specifies that the mapping of the segment should replace any existing mapping in the range starting at shmaddr and continuing for the size of the segment. (Normally, an EINVAL error would result if a mapping already exists in this address range.) In this case, shmaddr must not be NULL.
The brk(2) value of the calling process is not altered by the attach. The segment will automatically be detached at process exit. The same segment may be attached as a read and as a read-write one, and more than once, in the process’s address space.
A successful shmat() call updates the members of the shmid_ds structure (see shmctl(2)) associated with the shared memory segment as follows:
shm_atime is set to the current time.
shm_lpid is set to the process-ID of the calling process.
shm_nattch is incremented by one.
shmdt()
shmdt() detaches the shared memory segment located at the address specified by shmaddr from the address space of the calling process. The to-be-detached segment must be currently attached with shmaddr equal to the value returned by the attaching shmat() call.
On a successful shmdt() call, the system updates the members of the shmid_ds structure associated with the shared memory segment as follows:
shm_dtime is set to the current time.
shm_lpid is set to the process-ID of the calling process.
shm_nattch is decremented by one. If it becomes 0 and the segment is marked for deletion, the segment is deleted.
RETURN VALUE
On success, shmat() returns the address of the attached shared memory segment; on error, (void *) -1 is returned, and errno is set to indicate the error.
On success, shmdt() returns 0; on error -1 is returned, and errno is set to indicate the error.
ERRORS
shmat() can fail with one of the following errors:
EACCES
The calling process does not have the required permissions for the requested attach type, and does not have the CAP_IPC_OWNER capability in the user namespace that governs its IPC namespace.
EIDRM
shmid points to a removed identifier.
EINVAL
Invalid shmid value, unaligned (i.e., not page-aligned and SHM_RND was not specified) or invalid shmaddr value, or can’t attach segment at shmaddr, or SHM_REMAP was specified and shmaddr was NULL.
ENOMEM
Could not allocate memory for the descriptor or for the page tables.
shmdt() can fail with one of the following errors:
EINVAL
There is no shared memory segment attached at shmaddr; or, shmaddr is not aligned on a page boundary.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4.
In SVID 3 (or perhaps earlier), the type of the shmaddr argument was changed from char * into const void *, and the returned type of shmat() from char * into void *.
NOTES
After a fork(2), the child inherits the attached shared memory segments.
After an execve(2), all attached shared memory segments are detached from the process.
Upon _exit(2), all attached shared memory segments are detached from the process.
Using shmat() with shmaddr equal to NULL is the preferred, portable way of attaching a shared memory segment. Be aware that the shared memory segment attached in this way may be attached at different addresses in different processes. Therefore, any pointers maintained within the shared memory must be made relative (typically to the starting address of the segment), rather than absolute.
On Linux, it is possible to attach a shared memory segment even if it is already marked to be deleted. However, POSIX.1 does not specify this behavior and many other implementations do not support it.
The following system parameter affects shmat():
SHMLBA
Segment low boundary address multiple. When explicitly specifying an attach address in a call to shmat(), the caller should ensure that the address is a multiple of this value. This is necessary on some architectures, in order either to ensure good CPU cache performance or to ensure that different attaches of the same segment have consistent views within the CPU cache. SHMLBA is normally some multiple of the system page size. (On many Linux architectures, SHMLBA is the same as the system page size.)
The implementation places no intrinsic per-process limit on the number of shared memory segments (SHMSEG).
EXAMPLES
The two programs shown below exchange a string using a shared memory segment. Further details about the programs are given below. First, we show a shell session demonstrating their use.
In one terminal window, we run the “reader” program, which creates a System V shared memory segment and a System V semaphore set. The program prints out the IDs of the created objects, and then waits for the semaphore to change value.
$ ./svshm_string_read
shmid = 1114194; semid = 15
In another terminal window, we run the “writer” program. The “writer” program takes three command-line arguments: the IDs of the shared memory segment and semaphore set created by the “reader”, and a string. It attaches the existing shared memory segment, copies the string to the shared memory, and modifies the semaphore value.
$ ./svshm_string_write 1114194 15 'Hello, world'
Returning to the terminal where the “reader” is running, we see that the program has ceased waiting on the semaphore and has printed the string that was copied into the shared memory segment by the writer:
Hello, world
Program source: svshm_string.h
The following header file is included by the “reader” and “writer” programs:
/* svshm_string.h
Licensed under GNU General Public License v2 or later.
*/
#ifndef SVSHM_STRING_H
#define SVSHM_STRING_H
#include <stdio.h>
#include <stdlib.h>
#include <sys/sem.h>
#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)
union semun { /* Used in calls to semctl() */
int val;
struct semid_ds *buf;
unsigned short *array;
#if defined(__linux__)
struct seminfo *__buf;
#endif
};
#define MEM_SIZE 4096
#endif // include guard
Program source: svshm_string_read.c
The “reader” program creates a shared memory segment and a semaphore set containing one semaphore. It then attaches the shared memory object into its address space and initializes the semaphore value to 1. Finally, the program waits for the semaphore value to become 0, and afterwards prints the string that has been copied into the shared memory segment by the “writer”.
/* svshm_string_read.c
Licensed under GNU General Public License v2 or later.
*/
#include <stdio.h>
#include <stdlib.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <sys/shm.h>
#include "svshm_string.h"
int
main(void)
{
int semid, shmid;
char *addr;
union semun arg, dummy;
struct sembuf sop;
/* Create shared memory and semaphore set containing one
semaphore. */
shmid = shmget(IPC_PRIVATE, MEM_SIZE, IPC_CREAT | 0600);
if (shmid == -1)
errExit("shmget");
semid = semget(IPC_PRIVATE, 1, IPC_CREAT | 0600);
if (semid == -1)
errExit("semget");
/* Attach shared memory into our address space. */
addr = shmat(shmid, NULL, SHM_RDONLY);
if (addr == (void *) -1)
errExit("shmat");
/* Initialize semaphore 0 in set with value 1. */
arg.val = 1;
if (semctl(semid, 0, SETVAL, arg) == -1)
errExit("semctl");
printf("shmid = %d; semid = %d
“, shmid, semid); /* Wait for semaphore value to become 0. / sop.sem_num = 0; sop.sem_op = 0; sop.sem_flg = 0; if (semop(semid, &sop, 1) == -1) errExit(“semop”); / Print the string from shared memory. / printf("%s “, addr); / Remove shared memory and semaphore set. */ if (shmctl(shmid, IPC_RMID, NULL) == -1) errExit(“shmctl”); if (semctl(semid, 0, IPC_RMID, dummy) == -1) errExit(“semctl”); exit(EXIT_SUCCESS); }
Program source: svshm_string_write.c
The writer program takes three command-line arguments: the IDs of the shared memory segment and semaphore set that have already been created by the “reader”, and a string. It attaches the shared memory segment into its address space, and then decrements the semaphore value to 0 in order to inform the “reader” that it can now examine the contents of the shared memory.
/* svshm_string_write.c
Licensed under GNU General Public License v2 or later.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/sem.h>
#include <sys/shm.h>
#include "svshm_string.h"
int
main(int argc, char *argv[])
{
int semid, shmid;
char *addr;
size_t len;
struct sembuf sop;
if (argc != 4) {
fprintf(stderr, "Usage: %s shmid semid string
“, argv[0]); exit(EXIT_FAILURE); } len = strlen(argv[3]) + 1; /* +1 to include trailing ‘οΏ½’ / if (len > MEM_SIZE) { fprintf(stderr, “String is too big! “); exit(EXIT_FAILURE); } / Get object IDs from command-line. / shmid = atoi(argv[1]); semid = atoi(argv[2]); / Attach shared memory into our address space and copy string (including trailing null byte) into memory. */ addr = shmat(shmid, NULL, 0); if (addr == (void ) -1) errExit(“shmat”); memcpy(addr, argv[3], len); / Decrement semaphore to 0. */ sop.sem_num = 0; sop.sem_op = -1; sop.sem_flg = 0; if (semop(semid, &sop, 1) == -1) errExit(“semop”); exit(EXIT_SUCCESS); }
SEE ALSO
brk(2), mmap(2), shmctl(2), shmget(2), capabilities(7), shm_overview(7), sysvipc(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
175 - Linux cli command arm_sync_file_range
NAME π₯οΈ arm_sync_file_range π₯οΈ
sync a file segment with disk
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#define _FILE_OFFSET_BITS 64
#include <fcntl.h>
int sync_file_range(int fd, off_t offset, off_t nbytes,
unsigned int flags);
DESCRIPTION
sync_file_range() permits fine control when synchronizing the open file referred to by the file descriptor fd with disk.
offset is the starting byte of the file range to be synchronized. nbytes specifies the length of the range to be synchronized, in bytes; if nbytes is zero, then all bytes from offset through to the end of file are synchronized. Synchronization is in units of the system page size: offset is rounded down to a page boundary; (offset+nbytes-1) is rounded up to a page boundary.
The flags bit-mask argument can include any of the following values:
SYNC_FILE_RANGE_WAIT_BEFORE
Wait upon write-out of all pages in the specified range that have already been submitted to the device driver for write-out before performing any write.
SYNC_FILE_RANGE_WRITE
Initiate write-out of all dirty pages in the specified range which are not presently submitted write-out. Note that even this may block if you attempt to write more than request queue size.
SYNC_FILE_RANGE_WAIT_AFTER
Wait upon write-out of all pages in the range after performing any write.
Specifying flags as 0 is permitted, as a no-op.
Warning
This system call is extremely dangerous and should not be used in portable programs. None of these operations writes out the file’s metadata. Therefore, unless the application is strictly performing overwrites of already-instantiated disk blocks, there are no guarantees that the data will be available after a crash. There is no user interface to know if a write is purely an overwrite. On filesystems using copy-on-write semantics (e.g., btrfs) an overwrite of existing allocated blocks is impossible. When writing into preallocated space, many filesystems also require calls into the block allocator, which this system call does not sync out to disk. This system call does not flush disk write caches and thus does not provide any data integrity on systems with volatile disk write caches.
Some details
SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any I/O errors or ENOSPC conditions and will return these to the caller.
Useful combinations of the flags bits are:
SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE
Ensures that all pages in the specified range which were dirty when sync_file_range() was called are placed under write-out. This is a start-write-for-data-integrity operation.
SYNC_FILE_RANGE_WRITE
Start write-out of all dirty pages in the specified range which are not presently under write-out. This is an asynchronous flush-to-disk operation. This is not suitable for data integrity operations.
SYNC_FILE_RANGE_WAIT_BEFORE (or SYNC_FILE_RANGE_WAIT_AFTER)
Wait for completion of write-out of all pages in the specified range. This can be used after an earlier SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE operation to wait for completion of that operation, and obtain its result.
SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER
This is a write-for-data-integrity operation that will ensure that all pages in the specified range which were dirty when sync_file_range() was called are committed to disk.
RETURN VALUE
On success, sync_file_range() returns 0; on failure -1 is returned and errno is set to indicate the error.
ERRORS
EBADF
fd is not a valid file descriptor.
EINVAL
flags specifies an invalid bit; or offset or nbytes is invalid.
EIO
I/O error.
ENOMEM
Out of memory.
ENOSPC
Out of disk space.
ESPIPE
fd refers to something other than a regular file, a block device, or a directory.
VERSIONS
sync_file_range2()
Some architectures (e.g., PowerPC, ARM) need 64-bit arguments to be aligned in a suitable pair of registers. On such architectures, the call signature of sync_file_range() shown in the SYNOPSIS would force a register to be wasted as padding between the fd and offset arguments. (See syscall(2) for details.) Therefore, these architectures define a different system call that orders the arguments suitably:
int sync_file_range2(int fd, unsigned int flags,
off_t offset, off_t nbytes);
The behavior of this system call is otherwise exactly the same as sync_file_range().
STANDARDS
Linux.
HISTORY
Linux 2.6.17.
sync_file_range2()
A system call with this signature first appeared on the ARM architecture in Linux 2.6.20, with the name arm_sync_file_range(). It was renamed in Linux 2.6.22, when the analogous system call was added for PowerPC. On architectures where glibc support is provided, glibc transparently wraps sync_file_range2() under the name sync_file_range().
NOTES
_FILE_OFFSET_BITS should be defined to be 64 in code that takes the address of sync_file_range, if the code is intended to be portable to traditional 32-bit x86 and ARM platforms where off_t’s width defaults to 32 bits.
SEE ALSO
fdatasync(2), fsync(2), msync(2), sync(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
176 - Linux cli command sync
NAME π₯οΈ sync π₯οΈ
commit filesystem caches to disk
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
void sync(void);
int syncfs(int fd);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sync():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.19: */ _DEFAULT_SOURCE
|| /* glibc <= 2.19: */ _BSD_SOURCE
syncfs():
_GNU_SOURCE
DESCRIPTION
sync() causes all pending modifications to filesystem metadata and cached file data to be written to the underlying filesystems.
syncfs() is like sync(), but synchronizes just the filesystem containing file referred to by the open file descriptor fd.
RETURN VALUE
syncfs() returns 0 on success; on error, it returns -1 and sets errno to indicate the error.
ERRORS
sync() is always successful.
syncfs() can fail for at least the following reasons:
EBADF
fd is not a valid file descriptor.
EIO
An error occurred during synchronization. This error may relate to data written to any file on the filesystem, or on metadata related to the filesystem itself.
ENOSPC
Disk space was exhausted while synchronizing.
ENOSPC
EDQUOT
Data was written to a file on NFS or another filesystem which does not allocate space at the time of a write(2) system call, and some previous write failed due to insufficient storage space.
VERSIONS
According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However Linux waits for I/O completions, and thus sync() or syncfs() provide the same guarantees as fsync() called on every file in the system or filesystem respectively.
STANDARDS
sync()
POSIX.1-2008.
syncfs()
Linux.
HISTORY
sync()
POSIX.1-2001, SVr4, 4.3BSD.
syncfs()
Linux 2.6.39, glibc 2.14.
Since glibc 2.2.2, the Linux prototype for sync() is as listed above, following the various standards. In glibc 2.2.1 and earlier, it was “int sync(void)”, and sync() always returned 0.
In mainline kernel versions prior to Linux 5.8, syncfs() will fail only when passed a bad file descriptor (EBADF). Since Linux 5.8, syncfs() will also report an error if one or more inodes failed to be written back since the last syncfs() call.
BUGS
Before Linux 1.3.20, Linux did not wait for I/O to complete before returning.
SEE ALSO
sync(1), fdatasync(2), fsync(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
177 - Linux cli command tee
NAME π₯οΈ tee π₯οΈ
duplicating pipe content
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <fcntl.h>
ssize_t tee(int fd_in, int fd_out, size_t len",unsignedint"flags);
DESCRIPTION
tee() duplicates up to len bytes of data from the pipe referred to by the file descriptor fd_in to the pipe referred to by the file descriptor fd_out. It does not consume the data that is duplicated from fd_in; therefore, that data can be copied by a subsequent splice(2).
flags is a bit mask that is composed by ORing together zero or more of the following values:
SPLICE_F_MOVE
Currently has no effect for tee(); see splice(2).
SPLICE_F_NONBLOCK
Do not block on I/O; see splice(2) for further details.
SPLICE_F_MORE
Currently has no effect for tee(), but may be implemented in the future; see splice(2).
SPLICE_F_GIFT
Unused for tee(); see vmsplice(2).
RETURN VALUE
Upon successful completion, tee() returns the number of bytes that were duplicated between the input and output. A return value of 0 means that there was no data to transfer, and it would not make sense to block, because there are no writers connected to the write end of the pipe referred to by fd_in.
On error, tee() returns -1 and errno is set to indicate the error.
ERRORS
EAGAIN
SPLICE_F_NONBLOCK was specified in flags or one of the file descriptors had been marked as nonblocking (O_NONBLOCK), and the operation would block.
EINVAL
fd_in or fd_out does not refer to a pipe; or fd_in and fd_out refer to the same pipe.
ENOMEM
Out of memory.
STANDARDS
Linux.
HISTORY
Linux 2.6.17, glibc 2.5.
NOTES
Conceptually, tee() copies the data between the two pipes. In reality no real data copying takes place though: under the covers, tee() assigns data to the output by merely grabbing a reference to the input.
EXAMPLES
The example below implements a basic tee(1) program using the tee() system call. Here is an example of its use:
$ date | ./a.out out.log | cat
Tue Oct 28 10:06:00 CET 2014
$ cat out.log
Tue Oct 28 10:06:00 CET 2014
Program source
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int fd;
ssize_t len, slen;
if (argc != 2) {
fprintf(stderr, "Usage: %s <file>
“, argv[0]); exit(EXIT_FAILURE); } fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC, 0644); if (fd == -1) { perror(“open”); exit(EXIT_FAILURE); } for (;;) { /* * tee stdin to stdout. / len = tee(STDIN_FILENO, STDOUT_FILENO, INT_MAX, SPLICE_F_NONBLOCK); if (len < 0) { if (errno == EAGAIN) continue; perror(“tee”); exit(EXIT_FAILURE); } if (len == 0) break; / * Consume stdin by splicing it to a file. */ while (len > 0) { slen = splice(STDIN_FILENO, NULL, fd, NULL, len, SPLICE_F_MOVE); if (slen < 0) { perror(“splice”); exit(EXIT_FAILURE); } len -= slen; } } close(fd); exit(EXIT_SUCCESS); }
SEE ALSO
splice(2), vmsplice(2), pipe(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
178 - Linux cli command arch_prctl
NAME π₯οΈ arch_prctl π₯οΈ
set architecture-specific thread state
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <asm/prctl.h> /* Definition of ARCH_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_arch_prctl, int op, unsigned long addr);
int syscall(SYS_arch_prctl, int op, unsigned long *addr);
Note: glibc provides no wrapper for arch_prctl(), necessitating the use of syscall(2).
DESCRIPTION
arch_prctl() sets architecture-specific process or thread state. op selects an operation and passes argument addr to it; addr is interpreted as either an unsigned long for the “set” operations, or as an unsigned long *, for the “get” operations.
Subfunctions for both x86 and x86-64 are:
ARCH_SET_CPUID (since Linux 4.12)
Enable (addr != 0) or disable (addr == 0) the cpuid instruction for the calling thread. The instruction is enabled by default. If disabled, any execution of a cpuid instruction will instead generate a SIGSEGV signal. This feature can be used to emulate cpuid results that differ from what the underlying hardware would have produced (e.g., in a paravirtualization setting).
The ARCH_SET_CPUID setting is preserved across fork(2) and clone(2) but reset to the default (i.e., cpuid enabled) on execve(2).
ARCH_GET_CPUID (since Linux 4.12)
Return the setting of the flag manipulated by ARCH_SET_CPUID as the result of the system call (1 for enabled, 0 for disabled). addr is ignored.
Subfunctions for x86-64 only are:
ARCH_SET_FS
Set the 64-bit base for the FS register to addr.
ARCH_GET_FS
Return the 64-bit base value for the FS register of the calling thread in the unsigned long pointed to by addr.
ARCH_SET_GS
Set the 64-bit base for the GS register to addr.
ARCH_GET_GS
Return the 64-bit base value for the GS register of the calling thread in the unsigned long pointed to by addr.
RETURN VALUE
On success, arch_prctl() returns 0; on error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
addr points to an unmapped address or is outside the process address space.
EINVAL
op is not a valid operation.
ENODEV
ARCH_SET_CPUID was requested, but the underlying hardware does not support CPUID faulting.
EPERM
addr is outside the process address space.
STANDARDS
Linux/x86-64.
NOTES
arch_prctl() is supported only on Linux/x86-64 for 64-bit programs currently.
The 64-bit base changes when a new 32-bit segment selector is loaded.
ARCH_SET_GS is disabled in some kernels.
Context switches for 64-bit segment bases are rather expensive. As an optimization, if a 32-bit TLS base address is used, arch_prctl() may use a real TLS entry as if set_thread_area(2) had been called, instead of manipulating the segment base register directly. Memory in the first 2 GB of address space can be allocated by using mmap(2) with the MAP_32BIT flag.
Because of the aforementioned optimization, using arch_prctl() and set_thread_area(2) in the same thread is dangerous, as they may overwrite each other’s TLS entries.
FS may be already used by the threading library. Programs that use ARCH_SET_FS directly are very likely to crash.
SEE ALSO
mmap(2), modify_ldt(2), prctl(2), set_thread_area(2)
AMD X86-64 Programmer’s manual
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
179 - Linux cli command set_mempolicy
NAME π₯οΈ set_mempolicy π₯οΈ
set default NUMA memory policy for a thread and its children
LIBRARY
NUMA (Non-Uniform Memory Access) policy library (libnuma, -lnuma)
SYNOPSIS
#include <numaif.h>
long set_mempolicy(int mode, const unsigned long *nodemask,
unsigned long maxnode);
DESCRIPTION
set_mempolicy() sets the NUMA memory policy of the calling thread, which consists of a policy mode and zero or more nodes, to the values specified by the mode, nodemask, and maxnode arguments.
A NUMA machine has different memory controllers with different distances to specific CPUs. The memory policy defines from which node memory is allocated for the thread.
This system call defines the default policy for the thread. The thread policy governs allocation of pages in the process’s address space outside of memory ranges controlled by a more specific policy set by mbind(2). The thread default policy also controls allocation of any pages for memory-mapped files mapped using the mmap(2) call with the MAP_PRIVATE flag and that are only read (loaded) from by the thread and of memory-mapped files mapped using the mmap(2) call with the MAP_SHARED flag, regardless of the access type. The policy is applied only when a new page is allocated for the thread. For anonymous memory this is when the page is first touched by the thread.
The mode argument must specify one of MPOL_DEFAULT, MPOL_BIND, MPOL_INTERLEAVE, MPOL_WEIGHTED_INTERLEAVE, MPOL_PREFERRED, or MPOL_LOCAL (which are described in detail below). All modes except MPOL_DEFAULT require the caller to specify the node or nodes to which the mode applies, via the nodemask argument.
The mode argument may also include an optional mode flag. The supported mode flags are:
MPOL_F_NUMA_BALANCING (since Linux 5.12)
When mode is MPOL_BIND, enable the kernel NUMA balancing for the task if it is supported by the kernel. If the flag isn’t supported by the kernel, or is used with mode other than MPOL_BIND, -1 is returned and errno is set to EINVAL.
MPOL_F_RELATIVE_NODES (since Linux 2.6.26)
A nonempty nodemask specifies node IDs that are relative to the set of node IDs allowed by the process’s current cpuset.
MPOL_F_STATIC_NODES (since Linux 2.6.26)
A nonempty nodemask specifies physical node IDs. Linux will not remap the nodemask when the process moves to a different cpuset context, nor when the set of nodes allowed by the process’s current cpuset context changes.
nodemask points to a bit mask of node IDs that contains up to maxnode bits. The bit mask size is rounded to the next multiple of sizeof(unsigned long), but the kernel will use bits only up to maxnode. A NULL value of nodemask or a maxnode value of zero specifies the empty set of nodes. If the value of maxnode is zero, the nodemask argument is ignored.
Where a nodemask is required, it must contain at least one node that is on-line, allowed by the process’s current cpuset context, (unless the MPOL_F_STATIC_NODES mode flag is specified), and contains memory. If the MPOL_F_STATIC_NODES is set in mode and a required nodemask contains no nodes that are allowed by the process’s current cpuset context, the memory policy reverts to local allocation. This effectively overrides the specified policy until the process’s cpuset context includes one or more of the nodes specified by nodemask.
The mode argument must include one of the following values:
MPOL_DEFAULT
This mode specifies that any nondefault thread memory policy be removed, so that the memory policy “falls back” to the system default policy. The system default policy is “local allocation”βthat is, allocate memory on the node of the CPU that triggered the allocation. nodemask must be specified as NULL. If the “local node” contains no free memory, the system will attempt to allocate memory from a “near by” node.
MPOL_BIND
This mode defines a strict policy that restricts memory allocation to the nodes specified in nodemask. If nodemask specifies more than one node, page allocations will come from the node with the lowest numeric node ID first, until that node contains no free memory. Allocations will then come from the node with the next highest node ID specified in nodemask and so forth, until none of the specified nodes contain free memory. Pages will not be allocated from any node not specified in the nodemask.
MPOL_INTERLEAVE
This mode interleaves page allocations across the nodes specified in nodemask in numeric node ID order. This optimizes for bandwidth instead of latency by spreading out pages and memory accesses to those pages across multiple nodes. However, accesses to a single page will still be limited to the memory bandwidth of a single node.
MPOL_WEIGHTED_INTERLEAVE (since Linux 6.9)
This mode interleaves page allocations across the nodes specified in nodemask according to the weights in /sys/kernel/mm/mempolicy/weighted_interleave. For example, if bits 0, 2, and 5 are set in nodemask, and the contents of /sys/kernel/mm/mempolicy/weighted_interleave/node0, /sys/.β.β./node2, and /sys/.β.β./node5 are 4, 7, and 9, respectively, then pages in this region will be allocated on nodes 0, 2, and 5 in a 4:7:9 ratio.
MPOL_PREFERRED
This mode sets the preferred node for allocation. The kernel will try to allocate pages from this node first and fall back to “near by” nodes if the preferred node is low on free memory. If nodemask specifies more than one node ID, the first node in the mask will be selected as the preferred node. If the nodemask and maxnode arguments specify the empty set, then the policy specifies “local allocation” (like the system default policy discussed above).
MPOL_LOCAL (since Linux 3.8)
This mode specifies “local allocation”; the memory is allocated on the node of the CPU that triggered the allocation (the “local node”). The nodemask and maxnode arguments must specify the empty set. If the “local node” is low on free memory, the kernel will try to allocate memory from other nodes. The kernel will allocate memory from the “local node” whenever memory for this node is available. If the “local node” is not allowed by the process’s current cpuset context, the kernel will try to allocate memory from other nodes. The kernel will allocate memory from the “local node” whenever it becomes allowed by the process’s current cpuset context.
The thread memory policy is preserved across an execve(2), and is inherited by child threads created using fork(2) or clone(2).
RETURN VALUE
On success, set_mempolicy() returns 0; on error, -1 is returned and errno is set to indicate the error.
ERRORS
EFAULT
Part of all of the memory range specified by nodemask and maxnode points outside your accessible address space.
EINVAL
mode is invalid. Or, mode is MPOL_DEFAULT and nodemask is nonempty, or mode is MPOL_BIND or MPOL_INTERLEAVE and nodemask is empty. Or, maxnode specifies more than a page worth of bits. Or, nodemask specifies one or more node IDs that are greater than the maximum supported node ID. Or, none of the node IDs specified by nodemask are on-line and allowed by the process’s current cpuset context, or none of the specified nodes contain memory. Or, the mode argument specified both MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES. Or, the MPOL_F_NUMA_BALANCING isn’t supported by the kernel, or is used with mode other than MPOL_BIND.
ENOMEM
Insufficient kernel memory was available.
STANDARDS
Linux.
HISTORY
Linux 2.6.7.
NOTES
Memory policy is not remembered if the page is swapped out. When such a page is paged back in, it will use the policy of the thread or memory range that is in effect at the time the page is allocated.
For information on library support, see numa(7).
SEE ALSO
get_mempolicy(2), getcpu(2), mbind(2), mmap(2), numa(3), cpuset(7), numa(7), numactl(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
180 - Linux cli command mlock
NAME π₯οΈ mlock π₯οΈ
lock and unlock memory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/mman.h>
int mlock(const void addr[.len], size_t len);
int mlock2(const void addr[.len], size_t len, unsigned int flags);
int munlock(const void addr[.len], size_t len);
int mlockall(int flags);
int munlockall(void);
DESCRIPTION
mlock(), mlock2(), and mlockall() lock part or all of the calling process’s virtual address space into RAM, preventing that memory from being paged to the swap area.
munlock() and munlockall() perform the converse operation, unlocking part or all of the calling process’s virtual address space, so that pages in the specified virtual address range can be swapped out again if required by the kernel memory manager.
Memory locking and unlocking are performed in units of whole pages.
mlock(), mlock2(), and munlock()
mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked.
mlock2() also locks pages in the specified range starting at addr and continuing for len bytes. However, the state of the pages contained in that range after the call returns successfully will depend on the value in the flags argument.
The flags argument can be either 0 or the following constant:
MLOCK_ONFAULT
Lock pages that are currently resident and mark the entire range so that the remaining nonresident pages are locked when they are populated by a page fault.
If flags is 0, mlock2() behaves exactly the same as mlock().
munlock() unlocks pages in the address range starting at addr and continuing for len bytes. After this call, all pages that contain a part of the specified memory range can be moved to external swap space again by the kernel.
mlockall() and munlockall()
mlockall() locks all pages mapped into the address space of the calling process. This includes the pages of the code, data, and stack segment, as well as shared libraries, user space kernel data, shared memory, and memory-mapped files. All mapped pages are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked.
The flags argument is constructed as the bitwise OR of one or more of the following constants:
MCL_CURRENT
Lock all pages which are currently mapped into the address space of the process.
MCL_FUTURE
Lock all pages which will become mapped into the address space of the process in the future. These could be, for instance, new pages required by a growing heap and stack as well as new memory-mapped files or shared memory regions.
MCL_ONFAULT (since Linux 4.4)
Used together with MCL_CURRENT, MCL_FUTURE, or both. Mark all current (with MCL_CURRENT) or future (with MCL_FUTURE) mappings to lock pages when they are faulted in. When used with MCL_CURRENT, all present pages are locked, but mlockall() will not fault in non-present pages. When used with MCL_FUTURE, all future mappings will be marked to lock pages when they are faulted in, but they will not be populated by the lock when the mapping is created. MCL_ONFAULT must be used with either MCL_CURRENT or MCL_FUTURE or both.
If MCL_FUTURE has been specified, then a later system call (e.g., mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number of locked bytes to exceed the permitted maximum (see below). In the same circumstances, stack growth may likewise fail: the kernel will deny stack expansion and deliver a SIGSEGV signal to the process.
munlockall() unlocks all pages mapped into the address space of the calling process.
RETURN VALUE
On success, these system calls return 0. On error, -1 is returned, errno is set to indicate the error, and no changes are made to any locks in the address space of the process.
ERRORS
EAGAIN
(mlock(), mlock2(), and munlock()) Some or all of the specified address range could not be locked.
EINVAL
(mlock(), mlock2(), and munlock()) The result of the addition addr+len was less than addr (e.g., the addition may have resulted in an overflow).
EINVAL
(mlock2()) Unknown flags were specified.
EINVAL
(mlockall()) Unknown flags were specified or MCL_ONFAULT was specified without either MCL_FUTURE or MCL_CURRENT.
EINVAL
(Not on Linux) addr was not a multiple of the page size.
ENOMEM
(mlock(), mlock2(), and munlock()) Some of the specified address range does not correspond to mapped pages in the address space of the process.
ENOMEM
(mlock(), mlock2(), and munlock()) Locking or unlocking a region would result in the total number of mappings with distinct attributes (e.g., locked versus unlocked) exceeding the allowed maximum. (For example, unlocking a range in the middle of a currently locked mapping would result in three mappings: two locked mappings at each end and an unlocked mapping in the middle.)
ENOMEM
(Linux 2.6.9 and later) the caller had a nonzero RLIMIT_MEMLOCK soft resource limit, but tried to lock more memory than the limit permitted. This limit is not enforced if the process is privileged (CAP_IPC_LOCK).
ENOMEM
(Linux 2.4 and earlier) the calling process tried to lock more than half of RAM.
EPERM
The caller is not privileged, but needs privilege (CAP_IPC_LOCK) to perform the requested operation.
EPERM
(munlockall()) (Linux 2.6.8 and earlier) The caller was not privileged (CAP_IPC_LOCK).
VERSIONS
Linux
Under Linux, mlock(), mlock2(), and munlock() automatically round addr down to the nearest page boundary. However, the POSIX.1 specification of mlock() and munlock() allows an implementation to require that addr is page aligned, so portable applications should ensure this.
The VmLck field of the Linux-specific /proc/pid/status file shows how many kilobytes of memory the process with ID PID has locked using mlock(), mlock2(), mlockall(), and mmap(2) MAP_LOCKED.
STANDARDS
mlock()
munlock()
mlockall()
munlockall()
POSIX.1-2008.
mlock2()
Linux.
On POSIX systems on which mlock() and munlock() are available, _POSIX_MEMLOCK_RANGE is defined in <unistd.h> and the number of bytes in a page can be determined from the constant PAGESIZE (if defined) in <limits.h> or by calling sysconf(_SC_PAGESIZE).
On POSIX systems on which mlockall() and munlockall() are available, _POSIX_MEMLOCK is defined in <unistd.h> to a value greater than 0. (See also sysconf(3).)
HISTORY
mlock()
munlock()
mlockall()
munlockall()
POSIX.1-2001, POSIX.1-2008, SVr4.
mlock2()
Linux 4.4, glibc 2.27.
NOTES
Memory locking has two main applications: real-time algorithms and high-security data processing. Real-time applications require deterministic timing, and, like scheduling, paging is one major cause of unexpected program execution delays. Real-time applications will usually also switch to a real-time scheduler with sched_setscheduler(2). Cryptographic security software often handles critical bytes like passwords or secret keys as data structures. As a result of paging, these secrets could be transferred onto a persistent swap store medium, where they might be accessible to the enemy long after the security software has erased the secrets in RAM and terminated. (But be aware that the suspend mode on laptops and some desktop computers will save a copy of the system’s RAM to disk, regardless of memory locks.)
Real-time processes that are using mlockall() to prevent delays on page faults should reserve enough locked stack pages before entering the time-critical section, so that no page fault can be caused by function calls. This can be achieved by calling a function that allocates a sufficiently large automatic variable (an array) and writes to the memory occupied by this array in order to touch these stack pages. This way, enough pages will be mapped for the stack and can be locked into RAM. The dummy writes ensure that not even copy-on-write page faults can occur in the critical section.
Memory locks are not inherited by a child created via fork(2) and are automatically removed (unlocked) during an execve(2) or when the process terminates. The mlockall() MCL_FUTURE and MCL_FUTURE | MCL_ONFAULT settings are not inherited by a child created via fork(2) and are cleared during an execve(2).
Note that fork(2) will prepare the address space for a copy-on-write operation. The consequence is that any write access that follows will cause a page fault that in turn may cause high latencies for a real-time process. Therefore, it is crucial not to invoke fork(2) after an mlockall() or mlock() operationβnot even from a thread which runs at a low priority within a process which also has a thread running at elevated priority.
The memory lock on an address range is automatically removed if the address range is unmapped via munmap(2).
Memory locks do not stack, that is, pages which have been locked several times by calls to mlock(), mlock2(), or mlockall() will be unlocked by a single call to munlock() for the corresponding range or by munlockall(). Pages which are mapped to several locations or by several processes stay locked into RAM as long as they are locked at least at one location or by at least one process.
If a call to mlockall() which uses the MCL_FUTURE flag is followed by another call that does not specify this flag, the changes made by the MCL_FUTURE call will be lost.
The mlock2() MLOCK_ONFAULT flag and the mlockall() MCL_ONFAULT flag allow efficient memory locking for applications that deal with large mappings where only a (small) portion of pages in the mapping are touched. In such cases, locking all of the pages in a mapping would incur a significant penalty for memory locking.
Limits and permissions
In Linux 2.6.8 and earlier, a process must be privileged (CAP_IPC_LOCK) in order to lock memory and the RLIMIT_MEMLOCK soft resource limit defines a limit on how much memory the process may lock.
Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process can lock and the RLIMIT_MEMLOCK soft resource limit instead defines a limit on how much memory an unprivileged process may lock.
BUGS
In Linux 4.8 and earlier, a bug in the kernel’s accounting of locked memory for unprivileged processes (i.e., without CAP_IPC_LOCK) meant that if the region specified by addr and len overlapped an existing lock, then the already locked bytes in the overlapping region were counted twice when checking against the limit. Such double accounting could incorrectly calculate a “total locked memory” value for the process that exceeded the RLIMIT_MEMLOCK limit, with the result that mlock() and mlock2() would fail on requests that should have succeeded. This bug was fixed in Linux 4.9.
In Linux 2.4 series of kernels up to and including Linux 2.4.17, a bug caused the mlockall() MCL_FUTURE flag to be inherited across a fork(2). This was rectified in Linux 2.4.18.
Since Linux 2.6.9, if a privileged process calls mlockall(MCL_FUTURE) and later drops privileges (loses the CAP_IPC_LOCK capability by, for example, setting its effective UID to a nonzero value), then subsequent memory allocations (e.g., mmap(2), brk(2)) will fail if the RLIMIT_MEMLOCK resource limit is encountered.
SEE ALSO
mincore(2), mmap(2), setrlimit(2), shmctl(2), sysconf(3), proc(5), capabilities(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
181 - Linux cli command stime
NAME π₯οΈ stime π₯οΈ
set time
SYNOPSIS
#include <time.h>
[[deprecated]] int stime(const time_t *t);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
stime():
Since glibc 2.19:
_DEFAULT_SOURCE
glibc 2.19 and earlier:
_SVID_SOURCE
DESCRIPTION
NOTE: This function is deprecated; use clock_settime(2) instead.
stime() sets the system’s idea of the time and date. The time, pointed to by t, is measured in seconds since the Epoch, 1970-01-01 00:00:00 +0000 (UTC). stime() may be executed only by the superuser.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
Error in getting information from user space.
EPERM
The calling process has insufficient privilege. Under Linux, the CAP_SYS_TIME privilege is required.
STANDARDS
None.
HISTORY
SVr4.
Starting with glibc 2.31, this function is no longer available to newly linked applications and is no longer declared in <time.h>.
SEE ALSO
date(1), settimeofday(2), capabilities(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
182 - Linux cli command socket
NAME π₯οΈ socket π₯οΈ
create an endpoint for communication
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
int socket(int domain, int type, int protocol);
DESCRIPTION
socket() creates an endpoint for communication and returns a file descriptor that refers to that endpoint. The file descriptor returned by a successful call will be the lowest-numbered file descriptor not currently open for the process.
The domain argument specifies a communication domain; this selects the protocol family which will be used for communication. These families are defined in <sys/socket.h>. The formats currently understood by the Linux kernel include:
Name | Purpose | Man page |
AF_UNIX | Local communication | unix(7) |
AF_LOCAL | Synonym for AF_UNIX | |
AF_INET | IPv4 Internet protocols | ip(7) |
AF_AX25 | Amateur radio AX.25 protocol | ax25(4) |
AF_IPX | IPX - Novell protocols | |
AF_APPLETALK | AppleTalk | ddp(7) |
AF_X25 | ITU-T X.25 / ISO/IECΒ 8208 protocol | x25(7) |
AF_INET6 | IPv6 Internet protocols | ipv6(7) |
AF_DECnet | DECet protocol sockets | |
AF_KEY | Key management protocol, originally developed for usage with IPsec | |
AF_NETLINK | Kernel user interface device | netlink(7) |
AF_PACKET | Low-level packet interface | packet(7) |
AF_RDS | Reliable Datagram Sockets (RDS) protocol | rds(7) rds-rdma(7) |
AF_PPPOX | Generic PPP transport layer, for setting up L2 tunnels (L2TP and PPPoE) | |
AF_LLC | Logical link control (IEEE 802.2 LLC) protocol | |
AF_IB | InfiniBand native addressing | |
AF_MPLS | Multiprotocol Label Switching | |
AF_CAN | Controller Area Network automotive bus protocol | |
AF_TIPC | TIPC, "cluster domain sockets" protocol | |
AF_BLUETOOTH | Bluetooth low-level socket protocol | |
AF_ALG | Interface to kernel crypto API | |
AF_VSOCK | VSOCK (originally "VMWare VSockets") protocol for hypervisor-guest communication | vsock(7) |
AF_KCM | KCM (kernel connection multiplexer) interface | |
AF_XDP | XDP (express data path) interface |
Further details of the above address families, as well as information on several other address families, can be found in address_families(7).
The socket has the indicated type, which specifies the communication semantics. Currently defined types are:
SOCK_STREAM
Provides sequenced, reliable, two-way, connection-based byte streams. An out-of-band data transmission mechanism may be supported.
SOCK_DGRAM
Supports datagrams (connectionless, unreliable messages of a fixed maximum length).
SOCK_SEQPACKET
Provides a sequenced, reliable, two-way connection-based data transmission path for datagrams of fixed maximum length; a consumer is required to read an entire packet with each input system call.
SOCK_RAW
Provides raw network protocol access.
SOCK_RDM
Provides a reliable datagram layer that does not guarantee ordering.
SOCK_PACKET
Obsolete and should not be used in new programs; see packet(7).
Some socket types may not be implemented by all protocol families.
Since Linux 2.6.27, the type argument serves a second purpose: in addition to specifying a socket type, it may include the bitwise OR of any of the following values, to modify the behavior of socket():
SOCK_NONBLOCK
Set the O_NONBLOCK file status flag on the open file description (see open(2)) referred to by the new file descriptor. Using this flag saves extra calls to fcntl(2) to achieve the same result.
SOCK_CLOEXEC
Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.
The protocol specifies a particular protocol to be used with the socket. Normally only a single protocol exists to support a particular socket type within a given protocol family, in which case protocol can be specified as 0. However, it is possible that many protocols may exist, in which case a particular protocol must be specified in this manner. The protocol number to use is specific to the βcommunication domainβ in which communication is to take place; see protocols(5). See getprotoent(3) on how to map protocol name strings to protocol numbers.
Sockets of type SOCK_STREAM are full-duplex byte streams. They do not preserve record boundaries. A stream socket must be in a connected state before any data may be sent or received on it. A connection to another socket is created with a connect(2) call. Once connected, data may be transferred using read(2) and write(2) calls or some variant of the send(2) and recv(2) calls. When a session has been completed a close(2) may be performed. Out-of-band data may also be transmitted as described in send(2) and received as described in recv(2).
The communications protocols which implement a SOCK_STREAM ensure that data is not lost or duplicated. If a piece of data for which the peer protocol has buffer space cannot be successfully transmitted within a reasonable length of time, then the connection is considered to be dead. When SO_KEEPALIVE is enabled on the socket the protocol checks in a protocol-specific manner if the other end is still alive. A SIGPIPE signal is raised if a process sends or receives on a broken stream; this causes naive processes, which do not handle the signal, to exit. SOCK_SEQPACKET sockets employ the same system calls as SOCK_STREAM sockets. The only difference is that read(2) calls will return only the amount of data requested, and any data remaining in the arriving packet will be discarded. Also all message boundaries in incoming datagrams are preserved.
SOCK_DGRAM and SOCK_RAW sockets allow sending of datagrams to correspondents named in sendto(2) calls. Datagrams are generally received with recvfrom(2), which returns the next datagram along with the address of its sender.
SOCK_PACKET is an obsolete socket type to receive raw packets directly from the device driver. Use packet(7) instead.
An fcntl(2) F_SETOWN operation can be used to specify a process or process group to receive a SIGURG signal when the out-of-band data arrives or SIGPIPE signal when a SOCK_STREAM connection breaks unexpectedly. This operation may also be used to set the process or process group that receives the I/O and asynchronous notification of I/O events via SIGIO. Using F_SETOWN is equivalent to an ioctl(2) call with the FIOSETOWN or SIOCSPGRP argument.
When the network signals an error condition to the protocol module (e.g., using an ICMP message for IP) the pending error flag is set for the socket. The next operation on this socket will return the error code of the pending error. For some protocols it is possible to enable a per-socket error queue to retrieve detailed information about the error; see IP_RECVERR in ip(7).
The operation of sockets is controlled by socket level options. These options are defined in <sys/socket.h>. The functions setsockopt(2) and getsockopt(2) are used to set and get options.
RETURN VALUE
On success, a file descriptor for the new socket is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Permission to create a socket of the specified type and/or protocol is denied.
EAFNOSUPPORT
The implementation does not support the specified address family.
EINVAL
Unknown protocol, or protocol family not available.
EINVAL
Invalid flags in type.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENOBUFS or ENOMEM
Insufficient memory is available. The socket cannot be created until sufficient resources are freed.
EPROTONOSUPPORT
The protocol type or the specified protocol is not supported within this domain.
Other errors may be generated by the underlying protocol modules.
STANDARDS
POSIX.1-2008.
SOCK_NONBLOCK and SOCK_CLOEXEC are Linux-specific.
HISTORY
POSIX.1-2001, 4.4BSD.
socket() appeared in 4.2BSD. It is generally portable to/from non-BSD systems supporting clones of the BSD socket layer (including System V variants).
The manifest constants used under 4.x BSD for protocol families are PF_UNIX, PF_INET, and so on, while AF_UNIX, AF_INET, and so on are used for address families. However, already the BSD man page promises: “The protocol family generally is the same as the address family”, and subsequent standards use AF_* everywhere.
EXAMPLES
An example of the use of socket() is shown in getaddrinfo(3).
SEE ALSO
accept(2), bind(2), close(2), connect(2), fcntl(2), getpeername(2), getsockname(2), getsockopt(2), ioctl(2), listen(2), read(2), recv(2), select(2), send(2), shutdown(2), socketpair(2), write(2), getprotoent(3), address_families(7), ip(7), socket(7), tcp(7), udp(7), unix(7)
βAn Introductory 4.3BSD Interprocess Communication Tutorialβ and βBSD Interprocess Communication Tutorialβ, reprinted in UNIX Programmer’s Supplementary Documents Volume 1.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
183 - Linux cli command shmop
NAME π₯οΈ shmop π₯οΈ
System V shared memory operations
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/shm.h>
void *shmat(int shmid, const void *_Nullable shmaddr, int shmflg);
int shmdt(const void *shmaddr);
DESCRIPTION
shmat()
shmat() attaches the System V shared memory segment identified by shmid to the address space of the calling process. The attaching address is specified by shmaddr with one of the following criteria:
If shmaddr is NULL, the system chooses a suitable (unused) page-aligned address to attach the segment.
If shmaddr isn’t NULL and SHM_RND is specified in shmflg, the attach occurs at the address equal to shmaddr rounded down to the nearest multiple of SHMLBA.
Otherwise, shmaddr must be a page-aligned address at which the attach occurs.
In addition to SHM_RND, the following flags may be specified in the shmflg bit-mask argument:
SHM_EXEC (Linux-specific; since Linux 2.6.9)
Allow the contents of the segment to be executed. The caller must have execute permission on the segment.
SHM_RDONLY
Attach the segment for read-only access. The process must have read permission for the segment. If this flag is not specified, the segment is attached for read and write access, and the process must have read and write permission for the segment. There is no notion of a write-only shared memory segment.
SHM_REMAP (Linux-specific)
This flag specifies that the mapping of the segment should replace any existing mapping in the range starting at shmaddr and continuing for the size of the segment. (Normally, an EINVAL error would result if a mapping already exists in this address range.) In this case, shmaddr must not be NULL.
The brk(2) value of the calling process is not altered by the attach. The segment will automatically be detached at process exit. The same segment may be attached as a read and as a read-write one, and more than once, in the process’s address space.
A successful shmat() call updates the members of the shmid_ds structure (see shmctl(2)) associated with the shared memory segment as follows:
shm_atime is set to the current time.
shm_lpid is set to the process-ID of the calling process.
shm_nattch is incremented by one.
shmdt()
shmdt() detaches the shared memory segment located at the address specified by shmaddr from the address space of the calling process. The to-be-detached segment must be currently attached with shmaddr equal to the value returned by the attaching shmat() call.
On a successful shmdt() call, the system updates the members of the shmid_ds structure associated with the shared memory segment as follows:
shm_dtime is set to the current time.
shm_lpid is set to the process-ID of the calling process.
shm_nattch is decremented by one. If it becomes 0 and the segment is marked for deletion, the segment is deleted.
RETURN VALUE
On success, shmat() returns the address of the attached shared memory segment; on error, (void *) -1 is returned, and errno is set to indicate the error.
On success, shmdt() returns 0; on error -1 is returned, and errno is set to indicate the error.
ERRORS
shmat() can fail with one of the following errors:
EACCES
The calling process does not have the required permissions for the requested attach type, and does not have the CAP_IPC_OWNER capability in the user namespace that governs its IPC namespace.
EIDRM
shmid points to a removed identifier.
EINVAL
Invalid shmid value, unaligned (i.e., not page-aligned and SHM_RND was not specified) or invalid shmaddr value, or can’t attach segment at shmaddr, or SHM_REMAP was specified and shmaddr was NULL.
ENOMEM
Could not allocate memory for the descriptor or for the page tables.
shmdt() can fail with one of the following errors:
EINVAL
There is no shared memory segment attached at shmaddr; or, shmaddr is not aligned on a page boundary.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4.
In SVID 3 (or perhaps earlier), the type of the shmaddr argument was changed from char * into const void *, and the returned type of shmat() from char * into void *.
NOTES
After a fork(2), the child inherits the attached shared memory segments.
After an execve(2), all attached shared memory segments are detached from the process.
Upon _exit(2), all attached shared memory segments are detached from the process.
Using shmat() with shmaddr equal to NULL is the preferred, portable way of attaching a shared memory segment. Be aware that the shared memory segment attached in this way may be attached at different addresses in different processes. Therefore, any pointers maintained within the shared memory must be made relative (typically to the starting address of the segment), rather than absolute.
On Linux, it is possible to attach a shared memory segment even if it is already marked to be deleted. However, POSIX.1 does not specify this behavior and many other implementations do not support it.
The following system parameter affects shmat():
SHMLBA
Segment low boundary address multiple. When explicitly specifying an attach address in a call to shmat(), the caller should ensure that the address is a multiple of this value. This is necessary on some architectures, in order either to ensure good CPU cache performance or to ensure that different attaches of the same segment have consistent views within the CPU cache. SHMLBA is normally some multiple of the system page size. (On many Linux architectures, SHMLBA is the same as the system page size.)
The implementation places no intrinsic per-process limit on the number of shared memory segments (SHMSEG).
EXAMPLES
The two programs shown below exchange a string using a shared memory segment. Further details about the programs are given below. First, we show a shell session demonstrating their use.
In one terminal window, we run the “reader” program, which creates a System V shared memory segment and a System V semaphore set. The program prints out the IDs of the created objects, and then waits for the semaphore to change value.
$ ./svshm_string_read
shmid = 1114194; semid = 15
In another terminal window, we run the “writer” program. The “writer” program takes three command-line arguments: the IDs of the shared memory segment and semaphore set created by the “reader”, and a string. It attaches the existing shared memory segment, copies the string to the shared memory, and modifies the semaphore value.
$ ./svshm_string_write 1114194 15 'Hello, world'
Returning to the terminal where the “reader” is running, we see that the program has ceased waiting on the semaphore and has printed the string that was copied into the shared memory segment by the writer:
Hello, world
Program source: svshm_string.h
The following header file is included by the “reader” and “writer” programs:
/* svshm_string.h
Licensed under GNU General Public License v2 or later.
*/
#ifndef SVSHM_STRING_H
#define SVSHM_STRING_H
#include <stdio.h>
#include <stdlib.h>
#include <sys/sem.h>
#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)
union semun { /* Used in calls to semctl() */
int val;
struct semid_ds *buf;
unsigned short *array;
#if defined(__linux__)
struct seminfo *__buf;
#endif
};
#define MEM_SIZE 4096
#endif // include guard
Program source: svshm_string_read.c
The “reader” program creates a shared memory segment and a semaphore set containing one semaphore. It then attaches the shared memory object into its address space and initializes the semaphore value to 1. Finally, the program waits for the semaphore value to become 0, and afterwards prints the string that has been copied into the shared memory segment by the “writer”.
/* svshm_string_read.c
Licensed under GNU General Public License v2 or later.
*/
#include <stdio.h>
#include <stdlib.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <sys/shm.h>
#include "svshm_string.h"
int
main(void)
{
int semid, shmid;
char *addr;
union semun arg, dummy;
struct sembuf sop;
/* Create shared memory and semaphore set containing one
semaphore. */
shmid = shmget(IPC_PRIVATE, MEM_SIZE, IPC_CREAT | 0600);
if (shmid == -1)
errExit("shmget");
semid = semget(IPC_PRIVATE, 1, IPC_CREAT | 0600);
if (semid == -1)
errExit("semget");
/* Attach shared memory into our address space. */
addr = shmat(shmid, NULL, SHM_RDONLY);
if (addr == (void *) -1)
errExit("shmat");
/* Initialize semaphore 0 in set with value 1. */
arg.val = 1;
if (semctl(semid, 0, SETVAL, arg) == -1)
errExit("semctl");
printf("shmid = %d; semid = %d
“, shmid, semid); /* Wait for semaphore value to become 0. / sop.sem_num = 0; sop.sem_op = 0; sop.sem_flg = 0; if (semop(semid, &sop, 1) == -1) errExit(“semop”); / Print the string from shared memory. / printf("%s “, addr); / Remove shared memory and semaphore set. */ if (shmctl(shmid, IPC_RMID, NULL) == -1) errExit(“shmctl”); if (semctl(semid, 0, IPC_RMID, dummy) == -1) errExit(“semctl”); exit(EXIT_SUCCESS); }
Program source: svshm_string_write.c
The writer program takes three command-line arguments: the IDs of the shared memory segment and semaphore set that have already been created by the “reader”, and a string. It attaches the shared memory segment into its address space, and then decrements the semaphore value to 0 in order to inform the “reader” that it can now examine the contents of the shared memory.
/* svshm_string_write.c
Licensed under GNU General Public License v2 or later.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/sem.h>
#include <sys/shm.h>
#include "svshm_string.h"
int
main(int argc, char *argv[])
{
int semid, shmid;
char *addr;
size_t len;
struct sembuf sop;
if (argc != 4) {
fprintf(stderr, "Usage: %s shmid semid string
“, argv[0]); exit(EXIT_FAILURE); } len = strlen(argv[3]) + 1; /* +1 to include trailing ‘οΏ½’ / if (len > MEM_SIZE) { fprintf(stderr, “String is too big! “); exit(EXIT_FAILURE); } / Get object IDs from command-line. / shmid = atoi(argv[1]); semid = atoi(argv[2]); / Attach shared memory into our address space and copy string (including trailing null byte) into memory. */ addr = shmat(shmid, NULL, 0); if (addr == (void ) -1) errExit(“shmat”); memcpy(addr, argv[3], len); / Decrement semaphore to 0. */ sop.sem_num = 0; sop.sem_op = -1; sop.sem_flg = 0; if (semop(semid, &sop, 1) == -1) errExit(“semop”); exit(EXIT_SUCCESS); }
SEE ALSO
brk(2), mmap(2), shmctl(2), shmget(2), capabilities(7), shm_overview(7), sysvipc(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
184 - Linux cli command vfork
NAME π₯οΈ vfork π₯οΈ
create a child process and block parent
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
pid_t vfork(void);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
vfork():
Since glibc 2.12:
(_XOPEN_SOURCE >= 500) && ! (_POSIX_C_SOURCE >= 200809L)
|| /* Since glibc 2.19: */ _DEFAULT_SOURCE
|| /* glibc <= 2.19: */ _BSD_SOURCE
Before glibc 2.12:
_BSD_SOURCE || _XOPEN_SOURCE >= 500
DESCRIPTION
Standard description
(From POSIX.1) The vfork() function has the same effect as fork(2), except that the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit(2) or one of the exec(3) family of functions.
Linux description
vfork(), just like fork(2), creates a child process of the calling process. For details and return value and errors, see fork(2).
vfork() is a special case of clone(2). It is used to create new processes without copying the page tables of the parent process. It may be useful in performance-sensitive applications where a child is created which then immediately issues an execve(2).
vfork() differs from fork(2) in that the calling thread is suspended until the child terminates (either normally, by calling _exit(2), or abnormally, after delivery of a fatal signal), or it makes a call to execve(2). Until that point, the child shares all memory with its parent, including the stack. The child must not return from the current function or call exit(3) (which would have the effect of calling exit handlers established by the parent process and flushing the parent’s stdio(3) buffers), but may call _exit(2).
As with fork(2), the child process created by vfork() inherits copies of various of the caller’s process attributes (e.g., file descriptors, signal dispositions, and current working directory); the vfork() call differs only in the treatment of the virtual address space, as described above.
Signals sent to the parent arrive after the child releases the parent’s memory (i.e., after the child terminates or calls execve(2)).
Historic description
Under Linux, fork(2) is implemented using copy-on-write pages, so the only penalty incurred by fork(2) is the time and memory required to duplicate the parent’s page tables, and to create a unique task structure for the child. However, in the bad old days a fork(2) would require making a complete copy of the caller’s data space, often needlessly, since usually immediately afterward an exec(3) is done. Thus, for greater efficiency, BSD introduced the vfork() system call, which did not fully copy the address space of the parent process, but borrowed the parent’s memory and thread of control until a call to execve(2) or an exit occurred. The parent process was suspended while the child was using its resources. The use of vfork() was tricky: for example, not modifying data in the parent process depended on knowing which variables were held in a register.
VERSIONS
The requirements put on vfork() by the standards are weaker than those put on fork(2), so an implementation where the two are synonymous is compliant. In particular, the programmer cannot rely on the parent remaining blocked until the child either terminates or calls execve(2), and cannot rely on any specific behavior with respect to shared memory.
Some consider the semantics of vfork() to be an architectural blemish, and the 4.2BSD man page stated: βThis system call will be eliminated when proper system sharing mechanisms are implemented. Users should not depend on the memory sharing semantics of vfork as it will, in that case, be made synonymous to fork.β However, even though modern memory management hardware has decreased the performance difference between fork(2) and vfork(), there are various reasons why Linux and other systems have retained vfork():
Some performance-critical applications require the small performance advantage conferred by vfork().
vfork() can be implemented on systems that lack a memory-management unit (MMU), but fork(2) can’t be implemented on such systems. (POSIX.1-2008 removed vfork() from the standard; the POSIX rationale for the posix_spawn(3) function notes that that function, which provides functionality equivalent to fork(2)+ exec(3), is designed to be implementable on systems that lack an MMU.)
On systems where memory is constrained, vfork() avoids the need to temporarily commit memory (see the description of /proc/sys/vm/overcommit_memory in proc(5)) in order to execute a new program. (This can be especially beneficial where a large parent process wishes to execute a small helper program in a child process.) By contrast, using fork(2) in this scenario requires either committing an amount of memory equal to the size of the parent process (if strict overcommitting is in force) or overcommitting memory with the risk that a process is terminated by the out-of-memory (OOM) killer.
Linux notes
Fork handlers established using pthread_atfork(3) are not called when a multithreaded program employing the NPTL threading library calls vfork(). Fork handlers are called in this case in a program using the LinuxThreads threading library. (See pthreads(7) for a description of Linux threading libraries.)
A call to vfork() is equivalent to calling clone(2) with flags specified as:
CLONE_VM | CLONE_VFORK | SIGCHLD
STANDARDS
None.
HISTORY
4.3BSD; POSIX.1-2001 (but marked OBSOLETE). POSIX.1-2008 removes the specification of vfork().
The vfork() system call appeared in 3.0BSD. In 4.4BSD it was made synonymous to fork(2) but NetBSD introduced it again; see . In Linux, it has been equivalent to fork(2) until Linux 2.2.0-pre6 or so. Since Linux 2.2.0-pre9 (on i386, somewhat later on other architectures) it is an independent system call. Support was added in glibc 2.0.112.
CAVEATS
The child process should take care not to modify the memory in unintended ways, since such changes will be seen by the parent process once the child terminates or executes another program. In this regard, signal handlers can be especially problematic: if a signal handler that is invoked in the child of vfork() changes memory, those changes may result in an inconsistent process state from the perspective of the parent process (e.g., memory changes would be visible in the parent, but changes to the state of open file descriptors would not be visible).
When vfork() is called in a multithreaded process, only the calling thread is suspended until the child terminates or executes a new program. This means that the child is sharing an address space with other running code. This can be dangerous if another thread in the parent process changes credentials (using setuid(2) or similar), since there are now two processes with different privilege levels running in the same address space. As an example of the dangers, suppose that a multithreaded program running as root creates a child using vfork(). After the vfork(), a thread in the parent process drops the process to an unprivileged user in order to run some untrusted code (e.g., perhaps via plug-in opened with dlopen(3)). In this case, attacks are possible where the parent process uses mmap(2) to map in code that will be executed by the privileged child process.
BUGS
Details of the signal handling are obscure and differ between systems. The BSD man page states: “To avoid a possible deadlock situation, processes that are children in the middle of a vfork() are never sent SIGTTOU or SIGTTIN signals; rather, output or ioctls are allowed and input attempts result in an end-of-file indication.”
SEE ALSO
clone(2), execve(2), _exit(2), fork(2), unshare(2), wait(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
185 - Linux cli command lremovexattr
NAME π₯οΈ lremovexattr π₯οΈ
remove an extended attribute
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/xattr.h>
int removexattr(const char *path, const char *name);
int lremovexattr(const char *path, const char *name);
int fremovexattr(int fd, const char *name);
DESCRIPTION
Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7).
removexattr() removes the extended attribute identified by name and associated with the given path in the filesystem.
lremovexattr() is identical to removexattr(), except in the case of a symbolic link, where the extended attribute is removed from the link itself, not the file that it refers to.
fremovexattr() is identical to removexattr(), only the extended attribute is removed from the open file referred to by fd (as returned by open(2)) in place of path.
An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode.
RETURN VALUE
On success, zero is returned. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
ENODATA
The named attribute does not exist.
ENOTSUP
Extended attributes are not supported by the filesystem, or are disabled.
In addition, the errors documented in stat(2) can also occur.
STANDARDS
Linux.
HISTORY
Linux 2.4, glibc 2.3.
SEE ALSO
getfattr(1), setfattr(1), getxattr(2), listxattr(2), open(2), setxattr(2), stat(2), symlink(7), xattr(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
186 - Linux cli command bpf
NAME π₯οΈ bpf π₯οΈ
perform a command on an extended BPF map or program
SYNOPSIS
#include <linux/bpf.h>
int bpf(int cmd, union bpf_attr *attr, unsigned int size);
DESCRIPTION
The bpf() system call performs a range of operations related to extended Berkeley Packet Filters. Extended BPF (or eBPF) is similar to the original (“classic”) BPF (cBPF) used to filter network packets. For both cBPF and eBPF programs, the kernel statically analyzes the programs before loading them, in order to ensure that they cannot harm the running system.
eBPF extends cBPF in multiple ways, including the ability to call a fixed set of in-kernel helper functions (via the BPF_CALL opcode extension provided by eBPF) and access shared data structures such as eBPF maps.
Extended BPF Design/Architecture
eBPF maps are a generic data structure for storage of different data types. Data types are generally treated as binary blobs, so a user just specifies the size of the key and the size of the value at map-creation time. In other words, a key/value for a given map can have an arbitrary structure.
A user process can create multiple maps (with key/value-pairs being opaque bytes of data) and access them via file descriptors. Different eBPF programs can access the same maps in parallel. It’s up to the user process and eBPF program to decide what they store inside maps.
There’s one special map type, called a program array. This type of map stores file descriptors referring to other eBPF programs. When a lookup in the map is performed, the program flow is redirected in-place to the beginning of another eBPF program and does not return back to the calling program. The level of nesting has a fixed limit of 32, so that infinite loops cannot be crafted. At run time, the program file descriptors stored in the map can be modified, so program functionality can be altered based on specific requirements. All programs referred to in a program-array map must have been previously loaded into the kernel via bpf(). If a map lookup fails, the current program continues its execution. See BPF_MAP_TYPE_PROG_ARRAY below for further details.
Generally, eBPF programs are loaded by the user process and automatically unloaded when the process exits. In some cases, for example, tc-bpf(8), the program will continue to stay alive inside the kernel even after the process that loaded the program exits. In that case, the tc subsystem holds a reference to the eBPF program after the file descriptor has been closed by the user-space program. Thus, whether a specific program continues to live inside the kernel depends on how it is further attached to a given kernel subsystem after it was loaded via bpf().
Each eBPF program is a set of instructions that is safe to run until its completion. An in-kernel verifier statically determines that the eBPF program terminates and is safe to execute. During verification, the kernel increments reference counts for each of the maps that the eBPF program uses, so that the attached maps can’t be removed until the program is unloaded.
eBPF programs can be attached to different events. These events can be the arrival of network packets, tracing events, classification events by network queueing disciplines (for eBPF programs attached to a tc(8) classifier), and other types that may be added in the future. A new event triggers execution of the eBPF program, which may store information about the event in eBPF maps. Beyond storing data, eBPF programs may call a fixed set of in-kernel helper functions.
The same eBPF program can be attached to multiple events and different eBPF programs can access the same map:
tracing tracing tracing packet packet packet
event A event B event C on eth0 on eth1 on eth2
| | | | | ^
| | | | v |
--> tracing <-- tracing socket tc ingress tc egress
prog_1 prog_2 prog_3 classifier action
| | | | prog_4 prog_5
|--- -----| |------| map_3 | |
map_1 map_2 --| map_4 |--
Arguments
The operation to be performed by the bpf() system call is determined by the cmd argument. Each operation takes an accompanying argument, provided via attr, which is a pointer to a union of type bpf_attr (see below). The unused fields and padding must be zeroed out before the call. The size argument is the size of the union pointed to by attr.
The value provided in cmd is one of the following:
BPF_MAP_CREATE
Create a map and return a file descriptor that refers to the map. The close-on-exec file descriptor flag (see fcntl(2)) is automatically enabled for the new file descriptor.
BPF_MAP_LOOKUP_ELEM
Look up an element by key in a specified map and return its value.
BPF_MAP_UPDATE_ELEM
Create or update an element (key/value pair) in a specified map.
BPF_MAP_DELETE_ELEM
Look up and delete an element by key in a specified map.
BPF_MAP_GET_NEXT_KEY
Look up an element by key in a specified map and return the key of the next element.
BPF_PROG_LOAD
Verify and load an eBPF program, returning a new file descriptor associated with the program. The close-on-exec file descriptor flag (see fcntl(2)) is automatically enabled for the new file descriptor.
The bpf_attr union consists of various anonymous structures that are used by different bpf() commands:
union bpf_attr {
struct { /* Used by BPF_MAP_CREATE */
__u32 map_type;
__u32 key_size; /* size of key in bytes */
__u32 value_size; /* size of value in bytes */
__u32 max_entries; /* maximum number of entries
in a map */
};
struct { /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY
commands */
__u32 map_fd;
__aligned_u64 key;
union {
__aligned_u64 value;
__aligned_u64 next_key;
};
__u64 flags;
};
struct { /* Used by BPF_PROG_LOAD */
__u32 prog_type;
__u32 insn_cnt;
__aligned_u64 insns; /* 'const struct bpf_insn *' */
__aligned_u64 license; /* 'const char *' */
__u32 log_level; /* verbosity level of verifier */
__u32 log_size; /* size of user buffer */
__aligned_u64 log_buf; /* user supplied 'char *'
buffer */
__u32 kern_version;
/* checked when prog_type=kprobe
(since Linux 4.1) */
};
} __attribute__((aligned(8)));
eBPF maps
Maps are a generic data structure for storage of different types of data. They allow sharing of data between eBPF kernel programs, and also between kernel and user-space applications.
Each map type has the following attributes:
type
maximum number of elements
key size in bytes
value size in bytes
The following wrapper functions demonstrate how various bpf() commands can be used to access the maps. The functions use the cmd argument to invoke different operations.
BPF_MAP_CREATE
The BPF_MAP_CREATE command creates a new map, returning a new file descriptor that refers to the map.
int
bpf_create_map(enum bpf_map_type map_type,
unsigned int key_size,
unsigned int value_size,
unsigned int max_entries)
{
union bpf_attr attr = {
.map_type = map_type,
.key_size = key_size,
.value_size = value_size,
.max_entries = max_entries
};
return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
}
The new map has the type specified by map_type, and attributes as specified in key_size, value_size, and max_entries. On success, this operation returns a file descriptor. On error, -1 is returned and errno is set to EINVAL, EPERM, or ENOMEM.
The key_size and value_size attributes will be used by the verifier during program loading to check that the program is calling bpf_map_*_elem() helper functions with a correctly initialized key and to check that the program doesn’t access the map element value beyond the specified value_size. For example, when a map is created with a key_size of 8 and the eBPF program calls
bpf_map_lookup_elem(map_fd, fp - 4)
the program will be rejected, since the in-kernel helper function
bpf_map_lookup_elem(map_fd, void *key)
expects to read 8 bytes from the location pointed to by key, but the fp - 4 (where fp is the top of the stack) starting address will cause out-of-bounds stack access.
Similarly, when a map is created with a value_size of 1 and the eBPF program contains
value = bpf_map_lookup_elem(...);
*(u32 *) value = 1;
the program will be rejected, since it accesses the value pointer beyond the specified 1 byte value_size limit.
Currently, the following values are supported for map_type:
enum bpf_map_type {
BPF_MAP_TYPE_UNSPEC, /* Reserve 0 as invalid map type */
BPF_MAP_TYPE_HASH,
BPF_MAP_TYPE_ARRAY,
BPF_MAP_TYPE_PROG_ARRAY,
BPF_MAP_TYPE_PERF_EVENT_ARRAY,
BPF_MAP_TYPE_PERCPU_HASH,
BPF_MAP_TYPE_PERCPU_ARRAY,
BPF_MAP_TYPE_STACK_TRACE,
BPF_MAP_TYPE_CGROUP_ARRAY,
BPF_MAP_TYPE_LRU_HASH,
BPF_MAP_TYPE_LRU_PERCPU_HASH,
BPF_MAP_TYPE_LPM_TRIE,
BPF_MAP_TYPE_ARRAY_OF_MAPS,
BPF_MAP_TYPE_HASH_OF_MAPS,
BPF_MAP_TYPE_DEVMAP,
BPF_MAP_TYPE_SOCKMAP,
BPF_MAP_TYPE_CPUMAP,
BPF_MAP_TYPE_XSKMAP,
BPF_MAP_TYPE_SOCKHASH,
BPF_MAP_TYPE_CGROUP_STORAGE,
BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
BPF_MAP_TYPE_QUEUE,
BPF_MAP_TYPE_STACK,
/* See /usr/include/linux/bpf.h for the full list. */
};
map_type selects one of the available map implementations in the kernel. For all map types, eBPF programs access maps with the same bpf_map_lookup_elem() and bpf_map_update_elem() helper functions. Further details of the various map types are given below.
BPF_MAP_LOOKUP_ELEM
The BPF_MAP_LOOKUP_ELEM command looks up an element with a given key in the map referred to by the file descriptor fd.
int
bpf_lookup_elem(int fd, const void *key, void *value)
{
union bpf_attr attr = {
.map_fd = fd,
.key = ptr_to_u64(key),
.value = ptr_to_u64(value),
};
return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
}
If an element is found, the operation returns zero and stores the element’s value into value, which must point to a buffer of value_size bytes.
If no element is found, the operation returns -1 and sets errno to ENOENT.
BPF_MAP_UPDATE_ELEM
The BPF_MAP_UPDATE_ELEM command creates or updates an element with a given key/value in the map referred to by the file descriptor fd.
int
bpf_update_elem(int fd, const void *key, const void *value,
uint64_t flags)
{
union bpf_attr attr = {
.map_fd = fd,
.key = ptr_to_u64(key),
.value = ptr_to_u64(value),
.flags = flags,
};
return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
}
The flags argument should be specified as one of the following:
BPF_ANY
Create a new element or update an existing element.
BPF_NOEXIST
Create a new element only if it did not exist.
BPF_EXIST
Update an existing element.
On success, the operation returns zero. On error, -1 is returned and errno is set to EINVAL, EPERM, ENOMEM, or E2BIG. E2BIG indicates that the number of elements in the map reached the max_entries limit specified at map creation time. EEXIST will be returned if flags specifies BPF_NOEXIST and the element with key already exists in the map. ENOENT will be returned if flags specifies BPF_EXIST and the element with key doesn’t exist in the map.
BPF_MAP_DELETE_ELEM
The BPF_MAP_DELETE_ELEM command deletes the element whose key is key from the map referred to by the file descriptor fd.
int
bpf_delete_elem(int fd, const void *key)
{
union bpf_attr attr = {
.map_fd = fd,
.key = ptr_to_u64(key),
};
return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
}
On success, zero is returned. If the element is not found, -1 is returned and errno is set to ENOENT.
BPF_MAP_GET_NEXT_KEY
The BPF_MAP_GET_NEXT_KEY command looks up an element by key in the map referred to by the file descriptor fd and sets the next_key pointer to the key of the next element.
int
bpf_get_next_key(int fd, const void *key, void *next_key)
{
union bpf_attr attr = {
.map_fd = fd,
.key = ptr_to_u64(key),
.next_key = ptr_to_u64(next_key),
};
return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
}
If key is found, the operation returns zero and sets the next_key pointer to the key of the next element. If key is not found, the operation returns zero and sets the next_key pointer to the key of the first element. If key is the last element, -1 is returned and errno is set to ENOENT. Other possible errno values are ENOMEM, EFAULT, EPERM, and EINVAL. This method can be used to iterate over all elements in the map.
close(map_fd)
Delete the map referred to by the file descriptor map_fd. When the user-space program that created a map exits, all maps will be deleted automatically (but see NOTES).
eBPF map types
The following map types are supported:
BPF_MAP_TYPE_HASH
Hash-table maps have the following characteristics:
Maps are created and destroyed by user-space programs. Both user-space and eBPF programs can perform lookup, update, and delete operations.
The kernel takes care of allocating and freeing key/value pairs.
The map_update_elem() helper will fail to insert new element when the max_entries limit is reached. (This ensures that eBPF programs cannot exhaust memory.)
map_update_elem() replaces existing elements atomically.
Hash-table maps are optimized for speed of lookup.
BPF_MAP_TYPE_ARRAY
Array maps have the following characteristics:
Optimized for fastest possible lookup. In the future the verifier/JIT compiler may recognize lookup() operations that employ a constant key and optimize it into constant pointer. It is possible to optimize a non-constant key into direct pointer arithmetic as well, since pointers and value_size are constant for the life of the eBPF program. In other words, array_map_lookup_elem() may be ‘inlined’ by the verifier/JIT compiler while preserving concurrent access to this map from user space.
All array elements pre-allocated and zero initialized at init time
The key is an array index, and must be exactly four bytes.
map_delete_elem() fails with the error EINVAL, since elements cannot be deleted.
map_update_elem() replaces elements in a nonatomic fashion; for atomic updates, a hash-table map should be used instead. There is however one special case that can also be used with arrays: the atomic built-in __sync_fetch_and_add() can be used on 32 and 64 bit atomic counters. For example, it can be applied on the whole value itself if it represents a single counter, or in case of a structure containing multiple counters, it could be used on individual counters. This is quite often useful for aggregation and accounting of events.
Among the uses for array maps are the following:
As “global” eBPF variables: an array of 1 element whose key is (index) 0 and where the value is a collection of ‘global’ variables which eBPF programs can use to keep state between events.
Aggregation of tracing events into a fixed set of buckets.
Accounting of networking events, for example, number of packets and packet sizes.
BPF_MAP_TYPE_PROG_ARRAY (since Linux 4.2)
A program array map is a special kind of array map whose map values contain only file descriptors referring to other eBPF programs. Thus, both the key_size and value_size must be exactly four bytes. This map is used in conjunction with the bpf_tail_call() helper.
This means that an eBPF program with a program array map attached to it can call from kernel side into
void bpf_tail_call(void *context, void *prog_map,
unsigned int index);
and therefore replace its own program flow with the one from the program at the given program array slot, if present. This can be regarded as kind of a jump table to a different eBPF program. The invoked program will then reuse the same stack. When a jump into the new program has been performed, it won’t return to the old program anymore.
If no eBPF program is found at the given index of the program array (because the map slot doesn’t contain a valid program file descriptor, the specified lookup index/key is out of bounds, or the limit of 32 nested calls has been exceed), execution continues with the current eBPF program. This can be used as a fall-through for default cases.
A program array map is useful, for example, in tracing or networking, to handle individual system calls or protocols in their own subprograms and use their identifiers as an individual map index. This approach may result in performance benefits, and also makes it possible to overcome the maximum instruction limit of a single eBPF program. In dynamic environments, a user-space daemon might atomically replace individual subprograms at run-time with newer versions to alter overall program behavior, for instance, if global policies change.
eBPF programs
The BPF_PROG_LOAD command is used to load an eBPF program into the kernel. The return value for this command is a new file descriptor associated with this eBPF program.
char bpf_log_buf[LOG_BUF_SIZE];
int
bpf_prog_load(enum bpf_prog_type type,
const struct bpf_insn *insns, int insn_cnt,
const char *license)
{
union bpf_attr attr = {
.prog_type = type,
.insns = ptr_to_u64(insns),
.insn_cnt = insn_cnt,
.license = ptr_to_u64(license),
.log_buf = ptr_to_u64(bpf_log_buf),
.log_size = LOG_BUF_SIZE,
.log_level = 1,
};
return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
}
prog_type is one of the available program types:
enum bpf_prog_type {
BPF_PROG_TYPE_UNSPEC, /* Reserve 0 as invalid
program type */
BPF_PROG_TYPE_SOCKET_FILTER,
BPF_PROG_TYPE_KPROBE,
BPF_PROG_TYPE_SCHED_CLS,
BPF_PROG_TYPE_SCHED_ACT,
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
BPF_PROG_TYPE_CGROUP_SKB,
BPF_PROG_TYPE_CGROUP_SOCK,
BPF_PROG_TYPE_LWT_IN,
BPF_PROG_TYPE_LWT_OUT,
BPF_PROG_TYPE_LWT_XMIT,
BPF_PROG_TYPE_SOCK_OPS,
BPF_PROG_TYPE_SK_SKB,
BPF_PROG_TYPE_CGROUP_DEVICE,
BPF_PROG_TYPE_SK_MSG,
BPF_PROG_TYPE_RAW_TRACEPOINT,
BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
BPF_PROG_TYPE_LWT_SEG6LOCAL,
BPF_PROG_TYPE_LIRC_MODE2,
BPF_PROG_TYPE_SK_REUSEPORT,
BPF_PROG_TYPE_FLOW_DISSECTOR,
/* See /usr/include/linux/bpf.h for the full list. */
};
For further details of eBPF program types, see below.
The remaining fields of bpf_attr are set as follows:
insns is an array of struct bpf_insn instructions.
insn_cnt is the number of instructions in the program referred to by insns.
license is a license string, which must be GPL compatible to call helper functions marked gpl_only. (The licensing rules are the same as for kernel modules, so that also dual licenses, such as “Dual BSD/GPL”, may be used.)
log_buf is a pointer to a caller-allocated buffer in which the in-kernel verifier can store the verification log. This log is a multi-line string that can be checked by the program author in order to understand how the verifier came to the conclusion that the eBPF program is unsafe. The format of the output can change at any time as the verifier evolves.
log_size size of the buffer pointed to by log_buf. If the size of the buffer is not large enough to store all verifier messages, -1 is returned and errno is set to ENOSPC.
log_level verbosity level of the verifier. A value of zero means that the verifier will not provide a log; in this case, log_buf must be a null pointer, and log_size must be zero.
Applying close(2) to the file descriptor returned by BPF_PROG_LOAD will unload the eBPF program (but see NOTES).
Maps are accessible from eBPF programs and are used to exchange data between eBPF programs and between eBPF programs and user-space programs. For example, eBPF programs can process various events (like kprobe, packets) and store their data into a map, and user-space programs can then fetch data from the map. Conversely, user-space programs can use a map as a configuration mechanism, populating the map with values checked by the eBPF program, which then modifies its behavior on the fly according to those values.
eBPF program types
The eBPF program type (prog_type) determines the subset of kernel helper functions that the program may call. The program type also determines the program input (context)βthe format of struct bpf_context (which is the data blob passed into the eBPF program as the first argument).
For example, a tracing program does not have the exact same subset of helper functions as a socket filter program (though they may have some helpers in common). Similarly, the input (context) for a tracing program is a set of register values, while for a socket filter it is a network packet.
The set of functions available to eBPF programs of a given type may increase in the future.
The following program types are supported:
BPF_PROG_TYPE_SOCKET_FILTER (since Linux 3.19)
Currently, the set of functions for BPF_PROG_TYPE_SOCKET_FILTER is:
bpf_map_lookup_elem(map_fd, void *key)
/* look up key in a map_fd */
bpf_map_update_elem(map_fd, void *key, void *value)
/* update key/value */
bpf_map_delete_elem(map_fd, void *key)
/* delete key in a map_fd */
The bpf_context argument is a pointer to a struct __sk_buff.
BPF_PROG_TYPE_KPROBE (since Linux 4.1)
[To be documented]
BPF_PROG_TYPE_SCHED_CLS (since Linux 4.1)
[To be documented]
BPF_PROG_TYPE_SCHED_ACT (since Linux 4.1)
[To be documented]
Events
Once a program is loaded, it can be attached to an event. Various kernel subsystems have different ways to do so.
Since Linux 3.19, the following call will attach the program prog_fd to the socket sockfd, which was created by an earlier call to socket(2):
setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_BPF,
&prog_fd, sizeof(prog_fd));
Since Linux 4.1, the following call may be used to attach the eBPF program referred to by the file descriptor prog_fd to a perf event file descriptor, event_fd, that was created by a previous call to perf_event_open(2):
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
RETURN VALUE
For a successful call, the return value depends on the operation:
BPF_MAP_CREATE
The new file descriptor associated with the eBPF map.
BPF_PROG_LOAD
The new file descriptor associated with the eBPF program.
All other commands
Zero.
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
E2BIG
The eBPF program is too large or a map reached the max_entries limit (maximum number of elements).
EACCES
For BPF_PROG_LOAD, even though all program instructions are valid, the program has been rejected because it was deemed unsafe. This may be because it may have accessed a disallowed memory region or an uninitialized stack/register or because the function constraints don’t match the actual types or because there was a misaligned memory access. In this case, it is recommended to call bpf() again with log_level = 1 and examine log_buf for the specific reason provided by the verifier.
EAGAIN
For BPF_PROG_LOAD, indicates that needed resources are blocked. This happens when the verifier detects pending signals while it is checking the validity of the bpf program. In this case, just call bpf() again with the same parameters.
EBADF
fd is not an open file descriptor.
EFAULT
One of the pointers (key or value or log_buf or insns) is outside the accessible address space.
EINVAL
The value specified in cmd is not recognized by this kernel.
EINVAL
For BPF_MAP_CREATE, either map_type or attributes are invalid.
EINVAL
For BPF_MAP_*_ELEM commands, some of the fields of union bpf_attr that are not used by this command are not set to zero.
EINVAL
For BPF_PROG_LOAD, indicates an attempt to load an invalid program. eBPF programs can be deemed invalid due to unrecognized instructions, the use of reserved fields, jumps out of range, infinite loops or calls of unknown functions.
ENOENT
For BPF_MAP_LOOKUP_ELEM or BPF_MAP_DELETE_ELEM, indicates that the element with the given key was not found.
ENOMEM
Cannot allocate sufficient memory.
EPERM
The call was made without sufficient privilege (without the CAP_SYS_ADMIN capability).
STANDARDS
Linux.
HISTORY
Linux 3.18.
NOTES
Prior to Linux 4.4, all bpf() commands require the caller to have the CAP_SYS_ADMIN capability. From Linux 4.4 onwards, an unprivileged user may create limited programs of type BPF_PROG_TYPE_SOCKET_FILTER and associated maps. However they may not store kernel pointers within the maps and are presently limited to the following helper functions:
- get_random
get_smp_processor_id
tail_call
ktime_get_ns
Unprivileged access may be blocked by writing the value 1 to the file /proc/sys/kernel/unprivileged_bpf_disabled.
eBPF objects (maps and programs) can be shared between processes. For example, after fork(2), the child inherits file descriptors referring to the same eBPF objects. In addition, file descriptors referring to eBPF objects can be transferred over UNIX domain sockets. File descriptors referring to eBPF objects can be duplicated in the usual way, using dup(2) and similar calls. An eBPF object is deallocated only after all file descriptors referring to the object have been closed.
eBPF programs can be written in a restricted C that is compiled (using the clang compiler) into eBPF bytecode. Various features are omitted from this restricted C, such as loops, global variables, variadic functions, floating-point numbers, and passing structures as function arguments. Some examples can be found in the samples/bpf/*_kern.c files in the kernel source tree.
The kernel contains a just-in-time (JIT) compiler that translates eBPF bytecode into native machine code for better performance. Before Linux 4.15, the JIT compiler is disabled by default, but its operation can be controlled by writing one of the following integer strings to the file /proc/sys/net/core/bpf_jit_enable:
0
Disable JIT compilation (default).
1
Normal compilation.
2
Debugging mode. The generated opcodes are dumped in hexadecimal into the kernel log. These opcodes can then be disassembled using the program tools/net/bpf_jit_disasm.c provided in the kernel source tree.
Since Linux 4.15, the kernel may be configured with the CONFIG_BPF_JIT_ALWAYS_ON option. In this case, the JIT compiler is always enabled, and the bpf_jit_enable is initialized to 1 and is immutable. (This kernel configuration option was provided as a mitigation for one of the Spectre attacks against the BPF interpreter.)
The JIT compiler for eBPF is currently available for the following architectures:
- x86-64 (since Linux 3.18; cBPF since Linux 3.0);
ARM32 (since Linux 3.18; cBPF since Linux 3.4);
SPARC 32 (since Linux 3.18; cBPF since Linux 3.5);
ARM-64 (since Linux 3.18);
s390 (since Linux 4.1; cBPF since Linux 3.7);
PowerPC 64 (since Linux 4.8; cBPF since Linux 3.1);
SPARC 64 (since Linux 4.12);
x86-32 (since Linux 4.18);
MIPS 64 (since Linux 4.18; cBPF since Linux 3.16);
riscv (since Linux 5.1).
EXAMPLES
/* bpf+sockets example:
* 1. create array map of 256 elements
* 2. load program that counts number of packets received
* r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)]
* map[r0]++
* 3. attach prog_fd to raw socket via setsockopt()
* 4. print number of received TCP/UDP packets every second
*/
int
main(int argc, char *argv[])
{
int sock, map_fd, prog_fd, key;
long long value = 0, tcp_cnt, udp_cnt;
map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key),
sizeof(value), 256);
if (map_fd < 0) {
printf("failed to create map '%s'
“, strerror(errno)); /* likely not run as root / return 1; } struct bpf_insn prog[] = { BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), / r6 = r1 / BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol)), / r0 = ip->proto / BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), / *(u32 *)(fp - 4) = r0 / BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), / r2 = fp / BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), / r2 = r2 - 4 / BPF_LD_MAP_FD(BPF_REG_1, map_fd), / r1 = map_fd / BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem), / r0 = map_lookup(r1, r2) / BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), / if (r0 == 0) goto pc+2 / BPF_MOV64_IMM(BPF_REG_1, 1), / r1 = 1 / BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), / lock *(u64 *) r0 += r1 / BPF_MOV64_IMM(BPF_REG_0, 0), / r0 = 0 / BPF_EXIT_INSN(), / return r0 */ }; prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog, sizeof(prog) / sizeof(prog[0]), “GPL”); sock = open_raw_sock(“lo”); assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)) == 0); for (;;) { key = IPPROTO_TCP; assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0); key = IPPROTO_UDP; assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0); printf(“TCP %lld UDP %lld packets “, tcp_cnt, udp_cnt); sleep(1); } return 0; }
Some complete working code can be found in the samples/bpf directory in the kernel source tree.
SEE ALSO
seccomp(2), bpf-helpers(7), socket(7), tc(8), tc-bpf(8)
Both classic and extended BPF are explained in the kernel source file Documentation/networking/filter.txt.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
187 - Linux cli command fchmodat
NAME π₯οΈ fchmodat π₯οΈ
change permissions of a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int chmod(const char *pathname, mode_t mode);
int fchmod(int fd, mode_t mode);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int fchmodat(int dirfd, const char *pathname, mode_t mode, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
fchmod():
Since glibc 2.24:
_POSIX_C_SOURCE >= 199309L
glibc 2.19 to glibc 2.23
_POSIX_C_SOURCE
glibc 2.16 to glibc 2.19:
_BSD_SOURCE || _POSIX_C_SOURCE
glibc 2.12 to glibc 2.16:
_BSD_SOURCE || _XOPEN_SOURCE >= 500
|| _POSIX_C_SOURCE >= 200809L
glibc 2.11 and earlier:
_BSD_SOURCE || _XOPEN_SOURCE >= 500
fchmodat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
The chmod() and fchmod() system calls change a file’s mode bits. (The file mode consists of the file permission bits plus the set-user-ID, set-group-ID, and sticky bits.) These system calls differ only in how the file is specified:
chmod() changes the mode of the file specified whose pathname is given in pathname, which is dereferenced if it is a symbolic link.
fchmod() changes the mode of the file referred to by the open file descriptor fd.
The new file mode is specified in mode, which is a bit mask created by ORing together zero or more of the following:
S_ISUID (04000)
set-user-ID (set process effective user ID on execve(2))
S_ISGID (02000)
set-group-ID (set process effective group ID on execve(2); mandatory locking, as described in fcntl(2); take a new file’s group from parent directory, as described in chown(2) and mkdir(2))
S_ISVTX (01000)
sticky bit (restricted deletion flag, as described in unlink(2))
S_IRUSR (00400)
read by owner
S_IWUSR (00200)
write by owner
S_IXUSR (00100)
execute/search by owner (“search” applies for directories, and means that entries within the directory can be accessed)
S_IRGRP (00040)
read by group
S_IWGRP (00020)
write by group
S_IXGRP (00010)
execute/search by group
S_IROTH (00004)
read by others
S_IWOTH (00002)
write by others
S_IXOTH (00001)
execute/search by others
The effective UID of the calling process must match the owner of the file, or the process must be privileged (Linux: it must have the CAP_FOWNER capability).
If the calling process is not privileged (Linux: does not have the CAP_FSETID capability), and the group of the file does not match the effective group ID of the process or one of its supplementary group IDs, the S_ISGID bit will be turned off, but this will not cause an error to be returned.
As a security measure, depending on the filesystem, the set-user-ID and set-group-ID execution bits may be turned off if a file is written. (On Linux, this occurs if the writing process does not have the CAP_FSETID capability.) On some filesystems, only the superuser can set the sticky bit, which may have a special meaning. For the sticky bit, and for set-user-ID and set-group-ID bits on directories, see inode(7).
On NFS filesystems, restricting the permissions will immediately influence already open files, because the access control is done on the server, but open files are maintained by the client. Widening the permissions may be delayed for other clients if attribute caching is enabled on them.
fchmodat()
The fchmodat() system call operates in exactly the same way as chmod(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chmod() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chmod()).
If pathname is absolute, then dirfd is ignored.
flags can either be 0, or include the following flag:
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead operate on the link itself. This flag is not currently implemented.
See openat(2) for an explanation of the need for fchmodat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Depending on the filesystem, errors other than those listed below can be returned.
The more general errors for chmod() are listed below:
EACCES
Search permission is denied on a component of the path prefix. (See also path_resolution(7).)
EBADF
(fchmod()) The file descriptor fd is not valid.
EBADF
(fchmodat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
pathname points outside your accessible address space.
EINVAL
(fchmodat()) Invalid flag specified in flags.
EIO
An I/O error occurred.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ENAMETOOLONG
pathname is too long.
ENOENT
The file does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component of the path prefix is not a directory.
ENOTDIR
(fchmodat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
ENOTSUP
(fchmodat()) flags specified AT_SYMLINK_NOFOLLOW, which is not supported.
EPERM
The effective UID does not match the owner of the file, and the process is not privileged (Linux: it does not have the CAP_FOWNER capability).
EPERM
The file is marked immutable or append-only. (See ioctl_iflags(2).)
EROFS
The named file resides on a read-only filesystem.
VERSIONS
C library/kernel differences
The GNU C library fchmodat() wrapper function implements the POSIX-specified interface described in this page. This interface differs from the underlying Linux system call, which does not have a flags argument.
glibc notes
On older kernels where fchmodat() is unavailable, the glibc wrapper function falls back to the use of chmod(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
STANDARDS
POSIX.1-2008.
HISTORY
chmod()
fchmod()
4.4BSD, SVr4, POSIX.1-2001.
fchmodat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
SEE ALSO
chmod(1), chown(2), execve(2), open(2), stat(2), inode(7), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
188 - Linux cli command s390_pci_mmio_write
NAME π₯οΈ s390_pci_mmio_write π₯οΈ
transfer data to/from PCI MMIO memory page
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_s390_pci_mmio_write, unsigned long mmio_addr,
const void user_buffer[.length], size_t length);
int syscall(SYS_s390_pci_mmio_read, unsigned long mmio_addr,
void user_buffer[.length], size_t length);
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
The s390_pci_mmio_write() system call writes length bytes of data from the user-space buffer user_buffer to the PCI MMIO memory location specified by mmio_addr. The s390_pci_mmio_read() system call reads length bytes of data from the PCI MMIO memory location specified by mmio_addr to the user-space buffer user_buffer.
These system calls must be used instead of the simple assignment or data-transfer operations that are used to access the PCI MMIO memory areas mapped to user space on the Linux System z platform. The address specified by mmio_addr must belong to a PCI MMIO memory page mapping in the caller’s address space, and the data being written or read must not cross a page boundary. The length value cannot be greater than the system page size.
RETURN VALUE
On success, s390_pci_mmio_write() and s390_pci_mmio_read() return 0. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
EFAULT
The address in mmio_addr is invalid.
EFAULT
user_buffer does not point to a valid location in the caller’s address space.
EINVAL
Invalid length argument.
ENODEV
PCI support is not enabled.
ENOMEM
Insufficient memory.
STANDARDS
Linux on s390.
HISTORY
Linux 3.19. System z EC12.
SEE ALSO
syscall(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
189 - Linux cli command mq_timedsend
NAME π₯οΈ mq_timedsend π₯οΈ
send a message to a message queue
LIBRARY
Real-time library (librt, -lrt)
SYNOPSIS
#include <mqueue.h>
int mq_send(mqd_t mqdes, const char msg_ptr[.msg_len],
size_t msg_len, unsigned int msg_prio);
#include <time.h>
#include <mqueue.h>
int mq_timedsend(mqd_t mqdes, const char msg_ptr[.msg_len],
size_t msg_len, unsigned int msg_prio,
const struct timespec *abs_timeout);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
mq_timedsend():
_POSIX_C_SOURCE >= 200112L
DESCRIPTION
mq_send() adds the message pointed to by msg_ptr to the message queue referred to by the message queue descriptor mqdes. The msg_len argument specifies the length of the message pointed to by msg_ptr; this length must be less than or equal to the queue’s mq_msgsize attribute. Zero-length messages are allowed.
The msg_prio argument is a nonnegative integer that specifies the priority of this message. Messages are placed on the queue in decreasing order of priority, with newer messages of the same priority being placed after older messages with the same priority. See mq_overview(7) for details on the range for the message priority.
If the message queue is already full (i.e., the number of messages on the queue equals the queue’s mq_maxmsg attribute), then, by default, mq_send() blocks until sufficient space becomes available to allow the message to be queued, or until the call is interrupted by a signal handler. If the O_NONBLOCK flag is enabled for the message queue description, then the call instead fails immediately with the error EAGAIN.
mq_timedsend() behaves just like mq_send(), except that if the queue is full and the O_NONBLOCK flag is not enabled for the message queue description, then abs_timeout points to a structure which specifies how long the call will block. This value is an absolute timeout in seconds and nanoseconds since the Epoch, 1970-01-01 00:00:00 +0000 (UTC), specified in a timespec(3) structure.
If the message queue is full, and the timeout has already expired by the time of the call, mq_timedsend() returns immediately.
RETURN VALUE
On success, mq_send() and mq_timedsend() return zero; on error, -1 is returned, with errno set to indicate the error.
ERRORS
EAGAIN
The queue was full, and the O_NONBLOCK flag was set for the message queue description referred to by mqdes.
EBADF
The descriptor specified in mqdes was invalid or not opened for writing.
EINTR
The call was interrupted by a signal handler; see signal(7).
EINVAL
The call would have blocked, and abs_timeout was invalid, either because tv_sec was less than zero, or because tv_nsec was less than zero or greater than 1000 million.
EMSGSIZE
msg_len was greater than the mq_msgsize attribute of the message queue.
ETIMEDOUT
The call timed out before a message could be transferred.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
mq_send(), mq_timedsend() | Thread safety | MT-Safe |
VERSIONS
On Linux, mq_timedsend() is a system call, and mq_send() is a library function layered on top of that system call.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
SEE ALSO
mq_close(3), mq_getattr(3), mq_notify(3), mq_open(3), mq_receive(3), mq_unlink(3), timespec(3), mq_overview(7), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
190 - Linux cli command oldolduname
NAME π₯οΈ oldolduname π₯οΈ
get name and information about current kernel
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/utsname.h>
int uname(struct utsname *buf);
DESCRIPTION
uname() returns system information in the structure pointed to by buf. The utsname struct is defined in <sys/utsname.h>:
struct utsname {
char sysname[]; /* Operating system name (e.g., "Linux") */
char nodename[]; /* Name within communications network
to which the node is attached, if any */
char release[]; /* Operating system release
(e.g., "2.6.28") */
char version[]; /* Operating system version */
char machine[]; /* Hardware type identifier */
#ifdef _GNU_SOURCE
char domainname[]; /* NIS or YP domain name */
#endif
};
The length of the arrays in a struct utsname is unspecified (see NOTES); the fields are terminated by a null byte (‘οΏ½’).
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
buf is not valid.
VERSIONS
The domainname member (the NIS or YP domain name) is a GNU extension.
The length of the fields in the struct varies. Some operating systems or libraries use a hardcoded 9 or 33 or 65 or 257. Other systems use SYS_NMLN or _SYS_NMLN or UTSLEN or _UTSNAME_LENGTH. Clearly, it is a bad idea to use any of these constants; just use sizeof(…). SVr4 uses 257, “to support Internet hostnames” β this is the largest value likely to be encountered in the wild.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.4BSD.
C library/kernel differences
Over time, increases in the size of the utsname structure have led to three successive versions of uname(): sys_olduname() (slot __NR_oldolduname), sys_uname() (slot __NR_olduname), and sys_newuname() (slot __NR_uname). The first one used length 9 for all fields; the second used 65; the third also uses 65 but adds the domainname field. The glibc uname() wrapper function hides these details from applications, invoking the most recent version of the system call provided by the kernel.
NOTES
The kernel has the name, release, version, and supported machine type built in. Conversely, the nodename field is configured by the administrator to match the network (this is what the BSD historically calls the “hostname”, and is set via sethostname(2)). Similarly, the domainname field is set via setdomainname(2).
Part of the utsname information is also accessible via /proc/sys/kernel/{ostype, hostname, osrelease, version, domainname}.
SEE ALSO
uname(1), getdomainname(2), gethostname(2), uts_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
191 - Linux cli command getuid32
NAME π₯οΈ getuid32 π₯οΈ
get user identity
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
uid_t getuid(void);
uid_t geteuid(void);
DESCRIPTION
getuid() returns the real user ID of the calling process.
geteuid() returns the effective user ID of the calling process.
ERRORS
These functions are always successful and never modify errno.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.3BSD.
In UNIX V6 the getuid() call returned (euid << 8) + uid. UNIX V7 introduced separate calls getuid() and geteuid().
The original Linux getuid() and geteuid() system calls supported only 16-bit user IDs. Subsequently, Linux 2.4 added getuid32() and geteuid32(), supporting 32-bit IDs. The glibc getuid() and geteuid() wrapper functions transparently deal with the variations across kernel versions.
On Alpha, instead of a pair of getuid() and geteuid() system calls, a single getxuid() system call is provided, which returns a pair of real and effective UIDs. The glibc getuid() and geteuid() wrapper functions transparently deal with this. See syscall(2) for details regarding register mapping.
SEE ALSO
getresuid(2), setreuid(2), setuid(2), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
192 - Linux cli command prlimit64
NAME π₯οΈ prlimit64 π₯οΈ
get/set resource limits
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/resource.h>
int getrlimit(int resource, struct rlimit *rlim);
int setrlimit(int resource, const struct rlimit *rlim);
int prlimit(pid_t pid, int resource,
const struct rlimit *_Nullable new_limit,
struct rlimit *_Nullable old_limit);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
prlimit():
_GNU_SOURCE
DESCRIPTION
The getrlimit() and setrlimit() system calls get and set resource limits. Each resource has an associated soft and hard limit, as defined by the rlimit structure:
struct rlimit {
rlim_t rlim_cur; /* Soft limit */
rlim_t rlim_max; /* Hard limit (ceiling for rlim_cur) */
};
The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may set only its soft limit to a value in the range from 0 up to the hard limit, and (irreversibly) lower its hard limit. A privileged process (under Linux: one with the CAP_SYS_RESOURCE capability in the initial user namespace) may make arbitrary changes to either limit value.
The value RLIM_INFINITY denotes no limit on a resource (both in the structure returned by getrlimit() and in the structure passed to setrlimit()).
The resource argument must be one of:
RLIMIT_AS
This is the maximum size of the process’s virtual memory (address space). The limit is specified in bytes, and is rounded down to the system page size. This limit affects calls to brk(2), mmap(2), and mremap(2), which fail with the error ENOMEM upon exceeding this limit. In addition, automatic stack expansion fails (and generates a SIGSEGV that kills the process if no alternate stack has been made available via sigaltstack(2)). Since the value is a long, on machines with a 32-bit long either this limit is at most 2 GiB, or this resource is unlimited.
RLIMIT_CORE
This is the maximum size of a core file (see core(5)) in bytes that the process may dump. When 0 no core dump files are created. When nonzero, larger dumps are truncated to this size.
RLIMIT_CPU
This is a limit, in seconds, on the amount of CPU time that the process can consume. When the process reaches the soft limit, it is sent a SIGXCPU signal. The default action for this signal is to terminate the process. However, the signal can be caught, and the handler can return control to the main program. If the process continues to consume CPU time, it will be sent SIGXCPU once per second until the hard limit is reached, at which time it is sent SIGKILL. (This latter point describes Linux behavior. Implementations vary in how they treat processes which continue to consume CPU time after reaching the soft limit. Portable applications that need to catch this signal should perform an orderly termination upon first receipt of SIGXCPU.)
RLIMIT_DATA
This is the maximum size of the process’s data segment (initialized data, uninitialized data, and heap). The limit is specified in bytes, and is rounded down to the system page size. This limit affects calls to brk(2), sbrk(2), and (since Linux 4.7) mmap(2), which fail with the error ENOMEM upon encountering the soft limit of this resource.
RLIMIT_FSIZE
This is the maximum size in bytes of files that the process may create. Attempts to extend a file beyond this limit result in delivery of a SIGXFSZ signal. By default, this signal terminates a process, but a process can catch this signal instead, in which case the relevant system call (e.g., write(2), truncate(2)) fails with the error EFBIG.
RLIMIT_LOCKS (Linux 2.4.0 to Linux 2.4.24)
This is a limit on the combined number of flock(2) locks and fcntl(2) leases that this process may establish.
RLIMIT_MEMLOCK
This is the maximum number of bytes of memory that may be locked into RAM. This limit is in effect rounded down to the nearest multiple of the system page size. This limit affects mlock(2), mlockall(2), and the mmap(2) MAP_LOCKED operation. Since Linux 2.6.9, it also affects the shmctl(2) SHM_LOCK operation, where it sets a maximum on the total bytes in shared memory segments (see shmget(2)) that may be locked by the real user ID of the calling process. The shmctl(2) SHM_LOCK locks are accounted for separately from the per-process memory locks established by mlock(2), mlockall(2), and mmap(2) MAP_LOCKED; a process can lock bytes up to this limit in each of these two categories.
Before Linux 2.6.9, this limit controlled the amount of memory that could be locked by a privileged process. Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process may lock, and this limit instead governs the amount of memory that an unprivileged process may lock.
RLIMIT_MSGQUEUE (since Linux 2.6.8)
This is a limit on the number of bytes that can be allocated for POSIX message queues for the real user ID of the calling process. This limit is enforced for mq_open(3). Each message queue that the user creates counts (until it is removed) against this limit according to the formula:
Since Linux 3.5:
bytes = attr.mq_maxmsg * sizeof(struct msg_msg) +
MIN(attr.mq_maxmsg, MQ_PRIO_MAX) *
sizeof(struct posix_msg_tree_node)+
/* For overhead */
attr.mq_maxmsg * attr.mq_msgsize;
/* For message data */
Linux 3.4 and earlier:
bytes = attr.mq_maxmsg * sizeof(struct msg_msg *) +
/* For overhead */
attr.mq_maxmsg * attr.mq_msgsize;
/* For message data */
where attr is the mq_attr structure specified as the fourth argument to mq_open(3), and the msg_msg and posix_msg_tree_node structures are kernel-internal structures.
The “overhead” addend in the formula accounts for overhead bytes required by the implementation and ensures that the user cannot create an unlimited number of zero-length messages (such messages nevertheless each consume some system memory for bookkeeping overhead).
RLIMIT_NICE (since Linux 2.6.12, but see BUGS below)
This specifies a ceiling to which the process’s nice value can be raised using setpriority(2) or nice(2). The actual ceiling for the nice value is calculated as 20 - rlim_cur. The useful range for this limit is thus from 1 (corresponding to a nice value of 19) to 40 (corresponding to a nice value of -20). This unusual choice of range was necessary because negative numbers cannot be specified as resource limit values, since they typically have special meanings. For example, RLIM_INFINITY typically is the same as -1. For more detail on the nice value, see sched(7).
RLIMIT_NOFILE
This specifies a value one greater than the maximum file descriptor number that can be opened by this process. Attempts (open(2), pipe(2), dup(2), etc.) to exceed this limit yield the error EMFILE. (Historically, this limit was named RLIMIT_OFILE on BSD.)
Since Linux 4.5, this limit also defines the maximum number of file descriptors that an unprivileged process (one without the CAP_SYS_RESOURCE capability) may have “in flight” to other processes, by being passed across UNIX domain sockets. This limit applies to the sendmsg(2) system call. For further details, see unix(7).
RLIMIT_NPROC
This is a limit on the number of extant process (or, more precisely on Linux, threads) for the real user ID of the calling process. So long as the current number of processes belonging to this process’s real user ID is greater than or equal to this limit, fork(2) fails with the error EAGAIN.
The RLIMIT_NPROC limit is not enforced for processes that have either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE capability, or run with real user ID 0.
RLIMIT_RSS
This is a limit (in bytes) on the process’s resident set (the number of virtual pages resident in RAM). This limit has effect only in Linux 2.4.x, x < 30, and there affects only calls to madvise(2) specifying MADV_WILLNEED.
RLIMIT_RTPRIO (since Linux 2.6.12, but see BUGS)
This specifies a ceiling on the real-time priority that may be set for this process using sched_setscheduler(2) and sched_setparam(2).
For further details on real-time scheduling policies, see sched(7)
RLIMIT_RTTIME (since Linux 2.6.25)
This is a limit (in microseconds) on the amount of CPU time that a process scheduled under a real-time scheduling policy may consume without making a blocking system call. For the purpose of this limit, each time a process makes a blocking system call, the count of its consumed CPU time is reset to zero. The CPU time count is not reset if the process continues trying to use the CPU but is preempted, its time slice expires, or it calls sched_yield(2).
Upon reaching the soft limit, the process is sent a SIGXCPU signal. If the process catches or ignores this signal and continues consuming CPU time, then SIGXCPU will be generated once each second until the hard limit is reached, at which point the process is sent a SIGKILL signal.
The intended use of this limit is to stop a runaway real-time process from locking up the system.
For further details on real-time scheduling policies, see sched(7)
RLIMIT_SIGPENDING (since Linux 2.6.8)
This is a limit on the number of signals that may be queued for the real user ID of the calling process. Both standard and real-time signals are counted for the purpose of checking this limit. However, the limit is enforced only for sigqueue(3); it is always possible to use kill(2) to queue one instance of any of the signals that are not already queued to the process.
RLIMIT_STACK
This is the maximum size of the process stack, in bytes. Upon reaching this limit, a SIGSEGV signal is generated. To handle this signal, a process must employ an alternate signal stack (sigaltstack(2)).
Since Linux 2.6.23, this limit also determines the amount of space used for the process’s command-line arguments and environment variables; for details, see execve(2).
prlimit()
The Linux-specific prlimit() system call combines and extends the functionality of setrlimit() and getrlimit(). It can be used to both set and get the resource limits of an arbitrary process.
The resource argument has the same meaning as for setrlimit() and getrlimit().
If the new_limit argument is not NULL, then the rlimit structure to which it points is used to set new values for the soft and hard limits for resource. If the old_limit argument is not NULL, then a successful call to prlimit() places the previous soft and hard limits for resource in the rlimit structure pointed to by old_limit.
The pid argument specifies the ID of the process on which the call is to operate. If pid is 0, then the call applies to the calling process. To set or get the resources of a process other than itself, the caller must have the CAP_SYS_RESOURCE capability in the user namespace of the process whose resource limits are being changed, or the real, effective, and saved set user IDs of the target process must match the real user ID of the caller and the real, effective, and saved set group IDs of the target process must match the real group ID of the caller.
RETURN VALUE
On success, these system calls return 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
A pointer argument points to a location outside the accessible address space.
EINVAL
The value specified in resource is not valid; or, for setrlimit() or prlimit(): rlim->rlim_cur was greater than rlim->rlim_max.
EPERM
An unprivileged process tried to raise the hard limit; the CAP_SYS_RESOURCE capability is required to do this.
EPERM
The caller tried to increase the hard RLIMIT_NOFILE limit above the maximum defined by /proc/sys/fs/nr_open (see proc(5))
EPERM
(prlimit()) The calling process did not have permission to set limits for the process specified by pid.
ESRCH
Could not find a process with the ID specified in pid.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
getrlimit(), setrlimit(), prlimit() | Thread safety | MT-Safe |
STANDARDS
getrlimit()
setrlimit()
POSIX.1-2008.
prlimit()
Linux.
RLIMIT_MEMLOCK and RLIMIT_NPROC derive from BSD and are not specified in POSIX.1; they are present on the BSDs and Linux, but on few other implementations. RLIMIT_RSS derives from BSD and is not specified in POSIX.1; it is nevertheless present on most implementations. RLIMIT_MSGQUEUE, RLIMIT_NICE, RLIMIT_RTPRIO, RLIMIT_RTTIME, and RLIMIT_SIGPENDING are Linux-specific.
HISTORY
getrlimit()
setrlimit()
POSIX.1-2001, SVr4, 4.3BSD.
prlimit()
Linux 2.6.36, glibc 2.13.
NOTES
A child process created via fork(2) inherits its parent’s resource limits. Resource limits are preserved across execve(2).
Resource limits are per-process attributes that are shared by all of the threads in a process.
Lowering the soft limit for a resource below the process’s current consumption of that resource will succeed (but will prevent the process from further increasing its consumption of the resource).
One can set the resource limits of the shell using the built-in ulimit command (limit in csh(1)). The shell’s resource limits are inherited by the processes that it creates to execute commands.
Since Linux 2.6.24, the resource limits of any process can be inspected via /proc/pid/limits; see proc(5).
Ancient systems provided a vlimit() function with a similar purpose to setrlimit(). For backward compatibility, glibc also provides vlimit(). All new applications should be written using setrlimit().
C library/kernel ABI differences
Since glibc 2.13, the glibc getrlimit() and setrlimit() wrapper functions no longer invoke the corresponding system calls, but instead employ prlimit(), for the reasons described in BUGS.
The name of the glibc wrapper function is prlimit(); the underlying system call is prlimit64().
BUGS
In older Linux kernels, the SIGXCPU and SIGKILL signals delivered when a process encountered the soft and hard RLIMIT_CPU limits were delivered one (CPU) second later than they should have been. This was fixed in Linux 2.6.8.
In Linux 2.6.x kernels before Linux 2.6.17, a RLIMIT_CPU limit of 0 is wrongly treated as “no limit” (like RLIM_INFINITY). Since Linux 2.6.17, setting a limit of 0 does have an effect, but is actually treated as a limit of 1 second.
A kernel bug means that RLIMIT_RTPRIO does not work in Linux 2.6.12; the problem is fixed in Linux 2.6.13.
In Linux 2.6.12, there was an off-by-one mismatch between the priority ranges returned by getpriority(2) and RLIMIT_NICE. This had the effect that the actual ceiling for the nice value was calculated as 19 - rlim_cur. This was fixed in Linux 2.6.13.
Since Linux 2.6.12, if a process reaches its soft RLIMIT_CPU limit and has a handler installed for SIGXCPU, then, in addition to invoking the signal handler, the kernel increases the soft limit by one second. This behavior repeats if the process continues to consume CPU time, until the hard limit is reached, at which point the process is killed. Other implementations do not change the RLIMIT_CPU soft limit in this manner, and the Linux behavior is probably not standards conformant; portable applications should avoid relying on this Linux-specific behavior. The Linux-specific RLIMIT_RTTIME limit exhibits the same behavior when the soft limit is encountered.
Kernels before Linux 2.4.22 did not diagnose the error EINVAL for setrlimit() when rlim->rlim_cur was greater than rlim->rlim_max.
Linux doesn’t return an error when an attempt to set RLIMIT_CPU has failed, for compatibility reasons.
Representation of “large” resource limit values on 32-bit platforms
The glibc getrlimit() and setrlimit() wrapper functions use a 64-bit rlim_t data type, even on 32-bit platforms. However, the rlim_t data type used in the getrlimit() and setrlimit() system calls is a (32-bit) unsigned long. Furthermore, in Linux, the kernel represents resource limits on 32-bit platforms as unsigned long. However, a 32-bit data type is not wide enough. The most pertinent limit here is RLIMIT_FSIZE, which specifies the maximum size to which a file can grow: to be useful, this limit must be represented using a type that is as wide as the type used to represent file offsetsβthat is, as wide as a 64-bit off_t (assuming a program compiled with _FILE_OFFSET_BITS=64).
To work around this kernel limitation, if a program tried to set a resource limit to a value larger than can be represented in a 32-bit unsigned long, then the glibc setrlimit() wrapper function silently converted the limit value to RLIM_INFINITY. In other words, the requested resource limit setting was silently ignored.
Since glibc 2.13, glibc works around the limitations of the getrlimit() and setrlimit() system calls by implementing setrlimit() and getrlimit() as wrapper functions that call prlimit().
EXAMPLES
The program below demonstrates the use of prlimit().
#define _GNU_SOURCE
#define _FILE_OFFSET_BITS 64
#include <err.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <time.h>
int
main(int argc, char *argv[])
{
pid_t pid;
struct rlimit old, new;
struct rlimit *newp;
if (!(argc == 2 || argc == 4)) {
fprintf(stderr, "Usage: %s <pid> [<new-soft-limit> "
"<new-hard-limit>]
“, argv[0]); exit(EXIT_FAILURE); } pid = atoi(argv[1]); /* PID of target process / newp = NULL; if (argc == 4) { new.rlim_cur = atoi(argv[2]); new.rlim_max = atoi(argv[3]); newp = &new; } / Set CPU time limit of target process; retrieve and display previous limit / if (prlimit(pid, RLIMIT_CPU, newp, &old) == -1) err(EXIT_FAILURE, “prlimit-1”); printf(“Previous limits: soft=%jd; hard=%jd “, (intmax_t) old.rlim_cur, (intmax_t) old.rlim_max); / Retrieve and display new CPU time limit */ if (prlimit(pid, RLIMIT_CPU, NULL, &old) == -1) err(EXIT_FAILURE, “prlimit-2”); printf(“New limits: soft=%jd; hard=%jd “, (intmax_t) old.rlim_cur, (intmax_t) old.rlim_max); exit(EXIT_SUCCESS); }
SEE ALSO
prlimit(1), dup(2), fcntl(2), fork(2), getrusage(2), mlock(2), mmap(2), open(2), quotactl(2), sbrk(2), shmctl(2), malloc(3), sigqueue(3), ulimit(3), core(5), capabilities(7), cgroups(7), credentials(7), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
193 - Linux cli command timer_getoverrun
NAME π₯οΈ timer_getoverrun π₯οΈ
get overrun count for a POSIX per-process timer
LIBRARY
Real-time library (librt, -lrt)
SYNOPSIS
#include <time.h>
int timer_getoverrun(timer_t timerid);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
timer_getoverrun():
_POSIX_C_SOURCE >= 199309L
DESCRIPTION
timer_getoverrun() returns the “overrun count” for the timer referred to by timerid. An application can use the overrun count to accurately calculate the number of timer expirations that would have occurred over a given time interval. Timer overruns can occur both when receiving expiration notifications via signals (SIGEV_SIGNAL), and via threads (SIGEV_THREAD).
When expiration notifications are delivered via a signal, overruns can occur as follows. Regardless of whether or not a real-time signal is used for timer notifications, the system queues at most one signal per timer. (This is the behavior specified by POSIX.1. The alternative, queuing one signal for each timer expiration, could easily result in overflowing the allowed limits for queued signals on the system.) Because of system scheduling delays, or because the signal may be temporarily blocked, there can be a delay between the time when the notification signal is generated and the time when it is delivered (e.g., caught by a signal handler) or accepted (e.g., using sigwaitinfo(2)). In this interval, further timer expirations may occur. The timer overrun count is the number of additional timer expirations that occurred between the time when the signal was generated and when it was delivered or accepted.
Timer overruns can also occur when expiration notifications are delivered via invocation of a thread, since there may be an arbitrary delay between an expiration of the timer and the invocation of the notification thread, and in that delay interval, additional timer expirations may occur.
RETURN VALUE
On success, timer_getoverrun() returns the overrun count of the specified timer; this count may be 0 if no overruns have occurred. On failure, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
timerid is not a valid timer ID.
VERSIONS
When timer notifications are delivered via signals (SIGEV_SIGNAL), on Linux it is also possible to obtain the overrun count via the si_overrun field of the siginfo_t structure (see sigaction(2)). This allows an application to avoid the overhead of making a system call to obtain the overrun count, but is a nonportable extension to POSIX.1.
POSIX.1 discusses timer overruns only in the context of timer notifications using signals.
STANDARDS
POSIX.1-2008.
HISTORY
Linux 2.6. POSIX.1-2001.
BUGS
POSIX.1 specifies that if the timer overrun count is equal to or greater than an implementation-defined maximum, DELAYTIMER_MAX, then timer_getoverrun() should return DELAYTIMER_MAX. However, before Linux 4.19, if the timer overrun value exceeds the maximum representable integer, the counter cycles, starting once more from low values. Since Linux 4.19, timer_getoverrun() returns DELAYTIMER_MAX (defined as INT_MAX in <limits.h>) in this case (and the overrun value is reset to 0).
EXAMPLES
See timer_create(2).
SEE ALSO
clock_gettime(2), sigaction(2), signalfd(2), sigwaitinfo(2), timer_create(2), timer_delete(2), timer_settime(2), signal(7), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
194 - Linux cli command landlock_restrict_self
NAME π₯οΈ landlock_restrict_self π₯οΈ
enforce a Landlock ruleset
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/landlock.h> /* Definition of LANDLOCK_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
int syscall(SYS_landlock_restrict_self, int ruleset_fd,
uint32_t flags);
DESCRIPTION
Once a Landlock ruleset is populated with the desired rules, the landlock_restrict_self() system call enables enforcing this ruleset on the calling thread. See landlock(7) for a global overview.
A thread can be restricted with multiple rulesets that are then composed together to form the thread’s Landlock domain. This can be seen as a stack of rulesets but it is implemented in a more efficient way. A domain can only be updated in such a way that the constraints of each past and future composed rulesets will restrict the thread and its future children for their entire life. It is then possible to gradually enforce tailored access control policies with multiple independent rulesets coming from different sources (e.g., init system configuration, user session policy, built-in application policy). However, most applications should only need one call to landlock_restrict_self() and they should avoid arbitrary numbers of such calls because of the composed rulesets limit. Instead, developers are encouraged to build a tailored ruleset thanks to multiple calls to landlock_add_rule(2).
In order to enforce a ruleset, either the caller must have the CAP_SYS_ADMIN capability in its user namespace, or the thread must already have the no_new_privs bit set. As for seccomp(2), this avoids scenarios where unprivileged processes can affect the behavior of privileged children (e.g., because of set-user-ID binaries). If that bit was not already set by an ancestor of this thread, the thread must make the following call:
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
ruleset_fd is a Landlock ruleset file descriptor obtained with landlock_create_ruleset(2) and fully populated with a set of calls to landlock_add_rule(2).
flags must be 0.
RETURN VALUE
On success, landlock_restrict_self() returns 0.
ERRORS
landlock_restrict_self() can fail for the following reasons:
EOPNOTSUPP
Landlock is supported by the kernel but disabled at boot time.
EINVAL
flags is not 0.
EBADF
ruleset_fd is not a file descriptor for the current thread.
EBADFD
ruleset_fd is not a ruleset file descriptor.
EPERM
ruleset_fd has no read access to the underlying ruleset, or the calling thread is not running with no_new_privs, or it doesn’t have the CAP_SYS_ADMIN in its user namespace.
E2BIG
The maximum number of composed rulesets is reached for the calling thread. This limit is currently 64.
STANDARDS
Linux.
HISTORY
Linux 5.13.
EXAMPLES
See landlock(7).
SEE ALSO
landlock_create_ruleset(2), landlock_add_rule(2), landlock(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
195 - Linux cli command setgroups32
NAME π₯οΈ setgroups32 π₯οΈ
get/set list of supplementary group IDs
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int getgroups(int size, gid_t list[]);
#include <grp.h>
int setgroups(size_t size, const gid_t *_Nullable list);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
setgroups():
Since glibc 2.19:
_DEFAULT_SOURCE
glibc 2.19 and earlier:
_BSD_SOURCE
DESCRIPTION
getgroups() returns the supplementary group IDs of the calling process in list. The argument size should be set to the maximum number of items that can be stored in the buffer pointed to by list. If the calling process is a member of more than size supplementary groups, then an error results.
It is unspecified whether the effective group ID of the calling process is included in the returned list. (Thus, an application should also call getegid(2) and add or remove the resulting value.)
If size is zero, list is not modified, but the total number of supplementary group IDs for the process is returned. This allows the caller to determine the size of a dynamically allocated list to be used in a further call to getgroups().
setgroups() sets the supplementary group IDs for the calling process. Appropriate privileges are required (see the description of the EPERM error, below). The size argument specifies the number of supplementary group IDs in the buffer pointed to by list. A process can drop all of its supplementary groups with the call:
setgroups(0, NULL);
RETURN VALUE
On success, getgroups() returns the number of supplementary group IDs. On error, -1 is returned, and errno is set to indicate the error.
On success, setgroups() returns 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
list has an invalid address.
getgroups() can additionally fail with the following error:
EINVAL
size is less than the number of supplementary group IDs, but is not zero.
setgroups() can additionally fail with the following errors:
EINVAL
size is greater than NGROUPS_MAX (32 before Linux 2.6.4; 65536 since Linux 2.6.4).
ENOMEM
Out of memory.
EPERM
The calling process has insufficient privilege (the caller does not have the CAP_SETGID capability in the user namespace in which it resides).
EPERM (since Linux 3.19)
The use of setgroups() is denied in this user namespace. See the description of /proc/pid/setgroups in user_namespaces(7).
VERSIONS
C library/kernel differences
At the kernel level, user IDs and group IDs are a per-thread attribute. However, POSIX requires that all threads in a process share the same credentials. The NPTL threading implementation handles the POSIX requirements by providing wrapper functions for the various system calls that change process UIDs and GIDs. These wrapper functions (including the one for setgroups()) employ a signal-based technique to ensure that when one thread changes credentials, all of the other threads in the process also change their credentials. For details, see nptl(7).
STANDARDS
getgroups()
POSIX.1-2008.
setgroups()
None.
HISTORY
getgroups()
SVr4, 4.3BSD, POSIX.1-2001.
setgroups()
SVr4, 4.3BSD. Since setgroups() requires privilege, it is not covered by POSIX.1.
The original Linux getgroups() system call supported only 16-bit group IDs. Subsequently, Linux 2.4 added getgroups32(), supporting 32-bit IDs. The glibc getgroups() wrapper function transparently deals with the variation across kernel versions.
NOTES
A process can have up to NGROUPS_MAX supplementary group IDs in addition to the effective group ID. The constant NGROUPS_MAX is defined in <limits.h>. The set of supplementary group IDs is inherited from the parent process, and preserved across an execve(2).
The maximum number of supplementary group IDs can be found at run time using sysconf(3):
long ngroups_max;
ngroups_max = sysconf(_SC_NGROUPS_MAX);
The maximum return value of getgroups() cannot be larger than one more than this value. Since Linux 2.6.4, the maximum number of supplementary group IDs is also exposed via the Linux-specific read-only file, /proc/sys/kernel/ngroups_max.
SEE ALSO
getgid(2), setgid(2), getgrouplist(3), group_member(3), initgroups(3), capabilities(7), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
196 - Linux cli command setitimer
NAME π₯οΈ setitimer π₯οΈ
get or set value of an interval timer
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/time.h>
int getitimer(int which, struct itimerval *curr_value);
int setitimer(int which, const struct itimerval *restrict new_value,
struct itimerval *_Nullable restrict old_value);
DESCRIPTION
These system calls provide access to interval timers, that is, timers that initially expire at some point in the future, and (optionally) at regular intervals after that. When a timer expires, a signal is generated for the calling process, and the timer is reset to the specified interval (if the interval is nonzero).
Three types of timersβspecified via the which argumentβare provided, each of which counts against a different clock and generates a different signal on timer expiration:
ITIMER_REAL
This timer counts down in real (i.e., wall clock) time. At each expiration, a SIGALRM signal is generated.
ITIMER_VIRTUAL
This timer counts down against the user-mode CPU time consumed by the process. (The measurement includes CPU time consumed by all threads in the process.) At each expiration, a SIGVTALRM signal is generated.
ITIMER_PROF
This timer counts down against the total (i.e., both user and system) CPU time consumed by the process. (The measurement includes CPU time consumed by all threads in the process.) At each expiration, a SIGPROF signal is generated.
In conjunction with ITIMER_VIRTUAL, this timer can be used to profile user and system CPU time consumed by the process.
A process has only one of each of the three types of timers.
Timer values are defined by the following structures:
struct itimerval {
struct timeval it_interval; /* Interval for periodic timer */
struct timeval it_value; /* Time until next expiration */
};
struct timeval {
time_t tv_sec; /* seconds */
suseconds_t tv_usec; /* microseconds */
};
getitimer()
The function getitimer() places the current value of the timer specified by which in the buffer pointed to by curr_value.
The it_value substructure is populated with the amount of time remaining until the next expiration of the specified timer. This value changes as the timer counts down, and will be reset to it_interval when the timer expires. If both fields of it_value are zero, then this timer is currently disarmed (inactive).
The it_interval substructure is populated with the timer interval. If both fields of it_interval are zero, then this is a single-shot timer (i.e., it expires just once).
setitimer()
The function setitimer() arms or disarms the timer specified by which, by setting the timer to the value specified by new_value. If old_value is non-NULL, the buffer it points to is used to return the previous value of the timer (i.e., the same information that is returned by getitimer()).
If either field in new_value.it_value is nonzero, then the timer is armed to initially expire at the specified time. If both fields in new_value.it_value are zero, then the timer is disarmed.
The new_value.it_interval field specifies the new interval for the timer; if both of its subfields are zero, the timer is single-shot.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
new_value, old_value, or curr_value is not valid a pointer.
EINVAL
which is not one of ITIMER_REAL, ITIMER_VIRTUAL, or ITIMER_PROF; or (since Linux 2.6.22) one of the tv_usec fields in the structure pointed to by new_value contains a value outside the range [0, 999999].
VERSIONS
The standards are silent on the meaning of the call:
setitimer(which, NULL, &old_value);
Many systems (Solaris, the BSDs, and perhaps others) treat this as equivalent to:
getitimer(which, &old_value);
In Linux, this is treated as being equivalent to a call in which the new_value fields are zero; that is, the timer is disabled. Don’t use this Linux misfeature: it is nonportable and unnecessary.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.4BSD (this call first appeared in 4.2BSD). POSIX.1-2008 marks getitimer() and setitimer() obsolete, recommending the use of the POSIX timers API (timer_gettime(2), timer_settime(2), etc.) instead.
NOTES
Timers will never expire before the requested time, but may expire some (short) time afterward, which depends on the system timer resolution and on the system load; see time(7). (But see BUGS below.) If the timer expires while the process is active (always true for ITIMER_VIRTUAL), the signal will be delivered immediately when generated.
A child created via fork(2) does not inherit its parent’s interval timers. Interval timers are preserved across an execve(2).
POSIX.1 leaves the interaction between setitimer() and the three interfaces alarm(2), sleep(3), and usleep(3) unspecified.
BUGS
The generation and delivery of a signal are distinct, and only one instance of each of the signals listed above may be pending for a process. Under very heavy loading, an ITIMER_REAL timer may expire before the signal from a previous expiration has been delivered. The second signal in such an event will be lost.
Before Linux 2.6.16, timer values are represented in jiffies. If a request is made set a timer with a value whose jiffies representation exceeds MAX_SEC_IN_JIFFIES (defined in include/linux/jiffies.h), then the timer is silently truncated to this ceiling value. On Linux/i386 (where, since Linux 2.6.13, the default jiffy is 0.004 seconds), this means that the ceiling value for a timer is approximately 99.42 days. Since Linux 2.6.16, the kernel uses a different internal representation for times, and this ceiling is removed.
On certain systems (including i386), Linux kernels before Linux 2.6.12 have a bug which will produce premature timer expirations of up to one jiffy under some circumstances. This bug is fixed in Linux 2.6.12.
POSIX.1-2001 says that setitimer() should fail if a tv_usec value is specified that is outside of the range [0, 999999]. However, up to and including Linux 2.6.21, Linux does not give an error, but instead silently adjusts the corresponding seconds value for the timer. From Linux 2.6.22 onward, this nonconformance has been repaired: an improper tv_usec value results in an EINVAL error.
SEE ALSO
gettimeofday(2), sigaction(2), signal(2), timer_create(2), timerfd_create(2), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
197 - Linux cli command sigaltstack
NAME π₯οΈ sigaltstack π₯οΈ
set and/or get signal stack context
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <signal.h>
int sigaltstack(const stack_t *_Nullable restrict ss,
stack_t *_Nullable restrict old_ss);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sigaltstack():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
|| /* glibc <= 2.19: */ _BSD_SOURCE
DESCRIPTION
sigaltstack() allows a thread to define a new alternate signal stack and/or retrieve the state of an existing alternate signal stack. An alternate signal stack is used during the execution of a signal handler if the establishment of that handler (see sigaction(2)) requested it.
The normal sequence of events for using an alternate signal stack is the following:
1.
Allocate an area of memory to be used for the alternate signal stack.
2.
Use sigaltstack() to inform the system of the existence and location of the alternate signal stack.
3.
When establishing a signal handler using sigaction(2), inform the system that the signal handler should be executed on the alternate signal stack by specifying the SA_ONSTACK flag.
The ss argument is used to specify a new alternate signal stack, while the old_ss argument is used to retrieve information about the currently established signal stack. If we are interested in performing just one of these tasks, then the other argument can be specified as NULL.
The stack_t type used to type the arguments of this function is defined as follows:
typedef struct {
void *ss_sp; /* Base address of stack */
int ss_flags; /* Flags */
size_t ss_size; /* Number of bytes in stack */
} stack_t;
To establish a new alternate signal stack, the fields of this structure are set as follows:
ss.ss_flags
This field contains either 0, or the following flag:
SS_AUTODISARM (since Linux 4.7)
Clear the alternate signal stack settings on entry to the signal handler. When the signal handler returns, the previous alternate signal stack settings are restored.
This flag was added in order to make it safe to switch away from the signal handler with swapcontext(3). Without this flag, a subsequently handled signal will corrupt the state of the switched-away signal handler. On kernels where this flag is not supported, sigaltstack() fails with the error EINVAL when this flag is supplied.
ss.ss_sp
This field specifies the starting address of the stack. When a signal handler is invoked on the alternate stack, the kernel automatically aligns the address given in ss.ss_sp to a suitable address boundary for the underlying hardware architecture.
ss.ss_size
This field specifies the size of the stack. The constant SIGSTKSZ is defined to be large enough to cover the usual size requirements for an alternate signal stack, and the constant MINSIGSTKSZ defines the minimum size required to execute a signal handler.
To disable an existing stack, specify ss.ss_flags as SS_DISABLE. In this case, the kernel ignores any other flags in ss.ss_flags and the remaining fields in ss.
If old_ss is not NULL, then it is used to return information about the alternate signal stack which was in effect prior to the call to sigaltstack(). The old_ss.ss_sp and old_ss.ss_size fields return the starting address and size of that stack. The old_ss.ss_flags may return either of the following values:
SS_ONSTACK
The thread is currently executing on the alternate signal stack. (Note that it is not possible to change the alternate signal stack if the thread is currently executing on it.)
SS_DISABLE
The alternate signal stack is currently disabled.
Alternatively, this value is returned if the thread is currently executing on an alternate signal stack that was established using the SS_AUTODISARM flag. In this case, it is safe to switch away from the signal handler with swapcontext(3). It is also possible to set up a different alternative signal stack using a further call to sigaltstack().
SS_AUTODISARM
The alternate signal stack has been marked to be autodisarmed as described above.
By specifying ss as NULL, and old_ss as a non-NULL value, one can obtain the current settings for the alternate signal stack without changing them.
RETURN VALUE
sigaltstack() returns 0 on success, or -1 on failure with errno set to indicate the error.
ERRORS
EFAULT
Either ss or old_ss is not NULL and points to an area outside of the process’s address space.
EINVAL
ss is not NULL and the ss_flags field contains an invalid flag.
ENOMEM
The specified size of the new alternate signal stack ss.ss_size was less than MINSIGSTKSZ.
EPERM
An attempt was made to change the alternate signal stack while it was active (i.e., the thread was already executing on the current alternate signal stack).
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
sigaltstack() | Thread safety | MT-Safe |
STANDARDS
POSIX.1-2008.
SS_AUTODISARM is a Linux extension.
HISTORY
POSIX.1-2001, SUSv2, SVr4.
NOTES
The most common usage of an alternate signal stack is to handle the SIGSEGV signal that is generated if the space available for the standard stack is exhausted: in this case, a signal handler for SIGSEGV cannot be invoked on the standard stack; if we wish to handle it, we must use an alternate signal stack.
Establishing an alternate signal stack is useful if a thread expects that it may exhaust its standard stack. This may occur, for example, because the stack grows so large that it encounters the upwardly growing heap, or it reaches a limit established by a call to setrlimit(RLIMIT_STACK, &rlim). If the standard stack is exhausted, the kernel sends the thread a SIGSEGV signal. In these circumstances the only way to catch this signal is on an alternate signal stack.
On most hardware architectures supported by Linux, stacks grow downward. sigaltstack() automatically takes account of the direction of stack growth.
Functions called from a signal handler executing on an alternate signal stack will also use the alternate signal stack. (This also applies to any handlers invoked for other signals while the thread is executing on the alternate signal stack.) Unlike the standard stack, the system does not automatically extend the alternate signal stack. Exceeding the allocated size of the alternate signal stack will lead to unpredictable results.
A successful call to execve(2) removes any existing alternate signal stack. A child process created via fork(2) inherits a copy of its parent’s alternate signal stack settings. The same is also true for a child process created using clone(2), unless the clone flags include CLONE_VM and do not include CLONE_VFORK, in which case any alternate signal stack that was established in the parent is disabled in the child process.
sigaltstack() supersedes the older sigstack() call. For backward compatibility, glibc also provides sigstack(). All new applications should be written using sigaltstack().
History
4.2BSD had a sigstack() system call. It used a slightly different struct, and had the major disadvantage that the caller had to know the direction of stack growth.
BUGS
In Linux 2.2 and earlier, the only flag that could be specified in ss.sa_flags was SS_DISABLE. In the lead up to the release of the Linux 2.4 kernel, a change was made to allow sigaltstack() to allow ss.ss_flags==SS_ONSTACK with the same meaning as ss.ss_flags==0 (i.e., the inclusion of SS_ONSTACK in ss.ss_flags is a no-op). On other implementations, and according to POSIX.1, SS_ONSTACK appears only as a reported flag in old_ss.ss_flags. On Linux, there is no need ever to specify SS_ONSTACK in ss.ss_flags, and indeed doing so should be avoided on portability grounds: various other systems give an error if SS_ONSTACK is specified in ss.ss_flags.
EXAMPLES
The following code segment demonstrates the use of sigaltstack() (and sigaction(2)) to install an alternate signal stack that is employed by a handler for the SIGSEGV signal:
stack_t ss;
ss.ss_sp = malloc(SIGSTKSZ);
if (ss.ss_sp == NULL) {
perror("malloc");
exit(EXIT_FAILURE);
}
ss.ss_size = SIGSTKSZ;
ss.ss_flags = 0;
if (sigaltstack(&ss, NULL) == -1) {
perror("sigaltstack");
exit(EXIT_FAILURE);
}
sa.sa_flags = SA_ONSTACK;
sa.sa_handler = handler(); /* Address of a signal handler */
sigemptyset(&sa.sa_mask);
if (sigaction(SIGSEGV, &sa, NULL) == -1) {
perror("sigaction");
exit(EXIT_FAILURE);
}
SEE ALSO
execve(2), setrlimit(2), sigaction(2), siglongjmp(3), sigsetjmp(3), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
198 - Linux cli command fremovexattr
NAME π₯οΈ fremovexattr π₯οΈ
remove an extended attribute
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/xattr.h>
int removexattr(const char *path, const char *name);
int lremovexattr(const char *path, const char *name);
int fremovexattr(int fd, const char *name);
DESCRIPTION
Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7).
removexattr() removes the extended attribute identified by name and associated with the given path in the filesystem.
lremovexattr() is identical to removexattr(), except in the case of a symbolic link, where the extended attribute is removed from the link itself, not the file that it refers to.
fremovexattr() is identical to removexattr(), only the extended attribute is removed from the open file referred to by fd (as returned by open(2)) in place of path.
An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode.
RETURN VALUE
On success, zero is returned. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
ENODATA
The named attribute does not exist.
ENOTSUP
Extended attributes are not supported by the filesystem, or are disabled.
In addition, the errors documented in stat(2) can also occur.
STANDARDS
Linux.
HISTORY
Linux 2.4, glibc 2.3.
SEE ALSO
getfattr(1), setfattr(1), getxattr(2), listxattr(2), open(2), setxattr(2), stat(2), symlink(7), xattr(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
199 - Linux cli command pselect6
NAME π₯οΈ pselect6 π₯οΈ
synchronous I/O multiplexing
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/select.h>
typedef /* ... */ fd_set;
int select(int nfds, fd_set *_Nullable restrict readfds,
fd_set *_Nullable restrict writefds,
fd_set *_Nullable restrict exceptfds,
struct timeval *_Nullable restrict timeout);
void FD_CLR(int fd, fd_set *set);
int FD_ISSET(int fd, fd_set *set);
void FD_SET(int fd, fd_set *set);
void FD_ZERO(fd_set *set);
int pselect(int nfds, fd_set *_Nullable restrict readfds,
fd_set *_Nullable restrict writefds,
fd_set *_Nullable restrict exceptfds,
const struct timespec *_Nullable restrict timeout,
const sigset_t *_Nullable restrict sigmask);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
pselect():
_POSIX_C_SOURCE >= 200112L
DESCRIPTION
WARNING: select() can monitor only file descriptors numbers that are less than FD_SETSIZE (1024)βan unreasonably low limit for many modern applicationsβand this limitation will not change. All modern applications should instead use poll(2) or epoll(7), which do not suffer this limitation.
select() allows a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become “ready” for some class of I/O operation (e.g., input possible). A file descriptor is considered ready if it is possible to perform a corresponding I/O operation (e.g., read(2), or a sufficiently small write(2)) without blocking.
fd_set
A structure type that can represent a set of file descriptors. According to POSIX, the maximum number of file descriptors in an fd_set structure is the value of the macro FD_SETSIZE.
File descriptor sets
The principal arguments of select() are three “sets” of file descriptors (declared with the type fd_set), which allow the caller to wait for three classes of events on the specified set of file descriptors. Each of the fd_set arguments may be specified as NULL if no file descriptors are to be watched for the corresponding class of events.
Note well: Upon return, each of the file descriptor sets is modified in place to indicate which file descriptors are currently “ready”. Thus, if using select() within a loop, the sets must be reinitialized before each call.
The contents of a file descriptor set can be manipulated using the following macros:
FD_ZERO()
This macro clears (removes all file descriptors from) set. It should be employed as the first step in initializing a file descriptor set.
FD_SET()
This macro adds the file descriptor fd to set. Adding a file descriptor that is already present in the set is a no-op, and does not produce an error.
FD_CLR()
This macro removes the file descriptor fd from set. Removing a file descriptor that is not present in the set is a no-op, and does not produce an error.
FD_ISSET()
select() modifies the contents of the sets according to the rules described below. After calling select(), the FD_ISSET() macro can be used to test if a file descriptor is still present in a set. FD_ISSET() returns nonzero if the file descriptor fd is present in set, and zero if it is not.
Arguments
The arguments of select() are as follows:
readfds
The file descriptors in this set are watched to see if they are ready for reading. A file descriptor is ready for reading if a read operation will not block; in particular, a file descriptor is also ready on end-of-file.
After select() has returned, readfds will be cleared of all file descriptors except for those that are ready for reading.
writefds
The file descriptors in this set are watched to see if they are ready for writing. A file descriptor is ready for writing if a write operation will not block. However, even if a file descriptor indicates as writable, a large write may still block.
After select() has returned, writefds will be cleared of all file descriptors except for those that are ready for writing.
exceptfds
The file descriptors in this set are watched for “exceptional conditions”. For examples of some exceptional conditions, see the discussion of POLLPRI in poll(2).
After select() has returned, exceptfds will be cleared of all file descriptors except for those for which an exceptional condition has occurred.
nfds
This argument should be set to the highest-numbered file descriptor in any of the three sets, plus 1. The indicated file descriptors in each set are checked, up to this limit (but see BUGS).
timeout
The timeout argument is a timeval structure (shown below) that specifies the interval that select() should block waiting for a file descriptor to become ready. The call will block until either:
a file descriptor becomes ready;
the call is interrupted by a signal handler; or
the timeout expires.
Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount.
If both fields of the timeval structure are zero, then select() returns immediately. (This is useful for polling.)
If timeout is specified as NULL, select() blocks indefinitely waiting for a file descriptor to become ready.
pselect()
The pselect() system call allows an application to safely wait until either a file descriptor becomes ready or until a signal is caught.
The operation of select() and pselect() is identical, other than these three differences:
select() uses a timeout that is a struct timeval (with seconds and microseconds), while pselect() uses a struct timespec (with seconds and nanoseconds).
select() may update the timeout argument to indicate how much time was left. pselect() does not change this argument.
select() has no sigmask argument, and behaves as pselect() called with NULL sigmask.
sigmask is a pointer to a signal mask (see sigprocmask(2)); if it is not NULL, then pselect() first replaces the current signal mask by the one pointed to by sigmask, then does the “select” function, and then restores the original signal mask. (If sigmask is NULL, the signal mask is not modified during the pselect() call.)
Other than the difference in the precision of the timeout argument, the following pselect() call:
ready = pselect(nfds, &readfds, &writefds, &exceptfds,
timeout, &sigmask);
is equivalent to atomically executing the following calls:
sigset_t origmask;
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ready = select(nfds, &readfds, &writefds, &exceptfds, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
The reason that pselect() is needed is that if one wants to wait for either a signal or for a file descriptor to become ready, then an atomic test is needed to prevent race conditions. (Suppose the signal handler sets a global flag and returns. Then a test of this global flag followed by a call of select() could hang indefinitely if the signal arrived just after the test but just before the call. By contrast, pselect() allows one to first block signals, handle the signals that have come in, then call pselect() with the desired sigmask, avoiding the race.)
The timeout
The timeout argument for select() is a structure of the following type:
struct timeval {
time_t tv_sec; /* seconds */
suseconds_t tv_usec; /* microseconds */
};
The corresponding argument for pselect() is a timespec(3) structure.
On Linux, select() modifies timeout to reflect the amount of time not slept; most other implementations do not do this. (POSIX.1 permits either behavior.) This causes problems both when Linux code which reads timeout is ported to other operating systems, and when code is ported to Linux that reuses a struct timeval for multiple select()s in a loop without reinitializing it. Consider timeout to be undefined after select() returns.
RETURN VALUE
On success, select() and pselect() return the number of file descriptors contained in the three returned descriptor sets (that is, the total number of bits that are set in readfds, writefds, exceptfds). The return value may be zero if the timeout expired before any file descriptors became ready.
On error, -1 is returned, and errno is set to indicate the error; the file descriptor sets are unmodified, and timeout becomes undefined.
ERRORS
EBADF
An invalid file descriptor was given in one of the sets. (Perhaps a file descriptor that was already closed, or one on which an error has occurred.) However, see BUGS.
EINTR
A signal was caught; see signal(7).
EINVAL
nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see getrlimit(2)).
EINVAL
The value contained within timeout is invalid.
ENOMEM
Unable to allocate memory for internal tables.
VERSIONS
On some other UNIX systems, select() can fail with the error EAGAIN if the system fails to allocate kernel-internal resources, rather than ENOMEM as Linux does. POSIX specifies this error for poll(2), but not for select(). Portable programs may wish to check for EAGAIN and loop, just as with EINTR.
STANDARDS
POSIX.1-2008.
HISTORY
select()
POSIX.1-2001, 4.4BSD (first appeared in 4.2BSD).
Generally portable to/from non-BSD systems supporting clones of the BSD socket layer (including System V variants). However, note that the System V variant typically sets the timeout variable before returning, but the BSD variant does not.
pselect()
Linux 2.6.16. POSIX.1g, POSIX.1-2001.
Prior to this, it was emulated in glibc (but see BUGS).
fd_set
POSIX.1-2001.
NOTES
The following header also provides the fd_set type: <sys/time.h>.
An fd_set is a fixed size buffer. Executing FD_CLR() or FD_SET() with a value of fd that is negative or is equal to or larger than FD_SETSIZE will result in undefined behavior. Moreover, POSIX requires fd to be a valid file descriptor.
The operation of select() and pselect() is not affected by the O_NONBLOCK flag.
The self-pipe trick
On systems that lack pselect(), reliable (and more portable) signal trapping can be achieved using the self-pipe trick. In this technique, a signal handler writes a byte to a pipe whose other end is monitored by select() in the main program. (To avoid possibly blocking when writing to a pipe that may be full or reading from a pipe that may be empty, nonblocking I/O is used when reading from and writing to the pipe.)
Emulating usleep(3)
Before the advent of usleep(3), some code employed a call to select() with all three sets empty, nfds zero, and a non-NULL timeout as a fairly portable way to sleep with subsecond precision.
Correspondence between select() and poll() notifications
Within the Linux kernel source, we find the following definitions which show the correspondence between the readable, writable, and exceptional condition notifications of select() and the event notifications provided by poll(2) and epoll(7):
#define POLLIN_SET (EPOLLRDNORM | EPOLLRDBAND | EPOLLIN |
EPOLLHUP | EPOLLERR)
/* Ready for reading */
#define POLLOUT_SET (EPOLLWRBAND | EPOLLWRNORM | EPOLLOUT |
EPOLLERR)
/* Ready for writing */
#define POLLEX_SET (EPOLLPRI)
/* Exceptional condition */
Multithreaded applications
If a file descriptor being monitored by select() is closed in another thread, the result is unspecified. On some UNIX systems, select() unblocks and returns, with an indication that the file descriptor is ready (a subsequent I/O operation will likely fail with an error, unless another process reopens the file descriptor between the time select() returned and the I/O operation is performed). On Linux (and some other systems), closing the file descriptor in another thread has no effect on select(). In summary, any application that relies on a particular behavior in this scenario must be considered buggy.
C library/kernel differences
The Linux kernel allows file descriptor sets of arbitrary size, determining the length of the sets to be checked from the value of nfds. However, in the glibc implementation, the fd_set type is fixed in size. See also BUGS.
The pselect() interface described in this page is implemented by glibc. The underlying Linux system call is named pselect6(). This system call has somewhat different behavior from the glibc wrapper function.
The Linux pselect6() system call modifies its timeout argument. However, the glibc wrapper function hides this behavior by using a local variable for the timeout argument that is passed to the system call. Thus, the glibc pselect() function does not modify its timeout argument; this is the behavior required by POSIX.1-2001.
The final argument of the pselect6() system call is not a sigset_t * pointer, but is instead a structure of the form:
struct {
const kernel_sigset_t *ss; /* Pointer to signal set */
size_t ss_len; /* Size (in bytes) of object
pointed to by 'ss' */
};
This allows the system call to obtain both a pointer to the signal set and its size, while allowing for the fact that most architectures support a maximum of 6 arguments to a system call. See sigprocmask(2) for a discussion of the difference between the kernel and libc notion of the signal set.
Historical glibc details
glibc 2.0 provided an incorrect version of pselect() that did not take a sigmask argument.
From glibc 2.1 to glibc 2.2.1, one must define _GNU_SOURCE in order to obtain the declaration of pselect() from <sys/select.h>.
BUGS
POSIX allows an implementation to define an upper limit, advertised via the constant FD_SETSIZE, on the range of file descriptors that can be specified in a file descriptor set. The Linux kernel imposes no fixed limit, but the glibc implementation makes fd_set a fixed-size type, with FD_SETSIZE defined as 1024, and the FD_*() macros operating according to that limit. To monitor file descriptors greater than 1023, use poll(2) or epoll(7) instead.
The implementation of the fd_set arguments as value-result arguments is a design error that is avoided in poll(2) and epoll(7).
According to POSIX, select() should check all specified file descriptors in the three file descriptor sets, up to the limit nfds-1. However, the current implementation ignores any file descriptor in these sets that is greater than the maximum file descriptor number that the process currently has open. According to POSIX, any such file descriptor that is specified in one of the sets should result in the error EBADF.
Starting with glibc 2.1, glibc provided an emulation of pselect() that was implemented using sigprocmask(2) and select(). This implementation remained vulnerable to the very race condition that pselect() was designed to prevent. Modern versions of glibc use the (race-free) pselect() system call on kernels where it is provided.
On Linux, select() may report a socket file descriptor as “ready for reading”, while nevertheless a subsequent read blocks. This could for example happen when data has arrived but upon examination has the wrong checksum and is discarded. There may be other circumstances in which a file descriptor is spuriously reported as ready. Thus it may be safer to use O_NONBLOCK on sockets that should not block.
On Linux, select() also modifies timeout if the call is interrupted by a signal handler (i.e., the EINTR error return). This is not permitted by POSIX.1. The Linux pselect() system call has the same behavior, but the glibc wrapper hides this behavior by internally copying the timeout to a local variable and passing that variable to the system call.
EXAMPLES
#include <stdio.h>
#include <stdlib.h>
#include <sys/select.h>
int
main(void)
{
int retval;
fd_set rfds;
struct timeval tv;
/* Watch stdin (fd 0) to see when it has input. */
FD_ZERO(&rfds);
FD_SET(0, &rfds);
/* Wait up to five seconds. */
tv.tv_sec = 5;
tv.tv_usec = 0;
retval = select(1, &rfds, NULL, NULL, &tv);
/* Don't rely on the value of tv now! */
if (retval == -1)
perror("select()");
else if (retval)
printf("Data is available now.
“); /* FD_ISSET(0, &rfds) will be true. */ else printf(“No data within five seconds. “); exit(EXIT_SUCCESS); }
SEE ALSO
accept(2), connect(2), poll(2), read(2), recv(2), restart_syscall(2), send(2), sigprocmask(2), write(2), timespec(3), epoll(7), time(7)
For a tutorial with discussion and examples, see select_tut(2).
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
200 - Linux cli command getpgid
NAME π₯οΈ getpgid π₯οΈ
set/get process group
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int setpgid(pid_t pid, pid_t pgid);
pid_t getpgid(pid_t pid);
pid_t getpgrp(void); /* POSIX.1 version */
[[deprecated]] pid_t getpgrp(pid_t pid); /* BSD version */
int setpgrp(void); /* System V version */
[[deprecated]] int setpgrp(pid_t pid, pid_t pgid); /* BSD version */
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
getpgid():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
setpgrp() (POSIX.1):
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.19: */ _DEFAULT_SOURCE
|| /* glibc <= 2.19: */ _SVID_SOURCE
setpgrp() (BSD), getpgrp() (BSD):
[These are available only before glibc 2.19]
_BSD_SOURCE &&
! (_POSIX_SOURCE || _POSIX_C_SOURCE || _XOPEN_SOURCE
|| _GNU_SOURCE || _SVID_SOURCE)
DESCRIPTION
All of these interfaces are available on Linux, and are used for getting and setting the process group ID (PGID) of a process. The preferred, POSIX.1-specified ways of doing this are: getpgrp(void), for retrieving the calling process’s PGID; and setpgid(), for setting a process’s PGID.
setpgid() sets the PGID of the process specified by pid to pgid. If pid is zero, then the process ID of the calling process is used. If pgid is zero, then the PGID of the process specified by pid is made the same as its process ID. If setpgid() is used to move a process from one process group to another (as is done by some shells when creating pipelines), both process groups must be part of the same session (see setsid(2) and credentials(7)). In this case, the pgid specifies an existing process group to be joined and the session ID of that group must match the session ID of the joining process.
The POSIX.1 version of getpgrp(), which takes no arguments, returns the PGID of the calling process.
getpgid() returns the PGID of the process specified by pid. If pid is zero, the process ID of the calling process is used. (Retrieving the PGID of a process other than the caller is rarely necessary, and the POSIX.1 getpgrp() is preferred for that task.)
The System V-style setpgrp(), which takes no arguments, is equivalent to setpgid(0, 0).
The BSD-specific setpgrp() call, which takes arguments pid and pgid, is a wrapper function that calls
setpgid(pid, pgid)
Since glibc 2.19, the BSD-specific setpgrp() function is no longer exposed by <unistd.h>; calls should be replaced with the setpgid() call shown above.
The BSD-specific getpgrp() call, which takes a single pid argument, is a wrapper function that calls
getpgid(pid)
Since glibc 2.19, the BSD-specific getpgrp() function is no longer exposed by <unistd.h>; calls should be replaced with calls to the POSIX.1 getpgrp() which takes no arguments (if the intent is to obtain the caller’s PGID), or with the getpgid() call shown above.
RETURN VALUE
On success, setpgid() and setpgrp() return zero. On error, -1 is returned, and errno is set to indicate the error.
The POSIX.1 getpgrp() always returns the PGID of the caller.
getpgid(), and the BSD-specific getpgrp() return a process group on success. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
An attempt was made to change the process group ID of one of the children of the calling process and the child had already performed an execve(2) (setpgid(), setpgrp()).
EINVAL
pgid is less than 0 (setpgid(), setpgrp()).
EPERM
An attempt was made to move a process into a process group in a different session, or to change the process group ID of one of the children of the calling process and the child was in a different session, or to change the process group ID of a session leader (setpgid(), setpgrp()).
EPERM
The target process group does not exist. (setpgid(), setpgrp()).
ESRCH
For getpgid(): pid does not match any process. For setpgid(): pid is not the calling process and not a child of the calling process.
STANDARDS
getpgid()
setpgid()
getpgrp() (no args)
setpgrp() (no args)
POSIX.1-2008 (but see HISTORY).
setpgrp() (2 args)
getpgrp() (1 arg)
None.
HISTORY
getpgid()
setpgid()
getpgrp() (no args)
POSIX.1-2001.
setpgrp() (no args)
POSIX.1-2001. POSIX.1-2008 marks it as obsolete.
setpgrp() (2 args)
getpgrp() (1 arg)
4.2BSD.
NOTES
A child created via fork(2) inherits its parent’s process group ID. The PGID is preserved across an execve(2).
Each process group is a member of a session and each process is a member of the session of which its process group is a member. (See credentials(7).)
A session can have a controlling terminal. At any time, one (and only one) of the process groups in the session can be the foreground process group for the terminal; the remaining process groups are in the background. If a signal is generated from the terminal (e.g., typing the interrupt key to generate SIGINT), that signal is sent to the foreground process group. (See termios(3) for a description of the characters that generate signals.) Only the foreground process group may read(2) from the terminal; if a background process group tries to read(2) from the terminal, then the group is sent a SIGTTIN signal, which suspends it. The tcgetpgrp(3) and tcsetpgrp(3) functions are used to get/set the foreground process group of the controlling terminal.
The setpgid() and getpgrp() calls are used by programs such as bash(1) to create process groups in order to implement shell job control.
If the termination of a process causes a process group to become orphaned, and if any member of the newly orphaned process group is stopped, then a SIGHUP signal followed by a SIGCONT signal will be sent to each process in the newly orphaned process group. An orphaned process group is one in which the parent of every member of process group is either itself also a member of the process group or is a member of a process group in a different session (see also credentials(7)).
SEE ALSO
getuid(2), setsid(2), tcgetpgrp(3), tcsetpgrp(3), termios(3), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
201 - Linux cli command ioctl_fslabel
NAME π₯οΈ ioctl_fslabel π₯οΈ
get or set a filesystem label
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/fs.h> /* Definition of *FSLABEL* constants */
#include <sys/ioctl.h>
int ioctl(int fd, FS_IOC_GETFSLABEL, char label[FSLABEL_MAX]);
int ioctl(int fd, FS_IOC_SETFSLABEL, char label[FSLABEL_MAX]);
DESCRIPTION
If a filesystem supports online label manipulation, these ioctl(2) operations can be used to get or set the filesystem label for the filesystem on which fd resides. The FS_IOC_SETFSLABEL operation requires privilege (CAP_SYS_ADMIN).
RETURN VALUE
On success zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Possible errors include (but are not limited to) the following:
EFAULT
label references an inaccessible memory area.
EINVAL
The specified label exceeds the maximum label length for the filesystem.
ENOTTY
This can appear if the filesystem does not support online label manipulation.
EPERM
The calling process does not have sufficient permissions to set the label.
STANDARDS
Linux.
HISTORY
Linux 4.18.
They were previously known as BTRFS_IOC_GET_FSLABEL and BTRFS_IOC_SET_FSLABEL and were private to Btrfs.
NOTES
The maximum string length for this interface is FSLABEL_MAX, including the terminating null byte (‘οΏ½’). Filesystems have differing maximum label lengths, which may or may not include the terminating null. The string provided to FS_IOC_SETFSLABEL must always be null-terminated, and the string returned by FS_IOC_GETFSLABEL will always be null-terminated.
SEE ALSO
ioctl(2), blkid(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
202 - Linux cli command swapoff
NAME π₯οΈ swapoff π₯οΈ
start/stop swapping to file/device
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/swap.h>
int swapon(const char *path, int swapflags);
int swapoff(const char *path);
DESCRIPTION
swapon() sets the swap area to the file or block device specified by path. swapoff() stops swapping to the file or block device specified by path.
If the SWAP_FLAG_PREFER flag is specified in the swapon() swapflags argument, the new swap area will have a higher priority than default. The priority is encoded within swapflags as:
(prio << SWAP_FLAG_PRIO_SHIFT) & SWAP_FLAG_PRIO_MASK
If the SWAP_FLAG_DISCARD flag is specified in the swapon() swapflags argument, freed swap pages will be discarded before they are reused, if the swap device supports the discard or trim operation. (This may improve performance on some Solid State Devices, but often it does not.) See also NOTES.
These functions may be used only by a privileged process (one having the CAP_SYS_ADMIN capability).
Priority
Each swap area has a priority, either high or low. The default priority is low. Within the low-priority areas, newer areas are even lower priority than older areas.
All priorities set with swapflags are high-priority, higher than default. They may have any nonnegative value chosen by the caller. Higher numbers mean higher priority.
Swap pages are allocated from areas in priority order, highest priority first. For areas with different priorities, a higher-priority area is exhausted before using a lower-priority area. If two or more areas have the same priority, and it is the highest priority available, pages are allocated on a round-robin basis between them.
As of Linux 1.3.6, the kernel usually follows these rules, but there are exceptions.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBUSY
(for swapon()) The specified path is already being used as a swap area.
EINVAL
The file path exists, but refers neither to a regular file nor to a block device;
EINVAL
(swapon()) The indicated path does not contain a valid swap signature or resides on an in-memory filesystem such as tmpfs(5).
EINVAL (since Linux 3.4)
(swapon()) An invalid flag value was specified in swapflags.
EINVAL
(swapoff()) path is not currently a swap area.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENOENT
The file path does not exist.
ENOMEM
The system has insufficient memory to start swapping.
EPERM
The caller does not have the CAP_SYS_ADMIN capability. Alternatively, the maximum number of swap files are already in use; see NOTES below.
STANDARDS
Linux.
HISTORY
The swapflags argument was introduced in Linux 1.3.2.
NOTES
The partition or path must be prepared with mkswap(8).
There is an upper limit on the number of swap files that may be used, defined by the kernel constant MAX_SWAPFILES. Before Linux 2.4.10, MAX_SWAPFILES has the value 8; since Linux 2.4.10, it has the value 32. Since Linux 2.6.18, the limit is decreased by 2 (thus 30), since Linux 5.19, the limit is decreased by 3 (thus: 29) if the kernel is built with the CONFIG_MIGRATION option (which reserves two swap table entries for the page migration features of mbind(2) and migrate_pages(2)). Since Linux 2.6.32, the limit is further decreased by 1 if the kernel is built with the CONFIG_MEMORY_FAILURE option. Since Linux 5.14, the limit is further decreased by 4 if the kernel is built with the CONFIG_DEVICE_PRIVATE option. Since Linux 5.19, the limit is further decreased by 1 if the kernel is built with the CONFIG_PTE_MARKER option.
Discard of swap pages was introduced in Linux 2.6.29, then made conditional on the SWAP_FLAG_DISCARD flag in Linux 2.6.36, which still discards the entire swap area when swapon() is called, even if that flag bit is not set.
SEE ALSO
mkswap(8), swapoff(8), swapon(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
203 - Linux cli command writev
NAME π₯οΈ writev π₯οΈ
read or write data into multiple buffers
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/uio.h>
ssize_t readv(int fd, const struct iovec *iov, int iovcnt);
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);
ssize_t preadv(int fd, const struct iovec *iov, int iovcnt,
off_t offset);
ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt,
off_t offset);
ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt,
off_t offset, int flags);
ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt,
off_t offset, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
preadv(), pwritev():
Since glibc 2.19:
_DEFAULT_SOURCE
glibc 2.19 and earlier:
_BSD_SOURCE
DESCRIPTION
The readv() system call reads iovcnt buffers from the file associated with the file descriptor fd into the buffers described by iov (“scatter input”).
The writev() system call writes iovcnt buffers of data described by iov to the file associated with the file descriptor fd (“gather output”).
The pointer iov points to an array of iovec structures, described in iovec(3type).
The readv() system call works just like read(2) except that multiple buffers are filled.
The writev() system call works just like write(2) except that multiple buffers are written out.
Buffers are processed in array order. This means that readv() completely fills iov[0] before proceeding to iov[1], and so on. (If there is insufficient data, then not all buffers pointed to by iov may be filled.) Similarly, writev() writes out the entire contents of iov[0] before proceeding to iov[1], and so on.
The data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not intermingled with output from writes in other processes; analogously, readv() is guaranteed to read a contiguous block of data from the file, regardless of read operations performed in other threads or processes that have file descriptors referring to the same open file description (see open(2)).
preadv() and pwritev()
The preadv() system call combines the functionality of readv() and pread(2). It performs the same task as readv(), but adds a fourth argument, offset, which specifies the file offset at which the input operation is to be performed.
The pwritev() system call combines the functionality of writev() and pwrite(2). It performs the same task as writev(), but adds a fourth argument, offset, which specifies the file offset at which the output operation is to be performed.
The file offset is not changed by these system calls. The file referred to by fd must be capable of seeking.
preadv2() and pwritev2()
These system calls are similar to preadv() and pwritev() calls, but add a fifth argument, flags, which modifies the behavior on a per-call basis.
Unlike preadv() and pwritev(), if the offset argument is -1, then the current file offset is used and updated.
The flags argument contains a bitwise OR of zero or more of the following flags:
RWF_DSYNC (since Linux 4.7)
Provide a per-write equivalent of the O_DSYNC open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call.
RWF_HIPRI (since Linux 4.6)
High priority read/write. Allows block-based filesystems to use polling of the device, which provides lower latency, but may use additional resources. (Currently, this feature is usable only on a file descriptor opened using the O_DIRECT flag.)
RWF_SYNC (since Linux 4.7)
Provide a per-write equivalent of the O_SYNC open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call.
RWF_NOWAIT (since Linux 4.14)
Do not wait for data which is not immediately available. If this flag is specified, the preadv2() system call will return instantly if it would have to read data from the backing storage or wait for a lock. If some data was successfully read, it will return the number of bytes read. If no bytes were read, it will return -1 and set errno to EAGAIN (but see BUGS). Currently, this flag is meaningful only for preadv2().
RWF_APPEND (since Linux 4.16)
Provide a per-write equivalent of the O_APPEND open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call. The offset argument does not affect the write operation; the data is always appended to the end of the file. However, if the offset argument is -1, the current file offset is updated.
RETURN VALUE
On success, readv(), preadv(), and preadv2() return the number of bytes read; writev(), pwritev(), and pwritev2() return the number of bytes written.
Note that it is not an error for a successful call to transfer fewer bytes than requested (see read(2) and write(2)).
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
The errors are as given for read(2) and write(2). Furthermore, preadv(), preadv2(), pwritev(), and pwritev2() can also fail for the same reasons as lseek(2). Additionally, the following errors are defined:
EINVAL
The sum of the iov_len values overflows an ssize_t value.
EINVAL
The vector count, iovcnt, is less than zero or greater than the permitted maximum.
EOPNOTSUPP
An unknown flag is specified in flags.
VERSIONS
C library/kernel differences
The raw preadv() and pwritev() system calls have call signatures that differ slightly from that of the corresponding GNU C library wrapper functions shown in the SYNOPSIS. The final argument, offset, is unpacked by the wrapper functions into two arguments in the system calls:
** unsigned long pos_l, unsigned long **pos
These arguments contain, respectively, the low order and high order 32 bits of offset.
STANDARDS
readv()
writev()
POSIX.1-2008.
preadv()
pwritev()
BSD.
preadv2()
pwritev2()
Linux.
HISTORY
readv()
writev()
POSIX.1-2001, 4.4BSD (first appeared in 4.2BSD).
preadv(), pwritev(): Linux 2.6.30, glibc 2.10.
preadv2(), pwritev2(): Linux 4.6, glibc 2.26.
Historical C library/kernel differences
To deal with the fact that IOV_MAX was so low on early versions of Linux, the glibc wrapper functions for readv() and writev() did some extra work if they detected that the underlying kernel system call failed because this limit was exceeded. In the case of readv(), the wrapper function allocated a temporary buffer large enough for all of the items specified by iov, passed that buffer in a call to read(2), copied data from the buffer to the locations specified by the iov_base fields of the elements of iov, and then freed the buffer. The wrapper function for writev() performed the analogous task using a temporary buffer and a call to write(2).
The need for this extra effort in the glibc wrapper functions went away with Linux 2.2 and later. However, glibc continued to provide this behavior until glibc 2.10. Starting with glibc 2.9, the wrapper functions provide this behavior only if the library detects that the system is running a Linux kernel older than Linux 2.6.18 (an arbitrarily selected kernel version). And since glibc 2.20 (which requires a minimum of Linux 2.6.32), the glibc wrapper functions always just directly invoke the system calls.
NOTES
POSIX.1 allows an implementation to place a limit on the number of items that can be passed in iov. An implementation can advertise its limit by defining IOV_MAX in <limits.h> or at run time via the return value from sysconf(_SC_IOV_MAX). On modern Linux systems, the limit is 1024. Back in Linux 2.0 days, this limit was 16.
BUGS
Linux 5.9 and Linux 5.10 have a bug where preadv2() with the RWF_NOWAIT flag may return 0 even when not at end of file.
EXAMPLES
The following code sample demonstrates the use of writev():
char *str0 = "hello ";
char *str1 = "world
“; ssize_t nwritten; struct iovec iov[2]; iov[0].iov_base = str0; iov[0].iov_len = strlen(str0); iov[1].iov_base = str1; iov[1].iov_len = strlen(str1); nwritten = writev(STDOUT_FILENO, iov, 2);
SEE ALSO
pread(2), read(2), write(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
204 - Linux cli command gethostname
NAME π₯οΈ gethostname π₯οΈ
get/set hostname
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int gethostname(char *name, size_t len);
int sethostname(const char *name, size_t len);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
gethostname():
_XOPEN_SOURCE >= 500 || _POSIX_C_SOURCE >= 200112L
|| /* glibc 2.19 and earlier */ _BSD_SOURCE
sethostname():
Since glibc 2.21:
_DEFAULT_SOURCE
In glibc 2.19 and 2.20:
_DEFAULT_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
Up to and including glibc 2.19:
_BSD_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
DESCRIPTION
These system calls are used to access or to change the system hostname. More precisely, they operate on the hostname associated with the calling process’s UTS namespace.
sethostname() sets the hostname to the value given in the character array name. The len argument specifies the number of bytes in name. (Thus, name does not require a terminating null byte.)
gethostname() returns the null-terminated hostname in the character array name, which has a length of len bytes. If the null-terminated hostname is too large to fit, then the name is truncated, and no error is returned (but see NOTES below). POSIX.1 says that if such truncation occurs, then it is unspecified whether the returned buffer includes a terminating null byte.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
name is an invalid address.
EINVAL
len is negative or, for sethostname(), len is larger than the maximum allowed size.
ENAMETOOLONG
(glibc gethostname()) len is smaller than the actual size. (Before glibc 2.1, glibc uses EINVAL for this case.)
EPERM
For sethostname(), the caller did not have the CAP_SYS_ADMIN capability in the user namespace associated with its UTS namespace (see namespaces(7)).
VERSIONS
SUSv2 guarantees that “Host names are limited to 255 bytes”. POSIX.1 guarantees that “Host names (not including the terminating null byte) are limited to HOST_NAME_MAX bytes”. On Linux, HOST_NAME_MAX is defined with the value 64, which has been the limit since Linux 1.0 (earlier kernels imposed a limit of 8 bytes).
C library/kernel differences
The GNU C library does not employ the gethostname() system call; instead, it implements gethostname() as a library function that calls uname(2) and copies up to len bytes from the returned nodename field into name. Having performed the copy, the function then checks if the length of the nodename was greater than or equal to len, and if it is, then the function returns -1 with errno set to ENAMETOOLONG; in this case, a terminating null byte is not included in the returned name.
STANDARDS
gethostname()
POSIX.1-2008.
sethostname()
None.
HISTORY
SVr4, 4.4BSD (these interfaces first appeared in 4.2BSD). POSIX.1-2001 and POSIX.1-2008 specify gethostname() but not sethostname().
Versions of glibc before glibc 2.2 handle the case where the length of the nodename was greater than or equal to len differently: nothing is copied into name and the function returns -1 with errno set to ENAMETOOLONG.
SEE ALSO
hostname(1), getdomainname(2), setdomainname(2), uname(2), uts_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
205 - Linux cli command sched_yield
NAME π₯οΈ sched_yield π₯οΈ
yield the processor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sched.h>
int sched_yield(void);
DESCRIPTION
sched_yield() causes the calling thread to relinquish the CPU. The thread is moved to the end of the queue for its static priority and a new thread gets to run.
RETURN VALUE
On success, sched_yield() returns 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
In the Linux implementation, sched_yield() always succeeds.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001 (but optional). POSIX.1-2008.
Before POSIX.1-2008, systems on which sched_yield() is available defined _POSIX_PRIORITY_SCHEDULING in <unistd.h>.
CAVEATS
sched_yield() is intended for use with real-time scheduling policies (i.e., SCHED_FIFO or SCHED_RR). Use of sched_yield() with nondeterministic scheduling policies such as SCHED_OTHER is unspecified and very likely means your application design is broken.
If the calling thread is the only thread in the highest priority list at that time, it will continue to run after a call to sched_yield().
Avoid calling sched_yield() unnecessarily or inappropriately (e.g., when resources needed by other schedulable threads are still held by the caller), since doing so will result in unnecessary context switches, which will degrade system performance.
SEE ALSO
sched(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
206 - Linux cli command uname
NAME π₯οΈ uname π₯οΈ
get name and information about current kernel
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/utsname.h>
int uname(struct utsname *buf);
DESCRIPTION
uname() returns system information in the structure pointed to by buf. The utsname struct is defined in <sys/utsname.h>:
struct utsname {
char sysname[]; /* Operating system name (e.g., "Linux") */
char nodename[]; /* Name within communications network
to which the node is attached, if any */
char release[]; /* Operating system release
(e.g., "2.6.28") */
char version[]; /* Operating system version */
char machine[]; /* Hardware type identifier */
#ifdef _GNU_SOURCE
char domainname[]; /* NIS or YP domain name */
#endif
};
The length of the arrays in a struct utsname is unspecified (see NOTES); the fields are terminated by a null byte (‘οΏ½’).
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
buf is not valid.
VERSIONS
The domainname member (the NIS or YP domain name) is a GNU extension.
The length of the fields in the struct varies. Some operating systems or libraries use a hardcoded 9 or 33 or 65 or 257. Other systems use SYS_NMLN or _SYS_NMLN or UTSLEN or _UTSNAME_LENGTH. Clearly, it is a bad idea to use any of these constants; just use sizeof(…). SVr4 uses 257, “to support Internet hostnames” β this is the largest value likely to be encountered in the wild.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.4BSD.
C library/kernel differences
Over time, increases in the size of the utsname structure have led to three successive versions of uname(): sys_olduname() (slot __NR_oldolduname), sys_uname() (slot __NR_olduname), and sys_newuname() (slot __NR_uname). The first one used length 9 for all fields; the second used 65; the third also uses 65 but adds the domainname field. The glibc uname() wrapper function hides these details from applications, invoking the most recent version of the system call provided by the kernel.
NOTES
The kernel has the name, release, version, and supported machine type built in. Conversely, the nodename field is configured by the administrator to match the network (this is what the BSD historically calls the “hostname”, and is set via sethostname(2)). Similarly, the domainname field is set via setdomainname(2).
Part of the utsname information is also accessible via /proc/sys/kernel/{ostype, hostname, osrelease, version, domainname}.
SEE ALSO
uname(1), getdomainname(2), gethostname(2), uts_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
207 - Linux cli command outw_p
NAME π₯οΈ outw_p π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
208 - Linux cli command fstat64
NAME π₯οΈ fstat64 π₯οΈ
get file status
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int stat(const char *restrict pathname,
struct stat *restrict statbuf);
int fstat(int fd, struct stat *statbuf);
int lstat(const char *restrict pathname,
struct stat *restrict statbuf);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int fstatat(int dirfd, const char *restrict pathname,
struct stat *restrict statbuf, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
lstat():
/* Since glibc 2.20 */ _DEFAULT_SOURCE
|| _XOPEN_SOURCE >= 500
|| /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L
|| /* glibc 2.19 and earlier */ _BSD_SOURCE
fstatat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These functions return information about a file, in the buffer pointed to by statbuf. No permissions are required on the file itself, butβin the case of stat(), fstatat(), and lstat()βexecute (search) permission is required on all of the directories in pathname that lead to the file.
stat() and fstatat() retrieve information about the file pointed to by pathname; the differences for fstatat() are described below.
lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information about the link itself, not the file that the link refers to.
fstat() is identical to stat(), except that the file about which information is to be retrieved is specified by the file descriptor fd.
The stat structure
All of these system calls return a stat structure (see stat(3type)).
Note: for performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call. For example, if st_mode or st_uid is changed by another process by calling chmod(2) or chown(2), stat() might return the old st_mode together with the new st_uid, or the old st_uid together with the new st_mode.
fstatat()
The fstatat() system call is a more general interface for accessing file information which can still provide exactly the behavior of each of stat(), lstat(), and fstat().
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by stat() and lstat() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like stat() and lstat()).
If pathname is absolute, then dirfd is ignored.
flags can either be 0, or include one or more of the following flags ORed:
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory, and the behavior of fstatat() is similar to that of fstat(). If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_NO_AUTOMOUNT (since Linux 2.6.38)
Don’t automount the terminal (“basename”) component of pathname. Since Linux 3.1 this flag is ignored. Since Linux 4.11 this flag is implied.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead return information about the link itself, like lstat(). (By default, fstatat() dereferences symbolic links, like stat().)
See openat(2) for an explanation of the need for fstatat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Search permission is denied for one of the directories in the path prefix of pathname. (See also path_resolution(7).)
EBADF
fd is not a valid open file descriptor.
EBADF
(fstatat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
Bad address.
EINVAL
(fstatat()) Invalid flag specified in flags.
ELOOP
Too many symbolic links encountered while traversing the path.
ENAMETOOLONG
pathname is too long.
ENOENT
A component of pathname does not exist or is a dangling symbolic link.
ENOENT
pathname is an empty string and AT_EMPTY_PATH was not specified in flags.
ENOMEM
Out of memory (i.e., kernel memory).
ENOTDIR
A component of the path prefix of pathname is not a directory.
ENOTDIR
(fstatat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EOVERFLOW
pathname or fd refers to a file whose size, inode number, or number of blocks cannot be represented in, respectively, the types off_t, ino_t, or blkcnt_t. This error can occur when, for example, an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 calls stat() on a file whose size exceeds (1<<31)-1 bytes.
STANDARDS
POSIX.1-2008.
HISTORY
stat()
fstat()
lstat()
SVr4, 4.3BSD, POSIX.1-2001.
fstatat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
According to POSIX.1-2001, lstat() on a symbolic link need return valid information only in the st_size field and the file type of the st_mode field of the stat structure. POSIX.1-2008 tightens the specification, requiring lstat() to return valid information in all fields except the mode bits in st_mode.
Use of the st_blocks and st_blksize fields may be less portable. (They were introduced in BSD. The interpretation differs between systems, and possibly on a single system when NFS mounts are involved.)
C library/kernel differences
Over time, increases in the size of the stat structure have led to three successive versions of stat(): sys_stat() (slot __NR_oldstat), sys_newstat() (slot __NR_stat), and sys_stat64() (slot __NR_stat64) on 32-bit platforms such as i386. The first two versions were already present in Linux 1.0 (albeit with different names); the last was added in Linux 2.4. Similar remarks apply for fstat() and lstat().
The kernel-internal versions of the stat structure dealt with by the different versions are, respectively:
__old_kernel_stat
The original structure, with rather narrow fields, and no padding.
stat
Larger st_ino field and padding added to various parts of the structure to allow for future expansion.
stat64
Even larger st_ino field, larger st_uid and st_gid fields to accommodate the Linux-2.4 expansion of UIDs and GIDs to 32 bits, and various other enlarged fields and further padding in the structure. (Various padding bytes were eventually consumed in Linux 2.6, with the advent of 32-bit device IDs and nanosecond components for the timestamp fields.)
The glibc stat() wrapper function hides these details from applications, invoking the most recent version of the system call provided by the kernel, and repacking the returned information if required for old binaries.
On modern 64-bit systems, life is simpler: there is a single stat() system call and the kernel deals with a stat structure that contains fields of a sufficient size.
The underlying system call employed by the glibc fstatat() wrapper function is actually called fstatat64() or, on some architectures, newfstatat().
EXAMPLES
The following program calls lstat() and displays selected fields in the returned stat structure.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
#include <time.h>
int
main(int argc, char *argv[])
{
struct stat sb;
if (argc != 2) {
fprintf(stderr, "Usage: %s <pathname>
“, argv[0]); exit(EXIT_FAILURE); } if (lstat(argv[1], &sb) == -1) { perror(“lstat”); exit(EXIT_FAILURE); } printf(“ID of containing device: [%x,%x] “, major(sb.st_dev), minor(sb.st_dev)); printf(“File type: “); switch (sb.st_mode & S_IFMT) { case S_IFBLK: printf(“block device “); break; case S_IFCHR: printf(“character device “); break; case S_IFDIR: printf(“directory “); break; case S_IFIFO: printf(“FIFO/pipe “); break; case S_IFLNK: printf(“symlink “); break; case S_IFREG: printf(“regular file “); break; case S_IFSOCK: printf(“socket “); break; default: printf(“unknown? “); break; } printf(“I-node number: %ju “, (uintmax_t) sb.st_ino); printf(“Mode: %jo (octal) “, (uintmax_t) sb.st_mode); printf(“Link count: %ju “, (uintmax_t) sb.st_nlink); printf(“Ownership: UID=%ju GID=%ju “, (uintmax_t) sb.st_uid, (uintmax_t) sb.st_gid); printf(“Preferred I/O block size: %jd bytes “, (intmax_t) sb.st_blksize); printf(“File size: %jd bytes “, (intmax_t) sb.st_size); printf(“Blocks allocated: %jd “, (intmax_t) sb.st_blocks); printf(“Last status change: %s”, ctime(&sb.st_ctime)); printf(“Last file access: %s”, ctime(&sb.st_atime)); printf(“Last file modification: %s”, ctime(&sb.st_mtime)); exit(EXIT_SUCCESS); }
SEE ALSO
ls(1), stat(1), access(2), chmod(2), chown(2), readlink(2), statx(2), utime(2), stat(3type), capabilities(7), inode(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
209 - Linux cli command getgroups
NAME π₯οΈ getgroups π₯οΈ
get/set list of supplementary group IDs
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int getgroups(int size, gid_t list[]);
#include <grp.h>
int setgroups(size_t size, const gid_t *_Nullable list);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
setgroups():
Since glibc 2.19:
_DEFAULT_SOURCE
glibc 2.19 and earlier:
_BSD_SOURCE
DESCRIPTION
getgroups() returns the supplementary group IDs of the calling process in list. The argument size should be set to the maximum number of items that can be stored in the buffer pointed to by list. If the calling process is a member of more than size supplementary groups, then an error results.
It is unspecified whether the effective group ID of the calling process is included in the returned list. (Thus, an application should also call getegid(2) and add or remove the resulting value.)
If size is zero, list is not modified, but the total number of supplementary group IDs for the process is returned. This allows the caller to determine the size of a dynamically allocated list to be used in a further call to getgroups().
setgroups() sets the supplementary group IDs for the calling process. Appropriate privileges are required (see the description of the EPERM error, below). The size argument specifies the number of supplementary group IDs in the buffer pointed to by list. A process can drop all of its supplementary groups with the call:
setgroups(0, NULL);
RETURN VALUE
On success, getgroups() returns the number of supplementary group IDs. On error, -1 is returned, and errno is set to indicate the error.
On success, setgroups() returns 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
list has an invalid address.
getgroups() can additionally fail with the following error:
EINVAL
size is less than the number of supplementary group IDs, but is not zero.
setgroups() can additionally fail with the following errors:
EINVAL
size is greater than NGROUPS_MAX (32 before Linux 2.6.4; 65536 since Linux 2.6.4).
ENOMEM
Out of memory.
EPERM
The calling process has insufficient privilege (the caller does not have the CAP_SETGID capability in the user namespace in which it resides).
EPERM (since Linux 3.19)
The use of setgroups() is denied in this user namespace. See the description of /proc/pid/setgroups in user_namespaces(7).
VERSIONS
C library/kernel differences
At the kernel level, user IDs and group IDs are a per-thread attribute. However, POSIX requires that all threads in a process share the same credentials. The NPTL threading implementation handles the POSIX requirements by providing wrapper functions for the various system calls that change process UIDs and GIDs. These wrapper functions (including the one for setgroups()) employ a signal-based technique to ensure that when one thread changes credentials, all of the other threads in the process also change their credentials. For details, see nptl(7).
STANDARDS
getgroups()
POSIX.1-2008.
setgroups()
None.
HISTORY
getgroups()
SVr4, 4.3BSD, POSIX.1-2001.
setgroups()
SVr4, 4.3BSD. Since setgroups() requires privilege, it is not covered by POSIX.1.
The original Linux getgroups() system call supported only 16-bit group IDs. Subsequently, Linux 2.4 added getgroups32(), supporting 32-bit IDs. The glibc getgroups() wrapper function transparently deals with the variation across kernel versions.
NOTES
A process can have up to NGROUPS_MAX supplementary group IDs in addition to the effective group ID. The constant NGROUPS_MAX is defined in <limits.h>. The set of supplementary group IDs is inherited from the parent process, and preserved across an execve(2).
The maximum number of supplementary group IDs can be found at run time using sysconf(3):
long ngroups_max;
ngroups_max = sysconf(_SC_NGROUPS_MAX);
The maximum return value of getgroups() cannot be larger than one more than this value. Since Linux 2.6.4, the maximum number of supplementary group IDs is also exposed via the Linux-specific read-only file, /proc/sys/kernel/ngroups_max.
SEE ALSO
getgid(2), setgid(2), getgrouplist(3), group_member(3), initgroups(3), capabilities(7), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
210 - Linux cli command fsync
NAME π₯οΈ fsync π₯οΈ
synchronize a file’s in-core state with storage device
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int fsync(int fd);
int fdatasync(int fd);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
fsync():
glibc 2.16 and later:
No feature test macros need be defined
glibc up to and including 2.15:
_BSD_SOURCE || _XOPEN_SOURCE
|| /* Since glibc 2.8: */ _POSIX_C_SOURCE >= 200112L
fdatasync():
_POSIX_C_SOURCE >= 199309L || _XOPEN_SOURCE >= 500
DESCRIPTION
fsync() transfers (“flushes”) all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed.
As well as flushing the file data, fsync() also flushes the metadata information associated with the file (see inode(7)).
Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.
fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. For example, changes to st_atime or st_mtime (respectively, time of last access and time of last modification; see inode(7)) do not require flushing because they are not necessary for a subsequent data read to be handled correctly. On the other hand, a change to the file size (st_size, as made by say ftruncate(2)), would require a metadata flush.
The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk.
RETURN VALUE
On success, these system calls return zero. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBADF
fd is not a valid open file descriptor.
EINTR
The function was interrupted by a signal; see signal(7).
EIO
An error occurred during synchronization. This error may relate to data written to some other file descriptor on the same file. Since Linux 4.13, errors from write-back will be reported to all file descriptors that might have written the data which triggered the error. Some filesystems (e.g., NFS) keep close track of which data came through which file descriptor, and give more precise reporting. Other filesystems (e.g., most local filesystems) will report errors to all file descriptors that were open on the file when the error was recorded.
ENOSPC
Disk space was exhausted while synchronizing.
EROFS
EINVAL
fd is bound to a special file (e.g., a pipe, FIFO, or socket) which does not support synchronization.
ENOSPC
EDQUOT
fd is bound to a file on NFS or another filesystem which does not allocate space at the time of a write(2) system call, and some previous write failed due to insufficient storage space.
VERSIONS
On POSIX systems on which fdatasync() is available, _POSIX_SYNCHRONIZED_IO is defined in <unistd.h> to a value greater than 0. (See also sysconf(3).)
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.2BSD.
In Linux 2.2 and earlier, fdatasync() is equivalent to fsync(), and so has no performance advantage.
The fsync() implementations in older kernels and lesser used filesystems do not know how to flush disk caches. In these cases disk caches need to be disabled using hdparm(8) or sdparm(8) to guarantee safe operation.
Under AT&T UNIX System V Release 4 fd needs to be opened for writing. This is by itself incompatible with the original BSD interface and forbidden by POSIX, but nevertheless survives in HP-UX and AIX.
SEE ALSO
sync(1), bdflush(2), open(2), posix_fadvise(2), pwritev(2), sync(2), sync_file_range(2), fflush(3), fileno(3), hdparm(8), mount(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
211 - Linux cli command subpage_prot
NAME π₯οΈ subpage_prot π₯οΈ
define a subpage protection for an address range
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_subpage_prot, unsigned long addr, unsigned long len,
uint32_t *map);
Note: glibc provides no wrapper for subpage_prot(), necessitating the use of syscall(2).
DESCRIPTION
The PowerPC-specific subpage_prot() system call provides the facility to control the access permissions on individual 4 kB subpages on systems configured with a page size of 64 kB.
The protection map is applied to the memory pages in the region starting at addr and continuing for len bytes. Both of these arguments must be aligned to a 64-kB boundary.
The protection map is specified in the buffer pointed to by map. The map has 2 bits per 4 kB subpage; thus each 32-bit word specifies the protections of 16 4 kB subpages inside a 64 kB page (so, the number of 32-bit words pointed to by map should equate to the number of 64-kB pages specified by len). Each 2-bit field in the protection map is either 0 to allow any access, 1 to prevent writes, or 2 or 3 to prevent all accesses.
RETURN VALUE
On success, subpage_prot() returns 0. Otherwise, one of the error codes specified below is returned.
ERRORS
EFAULT
The buffer referred to by map is not accessible.
EINVAL
The addr or len arguments are incorrect. Both of these arguments must be aligned to a multiple of the system page size, and they must not refer to a region outside of the address space of the process or to a region that consists of huge pages.
ENOMEM
Out of memory.
STANDARDS
Linux.
HISTORY
Linux 2.6.25 (PowerPC).
The system call is provided only if the kernel is configured with CONFIG_PPC_64K_PAGES.
NOTES
Normal page protections (at the 64-kB page level) also apply; the subpage protection mechanism is an additional constraint, so putting 0 in a 2-bit field won’t allow writes to a page that is otherwise write-protected.
Rationale
This system call is provided to assist writing emulators that operate using 64-kB pages on PowerPC systems. When emulating systems such as x86, which uses a smaller page size, the emulator can no longer use the memory-management unit (MMU) and normal system calls for controlling page protections. (The emulator could emulate the MMU by checking and possibly remapping the address for each memory access in software, but that is slow.) The idea is that the emulator supplies an array of protection masks to apply to a specified range of virtual addresses. These masks are applied at the level where hardware page-table entries (PTEs) are inserted into the hardware page table based on the Linux PTEs, so the Linux PTEs are not affected. Implicit in this is that the regions of the address space that are protected are switched to use 4-kB hardware pages rather than 64-kB hardware pages (on machines with hardware 64-kB page support).
SEE ALSO
mprotect(2), syscall(2)
Documentation/admin-guide/mm/hugetlbpage.rst in the Linux kernel source tree
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
212 - Linux cli command preadv
NAME π₯οΈ preadv π₯οΈ
read or write data into multiple buffers
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/uio.h>
ssize_t readv(int fd, const struct iovec *iov, int iovcnt);
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);
ssize_t preadv(int fd, const struct iovec *iov, int iovcnt,
off_t offset);
ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt,
off_t offset);
ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt,
off_t offset, int flags);
ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt,
off_t offset, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
preadv(), pwritev():
Since glibc 2.19:
_DEFAULT_SOURCE
glibc 2.19 and earlier:
_BSD_SOURCE
DESCRIPTION
The readv() system call reads iovcnt buffers from the file associated with the file descriptor fd into the buffers described by iov (“scatter input”).
The writev() system call writes iovcnt buffers of data described by iov to the file associated with the file descriptor fd (“gather output”).
The pointer iov points to an array of iovec structures, described in iovec(3type).
The readv() system call works just like read(2) except that multiple buffers are filled.
The writev() system call works just like write(2) except that multiple buffers are written out.
Buffers are processed in array order. This means that readv() completely fills iov[0] before proceeding to iov[1], and so on. (If there is insufficient data, then not all buffers pointed to by iov may be filled.) Similarly, writev() writes out the entire contents of iov[0] before proceeding to iov[1], and so on.
The data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not intermingled with output from writes in other processes; analogously, readv() is guaranteed to read a contiguous block of data from the file, regardless of read operations performed in other threads or processes that have file descriptors referring to the same open file description (see open(2)).
preadv() and pwritev()
The preadv() system call combines the functionality of readv() and pread(2). It performs the same task as readv(), but adds a fourth argument, offset, which specifies the file offset at which the input operation is to be performed.
The pwritev() system call combines the functionality of writev() and pwrite(2). It performs the same task as writev(), but adds a fourth argument, offset, which specifies the file offset at which the output operation is to be performed.
The file offset is not changed by these system calls. The file referred to by fd must be capable of seeking.
preadv2() and pwritev2()
These system calls are similar to preadv() and pwritev() calls, but add a fifth argument, flags, which modifies the behavior on a per-call basis.
Unlike preadv() and pwritev(), if the offset argument is -1, then the current file offset is used and updated.
The flags argument contains a bitwise OR of zero or more of the following flags:
RWF_DSYNC (since Linux 4.7)
Provide a per-write equivalent of the O_DSYNC open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call.
RWF_HIPRI (since Linux 4.6)
High priority read/write. Allows block-based filesystems to use polling of the device, which provides lower latency, but may use additional resources. (Currently, this feature is usable only on a file descriptor opened using the O_DIRECT flag.)
RWF_SYNC (since Linux 4.7)
Provide a per-write equivalent of the O_SYNC open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call.
RWF_NOWAIT (since Linux 4.14)
Do not wait for data which is not immediately available. If this flag is specified, the preadv2() system call will return instantly if it would have to read data from the backing storage or wait for a lock. If some data was successfully read, it will return the number of bytes read. If no bytes were read, it will return -1 and set errno to EAGAIN (but see BUGS). Currently, this flag is meaningful only for preadv2().
RWF_APPEND (since Linux 4.16)
Provide a per-write equivalent of the O_APPEND open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call. The offset argument does not affect the write operation; the data is always appended to the end of the file. However, if the offset argument is -1, the current file offset is updated.
RETURN VALUE
On success, readv(), preadv(), and preadv2() return the number of bytes read; writev(), pwritev(), and pwritev2() return the number of bytes written.
Note that it is not an error for a successful call to transfer fewer bytes than requested (see read(2) and write(2)).
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
The errors are as given for read(2) and write(2). Furthermore, preadv(), preadv2(), pwritev(), and pwritev2() can also fail for the same reasons as lseek(2). Additionally, the following errors are defined:
EINVAL
The sum of the iov_len values overflows an ssize_t value.
EINVAL
The vector count, iovcnt, is less than zero or greater than the permitted maximum.
EOPNOTSUPP
An unknown flag is specified in flags.
VERSIONS
C library/kernel differences
The raw preadv() and pwritev() system calls have call signatures that differ slightly from that of the corresponding GNU C library wrapper functions shown in the SYNOPSIS. The final argument, offset, is unpacked by the wrapper functions into two arguments in the system calls:
** unsigned long pos_l, unsigned long **pos
These arguments contain, respectively, the low order and high order 32 bits of offset.
STANDARDS
readv()
writev()
POSIX.1-2008.
preadv()
pwritev()
BSD.
preadv2()
pwritev2()
Linux.
HISTORY
readv()
writev()
POSIX.1-2001, 4.4BSD (first appeared in 4.2BSD).
preadv(), pwritev(): Linux 2.6.30, glibc 2.10.
preadv2(), pwritev2(): Linux 4.6, glibc 2.26.
Historical C library/kernel differences
To deal with the fact that IOV_MAX was so low on early versions of Linux, the glibc wrapper functions for readv() and writev() did some extra work if they detected that the underlying kernel system call failed because this limit was exceeded. In the case of readv(), the wrapper function allocated a temporary buffer large enough for all of the items specified by iov, passed that buffer in a call to read(2), copied data from the buffer to the locations specified by the iov_base fields of the elements of iov, and then freed the buffer. The wrapper function for writev() performed the analogous task using a temporary buffer and a call to write(2).
The need for this extra effort in the glibc wrapper functions went away with Linux 2.2 and later. However, glibc continued to provide this behavior until glibc 2.10. Starting with glibc 2.9, the wrapper functions provide this behavior only if the library detects that the system is running a Linux kernel older than Linux 2.6.18 (an arbitrarily selected kernel version). And since glibc 2.20 (which requires a minimum of Linux 2.6.32), the glibc wrapper functions always just directly invoke the system calls.
NOTES
POSIX.1 allows an implementation to place a limit on the number of items that can be passed in iov. An implementation can advertise its limit by defining IOV_MAX in <limits.h> or at run time via the return value from sysconf(_SC_IOV_MAX). On modern Linux systems, the limit is 1024. Back in Linux 2.0 days, this limit was 16.
BUGS
Linux 5.9 and Linux 5.10 have a bug where preadv2() with the RWF_NOWAIT flag may return 0 even when not at end of file.
EXAMPLES
The following code sample demonstrates the use of writev():
char *str0 = "hello ";
char *str1 = "world
“; ssize_t nwritten; struct iovec iov[2]; iov[0].iov_base = str0; iov[0].iov_len = strlen(str0); iov[1].iov_base = str1; iov[1].iov_len = strlen(str1); nwritten = writev(STDOUT_FILENO, iov, 2);
SEE ALSO
pread(2), read(2), write(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
213 - Linux cli command getrusage
NAME π₯οΈ getrusage π₯οΈ
get resource usage
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/resource.h>
int getrusage(int who, struct rusage *usage);
DESCRIPTION
getrusage() returns resource usage measures for who, which can be one of the following:
RUSAGE_SELF
Return resource usage statistics for the calling process, which is the sum of resources used by all threads in the process.
RUSAGE_CHILDREN
Return resource usage statistics for all children of the calling process that have terminated and been waited for. These statistics will include the resources used by grandchildren, and further removed descendants, if all of the intervening descendants waited on their terminated children.
RUSAGE_THREAD (since Linux 2.6.26)
Return resource usage statistics for the calling thread. The _GNU_SOURCE feature test macro must be defined (before including any header file) in order to obtain the definition of this constant from <sys/resource.h>.
The resource usages are returned in the structure pointed to by usage, which has the following form:
struct rusage {
struct timeval ru_utime; /* user CPU time used */
struct timeval ru_stime; /* system CPU time used */
long ru_maxrss; /* maximum resident set size */
long ru_ixrss; /* integral shared memory size */
long ru_idrss; /* integral unshared data size */
long ru_isrss; /* integral unshared stack size */
long ru_minflt; /* page reclaims (soft page faults) */
long ru_majflt; /* page faults (hard page faults) */
long ru_nswap; /* swaps */
long ru_inblock; /* block input operations */
long ru_oublock; /* block output operations */
long ru_msgsnd; /* IPC messages sent */
long ru_msgrcv; /* IPC messages received */
long ru_nsignals; /* signals received */
long ru_nvcsw; /* voluntary context switches */
long ru_nivcsw; /* involuntary context switches */
};
Not all fields are completed; unmaintained fields are set to zero by the kernel. (The unmaintained fields are provided for compatibility with other systems, and because they may one day be supported on Linux.) The fields are interpreted as follows:
ru_utime
This is the total amount of time spent executing in user mode, expressed in a timeval structure (seconds plus microseconds).
ru_stime
This is the total amount of time spent executing in kernel mode, expressed in a timeval structure (seconds plus microseconds).
ru_maxrss (since Linux 2.6.32)
This is the maximum resident set size used (in kilobytes). For RUSAGE_CHILDREN, this is the resident set size of the largest child, not the maximum resident set size of the process tree.
ru_ixrss (unmaintained)
This field is currently unused on Linux.
ru_idrss (unmaintained)
This field is currently unused on Linux.
ru_isrss (unmaintained)
This field is currently unused on Linux.
ru_minflt
The number of page faults serviced without any I/O activity; here I/O activity is avoided by βreclaimingβ a page frame from the list of pages awaiting reallocation.
ru_majflt
The number of page faults serviced that required I/O activity.
ru_nswap (unmaintained)
This field is currently unused on Linux.
ru_inblock (since Linux 2.6.22)
The number of times the filesystem had to perform input.
ru_oublock (since Linux 2.6.22)
The number of times the filesystem had to perform output.
ru_msgsnd (unmaintained)
This field is currently unused on Linux.
ru_msgrcv (unmaintained)
This field is currently unused on Linux.
ru_nsignals (unmaintained)
This field is currently unused on Linux.
ru_nvcsw (since Linux 2.6)
The number of times a context switch resulted due to a process voluntarily giving up the processor before its time slice was completed (usually to await availability of a resource).
ru_nivcsw (since Linux 2.6)
The number of times a context switch resulted due to a higher priority process becoming runnable or because the current process exceeded its time slice.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
usage points outside the accessible address space.
EINVAL
who is invalid.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
getrusage() | Thread safety | MT-Safe |
STANDARDS
POSIX.1-2008.
POSIX.1 specifies getrusage(), but specifies only the fields ru_utime and ru_stime.
RUSAGE_THREAD is Linux-specific.
HISTORY
POSIX.1-2001, SVr4, 4.3BSD.
Before Linux 2.6.9, if the disposition of SIGCHLD is set to SIG_IGN then the resource usages of child processes are automatically included in the value returned by RUSAGE_CHILDREN, although POSIX.1-2001 explicitly prohibits this. This nonconformance is rectified in Linux 2.6.9 and later.
The structure definition shown at the start of this page was taken from 4.3BSD Reno.
Ancient systems provided a vtimes() function with a similar purpose to getrusage(). For backward compatibility, glibc (up until Linux 2.32) also provides vtimes(). All new applications should be written using getrusage(). (Since Linux 2.33, glibc no longer provides an vtimes() implementation.)
NOTES
Resource usage metrics are preserved across an execve(2).
SEE ALSO
clock_gettime(2), getrlimit(2), times(2), wait(2), wait4(2), clock(3), proc_pid_stat(5), proc_pid_io(5)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
214 - Linux cli command openat
NAME π₯οΈ openat π₯οΈ
open and possibly create a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <fcntl.h>
int open(const char *pathname, int flags, ...
/* mode_t mode */ );
int creat(const char *pathname, mode_t mode);
int openat(int dirfd, const char *pathname, int flags, ...
/* mode_t mode */ );
/* Documented separately, in
openat2(2):
*/
int openat2(int dirfd, const char *pathname,
const struct open_how *how, size_t size);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
openat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
The open() system call opens the file specified by pathname. If the specified file does not exist, it may optionally (if O_CREAT is specified in flags) be created by open().
The return value of open() is a file descriptor, a small, nonnegative integer that is an index to an entry in the process’s table of open file descriptors. The file descriptor is used in subsequent system calls ( read(2), write(2), lseek(2), fcntl(2), etc.) to refer to the open file. The file descriptor returned by a successful call will be the lowest-numbered file descriptor not currently open for the process.
By default, the new file descriptor is set to remain open across an execve(2) (i.e., the FD_CLOEXEC file descriptor flag described in fcntl(2) is initially disabled); the O_CLOEXEC flag, described below, can be used to change this default. The file offset is set to the beginning of the file (see lseek(2)).
A call to open() creates a new open file description, an entry in the system-wide table of open files. The open file description records the file offset and the file status flags (see below). A file descriptor is a reference to an open file description; this reference is unaffected if pathname is subsequently removed or modified to refer to a different file. For further details on open file descriptions, see NOTES.
The argument flags must include one of the following access modes: O_RDONLY, O_WRONLY, or O_RDWR. These request opening the file read-only, write-only, or read/write, respectively.
In addition, zero or more file creation flags and file status flags can be bitwise ORed in flags. The file creation flags are O_CLOEXEC, O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TMPFILE, and O_TRUNC. The file status flags are all of the remaining flags listed below. The distinction between these two groups of flags is that the file creation flags affect the semantics of the open operation itself, while the file status flags affect the semantics of subsequent I/O operations. The file status flags can be retrieved and (in some cases) modified; see fcntl(2) for details.
The full list of file creation flags and file status flags is as follows:
O_APPEND
The file is opened in append mode. Before each write(2), the file offset is positioned at the end of the file, as if with lseek(2). The modification of the file offset and the write operation are performed as a single atomic step.
O_APPEND may lead to corrupted files on NFS filesystems if more than one process appends data to a file at once. This is because NFS does not support appending to a file, so the client kernel has to simulate it, which can’t be done without a race condition.
O_ASYNC
Enable signal-driven I/O: generate a signal (SIGIO by default, but this can be changed via fcntl(2)) when input or output becomes possible on this file descriptor. This feature is available only for terminals, pseudoterminals, sockets, and (since Linux 2.6) pipes and FIFOs. See fcntl(2) for further details. See also BUGS, below.
O_CLOEXEC (since Linux 2.6.23)
Enable the close-on-exec flag for the new file descriptor. Specifying this flag permits a program to avoid additional fcntl(2) F_SETFD operations to set the FD_CLOEXEC flag.
Note that the use of this flag is essential in some multithreaded programs, because using a separate fcntl(2) F_SETFD operation to set the FD_CLOEXEC flag does not suffice to avoid race conditions where one thread opens a file descriptor and attempts to set its close-on-exec flag using fcntl(2) at the same time as another thread does a fork(2) plus execve(2). Depending on the order of execution, the race may lead to the file descriptor returned by open() being unintentionally leaked to the program executed by the child process created by fork(2). (This kind of race is in principle possible for any system call that creates a file descriptor whose close-on-exec flag should be set, and various other Linux system calls provide an equivalent of the O_CLOEXEC flag to deal with this problem.)
O_CREAT
If pathname does not exist, create it as a regular file.
The owner (user ID) of the new file is set to the effective user ID of the process.
The group ownership (group ID) of the new file is set either to the effective group ID of the process (System V semantics) or to the group ID of the parent directory (BSD semantics). On Linux, the behavior depends on whether the set-group-ID mode bit is set on the parent directory: if that bit is set, then BSD semantics apply; otherwise, System V semantics apply. For some filesystems, the behavior also depends on the bsdgroups and sysvgroups mount options described in mount(8).
The mode argument specifies the file mode bits to be applied when a new file is created. If neither O_CREAT nor O_TMPFILE is specified in flags, then mode is ignored (and can thus be specified as 0, or simply omitted). The mode argument must be supplied if O_CREAT or O_TMPFILE is specified in flags; if it is not supplied, some arbitrary bytes from the stack will be applied as the file mode.
The effective mode is modified by the process’s umask in the usual way: in the absence of a default ACL, the mode of the created file is (mode & ~umask).
Note that mode applies only to future accesses of the newly created file; the open() call that creates a read-only file may well return a read/write file descriptor.
The following symbolic constants are provided for mode:
S_IRWXU
00700 user (file owner) has read, write, and execute permission
S_IRUSR
00400 user has read permission
S_IWUSR
00200 user has write permission
S_IXUSR
00100 user has execute permission
S_IRWXG
00070 group has read, write, and execute permission
S_IRGRP
00040 group has read permission
S_IWGRP
00020 group has write permission
S_IXGRP
00010 group has execute permission
S_IRWXO
00007 others have read, write, and execute permission
S_IROTH
00004 others have read permission
S_IWOTH
00002 others have write permission
S_IXOTH
00001 others have execute permission
According to POSIX, the effect when other bits are set in mode is unspecified. On Linux, the following bits are also honored in mode:
S_ISUID
0004000 set-user-ID bit
S_ISGID
0002000 set-group-ID bit (see inode(7)).
S_ISVTX
0001000 sticky bit (see inode(7)).
O_DIRECT (since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is described in raw(8).
O_DIRECTORY
If pathname is not a directory, cause the open to fail. This flag was added in Linux 2.1.126, to avoid denial-of-service problems if opendir(3) is called on a FIFO or tape device.
O_DSYNC
Write operations on the file will complete according to the requirements of synchronized I/O data integrity completion.
By the time write(2) (and similar) return, the output data has been transferred to the underlying hardware, along with any file metadata that would be required to retrieve that data (i.e., as though each write(2) was followed by a call to fdatasync(2)). See NOTES below.
O_EXCL
Ensure that this call creates the file: if this flag is specified in conjunction with O_CREAT, and pathname already exists, then open() fails with the error EEXIST.
When these two flags are specified, symbolic links are not followed: if pathname is a symbolic link, then open() fails regardless of where the symbolic link points.
In general, the behavior of O_EXCL is undefined if it is used without O_CREAT. There is one exception: on Linux 2.6 and later, O_EXCL can be used without O_CREAT if pathname refers to a block device. If the block device is in use by the system (e.g., mounted), open() fails with the error EBUSY.
On NFS, O_EXCL is supported only when using NFSv3 or later on kernel 2.6 or later. In NFS environments where O_EXCL support is not provided, programs that rely on it for performing locking tasks will contain a race condition. Portable programs that want to perform atomic file locking using a lockfile, and need to avoid reliance on NFS support for O_EXCL, can create a unique file on the same filesystem (e.g., incorporating hostname and PID), and use link(2) to make a link to the lockfile. If link(2) returns 0, the lock is successful. Otherwise, use stat(2) on the unique file to check if its link count has increased to 2, in which case the lock is also successful.
O_LARGEFILE
(LFS) Allow files whose sizes cannot be represented in an off_t (but can be represented in an off64_t) to be opened. The _LARGEFILE64_SOURCE macro must be defined (before including any header files) in order to obtain this definition. Setting the _FILE_OFFSET_BITS feature test macro to 64 (rather than using O_LARGEFILE) is the preferred method of accessing large files on 32-bit systems (see feature_test_macros(7)).
O_NOATIME (since Linux 2.6.8)
Do not update the file last access time (st_atime in the inode) when the file is read(2).
This flag can be employed only if one of the following conditions is true:
The effective UID of the process matches the owner UID of the file.
The calling process has the CAP_FOWNER capability in its user namespace and the owner UID of the file has a mapping in the namespace.
This flag is intended for use by indexing or backup programs, where its use can significantly reduce the amount of disk activity. This flag may not be effective on all filesystems. One example is NFS, where the server maintains the access time.
O_NOCTTY
If pathname refers to a terminal deviceβsee tty(4)βit will not become the process’s controlling terminal even if the process does not have one.
O_NOFOLLOW
If the trailing component (i.e., basename) of pathname is a symbolic link, then the open fails, with the error ELOOP. Symbolic links in earlier components of the pathname will still be followed. (Note that the ELOOP error that can occur in this case is indistinguishable from the case where an open fails because there are too many symbolic links found while resolving components in the prefix part of the pathname.)
This flag is a FreeBSD extension, which was added in Linux 2.1.126, and has subsequently been standardized in POSIX.1-2008.
See also O_PATH below.
O_NONBLOCK or O_NDELAY
When possible, the file is opened in nonblocking mode. Neither the open() nor any subsequent I/O operations on the file descriptor which is returned will cause the calling process to wait.
Note that the setting of this flag has no effect on the operation of poll(2), select(2), epoll(7), and similar, since those interfaces merely inform the caller about whether a file descriptor is “ready”, meaning that an I/O operation performed on the file descriptor with the O_NONBLOCK flag clear would not block.
Note that this flag has no effect for regular files and block devices; that is, I/O operations will (briefly) block when device activity is required, regardless of whether O_NONBLOCK is set. Since O_NONBLOCK semantics might eventually be implemented, applications should not depend upon blocking behavior when specifying this flag for regular files and block devices.
For the handling of FIFOs (named pipes), see also fifo(7). For a discussion of the effect of O_NONBLOCK in conjunction with mandatory file locks and with file leases, see fcntl(2).
O_PATH (since Linux 2.6.39)
Obtain a file descriptor that can be used for two purposes: to indicate a location in the filesystem tree and to perform operations that act purely at the file descriptor level. The file itself is not opened, and other file operations (e.g., read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), ioctl(2), mmap(2)) fail with the error EBADF.
The following operations can be performed on the resulting file descriptor:
close(2).
fchdir(2), if the file descriptor refers to a directory (since Linux 3.5).
fstat(2) (since Linux 3.6).
fstatfs(2) (since Linux 3.12).
Duplicating the file descriptor (dup(2), fcntl(2) F_DUPFD, etc.).
Getting and setting file descriptor flags (fcntl(2) F_GETFD and F_SETFD).
Retrieving open file status flags using the fcntl(2) F_GETFL operation: the returned flags will include the bit O_PATH.
Passing the file descriptor as the dirfd argument of openat() and the other “*at()” system calls. This includes linkat(2) with AT_EMPTY_PATH (or via procfs using AT_SYMLINK_FOLLOW) even if the file is not a directory.
Passing the file descriptor to another process via a UNIX domain socket (see SCM_RIGHTS in unix(7)).
When O_PATH is specified in flags, flag bits other than O_CLOEXEC, O_DIRECTORY, and O_NOFOLLOW are ignored.
Opening a file or directory with the O_PATH flag requires no permissions on the object itself (but does require execute permission on the directories in the path prefix). Depending on the subsequent operation, a check for suitable file permissions may be performed (e.g., fchdir(2) requires execute permission on the directory referred to by its file descriptor argument). By contrast, obtaining a reference to a filesystem object by opening it with the O_RDONLY flag requires that the caller have read permission on the object, even when the subsequent operation (e.g., fchdir(2), fstat(2)) does not require read permission on the object.
If pathname is a symbolic link and the O_NOFOLLOW flag is also specified, then the call returns a file descriptor referring to the symbolic link. This file descriptor can be used as the dirfd argument in calls to fchownat(2), fstatat(2), linkat(2), and readlinkat(2) with an empty pathname to have the calls operate on the symbolic link.
If pathname refers to an automount point that has not yet been triggered, so no other filesystem is mounted on it, then the call returns a file descriptor referring to the automount directory without triggering a mount. fstatfs(2) can then be used to determine if it is, in fact, an untriggered automount point (.f_type == AUTOFS_SUPER_MAGIC).
One use of O_PATH for regular files is to provide the equivalent of POSIX.1’s O_EXEC functionality. This permits us to open a file for which we have execute permission but not read permission, and then execute that file, with steps something like the following:
char buf[PATH_MAX];
fd = open("some_prog", O_PATH);
snprintf(buf, PATH_MAX, "/proc/self/fd/%d", fd);
execl(buf, "some_prog", (char *) NULL);
An O_PATH file descriptor can also be passed as the argument of fexecve(3).
O_SYNC
Write operations on the file will complete according to the requirements of synchronized I/O file integrity completion (by contrast with the synchronized I/O data integrity completion provided by O_DSYNC.)
By the time write(2) (or similar) returns, the output data and associated file metadata have been transferred to the underlying hardware (i.e., as though each write(2) was followed by a call to fsync(2)). See NOTES below.
O_TMPFILE (since Linux 3.11)
Create an unnamed temporary regular file. The pathname argument specifies a directory; an unnamed inode will be created in that directory’s filesystem. Anything written to the resulting file will be lost when the last file descriptor is closed, unless the file is given a name.
O_TMPFILE must be specified with one of O_RDWR or O_WRONLY and, optionally, O_EXCL. If O_EXCL is not specified, then linkat(2) can be used to link the temporary file into the filesystem, making it permanent, using code like the following:
char path[PATH_MAX];
fd = open("/path/to/dir", O_TMPFILE | O_RDWR,
S_IRUSR | S_IWUSR);
/* File I/O on 'fd'... */
linkat(fd, "", AT_FDCWD, "/path/for/file", AT_EMPTY_PATH);
/* If the caller doesn't have the CAP_DAC_READ_SEARCH
capability (needed to use AT_EMPTY_PATH with linkat(2)),
and there is a proc(5) filesystem mounted, then the
linkat(2) call above can be replaced with:
snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd);
linkat(AT_FDCWD, path, AT_FDCWD, "/path/for/file",
AT_SYMLINK_FOLLOW);
*/
In this case, the open() mode argument determines the file permission mode, as with O_CREAT.
Specifying O_EXCL in conjunction with O_TMPFILE prevents a temporary file from being linked into the filesystem in the above manner. (Note that the meaning of O_EXCL in this case is different from the meaning of O_EXCL otherwise.)
There are two main use cases for O_TMPFILE:
Improved tmpfile(3) functionality: race-free creation of temporary files that (1) are automatically deleted when closed; (2) can never be reached via any pathname; (3) are not subject to symlink attacks; and (4) do not require the caller to devise unique names.
Creating a file that is initially invisible, which is then populated with data and adjusted to have appropriate filesystem attributes (fchown(2), fchmod(2), fsetxattr(2), etc.) before being atomically linked into the filesystem in a fully formed state (using linkat(2) as described above).
O_TMPFILE requires support by the underlying filesystem; only a subset of Linux filesystems provide that support. In the initial implementation, support was provided in the ext2, ext3, ext4, UDF, Minix, and tmpfs filesystems. Support for other filesystems has subsequently been added as follows: XFS (Linux 3.15); Btrfs (Linux 3.16); F2FS (Linux 3.16); and ubifs (Linux 4.9)
O_TRUNC
If the file already exists and is a regular file and the access mode allows writing (i.e., is O_RDWR or O_WRONLY) it will be truncated to length 0. If the file is a FIFO or terminal device file, the O_TRUNC flag is ignored. Otherwise, the effect of O_TRUNC is unspecified.
creat()
A call to creat() is equivalent to calling open() with flags equal to O_CREAT|O_WRONLY|O_TRUNC.
openat()
The openat() system call operates in exactly the same way as open(), except for the differences described here.
The dirfd argument is used in conjunction with the pathname argument as follows:
If the pathname given in pathname is absolute, then dirfd is ignored.
If the pathname given in pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like open()).
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by open() for a relative pathname). In this case, dirfd must be a directory that was opened for reading (O_RDONLY) or using the O_PATH flag.
If the pathname given in pathname is relative, and dirfd is not a valid file descriptor, an error (EBADF) results. (Specifying an invalid file descriptor number in dirfd can be used as a means to ensure that pathname is absolute.)
openat2(2)
The openat2(2) system call is an extension of openat(), and provides a superset of the features of openat(). It is documented separately, in openat2(2).
RETURN VALUE
On success, open(), openat(), and creat() return the new file descriptor (a nonnegative integer). On error, -1 is returned and errno is set to indicate the error.
ERRORS
open(), openat(), and creat() can fail with the following errors:
EACCES
The requested access to the file is not allowed, or search permission is denied for one of the directories in the path prefix of pathname, or the file did not exist yet and write access to the parent directory is not allowed. (See also path_resolution(7).)
EACCES
Where O_CREAT is specified, the protected_fifos or protected_regular sysctl is enabled, the file already exists and is a FIFO or regular file, the owner of the file is neither the current user nor the owner of the containing directory, and the containing directory is both world- or group-writable and sticky. For details, see the descriptions of /proc/sys/fs/protected_fifos and /proc/sys/fs/protected_regular in proc_sys_fs(5).
EBADF
(openat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EBUSY
O_EXCL was specified in flags and pathname refers to a block device that is in use by the system (e.g., it is mounted).
EDQUOT
Where O_CREAT is specified, the file does not exist, and the user’s quota of disk blocks or inodes on the filesystem has been exhausted.
EEXIST
pathname already exists and O_CREAT and O_EXCL were used.
EFAULT
pathname points outside your accessible address space.
EFBIG
See EOVERFLOW.
EINTR
While blocked waiting to complete an open of a slow device (e.g., a FIFO; see fifo(7)), the call was interrupted by a signal handler; see signal(7).
EINVAL
The filesystem does not support the O_DIRECT flag. See NOTES for more information.
EINVAL
Invalid value in flags.
EINVAL
O_TMPFILE was specified in flags, but neither O_WRONLY nor O_RDWR was specified.
EINVAL
O_CREAT was specified in flags and the final component (“basename”) of the new file’s pathname is invalid (e.g., it contains characters not permitted by the underlying filesystem).
EINVAL
The final component (“basename”) of pathname is invalid (e.g., it contains characters not permitted by the underlying filesystem).
EISDIR
pathname refers to a directory and the access requested involved writing (that is, O_WRONLY or O_RDWR is set).
EISDIR
pathname refers to an existing directory, O_TMPFILE and one of O_WRONLY or O_RDWR were specified in flags, but this kernel version does not provide the O_TMPFILE functionality.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ELOOP
pathname was a symbolic link, and flags specified O_NOFOLLOW but not O_PATH.
EMFILE
The per-process limit on the number of open file descriptors has been reached (see the description of RLIMIT_NOFILE in getrlimit(2)).
ENAMETOOLONG
pathname was too long.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENODEV
pathname refers to a device special file and no corresponding device exists. (This is a Linux kernel bug; in this situation ENXIO must be returned.)
ENOENT
O_CREAT is not set and the named file does not exist.
ENOENT
A directory component in pathname does not exist or is a dangling symbolic link.
ENOENT
pathname refers to a nonexistent directory, O_TMPFILE and one of O_WRONLY or O_RDWR were specified in flags, but this kernel version does not provide the O_TMPFILE functionality.
ENOMEM
The named file is a FIFO, but memory for the FIFO buffer can’t be allocated because the per-user hard limit on memory allocation for pipes has been reached and the caller is not privileged; see pipe(7).
ENOMEM
Insufficient kernel memory was available.
ENOSPC
pathname was to be created but the device containing pathname has no room for the new file.
ENOTDIR
A component used as a directory in pathname is not, in fact, a directory, or O_DIRECTORY was specified and pathname was not a directory.
ENOTDIR
(openat()) pathname is a relative pathname and dirfd is a file descriptor referring to a file other than a directory.
ENXIO
O_NONBLOCK | O_WRONLY is set, the named file is a FIFO, and no process has the FIFO open for reading.
ENXIO
The file is a device special file and no corresponding device exists.
ENXIO
The file is a UNIX domain socket.
EOPNOTSUPP
The filesystem containing pathname does not support O_TMPFILE.
EOVERFLOW
pathname refers to a regular file that is too large to be opened. The usual scenario here is that an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 tried to open a file whose size exceeds (1<<31)-1 bytes; see also O_LARGEFILE above. This is the error specified by POSIX.1; before Linux 2.6.24, Linux gave the error EFBIG for this case.
EPERM
The O_NOATIME flag was specified, but the effective user ID of the caller did not match the owner of the file and the caller was not privileged.
EPERM
The operation was prevented by a file seal; see fcntl(2).
EROFS
pathname refers to a file on a read-only filesystem and write access was requested.
ETXTBSY
pathname refers to an executable image which is currently being executed and write access was requested.
ETXTBSY
pathname refers to a file that is currently in use as a swap file, and the O_TRUNC flag was specified.
ETXTBSY
pathname refers to a file that is currently being read by the kernel (e.g., for module/firmware loading), and write access was requested.
EWOULDBLOCK
The O_NONBLOCK flag was specified, and an incompatible lease was held on the file (see fcntl(2)).
VERSIONS
The (undefined) effect of O_RDONLY | O_TRUNC varies among implementations. On many systems the file is actually truncated.
Synchronized I/O
The POSIX.1-2008 “synchronized I/O” option specifies different variants of synchronized I/O, and specifies the open() flags O_SYNC, O_DSYNC, and O_RSYNC for controlling the behavior. Regardless of whether an implementation supports this option, it must at least support the use of O_SYNC for regular files.
Linux implements O_SYNC and O_DSYNC, but not O_RSYNC. Somewhat incorrectly, glibc defines O_RSYNC to have the same value as O_SYNC. (O_RSYNC is defined in the Linux header file <asm/fcntl.h> on HP PA-RISC, but it is not used.)
O_SYNC provides synchronized I/O file integrity completion, meaning write operations will flush data and all associated metadata to the underlying hardware. O_DSYNC provides synchronized I/O data integrity completion, meaning write operations will flush data to the underlying hardware, but will only flush metadata updates that are required to allow a subsequent read operation to complete successfully. Data integrity completion can reduce the number of disk operations that are required for applications that don’t need the guarantees of file integrity completion.
To understand the difference between the two types of completion, consider two pieces of file metadata: the file last modification timestamp (st_mtime) and the file length. All write operations will update the last file modification timestamp, but only writes that add data to the end of the file will change the file length. The last modification timestamp is not needed to ensure that a read completes successfully, but the file length is. Thus, O_DSYNC would only guarantee to flush updates to the file length metadata (whereas O_SYNC would also always flush the last modification timestamp metadata).
Before Linux 2.6.33, Linux implemented only the O_SYNC flag for open(). However, when that flag was specified, most filesystems actually provided the equivalent of synchronized I/O data integrity completion (i.e., O_SYNC was actually implemented as the equivalent of O_DSYNC).
Since Linux 2.6.33, proper O_SYNC support is provided. However, to ensure backward binary compatibility, O_DSYNC was defined with the same value as the historical O_SYNC, and O_SYNC was defined as a new (two-bit) flag value that includes the O_DSYNC flag value. This ensures that applications compiled against new headers get at least O_DSYNC semantics before Linux 2.6.33.
C library/kernel differences
Since glibc 2.26, the glibc wrapper function for open() employs the openat() system call, rather than the kernel’s open() system call. For certain architectures, this is also true before glibc 2.26.
STANDARDS
open()
creat()
openat()
POSIX.1-2008.
openat2(2) Linux.
The O_DIRECT, O_NOATIME, O_PATH, and O_TMPFILE flags are Linux-specific. One must define _GNU_SOURCE to obtain their definitions.
The O_CLOEXEC, O_DIRECTORY, and O_NOFOLLOW flags are not specified in POSIX.1-2001, but are specified in POSIX.1-2008. Since glibc 2.12, one can obtain their definitions by defining either _POSIX_C_SOURCE with a value greater than or equal to 200809L or _XOPEN_SOURCE with a value greater than or equal to 700. In glibc 2.11 and earlier, one obtains the definitions by defining _GNU_SOURCE.
HISTORY
open()
creat()
SVr4, 4.3BSD, POSIX.1-2001.
openat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
NOTES
Under Linux, the O_NONBLOCK flag is sometimes used in cases where one wants to open but does not necessarily have the intention to read or write. For example, this may be used to open a device in order to get a file descriptor for use with ioctl(2).
Note that open() can open device special files, but creat() cannot create them; use mknod(2) instead.
If the file is newly created, its st_atime, st_ctime, st_mtime fields (respectively, time of last access, time of last status change, and time of last modification; see stat(2)) are set to the current time, and so are the st_ctime and st_mtime fields of the parent directory. Otherwise, if the file is modified because of the O_TRUNC flag, its st_ctime and st_mtime fields are set to the current time.
The files in the /proc/pid/fd directory show the open file descriptors of the process with the PID pid. The files in the /proc/pid/fdinfo directory show even more information about these file descriptors. See proc(5) for further details of both of these directories.
The Linux header file <asm/fcntl.h> doesn’t define O_ASYNC; the (BSD-derived) FASYNC synonym is defined instead.
Open file descriptions
The term open file description is the one used by POSIX to refer to the entries in the system-wide table of open files. In other contexts, this object is variously also called an “open file object”, a “file handle”, an “open file table entry”, orβin kernel-developer parlanceβa struct file.
When a file descriptor is duplicated (using dup(2) or similar), the duplicate refers to the same open file description as the original file descriptor, and the two file descriptors consequently share the file offset and file status flags. Such sharing can also occur between processes: a child process created via fork(2) inherits duplicates of its parent’s file descriptors, and those duplicates refer to the same open file descriptions.
Each open() of a file creates a new open file description; thus, there may be multiple open file descriptions corresponding to a file inode.
On Linux, one can use the kcmp(2) KCMP_FILE operation to test whether two file descriptors (in the same process or in two different processes) refer to the same open file description.
NFS
There are many infelicities in the protocol underlying NFS, affecting amongst others O_SYNC and O_NDELAY.
On NFS filesystems with UID mapping enabled, open() may return a file descriptor but, for example, read(2) requests are denied with EACCES. This is because the client performs open() by checking the permissions, but UID mapping is performed by the server upon read and write requests.
FIFOs
Opening the read or write end of a FIFO blocks until the other end is also opened (by another process or thread). See fifo(7) for further details.
File access mode
Unlike the other values that can be specified in flags, the access mode values O_RDONLY, O_WRONLY, and O_RDWR do not specify individual bits. Rather, they define the low order two bits of flags, and are defined respectively as 0, 1, and 2. In other words, the combination O_RDONLY | O_WRONLY is a logical error, and certainly does not have the same meaning as O_RDWR.
Linux reserves the special, nonstandard access mode 3 (binary 11) in flags to mean: check for read and write permission on the file and return a file descriptor that can’t be used for reading or writing. This nonstandard access mode is used by some Linux drivers to return a file descriptor that is to be used only for device-specific ioctl(2) operations.
Rationale for openat() and other directory file descriptor APIs
openat() and the other system calls and library functions that take a directory file descriptor argument (i.e., execveat(2), faccessat(2), fanotify_mark(2), fchmodat(2), fchownat(2), fspick(2), fstatat(2), futimesat(2), linkat(2), mkdirat(2), mknodat(2), mount_setattr(2), move_mount(2), name_to_handle_at(2), open_tree(2), openat2(2), readlinkat(2), renameat(2), renameat2(2), statx(2), symlinkat(2), unlinkat(2), utimensat(2), mkfifoat(3), and scandirat(3)) address two problems with the older interfaces that preceded them. Here, the explanation is in terms of the openat() call, but the rationale is analogous for the other interfaces.
First, openat() allows an application to avoid race conditions that could occur when using open() to open files in directories other than the current working directory. These race conditions result from the fact that some component of the directory prefix given to open() could be changed in parallel with the call to open(). Suppose, for example, that we wish to create the file dir1/dir2/xxx.dep if the file dir1/dir2/xxx exists. The problem is that between the existence check and the file-creation step, dir1 or dir2 (which might be symbolic links) could be modified to point to a different location. Such races can be avoided by opening a file descriptor for the target directory, and then specifying that file descriptor as the dirfd argument of (say) fstatat(2) and openat(). The use of the dirfd file descriptor also has other benefits:
the file descriptor is a stable reference to the directory, even if the directory is renamed; and
the open file descriptor prevents the underlying filesystem from being dismounted, just as when a process has a current working directory on a filesystem.
Second, openat() allows the implementation of a per-thread “current working directory”, via file descriptor(s) maintained by the application. (This functionality can also be obtained by tricks based on the use of */proc/self/fd/*dirfd, but less efficiently.)
The dirfd argument for these APIs can be obtained by using open() or openat() to open a directory (with either the O_RDONLY or the O_PATH flag). Alternatively, such a file descriptor can be obtained by applying dirfd(3) to a directory stream created using opendir(3).
When these APIs are given a dirfd argument of AT_FDCWD or the specified pathname is absolute, then they handle their pathname argument in the same way as the corresponding conventional APIs. However, in this case, several of the APIs have a flags argument that provides access to functionality that is not available with the corresponding conventional APIs.
O_DIRECT
The O_DIRECT flag may impose alignment restrictions on the length and address of user-space buffers and the file offset of I/Os. In Linux alignment restrictions vary by filesystem and kernel version and might be absent entirely. The handling of misaligned O_DIRECT I/Os also varies; they can either fail with EINVAL or fall back to buffered I/O.
Since Linux 6.1, O_DIRECT support and alignment restrictions for a file can be queried using statx(2), using the STATX_DIOALIGN flag. Support for STATX_DIOALIGN varies by filesystem; see statx(2).
Some filesystems provide their own interfaces for querying O_DIRECT alignment restrictions, for example the XFS_IOC_DIOINFO operation in xfsctl(3). STATX_DIOALIGN should be used instead when it is available.
If none of the above is available, then direct I/O support and alignment restrictions can only be assumed from known characteristics of the filesystem, the individual file, the underlying storage device(s), and the kernel version. In Linux 2.4, most filesystems based on block devices require that the file offset and the length and memory address of all I/O segments be multiples of the filesystem block size (typically 4096 bytes). In Linux 2.6.0, this was relaxed to the logical block size of the block device (typically 512 bytes). A block device’s logical block size can be determined using the ioctl(2) BLKSSZGET operation or from the shell using the command:
blockdev --getss
O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption and undefined behavior in parent and child processes. This restriction does not apply when the memory buffer for the O_DIRECT I/Os was created using shmat(2) or mmap(2) with the MAP_SHARED flag. Nor does this restriction apply when the memory buffer has been advised as MADV_DONTFORK with madvise(2), ensuring that it will not be available to the child after fork(2).
The O_DIRECT flag was introduced in SGI IRIX, where it has alignment restrictions similar to those of Linux 2.4. IRIX has also a fcntl(2) call to query appropriate alignments, and sizes. FreeBSD 4.x introduced a flag of the same name, but without alignment restrictions.
O_DIRECT support was added in Linux 2.4.10. Older Linux kernels simply ignore this flag. Some filesystems may not implement the flag, in which case open() fails with the error EINVAL if it is used.
Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file. Even when the filesystem correctly handles the coherency issues in this situation, overall I/O throughput is likely to be slower than using either mode alone. Likewise, applications should avoid mixing mmap(2) of files with direct I/O to the same files.
The behavior of O_DIRECT with NFS will differ from local filesystems. Older kernels, or kernels configured in certain ways, may not support this combination. The NFS protocol does not support passing the flag to the server, so O_DIRECT I/O will bypass the page cache only on the client; the server may still cache the I/O. The client asks the server to make the I/O synchronous to preserve the synchronous semantics of O_DIRECT. Some servers will perform poorly under these circumstances, especially if the I/O size is small. Some servers may also be configured to lie to clients about the I/O having reached stable storage; this will avoid the performance penalty at some risk to data integrity in the event of server power failure. The Linux NFS client places no alignment restrictions on O_DIRECT I/O.
In summary, O_DIRECT is a potentially powerful tool that should be used with caution. It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default.
BUGS
Currently, it is not possible to enable signal-driven I/O by specifying O_ASYNC when calling open(); use fcntl(2) to enable this flag.
One must check for two different error codes, EISDIR and ENOENT, when trying to determine whether the kernel supports O_TMPFILE functionality.
When both O_CREAT and O_DIRECTORY are specified in flags and the file specified by pathname does not exist, open() will create a regular file (i.e., O_DIRECTORY is ignored).
SEE ALSO
chmod(2), chown(2), close(2), dup(2), fcntl(2), link(2), lseek(2), mknod(2), mmap(2), mount(2), open_by_handle_at(2), openat2(2), read(2), socket(2), stat(2), umask(2), unlink(2), write(2), fopen(3), acl(5), fifo(7), inode(7), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
215 - Linux cli command vm86
NAME π₯οΈ vm86 π₯οΈ
enter virtual 8086 mode
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/vm86.h>
int vm86old(struct vm86_struct *info);
int vm86(unsigned long fn, struct vm86plus_struct *v86);
DESCRIPTION
The system call vm86() was introduced in Linux 0.97p2. In Linux 2.1.15 and 2.0.28, it was renamed to vm86old(), and a new vm86() was introduced. The definition of struct vm86_struct was changed in 1.1.8 and 1.1.9.
These calls cause the process to enter VM86 mode (virtual-8086 in Intel literature), and are used by dosemu.
VM86 mode is an emulation of real mode within a protected mode task.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
This return value is specific to i386 and indicates a problem with getting user-space data.
ENOSYS
This return value indicates the call is not implemented on the present architecture.
EPERM
Saved kernel stack exists. (This is a kernel sanity check; the saved stack should exist only within vm86 mode itself.)
STANDARDS
Linux on 32-bit Intel processors.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
216 - Linux cli command llistxattr
NAME π₯οΈ llistxattr π₯οΈ
list extended attribute names
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/xattr.h>
ssize_t listxattr(const char *path, char *_Nullable list",size_t"size);
ssize_t llistxattr(const char *path, char *_Nullable list",size_t"size);
ssize_t flistxattr(int fd, char *_Nullable list, size_t size);
DESCRIPTION
Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7).
listxattr() retrieves the list of extended attribute names associated with the given path in the filesystem. The retrieved list is placed in list, a caller-allocated buffer whose size (in bytes) is specified in the argument size. The list is the set of (null-terminated) names, one after the other. Names of extended attributes to which the calling process does not have access may be omitted from the list. The length of the attribute name list is returned.
llistxattr() is identical to listxattr(), except in the case of a symbolic link, where the list of names of extended attributes associated with the link itself is retrieved, not the file that it refers to.
flistxattr() is identical to listxattr(), only the open file referred to by fd (as returned by open(2)) is interrogated in place of path.
A single extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode.
If size is specified as zero, these calls return the current size of the list of extended attribute names (and leave list unchanged). This can be used to determine the size of the buffer that should be supplied in a subsequent call. (But, bear in mind that there is a possibility that the set of extended attributes may change between the two calls, so that it is still necessary to check the return status from the second call.)
Example
The list of names is returned as an unordered array of null-terminated character strings (attribute names are separated by null bytes (‘οΏ½’)), like this:
user.name1 system.name1 user.name2
Filesystems that implement POSIX ACLs using extended attributes might return a list like this:
system.posix_acl_access system.posix_acl_default
RETURN VALUE
On success, a nonnegative number is returned indicating the size of the extended attribute name list. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
E2BIG
The size of the list of extended attribute names is larger than the maximum size allowed; the list cannot be retrieved. This can happen on filesystems that support an unlimited number of extended attributes per file such as XFS, for example. See BUGS.
ENOTSUP
Extended attributes are not supported by the filesystem, or are disabled.
ERANGE
The size of the list buffer is too small to hold the result.
In addition, the errors documented in stat(2) can also occur.
STANDARDS
Linux.
HISTORY
Linux 2.4, glibc 2.3.
BUGS
As noted in xattr(7), the VFS imposes a limit of 64 kB on the size of the extended attribute name list returned by listxattr(). If the total size of attribute names attached to a file exceeds this limit, it is no longer possible to retrieve the list of attribute names.
EXAMPLES
The following program demonstrates the usage of listxattr() and getxattr(2). For the file whose pathname is provided as a command-line argument, it lists all extended file attributes and their values.
To keep the code simple, the program assumes that attribute keys and values are constant during the execution of the program. A production program should expect and handle changes during execution of the program. For example, the number of bytes required for attribute keys might increase between the two calls to listxattr(). An application could handle this possibility using a loop that retries the call (perhaps up to a predetermined maximum number of attempts) with a larger buffer each time it fails with the error ERANGE. Calls to getxattr(2) could be handled similarly.
The following output was recorded by first creating a file, setting some extended file attributes, and then listing the attributes with the example program.
Example output
$ touch /tmp/foo
$ setfattr -n user.fred -v chocolate /tmp/foo
$ setfattr -n user.frieda -v bar /tmp/foo
$ setfattr -n user.empty /tmp/foo
$ ./listxattr /tmp/foo
user.fred: chocolate
user.frieda: bar
user.empty: <no value>
Program source (listxattr.c)
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/xattr.h>
int
main(int argc, char *argv[])
{
char *buf, *key, *val;
ssize_t buflen, keylen, vallen;
if (argc != 2) {
fprintf(stderr, "Usage: %s path
“, argv[0]);
exit(EXIT_FAILURE);
}
/*
* Determine the length of the buffer needed.
/
buflen = listxattr(argv[1], NULL, 0);
if (buflen == -1) {
perror(“listxattr”);
exit(EXIT_FAILURE);
}
if (buflen == 0) {
printf("%s has no attributes.
“, argv[1]);
exit(EXIT_SUCCESS);
}
/
* Allocate the buffer.
/
buf = malloc(buflen);
if (buf == NULL) {
perror(“malloc”);
exit(EXIT_FAILURE);
}
/
* Copy the list of attribute keys to the buffer.
/
buflen = listxattr(argv[1], buf, buflen);
if (buflen == -1) {
perror(“listxattr”);
exit(EXIT_FAILURE);
}
/
* Loop over the list of zero terminated strings with the
* attribute keys. Use the remaining buffer length to determine
* the end of the list.
/
key = buf;
while (buflen > 0) {
/
* Output attribute key.
/
printf("%s: “, key);
/
* Determine length of the value.
/
vallen = getxattr(argv[1], key, NULL, 0);
if (vallen == -1)
perror(“getxattr”);
if (vallen > 0) {
/
* Allocate value buffer.
* One extra byte is needed to append 0x00.
/
val = malloc(vallen + 1);
if (val == NULL) {
perror(“malloc”);
exit(EXIT_FAILURE);
}
/
* Copy value to buffer.
/
vallen = getxattr(argv[1], key, val, vallen);
if (vallen == -1) {
perror(“getxattr”);
} else {
/
* Output attribute value.
/
val[vallen] = 0;
printf("%s”, val);
}
free(val);
} else if (vallen == 0) {
printf("
SEE ALSO
getfattr(1), setfattr(1), getxattr(2), open(2), removexattr(2), setxattr(2), stat(2), symlink(7), xattr(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
217 - Linux cli command ioprio_get
NAME π₯οΈ ioprio_get π₯οΈ
get/set I/O scheduling class and priority
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/ioprio.h> /* Definition of IOPRIO_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_ioprio_get, int which, int who);
int syscall(SYS_ioprio_set, int which, int who, int ioprio);
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
The ioprio_get() and ioprio_set() system calls get and set the I/O scheduling class and priority of one or more threads.
The which and who arguments identify the thread(s) on which the system calls operate. The which argument determines how who is interpreted, and has one of the following values:
IOPRIO_WHO_PROCESS
who is a process ID or thread ID identifying a single process or thread. If who is 0, then operate on the calling thread.
IOPRIO_WHO_PGRP
who is a process group ID identifying all the members of a process group. If who is 0, then operate on the process group of which the caller is a member.
IOPRIO_WHO_USER
who is a user ID identifying all of the processes that have a matching real UID.
If which is specified as IOPRIO_WHO_PGRP or IOPRIO_WHO_USER when calling ioprio_get(), and more than one process matches who, then the returned priority will be the highest one found among all of the matching processes. One priority is said to be higher than another one if it belongs to a higher priority class (IOPRIO_CLASS_RT is the highest priority class; IOPRIO_CLASS_IDLE is the lowest) or if it belongs to the same priority class as the other process but has a higher priority level (a lower priority number means a higher priority level).
The ioprio argument given to ioprio_set() is a bit mask that specifies both the scheduling class and the priority to be assigned to the target process(es). The following macros are used for assembling and dissecting ioprio values:
IOPRIO_PRIO_VALUE(class, data)
Given a scheduling class and priority (data), this macro combines the two values to produce an ioprio value, which is returned as the result of the macro.
IOPRIO_PRIO_CLASS(mask)
Given mask (an ioprio value), this macro returns its I/O class component, that is, one of the values IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, or IOPRIO_CLASS_IDLE.
IOPRIO_PRIO_DATA(mask)
Given mask (an ioprio value), this macro returns its priority (data) component.
See the NOTES section for more information on scheduling classes and priorities, as well as the meaning of specifying ioprio as 0.
I/O priorities are supported for reads and for synchronous (O_DIRECT, O_SYNC) writes. I/O priorities are not supported for asynchronous writes because they are issued outside the context of the program dirtying the memory, and thus program-specific priorities do not apply.
RETURN VALUE
On success, ioprio_get() returns the ioprio value of the process with highest I/O priority of any of the processes that match the criteria specified in which and who. On error, -1 is returned, and errno is set to indicate the error.
On success, ioprio_set() returns 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
Invalid value for which or ioprio. Refer to the NOTES section for available scheduler classes and priority levels for ioprio.
EPERM
The calling process does not have the privilege needed to assign this ioprio to the specified process(es). See the NOTES section for more information on required privileges for ioprio_set().
ESRCH
No process(es) could be found that matched the specification in which and who.
STANDARDS
Linux.
HISTORY
Linux 2.6.13.
NOTES
Two or more processes or threads can share an I/O context. This will be the case when clone(2) was called with the CLONE_IO flag. However, by default, the distinct threads of a process will not share the same I/O context. This means that if you want to change the I/O priority of all threads in a process, you may need to call ioprio_set() on each of the threads. The thread ID that you would need for this operation is the one that is returned by gettid(2) or clone(2).
These system calls have an effect only when used in conjunction with an I/O scheduler that supports I/O priorities. As at kernel 2.6.17 the only such scheduler is the Completely Fair Queuing (CFQ) I/O scheduler.
If no I/O scheduler has been set for a thread, then by default the I/O priority will follow the CPU nice value (setpriority(2)). Before Linux 2.6.24, once an I/O priority had been set using ioprio_set(), there was no way to reset the I/O scheduling behavior to the default. Since Linux 2.6.24, specifying ioprio as 0 can be used to reset to the default I/O scheduling behavior.
Selecting an I/O scheduler
I/O schedulers are selected on a per-device basis via the special file /sys/block/device/queue/scheduler.
One can view the current I/O scheduler via the /sys filesystem. For example, the following command displays a list of all schedulers currently loaded in the kernel:
$ cat /sys/block/sda/queue/scheduler
noop anticipatory deadline [cfq]
The scheduler surrounded by brackets is the one actually in use for the device (sda in the example). Setting another scheduler is done by writing the name of the new scheduler to this file. For example, the following command will set the scheduler for the sda device to cfq:
$ su
Password:
# echo cfq > /sys/block/sda/queue/scheduler
The Completely Fair Queuing (CFQ) I/O scheduler
Since version 3 (also known as CFQ Time Sliced), CFQ implements I/O nice levels similar to those of CPU scheduling. These nice levels are grouped into three scheduling classes, each one containing one or more priority levels:
IOPRIO_CLASS_RT (1)
This is the real-time I/O class. This scheduling class is given higher priority than any other class: processes from this class are given first access to the disk every time. Thus, this I/O class needs to be used with some care: one I/O real-time process can starve the entire system. Within the real-time class, there are 8 levels of class data (priority) that determine exactly how much time this process needs the disk for on each service. The highest real-time priority level is 0; the lowest is 7. In the future, this might change to be more directly mappable to performance, by passing in a desired data rate instead.
IOPRIO_CLASS_BE (2)
This is the best-effort scheduling class, which is the default for any process that hasn’t set a specific I/O priority. The class data (priority) determines how much I/O bandwidth the process will get. Best-effort priority levels are analogous to CPU nice values (see getpriority(2)). The priority level determines a priority relative to other processes in the best-effort scheduling class. Priority levels range from 0 (highest) to 7 (lowest).
IOPRIO_CLASS_IDLE (3)
This is the idle scheduling class. Processes running at this level get I/O time only when no one else needs the disk. The idle class has no class data. Attention is required when assigning this priority class to a process, since it may become starved if higher priority processes are constantly accessing the disk.
Refer to the kernel source file Documentation/block/ioprio.txt for more information on the CFQ I/O Scheduler and an example program.
Required permissions to set I/O priorities
Permission to change a process’s priority is granted or denied based on two criteria:
Process ownership
An unprivileged process may set the I/O priority only for a process whose real UID matches the real or effective UID of the calling process. A process which has the CAP_SYS_NICE capability can change the priority of any process.
What is the desired priority
Attempts to set very high priorities (IOPRIO_CLASS_RT) require the CAP_SYS_ADMIN capability. Up to Linux 2.6.24 also required CAP_SYS_ADMIN to set a very low priority (IOPRIO_CLASS_IDLE), but since Linux 2.6.25, this is no longer required.
A call to ioprio_set() must follow both rules, or the call will fail with the error EPERM.
BUGS
glibc does not yet provide a suitable header file defining the function prototypes and macros described on this page. Suitable definitions can be found in linux/ioprio.h.
SEE ALSO
ionice(1), getpriority(2), open(2), capabilities(7), cgroups(7)
Documentation/block/ioprio.txt in the Linux kernel source tree
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
218 - Linux cli command lgetxattr
NAME π₯οΈ lgetxattr π₯οΈ
retrieve an extended attribute value
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/xattr.h>
ssize_t getxattr(const char *path, const char *name,
void value[.size], size_t size);
ssize_t lgetxattr(const char *path, const char *name,
void value[.size], size_t size);
ssize_t fgetxattr(int fd, const char *name,
void value[.size], size_t size);
DESCRIPTION
Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7).
getxattr() retrieves the value of the extended attribute identified by name and associated with the given path in the filesystem. The attribute value is placed in the buffer pointed to by value; size specifies the size of that buffer. The return value of the call is the number of bytes placed in value.
lgetxattr() is identical to getxattr(), except in the case of a symbolic link, where the link itself is interrogated, not the file that it refers to.
fgetxattr() is identical to getxattr(), only the open file referred to by fd (as returned by open(2)) is interrogated in place of path.
An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. The value of an extended attribute is a chunk of arbitrary textual or binary data that was assigned using setxattr(2).
If size is specified as zero, these calls return the current size of the named extended attribute (and leave value unchanged). This can be used to determine the size of the buffer that should be supplied in a subsequent call. (But, bear in mind that there is a possibility that the attribute value may change between the two calls, so that it is still necessary to check the return status from the second call.)
RETURN VALUE
On success, these calls return a nonnegative value which is the size (in bytes) of the extended attribute value. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
E2BIG
The size of the attribute value is larger than the maximum size allowed; the attribute cannot be retrieved. This can happen on filesystems that support very large attribute values such as NFSv4, for example.
ENODATA
The named attribute does not exist, or the process has no access to this attribute.
ENOTSUP
Extended attributes are not supported by the filesystem, or are disabled.
ERANGE
The size of the value buffer is too small to hold the result.
In addition, the errors documented in stat(2) can also occur.
STANDARDS
Linux.
HISTORY
Linux 2.4, glibc 2.3.
EXAMPLES
See listxattr(2).
SEE ALSO
getfattr(1), setfattr(1), listxattr(2), open(2), removexattr(2), setxattr(2), stat(2), symlink(7), xattr(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
219 - Linux cli command chroot
NAME π₯οΈ chroot π₯οΈ
change root directory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int chroot(const char *path);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
chroot():
Since glibc 2.2.2:
_XOPEN_SOURCE && ! (_POSIX_C_SOURCE >= 200112L)
|| /* Since glibc 2.20: */ _DEFAULT_SOURCE
|| /* glibc <= 2.19: */ _BSD_SOURCE
Before glibc 2.2.2:
none
DESCRIPTION
chroot() changes the root directory of the calling process to that specified in path. This directory will be used for pathnames beginning with /. The root directory is inherited by all children of the calling process.
Only a privileged process (Linux: one with the CAP_SYS_CHROOT capability in its user namespace) may call chroot().
This call changes an ingredient in the pathname resolution process and does nothing else. In particular, it is not intended to be used for any kind of security purpose, neither to fully sandbox a process nor to restrict filesystem system calls. In the past, chroot() has been used by daemons to restrict themselves prior to passing paths supplied by untrusted users to system calls such as open(2). However, if a folder is moved out of the chroot directory, an attacker can exploit that to get out of the chroot directory as well. The easiest way to do that is to chdir(2) to the to-be-moved directory, wait for it to be moved out, then open a path like ../../../etc/passwd.
A slightly trickier variation also works under some circumstances if chdir(2) is not permitted. If a daemon allows a “chroot directory” to be specified, that usually means that if you want to prevent remote users from accessing files outside the chroot directory, you must ensure that folders are never moved out of it.
This call does not change the current working directory, so that after the call ‘.’ can be outside the tree rooted at ‘/’. In particular, the superuser can escape from a “chroot jail” by doing:
mkdir foo; chroot foo; cd ..
This call does not close open file descriptors, and such file descriptors may allow access to files outside the chroot tree.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Depending on the filesystem, other errors can be returned. The more general errors are listed below:
EACCES
Search permission is denied on a component of the path prefix. (See also path_resolution(7).)
EFAULT
path points outside your accessible address space.
EIO
An I/O error occurred.
ELOOP
Too many symbolic links were encountered in resolving path.
ENAMETOOLONG
path is too long.
ENOENT
The file does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component of path is not a directory.
EPERM
The caller has insufficient privilege.
STANDARDS
None.
HISTORY
SVr4, 4.4BSD, SUSv2 (marked LEGACY). This function is not part of POSIX.1-2001.
NOTES
A child process created via fork(2) inherits its parent’s root directory. The root directory is left unchanged by execve(2).
The magic symbolic link, /proc/pid/root, can be used to discover a process’s root directory; see proc(5) for details.
FreeBSD has a stronger jail() system call.
SEE ALSO
chroot(1), chdir(2), pivot_root(2), path_resolution(7), switch_root(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
220 - Linux cli command mount
NAME π₯οΈ mount π₯οΈ
mount filesystem
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/mount.h>
int mount(const char *source, const char *target,
const char *filesystemtype, unsigned long mountflags,
const void *_Nullable data);
DESCRIPTION
mount() attaches the filesystem specified by source (which is often a pathname referring to a device, but can also be the pathname of a directory or file, or a dummy string) to the location (a directory or file) specified by the pathname in target.
Appropriate privilege (Linux: the CAP_SYS_ADMIN capability) is required to mount filesystems.
Values for the filesystemtype argument supported by the kernel are listed in /proc/filesystems (e.g., “btrfs”, “ext4”, “jfs”, “xfs”, “vfat”, “fuse”, “tmpfs”, “cgroup”, “proc”, “mqueue”, “nfs”, “cifs”, “iso9660”). Further types may become available when the appropriate modules are loaded.
The data argument is interpreted by the different filesystems. Typically it is a string of comma-separated options understood by this filesystem. See mount(8) for details of the options available for each filesystem type. This argument may be specified as NULL, if there are no options.
A call to mount() performs one of a number of general types of operation, depending on the bits specified in mountflags. The choice of which operation to perform is determined by testing the bits set in mountflags, with the tests being conducted in the order listed here:
Remount an existing mount: mountflags includes MS_REMOUNT.
Create a bind mount: mountflags includes MS_BIND.
Change the propagation type of an existing mount: mountflags includes one of MS_SHARED, MS_PRIVATE, MS_SLAVE, or MS_UNBINDABLE.
Move an existing mount to a new location: mountflags includes MS_MOVE.
Create a new mount: mountflags includes none of the above flags.
Each of these operations is detailed later in this page. Further flags may be specified in mountflags to modify the behavior of mount(), as described below.
Additional mount flags
The list below describes the additional flags that can be specified in mountflags. Note that some operation types ignore some or all of these flags, as described later in this page.
MS_DIRSYNC (since Linux 2.5.19)
Make directory changes on this filesystem synchronous. (This property can be obtained for individual directories or subtrees using chattr(1).)
MS_LAZYTIME (since Linux 4.0)
Reduce on-disk updates of inode timestamps (atime, mtime, ctime) by maintaining these changes only in memory. The on-disk timestamps are updated only when:
the inode needs to be updated for some change unrelated to file timestamps;
the application employs fsync(2), syncfs(2), or sync(2);
an undeleted inode is evicted from memory; or
more than 24 hours have passed since the inode was written to disk.
This mount option significantly reduces writes needed to update the inode’s timestamps, especially mtime and atime. However, in the event of a system crash, the atime and mtime fields on disk might be out of date by up to 24 hours.
Examples of workloads where this option could be of significant benefit include frequent random writes to preallocated files, as well as cases where the MS_STRICTATIME mount option is also enabled. (The advantage of combining MS_STRICTATIME and MS_LAZYTIME is that stat(2) will return the correctly updated atime, but the atime updates will be flushed to disk only in the cases listed above.)
MS_MANDLOCK
Permit mandatory locking on files in this filesystem. (Mandatory locking must still be enabled on a per-file basis, as described in fcntl(2).) Since Linux 4.5, this mount option requires the CAP_SYS_ADMIN capability and a kernel configured with the CONFIG_MANDATORY_FILE_LOCKING option. Mandatory locking has been fully deprecated in Linux 5.15, so this flag should be considered deprecated.
MS_NOATIME
Do not update access times for (all types of) files on this filesystem.
MS_NODEV
Do not allow access to devices (special files) on this filesystem.
MS_NODIRATIME
Do not update access times for directories on this filesystem. This flag provides a subset of the functionality provided by MS_NOATIME; that is, MS_NOATIME implies MS_NODIRATIME.
MS_NOEXEC
Do not allow programs to be executed from this filesystem.
MS_NOSUID
Do not honor set-user-ID and set-group-ID bits or file capabilities when executing programs from this filesystem. In addition, SELinux domain transitions require the permission nosuid_transition, which in turn needs also the policy capability nnp_nosuid_transition.
MS_RDONLY
Mount filesystem read-only.
MS_REC (since Linux 2.4.11)
Used in conjunction with MS_BIND to create a recursive bind mount, and in conjunction with the propagation type flags to recursively change the propagation type of all of the mounts in a subtree. See below for further details.
MS_RELATIME (since Linux 2.6.20)
When a file on this filesystem is accessed, update the file’s last access time (atime) only if the current value of atime is less than or equal to the file’s last modification time (mtime) or last status change time (ctime). This option is useful for programs, such as mutt(1), that need to know when a file has been read since it was last modified. Since Linux 2.6.30, the kernel defaults to the behavior provided by this flag (unless MS_NOATIME was specified), and the MS_STRICTATIME flag is required to obtain traditional semantics. In addition, since Linux 2.6.30, the file’s last access time is always updated if it is more than 1 day old.
MS_SILENT (since Linux 2.6.17)
Suppress the display of certain (printk()) warning messages in the kernel log. This flag supersedes the misnamed and obsolete MS_VERBOSE flag (available since Linux 2.4.12), which has the same meaning.
MS_STRICTATIME (since Linux 2.6.30)
Always update the last access time (atime) when files on this filesystem are accessed. (This was the default behavior before Linux 2.6.30.) Specifying this flag overrides the effect of setting the MS_NOATIME and MS_RELATIME flags.
MS_SYNCHRONOUS
Make writes on this filesystem synchronous (as though the O_SYNC flag to open(2) was specified for all file opens to this filesystem).
MS_NOSYMFOLLOW (since Linux 5.10)
Do not follow symbolic links when resolving paths. Symbolic links can still be created, and readlink(1), readlink(2), realpath(1), and realpath(3) all still work properly.
From Linux 2.4 onward, some of the above flags are settable on a per-mount basis, while others apply to the superblock of the mounted filesystem, meaning that all mounts of the same filesystem share those flags. (Previously, all of the flags were per-superblock.)
The per-mount-point flags are as follows:
Since Linux 2.4: MS_NODEV, MS_NOEXEC, and MS_NOSUID flags are settable on a per-mount-point basis.
Additionally, since Linux 2.6.16: MS_NOATIME and MS_NODIRATIME.
Additionally, since Linux 2.6.20: MS_RELATIME.
The following flags are per-superblock: MS_DIRSYNC, MS_LAZYTIME, MS_MANDLOCK, MS_SILENT, and MS_SYNCHRONOUS. The initial settings of these flags are determined on the first mount of the filesystem, and will be shared by all subsequent mounts of the same filesystem. Subsequently, the settings of the flags can be changed via a remount operation (see below). Such changes will be visible via all mounts associated with the filesystem.
Since Linux 2.6.16, MS_RDONLY can be set or cleared on a per-mount-point basis as well as on the underlying filesystem superblock. The mounted filesystem will be writable only if neither the filesystem nor the mountpoint are flagged as read-only.
Remounting an existing mount
An existing mount may be remounted by specifying MS_REMOUNT in mountflags. This allows you to change the mountflags and data of an existing mount without having to unmount and remount the filesystem. target should be the same value specified in the initial mount() call.
The source and filesystemtype arguments are ignored.
The mountflags and data arguments should match the values used in the original mount() call, except for those parameters that are being deliberately changed.
The following mountflags can be changed: MS_LAZYTIME, MS_MANDLOCK, MS_NOATIME, MS_NODEV, MS_NODIRATIME, MS_NOEXEC, MS_NOSUID, MS_RELATIME, MS_RDONLY, MS_STRICTATIME (whose effect is to clear the MS_NOATIME and MS_RELATIME flags), and MS_SYNCHRONOUS. Attempts to change the setting of the MS_DIRSYNC and MS_SILENT flags during a remount are silently ignored. Note that changes to per-superblock flags are visible via all mounts of the associated filesystem (because the per-superblock flags are shared by all mounts).
Since Linux 3.17, if none of MS_NOATIME, MS_NODIRATIME, MS_RELATIME, or MS_STRICTATIME is specified in mountflags, then the remount operation preserves the existing values of these flags (rather than defaulting to MS_RELATIME).
Since Linux 2.6.26, the MS_REMOUNT flag can be used with MS_BIND to modify only the per-mount-point flags. This is particularly useful for setting or clearing the “read-only” flag on a mount without changing the underlying filesystem. Specifying mountflags as:
MS_REMOUNT | MS_BIND | MS_RDONLY
will make access through this mountpoint read-only, without affecting other mounts.
Creating a bind mount
If mountflags includes MS_BIND (available since Linux 2.4), then perform a bind mount. A bind mount makes a file or a directory subtree visible at another point within the single directory hierarchy. Bind mounts may cross filesystem boundaries and span chroot(2) jails.
The filesystemtype and data arguments are ignored.
The remaining bits (other than MS_REC, described below) in the mountflags argument are also ignored. (The bind mount has the same mount options as the underlying mount.) However, see the discussion of remounting above, for a method of making an existing bind mount read-only.
By default, when a directory is bind mounted, only that directory is mounted; if there are any submounts under the directory tree, they are not bind mounted. If the MS_REC flag is also specified, then a recursive bind mount operation is performed: all submounts under the source subtree (other than unbindable mounts) are also bind mounted at the corresponding location in the target subtree.
Changing the propagation type of an existing mount
If mountflags includes one of MS_SHARED, MS_PRIVATE, MS_SLAVE, or MS_UNBINDABLE (all available since Linux 2.6.15), then the propagation type of an existing mount is changed. If more than one of these flags is specified, an error results.
The only other flags that can be specified while changing the propagation type are MS_REC (described below) and MS_SILENT (which is ignored).
The source, filesystemtype, and data arguments are ignored.
The meanings of the propagation type flags are as follows:
MS_SHARED
Make this mount shared. Mount and unmount events immediately under this mount will propagate to the other mounts that are members of this mount’s peer group. Propagation here means that the same mount or unmount will automatically occur under all of the other mounts in the peer group. Conversely, mount and unmount events that take place under peer mounts will propagate to this mount.
MS_PRIVATE
Make this mount private. Mount and unmount events do not propagate into or out of this mount.
MS_SLAVE
If this is a shared mount that is a member of a peer group that contains other members, convert it to a slave mount. If this is a shared mount that is a member of a peer group that contains no other members, convert it to a private mount. Otherwise, the propagation type of the mount is left unchanged.
When a mount is a slave, mount and unmount events propagate into this mount from the (master) shared peer group of which it was formerly a member. Mount and unmount events under this mount do not propagate to any peer.
A mount can be the slave of another peer group while at the same time sharing mount and unmount events with a peer group of which it is a member.
MS_UNBINDABLE
Make this mount unbindable. This is like a private mount, and in addition this mount can’t be bind mounted. When a recursive bind mount (mount() with the MS_BIND and MS_REC flags) is performed on a directory subtree, any unbindable mounts within the subtree are automatically pruned (i.e., not replicated) when replicating that subtree to produce the target subtree.
By default, changing the propagation type affects only the target mount. If the MS_REC flag is also specified in mountflags, then the propagation type of all mounts under target is also changed.
For further details regarding mount propagation types (including the default propagation type assigned to new mounts), see mount_namespaces(7).
Moving a mount
If mountflags contains the flag MS_MOVE (available since Linux 2.4.18), then move a subtree: source specifies an existing mount and target specifies the new location to which that mount is to be relocated. The move is atomic: at no point is the subtree unmounted.
The remaining bits in the mountflags argument are ignored, as are the filesystemtype and data arguments.
Creating a new mount
If none of MS_REMOUNT, MS_BIND, MS_MOVE, MS_SHARED, MS_PRIVATE, MS_SLAVE, or MS_UNBINDABLE is specified in mountflags, then mount() performs its default action: creating a new mount. source specifies the source for the new mount, and target specifies the directory at which to create the mount point.
The filesystemtype and data arguments are employed, and further bits may be specified in mountflags to modify the behavior of the call.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
The error values given below result from filesystem type independent errors. Each filesystem type may have its own special errors and its own special behavior. See the Linux kernel source code for details.
EACCES
A component of a path was not searchable. (See also path_resolution(7).)
EACCES
Mounting a read-only filesystem was attempted without giving the MS_RDONLY flag.
The filesystem may be read-only for various reasons, including: it resides on a read-only optical disk; it is resides on a device with a physical switch that has been set to mark the device read-only; the filesystem implementation was compiled with read-only support; or errors were detected when initially mounting the filesystem, so that it was marked read-only and can’t be remounted as read-write (until the errors are fixed).
Some filesystems instead return the error EROFS on an attempt to mount a read-only filesystem.
EACCES
The block device source is located on a filesystem mounted with the MS_NODEV option.
EBUSY
An attempt was made to stack a new mount directly on top of an existing mount point that was created in this mount namespace with the same source and target.
EBUSY
source cannot be remounted read-only, because it still holds files open for writing.
EFAULT
One of the pointer arguments points outside the user address space.
EINVAL
source had an invalid superblock.
EINVAL
A remount operation (MS_REMOUNT) was attempted, but source was not already mounted on target.
EINVAL
A move operation (MS_MOVE) was attempted, but the mount tree under source includes unbindable mounts and target is a mount that has propagation type MS_SHARED.
EINVAL
A move operation (MS_MOVE) was attempted, but the parent mount of source mount has propagation type MS_SHARED.
EINVAL
A move operation (MS_MOVE) was attempted, but source was not a mount, or was ‘/’.
EINVAL
A bind operation (MS_BIND) was requested where source referred a mount namespace magic link (i.e., a /proc/pid/ns/mnt magic link or a bind mount to such a link) and the propagation type of the parent mount of target was MS_SHARED, but propagation of the requested bind mount could lead to a circular dependency that might prevent the mount namespace from ever being freed.
EINVAL
mountflags includes more than one of MS_SHARED, MS_PRIVATE, MS_SLAVE, or MS_UNBINDABLE.
EINVAL
mountflags includes MS_SHARED, MS_PRIVATE, MS_SLAVE, or MS_UNBINDABLE and also includes a flag other than MS_REC or MS_SILENT.
EINVAL
An attempt was made to bind mount an unbindable mount.
EINVAL
In an unprivileged mount namespace (i.e., a mount namespace owned by a user namespace that was created by an unprivileged user), a bind mount operation (MS_BIND) was attempted without specifying (MS_REC), which would have revealed the filesystem tree underneath one of the submounts of the directory being bound.
ELOOP
Too many links encountered during pathname resolution.
ELOOP
A move operation was attempted, and target is a descendant of source.
EMFILE
(In case no block device is required:) Table of dummy devices is full.
ENAMETOOLONG
A pathname was longer than MAXPATHLEN.
ENODEV
filesystemtype not configured in the kernel.
ENOENT
A pathname was empty or had a nonexistent component.
ENOMEM
The kernel could not allocate a free page to copy filenames or data into.
ENOTBLK
source is not a block device (and a device was required).
ENOTDIR
target, or a prefix of source, is not a directory.
ENXIO
The major number of the block device source is out of range.
EPERM
The caller does not have the required privileges.
EPERM
An attempt was made to modify (MS_REMOUNT) the MS_RDONLY, MS_NOSUID, or MS_NOEXEC flag, or one of the “atime” flags (MS_NOATIME, MS_NODIRATIME, MS_RELATIME) of an existing mount, but the mount is locked; see mount_namespaces(7).
EROFS
Mounting a read-only filesystem was attempted without giving the MS_RDONLY flag. See EACCES, above.
STANDARDS
Linux.
HISTORY
The definitions of MS_DIRSYNC, MS_MOVE, MS_PRIVATE, MS_REC, MS_RELATIME, MS_SHARED, MS_SLAVE, MS_STRICTATIME, and MS_UNBINDABLE were added to glibc headers in glibc 2.12.
Since Linux 2.4 a single filesystem can be mounted at multiple mount points, and multiple mounts can be stacked on the same mount point.
The mountflags argument may have the magic number 0xC0ED (MS_MGC_VAL) in the top 16 bits. (All of the other flags discussed in DESCRIPTION occupy the low order 16 bits of mountflags.) Specifying MS_MGC_VAL was required before Linux 2.4, but since Linux 2.4 is no longer required and is ignored if specified.
The original MS_SYNC flag was renamed MS_SYNCHRONOUS in 1.1.69 when a different MS_SYNC was added to <mman.h>.
Before Linux 2.4 an attempt to execute a set-user-ID or set-group-ID program on a filesystem mounted with MS_NOSUID would fail with EPERM. Since Linux 2.4 the set-user-ID and set-group-ID bits are just silently ignored in this case.
NOTES
Mount namespaces
Starting with Linux 2.4.19, Linux provides mount namespaces. A mount namespace is the set of filesystem mounts that are visible to a process. Mount namespaces can be (and usually are) shared between multiple processes, and changes to the namespace (i.e., mounts and unmounts) by one process are visible to all other processes sharing the same namespace. (The pre-2.4.19 Linux situation can be considered as one in which a single namespace was shared by every process on the system.)
A child process created by fork(2) shares its parent’s mount namespace; the mount namespace is preserved across an execve(2).
A process can obtain a private mount namespace if: it was created using the clone(2) CLONE_NEWNS flag, in which case its new namespace is initialized to be a copy of the namespace of the process that called clone(2); or it calls unshare(2) with the CLONE_NEWNS flag, which causes the caller’s mount namespace to obtain a private copy of the namespace that it was previously sharing with other processes, so that future mounts and unmounts by the caller are invisible to other processes (except child processes that the caller subsequently creates) and vice versa.
For further details on mount namespaces, see mount_namespaces(7).
Parental relationship between mounts
Each mount has a parent mount. The overall parental relationship of all mounts defines the single directory hierarchy seen by the processes within a mount namespace.
The parent of a new mount is defined when the mount is created. In the usual case, the parent of a new mount is the mount of the filesystem containing the directory or file at which the new mount is attached. In the case where a new mount is stacked on top of an existing mount, the parent of the new mount is the previous mount that was stacked at that location.
The parental relationship between mounts can be discovered via the /proc/pid/mountinfo file (see below).
/proc/pid/mounts and /proc/pid/mountinfo
The Linux-specific /proc/pid/mounts file exposes the list of mounts in the mount namespace of the process with the specified ID. The /proc/pid/mountinfo file exposes even more information about mounts, including the propagation type and mount ID information that makes it possible to discover the parental relationship between mounts. See proc(5) and mount_namespaces(7) for details of this file.
SEE ALSO
mountpoint(1), chroot(2), ioctl_iflags(2), mount_setattr(2), pivot_root(2), umount(2), mount_namespaces(7), path_resolution(7), findmnt(8), lsblk(8), mount(8), umount(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
221 - Linux cli command get_mempolicy
NAME π₯οΈ get_mempolicy π₯οΈ
retrieve NUMA memory policy for a thread
LIBRARY
NUMA (Non-Uniform Memory Access) policy library (libnuma, -lnuma)
SYNOPSIS
#include <numaif.h>
long get_mempolicy(int *mode,
unsigned long nodemask[(.maxnode + ULONG_WIDTH - 1)
/ ULONG_WIDTH],
unsigned long maxnode, void *addr,
unsigned long flags);
DESCRIPTION
get_mempolicy() retrieves the NUMA policy of the calling thread or of a memory address, depending on the setting of flags.
A NUMA machine has different memory controllers with different distances to specific CPUs. The memory policy defines from which node memory is allocated for the thread.
If flags is specified as 0, then information about the calling thread’s default policy (as set by set_mempolicy(2)) is returned, in the buffers pointed to by mode and nodemask. The value returned in these arguments may be used to restore the thread’s policy to its state at the time of the call to get_mempolicy() using set_mempolicy(2). When flags is 0, addr must be specified as NULL.
If flags specifies MPOL_F_MEMS_ALLOWED (available since Linux 2.6.24), the mode argument is ignored and the set of nodes (memories) that the thread is allowed to specify in subsequent calls to mbind(2) or set_mempolicy(2) (in the absence of any mode flags) is returned in nodemask. It is not permitted to combine MPOL_F_MEMS_ALLOWED with either MPOL_F_ADDR or MPOL_F_NODE.
If flags specifies MPOL_F_ADDR, then information is returned about the policy governing the memory address given in addr. This policy may be different from the thread’s default policy if mbind(2) or one of the helper functions described in numa(3) has been used to establish a policy for the memory range containing addr.
If the mode argument is not NULL, then get_mempolicy() will store the policy mode and any optional mode flags of the requested NUMA policy in the location pointed to by this argument. If nodemask is not NULL, then the nodemask associated with the policy will be stored in the location pointed to by this argument. maxnode specifies the number of node IDs that can be stored into nodemaskβthat is, the maximum node ID plus one. The value specified by maxnode is always rounded to a multiple of sizeof(unsigned long)*8.
If flags specifies both MPOL_F_NODE and MPOL_F_ADDR, get_mempolicy() will return the node ID of the node on which the address addr is allocated into the location pointed to by mode. If no page has yet been allocated for the specified address, get_mempolicy() will allocate a page as if the thread had performed a read (load) access to that address, and return the ID of the node where that page was allocated.
If flags specifies MPOL_F_NODE, but not MPOL_F_ADDR, and the thread’s current policy is MPOL_INTERLEAVE or MPOL_WEIGHTED_INTERLEAVE, then get_mempolicy() will return in the location pointed to by a non-NULL mode argument, the node ID of the next node that will be used for interleaving of internal kernel pages allocated on behalf of the thread. These allocations include pages for memory-mapped files in process memory ranges mapped using the mmap(2) call with the MAP_PRIVATE flag for read accesses, and in memory ranges mapped with the MAP_SHARED flag for all accesses.
Other flag values are reserved.
For an overview of the possible policies see set_mempolicy(2).
RETURN VALUE
On success, get_mempolicy() returns 0; on error, -1 is returned and errno is set to indicate the error.
ERRORS
EFAULT
Part of all of the memory range specified by nodemask and maxnode points outside your accessible address space.
EINVAL
The value specified by maxnode is less than the number of node IDs supported by the system. Or flags specified values other than MPOL_F_NODE or MPOL_F_ADDR; or flags specified MPOL_F_ADDR and addr is NULL, or flags did not specify MPOL_F_ADDR and addr is not NULL. Or, flags specified MPOL_F_NODE but not MPOL_F_ADDR and the current thread policy is neither MPOL_INTERLEAVE nor MPOL_WEIGHTED_INTERLEAVE. Or, flags specified MPOL_F_MEMS_ALLOWED with either MPOL_F_ADDR or MPOL_F_NODE. (And there are other EINVAL cases.)
STANDARDS
Linux.
HISTORY
Linux 2.6.7.
NOTES
For information on library support, see numa(7).
SEE ALSO
getcpu(2), mbind(2), mmap(2), set_mempolicy(2), numa(3), numa(7), numactl(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
222 - Linux cli command renameat
NAME π₯οΈ renameat π₯οΈ
change the name or location of a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <stdio.h>
int rename(const char *oldpath, const char *newpath);
#include <fcntl.h> /* Definition of AT_* constants */
#include <stdio.h>
int renameat(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath);
int renameat2(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath",unsignedint"flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
renameat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
renameat2():
_GNU_SOURCE
DESCRIPTION
rename() renames a file, moving it between directories if required. Any other hard links to the file (as created using link(2)) are unaffected. Open file descriptors for oldpath are also unaffected.
Various restrictions determine whether or not the rename operation succeeds: see ERRORS below.
If newpath already exists, it will be atomically replaced, so that there is no point at which another process attempting to access newpath will find it missing. However, there will probably be a window in which both oldpath and newpath refer to the file being renamed.
If oldpath and newpath are existing hard links referring to the same file, then rename() does nothing, and returns a success status.
If newpath exists but the operation fails for some reason, rename() guarantees to leave an instance of newpath in place.
oldpath can specify a directory. In this case, newpath must either not exist, or it must specify an empty directory.
If oldpath refers to a symbolic link, the link is renamed; if newpath refers to a symbolic link, the link will be overwritten.
renameat()
The renameat() system call operates in exactly the same way as rename(), except for the differences described here.
If the pathname given in oldpath is relative, then it is interpreted relative to the directory referred to by the file descriptor olddirfd (rather than relative to the current working directory of the calling process, as is done by rename() for a relative pathname).
If oldpath is relative and olddirfd is the special value AT_FDCWD, then oldpath is interpreted relative to the current working directory of the calling process (like rename()).
If oldpath is absolute, then olddirfd is ignored.
The interpretation of newpath is as for oldpath, except that a relative pathname is interpreted relative to the directory referred to by the file descriptor newdirfd.
See openat(2) for an explanation of the need for renameat().
renameat2()
renameat2() has an additional flags argument. A renameat2() call with a zero flags argument is equivalent to renameat().
The flags argument is a bit mask consisting of zero or more of the following flags:
RENAME_EXCHANGE
Atomically exchange oldpath and newpath. Both pathnames must exist but may be of different types (e.g., one could be a non-empty directory and the other a symbolic link).
RENAME_NOREPLACE
Don’t overwrite newpath of the rename. Return an error if newpath already exists.
RENAME_NOREPLACE can’t be employed together with RENAME_EXCHANGE.
RENAME_NOREPLACE requires support from the underlying filesystem. Support for various filesystems was added as follows:
ext4 (Linux 3.15);
btrfs, tmpfs, and cifs (Linux 3.17);
xfs (Linux 4.0);
Support for many other filesystems was added in Linux 4.9, including ext2, minix, reiserfs, jfs, vfat, and bpf.
RENAME_WHITEOUT (since Linux 3.18)
This operation makes sense only for overlay/union filesystem implementations.
Specifying RENAME_WHITEOUT creates a “whiteout” object at the source of the rename at the same time as performing the rename. The whole operation is atomic, so that if the rename succeeds then the whiteout will also have been created.
A “whiteout” is an object that has special meaning in union/overlay filesystem constructs. In these constructs, multiple layers exist and only the top one is ever modified. A whiteout on an upper layer will effectively hide a matching file in the lower layer, making it appear as if the file didn’t exist.
When a file that exists on the lower layer is renamed, the file is first copied up (if not already on the upper layer) and then renamed on the upper, read-write layer. At the same time, the source file needs to be “whiteouted” (so that the version of the source file in the lower layer is rendered invisible). The whole operation needs to be done atomically.
When not part of a union/overlay, the whiteout appears as a character device with a {0,0} device number. (Note that other union/overlay implementations may employ different methods for storing whiteout entries; specifically, BSD union mount employs a separate inode type, DT_WHT, which, while supported by some filesystems available in Linux, such as CODA and XFS, is ignored by the kernel’s whiteout support code, as of Linux 4.19, at least.)
RENAME_WHITEOUT requires the same privileges as creating a device node (i.e., the CAP_MKNOD capability).
RENAME_WHITEOUT can’t be employed together with RENAME_EXCHANGE.
RENAME_WHITEOUT requires support from the underlying filesystem. Among the filesystems that support it are tmpfs (since Linux 3.18), ext4 (since Linux 3.18), XFS (since Linux 4.1), f2fs (since Linux 4.2), btrfs (since Linux 4.7), and ubifs (since Linux 4.9).
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Write permission is denied for the directory containing oldpath or newpath, or, search permission is denied for one of the directories in the path prefix of oldpath or newpath, or oldpath is a directory and does not allow write permission (needed to update the .. entry). (See also path_resolution(7).)
EBUSY
The rename fails because oldpath or newpath is a directory that is in use by some process (perhaps as current working directory, or as root directory, or because it was open for reading) or is in use by the system (for example as a mount point), while the system considers this an error. (Note that there is no requirement to return EBUSY in such casesβthere is nothing wrong with doing the rename anywayβbut it is allowed to return EBUSY if the system cannot otherwise handle such situations.)
EDQUOT
The user’s quota of disk blocks on the filesystem has been exhausted.
EFAULT
oldpath or newpath points outside your accessible address space.
EINVAL
The new pathname contained a path prefix of the old, or, more generally, an attempt was made to make a directory a subdirectory of itself.
EISDIR
newpath is an existing directory, but oldpath is not a directory.
ELOOP
Too many symbolic links were encountered in resolving oldpath or newpath.
EMLINK
oldpath already has the maximum number of links to it, or it was a directory and the directory containing newpath has the maximum number of links.
ENAMETOOLONG
oldpath or newpath was too long.
ENOENT
The link named by oldpath does not exist; or, a directory component in newpath does not exist; or, oldpath or newpath is an empty string.
ENOMEM
Insufficient kernel memory was available.
ENOSPC
The device containing the file has no room for the new directory entry.
ENOTDIR
A component used as a directory in oldpath or newpath is not, in fact, a directory. Or, oldpath is a directory, and newpath exists but is not a directory.
ENOTEMPTY or EEXIST
newpath is a nonempty directory, that is, contains entries other than “.” and “..”.
EPERM or EACCES
The directory containing oldpath has the sticky bit (S_ISVTX) set and the process’s effective user ID is neither the user ID of the file to be deleted nor that of the directory containing it, and the process is not privileged (Linux: does not have the CAP_FOWNER capability); or newpath is an existing file and the directory containing it has the sticky bit set and the process’s effective user ID is neither the user ID of the file to be replaced nor that of the directory containing it, and the process is not privileged (Linux: does not have the CAP_FOWNER capability); or the filesystem containing oldpath does not support renaming of the type requested.
EROFS
The file is on a read-only filesystem.
EXDEV
oldpath and newpath are not on the same mounted filesystem. (Linux permits a filesystem to be mounted at multiple points, but rename() does not work across different mount points, even if the same filesystem is mounted on both.)
The following additional errors can occur for renameat() and renameat2():
EBADF
oldpath (newpath) is relative but olddirfd (newdirfd) is not a valid file descriptor.
ENOTDIR
oldpath is relative and olddirfd is a file descriptor referring to a file other than a directory; or similar for newpath and newdirfd
The following additional errors can occur for renameat2():
EEXIST
flags contains RENAME_NOREPLACE and newpath already exists.
EINVAL
An invalid flag was specified in flags.
EINVAL
Both RENAME_NOREPLACE and RENAME_EXCHANGE were specified in flags.
EINVAL
Both RENAME_WHITEOUT and RENAME_EXCHANGE were specified in flags.
EINVAL
The filesystem does not support one of the flags in flags.
ENOENT
flags contains RENAME_EXCHANGE and newpath does not exist.
EPERM
RENAME_WHITEOUT was specified in flags, but the caller does not have the CAP_MKNOD capability.
STANDARDS
rename()
C11, POSIX.1-2008.
renameat()
POSIX.1-2008.
renameat2()
Linux.
HISTORY
rename()
4.3BSD, C89, POSIX.1-2001.
renameat()
Linux 2.6.16, glibc 2.4.
renameat2()
Linux 3.15, glibc 2.28.
glibc notes
On older kernels where renameat() is unavailable, the glibc wrapper function falls back to the use of rename(). When oldpath and newpath are relative pathnames, glibc constructs pathnames based on the symbolic links in /proc/self/fd that correspond to the olddirfd and newdirfd arguments.
BUGS
On NFS filesystems, you can not assume that if the operation failed, the file was not renamed. If the server does the rename operation and then crashes, the retransmitted RPC which will be processed when the server is up again causes a failure. The application is expected to deal with this. See link(2) for a similar problem.
SEE ALSO
mv(1), rename(1), chmod(2), link(2), symlink(2), unlink(2), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
223 - Linux cli command remap_file_pages
NAME π₯οΈ remap_file_pages π₯οΈ
create a nonlinear file mapping
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <sys/mman.h>
[[deprecated]] int remap_file_pages(void addr[.size], size_t size,
int prot, size_t pgoff, int flags);
DESCRIPTION
Note: this system call was marked as deprecated starting with Linux 3.16. In Linux 4.0, the implementation was replaced by a slower in-kernel emulation. Those few applications that use this system call should consider migrating to alternatives. This change was made because the kernel code for this system call was complex, and it is believed to be little used or perhaps even completely unused. While it had some use cases in database applications on 32-bit systems, those use cases don’t exist on 64-bit systems.
The remap_file_pages() system call is used to create a nonlinear mapping, that is, a mapping in which the pages of the file are mapped into a nonsequential order in memory. The advantage of using remap_file_pages() over using repeated calls to mmap(2) is that the former approach does not require the kernel to create additional VMA (Virtual Memory Area) data structures.
To create a nonlinear mapping we perform the following steps:
1.
Use mmap(2) to create a mapping (which is initially linear). This mapping must be created with the MAP_SHARED flag.
2.
Use one or more calls to remap_file_pages() to rearrange the correspondence between the pages of the mapping and the pages of the file. It is possible to map the same page of a file into multiple locations within the mapped region.
The pgoff and size arguments specify the region of the file that is to be relocated within the mapping: pgoff is a file offset in units of the system page size; size is the length of the region in bytes.
The addr argument serves two purposes. First, it identifies the mapping whose pages we want to rearrange. Thus, addr must be an address that falls within a region previously mapped by a call to mmap(2). Second, addr specifies the address at which the file pages identified by pgoff and size will be placed.
The values specified in addr and size should be multiples of the system page size. If they are not, then the kernel rounds both values down to the nearest multiple of the page size.
The prot argument must be specified as 0.
The flags argument has the same meaning as for mmap(2), but all flags other than MAP_NONBLOCK are ignored.
RETURN VALUE
On success, remap_file_pages() returns 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
addr does not refer to a valid mapping created with the MAP_SHARED flag.
EINVAL
addr, size, prot, or pgoff is invalid.
STANDARDS
Linux.
HISTORY
Linux 2.5.46, glibc 2.3.3.
NOTES
Since Linux 2.6.23, remap_file_pages() creates non-linear mappings only on in-memory filesystems such as tmpfs(5), hugetlbfs or ramfs. On filesystems with a backing store, remap_file_pages() is not much more efficient than using mmap(2) to adjust which parts of the file are mapped to which addresses.
SEE ALSO
getpagesize(2), mmap(2), mmap2(2), mprotect(2), mremap(2), msync(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
224 - Linux cli command landlock_add_rule
NAME π₯οΈ landlock_add_rule π₯οΈ
add a new Landlock rule to a ruleset
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/landlock.h> /* Definition of LANDLOCK_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
int syscall(SYS_landlock_add_rule, int ruleset_fd,
enum landlock_rule_type rule_type,
const void *rule_attr, uint32_t flags);
DESCRIPTION
A Landlock rule describes an action on an object. An object is currently a file hierarchy, and the related filesystem actions are defined with a set of access rights. This landlock_add_rule() system call enables adding a new Landlock rule to an existing ruleset created with landlock_create_ruleset(2). See landlock(7) for a global overview.
ruleset_fd is a Landlock ruleset file descriptor obtained with landlock_create_ruleset(2).
rule_type identifies the structure type pointed to by rule_attr. Currently, Linux supports the following rule_type value:
LANDLOCK_RULE_PATH_BENEATH
This defines the object type as a file hierarchy. In this case, rule_attr points to the following structure:
struct landlock_path_beneath_attr {
__u64 allowed_access;
__s32 parent_fd;
} __attribute__((packed));
allowed_access contains a bitmask of allowed filesystem actions for this file hierarchy (see Filesystem actions in landlock(7)).
parent_fd is an opened file descriptor, preferably with the O_PATH flag, which identifies the parent directory of the file hierarchy or just a file.
flags must be 0.
RETURN VALUE
On success, landlock_add_rule() returns 0.
ERRORS
landlock_add_rule() can fail for the following reasons:
EOPNOTSUPP
Landlock is supported by the kernel but disabled at boot time.
EINVAL
flags is not 0, or the rule accesses are inconsistent (i.e., rule_attr->allowed_access is not a subset of the ruleset handled accesses).
ENOMSG
Empty accesses (i.e., rule_attr->allowed_access is 0).
EBADF
ruleset_fd is not a file descriptor for the current thread, or a member of rule_attr is not a file descriptor as expected.
EBADFD
ruleset_fd is not a ruleset file descriptor, or a member of rule_attr is not the expected file descriptor type.
EPERM
ruleset_fd has no write access to the underlying ruleset.
EFAULT
rule_attr was not a valid address.
STANDARDS
Linux.
HISTORY
Linux 5.13.
EXAMPLES
See landlock(7).
SEE ALSO
landlock_create_ruleset(2), landlock_restrict_self(2), landlock(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
225 - Linux cli command outw
NAME π₯οΈ outw π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
226 - Linux cli command setgid32
NAME π₯οΈ setgid32 π₯οΈ
set group identity
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int setgid(gid_t gid);
DESCRIPTION
setgid() sets the effective group ID of the calling process. If the calling process is privileged (more precisely: has the CAP_SETGID capability in its user namespace), the real GID and saved set-group-ID are also set.
Under Linux, setgid() is implemented like the POSIX version with the _POSIX_SAVED_IDS feature. This allows a set-group-ID program that is not set-user-ID-root to drop all of its group privileges, do some un-privileged work, and then reengage the original effective group ID in a secure manner.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
The group ID specified in gid is not valid in this user namespace.
EPERM
The calling process is not privileged (does not have the CAP_SETGID capability in its user namespace), and gid does not match the real group ID or saved set-group-ID of the calling process.
VERSIONS
C library/kernel differences
At the kernel level, user IDs and group IDs are a per-thread attribute. However, POSIX requires that all threads in a process share the same credentials. The NPTL threading implementation handles the POSIX requirements by providing wrapper functions for the various system calls that change process UIDs and GIDs. These wrapper functions (including the one for setgid()) employ a signal-based technique to ensure that when one thread changes credentials, all of the other threads in the process also change their credentials. For details, see nptl(7).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4.
The original Linux setgid() system call supported only 16-bit group IDs. Subsequently, Linux 2.4 added setgid32() supporting 32-bit IDs. The glibc setgid() wrapper function transparently deals with the variation across kernel versions.
SEE ALSO
getgid(2), setegid(2), setregid(2), capabilities(7), credentials(7), user_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
227 - Linux cli command getdents
NAME π₯οΈ getdents π₯οΈ
get directory entries
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
long syscall(SYS_getdents, unsigned int fd",structlinux_dirent*"dirp,
unsigned int count);
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <dirent.h>
ssize_t getdents64(int fd, void dirp[.count], size_t count);
Note: glibc provides no wrapper for getdents(), necessitating the use of syscall(2).
Note: There is no definition of struct linux_dirent in glibc; see NOTES.
DESCRIPTION
These are not the interfaces you are interested in. Look at readdir(3) for the POSIX-conforming C library interface. This page documents the bare kernel system call interfaces.
getdents()
The system call getdents() reads several linux_dirent structures from the directory referred to by the open file descriptor fd into the buffer pointed to by dirp. The argument count specifies the size of that buffer.
The linux_dirent structure is declared as follows:
struct linux_dirent {
unsigned long d_ino; /* Inode number */
unsigned long d_off; /* Not an offset; see below */
unsigned short d_reclen; /* Length of this linux_dirent */
char d_name[]; /* Filename (null-terminated) */
/* length is actually (d_reclen - 2 -
offsetof(struct linux_dirent, d_name)) */
/*
char pad; // Zero padding byte
char d_type; // File type (only since Linux
// 2.6.4); offset is (d_reclen - 1)
*/
}
d_ino is an inode number. d_off is a filesystem-specific value with no specific meaning to user space, though on older filesystems it used to be the distance from the start of the directory to the start of the next linux_dirent; see readdir(3). d_reclen is the size of this entire linux_dirent. d_name is a null-terminated filename.
d_type is a byte at the end of the structure that indicates the file type. It contains one of the following values (defined in <dirent.h>):
DT_BLK
This is a block device.
DT_CHR
This is a character device.
DT_DIR
This is a directory.
DT_FIFO
This is a named pipe (FIFO).
DT_LNK
This is a symbolic link.
DT_REG
This is a regular file.
DT_SOCK
This is a UNIX domain socket.
DT_UNKNOWN
The file type is unknown.
The d_type field is implemented since Linux 2.6.4. It occupies a space that was previously a zero-filled padding byte in the linux_dirent structure. Thus, on kernels up to and including Linux 2.6.3, attempting to access this field always provides the value 0 (DT_UNKNOWN).
Currently, only some filesystems (among them: Btrfs, ext2, ext3, and ext4) have full support for returning the file type in d_type. All applications must properly handle a return of DT_UNKNOWN.
getdents64()
The original Linux getdents() system call did not handle large filesystems and large file offsets. Consequently, Linux 2.4 added getdents64(), with wider types for the d_ino and d_off fields. In addition, getdents64() supports an explicit d_type field.
The getdents64() system call is like getdents(), except that its second argument is a pointer to a buffer containing structures of the following type:
struct linux_dirent64 {
ino64_t d_ino; /* 64-bit inode number */
off64_t d_off; /* Not an offset; see getdents() */
unsigned short d_reclen; /* Size of this dirent */
unsigned char d_type; /* File type */
char d_name[]; /* Filename (null-terminated) */
};
RETURN VALUE
On success, the number of bytes read is returned. On end of directory, 0 is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBADF
Invalid file descriptor fd.
EFAULT
Argument points outside the calling process’s address space.
EINVAL
Result buffer is too small.
ENOENT
No such directory.
ENOTDIR
File descriptor does not refer to a directory.
STANDARDS
None.
HISTORY
SVr4.
getdents64()
glibc 2.30.
NOTES
glibc does not provide a wrapper for getdents(); call getdents() using syscall(2). In that case you will need to define the linux_dirent or linux_dirent64 structure yourself.
Probably, you want to use readdir(3) instead of these system calls.
These calls supersede readdir(2).
EXAMPLES
The program below demonstrates the use of getdents(). The following output shows an example of what we see when running this program on an ext2 directory:
$ ./a.out /testfs/
--------------- nread=120 ---------------
inode# file type d_reclen d_off d_name
2 directory 16 12 .
2 directory 16 24 ..
11 directory 24 44 lost+found
12 regular 16 56 a
228929 directory 16 68 sub
16353 directory 16 80 sub2
130817 directory 16 4096 sub3
Program source
#define _GNU_SOURCE
#include <dirent.h> /* Defines DT_* constants */
#include <err.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
struct linux_dirent {
unsigned long d_ino;
off_t d_off;
unsigned short d_reclen;
char d_name[];
};
#define BUF_SIZE 1024
int
main(int argc, char *argv[])
{
int fd;
char d_type;
char buf[BUF_SIZE];
long nread;
struct linux_dirent *d;
fd = open(argc > 1 ? argv[1] : ".", O_RDONLY | O_DIRECTORY);
if (fd == -1)
err(EXIT_FAILURE, "open");
for (;;) {
nread = syscall(SYS_getdents, fd, buf, BUF_SIZE);
if (nread == -1)
err(EXIT_FAILURE, "getdents");
if (nread == 0)
break;
printf("--------------- nread=%ld ---------------
“, nread); printf(“inode# file type d_reclen d_off d_name “); for (size_t bpos = 0; bpos < nread;) { d = (struct linux_dirent *) (buf + bpos); printf("%8lu “, d->d_ino); d_type = *(buf + bpos + d->d_reclen - 1); printf(”%-10s “, (d_type == DT_REG) ? “regular” : (d_type == DT_DIR) ? “directory” : (d_type == DT_FIFO) ? “FIFO” : (d_type == DT_SOCK) ? “socket” : (d_type == DT_LNK) ? “symlink” : (d_type == DT_BLK) ? “block dev” : (d_type == DT_CHR) ? “char dev” : “???”); printf("%4d %10jd %s “, d->d_reclen, (intmax_t) d->d_off, d->d_name); bpos += d->d_reclen; } } exit(EXIT_SUCCESS); }
SEE ALSO
readdir(2), readdir(3), inode(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
228 - Linux cli command setup
NAME π₯οΈ setup π₯οΈ
setup devices and filesystems, mount root filesystem
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
[[deprecated]] int setup(void);
DESCRIPTION
setup() is called once from within linux/init/main.c. It calls initialization functions for devices and filesystems configured into the kernel and then mounts the root filesystem.
No user process may call setup(). Any user process, even a process with superuser permission, will receive EPERM.
RETURN VALUE
setup() always returns -1 for a user process.
ERRORS
EPERM
Always, for a user process.
STANDARDS
Linux.
VERSIONS
Removed in Linux 2.1.121.
The calling sequence varied: at some times setup() has had a single argument void *BIOS and at other times a single argument int magic.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
229 - Linux cli command seteuid
NAME π₯οΈ seteuid π₯οΈ
set effective user or group ID
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int seteuid(uid_t euid);
int setegid(gid_t egid);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
seteuid(), setegid():
_POSIX_C_SOURCE >= 200112L
|| /* glibc <= 2.19: */ _BSD_SOURCE
DESCRIPTION
seteuid() sets the effective user ID of the calling process. Unprivileged processes may only set the effective user ID to the real user ID, the effective user ID or the saved set-user-ID.
Precisely the same holds for setegid() with “group” instead of “user”.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
Note: there are cases where seteuid() can fail even when the caller is UID 0; it is a grave security error to omit checking for a failure return from seteuid().
ERRORS
EINVAL
The target user or group ID is not valid in this user namespace.
EPERM
In the case of seteuid(): the calling process is not privileged (does not have the CAP_SETUID capability in its user namespace) and euid does not match the current real user ID, current effective user ID, or current saved set-user-ID.
In the case of setegid(): the calling process is not privileged (does not have the CAP_SETGID capability in its user namespace) and egid does not match the current real group ID, current effective group ID, or current saved set-group-ID.
VERSIONS
Setting the effective user (group) ID to the saved set-user-ID (saved set-group-ID) is possible since Linux 1.1.37 (1.1.38). On an arbitrary system one should check _POSIX_SAVED_IDS.
Under glibc 2.0, seteuid(euid) is equivalent to setreuid(-1,* euid***)** and hence may change the saved set-user-ID. Under glibc 2.1 and later, it is equivalent to setresuid(-1,* euid***, -1)** and hence does not change the saved set-user-ID. Analogous remarks hold for setegid(), with the difference that the change in implementation from setregid(-1,* egid***)** to setresgid(-1,* egid***, -1)** occurred in glibc 2.2 or 2.3 (depending on the hardware architecture).
According to POSIX.1, seteuid() (setegid()) need not permit euid (egid) to be the same value as the current effective user (group) ID, and some implementations do not permit this.
C library/kernel differences
On Linux, seteuid() and setegid() are implemented as library functions that call, respectively, setresuid(2) and setresgid(2).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.3BSD.
SEE ALSO
geteuid(2), setresuid(2), setreuid(2), setuid(2), capabilities(7), credentials(7), user_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
230 - Linux cli command read
NAME π₯οΈ read π₯οΈ
read from a file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
ssize_t read(int fd, void buf[.count], size_t count);
DESCRIPTION
read() attempts to read up to count bytes from file descriptor fd into the buffer starting at buf.
On files that support seeking, the read operation commences at the file offset, and the file offset is incremented by the number of bytes read. If the file offset is at or past the end of file, no bytes are read, and read() returns zero.
If count is zero, read() may detect the errors described below. In the absence of any errors, or if read() does not check for errors, a read() with a count of 0 returns zero and has no other effects.
According to POSIX.1, if count is greater than SSIZE_MAX, the result is implementation-defined; see NOTES for the upper limit on Linux.
RETURN VALUE
On success, the number of bytes read is returned (zero indicates end of file), and the file position is advanced by this number. It is not an error if this number is smaller than the number of bytes requested; this may happen for example because fewer bytes are actually available right now (maybe because we were close to end-of-file, or because we are reading from a pipe, or from a terminal), or because read() was interrupted by a signal. See also NOTES.
On error, -1 is returned, and errno is set to indicate the error. In this case, it is left unspecified whether the file position (if any) changes.
ERRORS
EAGAIN
The file descriptor fd refers to a file other than a socket and has been marked nonblocking (O_NONBLOCK), and the read would block. See open(2) for further details on the O_NONBLOCK flag.
EAGAIN or EWOULDBLOCK
The file descriptor fd refers to a socket and has been marked nonblocking (O_NONBLOCK), and the read would block. POSIX.1-2001 allows either error to be returned for this case, and does not require these constants to have the same value, so a portable application should check for both possibilities.
EBADF
fd is not a valid file descriptor or is not open for reading.
EFAULT
buf is outside your accessible address space.
EINTR
The call was interrupted by a signal before any data was read; see signal(7).
EINVAL
fd is attached to an object which is unsuitable for reading; or the file was opened with the O_DIRECT flag, and either the address specified in buf, the value specified in count, or the file offset is not suitably aligned.
EINVAL
fd was created via a call to timerfd_create(2) and the wrong size buffer was given to read(); see timerfd_create(2) for further information.
EIO
I/O error. This will happen for example when the process is in a background process group, tries to read from its controlling terminal, and either it is ignoring or blocking SIGTTIN or its process group is orphaned. It may also occur when there is a low-level I/O error while reading from a disk or tape. A further possible cause of EIO on networked filesystems is when an advisory lock had been taken out on the file descriptor and this lock has been lost. See the Lost locks section of fcntl(2) for further details.
EISDIR
fd refers to a directory.
Other errors may occur, depending on the object connected to fd.
STANDARDS
POSIX.1-2008.
HISTORY
SVr4, 4.3BSD, POSIX.1-2001.
NOTES
On Linux, read() (and similar system calls) will transfer at most 0x7ffff000 (2,147,479,552) bytes, returning the number of bytes actually transferred. (This is true on both 32-bit and 64-bit systems.)
On NFS filesystems, reading small amounts of data will update the timestamp only the first time, subsequent calls may not do so. This is caused by client side attribute caching, because most if not all NFS clients leave st_atime (last file access time) updates to the server, and client side reads satisfied from the client’s cache will not cause st_atime updates on the server as there are no server-side reads. UNIX semantics can be obtained by disabling client-side attribute caching, but in most situations this will substantially increase server load and decrease performance.
BUGS
According to POSIX.1-2008/SUSv4 Section XSI 2.9.7 (“Thread Interactions with Regular File Operations”):
All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2008 when they operate on regular files or symbolic links: …
Among the APIs subsequently listed are read() and readv(2). And among the effects that should be atomic across threads (and processes) are updates of the file offset. However, before Linux 3.14, this was not the case: if two processes that share an open file description (see open(2)) perform a read() (or readv(2)) at the same time, then the I/O operations were not atomic with respect to updating the file offset, with the result that the reads in the two processes might (incorrectly) overlap in the blocks of data that they obtained. This problem was fixed in Linux 3.14.
SEE ALSO
close(2), fcntl(2), ioctl(2), lseek(2), open(2), pread(2), readdir(2), readlink(2), readv(2), select(2), write(2), fread(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
231 - Linux cli command sigwaitinfo
NAME π₯οΈ sigwaitinfo π₯οΈ
synchronously wait for queued signals
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <signal.h>
int sigwaitinfo(const sigset_t *restrict set,
siginfo_t *_Nullable restrict info);
int sigtimedwait(const sigset_t *restrict set,
siginfo_t *_Nullable restrict info,
const struct timespec *restrict timeout);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sigwaitinfo(), sigtimedwait():
_POSIX_C_SOURCE >= 199309L
DESCRIPTION
sigwaitinfo() suspends execution of the calling thread until one of the signals in set is pending (If one of the signals in set is already pending for the calling thread, sigwaitinfo() will return immediately.)
sigwaitinfo() removes the signal from the set of pending signals and returns the signal number as its function result. If the info argument is not NULL, then the buffer that it points to is used to return a structure of type siginfo_t (see sigaction(2)) containing information about the signal.
If multiple signals in set are pending for the caller, the signal that is retrieved by sigwaitinfo() is determined according to the usual ordering rules; see signal(7) for further details.
sigtimedwait() operates in exactly the same way as sigwaitinfo() except that it has an additional argument, timeout, which specifies the interval for which the thread is suspended waiting for a signal. (This interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the interval may overrun by a small amount.) This argument is a timespec(3) structure.
If both fields of this structure are specified as 0, a poll is performed: sigtimedwait() returns immediately, either with information about a signal that was pending for the caller, or with an error if none of the signals in set was pending.
RETURN VALUE
On success, both sigwaitinfo() and sigtimedwait() return a signal number (i.e., a value greater than zero). On failure both calls return -1, with errno set to indicate the error.
ERRORS
EAGAIN
No signal in set became pending within the timeout period specified to sigtimedwait().
EINTR
The wait was interrupted by a signal handler; see signal(7). (This handler was for a signal other than one of those in set.)
EINVAL
timeout was invalid.
VERSIONS
C library/kernel differences
On Linux, sigwaitinfo() is a library function implemented on top of sigtimedwait().
The glibc wrapper functions for sigwaitinfo() and sigtimedwait() silently ignore attempts to wait for the two real-time signals that are used internally by the NPTL threading implementation. See nptl(7) for details.
The original Linux system call was named sigtimedwait(). However, with the addition of real-time signals in Linux 2.2, the fixed-size, 32-bit sigset_t type supported by that system call was no longer fit for purpose. Consequently, a new system call, rt_sigtimedwait(), was added to support an enlarged sigset_t type. The new system call takes a fourth argument, size_t sigsetsize, which specifies the size in bytes of the signal set in set. This argument is currently required to have the value sizeof(sigset_t) (or the error EINVAL results). The glibc sigtimedwait() wrapper function hides these details from us, transparently calling rt_sigtimedwait() when the kernel provides it.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
NOTES
In normal usage, the calling program blocks the signals in set via a prior call to sigprocmask(2) (so that the default disposition for these signals does not occur if they become pending between successive calls to sigwaitinfo() or sigtimedwait()) and does not establish handlers for these signals. In a multithreaded program, the signal should be blocked in all threads, in order to prevent the signal being treated according to its default disposition in a thread other than the one calling sigwaitinfo() or sigtimedwait()).
The set of signals that is pending for a given thread is the union of the set of signals that is pending specifically for that thread and the set of signals that is pending for the process as a whole (see signal(7)).
Attempts to wait for SIGKILL and SIGSTOP are silently ignored.
If multiple threads of a process are blocked waiting for the same signal(s) in sigwaitinfo() or sigtimedwait(), then exactly one of the threads will actually receive the signal if it becomes pending for the process as a whole; which of the threads receives the signal is indeterminate.
sigwaitinfo() or sigtimedwait(), can’t be used to receive signals that are synchronously generated, such as the SIGSEGV signal that results from accessing an invalid memory address or the SIGFPE signal that results from an arithmetic error. Such signals can be caught only via signal handler.
POSIX leaves the meaning of a NULL value for the timeout argument of sigtimedwait() unspecified, permitting the possibility that this has the same meaning as a call to sigwaitinfo(), and indeed this is what is done on Linux.
SEE ALSO
kill(2), sigaction(2), signal(2), signalfd(2), sigpending(2), sigprocmask(2), sigqueue(3), sigsetops(3), sigwait(3), timespec(3), signal(7), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
232 - Linux cli command connect
NAME π₯οΈ connect π₯οΈ
initiate a connection on a socket
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
int connect(int sockfd, const struct sockaddr *addr,
socklen_t addrlen);
DESCRIPTION
The connect() system call connects the socket referred to by the file descriptor sockfd to the address specified by addr. The addrlen argument specifies the size of addr. The format of the address in addr is determined by the address space of the socket sockfd; see socket(2) for further details.
If the socket sockfd is of type SOCK_DGRAM, then addr is the address to which datagrams are sent by default, and the only address from which datagrams are received. If the socket is of type SOCK_STREAM or SOCK_SEQPACKET, this call attempts to make a connection to the socket that is bound to the address specified by addr.
Some protocol sockets (e.g., UNIX domain stream sockets) may successfully connect() only once.
Some protocol sockets (e.g., datagram sockets in the UNIX and Internet domains) may use connect() multiple times to change their association.
Some protocol sockets (e.g., TCP sockets as well as datagram sockets in the UNIX and Internet domains) may dissolve the association by connecting to an address with the sa_family member of sockaddr set to AF_UNSPEC; thereafter, the socket can be connected to another address. (AF_UNSPEC is supported since Linux 2.2.)
RETURN VALUE
If the connection or binding succeeds, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
The following are general socket errors only. There may be other domain-specific error codes.
EACCES
For UNIX domain sockets, which are identified by pathname: Write permission is denied on the socket file, or search permission is denied for one of the directories in the path prefix. (See also path_resolution(7).)
EACCES
EPERM
The user tried to connect to a broadcast address without having the socket broadcast flag enabled or the connection request failed because of a local firewall rule.
EACCES
It can also be returned if an SELinux policy denied a connection (for example, if there is a policy saying that an HTTP proxy can only connect to ports associated with HTTP servers, and the proxy tries to connect to a different port).
EADDRINUSE
Local address is already in use.
EADDRNOTAVAIL
(Internet domain sockets) The socket referred to by sockfd had not previously been bound to an address and, upon attempting to bind it to an ephemeral port, it was determined that all port numbers in the ephemeral port range are currently in use. See the discussion of /proc/sys/net/ipv4/ip_local_port_range in ip(7).
EAFNOSUPPORT
The passed address didn’t have the correct address family in its sa_family field.
EAGAIN
For nonblocking UNIX domain sockets, the socket is nonblocking, and the connection cannot be completed immediately. For other socket families, there are insufficient entries in the routing cache.
EALREADY
The socket is nonblocking and a previous connection attempt has not yet been completed.
EBADF
sockfd is not a valid open file descriptor.
ECONNREFUSED
A connect() on a stream socket found no one listening on the remote address.
EFAULT
The socket structure address is outside the user’s address space.
EINPROGRESS
The socket is nonblocking and the connection cannot be completed immediately. (UNIX domain sockets failed with EAGAIN instead.) It is possible to select(2) or poll(2) for completion by selecting the socket for writing. After select(2) indicates writability, use getsockopt(2) to read the SO_ERROR option at level SOL_SOCKET to determine whether connect() completed successfully (SO_ERROR is zero) or unsuccessfully (SO_ERROR is one of the usual error codes listed here, explaining the reason for the failure).
EINTR
The system call was interrupted by a signal that was caught; see signal(7).
EISCONN
The socket is already connected.
ENETUNREACH
Network is unreachable.
ENOTSOCK
The file descriptor sockfd does not refer to a socket.
EPROTOTYPE
The socket type does not support the requested communications protocol. This error can occur, for example, on an attempt to connect a UNIX domain datagram socket to a stream socket.
ETIMEDOUT
Timeout while attempting connection. The server may be too busy to accept new connections. Note that for IP sockets the timeout may be very long when syncookies are enabled on the server.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.4BSD, (connect() first appeared in 4.2BSD).
NOTES
If connect() fails, consider the state of the socket as unspecified. Portable applications should close the socket and create a new one for reconnecting.
EXAMPLES
An example of the use of connect() is shown in getaddrinfo(3).
SEE ALSO
accept(2), bind(2), getsockname(2), listen(2), socket(2), path_resolution(7), selinux(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
233 - Linux cli command set_tid_address
NAME π₯οΈ set_tid_address π₯οΈ
set pointer to thread ID
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
pid_t syscall(SYS_set_tid_address, int *tidptr);
Note: glibc provides no wrapper for set_tid_address(), necessitating the use of syscall(2).
DESCRIPTION
For each thread, the kernel maintains two attributes (addresses) called set_child_tid and clear_child_tid. These two attributes contain the value NULL by default.
set_child_tid
If a thread is started using clone(2) with the CLONE_CHILD_SETTID flag, set_child_tid is set to the value passed in the ctid argument of that system call.
When set_child_tid is set, the very first thing the new thread does is to write its thread ID at this address.
clear_child_tid
If a thread is started using clone(2) with the CLONE_CHILD_CLEARTID flag, clear_child_tid is set to the value passed in the ctid argument of that system call.
The system call set_tid_address() sets the clear_child_tid value for the calling thread to tidptr.
When a thread whose clear_child_tid is not NULL terminates, then, if the thread is sharing memory with other threads, then 0 is written at the address specified in clear_child_tid and the kernel performs the following operation:
futex(clear_child_tid, FUTEX_WAKE, 1, NULL, NULL, 0);
The effect of this operation is to wake a single thread that is performing a futex wait on the memory location. Errors from the futex wake operation are ignored.
RETURN VALUE
set_tid_address() always returns the caller’s thread ID.
ERRORS
set_tid_address() always succeeds.
STANDARDS
Linux.
HISTORY
Linux 2.5.48.
Details as given here are valid since Linux 2.5.49.
SEE ALSO
clone(2), futex(2), gettid(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
234 - Linux cli command copy_file_range
NAME π₯οΈ copy_file_range π₯οΈ
Copy a range of data from one file to another
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE
#define _FILE_OFFSET_BITS 64
#include <unistd.h>
ssize_t copy_file_range(int fd_in, off_t *_Nullable off_in,
int fd_out, off_t *_Nullable off_out,
size_t len, unsigned int flags);
DESCRIPTION
The copy_file_range() system call performs an in-kernel copy between two file descriptors without the additional cost of transferring data from the kernel to user space and then back into the kernel. It copies up to len bytes of data from the source file descriptor fd_in to the target file descriptor fd_out, overwriting any data that exists within the requested range of the target file.
The following semantics apply for off_in, and similar statements apply to off_out:
If off_in is NULL, then bytes are read from fd_in starting from the file offset, and the file offset is adjusted by the number of bytes copied.
If off_in is not NULL, then off_in must point to a buffer that specifies the starting offset where bytes from fd_in will be read. The file offset of fd_in is not changed, but off_in is adjusted appropriately.
fd_in and fd_out can refer to the same file. If they refer to the same file, then the source and target ranges are not allowed to overlap.
The flags argument is provided to allow for future extensions and currently must be set to 0.
RETURN VALUE
Upon successful completion, copy_file_range() will return the number of bytes copied between files. This could be less than the length originally requested. If the file offset of fd_in is at or past the end of file, no bytes are copied, and copy_file_range() returns zero.
On error, copy_file_range() returns -1 and errno is set to indicate the error.
ERRORS
EBADF
One or more file descriptors are not valid.
EBADF
fd_in is not open for reading; or fd_out is not open for writing.
EBADF
The O_APPEND flag is set for the open file description (see open(2)) referred to by the file descriptor fd_out.
EFBIG
An attempt was made to write at a position past the maximum file offset the kernel supports.
EFBIG
An attempt was made to write a range that exceeds the allowed maximum file size. The maximum file size differs between filesystem implementations and can be different from the maximum allowed file offset.
EFBIG
An attempt was made to write beyond the process’s file size resource limit. This may also result in the process receiving a SIGXFSZ signal.
EINVAL
The flags argument is not 0.
EINVAL
fd_in and fd_out refer to the same file and the source and target ranges overlap.
EINVAL
Either fd_in or fd_out is not a regular file.
EIO
A low-level I/O error occurred while copying.
EISDIR
Either fd_in or fd_out refers to a directory.
ENOMEM
Out of memory.
ENOSPC
There is not enough space on the target filesystem to complete the copy.
EOPNOTSUPP (since Linux 5.19)
The filesystem does not support this operation.
EOVERFLOW
The requested source or destination range is too large to represent in the specified data types.
EPERM
fd_out refers to an immutable file.
ETXTBSY
Either fd_in or fd_out refers to an active swap file.
EXDEV (before Linux 5.3)
The files referred to by fd_in and fd_out are not on the same filesystem.
EXDEV (since Linux 5.19)
The files referred to by fd_in and fd_out are not on the same filesystem, and the source and target filesystems are not of the same type, or do not support cross-filesystem copy.
VERSIONS
A major rework of the kernel implementation occurred in Linux 5.3. Areas of the API that weren’t clearly defined were clarified and the API bounds are much more strictly checked than on earlier kernels.
Since Linux 5.19, cross-filesystem copies can be achieved when both filesystems are of the same type, and that filesystem implements support for it. See BUGS for behavior prior to Linux 5.19.
Applications should target the behaviour and requirements of Linux 5.19, that was also backported to earlier stable kernels.
STANDARDS
Linux, GNU.
HISTORY
Linux 4.5, but glibc 2.27 provides a user-space emulation when it is not available.
NOTES
If fd_in is a sparse file, then copy_file_range() may expand any holes existing in the requested range. Users may benefit from calling copy_file_range() in a loop, and using the lseek(2) SEEK_DATA and SEEK_HOLE operations to find the locations of data segments.
copy_file_range() gives filesystems an opportunity to implement “copy acceleration” techniques, such as the use of reflinks (i.e., two or more inodes that share pointers to the same copy-on-write disk blocks) or server-side-copy (in the case of NFS).
_FILE_OFFSET_BITS should be defined to be 64 in code that uses non-null off_in or off_out or that takes the address of copy_file_range, if the code is intended to be portable to traditional 32-bit x86 and ARM platforms where off_t’s width defaults to 32 bits.
BUGS
In Linux 5.3 to Linux 5.18, cross-filesystem copies were implemented by the kernel, if the operation was not supported by individual filesystems. However, on some virtual filesystems, the call failed to copy, while still reporting success.
EXAMPLES
#define _GNU_SOURCE
#define _FILE_OFFSET_BITS 64
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int fd_in, fd_out;
off_t len, ret;
struct stat stat;
if (argc != 3) {
fprintf(stderr, "Usage: %s <source> <destination>
“, argv[0]); exit(EXIT_FAILURE); } fd_in = open(argv[1], O_RDONLY); if (fd_in == -1) { perror(“open (argv[1])”); exit(EXIT_FAILURE); } if (fstat(fd_in, &stat) == -1) { perror(“fstat”); exit(EXIT_FAILURE); } len = stat.st_size; fd_out = open(argv[2], O_CREAT | O_WRONLY | O_TRUNC, 0644); if (fd_out == -1) { perror(“open (argv[2])”); exit(EXIT_FAILURE); } do { ret = copy_file_range(fd_in, NULL, fd_out, NULL, len, 0); if (ret == -1) { perror(“copy_file_range”); exit(EXIT_FAILURE); } len -= ret; } while (len > 0 && ret > 0); close(fd_in); close(fd_out); exit(EXIT_SUCCESS); }
SEE ALSO
lseek(2), sendfile(2), splice(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
235 - Linux cli command dup3
NAME π₯οΈ dup3 π₯οΈ
duplicate a file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int dup(int oldfd);
int dup2(int oldfd, int newfd);
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <fcntl.h> /* Definition of O_* constants */
#include <unistd.h>
int dup3(int oldfd, int newfd, int flags);
DESCRIPTION
The dup() system call allocates a new file descriptor that refers to the same open file description as the descriptor oldfd. (For an explanation of open file descriptions, see open(2).) The new file descriptor number is guaranteed to be the lowest-numbered file descriptor that was unused in the calling process.
After a successful return, the old and new file descriptors may be used interchangeably. Since the two file descriptors refer to the same open file description, they share file offset and file status flags; for example, if the file offset is modified by using lseek(2) on one of the file descriptors, the offset is also changed for the other file descriptor.
The two file descriptors do not share file descriptor flags (the close-on-exec flag). The close-on-exec flag (FD_CLOEXEC; see fcntl(2)) for the duplicate descriptor is off.
dup2()
The dup2() system call performs the same task as dup(), but instead of using the lowest-numbered unused file descriptor, it uses the file descriptor number specified in newfd. In other words, the file descriptor newfd is adjusted so that it now refers to the same open file description as oldfd.
If the file descriptor newfd was previously open, it is closed before being reused; the close is performed silently (i.e., any errors during the close are not reported by dup2()).
The steps of closing and reusing the file descriptor newfd are performed atomically. This is important, because trying to implement equivalent functionality using close(2) and dup() would be subject to race conditions, whereby newfd might be reused between the two steps. Such reuse could happen because the main program is interrupted by a signal handler that allocates a file descriptor, or because a parallel thread allocates a file descriptor.
Note the following points:
If oldfd is not a valid file descriptor, then the call fails, and newfd is not closed.
If oldfd is a valid file descriptor, and newfd has the same value as oldfd, then dup2() does nothing, and returns newfd.
dup3()
dup3() is the same as dup2(), except that:
The caller can force the close-on-exec flag to be set for the new file descriptor by specifying O_CLOEXEC in flags. See the description of the same flag in open(2) for reasons why this may be useful.
If oldfd equals newfd, then dup3() fails with the error EINVAL.
RETURN VALUE
On success, these system calls return the new file descriptor. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBADF
oldfd isn’t an open file descriptor.
EBADF
newfd is out of the allowed range for file descriptors (see the discussion of RLIMIT_NOFILE in getrlimit(2)).
EBUSY
(Linux only) This may be returned by dup2() or dup3() during a race condition with open(2) and dup().
EINTR
The dup2() or dup3() call was interrupted by a signal; see signal(7).
EINVAL
(dup3()) flags contain an invalid value.
EINVAL
(dup3()) oldfd was equal to newfd.
EMFILE
The per-process limit on the number of open file descriptors has been reached (see the discussion of RLIMIT_NOFILE in getrlimit(2)).
STANDARDS
dup()
dup2()
POSIX.1-2008.
dup3()
Linux.
HISTORY
dup()
dup2()
POSIX.1-2001, SVr4, 4.3BSD.
dup3()
Linux 2.6.27, glibc 2.9.
NOTES
The error returned by dup2() is different from that returned by fcntl(…, F_DUPFD, …) when newfd is out of range. On some systems, dup2() also sometimes returns EINVAL like F_DUPFD.
If newfd was open, any errors that would have been reported at close(2) time are lost. If this is of concern, thenβunless the program is single-threaded and does not allocate file descriptors in signal handlersβthe correct approach is not to close newfd before calling dup2(), because of the race condition described above. Instead, code something like the following could be used:
/* Obtain a duplicate of 'newfd' that can subsequently
be used to check for close() errors; an EBADF error
means that 'newfd' was not open. */
tmpfd = dup(newfd);
if (tmpfd == -1 && errno != EBADF) {
/* Handle unexpected dup() error. */
}
/* Atomically duplicate 'oldfd' on 'newfd'. */
if (dup2(oldfd, newfd) == -1) {
/* Handle dup2() error. */
}
/* Now check for close() errors on the file originally
referred to by 'newfd'. */
if (tmpfd != -1) {
if (close(tmpfd) == -1) {
/* Handle errors from close. */
}
}
SEE ALSO
close(2), fcntl(2), open(2), pidfd_getfd(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
236 - Linux cli command memfd_secret
NAME π₯οΈ memfd_secret π₯οΈ
create an anonymous RAM-based file to access secret memory regions
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_memfd_secret, unsigned int flags);
Note: glibc provides no wrapper for memfd_secret(), necessitating the use of syscall(2).
DESCRIPTION
memfd_secret() creates an anonymous RAM-based file and returns a file descriptor that refers to it. The file provides a way to create and access memory regions with stronger protection than usual RAM-based files and anonymous memory mappings. Once all open references to the file are closed, it is automatically released. The initial size of the file is set to 0. Following the call, the file size should be set using ftruncate(2).
The memory areas backing the file created with memfd_secret(2) are visible only to the processes that have access to the file descriptor. The memory region is removed from the kernel page tables and only the page tables of the processes holding the file descriptor map the corresponding physical memory. (Thus, the pages in the region can’t be accessed by the kernel itself, so that, for example, pointers to the region can’t be passed to system calls.)
The following values may be bitwise ORed in flags to control the behavior of memfd_secret():
FD_CLOEXEC
Set the close-on-exec flag on the new file descriptor, which causes the region to be removed from the process on execve(2). See the description of the O_CLOEXEC flag in open(2)
As its return value, memfd_secret() returns a new file descriptor that refers to an anonymous file. This file descriptor is opened for both reading and writing (O_RDWR) and O_LARGEFILE is set for the file descriptor.
With respect to fork(2) and execve(2), the usual semantics apply for the file descriptor created by memfd_secret(). A copy of the file descriptor is inherited by the child produced by fork(2) and refers to the same file. The file descriptor is preserved across execve(2), unless the close-on-exec flag has been set.
The memory region is locked into memory in the same way as with mlock(2), so that it will never be written into swap, and hibernation is inhibited for as long as any memfd_secret() descriptions exist. However the implementation of memfd_secret() will not try to populate the whole range during the mmap(2) call that attaches the region into the process’s address space; instead, the pages are only actually allocated as they are faulted in. The amount of memory allowed for memory mappings of the file descriptor obeys the same rules as mlock(2) and cannot exceed RLIMIT_MEMLOCK.
RETURN VALUE
On success, memfd_secret() returns a new file descriptor. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EINVAL
flags included unknown bits.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
EMFILE
The system-wide limit on the total number of open files has been reached.
ENOMEM
There was insufficient memory to create a new anonymous file.
ENOSYS
memfd_secret() is not implemented on this architecture, or has not been enabled on the kernel command-line with secretmem_enable=1.
STANDARDS
Linux.
HISTORY
Linux 5.14.
NOTES
The memfd_secret() system call is designed to allow a user-space process to create a range of memory that is inaccessible to anybody else - kernel included. There is no 100% guarantee that kernel won’t be able to access memory ranges backed by memfd_secret() in any circumstances, but nevertheless, it is much harder to exfiltrate data from these regions.
memfd_secret() provides the following protections:
Enhanced protection (in conjunction with all the other in-kernel attack prevention systems) against ROP attacks. Absence of any in-kernel primitive for accessing memory backed by memfd_secret() means that one-gadget ROP attack can’t work to perform data exfiltration. The attacker would need to find enough ROP gadgets to reconstruct the missing page table entries, which significantly increases difficulty of the attack, especially when other protections like the kernel stack size limit and address space layout randomization are in place.
Prevent cross-process user-space memory exposures. Once a region for a memfd_secret() memory mapping is allocated, the user can’t accidentally pass it into the kernel to be transmitted somewhere. The memory pages in this region cannot be accessed via the direct map and they are disallowed in get_user_pages.
Harden against exploited kernel flaws. In order to access memory areas backed by memfd_secret(), a kernel-side attack would need to either walk the page tables and create new ones, or spawn a new privileged user-space process to perform secrets exfiltration using ptrace(2).
The way memfd_secret() allocates and locks the memory may impact overall system performance, therefore the system call is disabled by default and only available if the system administrator turned it on using “secretmem.enable=y” kernel parameter.
To prevent potential data leaks of memory regions backed by memfd_secret() from a hybernation image, hybernation is prevented when there are active memfd_secret() users.
SEE ALSO
fcntl(2), ftruncate(2), mlock(2), memfd_create(2), mmap(2), setrlimit(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
237 - Linux cli command timerfd_gettime
NAME π₯οΈ timerfd_gettime π₯οΈ
timers that notify via file descriptors
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/timerfd.h>
int timerfd_create(int clockid, int flags);
int timerfd_settime(int fd, int flags,
const struct itimerspec *new_value,
struct itimerspec *_Nullable old_value);
int timerfd_gettime(int fd, struct itimerspec *curr_value);
DESCRIPTION
These system calls create and operate on a timer that delivers timer expiration notifications via a file descriptor. They provide an alternative to the use of setitimer(2) or timer_create(2), with the advantage that the file descriptor may be monitored by select(2), poll(2), and epoll(7).
The use of these three system calls is analogous to the use of timer_create(2), timer_settime(2), and timer_gettime(2). (There is no analog of timer_getoverrun(2), since that functionality is provided by read(2), as described below.)
timerfd_create()
timerfd_create() creates a new timer object, and returns a file descriptor that refers to that timer. The clockid argument specifies the clock that is used to mark the progress of the timer, and must be one of the following:
CLOCK_REALTIME
A settable system-wide real-time clock.
CLOCK_MONOTONIC
A nonsettable monotonically increasing clock that measures time from some unspecified point in the past that does not change after system startup.
CLOCK_BOOTTIME (Since Linux 3.15)
Like CLOCK_MONOTONIC, this is a monotonically increasing clock. However, whereas the CLOCK_MONOTONIC clock does not measure the time while a system is suspended, the CLOCK_BOOTTIME clock does include the time during which the system is suspended. This is useful for applications that need to be suspend-aware. CLOCK_REALTIME is not suitable for such applications, since that clock is affected by discontinuous changes to the system clock.
CLOCK_REALTIME_ALARM (since Linux 3.11)
This clock is like CLOCK_REALTIME, but will wake the system if it is suspended. The caller must have the CAP_WAKE_ALARM capability in order to set a timer against this clock.
CLOCK_BOOTTIME_ALARM (since Linux 3.11)
This clock is like CLOCK_BOOTTIME, but will wake the system if it is suspended. The caller must have the CAP_WAKE_ALARM capability in order to set a timer against this clock.
See clock_getres(2) for some further details on the above clocks.
The current value of each of these clocks can be retrieved using clock_gettime(2).
Starting with Linux 2.6.27, the following values may be bitwise ORed in flags to change the behavior of timerfd_create():
TFD_NONBLOCK
Set the O_NONBLOCK file status flag on the open file description (see open(2)) referred to by the new file descriptor. Using this flag saves extra calls to fcntl(2) to achieve the same result.
TFD_CLOEXEC
Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.
In Linux versions up to and including 2.6.26, flags must be specified as zero.
timerfd_settime()
timerfd_settime() arms (starts) or disarms (stops) the timer referred to by the file descriptor fd.
The new_value argument specifies the initial expiration and interval for the timer. The itimerspec structure used for this argument is described in itimerspec(3type).
new_value.it_value specifies the initial expiration of the timer, in seconds and nanoseconds. Setting either field of new_value.it_value to a nonzero value arms the timer. Setting both fields of new_value.it_value to zero disarms the timer.
Setting one or both fields of new_value.it_interval to nonzero values specifies the period, in seconds and nanoseconds, for repeated timer expirations after the initial expiration. If both fields of new_value.it_interval are zero, the timer expires just once, at the time specified by new_value.it_value.
By default, the initial expiration time specified in new_value is interpreted relative to the current time on the timer’s clock at the time of the call (i.e., new_value.it_value specifies a time relative to the current value of the clock specified by clockid). An absolute timeout can be selected via the flags argument.
The flags argument is a bit mask that can include the following values:
TFD_TIMER_ABSTIME
Interpret new_value.it_value as an absolute value on the timer’s clock. The timer will expire when the value of the timer’s clock reaches the value specified in new_value.it_value.
TFD_TIMER_CANCEL_ON_SET
If this flag is specified along with TFD_TIMER_ABSTIME and the clock for this timer is CLOCK_REALTIME or CLOCK_REALTIME_ALARM, then mark this timer as cancelable if the real-time clock undergoes a discontinuous change (settimeofday(2), clock_settime(2), or similar). When such changes occur, a current or future read(2) from the file descriptor will fail with the error ECANCELED.
If the old_value argument is not NULL, then the itimerspec structure that it points to is used to return the setting of the timer that was current at the time of the call; see the description of timerfd_gettime() following.
timerfd_gettime()
timerfd_gettime() returns, in curr_value, an itimerspec structure that contains the current setting of the timer referred to by the file descriptor fd.
The it_value field returns the amount of time until the timer will next expire. If both fields of this structure are zero, then the timer is currently disarmed. This field always contains a relative value, regardless of whether the TFD_TIMER_ABSTIME flag was specified when setting the timer.
The it_interval field returns the interval of the timer. If both fields of this structure are zero, then the timer is set to expire just once, at the time specified by curr_value.it_value.
Operating on a timer file descriptor
The file descriptor returned by timerfd_create() supports the following additional operations:
read(2)
If the timer has already expired one or more times since its settings were last modified using timerfd_settime(), or since the last successful read(2), then the buffer given to read(2) returns an unsigned 8-byte integer (uint64_t) containing the number of expirations that have occurred. (The returned value is in host byte orderβthat is, the native byte order for integers on the host machine.)
If no timer expirations have occurred at the time of the read(2), then the call either blocks until the next timer expiration, or fails with the error EAGAIN if the file descriptor has been made nonblocking (via the use of the fcntl(2) F_SETFL operation to set the O_NONBLOCK flag).
A read(2) fails with the error EINVAL if the size of the supplied buffer is less than 8 bytes.
If the associated clock is either CLOCK_REALTIME or CLOCK_REALTIME_ALARM, the timer is absolute (TFD_TIMER_ABSTIME), and the flag TFD_TIMER_CANCEL_ON_SET was specified when calling timerfd_settime(), then read(2) fails with the error ECANCELED if the real-time clock undergoes a discontinuous change. (This allows the reading application to discover such discontinuous changes to the clock.)
If the associated clock is either CLOCK_REALTIME or CLOCK_REALTIME_ALARM, the timer is absolute (TFD_TIMER_ABSTIME), and the flag TFD_TIMER_CANCEL_ON_SET was not specified when calling timerfd_settime(), then a discontinuous negative change to the clock (e.g., clock_settime(2)) may cause read(2) to unblock, but return a value of 0 (i.e., no bytes read), if the clock change occurs after the time expired, but before the read(2) on the file descriptor.
poll(2)
select(2)
(and similar)
The file descriptor is readable (the select(2) readfds argument; the poll(2) POLLIN flag) if one or more timer expirations have occurred.
The file descriptor also supports the other file-descriptor multiplexing APIs: pselect(2), ppoll(2), and epoll(7).
ioctl(2)
The following timerfd-specific command is supported:
TFD_IOC_SET_TICKS (since Linux 3.17)
Adjust the number of timer expirations that have occurred. The argument is a pointer to a nonzero 8-byte integer (uint64_t*) containing the new number of expirations. Once the number is set, any waiter on the timer is woken up. The only purpose of this command is to restore the expirations for the purpose of checkpoint/restore. This operation is available only if the kernel was configured with the CONFIG_CHECKPOINT_RESTORE option.
close(2)
When the file descriptor is no longer required it should be closed. When all file descriptors associated with the same timer object have been closed, the timer is disarmed and its resources are freed by the kernel.
fork(2) semantics
After a fork(2), the child inherits a copy of the file descriptor created by timerfd_create(). The file descriptor refers to the same underlying timer object as the corresponding file descriptor in the parent, and read(2)s in the child will return information about expirations of the timer.
execve(2) semantics
A file descriptor created by timerfd_create() is preserved across execve(2), and continues to generate timer expirations if the timer was armed.
RETURN VALUE
On success, timerfd_create() returns a new file descriptor. On error, -1 is returned and errno is set to indicate the error.
timerfd_settime() and timerfd_gettime() return 0 on success; on error they return -1, and set errno to indicate the error.
ERRORS
timerfd_create() can fail with the following errors:
EINVAL
The clockid is not valid.
EINVAL
flags is invalid; or, in Linux 2.6.26 or earlier, flags is nonzero.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENODEV
Could not mount (internal) anonymous inode device.
ENOMEM
There was insufficient kernel memory to create the timer.
EPERM
clockid was CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM but the caller did not have the CAP_WAKE_ALARM capability.
timerfd_settime() and timerfd_gettime() can fail with the following errors:
EBADF
fd is not a valid file descriptor.
EFAULT
new_value, old_value, or curr_value is not a valid pointer.
EINVAL
fd is not a valid timerfd file descriptor.
timerfd_settime() can also fail with the following errors:
ECANCELED
See NOTES.
EINVAL
new_value is not properly initialized (one of the tv_nsec falls outside the range zero to 999,999,999).
EINVAL
flags is invalid.
STANDARDS
Linux.
HISTORY
Linux 2.6.25, glibc 2.8.
NOTES
Suppose the following scenario for CLOCK_REALTIME or CLOCK_REALTIME_ALARM timer that was created with timerfd_create():
The timer has been started (timerfd_settime()) with the TFD_TIMER_ABSTIME and TFD_TIMER_CANCEL_ON_SET flags;
A discontinuous change (e.g., settimeofday(2)) is subsequently made to the CLOCK_REALTIME clock; and
the caller once more calls timerfd_settime() to rearm the timer (without first doing a read(2) on the file descriptor).
In this case the following occurs:
The timerfd_settime() returns -1 with errno set to ECANCELED. (This enables the caller to know that the previous timer was affected by a discontinuous change to the clock.)
The timer is successfully rearmed with the settings provided in the second timerfd_settime() call. (This was probably an implementation accident, but won’t be fixed now, in case there are applications that depend on this behaviour.)
BUGS
Currently, timerfd_create() supports fewer types of clock IDs than timer_create(2).
EXAMPLES
The following program creates a timer and then monitors its progress. The program accepts up to three command-line arguments. The first argument specifies the number of seconds for the initial expiration of the timer. The second argument specifies the interval for the timer, in seconds. The third argument specifies the number of times the program should allow the timer to expire before terminating. The second and third command-line arguments are optional.
The following shell session demonstrates the use of the program:
$ a.out 3 1 100
0.000: timer started
3.000: read: 1; total=1
4.000: read: 1; total=2
^Z # type control-Z to suspend the program
[1]+ Stopped ./timerfd3_demo 3 1 100
$ fg # Resume execution after a few seconds
a.out 3 1 100
9.660: read: 5; total=7
10.000: read: 1; total=8
11.000: read: 1; total=9
^C # type control-C to suspend the program
Program source
#include <err.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/timerfd.h>
#include <sys/types.h>
#include <time.h>
#include <unistd.h>
static void
print_elapsed_time(void)
{
int secs, nsecs;
static int first_call = 1;
struct timespec curr;
static struct timespec start;
if (first_call) {
first_call = 0;
if (clock_gettime(CLOCK_MONOTONIC, &start) == -1)
err(EXIT_FAILURE, "clock_gettime");
}
if (clock_gettime(CLOCK_MONOTONIC, &curr) == -1)
err(EXIT_FAILURE, "clock_gettime");
secs = curr.tv_sec - start.tv_sec;
nsecs = curr.tv_nsec - start.tv_nsec;
if (nsecs < 0) {
secs--;
nsecs += 1000000000;
}
printf("%d.%03d: ", secs, (nsecs + 500000) / 1000000);
}
int
main(int argc, char *argv[])
{
int fd;
ssize_t s;
uint64_t exp, tot_exp, max_exp;
struct timespec now;
struct itimerspec new_value;
if (argc != 2 && argc != 4) {
fprintf(stderr, "%s init-secs [interval-secs max-exp]
“, argv[0]); exit(EXIT_FAILURE); } if (clock_gettime(CLOCK_REALTIME, &now) == -1) err(EXIT_FAILURE, “clock_gettime”); /* Create a CLOCK_REALTIME absolute timer with initial expiration and interval as specified in command line. */ new_value.it_value.tv_sec = now.tv_sec + atoi(argv[1]); new_value.it_value.tv_nsec = now.tv_nsec; if (argc == 2) { new_value.it_interval.tv_sec = 0; max_exp = 1; } else { new_value.it_interval.tv_sec = atoi(argv[2]); max_exp = atoi(argv[3]); } new_value.it_interval.tv_nsec = 0; fd = timerfd_create(CLOCK_REALTIME, 0); if (fd == -1) err(EXIT_FAILURE, “timerfd_create”); if (timerfd_settime(fd, TFD_TIMER_ABSTIME, &new_value, NULL) == -1) err(EXIT_FAILURE, “timerfd_settime”); print_elapsed_time(); printf(“timer started “); for (tot_exp = 0; tot_exp < max_exp;) { s = read(fd, &exp, sizeof(uint64_t)); if (s != sizeof(uint64_t)) err(EXIT_FAILURE, “read”); tot_exp += exp; print_elapsed_time(); printf(“read: %” PRIu64 “; total=%” PRIu64 " “, exp, tot_exp); } exit(EXIT_SUCCESS); }
SEE ALSO
eventfd(2), poll(2), read(2), select(2), setitimer(2), signalfd(2), timer_create(2), timer_gettime(2), timer_settime(2), timespec(3), epoll(7), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
238 - Linux cli command reboot
NAME π₯οΈ reboot π₯οΈ
reboot or enable/disable Ctrl-Alt-Del
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
/* Since Linux 2.1.30 there are symbolic names LINUX_REBOOT_*
for the constants and a fourth argument to the call: */
#include <linux/reboot.h> "/*Definitionof LINUX_REBOOT_* constants*/"
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_reboot, int magic, int magic2, int op, void *arg);
/* Under glibc and most alternative libc's (including uclibc, dietlibc,
musl and a few others), some of the constants involved have gotten
symbolic names RB_*, and the library call is a 1-argument
wrapper around the system call: */
#include <sys/reboot.h> /* Definition of RB_* constants */
#include <unistd.h>
int reboot(int op);
DESCRIPTION
The reboot() call reboots the system, or enables/disables the reboot keystroke (abbreviated CAD, since the default is Ctrl-Alt-Delete; it can be changed using loadkeys(1)).
This system call fails (with the error EINVAL) unless magic equals LINUX_REBOOT_MAGIC1 (that is, 0xfee1dead) and magic2 equals LINUX_REBOOT_MAGIC2 (that is, 0x28121969). However, since Linux 2.1.17 also LINUX_REBOOT_MAGIC2A (that is, 0x05121996) and since Linux 2.1.97 also LINUX_REBOOT_MAGIC2B (that is, 0x16041998) and since Linux 2.5.71 also LINUX_REBOOT_MAGIC2C (that is, 0x20112000) are permitted as values for magic2. (The hexadecimal values of these constants are meaningful.)
The op argument can have the following values:
LINUX_REBOOT_CMD_CAD_OFF
(RB_DISABLE_CAD, 0). CAD is disabled. This means that the CAD keystroke will cause a SIGINT signal to be sent to init (process 1), whereupon this process may decide upon a proper action (maybe: kill all processes, sync, reboot).
LINUX_REBOOT_CMD_CAD_ON
(RB_ENABLE_CAD, 0x89abcdef). CAD is enabled. This means that the CAD keystroke will immediately cause the action associated with LINUX_REBOOT_CMD_RESTART.
LINUX_REBOOT_CMD_HALT
(RB_HALT_SYSTEM, 0xcdef0123; since Linux 1.1.76). The message “System halted.” is printed, and the system is halted. Control is given to the ROM monitor, if there is one. If not preceded by a sync(2), data will be lost.
LINUX_REBOOT_CMD_KEXEC
(RB_KEXEC, 0x45584543, since Linux 2.6.13). Execute a kernel that has been loaded earlier with kexec_load(2). This option is available only if the kernel was configured with CONFIG_KEXEC.
LINUX_REBOOT_CMD_POWER_OFF
(RB_POWER_OFF, 0x4321fedc; since Linux 2.1.30). The message “Power down.” is printed, the system is stopped, and all power is removed from the system, if possible. If not preceded by a sync(2), data will be lost.
LINUX_REBOOT_CMD_RESTART
(RB_AUTOBOOT, 0x1234567). The message “Restarting system.” is printed, and a default restart is performed immediately. If not preceded by a sync(2), data will be lost.
LINUX_REBOOT_CMD_RESTART2
(0xa1b2c3d4; since Linux 2.1.30). The message “Restarting system with command ‘%s’” is printed, and a restart (using the command string given in arg) is performed immediately. If not preceded by a sync(2), data will be lost.
LINUX_REBOOT_CMD_SW_SUSPEND
(RB_SW_SUSPEND, 0xd000fce1; since Linux 2.5.18). The system is suspended (hibernated) to disk. This option is available only if the kernel was configured with CONFIG_HIBERNATION.
Only the superuser may call reboot().
The precise effect of the above actions depends on the architecture. For the i386 architecture, the additional argument does not do anything at present (2.1.122), but the type of reboot can be determined by kernel command-line arguments (“reboot=…”) to be either warm or cold, and either hard or through the BIOS.
Behavior inside PID namespaces
Since Linux 3.4, if reboot() is called from a PID namespace other than the initial PID namespace with one of the op values listed below, it performs a “reboot” of that namespace: the “init” process of the PID namespace is immediately terminated, with the effects described in pid_namespaces(7).
The values that can be supplied in op when calling reboot() in this case are as follows:
LINUX_REBOOT_CMD_RESTART
LINUX_REBOOT_CMD_RESTART2
The “init” process is terminated, and wait(2) in the parent process reports that the child was killed with a SIGHUP signal.
LINUX_REBOOT_CMD_POWER_OFF
LINUX_REBOOT_CMD_HALT
The “init” process is terminated, and wait(2) in the parent process reports that the child was killed with a SIGINT signal.
For the other op values, reboot() returns -1 and errno is set to EINVAL.
RETURN VALUE
For the values of op that stop or restart the system, a successful call to reboot() does not return. For the other op values, zero is returned on success. In all cases, -1 is returned on failure, and errno is set to indicate the error.
ERRORS
EFAULT
Problem with getting user-space data under LINUX_REBOOT_CMD_RESTART2.
EINVAL
Bad magic numbers or op.
EPERM
The calling process has insufficient privilege to call reboot(); the caller must have the CAP_SYS_BOOT inside its user namespace.
STANDARDS
Linux.
SEE ALSO
systemctl(1), systemd(1), kexec_load(2), sync(2), bootparam(7), capabilities(7), ctrlaltdel(8), halt(8), shutdown(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
239 - Linux cli command mknodat
NAME π₯οΈ mknodat π₯οΈ
create a special or ordinary file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int mknod(const char *pathname, mode_t mode, dev_t dev);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int mknodat(int dirfd, const char *pathname, mode_t mode",dev_t"dev);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
mknod():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.19: */ _DEFAULT_SOURCE
|| /* glibc <= 2.19: */ _BSD_SOURCE || _SVID_SOURCE
DESCRIPTION
The system call mknod() creates a filesystem node (file, device special file, or named pipe) named pathname, with attributes specified by mode and dev.
The mode argument specifies both the file mode to use and the type of node to be created. It should be a combination (using bitwise OR) of one of the file types listed below and zero or more of the file mode bits listed in inode(7).
The file mode is modified by the process’s umask in the usual way: in the absence of a default ACL, the permissions of the created node are (mode & ~umask).
The file type must be one of S_IFREG, S_IFCHR, S_IFBLK, S_IFIFO, or S_IFSOCK to specify a regular file (which will be created empty), character special file, block special file, FIFO (named pipe), or UNIX domain socket, respectively. (Zero file type is equivalent to type S_IFREG.)
If the file type is S_IFCHR or S_IFBLK, then dev specifies the major and minor numbers of the newly created device special file (makedev(3) may be useful to build the value for dev); otherwise it is ignored.
If pathname already exists, or is a symbolic link, this call fails with an EEXIST error.
The newly created node will be owned by the effective user ID of the process. If the directory containing the node has the set-group-ID bit set, or if the filesystem is mounted with BSD group semantics, the new node will inherit the group ownership from its parent directory; otherwise it will be owned by the effective group ID of the process.
mknodat()
The mknodat() system call operates in exactly the same way as mknod(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by mknod() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like mknod()).
If pathname is absolute, then dirfd is ignored.
See openat(2) for an explanation of the need for mknodat().
RETURN VALUE
mknod() and mknodat() return zero on success. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EACCES
The parent directory does not allow write permission to the process, or one of the directories in the path prefix of pathname did not allow search permission. (See also path_resolution(7).)
EBADF
(mknodat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EDQUOT
The user’s quota of disk blocks or inodes on the filesystem has been exhausted.
EEXIST
pathname already exists. This includes the case where pathname is a symbolic link, dangling or not.
EFAULT
pathname points outside your accessible address space.
EINVAL
mode requested creation of something other than a regular file, device special file, FIFO or socket.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ENAMETOOLONG
pathname was too long.
ENOENT
A directory component in pathname does not exist or is a dangling symbolic link.
ENOMEM
Insufficient kernel memory was available.
ENOSPC
The device containing pathname has no room for the new node.
ENOTDIR
A component used as a directory in pathname is not, in fact, a directory.
ENOTDIR
(mknodat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EPERM
mode requested creation of something other than a regular file, FIFO (named pipe), or UNIX domain socket, and the caller is not privileged (Linux: does not have the CAP_MKNOD capability); also returned if the filesystem containing pathname does not support the type of node requested.
EROFS
pathname refers to a file on a read-only filesystem.
VERSIONS
POSIX.1-2001 says: “The only portable use of mknod() is to create a FIFO-special file. If mode is not S_IFIFO or dev is not 0, the behavior of mknod() is unspecified.” However, nowadays one should never use mknod() for this purpose; one should use mkfifo(3), a function especially defined for this purpose.
Under Linux, mknod() cannot be used to create directories. One should make directories with mkdir(2).
STANDARDS
POSIX.1-2008.
HISTORY
mknod()
SVr4, 4.4BSD, POSIX.1-2001 (but see VERSIONS).
mknodat()
Linux 2.6.16, glibc 2.4. POSIX.1-2008.
NOTES
There are many infelicities in the protocol underlying NFS. Some of these affect mknod() and mknodat().
SEE ALSO
mknod(1), chmod(2), chown(2), fcntl(2), mkdir(2), mount(2), socket(2), stat(2), umask(2), unlink(2), makedev(3), mkfifo(3), acl(5), path_resolution(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
240 - Linux cli command pidfd_getfd
NAME π₯οΈ pidfd_getfd π₯οΈ
obtain a duplicate of another process’s file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_pidfd_getfd, int pidfd, int targetfd,
unsigned int flags);
Note: glibc provides no wrapper for pidfd_getfd(), necessitating the use of syscall(2).
DESCRIPTION
The pidfd_getfd() system call allocates a new file descriptor in the calling process. This new file descriptor is a duplicate of an existing file descriptor, targetfd, in the process referred to by the PID file descriptor pidfd.
The duplicate file descriptor refers to the same open file description (see open(2)) as the original file descriptor in the process referred to by pidfd. The two file descriptors thus share file status flags and file offset. Furthermore, operations on the underlying file object (for example, assigning an address to a socket object using bind(2)) can equally be performed via the duplicate file descriptor.
The close-on-exec flag (FD_CLOEXEC; see fcntl(2)) is set on the file descriptor returned by pidfd_getfd().
The flags argument is reserved for future use. Currently, it must be specified as 0.
Permission to duplicate another process’s file descriptor is governed by a ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check (see ptrace(2)).
RETURN VALUE
On success, pidfd_getfd() returns a file descriptor (a nonnegative integer). On error, -1 is returned and errno is set to indicate the error.
ERRORS
EBADF
pidfd is not a valid PID file descriptor.
EBADF
targetfd is not an open file descriptor in the process referred to by pidfd.
EINVAL
flags is not 0.
EMFILE
The per-process limit on the number of open file descriptors has been reached (see the description of RLIMIT_NOFILE in getrlimit(2)).
ENFILE
The system-wide limit on the total number of open files has been reached.
EPERM
The calling process did not have PTRACE_MODE_ATTACH_REALCREDS permissions (see ptrace(2)) over the process referred to by pidfd.
ESRCH
The process referred to by pidfd does not exist (i.e., it has terminated and been waited on).
STANDARDS
Linux.
HISTORY
Linux 5.6.
NOTES
For a description of PID file descriptors, see pidfd_open(2).
The effect of pidfd_getfd() is similar to the use of SCM_RIGHTS messages described in unix(7), but differs in the following respects:
In order to pass a file descriptor using an SCM_RIGHTS message, the two processes must first establish a UNIX domain socket connection.
The use of SCM_RIGHTS requires cooperation on the part of the process whose file descriptor is being copied. By contrast, no such cooperation is necessary when using pidfd_getfd().
The ability to use pidfd_getfd() is restricted by a PTRACE_MODE_ATTACH_REALCREDS ptrace access mode check.
SEE ALSO
clone3(2), dup(2), kcmp(2), pidfd_open(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
241 - Linux cli command fadvise64_64
NAME π₯οΈ fadvise64_64 π₯οΈ
predeclare an access pattern for file data
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <fcntl.h>
int posix_fadvise(int fd, off_t offset, off_t len",int advice );"
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
posix_fadvise():
_POSIX_C_SOURCE >= 200112L
DESCRIPTION
Programs can use posix_fadvise() to announce an intention to access file data in a specific pattern in the future, thus allowing the kernel to perform appropriate optimizations.
The advice applies to a (not necessarily existent) region starting at offset and extending for len bytes (or until the end of the file if len is 0) within the file referred to by fd. The advice is not binding; it merely constitutes an expectation on behalf of the application.
Permissible values for advice include:
POSIX_FADV_NORMAL
Indicates that the application has no advice to give about its access pattern for the specified data. If no advice is given for an open file, this is the default assumption.
POSIX_FADV_SEQUENTIAL
The application expects to access the specified data sequentially (with lower offsets read before higher ones).
POSIX_FADV_RANDOM
The specified data will be accessed in random order.
POSIX_FADV_NOREUSE
The specified data will be accessed only once.
Before Linux 2.6.18, POSIX_FADV_NOREUSE had the same semantics as POSIX_FADV_WILLNEED. This was probably a bug; since Linux 2.6.18, this flag is a no-op.
POSIX_FADV_WILLNEED
The specified data will be accessed in the near future.
POSIX_FADV_WILLNEED initiates a nonblocking read of the specified region into the page cache. The amount of data read may be decreased by the kernel depending on virtual memory load. (A few megabytes will usually be fully satisfied, and more is rarely useful.)
POSIX_FADV_DONTNEED
The specified data will not be accessed in the near future.
POSIX_FADV_DONTNEED attempts to free cached pages associated with the specified region. This is useful, for example, while streaming large files. A program may periodically request the kernel to free cached data that has already been used, so that more useful cached pages are not discarded instead.
Requests to discard partial pages are ignored. It is preferable to preserve needed data than discard unneeded data. If the application requires that data be considered for discarding, then offset and len must be page-aligned.
The implementation may attempt to write back dirty pages in the specified region, but this is not guaranteed. Any unwritten dirty pages will not be freed. If the application wishes to ensure that dirty pages will be released, it should call fsync(2) or fdatasync(2) first.
RETURN VALUE
On success, zero is returned. On error, an error number is returned.
ERRORS
EBADF
The fd argument was not a valid file descriptor.
EINVAL
An invalid value was specified for advice.
ESPIPE
The specified file descriptor refers to a pipe or FIFO. (ESPIPE is the error specified by POSIX, but before Linux 2.6.16, Linux returned EINVAL in this case.)
VERSIONS
Under Linux, POSIX_FADV_NORMAL sets the readahead window to the default size for the backing device; POSIX_FADV_SEQUENTIAL doubles this size, and POSIX_FADV_RANDOM disables file readahead entirely. These changes affect the entire file, not just the specified region (but other open file handles to the same file are unaffected).
C library/kernel differences
The name of the wrapper function in the C library is posix_fadvise(). The underlying system call is called fadvise64() (or, on some architectures, fadvise64_64()); the difference between the two is that the former system call assumes that the type of the len argument is size_t, while the latter expects loff_t there.
Architecture-specific variants
Some architectures require 64-bit arguments to be aligned in a suitable pair of registers (see syscall(2) for further detail). On such architectures, the call signature of posix_fadvise() shown in the SYNOPSIS would force a register to be wasted as padding between the fd and offset arguments. Therefore, these architectures define a version of the system call that orders the arguments suitably, but is otherwise exactly the same as posix_fadvise().
For example, since Linux 2.6.14, ARM has the following system call:
long arm_fadvise64_64(int fd, int advice,
loff_t offset, loff_t len);
These architecture-specific details are generally hidden from applications by the glibc posix_fadvise() wrapper function, which invokes the appropriate architecture-specific system call.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
Kernel support first appeared in Linux 2.5.60; the underlying system call is called fadvise64(). Library support has been provided since glibc 2.2, via the wrapper function posix_fadvise().
Since Linux 3.18, support for the underlying system call is optional, depending on the setting of the CONFIG_ADVISE_SYSCALLS configuration option.
The type of the len argument was changed from size_t to off_t in POSIX.1-2001 TC1.
NOTES
The contents of the kernel buffer cache can be cleared via the /proc/sys/vm/drop_caches interface described in proc(5).
One can obtain a snapshot of which pages of a file are resident in the buffer cache by opening a file, mapping it with mmap(2), and then applying mincore(2) to the mapping.
BUGS
Before Linux 2.6.6, if len was specified as 0, then this was interpreted literally as “zero bytes”, rather than as meaning “all bytes through to the end of the file”.
SEE ALSO
fincore(1), mincore(2), readahead(2), sync_file_range(2), posix_fallocate(3), posix_madvise(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
242 - Linux cli command getresgid32
NAME π₯οΈ getresgid32 π₯οΈ
get real, effective, and saved user/group IDs
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <unistd.h>
int getresuid(uid_t *ruid, uid_t *euid, uid_t *suid);
int getresgid(gid_t *rgid, gid_t *egid, gid_t *sgid);
DESCRIPTION
getresuid() returns the real UID, the effective UID, and the saved set-user-ID of the calling process, in the arguments ruid, euid, and suid, respectively. getresgid() performs the analogous task for the process’s group IDs.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
One of the arguments specified an address outside the calling program’s address space.
STANDARDS
None. These calls also appear on HP-UX and some of the BSDs.
HISTORY
Linux 2.1.44, glibc 2.3.2.
The original Linux getresuid() and getresgid() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added getresuid32() and getresgid32(), supporting 32-bit IDs. The glibc getresuid() and getresgid() wrapper functions transparently deal with the variations across kernel versions.
SEE ALSO
getuid(2), setresuid(2), setreuid(2), setuid(2), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
243 - Linux cli command unshare
NAME π₯οΈ unshare π₯οΈ
disassociate parts of the process execution context
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE
#include <sched.h>
int unshare(int flags);
DESCRIPTION
unshare() allows a process (or thread) to disassociate parts of its execution context that are currently being shared with other processes (or threads). Part of the execution context, such as the mount namespace, is shared implicitly when a new process is created using fork(2) or vfork(2), while other parts, such as virtual memory, may be shared by explicit request when creating a process or thread using clone(2).
The main use of unshare() is to allow a process to control its shared execution context without creating a new process.
The flags argument is a bit mask that specifies which parts of the execution context should be unshared. This argument is specified by ORing together zero or more of the following constants:
CLONE_FILES
Reverse the effect of the clone(2) CLONE_FILES flag. Unshare the file descriptor table, so that the calling process no longer shares its file descriptors with any other process.
CLONE_FS
Reverse the effect of the clone(2) CLONE_FS flag. Unshare filesystem attributes, so that the calling process no longer shares its root directory (chroot(2)), current directory (chdir(2)), or umask (umask(2)) attributes with any other process.
CLONE_NEWCGROUP (since Linux 4.6)
This flag has the same effect as the clone(2) CLONE_NEWCGROUP flag. Unshare the cgroup namespace. Use of CLONE_NEWCGROUP requires the CAP_SYS_ADMIN capability.
CLONE_NEWIPC (since Linux 2.6.19)
This flag has the same effect as the clone(2) CLONE_NEWIPC flag. Unshare the IPC namespace, so that the calling process has a private copy of the IPC namespace which is not shared with any other process. Specifying this flag automatically implies CLONE_SYSVSEM as well. Use of CLONE_NEWIPC requires the CAP_SYS_ADMIN capability.
CLONE_NEWNET (since Linux 2.6.24)
This flag has the same effect as the clone(2) CLONE_NEWNET flag. Unshare the network namespace, so that the calling process is moved into a new network namespace which is not shared with any previously existing process. Use of CLONE_NEWNET requires the CAP_SYS_ADMIN capability.
CLONE_NEWNS
This flag has the same effect as the clone(2) CLONE_NEWNS flag. Unshare the mount namespace, so that the calling process has a private copy of its namespace which is not shared with any other process. Specifying this flag automatically implies CLONE_FS as well. Use of CLONE_NEWNS requires the CAP_SYS_ADMIN capability. For further information, see mount_namespaces(7).
CLONE_NEWPID (since Linux 3.8)
This flag has the same effect as the clone(2) CLONE_NEWPID flag. Unshare the PID namespace, so that the calling process has a new PID namespace for its children which is not shared with any previously existing process. The calling process is not moved into the new namespace. The first child created by the calling process will have the process ID 1 and will assume the role of init(1) in the new namespace. CLONE_NEWPID automatically implies CLONE_THREAD as well. Use of CLONE_NEWPID requires the CAP_SYS_ADMIN capability. For further information, see pid_namespaces(7).
CLONE_NEWTIME (since Linux 5.6)
Unshare the time namespace, so that the calling process has a new time namespace for its children which is not shared with any previously existing process. The calling process is not moved into the new namespace. Use of CLONE_NEWTIME requires the CAP_SYS_ADMIN capability. For further information, see time_namespaces(7).
CLONE_NEWUSER (since Linux 3.8)
This flag has the same effect as the clone(2) CLONE_NEWUSER flag. Unshare the user namespace, so that the calling process is moved into a new user namespace which is not shared with any previously existing process. As with the child process created by clone(2) with the CLONE_NEWUSER flag, the caller obtains a full set of capabilities in the new namespace.
CLONE_NEWUSER requires that the calling process is not threaded; specifying CLONE_NEWUSER automatically implies CLONE_THREAD. Since Linux 3.9, CLONE_NEWUSER also automatically implies CLONE_FS. CLONE_NEWUSER requires that the user ID and group ID of the calling process are mapped to user IDs and group IDs in the user namespace of the calling process at the time of the call.
For further information on user namespaces, see user_namespaces(7).
CLONE_NEWUTS (since Linux 2.6.19)
This flag has the same effect as the clone(2) CLONE_NEWUTS flag. Unshare the UTS IPC namespace, so that the calling process has a private copy of the UTS namespace which is not shared with any other process. Use of CLONE_NEWUTS requires the CAP_SYS_ADMIN capability.
CLONE_SYSVSEM (since Linux 2.6.26)
This flag reverses the effect of the clone(2) CLONE_SYSVSEM flag. Unshare System V semaphore adjustment (semadj) values, so that the calling process has a new empty semadj list that is not shared with any other process. If this is the last process that has a reference to the process’s current semadj list, then the adjustments in that list are applied to the corresponding semaphores, as described in semop(2).
In addition, CLONE_THREAD, CLONE_SIGHAND, and CLONE_VM can be specified in flags if the caller is single threaded (i.e., it is not sharing its address space with another process or thread). In this case, these flags have no effect. (Note also that specifying CLONE_THREAD automatically implies CLONE_VM, and specifying CLONE_VM automatically implies CLONE_SIGHAND.) If the process is multithreaded, then the use of these flags results in an error.
If flags is specified as zero, then unshare() is a no-op; no changes are made to the calling process’s execution context.
RETURN VALUE
On success, zero returned. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
EINVAL
An invalid bit was specified in flags.
EINVAL
CLONE_THREAD, CLONE_SIGHAND, or CLONE_VM was specified in flags, and the caller is multithreaded.
EINVAL
CLONE_NEWIPC was specified in flags, but the kernel was not configured with the CONFIG_SYSVIPC and CONFIG_IPC_NS options.
EINVAL
CLONE_NEWNET was specified in flags, but the kernel was not configured with the CONFIG_NET_NS option.
EINVAL
CLONE_NEWPID was specified in flags, but the kernel was not configured with the CONFIG_PID_NS option.
EINVAL
CLONE_NEWUSER was specified in flags, but the kernel was not configured with the CONFIG_USER_NS option.
EINVAL
CLONE_NEWUTS was specified in flags, but the kernel was not configured with the CONFIG_UTS_NS option.
EINVAL
CLONE_NEWPID was specified in flags, but the process has previously called unshare() with the CLONE_NEWPID flag.
ENOMEM
Cannot allocate sufficient memory to copy parts of caller’s context that need to be unshared.
ENOSPC (since Linux 3.7)
CLONE_NEWPID was specified in flags, but the limit on the nesting depth of PID namespaces would have been exceeded; see pid_namespaces(7).
ENOSPC (since Linux 4.9; beforehand EUSERS)
CLONE_NEWUSER was specified in flags, and the call would cause the limit on the number of nested user namespaces to be exceeded. See user_namespaces(7).
From Linux 3.11 to Linux 4.8, the error diagnosed in this case was EUSERS.
ENOSPC (since Linux 4.9)
One of the values in flags specified the creation of a new user namespace, but doing so would have caused the limit defined by the corresponding file in /proc/sys/user to be exceeded. For further details, see namespaces(7).
EPERM
The calling process did not have the required privileges for this operation.
EPERM
CLONE_NEWUSER was specified in flags, but either the effective user ID or the effective group ID of the caller does not have a mapping in the parent namespace (see user_namespaces(7)).
EPERM (since Linux 3.9)
CLONE_NEWUSER was specified in flags and the caller is in a chroot environment (i.e., the caller’s root directory does not match the root directory of the mount namespace in which it resides).
EUSERS (from Linux 3.11 to Linux 4.8)
CLONE_NEWUSER was specified in flags, and the limit on the number of nested user namespaces would be exceeded. See the discussion of the ENOSPC error above.
STANDARDS
Linux.
HISTORY
Linux 2.6.16.
NOTES
Not all of the process attributes that can be shared when a new process is created using clone(2) can be unshared using unshare(). In particular, as at kernel 3.8, unshare() does not implement flags that reverse the effects of CLONE_SIGHAND, CLONE_THREAD, or CLONE_VM. Such functionality may be added in the future, if required.
Creating all kinds of namespace, except user namespaces, requires the CAP_SYS_ADMIN capability. However, since creating a user namespace automatically confers a full set of capabilities, creating both a user namespace and any other type of namespace in the same unshare() call does not require the CAP_SYS_ADMIN capability in the original namespace.
EXAMPLES
The program below provides a simple implementation of the unshare(1) command, which unshares one or more namespaces and executes the command supplied in its command-line arguments. Here’s an example of the use of this program, running a shell in a new mount namespace, and verifying that the original shell and the new shell are in separate mount namespaces:
$ readlink /proc/$$/ns/mnt
mnt:[4026531840]
$ sudo ./unshare -m /bin/bash
# readlink /proc/$$/ns/mnt
mnt:[4026532325]
The differing output of the two readlink(1) commands shows that the two shells are in different mount namespaces.
Program source
/* unshare.c
A simple implementation of the unshare(1) command: unshare
namespaces and execute a command.
*/
#define _GNU_SOURCE
#include <err.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
static void
usage(char *pname)
{
fprintf(stderr, "Usage: %s [options] program [arg...]
“, pname); fprintf(stderr, “Options can be: “); fprintf(stderr, " -C unshare cgroup namespace “); fprintf(stderr, " -i unshare IPC namespace “); fprintf(stderr, " -m unshare mount namespace “); fprintf(stderr, " -n unshare network namespace “); fprintf(stderr, " -p unshare PID namespace “); fprintf(stderr, " -t unshare time namespace “); fprintf(stderr, " -u unshare UTS namespace “); fprintf(stderr, " -U unshare user namespace “); exit(EXIT_FAILURE); } int main(int argc, char *argv[]) { int flags, opt; flags = 0; while ((opt = getopt(argc, argv, “CimnptuU”)) != -1) { switch (opt) { case ‘C’: flags |= CLONE_NEWCGROUP; break; case ‘i’: flags |= CLONE_NEWIPC; break; case ’m’: flags |= CLONE_NEWNS; break; case ’n’: flags |= CLONE_NEWNET; break; case ‘p’: flags |= CLONE_NEWPID; break; case ’t’: flags |= CLONE_NEWTIME; break; case ‘u’: flags |= CLONE_NEWUTS; break; case ‘U’: flags |= CLONE_NEWUSER; break; default: usage(argv[0]); } } if (optind >= argc) usage(argv[0]); if (unshare(flags) == -1) err(EXIT_FAILURE, “unshare”); execvp(argv[optind], &argv[optind]); err(EXIT_FAILURE, “execvp”); }
SEE ALSO
unshare(1), clone(2), fork(2), kcmp(2), setns(2), vfork(2), namespaces(7)
Documentation/userspace-api/unshare.rst in the Linux kernel source tree (or Documentation/unshare.txt before Linux 4.12)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
244 - Linux cli command fadvise64
NAME π₯οΈ fadvise64 π₯οΈ
predeclare an access pattern for file data
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <fcntl.h>
int posix_fadvise(int fd, off_t offset, off_t len",int advice );"
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
posix_fadvise():
_POSIX_C_SOURCE >= 200112L
DESCRIPTION
Programs can use posix_fadvise() to announce an intention to access file data in a specific pattern in the future, thus allowing the kernel to perform appropriate optimizations.
The advice applies to a (not necessarily existent) region starting at offset and extending for len bytes (or until the end of the file if len is 0) within the file referred to by fd. The advice is not binding; it merely constitutes an expectation on behalf of the application.
Permissible values for advice include:
POSIX_FADV_NORMAL
Indicates that the application has no advice to give about its access pattern for the specified data. If no advice is given for an open file, this is the default assumption.
POSIX_FADV_SEQUENTIAL
The application expects to access the specified data sequentially (with lower offsets read before higher ones).
POSIX_FADV_RANDOM
The specified data will be accessed in random order.
POSIX_FADV_NOREUSE
The specified data will be accessed only once.
Before Linux 2.6.18, POSIX_FADV_NOREUSE had the same semantics as POSIX_FADV_WILLNEED. This was probably a bug; since Linux 2.6.18, this flag is a no-op.
POSIX_FADV_WILLNEED
The specified data will be accessed in the near future.
POSIX_FADV_WILLNEED initiates a nonblocking read of the specified region into the page cache. The amount of data read may be decreased by the kernel depending on virtual memory load. (A few megabytes will usually be fully satisfied, and more is rarely useful.)
POSIX_FADV_DONTNEED
The specified data will not be accessed in the near future.
POSIX_FADV_DONTNEED attempts to free cached pages associated with the specified region. This is useful, for example, while streaming large files. A program may periodically request the kernel to free cached data that has already been used, so that more useful cached pages are not discarded instead.
Requests to discard partial pages are ignored. It is preferable to preserve needed data than discard unneeded data. If the application requires that data be considered for discarding, then offset and len must be page-aligned.
The implementation may attempt to write back dirty pages in the specified region, but this is not guaranteed. Any unwritten dirty pages will not be freed. If the application wishes to ensure that dirty pages will be released, it should call fsync(2) or fdatasync(2) first.
RETURN VALUE
On success, zero is returned. On error, an error number is returned.
ERRORS
EBADF
The fd argument was not a valid file descriptor.
EINVAL
An invalid value was specified for advice.
ESPIPE
The specified file descriptor refers to a pipe or FIFO. (ESPIPE is the error specified by POSIX, but before Linux 2.6.16, Linux returned EINVAL in this case.)
VERSIONS
Under Linux, POSIX_FADV_NORMAL sets the readahead window to the default size for the backing device; POSIX_FADV_SEQUENTIAL doubles this size, and POSIX_FADV_RANDOM disables file readahead entirely. These changes affect the entire file, not just the specified region (but other open file handles to the same file are unaffected).
C library/kernel differences
The name of the wrapper function in the C library is posix_fadvise(). The underlying system call is called fadvise64() (or, on some architectures, fadvise64_64()); the difference between the two is that the former system call assumes that the type of the len argument is size_t, while the latter expects loff_t there.
Architecture-specific variants
Some architectures require 64-bit arguments to be aligned in a suitable pair of registers (see syscall(2) for further detail). On such architectures, the call signature of posix_fadvise() shown in the SYNOPSIS would force a register to be wasted as padding between the fd and offset arguments. Therefore, these architectures define a version of the system call that orders the arguments suitably, but is otherwise exactly the same as posix_fadvise().
For example, since Linux 2.6.14, ARM has the following system call:
long arm_fadvise64_64(int fd, int advice,
loff_t offset, loff_t len);
These architecture-specific details are generally hidden from applications by the glibc posix_fadvise() wrapper function, which invokes the appropriate architecture-specific system call.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
Kernel support first appeared in Linux 2.5.60; the underlying system call is called fadvise64(). Library support has been provided since glibc 2.2, via the wrapper function posix_fadvise().
Since Linux 3.18, support for the underlying system call is optional, depending on the setting of the CONFIG_ADVISE_SYSCALLS configuration option.
The type of the len argument was changed from size_t to off_t in POSIX.1-2001 TC1.
NOTES
The contents of the kernel buffer cache can be cleared via the /proc/sys/vm/drop_caches interface described in proc(5).
One can obtain a snapshot of which pages of a file are resident in the buffer cache by opening a file, mapping it with mmap(2), and then applying mincore(2) to the mapping.
BUGS
Before Linux 2.6.6, if len was specified as 0, then this was interpreted literally as “zero bytes”, rather than as meaning “all bytes through to the end of the file”.
SEE ALSO
fincore(1), mincore(2), readahead(2), sync_file_range(2), posix_fallocate(3), posix_madvise(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
245 - Linux cli command symlinkat
NAME π₯οΈ symlinkat π₯οΈ
make a new name for a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int symlink(const char *target, const char *linkpath);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int symlinkat(const char *target, int newdirfd",constchar*"linkpath);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
symlink():
_XOPEN_SOURCE >= 500 || _POSIX_C_SOURCE >= 200112L
|| /* glibc <= 2.19: */ _BSD_SOURCE
symlinkat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
symlink() creates a symbolic link named linkpath which contains the string target.
Symbolic links are interpreted at run time as if the contents of the link had been substituted into the path being followed to find a file or directory.
Symbolic links may contain .. path components, which (if used at the start of the link) refer to the parent directories of that in which the link resides.
A symbolic link (also known as a soft link) may point to an existing file or to a nonexistent one; the latter case is known as a dangling link.
The permissions of a symbolic link are irrelevant; the ownership is ignored when following the link (except when the protected_symlinks feature is enabled, as explained in proc(5)), but is checked when removal or renaming of the link is requested and the link is in a directory with the sticky bit (S_ISVTX) set.
If linkpath exists, it will not be overwritten.
symlinkat()
The symlinkat() system call operates in exactly the same way as symlink(), except for the differences described here.
If the pathname given in linkpath is relative, then it is interpreted relative to the directory referred to by the file descriptor newdirfd (rather than relative to the current working directory of the calling process, as is done by symlink() for a relative pathname).
If linkpath is relative and newdirfd is the special value AT_FDCWD, then linkpath is interpreted relative to the current working directory of the calling process (like symlink()).
If linkpath is absolute, then newdirfd is ignored.
See openat(2) for an explanation of the need for symlinkat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Write access to the directory containing linkpath is denied, or one of the directories in the path prefix of linkpath did not allow search permission. (See also path_resolution(7).)
EBADF
(symlinkat()) linkpath is relative but newdirfd is neither AT_FDCWD nor a valid file descriptor.
EDQUOT
The user’s quota of resources on the filesystem has been exhausted. The resources could be inodes or disk blocks, depending on the filesystem implementation.
EEXIST
linkpath already exists.
EFAULT
target or linkpath points outside your accessible address space.
EIO
An I/O error occurred.
ELOOP
Too many symbolic links were encountered in resolving linkpath.
ENAMETOOLONG
target or linkpath was too long.
ENOENT
A directory component in linkpath does not exist or is a dangling symbolic link, or target or linkpath is an empty string.
ENOENT
(symlinkat()) linkpath is a relative pathname and newdirfd refers to a directory that has been deleted.
ENOMEM
Insufficient kernel memory was available.
ENOSPC
The device containing the file has no room for the new directory entry.
ENOTDIR
A component used as a directory in linkpath is not, in fact, a directory.
ENOTDIR
(symlinkat()) linkpath is relative and newdirfd is a file descriptor referring to a file other than a directory.
EPERM
The filesystem containing linkpath does not support the creation of symbolic links.
EROFS
linkpath is on a read-only filesystem.
STANDARDS
POSIX.1-2008.
HISTORY
symlink()
SVr4, 4.3BSD, POSIX.1-2001.
symlinkat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
glibc notes
On older kernels where symlinkat() is unavailable, the glibc wrapper function falls back to the use of symlink(). When linkpath is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the newdirfd argument.
NOTES
No checking of target is done.
Deleting the name referred to by a symbolic link will actually delete the file (unless it also has other hard links). If this behavior is not desired, use link(2).
SEE ALSO
ln(1), namei(1), lchown(2), link(2), lstat(2), open(2), readlink(2), rename(2), unlink(2), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
246 - Linux cli command name_to_handle_at
NAME π₯οΈ name_to_handle_at π₯οΈ
obtain handle for a pathname and open file via a handle
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <fcntl.h>
int name_to_handle_at(int dirfd, const char *pathname,
struct file_handle *handle,
int *mount_id, int flags);
int open_by_handle_at(int mount_fd, struct file_handle *handle,
int flags);
DESCRIPTION
The name_to_handle_at() and open_by_handle_at() system calls split the functionality of openat(2) into two parts: name_to_handle_at() returns an opaque handle that corresponds to a specified file; open_by_handle_at() opens the file corresponding to a handle returned by a previous call to name_to_handle_at() and returns an open file descriptor.
name_to_handle_at()
The name_to_handle_at() system call returns a file handle and a mount ID corresponding to the file specified by the dirfd and pathname arguments. The file handle is returned via the argument handle, which is a pointer to a structure of the following form:
struct file_handle {
unsigned int handle_bytes; /* Size of f_handle [in, out] */
int handle_type; /* Handle type [out] */
unsigned char f_handle[0]; /* File identifier (sized by
caller) [out] */
};
It is the caller’s responsibility to allocate the structure with a size large enough to hold the handle returned in f_handle. Before the call, the handle_bytes field should be initialized to contain the allocated size for f_handle. (The constant MAX_HANDLE_SZ, defined in <fcntl.h>, specifies the maximum expected size for a file handle. It is not a guaranteed upper limit as future filesystems may require more space.) Upon successful return, the handle_bytes field is updated to contain the number of bytes actually written to f_handle.
The caller can discover the required size for the file_handle structure by making a call in which handle->handle_bytes is zero; in this case, the call fails with the error EOVERFLOW and handle->handle_bytes is set to indicate the required size; the caller can then use this information to allocate a structure of the correct size (see EXAMPLES below). Some care is needed here as EOVERFLOW can also indicate that no file handle is available for this particular name in a filesystem which does normally support file-handle lookup. This case can be detected when the EOVERFLOW error is returned without handle_bytes being increased.
Other than the use of the handle_bytes field, the caller should treat the file_handle structure as an opaque data type: the handle_type and f_handle fields can be used in a subsequent call to open_by_handle_at(). The caller can also use the opaque file_handle to compare the identity of filesystem objects that were queried at different times and possibly at different paths. The fanotify(7) subsystem can report events with an information record containing a file_handle to identify the filesystem object.
The flags argument is a bit mask constructed by ORing together zero or more of AT_HANDLE_FID, AT_EMPTY_PATH, and AT_SYMLINK_FOLLOW, described below.
When flags contain the AT_HANDLE_FID (since Linux 6.5) flag, the caller indicates that the returned file_handle is needed to identify the filesystem object, and not for opening the file later, so it should be expected that a subsequent call to open_by_handle_at() with the returned file_handle may fail.
Together, the pathname and dirfd arguments identify the file for which a handle is to be obtained. There are four distinct cases:
If pathname is a nonempty string containing an absolute pathname, then a handle is returned for the file referred to by that pathname. In this case, dirfd is ignored.
If pathname is a nonempty string containing a relative pathname and dirfd has the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the caller, and a handle is returned for the file to which it refers.
If pathname is a nonempty string containing a relative pathname and dirfd is a file descriptor referring to a directory, then pathname is interpreted relative to the directory referred to by dirfd, and a handle is returned for the file to which it refers. (See openat(2) for an explanation of why “directory file descriptors” are useful.)
If pathname is an empty string and flags specifies the value AT_EMPTY_PATH, then dirfd can be an open file descriptor referring to any type of file, or AT_FDCWD, meaning the current working directory, and a handle is returned for the file to which it refers.
The mount_id argument returns an identifier for the filesystem mount that corresponds to pathname. This corresponds to the first field in one of the records in /proc/self/mountinfo. Opening the pathname in the fifth field of that record yields a file descriptor for the mount point; that file descriptor can be used in a subsequent call to open_by_handle_at(). mount_id is returned both for a successful call and for a call that results in the error EOVERFLOW.
By default, name_to_handle_at() does not dereference pathname if it is a symbolic link, and thus returns a handle for the link itself. If AT_SYMLINK_FOLLOW is specified in flags, pathname is dereferenced if it is a symbolic link (so that the call returns a handle for the file referred to by the link).
name_to_handle_at() does not trigger a mount when the final component of the pathname is an automount point. When a filesystem supports both file handles and automount points, a name_to_handle_at() call on an automount point will return with error EOVERFLOW without having increased handle_bytes. This can happen since Linux 4.13 with NFS when accessing a directory which is on a separate filesystem on the server. In this case, the automount can be triggered by adding a “/” to the end of the pathname.
open_by_handle_at()
The open_by_handle_at() system call opens the file referred to by handle, a file handle returned by a previous call to name_to_handle_at().
The mount_fd argument is a file descriptor for any object (file, directory, etc.) in the mounted filesystem with respect to which handle should be interpreted. The special value AT_FDCWD can be specified, meaning the current working directory of the caller.
The flags argument is as for open(2). If handle refers to a symbolic link, the caller must specify the O_PATH flag, and the symbolic link is not dereferenced; the O_NOFOLLOW flag, if specified, is ignored.
The caller must have the CAP_DAC_READ_SEARCH capability to invoke open_by_handle_at().
RETURN VALUE
On success, name_to_handle_at() returns 0, and open_by_handle_at() returns a file descriptor (a nonnegative integer).
In the event of an error, both system calls return -1 and set errno to indicate the error.
ERRORS
name_to_handle_at() and open_by_handle_at() can fail for the same errors as openat(2). In addition, they can fail with the errors noted below.
name_to_handle_at() can fail with the following errors:
EFAULT
pathname, mount_id, or handle points outside your accessible address space.
EINVAL
flags includes an invalid bit value.
EINVAL
handle->handle_bytes is greater than MAX_HANDLE_SZ.
ENOENT
pathname is an empty string, but AT_EMPTY_PATH was not specified in flags.
ENOTDIR
The file descriptor supplied in dirfd does not refer to a directory, and it is not the case that both flags includes AT_EMPTY_PATH and pathname is an empty string.
EOPNOTSUPP
The filesystem does not support decoding of a pathname to a file handle.
EOVERFLOW
The handle->handle_bytes value passed into the call was too small. When this error occurs, handle->handle_bytes is updated to indicate the required size for the handle.
open_by_handle_at() can fail with the following errors:
EBADF
mount_fd is not an open file descriptor.
EBADF
pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
handle points outside your accessible address space.
EINVAL
handle->handle_bytes is greater than MAX_HANDLE_SZ or is equal to zero.
ELOOP
handle refers to a symbolic link, but O_PATH was not specified in flags.
EPERM
The caller does not have the CAP_DAC_READ_SEARCH capability.
ESTALE
The specified handle is not valid for opening a file. This error will occur if, for example, the file has been deleted. This error can also occur if the handle was acquired using the AT_HANDLE_FID flag and the filesystem does not support open_by_handle_at().
VERSIONS
FreeBSD has a broadly similar pair of system calls in the form of getfh() and openfh().
STANDARDS
Linux.
HISTORY
Linux 2.6.39, glibc 2.14.
NOTES
A file handle can be generated in one process using name_to_handle_at() and later used in a different process that calls open_by_handle_at().
Some filesystem don’t support the translation of pathnames to file handles, for example, /proc, /sys, and various network filesystems. Some filesystems support the translation of pathnames to file handles, but do not support using those file handles in open_by_handle_at().
A file handle may become invalid (“stale”) if a file is deleted, or for other filesystem-specific reasons. Invalid handles are notified by an ESTALE error from open_by_handle_at().
These system calls are designed for use by user-space file servers. For example, a user-space NFS server might generate a file handle and pass it to an NFS client. Later, when the client wants to open the file, it could pass the handle back to the server. This sort of functionality allows a user-space file server to operate in a stateless fashion with respect to the files it serves.
If pathname refers to a symbolic link and flags does not specify AT_SYMLINK_FOLLOW, then name_to_handle_at() returns a handle for the link (rather than the file to which it refers). The process receiving the handle can later perform operations on the symbolic link by converting the handle to a file descriptor using open_by_handle_at() with the O_PATH flag, and then passing the file descriptor as the dirfd argument in system calls such as readlinkat(2) and fchownat(2).
Obtaining a persistent filesystem ID
The mount IDs in /proc/self/mountinfo can be reused as filesystems are unmounted and mounted. Therefore, the mount ID returned by name_to_handle_at() (in *mount_id) should not be treated as a persistent identifier for the corresponding mounted filesystem. However, an application can use the information in the mountinfo record that corresponds to the mount ID to derive a persistent identifier.
For example, one can use the device name in the fifth field of the mountinfo record to search for the corresponding device UUID via the symbolic links in /dev/disks/by-uuid. (A more comfortable way of obtaining the UUID is to use the libblkid(3) library.) That process can then be reversed, using the UUID to look up the device name, and then obtaining the corresponding mount point, in order to produce the mount_fd argument used by open_by_handle_at().
EXAMPLES
The two programs below demonstrate the use of name_to_handle_at() and open_by_handle_at(). The first program (t_name_to_handle_at.c) uses name_to_handle_at() to obtain the file handle and mount ID for the file specified in its command-line argument; the handle and mount ID are written to standard output.
The second program (t_open_by_handle_at.c) reads a mount ID and file handle from standard input. The program then employs open_by_handle_at() to open the file using that handle. If an optional command-line argument is supplied, then the mount_fd argument for open_by_handle_at() is obtained by opening the directory named in that argument. Otherwise, mount_fd is obtained by scanning /proc/self/mountinfo to find a record whose mount ID matches the mount ID read from standard input, and the mount directory specified in that record is opened. (These programs do not deal with the fact that mount IDs are not persistent.)
The following shell session demonstrates the use of these two programs:
$ echo 'Can you please think about it?' > cecilia.txt
$ ./t_name_to_handle_at cecilia.txt > fh
$ ./t_open_by_handle_at < fh
open_by_handle_at: Operation not permitted
$ sudo ./t_open_by_handle_at < fh # Need CAP_SYS_ADMIN
Read 31 bytes
$ rm cecilia.txt
Now we delete and (quickly) re-create the file so that it has the same content and (by chance) the same inode. Nevertheless, open_by_handle_at() recognizes that the original file referred to by the file handle no longer exists.
$ stat --printf="%i
" cecilia.txt # Display inode number 4072121 $ rm cecilia.txt $ echo ‘Can you please think about it?’ > cecilia.txt $ stat –printf="%i " cecilia.txt # Check inode number 4072121 $ sudo ./t_open_by_handle_at < fh open_by_handle_at: Stale NFS file handle
Program source: t_name_to_handle_at.c
#define _GNU_SOURCE
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
int
main(int argc, char *argv[])
{
int mount_id, fhsize, flags, dirfd;
char *pathname;
struct file_handle *fhp;
if (argc != 2) {
fprintf(stderr, "Usage: %s pathname
“, argv[0]); exit(EXIT_FAILURE); } pathname = argv[1]; /* Allocate file_handle structure. */ fhsize = sizeof(fhp); fhp = malloc(fhsize); if (fhp == NULL) err(EXIT_FAILURE, “malloc”); / Make an initial call to name_to_handle_at() to discover the size required for file handle. / dirfd = AT_FDCWD; / For name_to_handle_at() calls / flags = 0; / For name_to_handle_at() calls / fhp->handle_bytes = 0; if (name_to_handle_at(dirfd, pathname, fhp, &mount_id, flags) != -1 || errno != EOVERFLOW) { fprintf(stderr, “Unexpected result from name_to_handle_at() “); exit(EXIT_FAILURE); } / Reallocate file_handle structure with correct size. */ fhsize = sizeof(fhp) + fhp->handle_bytes; fhp = realloc(fhp, fhsize); / Copies fhp->handle_bytes / if (fhp == NULL) err(EXIT_FAILURE, “realloc”); / Get file handle from pathname supplied on command line. / if (name_to_handle_at(dirfd, pathname, fhp, &mount_id, flags) == -1) err(EXIT_FAILURE, “name_to_handle_at”); / Write mount ID, file handle size, and file handle to stdout, for later reuse by t_open_by_handle_at.c. */ printf("%d “, mount_id); printf("%u %d “, fhp->handle_bytes, fhp->handle_type); for (size_t j = 0; j < fhp->handle_bytes; j++) printf(” %02x”, fhp->f_handle[j]); printf(” “); exit(EXIT_SUCCESS); }
Program source: t_open_by_handle_at.c
#define _GNU_SOURCE
#include <err.h>
#include <fcntl.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>
/* Scan /proc/self/mountinfo to find the line whose mount ID matches
'mount_id'. (An easier way to do this is to install and use the
'libmount' library provided by the 'util-linux' project.)
Open the corresponding mount path and return the resulting file
descriptor. */
static int
open_mount_path_by_id(int mount_id)
{
int mi_mount_id, found;
char mount_path[PATH_MAX];
char *linep;
FILE *fp;
size_t lsize;
ssize_t nread;
fp = fopen("/proc/self/mountinfo", "r");
if (fp == NULL)
err(EXIT_FAILURE, "fopen");
found = 0;
linep = NULL;
while (!found) {
nread = getline(&linep, &lsize, fp);
if (nread == -1)
break;
nread = sscanf(linep, "%d %*d %*s %*s %s",
&mi_mount_id, mount_path);
if (nread != 2) {
fprintf(stderr, "Bad sscanf()
“);
exit(EXIT_FAILURE);
}
if (mi_mount_id == mount_id)
found = 1;
}
free(linep);
fclose(fp);
if (!found) {
fprintf(stderr, “Could not find mount point
“);
exit(EXIT_FAILURE);
}
return open(mount_path, O_RDONLY);
}
int
main(int argc, char *argv[])
{
int mount_id, fd, mount_fd, handle_bytes;
char buf[1000];
#define LINE_SIZE 100
char line1[LINE_SIZE], line2[LINE_SIZE];
char *nextp;
ssize_t nread;
struct file_handle fhp;
if ((argc > 1 && strcmp(argv[1], “–help”) == 0) || argc > 2) {
fprintf(stderr, “Usage: %s [mount-path]
“, argv[0]);
exit(EXIT_FAILURE);
}
/ Standard input contains mount ID and file handle information:
Line 1: <mount_id>
Line 2: <handle_bytes> <handle_type>
SEE ALSO
open(2), libblkid(3), blkid(8), findfs(8), mount(8)
The libblkid and libmount documentation in the latest util-linux release at
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
247 - Linux cli command timer_delete
NAME π₯οΈ timer_delete π₯οΈ
delete a POSIX per-process timer
LIBRARY
Real-time library (librt, -lrt)
SYNOPSIS
#include <time.h>
int timer_delete(timer_t timerid);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
timer_delete():
_POSIX_C_SOURCE >= 199309L
DESCRIPTION
timer_delete() deletes the timer whose ID is given in timerid. If the timer was armed at the time of this call, it is disarmed before being deleted. The treatment of any pending signal generated by the deleted timer is unspecified.
RETURN VALUE
On success, timer_delete() returns 0. On failure, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
timerid is not a valid timer ID.
STANDARDS
POSIX.1-2008.
HISTORY
Linux 2.6. POSIX.1-2001.
SEE ALSO
clock_gettime(2), timer_create(2), timer_getoverrun(2), timer_settime(2), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
248 - Linux cli command pipe
NAME π₯οΈ pipe π₯οΈ
create pipe
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int pipe(int pipefd[2]);
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <fcntl.h> /* Definition of O_* constants */
#include <unistd.h>
int pipe2(int pipefd[2], int flags);
/* On Alpha, IA-64, MIPS, SuperH, and SPARC/SPARC64, pipe() has the
following prototype; see VERSIONS */
#include <unistd.h>
struct fd_pair {
long fd[2];
};
struct fd_pair pipe(void);
DESCRIPTION
pipe() creates a pipe, a unidirectional data channel that can be used for interprocess communication. The array pipefd is used to return two file descriptors referring to the ends of the pipe. pipefd[0] refers to the read end of the pipe. pipefd[1] refers to the write end of the pipe. Data written to the write end of the pipe is buffered by the kernel until it is read from the read end of the pipe. For further details, see pipe(7).
If flags is 0, then pipe2() is the same as pipe(). The following values can be bitwise ORed in flags to obtain different behavior:
O_CLOEXEC
Set the close-on-exec (FD_CLOEXEC) flag on the two new file descriptors. See the description of the same flag in open(2) for reasons why this may be useful.
O_DIRECT (since Linux 3.4)
Create a pipe that performs I/O in “packet” mode. Each write(2) to the pipe is dealt with as a separate packet, and read(2)s from the pipe will read one packet at a time. Note the following points:
Writes of greater than PIPE_BUF bytes (see pipe(7)) will be split into multiple packets. The constant PIPE_BUF is defined in <limits.h>.
If a read(2) specifies a buffer size that is smaller than the next packet, then the requested number of bytes are read, and the excess bytes in the packet are discarded. Specifying a buffer size of PIPE_BUF will be sufficient to read the largest possible packets (see the previous point).
Zero-length packets are not supported. (A read(2) that specifies a buffer size of zero is a no-op, and returns 0.)
Older kernels that do not support this flag will indicate this via an EINVAL error.
Since Linux 4.5, it is possible to change the O_DIRECT setting of a pipe file descriptor using fcntl(2).
O_NONBLOCK
Set the O_NONBLOCK file status flag on the open file descriptions referred to by the new file descriptors. Using this flag saves extra calls to fcntl(2) to achieve the same result.
O_NOTIFICATION_PIPE
Since Linux 5.8, general notification mechanism is built on the top of the pipe where kernel splices notification messages into pipes opened by user space. The owner of the pipe has to tell the kernel which sources of events to watch and filters can also be applied to select which subevents should be placed into the pipe.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, errno is set to indicate the error, and pipefd is left unchanged.
On Linux (and other systems), pipe() does not modify pipefd on failure. A requirement standardizing this behavior was added in POSIX.1-2008 TC2. The Linux-specific pipe2() system call likewise does not modify pipefd on failure.
ERRORS
EFAULT
pipefd is not valid.
EINVAL
(pipe2()) Invalid value in flags.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENFILE
The user hard limit on memory that can be allocated for pipes has been reached and the caller is not privileged; see pipe(7).
ENOPKG
(pipe2()) O_NOTIFICATION_PIPE was passed in flags and support for notifications (CONFIG_WATCH_QUEUE) is not compiled into the kernel.
VERSIONS
The System V ABI on some architectures allows the use of more than one register for returning multiple values; several architectures (namely, Alpha, IA-64, MIPS, SuperH, and SPARC/SPARC64) (ab)use this feature in order to implement the pipe() system call in a functional manner: the call doesn’t take any arguments and returns a pair of file descriptors as the return value on success. The glibc pipe() wrapper function transparently deals with this. See syscall(2) for information regarding registers used for storing second file descriptor.
STANDARDS
pipe()
POSIX.1-2008.
pipe2()
Linux.
HISTORY
pipe()
POSIX.1-2001.
pipe2()
Linux 2.6.27, glibc 2.9.
EXAMPLES
The following program creates a pipe, and then fork(2)s to create a child process; the child inherits a duplicate set of file descriptors that refer to the same pipe. After the fork(2), each process closes the file descriptors that it doesn’t need for the pipe (see pipe(7)). The parent then writes the string contained in the program’s command-line argument to the pipe, and the child reads this string a byte at a time from the pipe and echoes it on standard output.
Program source
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int pipefd[2];
char buf;
pid_t cpid;
if (argc != 2) {
fprintf(stderr, "Usage: %s <string>
“, argv[0]); exit(EXIT_FAILURE); } if (pipe(pipefd) == -1) { perror(“pipe”); exit(EXIT_FAILURE); } cpid = fork(); if (cpid == -1) { perror(“fork”); exit(EXIT_FAILURE); } if (cpid == 0) { /* Child reads from pipe / close(pipefd[1]); / Close unused write end / while (read(pipefd[0], &buf, 1) > 0) write(STDOUT_FILENO, &buf, 1); write(STDOUT_FILENO, " “, 1); close(pipefd[0]); _exit(EXIT_SUCCESS); } else { / Parent writes argv[1] to pipe / close(pipefd[0]); / Close unused read end / write(pipefd[1], argv[1], strlen(argv[1])); close(pipefd[1]); / Reader will see EOF / wait(NULL); / Wait for child */ exit(EXIT_SUCCESS); } }
SEE ALSO
fork(2), read(2), socketpair(2), splice(2), tee(2), vmsplice(2), write(2), popen(3), pipe(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
249 - Linux cli command s390_guarded_storage
NAME π₯οΈ s390_guarded_storage π₯οΈ
operations with z/Architecture guarded storage facility
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <asm/guarded_storage.h> /* Definition of GS_* constants */
#include <sys/syscall.h> "/*Definitionof SYS_* constants*/"
#include <unistd.h>
int syscall(SYS_s390_guarded_storage, int command,
struct gs_cb *gs_cb);
Note: glibc provides no wrapper for s390_guarded_storage(), necessitating the use of syscall(2).
DESCRIPTION
The s390_guarded_storage() system call enables the use of the Guarded Storage Facility (a z/Architecture-specific feature) for user-space processes.
The guarded storage facility is a hardware feature that allows marking up to 64 memory regions (as of z14) as guarded; reading a pointer with a newly introduced “Load Guarded” (LGG) or “Load Logical and Shift Guarded” (LLGFSG) instructions will cause a range check on the loaded value and invoke a (previously set up) user-space handler if one of the guarded regions is affected.
The command argument indicates which function to perform. The following commands are supported:
GS_ENABLE
Enable the guarded storage facility for the calling task. The initial content of the guarded storage control block will be all zeros. After enablement, user-space code can use the “Load Guarded Storage Controls” (LGSC) instruction (or the load_gs_cb() function wrapper provided in the asm/guarded_storage.h header) to load an arbitrary control block. While a task is enabled, the kernel will save and restore the calling content of the guarded storage registers on context switch.
GS_DISABLE
Disables the use of the guarded storage facility for the calling task. The kernel will cease to save and restore the content of the guarded storage registers, the task-specific content of these registers is lost.
GS_SET_BC_CB
Set a broadcast guarded storage control block to the one provided in the gs_cb argument. This is called per thread and associates a specific guarded storage control block with the calling task. This control block will be used in the broadcast command GS_BROADCAST.
GS_CLEAR_BC_CB
Clears the broadcast guarded storage control block. The guarded storage control block will no longer have the association established by the GS_SET_BC_CB command.
GS_BROADCAST
Sends a broadcast to all thread siblings of the calling task. Every sibling that has established a broadcast guarded storage control block will load this control block and will be enabled for guarded storage. The broadcast guarded storage control block is consumed; a second broadcast without a refresh of the stored control block with GS_SET_BC_CB will not have any effect.
The gs_cb argument specifies the address of a guarded storage control block structure and is currently used only by the GS_SET_BC_CB command; all other aforementioned commands ignore this argument.
RETURN VALUE
On success, the return value of s390_guarded_storage() is 0.
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
command was GS_SET_BC_CB and the copying of the guarded storage control block structure pointed by the gs_cb argument has failed.
EINVAL
The value provided in the command argument was not valid.
ENOMEM
command was one of GS_ENABLE or GS_SET_BC_CB, and the allocation of a new guarded storage control block has failed.
EOPNOTSUPP
The guarded storage facility is not supported by the hardware.
STANDARDS
Linux on s390.
HISTORY
Linux 4.12. System z14.
NOTES
The description of the guarded storage facility along with related instructions and Guarded Storage Control Block and Guarded Storage Event Parameter List structure layouts is available in “z/Architecture Principles of Operations” beginning from the twelfth edition.
The gs_cb structure has a field gsepla (Guarded Storage Event Parameter List Address), which is a user-space pointer to a Guarded Storage Event Parameter List structure (that contains the address of the aforementioned event handler in the gseha field), and its layout is available as a gs_epl structure type definition in the asm/guarded_storage.h header.
SEE ALSO
syscall(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
250 - Linux cli command listxattr
NAME π₯οΈ listxattr π₯οΈ
list extended attribute names
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/xattr.h>
ssize_t listxattr(const char *path, char *_Nullable list",size_t"size);
ssize_t llistxattr(const char *path, char *_Nullable list",size_t"size);
ssize_t flistxattr(int fd, char *_Nullable list, size_t size);
DESCRIPTION
Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7).
listxattr() retrieves the list of extended attribute names associated with the given path in the filesystem. The retrieved list is placed in list, a caller-allocated buffer whose size (in bytes) is specified in the argument size. The list is the set of (null-terminated) names, one after the other. Names of extended attributes to which the calling process does not have access may be omitted from the list. The length of the attribute name list is returned.
llistxattr() is identical to listxattr(), except in the case of a symbolic link, where the list of names of extended attributes associated with the link itself is retrieved, not the file that it refers to.
flistxattr() is identical to listxattr(), only the open file referred to by fd (as returned by open(2)) is interrogated in place of path.
A single extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode.
If size is specified as zero, these calls return the current size of the list of extended attribute names (and leave list unchanged). This can be used to determine the size of the buffer that should be supplied in a subsequent call. (But, bear in mind that there is a possibility that the set of extended attributes may change between the two calls, so that it is still necessary to check the return status from the second call.)
Example
The list of names is returned as an unordered array of null-terminated character strings (attribute names are separated by null bytes (‘οΏ½’)), like this:
user.name1 system.name1 user.name2
Filesystems that implement POSIX ACLs using extended attributes might return a list like this:
system.posix_acl_access system.posix_acl_default
RETURN VALUE
On success, a nonnegative number is returned indicating the size of the extended attribute name list. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
E2BIG
The size of the list of extended attribute names is larger than the maximum size allowed; the list cannot be retrieved. This can happen on filesystems that support an unlimited number of extended attributes per file such as XFS, for example. See BUGS.
ENOTSUP
Extended attributes are not supported by the filesystem, or are disabled.
ERANGE
The size of the list buffer is too small to hold the result.
In addition, the errors documented in stat(2) can also occur.
STANDARDS
Linux.
HISTORY
Linux 2.4, glibc 2.3.
BUGS
As noted in xattr(7), the VFS imposes a limit of 64 kB on the size of the extended attribute name list returned by listxattr(). If the total size of attribute names attached to a file exceeds this limit, it is no longer possible to retrieve the list of attribute names.
EXAMPLES
The following program demonstrates the usage of listxattr() and getxattr(2). For the file whose pathname is provided as a command-line argument, it lists all extended file attributes and their values.
To keep the code simple, the program assumes that attribute keys and values are constant during the execution of the program. A production program should expect and handle changes during execution of the program. For example, the number of bytes required for attribute keys might increase between the two calls to listxattr(). An application could handle this possibility using a loop that retries the call (perhaps up to a predetermined maximum number of attempts) with a larger buffer each time it fails with the error ERANGE. Calls to getxattr(2) could be handled similarly.
The following output was recorded by first creating a file, setting some extended file attributes, and then listing the attributes with the example program.
Example output
$ touch /tmp/foo
$ setfattr -n user.fred -v chocolate /tmp/foo
$ setfattr -n user.frieda -v bar /tmp/foo
$ setfattr -n user.empty /tmp/foo
$ ./listxattr /tmp/foo
user.fred: chocolate
user.frieda: bar
user.empty: <no value>
Program source (listxattr.c)
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/xattr.h>
int
main(int argc, char *argv[])
{
char *buf, *key, *val;
ssize_t buflen, keylen, vallen;
if (argc != 2) {
fprintf(stderr, "Usage: %s path
“, argv[0]);
exit(EXIT_FAILURE);
}
/*
* Determine the length of the buffer needed.
/
buflen = listxattr(argv[1], NULL, 0);
if (buflen == -1) {
perror(“listxattr”);
exit(EXIT_FAILURE);
}
if (buflen == 0) {
printf("%s has no attributes.
“, argv[1]);
exit(EXIT_SUCCESS);
}
/
* Allocate the buffer.
/
buf = malloc(buflen);
if (buf == NULL) {
perror(“malloc”);
exit(EXIT_FAILURE);
}
/
* Copy the list of attribute keys to the buffer.
/
buflen = listxattr(argv[1], buf, buflen);
if (buflen == -1) {
perror(“listxattr”);
exit(EXIT_FAILURE);
}
/
* Loop over the list of zero terminated strings with the
* attribute keys. Use the remaining buffer length to determine
* the end of the list.
/
key = buf;
while (buflen > 0) {
/
* Output attribute key.
/
printf("%s: “, key);
/
* Determine length of the value.
/
vallen = getxattr(argv[1], key, NULL, 0);
if (vallen == -1)
perror(“getxattr”);
if (vallen > 0) {
/
* Allocate value buffer.
* One extra byte is needed to append 0x00.
/
val = malloc(vallen + 1);
if (val == NULL) {
perror(“malloc”);
exit(EXIT_FAILURE);
}
/
* Copy value to buffer.
/
vallen = getxattr(argv[1], key, val, vallen);
if (vallen == -1) {
perror(“getxattr”);
} else {
/
* Output attribute value.
/
val[vallen] = 0;
printf("%s”, val);
}
free(val);
} else if (vallen == 0) {
printf("
SEE ALSO
getfattr(1), setfattr(1), getxattr(2), open(2), removexattr(2), setxattr(2), stat(2), symlink(7), xattr(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
251 - Linux cli command accept
NAME π₯οΈ accept π₯οΈ
accept a connection on a socket
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
int accept(int sockfd, struct sockaddr *_Nullable restrict addr,
socklen_t *_Nullable restrict addrlen);
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <sys/socket.h>
int accept4(int sockfd, struct sockaddr *_Nullable restrict addr,
socklen_t *_Nullable restrict addrlen, int flags);
DESCRIPTION
The accept() system call is used with connection-based socket types (SOCK_STREAM, SOCK_SEQPACKET). It extracts the first connection request on the queue of pending connections for the listening socket, sockfd, creates a new connected socket, and returns a new file descriptor referring to that socket. The newly created socket is not in the listening state. The original socket sockfd is unaffected by this call.
The argument sockfd is a socket that has been created with socket(2), bound to a local address with bind(2), and is listening for connections after a listen(2).
The argument addr is a pointer to a sockaddr structure. This structure is filled in with the address of the peer socket, as known to the communications layer. The exact format of the address returned addr is determined by the socket’s address family (see socket(2) and the respective protocol man pages). When addr is NULL, nothing is filled in; in this case, addrlen is not used, and should also be NULL.
The addrlen argument is a value-result argument: the caller must initialize it to contain the size (in bytes) of the structure pointed to by addr; on return it will contain the actual size of the peer address.
The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call.
If no pending connections are present on the queue, and the socket is not marked as nonblocking, accept() blocks the caller until a connection is present. If the socket is marked nonblocking and no pending connections are present on the queue, accept() fails with the error EAGAIN or EWOULDBLOCK.
In order to be notified of incoming connections on a socket, you can use select(2), poll(2), or epoll(7). A readable event will be delivered when a new connection is attempted and you may then call accept() to get a socket for that connection. Alternatively, you can set the socket to deliver SIGIO when activity occurs on a socket; see socket(7) for details.
If flags is 0, then accept4() is the same as accept(). The following values can be bitwise ORed in flags to obtain different behavior:
SOCK_NONBLOCK
Set the O_NONBLOCK file status flag on the open file description (see open(2)) referred to by the new file descriptor. Using this flag saves extra calls to fcntl(2) to achieve the same result.
SOCK_CLOEXEC
Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.
RETURN VALUE
On success, these system calls return a file descriptor for the accepted socket (a nonnegative integer). On error, -1 is returned, errno is set to indicate the error, and addrlen is left unchanged.
Error handling
Linux accept() (and accept4()) passes already-pending network errors on the new socket as an error code from accept(). This behavior differs from other BSD socket implementations. For reliable operation the application should detect the network errors defined for the protocol after accept() and treat them like EAGAIN by retrying. In the case of TCP/IP, these are ENETDOWN, EPROTO, ENOPROTOOPT, EHOSTDOWN, ENONET, EHOSTUNREACH, EOPNOTSUPP, and ENETUNREACH.
ERRORS
EAGAIN or EWOULDBLOCK
The socket is marked nonblocking and no connections are present to be accepted. POSIX.1-2001 and POSIX.1-2008 allow either error to be returned for this case, and do not require these constants to have the same value, so a portable application should check for both possibilities.
EBADF
sockfd is not an open file descriptor.
ECONNABORTED
A connection has been aborted.
EFAULT
The addr argument is not in a writable part of the user address space.
EINTR
The system call was interrupted by a signal that was caught before a valid connection arrived; see signal(7).
EINVAL
Socket is not listening for connections, or addrlen is invalid (e.g., is negative).
EINVAL
(accept4()) invalid value in flags.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENOBUFS
ENOMEM
Not enough free memory. This often means that the memory allocation is limited by the socket buffer limits, not by the system memory.
ENOTSOCK
The file descriptor sockfd does not refer to a socket.
EOPNOTSUPP
The referenced socket is not of type SOCK_STREAM.
EPERM
Firewall rules forbid connection.
EPROTO
Protocol error.
In addition, network errors for the new socket and as defined for the protocol may be returned. Various Linux kernels can return other errors such as ENOSR, ESOCKTNOSUPPORT, EPROTONOSUPPORT, ETIMEDOUT. The value ERESTARTSYS may be seen during a trace.
VERSIONS
On Linux, the new socket returned by accept() does not inherit file status flags such as O_NONBLOCK and O_ASYNC from the listening socket. This behavior differs from the canonical BSD sockets implementation. Portable programs should not rely on inheritance or noninheritance of file status flags and always explicitly set all required flags on the socket returned from accept().
STANDARDS
accept()
POSIX.1-2008.
accept4()
Linux.
HISTORY
accept()
POSIX.1-2001, SVr4, 4.4BSD (accept() first appeared in 4.2BSD).
accept4()
Linux 2.6.28, glibc 2.10.
NOTES
There may not always be a connection waiting after a SIGIO is delivered or select(2), poll(2), or epoll(7) return a readability event because the connection might have been removed by an asynchronous network error or another thread before accept() is called. If this happens, then the call will block waiting for the next connection to arrive. To ensure that accept() never blocks, the passed socket sockfd needs to have the O_NONBLOCK flag set (see socket(7)).
For certain protocols which require an explicit confirmation, such as DECnet, accept() can be thought of as merely dequeuing the next connection request and not implying confirmation. Confirmation can be implied by a normal read or write on the new file descriptor, and rejection can be implied by closing the new socket. Currently, only DECnet has these semantics on Linux.
The socklen_t type
In the original BSD sockets implementation (and on other older systems) the third argument of accept() was declared as an int *. A POSIX.1g draft standard wanted to change it into a *size_t **C; later POSIX standards and glibc 2.x have *socklen_t * *.
EXAMPLES
See bind(2).
SEE ALSO
bind(2), connect(2), listen(2), select(2), socket(2), socket(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
252 - Linux cli command setegid
NAME π₯οΈ setegid π₯οΈ
set effective user or group ID
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int seteuid(uid_t euid);
int setegid(gid_t egid);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
seteuid(), setegid():
_POSIX_C_SOURCE >= 200112L
|| /* glibc <= 2.19: */ _BSD_SOURCE
DESCRIPTION
seteuid() sets the effective user ID of the calling process. Unprivileged processes may only set the effective user ID to the real user ID, the effective user ID or the saved set-user-ID.
Precisely the same holds for setegid() with “group” instead of “user”.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
Note: there are cases where seteuid() can fail even when the caller is UID 0; it is a grave security error to omit checking for a failure return from seteuid().
ERRORS
EINVAL
The target user or group ID is not valid in this user namespace.
EPERM
In the case of seteuid(): the calling process is not privileged (does not have the CAP_SETUID capability in its user namespace) and euid does not match the current real user ID, current effective user ID, or current saved set-user-ID.
In the case of setegid(): the calling process is not privileged (does not have the CAP_SETGID capability in its user namespace) and egid does not match the current real group ID, current effective group ID, or current saved set-group-ID.
VERSIONS
Setting the effective user (group) ID to the saved set-user-ID (saved set-group-ID) is possible since Linux 1.1.37 (1.1.38). On an arbitrary system one should check _POSIX_SAVED_IDS.
Under glibc 2.0, seteuid(euid) is equivalent to setreuid(-1,* euid***)** and hence may change the saved set-user-ID. Under glibc 2.1 and later, it is equivalent to setresuid(-1,* euid***, -1)** and hence does not change the saved set-user-ID. Analogous remarks hold for setegid(), with the difference that the change in implementation from setregid(-1,* egid***)** to setresgid(-1,* egid***, -1)** occurred in glibc 2.2 or 2.3 (depending on the hardware architecture).
According to POSIX.1, seteuid() (setegid()) need not permit euid (egid) to be the same value as the current effective user (group) ID, and some implementations do not permit this.
C library/kernel differences
On Linux, seteuid() and setegid() are implemented as library functions that call, respectively, setresuid(2) and setresgid(2).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.3BSD.
SEE ALSO
geteuid(2), setresuid(2), setreuid(2), setuid(2), capabilities(7), credentials(7), user_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
253 - Linux cli command ftruncate64
NAME π₯οΈ ftruncate64 π₯οΈ
truncate a file to a specified length
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int truncate(const char *path, off_t length);
int ftruncate(int fd, off_t length);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
truncate():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
|| /* glibc <= 2.19: */ _BSD_SOURCE
ftruncate():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.3.5: */ _POSIX_C_SOURCE >= 200112L
|| /* glibc <= 2.19: */ _BSD_SOURCE
DESCRIPTION
The truncate() and ftruncate() functions cause the regular file named by path or referenced by fd to be truncated to a size of precisely length bytes.
If the file previously was larger than this size, the extra data is lost. If the file previously was shorter, it is extended, and the extended part reads as null bytes (‘οΏ½’).
The file offset is not changed.
If the size changed, then the st_ctime and st_mtime fields (respectively, time of last status change and time of last modification; see inode(7)) for the file are updated, and the set-user-ID and set-group-ID mode bits may be cleared.
With ftruncate(), the file must be open for writing; with truncate(), the file must be writable.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
For truncate():
EACCES
Search permission is denied for a component of the path prefix, or the named file is not writable by the user. (See also path_resolution(7).)
EFAULT
The argument path points outside the process’s allocated address space.
EFBIG
The argument length is larger than the maximum file size. (XSI)
EINTR
While blocked waiting to complete, the call was interrupted by a signal handler; see fcntl(2) and signal(7).
EINVAL
The argument length is negative or larger than the maximum file size.
EIO
An I/O error occurred updating the inode.
EISDIR
The named file is a directory.
ELOOP
Too many symbolic links were encountered in translating the pathname.
ENAMETOOLONG
A component of a pathname exceeded 255 characters, or an entire pathname exceeded 1023 characters.
ENOENT
The named file does not exist.
ENOTDIR
A component of the path prefix is not a directory.
EPERM
The underlying filesystem does not support extending a file beyond its current size.
EPERM
The operation was prevented by a file seal; see fcntl(2).
EROFS
The named file resides on a read-only filesystem.
ETXTBSY
The file is an executable file that is being executed.
For ftruncate() the same errors apply, but instead of things that can be wrong with path, we now have things that can be wrong with the file descriptor, fd:
EBADF
fd is not a valid file descriptor.
EBADF or EINVAL
fd is not open for writing.
EINVAL
fd does not reference a regular file or a POSIX shared memory object.
EINVAL or EBADF
The file descriptor fd is not open for writing. POSIX permits, and portable applications should handle, either error for this case. (Linux produces EINVAL.)
VERSIONS
The details in DESCRIPTION are for XSI-compliant systems. For non-XSI-compliant systems, the POSIX standard allows two behaviors for ftruncate() when length exceeds the file length (note that truncate() is not specified at all in such an environment): either returning an error, or extending the file. Like most UNIX implementations, Linux follows the XSI requirement when dealing with native filesystems. However, some nonnative filesystems do not permit truncate() and ftruncate() to be used to extend a file beyond its current length: a notable example on Linux is VFAT.
On some 32-bit architectures, the calling signature for these system calls differ, for the reasons described in syscall(2).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.4BSD, SVr4 (first appeared in 4.2BSD).
The original Linux truncate() and ftruncate() system calls were not designed to handle large file offsets. Consequently, Linux 2.4 added truncate64() and ftruncate64() system calls that handle large files. However, these details can be ignored by applications using glibc, whose wrapper functions transparently employ the more recent system calls where they are available.
NOTES
ftruncate() can also be used to set the size of a POSIX shared memory object; see shm_open(3).
BUGS
A header file bug in glibc 2.12 meant that the minimum value of _POSIX_C_SOURCE required to expose the declaration of ftruncate() was 200809L instead of 200112L. This has been fixed in later glibc versions.
SEE ALSO
truncate(1), open(2), stat(2), path_resolution(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
254 - Linux cli command get_thread_area
NAME π₯οΈ get_thread_area π₯οΈ
manipulate thread-local storage information
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
#if defined __i386__ || defined __x86_64__
# include <asm/ldt.h> /* Definition of struct user_desc */
int syscall(SYS_get_thread_area, struct user_desc *u_info);
int syscall(SYS_set_thread_area, struct user_desc *u_info);
#elif defined __m68k__
int syscall(SYS_get_thread_area);
int syscall(SYS_set_thread_area, unsigned long tp);
#elif defined __mips__ || defined __csky__
int syscall(SYS_set_thread_area, unsigned long addr);
#endif
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
These calls provide architecture-specific support for a thread-local storage implementation. At the moment, set_thread_area() is available on m68k, MIPS, C-SKY, and x86 (both 32-bit and 64-bit variants); get_thread_area() is available on m68k and x86.
On m68k, MIPS and C-SKY, set_thread_area() allows storing an arbitrary pointer (provided in the tp argument on m68k and in the addr argument on MIPS and C-SKY) in the kernel data structure associated with the calling thread; this pointer can later be retrieved using get_thread_area() (see also NOTES for information regarding obtaining the thread pointer on MIPS).
On x86, Linux dedicates three global descriptor table (GDT) entries for thread-local storage. For more information about the GDT, see the Intel Software Developer’s Manual or the AMD Architecture Programming Manual.
Both of these system calls take an argument that is a pointer to a structure of the following type:
struct user_desc {
unsigned int entry_number;
unsigned int base_addr;
unsigned int limit;
unsigned int seg_32bit:1;
unsigned int contents:2;
unsigned int read_exec_only:1;
unsigned int limit_in_pages:1;
unsigned int seg_not_present:1;
unsigned int useable:1;
#ifdef __x86_64__
unsigned int lm:1;
#endif
};
get_thread_area() reads the GDT entry indicated by u_info->entry_number and fills in the rest of the fields in u_info.
set_thread_area() sets a TLS entry in the GDT.
The TLS array entry set by set_thread_area() corresponds to the value of u_info->entry_number passed in by the user. If this value is in bounds, set_thread_area() writes the TLS descriptor pointed to by u_info into the thread’s TLS array.
When set_thread_area() is passed an entry_number of -1, it searches for a free TLS entry. If set_thread_area() finds a free TLS entry, the value of u_info->entry_number is set upon return to show which entry was changed.
A user_desc is considered “empty” if read_exec_only and seg_not_present are set to 1 and all of the other fields are 0. If an “empty” descriptor is passed to set_thread_area(), the corresponding TLS entry will be cleared. See BUGS for additional details.
Since Linux 3.19, set_thread_area() cannot be used to write non-present segments, 16-bit segments, or code segments, although clearing a segment is still acceptable.
RETURN VALUE
On x86, these system calls return 0 on success, and -1 on failure, with errno set to indicate the error.
On C-SKY, MIPS and m68k, set_thread_area() always returns 0. On m68k, get_thread_area() returns the thread area pointer value (previously set via set_thread_area()).
ERRORS
EFAULT
u_info is an invalid pointer.
EINVAL
u_info->entry_number is out of bounds.
ENOSYS
get_thread_area() or set_thread_area() was invoked as a 64-bit system call.
ESRCH
(set_thread_area()) A free TLS entry could not be located.
STANDARDS
Linux.
HISTORY
set_thread_area()
Linux 2.5.29.
get_thread_area()
Linux 2.5.32.
NOTES
These system calls are generally intended for use only by threading libraries.
arch_prctl(2) can interfere with set_thread_area() on x86. See arch_prctl(2) for more details. This is not normally a problem, as arch_prctl(2) is normally used only by 64-bit programs.
On MIPS, the current value of the thread area pointer can be obtained using the instruction:
rdhwr dest, $29
This instruction traps and is handled by kernel.
BUGS
On 64-bit kernels before Linux 3.19, one of the padding bits in user_desc, if set, would prevent the descriptor from being considered empty (see modify_ldt(2)). As a result, the only reliable way to clear a TLS entry is to use memset(3) to zero the entire user_desc structure, including padding bits, and then to set the read_exec_only and seg_not_present bits. On Linux 3.19, a user_desc consisting entirely of zeros except for entry_number will also be interpreted as a request to clear a TLS entry, but this behaved differently on older kernels.
Prior to Linux 3.19, the DS and ES segment registers must not reference TLS entries.
SEE ALSO
arch_prctl(2), modify_ldt(2), ptrace(2) (PTRACE_GET_THREAD_AREA and PTRACE_SET_THREAD_AREA)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
255 - Linux cli command msgrcv
NAME π₯οΈ msgrcv π₯οΈ
System V message queue operations
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/msg.h>
int msgsnd(int msqid, const void msgp[.msgsz], size_t msgsz,
int msgflg);
ssize_t msgrcv(int msqid, void msgp[.msgsz], size_t msgsz",long"msgtyp,
int msgflg);
DESCRIPTION
The msgsnd() and msgrcv() system calls are used to send messages to, and receive messages from, a System V message queue. The calling process must have write permission on the message queue in order to send a message, and read permission to receive a message.
The msgp argument is a pointer to a caller-defined structure of the following general form:
struct msgbuf {
long mtype; /* message type, must be > 0 */
char mtext[1]; /* message data */
};
The mtext field is an array (or other structure) whose size is specified by msgsz, a nonnegative integer value. Messages of zero length (i.e., no mtext field) are permitted. The mtype field must have a strictly positive integer value. This value can be used by the receiving process for message selection (see the description of msgrcv() below).
msgsnd()
The msgsnd() system call appends a copy of the message pointed to by msgp to the message queue whose identifier is specified by msqid.
If sufficient space is available in the queue, msgsnd() succeeds immediately. The queue capacity is governed by the msg_qbytes field in the associated data structure for the message queue. During queue creation this field is initialized to MSGMNB bytes, but this limit can be modified using msgctl(2). A message queue is considered to be full if either of the following conditions is true:
Adding a new message to the queue would cause the total number of bytes in the queue to exceed the queue’s maximum size (the msg_qbytes field).
Adding another message to the queue would cause the total number of messages in the queue to exceed the queue’s maximum size (the msg_qbytes field). This check is necessary to prevent an unlimited number of zero-length messages being placed on the queue. Although such messages contain no data, they nevertheless consume (locked) kernel memory.
If insufficient space is available in the queue, then the default behavior of msgsnd() is to block until space becomes available. If IPC_NOWAIT is specified in msgflg, then the call instead fails with the error EAGAIN.
A blocked msgsnd() call may also fail if:
the queue is removed, in which case the system call fails with errno set to EIDRM; or
a signal is caught, in which case the system call fails with errno set to EINTR;see signal(7). (msgsnd() is never automatically restarted after being interrupted by a signal handler, regardless of the setting of the SA_RESTART flag when establishing a signal handler.)
Upon successful completion the message queue data structure is updated as follows:
msg_lspid is set to the process ID of the calling process.
msg_qnum is incremented by 1.
msg_stime is set to the current time.
msgrcv()
The msgrcv() system call removes a message from the queue specified by msqid and places it in the buffer pointed to by msgp.
The argument msgsz specifies the maximum size in bytes for the member mtext of the structure pointed to by the msgp argument. If the message text has length greater than msgsz, then the behavior depends on whether MSG_NOERROR is specified in msgflg. If MSG_NOERROR is specified, then the message text will be truncated (and the truncated part will be lost); if MSG_NOERROR is not specified, then the message isn’t removed from the queue and the system call fails returning -1 with errno set to E2BIG.
Unless MSG_COPY is specified in msgflg (see below), the msgtyp argument specifies the type of message requested, as follows:
If msgtyp is 0, then the first message in the queue is read.
If msgtyp is greater than 0, then the first message in the queue of type msgtyp is read, unless MSG_EXCEPT was specified in msgflg, in which case the first message in the queue of type not equal to msgtyp will be read.
If msgtyp is less than 0, then the first message in the queue with the lowest type less than or equal to the absolute value of msgtyp will be read.
The msgflg argument is a bit mask constructed by ORing together zero or more of the following flags:
IPC_NOWAIT
Return immediately if no message of the requested type is in the queue. The system call fails with errno set to ENOMSG.
MSG_COPY (since Linux 3.8)
Nondestructively fetch a copy of the message at the ordinal position in the queue specified by msgtyp (messages are considered to be numbered starting at 0).
This flag must be specified in conjunction with IPC_NOWAIT, with the result that, if there is no message available at the given position, the call fails immediately with the error ENOMSG. Because they alter the meaning of msgtyp in orthogonal ways, MSG_COPY and MSG_EXCEPT may not both be specified in msgflg.
The MSG_COPY flag was added for the implementation of the kernel checkpoint-restore facility and is available only if the kernel was built with the CONFIG_CHECKPOINT_RESTORE option.
MSG_EXCEPT
Used with msgtyp greater than 0 to read the first message in the queue with message type that differs from msgtyp.
MSG_NOERROR
To truncate the message text if longer than msgsz bytes.
If no message of the requested type is available and IPC_NOWAIT isn’t specified in msgflg, the calling process is blocked until one of the following conditions occurs:
A message of the desired type is placed in the queue.
The message queue is removed from the system. In this case, the system call fails with errno set to EIDRM.
The calling process catches a signal. In this case, the system call fails with errno set to EINTR. (msgrcv() is never automatically restarted after being interrupted by a signal handler, regardless of the setting of the SA_RESTART flag when establishing a signal handler.)
Upon successful completion the message queue data structure is updated as follows:
msg_lrpid is set to the process ID of the calling process.
msg_qnum is decremented by 1.
msg_rtime is set to the current time.
RETURN VALUE
On success, msgsnd() returns 0 and msgrcv() returns the number of bytes actually copied into the mtext array. On failure, both functions return -1, and set errno to indicate the error.
ERRORS
msgsnd() can fail with the following errors:
EACCES
The calling process does not have write permission on the message queue, and does not have the CAP_IPC_OWNER capability in the user namespace that governs its IPC namespace.
EAGAIN
The message can’t be sent due to the msg_qbytes limit for the queue and IPC_NOWAIT was specified in msgflg.
EFAULT
The address pointed to by msgp isn’t accessible.
EIDRM
The message queue was removed.
EINTR
Sleeping on a full message queue condition, the process caught a signal.
EINVAL
Invalid msqid value, or nonpositive mtype value, or invalid msgsz value (less than 0 or greater than the system value MSGMAX).
ENOMEM
The system does not have enough memory to make a copy of the message pointed to by msgp.
msgrcv() can fail with the following errors:
E2BIG
The message text length is greater than msgsz and MSG_NOERROR isn’t specified in msgflg.
EACCES
The calling process does not have read permission on the message queue, and does not have the CAP_IPC_OWNER capability in the user namespace that governs its IPC namespace.
EFAULT
The address pointed to by msgp isn’t accessible.
EIDRM
While the process was sleeping to receive a message, the message queue was removed.
EINTR
While the process was sleeping to receive a message, the process caught a signal; see signal(7).
EINVAL
msqid was invalid, or msgsz was less than 0.
EINVAL (since Linux 3.14)
msgflg specified MSG_COPY, but not IPC_NOWAIT.
EINVAL (since Linux 3.14)
msgflg specified both MSG_COPY and MSG_EXCEPT.
ENOMSG
IPC_NOWAIT was specified in msgflg and no message of the requested type existed on the message queue.
ENOMSG
IPC_NOWAIT and MSG_COPY were specified in msgflg and the queue contains less than msgtyp messages.
ENOSYS (since Linux 3.8)
Both MSG_COPY and IPC_NOWAIT were specified in msgflg, and this kernel was configured without CONFIG_CHECKPOINT_RESTORE.
STANDARDS
POSIX.1-2008.
The MSG_EXCEPT and MSG_COPY flags are Linux-specific; their definitions can be obtained by defining the _GNU_SOURCE feature test macro.
HISTORY
POSIX.1-2001, SVr4.
The msgp argument is declared as struct msgbuf * in glibc 2.0 and 2.1. It is declared as void * in glibc 2.2 and later, as required by SUSv2 and SUSv3.
NOTES
The following limits on message queue resources affect the msgsnd() call:
MSGMAX
Maximum size of a message text, in bytes (default value: 8192 bytes). On Linux, this limit can be read and modified via /proc/sys/kernel/msgmax.
MSGMNB
Maximum number of bytes that can be held in a message queue (default value: 16384 bytes). On Linux, this limit can be read and modified via /proc/sys/kernel/msgmnb. A privileged process (Linux: a process with the CAP_SYS_RESOURCE capability) can increase the size of a message queue beyond MSGMNB using the msgctl(2) IPC_SET operation.
The implementation has no intrinsic system-wide limits on the number of message headers (MSGTQL) and the number of bytes in the message pool (MSGPOOL).
BUGS
In Linux 3.13 and earlier, if msgrcv() was called with the MSG_COPY flag, but without IPC_NOWAIT, and the message queue contained less than msgtyp messages, then the call would block until the next message is written to the queue. At that point, the call would return a copy of the message, regardless of whether that message was at the ordinal position msgtyp. This bug is fixed in Linux 3.14.
Specifying both MSG_COPY and MSC_EXCEPT in msgflg is a logical error (since these flags impose different interpretations on msgtyp). In Linux 3.13 and earlier, this error was not diagnosed by msgrcv(). This bug is fixed in Linux 3.14.
EXAMPLES
The program below demonstrates the use of msgsnd() and msgrcv().
The example program is first run with the -s option to send a message and then run again with the -r option to receive a message.
The following shell session shows a sample run of the program:
$ ./a.out -s
sent: a message at Wed Mar 4 16:25:45 2015
$ ./a.out -r
message received: a message at Wed Mar 4 16:25:45 2015
Program source
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ipc.h>
#include <sys/msg.h>
#include <time.h>
#include <unistd.h>
struct msgbuf {
long mtype;
char mtext[80];
};
static void
usage(char *prog_name, char *msg)
{
if (msg != NULL)
fputs(msg, stderr);
fprintf(stderr, "Usage: %s [options]
“, prog_name); fprintf(stderr, “Options are: “); fprintf(stderr, “-s send message using msgsnd() “); fprintf(stderr, “-r read message using msgrcv() “); fprintf(stderr, “-t message type (default is 1) “); fprintf(stderr, “-k message queue key (default is 1234) “); exit(EXIT_FAILURE); } static void send_msg(int qid, int msgtype) { time_t t; struct msgbuf msg; msg.mtype = msgtype; time(&t); snprintf(msg.mtext, sizeof(msg.mtext), “a message at %s”, ctime(&t)); if (msgsnd(qid, &msg, sizeof(msg.mtext), IPC_NOWAIT) == -1) { perror(“msgsnd error”); exit(EXIT_FAILURE); } printf(“sent: %s “, msg.mtext); } static void get_msg(int qid, int msgtype) { struct msgbuf msg; if (msgrcv(qid, &msg, sizeof(msg.mtext), msgtype, MSG_NOERROR | IPC_NOWAIT) == -1) { if (errno != ENOMSG) { perror(“msgrcv”); exit(EXIT_FAILURE); } printf(“No message available for msgrcv() “); } else { printf(“message received: %s “, msg.mtext); } } int main(int argc, char argv[]) { int qid, opt; int mode = 0; / 1 = send, 2 = receive */ int msgtype = 1; int msgkey = 1234; while ((opt = getopt(argc, argv, “srt:k:”)) != -1) { switch (opt) { case ’s’: mode = 1; break; case ‘r’: mode = 2; break; case ’t’: msgtype = atoi(optarg); if (msgtype <= 0) usage(argv[0], “-t option must be greater than 0 “); break; case ‘k’: msgkey = atoi(optarg); break; default: usage(argv[0], “Unrecognized option “); } } if (mode == 0) usage(argv[0], “must use either -s or -r option “); qid = msgget(msgkey, IPC_CREAT | 0666); if (qid == -1) { perror(“msgget”); exit(EXIT_FAILURE); } if (mode == 2) get_msg(qid, msgtype); else send_msg(qid, msgtype); exit(EXIT_SUCCESS); }
SEE ALSO
msgctl(2), msgget(2), capabilities(7), mq_overview(7), sysvipc(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
256 - Linux cli command geteuid
NAME π₯οΈ geteuid π₯οΈ
get user identity
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
uid_t getuid(void);
uid_t geteuid(void);
DESCRIPTION
getuid() returns the real user ID of the calling process.
geteuid() returns the effective user ID of the calling process.
ERRORS
These functions are always successful and never modify errno.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.3BSD.
In UNIX V6 the getuid() call returned (euid << 8) + uid. UNIX V7 introduced separate calls getuid() and geteuid().
The original Linux getuid() and geteuid() system calls supported only 16-bit user IDs. Subsequently, Linux 2.4 added getuid32() and geteuid32(), supporting 32-bit IDs. The glibc getuid() and geteuid() wrapper functions transparently deal with the variations across kernel versions.
On Alpha, instead of a pair of getuid() and geteuid() system calls, a single getxuid() system call is provided, which returns a pair of real and effective UIDs. The glibc getuid() and geteuid() wrapper functions transparently deal with this. See syscall(2) for details regarding register mapping.
SEE ALSO
getresuid(2), setreuid(2), setuid(2), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
257 - Linux cli command mq_open
NAME π₯οΈ mq_open π₯οΈ
open a message queue
LIBRARY
Real-time library (librt, -lrt)
SYNOPSIS
#include <fcntl.h> /* For O_* constants */
#include <sys/stat.h> /* For mode constants */
#include <mqueue.h>
mqd_t mq_open(const char *name, int oflag);
mqd_t mq_open(const char *name, int oflag, mode_t mode,
struct mq_attr *attr);
DESCRIPTION
mq_open() creates a new POSIX message queue or opens an existing queue. The queue is identified by name. For details of the construction of name, see mq_overview(7).
The oflag argument specifies flags that control the operation of the call. (Definitions of the flags values can be obtained by including <fcntl.h>.) Exactly one of the following must be specified in oflag:
O_RDONLY
Open the queue to receive messages only.
O_WRONLY
Open the queue to send messages only.
O_RDWR
Open the queue to both send and receive messages.
Zero or more of the following flags can additionally be ORed in oflag:
O_CLOEXEC (since Linux 2.6.26)
Set the close-on-exec flag for the message queue descriptor. See open(2) for a discussion of why this flag is useful.
O_CREAT
Create the message queue if it does not exist. The owner (user ID) of the message queue is set to the effective user ID of the calling process. The group ownership (group ID) is set to the effective group ID of the calling process.
O_EXCL
If O_CREAT was specified in oflag, and a queue with the given name already exists, then fail with the error EEXIST.
O_NONBLOCK
Open the queue in nonblocking mode. In circumstances where mq_receive(3) and mq_send(3) would normally block, these functions instead fail with the error EAGAIN.
If O_CREAT is specified in oflag, then two additional arguments must be supplied. The mode argument specifies the permissions to be placed on the new queue, as for open(2). (Symbolic definitions for the permissions bits can be obtained by including <sys/stat.h>.) The permissions settings are masked against the process umask.
The fields of the struct mq_attr pointed to attr specify the maximum number of messages and the maximum size of messages that the queue will allow. This structure is defined as follows:
struct mq_attr {
long mq_flags; /* Flags (ignored for mq_open()) */
long mq_maxmsg; /* Max. # of messages on queue */
long mq_msgsize; /* Max. message size (bytes) */
long mq_curmsgs; /* # of messages currently in queue
(ignored for mq_open()) */
};
Only the mq_maxmsg and mq_msgsize fields are employed when calling mq_open(); the values in the remaining fields are ignored.
If attr is NULL, then the queue is created with implementation-defined default attributes. Since Linux 3.5, two /proc files can be used to control these defaults; see mq_overview(7) for details.
RETURN VALUE
On success, mq_open() returns a message queue descriptor for use by other message queue functions. On error, mq_open() returns (mqd_t) -1, with errno set to indicate the error.
ERRORS
EACCES
The queue exists, but the caller does not have permission to open it in the specified mode.
EACCES
name contained more than one slash.
EEXIST
Both O_CREAT and O_EXCL were specified in oflag, but a queue with this name already exists.
EINVAL
name doesn’t follow the format in mq_overview(7).
EINVAL
O_CREAT was specified in oflag, and attr was not NULL, but attr->mq_maxmsg or attr->mq_msqsize was invalid. Both of these fields must be greater than zero. In a process that is unprivileged (does not have the CAP_SYS_RESOURCE capability), attr->mq_maxmsg must be less than or equal to the msg_max limit, and attr->mq_msgsize must be less than or equal to the msgsize_max limit. In addition, even in a privileged process, attr->mq_maxmsg cannot exceed the HARD_MAX limit. (See mq_overview(7) for details of these limits.)
EMFILE
The per-process limit on the number of open file and message queue descriptors has been reached (see the description of RLIMIT_NOFILE in getrlimit(2)).
ENAMETOOLONG
name was too long.
ENFILE
The system-wide limit on the total number of open files and message queues has been reached.
ENOENT
The O_CREAT flag was not specified in oflag, and no queue with this name exists.
ENOENT
name was just “/” followed by no other characters.
ENOMEM
Insufficient memory.
ENOSPC
Insufficient space for the creation of a new message queue. This probably occurred because the queues_max limit was encountered; see mq_overview(7).
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
mq_open() | Thread safety | MT-Safe |
VERSIONS
C library/kernel differences
The mq_open() library function is implemented on top of a system call of the same name. The library function performs the check that the name starts with a slash (/), giving the EINVAL error if it does not. The kernel system call expects name to contain no preceding slash, so the C library function passes name without the preceding slash (i.e., name+1) to the system call.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
BUGS
Before Linux 2.6.14, the process umask was not applied to the permissions specified in mode.
SEE ALSO
mq_close(3), mq_getattr(3), mq_notify(3), mq_receive(3), mq_send(3), mq_unlink(3), mq_overview(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
258 - Linux cli command mkdirat
NAME π₯οΈ mkdirat π₯οΈ
create a directory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int mkdir(const char *pathname, mode_t mode);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int mkdirat(int dirfd, const char *pathname, mode_t mode);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
mkdirat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
mkdir() attempts to create a directory named pathname.
The argument mode specifies the mode for the new directory (see inode(7)). It is modified by the process’s umask in the usual way: in the absence of a default ACL, the mode of the created directory is (mode & ~umask & 0777). Whether other mode bits are honored for the created directory depends on the operating system. For Linux, see NOTES below.
The newly created directory will be owned by the effective user ID of the process. If the directory containing the file has the set-group-ID bit set, or if the filesystem is mounted with BSD group semantics (mount -o bsdgroups or, synonymously mount -o grpid), the new directory will inherit the group ownership from its parent; otherwise it will be owned by the effective group ID of the process.
If the parent directory has the set-group-ID bit set, then so will the newly created directory.
mkdirat()
The mkdirat() system call operates in exactly the same way as mkdir(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by mkdir() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like mkdir()).
If pathname is absolute, then dirfd is ignored.
See openat(2) for an explanation of the need for mkdirat().
RETURN VALUE
mkdir() and mkdirat() return zero on success. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EACCES
The parent directory does not allow write permission to the process, or one of the directories in pathname did not allow search permission. (See also path_resolution(7).)
EBADF
(mkdirat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EDQUOT
The user’s quota of disk blocks or inodes on the filesystem has been exhausted.
EEXIST
pathname already exists (not necessarily as a directory). This includes the case where pathname is a symbolic link, dangling or not.
EFAULT
pathname points outside your accessible address space.
EINVAL
The final component (“basename”) of the new directory’s pathname is invalid (e.g., it contains characters not permitted by the underlying filesystem).
ELOOP
Too many symbolic links were encountered in resolving pathname.
EMLINK
The number of links to the parent directory would exceed LINK_MAX.
ENAMETOOLONG
pathname was too long.
ENOENT
A directory component in pathname does not exist or is a dangling symbolic link.
ENOMEM
Insufficient kernel memory was available.
ENOSPC
The device containing pathname has no room for the new directory.
ENOSPC
The new directory cannot be created because the user’s disk quota is exhausted.
ENOTDIR
A component used as a directory in pathname is not, in fact, a directory.
ENOTDIR
(mkdirat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EPERM
The filesystem containing pathname does not support the creation of directories.
EROFS
pathname refers to a file on a read-only filesystem.
VERSIONS
Under Linux, apart from the permission bits, the S_ISVTX mode bit is also honored.
glibc notes
On older kernels where mkdirat() is unavailable, the glibc wrapper function falls back to the use of mkdir(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
STANDARDS
POSIX.1-2008.
HISTORY
mkdir()
SVr4, BSD, POSIX.1-2001.
mkdirat()
Linux 2.6.16, glibc 2.4.
NOTES
There are many infelicities in the protocol underlying NFS. Some of these affect mkdir().
SEE ALSO
mkdir(1), chmod(2), chown(2), mknod(2), mount(2), rmdir(2), stat(2), umask(2), unlink(2), acl(5), path_resolution(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
259 - Linux cli command pwritev2
NAME π₯οΈ pwritev2 π₯οΈ
read or write data into multiple buffers
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/uio.h>
ssize_t readv(int fd, const struct iovec *iov, int iovcnt);
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);
ssize_t preadv(int fd, const struct iovec *iov, int iovcnt,
off_t offset);
ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt,
off_t offset);
ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt,
off_t offset, int flags);
ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt,
off_t offset, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
preadv(), pwritev():
Since glibc 2.19:
_DEFAULT_SOURCE
glibc 2.19 and earlier:
_BSD_SOURCE
DESCRIPTION
The readv() system call reads iovcnt buffers from the file associated with the file descriptor fd into the buffers described by iov (“scatter input”).
The writev() system call writes iovcnt buffers of data described by iov to the file associated with the file descriptor fd (“gather output”).
The pointer iov points to an array of iovec structures, described in iovec(3type).
The readv() system call works just like read(2) except that multiple buffers are filled.
The writev() system call works just like write(2) except that multiple buffers are written out.
Buffers are processed in array order. This means that readv() completely fills iov[0] before proceeding to iov[1], and so on. (If there is insufficient data, then not all buffers pointed to by iov may be filled.) Similarly, writev() writes out the entire contents of iov[0] before proceeding to iov[1], and so on.
The data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not intermingled with output from writes in other processes; analogously, readv() is guaranteed to read a contiguous block of data from the file, regardless of read operations performed in other threads or processes that have file descriptors referring to the same open file description (see open(2)).
preadv() and pwritev()
The preadv() system call combines the functionality of readv() and pread(2). It performs the same task as readv(), but adds a fourth argument, offset, which specifies the file offset at which the input operation is to be performed.
The pwritev() system call combines the functionality of writev() and pwrite(2). It performs the same task as writev(), but adds a fourth argument, offset, which specifies the file offset at which the output operation is to be performed.
The file offset is not changed by these system calls. The file referred to by fd must be capable of seeking.
preadv2() and pwritev2()
These system calls are similar to preadv() and pwritev() calls, but add a fifth argument, flags, which modifies the behavior on a per-call basis.
Unlike preadv() and pwritev(), if the offset argument is -1, then the current file offset is used and updated.
The flags argument contains a bitwise OR of zero or more of the following flags:
RWF_DSYNC (since Linux 4.7)
Provide a per-write equivalent of the O_DSYNC open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call.
RWF_HIPRI (since Linux 4.6)
High priority read/write. Allows block-based filesystems to use polling of the device, which provides lower latency, but may use additional resources. (Currently, this feature is usable only on a file descriptor opened using the O_DIRECT flag.)
RWF_SYNC (since Linux 4.7)
Provide a per-write equivalent of the O_SYNC open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call.
RWF_NOWAIT (since Linux 4.14)
Do not wait for data which is not immediately available. If this flag is specified, the preadv2() system call will return instantly if it would have to read data from the backing storage or wait for a lock. If some data was successfully read, it will return the number of bytes read. If no bytes were read, it will return -1 and set errno to EAGAIN (but see BUGS). Currently, this flag is meaningful only for preadv2().
RWF_APPEND (since Linux 4.16)
Provide a per-write equivalent of the O_APPEND open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call. The offset argument does not affect the write operation; the data is always appended to the end of the file. However, if the offset argument is -1, the current file offset is updated.
RETURN VALUE
On success, readv(), preadv(), and preadv2() return the number of bytes read; writev(), pwritev(), and pwritev2() return the number of bytes written.
Note that it is not an error for a successful call to transfer fewer bytes than requested (see read(2) and write(2)).
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
The errors are as given for read(2) and write(2). Furthermore, preadv(), preadv2(), pwritev(), and pwritev2() can also fail for the same reasons as lseek(2). Additionally, the following errors are defined:
EINVAL
The sum of the iov_len values overflows an ssize_t value.
EINVAL
The vector count, iovcnt, is less than zero or greater than the permitted maximum.
EOPNOTSUPP
An unknown flag is specified in flags.
VERSIONS
C library/kernel differences
The raw preadv() and pwritev() system calls have call signatures that differ slightly from that of the corresponding GNU C library wrapper functions shown in the SYNOPSIS. The final argument, offset, is unpacked by the wrapper functions into two arguments in the system calls:
** unsigned long pos_l, unsigned long **pos
These arguments contain, respectively, the low order and high order 32 bits of offset.
STANDARDS
readv()
writev()
POSIX.1-2008.
preadv()
pwritev()
BSD.
preadv2()
pwritev2()
Linux.
HISTORY
readv()
writev()
POSIX.1-2001, 4.4BSD (first appeared in 4.2BSD).
preadv(), pwritev(): Linux 2.6.30, glibc 2.10.
preadv2(), pwritev2(): Linux 4.6, glibc 2.26.
Historical C library/kernel differences
To deal with the fact that IOV_MAX was so low on early versions of Linux, the glibc wrapper functions for readv() and writev() did some extra work if they detected that the underlying kernel system call failed because this limit was exceeded. In the case of readv(), the wrapper function allocated a temporary buffer large enough for all of the items specified by iov, passed that buffer in a call to read(2), copied data from the buffer to the locations specified by the iov_base fields of the elements of iov, and then freed the buffer. The wrapper function for writev() performed the analogous task using a temporary buffer and a call to write(2).
The need for this extra effort in the glibc wrapper functions went away with Linux 2.2 and later. However, glibc continued to provide this behavior until glibc 2.10. Starting with glibc 2.9, the wrapper functions provide this behavior only if the library detects that the system is running a Linux kernel older than Linux 2.6.18 (an arbitrarily selected kernel version). And since glibc 2.20 (which requires a minimum of Linux 2.6.32), the glibc wrapper functions always just directly invoke the system calls.
NOTES
POSIX.1 allows an implementation to place a limit on the number of items that can be passed in iov. An implementation can advertise its limit by defining IOV_MAX in <limits.h> or at run time via the return value from sysconf(_SC_IOV_MAX). On modern Linux systems, the limit is 1024. Back in Linux 2.0 days, this limit was 16.
BUGS
Linux 5.9 and Linux 5.10 have a bug where preadv2() with the RWF_NOWAIT flag may return 0 even when not at end of file.
EXAMPLES
The following code sample demonstrates the use of writev():
char *str0 = "hello ";
char *str1 = "world
“; ssize_t nwritten; struct iovec iov[2]; iov[0].iov_base = str0; iov[0].iov_len = strlen(str0); iov[1].iov_base = str1; iov[1].iov_len = strlen(str1); nwritten = writev(STDOUT_FILENO, iov, 2);
SEE ALSO
pread(2), read(2), write(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
260 - Linux cli command mlockall
NAME π₯οΈ mlockall π₯οΈ
lock and unlock memory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/mman.h>
int mlock(const void addr[.len], size_t len);
int mlock2(const void addr[.len], size_t len, unsigned int flags);
int munlock(const void addr[.len], size_t len);
int mlockall(int flags);
int munlockall(void);
DESCRIPTION
mlock(), mlock2(), and mlockall() lock part or all of the calling process’s virtual address space into RAM, preventing that memory from being paged to the swap area.
munlock() and munlockall() perform the converse operation, unlocking part or all of the calling process’s virtual address space, so that pages in the specified virtual address range can be swapped out again if required by the kernel memory manager.
Memory locking and unlocking are performed in units of whole pages.
mlock(), mlock2(), and munlock()
mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked.
mlock2() also locks pages in the specified range starting at addr and continuing for len bytes. However, the state of the pages contained in that range after the call returns successfully will depend on the value in the flags argument.
The flags argument can be either 0 or the following constant:
MLOCK_ONFAULT
Lock pages that are currently resident and mark the entire range so that the remaining nonresident pages are locked when they are populated by a page fault.
If flags is 0, mlock2() behaves exactly the same as mlock().
munlock() unlocks pages in the address range starting at addr and continuing for len bytes. After this call, all pages that contain a part of the specified memory range can be moved to external swap space again by the kernel.
mlockall() and munlockall()
mlockall() locks all pages mapped into the address space of the calling process. This includes the pages of the code, data, and stack segment, as well as shared libraries, user space kernel data, shared memory, and memory-mapped files. All mapped pages are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked.
The flags argument is constructed as the bitwise OR of one or more of the following constants:
MCL_CURRENT
Lock all pages which are currently mapped into the address space of the process.
MCL_FUTURE
Lock all pages which will become mapped into the address space of the process in the future. These could be, for instance, new pages required by a growing heap and stack as well as new memory-mapped files or shared memory regions.
MCL_ONFAULT (since Linux 4.4)
Used together with MCL_CURRENT, MCL_FUTURE, or both. Mark all current (with MCL_CURRENT) or future (with MCL_FUTURE) mappings to lock pages when they are faulted in. When used with MCL_CURRENT, all present pages are locked, but mlockall() will not fault in non-present pages. When used with MCL_FUTURE, all future mappings will be marked to lock pages when they are faulted in, but they will not be populated by the lock when the mapping is created. MCL_ONFAULT must be used with either MCL_CURRENT or MCL_FUTURE or both.
If MCL_FUTURE has been specified, then a later system call (e.g., mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number of locked bytes to exceed the permitted maximum (see below). In the same circumstances, stack growth may likewise fail: the kernel will deny stack expansion and deliver a SIGSEGV signal to the process.
munlockall() unlocks all pages mapped into the address space of the calling process.
RETURN VALUE
On success, these system calls return 0. On error, -1 is returned, errno is set to indicate the error, and no changes are made to any locks in the address space of the process.
ERRORS
EAGAIN
(mlock(), mlock2(), and munlock()) Some or all of the specified address range could not be locked.
EINVAL
(mlock(), mlock2(), and munlock()) The result of the addition addr+len was less than addr (e.g., the addition may have resulted in an overflow).
EINVAL
(mlock2()) Unknown flags were specified.
EINVAL
(mlockall()) Unknown flags were specified or MCL_ONFAULT was specified without either MCL_FUTURE or MCL_CURRENT.
EINVAL
(Not on Linux) addr was not a multiple of the page size.
ENOMEM
(mlock(), mlock2(), and munlock()) Some of the specified address range does not correspond to mapped pages in the address space of the process.
ENOMEM
(mlock(), mlock2(), and munlock()) Locking or unlocking a region would result in the total number of mappings with distinct attributes (e.g., locked versus unlocked) exceeding the allowed maximum. (For example, unlocking a range in the middle of a currently locked mapping would result in three mappings: two locked mappings at each end and an unlocked mapping in the middle.)
ENOMEM
(Linux 2.6.9 and later) the caller had a nonzero RLIMIT_MEMLOCK soft resource limit, but tried to lock more memory than the limit permitted. This limit is not enforced if the process is privileged (CAP_IPC_LOCK).
ENOMEM
(Linux 2.4 and earlier) the calling process tried to lock more than half of RAM.
EPERM
The caller is not privileged, but needs privilege (CAP_IPC_LOCK) to perform the requested operation.
EPERM
(munlockall()) (Linux 2.6.8 and earlier) The caller was not privileged (CAP_IPC_LOCK).
VERSIONS
Linux
Under Linux, mlock(), mlock2(), and munlock() automatically round addr down to the nearest page boundary. However, the POSIX.1 specification of mlock() and munlock() allows an implementation to require that addr is page aligned, so portable applications should ensure this.
The VmLck field of the Linux-specific /proc/pid/status file shows how many kilobytes of memory the process with ID PID has locked using mlock(), mlock2(), mlockall(), and mmap(2) MAP_LOCKED.
STANDARDS
mlock()
munlock()
mlockall()
munlockall()
POSIX.1-2008.
mlock2()
Linux.
On POSIX systems on which mlock() and munlock() are available, _POSIX_MEMLOCK_RANGE is defined in <unistd.h> and the number of bytes in a page can be determined from the constant PAGESIZE (if defined) in <limits.h> or by calling sysconf(_SC_PAGESIZE).
On POSIX systems on which mlockall() and munlockall() are available, _POSIX_MEMLOCK is defined in <unistd.h> to a value greater than 0. (See also sysconf(3).)
HISTORY
mlock()
munlock()
mlockall()
munlockall()
POSIX.1-2001, POSIX.1-2008, SVr4.
mlock2()
Linux 4.4, glibc 2.27.
NOTES
Memory locking has two main applications: real-time algorithms and high-security data processing. Real-time applications require deterministic timing, and, like scheduling, paging is one major cause of unexpected program execution delays. Real-time applications will usually also switch to a real-time scheduler with sched_setscheduler(2). Cryptographic security software often handles critical bytes like passwords or secret keys as data structures. As a result of paging, these secrets could be transferred onto a persistent swap store medium, where they might be accessible to the enemy long after the security software has erased the secrets in RAM and terminated. (But be aware that the suspend mode on laptops and some desktop computers will save a copy of the system’s RAM to disk, regardless of memory locks.)
Real-time processes that are using mlockall() to prevent delays on page faults should reserve enough locked stack pages before entering the time-critical section, so that no page fault can be caused by function calls. This can be achieved by calling a function that allocates a sufficiently large automatic variable (an array) and writes to the memory occupied by this array in order to touch these stack pages. This way, enough pages will be mapped for the stack and can be locked into RAM. The dummy writes ensure that not even copy-on-write page faults can occur in the critical section.
Memory locks are not inherited by a child created via fork(2) and are automatically removed (unlocked) during an execve(2) or when the process terminates. The mlockall() MCL_FUTURE and MCL_FUTURE | MCL_ONFAULT settings are not inherited by a child created via fork(2) and are cleared during an execve(2).
Note that fork(2) will prepare the address space for a copy-on-write operation. The consequence is that any write access that follows will cause a page fault that in turn may cause high latencies for a real-time process. Therefore, it is crucial not to invoke fork(2) after an mlockall() or mlock() operationβnot even from a thread which runs at a low priority within a process which also has a thread running at elevated priority.
The memory lock on an address range is automatically removed if the address range is unmapped via munmap(2).
Memory locks do not stack, that is, pages which have been locked several times by calls to mlock(), mlock2(), or mlockall() will be unlocked by a single call to munlock() for the corresponding range or by munlockall(). Pages which are mapped to several locations or by several processes stay locked into RAM as long as they are locked at least at one location or by at least one process.
If a call to mlockall() which uses the MCL_FUTURE flag is followed by another call that does not specify this flag, the changes made by the MCL_FUTURE call will be lost.
The mlock2() MLOCK_ONFAULT flag and the mlockall() MCL_ONFAULT flag allow efficient memory locking for applications that deal with large mappings where only a (small) portion of pages in the mapping are touched. In such cases, locking all of the pages in a mapping would incur a significant penalty for memory locking.
Limits and permissions
In Linux 2.6.8 and earlier, a process must be privileged (CAP_IPC_LOCK) in order to lock memory and the RLIMIT_MEMLOCK soft resource limit defines a limit on how much memory the process may lock.
Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process can lock and the RLIMIT_MEMLOCK soft resource limit instead defines a limit on how much memory an unprivileged process may lock.
BUGS
In Linux 4.8 and earlier, a bug in the kernel’s accounting of locked memory for unprivileged processes (i.e., without CAP_IPC_LOCK) meant that if the region specified by addr and len overlapped an existing lock, then the already locked bytes in the overlapping region were counted twice when checking against the limit. Such double accounting could incorrectly calculate a “total locked memory” value for the process that exceeded the RLIMIT_MEMLOCK limit, with the result that mlock() and mlock2() would fail on requests that should have succeeded. This bug was fixed in Linux 4.9.
In Linux 2.4 series of kernels up to and including Linux 2.4.17, a bug caused the mlockall() MCL_FUTURE flag to be inherited across a fork(2). This was rectified in Linux 2.4.18.
Since Linux 2.6.9, if a privileged process calls mlockall(MCL_FUTURE) and later drops privileges (loses the CAP_IPC_LOCK capability by, for example, setting its effective UID to a nonzero value), then subsequent memory allocations (e.g., mmap(2), brk(2)) will fail if the RLIMIT_MEMLOCK resource limit is encountered.
SEE ALSO
mincore(2), mmap(2), setrlimit(2), shmctl(2), sysconf(3), proc(5), capabilities(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
261 - Linux cli command setgid
NAME π₯οΈ setgid π₯οΈ
set group identity
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int setgid(gid_t gid);
DESCRIPTION
setgid() sets the effective group ID of the calling process. If the calling process is privileged (more precisely: has the CAP_SETGID capability in its user namespace), the real GID and saved set-group-ID are also set.
Under Linux, setgid() is implemented like the POSIX version with the _POSIX_SAVED_IDS feature. This allows a set-group-ID program that is not set-user-ID-root to drop all of its group privileges, do some un-privileged work, and then reengage the original effective group ID in a secure manner.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
The group ID specified in gid is not valid in this user namespace.
EPERM
The calling process is not privileged (does not have the CAP_SETGID capability in its user namespace), and gid does not match the real group ID or saved set-group-ID of the calling process.
VERSIONS
C library/kernel differences
At the kernel level, user IDs and group IDs are a per-thread attribute. However, POSIX requires that all threads in a process share the same credentials. The NPTL threading implementation handles the POSIX requirements by providing wrapper functions for the various system calls that change process UIDs and GIDs. These wrapper functions (including the one for setgid()) employ a signal-based technique to ensure that when one thread changes credentials, all of the other threads in the process also change their credentials. For details, see nptl(7).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4.
The original Linux setgid() system call supported only 16-bit group IDs. Subsequently, Linux 2.4 added setgid32() supporting 32-bit IDs. The glibc setgid() wrapper function transparently deals with the variation across kernel versions.
SEE ALSO
getgid(2), setegid(2), setregid(2), capabilities(7), credentials(7), user_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
262 - Linux cli command execve
NAME π₯οΈ execve π₯οΈ
execute program
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int execve(const char *pathname, char *const _Nullable argv[],
char *const _Nullable envp[]);
DESCRIPTION
execve() executes the program referred to by pathname. This causes the program that is currently being run by the calling process to be replaced with a new program, with newly initialized stack, heap, and (initialized and uninitialized) data segments.
pathname must be either a binary executable, or a script starting with a line of the form:
#!interpreter [optional-arg]
For details of the latter case, see “Interpreter scripts” below.
argv is an array of pointers to strings passed to the new program as its command-line arguments. By convention, the first of these strings (i.e., argv[0]) should contain the filename associated with the file being executed. The argv array must be terminated by a null pointer. (Thus, in the new program, argv[argc] will be a null pointer.)
envp is an array of pointers to strings, conventionally of the form key=value, which are passed as the environment of the new program. The envp array must be terminated by a null pointer.
This manual page describes the Linux system call in detail; for an overview of the nomenclature and the many, often preferable, standardised variants of this function provided by libc, including ones that search the PATH environment variable, see exec(3).
The argument vector and environment can be accessed by the new program’s main function, when it is defined as:
int main(int argc, char *argv[], char *envp[])
Note, however, that the use of a third argument to the main function is not specified in POSIX.1; according to POSIX.1, the environment should be accessed via the external variable environ(7).
execve() does not return on success, and the text, initialized data, uninitialized data (bss), and stack of the calling process are overwritten according to the contents of the newly loaded program.
If the current program is being ptraced, a SIGTRAP signal is sent to it after a successful execve().
If the set-user-ID bit is set on the program file referred to by pathname, then the effective user ID of the calling process is changed to that of the owner of the program file. Similarly, if the set-group-ID bit is set on the program file, then the effective group ID of the calling process is set to the group of the program file.
The aforementioned transformations of the effective IDs are not performed (i.e., the set-user-ID and set-group-ID bits are ignored) if any of the following is true:
the no_new_privs attribute is set for the calling thread (see prctl(2));
the underlying filesystem is mounted nosuid (the MS_NOSUID flag for mount(2)); or
the calling process is being ptraced.
The capabilities of the program file (see capabilities(7)) are also ignored if any of the above are true.
The effective user ID of the process is copied to the saved set-user-ID; similarly, the effective group ID is copied to the saved set-group-ID. This copying takes place after any effective ID changes that occur because of the set-user-ID and set-group-ID mode bits.
The process’s real UID and real GID, as well as its supplementary group IDs, are unchanged by a call to execve().
If the executable is an a.out dynamically linked binary executable containing shared-library stubs, the Linux dynamic linker ld.so(8) is called at the start of execution to bring needed shared objects into memory and link the executable with them.
If the executable is a dynamically linked ELF executable, the interpreter named in the PT_INTERP segment is used to load the needed shared objects. This interpreter is typically /lib/ld-linux.so.2 for binaries linked with glibc (see ld-linux.so(8)).
Effect on process attributes
All process attributes are preserved during an execve(), except the following:
The dispositions of any signals that are being caught are reset to the default (signal(7)).
Any alternate signal stack is not preserved (sigaltstack(2)).
Memory mappings are not preserved (mmap(2)).
Attached System V shared memory segments are detached (shmat(2)).
POSIX shared memory regions are unmapped (shm_open(3)).
Open POSIX message queue descriptors are closed (mq_overview(7)).
Any open POSIX named semaphores are closed (sem_overview(7)).
POSIX timers are not preserved (timer_create(2)).
Any open directory streams are closed (opendir(3)).
Memory locks are not preserved (mlock(2), mlockall(2)).
Exit handlers are not preserved (atexit(3), on_exit(3)).
The floating-point environment is reset to the default (see fenv(3)).
The process attributes in the preceding list are all specified in POSIX.1. The following Linux-specific process attributes are also not preserved during an execve():
The process’s “dumpable” attribute is set to the value 1, unless a set-user-ID program, a set-group-ID program, or a program with capabilities is being executed, in which case the dumpable flag may instead be reset to the value in /proc/sys/fs/suid_dumpable, in the circumstances described under PR_SET_DUMPABLE in prctl(2). Note that changes to the “dumpable” attribute may cause ownership of files in the process’s */proc/*pid directory to change to root:root, as described in proc(5).
The prctl(2) PR_SET_KEEPCAPS flag is cleared.
(Since Linux 2.4.36 / 2.6.23) If a set-user-ID or set-group-ID program is being executed, then the parent death signal set by prctl(2) PR_SET_PDEATHSIG flag is cleared.
The process name, as set by prctl(2) PR_SET_NAME (and displayed by ps -o comm), is reset to the name of the new executable file.
The SECBIT_KEEP_CAPS securebits flag is cleared. See capabilities(7).
The termination signal is reset to SIGCHLD (see clone(2)).
The file descriptor table is unshared, undoing the effect of the CLONE_FILES flag of clone(2).
Note the following further points:
All threads other than the calling thread are destroyed during an execve(). Mutexes, condition variables, and other pthreads objects are not preserved.
The equivalent of setlocale(LC_ALL, “C”) is executed at program start-up.
POSIX.1 specifies that the dispositions of any signals that are ignored or set to the default are left unchanged. POSIX.1 specifies one exception: if SIGCHLD is being ignored, then an implementation may leave the disposition unchanged or reset it to the default; Linux does the former.
Any outstanding asynchronous I/O operations are canceled (aio_read(3), aio_write(3)).
For the handling of capabilities during execve(), see capabilities(7).
By default, file descriptors remain open across an execve(). File descriptors that are marked close-on-exec are closed; see the description of FD_CLOEXEC in fcntl(2). (If a file descriptor is closed, this will cause the release of all record locks obtained on the underlying file by this process. See fcntl(2) for details.) POSIX.1 says that if file descriptors 0, 1, and 2 would otherwise be closed after a successful execve(), and the process would gain privilege because the set-user-ID or set-group-ID mode bit was set on the executed file, then the system may open an unspecified file for each of these file descriptors. As a general principle, no portable program, whether privileged or not, can assume that these three file descriptors will remain closed across an execve().
Interpreter scripts
An interpreter script is a text file that has execute permission enabled and whose first line is of the form:
#!interpreter [optional-arg]
The interpreter must be a valid pathname for an executable file.
If the pathname argument of execve() specifies an interpreter script, then interpreter will be invoked with the following arguments:
interpreter [optional-arg] pathname arg...
where pathname is the pathname of the file specified as the first argument of execve(), and arg… is the series of words pointed to by the argv argument of execve(), starting at argv[1]. Note that there is no way to get the argv[0] that was passed to the execve() call.
For portable use, optional-arg should either be absent, or be specified as a single word (i.e., it should not contain white space); see NOTES below.
Since Linux 2.6.28, the kernel permits the interpreter of a script to itself be a script. This permission is recursive, up to a limit of four recursions, so that the interpreter may be a script which is interpreted by a script, and so on.
Limits on size of arguments and environment
Most UNIX implementations impose some limit on the total size of the command-line argument (argv) and environment (envp) strings that may be passed to a new program. POSIX.1 allows an implementation to advertise this limit using the ARG_MAX constant (either defined in <limits.h> or available at run time using the call sysconf(_SC_ARG_MAX)).
Before Linux 2.6.23, the memory used to store the environment and argument strings was limited to 32 pages (defined by the kernel constant MAX_ARG_PAGES). On architectures with a 4-kB page size, this yields a maximum size of 128 kB.
On Linux 2.6.23 and later, most architectures support a size limit derived from the soft RLIMIT_STACK resource limit (see getrlimit(2)) that is in force at the time of the execve() call. (Architectures with no memory management unit are excepted: they maintain the limit that was in effect before Linux 2.6.23.) This change allows programs to have a much larger argument and/or environment list. For these architectures, the total size is limited to 1/4 of the allowed stack size. (Imposing the 1/4-limit ensures that the new program always has some stack space.) Additionally, the total size is limited to 3/4 of the value of the kernel constant _STK_LIM (8 MiB). Since Linux 2.6.25, the kernel also places a floor of 32 pages on this size limit, so that, even when RLIMIT_STACK is set very low, applications are guaranteed to have at least as much argument and environment space as was provided by Linux 2.6.22 and earlier. (This guarantee was not provided in Linux 2.6.23 and 2.6.24.) Additionally, the limit per string is 32 pages (the kernel constant MAX_ARG_STRLEN), and the maximum number of strings is 0x7FFFFFFF.
RETURN VALUE
On success, execve() does not return, on error -1 is returned, and errno is set to indicate the error.
ERRORS
E2BIG
The total number of bytes in the environment (envp) and argument list (argv) is too large, an argument or environment string is too long, or the full pathname of the executable is too long. The terminating null byte is counted as part of the string length.
EACCES
Search permission is denied on a component of the path prefix of pathname or the name of a script interpreter. (See also path_resolution(7).)
EACCES
The file or a script interpreter is not a regular file.
EACCES
Execute permission is denied for the file or a script or ELF interpreter.
EACCES
The filesystem is mounted noexec.
EAGAIN (since Linux 3.1)
Having changed its real UID using one of the set*uid() calls, the caller wasβand is now stillβabove its RLIMIT_NPROC resource limit (see setrlimit(2)). For a more detailed explanation of this error, see NOTES.
EFAULT
pathname or one of the pointers in the vectors argv or envp points outside your accessible address space.
EINVAL
An ELF executable had more than one PT_INTERP segment (i.e., tried to name more than one interpreter).
EIO
An I/O error occurred.
EISDIR
An ELF interpreter was a directory.
ELIBBAD
An ELF interpreter was not in a recognized format.
ELOOP
Too many symbolic links were encountered in resolving pathname or the name of a script or ELF interpreter.
ELOOP
The maximum recursion limit was reached during recursive script interpretation (see “Interpreter scripts”, above). Before Linux 3.8, the error produced for this case was ENOEXEC.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENAMETOOLONG
pathname is too long.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENOENT
The file pathname or a script or ELF interpreter does not exist.
ENOEXEC
An executable is not in a recognized format, is for the wrong architecture, or has some other format error that means it cannot be executed.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component of the path prefix of pathname or a script or ELF interpreter is not a directory.
EPERM
The filesystem is mounted nosuid, the user is not the superuser, and the file has the set-user-ID or set-group-ID bit set.
EPERM
The process is being traced, the user is not the superuser and the file has the set-user-ID or set-group-ID bit set.
EPERM
A “capability-dumb” applications would not obtain the full set of permitted capabilities granted by the executable file. See capabilities(7).
ETXTBSY
The specified executable was open for writing by one or more processes.
VERSIONS
POSIX does not document the #! behavior, but it exists (with some variations) on other UNIX systems.
On Linux, argv and envp can be specified as NULL. In both cases, this has the same effect as specifying the argument as a pointer to a list containing a single null pointer. Do not take advantage of this nonstandard and nonportable misfeature! On many other UNIX systems, specifying argv as NULL will result in an error (EFAULT). Some other UNIX systems treat the envp==NULL case the same as Linux.
POSIX.1 says that values returned by sysconf(3) should be invariant over the lifetime of a process. However, since Linux 2.6.23, if the RLIMIT_STACK resource limit changes, then the value reported by _SC_ARG_MAX will also change, to reflect the fact that the limit on space for holding command-line arguments and environment variables has changed.
Interpreter scripts
The kernel imposes a maximum length on the text that follows the “#!” characters at the start of a script; characters beyond the limit are ignored. Before Linux 5.1, the limit is 127 characters. Since Linux 5.1, the limit is 255 characters.
The semantics of the optional-arg argument of an interpreter script vary across implementations. On Linux, the entire string following the interpreter name is passed as a single argument to the interpreter, and this string can include white space. However, behavior differs on some other systems. Some systems use the first white space to terminate optional-arg. On some systems, an interpreter script can have multiple arguments, and white spaces in optional-arg are used to delimit the arguments.
Linux (like most other modern UNIX systems) ignores the set-user-ID and set-group-ID bits on scripts.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.3BSD.
With UNIX V6, the argument list of an exec() call was ended by 0, while the argument list of main was ended by -1. Thus, this argument list was not directly usable in a further exec() call. Since UNIX V7, both are NULL.
NOTES
One sometimes sees execve() (and the related functions described in exec(3)) described as “executing a new process” (or similar). This is a highly misleading description: there is no new process; many attributes of the calling process remain unchanged (in particular, its PID). All that execve() does is arrange for an existing process (the calling process) to execute a new program.
Set-user-ID and set-group-ID processes can not be ptrace(2)d.
The result of mounting a filesystem nosuid varies across Linux kernel versions: some will refuse execution of set-user-ID and set-group-ID executables when this would give the user powers they did not have already (and return EPERM), some will just ignore the set-user-ID and set-group-ID bits and exec() successfully.
In most cases where execve() fails, control returns to the original executable image, and the caller of execve() can then handle the error. However, in (rare) cases (typically caused by resource exhaustion), failure may occur past the point of no return: the original executable image has been torn down, but the new image could not be completely built. In such cases, the kernel kills the process with a SIGSEGV (SIGKILL until Linux 3.17) signal.
execve() and EAGAIN
A more detailed explanation of the EAGAIN error that can occur (since Linux 3.1) when calling execve() is as follows.
The EAGAIN error can occur when a preceding call to setuid(2), setreuid(2), or setresuid(2) caused the real user ID of the process to change, and that change caused the process to exceed its RLIMIT_NPROC resource limit (i.e., the number of processes belonging to the new real UID exceeds the resource limit). From Linux 2.6.0 to Linux 3.0, this caused the set*uid() call to fail. (Before Linux 2.6, the resource limit was not imposed on processes that changed their user IDs.)
Since Linux 3.1, the scenario just described no longer causes the set*uid() call to fail, because it too often led to security holes where buggy applications didn’t check the return status and assumed thatβif the caller had root privilegesβthe call would always succeed. Instead, the set*uid() calls now successfully change the real UID, but the kernel sets an internal flag, named PF_NPROC_EXCEEDED, to note that the RLIMIT_NPROC resource limit has been exceeded. If the PF_NPROC_EXCEEDED flag is set and the resource limit is still exceeded at the time of a subsequent execve() call, that call fails with the error EAGAIN. This kernel logic ensures that the RLIMIT_NPROC resource limit is still enforced for the common privileged daemon workflowβnamely, fork(2) + set*uid() + execve().
If the resource limit was not still exceeded at the time of the execve() call (because other processes belonging to this real UID terminated between the set*uid() call and the execve() call), then the execve() call succeeds and the kernel clears the PF_NPROC_EXCEEDED process flag. The flag is also cleared if a subsequent call to fork(2) by this process succeeds.
EXAMPLES
The following program is designed to be execed by the second program below. It just echoes its command-line arguments, one per line.
/* myecho.c */
#include <stdio.h>
#include <stdlib.h>
int
main(int argc, char *argv[])
{
for (size_t j = 0; j < argc; j++)
printf("argv[%zu]: %s
“, j, argv[j]); exit(EXIT_SUCCESS); }
This program can be used to exec the program named in its command-line argument:
/* execve.c */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
static char *newargv[] = { NULL, "hello", "world", NULL };
static char *newenviron[] = { NULL };
if (argc != 2) {
fprintf(stderr, "Usage: %s <file-to-exec>
“, argv[0]); exit(EXIT_FAILURE); } newargv[0] = argv[1]; execve(argv[1], newargv, newenviron); perror(“execve”); /* execve() returns only on error */ exit(EXIT_FAILURE); }
We can use the second program to exec the first as follows:
$ cc myecho.c -o myecho
$ cc execve.c -o execve
$ ./execve ./myecho
argv[0]: ./myecho
argv[1]: hello
argv[2]: world
We can also use these programs to demonstrate the use of a script interpreter. To do this we create a script whose “interpreter” is our myecho program:
$ cat > script
#!./myecho script-arg
^D
$ chmod +x script
We can then use our program to exec the script:
$ ./execve ./script
argv[0]: ./myecho
argv[1]: script-arg
argv[2]: ./script
argv[3]: hello
argv[4]: world
SEE ALSO
chmod(2), execveat(2), fork(2), get_robust_list(2), ptrace(2), exec(3), fexecve(3), getauxval(3), getopt(3), system(3), capabilities(7), credentials(7), environ(7), path_resolution(7), ld.so(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
263 - Linux cli command fstat
NAME π₯οΈ fstat π₯οΈ
get file status
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int stat(const char *restrict pathname,
struct stat *restrict statbuf);
int fstat(int fd, struct stat *statbuf);
int lstat(const char *restrict pathname,
struct stat *restrict statbuf);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int fstatat(int dirfd, const char *restrict pathname,
struct stat *restrict statbuf, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
lstat():
/* Since glibc 2.20 */ _DEFAULT_SOURCE
|| _XOPEN_SOURCE >= 500
|| /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L
|| /* glibc 2.19 and earlier */ _BSD_SOURCE
fstatat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These functions return information about a file, in the buffer pointed to by statbuf. No permissions are required on the file itself, butβin the case of stat(), fstatat(), and lstat()βexecute (search) permission is required on all of the directories in pathname that lead to the file.
stat() and fstatat() retrieve information about the file pointed to by pathname; the differences for fstatat() are described below.
lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information about the link itself, not the file that the link refers to.
fstat() is identical to stat(), except that the file about which information is to be retrieved is specified by the file descriptor fd.
The stat structure
All of these system calls return a stat structure (see stat(3type)).
Note: for performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call. For example, if st_mode or st_uid is changed by another process by calling chmod(2) or chown(2), stat() might return the old st_mode together with the new st_uid, or the old st_uid together with the new st_mode.
fstatat()
The fstatat() system call is a more general interface for accessing file information which can still provide exactly the behavior of each of stat(), lstat(), and fstat().
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by stat() and lstat() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like stat() and lstat()).
If pathname is absolute, then dirfd is ignored.
flags can either be 0, or include one or more of the following flags ORed:
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory, and the behavior of fstatat() is similar to that of fstat(). If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_NO_AUTOMOUNT (since Linux 2.6.38)
Don’t automount the terminal (“basename”) component of pathname. Since Linux 3.1 this flag is ignored. Since Linux 4.11 this flag is implied.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead return information about the link itself, like lstat(). (By default, fstatat() dereferences symbolic links, like stat().)
See openat(2) for an explanation of the need for fstatat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Search permission is denied for one of the directories in the path prefix of pathname. (See also path_resolution(7).)
EBADF
fd is not a valid open file descriptor.
EBADF
(fstatat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
Bad address.
EINVAL
(fstatat()) Invalid flag specified in flags.
ELOOP
Too many symbolic links encountered while traversing the path.
ENAMETOOLONG
pathname is too long.
ENOENT
A component of pathname does not exist or is a dangling symbolic link.
ENOENT
pathname is an empty string and AT_EMPTY_PATH was not specified in flags.
ENOMEM
Out of memory (i.e., kernel memory).
ENOTDIR
A component of the path prefix of pathname is not a directory.
ENOTDIR
(fstatat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EOVERFLOW
pathname or fd refers to a file whose size, inode number, or number of blocks cannot be represented in, respectively, the types off_t, ino_t, or blkcnt_t. This error can occur when, for example, an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 calls stat() on a file whose size exceeds (1<<31)-1 bytes.
STANDARDS
POSIX.1-2008.
HISTORY
stat()
fstat()
lstat()
SVr4, 4.3BSD, POSIX.1-2001.
fstatat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
According to POSIX.1-2001, lstat() on a symbolic link need return valid information only in the st_size field and the file type of the st_mode field of the stat structure. POSIX.1-2008 tightens the specification, requiring lstat() to return valid information in all fields except the mode bits in st_mode.
Use of the st_blocks and st_blksize fields may be less portable. (They were introduced in BSD. The interpretation differs between systems, and possibly on a single system when NFS mounts are involved.)
C library/kernel differences
Over time, increases in the size of the stat structure have led to three successive versions of stat(): sys_stat() (slot __NR_oldstat), sys_newstat() (slot __NR_stat), and sys_stat64() (slot __NR_stat64) on 32-bit platforms such as i386. The first two versions were already present in Linux 1.0 (albeit with different names); the last was added in Linux 2.4. Similar remarks apply for fstat() and lstat().
The kernel-internal versions of the stat structure dealt with by the different versions are, respectively:
__old_kernel_stat
The original structure, with rather narrow fields, and no padding.
stat
Larger st_ino field and padding added to various parts of the structure to allow for future expansion.
stat64
Even larger st_ino field, larger st_uid and st_gid fields to accommodate the Linux-2.4 expansion of UIDs and GIDs to 32 bits, and various other enlarged fields and further padding in the structure. (Various padding bytes were eventually consumed in Linux 2.6, with the advent of 32-bit device IDs and nanosecond components for the timestamp fields.)
The glibc stat() wrapper function hides these details from applications, invoking the most recent version of the system call provided by the kernel, and repacking the returned information if required for old binaries.
On modern 64-bit systems, life is simpler: there is a single stat() system call and the kernel deals with a stat structure that contains fields of a sufficient size.
The underlying system call employed by the glibc fstatat() wrapper function is actually called fstatat64() or, on some architectures, newfstatat().
EXAMPLES
The following program calls lstat() and displays selected fields in the returned stat structure.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
#include <time.h>
int
main(int argc, char *argv[])
{
struct stat sb;
if (argc != 2) {
fprintf(stderr, "Usage: %s <pathname>
“, argv[0]); exit(EXIT_FAILURE); } if (lstat(argv[1], &sb) == -1) { perror(“lstat”); exit(EXIT_FAILURE); } printf(“ID of containing device: [%x,%x] “, major(sb.st_dev), minor(sb.st_dev)); printf(“File type: “); switch (sb.st_mode & S_IFMT) { case S_IFBLK: printf(“block device “); break; case S_IFCHR: printf(“character device “); break; case S_IFDIR: printf(“directory “); break; case S_IFIFO: printf(“FIFO/pipe “); break; case S_IFLNK: printf(“symlink “); break; case S_IFREG: printf(“regular file “); break; case S_IFSOCK: printf(“socket “); break; default: printf(“unknown? “); break; } printf(“I-node number: %ju “, (uintmax_t) sb.st_ino); printf(“Mode: %jo (octal) “, (uintmax_t) sb.st_mode); printf(“Link count: %ju “, (uintmax_t) sb.st_nlink); printf(“Ownership: UID=%ju GID=%ju “, (uintmax_t) sb.st_uid, (uintmax_t) sb.st_gid); printf(“Preferred I/O block size: %jd bytes “, (intmax_t) sb.st_blksize); printf(“File size: %jd bytes “, (intmax_t) sb.st_size); printf(“Blocks allocated: %jd “, (intmax_t) sb.st_blocks); printf(“Last status change: %s”, ctime(&sb.st_ctime)); printf(“Last file access: %s”, ctime(&sb.st_atime)); printf(“Last file modification: %s”, ctime(&sb.st_mtime)); exit(EXIT_SUCCESS); }
SEE ALSO
ls(1), stat(1), access(2), chmod(2), chown(2), readlink(2), statx(2), utime(2), stat(3type), capabilities(7), inode(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
264 - Linux cli command getsockopt
NAME π₯οΈ getsockopt π₯οΈ
get and set options on sockets
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
int getsockopt(int sockfd, int level, int optname,
void optval[restrict *.optlen],
socklen_t *restrict optlen);
int setsockopt(int sockfd, int level, int optname,
const void optval[.optlen],
socklen_t optlen);
DESCRIPTION
getsockopt() and setsockopt() manipulate options for the socket referred to by the file descriptor sockfd. Options may exist at multiple protocol levels; they are always present at the uppermost socket level.
When manipulating socket options, the level at which the option resides and the name of the option must be specified. To manipulate options at the sockets API level, level is specified as SOL_SOCKET. To manipulate options at any other level the protocol number of the appropriate protocol controlling the option is supplied. For example, to indicate that an option is to be interpreted by the TCP protocol, level should be set to the protocol number of TCP; see getprotoent(3).
The arguments optval and optlen are used to access option values for setsockopt(). For getsockopt() they identify a buffer in which the value for the requested option(s) are to be returned. For getsockopt(), optlen is a value-result argument, initially containing the size of the buffer pointed to by optval, and modified on return to indicate the actual size of the value returned. If no option value is to be supplied or returned, optval may be NULL.
Optname and any specified options are passed uninterpreted to the appropriate protocol module for interpretation. The include file <sys/socket.h> contains definitions for socket level options, described below. Options at other protocol levels vary in format and name; consult the appropriate entries in section 4 of the manual.
Most socket-level options utilize an int argument for optval. For setsockopt(), the argument should be nonzero to enable a boolean option, or zero if the option is to be disabled.
For a description of the available socket options see socket(7) and the appropriate protocol man pages.
RETURN VALUE
On success, zero is returned for the standard options. On error, -1 is returned, and errno is set to indicate the error.
Netfilter allows the programmer to define custom socket options with associated handlers; for such options, the return value on success is the value returned by the handler.
ERRORS
EBADF
The argument sockfd is not a valid file descriptor.
EFAULT
The address pointed to by optval is not in a valid part of the process address space. For getsockopt(), this error may also be returned if optlen is not in a valid part of the process address space.
EINVAL
optlen invalid in setsockopt(). In some cases this error can also occur for an invalid value in optval (e.g., for the IP_ADD_MEMBERSHIP option described in ip(7)).
ENOPROTOOPT
The option is unknown at the level indicated.
ENOTSOCK
The file descriptor sockfd does not refer to a socket.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.4BSD (first appeared in 4.2BSD).
BUGS
Several of the socket options should be handled at lower levels of the system.
SEE ALSO
ioctl(2), socket(2), getprotoent(3), protocols(5), ip(7), packet(7), socket(7), tcp(7), udp(7), unix(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
265 - Linux cli command geteuid32
NAME π₯οΈ geteuid32 π₯οΈ
get user identity
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
uid_t getuid(void);
uid_t geteuid(void);
DESCRIPTION
getuid() returns the real user ID of the calling process.
geteuid() returns the effective user ID of the calling process.
ERRORS
These functions are always successful and never modify errno.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.3BSD.
In UNIX V6 the getuid() call returned (euid << 8) + uid. UNIX V7 introduced separate calls getuid() and geteuid().
The original Linux getuid() and geteuid() system calls supported only 16-bit user IDs. Subsequently, Linux 2.4 added getuid32() and geteuid32(), supporting 32-bit IDs. The glibc getuid() and geteuid() wrapper functions transparently deal with the variations across kernel versions.
On Alpha, instead of a pair of getuid() and geteuid() system calls, a single getxuid() system call is provided, which returns a pair of real and effective UIDs. The glibc getuid() and geteuid() wrapper functions transparently deal with this. See syscall(2) for details regarding register mapping.
SEE ALSO
getresuid(2), setreuid(2), setuid(2), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
266 - Linux cli command quotactl
NAME π₯οΈ quotactl π₯οΈ
manipulate disk quotas
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/quota.h>
#include <xfs/xqm.h> /* Definition of Q_X* and XFS_QUOTA_*"constants"
(or <linux/dqblk_xfs.h>; see NOTES) */
int quotactl(int op, const char *_Nullable special, int id,
caddr_t addr);
DESCRIPTION
The quota system can be used to set per-user, per-group, and per-project limits on the amount of disk space used on a filesystem. For each user and/or group, a soft limit and a hard limit can be set for each filesystem. The hard limit can’t be exceeded. The soft limit can be exceeded, but warnings will ensue. Moreover, the user can’t exceed the soft limit for more than grace period duration (one week by default) at a time; after this, the soft limit counts as a hard limit.
The quotactl() call manipulates disk quotas. The op argument indicates an operation to be applied to the user or group ID specified in id. To initialize the op argument, use the QCMD(subop, type) macro. The type value is either USRQUOTA, for user quotas, GRPQUOTA, for group quotas, or (since Linux 4.1) PRJQUOTA, for project quotas. The subop value is described below.
The special argument is a pointer to a null-terminated string containing the pathname of the (mounted) block special device for the filesystem being manipulated.
The addr argument is the address of an optional, operation-specific, data structure that is copied in or out of the system. The interpretation of addr is given with each operation below.
The subop value is one of the following operations:
Q_QUOTAON
Turn on quotas for a filesystem. The id argument is the identification number of the quota format to be used. Currently, there are three supported quota formats:
QFMT_VFS_OLD
The original quota format.
QFMT_VFS_V0
The standard VFS v0 quota format, which can handle 32-bit UIDs and GIDs and quota limits up to 2^42 bytes and 2^32 inodes.
QFMT_VFS_V1
A quota format that can handle 32-bit UIDs and GIDs and quota limits of 2^63 - 1 bytes and 2^63 - 1 inodes.
The addr argument points to the pathname of a file containing the quotas for the filesystem. The quota file must exist; it is normally created with the quotacheck(8) program
Quota information can be also stored in hidden system inodes for ext4, XFS, and other filesystems if the filesystem is configured so. In this case, there are no visible quota files and there is no need to use quotacheck(8). Quota information is always kept consistent by the filesystem and the Q_QUOTAON operation serves only to enable enforcement of quota limits. The presence of hidden system inodes with quota information is indicated by the DQF_SYS_FILE flag in the dqi_flags field returned by the Q_GETINFO operation.
This operation requires privilege (CAP_SYS_ADMIN).
Q_QUOTAOFF
Turn off quotas for a filesystem. The addr and id arguments are ignored. This operation requires privilege (CAP_SYS_ADMIN).
Q_GETQUOTA
Get disk quota limits and current usage for user or group id. The addr argument is a pointer to a dqblk structure defined in <sys/quota.h> as follows:
/* uint64_t is an unsigned 64-bit integer;
uint32_t is an unsigned 32-bit integer */
struct dqblk { /* Definition since Linux 2.4.22 */
uint64_t dqb_bhardlimit; /* Absolute limit on disk
quota blocks alloc */
uint64_t dqb_bsoftlimit; /* Preferred limit on
disk quota blocks */
uint64_t dqb_curspace; /* Current occupied space
(in bytes) */
uint64_t dqb_ihardlimit; /* Maximum number of
allocated inodes */
uint64_t dqb_isoftlimit; /* Preferred inode limit */
uint64_t dqb_curinodes; /* Current number of
allocated inodes */
uint64_t dqb_btime; /* Time limit for excessive
disk use */
uint64_t dqb_itime; /* Time limit for excessive
files */
uint32_t dqb_valid; /* Bit mask of QIF_*
constants */
};
/* Flags in dqb_valid that indicate which fields in
dqblk structure are valid. */
#define QIF_BLIMITS 1
#define QIF_SPACE 2
#define QIF_ILIMITS 4
#define QIF_INODES 8
#define QIF_BTIME 16
#define QIF_ITIME 32
#define QIF_LIMITS (QIF_BLIMITS | QIF_ILIMITS)
#define QIF_USAGE (QIF_SPACE | QIF_INODES)
#define QIF_TIMES (QIF_BTIME | QIF_ITIME)
#define QIF_ALL (QIF_LIMITS | QIF_USAGE | QIF_TIMES)
The dqb_valid field is a bit mask that is set to indicate the entries in the dqblk structure that are valid. Currently, the kernel fills in all entries of the dqblk structure and marks them as valid in the dqb_valid field. Unprivileged users may retrieve only their own quotas; a privileged user (CAP_SYS_ADMIN) can retrieve the quotas of any user.
Q_GETNEXTQUOTA (since Linux 4.6)
This operation is the same as Q_GETQUOTA, but it returns quota information for the next ID greater than or equal to id that has a quota set.
The addr argument is a pointer to a nextdqblk structure whose fields are as for the dqblk, except for the addition of a dqb_id field that is used to return the ID for which quota information is being returned:
struct nextdqblk {
uint64_t dqb_bhardlimit;
uint64_t dqb_bsoftlimit;
uint64_t dqb_curspace;
uint64_t dqb_ihardlimit;
uint64_t dqb_isoftlimit;
uint64_t dqb_curinodes;
uint64_t dqb_btime;
uint64_t dqb_itime;
uint32_t dqb_valid;
uint32_t dqb_id;
};
Q_SETQUOTA
Set quota information for user or group id, using the information supplied in the dqblk structure pointed to by addr. The dqb_valid field of the dqblk structure indicates which entries in the structure have been set by the caller. This operation supersedes the Q_SETQLIM and Q_SETUSE operations in the previous quota interfaces. This operation requires privilege (CAP_SYS_ADMIN).
Q_GETINFO (since Linux 2.4.22)
Get information (like grace times) about quotafile. The addr argument should be a pointer to a dqinfo structure. This structure is defined in <sys/quota.h> as follows:
/* uint64_t is an unsigned 64-bit integer;
uint32_t is an unsigned 32-bit integer */
struct dqinfo { /* Defined since Linux 2.4.22 */
uint64_t dqi_bgrace; /* Time before block soft limit
becomes hard limit */
uint64_t dqi_igrace; /* Time before inode soft limit
becomes hard limit */
uint32_t dqi_flags; /* Flags for quotafile
(DQF_*) */
uint32_t dqi_valid;
};
/* Bits for dqi_flags */
/* Quota format QFMT_VFS_OLD */
#define DQF_ROOT_SQUASH (1 << 0) /* Root squash enabled */
/* Before Linux v4.0, this had been defined
privately as V1_DQF_RSQUASH */
/* Quota format QFMT_VFS_V0 / QFMT_VFS_V1 */
#define DQF_SYS_FILE (1 << 16) /* Quota stored in
a system file */
/* Flags in dqi_valid that indicate which fields in
dqinfo structure are valid. */
#define IIF_BGRACE 1
#define IIF_IGRACE 2
#define IIF_FLAGS 4
#define IIF_ALL (IIF_BGRACE | IIF_IGRACE | IIF_FLAGS)
The dqi_valid field in the dqinfo structure indicates the entries in the structure that are valid. Currently, the kernel fills in all entries of the dqinfo structure and marks them all as valid in the dqi_valid field. The id argument is ignored.
Q_SETINFO (since Linux 2.4.22)
Set information about quotafile. The addr argument should be a pointer to a dqinfo structure. The dqi_valid field of the dqinfo structure indicates the entries in the structure that have been set by the caller. This operation supersedes the Q_SETGRACE and Q_SETFLAGS operations in the previous quota interfaces. The id argument is ignored. This operation requires privilege (CAP_SYS_ADMIN).
Q_GETFMT (since Linux 2.4.22)
Get quota format used on the specified filesystem. The addr argument should be a pointer to a 4-byte buffer where the format number will be stored.
Q_SYNC
Update the on-disk copy of quota usages for a filesystem. If special is NULL, then all filesystems with active quotas are sync’ed. The addr and id arguments are ignored.
Q_GETSTATS (supported up to Linux 2.4.21)
Get statistics and other generic information about the quota subsystem. The addr argument should be a pointer to a dqstats structure in which data should be stored. This structure is defined in <sys/quota.h>. The special and id arguments are ignored.
This operation is obsolete and was removed in Linux 2.4.22. Files in /proc/sys/fs/quota/ carry the information instead.
For XFS filesystems making use of the XFS Quota Manager (XQM), the above operations are bypassed and the following operations are used:
Q_XQUOTAON
Turn on quotas for an XFS filesystem. XFS provides the ability to turn on/off quota limit enforcement with quota accounting. Therefore, XFS expects addr to be a pointer to an unsigned int that contains a bitwise combination of the following flags (defined in <xfs/xqm.h>):
XFS_QUOTA_UDQ_ACCT /* User quota accounting */
XFS_QUOTA_UDQ_ENFD /* User quota limits enforcement */
XFS_QUOTA_GDQ_ACCT /* Group quota accounting */
XFS_QUOTA_GDQ_ENFD /* Group quota limits enforcement */
XFS_QUOTA_PDQ_ACCT /* Project quota accounting */
XFS_QUOTA_PDQ_ENFD /* Project quota limits enforcement */
This operation requires privilege (CAP_SYS_ADMIN). The id argument is ignored.
Q_XQUOTAOFF
Turn off quotas for an XFS filesystem. As with Q_QUOTAON, XFS filesystems expect a pointer to an unsigned int that specifies whether quota accounting and/or limit enforcement need to be turned off (using the same flags as for Q_XQUOTAON operation). This operation requires privilege (CAP_SYS_ADMIN). The id argument is ignored.
Q_XGETQUOTA
Get disk quota limits and current usage for user id. The addr argument is a pointer to an fs_disk_quota structure, which is defined in <xfs/xqm.h> as follows:
/* All the blk units are in BBs (Basic Blocks) of
512 bytes. */
#define FS_DQUOT_VERSION 1 /* fs_disk_quota.d_version */
#define XFS_USER_QUOTA (1<<0) /* User quota type */
#define XFS_PROJ_QUOTA (1<<1) /* Project quota type */
#define XFS_GROUP_QUOTA (1<<2) /* Group quota type */
struct fs_disk_quota {
int8_t d_version; /* Version of this structure */
int8_t d_flags; /* XFS_{USER,PROJ,GROUP}_QUOTA */
uint16_t d_fieldmask; /* Field specifier */
uint32_t d_id; /* User, project, or group ID */
uint64_t d_blk_hardlimit; /* Absolute limit on
disk blocks */
uint64_t d_blk_softlimit; /* Preferred limit on
disk blocks */
uint64_t d_ino_hardlimit; /* Maximum # allocated
inodes */
uint64_t d_ino_softlimit; /* Preferred inode limit */
uint64_t d_bcount; /* # disk blocks owned by
the user */
uint64_t d_icount; /* # inodes owned by the user */
int32_t d_itimer; /* Zero if within inode limits */
/* If not, we refuse service */
int32_t d_btimer; /* Similar to above; for
disk blocks */
uint16_t d_iwarns; /* # warnings issued with
respect to # of inodes */
uint16_t d_bwarns; /* # warnings issued with
respect to disk blocks */
int32_t d_padding2; /* Padding - for future use */
uint64_t d_rtb_hardlimit; /* Absolute limit on realtime
(RT) disk blocks */
uint64_t d_rtb_softlimit; /* Preferred limit on RT
disk blocks */
uint64_t d_rtbcount; /* # realtime blocks owned */
int32_t d_rtbtimer; /* Similar to above; for RT
disk blocks */
uint16_t d_rtbwarns; /* # warnings issued with
respect to RT disk blocks */
int16_t d_padding3; /* Padding - for future use */
char d_padding4[8]; /* Yet more padding */
};
Unprivileged users may retrieve only their own quotas; a privileged user (CAP_SYS_ADMIN) may retrieve the quotas of any user.
Q_XGETNEXTQUOTA (since Linux 4.6)
This operation is the same as Q_XGETQUOTA, but it returns (in the fs_disk_quota structure pointed by addr) quota information for the next ID greater than or equal to id that has a quota set. Note that since fs_disk_quota already has q_id field, no separate structure type is needed (in contrast with Q_GETQUOTA and Q_GETNEXTQUOTA operations)
Q_XSETQLIM
Set disk quota limits for user id. The addr argument is a pointer to an fs_disk_quota structure. This operation requires privilege (CAP_SYS_ADMIN).
Q_XGETQSTAT
Returns XFS filesystem-specific quota information in the fs_quota_stat structure pointed by addr. This is useful for finding out how much space is used to store quota information, and also to get the quota on/off status of a given local XFS filesystem. The fs_quota_stat structure itself is defined as follows:
#define FS_QSTAT_VERSION 1 /* fs_quota_stat.qs_version */
struct fs_qfilestat {
uint64_t qfs_ino; /* Inode number */
uint64_t qfs_nblks; /* Number of BBs
512-byte-blocks */
uint32_t qfs_nextents; /* Number of extents */
};
struct fs_quota_stat {
int8_t qs_version; /* Version number for
future changes */
uint16_t qs_flags; /* XFS_QUOTA_{U,P,G}DQ_{ACCT,ENFD} */
int8_t qs_pad; /* Unused */
struct fs_qfilestat qs_uquota; /* User quota storage
information */
struct fs_qfilestat qs_gquota; /* Group quota storage
information */
uint32_t qs_incoredqs; /* Number of dquots in core */
int32_t qs_btimelimit; /* Limit for blocks timer */
int32_t qs_itimelimit; /* Limit for inodes timer */
int32_t qs_rtbtimelimit;/* Limit for RT
blocks timer */
uint16_t qs_bwarnlimit; /* Limit for # of warnings */
uint16_t qs_iwarnlimit; /* Limit for # of warnings */
};
The id argument is ignored.
Q_XGETQSTATV
Returns XFS filesystem-specific quota information in the fs_quota_statv pointed to by addr. This version of the operation uses a structure with proper versioning support, along with appropriate layout (all fields are naturally aligned) and padding to avoiding special compat handling; it also provides the ability to get statistics regarding the project quota file. The fs_quota_statv structure itself is defined as follows:
#define FS_QSTATV_VERSION1 1 /* fs_quota_statv.qs_version */
struct fs_qfilestatv {
uint64_t qfs_ino; /* Inode number */
uint64_t qfs_nblks; /* Number of BBs
512-byte-blocks */
uint32_t qfs_nextents; /* Number of extents */
uint32_t qfs_pad; /* Pad for 8-byte alignment */
};
struct fs_quota_statv {
int8_t qs_version; /* Version for future
changes */
uint8_t qs_pad1; /* Pad for 16-bit alignment */
uint16_t qs_flags; /* XFS_QUOTA_.* flags */
uint32_t qs_incoredqs; /* Number of dquots incore */
struct fs_qfilestatv qs_uquota; /* User quota
information */
struct fs_qfilestatv qs_gquota; /* Group quota
information */
struct fs_qfilestatv qs_pquota; /* Project quota
information */
int32_t qs_btimelimit; /* Limit for blocks timer */
int32_t qs_itimelimit; /* Limit for inodes timer */
int32_t qs_rtbtimelimit; /* Limit for RT blocks
timer */
uint16_t qs_bwarnlimit; /* Limit for # of warnings */
uint16_t qs_iwarnlimit; /* Limit for # of warnings */
uint64_t qs_pad2[8]; /* For future proofing */
};
The qs_version field of the structure should be filled with the version of the structure supported by the callee (for now, only FS_QSTAT_VERSION1 is supported). The kernel will fill the structure in accordance with version provided. The id argument is ignored.
Q_XQUOTARM (buggy until Linux 3.16)
Free the disk space taken by disk quotas. The addr argument should be a pointer to an unsigned int value containing flags (the same as in d_flags field of fs_disk_quota structure) which identify what types of quota should be removed. (Note that the quota type passed in the op argument is ignored, but should remain valid in order to pass preliminary quotactl syscall handler checks.)
Quotas must have already been turned off. The id argument is ignored.
Q_XQUOTASYNC (since Linux 2.6.15; no-op since Linux 3.4)
This operation was an XFS quota equivalent to Q_SYNC, but it is no-op since Linux 3.4, as sync(1) writes quota information to disk now (in addition to the other filesystem metadata that it writes out). The special, id and addr arguments are ignored.
RETURN VALUE
On success, quotactl() returns 0; on error -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
op is Q_QUOTAON, and the quota file pointed to by addr exists, but is not a regular file or is not on the filesystem pointed to by special.
EBUSY
op is Q_QUOTAON, but another Q_QUOTAON had already been performed.
EFAULT
addr or special is invalid.
EINVAL
op or type is invalid.
EINVAL
op is Q_QUOTAON, but the specified quota file is corrupted.
EINVAL (since Linux 5.5)
op is Q_XQUOTARM, but addr does not point to valid quota types.
ENOENT
The file specified by special or addr does not exist.
ENOSYS
The kernel has not been compiled with the CONFIG_QUOTA option.
ENOTBLK
special is not a block device.
EPERM
The caller lacked the required privilege (CAP_SYS_ADMIN) for the specified operation.
ERANGE
op is Q_SETQUOTA, but the specified limits are out of the range allowed by the quota format.
ESRCH
No disk quota is found for the indicated user. Quotas have not been turned on for this filesystem.
ESRCH
op is Q_QUOTAON, but the specified quota format was not found.
ESRCH
op is Q_GETNEXTQUOTA or Q_XGETNEXTQUOTA, but there is no ID greater than or equal to id that has an active quota.
NOTES
Instead of <xfs/xqm.h> one can use <linux/dqblk_xfs.h>, taking into account that there are several naming discrepancies:
Quota enabling flags (of format XFS_QUOTA_[UGP]DQ_{ACCT,ENFD}) are defined without a leading “X”, as FS_QUOTA_[UGP]DQ_{ACCT,ENFD}.
The same is true for XFS_{USER,GROUP,PROJ}_QUOTA quota type flags, which are defined as FS_{USER,GROUP,PROJ}_QUOTA.
The dqblk_xfs.h header file defines its own XQM_USRQUOTA, XQM_GRPQUOTA, and XQM_PRJQUOTA constants for the available quota types, but their values are the same as for constants without the XQM_ prefix.
SEE ALSO
quota(1), getrlimit(2), quotacheck(8), quotaon(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
267 - Linux cli command pselect
NAME π₯οΈ pselect π₯οΈ
synchronous I/O multiplexing
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/select.h>
typedef /* ... */ fd_set;
int select(int nfds, fd_set *_Nullable restrict readfds,
fd_set *_Nullable restrict writefds,
fd_set *_Nullable restrict exceptfds,
struct timeval *_Nullable restrict timeout);
void FD_CLR(int fd, fd_set *set);
int FD_ISSET(int fd, fd_set *set);
void FD_SET(int fd, fd_set *set);
void FD_ZERO(fd_set *set);
int pselect(int nfds, fd_set *_Nullable restrict readfds,
fd_set *_Nullable restrict writefds,
fd_set *_Nullable restrict exceptfds,
const struct timespec *_Nullable restrict timeout,
const sigset_t *_Nullable restrict sigmask);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
pselect():
_POSIX_C_SOURCE >= 200112L
DESCRIPTION
WARNING: select() can monitor only file descriptors numbers that are less than FD_SETSIZE (1024)βan unreasonably low limit for many modern applicationsβand this limitation will not change. All modern applications should instead use poll(2) or epoll(7), which do not suffer this limitation.
select() allows a program to monitor multiple file descriptors, waiting until one or more of the file descriptors become “ready” for some class of I/O operation (e.g., input possible). A file descriptor is considered ready if it is possible to perform a corresponding I/O operation (e.g., read(2), or a sufficiently small write(2)) without blocking.
fd_set
A structure type that can represent a set of file descriptors. According to POSIX, the maximum number of file descriptors in an fd_set structure is the value of the macro FD_SETSIZE.
File descriptor sets
The principal arguments of select() are three “sets” of file descriptors (declared with the type fd_set), which allow the caller to wait for three classes of events on the specified set of file descriptors. Each of the fd_set arguments may be specified as NULL if no file descriptors are to be watched for the corresponding class of events.
Note well: Upon return, each of the file descriptor sets is modified in place to indicate which file descriptors are currently “ready”. Thus, if using select() within a loop, the sets must be reinitialized before each call.
The contents of a file descriptor set can be manipulated using the following macros:
FD_ZERO()
This macro clears (removes all file descriptors from) set. It should be employed as the first step in initializing a file descriptor set.
FD_SET()
This macro adds the file descriptor fd to set. Adding a file descriptor that is already present in the set is a no-op, and does not produce an error.
FD_CLR()
This macro removes the file descriptor fd from set. Removing a file descriptor that is not present in the set is a no-op, and does not produce an error.
FD_ISSET()
select() modifies the contents of the sets according to the rules described below. After calling select(), the FD_ISSET() macro can be used to test if a file descriptor is still present in a set. FD_ISSET() returns nonzero if the file descriptor fd is present in set, and zero if it is not.
Arguments
The arguments of select() are as follows:
readfds
The file descriptors in this set are watched to see if they are ready for reading. A file descriptor is ready for reading if a read operation will not block; in particular, a file descriptor is also ready on end-of-file.
After select() has returned, readfds will be cleared of all file descriptors except for those that are ready for reading.
writefds
The file descriptors in this set are watched to see if they are ready for writing. A file descriptor is ready for writing if a write operation will not block. However, even if a file descriptor indicates as writable, a large write may still block.
After select() has returned, writefds will be cleared of all file descriptors except for those that are ready for writing.
exceptfds
The file descriptors in this set are watched for “exceptional conditions”. For examples of some exceptional conditions, see the discussion of POLLPRI in poll(2).
After select() has returned, exceptfds will be cleared of all file descriptors except for those for which an exceptional condition has occurred.
nfds
This argument should be set to the highest-numbered file descriptor in any of the three sets, plus 1. The indicated file descriptors in each set are checked, up to this limit (but see BUGS).
timeout
The timeout argument is a timeval structure (shown below) that specifies the interval that select() should block waiting for a file descriptor to become ready. The call will block until either:
a file descriptor becomes ready;
the call is interrupted by a signal handler; or
the timeout expires.
Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount.
If both fields of the timeval structure are zero, then select() returns immediately. (This is useful for polling.)
If timeout is specified as NULL, select() blocks indefinitely waiting for a file descriptor to become ready.
pselect()
The pselect() system call allows an application to safely wait until either a file descriptor becomes ready or until a signal is caught.
The operation of select() and pselect() is identical, other than these three differences:
select() uses a timeout that is a struct timeval (with seconds and microseconds), while pselect() uses a struct timespec (with seconds and nanoseconds).
select() may update the timeout argument to indicate how much time was left. pselect() does not change this argument.
select() has no sigmask argument, and behaves as pselect() called with NULL sigmask.
sigmask is a pointer to a signal mask (see sigprocmask(2)); if it is not NULL, then pselect() first replaces the current signal mask by the one pointed to by sigmask, then does the “select” function, and then restores the original signal mask. (If sigmask is NULL, the signal mask is not modified during the pselect() call.)
Other than the difference in the precision of the timeout argument, the following pselect() call:
ready = pselect(nfds, &readfds, &writefds, &exceptfds,
timeout, &sigmask);
is equivalent to atomically executing the following calls:
sigset_t origmask;
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ready = select(nfds, &readfds, &writefds, &exceptfds, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
The reason that pselect() is needed is that if one wants to wait for either a signal or for a file descriptor to become ready, then an atomic test is needed to prevent race conditions. (Suppose the signal handler sets a global flag and returns. Then a test of this global flag followed by a call of select() could hang indefinitely if the signal arrived just after the test but just before the call. By contrast, pselect() allows one to first block signals, handle the signals that have come in, then call pselect() with the desired sigmask, avoiding the race.)
The timeout
The timeout argument for select() is a structure of the following type:
struct timeval {
time_t tv_sec; /* seconds */
suseconds_t tv_usec; /* microseconds */
};
The corresponding argument for pselect() is a timespec(3) structure.
On Linux, select() modifies timeout to reflect the amount of time not slept; most other implementations do not do this. (POSIX.1 permits either behavior.) This causes problems both when Linux code which reads timeout is ported to other operating systems, and when code is ported to Linux that reuses a struct timeval for multiple select()s in a loop without reinitializing it. Consider timeout to be undefined after select() returns.
RETURN VALUE
On success, select() and pselect() return the number of file descriptors contained in the three returned descriptor sets (that is, the total number of bits that are set in readfds, writefds, exceptfds). The return value may be zero if the timeout expired before any file descriptors became ready.
On error, -1 is returned, and errno is set to indicate the error; the file descriptor sets are unmodified, and timeout becomes undefined.
ERRORS
EBADF
An invalid file descriptor was given in one of the sets. (Perhaps a file descriptor that was already closed, or one on which an error has occurred.) However, see BUGS.
EINTR
A signal was caught; see signal(7).
EINVAL
nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see getrlimit(2)).
EINVAL
The value contained within timeout is invalid.
ENOMEM
Unable to allocate memory for internal tables.
VERSIONS
On some other UNIX systems, select() can fail with the error EAGAIN if the system fails to allocate kernel-internal resources, rather than ENOMEM as Linux does. POSIX specifies this error for poll(2), but not for select(). Portable programs may wish to check for EAGAIN and loop, just as with EINTR.
STANDARDS
POSIX.1-2008.
HISTORY
select()
POSIX.1-2001, 4.4BSD (first appeared in 4.2BSD).
Generally portable to/from non-BSD systems supporting clones of the BSD socket layer (including System V variants). However, note that the System V variant typically sets the timeout variable before returning, but the BSD variant does not.
pselect()
Linux 2.6.16. POSIX.1g, POSIX.1-2001.
Prior to this, it was emulated in glibc (but see BUGS).
fd_set
POSIX.1-2001.
NOTES
The following header also provides the fd_set type: <sys/time.h>.
An fd_set is a fixed size buffer. Executing FD_CLR() or FD_SET() with a value of fd that is negative or is equal to or larger than FD_SETSIZE will result in undefined behavior. Moreover, POSIX requires fd to be a valid file descriptor.
The operation of select() and pselect() is not affected by the O_NONBLOCK flag.
The self-pipe trick
On systems that lack pselect(), reliable (and more portable) signal trapping can be achieved using the self-pipe trick. In this technique, a signal handler writes a byte to a pipe whose other end is monitored by select() in the main program. (To avoid possibly blocking when writing to a pipe that may be full or reading from a pipe that may be empty, nonblocking I/O is used when reading from and writing to the pipe.)
Emulating usleep(3)
Before the advent of usleep(3), some code employed a call to select() with all three sets empty, nfds zero, and a non-NULL timeout as a fairly portable way to sleep with subsecond precision.
Correspondence between select() and poll() notifications
Within the Linux kernel source, we find the following definitions which show the correspondence between the readable, writable, and exceptional condition notifications of select() and the event notifications provided by poll(2) and epoll(7):
#define POLLIN_SET (EPOLLRDNORM | EPOLLRDBAND | EPOLLIN |
EPOLLHUP | EPOLLERR)
/* Ready for reading */
#define POLLOUT_SET (EPOLLWRBAND | EPOLLWRNORM | EPOLLOUT |
EPOLLERR)
/* Ready for writing */
#define POLLEX_SET (EPOLLPRI)
/* Exceptional condition */
Multithreaded applications
If a file descriptor being monitored by select() is closed in another thread, the result is unspecified. On some UNIX systems, select() unblocks and returns, with an indication that the file descriptor is ready (a subsequent I/O operation will likely fail with an error, unless another process reopens the file descriptor between the time select() returned and the I/O operation is performed). On Linux (and some other systems), closing the file descriptor in another thread has no effect on select(). In summary, any application that relies on a particular behavior in this scenario must be considered buggy.
C library/kernel differences
The Linux kernel allows file descriptor sets of arbitrary size, determining the length of the sets to be checked from the value of nfds. However, in the glibc implementation, the fd_set type is fixed in size. See also BUGS.
The pselect() interface described in this page is implemented by glibc. The underlying Linux system call is named pselect6(). This system call has somewhat different behavior from the glibc wrapper function.
The Linux pselect6() system call modifies its timeout argument. However, the glibc wrapper function hides this behavior by using a local variable for the timeout argument that is passed to the system call. Thus, the glibc pselect() function does not modify its timeout argument; this is the behavior required by POSIX.1-2001.
The final argument of the pselect6() system call is not a sigset_t * pointer, but is instead a structure of the form:
struct {
const kernel_sigset_t *ss; /* Pointer to signal set */
size_t ss_len; /* Size (in bytes) of object
pointed to by 'ss' */
};
This allows the system call to obtain both a pointer to the signal set and its size, while allowing for the fact that most architectures support a maximum of 6 arguments to a system call. See sigprocmask(2) for a discussion of the difference between the kernel and libc notion of the signal set.
Historical glibc details
glibc 2.0 provided an incorrect version of pselect() that did not take a sigmask argument.
From glibc 2.1 to glibc 2.2.1, one must define _GNU_SOURCE in order to obtain the declaration of pselect() from <sys/select.h>.
BUGS
POSIX allows an implementation to define an upper limit, advertised via the constant FD_SETSIZE, on the range of file descriptors that can be specified in a file descriptor set. The Linux kernel imposes no fixed limit, but the glibc implementation makes fd_set a fixed-size type, with FD_SETSIZE defined as 1024, and the FD_*() macros operating according to that limit. To monitor file descriptors greater than 1023, use poll(2) or epoll(7) instead.
The implementation of the fd_set arguments as value-result arguments is a design error that is avoided in poll(2) and epoll(7).
According to POSIX, select() should check all specified file descriptors in the three file descriptor sets, up to the limit nfds-1. However, the current implementation ignores any file descriptor in these sets that is greater than the maximum file descriptor number that the process currently has open. According to POSIX, any such file descriptor that is specified in one of the sets should result in the error EBADF.
Starting with glibc 2.1, glibc provided an emulation of pselect() that was implemented using sigprocmask(2) and select(). This implementation remained vulnerable to the very race condition that pselect() was designed to prevent. Modern versions of glibc use the (race-free) pselect() system call on kernels where it is provided.
On Linux, select() may report a socket file descriptor as “ready for reading”, while nevertheless a subsequent read blocks. This could for example happen when data has arrived but upon examination has the wrong checksum and is discarded. There may be other circumstances in which a file descriptor is spuriously reported as ready. Thus it may be safer to use O_NONBLOCK on sockets that should not block.
On Linux, select() also modifies timeout if the call is interrupted by a signal handler (i.e., the EINTR error return). This is not permitted by POSIX.1. The Linux pselect() system call has the same behavior, but the glibc wrapper hides this behavior by internally copying the timeout to a local variable and passing that variable to the system call.
EXAMPLES
#include <stdio.h>
#include <stdlib.h>
#include <sys/select.h>
int
main(void)
{
int retval;
fd_set rfds;
struct timeval tv;
/* Watch stdin (fd 0) to see when it has input. */
FD_ZERO(&rfds);
FD_SET(0, &rfds);
/* Wait up to five seconds. */
tv.tv_sec = 5;
tv.tv_usec = 0;
retval = select(1, &rfds, NULL, NULL, &tv);
/* Don't rely on the value of tv now! */
if (retval == -1)
perror("select()");
else if (retval)
printf("Data is available now.
“); /* FD_ISSET(0, &rfds) will be true. */ else printf(“No data within five seconds. “); exit(EXIT_SUCCESS); }
SEE ALSO
accept(2), connect(2), poll(2), read(2), recv(2), restart_syscall(2), send(2), sigprocmask(2), write(2), timespec(3), epoll(7), time(7)
For a tutorial with discussion and examples, see select_tut(2).
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
268 - Linux cli command fstatat
NAME π₯οΈ fstatat π₯οΈ
get file status
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int stat(const char *restrict pathname,
struct stat *restrict statbuf);
int fstat(int fd, struct stat *statbuf);
int lstat(const char *restrict pathname,
struct stat *restrict statbuf);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int fstatat(int dirfd, const char *restrict pathname,
struct stat *restrict statbuf, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
lstat():
/* Since glibc 2.20 */ _DEFAULT_SOURCE
|| _XOPEN_SOURCE >= 500
|| /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L
|| /* glibc 2.19 and earlier */ _BSD_SOURCE
fstatat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These functions return information about a file, in the buffer pointed to by statbuf. No permissions are required on the file itself, butβin the case of stat(), fstatat(), and lstat()βexecute (search) permission is required on all of the directories in pathname that lead to the file.
stat() and fstatat() retrieve information about the file pointed to by pathname; the differences for fstatat() are described below.
lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information about the link itself, not the file that the link refers to.
fstat() is identical to stat(), except that the file about which information is to be retrieved is specified by the file descriptor fd.
The stat structure
All of these system calls return a stat structure (see stat(3type)).
Note: for performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call. For example, if st_mode or st_uid is changed by another process by calling chmod(2) or chown(2), stat() might return the old st_mode together with the new st_uid, or the old st_uid together with the new st_mode.
fstatat()
The fstatat() system call is a more general interface for accessing file information which can still provide exactly the behavior of each of stat(), lstat(), and fstat().
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by stat() and lstat() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like stat() and lstat()).
If pathname is absolute, then dirfd is ignored.
flags can either be 0, or include one or more of the following flags ORed:
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory, and the behavior of fstatat() is similar to that of fstat(). If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_NO_AUTOMOUNT (since Linux 2.6.38)
Don’t automount the terminal (“basename”) component of pathname. Since Linux 3.1 this flag is ignored. Since Linux 4.11 this flag is implied.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead return information about the link itself, like lstat(). (By default, fstatat() dereferences symbolic links, like stat().)
See openat(2) for an explanation of the need for fstatat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Search permission is denied for one of the directories in the path prefix of pathname. (See also path_resolution(7).)
EBADF
fd is not a valid open file descriptor.
EBADF
(fstatat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
Bad address.
EINVAL
(fstatat()) Invalid flag specified in flags.
ELOOP
Too many symbolic links encountered while traversing the path.
ENAMETOOLONG
pathname is too long.
ENOENT
A component of pathname does not exist or is a dangling symbolic link.
ENOENT
pathname is an empty string and AT_EMPTY_PATH was not specified in flags.
ENOMEM
Out of memory (i.e., kernel memory).
ENOTDIR
A component of the path prefix of pathname is not a directory.
ENOTDIR
(fstatat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EOVERFLOW
pathname or fd refers to a file whose size, inode number, or number of blocks cannot be represented in, respectively, the types off_t, ino_t, or blkcnt_t. This error can occur when, for example, an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 calls stat() on a file whose size exceeds (1<<31)-1 bytes.
STANDARDS
POSIX.1-2008.
HISTORY
stat()
fstat()
lstat()
SVr4, 4.3BSD, POSIX.1-2001.
fstatat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
According to POSIX.1-2001, lstat() on a symbolic link need return valid information only in the st_size field and the file type of the st_mode field of the stat structure. POSIX.1-2008 tightens the specification, requiring lstat() to return valid information in all fields except the mode bits in st_mode.
Use of the st_blocks and st_blksize fields may be less portable. (They were introduced in BSD. The interpretation differs between systems, and possibly on a single system when NFS mounts are involved.)
C library/kernel differences
Over time, increases in the size of the stat structure have led to three successive versions of stat(): sys_stat() (slot __NR_oldstat), sys_newstat() (slot __NR_stat), and sys_stat64() (slot __NR_stat64) on 32-bit platforms such as i386. The first two versions were already present in Linux 1.0 (albeit with different names); the last was added in Linux 2.4. Similar remarks apply for fstat() and lstat().
The kernel-internal versions of the stat structure dealt with by the different versions are, respectively:
__old_kernel_stat
The original structure, with rather narrow fields, and no padding.
stat
Larger st_ino field and padding added to various parts of the structure to allow for future expansion.
stat64
Even larger st_ino field, larger st_uid and st_gid fields to accommodate the Linux-2.4 expansion of UIDs and GIDs to 32 bits, and various other enlarged fields and further padding in the structure. (Various padding bytes were eventually consumed in Linux 2.6, with the advent of 32-bit device IDs and nanosecond components for the timestamp fields.)
The glibc stat() wrapper function hides these details from applications, invoking the most recent version of the system call provided by the kernel, and repacking the returned information if required for old binaries.
On modern 64-bit systems, life is simpler: there is a single stat() system call and the kernel deals with a stat structure that contains fields of a sufficient size.
The underlying system call employed by the glibc fstatat() wrapper function is actually called fstatat64() or, on some architectures, newfstatat().
EXAMPLES
The following program calls lstat() and displays selected fields in the returned stat structure.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
#include <time.h>
int
main(int argc, char *argv[])
{
struct stat sb;
if (argc != 2) {
fprintf(stderr, "Usage: %s <pathname>
“, argv[0]); exit(EXIT_FAILURE); } if (lstat(argv[1], &sb) == -1) { perror(“lstat”); exit(EXIT_FAILURE); } printf(“ID of containing device: [%x,%x] “, major(sb.st_dev), minor(sb.st_dev)); printf(“File type: “); switch (sb.st_mode & S_IFMT) { case S_IFBLK: printf(“block device “); break; case S_IFCHR: printf(“character device “); break; case S_IFDIR: printf(“directory “); break; case S_IFIFO: printf(“FIFO/pipe “); break; case S_IFLNK: printf(“symlink “); break; case S_IFREG: printf(“regular file “); break; case S_IFSOCK: printf(“socket “); break; default: printf(“unknown? “); break; } printf(“I-node number: %ju “, (uintmax_t) sb.st_ino); printf(“Mode: %jo (octal) “, (uintmax_t) sb.st_mode); printf(“Link count: %ju “, (uintmax_t) sb.st_nlink); printf(“Ownership: UID=%ju GID=%ju “, (uintmax_t) sb.st_uid, (uintmax_t) sb.st_gid); printf(“Preferred I/O block size: %jd bytes “, (intmax_t) sb.st_blksize); printf(“File size: %jd bytes “, (intmax_t) sb.st_size); printf(“Blocks allocated: %jd “, (intmax_t) sb.st_blocks); printf(“Last status change: %s”, ctime(&sb.st_ctime)); printf(“Last file access: %s”, ctime(&sb.st_atime)); printf(“Last file modification: %s”, ctime(&sb.st_mtime)); exit(EXIT_SUCCESS); }
SEE ALSO
ls(1), stat(1), access(2), chmod(2), chown(2), readlink(2), statx(2), utime(2), stat(3type), capabilities(7), inode(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
269 - Linux cli command cacheflush
NAME π₯οΈ cacheflush π₯οΈ
flush contents of instruction and/or data cache
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/cachectl.h>
int cacheflush(void addr[.nbytes], int nbytes, int cache);
Note: On some architectures, there is no glibc wrapper for this system call; see NOTES.
DESCRIPTION
cacheflush() flushes the contents of the indicated cache(s) for the user addresses in the range addr to (addr+nbytes-1). cache may be one of:
ICACHE
Flush the instruction cache.
DCACHE
Write back to memory and invalidate the affected valid cache lines.
BCACHE
Same as (ICACHE|DCACHE).
RETURN VALUE
cacheflush() returns 0 on success. On error, it returns -1 and sets errno to indicate the error.
ERRORS
EFAULT
Some or all of the address range addr to (addr+nbytes-1) is not accessible.
EINVAL
cache is not one of ICACHE, DCACHE, or BCACHE (but see BUGS).
VERSIONS
cacheflush() should not be used in programs intended to be portable. On Linux, this call first appeared on the MIPS architecture, but nowadays, Linux provides a cacheflush() system call on some other architectures, but with different arguments.
Architecture-specific variants
glibc provides a wrapper for this system call, with the prototype shown in SYNOPSIS, for the following architectures: ARC, CSKY, MIPS, and NIOS2.
On some other architectures, Linux provides this system call, with different arguments:
M68K:
int cacheflush(unsigned long addr, int scope, int cache,
unsigned long len);
SH:
int cacheflush(unsigned long addr, unsigned long len, int op);
NDS32:
int cacheflush(unsigned int start, unsigned int end, int cache);
On the above architectures, glibc does not provide a wrapper for this system call; call it using syscall(2).
GCC alternative
Unless you need the finer grained control that this system call provides, you probably want to use the GCC built-in function __builtin___clear_cache(), which provides a portable interface across platforms supported by GCC and compatible compilers:
void __builtin___clear_cache(void *begin, void *end);
On platforms that don’t require instruction cache flushes, __builtin___clear_cache() has no effect.
Note: On some GCC-compatible compilers, the prototype for this built-in function uses char * instead of void * for the parameters.
STANDARDS
Historically, this system call was available on all MIPS UNIX variants including RISC/os, IRIX, Ultrix, NetBSD, OpenBSD, and FreeBSD (and also on some non-UNIX MIPS operating systems), so that the existence of this call in MIPS operating systems is a de-facto standard.
BUGS
Linux kernels older than Linux 2.6.11 ignore the addr and nbytes arguments, making this function fairly expensive. Therefore, the whole cache is always flushed.
This function always behaves as if BCACHE has been passed for the cache argument and does not do any error checking on the cache argument.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
270 - Linux cli command mmap2
NAME π₯οΈ mmap2 π₯οΈ
map files or devices into memory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/mman.h> /* Definition of MAP_* and PROT_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
void *syscall(SYS_mmap2, unsigned long addr, unsigned long length,
unsigned long prot, unsigned long flags,
unsigned long fd, unsigned long pgoffset);
DESCRIPTION
This is probably not the system call that you are interested in; instead, see mmap(2), which describes the glibc wrapper function that invokes this system call.
The mmap2() system call provides the same interface as mmap(2), except that the final argument specifies the offset into the file in 4096-byte units (instead of bytes, as is done by mmap(2)). This enables applications that use a 32-bit off_t to map large files (up to 2^44 bytes).
RETURN VALUE
On success, mmap2() returns a pointer to the mapped area. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EFAULT
Problem with getting the data from user space.
EINVAL
(Various platforms where the page size is not 4096 bytes.) offset * 4096 is not a multiple of the system page size.
mmap2() can also return any of the errors described in mmap(2).
VERSIONS
On architectures where this system call is present, the glibc mmap() wrapper function invokes this system call rather than the mmap(2) system call.
This system call does not exist on x86-64.
On ia64, the unit for offset is actually the system page size, rather than 4096 bytes.
STANDARDS
Linux.
HISTORY
Linux 2.3.31.
SEE ALSO
getpagesize(2), mmap(2), mremap(2), msync(2), shm_open(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
271 - Linux cli command unlinkat
NAME π₯οΈ unlinkat π₯οΈ
delete a name and possibly the file it refers to
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int unlink(const char *pathname);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int unlinkat(int dirfd, const char *pathname, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
unlinkat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
unlink() deletes a name from the filesystem. If that name was the last link to a file and no processes have the file open, the file is deleted and the space it was using is made available for reuse.
If the name was the last link to a file but any processes still have the file open, the file will remain in existence until the last file descriptor referring to it is closed.
If the name referred to a symbolic link, the link is removed.
If the name referred to a socket, FIFO, or device, the name for it is removed but processes which have the object open may continue to use it.
unlinkat()
The unlinkat() system call operates in exactly the same way as either unlink() or rmdir(2) (depending on whether or not flags includes the AT_REMOVEDIR flag) except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by unlink() and rmdir(2) for a relative pathname).
If the pathname given in pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like unlink() and rmdir(2)).
If the pathname given in pathname is absolute, then dirfd is ignored.
flags is a bit mask that can either be specified as 0, or by ORing together flag values that control the operation of unlinkat(). Currently, only one such flag is defined:
AT_REMOVEDIR
By default, unlinkat() performs the equivalent of unlink() on pathname. If the AT_REMOVEDIR flag is specified, it performs the equivalent of rmdir(2) on pathname.
See openat(2) for an explanation of the need for unlinkat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Write access to the directory containing pathname is not allowed for the process’s effective UID, or one of the directories in pathname did not allow search permission. (See also path_resolution(7).)
EBUSY
The file pathname cannot be unlinked because it is being used by the system or another process; for example, it is a mount point or the NFS client software created it to represent an active but otherwise nameless inode (“NFS silly renamed”).
EFAULT
pathname points outside your accessible address space.
EIO
An I/O error occurred.
EISDIR
pathname refers to a directory. (This is the non-POSIX value returned since Linux 2.1.132.)
ELOOP
Too many symbolic links were encountered in translating pathname.
ENAMETOOLONG
pathname was too long.
ENOENT
A component in pathname does not exist or is a dangling symbolic link, or pathname is empty.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component used as a directory in pathname is not, in fact, a directory.
EPERM
The system does not allow unlinking of directories, or unlinking of directories requires privileges that the calling process doesn’t have. (This is the POSIX prescribed error return; as noted above, Linux returns EISDIR for this case.)
EPERM (Linux only)
The filesystem does not allow unlinking of files.
EPERM or EACCES
The directory containing pathname has the sticky bit (S_ISVTX) set and the process’s effective UID is neither the UID of the file to be deleted nor that of the directory containing it, and the process is not privileged (Linux: does not have the CAP_FOWNER capability).
EPERM
The file to be unlinked is marked immutable or append-only. (See ioctl_iflags(2).)
EROFS
pathname refers to a file on a read-only filesystem.
The same errors that occur for unlink() and rmdir(2) can also occur for unlinkat(). The following additional errors can occur for unlinkat():
EBADF
pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EINVAL
An invalid flag value was specified in flags.
EISDIR
pathname refers to a directory, and AT_REMOVEDIR was not specified in flags.
ENOTDIR
pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
STANDARDS
POSIX.1-2008.
HISTORY
unlink()
SVr4, 4.3BSD, POSIX.1-2001.
unlinkat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
glibc
On older kernels where unlinkat() is unavailable, the glibc wrapper function falls back to the use of unlink() or rmdir(2). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
BUGS
Infelicities in the protocol underlying NFS can cause the unexpected disappearance of files which are still being used.
SEE ALSO
rm(1), unlink(1), chmod(2), link(2), mknod(2), open(2), rename(2), rmdir(2), mkfifo(3), remove(3), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
272 - Linux cli command spu_run
NAME π₯οΈ spu_run π₯οΈ
execute an SPU context
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/spu.h> /* Definition of SPU_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_spu_run, int fd, uint32_t *npc",uint32_t*"event);
Note: glibc provides no wrapper for spu_run(), necessitating the use of syscall(2).
DESCRIPTION
The spu_run() system call is used on PowerPC machines that implement the Cell Broadband Engine Architecture in order to access Synergistic Processor Units (SPUs). The fd argument is a file descriptor returned by spu_create(2) that refers to a specific SPU context. When the context gets scheduled to a physical SPU, it starts execution at the instruction pointer passed in npc.
Execution of SPU code happens synchronously, meaning that spu_run() blocks while the SPU is still running. If there is a need to execute SPU code in parallel with other code on either the main CPU or other SPUs, a new thread of execution must be created first (e.g., using pthread_create(3)).
When spu_run() returns, the current value of the SPU program counter is written to npc, so successive calls to spu_run() can use the same npc pointer.
The event argument provides a buffer for an extended status code. If the SPU context was created with the SPU_CREATE_EVENTS_ENABLED flag, then this buffer is populated by the Linux kernel before spu_run() returns.
The status code may be one (or more) of the following constants:
SPE_EVENT_DMA_ALIGNMENT
A DMA alignment error occurred.
SPE_EVENT_INVALID_DMA
An invalid MFC DMA command was attempted.
SPE_EVENT_SPE_DATA_STORAGE
A DMA storage error occurred.
SPE_EVENT_SPE_ERROR
An illegal instruction was executed.
NULL is a valid value for the event argument. In this case, the events will not be reported to the calling process.
RETURN VALUE
On success, spu_run() returns the value of the spu_status register. On failure, it returns -1 and sets errno is set to indicate the error.
The spu_status register value is a bit mask of status codes and optionally a 14-bit code returned from the stop-and-signal instruction on the SPU. The bit masks for the status codes are:
0x02
SPU was stopped by a stop-and-signal instruction.
0x04
SPU was stopped by a halt instruction.
0x08
SPU is waiting for a channel.
0x10
SPU is in single-step mode.
0x20
SPU has tried to execute an invalid instruction.
0x40
SPU has tried to access an invalid channel.
0x3fff0000
The bits masked with this value contain the code returned from a stop-and-signal instruction. These bits are valid only if the 0x02 bit is set.
If spu_run() has not returned an error, one or more bits among the lower eight ones are always set.
ERRORS
EBADF
fd is not a valid file descriptor.
EFAULT
npc is not a valid pointer, or event is non-NULL and an invalid pointer.
EINTR
A signal occurred while spu_run() was in progress; see signal(7). The npc value has been updated to the new program counter value if necessary.
EINVAL
fd is not a valid file descriptor returned from spu_create(2).
ENOMEM
There was not enough memory available to handle a page fault resulting from a Memory Flow Controller (MFC) direct memory access.
ENOSYS
The functionality is not provided by the current system, because either the hardware does not provide SPUs or the spufs module is not loaded.
STANDARDS
Linux on PowerPC.
HISTORY
Linux 2.6.16.
NOTES
spu_run() is meant to be used from libraries that implement a more abstract interface to SPUs, not to be used from regular applications. See for the recommended libraries.
EXAMPLES
The following is an example of running a simple, one-instruction SPU program with the spu_run() system call.
#include <err.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
int main(void)
{
int context, fd, spu_status;
uint32_t instruction, npc;
context = syscall(SYS_spu_create, "/spu/example-context", 0, 0755);
if (context == -1)
err(EXIT_FAILURE, "spu_create");
/*
* Write a 'stop 0x1234' instruction to the SPU's
* local store memory.
*/
instruction = 0x00001234;
fd = open("/spu/example-context/mem", O_RDWR);
if (fd == -1)
err(EXIT_FAILURE, "open");
write(fd, &instruction, sizeof(instruction));
/*
* set npc to the starting instruction address of the
* SPU program. Since we wrote the instruction at the
* start of the mem file, the entry point will be 0x0.
*/
npc = 0;
spu_status = syscall(SYS_spu_run, context, &npc, NULL);
if (spu_status == -1)
err(EXIT_FAILURE, "open");
/*
* We should see a status code of 0x12340002:
* 0x00000002 (spu was stopped due to stop-and-signal)
* | 0x12340000 (the stop-and-signal code)
*/
printf("SPU Status: %#08x
“, spu_status); exit(EXIT_SUCCESS); }
SEE ALSO
close(2), spu_create(2), capabilities(7), spufs(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
273 - Linux cli command sbrk
NAME π₯οΈ sbrk π₯οΈ
change data segment size
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int brk(void *addr);
void *sbrk(intptr_t increment);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
brk(), sbrk():
Since glibc 2.19:
_DEFAULT_SOURCE
|| ((_XOPEN_SOURCE >= 500) &&
! (_POSIX_C_SOURCE >= 200112L))
From glibc 2.12 to glibc 2.19:
_BSD_SOURCE || _SVID_SOURCE
|| ((_XOPEN_SOURCE >= 500) &&
! (_POSIX_C_SOURCE >= 200112L))
Before glibc 2.12:
_BSD_SOURCE || _SVID_SOURCE || _XOPEN_SOURCE >= 500
DESCRIPTION
brk() and sbrk() change the location of the program break, which defines the end of the process’s data segment (i.e., the program break is the first location after the end of the uninitialized data segment). Increasing the program break has the effect of allocating memory to the process; decreasing the break deallocates memory.
brk() sets the end of the data segment to the value specified by addr, when that value is reasonable, the system has enough memory, and the process does not exceed its maximum data size (see setrlimit(2)).
sbrk() increments the program’s data space by increment bytes. Calling sbrk() with an increment of 0 can be used to find the current location of the program break.
RETURN VALUE
On success, brk() returns zero. On error, -1 is returned, and errno is set to ENOMEM.
On success, sbrk() returns the previous program break. (If the break was increased, then this value is a pointer to the start of the newly allocated memory). On error, (void *) -1 is returned, and errno is set to ENOMEM.
STANDARDS
None.
HISTORY
4.3BSD; SUSv1, marked LEGACY in SUSv2, removed in POSIX.1-2001.
NOTES
Avoid using brk() and sbrk(): the malloc(3) memory allocation package is the portable and comfortable way of allocating memory.
Various systems use various types for the argument of sbrk(). Common are int, ssize_t, ptrdiff_t, intptr_t.
C library/kernel differences
The return value described above for brk() is the behavior provided by the glibc wrapper function for the Linux brk() system call. (On most other implementations, the return value from brk() is the same; this return value was also specified in SUSv2.) However, the actual Linux system call returns the new program break on success. On failure, the system call returns the current break. The glibc wrapper function does some work (i.e., checks whether the new break is less than addr) to provide the 0 and -1 return values described above.
On Linux, sbrk() is implemented as a library function that uses the brk() system call, and does some internal bookkeeping so that it can return the old break value.
SEE ALSO
execve(2), getrlimit(2), end(3), malloc(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
274 - Linux cli command getpagesize
NAME π₯οΈ getpagesize π₯οΈ
get memory page size
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int getpagesize(void);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
getpagesize():
Since glibc 2.20:
_DEFAULT_SOURCE || ! (_POSIX_C_SOURCE >= 200112L)
glibc 2.12 to glibc 2.19:
_BSD_SOURCE || ! (_POSIX_C_SOURCE >= 200112L)
Before glibc 2.12:
_BSD_SOURCE || _XOPEN_SOURCE >= 500
DESCRIPTION
The function getpagesize() returns the number of bytes in a memory page, where “page” is a fixed-length block, the unit for memory allocation and file mapping performed by mmap(2).
VERSIONS
A user program should not hard-code a page size, neither as a literal nor using the PAGE_SIZE macro, because some architectures support multiple page sizes.
This manual page is in section 2 because Alpha, SPARC, and SPARC64 all have a Linux system call getpagesize() though other architectures do not, and use the ELF auxiliary vector instead.
STANDARDS
None.
HISTORY
This call first appeared in 4.2BSD. SVr4, 4.4BSD, SUSv2. In SUSv2 the getpagesize() call was labeled LEGACY, and it was removed in POSIX.1-2001.
glibc 2.0 returned a constant even on architectures with multiple page sizes.
SEE ALSO
mmap(2), sysconf(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
275 - Linux cli command fgetxattr
NAME π₯οΈ fgetxattr π₯οΈ
retrieve an extended attribute value
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/xattr.h>
ssize_t getxattr(const char *path, const char *name,
void value[.size], size_t size);
ssize_t lgetxattr(const char *path, const char *name,
void value[.size], size_t size);
ssize_t fgetxattr(int fd, const char *name,
void value[.size], size_t size);
DESCRIPTION
Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7).
getxattr() retrieves the value of the extended attribute identified by name and associated with the given path in the filesystem. The attribute value is placed in the buffer pointed to by value; size specifies the size of that buffer. The return value of the call is the number of bytes placed in value.
lgetxattr() is identical to getxattr(), except in the case of a symbolic link, where the link itself is interrogated, not the file that it refers to.
fgetxattr() is identical to getxattr(), only the open file referred to by fd (as returned by open(2)) is interrogated in place of path.
An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. The value of an extended attribute is a chunk of arbitrary textual or binary data that was assigned using setxattr(2).
If size is specified as zero, these calls return the current size of the named extended attribute (and leave value unchanged). This can be used to determine the size of the buffer that should be supplied in a subsequent call. (But, bear in mind that there is a possibility that the attribute value may change between the two calls, so that it is still necessary to check the return status from the second call.)
RETURN VALUE
On success, these calls return a nonnegative value which is the size (in bytes) of the extended attribute value. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
E2BIG
The size of the attribute value is larger than the maximum size allowed; the attribute cannot be retrieved. This can happen on filesystems that support very large attribute values such as NFSv4, for example.
ENODATA
The named attribute does not exist, or the process has no access to this attribute.
ENOTSUP
Extended attributes are not supported by the filesystem, or are disabled.
ERANGE
The size of the value buffer is too small to hold the result.
In addition, the errors documented in stat(2) can also occur.
STANDARDS
Linux.
HISTORY
Linux 2.4, glibc 2.3.
EXAMPLES
See listxattr(2).
SEE ALSO
getfattr(1), setfattr(1), listxattr(2), open(2), removexattr(2), setxattr(2), stat(2), symlink(7), xattr(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
276 - Linux cli command kexec_file_load
NAME π₯οΈ kexec_file_load π₯οΈ
load a new kernel for later execution
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/kexec.h> /* Definition of KEXEC_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
long syscall(SYS_kexec_load, unsigned long entry,
unsigned long nr_segments",structkexec_segment*"segments,
unsigned long flags);
long syscall(SYS_kexec_file_load, int kernel_fd, int initrd_fd,
unsigned long cmdline_len, const char *cmdline,
unsigned long flags);
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
The kexec_load() system call loads a new kernel that can be executed later by reboot(2).
The flags argument is a bit mask that controls the operation of the call. The following values can be specified in flags:
KEXEC_ON_CRASH (since Linux 2.6.13)
Execute the new kernel automatically on a system crash. This “crash kernel” is loaded into an area of reserved memory that is determined at boot time using the crashkernel kernel command-line parameter. The location of this reserved memory is exported to user space via the /proc/iomem file, in an entry labeled “Crash kernel”. A user-space application can parse this file and prepare a list of segments (see below) that specify this reserved memory as destination. If this flag is specified, the kernel checks that the target segments specified in segments fall within the reserved region.
KEXEC_PRESERVE_CONTEXT (since Linux 2.6.27)
Preserve the system hardware and software states before executing the new kernel. This could be used for system suspend. This flag is available only if the kernel was configured with CONFIG_KEXEC_JUMP, and is effective only if nr_segments is greater than 0.
The high-order bits (corresponding to the mask 0xffff0000) of flags contain the architecture of the to-be-executed kernel. Specify (OR) the constant KEXEC_ARCH_DEFAULT to use the current architecture, or one of the following architecture constants KEXEC_ARCH_386, KEXEC_ARCH_68K, KEXEC_ARCH_X86_64, KEXEC_ARCH_PPC, KEXEC_ARCH_PPC64, KEXEC_ARCH_IA_64, KEXEC_ARCH_ARM, KEXEC_ARCH_S390, KEXEC_ARCH_SH, KEXEC_ARCH_MIPS, and KEXEC_ARCH_MIPS_LE. The architecture must be executable on the CPU of the system.
The entry argument is the physical entry address in the kernel image. The nr_segments argument is the number of segments pointed to by the segments pointer; the kernel imposes an (arbitrary) limit of 16 on the number of segments. The segments argument is an array of kexec_segment structures which define the kernel layout:
struct kexec_segment {
void *buf; /* Buffer in user space */
size_t bufsz; /* Buffer length in user space */
void *mem; /* Physical address of kernel */
size_t memsz; /* Physical address length */
};
The kernel image defined by segments is copied from the calling process into the kernel either in regular memory or in reserved memory (if KEXEC_ON_CRASH is set). The kernel first performs various sanity checks on the information passed in segments. If these checks pass, the kernel copies the segment data to kernel memory. Each segment specified in segments is copied as follows:
buf and bufsz identify a memory region in the caller’s virtual address space that is the source of the copy. The value in bufsz may not exceed the value in the memsz field.
mem and memsz specify a physical address range that is the target of the copy. The values specified in both fields must be multiples of the system page size.
bufsz bytes are copied from the source buffer to the target kernel buffer. If bufsz is less than memsz, then the excess bytes in the kernel buffer are zeroed out.
In case of a normal kexec (i.e., the KEXEC_ON_CRASH flag is not set), the segment data is loaded in any available memory and is moved to the final destination at kexec reboot time (e.g., when the kexec(8) command is executed with the -e option).
In case of kexec on panic (i.e., the KEXEC_ON_CRASH flag is set), the segment data is loaded to reserved memory at the time of the call, and, after a crash, the kexec mechanism simply passes control to that kernel.
The kexec_load() system call is available only if the kernel was configured with CONFIG_KEXEC.
kexec_file_load()
The kexec_file_load() system call is similar to kexec_load(), but it takes a different set of arguments. It reads the kernel to be loaded from the file referred to by the file descriptor kernel_fd, and the initrd (initial RAM disk) to be loaded from file referred to by the file descriptor initrd_fd. The cmdline argument is a pointer to a buffer containing the command line for the new kernel. The cmdline_len argument specifies size of the buffer. The last byte in the buffer must be a null byte (‘οΏ½’).
The flags argument is a bit mask which modifies the behavior of the call. The following values can be specified in flags:
KEXEC_FILE_UNLOAD
Unload the currently loaded kernel.
KEXEC_FILE_ON_CRASH
Load the new kernel in the memory region reserved for the crash kernel (as for KEXEC_ON_CRASH). This kernel is booted if the currently running kernel crashes.
KEXEC_FILE_NO_INITRAMFS
Loading initrd/initramfs is optional. Specify this flag if no initramfs is being loaded. If this flag is set, the value passed in initrd_fd is ignored.
The kexec_file_load() system call was added to provide support for systems where “kexec” loading should be restricted to only kernels that are signed. This system call is available only if the kernel was configured with CONFIG_KEXEC_FILE.
RETURN VALUE
On success, these system calls returns 0. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EADDRNOTAVAIL
The KEXEC_ON_CRASH flags was specified, but the region specified by the mem and memsz fields of one of the segments entries lies outside the range of memory reserved for the crash kernel.
EADDRNOTAVAIL
The value in a mem or memsz field in one of the segments entries is not a multiple of the system page size.
EBADF
kernel_fd or initrd_fd is not a valid file descriptor.
EBUSY
Another crash kernel is already being loaded or a crash kernel is already in use.
EINVAL
flags is invalid.
EINVAL
The value of a bufsz field in one of the segments entries exceeds the value in the corresponding memsz field.
EINVAL
nr_segments exceeds KEXEC_SEGMENT_MAX (16).
EINVAL
Two or more of the kernel target buffers overlap.
EINVAL
The value in cmdline[cmdline_len-1] is not ‘οΏ½’.
EINVAL
The file referred to by kernel_fd or initrd_fd is empty (length zero).
ENOEXEC
kernel_fd does not refer to an open file, or the kernel can’t load this file. Currently, the file must be a bzImage and contain an x86 kernel that is loadable above 4 GiB in memory (see the kernel source file Documentation/x86/boot.txt).
ENOMEM
Could not allocate memory.
EPERM
The caller does not have the CAP_SYS_BOOT capability.
STANDARDS
Linux.
HISTORY
kexec_load()
Linux 2.6.13.
kexec_file_load()
Linux 3.17.
SEE ALSO
reboot(2), syscall(2), kexec(8)
The kernel source files Documentation/kdump/kdump.txt and Documentation/admin-guide/kernel-parameters.txt
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
277 - Linux cli command posix_fadvise
NAME π₯οΈ posix_fadvise π₯οΈ
predeclare an access pattern for file data
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <fcntl.h>
int posix_fadvise(int fd, off_t offset, off_t len",int advice );"
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
posix_fadvise():
_POSIX_C_SOURCE >= 200112L
DESCRIPTION
Programs can use posix_fadvise() to announce an intention to access file data in a specific pattern in the future, thus allowing the kernel to perform appropriate optimizations.
The advice applies to a (not necessarily existent) region starting at offset and extending for len bytes (or until the end of the file if len is 0) within the file referred to by fd. The advice is not binding; it merely constitutes an expectation on behalf of the application.
Permissible values for advice include:
POSIX_FADV_NORMAL
Indicates that the application has no advice to give about its access pattern for the specified data. If no advice is given for an open file, this is the default assumption.
POSIX_FADV_SEQUENTIAL
The application expects to access the specified data sequentially (with lower offsets read before higher ones).
POSIX_FADV_RANDOM
The specified data will be accessed in random order.
POSIX_FADV_NOREUSE
The specified data will be accessed only once.
Before Linux 2.6.18, POSIX_FADV_NOREUSE had the same semantics as POSIX_FADV_WILLNEED. This was probably a bug; since Linux 2.6.18, this flag is a no-op.
POSIX_FADV_WILLNEED
The specified data will be accessed in the near future.
POSIX_FADV_WILLNEED initiates a nonblocking read of the specified region into the page cache. The amount of data read may be decreased by the kernel depending on virtual memory load. (A few megabytes will usually be fully satisfied, and more is rarely useful.)
POSIX_FADV_DONTNEED
The specified data will not be accessed in the near future.
POSIX_FADV_DONTNEED attempts to free cached pages associated with the specified region. This is useful, for example, while streaming large files. A program may periodically request the kernel to free cached data that has already been used, so that more useful cached pages are not discarded instead.
Requests to discard partial pages are ignored. It is preferable to preserve needed data than discard unneeded data. If the application requires that data be considered for discarding, then offset and len must be page-aligned.
The implementation may attempt to write back dirty pages in the specified region, but this is not guaranteed. Any unwritten dirty pages will not be freed. If the application wishes to ensure that dirty pages will be released, it should call fsync(2) or fdatasync(2) first.
RETURN VALUE
On success, zero is returned. On error, an error number is returned.
ERRORS
EBADF
The fd argument was not a valid file descriptor.
EINVAL
An invalid value was specified for advice.
ESPIPE
The specified file descriptor refers to a pipe or FIFO. (ESPIPE is the error specified by POSIX, but before Linux 2.6.16, Linux returned EINVAL in this case.)
VERSIONS
Under Linux, POSIX_FADV_NORMAL sets the readahead window to the default size for the backing device; POSIX_FADV_SEQUENTIAL doubles this size, and POSIX_FADV_RANDOM disables file readahead entirely. These changes affect the entire file, not just the specified region (but other open file handles to the same file are unaffected).
C library/kernel differences
The name of the wrapper function in the C library is posix_fadvise(). The underlying system call is called fadvise64() (or, on some architectures, fadvise64_64()); the difference between the two is that the former system call assumes that the type of the len argument is size_t, while the latter expects loff_t there.
Architecture-specific variants
Some architectures require 64-bit arguments to be aligned in a suitable pair of registers (see syscall(2) for further detail). On such architectures, the call signature of posix_fadvise() shown in the SYNOPSIS would force a register to be wasted as padding between the fd and offset arguments. Therefore, these architectures define a version of the system call that orders the arguments suitably, but is otherwise exactly the same as posix_fadvise().
For example, since Linux 2.6.14, ARM has the following system call:
long arm_fadvise64_64(int fd, int advice,
loff_t offset, loff_t len);
These architecture-specific details are generally hidden from applications by the glibc posix_fadvise() wrapper function, which invokes the appropriate architecture-specific system call.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
Kernel support first appeared in Linux 2.5.60; the underlying system call is called fadvise64(). Library support has been provided since glibc 2.2, via the wrapper function posix_fadvise().
Since Linux 3.18, support for the underlying system call is optional, depending on the setting of the CONFIG_ADVISE_SYSCALLS configuration option.
The type of the len argument was changed from size_t to off_t in POSIX.1-2001 TC1.
NOTES
The contents of the kernel buffer cache can be cleared via the /proc/sys/vm/drop_caches interface described in proc(5).
One can obtain a snapshot of which pages of a file are resident in the buffer cache by opening a file, mapping it with mmap(2), and then applying mincore(2) to the mapping.
BUGS
Before Linux 2.6.6, if len was specified as 0, then this was interpreted literally as “zero bytes”, rather than as meaning “all bytes through to the end of the file”.
SEE ALSO
fincore(1), mincore(2), readahead(2), sync_file_range(2), posix_fallocate(3), posix_madvise(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
278 - Linux cli command ioctl_userfaultfd
NAME π₯οΈ ioctl_userfaultfd π₯οΈ
create a file descriptor for handling page faults in user space
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/userfaultfd.h> /* Definition of UFFD* constants */
#include <sys/ioctl.h>
int ioctl(int fd, int op, ...);
DESCRIPTION
Various ioctl(2) operations can be performed on a userfaultfd object (created by a call to userfaultfd(2)) using calls of the form:
ioctl(fd, op, argp);
In the above, fd is a file descriptor referring to a userfaultfd object, op is one of the operations listed below, and argp is a pointer to a data structure that is specific to op.
The various ioctl(2) operations are described below. The UFFDIO_API, UFFDIO_REGISTER, and UFFDIO_UNREGISTER operations are used to configure userfaultfd behavior. These operations allow the caller to choose what features will be enabled and what kinds of events will be delivered to the application. The remaining operations are range operations. These operations enable the calling application to resolve page-fault events.
UFFDIO_API
(Since Linux 4.3.) Enable operation of the userfaultfd and perform API handshake.
The argp argument is a pointer to a uffdio_api structure, defined as:
struct uffdio_api {
__u64 api; /* Requested API version (input) */
__u64 features; /* Requested features (input/output) */
__u64 ioctls; /* Available ioctl() operations (output) */
};
The api field denotes the API version requested by the application. The kernel verifies that it can support the requested API version, and sets the features and ioctls fields to bit masks representing all the available features and the generic ioctl(2) operations available.
Since Linux 4.11, applications should use the features field to perform a two-step handshake. First, UFFDIO_API is called with the features field set to zero. The kernel responds by setting all supported feature bits.
Applications which do not require any specific features can begin using the userfaultfd immediately. Applications which do need specific features should call UFFDIO_API again with a subset of the reported feature bits set to enable those features.
Before Linux 4.11, the features field must be initialized to zero before the call to UFFDIO_API, and zero (i.e., no feature bits) is placed in the features field by the kernel upon return from ioctl(2).
If the application sets unsupported feature bits, the kernel will zero out the returned uffdio_api structure and return EINVAL.
The following feature bits may be set:
UFFD_FEATURE_EVENT_FORK (since Linux 4.11)
When this feature is enabled, the userfaultfd objects associated with a parent process are duplicated into the child process during fork(2) and a UFFD_EVENT_FORK event is delivered to the userfaultfd monitor
UFFD_FEATURE_EVENT_REMAP (since Linux 4.11)
If this feature is enabled, when the faulting process invokes mremap(2), the userfaultfd monitor will receive an event of type UFFD_EVENT_REMAP.
UFFD_FEATURE_EVENT_REMOVE (since Linux 4.11)
If this feature is enabled, when the faulting process calls madvise(2) with the MADV_DONTNEED or MADV_REMOVE advice value to free a virtual memory area the userfaultfd monitor will receive an event of type UFFD_EVENT_REMOVE.
UFFD_FEATURE_EVENT_UNMAP (since Linux 4.11)
If this feature is enabled, when the faulting process unmaps virtual memory either explicitly with munmap(2), or implicitly during either mmap(2) or mremap(2), the userfaultfd monitor will receive an event of type UFFD_EVENT_UNMAP.
UFFD_FEATURE_MISSING_HUGETLBFS (since Linux 4.11)
If this feature bit is set, the kernel supports registering userfaultfd ranges on hugetlbfs virtual memory areas
UFFD_FEATURE_MISSING_SHMEM (since Linux 4.11)
If this feature bit is set, the kernel supports registering userfaultfd ranges on shared memory areas. This includes all kernel shared memory APIs: System V shared memory, tmpfs(5), shared mappings of /dev/zero, mmap(2) with the MAP_SHARED flag set, memfd_create(2), and so on.
UFFD_FEATURE_SIGBUS (since Linux 4.14)
If this feature bit is set, no page-fault events (UFFD_EVENT_PAGEFAULT) will be delivered. Instead, a SIGBUS signal will be sent to the faulting process. Applications using this feature will not require the use of a userfaultfd monitor for processing memory accesses to the regions registered with userfaultfd.
UFFD_FEATURE_THREAD_ID (since Linux 4.14)
If this feature bit is set, uffd_msg.pagefault.feat.ptid will be set to the faulted thread ID for each page-fault message.
UFFD_FEATURE_PAGEFAULT_FLAG_WP (since Linux 5.10)
If this feature bit is set, userfaultfd supports write-protect faults for anonymous memory. (Note that shmem / hugetlbfs support is indicated by a separate feature.)
UFFD_FEATURE_MINOR_HUGETLBFS (since Linux 5.13)
If this feature bit is set, the kernel supports registering userfaultfd ranges in minor mode on hugetlbfs-backed memory areas.
UFFD_FEATURE_MINOR_SHMEM (since Linux 5.14)
If this feature bit is set, the kernel supports registering userfaultfd ranges in minor mode on shmem-backed memory areas.
UFFD_FEATURE_EXACT_ADDRESS (since Linux 5.18)
If this feature bit is set, uffd_msg.pagefault.address will be set to the exact page-fault address that was reported by the hardware, and will not mask the offset within the page. Note that old Linux versions might indicate the exact address as well, even though the feature bit is not set.
UFFD_FEATURE_WP_HUGETLBFS_SHMEM (since Linux 5.19)
If this feature bit is set, userfaultfd supports write-protect faults for hugetlbfs and shmem / tmpfs memory.
UFFD_FEATURE_WP_UNPOPULATED (since Linux 6.4)
If this feature bit is set, the kernel will handle anonymous memory the same way as file memory, by allowing the user to write-protect unpopulated page table entries.
UFFD_FEATURE_POISON (since Linux 6.6)
If this feature bit is set, the kernel supports resolving faults with the UFFDIO_POISON ioctl.
UFFD_FEATURE_WP_ASYNC (since Linux 6.7)
If this feature bit is set, the write protection faults would be asynchronously resolved by the kernel.
The returned ioctls field can contain the following bits:
1 << _UFFDIO_API
The UFFDIO_API operation is supported.
1 << _UFFDIO_REGISTER
The UFFDIO_REGISTER operation is supported.
1 << _UFFDIO_UNREGISTER
The UFFDIO_UNREGISTER operation is supported.
This ioctl(2) operation returns 0 on success. On error, -1 is returned and errno is set to indicate the error. If an error occurs, the kernel may zero the provided uffdio_api structure. The caller should treat its contents as unspecified, and reinitialize it before re-attempting another UFFDIO_API call. Possible errors include:
EFAULT
argp refers to an address that is outside the calling process’s accessible address space.
EINVAL
The API version requested in the api field is not supported by this kernel, or the features field passed to the kernel includes feature bits that are not supported by the current kernel version.
EINVAL
A previous UFFDIO_API call already enabled one or more features for this userfaultfd. Calling UFFDIO_API twice, the first time with no features set, is explicitly allowed as per the two-step feature detection handshake.
EPERM
The UFFD_FEATURE_EVENT_FORK feature was enabled, but the calling process doesn’t have the CAP_SYS_PTRACE capability.
UFFDIO_REGISTER
(Since Linux 4.3.) Register a memory address range with the userfaultfd object. The pages in the range must be βcompatibleβ. Please refer to the list of register modes below for the compatible memory backends for each mode.
The argp argument is a pointer to a uffdio_register structure, defined as:
struct uffdio_range {
__u64 start; /* Start of range */
__u64 len; /* Length of range (bytes) */
};
struct uffdio_register {
struct uffdio_range range;
__u64 mode; /* Desired mode of operation (input) */
__u64 ioctls; /* Available ioctl() operations (output) */
};
The range field defines a memory range starting at start and continuing for len bytes that should be handled by the userfaultfd.
The mode field defines the mode of operation desired for this memory region. The following values may be bitwise ORed to set the userfaultfd mode for the specified range:
UFFDIO_REGISTER_MODE_MISSING
Track page faults on missing pages. Since Linux 4.3, only private anonymous ranges are compatible. Since Linux 4.11, hugetlbfs and shared memory ranges are also compatible.
UFFDIO_REGISTER_MODE_WP
Track page faults on write-protected pages. Since Linux 5.7, only private anonymous ranges are compatible.
UFFDIO_REGISTER_MODE_MINOR
Track minor page faults. Since Linux 5.13, only hugetlbfs ranges are compatible. Since Linux 5.14, compatibility with shmem ranges was added.
If the operation is successful, the kernel modifies the ioctls bit-mask field to indicate which ioctl(2) operations are available for the specified range. This returned bit mask can contain the following bits:
1 << _UFFDIO_COPY
The UFFDIO_COPY operation is supported.
1 << _UFFDIO_WAKE
The UFFDIO_WAKE operation is supported.
1 << _UFFDIO_WRITEPROTECT
The UFFDIO_WRITEPROTECT operation is supported.
1 << _UFFDIO_ZEROPAGE
The UFFDIO_ZEROPAGE operation is supported.
1 << _UFFDIO_CONTINUE
The UFFDIO_CONTINUE operation is supported.
1 << _UFFDIO_POISON
The UFFDIO_POISON operation is supported.
This ioctl(2) operation returns 0 on success. On error, -1 is returned and errno is set to indicate the error. Possible errors include:
EBUSY
A mapping in the specified range is registered with another userfaultfd object.
EFAULT
argp refers to an address that is outside the calling process’s accessible address space.
EINVAL
An invalid or unsupported bit was specified in the mode field; or the mode field was zero.
EINVAL
There is no mapping in the specified address range.
EINVAL
range.start or range.len is not a multiple of the system page size; or, range.len is zero; or these fields are otherwise invalid.
EINVAL
There as an incompatible mapping in the specified address range.
UFFDIO_UNREGISTER
(Since Linux 4.3.) Unregister a memory address range from userfaultfd. The pages in the range must be βcompatibleβ (see the description of UFFDIO_REGISTER.)
The address range to unregister is specified in the uffdio_range structure pointed to by argp.
This ioctl(2) operation returns 0 on success. On error, -1 is returned and errno is set to indicate the error. Possible errors include:
EINVAL
Either the start or the len field of the ufdio_range structure was not a multiple of the system page size; or the len field was zero; or these fields were otherwise invalid.
EINVAL
There as an incompatible mapping in the specified address range.
EINVAL
There was no mapping in the specified address range.
UFFDIO_COPY
(Since Linux 4.3.) Atomically copy a continuous memory chunk into the userfault registered range and optionally wake up the blocked thread. The source and destination addresses and the number of bytes to copy are specified by the src, dst, and len fields of the uffdio_copy structure pointed to by argp:
struct uffdio_copy {
__u64 dst; /* Destination of copy */
__u64 src; /* Source of copy */
__u64 len; /* Number of bytes to copy */
__u64 mode; /* Flags controlling behavior of copy */
__s64 copy; /* Number of bytes copied, or negated error */
};
The following value may be bitwise ORed in mode to change the behavior of the UFFDIO_COPY operation:
UFFDIO_COPY_MODE_DONTWAKE
Do not wake up the thread that waits for page-fault resolution
UFFDIO_COPY_MODE_WP
Copy the page with read-only permission. This allows the user to trap the next write to the page, which will block and generate another write-protect userfault message. This is used only when both UFFDIO_REGISTER_MODE_MISSING and UFFDIO_REGISTER_MODE_WP modes are enabled for the registered range.
The copy field is used by the kernel to return the number of bytes that was actually copied, or an error (a negated errno-style value). If the value returned in copy doesn’t match the value that was specified in len, the operation fails with the error EAGAIN. The copy field is output-only; it is not read by the UFFDIO_COPY operation.
This ioctl(2) operation returns 0 on success. In this case, the entire area was copied. On error, -1 is returned and errno is set to indicate the error. Possible errors include:
EAGAIN
The number of bytes copied (i.e., the value returned in the copy field) does not equal the value that was specified in the len field.
EINVAL
Either dst or len was not a multiple of the system page size, or the range specified by src and len or dst and len was invalid.
EINVAL
An invalid bit was specified in the mode field.
ENOENT (since Linux 4.11)
The faulting process has changed its virtual memory layout simultaneously with an outstanding UFFDIO_COPY operation.
ENOSPC (from Linux 4.11 until Linux 4.13)
The faulting process has exited at the time of a UFFDIO_COPY operation.
ESRCH (since Linux 4.13)
The faulting process has exited at the time of a UFFDIO_COPY operation.
UFFDIO_ZEROPAGE
(Since Linux 4.3.) Zero out a memory range registered with userfaultfd.
The requested range is specified by the range field of the uffdio_zeropage structure pointed to by argp:
struct uffdio_zeropage {
struct uffdio_range range;
__u64 mode; /* Flags controlling behavior of copy */
__s64 zeropage; /* Number of bytes zeroed, or negated error */
};
The following value may be bitwise ORed in mode to change the behavior of the UFFDIO_ZEROPAGE operation:
UFFDIO_ZEROPAGE_MODE_DONTWAKE
Do not wake up the thread that waits for page-fault resolution.
The zeropage field is used by the kernel to return the number of bytes that was actually zeroed, or an error in the same manner as UFFDIO_COPY. If the value returned in the zeropage field doesn’t match the value that was specified in range.len, the operation fails with the error EAGAIN. The zeropage field is output-only; it is not read by the UFFDIO_ZEROPAGE operation.
This ioctl(2) operation returns 0 on success. In this case, the entire area was zeroed. On error, -1 is returned and errno is set to indicate the error. Possible errors include:
EAGAIN
The number of bytes zeroed (i.e., the value returned in the zeropage field) does not equal the value that was specified in the range.len field.
EINVAL
Either range.start or range.len was not a multiple of the system page size; or range.len was zero; or the range specified was invalid.
EINVAL
An invalid bit was specified in the mode field.
ESRCH (since Linux 4.13)
The faulting process has exited at the time of a UFFDIO_ZEROPAGE operation.
UFFDIO_WAKE
(Since Linux 4.3.) Wake up the thread waiting for page-fault resolution on a specified memory address range.
The UFFDIO_WAKE operation is used in conjunction with UFFDIO_COPY and UFFDIO_ZEROPAGE operations that have the UFFDIO_COPY_MODE_DONTWAKE or UFFDIO_ZEROPAGE_MODE_DONTWAKE bit set in the mode field. The userfault monitor can perform several UFFDIO_COPY and UFFDIO_ZEROPAGE operations in a batch and then explicitly wake up the faulting thread using UFFDIO_WAKE.
The argp argument is a pointer to a uffdio_range structure (shown above) that specifies the address range.
This ioctl(2) operation returns 0 on success. On error, -1 is returned and errno is set to indicate the error. Possible errors include:
EINVAL
The start or the len field of the ufdio_range structure was not a multiple of the system page size; or len was zero; or the specified range was otherwise invalid.
UFFDIO_WRITEPROTECT
(Since Linux 5.7.) Write-protect or write-unprotect a userfaultfd-registered memory range registered with mode UFFDIO_REGISTER_MODE_WP.
The argp argument is a pointer to a uffdio_range structure as shown below:
struct uffdio_writeprotect {
struct uffdio_range range; /* Range to change write permission*/
__u64 mode; /* Mode to change write permission */
};
There are two mode bits that are supported in this structure:
UFFDIO_WRITEPROTECT_MODE_WP
When this mode bit is set, the ioctl will be a write-protect operation upon the memory range specified by range. Otherwise it will be a write-unprotect operation upon the specified range, which can be used to resolve a userfaultfd write-protect page fault.
UFFDIO_WRITEPROTECT_MODE_DONTWAKE
When this mode bit is set, do not wake up any thread that waits for page-fault resolution after the operation. This can be specified only if UFFDIO_WRITEPROTECT_MODE_WP is not specified.
This ioctl(2) operation returns 0 on success. On error, -1 is returned and errno is set to indicate the error. Possible errors include:
EINVAL
The start or the len field of the ufdio_range structure was not a multiple of the system page size; or len was zero; or the specified range was otherwise invalid.
EAGAIN
The process was interrupted; retry this call.
ENOENT
The range specified in range is not valid. For example, the virtual address does not exist, or not registered with userfaultfd write-protect mode.
EFAULT
Encountered a generic fault during processing.
UFFDIO_CONTINUE
(Since Linux 5.13.) Resolve a minor page fault by installing page table entries for existing pages in the page cache.
The argp argument is a pointer to a uffdio_continue structure as shown below:
struct uffdio_continue {
struct uffdio_range range;
/* Range to install PTEs for and continue */
__u64 mode; /* Flags controlling the behavior of continue */
__s64 mapped; /* Number of bytes mapped, or negated error */
};
The following value may be bitwise ORed in mode to change the behavior of the UFFDIO_CONTINUE operation:
UFFDIO_CONTINUE_MODE_DONTWAKE
Do not wake up the thread that waits for page-fault resolution.
The mapped field is used by the kernel to return the number of bytes that were actually mapped, or an error in the same manner as UFFDIO_COPY. If the value returned in the mapped field doesn’t match the value that was specified in range.len, the operation fails with the error EAGAIN. The mapped field is output-only; it is not read by the UFFDIO_CONTINUE operation.
This ioctl(2) operation returns 0 on success. In this case, the entire area was mapped. On error, -1 is returned and errno is set to indicate the error. Possible errors include:
EAGAIN
The number of bytes mapped (i.e., the value returned in the mapped field) does not equal the value that was specified in the range.len field.
EEXIST
One or more pages were already mapped in the given range.
EFAULT
No existing page could be found in the page cache for the given range.
EINVAL
Either range.start or range.len was not a multiple of the system page size; or range.len was zero; or the range specified was invalid.
EINVAL
An invalid bit was specified in the mode field.
ENOENT
The faulting process has changed its virtual memory layout simultaneously with an outstanding UFFDIO_CONTINUE operation.
ENOMEM
Allocating memory needed to setup the page table mappings failed.
ESRCH
The faulting process has exited at the time of a UFFDIO_CONTINUE operation.
UFFDIO_POISON
(Since Linux 6.6.) Mark an address range as “poisoned”. Future accesses to these addresses will raise a SIGBUS signal. Unlike MADV_HWPOISON this works by installing page table entries, rather than “really” poisoning the underlying physical pages. This means it only affects this particular address space.
The argp argument is a pointer to a uffdio_poison structure as shown below:
struct uffdio_poison {
struct uffdio_range range;
/* Range to install poison PTE markers in */
__u64 mode; /* Flags controlling the behavior of poison */
__s64 updated; /* Number of bytes poisoned, or negated error */
};
The following value may be bitwise ORed in mode to change the behavior of the UFFDIO_POISON operation:
UFFDIO_POISON_MODE_DONTWAKE
Do not wake up the thread that waits for page-fault resolution.
The updated field is used by the kernel to return the number of bytes that were actually poisoned, or an error in the same manner as UFFDIO_COPY. If the value returned in the updated field doesn’t match the value that was specified in range.len, the operation fails with the error EAGAIN. The updated field is output-only; it is not read by the UFFDIO_POISON operation.
This ioctl(2) operation returns 0 on success. In this case, the entire area was poisoned. On error, -1 is returned and errno is set to indicate the error. Possible errors include:
EAGAIN
The number of bytes mapped (i.e., the value returned in the updated field) does not equal the value that was specified in the range.len field.
EINVAL
Either range.start or range.len was not a multiple of the system page size; or range.len was zero; or the range specified was invalid.
EINVAL
An invalid bit was specified in the mode field.
EEXIST
One or more pages were already mapped in the given range.
ENOENT
The faulting process has changed its virtual memory layout simultaneously with an outstanding UFFDIO_POISON operation.
ENOMEM
Allocating memory for page table entries failed.
ESRCH
The faulting process has exited at the time of a UFFDIO_POISON operation.
RETURN VALUE
See descriptions of the individual operations, above.
ERRORS
See descriptions of the individual operations, above. In addition, the following general errors can occur for all of the operations described above:
EFAULT
argp does not point to a valid memory address.
EINVAL
(For all operations except UFFDIO_API.) The userfaultfd object has not yet been enabled (via the UFFDIO_API operation).
STANDARDS
Linux.
BUGS
In order to detect available userfault features and enable some subset of those features the userfaultfd file descriptor must be closed after the first UFFDIO_API operation that queries features availability and reopened before the second UFFDIO_API operation that actually enables the desired features.
EXAMPLES
See userfaultfd(2).
SEE ALSO
ioctl(2), mmap(2), userfaultfd(2)
Documentation/admin-guide/mm/userfaultfd.rst in the Linux kernel source tree
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
279 - Linux cli command fdatasync
NAME π₯οΈ fdatasync π₯οΈ
synchronize a file’s in-core state with storage device
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int fsync(int fd);
int fdatasync(int fd);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
fsync():
glibc 2.16 and later:
No feature test macros need be defined
glibc up to and including 2.15:
_BSD_SOURCE || _XOPEN_SOURCE
|| /* Since glibc 2.8: */ _POSIX_C_SOURCE >= 200112L
fdatasync():
_POSIX_C_SOURCE >= 199309L || _XOPEN_SOURCE >= 500
DESCRIPTION
fsync() transfers (“flushes”) all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed.
As well as flushing the file data, fsync() also flushes the metadata information associated with the file (see inode(7)).
Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.
fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. For example, changes to st_atime or st_mtime (respectively, time of last access and time of last modification; see inode(7)) do not require flushing because they are not necessary for a subsequent data read to be handled correctly. On the other hand, a change to the file size (st_size, as made by say ftruncate(2)), would require a metadata flush.
The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk.
RETURN VALUE
On success, these system calls return zero. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBADF
fd is not a valid open file descriptor.
EINTR
The function was interrupted by a signal; see signal(7).
EIO
An error occurred during synchronization. This error may relate to data written to some other file descriptor on the same file. Since Linux 4.13, errors from write-back will be reported to all file descriptors that might have written the data which triggered the error. Some filesystems (e.g., NFS) keep close track of which data came through which file descriptor, and give more precise reporting. Other filesystems (e.g., most local filesystems) will report errors to all file descriptors that were open on the file when the error was recorded.
ENOSPC
Disk space was exhausted while synchronizing.
EROFS
EINVAL
fd is bound to a special file (e.g., a pipe, FIFO, or socket) which does not support synchronization.
ENOSPC
EDQUOT
fd is bound to a file on NFS or another filesystem which does not allocate space at the time of a write(2) system call, and some previous write failed due to insufficient storage space.
VERSIONS
On POSIX systems on which fdatasync() is available, _POSIX_SYNCHRONIZED_IO is defined in <unistd.h> to a value greater than 0. (See also sysconf(3).)
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.2BSD.
In Linux 2.2 and earlier, fdatasync() is equivalent to fsync(), and so has no performance advantage.
The fsync() implementations in older kernels and lesser used filesystems do not know how to flush disk caches. In these cases disk caches need to be disabled using hdparm(8) or sdparm(8) to guarantee safe operation.
Under AT&T UNIX System V Release 4 fd needs to be opened for writing. This is by itself incompatible with the original BSD interface and forbidden by POSIX, but nevertheless survives in HP-UX and AIX.
SEE ALSO
sync(1), bdflush(2), open(2), posix_fadvise(2), pwritev(2), sync(2), sync_file_range(2), fflush(3), fileno(3), hdparm(8), mount(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
280 - Linux cli command kexec_load
NAME π₯οΈ kexec_load π₯οΈ
load a new kernel for later execution
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/kexec.h> /* Definition of KEXEC_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
long syscall(SYS_kexec_load, unsigned long entry,
unsigned long nr_segments",structkexec_segment*"segments,
unsigned long flags);
long syscall(SYS_kexec_file_load, int kernel_fd, int initrd_fd,
unsigned long cmdline_len, const char *cmdline,
unsigned long flags);
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
The kexec_load() system call loads a new kernel that can be executed later by reboot(2).
The flags argument is a bit mask that controls the operation of the call. The following values can be specified in flags:
KEXEC_ON_CRASH (since Linux 2.6.13)
Execute the new kernel automatically on a system crash. This “crash kernel” is loaded into an area of reserved memory that is determined at boot time using the crashkernel kernel command-line parameter. The location of this reserved memory is exported to user space via the /proc/iomem file, in an entry labeled “Crash kernel”. A user-space application can parse this file and prepare a list of segments (see below) that specify this reserved memory as destination. If this flag is specified, the kernel checks that the target segments specified in segments fall within the reserved region.
KEXEC_PRESERVE_CONTEXT (since Linux 2.6.27)
Preserve the system hardware and software states before executing the new kernel. This could be used for system suspend. This flag is available only if the kernel was configured with CONFIG_KEXEC_JUMP, and is effective only if nr_segments is greater than 0.
The high-order bits (corresponding to the mask 0xffff0000) of flags contain the architecture of the to-be-executed kernel. Specify (OR) the constant KEXEC_ARCH_DEFAULT to use the current architecture, or one of the following architecture constants KEXEC_ARCH_386, KEXEC_ARCH_68K, KEXEC_ARCH_X86_64, KEXEC_ARCH_PPC, KEXEC_ARCH_PPC64, KEXEC_ARCH_IA_64, KEXEC_ARCH_ARM, KEXEC_ARCH_S390, KEXEC_ARCH_SH, KEXEC_ARCH_MIPS, and KEXEC_ARCH_MIPS_LE. The architecture must be executable on the CPU of the system.
The entry argument is the physical entry address in the kernel image. The nr_segments argument is the number of segments pointed to by the segments pointer; the kernel imposes an (arbitrary) limit of 16 on the number of segments. The segments argument is an array of kexec_segment structures which define the kernel layout:
struct kexec_segment {
void *buf; /* Buffer in user space */
size_t bufsz; /* Buffer length in user space */
void *mem; /* Physical address of kernel */
size_t memsz; /* Physical address length */
};
The kernel image defined by segments is copied from the calling process into the kernel either in regular memory or in reserved memory (if KEXEC_ON_CRASH is set). The kernel first performs various sanity checks on the information passed in segments. If these checks pass, the kernel copies the segment data to kernel memory. Each segment specified in segments is copied as follows:
buf and bufsz identify a memory region in the caller’s virtual address space that is the source of the copy. The value in bufsz may not exceed the value in the memsz field.
mem and memsz specify a physical address range that is the target of the copy. The values specified in both fields must be multiples of the system page size.
bufsz bytes are copied from the source buffer to the target kernel buffer. If bufsz is less than memsz, then the excess bytes in the kernel buffer are zeroed out.
In case of a normal kexec (i.e., the KEXEC_ON_CRASH flag is not set), the segment data is loaded in any available memory and is moved to the final destination at kexec reboot time (e.g., when the kexec(8) command is executed with the -e option).
In case of kexec on panic (i.e., the KEXEC_ON_CRASH flag is set), the segment data is loaded to reserved memory at the time of the call, and, after a crash, the kexec mechanism simply passes control to that kernel.
The kexec_load() system call is available only if the kernel was configured with CONFIG_KEXEC.
kexec_file_load()
The kexec_file_load() system call is similar to kexec_load(), but it takes a different set of arguments. It reads the kernel to be loaded from the file referred to by the file descriptor kernel_fd, and the initrd (initial RAM disk) to be loaded from file referred to by the file descriptor initrd_fd. The cmdline argument is a pointer to a buffer containing the command line for the new kernel. The cmdline_len argument specifies size of the buffer. The last byte in the buffer must be a null byte (‘οΏ½’).
The flags argument is a bit mask which modifies the behavior of the call. The following values can be specified in flags:
KEXEC_FILE_UNLOAD
Unload the currently loaded kernel.
KEXEC_FILE_ON_CRASH
Load the new kernel in the memory region reserved for the crash kernel (as for KEXEC_ON_CRASH). This kernel is booted if the currently running kernel crashes.
KEXEC_FILE_NO_INITRAMFS
Loading initrd/initramfs is optional. Specify this flag if no initramfs is being loaded. If this flag is set, the value passed in initrd_fd is ignored.
The kexec_file_load() system call was added to provide support for systems where “kexec” loading should be restricted to only kernels that are signed. This system call is available only if the kernel was configured with CONFIG_KEXEC_FILE.
RETURN VALUE
On success, these system calls returns 0. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EADDRNOTAVAIL
The KEXEC_ON_CRASH flags was specified, but the region specified by the mem and memsz fields of one of the segments entries lies outside the range of memory reserved for the crash kernel.
EADDRNOTAVAIL
The value in a mem or memsz field in one of the segments entries is not a multiple of the system page size.
EBADF
kernel_fd or initrd_fd is not a valid file descriptor.
EBUSY
Another crash kernel is already being loaded or a crash kernel is already in use.
EINVAL
flags is invalid.
EINVAL
The value of a bufsz field in one of the segments entries exceeds the value in the corresponding memsz field.
EINVAL
nr_segments exceeds KEXEC_SEGMENT_MAX (16).
EINVAL
Two or more of the kernel target buffers overlap.
EINVAL
The value in cmdline[cmdline_len-1] is not ‘οΏ½’.
EINVAL
The file referred to by kernel_fd or initrd_fd is empty (length zero).
ENOEXEC
kernel_fd does not refer to an open file, or the kernel can’t load this file. Currently, the file must be a bzImage and contain an x86 kernel that is loadable above 4 GiB in memory (see the kernel source file Documentation/x86/boot.txt).
ENOMEM
Could not allocate memory.
EPERM
The caller does not have the CAP_SYS_BOOT capability.
STANDARDS
Linux.
HISTORY
kexec_load()
Linux 2.6.13.
kexec_file_load()
Linux 3.17.
SEE ALSO
reboot(2), syscall(2), kexec(8)
The kernel source files Documentation/kdump/kdump.txt and Documentation/admin-guide/kernel-parameters.txt
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
281 - Linux cli command lookup_dcookie
NAME π₯οΈ lookup_dcookie π₯οΈ
return a directory entry’s path
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_lookup_dcookie, uint64_t cookie, char *buffer,
size_t len);
Note: glibc provides no wrapper for lookup_dcookie(), necessitating the use of syscall(2).
DESCRIPTION
Look up the full path of the directory entry specified by the value cookie. The cookie is an opaque identifier uniquely identifying a particular directory entry. The buffer given is filled in with the full path of the directory entry.
For lookup_dcookie() to return successfully, the kernel must still hold a cookie reference to the directory entry.
RETURN VALUE
On success, lookup_dcookie() returns the length of the path string copied into the buffer. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
The buffer was not valid.
EINVAL
The kernel has no registered cookie/directory entry mappings at the time of lookup, or the cookie does not refer to a valid directory entry.
ENAMETOOLONG
The name could not fit in the buffer.
ENOMEM
The kernel could not allocate memory for the temporary buffer holding the path.
EPERM
The process does not have the capability CAP_SYS_ADMIN required to look up cookie values.
ERANGE
The buffer was not large enough to hold the path of the directory entry.
STANDARDS
Linux.
HISTORY
Linux 2.5.43.
The ENAMETOOLONG error was added in Linux 2.5.70.
NOTES
lookup_dcookie() is a special-purpose system call, currently used only by the oprofile(1) profiler. It relies on a kernel driver to register cookies for directory entries.
The path returned may be suffixed by the string " (deleted)" if the directory entry has been removed.
SEE ALSO
oprofile(1)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
282 - Linux cli command epoll_pwait2
NAME π₯οΈ epoll_pwait2 π₯οΈ
wait for an I/O event on an epoll file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/epoll.h>
int epoll_wait(int epfd, struct epoll_event *events,
int maxevents, int timeout);
int epoll_pwait(int epfd, struct epoll_event *events,
int maxevents, int timeout,
const sigset_t *_Nullable sigmask);
int epoll_pwait2(int epfd, struct epoll_event *events,
int maxevents, const struct timespec *_Nullable timeout,
const sigset_t *_Nullable sigmask);
DESCRIPTION
The epoll_wait() system call waits for events on the epoll(7) instance referred to by the file descriptor epfd. The buffer pointed to by events is used to return information from the ready list about file descriptors in the interest list that have some events available. Up to maxevents are returned by epoll_wait(). The maxevents argument must be greater than zero.
The timeout argument specifies the number of milliseconds that epoll_wait() will block. Time is measured against the CLOCK_MONOTONIC clock.
A call to epoll_wait() will block until either:
a file descriptor delivers an event;
the call is interrupted by a signal handler; or
the timeout expires.
Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount. Specifying a timeout of -1 causes epoll_wait() to block indefinitely, while specifying a timeout equal to zero causes epoll_wait() to return immediately, even if no events are available.
The struct epoll_event is described in epoll_event(3type).
The data field of each returned epoll_event structure contains the same data as was specified in the most recent call to epoll_ctl(2) (EPOLL_CTL_ADD, EPOLL_CTL_MOD) for the corresponding open file descriptor.
The events field is a bit mask that indicates the events that have occurred for the corresponding open file description. See epoll_ctl(2) for a list of the bits that may appear in this mask.
epoll_pwait()
The relationship between epoll_wait() and epoll_pwait() is analogous to the relationship between select(2) and pselect(2): like pselect(2), epoll_pwait() allows an application to safely wait until either a file descriptor becomes ready or until a signal is caught.
The following epoll_pwait() call:
ready = epoll_pwait(epfd, &events, maxevents, timeout, &sigmask);
is equivalent to atomically executing the following calls:
sigset_t origmask;
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ready = epoll_wait(epfd, &events, maxevents, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
The sigmask argument may be specified as NULL, in which case epoll_pwait() is equivalent to epoll_wait().
epoll_pwait2()
The epoll_pwait2() system call is equivalent to epoll_pwait() except for the timeout argument. It takes an argument of type timespec to be able to specify nanosecond resolution timeout. This argument functions the same as in pselect(2) and ppoll(2). If timeout is NULL, then epoll_pwait2() can block indefinitely.
RETURN VALUE
On success, epoll_wait() returns the number of file descriptors ready for the requested I/O operation, or zero if no file descriptor became ready during the requested timeout milliseconds. On failure, epoll_wait() returns -1 and errno is set to indicate the error.
ERRORS
EBADF
epfd is not a valid file descriptor.
EFAULT
The memory area pointed to by events is not accessible with write permissions.
EINTR
The call was interrupted by a signal handler before either (1) any of the requested events occurred or (2) the timeout expired; see signal(7).
EINVAL
epfd is not an epoll file descriptor, or maxevents is less than or equal to zero.
STANDARDS
Linux.
HISTORY
epoll_wait()
Linux 2.6, glibc 2.3.2.
epoll_pwait()
Linux 2.6.19, glibc 2.6.
epoll_pwait2()
Linux 5.11.
NOTES
While one thread is blocked in a call to epoll_wait(), it is possible for another thread to add a file descriptor to the waited-upon epoll instance. If the new file descriptor becomes ready, it will cause the epoll_wait() call to unblock.
If more than maxevents file descriptors are ready when epoll_wait() is called, then successive epoll_wait() calls will round robin through the set of ready file descriptors. This behavior helps avoid starvation scenarios, where a process fails to notice that additional file descriptors are ready because it focuses on a set of file descriptors that are already known to be ready.
Note that it is possible to call epoll_wait() on an epoll instance whose interest list is currently empty (or whose interest list becomes empty because file descriptors are closed or removed from the interest in another thread). The call will block until some file descriptor is later added to the interest list (in another thread) and that file descriptor becomes ready.
C library/kernel differences
The raw epoll_pwait() and epoll_pwait2() system calls have a sixth argument, size_t sigsetsize, which specifies the size in bytes of the sigmask argument. The glibc epoll_pwait() wrapper function specifies this argument as a fixed value (equal to sizeof(sigset_t)).
BUGS
Before Linux 2.6.37, a timeout value larger than approximately LONG_MAX / HZ milliseconds is treated as -1 (i.e., infinity). Thus, for example, on a system where sizeof(long) is 4 and the kernel HZ value is 1000, this means that timeouts greater than 35.79 minutes are treated as infinity.
SEE ALSO
epoll_create(2), epoll_ctl(2), epoll(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
283 - Linux cli command syncfs
NAME π₯οΈ syncfs π₯οΈ
commit filesystem caches to disk
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
void sync(void);
int syncfs(int fd);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sync():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.19: */ _DEFAULT_SOURCE
|| /* glibc <= 2.19: */ _BSD_SOURCE
syncfs():
_GNU_SOURCE
DESCRIPTION
sync() causes all pending modifications to filesystem metadata and cached file data to be written to the underlying filesystems.
syncfs() is like sync(), but synchronizes just the filesystem containing file referred to by the open file descriptor fd.
RETURN VALUE
syncfs() returns 0 on success; on error, it returns -1 and sets errno to indicate the error.
ERRORS
sync() is always successful.
syncfs() can fail for at least the following reasons:
EBADF
fd is not a valid file descriptor.
EIO
An error occurred during synchronization. This error may relate to data written to any file on the filesystem, or on metadata related to the filesystem itself.
ENOSPC
Disk space was exhausted while synchronizing.
ENOSPC
EDQUOT
Data was written to a file on NFS or another filesystem which does not allocate space at the time of a write(2) system call, and some previous write failed due to insufficient storage space.
VERSIONS
According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However Linux waits for I/O completions, and thus sync() or syncfs() provide the same guarantees as fsync() called on every file in the system or filesystem respectively.
STANDARDS
sync()
POSIX.1-2008.
syncfs()
Linux.
HISTORY
sync()
POSIX.1-2001, SVr4, 4.3BSD.
syncfs()
Linux 2.6.39, glibc 2.14.
Since glibc 2.2.2, the Linux prototype for sync() is as listed above, following the various standards. In glibc 2.2.1 and earlier, it was “int sync(void)”, and sync() always returned 0.
In mainline kernel versions prior to Linux 5.8, syncfs() will fail only when passed a bad file descriptor (EBADF). Since Linux 5.8, syncfs() will also report an error if one or more inodes failed to be written back since the last syncfs() call.
BUGS
Before Linux 1.3.20, Linux did not wait for I/O to complete before returning.
SEE ALSO
sync(1), fdatasync(2), fsync(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
284 - Linux cli command prof
NAME π₯οΈ prof π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
285 - Linux cli command sched_getaffinity
NAME π₯οΈ sched_getaffinity π₯οΈ
set and get a thread’s CPU affinity mask
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <sched.h>
int sched_setaffinity(pid_t pid, size_t cpusetsize,
const cpu_set_t *mask);
int sched_getaffinity(pid_t pid, size_t cpusetsize,
cpu_set_t *mask);
DESCRIPTION
A thread’s CPU affinity mask determines the set of CPUs on which it is eligible to run. On a multiprocessor system, setting the CPU affinity mask can be used to obtain performance benefits. For example, by dedicating one CPU to a particular thread (i.e., setting the affinity mask of that thread to specify a single CPU, and setting the affinity mask of all other threads to exclude that CPU), it is possible to ensure maximum execution speed for that thread. Restricting a thread to run on a single CPU also avoids the performance cost caused by the cache invalidation that occurs when a thread ceases to execute on one CPU and then recommences execution on a different CPU.
A CPU affinity mask is represented by the cpu_set_t structure, a “CPU set”, pointed to by mask. A set of macros for manipulating CPU sets is described in CPU_SET(3).
sched_setaffinity() sets the CPU affinity mask of the thread whose ID is pid to the value specified by mask. If pid is zero, then the calling thread is used. The argument cpusetsize is the length (in bytes) of the data pointed to by mask. Normally this argument would be specified as sizeof(cpu_set_t).
If the thread specified by pid is not currently running on one of the CPUs specified in mask, then that thread is migrated to one of the CPUs specified in mask.
sched_getaffinity() writes the affinity mask of the thread whose ID is pid into the cpu_set_t structure pointed to by mask. The cpusetsize argument specifies the size (in bytes) of mask. If pid is zero, then the mask of the calling thread is returned.
RETURN VALUE
On success, sched_setaffinity() and sched_getaffinity() return 0 (but see “C library/kernel differences” below, which notes that the underlying sched_getaffinity() differs in its return value). On failure, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
A supplied memory address was invalid.
EINVAL
The affinity bit mask mask contains no processors that are currently physically on the system and permitted to the thread according to any restrictions that may be imposed by cpuset cgroups or the “cpuset” mechanism described in cpuset(7).
EINVAL
(sched_getaffinity() and, before Linux 2.6.9, sched_setaffinity()) cpusetsize is smaller than the size of the affinity mask used by the kernel.
EPERM
(sched_setaffinity()) The calling thread does not have appropriate privileges. The caller needs an effective user ID equal to the real user ID or effective user ID of the thread identified by pid, or it must possess the CAP_SYS_NICE capability in the user namespace of the thread pid.
ESRCH
The thread whose ID is pid could not be found.
STANDARDS
Linux.
HISTORY
Linux 2.5.8, glibc 2.3.
Initially, the glibc interfaces included a cpusetsize argument, typed as unsigned int. In glibc 2.3.3, the cpusetsize argument was removed, but was then restored in glibc 2.3.4, with type size_t.
NOTES
After a call to sched_setaffinity(), the set of CPUs on which the thread will actually run is the intersection of the set specified in the mask argument and the set of CPUs actually present on the system. The system may further restrict the set of CPUs on which the thread runs if the “cpuset” mechanism described in cpuset(7) is being used. These restrictions on the actual set of CPUs on which the thread will run are silently imposed by the kernel.
There are various ways of determining the number of CPUs available on the system, including: inspecting the contents of /proc/cpuinfo; using sysconf(3) to obtain the values of the _SC_NPROCESSORS_CONF and _SC_NPROCESSORS_ONLN parameters; and inspecting the list of CPU directories under /sys/devices/system/cpu/.
sched(7) has a description of the Linux scheduling scheme.
The affinity mask is a per-thread attribute that can be adjusted independently for each of the threads in a thread group. The value returned from a call to gettid(2) can be passed in the argument pid. Specifying pid as 0 will set the attribute for the calling thread, and passing the value returned from a call to getpid(2) will set the attribute for the main thread of the thread group. (If you are using the POSIX threads API, then use pthread_setaffinity_np(3) instead of sched_setaffinity().)
The isolcpus boot option can be used to isolate one or more CPUs at boot time, so that no processes are scheduled onto those CPUs. Following the use of this boot option, the only way to schedule processes onto the isolated CPUs is via sched_setaffinity() or the cpuset(7) mechanism. For further information, see the kernel source file Documentation/admin-guide/kernel-parameters.txt. As noted in that file, isolcpus is the preferred mechanism of isolating CPUs (versus the alternative of manually setting the CPU affinity of all processes on the system).
A child created via fork(2) inherits its parent’s CPU affinity mask. The affinity mask is preserved across an execve(2).
C library/kernel differences
This manual page describes the glibc interface for the CPU affinity calls. The actual system call interface is slightly different, with the mask being typed as unsigned long *, reflecting the fact that the underlying implementation of CPU sets is a simple bit mask.
On success, the raw sched_getaffinity() system call returns the number of bytes placed copied into the mask buffer; this will be the minimum of cpusetsize and the size (in bytes) of the cpumask_t data type that is used internally by the kernel to represent the CPU set bit mask.
Handling systems with large CPU affinity masks
The underlying system calls (which represent CPU masks as bit masks of type unsigned long *) impose no restriction on the size of the CPU mask. However, the cpu_set_t data type used by glibc has a fixed size of 128 bytes, meaning that the maximum CPU number that can be represented is 1023. If the kernel CPU affinity mask is larger than 1024, then calls of the form:
sched_getaffinity(pid, sizeof(cpu_set_t), &mask);
fail with the error EINVAL, the error produced by the underlying system call for the case where the mask size specified in cpusetsize is smaller than the size of the affinity mask used by the kernel. (Depending on the system CPU topology, the kernel affinity mask can be substantially larger than the number of active CPUs in the system.)
When working on systems with large kernel CPU affinity masks, one must dynamically allocate the mask argument (see CPU_ALLOC(3)). Currently, the only way to do this is by probing for the size of the required mask using sched_getaffinity() calls with increasing mask sizes (until the call does not fail with the error EINVAL).
Be aware that CPU_ALLOC(3) may allocate a slightly larger CPU set than requested (because CPU sets are implemented as bit masks allocated in units of sizeof(long)). Consequently, sched_getaffinity() can set bits beyond the requested allocation size, because the kernel sees a few additional bits. Therefore, the caller should iterate over the bits in the returned set, counting those which are set, and stop upon reaching the value returned by CPU_COUNT(3) (rather than iterating over the number of bits requested to be allocated).
EXAMPLES
The program below creates a child process. The parent and child then each assign themselves to a specified CPU and execute identical loops that consume some CPU time. Before terminating, the parent waits for the child to complete. The program takes three command-line arguments: the CPU number for the parent, the CPU number for the child, and the number of loop iterations that both processes should perform.
As the sample runs below demonstrate, the amount of real and CPU time consumed when running the program will depend on intra-core caching effects and whether the processes are using the same CPU.
We first employ lscpu(1) to determine that this (x86) system has two cores, each with two CPUs:
$ lscpu | egrep -i 'core.*:|socket'
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
We then time the operation of the example program for three cases: both processes running on the same CPU; both processes running on different CPUs on the same core; and both processes running on different CPUs on different cores.
$ time -p ./a.out 0 0 100000000
real 14.75
user 3.02
sys 11.73
$ time -p ./a.out 0 1 100000000
real 11.52
user 3.98
sys 19.06
$ time -p ./a.out 0 3 100000000
real 7.89
user 3.29
sys 12.07
Program source
#define _GNU_SOURCE
#include <err.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int parentCPU, childCPU;
cpu_set_t set;
unsigned int nloops;
if (argc != 4) {
fprintf(stderr, "Usage: %s parent-cpu child-cpu num-loops
“, argv[0]); exit(EXIT_FAILURE); } parentCPU = atoi(argv[1]); childCPU = atoi(argv[2]); nloops = atoi(argv[3]); CPU_ZERO(&set); switch (fork()) { case -1: /* Error / err(EXIT_FAILURE, “fork”); case 0: / Child / CPU_SET(childCPU, &set); if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) err(EXIT_FAILURE, “sched_setaffinity”); for (unsigned int j = 0; j < nloops; j++) getppid(); exit(EXIT_SUCCESS); default: / Parent / CPU_SET(parentCPU, &set); if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) err(EXIT_FAILURE, “sched_setaffinity”); for (unsigned int j = 0; j < nloops; j++) getppid(); wait(NULL); / Wait for child to terminate */ exit(EXIT_SUCCESS); } }
SEE ALSO
lscpu(1), nproc(1), taskset(1), clone(2), getcpu(2), getpriority(2), gettid(2), nice(2), sched_get_priority_max(2), sched_get_priority_min(2), sched_getscheduler(2), sched_setscheduler(2), setpriority(2), CPU_SET(3), get_nprocs(3), pthread_setaffinity_np(3), sched_getcpu(3), capabilities(7), cpuset(7), sched(7), numactl(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
286 - Linux cli command clock_getres
NAME π₯οΈ clock_getres π₯οΈ
clock and time functions
LIBRARY
Standard C library (libc, -lc), since glibc 2.17
Before glibc 2.17, Real-time library (librt, -lrt)
SYNOPSIS
#include <time.h>
int clock_getres(clockid_t clockid, struct timespec *_Nullable res);
int clock_gettime(clockid_t clockid, struct timespec *tp);
int clock_settime(clockid_t clockid, const struct timespec *tp);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
clock_getres(), clock_gettime(), clock_settime():
_POSIX_C_SOURCE >= 199309L
DESCRIPTION
The function clock_getres() finds the resolution (precision) of the specified clock clockid, and, if res is non-NULL, stores it in the struct timespec pointed to by res. The resolution of clocks depends on the implementation and cannot be configured by a particular process. If the time value pointed to by the argument tp of clock_settime() is not a multiple of res, then it is truncated to a multiple of res.
The functions clock_gettime() and clock_settime() retrieve and set the time of the specified clock clockid.
The res and tp arguments are timespec(3) structures.
The clockid argument is the identifier of the particular clock on which to act. A clock may be system-wide and hence visible for all processes, or per-process if it measures time only within a single process.
All implementations support the system-wide real-time clock, which is identified by CLOCK_REALTIME. Its time represents seconds and nanoseconds since the Epoch. When its time is changed, timers for a relative interval are unaffected, but timers for an absolute point in time are affected.
More clocks may be implemented. The interpretation of the corresponding time values and the effect on timers is unspecified.
Sufficiently recent versions of glibc and the Linux kernel support the following clocks:
CLOCK_REALTIME
A settable system-wide clock that measures real (i.e., wall-clock) time. Setting this clock requires appropriate privileges. This clock is affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), and by frequency adjustments performed by NTP and similar applications via adjtime(3), adjtimex(2), clock_adjtime(2), and ntp_adjtime(3). This clock normally counts the number of seconds since 1970-01-01 00:00:00 Coordinated Universal Time (UTC) except that it ignores leap seconds; near a leap second it is typically adjusted by NTP to stay roughly in sync with UTC.
CLOCK_REALTIME_ALARM (since Linux 3.0; Linux-specific)
Like CLOCK_REALTIME, but not settable. See timer_create(2) for further details.
CLOCK_REALTIME_COARSE (since Linux 2.6.32; Linux-specific)
A faster but less precise version of CLOCK_REALTIME. This clock is not settable. Use when you need very fast, but not fine-grained timestamps. Requires per-architecture support, and probably also architecture support for this flag in the vdso(7).
CLOCK_TAI (since Linux 3.10; Linux-specific)
A nonsettable system-wide clock derived from wall-clock time but counting leap seconds. This clock does not experience discontinuities or frequency adjustments caused by inserting leap seconds as CLOCK_REALTIME does.
The acronym TAI refers to International Atomic Time.
CLOCK_MONOTONIC
A nonsettable system-wide clock that represents monotonic time sinceβas described by POSIXβ“some unspecified point in the past”. On Linux, that point corresponds to the number of seconds that the system has been running since it was booted.
The CLOCK_MONOTONIC clock is not affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), but is affected by frequency adjustments. This clock does not count time that the system is suspended. All CLOCK_MONOTONIC variants guarantee that the time returned by consecutive calls will not go backwards, but successive calls mayβdepending on the architectureβreturn identical (not-increased) time values.
CLOCK_MONOTONIC_COARSE (since Linux 2.6.32; Linux-specific)
A faster but less precise version of CLOCK_MONOTONIC. Use when you need very fast, but not fine-grained timestamps. Requires per-architecture support, and probably also architecture support for this flag in the vdso(7).
CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific)
Similar to CLOCK_MONOTONIC, but provides access to a raw hardware-based time that is not subject to frequency adjustments. This clock does not count time that the system is suspended.
CLOCK_BOOTTIME (since Linux 2.6.39; Linux-specific)
A nonsettable system-wide clock that is identical to CLOCK_MONOTONIC, except that it also includes any time that the system is suspended. This allows applications to get a suspend-aware monotonic clock without having to deal with the complications of CLOCK_REALTIME, which may have discontinuities if the time is changed using settimeofday(2) or similar.
CLOCK_BOOTTIME_ALARM (since Linux 3.0; Linux-specific)
Like CLOCK_BOOTTIME. See timer_create(2) for further details.
CLOCK_PROCESS_CPUTIME_ID (since Linux 2.6.12)
This is a clock that measures CPU time consumed by this process (i.e., CPU time consumed by all threads in the process). On Linux, this clock is not settable.
CLOCK_THREAD_CPUTIME_ID (since Linux 2.6.12)
This is a clock that measures CPU time consumed by this thread. On Linux, this clock is not settable.
Linux also implements dynamic clock instances as described below.
Dynamic clocks
In addition to the hard-coded System-V style clock IDs described above, Linux also supports POSIX clock operations on certain character devices. Such devices are called “dynamic” clocks, and are supported since Linux 2.6.39.
Using the appropriate macros, open file descriptors may be converted into clock IDs and passed to clock_gettime(), clock_settime(), and clock_adjtime(2). The following example shows how to convert a file descriptor into a dynamic clock ID.
#define CLOCKFD 3
#define FD_TO_CLOCKID(fd) ((~(clockid_t) (fd) << 3) | CLOCKFD)
#define CLOCKID_TO_FD(clk) ((unsigned int) ~((clk) >> 3))
struct timespec ts;
clockid_t clkid;
int fd;
fd = open("/dev/ptp0", O_RDWR);
clkid = FD_TO_CLOCKID(fd);
clock_gettime(clkid, &ts);
RETURN VALUE
clock_gettime(), clock_settime(), and clock_getres() return 0 for success. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EACCES
clock_settime() does not have write permission for the dynamic POSIX clock device indicated.
EFAULT
tp points outside the accessible address space.
EINVAL
The clockid specified is invalid for one of two reasons. Either the System-V style hard coded positive value is out of range, or the dynamic clock ID does not refer to a valid instance of a clock object.
EINVAL
(clock_settime()): tp.tv_sec is negative or tp.tv_nsec is outside the range [0, 999,999,999].
EINVAL
The clockid specified in a call to clock_settime() is not a settable clock.
EINVAL (since Linux 4.3)
A call to clock_settime() with a clockid of CLOCK_REALTIME attempted to set the time to a value less than the current value of the CLOCK_MONOTONIC clock.
ENODEV
The hot-pluggable device (like USB for example) represented by a dynamic clk_id has disappeared after its character device was opened.
ENOTSUP
The operation is not supported by the dynamic POSIX clock device specified.
EOVERFLOW
The timestamp would not fit in time_t range. This can happen if an executable with 32-bit time_t is run on a 64-bit kernel when the time is 2038-01-19 03:14:08 UTC or later. However, when the system time is out of time_t range in other situations, the behavior is undefined.
EPERM
clock_settime() does not have permission to set the clock indicated.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
clock_getres(), clock_gettime(), clock_settime() | Thread safety | MT-Safe |
VERSIONS
POSIX.1 specifies the following:
Setting the value of the CLOCK_REALTIME clock via clock_settime() shall have no effect on threads that are blocked waiting for a relative time service based upon this clock, including the nanosleep() function; nor on the expiration of relative timers based upon this clock. Consequently, these time services shall expire when the requested relative interval elapses, independently of the new or old value of the clock.
According to POSIX.1-2001, a process with “appropriate privileges” may set the CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks using clock_settime(). On Linux, these clocks are not settable (i.e., no process has “appropriate privileges”).
C library/kernel differences
On some architectures, an implementation of clock_gettime() is provided in the vdso(7).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SUSv2. Linux 2.6.
On POSIX systems on which these functions are available, the symbol _POSIX_TIMERS is defined in <unistd.h> to a value greater than 0. POSIX.1-2008 makes these functions mandatory.
The symbols _POSIX_MONOTONIC_CLOCK, _POSIX_CPUTIME, _POSIX_THREAD_CPUTIME indicate that CLOCK_MONOTONIC, CLOCK_PROCESS_CPUTIME_ID, CLOCK_THREAD_CPUTIME_ID are available. (See also sysconf(3).)
Historical note for SMP systems
Before Linux added kernel support for CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID, glibc implemented these clocks on many platforms using timer registers from the CPUs (TSC on i386, AR.ITC on Itanium). These registers may differ between CPUs and as a consequence these clocks may return bogus results if a process is migrated to another CPU.
If the CPUs in an SMP system have different clock sources, then there is no way to maintain a correlation between the timer registers since each CPU will run at a slightly different frequency. If that is the case, then clock_getcpuclockid(0) will return ENOENT to signify this condition. The two clocks will then be useful only if it can be ensured that a process stays on a certain CPU.
The processors in an SMP system do not start all at exactly the same time and therefore the timer registers are typically running at an offset. Some architectures include code that attempts to limit these offsets on bootup. However, the code cannot guarantee to accurately tune the offsets. glibc contains no provisions to deal with these offsets (unlike the Linux Kernel). Typically these offsets are small and therefore the effects may be negligible in most cases.
Since glibc 2.4, the wrapper functions for the system calls described in this page avoid the abovementioned problems by employing the kernel implementation of CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID, on systems that provide such an implementation (i.e., Linux 2.6.12 and later).
EXAMPLES
The program below demonstrates the use of clock_gettime() and clock_getres() with various clocks. This is an example of what we might see when running the program:
$ ./clock_times x
CLOCK_REALTIME : 1585985459.446 (18356 days + 7h 30m 59s)
resolution: 0.000000001
CLOCK_TAI : 1585985496.447 (18356 days + 7h 31m 36s)
resolution: 0.000000001
CLOCK_MONOTONIC: 52395.722 (14h 33m 15s)
resolution: 0.000000001
CLOCK_BOOTTIME : 72691.019 (20h 11m 31s)
resolution: 0.000000001
Program source
/* clock_times.c
Licensed under GNU General Public License v2 or later.
*/
#define _XOPEN_SOURCE 600
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define SECS_IN_DAY (24 * 60 * 60)
static void
displayClock(clockid_t clock, const char *name, bool showRes)
{
long days;
struct timespec ts;
if (clock_gettime(clock, &ts) == -1) {
perror("clock_gettime");
exit(EXIT_FAILURE);
}
printf("%-15s: %10jd.%03ld (", name,
(intmax_t) ts.tv_sec, ts.tv_nsec / 1000000);
days = ts.tv_sec / SECS_IN_DAY;
if (days > 0)
printf("%ld days + ", days);
printf("%2dh %2dm %2ds",
(int) (ts.tv_sec % SECS_IN_DAY) / 3600,
(int) (ts.tv_sec % 3600) / 60,
(int) ts.tv_sec % 60);
printf(")
“); if (clock_getres(clock, &ts) == -1) { perror(“clock_getres”); exit(EXIT_FAILURE); } if (showRes) printf(” resolution: %10jd.%09ld “, (intmax_t) ts.tv_sec, ts.tv_nsec); } int main(int argc, char *argv[]) { bool showRes = argc > 1; displayClock(CLOCK_REALTIME, “CLOCK_REALTIME”, showRes); #ifdef CLOCK_TAI displayClock(CLOCK_TAI, “CLOCK_TAI”, showRes); #endif displayClock(CLOCK_MONOTONIC, “CLOCK_MONOTONIC”, showRes); #ifdef CLOCK_BOOTTIME displayClock(CLOCK_BOOTTIME, “CLOCK_BOOTTIME”, showRes); #endif exit(EXIT_SUCCESS); }
SEE ALSO
date(1), gettimeofday(2), settimeofday(2), time(2), adjtime(3), clock_getcpuclockid(3), ctime(3), ftime(3), pthread_getcpuclockid(3), sysconf(3), timespec(3), time(7), time_namespaces(7), vdso(7), hwclock(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
287 - Linux cli command inb
NAME π₯οΈ inb π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
288 - Linux cli command umount
NAME π₯οΈ umount π₯οΈ
unmount filesystem
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/mount.h>
int umount(const char *target);
int umount2(const char *target, int flags);
DESCRIPTION
umount() and umount2() remove the attachment of the (topmost) filesystem mounted on target.
Appropriate privilege (Linux: the CAP_SYS_ADMIN capability) is required to unmount filesystems.
Linux 2.1.116 added the umount2() system call, which, like umount(), unmounts a target, but allows additional flags controlling the behavior of the operation:
MNT_FORCE (since Linux 2.1.116)
Ask the filesystem to abort pending requests before attempting the unmount. This may allow the unmount to complete without waiting for an inaccessible server, but could cause data loss. If, after aborting requests, some processes still have active references to the filesystem, the unmount will still fail. As at Linux 4.12, MNT_FORCE is supported only on the following filesystems: 9p (since Linux 2.6.16), ceph (since Linux 2.6.34), cifs (since Linux 2.6.12), fuse (since Linux 2.6.16), lustre (since Linux 3.11), and NFS (since Linux 2.1.116).
MNT_DETACH (since Linux 2.4.11)
Perform a lazy unmount: make the mount unavailable for new accesses, immediately disconnect the filesystem and all filesystems mounted below it from each other and from the mount table, and actually perform the unmount when the mount ceases to be busy.
MNT_EXPIRE (since Linux 2.6.8)
Mark the mount as expired. If a mount is not currently in use, then an initial call to umount2() with this flag fails with the error EAGAIN, but marks the mount as expired. The mount remains expired as long as it isn’t accessed by any process. A second umount2() call specifying MNT_EXPIRE unmounts an expired mount. This flag cannot be specified with either MNT_FORCE or MNT_DETACH.
UMOUNT_NOFOLLOW (since Linux 2.6.34)
Don’t dereference target if it is a symbolic link. This flag allows security problems to be avoided in set-user-ID-root programs that allow unprivileged users to unmount filesystems.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
The error values given below result from filesystem type independent errors. Each filesystem type may have its own special errors and its own special behavior. See the Linux kernel source code for details.
EAGAIN
A call to umount2() specifying MNT_EXPIRE successfully marked an unbusy filesystem as expired.
EBUSY
target could not be unmounted because it is busy.
EFAULT
target points outside the user address space.
EINVAL
target is not a mount point.
EINVAL
target is locked; see mount_namespaces(7).
EINVAL
umount2() was called with MNT_EXPIRE and either MNT_DETACH or MNT_FORCE.
EINVAL (since Linux 2.6.34)
umount2() was called with an invalid flag value in flags.
ENAMETOOLONG
A pathname was longer than MAXPATHLEN.
ENOENT
A pathname was empty or had a nonexistent component.
ENOMEM
The kernel could not allocate a free page to copy filenames or data into.
EPERM
The caller does not have the required privileges.
STANDARDS
Linux.
HISTORY
MNT_DETACH and MNT_EXPIRE are available since glibc 2.11.
The original umount() function was called as umount(device) and would return ENOTBLK when called with something other than a block device. In Linux 0.98p4, a call umount(dir) was added, in order to support anonymous devices. In Linux 2.3.99-pre7, the call umount(device) was removed, leaving only umount(dir) (since now devices can be mounted in more than one place, so specifying the device does not suffice).
NOTES
umount() and shared mounts
Shared mounts cause any mount activity on a mount, including umount() operations, to be forwarded to every shared mount in the peer group and every slave mount of that peer group. This means that umount() of any peer in a set of shared mounts will cause all of its peers to be unmounted and all of their slaves to be unmounted as well.
This propagation of unmount activity can be particularly surprising on systems where every mount is shared by default. On such systems, recursively bind mounting the root directory of the filesystem onto a subdirectory and then later unmounting that subdirectory with MNT_DETACH will cause every mount in the mount namespace to be lazily unmounted.
To ensure umount() does not propagate in this fashion, the mount may be remounted using a mount(2) call with a mount_flags argument that includes both MS_REC and MS_PRIVATE prior to umount() being called.
SEE ALSO
mount(2), mount_namespaces(7), path_resolution(7), mount(8), umount(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
289 - Linux cli command sched_setscheduler
NAME π₯οΈ sched_setscheduler π₯οΈ
set and get scheduling policy/parameters
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sched.h>
int sched_setscheduler(pid_t pid, int policy,
const struct sched_param *param);
int sched_getscheduler(pid_t pid);
DESCRIPTION
The sched_setscheduler() system call sets both the scheduling policy and parameters for the thread whose ID is specified in pid. If pid equals zero, the scheduling policy and parameters of the calling thread will be set.
The scheduling parameters are specified in the param argument, which is a pointer to a structure of the following form:
struct sched_param {
...
int sched_priority;
...
};
In the current implementation, the structure contains only one field, sched_priority. The interpretation of param depends on the selected policy.
Currently, Linux supports the following “normal” (i.e., non-real-time) scheduling policies as values that may be specified in policy:
SCHED_OTHER
the standard round-robin time-sharing policy;
SCHED_BATCH
for “batch” style execution of processes; and
SCHED_IDLE
for running very low priority background jobs.
For each of the above policies, param->sched_priority must be 0.
Various “real-time” policies are also supported, for special time-critical applications that need precise control over the way in which runnable threads are selected for execution. For the rules governing when a process may use these policies, see sched(7). The real-time policies that may be specified in policy are:
SCHED_FIFO
a first-in, first-out policy; and
SCHED_RR
a round-robin policy.
For each of the above policies, param->sched_priority specifies a scheduling priority for the thread. This is a number in the range returned by calling sched_get_priority_min(2) and sched_get_priority_max(2) with the specified policy. On Linux, these system calls return, respectively, 1 and 99.
Since Linux 2.6.32, the SCHED_RESET_ON_FORK flag can be ORed in policy when calling sched_setscheduler(). As a result of including this flag, children created by fork(2) do not inherit privileged scheduling policies. See sched(7) for details.
sched_getscheduler() returns the current scheduling policy of the thread identified by pid. If pid equals zero, the policy of the calling thread will be retrieved.
RETURN VALUE
On success, sched_setscheduler() returns zero. On success, sched_getscheduler() returns the policy for the thread (a nonnegative integer). On error, both calls return -1, and errno is set to indicate the error.
ERRORS
EINVAL
Invalid arguments: pid is negative or param is NULL.
EINVAL
(sched_setscheduler()) policy is not one of the recognized policies.
EINVAL
(sched_setscheduler()) param does not make sense for the specified policy.
EPERM
The calling thread does not have appropriate privileges.
ESRCH
The thread whose ID is pid could not be found.
VERSIONS
POSIX.1 does not detail the permissions that an unprivileged thread requires in order to call sched_setscheduler(), and details vary across systems. For example, the Solaris 7 manual page says that the real or effective user ID of the caller must match the real user ID or the save set-user-ID of the target.
The scheduling policy and parameters are in fact per-thread attributes on Linux. The value returned from a call to gettid(2) can be passed in the argument pid. Specifying pid as 0 will operate on the attributes of the calling thread, and passing the value returned from a call to getpid(2) will operate on the attributes of the main thread of the thread group. (If you are using the POSIX threads API, then use pthread_setschedparam(3), pthread_getschedparam(3), and pthread_setschedprio(3), instead of the sched_*(2) system calls.)
STANDARDS
POSIX.1-2008 (but see BUGS below).
SCHED_BATCH and SCHED_IDLE are Linux-specific.
HISTORY
POSIX.1-2001.
NOTES
Further details of the semantics of all of the above “normal” and “real-time” scheduling policies can be found in the sched(7) manual page. That page also describes an additional policy, SCHED_DEADLINE, which is settable only via sched_setattr(2).
POSIX systems on which sched_setscheduler() and sched_getscheduler() are available define _POSIX_PRIORITY_SCHEDULING in <unistd.h>.
BUGS
POSIX.1 says that on success, sched_setscheduler() should return the previous scheduling policy. Linux sched_setscheduler() does not conform to this requirement, since it always returns 0 on success.
SEE ALSO
chrt(1), nice(2), sched_get_priority_max(2), sched_get_priority_min(2), sched_getaffinity(2), sched_getattr(2), sched_getparam(2), sched_rr_get_interval(2), sched_setaffinity(2), sched_setattr(2), sched_setparam(2), sched_yield(2), setpriority(2), capabilities(7), cpuset(7), sched(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
290 - Linux cli command fallocate
NAME π₯οΈ fallocate π₯οΈ
manipulate file space
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <fcntl.h>
int fallocate(int fd, int mode, off_t offset",off_t len );"
DESCRIPTION
This is a nonportable, Linux-specific system call. For the portable, POSIX.1-specified method of ensuring that space is allocated for a file, see posix_fallocate(3).
fallocate() allows the caller to directly manipulate the allocated disk space for the file referred to by fd for the byte range starting at offset and continuing for len bytes.
The mode argument determines the operation to be performed on the given range. Details of the supported operations are given in the subsections below.
Allocating disk space
The default operation (i.e., mode is zero) of fallocate() allocates the disk space within the range specified by offset and len. The file size (as reported by stat(2)) will be changed if offset+len is greater than the file size. Any subregion within the range specified by offset and len that did not contain data before the call will be initialized to zero. This default behavior closely resembles the behavior of the posix_fallocate(3) library function, and is intended as a method of optimally implementing that function.
After a successful call, subsequent writes into the range specified by offset and len are guaranteed not to fail because of lack of disk space.
If the FALLOC_FL_KEEP_SIZE flag is specified in mode, the behavior of the call is similar, but the file size will not be changed even if offset+len is greater than the file size. Preallocating zeroed blocks beyond the end of the file in this manner is useful for optimizing append workloads.
If the FALLOC_FL_UNSHARE_RANGE flag is specified in mode, shared file data extents will be made private to the file to guarantee that a subsequent write will not fail due to lack of space. Typically, this will be done by performing a copy-on-write operation on all shared data in the file. This flag may not be supported by all filesystems.
Because allocation is done in block size chunks, fallocate() may allocate a larger range of disk space than was specified.
Deallocating file space
Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux 2.6.38) in mode deallocates space (i.e., creates a hole) in the byte range starting at offset and continuing for len bytes. Within the specified range, partial filesystem blocks are zeroed, and whole filesystem blocks are removed from the file. After a successful call, subsequent reads from this range will return zeros.
The FALLOC_FL_PUNCH_HOLE flag must be ORed with FALLOC_FL_KEEP_SIZE in mode; in other words, even when punching off the end of the file, the file size (as reported by stat(2)) does not change.
Not all filesystems support FALLOC_FL_PUNCH_HOLE; if a filesystem doesn’t support the operation, an error is returned. The operation is supported on at least the following filesystems:
XFS (since Linux 2.6.38)
ext4 (since Linux 3.0)
Btrfs (since Linux 3.7)
tmpfs(5) (since Linux 3.5)
gfs2(5) (since Linux 4.16)
Collapsing file space
Specifying the FALLOC_FL_COLLAPSE_RANGE flag (available since Linux 3.15) in mode removes a byte range from a file, without leaving a hole. The byte range to be collapsed starts at offset and continues for len bytes. At the completion of the operation, the contents of the file starting at the location offset+len will be appended at the location offset, and the file will be len bytes smaller.
A filesystem may place limitations on the granularity of the operation, in order to ensure efficient implementation. Typically, offset and len must be a multiple of the filesystem logical block size, which varies according to the filesystem type and configuration. If a filesystem has such a requirement, fallocate() fails with the error EINVAL if this requirement is violated.
If the region specified by offset plus len reaches or passes the end of file, an error is returned; instead, use ftruncate(2) to truncate a file.
No other flags may be specified in mode in conjunction with FALLOC_FL_COLLAPSE_RANGE.
As at Linux 3.15, FALLOC_FL_COLLAPSE_RANGE is supported by ext4 (only for extent-based files) and XFS.
Zeroing file space
Specifying the FALLOC_FL_ZERO_RANGE flag (available since Linux 3.15) in mode zeros space in the byte range starting at offset and continuing for len bytes. Within the specified range, blocks are preallocated for the regions that span the holes in the file. After a successful call, subsequent reads from this range will return zeros.
Zeroing is done within the filesystem preferably by converting the range into unwritten extents. This approach means that the specified range will not be physically zeroed out on the device (except for partial blocks at the either end of the range), and I/O is (otherwise) required only to update metadata.
If the FALLOC_FL_KEEP_SIZE flag is additionally specified in mode, the behavior of the call is similar, but the file size will not be changed even if offset+len is greater than the file size. This behavior is the same as when preallocating space with FALLOC_FL_KEEP_SIZE specified.
Not all filesystems support FALLOC_FL_ZERO_RANGE; if a filesystem doesn’t support the operation, an error is returned. The operation is supported on at least the following filesystems:
XFS (since Linux 3.15)
ext4, for extent-based files (since Linux 3.15)
SMB3 (since Linux 3.17)
Btrfs (since Linux 4.16)
Increasing file space
Specifying the FALLOC_FL_INSERT_RANGE flag (available since Linux 4.1) in mode increases the file space by inserting a hole within the file size without overwriting any existing data. The hole will start at offset and continue for len bytes. When inserting the hole inside file, the contents of the file starting at offset will be shifted upward (i.e., to a higher file offset) by len bytes. Inserting a hole inside a file increases the file size by len bytes.
This mode has the same limitations as FALLOC_FL_COLLAPSE_RANGE regarding the granularity of the operation. If the granularity requirements are not met, fallocate() fails with the error EINVAL. If the offset is equal to or greater than the end of file, an error is returned. For such operations (i.e., inserting a hole at the end of file), ftruncate(2) should be used.
No other flags may be specified in mode in conjunction with FALLOC_FL_INSERT_RANGE.
FALLOC_FL_INSERT_RANGE requires filesystem support. Filesystems that support this operation include XFS (since Linux 4.1) and ext4 (since Linux 4.2).
RETURN VALUE
On success, fallocate() returns zero. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EBADF
fd is not a valid file descriptor, or is not opened for writing.
EFBIG
offset+len exceeds the maximum file size.
EFBIG
mode is FALLOC_FL_INSERT_RANGE, and the current file size+len exceeds the maximum file size.
EINTR
A signal was caught during execution; see signal(7).
EINVAL
offset was less than 0, or len was less than or equal to 0.
EINVAL
mode is FALLOC_FL_COLLAPSE_RANGE and the range specified by offset plus len reaches or passes the end of the file.
EINVAL
mode is FALLOC_FL_INSERT_RANGE and the range specified by offset reaches or passes the end of the file.
EINVAL
mode is FALLOC_FL_COLLAPSE_RANGE or FALLOC_FL_INSERT_RANGE, but either offset or len is not a multiple of the filesystem block size.
EINVAL
mode contains one of FALLOC_FL_COLLAPSE_RANGE or FALLOC_FL_INSERT_RANGE and also other flags; no other flags are permitted with FALLOC_FL_COLLAPSE_RANGE or FALLOC_FL_INSERT_RANGE.
EINVAL
mode is FALLOC_FL_COLLAPSE_RANGE, FALLOC_FL_ZERO_RANGE, or FALLOC_FL_INSERT_RANGE, but the file referred to by fd is not a regular file.
EIO
An I/O error occurred while reading from or writing to a filesystem.
ENODEV
fd does not refer to a regular file or a directory. (If fd is a pipe or FIFO, a different error results.)
ENOSPC
There is not enough space left on the device containing the file referred to by fd.
ENOSYS
This kernel does not implement fallocate().
EOPNOTSUPP
The filesystem containing the file referred to by fd does not support this operation; or the mode is not supported by the filesystem containing the file referred to by fd.
EPERM
The file referred to by fd is marked immutable (see chattr(1)).
EPERM
mode specifies FALLOC_FL_PUNCH_HOLE, FALLOC_FL_COLLAPSE_RANGE, or FALLOC_FL_INSERT_RANGE and the file referred to by fd is marked append-only (see chattr(1)).
EPERM
The operation was prevented by a file seal; see fcntl(2).
ESPIPE
fd refers to a pipe or FIFO.
ETXTBSY
mode specifies FALLOC_FL_COLLAPSE_RANGE or FALLOC_FL_INSERT_RANGE, but the file referred to by fd is currently being executed.
STANDARDS
Linux.
HISTORY
fallocate()
Linux 2.6.23, glibc 2.10.
FALLOC_FL_*
glibc 2.18.
SEE ALSO
fallocate(1), ftruncate(2), posix_fadvise(3), posix_fallocate(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
291 - Linux cli command add_key
NAME π₯οΈ add_key π₯οΈ
add a key to the kernel’s key management facility
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <keyutils.h>
key_serial_t add_key(const char *type, const char *description,
const void payload[.plen], size_t plen,
key_serial_t keyring);
Note: There is no glibc wrapper for this system call; see NOTES.
DESCRIPTION
add_key() creates or updates a key of the given type and description, instantiates it with the payload of length plen, attaches it to the nominated keyring, and returns the key’s serial number.
The key may be rejected if the provided data is in the wrong format or it is invalid in some other way.
If the destination keyring already contains a key that matches the specified type and description, then, if the key type supports it, that key will be updated rather than a new key being created; if not, a new key (with a different ID) will be created and it will displace the link to the extant key from the keyring.
The destination keyring serial number may be that of a valid keyring for which the caller has write permission. Alternatively, it may be one of the following special keyring IDs:
KEY_SPEC_THREAD_KEYRING
This specifies the caller’s thread-specific keyring (thread-keyring(7)).
KEY_SPEC_PROCESS_KEYRING
This specifies the caller’s process-specific keyring (process-keyring(7)).
KEY_SPEC_SESSION_KEYRING
This specifies the caller’s session-specific keyring (session-keyring(7)).
KEY_SPEC_USER_KEYRING
This specifies the caller’s UID-specific keyring (user-keyring(7)).
KEY_SPEC_USER_SESSION_KEYRING
This specifies the caller’s UID-session keyring (user-session-keyring(7)).
Key types
The key type is a string that specifies the key’s type. Internally, the kernel defines a number of key types that are available in the core key management code. Among the types that are available for user-space use and can be specified as the type argument to add_key() are the following:
“keyring”
Keyrings are special key types that may contain links to sequences of other keys of any type. If this interface is used to create a keyring, then payload should be NULL and plen should be zero.
“user”
This is a general purpose key type whose payload may be read and updated by user-space applications. The key is kept entirely within kernel memory. The payload for keys of this type is a blob of arbitrary data of up to 32,767 bytes.
“logon” (since Linux 3.3)
This key type is essentially the same as “user”, but it does not permit the key to read. This is suitable for storing payloads that you do not want to be readable from user space.
This key type vets the description to ensure that it is qualified by a “service” prefix, by checking to ensure that the description contains a ‘:’ that is preceded by other characters.
“big_key” (since Linux 3.13)
This key type is similar to “user”, but may hold a payload of up to 1 MiB. If the key payload is large enough, then it may be stored encrypted in tmpfs (which can be swapped out) rather than kernel memory.
For further details on these key types, see keyrings(7).
RETURN VALUE
On success, add_key() returns the serial number of the key it created or updated. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EACCES
The keyring wasn’t available for modification by the user.
EDQUOT
The key quota for this user would be exceeded by creating this key or linking it to the keyring.
EFAULT
One or more of type, description, and payload points outside process’s accessible address space.
EINVAL
The size of the string (including the terminating null byte) specified in type or description exceeded the limit (32 bytes and 4096 bytes respectively).
EINVAL
The payload data was invalid.
EINVAL
type was “logon” and the description was not qualified with a prefix string of the form “service:”.
EKEYEXPIRED
The keyring has expired.
EKEYREVOKED
The keyring has been revoked.
ENOKEY
The keyring doesn’t exist.
ENOMEM
Insufficient memory to create a key.
EPERM
The type started with a period (’.’). Key types that begin with a period are reserved to the implementation.
EPERM
type was “keyring” and the description started with a period (’.’). Keyrings with descriptions (names) that begin with a period are reserved to the implementation.
STANDARDS
Linux.
HISTORY
Linux 2.6.10.
NOTES
glibc does not provide a wrapper for this system call. A wrapper is provided in the libkeyutils library. (The accompanying package provides the <keyutils.h> header file.) When employing the wrapper in that library, link with -lkeyutils.
EXAMPLES
The program below creates a key with the type, description, and payload specified in its command-line arguments, and links that key into the session keyring. The following shell session demonstrates the use of the program:
$ ./a.out user mykey "Some payload"
Key ID is 64a4dca
$ grep '64a4dca' /proc/keys
064a4dca I--Q--- 1 perm 3f010000 1000 1000 user mykey: 12
Program source
#include <keyutils.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int
main(int argc, char *argv[])
{
key_serial_t key;
if (argc != 4) {
fprintf(stderr, "Usage: %s type description payload
“, argv[0]); exit(EXIT_FAILURE); } key = add_key(argv[1], argv[2], argv[3], strlen(argv[3]), KEY_SPEC_SESSION_KEYRING); if (key == -1) { perror(“add_key”); exit(EXIT_FAILURE); } printf(“Key ID is %jx “, (uintmax_t) key); exit(EXIT_SUCCESS); }
SEE ALSO
keyctl(1), keyctl(2), request_key(2), keyctl(3), keyrings(7), keyutils(7), persistent-keyring(7), process-keyring(7), session-keyring(7), thread-keyring(7), user-keyring(7), user-session-keyring(7)
The kernel source files Documentation/security/keys/core.rst and Documentation/keys/request-key.rst (or, before Linux 4.13, in the files Documentation/security/keys.txt and Documentation/security/keys-request-key.txt).
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
292 - Linux cli command fsetxattr
NAME π₯οΈ fsetxattr π₯οΈ
set an extended attribute value
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/xattr.h>
int setxattr(const char *path, const char *name,
const void value[.size], size_t size, int flags);
int lsetxattr(const char *path, const char *name,
const void value[.size], size_t size, int flags);
int fsetxattr(int fd, const char *name,
const void value[.size], size_t size, int flags);
DESCRIPTION
Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7).
setxattr() sets the value of the extended attribute identified by name and associated with the given path in the filesystem. The size argument specifies the size (in bytes) of value; a zero-length value is permitted.
lsetxattr() is identical to setxattr(), except in the case of a symbolic link, where the extended attribute is set on the link itself, not the file that it refers to.
fsetxattr() is identical to setxattr(), only the extended attribute is set on the open file referred to by fd (as returned by open(2)) in place of path.
An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. The value of an extended attribute is a chunk of arbitrary textual or binary data of specified length.
By default (i.e., flags is zero), the extended attribute will be created if it does not exist, or the value will be replaced if the attribute already exists. To modify these semantics, one of the following values can be specified in flags:
XATTR_CREATE
Perform a pure create, which fails if the named attribute exists already.
XATTR_REPLACE
Perform a pure replace operation, which fails if the named attribute does not already exist.
RETURN VALUE
On success, zero is returned. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
EDQUOT
Disk quota limits meant that there is insufficient space remaining to store the extended attribute.
EEXIST
XATTR_CREATE was specified, and the attribute exists already.
ENODATA
XATTR_REPLACE was specified, and the attribute does not exist.
ENOSPC
There is insufficient space remaining to store the extended attribute.
ENOTSUP
The namespace prefix of name is not valid.
ENOTSUP
Extended attributes are not supported by the filesystem, or are disabled,
EPERM
The file is marked immutable or append-only. (See ioctl_iflags(2).)
In addition, the errors documented in stat(2) can also occur.
ERANGE
The size of name or value exceeds a filesystem-specific limit.
STANDARDS
Linux.
HISTORY
Linux 2.4, glibc 2.3.
SEE ALSO
getfattr(1), setfattr(1), getxattr(2), listxattr(2), open(2), removexattr(2), stat(2), symlink(7), xattr(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
293 - Linux cli command utimes
NAME π₯οΈ utimes π₯οΈ
change file last access and modification times
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <utime.h>
int utime(const char *filename,
const struct utimbuf *_Nullable times);
#include <sys/time.h>
int utimes(const char *filename,
const struct timeval times[_Nullable 2]);
DESCRIPTION
Note: modern applications may prefer to use the interfaces described in utimensat(2).
The utime() system call changes the access and modification times of the inode specified by filename to the actime and modtime fields of times respectively. The status change time (ctime) will be set to the current time, even if the other time stamps don’t actually change.
If times is NULL, then the access and modification times of the file are set to the current time.
Changing timestamps is permitted when: either the process has appropriate privileges, or the effective user ID equals the user ID of the file, or times is NULL and the process has write permission for the file.
The utimbuf structure is:
struct utimbuf {
time_t actime; /* access time */
time_t modtime; /* modification time */
};
The utime() system call allows specification of timestamps with a resolution of 1 second.
The utimes() system call is similar, but the times argument refers to an array rather than a structure. The elements of this array are timeval structures, which allow a precision of 1 microsecond for specifying timestamps. The timeval structure is:
struct timeval {
long tv_sec; /* seconds */
long tv_usec; /* microseconds */
};
times[0] specifies the new access time, and times[1] specifies the new modification time. If times is NULL, then analogously to utime(), the access and modification times of the file are set to the current time.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Search permission is denied for one of the directories in the path prefix of path (see also path_resolution(7)).
EACCES
times is NULL, the caller’s effective user ID does not match the owner of the file, the caller does not have write access to the file, and the caller is not privileged (Linux: does not have either the CAP_DAC_OVERRIDE or the CAP_FOWNER capability).
ENOENT
filename does not exist.
EPERM
times is not NULL, the caller’s effective UID does not match the owner of the file, and the caller is not privileged (Linux: does not have the CAP_FOWNER capability).
EROFS
path resides on a read-only filesystem.
STANDARDS
POSIX.1-2008.
HISTORY
utime()
SVr4, POSIX.1-2001. POSIX.1-2008 marks it as obsolete.
utimes()
4.3BSD, POSIX.1-2001.
NOTES
Linux does not allow changing the timestamps on an immutable file, or setting the timestamps to something other than the current time on an append-only file.
SEE ALSO
chattr(1), touch(1), futimesat(2), stat(2), utimensat(2), futimens(3), futimes(3), inode(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
294 - Linux cli command getcpu
NAME π₯οΈ getcpu π₯οΈ
determine CPU and NUMA node on which the calling thread is running
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <sched.h>
int getcpu(unsigned int *_Nullable cpu, unsigned int *_Nullable node);
DESCRIPTION
The getcpu() system call identifies the processor and node on which the calling thread or process is currently running and writes them into the integers pointed to by the cpu and node arguments. The processor is a unique small integer identifying a CPU. The node is a unique small identifier identifying a NUMA node. When either cpu or node is NULL nothing is written to the respective pointer.
The information placed in cpu is guaranteed to be current only at the time of the call: unless the CPU affinity has been fixed using sched_setaffinity(2), the kernel might change the CPU at any time. (Normally this does not happen because the scheduler tries to minimize movements between CPUs to keep caches hot, but it is possible.) The caller must allow for the possibility that the information returned in cpu and node is no longer current by the time the call returns.
RETURN VALUE
On success, 0 is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
Arguments point outside the calling process’s address space.
STANDARDS
Linux.
HISTORY
Linux 2.6.19 (x86-64 and i386), glibc 2.29.
C library/kernel differences
The kernel system call has a third argument:
int getcpu(unsigned int *cpu, unsigned int *node,
struct getcpu_cache *tcache);
The tcache argument is unused since Linux 2.6.24, and (when invoking the system call directly) should be specified as NULL, unless portability to Linux 2.6.23 or earlier is required.
In Linux 2.6.23 and earlier, if the tcache argument was non-NULL, then it specified a pointer to a caller-allocated buffer in thread-local storage that was used to provide a caching mechanism for getcpu(). Use of the cache could speed getcpu() calls, at the cost that there was a very small chance that the returned information would be out of date. The caching mechanism was considered to cause problems when migrating threads between CPUs, and so the argument is now ignored.
NOTES
Linux makes a best effort to make this call as fast as possible. (On some architectures, this is done via an implementation in the vdso(7).) The intention of getcpu() is to allow programs to make optimizations with per-CPU data or for NUMA optimization.
SEE ALSO
mbind(2), sched_setaffinity(2), set_mempolicy(2), sched_getcpu(3), cpuset(7), vdso(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
295 - Linux cli command stty
NAME π₯οΈ stty π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
296 - Linux cli command pread64
NAME π₯οΈ pread64 π₯οΈ
read from or write to a file descriptor at a given offset
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
ssize_t pread(int fd, void buf[.count], size_t count,
off_t offset);
ssize_t pwrite(int fd, const void buf[.count], size_t count,
off_t offset);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
pread(), pwrite():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
DESCRIPTION
pread() reads up to count bytes from file descriptor fd at offset offset (from the start of the file) into the buffer starting at buf. The file offset is not changed.
pwrite() writes up to count bytes from the buffer starting at buf to the file descriptor fd at offset offset. The file offset is not changed.
The file referenced by fd must be capable of seeking.
RETURN VALUE
On success, pread() returns the number of bytes read (a return of zero indicates end of file) and pwrite() returns the number of bytes written.
Note that it is not an error for a successful call to transfer fewer bytes than requested (see read(2) and write(2)).
On error, -1 is returned and errno is set to indicate the error.
ERRORS
pread() can fail and set errno to any error specified for read(2) or lseek(2). pwrite() can fail and set errno to any error specified for write(2) or lseek(2).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
Added in Linux 2.1.60; the entries in the i386 system call table were added in Linux 2.1.69. C library support (including emulation using lseek(2) on older kernels without the system calls) was added in glibc 2.1.
C library/kernel differences
On Linux, the underlying system calls were renamed in Linux 2.6: pread() became pread64(), and pwrite() became pwrite64(). The system call numbers remained the same. The glibc pread() and pwrite() wrapper functions transparently deal with the change.
On some 32-bit architectures, the calling signature for these system calls differ, for the reasons described in syscall(2).
NOTES
The pread() and pwrite() system calls are especially useful in multithreaded applications. They allow multiple threads to perform I/O on the same file descriptor without being affected by changes to the file offset by other threads.
BUGS
POSIX requires that opening a file with the O_APPEND flag should have no effect on the location at which pwrite() writes data. However, on Linux, if a file is opened with O_APPEND, pwrite() appends data to the end of the file, regardless of the value of offset.
SEE ALSO
lseek(2), read(2), readv(2), write(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
297 - Linux cli command vmsplice
NAME π₯οΈ vmsplice π₯οΈ
splice user pages to/from a pipe
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <fcntl.h>
ssize_t vmsplice(int fd, const struct iovec *iov,
size_t nr_segs, unsigned int flags);
DESCRIPTION
If fd is opened for writing, the vmsplice() system call maps nr_segs ranges of user memory described by iov into a pipe. If fd is opened for reading, the vmsplice() system call fills nr_segs ranges of user memory described by iov from a pipe. The file descriptor fd must refer to a pipe.
The pointer iov points to an array of iovec structures as described in iovec(3type).
The flags argument is a bit mask that is composed by ORing together zero or more of the following values:
SPLICE_F_MOVE
Unused for vmsplice(); see splice(2).
SPLICE_F_NONBLOCK
Do not block on I/O; see splice(2) for further details.
SPLICE_F_MORE
Currently has no effect for vmsplice(), but may be implemented in the future; see splice(2).
SPLICE_F_GIFT
The user pages are a gift to the kernel. The application may not modify this memory ever, otherwise the page cache and on-disk data may differ. Gifting pages to the kernel means that a subsequent splice(2) SPLICE_F_MOVE can successfully move the pages; if this flag is not specified, then a subsequent splice(2) SPLICE_F_MOVE must copy the pages. Data must also be properly page aligned, both in memory and length.
RETURN VALUE
Upon successful completion, vmsplice() returns the number of bytes transferred to the pipe. On error, vmsplice() returns -1 and errno is set to indicate the error.
ERRORS
EAGAIN
SPLICE_F_NONBLOCK was specified in flags, and the operation would block.
EBADF
fd either not valid, or doesn’t refer to a pipe.
EINVAL
nr_segs is greater than IOV_MAX; or memory not aligned if SPLICE_F_GIFT set.
ENOMEM
Out of memory.
STANDARDS
Linux.
HISTORY
Linux 2.6.17, glibc 2.5.
NOTES
vmsplice() follows the other vectorized read/write type functions when it comes to limitations on the number of segments being passed in. This limit is IOV_MAX as defined in <limits.h>. Currently, this limit is 1024.
vmsplice() really supports true splicing only from user memory to a pipe. In the opposite direction, it actually just copies the data to user space. But this makes the interface nice and symmetric and enables people to build on vmsplice() with room for future improvement in performance.
SEE ALSO
splice(2), tee(2), pipe(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
298 - Linux cli command socketcall
NAME π₯οΈ socketcall π₯οΈ
socket system calls
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/net.h> /* Definition of SYS_* constants */
#include <sys/syscall.h> /* Definition of SYS_socketcall */
#include <unistd.h>
int syscall(SYS_socketcall, int call, unsigned long *args);
Note: glibc provides no wrapper for socketcall(), necessitating the use of syscall(2).
DESCRIPTION
socketcall() is a common kernel entry point for the socket system calls. call determines which socket function to invoke. args points to a block containing the actual arguments, which are passed through to the appropriate call.
User programs should call the appropriate functions by their usual names. Only standard library implementors and kernel hackers need to know about socketcall().
call | Man page |
SYS_SOCKET | socket(2) |
SYS_BIND | bind(2) |
SYS_CONNECT | connect(2) |
SYS_LISTEN | listen(2) |
SYS_ACCEPT | accept(2) |
SYS_GETSOCKNAME | getsockname(2) |
SYS_GETPEERNAME | getpeername(2) |
SYS_SOCKETPAIR | socketpair(2) |
SYS_SEND | send(2) |
SYS_RECV | recv(2) |
SYS_SENDTO | sendto(2) |
SYS_RECVFROM | recvfrom(2) |
SYS_SHUTDOWN | shutdown(2) |
SYS_SETSOCKOPT | setsockopt(2) |
SYS_GETSOCKOPT | getsockopt(2) |
SYS_SENDMSG | sendmsg(2) |
SYS_RECVMSG | recvmsg(2) |
SYS_ACCEPT4 | accept4(2) |
SYS_RECVMMSG | recvmmsg(2) |
SYS_SENDMMSG | sendmmsg(2) |
VERSIONS
On some architecturesβfor example, x86-64 and ARMβthere is no socketcall() system call; instead socket(2), accept(2), bind(2), and so on really are implemented as separate system calls.
STANDARDS
Linux.
On x86-32, socketcall() was historically the only entry point for the sockets API. However, starting in Linux 4.3, direct system calls are provided on x86-32 for the sockets API. This facilitates the creation of seccomp(2) filters that filter sockets system calls (for new user-space binaries that are compiled to use the new entry points) and also provides a (very) small performance improvement.
SEE ALSO
accept(2), bind(2), connect(2), getpeername(2), getsockname(2), getsockopt(2), listen(2), recv(2), recvfrom(2), recvmsg(2), send(2), sendmsg(2), sendto(2), setsockopt(2), shutdown(2), socket(2), socketpair(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
299 - Linux cli command socketpair
NAME π₯οΈ socketpair π₯οΈ
create a pair of connected sockets
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
int socketpair(int domain, int type, int protocol",int"sv[2]);
DESCRIPTION
The socketpair() call creates an unnamed pair of connected sockets in the specified domain, of the specified type, and using the optionally specified protocol. For further details of these arguments, see socket(2).
The file descriptors used in referencing the new sockets are returned in sv[0] and sv[1]. The two sockets are indistinguishable.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, errno is set to indicate the error, and sv is left unchanged
On Linux (and other systems), socketpair() does not modify sv on failure. A requirement standardizing this behavior was added in POSIX.1-2008 TC2.
ERRORS
EAFNOSUPPORT
The specified address family is not supported on this machine.
EFAULT
The address sv does not specify a valid part of the process address space.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
EOPNOTSUPP
The specified protocol does not support creation of socket pairs.
EPROTONOSUPPORT
The specified protocol is not supported on this machine.
VERSIONS
On Linux, the only supported domains for this call are AF_UNIX (or synonymously, AF_LOCAL) and AF_TIPC (since Linux 4.12).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.4BSD.
socketpair() first appeared in 4.2BSD. It is generally portable to/from non-BSD systems supporting clones of the BSD socket layer (including System V variants).
Since Linux 2.6.27, socketpair() supports the SOCK_NONBLOCK and SOCK_CLOEXEC flags in the type argument, as described in socket(2).
SEE ALSO
pipe(2), read(2), socket(2), write(2), socket(7), unix(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
300 - Linux cli command clock_adjtime
NAME π₯οΈ clock_adjtime π₯οΈ
tune kernel clock
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/timex.h>
int adjtimex(struct timex *buf);
int clock_adjtime(clockid_t clk_id, struct timex *buf);
int ntp_adjtime(struct timex *buf);
DESCRIPTION
Linux uses David L. Mills’ clock adjustment algorithm (see RFC 5905). The system call adjtimex() reads and optionally sets adjustment parameters for this algorithm. It takes a pointer to a timex structure, updates kernel parameters from (selected) field values, and returns the same structure updated with the current kernel values. This structure is declared as follows:
struct timex {
int modes; /* Mode selector */
long offset; /* Time offset; nanoseconds, if STA_NANO
status flag is set, otherwise
microseconds */
long freq; /* Frequency offset; see NOTES for units */
long maxerror; /* Maximum error (microseconds) */
long esterror; /* Estimated error (microseconds) */
int status; /* Clock command/status */
long constant; /* PLL (phase-locked loop) time constant */
long precision; /* Clock precision
(microseconds, read-only) */
long tolerance; /* Clock frequency tolerance (read-only);
see NOTES for units */
struct timeval time;
/* Current time (read-only, except for
ADJ_SETOFFSET); upon return, time.tv_usec
contains nanoseconds, if STA_NANO status
flag is set, otherwise microseconds */
long tick; /* Microseconds between clock ticks */
long ppsfreq; /* PPS (pulse per second) frequency
(read-only); see NOTES for units */
long jitter; /* PPS jitter (read-only); nanoseconds, if
STA_NANO status flag is set, otherwise
microseconds */
int shift; /* PPS interval duration
(seconds, read-only) */
long stabil; /* PPS stability (read-only);
see NOTES for units */
long jitcnt; /* PPS count of jitter limit exceeded
events (read-only) */
long calcnt; /* PPS count of calibration intervals
(read-only) */
long errcnt; /* PPS count of calibration errors
(read-only) */
long stbcnt; /* PPS count of stability limit exceeded
events (read-only) */
int tai; /* TAI offset, as set by previous ADJ_TAI
operation (seconds, read-only,
since Linux 2.6.26) */
/* Further padding bytes to allow for future expansion */
};
The modes field determines which parameters, if any, to set. (As described later in this page, the constants used for ntp_adjtime() are equivalent but differently named.) It is a bit mask containing a bitwise OR combination of zero or more of the following bits:
ADJ_OFFSET
Set time offset from buf.offset. Since Linux 2.6.26, the supplied value is clamped to the range (-0.5s, +0.5s). In older kernels, an EINVAL error occurs if the supplied value is out of range.
ADJ_FREQUENCY
Set frequency offset from buf.freq. Since Linux 2.6.26, the supplied value is clamped to the range (-32768000, +32768000). In older kernels, an EINVAL error occurs if the supplied value is out of range.
ADJ_MAXERROR
Set maximum time error from buf.maxerror.
ADJ_ESTERROR
Set estimated time error from buf.esterror.
ADJ_STATUS
Set clock status bits from buf.status. A description of these bits is provided below.
ADJ_TIMECONST
Set PLL time constant from buf.constant. If the STA_NANO status flag (see below) is clear, the kernel adds 4 to this value.
ADJ_SETOFFSET (since Linux 2.6.39)
Add buf.time to the current time. If buf.status includes the ADJ_NANO flag, then buf.time.tv_usec is interpreted as a nanosecond value; otherwise it is interpreted as microseconds.
The value of buf.time is the sum of its two fields, but the field buf.time.tv_usec must always be nonnegative. The following example shows how to normalize a timeval with nanosecond resolution.
while (buf.time.tv_usec < 0) {
buf.time.tv_sec -= 1;
buf.time.tv_usec += 1000000000;
}
ADJ_MICRO (since Linux 2.6.26)
Select microsecond resolution.
ADJ_NANO (since Linux 2.6.26)
Select nanosecond resolution. Only one of ADJ_MICRO and ADJ_NANO should be specified.
ADJ_TAI (since Linux 2.6.26)
Set TAI (Atomic International Time) offset from buf.constant.
ADJ_TAI should not be used in conjunction with ADJ_TIMECONST, since the latter mode also employs the buf.constant field.
For a complete explanation of TAI and the difference between TAI and UTC, see
BIPM
ADJ_TICK
Set tick value from buf.tick.
Alternatively, modes can be specified as either of the following (multibit mask) values, in which case other bits should not be specified in modes:
ADJ_OFFSET_SINGLESHOT
Old-fashioned adjtime(3): (gradually) adjust time by value specified in buf.offset, which specifies an adjustment in microseconds.
ADJ_OFFSET_SS_READ (functional since Linux 2.6.28)
Return (in buf.offset) the remaining amount of time to be adjusted after an earlier ADJ_OFFSET_SINGLESHOT operation. This feature was added in Linux 2.6.24, but did not work correctly until Linux 2.6.28.
Ordinary users are restricted to a value of either 0 or ADJ_OFFSET_SS_READ for modes. Only the superuser may set any parameters.
The buf.status field is a bit mask that is used to set and/or retrieve status bits associated with the NTP implementation. Some bits in the mask are both readable and settable, while others are read-only.
STA_PLL (read-write)
Enable phase-locked loop (PLL) updates via ADJ_OFFSET.
STA_PPSFREQ (read-write)
Enable PPS (pulse-per-second) frequency discipline.
STA_PPSTIME (read-write)
Enable PPS time discipline.
STA_FLL (read-write)
Select frequency-locked loop (FLL) mode.
STA_INS (read-write)
Insert a leap second after the last second of the UTC day, thus extending the last minute of the day by one second. Leap-second insertion will occur each day, so long as this flag remains set.
STA_DEL (read-write)
Delete a leap second at the last second of the UTC day. Leap second deletion will occur each day, so long as this flag remains set.
STA_UNSYNC (read-write)
Clock unsynchronized.
STA_FREQHOLD (read-write)
Hold frequency. Normally adjustments made via ADJ_OFFSET result in dampened frequency adjustments also being made. So a single call corrects the current offset, but as offsets in the same direction are made repeatedly, the small frequency adjustments will accumulate to fix the long-term skew.
This flag prevents the small frequency adjustment from being made when correcting for an ADJ_OFFSET value.
STA_PPSSIGNAL (read-only)
A valid PPS (pulse-per-second) signal is present.
STA_PPSJITTER (read-only)
PPS signal jitter exceeded.
STA_PPSWANDER (read-only)
PPS signal wander exceeded.
STA_PPSERROR (read-only)
PPS signal calibration error.
STA_CLOCKERR (read-only)
Clock hardware fault.
STA_NANO (read-only; since Linux 2.6.26)
Resolution (0 = microsecond, 1 = nanoseconds). Set via ADJ_NANO, cleared via ADJ_MICRO.
STA_MODE (since Linux 2.6.26)
Mode (0 = Phase Locked Loop, 1 = Frequency Locked Loop).
STA_CLK (read-only; since Linux 2.6.26)
Clock source (0 = A, 1 = B); currently unused.
Attempts to set read-only status bits are silently ignored.
clock_adjtime ()
The clock_adjtime() system call (added in Linux 2.6.39) behaves like adjtimex() but takes an additional clk_id argument to specify the particular clock on which to act.
ntp_adjtime ()
The ntp_adjtime() library function (described in the NTP “Kernel Application Program API”, KAPI) is a more portable interface for performing the same task as adjtimex(). Other than the following points, it is identical to adjtimex():
The constants used in modes are prefixed with “MOD_” rather than “ADJ_”, and have the same suffixes (thus, MOD_OFFSET, MOD_FREQUENCY, and so on), other than the exceptions noted in the following points.
MOD_CLKA is the synonym for ADJ_OFFSET_SINGLESHOT.
MOD_CLKB is the synonym for ADJ_TICK.
The is no synonym for ADJ_OFFSET_SS_READ, which is not described in the KAPI.
RETURN VALUE
On success, adjtimex() and ntp_adjtime() return the clock state; that is, one of the following values:
TIME_OK
Clock synchronized, no leap second adjustment pending.
TIME_INS
Indicates that a leap second will be added at the end of the UTC day.
TIME_DEL
Indicates that a leap second will be deleted at the end of the UTC day.
TIME_OOP
Insertion of a leap second is in progress.
TIME_WAIT
A leap-second insertion or deletion has been completed. This value will be returned until the next ADJ_STATUS operation clears the STA_INS and STA_DEL flags.
TIME_ERROR
The system clock is not synchronized to a reliable server. This value is returned when any of the following holds true:
Either STA_UNSYNC or STA_CLOCKERR is set.
STA_PPSSIGNAL is clear and either STA_PPSFREQ or STA_PPSTIME is set.
STA_PPSTIME and STA_PPSJITTER are both set.
STA_PPSFREQ is set and either STA_PPSWANDER or STA_PPSJITTER is set.
The symbolic name TIME_BAD is a synonym for TIME_ERROR, provided for backward compatibility.
Note that starting with Linux 3.4, the call operates asynchronously and the return value usually will not reflect a state change caused by the call itself.
On failure, these calls return -1 and set errno to indicate the error.
ERRORS
EFAULT
buf does not point to writable memory.
EINVAL (before Linux 2.6.26)
An attempt was made to set buf.freq to a value outside the range (-33554432, +33554432).
EINVAL (before Linux 2.6.26)
An attempt was made to set buf.offset to a value outside the permitted range. Before Linux 2.0, the permitted range was (-131072, +131072). From Linux 2.0 onwards, the permitted range was (-512000, +512000).
EINVAL
An attempt was made to set buf.status to a value other than those listed above.
EINVAL
The clk_id given to clock_adjtime() is invalid for one of two reasons. Either the System-V style hard-coded positive clock ID value is out of range, or the dynamic clk_id does not refer to a valid instance of a clock object. See clock_gettime(2) for a discussion of dynamic clocks.
EINVAL
An attempt was made to set buf.tick to a value outside the range 900000/HZ to 1100000/HZ, where HZ is the system timer interrupt frequency.
ENODEV
The hot-pluggable device (like USB for example) represented by a dynamic clk_id has disappeared after its character device was opened. See clock_gettime(2) for a discussion of dynamic clocks.
EOPNOTSUPP
The given clk_id does not support adjustment.
EPERM
buf.modes is neither 0 nor ADJ_OFFSET_SS_READ, and the caller does not have sufficient privilege. Under Linux, the CAP_SYS_TIME capability is required.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
ntp_adjtime() | Thread safety | MT-Safe |
STANDARDS
adjtimex()
clock_adjtime()
Linux.
The preferred API for the NTP daemon is ntp_adjtime().
NOTES
In struct timex, freq, ppsfreq, and stabil are ppm (parts per million) with a 16-bit fractional part, which means that a value of 1 in one of those fields actually means 2^-16 ppm, and 2^16=65536 is 1 ppm. This is the case for both input values (in the case of freq) and output values.
The leap-second processing triggered by STA_INS and STA_DEL is done by the kernel in timer context. Thus, it will take one tick into the second for the leap second to be inserted or deleted.
SEE ALSO
clock_gettime(2), clock_settime(2), settimeofday(2), adjtime(3), ntp_gettime(3), capabilities(7), time(7), adjtimex(8), hwclock(8)
NTP “Kernel Application Program Interface”
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
301 - Linux cli command getegid
NAME π₯οΈ getegid π₯οΈ
get group identity
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
gid_t getgid(void);
gid_t getegid(void);
DESCRIPTION
getgid() returns the real group ID of the calling process.
getegid() returns the effective group ID of the calling process.
ERRORS
These functions are always successful and never modify errno.
VERSIONS
On Alpha, instead of a pair of getgid() and getegid() system calls, a single getxgid() system call is provided, which returns a pair of real and effective GIDs. The glibc getgid() and getegid() wrapper functions transparently deal with this. See syscall(2) for details regarding register mapping.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.3BSD.
The original Linux getgid() and getegid() system calls supported only 16-bit group IDs. Subsequently, Linux 2.4 added getgid32() and getegid32(), supporting 32-bit IDs. The glibc getgid() and getegid() wrapper functions transparently deal with the variations across kernel versions.
SEE ALSO
getresgid(2), setgid(2), setregid(2), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
302 - Linux cli command timer_settime
NAME π₯οΈ timer_settime π₯οΈ
arm/disarm and fetch state of POSIX per-process timer
LIBRARY
Real-time library (librt, -lrt)
SYNOPSIS
#include <time.h>
int timer_gettime(timer_t timerid, struct itimerspec *curr_value);
int timer_settime(timer_t timerid, int flags,
const struct itimerspec *restrict new_value,
struct itimerspec *_Nullable restrict old_value);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
timer_settime(), timer_gettime():
_POSIX_C_SOURCE >= 199309L
DESCRIPTION
timer_settime() arms or disarms the timer identified by timerid. The new_value argument is pointer to an itimerspec structure that specifies the new initial value and the new interval for the timer. The itimerspec structure is described in itimerspec(3type).
Each of the substructures of the itimerspec structure is a timespec(3) structure that allows a time value to be specified in seconds and nanoseconds. These time values are measured according to the clock that was specified when the timer was created by timer_create(2).
If new_value->it_value specifies a nonzero value (i.e., either subfield is nonzero), then timer_settime() arms (starts) the timer, setting it to initially expire at the given time. (If the timer was already armed, then the previous settings are overwritten.) If new_value->it_value specifies a zero value (i.e., both subfields are zero), then the timer is disarmed.
The new_value->it_interval field specifies the period of the timer, in seconds and nanoseconds. If this field is nonzero, then each time that an armed timer expires, the timer is reloaded from the value specified in new_value->it_interval. If new_value->it_interval specifies a zero value, then the timer expires just once, at the time specified by it_value.
By default, the initial expiration time specified in new_value->it_value is interpreted relative to the current time on the timer’s clock at the time of the call. This can be modified by specifying TIMER_ABSTIME in flags, in which case new_value->it_value is interpreted as an absolute value as measured on the timer’s clock; that is, the timer will expire when the clock value reaches the value specified by new_value->it_value. If the specified absolute time has already passed, then the timer expires immediately, and the overrun count (see timer_getoverrun(2)) will be set correctly.
If the value of the CLOCK_REALTIME clock is adjusted while an absolute timer based on that clock is armed, then the expiration of the timer will be appropriately adjusted. Adjustments to the CLOCK_REALTIME clock have no effect on relative timers based on that clock.
If old_value is not NULL, then it points to a buffer that is used to return the previous interval of the timer (in old_value->it_interval) and the amount of time until the timer would previously have next expired (in old_value->it_value).
timer_gettime() returns the time until next expiration, and the interval, for the timer specified by timerid, in the buffer pointed to by curr_value. The time remaining until the next timer expiration is returned in curr_value->it_value; this is always a relative value, regardless of whether the TIMER_ABSTIME flag was used when arming the timer. If the value returned in curr_value->it_value is zero, then the timer is currently disarmed. The timer interval is returned in curr_value->it_interval. If the value returned in curr_value->it_interval is zero, then this is a “one-shot” timer.
RETURN VALUE
On success, timer_settime() and timer_gettime() return 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
These functions may fail with the following errors:
EFAULT
new_value, old_value, or curr_value is not a valid pointer.
EINVAL
timerid is invalid.
timer_settime() may fail with the following errors:
EINVAL
new_value.it_value is negative; or new_value.it_value.tv_nsec is negative or greater than 999,999,999.
STANDARDS
POSIX.1-2008.
HISTORY
Linux 2.6. POSIX.1-2001.
EXAMPLES
See timer_create(2).
SEE ALSO
timer_create(2), timer_getoverrun(2), timespec(3), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
303 - Linux cli command chown
NAME π₯οΈ chown π₯οΈ
change ownership of a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int chown(const char *pathname, uid_t owner, gid_t group);
int fchown(int fd, uid_t owner, gid_t group);
int lchown(const char *pathname, uid_t owner, gid_t group);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int fchownat(int dirfd, const char *pathname,
uid_t owner, gid_t group, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
fchown(), lchown():
/* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
|| _XOPEN_SOURCE >= 500
|| /* glibc <= 2.19: */ _BSD_SOURCE
fchownat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These system calls change the owner and group of a file. The chown(), fchown(), and lchown() system calls differ only in how the file is specified:
chown() changes the ownership of the file specified by pathname, which is dereferenced if it is a symbolic link.
fchown() changes the ownership of the file referred to by the open file descriptor fd.
lchown() is like chown(), but does not dereference symbolic links.
Only a privileged process (Linux: one with the CAP_CHOWN capability) may change the owner of a file. The owner of a file may change the group of the file to any group of which that owner is a member. A privileged process (Linux: with CAP_CHOWN) may change the group arbitrarily.
If the owner or group is specified as -1, then that ID is not changed.
When the owner or group of an executable file is changed by an unprivileged user, the S_ISUID and S_ISGID mode bits are cleared. POSIX does not specify whether this also should happen when root does the chown(); the Linux behavior depends on the kernel version, and since Linux 2.2.13, root is treated like other users. In case of a non-group-executable file (i.e., one for which the S_IXGRP bit is not set) the S_ISGID bit indicates mandatory locking, and is not cleared by a chown().
When the owner or group of an executable file is changed (by any user), all capability sets for the file are cleared.
fchownat()
The fchownat() system call operates in exactly the same way as chown(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chown() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chown()).
If pathname is absolute, then dirfd is ignored.
The flags argument is a bit mask created by ORing together 0 or more of the following values;
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory. If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead operate on the link itself, like lchown(). (By default, fchownat() dereferences symbolic links, like chown().)
See openat(2) for an explanation of the need for fchownat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Depending on the filesystem, errors other than those listed below can be returned.
The more general errors for chown() are listed below.
EACCES
Search permission is denied on a component of the path prefix. (See also path_resolution(7).)
EBADF
(fchown()) fd is not a valid open file descriptor.
EBADF
(fchownat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
pathname points outside your accessible address space.
EINVAL
(fchownat()) Invalid flag specified in flags.
EIO
(fchown()) A low-level I/O error occurred while modifying the inode.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ENAMETOOLONG
pathname is too long.
ENOENT
The file does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component of the path prefix is not a directory.
ENOTDIR
(fchownat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EPERM
The calling process did not have the required permissions (see above) to change owner and/or group.
EPERM
The file is marked immutable or append-only. (See ioctl_iflags(2).)
EROFS
The named file resides on a read-only filesystem.
VERSIONS
The 4.4BSD version can be used only by the superuser (that is, ordinary users cannot give away files).
STANDARDS
POSIX.1-2008.
HISTORY
chown()
fchown()
lchown()
4.4BSD, SVr4, POSIX.1-2001.
fchownat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
NOTES
Ownership of new files
When a new file is created (by, for example, open(2) or mkdir(2)), its owner is made the same as the filesystem user ID of the creating process. The group of the file depends on a range of factors, including the type of filesystem, the options used to mount the filesystem, and whether or not the set-group-ID mode bit is enabled on the parent directory. If the filesystem supports the -o grpid (or, synonymously -o bsdgroups) and -o nogrpid (or, synonymously -o sysvgroups) mount(8) options, then the rules are as follows:
If the filesystem is mounted with -o grpid, then the group of a new file is made the same as that of the parent directory.
If the filesystem is mounted with -o nogrpid and the set-group-ID bit is disabled on the parent directory, then the group of a new file is made the same as the process’s filesystem GID.
If the filesystem is mounted with -o nogrpid and the set-group-ID bit is enabled on the parent directory, then the group of a new file is made the same as that of the parent directory.
As at Linux 4.12, the -o grpid and -o nogrpid mount options are supported by ext2, ext3, ext4, and XFS. Filesystems that don’t support these mount options follow the -o nogrpid rules.
glibc notes
On older kernels where fchownat() is unavailable, the glibc wrapper function falls back to the use of chown() and lchown(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
NFS
The chown() semantics are deliberately violated on NFS filesystems which have UID mapping enabled. Additionally, the semantics of all system calls which access the file contents are violated, because chown() may cause immediate access revocation on already open files. Client side caching may lead to a delay between the time where ownership have been changed to allow access for a user and the time where the file can actually be accessed by the user on other clients.
Historical details
The original Linux chown(), fchown(), and lchown() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added chown32(), fchown32(), and lchown32(), supporting 32-bit IDs. The glibc chown(), fchown(), and lchown() wrapper functions transparently deal with the variations across kernel versions.
Before Linux 2.1.81 (except 2.1.46), chown() did not follow symbolic links. Since Linux 2.1.81, chown() does follow symbolic links, and there is a new system call lchown() that does not follow symbolic links. Since Linux 2.1.86, this new call (that has the same semantics as the old chown()) has got the same syscall number, and chown() got the newly introduced number.
EXAMPLES
The following program changes the ownership of the file named in its second command-line argument to the value specified in its first command-line argument. The new owner can be specified either as a numeric user ID, or as a username (which is converted to a user ID by using getpwnam(3) to perform a lookup in the system password file).
Program source
#include <pwd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
char *endptr;
uid_t uid;
struct passwd *pwd;
if (argc != 3 || argv[1][0] == ' ') {
fprintf(stderr, "%s <owner> <file>
“, argv[0]); exit(EXIT_FAILURE); } uid = strtol(argv[1], &endptr, 10); /* Allow a numeric string */ if (endptr != ‘οΏ½’) { / Was not pure numeric string / pwd = getpwnam(argv[1]); / Try getting UID for username */ if (pwd == NULL) { perror(“getpwnam”); exit(EXIT_FAILURE); } uid = pwd->pw_uid; } if (chown(argv[2], uid, -1) == -1) { perror(“chown”); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); }
SEE ALSO
chgrp(1), chown(1), chmod(2), flock(2), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
304 - Linux cli command ioctl_pipe
NAME π₯οΈ ioctl_pipe π₯οΈ
ioctl() operations for General notification mechanism
SYNOPSIS
#include <linux/watch_queue.h> /* Definition of IOC_WATCH_QUEUE_* */
#include <sys/ioctl.h>
int ioctl(int pipefd[1], IOC_WATCH_QUEUE_SET_SIZE, int size);
int ioctl(int pipefd[1], IOC_WATCH_QUEUE_SET_FILTER,
struct watch_notification_filter *filter);
DESCRIPTION
The following ioctl(2) operations are provided to set up general notification queue parameters. The notification queue is built on the top of a pipe(2) opened with the O_NOTIFICATION_PIPE flag.
IOC_WATCH_QUEUE_SET_SIZE (since Linux 5.8)
Preallocates the pipe buffer memory so that it can fit size notification messages. Currently, size must be between 1 and 512.
IOC_WATCH_QUEUE_SET_FILTER (since Linux 5.8)
Watch queue filter can limit events that are received. Filters are passed in a struct watch_notification_filter and each filter is described by a struct watch_notification_type_filter structure.
struct watch_notification_filter {
__u32 nr_filters;
__u32 __reserved;
struct watch_notification_type_filter filters[];
};
struct watch_notification_type_filter {
__u32 type;
__u32 info_filter;
__u32 info_mask;
__u32 subtype_filter[8];
};
SEE ALSO
pipe(2), ioctl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
305 - Linux cli command fchown32
NAME π₯οΈ fchown32 π₯οΈ
change ownership of a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int chown(const char *pathname, uid_t owner, gid_t group);
int fchown(int fd, uid_t owner, gid_t group);
int lchown(const char *pathname, uid_t owner, gid_t group);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int fchownat(int dirfd, const char *pathname,
uid_t owner, gid_t group, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
fchown(), lchown():
/* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
|| _XOPEN_SOURCE >= 500
|| /* glibc <= 2.19: */ _BSD_SOURCE
fchownat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These system calls change the owner and group of a file. The chown(), fchown(), and lchown() system calls differ only in how the file is specified:
chown() changes the ownership of the file specified by pathname, which is dereferenced if it is a symbolic link.
fchown() changes the ownership of the file referred to by the open file descriptor fd.
lchown() is like chown(), but does not dereference symbolic links.
Only a privileged process (Linux: one with the CAP_CHOWN capability) may change the owner of a file. The owner of a file may change the group of the file to any group of which that owner is a member. A privileged process (Linux: with CAP_CHOWN) may change the group arbitrarily.
If the owner or group is specified as -1, then that ID is not changed.
When the owner or group of an executable file is changed by an unprivileged user, the S_ISUID and S_ISGID mode bits are cleared. POSIX does not specify whether this also should happen when root does the chown(); the Linux behavior depends on the kernel version, and since Linux 2.2.13, root is treated like other users. In case of a non-group-executable file (i.e., one for which the S_IXGRP bit is not set) the S_ISGID bit indicates mandatory locking, and is not cleared by a chown().
When the owner or group of an executable file is changed (by any user), all capability sets for the file are cleared.
fchownat()
The fchownat() system call operates in exactly the same way as chown(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chown() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chown()).
If pathname is absolute, then dirfd is ignored.
The flags argument is a bit mask created by ORing together 0 or more of the following values;
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory. If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead operate on the link itself, like lchown(). (By default, fchownat() dereferences symbolic links, like chown().)
See openat(2) for an explanation of the need for fchownat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Depending on the filesystem, errors other than those listed below can be returned.
The more general errors for chown() are listed below.
EACCES
Search permission is denied on a component of the path prefix. (See also path_resolution(7).)
EBADF
(fchown()) fd is not a valid open file descriptor.
EBADF
(fchownat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
pathname points outside your accessible address space.
EINVAL
(fchownat()) Invalid flag specified in flags.
EIO
(fchown()) A low-level I/O error occurred while modifying the inode.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ENAMETOOLONG
pathname is too long.
ENOENT
The file does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component of the path prefix is not a directory.
ENOTDIR
(fchownat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EPERM
The calling process did not have the required permissions (see above) to change owner and/or group.
EPERM
The file is marked immutable or append-only. (See ioctl_iflags(2).)
EROFS
The named file resides on a read-only filesystem.
VERSIONS
The 4.4BSD version can be used only by the superuser (that is, ordinary users cannot give away files).
STANDARDS
POSIX.1-2008.
HISTORY
chown()
fchown()
lchown()
4.4BSD, SVr4, POSIX.1-2001.
fchownat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
NOTES
Ownership of new files
When a new file is created (by, for example, open(2) or mkdir(2)), its owner is made the same as the filesystem user ID of the creating process. The group of the file depends on a range of factors, including the type of filesystem, the options used to mount the filesystem, and whether or not the set-group-ID mode bit is enabled on the parent directory. If the filesystem supports the -o grpid (or, synonymously -o bsdgroups) and -o nogrpid (or, synonymously -o sysvgroups) mount(8) options, then the rules are as follows:
If the filesystem is mounted with -o grpid, then the group of a new file is made the same as that of the parent directory.
If the filesystem is mounted with -o nogrpid and the set-group-ID bit is disabled on the parent directory, then the group of a new file is made the same as the process’s filesystem GID.
If the filesystem is mounted with -o nogrpid and the set-group-ID bit is enabled on the parent directory, then the group of a new file is made the same as that of the parent directory.
As at Linux 4.12, the -o grpid and -o nogrpid mount options are supported by ext2, ext3, ext4, and XFS. Filesystems that don’t support these mount options follow the -o nogrpid rules.
glibc notes
On older kernels where fchownat() is unavailable, the glibc wrapper function falls back to the use of chown() and lchown(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
NFS
The chown() semantics are deliberately violated on NFS filesystems which have UID mapping enabled. Additionally, the semantics of all system calls which access the file contents are violated, because chown() may cause immediate access revocation on already open files. Client side caching may lead to a delay between the time where ownership have been changed to allow access for a user and the time where the file can actually be accessed by the user on other clients.
Historical details
The original Linux chown(), fchown(), and lchown() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added chown32(), fchown32(), and lchown32(), supporting 32-bit IDs. The glibc chown(), fchown(), and lchown() wrapper functions transparently deal with the variations across kernel versions.
Before Linux 2.1.81 (except 2.1.46), chown() did not follow symbolic links. Since Linux 2.1.81, chown() does follow symbolic links, and there is a new system call lchown() that does not follow symbolic links. Since Linux 2.1.86, this new call (that has the same semantics as the old chown()) has got the same syscall number, and chown() got the newly introduced number.
EXAMPLES
The following program changes the ownership of the file named in its second command-line argument to the value specified in its first command-line argument. The new owner can be specified either as a numeric user ID, or as a username (which is converted to a user ID by using getpwnam(3) to perform a lookup in the system password file).
Program source
#include <pwd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
char *endptr;
uid_t uid;
struct passwd *pwd;
if (argc != 3 || argv[1][0] == ' ') {
fprintf(stderr, "%s <owner> <file>
“, argv[0]); exit(EXIT_FAILURE); } uid = strtol(argv[1], &endptr, 10); /* Allow a numeric string */ if (endptr != ‘οΏ½’) { / Was not pure numeric string / pwd = getpwnam(argv[1]); / Try getting UID for username */ if (pwd == NULL) { perror(“getpwnam”); exit(EXIT_FAILURE); } uid = pwd->pw_uid; } if (chown(argv[2], uid, -1) == -1) { perror(“chown”); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); }
SEE ALSO
chgrp(1), chown(1), chmod(2), flock(2), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
306 - Linux cli command epoll_create
NAME π₯οΈ epoll_create π₯οΈ
open an epoll file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/epoll.h>
int epoll_create(int size);
int epoll_create1(int flags);
DESCRIPTION
epoll_create() creates a new epoll(7) instance. Since Linux 2.6.8, the size argument is ignored, but must be greater than zero; see HISTORY.
epoll_create() returns a file descriptor referring to the new epoll instance. This file descriptor is used for all the subsequent calls to the epoll interface. When no longer required, the file descriptor returned by epoll_create() should be closed by using close(2). When all file descriptors referring to an epoll instance have been closed, the kernel destroys the instance and releases the associated resources for reuse.
epoll_create1()
If flags is 0, then, other than the fact that the obsolete size argument is dropped, epoll_create1() is the same as epoll_create(). The following value can be included in flags to obtain different behavior:
EPOLL_CLOEXEC
Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.
RETURN VALUE
On success, these system calls return a file descriptor (a nonnegative integer). On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
size is not positive.
EINVAL
(epoll_create1()) Invalid value specified in flags.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENOMEM
There was insufficient memory to create the kernel object.
STANDARDS
Linux.
HISTORY
epoll_create()
Linux 2.6, glibc 2.3.2.
epoll_create1()
Linux 2.6.27, glibc 2.9.
In the initial epoll_create() implementation, the size argument informed the kernel of the number of file descriptors that the caller expected to add to the epoll instance. The kernel used this information as a hint for the amount of space to initially allocate in internal data structures describing events. (If necessary, the kernel would allocate more space if the caller’s usage exceeded the hint given in size.) Nowadays, this hint is no longer required (the kernel dynamically sizes the required data structures without needing the hint), but size must still be greater than zero, in order to ensure backward compatibility when new epoll applications are run on older kernels.
Prior to Linux 2.6.29, a /proc/sys/fs/epoll/max_user_instances kernel parameter limited live epolls for each real user ID, and caused epoll_create() to fail with EMFILE on overrun.
SEE ALSO
close(2), epoll_ctl(2), epoll_wait(2), epoll(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
307 - Linux cli command ioctl_getfsmap
NAME π₯οΈ ioctl_getfsmap π₯οΈ
retrieve the physical layout of the filesystem
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/fsmap.h> /* Definition of FS_IOC_GETFSMAP,
FM?_OF_*, and *FMR_OWN_* constants */
#include <sys/ioctl.h>
int ioctl(int fd, FS_IOC_GETFSMAP, struct fsmap_head * arg);
DESCRIPTION
This ioctl(2) operation retrieves physical extent mappings for a filesystem. This information can be used to discover which files are mapped to a physical block, examine free space, or find known bad blocks, among other things.
The sole argument to this operation should be a pointer to a single struct fsmap_head:
struct fsmap {
__u32 fmr_device; /* Device ID */
__u32 fmr_flags; /* Mapping flags */
__u64 fmr_physical; /* Device offset of segment */
__u64 fmr_owner; /* Owner ID */
__u64 fmr_offset; /* File offset of segment */
__u64 fmr_length; /* Length of segment */
__u64 fmr_reserved[3]; /* Must be zero */
};
struct fsmap_head {
__u32 fmh_iflags; /* Control flags */
__u32 fmh_oflags; /* Output flags */
__u32 fmh_count; /* # of entries in array incl. input */
__u32 fmh_entries; /* # of entries filled in (output) */
__u64 fmh_reserved[6]; /* Must be zero */
struct fsmap fmh_keys[2]; /* Low and high keys for
the mapping search */
struct fsmap fmh_recs[]; /* Returned records */
};
The two fmh_keys array elements specify the lowest and highest reverse-mapping key for which the application would like physical mapping information. A reverse mapping key consists of the tuple (device, block, owner, offset). The owner and offset fields are part of the key because some filesystems support sharing physical blocks between multiple files and therefore may return multiple mappings for a given physical block.
Filesystem mappings are copied into the fmh_recs array, which immediately follows the header data.
Fields of struct fsmap_head
The fmh_iflags field is a bit mask passed to the kernel to alter the output. No flags are currently defined, so the caller must set this value to zero.
The fmh_oflags field is a bit mask of flags set by the kernel concerning the returned mappings. If FMH_OF_DEV_T is set, then the fmr_device field represents a dev_t structure containing the major and minor numbers of the block device.
The fmh_count field contains the number of elements in the array being passed to the kernel. If this value is 0, fmh_entries will be set to the number of records that would have been returned had the array been large enough; no mapping information will be returned.
The fmh_entries field contains the number of elements in the fmh_recs array that contain useful information.
The fmh_reserved fields must be set to zero.
Keys
The two key records in fsmap_head.fmh_keys specify the lowest and highest extent records in the keyspace that the caller wants returned. A filesystem that can share blocks between files likely requires the tuple (device, physical, owner, offset, flags) to uniquely index any filesystem mapping record. Classic non-sharing filesystems might be able to identify any record with only (device, physical, flags). For example, if the low key is set to (8:0, 36864, 0, 0, 0), the filesystem will only return records for extents starting at or above 36 KiB on disk. If the high key is set to (8:0, 1048576, 0, 0, 0), only records below 1 MiB will be returned. The format of fmr_device in the keys must match the format of the same field in the output records, as defined below. By convention, the field fsmap_head.fmh_keys[0] must contain the low key and fsmap_head.fmh_keys[1] must contain the high key for the operation.
For convenience, if fmr_length is set in the low key, it will be added to fmr_block or fmr_offset as appropriate. The caller can take advantage of this subtlety to set up subsequent calls by copying fsmap_head.fmh_recs[fsmap_head.fmh_entries - 1] into the low key. The function fsmap_advance (defined in linux/fsmap.h) provides this functionality.
Fields of struct fsmap
The fmr_device field uniquely identifies the underlying storage device. If the FMH_OF_DEV_T flag is set in the header’s fmh_oflags field, this field contains a dev_t from which major and minor numbers can be extracted. If the flag is not set, this field contains a value that must be unique for each unique storage device.
The fmr_physical field contains the disk address of the extent in bytes.
The fmr_owner field contains the owner of the extent. This is an inode number unless FMR_OF_SPECIAL_OWNER is set in the fmr_flags field, in which case the value is determined by the filesystem. See the section below about owner values for more details.
The fmr_offset field contains the logical address in the mapping record in bytes. This field has no meaning if the FMR_OF_SPECIAL_OWNER or FMR_OF_EXTENT_MAP flags are set in fmr_flags.
The fmr_length field contains the length of the extent in bytes.
The fmr_flags field is a bit mask of extent state flags. The bits are:
FMR_OF_PREALLOC
The extent is allocated but not yet written.FMR_OF_ATTR_FORK
This extent contains extended attribute data.FMR_OF_EXTENT_MAP
This extent contains extent map information for the owner.FMR_OF_SHARED
Parts of this extent may be shared.FMR_OF_SPECIAL_OWNER
The fmr_owner field contains a special value instead of an inode number.FMR_OF_LAST
This is the last record in the data set.
The fmr_reserved field will be set to zero.
Owner values
Generally, the value of the fmr_owner field for non-metadata extents should be an inode number. However, filesystems are under no obligation to report inode numbers; they may instead report FMR_OWN_UNKNOWN if the inode number cannot easily be retrieved, if the caller lacks sufficient privilege, if the filesystem does not support stable inode numbers, or for any other reason. If a filesystem wishes to condition the reporting of inode numbers based on process capabilities, it is strongly urged that the CAP_SYS_ADMIN capability be used for this purpose.
The following special owner values are generic to all filesystems:
FMR_OWN_FREE
Free space.FMR_OWN_UNKNOWN
This extent is in use but its owner is not known or not easily retrieved.FMR_OWN_METADATA
This extent is filesystem metadata.
XFS can return the following special owner values:
XFS_FMR_OWN_FREE
Free space.XFS_FMR_OWN_UNKNOWN
This extent is in use but its owner is not known or not easily retrieved.XFS_FMR_OWN_FS
Static filesystem metadata which exists at a fixed address. These are the AG superblock, the AGF, the AGFL, and the AGI headers.XFS_FMR_OWN_LOG
The filesystem journal.XFS_FMR_OWN_AG
Allocation group metadata, such as the free space btrees and the reverse mapping btrees.XFS_FMR_OWN_INOBT
The inode and free inode btrees.XFS_FMR_OWN_INODES
Inode records.XFS_FMR_OWN_REFC
Reference count information.XFS_FMR_OWN_COW
This extent is being used to stage a copy-on-write.XFS_FMR_OWN_DEFECTIVE:
This extent has been marked defective either by the filesystem or the underlying device.
ext4 can return the following special owner values:
EXT4_FMR_OWN_FREE
Free space.EXT4_FMR_OWN_UNKNOWN
This extent is in use but its owner is not known or not easily retrieved.EXT4_FMR_OWN_FS
Static filesystem metadata which exists at a fixed address. This is the superblock and the group descriptors.EXT4_FMR_OWN_LOG
The filesystem journal.EXT4_FMR_OWN_INODES
Inode records.EXT4_FMR_OWN_BLKBM
Block bit map.EXT4_FMR_OWN_INOBM
Inode bit map.
RETURN VALUE
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
The error placed in errno can be one of, but is not limited to, the following:
EBADF
fd is not open for reading.
EBADMSG
The filesystem has detected a checksum error in the metadata.
EFAULT
The pointer passed in was not mapped to a valid memory address.
EINVAL
The array is not long enough, the keys do not point to a valid part of the filesystem, the low key points to a higher point in the filesystem’s physical storage address space than the high key, or a nonzero value was passed in one of the fields that must be zero.
ENOMEM
Insufficient memory to process the operation.
EOPNOTSUPP
The filesystem does not support this operation.
EUCLEAN
The filesystem metadata is corrupt and needs repair.
STANDARDS
Linux.
Not all filesystems support it.
HISTORY
Linux 4.12.
EXAMPLES
See io/fsmap.c in the xfsprogs distribution for a sample program.
SEE ALSO
ioctl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
308 - Linux cli command sched_setparam
NAME π₯οΈ sched_setparam π₯οΈ
set and get scheduling parameters
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sched.h>
int sched_setparam(pid_t pid, const struct sched_param *param);
int sched_getparam(pid_t pid, struct sched_param *param);
struct sched_param {
...
int sched_priority;
...
};
DESCRIPTION
sched_setparam() sets the scheduling parameters associated with the scheduling policy for the thread whose thread ID is specified in pid**.** If pid is zero, then the parameters of the calling thread are set. The interpretation of the argument param depends on the scheduling policy of the thread identified by pid. See sched(7) for a description of the scheduling policies supported under Linux.
sched_getparam() retrieves the scheduling parameters for the thread identified by pid**.** If pid is zero, then the parameters of the calling thread are retrieved.
sched_setparam() checks the validity of param for the scheduling policy of the thread. The value param->sched_priority must lie within the range given by sched_get_priority_min(2) and sched_get_priority_max(2).
For a discussion of the privileges and resource limits related to scheduling priority and policy, see sched(7).
POSIX systems on which sched_setparam() and sched_getparam() are available define _POSIX_PRIORITY_SCHEDULING in <unistd.h>.
RETURN VALUE
On success, sched_setparam() and sched_getparam() return 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
Invalid arguments: param is NULL or pid is negative
EINVAL
(sched_setparam()) The argument param does not make sense for the current scheduling policy.
EPERM
(sched_setparam()) The caller does not have appropriate privileges (Linux: does not have the CAP_SYS_NICE capability).
ESRCH
The thread whose ID is pid could not be found.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
SEE ALSO
getpriority(2), gettid(2), nice(2), sched_get_priority_max(2), sched_get_priority_min(2), sched_getaffinity(2), sched_getscheduler(2), sched_setaffinity(2), sched_setattr(2), sched_setscheduler(2), setpriority(2), capabilities(7), sched(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
309 - Linux cli command pivot_root
NAME π₯οΈ pivot_root π₯οΈ
change the root mount
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_pivot_root, const char *new_root",constchar*"put_old);
Note: glibc provides no wrapper for pivot_root(), necessitating the use of syscall(2).
DESCRIPTION
pivot_root() changes the root mount in the mount namespace of the calling process. More precisely, it moves the root mount to the directory put_old and makes new_root the new root mount. The calling process must have the CAP_SYS_ADMIN capability in the user namespace that owns the caller’s mount namespace.
pivot_root() changes the root directory and the current working directory of each process or thread in the same mount namespace to new_root if they point to the old root directory. (See also NOTES.) On the other hand, pivot_root() does not change the caller’s current working directory (unless it is on the old root directory), and thus it should be followed by a chdir("/") call.
The following restrictions apply:
new_root and put_old must be directories.
new_root and put_old must not be on the same mount as the current root.
put_old must be at or underneath new_root; that is, adding some nonnegative number of “/..” suffixes to the pathname pointed to by put_old must yield the same directory as new_root.
new_root must be a path to a mount point, but can’t be "/". A path that is not already a mount point can be converted into one by bind mounting the path onto itself.
The propagation type of the parent mount of new_root and the parent mount of the current root directory must not be MS_SHARED; similarly, if put_old is an existing mount point, its propagation type must not be MS_SHARED. These restrictions ensure that pivot_root() never propagates any changes to another mount namespace.
The current root directory must be a mount point.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
pivot_root() may fail with any of the same errors as stat(2). Additionally, it may fail with the following errors:
EBUSY
new_root or put_old is on the current root mount. (This error covers the pathological case where new_root is "/".)
EINVAL
new_root is not a mount point.
EINVAL
put_old is not at or underneath new_root.
EINVAL
The current root directory is not a mount point (because of an earlier chroot(2)).
EINVAL
The current root is on the rootfs (initial ramfs) mount; see NOTES.
EINVAL
Either the mount point at new_root, or the parent mount of that mount point, has propagation type MS_SHARED.
EINVAL
put_old is a mount point and has the propagation type MS_SHARED.
ENOTDIR
new_root or put_old is not a directory.
EPERM
The calling process does not have the CAP_SYS_ADMIN capability.
STANDARDS
Linux.
HISTORY
Linux 2.3.41.
NOTES
A command-line interface for this system call is provided by pivot_root(8).
pivot_root() allows the caller to switch to a new root filesystem while at the same time placing the old root mount at a location under new_root from where it can subsequently be unmounted. (The fact that it moves all processes that have a root directory or current working directory on the old root directory to the new root frees the old root directory of users, allowing the old root mount to be unmounted more easily.)
One use of pivot_root() is during system startup, when the system mounts a temporary root filesystem (e.g., an initrd(4)), then mounts the real root filesystem, and eventually turns the latter into the root directory of all relevant processes and threads. A modern use is to set up a root filesystem during the creation of a container.
The fact that pivot_root() modifies process root and current working directories in the manner noted in DESCRIPTION is necessary in order to prevent kernel threads from keeping the old root mount busy with their root and current working directories, even if they never access the filesystem in any way.
The rootfs (initial ramfs) cannot be pivot_root()ed. The recommended method of changing the root filesystem in this case is to delete everything in rootfs, overmount rootfs with the new root, attach stdin/stdout/stderr to the new /dev/console, and exec the new init(1). Helper programs for this process exist; see switch_root(8).
pivot_root(".", “.”)
new_root and put_old may be the same directory. In particular, the following sequence allows a pivot-root operation without needing to create and remove a temporary directory:
chdir(new_root);
pivot_root(".", ".");
umount2(".", MNT_DETACH);
This sequence succeeds because the pivot_root() call stacks the old root mount point on top of the new root mount point at /. At that point, the calling process’s root directory and current working directory refer to the new root mount point (new_root). During the subsequent umount() call, resolution of "." starts with new_root and then moves up the list of mounts stacked at /, with the result that old root mount point is unmounted.
Historical notes
For many years, this manual page carried the following text:
pivot_root() may or may not change the current root and the current working directory of any processes or threads which use the old root directory. The caller of pivot_root() must ensure that processes with root or current working directory at the old root operate correctly in either case. An easy way to ensure this is to change their root and current working directory to new_root before invoking pivot_root().
This text, written before the system call implementation was even finalized in the kernel, was probably intended to warn users at that time that the implementation might change before final release. However, the behavior stated in DESCRIPTION has remained consistent since this system call was first implemented and will not change now.
EXAMPLES
The program below demonstrates the use of pivot_root() inside a mount namespace that is created using clone(2). After pivoting to the root directory named in the program’s first command-line argument, the child created by clone(2) then executes the program named in the remaining command-line arguments.
We demonstrate the program by creating a directory that will serve as the new root filesystem and placing a copy of the (statically linked) busybox(1) executable in that directory.
$ mkdir /tmp/rootfs
$ ls -id /tmp/rootfs # Show inode number of new root directory
319459 /tmp/rootfs
$ cp $(which busybox) /tmp/rootfs
$ PS1='bbsh$ ' sudo ./pivot_root_demo /tmp/rootfs /busybox sh
bbsh$ PATH=/
bbsh$ busybox ln busybox ln
bbsh$ ln busybox echo
bbsh$ ln busybox ls
bbsh$ ls
busybox echo ln ls
bbsh$ ls -id / # Compare with inode number above
319459 /
bbsh$ echo 'hello world'
hello world
Program source
/* pivot_root_demo.c */
#define _GNU_SOURCE
#include <err.h>
#include <limits.h>
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <unistd.h>
static int
pivot_root(const char *new_root, const char *put_old)
{
return syscall(SYS_pivot_root, new_root, put_old);
}
#define STACK_SIZE (1024 * 1024)
static int /* Startup function for cloned child */
child(void *arg)
{
char path[PATH_MAX];
char **args = arg;
char *new_root = args[0];
const char *put_old = "/oldrootfs";
/* Ensure that 'new_root' and its parent mount don't have
shared propagation (which would cause pivot_root() to
return an error), and prevent propagation of mount
events to the initial mount namespace. */
if (mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL) == -1)
err(EXIT_FAILURE, "mount-MS_PRIVATE");
/* Ensure that 'new_root' is a mount point. */
if (mount(new_root, new_root, NULL, MS_BIND, NULL) == -1)
err(EXIT_FAILURE, "mount-MS_BIND");
/* Create directory to which old root will be pivoted. */
snprintf(path, sizeof(path), "%s/%s", new_root, put_old);
if (mkdir(path, 0777) == -1)
err(EXIT_FAILURE, "mkdir");
/* And pivot the root filesystem. */
if (pivot_root(new_root, path) == -1)
err(EXIT_FAILURE, "pivot_root");
/* Switch the current working directory to "/". */
if (chdir("/") == -1)
err(EXIT_FAILURE, "chdir");
/* Unmount old root and remove mount point. */
if (umount2(put_old, MNT_DETACH) == -1)
perror("umount2");
if (rmdir(put_old) == -1)
perror("rmdir");
/* Execute the command specified in argv[1]... */
execv(args[1], &args[1]);
err(EXIT_FAILURE, "execv");
}
int
main(int argc, char *argv[])
{
char *stack;
/* Create a child process in a new mount namespace. */
stack = mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
if (stack == MAP_FAILED)
err(EXIT_FAILURE, "mmap");
if (clone(child, stack + STACK_SIZE,
CLONE_NEWNS | SIGCHLD, &argv[1]) == -1)
err(EXIT_FAILURE, "clone");
/* Parent falls through to here; wait for child. */
if (wait(NULL) == -1)
err(EXIT_FAILURE, "wait");
exit(EXIT_SUCCESS);
}
SEE ALSO
chdir(2), chroot(2), mount(2), stat(2), initrd(4), mount_namespaces(7), pivot_root(8), switch_root(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
310 - Linux cli command set_robust_list
NAME π₯οΈ set_robust_list π₯οΈ
get/set list of robust futexes
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/futex.h>"/*Definitionof structrobust_list_head" */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
long syscall(SYS_get_robust_list, int pid,
struct robust_list_head **head_ptr, size_t *len_ptr);
long syscall(SYS_set_robust_list,
struct robust_list_head *head, size_t len);
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
These system calls deal with per-thread robust futex lists. These lists are managed in user space: the kernel knows only about the location of the head of the list. A thread can inform the kernel of the location of its robust futex list using set_robust_list(). The address of a thread’s robust futex list can be obtained using get_robust_list().
The purpose of the robust futex list is to ensure that if a thread accidentally fails to unlock a futex before terminating or calling execve(2), another thread that is waiting on that futex is notified that the former owner of the futex has died. This notification consists of two pieces: the FUTEX_OWNER_DIED bit is set in the futex word, and the kernel performs a futex(2) FUTEX_WAKE operation on one of the threads waiting on the futex.
The get_robust_list() system call returns the head of the robust futex list of the thread whose thread ID is specified in pid. If pid is 0, the head of the list for the calling thread is returned. The list head is stored in the location pointed to by head_ptr. The size of the object pointed to by **head_ptr is stored in len_ptr.
Permission to employ get_robust_list() is governed by a ptrace access mode PTRACE_MODE_READ_REALCREDS check; see ptrace(2).
The set_robust_list() system call requests the kernel to record the head of the list of robust futexes owned by the calling thread. The head argument is the list head to record. The len argument should be sizeof(*head).
RETURN VALUE
The set_robust_list() and get_robust_list() system calls return zero when the operation is successful, an error code otherwise.
ERRORS
The set_robust_list() system call can fail with the following error:
EINVAL
len does not equal sizeof(struct robust_list_head).
The get_robust_list() system call can fail with the following errors:
EFAULT
The head of the robust futex list can’t be stored at the location head.
EPERM
The calling process does not have permission to see the robust futex list of the thread with the thread ID pid, and does not have the CAP_SYS_PTRACE capability.
ESRCH
No thread with the thread ID pid could be found.
VERSIONS
These system calls were added in Linux 2.6.17.
NOTES
These system calls are not needed by normal applications.
A thread can have only one robust futex list; therefore applications that wish to use this functionality should use the robust mutexes provided by glibc.
In the initial implementation, a thread waiting on a futex was notified that the owner had died only if the owner terminated. Starting with Linux 2.6.28, notification was extended to include the case where the owner performs an execve(2).
The thread IDs mentioned in the main text are kernel thread IDs of the kind returned by clone(2) and gettid(2).
SEE ALSO
futex(2), pthread_mutexattr_setrobust(3)
Documentation/robust-futexes.txt and Documentation/robust-futex-ABI.txt in the Linux kernel source tree
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
311 - Linux cli command nice
NAME π₯οΈ nice π₯οΈ
change process priority
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int nice(int inc);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
nice():
_XOPEN_SOURCE
|| /* Since glibc 2.19: */ _DEFAULT_SOURCE
|| /* glibc <= 2.19: */ _BSD_SOURCE || _SVID_SOURCE
DESCRIPTION
nice() adds inc to the nice value for the calling thread. (A higher nice value means a lower priority.)
The range of the nice value is +19 (low priority) to -20 (high priority). Attempts to set a nice value outside the range are clamped to the range.
Traditionally, only a privileged process could lower the nice value (i.e., set a higher priority). However, since Linux 2.6.12, an unprivileged process can decrease the nice value of a target process that has a suitable RLIMIT_NICE soft limit; see getrlimit(2) for details.
RETURN VALUE
On success, the new nice value is returned (but see NOTES below). On error, -1 is returned, and errno is set to indicate the error.
A successful call can legitimately return -1. To detect an error, set errno to 0 before the call, and check whether it is nonzero after nice() returns -1.
ERRORS
EPERM
The calling process attempted to increase its priority by supplying a negative inc but has insufficient privileges. Under Linux, the CAP_SYS_NICE capability is required. (But see the discussion of the RLIMIT_NICE resource limit in setrlimit(2).)
VERSIONS
C library/kernel differences
POSIX.1 specifies that nice() should return the new nice value. However, the raw Linux system call returns 0 on success. Likewise, the nice() wrapper function provided in glibc 2.2.3 and earlier returns 0 on success.
Since glibc 2.2.4, the nice() wrapper function provided by glibc provides conformance to POSIX.1 by calling getpriority(2) to obtain the new nice value, which is then returned to the caller.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.3BSD.
NOTES
For further details on the nice value, see sched(7).
Note: the addition of the “autogroup” feature in Linux 2.6.38 means that the nice value no longer has its traditional effect in many circumstances. For details, see sched(7).
SEE ALSO
nice(1), renice(1), fork(2), getpriority(2), getrlimit(2), setpriority(2), capabilities(7), sched(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
312 - Linux cli command clone2
NAME π₯οΈ clone2 π₯οΈ
create a child process
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
/* Prototype for the glibc wrapper function */
#define _GNU_SOURCE
#include <sched.h>
int clone(int (*fn)(void *_Nullable), void *stack",int"flags,
void *_Nullable arg, ..."/*" pid_t *_Nullable parent_tid,
void *_Nullable tls,
pid_t *_Nullable child_tid */ );
/* For the prototype of the raw clone() system call, see NOTES */
#include <linux/sched.h> /* Definition of struct clone_args */
#include <sched.h> /* Definition of CLONE_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
long syscall(SYS_clone3, struct clone_args *cl_args, size_t size);
Note: glibc provides no wrapper for clone3(), necessitating the use of syscall(2).
DESCRIPTION
These system calls create a new (“child”) process, in a manner similar to fork(2).
By contrast with fork(2), these system calls provide more precise control over what pieces of execution context are shared between the calling process and the child process. For example, using these system calls, the caller can control whether or not the two processes share the virtual address space, the table of file descriptors, and the table of signal handlers. These system calls also allow the new child process to be placed in separate namespaces(7).
Note that in this manual page, “calling process” normally corresponds to “parent process”. But see the descriptions of CLONE_PARENT and CLONE_THREAD below.
This page describes the following interfaces:
The glibc clone() wrapper function and the underlying system call on which it is based. The main text describes the wrapper function; the differences for the raw system call are described toward the end of this page.
The newer clone3() system call.
In the remainder of this page, the terminology “the clone call” is used when noting details that apply to all of these interfaces.
The clone() wrapper function
When the child process is created with the clone() wrapper function, it commences execution by calling the function pointed to by the argument fn. (This differs from fork(2), where execution continues in the child from the point of the fork(2) call.) The arg argument is passed as the argument of the function fn.
When the fn(arg) function returns, the child process terminates. The integer returned by fn is the exit status for the child process. The child process may also terminate explicitly by calling exit(2) or after receiving a fatal signal.
The stack argument specifies the location of the stack used by the child process. Since the child and calling process may share memory, it is not possible for the child process to execute in the same stack as the calling process. The calling process must therefore set up memory space for the child stack and pass a pointer to this space to clone(). Stacks grow downward on all processors that run Linux (except the HP PA processors), so stack usually points to the topmost address of the memory space set up for the child stack. Note that clone() does not provide a means whereby the caller can inform the kernel of the size of the stack area.
The remaining arguments to clone() are discussed below.
clone3()
The clone3() system call provides a superset of the functionality of the older clone() interface. It also provides a number of API improvements, including: space for additional flags bits; cleaner separation in the use of various arguments; and the ability to specify the size of the child’s stack area.
As with fork(2), clone3() returns in both the parent and the child. It returns 0 in the child process and returns the PID of the child in the parent.
The cl_args argument of clone3() is a structure of the following form:
struct clone_args {
u64 flags; /* Flags bit mask */
u64 pidfd; /* Where to store PID file descriptor
(int *) */
u64 child_tid; /* Where to store child TID,
in child's memory (pid_t *) */
u64 parent_tid; /* Where to store child TID,
in parent's memory (pid_t *) */
u64 exit_signal; /* Signal to deliver to parent on
child termination */
u64 stack; /* Pointer to lowest byte of stack */
u64 stack_size; /* Size of stack */
u64 tls; /* Location of new TLS */
u64 set_tid; /* Pointer to a pid_t array
(since Linux 5.5) */
u64 set_tid_size; /* Number of elements in set_tid
(since Linux 5.5) */
u64 cgroup; /* File descriptor for target cgroup
of child (since Linux 5.7) */
};
The size argument that is supplied to clone3() should be initialized to the size of this structure. (The existence of the size argument permits future extensions to the clone_args structure.)
The stack for the child process is specified via cl_args.stack, which points to the lowest byte of the stack area, and cl_args.stack_size, which specifies the size of the stack in bytes. In the case where the CLONE_VM flag (see below) is specified, a stack must be explicitly allocated and specified. Otherwise, these two fields can be specified as NULL and 0, which causes the child to use the same stack area as the parent (in the child’s own virtual address space).
The remaining fields in the cl_args argument are discussed below.
Equivalence between clone() and clone3() arguments
Unlike the older clone() interface, where arguments are passed individually, in the newer clone3() interface the arguments are packaged into the clone_args structure shown above. This structure allows for a superset of the information passed via the clone() arguments.
The following table shows the equivalence between the arguments of clone() and the fields in the clone_args argument supplied to clone3():
clone() clone3() Notes cl_args field flags & ~0xff flags For most flags; details below parent_tid pidfd See CLONE_PIDFD child_tid child_tid See CLONE_CHILD_SETTID parent_tid parent_tid See CLONE_PARENT_SETTID flags & 0xff exit_signal stack stack --- stack_size tls tls See CLONE_SETTLS --- set_tid See below for details --- set_tid_size --- cgroup See CLONE_INTO_CGROUP
The child termination signal
When the child process terminates, a signal may be sent to the parent. The termination signal is specified in the low byte of flags (clone()) or in cl_args.exit_signal (clone3()). If this signal is specified as anything other than SIGCHLD, then the parent process must specify the __WALL or __WCLONE options when waiting for the child with wait(2). If no signal (i.e., zero) is specified, then the parent process is not signaled when the child terminates.
The set_tid array
By default, the kernel chooses the next sequential PID for the new process in each of the PID namespaces where it is present. When creating a process with clone3(), the set_tid array (available since Linux 5.5) can be used to select specific PIDs for the process in some or all of the PID namespaces where it is present. If the PID of the newly created process should be set only for the current PID namespace or in the newly created PID namespace (if flags contains CLONE_NEWPID) then the first element in the set_tid array has to be the desired PID and set_tid_size needs to be 1.
If the PID of the newly created process should have a certain value in multiple PID namespaces, then the set_tid array can have multiple entries. The first entry defines the PID in the most deeply nested PID namespace and each of the following entries contains the PID in the corresponding ancestor PID namespace. The number of PID namespaces in which a PID should be set is defined by set_tid_size which cannot be larger than the number of currently nested PID namespaces.
To create a process with the following PIDs in a PID namespace hierarchy:
PID NS level Requested PID Notes 0 31496 Outermost PID namespace 1 42 2 7 Innermost PID namespace
Set the array to:
set_tid[0] = 7;
set_tid[1] = 42;
set_tid[2] = 31496;
set_tid_size = 3;
If only the PIDs in the two innermost PID namespaces need to be specified, set the array to:
set_tid[0] = 7;
set_tid[1] = 42;
set_tid_size = 2;
The PID in the PID namespaces outside the two innermost PID namespaces is selected the same way as any other PID is selected.
The set_tid feature requires CAP_SYS_ADMIN or (since Linux 5.9) CAP_CHECKPOINT_RESTORE in all owning user namespaces of the target PID namespaces.
Callers may only choose a PID greater than 1 in a given PID namespace if an init process (i.e., a process with PID 1) already exists in that namespace. Otherwise the PID entry for this PID namespace must be 1.
The flags mask
Both clone() and clone3() allow a flags bit mask that modifies their behavior and allows the caller to specify what is shared between the calling process and the child process. This bit maskβthe flags argument of clone() or the cl_args.flags field passed to clone3()βis referred to as the flags mask in the remainder of this page.
The flags mask is specified as a bitwise OR of zero or more of the constants listed below. Except as noted below, these flags are available (and have the same effect) in both clone() and clone3().
CLONE_CHILD_CLEARTID (since Linux 2.5.49)
Clear (zero) the child thread ID at the location pointed to by child_tid (clone()) or cl_args.child_tid (clone3()) in child memory when the child exits, and do a wakeup on the futex at that address. The address involved may be changed by the set_tid_address(2) system call. This is used by threading libraries.
CLONE_CHILD_SETTID (since Linux 2.5.49)
Store the child thread ID at the location pointed to by child_tid (clone()) or cl_args.child_tid (clone3()) in the child’s memory. The store operation completes before the clone call returns control to user space in the child process. (Note that the store operation may not have completed before the clone call returns in the parent process, which is relevant if the CLONE_VM flag is also employed.)
CLONE_CLEAR_SIGHAND (since Linux 5.5)
By default, signal dispositions in the child thread are the same as in the parent. If this flag is specified, then all signals that are handled in the parent (and not set to SIG_IGN) are reset to their default dispositions (SIG_DFL) in the child.
Specifying this flag together with CLONE_SIGHAND is nonsensical and disallowed.
CLONE_DETACHED (historical)
For a while (during the Linux 2.5 development series) there was a CLONE_DETACHED flag, which caused the parent not to receive a signal when the child terminated. Ultimately, the effect of this flag was subsumed under the CLONE_THREAD flag and by the time Linux 2.6.0 was released, this flag had no effect. Starting in Linux 2.6.2, the need to give this flag together with CLONE_THREAD disappeared.
This flag is still defined, but it is usually ignored when calling clone(). However, see the description of CLONE_PIDFD for some exceptions.
CLONE_FILES (since Linux 2.0)
If CLONE_FILES is set, the calling process and the child process share the same file descriptor table. Any file descriptor created by the calling process or by the child process is also valid in the other process. Similarly, if one of the processes closes a file descriptor, or changes its associated flags (using the fcntl(2) F_SETFD operation), the other process is also affected. If a process sharing a file descriptor table calls execve(2), its file descriptor table is duplicated (unshared).
If CLONE_FILES is not set, the child process inherits a copy of all file descriptors opened in the calling process at the time of the clone call. Subsequent operations that open or close file descriptors, or change file descriptor flags, performed by either the calling process or the child process do not affect the other process. Note, however, that the duplicated file descriptors in the child refer to the same open file descriptions as the corresponding file descriptors in the calling process, and thus share file offsets and file status flags (see open(2)).
CLONE_FS (since Linux 2.0)
If CLONE_FS is set, the caller and the child process share the same filesystem information. This includes the root of the filesystem, the current working directory, and the umask. Any call to chroot(2), chdir(2), or umask(2) performed by the calling process or the child process also affects the other process.
If CLONE_FS is not set, the child process works on a copy of the filesystem information of the calling process at the time of the clone call. Calls to chroot(2), chdir(2), or umask(2) performed later by one of the processes do not affect the other process.
CLONE_INTO_CGROUP (since Linux 5.7)
By default, a child process is placed in the same version 2 cgroup as its parent. The CLONE_INTO_CGROUP flag allows the child process to be created in a different version 2 cgroup. (Note that CLONE_INTO_CGROUP has effect only for version 2 cgroups.)
In order to place the child process in a different cgroup, the caller specifies CLONE_INTO_CGROUP in cl_args.flags and passes a file descriptor that refers to a version 2 cgroup in the cl_args.cgroup field. (This file descriptor can be obtained by opening a cgroup v2 directory using either the O_RDONLY or the O_PATH flag.) Note that all of the usual restrictions (described in cgroups(7)) on placing a process into a version 2 cgroup apply.
Among the possible use cases for CLONE_INTO_CGROUP are the following:
Spawning a process into a cgroup different from the parent’s cgroup makes it possible for a service manager to directly spawn new services into dedicated cgroups. This eliminates the accounting jitter that would be caused if the child process was first created in the same cgroup as the parent and then moved into the target cgroup. Furthermore, spawning the child process directly into a target cgroup is significantly cheaper than moving the child process into the target cgroup after it has been created.
The CLONE_INTO_CGROUP flag also allows the creation of frozen child processes by spawning them into a frozen cgroup. (See cgroups(7) for a description of the freezer controller.)
For threaded applications (or even thread implementations which make use of cgroups to limit individual threads), it is possible to establish a fixed cgroup layout before spawning each thread directly into its target cgroup.
CLONE_IO (since Linux 2.6.25)
If CLONE_IO is set, then the new process shares an I/O context with the calling process. If this flag is not set, then (as with fork(2)) the new process has its own I/O context.
The I/O context is the I/O scope of the disk scheduler (i.e., what the I/O scheduler uses to model scheduling of a process’s I/O). If processes share the same I/O context, they are treated as one by the I/O scheduler. As a consequence, they get to share disk time. For some I/O schedulers, if two processes share an I/O context, they will be allowed to interleave their disk access. If several threads are doing I/O on behalf of the same process (aio_read(3), for instance), they should employ CLONE_IO to get better I/O performance.
If the kernel is not configured with the CONFIG_BLOCK option, this flag is a no-op.
CLONE_NEWCGROUP (since Linux 4.6)
Create the process in a new cgroup namespace. If this flag is not set, then (as with fork(2)) the process is created in the same cgroup namespaces as the calling process.
For further information on cgroup namespaces, see cgroup_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWCGROUP.
CLONE_NEWIPC (since Linux 2.6.19)
If CLONE_NEWIPC is set, then create the process in a new IPC namespace. If this flag is not set, then (as with fork(2)), the process is created in the same IPC namespace as the calling process.
For further information on IPC namespaces, see ipc_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWIPC. This flag can’t be specified in conjunction with CLONE_SYSVSEM.
CLONE_NEWNET (since Linux 2.6.24)
(The implementation of this flag was completed only by about Linux 2.6.29.)
If CLONE_NEWNET is set, then create the process in a new network namespace. If this flag is not set, then (as with fork(2)) the process is created in the same network namespace as the calling process.
For further information on network namespaces, see network_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWNET.
CLONE_NEWNS (since Linux 2.4.19)
If CLONE_NEWNS is set, the cloned child is started in a new mount namespace, initialized with a copy of the namespace of the parent. If CLONE_NEWNS is not set, the child lives in the same mount namespace as the parent.
For further information on mount namespaces, see namespaces(7) and mount_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWNS. It is not permitted to specify both CLONE_NEWNS and CLONE_FS in the same clone call.
CLONE_NEWPID (since Linux 2.6.24)
If CLONE_NEWPID is set, then create the process in a new PID namespace. If this flag is not set, then (as with fork(2)) the process is created in the same PID namespace as the calling process.
For further information on PID namespaces, see namespaces(7) and pid_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWPID. This flag can’t be specified in conjunction with CLONE_THREAD.
CLONE_NEWUSER
(This flag first became meaningful for clone() in Linux 2.6.23, the current clone() semantics were merged in Linux 3.5, and the final pieces to make the user namespaces completely usable were merged in Linux 3.8.)
If CLONE_NEWUSER is set, then create the process in a new user namespace. If this flag is not set, then (as with fork(2)) the process is created in the same user namespace as the calling process.
For further information on user namespaces, see namespaces(7) and user_namespaces(7).
Before Linux 3.8, use of CLONE_NEWUSER required that the caller have three capabilities: CAP_SYS_ADMIN, CAP_SETUID, and CAP_SETGID. Starting with Linux 3.8, no privileges are needed to create a user namespace.
This flag can’t be specified in conjunction with CLONE_THREAD or CLONE_PARENT. For security reasons, CLONE_NEWUSER cannot be specified in conjunction with CLONE_FS.
CLONE_NEWUTS (since Linux 2.6.19)
If CLONE_NEWUTS is set, then create the process in a new UTS namespace, whose identifiers are initialized by duplicating the identifiers from the UTS namespace of the calling process. If this flag is not set, then (as with fork(2)) the process is created in the same UTS namespace as the calling process.
For further information on UTS namespaces, see uts_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWUTS.
CLONE_PARENT (since Linux 2.3.12)
If CLONE_PARENT is set, then the parent of the new child (as returned by getppid(2)) will be the same as that of the calling process.
If CLONE_PARENT is not set, then (as with fork(2)) the child’s parent is the calling process.
Note that it is the parent process, as returned by getppid(2), which is signaled when the child terminates, so that if CLONE_PARENT is set, then the parent of the calling process, rather than the calling process itself, is signaled.
The CLONE_PARENT flag can’t be used in clone calls by the global init process (PID 1 in the initial PID namespace) and init processes in other PID namespaces. This restriction prevents the creation of multi-rooted process trees as well as the creation of unreapable zombies in the initial PID namespace.
CLONE_PARENT_SETTID (since Linux 2.5.49)
Store the child thread ID at the location pointed to by parent_tid (clone()) or cl_args.parent_tid (clone3()) in the parent’s memory. (In Linux 2.5.32-2.5.48 there was a flag CLONE_SETTID that did this.) The store operation completes before the clone call returns control to user space.
CLONE_PID (Linux 2.0 to Linux 2.5.15)
If CLONE_PID is set, the child process is created with the same process ID as the calling process. This is good for hacking the system, but otherwise of not much use. From Linux 2.3.21 onward, this flag could be specified only by the system boot process (PID 0). The flag disappeared completely from the kernel sources in Linux 2.5.16. Subsequently, the kernel silently ignored this bit if it was specified in the flags mask. Much later, the same bit was recycled for use as the CLONE_PIDFD flag.
CLONE_PIDFD (since Linux 5.2)
If this flag is specified, a PID file descriptor referring to the child process is allocated and placed at a specified location in the parent’s memory. The close-on-exec flag is set on this new file descriptor. PID file descriptors can be used for the purposes described in pidfd_open(2).
When using clone3(), the PID file descriptor is placed at the location pointed to by cl_args.pidfd.
When using clone(), the PID file descriptor is placed at the location pointed to by parent_tid. Since the parent_tid argument is used to return the PID file descriptor, CLONE_PIDFD cannot be used with CLONE_PARENT_SETTID when calling clone().
It is currently not possible to use this flag together with CLONE_THREAD. This means that the process identified by the PID file descriptor will always be a thread group leader.
If the obsolete CLONE_DETACHED flag is specified alongside CLONE_PIDFD when calling clone(), an error is returned. An error also results if CLONE_DETACHED is specified when calling clone3(). This error behavior ensures that the bit corresponding to CLONE_DETACHED can be reused for further PID file descriptor features in the future.
CLONE_PTRACE (since Linux 2.2)
If CLONE_PTRACE is specified, and the calling process is being traced, then trace the child also (see ptrace(2)).
CLONE_SETTLS (since Linux 2.5.32)
The TLS (Thread Local Storage) descriptor is set to tls.
The interpretation of tls and the resulting effect is architecture dependent. On x86, tls is interpreted as a struct user_descΒ * (see set_thread_area(2)). On x86-64 it is the new value to be set for the %fs base register (see the ARCH_SET_FS argument to arch_prctl(2)). On architectures with a dedicated TLS register, it is the new value of that register.
Use of this flag requires detailed knowledge and generally it should not be used except in libraries implementing threading.
CLONE_SIGHAND (since Linux 2.0)
If CLONE_SIGHAND is set, the calling process and the child process share the same table of signal handlers. If the calling process or child process calls sigaction(2) to change the behavior associated with a signal, the behavior is changed in the other process as well. However, the calling process and child processes still have distinct signal masks and sets of pending signals. So, one of them may block or unblock signals using sigprocmask(2) without affecting the other process.
If CLONE_SIGHAND is not set, the child process inherits a copy of the signal handlers of the calling process at the time of the clone call. Calls to sigaction(2) performed later by one of the processes have no effect on the other process.
Since Linux 2.6.0, the flags mask must also include CLONE_VM if CLONE_SIGHAND is specified.
CLONE_STOPPED (since Linux 2.6.0)
If CLONE_STOPPED is set, then the child is initially stopped (as though it was sent a SIGSTOP signal), and must be resumed by sending it a SIGCONT signal.
This flag was deprecated from Linux 2.6.25 onward, and was removed altogether in Linux 2.6.38. Since then, the kernel silently ignores it without error. Starting with Linux 4.6, the same bit was reused for the CLONE_NEWCGROUP flag.
CLONE_SYSVSEM (since Linux 2.5.10)
If CLONE_SYSVSEM is set, then the child and the calling process share a single list of System V semaphore adjustment (semadj) values (see semop(2)). In this case, the shared list accumulates semadj values across all processes sharing the list, and semaphore adjustments are performed only when the last process that is sharing the list terminates (or ceases sharing the list using unshare(2)). If this flag is not set, then the child has a separate semadj list that is initially empty.
CLONE_THREAD (since Linux 2.4.0)
If CLONE_THREAD is set, the child is placed in the same thread group as the calling process. To make the remainder of the discussion of CLONE_THREAD more readable, the term “thread” is used to refer to the processes within a thread group.
Thread groups were a feature added in Linux 2.4 to support the POSIX threads notion of a set of threads that share a single PID. Internally, this shared PID is the so-called thread group identifier (TGID) for the thread group. Since Linux 2.4, calls to getpid(2) return the TGID of the caller.
The threads within a group can be distinguished by their (system-wide) unique thread IDs (TID). A new thread’s TID is available as the function result returned to the caller, and a thread can obtain its own TID using gettid(2).
When a clone call is made without specifying CLONE_THREAD, then the resulting thread is placed in a new thread group whose TGID is the same as the thread’s TID. This thread is the leader of the new thread group.
A new thread created with CLONE_THREAD has the same parent process as the process that made the clone call (i.e., like CLONE_PARENT), so that calls to getppid(2) return the same value for all of the threads in a thread group. When a CLONE_THREAD thread terminates, the thread that created it is not sent a SIGCHLD (or other termination) signal; nor can the status of such a thread be obtained using wait(2). (The thread is said to be detached.)
After all of the threads in a thread group terminate the parent process of the thread group is sent a SIGCHLD (or other termination) signal.
If any of the threads in a thread group performs an execve(2), then all threads other than the thread group leader are terminated, and the new program is executed in the thread group leader.
If one of the threads in a thread group creates a child using fork(2), then any thread in the group can wait(2) for that child.
Since Linux 2.5.35, the flags mask must also include CLONE_SIGHAND if CLONE_THREAD is specified (and note that, since Linux 2.6.0, CLONE_SIGHAND also requires CLONE_VM to be included).
Signal dispositions and actions are process-wide: if an unhandled signal is delivered to a thread, then it will affect (terminate, stop, continue, be ignored in) all members of the thread group.
Each thread has its own signal mask, as set by sigprocmask(2).
A signal may be process-directed or thread-directed. A process-directed signal is targeted at a thread group (i.e., a TGID), and is delivered to an arbitrarily selected thread from among those that are not blocking the signal. A signal may be process-directed because it was generated by the kernel for reasons other than a hardware exception, or because it was sent using kill(2) or sigqueue(3). A thread-directed signal is targeted at (i.e., delivered to) a specific thread. A signal may be thread directed because it was sent using tgkill(2) or pthread_sigqueue(3), or because the thread executed a machine language instruction that triggered a hardware exception (e.g., invalid memory access triggering SIGSEGV or a floating-point exception triggering SIGFPE).
A call to sigpending(2) returns a signal set that is the union of the pending process-directed signals and the signals that are pending for the calling thread.
If a process-directed signal is delivered to a thread group, and the thread group has installed a handler for the signal, then the handler is invoked in exactly one, arbitrarily selected member of the thread group that has not blocked the signal. If multiple threads in a group are waiting to accept the same signal using sigwaitinfo(2), the kernel will arbitrarily select one of these threads to receive the signal.
CLONE_UNTRACED (since Linux 2.5.46)
If CLONE_UNTRACED is specified, then a tracing process cannot force CLONE_PTRACE on this child process.
CLONE_VFORK (since Linux 2.2)
If CLONE_VFORK is set, the execution of the calling process is suspended until the child releases its virtual memory resources via a call to execve(2) or _exit(2) (as with vfork(2)).
If CLONE_VFORK is not set, then both the calling process and the child are schedulable after the call, and an application should not rely on execution occurring in any particular order.
CLONE_VM (since Linux 2.0)
If CLONE_VM is set, the calling process and the child process run in the same memory space. In particular, memory writes performed by the calling process or by the child process are also visible in the other process. Moreover, any memory mapping or unmapping performed with mmap(2) or munmap(2) by the child or calling process also affects the other process.
If CLONE_VM is not set, the child process runs in a separate copy of the memory space of the calling process at the time of the clone call. Memory writes or file mappings/unmappings performed by one of the processes do not affect the other, as with fork(2).
If the CLONE_VM flag is specified and the CLONE_VFORK flag is not specified, then any alternate signal stack that was established by sigaltstack(2) is cleared in the child process.
RETURN VALUE
On success, the thread ID of the child process is returned in the caller’s thread of execution. On failure, -1 is returned in the caller’s context, no child process is created, and errno is set to indicate the error.
ERRORS
EACCES (clone3() only)
CLONE_INTO_CGROUP was specified in cl_args.flags, but the restrictions (described in cgroups(7)) on placing the child process into the version 2 cgroup referred to by cl_args.cgroup are not met.
EAGAIN
Too many processes are already running; see fork(2).
EBUSY (clone3() only)
CLONE_INTO_CGROUP was specified in cl_args.flags, but the file descriptor specified in cl_args.cgroup refers to a version 2 cgroup in which a domain controller is enabled.
EEXIST (clone3() only)
One (or more) of the PIDs specified in set_tid already exists in the corresponding PID namespace.
EINVAL
Both CLONE_SIGHAND and CLONE_CLEAR_SIGHAND were specified in the flags mask.
EINVAL
CLONE_SIGHAND was specified in the flags mask, but CLONE_VM was not. (Since Linux 2.6.0.)
EINVAL
CLONE_THREAD was specified in the flags mask, but CLONE_SIGHAND was not. (Since Linux 2.5.35.)
EINVAL
CLONE_THREAD was specified in the flags mask, but the current process previously called unshare(2) with the CLONE_NEWPID flag or used setns(2) to reassociate itself with a PID namespace.
EINVAL
Both CLONE_FS and CLONE_NEWNS were specified in the flags mask.
EINVAL (since Linux 3.9)
Both CLONE_NEWUSER and CLONE_FS were specified in the flags mask.
EINVAL
Both CLONE_NEWIPC and CLONE_SYSVSEM were specified in the flags mask.
EINVAL
CLONE_NEWPID and one (or both) of CLONE_THREAD or CLONE_PARENT were specified in the flags mask.
EINVAL
CLONE_NEWUSER and CLONE_THREAD were specified in the flags mask.
EINVAL (since Linux 2.6.32)
CLONE_PARENT was specified, and the caller is an init process.
EINVAL
Returned by the glibc clone() wrapper function when fn or stack is specified as NULL.
EINVAL
CLONE_NEWIPC was specified in the flags mask, but the kernel was not configured with the CONFIG_SYSVIPC and CONFIG_IPC_NS options.
EINVAL
CLONE_NEWNET was specified in the flags mask, but the kernel was not configured with the CONFIG_NET_NS option.
EINVAL
CLONE_NEWPID was specified in the flags mask, but the kernel was not configured with the CONFIG_PID_NS option.
EINVAL
CLONE_NEWUSER was specified in the flags mask, but the kernel was not configured with the CONFIG_USER_NS option.
EINVAL
CLONE_NEWUTS was specified in the flags mask, but the kernel was not configured with the CONFIG_UTS_NS option.
EINVAL
stack is not aligned to a suitable boundary for this architecture. For example, on aarch64, stack must be a multiple of 16.
EINVAL (clone3() only)
CLONE_DETACHED was specified in the flags mask.
EINVAL (clone() only)
CLONE_PIDFD was specified together with CLONE_DETACHED in the flags mask.
EINVAL
CLONE_PIDFD was specified together with CLONE_THREAD in the flags mask.
**EINVAL **(clone() only)
CLONE_PIDFD was specified together with CLONE_PARENT_SETTID in the flags mask.
EINVAL (clone3() only)
set_tid_size is greater than the number of nested PID namespaces.
EINVAL (clone3() only)
One of the PIDs specified in set_tid was an invalid.
EINVAL (clone3() only)
CLONE_THREAD or CLONE_PARENT was specified in the flags mask, but a signal was specified in exit_signal.
EINVAL (AArch64 only, Linux 4.6 and earlier)
stack was not aligned to a 128-bit boundary.
ENOMEM
Cannot allocate sufficient memory to allocate a task structure for the child, or to copy those parts of the caller’s context that need to be copied.
ENOSPC (since Linux 3.7)
CLONE_NEWPID was specified in the flags mask, but the limit on the nesting depth of PID namespaces would have been exceeded; see pid_namespaces(7).
ENOSPC (since Linux 4.9; beforehand EUSERS)
CLONE_NEWUSER was specified in the flags mask, and the call would cause the limit on the number of nested user namespaces to be exceeded. See user_namespaces(7).
From Linux 3.11 to Linux 4.8, the error diagnosed in this case was EUSERS.
ENOSPC (since Linux 4.9)
One of the values in the flags mask specified the creation of a new user namespace, but doing so would have caused the limit defined by the corresponding file in /proc/sys/user to be exceeded. For further details, see namespaces(7).
EOPNOTSUPP (clone3() only)
CLONE_INTO_CGROUP was specified in cl_args.flags, but the file descriptor specified in cl_args.cgroup refers to a version 2 cgroup that is in the domain invalid state.
EPERM
CLONE_NEWCGROUP, CLONE_NEWIPC, CLONE_NEWNET, CLONE_NEWNS, CLONE_NEWPID, or CLONE_NEWUTS was specified by an unprivileged process (process without CAP_SYS_ADMIN).
EPERM
CLONE_PID was specified by a process other than process 0. (This error occurs only on Linux 2.5.15 and earlier.)
EPERM
CLONE_NEWUSER was specified in the flags mask, but either the effective user ID or the effective group ID of the caller does not have a mapping in the parent namespace (see user_namespaces(7)).
EPERM (since Linux 3.9)
CLONE_NEWUSER was specified in the flags mask and the caller is in a chroot environment (i.e., the caller’s root directory does not match the root directory of the mount namespace in which it resides).
EPERM (clone3() only)
set_tid_size was greater than zero, and the caller lacks the CAP_SYS_ADMIN capability in one or more of the user namespaces that own the corresponding PID namespaces.
ERESTARTNOINTR (since Linux 2.6.17)
System call was interrupted by a signal and will be restarted. (This can be seen only during a trace.)
EUSERS (Linux 3.11 to Linux 4.8)
CLONE_NEWUSER was specified in the flags mask, and the limit on the number of nested user namespaces would be exceeded. See the discussion of the ENOSPC error above.
VERSIONS
The glibc clone() wrapper function makes some changes in the memory pointed to by stack (changes required to set the stack up correctly for the child) before invoking the clone() system call. So, in cases where clone() is used to recursively create children, do not use the buffer employed for the parent’s stack as the stack of the child.
On i386, clone() should not be called through vsyscall, but directly through int $0x80.
C library/kernel differences
The raw clone() system call corresponds more closely to fork(2) in that execution in the child continues from the point of the call. As such, the fn and arg arguments of the clone() wrapper function are omitted.
In contrast to the glibc wrapper, the raw clone() system call accepts NULL as a stack argument (and clone3() likewise allows cl_args.stack to be NULL). In this case, the child uses a duplicate of the parent’s stack. (Copy-on-write semantics ensure that the child gets separate copies of stack pages when either process modifies the stack.) In this case, for correct operation, the CLONE_VM option should not be specified. (If the child shares the parent’s memory because of the use of the CLONE_VM flag, then no copy-on-write duplication occurs and chaos is likely to result.)
The order of the arguments also differs in the raw system call, and there are variations in the arguments across architectures, as detailed in the following paragraphs.
The raw system call interface on x86-64 and some other architectures (including sh, tile, and alpha) is:
long clone(unsigned long flags, void *stack,
int *parent_tid, int *child_tid,
unsigned long tls);
On x86-32, and several other common architectures (including score, ARM, ARM 64, PA-RISC, arc, Power PC, xtensa, and MIPS), the order of the last two arguments is reversed:
long clone(unsigned long flags, void *stack,
int *parent_tid, unsigned long tls,
int *child_tid);
On the cris and s390 architectures, the order of the first two arguments is reversed:
long clone(void *stack, unsigned long flags,
int *parent_tid, int *child_tid,
unsigned long tls);
On the microblaze architecture, an additional argument is supplied:
long clone(unsigned long flags, void *stack,
int stack_size, /* Size of stack */
int *parent_tid, int *child_tid,
unsigned long tls);
blackfin, m68k, and sparc
The argument-passing conventions on blackfin, m68k, and sparc are different from the descriptions above. For details, see the kernel (and glibc) source.
ia64
On ia64, a different interface is used:
int __clone2(int (*fn)(void *),
void *stack_base, size_t stack_size,
int flags, void *arg, ...
/* pid_t *parent_tid, struct user_desc *tls,
pid_t *child_tid */ );
The prototype shown above is for the glibc wrapper function; for the system call itself, the prototype can be described as follows (it is identical to the clone() prototype on microblaze):
long clone2(unsigned long flags, void *stack_base,
int stack_size, /* Size of stack */
int *parent_tid, int *child_tid,
unsigned long tls);
__clone2() operates in the same way as clone(), except that stack_base points to the lowest address of the child’s stack area, and stack_size specifies the size of the stack pointed to by stack_base.
STANDARDS
Linux.
HISTORY
clone3()
Linux 5.3.
Linux 2.4 and earlier
In the Linux 2.4.x series, CLONE_THREAD generally does not make the parent of the new thread the same as the parent of the calling process. However, from Linux 2.4.7 to Linux 2.4.18 the CLONE_THREAD flag implied the CLONE_PARENT flag (as in Linux 2.6.0 and later).
In Linux 2.4 and earlier, clone() does not take arguments parent_tid, tls, and child_tid.
NOTES
One use of these system calls is to implement threads: multiple flows of control in a program that run concurrently in a shared address space.
The kcmp(2) system call can be used to test whether two processes share various resources such as a file descriptor table, System V semaphore undo operations, or a virtual address space.
Handlers registered using pthread_atfork(3) are not executed during a clone call.
BUGS
GNU C library versions 2.3.4 up to and including 2.24 contained a wrapper function for getpid(2) that performed caching of PIDs. This caching relied on support in the glibc wrapper for clone(), but limitations in the implementation meant that the cache was not up to date in some circumstances. In particular, if a signal was delivered to the child immediately after the clone() call, then a call to getpid(2) in a handler for the signal could return the PID of the calling process (“the parent”), if the clone wrapper had not yet had a chance to update the PID cache in the child. (This discussion ignores the case where the child was created using CLONE_THREAD, when getpid(2) should return the same value in the child and in the process that called clone(), since the caller and the child are in the same thread group. The stale-cache problem also does not occur if the flags argument includes CLONE_VM.) To get the truth, it was sometimes necessary to use code such as the following:
#include <syscall.h>
pid_t mypid;
mypid = syscall(SYS_getpid);
Because of the stale-cache problem, as well as other problems noted in getpid(2), the PID caching feature was removed in glibc 2.25.
EXAMPLES
The following program demonstrates the use of clone() to create a child process that executes in a separate UTS namespace. The child changes the hostname in its UTS namespace. Both parent and child then display the system hostname, making it possible to see that the hostname differs in the UTS namespaces of the parent and child. For an example of the use of this program, see setns(2).
Within the sample program, we allocate the memory that is to be used for the child’s stack using mmap(2) rather than malloc(3) for the following reasons:
mmap(2) allocates a block of memory that starts on a page boundary and is a multiple of the page size. This is useful if we want to establish a guard page (a page with protection PROT_NONE) at the end of the stack using mprotect(2).
We can specify the MAP_STACK flag to request a mapping that is suitable for a stack. For the moment, this flag is a no-op on Linux, but it exists and has effect on some other systems, so we should include it for portability.
Program source
#define _GNU_SOURCE
#include <err.h>
#include <sched.h>
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/utsname.h>
#include <sys/wait.h>
#include <unistd.h>
static int /* Start function for cloned child */
childFunc(void *arg)
{
struct utsname uts;
/* Change hostname in UTS namespace of child. */
if (sethostname(arg, strlen(arg)) == -1)
err(EXIT_FAILURE, "sethostname");
/* Retrieve and display hostname. */
if (uname(&uts) == -1)
err(EXIT_FAILURE, "uname");
printf("uts.nodename in child: %s
“, uts.nodename);
/* Keep the namespace open for a while, by sleeping.
This allows some experimentation–for example, another
process might join the namespace. /
sleep(200);
return 0; / Child terminates now /
}
#define STACK_SIZE (1024 * 1024) / Stack size for cloned child */
int
main(int argc, char *argv[])
{
char stack; / Start of stack buffer */
char stackTop; / End of stack buffer /
pid_t pid;
struct utsname uts;
if (argc < 2) {
fprintf(stderr, “Usage: %s
SEE ALSO
fork(2), futex(2), getpid(2), gettid(2), kcmp(2), mmap(2), pidfd_open(2), set_thread_area(2), set_tid_address(2), setns(2), tkill(2), unshare(2), wait(2), capabilities(7), namespaces(7), pthreads(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
313 - Linux cli command waitpid
NAME π₯οΈ waitpid π₯οΈ
wait for process to change state
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/wait.h>
pid_t wait(int *_Nullable wstatus);
pid_t waitpid(pid_t pid, int *_Nullable wstatus, int options);
int waitid(idtype_t idtype, id_t id",siginfo_t*"infop, int options);
/* This is the glibc and POSIX interface; see
NOTES for information on the raw system call. */
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
waitid():
Since glibc 2.26:
_XOPEN_SOURCE >= 500 || _POSIX_C_SOURCE >= 200809L
glibc 2.25 and earlier:
_XOPEN_SOURCE
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
|| /* glibc <= 2.19: */ _BSD_SOURCE
DESCRIPTION
All of these system calls are used to wait for state changes in a child of the calling process, and obtain information about the child whose state has changed. A state change is considered to be: the child terminated; the child was stopped by a signal; or the child was resumed by a signal. In the case of a terminated child, performing a wait allows the system to release the resources associated with the child; if a wait is not performed, then the terminated child remains in a “zombie” state (see NOTES below).
If a child has already changed state, then these calls return immediately. Otherwise, they block until either a child changes state or a signal handler interrupts the call (assuming that system calls are not automatically restarted using the SA_RESTART flag of sigaction(2)). In the remainder of this page, a child whose state has changed and which has not yet been waited upon by one of these system calls is termed waitable.
wait() and waitpid()
The wait() system call suspends execution of the calling thread until one of its children terminates. The call wait(&wstatus) is equivalent to:
waitpid(-1, &wstatus, 0);
The waitpid() system call suspends execution of the calling thread until a child specified by pid argument has changed state. By default, waitpid() waits only for terminated children, but this behavior is modifiable via the options argument, as described below.
The value of pid can be:
< -1
meaning wait for any child process whose process group ID is equal to the absolute value of pid.
-1
meaning wait for any child process.
0
meaning wait for any child process whose process group ID is equal to that of the calling process at the time of the call to waitpid().
> 0
meaning wait for the child whose process ID is equal to the value of pid.
The value of options is an OR of zero or more of the following constants:
WNOHANG
return immediately if no child has exited.
WUNTRACED
also return if a child has stopped (but not traced via ptrace(2)). Status for traced children which have stopped is provided even if this option is not specified.
WCONTINUED (since Linux 2.6.10)
also return if a stopped child has been resumed by delivery of SIGCONT.
(For Linux-only options, see below.)
If wstatus is not NULL, wait() and waitpid() store status information in the int to which it points. This integer can be inspected with the following macros (which take the integer itself as an argument, not a pointer to it, as is done in wait() and waitpid()!):
WIFEXITED(wstatus)
returns true if the child terminated normally, that is, by calling exit(3) or _exit(2), or by returning from main().
WEXITSTATUS(wstatus)
returns the exit status of the child. This consists of the least significant 8 bits of the status argument that the child specified in a call to exit(3) or _exit(2) or as the argument for a return statement in main(). This macro should be employed only if WIFEXITED returned true.
WIFSIGNALED(wstatus)
returns true if the child process was terminated by a signal.
WTERMSIG(wstatus)
returns the number of the signal that caused the child process to terminate. This macro should be employed only if WIFSIGNALED returned true.
WCOREDUMP(wstatus)
returns true if the child produced a core dump (see core(5)). This macro should be employed only if WIFSIGNALED returned true.
This macro is not specified in POSIX.1-2001 and is not available on some UNIX implementations (e.g., AIX, SunOS). Therefore, enclose its use inside #ifdef WCOREDUMP … #endif.
WIFSTOPPED(wstatus)
returns true if the child process was stopped by delivery of a signal; this is possible only if the call was done using WUNTRACED or when the child is being traced (see ptrace(2)).
WSTOPSIG(wstatus)
returns the number of the signal which caused the child to stop. This macro should be employed only if WIFSTOPPED returned true.
WIFCONTINUED(wstatus)
(since Linux 2.6.10) returns true if the child process was resumed by delivery of SIGCONT.
waitid()
The waitid() system call (available since Linux 2.6.9) provides more precise control over which child state changes to wait for.
The idtype and id arguments select the child(ren) to wait for, as follows:
idtype == P_PID
Wait for the child whose process ID matches id.
idtype == P_PIDFD (since Linux 5.4)
Wait for the child referred to by the PID file descriptor specified in id. (See pidfd_open(2) for further information on PID file descriptors.)
idtype == P_PGID
Wait for any child whose process group ID matches id. Since Linux 5.4, if id is zero, then wait for any child that is in the same process group as the caller’s process group at the time of the call.
idtype == P_ALL
Wait for any child; id is ignored.
The child state changes to wait for are specified by ORing one or more of the following flags in options:
WEXITED
Wait for children that have terminated.
WSTOPPED
Wait for children that have been stopped by delivery of a signal.
WCONTINUED
Wait for (previously stopped) children that have been resumed by delivery of SIGCONT.
The following flags may additionally be ORed in options:
WNOHANG
As for waitpid().
WNOWAIT
Leave the child in a waitable state; a later wait call can be used to again retrieve the child status information.
Upon successful return, waitid() fills in the following fields of the siginfo_t structure pointed to by infop:
si_pid
The process ID of the child.
si_uid
The real user ID of the child. (This field is not set on most other implementations.)
si_signo
Always set to SIGCHLD.
si_status
Either the exit status of the child, as given to _exit(2) (or exit(3)), or the signal that caused the child to terminate, stop, or continue. The si_code field can be used to determine how to interpret this field.
si_code
Set to one of: CLD_EXITED (child called _exit(2)); CLD_KILLED (child killed by signal); CLD_DUMPED (child killed by signal, and dumped core); CLD_STOPPED (child stopped by signal); CLD_TRAPPED (traced child has trapped); or CLD_CONTINUED (child continued by SIGCONT).
If WNOHANG was specified in options and there were no children in a waitable state, then waitid() returns 0 immediately and the state of the siginfo_t structure pointed to by infop depends on the implementation. To (portably) distinguish this case from that where a child was in a waitable state, zero out the si_pid field before the call and check for a nonzero value in this field after the call returns.
POSIX.1-2008 Technical Corrigendum 1 (2013) adds the requirement that when WNOHANG is specified in options and there were no children in a waitable state, then waitid() should zero out the si_pid and si_signo fields of the structure. On Linux and other implementations that adhere to this requirement, it is not necessary to zero out the si_pid field before calling waitid(). However, not all implementations follow the POSIX.1 specification on this point.
RETURN VALUE
wait(): on success, returns the process ID of the terminated child; on failure, -1 is returned.
waitpid(): on success, returns the process ID of the child whose state has changed; if WNOHANG was specified and one or more child(ren) specified by pid exist, but have not yet changed state, then 0 is returned. On failure, -1 is returned.
waitid(): returns 0 on success or if WNOHANG was specified and no child(ren) specified by id has yet changed state; on failure, -1 is returned.
On failure, each of these calls sets errno to indicate the error.
ERRORS
EAGAIN
The PID file descriptor specified in id is nonblocking and the process that it refers to has not terminated.
ECHILD
(for wait()) The calling process does not have any unwaited-for children.
ECHILD
(for waitpid() or waitid()) The process specified by pid (waitpid()) or idtype and id (waitid()) does not exist or is not a child of the calling process. (This can happen for one’s own child if the action for SIGCHLD is set to SIG_IGN. See also the Linux Notes section about threads.)
EINTR
WNOHANG was not set and an unblocked signal or a SIGCHLD was caught; see signal(7).
EINVAL
The options argument was invalid.
ESRCH
(for wait() or waitpid()) pid is equal to INT_MIN.
VERSIONS
C library/kernel differences
wait() is actually a library function that (in glibc) is implemented as a call to wait4(2).
On some architectures, there is no waitpid() system call; instead, this interface is implemented via a C library wrapper function that calls wait4(2).
The raw waitid() system call takes a fifth argument, of type struct rusage *. If this argument is non-NULL, then it is used to return resource usage information about the child, in the same manner as wait4(2). See getrusage(2) for details.
STANDARDS
POSIX.1-2008.
HISTORY
SVr4, 4.3BSD, POSIX.1-2001.
NOTES
A child that terminates, but has not been waited for becomes a “zombie”. The kernel maintains a minimal set of information about the zombie process (PID, termination status, resource usage information) in order to allow the parent to later perform a wait to obtain information about the child. As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. If a parent process terminates, then its “zombie” children (if any) are adopted by init(1), (or by the nearest “subreaper” process as defined through the use of the prctl(2) PR_SET_CHILD_SUBREAPER operation); init(1) automatically performs a wait to remove the zombies.
POSIX.1-2001 specifies that if the disposition of SIGCHLD is set to SIG_IGN or the SA_NOCLDWAIT flag is set for SIGCHLD (see sigaction(2)), then children that terminate do not become zombies and a call to wait() or waitpid() will block until all children have terminated, and then fail with errno set to ECHILD. (The original POSIX standard left the behavior of setting SIGCHLD to SIG_IGN unspecified. Note that even though the default disposition of SIGCHLD is “ignore”, explicitly setting the disposition to SIG_IGN results in different treatment of zombie process children.)
Linux 2.6 conforms to the POSIX requirements. However, Linux 2.4 (and earlier) does not: if a wait() or waitpid() call is made while SIGCHLD is being ignored, the call behaves just as though SIGCHLD were not being ignored, that is, the call blocks until the next child terminates and then returns the process ID and status of that child.
Linux notes
In the Linux kernel, a kernel-scheduled thread is not a distinct construct from a process. Instead, a thread is simply a process that is created using the Linux-unique clone(2) system call; other routines such as the portable pthread_create(3) call are implemented using clone(2). Before Linux 2.4, a thread was just a special case of a process, and as a consequence one thread could not wait on the children of another thread, even when the latter belongs to the same thread group. However, POSIX prescribes such functionality, and since Linux 2.4 a thread can, and by default will, wait on children of other threads in the same thread group.
The following Linux-specific options are for use with children created using clone(2); they can also, since Linux 4.7, be used with waitid():
__WCLONE
Wait for “clone” children only. If omitted, then wait for “non-clone” children only. (A “clone” child is one which delivers no signal, or a signal other than SIGCHLD to its parent upon termination.) This option is ignored if __WALL is also specified.
__WALL (since Linux 2.4)
Wait for all children, regardless of type (“clone” or “non-clone”).
__WNOTHREAD (since Linux 2.4)
Do not wait for children of other threads in the same thread group. This was the default before Linux 2.4.
Since Linux 4.7, the __WALL flag is automatically implied if the child is being ptraced.
BUGS
According to POSIX.1-2008, an application calling waitid() must ensure that infop points to a siginfo_t structure (i.e., that it is a non-null pointer). On Linux, if infop is NULL, waitid() succeeds, and returns the process ID of the waited-for child. Applications should avoid relying on this inconsistent, nonstandard, and unnecessary feature.
EXAMPLES
The following program demonstrates the use of fork(2) and waitpid(). The program creates a child process. If no command-line argument is supplied to the program, then the child suspends its execution using pause(2), to allow the user to send signals to the child. Otherwise, if a command-line argument is supplied, then the child exits immediately, using the integer supplied on the command line as the exit status. The parent process executes a loop that monitors the child using waitpid(), and uses the W*() macros described above to analyze the wait status value.
The following shell session demonstrates the use of the program:
$ ./a.out &
Child PID is 32360
[1] 32359
$ kill -STOP 32360
stopped by signal 19
$ kill -CONT 32360
continued
$ kill -TERM 32360
killed by signal 15
[1]+ Done ./a.out
$
Program source
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int wstatus;
pid_t cpid, w;
cpid = fork();
if (cpid == -1) {
perror("fork");
exit(EXIT_FAILURE);
}
if (cpid == 0) { /* Code executed by child */
printf("Child PID is %jd
“, (intmax_t) getpid()); if (argc == 1) pause(); /* Wait for signals / _exit(atoi(argv[1])); } else { / Code executed by parent */ do { w = waitpid(cpid, &wstatus, WUNTRACED | WCONTINUED); if (w == -1) { perror(“waitpid”); exit(EXIT_FAILURE); } if (WIFEXITED(wstatus)) { printf(“exited, status=%d “, WEXITSTATUS(wstatus)); } else if (WIFSIGNALED(wstatus)) { printf(“killed by signal %d “, WTERMSIG(wstatus)); } else if (WIFSTOPPED(wstatus)) { printf(“stopped by signal %d “, WSTOPSIG(wstatus)); } else if (WIFCONTINUED(wstatus)) { printf(“continued “); } } while (!WIFEXITED(wstatus) && !WIFSIGNALED(wstatus)); exit(EXIT_SUCCESS); } }
SEE ALSO
_exit(2), clone(2), fork(2), kill(2), ptrace(2), sigaction(2), signal(2), wait4(2), pthread_create(3), core(5), credentials(7), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
314 - Linux cli command kcmp
NAME π₯οΈ kcmp π₯οΈ
compare two processes to determine if they share a kernel resource
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/kcmp.h> /* Definition of KCMP_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_kcmp, pid_t pid1, pid_t pid2, int type,
unsigned long idx1, unsigned long idx2);
Note: glibc provides no wrapper for kcmp(), necessitating the use of syscall(2).
DESCRIPTION
The kcmp() system call can be used to check whether the two processes identified by pid1 and pid2 share a kernel resource such as virtual memory, file descriptors, and so on.
Permission to employ kcmp() is governed by ptrace access mode PTRACE_MODE_READ_REALCREDS checks against both pid1 and pid2; see ptrace(2).
The type argument specifies which resource is to be compared in the two processes. It has one of the following values:
KCMP_FILE
Check whether a file descriptor idx1 in the process pid1 refers to the same open file description (see open(2)) as file descriptor idx2 in the process pid2. The existence of two file descriptors that refer to the same open file description can occur as a result of dup(2) (and similar) fork(2), or passing file descriptors via a domain socket (see unix(7)).
KCMP_FILES
Check whether the processes share the same set of open file descriptors. The arguments idx1 and idx2 are ignored. See the discussion of the CLONE_FILES flag in clone(2).
KCMP_FS
Check whether the processes share the same filesystem information (i.e., file mode creation mask, working directory, and filesystem root). The arguments idx1 and idx2 are ignored. See the discussion of the CLONE_FS flag in clone(2).
KCMP_IO
Check whether the processes share I/O context. The arguments idx1 and idx2 are ignored. See the discussion of the CLONE_IO flag in clone(2).
KCMP_SIGHAND
Check whether the processes share the same table of signal dispositions. The arguments idx1 and idx2 are ignored. See the discussion of the CLONE_SIGHAND flag in clone(2).
KCMP_SYSVSEM
Check whether the processes share the same list of System V semaphore undo operations. The arguments idx1 and idx2 are ignored. See the discussion of the CLONE_SYSVSEM flag in clone(2).
KCMP_VM
Check whether the processes share the same address space. The arguments idx1 and idx2 are ignored. See the discussion of the CLONE_VM flag in clone(2).
KCMP_EPOLL_TFD (since Linux 4.13)
Check whether the file descriptor idx1 of the process pid1 is present in the epoll(7) instance described by idx2 of the process pid2. The argument idx2 is a pointer to a structure where the target file is described. This structure has the form:
struct kcmp_epoll_slot {
__u32 efd;
__u32 tfd;
__u64 toff;
};
Within this structure, efd is an epoll file descriptor returned from epoll_create(2), tfd is a target file descriptor number, and toff is a target file offset counted from zero. Several different targets may be registered with the same file descriptor number and setting a specific offset helps to investigate each of them.
Note the kcmp() is not protected against false positives which may occur if the processes are currently running. One should stop the processes by sending SIGSTOP (see signal(7)) prior to inspection with this system call to obtain meaningful results.
RETURN VALUE
The return value of a successful call to kcmp() is simply the result of arithmetic comparison of kernel pointers (when the kernel compares resources, it uses their memory addresses).
The easiest way to explain is to consider an example. Suppose that v1 and v2 are the addresses of appropriate resources, then the return value is one of the following:
0
v1 is equal to v2; in other words, the two processes share the resource.1
v1 is less than v2.2
v1 is greater than v2.3
v1 is not equal to v2, but ordering information is unavailable.
On error, -1 is returned, and errno is set to indicate the error.
kcmp() was designed to return values suitable for sorting. This is particularly handy if one needs to compare a large number of file descriptors.
ERRORS
EBADF
type is KCMP_FILE and fd1 or fd2 is not an open file descriptor.
EFAULT
The epoll slot addressed by idx2 is outside of the user’s address space.
EINVAL
type is invalid.
ENOENT
The target file is not present in epoll(7) instance.
EPERM
Insufficient permission to inspect process resources. The CAP_SYS_PTRACE capability is required to inspect processes that you do not own. Other ptrace limitations may also apply, such as CONFIG_SECURITY_YAMA, which, when /proc/sys/kernel/yama/ptrace_scope is 2, limits kcmp() to child processes; see ptrace(2).
ESRCH
Process pid1 or pid2 does not exist.
STANDARDS
Linux.
HISTORY
Linux 3.5.
Before Linux 5.12, this system call is available only if the kernel is configured with CONFIG_CHECKPOINT_RESTORE, since the original purpose of the system call was for the checkpoint/restore in user space (CRIU) feature. (The alternative to this system call would have been to expose suitable process information via the proc(5) filesystem; this was deemed to be unsuitable for security reasons.) Since Linux 5.12, this system call is also available if the kernel is configured with CONFIG_KCMP.
NOTES
See clone(2) for some background information on the shared resources referred to on this page.
EXAMPLES
The program below uses kcmp() to test whether pairs of file descriptors refer to the same open file description. The program tests different cases for the file descriptor pairs, as described in the program output. An example run of the program is as follows:
$ ./a.out
Parent PID is 1144
Parent opened file on FD 3
PID of child of fork() is 1145
Compare duplicate FDs from different processes:
kcmp(1145, 1144, KCMP_FILE, 3, 3) ==> same
Child opened file on FD 4
Compare FDs from distinct open()s in same process:
kcmp(1145, 1145, KCMP_FILE, 3, 4) ==> different
Child duplicated FD 3 to create FD 5
Compare duplicated FDs in same process:
kcmp(1145, 1145, KCMP_FILE, 3, 5) ==> same
Program source
#define _GNU_SOURCE
#include <err.h>
#include <fcntl.h>
#include <linux/kcmp.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
static int
kcmp(pid_t pid1, pid_t pid2, int type,
unsigned long idx1, unsigned long idx2)
{
return syscall(SYS_kcmp, pid1, pid2, type, idx1, idx2);
}
static void
test_kcmp(char *msg, pid_t pid1, pid_t pid2, int fd_a, int fd_b)
{
printf(" %s
“, msg); printf(” kcmp(%jd, %jd, KCMP_FILE, %d, %d) ==> %s “, (intmax_t) pid1, (intmax_t) pid2, fd_a, fd_b, (kcmp(pid1, pid2, KCMP_FILE, fd_a, fd_b) == 0) ? “same” : “different”); } int main(void) { int fd1, fd2, fd3; static const char pathname[] = “/tmp/kcmp.test”; fd1 = open(pathname, O_CREAT | O_RDWR, 0600); if (fd1 == -1) err(EXIT_FAILURE, “open”); printf(“Parent PID is %jd “, (intmax_t) getpid()); printf(“Parent opened file on FD %d
“, fd1); switch (fork()) { case -1: err(EXIT_FAILURE, “fork”); case 0: printf(“PID of child of fork() is %jd “, (intmax_t) getpid()); test_kcmp(“Compare duplicate FDs from different processes:”, getpid(), getppid(), fd1, fd1); fd2 = open(pathname, O_CREAT | O_RDWR, 0600); if (fd2 == -1) err(EXIT_FAILURE, “open”); printf(“Child opened file on FD %d “, fd2); test_kcmp(“Compare FDs from distinct open()s in same process:”, getpid(), getpid(), fd1, fd2); fd3 = dup(fd1); if (fd3 == -1) err(EXIT_FAILURE, “dup”); printf(“Child duplicated FD %d to create FD %d “, fd1, fd3); test_kcmp(“Compare duplicated FDs in same process:”, getpid(), getpid(), fd1, fd3); break; default: wait(NULL); } exit(EXIT_SUCCESS); }
SEE ALSO
clone(2), unshare(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
315 - Linux cli command setsockopt
NAME π₯οΈ setsockopt π₯οΈ
get and set options on sockets
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
int getsockopt(int sockfd, int level, int optname,
void optval[restrict *.optlen],
socklen_t *restrict optlen);
int setsockopt(int sockfd, int level, int optname,
const void optval[.optlen],
socklen_t optlen);
DESCRIPTION
getsockopt() and setsockopt() manipulate options for the socket referred to by the file descriptor sockfd. Options may exist at multiple protocol levels; they are always present at the uppermost socket level.
When manipulating socket options, the level at which the option resides and the name of the option must be specified. To manipulate options at the sockets API level, level is specified as SOL_SOCKET. To manipulate options at any other level the protocol number of the appropriate protocol controlling the option is supplied. For example, to indicate that an option is to be interpreted by the TCP protocol, level should be set to the protocol number of TCP; see getprotoent(3).
The arguments optval and optlen are used to access option values for setsockopt(). For getsockopt() they identify a buffer in which the value for the requested option(s) are to be returned. For getsockopt(), optlen is a value-result argument, initially containing the size of the buffer pointed to by optval, and modified on return to indicate the actual size of the value returned. If no option value is to be supplied or returned, optval may be NULL.
Optname and any specified options are passed uninterpreted to the appropriate protocol module for interpretation. The include file <sys/socket.h> contains definitions for socket level options, described below. Options at other protocol levels vary in format and name; consult the appropriate entries in section 4 of the manual.
Most socket-level options utilize an int argument for optval. For setsockopt(), the argument should be nonzero to enable a boolean option, or zero if the option is to be disabled.
For a description of the available socket options see socket(7) and the appropriate protocol man pages.
RETURN VALUE
On success, zero is returned for the standard options. On error, -1 is returned, and errno is set to indicate the error.
Netfilter allows the programmer to define custom socket options with associated handlers; for such options, the return value on success is the value returned by the handler.
ERRORS
EBADF
The argument sockfd is not a valid file descriptor.
EFAULT
The address pointed to by optval is not in a valid part of the process address space. For getsockopt(), this error may also be returned if optlen is not in a valid part of the process address space.
EINVAL
optlen invalid in setsockopt(). In some cases this error can also occur for an invalid value in optval (e.g., for the IP_ADD_MEMBERSHIP option described in ip(7)).
ENOPROTOOPT
The option is unknown at the level indicated.
ENOTSOCK
The file descriptor sockfd does not refer to a socket.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.4BSD (first appeared in 4.2BSD).
BUGS
Several of the socket options should be handled at lower levels of the system.
SEE ALSO
ioctl(2), socket(2), getprotoent(3), protocols(5), ip(7), packet(7), socket(7), tcp(7), udp(7), unix(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
316 - Linux cli command ioctl_fat
NAME π₯οΈ ioctl_fat π₯οΈ
manipulating the FAT filesystem
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/msdos_fs.h> /* Definition of [V]FAT_* and
ATTR_* constants*/"
#include <sys/ioctl.h>
int ioctl(int fd, FAT_IOCTL_GET_ATTRIBUTES, uint32_t *attr);
int ioctl(int fd, FAT_IOCTL_SET_ATTRIBUTES, uint32_t *attr);
int ioctl(int fd, FAT_IOCTL_GET_VOLUME_ID, uint32_t *id);
int ioctl(int fd, VFAT_IOCTL_READDIR_BOTH,
struct __fat_dirent entry[2]);
int ioctl(int fd, VFAT_IOCTL_READDIR_SHORT,
struct __fat_dirent entry[2]);
DESCRIPTION
The ioctl(2) system call can be used to read and write metadata of FAT filesystems that are not accessible using other system calls.
Reading and setting file attributes
Files and directories in the FAT filesystem possess an attribute bit mask that can be read with FAT_IOCTL_GET_ATTRIBUTES and written with FAT_IOCTL_SET_ATTRIBUTES.
The fd argument contains a file descriptor for a file or directory. It is sufficient to create the file descriptor by calling open(2) with the O_RDONLY flag.
The attr argument contains a pointer to a bit mask. The bits of the bit mask are:
ATTR_RO
This bit specifies that the file or directory is read-only.
ATTR_HIDDEN
This bit specifies that the file or directory is hidden.
ATTR_SYS
This bit specifies that the file is a system file.
ATTR_VOLUME
This bit specifies that the file is a volume label. This attribute is read-only.
ATTR_DIR
This bit specifies that this is a directory. This attribute is read-only.
ATTR_ARCH
This bit indicates that this file or directory should be archived. It is set when a file is created or modified. It is reset by an archiving system.
The zero value ATTR_NONE can be used to indicate that no attribute bit is set.
Reading the volume ID
FAT filesystems are identified by a volume ID. The volume ID can be read with FAT_IOCTL_GET_VOLUME_ID.
The fd argument can be a file descriptor for any file or directory of the filesystem. It is sufficient to create the file descriptor by calling open(2) with the O_RDONLY flag.
The id argument is a pointer to the field that will be filled with the volume ID. Typically the volume ID is displayed to the user as a group of two 16-bit fields:
printf("Volume ID %04x-%04x
“, id » 16, id & 0xFFFF);
Reading short filenames of a directory
A file or directory on a FAT filesystem always has a short filename consisting of up to 8 capital letters, optionally followed by a period and up to 3 capital letters for the file extension. If the actual filename does not fit into this scheme, it is stored as a long filename of up to 255 UTF-16 characters.
The short filenames in a directory can be read with VFAT_IOCTL_READDIR_SHORT. VFAT_IOCTL_READDIR_BOTH reads both the short and the long filenames.
The fd argument must be a file descriptor for a directory. It is sufficient to create the file descriptor by calling open(2) with the O_RDONLY flag. The file descriptor can be used only once to iterate over the directory entries by calling ioctl(2) repeatedly.
The entry argument is a two-element array of the following structures:
struct __fat_dirent {
long d_ino;
__kernel_off_t d_off;
uint32_t short d_reclen;
char d_name[256];
};
The first entry in the array is for the short filename. The second entry is for the long filename.
The d_ino and d_off fields are filled only for long filenames. The d_ino field holds the inode number of the directory. The d_off field holds the offset of the file entry in the directory. As these values are not available for short filenames, the user code should simply ignore them.
The field d_reclen contains the length of the filename in the field d_name. To keep backward compatibility, a length of 0 for the short filename signals that the end of the directory has been reached. However, the preferred method for detecting the end of the directory is to test the ioctl(2) return value. If no long filename exists, field d_reclen is set to 0 and d_name is a character string of length 0 for the long filename.
RETURN VALUE
On error, -1 is returned, and errno is set to indicate the error.
For VFAT_IOCTL_READDIR_BOTH and VFAT_IOCTL_READDIR_SHORT a return value of 1 signals that a new directory entry has been read and a return value of 0 signals that the end of the directory has been reached.
ERRORS
ENOENT
This error is returned by VFAT_IOCTL_READDIR_BOTH and VFAT_IOCTL_READDIR_SHORT if the file descriptor fd refers to a removed, but still open directory.
ENOTDIR
This error is returned by VFAT_IOCTL_READDIR_BOTH and VFAT_IOCTL_READDIR_SHORT if the file descriptor fd does not refer to a directory.
ENOTTY
The file descriptor fd does not refer to an object in a FAT filesystem.
For further error values, see ioctl(2).
STANDARDS
Linux.
HISTORY
VFAT_IOCTL_READDIR_BOTH
VFAT_IOCTL_READDIR_SHORT
Linux 2.0.
FAT_IOCTL_GET_ATTRIBUTES
FAT_IOCTL_SET_ATTRIBUTES
Linux 2.6.12.
FAT_IOCTL_GET_VOLUME_ID
Linux 3.11.
EXAMPLES
Toggling the archive flag
The following program demonstrates the usage of ioctl(2) to manipulate file attributes. The program reads and displays the archive attribute of a file. After inverting the value of the attribute, the program reads and displays the attribute again.
The following was recorded when applying the program for the file /mnt/user/foo:
# ./toggle_fat_archive_flag /mnt/user/foo
Archive flag is set
Toggling archive flag
Archive flag is not set
Program source (toggle_fat_archive_flag.c)
#include <fcntl.h>
#include <linux/msdos_fs.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <unistd.h>
/*
* Read file attributes of a file on a FAT filesystem.
* Output the state of the archive flag.
*/
static uint32_t
readattr(int fd)
{
int ret;
uint32_t attr;
ret = ioctl(fd, FAT_IOCTL_GET_ATTRIBUTES, &attr);
if (ret == -1) {
perror("ioctl");
exit(EXIT_FAILURE);
}
if (attr & ATTR_ARCH)
printf("Archive flag is set
“); else printf(“Archive flag is not set “); return attr; } int main(int argc, char argv[]) { int fd; int ret; uint32_t attr; if (argc != 2) { printf(“Usage: %s FILENAME “, argv[0]); exit(EXIT_FAILURE); } fd = open(argv[1], O_RDONLY); if (fd == -1) { perror(“open”); exit(EXIT_FAILURE); } / * Read and display the FAT file attributes. / attr = readattr(fd); / * Invert archive attribute. / printf(“Toggling archive flag “); attr ^= ATTR_ARCH; / * Write the changed FAT file attributes. / ret = ioctl(fd, FAT_IOCTL_SET_ATTRIBUTES, &attr); if (ret == -1) { perror(“ioctl”); exit(EXIT_FAILURE); } / * Read and display the FAT file attributes. */ readattr(fd); close(fd); exit(EXIT_SUCCESS); }
Reading the volume ID
The following program demonstrates the use of ioctl(2) to display the volume ID of a FAT filesystem.
The following output was recorded when applying the program for directory /mnt/user:
$ ./display_fat_volume_id /mnt/user
Volume ID 6443-6241
Program source (display_fat_volume_id.c)
#include <fcntl.h>
#include <linux/msdos_fs.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int fd;
int ret;
uint32_t id;
if (argc != 2) {
printf("Usage: %s FILENAME
“, argv[0]); exit(EXIT_FAILURE); } fd = open(argv[1], O_RDONLY); if (fd == -1) { perror(“open”); exit(EXIT_FAILURE); } /* * Read volume ID. / ret = ioctl(fd, FAT_IOCTL_GET_VOLUME_ID, &id); if (ret == -1) { perror(“ioctl”); exit(EXIT_FAILURE); } / * Format the output as two groups of 16 bits each. */ printf(“Volume ID %04x-%04x “, id » 16, id & 0xFFFF); close(fd); exit(EXIT_SUCCESS); }
Listing a directory
The following program demonstrates the use of ioctl(2) to list a directory.
The following was recorded when applying the program to the directory /mnt/user:
$ ./fat_dir /mnt/user
. -> ''
.. -> ''
ALONGF~1.TXT -> 'a long filename.txt'
UPPER.TXT -> ''
LOWER.TXT -> 'lower.txt'
Program source
#include <fcntl.h>
#include <linux/msdos_fs.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int fd;
int ret;
struct __fat_dirent entry[2];
if (argc != 2) {
printf("Usage: %s DIRECTORY
“, argv[0]); exit(EXIT_FAILURE); } /* * Open file descriptor for the directory. / fd = open(argv[1], O_RDONLY | O_DIRECTORY); if (fd == -1) { perror(“open”); exit(EXIT_FAILURE); } for (;;) { / * Read next directory entry. / ret = ioctl(fd, VFAT_IOCTL_READDIR_BOTH, entry); / * If an error occurs, the return value is -1. * If the end of the directory list has been reached, * the return value is 0. * For backward compatibility the end of the directory * list is also signaled by d_reclen == 0. / if (ret < 1) break; / * Write both the short name and the long name. / printf("%s -> ‘%s’ “, entry[0].d_name, entry[1].d_name); } if (ret == -1) { perror(“VFAT_IOCTL_READDIR_BOTH”); exit(EXIT_FAILURE); } / * Close the file descriptor. */ close(fd); exit(EXIT_SUCCESS); }
SEE ALSO
ioctl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
317 - Linux cli command open_by_handle_at
NAME π₯οΈ open_by_handle_at π₯οΈ
obtain handle for a pathname and open file via a handle
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <fcntl.h>
int name_to_handle_at(int dirfd, const char *pathname,
struct file_handle *handle,
int *mount_id, int flags);
int open_by_handle_at(int mount_fd, struct file_handle *handle,
int flags);
DESCRIPTION
The name_to_handle_at() and open_by_handle_at() system calls split the functionality of openat(2) into two parts: name_to_handle_at() returns an opaque handle that corresponds to a specified file; open_by_handle_at() opens the file corresponding to a handle returned by a previous call to name_to_handle_at() and returns an open file descriptor.
name_to_handle_at()
The name_to_handle_at() system call returns a file handle and a mount ID corresponding to the file specified by the dirfd and pathname arguments. The file handle is returned via the argument handle, which is a pointer to a structure of the following form:
struct file_handle {
unsigned int handle_bytes; /* Size of f_handle [in, out] */
int handle_type; /* Handle type [out] */
unsigned char f_handle[0]; /* File identifier (sized by
caller) [out] */
};
It is the caller’s responsibility to allocate the structure with a size large enough to hold the handle returned in f_handle. Before the call, the handle_bytes field should be initialized to contain the allocated size for f_handle. (The constant MAX_HANDLE_SZ, defined in <fcntl.h>, specifies the maximum expected size for a file handle. It is not a guaranteed upper limit as future filesystems may require more space.) Upon successful return, the handle_bytes field is updated to contain the number of bytes actually written to f_handle.
The caller can discover the required size for the file_handle structure by making a call in which handle->handle_bytes is zero; in this case, the call fails with the error EOVERFLOW and handle->handle_bytes is set to indicate the required size; the caller can then use this information to allocate a structure of the correct size (see EXAMPLES below). Some care is needed here as EOVERFLOW can also indicate that no file handle is available for this particular name in a filesystem which does normally support file-handle lookup. This case can be detected when the EOVERFLOW error is returned without handle_bytes being increased.
Other than the use of the handle_bytes field, the caller should treat the file_handle structure as an opaque data type: the handle_type and f_handle fields can be used in a subsequent call to open_by_handle_at(). The caller can also use the opaque file_handle to compare the identity of filesystem objects that were queried at different times and possibly at different paths. The fanotify(7) subsystem can report events with an information record containing a file_handle to identify the filesystem object.
The flags argument is a bit mask constructed by ORing together zero or more of AT_HANDLE_FID, AT_EMPTY_PATH, and AT_SYMLINK_FOLLOW, described below.
When flags contain the AT_HANDLE_FID (since Linux 6.5) flag, the caller indicates that the returned file_handle is needed to identify the filesystem object, and not for opening the file later, so it should be expected that a subsequent call to open_by_handle_at() with the returned file_handle may fail.
Together, the pathname and dirfd arguments identify the file for which a handle is to be obtained. There are four distinct cases:
If pathname is a nonempty string containing an absolute pathname, then a handle is returned for the file referred to by that pathname. In this case, dirfd is ignored.
If pathname is a nonempty string containing a relative pathname and dirfd has the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the caller, and a handle is returned for the file to which it refers.
If pathname is a nonempty string containing a relative pathname and dirfd is a file descriptor referring to a directory, then pathname is interpreted relative to the directory referred to by dirfd, and a handle is returned for the file to which it refers. (See openat(2) for an explanation of why “directory file descriptors” are useful.)
If pathname is an empty string and flags specifies the value AT_EMPTY_PATH, then dirfd can be an open file descriptor referring to any type of file, or AT_FDCWD, meaning the current working directory, and a handle is returned for the file to which it refers.
The mount_id argument returns an identifier for the filesystem mount that corresponds to pathname. This corresponds to the first field in one of the records in /proc/self/mountinfo. Opening the pathname in the fifth field of that record yields a file descriptor for the mount point; that file descriptor can be used in a subsequent call to open_by_handle_at(). mount_id is returned both for a successful call and for a call that results in the error EOVERFLOW.
By default, name_to_handle_at() does not dereference pathname if it is a symbolic link, and thus returns a handle for the link itself. If AT_SYMLINK_FOLLOW is specified in flags, pathname is dereferenced if it is a symbolic link (so that the call returns a handle for the file referred to by the link).
name_to_handle_at() does not trigger a mount when the final component of the pathname is an automount point. When a filesystem supports both file handles and automount points, a name_to_handle_at() call on an automount point will return with error EOVERFLOW without having increased handle_bytes. This can happen since Linux 4.13 with NFS when accessing a directory which is on a separate filesystem on the server. In this case, the automount can be triggered by adding a “/” to the end of the pathname.
open_by_handle_at()
The open_by_handle_at() system call opens the file referred to by handle, a file handle returned by a previous call to name_to_handle_at().
The mount_fd argument is a file descriptor for any object (file, directory, etc.) in the mounted filesystem with respect to which handle should be interpreted. The special value AT_FDCWD can be specified, meaning the current working directory of the caller.
The flags argument is as for open(2). If handle refers to a symbolic link, the caller must specify the O_PATH flag, and the symbolic link is not dereferenced; the O_NOFOLLOW flag, if specified, is ignored.
The caller must have the CAP_DAC_READ_SEARCH capability to invoke open_by_handle_at().
RETURN VALUE
On success, name_to_handle_at() returns 0, and open_by_handle_at() returns a file descriptor (a nonnegative integer).
In the event of an error, both system calls return -1 and set errno to indicate the error.
ERRORS
name_to_handle_at() and open_by_handle_at() can fail for the same errors as openat(2). In addition, they can fail with the errors noted below.
name_to_handle_at() can fail with the following errors:
EFAULT
pathname, mount_id, or handle points outside your accessible address space.
EINVAL
flags includes an invalid bit value.
EINVAL
handle->handle_bytes is greater than MAX_HANDLE_SZ.
ENOENT
pathname is an empty string, but AT_EMPTY_PATH was not specified in flags.
ENOTDIR
The file descriptor supplied in dirfd does not refer to a directory, and it is not the case that both flags includes AT_EMPTY_PATH and pathname is an empty string.
EOPNOTSUPP
The filesystem does not support decoding of a pathname to a file handle.
EOVERFLOW
The handle->handle_bytes value passed into the call was too small. When this error occurs, handle->handle_bytes is updated to indicate the required size for the handle.
open_by_handle_at() can fail with the following errors:
EBADF
mount_fd is not an open file descriptor.
EBADF
pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
handle points outside your accessible address space.
EINVAL
handle->handle_bytes is greater than MAX_HANDLE_SZ or is equal to zero.
ELOOP
handle refers to a symbolic link, but O_PATH was not specified in flags.
EPERM
The caller does not have the CAP_DAC_READ_SEARCH capability.
ESTALE
The specified handle is not valid for opening a file. This error will occur if, for example, the file has been deleted. This error can also occur if the handle was acquired using the AT_HANDLE_FID flag and the filesystem does not support open_by_handle_at().
VERSIONS
FreeBSD has a broadly similar pair of system calls in the form of getfh() and openfh().
STANDARDS
Linux.
HISTORY
Linux 2.6.39, glibc 2.14.
NOTES
A file handle can be generated in one process using name_to_handle_at() and later used in a different process that calls open_by_handle_at().
Some filesystem don’t support the translation of pathnames to file handles, for example, /proc, /sys, and various network filesystems. Some filesystems support the translation of pathnames to file handles, but do not support using those file handles in open_by_handle_at().
A file handle may become invalid (“stale”) if a file is deleted, or for other filesystem-specific reasons. Invalid handles are notified by an ESTALE error from open_by_handle_at().
These system calls are designed for use by user-space file servers. For example, a user-space NFS server might generate a file handle and pass it to an NFS client. Later, when the client wants to open the file, it could pass the handle back to the server. This sort of functionality allows a user-space file server to operate in a stateless fashion with respect to the files it serves.
If pathname refers to a symbolic link and flags does not specify AT_SYMLINK_FOLLOW, then name_to_handle_at() returns a handle for the link (rather than the file to which it refers). The process receiving the handle can later perform operations on the symbolic link by converting the handle to a file descriptor using open_by_handle_at() with the O_PATH flag, and then passing the file descriptor as the dirfd argument in system calls such as readlinkat(2) and fchownat(2).
Obtaining a persistent filesystem ID
The mount IDs in /proc/self/mountinfo can be reused as filesystems are unmounted and mounted. Therefore, the mount ID returned by name_to_handle_at() (in *mount_id) should not be treated as a persistent identifier for the corresponding mounted filesystem. However, an application can use the information in the mountinfo record that corresponds to the mount ID to derive a persistent identifier.
For example, one can use the device name in the fifth field of the mountinfo record to search for the corresponding device UUID via the symbolic links in /dev/disks/by-uuid. (A more comfortable way of obtaining the UUID is to use the libblkid(3) library.) That process can then be reversed, using the UUID to look up the device name, and then obtaining the corresponding mount point, in order to produce the mount_fd argument used by open_by_handle_at().
EXAMPLES
The two programs below demonstrate the use of name_to_handle_at() and open_by_handle_at(). The first program (t_name_to_handle_at.c) uses name_to_handle_at() to obtain the file handle and mount ID for the file specified in its command-line argument; the handle and mount ID are written to standard output.
The second program (t_open_by_handle_at.c) reads a mount ID and file handle from standard input. The program then employs open_by_handle_at() to open the file using that handle. If an optional command-line argument is supplied, then the mount_fd argument for open_by_handle_at() is obtained by opening the directory named in that argument. Otherwise, mount_fd is obtained by scanning /proc/self/mountinfo to find a record whose mount ID matches the mount ID read from standard input, and the mount directory specified in that record is opened. (These programs do not deal with the fact that mount IDs are not persistent.)
The following shell session demonstrates the use of these two programs:
$ echo 'Can you please think about it?' > cecilia.txt
$ ./t_name_to_handle_at cecilia.txt > fh
$ ./t_open_by_handle_at < fh
open_by_handle_at: Operation not permitted
$ sudo ./t_open_by_handle_at < fh # Need CAP_SYS_ADMIN
Read 31 bytes
$ rm cecilia.txt
Now we delete and (quickly) re-create the file so that it has the same content and (by chance) the same inode. Nevertheless, open_by_handle_at() recognizes that the original file referred to by the file handle no longer exists.
$ stat --printf="%i
" cecilia.txt # Display inode number 4072121 $ rm cecilia.txt $ echo ‘Can you please think about it?’ > cecilia.txt $ stat –printf="%i " cecilia.txt # Check inode number 4072121 $ sudo ./t_open_by_handle_at < fh open_by_handle_at: Stale NFS file handle
Program source: t_name_to_handle_at.c
#define _GNU_SOURCE
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
int
main(int argc, char *argv[])
{
int mount_id, fhsize, flags, dirfd;
char *pathname;
struct file_handle *fhp;
if (argc != 2) {
fprintf(stderr, "Usage: %s pathname
“, argv[0]); exit(EXIT_FAILURE); } pathname = argv[1]; /* Allocate file_handle structure. */ fhsize = sizeof(fhp); fhp = malloc(fhsize); if (fhp == NULL) err(EXIT_FAILURE, “malloc”); / Make an initial call to name_to_handle_at() to discover the size required for file handle. / dirfd = AT_FDCWD; / For name_to_handle_at() calls / flags = 0; / For name_to_handle_at() calls / fhp->handle_bytes = 0; if (name_to_handle_at(dirfd, pathname, fhp, &mount_id, flags) != -1 || errno != EOVERFLOW) { fprintf(stderr, “Unexpected result from name_to_handle_at() “); exit(EXIT_FAILURE); } / Reallocate file_handle structure with correct size. */ fhsize = sizeof(fhp) + fhp->handle_bytes; fhp = realloc(fhp, fhsize); / Copies fhp->handle_bytes / if (fhp == NULL) err(EXIT_FAILURE, “realloc”); / Get file handle from pathname supplied on command line. / if (name_to_handle_at(dirfd, pathname, fhp, &mount_id, flags) == -1) err(EXIT_FAILURE, “name_to_handle_at”); / Write mount ID, file handle size, and file handle to stdout, for later reuse by t_open_by_handle_at.c. */ printf("%d “, mount_id); printf("%u %d “, fhp->handle_bytes, fhp->handle_type); for (size_t j = 0; j < fhp->handle_bytes; j++) printf(” %02x”, fhp->f_handle[j]); printf(” “); exit(EXIT_SUCCESS); }
Program source: t_open_by_handle_at.c
#define _GNU_SOURCE
#include <err.h>
#include <fcntl.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>
/* Scan /proc/self/mountinfo to find the line whose mount ID matches
'mount_id'. (An easier way to do this is to install and use the
'libmount' library provided by the 'util-linux' project.)
Open the corresponding mount path and return the resulting file
descriptor. */
static int
open_mount_path_by_id(int mount_id)
{
int mi_mount_id, found;
char mount_path[PATH_MAX];
char *linep;
FILE *fp;
size_t lsize;
ssize_t nread;
fp = fopen("/proc/self/mountinfo", "r");
if (fp == NULL)
err(EXIT_FAILURE, "fopen");
found = 0;
linep = NULL;
while (!found) {
nread = getline(&linep, &lsize, fp);
if (nread == -1)
break;
nread = sscanf(linep, "%d %*d %*s %*s %s",
&mi_mount_id, mount_path);
if (nread != 2) {
fprintf(stderr, "Bad sscanf()
“);
exit(EXIT_FAILURE);
}
if (mi_mount_id == mount_id)
found = 1;
}
free(linep);
fclose(fp);
if (!found) {
fprintf(stderr, “Could not find mount point
“);
exit(EXIT_FAILURE);
}
return open(mount_path, O_RDONLY);
}
int
main(int argc, char *argv[])
{
int mount_id, fd, mount_fd, handle_bytes;
char buf[1000];
#define LINE_SIZE 100
char line1[LINE_SIZE], line2[LINE_SIZE];
char *nextp;
ssize_t nread;
struct file_handle fhp;
if ((argc > 1 && strcmp(argv[1], “–help”) == 0) || argc > 2) {
fprintf(stderr, “Usage: %s [mount-path]
“, argv[0]);
exit(EXIT_FAILURE);
}
/ Standard input contains mount ID and file handle information:
Line 1: <mount_id>
Line 2: <handle_bytes> <handle_type>
SEE ALSO
open(2), libblkid(3), blkid(8), findfs(8), mount(8)
The libblkid and libmount documentation in the latest util-linux release at
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
318 - Linux cli command epoll_ctl
NAME π₯οΈ epoll_ctl π₯οΈ
control interface for an epoll file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/epoll.h>
int epoll_ctl(int epfd, int op, int fd,
struct epoll_event *_Nullable event);
DESCRIPTION
This system call is used to add, modify, or remove entries in the interest list of the epoll(7) instance referred to by the file descriptor epfd. It requests that the operation op be performed for the target file descriptor, fd.
Valid values for the op argument are:
EPOLL_CTL_ADD
Add an entry to the interest list of the epoll file descriptor, epfd. The entry includes the file descriptor, fd, a reference to the corresponding open file description (see epoll(7) and open(2)), and the settings specified in event.
EPOLL_CTL_MOD
Change the settings associated with fd in the interest list to the new settings specified in event.
EPOLL_CTL_DEL
Remove (deregister) the target file descriptor fd from the interest list. The event argument is ignored and can be NULL (but see BUGS below).
The event argument describes the object linked to the file descriptor fd. The struct epoll_event is described in epoll_event(3type).
The data member of the epoll_event structure specifies data that the kernel should save and then return (via epoll_wait(2)) when this file descriptor becomes ready.
The events member of the epoll_event structure is a bit mask composed by ORing together zero or more event types, returned by epoll_wait(2), and input flags, which affect its behaviour, but aren’t returned. The available event types are:
EPOLLIN
The associated file is available for read(2) operations.
EPOLLOUT
The associated file is available for write(2) operations.
EPOLLRDHUP (since Linux 2.6.17)
Stream socket peer closed connection, or shut down writing half of connection. (This flag is especially useful for writing simple code to detect peer shutdown when using edge-triggered monitoring.)
EPOLLPRI
There is an exceptional condition on the file descriptor. See the discussion of POLLPRI in poll(2).
EPOLLERR
Error condition happened on the associated file descriptor. This event is also reported for the write end of a pipe when the read end has been closed.
epoll_wait(2) will always report for this event; it is not necessary to set it in events when calling epoll_ctl().
EPOLLHUP
Hang up happened on the associated file descriptor.
epoll_wait(2) will always wait for this event; it is not necessary to set it in events when calling epoll_ctl().
Note that when reading from a channel such as a pipe or a stream socket, this event merely indicates that the peer closed its end of the channel. Subsequent reads from the channel will return 0 (end of file) only after all outstanding data in the channel has been consumed.
And the available input flags are:
EPOLLET
Requests edge-triggered notification for the associated file descriptor. The default behavior for epoll is level-triggered. See epoll(7) for more detailed information about edge-triggered and level-triggered notification.
EPOLLONESHOT (since Linux 2.6.2)
Requests one-shot notification for the associated file descriptor. This means that after an event notified for the file descriptor by epoll_wait(2), the file descriptor is disabled in the interest list and no other events will be reported by the epoll interface. The user must call epoll_ctl() with EPOLL_CTL_MOD to rearm the file descriptor with a new event mask.
EPOLLWAKEUP (since Linux 3.5)
If EPOLLONESHOT and EPOLLET are clear and the process has the CAP_BLOCK_SUSPEND capability, ensure that the system does not enter “suspend” or “hibernate” while this event is pending or being processed. The event is considered as being “processed” from the time when it is returned by a call to epoll_wait(2) until the next call to epoll_wait(2) on the same epoll(7) file descriptor, the closure of that file descriptor, the removal of the event file descriptor with EPOLL_CTL_DEL, or the clearing of EPOLLWAKEUP for the event file descriptor with EPOLL_CTL_MOD. See also BUGS.
EPOLLEXCLUSIVE (since Linux 4.5)
Sets an exclusive wakeup mode for the epoll file descriptor that is being attached to the target file descriptor, fd. When a wakeup event occurs and multiple epoll file descriptors are attached to the same target file using EPOLLEXCLUSIVE, one or more of the epoll file descriptors will receive an event with epoll_wait(2). The default in this scenario (when EPOLLEXCLUSIVE is not set) is for all epoll file descriptors to receive an event. EPOLLEXCLUSIVE is thus useful for avoiding thundering herd problems in certain scenarios.
If the same file descriptor is in multiple epoll instances, some with the EPOLLEXCLUSIVE flag, and others without, then events will be provided to all epoll instances that did not specify EPOLLEXCLUSIVE, and at least one of the epoll instances that did specify EPOLLEXCLUSIVE.
The following values may be specified in conjunction with EPOLLEXCLUSIVE: EPOLLIN, EPOLLOUT, EPOLLWAKEUP, and EPOLLET. EPOLLHUP and EPOLLERR can also be specified, but this is not required: as usual, these events are always reported if they occur, regardless of whether they are specified in events. Attempts to specify other values in events yield the error EINVAL.
EPOLLEXCLUSIVE may be used only in an EPOLL_CTL_ADD operation; attempts to employ it with EPOLL_CTL_MOD yield an error. If EPOLLEXCLUSIVE has been set using epoll_ctl(), then a subsequent EPOLL_CTL_MOD on the same epfd,Β fd pair yields an error. A call to epoll_ctl() that specifies EPOLLEXCLUSIVE in events and specifies the target file descriptor fd as an epoll instance will likewise fail. The error in all of these cases is EINVAL.
RETURN VALUE
When successful, epoll_ctl() returns zero. When an error occurs, epoll_ctl() returns -1 and errno is set to indicate the error.
ERRORS
EBADF
epfd or fd is not a valid file descriptor.
EEXIST
op was EPOLL_CTL_ADD, and the supplied file descriptor fd is already registered with this epoll instance.
EINVAL
epfd is not an epoll file descriptor, or fd is the same as epfd, or the requested operation op is not supported by this interface.
EINVAL
An invalid event type was specified along with EPOLLEXCLUSIVE in events.
EINVAL
op was EPOLL_CTL_MOD and events included EPOLLEXCLUSIVE.
EINVAL
op was EPOLL_CTL_MOD and the EPOLLEXCLUSIVE flag has previously been applied to this epfd,Β fd pair.
EINVAL
EPOLLEXCLUSIVE was specified in event and fd refers to an epoll instance.
ELOOP
fd refers to an epoll instance and this EPOLL_CTL_ADD operation would result in a circular loop of epoll instances monitoring one another or a nesting depth of epoll instances greater than 5.
ENOENT
op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not registered with this epoll instance.
ENOMEM
There was insufficient memory to handle the requested op control operation.
ENOSPC
The limit imposed by /proc/sys/fs/epoll/max_user_watches was encountered while trying to register (EPOLL_CTL_ADD) a new file descriptor on an epoll instance. See epoll(7) for further details.
EPERM
The target file fd does not support epoll. This error can occur if fd refers to, for example, a regular file or a directory.
STANDARDS
Linux.
HISTORY
Linux 2.6, glibc 2.3.2.
NOTES
The epoll interface supports all file descriptors that support poll(2).
BUGS
Before Linux 2.6.9, the EPOLL_CTL_DEL operation required a non-null pointer in event, even though this argument is ignored. Since Linux 2.6.9, event can be specified as NULL when using EPOLL_CTL_DEL. Applications that need to be portable to kernels before Linux 2.6.9 should specify a non-null pointer in event.
If EPOLLWAKEUP is specified in flags, but the caller does not have the CAP_BLOCK_SUSPEND capability, then the EPOLLWAKEUP flag is silently ignored. This unfortunate behavior is necessary because no validity checks were performed on the flags argument in the original implementation, and the addition of the EPOLLWAKEUP with a check that caused the call to fail if the caller did not have the CAP_BLOCK_SUSPEND capability caused a breakage in at least one existing user-space application that happened to randomly (and uselessly) specify this bit. A robust application should therefore double check that it has the CAP_BLOCK_SUSPEND capability if attempting to use the EPOLLWAKEUP flag.
SEE ALSO
epoll_create(2), epoll_wait(2), poll(2), epoll(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
319 - Linux cli command eventfd2
NAME π₯οΈ eventfd2 π₯οΈ
create a file descriptor for event notification
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/eventfd.h>
int eventfd(unsigned int initval, int flags);
DESCRIPTION
eventfd() creates an “eventfd object” that can be used as an event wait/notify mechanism by user-space applications, and by the kernel to notify user-space applications of events. The object contains an unsigned 64-bit integer (uint64_t) counter that is maintained by the kernel. This counter is initialized with the value specified in the argument initval.
As its return value, eventfd() returns a new file descriptor that can be used to refer to the eventfd object.
The following values may be bitwise ORed in flags to change the behavior of eventfd():
EFD_CLOEXEC (since Linux 2.6.27)
Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.
EFD_NONBLOCK (since Linux 2.6.27)
Set the O_NONBLOCK file status flag on the open file description (see open(2)) referred to by the new file descriptor. Using this flag saves extra calls to fcntl(2) to achieve the same result.
EFD_SEMAPHORE (since Linux 2.6.30)
Provide semaphore-like semantics for reads from the new file descriptor. See below.
Up to Linux 2.6.26, the flags argument is unused, and must be specified as zero.
The following operations can be performed on the file descriptor returned by eventfd():
read(2)
Each successful read(2) returns an 8-byte integer. A read(2) fails with the error EINVAL if the size of the supplied buffer is less than 8 bytes.
The value returned by read(2) is in host byte orderβthat is, the native byte order for integers on the host machine.
The semantics of read(2) depend on whether the eventfd counter currently has a nonzero value and whether the EFD_SEMAPHORE flag was specified when creating the eventfd file descriptor:
If EFD_SEMAPHORE was not specified and the eventfd counter has a nonzero value, then a read(2) returns 8 bytes containing that value, and the counter’s value is reset to zero.
If EFD_SEMAPHORE was specified and the eventfd counter has a nonzero value, then a read(2) returns 8 bytes containing the value 1, and the counter’s value is decremented by 1.
If the eventfd counter is zero at the time of the call to read(2), then the call either blocks until the counter becomes nonzero (at which time, the read(2) proceeds as described above) or fails with the error EAGAIN if the file descriptor has been made nonblocking.
write(2)
A write(2) call adds the 8-byte integer value supplied in its buffer to the counter. The maximum value that may be stored in the counter is the largest unsigned 64-bit value minus 1 (i.e., 0xfffffffffffffffe). If the addition would cause the counter’s value to exceed the maximum, then the write(2) either blocks until a read(2) is performed on the file descriptor, or fails with the error EAGAIN if the file descriptor has been made nonblocking.
A write(2) fails with the error EINVAL if the size of the supplied buffer is less than 8 bytes, or if an attempt is made to write the value 0xffffffffffffffff.
poll(2)
select(2)
(and similar)
The returned file descriptor supports poll(2) (and analogously epoll(7)) and select(2), as follows:
The file descriptor is readable (the select(2) readfds argument; the poll(2) POLLIN flag) if the counter has a value greater than 0.
The file descriptor is writable (the select(2) writefds argument; the poll(2) POLLOUT flag) if it is possible to write a value of at least “1” without blocking.
If an overflow of the counter value was detected, then select(2) indicates the file descriptor as being both readable and writable, and poll(2) returns a POLLERR event. As noted above, write(2) can never overflow the counter. However an overflow can occur if 2^64 eventfd “signal posts” were performed by the KAIO subsystem (theoretically possible, but practically unlikely). If an overflow has occurred, then read(2) will return that maximum uint64_t value (i.e., 0xffffffffffffffff).
The eventfd file descriptor also supports the other file-descriptor multiplexing APIs: pselect(2) and ppoll(2).
close(2)
When the file descriptor is no longer required it should be closed. When all file descriptors associated with the same eventfd object have been closed, the resources for object are freed by the kernel.
A copy of the file descriptor created by eventfd() is inherited by the child produced by fork(2). The duplicate file descriptor is associated with the same eventfd object. File descriptors created by eventfd() are preserved across execve(2), unless the close-on-exec flag has been set.
RETURN VALUE
On success, eventfd() returns a new eventfd file descriptor. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EINVAL
An unsupported value was specified in flags.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENODEV
Could not mount (internal) anonymous inode device.
ENOMEM
There was insufficient memory to create a new eventfd file descriptor.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
eventfd() | Thread safety | MT-Safe |
VERSIONS
C library/kernel differences
There are two underlying Linux system calls: eventfd() and the more recent eventfd2(). The former system call does not implement a flags argument. The latter system call implements the flags values described above. The glibc wrapper function will use eventfd2() where it is available.
Additional glibc features
The GNU C library defines an additional type, and two functions that attempt to abstract some of the details of reading and writing on an eventfd file descriptor:
typedef uint64_t eventfd_t;
int eventfd_read(int fd, eventfd_t *value);
int eventfd_write(int fd, eventfd_t value);
The functions perform the read and write operations on an eventfd file descriptor, returning 0 if the correct number of bytes was transferred, or -1 otherwise.
STANDARDS
Linux, GNU.
HISTORY
eventfd()
Linux 2.6.22, glibc 2.8.
eventfd2()
Linux 2.6.27 (see VERSIONS). Since glibc 2.9, the eventfd() wrapper will employ the eventfd2() system call, if it is supported by the kernel.
NOTES
Applications can use an eventfd file descriptor instead of a pipe (see pipe(2)) in all cases where a pipe is used simply to signal events. The kernel overhead of an eventfd file descriptor is much lower than that of a pipe, and only one file descriptor is required (versus the two required for a pipe).
When used in the kernel, an eventfd file descriptor can provide a bridge from kernel to user space, allowing, for example, functionalities like KAIO (kernel AIO) to signal to a file descriptor that some operation is complete.
A key point about an eventfd file descriptor is that it can be monitored just like any other file descriptor using select(2), poll(2), or epoll(7). This means that an application can simultaneously monitor the readiness of “traditional” files and the readiness of other kernel mechanisms that support the eventfd interface. (Without the eventfd() interface, these mechanisms could not be multiplexed via select(2), poll(2), or epoll(7).)
The current value of an eventfd counter can be viewed via the entry for the corresponding file descriptor in the process’s /proc/pid/fdinfo directory. See proc(5) for further details.
EXAMPLES
The following program creates an eventfd file descriptor and then forks to create a child process. While the parent briefly sleeps, the child writes each of the integers supplied in the program’s command-line arguments to the eventfd file descriptor. When the parent has finished sleeping, it reads from the eventfd file descriptor.
The following shell session shows a sample run of the program:
$ ./a.out 1 2 4 7 14
Child writing 1 to efd
Child writing 2 to efd
Child writing 4 to efd
Child writing 7 to efd
Child writing 14 to efd
Child completed write loop
Parent about to read
Parent read 28 (0x1c) from efd
Program source
#include <err.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/eventfd.h>
#include <sys/types.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int efd;
uint64_t u;
ssize_t s;
if (argc < 2) {
fprintf(stderr, "Usage: %s <num>...
“, argv[0]); exit(EXIT_FAILURE); } efd = eventfd(0, 0); if (efd == -1) err(EXIT_FAILURE, “eventfd”); switch (fork()) { case 0: for (size_t j = 1; j < argc; j++) { printf(“Child writing %s to efd “, argv[j]); u = strtoull(argv[j], NULL, 0); /* strtoull() allows various bases */ s = write(efd, &u, sizeof(uint64_t)); if (s != sizeof(uint64_t)) err(EXIT_FAILURE, “write”); } printf(“Child completed write loop “); exit(EXIT_SUCCESS); default: sleep(2); printf(“Parent about to read “); s = read(efd, &u, sizeof(uint64_t)); if (s != sizeof(uint64_t)) err(EXIT_FAILURE, “read”); printf(“Parent read %“PRIu64” (%#“PRIx64”) from efd “, u, u); exit(EXIT_SUCCESS); case -1: err(EXIT_FAILURE, “fork”); } }
SEE ALSO
futex(2), pipe(2), poll(2), read(2), select(2), signalfd(2), timerfd_create(2), write(2), epoll(7), sem_overview(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
320 - Linux cli command lchown
NAME π₯οΈ lchown π₯οΈ
change ownership of a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int chown(const char *pathname, uid_t owner, gid_t group);
int fchown(int fd, uid_t owner, gid_t group);
int lchown(const char *pathname, uid_t owner, gid_t group);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int fchownat(int dirfd, const char *pathname,
uid_t owner, gid_t group, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
fchown(), lchown():
/* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
|| _XOPEN_SOURCE >= 500
|| /* glibc <= 2.19: */ _BSD_SOURCE
fchownat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These system calls change the owner and group of a file. The chown(), fchown(), and lchown() system calls differ only in how the file is specified:
chown() changes the ownership of the file specified by pathname, which is dereferenced if it is a symbolic link.
fchown() changes the ownership of the file referred to by the open file descriptor fd.
lchown() is like chown(), but does not dereference symbolic links.
Only a privileged process (Linux: one with the CAP_CHOWN capability) may change the owner of a file. The owner of a file may change the group of the file to any group of which that owner is a member. A privileged process (Linux: with CAP_CHOWN) may change the group arbitrarily.
If the owner or group is specified as -1, then that ID is not changed.
When the owner or group of an executable file is changed by an unprivileged user, the S_ISUID and S_ISGID mode bits are cleared. POSIX does not specify whether this also should happen when root does the chown(); the Linux behavior depends on the kernel version, and since Linux 2.2.13, root is treated like other users. In case of a non-group-executable file (i.e., one for which the S_IXGRP bit is not set) the S_ISGID bit indicates mandatory locking, and is not cleared by a chown().
When the owner or group of an executable file is changed (by any user), all capability sets for the file are cleared.
fchownat()
The fchownat() system call operates in exactly the same way as chown(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chown() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chown()).
If pathname is absolute, then dirfd is ignored.
The flags argument is a bit mask created by ORing together 0 or more of the following values;
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory. If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead operate on the link itself, like lchown(). (By default, fchownat() dereferences symbolic links, like chown().)
See openat(2) for an explanation of the need for fchownat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Depending on the filesystem, errors other than those listed below can be returned.
The more general errors for chown() are listed below.
EACCES
Search permission is denied on a component of the path prefix. (See also path_resolution(7).)
EBADF
(fchown()) fd is not a valid open file descriptor.
EBADF
(fchownat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
pathname points outside your accessible address space.
EINVAL
(fchownat()) Invalid flag specified in flags.
EIO
(fchown()) A low-level I/O error occurred while modifying the inode.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ENAMETOOLONG
pathname is too long.
ENOENT
The file does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component of the path prefix is not a directory.
ENOTDIR
(fchownat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EPERM
The calling process did not have the required permissions (see above) to change owner and/or group.
EPERM
The file is marked immutable or append-only. (See ioctl_iflags(2).)
EROFS
The named file resides on a read-only filesystem.
VERSIONS
The 4.4BSD version can be used only by the superuser (that is, ordinary users cannot give away files).
STANDARDS
POSIX.1-2008.
HISTORY
chown()
fchown()
lchown()
4.4BSD, SVr4, POSIX.1-2001.
fchownat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
NOTES
Ownership of new files
When a new file is created (by, for example, open(2) or mkdir(2)), its owner is made the same as the filesystem user ID of the creating process. The group of the file depends on a range of factors, including the type of filesystem, the options used to mount the filesystem, and whether or not the set-group-ID mode bit is enabled on the parent directory. If the filesystem supports the -o grpid (or, synonymously -o bsdgroups) and -o nogrpid (or, synonymously -o sysvgroups) mount(8) options, then the rules are as follows:
If the filesystem is mounted with -o grpid, then the group of a new file is made the same as that of the parent directory.
If the filesystem is mounted with -o nogrpid and the set-group-ID bit is disabled on the parent directory, then the group of a new file is made the same as the process’s filesystem GID.
If the filesystem is mounted with -o nogrpid and the set-group-ID bit is enabled on the parent directory, then the group of a new file is made the same as that of the parent directory.
As at Linux 4.12, the -o grpid and -o nogrpid mount options are supported by ext2, ext3, ext4, and XFS. Filesystems that don’t support these mount options follow the -o nogrpid rules.
glibc notes
On older kernels where fchownat() is unavailable, the glibc wrapper function falls back to the use of chown() and lchown(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
NFS
The chown() semantics are deliberately violated on NFS filesystems which have UID mapping enabled. Additionally, the semantics of all system calls which access the file contents are violated, because chown() may cause immediate access revocation on already open files. Client side caching may lead to a delay between the time where ownership have been changed to allow access for a user and the time where the file can actually be accessed by the user on other clients.
Historical details
The original Linux chown(), fchown(), and lchown() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added chown32(), fchown32(), and lchown32(), supporting 32-bit IDs. The glibc chown(), fchown(), and lchown() wrapper functions transparently deal with the variations across kernel versions.
Before Linux 2.1.81 (except 2.1.46), chown() did not follow symbolic links. Since Linux 2.1.81, chown() does follow symbolic links, and there is a new system call lchown() that does not follow symbolic links. Since Linux 2.1.86, this new call (that has the same semantics as the old chown()) has got the same syscall number, and chown() got the newly introduced number.
EXAMPLES
The following program changes the ownership of the file named in its second command-line argument to the value specified in its first command-line argument. The new owner can be specified either as a numeric user ID, or as a username (which is converted to a user ID by using getpwnam(3) to perform a lookup in the system password file).
Program source
#include <pwd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
char *endptr;
uid_t uid;
struct passwd *pwd;
if (argc != 3 || argv[1][0] == ' ') {
fprintf(stderr, "%s <owner> <file>
“, argv[0]); exit(EXIT_FAILURE); } uid = strtol(argv[1], &endptr, 10); /* Allow a numeric string */ if (endptr != ‘οΏ½’) { / Was not pure numeric string / pwd = getpwnam(argv[1]); / Try getting UID for username */ if (pwd == NULL) { perror(“getpwnam”); exit(EXIT_FAILURE); } uid = pwd->pw_uid; } if (chown(argv[2], uid, -1) == -1) { perror(“chown”); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); }
SEE ALSO
chgrp(1), chown(1), chmod(2), flock(2), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
321 - Linux cli command getpeername
NAME π₯οΈ getpeername π₯οΈ
get name of connected peer socket
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
int getpeername(int sockfd, struct sockaddr *restrict addr,
socklen_t *restrict addrlen);
DESCRIPTION
getpeername() returns the address of the peer connected to the socket sockfd, in the buffer pointed to by addr. The addrlen argument should be initialized to indicate the amount of space pointed to by addr. On return it contains the actual size of the name returned (in bytes). The name is truncated if the buffer provided is too small.
The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBADF
The argument sockfd is not a valid file descriptor.
EFAULT
The addr argument points to memory not in a valid part of the process address space.
EINVAL
addrlen is invalid (e.g., is negative).
ENOBUFS
Insufficient resources were available in the system to perform the operation.
ENOTCONN
The socket is not connected.
ENOTSOCK
The file descriptor sockfd does not refer to a socket.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.4BSD (first appeared in 4.2BSD).
NOTES
For stream sockets, once a connect(2) has been performed, either socket can call getpeername() to obtain the address of the peer socket. On the other hand, datagram sockets are connectionless. Calling connect(2) on a datagram socket merely sets the peer address for outgoing datagrams sent with write(2) or recv(2). The caller of connect(2) can use getpeername() to obtain the peer address that it earlier set for the socket. However, the peer socket is unaware of this information, and calling getpeername() on the peer socket will return no useful information (unless a connect(2) call was also executed on the peer). Note also that the receiver of a datagram can obtain the address of the sender when using recvfrom(2).
SEE ALSO
accept(2), bind(2), getsockname(2), ip(7), socket(7), unix(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
322 - Linux cli command mmap
NAME π₯οΈ mmap π₯οΈ
map or unmap files or devices into memory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/mman.h>
void *mmap(void addr[.length], size_t length",int prot ,int"flags,
int fd, off_t offset);
int munmap(void addr[.length], size_t length);
See NOTES for information on feature test macro requirements.
DESCRIPTION
mmap() creates a new mapping in the virtual address space of the calling process. The starting address for the new mapping is specified in addr. The length argument specifies the length of the mapping (which must be greater than 0).
If addr is NULL, then the kernel chooses the (page-aligned) address at which to create the mapping; this is the most portable method of creating a new mapping. If addr is not NULL, then the kernel takes it as a hint about where to place the mapping; on Linux, the kernel will pick a nearby page boundary (but always above or equal to the value specified by /proc/sys/vm/mmap_min_addr) and attempt to create the mapping there. If another mapping already exists there, the kernel picks a new address that may or may not depend on the hint. The address of the new mapping is returned as the result of the call.
The contents of a file mapping (as opposed to an anonymous mapping; see MAP_ANONYMOUS below), are initialized using length bytes starting at offset offset in the file (or other object) referred to by the file descriptor fd. offset must be a multiple of the page size as returned by sysconf(_SC_PAGE_SIZE).
After the mmap() call has returned, the file descriptor, fd, can be closed immediately without invalidating the mapping.
The prot argument describes the desired memory protection of the mapping (and must not conflict with the open mode of the file). It is either PROT_NONE or the bitwise OR of one or more of the following flags:
PROT_EXEC
Pages may be executed.
PROT_READ
Pages may be read.
PROT_WRITE
Pages may be written.
PROT_NONE
Pages may not be accessed.
The flags argument
The flags argument determines whether updates to the mapping are visible to other processes mapping the same region, and whether updates are carried through to the underlying file. This behavior is determined by including exactly one of the following values in flags:
MAP_SHARED
Share this mapping. Updates to the mapping are visible to other processes mapping the same region, and (in the case of file-backed mappings) are carried through to the underlying file. (To precisely control when updates are carried through to the underlying file requires the use of msync(2).)
MAP_SHARED_VALIDATE (since Linux 4.15)
This flag provides the same behavior as MAP_SHARED except that MAP_SHARED mappings ignore unknown flags in flags. By contrast, when creating a mapping using MAP_SHARED_VALIDATE, the kernel verifies all passed flags are known and fails the mapping with the error EOPNOTSUPP for unknown flags. This mapping type is also required to be able to use some mapping flags (e.g., MAP_SYNC).
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
Both MAP_SHARED and MAP_PRIVATE are described in POSIX.1-2001 and POSIX.1-2008. MAP_SHARED_VALIDATE is a Linux extension.
In addition, zero or more of the following values can be ORed in flags:
MAP_32BIT (since Linux 2.4.20, 2.6)
Put the mapping into the first 2 Gigabytes of the process address space. This flag is supported only on x86-64, for 64-bit programs. It was added to allow thread stacks to be allocated somewhere in the first 2 GB of memory, so as to improve context-switch performance on some early 64-bit processors. Modern x86-64 processors no longer have this performance problem, so use of this flag is not required on those systems. The MAP_32BIT flag is ignored when MAP_FIXED is set.
MAP_ANON
Synonym for MAP_ANONYMOUS; provided for compatibility with other implementations.
MAP_ANONYMOUS
The mapping is not backed by any file; its contents are initialized to zero. The fd argument is ignored; however, some implementations require fd to be -1 if MAP_ANONYMOUS (or MAP_ANON) is specified, and portable applications should ensure this. The offset argument should be zero. Support for MAP_ANONYMOUS in conjunction with MAP_SHARED was added in Linux 2.4.
MAP_DENYWRITE
This flag is ignored. (Long agoβLinux 2.0 and earlierβit signaled that attempts to write to the underlying file should fail with ETXTBSY. But this was a source of denial-of-service attacks.)
MAP_EXECUTABLE
This flag is ignored.
MAP_FILE
Compatibility flag. Ignored.
MAP_FIXED
Don’t interpret addr as a hint: place the mapping at exactly that address. addr must be suitably aligned: for most architectures a multiple of the page size is sufficient; however, some architectures may impose additional restrictions. If the memory region specified by addr and length overlaps pages of any existing mapping(s), then the overlapped part of the existing mapping(s) will be discarded. If the specified address cannot be used, mmap() will fail.
Software that aspires to be portable should use the MAP_FIXED flag with care, keeping in mind that the exact layout of a process’s memory mappings is allowed to change significantly between Linux versions, C library versions, and operating system releases. Carefully read the discussion of this flag in NOTES!
MAP_FIXED_NOREPLACE (since Linux 4.17)
This flag provides behavior that is similar to MAP_FIXED with respect to the addr enforcement, but differs in that MAP_FIXED_NOREPLACE never clobbers a preexisting mapped range. If the requested range would collide with an existing mapping, then this call fails with the error EEXIST. This flag can therefore be used as a way to atomically (with respect to other threads) attempt to map an address range: one thread will succeed; all others will report failure.
Note that older kernels which do not recognize the MAP_FIXED_NOREPLACE flag will typically (upon detecting a collision with a preexisting mapping) fall back to a βnon-MAP_FIXEDβ type of behavior: they will return an address that is different from the requested address. Therefore, backward-compatible software should check the returned address against the requested address.
MAP_GROWSDOWN
This flag is used for stacks. It indicates to the kernel virtual memory system that the mapping should extend downward in memory. The return address is one page lower than the memory area that is actually created in the process’s virtual address space. Touching an address in the “guard” page below the mapping will cause the mapping to grow by a page. This growth can be repeated until the mapping grows to within a page of the high end of the next lower mapping, at which point touching the “guard” page will result in a SIGSEGV signal.
MAP_HUGETLB (since Linux 2.6.32)
Allocate the mapping using “huge” pages. See the Linux kernel source file Documentation/admin-guide/mm/hugetlbpage.rst for further information, as well as NOTES, below.
MAP_HUGE_2MB
MAP_HUGE_1GB (since Linux 3.8)
Used in conjunction with MAP_HUGETLB to select alternative hugetlb page sizes (respectively, 2 MB and 1 GB) on systems that support multiple hugetlb page sizes.
More generally, the desired huge page size can be configured by encoding the base-2 logarithm of the desired page size in the six bits at the offset MAP_HUGE_SHIFT. (A value of zero in this bit field provides the default huge page size; the default huge page size can be discovered via the Hugepagesize field exposed by /proc/meminfo.) Thus, the above two constants are defined as:
#define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT)
#define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT)
The range of huge page sizes that are supported by the system can be discovered by listing the subdirectories in /sys/kernel/mm/hugepages.
MAP_LOCKED (since Linux 2.5.37)
Mark the mapped region to be locked in the same way as mlock(2). This implementation will try to populate (prefault) the whole range but the mmap() call doesn’t fail with ENOMEM if this fails. Therefore major faults might happen later on. So the semantic is not as strong as mlock(2). One should use mmap() plus mlock(2) when major faults are not acceptable after the initialization of the mapping. The MAP_LOCKED flag is ignored in older kernels.
MAP_NONBLOCK (since Linux 2.5.46)
This flag is meaningful only in conjunction with MAP_POPULATE. Don’t perform read-ahead: create page tables entries only for pages that are already present in RAM. Since Linux 2.6.23, this flag causes MAP_POPULATE to do nothing. One day, the combination of MAP_POPULATE and MAP_NONBLOCK may be reimplemented.
MAP_NORESERVE
Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available. See also the discussion of the file /proc/sys/vm/overcommit_memory in proc(5). Before Linux 2.6, this flag had effect only for private writable mappings.
MAP_POPULATE (since Linux 2.5.46)
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. This will help to reduce blocking on page faults later. The mmap() call doesn’t fail if the mapping cannot be populated (for example, due to limitations on the number of mapped huge pages when using MAP_HUGETLB). Support for MAP_POPULATE in conjunction with private mappings was added in Linux 2.6.23.
MAP_STACK (since Linux 2.6.27)
Allocate the mapping at an address suitable for a process or thread stack.
This flag is currently a no-op on Linux. However, by employing this flag, applications can ensure that they transparently obtain support if the flag is implemented in the future. Thus, it is used in the glibc threading implementation to allow for the fact that some architectures may (later) require special treatment for stack allocations. A further reason to employ this flag is portability: MAP_STACK exists (and has an effect) on some other systems (e.g., some of the BSDs).
MAP_SYNC (since Linux 4.15)
This flag is available only with the MAP_SHARED_VALIDATE mapping type; mappings of type MAP_SHARED will silently ignore this flag. This flag is supported only for files supporting DAX (direct mapping of persistent memory). For other files, creating a mapping with this flag results in an EOPNOTSUPP error.
Shared file mappings with this flag provide the guarantee that while some memory is mapped writable in the address space of the process, it will be visible in the same file at the same offset even after the system crashes or is rebooted. In conjunction with the use of appropriate CPU instructions, this provides users of such mappings with a more efficient way of making data modifications persistent.
MAP_UNINITIALIZED (since Linux 2.6.33)
Don’t clear anonymous pages. This flag is intended to improve performance on embedded devices. This flag is honored only if the kernel was configured with the CONFIG_MMAP_ALLOW_UNINITIALIZED option. Because of the security implications, that option is normally enabled only on embedded devices (i.e., devices where one has complete control of the contents of user memory).
Of the above flags, only MAP_FIXED is specified in POSIX.1-2001 and POSIX.1-2008. However, most systems also support MAP_ANONYMOUS (or its synonym MAP_ANON).
munmap()
The munmap() system call deletes the mappings for the specified address range, and causes further references to addresses within the range to generate invalid memory references. The region is also automatically unmapped when the process is terminated. On the other hand, closing the file descriptor does not unmap the region.
The address addr must be a multiple of the page size (but length need not be). All pages containing a part of the indicated range are unmapped, and subsequent references to these pages will generate SIGSEGV. It is not an error if the indicated range does not contain any mapped pages.
RETURN VALUE
On success, mmap() returns a pointer to the mapped area. On error, the value MAP_FAILED (that is, (void *) -1) is returned, and errno is set to indicate the error.
On success, munmap() returns 0. On failure, it returns -1, and errno is set to indicate the error (probably to EINVAL).
ERRORS
EACCES
A file descriptor refers to a non-regular file. Or a file mapping was requested, but fd is not open for reading. Or MAP_SHARED was requested and PROT_WRITE is set, but fd is not open in read/write (O_RDWR) mode. Or PROT_WRITE is set, but the file is append-only.
EAGAIN
The file has been locked, or too much memory has been locked (see setrlimit(2)).
EBADF
fd is not a valid file descriptor (and MAP_ANONYMOUS was not set).
EEXIST
MAP_FIXED_NOREPLACE was specified in flags, and the range covered by addr and length clashes with an existing mapping.
EINVAL
We don’t like addr, length, or offset (e.g., they are too large, or not aligned on a page boundary).
EINVAL
(since Linux 2.6.12) length was 0.
EINVAL
flags contained none of MAP_PRIVATE, MAP_SHARED, or MAP_SHARED_VALIDATE.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENODEV
The underlying filesystem of the specified file does not support memory mapping.
ENOMEM
No memory is available.
ENOMEM
The process’s maximum number of mappings would have been exceeded. This error can also occur for munmap(), when unmapping a region in the middle of an existing mapping, since this results in two smaller mappings on either side of the region being unmapped.
ENOMEM
(since Linux 4.7) The process’s RLIMIT_DATA limit, described in getrlimit(2), would have been exceeded.
ENOMEM
We don’t like addr, because it exceeds the virtual address space of the CPU.
EOVERFLOW
On 32-bit architecture together with the large file extension (i.e., using 64-bit off_t): the number of pages used for length plus number of pages used for offset would overflow unsigned long (32 bits).
EPERM
The prot argument asks for PROT_EXEC but the mapped area belongs to a file on a filesystem that was mounted no-exec.
EPERM
The operation was prevented by a file seal; see fcntl(2).
EPERM
The MAP_HUGETLB flag was specified, but the caller was not privileged (did not have the CAP_IPC_LOCK capability) and is not a member of the sysctl_hugetlb_shm_group group; see the description of /proc/sys/vm/sysctl_hugetlb_shm_group in proc_sys(5).
ETXTBSY
MAP_DENYWRITE was set but the object specified by fd is open for writing.
Use of a mapped region can result in these signals:
SIGSEGV
Attempted write into a region mapped as read-only.
SIGBUS
Attempted access to a page of the buffer that lies beyond the end of the mapped file. For an explanation of the treatment of the bytes in the page that corresponds to the end of a mapped file that is not a multiple of the page size, see NOTES.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
mmap(), munmap() | Thread safety | MT-Safe |
VERSIONS
On some hardware architectures (e.g., i386), PROT_WRITE implies PROT_READ. It is architecture dependent whether PROT_READ implies PROT_EXEC or not. Portable programs should always set PROT_EXEC if they intend to execute code in the new mapping.
The portable way to create a mapping is to specify addr as 0 (NULL), and omit MAP_FIXED from flags. In this case, the system chooses the address for the mapping; the address is chosen so as not to conflict with any existing mapping, and will not be 0. If the MAP_FIXED flag is specified, and addr is 0 (NULL), then the mapped address will be 0 (NULL).
Certain flags constants are defined only if suitable feature test macros are defined (possibly by default): _DEFAULT_SOURCE with glibc 2.19 or later; or _BSD_SOURCE or _SVID_SOURCE in glibc 2.19 and earlier. (Employing _GNU_SOURCE also suffices, and requiring that macro specifically would have been more logical, since these flags are all Linux-specific.) The relevant flags are: MAP_32BIT, MAP_ANONYMOUS (and the synonym MAP_ANON), MAP_DENYWRITE, MAP_EXECUTABLE, MAP_FILE, MAP_GROWSDOWN, MAP_HUGETLB, MAP_LOCKED, MAP_NONBLOCK, MAP_NORESERVE, MAP_POPULATE, and MAP_STACK.
C library/kernel differences
This page describes the interface provided by the glibc mmap() wrapper function. Originally, this function invoked a system call of the same name. Since Linux 2.4, that system call has been superseded by mmap2(2), and nowadays the glibc mmap() wrapper function invokes mmap2(2) with a suitably adjusted value for offset.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.4BSD.
On POSIX systems on which mmap(), msync(2), and munmap() are available, _POSIX_MAPPED_FILES is defined in <unistd.h> to a value greater than 0. (See also sysconf(3).)
NOTES
Memory mapped by mmap() is preserved across fork(2), with the same attributes.
A file is mapped in multiples of the page size. For a file that is not a multiple of the page size, the remaining bytes in the partial page at the end of the mapping are zeroed when mapped, and modifications to that region are not written out to the file. The effect of changing the size of the underlying file of a mapping on the pages that correspond to added or removed regions of the file is unspecified.
An application can determine which pages of a mapping are currently resident in the buffer/page cache using mincore(2).
Using MAP_FIXED safely
The only safe use for MAP_FIXED is where the address range specified by addr and length was previously reserved using another mapping; otherwise, the use of MAP_FIXED is hazardous because it forcibly removes preexisting mappings, making it easy for a multithreaded process to corrupt its own address space.
For example, suppose that thread A looks through /proc/pid/maps in order to locate an unused address range that it can map using MAP_FIXED, while thread B simultaneously acquires part or all of that same address range. When thread A subsequently employs mmap(MAP_FIXED), it will effectively clobber the mapping that thread B created. In this scenario, thread B need not create a mapping directly; simply making a library call that, internally, uses dlopen(3) to load some other shared library, will suffice. The dlopen(3) call will map the library into the process’s address space. Furthermore, almost any library call may be implemented in a way that adds memory mappings to the address space, either with this technique, or by simply allocating memory. Examples include brk(2), malloc(3), pthread_create(3), and the PAM libraries .
Since Linux 4.17, a multithreaded program can use the MAP_FIXED_NOREPLACE flag to avoid the hazard described above when attempting to create a mapping at a fixed address that has not been reserved by a preexisting mapping.
Timestamps changes for file-backed mappings
For file-backed mappings, the st_atime field for the mapped file may be updated at any time between the mmap() and the corresponding unmapping; the first reference to a mapped page will update the field if it has not been already.
The st_ctime and st_mtime field for a file mapped with PROT_WRITE and MAP_SHARED will be updated after a write to the mapped region, and before a subsequent msync(2) with the MS_SYNC or MS_ASYNC flag, if one occurs.
Huge page (Huge TLB) mappings
For mappings that employ huge pages, the requirements for the arguments of mmap() and munmap() differ somewhat from the requirements for mappings that use the native system page size.
For mmap(), offset must be a multiple of the underlying huge page size. The system automatically aligns length to be a multiple of the underlying huge page size.
For munmap(), addr, and length must both be a multiple of the underlying huge page size.
BUGS
On Linux, there are no guarantees like those suggested above under MAP_NORESERVE. By default, any process can be killed at any moment when the system runs out of memory.
Before Linux 2.6.7, the MAP_POPULATE flag has effect only if prot is specified as PROT_NONE.
SUSv3 specifies that mmap() should fail if length is 0. However, before Linux 2.6.12, mmap() succeeded in this case: no mapping was created and the call returned addr. Since Linux 2.6.12, mmap() fails with the error EINVAL for this case.
POSIX specifies that the system shall always zero fill any partial page at the end of the object and that system will never write any modification of the object beyond its end. On Linux, when you write data to such partial page after the end of the object, the data stays in the page cache even after the file is closed and unmapped and even though the data is never written to the file itself, subsequent mappings may see the modified content. In some cases, this could be fixed by calling msync(2) before the unmap takes place; however, this doesn’t work on tmpfs(5) (for example, when using the POSIX shared memory interface documented in shm_overview(7)).
EXAMPLES
The following program prints part of the file specified in its first command-line argument to standard output. The range of bytes to be printed is specified via offset and length values in the second and third command-line arguments. The program creates a memory mapping of the required pages of the file and then uses write(2) to output the desired bytes.
Program source
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
int
main(int argc, char *argv[])
{
int fd;
char *addr;
off_t offset, pa_offset;
size_t length;
ssize_t s;
struct stat sb;
if (argc < 3 || argc > 4) {
fprintf(stderr, "%s file offset [length]
“, argv[0]); exit(EXIT_FAILURE); } fd = open(argv[1], O_RDONLY); if (fd == -1) handle_error(“open”); if (fstat(fd, &sb) == -1) /* To obtain file size / handle_error(“fstat”); offset = atoi(argv[2]); pa_offset = offset & ~(sysconf(_SC_PAGE_SIZE) - 1); / offset for mmap() must be page aligned / if (offset >= sb.st_size) { fprintf(stderr, “offset is past end of file “); exit(EXIT_FAILURE); } if (argc == 4) { length = atoi(argv[3]); if (offset + length > sb.st_size) length = sb.st_size - offset; / Can’t display bytes past end of file / } else { / No length arg ==> display to end of file */ length = sb.st_size - offset; } addr = mmap(NULL, length + offset - pa_offset, PROT_READ, MAP_PRIVATE, fd, pa_offset); if (addr == MAP_FAILED) handle_error(“mmap”); s = write(STDOUT_FILENO, addr + offset - pa_offset, length); if (s != length) { if (s == -1) handle_error(“write”); fprintf(stderr, “partial write”); exit(EXIT_FAILURE); } munmap(addr, length + offset - pa_offset); close(fd); exit(EXIT_SUCCESS); }
SEE ALSO
ftruncate(2), getpagesize(2), memfd_create(2), mincore(2), mlock(2), mmap2(2), mprotect(2), mremap(2), msync(2), remap_file_pages(2), setrlimit(2), shmat(2), userfaultfd(2), shm_open(3), shm_overview(7)
The descriptions of the following files in proc(5): /proc/pid/maps, /proc/pid/map_files, and /proc/pid/smaps.
B.O. Gallmeister, POSIX.4, O’Reilly, pp. 128β129 and 389β391.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
323 - Linux cli command ioctl_tty
NAME π₯οΈ ioctl_tty π₯οΈ
ioctls for terminals and serial lines
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/ioctl.h>
#include <asm/termbits.h> /* Definition of struct termios,
struct termios2, and
Bnnn, BOTHER, CBAUD, CLOCAL,
TC*{FLUSH,ON,OFF} and other constants */
int ioctl(int fd, int op, ...);
DESCRIPTION
The ioctl(2) call for terminals and serial ports accepts many possible operation arguments. Most require a third argument, of varying type, here called argp or arg.
Use of ioctl() makes for nonportable programs. Use the POSIX interface described in termios(3) whenever possible.
Please note that struct termios from <asm/termbits.h> is different and incompatible with struct termios from <termios.h>. These ioctl calls require struct termios from <asm/termbits.h>.
Get and set terminal attributes
TCGETS
Argument: **struct termiosΒ ***argp
Equivalent to tcgetattr(fd, argp).
Get the current serial port settings.
TCSETS
Argument: **const struct termiosΒ ***argp
Equivalent to tcsetattr(fd, TCSANOW, argp).
Set the current serial port settings.
TCSETSW
Argument: **const struct termiosΒ ***argp
Equivalent to tcsetattr(fd, TCSADRAIN, argp).
Allow the output buffer to drain, and set the current serial port settings.
TCSETSF
Argument: **const struct termiosΒ ***argp
Equivalent to tcsetattr(fd, TCSAFLUSH, argp).
Allow the output buffer to drain, discard pending input, and set the current serial port settings.
The following four ioctls, added in Linux 2.6.20, are just like TCGETS, TCSETS, TCSETSW, TCSETSF, except that they take a struct termios2Β * instead of a struct termiosΒ *. If the structure member c_cflag contains the flag BOTHER, then the baud rate is stored in the structure members c_ispeed and c_ospeed as integer values. These ioctls are not supported on all architectures.
TCGETS2 struct termios2 *argp TCSETS2 const struct termios2 *argp TCSETSW2 const struct termios2 *argp TCSETSF2 const struct termios2 *argp
The following four ioctls are just like TCGETS, TCSETS, TCSETSW, TCSETSF, except that they take a struct termioΒ * instead of a struct termiosΒ *.
TCGETA struct termio *argp TCSETA const struct termio *argp TCSETAW const struct termio *argp TCSETAF const struct termio *argp
Locking the termios structure
The termios structure of a terminal can be locked. The lock is itself a termios structure, with nonzero bits or fields indicating a locked value.
TIOCGLCKTRMIOS
Argument: **struct termiosΒ ***argp
Gets the locking status of the termios structure of the terminal.
TIOCSLCKTRMIOS
Argument: **const struct termiosΒ ***argp
Sets the locking status of the termios structure of the terminal. Only a process with the CAP_SYS_ADMIN capability can do this.
Get and set window size
Window sizes are kept in the kernel, but not used by the kernel (except in the case of virtual consoles, where the kernel will update the window size when the size of the virtual console changes, for example, by loading a new font).
TIOCGWINSZ
Argument: **struct winsizeΒ ***argp
Get window size.
TIOCSWINSZ
Argument: **const struct winsizeΒ ***argp
Set window size.
The struct used by these ioctls is defined as
struct winsize {
unsigned short ws_row;
unsigned short ws_col;
unsigned short ws_xpixel; /* unused */
unsigned short ws_ypixel; /* unused */
};
When the window size changes, a SIGWINCH signal is sent to the foreground process group.
Sending a break
TCSBRK
Argument: **int **arg
Equivalent to tcsendbreak(fd, arg).
If the terminal is using asynchronous serial data transmission, and arg is zero, then send a break (a stream of zero bits) for between 0.25 and 0.5 seconds. If the terminal is not using asynchronous serial data transmission, then either a break is sent, or the function returns without doing anything. When arg is nonzero, nobody knows what will happen.
(SVr4, UnixWare, Solaris, and Linux treat tcsendbreak(fd,arg) with nonzero arg like tcdrain(fd). SunOS treats arg as a multiplier, and sends a stream of bits arg times as long as done for zero arg. DG/UX and AIX treat arg (when nonzero) as a time interval measured in milliseconds. HP-UX ignores arg.)
TCSBRKP
Argument: **int **arg
So-called “POSIX version” of TCSBRK. It treats nonzero arg as a time interval measured in deciseconds, and does nothing when the driver does not support breaks.
TIOCSBRK
Argument: void
Turn break on, that is, start sending zero bits.
TIOCCBRK
Argument: void
Turn break off, that is, stop sending zero bits.
Software flow control
TCXONC
Argument: **int **arg
Equivalent to tcflow(fd, arg).
See tcflow(3) for the argument values TCOOFF, TCOON, TCIOFF, TCION.
Buffer count and flushing
FIONREAD
Argument: **intΒ ***argp
Get the number of bytes in the input buffer.
TIOCINQ
Argument: **intΒ ***argp
Same as FIONREAD.
TIOCOUTQ
Argument: **intΒ ***argp
Get the number of bytes in the output buffer.
TCFLSH
Argument: **int **arg
Equivalent to tcflush(fd, arg).
See tcflush(3) for the argument values TCIFLUSH, TCOFLUSH, TCIOFLUSH.
TIOCSERGETLSR
Argument: **intΒ ***argp
Get line status register. Status register has TIOCSER_TEMT bit set when output buffer is empty and also hardware transmitter is physically empty.
Does not have to be supported by all serial tty drivers.
tcdrain(3) does not wait and returns immediately when TIOCSER_TEMT bit is set.
Faking input
TIOCSTI
Argument: **const charΒ ***argp
Insert the given byte in the input queue.
Since Linux 6.2, this operation may require the CAP_SYS_ADMIN capability (if the dev.tty.legacy_tiocsti sysctl variable is set to false).
Redirecting console output
TIOCCONS
Argument: void
Redirect output that would have gone to /dev/console or /dev/tty0 to the given terminal. If that was a pseudoterminal master, send it to the slave. Before Linux 2.6.10, anybody can do this as long as the output was not redirected yet; since Linux 2.6.10, only a process with the CAP_SYS_ADMIN capability may do this. If output was redirected already, then EBUSY is returned, but redirection can be stopped by using this ioctl with fd pointing at /dev/console or /dev/tty0.
Controlling terminal
TIOCSCTTY
Argument: **int **arg
Make the given terminal the controlling terminal of the calling process. The calling process must be a session leader and not have a controlling terminal already. For this case, arg should be specified as zero.
If this terminal is already the controlling terminal of a different session group, then the ioctl fails with EPERM, unless the caller has the CAP_SYS_ADMIN capability and arg equals 1, in which case the terminal is stolen, and all processes that had it as controlling terminal lose it.
TIOCNOTTY
Argument: void
If the given terminal was the controlling terminal of the calling process, give up this controlling terminal. If the process was session leader, then send SIGHUP and SIGCONT to the foreground process group and all processes in the current session lose their controlling terminal.
Process group and session ID
TIOCGPGRP
Argument: **pid_tΒ ***argp
When successful, equivalent to *argp = tcgetpgrp(fd).
Get the process group ID of the foreground process group on this terminal.
TIOCSPGRP
Argument: **const pid_tΒ ***argp
Equivalent to tcsetpgrp(fd, *argp).
Set the foreground process group ID of this terminal.
TIOCGSID
Argument: **pid_tΒ ***argp
When successful, equivalent to *argp = tcgetsid(fd).
Get the session ID of the given terminal. This fails with the error ENOTTY if the terminal is not a master pseudoterminal and not our controlling terminal. Strange.
Exclusive mode
TIOCEXCL
Argument: void
Put the terminal into exclusive mode. No further open(2) operations on the terminal are permitted. (They fail with EBUSY, except for a process with the CAP_SYS_ADMIN capability.)
TIOCGEXCL
Argument: **intΒ ***argp
(since Linux 3.8) If the terminal is currently in exclusive mode, place a nonzero value in the location pointed to by argp; otherwise, place zero in *argp.
TIOCNXCL
Argument: void
Disable exclusive mode.
Line discipline
TIOCGETD
Argument: **intΒ ***argp
Get the line discipline of the terminal.
TIOCSETD
Argument: **const intΒ ***argp
Set the line discipline of the terminal.
Pseudoterminal ioctls
TIOCPKT
Argument: **const intΒ ***argp
Enable (when *argp is nonzero) or disable packet mode. Can be applied to the master side of a pseudoterminal only (and will return ENOTTY otherwise). In packet mode, each subsequent read(2) will return a packet that either contains a single nonzero control byte, or has a single byte containing zero (‘οΏ½’) followed by data written on the slave side of the pseudoterminal. If the first byte is not TIOCPKT_DATA (0), it is an OR of one or more of the following bits:
TIOCPKT_FLUSHREAD | The read queue for the terminal is flushed. |
TIOCPKT_FLUSHWRITE | The write queue for the terminal is flushed. |
TIOCPKT_STOP | Output to the terminal is stopped. |
TIOCPKT_START | Output to the terminal is restarted. |
TIOCPKT_DOSTOP | The start and stop characters are ^S/^Q. |
TIOCPKT_NOSTOP | The start and stop characters are not ^S/^Q. |
While packet mode is in use, the presence of control status information to be read from the master side may be detected by a select(2) for exceptional conditions or a poll(2) for the POLLPRI event.
This mode is used by rlogin(1) and rlogind(8) to implement a remote-echoed, locally ^S/^Q flow-controlled remote login.
TIOCGPKT
Argument: **const intΒ ***argp
(since Linux 3.8) Return the current packet mode setting in the integer pointed to by argp.
TIOCSPTLCK
Argument: **intΒ ***argp
Set (if *argp is nonzero) or remove (if *argp is zero) the lock on the pseudoterminal slave device. (See also unlockpt(3).)
TIOCGPTLCK
Argument: **intΒ ***argp
(since Linux 3.8) Place the current lock state of the pseudoterminal slave device in the location pointed to by argp.
TIOCGPTPEER
Argument: **int **flags
(since Linux 4.13) Given a file descriptor in fd that refers to a pseudoterminal master, open (with the given open(2)-style flags) and return a new file descriptor that refers to the peer pseudoterminal slave device. This operation can be performed regardless of whether the pathname of the slave device is accessible through the calling process’s mount namespace.
Security-conscious programs interacting with namespaces may wish to use this operation rather than open(2) with the pathname returned by ptsname(3), and similar library functions that have insecure APIs. (For example, confusion can occur in some cases using ptsname(3) with a pathname where a devpts filesystem has been mounted in a different mount namespace.)
The BSD ioctls TIOCSTOP, TIOCSTART, TIOCUCNTL, and TIOCREMOTE have not been implemented under Linux.
Modem control
TIOCMGET
Argument: **intΒ ***argp
Get the status of modem bits.
TIOCMSET
Argument: **const intΒ ***argp
Set the status of modem bits.
TIOCMBIC
Argument: **const intΒ ***argp
Clear the indicated modem bits.
TIOCMBIS
Argument: **const intΒ ***argp
Set the indicated modem bits.
The following bits are used by the above ioctls:
TIOCM_LE | DSR (data set ready/line enable) |
TIOCM_DTR | DTR (data terminal ready) |
TIOCM_RTS | RTS (request to send) |
TIOCM_ST | Secondary TXD (transmit) |
TIOCM_SR | Secondary RXD (receive) |
TIOCM_CTS | CTS (clear to send) |
TIOCM_CAR | DCD (data carrier detect) |
TIOCM_CD | see TIOCM_CAR |
TIOCM_RNG | RNG (ring) |
TIOCM_RI | see TIOCM_RNG |
TIOCM_DSR | DSR (data set ready) |
TIOCMIWAIT
Argument: **int **arg
Wait for any of the 4 modem bits (DCD, RI, DSR, CTS) to change. The bits of interest are specified as a bit mask in arg, by ORing together any of the bit values, TIOCM_RNG, TIOCM_DSR, TIOCM_CD, and TIOCM_CTS. The caller should use TIOCGICOUNT to see which bit has changed.
TIOCGICOUNT
Argument: **struct serial_icounter_structΒ ***argp
Get counts of input serial line interrupts (DCD, RI, DSR, CTS). The counts are written to the serial_icounter_struct structure pointed to by argp.
Note: both 1->0 and 0->1 transitions are counted, except for RI, where only 0->1 transitions are counted.
Marking a line as local
TIOCGSOFTCAR
Argument: **intΒ ***argp
(“Get software carrier flag”) Get the status of the CLOCAL flag in the c_cflag field of the termios structure.
TIOCSSOFTCAR
Argument: **const intΒ ***argp
(“Set software carrier flag”) Set the CLOCAL flag in the termios structure when *argp is nonzero, and clear it otherwise.
If the CLOCAL flag for a line is off, the hardware carrier detect (DCD) signal is significant, and an open(2) of the corresponding terminal will block until DCD is asserted, unless the O_NONBLOCK flag is given. If CLOCAL is set, the line behaves as if DCD is always asserted. The software carrier flag is usually turned on for local devices, and is off for lines with modems.
Linux-specific
For the TIOCLINUX ioctl, see ioctl_console(2).
Kernel debugging
#include <linux/tty.h>
TIOCTTYGSTRUCT
Argument: **struct tty_structΒ ***argp
Get the tty_struct corresponding to fd. This operation was removed in Linux 2.5.67.
RETURN VALUE
The ioctl(2) system call returns 0 on success. On error, it returns -1 and sets errno to indicate the error.
ERRORS
EINVAL
Invalid operation parameter.
ENOIOCTLCMD
Unknown operation.
ENOTTY
Inappropriate fd.
EPERM
Insufficient permission.
EXAMPLES
Check the condition of DTR on the serial port.
#include <fcntl.h>
#include <stdio.h>
#include <sys/ioctl.h>
#include <unistd.h>
int
main(void)
{
int fd, serial;
fd = open("/dev/ttyS0", O_RDONLY);
ioctl(fd, TIOCMGET, &serial);
if (serial & TIOCM_DTR)
puts("TIOCM_DTR is set");
else
puts("TIOCM_DTR is not set");
close(fd);
}
Get or set arbitrary baudrate on the serial port.
/* SPDX-License-Identifier: GPL-2.0-or-later */
#include <asm/termbits.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
#if !defined BOTHER
fprintf(stderr, "BOTHER is unsupported
“); /* Program may fallback to TCGETS/TCSETS with Bnnn constants / exit(EXIT_FAILURE); #else / Declare tio structure, its type depends on supported ioctl / # if defined TCGETS2 struct termios2 tio; # else struct termios tio; # endif int fd, rc; if (argc != 2 && argc != 3 && argc != 4) { fprintf(stderr, “Usage: %s device [output [input] ] “, argv[0]); exit(EXIT_FAILURE); } fd = open(argv[1], O_RDWR | O_NONBLOCK | O_NOCTTY); if (fd < 0) { perror(“open”); exit(EXIT_FAILURE); } / Get the current serial port settings via supported ioctl / # if defined TCGETS2 rc = ioctl(fd, TCGETS2, &tio); # else rc = ioctl(fd, TCGETS, &tio); # endif if (rc) { perror(“TCGETS”); close(fd); exit(EXIT_FAILURE); } / Change baud rate when more arguments were provided / if (argc == 3 || argc == 4) { / Clear the current output baud rate and fill a new value / tio.c_cflag &= ~CBAUD; tio.c_cflag |= BOTHER; tio.c_ospeed = atoi(argv[2]); / Clear the current input baud rate and fill a new value / tio.c_cflag &= ~(CBAUD « IBSHIFT); tio.c_cflag |= BOTHER « IBSHIFT; / When 4th argument is not provided reuse output baud rate / tio.c_ispeed = (argc == 4) ? atoi(argv[3]) : atoi(argv[2]); / Set new serial port settings via supported ioctl / # if defined TCSETS2 rc = ioctl(fd, TCSETS2, &tio); # else rc = ioctl(fd, TCSETS, &tio); # endif if (rc) { perror(“TCSETS”); close(fd); exit(EXIT_FAILURE); } / And get new values which were really configured */ # if defined TCGETS2 rc = ioctl(fd, TCGETS2, &tio); # else rc = ioctl(fd, TCGETS, &tio); # endif if (rc) { perror(“TCGETS”); close(fd); exit(EXIT_FAILURE); } } close(fd); printf(“output baud rate: %u “, tio.c_ospeed); printf(“input baud rate: %u “, tio.c_ispeed); exit(EXIT_SUCCESS); #endif }
SEE ALSO
ldattach(8), ioctl(2), ioctl_console(2), termios(3), pty(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
324 - Linux cli command accept4
NAME π₯οΈ accept4 π₯οΈ
accept a connection on a socket
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
int accept(int sockfd, struct sockaddr *_Nullable restrict addr,
socklen_t *_Nullable restrict addrlen);
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <sys/socket.h>
int accept4(int sockfd, struct sockaddr *_Nullable restrict addr,
socklen_t *_Nullable restrict addrlen, int flags);
DESCRIPTION
The accept() system call is used with connection-based socket types (SOCK_STREAM, SOCK_SEQPACKET). It extracts the first connection request on the queue of pending connections for the listening socket, sockfd, creates a new connected socket, and returns a new file descriptor referring to that socket. The newly created socket is not in the listening state. The original socket sockfd is unaffected by this call.
The argument sockfd is a socket that has been created with socket(2), bound to a local address with bind(2), and is listening for connections after a listen(2).
The argument addr is a pointer to a sockaddr structure. This structure is filled in with the address of the peer socket, as known to the communications layer. The exact format of the address returned addr is determined by the socket’s address family (see socket(2) and the respective protocol man pages). When addr is NULL, nothing is filled in; in this case, addrlen is not used, and should also be NULL.
The addrlen argument is a value-result argument: the caller must initialize it to contain the size (in bytes) of the structure pointed to by addr; on return it will contain the actual size of the peer address.
The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call.
If no pending connections are present on the queue, and the socket is not marked as nonblocking, accept() blocks the caller until a connection is present. If the socket is marked nonblocking and no pending connections are present on the queue, accept() fails with the error EAGAIN or EWOULDBLOCK.
In order to be notified of incoming connections on a socket, you can use select(2), poll(2), or epoll(7). A readable event will be delivered when a new connection is attempted and you may then call accept() to get a socket for that connection. Alternatively, you can set the socket to deliver SIGIO when activity occurs on a socket; see socket(7) for details.
If flags is 0, then accept4() is the same as accept(). The following values can be bitwise ORed in flags to obtain different behavior:
SOCK_NONBLOCK
Set the O_NONBLOCK file status flag on the open file description (see open(2)) referred to by the new file descriptor. Using this flag saves extra calls to fcntl(2) to achieve the same result.
SOCK_CLOEXEC
Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.
RETURN VALUE
On success, these system calls return a file descriptor for the accepted socket (a nonnegative integer). On error, -1 is returned, errno is set to indicate the error, and addrlen is left unchanged.
Error handling
Linux accept() (and accept4()) passes already-pending network errors on the new socket as an error code from accept(). This behavior differs from other BSD socket implementations. For reliable operation the application should detect the network errors defined for the protocol after accept() and treat them like EAGAIN by retrying. In the case of TCP/IP, these are ENETDOWN, EPROTO, ENOPROTOOPT, EHOSTDOWN, ENONET, EHOSTUNREACH, EOPNOTSUPP, and ENETUNREACH.
ERRORS
EAGAIN or EWOULDBLOCK
The socket is marked nonblocking and no connections are present to be accepted. POSIX.1-2001 and POSIX.1-2008 allow either error to be returned for this case, and do not require these constants to have the same value, so a portable application should check for both possibilities.
EBADF
sockfd is not an open file descriptor.
ECONNABORTED
A connection has been aborted.
EFAULT
The addr argument is not in a writable part of the user address space.
EINTR
The system call was interrupted by a signal that was caught before a valid connection arrived; see signal(7).
EINVAL
Socket is not listening for connections, or addrlen is invalid (e.g., is negative).
EINVAL
(accept4()) invalid value in flags.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENOBUFS
ENOMEM
Not enough free memory. This often means that the memory allocation is limited by the socket buffer limits, not by the system memory.
ENOTSOCK
The file descriptor sockfd does not refer to a socket.
EOPNOTSUPP
The referenced socket is not of type SOCK_STREAM.
EPERM
Firewall rules forbid connection.
EPROTO
Protocol error.
In addition, network errors for the new socket and as defined for the protocol may be returned. Various Linux kernels can return other errors such as ENOSR, ESOCKTNOSUPPORT, EPROTONOSUPPORT, ETIMEDOUT. The value ERESTARTSYS may be seen during a trace.
VERSIONS
On Linux, the new socket returned by accept() does not inherit file status flags such as O_NONBLOCK and O_ASYNC from the listening socket. This behavior differs from the canonical BSD sockets implementation. Portable programs should not rely on inheritance or noninheritance of file status flags and always explicitly set all required flags on the socket returned from accept().
STANDARDS
accept()
POSIX.1-2008.
accept4()
Linux.
HISTORY
accept()
POSIX.1-2001, SVr4, 4.4BSD (accept() first appeared in 4.2BSD).
accept4()
Linux 2.6.28, glibc 2.10.
NOTES
There may not always be a connection waiting after a SIGIO is delivered or select(2), poll(2), or epoll(7) return a readability event because the connection might have been removed by an asynchronous network error or another thread before accept() is called. If this happens, then the call will block waiting for the next connection to arrive. To ensure that accept() never blocks, the passed socket sockfd needs to have the O_NONBLOCK flag set (see socket(7)).
For certain protocols which require an explicit confirmation, such as DECnet, accept() can be thought of as merely dequeuing the next connection request and not implying confirmation. Confirmation can be implied by a normal read or write on the new file descriptor, and rejection can be implied by closing the new socket. Currently, only DECnet has these semantics on Linux.
The socklen_t type
In the original BSD sockets implementation (and on other older systems) the third argument of accept() was declared as an int *. A POSIX.1g draft standard wanted to change it into a *size_t **C; later POSIX standards and glibc 2.x have *socklen_t * *.
EXAMPLES
See bind(2).
SEE ALSO
bind(2), connect(2), listen(2), select(2), socket(2), socket(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
325 - Linux cli command inotify_add_watch
NAME π₯οΈ inotify_add_watch π₯οΈ
add a watch to an initialized inotify instance
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/inotify.h>
int inotify_add_watch(int fd, const char *pathname, uint32_t mask);
DESCRIPTION
inotify_add_watch() adds a new watch, or modifies an existing watch, for the file whose location is specified in pathname; the caller must have read permission for this file. The fd argument is a file descriptor referring to the inotify instance whose watch list is to be modified. The events to be monitored for pathname are specified in the mask bit-mask argument. See inotify(7) for a description of the bits that can be set in mask.
A successful call to inotify_add_watch() returns a unique watch descriptor for this inotify instance, for the filesystem object (inode) that corresponds to pathname. If the filesystem object was not previously being watched by this inotify instance, then the watch descriptor is newly allocated. If the filesystem object was already being watched (perhaps via a different link to the same object), then the descriptor for the existing watch is returned.
The watch descriptor is returned by later read(2)s from the inotify file descriptor. These reads fetch inotify_event structures (see inotify(7)) indicating filesystem events; the watch descriptor inside this structure identifies the object for which the event occurred.
RETURN VALUE
On success, inotify_add_watch() returns a watch descriptor (a nonnegative integer). On error, -1 is returned and errno is set to indicate the error.
ERRORS
EACCES
Read access to the given file is not permitted.
EBADF
The given file descriptor is not valid.
EEXIST
mask contains IN_MASK_CREATE and pathname refers to a file already being watched by the same fd.
EFAULT
pathname points outside of the process’s accessible address space.
EINVAL
The given event mask contains no valid events; or mask contains both IN_MASK_ADD and IN_MASK_CREATE; or fd is not an inotify file descriptor.
ENAMETOOLONG
pathname is too long.
ENOENT
A directory component in pathname does not exist or is a dangling symbolic link.
ENOMEM
Insufficient kernel memory was available.
ENOSPC
The user limit on the total number of inotify watches was reached or the kernel failed to allocate a needed resource.
ENOTDIR
mask contains IN_ONLYDIR and pathname is not a directory.
STANDARDS
Linux.
HISTORY
Linux 2.6.13.
EXAMPLES
See inotify(7).
SEE ALSO
inotify_init(2), inotify_rm_watch(2), inotify(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
326 - Linux cli command capset
NAME π₯οΈ capset π₯οΈ
set/get capabilities of thread(s)
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/capability.h> /* Definition of CAP_* and
_LINUX_CAPABILITY_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_capget, cap_user_header_t hdrp,
cap_user_data_t datap);
int syscall(SYS_capset, cap_user_header_t hdrp,
const cap_user_data_t datap);
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
These two system calls are the raw kernel interface for getting and setting thread capabilities. Not only are these system calls specific to Linux, but the kernel API is likely to change and use of these system calls (in particular the format of the cap_user_*_t types) is subject to extension with each kernel revision, but old programs will keep working.
The portable interfaces are cap_set_proc(3) and cap_get_proc(3); if possible, you should use those interfaces in applications; see NOTES.
Current details
Now that you have been warned, some current kernel details. The structures are defined as follows.
#define _LINUX_CAPABILITY_VERSION_1 0x19980330
#define _LINUX_CAPABILITY_U32S_1 1
/* V2 added in Linux 2.6.25; deprecated */
#define _LINUX_CAPABILITY_VERSION_2 0x20071026
#define _LINUX_CAPABILITY_U32S_2 2
/* V3 added in Linux 2.6.26 */
#define _LINUX_CAPABILITY_VERSION_3 0x20080522
#define _LINUX_CAPABILITY_U32S_3 2
typedef struct __user_cap_header_struct {
__u32 version;
int pid;
} *cap_user_header_t;
typedef struct __user_cap_data_struct {
__u32 effective;
__u32 permitted;
__u32 inheritable;
} *cap_user_data_t;
The effective, permitted, and inheritable fields are bit masks of the capabilities defined in capabilities(7). Note that the CAP_* values are bit indexes and need to be bit-shifted before ORing into the bit fields. To define the structures for passing to the system call, you have to use the struct __user_cap_header_struct and struct __user_cap_data_struct names because the typedefs are only pointers.
Kernels prior to Linux 2.6.25 prefer 32-bit capabilities with version _LINUX_CAPABILITY_VERSION_1. Linux 2.6.25 added 64-bit capability sets, with version _LINUX_CAPABILITY_VERSION_2. There was, however, an API glitch, and Linux 2.6.26 added _LINUX_CAPABILITY_VERSION_3 to fix the problem.
Note that 64-bit capabilities use datap[0] and datap[1], whereas 32-bit capabilities use only datap[0].
On kernels that support file capabilities (VFS capabilities support), these system calls behave slightly differently. This support was added as an option in Linux 2.6.24, and became fixed (nonoptional) in Linux 2.6.33.
For capget() calls, one can probe the capabilities of any process by specifying its process ID with the hdrp->pid field value.
For details on the data, see capabilities(7).
With VFS capabilities support
VFS capabilities employ a file extended attribute (see xattr(7)) to allow capabilities to be attached to executables. This privilege model obsoletes kernel support for one process asynchronously setting the capabilities of another. That is, on kernels that have VFS capabilities support, when calling capset(), the only permitted values for hdrp->pid are 0 or, equivalently, the value returned by gettid(2).
Without VFS capabilities support
On older kernels that do not provide VFS capabilities support capset() can, if the caller has the CAP_SETPCAP capability, be used to change not only the caller’s own capabilities, but also the capabilities of other threads. The call operates on the capabilities of the thread specified by the pid field of hdrp when that is nonzero, or on the capabilities of the calling thread if pid is 0. If pid refers to a single-threaded process, then pid can be specified as a traditional process ID; operating on a thread of a multithreaded process requires a thread ID of the type returned by gettid(2). For capset(), pid can also be: -1, meaning perform the change on all threads except the caller and init(1); or a value less than -1, in which case the change is applied to all members of the process group whose ID is -pid.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
The calls fail with the error EINVAL, and set the version field of hdrp to the kernel preferred value of _LINUX_CAPABILITY_VERSION_? when an unsupported version value is specified. In this way, one can probe what the current preferred capability revision is.
ERRORS
EFAULT
Bad memory address. hdrp must not be NULL. datap may be NULL only when the user is trying to determine the preferred capability version format supported by the kernel.
EINVAL
One of the arguments was invalid.
EPERM
An attempt was made to add a capability to the permitted set, or to set a capability in the effective set that is not in the permitted set.
EPERM
An attempt was made to add a capability to the inheritable set, and either:
that capability was not in the caller’s bounding set; or
the capability was not in the caller’s permitted set and the caller lacked the CAP_SETPCAP capability in its effective set.
EPERM
The caller attempted to use capset() to modify the capabilities of a thread other than itself, but lacked sufficient privilege. For kernels supporting VFS capabilities, this is never permitted. For kernels lacking VFS support, the CAP_SETPCAP capability is required. (A bug in kernels before Linux 2.6.11 meant that this error could also occur if a thread without this capability tried to change its own capabilities by specifying the pid field as a nonzero value (i.e., the value returned by getpid(2)) instead of 0.)
ESRCH
No such thread.
STANDARDS
Linux.
NOTES
The portable interface to the capability querying and setting functions is provided by the libcap library and is available here:
SEE ALSO
clone(2), gettid(2), capabilities(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
327 - Linux cli command sched_setaffinity
NAME π₯οΈ sched_setaffinity π₯οΈ
set and get a thread’s CPU affinity mask
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <sched.h>
int sched_setaffinity(pid_t pid, size_t cpusetsize,
const cpu_set_t *mask);
int sched_getaffinity(pid_t pid, size_t cpusetsize,
cpu_set_t *mask);
DESCRIPTION
A thread’s CPU affinity mask determines the set of CPUs on which it is eligible to run. On a multiprocessor system, setting the CPU affinity mask can be used to obtain performance benefits. For example, by dedicating one CPU to a particular thread (i.e., setting the affinity mask of that thread to specify a single CPU, and setting the affinity mask of all other threads to exclude that CPU), it is possible to ensure maximum execution speed for that thread. Restricting a thread to run on a single CPU also avoids the performance cost caused by the cache invalidation that occurs when a thread ceases to execute on one CPU and then recommences execution on a different CPU.
A CPU affinity mask is represented by the cpu_set_t structure, a “CPU set”, pointed to by mask. A set of macros for manipulating CPU sets is described in CPU_SET(3).
sched_setaffinity() sets the CPU affinity mask of the thread whose ID is pid to the value specified by mask. If pid is zero, then the calling thread is used. The argument cpusetsize is the length (in bytes) of the data pointed to by mask. Normally this argument would be specified as sizeof(cpu_set_t).
If the thread specified by pid is not currently running on one of the CPUs specified in mask, then that thread is migrated to one of the CPUs specified in mask.
sched_getaffinity() writes the affinity mask of the thread whose ID is pid into the cpu_set_t structure pointed to by mask. The cpusetsize argument specifies the size (in bytes) of mask. If pid is zero, then the mask of the calling thread is returned.
RETURN VALUE
On success, sched_setaffinity() and sched_getaffinity() return 0 (but see “C library/kernel differences” below, which notes that the underlying sched_getaffinity() differs in its return value). On failure, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
A supplied memory address was invalid.
EINVAL
The affinity bit mask mask contains no processors that are currently physically on the system and permitted to the thread according to any restrictions that may be imposed by cpuset cgroups or the “cpuset” mechanism described in cpuset(7).
EINVAL
(sched_getaffinity() and, before Linux 2.6.9, sched_setaffinity()) cpusetsize is smaller than the size of the affinity mask used by the kernel.
EPERM
(sched_setaffinity()) The calling thread does not have appropriate privileges. The caller needs an effective user ID equal to the real user ID or effective user ID of the thread identified by pid, or it must possess the CAP_SYS_NICE capability in the user namespace of the thread pid.
ESRCH
The thread whose ID is pid could not be found.
STANDARDS
Linux.
HISTORY
Linux 2.5.8, glibc 2.3.
Initially, the glibc interfaces included a cpusetsize argument, typed as unsigned int. In glibc 2.3.3, the cpusetsize argument was removed, but was then restored in glibc 2.3.4, with type size_t.
NOTES
After a call to sched_setaffinity(), the set of CPUs on which the thread will actually run is the intersection of the set specified in the mask argument and the set of CPUs actually present on the system. The system may further restrict the set of CPUs on which the thread runs if the “cpuset” mechanism described in cpuset(7) is being used. These restrictions on the actual set of CPUs on which the thread will run are silently imposed by the kernel.
There are various ways of determining the number of CPUs available on the system, including: inspecting the contents of /proc/cpuinfo; using sysconf(3) to obtain the values of the _SC_NPROCESSORS_CONF and _SC_NPROCESSORS_ONLN parameters; and inspecting the list of CPU directories under /sys/devices/system/cpu/.
sched(7) has a description of the Linux scheduling scheme.
The affinity mask is a per-thread attribute that can be adjusted independently for each of the threads in a thread group. The value returned from a call to gettid(2) can be passed in the argument pid. Specifying pid as 0 will set the attribute for the calling thread, and passing the value returned from a call to getpid(2) will set the attribute for the main thread of the thread group. (If you are using the POSIX threads API, then use pthread_setaffinity_np(3) instead of sched_setaffinity().)
The isolcpus boot option can be used to isolate one or more CPUs at boot time, so that no processes are scheduled onto those CPUs. Following the use of this boot option, the only way to schedule processes onto the isolated CPUs is via sched_setaffinity() or the cpuset(7) mechanism. For further information, see the kernel source file Documentation/admin-guide/kernel-parameters.txt. As noted in that file, isolcpus is the preferred mechanism of isolating CPUs (versus the alternative of manually setting the CPU affinity of all processes on the system).
A child created via fork(2) inherits its parent’s CPU affinity mask. The affinity mask is preserved across an execve(2).
C library/kernel differences
This manual page describes the glibc interface for the CPU affinity calls. The actual system call interface is slightly different, with the mask being typed as unsigned long *, reflecting the fact that the underlying implementation of CPU sets is a simple bit mask.
On success, the raw sched_getaffinity() system call returns the number of bytes placed copied into the mask buffer; this will be the minimum of cpusetsize and the size (in bytes) of the cpumask_t data type that is used internally by the kernel to represent the CPU set bit mask.
Handling systems with large CPU affinity masks
The underlying system calls (which represent CPU masks as bit masks of type unsigned long *) impose no restriction on the size of the CPU mask. However, the cpu_set_t data type used by glibc has a fixed size of 128 bytes, meaning that the maximum CPU number that can be represented is 1023. If the kernel CPU affinity mask is larger than 1024, then calls of the form:
sched_getaffinity(pid, sizeof(cpu_set_t), &mask);
fail with the error EINVAL, the error produced by the underlying system call for the case where the mask size specified in cpusetsize is smaller than the size of the affinity mask used by the kernel. (Depending on the system CPU topology, the kernel affinity mask can be substantially larger than the number of active CPUs in the system.)
When working on systems with large kernel CPU affinity masks, one must dynamically allocate the mask argument (see CPU_ALLOC(3)). Currently, the only way to do this is by probing for the size of the required mask using sched_getaffinity() calls with increasing mask sizes (until the call does not fail with the error EINVAL).
Be aware that CPU_ALLOC(3) may allocate a slightly larger CPU set than requested (because CPU sets are implemented as bit masks allocated in units of sizeof(long)). Consequently, sched_getaffinity() can set bits beyond the requested allocation size, because the kernel sees a few additional bits. Therefore, the caller should iterate over the bits in the returned set, counting those which are set, and stop upon reaching the value returned by CPU_COUNT(3) (rather than iterating over the number of bits requested to be allocated).
EXAMPLES
The program below creates a child process. The parent and child then each assign themselves to a specified CPU and execute identical loops that consume some CPU time. Before terminating, the parent waits for the child to complete. The program takes three command-line arguments: the CPU number for the parent, the CPU number for the child, and the number of loop iterations that both processes should perform.
As the sample runs below demonstrate, the amount of real and CPU time consumed when running the program will depend on intra-core caching effects and whether the processes are using the same CPU.
We first employ lscpu(1) to determine that this (x86) system has two cores, each with two CPUs:
$ lscpu | egrep -i 'core.*:|socket'
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
We then time the operation of the example program for three cases: both processes running on the same CPU; both processes running on different CPUs on the same core; and both processes running on different CPUs on different cores.
$ time -p ./a.out 0 0 100000000
real 14.75
user 3.02
sys 11.73
$ time -p ./a.out 0 1 100000000
real 11.52
user 3.98
sys 19.06
$ time -p ./a.out 0 3 100000000
real 7.89
user 3.29
sys 12.07
Program source
#define _GNU_SOURCE
#include <err.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int parentCPU, childCPU;
cpu_set_t set;
unsigned int nloops;
if (argc != 4) {
fprintf(stderr, "Usage: %s parent-cpu child-cpu num-loops
“, argv[0]); exit(EXIT_FAILURE); } parentCPU = atoi(argv[1]); childCPU = atoi(argv[2]); nloops = atoi(argv[3]); CPU_ZERO(&set); switch (fork()) { case -1: /* Error / err(EXIT_FAILURE, “fork”); case 0: / Child / CPU_SET(childCPU, &set); if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) err(EXIT_FAILURE, “sched_setaffinity”); for (unsigned int j = 0; j < nloops; j++) getppid(); exit(EXIT_SUCCESS); default: / Parent / CPU_SET(parentCPU, &set); if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) err(EXIT_FAILURE, “sched_setaffinity”); for (unsigned int j = 0; j < nloops; j++) getppid(); wait(NULL); / Wait for child to terminate */ exit(EXIT_SUCCESS); } }
SEE ALSO
lscpu(1), nproc(1), taskset(1), clone(2), getcpu(2), getpriority(2), gettid(2), nice(2), sched_get_priority_max(2), sched_get_priority_min(2), sched_getscheduler(2), sched_setscheduler(2), setpriority(2), CPU_SET(3), get_nprocs(3), pthread_setaffinity_np(3), sched_getcpu(3), capabilities(7), cpuset(7), sched(7), numactl(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
328 - Linux cli command memfd_create
NAME π₯οΈ memfd_create π₯οΈ
create an anonymous file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <sys/mman.h>
int memfd_create(const char *name, unsigned int flags);
DESCRIPTION
memfd_create() creates an anonymous file and returns a file descriptor that refers to it. The file behaves like a regular file, and so can be modified, truncated, memory-mapped, and so on. However, unlike a regular file, it lives in RAM and has a volatile backing storage. Once all references to the file are dropped, it is automatically released. Anonymous memory is used for all backing pages of the file. Therefore, files created by memfd_create() have the same semantics as other anonymous memory allocations such as those allocated using mmap(2) with the MAP_ANONYMOUS flag.
The initial size of the file is set to 0. Following the call, the file size should be set using ftruncate(2). (Alternatively, the file may be populated by calls to write(2) or similar.)
The name supplied in name is used as a filename and will be displayed as the target of the corresponding symbolic link in the directory /proc/self/fd/. The displayed name is always prefixed with memfd: and serves only for debugging purposes. Names do not affect the behavior of the file descriptor, and as such multiple files can have the same name without any side effects.
The following values may be bitwise ORed in flags to change the behavior of memfd_create():
MFD_CLOEXEC
Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.
MFD_ALLOW_SEALING
Allow sealing operations on this file. See the discussion of the F_ADD_SEALS and F_GET_SEALS operations in fcntl(2), and also NOTES, below. The initial set of seals is empty. If this flag is not set, the initial set of seals will be F_SEAL_SEAL, meaning that no other seals can be set on the file.
MFD_HUGETLB (since Linux 4.14)
The anonymous file will be created in the hugetlbfs filesystem using huge pages. See the Linux kernel source file Documentation/admin-guide/mm/hugetlbpage.rst for more information about hugetlbfs. Specifying both MFD_HUGETLB and MFD_ALLOW_SEALING in flags is supported since Linux 4.16.
MFD_HUGE_2MB
MFD_HUGE_1GB
.β.β.
Used in conjunction with MFD_HUGETLB to select alternative hugetlb page sizes (respectively, 2 MB, 1 GB, …) on systems that support multiple hugetlb page sizes. Definitions for known huge page sizes are included in the header file <linux/memfd.h>.
For details on encoding huge page sizes not included in the header file, see the discussion of the similarly named constants in mmap(2).
Unused bits in flags must be 0.
As its return value, memfd_create() returns a new file descriptor that can be used to refer to the file. This file descriptor is opened for both reading and writing (O_RDWR) and O_LARGEFILE is set for the file descriptor.
With respect to fork(2) and execve(2), the usual semantics apply for the file descriptor created by memfd_create(). A copy of the file descriptor is inherited by the child produced by fork(2) and refers to the same file. The file descriptor is preserved across execve(2), unless the close-on-exec flag has been set.
RETURN VALUE
On success, memfd_create() returns a new file descriptor. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EFAULT
The address in name points to invalid memory.
EINVAL
flags included unknown bits.
EINVAL
name was too long. (The limit is 249 bytes, excluding the terminating null byte.)
EINVAL
Both MFD_HUGETLB and MFD_ALLOW_SEALING were specified in flags.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENOMEM
There was insufficient memory to create a new anonymous file.
EPERM
The MFD_HUGETLB flag was specified, but the caller was not privileged (did not have the CAP_IPC_LOCK capability) and is not a member of the sysctl_hugetlb_shm_group group; see the description of /proc/sys/vm/sysctl_hugetlb_shm_group in proc(5).
STANDARDS
Linux.
HISTORY
Linux 3.17, glibc 2.27.
NOTES
The memfd_create() system call provides a simple alternative to manually mounting a tmpfs(5) filesystem and creating and opening a file in that filesystem. The primary purpose of memfd_create() is to create files and associated file descriptors that are used with the file-sealing APIs provided by fcntl(2).
The memfd_create() system call also has uses without file sealing (which is why file-sealing is disabled, unless explicitly requested with the MFD_ALLOW_SEALING flag). In particular, it can be used as an alternative to creating files in tmp or as an alternative to using the open(2) O_TMPFILE in cases where there is no intention to actually link the resulting file into the filesystem.
File sealing
In the absence of file sealing, processes that communicate via shared memory must either trust each other, or take measures to deal with the possibility that an untrusted peer may manipulate the shared memory region in problematic ways. For example, an untrusted peer might modify the contents of the shared memory at any time, or shrink the shared memory region. The former possibility leaves the local process vulnerable to time-of-check-to-time-of-use race conditions (typically dealt with by copying data from the shared memory region before checking and using it). The latter possibility leaves the local process vulnerable to SIGBUS signals when an attempt is made to access a now-nonexistent location in the shared memory region. (Dealing with this possibility necessitates the use of a handler for the SIGBUS signal.)
Dealing with untrusted peers imposes extra complexity on code that employs shared memory. Memory sealing enables that extra complexity to be eliminated, by allowing a process to operate secure in the knowledge that its peer can’t modify the shared memory in an undesired fashion.
An example of the usage of the sealing mechanism is as follows:
The first process creates a tmpfs(5) file using memfd_create(). The call yields a file descriptor used in subsequent steps.
The first process sizes the file created in the previous step using ftruncate(2), maps it using mmap(2), and populates the shared memory with the desired data.
The first process uses the fcntl(2) F_ADD_SEALS operation to place one or more seals on the file, in order to restrict further modifications on the file. (If placing the seal F_SEAL_WRITE, then it will be necessary to first unmap the shared writable mapping created in the previous step. Otherwise, behavior similar to F_SEAL_WRITE can be achieved by using F_SEAL_FUTURE_WRITE, which will prevent future writes via mmap(2) and write(2) from succeeding while keeping existing shared writable mappings).
A second process obtains a file descriptor for the tmpfs(5) file and maps it. Among the possible ways in which this could happen are the following:
The process that called memfd_create() could transfer the resulting file descriptor to the second process via a UNIX domain socket (see unix(7) and cmsg(3)). The second process then maps the file using mmap(2).
The second process is created via fork(2) and thus automatically inherits the file descriptor and mapping. (Note that in this case and the next, there is a natural trust relationship between the two processes, since they are running under the same user ID. Therefore, file sealing would not normally be necessary.)
The second process opens the file */proc/pid/fd/*fd, where <pid> is the PID of the first process (the one that called memfd_create()), and <fd> is the number of the file descriptor returned by the call to memfd_create() in that process. The second process then maps the file using mmap(2).
The second process uses the fcntl(2) F_GET_SEALS operation to retrieve the bit mask of seals that has been applied to the file. This bit mask can be inspected in order to determine what kinds of restrictions have been placed on file modifications. If desired, the second process can apply further seals to impose additional restrictions (so long as the F_SEAL_SEAL seal has not yet been applied).
EXAMPLES
Below are shown two example programs that demonstrate the use of memfd_create() and the file sealing API.
The first program, t_memfd_create.c, creates a tmpfs(5) file using memfd_create(), sets a size for the file, maps it into memory, and optionally places some seals on the file. The program accepts up to three command-line arguments, of which the first two are required. The first argument is the name to associate with the file, the second argument is the size to be set for the file, and the optional third argument is a string of characters that specify seals to be set on the file.
The second program, t_get_seals.c, can be used to open an existing file that was created via memfd_create() and inspect the set of seals that have been applied to that file.
The following shell session demonstrates the use of these programs. First we create a tmpfs(5) file and set some seals on it:
$ ./t_memfd_create my_memfd_file 4096 sw &
[1] 11775
PID: 11775; fd: 3; /proc/11775/fd/3
At this point, the t_memfd_create program continues to run in the background. From another program, we can obtain a file descriptor for the file created by memfd_create() by opening the /proc/pid/fd file that corresponds to the file descriptor opened by memfd_create(). Using that pathname, we inspect the content of the /proc/pid/fd symbolic link, and use our t_get_seals program to view the seals that have been placed on the file:
$ readlink /proc/11775/fd/3
/memfd:my_memfd_file (deleted)
$ ./t_get_seals /proc/11775/fd/3
Existing seals: WRITE SHRINK
Program source: t_memfd_create.c
#define _GNU_SOURCE
#include <err.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int fd;
char *name, *seals_arg;
ssize_t len;
unsigned int seals;
if (argc < 3) {
fprintf(stderr, "%s name size [seals]
“, argv[0]); fprintf(stderr, " ‘seals’ can contain any of the " “following characters: “); fprintf(stderr, " g - F_SEAL_GROW “); fprintf(stderr, " s - F_SEAL_SHRINK “); fprintf(stderr, " w - F_SEAL_WRITE “); fprintf(stderr, " W - F_SEAL_FUTURE_WRITE “); fprintf(stderr, " S - F_SEAL_SEAL “); exit(EXIT_FAILURE); } name = argv[1]; len = atoi(argv[2]); seals_arg = argv[3]; /* Create an anonymous file in tmpfs; allow seals to be placed on the file. / fd = memfd_create(name, MFD_ALLOW_SEALING); if (fd == -1) err(EXIT_FAILURE, “memfd_create”); / Size the file as specified on the command line. / if (ftruncate(fd, len) == -1) err(EXIT_FAILURE, “truncate”); printf(“PID: %jd; fd: %d; /proc/%jd/fd/%d “, (intmax_t) getpid(), fd, (intmax_t) getpid(), fd); / Code to map the file and populate the mapping with data omitted. / / If a ‘seals’ command-line argument was supplied, set some seals on the file. / if (seals_arg != NULL) { seals = 0; if (strchr(seals_arg, ‘g’) != NULL) seals |= F_SEAL_GROW; if (strchr(seals_arg, ’s’) != NULL) seals |= F_SEAL_SHRINK; if (strchr(seals_arg, ‘w’) != NULL) seals |= F_SEAL_WRITE; if (strchr(seals_arg, ‘W’) != NULL) seals |= F_SEAL_FUTURE_WRITE; if (strchr(seals_arg, ‘S’) != NULL) seals |= F_SEAL_SEAL; if (fcntl(fd, F_ADD_SEALS, seals) == -1) err(EXIT_FAILURE, “fcntl”); } / Keep running, so that the file created by memfd_create() continues to exist. */ pause(); exit(EXIT_SUCCESS); }
Program source: t_get_seals.c
#define _GNU_SOURCE
#include <err.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
int
main(int argc, char *argv[])
{
int fd;
unsigned int seals;
if (argc != 2) {
fprintf(stderr, "%s /proc/PID/fd/FD
“, argv[0]); exit(EXIT_FAILURE); } fd = open(argv[1], O_RDWR); if (fd == -1) err(EXIT_FAILURE, “open”); seals = fcntl(fd, F_GET_SEALS); if (seals == -1) err(EXIT_FAILURE, “fcntl”); printf(“Existing seals:”); if (seals & F_SEAL_SEAL) printf(” SEAL”); if (seals & F_SEAL_GROW) printf(” GROW”); if (seals & F_SEAL_WRITE) printf(” WRITE”); if (seals & F_SEAL_FUTURE_WRITE) printf(” FUTURE_WRITE”); if (seals & F_SEAL_SHRINK) printf(” SHRINK”); printf(” “); /* Code to map the file and access the contents of the resulting mapping omitted. */ exit(EXIT_SUCCESS); }
SEE ALSO
fcntl(2), ftruncate(2), memfd_secret(2), mmap(2), shmget(2), shm_open(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
329 - Linux cli command io_cancel
NAME π₯οΈ io_cancel π₯οΈ
cancel an outstanding asynchronous I/O operation
LIBRARY
Standard C library (libc, -lc)
Alternatively, Asynchronous I/O library (libaio, -laio); see VERSIONS.
SYNOPSIS
#include <linux/aio_abi.h> /* Definition of needed types */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_io_cancel, aio_context_t ctx_id, struct iocb *iocb,
struct io_event *result);
DESCRIPTION
Note: this page describes the raw Linux system call interface. The wrapper function provided by libaio uses a different type for the ctx_id argument. See VERSIONS.
The io_cancel() system call attempts to cancel an asynchronous I/O operation previously submitted with io_submit(2). The iocb argument describes the operation to be canceled and the ctx_id argument is the AIO context to which the operation was submitted. If the operation is successfully canceled, the event will be copied into the memory pointed to by result without being placed into the completion queue.
RETURN VALUE
On success, io_cancel() returns 0. For the failure return, see VERSIONS.
ERRORS
EAGAIN
The iocb specified was not canceled.
EFAULT
One of the data structures points to invalid data.
EINVAL
The AIO context specified by ctx_id is invalid.
ENOSYS
io_cancel() is not implemented on this architecture.
VERSIONS
You probably want to use the io_cancel() wrapper function provided by libaio.
Note that the libaio wrapper function uses a different type (io_context_t) for the ctx_id argument. Note also that the libaio wrapper does not follow the usual C library conventions for indicating errors: on error it returns a negated error number (the negative of one of the values listed in ERRORS). If the system call is invoked via syscall(2), then the return value follows the usual conventions for indicating an error: -1, with errno set to a (positive) value that indicates the error.
STANDARDS
Linux.
HISTORY
Linux 2.5.
SEE ALSO
io_destroy(2), io_getevents(2), io_setup(2), io_submit(2), aio(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
330 - Linux cli command fstatfs
NAME π₯οΈ fstatfs π₯οΈ
get filesystem statistics
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/vfs.h> /* or <sys/statfs.h> */
int statfs(const char *path, struct statfs *buf);
int fstatfs(int fd, struct statfs *buf);
Unless you need the f_type field, you should use the standard statvfs(3) interface instead.
DESCRIPTION
The statfs() system call returns information about a mounted filesystem. path is the pathname of any file within the mounted filesystem. buf is a pointer to a statfs structure defined approximately as follows:
struct statfs {
__fsword_t f_type; /* Type of filesystem (see below) */
__fsword_t f_bsize; /* Optimal transfer block size */
fsblkcnt_t f_blocks; /* Total data blocks in filesystem */
fsblkcnt_t f_bfree; /* Free blocks in filesystem */
fsblkcnt_t f_bavail; /* Free blocks available to
unprivileged user */
fsfilcnt_t f_files; /* Total inodes in filesystem */
fsfilcnt_t f_ffree; /* Free inodes in filesystem */
fsid_t f_fsid; /* Filesystem ID */
__fsword_t f_namelen; /* Maximum length of filenames */
__fsword_t f_frsize; /* Fragment size (since Linux 2.6) */
__fsword_t f_flags; /* Mount flags of filesystem
(since Linux 2.6.36) */
__fsword_t f_spare[xxx];
/* Padding bytes reserved for future use */
};
The following filesystem types may appear in f_type:
ADFS_SUPER_MAGIC 0xadf5
AFFS_SUPER_MAGIC 0xadff
AFS_SUPER_MAGIC 0x5346414f
ANON_INODE_FS_MAGIC 0x09041934 /* Anonymous inode FS (for
pseudofiles that have no name;
e.g., epoll, signalfd, bpf) */
AUTOFS_SUPER_MAGIC 0x0187
BDEVFS_MAGIC 0x62646576
BEFS_SUPER_MAGIC 0x42465331
BFS_MAGIC 0x1badface
BINFMTFS_MAGIC 0x42494e4d
BPF_FS_MAGIC 0xcafe4a11
BTRFS_SUPER_MAGIC 0x9123683e
BTRFS_TEST_MAGIC 0x73727279
CGROUP_SUPER_MAGIC 0x27e0eb /* Cgroup pseudo FS */
CGROUP2_SUPER_MAGIC 0x63677270 /* Cgroup v2 pseudo FS */
CIFS_MAGIC_NUMBER 0xff534d42
CODA_SUPER_MAGIC 0x73757245
COH_SUPER_MAGIC 0x012ff7b7
CRAMFS_MAGIC 0x28cd3d45
DEBUGFS_MAGIC 0x64626720
DEVFS_SUPER_MAGIC 0x1373 /* Linux 2.6.17 and earlier */
DEVPTS_SUPER_MAGIC 0x1cd1
ECRYPTFS_SUPER_MAGIC 0xf15f
EFIVARFS_MAGIC 0xde5e81e4
EFS_SUPER_MAGIC 0x00414a53
EXT_SUPER_MAGIC 0x137d /* Linux 2.0 and earlier */
EXT2_OLD_SUPER_MAGIC 0xef51
EXT2_SUPER_MAGIC 0xef53
EXT3_SUPER_MAGIC 0xef53
EXT4_SUPER_MAGIC 0xef53
F2FS_SUPER_MAGIC 0xf2f52010
FUSE_SUPER_MAGIC 0x65735546
FUTEXFS_SUPER_MAGIC 0xbad1dea /* Unused */
HFS_SUPER_MAGIC 0x4244
HOSTFS_SUPER_MAGIC 0x00c0ffee
HPFS_SUPER_MAGIC 0xf995e849
HUGETLBFS_MAGIC 0x958458f6
ISOFS_SUPER_MAGIC 0x9660
JFFS2_SUPER_MAGIC 0x72b6
JFS_SUPER_MAGIC 0x3153464a
MINIX_SUPER_MAGIC 0x137f /* original minix FS */
MINIX_SUPER_MAGIC2 0x138f /* 30 char minix FS */
MINIX2_SUPER_MAGIC 0x2468 /* minix V2 FS */
MINIX2_SUPER_MAGIC2 0x2478 /* minix V2 FS, 30 char names */
MINIX3_SUPER_MAGIC 0x4d5a /* minix V3 FS, 60 char names */
MQUEUE_MAGIC 0x19800202 /* POSIX message queue FS */
MSDOS_SUPER_MAGIC 0x4d44
MTD_INODE_FS_MAGIC 0x11307854
NCP_SUPER_MAGIC 0x564c
NFS_SUPER_MAGIC 0x6969
NILFS_SUPER_MAGIC 0x3434
NSFS_MAGIC 0x6e736673
NTFS_SB_MAGIC 0x5346544e
OCFS2_SUPER_MAGIC 0x7461636f
OPENPROM_SUPER_MAGIC 0x9fa1
OVERLAYFS_SUPER_MAGIC 0x794c7630
PIPEFS_MAGIC 0x50495045
PROC_SUPER_MAGIC 0x9fa0 /* /proc FS */
PSTOREFS_MAGIC 0x6165676c
QNX4_SUPER_MAGIC 0x002f
QNX6_SUPER_MAGIC 0x68191122
RAMFS_MAGIC 0x858458f6
REISERFS_SUPER_MAGIC 0x52654973
ROMFS_MAGIC 0x7275
SECURITYFS_MAGIC 0x73636673
SELINUX_MAGIC 0xf97cff8c
SMACK_MAGIC 0x43415d53
SMB_SUPER_MAGIC 0x517b
SMB2_MAGIC_NUMBER 0xfe534d42
SOCKFS_MAGIC 0x534f434b
SQUASHFS_MAGIC 0x73717368
SYSFS_MAGIC 0x62656572
SYSV2_SUPER_MAGIC 0x012ff7b6
SYSV4_SUPER_MAGIC 0x012ff7b5
TMPFS_MAGIC 0x01021994
TRACEFS_MAGIC 0x74726163
UDF_SUPER_MAGIC 0x15013346
UFS_MAGIC 0x00011954
USBDEVICE_SUPER_MAGIC 0x9fa2
V9FS_MAGIC 0x01021997
VXFS_SUPER_MAGIC 0xa501fcf5
XENFS_SUPER_MAGIC 0xabba1974
XENIX_SUPER_MAGIC 0x012ff7b4
XFS_SUPER_MAGIC 0x58465342
_XIAFS_SUPER_MAGIC 0x012fd16d /* Linux 2.0 and earlier */
Most of these MAGIC constants are defined in /usr/include/linux/magic.h, and some are hardcoded in kernel sources.
The f_flags field is a bit mask indicating mount options for the filesystem. It contains zero or more of the following bits:
ST_MANDLOCK
Mandatory locking is permitted on the filesystem (see fcntl(2)).
ST_NOATIME
Do not update access times; see mount(2).
ST_NODEV
Disallow access to device special files on this filesystem.
ST_NODIRATIME
Do not update directory access times; see mount(2).
ST_NOEXEC
Execution of programs is disallowed on this filesystem.
ST_NOSUID
The set-user-ID and set-group-ID bits are ignored by exec(3) for executable files on this filesystem
ST_RDONLY
This filesystem is mounted read-only.
ST_RELATIME
Update atime relative to mtime/ctime; see mount(2).
ST_SYNCHRONOUS
Writes are synched to the filesystem immediately (see the description of O_SYNC in open(2)).
ST_NOSYMFOLLOW (since Linux 5.10)
Symbolic links are not followed when resolving paths; see mount(2).
Nobody knows what f_fsid is supposed to contain (but see below).
Fields that are undefined for a particular filesystem are set to 0.
fstatfs() returns the same information about an open file referenced by descriptor fd.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
(statfs()) Search permission is denied for a component of the path prefix of path. (See also path_resolution(7).)
EBADF
(fstatfs()) fd is not a valid open file descriptor.
EFAULT
buf or path points to an invalid address.
EINTR
The call was interrupted by a signal; see signal(7).
EIO
An I/O error occurred while reading from the filesystem.
ELOOP
(statfs()) Too many symbolic links were encountered in translating path.
ENAMETOOLONG
(statfs()) path is too long.
ENOENT
(statfs()) The file referred to by path does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOSYS
The filesystem does not support this call.
ENOTDIR
(statfs()) A component of the path prefix of path is not a directory.
EOVERFLOW
Some values were too large to be represented in the returned struct.
VERSIONS
The f_fsid field
Solaris, Irix, and POSIX have a system call statvfs(2) that returns a struct statvfs (defined in <sys/statvfs.h>) containing an unsigned long f_fsid. Linux, SunOS, HP-UX, 4.4BSD have a system call statfs() that returns a struct statfs (defined in <sys/vfs.h>) containing a fsid_t f_fsid, where fsid_t is defined as struct { int val[2]; }. The same holds for FreeBSD, except that it uses the include file <sys/mount.h>.
The general idea is that f_fsid contains some random stuff such that the pair (f_fsid,ino) uniquely determines a file. Some operating systems use (a variation on) the device number, or the device number combined with the filesystem type. Several operating systems restrict giving out the f_fsid field to the superuser only (and zero it for unprivileged users), because this field is used in the filehandle of the filesystem when NFS-exported, and giving it out is a security concern.
Under some operating systems, the fsid can be used as the second argument to the sysfs(2) system call.
STANDARDS
Linux.
HISTORY
The Linux statfs() was inspired by the 4.4BSD one (but they do not use the same structure).
The original Linux statfs() and fstatfs() system calls were not designed with extremely large file sizes in mind. Subsequently, Linux 2.6 added new statfs64() and fstatfs64() system calls that employ a new structure, statfs64. The new structure contains the same fields as the original statfs structure, but the sizes of various fields are increased, to accommodate large file sizes. The glibc statfs() and fstatfs() wrapper functions transparently deal with the kernel differences.
LSB has deprecated the library calls statfs() and fstatfs() and tells us to use statvfs(3) and fstatvfs(3) instead.
NOTES
The __fsword_t type used for various fields in the statfs structure definition is a glibc internal type, not intended for public use. This leaves the programmer in a bit of a conundrum when trying to copy or compare these fields to local variables in a program. Using unsigned int for such variables suffices on most systems.
Some systems have only <sys/vfs.h>, other systems also have <sys/statfs.h>, where the former includes the latter. So it seems including the former is the best choice.
BUGS
From Linux 2.6.38 up to and including Linux 3.1, fstatfs() failed with the error ENOSYS for file descriptors created by pipe(2).
SEE ALSO
stat(2), statvfs(3), path_resolution(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
331 - Linux cli command fanotify_init
NAME π₯οΈ fanotify_init π₯οΈ
create and initialize fanotify group
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <fcntl.h> /* Definition of O_* constants */
#include <sys/fanotify.h>
int fanotify_init(unsigned int flags, unsigned int event_f_flags);
DESCRIPTION
For an overview of the fanotify API, see fanotify(7).
fanotify_init() initializes a new fanotify group and returns a file descriptor for the event queue associated with the group.
The file descriptor is used in calls to fanotify_mark(2) to specify the files, directories, mounts, or filesystems for which fanotify events shall be created. These events are received by reading from the file descriptor. Some events are only informative, indicating that a file has been accessed. Other events can be used to determine whether another application is permitted to access a file or directory. Permission to access filesystem objects is granted by writing to the file descriptor.
Multiple programs may be using the fanotify interface at the same time to monitor the same files.
The number of fanotify groups per user is limited. See fanotify(7) for details about this limit.
The flags argument contains a multi-bit field defining the notification class of the listening application and further single bit fields specifying the behavior of the file descriptor.
If multiple listeners for permission events exist, the notification class is used to establish the sequence in which the listeners receive the events.
Only one of the following notification classes may be specified in flags:
FAN_CLASS_PRE_CONTENT
This value allows the receipt of events notifying that a file has been accessed and events for permission decisions if a file may be accessed. It is intended for event listeners that need to access files before they contain their final data. This notification class might be used by hierarchical storage managers, for example. Use of this flag requires the CAP_SYS_ADMIN capability.
FAN_CLASS_CONTENT
This value allows the receipt of events notifying that a file has been accessed and events for permission decisions if a file may be accessed. It is intended for event listeners that need to access files when they already contain their final content. This notification class might be used by malware detection programs, for example. Use of this flag requires the CAP_SYS_ADMIN capability.
FAN_CLASS_NOTIF
This is the default value. It does not need to be specified. This value only allows the receipt of events notifying that a file has been accessed. Permission decisions before the file is accessed are not possible.
Listeners with different notification classes will receive events in the order FAN_CLASS_PRE_CONTENT, FAN_CLASS_CONTENT, FAN_CLASS_NOTIF. The order of notification for listeners in the same notification class is undefined.
The following bits can additionally be set in flags:
FAN_CLOEXEC
Set the close-on-exec flag (FD_CLOEXEC) on the new file descriptor. See the description of the O_CLOEXEC flag in open(2).
FAN_NONBLOCK
Enable the nonblocking flag (O_NONBLOCK) for the file descriptor. Reading from the file descriptor will not block. Instead, if no data is available, read(2) fails with the error EAGAIN.
FAN_UNLIMITED_QUEUE
Remove the limit on the number of events in the event queue. See fanotify(7) for details about this limit. Use of this flag requires the CAP_SYS_ADMIN capability.
FAN_UNLIMITED_MARKS
Remove the limit on the number of fanotify marks per user. See fanotify(7) for details about this limit. Use of this flag requires the CAP_SYS_ADMIN capability.
FAN_REPORT_TID (since Linux 4.20)
Report thread ID (TID) instead of process ID (PID) in the pid field of the struct fanotify_event_metadata supplied to read(2) (see fanotify(7)). Use of this flag requires the CAP_SYS_ADMIN capability.
FAN_ENABLE_AUDIT (since Linux 4.15)
Enable generation of audit log records about access mediation performed by permission events. The permission event response has to be marked with the FAN_AUDIT flag for an audit log record to be generated. Use of this flag requires the CAP_AUDIT_WRITE capability.
FAN_REPORT_FID (since Linux 5.1)
This value allows the receipt of events which contain additional information about the underlying filesystem object correlated to an event. An additional record of type FAN_EVENT_INFO_TYPE_FID encapsulates the information about the object and is included alongside the generic event metadata structure. The file descriptor that is used to represent the object correlated to an event is instead substituted with a file handle. It is intended for applications that may find the use of a file handle to identify an object more suitable than a file descriptor. Additionally, it may be used for applications monitoring a directory or a filesystem that are interested in the directory entry modification events FAN_CREATE, FAN_DELETE, FAN_MOVE, and FAN_RENAME, or in events such as FAN_ATTRIB, FAN_DELETE_SELF, and FAN_MOVE_SELF. All the events above require an fanotify group that identifies filesystem objects by file handles. Note that without the flag FAN_REPORT_TARGET_FID, for the directory entry modification events, there is an information record that identifies the modified directory and not the created/deleted/moved child object. The use of FAN_CLASS_CONTENT or FAN_CLASS_PRE_CONTENT is not permitted with this flag and will result in the error EINVAL. See fanotify(7) for additional details.
FAN_REPORT_DIR_FID (since Linux 5.9)
Events for fanotify groups initialized with this flag will contain (see exceptions below) additional information about a directory object correlated to an event. An additional record of type FAN_EVENT_INFO_TYPE_DFID encapsulates the information about the directory object and is included alongside the generic event metadata structure. For events that occur on a non-directory object, the additional structure includes a file handle that identifies the parent directory filesystem object. Note that there is no guarantee that the directory filesystem object will be found at the location described by the file handle information at the time the event is received. When combined with the flag FAN_REPORT_FID, two records may be reported with events that occur on a non-directory object, one to identify the non-directory object itself and one to identify the parent directory object. Note that in some cases, a filesystem object does not have a parent, for example, when an event occurs on an unlinked but open file. In that case, with the FAN_REPORT_FID flag, the event will be reported with only one record to identify the non-directory object itself, because there is no directory associated with the event. Without the FAN_REPORT_FID flag, no event will be reported. See fanotify(7) for additional details.
FAN_REPORT_NAME (since Linux 5.9)
Events for fanotify groups initialized with this flag will contain additional information about the name of the directory entry correlated to an event. This flag must be provided in conjunction with the flag FAN_REPORT_DIR_FID. Providing this flag value without FAN_REPORT_DIR_FID will result in the error EINVAL. This flag may be combined with the flag FAN_REPORT_FID. An additional record of type FAN_EVENT_INFO_TYPE_DFID_NAME, which encapsulates the information about the directory entry, is included alongside the generic event metadata structure and substitutes the additional information record of type FAN_EVENT_INFO_TYPE_DFID. The additional record includes a file handle that identifies a directory filesystem object followed by a name that identifies an entry in that directory. For the directory entry modification events FAN_CREATE, FAN_DELETE, and FAN_MOVE, the reported name is that of the created/deleted/moved directory entry. The event FAN_RENAME may contain two information records. One of type FAN_EVENT_INFO_TYPE_OLD_DFID_NAME identifying the old directory entry, and another of type FAN_EVENT_INFO_TYPE_NEW_DFID_NAME identifying the new directory entry. For other events that occur on a directory object, the reported file handle is that of the directory object itself and the reported name is ‘.’. For other events that occur on a non-directory object, the reported file handle is that of the parent directory object and the reported name is the name of a directory entry where the object was located at the time of the event. The rationale behind this logic is that the reported directory file handle can be passed to open_by_handle_at(2) to get an open directory file descriptor and that file descriptor along with the reported name can be used to call fstatat(2). The same rule that applies to record type FAN_EVENT_INFO_TYPE_DFID also applies to record type FAN_EVENT_INFO_TYPE_DFID_NAME: if a non-directory object has no parent, either the event will not be reported or it will be reported without the directory entry information. Note that there is no guarantee that the filesystem object will be found at the location described by the directory entry information at the time the event is received. See fanotify(7) for additional details.
FAN_REPORT_DFID_NAME
This is a synonym for (FAN_REPORT_DIR_FID|FAN_REPORT_NAME).
FAN_REPORT_TARGET_FID (since Linux 5.17)
Events for fanotify groups initialized with this flag will contain additional information about the child correlated with directory entry modification events. This flag must be provided in conjunction with the flags FAN_REPORT_FID, FAN_REPORT_DIR_FID and FAN_REPORT_NAME. or else the error EINVAL will be returned. For the directory entry modification events FAN_CREATE, FAN_DELETE, FAN_MOVE, and FAN_RENAME, an additional record of type FAN_EVENT_INFO_TYPE_FID, is reported in addition to the information records of type FAN_EVENT_INFO_TYPE_DFID, FAN_EVENT_INFO_TYPE_DFID_NAME, FAN_EVENT_INFO_TYPE_OLD_DFID_NAME, and FAN_EVENT_INFO_TYPE_NEW_DFID_NAME. The additional record includes a file handle that identifies the filesystem child object that the directory entry is referring to.
FAN_REPORT_DFID_NAME_TARGET
This is a synonym for (FAN_REPORT_DFID_NAME|FAN_REPORT_FID|FAN_REPORT_TARGET_FID).
FAN_REPORT_PIDFD (since Linux 5.15)
Events for fanotify groups initialized with this flag will contain an additional information record alongside the generic fanotify_event_metadata structure. This information record will be of type FAN_EVENT_INFO_TYPE_PIDFD and will contain a pidfd for the process that was responsible for generating an event. A pidfd returned in this information record object is no different to the pidfd that is returned when calling pidfd_open(2). Usage of this information record are for applications that may be interested in reliably determining whether the process responsible for generating an event has been recycled or terminated. The use of the FAN_REPORT_TID flag along with FAN_REPORT_PIDFD is currently not supported and attempting to do so will result in the error EINVAL being returned. This limitation is currently imposed by the pidfd API as it currently only supports the creation of pidfds for thread-group leaders. Creating pidfds for non-thread-group leaders may be supported at some point in the future, so this restriction may eventually be lifted. For more details on information records, see fanotify(7).
The event_f_flags argument defines the file status flags that will be set on the open file descriptions that are created for fanotify events. For details of these flags, see the description of the flags values in open(2). event_f_flags includes a multi-bit field for the access mode. This field can take the following values:
O_RDONLY
This value allows only read access.
O_WRONLY
This value allows only write access.
O_RDWR
This value allows read and write access.
Additional bits can be set in event_f_flags. The most useful values are:
O_LARGEFILE
Enable support for files exceeding 2 GB. Failing to set this flag will result in an EOVERFLOW error when trying to open a large file which is monitored by an fanotify group on a 32-bit system.
O_CLOEXEC (since Linux 3.18)
Enable the close-on-exec flag for the file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.
The following are also allowable: O_APPEND, O_DSYNC, O_NOATIME, O_NONBLOCK, and O_SYNC. Specifying any other flag in event_f_flags yields the error EINVAL (but see BUGS).
RETURN VALUE
On success, fanotify_init() returns a new file descriptor. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
An invalid value was passed in flags or event_f_flags. FAN_ALL_INIT_FLAGS (deprecated since Linux 4.20) defines all allowable bits for flags.
EMFILE
The number of fanotify groups for this user exceeds the limit. See fanotify(7) for details about this limit.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENOMEM
The allocation of memory for the notification group failed.
ENOSYS
This kernel does not implement fanotify_init(). The fanotify API is available only if the kernel was configured with CONFIG_FANOTIFY.
EPERM
The operation is not permitted because the caller lacks a required capability.
VERSIONS
Prior to Linux 5.13, calling fanotify_init() required the CAP_SYS_ADMIN capability. Since Linux 5.13, users may call fanotify_init() without the CAP_SYS_ADMIN capability to create and initialize an fanotify group with limited functionality.
The limitations imposed on an event listener created by a user without the
CAP_SYS_ADMIN capability are as follows:
The user cannot request for an unlimited event queue by using FAN_UNLIMITED_QUEUE.
The user cannot request for an unlimited number of marks by using FAN_UNLIMITED_MARKS.
The user cannot request to use either notification classes FAN_CLASS_CONTENT or FAN_CLASS_PRE_CONTENT. This means that user cannot request permission events.
The user is required to create a group that identifies filesystem objects by file handles, for example, by providing the FAN_REPORT_FID flag.
The user is limited to only mark inodes. The ability to mark a mount or filesystem via fanotify_mark() through the use of FAN_MARK_MOUNT or FAN_MARK_FILESYSTEM is not permitted.
The event object in the event queue is limited in terms of the information that is made available to the unprivileged user. A user will also not receive the pid that generated the event, unless the listening process itself generated the event.
STANDARDS
Linux.
HISTORY
Linux 2.6.37.
BUGS
The following bug was present before Linux 3.18:
- The O_CLOEXEC is ignored when passed in event_f_flags.
The following bug was present before Linux 3.14:
- The event_f_flags argument is not checked for invalid flags. Flags that are intended only for internal use, such as FMODE_EXEC, can be set, and will consequently be set for the file descriptors returned when reading from the fanotify file descriptor.
SEE ALSO
fanotify_mark(2), fanotify(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
332 - Linux cli command statfs
NAME π₯οΈ statfs π₯οΈ
get filesystem statistics
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/vfs.h> /* or <sys/statfs.h> */
int statfs(const char *path, struct statfs *buf);
int fstatfs(int fd, struct statfs *buf);
Unless you need the f_type field, you should use the standard statvfs(3) interface instead.
DESCRIPTION
The statfs() system call returns information about a mounted filesystem. path is the pathname of any file within the mounted filesystem. buf is a pointer to a statfs structure defined approximately as follows:
struct statfs {
__fsword_t f_type; /* Type of filesystem (see below) */
__fsword_t f_bsize; /* Optimal transfer block size */
fsblkcnt_t f_blocks; /* Total data blocks in filesystem */
fsblkcnt_t f_bfree; /* Free blocks in filesystem */
fsblkcnt_t f_bavail; /* Free blocks available to
unprivileged user */
fsfilcnt_t f_files; /* Total inodes in filesystem */
fsfilcnt_t f_ffree; /* Free inodes in filesystem */
fsid_t f_fsid; /* Filesystem ID */
__fsword_t f_namelen; /* Maximum length of filenames */
__fsword_t f_frsize; /* Fragment size (since Linux 2.6) */
__fsword_t f_flags; /* Mount flags of filesystem
(since Linux 2.6.36) */
__fsword_t f_spare[xxx];
/* Padding bytes reserved for future use */
};
The following filesystem types may appear in f_type:
ADFS_SUPER_MAGIC 0xadf5
AFFS_SUPER_MAGIC 0xadff
AFS_SUPER_MAGIC 0x5346414f
ANON_INODE_FS_MAGIC 0x09041934 /* Anonymous inode FS (for
pseudofiles that have no name;
e.g., epoll, signalfd, bpf) */
AUTOFS_SUPER_MAGIC 0x0187
BDEVFS_MAGIC 0x62646576
BEFS_SUPER_MAGIC 0x42465331
BFS_MAGIC 0x1badface
BINFMTFS_MAGIC 0x42494e4d
BPF_FS_MAGIC 0xcafe4a11
BTRFS_SUPER_MAGIC 0x9123683e
BTRFS_TEST_MAGIC 0x73727279
CGROUP_SUPER_MAGIC 0x27e0eb /* Cgroup pseudo FS */
CGROUP2_SUPER_MAGIC 0x63677270 /* Cgroup v2 pseudo FS */
CIFS_MAGIC_NUMBER 0xff534d42
CODA_SUPER_MAGIC 0x73757245
COH_SUPER_MAGIC 0x012ff7b7
CRAMFS_MAGIC 0x28cd3d45
DEBUGFS_MAGIC 0x64626720
DEVFS_SUPER_MAGIC 0x1373 /* Linux 2.6.17 and earlier */
DEVPTS_SUPER_MAGIC 0x1cd1
ECRYPTFS_SUPER_MAGIC 0xf15f
EFIVARFS_MAGIC 0xde5e81e4
EFS_SUPER_MAGIC 0x00414a53
EXT_SUPER_MAGIC 0x137d /* Linux 2.0 and earlier */
EXT2_OLD_SUPER_MAGIC 0xef51
EXT2_SUPER_MAGIC 0xef53
EXT3_SUPER_MAGIC 0xef53
EXT4_SUPER_MAGIC 0xef53
F2FS_SUPER_MAGIC 0xf2f52010
FUSE_SUPER_MAGIC 0x65735546
FUTEXFS_SUPER_MAGIC 0xbad1dea /* Unused */
HFS_SUPER_MAGIC 0x4244
HOSTFS_SUPER_MAGIC 0x00c0ffee
HPFS_SUPER_MAGIC 0xf995e849
HUGETLBFS_MAGIC 0x958458f6
ISOFS_SUPER_MAGIC 0x9660
JFFS2_SUPER_MAGIC 0x72b6
JFS_SUPER_MAGIC 0x3153464a
MINIX_SUPER_MAGIC 0x137f /* original minix FS */
MINIX_SUPER_MAGIC2 0x138f /* 30 char minix FS */
MINIX2_SUPER_MAGIC 0x2468 /* minix V2 FS */
MINIX2_SUPER_MAGIC2 0x2478 /* minix V2 FS, 30 char names */
MINIX3_SUPER_MAGIC 0x4d5a /* minix V3 FS, 60 char names */
MQUEUE_MAGIC 0x19800202 /* POSIX message queue FS */
MSDOS_SUPER_MAGIC 0x4d44
MTD_INODE_FS_MAGIC 0x11307854
NCP_SUPER_MAGIC 0x564c
NFS_SUPER_MAGIC 0x6969
NILFS_SUPER_MAGIC 0x3434
NSFS_MAGIC 0x6e736673
NTFS_SB_MAGIC 0x5346544e
OCFS2_SUPER_MAGIC 0x7461636f
OPENPROM_SUPER_MAGIC 0x9fa1
OVERLAYFS_SUPER_MAGIC 0x794c7630
PIPEFS_MAGIC 0x50495045
PROC_SUPER_MAGIC 0x9fa0 /* /proc FS */
PSTOREFS_MAGIC 0x6165676c
QNX4_SUPER_MAGIC 0x002f
QNX6_SUPER_MAGIC 0x68191122
RAMFS_MAGIC 0x858458f6
REISERFS_SUPER_MAGIC 0x52654973
ROMFS_MAGIC 0x7275
SECURITYFS_MAGIC 0x73636673
SELINUX_MAGIC 0xf97cff8c
SMACK_MAGIC 0x43415d53
SMB_SUPER_MAGIC 0x517b
SMB2_MAGIC_NUMBER 0xfe534d42
SOCKFS_MAGIC 0x534f434b
SQUASHFS_MAGIC 0x73717368
SYSFS_MAGIC 0x62656572
SYSV2_SUPER_MAGIC 0x012ff7b6
SYSV4_SUPER_MAGIC 0x012ff7b5
TMPFS_MAGIC 0x01021994
TRACEFS_MAGIC 0x74726163
UDF_SUPER_MAGIC 0x15013346
UFS_MAGIC 0x00011954
USBDEVICE_SUPER_MAGIC 0x9fa2
V9FS_MAGIC 0x01021997
VXFS_SUPER_MAGIC 0xa501fcf5
XENFS_SUPER_MAGIC 0xabba1974
XENIX_SUPER_MAGIC 0x012ff7b4
XFS_SUPER_MAGIC 0x58465342
_XIAFS_SUPER_MAGIC 0x012fd16d /* Linux 2.0 and earlier */
Most of these MAGIC constants are defined in /usr/include/linux/magic.h, and some are hardcoded in kernel sources.
The f_flags field is a bit mask indicating mount options for the filesystem. It contains zero or more of the following bits:
ST_MANDLOCK
Mandatory locking is permitted on the filesystem (see fcntl(2)).
ST_NOATIME
Do not update access times; see mount(2).
ST_NODEV
Disallow access to device special files on this filesystem.
ST_NODIRATIME
Do not update directory access times; see mount(2).
ST_NOEXEC
Execution of programs is disallowed on this filesystem.
ST_NOSUID
The set-user-ID and set-group-ID bits are ignored by exec(3) for executable files on this filesystem
ST_RDONLY
This filesystem is mounted read-only.
ST_RELATIME
Update atime relative to mtime/ctime; see mount(2).
ST_SYNCHRONOUS
Writes are synched to the filesystem immediately (see the description of O_SYNC in open(2)).
ST_NOSYMFOLLOW (since Linux 5.10)
Symbolic links are not followed when resolving paths; see mount(2).
Nobody knows what f_fsid is supposed to contain (but see below).
Fields that are undefined for a particular filesystem are set to 0.
fstatfs() returns the same information about an open file referenced by descriptor fd.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
(statfs()) Search permission is denied for a component of the path prefix of path. (See also path_resolution(7).)
EBADF
(fstatfs()) fd is not a valid open file descriptor.
EFAULT
buf or path points to an invalid address.
EINTR
The call was interrupted by a signal; see signal(7).
EIO
An I/O error occurred while reading from the filesystem.
ELOOP
(statfs()) Too many symbolic links were encountered in translating path.
ENAMETOOLONG
(statfs()) path is too long.
ENOENT
(statfs()) The file referred to by path does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOSYS
The filesystem does not support this call.
ENOTDIR
(statfs()) A component of the path prefix of path is not a directory.
EOVERFLOW
Some values were too large to be represented in the returned struct.
VERSIONS
The f_fsid field
Solaris, Irix, and POSIX have a system call statvfs(2) that returns a struct statvfs (defined in <sys/statvfs.h>) containing an unsigned long f_fsid. Linux, SunOS, HP-UX, 4.4BSD have a system call statfs() that returns a struct statfs (defined in <sys/vfs.h>) containing a fsid_t f_fsid, where fsid_t is defined as struct { int val[2]; }. The same holds for FreeBSD, except that it uses the include file <sys/mount.h>.
The general idea is that f_fsid contains some random stuff such that the pair (f_fsid,ino) uniquely determines a file. Some operating systems use (a variation on) the device number, or the device number combined with the filesystem type. Several operating systems restrict giving out the f_fsid field to the superuser only (and zero it for unprivileged users), because this field is used in the filehandle of the filesystem when NFS-exported, and giving it out is a security concern.
Under some operating systems, the fsid can be used as the second argument to the sysfs(2) system call.
STANDARDS
Linux.
HISTORY
The Linux statfs() was inspired by the 4.4BSD one (but they do not use the same structure).
The original Linux statfs() and fstatfs() system calls were not designed with extremely large file sizes in mind. Subsequently, Linux 2.6 added new statfs64() and fstatfs64() system calls that employ a new structure, statfs64. The new structure contains the same fields as the original statfs structure, but the sizes of various fields are increased, to accommodate large file sizes. The glibc statfs() and fstatfs() wrapper functions transparently deal with the kernel differences.
LSB has deprecated the library calls statfs() and fstatfs() and tells us to use statvfs(3) and fstatvfs(3) instead.
NOTES
The __fsword_t type used for various fields in the statfs structure definition is a glibc internal type, not intended for public use. This leaves the programmer in a bit of a conundrum when trying to copy or compare these fields to local variables in a program. Using unsigned int for such variables suffices on most systems.
Some systems have only <sys/vfs.h>, other systems also have <sys/statfs.h>, where the former includes the latter. So it seems including the former is the best choice.
BUGS
From Linux 2.6.38 up to and including Linux 3.1, fstatfs() failed with the error ENOSYS for file descriptors created by pipe(2).
SEE ALSO
stat(2), statvfs(3), path_resolution(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
333 - Linux cli command epoll_pwait
NAME π₯οΈ epoll_pwait π₯οΈ
wait for an I/O event on an epoll file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/epoll.h>
int epoll_wait(int epfd, struct epoll_event *events,
int maxevents, int timeout);
int epoll_pwait(int epfd, struct epoll_event *events,
int maxevents, int timeout,
const sigset_t *_Nullable sigmask);
int epoll_pwait2(int epfd, struct epoll_event *events,
int maxevents, const struct timespec *_Nullable timeout,
const sigset_t *_Nullable sigmask);
DESCRIPTION
The epoll_wait() system call waits for events on the epoll(7) instance referred to by the file descriptor epfd. The buffer pointed to by events is used to return information from the ready list about file descriptors in the interest list that have some events available. Up to maxevents are returned by epoll_wait(). The maxevents argument must be greater than zero.
The timeout argument specifies the number of milliseconds that epoll_wait() will block. Time is measured against the CLOCK_MONOTONIC clock.
A call to epoll_wait() will block until either:
a file descriptor delivers an event;
the call is interrupted by a signal handler; or
the timeout expires.
Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount. Specifying a timeout of -1 causes epoll_wait() to block indefinitely, while specifying a timeout equal to zero causes epoll_wait() to return immediately, even if no events are available.
The struct epoll_event is described in epoll_event(3type).
The data field of each returned epoll_event structure contains the same data as was specified in the most recent call to epoll_ctl(2) (EPOLL_CTL_ADD, EPOLL_CTL_MOD) for the corresponding open file descriptor.
The events field is a bit mask that indicates the events that have occurred for the corresponding open file description. See epoll_ctl(2) for a list of the bits that may appear in this mask.
epoll_pwait()
The relationship between epoll_wait() and epoll_pwait() is analogous to the relationship between select(2) and pselect(2): like pselect(2), epoll_pwait() allows an application to safely wait until either a file descriptor becomes ready or until a signal is caught.
The following epoll_pwait() call:
ready = epoll_pwait(epfd, &events, maxevents, timeout, &sigmask);
is equivalent to atomically executing the following calls:
sigset_t origmask;
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ready = epoll_wait(epfd, &events, maxevents, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
The sigmask argument may be specified as NULL, in which case epoll_pwait() is equivalent to epoll_wait().
epoll_pwait2()
The epoll_pwait2() system call is equivalent to epoll_pwait() except for the timeout argument. It takes an argument of type timespec to be able to specify nanosecond resolution timeout. This argument functions the same as in pselect(2) and ppoll(2). If timeout is NULL, then epoll_pwait2() can block indefinitely.
RETURN VALUE
On success, epoll_wait() returns the number of file descriptors ready for the requested I/O operation, or zero if no file descriptor became ready during the requested timeout milliseconds. On failure, epoll_wait() returns -1 and errno is set to indicate the error.
ERRORS
EBADF
epfd is not a valid file descriptor.
EFAULT
The memory area pointed to by events is not accessible with write permissions.
EINTR
The call was interrupted by a signal handler before either (1) any of the requested events occurred or (2) the timeout expired; see signal(7).
EINVAL
epfd is not an epoll file descriptor, or maxevents is less than or equal to zero.
STANDARDS
Linux.
HISTORY
epoll_wait()
Linux 2.6, glibc 2.3.2.
epoll_pwait()
Linux 2.6.19, glibc 2.6.
epoll_pwait2()
Linux 5.11.
NOTES
While one thread is blocked in a call to epoll_wait(), it is possible for another thread to add a file descriptor to the waited-upon epoll instance. If the new file descriptor becomes ready, it will cause the epoll_wait() call to unblock.
If more than maxevents file descriptors are ready when epoll_wait() is called, then successive epoll_wait() calls will round robin through the set of ready file descriptors. This behavior helps avoid starvation scenarios, where a process fails to notice that additional file descriptors are ready because it focuses on a set of file descriptors that are already known to be ready.
Note that it is possible to call epoll_wait() on an epoll instance whose interest list is currently empty (or whose interest list becomes empty because file descriptors are closed or removed from the interest in another thread). The call will block until some file descriptor is later added to the interest list (in another thread) and that file descriptor becomes ready.
C library/kernel differences
The raw epoll_pwait() and epoll_pwait2() system calls have a sixth argument, size_t sigsetsize, which specifies the size in bytes of the sigmask argument. The glibc epoll_pwait() wrapper function specifies this argument as a fixed value (equal to sizeof(sigset_t)).
BUGS
Before Linux 2.6.37, a timeout value larger than approximately LONG_MAX / HZ milliseconds is treated as -1 (i.e., infinity). Thus, for example, on a system where sizeof(long) is 4 and the kernel HZ value is 1000, this means that timeouts greater than 35.79 minutes are treated as infinity.
SEE ALSO
epoll_create(2), epoll_ctl(2), epoll(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
334 - Linux cli command ioperm
NAME π₯οΈ ioperm π₯οΈ
set port input/output permissions
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
int ioperm(unsigned long from, unsigned long num, int turn_on);
DESCRIPTION
ioperm() sets the port access permission bits for the calling thread for num bits starting from port address from. If turn_on is nonzero, then permission for the specified bits is enabled; otherwise it is disabled. If turn_on is nonzero, the calling thread must be privileged (CAP_SYS_RAWIO).
Before Linux 2.6.8, only the first 0x3ff I/O ports could be specified in this manner. For more ports, the iopl(2) system call had to be used (with a level argument of 3). Since Linux 2.6.8, 65,536 I/O ports can be specified.
Permissions are inherited by the child created by fork(2) (but see NOTES). Permissions are preserved across execve(2); this is useful for giving port access permissions to unprivileged programs.
This call is mostly for the i386 architecture. On many other architectures it does not exist or will always return an error.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
Invalid values for from or num.
EIO
(on PowerPC) This call is not supported.
ENOMEM
Out of memory.
EPERM
The calling thread has insufficient privilege.
VERSIONS
glibc has an ioperm() prototype both in <sys/io.h> and in <sys/perm.h>. Avoid the latter, it is available on i386 only.
STANDARDS
Linux.
HISTORY
Before Linux 2.4, permissions were not inherited by a child created by fork(2).
NOTES
The /proc/ioports file shows the I/O ports that are currently allocated on the system.
SEE ALSO
iopl(2), outb(2), capabilities(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
335 - Linux cli command clock_nanosleep
NAME π₯οΈ clock_nanosleep π₯οΈ
high-resolution sleep with specifiable clock
LIBRARY
Standard C library (libc, -lc), since glibc 2.17
Before glibc 2.17, Real-time library (librt, -lrt)
SYNOPSIS
#include <time.h>
int clock_nanosleep(clockid_t clockid, int flags,
const struct timespec *t,
struct timespec *_Nullable remain);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
clock_nanosleep():
_POSIX_C_SOURCE >= 200112L
DESCRIPTION
Like nanosleep(2), clock_nanosleep() allows the calling thread to sleep for an interval specified with nanosecond precision. It differs in allowing the caller to select the clock against which the sleep interval is to be measured, and in allowing the sleep interval to be specified as either an absolute or a relative value.
The time values passed to and returned by this call are specified using timespec(3) structures.
The clockid argument specifies the clock against which the sleep interval is to be measured. This argument can have one of the following values:
CLOCK_REALTIME
A settable system-wide real-time clock.
CLOCK_TAI (since Linux 3.10)
A system-wide clock derived from wall-clock time but counting leap seconds.
CLOCK_MONOTONIC
A nonsettable, monotonically increasing clock that measures time since some unspecified point in the past that does not change after system startup.
CLOCK_BOOTTIME (since Linux 2.6.39)
Identical to CLOCK_MONOTONIC, except that it also includes any time that the system is suspended.
CLOCK_PROCESS_CPUTIME_ID
A settable per-process clock that measures CPU time consumed by all threads in the process.
See clock_getres(2) for further details on these clocks. In addition, the CPU clock IDs returned by clock_getcpuclockid(3) and pthread_getcpuclockid(3) can also be passed in clockid.
If flags is 0, then the value specified in t is interpreted as an interval relative to the current value of the clock specified by clockid.
If flags is TIMER_ABSTIME, then t is interpreted as an absolute time as measured by the clock, clockid. If t is less than or equal to the current value of the clock, then clock_nanosleep() returns immediately without suspending the calling thread.
clock_nanosleep() suspends the execution of the calling thread until either at least the time specified by t has elapsed, or a signal is delivered that causes a signal handler to be called or that terminates the process.
If the call is interrupted by a signal handler, clock_nanosleep() fails with the error EINTR. In addition, if remain is not NULL, and flags was not TIMER_ABSTIME, it returns the remaining unslept time in remain. This value can then be used to call clock_nanosleep() again and complete a (relative) sleep.
RETURN VALUE
On successfully sleeping for the requested interval, clock_nanosleep() returns 0. If the call is interrupted by a signal handler or encounters an error, then it returns one of the positive error number listed in ERRORS.
ERRORS
EFAULT
t or remain specified an invalid address.
EINTR
The sleep was interrupted by a signal handler; see signal(7).
EINVAL
The value in the tv_nsec field was not in the range [0, 999999999] or tv_sec was negative.
EINVAL
clockid was invalid. (CLOCK_THREAD_CPUTIME_ID is not a permitted value for clockid.)
ENOTSUP
The kernel does not support sleeping against this clockid.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001. Linux 2.6, glibc 2.1.
NOTES
If the interval specified in t is not an exact multiple of the granularity underlying clock (see time(7)), then the interval will be rounded up to the next multiple. Furthermore, after the sleep completes, there may still be a delay before the CPU becomes free to once again execute the calling thread.
Using an absolute timer is useful for preventing timer drift problems of the type described in nanosleep(2). (Such problems are exacerbated in programs that try to restart a relative sleep that is repeatedly interrupted by signals.) To perform a relative sleep that avoids these problems, call clock_gettime(2) for the desired clock, add the desired interval to the returned time value, and then call clock_nanosleep() with the TIMER_ABSTIME flag.
clock_nanosleep() is never restarted after being interrupted by a signal handler, regardless of the use of the sigaction(2) SA_RESTART flag.
The remain argument is unused, and unnecessary, when flags is TIMER_ABSTIME. (An absolute sleep can be restarted using the same t argument.)
POSIX.1 specifies that clock_nanosleep() has no effect on signals dispositions or the signal mask.
POSIX.1 specifies that after changing the value of the CLOCK_REALTIME clock via clock_settime(2), the new clock value shall be used to determine the time at which a thread blocked on an absolute clock_nanosleep() will wake up; if the new clock value falls past the end of the sleep interval, then the clock_nanosleep() call will return immediately.
POSIX.1 specifies that changing the value of the CLOCK_REALTIME clock via clock_settime(2) shall have no effect on a thread that is blocked on a relative clock_nanosleep().
SEE ALSO
clock_getres(2), nanosleep(2), restart_syscall(2), timer_create(2), sleep(3), timespec(3), usleep(3), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
336 - Linux cli command setgroups
NAME π₯οΈ setgroups π₯οΈ
get/set list of supplementary group IDs
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int getgroups(int size, gid_t list[]);
#include <grp.h>
int setgroups(size_t size, const gid_t *_Nullable list);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
setgroups():
Since glibc 2.19:
_DEFAULT_SOURCE
glibc 2.19 and earlier:
_BSD_SOURCE
DESCRIPTION
getgroups() returns the supplementary group IDs of the calling process in list. The argument size should be set to the maximum number of items that can be stored in the buffer pointed to by list. If the calling process is a member of more than size supplementary groups, then an error results.
It is unspecified whether the effective group ID of the calling process is included in the returned list. (Thus, an application should also call getegid(2) and add or remove the resulting value.)
If size is zero, list is not modified, but the total number of supplementary group IDs for the process is returned. This allows the caller to determine the size of a dynamically allocated list to be used in a further call to getgroups().
setgroups() sets the supplementary group IDs for the calling process. Appropriate privileges are required (see the description of the EPERM error, below). The size argument specifies the number of supplementary group IDs in the buffer pointed to by list. A process can drop all of its supplementary groups with the call:
setgroups(0, NULL);
RETURN VALUE
On success, getgroups() returns the number of supplementary group IDs. On error, -1 is returned, and errno is set to indicate the error.
On success, setgroups() returns 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
list has an invalid address.
getgroups() can additionally fail with the following error:
EINVAL
size is less than the number of supplementary group IDs, but is not zero.
setgroups() can additionally fail with the following errors:
EINVAL
size is greater than NGROUPS_MAX (32 before Linux 2.6.4; 65536 since Linux 2.6.4).
ENOMEM
Out of memory.
EPERM
The calling process has insufficient privilege (the caller does not have the CAP_SETGID capability in the user namespace in which it resides).
EPERM (since Linux 3.19)
The use of setgroups() is denied in this user namespace. See the description of /proc/pid/setgroups in user_namespaces(7).
VERSIONS
C library/kernel differences
At the kernel level, user IDs and group IDs are a per-thread attribute. However, POSIX requires that all threads in a process share the same credentials. The NPTL threading implementation handles the POSIX requirements by providing wrapper functions for the various system calls that change process UIDs and GIDs. These wrapper functions (including the one for setgroups()) employ a signal-based technique to ensure that when one thread changes credentials, all of the other threads in the process also change their credentials. For details, see nptl(7).
STANDARDS
getgroups()
POSIX.1-2008.
setgroups()
None.
HISTORY
getgroups()
SVr4, 4.3BSD, POSIX.1-2001.
setgroups()
SVr4, 4.3BSD. Since setgroups() requires privilege, it is not covered by POSIX.1.
The original Linux getgroups() system call supported only 16-bit group IDs. Subsequently, Linux 2.4 added getgroups32(), supporting 32-bit IDs. The glibc getgroups() wrapper function transparently deals with the variation across kernel versions.
NOTES
A process can have up to NGROUPS_MAX supplementary group IDs in addition to the effective group ID. The constant NGROUPS_MAX is defined in <limits.h>. The set of supplementary group IDs is inherited from the parent process, and preserved across an execve(2).
The maximum number of supplementary group IDs can be found at run time using sysconf(3):
long ngroups_max;
ngroups_max = sysconf(_SC_NGROUPS_MAX);
The maximum return value of getgroups() cannot be larger than one more than this value. Since Linux 2.6.4, the maximum number of supplementary group IDs is also exposed via the Linux-specific read-only file, /proc/sys/kernel/ngroups_max.
SEE ALSO
getgid(2), setgid(2), getgrouplist(3), group_member(3), initgroups(3), capabilities(7), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
337 - Linux cli command lseek
NAME π₯οΈ lseek π₯οΈ
reposition read/write file offset
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
off_t lseek(int fd, off_t offset, int whence);
DESCRIPTION
lseek() repositions the file offset of the open file description associated with the file descriptor fd to the argument offset according to the directive whence as follows:
SEEK_SET
The file offset is set to offset bytes.
SEEK_CUR
The file offset is set to its current location plus offset bytes.
SEEK_END
The file offset is set to the size of the file plus offset bytes.
lseek() allows the file offset to be set beyond the end of the file (but this does not change the size of the file). If data is later written at this point, subsequent reads of the data in the gap (a “hole”) return null bytes (‘οΏ½’) until data is actually written into the gap.
Seeking file data and holes
Since Linux 3.1, Linux supports the following additional values for whence:
SEEK_DATA
Adjust the file offset to the next location in the file greater than or equal to offset containing data. If offset points to data, then the file offset is set to offset.
SEEK_HOLE
Adjust the file offset to the next hole in the file greater than or equal to offset. If offset points into the middle of a hole, then the file offset is set to offset. If there is no hole past offset, then the file offset is adjusted to the end of the file (i.e., there is an implicit hole at the end of any file).
In both of the above cases, lseek() fails if offset points past the end of the file.
These operations allow applications to map holes in a sparsely allocated file. This can be useful for applications such as file backup tools, which can save space when creating backups and preserve holes, if they have a mechanism for discovering holes.
For the purposes of these operations, a hole is a sequence of zeros that (normally) has not been allocated in the underlying file storage. However, a filesystem is not obliged to report holes, so these operations are not a guaranteed mechanism for mapping the storage space actually allocated to a file. (Furthermore, a sequence of zeros that actually has been written to the underlying storage may not be reported as a hole.) In the simplest implementation, a filesystem can support the operations by making SEEK_HOLE always return the offset of the end of the file, and making SEEK_DATA always return offset (i.e., even if the location referred to by offset is a hole, it can be considered to consist of data that is a sequence of zeros).
The _GNU_SOURCE feature test macro must be defined in order to obtain the definitions of SEEK_DATA and SEEK_HOLE from <unistd.h>.
The SEEK_HOLE and SEEK_DATA operations are supported for the following filesystems:
Btrfs (since Linux 3.1)
OCFS (since Linux 3.2)
XFS (since Linux 3.5)
ext4 (since Linux 3.8)
tmpfs(5) (since Linux 3.8)
NFS (since Linux 3.18)
FUSE (since Linux 4.5)
GFS2 (since Linux 4.15)
RETURN VALUE
Upon successful completion, lseek() returns the resulting offset location as measured in bytes from the beginning of the file. On error, the value (off_t) -1 is returned and errno is set to indicate the error.
ERRORS
EBADF
fd is not an open file descriptor.
EINVAL
whence is not valid. Or: the resulting file offset would be negative, or beyond the end of a seekable device.
ENXIO
whence is SEEK_DATA or SEEK_HOLE, and offset is beyond the end of the file, or whence is SEEK_DATA and offset is within a hole at the end of the file.
EOVERFLOW
The resulting file offset cannot be represented in an off_t.
ESPIPE
fd is associated with a pipe, socket, or FIFO.
VERSIONS
On Linux, using lseek() on a terminal device fails with the error ESPIPE.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.3BSD.
SEEK_DATA and SEEK_HOLE are nonstandard extensions also present in Solaris, FreeBSD, and DragonFly BSD; they are proposed for inclusion in the next POSIX revision (Issue 8).
NOTES
See open(2) for a discussion of the relationship between file descriptors, open file descriptions, and files.
If the O_APPEND file status flag is set on the open file description, then a write(2) always moves the file offset to the end of the file, regardless of the use of lseek().
Some devices are incapable of seeking and POSIX does not specify which devices must support lseek().
SEE ALSO
dup(2), fallocate(2), fork(2), open(2), fseek(3), lseek64(3), posix_fallocate(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
338 - Linux cli command rt_sigprocmask
NAME π₯οΈ rt_sigprocmask π₯οΈ
examine and change blocked signals
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <signal.h>
/* Prototype for the glibc wrapper function */
int sigprocmask(int how, const sigset_t *_Nullable restrict set,
sigset_t *_Nullable restrict oldset);
#include <signal.h> /* Definition of SIG_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
/* Prototype for the underlying system call */
int syscall(SYS_rt_sigprocmask, int how,
const kernel_sigset_t *_Nullable set,
kernel_sigset_t *_Nullable oldset,
size_t sigsetsize);
/* Prototype for the legacy system call */
[[deprecated]] int syscall(SYS_sigprocmask, int how,
const old_kernel_sigset_t *_Nullable set,
old_kernel_sigset_t *_Nullable oldset);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sigprocmask():
_POSIX_C_SOURCE
DESCRIPTION
sigprocmask() is used to fetch and/or change the signal mask of the calling thread. The signal mask is the set of signals whose delivery is currently blocked for the caller (see also signal(7) for more details).
The behavior of the call is dependent on the value of how, as follows.
SIG_BLOCK
The set of blocked signals is the union of the current set and the set argument.
SIG_UNBLOCK
The signals in set are removed from the current set of blocked signals. It is permissible to attempt to unblock a signal which is not blocked.
SIG_SETMASK
The set of blocked signals is set to the argument set.
If oldset is non-NULL, the previous value of the signal mask is stored in oldset.
If set is NULL, then the signal mask is unchanged (i.e., how is ignored), but the current value of the signal mask is nevertheless returned in oldset (if it is not NULL).
A set of functions for modifying and inspecting variables of type sigset_t (“signal sets”) is described in sigsetops(3).
The use of sigprocmask() is unspecified in a multithreaded process; see pthread_sigmask(3).
RETURN VALUE
sigprocmask() returns 0 on success. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
EFAULT
The set or oldset argument points outside the process’s allocated address space.
EINVAL
Either the value specified in how was invalid or the kernel does not support the size passed in sigsetsize.
VERSIONS
C library/kernel differences
The kernel’s definition of sigset_t differs in size from that used by the C library. In this manual page, the former is referred to as kernel_sigset_t (it is nevertheless named sigset_t in the kernel sources).
The glibc wrapper function for sigprocmask() silently ignores attempts to block the two real-time signals that are used internally by the NPTL threading implementation. See nptl(7) for details.
The original Linux system call was named sigprocmask(). However, with the addition of real-time signals in Linux 2.2, the fixed-size, 32-bit sigset_t (referred to as old_kernel_sigset_t in this manual page) type supported by that system call was no longer fit for purpose. Consequently, a new system call, rt_sigprocmask(), was added to support an enlarged sigset_t type (referred to as kernel_sigset_t in this manual page). The new system call takes a fourth argument, size_t sigsetsize, which specifies the size in bytes of the signal sets in set and oldset. This argument is currently required to have a fixed architecture specific value (equal to sizeof(kernel_sigset_t)).
The glibc sigprocmask() wrapper function hides these details from us, transparently calling rt_sigprocmask() when the kernel provides it.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
NOTES
It is not possible to block SIGKILL or SIGSTOP. Attempts to do so are silently ignored.
Each of the threads in a process has its own signal mask.
A child created via fork(2) inherits a copy of its parent’s signal mask; the signal mask is preserved across execve(2).
If SIGBUS, SIGFPE, SIGILL, or SIGSEGV are generated while they are blocked, the result is undefined, unless the signal was generated by kill(2), sigqueue(3), or raise(3).
See sigsetops(3) for details on manipulating signal sets.
Note that it is permissible (although not very useful) to specify both set and oldset as NULL.
SEE ALSO
kill(2), pause(2), sigaction(2), signal(2), sigpending(2), sigsuspend(2), pthread_sigmask(3), sigqueue(3), sigsetops(3), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
339 - Linux cli command getunwind
NAME π₯οΈ getunwind π₯οΈ
copy the unwind data to caller’s buffer
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/unwind.h>
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
[[deprecated]] long syscall(SYS_getunwind, void buf[.buf_size],
size_t buf_size);
DESCRIPTION
Note: this system call is obsolete.
The IA-64-specific getunwind() system call copies the kernel’s call frame unwind data into the buffer pointed to by buf and returns the size of the unwind data; this data describes the gate page (kernel code that is mapped into user space).
The size of the buffer buf is specified in buf_size. The data is copied only if buf_size is greater than or equal to the size of the unwind data and buf is not NULL; otherwise, no data is copied, and the call succeeds, returning the size that would be needed to store the unwind data.
The first part of the unwind data contains an unwind table. The rest contains the associated unwind information, in no particular order. The unwind table contains entries of the following form:
u64 start; (64-bit address of start of function)
u64 end; (64-bit address of end of function)
u64 info; (BUF-relative offset to unwind info)
An entry whose start value is zero indicates the end of the table. For more information about the format, see the IA-64 Software Conventions and Runtime Architecture manual.
RETURN VALUE
On success, getunwind() returns the size of the unwind data. On error, -1 is returned and errno is set to indicate the error.
ERRORS
getunwind() fails with the error EFAULT if the unwind info can’t be stored in the space specified by buf.
STANDARDS
Linux on IA-64.
HISTORY
Linux 2.4.
This system call has been deprecated. The modern way to obtain the kernel’s unwind data is via the vdso(7).
SEE ALSO
getauxval(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
340 - Linux cli command fcntl64
NAME π₯οΈ fcntl64 π₯οΈ
manipulate file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <fcntl.h>
int fcntl(int fd, int op, ... /* arg */ );
DESCRIPTION
fcntl() performs one of the operations described below on the open file descriptor fd. The operation is determined by op.
fcntl() can take an optional third argument. Whether or not this argument is required is determined by op. The required argument type is indicated in parentheses after each op name (in most cases, the required type is int, and we identify the argument using the name arg), or void is specified if the argument is not required.
Certain of the operations below are supported only since a particular Linux kernel version. The preferred method of checking whether the host kernel supports a particular operation is to invoke fcntl() with the desired op value and then test whether the call failed with EINVAL, indicating that the kernel does not recognize this value.
Duplicating a file descriptor
F_DUPFD (int)
Duplicate the file descriptor fd using the lowest-numbered available file descriptor greater than or equal to arg. This is different from dup2(2), which uses exactly the file descriptor specified.
On success, the new file descriptor is returned.
See dup(2) for further details.
F_DUPFD_CLOEXEC (int; since Linux 2.6.24)
As for F_DUPFD, but additionally set the close-on-exec flag for the duplicate file descriptor. Specifying this flag permits a program to avoid an additional fcntl() F_SETFD operation to set the FD_CLOEXEC flag. For an explanation of why this flag is useful, see the description of O_CLOEXEC in open(2).
File descriptor flags
The following operations manipulate the flags associated with a file descriptor. Currently, only one such flag is defined: FD_CLOEXEC, the close-on-exec flag. If the FD_CLOEXEC bit is set, the file descriptor will automatically be closed during a successful execve(2). (If the execve(2) fails, the file descriptor is left open.) If the FD_CLOEXEC bit is not set, the file descriptor will remain open across an execve(2).
F_GETFD (void)
Return (as the function result) the file descriptor flags; arg is ignored.
F_SETFD (int)
Set the file descriptor flags to the value specified by arg.
In multithreaded programs, using fcntl() F_SETFD to set the close-on-exec flag at the same time as another thread performs a fork(2) plus execve(2) is vulnerable to a race condition that may unintentionally leak the file descriptor to the program executed in the child process. See the discussion of the O_CLOEXEC flag in open(2) for details and a remedy to the problem.
File status flags
Each open file description has certain associated status flags, initialized by open(2) and possibly modified by fcntl(). Duplicated file descriptors (made with dup(2), fcntl(F_DUPFD), fork(2), etc.) refer to the same open file description, and thus share the same file status flags.
The file status flags and their semantics are described in open(2).
F_GETFL (void)
Return (as the function result) the file access mode and the file status flags; arg is ignored.
F_SETFL (int)
Set the file status flags to the value specified by arg. File access mode (O_RDONLY, O_WRONLY, O_RDWR) and file creation flags (i.e., O_CREAT, O_EXCL, O_NOCTTY, O_TRUNC) in arg are ignored. On Linux, this operation can change only the O_APPEND, O_ASYNC, O_DIRECT, O_NOATIME, and O_NONBLOCK flags. It is not possible to change the O_DSYNC and O_SYNC flags; see BUGS, below.
Advisory record locking
Linux implements traditional (“process-associated”) UNIX record locks, as standardized by POSIX. For a Linux-specific alternative with better semantics, see the discussion of open file description locks below.
F_SETLK, F_SETLKW, and F_GETLK are used to acquire, release, and test for the existence of record locks (also known as byte-range, file-segment, or file-region locks). The third argument, lock, is a pointer to a structure that has at least the following fields (in unspecified order).
struct flock {
...
short l_type; /* Type of lock: F_RDLCK,
F_WRLCK, F_UNLCK */
short l_whence; /* How to interpret l_start:
SEEK_SET, SEEK_CUR, SEEK_END */
off_t l_start; /* Starting offset for lock */
off_t l_len; /* Number of bytes to lock */
pid_t l_pid; /* PID of process blocking our lock
(set by F_GETLK and F_OFD_GETLK) */
...
};
The l_whence, l_start, and l_len fields of this structure specify the range of bytes we wish to lock. Bytes past the end of the file may be locked, but not bytes before the start of the file.
l_start is the starting offset for the lock, and is interpreted relative to either: the start of the file (if l_whence is SEEK_SET); the current file offset (if l_whence is SEEK_CUR); or the end of the file (if l_whence is SEEK_END). In the final two cases, l_start can be a negative number provided the offset does not lie before the start of the file.
l_len specifies the number of bytes to be locked. If l_len is positive, then the range to be locked covers bytes l_start up to and including l_start+l_len-1. Specifying 0 for l_len has the special meaning: lock all bytes starting at the location specified by l_whence and l_start through to the end of file, no matter how large the file grows.
POSIX.1-2001 allows (but does not require) an implementation to support a negative l_len value; if l_len is negative, the interval described by lock covers bytes l_start+l_len up to and including l_start-1. This is supported since Linux 2.4.21 and Linux 2.5.49.
The l_type field can be used to place a read (F_RDLCK) or a write (F_WRLCK) lock on a file. Any number of processes may hold a read lock (shared lock) on a file region, but only one process may hold a write lock (exclusive lock). An exclusive lock excludes all other locks, both shared and exclusive. A single process can hold only one type of lock on a file region; if a new lock is applied to an already-locked region, then the existing lock is converted to the new lock type. (Such conversions may involve splitting, shrinking, or coalescing with an existing lock if the byte range specified by the new lock does not precisely coincide with the range of the existing lock.)
F_SETLK (struct flock *)
Acquire a lock (when l_type is F_RDLCK or F_WRLCK) or release a lock (when l_type is F_UNLCK) on the bytes specified by the l_whence, l_start, and l_len fields of lock. If a conflicting lock is held by another process, this call returns -1 and sets errno to EACCES or EAGAIN. (The error returned in this case differs across implementations, so POSIX requires a portable application to check for both errors.)
F_SETLKW (struct flock *)
As for F_SETLK, but if a conflicting lock is held on the file, then wait for that lock to be released. If a signal is caught while waiting, then the call is interrupted and (after the signal handler has returned) returns immediately (with return value -1 and errno set to EINTR; see signal(7)).
F_GETLK (struct flock *)
On input to this call, lock describes a lock we would like to place on the file. If the lock could be placed, fcntl() does not actually place it, but returns F_UNLCK in the l_type field of lock and leaves the other fields of the structure unchanged.
If one or more incompatible locks would prevent this lock being placed, then fcntl() returns details about one of those locks in the l_type, l_whence, l_start, and l_len fields of lock. If the conflicting lock is a traditional (process-associated) record lock, then the l_pid field is set to the PID of the process holding that lock. If the conflicting lock is an open file description lock, then l_pid is set to -1. Note that the returned information may already be out of date by the time the caller inspects it.
In order to place a read lock, fd must be open for reading. In order to place a write lock, fd must be open for writing. To place both types of lock, open a file read-write.
When placing locks with F_SETLKW, the kernel detects deadlocks, whereby two or more processes have their lock requests mutually blocked by locks held by the other processes. For example, suppose process A holds a write lock on byte 100 of a file, and process B holds a write lock on byte 200. If each process then attempts to lock the byte already locked by the other process using F_SETLKW, then, without deadlock detection, both processes would remain blocked indefinitely. When the kernel detects such deadlocks, it causes one of the blocking lock requests to immediately fail with the error EDEADLK; an application that encounters such an error should release some of its locks to allow other applications to proceed before attempting regain the locks that it requires. Circular deadlocks involving more than two processes are also detected. Note, however, that there are limitations to the kernel’s deadlock-detection algorithm; see BUGS.
As well as being removed by an explicit F_UNLCK, record locks are automatically released when the process terminates.
Record locks are not inherited by a child created via fork(2), but are preserved across an execve(2).
Because of the buffering performed by the stdio(3) library, the use of record locking with routines in that package should be avoided; use read(2) and write(2) instead.
The record locks described above are associated with the process (unlike the open file description locks described below). This has some unfortunate consequences:
If a process closes any file descriptor referring to a file, then all of the process’s locks on that file are released, regardless of the file descriptor(s) on which the locks were obtained. This is bad: it means that a process can lose its locks on a file such as /etc/passwd or /etc/mtab when for some reason a library function decides to open, read, and close the same file.
The threads in a process share locks. In other words, a multithreaded program can’t use record locking to ensure that threads don’t simultaneously access the same region of a file.
Open file description locks solve both of these problems.
Open file description locks (non-POSIX)
Open file description locks are advisory byte-range locks whose operation is in most respects identical to the traditional record locks described above. This lock type is Linux-specific, and available since Linux 3.15. (There is a proposal with the Austin Group to include this lock type in the next revision of POSIX.1.) For an explanation of open file descriptions, see open(2).
The principal difference between the two lock types is that whereas traditional record locks are associated with a process, open file description locks are associated with the open file description on which they are acquired, much like locks acquired with flock(2). Consequently (and unlike traditional advisory record locks), open file description locks are inherited across fork(2) (and clone(2) with CLONE_FILES), and are only automatically released on the last close of the open file description, instead of being released on any close of the file.
Conflicting lock combinations (i.e., a read lock and a write lock or two write locks) where one lock is an open file description lock and the other is a traditional record lock conflict even when they are acquired by the same process on the same file descriptor.
Open file description locks placed via the same open file description (i.e., via the same file descriptor, or via a duplicate of the file descriptor created by fork(2), dup(2), fcntl() F_DUPFD, and so on) are always compatible: if a new lock is placed on an already locked region, then the existing lock is converted to the new lock type. (Such conversions may result in splitting, shrinking, or coalescing with an existing lock as discussed above.)
On the other hand, open file description locks may conflict with each other when they are acquired via different open file descriptions. Thus, the threads in a multithreaded program can use open file description locks to synchronize access to a file region by having each thread perform its own open(2) on the file and applying locks via the resulting file descriptor.
As with traditional advisory locks, the third argument to fcntl(), lock, is a pointer to an flock structure. By contrast with traditional record locks, the l_pid field of that structure must be set to zero when using the operations described below.
The operations for working with open file description locks are analogous to those used with traditional locks:
F_OFD_SETLK (struct flock *)
Acquire an open file description lock (when l_type is F_RDLCK or F_WRLCK) or release an open file description lock (when l_type is F_UNLCK) on the bytes specified by the l_whence, l_start, and l_len fields of lock. If a conflicting lock is held by another process, this call returns -1 and sets errno to EAGAIN.
F_OFD_SETLKW (struct flock *)
As for F_OFD_SETLK, but if a conflicting lock is held on the file, then wait for that lock to be released. If a signal is caught while waiting, then the call is interrupted and (after the signal handler has returned) returns immediately (with return value -1 and errno set to EINTR; see signal(7)).
F_OFD_GETLK (struct flock *)
On input to this call, lock describes an open file description lock we would like to place on the file. If the lock could be placed, fcntl() does not actually place it, but returns F_UNLCK in the l_type field of lock and leaves the other fields of the structure unchanged. If one or more incompatible locks would prevent this lock being placed, then details about one of these locks are returned via lock, as described above for F_GETLK.
In the current implementation, no deadlock detection is performed for open file description locks. (This contrasts with process-associated record locks, for which the kernel does perform deadlock detection.)
Mandatory locking
Warning: the Linux implementation of mandatory locking is unreliable. See BUGS below. Because of these bugs, and the fact that the feature is believed to be little used, since Linux 4.5, mandatory locking has been made an optional feature, governed by a configuration option (CONFIG_MANDATORY_FILE_LOCKING). This feature is no longer supported at all in Linux 5.15 and above.
By default, both traditional (process-associated) and open file description record locks are advisory. Advisory locks are not enforced and are useful only between cooperating processes.
Both lock types can also be mandatory. Mandatory locks are enforced for all processes. If a process tries to perform an incompatible access (e.g., read(2) or write(2)) on a file region that has an incompatible mandatory lock, then the result depends upon whether the O_NONBLOCK flag is enabled for its open file description. If the O_NONBLOCK flag is not enabled, then the system call is blocked until the lock is removed or converted to a mode that is compatible with the access. If the O_NONBLOCK flag is enabled, then the system call fails with the error EAGAIN.
To make use of mandatory locks, mandatory locking must be enabled both on the filesystem that contains the file to be locked, and on the file itself. Mandatory locking is enabled on a filesystem using the “-o mand” option to mount(8), or the MS_MANDLOCK flag for mount(2). Mandatory locking is enabled on a file by disabling group execute permission on the file and enabling the set-group-ID permission bit (see chmod(1) and chmod(2)).
Mandatory locking is not specified by POSIX. Some other systems also support mandatory locking, although the details of how to enable it vary across systems.
Lost locks
When an advisory lock is obtained on a networked filesystem such as NFS it is possible that the lock might get lost. This may happen due to administrative action on the server, or due to a network partition (i.e., loss of network connectivity with the server) which lasts long enough for the server to assume that the client is no longer functioning.
When the filesystem determines that a lock has been lost, future read(2) or write(2) requests may fail with the error EIO. This error will persist until the lock is removed or the file descriptor is closed. Since Linux 3.12, this happens at least for NFSv4 (including all minor versions).
Some versions of UNIX send a signal (SIGLOST) in this circumstance. Linux does not define this signal, and does not provide any asynchronous notification of lost locks.
Managing signals
F_GETOWN, F_SETOWN, F_GETOWN_EX, F_SETOWN_EX, F_GETSIG, and F_SETSIG are used to manage I/O availability signals:
F_GETOWN (void)
Return (as the function result) the process ID or process group ID currently receiving SIGIO and SIGURG signals for events on file descriptor fd. Process IDs are returned as positive values; process group IDs are returned as negative values (but see BUGS below). arg is ignored.
F_SETOWN (int)
Set the process ID or process group ID that will receive SIGIO and SIGURG signals for events on the file descriptor fd. The target process or process group ID is specified in arg. A process ID is specified as a positive value; a process group ID is specified as a negative value. Most commonly, the calling process specifies itself as the owner (that is, arg is specified as getpid(2)).
As well as setting the file descriptor owner, one must also enable generation of signals on the file descriptor. This is done by using the fcntl() F_SETFL operation to set the O_ASYNC file status flag on the file descriptor. Subsequently, a SIGIO signal is sent whenever input or output becomes possible on the file descriptor. The fcntl() F_SETSIG operation can be used to obtain delivery of a signal other than SIGIO.
Sending a signal to the owner process (group) specified by F_SETOWN is subject to the same permissions checks as are described for kill(2), where the sending process is the one that employs F_SETOWN (but see BUGS below). If this permission check fails, then the signal is silently discarded. Note: The F_SETOWN operation records the caller’s credentials at the time of the fcntl() call, and it is these saved credentials that are used for the permission checks.
If the file descriptor fd refers to a socket, F_SETOWN also selects the recipient of SIGURG signals that are delivered when out-of-band data arrives on that socket. (SIGURG is sent in any situation where select(2) would report the socket as having an “exceptional condition”.)
The following was true in Linux 2.6.x up to and including Linux 2.6.11:
If a nonzero value is given to F_SETSIG in a multithreaded process running with a threading library that supports thread groups (e.g., NPTL), then a positive value given to F_SETOWN has a different meaning: instead of being a process ID identifying a whole process, it is a thread ID identifying a specific thread within a process. Consequently, it may be necessary to pass F_SETOWN the result of gettid(2) instead of getpid(2) to get sensible results when F_SETSIG is used. (In current Linux threading implementations, a main thread’s thread ID is the same as its process ID. This means that a single-threaded program can equally use gettid(2) or getpid(2) in this scenario.) Note, however, that the statements in this paragraph do not apply to the SIGURG signal generated for out-of-band data on a socket: this signal is always sent to either a process or a process group, depending on the value given to F_SETOWN.
The above behavior was accidentally dropped in Linux 2.6.12, and won’t be restored. From Linux 2.6.32 onward, use F_SETOWN_EX to target SIGIO and SIGURG signals at a particular thread.
F_GETOWN_EX (struct f_owner_ex *) (since Linux 2.6.32)
Return the current file descriptor owner settings as defined by a previous F_SETOWN_EX operation. The information is returned in the structure pointed to by arg, which has the following form:
struct f_owner_ex {
int type;
pid_t pid;
};
The type field will have one of the values F_OWNER_TID, F_OWNER_PID, or F_OWNER_PGRP. The pid field is a positive integer representing a thread ID, process ID, or process group ID. See F_SETOWN_EX for more details.
F_SETOWN_EX (struct f_owner_ex *) (since Linux 2.6.32)
This operation performs a similar task to F_SETOWN. It allows the caller to direct I/O availability signals to a specific thread, process, or process group. The caller specifies the target of signals via arg, which is a pointer to a f_owner_ex structure. The type field has one of the following values, which define how pid is interpreted:
F_OWNER_TID
Send the signal to the thread whose thread ID (the value returned by a call to clone(2) or gettid(2)) is specified in pid.
F_OWNER_PID
Send the signal to the process whose ID is specified in pid.
F_OWNER_PGRP
Send the signal to the process group whose ID is specified in pid. (Note that, unlike with F_SETOWN, a process group ID is specified as a positive value here.)
F_GETSIG (void)
Return (as the function result) the signal sent when input or output becomes possible. A value of zero means SIGIO is sent. Any other value (including SIGIO) is the signal sent instead, and in this case additional info is available to the signal handler if installed with SA_SIGINFO. arg is ignored.
F_SETSIG (int)
Set the signal sent when input or output becomes possible to the value given in arg. A value of zero means to send the default SIGIO signal. Any other value (including SIGIO) is the signal to send instead, and in this case additional info is available to the signal handler if installed with SA_SIGINFO.
By using F_SETSIG with a nonzero value, and setting SA_SIGINFO for the signal handler (see sigaction(2)), extra information about I/O events is passed to the handler in a siginfo_t structure. If the si_code field indicates the source is SI_SIGIO, the si_fd field gives the file descriptor associated with the event. Otherwise, there is no indication which file descriptors are pending, and you should use the usual mechanisms (select(2), poll(2), read(2) with O_NONBLOCK set etc.) to determine which file descriptors are available for I/O.
Note that the file descriptor provided in si_fd is the one that was specified during the F_SETSIG operation. This can lead to an unusual corner case. If the file descriptor is duplicated (dup(2) or similar), and the original file descriptor is closed, then I/O events will continue to be generated, but the si_fd field will contain the number of the now closed file descriptor.
By selecting a real time signal (value >= SIGRTMIN), multiple I/O events may be queued using the same signal numbers. (Queuing is dependent on available memory.) Extra information is available if SA_SIGINFO is set for the signal handler, as above.
Note that Linux imposes a limit on the number of real-time signals that may be queued to a process (see getrlimit(2) and signal(7)) and if this limit is reached, then the kernel reverts to delivering SIGIO, and this signal is delivered to the entire process rather than to a specific thread.
Using these mechanisms, a program can implement fully asynchronous I/O without using select(2) or poll(2) most of the time.
The use of O_ASYNC is specific to BSD and Linux. The only use of F_GETOWN and F_SETOWN specified in POSIX.1 is in conjunction with the use of the SIGURG signal on sockets. (POSIX does not specify the SIGIO signal.) F_GETOWN_EX, F_SETOWN_EX, F_GETSIG, and F_SETSIG are Linux-specific. POSIX has asynchronous I/O and the aio_sigevent structure to achieve similar things; these are also available in Linux as part of the GNU C Library (glibc).
Leases
F_SETLEASE and F_GETLEASE (Linux 2.4 onward) are used to establish a new lease, and retrieve the current lease, on the open file description referred to by the file descriptor fd. A file lease provides a mechanism whereby the process holding the lease (the “lease holder”) is notified (via delivery of a signal) when a process (the “lease breaker”) tries to open(2) or truncate(2) the file referred to by that file descriptor.
F_SETLEASE (int)
Set or remove a file lease according to which of the following values is specified in the integer arg:
F_RDLCK
Take out a read lease. This will cause the calling process to be notified when the file is opened for writing or is truncated. A read lease can be placed only on a file descriptor that is opened read-only.
F_WRLCK
Take out a write lease. This will cause the caller to be notified when the file is opened for reading or writing or is truncated. A write lease may be placed on a file only if there are no other open file descriptors for the file.
F_UNLCK
Remove our lease from the file.
Leases are associated with an open file description (see open(2)). This means that duplicate file descriptors (created by, for example, fork(2) or dup(2)) refer to the same lease, and this lease may be modified or released using any of these descriptors. Furthermore, the lease is released by either an explicit F_UNLCK operation on any of these duplicate file descriptors, or when all such file descriptors have been closed.
Leases may be taken out only on regular files. An unprivileged process may take out a lease only on a file whose UID (owner) matches the filesystem UID of the process. A process with the CAP_LEASE capability may take out leases on arbitrary files.
F_GETLEASE (void)
Indicates what type of lease is associated with the file descriptor fd by returning either F_RDLCK, F_WRLCK, or F_UNLCK, indicating, respectively, a read lease , a write lease, or no lease. arg is ignored.
When a process (the “lease breaker”) performs an open(2) or truncate(2) that conflicts with a lease established via F_SETLEASE, the system call is blocked by the kernel and the kernel notifies the lease holder by sending it a signal (SIGIO by default). The lease holder should respond to receipt of this signal by doing whatever cleanup is required in preparation for the file to be accessed by another process (e.g., flushing cached buffers) and then either remove or downgrade its lease. A lease is removed by performing an F_SETLEASE operation specifying arg as F_UNLCK. If the lease holder currently holds a write lease on the file, and the lease breaker is opening the file for reading, then it is sufficient for the lease holder to downgrade the lease to a read lease. This is done by performing an F_SETLEASE operation specifying arg as F_RDLCK.
If the lease holder fails to downgrade or remove the lease within the number of seconds specified in /proc/sys/fs/lease-break-time, then the kernel forcibly removes or downgrades the lease holder’s lease.
Once a lease break has been initiated, F_GETLEASE returns the target lease type (either F_RDLCK or F_UNLCK, depending on what would be compatible with the lease breaker) until the lease holder voluntarily downgrades or removes the lease or the kernel forcibly does so after the lease break timer expires.
Once the lease has been voluntarily or forcibly removed or downgraded, and assuming the lease breaker has not unblocked its system call, the kernel permits the lease breaker’s system call to proceed.
If the lease breaker’s blocked open(2) or truncate(2) is interrupted by a signal handler, then the system call fails with the error EINTR, but the other steps still occur as described above. If the lease breaker is killed by a signal while blocked in open(2) or truncate(2), then the other steps still occur as described above. If the lease breaker specifies the O_NONBLOCK flag when calling open(2), then the call immediately fails with the error EWOULDBLOCK, but the other steps still occur as described above.
The default signal used to notify the lease holder is SIGIO, but this can be changed using the F_SETSIG operation to fcntl(). If a F_SETSIG operation is performed (even one specifying SIGIO), and the signal handler is established using SA_SIGINFO, then the handler will receive a siginfo_t structure as its second argument, and the si_fd field of this argument will hold the file descriptor of the leased file that has been accessed by another process. (This is useful if the caller holds leases against multiple files.)
File and directory change notification (dnotify)
F_NOTIFY (int)
(Linux 2.4 onward) Provide notification when the directory referred to by fd or any of the files that it contains is changed. The events to be notified are specified in arg, which is a bit mask specified by ORing together zero or more of the following bits:
DN_ACCESS
A file was accessed (read(2), pread(2), readv(2), and similar)DN_MODIFY
A file was modified (write(2), pwrite(2), writev(2), truncate(2), ftruncate(2), and similar).DN_CREATE
A file was created (open(2), creat(2), mknod(2), mkdir(2), link(2), symlink(2), rename(2) into this directory).DN_DELETE
A file was unlinked (unlink(2), rename(2) to another directory, rmdir(2)).DN_RENAME
A file was renamed within this directory (rename(2)).DN_ATTRIB
The attributes of a file were changed (chown(2), chmod(2), utime(2), utimensat(2), and similar).
(In order to obtain these definitions, the _GNU_SOURCE feature test macro must be defined before including any header files.)
Directory notifications are normally “one-shot”, and the application must reregister to receive further notifications. Alternatively, if DN_MULTISHOT is included in arg, then notification will remain in effect until explicitly removed.
A series of F_NOTIFY requests is cumulative, with the events in arg being added to the set already monitored. To disable notification of all events, make an F_NOTIFY call specifying arg as 0.
Notification occurs via delivery of a signal. The default signal is SIGIO, but this can be changed using the F_SETSIG operation to fcntl(). (Note that SIGIO is one of the nonqueuing standard signals; switching to the use of a real-time signal means that multiple notifications can be queued to the process.) In the latter case, the signal handler receives a siginfo_t structure as its second argument (if the handler was established using SA_SIGINFO) and the si_fd field of this structure contains the file descriptor which generated the notification (useful when establishing notification on multiple directories).
Especially when using DN_MULTISHOT, a real time signal should be used for notification, so that multiple notifications can be queued.
NOTE: New applications should use the inotify interface (available since Linux 2.6.13), which provides a much superior interface for obtaining notifications of filesystem events. See inotify(7).
Changing the capacity of a pipe
F_SETPIPE_SZ (int; since Linux 2.6.35)
Change the capacity of the pipe referred to by fd to be at least arg bytes. An unprivileged process can adjust the pipe capacity to any value between the system page size and the limit defined in /proc/sys/fs/pipe-max-size (see proc(5)). Attempts to set the pipe capacity below the page size are silently rounded up to the page size. Attempts by an unprivileged process to set the pipe capacity above the limit in /proc/sys/fs/pipe-max-size yield the error EPERM; a privileged process (CAP_SYS_RESOURCE) can override the limit.
When allocating the buffer for the pipe, the kernel may use a capacity larger than arg, if that is convenient for the implementation. (In the current implementation, the allocation is the next higher power-of-two page-size multiple of the requested size.) The actual capacity (in bytes) that is set is returned as the function result.
Attempting to set the pipe capacity smaller than the amount of buffer space currently used to store data produces the error EBUSY.
Note that because of the way the pages of the pipe buffer are employed when data is written to the pipe, the number of bytes that can be written may be less than the nominal size, depending on the size of the writes.
F_GETPIPE_SZ (void; since Linux 2.6.35)
Return (as the function result) the capacity of the pipe referred to by fd.
File Sealing
File seals limit the set of allowed operations on a given file. For each seal that is set on a file, a specific set of operations will fail with EPERM on this file from now on. The file is said to be sealed. The default set of seals depends on the type of the underlying file and filesystem. For an overview of file sealing, a discussion of its purpose, and some code examples, see memfd_create(2).
Currently, file seals can be applied only to a file descriptor returned by memfd_create(2) (if the MFD_ALLOW_SEALING was employed). On other filesystems, all fcntl() operations that operate on seals will return EINVAL.
Seals are a property of an inode. Thus, all open file descriptors referring to the same inode share the same set of seals. Furthermore, seals can never be removed, only added.
F_ADD_SEALS (int; since Linux 3.17)
Add the seals given in the bit-mask argument arg to the set of seals of the inode referred to by the file descriptor fd. Seals cannot be removed again. Once this call succeeds, the seals are enforced by the kernel immediately. If the current set of seals includes F_SEAL_SEAL (see below), then this call will be rejected with EPERM. Adding a seal that is already set is a no-op, in case F_SEAL_SEAL is not set already. In order to place a seal, the file descriptor fd must be writable.
F_GET_SEALS (void; since Linux 3.17)
Return (as the function result) the current set of seals of the inode referred to by fd. If no seals are set, 0 is returned. If the file does not support sealing, -1 is returned and errno is set to EINVAL.
The following seals are available:
F_SEAL_SEAL
If this seal is set, any further call to fcntl() with F_ADD_SEALS fails with the error EPERM. Therefore, this seal prevents any modifications to the set of seals itself. If the initial set of seals of a file includes F_SEAL_SEAL, then this effectively causes the set of seals to be constant and locked.
F_SEAL_SHRINK
If this seal is set, the file in question cannot be reduced in size. This affects open(2) with the O_TRUNC flag as well as truncate(2) and ftruncate(2). Those calls fail with EPERM if you try to shrink the file in question. Increasing the file size is still possible.
F_SEAL_GROW
If this seal is set, the size of the file in question cannot be increased. This affects write(2) beyond the end of the file, truncate(2), ftruncate(2), and fallocate(2). These calls fail with EPERM if you use them to increase the file size. If you keep the size or shrink it, those calls still work as expected.
F_SEAL_WRITE
If this seal is set, you cannot modify the contents of the file. Note that shrinking or growing the size of the file is still possible and allowed. Thus, this seal is normally used in combination with one of the other seals. This seal affects write(2) and fallocate(2) (only in combination with the FALLOC_FL_PUNCH_HOLE flag). Those calls fail with EPERM if this seal is set. Furthermore, trying to create new shared, writable memory-mappings via mmap(2) will also fail with EPERM.
Using the F_ADD_SEALS operation to set the F_SEAL_WRITE seal fails with EBUSY if any writable, shared mapping exists. Such mappings must be unmapped before you can add this seal. Furthermore, if there are any asynchronous I/O operations (io_submit(2)) pending on the file, all outstanding writes will be discarded.
F_SEAL_FUTURE_WRITE (since Linux 5.1)
The effect of this seal is similar to F_SEAL_WRITE, but the contents of the file can still be modified via shared writable mappings that were created prior to the seal being set. Any attempt to create a new writable mapping on the file via mmap(2) will fail with EPERM. Likewise, an attempt to write to the file via write(2) will fail with EPERM.
Using this seal, one process can create a memory buffer that it can continue to modify while sharing that buffer on a “read-only” basis with other processes.
File read/write hints
Write lifetime hints can be used to inform the kernel about the relative expected lifetime of writes on a given inode or via a particular open file description. (See open(2) for an explanation of open file descriptions.) In this context, the term “write lifetime” means the expected time the data will live on media, before being overwritten or erased.
An application may use the different hint values specified below to separate writes into different write classes, so that multiple users or applications running on a single storage back-end can aggregate their I/O patterns in a consistent manner. However, there are no functional semantics implied by these flags, and different I/O classes can use the write lifetime hints in arbitrary ways, so long as the hints are used consistently.
The following operations can be applied to the file descriptor, fd:
F_GET_RW_HINT (uint64_t *; since Linux 4.13)
Returns the value of the read/write hint associated with the underlying inode referred to by fd.
F_SET_RW_HINT (uint64_t *; since Linux 4.13)
Sets the read/write hint value associated with the underlying inode referred to by fd. This hint persists until either it is explicitly modified or the underlying filesystem is unmounted.
F_GET_FILE_RW_HINT (uint64_t *; since Linux 4.13)
Returns the value of the read/write hint associated with the open file description referred to by fd.
F_SET_FILE_RW_HINT (uint64_t *; since Linux 4.13)
Sets the read/write hint value associated with the open file description referred to by fd.
If an open file description has not been assigned a read/write hint, then it shall use the value assigned to the inode, if any.
The following read/write hints are valid since Linux 4.13:
RWH_WRITE_LIFE_NOT_SET
No specific hint has been set. This is the default value.
RWH_WRITE_LIFE_NONE
No specific write lifetime is associated with this file or inode.
RWH_WRITE_LIFE_SHORT
Data written to this inode or via this open file description is expected to have a short lifetime.
RWH_WRITE_LIFE_MEDIUM
Data written to this inode or via this open file description is expected to have a lifetime longer than data written with RWH_WRITE_LIFE_SHORT.
RWH_WRITE_LIFE_LONG
Data written to this inode or via this open file description is expected to have a lifetime longer than data written with RWH_WRITE_LIFE_MEDIUM.
RWH_WRITE_LIFE_EXTREME
Data written to this inode or via this open file description is expected to have a lifetime longer than data written with RWH_WRITE_LIFE_LONG.
All the write-specific hints are relative to each other, and no individual absolute meaning should be attributed to them.
RETURN VALUE
For a successful call, the return value depends on the operation:
F_DUPFD
The new file descriptor.
F_GETFD
Value of file descriptor flags.
F_GETFL
Value of file status flags.
F_GETLEASE
Type of lease held on file descriptor.
F_GETOWN
Value of file descriptor owner.
F_GETSIG
Value of signal sent when read or write becomes possible, or zero for traditional SIGIO behavior.
F_GETPIPE_SZ
F_SETPIPE_SZ
The pipe capacity.
F_GET_SEALS
A bit mask identifying the seals that have been set for the inode referred to by fd.
All other operations
Zero.
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES or EAGAIN
Operation is prohibited by locks held by other processes.
EAGAIN
The operation is prohibited because the file has been memory-mapped by another process.
EBADF
fd is not an open file descriptor
EBADF
op is F_SETLK or F_SETLKW and the file descriptor open mode doesn’t match with the type of lock requested.
EBUSY
op is F_SETPIPE_SZ and the new pipe capacity specified in arg is smaller than the amount of buffer space currently used to store data in the pipe.
EBUSY
op is F_ADD_SEALS, arg includes F_SEAL_WRITE, and there exists a writable, shared mapping on the file referred to by fd.
EDEADLK
It was detected that the specified F_SETLKW operation would cause a deadlock.
EFAULT
lock is outside your accessible address space.
EINTR
op is F_SETLKW or F_OFD_SETLKW and the operation was interrupted by a signal; see signal(7).
EINTR
op is F_GETLK, F_SETLK, F_OFD_GETLK, or F_OFD_SETLK, and the operation was interrupted by a signal before the lock was checked or acquired. Most likely when locking a remote file (e.g., locking over NFS), but can sometimes happen locally.
EINVAL
The value specified in op is not recognized by this kernel.
EINVAL
op is F_ADD_SEALS and arg includes an unrecognized sealing bit.
EINVAL
op is F_ADD_SEALS or F_GET_SEALS and the filesystem containing the inode referred to by fd does not support sealing.
EINVAL
op is F_DUPFD and arg is negative or is greater than the maximum allowable value (see the discussion of RLIMIT_NOFILE in getrlimit(2)).
EINVAL
op is F_SETSIG and arg is not an allowable signal number.
EINVAL
op is F_OFD_SETLK, F_OFD_SETLKW, or F_OFD_GETLK, and l_pid was not specified as zero.
EMFILE
op is F_DUPFD and the per-process limit on the number of open file descriptors has been reached.
ENOLCK
Too many segment locks open, lock table is full, or a remote locking protocol failed (e.g., locking over NFS).
ENOTDIR
F_NOTIFY was specified in op, but fd does not refer to a directory.
EPERM
op is F_SETPIPE_SZ and the soft or hard user pipe limit has been reached; see pipe(7).
EPERM
Attempted to clear the O_APPEND flag on a file that has the append-only attribute set.
EPERM
op was F_ADD_SEALS, but fd was not open for writing or the current set of seals on the file already includes F_SEAL_SEAL.
STANDARDS
POSIX.1-2008.
F_GETOWN_EX, F_SETOWN_EX, F_SETPIPE_SZ, F_GETPIPE_SZ, F_GETSIG, F_SETSIG, F_NOTIFY, F_GETLEASE, and F_SETLEASE are Linux-specific. (Define the _GNU_SOURCE macro to obtain these definitions.)
F_OFD_SETLK, F_OFD_SETLKW, and F_OFD_GETLK are Linux-specific (and one must define _GNU_SOURCE to obtain their definitions), but work is being done to have them included in the next version of POSIX.1.
F_ADD_SEALS and F_GET_SEALS are Linux-specific.
HISTORY
SVr4, 4.3BSD, POSIX.1-2001.
Only the operations F_DUPFD, F_GETFD, F_SETFD, F_GETFL, F_SETFL, F_GETLK, F_SETLK, and F_SETLKW are specified in POSIX.1-2001.
F_GETOWN and F_SETOWN are specified in POSIX.1-2001. (To get their definitions, define either _XOPEN_SOURCE with the value 500 or greater, or _POSIX_C_SOURCE with the value 200809L or greater.)
F_DUPFD_CLOEXEC is specified in POSIX.1-2008. (To get this definition, define _POSIX_C_SOURCE with the value 200809L or greater, or _XOPEN_SOURCE with the value 700 or greater.)
NOTES
The errors returned by dup2(2) are different from those returned by F_DUPFD.
File locking
The original Linux fcntl() system call was not designed to handle large file offsets (in the flock structure). Consequently, an fcntl64() system call was added in Linux 2.4. The newer system call employs a different structure for file locking, flock64, and corresponding operations, F_GETLK64, F_SETLK64, and F_SETLKW64. However, these details can be ignored by applications using glibc, whose fcntl() wrapper function transparently employs the more recent system call where it is available.
Record locks
Since Linux 2.0, there is no interaction between the types of lock placed by flock(2) and fcntl().
Several systems have more fields in struct flock such as, for example, l_sysid (to identify the machine where the lock is held). Clearly, l_pid alone is not going to be very useful if the process holding the lock may live on a different machine; on Linux, while present on some architectures (such as MIPS32), this field is not used.
The original Linux fcntl() system call was not designed to handle large file offsets (in the flock structure). Consequently, an fcntl64() system call was added in Linux 2.4. The newer system call employs a different structure for file locking, flock64, and corresponding operations, F_GETLK64, F_SETLK64, and F_SETLKW64. However, these details can be ignored by applications using glibc, whose fcntl() wrapper function transparently employs the more recent system call where it is available.
Record locking and NFS
Before Linux 3.12, if an NFSv4 client loses contact with the server for a period of time (defined as more than 90 seconds with no communication), it might lose and regain a lock without ever being aware of the fact. (The period of time after which contact is assumed lost is known as the NFSv4 leasetime. On a Linux NFS server, this can be determined by looking at /proc/fs/nfsd/nfsv4leasetime, which expresses the period in seconds. The default value for this file is 90.) This scenario potentially risks data corruption, since another process might acquire a lock in the intervening period and perform file I/O.
Since Linux 3.12, if an NFSv4 client loses contact with the server, any I/O to the file by a process which “thinks” it holds a lock will fail until that process closes and reopens the file. A kernel parameter, nfs.recover_lost_locks, can be set to 1 to obtain the pre-3.12 behavior, whereby the client will attempt to recover lost locks when contact is reestablished with the server. Because of the attendant risk of data corruption, this parameter defaults to 0 (disabled).
BUGS
F_SETFL
It is not possible to use F_SETFL to change the state of the O_DSYNC and O_SYNC flags. Attempts to change the state of these flags are silently ignored.
F_GETOWN
A limitation of the Linux system call conventions on some architectures (notably i386) means that if a (negative) process group ID to be returned by F_GETOWN falls in the range -1 to -4095, then the return value is wrongly interpreted by glibc as an error in the system call; that is, the return value of fcntl() will be -1, and errno will contain the (positive) process group ID. The Linux-specific F_GETOWN_EX operation avoids this problem. Since glibc 2.11, glibc makes the kernel F_GETOWN problem invisible by implementing F_GETOWN using F_GETOWN_EX.
F_SETOWN
In Linux 2.4 and earlier, there is bug that can occur when an unprivileged process uses F_SETOWN to specify the owner of a socket file descriptor as a process (group) other than the caller. In this case, fcntl() can return -1 with errno set to EPERM, even when the owner process (group) is one that the caller has permission to send signals to. Despite this error return, the file descriptor owner is set, and signals will be sent to the owner.
Deadlock detection
The deadlock-detection algorithm employed by the kernel when dealing with F_SETLKW requests can yield both false negatives (failures to detect deadlocks, leaving a set of deadlocked processes blocked indefinitely) and false positives (EDEADLK errors when there is no deadlock). For example, the kernel limits the lock depth of its dependency search to 10 steps, meaning that circular deadlock chains that exceed that size will not be detected. In addition, the kernel may falsely indicate a deadlock when two or more processes created using the clone(2) CLONE_FILES flag place locks that appear (to the kernel) to conflict.
Mandatory locking
The Linux implementation of mandatory locking is subject to race conditions which render it unreliable: a write(2) call that overlaps with a lock may modify data after the mandatory lock is acquired; a read(2) call that overlaps with a lock may detect changes to data that were made only after a write lock was acquired. Similar races exist between mandatory locks and mmap(2). It is therefore inadvisable to rely on mandatory locking.
SEE ALSO
dup2(2), flock(2), open(2), socket(2), lockf(3), capabilities(7), feature_test_macros(7), lslocks(8)
locks.txt, mandatory-locking.txt, and dnotify.txt in the Linux kernel source directory Documentation/filesystems/ (on older kernels, these files are directly under the Documentation/ directory, and mandatory-locking.txt is called mandatory.txt)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
341 - Linux cli command olduname
NAME π₯οΈ olduname π₯οΈ
get name and information about current kernel
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/utsname.h>
int uname(struct utsname *buf);
DESCRIPTION
uname() returns system information in the structure pointed to by buf. The utsname struct is defined in <sys/utsname.h>:
struct utsname {
char sysname[]; /* Operating system name (e.g., "Linux") */
char nodename[]; /* Name within communications network
to which the node is attached, if any */
char release[]; /* Operating system release
(e.g., "2.6.28") */
char version[]; /* Operating system version */
char machine[]; /* Hardware type identifier */
#ifdef _GNU_SOURCE
char domainname[]; /* NIS or YP domain name */
#endif
};
The length of the arrays in a struct utsname is unspecified (see NOTES); the fields are terminated by a null byte (‘οΏ½’).
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
buf is not valid.
VERSIONS
The domainname member (the NIS or YP domain name) is a GNU extension.
The length of the fields in the struct varies. Some operating systems or libraries use a hardcoded 9 or 33 or 65 or 257. Other systems use SYS_NMLN or _SYS_NMLN or UTSLEN or _UTSNAME_LENGTH. Clearly, it is a bad idea to use any of these constants; just use sizeof(…). SVr4 uses 257, “to support Internet hostnames” β this is the largest value likely to be encountered in the wild.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.4BSD.
C library/kernel differences
Over time, increases in the size of the utsname structure have led to three successive versions of uname(): sys_olduname() (slot __NR_oldolduname), sys_uname() (slot __NR_olduname), and sys_newuname() (slot __NR_uname). The first one used length 9 for all fields; the second used 65; the third also uses 65 but adds the domainname field. The glibc uname() wrapper function hides these details from applications, invoking the most recent version of the system call provided by the kernel.
NOTES
The kernel has the name, release, version, and supported machine type built in. Conversely, the nodename field is configured by the administrator to match the network (this is what the BSD historically calls the “hostname”, and is set via sethostname(2)). Similarly, the domainname field is set via setdomainname(2).
Part of the utsname information is also accessible via /proc/sys/kernel/{ostype, hostname, osrelease, version, domainname}.
SEE ALSO
uname(1), getdomainname(2), gethostname(2), uts_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
342 - Linux cli command select_tut
NAME π₯οΈ select_tut π₯οΈ
synchronous I/O multiplexing
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
See select(2)
DESCRIPTION
The select() and pselect() system calls are used to efficiently monitor multiple file descriptors, to see if any of them is, or becomes, “ready”; that is, to see whether I/O becomes possible, or an “exceptional condition” has occurred on any of the file descriptors.
This page provides background and tutorial information on the use of these system calls. For details of the arguments and semantics of select() and pselect(), see select(2).
Combining signal and data events
pselect() is useful if you are waiting for a signal as well as for file descriptor(s) to become ready for I/O. Programs that receive signals normally use the signal handler only to raise a global flag. The global flag will indicate that the event must be processed in the main loop of the program. A signal will cause the select() (or pselect()) call to return with errno set to EINTR. This behavior is essential so that signals can be processed in the main loop of the program, otherwise select() would block indefinitely.
Now, somewhere in the main loop will be a conditional to check the global flag. So we must ask: what if a signal arrives after the conditional, but before the select() call? The answer is that select() would block indefinitely, even though an event is actually pending. This race condition is solved by the pselect() call. This call can be used to set the signal mask to a set of signals that are to be received only within the pselect() call. For instance, let us say that the event in question was the exit of a child process. Before the start of the main loop, we would block SIGCHLD using sigprocmask(2). Our pselect() call would enable SIGCHLD by using an empty signal mask. Our program would look like:
static volatile sig_atomic_t got_SIGCHLD = 0;
static void
child_sig_handler(int sig)
{
got_SIGCHLD = 1;
}
int
main(int argc, char *argv[])
{
sigset_t sigmask, empty_mask;
struct sigaction sa;
fd_set readfds, writefds, exceptfds;
int r;
sigemptyset(&sigmask);
sigaddset(&sigmask, SIGCHLD);
if (sigprocmask(SIG_BLOCK, &sigmask, NULL) == -1) {
perror("sigprocmask");
exit(EXIT_FAILURE);
}
sa.sa_flags = 0;
sa.sa_handler = child_sig_handler;
sigemptyset(&sa.sa_mask);
if (sigaction(SIGCHLD, &sa, NULL) == -1) {
perror("sigaction");
exit(EXIT_FAILURE);
}
sigemptyset(&empty_mask);
for (;;) { /* main loop */
/* Initialize readfds, writefds, and exceptfds
before the pselect() call. (Code omitted.) */
r = pselect(nfds, &readfds, &writefds, &exceptfds,
NULL, &empty_mask);
if (r == -1 && errno != EINTR) {
/* Handle error */
}
if (got_SIGCHLD) {
got_SIGCHLD = 0;
/* Handle signalled event here; e.g., wait() for all
terminated children. (Code omitted.) */
}
/* main body of program */
}
}
Practical
So what is the point of select()? Can’t I just read and write to my file descriptors whenever I want? The point of select() is that it watches multiple descriptors at the same time and properly puts the process to sleep if there is no activity. UNIX programmers often find themselves in a position where they have to handle I/O from more than one file descriptor where the data flow may be intermittent. If you were to merely create a sequence of read(2) and write(2) calls, you would find that one of your calls may block waiting for data from/to a file descriptor, while another file descriptor is unused though ready for I/O. select() efficiently copes with this situation.
Select law
Many people who try to use select() come across behavior that is difficult to understand and produces nonportable or borderline results. For instance, the above program is carefully written not to block at any point, even though it does not set its file descriptors to nonblocking mode. It is easy to introduce subtle errors that will remove the advantage of using select(), so here is a list of essentials to watch for when using select().
1.
You should always try to use select() without a timeout. Your program should have nothing to do if there is no data available. Code that depends on timeouts is not usually portable and is difficult to debug.
2.
The value nfds must be properly calculated for efficiency as explained above.
3.
No file descriptor must be added to any set if you do not intend to check its result after the select() call, and respond appropriately. See next rule.
4.
After select() returns, all file descriptors in all sets should be checked to see if they are ready.
5.
The functions read(2), recv(2), write(2), and send(2) do not necessarily read/write the full amount of data that you have requested. If they do read/write the full amount, it’s because you have a low traffic load and a fast stream. This is not always going to be the case. You should cope with the case of your functions managing to send or receive only a single byte.
6.
Never read/write only in single bytes at a time unless you are really sure that you have a small amount of data to process. It is extremely inefficient not to read/write as much data as you can buffer each time. The buffers in the example below are 1024 bytes although they could easily be made larger.
7.
Calls to read(2), recv(2), write(2), send(2), and select() can fail with the error EINTR, and calls to read(2), recv(2), write(2), and send(2) can fail with errno set to EAGAIN (EWOULDBLOCK). These results must be properly managed (not done properly above). If your program is not going to receive any signals, then it is unlikely you will get EINTR. If your program does not set nonblocking I/O, you will not get EAGAIN.
8.
Never call read(2), recv(2), write(2), or send(2) with a buffer length of zero.
9.
If the functions read(2), recv(2), write(2), and send(2) fail with errors other than those listed in 7., or one of the input functions returns 0, indicating end of file, then you should not pass that file descriptor to select() again. In the example below, I close the file descriptor immediately, and then set it to -1 to prevent it being included in a set.
10.
The timeout value must be initialized with each new call to select(), since some operating systems modify the structure. pselect() however does not modify its timeout structure.
11.
Since select() modifies its file descriptor sets, if the call is being used in a loop, then the sets must be reinitialized before each call.
RETURN VALUE
See select(2).
NOTES
Generally speaking, all operating systems that support sockets also support select(). select() can be used to solve many problems in a portable and efficient way that naive programmers try to solve in a more complicated manner using threads, forking, IPCs, signals, memory sharing, and so on.
The poll(2) system call has the same functionality as select(), and is somewhat more efficient when monitoring sparse file descriptor sets. It is nowadays widely available, but historically was less portable than select().
The Linux-specific epoll(7) API provides an interface that is more efficient than select(2) and poll(2) when monitoring large numbers of file descriptors.
EXAMPLES
Here is an example that better demonstrates the true utility of select(). The listing below is a TCP forwarding program that forwards from one TCP port to another.
#include <arpa/inet.h>
#include <errno.h>
#include <netinet/in.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/select.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <unistd.h>
static int forward_port;
#undef max
#define max(x, y) ((x) > (y) ? (x) : (y))
static int
listen_socket(int listen_port)
{
int lfd;
int yes;
struct sockaddr_in addr;
lfd = socket(AF_INET, SOCK_STREAM, 0);
if (lfd == -1) {
perror("socket");
return -1;
}
yes = 1;
if (setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR,
&yes, sizeof(yes)) == -1)
{
perror("setsockopt");
close(lfd);
return -1;
}
memset(&addr, 0, sizeof(addr));
addr.sin_port = htons(listen_port);
addr.sin_family = AF_INET;
if (bind(lfd, (struct sockaddr *) &addr, sizeof(addr)) == -1) {
perror("bind");
close(lfd);
return -1;
}
printf("accepting connections on port %d
“, listen_port);
listen(lfd, 10);
return lfd;
}
static int
connect_socket(int connect_port, char *address)
{
int cfd;
struct sockaddr_in addr;
cfd = socket(AF_INET, SOCK_STREAM, 0);
if (cfd == -1) {
perror(“socket”);
return -1;
}
memset(&addr, 0, sizeof(addr));
addr.sin_port = htons(connect_port);
addr.sin_family = AF_INET;
if (!inet_aton(address, (struct in_addr *) &addr.sin_addr.s_addr)) {
fprintf(stderr, “inet_aton(): bad IP address format
“);
close(cfd);
return -1;
}
if (connect(cfd, (struct sockaddr *) &addr, sizeof(addr)) == -1) {
perror(“connect()”);
shutdown(cfd, SHUT_RDWR);
close(cfd);
return -1;
}
return cfd;
}
#define SHUT_FD1 do {
if (fd1 >= 0) {
shutdown(fd1, SHUT_RDWR);
close(fd1);
fd1 = -1;
}
} while (0)
#define SHUT_FD2 do {
if (fd2 >= 0) {
shutdown(fd2, SHUT_RDWR);
close(fd2);
fd2 = -1;
}
} while (0)
#define BUF_SIZE 1024
int
main(int argc, char argv[])
{
int h;
int ready, nfds;
int fd1 = -1, fd2 = -1;
int buf1_avail = 0, buf1_written = 0;
int buf2_avail = 0, buf2_written = 0;
char buf1[BUF_SIZE], buf2[BUF_SIZE];
fd_set readfds, writefds, exceptfds;
ssize_t nbytes;
if (argc != 4) {
fprintf(stderr, “Usage
fwd
The above program properly forwards most kinds of TCP connections including OOB signal data transmitted by telnet servers. It handles the tricky problem of having data flow in both directions simultaneously. You might think it more efficient to use a fork(2) call and devote a thread to each stream. This becomes more tricky than you might suspect. Another idea is to set nonblocking I/O using fcntl(2). This also has its problems because you end up using inefficient timeouts.
The program does not handle more than one simultaneous connection at a time, although it could easily be extended to do this with a linked list of buffersβone for each connection. At the moment, new connections cause the current connection to be dropped.
SEE ALSO
accept(2), connect(2), poll(2), read(2), recv(2), select(2), send(2), sigprocmask(2), write(2), epoll(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
343 - Linux cli command mincore
NAME π₯οΈ mincore π₯οΈ
determine whether pages are resident in memory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/mman.h>
int mincore(void addr[.length], size_t length, unsigned char *vec);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
mincore():
Since glibc 2.19:
_DEFAULT_SOURCE
glibc 2.19 and earlier:
_BSD_SOURCE || _SVID_SOURCE
DESCRIPTION
mincore() returns a vector that indicates whether pages of the calling process’s virtual memory are resident in core (RAM), and so will not cause a disk access (page fault) if referenced. The kernel returns residency information about the pages starting at the address addr, and continuing for length bytes.
The addr argument must be a multiple of the system page size. The length argument need not be a multiple of the page size, but since residency information is returned for whole pages, length is effectively rounded up to the next multiple of the page size. One may obtain the page size (PAGE_SIZE) using sysconf(_SC_PAGESIZE).
The vec argument must point to an array containing at least (length+PAGE_SIZE-1) / PAGE_SIZE bytes. On return, the least significant bit of each byte will be set if the corresponding page is currently resident in memory, and be clear otherwise. (The settings of the other bits in each byte are undefined; these bits are reserved for possible later use.) Of course the information returned in vec is only a snapshot: pages that are not locked in memory can come and go at any moment, and the contents of vec may already be stale by the time this call returns.
RETURN VALUE
On success, mincore() returns zero. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EAGAIN kernel is temporarily out of resources.
EFAULT
vec points to an invalid address.
EINVAL
addr is not a multiple of the page size.
ENOMEM
length is greater than (TASK_SIZE - addr). (This could occur if a negative value is specified for length, since that value will be interpreted as a large unsigned integer.) In Linux 2.6.11 and earlier, the error EINVAL was returned for this condition.
ENOMEM
addr to addr + length contained unmapped memory.
STANDARDS
None.
HISTORY
Linux 2.3.99pre1, glibc 2.2.
First appeared in 4.4BSD.
NetBSD, FreeBSD, OpenBSD, Solaris 8, AIX 5.1, SunOS 4.1.
BUGS
Before Linux 2.6.21, mincore() did not return correct information for MAP_PRIVATE mappings, or for nonlinear mappings (established using remap_file_pages(2)).
SEE ALSO
fincore(1), madvise(2), mlock(2), mmap(2), posix_fadvise(2), posix_madvise(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
344 - Linux cli command wait
NAME π₯οΈ wait π₯οΈ
wait for process to change state
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/wait.h>
pid_t wait(int *_Nullable wstatus);
pid_t waitpid(pid_t pid, int *_Nullable wstatus, int options);
int waitid(idtype_t idtype, id_t id",siginfo_t*"infop, int options);
/* This is the glibc and POSIX interface; see
NOTES for information on the raw system call. */
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
waitid():
Since glibc 2.26:
_XOPEN_SOURCE >= 500 || _POSIX_C_SOURCE >= 200809L
glibc 2.25 and earlier:
_XOPEN_SOURCE
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
|| /* glibc <= 2.19: */ _BSD_SOURCE
DESCRIPTION
All of these system calls are used to wait for state changes in a child of the calling process, and obtain information about the child whose state has changed. A state change is considered to be: the child terminated; the child was stopped by a signal; or the child was resumed by a signal. In the case of a terminated child, performing a wait allows the system to release the resources associated with the child; if a wait is not performed, then the terminated child remains in a “zombie” state (see NOTES below).
If a child has already changed state, then these calls return immediately. Otherwise, they block until either a child changes state or a signal handler interrupts the call (assuming that system calls are not automatically restarted using the SA_RESTART flag of sigaction(2)). In the remainder of this page, a child whose state has changed and which has not yet been waited upon by one of these system calls is termed waitable.
wait() and waitpid()
The wait() system call suspends execution of the calling thread until one of its children terminates. The call wait(&wstatus) is equivalent to:
waitpid(-1, &wstatus, 0);
The waitpid() system call suspends execution of the calling thread until a child specified by pid argument has changed state. By default, waitpid() waits only for terminated children, but this behavior is modifiable via the options argument, as described below.
The value of pid can be:
< -1
meaning wait for any child process whose process group ID is equal to the absolute value of pid.
-1
meaning wait for any child process.
0
meaning wait for any child process whose process group ID is equal to that of the calling process at the time of the call to waitpid().
> 0
meaning wait for the child whose process ID is equal to the value of pid.
The value of options is an OR of zero or more of the following constants:
WNOHANG
return immediately if no child has exited.
WUNTRACED
also return if a child has stopped (but not traced via ptrace(2)). Status for traced children which have stopped is provided even if this option is not specified.
WCONTINUED (since Linux 2.6.10)
also return if a stopped child has been resumed by delivery of SIGCONT.
(For Linux-only options, see below.)
If wstatus is not NULL, wait() and waitpid() store status information in the int to which it points. This integer can be inspected with the following macros (which take the integer itself as an argument, not a pointer to it, as is done in wait() and waitpid()!):
WIFEXITED(wstatus)
returns true if the child terminated normally, that is, by calling exit(3) or _exit(2), or by returning from main().
WEXITSTATUS(wstatus)
returns the exit status of the child. This consists of the least significant 8 bits of the status argument that the child specified in a call to exit(3) or _exit(2) or as the argument for a return statement in main(). This macro should be employed only if WIFEXITED returned true.
WIFSIGNALED(wstatus)
returns true if the child process was terminated by a signal.
WTERMSIG(wstatus)
returns the number of the signal that caused the child process to terminate. This macro should be employed only if WIFSIGNALED returned true.
WCOREDUMP(wstatus)
returns true if the child produced a core dump (see core(5)). This macro should be employed only if WIFSIGNALED returned true.
This macro is not specified in POSIX.1-2001 and is not available on some UNIX implementations (e.g., AIX, SunOS). Therefore, enclose its use inside #ifdef WCOREDUMP … #endif.
WIFSTOPPED(wstatus)
returns true if the child process was stopped by delivery of a signal; this is possible only if the call was done using WUNTRACED or when the child is being traced (see ptrace(2)).
WSTOPSIG(wstatus)
returns the number of the signal which caused the child to stop. This macro should be employed only if WIFSTOPPED returned true.
WIFCONTINUED(wstatus)
(since Linux 2.6.10) returns true if the child process was resumed by delivery of SIGCONT.
waitid()
The waitid() system call (available since Linux 2.6.9) provides more precise control over which child state changes to wait for.
The idtype and id arguments select the child(ren) to wait for, as follows:
idtype == P_PID
Wait for the child whose process ID matches id.
idtype == P_PIDFD (since Linux 5.4)
Wait for the child referred to by the PID file descriptor specified in id. (See pidfd_open(2) for further information on PID file descriptors.)
idtype == P_PGID
Wait for any child whose process group ID matches id. Since Linux 5.4, if id is zero, then wait for any child that is in the same process group as the caller’s process group at the time of the call.
idtype == P_ALL
Wait for any child; id is ignored.
The child state changes to wait for are specified by ORing one or more of the following flags in options:
WEXITED
Wait for children that have terminated.
WSTOPPED
Wait for children that have been stopped by delivery of a signal.
WCONTINUED
Wait for (previously stopped) children that have been resumed by delivery of SIGCONT.
The following flags may additionally be ORed in options:
WNOHANG
As for waitpid().
WNOWAIT
Leave the child in a waitable state; a later wait call can be used to again retrieve the child status information.
Upon successful return, waitid() fills in the following fields of the siginfo_t structure pointed to by infop:
si_pid
The process ID of the child.
si_uid
The real user ID of the child. (This field is not set on most other implementations.)
si_signo
Always set to SIGCHLD.
si_status
Either the exit status of the child, as given to _exit(2) (or exit(3)), or the signal that caused the child to terminate, stop, or continue. The si_code field can be used to determine how to interpret this field.
si_code
Set to one of: CLD_EXITED (child called _exit(2)); CLD_KILLED (child killed by signal); CLD_DUMPED (child killed by signal, and dumped core); CLD_STOPPED (child stopped by signal); CLD_TRAPPED (traced child has trapped); or CLD_CONTINUED (child continued by SIGCONT).
If WNOHANG was specified in options and there were no children in a waitable state, then waitid() returns 0 immediately and the state of the siginfo_t structure pointed to by infop depends on the implementation. To (portably) distinguish this case from that where a child was in a waitable state, zero out the si_pid field before the call and check for a nonzero value in this field after the call returns.
POSIX.1-2008 Technical Corrigendum 1 (2013) adds the requirement that when WNOHANG is specified in options and there were no children in a waitable state, then waitid() should zero out the si_pid and si_signo fields of the structure. On Linux and other implementations that adhere to this requirement, it is not necessary to zero out the si_pid field before calling waitid(). However, not all implementations follow the POSIX.1 specification on this point.
RETURN VALUE
wait(): on success, returns the process ID of the terminated child; on failure, -1 is returned.
waitpid(): on success, returns the process ID of the child whose state has changed; if WNOHANG was specified and one or more child(ren) specified by pid exist, but have not yet changed state, then 0 is returned. On failure, -1 is returned.
waitid(): returns 0 on success or if WNOHANG was specified and no child(ren) specified by id has yet changed state; on failure, -1 is returned.
On failure, each of these calls sets errno to indicate the error.
ERRORS
EAGAIN
The PID file descriptor specified in id is nonblocking and the process that it refers to has not terminated.
ECHILD
(for wait()) The calling process does not have any unwaited-for children.
ECHILD
(for waitpid() or waitid()) The process specified by pid (waitpid()) or idtype and id (waitid()) does not exist or is not a child of the calling process. (This can happen for one’s own child if the action for SIGCHLD is set to SIG_IGN. See also the Linux Notes section about threads.)
EINTR
WNOHANG was not set and an unblocked signal or a SIGCHLD was caught; see signal(7).
EINVAL
The options argument was invalid.
ESRCH
(for wait() or waitpid()) pid is equal to INT_MIN.
VERSIONS
C library/kernel differences
wait() is actually a library function that (in glibc) is implemented as a call to wait4(2).
On some architectures, there is no waitpid() system call; instead, this interface is implemented via a C library wrapper function that calls wait4(2).
The raw waitid() system call takes a fifth argument, of type struct rusage *. If this argument is non-NULL, then it is used to return resource usage information about the child, in the same manner as wait4(2). See getrusage(2) for details.
STANDARDS
POSIX.1-2008.
HISTORY
SVr4, 4.3BSD, POSIX.1-2001.
NOTES
A child that terminates, but has not been waited for becomes a “zombie”. The kernel maintains a minimal set of information about the zombie process (PID, termination status, resource usage information) in order to allow the parent to later perform a wait to obtain information about the child. As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. If a parent process terminates, then its “zombie” children (if any) are adopted by init(1), (or by the nearest “subreaper” process as defined through the use of the prctl(2) PR_SET_CHILD_SUBREAPER operation); init(1) automatically performs a wait to remove the zombies.
POSIX.1-2001 specifies that if the disposition of SIGCHLD is set to SIG_IGN or the SA_NOCLDWAIT flag is set for SIGCHLD (see sigaction(2)), then children that terminate do not become zombies and a call to wait() or waitpid() will block until all children have terminated, and then fail with errno set to ECHILD. (The original POSIX standard left the behavior of setting SIGCHLD to SIG_IGN unspecified. Note that even though the default disposition of SIGCHLD is “ignore”, explicitly setting the disposition to SIG_IGN results in different treatment of zombie process children.)
Linux 2.6 conforms to the POSIX requirements. However, Linux 2.4 (and earlier) does not: if a wait() or waitpid() call is made while SIGCHLD is being ignored, the call behaves just as though SIGCHLD were not being ignored, that is, the call blocks until the next child terminates and then returns the process ID and status of that child.
Linux notes
In the Linux kernel, a kernel-scheduled thread is not a distinct construct from a process. Instead, a thread is simply a process that is created using the Linux-unique clone(2) system call; other routines such as the portable pthread_create(3) call are implemented using clone(2). Before Linux 2.4, a thread was just a special case of a process, and as a consequence one thread could not wait on the children of another thread, even when the latter belongs to the same thread group. However, POSIX prescribes such functionality, and since Linux 2.4 a thread can, and by default will, wait on children of other threads in the same thread group.
The following Linux-specific options are for use with children created using clone(2); they can also, since Linux 4.7, be used with waitid():
__WCLONE
Wait for “clone” children only. If omitted, then wait for “non-clone” children only. (A “clone” child is one which delivers no signal, or a signal other than SIGCHLD to its parent upon termination.) This option is ignored if __WALL is also specified.
__WALL (since Linux 2.4)
Wait for all children, regardless of type (“clone” or “non-clone”).
__WNOTHREAD (since Linux 2.4)
Do not wait for children of other threads in the same thread group. This was the default before Linux 2.4.
Since Linux 4.7, the __WALL flag is automatically implied if the child is being ptraced.
BUGS
According to POSIX.1-2008, an application calling waitid() must ensure that infop points to a siginfo_t structure (i.e., that it is a non-null pointer). On Linux, if infop is NULL, waitid() succeeds, and returns the process ID of the waited-for child. Applications should avoid relying on this inconsistent, nonstandard, and unnecessary feature.
EXAMPLES
The following program demonstrates the use of fork(2) and waitpid(). The program creates a child process. If no command-line argument is supplied to the program, then the child suspends its execution using pause(2), to allow the user to send signals to the child. Otherwise, if a command-line argument is supplied, then the child exits immediately, using the integer supplied on the command line as the exit status. The parent process executes a loop that monitors the child using waitpid(), and uses the W*() macros described above to analyze the wait status value.
The following shell session demonstrates the use of the program:
$ ./a.out &
Child PID is 32360
[1] 32359
$ kill -STOP 32360
stopped by signal 19
$ kill -CONT 32360
continued
$ kill -TERM 32360
killed by signal 15
[1]+ Done ./a.out
$
Program source
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int wstatus;
pid_t cpid, w;
cpid = fork();
if (cpid == -1) {
perror("fork");
exit(EXIT_FAILURE);
}
if (cpid == 0) { /* Code executed by child */
printf("Child PID is %jd
“, (intmax_t) getpid()); if (argc == 1) pause(); /* Wait for signals / _exit(atoi(argv[1])); } else { / Code executed by parent */ do { w = waitpid(cpid, &wstatus, WUNTRACED | WCONTINUED); if (w == -1) { perror(“waitpid”); exit(EXIT_FAILURE); } if (WIFEXITED(wstatus)) { printf(“exited, status=%d “, WEXITSTATUS(wstatus)); } else if (WIFSIGNALED(wstatus)) { printf(“killed by signal %d “, WTERMSIG(wstatus)); } else if (WIFSTOPPED(wstatus)) { printf(“stopped by signal %d “, WSTOPSIG(wstatus)); } else if (WIFCONTINUED(wstatus)) { printf(“continued “); } } while (!WIFEXITED(wstatus) && !WIFSIGNALED(wstatus)); exit(EXIT_SUCCESS); } }
SEE ALSO
_exit(2), clone(2), fork(2), kill(2), ptrace(2), sigaction(2), signal(2), wait4(2), pthread_create(3), core(5), credentials(7), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
345 - Linux cli command getgid32
NAME π₯οΈ getgid32 π₯οΈ
get group identity
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
gid_t getgid(void);
gid_t getegid(void);
DESCRIPTION
getgid() returns the real group ID of the calling process.
getegid() returns the effective group ID of the calling process.
ERRORS
These functions are always successful and never modify errno.
VERSIONS
On Alpha, instead of a pair of getgid() and getegid() system calls, a single getxgid() system call is provided, which returns a pair of real and effective GIDs. The glibc getgid() and getegid() wrapper functions transparently deal with this. See syscall(2) for details regarding register mapping.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.3BSD.
The original Linux getgid() and getegid() system calls supported only 16-bit group IDs. Subsequently, Linux 2.4 added getgid32() and getegid32(), supporting 32-bit IDs. The glibc getgid() and getegid() wrapper functions transparently deal with the variations across kernel versions.
SEE ALSO
getresgid(2), setgid(2), setregid(2), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
346 - Linux cli command ioctl_pagemap_scan
NAME π₯οΈ ioctl_pagemap_scan π₯οΈ
get and/or clear page flags
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/fs.h> /* Definition of struct pm_scan_arg,
struct page_region, and PAGE_IS_* constants */
#include <sys/ioctl.h>
int ioctl(int pagemap_fd, PAGEMAP_SCAN, struct pm_scan_arg *arg);
DESCRIPTION
This ioctl(2) is used to get and optionally clear some specific flags from page table entries. The information is returned with PAGE_SIZE granularity.
To start tracking the written state (flag) of a page or range of memory, the UFFD_FEATURE_WP_ASYNC must be enabled by UFFDIO_API ioctl(2) on userfaultfd and memory range must be registered with UFFDIO_REGISTER ioctl(2) in UFFDIO_REGISTER_MODE_WP mode.
Supported page flags
The following page table entry flags are supported:
PAGE_IS_WPALLOWED
The page has asynchronous write-protection enabled.
PAGE_IS_WRITTEN
The page has been written to from the time it was write protected.
PAGE_IS_FILE
The page is file backed.
PAGE_IS_PRESENT
The page is present in the memory.
PAGE_IS_SWAPPED
The page is swapped.
PAGE_IS_PFNZERO
The page has zero PFN.
PAGE_IS_HUGE
The page is THP or Hugetlb backed.
Supported operations
The get operation is always performed if the output buffer is specified. The other operations are as following:
PM_SCAN_WP_MATCHING
Write protect the matched pages.
PM_SCAN_CHECK_WPASYNC
Abort the scan when a page is found which doesn’t have the Userfaultfd Asynchronous Write protection enabled.
The struct pm_scan_arg argument
struct pm_scan_arg {
__u64 size;
__u64 flags;
__u64 start;
__u64 end;
__u64 walk_end;
__u64 vec;
__u64 vec_len;
__u64 max_pages
__u64 category_inverted;
__u64 category_mask;
__u64 category_anyof_mask
__u64 return_mask;
};
size
This field should be set to the size of the structure in bytes, as in sizeof(structΒ pm_scan_arg).
flags
The operations to be performed are specified in it.
start
The starting address of the scan is specified in it.
end
The ending address of the scan is specified in it.
walk_end
The kernel returns the scan’s ending address in it. The walk_end equal to end means that scan has completed on the entire range.
vec
The address of page_region array for output.
struct page_region {
__u64 start;
__u64 end;
__u64 categories;
};
vec_len
The length of the page_region struct array.
max_pages
It is the optional limit for the number of output pages required.
category_inverted
PAGE_IS_* categories which values match if 0 instead of 1.
category_mask
Skip pages for which any PAGE_IS_* category doesn’t match.
category_anyof_mask
Skip pages for which no PAGE_IS_* category matches.
return_mask
PAGE_IS_* categories that are to be reported in page_region.
RETURN VALUE
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Error codes can be one of, but are not limited to, the following:
EINVAL
Invalid arguments i.e., invalid size of the argument, invalid flags, invalid categories, the start address isn’t aligned with PAGE_SIZE, or vec_len is specified when vec is NULL.
EFAULT
Invalid arg pointer, invalid vec pointer, or invalid address range specified by start and end.
ENOMEM
No memory is available.
EINTR
Fetal signal is pending.
STANDARDS
Linux.
HISTORY
Linux 6.7.
SEE ALSO
ioctl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
347 - Linux cli command msgget
NAME π₯οΈ msgget π₯οΈ
get a System V message queue identifier
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/msg.h>
int msgget(key_t key, int msgflg);
DESCRIPTION
The msgget() system call returns the System V message queue identifier associated with the value of the key argument. It may be used either to obtain the identifier of a previously created message queue (when msgflg is zero and key does not have the value IPC_PRIVATE), or to create a new set.
A new message queue is created if key has the value IPC_PRIVATE or key isn’t IPC_PRIVATE, no message queue with the given key key exists, and IPC_CREAT is specified in msgflg.
If msgflg specifies both IPC_CREAT and IPC_EXCL and a message queue already exists for key, then msgget() fails with errno set to EEXIST. (This is analogous to the effect of the combination O_CREAT | O_EXCL for open(2).)
Upon creation, the least significant bits of the argument msgflg define the permissions of the message queue. These permission bits have the same format and semantics as the permissions specified for the mode argument of open(2). (The execute permissions are not used.)
If a new message queue is created, then its associated data structure msqid_ds (see msgctl(2)) is initialized as follows:
msg_perm.cuid and msg_perm.uid are set to the effective user ID of the calling process.
msg_perm.cgid and msg_perm.gid are set to the effective group ID of the calling process.
The least significant 9 bits of msg_perm.mode are set to the least significant 9 bits of msgflg.
msg_qnum, msg_lspid, msg_lrpid, msg_stime, and msg_rtime are set to 0.
msg_ctime is set to the current time.
msg_qbytes is set to the system limit MSGMNB.
If the message queue already exists the permissions are verified, and a check is made to see if it is marked for destruction.
RETURN VALUE
On success, msgget() returns the message queue identifier (a nonnegative integer). On failure, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
A message queue exists for key, but the calling process does not have permission to access the queue, and does not have the CAP_IPC_OWNER capability in the user namespace that governs its IPC namespace.
EEXIST
IPC_CREAT and IPC_EXCL were specified in msgflg, but a message queue already exists for key.
ENOENT
No message queue exists for key and msgflg did not specify IPC_CREAT.
ENOMEM
A message queue has to be created but the system does not have enough memory for the new data structure.
ENOSPC
A message queue has to be created but the system limit for the maximum number of message queues (MSGMNI) would be exceeded.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4.
Linux
Until Linux 2.3.20, Linux would return EIDRM for a msgget() on a message queue scheduled for deletion.
NOTES
IPC_PRIVATE isn’t a flag field but a key_t type. If this special value is used for key, the system call ignores everything but the least significant 9 bits of msgflg and creates a new message queue (on success).
The following is a system limit on message queue resources affecting a msgget() call:
MSGMNI
System-wide limit on the number of message queues. Before Linux 3.19, the default value for this limit was calculated using a formula based on available system memory. Since Linux 3.19, the default value is 32,000. On Linux, this limit can be read and modified via /proc/sys/kernel/msgmni.
BUGS
The name choice IPC_PRIVATE was perhaps unfortunate, IPC_NEW would more clearly show its function.
SEE ALSO
msgctl(2), msgrcv(2), msgsnd(2), ftok(3), capabilities(7), mq_overview(7), sysvipc(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
348 - Linux cli command rt_tgsigqueueinfo
NAME π₯οΈ rt_tgsigqueueinfo π₯οΈ
queue a signal and data
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/signal.h> /* Definition of SI_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_rt_sigqueueinfo, pid_t tgid,
int sig, siginfo_t *info);
int syscall(SYS_rt_tgsigqueueinfo, pid_t tgid, pid_t tid,
int sig, siginfo_t *info);
Note: There are no glibc wrappers for these system calls; see NOTES.
DESCRIPTION
The rt_sigqueueinfo() and rt_tgsigqueueinfo() system calls are the low-level interfaces used to send a signal plus data to a process or thread. The receiver of the signal can obtain the accompanying data by establishing a signal handler with the sigaction(2) SA_SIGINFO flag.
These system calls are not intended for direct application use; they are provided to allow the implementation of sigqueue(3) and pthread_sigqueue(3).
The rt_sigqueueinfo() system call sends the signal sig to the thread group with the ID tgid. (The term “thread group” is synonymous with “process”, and tid corresponds to the traditional UNIX process ID.) The signal will be delivered to an arbitrary member of the thread group (i.e., one of the threads that is not currently blocking the signal).
The info argument specifies the data to accompany the signal. This argument is a pointer to a structure of type siginfo_t, described in sigaction(2) (and defined by including <sigaction.h>). The caller should set the following fields in this structure:
si_code
This should be one of the SI_* codes in the Linux kernel source file include/asm-generic/siginfo.h. If the signal is being sent to any process other than the caller itself, the following restrictions apply:
The code can’t be a value greater than or equal to zero. In particular, it can’t be SI_USER, which is used by the kernel to indicate a signal sent by kill(2), and nor can it be SI_KERNEL, which is used to indicate a signal generated by the kernel.
The code can’t (since Linux 2.6.39) be SI_TKILL, which is used by the kernel to indicate a signal sent using tgkill(2).
si_pid
This should be set to a process ID, typically the process ID of the sender.
si_uid
This should be set to a user ID, typically the real user ID of the sender.
si_value
This field contains the user data to accompany the signal. For more information, see the description of the last (union sigval) argument of sigqueue(3).
Internally, the kernel sets the si_signo field to the value specified in sig, so that the receiver of the signal can also obtain the signal number via that field.
The rt_tgsigqueueinfo() system call is like rt_sigqueueinfo(), but sends the signal and data to the single thread specified by the combination of tgid, a thread group ID, and tid, a thread in that thread group.
RETURN VALUE
On success, these system calls return 0. On error, they return -1 and errno is set to indicate the error.
ERRORS
EAGAIN
The limit of signals which may be queued has been reached. (See signal(7) for further information.)
EINVAL
sig, tgid, or tid was invalid.
EPERM
The caller does not have permission to send the signal to the target. For the required permissions, see kill(2).
EPERM
tgid specifies a process other than the caller and info->si_code is invalid.
ESRCH
rt_sigqueueinfo(): No thread group matching tgid was found.
rt_tgsigqueinfo(): No thread matching tgid and tid was found.
STANDARDS
Linux.
HISTORY
rt_sigqueueinfo()
Linux 2.2.
rt_tgsigqueueinfo()
Linux 2.6.31.
NOTES
Since these system calls are not intended for application use, there are no glibc wrapper functions; use syscall(2) in the unlikely case that you want to call them directly.
As with kill(2), the null signal (0) can be used to check if the specified process or thread exists.
SEE ALSO
kill(2), pidfd_send_signal(2), sigaction(2), sigprocmask(2), tgkill(2), pthread_sigqueue(3), sigqueue(3), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
349 - Linux cli command get_robust_list
NAME π₯οΈ get_robust_list π₯οΈ
get/set list of robust futexes
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/futex.h>"/*Definitionof structrobust_list_head" */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
long syscall(SYS_get_robust_list, int pid,
struct robust_list_head **head_ptr, size_t *len_ptr);
long syscall(SYS_set_robust_list,
struct robust_list_head *head, size_t len);
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
These system calls deal with per-thread robust futex lists. These lists are managed in user space: the kernel knows only about the location of the head of the list. A thread can inform the kernel of the location of its robust futex list using set_robust_list(). The address of a thread’s robust futex list can be obtained using get_robust_list().
The purpose of the robust futex list is to ensure that if a thread accidentally fails to unlock a futex before terminating or calling execve(2), another thread that is waiting on that futex is notified that the former owner of the futex has died. This notification consists of two pieces: the FUTEX_OWNER_DIED bit is set in the futex word, and the kernel performs a futex(2) FUTEX_WAKE operation on one of the threads waiting on the futex.
The get_robust_list() system call returns the head of the robust futex list of the thread whose thread ID is specified in pid. If pid is 0, the head of the list for the calling thread is returned. The list head is stored in the location pointed to by head_ptr. The size of the object pointed to by **head_ptr is stored in len_ptr.
Permission to employ get_robust_list() is governed by a ptrace access mode PTRACE_MODE_READ_REALCREDS check; see ptrace(2).
The set_robust_list() system call requests the kernel to record the head of the list of robust futexes owned by the calling thread. The head argument is the list head to record. The len argument should be sizeof(*head).
RETURN VALUE
The set_robust_list() and get_robust_list() system calls return zero when the operation is successful, an error code otherwise.
ERRORS
The set_robust_list() system call can fail with the following error:
EINVAL
len does not equal sizeof(struct robust_list_head).
The get_robust_list() system call can fail with the following errors:
EFAULT
The head of the robust futex list can’t be stored at the location head.
EPERM
The calling process does not have permission to see the robust futex list of the thread with the thread ID pid, and does not have the CAP_SYS_PTRACE capability.
ESRCH
No thread with the thread ID pid could be found.
VERSIONS
These system calls were added in Linux 2.6.17.
NOTES
These system calls are not needed by normal applications.
A thread can have only one robust futex list; therefore applications that wish to use this functionality should use the robust mutexes provided by glibc.
In the initial implementation, a thread waiting on a futex was notified that the owner had died only if the owner terminated. Starting with Linux 2.6.28, notification was extended to include the case where the owner performs an execve(2).
The thread IDs mentioned in the main text are kernel thread IDs of the kind returned by clone(2) and gettid(2).
SEE ALSO
futex(2), pthread_mutexattr_setrobust(3)
Documentation/robust-futexes.txt and Documentation/robust-futex-ABI.txt in the Linux kernel source tree
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
350 - Linux cli command dup
NAME π₯οΈ dup π₯οΈ
duplicate a file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int dup(int oldfd);
int dup2(int oldfd, int newfd);
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <fcntl.h> /* Definition of O_* constants */
#include <unistd.h>
int dup3(int oldfd, int newfd, int flags);
DESCRIPTION
The dup() system call allocates a new file descriptor that refers to the same open file description as the descriptor oldfd. (For an explanation of open file descriptions, see open(2).) The new file descriptor number is guaranteed to be the lowest-numbered file descriptor that was unused in the calling process.
After a successful return, the old and new file descriptors may be used interchangeably. Since the two file descriptors refer to the same open file description, they share file offset and file status flags; for example, if the file offset is modified by using lseek(2) on one of the file descriptors, the offset is also changed for the other file descriptor.
The two file descriptors do not share file descriptor flags (the close-on-exec flag). The close-on-exec flag (FD_CLOEXEC; see fcntl(2)) for the duplicate descriptor is off.
dup2()
The dup2() system call performs the same task as dup(), but instead of using the lowest-numbered unused file descriptor, it uses the file descriptor number specified in newfd. In other words, the file descriptor newfd is adjusted so that it now refers to the same open file description as oldfd.
If the file descriptor newfd was previously open, it is closed before being reused; the close is performed silently (i.e., any errors during the close are not reported by dup2()).
The steps of closing and reusing the file descriptor newfd are performed atomically. This is important, because trying to implement equivalent functionality using close(2) and dup() would be subject to race conditions, whereby newfd might be reused between the two steps. Such reuse could happen because the main program is interrupted by a signal handler that allocates a file descriptor, or because a parallel thread allocates a file descriptor.
Note the following points:
If oldfd is not a valid file descriptor, then the call fails, and newfd is not closed.
If oldfd is a valid file descriptor, and newfd has the same value as oldfd, then dup2() does nothing, and returns newfd.
dup3()
dup3() is the same as dup2(), except that:
The caller can force the close-on-exec flag to be set for the new file descriptor by specifying O_CLOEXEC in flags. See the description of the same flag in open(2) for reasons why this may be useful.
If oldfd equals newfd, then dup3() fails with the error EINVAL.
RETURN VALUE
On success, these system calls return the new file descriptor. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBADF
oldfd isn’t an open file descriptor.
EBADF
newfd is out of the allowed range for file descriptors (see the discussion of RLIMIT_NOFILE in getrlimit(2)).
EBUSY
(Linux only) This may be returned by dup2() or dup3() during a race condition with open(2) and dup().
EINTR
The dup2() or dup3() call was interrupted by a signal; see signal(7).
EINVAL
(dup3()) flags contain an invalid value.
EINVAL
(dup3()) oldfd was equal to newfd.
EMFILE
The per-process limit on the number of open file descriptors has been reached (see the discussion of RLIMIT_NOFILE in getrlimit(2)).
STANDARDS
dup()
dup2()
POSIX.1-2008.
dup3()
Linux.
HISTORY
dup()
dup2()
POSIX.1-2001, SVr4, 4.3BSD.
dup3()
Linux 2.6.27, glibc 2.9.
NOTES
The error returned by dup2() is different from that returned by fcntl(…, F_DUPFD, …) when newfd is out of range. On some systems, dup2() also sometimes returns EINVAL like F_DUPFD.
If newfd was open, any errors that would have been reported at close(2) time are lost. If this is of concern, thenβunless the program is single-threaded and does not allocate file descriptors in signal handlersβthe correct approach is not to close newfd before calling dup2(), because of the race condition described above. Instead, code something like the following could be used:
/* Obtain a duplicate of 'newfd' that can subsequently
be used to check for close() errors; an EBADF error
means that 'newfd' was not open. */
tmpfd = dup(newfd);
if (tmpfd == -1 && errno != EBADF) {
/* Handle unexpected dup() error. */
}
/* Atomically duplicate 'oldfd' on 'newfd'. */
if (dup2(oldfd, newfd) == -1) {
/* Handle dup2() error. */
}
/* Now check for close() errors on the file originally
referred to by 'newfd'. */
if (tmpfd != -1) {
if (close(tmpfd) == -1) {
/* Handle errors from close. */
}
}
SEE ALSO
close(2), fcntl(2), open(2), pidfd_getfd(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
351 - Linux cli command syslog
NAME π₯οΈ syslog π₯οΈ
read and/or clear kernel message ring buffer; set console_loglevel
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/klog.h> /* Definition of SYSLOG_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_syslog, int type, char *bufp, int len);
/* The glibc interface */
#include <sys/klog.h>
int klogctl(int type, char *bufp, int len);
DESCRIPTION
Note: Probably, you are looking for the C library function syslog(), which talks to syslogd(8); see syslog(3) for details.
This page describes the kernel syslog() system call, which is used to control the kernel printk() buffer; the glibc wrapper function for the system call is called klogctl().
The kernel log buffer
The kernel has a cyclic buffer of length LOG_BUF_LEN in which messages given as arguments to the kernel function printk() are stored (regardless of their log level). In early kernels, LOG_BUF_LEN had the value 4096; from Linux 1.3.54, it was 8192; from Linux 2.1.113, it was 16384; since Linux 2.4.23/2.6, the value is a kernel configuration option (CONFIG_LOG_BUF_SHIFT, default value dependent on the architecture). Since Linux 2.6.6, the size can be queried with command type 10 (see below).
Commands
The type argument determines the action taken by this function. The list below specifies the values for type. The symbolic names are defined in the kernel source, but are not exported to user space; you will either need to use the numbers, or define the names yourself.
SYSLOG_ACTION_CLOSE (0)
Close the log. Currently a NOP.
SYSLOG_ACTION_OPEN (1)
Open the log. Currently a NOP.
SYSLOG_ACTION_READ (2)
Read from the log. The call waits until the kernel log buffer is nonempty, and then reads at most len bytes into the buffer pointed to by bufp. The call returns the number of bytes read. Bytes read from the log disappear from the log buffer: the information can be read only once. This is the function executed by the kernel when a user program reads /proc/kmsg.
SYSLOG_ACTION_READ_ALL (3)
Read all messages remaining in the ring buffer, placing them in the buffer pointed to by bufp. The call reads the last len bytes from the log buffer (nondestructively), but will not read more than was written into the buffer since the last “clear ring buffer” command (see command 5 below)). The call returns the number of bytes read.
SYSLOG_ACTION_READ_CLEAR (4)
Read and clear all messages remaining in the ring buffer. The call does precisely the same as for a type of 3, but also executes the “clear ring buffer” command.
SYSLOG_ACTION_CLEAR (5)
The call executes just the “clear ring buffer” command. The bufp and len arguments are ignored.
This command does not really clear the ring buffer. Rather, it sets a kernel bookkeeping variable that determines the results returned by commands 3 (SYSLOG_ACTION_READ_ALL) and 4 (SYSLOG_ACTION_READ_CLEAR). This command has no effect on commands 2 (SYSLOG_ACTION_READ) and 9 (SYSLOG_ACTION_SIZE_UNREAD).
SYSLOG_ACTION_CONSOLE_OFF (6)
The command saves the current value of console_loglevel and then sets console_loglevel to minimum_console_loglevel, so that no messages are printed to the console. Before Linux 2.6.32, the command simply sets console_loglevel to minimum_console_loglevel. See the discussion of /proc/sys/kernel/printk, below.
The bufp and len arguments are ignored.
SYSLOG_ACTION_CONSOLE_ON (7)
If a previous SYSLOG_ACTION_CONSOLE_OFF command has been performed, this command restores console_loglevel to the value that was saved by that command. Before Linux 2.6.32, this command simply sets console_loglevel to default_console_loglevel. See the discussion of /proc/sys/kernel/printk, below.
The bufp and len arguments are ignored.
SYSLOG_ACTION_CONSOLE_LEVEL (8)
The call sets console_loglevel to the value given in len, which must be an integer between 1 and 8 (inclusive). The kernel silently enforces a minimum value of minimum_console_loglevel for len. See the log level section for details. The bufp argument is ignored.
SYSLOG_ACTION_SIZE_UNREAD (9) (since Linux 2.4.10)
The call returns the number of bytes currently available to be read from the kernel log buffer via command 2 (SYSLOG_ACTION_READ). The bufp and len arguments are ignored.
SYSLOG_ACTION_SIZE_BUFFER (10) (since Linux 2.6.6)
This command returns the total size of the kernel log buffer. The bufp and len arguments are ignored.
All commands except 3 and 10 require privilege. In Linux kernels before Linux 2.6.37, command types 3 and 10 are allowed to unprivileged processes; since Linux 2.6.37, these commands are allowed to unprivileged processes only if /proc/sys/kernel/dmesg_restrict has the value 0. Before Linux 2.6.37, “privileged” means that the caller has the CAP_SYS_ADMIN capability. Since Linux 2.6.37, “privileged” means that the caller has either the CAP_SYS_ADMIN capability (now deprecated for this purpose) or the (new) CAP_SYSLOG capability.
/proc/sys/kernel/printk
/proc/sys/kernel/printk is a writable file containing four integer values that influence kernel printk() behavior when printing or logging error messages. The four values are:
console_loglevel
Only messages with a log level lower than this value will be printed to the console. The default value for this field is DEFAULT_CONSOLE_LOGLEVEL (7), but it is set to 4 if the kernel command line contains the word “quiet”, 10 if the kernel command line contains the word “debug”, and to 15 in case of a kernel fault (the 10 and 15 are just silly, and equivalent to 8). The value of console_loglevel can be set (to a value in the range 1β8) by a syslog() call with a type of 8.
default_message_loglevel
This value will be used as the log level for printk() messages that do not have an explicit level. Up to and including Linux 2.6.38, the hard-coded default value for this field was 4 (KERN_WARNING); since Linux 2.6.39, the default value is defined by the kernel configuration option CONFIG_DEFAULT_MESSAGE_LOGLEVEL, which defaults to 4.
minimum_console_loglevel
The value in this field is the minimum value to which console_loglevel can be set.
default_console_loglevel
This is the default value for console_loglevel.
The log level
Every printk() message has its own log level. If the log level is not explicitly specified as part of the message, it defaults to default_message_loglevel. The conventional meaning of the log level is as follows:
Kernel constant | Level value | Meaning |
KERN_EMERG | 0 | System is unusable |
KERN_ALERT | 1 | Action must be taken immediately |
KERN_CRIT | 2 | Critical conditions |
KERN_ERR | 3 | Error conditions |
KERN_WARNING | 4 | Warning conditions |
KERN_NOTICE | 5 | Normal but significant condition |
KERN_INFO | 6 | Informational |
KERN_DEBUG | 7 | Debug-level messages |
The kernel printk() routine will print a message on the console only if it has a log level less than the value of console_loglevel.
RETURN VALUE
For type equal to 2, 3, or 4, a successful call to syslog() returns the number of bytes read. For type 9, syslog() returns the number of bytes currently available to be read on the kernel log buffer. For type 10, syslog() returns the total size of the kernel log buffer. For other values of type, 0 is returned on success.
In case of error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
Bad arguments (e.g., bad type; or for type 2, 3, or 4, buf is NULL, or len is less than zero; or for type 8, the level is outside the range 1 to 8).
ENOSYS
This syslog() system call is not available, because the kernel was compiled with the CONFIG_PRINTK kernel-configuration option disabled.
EPERM
An attempt was made to change console_loglevel or clear the kernel message ring buffer by a process without sufficient privilege (more precisely: without the CAP_SYS_ADMIN or CAP_SYSLOG capability).
ERESTARTSYS
System call was interrupted by a signal; nothing was read. (This can be seen only during a trace.)
STANDARDS
Linux.
HISTORY
From the very start, people noted that it is unfortunate that a system call and a library routine of the same name are entirely different animals.
SEE ALSO
dmesg(1), syslog(3), capabilities(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
352 - Linux cli command phys
NAME π₯οΈ phys π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
353 - Linux cli command io_setup
NAME π₯οΈ io_setup π₯οΈ
create an asynchronous I/O context
LIBRARY
Standard C library (libc, -lc)
Alternatively, Asynchronous I/O library (libaio, -laio); see VERSIONS.
SYNOPSIS
#include <linux/aio_abi.h> /* Defines needed types */
long io_setup(unsigned int nr_events, aio_context_t *ctx_idp);
Note: There is no glibc wrapper for this system call; see VERSIONS.
DESCRIPTION
Note: this page describes the raw Linux system call interface. The wrapper function provided by libaio uses a different type for the ctx_idp argument. See VERSIONS.
The io_setup() system call creates an asynchronous I/O context suitable for concurrently processing nr_events operations. The ctx_idp argument must not point to an AIO context that already exists, and must be initialized to 0 prior to the call. On successful creation of the AIO context, *ctx_idp is filled in with the resulting handle.
RETURN VALUE
On success, io_setup() returns 0. For the failure return, see VERSIONS.
ERRORS
EAGAIN
The specified nr_events exceeds the limit of available events, as defined in /proc/sys/fs/aio-max-nr (see proc(5)).
EFAULT
An invalid pointer is passed for ctx_idp.
EINVAL
ctx_idp is not initialized, or the specified nr_events exceeds internal limits. nr_events should be greater than 0.
ENOMEM
Insufficient kernel resources are available.
ENOSYS
io_setup() is not implemented on this architecture.
VERSIONS
glibc does not provide a wrapper for this system call. You could invoke it using syscall(2). But instead, you probably want to use the io_setup() wrapper function provided by libaio.
Note that the libaio wrapper function uses a different type (io_context_t *) for the ctx_idp argument. Note also that the libaio wrapper does not follow the usual C library conventions for indicating errors: on error it returns a negated error number (the negative of one of the values listed in ERRORS). If the system call is invoked via syscall(2), then the return value follows the usual conventions for indicating an error: -1, with errno set to a (positive) value that indicates the error.
STANDARDS
Linux.
HISTORY
Linux 2.5.
SEE ALSO
io_cancel(2), io_destroy(2), io_getevents(2), io_submit(2), aio(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
354 - Linux cli command setpriority
NAME π₯οΈ setpriority π₯οΈ
get/set program scheduling priority
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/resource.h>
int getpriority(int which, id_t who);
int setpriority(int which, id_t who, int prio);
DESCRIPTION
The scheduling priority of the process, process group, or user, as indicated by which and who is obtained with the getpriority() call and set with the setpriority() call. The process attribute dealt with by these system calls is the same attribute (also known as the “nice” value) that is dealt with by nice(2).
The value which is one of PRIO_PROCESS, PRIO_PGRP, or PRIO_USER, and who is interpreted relative to which (a process identifier for PRIO_PROCESS, process group identifier for PRIO_PGRP, and a user ID for PRIO_USER). A zero value for who denotes (respectively) the calling process, the process group of the calling process, or the real user ID of the calling process.
The prio argument is a value in the range -20 to 19 (but see NOTES below), with -20 being the highest priority and 19 being the lowest priority. Attempts to set a priority outside this range are silently clamped to the range. The default priority is 0; lower values give a process a higher scheduling priority.
The getpriority() call returns the highest priority (lowest numerical value) enjoyed by any of the specified processes. The setpriority() call sets the priorities of all of the specified processes to the specified value.
Traditionally, only a privileged process could lower the nice value (i.e., set a higher priority). However, since Linux 2.6.12, an unprivileged process can decrease the nice value of a target process that has a suitable RLIMIT_NICE soft limit; see getrlimit(2) for details.
RETURN VALUE
On success, getpriority() returns the calling thread’s nice value, which may be a negative number. On error, it returns -1 and sets errno to indicate the error.
Since a successful call to getpriority() can legitimately return the value -1, it is necessary to clear errno prior to the call, then check errno afterward to determine if -1 is an error or a legitimate value.
setpriority() returns 0 on success. On failure, it returns -1 and sets errno to indicate the error.
ERRORS
EACCES
The caller attempted to set a lower nice value (i.e., a higher process priority), but did not have the required privilege (on Linux: did not have the CAP_SYS_NICE capability).
EINVAL
which was not one of PRIO_PROCESS, PRIO_PGRP, or PRIO_USER.
EPERM
A process was located, but its effective user ID did not match either the effective or the real user ID of the caller, and was not privileged (on Linux: did not have the CAP_SYS_NICE capability). But see NOTES below.
ESRCH
No process was located using the which and who values specified.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.4BSD (these interfaces first appeared in 4.2BSD).
NOTES
For further details on the nice value, see sched(7).
Note: the addition of the “autogroup” feature in Linux 2.6.38 means that the nice value no longer has its traditional effect in many circumstances. For details, see sched(7).
A child created by fork(2) inherits its parent’s nice value. The nice value is preserved across execve(2).
The details on the condition for EPERM depend on the system. The above description is what POSIX.1-2001 says, and seems to be followed on all System V-like systems. Linux kernels before Linux 2.6.12 required the real or effective user ID of the caller to match the real user of the process who (instead of its effective user ID). Linux 2.6.12 and later require the effective user ID of the caller to match the real or effective user ID of the process who. All BSD-like systems (SunOS 4.1.3, Ultrix 4.2, 4.3BSD, FreeBSD 4.3, OpenBSD-2.5, …) behave in the same manner as Linux 2.6.12 and later.
C library/kernel differences
The getpriority system call returns nice values translated to the range 40..1, since a negative return value would be interpreted as an error. The glibc wrapper function for getpriority() translates the value back according to the formula uniceΒ =Β 20Β -Β knice (thus, the 40..1 range returned by the kernel corresponds to the range -20..19 as seen by user space).
BUGS
According to POSIX, the nice value is a per-process setting. However, under the current Linux/NPTL implementation of POSIX threads, the nice value is a per-thread attribute: different threads in the same process can have different nice values. Portable applications should avoid relying on the Linux behavior, which may be made standards conformant in the future.
SEE ALSO
nice(1), renice(1), fork(2), capabilities(7), sched(7)
Documentation/scheduler/sched-nice-design.txt in the Linux kernel source tree (since Linux 2.6.23)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
355 - Linux cli command recvmmsg
NAME π₯οΈ recvmmsg π₯οΈ
receive multiple messages on a socket
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <sys/socket.h>
int recvmmsg(int sockfd, struct mmsghdr *msgvec",unsignedint vlen ,"
int flags, struct timespec *timeout);
DESCRIPTION
The recvmmsg() system call is an extension of recvmsg(2) that allows the caller to receive multiple messages from a socket using a single system call. (This has performance benefits for some applications.) A further extension over recvmsg(2) is support for a timeout on the receive operation.
The sockfd argument is the file descriptor of the socket to receive data from.
The msgvec argument is a pointer to an array of mmsghdr structures. The size of this array is specified in vlen.
The mmsghdr structure is defined in <sys/socket.h> as:
struct mmsghdr {
struct msghdr msg_hdr; /* Message header */
unsigned int msg_len; /* Number of received bytes for header */
};
The msg_hdr field is a msghdr structure, as described in recvmsg(2). The msg_len field is the number of bytes returned for the message in the entry. This field has the same value as the return value of a single recvmsg(2) on the header.
The flags argument contains flags ORed together. The flags are the same as documented for recvmsg(2), with the following addition:
MSG_WAITFORONE (since Linux 2.6.34)
Turns on MSG_DONTWAIT after the first message has been received.
The timeout argument points to a struct timespec (see clock_gettime(2)) defining a timeout (seconds plus nanoseconds) for the receive operation (but see BUGS!). (This interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount.) If timeout is NULL, then the operation blocks indefinitely.
A blocking recvmmsg() call blocks until vlen messages have been received or until the timeout expires. A nonblocking call reads as many messages as are available (up to the limit specified by vlen) and returns immediately.
On return from recvmmsg(), successive elements of msgvec are updated to contain information about each received message: msg_len contains the size of the received message; the subfields of msg_hdr are updated as described in recvmsg(2). The return value of the call indicates the number of elements of msgvec that have been updated.
RETURN VALUE
On success, recvmmsg() returns the number of messages received in msgvec; on error, -1 is returned, and errno is set to indicate the error.
ERRORS
Errors are as for recvmsg(2). In addition, the following error can occur:
EINVAL
timeout is invalid.
See also BUGS.
STANDARDS
Linux.
HISTORY
Linux 2.6.33, glibc 2.12.
BUGS
The timeout argument does not work as intended. The timeout is checked only after the receipt of each datagram, so that if up to vlen-1 datagrams are received before the timeout expires, but then no further datagrams are received, the call will block forever.
If an error occurs after at least one message has been received, the call succeeds, and returns the number of messages received. The error code is expected to be returned on a subsequent call to recvmmsg(). In the current implementation, however, the error code can be overwritten in the meantime by an unrelated network event on a socket, for example an incoming ICMP packet.
EXAMPLES
The following program uses recvmmsg() to receive multiple messages on a socket and stores them in multiple buffers. The call returns if all buffers are filled or if the timeout specified has expired.
The following snippet periodically generates UDP datagrams containing a random number:
$ while true; do echo $RANDOM > /dev/udp/127.0.0.1/1234;
sleep 0.25; done
These datagrams are read by the example application, which can give the following output:
$ ./a.out
5 messages received
1 11782
2 11345
3 304
4 13514
5 28421
Program source
#define _GNU_SOURCE
#include <arpa/inet.h>
#include <netinet/in.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <time.h>
int
main(void)
{
#define VLEN 10
#define BUFSIZE 200
#define TIMEOUT 1
int sockfd, retval;
char bufs[VLEN][BUFSIZE+1];
struct iovec iovecs[VLEN];
struct mmsghdr msgs[VLEN];
struct timespec timeout;
struct sockaddr_in addr;
sockfd = socket(AF_INET, SOCK_DGRAM, 0);
if (sockfd == -1) {
perror("socket()");
exit(EXIT_FAILURE);
}
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
addr.sin_port = htons(1234);
if (bind(sockfd, (struct sockaddr *) &addr, sizeof(addr)) == -1) {
perror("bind()");
exit(EXIT_FAILURE);
}
memset(msgs, 0, sizeof(msgs));
for (size_t i = 0; i < VLEN; i++) {
iovecs[i].iov_base = bufs[i];
iovecs[i].iov_len = BUFSIZE;
msgs[i].msg_hdr.msg_iov = &iovecs[i];
msgs[i].msg_hdr.msg_iovlen = 1;
}
timeout.tv_sec = TIMEOUT;
timeout.tv_nsec = 0;
retval = recvmmsg(sockfd, msgs, VLEN, 0, &timeout);
if (retval == -1) {
perror("recvmmsg()");
exit(EXIT_FAILURE);
}
printf("%d messages received
“, retval); for (size_t i = 0; i < retval; i++) { bufs[i][msgs[i].msg_len] = 0; printf("%zu %s”, i+1, bufs[i]); } exit(EXIT_SUCCESS); }
SEE ALSO
clock_gettime(2), recvmsg(2), sendmmsg(2), sendmsg(2), socket(2), socket(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
356 - Linux cli command signalfd4
NAME π₯οΈ signalfd4 π₯οΈ
create a file descriptor for accepting signals
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/signalfd.h>
int signalfd(int fd, const sigset_t *mask, int flags);
DESCRIPTION
signalfd() creates a file descriptor that can be used to accept signals targeted at the caller. This provides an alternative to the use of a signal handler or sigwaitinfo(2), and has the advantage that the file descriptor may be monitored by select(2), poll(2), and epoll(7).
The mask argument specifies the set of signals that the caller wishes to accept via the file descriptor. This argument is a signal set whose contents can be initialized using the macros described in sigsetops(3). Normally, the set of signals to be received via the file descriptor should be blocked using sigprocmask(2), to prevent the signals being handled according to their default dispositions. It is not possible to receive SIGKILL or SIGSTOP signals via a signalfd file descriptor; these signals are silently ignored if specified in mask.
If the fd argument is -1, then the call creates a new file descriptor and associates the signal set specified in mask with that file descriptor. If fd is not -1, then it must specify a valid existing signalfd file descriptor, and mask is used to replace the signal set associated with that file descriptor.
Starting with Linux 2.6.27, the following values may be bitwise ORed in flags to change the behavior of signalfd():
SFD_NONBLOCK
Set the O_NONBLOCK file status flag on the open file description (see open(2)) referred to by the new file descriptor. Using this flag saves extra calls to fcntl(2) to achieve the same result.
SFD_CLOEXEC
Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.
Up to Linux 2.6.26, the flags argument is unused, and must be specified as zero.
signalfd() returns a file descriptor that supports the following operations:
read(2)
If one or more of the signals specified in mask is pending for the process, then the buffer supplied to read(2) is used to return one or more signalfd_siginfo structures (see below) that describe the signals. The read(2) returns information for as many signals as are pending and will fit in the supplied buffer. The buffer must be at least sizeof(struct signalfd_siginfo) bytes. The return value of the read(2) is the total number of bytes read.
As a consequence of the read(2), the signals are consumed, so that they are no longer pending for the process (i.e., will not be caught by signal handlers, and cannot be accepted using sigwaitinfo(2)).
If none of the signals in mask is pending for the process, then the read(2) either blocks until one of the signals in mask is generated for the process, or fails with the error EAGAIN if the file descriptor has been made nonblocking.
poll(2)
select(2)
(and similar)
The file descriptor is readable (the select(2) readfds argument; the poll(2) POLLIN flag) if one or more of the signals in mask is pending for the process.
The signalfd file descriptor also supports the other file-descriptor multiplexing APIs: pselect(2), ppoll(2), and epoll(7).
close(2)
When the file descriptor is no longer required it should be closed. When all file descriptors associated with the same signalfd object have been closed, the resources for object are freed by the kernel.
The signalfd_siginfo structure
The format of the signalfd_siginfo structure(s) returned by read(2)s from a signalfd file descriptor is as follows:
struct signalfd_siginfo {
uint32_t ssi_signo; /* Signal number */
int32_t ssi_errno; /* Error number (unused) */
int32_t ssi_code; /* Signal code */
uint32_t ssi_pid; /* PID of sender */
uint32_t ssi_uid; /* Real UID of sender */
int32_t ssi_fd; /* File descriptor (SIGIO) */
uint32_t ssi_tid; /* Kernel timer ID (POSIX timers)
uint32_t ssi_band; /* Band event (SIGIO) */
uint32_t ssi_overrun; /* POSIX timer overrun count */
uint32_t ssi_trapno; /* Trap number that caused signal */
int32_t ssi_status; /* Exit status or signal (SIGCHLD) */
int32_t ssi_int; /* Integer sent by sigqueue(3) */
uint64_t ssi_ptr; /* Pointer sent by sigqueue(3) */
uint64_t ssi_utime; /* User CPU time consumed (SIGCHLD) */
uint64_t ssi_stime; /* System CPU time consumed
(SIGCHLD) */
uint64_t ssi_addr; /* Address that generated signal
(for hardware-generated signals) */
uint16_t ssi_addr_lsb; /* Least significant bit of address
(SIGBUS; since Linux 2.6.37) */
uint8_t pad[X]; /* Pad size to 128 bytes (allow for
additional fields in the future) */
};
Each of the fields in this structure is analogous to the similarly named field in the siginfo_t structure. The siginfo_t structure is described in sigaction(2). Not all fields in the returned signalfd_siginfo structure will be valid for a specific signal; the set of valid fields can be determined from the value returned in the ssi_code field. This field is the analog of the siginfo_t si_code field; see sigaction(2) for details.
fork(2) semantics
After a fork(2), the child inherits a copy of the signalfd file descriptor. A read(2) from the file descriptor in the child will return information about signals queued to the child.
Semantics of file descriptor passing
As with other file descriptors, signalfd file descriptors can be passed to another process via a UNIX domain socket (see unix(7)). In the receiving process, a read(2) from the received file descriptor will return information about signals queued to that process.
execve(2) semantics
Just like any other file descriptor, a signalfd file descriptor remains open across an execve(2), unless it has been marked for close-on-exec (see fcntl(2)). Any signals that were available for reading before the execve(2) remain available to the newly loaded program. (This is analogous to traditional signal semantics, where a blocked signal that is pending remains pending across an execve(2).)
Thread semantics
The semantics of signalfd file descriptors in a multithreaded program mirror the standard semantics for signals. In other words, when a thread reads from a signalfd file descriptor, it will read the signals that are directed to the thread itself and the signals that are directed to the process (i.e., the entire thread group). (A thread will not be able to read signals that are directed to other threads in the process.)
epoll(7) semantics
If a process adds (via epoll_ctl(2)) a signalfd file descriptor to an epoll(7) instance, then epoll_wait(2) returns events only for signals sent to that process. In particular, if the process then uses fork(2) to create a child process, then the child will be able to read(2) signals that are sent to it using the signalfd file descriptor, but epoll_wait(2) will not indicate that the signalfd file descriptor is ready. In this scenario, a possible workaround is that after the fork(2), the child process can close the signalfd file descriptor that it inherited from the parent process and then create another signalfd file descriptor and add it to the epoll instance. Alternatively, the parent and the child could delay creating their (separate) signalfd file descriptors and adding them to the epoll instance until after the call to fork(2).
RETURN VALUE
On success, signalfd() returns a signalfd file descriptor; this is either a new file descriptor (if fd was -1), or fd if fd was a valid signalfd file descriptor. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EBADF
The fd file descriptor is not a valid file descriptor.
EINVAL
fd is not a valid signalfd file descriptor.
EINVAL
flags is invalid; or, in Linux 2.6.26 or earlier, flags is nonzero.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENODEV
Could not mount (internal) anonymous inode device.
ENOMEM
There was insufficient memory to create a new signalfd file descriptor.
VERSIONS
C library/kernel differences
The underlying Linux system call requires an additional argument, size_t sizemask, which specifies the size of the mask argument. The glibc signalfd() wrapper function does not include this argument, since it provides the required value for the underlying system call.
There are two underlying Linux system calls: signalfd() and the more recent signalfd4(). The former system call does not implement a flags argument. The latter system call implements the flags values described above. Starting with glibc 2.9, the signalfd() wrapper function will use signalfd4() where it is available.
STANDARDS
Linux.
HISTORY
signalfd()
Linux 2.6.22, glibc 2.8.
signalfd4()
Linux 2.6.27.
NOTES
A process can create multiple signalfd file descriptors. This makes it possible to accept different signals on different file descriptors. (This may be useful if monitoring the file descriptors using select(2), poll(2), or epoll(7): the arrival of different signals will make different file descriptors ready.) If a signal appears in the mask of more than one of the file descriptors, then occurrences of that signal can be read (once) from any one of the file descriptors.
Attempts to include SIGKILL and SIGSTOP in mask are silently ignored.
The signal mask employed by a signalfd file descriptor can be viewed via the entry for the corresponding file descriptor in the process’s /proc/pid/fdinfo directory. See proc(5) for further details.
Limitations
The signalfd mechanism can’t be used to receive signals that are synchronously generated, such as the SIGSEGV signal that results from accessing an invalid memory address or the SIGFPE signal that results from an arithmetic error. Such signals can be caught only via signal handler.
As described above, in normal usage one blocks the signals that will be accepted via signalfd(). If spawning a child process to execute a helper program (that does not need the signalfd file descriptor), then, after the call to fork(2), you will normally want to unblock those signals before calling execve(2), so that the helper program can see any signals that it expects to see. Be aware, however, that this won’t be possible in the case of a helper program spawned behind the scenes by any library function that the program may call. In such cases, one must fall back to using a traditional signal handler that writes to a file descriptor monitored by select(2), poll(2), or epoll(7).
BUGS
Before Linux 2.6.25, the ssi_ptr and ssi_int fields are not filled in with the data accompanying a signal sent by sigqueue(3).
EXAMPLES
The program below accepts the signals SIGINT and SIGQUIT via a signalfd file descriptor. The program terminates after accepting a SIGQUIT signal. The following shell session demonstrates the use of the program:
$ ./signalfd_demo
^C # Control-C generates SIGINT
Got SIGINT
^C
Got SIGINT
^\ # Control-\ generates SIGQUIT
Got SIGQUIT
$
Program source
#include <err.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/signalfd.h>
#include <sys/types.h>
#include <unistd.h>
int
main(void)
{
int sfd;
ssize_t s;
sigset_t mask;
struct signalfd_siginfo fdsi;
sigemptyset(&mask);
sigaddset(&mask, SIGINT);
sigaddset(&mask, SIGQUIT);
/* Block signals so that they aren't handled
according to their default dispositions. */
if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1)
err(EXIT_FAILURE, "sigprocmask");
sfd = signalfd(-1, &mask, 0);
if (sfd == -1)
err(EXIT_FAILURE, "signalfd");
for (;;) {
s = read(sfd, &fdsi, sizeof(fdsi));
if (s != sizeof(fdsi))
err(EXIT_FAILURE, "read");
if (fdsi.ssi_signo == SIGINT) {
printf("Got SIGINT
“); } else if (fdsi.ssi_signo == SIGQUIT) { printf(“Got SIGQUIT “); exit(EXIT_SUCCESS); } else { printf(“Read unexpected signal “); } } }
SEE ALSO
eventfd(2), poll(2), read(2), select(2), sigaction(2), sigprocmask(2), sigwaitinfo(2), timerfd_create(2), sigsetops(3), sigwait(3), epoll(7), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
357 - Linux cli command lstat64
NAME π₯οΈ lstat64 π₯οΈ
get file status
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int stat(const char *restrict pathname,
struct stat *restrict statbuf);
int fstat(int fd, struct stat *statbuf);
int lstat(const char *restrict pathname,
struct stat *restrict statbuf);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int fstatat(int dirfd, const char *restrict pathname,
struct stat *restrict statbuf, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
lstat():
/* Since glibc 2.20 */ _DEFAULT_SOURCE
|| _XOPEN_SOURCE >= 500
|| /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L
|| /* glibc 2.19 and earlier */ _BSD_SOURCE
fstatat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These functions return information about a file, in the buffer pointed to by statbuf. No permissions are required on the file itself, butβin the case of stat(), fstatat(), and lstat()βexecute (search) permission is required on all of the directories in pathname that lead to the file.
stat() and fstatat() retrieve information about the file pointed to by pathname; the differences for fstatat() are described below.
lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information about the link itself, not the file that the link refers to.
fstat() is identical to stat(), except that the file about which information is to be retrieved is specified by the file descriptor fd.
The stat structure
All of these system calls return a stat structure (see stat(3type)).
Note: for performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call. For example, if st_mode or st_uid is changed by another process by calling chmod(2) or chown(2), stat() might return the old st_mode together with the new st_uid, or the old st_uid together with the new st_mode.
fstatat()
The fstatat() system call is a more general interface for accessing file information which can still provide exactly the behavior of each of stat(), lstat(), and fstat().
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by stat() and lstat() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like stat() and lstat()).
If pathname is absolute, then dirfd is ignored.
flags can either be 0, or include one or more of the following flags ORed:
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory, and the behavior of fstatat() is similar to that of fstat(). If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_NO_AUTOMOUNT (since Linux 2.6.38)
Don’t automount the terminal (“basename”) component of pathname. Since Linux 3.1 this flag is ignored. Since Linux 4.11 this flag is implied.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead return information about the link itself, like lstat(). (By default, fstatat() dereferences symbolic links, like stat().)
See openat(2) for an explanation of the need for fstatat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Search permission is denied for one of the directories in the path prefix of pathname. (See also path_resolution(7).)
EBADF
fd is not a valid open file descriptor.
EBADF
(fstatat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
Bad address.
EINVAL
(fstatat()) Invalid flag specified in flags.
ELOOP
Too many symbolic links encountered while traversing the path.
ENAMETOOLONG
pathname is too long.
ENOENT
A component of pathname does not exist or is a dangling symbolic link.
ENOENT
pathname is an empty string and AT_EMPTY_PATH was not specified in flags.
ENOMEM
Out of memory (i.e., kernel memory).
ENOTDIR
A component of the path prefix of pathname is not a directory.
ENOTDIR
(fstatat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EOVERFLOW
pathname or fd refers to a file whose size, inode number, or number of blocks cannot be represented in, respectively, the types off_t, ino_t, or blkcnt_t. This error can occur when, for example, an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 calls stat() on a file whose size exceeds (1<<31)-1 bytes.
STANDARDS
POSIX.1-2008.
HISTORY
stat()
fstat()
lstat()
SVr4, 4.3BSD, POSIX.1-2001.
fstatat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
According to POSIX.1-2001, lstat() on a symbolic link need return valid information only in the st_size field and the file type of the st_mode field of the stat structure. POSIX.1-2008 tightens the specification, requiring lstat() to return valid information in all fields except the mode bits in st_mode.
Use of the st_blocks and st_blksize fields may be less portable. (They were introduced in BSD. The interpretation differs between systems, and possibly on a single system when NFS mounts are involved.)
C library/kernel differences
Over time, increases in the size of the stat structure have led to three successive versions of stat(): sys_stat() (slot __NR_oldstat), sys_newstat() (slot __NR_stat), and sys_stat64() (slot __NR_stat64) on 32-bit platforms such as i386. The first two versions were already present in Linux 1.0 (albeit with different names); the last was added in Linux 2.4. Similar remarks apply for fstat() and lstat().
The kernel-internal versions of the stat structure dealt with by the different versions are, respectively:
__old_kernel_stat
The original structure, with rather narrow fields, and no padding.
stat
Larger st_ino field and padding added to various parts of the structure to allow for future expansion.
stat64
Even larger st_ino field, larger st_uid and st_gid fields to accommodate the Linux-2.4 expansion of UIDs and GIDs to 32 bits, and various other enlarged fields and further padding in the structure. (Various padding bytes were eventually consumed in Linux 2.6, with the advent of 32-bit device IDs and nanosecond components for the timestamp fields.)
The glibc stat() wrapper function hides these details from applications, invoking the most recent version of the system call provided by the kernel, and repacking the returned information if required for old binaries.
On modern 64-bit systems, life is simpler: there is a single stat() system call and the kernel deals with a stat structure that contains fields of a sufficient size.
The underlying system call employed by the glibc fstatat() wrapper function is actually called fstatat64() or, on some architectures, newfstatat().
EXAMPLES
The following program calls lstat() and displays selected fields in the returned stat structure.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
#include <time.h>
int
main(int argc, char *argv[])
{
struct stat sb;
if (argc != 2) {
fprintf(stderr, "Usage: %s <pathname>
“, argv[0]); exit(EXIT_FAILURE); } if (lstat(argv[1], &sb) == -1) { perror(“lstat”); exit(EXIT_FAILURE); } printf(“ID of containing device: [%x,%x] “, major(sb.st_dev), minor(sb.st_dev)); printf(“File type: “); switch (sb.st_mode & S_IFMT) { case S_IFBLK: printf(“block device “); break; case S_IFCHR: printf(“character device “); break; case S_IFDIR: printf(“directory “); break; case S_IFIFO: printf(“FIFO/pipe “); break; case S_IFLNK: printf(“symlink “); break; case S_IFREG: printf(“regular file “); break; case S_IFSOCK: printf(“socket “); break; default: printf(“unknown? “); break; } printf(“I-node number: %ju “, (uintmax_t) sb.st_ino); printf(“Mode: %jo (octal) “, (uintmax_t) sb.st_mode); printf(“Link count: %ju “, (uintmax_t) sb.st_nlink); printf(“Ownership: UID=%ju GID=%ju “, (uintmax_t) sb.st_uid, (uintmax_t) sb.st_gid); printf(“Preferred I/O block size: %jd bytes “, (intmax_t) sb.st_blksize); printf(“File size: %jd bytes “, (intmax_t) sb.st_size); printf(“Blocks allocated: %jd “, (intmax_t) sb.st_blocks); printf(“Last status change: %s”, ctime(&sb.st_ctime)); printf(“Last file access: %s”, ctime(&sb.st_atime)); printf(“Last file modification: %s”, ctime(&sb.st_mtime)); exit(EXIT_SUCCESS); }
SEE ALSO
ls(1), stat(1), access(2), chmod(2), chown(2), readlink(2), statx(2), utime(2), stat(3type), capabilities(7), inode(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
358 - Linux cli command sched_getattr
NAME π₯οΈ sched_getattr π₯οΈ
set and get scheduling policy and attributes
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sched.h> /* Definition of SCHED_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_sched_setattr, pid_t pid, struct sched_attr *attr,
unsigned int flags);
int syscall(SYS_sched_getattr, pid_t pid, struct sched_attr *attr,
unsigned int size, unsigned int flags);
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
sched_setattr()
The sched_setattr() system call sets the scheduling policy and associated attributes for the thread whose ID is specified in pid. If pid equals zero, the scheduling policy and attributes of the calling thread will be set.
Currently, Linux supports the following “normal” (i.e., non-real-time) scheduling policies as values that may be specified in policy:
SCHED_OTHER
the standard round-robin time-sharing policy;
SCHED_BATCH
for “batch” style execution of processes; and
SCHED_IDLE
for running very low priority background jobs.
Various “real-time” policies are also supported, for special time-critical applications that need precise control over the way in which runnable threads are selected for execution. For the rules governing when a process may use these policies, see sched(7). The real-time policies that may be specified in policy are:
SCHED_FIFO
a first-in, first-out policy; and
SCHED_RR
a round-robin policy.
Linux also provides the following policy:
SCHED_DEADLINE
a deadline scheduling policy; see sched(7) for details.
The attr argument is a pointer to a structure that defines the new scheduling policy and attributes for the specified thread. This structure has the following form:
struct sched_attr {
u32 size; /* Size of this structure */
u32 sched_policy; /* Policy (SCHED_*) */
u64 sched_flags; /* Flags */
s32 sched_nice; /* Nice value (SCHED_OTHER,
SCHED_BATCH) */
u32 sched_priority; /* Static priority (SCHED_FIFO,
SCHED_RR) */
/* Remaining fields are for SCHED_DEADLINE */
u64 sched_runtime;
u64 sched_deadline;
u64 sched_period;
};
The fields of the sched_attr structure are as follows:
size
This field should be set to the size of the structure in bytes, as in sizeof(struct sched_attr). If the provided structure is smaller than the kernel structure, any additional fields are assumed to be ‘0’. If the provided structure is larger than the kernel structure, the kernel verifies that all additional fields are 0; if they are not, sched_setattr() fails with the error E2BIG and updates size to contain the size of the kernel structure.
The above behavior when the size of the user-space sched_attr structure does not match the size of the kernel structure allows for future extensibility of the interface. Malformed applications that pass oversize structures won’t break in the future if the size of the kernel sched_attr structure is increased. In the future, it could also allow applications that know about a larger user-space sched_attr structure to determine whether they are running on an older kernel that does not support the larger structure.
sched_policy
This field specifies the scheduling policy, as one of the SCHED_* values listed above.
sched_flags
This field contains zero or more of the following flags that are ORed together to control scheduling behavior:
SCHED_FLAG_RESET_ON_FORK
Children created by fork(2) do not inherit privileged scheduling policies. See sched(7) for details.
SCHED_FLAG_RECLAIM (since Linux 4.13)
This flag allows a SCHED_DEADLINE thread to reclaim bandwidth unused by other real-time threads.
SCHED_FLAG_DL_OVERRUN (since Linux 4.16)
This flag allows an application to get informed about run-time overruns in SCHED_DEADLINE threads. Such overruns may be caused by (for example) coarse execution time accounting or incorrect parameter assignment. Notification takes the form of a SIGXCPU signal which is generated on each overrun.
This SIGXCPU signal is process-directed (see signal(7)) rather than thread-directed. This is probably a bug. On the one hand, sched_setattr() is being used to set a per-thread attribute. On the other hand, if the process-directed signal is delivered to a thread inside the process other than the one that had a run-time overrun, the application has no way of knowing which thread overran.
sched_nice
This field specifies the nice value to be set when specifying sched_policy as SCHED_OTHER or SCHED_BATCH. The nice value is a number in the range -20 (high priority) to +19 (low priority); see sched(7).
sched_priority
This field specifies the static priority to be set when specifying sched_policy as SCHED_FIFO or SCHED_RR. The allowed range of priorities for these policies can be determined using sched_get_priority_min(2) and sched_get_priority_max(2). For other policies, this field must be specified as 0.
sched_runtime
This field specifies the “Runtime” parameter for deadline scheduling. The value is expressed in nanoseconds. This field, and the next two fields, are used only for SCHED_DEADLINE scheduling; for further details, see sched(7).
sched_deadline
This field specifies the “Deadline” parameter for deadline scheduling. The value is expressed in nanoseconds.
sched_period
This field specifies the “Period” parameter for deadline scheduling. The value is expressed in nanoseconds.
The flags argument is provided to allow for future extensions to the interface; in the current implementation it must be specified as 0.
sched_getattr()
The sched_getattr() system call fetches the scheduling policy and the associated attributes for the thread whose ID is specified in pid. If pid equals zero, the scheduling policy and attributes of the calling thread will be retrieved.
The size argument should be set to the size of the sched_attr structure as known to user space. The value must be at least as large as the size of the initially published sched_attr structure, or the call fails with the error EINVAL.
The retrieved scheduling attributes are placed in the fields of the sched_attr structure pointed to by attr. The kernel sets attr.size to the size of its sched_attr structure.
If the caller-provided attr buffer is larger than the kernel’s sched_attr structure, the additional bytes in the user-space structure are not touched. If the caller-provided structure is smaller than the kernel sched_attr structure, the kernel will silently not return any values which would be stored outside the provided space. As with sched_setattr(), these semantics allow for future extensibility of the interface.
The flags argument is provided to allow for future extensions to the interface; in the current implementation it must be specified as 0.
RETURN VALUE
On success, sched_setattr() and sched_getattr() return 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
sched_getattr() and sched_setattr() can both fail for the following reasons:
EINVAL
attr is NULL; or pid is negative; or flags is not zero.
ESRCH
The thread whose ID is pid could not be found.
In addition, sched_getattr() can fail for the following reasons:
E2BIG
The buffer specified by size and attr is too small.
EINVAL
size is invalid; that is, it is smaller than the initial version of the sched_attr structure (48 bytes) or larger than the system page size.
In addition, sched_setattr() can fail for the following reasons:
E2BIG
The buffer specified by size and attr is larger than the kernel structure, and one or more of the excess bytes is nonzero.
EBUSY
SCHED_DEADLINE admission control failure, see sched(7).
EINVAL
attr.sched_policy is not one of the recognized policies; attr.sched_flags contains a flag other than SCHED_FLAG_RESET_ON_FORK; or attr.sched_priority is invalid; or attr.sched_policy is SCHED_DEADLINE and the deadline scheduling parameters in attr are invalid.
EPERM
The caller does not have appropriate privileges.
EPERM
The CPU affinity mask of the thread specified by pid does not include all CPUs in the system (see sched_setaffinity(2)).
STANDARDS
Linux.
HISTORY
Linux 3.14.
NOTES
glibc does not provide wrappers for these system calls; call them using syscall(2).
sched_setattr() provides a superset of the functionality of sched_setscheduler(2), sched_setparam(2), nice(2), and (other than the ability to set the priority of all processes belonging to a specified user or all processes in a specified group) setpriority(2). Analogously, sched_getattr() provides a superset of the functionality of sched_getscheduler(2), sched_getparam(2), and (partially) getpriority(2).
BUGS
In Linux versions up to 3.15, sched_setattr() failed with the error EFAULT instead of E2BIG for the case described in ERRORS.
Up to Linux 5.3, sched_getattr() failed with the error EFBIG if the in-kernel sched_attr structure was larger than the size passed by user space.
SEE ALSO
chrt(1), nice(2), sched_get_priority_max(2), sched_get_priority_min(2), sched_getaffinity(2), sched_getparam(2), sched_getscheduler(2), sched_rr_get_interval(2), sched_setaffinity(2), sched_setparam(2), sched_setscheduler(2), sched_yield(2), setpriority(2), pthread_getschedparam(3), pthread_setschedparam(3), pthread_setschedprio(3), capabilities(7), cpuset(7), sched(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
359 - Linux cli command readahead
NAME π₯οΈ readahead π₯οΈ
initiate file readahead into page cache
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#define _FILE_OFFSET_BITS 64
#include <fcntl.h>
ssize_t readahead(int fd, off_t offset, size_t count);
DESCRIPTION
readahead() initiates readahead on a file so that subsequent reads from that file will be satisfied from the cache, and not block on disk I/O (assuming the readahead was initiated early enough and that other activity on the system did not in the meantime flush pages from the cache).
The fd argument is a file descriptor identifying the file which is to be read. The offset argument specifies the starting point from which data is to be read and count specifies the number of bytes to be read. I/O is performed in whole pages, so that offset is effectively rounded down to a page boundary and bytes are read up to the next page boundary greater than or equal to (offset+count). readahead() does not read beyond the end of the file. The file offset of the open file description referred to by the file descriptor fd is left unchanged.
RETURN VALUE
On success, readahead() returns 0; on failure, -1 is returned, with errno set to indicate the error.
ERRORS
EBADF
fd is not a valid file descriptor or is not open for reading.
EINVAL
fd does not refer to a file type to which readahead() can be applied.
VERSIONS
On some 32-bit architectures, the calling signature for this system call differs, for the reasons described in syscall(2).
STANDARDS
Linux.
HISTORY
Linux 2.4.13, glibc 2.3.
NOTES
_FILE_OFFSET_BITS should be defined to be 64 in code that uses a pointer to readahead, if the code is intended to be portable to traditional 32-bit x86 and ARM platforms where off_t’s width defaults to 32 bits.
BUGS
readahead() attempts to schedule the reads in the background and return immediately. However, it may block while it reads the filesystem metadata needed to locate the requested blocks. This occurs frequently with ext[234] on large files using indirect blocks instead of extents, giving the appearance that the call blocks until the requested data has been read.
SEE ALSO
lseek(2), madvise(2), mmap(2), posix_fadvise(2), read(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
360 - Linux cli command setpgrp
NAME π₯οΈ setpgrp π₯οΈ
set/get process group
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int setpgid(pid_t pid, pid_t pgid);
pid_t getpgid(pid_t pid);
pid_t getpgrp(void); /* POSIX.1 version */
[[deprecated]] pid_t getpgrp(pid_t pid); /* BSD version */
int setpgrp(void); /* System V version */
[[deprecated]] int setpgrp(pid_t pid, pid_t pgid); /* BSD version */
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
getpgid():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
setpgrp() (POSIX.1):
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.19: */ _DEFAULT_SOURCE
|| /* glibc <= 2.19: */ _SVID_SOURCE
setpgrp() (BSD), getpgrp() (BSD):
[These are available only before glibc 2.19]
_BSD_SOURCE &&
! (_POSIX_SOURCE || _POSIX_C_SOURCE || _XOPEN_SOURCE
|| _GNU_SOURCE || _SVID_SOURCE)
DESCRIPTION
All of these interfaces are available on Linux, and are used for getting and setting the process group ID (PGID) of a process. The preferred, POSIX.1-specified ways of doing this are: getpgrp(void), for retrieving the calling process’s PGID; and setpgid(), for setting a process’s PGID.
setpgid() sets the PGID of the process specified by pid to pgid. If pid is zero, then the process ID of the calling process is used. If pgid is zero, then the PGID of the process specified by pid is made the same as its process ID. If setpgid() is used to move a process from one process group to another (as is done by some shells when creating pipelines), both process groups must be part of the same session (see setsid(2) and credentials(7)). In this case, the pgid specifies an existing process group to be joined and the session ID of that group must match the session ID of the joining process.
The POSIX.1 version of getpgrp(), which takes no arguments, returns the PGID of the calling process.
getpgid() returns the PGID of the process specified by pid. If pid is zero, the process ID of the calling process is used. (Retrieving the PGID of a process other than the caller is rarely necessary, and the POSIX.1 getpgrp() is preferred for that task.)
The System V-style setpgrp(), which takes no arguments, is equivalent to setpgid(0, 0).
The BSD-specific setpgrp() call, which takes arguments pid and pgid, is a wrapper function that calls
setpgid(pid, pgid)
Since glibc 2.19, the BSD-specific setpgrp() function is no longer exposed by <unistd.h>; calls should be replaced with the setpgid() call shown above.
The BSD-specific getpgrp() call, which takes a single pid argument, is a wrapper function that calls
getpgid(pid)
Since glibc 2.19, the BSD-specific getpgrp() function is no longer exposed by <unistd.h>; calls should be replaced with calls to the POSIX.1 getpgrp() which takes no arguments (if the intent is to obtain the caller’s PGID), or with the getpgid() call shown above.
RETURN VALUE
On success, setpgid() and setpgrp() return zero. On error, -1 is returned, and errno is set to indicate the error.
The POSIX.1 getpgrp() always returns the PGID of the caller.
getpgid(), and the BSD-specific getpgrp() return a process group on success. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
An attempt was made to change the process group ID of one of the children of the calling process and the child had already performed an execve(2) (setpgid(), setpgrp()).
EINVAL
pgid is less than 0 (setpgid(), setpgrp()).
EPERM
An attempt was made to move a process into a process group in a different session, or to change the process group ID of one of the children of the calling process and the child was in a different session, or to change the process group ID of a session leader (setpgid(), setpgrp()).
EPERM
The target process group does not exist. (setpgid(), setpgrp()).
ESRCH
For getpgid(): pid does not match any process. For setpgid(): pid is not the calling process and not a child of the calling process.
STANDARDS
getpgid()
setpgid()
getpgrp() (no args)
setpgrp() (no args)
POSIX.1-2008 (but see HISTORY).
setpgrp() (2 args)
getpgrp() (1 arg)
None.
HISTORY
getpgid()
setpgid()
getpgrp() (no args)
POSIX.1-2001.
setpgrp() (no args)
POSIX.1-2001. POSIX.1-2008 marks it as obsolete.
setpgrp() (2 args)
getpgrp() (1 arg)
4.2BSD.
NOTES
A child created via fork(2) inherits its parent’s process group ID. The PGID is preserved across an execve(2).
Each process group is a member of a session and each process is a member of the session of which its process group is a member. (See credentials(7).)
A session can have a controlling terminal. At any time, one (and only one) of the process groups in the session can be the foreground process group for the terminal; the remaining process groups are in the background. If a signal is generated from the terminal (e.g., typing the interrupt key to generate SIGINT), that signal is sent to the foreground process group. (See termios(3) for a description of the characters that generate signals.) Only the foreground process group may read(2) from the terminal; if a background process group tries to read(2) from the terminal, then the group is sent a SIGTTIN signal, which suspends it. The tcgetpgrp(3) and tcsetpgrp(3) functions are used to get/set the foreground process group of the controlling terminal.
The setpgid() and getpgrp() calls are used by programs such as bash(1) to create process groups in order to implement shell job control.
If the termination of a process causes a process group to become orphaned, and if any member of the newly orphaned process group is stopped, then a SIGHUP signal followed by a SIGCONT signal will be sent to each process in the newly orphaned process group. An orphaned process group is one in which the parent of every member of process group is either itself also a member of the process group or is a member of a process group in a different session (see also credentials(7)).
SEE ALSO
getuid(2), setsid(2), tcgetpgrp(3), tcsetpgrp(3), termios(3), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
361 - Linux cli command poll
NAME π₯οΈ poll π₯οΈ
wait for some event on a file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <poll.h>
int poll(struct pollfd *fds, nfds_t nfds, int timeout);
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <poll.h>
int ppoll(struct pollfd *fds, nfds_t nfds,
const struct timespec *_Nullable tmo_p,
const sigset_t *_Nullable sigmask);
DESCRIPTION
poll() performs a similar task to select(2): it waits for one of a set of file descriptors to become ready to perform I/O. The Linux-specific epoll(7) API performs a similar task, but offers features beyond those found in poll().
The set of file descriptors to be monitored is specified in the fds argument, which is an array of structures of the following form:
struct pollfd {
int fd; /* file descriptor */
short events; /* requested events */
short revents; /* returned events */
};
The caller should specify the number of items in the fds array in nfds.
The field fd contains a file descriptor for an open file. If this field is negative, then the corresponding events field is ignored and the revents field returns zero. (This provides an easy way of ignoring a file descriptor for a single poll() call: simply set the fd field to its bitwise complement.)
The field events is an input parameter, a bit mask specifying the events the application is interested in for the file descriptor fd. This field may be specified as zero, in which case the only events that can be returned in revents are POLLHUP, POLLERR, and POLLNVAL (see below).
The field revents is an output parameter, filled by the kernel with the events that actually occurred. The bits returned in revents can include any of those specified in events, or one of the values POLLERR, POLLHUP, or POLLNVAL. (These three bits are meaningless in the events field, and will be set in the revents field whenever the corresponding condition is true.)
If none of the events requested (and no error) has occurred for any of the file descriptors, then poll() blocks until one of the events occurs.
The timeout argument specifies the number of milliseconds that poll() should block waiting for a file descriptor to become ready. The call will block until either:
a file descriptor becomes ready;
the call is interrupted by a signal handler; or
the timeout expires.
Being “ready” means that the requested operation will not block; thus, poll()ing regular files, block devices, and other files with no reasonable polling semantic always returns instantly as ready to read and write.
Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount. Specifying a negative value in timeout means an infinite timeout. Specifying a timeout of zero causes poll() to return immediately, even if no file descriptors are ready.
The bits that may be set/returned in events and revents are defined in <poll.h>:
POLLIN
There is data to read.
POLLPRI
There is some exceptional condition on the file descriptor. Possibilities include:
There is out-of-band data on a TCP socket (see tcp(7)).
A pseudoterminal master in packet mode has seen a state change on the slave (see ioctl_tty(2)).
A cgroup.events file has been modified (see cgroups(7)).
POLLOUT
Writing is now possible, though a write larger than the available space in a socket or pipe will still block (unless O_NONBLOCK is set).
POLLRDHUP (since Linux 2.6.17)
Stream socket peer closed connection, or shut down writing half of connection. The _GNU_SOURCE feature test macro must be defined (before including any header files) in order to obtain this definition.
POLLERR
Error condition (only returned in revents; ignored in events). This bit is also set for a file descriptor referring to the write end of a pipe when the read end has been closed.
POLLHUP
Hang up (only returned in revents; ignored in events). Note that when reading from a channel such as a pipe or a stream socket, this event merely indicates that the peer closed its end of the channel. Subsequent reads from the channel will return 0 (end of file) only after all outstanding data in the channel has been consumed.
POLLNVAL
Invalid request: fd not open (only returned in revents; ignored in events).
When compiling with _XOPEN_SOURCE defined, one also has the following, which convey no further information beyond the bits listed above:
POLLRDNORM
Equivalent to POLLIN.
POLLRDBAND
Priority band data can be read (generally unused on Linux).
POLLWRNORM
Equivalent to POLLOUT.
POLLWRBAND
Priority data may be written.
Linux also knows about, but does not use POLLMSG.
ppoll()
The relationship between poll() and ppoll() is analogous to the relationship between select(2) and pselect(2): like pselect(2), ppoll() allows an application to safely wait until either a file descriptor becomes ready or until a signal is caught.
Other than the difference in the precision of the timeout argument, the following ppoll() call:
ready = ppoll(&fds, nfds, tmo_p, &sigmask);
is nearly equivalent to atomically executing the following calls:
sigset_t origmask;
int timeout;
timeout = (tmo_p == NULL) ? -1 :
(tmo_p->tv_sec * 1000 + tmo_p->tv_nsec / 1000000);
pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ready = poll(&fds, nfds, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);
The above code segment is described as nearly equivalent because whereas a negative timeout value for poll() is interpreted as an infinite timeout, a negative value expressed in *tmo_p results in an error from ppoll().
See the description of pselect(2) for an explanation of why ppoll() is necessary.
If the sigmask argument is specified as NULL, then no signal mask manipulation is performed (and thus ppoll() differs from poll() only in the precision of the timeout argument).
The tmo_p argument specifies an upper limit on the amount of time that ppoll() will block. This argument is a pointer to a timespec(3) structure.
If tmo_p is specified as NULL, then ppoll() can block indefinitely.
RETURN VALUE
On success, poll() returns a nonnegative value which is the number of elements in the pollfds whose revents fields have been set to a nonzero value (indicating an event or an error). A return value of zero indicates that the system call timed out before any file descriptors became ready.
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
fds points outside the process’s accessible address space. The array given as argument was not contained in the calling program’s address space.
EINTR
A signal occurred before any requested event; see signal(7).
EINVAL
The nfds value exceeds the RLIMIT_NOFILE value.
EINVAL
(ppoll()) The timeout value expressed in *tmo_p is invalid (negative).
ENOMEM
Unable to allocate memory for kernel data structures.
VERSIONS
On some other UNIX systems, poll() can fail with the error EAGAIN if the system fails to allocate kernel-internal resources, rather than ENOMEM as Linux does. POSIX permits this behavior. Portable programs may wish to check for EAGAIN and loop, just as with EINTR.
Some implementations define the nonstandard constant INFTIM with the value -1 for use as a timeout for poll(). This constant is not provided in glibc.
C library/kernel differences
The Linux ppoll() system call modifies its tmo_p argument. However, the glibc wrapper function hides this behavior by using a local variable for the timeout argument that is passed to the system call. Thus, the glibc ppoll() function does not modify its tmo_p argument.
The raw ppoll() system call has a fifth argument, size_t sigsetsize, which specifies the size in bytes of the sigmask argument. The glibc ppoll() wrapper function specifies this argument as a fixed value (equal to sizeof(kernel_sigset_t)). See sigprocmask(2) for a discussion on the differences between the kernel and the libc notion of the sigset.
STANDARDS
poll()
POSIX.1-2008.
ppoll()
Linux.
HISTORY
poll()
POSIX.1-2001. Linux 2.1.23.
On older kernels that lack this system call, the glibc poll() wrapper function provides emulation using select(2).
ppoll()
Linux 2.6.16, glibc 2.4.
NOTES
The operation of poll() and ppoll() is not affected by the O_NONBLOCK flag.
For a discussion of what may happen if a file descriptor being monitored by poll() is closed in another thread, see select(2).
BUGS
See the discussion of spurious readiness notifications under the BUGS section of select(2).
EXAMPLES
The program below opens each of the files named in its command-line arguments and monitors the resulting file descriptors for readiness to read (POLLIN). The program loops, repeatedly using poll() to monitor the file descriptors, printing the number of ready file descriptors on return. For each ready file descriptor, the program:
displays the returned revents field in a human-readable form;
if the file descriptor is readable, reads some data from it, and displays that data on standard output; and
if the file descriptor was not readable, but some other event occurred (presumably POLLHUP), closes the file descriptor.
Suppose we run the program in one terminal, asking it to open a FIFO:
$ mkfifo myfifo
$ ./poll_input myfifo
In a second terminal window, we then open the FIFO for writing, write some data to it, and close the FIFO:
$ echo aaaaabbbbbccccc > myfifo
In the terminal where we are running the program, we would then see:
Opened "myfifo" on fd 3
About to poll()
Ready: 1
fd=3; events: POLLIN POLLHUP
read 10 bytes: aaaaabbbbb
About to poll()
Ready: 1
fd=3; events: POLLIN POLLHUP
read 6 bytes: ccccc
About to poll()
Ready: 1
fd=3; events: POLLHUP
closing fd 3
All file descriptors closed; bye
In the above output, we see that poll() returned three times:
On the first return, the bits returned in the revents field were POLLIN, indicating that the file descriptor is readable, and POLLHUP, indicating that the other end of the FIFO has been closed. The program then consumed some of the available input.
The second return from poll() also indicated POLLIN and POLLHUP; the program then consumed the last of the available input.
On the final return, poll() indicated only POLLHUP on the FIFO, at which point the file descriptor was closed and the program terminated.
Program source
/* poll_input.c
Licensed under GNU General Public License v2 or later.
*/
#include <fcntl.h>
#include <poll.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)
int
main(int argc, char *argv[])
{
int ready;
char buf[10];
nfds_t num_open_fds, nfds;
ssize_t s;
struct pollfd *pfds;
if (argc < 2) {
fprintf(stderr, "Usage: %s file...
“, argv[0]); exit(EXIT_FAILURE); } num_open_fds = nfds = argc - 1; pfds = calloc(nfds, sizeof(struct pollfd)); if (pfds == NULL) errExit(“malloc”); /* Open each file on command line, and add it to ‘pfds’ array. / for (nfds_t j = 0; j < nfds; j++) { pfds[j].fd = open(argv[j + 1], O_RDONLY); if (pfds[j].fd == -1) errExit(“open”); printf(“Opened "%s" on fd %d “, argv[j + 1], pfds[j].fd); pfds[j].events = POLLIN; } / Keep calling poll() as long as at least one file descriptor is open. / while (num_open_fds > 0) { printf(“About to poll() “); ready = poll(pfds, nfds, -1); if (ready == -1) errExit(“poll”); printf(“Ready: %d “, ready); / Deal with array returned by poll(). */ for (nfds_t j = 0; j < nfds; j++) { if (pfds[j].revents != 0) { printf(” fd=%d; events: %s%s%s “, pfds[j].fd, (pfds[j].revents & POLLIN) ? “POLLIN " : “”, (pfds[j].revents & POLLHUP) ? “POLLHUP " : “”, (pfds[j].revents & POLLERR) ? “POLLERR " : “”); if (pfds[j].revents & POLLIN) { s = read(pfds[j].fd, buf, sizeof(buf)); if (s == -1) errExit(“read”); printf(” read %zd bytes: %.s “, s, (int) s, buf); } else { / POLLERR | POLLHUP */ printf(” closing fd %d “, pfds[j].fd); if (close(pfds[j].fd) == -1) errExit(“close”); num_open_fds–; } } } } printf(“All file descriptors closed; bye “); exit(EXIT_SUCCESS); }
SEE ALSO
restart_syscall(2), select(2), select_tut(2), timespec(3), epoll(7), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
362 - Linux cli command pkey_mprotect
NAME π₯οΈ pkey_mprotect π₯οΈ
set protection on a region of memory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/mman.h>
int mprotect(void addr[.len], size_t len, int prot);
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <sys/mman.h>
int pkey_mprotect(void addr[.len], size_t len, int prot, int pkey);
DESCRIPTION
mprotect() changes the access protections for the calling process’s memory pages containing any part of the address range in the interval [addr, addr+len-1]. addr must be aligned to a page boundary.
If the calling process tries to access memory in a manner that violates the protections, then the kernel generates a SIGSEGV signal for the process.
prot is a combination of the following access flags: PROT_NONE or a bitwise OR of the other values in the following list:
PROT_NONE
The memory cannot be accessed at all.
PROT_READ
The memory can be read.
PROT_WRITE
The memory can be modified.
PROT_EXEC
The memory can be executed.
PROT_SEM (since Linux 2.5.7)
The memory can be used for atomic operations. This flag was introduced as part of the futex(2) implementation (in order to guarantee the ability to perform atomic operations required by commands such as FUTEX_WAIT), but is not currently used in on any architecture.
PROT_SAO (since Linux 2.6.26)
The memory should have strong access ordering. This feature is specific to the PowerPC architecture (version 2.06 of the architecture specification adds the SAO CPU feature, and it is available on POWER 7 or PowerPC A2, for example).
Additionally (since Linux 2.6.0), prot can have one of the following flags set:
PROT_GROWSUP
Apply the protection mode up to the end of a mapping that grows upwards. (Such mappings are created for the stack area on architecturesβfor example, HP-PARISCβthat have an upwardly growing stack.)
PROT_GROWSDOWN
Apply the protection mode down to the beginning of a mapping that grows downward (which should be a stack segment or a segment mapped with the MAP_GROWSDOWN flag set).
Like mprotect(), pkey_mprotect() changes the protection on the pages specified by addr and len. The pkey argument specifies the protection key (see pkeys(7)) to assign to the memory. The protection key must be allocated with pkey_alloc(2) before it is passed to pkey_mprotect(). For an example of the use of this system call, see pkeys(7).
RETURN VALUE
On success, mprotect() and pkey_mprotect() return zero. On error, these system calls return -1, and errno is set to indicate the error.
ERRORS
EACCES
The memory cannot be given the specified access. This can happen, for example, if you mmap(2) a file to which you have read-only access, then ask mprotect() to mark it PROT_WRITE.
EINVAL
addr is not a valid pointer, or not a multiple of the system page size.
EINVAL
(pkey_mprotect()) pkey has not been allocated with pkey_alloc(2)
EINVAL
Both PROT_GROWSUP and PROT_GROWSDOWN were specified in prot.
EINVAL
Invalid flags specified in prot.
EINVAL
(PowerPC architecture) PROT_SAO was specified in prot, but SAO hardware feature is not available.
ENOMEM
Internal kernel structures could not be allocated.
ENOMEM
Addresses in the range [addr, addr+len-1] are invalid for the address space of the process, or specify one or more pages that are not mapped. (Before Linux 2.4.19, the error EFAULT was incorrectly produced for these cases.)
ENOMEM
Changing the protection of a memory region would result in the total number of mappings with distinct attributes (e.g., read versus read/write protection) exceeding the allowed maximum. (For example, making the protection of a range PROT_READ in the middle of a region currently protected as PROT_READ|PROT_WRITE would result in three mappings: two read/write mappings at each end and a read-only mapping in the middle.)
VERSIONS
POSIX says that the behavior of mprotect() is unspecified if it is applied to a region of memory that was not obtained via mmap(2).
On Linux, it is always permissible to call mprotect() on any address in a process’s address space (except for the kernel vsyscall area). In particular, it can be used to change existing code mappings to be writable.
Whether PROT_EXEC has any effect different from PROT_READ depends on processor architecture, kernel version, and process state. If READ_IMPLIES_EXEC is set in the process’s personality flags (see personality(2)), specifying PROT_READ will implicitly add PROT_EXEC.
On some hardware architectures (e.g., i386), PROT_WRITE implies PROT_READ.
POSIX.1 says that an implementation may permit access other than that specified in prot, but at a minimum can allow write access only if PROT_WRITE has been set, and must not allow any access if PROT_NONE has been set.
Applications should be careful when mixing use of mprotect() and pkey_mprotect(). On x86, when mprotect() is used with prot set to PROT_EXEC a pkey may be allocated and set on the memory implicitly by the kernel, but only when the pkey was 0 previously.
On systems that do not support protection keys in hardware, pkey_mprotect() may still be used, but pkey must be set to -1. When called this way, the operation of pkey_mprotect() is equivalent to mprotect().
STANDARDS
mprotect()
POSIX.1-2008.
pkey_mprotect()
Linux.
HISTORY
mprotect()
POSIX.1-2001, SVr4.
pkey_mprotect()
Linux 4.9, glibc 2.27.
NOTES
EXAMPLES
The program below demonstrates the use of mprotect(). The program allocates four pages of memory, makes the third of these pages read-only, and then executes a loop that walks upward through the allocated region modifying bytes.
An example of what we might see when running the program is the following:
$ ./a.out
Start of region: 0x804c000
Got SIGSEGV at address: 0x804e000
Program source
#include <malloc.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
static char *buffer;
static void
handler(int sig, siginfo_t *si, void *unused)
{
/* Note: calling printf() from a signal handler is not safe
(and should not be done in production programs), since
printf() is not async-signal-safe; see signal-safety(7).
Nevertheless, we use printf() here as a simple way of
showing that the handler was called. */
printf("Got SIGSEGV at address: %p
“, si->si_addr); exit(EXIT_FAILURE); } int main(void) { int pagesize; struct sigaction sa; sa.sa_flags = SA_SIGINFO; sigemptyset(&sa.sa_mask); sa.sa_sigaction = handler; if (sigaction(SIGSEGV, &sa, NULL) == -1) handle_error(“sigaction”); pagesize = sysconf(_SC_PAGE_SIZE); if (pagesize == -1) handle_error(“sysconf”); /* Allocate a buffer aligned on a page boundary; initial protection is PROT_READ | PROT_WRITE. */ buffer = memalign(pagesize, 4 * pagesize); if (buffer == NULL) handle_error(“memalign”); printf(“Start of region: %p “, buffer); if (mprotect(buffer + pagesize * 2, pagesize, PROT_READ) == -1) handle_error(“mprotect”); for (char *p = buffer ; ; ) (p++) = ‘a’; printf(“Loop completed “); / Should never happen */ exit(EXIT_SUCCESS); }
SEE ALSO
mmap(2), sysconf(3), pkeys(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
363 - Linux cli command times
NAME π₯οΈ times π₯οΈ
get process times
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/times.h>
clock_t times(struct tms *buf);
DESCRIPTION
times() stores the current process times in the struct tms that buf points to. The struct tms is as defined in <sys/times.h>:
struct tms {
clock_t tms_utime; /* user time */
clock_t tms_stime; /* system time */
clock_t tms_cutime; /* user time of children */
clock_t tms_cstime; /* system time of children */
};
The tms_utime field contains the CPU time spent executing instructions of the calling process. The tms_stime field contains the CPU time spent executing inside the kernel while performing tasks on behalf of the calling process.
The tms_cutime field contains the sum of the tms_utime and tms_cutime values for all waited-for terminated children. The tms_cstime field contains the sum of the tms_stime and tms_cstime values for all waited-for terminated children.
Times for terminated children (and their descendants) are added in at the moment wait(2) or waitpid(2) returns their process ID. In particular, times of grandchildren that the children did not wait for are never seen.
All times reported are in clock ticks.
RETURN VALUE
times() returns the number of clock ticks that have elapsed since an arbitrary point in the past. The return value may overflow the possible range of type clock_t. On error, (clock_t) -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
tms points outside the process’s address space.
VERSIONS
On Linux, the buf argument can be specified as NULL, with the result that times() just returns a function result. However, POSIX does not specify this behavior, and most other UNIX implementations require a non-NULL value for buf.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.3BSD.
In POSIX.1-1996 the symbol CLK_TCK (defined in <time.h>) is mentioned as obsolescent. It is obsolete now.
Before Linux 2.6.9, if the disposition of SIGCHLD is set to SIG_IGN, then the times of terminated children are automatically included in the tms_cstime and tms_cutime fields, although POSIX.1-2001 says that this should happen only if the calling process wait(2)s on its children. This nonconformance is rectified in Linux 2.6.9 and later.
On Linux, the βarbitrary point in the pastβ from which the return value of times() is measured has varied across kernel versions. On Linux 2.4 and earlier, this point is the moment the system was booted. Since Linux 2.6, this point is (2^32/HZ) - 300 seconds before system boot time. This variability across kernel versions (and across UNIX implementations), combined with the fact that the returned value may overflow the range of clock_t, means that a portable application would be wise to avoid using this value. To measure changes in elapsed time, use clock_gettime(2) instead.
SVr1-3 returns long and the struct members are of type time_t although they store clock ticks, not seconds since the Epoch. V7 used long for the struct members, because it had no type time_t yet.
NOTES
The number of clock ticks per second can be obtained using:
sysconf(_SC_CLK_TCK);
Note that clock(3) also returns a value of type clock_t, but this value is measured in units of CLOCKS_PER_SEC, not the clock ticks used by times().
BUGS
A limitation of the Linux system call conventions on some architectures (notably i386) means that on Linux 2.6 there is a small time window (41 seconds) soon after boot when times() can return -1, falsely indicating that an error occurred. The same problem can occur when the return value wraps past the maximum value that can be stored in clock_t.
SEE ALSO
time(1), getrusage(2), wait(2), clock(3), sysconf(3), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
364 - Linux cli command setresuid32
NAME π₯οΈ setresuid32 π₯οΈ
set real, effective, and saved user or group ID
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <unistd.h>
int setresuid(uid_t ruid, uid_t euid, uid_t suid);
int setresgid(gid_t rgid, gid_t egid, gid_t sgid);
DESCRIPTION
setresuid() sets the real user ID, the effective user ID, and the saved set-user-ID of the calling process.
An unprivileged process may change its real UID, effective UID, and saved set-user-ID, each to one of: the current real UID, the current effective UID, or the current saved set-user-ID.
A privileged process (on Linux, one having the CAP_SETUID capability) may set its real UID, effective UID, and saved set-user-ID to arbitrary values.
If one of the arguments equals -1, the corresponding value is not changed.
Regardless of what changes are made to the real UID, effective UID, and saved set-user-ID, the filesystem UID is always set to the same value as the (possibly new) effective UID.
Completely analogously, setresgid() sets the real GID, effective GID, and saved set-group-ID of the calling process (and always modifies the filesystem GID to be the same as the effective GID), with the same restrictions for unprivileged processes.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
Note: there are cases where setresuid() can fail even when the caller is UID 0; it is a grave security error to omit checking for a failure return from setresuid().
ERRORS
EAGAIN
The call would change the caller’s real UID (i.e., ruid does not match the caller’s real UID), but there was a temporary failure allocating the necessary kernel data structures.
EAGAIN
ruid does not match the caller’s real UID and this call would bring the number of processes belonging to the real user ID ruid over the caller’s RLIMIT_NPROC resource limit. Since Linux 3.1, this error case no longer occurs (but robust applications should check for this error); see the description of EAGAIN in execve(2).
EINVAL
One or more of the target user or group IDs is not valid in this user namespace.
EPERM
The calling process is not privileged (did not have the necessary capability in its user namespace) and tried to change the IDs to values that are not permitted. For setresuid(), the necessary capability is CAP_SETUID; for setresgid(), it is CAP_SETGID.
VERSIONS
C library/kernel differences
At the kernel level, user IDs and group IDs are a per-thread attribute. However, POSIX requires that all threads in a process share the same credentials. The NPTL threading implementation handles the POSIX requirements by providing wrapper functions for the various system calls that change process UIDs and GIDs. These wrapper functions (including those for setresuid() and setresgid()) employ a signal-based technique to ensure that when one thread changes credentials, all of the other threads in the process also change their credentials. For details, see nptl(7).
STANDARDS
None.
HISTORY
Linux 2.1.44, glibc 2.3.2. HP-UX, FreeBSD.
The original Linux setresuid() and setresgid() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added setresuid32() and setresgid32(), supporting 32-bit IDs. The glibc setresuid() and setresgid() wrapper functions transparently deal with the variations across kernel versions.
SEE ALSO
getresuid(2), getuid(2), setfsgid(2), setfsuid(2), setreuid(2), setuid(2), capabilities(7), credentials(7), user_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
365 - Linux cli command getresgid
NAME π₯οΈ getresgid π₯οΈ
get real, effective, and saved user/group IDs
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <unistd.h>
int getresuid(uid_t *ruid, uid_t *euid, uid_t *suid);
int getresgid(gid_t *rgid, gid_t *egid, gid_t *sgid);
DESCRIPTION
getresuid() returns the real UID, the effective UID, and the saved set-user-ID of the calling process, in the arguments ruid, euid, and suid, respectively. getresgid() performs the analogous task for the process’s group IDs.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
One of the arguments specified an address outside the calling program’s address space.
STANDARDS
None. These calls also appear on HP-UX and some of the BSDs.
HISTORY
Linux 2.1.44, glibc 2.3.2.
The original Linux getresuid() and getresgid() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added getresuid32() and getresgid32(), supporting 32-bit IDs. The glibc getresuid() and getresgid() wrapper functions transparently deal with the variations across kernel versions.
SEE ALSO
getuid(2), setresuid(2), setreuid(2), setuid(2), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
366 - Linux cli command getsockname
NAME π₯οΈ getsockname π₯οΈ
get socket name
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
int getsockname(int sockfd, struct sockaddr *restrict addr,
socklen_t *restrict addrlen);
DESCRIPTION
getsockname() returns the current address to which the socket sockfd is bound, in the buffer pointed to by addr. The addrlen argument should be initialized to indicate the amount of space (in bytes) pointed to by addr. On return it contains the actual size of the socket address.
The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBADF
The argument sockfd is not a valid file descriptor.
EFAULT
The addr argument points to memory not in a valid part of the process address space.
EINVAL
addrlen is invalid (e.g., is negative).
ENOBUFS
Insufficient resources were available in the system to perform the operation.
ENOTSOCK
The file descriptor sockfd does not refer to a socket.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.4BSD (first appeared in 4.2BSD).
SEE ALSO
bind(2), socket(2), getifaddrs(3), ip(7), socket(7), unix(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
367 - Linux cli command umask
NAME π₯οΈ umask π₯οΈ
set file mode creation mask
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
mode_t umask(mode_t mask);
DESCRIPTION
umask() sets the calling process’s file mode creation mask (umask) to mask & 0777 (i.e., only the file permission bits of mask are used), and returns the previous value of the mask.
The umask is used by open(2), mkdir(2), and other system calls that create files to modify the permissions placed on newly created files or directories. Specifically, permissions in the umask are turned off from the mode argument to open(2) and mkdir(2).
Alternatively, if the parent directory has a default ACL (see acl(5)), the umask is ignored, the default ACL is inherited, the permission bits are set based on the inherited ACL, and permission bits absent in the mode argument are turned off. For example, the following default ACL is equivalent to a umask of 022:
u::rwx,g::r-x,o::r-x
Combining the effect of this default ACL with a mode argument of 0666 (rw-rw-rw-), the resulting file permissions would be 0644 (rw-r–r–).
The constants that should be used to specify mask are described in inode(7).
The typical default value for the process umask is S_IWGRP | S_IWOTH (octal 022). In the usual case where the mode argument to open(2) is specified as:
S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH
(octal 0666) when creating a new file, the permissions on the resulting file will be:
S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH
(because 0666 & ~022 = 0644; i.e. rw-r–r–).
RETURN VALUE
This system call always succeeds and the previous value of the mask is returned.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.3BSD.
NOTES
A child process created via fork(2) inherits its parent’s umask. The umask is left unchanged by execve(2).
It is impossible to use umask() to fetch a process’s umask without at the same time changing it. A second call to umask() would then be needed to restore the umask. The nonatomicity of these two steps provides the potential for races in multithreaded programs.
Since Linux 4.7, the umask of any process can be viewed via the Umask field of /proc/pid/status. Inspecting this field in /proc/self/status allows a process to retrieve its umask without at the same time changing it.
The umask setting also affects the permissions assigned to POSIX IPC objects (mq_open(3), sem_open(3), shm_open(3)), FIFOs (mkfifo(3)), and UNIX domain sockets (unix(7)) created by the process. The umask does not affect the permissions assigned to System V IPC objects created by the process (using msgget(2), semget(2), shmget(2)).
SEE ALSO
chmod(2), mkdir(2), open(2), stat(2), acl(5)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
368 - Linux cli command epoll_create1
NAME π₯οΈ epoll_create1 π₯οΈ
open an epoll file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/epoll.h>
int epoll_create(int size);
int epoll_create1(int flags);
DESCRIPTION
epoll_create() creates a new epoll(7) instance. Since Linux 2.6.8, the size argument is ignored, but must be greater than zero; see HISTORY.
epoll_create() returns a file descriptor referring to the new epoll instance. This file descriptor is used for all the subsequent calls to the epoll interface. When no longer required, the file descriptor returned by epoll_create() should be closed by using close(2). When all file descriptors referring to an epoll instance have been closed, the kernel destroys the instance and releases the associated resources for reuse.
epoll_create1()
If flags is 0, then, other than the fact that the obsolete size argument is dropped, epoll_create1() is the same as epoll_create(). The following value can be included in flags to obtain different behavior:
EPOLL_CLOEXEC
Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.
RETURN VALUE
On success, these system calls return a file descriptor (a nonnegative integer). On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
size is not positive.
EINVAL
(epoll_create1()) Invalid value specified in flags.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENOMEM
There was insufficient memory to create the kernel object.
STANDARDS
Linux.
HISTORY
epoll_create()
Linux 2.6, glibc 2.3.2.
epoll_create1()
Linux 2.6.27, glibc 2.9.
In the initial epoll_create() implementation, the size argument informed the kernel of the number of file descriptors that the caller expected to add to the epoll instance. The kernel used this information as a hint for the amount of space to initially allocate in internal data structures describing events. (If necessary, the kernel would allocate more space if the caller’s usage exceeded the hint given in size.) Nowadays, this hint is no longer required (the kernel dynamically sizes the required data structures without needing the hint), but size must still be greater than zero, in order to ensure backward compatibility when new epoll applications are run on older kernels.
Prior to Linux 2.6.29, a /proc/sys/fs/epoll/max_user_instances kernel parameter limited live epolls for each real user ID, and caused epoll_create() to fail with EMFILE on overrun.
SEE ALSO
close(2), epoll_ctl(2), epoll_wait(2), epoll(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
369 - Linux cli command ustat
NAME π₯οΈ ustat π₯οΈ
get filesystem statistics
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/types.h>
#include <unistd.h> /* libc[45] */
#include <ustat.h> /* glibc2 */
[[deprecated]] int ustat(dev_t dev, struct ustat *ubuf);
DESCRIPTION
ustat() returns information about a mounted filesystem. dev is a device number identifying a device containing a mounted filesystem. ubuf is a pointer to a ustat structure that contains the following members:
daddr_t f_tfree; /* Total free blocks */
ino_t f_tinode; /* Number of free inodes */
char f_fname[6]; /* Filsys name */
char f_fpack[6]; /* Filsys pack name */
The last two fields, f_fname and f_fpack, are not implemented and will always be filled with null bytes (‘οΏ½’).
RETURN VALUE
On success, zero is returned and the ustat structure pointed to by ubuf will be filled in. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
ubuf points outside of your accessible address space.
EINVAL
dev does not refer to a device containing a mounted filesystem.
ENOSYS
The mounted filesystem referenced by dev does not support this operation, or any version of Linux before Linux 1.3.16.
STANDARDS
None.
HISTORY
SVr4. Removed in glibc 2.28.
ustat() is deprecated and has been provided only for compatibility. All new programs should use statfs(2) instead.
HP-UX notes
The HP-UX version of the ustat structure has an additional field, f_blksize, that is unknown elsewhere. HP-UX warns: For some filesystems, the number of free inodes does not change. Such filesystems will return -1 in the field f_tinode. For some filesystems, inodes are dynamically allocated. Such filesystems will return the current number of free inodes.
SEE ALSO
stat(2), statfs(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
370 - Linux cli command dup2
NAME π₯οΈ dup2 π₯οΈ
duplicate a file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int dup(int oldfd);
int dup2(int oldfd, int newfd);
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <fcntl.h> /* Definition of O_* constants */
#include <unistd.h>
int dup3(int oldfd, int newfd, int flags);
DESCRIPTION
The dup() system call allocates a new file descriptor that refers to the same open file description as the descriptor oldfd. (For an explanation of open file descriptions, see open(2).) The new file descriptor number is guaranteed to be the lowest-numbered file descriptor that was unused in the calling process.
After a successful return, the old and new file descriptors may be used interchangeably. Since the two file descriptors refer to the same open file description, they share file offset and file status flags; for example, if the file offset is modified by using lseek(2) on one of the file descriptors, the offset is also changed for the other file descriptor.
The two file descriptors do not share file descriptor flags (the close-on-exec flag). The close-on-exec flag (FD_CLOEXEC; see fcntl(2)) for the duplicate descriptor is off.
dup2()
The dup2() system call performs the same task as dup(), but instead of using the lowest-numbered unused file descriptor, it uses the file descriptor number specified in newfd. In other words, the file descriptor newfd is adjusted so that it now refers to the same open file description as oldfd.
If the file descriptor newfd was previously open, it is closed before being reused; the close is performed silently (i.e., any errors during the close are not reported by dup2()).
The steps of closing and reusing the file descriptor newfd are performed atomically. This is important, because trying to implement equivalent functionality using close(2) and dup() would be subject to race conditions, whereby newfd might be reused between the two steps. Such reuse could happen because the main program is interrupted by a signal handler that allocates a file descriptor, or because a parallel thread allocates a file descriptor.
Note the following points:
If oldfd is not a valid file descriptor, then the call fails, and newfd is not closed.
If oldfd is a valid file descriptor, and newfd has the same value as oldfd, then dup2() does nothing, and returns newfd.
dup3()
dup3() is the same as dup2(), except that:
The caller can force the close-on-exec flag to be set for the new file descriptor by specifying O_CLOEXEC in flags. See the description of the same flag in open(2) for reasons why this may be useful.
If oldfd equals newfd, then dup3() fails with the error EINVAL.
RETURN VALUE
On success, these system calls return the new file descriptor. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBADF
oldfd isn’t an open file descriptor.
EBADF
newfd is out of the allowed range for file descriptors (see the discussion of RLIMIT_NOFILE in getrlimit(2)).
EBUSY
(Linux only) This may be returned by dup2() or dup3() during a race condition with open(2) and dup().
EINTR
The dup2() or dup3() call was interrupted by a signal; see signal(7).
EINVAL
(dup3()) flags contain an invalid value.
EINVAL
(dup3()) oldfd was equal to newfd.
EMFILE
The per-process limit on the number of open file descriptors has been reached (see the discussion of RLIMIT_NOFILE in getrlimit(2)).
STANDARDS
dup()
dup2()
POSIX.1-2008.
dup3()
Linux.
HISTORY
dup()
dup2()
POSIX.1-2001, SVr4, 4.3BSD.
dup3()
Linux 2.6.27, glibc 2.9.
NOTES
The error returned by dup2() is different from that returned by fcntl(…, F_DUPFD, …) when newfd is out of range. On some systems, dup2() also sometimes returns EINVAL like F_DUPFD.
If newfd was open, any errors that would have been reported at close(2) time are lost. If this is of concern, thenβunless the program is single-threaded and does not allocate file descriptors in signal handlersβthe correct approach is not to close newfd before calling dup2(), because of the race condition described above. Instead, code something like the following could be used:
/* Obtain a duplicate of 'newfd' that can subsequently
be used to check for close() errors; an EBADF error
means that 'newfd' was not open. */
tmpfd = dup(newfd);
if (tmpfd == -1 && errno != EBADF) {
/* Handle unexpected dup() error. */
}
/* Atomically duplicate 'oldfd' on 'newfd'. */
if (dup2(oldfd, newfd) == -1) {
/* Handle dup2() error. */
}
/* Now check for close() errors on the file originally
referred to by 'newfd'. */
if (tmpfd != -1) {
if (close(tmpfd) == -1) {
/* Handle errors from close. */
}
}
SEE ALSO
close(2), fcntl(2), open(2), pidfd_getfd(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
371 - Linux cli command rt_sigreturn
NAME π₯οΈ rt_sigreturn π₯οΈ
return from signal handler and cleanup stack frame
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
int sigreturn(...);
DESCRIPTION
If the Linux kernel determines that an unblocked signal is pending for a process, then, at the next transition back to user mode in that process (e.g., upon return from a system call or when the process is rescheduled onto the CPU), it creates a new frame on the user-space stack where it saves various pieces of process context (processor status word, registers, signal mask, and signal stack settings).
The kernel also arranges that, during the transition back to user mode, the signal handler is called, and that, upon return from the handler, control passes to a piece of user-space code commonly called the “signal trampoline”. The signal trampoline code in turn calls sigreturn().
This sigreturn() call undoes everything that was doneβchanging the process’s signal mask, switching signal stacks (see sigaltstack(2))βin order to invoke the signal handler. Using the information that was earlier saved on the user-space stack sigreturn() restores the process’s signal mask, switches stacks, and restores the process’s context (processor flags and registers, including the stack pointer and instruction pointer), so that the process resumes execution at the point where it was interrupted by the signal.
RETURN VALUE
sigreturn() never returns.
VERSIONS
Many UNIX-type systems have a sigreturn() system call or near equivalent. However, this call is not specified in POSIX, and details of its behavior vary across systems.
STANDARDS
None.
NOTES
sigreturn() exists only to allow the implementation of signal handlers. It should never be called directly. (Indeed, a simple sigreturn() wrapper in the GNU C library simply returns -1, with errno set to ENOSYS.) Details of the arguments (if any) passed to sigreturn() vary depending on the architecture. (On some architectures, such as x86-64, sigreturn() takes no arguments, since all of the information that it requires is available in the stack frame that was previously created by the kernel on the user-space stack.)
Once upon a time, UNIX systems placed the signal trampoline code onto the user stack. Nowadays, pages of the user stack are protected so as to disallow code execution. Thus, on contemporary Linux systems, depending on the architecture, the signal trampoline code lives either in the vdso(7) or in the C library. In the latter case, the C library’s sigaction(2) wrapper function informs the kernel of the location of the trampoline code by placing its address in the sa_restorer field of the sigaction structure, and sets the SA_RESTORER flag in the sa_flags field.
The saved process context information is placed in a ucontext_t structure (see <sys/ucontext.h>). That structure is visible within the signal handler as the third argument of a handler established via sigaction(2) with the SA_SIGINFO flag.
On some other UNIX systems, the operation of the signal trampoline differs a little. In particular, on some systems, upon transitioning back to user mode, the kernel passes control to the trampoline (rather than the signal handler), and the trampoline code calls the signal handler (and then calls sigreturn() once the handler returns).
C library/kernel differences
The original Linux system call was named sigreturn(). However, with the addition of real-time signals in Linux 2.2, a new system call, rt_sigreturn() was added to support an enlarged sigset_t type. The GNU C library hides these details from us, transparently employing rt_sigreturn() when the kernel provides it.
SEE ALSO
kill(2), restart_syscall(2), sigaltstack(2), signal(2), getcontext(3), signal(7), vdso(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
372 - Linux cli command setfsgid
NAME π₯οΈ setfsgid π₯οΈ
set group identity used for filesystem checks
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/fsuid.h>
[[deprecated]] int setfsgid(gid_t fsgid);
DESCRIPTION
On Linux, a process has both a filesystem group ID and an effective group ID. The (Linux-specific) filesystem group ID is used for permissions checking when accessing filesystem objects, while the effective group ID is used for some other kinds of permissions checks (see credentials(7)).
Normally, the value of the process’s filesystem group ID is the same as the value of its effective group ID. This is so, because whenever a process’s effective group ID is changed, the kernel also changes the filesystem group ID to be the same as the new value of the effective group ID. A process can cause the value of its filesystem group ID to diverge from its effective group ID by using setfsgid() to change its filesystem group ID to the value given in fsgid.
setfsgid() will succeed only if the caller is the superuser or if fsgid matches either the caller’s real group ID, effective group ID, saved set-group-ID, or current the filesystem user ID.
RETURN VALUE
On both success and failure, this call returns the previous filesystem group ID of the caller.
STANDARDS
Linux.
HISTORY
Linux 1.2.
C library/kernel differences
In glibc 2.15 and earlier, when the wrapper for this system call determines that the argument can’t be passed to the kernel without integer truncation (because the kernel is old and does not support 32-bit group IDs), it will return -1 and set errno to EINVAL without attempting the system call.
NOTES
The filesystem group ID concept and the setfsgid() system call were invented for historical reasons that are no longer applicable on modern Linux kernels. See setfsuid(2) for a discussion of why the use of both setfsuid(2) and setfsgid() is nowadays unneeded.
The original Linux setfsgid() system call supported only 16-bit group IDs. Subsequently, Linux 2.4 added setfsgid32() supporting 32-bit IDs. The glibc setfsgid() wrapper function transparently deals with the variation across kernel versions.
BUGS
No error indications of any kind are returned to the caller, and the fact that both successful and unsuccessful calls return the same value makes it impossible to directly determine whether the call succeeded or failed. Instead, the caller must resort to looking at the return value from a further call such as setfsgid(-1) (which will always fail), in order to determine if a preceding call to setfsgid() changed the filesystem group ID. At the very least, EPERM should be returned when the call fails (because the caller lacks the CAP_SETGID capability).
SEE ALSO
kill(2), setfsuid(2), capabilities(7), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
373 - Linux cli command shmget
NAME π₯οΈ shmget π₯οΈ
allocates a System V shared memory segment
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/shm.h>
int shmget(key_t key, size_t size, int shmflg);
DESCRIPTION
shmget() returns the identifier of the System V shared memory segment associated with the value of the argument key. It may be used either to obtain the identifier of a previously created shared memory segment (when shmflg is zero and key does not have the value IPC_PRIVATE), or to create a new set.
A new shared memory segment, with size equal to the value of size rounded up to a multiple of PAGE_SIZE, is created if key has the value IPC_PRIVATE or key isn’t IPC_PRIVATE, no shared memory segment corresponding to key exists, and IPC_CREAT is specified in shmflg.
If shmflg specifies both IPC_CREAT and IPC_EXCL and a shared memory segment already exists for key, then shmget() fails with errno set to EEXIST. (This is analogous to the effect of the combination O_CREAT | O_EXCL for open(2).)
The value shmflg is composed of:
IPC_CREAT
Create a new segment. If this flag is not used, then shmget() will find the segment associated with key and check to see if the user has permission to access the segment.
IPC_EXCL
This flag is used with IPC_CREAT to ensure that this call creates the segment. If the segment already exists, the call fails.
SHM_HUGETLB (since Linux 2.6)
Allocate the segment using “huge” pages. See the Linux kernel source file Documentation/admin-guide/mm/hugetlbpage.rst for further information.
SHM_HUGE_2MB
SHM_HUGE_1GB (since Linux 3.8)
Used in conjunction with SHM_HUGETLB to select alternative hugetlb page sizes (respectively, 2 MB and 1 GB) on systems that support multiple hugetlb page sizes.
More generally, the desired huge page size can be configured by encoding the base-2 logarithm of the desired page size in the six bits at the offset SHM_HUGE_SHIFT. Thus, the above two constants are defined as:
#define SHM_HUGE_2MB (21 << SHM_HUGE_SHIFT)
#define SHM_HUGE_1GB (30 << SHM_HUGE_SHIFT)
For some additional details, see the discussion of the similarly named constants in mmap(2).
SHM_NORESERVE (since Linux 2.6.15)
This flag serves the same purpose as the mmap(2) MAP_NORESERVE flag. Do not reserve swap space for this segment. When swap space is reserved, one has the guarantee that it is possible to modify the segment. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available. See also the discussion of the file /proc/sys/vm/overcommit_memory in proc(5).
In addition to the above flags, the least significant 9 bits of shmflg specify the permissions granted to the owner, group, and others. These bits have the same format, and the same meaning, as the mode argument of open(2). Presently, execute permissions are not used by the system.
When a new shared memory segment is created, its contents are initialized to zero values, and its associated data structure, shmid_ds (see shmctl(2)), is initialized as follows:
shm_perm.cuid and shm_perm.uid are set to the effective user ID of the calling process.
shm_perm.cgid and shm_perm.gid are set to the effective group ID of the calling process.
The least significant 9 bits of shm_perm.mode are set to the least significant 9 bit of shmflg.
shm_segsz is set to the value of size.
shm_lpid, shm_nattch, shm_atime, and shm_dtime are set to 0.
shm_ctime is set to the current time.
If the shared memory segment already exists, the permissions are verified, and a check is made to see if it is marked for destruction.
RETURN VALUE
On success, a valid shared memory identifier is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
The user does not have permission to access the shared memory segment, and does not have the CAP_IPC_OWNER capability in the user namespace that governs its IPC namespace.
EEXIST
IPC_CREAT and IPC_EXCL were specified in shmflg, but a shared memory segment already exists for key.
EINVAL
A new segment was to be created and size is less than SHMMIN or greater than SHMMAX.
EINVAL
A segment for the given key exists, but size is greater than the size of that segment.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENOENT
No segment exists for the given key, and IPC_CREAT was not specified.
ENOMEM
No memory could be allocated for segment overhead.
ENOSPC
All possible shared memory IDs have been taken (SHMMNI), or allocating a segment of the requested size would cause the system to exceed the system-wide limit on shared memory (SHMALL).
EPERM
The SHM_HUGETLB flag was specified, but the caller was not privileged (did not have the CAP_IPC_LOCK capability) and is not a member of the sysctl_hugetlb_shm_group group; see the description of /proc/sys/vm/sysctl_hugetlb_shm_group in proc(5).
STANDARDS
POSIX.1-2008.
SHM_HUGETLB and SHM_NORESERVE are Linux extensions.
HISTORY
POSIX.1-2001, SVr4.
NOTES
IPC_PRIVATE isn’t a flag field but a key_t type. If this special value is used for key, the system call ignores all but the least significant 9 bits of shmflg and creates a new shared memory segment.
Shared memory limits
The following limits on shared memory segment resources affect the shmget() call:
SHMALL
System-wide limit on the total amount of shared memory, measured in units of the system page size.
On Linux, this limit can be read and modified via /proc/sys/kernel/shmall. Since Linux 3.16, the default value for this limit is:
ULONG_MAX - 2^24
The effect of this value (which is suitable for both 32-bit and 64-bit systems) is to impose no limitation on allocations. This value, rather than ULONG_MAX, was chosen as the default to prevent some cases where historical applications simply raised the existing limit without first checking its current value. Such applications would cause the value to overflow if the limit was set at ULONG_MAX.
From Linux 2.4 up to Linux 3.15, the default value for this limit was:
SHMMAX / PAGE_SIZE * (SHMMNI / 16)
If SHMMAX and SHMMNI were not modified, then multiplying the result of this formula by the page size (to get a value in bytes) yielded a value of 8 GB as the limit on the total memory used by all shared memory segments.
SHMMAX
Maximum size in bytes for a shared memory segment.
On Linux, this limit can be read and modified via /proc/sys/kernel/shmmax. Since Linux 3.16, the default value for this limit is:
ULONG_MAX - 2^24
The effect of this value (which is suitable for both 32-bit and 64-bit systems) is to impose no limitation on allocations. See the description of SHMALL for a discussion of why this default value (rather than ULONG_MAX) is used.
From Linux 2.2 up to Linux 3.15, the default value of this limit was 0x2000000 (32 MiB).
Because it is not possible to map just part of a shared memory segment, the amount of virtual memory places another limit on the maximum size of a usable segment: for example, on i386 the largest segments that can be mapped have a size of around 2.8 GB, and on x86-64 the limit is around 127 TB.
SHMMIN
Minimum size in bytes for a shared memory segment: implementation dependent (currently 1 byte, though PAGE_SIZE is the effective minimum size).
SHMMNI
System-wide limit on the number of shared memory segments. In Linux 2.2, the default value for this limit was 128; since Linux 2.4, the default value is 4096.
On Linux, this limit can be read and modified via /proc/sys/kernel/shmmni.
The implementation has no specific limits for the per-process maximum number of shared memory segments (SHMSEG).
Linux notes
Until Linux 2.3.30, Linux would return EIDRM for a shmget() on a shared memory segment scheduled for deletion.
BUGS
The name choice IPC_PRIVATE was perhaps unfortunate, IPC_NEW would more clearly show its function.
EXAMPLES
See shmop(2).
SEE ALSO
memfd_create(2), shmat(2), shmctl(2), shmdt(2), ftok(3), capabilities(7), shm_overview(7), sysvipc(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
374 - Linux cli command link
NAME π₯οΈ link π₯οΈ
make a new name for a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int link(const char *oldpath, const char *newpath);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int linkat(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
linkat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
link() creates a new link (also known as a hard link) to an existing file.
If newpath exists, it will not be overwritten.
This new name may be used exactly as the old one for any operation; both names refer to the same file (and so have the same permissions and ownership) and it is impossible to tell which name was the “original”.
linkat()
The linkat() system call operates in exactly the same way as link(), except for the differences described here.
If the pathname given in oldpath is relative, then it is interpreted relative to the directory referred to by the file descriptor olddirfd (rather than relative to the current working directory of the calling process, as is done by link() for a relative pathname).
If oldpath is relative and olddirfd is the special value AT_FDCWD, then oldpath is interpreted relative to the current working directory of the calling process (like link()).
If oldpath is absolute, then olddirfd is ignored.
The interpretation of newpath is as for oldpath, except that a relative pathname is interpreted relative to the directory referred to by the file descriptor newdirfd.
The following values can be bitwise ORed in flags:
AT_EMPTY_PATH (since Linux 2.6.39)
If oldpath is an empty string, create a link to the file referenced by olddirfd (which may have been obtained using the open(2) O_PATH flag). In this case, olddirfd can refer to any type of file except a directory. This will generally not work if the file has a link count of zero (files created with O_TMPFILE and without O_EXCL are an exception). The caller must have the CAP_DAC_READ_SEARCH capability in order to use this flag. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_SYMLINK_FOLLOW (since Linux 2.6.18)
By default, linkat(), does not dereference oldpath if it is a symbolic link (like link()). The flag AT_SYMLINK_FOLLOW can be specified in flags to cause oldpath to be dereferenced if it is a symbolic link. If procfs is mounted, this can be used as an alternative to AT_EMPTY_PATH, like this:
linkat(AT_FDCWD, "/proc/self/fd/<fd>", newdirfd,
newname, AT_SYMLINK_FOLLOW);
Before Linux 2.6.18, the flags argument was unused, and had to be specified as 0.
See openat(2) for an explanation of the need for linkat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Write access to the directory containing newpath is denied, or search permission is denied for one of the directories in the path prefix of oldpath or newpath. (See also path_resolution(7).)
EDQUOT
The user’s quota of disk blocks on the filesystem has been exhausted.
EEXIST
newpath already exists.
EFAULT
oldpath or newpath points outside your accessible address space.
EIO
An I/O error occurred.
ELOOP
Too many symbolic links were encountered in resolving oldpath or newpath.
EMLINK
The file referred to by oldpath already has the maximum number of links to it. For example, on an ext4(5) filesystem that does not employ the dir_index feature, the limit on the number of hard links to a file is 65,000; on btrfs(5), the limit is 65,535 links.
ENAMETOOLONG
oldpath or newpath was too long.
ENOENT
A directory component in oldpath or newpath does not exist or is a dangling symbolic link.
ENOMEM
Insufficient kernel memory was available.
ENOSPC
The device containing the file has no room for the new directory entry.
ENOTDIR
A component used as a directory in oldpath or newpath is not, in fact, a directory.
EPERM
oldpath is a directory.
EPERM
The filesystem containing oldpath and newpath does not support the creation of hard links.
EPERM (since Linux 3.6)
The caller does not have permission to create a hard link to this file (see the description of /proc/sys/fs/protected_hardlinks in proc(5)).
EPERM
oldpath is marked immutable or append-only. (See ioctl_iflags(2).)
EROFS
The file is on a read-only filesystem.
EXDEV
oldpath and newpath are not on the same mounted filesystem. (Linux permits a filesystem to be mounted at multiple points, but link() does not work across different mounts, even if the same filesystem is mounted on both.)
The following additional errors can occur for linkat():
EBADF
oldpath (newpath) is relative but olddirfd (newdirfd) is neither AT_FDCWD nor a valid file descriptor.
EINVAL
An invalid flag value was specified in flags.
ENOENT
AT_EMPTY_PATH was specified in flags, but the caller did not have the CAP_DAC_READ_SEARCH capability.
ENOENT
An attempt was made to link to the /proc/self/fd/NN file corresponding to a file descriptor created with
open(path, O_TMPFILE | O_EXCL, mode);
See open(2).
ENOENT
An attempt was made to link to a /proc/self/fd/NN file corresponding to a file that has been deleted.
ENOENT
oldpath is a relative pathname and olddirfd refers to a directory that has been deleted, or newpath is a relative pathname and newdirfd refers to a directory that has been deleted.
ENOTDIR
oldpath is relative and olddirfd is a file descriptor referring to a file other than a directory; or similar for newpath and newdirfd
EPERM
AT_EMPTY_PATH was specified in flags, oldpath is an empty string, and olddirfd refers to a directory.
VERSIONS
POSIX.1-2001 says that link() should dereference oldpath if it is a symbolic link. However, since Linux 2.0, Linux does not do so: if oldpath is a symbolic link, then newpath is created as a (hard) link to the same symbolic link file (i.e., newpath becomes a symbolic link to the same file that oldpath refers to). Some other implementations behave in the same manner as Linux. POSIX.1-2008 changes the specification of link(), making it implementation-dependent whether or not oldpath is dereferenced if it is a symbolic link. For precise control over the treatment of symbolic links when creating a link, use linkat().
glibc
On older kernels where linkat() is unavailable, the glibc wrapper function falls back to the use of link(), unless the AT_SYMLINK_FOLLOW is specified. When oldpath and newpath are relative pathnames, glibc constructs pathnames based on the symbolic links in /proc/self/fd that correspond to the olddirfd and newdirfd arguments.
STANDARDS
link()
POSIX.1-2008.
HISTORY
link()
SVr4, 4.3BSD, POSIX.1-2001 (but see VERSIONS).
linkat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
NOTES
Hard links, as created by link(), cannot span filesystems. Use symlink(2) if this is required.
BUGS
On NFS filesystems, the return code may be wrong in case the NFS server performs the link creation and dies before it can say so. Use stat(2) to find out if the link got created.
SEE ALSO
ln(1), open(2), rename(2), stat(2), symlink(2), unlink(2), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
375 - Linux cli command mkdir
NAME π₯οΈ mkdir π₯οΈ
create a directory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int mkdir(const char *pathname, mode_t mode);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int mkdirat(int dirfd, const char *pathname, mode_t mode);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
mkdirat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
mkdir() attempts to create a directory named pathname.
The argument mode specifies the mode for the new directory (see inode(7)). It is modified by the process’s umask in the usual way: in the absence of a default ACL, the mode of the created directory is (mode & ~umask & 0777). Whether other mode bits are honored for the created directory depends on the operating system. For Linux, see NOTES below.
The newly created directory will be owned by the effective user ID of the process. If the directory containing the file has the set-group-ID bit set, or if the filesystem is mounted with BSD group semantics (mount -o bsdgroups or, synonymously mount -o grpid), the new directory will inherit the group ownership from its parent; otherwise it will be owned by the effective group ID of the process.
If the parent directory has the set-group-ID bit set, then so will the newly created directory.
mkdirat()
The mkdirat() system call operates in exactly the same way as mkdir(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by mkdir() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like mkdir()).
If pathname is absolute, then dirfd is ignored.
See openat(2) for an explanation of the need for mkdirat().
RETURN VALUE
mkdir() and mkdirat() return zero on success. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EACCES
The parent directory does not allow write permission to the process, or one of the directories in pathname did not allow search permission. (See also path_resolution(7).)
EBADF
(mkdirat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EDQUOT
The user’s quota of disk blocks or inodes on the filesystem has been exhausted.
EEXIST
pathname already exists (not necessarily as a directory). This includes the case where pathname is a symbolic link, dangling or not.
EFAULT
pathname points outside your accessible address space.
EINVAL
The final component (“basename”) of the new directory’s pathname is invalid (e.g., it contains characters not permitted by the underlying filesystem).
ELOOP
Too many symbolic links were encountered in resolving pathname.
EMLINK
The number of links to the parent directory would exceed LINK_MAX.
ENAMETOOLONG
pathname was too long.
ENOENT
A directory component in pathname does not exist or is a dangling symbolic link.
ENOMEM
Insufficient kernel memory was available.
ENOSPC
The device containing pathname has no room for the new directory.
ENOSPC
The new directory cannot be created because the user’s disk quota is exhausted.
ENOTDIR
A component used as a directory in pathname is not, in fact, a directory.
ENOTDIR
(mkdirat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EPERM
The filesystem containing pathname does not support the creation of directories.
EROFS
pathname refers to a file on a read-only filesystem.
VERSIONS
Under Linux, apart from the permission bits, the S_ISVTX mode bit is also honored.
glibc notes
On older kernels where mkdirat() is unavailable, the glibc wrapper function falls back to the use of mkdir(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
STANDARDS
POSIX.1-2008.
HISTORY
mkdir()
SVr4, BSD, POSIX.1-2001.
mkdirat()
Linux 2.6.16, glibc 2.4.
NOTES
There are many infelicities in the protocol underlying NFS. Some of these affect mkdir().
SEE ALSO
mkdir(1), chmod(2), chown(2), mknod(2), mount(2), rmdir(2), stat(2), umask(2), unlink(2), acl(5), path_resolution(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
376 - Linux cli command getegid32
NAME π₯οΈ getegid32 π₯οΈ
get group identity
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
gid_t getgid(void);
gid_t getegid(void);
DESCRIPTION
getgid() returns the real group ID of the calling process.
getegid() returns the effective group ID of the calling process.
ERRORS
These functions are always successful and never modify errno.
VERSIONS
On Alpha, instead of a pair of getgid() and getegid() system calls, a single getxgid() system call is provided, which returns a pair of real and effective GIDs. The glibc getgid() and getegid() wrapper functions transparently deal with this. See syscall(2) for details regarding register mapping.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.3BSD.
The original Linux getgid() and getegid() system calls supported only 16-bit group IDs. Subsequently, Linux 2.4 added getgid32() and getegid32(), supporting 32-bit IDs. The glibc getgid() and getegid() wrapper functions transparently deal with the variations across kernel versions.
SEE ALSO
getresgid(2), setgid(2), setregid(2), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
377 - Linux cli command setresgid32
NAME π₯οΈ setresgid32 π₯οΈ
set real, effective, and saved user or group ID
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <unistd.h>
int setresuid(uid_t ruid, uid_t euid, uid_t suid);
int setresgid(gid_t rgid, gid_t egid, gid_t sgid);
DESCRIPTION
setresuid() sets the real user ID, the effective user ID, and the saved set-user-ID of the calling process.
An unprivileged process may change its real UID, effective UID, and saved set-user-ID, each to one of: the current real UID, the current effective UID, or the current saved set-user-ID.
A privileged process (on Linux, one having the CAP_SETUID capability) may set its real UID, effective UID, and saved set-user-ID to arbitrary values.
If one of the arguments equals -1, the corresponding value is not changed.
Regardless of what changes are made to the real UID, effective UID, and saved set-user-ID, the filesystem UID is always set to the same value as the (possibly new) effective UID.
Completely analogously, setresgid() sets the real GID, effective GID, and saved set-group-ID of the calling process (and always modifies the filesystem GID to be the same as the effective GID), with the same restrictions for unprivileged processes.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
Note: there are cases where setresuid() can fail even when the caller is UID 0; it is a grave security error to omit checking for a failure return from setresuid().
ERRORS
EAGAIN
The call would change the caller’s real UID (i.e., ruid does not match the caller’s real UID), but there was a temporary failure allocating the necessary kernel data structures.
EAGAIN
ruid does not match the caller’s real UID and this call would bring the number of processes belonging to the real user ID ruid over the caller’s RLIMIT_NPROC resource limit. Since Linux 3.1, this error case no longer occurs (but robust applications should check for this error); see the description of EAGAIN in execve(2).
EINVAL
One or more of the target user or group IDs is not valid in this user namespace.
EPERM
The calling process is not privileged (did not have the necessary capability in its user namespace) and tried to change the IDs to values that are not permitted. For setresuid(), the necessary capability is CAP_SETUID; for setresgid(), it is CAP_SETGID.
VERSIONS
C library/kernel differences
At the kernel level, user IDs and group IDs are a per-thread attribute. However, POSIX requires that all threads in a process share the same credentials. The NPTL threading implementation handles the POSIX requirements by providing wrapper functions for the various system calls that change process UIDs and GIDs. These wrapper functions (including those for setresuid() and setresgid()) employ a signal-based technique to ensure that when one thread changes credentials, all of the other threads in the process also change their credentials. For details, see nptl(7).
STANDARDS
None.
HISTORY
Linux 2.1.44, glibc 2.3.2. HP-UX, FreeBSD.
The original Linux setresuid() and setresgid() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added setresuid32() and setresgid32(), supporting 32-bit IDs. The glibc setresuid() and setresgid() wrapper functions transparently deal with the variations across kernel versions.
SEE ALSO
getresuid(2), getuid(2), setfsgid(2), setfsuid(2), setreuid(2), setuid(2), capabilities(7), credentials(7), user_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
378 - Linux cli command sched_getparam
NAME π₯οΈ sched_getparam π₯οΈ
set and get scheduling parameters
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sched.h>
int sched_setparam(pid_t pid, const struct sched_param *param);
int sched_getparam(pid_t pid, struct sched_param *param);
struct sched_param {
...
int sched_priority;
...
};
DESCRIPTION
sched_setparam() sets the scheduling parameters associated with the scheduling policy for the thread whose thread ID is specified in pid**.** If pid is zero, then the parameters of the calling thread are set. The interpretation of the argument param depends on the scheduling policy of the thread identified by pid. See sched(7) for a description of the scheduling policies supported under Linux.
sched_getparam() retrieves the scheduling parameters for the thread identified by pid**.** If pid is zero, then the parameters of the calling thread are retrieved.
sched_setparam() checks the validity of param for the scheduling policy of the thread. The value param->sched_priority must lie within the range given by sched_get_priority_min(2) and sched_get_priority_max(2).
For a discussion of the privileges and resource limits related to scheduling priority and policy, see sched(7).
POSIX systems on which sched_setparam() and sched_getparam() are available define _POSIX_PRIORITY_SCHEDULING in <unistd.h>.
RETURN VALUE
On success, sched_setparam() and sched_getparam() return 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
Invalid arguments: param is NULL or pid is negative
EINVAL
(sched_setparam()) The argument param does not make sense for the current scheduling policy.
EPERM
(sched_setparam()) The caller does not have appropriate privileges (Linux: does not have the CAP_SYS_NICE capability).
ESRCH
The thread whose ID is pid could not be found.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
SEE ALSO
getpriority(2), gettid(2), nice(2), sched_get_priority_max(2), sched_get_priority_min(2), sched_getaffinity(2), sched_getscheduler(2), sched_setaffinity(2), sched_setattr(2), sched_setscheduler(2), setpriority(2), capabilities(7), sched(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
379 - Linux cli command unlink
NAME π₯οΈ unlink π₯οΈ
delete a name and possibly the file it refers to
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int unlink(const char *pathname);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int unlinkat(int dirfd, const char *pathname, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
unlinkat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
unlink() deletes a name from the filesystem. If that name was the last link to a file and no processes have the file open, the file is deleted and the space it was using is made available for reuse.
If the name was the last link to a file but any processes still have the file open, the file will remain in existence until the last file descriptor referring to it is closed.
If the name referred to a symbolic link, the link is removed.
If the name referred to a socket, FIFO, or device, the name for it is removed but processes which have the object open may continue to use it.
unlinkat()
The unlinkat() system call operates in exactly the same way as either unlink() or rmdir(2) (depending on whether or not flags includes the AT_REMOVEDIR flag) except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by unlink() and rmdir(2) for a relative pathname).
If the pathname given in pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like unlink() and rmdir(2)).
If the pathname given in pathname is absolute, then dirfd is ignored.
flags is a bit mask that can either be specified as 0, or by ORing together flag values that control the operation of unlinkat(). Currently, only one such flag is defined:
AT_REMOVEDIR
By default, unlinkat() performs the equivalent of unlink() on pathname. If the AT_REMOVEDIR flag is specified, it performs the equivalent of rmdir(2) on pathname.
See openat(2) for an explanation of the need for unlinkat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Write access to the directory containing pathname is not allowed for the process’s effective UID, or one of the directories in pathname did not allow search permission. (See also path_resolution(7).)
EBUSY
The file pathname cannot be unlinked because it is being used by the system or another process; for example, it is a mount point or the NFS client software created it to represent an active but otherwise nameless inode (“NFS silly renamed”).
EFAULT
pathname points outside your accessible address space.
EIO
An I/O error occurred.
EISDIR
pathname refers to a directory. (This is the non-POSIX value returned since Linux 2.1.132.)
ELOOP
Too many symbolic links were encountered in translating pathname.
ENAMETOOLONG
pathname was too long.
ENOENT
A component in pathname does not exist or is a dangling symbolic link, or pathname is empty.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component used as a directory in pathname is not, in fact, a directory.
EPERM
The system does not allow unlinking of directories, or unlinking of directories requires privileges that the calling process doesn’t have. (This is the POSIX prescribed error return; as noted above, Linux returns EISDIR for this case.)
EPERM (Linux only)
The filesystem does not allow unlinking of files.
EPERM or EACCES
The directory containing pathname has the sticky bit (S_ISVTX) set and the process’s effective UID is neither the UID of the file to be deleted nor that of the directory containing it, and the process is not privileged (Linux: does not have the CAP_FOWNER capability).
EPERM
The file to be unlinked is marked immutable or append-only. (See ioctl_iflags(2).)
EROFS
pathname refers to a file on a read-only filesystem.
The same errors that occur for unlink() and rmdir(2) can also occur for unlinkat(). The following additional errors can occur for unlinkat():
EBADF
pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EINVAL
An invalid flag value was specified in flags.
EISDIR
pathname refers to a directory, and AT_REMOVEDIR was not specified in flags.
ENOTDIR
pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
STANDARDS
POSIX.1-2008.
HISTORY
unlink()
SVr4, 4.3BSD, POSIX.1-2001.
unlinkat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
glibc
On older kernels where unlinkat() is unavailable, the glibc wrapper function falls back to the use of unlink() or rmdir(2). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
BUGS
Infelicities in the protocol underlying NFS can cause the unexpected disappearance of files which are still being used.
SEE ALSO
rm(1), unlink(1), chmod(2), link(2), mknod(2), open(2), rename(2), rmdir(2), mkfifo(3), remove(3), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
380 - Linux cli command free_hugepages
NAME π₯οΈ free_hugepages π₯οΈ
allocate or free huge pages
SYNOPSIS
void *syscall(SYS_alloc_hugepages, int key, void addr[.len], size_t len,
int prot, int flag);
int syscall(SYS_free_hugepages, void *addr);
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
The system calls alloc_hugepages() and free_hugepages() were introduced in Linux 2.5.36 and removed again in Linux 2.5.54. They existed only on i386 and ia64 (when built with CONFIG_HUGETLB_PAGE). In Linux 2.4.20, the syscall numbers exist, but the calls fail with the error ENOSYS.
On i386 the memory management hardware knows about ordinary pages (4 KiB) and huge pages (2 or 4 MiB). Similarly ia64 knows about huge pages of several sizes. These system calls serve to map huge pages into the process’s memory or to free them again. Huge pages are locked into memory, and are not swapped.
The key argument is an identifier. When zero the pages are private, and not inherited by children. When positive the pages are shared with other applications using the same key, and inherited by child processes.
The addr argument of free_hugepages() tells which page is being freed: it was the return value of a call to alloc_hugepages(). (The memory is first actually freed when all users have released it.) The addr argument of alloc_hugepages() is a hint, that the kernel may or may not follow. Addresses must be properly aligned.
The len argument is the length of the required segment. It must be a multiple of the huge page size.
The prot argument specifies the memory protection of the segment. It is one of PROT_READ, PROT_WRITE, PROT_EXEC.
The flag argument is ignored, unless key is positive. In that case, if flag is IPC_CREAT, then a new huge page segment is created when none with the given key existed. If this flag is not set, then ENOENT is returned when no segment with the given key exists.
RETURN VALUE
On success, alloc_hugepages() returns the allocated virtual address, and free_hugepages() returns zero. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
ENOSYS
The system call is not supported on this kernel.
FILES
/proc/sys/vm/nr_hugepages
Number of configured hugetlb pages. This can be read and written.
/proc/meminfo
Gives info on the number of configured hugetlb pages and on their size in the three variables HugePages_Total, HugePages_Free, Hugepagesize.
STANDARDS
Linux on Intel processors.
HISTORY
These system calls are gone; they existed only in Linux 2.5.36 through to Linux 2.5.54.
NOTES
Now the hugetlbfs filesystem can be used instead. Memory backed by huge pages (if the CPU supports them) is obtained by using mmap(2) to map files in this virtual filesystem.
The maximal number of huge pages can be specified using the hugepages= boot parameter.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
381 - Linux cli command iopl
NAME π₯οΈ iopl π₯οΈ
change I/O privilege level
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
[[deprecated]] int iopl(int level);
DESCRIPTION
iopl() changes the I/O privilege level of the calling thread, as specified by the two least significant bits in level.
The I/O privilege level for a normal thread is 0. Permissions are inherited from parents to children.
This call is deprecated, is significantly slower than ioperm(2), and is only provided for older X servers which require access to all 65536 I/O ports. It is mostly for the i386 architecture. On many other architectures it does not exist or will always return an error.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
level is greater than 3.
ENOSYS
This call is unimplemented.
EPERM
The calling thread has insufficient privilege to call iopl(); the CAP_SYS_RAWIO capability is required to raise the I/O privilege level above its current value.
VERSIONS
glibc2 has a prototype both in <sys/io.h> and in <sys/perm.h>. Avoid the latter, it is available on i386 only.
STANDARDS
Linux.
HISTORY
Prior to Linux 5.5 iopl() allowed the thread to disable interrupts while running at a higher I/O privilege level. This will probably crash the system, and is not recommended.
Prior to Linux 3.7, on some architectures (such as i386), permissions were inherited by the child produced by fork(2) and were preserved across execve(2). This behavior was inadvertently changed in Linux 3.7, and won’t be reinstated.
SEE ALSO
ioperm(2), outb(2), capabilities(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
382 - Linux cli command waitid
NAME π₯οΈ waitid π₯οΈ
wait for process to change state
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/wait.h>
pid_t wait(int *_Nullable wstatus);
pid_t waitpid(pid_t pid, int *_Nullable wstatus, int options);
int waitid(idtype_t idtype, id_t id",siginfo_t*"infop, int options);
/* This is the glibc and POSIX interface; see
NOTES for information on the raw system call. */
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
waitid():
Since glibc 2.26:
_XOPEN_SOURCE >= 500 || _POSIX_C_SOURCE >= 200809L
glibc 2.25 and earlier:
_XOPEN_SOURCE
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
|| /* glibc <= 2.19: */ _BSD_SOURCE
DESCRIPTION
All of these system calls are used to wait for state changes in a child of the calling process, and obtain information about the child whose state has changed. A state change is considered to be: the child terminated; the child was stopped by a signal; or the child was resumed by a signal. In the case of a terminated child, performing a wait allows the system to release the resources associated with the child; if a wait is not performed, then the terminated child remains in a “zombie” state (see NOTES below).
If a child has already changed state, then these calls return immediately. Otherwise, they block until either a child changes state or a signal handler interrupts the call (assuming that system calls are not automatically restarted using the SA_RESTART flag of sigaction(2)). In the remainder of this page, a child whose state has changed and which has not yet been waited upon by one of these system calls is termed waitable.
wait() and waitpid()
The wait() system call suspends execution of the calling thread until one of its children terminates. The call wait(&wstatus) is equivalent to:
waitpid(-1, &wstatus, 0);
The waitpid() system call suspends execution of the calling thread until a child specified by pid argument has changed state. By default, waitpid() waits only for terminated children, but this behavior is modifiable via the options argument, as described below.
The value of pid can be:
< -1
meaning wait for any child process whose process group ID is equal to the absolute value of pid.
-1
meaning wait for any child process.
0
meaning wait for any child process whose process group ID is equal to that of the calling process at the time of the call to waitpid().
> 0
meaning wait for the child whose process ID is equal to the value of pid.
The value of options is an OR of zero or more of the following constants:
WNOHANG
return immediately if no child has exited.
WUNTRACED
also return if a child has stopped (but not traced via ptrace(2)). Status for traced children which have stopped is provided even if this option is not specified.
WCONTINUED (since Linux 2.6.10)
also return if a stopped child has been resumed by delivery of SIGCONT.
(For Linux-only options, see below.)
If wstatus is not NULL, wait() and waitpid() store status information in the int to which it points. This integer can be inspected with the following macros (which take the integer itself as an argument, not a pointer to it, as is done in wait() and waitpid()!):
WIFEXITED(wstatus)
returns true if the child terminated normally, that is, by calling exit(3) or _exit(2), or by returning from main().
WEXITSTATUS(wstatus)
returns the exit status of the child. This consists of the least significant 8 bits of the status argument that the child specified in a call to exit(3) or _exit(2) or as the argument for a return statement in main(). This macro should be employed only if WIFEXITED returned true.
WIFSIGNALED(wstatus)
returns true if the child process was terminated by a signal.
WTERMSIG(wstatus)
returns the number of the signal that caused the child process to terminate. This macro should be employed only if WIFSIGNALED returned true.
WCOREDUMP(wstatus)
returns true if the child produced a core dump (see core(5)). This macro should be employed only if WIFSIGNALED returned true.
This macro is not specified in POSIX.1-2001 and is not available on some UNIX implementations (e.g., AIX, SunOS). Therefore, enclose its use inside #ifdef WCOREDUMP … #endif.
WIFSTOPPED(wstatus)
returns true if the child process was stopped by delivery of a signal; this is possible only if the call was done using WUNTRACED or when the child is being traced (see ptrace(2)).
WSTOPSIG(wstatus)
returns the number of the signal which caused the child to stop. This macro should be employed only if WIFSTOPPED returned true.
WIFCONTINUED(wstatus)
(since Linux 2.6.10) returns true if the child process was resumed by delivery of SIGCONT.
waitid()
The waitid() system call (available since Linux 2.6.9) provides more precise control over which child state changes to wait for.
The idtype and id arguments select the child(ren) to wait for, as follows:
idtype == P_PID
Wait for the child whose process ID matches id.
idtype == P_PIDFD (since Linux 5.4)
Wait for the child referred to by the PID file descriptor specified in id. (See pidfd_open(2) for further information on PID file descriptors.)
idtype == P_PGID
Wait for any child whose process group ID matches id. Since Linux 5.4, if id is zero, then wait for any child that is in the same process group as the caller’s process group at the time of the call.
idtype == P_ALL
Wait for any child; id is ignored.
The child state changes to wait for are specified by ORing one or more of the following flags in options:
WEXITED
Wait for children that have terminated.
WSTOPPED
Wait for children that have been stopped by delivery of a signal.
WCONTINUED
Wait for (previously stopped) children that have been resumed by delivery of SIGCONT.
The following flags may additionally be ORed in options:
WNOHANG
As for waitpid().
WNOWAIT
Leave the child in a waitable state; a later wait call can be used to again retrieve the child status information.
Upon successful return, waitid() fills in the following fields of the siginfo_t structure pointed to by infop:
si_pid
The process ID of the child.
si_uid
The real user ID of the child. (This field is not set on most other implementations.)
si_signo
Always set to SIGCHLD.
si_status
Either the exit status of the child, as given to _exit(2) (or exit(3)), or the signal that caused the child to terminate, stop, or continue. The si_code field can be used to determine how to interpret this field.
si_code
Set to one of: CLD_EXITED (child called _exit(2)); CLD_KILLED (child killed by signal); CLD_DUMPED (child killed by signal, and dumped core); CLD_STOPPED (child stopped by signal); CLD_TRAPPED (traced child has trapped); or CLD_CONTINUED (child continued by SIGCONT).
If WNOHANG was specified in options and there were no children in a waitable state, then waitid() returns 0 immediately and the state of the siginfo_t structure pointed to by infop depends on the implementation. To (portably) distinguish this case from that where a child was in a waitable state, zero out the si_pid field before the call and check for a nonzero value in this field after the call returns.
POSIX.1-2008 Technical Corrigendum 1 (2013) adds the requirement that when WNOHANG is specified in options and there were no children in a waitable state, then waitid() should zero out the si_pid and si_signo fields of the structure. On Linux and other implementations that adhere to this requirement, it is not necessary to zero out the si_pid field before calling waitid(). However, not all implementations follow the POSIX.1 specification on this point.
RETURN VALUE
wait(): on success, returns the process ID of the terminated child; on failure, -1 is returned.
waitpid(): on success, returns the process ID of the child whose state has changed; if WNOHANG was specified and one or more child(ren) specified by pid exist, but have not yet changed state, then 0 is returned. On failure, -1 is returned.
waitid(): returns 0 on success or if WNOHANG was specified and no child(ren) specified by id has yet changed state; on failure, -1 is returned.
On failure, each of these calls sets errno to indicate the error.
ERRORS
EAGAIN
The PID file descriptor specified in id is nonblocking and the process that it refers to has not terminated.
ECHILD
(for wait()) The calling process does not have any unwaited-for children.
ECHILD
(for waitpid() or waitid()) The process specified by pid (waitpid()) or idtype and id (waitid()) does not exist or is not a child of the calling process. (This can happen for one’s own child if the action for SIGCHLD is set to SIG_IGN. See also the Linux Notes section about threads.)
EINTR
WNOHANG was not set and an unblocked signal or a SIGCHLD was caught; see signal(7).
EINVAL
The options argument was invalid.
ESRCH
(for wait() or waitpid()) pid is equal to INT_MIN.
VERSIONS
C library/kernel differences
wait() is actually a library function that (in glibc) is implemented as a call to wait4(2).
On some architectures, there is no waitpid() system call; instead, this interface is implemented via a C library wrapper function that calls wait4(2).
The raw waitid() system call takes a fifth argument, of type struct rusage *. If this argument is non-NULL, then it is used to return resource usage information about the child, in the same manner as wait4(2). See getrusage(2) for details.
STANDARDS
POSIX.1-2008.
HISTORY
SVr4, 4.3BSD, POSIX.1-2001.
NOTES
A child that terminates, but has not been waited for becomes a “zombie”. The kernel maintains a minimal set of information about the zombie process (PID, termination status, resource usage information) in order to allow the parent to later perform a wait to obtain information about the child. As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes. If a parent process terminates, then its “zombie” children (if any) are adopted by init(1), (or by the nearest “subreaper” process as defined through the use of the prctl(2) PR_SET_CHILD_SUBREAPER operation); init(1) automatically performs a wait to remove the zombies.
POSIX.1-2001 specifies that if the disposition of SIGCHLD is set to SIG_IGN or the SA_NOCLDWAIT flag is set for SIGCHLD (see sigaction(2)), then children that terminate do not become zombies and a call to wait() or waitpid() will block until all children have terminated, and then fail with errno set to ECHILD. (The original POSIX standard left the behavior of setting SIGCHLD to SIG_IGN unspecified. Note that even though the default disposition of SIGCHLD is “ignore”, explicitly setting the disposition to SIG_IGN results in different treatment of zombie process children.)
Linux 2.6 conforms to the POSIX requirements. However, Linux 2.4 (and earlier) does not: if a wait() or waitpid() call is made while SIGCHLD is being ignored, the call behaves just as though SIGCHLD were not being ignored, that is, the call blocks until the next child terminates and then returns the process ID and status of that child.
Linux notes
In the Linux kernel, a kernel-scheduled thread is not a distinct construct from a process. Instead, a thread is simply a process that is created using the Linux-unique clone(2) system call; other routines such as the portable pthread_create(3) call are implemented using clone(2). Before Linux 2.4, a thread was just a special case of a process, and as a consequence one thread could not wait on the children of another thread, even when the latter belongs to the same thread group. However, POSIX prescribes such functionality, and since Linux 2.4 a thread can, and by default will, wait on children of other threads in the same thread group.
The following Linux-specific options are for use with children created using clone(2); they can also, since Linux 4.7, be used with waitid():
__WCLONE
Wait for “clone” children only. If omitted, then wait for “non-clone” children only. (A “clone” child is one which delivers no signal, or a signal other than SIGCHLD to its parent upon termination.) This option is ignored if __WALL is also specified.
__WALL (since Linux 2.4)
Wait for all children, regardless of type (“clone” or “non-clone”).
__WNOTHREAD (since Linux 2.4)
Do not wait for children of other threads in the same thread group. This was the default before Linux 2.4.
Since Linux 4.7, the __WALL flag is automatically implied if the child is being ptraced.
BUGS
According to POSIX.1-2008, an application calling waitid() must ensure that infop points to a siginfo_t structure (i.e., that it is a non-null pointer). On Linux, if infop is NULL, waitid() succeeds, and returns the process ID of the waited-for child. Applications should avoid relying on this inconsistent, nonstandard, and unnecessary feature.
EXAMPLES
The following program demonstrates the use of fork(2) and waitpid(). The program creates a child process. If no command-line argument is supplied to the program, then the child suspends its execution using pause(2), to allow the user to send signals to the child. Otherwise, if a command-line argument is supplied, then the child exits immediately, using the integer supplied on the command line as the exit status. The parent process executes a loop that monitors the child using waitpid(), and uses the W*() macros described above to analyze the wait status value.
The following shell session demonstrates the use of the program:
$ ./a.out &
Child PID is 32360
[1] 32359
$ kill -STOP 32360
stopped by signal 19
$ kill -CONT 32360
continued
$ kill -TERM 32360
killed by signal 15
[1]+ Done ./a.out
$
Program source
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int wstatus;
pid_t cpid, w;
cpid = fork();
if (cpid == -1) {
perror("fork");
exit(EXIT_FAILURE);
}
if (cpid == 0) { /* Code executed by child */
printf("Child PID is %jd
“, (intmax_t) getpid()); if (argc == 1) pause(); /* Wait for signals / _exit(atoi(argv[1])); } else { / Code executed by parent */ do { w = waitpid(cpid, &wstatus, WUNTRACED | WCONTINUED); if (w == -1) { perror(“waitpid”); exit(EXIT_FAILURE); } if (WIFEXITED(wstatus)) { printf(“exited, status=%d “, WEXITSTATUS(wstatus)); } else if (WIFSIGNALED(wstatus)) { printf(“killed by signal %d “, WTERMSIG(wstatus)); } else if (WIFSTOPPED(wstatus)) { printf(“stopped by signal %d “, WSTOPSIG(wstatus)); } else if (WIFCONTINUED(wstatus)) { printf(“continued “); } } while (!WIFEXITED(wstatus) && !WIFSIGNALED(wstatus)); exit(EXIT_SUCCESS); } }
SEE ALSO
_exit(2), clone(2), fork(2), kill(2), ptrace(2), sigaction(2), signal(2), wait4(2), pthread_create(3), core(5), credentials(7), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
383 - Linux cli command getpriority
NAME π₯οΈ getpriority π₯οΈ
get/set program scheduling priority
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/resource.h>
int getpriority(int which, id_t who);
int setpriority(int which, id_t who, int prio);
DESCRIPTION
The scheduling priority of the process, process group, or user, as indicated by which and who is obtained with the getpriority() call and set with the setpriority() call. The process attribute dealt with by these system calls is the same attribute (also known as the “nice” value) that is dealt with by nice(2).
The value which is one of PRIO_PROCESS, PRIO_PGRP, or PRIO_USER, and who is interpreted relative to which (a process identifier for PRIO_PROCESS, process group identifier for PRIO_PGRP, and a user ID for PRIO_USER). A zero value for who denotes (respectively) the calling process, the process group of the calling process, or the real user ID of the calling process.
The prio argument is a value in the range -20 to 19 (but see NOTES below), with -20 being the highest priority and 19 being the lowest priority. Attempts to set a priority outside this range are silently clamped to the range. The default priority is 0; lower values give a process a higher scheduling priority.
The getpriority() call returns the highest priority (lowest numerical value) enjoyed by any of the specified processes. The setpriority() call sets the priorities of all of the specified processes to the specified value.
Traditionally, only a privileged process could lower the nice value (i.e., set a higher priority). However, since Linux 2.6.12, an unprivileged process can decrease the nice value of a target process that has a suitable RLIMIT_NICE soft limit; see getrlimit(2) for details.
RETURN VALUE
On success, getpriority() returns the calling thread’s nice value, which may be a negative number. On error, it returns -1 and sets errno to indicate the error.
Since a successful call to getpriority() can legitimately return the value -1, it is necessary to clear errno prior to the call, then check errno afterward to determine if -1 is an error or a legitimate value.
setpriority() returns 0 on success. On failure, it returns -1 and sets errno to indicate the error.
ERRORS
EACCES
The caller attempted to set a lower nice value (i.e., a higher process priority), but did not have the required privilege (on Linux: did not have the CAP_SYS_NICE capability).
EINVAL
which was not one of PRIO_PROCESS, PRIO_PGRP, or PRIO_USER.
EPERM
A process was located, but its effective user ID did not match either the effective or the real user ID of the caller, and was not privileged (on Linux: did not have the CAP_SYS_NICE capability). But see NOTES below.
ESRCH
No process was located using the which and who values specified.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.4BSD (these interfaces first appeared in 4.2BSD).
NOTES
For further details on the nice value, see sched(7).
Note: the addition of the “autogroup” feature in Linux 2.6.38 means that the nice value no longer has its traditional effect in many circumstances. For details, see sched(7).
A child created by fork(2) inherits its parent’s nice value. The nice value is preserved across execve(2).
The details on the condition for EPERM depend on the system. The above description is what POSIX.1-2001 says, and seems to be followed on all System V-like systems. Linux kernels before Linux 2.6.12 required the real or effective user ID of the caller to match the real user of the process who (instead of its effective user ID). Linux 2.6.12 and later require the effective user ID of the caller to match the real or effective user ID of the process who. All BSD-like systems (SunOS 4.1.3, Ultrix 4.2, 4.3BSD, FreeBSD 4.3, OpenBSD-2.5, …) behave in the same manner as Linux 2.6.12 and later.
C library/kernel differences
The getpriority system call returns nice values translated to the range 40..1, since a negative return value would be interpreted as an error. The glibc wrapper function for getpriority() translates the value back according to the formula uniceΒ =Β 20Β -Β knice (thus, the 40..1 range returned by the kernel corresponds to the range -20..19 as seen by user space).
BUGS
According to POSIX, the nice value is a per-process setting. However, under the current Linux/NPTL implementation of POSIX threads, the nice value is a per-thread attribute: different threads in the same process can have different nice values. Portable applications should avoid relying on the Linux behavior, which may be made standards conformant in the future.
SEE ALSO
nice(1), renice(1), fork(2), capabilities(7), sched(7)
Documentation/scheduler/sched-nice-design.txt in the Linux kernel source tree (since Linux 2.6.23)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
384 - Linux cli command statx
NAME π₯οΈ statx π₯οΈ
get file status (extended)
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int statx(int dirfd, const char *restrict pathname, int flags,
unsigned int mask, struct statx *restrict statxbuf);
DESCRIPTION
This function returns information about a file, storing it in the buffer pointed to by statxbuf. The returned buffer is a structure of the following type:
struct statx {
__u32 stx_mask; /* Mask of bits indicating
filled fields */
__u32 stx_blksize; /* Block size for filesystem I/O */
__u64 stx_attributes; /* Extra file attribute indicators */
__u32 stx_nlink; /* Number of hard links */
__u32 stx_uid; /* User ID of owner */
__u32 stx_gid; /* Group ID of owner */
__u16 stx_mode; /* File type and mode */
__u64 stx_ino; /* Inode number */
__u64 stx_size; /* Total size in bytes */
__u64 stx_blocks; /* Number of 512B blocks allocated */
__u64 stx_attributes_mask;
/* Mask to show what's supported
in stx_attributes */
/* The following fields are file timestamps */
struct statx_timestamp stx_atime; /* Last access */
struct statx_timestamp stx_btime; /* Creation */
struct statx_timestamp stx_ctime; /* Last status change */
struct statx_timestamp stx_mtime; /* Last modification */
/* If this file represents a device, then the next two
fields contain the ID of the device */
__u32 stx_rdev_major; /* Major ID */
__u32 stx_rdev_minor; /* Minor ID */
/* The next two fields contain the ID of the device
containing the filesystem where the file resides */
__u32 stx_dev_major; /* Major ID */
__u32 stx_dev_minor; /* Minor ID */
__u64 stx_mnt_id; /* Mount ID */
/* Direct I/O alignment restrictions */
__u32 stx_dio_mem_align;
__u32 stx_dio_offset_align;
};
The file timestamps are structures of the following type:
struct statx_timestamp {
__s64 tv_sec; /* Seconds since the Epoch (UNIX time) */
__u32 tv_nsec; /* Nanoseconds since tv_sec */
};
(Note that reserved space and padding is omitted.)
Invoking statx():
To access a file’s status, no permissions are required on the file itself, but in the case of statx() with a pathname, execute (search) permission is required on all of the directories in pathname that lead to the file.
statx() uses pathname, dirfd, and flags to identify the target file in one of the following ways:
An absolute pathname
If pathname begins with a slash, then it is an absolute pathname that identifies the target file. In this case, dirfd is ignored.
A relative pathname
If pathname is a string that begins with a character other than a slash and dirfd is AT_FDCWD, then pathname is a relative pathname that is interpreted relative to the process’s current working directory.
A directory-relative pathname
If pathname is a string that begins with a character other than a slash and dirfd is a file descriptor that refers to a directory, then pathname is a relative pathname that is interpreted relative to the directory referred to by dirfd. (See openat(2) for an explanation of why this is useful.)
By file descriptor
If pathname is an empty string and the AT_EMPTY_PATH flag is specified in flags (see below), then the target file is the one referred to by the file descriptor dirfd.
flags can be used to influence a pathname-based lookup. A value for flags is constructed by ORing together zero or more of the following constants:
AT_EMPTY_PATH
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory.
If dirfd is AT_FDCWD, the call operates on the current working directory.
AT_NO_AUTOMOUNT
Don’t automount the terminal (“basename”) component of pathname if it is a directory that is an automount point. This allows the caller to gather attributes of an automount point (rather than the location it would mount). This flag has no effect if the mount point has already been mounted over.
The AT_NO_AUTOMOUNT flag can be used in tools that scan directories to prevent mass-automounting of a directory of automount points.
All of stat(2), lstat(2), and fstatat(2) act as though AT_NO_AUTOMOUNT was set.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead return information about the link itself, like lstat(2).
flags can also be used to control what sort of synchronization the kernel will do when querying a file on a remote filesystem. This is done by ORing in one of the following values:
AT_STATX_SYNC_AS_STAT
Do whatever stat(2) does. This is the default and is very much filesystem-specific.
AT_STATX_FORCE_SYNC
Force the attributes to be synchronized with the server. This may require that a network filesystem perform a data writeback to get the timestamps correct.
AT_STATX_DONT_SYNC
Don’t synchronize anything, but rather just take whatever the system has cached if possible. This may mean that the information returned is approximate, but, on a network filesystem, it may not involve a round trip to the server - even if no lease is held.
The mask argument to statx() is used to tell the kernel which fields the caller is interested in. mask is an ORed combination of the following constants:
STATX_TYPE | Want stx_mode & S_IFMT |
STATX_MODE | Want stx_mode & ~S_IFMT |
STATX_NLINK | Want stx_nlink |
STATX_UID | Want stx_uid |
STATX_GID | Want stx_gid |
STATX_ATIME | Want stx_atime |
STATX_MTIME | Want stx_mtime |
STATX_CTIME | Want stx_ctime |
STATX_INO | Want stx_ino |
STATX_SIZE | Want stx_size |
STATX_BLOCKS | Want stx_blocks |
STATX_BASIC_STATS | [All of the above] |
STATX_BTIME | Want stx_btime |
STATX_ALL | The same as STATX_BASIC_STATS | STATX_BTIME. |
It is deprecated and should not be used. | |
STATX_MNT_ID | Want stx_mnt_id (since Linux 5.8) |
STATX_DIOALIGN | Want stx_dio_mem_align and stx_dio_offset_align |
(since Linux 6.1; support varies by filesystem) |
Note that, in general, the kernel does not reject values in mask other than the above. (For an exception, see EINVAL in errors.) Instead, it simply informs the caller which values are supported by this kernel and filesystem via the statx.stx_mask field. Therefore, do not simply set mask to UINT_MAX (all bits set), as one or more bits may, in the future, be used to specify an extension to the buffer.
The returned information
The status information for the target file is returned in the statx structure pointed to by statxbuf. Included in this is stx_mask which indicates what other information has been returned. stx_mask has the same format as the mask argument and bits are set in it to indicate which fields have been filled in.
It should be noted that the kernel may return fields that weren’t requested and may fail to return fields that were requested, depending on what the backing filesystem supports. (Fields that are given values despite being unrequested can just be ignored.) In either case, stx_mask will not be equal mask.
If a filesystem does not support a field or if it has an unrepresentable value (for instance, a file with an exotic type), then the mask bit corresponding to that field will be cleared in stx_mask even if the user asked for it and a dummy value will be filled in for compatibility purposes if one is available (e.g., a dummy UID and GID may be specified to mount under some circumstances).
A filesystem may also fill in fields that the caller didn’t ask for if it has values for them available and the information is available at no extra cost. If this happens, the corresponding bits will be set in stx_mask.
Note: for performance and simplicity reasons, different fields in the statx structure may contain state information from different moments during the execution of the system call. For example, if stx_mode or stx_uid is changed by another process by calling chmod(2) or chown(2), stat() might return the old stx_mode together with the new stx_uid, or the old stx_uid together with the new stx_mode.
Apart from stx_mask (which is described above), the fields in the statx structure are:
stx_blksize
The “preferred” block size for efficient filesystem I/O. (Writing to a file in smaller chunks may cause an inefficient read-modify-rewrite.)
stx_attributes
Further status information about the file (see below for more information).
stx_nlink
The number of hard links on a file.
stx_uid
This field contains the user ID of the owner of the file.
stx_gid
This field contains the ID of the group owner of the file.
stx_mode
The file type and mode. See inode(7) for details.
stx_ino
The inode number of the file.
stx_size
The size of the file (if it is a regular file or a symbolic link) in bytes. The size of a symbolic link is the length of the pathname it contains, without a terminating null byte.
stx_blocks
The number of blocks allocated to the file on the medium, in 512-byte units. (This may be smaller than stx_size/512 when the file has holes.)
stx_attributes_mask
A mask indicating which bits in stx_attributes are supported by the VFS and the filesystem.
stx_atime
The file’s last access timestamp.
stx_btime
The file’s creation timestamp.
stx_ctime
The file’s last status change timestamp.
stx_mtime
The file’s last modification timestamp.
stx_dev_major and stx_dev_minor
The device on which this file (inode) resides.
stx_rdev_major and stx_rdev_minor
The device that this file (inode) represents if the file is of block or character device type.
stx_mnt_id
The mount ID of the mount containing the file. This is the same number reported by name_to_handle_at(2) and corresponds to the number in the first field in one of the records in /proc/self/mountinfo.
stx_dio_mem_align
The alignment (in bytes) required for user memory buffers for direct I/O (O_DIRECT) on this file, or 0 if direct I/O is not supported on this file.
STATX_DIOALIGN (stx_dio_mem_align and stx_dio_offset_align) is supported on block devices since Linux 6.1. The support on regular files varies by filesystem; it is supported by ext4, f2fs, and xfs since Linux 6.1.
stx_dio_offset_align
The alignment (in bytes) required for file offsets and I/O segment lengths for direct I/O (O_DIRECT) on this file, or 0 if direct I/O is not supported on this file. This will only be nonzero if stx_dio_mem_align is nonzero, and vice versa.
For further information on the above fields, see inode(7).
File attributes
The stx_attributes field contains a set of ORed flags that indicate additional attributes of the file. Note that any attribute that is not indicated as supported by stx_attributes_mask has no usable value here. The bits in stx_attributes_mask correspond bit-by-bit to stx_attributes.
The flags are as follows:
STATX_ATTR_COMPRESSED
The file is compressed by the filesystem and may take extra resources to access.
STATX_ATTR_IMMUTABLE
The file cannot be modified: it cannot be deleted or renamed, no hard links can be created to this file and no data can be written to it. See chattr(1).
STATX_ATTR_APPEND
The file can only be opened in append mode for writing. Random access writing is not permitted. See chattr(1).
STATX_ATTR_NODUMP
File is not a candidate for backup when a backup program such as dump(8) is run. See chattr(1).
STATX_ATTR_ENCRYPTED
A key is required for the file to be encrypted by the filesystem.
STATX_ATTR_VERITY (since Linux 5.5)
The file has fs-verity enabled. It cannot be written to, and all reads from it will be verified against a cryptographic hash that covers the entire file (e.g., via a Merkle tree).
STATX_ATTR_DAX (since Linux 5.8)
The file is in the DAX (cpu direct access) state. DAX state attempts to minimize software cache effects for both I/O and memory mappings of this file. It requires a file system which has been configured to support DAX.
DAX generally assumes all accesses are via CPU load / store instructions which can minimize overhead for small accesses, but may adversely affect CPU utilization for large transfers.
File I/O is done directly to/from user-space buffers and memory mapped I/O may be performed with direct memory mappings that bypass the kernel page cache.
While the DAX property tends to result in data being transferred synchronously, it does not give the same guarantees as the O_SYNC flag (see open(2)), where data and the necessary metadata are transferred together.
A DAX file may support being mapped with the MAP_SYNC flag, which enables a program to use CPU cache flush instructions to persist CPU store operations without an explicit fsync(2). See mmap(2) for more information.
STATX_ATTR_MOUNT_ROOT (since Linux 5.8)
The file is the root of a mount.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Search permission is denied for one of the directories in the path prefix of pathname. (See also path_resolution(7).)
EBADF
pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
pathname or statxbuf is NULL or points to a location outside the process’s accessible address space.
EINVAL
Invalid flag specified in flags.
EINVAL
Reserved flag specified in mask. (Currently, there is one such flag, designated by the constant STATX__RESERVED, with the value 0x80000000U.)
ELOOP
Too many symbolic links encountered while traversing the pathname.
ENAMETOOLONG
pathname is too long.
ENOENT
A component of pathname does not exist, or pathname is an empty string and AT_EMPTY_PATH was not specified in flags.
ENOMEM
Out of memory (i.e., kernel memory).
ENOTDIR
A component of the path prefix of pathname is not a directory or pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
STANDARDS
Linux.
HISTORY
Linux 4.11, glibc 2.28.
SEE ALSO
ls(1), stat(1), access(2), chmod(2), chown(2), name_to_handle_at(2), readlink(2), stat(2), utime(2), proc(5), capabilities(7), inode(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
385 - Linux cli command exit_group
NAME π₯οΈ exit_group π₯οΈ
exit all threads in a process
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
[[noreturn]] void syscall(SYS_exit_group, int status);
Note: glibc provides no wrapper for exit_group(), necessitating the use of syscall(2).
DESCRIPTION
This system call terminates all threads in the calling process’s thread group.
RETURN VALUE
This system call does not return.
STANDARDS
Linux.
HISTORY
Linux 2.5.35.
NOTES
Since glibc 2.3, this is the system call invoked when the _exit(2) wrapper function is called.
SEE ALSO
_exit(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
386 - Linux cli command readv
NAME π₯οΈ readv π₯οΈ
read or write data into multiple buffers
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/uio.h>
ssize_t readv(int fd, const struct iovec *iov, int iovcnt);
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);
ssize_t preadv(int fd, const struct iovec *iov, int iovcnt,
off_t offset);
ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt,
off_t offset);
ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt,
off_t offset, int flags);
ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt,
off_t offset, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
preadv(), pwritev():
Since glibc 2.19:
_DEFAULT_SOURCE
glibc 2.19 and earlier:
_BSD_SOURCE
DESCRIPTION
The readv() system call reads iovcnt buffers from the file associated with the file descriptor fd into the buffers described by iov (“scatter input”).
The writev() system call writes iovcnt buffers of data described by iov to the file associated with the file descriptor fd (“gather output”).
The pointer iov points to an array of iovec structures, described in iovec(3type).
The readv() system call works just like read(2) except that multiple buffers are filled.
The writev() system call works just like write(2) except that multiple buffers are written out.
Buffers are processed in array order. This means that readv() completely fills iov[0] before proceeding to iov[1], and so on. (If there is insufficient data, then not all buffers pointed to by iov may be filled.) Similarly, writev() writes out the entire contents of iov[0] before proceeding to iov[1], and so on.
The data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not intermingled with output from writes in other processes; analogously, readv() is guaranteed to read a contiguous block of data from the file, regardless of read operations performed in other threads or processes that have file descriptors referring to the same open file description (see open(2)).
preadv() and pwritev()
The preadv() system call combines the functionality of readv() and pread(2). It performs the same task as readv(), but adds a fourth argument, offset, which specifies the file offset at which the input operation is to be performed.
The pwritev() system call combines the functionality of writev() and pwrite(2). It performs the same task as writev(), but adds a fourth argument, offset, which specifies the file offset at which the output operation is to be performed.
The file offset is not changed by these system calls. The file referred to by fd must be capable of seeking.
preadv2() and pwritev2()
These system calls are similar to preadv() and pwritev() calls, but add a fifth argument, flags, which modifies the behavior on a per-call basis.
Unlike preadv() and pwritev(), if the offset argument is -1, then the current file offset is used and updated.
The flags argument contains a bitwise OR of zero or more of the following flags:
RWF_DSYNC (since Linux 4.7)
Provide a per-write equivalent of the O_DSYNC open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call.
RWF_HIPRI (since Linux 4.6)
High priority read/write. Allows block-based filesystems to use polling of the device, which provides lower latency, but may use additional resources. (Currently, this feature is usable only on a file descriptor opened using the O_DIRECT flag.)
RWF_SYNC (since Linux 4.7)
Provide a per-write equivalent of the O_SYNC open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call.
RWF_NOWAIT (since Linux 4.14)
Do not wait for data which is not immediately available. If this flag is specified, the preadv2() system call will return instantly if it would have to read data from the backing storage or wait for a lock. If some data was successfully read, it will return the number of bytes read. If no bytes were read, it will return -1 and set errno to EAGAIN (but see BUGS). Currently, this flag is meaningful only for preadv2().
RWF_APPEND (since Linux 4.16)
Provide a per-write equivalent of the O_APPEND open(2) flag. This flag is meaningful only for pwritev2(), and its effect applies only to the data range written by the system call. The offset argument does not affect the write operation; the data is always appended to the end of the file. However, if the offset argument is -1, the current file offset is updated.
RETURN VALUE
On success, readv(), preadv(), and preadv2() return the number of bytes read; writev(), pwritev(), and pwritev2() return the number of bytes written.
Note that it is not an error for a successful call to transfer fewer bytes than requested (see read(2) and write(2)).
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
The errors are as given for read(2) and write(2). Furthermore, preadv(), preadv2(), pwritev(), and pwritev2() can also fail for the same reasons as lseek(2). Additionally, the following errors are defined:
EINVAL
The sum of the iov_len values overflows an ssize_t value.
EINVAL
The vector count, iovcnt, is less than zero or greater than the permitted maximum.
EOPNOTSUPP
An unknown flag is specified in flags.
VERSIONS
C library/kernel differences
The raw preadv() and pwritev() system calls have call signatures that differ slightly from that of the corresponding GNU C library wrapper functions shown in the SYNOPSIS. The final argument, offset, is unpacked by the wrapper functions into two arguments in the system calls:
** unsigned long pos_l, unsigned long **pos
These arguments contain, respectively, the low order and high order 32 bits of offset.
STANDARDS
readv()
writev()
POSIX.1-2008.
preadv()
pwritev()
BSD.
preadv2()
pwritev2()
Linux.
HISTORY
readv()
writev()
POSIX.1-2001, 4.4BSD (first appeared in 4.2BSD).
preadv(), pwritev(): Linux 2.6.30, glibc 2.10.
preadv2(), pwritev2(): Linux 4.6, glibc 2.26.
Historical C library/kernel differences
To deal with the fact that IOV_MAX was so low on early versions of Linux, the glibc wrapper functions for readv() and writev() did some extra work if they detected that the underlying kernel system call failed because this limit was exceeded. In the case of readv(), the wrapper function allocated a temporary buffer large enough for all of the items specified by iov, passed that buffer in a call to read(2), copied data from the buffer to the locations specified by the iov_base fields of the elements of iov, and then freed the buffer. The wrapper function for writev() performed the analogous task using a temporary buffer and a call to write(2).
The need for this extra effort in the glibc wrapper functions went away with Linux 2.2 and later. However, glibc continued to provide this behavior until glibc 2.10. Starting with glibc 2.9, the wrapper functions provide this behavior only if the library detects that the system is running a Linux kernel older than Linux 2.6.18 (an arbitrarily selected kernel version). And since glibc 2.20 (which requires a minimum of Linux 2.6.32), the glibc wrapper functions always just directly invoke the system calls.
NOTES
POSIX.1 allows an implementation to place a limit on the number of items that can be passed in iov. An implementation can advertise its limit by defining IOV_MAX in <limits.h> or at run time via the return value from sysconf(_SC_IOV_MAX). On modern Linux systems, the limit is 1024. Back in Linux 2.0 days, this limit was 16.
BUGS
Linux 5.9 and Linux 5.10 have a bug where preadv2() with the RWF_NOWAIT flag may return 0 even when not at end of file.
EXAMPLES
The following code sample demonstrates the use of writev():
char *str0 = "hello ";
char *str1 = "world
“; ssize_t nwritten; struct iovec iov[2]; iov[0].iov_base = str0; iov[0].iov_len = strlen(str0); iov[1].iov_base = str1; iov[1].iov_len = strlen(str1); nwritten = writev(STDOUT_FILENO, iov, 2);
SEE ALSO
pread(2), read(2), write(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
387 - Linux cli command getpgrp
NAME π₯οΈ getpgrp π₯οΈ
set/get process group
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int setpgid(pid_t pid, pid_t pgid);
pid_t getpgid(pid_t pid);
pid_t getpgrp(void); /* POSIX.1 version */
[[deprecated]] pid_t getpgrp(pid_t pid); /* BSD version */
int setpgrp(void); /* System V version */
[[deprecated]] int setpgrp(pid_t pid, pid_t pgid); /* BSD version */
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
getpgid():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
setpgrp() (POSIX.1):
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.19: */ _DEFAULT_SOURCE
|| /* glibc <= 2.19: */ _SVID_SOURCE
setpgrp() (BSD), getpgrp() (BSD):
[These are available only before glibc 2.19]
_BSD_SOURCE &&
! (_POSIX_SOURCE || _POSIX_C_SOURCE || _XOPEN_SOURCE
|| _GNU_SOURCE || _SVID_SOURCE)
DESCRIPTION
All of these interfaces are available on Linux, and are used for getting and setting the process group ID (PGID) of a process. The preferred, POSIX.1-specified ways of doing this are: getpgrp(void), for retrieving the calling process’s PGID; and setpgid(), for setting a process’s PGID.
setpgid() sets the PGID of the process specified by pid to pgid. If pid is zero, then the process ID of the calling process is used. If pgid is zero, then the PGID of the process specified by pid is made the same as its process ID. If setpgid() is used to move a process from one process group to another (as is done by some shells when creating pipelines), both process groups must be part of the same session (see setsid(2) and credentials(7)). In this case, the pgid specifies an existing process group to be joined and the session ID of that group must match the session ID of the joining process.
The POSIX.1 version of getpgrp(), which takes no arguments, returns the PGID of the calling process.
getpgid() returns the PGID of the process specified by pid. If pid is zero, the process ID of the calling process is used. (Retrieving the PGID of a process other than the caller is rarely necessary, and the POSIX.1 getpgrp() is preferred for that task.)
The System V-style setpgrp(), which takes no arguments, is equivalent to setpgid(0, 0).
The BSD-specific setpgrp() call, which takes arguments pid and pgid, is a wrapper function that calls
setpgid(pid, pgid)
Since glibc 2.19, the BSD-specific setpgrp() function is no longer exposed by <unistd.h>; calls should be replaced with the setpgid() call shown above.
The BSD-specific getpgrp() call, which takes a single pid argument, is a wrapper function that calls
getpgid(pid)
Since glibc 2.19, the BSD-specific getpgrp() function is no longer exposed by <unistd.h>; calls should be replaced with calls to the POSIX.1 getpgrp() which takes no arguments (if the intent is to obtain the caller’s PGID), or with the getpgid() call shown above.
RETURN VALUE
On success, setpgid() and setpgrp() return zero. On error, -1 is returned, and errno is set to indicate the error.
The POSIX.1 getpgrp() always returns the PGID of the caller.
getpgid(), and the BSD-specific getpgrp() return a process group on success. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
An attempt was made to change the process group ID of one of the children of the calling process and the child had already performed an execve(2) (setpgid(), setpgrp()).
EINVAL
pgid is less than 0 (setpgid(), setpgrp()).
EPERM
An attempt was made to move a process into a process group in a different session, or to change the process group ID of one of the children of the calling process and the child was in a different session, or to change the process group ID of a session leader (setpgid(), setpgrp()).
EPERM
The target process group does not exist. (setpgid(), setpgrp()).
ESRCH
For getpgid(): pid does not match any process. For setpgid(): pid is not the calling process and not a child of the calling process.
STANDARDS
getpgid()
setpgid()
getpgrp() (no args)
setpgrp() (no args)
POSIX.1-2008 (but see HISTORY).
setpgrp() (2 args)
getpgrp() (1 arg)
None.
HISTORY
getpgid()
setpgid()
getpgrp() (no args)
POSIX.1-2001.
setpgrp() (no args)
POSIX.1-2001. POSIX.1-2008 marks it as obsolete.
setpgrp() (2 args)
getpgrp() (1 arg)
4.2BSD.
NOTES
A child created via fork(2) inherits its parent’s process group ID. The PGID is preserved across an execve(2).
Each process group is a member of a session and each process is a member of the session of which its process group is a member. (See credentials(7).)
A session can have a controlling terminal. At any time, one (and only one) of the process groups in the session can be the foreground process group for the terminal; the remaining process groups are in the background. If a signal is generated from the terminal (e.g., typing the interrupt key to generate SIGINT), that signal is sent to the foreground process group. (See termios(3) for a description of the characters that generate signals.) Only the foreground process group may read(2) from the terminal; if a background process group tries to read(2) from the terminal, then the group is sent a SIGTTIN signal, which suspends it. The tcgetpgrp(3) and tcsetpgrp(3) functions are used to get/set the foreground process group of the controlling terminal.
The setpgid() and getpgrp() calls are used by programs such as bash(1) to create process groups in order to implement shell job control.
If the termination of a process causes a process group to become orphaned, and if any member of the newly orphaned process group is stopped, then a SIGHUP signal followed by a SIGCONT signal will be sent to each process in the newly orphaned process group. An orphaned process group is one in which the parent of every member of process group is either itself also a member of the process group or is a member of a process group in a different session (see also credentials(7)).
SEE ALSO
getuid(2), setsid(2), tcgetpgrp(3), tcsetpgrp(3), termios(3), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
388 - Linux cli command rt_sigtimedwait
NAME π₯οΈ rt_sigtimedwait π₯οΈ
synchronously wait for queued signals
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <signal.h>
int sigwaitinfo(const sigset_t *restrict set,
siginfo_t *_Nullable restrict info);
int sigtimedwait(const sigset_t *restrict set,
siginfo_t *_Nullable restrict info,
const struct timespec *restrict timeout);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sigwaitinfo(), sigtimedwait():
_POSIX_C_SOURCE >= 199309L
DESCRIPTION
sigwaitinfo() suspends execution of the calling thread until one of the signals in set is pending (If one of the signals in set is already pending for the calling thread, sigwaitinfo() will return immediately.)
sigwaitinfo() removes the signal from the set of pending signals and returns the signal number as its function result. If the info argument is not NULL, then the buffer that it points to is used to return a structure of type siginfo_t (see sigaction(2)) containing information about the signal.
If multiple signals in set are pending for the caller, the signal that is retrieved by sigwaitinfo() is determined according to the usual ordering rules; see signal(7) for further details.
sigtimedwait() operates in exactly the same way as sigwaitinfo() except that it has an additional argument, timeout, which specifies the interval for which the thread is suspended waiting for a signal. (This interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the interval may overrun by a small amount.) This argument is a timespec(3) structure.
If both fields of this structure are specified as 0, a poll is performed: sigtimedwait() returns immediately, either with information about a signal that was pending for the caller, or with an error if none of the signals in set was pending.
RETURN VALUE
On success, both sigwaitinfo() and sigtimedwait() return a signal number (i.e., a value greater than zero). On failure both calls return -1, with errno set to indicate the error.
ERRORS
EAGAIN
No signal in set became pending within the timeout period specified to sigtimedwait().
EINTR
The wait was interrupted by a signal handler; see signal(7). (This handler was for a signal other than one of those in set.)
EINVAL
timeout was invalid.
VERSIONS
C library/kernel differences
On Linux, sigwaitinfo() is a library function implemented on top of sigtimedwait().
The glibc wrapper functions for sigwaitinfo() and sigtimedwait() silently ignore attempts to wait for the two real-time signals that are used internally by the NPTL threading implementation. See nptl(7) for details.
The original Linux system call was named sigtimedwait(). However, with the addition of real-time signals in Linux 2.2, the fixed-size, 32-bit sigset_t type supported by that system call was no longer fit for purpose. Consequently, a new system call, rt_sigtimedwait(), was added to support an enlarged sigset_t type. The new system call takes a fourth argument, size_t sigsetsize, which specifies the size in bytes of the signal set in set. This argument is currently required to have the value sizeof(sigset_t) (or the error EINVAL results). The glibc sigtimedwait() wrapper function hides these details from us, transparently calling rt_sigtimedwait() when the kernel provides it.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
NOTES
In normal usage, the calling program blocks the signals in set via a prior call to sigprocmask(2) (so that the default disposition for these signals does not occur if they become pending between successive calls to sigwaitinfo() or sigtimedwait()) and does not establish handlers for these signals. In a multithreaded program, the signal should be blocked in all threads, in order to prevent the signal being treated according to its default disposition in a thread other than the one calling sigwaitinfo() or sigtimedwait()).
The set of signals that is pending for a given thread is the union of the set of signals that is pending specifically for that thread and the set of signals that is pending for the process as a whole (see signal(7)).
Attempts to wait for SIGKILL and SIGSTOP are silently ignored.
If multiple threads of a process are blocked waiting for the same signal(s) in sigwaitinfo() or sigtimedwait(), then exactly one of the threads will actually receive the signal if it becomes pending for the process as a whole; which of the threads receives the signal is indeterminate.
sigwaitinfo() or sigtimedwait(), can’t be used to receive signals that are synchronously generated, such as the SIGSEGV signal that results from accessing an invalid memory address or the SIGFPE signal that results from an arithmetic error. Such signals can be caught only via signal handler.
POSIX leaves the meaning of a NULL value for the timeout argument of sigtimedwait() unspecified, permitting the possibility that this has the same meaning as a call to sigwaitinfo(), and indeed this is what is done on Linux.
SEE ALSO
kill(2), sigaction(2), signal(2), signalfd(2), sigpending(2), sigprocmask(2), sigqueue(3), sigsetops(3), sigwait(3), timespec(3), signal(7), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
389 - Linux cli command mount_setattr
NAME π₯οΈ mount_setattr π₯οΈ
change properties of a mount or mount tree
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/fcntl.h> /* Definition of AT_* constants */
#include <linux/mount.h> /* Definition of MOUNT_ATTR_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_mount_setattr, int dirfd, const char *pathname,
unsigned int flags, struct mount_attr *attr",size_t"size);
Note: glibc provides no wrapper for mount_setattr(), necessitating the use of syscall(2).
DESCRIPTION
The mount_setattr() system call changes the mount properties of a mount or an entire mount tree. If pathname is a relative pathname, then it is interpreted relative to the directory referred to by the file descriptor dirfd. If dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process. If pathname is the empty string and AT_EMPTY_PATH is specified in flags, then the mount properties of the mount identified by dirfd are changed. (See openat(2) for an explanation of why the dirfd argument is useful.)
The mount_setattr() system call uses an extensible structure (struct mount_attr) to allow for future extensions. Any non-flag extensions to mount_setattr() will be implemented as new fields appended to the this structure, with a zero value in a new field resulting in the kernel behaving as though that extension field was not present. Therefore, the caller must zero-fill this structure on initialization. See the “Extensibility” subsection under NOTES for more details.
The size argument should usually be specified as sizeof(struct mount_attr). However, if the caller is using a kernel that supports an extended struct mount_attr, but the caller does not intend to make use of these features, it is possible to pass the size of an earlier version of the structure together with the extended structure. This allows the kernel to not copy later parts of the structure that aren’t used anyway. With each extension that changes the size of struct mount_attr, the kernel will expose a definition of the form MOUNT_ATTR_SIZE_VERnumber . For example, the macro for the size of the initial version of struct mount_attr is MOUNT_ATTR_SIZE_VER0.
The flags argument can be used to alter the pathname resolution behavior. The supported values are:
AT_EMPTY_PATH
If pathname is the empty string, change the mount properties on dirfd itself.
AT_RECURSIVE
Change the mount properties of the entire mount tree.
AT_SYMLINK_NOFOLLOW
Don’t follow trailing symbolic links.
AT_NO_AUTOMOUNT
Don’t trigger automounts.
The attr argument of mount_setattr() is a structure of the following form:
struct mount_attr {
__u64 attr_set; /* Mount properties to set */
__u64 attr_clr; /* Mount properties to clear */
__u64 propagation; /* Mount propagation type */
__u64 userns_fd; /* User namespace file descriptor */
};
The attr_set and attr_clr members are used to specify the mount properties that are supposed to be set or cleared for a mount or mount tree. Flags set in attr_set enable a property on a mount or mount tree, and flags set in attr_clr remove a property from a mount or mount tree.
When changing mount properties, the kernel will first clear the flags specified in the attr_clr field, and then set the flags specified in the attr_set field. For example, these settings:
struct mount_attr attr = {
.attr_clr = MOUNT_ATTR_NOEXEC | MOUNT_ATTR_NODEV,
.attr_set = MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID,
};
are equivalent to the following steps:
unsigned int current_mnt_flags = mnt->mnt_flags;
/*
* Clear all flags set in .attr_clr,
* clearing MOUNT_ATTR_NOEXEC and MOUNT_ATTR_NODEV.
*/
current_mnt_flags &= ~attr->attr_clr;
/*
* Now set all flags set in .attr_set,
* applying MOUNT_ATTR_RDONLY and MOUNT_ATTR_NOSUID.
*/
current_mnt_flags |= attr->attr_set;
mnt->mnt_flags = current_mnt_flags;
As a result of this change, the mount or mount tree (a) is read-only; (b) blocks the execution of set-user-ID and set-group-ID programs; (c) allows execution of programs; and (d) allows access to devices.
Multiple changes with the same set of flags requested in attr_clr and attr_set are guaranteed to be idempotent after the changes have been applied.
The following mount attributes can be specified in the attr_set or attr_clr fields:
MOUNT_ATTR_RDONLY
If set in attr_set, makes the mount read-only. If set in attr_clr, removes the read-only setting if set on the mount.
MOUNT_ATTR_NOSUID
If set in attr_set, causes the mount not to honor the set-user-ID and set-group-ID mode bits and file capabilities when executing programs. If set in attr_clr, clears the set-user-ID, set-group-ID, and file capability restriction if set on this mount.
MOUNT_ATTR_NODEV
If set in attr_set, prevents access to devices on this mount. If set in attr_clr, removes the restriction that prevented accessing devices on this mount.
MOUNT_ATTR_NOEXEC
If set in attr_set, prevents executing programs on this mount. If set in attr_clr, removes the restriction that prevented executing programs on this mount.
MOUNT_ATTR_NOSYMFOLLOW
If set in attr_set, prevents following symbolic links on this mount. If set in attr_clr, removes the restriction that prevented following symbolic links on this mount.
MOUNT_ATTR_NODIRATIME
If set in attr_set, prevents updating access time for directories on this mount. If set in attr_clr, removes the restriction that prevented updating access time for directories. Note that MOUNT_ATTR_NODIRATIME can be combined with other access-time settings and is implied by the noatime setting. All other access-time settings are mutually exclusive.
MOUNT_ATTR__ATIME - changing access-time settings
The access-time values listed below are an enumeration that includes the value zero, expressed in the bits defined by the mask MOUNT_ATTR__ATIME. Even though these bits are an enumeration (in contrast to the other mount flags such as MOUNT_ATTR_NOEXEC), they are nonetheless passed in attr_set and attr_clr for consistency with fsmount(2), which introduced this behavior.
Note that, since the access-time values are an enumeration rather than bit values, a caller wanting to transition to a different access-time setting cannot simply specify the access-time setting in attr_set, but must also include MOUNT_ATTR__ATIME in the attr_clr field. The kernel will verify that MOUNT_ATTR__ATIME isn’t partially set in attr_clr (i.e., either all bits in the MOUNT_ATTR__ATIME bit field are either set or clear), and that attr_set doesn’t have any access-time bits set if MOUNT_ATTR__ATIME isn’t set in attr_clr.
MOUNT_ATTR_RELATIME
When a file is accessed via this mount, update the file’s last access time (atime) only if the current value of atime is less than or equal to the file’s last modification time (mtime) or last status change time (ctime).
To enable this access-time setting on a mount or mount tree, MOUNT_ATTR_RELATIME must be set in attr_set and MOUNT_ATTR__ATIME must be set in the attr_clr field.
MOUNT_ATTR_NOATIME
Do not update access times for (all types of) files on this mount.
To enable this access-time setting on a mount or mount tree, MOUNT_ATTR_NOATIME must be set in attr_set and MOUNT_ATTR__ATIME must be set in the attr_clr field.
MOUNT_ATTR_STRICTATIME
Always update the last access time (atime) when files are accessed on this mount.
To enable this access-time setting on a mount or mount tree, MOUNT_ATTR_STRICTATIME must be set in attr_set and MOUNT_ATTR__ATIME must be set in the attr_clr field.
MOUNT_ATTR_IDMAP
If set in attr_set, creates an ID-mapped mount. The ID mapping is taken from the user namespace specified in userns_fd and attached to the mount.
Since it is not supported to change the ID mapping of a mount after it has been ID mapped, it is invalid to specify MOUNT_ATTR_IDMAP in attr_clr.
For further details, see the subsection “ID-mapped mounts” under NOTES.
The propagation field is used to specify the propagation type of the mount or mount tree. This field either has the value zero, meaning leave the propagation type unchanged, or it has one of the following values:
MS_PRIVATE
Turn all mounts into private mounts.
MS_SHARED
Turn all mounts into shared mounts.
MS_SLAVE
Turn all mounts into dependent mounts.
MS_UNBINDABLE
Turn all mounts into unbindable mounts.
For further details on the above propagation types, see mount_namespaces(7).
RETURN VALUE
On success, mount_setattr() returns zero. On error, -1 is returned and errno is set to indicate the cause of the error.
ERRORS
EBADF
pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EBADF
userns_fd is not a valid file descriptor.
EBUSY
The caller tried to change the mount to MOUNT_ATTR_RDONLY, but the mount still holds files open for writing.
EBUSY
The caller tried to create an ID-mapped mount raising MOUNT_ATTR_IDMAP and specifying userns_fd but the mount still holds files open for writing.
EINVAL
The pathname specified via the dirfd and pathname arguments to mount_setattr() isn’t a mount point.
EINVAL
An unsupported value was set in flags.
EINVAL
An unsupported value was specified in the attr_set field of mount_attr.
EINVAL
An unsupported value was specified in the attr_clr field of mount_attr.
EINVAL
An unsupported value was specified in the propagation field of mount_attr.
EINVAL
More than one of MS_SHARED, MS_SLAVE, MS_PRIVATE, or MS_UNBINDABLE was set in the propagation field of mount_attr.
EINVAL
An access-time setting was specified in the attr_set field without MOUNT_ATTR__ATIME being set in the attr_clr field.
EINVAL
MOUNT_ATTR_IDMAP was specified in attr_clr.
EINVAL
A file descriptor value was specified in userns_fd which exceeds INT_MAX.
EINVAL
A valid file descriptor value was specified in userns_fd, but the file descriptor did not refer to a user namespace.
EINVAL
The underlying filesystem does not support ID-mapped mounts.
EINVAL
The mount that is to be ID mapped is not a detached mount; that is, the mount has not previously been visible in a mount namespace.
EINVAL
A partial access-time setting was specified in attr_clr instead of MOUNT_ATTR__ATIME being set.
EINVAL
The mount is located outside the caller’s mount namespace.
EINVAL
The underlying filesystem has been mounted in a mount namespace that is owned by a noninitial user namespace
ENOENT
A pathname was empty or had a nonexistent component.
ENOMEM
When changing mount propagation to MS_SHARED, a new peer group ID needs to be allocated for all mounts without a peer group ID set. This allocation failed because there was not enough memory to allocate the relevant internal structures.
ENOSPC
When changing mount propagation to MS_SHARED, a new peer group ID needs to be allocated for all mounts without a peer group ID set. This allocation failed because the kernel has run out of IDs.
EPERM
One of the mounts had at least one of MOUNT_ATTR_NOATIME, MOUNT_ATTR_NODEV, MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC, MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the flag is locked. Mount attributes become locked on a mount if:
A new mount or mount tree is created causing mount propagation across user namespaces (i.e., propagation to a mount namespace owned by a different user namespace). The kernel will lock the aforementioned flags to prevent these sensitive properties from being altered.
A new mount and user namespace pair is created. This happens for example when specifying CLONE_NEWUSER | CLONE_NEWNS in unshare(2), clone(2), or clone3(2). The aforementioned flags become locked in the new mount namespace to prevent sensitive mount properties from being altered. Since the newly created mount namespace will be owned by the newly created user namespace, a calling process that is privileged in the new user namespace wouldβin the absence of such lockingβbe able to alter sensitive mount properties (e.g., to remount a mount that was marked read-only as read-write in the new mount namespace).
EPERM
A valid file descriptor value was specified in userns_fd, but the file descriptor refers to the initial user namespace.
EPERM
An attempt was made to add an ID mapping to a mount that is already ID mapped.
EPERM
The caller does not have CAP_SYS_ADMIN in the initial user namespace.
STANDARDS
Linux.
HISTORY
Linux 5.12.
NOTES
ID-mapped mounts
Creating an ID-mapped mount makes it possible to change the ownership of all files located under a mount. Thus, ID-mapped mounts make it possible to change ownership in a temporary and localized way. It is a localized change because the ownership changes are visible only via a specific mount. All other users and locations where the filesystem is exposed are unaffected. It is a temporary change because the ownership changes are tied to the lifetime of the mount.
Whenever callers interact with the filesystem through an ID-mapped mount, the ID mapping of the mount will be applied to user and group IDs associated with filesystem objects. This encompasses the user and group IDs associated with inodes and also the following xattr(7) keys:
security.capability, whenever filesystem capabilities are stored or returned in the VFS_CAP_REVISION_3 format, which stores a root user ID alongside the capabilities (see capabilities(7)).
system.posix_acl_access and system.posix_acl_default, whenever user IDs or group IDs are stored in ACL_USER or ACL_GROUP entries.
The following conditions must be met in order to create an ID-mapped mount:
The caller must have the CAP_SYS_ADMIN capability in the user namespace the filesystem was mounted in.
The underlying filesystem must support ID-mapped mounts. Currently, the following filesystems support ID-mapped mounts:
**xfs**(5) (since Linux 5.12)
- **ext4**(5) (since Linux 5.12)
- **FAT** (since Linux 5.12)
- **btrfs**(5) (since Linux 5.15)
- **ntfs3** (since Linux 5.15)
- **f2fs** (since Linux 5.18)
- **erofs** (since Linux 5.19)
- **overlayfs** (ID-mapped lower and upper layers supported since Linux 5.19)
- **squashfs** (since Linux 6.2)
- **tmpfs** (since Linux 6.3)
- **cephfs** (since Linux 6.7)
- **hugetlbfs** (since Linux 6.9)
The mount must not already be ID-mapped. This also implies that the ID mapping of a mount cannot be altered.
The mount must not have any writers.
The mount must be a detached mount; that is, it must have been created by calling open_tree(2) with the OPEN_TREE_CLONE flag and it must not already have been visible in a mount namespace. (To put things another way: the mount must not have been attached to the filesystem hierarchy with a system call such as move_mount(2).)
ID mappings can be created for user IDs, group IDs, and project IDs. An ID mapping is essentially a mapping of a range of user or group IDs into another or the same range of user or group IDs. ID mappings are written to map files as three numbers separated by white space. The first two numbers specify the starting user or group ID in each of the two user namespaces. The third number specifies the range of the ID mapping. For example, a mapping for user IDs such as “1000 1001 1” would indicate that user ID 1000 in the caller’s user namespace is mapped to user ID 1001 in its ancestor user namespace. Since the map range is 1, only user ID 1000 is mapped.
It is possible to specify up to 340 ID mappings for each ID mapping type. If any user IDs or group IDs are not mapped, all files owned by that unmapped user or group ID will appear as being owned by the overflow user ID or overflow group ID respectively.
Further details on setting up ID mappings can be found in user_namespaces(7).
In the common case, the user namespace passed in userns_fd (together with MOUNT_ATTR_IDMAP in attr_set) to create an ID-mapped mount will be the user namespace of a container. In other scenarios it will be a dedicated user namespace associated with a user’s login session as is the case for portable home directories in systemd-homed.service(8)). It is also perfectly fine to create a dedicated user namespace for the sake of ID mapping a mount.
ID-mapped mounts can be useful in the following and a variety of other scenarios:
Sharing files or filesystems between multiple users or multiple machines, especially in complex scenarios. For example, ID-mapped mounts are used to implement portable home directories in systemd-homed.service(8), where they allow users to move their home directory to an external storage device and use it on multiple computers where they are assigned different user IDs and group IDs. This effectively makes it possible to assign random user IDs and group IDs at login time.
Sharing files or filesystems from the host with unprivileged containers. This allows a user to avoid having to change ownership permanently through chown(2).
ID mapping a container’s root filesystem. Users don’t need to change ownership permanently through chown(2). Especially for large root filesystems, using chown(2) can be prohibitively expensive.
Sharing files or filesystems between containers with non-overlapping ID mappings.
Implementing discretionary access (DAC) permission checking for filesystems lacking a concept of ownership.
Efficiently changing ownership on a per-mount basis. In contrast to chown(2), changing ownership of large sets of files is instantaneous with ID-mapped mounts. This is especially useful when ownership of an entire root filesystem of a virtual machine or container is to be changed as mentioned above. With ID-mapped mounts, a single mount_setattr() system call will be sufficient to change the ownership of all files.
Taking the current ownership into account. ID mappings specify precisely what a user or group ID is supposed to be mapped to. This contrasts with the chown(2) system call which cannot by itself take the current ownership of the files it changes into account. It simply changes the ownership to the specified user ID and group ID.
Locally and temporarily restricted ownership changes. ID-mapped mounts make it possible to change ownership locally, restricting the ownership changes to specific mounts, and temporarily as the ownership changes only apply as long as the mount exists. By contrast, changing ownership via the chown(2) system call changes the ownership globally and permanently.
Extensibility
In order to allow for future extensibility, mount_setattr() requires the user-space application to specify the size of the mount_attr structure that it is passing. By providing this information, it is possible for mount_setattr() to provide both forwards- and backwards-compatibility, with size acting as an implicit version number. (Because new extension fields will always be appended, the structure size will always increase.) This extensibility design is very similar to other system calls such as perf_setattr(2), perf_event_open(2), clone3(2) and openat2(2).
Let usize be the size of the structure as specified by the user-space application, and let ksize be the size of the structure which the kernel supports, then there are three cases to consider:
If ksize equals usize, then there is no version mismatch and attr can be used verbatim.
If ksize is larger than usize, then there are some extension fields that the kernel supports which the user-space application is unaware of. Because a zero value in any added extension field signifies a no-op, the kernel treats all of the extension fields not provided by the user-space application as having zero values. This provides backwards-compatibility.
If ksize is smaller than usize, then there are some extension fields which the user-space application is aware of but which the kernel does not support. Because any extension field must have its zero values signify a no-op, the kernel can safely ignore the unsupported extension fields if they are all zero. If any unsupported extension fields are non-zero, then -1 is returned and errno is set to E2BIG. This provides forwards-compatibility.
Because the definition of struct mount_attr may change in the future (with new fields being added when system headers are updated), user-space applications should zero-fill struct mount_attr to ensure that recompiling the program with new headers will not result in spurious errors at run time. The simplest way is to use a designated initializer:
struct mount_attr attr = {
.attr_set = MOUNT_ATTR_RDONLY,
.attr_clr = MOUNT_ATTR_NODEV
};
Alternatively, the structure can be zero-filled using memset(3) or similar functions:
struct mount_attr attr;
memset(&attr, 0, sizeof(attr));
attr.attr_set = MOUNT_ATTR_RDONLY;
attr.attr_clr = MOUNT_ATTR_NODEV;
A user-space application that wishes to determine which extensions the running kernel supports can do so by conducting a binary search on size with a structure which has every byte nonzero (to find the largest value which doesn’t produce an error of E2BIG).
EXAMPLES
/*
* This program allows the caller to create a new detached mount
* and set various properties on it.
*/
#define _GNU_SOURCE
#include <err.h>
#include <fcntl.h>
#include <getopt.h>
#include <linux/mount.h>
#include <linux/types.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/syscall.h>
#include <unistd.h>
static inline int
mount_setattr(int dirfd, const char *pathname, unsigned int flags,
struct mount_attr *attr, size_t size)
{
return syscall(SYS_mount_setattr, dirfd, pathname, flags,
attr, size);
}
static inline int
open_tree(int dirfd, const char *filename, unsigned int flags)
{
return syscall(SYS_open_tree, dirfd, filename, flags);
}
static inline int
move_mount(int from_dirfd, const char *from_pathname,
int to_dirfd, const char *to_pathname, unsigned int flags)
{
return syscall(SYS_move_mount, from_dirfd, from_pathname,
to_dirfd, to_pathname, flags);
}
static const struct option longopts[] = {
{"map-mount", required_argument, NULL, 'a'},
{"recursive", no_argument, NULL, 'b'},
{"read-only", no_argument, NULL, 'c'},
{"block-setid", no_argument, NULL, 'd'},
{"block-devices", no_argument, NULL, 'e'},
{"block-exec", no_argument, NULL, 'f'},
{"no-access-time", no_argument, NULL, 'g'},
{ NULL, 0, NULL, 0 },
};
int
main(int argc, char *argv[])
{
int fd_userns = -1;
int fd_tree;
int index = 0;
int ret;
bool recursive = false;
const char *source;
const char *target;
struct mount_attr *attr = &(struct mount_attr){};
while ((ret = getopt_long_only(argc, argv, "",
longopts, &index)) != -1) {
switch (ret) {
case 'a':
fd_userns = open(optarg, O_RDONLY | O_CLOEXEC);
if (fd_userns == -1)
err(EXIT_FAILURE, "open(%s)", optarg);
break;
case 'b':
recursive = true;
break;
case 'c':
attr->attr_set |= MOUNT_ATTR_RDONLY;
break;
case 'd':
attr->attr_set |= MOUNT_ATTR_NOSUID;
break;
case 'e':
attr->attr_set |= MOUNT_ATTR_NODEV;
break;
case 'f':
attr->attr_set |= MOUNT_ATTR_NOEXEC;
break;
case 'g':
attr->attr_set |= MOUNT_ATTR_NOATIME;
attr->attr_clr |= MOUNT_ATTR__ATIME;
break;
default:
errx(EXIT_FAILURE, "Invalid argument specified");
}
}
if ((argc - optind) < 2)
errx(EXIT_FAILURE, "Missing source or target mount point");
source = argv[optind];
target = argv[optind + 1];
/* In the following, -1 as the 'dirfd' argument ensures that
open_tree() fails if 'source' is not an absolute pathname. */
fd_tree = open_tree(-1, source,
OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC |
AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0));
if (fd_tree == -1)
err(EXIT_FAILURE, "open(%s)", source);
if (fd_userns >= 0) {
attr->attr_set |= MOUNT_ATTR_IDMAP;
attr->userns_fd = fd_userns;
}
ret = mount_setattr(fd_tree, "",
AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0),
attr, sizeof(struct mount_attr));
if (ret == -1)
err(EXIT_FAILURE, "mount_setattr");
close(fd_userns);
/* In the following, -1 as the 'to_dirfd' argument ensures that
open_tree() fails if 'target' is not an absolute pathname. */
ret = move_mount(fd_tree, "", -1, target,
MOVE_MOUNT_F_EMPTY_PATH);
if (ret == -1)
err(EXIT_FAILURE, "move_mount() to %s", target);
close(fd_tree);
exit(EXIT_SUCCESS);
}
SEE ALSO
newgidmap(1), newuidmap(1), clone(2), mount(2), unshare(2), proc(5), capabilities(7), mount_namespaces(7), user_namespaces(7), xattr(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
390 - Linux cli command pwrite
NAME π₯οΈ pwrite π₯οΈ
read from or write to a file descriptor at a given offset
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
ssize_t pread(int fd, void buf[.count], size_t count,
off_t offset);
ssize_t pwrite(int fd, const void buf[.count], size_t count,
off_t offset);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
pread(), pwrite():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
DESCRIPTION
pread() reads up to count bytes from file descriptor fd at offset offset (from the start of the file) into the buffer starting at buf. The file offset is not changed.
pwrite() writes up to count bytes from the buffer starting at buf to the file descriptor fd at offset offset. The file offset is not changed.
The file referenced by fd must be capable of seeking.
RETURN VALUE
On success, pread() returns the number of bytes read (a return of zero indicates end of file) and pwrite() returns the number of bytes written.
Note that it is not an error for a successful call to transfer fewer bytes than requested (see read(2) and write(2)).
On error, -1 is returned and errno is set to indicate the error.
ERRORS
pread() can fail and set errno to any error specified for read(2) or lseek(2). pwrite() can fail and set errno to any error specified for write(2) or lseek(2).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
Added in Linux 2.1.60; the entries in the i386 system call table were added in Linux 2.1.69. C library support (including emulation using lseek(2) on older kernels without the system calls) was added in glibc 2.1.
C library/kernel differences
On Linux, the underlying system calls were renamed in Linux 2.6: pread() became pread64(), and pwrite() became pwrite64(). The system call numbers remained the same. The glibc pread() and pwrite() wrapper functions transparently deal with the change.
On some 32-bit architectures, the calling signature for these system calls differ, for the reasons described in syscall(2).
NOTES
The pread() and pwrite() system calls are especially useful in multithreaded applications. They allow multiple threads to perform I/O on the same file descriptor without being affected by changes to the file offset by other threads.
BUGS
POSIX requires that opening a file with the O_APPEND flag should have no effect on the location at which pwrite() writes data. However, on Linux, if a file is opened with O_APPEND, pwrite() appends data to the end of the file, regardless of the value of offset.
SEE ALSO
lseek(2), read(2), readv(2), write(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
391 - Linux cli command mprotect
NAME π₯οΈ mprotect π₯οΈ
set protection on a region of memory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/mman.h>
int mprotect(void addr[.len], size_t len, int prot);
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <sys/mman.h>
int pkey_mprotect(void addr[.len], size_t len, int prot, int pkey);
DESCRIPTION
mprotect() changes the access protections for the calling process’s memory pages containing any part of the address range in the interval [addr, addr+len-1]. addr must be aligned to a page boundary.
If the calling process tries to access memory in a manner that violates the protections, then the kernel generates a SIGSEGV signal for the process.
prot is a combination of the following access flags: PROT_NONE or a bitwise OR of the other values in the following list:
PROT_NONE
The memory cannot be accessed at all.
PROT_READ
The memory can be read.
PROT_WRITE
The memory can be modified.
PROT_EXEC
The memory can be executed.
PROT_SEM (since Linux 2.5.7)
The memory can be used for atomic operations. This flag was introduced as part of the futex(2) implementation (in order to guarantee the ability to perform atomic operations required by commands such as FUTEX_WAIT), but is not currently used in on any architecture.
PROT_SAO (since Linux 2.6.26)
The memory should have strong access ordering. This feature is specific to the PowerPC architecture (version 2.06 of the architecture specification adds the SAO CPU feature, and it is available on POWER 7 or PowerPC A2, for example).
Additionally (since Linux 2.6.0), prot can have one of the following flags set:
PROT_GROWSUP
Apply the protection mode up to the end of a mapping that grows upwards. (Such mappings are created for the stack area on architecturesβfor example, HP-PARISCβthat have an upwardly growing stack.)
PROT_GROWSDOWN
Apply the protection mode down to the beginning of a mapping that grows downward (which should be a stack segment or a segment mapped with the MAP_GROWSDOWN flag set).
Like mprotect(), pkey_mprotect() changes the protection on the pages specified by addr and len. The pkey argument specifies the protection key (see pkeys(7)) to assign to the memory. The protection key must be allocated with pkey_alloc(2) before it is passed to pkey_mprotect(). For an example of the use of this system call, see pkeys(7).
RETURN VALUE
On success, mprotect() and pkey_mprotect() return zero. On error, these system calls return -1, and errno is set to indicate the error.
ERRORS
EACCES
The memory cannot be given the specified access. This can happen, for example, if you mmap(2) a file to which you have read-only access, then ask mprotect() to mark it PROT_WRITE.
EINVAL
addr is not a valid pointer, or not a multiple of the system page size.
EINVAL
(pkey_mprotect()) pkey has not been allocated with pkey_alloc(2)
EINVAL
Both PROT_GROWSUP and PROT_GROWSDOWN were specified in prot.
EINVAL
Invalid flags specified in prot.
EINVAL
(PowerPC architecture) PROT_SAO was specified in prot, but SAO hardware feature is not available.
ENOMEM
Internal kernel structures could not be allocated.
ENOMEM
Addresses in the range [addr, addr+len-1] are invalid for the address space of the process, or specify one or more pages that are not mapped. (Before Linux 2.4.19, the error EFAULT was incorrectly produced for these cases.)
ENOMEM
Changing the protection of a memory region would result in the total number of mappings with distinct attributes (e.g., read versus read/write protection) exceeding the allowed maximum. (For example, making the protection of a range PROT_READ in the middle of a region currently protected as PROT_READ|PROT_WRITE would result in three mappings: two read/write mappings at each end and a read-only mapping in the middle.)
VERSIONS
POSIX says that the behavior of mprotect() is unspecified if it is applied to a region of memory that was not obtained via mmap(2).
On Linux, it is always permissible to call mprotect() on any address in a process’s address space (except for the kernel vsyscall area). In particular, it can be used to change existing code mappings to be writable.
Whether PROT_EXEC has any effect different from PROT_READ depends on processor architecture, kernel version, and process state. If READ_IMPLIES_EXEC is set in the process’s personality flags (see personality(2)), specifying PROT_READ will implicitly add PROT_EXEC.
On some hardware architectures (e.g., i386), PROT_WRITE implies PROT_READ.
POSIX.1 says that an implementation may permit access other than that specified in prot, but at a minimum can allow write access only if PROT_WRITE has been set, and must not allow any access if PROT_NONE has been set.
Applications should be careful when mixing use of mprotect() and pkey_mprotect(). On x86, when mprotect() is used with prot set to PROT_EXEC a pkey may be allocated and set on the memory implicitly by the kernel, but only when the pkey was 0 previously.
On systems that do not support protection keys in hardware, pkey_mprotect() may still be used, but pkey must be set to -1. When called this way, the operation of pkey_mprotect() is equivalent to mprotect().
STANDARDS
mprotect()
POSIX.1-2008.
pkey_mprotect()
Linux.
HISTORY
mprotect()
POSIX.1-2001, SVr4.
pkey_mprotect()
Linux 4.9, glibc 2.27.
NOTES
EXAMPLES
The program below demonstrates the use of mprotect(). The program allocates four pages of memory, makes the third of these pages read-only, and then executes a loop that walks upward through the allocated region modifying bytes.
An example of what we might see when running the program is the following:
$ ./a.out
Start of region: 0x804c000
Got SIGSEGV at address: 0x804e000
Program source
#include <malloc.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
static char *buffer;
static void
handler(int sig, siginfo_t *si, void *unused)
{
/* Note: calling printf() from a signal handler is not safe
(and should not be done in production programs), since
printf() is not async-signal-safe; see signal-safety(7).
Nevertheless, we use printf() here as a simple way of
showing that the handler was called. */
printf("Got SIGSEGV at address: %p
“, si->si_addr); exit(EXIT_FAILURE); } int main(void) { int pagesize; struct sigaction sa; sa.sa_flags = SA_SIGINFO; sigemptyset(&sa.sa_mask); sa.sa_sigaction = handler; if (sigaction(SIGSEGV, &sa, NULL) == -1) handle_error(“sigaction”); pagesize = sysconf(_SC_PAGE_SIZE); if (pagesize == -1) handle_error(“sysconf”); /* Allocate a buffer aligned on a page boundary; initial protection is PROT_READ | PROT_WRITE. */ buffer = memalign(pagesize, 4 * pagesize); if (buffer == NULL) handle_error(“memalign”); printf(“Start of region: %p “, buffer); if (mprotect(buffer + pagesize * 2, pagesize, PROT_READ) == -1) handle_error(“mprotect”); for (char *p = buffer ; ; ) (p++) = ‘a’; printf(“Loop completed “); / Should never happen */ exit(EXIT_SUCCESS); }
SEE ALSO
mmap(2), sysconf(3), pkeys(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
392 - Linux cli command getuid
NAME π₯οΈ getuid π₯οΈ
get user identity
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
uid_t getuid(void);
uid_t geteuid(void);
DESCRIPTION
getuid() returns the real user ID of the calling process.
geteuid() returns the effective user ID of the calling process.
ERRORS
These functions are always successful and never modify errno.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.3BSD.
In UNIX V6 the getuid() call returned (euid << 8) + uid. UNIX V7 introduced separate calls getuid() and geteuid().
The original Linux getuid() and geteuid() system calls supported only 16-bit user IDs. Subsequently, Linux 2.4 added getuid32() and geteuid32(), supporting 32-bit IDs. The glibc getuid() and geteuid() wrapper functions transparently deal with the variations across kernel versions.
On Alpha, instead of a pair of getuid() and geteuid() system calls, a single getxuid() system call is provided, which returns a pair of real and effective UIDs. The glibc getuid() and geteuid() wrapper functions transparently deal with this. See syscall(2) for details regarding register mapping.
SEE ALSO
getresuid(2), setreuid(2), setuid(2), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
393 - Linux cli command recvfrom
NAME π₯οΈ recvfrom π₯οΈ
receive a message from a socket
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
ssize_t recv(int sockfd, void buf[.len], size_t len,
int flags);
ssize_t recvfrom(int sockfd, void buf[restrict .len], size_t len,
int flags,
struct sockaddr *_Nullable restrict src_addr,
socklen_t *_Nullable restrict addrlen);
ssize_t recvmsg(int sockfd, struct msghdr *msg, int flags);
DESCRIPTION
The recv(), recvfrom(), and recvmsg() calls are used to receive messages from a socket. They may be used to receive data on both connectionless and connection-oriented sockets. This page first describes common features of all three system calls, and then describes the differences between the calls.
The only difference between recv() and read(2) is the presence of flags. With a zero flags argument, recv() is generally equivalent to read(2) (but see NOTES). Also, the following call
recv(sockfd, buf, len, flags);
is equivalent to
recvfrom(sockfd, buf, len, flags, NULL, NULL);
All three calls return the length of the message on successful completion. If a message is too long to fit in the supplied buffer, excess bytes may be discarded depending on the type of socket the message is received from.
If no messages are available at the socket, the receive calls wait for a message to arrive, unless the socket is nonblocking (see fcntl(2)), in which case the value -1 is returned and errno is set to EAGAIN or EWOULDBLOCK. The receive calls normally return any data available, up to the requested amount, rather than waiting for receipt of the full amount requested.
An application can use select(2), poll(2), or epoll(7) to determine when more data arrives on a socket.
The flags argument
The flags argument is formed by ORing one or more of the following values:
MSG_CMSG_CLOEXEC (recvmsg() only; since Linux 2.6.23)
Set the close-on-exec flag for the file descriptor received via a UNIX domain file descriptor using the SCM_RIGHTS operation (described in unix(7)). This flag is useful for the same reasons as the O_CLOEXEC flag of open(2).
MSG_DONTWAIT (since Linux 2.2)
Enables nonblocking operation; if the operation would block, the call fails with the error EAGAIN or EWOULDBLOCK. This provides similar behavior to setting the O_NONBLOCK flag (via the fcntl(2) F_SETFL operation), but differs in that MSG_DONTWAIT is a per-call option, whereas O_NONBLOCK is a setting on the open file description (see open(2)), which will affect all threads in the calling process as well as other processes that hold file descriptors referring to the same open file description.
MSG_ERRQUEUE (since Linux 2.2)
This flag specifies that queued errors should be received from the socket error queue. The error is passed in an ancillary message with a type dependent on the protocol (for IPv4 IP_RECVERR). The user should supply a buffer of sufficient size. See cmsg(3) and ip(7) for more information. The payload of the original packet that caused the error is passed as normal data via msg_iovec. The original destination address of the datagram that caused the error is supplied via msg_name.
The error is supplied in a sock_extended_err structure:
#define SO_EE_ORIGIN_NONE 0
#define SO_EE_ORIGIN_LOCAL 1
#define SO_EE_ORIGIN_ICMP 2
#define SO_EE_ORIGIN_ICMP6 3
struct sock_extended_err
{
uint32_t ee_errno; /* Error number */
uint8_t ee_origin; /* Where the error originated */
uint8_t ee_type; /* Type */
uint8_t ee_code; /* Code */
uint8_t ee_pad; /* Padding */
uint32_t ee_info; /* Additional information */
uint32_t ee_data; /* Other data */
/* More data may follow */
};
struct sockaddr *SO_EE_OFFENDER(struct sock_extended_err *);
ee_errno contains the errno number of the queued error. ee_origin is the origin code of where the error originated. The other fields are protocol-specific. The macro SO_EE_OFFENDER returns a pointer to the address of the network object where the error originated from given a pointer to the ancillary message. If this address is not known, the sa_family member of the sockaddr contains AF_UNSPEC and the other fields of the sockaddr are undefined. The payload of the packet that caused the error is passed as normal data.
For local errors, no address is passed (this can be checked with the cmsg_len member of the cmsghdr). For error receives, the MSG_ERRQUEUE flag is set in the msghdr. After an error has been passed, the pending socket error is regenerated based on the next queued error and will be passed on the next socket operation.
MSG_OOB
This flag requests receipt of out-of-band data that would not be received in the normal data stream. Some protocols place expedited data at the head of the normal data queue, and thus this flag cannot be used with such protocols.
MSG_PEEK
This flag causes the receive operation to return data from the beginning of the receive queue without removing that data from the queue. Thus, a subsequent receive call will return the same data.
MSG_TRUNC (since Linux 2.2)
For raw (AF_PACKET), Internet datagram (since Linux 2.4.27/2.6.8), netlink (since Linux 2.6.22), and UNIX datagram as well as sequenced-packet (since Linux 3.4) sockets: return the real length of the packet or datagram, even when it was longer than the passed buffer.
For use with Internet stream sockets, see tcp(7).
MSG_WAITALL (since Linux 2.2)
This flag requests that the operation block until the full request is satisfied. However, the call may still return less data than requested if a signal is caught, an error or disconnect occurs, or the next data to be received is of a different type than that returned. This flag has no effect for datagram sockets.
recvfrom()
recvfrom() places the received message into the buffer buf. The caller must specify the size of the buffer in len.
If src_addr is not NULL, and the underlying protocol provides the source address of the message, that source address is placed in the buffer pointed to by src_addr. In this case, addrlen is a value-result argument. Before the call, it should be initialized to the size of the buffer associated with src_addr. Upon return, addrlen is updated to contain the actual size of the source address. The returned address is truncated if the buffer provided is too small; in this case, addrlen will return a value greater than was supplied to the call.
If the caller is not interested in the source address, src_addr and addrlen should be specified as NULL.
recv()
The recv() call is normally used only on a connected socket (see connect(2)). It is equivalent to the call:
recvfrom(fd, buf, len, flags, NULL, 0);
recvmsg()
The recvmsg() call uses a msghdr structure to minimize the number of directly supplied arguments. This structure is defined as follows in <sys/socket.h>:
struct msghdr {
void *msg_name; /* Optional address */
socklen_t msg_namelen; /* Size of address */
struct iovec *msg_iov; /* Scatter/gather array */
size_t msg_iovlen; /* # elements in msg_iov */
void *msg_control; /* Ancillary data, see below */
size_t msg_controllen; /* Ancillary data buffer len */
int msg_flags; /* Flags on received message */
};
The msg_name field points to a caller-allocated buffer that is used to return the source address if the socket is unconnected. The caller should set msg_namelen to the size of this buffer before this call; upon return from a successful call, msg_namelen will contain the length of the returned address. If the application does not need to know the source address, msg_name can be specified as NULL.
The fields msg_iov and msg_iovlen describe scatter-gather locations, as discussed in readv(2).
The field msg_control, which has length msg_controllen, points to a buffer for other protocol control-related messages or miscellaneous ancillary data. When recvmsg() is called, msg_controllen should contain the length of the available buffer in msg_control; upon return from a successful call it will contain the length of the control message sequence.
The messages are of the form:
struct cmsghdr {
size_t cmsg_len; /* Data byte count, including header
(type is socklen_t in POSIX) */
int cmsg_level; /* Originating protocol */
int cmsg_type; /* Protocol-specific type */
/* followed by
unsigned char cmsg_data[]; */
};
Ancillary data should be accessed only by the macros defined in cmsg(3).
As an example, Linux uses this ancillary data mechanism to pass extended errors, IP options, or file descriptors over UNIX domain sockets. For further information on the use of ancillary data in various socket domains, see unix(7) and ip(7).
The msg_flags field in the msghdr is set on return of recvmsg(). It can contain several flags:
MSG_EOR
indicates end-of-record; the data returned completed a record (generally used with sockets of type SOCK_SEQPACKET).
MSG_TRUNC
indicates that the trailing portion of a datagram was discarded because the datagram was larger than the buffer supplied.
MSG_CTRUNC
indicates that some control data was discarded due to lack of space in the buffer for ancillary data.
MSG_OOB
is returned to indicate that expedited or out-of-band data was received.
MSG_ERRQUEUE
indicates that no data was received but an extended error from the socket error queue.
MSG_CMSG_CLOEXEC (since Linux 2.6.23)
indicates that MSG_CMSG_CLOEXEC was specified in the flags argument of recvmsg().
RETURN VALUE
These calls return the number of bytes received, or -1 if an error occurred. In the event of an error, errno is set to indicate the error.
When a stream socket peer has performed an orderly shutdown, the return value will be 0 (the traditional “end-of-file” return).
Datagram sockets in various domains (e.g., the UNIX and Internet domains) permit zero-length datagrams. When such a datagram is received, the return value is 0.
The value 0 may also be returned if the requested number of bytes to receive from a stream socket was 0.
ERRORS
These are some standard errors generated by the socket layer. Additional errors may be generated and returned from the underlying protocol modules; see their manual pages.
EAGAIN or EWOULDBLOCK
The socket is marked nonblocking and the receive operation would block, or a receive timeout had been set and the timeout expired before data was received. POSIX.1 allows either error to be returned for this case, and does not require these constants to have the same value, so a portable application should check for both possibilities.
EBADF
The argument sockfd is an invalid file descriptor.
ECONNREFUSED
A remote host refused to allow the network connection (typically because it is not running the requested service).
EFAULT
The receive buffer pointer(s) point outside the process’s address space.
EINTR
The receive was interrupted by delivery of a signal before any data was available; see signal(7).
EINVAL
Invalid argument passed.
ENOMEM
Could not allocate memory for recvmsg().
ENOTCONN
The socket is associated with a connection-oriented protocol and has not been connected (see connect(2) and accept(2)).
ENOTSOCK
The file descriptor sockfd does not refer to a socket.
VERSIONS
According to POSIX.1, the msg_controllen field of the msghdr structure should be typed as socklen_t, and the msg_iovlen field should be typed as int, but glibc currently types both as size_t.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.4BSD (first appeared in 4.2BSD).
POSIX.1 describes only the MSG_OOB, MSG_PEEK, and MSG_WAITALL flags.
NOTES
If a zero-length datagram is pending, read(2) and recv() with a flags argument of zero provide different behavior. In this circumstance, read(2) has no effect (the datagram remains pending), while recv() consumes the pending datagram.
See recvmmsg(2) for information about a Linux-specific system call that can be used to receive multiple datagrams in a single call.
EXAMPLES
An example of the use of recvfrom() is shown in getaddrinfo(3).
SEE ALSO
fcntl(2), getsockopt(2), read(2), recvmmsg(2), select(2), shutdown(2), socket(2), cmsg(3), sockatmark(3), ip(7), ipv6(7), socket(7), tcp(7), udp(7), unix(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
394 - Linux cli command setns
NAME π₯οΈ setns π₯οΈ
reassociate thread with a namespace
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <sched.h>
int setns(int fd, int nstype);
DESCRIPTION
The setns() system call allows the calling thread to move into different namespaces. The fd argument is one of the following:
a file descriptor referring to one of the magic links in a /proc/pid/ns/ directory (or a bind mount to such a link);
a PID file descriptor (see pidfd_open(2)).
The nstype argument is interpreted differently in each case.
fd refers to a /proc/pid/ns/ link
If fd refers to a /proc/pid/ns/ link, then setns() reassociates the calling thread with the namespace associated with that link, subject to any constraints imposed by the nstype argument. In this usage, each call to setns() changes just one of the caller’s namespace memberships.
The nstype argument specifies which type of namespace the calling thread may be reassociated with. This argument can have one of the following values:
0
Allow any type of namespace to be joined.
CLONE_NEWCGROUP (since Linux 4.6)
fd must refer to a cgroup namespace.
CLONE_NEWIPC (since Linux 3.0)
fd must refer to an IPC namespace.
CLONE_NEWNET (since Linux 3.0)
fd must refer to a network namespace.
CLONE_NEWNS (since Linux 3.8)
fd must refer to a mount namespace.
CLONE_NEWPID (since Linux 3.8)
fd must refer to a descendant PID namespace.
CLONE_NEWTIME (since Linux 5.8)
fd must refer to a time namespace.
CLONE_NEWUSER (since Linux 3.8)
fd must refer to a user namespace.
CLONE_NEWUTS (since Linux 3.0)
fd must refer to a UTS namespace.
Specifying nstype as 0 suffices if the caller knows (or does not care) what type of namespace is referred to by fd. Specifying a nonzero value for nstype is useful if the caller does not know what type of namespace is referred to by fd and wants to ensure that the namespace is of a particular type. (The caller might not know the type of the namespace referred to by fd if the file descriptor was opened by another process and, for example, passed to the caller via a UNIX domain socket.)
fd is a PID file descriptor
Since Linux 5.8, fd may refer to a PID file descriptor obtained from pidfd_open(2) or clone(2). In this usage, setns() atomically moves the calling thread into one or more of the same namespaces as the thread referred to by fd.
The nstype argument is a bit mask specified by ORing together one or more of the CLONE_NEW* namespace constants listed above. The caller is moved into each of the target thread’s namespaces that is specified in nstype; the caller’s memberships in the remaining namespaces are left unchanged.
For example, the following code would move the caller into the same user, network, and UTS namespaces as PID 1234, but would leave the caller’s other namespace memberships unchanged:
int fd = pidfd_open(1234, 0);
setns(fd, CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWUTS);
Details for specific namespace types
Note the following details and restrictions when reassociating with specific namespace types:
User namespaces
A process reassociating itself with a user namespace must have the CAP_SYS_ADMIN capability in the target user namespace. (This necessarily implies that it is only possible to join a descendant user namespace.) Upon successfully joining a user namespace, a process is granted all capabilities in that namespace, regardless of its user and group IDs.
A multithreaded process may not change user namespace with setns().
It is not permitted to use setns() to reenter the caller’s current user namespace. This prevents a caller that has dropped capabilities from regaining those capabilities via a call to setns().
For security reasons, a process can’t join a new user namespace if it is sharing filesystem-related attributes (the attributes whose sharing is controlled by the clone(2) CLONE_FS flag) with another process.
For further details on user namespaces, see user_namespaces(7).
Mount namespaces
Changing the mount namespace requires that the caller possess both CAP_SYS_CHROOT and CAP_SYS_ADMIN capabilities in its own user namespace and CAP_SYS_ADMIN in the user namespace that owns the target mount namespace.
A process can’t join a new mount namespace if it is sharing filesystem-related attributes (the attributes whose sharing is controlled by the clone(2) CLONE_FS flag) with another process.
See user_namespaces(7) for details on the interaction of user namespaces and mount namespaces.
PID namespaces
In order to reassociate itself with a new PID namespace, the caller must have the CAP_SYS_ADMIN capability both in its own user namespace and in the user namespace that owns the target PID namespace.
Reassociating the PID namespace has somewhat different from other namespace types. Reassociating the calling thread with a PID namespace changes only the PID namespace that subsequently created child processes of the caller will be placed in; it does not change the PID namespace of the caller itself.
Reassociating with a PID namespace is allowed only if the target PID namespace is a descendant (child, grandchild, etc.) of, or is the same as, the current PID namespace of the caller.
For further details on PID namespaces, see pid_namespaces(7).
Cgroup namespaces
In order to reassociate itself with a new cgroup namespace, the caller must have the CAP_SYS_ADMIN capability both in its own user namespace and in the user namespace that owns the target cgroup namespace.
Using setns() to change the caller’s cgroup namespace does not change the caller’s cgroup memberships.
Network, IPC, time, and UTS namespaces
In order to reassociate itself with a new network, IPC, time, or UTS namespace, the caller must have the CAP_SYS_ADMIN capability both in its own user namespace and in the user namespace that owns the target namespace.
RETURN VALUE
On success, setns() returns 0. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
EBADF
fd is not a valid file descriptor.
EINVAL
fd refers to a namespace whose type does not match that specified in nstype.
EINVAL
There is problem with reassociating the thread with the specified namespace.
EINVAL
The caller tried to join an ancestor (parent, grandparent, and so on) PID namespace.
EINVAL
The caller attempted to join the user namespace in which it is already a member.
EINVAL
The caller shares filesystem (CLONE_FS) state (in particular, the root directory) with other processes and tried to join a new user namespace.
EINVAL
The caller is multithreaded and tried to join a new user namespace.
EINVAL
fd is a PID file descriptor and nstype is invalid (e.g., it is 0).
ENOMEM
Cannot allocate sufficient memory to change the specified namespace.
EPERM
The calling thread did not have the required capability for this operation.
ESRCH
fd is a PID file descriptor but the process it refers to no longer exists (i.e., it has terminated and been waited on).
STANDARDS
Linux.
VERSIONS
Linux 3.0, glibc 2.14.
NOTES
For further information on the /proc/pid/ns/ magic links, see namespaces(7).
Not all of the attributes that can be shared when a new thread is created using clone(2) can be changed using setns().
EXAMPLES
The program below takes two or more arguments. The first argument specifies the pathname of a namespace file in an existing /proc/pid/ns/ directory. The remaining arguments specify a command and its arguments. The program opens the namespace file, joins that namespace using setns(), and executes the specified command inside that namespace.
The following shell session demonstrates the use of this program (compiled as a binary named ns_exec) in conjunction with the CLONE_NEWUTS example program in the clone(2) man page (complied as a binary named newuts).
We begin by executing the example program in clone(2) in the background. That program creates a child in a separate UTS namespace. The child changes the hostname in its namespace, and then both processes display the hostnames in their UTS namespaces, so that we can see that they are different.
$ su # Need privilege for namespace operations
Password:
# ./newuts bizarro &
[1] 3549
clone() returned 3550
uts.nodename in child: bizarro
uts.nodename in parent: antero
# uname -n # Verify hostname in the shell
antero
We then run the program shown below, using it to execute a shell. Inside that shell, we verify that the hostname is the one set by the child created by the first program:
# ./ns_exec /proc/3550/ns/uts /bin/bash
# uname -n # Executed in shell started by ns_exec
bizarro
Program source
#define _GNU_SOURCE
#include <err.h>
#include <fcntl.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int fd;
if (argc < 3) {
fprintf(stderr, "%s /proc/PID/ns/FILE cmd args...
“, argv[0]); exit(EXIT_FAILURE); } /* Get file descriptor for namespace; the file descriptor is opened with O_CLOEXEC so as to ensure that it is not inherited by the program that is later executed. / fd = open(argv[1], O_RDONLY | O_CLOEXEC); if (fd == -1) err(EXIT_FAILURE, “open”); if (setns(fd, 0) == -1) / Join that namespace / err(EXIT_FAILURE, “setns”); execvp(argv[2], &argv[2]); / Execute a command in namespace */ err(EXIT_FAILURE, “execvp”); }
SEE ALSO
nsenter(1), clone(2), fork(2), unshare(2), vfork(2), namespaces(7), unix(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
395 - Linux cli command mpx
NAME π₯οΈ mpx π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
396 - Linux cli command nfsservctl
NAME π₯οΈ nfsservctl π₯οΈ
syscall interface to kernel nfs daemon
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/nfsd/syscall.h>
long nfsservctl(int cmd, struct nfsctl_arg *argp,
union nfsctl_res *resp);
DESCRIPTION
Note: Since Linux 3.1, this system call no longer exists. It has been replaced by a set of files in the nfsd filesystem; see nfsd(7).
/*
* These are the commands understood by nfsctl().
*/
#define NFSCTL_SVC 0 /* This is a server process. */
#define NFSCTL_ADDCLIENT 1 /* Add an NFS client. */
#define NFSCTL_DELCLIENT 2 /* Remove an NFS client. */
#define NFSCTL_EXPORT 3 /* Export a filesystem. */
#define NFSCTL_UNEXPORT 4 /* Unexport a filesystem. */
#define NFSCTL_UGIDUPDATE 5 /* Update a client's UID/GID map
(only in Linux 2.4.x and earlier). */
#define NFSCTL_GETFH 6 /* Get a file handle (used by mountd(8))
(only in Linux 2.4.x and earlier). */
struct nfsctl_arg {
int ca_version; /* safeguard */
union {
struct nfsctl_svc u_svc;
struct nfsctl_client u_client;
struct nfsctl_export u_export;
struct nfsctl_uidmap u_umap;
struct nfsctl_fhparm u_getfh;
unsigned int u_debug;
} u;
}
union nfsctl_res {
struct knfs_fh cr_getfh;
unsigned int cr_debug;
};
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
STANDARDS
Linux.
HISTORY
Removed in Linux 3.1. Removed in glibc 2.28.
SEE ALSO
nfsd(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
397 - Linux cli command getmsg
NAME π₯οΈ getmsg π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
398 - Linux cli command pkey_free
NAME π₯οΈ pkey_free π₯οΈ
allocate or free a protection key
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <sys/mman.h>
int pkey_alloc(unsigned int flags, unsigned int access_rights);
int pkey_free(int pkey);
DESCRIPTION
pkey_alloc() allocates a protection key (pkey) and allows it to be passed to pkey_mprotect(2).
The pkey_alloc() flags is reserved for future use and currently must always be specified as 0.
The pkey_alloc() access_rights argument may contain zero or more disable operations:
PKEY_DISABLE_ACCESS
Disable all data access to memory covered by the returned protection key.
PKEY_DISABLE_WRITE
Disable write access to memory covered by the returned protection key.
pkey_free() frees a protection key and makes it available for later allocations. After a protection key has been freed, it may no longer be used in any protection-key-related operations.
An application should not call pkey_free() on any protection key which has been assigned to an address range by pkey_mprotect(2) and which is still in use. The behavior in this case is undefined and may result in an error.
RETURN VALUE
On success, pkey_alloc() returns a positive protection key value. On success, pkey_free() returns zero. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
pkey, flags, or access_rights is invalid.
ENOSPC
(pkey_alloc()) All protection keys available for the current process have been allocated. The number of keys available is architecture-specific and implementation-specific and may be reduced by kernel-internal use of certain keys. There are currently 15 keys available to user programs on x86.
This error will also be returned if the processor or operating system does not support protection keys. Applications should always be prepared to handle this error, since factors outside of the application’s control can reduce the number of available pkeys.
STANDARDS
Linux.
HISTORY
Linux 4.9, glibc 2.27.
NOTES
pkey_alloc() is always safe to call regardless of whether or not the operating system supports protection keys. It can be used in lieu of any other mechanism for detecting pkey support and will simply fail with the error ENOSPC if the operating system has no pkey support.
The kernel guarantees that the contents of the hardware rights register (PKRU) will be preserved only for allocated protection keys. Any time a key is unallocated (either before the first call returning that key from pkey_alloc() or after it is freed via pkey_free()), the kernel may make arbitrary changes to the parts of the rights register affecting access to that key.
EXAMPLES
See pkeys(7).
SEE ALSO
pkey_mprotect(2), pkeys(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
399 - Linux cli command pause
NAME π₯οΈ pause π₯οΈ
wait for signal
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int pause(void);
DESCRIPTION
pause() causes the calling process (or thread) to sleep until a signal is delivered that either terminates the process or causes the invocation of a signal-catching function.
RETURN VALUE
pause() returns only when a signal was caught and the signal-catching function returned. In this case, pause() returns -1, and errno is set to EINTR.
ERRORS
EINTR
a signal was caught and the signal-catching function returned.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.3BSD.
SEE ALSO
kill(2), select(2), signal(2), sigsuspend(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
400 - Linux cli command setregid
NAME π₯οΈ setregid π₯οΈ
set real and/or effective user or group ID
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int setreuid(uid_t ruid, uid_t euid);
int setregid(gid_t rgid, gid_t egid);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
setreuid(), setregid():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.19: */ _DEFAULT_SOURCE
|| /* glibc <= 2.19: */ _BSD_SOURCE
DESCRIPTION
setreuid() sets real and effective user IDs of the calling process.
Supplying a value of -1 for either the real or effective user ID forces the system to leave that ID unchanged.
Unprivileged processes may only set the effective user ID to the real user ID, the effective user ID, or the saved set-user-ID.
Unprivileged users may only set the real user ID to the real user ID or the effective user ID.
If the real user ID is set (i.e., ruid is not -1) or the effective user ID is set to a value not equal to the previous real user ID, the saved set-user-ID will be set to the new effective user ID.
Completely analogously, setregid() sets real and effective group ID’s of the calling process, and all of the above holds with “group” instead of “user”.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
Note: there are cases where setreuid() can fail even when the caller is UID 0; it is a grave security error to omit checking for a failure return from setreuid().
ERRORS
EAGAIN
The call would change the caller’s real UID (i.e., ruid does not match the caller’s real UID), but there was a temporary failure allocating the necessary kernel data structures.
EAGAIN
ruid does not match the caller’s real UID and this call would bring the number of processes belonging to the real user ID ruid over the caller’s RLIMIT_NPROC resource limit. Since Linux 3.1, this error case no longer occurs (but robust applications should check for this error); see the description of EAGAIN in execve(2).
EINVAL
One or more of the target user or group IDs is not valid in this user namespace.
EPERM
The calling process is not privileged (on Linux, does not have the necessary capability in its user namespace: CAP_SETUID in the case of setreuid(), or CAP_SETGID in the case of setregid()) and a change other than (i) swapping the effective user (group) ID with the real user (group) ID, or (ii) setting one to the value of the other or (iii) setting the effective user (group) ID to the value of the saved set-user-ID (saved set-group-ID) was specified.
VERSIONS
POSIX.1 does not specify all of the UID changes that Linux permits for an unprivileged process. For setreuid(), the effective user ID can be made the same as the real user ID or the saved set-user-ID, and it is unspecified whether unprivileged processes may set the real user ID to the real user ID, the effective user ID, or the saved set-user-ID. For setregid(), the real group ID can be changed to the value of the saved set-group-ID, and the effective group ID can be changed to the value of the real group ID or the saved set-group-ID. The precise details of what ID changes are permitted vary across implementations.
POSIX.1 makes no specification about the effect of these calls on the saved set-user-ID and saved set-group-ID.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.3BSD (first appeared in 4.2BSD).
Setting the effective user (group) ID to the saved set-user-ID (saved set-group-ID) is possible since Linux 1.1.37 (1.1.38).
The original Linux setreuid() and setregid() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added setreuid32() and setregid32(), supporting 32-bit IDs. The glibc setreuid() and setregid() wrapper functions transparently deal with the variations across kernel versions.
C library/kernel differences
At the kernel level, user IDs and group IDs are a per-thread attribute. However, POSIX requires that all threads in a process share the same credentials. The NPTL threading implementation handles the POSIX requirements by providing wrapper functions for the various system calls that change process UIDs and GIDs. These wrapper functions (including those for setreuid() and setregid()) employ a signal-based technique to ensure that when one thread changes credentials, all of the other threads in the process also change their credentials. For details, see nptl(7).
SEE ALSO
getgid(2), getuid(2), seteuid(2), setgid(2), setresuid(2), setuid(2), capabilities(7), credentials(7), user_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
401 - Linux cli command pidfd_send_signal
NAME π₯οΈ pidfd_send_signal π₯οΈ
send a signal to a process specified by a file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/signal.h> /* Definition of SIG* constants */
#include <signal.h> /* Definition of SI_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_pidfd_send_signal, int pidfd, int sig,
siginfo_t *_Nullable info, unsigned int flags);
Note: glibc provides no wrapper for pidfd_send_signal(), necessitating the use of syscall(2).
DESCRIPTION
The pidfd_send_signal() system call sends the signal sig to the target process referred to by pidfd, a PID file descriptor that refers to a process.
If the info argument points to a siginfo_t buffer, that buffer should be populated as described in rt_sigqueueinfo(2).
If the info argument is a null pointer, this is equivalent to specifying a pointer to a siginfo_t buffer whose fields match the values that are implicitly supplied when a signal is sent using kill(2):
si_signo is set to the signal number;
si_errno is set to 0;
si_code is set to SI_USER;
si_pid is set to the caller’s PID; and
si_uid is set to the caller’s real user ID.
The calling process must either be in the same PID namespace as the process referred to by pidfd, or be in an ancestor of that namespace.
The flags argument is reserved for future use; currently, this argument must be specified as 0.
RETURN VALUE
On success, pidfd_send_signal() returns 0. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EBADF
pidfd is not a valid PID file descriptor.
EINVAL
sig is not a valid signal.
EINVAL
The calling process is not in a PID namespace from which it can send a signal to the target process.
EINVAL
flags is not 0.
EPERM
The calling process does not have permission to send the signal to the target process.
EPERM
pidfd doesn’t refer to the calling process, and info.si_code is invalid (see rt_sigqueueinfo(2)).
ESRCH
The target process does not exist (i.e., it has terminated and been waited on).
STANDARDS
Linux.
HISTORY
Linux 5.1.
NOTES
PID file descriptors
The pidfd argument is a PID file descriptor, a file descriptor that refers to process. Such a file descriptor can be obtained in any of the following ways:
by opening a */proc/*pid directory;
using pidfd_open(2); or
via the PID file descriptor that is returned by a call to clone(2) or clone3(2) that specifies the CLONE_PIDFD flag.
The pidfd_send_signal() system call allows the avoidance of race conditions that occur when using traditional interfaces (such as kill(2)) to signal a process. The problem is that the traditional interfaces specify the target process via a process ID (PID), with the result that the sender may accidentally send a signal to the wrong process if the originally intended target process has terminated and its PID has been recycled for another process. By contrast, a PID file descriptor is a stable reference to a specific process; if that process terminates, pidfd_send_signal() fails with the error ESRCH.
EXAMPLES
#define _GNU_SOURCE
#include <fcntl.h>
#include <limits.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/syscall.h>
#include <unistd.h>
static int
pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
unsigned int flags)
{
return syscall(SYS_pidfd_send_signal, pidfd, sig, info, flags);
}
int
main(int argc, char *argv[])
{
int pidfd, sig;
char path[PATH_MAX];
siginfo_t info;
if (argc != 3) {
fprintf(stderr, "Usage: %s <pid> <signal>
“, argv[0]); exit(EXIT_FAILURE); } sig = atoi(argv[2]); /* Obtain a PID file descriptor by opening the /proc/PID directory of the target process. / snprintf(path, sizeof(path), “/proc/%s”, argv[1]); pidfd = open(path, O_RDONLY); if (pidfd == -1) { perror(“open”); exit(EXIT_FAILURE); } / Populate a ‘siginfo_t’ structure for use with pidfd_send_signal(). / memset(&info, 0, sizeof(info)); info.si_code = SI_QUEUE; info.si_signo = sig; info.si_errno = 0; info.si_uid = getuid(); info.si_pid = getpid(); info.si_value.sival_int = 1234; / Send the signal. */ if (pidfd_send_signal(pidfd, sig, &info, 0) == -1) { perror(“pidfd_send_signal”); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); }
SEE ALSO
clone(2), kill(2), pidfd_open(2), rt_sigqueueinfo(2), sigaction(2), pid_namespaces(7), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
402 - Linux cli command ioctl_ficlone
NAME π₯οΈ ioctl_ficlone π₯οΈ
share some the data of one file with another file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/fs.h> /* Definition of FICLONE* constants */
#include <sys/ioctl.h>
int ioctl(int dest_fd, FICLONERANGE, struct file_clone_range *arg);
int ioctl(int dest_fd, FICLONE, int src_fd);
DESCRIPTION
If a filesystem supports files sharing physical storage between multiple files (“reflink”), this ioctl(2) operation can be used to make some of the data in the src_fd file appear in the dest_fd file by sharing the underlying storage, which is faster than making a separate physical copy of the data. Both files must reside within the same filesystem. If a file write should occur to a shared region, the filesystem must ensure that the changes remain private to the file being written. This behavior is commonly referred to as “copy on write”.
This ioctl reflinks up to src_length bytes from file descriptor src_fd at offset src_offset into the file dest_fd at offset dest_offset, provided that both are files. If src_length is zero, the ioctl reflinks to the end of the source file. This information is conveyed in a structure of the following form:
struct file_clone_range {
__s64 src_fd;
__u64 src_offset;
__u64 src_length;
__u64 dest_offset;
};
Clones are atomic with regards to concurrent writes, so no locks need to be taken to obtain a consistent cloned copy.
The FICLONE ioctl clones entire files.
RETURN VALUE
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Error codes can be one of, but are not limited to, the following:
EBADF
src_fd is not open for reading; dest_fd is not open for writing or is open for append-only writes; or the filesystem which src_fd resides on does not support reflink.
EINVAL
The filesystem does not support reflinking the ranges of the given files. This error can also appear if either file descriptor represents a device, FIFO, or socket. Disk filesystems generally require the offset and length arguments to be aligned to the fundamental block size. XFS and Btrfs do not support overlapping reflink ranges in the same file.
EISDIR
One of the files is a directory and the filesystem does not support shared regions in directories.
EOPNOTSUPP
This can appear if the filesystem does not support reflinking either file descriptor, or if either file descriptor refers to special inodes.
EPERM
dest_fd is immutable.
ETXTBSY
One of the files is a swap file. Swap files cannot share storage.
EXDEV
dest_fd and src_fd are not on the same mounted filesystem.
STANDARDS
Linux.
HISTORY
Linux 4.5.
They were previously known as BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE, and were private to Btrfs.
NOTES
Because a copy-on-write operation requires the allocation of new storage, the fallocate(2) operation may unshare shared blocks to guarantee that subsequent writes will not fail because of lack of disk space.
SEE ALSO
ioctl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
403 - Linux cli command sched_setattr
NAME π₯οΈ sched_setattr π₯οΈ
set and get scheduling policy and attributes
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sched.h> /* Definition of SCHED_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_sched_setattr, pid_t pid, struct sched_attr *attr,
unsigned int flags);
int syscall(SYS_sched_getattr, pid_t pid, struct sched_attr *attr,
unsigned int size, unsigned int flags);
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
sched_setattr()
The sched_setattr() system call sets the scheduling policy and associated attributes for the thread whose ID is specified in pid. If pid equals zero, the scheduling policy and attributes of the calling thread will be set.
Currently, Linux supports the following “normal” (i.e., non-real-time) scheduling policies as values that may be specified in policy:
SCHED_OTHER
the standard round-robin time-sharing policy;
SCHED_BATCH
for “batch” style execution of processes; and
SCHED_IDLE
for running very low priority background jobs.
Various “real-time” policies are also supported, for special time-critical applications that need precise control over the way in which runnable threads are selected for execution. For the rules governing when a process may use these policies, see sched(7). The real-time policies that may be specified in policy are:
SCHED_FIFO
a first-in, first-out policy; and
SCHED_RR
a round-robin policy.
Linux also provides the following policy:
SCHED_DEADLINE
a deadline scheduling policy; see sched(7) for details.
The attr argument is a pointer to a structure that defines the new scheduling policy and attributes for the specified thread. This structure has the following form:
struct sched_attr {
u32 size; /* Size of this structure */
u32 sched_policy; /* Policy (SCHED_*) */
u64 sched_flags; /* Flags */
s32 sched_nice; /* Nice value (SCHED_OTHER,
SCHED_BATCH) */
u32 sched_priority; /* Static priority (SCHED_FIFO,
SCHED_RR) */
/* Remaining fields are for SCHED_DEADLINE */
u64 sched_runtime;
u64 sched_deadline;
u64 sched_period;
};
The fields of the sched_attr structure are as follows:
size
This field should be set to the size of the structure in bytes, as in sizeof(struct sched_attr). If the provided structure is smaller than the kernel structure, any additional fields are assumed to be ‘0’. If the provided structure is larger than the kernel structure, the kernel verifies that all additional fields are 0; if they are not, sched_setattr() fails with the error E2BIG and updates size to contain the size of the kernel structure.
The above behavior when the size of the user-space sched_attr structure does not match the size of the kernel structure allows for future extensibility of the interface. Malformed applications that pass oversize structures won’t break in the future if the size of the kernel sched_attr structure is increased. In the future, it could also allow applications that know about a larger user-space sched_attr structure to determine whether they are running on an older kernel that does not support the larger structure.
sched_policy
This field specifies the scheduling policy, as one of the SCHED_* values listed above.
sched_flags
This field contains zero or more of the following flags that are ORed together to control scheduling behavior:
SCHED_FLAG_RESET_ON_FORK
Children created by fork(2) do not inherit privileged scheduling policies. See sched(7) for details.
SCHED_FLAG_RECLAIM (since Linux 4.13)
This flag allows a SCHED_DEADLINE thread to reclaim bandwidth unused by other real-time threads.
SCHED_FLAG_DL_OVERRUN (since Linux 4.16)
This flag allows an application to get informed about run-time overruns in SCHED_DEADLINE threads. Such overruns may be caused by (for example) coarse execution time accounting or incorrect parameter assignment. Notification takes the form of a SIGXCPU signal which is generated on each overrun.
This SIGXCPU signal is process-directed (see signal(7)) rather than thread-directed. This is probably a bug. On the one hand, sched_setattr() is being used to set a per-thread attribute. On the other hand, if the process-directed signal is delivered to a thread inside the process other than the one that had a run-time overrun, the application has no way of knowing which thread overran.
sched_nice
This field specifies the nice value to be set when specifying sched_policy as SCHED_OTHER or SCHED_BATCH. The nice value is a number in the range -20 (high priority) to +19 (low priority); see sched(7).
sched_priority
This field specifies the static priority to be set when specifying sched_policy as SCHED_FIFO or SCHED_RR. The allowed range of priorities for these policies can be determined using sched_get_priority_min(2) and sched_get_priority_max(2). For other policies, this field must be specified as 0.
sched_runtime
This field specifies the “Runtime” parameter for deadline scheduling. The value is expressed in nanoseconds. This field, and the next two fields, are used only for SCHED_DEADLINE scheduling; for further details, see sched(7).
sched_deadline
This field specifies the “Deadline” parameter for deadline scheduling. The value is expressed in nanoseconds.
sched_period
This field specifies the “Period” parameter for deadline scheduling. The value is expressed in nanoseconds.
The flags argument is provided to allow for future extensions to the interface; in the current implementation it must be specified as 0.
sched_getattr()
The sched_getattr() system call fetches the scheduling policy and the associated attributes for the thread whose ID is specified in pid. If pid equals zero, the scheduling policy and attributes of the calling thread will be retrieved.
The size argument should be set to the size of the sched_attr structure as known to user space. The value must be at least as large as the size of the initially published sched_attr structure, or the call fails with the error EINVAL.
The retrieved scheduling attributes are placed in the fields of the sched_attr structure pointed to by attr. The kernel sets attr.size to the size of its sched_attr structure.
If the caller-provided attr buffer is larger than the kernel’s sched_attr structure, the additional bytes in the user-space structure are not touched. If the caller-provided structure is smaller than the kernel sched_attr structure, the kernel will silently not return any values which would be stored outside the provided space. As with sched_setattr(), these semantics allow for future extensibility of the interface.
The flags argument is provided to allow for future extensions to the interface; in the current implementation it must be specified as 0.
RETURN VALUE
On success, sched_setattr() and sched_getattr() return 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
sched_getattr() and sched_setattr() can both fail for the following reasons:
EINVAL
attr is NULL; or pid is negative; or flags is not zero.
ESRCH
The thread whose ID is pid could not be found.
In addition, sched_getattr() can fail for the following reasons:
E2BIG
The buffer specified by size and attr is too small.
EINVAL
size is invalid; that is, it is smaller than the initial version of the sched_attr structure (48 bytes) or larger than the system page size.
In addition, sched_setattr() can fail for the following reasons:
E2BIG
The buffer specified by size and attr is larger than the kernel structure, and one or more of the excess bytes is nonzero.
EBUSY
SCHED_DEADLINE admission control failure, see sched(7).
EINVAL
attr.sched_policy is not one of the recognized policies; attr.sched_flags contains a flag other than SCHED_FLAG_RESET_ON_FORK; or attr.sched_priority is invalid; or attr.sched_policy is SCHED_DEADLINE and the deadline scheduling parameters in attr are invalid.
EPERM
The caller does not have appropriate privileges.
EPERM
The CPU affinity mask of the thread specified by pid does not include all CPUs in the system (see sched_setaffinity(2)).
STANDARDS
Linux.
HISTORY
Linux 3.14.
NOTES
glibc does not provide wrappers for these system calls; call them using syscall(2).
sched_setattr() provides a superset of the functionality of sched_setscheduler(2), sched_setparam(2), nice(2), and (other than the ability to set the priority of all processes belonging to a specified user or all processes in a specified group) setpriority(2). Analogously, sched_getattr() provides a superset of the functionality of sched_getscheduler(2), sched_getparam(2), and (partially) getpriority(2).
BUGS
In Linux versions up to 3.15, sched_setattr() failed with the error EFAULT instead of E2BIG for the case described in ERRORS.
Up to Linux 5.3, sched_getattr() failed with the error EFBIG if the in-kernel sched_attr structure was larger than the size passed by user space.
SEE ALSO
chrt(1), nice(2), sched_get_priority_max(2), sched_get_priority_min(2), sched_getaffinity(2), sched_getparam(2), sched_getscheduler(2), sched_rr_get_interval(2), sched_setaffinity(2), sched_setparam(2), sched_setscheduler(2), sched_yield(2), setpriority(2), pthread_getschedparam(3), pthread_setschedparam(3), pthread_setschedprio(3), capabilities(7), cpuset(7), sched(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
404 - Linux cli command process_madvise
NAME π₯οΈ process_madvise π₯οΈ
give advice about use of memory to a process
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/mman.h>
ssize_t process_madvise(int pidfd, const struct iovec iovec[.n],
size_t n, int advice",unsignedint"flags);
DESCRIPTION
The process_madvise() system call is used to give advice or directions to the kernel about the address ranges of another process or of the calling process. It provides the advice for the address ranges described by iovec and n. The goal of such advice is to improve system or application performance.
The pidfd argument is a PID file descriptor (see pidfd_open(2)) that specifies the process to which the advice is to be applied.
The pointer iovec points to an array of iovec structures, described in iovec(3type).
n specifies the number of elements in the array of iovec structures. This value must be less than or equal to IOV_MAX (defined in <limits.h> or accessible via the call sysconf(_SC_IOV_MAX)).
The advice argument is one of the following values:
MADV_COLD
See madvise(2).
MADV_COLLAPSE
See madvise(2).
MADV_PAGEOUT
See madvise(2).
MADV_WILLNEED
See madvise(2).
The flags argument is reserved for future use; currently, this argument must be specified as 0.
The n and iovec arguments are checked before applying any advice. If n is too big, or iovec is invalid, then an error will be returned immediately and no advice will be applied.
The advice might be applied to only a part of iovec if one of its elements points to an invalid memory region in the remote process. No further elements will be processed beyond that point. (See the discussion regarding partial advice in RETURN VALUE.)
Starting in Linux 5.12, permission to apply advice to another process is governed by ptrace access mode PTRACE_MODE_READ_FSCREDS check (see ptrace(2)); in addition, because of the performance implications of applying the advice, the caller must have the CAP_SYS_NICE capability (see capabilities(7)).
RETURN VALUE
On success, process_madvise() returns the number of bytes advised. This return value may be less than the total number of requested bytes, if an error occurred after some iovec elements were already processed. The caller should check the return value to determine whether a partial advice occurred.
On error, -1 is returned and errno is set to indicate the error.
ERRORS
EBADF
pidfd is not a valid PID file descriptor.
EFAULT
The memory described by iovec is outside the accessible address space of the process referred to by pidfd.
EINVAL
flags is not 0.
EINVAL
The sum of the iov_len values of iovec overflows a ssize_t value.
EINVAL
n is too large.
ENOMEM
Could not allocate memory for internal copies of the iovec structures.
EPERM
The caller does not have permission to access the address space of the process pidfd.
ESRCH
The target process does not exist (i.e., it has terminated and been waited on).
See madvise(2) for advice-specific errors.
STANDARDS
Linux.
HISTORY
Linux 5.10. glibc 2.36.
Support for this system call is optional, depending on the setting of the CONFIG_ADVISE_SYSCALLS configuration option.
When this system call first appeared in Linux 5.10, permission to apply advice to another process was entirely governed by ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check (see ptrace(2)). This requirement was relaxed in Linux 5.12 so that the caller didn’t require full control over the target process.
SEE ALSO
madvise(2), pidfd_open(2), process_vm_readv(2), process_vm_write(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
405 - Linux cli command tkill
NAME π₯οΈ tkill π₯οΈ
send a signal to a thread
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <signal.h> /* Definition of SIG* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
[[deprecated]] int syscall(SYS_tkill, pid_t tid, int sig);
#include <signal.h>
int tgkill(pid_t tgid, pid_t tid, int sig);
Note: glibc provides no wrapper for tkill(), necessitating the use of syscall(2).
DESCRIPTION
tgkill() sends the signal sig to the thread with the thread ID tid in the thread group tgid. (By contrast, kill(2) can be used to send a signal only to a process (i.e., thread group) as a whole, and the signal will be delivered to an arbitrary thread within that process.)
tkill() is an obsolete predecessor to tgkill(). It allows only the target thread ID to be specified, which may result in the wrong thread being signaled if a thread terminates and its thread ID is recycled. Avoid using this system call.
These are the raw system call interfaces, meant for internal thread library use.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EAGAIN
The RLIMIT_SIGPENDING resource limit was reached and sig is a real-time signal.
EAGAIN
Insufficient kernel memory was available and sig is a real-time signal.
EINVAL
An invalid thread ID, thread group ID, or signal was specified.
EPERM
Permission denied. For the required permissions, see kill(2).
ESRCH
No process with the specified thread ID (and thread group ID) exists.
STANDARDS
Linux.
HISTORY
tkill()
Linux 2.4.19 / 2.5.4.
tgkill()
Linux 2.5.75, glibc 2.30.
NOTES
See the description of CLONE_THREAD in clone(2) for an explanation of thread groups.
SEE ALSO
clone(2), gettid(2), kill(2), rt_sigqueueinfo(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
406 - Linux cli command clock_gettime
NAME π₯οΈ clock_gettime π₯οΈ
clock and time functions
LIBRARY
Standard C library (libc, -lc), since glibc 2.17
Before glibc 2.17, Real-time library (librt, -lrt)
SYNOPSIS
#include <time.h>
int clock_getres(clockid_t clockid, struct timespec *_Nullable res);
int clock_gettime(clockid_t clockid, struct timespec *tp);
int clock_settime(clockid_t clockid, const struct timespec *tp);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
clock_getres(), clock_gettime(), clock_settime():
_POSIX_C_SOURCE >= 199309L
DESCRIPTION
The function clock_getres() finds the resolution (precision) of the specified clock clockid, and, if res is non-NULL, stores it in the struct timespec pointed to by res. The resolution of clocks depends on the implementation and cannot be configured by a particular process. If the time value pointed to by the argument tp of clock_settime() is not a multiple of res, then it is truncated to a multiple of res.
The functions clock_gettime() and clock_settime() retrieve and set the time of the specified clock clockid.
The res and tp arguments are timespec(3) structures.
The clockid argument is the identifier of the particular clock on which to act. A clock may be system-wide and hence visible for all processes, or per-process if it measures time only within a single process.
All implementations support the system-wide real-time clock, which is identified by CLOCK_REALTIME. Its time represents seconds and nanoseconds since the Epoch. When its time is changed, timers for a relative interval are unaffected, but timers for an absolute point in time are affected.
More clocks may be implemented. The interpretation of the corresponding time values and the effect on timers is unspecified.
Sufficiently recent versions of glibc and the Linux kernel support the following clocks:
CLOCK_REALTIME
A settable system-wide clock that measures real (i.e., wall-clock) time. Setting this clock requires appropriate privileges. This clock is affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), and by frequency adjustments performed by NTP and similar applications via adjtime(3), adjtimex(2), clock_adjtime(2), and ntp_adjtime(3). This clock normally counts the number of seconds since 1970-01-01 00:00:00 Coordinated Universal Time (UTC) except that it ignores leap seconds; near a leap second it is typically adjusted by NTP to stay roughly in sync with UTC.
CLOCK_REALTIME_ALARM (since Linux 3.0; Linux-specific)
Like CLOCK_REALTIME, but not settable. See timer_create(2) for further details.
CLOCK_REALTIME_COARSE (since Linux 2.6.32; Linux-specific)
A faster but less precise version of CLOCK_REALTIME. This clock is not settable. Use when you need very fast, but not fine-grained timestamps. Requires per-architecture support, and probably also architecture support for this flag in the vdso(7).
CLOCK_TAI (since Linux 3.10; Linux-specific)
A nonsettable system-wide clock derived from wall-clock time but counting leap seconds. This clock does not experience discontinuities or frequency adjustments caused by inserting leap seconds as CLOCK_REALTIME does.
The acronym TAI refers to International Atomic Time.
CLOCK_MONOTONIC
A nonsettable system-wide clock that represents monotonic time sinceβas described by POSIXβ“some unspecified point in the past”. On Linux, that point corresponds to the number of seconds that the system has been running since it was booted.
The CLOCK_MONOTONIC clock is not affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), but is affected by frequency adjustments. This clock does not count time that the system is suspended. All CLOCK_MONOTONIC variants guarantee that the time returned by consecutive calls will not go backwards, but successive calls mayβdepending on the architectureβreturn identical (not-increased) time values.
CLOCK_MONOTONIC_COARSE (since Linux 2.6.32; Linux-specific)
A faster but less precise version of CLOCK_MONOTONIC. Use when you need very fast, but not fine-grained timestamps. Requires per-architecture support, and probably also architecture support for this flag in the vdso(7).
CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific)
Similar to CLOCK_MONOTONIC, but provides access to a raw hardware-based time that is not subject to frequency adjustments. This clock does not count time that the system is suspended.
CLOCK_BOOTTIME (since Linux 2.6.39; Linux-specific)
A nonsettable system-wide clock that is identical to CLOCK_MONOTONIC, except that it also includes any time that the system is suspended. This allows applications to get a suspend-aware monotonic clock without having to deal with the complications of CLOCK_REALTIME, which may have discontinuities if the time is changed using settimeofday(2) or similar.
CLOCK_BOOTTIME_ALARM (since Linux 3.0; Linux-specific)
Like CLOCK_BOOTTIME. See timer_create(2) for further details.
CLOCK_PROCESS_CPUTIME_ID (since Linux 2.6.12)
This is a clock that measures CPU time consumed by this process (i.e., CPU time consumed by all threads in the process). On Linux, this clock is not settable.
CLOCK_THREAD_CPUTIME_ID (since Linux 2.6.12)
This is a clock that measures CPU time consumed by this thread. On Linux, this clock is not settable.
Linux also implements dynamic clock instances as described below.
Dynamic clocks
In addition to the hard-coded System-V style clock IDs described above, Linux also supports POSIX clock operations on certain character devices. Such devices are called “dynamic” clocks, and are supported since Linux 2.6.39.
Using the appropriate macros, open file descriptors may be converted into clock IDs and passed to clock_gettime(), clock_settime(), and clock_adjtime(2). The following example shows how to convert a file descriptor into a dynamic clock ID.
#define CLOCKFD 3
#define FD_TO_CLOCKID(fd) ((~(clockid_t) (fd) << 3) | CLOCKFD)
#define CLOCKID_TO_FD(clk) ((unsigned int) ~((clk) >> 3))
struct timespec ts;
clockid_t clkid;
int fd;
fd = open("/dev/ptp0", O_RDWR);
clkid = FD_TO_CLOCKID(fd);
clock_gettime(clkid, &ts);
RETURN VALUE
clock_gettime(), clock_settime(), and clock_getres() return 0 for success. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EACCES
clock_settime() does not have write permission for the dynamic POSIX clock device indicated.
EFAULT
tp points outside the accessible address space.
EINVAL
The clockid specified is invalid for one of two reasons. Either the System-V style hard coded positive value is out of range, or the dynamic clock ID does not refer to a valid instance of a clock object.
EINVAL
(clock_settime()): tp.tv_sec is negative or tp.tv_nsec is outside the range [0, 999,999,999].
EINVAL
The clockid specified in a call to clock_settime() is not a settable clock.
EINVAL (since Linux 4.3)
A call to clock_settime() with a clockid of CLOCK_REALTIME attempted to set the time to a value less than the current value of the CLOCK_MONOTONIC clock.
ENODEV
The hot-pluggable device (like USB for example) represented by a dynamic clk_id has disappeared after its character device was opened.
ENOTSUP
The operation is not supported by the dynamic POSIX clock device specified.
EOVERFLOW
The timestamp would not fit in time_t range. This can happen if an executable with 32-bit time_t is run on a 64-bit kernel when the time is 2038-01-19 03:14:08 UTC or later. However, when the system time is out of time_t range in other situations, the behavior is undefined.
EPERM
clock_settime() does not have permission to set the clock indicated.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
clock_getres(), clock_gettime(), clock_settime() | Thread safety | MT-Safe |
VERSIONS
POSIX.1 specifies the following:
Setting the value of the CLOCK_REALTIME clock via clock_settime() shall have no effect on threads that are blocked waiting for a relative time service based upon this clock, including the nanosleep() function; nor on the expiration of relative timers based upon this clock. Consequently, these time services shall expire when the requested relative interval elapses, independently of the new or old value of the clock.
According to POSIX.1-2001, a process with “appropriate privileges” may set the CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks using clock_settime(). On Linux, these clocks are not settable (i.e., no process has “appropriate privileges”).
C library/kernel differences
On some architectures, an implementation of clock_gettime() is provided in the vdso(7).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SUSv2. Linux 2.6.
On POSIX systems on which these functions are available, the symbol _POSIX_TIMERS is defined in <unistd.h> to a value greater than 0. POSIX.1-2008 makes these functions mandatory.
The symbols _POSIX_MONOTONIC_CLOCK, _POSIX_CPUTIME, _POSIX_THREAD_CPUTIME indicate that CLOCK_MONOTONIC, CLOCK_PROCESS_CPUTIME_ID, CLOCK_THREAD_CPUTIME_ID are available. (See also sysconf(3).)
Historical note for SMP systems
Before Linux added kernel support for CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID, glibc implemented these clocks on many platforms using timer registers from the CPUs (TSC on i386, AR.ITC on Itanium). These registers may differ between CPUs and as a consequence these clocks may return bogus results if a process is migrated to another CPU.
If the CPUs in an SMP system have different clock sources, then there is no way to maintain a correlation between the timer registers since each CPU will run at a slightly different frequency. If that is the case, then clock_getcpuclockid(0) will return ENOENT to signify this condition. The two clocks will then be useful only if it can be ensured that a process stays on a certain CPU.
The processors in an SMP system do not start all at exactly the same time and therefore the timer registers are typically running at an offset. Some architectures include code that attempts to limit these offsets on bootup. However, the code cannot guarantee to accurately tune the offsets. glibc contains no provisions to deal with these offsets (unlike the Linux Kernel). Typically these offsets are small and therefore the effects may be negligible in most cases.
Since glibc 2.4, the wrapper functions for the system calls described in this page avoid the abovementioned problems by employing the kernel implementation of CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID, on systems that provide such an implementation (i.e., Linux 2.6.12 and later).
EXAMPLES
The program below demonstrates the use of clock_gettime() and clock_getres() with various clocks. This is an example of what we might see when running the program:
$ ./clock_times x
CLOCK_REALTIME : 1585985459.446 (18356 days + 7h 30m 59s)
resolution: 0.000000001
CLOCK_TAI : 1585985496.447 (18356 days + 7h 31m 36s)
resolution: 0.000000001
CLOCK_MONOTONIC: 52395.722 (14h 33m 15s)
resolution: 0.000000001
CLOCK_BOOTTIME : 72691.019 (20h 11m 31s)
resolution: 0.000000001
Program source
/* clock_times.c
Licensed under GNU General Public License v2 or later.
*/
#define _XOPEN_SOURCE 600
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define SECS_IN_DAY (24 * 60 * 60)
static void
displayClock(clockid_t clock, const char *name, bool showRes)
{
long days;
struct timespec ts;
if (clock_gettime(clock, &ts) == -1) {
perror("clock_gettime");
exit(EXIT_FAILURE);
}
printf("%-15s: %10jd.%03ld (", name,
(intmax_t) ts.tv_sec, ts.tv_nsec / 1000000);
days = ts.tv_sec / SECS_IN_DAY;
if (days > 0)
printf("%ld days + ", days);
printf("%2dh %2dm %2ds",
(int) (ts.tv_sec % SECS_IN_DAY) / 3600,
(int) (ts.tv_sec % 3600) / 60,
(int) ts.tv_sec % 60);
printf(")
“); if (clock_getres(clock, &ts) == -1) { perror(“clock_getres”); exit(EXIT_FAILURE); } if (showRes) printf(” resolution: %10jd.%09ld “, (intmax_t) ts.tv_sec, ts.tv_nsec); } int main(int argc, char *argv[]) { bool showRes = argc > 1; displayClock(CLOCK_REALTIME, “CLOCK_REALTIME”, showRes); #ifdef CLOCK_TAI displayClock(CLOCK_TAI, “CLOCK_TAI”, showRes); #endif displayClock(CLOCK_MONOTONIC, “CLOCK_MONOTONIC”, showRes); #ifdef CLOCK_BOOTTIME displayClock(CLOCK_BOOTTIME, “CLOCK_BOOTTIME”, showRes); #endif exit(EXIT_SUCCESS); }
SEE ALSO
date(1), gettimeofday(2), settimeofday(2), time(2), adjtime(3), clock_getcpuclockid(3), ctime(3), ftime(3), pthread_getcpuclockid(3), sysconf(3), timespec(3), time(7), time_namespaces(7), vdso(7), hwclock(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
407 - Linux cli command madvise1
NAME π₯οΈ madvise1 π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
408 - Linux cli command sched_get_priority_min
NAME π₯οΈ sched_get_priority_min π₯οΈ
get static priority range
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sched.h>
int sched_get_priority_max(int policy);
int sched_get_priority_min(int policy);
DESCRIPTION
sched_get_priority_max() returns the maximum priority value that can be used with the scheduling algorithm identified by policy. sched_get_priority_min() returns the minimum priority value that can be used with the scheduling algorithm identified by policy. Supported policy values are SCHED_FIFO, SCHED_RR, SCHED_OTHER, SCHED_BATCH, SCHED_IDLE, and SCHED_DEADLINE. Further details about these policies can be found in sched(7).
Processes with numerically higher priority values are scheduled before processes with numerically lower priority values. Thus, the value returned by sched_get_priority_max() will be greater than the value returned by sched_get_priority_min().
Linux allows the static priority range 1 to 99 for the SCHED_FIFO and SCHED_RR policies, and the priority 0 for the remaining policies. Scheduling priority ranges for the various policies are not alterable.
The range of scheduling priorities may vary on other POSIX systems, thus it is a good idea for portable applications to use a virtual priority range and map it to the interval given by sched_get_priority_max() and sched_get_priority_min() POSIX.1 requires a spread of at least 32 between the maximum and the minimum values for SCHED_FIFO and SCHED_RR.
POSIX systems on which sched_get_priority_max() and sched_get_priority_min() are available define _POSIX_PRIORITY_SCHEDULING in <unistd.h>.
RETURN VALUE
On success, sched_get_priority_max() and sched_get_priority_min() return the maximum/minimum priority value for the named scheduling policy. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
The argument policy does not identify a defined scheduling policy.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
SEE ALSO
sched_getaffinity(2), sched_getparam(2), sched_getscheduler(2), sched_setaffinity(2), sched_setparam(2), sched_setscheduler(2), sched(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
409 - Linux cli command tgkill
NAME π₯οΈ tgkill π₯οΈ
send a signal to a thread
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <signal.h> /* Definition of SIG* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
[[deprecated]] int syscall(SYS_tkill, pid_t tid, int sig);
#include <signal.h>
int tgkill(pid_t tgid, pid_t tid, int sig);
Note: glibc provides no wrapper for tkill(), necessitating the use of syscall(2).
DESCRIPTION
tgkill() sends the signal sig to the thread with the thread ID tid in the thread group tgid. (By contrast, kill(2) can be used to send a signal only to a process (i.e., thread group) as a whole, and the signal will be delivered to an arbitrary thread within that process.)
tkill() is an obsolete predecessor to tgkill(). It allows only the target thread ID to be specified, which may result in the wrong thread being signaled if a thread terminates and its thread ID is recycled. Avoid using this system call.
These are the raw system call interfaces, meant for internal thread library use.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EAGAIN
The RLIMIT_SIGPENDING resource limit was reached and sig is a real-time signal.
EAGAIN
Insufficient kernel memory was available and sig is a real-time signal.
EINVAL
An invalid thread ID, thread group ID, or signal was specified.
EPERM
Permission denied. For the required permissions, see kill(2).
ESRCH
No process with the specified thread ID (and thread group ID) exists.
STANDARDS
Linux.
HISTORY
tkill()
Linux 2.4.19 / 2.5.4.
tgkill()
Linux 2.5.75, glibc 2.30.
NOTES
See the description of CLONE_THREAD in clone(2) for an explanation of thread groups.
SEE ALSO
clone(2), gettid(2), kill(2), rt_sigqueueinfo(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
410 - Linux cli command userfaultfd
NAME π₯οΈ userfaultfd π₯οΈ
create a file descriptor for handling page faults in user space
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <fcntl.h> /* Definition of O_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <linux/userfaultfd.h> /* Definition of UFFD_* constants */
#include <unistd.h>
int syscall(SYS_userfaultfd, int flags);
Note: glibc provides no wrapper for userfaultfd(), necessitating the use of syscall(2).
DESCRIPTION
userfaultfd() creates a new userfaultfd object that can be used for delegation of page-fault handling to a user-space application, and returns a file descriptor that refers to the new object. The new userfaultfd object is configured using ioctl(2).
Once the userfaultfd object is configured, the application can use read(2) to receive userfaultfd notifications. The reads from userfaultfd may be blocking or non-blocking, depending on the value of flags used for the creation of the userfaultfd or subsequent calls to fcntl(2).
The following values may be bitwise ORed in flags to change the behavior of userfaultfd():
O_CLOEXEC
Enable the close-on-exec flag for the new userfaultfd file descriptor. See the description of the O_CLOEXEC flag in open(2).
O_NONBLOCK
Enables non-blocking operation for the userfaultfd object. See the description of the O_NONBLOCK flag in open(2).
UFFD_USER_MODE_ONLY
This is an userfaultfd-specific flag that was introduced in Linux 5.11. When set, the userfaultfd object will only be able to handle page faults originated from the user space on the registered regions. When a kernel-originated fault was triggered on the registered range with this userfaultfd, a SIGBUS signal will be delivered.
When the last file descriptor referring to a userfaultfd object is closed, all memory ranges that were registered with the object are unregistered and unread events are flushed.
Userfaultfd supports three modes of registration:
UFFDIO_REGISTER_MODE_MISSING (since Linux 4.10)
When registered with UFFDIO_REGISTER_MODE_MISSING mode, user-space will receive a page-fault notification when a missing page is accessed. The faulted thread will be stopped from execution until the page fault is resolved from user-space by either an UFFDIO_COPY or an UFFDIO_ZEROPAGE ioctl.
UFFDIO_REGISTER_MODE_MINOR (since Linux 5.13)
When registered with UFFDIO_REGISTER_MODE_MINOR mode, user-space will receive a page-fault notification when a minor page fault occurs. That is, when a backing page is in the page cache, but page table entries don’t yet exist. The faulted thread will be stopped from execution until the page fault is resolved from user-space by an UFFDIO_CONTINUE ioctl.
UFFDIO_REGISTER_MODE_WP (since Linux 5.7)
When registered with UFFDIO_REGISTER_MODE_WP mode, user-space will receive a page-fault notification when a write-protected page is written. The faulted thread will be stopped from execution until user-space write-unprotects the page using an UFFDIO_WRITEPROTECT ioctl.
Multiple modes can be enabled at the same time for the same memory range.
Since Linux 4.14, a userfaultfd page-fault notification can selectively embed faulting thread ID information into the notification. One needs to enable this feature explicitly using the UFFD_FEATURE_THREAD_ID feature bit when initializing the userfaultfd context. By default, thread ID reporting is disabled.
Usage
The userfaultfd mechanism is designed to allow a thread in a multithreaded program to perform user-space paging for the other threads in the process. When a page fault occurs for one of the regions registered to the userfaultfd object, the faulting thread is put to sleep and an event is generated that can be read via the userfaultfd file descriptor. The fault-handling thread reads events from this file descriptor and services them using the operations described in ioctl_userfaultfd(2). When servicing the page fault events, the fault-handling thread can trigger a wake-up for the sleeping thread.
It is possible for the faulting threads and the fault-handling threads to run in the context of different processes. In this case, these threads may belong to different programs, and the program that executes the faulting threads will not necessarily cooperate with the program that handles the page faults. In such non-cooperative mode, the process that monitors userfaultfd and handles page faults needs to be aware of the changes in the virtual memory layout of the faulting process to avoid memory corruption.
Since Linux 4.11, userfaultfd can also notify the fault-handling threads about changes in the virtual memory layout of the faulting process. In addition, if the faulting process invokes fork(2), the userfaultfd objects associated with the parent may be duplicated into the child process and the userfaultfd monitor will be notified (via the UFFD_EVENT_FORK described below) about the file descriptor associated with the userfault objects created for the child process, which allows the userfaultfd monitor to perform user-space paging for the child process. Unlike page faults which have to be synchronous and require an explicit or implicit wakeup, all other events are delivered asynchronously and the non-cooperative process resumes execution as soon as the userfaultfd manager executes read(2). The userfaultfd manager should carefully synchronize calls to UFFDIO_COPY with the processing of events.
The current asynchronous model of the event delivery is optimal for single threaded non-cooperative userfaultfd manager implementations.
Since Linux 5.7, userfaultfd is able to do synchronous page dirty tracking using the new write-protect register mode. One should check against the feature bit UFFD_FEATURE_PAGEFAULT_FLAG_WP before using this feature. Similar to the original userfaultfd missing mode, the write-protect mode will generate a userfaultfd notification when the protected page is written. The user needs to resolve the page fault by unprotecting the faulted page and kicking the faulted thread to continue. For more information, please refer to the “Userfaultfd write-protect mode” section.
Userfaultfd operation
After the userfaultfd object is created with userfaultfd(), the application must enable it using the UFFDIO_API ioctl(2) operation. This operation allows a two-step handshake between the kernel and user space to determine what API version and features the kernel supports, and then to enable those features user space wants. This operation must be performed before any of the other ioctl(2) operations described below (or those operations fail with the EINVAL error).
After a successful UFFDIO_API operation, the application then registers memory address ranges using the UFFDIO_REGISTER ioctl(2) operation. After successful completion of a UFFDIO_REGISTER operation, a page fault occurring in the requested memory range, and satisfying the mode defined at the registration time, will be forwarded by the kernel to the user-space application. The application can then use various (e.g., UFFDIO_COPY, UFFDIO_ZEROPAGE, or UFFDIO_CONTINUE) ioctl(2) operations to resolve the page fault.
Since Linux 4.14, if the application sets the UFFD_FEATURE_SIGBUS feature bit using the UFFDIO_API ioctl(2), no page-fault notification will be forwarded to user space. Instead a SIGBUS signal is delivered to the faulting process. With this feature, userfaultfd can be used for robustness purposes to simply catch any access to areas within the registered address range that do not have pages allocated, without having to listen to userfaultfd events. No userfaultfd monitor will be required for dealing with such memory accesses. For example, this feature can be useful for applications that want to prevent the kernel from automatically allocating pages and filling holes in sparse files when the hole is accessed through a memory mapping.
The UFFD_FEATURE_SIGBUS feature is implicitly inherited through fork(2) if used in combination with UFFD_FEATURE_FORK.
Details of the various ioctl(2) operations can be found in ioctl_userfaultfd(2).
Since Linux 4.11, events other than page-fault may enabled during UFFDIO_API operation.
Up to Linux 4.11, userfaultfd can be used only with anonymous private memory mappings. Since Linux 4.11, userfaultfd can be also used with hugetlbfs and shared memory mappings.
Userfaultfd write-protect mode (since Linux 5.7)
Since Linux 5.7, userfaultfd supports write-protect mode for anonymous memory. The user needs to first check availability of this feature using UFFDIO_API ioctl against the feature bit UFFD_FEATURE_PAGEFAULT_FLAG_WP before using this feature.
Since Linux 5.19, the write-protection mode was also supported on shmem and hugetlbfs memory types. It can be detected with the feature bit UFFD_FEATURE_WP_HUGETLBFS_SHMEM.
To register with userfaultfd write-protect mode, the user needs to initiate the UFFDIO_REGISTER ioctl with mode UFFDIO_REGISTER_MODE_WP set. Note that it is legal to monitor the same memory range with multiple modes. For example, the user can do UFFDIO_REGISTER with the mode set to UFFDIO_REGISTER_MODE_MISSING | UFFDIO_REGISTER_MODE_WP. When there is only UFFDIO_REGISTER_MODE_WP registered, user-space will not receive any notification when a missing page is written. Instead, user-space will receive a write-protect page-fault notification only when an existing but write-protected page got written.
After the UFFDIO_REGISTER ioctl completed with UFFDIO_REGISTER_MODE_WP mode set, the user can write-protect any existing memory within the range using the ioctl UFFDIO_WRITEPROTECT where uffdio_writeprotect.mode should be set to UFFDIO_WRITEPROTECT_MODE_WP.
When a write-protect event happens, user-space will receive a page-fault notification whose uffd_msg.pagefault.flags will be with UFFD_PAGEFAULT_FLAG_WP flag set. Note: since only writes can trigger this kind of fault, write-protect notifications will always have the UFFD_PAGEFAULT_FLAG_WRITE bit set along with the UFFD_PAGEFAULT_FLAG_WP bit.
To resolve a write-protection page fault, the user should initiate another UFFDIO_WRITEPROTECT ioctl, whose uffd_msg.pagefault.flags should have the flag UFFDIO_WRITEPROTECT_MODE_WP cleared upon the faulted page or range.
Userfaultfd minor fault mode (since Linux 5.13)
Since Linux 5.13, userfaultfd supports minor fault mode. In this mode, fault messages are produced not for major faults (where the page was missing), but rather for minor faults, where a page exists in the page cache, but the page table entries are not yet present. The user needs to first check availability of this feature using the UFFDIO_API ioctl with the appropriate feature bits set before using this feature: UFFD_FEATURE_MINOR_HUGETLBFS since Linux 5.13, or UFFD_FEATURE_MINOR_SHMEM since Linux 5.14.
To register with userfaultfd minor fault mode, the user needs to initiate the UFFDIO_REGISTER ioctl with mode UFFD_REGISTER_MODE_MINOR set.
When a minor fault occurs, user-space will receive a page-fault notification whose uffd_msg.pagefault.flags will have the UFFD_PAGEFAULT_FLAG_MINOR flag set.
To resolve a minor page fault, the handler should decide whether or not the existing page contents need to be modified first. If so, this should be done in-place via a second, non-userfaultfd-registered mapping to the same backing page (e.g., by mapping the shmem or hugetlbfs file twice). Once the page is considered “up to date”, the fault can be resolved by initiating an UFFDIO_CONTINUE ioctl, which installs the page table entries and (by default) wakes up the faulting thread(s).
Minor fault mode supports only hugetlbfs-backed (since Linux 5.13) and shmem-backed (since Linux 5.14) memory.
Reading from the userfaultfd structure
Each read(2) from the userfaultfd file descriptor returns one or more uffd_msg structures, each of which describes a page-fault event or an event required for the non-cooperative userfaultfd usage:
struct uffd_msg {
__u8 event; /* Type of event */
...
union {
struct {
__u64 flags; /* Flags describing fault */
__u64 address; /* Faulting address */
union {
__u32 ptid; /* Thread ID of the fault */
} feat;
} pagefault;
struct { /* Since Linux 4.11 */
__u32 ufd; /* Userfault file descriptor
of the child process */
} fork;
struct { /* Since Linux 4.11 */
__u64 from; /* Old address of remapped area */
__u64 to; /* New address of remapped area */
__u64 len; /* Original mapping length */
} remap;
struct { /* Since Linux 4.11 */
__u64 start; /* Start address of removed area */
__u64 end; /* End address of removed area */
} remove;
...
} arg;
/* Padding fields omitted */
} __packed;
If multiple events are available and the supplied buffer is large enough, read(2) returns as many events as will fit in the supplied buffer. If the buffer supplied to read(2) is smaller than the size of the uffd_msg structure, the read(2) fails with the error EINVAL.
The fields set in the uffd_msg structure are as follows:
event
The type of event. Depending of the event type, different fields of the arg union represent details required for the event processing. The non-page-fault events are generated only when appropriate feature is enabled during API handshake with UFFDIO_API ioctl(2).
The following values can appear in the event field:
UFFD_EVENT_PAGEFAULT (since Linux 4.3)
A page-fault event. The page-fault details are available in the pagefault field.
UFFD_EVENT_FORK (since Linux 4.11)
Generated when the faulting process invokes fork(2) (or clone(2) without the CLONE_VM flag). The event details are available in the fork field.
UFFD_EVENT_REMAP (since Linux 4.11)
Generated when the faulting process invokes mremap(2). The event details are available in the remap field.
UFFD_EVENT_REMOVE (since Linux 4.11)
Generated when the faulting process invokes madvise(2) with MADV_DONTNEED or MADV_REMOVE advice. The event details are available in the remove field.
UFFD_EVENT_UNMAP (since Linux 4.11)
Generated when the faulting process unmaps a memory range, either explicitly using munmap(2) or implicitly during mmap(2) or mremap(2). The event details are available in the remove field.
pagefault.address
The address that triggered the page fault.
pagefault.flags
A bit mask of flags that describe the event. For UFFD_EVENT_PAGEFAULT, the following flag may appear:
UFFD_PAGEFAULT_FLAG_WP
If this flag is set, then the fault was a write-protect fault.
UFFD_PAGEFAULT_FLAG_MINOR
If this flag is set, then the fault was a minor fault.
UFFD_PAGEFAULT_FLAG_WRITE
If this flag is set, then the fault was a write fault.
If neither UFFD_PAGEFAULT_FLAG_WP nor UFFD_PAGEFAULT_FLAG_MINOR are set, then the fault was a missing fault.
pagefault.feat.pid
The thread ID that triggered the page fault.
fork.ufd
The file descriptor associated with the userfault object created for the child created by fork(2).
remap.from
The original address of the memory range that was remapped using mremap(2).
remap.to
The new address of the memory range that was remapped using mremap(2).
remap.len
The original length of the memory range that was remapped using mremap(2).
remove.start
The start address of the memory range that was freed using madvise(2) or unmapped
remove.end
The end address of the memory range that was freed using madvise(2) or unmapped
A read(2) on a userfaultfd file descriptor can fail with the following errors:
EINVAL
The userfaultfd object has not yet been enabled using the UFFDIO_API ioctl(2) operation
If the O_NONBLOCK flag is enabled in the associated open file description, the userfaultfd file descriptor can be monitored with poll(2), select(2), and epoll(7). When events are available, the file descriptor indicates as readable. If the O_NONBLOCK flag is not enabled, then poll(2) (always) indicates the file as having a POLLERR condition, and select(2) indicates the file descriptor as both readable and writable.
RETURN VALUE
On success, userfaultfd() returns a new file descriptor that refers to the userfaultfd object. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
An unsupported value was specified in flags.
EMFILE
The per-process limit on the number of open file descriptors has been reached
ENFILE
The system-wide limit on the total number of open files has been reached.
ENOMEM
Insufficient kernel memory was available.
EPERM (since Linux 5.2)
The caller is not privileged (does not have the CAP_SYS_PTRACE capability in the initial user namespace), and /proc/sys/vm/unprivileged_userfaultfd has the value 0.
STANDARDS
Linux.
HISTORY
Linux 4.3.
Support for hugetlbfs and shared memory areas and non-page-fault events was added in Linux 4.11
NOTES
The userfaultfd mechanism can be used as an alternative to traditional user-space paging techniques based on the use of the SIGSEGV signal and mmap(2). It can also be used to implement lazy restore for checkpoint/restore mechanisms, as well as post-copy migration to allow (nearly) uninterrupted execution when transferring virtual machines and Linux containers from one host to another.
BUGS
If the UFFD_FEATURE_EVENT_FORK is enabled and a system call from the fork(2) family is interrupted by a signal or failed, a stale userfaultfd descriptor might be created. In this case, a spurious UFFD_EVENT_FORK will be delivered to the userfaultfd monitor.
EXAMPLES
The program below demonstrates the use of the userfaultfd mechanism. The program creates two threads, one of which acts as the page-fault handler for the process, for the pages in a demand-page zero region created using mmap(2).
The program takes one command-line argument, which is the number of pages that will be created in a mapping whose page faults will be handled via userfaultfd. After creating a userfaultfd object, the program then creates an anonymous private mapping of the specified size and registers the address range of that mapping using the UFFDIO_REGISTER ioctl(2) operation. The program then creates a second thread that will perform the task of handling page faults.
The main thread then walks through the pages of the mapping fetching bytes from successive pages. Because the pages have not yet been accessed, the first access of a byte in each page will trigger a page-fault event on the userfaultfd file descriptor.
Each of the page-fault events is handled by the second thread, which sits in a loop processing input from the userfaultfd file descriptor. In each loop iteration, the second thread first calls poll(2) to check the state of the file descriptor, and then reads an event from the file descriptor. All such events should be UFFD_EVENT_PAGEFAULT events, which the thread handles by copying a page of data into the faulting region using the UFFDIO_COPY ioctl(2) operation.
The following is an example of what we see when running the program:
$ ./userfaultfd_demo 3
Address returned by mmap() = 0x7fd30106c000
fault_handler_thread():
poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106c00f
(uffdio_copy.copy returned 4096)
Read address 0x7fd30106c00f in main(): A
Read address 0x7fd30106c40f in main(): A
Read address 0x7fd30106c80f in main(): A
Read address 0x7fd30106cc0f in main(): A
fault_handler_thread():
poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106d00f
(uffdio_copy.copy returned 4096)
Read address 0x7fd30106d00f in main(): B
Read address 0x7fd30106d40f in main(): B
Read address 0x7fd30106d80f in main(): B
Read address 0x7fd30106dc0f in main(): B
fault_handler_thread():
poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fd30106e00f
(uffdio_copy.copy returned 4096)
Read address 0x7fd30106e00f in main(): C
Read address 0x7fd30106e40f in main(): C
Read address 0x7fd30106e80f in main(): C
Read address 0x7fd30106ec0f in main(): C
Program source
/* userfaultfd_demo.c
Licensed under the GNU General Public License version 2 or later.
*/
#define _GNU_SOURCE
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <inttypes.h>
#include <linux/userfaultfd.h>
#include <poll.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <unistd.h>
static int page_size;
static void *
fault_handler_thread(void *arg)
{
int nready;
long uffd; /* userfaultfd file descriptor */
ssize_t nread;
struct pollfd pollfd;
struct uffdio_copy uffdio_copy;
static int fault_cnt = 0; /* Number of faults so far handled */
static char *page = NULL;
static struct uffd_msg msg; /* Data read from userfaultfd */
uffd = (long) arg;
/* Create a page that will be copied into the faulting region. */
if (page == NULL) {
page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (page == MAP_FAILED)
err(EXIT_FAILURE, "mmap");
}
/* Loop, handling incoming events on the userfaultfd
file descriptor. */
for (;;) {
/* See what poll() tells us about the userfaultfd. */
pollfd.fd = uffd;
pollfd.events = POLLIN;
nready = poll(&pollfd, 1, -1);
if (nready == -1)
err(EXIT_FAILURE, "poll");
printf("
fault_handler_thread():
“);
printf(” poll() returns: nready = %d; "
“POLLIN = %d; POLLERR = %d
“, nready,
(pollfd.revents & POLLIN) != 0,
(pollfd.revents & POLLERR) != 0);
/* Read an event from the userfaultfd. /
nread = read(uffd, &msg, sizeof(msg));
if (nread == 0) {
printf(“EOF on userfaultfd!
“);
exit(EXIT_FAILURE);
}
if (nread == -1)
err(EXIT_FAILURE, “read”);
/ We expect only one kind of event; verify that assumption. /
if (msg.event != UFFD_EVENT_PAGEFAULT) {
fprintf(stderr, “Unexpected event on userfaultfd
“);
exit(EXIT_FAILURE);
}
/ Display info about the page-fault event. /
printf(” UFFD_EVENT_PAGEFAULT event: “);
printf(“flags = %“PRIx64”; “, msg.arg.pagefault.flags);
printf(“address = %“PRIx64”
“, msg.arg.pagefault.address);
/ Copy the page pointed to by ‘page’ into the faulting
region. Vary the contents that are copied in, so that it
is more obvious that each fault is handled separately. /
memset(page, ‘A’ + fault_cnt % 20, page_size);
fault_cnt++;
uffdio_copy.src = (unsigned long) page;
/ We need to handle page faults in units of pages(!).
So, round faulting address down to page boundary. */
uffdio_copy.dst = (unsigned long) msg.arg.pagefault.address &
~(page_size - 1);
uffdio_copy.len = page_size;
uffdio_copy.mode = 0;
uffdio_copy.copy = 0;
if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy) == -1)
err(EXIT_FAILURE, “ioctl-UFFDIO_COPY”);
printf(” (uffdio_copy.copy returned %“PRId64”)
“,
uffdio_copy.copy);
}
}
int
main(int argc, char *argv[])
{
int s;
char c;
char addr; / Start of region handled by userfaultfd /
long uffd; / userfaultfd file descriptor /
size_t len, l; / Length of region handled by userfaultfd /
pthread_t thr; / ID of thread that handles page faults /
struct uffdio_api uffdio_api;
struct uffdio_register uffdio_register;
if (argc != 2) {
fprintf(stderr, “Usage: %s num-pages
“, argv[0]);
exit(EXIT_FAILURE);
}
page_size = sysconf(_SC_PAGE_SIZE);
len = strtoull(argv[1], NULL, 0) * page_size;
/ Create and enable userfaultfd object. /
uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
if (uffd == -1)
err(EXIT_FAILURE, “userfaultfd”);
/ NOTE: Two-step feature handshake is not needed here, since this
example doesn’t require any specific features.
Programs that do should call UFFDIO_API twice: once with
features = 0
to detect features supported by this kernel, and
again with the subset of features the program actually wants to
enable. /
uffdio_api.api = UFFD_API;
uffdio_api.features = 0;
if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1)
err(EXIT_FAILURE, “ioctl-UFFDIO_API”);
/ Create a private anonymous mapping. The memory will be
demand-zero paged–that is, not yet allocated. When we
actually touch the memory, it will be allocated via
the userfaultfd. /
addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED)
err(EXIT_FAILURE, “mmap”);
printf(“Address returned by mmap() = %p
“, addr);
/ Register the memory range of the mapping we just created for
handling by the userfaultfd object. In mode, we request to track
missing pages (i.e., pages that have not yet been faulted in). /
uffdio_register.range.start = (unsigned long) addr;
uffdio_register.range.len = len;
uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1)
err(EXIT_FAILURE, “ioctl-UFFDIO_REGISTER”);
/ Create a thread that will process the userfaultfd events. */
s = pthread_create(&thr, NULL, fault_handler_thread, (void ) uffd);
if (s != 0) {
errc(EXIT_FAILURE, s, “pthread_create”);
}
/ Main thread now touches memory in the mapping, touching
locations 1024 bytes apart. This will trigger userfaultfd
events for all pages in the region. /
l = 0xf; / Ensure that faulting address is not on a page
boundary, in order to test that we correctly
handle that case in fault_handling_thread(). /
while (l < len) {
c = addr[l];
printf(“Read address %p in %s(): “, addr + l, func);
printf("%c
“, c);
l += 1024;
usleep(100000); / Slow things down a little */
}
exit(EXIT_SUCCESS);
}
SEE ALSO
fcntl(2), ioctl(2), ioctl_userfaultfd(2), madvise(2), mmap(2)
Documentation/admin-guide/mm/userfaultfd.rst in the Linux kernel source tree
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
411 - Linux cli command bind
NAME π₯οΈ bind π₯οΈ
bind a name to a socket
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
int bind(int sockfd, const struct sockaddr *addr,
socklen_t addrlen);
DESCRIPTION
When a socket is created with socket(2), it exists in a name space (address family) but has no address assigned to it. bind() assigns the address specified by addr to the socket referred to by the file descriptor sockfd. addrlen specifies the size, in bytes, of the address structure pointed to by addr. Traditionally, this operation is called βassigning a name to a socketβ.
It is normally necessary to assign a local address using bind() before a SOCK_STREAM socket may receive connections (see accept(2)).
The rules used in name binding vary between address families. Consult the manual entries in Section 7 for detailed information. For AF_INET, see ip(7); for AF_INET6, see ipv6(7); for AF_UNIX, see unix(7); for AF_APPLETALK, see ddp(7); for AF_PACKET, see packet(7); for AF_X25, see x25(7); and for AF_NETLINK, see netlink(7).
The actual structure passed for the addr argument will depend on the address family. The sockaddr structure is defined as something like:
struct sockaddr {
sa_family_t sa_family;
char sa_data[14];
}
The only purpose of this structure is to cast the structure pointer passed in addr in order to avoid compiler warnings. See EXAMPLES below.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
The address is protected, and the user is not the superuser.
EADDRINUSE
The given address is already in use.
EADDRINUSE
(Internet domain sockets) The port number was specified as zero in the socket address structure, but, upon attempting to bind to an ephemeral port, it was determined that all port numbers in the ephemeral port range are currently in use. See the discussion of /proc/sys/net/ipv4/ip_local_port_range ip(7).
EBADF
sockfd is not a valid file descriptor.
EINVAL
The socket is already bound to an address.
EINVAL
addrlen is wrong, or addr is not a valid address for this socket’s domain.
ENOTSOCK
The file descriptor sockfd does not refer to a socket.
The following errors are specific to UNIX domain (AF_UNIX) sockets:
EACCES
Search permission is denied on a component of the path prefix. (See also path_resolution(7).)
EADDRNOTAVAIL
A nonexistent interface was requested or the requested address was not local.
EFAULT
addr points outside the user’s accessible address space.
ELOOP
Too many symbolic links were encountered in resolving addr.
ENAMETOOLONG
addr is too long.
ENOENT
A component in the directory prefix of the socket pathname does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component of the path prefix is not a directory.
EROFS
The socket inode would reside on a read-only filesystem.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.4BSD (bind() first appeared in 4.2BSD).
BUGS
The transparent proxy options are not described.
EXAMPLES
An example of the use of bind() with Internet domain sockets can be found in getaddrinfo(3).
The following example shows how to bind a stream socket in the UNIX (AF_UNIX) domain, and accept connections:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <unistd.h>
#define MY_SOCK_PATH "/somepath"
#define LISTEN_BACKLOG 50
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
int
main(void)
{
int sfd, cfd;
socklen_t peer_addr_size;
struct sockaddr_un my_addr, peer_addr;
sfd = socket(AF_UNIX, SOCK_STREAM, 0);
if (sfd == -1)
handle_error("socket");
memset(&my_addr, 0, sizeof(my_addr));
my_addr.sun_family = AF_UNIX;
strncpy(my_addr.sun_path, MY_SOCK_PATH,
sizeof(my_addr.sun_path) - 1);
if (bind(sfd, (struct sockaddr *) &my_addr,
sizeof(my_addr)) == -1)
handle_error("bind");
if (listen(sfd, LISTEN_BACKLOG) == -1)
handle_error("listen");
/* Now we can accept incoming connections one
at a time using accept(2). */
peer_addr_size = sizeof(peer_addr);
cfd = accept(sfd, (struct sockaddr *) &peer_addr,
&peer_addr_size);
if (cfd == -1)
handle_error("accept");
/* Code to deal with incoming connection(s)... */
if (close(sfd) == -1)
handle_error("close");
if (unlink(MY_SOCK_PATH) == -1)
handle_error("unlink");
}
SEE ALSO
accept(2), connect(2), getsockname(2), listen(2), socket(2), getaddrinfo(3), getifaddrs(3), ip(7), ipv6(7), path_resolution(7), socket(7), unix(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
412 - Linux cli command lock
NAME π₯οΈ lock π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
413 - Linux cli command uselib
NAME π₯οΈ uselib π₯οΈ
load shared library
SYNOPSIS
#include <unistd.h>
[[deprecated]] int uselib(const char *library);
DESCRIPTION
The system call uselib() serves to load a shared library to be used by the calling process. It is given a pathname. The address where to load is found in the library itself. The library can have any recognized binary format.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
In addition to all of the error codes returned by open(2) and mmap(2), the following may also be returned:
EACCES
The library specified by library does not have read or execute permission, or the caller does not have search permission for one of the directories in the path prefix. (See also path_resolution(7).)
ENFILE
The system-wide limit on the total number of open files has been reached.
ENOEXEC
The file specified by library is not an executable of a known type; for example, it does not have the correct magic numbers.
STANDARDS
Linux.
HISTORY
This obsolete system call is not supported by glibc. No declaration is provided in glibc headers, but, through a quirk of history, glibc before glibc 2.23 did export an ABI for this system call. Therefore, in order to employ this system call, it was sufficient to manually declare the interface in your code; alternatively, you could invoke the system call using syscall(2).
In ancient libc versions (before glibc 2.0), uselib() was used to load the shared libraries with names found in an array of names in the binary.
Since Linux 3.15, this system call is available only when the kernel is configured with the CONFIG_USELIB option.
SEE ALSO
ar(1), gcc(1), ld(1), ldd(1), mmap(2), open(2), dlopen(3), capabilities(7), ld.so(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
414 - Linux cli command afs_syscall
NAME π₯οΈ afs_syscall π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
415 - Linux cli command splice
NAME π₯οΈ splice π₯οΈ
splice data to/from a pipe
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#define _FILE_OFFSET_BITS 64
#include <fcntl.h>
ssize_t splice(int fd_in, off_t *_Nullable off_in,
int fd_out, off_t *_Nullable off_out,
size_t len, unsigned int flags);
DESCRIPTION
splice() moves data between two file descriptors without copying between kernel address space and user address space. It transfers up to len bytes of data from the file descriptor fd_in to the file descriptor fd_out, where one of the file descriptors must refer to a pipe.
The following semantics apply for fd_in and off_in:
If fd_in refers to a pipe, then off_in must be NULL.
If fd_in does not refer to a pipe and off_in is NULL, then bytes are read from fd_in starting from the file offset, and the file offset is adjusted appropriately.
If fd_in does not refer to a pipe and off_in is not NULL, then off_in must point to a buffer which specifies the starting offset from which bytes will be read from fd_in; in this case, the file offset of fd_in is not changed.
Analogous statements apply for fd_out and off_out.
The flags argument is a bit mask that is composed by ORing together zero or more of the following values:
SPLICE_F_MOVE
Attempt to move pages instead of copying. This is only a hint to the kernel: pages may still be copied if the kernel cannot move the pages from the pipe, or if the pipe buffers don’t refer to full pages. The initial implementation of this flag was buggy: therefore starting in Linux 2.6.21 it is a no-op (but is still permitted in a splice() call); in the future, a correct implementation may be restored.
SPLICE_F_NONBLOCK
Do not block on I/O. This makes the splice pipe operations nonblocking, but splice() may nevertheless block because the file descriptors that are spliced to/from may block (unless they have the O_NONBLOCK flag set).
SPLICE_F_MORE
More data will be coming in a subsequent splice. This is a helpful hint when the fd_out refers to a socket (see also the description of MSG_MORE in send(2), and the description of TCP_CORK in tcp(7)).
SPLICE_F_GIFT
Unused for splice(); see vmsplice(2).
RETURN VALUE
Upon successful completion, splice() returns the number of bytes spliced to or from the pipe.
A return value of 0 means end of input. If fd_in refers to a pipe, then this means that there was no data to transfer, and it would not make sense to block because there are no writers connected to the write end of the pipe.
On error, splice() returns -1 and errno is set to indicate the error.
ERRORS
EAGAIN
SPLICE_F_NONBLOCK was specified in flags or one of the file descriptors had been marked as nonblocking (O_NONBLOCK), and the operation would block.
EBADF
One or both file descriptors are not valid, or do not have proper read-write mode.
EINVAL
The target filesystem doesn’t support splicing.
EINVAL
The target file is opened in append mode.
EINVAL
Neither of the file descriptors refers to a pipe.
EINVAL
An offset was given for nonseekable device (e.g., a pipe).
EINVAL
fd_in and fd_out refer to the same pipe.
ENOMEM
Out of memory.
ESPIPE
Either off_in or off_out was not NULL, but the corresponding file descriptor refers to a pipe.
STANDARDS
Linux.
HISTORY
Linux 2.6.17, glibc 2.5.
In Linux 2.6.30 and earlier, exactly one of fd_in and fd_out was required to be a pipe. Since Linux 2.6.31, both arguments may refer to pipes.
NOTES
The three system calls splice(), vmsplice(2), and tee(2), provide user-space programs with full control over an arbitrary kernel buffer, implemented within the kernel using the same type of buffer that is used for a pipe. In overview, these system calls perform the following tasks:
splice()
moves data from the buffer to an arbitrary file descriptor, or vice versa, or from one buffer to another.
tee(2)
“copies” the data from one buffer to another.
vmsplice(2)
“copies” data from user space into the buffer.
Though we talk of copying, actual copies are generally avoided. The kernel does this by implementing a pipe buffer as a set of reference-counted pointers to pages of kernel memory. The kernel creates “copies” of pages in a buffer by creating new pointers (for the output buffer) referring to the pages, and increasing the reference counts for the pages: only pointers are copied, not the pages of the buffer.
_FILE_OFFSET_BITS should be defined to be 64 in code that uses non-null off_in or off_out or that takes the address of splice, if the code is intended to be portable to traditional 32-bit x86 and ARM platforms where off_t’s width defaults to 32 bits.
EXAMPLES
See tee(2).
SEE ALSO
copy_file_range(2), sendfile(2), tee(2), vmsplice(2), pipe(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
416 - Linux cli command move_pages
NAME π₯οΈ move_pages π₯οΈ
move individual pages of a process to another node
LIBRARY
NUMA (Non-Uniform Memory Access) policy library (libnuma, -lnuma)
SYNOPSIS
#include <numaif.h>
long move_pages(int pid, unsigned long count, void *pages[.count],
const int nodes[.count], int status[.count], int flags);
DESCRIPTION
move_pages() moves the specified pages of the process pid to the memory nodes specified by nodes. The result of the move is reflected in status. The flags indicate constraints on the pages to be moved.
pid is the ID of the process in which pages are to be moved. If pid is 0, then move_pages() moves pages of the calling process.
To move pages in another process requires the following privileges:
Up to and including Linux 4.12: the caller must be privileged (CAP_SYS_NICE) or the real or effective user ID of the calling process must match the real or saved-set user ID of the target process.
The older rules allowed the caller to discover various virtual address choices made by the kernel that could lead to the defeat of address-space-layout randomization for a process owned by the same UID as the caller, the rules were changed starting with Linux 4.13. Since Linux 4.13, permission is governed by a ptrace access mode PTRACE_MODE_READ_REALCREDS check with respect to the target process; see ptrace(2).
count is the number of pages to move. It defines the size of the three arrays pages, nodes, and status.
pages is an array of pointers to the pages that should be moved. These are pointers that should be aligned to page boundaries. Addresses are specified as seen by the process specified by pid.
nodes is an array of integers that specify the desired location for each page. Each element in the array is a node number. nodes can also be NULL, in which case move_pages() does not move any pages but instead will return the node where each page currently resides, in the status array. Obtaining the status of each page may be necessary to determine pages that need to be moved.
status is an array of integers that return the status of each page. The array contains valid values only if move_pages() did not return an error. Preinitialization of the array to a value which cannot represent a real numa node or valid error of status array could help to identify pages that have been migrated.
flags specify what types of pages to move. MPOL_MF_MOVE means that only pages that are in exclusive use by the process are to be moved. MPOL_MF_MOVE_ALL means that pages shared between multiple processes can also be moved. The process must be privileged (CAP_SYS_NICE) to use MPOL_MF_MOVE_ALL.
Page states in the status array
The following values can be returned in each element of the status array.
0..MAX_NUMNODES
Identifies the node on which the page resides.
-EACCES
The page is mapped by multiple processes and can be moved only if MPOL_MF_MOVE_ALL is specified.
-EBUSY
The page is currently busy and cannot be moved. Try again later. This occurs if a page is undergoing I/O or another kernel subsystem is holding a reference to the page.
-EFAULT
This is a zero page or the memory area is not mapped by the process.
-EIO
Unable to write back a page. The page has to be written back in order to move it since the page is dirty and the filesystem does not provide a migration function that would allow the move of dirty pages.
-EINVAL
A dirty page cannot be moved. The filesystem does not provide a migration function and has no ability to write back pages.
-ENOENT
The page is not present.
-ENOMEM
Unable to allocate memory on target node.
RETURN VALUE
On success move_pages() returns zero. On error, it returns -1, and sets errno to indicate the error. If positive value is returned, it is the number of nonmigrated pages.
ERRORS
Positive value
The number of nonmigrated pages if they were the result of nonfatal reasons (since Linux 4.17).
E2BIG
Too many pages to move. Since Linux 2.6.29, the kernel no longer generates this error.
EACCES
One of the target nodes is not allowed by the current cpuset.
EFAULT
Parameter array could not be accessed.
EINVAL
Flags other than MPOL_MF_MOVE and MPOL_MF_MOVE_ALL was specified or an attempt was made to migrate pages of a kernel thread.
ENODEV
One of the target nodes is not online.
EPERM
The caller specified MPOL_MF_MOVE_ALL without sufficient privileges (CAP_SYS_NICE). Or, the caller attempted to move pages of a process belonging to another user but did not have privilege to do so (CAP_SYS_NICE).
ESRCH
Process does not exist.
STANDARDS
Linux.
HISTORY
Linux 2.6.18.
NOTES
For information on library support, see numa(7).
Use get_mempolicy(2) with the MPOL_F_MEMS_ALLOWED flag to obtain the set of nodes that are allowed by the current cpuset. Note that this information is subject to change at any time by manual or automatic reconfiguration of the cpuset.
Use of this function may result in pages whose location (node) violates the memory policy established for the specified addresses (See mbind(2)) and/or the specified process (See set_mempolicy(2)). That is, memory policy does not constrain the destination nodes used by move_pages().
The <numaif.h> header is not included with glibc, but requires installing libnuma-devel or a similar package.
SEE ALSO
get_mempolicy(2), mbind(2), set_mempolicy(2), numa(3), numa_maps(5), cpuset(7), numa(7), migratepages(8), numastat(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
417 - Linux cli command seccomp_unotify
NAME π₯οΈ seccomp_unotify π₯οΈ
Seccomp user-space notification mechanism
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/audit.h>
int seccomp(unsigned int operation, unsigned int flags",void*"args);
#include <sys/ioctl.h>
int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
struct seccomp_notif *req);
int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
struct seccomp_notif_resp *resp);
int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ADDFD,
struct seccomp_notif_addfd *addfd);
DESCRIPTION
This page describes the user-space notification mechanism provided by the Secure Computing (seccomp) facility. As well as the use of the SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the SECCOMP_RET_USER_NOTIF action value, and the SECCOMP_GET_NOTIF_SIZES operation described in seccomp(2), this mechanism involves the use of a number of related ioctl(2) operations (described below).
Overview
In conventional usage of a seccomp filter, the decision about how to treat a system call is made by the filter itself. By contrast, the user-space notification mechanism allows the seccomp filter to delegate the handling of the system call to another user-space process. Note that this mechanism is explicitly not intended as a method implementing security policy; see NOTES.
In the discussion that follows, the thread(s) on which the seccomp filter is installed is (are) referred to as the target, and the process that is notified by the user-space notification mechanism is referred to as the supervisor.
A suitably privileged supervisor can use the user-space notification mechanism to perform actions on behalf of the target. The advantage of the user-space notification mechanism is that the supervisor will usually be able to retrieve information about the target and the performed system call that the seccomp filter itself cannot. (A seccomp filter is limited in the information it can obtain and the actions that it can perform because it is running on a virtual machine inside the kernel.)
An overview of the steps performed by the target and the supervisor is as follows:
The target establishes a seccomp filter in the usual manner, but with two differences:
The seccomp(2) flags argument includes the flag SECCOMP_FILTER_FLAG_NEW_LISTENER. Consequently, the return value of the (successful) seccomp(2) call is a new “listening” file descriptor that can be used to receive notifications. Only one “listening” seccomp filter can be installed for a thread.
In cases where it is appropriate, the seccomp filter returns the action value SECCOMP_RET_USER_NOTIF. This return value will trigger a notification event.
In order that the supervisor can obtain notifications using the listening file descriptor, (a duplicate of) that file descriptor must be passed from the target to the supervisor. One way in which this could be done is by passing the file descriptor over a UNIX domain socket connection between the target and the supervisor (using the SCM_RIGHTS ancillary message type described in unix(7)). Another way to do this is through the use of pidfd_getfd(2).
The supervisor will receive notification events on the listening file descriptor. These events are returned as structures of type seccomp_notif. Because this structure and its size may evolve over kernel versions, the supervisor must first determine the size of this structure using the seccomp(2) SECCOMP_GET_NOTIF_SIZES operation, which returns a structure of type seccomp_notif_sizes. The supervisor allocates a buffer of size seccomp_notif_sizes.seccomp_notif bytes to receive notification events. In addition,the supervisor allocates another buffer of size seccomp_notif_sizes.seccomp_notif_resp bytes for the response (a struct seccomp_notif_resp structure) that it will provide to the kernel (and thus the target).
The target then performs its workload, which includes system calls that will be controlled by the seccomp filter. Whenever one of these system calls causes the filter to return the SECCOMP_RET_USER_NOTIF action value, the kernel does not (yet) execute the system call; instead, execution of the target is temporarily blocked inside the kernel (in a sleep state that is interruptible by signals) and a notification event is generated on the listening file descriptor.
The supervisor can now repeatedly monitor the listening file descriptor for SECCOMP_RET_USER_NOTIF-triggered events. To do this, the supervisor uses the SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation to read information about a notification event; this operation blocks until an event is available. The operation returns a seccomp_notif structure containing information about the system call that is being attempted by the target. (As described in NOTES, the file descriptor can also be monitored with select(2), poll(2), or epoll(7).)
The seccomp_notif structure returned by the SECCOMP_IOCTL_NOTIF_RECV operation includes the same information (a seccomp_data structure) that was passed to the seccomp filter. This information allows the supervisor to discover the system call number and the arguments for the target’s system call. In addition, the notification event contains the ID of the thread that triggered the notification and a unique cookie value that is used in subsequent SECCOMP_IOCTL_NOTIF_ID_VALID and SECCOMP_IOCTL_NOTIF_SEND operations.
The information in the notification can be used to discover the values of pointer arguments for the target’s system call. (This is something that can’t be done from within a seccomp filter.) One way in which the supervisor can do this is to open the corresponding /proc/tid/mem file (see proc(5)) and read bytes from the location that corresponds to one of the pointer arguments whose value is supplied in the notification event. (The supervisor must be careful to avoid a race condition that can occur when doing this; see the description of the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation below.) In addition, the supervisor can access other system information that is visible in user space but which is not accessible from a seccomp filter.
Having obtained information as per the previous step, the supervisor may then choose to perform an action in response to the target’s system call (which, as noted above, is not executed when the seccomp filter returns the SECCOMP_RET_USER_NOTIF action value).
One example use case here relates to containers. The target may be located inside a container where it does not have sufficient capabilities to mount a filesystem in the container’s mount namespace. However, the supervisor may be a more privileged process that does have sufficient capabilities to perform the mount operation.
The supervisor then sends a response to the notification. The information in this response is used by the kernel to construct a return value for the target’s system call and provide a value that will be assigned to the errno variable of the target.
The response is sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2) operation, which is used to transmit a seccomp_notif_resp structure to the kernel. This structure includes a cookie value that the supervisor obtained in the seccomp_notif structure returned by the SECCOMP_IOCTL_NOTIF_RECV operation. This cookie value allows the kernel to associate the response with the target. This structure must include the cookie value that the supervisor obtained in the seccomp_notif structure returned by the SECCOMP_IOCTL_NOTIF_RECV operation; the cookie allows the kernel to associate the response with the target.
Once the notification has been sent, the system call in the target thread unblocks, returning the information that was provided by the supervisor in the notification response.
As a variation on the last two steps, the supervisor can send a response that tells the kernel that it should execute the target thread’s system call; see the discussion of SECCOMP_USER_NOTIF_FLAG_CONTINUE, below.
IOCTL OPERATIONS
The following ioctl(2) operations are supported by the seccomp user-space notification file descriptor. For each of these operations, the first (file descriptor) argument of ioctl(2) is the listening file descriptor returned by a call to seccomp(2) with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag.
SECCOMP_IOCTL_NOTIF_RECV
The SECCOMP_IOCTL_NOTIF_RECV operation (available since Linux 5.0) is used to obtain a user-space notification event. If no such event is currently pending, the operation blocks until an event occurs. The third ioctl(2) argument is a pointer to a structure of the following form which contains information about the event. This structure must be zeroed out before the call.
struct seccomp_notif {
__u64 id; /* Cookie */
__u32 pid; /* TID of target thread */
__u32 flags; /* Currently unused (0) */
struct seccomp_data data; /* See seccomp(2) */
};
The fields in this structure are as follows:
id
This is a cookie for the notification. Each such cookie is guaranteed to be unique for the corresponding seccomp filter.
The cookie can be used with the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation described below.
When returning a notification response to the kernel, the supervisor must include the cookie value in the seccomp_notif_resp structure that is specified as the argument of the SECCOMP_IOCTL_NOTIF_SEND operation.
pid
This is the thread ID of the target thread that triggered the notification event.
flags
This is a bit mask of flags providing further information on the event. In the current implementation, this field is always zero.
data
This is a seccomp_data structure containing information about the system call that triggered the notification. This is the same structure that is passed to the seccomp filter. See seccomp(2) for details of this structure.
On success, this operation returns 0; on failure, -1 is returned, and errno is set to indicate the cause of the error. This operation can fail with the following errors:
EINVAL (since Linux 5.5)
The seccomp_notif structure that was passed to the call contained nonzero fields.
ENOENT
The target thread was killed by a signal as the notification information was being generated, or the target’s (blocked) system call was interrupted by a signal handler.
SECCOMP_IOCTL_NOTIF_ID_VALID
The SECCOMP_IOCTL_NOTIF_ID_VALID operation (available since Linux 5.0) is used to check that a notification ID returned by an earlier SECCOMP_IOCTL_NOTIF_RECV operation is still valid (i.e., that the target still exists and its system call is still blocked waiting for a response).
The third ioctl(2) argument is a pointer to the cookie (id) returned by the SECCOMP_IOCTL_NOTIF_RECV operation.
This operation is necessary to avoid race conditions that can occur when the pid returned by the SECCOMP_IOCTL_NOTIF_RECV operation terminates, and that process ID is reused by another process. An example of this kind of race is the following
A notification is generated on the listening file descriptor. The returned seccomp_notif contains the TID of the target thread (in the pid field of the structure).
The target terminates.
Another thread or process is created on the system that by chance reuses the TID that was freed when the target terminated.
The supervisor open(2)s the /proc/tid/mem file for the TID obtained in step 1, with the intention of (say) inspecting the memory location(s) that containing the argument(s) of the system call that triggered the notification in step 1.
In the above scenario, the risk is that the supervisor may try to access the memory of a process other than the target. This race can be avoided by following the call to open(2) with a SECCOMP_IOCTL_NOTIF_ID_VALID operation to verify that the process that generated the notification is still alive. (Note that if the target terminates after the latter step, a subsequent read(2) from the file descriptor may return 0, indicating end of file.)
See NOTES for a discussion of other cases where SECCOMP_IOCTL_NOTIF_ID_VALID checks must be performed.
On success (i.e., the notification ID is still valid), this operation returns 0. On failure (i.e., the notification ID is no longer valid), -1 is returned, and errno is set to ENOENT.
SECCOMP_IOCTL_NOTIF_SEND
The SECCOMP_IOCTL_NOTIF_SEND operation (available since Linux 5.0) is used to send a notification response back to the kernel. The third ioctl(2) argument of this structure is a pointer to a structure of the following form:
struct seccomp_notif_resp {
__u64 id; /* Cookie value */
__s64 val; /* Success return value */
__s32 error; /* 0 (success) or negative error number */
__u32 flags; /* See below */
};
The fields of this structure are as follows:
id
This is the cookie value that was obtained using the SECCOMP_IOCTL_NOTIF_RECV operation. This cookie value allows the kernel to correctly associate this response with the system call that triggered the user-space notification.
val
This is the value that will be used for a spoofed success return for the target’s system call; see below.
error
This is the value that will be used as the error number (errno) for a spoofed error return for the target’s system call; see below.
flags
This is a bit mask that includes zero or more of the following flags:
SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
Tell the kernel to execute the target’s system call.
Two kinds of response are possible:
A response to the kernel telling it to execute the target’s system call. In this case, the flags field includes SECCOMP_USER_NOTIF_FLAG_CONTINUE and the error and val fields must be zero.
This kind of response can be useful in cases where the supervisor needs to do deeper analysis of the target’s system call than is possible from a seccomp filter (e.g., examining the values of pointer arguments), and, having decided that the system call does not require emulation by the supervisor, the supervisor wants the system call to be executed normally in the target.
The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag should be used with caution; see NOTES.
A spoofed return value for the target’s system call. In this case, the kernel does not execute the target’s system call, instead causing the system call to return a spoofed value as specified by fields of the seccomp_notif_resp structure. The supervisor should set the fields of this structure as follows:
flags does not contain SECCOMP_USER_NOTIF_FLAG_CONTINUE.
error is set either to 0 for a spoofed “success” return or to a negative error number for a spoofed “failure” return. In the former case, the kernel causes the target’s system call to return the value specified in the val field. In the latter case, the kernel causes the target’s system call to return -1, and errno is assigned the negated error value.
val is set to a value that will be used as the return value for a spoofed “success” return for the target’s system call. The value in this field is ignored if the error field contains a nonzero value.
On success, this operation returns 0; on failure, -1 is returned, and errno is set to indicate the cause of the error. This operation can fail with the following errors:
EINPROGRESS
A response to this notification has already been sent.
EINVAL
An invalid value was specified in the flags field.
EINVAL The flags field contained SECCOMP_USER_NOTIF_FLAG_CONTINUE, and the error or val field was not zero.
ENOENT
The blocked system call in the target has been interrupted by a signal handler or the target has terminated.
SECCOMP_IOCTL_NOTIF_ADDFD
The SECCOMP_IOCTL_NOTIF_ADDFD operation (available since Linux 5.9) allows the supervisor to install a file descriptor into the target’s file descriptor table. Much like the use of SCM_RIGHTS messages described in unix(7), this operation is semantically equivalent to duplicating a file descriptor from the supervisor’s file descriptor table into the target’s file descriptor table.
The SECCOMP_IOCTL_NOTIF_ADDFD operation permits the supervisor to emulate a target system call (such as socket(2) or openat(2)) that generates a file descriptor. The supervisor can perform the system call that generates the file descriptor (and associated open file description) and then use this operation to allocate a file descriptor that refers to the same open file description in the target. (For an explanation of open file descriptions, see open(2).)
Once this operation has been performed, the supervisor can close its copy of the file descriptor.
In the target, the received file descriptor is subject to the same Linux Security Module (LSM) checks as are applied to a file descriptor that is received in an SCM_RIGHTS ancillary message. If the file descriptor refers to a socket, it inherits the cgroup version 1 network controller settings (classid and netprioidx) of the target.
The third ioctl(2) argument is a pointer to a structure of the following form:
struct seccomp_notif_addfd {
__u64 id; /* Cookie value */
__u32 flags; /* Flags */
__u32 srcfd; /* Local file descriptor number */
__u32 newfd; /* 0 or desired file descriptor
number in target */
__u32 newfd_flags; /* Flags to set on target file
descriptor */
};
The fields in this structure are as follows:
id
This field should be set to the notification ID (cookie value) that was obtained via SECCOMP_IOCTL_NOTIF_RECV.
flags
This field is a bit mask of flags that modify the behavior of the operation. Currently, only one flag is supported:
SECCOMP_ADDFD_FLAG_SETFD
When allocating the file descriptor in the target, use the file descriptor number specified in the newfd field.
SECCOMP_ADDFD_FLAG_SEND (since Linux 5.14)
Perform the equivalent of SECCOMP_IOCTL_NOTIF_ADDFD plus SECCOMP_IOCTL_NOTIF_SEND as an atomic operation. On successful invocation, the target process’s errno will be 0 and the return value will be the file descriptor number that was allocated in the target. If allocating the file descriptor in the target fails, the target’s system call continues to be blocked until a successful response is sent.
srcfd
This field should be set to the number of the file descriptor in the supervisor that is to be duplicated.
newfd
This field determines which file descriptor number is allocated in the target. If the SECCOMP_ADDFD_FLAG_SETFD flag is set, then this field specifies which file descriptor number should be allocated. If this file descriptor number is already open in the target, it is atomically closed and reused. If the descriptor duplication fails due to an LSM check, or if srcfd is not a valid file descriptor, the file descriptor newfd will not be closed in the target process.
If the SECCOMP_ADDFD_FLAG_SETFD flag it not set, then this field must be 0, and the kernel allocates the lowest unused file descriptor number in the target.
newfd_flags
This field is a bit mask specifying flags that should be set on the file descriptor that is received in the target process. Currently, only the following flag is implemented:
O_CLOEXEC
Set the close-on-exec flag on the received file descriptor.
On success, this ioctl(2) call returns the number of the file descriptor that was allocated in the target. Assuming that the emulated system call is one that returns a file descriptor as its function result (e.g., socket(2)), this value can be used as the return value (resp.val) that is supplied in the response that is subsequently sent with the SECCOMP_IOCTL_NOTIF_SEND operation.
On error, -1 is returned and errno is set to indicate the cause of the error.
This operation can fail with the following errors:
EBADF
Allocating the file descriptor in the target would cause the target’s RLIMIT_NOFILE limit to be exceeded (see getrlimit(2)).
EBUSY
If the flag SECCOMP_IOCTL_NOTIF_SEND is used, this means the operation can’t proceed until other SECCOMP_IOCTL_NOTIF_ADDFD requests are processed.
EINPROGRESS
The user-space notification specified in the id field exists but has not yet been fetched (by a SECCOMP_IOCTL_NOTIF_RECV) or has already been responded to (by a SECCOMP_IOCTL_NOTIF_SEND).
EINVAL
An invalid flag was specified in the flags or newfd_flags field, or the newfd field is nonzero and the SECCOMP_ADDFD_FLAG_SETFD flag was not specified in the flags field.
EMFILE
The file descriptor number specified in newfd exceeds the limit specified in /proc/sys/fs/nr_open.
ENOENT
The blocked system call in the target has been interrupted by a signal handler or the target has terminated.
Here is some sample code (with error handling omitted) that uses the SECCOMP_ADDFD_FLAG_SETFD operation (here, to emulate a call to openat(2)):
int fd, removeFd;
fd = openat(req->data.args[0], path, req->data.args[2],
req->data.args[3]);
struct seccomp_notif_addfd addfd;
addfd.id = req->id; /* Cookie from SECCOMP_IOCTL_NOTIF_RECV */
addfd.srcfd = fd;
addfd.newfd = 0;
addfd.flags = 0;
addfd.newfd_flags = O_CLOEXEC;
targetFd = ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);
close(fd); /* No longer needed in supervisor */
struct seccomp_notif_resp *resp;
/* Code to allocate 'resp' omitted */
resp->id = req->id;
resp->error = 0; /* "Success" */
resp->val = targetFd;
resp->flags = 0;
ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp);
NOTES
One example use case for the user-space notification mechanism is to allow a container manager (a process which is typically running with more privilege than the processes inside the container) to mount block devices or create device nodes for the container. The mount use case provides an example of where the SECCOMP_USER_NOTIF_FLAG_CONTINUE ioctl(2) operation is useful. Upon receiving a notification for the mount(2) system call, the container manager (the “supervisor”) can distinguish a request to mount a block filesystem (which would not be possible for a “target” process inside the container) and mount that file system. If, on the other hand, the container manager detects that the operation could be performed by the process inside the container (e.g., a mount of a tmpfs(5) filesystem), it can notify the kernel that the target process’s mount(2) system call can continue.
select()/poll()/epoll semantics
The file descriptor returned when seccomp(2) is employed with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using poll(2), epoll(7), and select(2). These interfaces indicate that the file descriptor is ready as follows:
When a notification is pending, these interfaces indicate that the file descriptor is readable. Following such an indication, a subsequent SECCOMP_IOCTL_NOTIF_RECV ioctl(2) will not block, returning either information about a notification or else failing with the error EINTR if the target has been killed by a signal or its system call has been interrupted by a signal handler.
After the notification has been received (i.e., by the SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation), these interfaces indicate that the file descriptor is writable, meaning that a notification response can be sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2) operation.
After the last thread using the filter has terminated and been reaped using waitpid(2) (or similar), the file descriptor indicates an end-of-file condition (readable in select(2); POLLHUP/EPOLLHUP in poll(2)/ epoll_wait(2)).
Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
The intent of the user-space notification feature is to allow system calls to be performed on behalf of the target. The target’s system call should either be handled by the supervisor or allowed to continue normally in the kernel (where standard security policies will be applied).
Note well: this mechanism must not be used to make security policy decisions about the system call, which would be inherently race-prone for reasons described next.
The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with caution. If set by the supervisor, the target’s system call will continue. However, there is a time-of-check, time-of-use race here, since an attacker could exploit the interval of time where the target is blocked waiting on the “continue” response to do things such as rewriting the system call arguments.
Note furthermore that a user-space notifier can be bypassed if the existing filters allow the use of seccomp(2) or prctl(2) to install a filter that returns an action value with a higher precedence than SECCOMP_RET_USER_NOTIF (see seccomp(2)).
It should thus be absolutely clear that the seccomp user-space notification mechanism can not be used to implement a security policy! It should only ever be used in scenarios where a more privileged process supervises the system calls of a lesser privileged target to get around kernel-enforced security restrictions when the supervisor deems this safe. In other words, in order to continue a system call, the supervisor should be sure that another security mechanism or the kernel itself will sufficiently block the system call if its arguments are rewritten to something unsafe.
Caveats regarding the use of /proc/tid/mem
The discussion above noted the need to use the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) when opening the /proc/tid/mem file of the target to avoid the possibility of accessing the memory of the wrong process in the event that the target terminates and its ID is recycled by another (unrelated) thread. However, the use of this ioctl(2) operation is also necessary in other situations, as explained in the following paragraphs.
Consider the following scenario, where the supervisor tries to read the pathname argument of a target’s blocked mount(2) system call:
From one of its functions (func()), the target calls mount(2), which triggers a user-space notification and causes the target to block.
The supervisor receives the notification, opens /proc/tid/mem, and (successfully) performs the SECCOMP_IOCTL_NOTIF_ID_VALID check.
The target receives a signal, which causes the mount(2) to abort.
The signal handler executes in the target, and returns.
Upon return from the handler, the execution of func() resumes, and it returns (and perhaps other functions are called, overwriting the memory that had been used for the stack frame of func()).
Using the address provided in the notification information, the supervisor reads from the target’s memory location that used to contain the pathname.
The supervisor now calls mount(2) with some arbitrary bytes obtained in the previous step.
The conclusion from the above scenario is this: since the target’s blocked system call may be interrupted by a signal handler, the supervisor must be written to expect that the target may abandon its system call at any time; in such an event, any information that the supervisor obtained from the target’s memory must be considered invalid.
To prevent such scenarios, every read from the target’s memory must be separated from use of the bytes so obtained by a SECCOMP_IOCTL_NOTIF_ID_VALID check. In the above example, the check would be placed between the two final steps. An example of such a check is shown in EXAMPLES.
Following on from the above, it should be clear that a write by the supervisor into the target’s memory can never be considered safe.
Caveats regarding blocking system calls
Suppose that the target performs a blocking system call (e.g., accept(2)) that the supervisor should handle. The supervisor might then in turn execute the same blocking system call.
In this scenario, it is important to note that if the target’s system call is now interrupted by a signal, the supervisor is not informed of this. If the supervisor does not take suitable steps to actively discover that the target’s system call has been canceled, various difficulties can occur. Taking the example of accept(2), the supervisor might remain blocked in its accept(2) holding a port number that the target (which, after the interruption by the signal handler, perhaps closed its listening socket) might expect to be able to reuse in a bind(2) call.
Therefore, when the supervisor wishes to emulate a blocking system call, it must do so in such a way that it gets informed if the target’s system call is interrupted by a signal handler. For example, if the supervisor itself executes the same blocking system call, then it could employ a separate thread that uses the SECCOMP_IOCTL_NOTIF_ID_VALID operation to check if the target is still blocked in its system call. Alternatively, in the accept(2) example, the supervisor might use poll(2) to monitor both the notification file descriptor (so as to discover when the target’s accept(2) call has been interrupted) and the listening file descriptor (so as to know when a connection is available).
If the target’s system call is interrupted, the supervisor must take care to release resources (e.g., file descriptors) that it acquired on behalf of the target.
Interaction with SA_RESTART signal handlers
Consider the following scenario:
The target process has used sigaction(2) to install a signal handler with the SA_RESTART flag.
The target has made a system call that triggered a seccomp user-space notification and the target is currently blocked until the supervisor sends a notification response.
A signal is delivered to the target and the signal handler is executed.
When (if) the supervisor attempts to send a notification response, the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will fail with the ENOENT error.
In this scenario, the kernel will restart the target’s system call. Consequently, the supervisor will receive another user-space notification. Thus, depending on how many times the blocked system call is interrupted by a signal handler, the supervisor may receive multiple notifications for the same instance of a system call in the target.
One oddity is that system call restarting as described in this scenario will occur even for the blocking system calls listed in signal(7) that would never normally be restarted by the SA_RESTART flag.
Furthermore, if the supervisor response is a file descriptor added with SECCOMP_IOCTL_NOTIF_ADDFD, then the flag SECCOMP_ADDFD_FLAG_SEND can be used to atomically add the file descriptor and return that value, making sure no file descriptors are inadvertently leaked into the target.
BUGS
If a SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation is performed after the target terminates, then the ioctl(2) call simply blocks (rather than returning an error to indicate that the target no longer exists).
EXAMPLES
The (somewhat contrived) program shown below demonstrates the use of the interfaces described in this page. The program creates a child process that serves as the “target” process. The child process installs a seccomp filter that returns the SECCOMP_RET_USER_NOTIF action value if a call is made to mkdir(2). The child process then calls mkdir(2) once for each of the supplied command-line arguments, and reports the result returned by the call. After processing all arguments, the child process terminates.
The parent process acts as the supervisor, listening for the notifications that are generated when the target process calls mkdir(2). When such a notification occurs, the supervisor examines the memory of the target process (using /proc/pid/mem) to discover the pathname argument that was supplied to the mkdir(2) call, and performs one of the following actions:
If the pathname begins with the prefix “/tmp/”, then the supervisor attempts to create the specified directory, and then spoofs a return for the target process based on the return value of the supervisor’s mkdir(2) call. In the event that that call succeeds, the spoofed success return value is the length of the pathname.
If the pathname begins with “./” (i.e., it is a relative pathname), the supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE response to the kernel to say that the kernel should execute the target process’s mkdir(2) call.
If the pathname begins with some other prefix, the supervisor spoofs an error return for the target process, so that the target process’s mkdir(2) call appears to fail with the error EOPNOTSUPP (“Operation not supported”). Additionally, if the specified pathname is exactly “/bye”, then the supervisor terminates.
This program can be used to demonstrate various aspects of the behavior of the seccomp user-space notification mechanism. To help aid such demonstrations, the program logs various messages to show the operation of the target process (lines prefixed “T:”) and the supervisor (indented lines prefixed “S:”).
In the following example, the target attempts to create the directory /tmp/x. Upon receiving the notification, the supervisor creates the directory on the target’s behalf, and spoofs a success return to be received by the target process’s mkdir(2) call.
$ ./seccomp_unotify /tmp/x
T: PID = 23168
T: about to mkdir("/tmp/x")
S: got notification (ID 0x17445c4a0f4e0e3c) for PID 23168
S: executing: mkdir("/tmp/x", 0700)
S: success! spoofed return = 6
S: sending response (flags = 0; val = 6; error = 0)
T: SUCCESS: mkdir(2) returned 6
T: terminating
S: target has terminated; bye
In the above output, note that the spoofed return value seen by the target process is 6 (the length of the pathname /tmp/x), whereas a normal mkdir(2) call returns 0 on success.
In the next example, the target attempts to create a directory using the relative pathname ./sub. Since this pathname starts with “./”, the supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE response to the kernel, and the kernel then (successfully) executes the target process’s mkdir(2) call.
$ ./seccomp_unotify ./sub
T: PID = 23204
T: about to mkdir("./sub")
S: got notification (ID 0xddb16abe25b4c12) for PID 23204
S: target can execute system call
S: sending response (flags = 0x1; val = 0; error = 0)
T: SUCCESS: mkdir(2) returned 0
T: terminating
S: target has terminated; bye
If the target process attempts to create a directory with a pathname that doesn’t start with “.” and doesn’t begin with the prefix “/tmp/”, then the supervisor spoofs an error return (EOPNOTSUPP, “Operation not supported”) for the target’s mkdir(2) call (which is not executed):
$ ./seccomp_unotify /xxx
T: PID = 23178
T: about to mkdir("/xxx")
S: got notification (ID 0xe7dc095d1c524e80) for PID 23178
S: spoofing error response (Operation not supported)
S: sending response (flags = 0; val = 0; error = -95)
T: ERROR: mkdir(2): Operation not supported
T: terminating
S: target has terminated; bye
In the next example, the target process attempts to create a directory with the pathname /tmp/nosuchdir/b. Upon receiving the notification, the supervisor attempts to create that directory, but the mkdir(2) call fails because the directory /tmp/nosuchdir does not exist. Consequently, the supervisor spoofs an error return that passes the error that it received back to the target process’s mkdir(2) call.
$ ./seccomp_unotify /tmp/nosuchdir/b
T: PID = 23199
T: about to mkdir("/tmp/nosuchdir/b")
S: got notification (ID 0x8744454293506046) for PID 23199
S: executing: mkdir("/tmp/nosuchdir/b", 0700)
S: failure! (errno = 2; No such file or directory)
S: sending response (flags = 0; val = 0; error = -2)
T: ERROR: mkdir(2): No such file or directory
T: terminating
S: target has terminated; bye
If the supervisor receives a notification and sees that the argument of the target’s mkdir(2) is the string “/bye”, then (as well as spoofing an EOPNOTSUPP error), the supervisor terminates. If the target process subsequently executes another mkdir(2) that triggers its seccomp filter to return the SECCOMP_RET_USER_NOTIF action value, then the kernel causes the target process’s system call to fail with the error ENOSYS (“Function not implemented”). This is demonstrated by the following example:
$ ./seccomp_unotify /bye /tmp/y
T: PID = 23185
T: about to mkdir("/bye")
S: got notification (ID 0xa81236b1d2f7b0f4) for PID 23185
S: spoofing error response (Operation not supported)
S: sending response (flags = 0; val = 0; error = -95)
S: terminating **********
T: ERROR: mkdir(2): Operation not supported
T: about to mkdir("/tmp/y")
T: ERROR: mkdir(2): Function not implemented
T: terminating
Program source
#define _GNU_SOURCE
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <linux/audit.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <signal.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/prctl.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/un.h>
#include <unistd.h>
#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
/* Send the file descriptor 'fd' over the connected UNIX domain socket
'sockfd'. Returns 0 on success, or -1 on error. */
static int
sendfd(int sockfd, int fd)
{
int data;
struct iovec iov;
struct msghdr msgh;
struct cmsghdr *cmsgp;
/* Allocate a char array of suitable size to hold the ancillary data.
However, since this buffer is in reality a 'struct cmsghdr', use a
union to ensure that it is suitably aligned. */
union {
char buf[CMSG_SPACE(sizeof(int))];
/* Space large enough to hold an 'int' */
struct cmsghdr align;
} controlMsg;
/* The 'msg_name' field can be used to specify the address of the
destination socket when sending a datagram. However, we do not
need to use this field because 'sockfd' is a connected socket. */
msgh.msg_name = NULL;
msgh.msg_namelen = 0;
/* On Linux, we must transmit at least one byte of real data in
order to send ancillary data. We transmit an arbitrary integer
whose value is ignored by recvfd(). */
msgh.msg_iov = &iov;
msgh.msg_iovlen = 1;
iov.iov_base = &data;
iov.iov_len = sizeof(int);
data = 12345;
/* Set 'msghdr' fields that describe ancillary data */
msgh.msg_control = controlMsg.buf;
msgh.msg_controllen = sizeof(controlMsg.buf);
/* Set up ancillary data describing file descriptor to send */
cmsgp = CMSG_FIRSTHDR(&msgh);
cmsgp->cmsg_level = SOL_SOCKET;
cmsgp->cmsg_type = SCM_RIGHTS;
cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));
/* Send real plus ancillary data */
if (sendmsg(sockfd, &msgh, 0) == -1)
return -1;
return 0;
}
/* Receive a file descriptor on a connected UNIX domain socket. Returns
the received file descriptor on success, or -1 on error. */
static int
recvfd(int sockfd)
{
int data, fd;
ssize_t nr;
struct iovec iov;
struct msghdr msgh;
/* Allocate a char buffer for the ancillary data. See the comments
in sendfd() */
union {
char buf[CMSG_SPACE(sizeof(int))];
struct cmsghdr align;
} controlMsg;
struct cmsghdr *cmsgp;
/* The 'msg_name' field can be used to obtain the address of the
sending socket. However, we do not need this information. */
msgh.msg_name = NULL;
msgh.msg_namelen = 0;
/* Specify buffer for receiving real data */
msgh.msg_iov = &iov;
msgh.msg_iovlen = 1;
iov.iov_base = &data; /* Real data is an 'int' */
iov.iov_len = sizeof(int);
/* Set 'msghdr' fields that describe ancillary data */
msgh.msg_control = controlMsg.buf;
msgh.msg_controllen = sizeof(controlMsg.buf);
/* Receive real plus ancillary data; real data is ignored */
nr = recvmsg(sockfd, &msgh, 0);
if (nr == -1)
return -1;
cmsgp = CMSG_FIRSTHDR(&msgh);
/* Check the validity of the 'cmsghdr' */
if (cmsgp == NULL
|| cmsgp->cmsg_len != CMSG_LEN(sizeof(int))
|| cmsgp->cmsg_level != SOL_SOCKET
|| cmsgp->cmsg_type != SCM_RIGHTS)
{
errno = EINVAL;
return -1;
}
/* Return the received file descriptor to our caller */
memcpy(&fd, CMSG_DATA(cmsgp), sizeof(int));
return fd;
}
static void
sigchldHandler(int sig)
{
char msg[] = " S: target has terminated; bye
“;
write(STDOUT_FILENO, msg, sizeof(msg) - 1);
_exit(EXIT_SUCCESS);
}
static int
seccomp(unsigned int operation, unsigned int flags, void args)
{
return syscall(SYS_seccomp, operation, flags, args);
}
/ The following is the x86-64-specific BPF boilerplate code for checking
that the BPF program is running on the right architecture + ABI. At
completion of these instructions, the accumulator contains the system
call number. /
/ For the x32 ABI, all system call numbers have bit 30 set /
#define X32_SYSCALL_BIT 0x40000000
#define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR
BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
(offsetof(struct seccomp_data, arch))),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2),
BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
(offsetof(struct seccomp_data, nr))),
BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)
/ installNotifyFilter() installs a seccomp filter that generates
user-space notifications (SECCOMP_RET_USER_NOTIF) when the process
calls mkdir(2); the filter allows all other system calls.
The function return value is a file descriptor from which the
user-space notifications can be fetched. /
static int
installNotifyFilter(void)
{
int notifyFd;
struct sock_filter filter[] = {
X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,
/ mkdir() triggers notification to user-space supervisor /
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, SYS_mkdir, 0, 1),
BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),
/ Every other system call is allowed /
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
};
struct sock_fprog prog = {
.len = ARRAY_SIZE(filter),
.filter = filter,
};
/ Install the filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
as a result, seccomp() returns a notification file descriptor. /
notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
if (notifyFd == -1)
err(EXIT_FAILURE, “seccomp-install-notify-filter”);
return notifyFd;
}
/ Close a pair of sockets created by socketpair() /
static void
closeSocketPair(int sockPair[2])
{
if (close(sockPair[0]) == -1)
err(EXIT_FAILURE, “closeSocketPair-close-0”);
if (close(sockPair[1]) == -1)
err(EXIT_FAILURE, “closeSocketPair-close-1”);
}
/ Implementation of the target process; create a child process that:
(1) installs a seccomp filter with the
SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
(2) writes the seccomp notification file descriptor returned from
the previous step onto the UNIX domain socket, ‘sockPair[0]’;
(3) calls mkdir(2) for each element of ‘argv’.
The function return value in the parent is the PID of the child
process; the child does not return from this function. */
static pid_t
targetProcess(int sockPair[2], char argv[])
{
int notifyFd, s;
pid_t targetPid;
targetPid = fork();
if (targetPid == -1)
err(EXIT_FAILURE, “fork”);
if (targetPid > 0) / In parent, return PID of child /
return targetPid;
/ Child falls through to here /
printf(“T: PID = %ld
“, (long) getpid());
/ Install seccomp filter(s) /
if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
err(EXIT_FAILURE, “prctl”);
notifyFd = installNotifyFilter();
/ Pass the notification file descriptor to the tracing process over
a UNIX domain socket /
if (sendfd(sockPair[0], notifyFd) == -1)
err(EXIT_FAILURE, “sendfd”);
/ Notification and socket FDs are no longer needed in target /
if (close(notifyFd) == -1)
err(EXIT_FAILURE, “close-target-notify-fd”);
closeSocketPair(sockPair);
/ Perform a mkdir() call for each of the command-line arguments */
for (char **ap = argv; *ap != NULL; ap++) {
printf(”
T: about to mkdir("%s")
“, *ap);
s = mkdir(ap, 0700);
if (s == -1)
perror(“T: ERROR: mkdir(2)”);
else
printf(“T: SUCCESS: mkdir(2) returned %d
“, s);
}
printf(”
T: terminating
“);
exit(EXIT_SUCCESS);
}
/ Check that the notification ID provided by a SECCOMP_IOCTL_NOTIF_RECV
operation is still valid. It will no longer be valid if the target
process has terminated or is no longer blocked in the system call that
generated the notification (because it was interrupted by a signal).
This operation can be used when doing such things as accessing
/proc/PID files in the target process in order to avoid TOCTOU race
conditions where the PID that is returned by SECCOMP_IOCTL_NOTIF_RECV
terminates and is reused by another process. /
static bool
cookieIsValid(int notifyFd, uint64_t id)
{
return ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == 0;
}
/ Access the memory of the target process in order to fetch the
pathname referred to by the system call argument ‘argNum’ in
‘req->data.args[]’. The pathname is returned in ‘path’,
a buffer of ’len’ bytes allocated by the caller.
Returns true if the pathname is successfully fetched, and false
otherwise. For possible causes of failure, see the comments below. */
static bool
getTargetPathname(struct seccomp_notif *req, int notifyFd,
int argNum, char path, size_t len)
{
int procMemFd;
char procMemPath[PATH_MAX];
ssize_t nread;
snprintf(procMemPath, sizeof(procMemPath), “/proc/%d/mem”, req->pid);
procMemFd = open(procMemPath, O_RDONLY | O_CLOEXEC);
if (procMemFd == -1)
return false;
/ Check that the process whose info we are accessing is still alive
and blocked in the system call that caused the notification.
If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed in
cookieIsValid()) succeeded, we know that the /proc/PID/mem file
descriptor that we opened corresponded to the process for which we
received a notification. If that process subsequently terminates,
then read() on that file descriptor will return 0 (EOF). /
if (!cookieIsValid(notifyFd, req->id)) {
close(procMemFd);
return false;
}
/ Read bytes at the location containing the pathname argument /
nread = pread(procMemFd, path, len, req->data.args[argNum]);
close(procMemFd);
if (nread <= 0)
return false;
/ Once again check that the notification ID is still valid. The
case we are particularly concerned about here is that just
before we fetched the pathname, the target’s blocked system
call was interrupted by a signal handler, and after the handler
returned, the target carried on execution (past the interrupted
system call). In that case, we have no guarantees about what we
are reading, since the target’s memory may have been arbitrarily
changed by subsequent operations. /
if (!cookieIsValid(notifyFd, req->id)) {
perror(” S: notification ID check failed!!!”);
return false;
}
/ Even if the target’s system call was not interrupted by a signal,
we have no guarantees about what was in the memory of the target
process. (The memory may have been modified by another thread, or
even by an external attacking process.) We therefore treat the
buffer returned by pread() as untrusted input. The buffer should
contain a terminating null byte; if not, then we will trigger an
error for the target process. /
if (strnlen(path, nread) < nread)
return true;
return false;
}
/ Allocate buffers for the seccomp user-space notification request and
response structures. It is the caller’s responsibility to free the
buffers returned via ‘req’ and ‘resp’. */
static void
allocSeccompNotifBuffers(struct seccomp_notif **req,
struct seccomp_notif_resp **resp,
struct seccomp_notif_sizes sizes)
{
size_t resp_size;
/ Discover the sizes of the structures that are used to receive
notifications and send notification responses, and allocate
buffers of those sizes. */
if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, sizes) == -1)
err(EXIT_FAILURE, “seccomp-SECCOMP_GET_NOTIF_SIZES”);
*req = malloc(sizes->seccomp_notif);
if (req == NULL)
err(EXIT_FAILURE, “malloc-seccomp_notif”);
/ When allocating the response buffer, we must allow for the fact
that the user-space binary may have been built with user-space
headers where ‘struct seccomp_notif_resp’ is bigger than the
response buffer expected by the (older) kernel. Therefore, we
allocate a buffer that is the maximum of the two sizes. This
ensures that if the supervisor places bytes into the response
structure that are past the response size that the kernel expects,
then the supervisor is not touching an invalid memory location. */
resp_size = sizes->seccomp_notif_resp;
if (sizeof(struct seccomp_notif_resp) > resp_size)
resp_size = sizeof(struct seccomp_notif_resp);
*resp = malloc(resp_size);
if (resp == NULL)
err(EXIT_FAILURE, “malloc-seccomp_notif_resp”);
}
/ Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
descriptor, ’notifyFd’. */
static void
handleNotifications(int notifyFd)
{
bool pathOK;
char path[PATH_MAX];
struct seccomp_notif *req;
struct seccomp_notif_resp resp;
struct seccomp_notif_sizes sizes;
allocSeccompNotifBuffers(&req, &resp, &sizes);
/ Loop handling notifications /
for (;;) {
/ Wait for next notification, returning info in ‘*req’ /
memset(req, 0, sizes.seccomp_notif);
if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1) {
if (errno == EINTR)
continue;
err(EXIT_FAILURE, " S: ioctl-SECCOMP_IOCTL_NOTIF_RECV”);
}
printf(” S: got notification (ID %#llx) for PID %d
“,
req->id, req->pid);
/ The only system call that can generate a notification event
is mkdir(2). Nevertheless, we check that the notified system
call is indeed mkdir() as kind of future-proofing of this
code in case the seccomp filter is later modified to
generate notifications for other system calls. /
if (req->data.nr != SYS_mkdir) {
printf(” S: notification contained unexpected "
“system call number; bye!!!
“);
exit(EXIT_FAILURE);
}
pathOK = getTargetPathname(req, notifyFd, 0, path, sizeof(path));
/ Prepopulate some fields of the response /
resp->id = req->id; / Response includes notification ID /
resp->flags = 0;
resp->val = 0;
/ If getTargetPathname() failed, trigger an EINVAL error
response (sending this response may yield an error if the
failure occurred because the notification ID was no longer
valid); if the directory is in /tmp, then create it on behalf
of the supervisor; if the pathname starts with ‘.’, tell the
kernel to let the target process execute the mkdir();
otherwise, give an error for a directory pathname in any other
location. /
if (!pathOK) {
resp->error = -EINVAL;
printf(” S: spoofing error for invalid pathname (%s)
“,
strerror(-resp->error));
} else if (strncmp(path, “/tmp/”, strlen("/tmp/”)) == 0) {
printf(” S: executing: mkdir("%s", %#llo)
“,
path, req->data.args[1]);
if (mkdir(path, req->data.args[1]) == 0) {
resp->error = 0; / “Success” /
resp->val = strlen(path); / Used as return value of
mkdir() in target /
printf(” S: success! spoofed return = %lld
“,
resp->val);
} else {
/ If mkdir() failed in the supervisor, pass the error
back to the target /
resp->error = -errno;
printf(” S: failure! (errno = %d; %s)
“, errno,
strerror(errno));
}
} else if (strncmp(path, “./”, strlen(”./”)) == 0) {
resp->error = resp->val = 0;
resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
printf(" S: target can execute system call
“);
} else {
resp->error = -EOPNOTSUPP;
printf(” S: spoofing error response (%s)
“,
strerror(-resp->error));
}
/ Send a response to the notification /
printf(” S: sending response "
“(flags = %#x; val = %lld; error = %d)
“,
resp->flags, resp->val, resp->error);
if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
if (errno == ENOENT)
printf(” S: response failed with ENOENT; "
“perhaps target process’s syscall was "
“interrupted by a signal?
“);
else
perror(“ioctl-SECCOMP_IOCTL_NOTIF_SEND”);
}
/ If the pathname is just “/bye”, then the supervisor breaks out
of the loop and terminates. This allows us to see what happens
if the target process makes further calls to mkdir(2). /
if (strcmp(path, “/bye”) == 0)
break;
}
free(req);
free(resp);
printf(” S: terminating **********
“);
exit(EXIT_FAILURE);
}
/ Implementation of the supervisor process:
(1) obtains the notification file descriptor from ‘sockPair[1]’
(2) handles notifications that arrive on that file descriptor. /
static void
supervisor(int sockPair[2])
{
int notifyFd;
notifyFd = recvfd(sockPair[1]);
if (notifyFd == -1)
err(EXIT_FAILURE, “recvfd”);
closeSocketPair(sockPair); / We no longer need the socket pair */
handleNotifications(notifyFd);
}
int
main(int argc, char argv[])
{
int sockPair[2];
struct sigaction sa;
setbuf(stdout, NULL);
if (argc < 2) {
fprintf(stderr, “At least one pathname argument is required
“);
exit(EXIT_FAILURE);
}
/ Create a UNIX domain socket that is used to pass the seccomp
notification file descriptor from the target process to the
supervisor process. /
if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1)
err(EXIT_FAILURE, “socketpair”);
/ Create a child process–the “target”–that installs seccomp
filtering. The target process writes the seccomp notification
file descriptor onto ‘sockPair[0]’ and then calls mkdir(2) for
each directory in the command-line arguments. /
(void) targetProcess(sockPair, &argv[optind]);
/ Catch SIGCHLD when the target terminates, so that the
supervisor can also terminate. */
sa.sa_handler = sigchldHandler;
sa.sa_flags = 0;
sigemptyset(&sa.sa_mask);
if (sigaction(SIGCHLD, &sa, NULL) == -1)
err(EXIT_FAILURE, “sigaction”);
supervisor(sockPair);
exit(EXIT_SUCCESS);
}
SEE ALSO
ioctl(2), pidfd_getfd(2), pidfd_open(2), seccomp(2)
A further example program can be found in the kernel source file samples/seccomp/user-trap.c.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
418 - Linux cli command flistxattr
NAME π₯οΈ flistxattr π₯οΈ
list extended attribute names
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/xattr.h>
ssize_t listxattr(const char *path, char *_Nullable list",size_t"size);
ssize_t llistxattr(const char *path, char *_Nullable list",size_t"size);
ssize_t flistxattr(int fd, char *_Nullable list, size_t size);
DESCRIPTION
Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7).
listxattr() retrieves the list of extended attribute names associated with the given path in the filesystem. The retrieved list is placed in list, a caller-allocated buffer whose size (in bytes) is specified in the argument size. The list is the set of (null-terminated) names, one after the other. Names of extended attributes to which the calling process does not have access may be omitted from the list. The length of the attribute name list is returned.
llistxattr() is identical to listxattr(), except in the case of a symbolic link, where the list of names of extended attributes associated with the link itself is retrieved, not the file that it refers to.
flistxattr() is identical to listxattr(), only the open file referred to by fd (as returned by open(2)) is interrogated in place of path.
A single extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode.
If size is specified as zero, these calls return the current size of the list of extended attribute names (and leave list unchanged). This can be used to determine the size of the buffer that should be supplied in a subsequent call. (But, bear in mind that there is a possibility that the set of extended attributes may change between the two calls, so that it is still necessary to check the return status from the second call.)
Example
The list of names is returned as an unordered array of null-terminated character strings (attribute names are separated by null bytes (‘οΏ½’)), like this:
user.name1 system.name1 user.name2
Filesystems that implement POSIX ACLs using extended attributes might return a list like this:
system.posix_acl_access system.posix_acl_default
RETURN VALUE
On success, a nonnegative number is returned indicating the size of the extended attribute name list. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
E2BIG
The size of the list of extended attribute names is larger than the maximum size allowed; the list cannot be retrieved. This can happen on filesystems that support an unlimited number of extended attributes per file such as XFS, for example. See BUGS.
ENOTSUP
Extended attributes are not supported by the filesystem, or are disabled.
ERANGE
The size of the list buffer is too small to hold the result.
In addition, the errors documented in stat(2) can also occur.
STANDARDS
Linux.
HISTORY
Linux 2.4, glibc 2.3.
BUGS
As noted in xattr(7), the VFS imposes a limit of 64 kB on the size of the extended attribute name list returned by listxattr(). If the total size of attribute names attached to a file exceeds this limit, it is no longer possible to retrieve the list of attribute names.
EXAMPLES
The following program demonstrates the usage of listxattr() and getxattr(2). For the file whose pathname is provided as a command-line argument, it lists all extended file attributes and their values.
To keep the code simple, the program assumes that attribute keys and values are constant during the execution of the program. A production program should expect and handle changes during execution of the program. For example, the number of bytes required for attribute keys might increase between the two calls to listxattr(). An application could handle this possibility using a loop that retries the call (perhaps up to a predetermined maximum number of attempts) with a larger buffer each time it fails with the error ERANGE. Calls to getxattr(2) could be handled similarly.
The following output was recorded by first creating a file, setting some extended file attributes, and then listing the attributes with the example program.
Example output
$ touch /tmp/foo
$ setfattr -n user.fred -v chocolate /tmp/foo
$ setfattr -n user.frieda -v bar /tmp/foo
$ setfattr -n user.empty /tmp/foo
$ ./listxattr /tmp/foo
user.fred: chocolate
user.frieda: bar
user.empty: <no value>
Program source (listxattr.c)
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/xattr.h>
int
main(int argc, char *argv[])
{
char *buf, *key, *val;
ssize_t buflen, keylen, vallen;
if (argc != 2) {
fprintf(stderr, "Usage: %s path
“, argv[0]);
exit(EXIT_FAILURE);
}
/*
* Determine the length of the buffer needed.
/
buflen = listxattr(argv[1], NULL, 0);
if (buflen == -1) {
perror(“listxattr”);
exit(EXIT_FAILURE);
}
if (buflen == 0) {
printf("%s has no attributes.
“, argv[1]);
exit(EXIT_SUCCESS);
}
/
* Allocate the buffer.
/
buf = malloc(buflen);
if (buf == NULL) {
perror(“malloc”);
exit(EXIT_FAILURE);
}
/
* Copy the list of attribute keys to the buffer.
/
buflen = listxattr(argv[1], buf, buflen);
if (buflen == -1) {
perror(“listxattr”);
exit(EXIT_FAILURE);
}
/
* Loop over the list of zero terminated strings with the
* attribute keys. Use the remaining buffer length to determine
* the end of the list.
/
key = buf;
while (buflen > 0) {
/
* Output attribute key.
/
printf("%s: “, key);
/
* Determine length of the value.
/
vallen = getxattr(argv[1], key, NULL, 0);
if (vallen == -1)
perror(“getxattr”);
if (vallen > 0) {
/
* Allocate value buffer.
* One extra byte is needed to append 0x00.
/
val = malloc(vallen + 1);
if (val == NULL) {
perror(“malloc”);
exit(EXIT_FAILURE);
}
/
* Copy value to buffer.
/
vallen = getxattr(argv[1], key, val, vallen);
if (vallen == -1) {
perror(“getxattr”);
} else {
/
* Output attribute value.
/
val[vallen] = 0;
printf("%s”, val);
}
free(val);
} else if (vallen == 0) {
printf("
SEE ALSO
getfattr(1), setfattr(1), getxattr(2), open(2), removexattr(2), setxattr(2), stat(2), symlink(7), xattr(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
419 - Linux cli command insw
NAME π₯οΈ insw π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
420 - Linux cli command close
NAME π₯οΈ close π₯οΈ
close a file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int close(int fd);
DESCRIPTION
close() closes a file descriptor, so that it no longer refers to any file and may be reused. Any record locks (see fcntl(2)) held on the file it was associated with, and owned by the process, are removed regardless of the file descriptor that was used to obtain the lock. This has some unfortunate consequences and one should be extra careful when using advisory record locking. See fcntl(2) for discussion of the risks and consequences as well as for the (probably preferred) open file description locks.
If fd is the last file descriptor referring to the underlying open file description (see open(2)), the resources associated with the open file description are freed; if the file descriptor was the last reference to a file which has been removed using unlink(2), the file is deleted.
RETURN VALUE
close() returns zero on success. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBADF
fd isn’t a valid open file descriptor.
EINTR
The close() call was interrupted by a signal; see signal(7).
EIO
An I/O error occurred.
ENOSPC
EDQUOT
On NFS, these errors are not normally reported against the first write which exceeds the available storage space, but instead against a subsequent write(2), fsync(2), or close().
See NOTES for a discussion of why close() should not be retried after an error.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.3BSD.
NOTES
A successful close does not guarantee that the data has been successfully saved to disk, as the kernel uses the buffer cache to defer writes. Typically, filesystems do not flush buffers when a file is closed. If you need to be sure that the data is physically stored on the underlying disk, use fsync(2). (It will depend on the disk hardware at this point.)
The close-on-exec file descriptor flag can be used to ensure that a file descriptor is automatically closed upon a successful execve(2); see fcntl(2) for details.
Multithreaded processes and close()
It is probably unwise to close file descriptors while they may be in use by system calls in other threads in the same process. Since a file descriptor may be reused, there are some obscure race conditions that may cause unintended side effects.
Furthermore, consider the following scenario where two threads are performing operations on the same file descriptor:
One thread is blocked in an I/O system call on the file descriptor. For example, it is trying to write(2) to a pipe that is already full, or trying to read(2) from a stream socket which currently has no available data.
Another thread closes the file descriptor.
The behavior in this situation varies across systems. On some systems, when the file descriptor is closed, the blocking system call returns immediately with an error.
On Linux (and possibly some other systems), the behavior is different: the blocking I/O system call holds a reference to the underlying open file description, and this reference keeps the description open until the I/O system call completes. (See open(2) for a discussion of open file descriptions.) Thus, the blocking system call in the first thread may successfully complete after the close() in the second thread.
Dealing with error returns from close()
A careful programmer will check the return value of close(), since it is quite possible that errors on a previous write(2) operation are reported only on the final close() that releases the open file description. Failing to check the return value when closing a file may lead to silent loss of data. This can especially be observed with NFS and with disk quota.
Note, however, that a failure return should be used only for diagnostic purposes (i.e., a warning to the application that there may still be I/O pending or there may have been failed I/O) or remedial purposes (e.g., writing the file once more or creating a backup).
Retrying the close() after a failure return is the wrong thing to do, since this may cause a reused file descriptor from another thread to be closed. This can occur because the Linux kernel always releases the file descriptor early in the close operation, freeing it for reuse; the steps that may return an error, such as flushing data to the filesystem or device, occur only later in the close operation.
Many other implementations similarly always close the file descriptor (except in the case of EBADF, meaning that the file descriptor was invalid) even if they subsequently report an error on return from close(). POSIX.1 is currently silent on this point, but there are plans to mandate this behavior in the next major release of the standard.
A careful programmer who wants to know about I/O errors may precede close() with a call to fsync(2).
The EINTR error is a somewhat special case. Regarding the EINTR error, POSIX.1-2008 says:
If close() is interrupted by a signal that is to be caught, it shall return -1 with errno set to EINTR and the state of fildes is unspecified.
This permits the behavior that occurs on Linux and many other implementations, where, as with other errors that may be reported by close(), the file descriptor is guaranteed to be closed. However, it also permits another possibility: that the implementation returns an EINTR error and keeps the file descriptor open. (According to its documentation, HP-UX’s close() does this.) The caller must then once more use close() to close the file descriptor, to avoid file descriptor leaks. This divergence in implementation behaviors provides a difficult hurdle for portable applications, since on many implementations, close() must not be called again after an EINTR error, and on at least one, close() must be called again. There are plans to address this conundrum for the next major release of the POSIX.1 standard.
SEE ALSO
close_range(2), fcntl(2), fsync(2), open(2), shutdown(2), unlink(2), fclose(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
421 - Linux cli command mknod
NAME π₯οΈ mknod π₯οΈ
create a special or ordinary file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int mknod(const char *pathname, mode_t mode, dev_t dev);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int mknodat(int dirfd, const char *pathname, mode_t mode",dev_t"dev);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
mknod():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.19: */ _DEFAULT_SOURCE
|| /* glibc <= 2.19: */ _BSD_SOURCE || _SVID_SOURCE
DESCRIPTION
The system call mknod() creates a filesystem node (file, device special file, or named pipe) named pathname, with attributes specified by mode and dev.
The mode argument specifies both the file mode to use and the type of node to be created. It should be a combination (using bitwise OR) of one of the file types listed below and zero or more of the file mode bits listed in inode(7).
The file mode is modified by the process’s umask in the usual way: in the absence of a default ACL, the permissions of the created node are (mode & ~umask).
The file type must be one of S_IFREG, S_IFCHR, S_IFBLK, S_IFIFO, or S_IFSOCK to specify a regular file (which will be created empty), character special file, block special file, FIFO (named pipe), or UNIX domain socket, respectively. (Zero file type is equivalent to type S_IFREG.)
If the file type is S_IFCHR or S_IFBLK, then dev specifies the major and minor numbers of the newly created device special file (makedev(3) may be useful to build the value for dev); otherwise it is ignored.
If pathname already exists, or is a symbolic link, this call fails with an EEXIST error.
The newly created node will be owned by the effective user ID of the process. If the directory containing the node has the set-group-ID bit set, or if the filesystem is mounted with BSD group semantics, the new node will inherit the group ownership from its parent directory; otherwise it will be owned by the effective group ID of the process.
mknodat()
The mknodat() system call operates in exactly the same way as mknod(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by mknod() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like mknod()).
If pathname is absolute, then dirfd is ignored.
See openat(2) for an explanation of the need for mknodat().
RETURN VALUE
mknod() and mknodat() return zero on success. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EACCES
The parent directory does not allow write permission to the process, or one of the directories in the path prefix of pathname did not allow search permission. (See also path_resolution(7).)
EBADF
(mknodat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EDQUOT
The user’s quota of disk blocks or inodes on the filesystem has been exhausted.
EEXIST
pathname already exists. This includes the case where pathname is a symbolic link, dangling or not.
EFAULT
pathname points outside your accessible address space.
EINVAL
mode requested creation of something other than a regular file, device special file, FIFO or socket.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ENAMETOOLONG
pathname was too long.
ENOENT
A directory component in pathname does not exist or is a dangling symbolic link.
ENOMEM
Insufficient kernel memory was available.
ENOSPC
The device containing pathname has no room for the new node.
ENOTDIR
A component used as a directory in pathname is not, in fact, a directory.
ENOTDIR
(mknodat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EPERM
mode requested creation of something other than a regular file, FIFO (named pipe), or UNIX domain socket, and the caller is not privileged (Linux: does not have the CAP_MKNOD capability); also returned if the filesystem containing pathname does not support the type of node requested.
EROFS
pathname refers to a file on a read-only filesystem.
VERSIONS
POSIX.1-2001 says: “The only portable use of mknod() is to create a FIFO-special file. If mode is not S_IFIFO or dev is not 0, the behavior of mknod() is unspecified.” However, nowadays one should never use mknod() for this purpose; one should use mkfifo(3), a function especially defined for this purpose.
Under Linux, mknod() cannot be used to create directories. One should make directories with mkdir(2).
STANDARDS
POSIX.1-2008.
HISTORY
mknod()
SVr4, 4.4BSD, POSIX.1-2001 (but see VERSIONS).
mknodat()
Linux 2.6.16, glibc 2.4. POSIX.1-2008.
NOTES
There are many infelicities in the protocol underlying NFS. Some of these affect mknod() and mknodat().
SEE ALSO
mknod(1), chmod(2), chown(2), fcntl(2), mkdir(2), mount(2), socket(2), stat(2), umask(2), unlink(2), makedev(3), mkfifo(3), acl(5), path_resolution(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
422 - Linux cli command outsw
NAME π₯οΈ outsw π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
423 - Linux cli command sendmsg
NAME π₯οΈ sendmsg π₯οΈ
send a message on a socket
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
ssize_t send(int sockfd, const void buf[.len], size_t len",int"flags);
ssize_t sendto(int sockfd, const void buf[.len], size_t len",int"flags,
const struct sockaddr *dest_addr, socklen_t addrlen);
ssize_t sendmsg(int sockfd, const struct msghdr *msg",int"flags);
DESCRIPTION
The system calls send(), sendto(), and sendmsg() are used to transmit a message to another socket.
The send() call may be used only when the socket is in a connected state (so that the intended recipient is known). The only difference between send() and write(2) is the presence of flags. With a zero flags argument, send() is equivalent to write(2). Also, the following call
send(sockfd, buf, len, flags);
is equivalent to
sendto(sockfd, buf, len, flags, NULL, 0);
The argument sockfd is the file descriptor of the sending socket.
If sendto() is used on a connection-mode (SOCK_STREAM, SOCK_SEQPACKET) socket, the arguments dest_addr and addrlen are ignored (and the error EISCONN may be returned when they are not NULL and 0), and the error ENOTCONN is returned when the socket was not actually connected. Otherwise, the address of the target is given by dest_addr with addrlen specifying its size. For sendmsg(), the address of the target is given by msg.msg_name, with msg.msg_namelen specifying its size.
For send() and sendto(), the message is found in buf and has length len. For sendmsg(), the message is pointed to by the elements of the array msg.msg_iov. The sendmsg() call also allows sending ancillary data (also known as control information).
If the message is too long to pass atomically through the underlying protocol, the error EMSGSIZE is returned, and the message is not transmitted.
No indication of failure to deliver is implicit in a send(). Locally detected errors are indicated by a return value of -1.
When the message does not fit into the send buffer of the socket, send() normally blocks, unless the socket has been placed in nonblocking I/O mode. In nonblocking mode it would fail with the error EAGAIN or EWOULDBLOCK in this case. The select(2) call may be used to determine when it is possible to send more data.
The flags argument
The flags argument is the bitwise OR of zero or more of the following flags.
MSG_CONFIRM (since Linux 2.3.15)
Tell the link layer that forward progress happened: you got a successful reply from the other side. If the link layer doesn’t get this it will regularly reprobe the neighbor (e.g., via a unicast ARP). Valid only on SOCK_DGRAM and SOCK_RAW sockets and currently implemented only for IPv4 and IPv6. See arp(7) for details.
MSG_DONTROUTE
Don’t use a gateway to send out the packet, send to hosts only on directly connected networks. This is usually used only by diagnostic or routing programs. This is defined only for protocol families that route; packet sockets don’t.
MSG_DONTWAIT (since Linux 2.2)
Enables nonblocking operation; if the operation would block, EAGAIN or EWOULDBLOCK is returned. This provides similar behavior to setting the O_NONBLOCK flag (via the fcntl(2) F_SETFL operation), but differs in that MSG_DONTWAIT is a per-call option, whereas O_NONBLOCK is a setting on the open file description (see open(2)), which will affect all threads in the calling process as well as other processes that hold file descriptors referring to the same open file description.
MSG_EOR (since Linux 2.2)
Terminates a record (when this notion is supported, as for sockets of type SOCK_SEQPACKET).
MSG_MORE (since Linux 2.4.4)
The caller has more data to send. This flag is used with TCP sockets to obtain the same effect as the TCP_CORK socket option (see tcp(7)), with the difference that this flag can be set on a per-call basis.
Since Linux 2.6, this flag is also supported for UDP sockets, and informs the kernel to package all of the data sent in calls with this flag set into a single datagram which is transmitted only when a call is performed that does not specify this flag. (See also the UDP_CORK socket option described in udp(7).)
MSG_NOSIGNAL (since Linux 2.2)
Don’t generate a SIGPIPE signal if the peer on a stream-oriented socket has closed the connection. The EPIPE error is still returned. This provides similar behavior to using sigaction(2) to ignore SIGPIPE, but, whereas MSG_NOSIGNAL is a per-call feature, ignoring SIGPIPE sets a process attribute that affects all threads in the process.
MSG_OOB
Sends out-of-band data on sockets that support this notion (e.g., of type SOCK_STREAM); the underlying protocol must also support out-of-band data.
MSG_FASTOPEN (since Linux 3.7)
Attempts TCP Fast Open (RFC7413) and sends data in the SYN like a combination of connect(2) and write(2), by performing an implicit connect(2) operation. It blocks until the data is buffered and the handshake has completed. For a non-blocking socket, it returns the number of bytes buffered and sent in the SYN packet. If the cookie is not available locally, it returns EINPROGRESS, and sends a SYN with a Fast Open cookie request automatically. The caller needs to write the data again when the socket is connected. On errors, it sets the same errno as connect(2) if the handshake fails. This flag requires enabling TCP Fast Open client support on sysctl net.ipv4.tcp_fastopen.
Refer to TCP_FASTOPEN_CONNECT socket option in tcp(7) for an alternative approach.
sendmsg()
The definition of the msghdr structure employed by sendmsg() is as follows:
struct msghdr {
void *msg_name; /* Optional address */
socklen_t msg_namelen; /* Size of address */
struct iovec *msg_iov; /* Scatter/gather array */
size_t msg_iovlen; /* # elements in msg_iov */
void *msg_control; /* Ancillary data, see below */
size_t msg_controllen; /* Ancillary data buffer len */
int msg_flags; /* Flags (unused) */
};
The msg_name field is used on an unconnected socket to specify the target address for a datagram. It points to a buffer containing the address; the msg_namelen field should be set to the size of the address. For a connected socket, these fields should be specified as NULL and 0, respectively.
The msg_iov and msg_iovlen fields specify scatter-gather locations, as for writev(2).
You may send control information (ancillary data) using the msg_control and msg_controllen members. The maximum control buffer length the kernel can process is limited per socket by the value in /proc/sys/net/core/optmem_max; see socket(7). For further information on the use of ancillary data in various socket domains, see unix(7) and ip(7).
The msg_flags field is ignored.
RETURN VALUE
On success, these calls return the number of bytes sent. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
These are some standard errors generated by the socket layer. Additional errors may be generated and returned from the underlying protocol modules; see their respective manual pages.
EACCES
(For UNIX domain sockets, which are identified by pathname) Write permission is denied on the destination socket file, or search permission is denied for one of the directories the path prefix. (See path_resolution(7).)
(For UDP sockets) An attempt was made to send to a network/broadcast address as though it was a unicast address.
EAGAIN or EWOULDBLOCK
The socket is marked nonblocking and the requested operation would block. POSIX.1-2001 allows either error to be returned for this case, and does not require these constants to have the same value, so a portable application should check for both possibilities.
EAGAIN
(Internet domain datagram sockets) The socket referred to by sockfd had not previously been bound to an address and, upon attempting to bind it to an ephemeral port, it was determined that all port numbers in the ephemeral port range are currently in use. See the discussion of /proc/sys/net/ipv4/ip_local_port_range in ip(7).
EALREADY
Another Fast Open is in progress.
EBADF
sockfd is not a valid open file descriptor.
ECONNRESET
Connection reset by peer.
EDESTADDRREQ
The socket is not connection-mode, and no peer address is set.
EFAULT
An invalid user space address was specified for an argument.
EINTR
A signal occurred before any data was transmitted; see signal(7).
EINVAL
Invalid argument passed.
EISCONN
The connection-mode socket was connected already but a recipient was specified. (Now either this error is returned, or the recipient specification is ignored.)
EMSGSIZE
The socket type requires that message be sent atomically, and the size of the message to be sent made this impossible.
ENOBUFS
The output queue for a network interface was full. This generally indicates that the interface has stopped sending, but may be caused by transient congestion. (Normally, this does not occur in Linux. Packets are just silently dropped when a device queue overflows.)
ENOMEM
No memory available.
ENOTCONN
The socket is not connected, and no target has been given.
ENOTSOCK
The file descriptor sockfd does not refer to a socket.
EOPNOTSUPP
Some bit in the flags argument is inappropriate for the socket type.
EPIPE
The local end has been shut down on a connection oriented socket. In this case, the process will also receive a SIGPIPE unless MSG_NOSIGNAL is set.
VERSIONS
According to POSIX.1-2001, the msg_controllen field of the msghdr structure should be typed as socklen_t, and the msg_iovlen field should be typed as int, but glibc currently types both as size_t.
STANDARDS
POSIX.1-2008.
MSG_CONFIRM is a Linux extension.
HISTORY
4.4BSD, SVr4, POSIX.1-2001. (first appeared in 4.2BSD).
POSIX.1-2001 describes only the MSG_OOB and MSG_EOR flags. POSIX.1-2008 adds a specification of MSG_NOSIGNAL.
NOTES
See sendmmsg(2) for information about a Linux-specific system call that can be used to transmit multiple datagrams in a single call.
BUGS
Linux may return EPIPE instead of ENOTCONN.
EXAMPLES
An example of the use of sendto() is shown in getaddrinfo(3).
SEE ALSO
fcntl(2), getsockopt(2), recv(2), select(2), sendfile(2), sendmmsg(2), shutdown(2), socket(2), write(2), cmsg(3), ip(7), ipv6(7), socket(7), tcp(7), udp(7), unix(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
424 - Linux cli command setfsgid32
NAME π₯οΈ setfsgid32 π₯οΈ
set group identity used for filesystem checks
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/fsuid.h>
[[deprecated]] int setfsgid(gid_t fsgid);
DESCRIPTION
On Linux, a process has both a filesystem group ID and an effective group ID. The (Linux-specific) filesystem group ID is used for permissions checking when accessing filesystem objects, while the effective group ID is used for some other kinds of permissions checks (see credentials(7)).
Normally, the value of the process’s filesystem group ID is the same as the value of its effective group ID. This is so, because whenever a process’s effective group ID is changed, the kernel also changes the filesystem group ID to be the same as the new value of the effective group ID. A process can cause the value of its filesystem group ID to diverge from its effective group ID by using setfsgid() to change its filesystem group ID to the value given in fsgid.
setfsgid() will succeed only if the caller is the superuser or if fsgid matches either the caller’s real group ID, effective group ID, saved set-group-ID, or current the filesystem user ID.
RETURN VALUE
On both success and failure, this call returns the previous filesystem group ID of the caller.
STANDARDS
Linux.
HISTORY
Linux 1.2.
C library/kernel differences
In glibc 2.15 and earlier, when the wrapper for this system call determines that the argument can’t be passed to the kernel without integer truncation (because the kernel is old and does not support 32-bit group IDs), it will return -1 and set errno to EINVAL without attempting the system call.
NOTES
The filesystem group ID concept and the setfsgid() system call were invented for historical reasons that are no longer applicable on modern Linux kernels. See setfsuid(2) for a discussion of why the use of both setfsuid(2) and setfsgid() is nowadays unneeded.
The original Linux setfsgid() system call supported only 16-bit group IDs. Subsequently, Linux 2.4 added setfsgid32() supporting 32-bit IDs. The glibc setfsgid() wrapper function transparently deals with the variation across kernel versions.
BUGS
No error indications of any kind are returned to the caller, and the fact that both successful and unsuccessful calls return the same value makes it impossible to directly determine whether the call succeeded or failed. Instead, the caller must resort to looking at the return value from a further call such as setfsgid(-1) (which will always fail), in order to determine if a preceding call to setfsgid() changed the filesystem group ID. At the very least, EPERM should be returned when the call fails (because the caller lacks the CAP_SETGID capability).
SEE ALSO
kill(2), setfsuid(2), capabilities(7), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
425 - Linux cli command init_module
NAME π₯οΈ init_module π₯οΈ
load a kernel module
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/module.h> /* Definition of MODULE_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_init_module, void module_image[.len], unsigned long len,
const char *param_values);
int syscall(SYS_finit_module, int fd,
const char *param_values, int flags);
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
init_module() loads an ELF image into kernel space, performs any necessary symbol relocations, initializes module parameters to values provided by the caller, and then runs the module’s init function. This system call requires privilege.
The module_image argument points to a buffer containing the binary image to be loaded; len specifies the size of that buffer. The module image should be a valid ELF image, built for the running kernel.
The param_values argument is a string containing space-delimited specifications of the values for module parameters (defined inside the module using module_param() and module_param_array()). The kernel parses this string and initializes the specified parameters. Each of the parameter specifications has the form:
name[ =value [, value…]]
The parameter name is one of those defined within the module using module_param() (see the Linux kernel source file include/linux/moduleparam.h). The parameter value is optional in the case of bool and invbool parameters. Values for array parameters are specified as a comma-separated list.
finit_module()
The finit_module() system call is like init_module(), but reads the module to be loaded from the file descriptor fd. It is useful when the authenticity of a kernel module can be determined from its location in the filesystem; in cases where that is possible, the overhead of using cryptographically signed modules to determine the authenticity of a module can be avoided. The param_values argument is as for init_module().
The flags argument modifies the operation of finit_module(). It is a bit mask value created by ORing together zero or more of the following flags:
MODULE_INIT_IGNORE_MODVERSIONS
Ignore symbol version hashes.
MODULE_INIT_IGNORE_VERMAGIC
Ignore kernel version magic.
MODULE_INIT_COMPRESSED_FILE (since Linux 5.17)
Use in-kernel module decompression.
There are some safety checks built into a module to ensure that it matches the kernel against which it is loaded. These checks are recorded when the module is built and verified when the module is loaded. First, the module records a “vermagic” string containing the kernel version number and prominent features (such as the CPU type). Second, if the module was built with the CONFIG_MODVERSIONS configuration option enabled, a version hash is recorded for each symbol the module uses. This hash is based on the types of the arguments and return value for the function named by the symbol. In this case, the kernel version number within the “vermagic” string is ignored, as the symbol version hashes are assumed to be sufficiently reliable.
Using the MODULE_INIT_IGNORE_VERMAGIC flag indicates that the “vermagic” string is to be ignored, and the MODULE_INIT_IGNORE_MODVERSIONS flag indicates that the symbol version hashes are to be ignored. If the kernel is built to permit forced loading (i.e., configured with CONFIG_MODULE_FORCE_LOAD), then loading continues, otherwise it fails with the error ENOEXEC as expected for malformed modules.
If the kernel was build with CONFIG_MODULE_DECOMPRESS, the in-kernel decompression feature can be used. User-space code can check if the kernel supports decompression by reading the /sys/module/compression attribute. If the kernel supports decompression, the compressed file can directly be passed to finit_module() using the MODULE_INIT_COMPRESSED_FILE flag. The in-kernel module decompressor supports the following compression algorithms:
gzip (since Linux 5.17)
xz (since Linux 5.17)
zstd (since Linux 6.2)
The kernel only implements a single decompression method. This is selected during module generation accordingly to the compression method chosen in the kernel configuration.
RETURN VALUE
On success, these system calls return 0. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EBADMSG (since Linux 3.7)
Module signature is misformatted.
EBUSY
Timeout while trying to resolve a symbol reference by this module.
EFAULT
An address argument referred to a location that is outside the process’s accessible address space.
ENOKEY (since Linux 3.7)
Module signature is invalid or the kernel does not have a key for this module. This error is returned only if the kernel was configured with CONFIG_MODULE_SIG_FORCE; if the kernel was not configured with this option, then an invalid or unsigned module simply taints the kernel.
ENOMEM
Out of memory.
EPERM
The caller was not privileged (did not have the CAP_SYS_MODULE capability), or module loading is disabled (see /proc/sys/kernel/modules_disabled in proc(5)).
The following errors may additionally occur for init_module():
EEXIST
A module with this name is already loaded.
EINVAL
param_values is invalid, or some part of the ELF image in module_image contains inconsistencies.
ENOEXEC
The binary image supplied in module_image is not an ELF image, or is an ELF image that is invalid or for a different architecture.
The following errors may additionally occur for finit_module():
EBADF
The file referred to by fd is not opened for reading.
EFBIG
The file referred to by fd is too large.
EINVAL
flags is invalid.
EINVAL
The decompressor sanity checks failed, while loading a compressed module with flag MODULE_INIT_COMPRESSED_FILE set.
ENOEXEC
fd does not refer to an open file.
EOPNOTSUPP (since Linux 5.17)
The flag MODULE_INIT_COMPRESSED_FILE is set to load a compressed module, and the kernel was built without CONFIG_MODULE_DECOMPRESS.
ETXTBSY (since Linux 4.7)
The file referred to by fd is opened for read-write.
In addition to the above errors, if the module’s init function is executed and returns an error, then init_module() or finit_module() fails and errno is set to the value returned by the init function.
STANDARDS
Linux.
HISTORY
finit_module()
Linux 3.8.
The init_module() system call is not supported by glibc. No declaration is provided in glibc headers, but, through a quirk of history, glibc versions before glibc 2.23 did export an ABI for this system call. Therefore, in order to employ this system call, it is (before glibc 2.23) sufficient to manually declare the interface in your code; alternatively, you can invoke the system call using syscall(2).
Linux 2.4 and earlier
In Linux 2.4 and earlier, the init_module() system call was rather different:
#include <linux/module.h>
** int init_module(const char *name, struct module *image);**
(User-space applications can detect which version of init_module() is available by calling query_module(); the latter call fails with the error ENOSYS on Linux 2.6 and later.)
The older version of the system call loads the relocated module image pointed to by image into kernel space and runs the module’s init function. The caller is responsible for providing the relocated image (since Linux 2.6, the init_module() system call does the relocation).
The module image begins with a module structure and is followed by code and data as appropriate. Since Linux 2.2, the module structure is defined as follows:
struct module {
unsigned long size_of_struct;
struct module *next;
const char *name;
unsigned long size;
long usecount;
unsigned long flags;
unsigned int nsyms;
unsigned int ndeps;
struct module_symbol *syms;
struct module_ref *deps;
struct module_ref *refs;
int (*init)(void);
void (*cleanup)(void);
const struct exception_table_entry *ex_table_start;
const struct exception_table_entry *ex_table_end;
#ifdef __alpha__
unsigned long gp;
#endif
};
All of the pointer fields, with the exception of next and refs, are expected to point within the module body and be initialized as appropriate for kernel space, that is, relocated with the rest of the module.
NOTES
Information about currently loaded modules can be found in /proc/modules and in the file trees under the per-module subdirectories under /sys/module.
See the Linux kernel source file include/linux/module.h for some useful background information.
SEE ALSO
create_module(2), delete_module(2), query_module(2), lsmod(8), modprobe(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
426 - Linux cli command fchmod
NAME π₯οΈ fchmod π₯οΈ
change permissions of a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int chmod(const char *pathname, mode_t mode);
int fchmod(int fd, mode_t mode);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int fchmodat(int dirfd, const char *pathname, mode_t mode, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
fchmod():
Since glibc 2.24:
_POSIX_C_SOURCE >= 199309L
glibc 2.19 to glibc 2.23
_POSIX_C_SOURCE
glibc 2.16 to glibc 2.19:
_BSD_SOURCE || _POSIX_C_SOURCE
glibc 2.12 to glibc 2.16:
_BSD_SOURCE || _XOPEN_SOURCE >= 500
|| _POSIX_C_SOURCE >= 200809L
glibc 2.11 and earlier:
_BSD_SOURCE || _XOPEN_SOURCE >= 500
fchmodat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
The chmod() and fchmod() system calls change a file’s mode bits. (The file mode consists of the file permission bits plus the set-user-ID, set-group-ID, and sticky bits.) These system calls differ only in how the file is specified:
chmod() changes the mode of the file specified whose pathname is given in pathname, which is dereferenced if it is a symbolic link.
fchmod() changes the mode of the file referred to by the open file descriptor fd.
The new file mode is specified in mode, which is a bit mask created by ORing together zero or more of the following:
S_ISUID (04000)
set-user-ID (set process effective user ID on execve(2))
S_ISGID (02000)
set-group-ID (set process effective group ID on execve(2); mandatory locking, as described in fcntl(2); take a new file’s group from parent directory, as described in chown(2) and mkdir(2))
S_ISVTX (01000)
sticky bit (restricted deletion flag, as described in unlink(2))
S_IRUSR (00400)
read by owner
S_IWUSR (00200)
write by owner
S_IXUSR (00100)
execute/search by owner (“search” applies for directories, and means that entries within the directory can be accessed)
S_IRGRP (00040)
read by group
S_IWGRP (00020)
write by group
S_IXGRP (00010)
execute/search by group
S_IROTH (00004)
read by others
S_IWOTH (00002)
write by others
S_IXOTH (00001)
execute/search by others
The effective UID of the calling process must match the owner of the file, or the process must be privileged (Linux: it must have the CAP_FOWNER capability).
If the calling process is not privileged (Linux: does not have the CAP_FSETID capability), and the group of the file does not match the effective group ID of the process or one of its supplementary group IDs, the S_ISGID bit will be turned off, but this will not cause an error to be returned.
As a security measure, depending on the filesystem, the set-user-ID and set-group-ID execution bits may be turned off if a file is written. (On Linux, this occurs if the writing process does not have the CAP_FSETID capability.) On some filesystems, only the superuser can set the sticky bit, which may have a special meaning. For the sticky bit, and for set-user-ID and set-group-ID bits on directories, see inode(7).
On NFS filesystems, restricting the permissions will immediately influence already open files, because the access control is done on the server, but open files are maintained by the client. Widening the permissions may be delayed for other clients if attribute caching is enabled on them.
fchmodat()
The fchmodat() system call operates in exactly the same way as chmod(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chmod() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chmod()).
If pathname is absolute, then dirfd is ignored.
flags can either be 0, or include the following flag:
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead operate on the link itself. This flag is not currently implemented.
See openat(2) for an explanation of the need for fchmodat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Depending on the filesystem, errors other than those listed below can be returned.
The more general errors for chmod() are listed below:
EACCES
Search permission is denied on a component of the path prefix. (See also path_resolution(7).)
EBADF
(fchmod()) The file descriptor fd is not valid.
EBADF
(fchmodat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
pathname points outside your accessible address space.
EINVAL
(fchmodat()) Invalid flag specified in flags.
EIO
An I/O error occurred.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ENAMETOOLONG
pathname is too long.
ENOENT
The file does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component of the path prefix is not a directory.
ENOTDIR
(fchmodat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
ENOTSUP
(fchmodat()) flags specified AT_SYMLINK_NOFOLLOW, which is not supported.
EPERM
The effective UID does not match the owner of the file, and the process is not privileged (Linux: it does not have the CAP_FOWNER capability).
EPERM
The file is marked immutable or append-only. (See ioctl_iflags(2).)
EROFS
The named file resides on a read-only filesystem.
VERSIONS
C library/kernel differences
The GNU C library fchmodat() wrapper function implements the POSIX-specified interface described in this page. This interface differs from the underlying Linux system call, which does not have a flags argument.
glibc notes
On older kernels where fchmodat() is unavailable, the glibc wrapper function falls back to the use of chmod(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
STANDARDS
POSIX.1-2008.
HISTORY
chmod()
fchmod()
4.4BSD, SVr4, POSIX.1-2001.
fchmodat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
SEE ALSO
chmod(1), chown(2), execve(2), open(2), stat(2), inode(7), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
427 - Linux cli command seccomp
NAME π₯οΈ seccomp π₯οΈ
operate on Secure Computing state of the process
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/seccomp.h> /* Definition of SECCOMP_* constants */
#include <linux/filter.h> /* Definition of struct sock_fprog */
#include <linux/audit.h> /* Definition of AUDIT_* constants */
#include <linux/signal.h> /* Definition of SIG* constants */
#include <sys/ptrace.h> /* Definition of PTRACE_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_seccomp, unsigned int operation, unsigned int flags,
void *args);
Note: glibc provides no wrapper for seccomp(), necessitating the use of syscall(2).
DESCRIPTION
The seccomp() system call operates on the Secure Computing (seccomp) state of the calling process.
Currently, Linux supports the following operation values:
SECCOMP_SET_MODE_STRICT
The only system calls that the calling thread is permitted to make are read(2), write(2), _exit(2) (but not exit_group(2)), and sigreturn(2). Other system calls result in the termination of the calling thread, or termination of the entire process with the SIGKILL signal when there is only one thread. Strict secure computing mode is useful for number-crunching applications that may need to execute untrusted byte code, perhaps obtained by reading from a pipe or socket.
Note that although the calling thread can no longer call sigprocmask(2), it can use sigreturn(2) to block all signals apart from SIGKILL and SIGSTOP. This means that alarm(2) (for example) is not sufficient for restricting the process’s execution time. Instead, to reliably terminate the process, SIGKILL must be used. This can be done by using timer_create(2) with SIGEV_SIGNAL and sigev_signo set to SIGKILL, or by using setrlimit(2) to set the hard limit for RLIMIT_CPU.
This operation is available only if the kernel is configured with CONFIG_SECCOMP enabled.
The value of flags must be 0, and args must be NULL.
This operation is functionally identical to the call:
prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
SECCOMP_SET_MODE_FILTER
The system calls allowed are defined by a pointer to a Berkeley Packet Filter (BPF) passed via args. This argument is a pointer to a structΒ sock_fprog; it can be designed to filter arbitrary system calls and system call arguments. If the filter is invalid, seccomp() fails, returning EINVAL in errno.
If fork(2) or clone(2) is allowed by the filter, any child processes will be constrained to the same system call filters as the parent. If execve(2) is allowed, the existing filters will be preserved across a call to execve(2).
In order to use the SECCOMP_SET_MODE_FILTER operation, either the calling thread must have the CAP_SYS_ADMIN capability in its user namespace, or the thread must already have the no_new_privs bit set. If that bit was not already set by an ancestor of this thread, the thread must make the following call:
prctl(PR_SET_NO_NEW_PRIVS, 1);
Otherwise, the SECCOMP_SET_MODE_FILTER operation fails and returns EACCES in errno. This requirement ensures that an unprivileged process cannot apply a malicious filter and then invoke a set-user-ID or other privileged program using execve(2), thus potentially compromising that program. (Such a malicious filter might, for example, cause an attempt to use setuid(2) to set the caller’s user IDs to nonzero values to instead return 0 without actually making the system call. Thus, the program might be tricked into retaining superuser privileges in circumstances where it is possible to influence it to do dangerous things because it did not actually drop privileges.)
If prctl(2) or seccomp() is allowed by the attached filter, further filters may be added. This will increase evaluation time, but allows for further reduction of the attack surface during execution of a thread.
The SECCOMP_SET_MODE_FILTER operation is available only if the kernel is configured with CONFIG_SECCOMP_FILTER enabled.
When flags is 0, this operation is functionally identical to the call:
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, args);
The recognized flags are:
SECCOMP_FILTER_FLAG_LOG (since Linux 4.14)
All filter return actions except SECCOMP_RET_ALLOW should be logged. An administrator may override this filter flag by preventing specific actions from being logged via the /proc/sys/kernel/seccomp/actions_logged file.
SECCOMP_FILTER_FLAG_NEW_LISTENER (since Linux 5.0)
After successfully installing the filter program, return a new user-space notification file descriptor. (The close-on-exec flag is set for the file descriptor.) When the filter returns SECCOMP_RET_USER_NOTIF a notification will be sent to this file descriptor.
At most one seccomp filter using the SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be installed for a thread.
See seccomp_unotify(2) for further details.
SECCOMP_FILTER_FLAG_SPEC_ALLOW (since Linux 4.17)
Disable Speculative Store Bypass mitigation.
SECCOMP_FILTER_FLAG_TSYNC
When adding a new filter, synchronize all other threads of the calling process to the same seccomp filter tree. A “filter tree” is the ordered list of filters attached to a thread. (Attaching identical filters in separate seccomp() calls results in different filters from this perspective.)
If any thread cannot synchronize to the same filter tree, the call will not attach the new seccomp filter, and will fail, returning the first thread ID found that cannot synchronize. Synchronization will fail if another thread in the same process is in SECCOMP_MODE_STRICT or if it has attached new seccomp filters to itself, diverging from the calling thread’s filter tree.
SECCOMP_GET_ACTION_AVAIL (since Linux 4.14)
Test to see if an action is supported by the kernel. This operation is helpful to confirm that the kernel knows of a more recently added filter return action since the kernel treats all unknown actions as SECCOMP_RET_KILL_PROCESS.
The value of flags must be 0, and args must be a pointer to an unsigned 32-bit filter return action.
SECCOMP_GET_NOTIF_SIZES (since Linux 5.0)
Get the sizes of the seccomp user-space notification structures. Since these structures may evolve and grow over time, this command can be used to determine how much memory to allocate for sending and receiving notifications.
The value of flags must be 0, and args must be a pointer to a struct seccomp_notif_sizes, which has the following form:
struct seccomp_notif_sizes
__u16 seccomp_notif; /* Size of notification structure */
__u16 seccomp_notif_resp; /* Size of response structure */
__u16 seccomp_data; /* Size of 'struct seccomp_data' */
};
See seccomp_unotify(2) for further details.
Filters
When adding filters via SECCOMP_SET_MODE_FILTER, args points to a filter program:
struct sock_fprog {
unsigned short len; /* Number of BPF instructions */
struct sock_filter *filter; /* Pointer to array of
BPF instructions */
};
Each program must contain one or more BPF instructions:
struct sock_filter { /* Filter block */
__u16 code; /* Actual filter code */
__u8 jt; /* Jump true */
__u8 jf; /* Jump false */
__u32 k; /* Generic multiuse field */
};
When executing the instructions, the BPF program operates on the system call information made available (i.e., use the BPF_ABS addressing mode) as a (read-only) buffer of the following form:
struct seccomp_data {
int nr; /* System call number */
__u32 arch; /* AUDIT_ARCH_* value
(see <linux/audit.h>) */
__u64 instruction_pointer; /* CPU instruction pointer */
__u64 args[6]; /* Up to 6 system call arguments */
};
Because numbering of system calls varies between architectures and some architectures (e.g., x86-64) allow user-space code to use the calling conventions of multiple architectures (and the convention being used may vary over the life of a process that uses execve(2) to execute binaries that employ the different conventions), it is usually necessary to verify the value of the arch field.
It is strongly recommended to use an allow-list approach whenever possible because such an approach is more robust and simple. A deny-list will have to be updated whenever a potentially dangerous system call is added (or a dangerous flag or option if those are deny-listed), and it is often possible to alter the representation of a value without altering its meaning, leading to a deny-list bypass. See also Caveats below.
The arch field is not unique for all calling conventions. The x86-64 ABI and the x32 ABI both use AUDIT_ARCH_X86_64 as arch, and they run on the same processors. Instead, the mask __X32_SYSCALL_BIT is used on the system call number to tell the two ABIs apart.
This means that a policy must either deny all syscalls with __X32_SYSCALL_BIT or it must recognize syscalls with and without __X32_SYSCALL_BIT set. A list of system calls to be denied based on nr that does not also contain nr values with __X32_SYSCALL_BIT set can be bypassed by a malicious program that sets __X32_SYSCALL_BIT.
Additionally, kernels prior to Linux 5.4 incorrectly permitted nr in the ranges 512-547 as well as the corresponding non-x32 syscalls ORed with __X32_SYSCALL_BIT. For example, nr == 521 and nr == (101 | __X32_SYSCALL_BIT) would result in invocations of ptrace(2) with potentially confused x32-vs-x86_64 semantics in the kernel. Policies intended to work on kernels before Linux 5.4 must ensure that they deny or otherwise correctly handle these system calls. On Linux 5.4 and newer, such system calls will fail with the error ENOSYS, without doing anything.
The instruction_pointer field provides the address of the machine-language instruction that performed the system call. This might be useful in conjunction with the use of /proc/pid/maps to perform checks based on which region (mapping) of the program made the system call. (Probably, it is wise to lock down the mmap(2) and mprotect(2) system calls to prevent the program from subverting such checks.)
When checking values from args, keep in mind that arguments are often silently truncated before being processed, but after the seccomp check. For example, this happens if the i386 ABI is used on an x86-64 kernel: although the kernel will normally not look beyond the 32 lowest bits of the arguments, the values of the full 64-bit registers will be present in the seccomp data. A less surprising example is that if the x86-64 ABI is used to perform a system call that takes an argument of type int, the more-significant half of the argument register is ignored by the system call, but visible in the seccomp data.
A seccomp filter returns a 32-bit value consisting of two parts: the most significant 16 bits (corresponding to the mask defined by the constant SECCOMP_RET_ACTION_FULL) contain one of the “action” values listed below; the least significant 16-bits (defined by the constant SECCOMP_RET_DATA) are “data” to be associated with this return value.
If multiple filters exist, they are all executed, in reverse order of their addition to the filter treeβthat is, the most recently installed filter is executed first. (Note that all filters will be called even if one of the earlier filters returns SECCOMP_RET_KILL. This is done to simplify the kernel code and to provide a tiny speed-up in the execution of sets of filters by avoiding a check for this uncommon case.) The return value for the evaluation of a given system call is the first-seen action value of highest precedence (along with its accompanying data) returned by execution of all of the filters.
In decreasing order of precedence, the action values that may be returned by a seccomp filter are:
SECCOMP_RET_KILL_PROCESS (since Linux 4.14)
This value results in immediate termination of the process, with a core dump. The system call is not executed. By contrast with SECCOMP_RET_KILL_THREAD below, all threads in the thread group are terminated. (For a discussion of thread groups, see the description of the CLONE_THREAD flag in clone(2).)
The process terminates as though killed by a SIGSYS signal. Even if a signal handler has been registered for SIGSYS, the handler will be ignored in this case and the process always terminates. To a parent process that is waiting on this process (using waitpid(2) or similar), the returned wstatus will indicate that its child was terminated as though by a SIGSYS signal.
SECCOMP_RET_KILL_THREAD (or SECCOMP_RET_KILL)
This value results in immediate termination of the thread that made the system call. The system call is not executed. Other threads in the same thread group will continue to execute.
The thread terminates as though killed by a SIGSYS signal. See SECCOMP_RET_KILL_PROCESS above.
Before Linux 4.11, any process terminated in this way would not trigger a coredump (even though SIGSYS is documented in signal(7) as having a default action of termination with a core dump). Since Linux 4.11, a single-threaded process will dump core if terminated in this way.
With the addition of SECCOMP_RET_KILL_PROCESS in Linux 4.14, SECCOMP_RET_KILL_THREAD was added as a synonym for SECCOMP_RET_KILL, in order to more clearly distinguish the two actions.
Note: the use of SECCOMP_RET_KILL_THREAD to kill a single thread in a multithreaded process is likely to leave the process in a permanently inconsistent and possibly corrupt state.
SECCOMP_RET_TRAP
This value results in the kernel sending a thread-directed SIGSYS signal to the triggering thread. (The system call is not executed.) Various fields will be set in the siginfo_t structure (see sigaction(2)) associated with signal:
si_signo will contain SIGSYS.
si_call_addr will show the address of the system call instruction.
si_syscall and si_arch will indicate which system call was attempted.
si_code will contain SYS_SECCOMP.
si_errno will contain the SECCOMP_RET_DATA portion of the filter return value.
The program counter will be as though the system call happened (i.e., the program counter will not point to the system call instruction). The return value register will contain an architecture-dependent value; if resuming execution, set it to something appropriate for the system call. (The architecture dependency is because replacing it with ENOSYS could overwrite some useful information.)
SECCOMP_RET_ERRNO
This value results in the SECCOMP_RET_DATA portion of the filter’s return value being passed to user space as the errno value without executing the system call.
SECCOMP_RET_USER_NOTIF (since Linux 5.0)
Forward the system call to an attached user-space supervisor process to allow that process to decide what to do with the system call. If there is no attached supervisor (either because the filter was not installed with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag or because the file descriptor was closed), the filter returns ENOSYS (similar to what happens when a filter returns SECCOMP_RET_TRACE and there is no tracer). See seccomp_unotify(2) for further details.
Note that the supervisor process will not be notified if another filter returns an action value with a precedence greater than SECCOMP_RET_USER_NOTIF.
SECCOMP_RET_TRACE
When returned, this value will cause the kernel to attempt to notify a ptrace(2)-based tracer prior to executing the system call. If there is no tracer present, the system call is not executed and returns a failure status with errno set to ENOSYS.
A tracer will be notified if it requests PTRACE_O_TRACESECCOMP using ptrace(PTRACE_SETOPTIONS). The tracer will be notified of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of the filter’s return value will be available to the tracer via PTRACE_GETEVENTMSG.
The tracer can skip the system call by changing the system call number to -1. Alternatively, the tracer can change the system call requested by changing the system call to a valid system call number. If the tracer asks to skip the system call, then the system call will appear to return the value that the tracer puts in the return value register.
Before Linux 4.8, the seccomp check will not be run again after the tracer is notified. (This means that, on older kernels, seccomp-based sandboxes must not allow use of ptrace(2)βeven of other sandboxed processesβwithout extreme care; ptracers can use this mechanism to escape from the seccomp sandbox.)
Note that a tracer process will not be notified if another filter returns an action value with a precedence greater than SECCOMP_RET_TRACE.
SECCOMP_RET_LOG (since Linux 4.14)
This value results in the system call being executed after the filter return action is logged. An administrator may override the logging of this action via the /proc/sys/kernel/seccomp/actions_logged file.
SECCOMP_RET_ALLOW
This value results in the system call being executed.
If an action value other than one of the above is specified, then the filter action is treated as either SECCOMP_RET_KILL_PROCESS (since Linux 4.14) or SECCOMP_RET_KILL_THREAD (in Linux 4.13 and earlier).
/proc interfaces
The files in the directory /proc/sys/kernel/seccomp provide additional seccomp information and configuration:
actions_avail (since Linux 4.14)
A read-only ordered list of seccomp filter return actions in string form. The ordering, from left-to-right, is in decreasing order of precedence. The list represents the set of seccomp filter return actions supported by the kernel.
actions_logged (since Linux 4.14)
A read-write ordered list of seccomp filter return actions that are allowed to be logged. Writes to the file do not need to be in ordered form but reads from the file will be ordered in the same way as the actions_avail file.
It is important to note that the value of actions_logged does not prevent certain filter return actions from being logged when the audit subsystem is configured to audit a task. If the action is not found in the actions_logged file, the final decision on whether to audit the action for that task is ultimately left up to the audit subsystem to decide for all filter return actions other than SECCOMP_RET_ALLOW.
The “allow” string is not accepted in the actions_logged file as it is not possible to log SECCOMP_RET_ALLOW actions. Attempting to write “allow” to the file will fail with the error EINVAL.
Audit logging of seccomp actions
Since Linux 4.14, the kernel provides the facility to log the actions returned by seccomp filters in the audit log. The kernel makes the decision to log an action based on the action type, whether or not the action is present in the actions_logged file, and whether kernel auditing is enabled (e.g., via the kernel boot option audit=1). The rules are as follows:
If the action is SECCOMP_RET_ALLOW, the action is not logged.
Otherwise, if the action is either SECCOMP_RET_KILL_PROCESS or SECCOMP_RET_KILL_THREAD, and that action appears in the actions_logged file, the action is logged.
Otherwise, if the filter has requested logging (the SECCOMP_FILTER_FLAG_LOG flag) and the action appears in the actions_logged file, the action is logged.
Otherwise, if kernel auditing is enabled and the process is being audited (autrace(8)), the action is logged.
Otherwise, the action is not logged.
RETURN VALUE
On success, seccomp() returns 0. On error, if SECCOMP_FILTER_FLAG_TSYNC was used, the return value is the ID of the thread that caused the synchronization failure. (This ID is a kernel thread ID of the type returned by clone(2) and gettid(2).) On other errors, -1 is returned, and errno is set to indicate the error.
ERRORS
seccomp() can fail for the following reasons:
EACCES
The caller did not have the CAP_SYS_ADMIN capability in its user namespace, or had not set no_new_privs before using SECCOMP_SET_MODE_FILTER.
EBUSY
While installing a new filter, the SECCOMP_FILTER_FLAG_NEW_LISTENER flag was specified, but a previous filter had already been installed with that flag.
EFAULT
args was not a valid address.
EINVAL
operation is unknown or is not supported by this kernel version or configuration.
EINVAL
The specified flags are invalid for the given operation.
EINVAL
operation included BPF_ABS, but the specified offset was not aligned to a 32-bit boundary or exceeded sizeof(structΒ seccomp_data).
EINVAL
A secure computing mode has already been set, and operation differs from the existing setting.
EINVAL
operation specified SECCOMP_SET_MODE_FILTER, but the filter program pointed to by args was not valid or the length of the filter program was zero or exceeded BPF_MAXINSNS (4096) instructions.
ENOMEM
Out of memory.
ENOMEM
The total length of all filter programs attached to the calling thread would exceed MAX_INSNS_PER_PATH (32768) instructions. Note that for the purposes of calculating this limit, each already existing filter program incurs an overhead penalty of 4 instructions.
EOPNOTSUPP
operation specified SECCOMP_GET_ACTION_AVAIL, but the kernel does not support the filter return action specified by args.
ESRCH
Another thread caused a failure during thread sync, but its ID could not be determined.
STANDARDS
Linux.
HISTORY
Linux 3.17.
NOTES
Rather than hand-coding seccomp filters as shown in the example below, you may prefer to employ the libseccomp library, which provides a front-end for generating seccomp filters.
The Seccomp field of the /proc/pid/status file provides a method of viewing the seccomp mode of a process; see proc(5).
seccomp() provides a superset of the functionality provided by the prctl(2) PR_SET_SECCOMP operation (which does not support flags).
Since Linux 4.4, the ptrace(2) PTRACE_SECCOMP_GET_FILTER operation can be used to dump a process’s seccomp filters.
Architecture support for seccomp BPF
Architecture support for seccomp BPF filtering is available on the following architectures:
- x86-64, i386, x32 (since Linux 3.5)
ARM (since Linux 3.8)
s390 (since Linux 3.8)
MIPS (since Linux 3.16)
ARM-64 (since Linux 3.19)
PowerPC (since Linux 4.3)
Tile (since Linux 4.3)
PA-RISC (since Linux 4.6)
Caveats
There are various subtleties to consider when applying seccomp filters to a program, including the following:
Some traditional system calls have user-space implementations in the vdso(7) on many architectures. Notable examples include clock_gettime(2), gettimeofday(2), and time(2). On such architectures, seccomp filtering for these system calls will have no effect. (However, there are cases where the vdso(7) implementations may fall back to invoking the true system call, in which case seccomp filters would see the system call.)
Seccomp filtering is based on system call numbers. However, applications typically do not directly invoke system calls, but instead call wrapper functions in the C library which in turn invoke the system calls. Consequently, one must be aware of the following:
The glibc wrappers for some traditional system calls may actually employ system calls with different names in the kernel. For example, the exit(2) wrapper function actually employs the exit_group(2) system call, and the fork(2) wrapper function actually calls clone(2).
The behavior of wrapper functions may vary across architectures, according to the range of system calls provided on those architectures. In other words, the same wrapper function may invoke different system calls on different architectures.
Finally, the behavior of wrapper functions can change across glibc versions. For example, in older versions, the glibc wrapper function for open(2) invoked the system call of the same name, but starting in glibc 2.26, the implementation switched to calling openat(2) on all architectures.
The consequence of the above points is that it may be necessary to filter for a system call other than might be expected. Various manual pages in Section 2 provide helpful details about the differences between wrapper functions and the underlying system calls in subsections entitled C library/kernel differences.
Furthermore, note that the application of seccomp filters even risks causing bugs in an application, when the filters cause unexpected failures for legitimate operations that the application might need to perform. Such bugs may not easily be discovered when testing the seccomp filters if the bugs occur in rarely used application code paths.
Seccomp-specific BPF details
Note the following BPF details specific to seccomp filters:
The BPF_H and BPF_B size modifiers are not supported: all operations must load and store (4-byte) words (BPF_W).
To access the contents of the seccomp_data buffer, use the BPF_ABS addressing mode modifier.
The BPF_LEN addressing mode modifier yields an immediate mode operand whose value is the size of the seccomp_data buffer.
EXAMPLES
The program below accepts four or more arguments. The first three arguments are a system call number, a numeric architecture identifier, and an error number. The program uses these values to construct a BPF filter that is used at run time to perform the following checks:
If the program is not running on the specified architecture, the BPF filter causes system calls to fail with the error ENOSYS.
If the program attempts to execute the system call with the specified number, the BPF filter causes the system call to fail, with errno being set to the specified error number.
The remaining command-line arguments specify the pathname and additional arguments of a program that the example program should attempt to execute using execv(3) (a library function that employs the execve(2) system call). Some example runs of the program are shown below.
First, we display the architecture that we are running on (x86-64) and then construct a shell function that looks up system call numbers on this architecture:
$ uname -m
x86_64
$ syscall_nr() {
cat /usr/src/linux/arch/x86/syscalls/syscall_64.tbl | \
awk '$2 != "x32" && $3 == "'$1'" { print $1 }'
}
When the BPF filter rejects a system call (case [2] above), it causes the system call to fail with the error number specified on the command line. In the experiments shown here, we’ll use error number 99:
$ errno 99
EADDRNOTAVAIL 99 Cannot assign requested address
In the following example, we attempt to run the command whoami(1), but the BPF filter rejects the execve(2) system call, so that the command is not even executed:
$ syscall_nr execve
59
$ ./a.out
Usage: ./a.out <syscall_nr> <arch> <errno> <prog> [<args>]
Hint for <arch>: AUDIT_ARCH_I386: 0x40000003
AUDIT_ARCH_X86_64: 0xC000003E
$ ./a.out 59 0xC000003E 99 /bin/whoami
execv: Cannot assign requested address
In the next example, the BPF filter rejects the write(2) system call, so that, although it is successfully started, the whoami(1) command is not able to write output:
$ syscall_nr write
1
$ ./a.out 1 0xC000003E 99 /bin/whoami
In the final example, the BPF filter rejects a system call that is not used by the whoami(1) command, so it is able to successfully execute and produce output:
$ syscall_nr preadv
295
$ ./a.out 295 0xC000003E 99 /bin/whoami
cecilia
Program source
#include <linux/audit.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <unistd.h>
#define X32_SYSCALL_BIT 0x40000000
#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
static int
install_filter(int syscall_nr, unsigned int t_arch, int f_errno)
{
unsigned int upper_nr_limit = 0xffffffff;
/* Assume that AUDIT_ARCH_X86_64 means the normal x86-64 ABI
(in the x32 ABI, all system calls have bit 30 set in the
'nr' field, meaning the numbers are >= X32_SYSCALL_BIT). */
if (t_arch == AUDIT_ARCH_X86_64)
upper_nr_limit = X32_SYSCALL_BIT - 1;
struct sock_filter filter[] = {
/* [0] Load architecture from 'seccomp_data' buffer into
accumulator. */
BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
(offsetof(struct seccomp_data, arch))),
/* [1] Jump forward 5 instructions if architecture does not
match 't_arch'. */
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 5),
/* [2] Load system call number from 'seccomp_data' buffer into
accumulator. */
BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
(offsetof(struct seccomp_data, nr))),
/* [3] Check ABI - only needed for x86-64 in deny-list use
cases. Use BPF_JGT instead of checking against the bit
mask to avoid having to reload the syscall number. */
BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, upper_nr_limit, 3, 0),
/* [4] Jump forward 1 instruction if system call number
does not match 'syscall_nr'. */
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall_nr, 0, 1),
/* [5] Matching architecture and system call: don't execute
the system call, and return 'f_errno' in 'errno'. */
BPF_STMT(BPF_RET | BPF_K,
SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),
/* [6] Destination of system call number mismatch: allow other
system calls. */
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
/* [7] Destination of architecture mismatch: kill process. */
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
};
struct sock_fprog prog = {
.len = ARRAY_SIZE(filter),
.filter = filter,
};
if (syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &prog)) {
perror("seccomp");
return 1;
}
return 0;
}
int
main(int argc, char *argv[])
{
if (argc < 5) {
fprintf(stderr, "Usage: "
"%s <syscall_nr> <arch> <errno> <prog> [<args>]
"
“Hint for
SEE ALSO
bpfc(1), strace(1), bpf(2), prctl(2), ptrace(2), seccomp_unotify(2), sigaction(2), proc(5), signal(7), socket(7)
Various pages from the libseccomp library, including: scmp_sys_resolver(1), seccomp_export_bpf(3), seccomp_init(3), seccomp_load(3), and seccomp_rule_add(3).
The kernel source files Documentation/networking/filter.txt and Documentation/userspace-api/seccomp_filter.rst (or Documentation/prctl/seccomp_filter.txt before Linux 4.13).
McCanne, S. and Jacobson, V. (1992) The BSD Packet Filter: A New Architecture for User-level Packet Capture, Proceedings of the USENIX Winter 1993 Conference
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
428 - Linux cli command setxattr
NAME π₯οΈ setxattr π₯οΈ
set an extended attribute value
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/xattr.h>
int setxattr(const char *path, const char *name,
const void value[.size], size_t size, int flags);
int lsetxattr(const char *path, const char *name,
const void value[.size], size_t size, int flags);
int fsetxattr(int fd, const char *name,
const void value[.size], size_t size, int flags);
DESCRIPTION
Extended attributes are name:value pairs associated with inodes (files, directories, symbolic links, etc.). They are extensions to the normal attributes which are associated with all inodes in the system (i.e., the stat(2) data). A complete overview of extended attributes concepts can be found in xattr(7).
setxattr() sets the value of the extended attribute identified by name and associated with the given path in the filesystem. The size argument specifies the size (in bytes) of value; a zero-length value is permitted.
lsetxattr() is identical to setxattr(), except in the case of a symbolic link, where the extended attribute is set on the link itself, not the file that it refers to.
fsetxattr() is identical to setxattr(), only the extended attribute is set on the open file referred to by fd (as returned by open(2)) in place of path.
An extended attribute name is a null-terminated string. The name includes a namespace prefix; there may be several, disjoint namespaces associated with an individual inode. The value of an extended attribute is a chunk of arbitrary textual or binary data of specified length.
By default (i.e., flags is zero), the extended attribute will be created if it does not exist, or the value will be replaced if the attribute already exists. To modify these semantics, one of the following values can be specified in flags:
XATTR_CREATE
Perform a pure create, which fails if the named attribute exists already.
XATTR_REPLACE
Perform a pure replace operation, which fails if the named attribute does not already exist.
RETURN VALUE
On success, zero is returned. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
EDQUOT
Disk quota limits meant that there is insufficient space remaining to store the extended attribute.
EEXIST
XATTR_CREATE was specified, and the attribute exists already.
ENODATA
XATTR_REPLACE was specified, and the attribute does not exist.
ENOSPC
There is insufficient space remaining to store the extended attribute.
ENOTSUP
The namespace prefix of name is not valid.
ENOTSUP
Extended attributes are not supported by the filesystem, or are disabled,
EPERM
The file is marked immutable or append-only. (See ioctl_iflags(2).)
In addition, the errors documented in stat(2) can also occur.
ERANGE
The size of name or value exceeds a filesystem-specific limit.
STANDARDS
Linux.
HISTORY
Linux 2.4, glibc 2.3.
SEE ALSO
getfattr(1), setfattr(1), getxattr(2), listxattr(2), open(2), removexattr(2), stat(2), symlink(7), xattr(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
429 - Linux cli command setuid32
NAME π₯οΈ setuid32 π₯οΈ
set user identity
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int setuid(uid_t uid);
DESCRIPTION
setuid() sets the effective user ID of the calling process. If the calling process is privileged (more precisely: if the process has the CAP_SETUID capability in its user namespace), the real UID and saved set-user-ID are also set.
Under Linux, setuid() is implemented like the POSIX version with the _POSIX_SAVED_IDS feature. This allows a set-user-ID (other than root) program to drop all of its user privileges, do some un-privileged work, and then reengage the original effective user ID in a secure manner.
If the user is root or the program is set-user-ID-root, special care must be taken: setuid() checks the effective user ID of the caller and if it is the superuser, all process-related user ID’s are set to uid. After this has occurred, it is impossible for the program to regain root privileges.
Thus, a set-user-ID-root program wishing to temporarily drop root privileges, assume the identity of an unprivileged user, and then regain root privileges afterward cannot use setuid(). You can accomplish this with seteuid(2).
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
Note: there are cases where setuid() can fail even when the caller is UID 0; it is a grave security error to omit checking for a failure return from setuid().
ERRORS
EAGAIN
The call would change the caller’s real UID (i.e., uid does not match the caller’s real UID), but there was a temporary failure allocating the necessary kernel data structures.
EAGAIN
uid does not match the real user ID of the caller and this call would bring the number of processes belonging to the real user ID uid over the caller’s RLIMIT_NPROC resource limit. Since Linux 3.1, this error case no longer occurs (but robust applications should check for this error); see the description of EAGAIN in execve(2).
EINVAL
The user ID specified in uid is not valid in this user namespace.
EPERM
The user is not privileged (Linux: does not have the CAP_SETUID capability in its user namespace) and uid does not match the real UID or saved set-user-ID of the calling process.
VERSIONS
C library/kernel differences
At the kernel level, user IDs and group IDs are a per-thread attribute. However, POSIX requires that all threads in a process share the same credentials. The NPTL threading implementation handles the POSIX requirements by providing wrapper functions for the various system calls that change process UIDs and GIDs. These wrapper functions (including the one for setuid()) employ a signal-based technique to ensure that when one thread changes credentials, all of the other threads in the process also change their credentials. For details, see nptl(7).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4.
Not quite compatible with the 4.4BSD call, which sets all of the real, saved, and effective user IDs.
The original Linux setuid() system call supported only 16-bit user IDs. Subsequently, Linux 2.4 added setuid32() supporting 32-bit IDs. The glibc setuid() wrapper function transparently deals with the variation across kernel versions.
NOTES
Linux has the concept of the filesystem user ID, normally equal to the effective user ID. The setuid() call also sets the filesystem user ID of the calling process. See setfsuid(2).
If uid is different from the old effective UID, the process will be forbidden from leaving core dumps.
SEE ALSO
getuid(2), seteuid(2), setfsuid(2), setreuid(2), capabilities(7), credentials(7), user_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
430 - Linux cli command clone
NAME π₯οΈ clone π₯οΈ
create a child process
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
/* Prototype for the glibc wrapper function */
#define _GNU_SOURCE
#include <sched.h>
int clone(int (*fn)(void *_Nullable), void *stack",int"flags,
void *_Nullable arg, ..."/*" pid_t *_Nullable parent_tid,
void *_Nullable tls,
pid_t *_Nullable child_tid */ );
/* For the prototype of the raw clone() system call, see NOTES */
#include <linux/sched.h> /* Definition of struct clone_args */
#include <sched.h> /* Definition of CLONE_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
long syscall(SYS_clone3, struct clone_args *cl_args, size_t size);
Note: glibc provides no wrapper for clone3(), necessitating the use of syscall(2).
DESCRIPTION
These system calls create a new (“child”) process, in a manner similar to fork(2).
By contrast with fork(2), these system calls provide more precise control over what pieces of execution context are shared between the calling process and the child process. For example, using these system calls, the caller can control whether or not the two processes share the virtual address space, the table of file descriptors, and the table of signal handlers. These system calls also allow the new child process to be placed in separate namespaces(7).
Note that in this manual page, “calling process” normally corresponds to “parent process”. But see the descriptions of CLONE_PARENT and CLONE_THREAD below.
This page describes the following interfaces:
The glibc clone() wrapper function and the underlying system call on which it is based. The main text describes the wrapper function; the differences for the raw system call are described toward the end of this page.
The newer clone3() system call.
In the remainder of this page, the terminology “the clone call” is used when noting details that apply to all of these interfaces.
The clone() wrapper function
When the child process is created with the clone() wrapper function, it commences execution by calling the function pointed to by the argument fn. (This differs from fork(2), where execution continues in the child from the point of the fork(2) call.) The arg argument is passed as the argument of the function fn.
When the fn(arg) function returns, the child process terminates. The integer returned by fn is the exit status for the child process. The child process may also terminate explicitly by calling exit(2) or after receiving a fatal signal.
The stack argument specifies the location of the stack used by the child process. Since the child and calling process may share memory, it is not possible for the child process to execute in the same stack as the calling process. The calling process must therefore set up memory space for the child stack and pass a pointer to this space to clone(). Stacks grow downward on all processors that run Linux (except the HP PA processors), so stack usually points to the topmost address of the memory space set up for the child stack. Note that clone() does not provide a means whereby the caller can inform the kernel of the size of the stack area.
The remaining arguments to clone() are discussed below.
clone3()
The clone3() system call provides a superset of the functionality of the older clone() interface. It also provides a number of API improvements, including: space for additional flags bits; cleaner separation in the use of various arguments; and the ability to specify the size of the child’s stack area.
As with fork(2), clone3() returns in both the parent and the child. It returns 0 in the child process and returns the PID of the child in the parent.
The cl_args argument of clone3() is a structure of the following form:
struct clone_args {
u64 flags; /* Flags bit mask */
u64 pidfd; /* Where to store PID file descriptor
(int *) */
u64 child_tid; /* Where to store child TID,
in child's memory (pid_t *) */
u64 parent_tid; /* Where to store child TID,
in parent's memory (pid_t *) */
u64 exit_signal; /* Signal to deliver to parent on
child termination */
u64 stack; /* Pointer to lowest byte of stack */
u64 stack_size; /* Size of stack */
u64 tls; /* Location of new TLS */
u64 set_tid; /* Pointer to a pid_t array
(since Linux 5.5) */
u64 set_tid_size; /* Number of elements in set_tid
(since Linux 5.5) */
u64 cgroup; /* File descriptor for target cgroup
of child (since Linux 5.7) */
};
The size argument that is supplied to clone3() should be initialized to the size of this structure. (The existence of the size argument permits future extensions to the clone_args structure.)
The stack for the child process is specified via cl_args.stack, which points to the lowest byte of the stack area, and cl_args.stack_size, which specifies the size of the stack in bytes. In the case where the CLONE_VM flag (see below) is specified, a stack must be explicitly allocated and specified. Otherwise, these two fields can be specified as NULL and 0, which causes the child to use the same stack area as the parent (in the child’s own virtual address space).
The remaining fields in the cl_args argument are discussed below.
Equivalence between clone() and clone3() arguments
Unlike the older clone() interface, where arguments are passed individually, in the newer clone3() interface the arguments are packaged into the clone_args structure shown above. This structure allows for a superset of the information passed via the clone() arguments.
The following table shows the equivalence between the arguments of clone() and the fields in the clone_args argument supplied to clone3():
clone() clone3() Notes cl_args field flags & ~0xff flags For most flags; details below parent_tid pidfd See CLONE_PIDFD child_tid child_tid See CLONE_CHILD_SETTID parent_tid parent_tid See CLONE_PARENT_SETTID flags & 0xff exit_signal stack stack --- stack_size tls tls See CLONE_SETTLS --- set_tid See below for details --- set_tid_size --- cgroup See CLONE_INTO_CGROUP
The child termination signal
When the child process terminates, a signal may be sent to the parent. The termination signal is specified in the low byte of flags (clone()) or in cl_args.exit_signal (clone3()). If this signal is specified as anything other than SIGCHLD, then the parent process must specify the __WALL or __WCLONE options when waiting for the child with wait(2). If no signal (i.e., zero) is specified, then the parent process is not signaled when the child terminates.
The set_tid array
By default, the kernel chooses the next sequential PID for the new process in each of the PID namespaces where it is present. When creating a process with clone3(), the set_tid array (available since Linux 5.5) can be used to select specific PIDs for the process in some or all of the PID namespaces where it is present. If the PID of the newly created process should be set only for the current PID namespace or in the newly created PID namespace (if flags contains CLONE_NEWPID) then the first element in the set_tid array has to be the desired PID and set_tid_size needs to be 1.
If the PID of the newly created process should have a certain value in multiple PID namespaces, then the set_tid array can have multiple entries. The first entry defines the PID in the most deeply nested PID namespace and each of the following entries contains the PID in the corresponding ancestor PID namespace. The number of PID namespaces in which a PID should be set is defined by set_tid_size which cannot be larger than the number of currently nested PID namespaces.
To create a process with the following PIDs in a PID namespace hierarchy:
PID NS level Requested PID Notes 0 31496 Outermost PID namespace 1 42 2 7 Innermost PID namespace
Set the array to:
set_tid[0] = 7;
set_tid[1] = 42;
set_tid[2] = 31496;
set_tid_size = 3;
If only the PIDs in the two innermost PID namespaces need to be specified, set the array to:
set_tid[0] = 7;
set_tid[1] = 42;
set_tid_size = 2;
The PID in the PID namespaces outside the two innermost PID namespaces is selected the same way as any other PID is selected.
The set_tid feature requires CAP_SYS_ADMIN or (since Linux 5.9) CAP_CHECKPOINT_RESTORE in all owning user namespaces of the target PID namespaces.
Callers may only choose a PID greater than 1 in a given PID namespace if an init process (i.e., a process with PID 1) already exists in that namespace. Otherwise the PID entry for this PID namespace must be 1.
The flags mask
Both clone() and clone3() allow a flags bit mask that modifies their behavior and allows the caller to specify what is shared between the calling process and the child process. This bit maskβthe flags argument of clone() or the cl_args.flags field passed to clone3()βis referred to as the flags mask in the remainder of this page.
The flags mask is specified as a bitwise OR of zero or more of the constants listed below. Except as noted below, these flags are available (and have the same effect) in both clone() and clone3().
CLONE_CHILD_CLEARTID (since Linux 2.5.49)
Clear (zero) the child thread ID at the location pointed to by child_tid (clone()) or cl_args.child_tid (clone3()) in child memory when the child exits, and do a wakeup on the futex at that address. The address involved may be changed by the set_tid_address(2) system call. This is used by threading libraries.
CLONE_CHILD_SETTID (since Linux 2.5.49)
Store the child thread ID at the location pointed to by child_tid (clone()) or cl_args.child_tid (clone3()) in the child’s memory. The store operation completes before the clone call returns control to user space in the child process. (Note that the store operation may not have completed before the clone call returns in the parent process, which is relevant if the CLONE_VM flag is also employed.)
CLONE_CLEAR_SIGHAND (since Linux 5.5)
By default, signal dispositions in the child thread are the same as in the parent. If this flag is specified, then all signals that are handled in the parent (and not set to SIG_IGN) are reset to their default dispositions (SIG_DFL) in the child.
Specifying this flag together with CLONE_SIGHAND is nonsensical and disallowed.
CLONE_DETACHED (historical)
For a while (during the Linux 2.5 development series) there was a CLONE_DETACHED flag, which caused the parent not to receive a signal when the child terminated. Ultimately, the effect of this flag was subsumed under the CLONE_THREAD flag and by the time Linux 2.6.0 was released, this flag had no effect. Starting in Linux 2.6.2, the need to give this flag together with CLONE_THREAD disappeared.
This flag is still defined, but it is usually ignored when calling clone(). However, see the description of CLONE_PIDFD for some exceptions.
CLONE_FILES (since Linux 2.0)
If CLONE_FILES is set, the calling process and the child process share the same file descriptor table. Any file descriptor created by the calling process or by the child process is also valid in the other process. Similarly, if one of the processes closes a file descriptor, or changes its associated flags (using the fcntl(2) F_SETFD operation), the other process is also affected. If a process sharing a file descriptor table calls execve(2), its file descriptor table is duplicated (unshared).
If CLONE_FILES is not set, the child process inherits a copy of all file descriptors opened in the calling process at the time of the clone call. Subsequent operations that open or close file descriptors, or change file descriptor flags, performed by either the calling process or the child process do not affect the other process. Note, however, that the duplicated file descriptors in the child refer to the same open file descriptions as the corresponding file descriptors in the calling process, and thus share file offsets and file status flags (see open(2)).
CLONE_FS (since Linux 2.0)
If CLONE_FS is set, the caller and the child process share the same filesystem information. This includes the root of the filesystem, the current working directory, and the umask. Any call to chroot(2), chdir(2), or umask(2) performed by the calling process or the child process also affects the other process.
If CLONE_FS is not set, the child process works on a copy of the filesystem information of the calling process at the time of the clone call. Calls to chroot(2), chdir(2), or umask(2) performed later by one of the processes do not affect the other process.
CLONE_INTO_CGROUP (since Linux 5.7)
By default, a child process is placed in the same version 2 cgroup as its parent. The CLONE_INTO_CGROUP flag allows the child process to be created in a different version 2 cgroup. (Note that CLONE_INTO_CGROUP has effect only for version 2 cgroups.)
In order to place the child process in a different cgroup, the caller specifies CLONE_INTO_CGROUP in cl_args.flags and passes a file descriptor that refers to a version 2 cgroup in the cl_args.cgroup field. (This file descriptor can be obtained by opening a cgroup v2 directory using either the O_RDONLY or the O_PATH flag.) Note that all of the usual restrictions (described in cgroups(7)) on placing a process into a version 2 cgroup apply.
Among the possible use cases for CLONE_INTO_CGROUP are the following:
Spawning a process into a cgroup different from the parent’s cgroup makes it possible for a service manager to directly spawn new services into dedicated cgroups. This eliminates the accounting jitter that would be caused if the child process was first created in the same cgroup as the parent and then moved into the target cgroup. Furthermore, spawning the child process directly into a target cgroup is significantly cheaper than moving the child process into the target cgroup after it has been created.
The CLONE_INTO_CGROUP flag also allows the creation of frozen child processes by spawning them into a frozen cgroup. (See cgroups(7) for a description of the freezer controller.)
For threaded applications (or even thread implementations which make use of cgroups to limit individual threads), it is possible to establish a fixed cgroup layout before spawning each thread directly into its target cgroup.
CLONE_IO (since Linux 2.6.25)
If CLONE_IO is set, then the new process shares an I/O context with the calling process. If this flag is not set, then (as with fork(2)) the new process has its own I/O context.
The I/O context is the I/O scope of the disk scheduler (i.e., what the I/O scheduler uses to model scheduling of a process’s I/O). If processes share the same I/O context, they are treated as one by the I/O scheduler. As a consequence, they get to share disk time. For some I/O schedulers, if two processes share an I/O context, they will be allowed to interleave their disk access. If several threads are doing I/O on behalf of the same process (aio_read(3), for instance), they should employ CLONE_IO to get better I/O performance.
If the kernel is not configured with the CONFIG_BLOCK option, this flag is a no-op.
CLONE_NEWCGROUP (since Linux 4.6)
Create the process in a new cgroup namespace. If this flag is not set, then (as with fork(2)) the process is created in the same cgroup namespaces as the calling process.
For further information on cgroup namespaces, see cgroup_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWCGROUP.
CLONE_NEWIPC (since Linux 2.6.19)
If CLONE_NEWIPC is set, then create the process in a new IPC namespace. If this flag is not set, then (as with fork(2)), the process is created in the same IPC namespace as the calling process.
For further information on IPC namespaces, see ipc_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWIPC. This flag can’t be specified in conjunction with CLONE_SYSVSEM.
CLONE_NEWNET (since Linux 2.6.24)
(The implementation of this flag was completed only by about Linux 2.6.29.)
If CLONE_NEWNET is set, then create the process in a new network namespace. If this flag is not set, then (as with fork(2)) the process is created in the same network namespace as the calling process.
For further information on network namespaces, see network_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWNET.
CLONE_NEWNS (since Linux 2.4.19)
If CLONE_NEWNS is set, the cloned child is started in a new mount namespace, initialized with a copy of the namespace of the parent. If CLONE_NEWNS is not set, the child lives in the same mount namespace as the parent.
For further information on mount namespaces, see namespaces(7) and mount_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWNS. It is not permitted to specify both CLONE_NEWNS and CLONE_FS in the same clone call.
CLONE_NEWPID (since Linux 2.6.24)
If CLONE_NEWPID is set, then create the process in a new PID namespace. If this flag is not set, then (as with fork(2)) the process is created in the same PID namespace as the calling process.
For further information on PID namespaces, see namespaces(7) and pid_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWPID. This flag can’t be specified in conjunction with CLONE_THREAD.
CLONE_NEWUSER
(This flag first became meaningful for clone() in Linux 2.6.23, the current clone() semantics were merged in Linux 3.5, and the final pieces to make the user namespaces completely usable were merged in Linux 3.8.)
If CLONE_NEWUSER is set, then create the process in a new user namespace. If this flag is not set, then (as with fork(2)) the process is created in the same user namespace as the calling process.
For further information on user namespaces, see namespaces(7) and user_namespaces(7).
Before Linux 3.8, use of CLONE_NEWUSER required that the caller have three capabilities: CAP_SYS_ADMIN, CAP_SETUID, and CAP_SETGID. Starting with Linux 3.8, no privileges are needed to create a user namespace.
This flag can’t be specified in conjunction with CLONE_THREAD or CLONE_PARENT. For security reasons, CLONE_NEWUSER cannot be specified in conjunction with CLONE_FS.
CLONE_NEWUTS (since Linux 2.6.19)
If CLONE_NEWUTS is set, then create the process in a new UTS namespace, whose identifiers are initialized by duplicating the identifiers from the UTS namespace of the calling process. If this flag is not set, then (as with fork(2)) the process is created in the same UTS namespace as the calling process.
For further information on UTS namespaces, see uts_namespaces(7).
Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWUTS.
CLONE_PARENT (since Linux 2.3.12)
If CLONE_PARENT is set, then the parent of the new child (as returned by getppid(2)) will be the same as that of the calling process.
If CLONE_PARENT is not set, then (as with fork(2)) the child’s parent is the calling process.
Note that it is the parent process, as returned by getppid(2), which is signaled when the child terminates, so that if CLONE_PARENT is set, then the parent of the calling process, rather than the calling process itself, is signaled.
The CLONE_PARENT flag can’t be used in clone calls by the global init process (PID 1 in the initial PID namespace) and init processes in other PID namespaces. This restriction prevents the creation of multi-rooted process trees as well as the creation of unreapable zombies in the initial PID namespace.
CLONE_PARENT_SETTID (since Linux 2.5.49)
Store the child thread ID at the location pointed to by parent_tid (clone()) or cl_args.parent_tid (clone3()) in the parent’s memory. (In Linux 2.5.32-2.5.48 there was a flag CLONE_SETTID that did this.) The store operation completes before the clone call returns control to user space.
CLONE_PID (Linux 2.0 to Linux 2.5.15)
If CLONE_PID is set, the child process is created with the same process ID as the calling process. This is good for hacking the system, but otherwise of not much use. From Linux 2.3.21 onward, this flag could be specified only by the system boot process (PID 0). The flag disappeared completely from the kernel sources in Linux 2.5.16. Subsequently, the kernel silently ignored this bit if it was specified in the flags mask. Much later, the same bit was recycled for use as the CLONE_PIDFD flag.
CLONE_PIDFD (since Linux 5.2)
If this flag is specified, a PID file descriptor referring to the child process is allocated and placed at a specified location in the parent’s memory. The close-on-exec flag is set on this new file descriptor. PID file descriptors can be used for the purposes described in pidfd_open(2).
When using clone3(), the PID file descriptor is placed at the location pointed to by cl_args.pidfd.
When using clone(), the PID file descriptor is placed at the location pointed to by parent_tid. Since the parent_tid argument is used to return the PID file descriptor, CLONE_PIDFD cannot be used with CLONE_PARENT_SETTID when calling clone().
It is currently not possible to use this flag together with CLONE_THREAD. This means that the process identified by the PID file descriptor will always be a thread group leader.
If the obsolete CLONE_DETACHED flag is specified alongside CLONE_PIDFD when calling clone(), an error is returned. An error also results if CLONE_DETACHED is specified when calling clone3(). This error behavior ensures that the bit corresponding to CLONE_DETACHED can be reused for further PID file descriptor features in the future.
CLONE_PTRACE (since Linux 2.2)
If CLONE_PTRACE is specified, and the calling process is being traced, then trace the child also (see ptrace(2)).
CLONE_SETTLS (since Linux 2.5.32)
The TLS (Thread Local Storage) descriptor is set to tls.
The interpretation of tls and the resulting effect is architecture dependent. On x86, tls is interpreted as a struct user_descΒ * (see set_thread_area(2)). On x86-64 it is the new value to be set for the %fs base register (see the ARCH_SET_FS argument to arch_prctl(2)). On architectures with a dedicated TLS register, it is the new value of that register.
Use of this flag requires detailed knowledge and generally it should not be used except in libraries implementing threading.
CLONE_SIGHAND (since Linux 2.0)
If CLONE_SIGHAND is set, the calling process and the child process share the same table of signal handlers. If the calling process or child process calls sigaction(2) to change the behavior associated with a signal, the behavior is changed in the other process as well. However, the calling process and child processes still have distinct signal masks and sets of pending signals. So, one of them may block or unblock signals using sigprocmask(2) without affecting the other process.
If CLONE_SIGHAND is not set, the child process inherits a copy of the signal handlers of the calling process at the time of the clone call. Calls to sigaction(2) performed later by one of the processes have no effect on the other process.
Since Linux 2.6.0, the flags mask must also include CLONE_VM if CLONE_SIGHAND is specified.
CLONE_STOPPED (since Linux 2.6.0)
If CLONE_STOPPED is set, then the child is initially stopped (as though it was sent a SIGSTOP signal), and must be resumed by sending it a SIGCONT signal.
This flag was deprecated from Linux 2.6.25 onward, and was removed altogether in Linux 2.6.38. Since then, the kernel silently ignores it without error. Starting with Linux 4.6, the same bit was reused for the CLONE_NEWCGROUP flag.
CLONE_SYSVSEM (since Linux 2.5.10)
If CLONE_SYSVSEM is set, then the child and the calling process share a single list of System V semaphore adjustment (semadj) values (see semop(2)). In this case, the shared list accumulates semadj values across all processes sharing the list, and semaphore adjustments are performed only when the last process that is sharing the list terminates (or ceases sharing the list using unshare(2)). If this flag is not set, then the child has a separate semadj list that is initially empty.
CLONE_THREAD (since Linux 2.4.0)
If CLONE_THREAD is set, the child is placed in the same thread group as the calling process. To make the remainder of the discussion of CLONE_THREAD more readable, the term “thread” is used to refer to the processes within a thread group.
Thread groups were a feature added in Linux 2.4 to support the POSIX threads notion of a set of threads that share a single PID. Internally, this shared PID is the so-called thread group identifier (TGID) for the thread group. Since Linux 2.4, calls to getpid(2) return the TGID of the caller.
The threads within a group can be distinguished by their (system-wide) unique thread IDs (TID). A new thread’s TID is available as the function result returned to the caller, and a thread can obtain its own TID using gettid(2).
When a clone call is made without specifying CLONE_THREAD, then the resulting thread is placed in a new thread group whose TGID is the same as the thread’s TID. This thread is the leader of the new thread group.
A new thread created with CLONE_THREAD has the same parent process as the process that made the clone call (i.e., like CLONE_PARENT), so that calls to getppid(2) return the same value for all of the threads in a thread group. When a CLONE_THREAD thread terminates, the thread that created it is not sent a SIGCHLD (or other termination) signal; nor can the status of such a thread be obtained using wait(2). (The thread is said to be detached.)
After all of the threads in a thread group terminate the parent process of the thread group is sent a SIGCHLD (or other termination) signal.
If any of the threads in a thread group performs an execve(2), then all threads other than the thread group leader are terminated, and the new program is executed in the thread group leader.
If one of the threads in a thread group creates a child using fork(2), then any thread in the group can wait(2) for that child.
Since Linux 2.5.35, the flags mask must also include CLONE_SIGHAND if CLONE_THREAD is specified (and note that, since Linux 2.6.0, CLONE_SIGHAND also requires CLONE_VM to be included).
Signal dispositions and actions are process-wide: if an unhandled signal is delivered to a thread, then it will affect (terminate, stop, continue, be ignored in) all members of the thread group.
Each thread has its own signal mask, as set by sigprocmask(2).
A signal may be process-directed or thread-directed. A process-directed signal is targeted at a thread group (i.e., a TGID), and is delivered to an arbitrarily selected thread from among those that are not blocking the signal. A signal may be process-directed because it was generated by the kernel for reasons other than a hardware exception, or because it was sent using kill(2) or sigqueue(3). A thread-directed signal is targeted at (i.e., delivered to) a specific thread. A signal may be thread directed because it was sent using tgkill(2) or pthread_sigqueue(3), or because the thread executed a machine language instruction that triggered a hardware exception (e.g., invalid memory access triggering SIGSEGV or a floating-point exception triggering SIGFPE).
A call to sigpending(2) returns a signal set that is the union of the pending process-directed signals and the signals that are pending for the calling thread.
If a process-directed signal is delivered to a thread group, and the thread group has installed a handler for the signal, then the handler is invoked in exactly one, arbitrarily selected member of the thread group that has not blocked the signal. If multiple threads in a group are waiting to accept the same signal using sigwaitinfo(2), the kernel will arbitrarily select one of these threads to receive the signal.
CLONE_UNTRACED (since Linux 2.5.46)
If CLONE_UNTRACED is specified, then a tracing process cannot force CLONE_PTRACE on this child process.
CLONE_VFORK (since Linux 2.2)
If CLONE_VFORK is set, the execution of the calling process is suspended until the child releases its virtual memory resources via a call to execve(2) or _exit(2) (as with vfork(2)).
If CLONE_VFORK is not set, then both the calling process and the child are schedulable after the call, and an application should not rely on execution occurring in any particular order.
CLONE_VM (since Linux 2.0)
If CLONE_VM is set, the calling process and the child process run in the same memory space. In particular, memory writes performed by the calling process or by the child process are also visible in the other process. Moreover, any memory mapping or unmapping performed with mmap(2) or munmap(2) by the child or calling process also affects the other process.
If CLONE_VM is not set, the child process runs in a separate copy of the memory space of the calling process at the time of the clone call. Memory writes or file mappings/unmappings performed by one of the processes do not affect the other, as with fork(2).
If the CLONE_VM flag is specified and the CLONE_VFORK flag is not specified, then any alternate signal stack that was established by sigaltstack(2) is cleared in the child process.
RETURN VALUE
On success, the thread ID of the child process is returned in the caller’s thread of execution. On failure, -1 is returned in the caller’s context, no child process is created, and errno is set to indicate the error.
ERRORS
EACCES (clone3() only)
CLONE_INTO_CGROUP was specified in cl_args.flags, but the restrictions (described in cgroups(7)) on placing the child process into the version 2 cgroup referred to by cl_args.cgroup are not met.
EAGAIN
Too many processes are already running; see fork(2).
EBUSY (clone3() only)
CLONE_INTO_CGROUP was specified in cl_args.flags, but the file descriptor specified in cl_args.cgroup refers to a version 2 cgroup in which a domain controller is enabled.
EEXIST (clone3() only)
One (or more) of the PIDs specified in set_tid already exists in the corresponding PID namespace.
EINVAL
Both CLONE_SIGHAND and CLONE_CLEAR_SIGHAND were specified in the flags mask.
EINVAL
CLONE_SIGHAND was specified in the flags mask, but CLONE_VM was not. (Since Linux 2.6.0.)
EINVAL
CLONE_THREAD was specified in the flags mask, but CLONE_SIGHAND was not. (Since Linux 2.5.35.)
EINVAL
CLONE_THREAD was specified in the flags mask, but the current process previously called unshare(2) with the CLONE_NEWPID flag or used setns(2) to reassociate itself with a PID namespace.
EINVAL
Both CLONE_FS and CLONE_NEWNS were specified in the flags mask.
EINVAL (since Linux 3.9)
Both CLONE_NEWUSER and CLONE_FS were specified in the flags mask.
EINVAL
Both CLONE_NEWIPC and CLONE_SYSVSEM were specified in the flags mask.
EINVAL
CLONE_NEWPID and one (or both) of CLONE_THREAD or CLONE_PARENT were specified in the flags mask.
EINVAL
CLONE_NEWUSER and CLONE_THREAD were specified in the flags mask.
EINVAL (since Linux 2.6.32)
CLONE_PARENT was specified, and the caller is an init process.
EINVAL
Returned by the glibc clone() wrapper function when fn or stack is specified as NULL.
EINVAL
CLONE_NEWIPC was specified in the flags mask, but the kernel was not configured with the CONFIG_SYSVIPC and CONFIG_IPC_NS options.
EINVAL
CLONE_NEWNET was specified in the flags mask, but the kernel was not configured with the CONFIG_NET_NS option.
EINVAL
CLONE_NEWPID was specified in the flags mask, but the kernel was not configured with the CONFIG_PID_NS option.
EINVAL
CLONE_NEWUSER was specified in the flags mask, but the kernel was not configured with the CONFIG_USER_NS option.
EINVAL
CLONE_NEWUTS was specified in the flags mask, but the kernel was not configured with the CONFIG_UTS_NS option.
EINVAL
stack is not aligned to a suitable boundary for this architecture. For example, on aarch64, stack must be a multiple of 16.
EINVAL (clone3() only)
CLONE_DETACHED was specified in the flags mask.
EINVAL (clone() only)
CLONE_PIDFD was specified together with CLONE_DETACHED in the flags mask.
EINVAL
CLONE_PIDFD was specified together with CLONE_THREAD in the flags mask.
**EINVAL **(clone() only)
CLONE_PIDFD was specified together with CLONE_PARENT_SETTID in the flags mask.
EINVAL (clone3() only)
set_tid_size is greater than the number of nested PID namespaces.
EINVAL (clone3() only)
One of the PIDs specified in set_tid was an invalid.
EINVAL (clone3() only)
CLONE_THREAD or CLONE_PARENT was specified in the flags mask, but a signal was specified in exit_signal.
EINVAL (AArch64 only, Linux 4.6 and earlier)
stack was not aligned to a 128-bit boundary.
ENOMEM
Cannot allocate sufficient memory to allocate a task structure for the child, or to copy those parts of the caller’s context that need to be copied.
ENOSPC (since Linux 3.7)
CLONE_NEWPID was specified in the flags mask, but the limit on the nesting depth of PID namespaces would have been exceeded; see pid_namespaces(7).
ENOSPC (since Linux 4.9; beforehand EUSERS)
CLONE_NEWUSER was specified in the flags mask, and the call would cause the limit on the number of nested user namespaces to be exceeded. See user_namespaces(7).
From Linux 3.11 to Linux 4.8, the error diagnosed in this case was EUSERS.
ENOSPC (since Linux 4.9)
One of the values in the flags mask specified the creation of a new user namespace, but doing so would have caused the limit defined by the corresponding file in /proc/sys/user to be exceeded. For further details, see namespaces(7).
EOPNOTSUPP (clone3() only)
CLONE_INTO_CGROUP was specified in cl_args.flags, but the file descriptor specified in cl_args.cgroup refers to a version 2 cgroup that is in the domain invalid state.
EPERM
CLONE_NEWCGROUP, CLONE_NEWIPC, CLONE_NEWNET, CLONE_NEWNS, CLONE_NEWPID, or CLONE_NEWUTS was specified by an unprivileged process (process without CAP_SYS_ADMIN).
EPERM
CLONE_PID was specified by a process other than process 0. (This error occurs only on Linux 2.5.15 and earlier.)
EPERM
CLONE_NEWUSER was specified in the flags mask, but either the effective user ID or the effective group ID of the caller does not have a mapping in the parent namespace (see user_namespaces(7)).
EPERM (since Linux 3.9)
CLONE_NEWUSER was specified in the flags mask and the caller is in a chroot environment (i.e., the caller’s root directory does not match the root directory of the mount namespace in which it resides).
EPERM (clone3() only)
set_tid_size was greater than zero, and the caller lacks the CAP_SYS_ADMIN capability in one or more of the user namespaces that own the corresponding PID namespaces.
ERESTARTNOINTR (since Linux 2.6.17)
System call was interrupted by a signal and will be restarted. (This can be seen only during a trace.)
EUSERS (Linux 3.11 to Linux 4.8)
CLONE_NEWUSER was specified in the flags mask, and the limit on the number of nested user namespaces would be exceeded. See the discussion of the ENOSPC error above.
VERSIONS
The glibc clone() wrapper function makes some changes in the memory pointed to by stack (changes required to set the stack up correctly for the child) before invoking the clone() system call. So, in cases where clone() is used to recursively create children, do not use the buffer employed for the parent’s stack as the stack of the child.
On i386, clone() should not be called through vsyscall, but directly through int $0x80.
C library/kernel differences
The raw clone() system call corresponds more closely to fork(2) in that execution in the child continues from the point of the call. As such, the fn and arg arguments of the clone() wrapper function are omitted.
In contrast to the glibc wrapper, the raw clone() system call accepts NULL as a stack argument (and clone3() likewise allows cl_args.stack to be NULL). In this case, the child uses a duplicate of the parent’s stack. (Copy-on-write semantics ensure that the child gets separate copies of stack pages when either process modifies the stack.) In this case, for correct operation, the CLONE_VM option should not be specified. (If the child shares the parent’s memory because of the use of the CLONE_VM flag, then no copy-on-write duplication occurs and chaos is likely to result.)
The order of the arguments also differs in the raw system call, and there are variations in the arguments across architectures, as detailed in the following paragraphs.
The raw system call interface on x86-64 and some other architectures (including sh, tile, and alpha) is:
long clone(unsigned long flags, void *stack,
int *parent_tid, int *child_tid,
unsigned long tls);
On x86-32, and several other common architectures (including score, ARM, ARM 64, PA-RISC, arc, Power PC, xtensa, and MIPS), the order of the last two arguments is reversed:
long clone(unsigned long flags, void *stack,
int *parent_tid, unsigned long tls,
int *child_tid);
On the cris and s390 architectures, the order of the first two arguments is reversed:
long clone(void *stack, unsigned long flags,
int *parent_tid, int *child_tid,
unsigned long tls);
On the microblaze architecture, an additional argument is supplied:
long clone(unsigned long flags, void *stack,
int stack_size, /* Size of stack */
int *parent_tid, int *child_tid,
unsigned long tls);
blackfin, m68k, and sparc
The argument-passing conventions on blackfin, m68k, and sparc are different from the descriptions above. For details, see the kernel (and glibc) source.
ia64
On ia64, a different interface is used:
int __clone2(int (*fn)(void *),
void *stack_base, size_t stack_size,
int flags, void *arg, ...
/* pid_t *parent_tid, struct user_desc *tls,
pid_t *child_tid */ );
The prototype shown above is for the glibc wrapper function; for the system call itself, the prototype can be described as follows (it is identical to the clone() prototype on microblaze):
long clone2(unsigned long flags, void *stack_base,
int stack_size, /* Size of stack */
int *parent_tid, int *child_tid,
unsigned long tls);
__clone2() operates in the same way as clone(), except that stack_base points to the lowest address of the child’s stack area, and stack_size specifies the size of the stack pointed to by stack_base.
STANDARDS
Linux.
HISTORY
clone3()
Linux 5.3.
Linux 2.4 and earlier
In the Linux 2.4.x series, CLONE_THREAD generally does not make the parent of the new thread the same as the parent of the calling process. However, from Linux 2.4.7 to Linux 2.4.18 the CLONE_THREAD flag implied the CLONE_PARENT flag (as in Linux 2.6.0 and later).
In Linux 2.4 and earlier, clone() does not take arguments parent_tid, tls, and child_tid.
NOTES
One use of these system calls is to implement threads: multiple flows of control in a program that run concurrently in a shared address space.
The kcmp(2) system call can be used to test whether two processes share various resources such as a file descriptor table, System V semaphore undo operations, or a virtual address space.
Handlers registered using pthread_atfork(3) are not executed during a clone call.
BUGS
GNU C library versions 2.3.4 up to and including 2.24 contained a wrapper function for getpid(2) that performed caching of PIDs. This caching relied on support in the glibc wrapper for clone(), but limitations in the implementation meant that the cache was not up to date in some circumstances. In particular, if a signal was delivered to the child immediately after the clone() call, then a call to getpid(2) in a handler for the signal could return the PID of the calling process (“the parent”), if the clone wrapper had not yet had a chance to update the PID cache in the child. (This discussion ignores the case where the child was created using CLONE_THREAD, when getpid(2) should return the same value in the child and in the process that called clone(), since the caller and the child are in the same thread group. The stale-cache problem also does not occur if the flags argument includes CLONE_VM.) To get the truth, it was sometimes necessary to use code such as the following:
#include <syscall.h>
pid_t mypid;
mypid = syscall(SYS_getpid);
Because of the stale-cache problem, as well as other problems noted in getpid(2), the PID caching feature was removed in glibc 2.25.
EXAMPLES
The following program demonstrates the use of clone() to create a child process that executes in a separate UTS namespace. The child changes the hostname in its UTS namespace. Both parent and child then display the system hostname, making it possible to see that the hostname differs in the UTS namespaces of the parent and child. For an example of the use of this program, see setns(2).
Within the sample program, we allocate the memory that is to be used for the child’s stack using mmap(2) rather than malloc(3) for the following reasons:
mmap(2) allocates a block of memory that starts on a page boundary and is a multiple of the page size. This is useful if we want to establish a guard page (a page with protection PROT_NONE) at the end of the stack using mprotect(2).
We can specify the MAP_STACK flag to request a mapping that is suitable for a stack. For the moment, this flag is a no-op on Linux, but it exists and has effect on some other systems, so we should include it for portability.
Program source
#define _GNU_SOURCE
#include <err.h>
#include <sched.h>
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/utsname.h>
#include <sys/wait.h>
#include <unistd.h>
static int /* Start function for cloned child */
childFunc(void *arg)
{
struct utsname uts;
/* Change hostname in UTS namespace of child. */
if (sethostname(arg, strlen(arg)) == -1)
err(EXIT_FAILURE, "sethostname");
/* Retrieve and display hostname. */
if (uname(&uts) == -1)
err(EXIT_FAILURE, "uname");
printf("uts.nodename in child: %s
“, uts.nodename);
/* Keep the namespace open for a while, by sleeping.
This allows some experimentation–for example, another
process might join the namespace. /
sleep(200);
return 0; / Child terminates now /
}
#define STACK_SIZE (1024 * 1024) / Stack size for cloned child */
int
main(int argc, char *argv[])
{
char stack; / Start of stack buffer */
char stackTop; / End of stack buffer /
pid_t pid;
struct utsname uts;
if (argc < 2) {
fprintf(stderr, “Usage: %s
SEE ALSO
fork(2), futex(2), getpid(2), gettid(2), kcmp(2), mmap(2), pidfd_open(2), set_thread_area(2), set_tid_address(2), setns(2), tkill(2), unshare(2), wait(2), capabilities(7), namespaces(7), pthreads(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
431 - Linux cli command setresuid
NAME π₯οΈ setresuid π₯οΈ
set real, effective, and saved user or group ID
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <unistd.h>
int setresuid(uid_t ruid, uid_t euid, uid_t suid);
int setresgid(gid_t rgid, gid_t egid, gid_t sgid);
DESCRIPTION
setresuid() sets the real user ID, the effective user ID, and the saved set-user-ID of the calling process.
An unprivileged process may change its real UID, effective UID, and saved set-user-ID, each to one of: the current real UID, the current effective UID, or the current saved set-user-ID.
A privileged process (on Linux, one having the CAP_SETUID capability) may set its real UID, effective UID, and saved set-user-ID to arbitrary values.
If one of the arguments equals -1, the corresponding value is not changed.
Regardless of what changes are made to the real UID, effective UID, and saved set-user-ID, the filesystem UID is always set to the same value as the (possibly new) effective UID.
Completely analogously, setresgid() sets the real GID, effective GID, and saved set-group-ID of the calling process (and always modifies the filesystem GID to be the same as the effective GID), with the same restrictions for unprivileged processes.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
Note: there are cases where setresuid() can fail even when the caller is UID 0; it is a grave security error to omit checking for a failure return from setresuid().
ERRORS
EAGAIN
The call would change the caller’s real UID (i.e., ruid does not match the caller’s real UID), but there was a temporary failure allocating the necessary kernel data structures.
EAGAIN
ruid does not match the caller’s real UID and this call would bring the number of processes belonging to the real user ID ruid over the caller’s RLIMIT_NPROC resource limit. Since Linux 3.1, this error case no longer occurs (but robust applications should check for this error); see the description of EAGAIN in execve(2).
EINVAL
One or more of the target user or group IDs is not valid in this user namespace.
EPERM
The calling process is not privileged (did not have the necessary capability in its user namespace) and tried to change the IDs to values that are not permitted. For setresuid(), the necessary capability is CAP_SETUID; for setresgid(), it is CAP_SETGID.
VERSIONS
C library/kernel differences
At the kernel level, user IDs and group IDs are a per-thread attribute. However, POSIX requires that all threads in a process share the same credentials. The NPTL threading implementation handles the POSIX requirements by providing wrapper functions for the various system calls that change process UIDs and GIDs. These wrapper functions (including those for setresuid() and setresgid()) employ a signal-based technique to ensure that when one thread changes credentials, all of the other threads in the process also change their credentials. For details, see nptl(7).
STANDARDS
None.
HISTORY
Linux 2.1.44, glibc 2.3.2. HP-UX, FreeBSD.
The original Linux setresuid() and setresgid() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added setresuid32() and setresgid32(), supporting 32-bit IDs. The glibc setresuid() and setresgid() wrapper functions transparently deal with the variations across kernel versions.
SEE ALSO
getresuid(2), getuid(2), setfsgid(2), setfsuid(2), setreuid(2), setuid(2), capabilities(7), credentials(7), user_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
432 - Linux cli command chown32
NAME π₯οΈ chown32 π₯οΈ
change ownership of a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int chown(const char *pathname, uid_t owner, gid_t group);
int fchown(int fd, uid_t owner, gid_t group);
int lchown(const char *pathname, uid_t owner, gid_t group);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int fchownat(int dirfd, const char *pathname,
uid_t owner, gid_t group, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
fchown(), lchown():
/* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
|| _XOPEN_SOURCE >= 500
|| /* glibc <= 2.19: */ _BSD_SOURCE
fchownat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These system calls change the owner and group of a file. The chown(), fchown(), and lchown() system calls differ only in how the file is specified:
chown() changes the ownership of the file specified by pathname, which is dereferenced if it is a symbolic link.
fchown() changes the ownership of the file referred to by the open file descriptor fd.
lchown() is like chown(), but does not dereference symbolic links.
Only a privileged process (Linux: one with the CAP_CHOWN capability) may change the owner of a file. The owner of a file may change the group of the file to any group of which that owner is a member. A privileged process (Linux: with CAP_CHOWN) may change the group arbitrarily.
If the owner or group is specified as -1, then that ID is not changed.
When the owner or group of an executable file is changed by an unprivileged user, the S_ISUID and S_ISGID mode bits are cleared. POSIX does not specify whether this also should happen when root does the chown(); the Linux behavior depends on the kernel version, and since Linux 2.2.13, root is treated like other users. In case of a non-group-executable file (i.e., one for which the S_IXGRP bit is not set) the S_ISGID bit indicates mandatory locking, and is not cleared by a chown().
When the owner or group of an executable file is changed (by any user), all capability sets for the file are cleared.
fchownat()
The fchownat() system call operates in exactly the same way as chown(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chown() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chown()).
If pathname is absolute, then dirfd is ignored.
The flags argument is a bit mask created by ORing together 0 or more of the following values;
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory. If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead operate on the link itself, like lchown(). (By default, fchownat() dereferences symbolic links, like chown().)
See openat(2) for an explanation of the need for fchownat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Depending on the filesystem, errors other than those listed below can be returned.
The more general errors for chown() are listed below.
EACCES
Search permission is denied on a component of the path prefix. (See also path_resolution(7).)
EBADF
(fchown()) fd is not a valid open file descriptor.
EBADF
(fchownat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
pathname points outside your accessible address space.
EINVAL
(fchownat()) Invalid flag specified in flags.
EIO
(fchown()) A low-level I/O error occurred while modifying the inode.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ENAMETOOLONG
pathname is too long.
ENOENT
The file does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component of the path prefix is not a directory.
ENOTDIR
(fchownat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EPERM
The calling process did not have the required permissions (see above) to change owner and/or group.
EPERM
The file is marked immutable or append-only. (See ioctl_iflags(2).)
EROFS
The named file resides on a read-only filesystem.
VERSIONS
The 4.4BSD version can be used only by the superuser (that is, ordinary users cannot give away files).
STANDARDS
POSIX.1-2008.
HISTORY
chown()
fchown()
lchown()
4.4BSD, SVr4, POSIX.1-2001.
fchownat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
NOTES
Ownership of new files
When a new file is created (by, for example, open(2) or mkdir(2)), its owner is made the same as the filesystem user ID of the creating process. The group of the file depends on a range of factors, including the type of filesystem, the options used to mount the filesystem, and whether or not the set-group-ID mode bit is enabled on the parent directory. If the filesystem supports the -o grpid (or, synonymously -o bsdgroups) and -o nogrpid (or, synonymously -o sysvgroups) mount(8) options, then the rules are as follows:
If the filesystem is mounted with -o grpid, then the group of a new file is made the same as that of the parent directory.
If the filesystem is mounted with -o nogrpid and the set-group-ID bit is disabled on the parent directory, then the group of a new file is made the same as the process’s filesystem GID.
If the filesystem is mounted with -o nogrpid and the set-group-ID bit is enabled on the parent directory, then the group of a new file is made the same as that of the parent directory.
As at Linux 4.12, the -o grpid and -o nogrpid mount options are supported by ext2, ext3, ext4, and XFS. Filesystems that don’t support these mount options follow the -o nogrpid rules.
glibc notes
On older kernels where fchownat() is unavailable, the glibc wrapper function falls back to the use of chown() and lchown(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
NFS
The chown() semantics are deliberately violated on NFS filesystems which have UID mapping enabled. Additionally, the semantics of all system calls which access the file contents are violated, because chown() may cause immediate access revocation on already open files. Client side caching may lead to a delay between the time where ownership have been changed to allow access for a user and the time where the file can actually be accessed by the user on other clients.
Historical details
The original Linux chown(), fchown(), and lchown() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added chown32(), fchown32(), and lchown32(), supporting 32-bit IDs. The glibc chown(), fchown(), and lchown() wrapper functions transparently deal with the variations across kernel versions.
Before Linux 2.1.81 (except 2.1.46), chown() did not follow symbolic links. Since Linux 2.1.81, chown() does follow symbolic links, and there is a new system call lchown() that does not follow symbolic links. Since Linux 2.1.86, this new call (that has the same semantics as the old chown()) has got the same syscall number, and chown() got the newly introduced number.
EXAMPLES
The following program changes the ownership of the file named in its second command-line argument to the value specified in its first command-line argument. The new owner can be specified either as a numeric user ID, or as a username (which is converted to a user ID by using getpwnam(3) to perform a lookup in the system password file).
Program source
#include <pwd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
char *endptr;
uid_t uid;
struct passwd *pwd;
if (argc != 3 || argv[1][0] == ' ') {
fprintf(stderr, "%s <owner> <file>
“, argv[0]); exit(EXIT_FAILURE); } uid = strtol(argv[1], &endptr, 10); /* Allow a numeric string */ if (endptr != ‘οΏ½’) { / Was not pure numeric string / pwd = getpwnam(argv[1]); / Try getting UID for username */ if (pwd == NULL) { perror(“getpwnam”); exit(EXIT_FAILURE); } uid = pwd->pw_uid; } if (chown(argv[2], uid, -1) == -1) { perror(“chown”); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); }
SEE ALSO
chgrp(1), chown(1), chmod(2), flock(2), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
433 - Linux cli command send
NAME π₯οΈ send π₯οΈ
send a message on a socket
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
ssize_t send(int sockfd, const void buf[.len], size_t len",int"flags);
ssize_t sendto(int sockfd, const void buf[.len], size_t len",int"flags,
const struct sockaddr *dest_addr, socklen_t addrlen);
ssize_t sendmsg(int sockfd, const struct msghdr *msg",int"flags);
DESCRIPTION
The system calls send(), sendto(), and sendmsg() are used to transmit a message to another socket.
The send() call may be used only when the socket is in a connected state (so that the intended recipient is known). The only difference between send() and write(2) is the presence of flags. With a zero flags argument, send() is equivalent to write(2). Also, the following call
send(sockfd, buf, len, flags);
is equivalent to
sendto(sockfd, buf, len, flags, NULL, 0);
The argument sockfd is the file descriptor of the sending socket.
If sendto() is used on a connection-mode (SOCK_STREAM, SOCK_SEQPACKET) socket, the arguments dest_addr and addrlen are ignored (and the error EISCONN may be returned when they are not NULL and 0), and the error ENOTCONN is returned when the socket was not actually connected. Otherwise, the address of the target is given by dest_addr with addrlen specifying its size. For sendmsg(), the address of the target is given by msg.msg_name, with msg.msg_namelen specifying its size.
For send() and sendto(), the message is found in buf and has length len. For sendmsg(), the message is pointed to by the elements of the array msg.msg_iov. The sendmsg() call also allows sending ancillary data (also known as control information).
If the message is too long to pass atomically through the underlying protocol, the error EMSGSIZE is returned, and the message is not transmitted.
No indication of failure to deliver is implicit in a send(). Locally detected errors are indicated by a return value of -1.
When the message does not fit into the send buffer of the socket, send() normally blocks, unless the socket has been placed in nonblocking I/O mode. In nonblocking mode it would fail with the error EAGAIN or EWOULDBLOCK in this case. The select(2) call may be used to determine when it is possible to send more data.
The flags argument
The flags argument is the bitwise OR of zero or more of the following flags.
MSG_CONFIRM (since Linux 2.3.15)
Tell the link layer that forward progress happened: you got a successful reply from the other side. If the link layer doesn’t get this it will regularly reprobe the neighbor (e.g., via a unicast ARP). Valid only on SOCK_DGRAM and SOCK_RAW sockets and currently implemented only for IPv4 and IPv6. See arp(7) for details.
MSG_DONTROUTE
Don’t use a gateway to send out the packet, send to hosts only on directly connected networks. This is usually used only by diagnostic or routing programs. This is defined only for protocol families that route; packet sockets don’t.
MSG_DONTWAIT (since Linux 2.2)
Enables nonblocking operation; if the operation would block, EAGAIN or EWOULDBLOCK is returned. This provides similar behavior to setting the O_NONBLOCK flag (via the fcntl(2) F_SETFL operation), but differs in that MSG_DONTWAIT is a per-call option, whereas O_NONBLOCK is a setting on the open file description (see open(2)), which will affect all threads in the calling process as well as other processes that hold file descriptors referring to the same open file description.
MSG_EOR (since Linux 2.2)
Terminates a record (when this notion is supported, as for sockets of type SOCK_SEQPACKET).
MSG_MORE (since Linux 2.4.4)
The caller has more data to send. This flag is used with TCP sockets to obtain the same effect as the TCP_CORK socket option (see tcp(7)), with the difference that this flag can be set on a per-call basis.
Since Linux 2.6, this flag is also supported for UDP sockets, and informs the kernel to package all of the data sent in calls with this flag set into a single datagram which is transmitted only when a call is performed that does not specify this flag. (See also the UDP_CORK socket option described in udp(7).)
MSG_NOSIGNAL (since Linux 2.2)
Don’t generate a SIGPIPE signal if the peer on a stream-oriented socket has closed the connection. The EPIPE error is still returned. This provides similar behavior to using sigaction(2) to ignore SIGPIPE, but, whereas MSG_NOSIGNAL is a per-call feature, ignoring SIGPIPE sets a process attribute that affects all threads in the process.
MSG_OOB
Sends out-of-band data on sockets that support this notion (e.g., of type SOCK_STREAM); the underlying protocol must also support out-of-band data.
MSG_FASTOPEN (since Linux 3.7)
Attempts TCP Fast Open (RFC7413) and sends data in the SYN like a combination of connect(2) and write(2), by performing an implicit connect(2) operation. It blocks until the data is buffered and the handshake has completed. For a non-blocking socket, it returns the number of bytes buffered and sent in the SYN packet. If the cookie is not available locally, it returns EINPROGRESS, and sends a SYN with a Fast Open cookie request automatically. The caller needs to write the data again when the socket is connected. On errors, it sets the same errno as connect(2) if the handshake fails. This flag requires enabling TCP Fast Open client support on sysctl net.ipv4.tcp_fastopen.
Refer to TCP_FASTOPEN_CONNECT socket option in tcp(7) for an alternative approach.
sendmsg()
The definition of the msghdr structure employed by sendmsg() is as follows:
struct msghdr {
void *msg_name; /* Optional address */
socklen_t msg_namelen; /* Size of address */
struct iovec *msg_iov; /* Scatter/gather array */
size_t msg_iovlen; /* # elements in msg_iov */
void *msg_control; /* Ancillary data, see below */
size_t msg_controllen; /* Ancillary data buffer len */
int msg_flags; /* Flags (unused) */
};
The msg_name field is used on an unconnected socket to specify the target address for a datagram. It points to a buffer containing the address; the msg_namelen field should be set to the size of the address. For a connected socket, these fields should be specified as NULL and 0, respectively.
The msg_iov and msg_iovlen fields specify scatter-gather locations, as for writev(2).
You may send control information (ancillary data) using the msg_control and msg_controllen members. The maximum control buffer length the kernel can process is limited per socket by the value in /proc/sys/net/core/optmem_max; see socket(7). For further information on the use of ancillary data in various socket domains, see unix(7) and ip(7).
The msg_flags field is ignored.
RETURN VALUE
On success, these calls return the number of bytes sent. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
These are some standard errors generated by the socket layer. Additional errors may be generated and returned from the underlying protocol modules; see their respective manual pages.
EACCES
(For UNIX domain sockets, which are identified by pathname) Write permission is denied on the destination socket file, or search permission is denied for one of the directories the path prefix. (See path_resolution(7).)
(For UDP sockets) An attempt was made to send to a network/broadcast address as though it was a unicast address.
EAGAIN or EWOULDBLOCK
The socket is marked nonblocking and the requested operation would block. POSIX.1-2001 allows either error to be returned for this case, and does not require these constants to have the same value, so a portable application should check for both possibilities.
EAGAIN
(Internet domain datagram sockets) The socket referred to by sockfd had not previously been bound to an address and, upon attempting to bind it to an ephemeral port, it was determined that all port numbers in the ephemeral port range are currently in use. See the discussion of /proc/sys/net/ipv4/ip_local_port_range in ip(7).
EALREADY
Another Fast Open is in progress.
EBADF
sockfd is not a valid open file descriptor.
ECONNRESET
Connection reset by peer.
EDESTADDRREQ
The socket is not connection-mode, and no peer address is set.
EFAULT
An invalid user space address was specified for an argument.
EINTR
A signal occurred before any data was transmitted; see signal(7).
EINVAL
Invalid argument passed.
EISCONN
The connection-mode socket was connected already but a recipient was specified. (Now either this error is returned, or the recipient specification is ignored.)
EMSGSIZE
The socket type requires that message be sent atomically, and the size of the message to be sent made this impossible.
ENOBUFS
The output queue for a network interface was full. This generally indicates that the interface has stopped sending, but may be caused by transient congestion. (Normally, this does not occur in Linux. Packets are just silently dropped when a device queue overflows.)
ENOMEM
No memory available.
ENOTCONN
The socket is not connected, and no target has been given.
ENOTSOCK
The file descriptor sockfd does not refer to a socket.
EOPNOTSUPP
Some bit in the flags argument is inappropriate for the socket type.
EPIPE
The local end has been shut down on a connection oriented socket. In this case, the process will also receive a SIGPIPE unless MSG_NOSIGNAL is set.
VERSIONS
According to POSIX.1-2001, the msg_controllen field of the msghdr structure should be typed as socklen_t, and the msg_iovlen field should be typed as int, but glibc currently types both as size_t.
STANDARDS
POSIX.1-2008.
MSG_CONFIRM is a Linux extension.
HISTORY
4.4BSD, SVr4, POSIX.1-2001. (first appeared in 4.2BSD).
POSIX.1-2001 describes only the MSG_OOB and MSG_EOR flags. POSIX.1-2008 adds a specification of MSG_NOSIGNAL.
NOTES
See sendmmsg(2) for information about a Linux-specific system call that can be used to transmit multiple datagrams in a single call.
BUGS
Linux may return EPIPE instead of ENOTCONN.
EXAMPLES
An example of the use of sendto() is shown in getaddrinfo(3).
SEE ALSO
fcntl(2), getsockopt(2), recv(2), select(2), sendfile(2), sendmmsg(2), shutdown(2), socket(2), write(2), cmsg(3), ip(7), ipv6(7), socket(7), tcp(7), udp(7), unix(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
434 - Linux cli command access
NAME π₯οΈ access π₯οΈ
check user’s permissions for a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int access(const char *pathname, int mode);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int faccessat(int dirfd, const char *pathname, int mode, int flags);
/* But see C library/kernel differences, below */
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_faccessat2,
int dirfd, const char *pathname, int mode",int"flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
faccessat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
access() checks whether the calling process can access the file pathname. If pathname is a symbolic link, it is dereferenced.
The mode specifies the accessibility check(s) to be performed, and is either the value F_OK, or a mask consisting of the bitwise OR of one or more of R_OK, W_OK, and X_OK. F_OK tests for the existence of the file. R_OK, W_OK, and X_OK test whether the file exists and grants read, write, and execute permissions, respectively.
The check is done using the calling process’s real UID and GID, rather than the effective IDs as is done when actually attempting an operation (e.g., open(2)) on the file. Similarly, for the root user, the check uses the set of permitted capabilities rather than the set of effective capabilities; and for non-root users, the check uses an empty set of capabilities.
This allows set-user-ID programs and capability-endowed programs to easily determine the invoking user’s authority. In other words, access() does not answer the “can I read/write/execute this file?” question. It answers a slightly different question: “(assuming I’m a setuid binary) can the user who invoked me read/write/execute this file?”, which gives set-user-ID programs the possibility to prevent malicious users from causing them to read files which users shouldn’t be able to read.
If the calling process is privileged (i.e., its real UID is zero), then an X_OK check is successful for a regular file if execute permission is enabled for any of the file owner, group, or other.
faccessat()
faccessat() operates in exactly the same way as access(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by access() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like access()).
If pathname is absolute, then dirfd is ignored.
flags is constructed by ORing together zero or more of the following values:
AT_EACCESS
Perform access checks using the effective user and group IDs. By default, faccessat() uses the real IDs (like access()).
AT_EMPTY_PATH (since Linux 5.8)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory. If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead return information about the link itself.
See openat(2) for an explanation of the need for faccessat().
faccessat2()
The description of faccessat() given above corresponds to POSIX.1 and to the implementation provided by glibc. However, the glibc implementation was an imperfect emulation (see BUGS) that papered over the fact that the raw Linux faccessat() system call does not have a flags argument. To allow for a proper implementation, Linux 5.8 added the faccessat2() system call, which supports the flags argument and allows a correct implementation of the faccessat() wrapper function.
RETURN VALUE
On success (all requested permissions granted, or mode is F_OK and the file exists), zero is returned. On error (at least one bit in mode asked for a permission that is denied, or mode is F_OK and the file does not exist, or some other error occurred), -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
The requested access would be denied to the file, or search permission is denied for one of the directories in the path prefix of pathname. (See also path_resolution(7).)
EBADF
(faccessat()) pathname is relative but dirfd is neither AT_FDCWD (faccessat()) nor a valid file descriptor.
EFAULT
pathname points outside your accessible address space.
EINVAL
mode was incorrectly specified.
EINVAL
(faccessat()) Invalid flag specified in flags.
EIO
An I/O error occurred.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ENAMETOOLONG
pathname is too long.
ENOENT
A component of pathname does not exist or is a dangling symbolic link.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component used as a directory in pathname is not, in fact, a directory.
ENOTDIR
(faccessat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EPERM
Write permission was requested to a file that has the immutable flag set. See also ioctl_iflags(2).
EROFS
Write permission was requested for a file on a read-only filesystem.
ETXTBSY
Write access was requested to an executable which is being executed.
VERSIONS
If the calling process has appropriate privileges (i.e., is superuser), POSIX.1-2001 permits an implementation to indicate success for an X_OK check even if none of the execute file permission bits are set. Linux does not do this.
C library/kernel differences
The raw faccessat() system call takes only the first three arguments. The AT_EACCESS and AT_SYMLINK_NOFOLLOW flags are actually implemented within the glibc wrapper function for faccessat(). If either of these flags is specified, then the wrapper function employs fstatat(2) to determine access permissions, but see BUGS.
glibc notes
On older kernels where faccessat() is unavailable (and when the AT_EACCESS and AT_SYMLINK_NOFOLLOW flags are not specified), the glibc wrapper function falls back to the use of access(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
STANDARDS
access()
faccessat()
POSIX.1-2008.
faccessat2()
Linux.
HISTORY
access()
SVr4, 4.3BSD, POSIX.1-2001.
faccessat()
Linux 2.6.16, glibc 2.4.
faccessat2()
Linux 5.8.
NOTES
Warning: Using these calls to check if a user is authorized to, for example, open a file before actually doing so using open(2) creates a security hole, because the user might exploit the short time interval between checking and opening the file to manipulate it. For this reason, the use of this system call should be avoided. (In the example just described, a safer alternative would be to temporarily switch the process’s effective user ID to the real ID and then call open(2).)
access() always dereferences symbolic links. If you need to check the permissions on a symbolic link, use faccessat() with the flag AT_SYMLINK_NOFOLLOW.
These calls return an error if any of the access types in mode is denied, even if some of the other access types in mode are permitted.
A file is accessible only if the permissions on each of the directories in the path prefix of pathname grant search (i.e., execute) access. If any directory is inaccessible, then the access() call fails, regardless of the permissions on the file itself.
Only access bits are checked, not the file type or contents. Therefore, if a directory is found to be writable, it probably means that files can be created in the directory, and not that the directory can be written as a file. Similarly, a DOS file may be reported as executable, but the execve(2) call will still fail.
These calls may not work correctly on NFSv2 filesystems with UID mapping enabled, because UID mapping is done on the server and hidden from the client, which checks permissions. (NFS versions 3 and higher perform the check on the server.) Similar problems can occur to FUSE mounts.
BUGS
Because the Linux kernel’s faccessat() system call does not support a flags argument, the glibc faccessat() wrapper function provided in glibc 2.32 and earlier emulates the required functionality using a combination of the faccessat() system call and fstatat(2). However, this emulation does not take ACLs into account. Starting with glibc 2.33, the wrapper function avoids this bug by making use of the faccessat2() system call where it is provided by the underlying kernel.
In Linux 2.4 (and earlier) there is some strangeness in the handling of X_OK tests for superuser. If all categories of execute permission are disabled for a nondirectory file, then the only access() test that returns -1 is when mode is specified as just X_OK; if R_OK or W_OK is also specified in mode, then access() returns 0 for such files. Early Linux 2.6 (up to and including Linux 2.6.3) also behaved in the same way as Linux 2.4.
Before Linux 2.6.20, these calls ignored the effect of the MS_NOEXEC flag if it was used to mount(2) the underlying filesystem. Since Linux 2.6.20, the MS_NOEXEC flag is honored.
SEE ALSO
chmod(2), chown(2), open(2), setgid(2), setuid(2), stat(2), euidaccess(3), credentials(7), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
435 - Linux cli command symlink
NAME π₯οΈ symlink π₯οΈ
make a new name for a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int symlink(const char *target, const char *linkpath);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
int symlinkat(const char *target, int newdirfd",constchar*"linkpath);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
symlink():
_XOPEN_SOURCE >= 500 || _POSIX_C_SOURCE >= 200112L
|| /* glibc <= 2.19: */ _BSD_SOURCE
symlinkat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
symlink() creates a symbolic link named linkpath which contains the string target.
Symbolic links are interpreted at run time as if the contents of the link had been substituted into the path being followed to find a file or directory.
Symbolic links may contain .. path components, which (if used at the start of the link) refer to the parent directories of that in which the link resides.
A symbolic link (also known as a soft link) may point to an existing file or to a nonexistent one; the latter case is known as a dangling link.
The permissions of a symbolic link are irrelevant; the ownership is ignored when following the link (except when the protected_symlinks feature is enabled, as explained in proc(5)), but is checked when removal or renaming of the link is requested and the link is in a directory with the sticky bit (S_ISVTX) set.
If linkpath exists, it will not be overwritten.
symlinkat()
The symlinkat() system call operates in exactly the same way as symlink(), except for the differences described here.
If the pathname given in linkpath is relative, then it is interpreted relative to the directory referred to by the file descriptor newdirfd (rather than relative to the current working directory of the calling process, as is done by symlink() for a relative pathname).
If linkpath is relative and newdirfd is the special value AT_FDCWD, then linkpath is interpreted relative to the current working directory of the calling process (like symlink()).
If linkpath is absolute, then newdirfd is ignored.
See openat(2) for an explanation of the need for symlinkat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Write access to the directory containing linkpath is denied, or one of the directories in the path prefix of linkpath did not allow search permission. (See also path_resolution(7).)
EBADF
(symlinkat()) linkpath is relative but newdirfd is neither AT_FDCWD nor a valid file descriptor.
EDQUOT
The user’s quota of resources on the filesystem has been exhausted. The resources could be inodes or disk blocks, depending on the filesystem implementation.
EEXIST
linkpath already exists.
EFAULT
target or linkpath points outside your accessible address space.
EIO
An I/O error occurred.
ELOOP
Too many symbolic links were encountered in resolving linkpath.
ENAMETOOLONG
target or linkpath was too long.
ENOENT
A directory component in linkpath does not exist or is a dangling symbolic link, or target or linkpath is an empty string.
ENOENT
(symlinkat()) linkpath is a relative pathname and newdirfd refers to a directory that has been deleted.
ENOMEM
Insufficient kernel memory was available.
ENOSPC
The device containing the file has no room for the new directory entry.
ENOTDIR
A component used as a directory in linkpath is not, in fact, a directory.
ENOTDIR
(symlinkat()) linkpath is relative and newdirfd is a file descriptor referring to a file other than a directory.
EPERM
The filesystem containing linkpath does not support the creation of symbolic links.
EROFS
linkpath is on a read-only filesystem.
STANDARDS
POSIX.1-2008.
HISTORY
symlink()
SVr4, 4.3BSD, POSIX.1-2001.
symlinkat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
glibc notes
On older kernels where symlinkat() is unavailable, the glibc wrapper function falls back to the use of symlink(). When linkpath is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the newdirfd argument.
NOTES
No checking of target is done.
Deleting the name referred to by a symbolic link will actually delete the file (unless it also has other hard links). If this behavior is not desired, use link(2).
SEE ALSO
ln(1), namei(1), lchown(2), link(2), lstat(2), open(2), readlink(2), rename(2), unlink(2), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
436 - Linux cli command nanosleep
NAME π₯οΈ nanosleep π₯οΈ
high-resolution sleep
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <time.h>
int nanosleep(const struct timespec *duration,
struct timespec *_Nullable rem);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
nanosleep():
_POSIX_C_SOURCE >= 199309L
DESCRIPTION
nanosleep() suspends the execution of the calling thread until either at least the time specified in *duration has elapsed, or the delivery of a signal that triggers the invocation of a handler in the calling thread or that terminates the process.
If the call is interrupted by a signal handler, nanosleep() returns -1, sets errno to EINTR, and writes the remaining time into the structure pointed to by rem unless rem is NULL. The value of *rem can then be used to call nanosleep() again and complete the specified pause (but see NOTES).
The timespec(3) structure is used to specify intervals of time with nanosecond precision.
The value of the nanoseconds field must be in the range [0, 999999999].
Compared to sleep(3) and usleep(3), nanosleep() has the following advantages: it provides a higher resolution for specifying the sleep interval; POSIX.1 explicitly specifies that it does not interact with signals; and it makes the task of resuming a sleep that has been interrupted by a signal handler easier.
RETURN VALUE
On successfully sleeping for the requested duration, nanosleep() returns 0. If the call is interrupted by a signal handler or encounters an error, then it returns -1, with errno set to indicate the error.
ERRORS
EFAULT
Problem with copying information from user space.
EINTR
The pause has been interrupted by a signal that was delivered to the thread (see signal(7)). The remaining sleep time has been written into *rem so that the thread can easily call nanosleep() again and continue with the pause.
EINVAL
The value in the tv_nsec field was not in the range [0, 999999999] or tv_sec was negative.
VERSIONS
POSIX.1 specifies that nanosleep() should measure time against the CLOCK_REALTIME clock. However, Linux measures the time using the CLOCK_MONOTONIC clock. This probably does not matter, since the POSIX.1 specification for clock_settime(2) says that discontinuous changes in CLOCK_REALTIME should not affect nanosleep():
Setting the value of the CLOCK_REALTIME clock via clock_settime(2) shall have no effect on threads that are blocked waiting for a relative time service based upon this clock, including the nanosleep() function; … Consequently, these time services shall expire when the requested duration elapses, independently of the new or old value of the clock.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
In order to support applications requiring much more precise pauses (e.g., in order to control some time-critical hardware), nanosleep() would handle pauses of up to 2 milliseconds by busy waiting with microsecond precision when called from a thread scheduled under a real-time policy like SCHED_FIFO or SCHED_RR. This special extension was removed in Linux 2.5.39, and is thus not available in Linux 2.6.0 and later kernels.
NOTES
If the duration is not an exact multiple of the granularity underlying clock (see time(7)), then the interval will be rounded up to the next multiple. Furthermore, after the sleep completes, there may still be a delay before the CPU becomes free to once again execute the calling thread.
The fact that nanosleep() sleeps for a relative interval can be problematic if the call is repeatedly restarted after being interrupted by signals, since the time between the interruptions and restarts of the call will lead to drift in the time when the sleep finally completes. This problem can be avoided by using clock_nanosleep(2) with an absolute time value.
BUGS
If a program that catches signals and uses nanosleep() receives signals at a very high rate, then scheduling delays and rounding errors in the kernel’s calculation of the sleep interval and the returned remain value mean that the remain value may steadily increase on successive restarts of the nanosleep() call. To avoid such problems, use clock_nanosleep(2) with the TIMER_ABSTIME flag to sleep to an absolute deadline.
In Linux 2.4, if nanosleep() is stopped by a signal (e.g., SIGTSTP), then the call fails with the error EINTR after the thread is resumed by a SIGCONT signal. If the system call is subsequently restarted, then the time that the thread spent in the stopped state is not counted against the sleep interval. This problem is fixed in Linux 2.6.0 and later kernels.
SEE ALSO
clock_nanosleep(2), restart_syscall(2), sched_setscheduler(2), timer_create(2), sleep(3), timespec(3), usleep(3), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
437 - Linux cli command swapon
NAME π₯οΈ swapon π₯οΈ
start/stop swapping to file/device
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/swap.h>
int swapon(const char *path, int swapflags);
int swapoff(const char *path);
DESCRIPTION
swapon() sets the swap area to the file or block device specified by path. swapoff() stops swapping to the file or block device specified by path.
If the SWAP_FLAG_PREFER flag is specified in the swapon() swapflags argument, the new swap area will have a higher priority than default. The priority is encoded within swapflags as:
(prio << SWAP_FLAG_PRIO_SHIFT) & SWAP_FLAG_PRIO_MASK
If the SWAP_FLAG_DISCARD flag is specified in the swapon() swapflags argument, freed swap pages will be discarded before they are reused, if the swap device supports the discard or trim operation. (This may improve performance on some Solid State Devices, but often it does not.) See also NOTES.
These functions may be used only by a privileged process (one having the CAP_SYS_ADMIN capability).
Priority
Each swap area has a priority, either high or low. The default priority is low. Within the low-priority areas, newer areas are even lower priority than older areas.
All priorities set with swapflags are high-priority, higher than default. They may have any nonnegative value chosen by the caller. Higher numbers mean higher priority.
Swap pages are allocated from areas in priority order, highest priority first. For areas with different priorities, a higher-priority area is exhausted before using a lower-priority area. If two or more areas have the same priority, and it is the highest priority available, pages are allocated on a round-robin basis between them.
As of Linux 1.3.6, the kernel usually follows these rules, but there are exceptions.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EBUSY
(for swapon()) The specified path is already being used as a swap area.
EINVAL
The file path exists, but refers neither to a regular file nor to a block device;
EINVAL
(swapon()) The indicated path does not contain a valid swap signature or resides on an in-memory filesystem such as tmpfs(5).
EINVAL (since Linux 3.4)
(swapon()) An invalid flag value was specified in swapflags.
EINVAL
(swapoff()) path is not currently a swap area.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENOENT
The file path does not exist.
ENOMEM
The system has insufficient memory to start swapping.
EPERM
The caller does not have the CAP_SYS_ADMIN capability. Alternatively, the maximum number of swap files are already in use; see NOTES below.
STANDARDS
Linux.
HISTORY
The swapflags argument was introduced in Linux 1.3.2.
NOTES
The partition or path must be prepared with mkswap(8).
There is an upper limit on the number of swap files that may be used, defined by the kernel constant MAX_SWAPFILES. Before Linux 2.4.10, MAX_SWAPFILES has the value 8; since Linux 2.4.10, it has the value 32. Since Linux 2.6.18, the limit is decreased by 2 (thus 30), since Linux 5.19, the limit is decreased by 3 (thus: 29) if the kernel is built with the CONFIG_MIGRATION option (which reserves two swap table entries for the page migration features of mbind(2) and migrate_pages(2)). Since Linux 2.6.32, the limit is further decreased by 1 if the kernel is built with the CONFIG_MEMORY_FAILURE option. Since Linux 5.14, the limit is further decreased by 4 if the kernel is built with the CONFIG_DEVICE_PRIVATE option. Since Linux 5.19, the limit is further decreased by 1 if the kernel is built with the CONFIG_PTE_MARKER option.
Discard of swap pages was introduced in Linux 2.6.29, then made conditional on the SWAP_FLAG_DISCARD flag in Linux 2.6.36, which still discards the entire swap area when swapon() is called, even if that flag bit is not set.
SEE ALSO
mkswap(8), swapoff(8), swapon(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
438 - Linux cli command ioctl_console
NAME π₯οΈ ioctl_console π₯οΈ
ioctls for console terminal and virtual consoles
DESCRIPTION
The following Linux-specific ioctl(2) operations are supported for console terminals and virtual consoles. Each operation requires a third argument, assumed here to be argp.
KDGETLED
Get state of LEDs. argp points to a char. The lower three bits of *argp are set to the state of the LEDs, as follows:
LED_CAP | 0x04 | caps lock led |
LED_NUM | 0x02 | num lock led |
LED_SCR | 0x01 | scroll lock led |
KDSETLED
Set the LEDs. The LEDs are set to correspond to the lower three bits of the unsigned long integer in argp. However, if a higher order bit is set, the LEDs revert to normal: displaying the state of the keyboard functions of caps lock, num lock, and scroll lock.
Before Linux 1.1.54, the LEDs just reflected the state of the corresponding keyboard flags, and KDGETLED/KDSETLED would also change the keyboard flags. Since Linux 1.1.54 the LEDs can be made to display arbitrary information, but by default they display the keyboard flags. The following two ioctls are used to access the keyboard flags.
KDGKBLED
Get keyboard flags CapsLock, NumLock, ScrollLock (not lights). argp points to a char which is set to the flag state. The low order three bits (mask 0x7) get the current flag state, and the low order bits of the next nibble (mask 0x70) get the default flag state. (Since Linux 1.1.54.)
KDSKBLED
Set keyboard flags CapsLock, NumLock, ScrollLock (not lights). argp is an unsigned long integer that has the desired flag state. The low order three bits (mask 0x7) have the flag state, and the low order bits of the next nibble (mask 0x70) have the default flag state. (Since Linux 1.1.54.)
KDGKBTYPE
Get keyboard type. This returns the value KB_101, defined as 0x02.
KDADDIO
Add I/O port as valid. Equivalent to ioperm(arg,1,1).
KDDELIO
Delete I/O port as valid. Equivalent to ioperm(arg,1,0).
KDENABIO
Enable I/O to video board. Equivalent to ioperm(0x3b4, 0x3df-0x3b4+1, 1).
KDDISABIO
Disable I/O to video board. Equivalent to ioperm(0x3b4, 0x3df-0x3b4+1, 0).
KDSETMODE
Set text/graphics mode. argp is an unsigned integer containing one of:
KD_TEXT | 0x00 |
KD_GRAPHICS | 0x01 |
KDGETMODE
Get text/graphics mode. argp points to an int which is set to one of the values shown above for KDSETMODE.
KDMKTONE
Generate tone of specified length. The lower 16 bits of the unsigned long integer in argp specify the period in clock cycles, and the upper 16 bits give the duration in msec. If the duration is zero, the sound is turned off. Control returns immediately. For example, argp = (125<<16) + 0x637 would specify the beep normally associated with a ctrl-G. (Thus since Linux 0.99pl1; broken in Linux 2.1.49-50.)
KIOCSOUND
Start or stop sound generation. The lower 16 bits of argp specify the period in clock cycles (that is, argp = 1193180/frequency). argp = 0 turns sound off. In either case, control returns immediately.
GIO_CMAP
Get the current default color map from kernel. argp points to a 48-byte array. (Since Linux 1.3.3.)
PIO_CMAP
Change the default text-mode color map. argp points to a 48-byte array which contains, in order, the Red, Green, and Blue values for the 16 available screen colors: 0 is off, and 255 is full intensity. The default colors are, in order: black, dark red, dark green, brown, dark blue, dark purple, dark cyan, light grey, dark grey, bright red, bright green, yellow, bright blue, bright purple, bright cyan, and white. (Since Linux 1.3.3.)
GIO_FONT
Gets 256-character screen font in expanded form. argp points to an 8192-byte array. Fails with error code EINVAL if the currently loaded font is a 512-character font, or if the console is not in text mode.
GIO_FONTX
Gets screen font and associated information. argp points to a struct consolefontdesc (see PIO_FONTX). On call, the charcount field should be set to the maximum number of characters that would fit in the buffer pointed to by chardata. On return, the charcount and charheight are filled with the respective data for the currently loaded font, and the chardata array contains the font data if the initial value of charcount indicated enough space was available; otherwise the buffer is untouched and errno is set to ENOMEM. (Since Linux 1.3.1.)
PIO_FONT
Sets 256-character screen font. Load font into the EGA/VGA character generator. argp points to an 8192-byte map, with 32 bytes per character. Only the first N of them are used for an 8xN font (0 < N <= 32). This call also invalidates the Unicode mapping.
PIO_FONTX
Sets screen font and associated rendering information. argp points to a
struct consolefontdesc {
unsigned short charcount; /* characters in font
(256 or 512) */
unsigned short charheight; /* scan lines per
character (1-32) */
char *chardata; /* font data in
expanded form */
};
If necessary, the screen will be appropriately resized, and SIGWINCH sent to the appropriate processes. This call also invalidates the Unicode mapping. (Since Linux 1.3.1.)
PIO_FONTRESET
Resets the screen font, size, and Unicode mapping to the bootup defaults. argp is unused, but should be set to NULL to ensure compatibility with future versions of Linux. (Since Linux 1.3.28.)
GIO_SCRNMAP
Get screen mapping from kernel. argp points to an area of size E_TABSZ, which is loaded with the font positions used to display each character. This call is likely to return useless information if the currently loaded font is more than 256 characters.
GIO_UNISCRNMAP
Get full Unicode screen mapping from kernel. argp points to an area of size E_TABSZ*sizeof(unsigned short), which is loaded with the Unicodes each character represent. A special set of Unicodes, starting at U+F000, are used to represent “direct to font” mappings. (Since Linux 1.3.1.)
PIO_SCRNMAP
Loads the “user definable” (fourth) table in the kernel which maps bytes into console screen symbols. argp points to an area of size E_TABSZ.
PIO_UNISCRNMAP
Loads the “user definable” (fourth) table in the kernel which maps bytes into Unicodes, which are then translated into screen symbols according to the currently loaded Unicode-to-font map. Special Unicodes starting at U+F000 can be used to map directly to the font symbols. (Since Linux 1.3.1.)
GIO_UNIMAP
Get Unicode-to-font mapping from kernel. argp points to a
struct unimapdesc {
unsigned short entry_ct;
struct unipair *entries;
};
where entries points to an array of
struct unipair {
unsigned short unicode;
unsigned short fontpos;
};
(Since Linux 1.1.92.)
PIO_UNIMAP
Put unicode-to-font mapping in kernel. argp points to a struct unimapdesc. (Since Linux 1.1.92)
PIO_UNIMAPCLR
Clear table, possibly advise hash algorithm. argp points to a
struct unimapinit {
unsigned short advised_hashsize; /* 0 if no opinion */
unsigned short advised_hashstep; /* 0 if no opinion */
unsigned short advised_hashlevel; /* 0 if no opinion */
};
(Since Linux 1.1.92.)
KDGKBMODE
Gets current keyboard mode. argp points to a long which is set to one of these:
K_RAW | 0x00 /* Raw (scancode) mode */ |
K_XLATE | 0x01 /* Translate keycodes using keymap */ |
K_MEDIUMRAW | 0x02 /* Medium raw (scancode) mode */ |
K_UNICODE | 0x03 /* Unicode mode */ |
K_OFF | 0x04 /* Disabled mode; since Linux 2.6.39 */ |
KDSKBMODE
Sets current keyboard mode. argp is a long equal to one of the values shown for KDGKBMODE.
KDGKBMETA
Gets meta key handling mode. argp points to a long which is set to one of these:
K_METABIT | 0x03 | set high order bit |
K_ESCPREFIX | 0x04 | escape prefix |
KDSKBMETA
Sets meta key handling mode. argp is a long equal to one of the values shown above for KDGKBMETA.
KDGKBENT
Gets one entry in key translation table (keycode to action code). argp points to a
struct kbentry {
unsigned char kb_table;
unsigned char kb_index;
unsigned short kb_value;
};
with the first two members filled in: kb_table selects the key table (0 <= kb_table < MAX_NR_KEYMAPS), and kb_index is the keycode (0 <= kb_index < NR_KEYS). kb_value is set to the corresponding action code, or K_HOLE if there is no such key, or K_NOSUCHMAP if kb_table is invalid.
KDSKBENT
Sets one entry in translation table. argp points to a struct kbentry.
KDGKBSENT
Gets one function key string. argp points to a
struct kbsentry {
unsigned char kb_func;
unsigned char kb_string[512];
};
kb_string is set to the (null-terminated) string corresponding to the kb_functh function key action code.
KDSKBSENT
Sets one function key string entry. argp points to a struct kbsentry.
KDGKBDIACR
Read kernel accent table. argp points to a
struct kbdiacrs {
unsigned int kb_cnt;
struct kbdiacr kbdiacr[256];
};
where kb_cnt is the number of entries in the array, each of which is a
struct kbdiacr {
unsigned char diacr;
unsigned char base;
unsigned char result;
};
KDGETKEYCODE
Read kernel keycode table entry (scan code to keycode). argp points to a
struct kbkeycode {
unsigned int scancode;
unsigned int keycode;
};
keycode is set to correspond to the given scancode. (89 <= scancode <= 255 only. For 1 <= scancode <= 88, keycode==scancode.) (Since Linux 1.1.63.)
KDSETKEYCODE
Write kernel keycode table entry. argp points to a struct kbkeycode. (Since Linux 1.1.63.)
KDSIGACCEPT
The calling process indicates its willingness to accept the signal argp when it is generated by pressing an appropriate key combination. (1 <= argp <= NSIG). (See spawn_console() in linux/drivers/char/keyboard.c.)
VT_OPENQRY
Returns the first available (non-opened) console. argp points to an int which is set to the number of the vt (1 <= *argp <= MAX_NR_CONSOLES).
VT_GETMODE
Get mode of active vt. argp points to a
struct vt_mode {
char mode; /* vt mode */
char waitv; /* if set, hang on writes if not active */
short relsig; /* signal to raise on release op */
short acqsig; /* signal to raise on acquisition */
short frsig; /* unused (set to 0) */
};
which is set to the mode of the active vt. mode is set to one of these values:
VT_AUTO | auto vt switching |
VT_PROCESS | process controls switching |
VT_ACKACQ | acknowledge switch |
VT_SETMODE
Set mode of active vt. argp points to a struct vt_mode.
VT_GETSTATE
Get global vt state info. argp points to a
struct vt_stat {
unsigned short v_active; /* active vt */
unsigned short v_signal; /* signal to send */
unsigned short v_state; /* vt bit mask */
};
For each vt in use, the corresponding bit in the v_state member is set. (Linux 1.0 through Linux 1.1.92.)
VT_RELDISP
Release a display.
VT_ACTIVATE
Switch to vt argp (1 <= argp <= MAX_NR_CONSOLES).
VT_WAITACTIVE
Wait until vt argp has been activated.
VT_DISALLOCATE
Deallocate the memory associated with vt argp. (Since Linux 1.1.54.)
VT_RESIZE
Set the kernel’s idea of screensize. argp points to a
struct vt_sizes {
unsigned short v_rows; /* # rows */
unsigned short v_cols; /* # columns */
unsigned short v_scrollsize; /* no longer used */
};
Note that this does not change the videomode. See resizecons(8). (Since Linux 1.1.54.)
VT_RESIZEX
Set the kernel’s idea of various screen parameters. argp points to a
struct vt_consize {
unsigned short v_rows; /* number of rows */
unsigned short v_cols; /* number of columns */
unsigned short v_vlin; /* number of pixel rows
on screen */
unsigned short v_clin; /* number of pixel rows
per character */
unsigned short v_vcol; /* number of pixel columns
on screen */
unsigned short v_ccol; /* number of pixel columns
per character */
};
Any parameter may be set to zero, indicating “no change”, but if multiple parameters are set, they must be self-consistent. Note that this does not change the videomode. See resizecons(8). (Since Linux 1.3.3.)
The action of the following ioctls depends on the first byte in the struct pointed to by argp, referred to here as the subcode. These are legal only for the superuser or the owner of the current terminal. Symbolic subcodes are available in <linux/tiocl.h> since Linux 2.5.71.
TIOCLINUX, subcode=0
Dump the screen. Disappeared in Linux 1.1.92. (With Linux 1.1.92 or later, read from /dev/vcsN or /dev/vcsaN instead.)
TIOCLINUX, subcode=1
Get task information. Disappeared in Linux 1.1.92.
TIOCLINUX, subcode=TIOCL_SETSEL
Set selection. argp points to a
struct {
char subcode;
short xs, ys, xe, ye;
short sel_mode;
};
xs and ys are the starting column and row. xe and ye are the ending column and row. (Upper left corner is row=column=1.) sel_mode is 0 for character-by-character selection, 1 for word-by-word selection, or 2 for line-by-line selection. The indicated screen characters are highlighted and saved in a kernel buffer.
Since Linux 6.7, using this subcode requires the CAP_SYS_ADMIN capability.
TIOCLINUX, subcode=TIOCL_PASTESEL
Paste selection. The characters in the selection buffer are written to fd.
Since Linux 6.7, using this subcode requires the CAP_SYS_ADMIN capability.
TIOCLINUX, subcode=TIOCL_UNBLANKSCREEN
Unblank the screen.
TIOCLINUX, subcode=TIOCL_SELLOADLUT
Sets contents of a 256-bit look up table defining characters in a “word”, for word-by-word selection. (Since Linux 1.1.32.)
Since Linux 6.7, using this subcode requires the CAP_SYS_ADMIN capability.
TIOCLINUX, subcode=TIOCL_GETSHIFTSTATE
argp points to a char which is set to the value of the kernel variable shift_state. (Since Linux 1.1.32.)
TIOCLINUX, subcode=TIOCL_GETMOUSEREPORTING
argp points to a char which is set to the value of the kernel variable report_mouse. (Since Linux 1.1.33.)
TIOCLINUX, subcode=8
Dump screen width and height, cursor position, and all the character-attribute pairs. (Linux 1.1.67 through Linux 1.1.91 only. With Linux 1.1.92 or later, read from /dev/vcsa* instead.)
TIOCLINUX, subcode=9
Restore screen width and height, cursor position, and all the character-attribute pairs. (Linux 1.1.67 through Linux 1.1.91 only. With Linux 1.1.92 or later, write to /dev/vcsa* instead.)
TIOCLINUX, subcode=TIOCL_SETVESABLANK
Handles the Power Saving feature of the new generation of monitors. VESA screen blanking mode is set to argp[1], which governs what screen blanking does:
0
Screen blanking is disabled.
1
The current video adapter register settings are saved, then the controller is programmed to turn off the vertical synchronization pulses. This puts the monitor into “standby” mode. If your monitor has an Off_Mode timer, then it will eventually power down by itself.
2
The current settings are saved, then both the vertical and horizontal synchronization pulses are turned off. This puts the monitor into “off” mode. If your monitor has no Off_Mode timer, or if you want your monitor to power down immediately when the blank_timer times out, then you choose this option. (Caution: Powering down frequently will damage the monitor.) (Since Linux 1.1.76.)
TIOCLINUX, subcode=TIOCL_SETKMSGREDIRECT
Change target of kernel messages (“console”): by default, and if this is set to 0, messages are written to the currently active VT. The VT to write to is a single byte following subcode. (Since Linux 2.5.36.)
TIOCLINUX, subcode=TIOCL_GETFGCONSOLE
Returns the number of VT currently in foreground. (Since Linux 2.5.36.)
TIOCLINUX, subcode=TIOCL_SCROLLCONSOLE
Scroll the foreground VT by the specified amount of lines down, or half the screen if 0. lines is *(((int32_t *)&subcode) + 1). (Since Linux 2.5.67.)
TIOCLINUX, subcode=TIOCL_BLANKSCREEN
Blank the foreground VT, ignoring “pokes” (typing): can only be unblanked explicitly (by switching VTs, to text mode, etc.). (Since Linux 2.5.71.)
TIOCLINUX, subcode=TIOCL_BLANKEDSCREEN
Returns the number of VT currently blanked, 0 if none. (Since Linux 2.5.71.)
TIOCLINUX, subcode=16
Never used.
TIOCLINUX, subcode=TIOCL_GETKMSGREDIRECT
Returns target of kernel messages. (Since Linux 2.6.17.)
RETURN VALUE
On success, 0 is returned (except where indicated). On failure, -1 is returned, and errno is set to indicate the error.
ERRORS
EBADF
The file descriptor is invalid.
EINVAL
The file descriptor or argp is invalid.
ENOTTY
The file descriptor is not associated with a character special device, or the specified operation does not apply to it.
EPERM
Insufficient permission.
NOTES
Warning: Do not regard this man page as documentation of the Linux console ioctls. This is provided for the curious only, as an alternative to reading the source. Ioctl’s are undocumented Linux internals, liable to be changed without warning. (And indeed, this page more or less describes the situation as of kernel version 1.1.94; there are many minor and not-so-minor differences with earlier versions.)
Very often, ioctls are introduced for communication between the kernel and one particular well-known program (fdisk, hdparm, setserial, tunelp, loadkeys, selection, setfont, etc.), and their behavior will be changed when required by this particular program.
Programs using these ioctls will not be portable to other versions of UNIX, will not work on older versions of Linux, and will not work on future versions of Linux.
Use POSIX functions.
SEE ALSO
dumpkeys(1), kbd_mode(1), loadkeys(1), mknod(1), setleds(1), setmetamode(1), execve(2), fcntl(2), ioctl_tty(2), ioperm(2), termios(3), console_codes(4), mt(4), sd(4), tty(4), ttyS(4), vcs(4), vcsa(4), charsets(7), mapscrn(8), resizecons(8), setfont(8)
/usr/include/linux/kd.h, /usr/include/linux/vt.h
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
439 - Linux cli command prctl
NAME π₯οΈ prctl π₯οΈ
operations on a process or thread
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/prctl.h>
int prctl(int op, ...
/* unsigned long arg2, unsigned long arg3,
unsigned long arg4, unsigned long arg5 */ );
DESCRIPTION
prctl() manipulates various aspects of the behavior of the calling thread or process.
Note that careless use of some prctl() operations can confuse the user-space run-time environment, so these operations should be used with care.
prctl() is called with a first argument describing what to do (with values defined in <linux/prctl.h>), and further arguments with a significance depending on the first one. The first argument can be:
PR_CAP_AMBIENT (since Linux 4.3)
Reads or changes the ambient capability set of the calling thread, according to the value of arg2, which must be one of the following:
PR_CAP_AMBIENT_RAISE
The capability specified in arg3 is added to the ambient set. The specified capability must already be present in both the permitted and the inheritable sets of the process. This operation is not permitted if the SECBIT_NO_CAP_AMBIENT_RAISE securebit is set.
PR_CAP_AMBIENT_LOWER
The capability specified in arg3 is removed from the ambient set.
PR_CAP_AMBIENT_IS_SET
The prctl() call returns 1 if the capability in arg3 is in the ambient set and 0 if it is not.
PR_CAP_AMBIENT_CLEAR_ALL
All capabilities will be removed from the ambient set. This operation requires setting arg3 to zero.
In all of the above operations, arg4 and arg5 must be specified as 0.
Higher-level interfaces layered on top of the above operations are provided in the libcap(3) library in the form of cap_get_ambient(3), cap_set_ambient(3), and cap_reset_ambient(3).
PR_CAPBSET_READ (since Linux 2.6.25)
Return (as the function result) 1 if the capability specified in arg2 is in the calling thread’s capability bounding set, or 0 if it is not. (The capability constants are defined in <linux/capability.h>.) The capability bounding set dictates whether the process can receive the capability through a file’s permitted capability set on a subsequent call to execve(2).
If the capability specified in arg2 is not valid, then the call fails with the error EINVAL.
A higher-level interface layered on top of this operation is provided in the libcap(3) library in the form of cap_get_bound(3).
PR_CAPBSET_DROP (since Linux 2.6.25)
If the calling thread has the CAP_SETPCAP capability within its user namespace, then drop the capability specified by arg2 from the calling thread’s capability bounding set. Any children of the calling thread will inherit the newly reduced bounding set.
The call fails with the error: EPERM if the calling thread does not have the CAP_SETPCAP; EINVAL if arg2 does not represent a valid capability; or EINVAL if file capabilities are not enabled in the kernel, in which case bounding sets are not supported.
A higher-level interface layered on top of this operation is provided in the libcap(3) library in the form of cap_drop_bound(3).
PR_SET_CHILD_SUBREAPER (since Linux 3.4)
If arg2 is nonzero, set the “child subreaper” attribute of the calling process; if arg2 is zero, unset the attribute.
A subreaper fulfills the role of init(1) for its descendant processes. When a process becomes orphaned (i.e., its immediate parent terminates), then that process will be reparented to the nearest still living ancestor subreaper. Subsequently, calls to getppid(2) in the orphaned process will now return the PID of the subreaper process, and when the orphan terminates, it is the subreaper process that will receive a SIGCHLD signal and will be able to wait(2) on the process to discover its termination status.
The setting of the “child subreaper” attribute is not inherited by children created by fork(2) and clone(2). The setting is preserved across execve(2).
Establishing a subreaper process is useful in session management frameworks where a hierarchical group of processes is managed by a subreaper process that needs to be informed when one of the processesβfor example, a double-forked daemonβterminates (perhaps so that it can restart that process). Some init(1) frameworks (e.g., systemd(1)) employ a subreaper process for similar reasons.
PR_GET_CHILD_SUBREAPER (since Linux 3.4)
Return the “child subreaper” setting of the caller, in the location pointed to by (intΒ *) arg2.
PR_SET_DUMPABLE (since Linux 2.3.20)
Set the state of the “dumpable” attribute, which determines whether core dumps are produced for the calling process upon delivery of a signal whose default behavior is to produce a core dump.
Up to and including Linux 2.6.12, arg2 must be either 0 (SUID_DUMP_DISABLE, process is not dumpable) or 1 (SUID_DUMP_USER, process is dumpable). Between Linux 2.6.13 and Linux 2.6.17, the value 2 was also permitted, which caused any binary which normally would not be dumped to be dumped readable by root only; for security reasons, this feature has been removed. (See also the description of /proc/sys/fs/suid_dumpable in proc(5).)
Normally, the “dumpable” attribute is set to 1. However, it is reset to the current value contained in the file /proc/sys/fs/suid_dumpable (which by default has the value 0), in the following circumstances:
The process’s effective user or group ID is changed.
The process’s filesystem user or group ID is changed (see credentials(7)).
The process executes (execve(2)) a set-user-ID or set-group-ID program, resulting in a change of either the effective user ID or the effective group ID.
The process executes (execve(2)) a program that has file capabilities (see capabilities(7)), but only if the permitted capabilities gained exceed those already permitted for the process.
Processes that are not dumpable can not be attached via ptrace(2) PTRACE_ATTACH; see ptrace(2) for further details.
If a process is not dumpable, the ownership of files in the process’s */proc/*pid directory is affected as described in proc(5).
PR_GET_DUMPABLE (since Linux 2.3.20)
Return (as the function result) the current state of the calling process’s dumpable attribute.
PR_SET_ENDIAN (since Linux 2.6.18, PowerPC only)
Set the endian-ness of the calling process to the value given in arg2, which should be one of the following: PR_ENDIAN_BIG, PR_ENDIAN_LITTLE, or PR_ENDIAN_PPC_LITTLE (PowerPC pseudo little endian).
PR_GET_ENDIAN (since Linux 2.6.18, PowerPC only)
Return the endian-ness of the calling process, in the location pointed to by (intΒ *) arg2.
PR_SET_FP_MODE (since Linux 4.0, only on MIPS)
On the MIPS architecture, user-space code can be built using an ABI which permits linking with code that has more restrictive floating-point (FP) requirements. For example, user-space code may be built to target the O32 FPXX ABI and linked with code built for either one of the more restrictive FP32 or FP64 ABIs. When more restrictive code is linked in, the overall requirement for the process is to use the more restrictive floating-point mode.
Because the kernel has no means of knowing in advance which mode the process should be executed in, and because these restrictions can change over the lifetime of the process, the PR_SET_FP_MODE operation is provided to allow control of the floating-point mode from user space.
The (unsigned int) arg2 argument is a bit mask describing the floating-point mode used:
PR_FP_MODE_FR
When this bit is unset (so called FR=0 or FR0 mode), the 32 floating-point registers are 32 bits wide, and 64-bit registers are represented as a pair of registers (even- and odd- numbered, with the even-numbered register containing the lower 32 bits, and the odd-numbered register containing the higher 32 bits).
When this bit is set (on supported hardware), the 32 floating-point registers are 64 bits wide (so called FR=1 or FR1 mode). Note that modern MIPS implementations (MIPS R6 and newer) support FR=1 mode only.
Applications that use the O32 FP32 ABI can operate only when this bit is unset (FR=0; or they can be used with FRE enabled, see below). Applications that use the O32 FP64 ABI (and the O32 FP64A ABI, which exists to provide the ability to operate with existing FP32 code; see below) can operate only when this bit is set (FR=1). Applications that use the O32 FPXX ABI can operate with either FR=0 or FR=1.
PR_FP_MODE_FRE
Enable emulation of 32-bit floating-point mode. When this mode is enabled, it emulates 32-bit floating-point operations by raising a reserved-instruction exception on every instruction that uses 32-bit formats and the kernel then handles the instruction in software. (The problem lies in the discrepancy of handling odd-numbered registers which are the high 32 bits of 64-bit registers with even numbers in FR=0 mode and the lower 32-bit parts of odd-numbered 64-bit registers in FR=1 mode.) Enabling this bit is necessary when code with the O32 FP32 ABI should operate with code with compatible the O32 FPXX or O32 FP64A ABIs (which require FR=1 FPU mode) or when it is executed on newer hardware (MIPS R6 onwards) which lacks FR=0 mode support when a binary with the FP32 ABI is used.
Note that this mode makes sense only when the FPU is in 64-bit mode (FR=1).
Note that the use of emulation inherently has a significant performance hit and should be avoided if possible.
In the N32/N64 ABI, 64-bit floating-point mode is always used, so FPU emulation is not required and the FPU always operates in FR=1 mode.
This operation is mainly intended for use by the dynamic linker (ld.so(8)).
The arguments arg3, arg4, and arg5 are ignored.
PR_GET_FP_MODE (since Linux 4.0, only on MIPS)
Return (as the function result) the current floating-point mode (see the description of PR_SET_FP_MODE for details).
On success, the call returns a bit mask which represents the current floating-point mode.
The arguments arg2, arg3, arg4, and arg5 are ignored.
PR_SET_FPEMU (since Linux 2.4.18, 2.5.9, only on ia64)
Set floating-point emulation control bits to arg2. Pass PR_FPEMU_NOPRINT to silently emulate floating-point operation accesses, or PR_FPEMU_SIGFPE to not emulate floating-point operations and send SIGFPE instead.
PR_GET_FPEMU (since Linux 2.4.18, 2.5.9, only on ia64)
Return floating-point emulation control bits, in the location pointed to by (intΒ *) arg2.
PR_SET_FPEXC (since Linux 2.4.21, 2.5.32, only on PowerPC)
Set floating-point exception mode to arg2. Pass PR_FP_EXC_SW_ENABLE to use FPEXC for FP exception enables, PR_FP_EXC_DIV for floating-point divide by zero, PR_FP_EXC_OVF for floating-point overflow, PR_FP_EXC_UND for floating-point underflow, PR_FP_EXC_RES for floating-point inexact result, PR_FP_EXC_INV for floating-point invalid operation, PR_FP_EXC_DISABLED for FP exceptions disabled, PR_FP_EXC_NONRECOV for async nonrecoverable exception mode, PR_FP_EXC_ASYNC for async recoverable exception mode, PR_FP_EXC_PRECISE for precise exception mode.
PR_GET_FPEXC (since Linux 2.4.21, 2.5.32, only on PowerPC)
Return floating-point exception mode, in the location pointed to by (intΒ *) arg2.
PR_SET_IO_FLUSHER (since Linux 5.6)
If a user process is involved in the block layer or filesystem I/O path, and can allocate memory while processing I/O requests it must set arg2 to 1. This will put the process in the IO_FLUSHER state, which allows it special treatment to make progress when allocating memory. If arg2 is 0, the process will clear the IO_FLUSHER state, and the default behavior will be used.
The calling process must have the CAP_SYS_RESOURCE capability.
arg3, arg4, and arg5 must be zero.
The IO_FLUSHER state is inherited by a child process created via fork(2) and is preserved across execve(2).
Examples of IO_FLUSHER applications are FUSE daemons, SCSI device emulation daemons, and daemons that perform error handling like multipath path recovery applications.
PR_GET_IO_FLUSHER (Since Linux 5.6)
Return (as the function result) the IO_FLUSHER state of the caller. A value of 1 indicates that the caller is in the IO_FLUSHER state; 0 indicates that the caller is not in the IO_FLUSHER state.
The calling process must have the CAP_SYS_RESOURCE capability.
arg2, arg3, arg4, and arg5 must be zero.
PR_SET_KEEPCAPS (since Linux 2.2.18)
Set the state of the calling thread’s “keep capabilities” flag. The effect of this flag is described in capabilities(7). arg2 must be either 0 (clear the flag) or 1 (set the flag). The “keep capabilities” value will be reset to 0 on subsequent calls to execve(2).
PR_GET_KEEPCAPS (since Linux 2.2.18)
Return (as the function result) the current state of the calling thread’s “keep capabilities” flag. See capabilities(7) for a description of this flag.
PR_MCE_KILL (since Linux 2.6.32)
Set the machine check memory corruption kill policy for the calling thread. If arg2 is PR_MCE_KILL_CLEAR, clear the thread memory corruption kill policy and use the system-wide default. (The system-wide default is defined by /proc/sys/vm/memory_failure_early_kill; see proc(5).) If arg2 is PR_MCE_KILL_SET, use a thread-specific memory corruption kill policy. In this case, arg3 defines whether the policy is early kill (PR_MCE_KILL_EARLY), late kill (PR_MCE_KILL_LATE), or the system-wide default (PR_MCE_KILL_DEFAULT). Early kill means that the thread receives a SIGBUS signal as soon as hardware memory corruption is detected inside its address space. In late kill mode, the process is killed only when it accesses a corrupted page. See sigaction(2) for more information on the SIGBUS signal. The policy is inherited by children. The remaining unused prctl() arguments must be zero for future compatibility.
PR_MCE_KILL_GET (since Linux 2.6.32)
Return (as the function result) the current per-process machine check kill policy. All unused prctl() arguments must be zero.
PR_SET_MM (since Linux 3.3)
Modify certain kernel memory map descriptor fields of the calling process. Usually these fields are set by the kernel and dynamic loader (see ld.so(8) for more information) and a regular application should not use this feature. However, there are cases, such as self-modifying programs, where a program might find it useful to change its own memory map.
The calling process must have the CAP_SYS_RESOURCE capability. The value in arg2 is one of the options below, while arg3 provides a new value for the option. The arg4 and arg5 arguments must be zero if unused.
Before Linux 3.10, this feature is available only if the kernel is built with the CONFIG_CHECKPOINT_RESTORE option enabled.
PR_SET_MM_START_CODE
Set the address above which the program text can run. The corresponding memory area must be readable and executable, but not writable or shareable (see mprotect(2) and mmap(2) for more information).
PR_SET_MM_END_CODE
Set the address below which the program text can run. The corresponding memory area must be readable and executable, but not writable or shareable.
PR_SET_MM_START_DATA
Set the address above which initialized and uninitialized (bss) data are placed. The corresponding memory area must be readable and writable, but not executable or shareable.
PR_SET_MM_END_DATA
Set the address below which initialized and uninitialized (bss) data are placed. The corresponding memory area must be readable and writable, but not executable or shareable.
PR_SET_MM_START_STACK
Set the start address of the stack. The corresponding memory area must be readable and writable.
PR_SET_MM_START_BRK
Set the address above which the program heap can be expanded with brk(2) call. The address must be greater than the ending address of the current program data segment. In addition, the combined size of the resulting heap and the size of the data segment can’t exceed the RLIMIT_DATA resource limit (see setrlimit(2)).
PR_SET_MM_BRK
Set the current brk(2) value. The requirements for the address are the same as for the PR_SET_MM_START_BRK option.
The following options are available since Linux 3.5.
PR_SET_MM_ARG_START
Set the address above which the program command line is placed.
PR_SET_MM_ARG_END
Set the address below which the program command line is placed.
PR_SET_MM_ENV_START
Set the address above which the program environment is placed.
PR_SET_MM_ENV_END
Set the address below which the program environment is placed.
The address passed with PR_SET_MM_ARG_START, PR_SET_MM_ARG_END, PR_SET_MM_ENV_START, and PR_SET_MM_ENV_END should belong to a process stack area. Thus, the corresponding memory area must be readable, writable, and (depending on the kernel configuration) have the MAP_GROWSDOWN attribute set (see mmap(2)).
PR_SET_MM_AUXV
Set a new auxiliary vector. The arg3 argument should provide the address of the vector. The arg4 is the size of the vector.
PR_SET_MM_EXE_FILE
Supersede the /proc/pid/exe symbolic link with a new one pointing to a new executable file identified by the file descriptor provided in arg3 argument. The file descriptor should be obtained with a regular open(2) call.
To change the symbolic link, one needs to unmap all existing executable memory areas, including those created by the kernel itself (for example the kernel usually creates at least one executable memory area for the ELF .text section).
In Linux 4.9 and earlier, the PR_SET_MM_EXE_FILE operation can be performed only once in a process’s lifetime; attempting to perform the operation a second time results in the error EPERM. This restriction was enforced for security reasons that were subsequently deemed specious, and the restriction was removed in Linux 4.10 because some user-space applications needed to perform this operation more than once.
The following options are available since Linux 3.18.
PR_SET_MM_MAP
Provides one-shot access to all the addresses by passing in a struct prctl_mm_map (as defined in <linux/prctl.h>). The arg4 argument should provide the size of the struct.
This feature is available only if the kernel is built with the CONFIG_CHECKPOINT_RESTORE option enabled.
PR_SET_MM_MAP_SIZE
Returns the size of the struct prctl_mm_map the kernel expects. This allows user space to find a compatible struct. The arg4 argument should be a pointer to an unsigned int.
This feature is available only if the kernel is built with the CONFIG_CHECKPOINT_RESTORE option enabled.
PR_SET_VMA (since Linux 5.17)
Sets an attribute specified in arg2 for virtual memory areas starting from the address specified in arg3 and spanning the size specified in arg4. arg5 specifies the value of the attribute to be set.
Note that assigning an attribute to a virtual memory area might prevent it from being merged with adjacent virtual memory areas due to the difference in that attribute’s value.
Currently, arg2 must be one of:
PR_SET_VMA_ANON_NAME
Set a name for anonymous virtual memory areas. arg5 should be a pointer to a null-terminated string containing the name. The name length including null byte cannot exceed 80 bytes. If arg5 is NULL, the name of the appropriate anonymous virtual memory areas will be reset. The name can contain only printable ascii characters (including space), except ‘[’, ‘]’, ‘, ‘$’, and ‘`’.
PR_MPX_ENABLE_MANAGEMENT
PR_MPX_DISABLE_MANAGEMENT (since Linux 3.19, removed in Linux 5.4; only on x86)
Enable or disable kernel management of Memory Protection eXtensions (MPX) bounds tables. The arg2, arg3, arg4, and arg5 arguments must be zero.
MPX is a hardware-assisted mechanism for performing bounds checking on pointers. It consists of a set of registers storing bounds information and a set of special instruction prefixes that tell the CPU on which instructions it should do bounds enforcement. There is a limited number of these registers and when there are more pointers than registers, their contents must be “spilled” into a set of tables. These tables are called “bounds tables” and the MPX prctl() operations control whether the kernel manages their allocation and freeing.
When management is enabled, the kernel will take over allocation and freeing of the bounds tables. It does this by trapping the #BR exceptions that result at first use of missing bounds tables and instead of delivering the exception to user space, it allocates the table and populates the bounds directory with the location of the new table. For freeing, the kernel checks to see if bounds tables are present for memory which is not allocated, and frees them if so.
Before enabling MPX management using PR_MPX_ENABLE_MANAGEMENT, the application must first have allocated a user-space buffer for the bounds directory and placed the location of that directory in the bndcfgu register.
These calls fail if the CPU or kernel does not support MPX. Kernel support for MPX is enabled via the CONFIG_X86_INTEL_MPX configuration option. You can check whether the CPU supports MPX by looking for the mpx CPUID bit, like with the following command:
cat /proc/cpuinfo | grep ' mpx '
A thread may not switch in or out of long (64-bit) mode while MPX is enabled.
All threads in a process are affected by these calls.
The child of a fork(2) inherits the state of MPX management. During execve(2), MPX management is reset to a state as if PR_MPX_DISABLE_MANAGEMENT had been called.
For further information on Intel MPX, see the kernel source file Documentation/x86/intel_mpx.txt.
Due to a lack of toolchain support, PR_MPX_ENABLE_MANAGEMENT and PR_MPX_DISABLE_MANAGEMENT are not supported in Linux 5.4 and later.
PR_SET_NAME (since Linux 2.6.9)
Set the name of the calling thread, using the value in the location pointed to by (charΒ *) arg2. The name can be up to 16 bytes long, including the terminating null byte. (If the length of the string, including the terminating null byte, exceeds 16 bytes, the string is silently truncated.) This is the same attribute that can be set via pthread_setname_np(3) and retrieved using pthread_getname_np(3). The attribute is likewise accessible via /proc/self/task/tid/comm (see proc(5)), where tid is the thread ID of the calling thread, as returned by gettid(2).
PR_GET_NAME (since Linux 2.6.11)
Return the name of the calling thread, in the buffer pointed to by (charΒ *) arg2. The buffer should allow space for up to 16 bytes; the returned string will be null-terminated.
PR_SET_NO_NEW_PRIVS (since Linux 3.5)
Set the calling thread’s no_new_privs attribute to the value in arg2. With no_new_privs set to 1, execve(2) promises not to grant privileges to do anything that could not have been done without the execve(2) call (for example, rendering the set-user-ID and set-group-ID mode bits, and file capabilities non-functional). Once set, the no_new_privs attribute cannot be unset. The setting of this attribute is inherited by children created by fork(2) and clone(2), and preserved across execve(2).
Since Linux 4.10, the value of a thread’s no_new_privs attribute can be viewed via the NoNewPrivs field in the /proc/pid/status file.
For more information, see the kernel source file Documentation/userspace-api/no_new_privs.rst (or Documentation/prctl/no_new_privs.txt before Linux 4.13). See also seccomp(2).
PR_GET_NO_NEW_PRIVS (since Linux 3.5)
Return (as the function result) the value of the no_new_privs attribute for the calling thread. A value of 0 indicates the regular execve(2) behavior. A value of 1 indicates execve(2) will operate in the privilege-restricting mode described above.
PR_PAC_RESET_KEYS (since Linux 5.0, only on arm64)
Securely reset the thread’s pointer authentication keys to fresh random values generated by the kernel.
The set of keys to be reset is specified by arg2, which must be a logical OR of zero or more of the following:
PR_PAC_APIAKEY
instruction authentication key A
PR_PAC_APIBKEY
instruction authentication key B
PR_PAC_APDAKEY
data authentication key A
PR_PAC_APDBKEY
data authentication key B
PR_PAC_APGAKEY
generic authentication βAβ key.
(Yes folks, there really is no generic B key.)
As a special case, if arg2 is zero, then all the keys are reset. Since new keys could be added in future, this is the recommended way to completely wipe the existing keys when establishing a clean execution context. Note that there is no need to use PR_PAC_RESET_KEYS in preparation for calling execve(2), since execve(2) resets all the pointer authentication keys.
The remaining arguments arg3, arg4, and arg5 must all be zero.
If the arguments are invalid, and in particular if arg2 contains set bits that are unrecognized or that correspond to a key not available on this platform, then the call fails with error EINVAL.
Warning: Because the compiler or run-time environment may be using some or all of the keys, a successful PR_PAC_RESET_KEYS may crash the calling process. The conditions for using it safely are complex and system-dependent. Don’t use it unless you know what you are doing.
For more information, see the kernel source file Documentation/arm64/pointer-authentication.rst (or Documentation/arm64/pointer-authentication.txt before Linux 5.3).
PR_SET_PDEATHSIG (since Linux 2.1.57)
Set the parent-death signal of the calling process to arg2 (either a signal value in the range [1, NSIGΒ -Β 1], or 0 to clear). This is the signal that the calling process will get when its parent dies.
Warning: the “parent” in this case is considered to be the thread that created this process. In other words, the signal will be sent when that thread terminates (via, for example, pthread_exit(3)), rather than after all of the threads in the parent process terminate.
The parent-death signal is sent upon subsequent termination of the parent thread and also upon termination of each subreaper process (see the description of PR_SET_CHILD_SUBREAPER above) to which the caller is subsequently reparented. If the parent thread and all ancestor subreapers have already terminated by the time of the PR_SET_PDEATHSIG operation, then no parent-death signal is sent to the caller.
The parent-death signal is process-directed (see signal(7)) and, if the child installs a handler using the sigaction(2) SA_SIGINFO flag, the si_pid field of the siginfo_t argument of the handler contains the PID of the terminating parent process.
The parent-death signal setting is cleared for the child of a fork(2). It is also (since Linux 2.4.36 / 2.6.23) cleared when executing a set-user-ID or set-group-ID binary, or a binary that has associated capabilities (see capabilities(7)); otherwise, this value is preserved across execve(2). The parent-death signal setting is also cleared upon changes to any of the following thread credentials: effective user ID, effective group ID, filesystem user ID, or filesystem group ID.
PR_GET_PDEATHSIG (since Linux 2.3.15)
Return the current value of the parent process death signal, in the location pointed to by (intΒ *) arg2.
PR_SET_PTRACER (since Linux 3.4)
This is meaningful only when the Yama LSM is enabled and in mode 1 (“restricted ptrace”, visible via /proc/sys/kernel/yama/ptrace_scope). When a “ptracer process ID” is passed in arg2, the caller is declaring that the ptracer process can ptrace(2) the calling process as if it were a direct process ancestor. Each PR_SET_PTRACER operation replaces the previous “ptracer process ID”. Employing PR_SET_PTRACER with arg2 set to 0 clears the caller’s “ptracer process ID”. If arg2 is PR_SET_PTRACER_ANY, the ptrace restrictions introduced by Yama are effectively disabled for the calling process.
For further information, see the kernel source file Documentation/admin-guide/LSM/Yama.rst (or Documentation/security/Yama.txt before Linux 4.13).
PR_SET_SECCOMP (since Linux 2.6.23)
Set the secure computing (seccomp) mode for the calling thread, to limit the available system calls. The more recent seccomp(2) system call provides a superset of the functionality of PR_SET_SECCOMP, and is the preferred interface for new applications.
The seccomp mode is selected via arg2. (The seccomp constants are defined in <linux/seccomp.h>.) The following values can be specified:
SECCOMP_MODE_STRICT (since Linux 2.6.23)
See the description of SECCOMP_SET_MODE_STRICT in seccomp(2).
This operation is available only if the kernel is configured with CONFIG_SECCOMP enabled.
SECCOMP_MODE_FILTER (since Linux 3.5)
The allowed system calls are defined by a pointer to a Berkeley Packet Filter passed in arg3. This argument is a pointer to struct sock_fprog; it can be designed to filter arbitrary system calls and system call arguments. See the description of SECCOMP_SET_MODE_FILTER in seccomp(2).
This operation is available only if the kernel is configured with CONFIG_SECCOMP_FILTER enabled.
For further details on seccomp filtering, see seccomp(2).
PR_GET_SECCOMP (since Linux 2.6.23)
Return (as the function result) the secure computing mode of the calling thread. If the caller is not in secure computing mode, this operation returns 0; if the caller is in strict secure computing mode, then the prctl() call will cause a SIGKILL signal to be sent to the process. If the caller is in filter mode, and this system call is allowed by the seccomp filters, it returns 2; otherwise, the process is killed with a SIGKILL signal.
This operation is available only if the kernel is configured with CONFIG_SECCOMP enabled.
Since Linux 3.8, the Seccomp field of the /proc/pid/status file provides a method of obtaining the same information, without the risk that the process is killed; see proc(5).
PR_SET_SECUREBITS (since Linux 2.6.26)
Set the “securebits” flags of the calling thread to the value supplied in arg2. See capabilities(7).
PR_GET_SECUREBITS (since Linux 2.6.26)
Return (as the function result) the “securebits” flags of the calling thread. See capabilities(7).
PR_GET_SPECULATION_CTRL (since Linux 4.17)
Return (as the function result) the state of the speculation misfeature specified in arg2. Currently, the only permitted value for this argument is PR_SPEC_STORE_BYPASS (otherwise the call fails with the error ENODEV).
The return value uses bits 0-3 with the following meaning:
PR_SPEC_PRCTL
Mitigation can be controlled per thread by PR_SET_SPECULATION_CTRL.
PR_SPEC_ENABLE
The speculation feature is enabled, mitigation is disabled.
PR_SPEC_DISABLE
The speculation feature is disabled, mitigation is enabled.
PR_SPEC_FORCE_DISABLE
Same as PR_SPEC_DISABLE but cannot be undone.
PR_SPEC_DISABLE_NOEXEC (since Linux 5.1)
Same as PR_SPEC_DISABLE, but the state will be cleared on execve(2).
If all bits are 0, then the CPU is not affected by the speculation misfeature.
If PR_SPEC_PRCTL is set, then per-thread control of the mitigation is available. If not set, prctl() for the speculation misfeature will fail.
The arg3, arg4, and arg5 arguments must be specified as 0; otherwise the call fails with the error EINVAL.
PR_SET_SPECULATION_CTRL (since Linux 4.17)
Sets the state of the speculation misfeature specified in arg2. The speculation-misfeature settings are per-thread attributes.
Currently, arg2 must be one of:
PR_SPEC_STORE_BYPASS
Set the state of the speculative store bypass misfeature.
PR_SPEC_INDIRECT_BRANCH (since Linux 4.20)
Set the state of the indirect branch speculation misfeature.
If arg2 does not have one of the above values, then the call fails with the error ENODEV.
The arg3 argument is used to hand in the control value, which is one of the following:
PR_SPEC_ENABLE
The speculation feature is enabled, mitigation is disabled.
PR_SPEC_DISABLE
The speculation feature is disabled, mitigation is enabled.
PR_SPEC_FORCE_DISABLE
Same as PR_SPEC_DISABLE, but cannot be undone. A subsequent prctl( arg2, PR_SPEC_ENABLE) with the same value for arg2 will fail with the error EPERM.
PR_SPEC_DISABLE_NOEXEC (since Linux 5.1)
Same as PR_SPEC_DISABLE, but the state will be cleared on execve(2). Currently only supported for arg2 equal to PR_SPEC_STORE_BYPASS.
Any unsupported value in arg3 will result in the call failing with the error ERANGE.
The arg4 and arg5 arguments must be specified as 0; otherwise the call fails with the error EINVAL.
The speculation feature can also be controlled by the spec_store_bypass_disable boot parameter. This parameter may enforce a read-only policy which will result in the prctl() call failing with the error ENXIO. For further details, see the kernel source file Documentation/admin-guide/kernel-parameters.txt.
PR_SVE_SET_VL (since Linux 4.15, only on arm64)
Configure the thread’s SVE vector length, as specified by (int) arg2. Arguments arg3, arg4, and arg5 are ignored.
The bits of arg2 corresponding to PR_SVE_VL_LEN_MASK must be set to the desired vector length in bytes. This is interpreted as an upper bound: the kernel will select the greatest available vector length that does not exceed the value specified. In particular, specifying SVE_VL_MAX (defined in <asm/sigcontext.h>) for the PR_SVE_VL_LEN_MASK bits requests the maximum supported vector length.
In addition, the other bits of arg2 must be set to one of the following combinations of flags:
0
Perform the change immediately. At the next execve(2) in the thread, the vector length will be reset to the value configured in /proc/sys/abi/sve_default_vector_length.
PR_SVE_VL_INHERIT
Perform the change immediately. Subsequent execve(2) calls will preserve the new vector length.
PR_SVE_SET_VL_ONEXEC
Defer the change, so that it is performed at the next execve(2) in the thread. Further execve(2) calls will reset the vector length to the value configured in /proc/sys/abi/sve_default_vector_length.
PR_SVE_SET_VL_ONEXEC | PR_SVE_VL_INHERIT
Defer the change, so that it is performed at the next execve(2) in the thread. Further execve(2) calls will preserve the new vector length.
In all cases, any previously pending deferred change is canceled.
The call fails with error EINVAL if SVE is not supported on the platform, if arg2 is unrecognized or invalid, or the value in the bits of arg2 corresponding to PR_SVE_VL_LEN_MASK is outside the range SVE_VL_MIN..SVE_VL_MAX or is not a multiple of 16.
On success, a nonnegative value is returned that describes the selected configuration. If PR_SVE_SET_VL_ONEXEC was included in arg2, then the configuration described by the return value will take effect at the next execve(2). Otherwise, the configuration is already in effect when the PR_SVE_SET_VL call returns. In either case, the value is encoded in the same way as the return value of PR_SVE_GET_VL. Note that there is no explicit flag in the return value corresponding to PR_SVE_SET_VL_ONEXEC.
The configuration (including any pending deferred change) is inherited across fork(2) and clone(2).
For more information, see the kernel source file Documentation/arm64/sve.rst (or Documentation/arm64/sve.txt before Linux 5.3).
Warning: Because the compiler or run-time environment may be using SVE, using this call without the PR_SVE_SET_VL_ONEXEC flag may crash the calling process. The conditions for using it safely are complex and system-dependent. Don’t use it unless you really know what you are doing.
PR_SVE_GET_VL (since Linux 4.15, only on arm64)
Get the thread’s current SVE vector length configuration.
Arguments arg2, arg3, arg4, and arg5 are ignored.
Provided that the kernel and platform support SVE, this operation always succeeds, returning a nonnegative value that describes the current configuration. The bits corresponding to PR_SVE_VL_LEN_MASK contain the currently configured vector length in bytes. The bit corresponding to PR_SVE_VL_INHERIT indicates whether the vector length will be inherited across execve(2).
Note that there is no way to determine whether there is a pending vector length change that has not yet taken effect.
For more information, see the kernel source file Documentation/arm64/sve.rst (or Documentation/arm64/sve.txt before Linux 5.3).
PR_SET_SYSCALL_USER_DISPATCH (since Linux 5.11, x86 only)
Configure the Syscall User Dispatch mechanism for the calling thread. This mechanism allows an application to selectively intercept system calls so that they can be handled within the application itself. Interception takes the form of a thread-directed SIGSYS signal that is delivered to the thread when it makes a system call. If intercepted, the system call is not executed by the kernel.
To enable this mechanism, arg2 should be set to PR_SYS_DISPATCH_ON. Once enabled, further system calls will be selectively intercepted, depending on a control variable provided by user space. In this case, arg3 and arg4 respectively identify the offset and length of a single contiguous memory region in the process address space from where system calls are always allowed to be executed, regardless of the control variable. (Typically, this area would include the area of memory containing the C library.)
arg5 points to a char-sized variable that is a fast switch to allow/block system call execution without the overhead of doing another system call to re-configure Syscall User Dispatch. This control variable can either be set to SYSCALL_DISPATCH_FILTER_BLOCK to block system calls from executing or to SYSCALL_DISPATCH_FILTER_ALLOW to temporarily allow them to be executed. This value is checked by the kernel on every system call entry, and any unexpected value will raise an uncatchable SIGSYS at that time, killing the application.
When a system call is intercepted, the kernel sends a thread-directed SIGSYS signal to the triggering thread. Various fields will be set in the siginfo_t structure (see sigaction(2)) associated with the signal:
si_signo will contain SIGSYS.
si_call_addr will show the address of the system call instruction.
si_syscall and si_arch will indicate which system call was attempted.
si_code will contain SYS_USER_DISPATCH.
si_errno will be set to 0.
The program counter will be as though the system call happened (i.e., the program counter will not point to the system call instruction).
When the signal handler returns to the kernel, the system call completes immediately and returns to the calling thread, without actually being executed. If necessary (i.e., when emulating the system call on user space.), the signal handler should set the system call return value to a sane value, by modifying the register context stored in the ucontext argument of the signal handler. See sigaction(2), sigreturn(2), and getcontext(3) for more information.
If arg2 is set to PR_SYS_DISPATCH_OFF, Syscall User Dispatch is disabled for that thread. the remaining arguments must be set to 0.
The setting is not preserved across fork(2), clone(2), or execve(2).
For more information, see the kernel source file Documentation/admin-guide/syscall-user-dispatch.rst
PR_SET_TAGGED_ADDR_CTRL (since Linux 5.4, only on arm64)
Controls support for passing tagged user-space addresses to the kernel (i.e., addresses where bits 56β63 are not all zero).
The level of support is selected by arg2, which can be one of the following:
0
Addresses that are passed for the purpose of being dereferenced by the kernel must be untagged.
PR_TAGGED_ADDR_ENABLE
Addresses that are passed for the purpose of being dereferenced by the kernel may be tagged, with the exceptions summarized below.
The remaining arguments arg3, arg4, and arg5 must all be zero.
On success, the mode specified in arg2 is set for the calling thread and the return value is 0. If the arguments are invalid, the mode specified in arg2 is unrecognized, or if this feature is unsupported by the kernel or disabled via /proc/sys/abi/tagged_addr_disabled, the call fails with the error EINVAL.
In particular, if prctl(PR_SET_TAGGED_ADDR_CTRL, 0, 0, 0, 0) fails with EINVAL, then all addresses passed to the kernel must be untagged.
Irrespective of which mode is set, addresses passed to certain interfaces must always be untagged:
brk(2), mmap(2), shmat(2), shmdt(2), and the new_address argument of mremap(2).
(Prior to Linux 5.6 these accepted tagged addresses, but the behaviour may not be what you expect. Don’t rely on it.)
βpolymorphicβ interfaces that accept pointers to arbitrary types cast to a void * or other generic type, specifically prctl(), ioctl(2), and in general setsockopt(2) (only certain specific setsockopt(2) options allow tagged addresses).
This list of exclusions may shrink when moving from one kernel version to a later kernel version. While the kernel may make some guarantees for backwards compatibility reasons, for the purposes of new software the effect of passing tagged addresses to these interfaces is unspecified.
The mode set by this call is inherited across fork(2) and clone(2). The mode is reset by execve(2) to 0 (i.e., tagged addresses not permitted in the user/kernel ABI).
For more information, see the kernel source file Documentation/arm64/tagged-address-abi.rst.
Warning: This call is primarily intended for use by the run-time environment. A successful PR_SET_TAGGED_ADDR_CTRL call elsewhere may crash the calling process. The conditions for using it safely are complex and system-dependent. Don’t use it unless you know what you are doing.
PR_GET_TAGGED_ADDR_CTRL (since Linux 5.4, only on arm64)
Returns the current tagged address mode for the calling thread.
Arguments arg2, arg3, arg4, and arg5 must all be zero.
If the arguments are invalid or this feature is disabled or unsupported by the kernel, the call fails with EINVAL. In particular, if prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0) fails with EINVAL, then this feature is definitely either unsupported, or disabled via /proc/sys/abi/tagged_addr_disabled. In this case, all addresses passed to the kernel must be untagged.
Otherwise, the call returns a nonnegative value describing the current tagged address mode, encoded in the same way as the arg2 argument of PR_SET_TAGGED_ADDR_CTRL.
For more information, see the kernel source file Documentation/arm64/tagged-address-abi.rst.
PR_TASK_PERF_EVENTS_DISABLE (since Linux 2.6.31)
Disable all performance counters attached to the calling process, regardless of whether the counters were created by this process or another process. Performance counters created by the calling process for other processes are unaffected. For more information on performance counters, see the Linux kernel source file tools/perf/design.txt.
Originally called PR_TASK_PERF_COUNTERS_DISABLE; renamed (retaining the same numerical value) in Linux 2.6.32.
PR_TASK_PERF_EVENTS_ENABLE (since Linux 2.6.31)
The converse of PR_TASK_PERF_EVENTS_DISABLE; enable performance counters attached to the calling process.
Originally called PR_TASK_PERF_COUNTERS_ENABLE; renamed in Linux 2.6.32.
PR_SET_THP_DISABLE (since Linux 3.15)
Set the state of the “THP disable” flag for the calling thread. If arg2 has a nonzero value, the flag is set, otherwise it is cleared. Setting this flag provides a method for disabling transparent huge pages for jobs where the code cannot be modified, and using a malloc hook with madvise(2) is not an option (i.e., statically allocated data). The setting of the “THP disable” flag is inherited by a child created via fork(2) and is preserved across execve(2).
PR_GET_THP_DISABLE (since Linux 3.15)
Return (as the function result) the current setting of the “THP disable” flag for the calling thread: either 1, if the flag is set, or 0, if it is not.
PR_GET_TID_ADDRESS (since Linux 3.5)
Return the clear_child_tid address set by set_tid_address(2) and the clone(2) CLONE_CHILD_CLEARTID flag, in the location pointed to by (intΒ **)Β arg2. This feature is available only if the kernel is built with the CONFIG_CHECKPOINT_RESTORE option enabled. Note that since the prctl() system call does not have a compat implementation for the AMD64 x32 and MIPS n32 ABIs, and the kernel writes out a pointer using the kernel’s pointer size, this operation expects a user-space buffer of 8 (not 4) bytes on these ABIs.
PR_SET_TIMERSLACK (since Linux 2.6.28)
Each thread has two associated timer slack values: a “default” value, and a “current” value. This operation sets the “current” timer slack value for the calling thread. arg2 is an unsigned long value, then maximum “current” value is ULONG_MAX and the minimum “current” value is 1. If the nanosecond value supplied in arg2 is greater than zero, then the “current” value is set to this value. If arg2 is equal to zero, the “current” timer slack is reset to the thread’s “default” timer slack value.
The “current” timer slack is used by the kernel to group timer expirations for the calling thread that are close to one another; as a consequence, timer expirations for the thread may be up to the specified number of nanoseconds late (but will never expire early). Grouping timer expirations can help reduce system power consumption by minimizing CPU wake-ups.
The timer expirations affected by timer slack are those set by select(2), pselect(2), poll(2), ppoll(2), epoll_wait(2), epoll_pwait(2), clock_nanosleep(2), nanosleep(2), and futex(2) (and thus the library functions implemented via futexes, including pthread_cond_timedwait(3), pthread_mutex_timedlock(3), pthread_rwlock_timedrdlock(3), pthread_rwlock_timedwrlock(3), and sem_timedwait(3)).
Timer slack is not applied to threads that are scheduled under a real-time scheduling policy (see sched_setscheduler(2)).
When a new thread is created, the two timer slack values are made the same as the “current” value of the creating thread. Thereafter, a thread can adjust its “current” timer slack value via PR_SET_TIMERSLACK. The “default” value can’t be changed. The timer slack values of init (PID 1), the ancestor of all processes, are 50,000 nanoseconds (50 microseconds). The timer slack value is inherited by a child created via fork(2), and is preserved across execve(2).
Since Linux 4.6, the “current” timer slack value of any process can be examined and changed via the file /proc/pid/timerslack_ns. See proc(5).
PR_GET_TIMERSLACK (since Linux 2.6.28)
Return (as the function result) the “current” timer slack value of the calling thread.
PR_SET_TIMING (since Linux 2.6.0)
Set whether to use (normal, traditional) statistical process timing or accurate timestamp-based process timing, by passing PR_TIMING_STATISTICAL or PR_TIMING_TIMESTAMP to arg2. PR_TIMING_TIMESTAMP is not currently implemented (attempting to set this mode will yield the error EINVAL).
PR_GET_TIMING (since Linux 2.6.0)
Return (as the function result) which process timing method is currently in use.
PR_SET_TSC (since Linux 2.6.26, x86 only)
Set the state of the flag determining whether the timestamp counter can be read by the process. Pass PR_TSC_ENABLE to arg2 to allow it to be read, or PR_TSC_SIGSEGV to generate a SIGSEGV when the process tries to read the timestamp counter.
PR_GET_TSC (since Linux 2.6.26, x86 only)
Return the state of the flag determining whether the timestamp counter can be read, in the location pointed to by (intΒ *) arg2.
PR_SET_UNALIGN
(Only on: ia64, since Linux 2.3.48; parisc, since Linux 2.6.15; PowerPC, since Linux 2.6.18; Alpha, since Linux 2.6.22; sh, since Linux 2.6.34; tile, since Linux 3.12) Set unaligned access control bits to arg2. Pass PR_UNALIGN_NOPRINT to silently fix up unaligned user accesses, or PR_UNALIGN_SIGBUS to generate SIGBUS on unaligned user access. Alpha also supports an additional flag with the value of 4 and no corresponding named constant, which instructs kernel to not fix up unaligned accesses (it is analogous to providing the UAC_NOFIX flag in SSI_NVPAIRS operation of the setsysinfo() system call on Tru64).
PR_GET_UNALIGN
(See PR_SET_UNALIGN for information on versions and architectures.) Return unaligned access control bits, in the location pointed to by (unsigned intΒ *) arg2.
PR_GET_AUXV (since Linux 6.4)
Get the auxiliary vector (auxv) into the buffer pointed to by (voidΒ *) arg2, whose length is given by arg3. If the buffer is not long enough for the full auxiliary vector, the copy will be truncated. Return (as the function result) the full length of the auxiliary vector. arg4 and arg5 must be 0.
PR_SET_MDWE (since Linux 6.3)
Set the calling process’ Memory-Deny-Write-Execute protection mask. Once protection bits are set, they can not be changed. arg2 must be a bit mask of:
PR_MDWE_REFUSE_EXEC_GAIN
New memory mapping protections can’t be writable and executable. Non-executable mappings can’t become executable.
PR_MDWE_NO_INHERIT (since Linux 6.6)
Do not propagate MDWE protection to child processes on fork(2). Setting this bit requires setting PR_MDWE_REFUSE_EXEC_GAIN too.
PR_GET_MDWE (since Linux 6.3)
Return (as the function result) the Memory-Deny-Write-Execute protection mask of the calling process. (See PR_SET_MDWE for information on the protection mask bits.)
RETURN VALUE
On success, PR_CAP_AMBIENT+PR_CAP_AMBIENT_IS_SET, PR_CAPBSET_READ, PR_GET_DUMPABLE, PR_GET_FP_MODE, PR_GET_IO_FLUSHER, PR_GET_KEEPCAPS, PR_MCE_KILL_GET, PR_GET_NO_NEW_PRIVS, PR_GET_SECUREBITS, PR_GET_SPECULATION_CTRL, PR_SVE_GET_VL, PR_SVE_SET_VL, PR_GET_TAGGED_ADDR_CTRL, PR_GET_THP_DISABLE, PR_GET_TIMING, PR_GET_TIMERSLACK, PR_GET_AUXV, and (if it returns) PR_GET_SECCOMP return the nonnegative values described above. All other op values return 0 on success. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
op is PR_SET_SECCOMP and arg2 is SECCOMP_MODE_FILTER, but the process does not have the CAP_SYS_ADMIN capability or has not set the no_new_privs attribute (see the discussion of PR_SET_NO_NEW_PRIVS above).
EACCES
op is PR_SET_MM, and arg3 is PR_SET_MM_EXE_FILE, the file is not executable.
EBADF
op is PR_SET_MM, arg3 is PR_SET_MM_EXE_FILE, and the file descriptor passed in arg4 is not valid.
EBUSY
op is PR_SET_MM, arg3 is PR_SET_MM_EXE_FILE, and this the second attempt to change the /proc/pid/exe symbolic link, which is prohibited.
EFAULT
arg2 is an invalid address.
EFAULT
op is PR_SET_SECCOMP, arg2 is SECCOMP_MODE_FILTER, the system was built with CONFIG_SECCOMP_FILTER, and arg3 is an invalid address.
EFAULT
op is PR_SET_SYSCALL_USER_DISPATCH and arg5 has an invalid address.
EINVAL
The value of op is not recognized, or not supported on this system.
EINVAL
op is PR_MCE_KILL or PR_MCE_KILL_GET or PR_SET_MM, and unused prctl() arguments were not specified as zero.
EINVAL
arg2 is not valid value for this op.
EINVAL
op is PR_SET_SECCOMP or PR_GET_SECCOMP, and the kernel was not configured with CONFIG_SECCOMP.
EINVAL
op is PR_SET_SECCOMP, arg2 is SECCOMP_MODE_FILTER, and the kernel was not configured with CONFIG_SECCOMP_FILTER.
EINVAL
op is PR_SET_MM, and one of the following is true
arg4 or arg5 is nonzero;
arg3 is greater than TASK_SIZE (the limit on the size of the user address space for this architecture);
arg2 is PR_SET_MM_START_CODE, PR_SET_MM_END_CODE, PR_SET_MM_START_DATA, PR_SET_MM_END_DATA, or PR_SET_MM_START_STACK, and the permissions of the corresponding memory area are not as required;
arg2 is PR_SET_MM_START_BRK or PR_SET_MM_BRK, and arg3 is less than or equal to the end of the data segment or specifies a value that would cause the RLIMIT_DATA resource limit to be exceeded.
EINVAL
op is PR_SET_PTRACER and arg2 is not 0, PR_SET_PTRACER_ANY, or the PID of an existing process.
EINVAL
op is PR_SET_PDEATHSIG and arg2 is not a valid signal number.
EINVAL
op is PR_SET_DUMPABLE and arg2 is neither SUID_DUMP_DISABLE nor SUID_DUMP_USER.
EINVAL
op is PR_SET_TIMING and arg2 is not PR_TIMING_STATISTICAL.
EINVAL
op is PR_SET_NO_NEW_PRIVS and arg2 is not equal to 1 or arg3, arg4, or arg5 is nonzero.
EINVAL
op is PR_GET_NO_NEW_PRIVS and arg2, arg3, arg4, or arg5 is nonzero.
EINVAL
op is PR_SET_THP_DISABLE and arg3, arg4, or arg5 is nonzero.
EINVAL
op is PR_GET_THP_DISABLE and arg2, arg3, arg4, or arg5 is nonzero.
EINVAL
op is PR_CAP_AMBIENT and an unused argument (arg4, arg5, or, in the case of PR_CAP_AMBIENT_CLEAR_ALL, arg3) is nonzero; or arg2 has an invalid value; or arg2 is PR_CAP_AMBIENT_LOWER, PR_CAP_AMBIENT_RAISE, or PR_CAP_AMBIENT_IS_SET and arg3 does not specify a valid capability.
EINVAL
op was PR_GET_SPECULATION_CTRL or PR_SET_SPECULATION_CTRL and unused arguments to prctl() are not 0.
EINVAL
op is PR_PAC_RESET_KEYS and the arguments are invalid or unsupported. See the description of PR_PAC_RESET_KEYS above for details.
EINVAL
op is PR_SVE_SET_VL and the arguments are invalid or unsupported, or SVE is not available on this platform. See the description of PR_SVE_SET_VL above for details.
EINVAL
op is PR_SVE_GET_VL and SVE is not available on this platform.
EINVAL
op is PR_SET_SYSCALL_USER_DISPATCH and one of the following is true:
arg2 is PR_SYS_DISPATCH_OFF and the remaining arguments are not 0;
arg2 is PR_SYS_DISPATCH_ON and the memory range specified is outside the address space of the process.
arg2 is invalid.
EINVAL
op is PR_SET_TAGGED_ADDR_CTRL and the arguments are invalid or unsupported. See the description of PR_SET_TAGGED_ADDR_CTRL above for details.
EINVAL
op is PR_GET_TAGGED_ADDR_CTRL and the arguments are invalid or unsupported. See the description of PR_GET_TAGGED_ADDR_CTRL above for details.
ENODEV
op was PR_SET_SPECULATION_CTRL the kernel or CPU does not support the requested speculation misfeature.
ENXIO
op was PR_MPX_ENABLE_MANAGEMENT or PR_MPX_DISABLE_MANAGEMENT and the kernel or the CPU does not support MPX management. Check that the kernel and processor have MPX support.
ENXIO
op was PR_SET_SPECULATION_CTRL implies that the control of the selected speculation misfeature is not possible. See PR_GET_SPECULATION_CTRL for the bit fields to determine which option is available.
EOPNOTSUPP
op is PR_SET_FP_MODE and arg2 has an invalid or unsupported value.
EPERM
op is PR_SET_SECUREBITS, and the caller does not have the CAP_SETPCAP capability, or tried to unset a “locked” flag, or tried to set a flag whose corresponding locked flag was set (see capabilities(7)).
EPERM
op is PR_SET_SPECULATION_CTRL wherein the speculation was disabled with PR_SPEC_FORCE_DISABLE and caller tried to enable it again.
EPERM
op is PR_SET_KEEPCAPS, and the caller’s SECBIT_KEEP_CAPS_LOCKED flag is set (see capabilities(7)).
EPERM
op is PR_CAPBSET_DROP, and the caller does not have the CAP_SETPCAP capability.
EPERM
op is PR_SET_MM, and the caller does not have the CAP_SYS_RESOURCE capability.
EPERM
op is PR_CAP_AMBIENT and arg2 is PR_CAP_AMBIENT_RAISE, but either the capability specified in arg3 is not present in the process’s permitted and inheritable capability sets, or the PR_CAP_AMBIENT_LOWER securebit has been set.
ERANGE
op was PR_SET_SPECULATION_CTRL and arg3 is not PR_SPEC_ENABLE, PR_SPEC_DISABLE, PR_SPEC_FORCE_DISABLE, nor PR_SPEC_DISABLE_NOEXEC.
VERSIONS
IRIX has a prctl() system call (also introduced in Linux 2.1.44 as irix_prctl on the MIPS architecture), with prototype
ptrdiff_t prctl(int op, int arg2, int arg3);
and operations to get the maximum number of processes per user, get the maximum number of processors the calling process can use, find out whether a specified process is currently blocked, get or set the maximum stack size, and so on.
STANDARDS
Linux.
HISTORY
Linux 2.1.57, glibc 2.0.6
SEE ALSO
signal(2), core(5)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
440 - Linux cli command msgsnd
NAME π₯οΈ msgsnd π₯οΈ
System V message queue operations
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/msg.h>
int msgsnd(int msqid, const void msgp[.msgsz], size_t msgsz,
int msgflg);
ssize_t msgrcv(int msqid, void msgp[.msgsz], size_t msgsz",long"msgtyp,
int msgflg);
DESCRIPTION
The msgsnd() and msgrcv() system calls are used to send messages to, and receive messages from, a System V message queue. The calling process must have write permission on the message queue in order to send a message, and read permission to receive a message.
The msgp argument is a pointer to a caller-defined structure of the following general form:
struct msgbuf {
long mtype; /* message type, must be > 0 */
char mtext[1]; /* message data */
};
The mtext field is an array (or other structure) whose size is specified by msgsz, a nonnegative integer value. Messages of zero length (i.e., no mtext field) are permitted. The mtype field must have a strictly positive integer value. This value can be used by the receiving process for message selection (see the description of msgrcv() below).
msgsnd()
The msgsnd() system call appends a copy of the message pointed to by msgp to the message queue whose identifier is specified by msqid.
If sufficient space is available in the queue, msgsnd() succeeds immediately. The queue capacity is governed by the msg_qbytes field in the associated data structure for the message queue. During queue creation this field is initialized to MSGMNB bytes, but this limit can be modified using msgctl(2). A message queue is considered to be full if either of the following conditions is true:
Adding a new message to the queue would cause the total number of bytes in the queue to exceed the queue’s maximum size (the msg_qbytes field).
Adding another message to the queue would cause the total number of messages in the queue to exceed the queue’s maximum size (the msg_qbytes field). This check is necessary to prevent an unlimited number of zero-length messages being placed on the queue. Although such messages contain no data, they nevertheless consume (locked) kernel memory.
If insufficient space is available in the queue, then the default behavior of msgsnd() is to block until space becomes available. If IPC_NOWAIT is specified in msgflg, then the call instead fails with the error EAGAIN.
A blocked msgsnd() call may also fail if:
the queue is removed, in which case the system call fails with errno set to EIDRM; or
a signal is caught, in which case the system call fails with errno set to EINTR;see signal(7). (msgsnd() is never automatically restarted after being interrupted by a signal handler, regardless of the setting of the SA_RESTART flag when establishing a signal handler.)
Upon successful completion the message queue data structure is updated as follows:
msg_lspid is set to the process ID of the calling process.
msg_qnum is incremented by 1.
msg_stime is set to the current time.
msgrcv()
The msgrcv() system call removes a message from the queue specified by msqid and places it in the buffer pointed to by msgp.
The argument msgsz specifies the maximum size in bytes for the member mtext of the structure pointed to by the msgp argument. If the message text has length greater than msgsz, then the behavior depends on whether MSG_NOERROR is specified in msgflg. If MSG_NOERROR is specified, then the message text will be truncated (and the truncated part will be lost); if MSG_NOERROR is not specified, then the message isn’t removed from the queue and the system call fails returning -1 with errno set to E2BIG.
Unless MSG_COPY is specified in msgflg (see below), the msgtyp argument specifies the type of message requested, as follows:
If msgtyp is 0, then the first message in the queue is read.
If msgtyp is greater than 0, then the first message in the queue of type msgtyp is read, unless MSG_EXCEPT was specified in msgflg, in which case the first message in the queue of type not equal to msgtyp will be read.
If msgtyp is less than 0, then the first message in the queue with the lowest type less than or equal to the absolute value of msgtyp will be read.
The msgflg argument is a bit mask constructed by ORing together zero or more of the following flags:
IPC_NOWAIT
Return immediately if no message of the requested type is in the queue. The system call fails with errno set to ENOMSG.
MSG_COPY (since Linux 3.8)
Nondestructively fetch a copy of the message at the ordinal position in the queue specified by msgtyp (messages are considered to be numbered starting at 0).
This flag must be specified in conjunction with IPC_NOWAIT, with the result that, if there is no message available at the given position, the call fails immediately with the error ENOMSG. Because they alter the meaning of msgtyp in orthogonal ways, MSG_COPY and MSG_EXCEPT may not both be specified in msgflg.
The MSG_COPY flag was added for the implementation of the kernel checkpoint-restore facility and is available only if the kernel was built with the CONFIG_CHECKPOINT_RESTORE option.
MSG_EXCEPT
Used with msgtyp greater than 0 to read the first message in the queue with message type that differs from msgtyp.
MSG_NOERROR
To truncate the message text if longer than msgsz bytes.
If no message of the requested type is available and IPC_NOWAIT isn’t specified in msgflg, the calling process is blocked until one of the following conditions occurs:
A message of the desired type is placed in the queue.
The message queue is removed from the system. In this case, the system call fails with errno set to EIDRM.
The calling process catches a signal. In this case, the system call fails with errno set to EINTR. (msgrcv() is never automatically restarted after being interrupted by a signal handler, regardless of the setting of the SA_RESTART flag when establishing a signal handler.)
Upon successful completion the message queue data structure is updated as follows:
msg_lrpid is set to the process ID of the calling process.
msg_qnum is decremented by 1.
msg_rtime is set to the current time.
RETURN VALUE
On success, msgsnd() returns 0 and msgrcv() returns the number of bytes actually copied into the mtext array. On failure, both functions return -1, and set errno to indicate the error.
ERRORS
msgsnd() can fail with the following errors:
EACCES
The calling process does not have write permission on the message queue, and does not have the CAP_IPC_OWNER capability in the user namespace that governs its IPC namespace.
EAGAIN
The message can’t be sent due to the msg_qbytes limit for the queue and IPC_NOWAIT was specified in msgflg.
EFAULT
The address pointed to by msgp isn’t accessible.
EIDRM
The message queue was removed.
EINTR
Sleeping on a full message queue condition, the process caught a signal.
EINVAL
Invalid msqid value, or nonpositive mtype value, or invalid msgsz value (less than 0 or greater than the system value MSGMAX).
ENOMEM
The system does not have enough memory to make a copy of the message pointed to by msgp.
msgrcv() can fail with the following errors:
E2BIG
The message text length is greater than msgsz and MSG_NOERROR isn’t specified in msgflg.
EACCES
The calling process does not have read permission on the message queue, and does not have the CAP_IPC_OWNER capability in the user namespace that governs its IPC namespace.
EFAULT
The address pointed to by msgp isn’t accessible.
EIDRM
While the process was sleeping to receive a message, the message queue was removed.
EINTR
While the process was sleeping to receive a message, the process caught a signal; see signal(7).
EINVAL
msqid was invalid, or msgsz was less than 0.
EINVAL (since Linux 3.14)
msgflg specified MSG_COPY, but not IPC_NOWAIT.
EINVAL (since Linux 3.14)
msgflg specified both MSG_COPY and MSG_EXCEPT.
ENOMSG
IPC_NOWAIT was specified in msgflg and no message of the requested type existed on the message queue.
ENOMSG
IPC_NOWAIT and MSG_COPY were specified in msgflg and the queue contains less than msgtyp messages.
ENOSYS (since Linux 3.8)
Both MSG_COPY and IPC_NOWAIT were specified in msgflg, and this kernel was configured without CONFIG_CHECKPOINT_RESTORE.
STANDARDS
POSIX.1-2008.
The MSG_EXCEPT and MSG_COPY flags are Linux-specific; their definitions can be obtained by defining the _GNU_SOURCE feature test macro.
HISTORY
POSIX.1-2001, SVr4.
The msgp argument is declared as struct msgbuf * in glibc 2.0 and 2.1. It is declared as void * in glibc 2.2 and later, as required by SUSv2 and SUSv3.
NOTES
The following limits on message queue resources affect the msgsnd() call:
MSGMAX
Maximum size of a message text, in bytes (default value: 8192 bytes). On Linux, this limit can be read and modified via /proc/sys/kernel/msgmax.
MSGMNB
Maximum number of bytes that can be held in a message queue (default value: 16384 bytes). On Linux, this limit can be read and modified via /proc/sys/kernel/msgmnb. A privileged process (Linux: a process with the CAP_SYS_RESOURCE capability) can increase the size of a message queue beyond MSGMNB using the msgctl(2) IPC_SET operation.
The implementation has no intrinsic system-wide limits on the number of message headers (MSGTQL) and the number of bytes in the message pool (MSGPOOL).
BUGS
In Linux 3.13 and earlier, if msgrcv() was called with the MSG_COPY flag, but without IPC_NOWAIT, and the message queue contained less than msgtyp messages, then the call would block until the next message is written to the queue. At that point, the call would return a copy of the message, regardless of whether that message was at the ordinal position msgtyp. This bug is fixed in Linux 3.14.
Specifying both MSG_COPY and MSC_EXCEPT in msgflg is a logical error (since these flags impose different interpretations on msgtyp). In Linux 3.13 and earlier, this error was not diagnosed by msgrcv(). This bug is fixed in Linux 3.14.
EXAMPLES
The program below demonstrates the use of msgsnd() and msgrcv().
The example program is first run with the -s option to send a message and then run again with the -r option to receive a message.
The following shell session shows a sample run of the program:
$ ./a.out -s
sent: a message at Wed Mar 4 16:25:45 2015
$ ./a.out -r
message received: a message at Wed Mar 4 16:25:45 2015
Program source
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ipc.h>
#include <sys/msg.h>
#include <time.h>
#include <unistd.h>
struct msgbuf {
long mtype;
char mtext[80];
};
static void
usage(char *prog_name, char *msg)
{
if (msg != NULL)
fputs(msg, stderr);
fprintf(stderr, "Usage: %s [options]
“, prog_name); fprintf(stderr, “Options are: “); fprintf(stderr, “-s send message using msgsnd() “); fprintf(stderr, “-r read message using msgrcv() “); fprintf(stderr, “-t message type (default is 1) “); fprintf(stderr, “-k message queue key (default is 1234) “); exit(EXIT_FAILURE); } static void send_msg(int qid, int msgtype) { time_t t; struct msgbuf msg; msg.mtype = msgtype; time(&t); snprintf(msg.mtext, sizeof(msg.mtext), “a message at %s”, ctime(&t)); if (msgsnd(qid, &msg, sizeof(msg.mtext), IPC_NOWAIT) == -1) { perror(“msgsnd error”); exit(EXIT_FAILURE); } printf(“sent: %s “, msg.mtext); } static void get_msg(int qid, int msgtype) { struct msgbuf msg; if (msgrcv(qid, &msg, sizeof(msg.mtext), msgtype, MSG_NOERROR | IPC_NOWAIT) == -1) { if (errno != ENOMSG) { perror(“msgrcv”); exit(EXIT_FAILURE); } printf(“No message available for msgrcv() “); } else { printf(“message received: %s “, msg.mtext); } } int main(int argc, char argv[]) { int qid, opt; int mode = 0; / 1 = send, 2 = receive */ int msgtype = 1; int msgkey = 1234; while ((opt = getopt(argc, argv, “srt:k:”)) != -1) { switch (opt) { case ’s’: mode = 1; break; case ‘r’: mode = 2; break; case ’t’: msgtype = atoi(optarg); if (msgtype <= 0) usage(argv[0], “-t option must be greater than 0 “); break; case ‘k’: msgkey = atoi(optarg); break; default: usage(argv[0], “Unrecognized option “); } } if (mode == 0) usage(argv[0], “must use either -s or -r option “); qid = msgget(msgkey, IPC_CREAT | 0666); if (qid == -1) { perror(“msgget”); exit(EXIT_FAILURE); } if (mode == 2) get_msg(qid, msgtype); else send_msg(qid, msgtype); exit(EXIT_SUCCESS); }
SEE ALSO
msgctl(2), msgget(2), capabilities(7), mq_overview(7), sysvipc(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
441 - Linux cli command set_thread_area
NAME π₯οΈ set_thread_area π₯οΈ
manipulate thread-local storage information
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
#if defined __i386__ || defined __x86_64__
# include <asm/ldt.h> /* Definition of struct user_desc */
int syscall(SYS_get_thread_area, struct user_desc *u_info);
int syscall(SYS_set_thread_area, struct user_desc *u_info);
#elif defined __m68k__
int syscall(SYS_get_thread_area);
int syscall(SYS_set_thread_area, unsigned long tp);
#elif defined __mips__ || defined __csky__
int syscall(SYS_set_thread_area, unsigned long addr);
#endif
Note: glibc provides no wrappers for these system calls, necessitating the use of syscall(2).
DESCRIPTION
These calls provide architecture-specific support for a thread-local storage implementation. At the moment, set_thread_area() is available on m68k, MIPS, C-SKY, and x86 (both 32-bit and 64-bit variants); get_thread_area() is available on m68k and x86.
On m68k, MIPS and C-SKY, set_thread_area() allows storing an arbitrary pointer (provided in the tp argument on m68k and in the addr argument on MIPS and C-SKY) in the kernel data structure associated with the calling thread; this pointer can later be retrieved using get_thread_area() (see also NOTES for information regarding obtaining the thread pointer on MIPS).
On x86, Linux dedicates three global descriptor table (GDT) entries for thread-local storage. For more information about the GDT, see the Intel Software Developer’s Manual or the AMD Architecture Programming Manual.
Both of these system calls take an argument that is a pointer to a structure of the following type:
struct user_desc {
unsigned int entry_number;
unsigned int base_addr;
unsigned int limit;
unsigned int seg_32bit:1;
unsigned int contents:2;
unsigned int read_exec_only:1;
unsigned int limit_in_pages:1;
unsigned int seg_not_present:1;
unsigned int useable:1;
#ifdef __x86_64__
unsigned int lm:1;
#endif
};
get_thread_area() reads the GDT entry indicated by u_info->entry_number and fills in the rest of the fields in u_info.
set_thread_area() sets a TLS entry in the GDT.
The TLS array entry set by set_thread_area() corresponds to the value of u_info->entry_number passed in by the user. If this value is in bounds, set_thread_area() writes the TLS descriptor pointed to by u_info into the thread’s TLS array.
When set_thread_area() is passed an entry_number of -1, it searches for a free TLS entry. If set_thread_area() finds a free TLS entry, the value of u_info->entry_number is set upon return to show which entry was changed.
A user_desc is considered “empty” if read_exec_only and seg_not_present are set to 1 and all of the other fields are 0. If an “empty” descriptor is passed to set_thread_area(), the corresponding TLS entry will be cleared. See BUGS for additional details.
Since Linux 3.19, set_thread_area() cannot be used to write non-present segments, 16-bit segments, or code segments, although clearing a segment is still acceptable.
RETURN VALUE
On x86, these system calls return 0 on success, and -1 on failure, with errno set to indicate the error.
On C-SKY, MIPS and m68k, set_thread_area() always returns 0. On m68k, get_thread_area() returns the thread area pointer value (previously set via set_thread_area()).
ERRORS
EFAULT
u_info is an invalid pointer.
EINVAL
u_info->entry_number is out of bounds.
ENOSYS
get_thread_area() or set_thread_area() was invoked as a 64-bit system call.
ESRCH
(set_thread_area()) A free TLS entry could not be located.
STANDARDS
Linux.
HISTORY
set_thread_area()
Linux 2.5.29.
get_thread_area()
Linux 2.5.32.
NOTES
These system calls are generally intended for use only by threading libraries.
arch_prctl(2) can interfere with set_thread_area() on x86. See arch_prctl(2) for more details. This is not normally a problem, as arch_prctl(2) is normally used only by 64-bit programs.
On MIPS, the current value of the thread area pointer can be obtained using the instruction:
rdhwr dest, $29
This instruction traps and is handled by kernel.
BUGS
On 64-bit kernels before Linux 3.19, one of the padding bits in user_desc, if set, would prevent the descriptor from being considered empty (see modify_ldt(2)). As a result, the only reliable way to clear a TLS entry is to use memset(3) to zero the entire user_desc structure, including padding bits, and then to set the read_exec_only and seg_not_present bits. On Linux 3.19, a user_desc consisting entirely of zeros except for entry_number will also be interpreted as a request to clear a TLS entry, but this behaved differently on older kernels.
Prior to Linux 3.19, the DS and ES segment registers must not reference TLS entries.
SEE ALSO
arch_prctl(2), modify_ldt(2), ptrace(2) (PTRACE_GET_THREAD_AREA and PTRACE_SET_THREAD_AREA)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
442 - Linux cli command spu_create
NAME π₯οΈ spu_create π₯οΈ
create a new spu context
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/spu.h> /* Definition of SPU_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_spu_create, const char *pathname",unsignedint"flags,
mode_t mode, int neighbor_fd);
Note: glibc provides no wrapper for spu_create(), necessitating the use of syscall(2).
DESCRIPTION
The spu_create() system call is used on PowerPC machines that implement the Cell Broadband Engine Architecture in order to access Synergistic Processor Units (SPUs). It creates a new logical context for an SPU in pathname and returns a file descriptor associated with it. pathname must refer to a nonexistent directory in the mount point of the SPU filesystem (spufs). If spu_create() is successful, a directory is created at pathname and it is populated with the files described in spufs(7).
When a context is created, the returned file descriptor can only be passed to spu_run(2), used as the dirfd argument to the *at family of system calls (e.g., openat(2)), or closed; other operations are not defined. A logical SPU context is destroyed (along with all files created within the context’s pathname directory) once the last reference to the context has gone; this usually occurs when the file descriptor returned by spu_create() is closed.
The mode argument (minus any bits set in the process’s umask(2)) specifies the permissions used for creating the new directory in spufs. See stat(2) for a full list of the possible mode values.
The neighbor_fd is used only when the SPU_CREATE_AFFINITY_SPU flag is specified; see below.
The flags argument can be zero or any bitwise OR-ed combination of the following constants:
SPU_CREATE_EVENTS_ENABLED
Rather than using signals for reporting DMA errors, use the event argument to spu_run(2).
SPU_CREATE_GANG
Create an SPU gang instead of a context. (A gang is a group of SPU contexts that are functionally related to each other and which share common scheduling parametersβpriority and policy. In the future, gang scheduling may be implemented causing the group to be switched in and out as a single unit.)
A new directory will be created at the location specified by the pathname argument. This gang may be used to hold other SPU contexts, by providing a pathname that is within the gang directory to further calls to spu_create().
SPU_CREATE_NOSCHED
Create a context that is not affected by the SPU scheduler. Once the context is run, it will not be scheduled out until it is destroyed by the creating process.
Because the context cannot be removed from the SPU, some functionality is disabled for SPU_CREATE_NOSCHED contexts. Only a subset of the files will be available in this context directory in spufs. Additionally, SPU_CREATE_NOSCHED contexts cannot dump a core file when crashing.
Creating SPU_CREATE_NOSCHED contexts requires the CAP_SYS_NICE capability.
SPU_CREATE_ISOLATE
Create an isolated SPU context. Isolated contexts are protected from some PPE (PowerPC Processing Element) operations, such as access to the SPU local store and the NPC register.
Creating SPU_CREATE_ISOLATE contexts also requires the SPU_CREATE_NOSCHED flag.
SPU_CREATE_AFFINITY_SPU (since Linux 2.6.23)
Create a context with affinity to another SPU context. This affinity information is used within the SPU scheduling algorithm. Using this flag requires that a file descriptor referring to the other SPU context be passed in the neighbor_fd argument.
SPU_CREATE_AFFINITY_MEM (since Linux 2.6.23)
Create a context with affinity to system memory. This affinity information is used within the SPU scheduling algorithm.
RETURN VALUE
On success, spu_create() returns a new file descriptor. On failure, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
The current user does not have write access to the spufs(7) mount point.
EEXIST
An SPU context already exists at the given pathname.
EFAULT
pathname is not a valid string pointer in the calling process’s address space.
EINVAL
pathname is not a directory in the spufs(7) mount point, or invalid flags have been provided.
ELOOP
Too many symbolic links were found while resolving pathname.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENAMETOOLONG
pathname is too long.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENODEV
An isolated context was requested, but the hardware does not support SPU isolation.
ENOENT
Part of pathname could not be resolved.
ENOMEM
The kernel could not allocate all resources required.
ENOSPC
There are not enough SPU resources available to create a new context or the user-specific limit for the number of SPU contexts has been reached.
ENOSYS
The functionality is not provided by the current system, because either the hardware does not provide SPUs or the spufs module is not loaded.
ENOTDIR
A part of pathname is not a directory.
EPERM
The SPU_CREATE_NOSCHED flag has been given, but the user does not have the CAP_SYS_NICE capability.
FILES
pathname must point to a location beneath the mount point of spufs. By convention, it gets mounted in /spu.
STANDARDS
Linux on PowerPC.
HISTORY
Linux 2.6.16.
Prior to the addition of the SPU_CREATE_AFFINITY_SPU flag in Linux 2.6.23, the spu_create() system call took only three arguments (i.e., there was no neighbor_fd argument).
NOTES
spu_create() is meant to be used from libraries that implement a more abstract interface to SPUs, not to be used from regular applications. See for the recommended libraries.
EXAMPLES
See spu_run(2) for an example of the use of spu_create()
SEE ALSO
close(2), spu_run(2), capabilities(7), spufs(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
443 - Linux cli command setreuid
NAME π₯οΈ setreuid π₯οΈ
set real and/or effective user or group ID
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int setreuid(uid_t ruid, uid_t euid);
int setregid(gid_t rgid, gid_t egid);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
setreuid(), setregid():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.19: */ _DEFAULT_SOURCE
|| /* glibc <= 2.19: */ _BSD_SOURCE
DESCRIPTION
setreuid() sets real and effective user IDs of the calling process.
Supplying a value of -1 for either the real or effective user ID forces the system to leave that ID unchanged.
Unprivileged processes may only set the effective user ID to the real user ID, the effective user ID, or the saved set-user-ID.
Unprivileged users may only set the real user ID to the real user ID or the effective user ID.
If the real user ID is set (i.e., ruid is not -1) or the effective user ID is set to a value not equal to the previous real user ID, the saved set-user-ID will be set to the new effective user ID.
Completely analogously, setregid() sets real and effective group ID’s of the calling process, and all of the above holds with “group” instead of “user”.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
Note: there are cases where setreuid() can fail even when the caller is UID 0; it is a grave security error to omit checking for a failure return from setreuid().
ERRORS
EAGAIN
The call would change the caller’s real UID (i.e., ruid does not match the caller’s real UID), but there was a temporary failure allocating the necessary kernel data structures.
EAGAIN
ruid does not match the caller’s real UID and this call would bring the number of processes belonging to the real user ID ruid over the caller’s RLIMIT_NPROC resource limit. Since Linux 3.1, this error case no longer occurs (but robust applications should check for this error); see the description of EAGAIN in execve(2).
EINVAL
One or more of the target user or group IDs is not valid in this user namespace.
EPERM
The calling process is not privileged (on Linux, does not have the necessary capability in its user namespace: CAP_SETUID in the case of setreuid(), or CAP_SETGID in the case of setregid()) and a change other than (i) swapping the effective user (group) ID with the real user (group) ID, or (ii) setting one to the value of the other or (iii) setting the effective user (group) ID to the value of the saved set-user-ID (saved set-group-ID) was specified.
VERSIONS
POSIX.1 does not specify all of the UID changes that Linux permits for an unprivileged process. For setreuid(), the effective user ID can be made the same as the real user ID or the saved set-user-ID, and it is unspecified whether unprivileged processes may set the real user ID to the real user ID, the effective user ID, or the saved set-user-ID. For setregid(), the real group ID can be changed to the value of the saved set-group-ID, and the effective group ID can be changed to the value of the real group ID or the saved set-group-ID. The precise details of what ID changes are permitted vary across implementations.
POSIX.1 makes no specification about the effect of these calls on the saved set-user-ID and saved set-group-ID.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.3BSD (first appeared in 4.2BSD).
Setting the effective user (group) ID to the saved set-user-ID (saved set-group-ID) is possible since Linux 1.1.37 (1.1.38).
The original Linux setreuid() and setregid() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added setreuid32() and setregid32(), supporting 32-bit IDs. The glibc setreuid() and setregid() wrapper functions transparently deal with the variations across kernel versions.
C library/kernel differences
At the kernel level, user IDs and group IDs are a per-thread attribute. However, POSIX requires that all threads in a process share the same credentials. The NPTL threading implementation handles the POSIX requirements by providing wrapper functions for the various system calls that change process UIDs and GIDs. These wrapper functions (including those for setreuid() and setregid()) employ a signal-based technique to ensure that when one thread changes credentials, all of the other threads in the process also change their credentials. For details, see nptl(7).
SEE ALSO
getgid(2), getuid(2), seteuid(2), setgid(2), setresuid(2), setuid(2), capabilities(7), credentials(7), user_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
444 - Linux cli command ioctl_ns
NAME π₯οΈ ioctl_ns π₯οΈ
ioctl() operations for Linux namespaces
DESCRIPTION
Discovering namespace relationships
The following ioctl(2) operations are provided to allow discovery of namespace relationships (see user_namespaces(7) and pid_namespaces(7)). The form of the calls is:
new_fd = ioctl(fd, op);
In each case, fd refers to a /proc/pid/ns/* file. Both operations return a new file descriptor on success.
NS_GET_USERNS (since Linux 4.9)
Returns a file descriptor that refers to the owning user namespace for the namespace referred to by fd.
NS_GET_PARENT (since Linux 4.9)
Returns a file descriptor that refers to the parent namespace of the namespace referred to by fd. This operation is valid only for hierarchical namespaces (i.e., PID and user namespaces). For user namespaces, NS_GET_PARENT is synonymous with NS_GET_USERNS.
The new file descriptor returned by these operations is opened with the O_RDONLY and O_CLOEXEC (close-on-exec; see fcntl(2)) flags.
By applying fstat(2) to the returned file descriptor, one obtains a stat structure whose st_dev (resident device) and st_ino (inode number) fields together identify the owning/parent namespace. This inode number can be matched with the inode number of another /proc/pid/ns/{pid,user} file to determine whether that is the owning/parent namespace.
Either of these ioctl(2) operations can fail with the following errors:
EPERM
The requested namespace is outside of the caller’s namespace scope. This error can occur if, for example, the owning user namespace is an ancestor of the caller’s current user namespace. It can also occur on attempts to obtain the parent of the initial user or PID namespace.
ENOTTY
The operation is not supported by this kernel version.
Additionally, the NS_GET_PARENT operation can fail with the following error:
EINVAL
fd refers to a nonhierarchical namespace.
See the EXAMPLES section for an example of the use of these operations.
Discovering the namespace type
The NS_GET_NSTYPE operation (available since Linux 4.11) can be used to discover the type of namespace referred to by the file descriptor fd:
nstype = ioctl(fd, NS_GET_NSTYPE);
fd refers to a /proc/pid/ns/* file.
The return value is one of the CLONE_NEW* values that can be specified to clone(2) or unshare(2) in order to create a namespace.
Discovering the owner of a user namespace
The NS_GET_OWNER_UID operation (available since Linux 4.11) can be used to discover the owner user ID of a user namespace (i.e., the effective user ID of the process that created the user namespace). The form of the call is:
uid_t uid;
ioctl(fd, NS_GET_OWNER_UID, &uid);
fd refers to a /proc/pid/ns/user file.
The owner user ID is returned in the uid_t pointed to by the third argument.
This operation can fail with the following error:
EINVAL
fd does not refer to a user namespace.
ERRORS
Any of the above ioctl() operations can return the following errors:
ENOTTY
fd does not refer to a /proc/pid/ns/* file.
STANDARDS
Linux.
EXAMPLES
The example shown below uses the ioctl(2) operations described above to perform simple discovery of namespace relationships. The following shell sessions show various examples of the use of this program.
Trying to get the parent of the initial user namespace fails, since it has no parent:
$ ./ns_show /proc/self/ns/user p
The parent namespace is outside your namespace scope
Create a process running sleep(1) that resides in new user and UTS namespaces, and show that the new UTS namespace is associated with the new user namespace:
$ unshare -Uu sleep 1000 &
[1] 23235
$ ./ns_show /proc/23235/ns/uts u
Device/Inode of owning user namespace is: [0,3] / 4026532448
$ readlink /proc/23235/ns/user
user:[4026532448]
Then show that the parent of the new user namespace in the preceding example is the initial user namespace:
$ readlink /proc/self/ns/user
user:[4026531837]
$ ./ns_show /proc/23235/ns/user p
Device/Inode of parent namespace is: [0,3] / 4026531837
Start a shell in a new user namespace, and show that from within this shell, the parent user namespace can’t be discovered. Similarly, the UTS namespace (which is associated with the initial user namespace) can’t be discovered.
$ PS1="sh2$ " unshare -U bash
sh2$ ./ns_show /proc/self/ns/user p
The parent namespace is outside your namespace scope
sh2$ ./ns_show /proc/self/ns/uts u
The owning user namespace is outside your namespace scope
Program source
/* ns_show.c
Licensed under the GNU General Public License v2 or later.
*/
#include <errno.h>
#include <fcntl.h>
#include <linux/nsfs.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int fd, userns_fd, parent_fd;
struct stat sb;
if (argc < 2) {
fprintf(stderr, "Usage: %s /proc/[pid]/ns/[file] [p|u]
“, argv[0]); fprintf(stderr, " Display the result of one or both " “of NS_GET_USERNS (u) or NS_GET_PARENT (p) " “for the specified /proc/[pid]/ns/[file]. If neither " “‘p’ nor ‘u’ is specified, " “NS_GET_USERNS is the default. “); exit(EXIT_FAILURE); } /* Obtain a file descriptor for the ’ns’ file specified in argv[1]. / fd = open(argv[1], O_RDONLY); if (fd == -1) { perror(“open”); exit(EXIT_FAILURE); } / Obtain a file descriptor for the owning user namespace and then obtain and display the inode number of that namespace. / if (argc < 3 || strchr(argv[2], ‘u’)) { userns_fd = ioctl(fd, NS_GET_USERNS); if (userns_fd == -1) { if (errno == EPERM) printf(“The owning user namespace is outside " “your namespace scope “); else perror(“ioctl-NS_GET_USERNS”); exit(EXIT_FAILURE); } if (fstat(userns_fd, &sb) == -1) { perror(“fstat-userns”); exit(EXIT_FAILURE); } printf(“Device/Inode of owning user namespace is: " “[%x,%x] / %ju “, major(sb.st_dev), minor(sb.st_dev), (uintmax_t) sb.st_ino); close(userns_fd); } / Obtain a file descriptor for the parent namespace and then obtain and display the inode number of that namespace. */ if (argc > 2 && strchr(argv[2], ‘p’)) { parent_fd = ioctl(fd, NS_GET_PARENT); if (parent_fd == -1) { if (errno == EINVAL) printf(“Can’ get parent namespace of a " “nonhierarchical namespace “); else if (errno == EPERM) printf(“The parent namespace is outside " “your namespace scope “); else perror(“ioctl-NS_GET_PARENT”); exit(EXIT_FAILURE); } if (fstat(parent_fd, &sb) == -1) { perror(“fstat-parentns”); exit(EXIT_FAILURE); } printf(“Device/Inode of parent namespace is: [%x,%x] / %ju “, major(sb.st_dev), minor(sb.st_dev), (uintmax_t) sb.st_ino); close(parent_fd); } exit(EXIT_SUCCESS); }
SEE ALSO
fstat(2), ioctl(2), proc(5), namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
445 - Linux cli command syscall
NAME π₯οΈ syscall π₯οΈ
indirect system call
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
long syscall(long number, ...);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
syscall():
Since glibc 2.19:
_DEFAULT_SOURCE
Before glibc 2.19:
_BSD_SOURCE || _SVID_SOURCE
DESCRIPTION
syscall() is a small library function that invokes the system call whose assembly language interface has the specified number with the specified arguments. Employing syscall() is useful, for example, when invoking a system call that has no wrapper function in the C library.
syscall() saves CPU registers before making the system call, restores the registers upon return from the system call, and stores any error returned by the system call in errno(3).
Symbolic constants for system call numbers can be found in the header file <sys/syscall.h>.
RETURN VALUE
The return value is defined by the system call being invoked. In general, a 0 return value indicates success. A -1 return value indicates an error, and an error number is stored in errno.
ERRORS
ENOSYS
The requested system call number is not implemented.
Other errors are specific to the invoked system call.
NOTES
syscall() first appeared in 4BSD.
Architecture-specific requirements
Each architecture ABI has its own requirements on how system call arguments are passed to the kernel. For system calls that have a glibc wrapper (e.g., most system calls), glibc handles the details of copying arguments to the right registers in a manner suitable for the architecture. However, when using syscall() to make a system call, the caller might need to handle architecture-dependent details; this requirement is most commonly encountered on certain 32-bit architectures.
For example, on the ARM architecture Embedded ABI (EABI), a 64-bit value (e.g., long long) must be aligned to an even register pair. Thus, using syscall() instead of the wrapper provided by glibc, the readahead(2) system call would be invoked as follows on the ARM architecture with the EABI in little endian mode:
syscall(SYS_readahead, fd, 0,
(unsigned int) (offset & 0xFFFFFFFF),
(unsigned int) (offset >> 32),
count);
Since the offset argument is 64 bits, and the first argument (fd) is passed in r0, the caller must manually split and align the 64-bit value so that it is passed in the r2/r3 register pair. That means inserting a dummy value into r1 (the second argument of 0). Care also must be taken so that the split follows endian conventions (according to the C ABI for the platform).
Similar issues can occur on MIPS with the O32 ABI, on PowerPC and parisc with the 32-bit ABI, and on Xtensa.
Note that while the parisc C ABI also uses aligned register pairs, it uses a shim layer to hide the issue from user space.
The affected system calls are fadvise64_64(2), ftruncate64(2), posix_fadvise(2), pread64(2), pwrite64(2), readahead(2), sync_file_range(2), and truncate64(2).
This does not affect syscalls that manually split and assemble 64-bit values such as _llseek(2), preadv(2), preadv2(2), pwritev(2), and pwritev2(2). Welcome to the wonderful world of historical baggage.
Architecture calling conventions
Every architecture has its own way of invoking and passing arguments to the kernel. The details for various architectures are listed in the two tables below.
The first table lists the instruction used to transition to kernel mode (which might not be the fastest or best way to transition to the kernel, so you might have to refer to vdso(7)), the register used to indicate the system call number, the register(s) used to return the system call result, and the register used to signal an error.
Arch/ABI | Instruction | System | Ret | Ret | Error | Notes |
call # | val | val2 | ||||
_ | ||||||
alpha | callsys | v0 | v0 | a4 | a3 | 1, 6 |
arc | trap0 | r8 | r0 | - | - | |
arm/OABI | swi NR | - | r0 | - | - | 2 |
arm/EABI | swi 0x0 | r7 | r0 | r1 | - | |
arm64 | svc #0 | w8 | x0 | x1 | - | |
blackfin | excpt 0x0 | P0 | R0 | - | - | |
i386 | int $0x80 | eax | eax | edx | - | |
ia64 | break 0x100000 | r15 | r8 | r9 | r10 | 1, 6 |
loongarch | syscall 0 | a7 | a0 | - | - | |
m68k | trap #0 | d0 | d0 | - | - | |
microblaze | brki r14,8 | r12 | r3 | - | - | |
mips | syscall | v0 | v0 | v1 | a3 | 1, 6 |
nios2 | trap | r2 | r2 | - | r7 | |
parisc | ble 0x100(%sr2, %r0) | r20 | r28 | - | - | |
powerpc | sc | r0 | r3 | - | r0 | 1 |
powerpc64 | sc | r0 | r3 | - | cr0.SO | 1 |
riscv | ecall | a7 | a0 | a1 | - | |
s390 | svc 0 | r1 | r2 | r3 | - | 3 |
s390x | svc 0 | r1 | r2 | r3 | - | 3 |
superh | trapa #31 | r3 | r0 | r1 | - | 4, 6 |
sparc/32 | t 0x10 | g1 | o0 | o1 | psr/csr | 1, 6 |
sparc/64 | t 0x6d | g1 | o0 | o1 | psr/csr | 1, 6 |
tile | swint1 | R10 | R00 | - | R01 | 1 |
x86-64 | syscall | rax | rax | rdx | - | 5 |
x32 | syscall | rax | rax | rdx | - | 5 |
xtensa | syscall | a2 | a2 | - | - |
Notes:
On a few architectures, a register is used as a boolean (0 indicating no error, and -1 indicating an error) to signal that the system call failed. The actual error value is still contained in the return register. On sparc, the carry bit (csr) in the processor status register (psr) is used instead of a full register. On powerpc64, the summary overflow bit (SO) in field 0 of the condition register (cr0) is used.
NR is the system call number.
For s390 and s390x, NR (the system call number) may be passed directly with svc NR if it is less than 256.
On SuperH additional trap numbers are supported for historic reasons, but trapa#31 is the recommended “unified” ABI.
The x32 ABI shares syscall table with x86-64 ABI, but there are some nuances:
In order to indicate that a system call is called under the x32 ABI, an additional bit, __X32_SYSCALL_BIT, is bitwise ORed with the system call number. The ABI used by a process affects some process behaviors, including signal handling or system call restarting.
Since x32 has different sizes for long and pointer types, layouts of some (but not all; struct timeval or struct rlimit are 64-bit, for example) structures are different. In order to handle this, additional system calls are added to the system call table, starting from number 512 (without the __X32_SYSCALL_BIT). For example, __NR_readv is defined as 19 for the x86-64 ABI and as __X32_SYSCALL_BIT | 515 for the x32 ABI. Most of these additional system calls are actually identical to the system calls used for providing i386 compat. There are some notable exceptions, however, such as preadv2(2), which uses struct iovec entities with 4-byte pointers and sizes (“compat_iovec” in kernel terms), but passes an 8-byte pos argument in a single register and not two, as is done in every other ABI.
Some architectures (namely, Alpha, IA-64, MIPS, SuperH, sparc/32, and sparc/64) use an additional register (“Retval2” in the above table) to pass back a second return value from the pipe(2) system call; Alpha uses this technique in the architecture-specific getxpid(2), getxuid(2), and getxgid(2) system calls as well. Other architectures do not use the second return value register in the system call interface, even if it is defined in the System V ABI.
The second table shows the registers used to pass the system call arguments.
Arch/ABI | arg1 | arg2 | arg3 | arg4 | arg5 | arg6 | arg7 | Notes |
---|---|---|---|---|---|---|---|---|
alpha | a0 | a1 | a2 | a3 | a4 | a5 | - | |
arc | r0 | r1 | r2 | r3 | r4 | r5 | - | |
arm/OABI | r0 | r1 | r2 | r3 | r4 | r5 | r6 | |
arm/EABI | r0 | r1 | r2 | r3 | r4 | r5 | r6 | |
arm64 | x0 | x1 | x2 | x3 | x4 | x5 | - | |
blackfin | R0 | R1 | R2 | R3 | R4 | R5 | - | |
i386 | ebx | ecx | edx | esi | edi | ebp | - | |
ia64 | out0 | out1 | out2 | out3 | out4 | out5 | - | |
loongarch | a0 | a1 | a2 | a3 | a4 | a5 | a6 | |
m68k | d1 | d2 | d3 | d4 | d5 | a0 | - | |
microblaze | r5 | r6 | r7 | r8 | r9 | r10 | - | |
mips/o32 | a0 | a1 | a2 | a3 | - | - | - | 1 |
mips/n32,64 | a0 | a1 | a2 | a3 | a4 | a5 | - | |
nios2 | r4 | r5 | r6 | r7 | r8 | r9 | - | |
parisc | r26 | r25 | r24 | r23 | r22 | r21 | - | |
powerpc | r3 | r4 | r5 | r6 | r7 | r8 | r9 | |
powerpc64 | r3 | r4 | r5 | r6 | r7 | r8 | - | |
riscv | a0 | a1 | a2 | a3 | a4 | a5 | - | |
s390 | r2 | r3 | r4 | r5 | r6 | r7 | - | |
s390x | r2 | r3 | r4 | r5 | r6 | r7 | - | |
superh | r4 | r5 | r6 | r7 | r0 | r1 | r2 | |
sparc/32 | o0 | o1 | o2 | o3 | o4 | o5 | - | |
sparc/64 | o0 | o1 | o2 | o3 | o4 | o5 | - | |
tile | R00 | R01 | R02 | R03 | R04 | R05 | - | |
x86-64 | rdi | rsi | rdx | r10 | r8 | r9 | - | |
x32 | rdi | rsi | rdx | r10 | r8 | r9 | - | |
xtensa | a6 | a3 | a4 | a5 | a8 | a9 | - |
Notes:
- The mips/o32 system call convention passes arguments 5 through 8 on the user stack.
Note that these tables don’t cover the entire calling conventionβsome architectures may indiscriminately clobber other registers not listed here.
EXAMPLES
#define _GNU_SOURCE
#include <signal.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
int
main(void)
{
pid_t tid;
tid = syscall(SYS_gettid);
syscall(SYS_tgkill, getpid(), tid, SIGHUP);
}
SEE ALSO
_syscall(2), intro(2), syscalls(2), errno(3), vdso(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
446 - Linux cli command getgroups32
NAME π₯οΈ getgroups32 π₯οΈ
get/set list of supplementary group IDs
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int getgroups(int size, gid_t list[]);
#include <grp.h>
int setgroups(size_t size, const gid_t *_Nullable list);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
setgroups():
Since glibc 2.19:
_DEFAULT_SOURCE
glibc 2.19 and earlier:
_BSD_SOURCE
DESCRIPTION
getgroups() returns the supplementary group IDs of the calling process in list. The argument size should be set to the maximum number of items that can be stored in the buffer pointed to by list. If the calling process is a member of more than size supplementary groups, then an error results.
It is unspecified whether the effective group ID of the calling process is included in the returned list. (Thus, an application should also call getegid(2) and add or remove the resulting value.)
If size is zero, list is not modified, but the total number of supplementary group IDs for the process is returned. This allows the caller to determine the size of a dynamically allocated list to be used in a further call to getgroups().
setgroups() sets the supplementary group IDs for the calling process. Appropriate privileges are required (see the description of the EPERM error, below). The size argument specifies the number of supplementary group IDs in the buffer pointed to by list. A process can drop all of its supplementary groups with the call:
setgroups(0, NULL);
RETURN VALUE
On success, getgroups() returns the number of supplementary group IDs. On error, -1 is returned, and errno is set to indicate the error.
On success, setgroups() returns 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
list has an invalid address.
getgroups() can additionally fail with the following error:
EINVAL
size is less than the number of supplementary group IDs, but is not zero.
setgroups() can additionally fail with the following errors:
EINVAL
size is greater than NGROUPS_MAX (32 before Linux 2.6.4; 65536 since Linux 2.6.4).
ENOMEM
Out of memory.
EPERM
The calling process has insufficient privilege (the caller does not have the CAP_SETGID capability in the user namespace in which it resides).
EPERM (since Linux 3.19)
The use of setgroups() is denied in this user namespace. See the description of /proc/pid/setgroups in user_namespaces(7).
VERSIONS
C library/kernel differences
At the kernel level, user IDs and group IDs are a per-thread attribute. However, POSIX requires that all threads in a process share the same credentials. The NPTL threading implementation handles the POSIX requirements by providing wrapper functions for the various system calls that change process UIDs and GIDs. These wrapper functions (including the one for setgroups()) employ a signal-based technique to ensure that when one thread changes credentials, all of the other threads in the process also change their credentials. For details, see nptl(7).
STANDARDS
getgroups()
POSIX.1-2008.
setgroups()
None.
HISTORY
getgroups()
SVr4, 4.3BSD, POSIX.1-2001.
setgroups()
SVr4, 4.3BSD. Since setgroups() requires privilege, it is not covered by POSIX.1.
The original Linux getgroups() system call supported only 16-bit group IDs. Subsequently, Linux 2.4 added getgroups32(), supporting 32-bit IDs. The glibc getgroups() wrapper function transparently deals with the variation across kernel versions.
NOTES
A process can have up to NGROUPS_MAX supplementary group IDs in addition to the effective group ID. The constant NGROUPS_MAX is defined in <limits.h>. The set of supplementary group IDs is inherited from the parent process, and preserved across an execve(2).
The maximum number of supplementary group IDs can be found at run time using sysconf(3):
long ngroups_max;
ngroups_max = sysconf(_SC_NGROUPS_MAX);
The maximum return value of getgroups() cannot be larger than one more than this value. Since Linux 2.6.4, the maximum number of supplementary group IDs is also exposed via the Linux-specific read-only file, /proc/sys/kernel/ngroups_max.
SEE ALSO
getgid(2), setgid(2), getgrouplist(3), group_member(3), initgroups(3), capabilities(7), credentials(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
447 - Linux cli command ftruncate
NAME π₯οΈ ftruncate π₯οΈ
truncate a file to a specified length
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int truncate(const char *path, off_t length);
int ftruncate(int fd, off_t length);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
truncate():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.12: */ _POSIX_C_SOURCE >= 200809L
|| /* glibc <= 2.19: */ _BSD_SOURCE
ftruncate():
_XOPEN_SOURCE >= 500
|| /* Since glibc 2.3.5: */ _POSIX_C_SOURCE >= 200112L
|| /* glibc <= 2.19: */ _BSD_SOURCE
DESCRIPTION
The truncate() and ftruncate() functions cause the regular file named by path or referenced by fd to be truncated to a size of precisely length bytes.
If the file previously was larger than this size, the extra data is lost. If the file previously was shorter, it is extended, and the extended part reads as null bytes (‘οΏ½’).
The file offset is not changed.
If the size changed, then the st_ctime and st_mtime fields (respectively, time of last status change and time of last modification; see inode(7)) for the file are updated, and the set-user-ID and set-group-ID mode bits may be cleared.
With ftruncate(), the file must be open for writing; with truncate(), the file must be writable.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
For truncate():
EACCES
Search permission is denied for a component of the path prefix, or the named file is not writable by the user. (See also path_resolution(7).)
EFAULT
The argument path points outside the process’s allocated address space.
EFBIG
The argument length is larger than the maximum file size. (XSI)
EINTR
While blocked waiting to complete, the call was interrupted by a signal handler; see fcntl(2) and signal(7).
EINVAL
The argument length is negative or larger than the maximum file size.
EIO
An I/O error occurred updating the inode.
EISDIR
The named file is a directory.
ELOOP
Too many symbolic links were encountered in translating the pathname.
ENAMETOOLONG
A component of a pathname exceeded 255 characters, or an entire pathname exceeded 1023 characters.
ENOENT
The named file does not exist.
ENOTDIR
A component of the path prefix is not a directory.
EPERM
The underlying filesystem does not support extending a file beyond its current size.
EPERM
The operation was prevented by a file seal; see fcntl(2).
EROFS
The named file resides on a read-only filesystem.
ETXTBSY
The file is an executable file that is being executed.
For ftruncate() the same errors apply, but instead of things that can be wrong with path, we now have things that can be wrong with the file descriptor, fd:
EBADF
fd is not a valid file descriptor.
EBADF or EINVAL
fd is not open for writing.
EINVAL
fd does not reference a regular file or a POSIX shared memory object.
EINVAL or EBADF
The file descriptor fd is not open for writing. POSIX permits, and portable applications should handle, either error for this case. (Linux produces EINVAL.)
VERSIONS
The details in DESCRIPTION are for XSI-compliant systems. For non-XSI-compliant systems, the POSIX standard allows two behaviors for ftruncate() when length exceeds the file length (note that truncate() is not specified at all in such an environment): either returning an error, or extending the file. Like most UNIX implementations, Linux follows the XSI requirement when dealing with native filesystems. However, some nonnative filesystems do not permit truncate() and ftruncate() to be used to extend a file beyond its current length: a notable example on Linux is VFAT.
On some 32-bit architectures, the calling signature for these system calls differ, for the reasons described in syscall(2).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, 4.4BSD, SVr4 (first appeared in 4.2BSD).
The original Linux truncate() and ftruncate() system calls were not designed to handle large file offsets. Consequently, Linux 2.4 added truncate64() and ftruncate64() system calls that handle large files. However, these details can be ignored by applications using glibc, whose wrapper functions transparently employ the more recent system calls where they are available.
NOTES
ftruncate() can also be used to set the size of a POSIX shared memory object; see shm_open(3).
BUGS
A header file bug in glibc 2.12 meant that the minimum value of _POSIX_C_SOURCE required to expose the declaration of ftruncate() was 200809L instead of 200112L. This has been fixed in later glibc versions.
SEE ALSO
truncate(1), open(2), stat(2), path_resolution(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
448 - Linux cli command insl
NAME π₯οΈ insl π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
449 - Linux cli command shmat
NAME π₯οΈ shmat π₯οΈ
System V shared memory operations
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/shm.h>
void *shmat(int shmid, const void *_Nullable shmaddr, int shmflg);
int shmdt(const void *shmaddr);
DESCRIPTION
shmat()
shmat() attaches the System V shared memory segment identified by shmid to the address space of the calling process. The attaching address is specified by shmaddr with one of the following criteria:
If shmaddr is NULL, the system chooses a suitable (unused) page-aligned address to attach the segment.
If shmaddr isn’t NULL and SHM_RND is specified in shmflg, the attach occurs at the address equal to shmaddr rounded down to the nearest multiple of SHMLBA.
Otherwise, shmaddr must be a page-aligned address at which the attach occurs.
In addition to SHM_RND, the following flags may be specified in the shmflg bit-mask argument:
SHM_EXEC (Linux-specific; since Linux 2.6.9)
Allow the contents of the segment to be executed. The caller must have execute permission on the segment.
SHM_RDONLY
Attach the segment for read-only access. The process must have read permission for the segment. If this flag is not specified, the segment is attached for read and write access, and the process must have read and write permission for the segment. There is no notion of a write-only shared memory segment.
SHM_REMAP (Linux-specific)
This flag specifies that the mapping of the segment should replace any existing mapping in the range starting at shmaddr and continuing for the size of the segment. (Normally, an EINVAL error would result if a mapping already exists in this address range.) In this case, shmaddr must not be NULL.
The brk(2) value of the calling process is not altered by the attach. The segment will automatically be detached at process exit. The same segment may be attached as a read and as a read-write one, and more than once, in the process’s address space.
A successful shmat() call updates the members of the shmid_ds structure (see shmctl(2)) associated with the shared memory segment as follows:
shm_atime is set to the current time.
shm_lpid is set to the process-ID of the calling process.
shm_nattch is incremented by one.
shmdt()
shmdt() detaches the shared memory segment located at the address specified by shmaddr from the address space of the calling process. The to-be-detached segment must be currently attached with shmaddr equal to the value returned by the attaching shmat() call.
On a successful shmdt() call, the system updates the members of the shmid_ds structure associated with the shared memory segment as follows:
shm_dtime is set to the current time.
shm_lpid is set to the process-ID of the calling process.
shm_nattch is decremented by one. If it becomes 0 and the segment is marked for deletion, the segment is deleted.
RETURN VALUE
On success, shmat() returns the address of the attached shared memory segment; on error, (void *) -1 is returned, and errno is set to indicate the error.
On success, shmdt() returns 0; on error -1 is returned, and errno is set to indicate the error.
ERRORS
shmat() can fail with one of the following errors:
EACCES
The calling process does not have the required permissions for the requested attach type, and does not have the CAP_IPC_OWNER capability in the user namespace that governs its IPC namespace.
EIDRM
shmid points to a removed identifier.
EINVAL
Invalid shmid value, unaligned (i.e., not page-aligned and SHM_RND was not specified) or invalid shmaddr value, or can’t attach segment at shmaddr, or SHM_REMAP was specified and shmaddr was NULL.
ENOMEM
Could not allocate memory for the descriptor or for the page tables.
shmdt() can fail with one of the following errors:
EINVAL
There is no shared memory segment attached at shmaddr; or, shmaddr is not aligned on a page boundary.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4.
In SVID 3 (or perhaps earlier), the type of the shmaddr argument was changed from char * into const void *, and the returned type of shmat() from char * into void *.
NOTES
After a fork(2), the child inherits the attached shared memory segments.
After an execve(2), all attached shared memory segments are detached from the process.
Upon _exit(2), all attached shared memory segments are detached from the process.
Using shmat() with shmaddr equal to NULL is the preferred, portable way of attaching a shared memory segment. Be aware that the shared memory segment attached in this way may be attached at different addresses in different processes. Therefore, any pointers maintained within the shared memory must be made relative (typically to the starting address of the segment), rather than absolute.
On Linux, it is possible to attach a shared memory segment even if it is already marked to be deleted. However, POSIX.1 does not specify this behavior and many other implementations do not support it.
The following system parameter affects shmat():
SHMLBA
Segment low boundary address multiple. When explicitly specifying an attach address in a call to shmat(), the caller should ensure that the address is a multiple of this value. This is necessary on some architectures, in order either to ensure good CPU cache performance or to ensure that different attaches of the same segment have consistent views within the CPU cache. SHMLBA is normally some multiple of the system page size. (On many Linux architectures, SHMLBA is the same as the system page size.)
The implementation places no intrinsic per-process limit on the number of shared memory segments (SHMSEG).
EXAMPLES
The two programs shown below exchange a string using a shared memory segment. Further details about the programs are given below. First, we show a shell session demonstrating their use.
In one terminal window, we run the “reader” program, which creates a System V shared memory segment and a System V semaphore set. The program prints out the IDs of the created objects, and then waits for the semaphore to change value.
$ ./svshm_string_read
shmid = 1114194; semid = 15
In another terminal window, we run the “writer” program. The “writer” program takes three command-line arguments: the IDs of the shared memory segment and semaphore set created by the “reader”, and a string. It attaches the existing shared memory segment, copies the string to the shared memory, and modifies the semaphore value.
$ ./svshm_string_write 1114194 15 'Hello, world'
Returning to the terminal where the “reader” is running, we see that the program has ceased waiting on the semaphore and has printed the string that was copied into the shared memory segment by the writer:
Hello, world
Program source: svshm_string.h
The following header file is included by the “reader” and “writer” programs:
/* svshm_string.h
Licensed under GNU General Public License v2 or later.
*/
#ifndef SVSHM_STRING_H
#define SVSHM_STRING_H
#include <stdio.h>
#include <stdlib.h>
#include <sys/sem.h>
#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)
union semun { /* Used in calls to semctl() */
int val;
struct semid_ds *buf;
unsigned short *array;
#if defined(__linux__)
struct seminfo *__buf;
#endif
};
#define MEM_SIZE 4096
#endif // include guard
Program source: svshm_string_read.c
The “reader” program creates a shared memory segment and a semaphore set containing one semaphore. It then attaches the shared memory object into its address space and initializes the semaphore value to 1. Finally, the program waits for the semaphore value to become 0, and afterwards prints the string that has been copied into the shared memory segment by the “writer”.
/* svshm_string_read.c
Licensed under GNU General Public License v2 or later.
*/
#include <stdio.h>
#include <stdlib.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <sys/shm.h>
#include "svshm_string.h"
int
main(void)
{
int semid, shmid;
char *addr;
union semun arg, dummy;
struct sembuf sop;
/* Create shared memory and semaphore set containing one
semaphore. */
shmid = shmget(IPC_PRIVATE, MEM_SIZE, IPC_CREAT | 0600);
if (shmid == -1)
errExit("shmget");
semid = semget(IPC_PRIVATE, 1, IPC_CREAT | 0600);
if (semid == -1)
errExit("semget");
/* Attach shared memory into our address space. */
addr = shmat(shmid, NULL, SHM_RDONLY);
if (addr == (void *) -1)
errExit("shmat");
/* Initialize semaphore 0 in set with value 1. */
arg.val = 1;
if (semctl(semid, 0, SETVAL, arg) == -1)
errExit("semctl");
printf("shmid = %d; semid = %d
“, shmid, semid); /* Wait for semaphore value to become 0. / sop.sem_num = 0; sop.sem_op = 0; sop.sem_flg = 0; if (semop(semid, &sop, 1) == -1) errExit(“semop”); / Print the string from shared memory. / printf("%s “, addr); / Remove shared memory and semaphore set. */ if (shmctl(shmid, IPC_RMID, NULL) == -1) errExit(“shmctl”); if (semctl(semid, 0, IPC_RMID, dummy) == -1) errExit(“semctl”); exit(EXIT_SUCCESS); }
Program source: svshm_string_write.c
The writer program takes three command-line arguments: the IDs of the shared memory segment and semaphore set that have already been created by the “reader”, and a string. It attaches the shared memory segment into its address space, and then decrements the semaphore value to 0 in order to inform the “reader” that it can now examine the contents of the shared memory.
/* svshm_string_write.c
Licensed under GNU General Public License v2 or later.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/sem.h>
#include <sys/shm.h>
#include "svshm_string.h"
int
main(int argc, char *argv[])
{
int semid, shmid;
char *addr;
size_t len;
struct sembuf sop;
if (argc != 4) {
fprintf(stderr, "Usage: %s shmid semid string
“, argv[0]); exit(EXIT_FAILURE); } len = strlen(argv[3]) + 1; /* +1 to include trailing ‘οΏ½’ / if (len > MEM_SIZE) { fprintf(stderr, “String is too big! “); exit(EXIT_FAILURE); } / Get object IDs from command-line. / shmid = atoi(argv[1]); semid = atoi(argv[2]); / Attach shared memory into our address space and copy string (including trailing null byte) into memory. */ addr = shmat(shmid, NULL, 0); if (addr == (void ) -1) errExit(“shmat”); memcpy(addr, argv[3], len); / Decrement semaphore to 0. */ sop.sem_num = 0; sop.sem_op = -1; sop.sem_flg = 0; if (semop(semid, &sop, 1) == -1) errExit(“semop”); exit(EXIT_SUCCESS); }
SEE ALSO
brk(2), mmap(2), shmctl(2), shmget(2), capabilities(7), shm_overview(7), sysvipc(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
450 - Linux cli command semget
NAME π₯οΈ semget π₯οΈ
get a System V semaphore set identifier
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/sem.h>
int semget(key_t key, int nsems, int semflg);
DESCRIPTION
The semget() system call returns the System V semaphore set identifier associated with the argument key. It may be used either to obtain the identifier of a previously created semaphore set (when semflg is zero and key does not have the value IPC_PRIVATE), or to create a new set.
A new set of nsems semaphores is created if key has the value IPC_PRIVATE or if no existing semaphore set is associated with key and IPC_CREAT is specified in semflg.
If semflg specifies both IPC_CREAT and IPC_EXCL and a semaphore set already exists for key, then semget() fails with errno set to EEXIST. (This is analogous to the effect of the combination O_CREAT | O_EXCL for open(2).)
Upon creation, the least significant 9 bits of the argument semflg define the permissions (for owner, group, and others) for the semaphore set. These bits have the same format, and the same meaning, as the mode argument of open(2) (though the execute permissions are not meaningful for semaphores, and write permissions mean permission to alter semaphore values).
When creating a new semaphore set, semget() initializes the set’s associated data structure, semid_ds (see semctl(2)), as follows:
sem_perm.cuid and sem_perm.uid are set to the effective user ID of the calling process.
sem_perm.cgid and sem_perm.gid are set to the effective group ID of the calling process.
The least significant 9 bits of sem_perm.mode are set to the least significant 9 bits of semflg.
sem_nsems is set to the value of nsems.
sem_otime is set to 0.
sem_ctime is set to the current time.
The argument nsems can be 0 (a don’t care) when a semaphore set is not being created. Otherwise, nsems must be greater than 0 and less than or equal to the maximum number of semaphores per semaphore set (SEMMSL).
If the semaphore set already exists, the permissions are verified.
RETURN VALUE
On success, semget() returns the semaphore set identifier (a nonnegative integer). On failure, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
A semaphore set exists for key, but the calling process does not have permission to access the set, and does not have the CAP_IPC_OWNER capability in the user namespace that governs its IPC namespace.
EEXIST
IPC_CREAT and IPC_EXCL were specified in semflg, but a semaphore set already exists for key.
EINVAL
nsems is less than 0 or greater than the limit on the number of semaphores per semaphore set (SEMMSL).
EINVAL
A semaphore set corresponding to key already exists, but nsems is larger than the number of semaphores in that set.
ENOENT
No semaphore set exists for key and semflg did not specify IPC_CREAT.
ENOMEM
A semaphore set has to be created but the system does not have enough memory for the new data structure.
ENOSPC
A semaphore set has to be created but the system limit for the maximum number of semaphore sets (SEMMNI), or the system wide maximum number of semaphores (SEMMNS), would be exceeded.
STANDARDS
POSIX.1-2008.
HISTORY
SVr4, POSIX.1-2001.
NOTES
IPC_PRIVATE isn’t a flag field but a key_t type. If this special value is used for key, the system call ignores all but the least significant 9 bits of semflg and creates a new semaphore set (on success).
Semaphore initialization
The values of the semaphores in a newly created set are indeterminate. (POSIX.1-2001 and POSIX.1-2008 are explicit on this point, although POSIX.1-2008 notes that a future version of the standard may require an implementation to initialize the semaphores to 0.) Although Linux, like many other implementations, initializes the semaphore values to 0, a portable application cannot rely on this: it should explicitly initialize the semaphores to the desired values.
Initialization can be done using semctl(2) SETVAL or SETALL operation. Where multiple peers do not know who will be the first to initialize the set, checking for a nonzero sem_otime in the associated data structure retrieved by a semctl(2) IPC_STAT operation can be used to avoid races.
Semaphore limits
The following limits on semaphore set resources affect the semget() call:
SEMMNI
System-wide limit on the number of semaphore sets. Before Linux 3.19, the default value for this limit was 128. Since Linux 3.19, the default value is 32,000. On Linux, this limit can be read and modified via the fourth field of /proc/sys/kernel/sem.
SEMMSL
Maximum number of semaphores per semaphore ID. Before Linux 3.19, the default value for this limit was 250. Since Linux 3.19, the default value is 32,000. On Linux, this limit can be read and modified via the first field of /proc/sys/kernel/sem.
SEMMNS
System-wide limit on the number of semaphores: policy dependent (on Linux, this limit can be read and modified via the second field of /proc/sys/kernel/sem). Note that the number of semaphores system-wide is also limited by the product of SEMMSL and SEMMNI.
BUGS
The name choice IPC_PRIVATE was perhaps unfortunate, IPC_NEW would more clearly show its function.
EXAMPLES
The program shown below uses semget() to create a new semaphore set or retrieve the ID of an existing set. It generates the key for semget() using ftok(3). The first two command-line arguments are used as the pathname and proj_id arguments for ftok(3). The third command-line argument is an integer that specifies the nsems argument for semget(). Command-line options can be used to specify the IPC_CREAT (-c) and IPC_EXCL (-x) flags for the call to semget(). The usage of this program is demonstrated below.
We first create two files that will be used to generate keys using ftok(3), create two semaphore sets using those files, and then list the sets using ipcs(1):
$ touch mykey mykey2
$ ./t_semget -c mykey p 1
ID = 9
$ ./t_semget -c mykey2 p 2
ID = 10
$ ipcs -s
------ Semaphore Arrays --------
key semid owner perms nsems
0x7004136d 9 mtk 600 1
0x70041368 10 mtk 600 2
Next, we demonstrate that when semctl(2) is given the same key (as generated by the same arguments to ftok(3)), it returns the ID of the already existing semaphore set:
$ ./t_semget -c mykey p 1
ID = 9
Finally, we demonstrate the kind of collision that can occur when ftok(3) is given different pathname arguments that have the same inode number:
$ ln mykey link
$ ls -i1 link mykey
2233197 link
2233197 mykey
$ ./t_semget link p 1 # Generates same key as 'mykey'
ID = 9
Program source
/* t_semget.c
Licensed under GNU General Public License v2 or later.
*/
#include <stdio.h>
#include <stdlib.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <unistd.h>
static void
usage(const char *pname)
{
fprintf(stderr, "Usage: %s [-cx] pathname proj-id num-sems
“, pname); fprintf(stderr, " -c Use IPC_CREAT flag “); fprintf(stderr, " -x Use IPC_EXCL flag “); exit(EXIT_FAILURE); } int main(int argc, char *argv[]) { int semid, nsems, flags, opt; key_t key; flags = 0; while ((opt = getopt(argc, argv, “cx”)) != -1) { switch (opt) { case ‘c’: flags |= IPC_CREAT; break; case ‘x’: flags |= IPC_EXCL; break; default: usage(argv[0]); } } if (argc != optind + 3) usage(argv[0]); key = ftok(argv[optind], argv[optind + 1][0]); if (key == -1) { perror(“ftok”); exit(EXIT_FAILURE); } nsems = atoi(argv[optind + 2]); semid = semget(key, nsems, flags | 0600); if (semid == -1) { perror(“semget”); exit(EXIT_FAILURE); } printf(“ID = %d “, semid); exit(EXIT_SUCCESS); }
SEE ALSO
semctl(2), semop(2), ftok(3), capabilities(7), sem_overview(7), sysvipc(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
451 - Linux cli command insb
NAME π₯οΈ insb π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
452 - Linux cli command sendto
NAME π₯οΈ sendto π₯οΈ
send a message on a socket
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/socket.h>
ssize_t send(int sockfd, const void buf[.len], size_t len",int"flags);
ssize_t sendto(int sockfd, const void buf[.len], size_t len",int"flags,
const struct sockaddr *dest_addr, socklen_t addrlen);
ssize_t sendmsg(int sockfd, const struct msghdr *msg",int"flags);
DESCRIPTION
The system calls send(), sendto(), and sendmsg() are used to transmit a message to another socket.
The send() call may be used only when the socket is in a connected state (so that the intended recipient is known). The only difference between send() and write(2) is the presence of flags. With a zero flags argument, send() is equivalent to write(2). Also, the following call
send(sockfd, buf, len, flags);
is equivalent to
sendto(sockfd, buf, len, flags, NULL, 0);
The argument sockfd is the file descriptor of the sending socket.
If sendto() is used on a connection-mode (SOCK_STREAM, SOCK_SEQPACKET) socket, the arguments dest_addr and addrlen are ignored (and the error EISCONN may be returned when they are not NULL and 0), and the error ENOTCONN is returned when the socket was not actually connected. Otherwise, the address of the target is given by dest_addr with addrlen specifying its size. For sendmsg(), the address of the target is given by msg.msg_name, with msg.msg_namelen specifying its size.
For send() and sendto(), the message is found in buf and has length len. For sendmsg(), the message is pointed to by the elements of the array msg.msg_iov. The sendmsg() call also allows sending ancillary data (also known as control information).
If the message is too long to pass atomically through the underlying protocol, the error EMSGSIZE is returned, and the message is not transmitted.
No indication of failure to deliver is implicit in a send(). Locally detected errors are indicated by a return value of -1.
When the message does not fit into the send buffer of the socket, send() normally blocks, unless the socket has been placed in nonblocking I/O mode. In nonblocking mode it would fail with the error EAGAIN or EWOULDBLOCK in this case. The select(2) call may be used to determine when it is possible to send more data.
The flags argument
The flags argument is the bitwise OR of zero or more of the following flags.
MSG_CONFIRM (since Linux 2.3.15)
Tell the link layer that forward progress happened: you got a successful reply from the other side. If the link layer doesn’t get this it will regularly reprobe the neighbor (e.g., via a unicast ARP). Valid only on SOCK_DGRAM and SOCK_RAW sockets and currently implemented only for IPv4 and IPv6. See arp(7) for details.
MSG_DONTROUTE
Don’t use a gateway to send out the packet, send to hosts only on directly connected networks. This is usually used only by diagnostic or routing programs. This is defined only for protocol families that route; packet sockets don’t.
MSG_DONTWAIT (since Linux 2.2)
Enables nonblocking operation; if the operation would block, EAGAIN or EWOULDBLOCK is returned. This provides similar behavior to setting the O_NONBLOCK flag (via the fcntl(2) F_SETFL operation), but differs in that MSG_DONTWAIT is a per-call option, whereas O_NONBLOCK is a setting on the open file description (see open(2)), which will affect all threads in the calling process as well as other processes that hold file descriptors referring to the same open file description.
MSG_EOR (since Linux 2.2)
Terminates a record (when this notion is supported, as for sockets of type SOCK_SEQPACKET).
MSG_MORE (since Linux 2.4.4)
The caller has more data to send. This flag is used with TCP sockets to obtain the same effect as the TCP_CORK socket option (see tcp(7)), with the difference that this flag can be set on a per-call basis.
Since Linux 2.6, this flag is also supported for UDP sockets, and informs the kernel to package all of the data sent in calls with this flag set into a single datagram which is transmitted only when a call is performed that does not specify this flag. (See also the UDP_CORK socket option described in udp(7).)
MSG_NOSIGNAL (since Linux 2.2)
Don’t generate a SIGPIPE signal if the peer on a stream-oriented socket has closed the connection. The EPIPE error is still returned. This provides similar behavior to using sigaction(2) to ignore SIGPIPE, but, whereas MSG_NOSIGNAL is a per-call feature, ignoring SIGPIPE sets a process attribute that affects all threads in the process.
MSG_OOB
Sends out-of-band data on sockets that support this notion (e.g., of type SOCK_STREAM); the underlying protocol must also support out-of-band data.
MSG_FASTOPEN (since Linux 3.7)
Attempts TCP Fast Open (RFC7413) and sends data in the SYN like a combination of connect(2) and write(2), by performing an implicit connect(2) operation. It blocks until the data is buffered and the handshake has completed. For a non-blocking socket, it returns the number of bytes buffered and sent in the SYN packet. If the cookie is not available locally, it returns EINPROGRESS, and sends a SYN with a Fast Open cookie request automatically. The caller needs to write the data again when the socket is connected. On errors, it sets the same errno as connect(2) if the handshake fails. This flag requires enabling TCP Fast Open client support on sysctl net.ipv4.tcp_fastopen.
Refer to TCP_FASTOPEN_CONNECT socket option in tcp(7) for an alternative approach.
sendmsg()
The definition of the msghdr structure employed by sendmsg() is as follows:
struct msghdr {
void *msg_name; /* Optional address */
socklen_t msg_namelen; /* Size of address */
struct iovec *msg_iov; /* Scatter/gather array */
size_t msg_iovlen; /* # elements in msg_iov */
void *msg_control; /* Ancillary data, see below */
size_t msg_controllen; /* Ancillary data buffer len */
int msg_flags; /* Flags (unused) */
};
The msg_name field is used on an unconnected socket to specify the target address for a datagram. It points to a buffer containing the address; the msg_namelen field should be set to the size of the address. For a connected socket, these fields should be specified as NULL and 0, respectively.
The msg_iov and msg_iovlen fields specify scatter-gather locations, as for writev(2).
You may send control information (ancillary data) using the msg_control and msg_controllen members. The maximum control buffer length the kernel can process is limited per socket by the value in /proc/sys/net/core/optmem_max; see socket(7). For further information on the use of ancillary data in various socket domains, see unix(7) and ip(7).
The msg_flags field is ignored.
RETURN VALUE
On success, these calls return the number of bytes sent. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
These are some standard errors generated by the socket layer. Additional errors may be generated and returned from the underlying protocol modules; see their respective manual pages.
EACCES
(For UNIX domain sockets, which are identified by pathname) Write permission is denied on the destination socket file, or search permission is denied for one of the directories the path prefix. (See path_resolution(7).)
(For UDP sockets) An attempt was made to send to a network/broadcast address as though it was a unicast address.
EAGAIN or EWOULDBLOCK
The socket is marked nonblocking and the requested operation would block. POSIX.1-2001 allows either error to be returned for this case, and does not require these constants to have the same value, so a portable application should check for both possibilities.
EAGAIN
(Internet domain datagram sockets) The socket referred to by sockfd had not previously been bound to an address and, upon attempting to bind it to an ephemeral port, it was determined that all port numbers in the ephemeral port range are currently in use. See the discussion of /proc/sys/net/ipv4/ip_local_port_range in ip(7).
EALREADY
Another Fast Open is in progress.
EBADF
sockfd is not a valid open file descriptor.
ECONNRESET
Connection reset by peer.
EDESTADDRREQ
The socket is not connection-mode, and no peer address is set.
EFAULT
An invalid user space address was specified for an argument.
EINTR
A signal occurred before any data was transmitted; see signal(7).
EINVAL
Invalid argument passed.
EISCONN
The connection-mode socket was connected already but a recipient was specified. (Now either this error is returned, or the recipient specification is ignored.)
EMSGSIZE
The socket type requires that message be sent atomically, and the size of the message to be sent made this impossible.
ENOBUFS
The output queue for a network interface was full. This generally indicates that the interface has stopped sending, but may be caused by transient congestion. (Normally, this does not occur in Linux. Packets are just silently dropped when a device queue overflows.)
ENOMEM
No memory available.
ENOTCONN
The socket is not connected, and no target has been given.
ENOTSOCK
The file descriptor sockfd does not refer to a socket.
EOPNOTSUPP
Some bit in the flags argument is inappropriate for the socket type.
EPIPE
The local end has been shut down on a connection oriented socket. In this case, the process will also receive a SIGPIPE unless MSG_NOSIGNAL is set.
VERSIONS
According to POSIX.1-2001, the msg_controllen field of the msghdr structure should be typed as socklen_t, and the msg_iovlen field should be typed as int, but glibc currently types both as size_t.
STANDARDS
POSIX.1-2008.
MSG_CONFIRM is a Linux extension.
HISTORY
4.4BSD, SVr4, POSIX.1-2001. (first appeared in 4.2BSD).
POSIX.1-2001 describes only the MSG_OOB and MSG_EOR flags. POSIX.1-2008 adds a specification of MSG_NOSIGNAL.
NOTES
See sendmmsg(2) for information about a Linux-specific system call that can be used to transmit multiple datagrams in a single call.
BUGS
Linux may return EPIPE instead of ENOTCONN.
EXAMPLES
An example of the use of sendto() is shown in getaddrinfo(3).
SEE ALSO
fcntl(2), getsockopt(2), recv(2), select(2), sendfile(2), sendmmsg(2), shutdown(2), socket(2), write(2), cmsg(3), ip(7), ipv6(7), socket(7), tcp(7), udp(7), unix(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
453 - Linux cli command rename
NAME π₯οΈ rename π₯οΈ
change the name or location of a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <stdio.h>
int rename(const char *oldpath, const char *newpath);
#include <fcntl.h> /* Definition of AT_* constants */
#include <stdio.h>
int renameat(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath);
int renameat2(int olddirfd, const char *oldpath,
int newdirfd, const char *newpath",unsignedint"flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
renameat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
renameat2():
_GNU_SOURCE
DESCRIPTION
rename() renames a file, moving it between directories if required. Any other hard links to the file (as created using link(2)) are unaffected. Open file descriptors for oldpath are also unaffected.
Various restrictions determine whether or not the rename operation succeeds: see ERRORS below.
If newpath already exists, it will be atomically replaced, so that there is no point at which another process attempting to access newpath will find it missing. However, there will probably be a window in which both oldpath and newpath refer to the file being renamed.
If oldpath and newpath are existing hard links referring to the same file, then rename() does nothing, and returns a success status.
If newpath exists but the operation fails for some reason, rename() guarantees to leave an instance of newpath in place.
oldpath can specify a directory. In this case, newpath must either not exist, or it must specify an empty directory.
If oldpath refers to a symbolic link, the link is renamed; if newpath refers to a symbolic link, the link will be overwritten.
renameat()
The renameat() system call operates in exactly the same way as rename(), except for the differences described here.
If the pathname given in oldpath is relative, then it is interpreted relative to the directory referred to by the file descriptor olddirfd (rather than relative to the current working directory of the calling process, as is done by rename() for a relative pathname).
If oldpath is relative and olddirfd is the special value AT_FDCWD, then oldpath is interpreted relative to the current working directory of the calling process (like rename()).
If oldpath is absolute, then olddirfd is ignored.
The interpretation of newpath is as for oldpath, except that a relative pathname is interpreted relative to the directory referred to by the file descriptor newdirfd.
See openat(2) for an explanation of the need for renameat().
renameat2()
renameat2() has an additional flags argument. A renameat2() call with a zero flags argument is equivalent to renameat().
The flags argument is a bit mask consisting of zero or more of the following flags:
RENAME_EXCHANGE
Atomically exchange oldpath and newpath. Both pathnames must exist but may be of different types (e.g., one could be a non-empty directory and the other a symbolic link).
RENAME_NOREPLACE
Don’t overwrite newpath of the rename. Return an error if newpath already exists.
RENAME_NOREPLACE can’t be employed together with RENAME_EXCHANGE.
RENAME_NOREPLACE requires support from the underlying filesystem. Support for various filesystems was added as follows:
ext4 (Linux 3.15);
btrfs, tmpfs, and cifs (Linux 3.17);
xfs (Linux 4.0);
Support for many other filesystems was added in Linux 4.9, including ext2, minix, reiserfs, jfs, vfat, and bpf.
RENAME_WHITEOUT (since Linux 3.18)
This operation makes sense only for overlay/union filesystem implementations.
Specifying RENAME_WHITEOUT creates a “whiteout” object at the source of the rename at the same time as performing the rename. The whole operation is atomic, so that if the rename succeeds then the whiteout will also have been created.
A “whiteout” is an object that has special meaning in union/overlay filesystem constructs. In these constructs, multiple layers exist and only the top one is ever modified. A whiteout on an upper layer will effectively hide a matching file in the lower layer, making it appear as if the file didn’t exist.
When a file that exists on the lower layer is renamed, the file is first copied up (if not already on the upper layer) and then renamed on the upper, read-write layer. At the same time, the source file needs to be “whiteouted” (so that the version of the source file in the lower layer is rendered invisible). The whole operation needs to be done atomically.
When not part of a union/overlay, the whiteout appears as a character device with a {0,0} device number. (Note that other union/overlay implementations may employ different methods for storing whiteout entries; specifically, BSD union mount employs a separate inode type, DT_WHT, which, while supported by some filesystems available in Linux, such as CODA and XFS, is ignored by the kernel’s whiteout support code, as of Linux 4.19, at least.)
RENAME_WHITEOUT requires the same privileges as creating a device node (i.e., the CAP_MKNOD capability).
RENAME_WHITEOUT can’t be employed together with RENAME_EXCHANGE.
RENAME_WHITEOUT requires support from the underlying filesystem. Among the filesystems that support it are tmpfs (since Linux 3.18), ext4 (since Linux 3.18), XFS (since Linux 4.1), f2fs (since Linux 4.2), btrfs (since Linux 4.7), and ubifs (since Linux 4.9).
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Write permission is denied for the directory containing oldpath or newpath, or, search permission is denied for one of the directories in the path prefix of oldpath or newpath, or oldpath is a directory and does not allow write permission (needed to update the .. entry). (See also path_resolution(7).)
EBUSY
The rename fails because oldpath or newpath is a directory that is in use by some process (perhaps as current working directory, or as root directory, or because it was open for reading) or is in use by the system (for example as a mount point), while the system considers this an error. (Note that there is no requirement to return EBUSY in such casesβthere is nothing wrong with doing the rename anywayβbut it is allowed to return EBUSY if the system cannot otherwise handle such situations.)
EDQUOT
The user’s quota of disk blocks on the filesystem has been exhausted.
EFAULT
oldpath or newpath points outside your accessible address space.
EINVAL
The new pathname contained a path prefix of the old, or, more generally, an attempt was made to make a directory a subdirectory of itself.
EISDIR
newpath is an existing directory, but oldpath is not a directory.
ELOOP
Too many symbolic links were encountered in resolving oldpath or newpath.
EMLINK
oldpath already has the maximum number of links to it, or it was a directory and the directory containing newpath has the maximum number of links.
ENAMETOOLONG
oldpath or newpath was too long.
ENOENT
The link named by oldpath does not exist; or, a directory component in newpath does not exist; or, oldpath or newpath is an empty string.
ENOMEM
Insufficient kernel memory was available.
ENOSPC
The device containing the file has no room for the new directory entry.
ENOTDIR
A component used as a directory in oldpath or newpath is not, in fact, a directory. Or, oldpath is a directory, and newpath exists but is not a directory.
ENOTEMPTY or EEXIST
newpath is a nonempty directory, that is, contains entries other than “.” and “..”.
EPERM or EACCES
The directory containing oldpath has the sticky bit (S_ISVTX) set and the process’s effective user ID is neither the user ID of the file to be deleted nor that of the directory containing it, and the process is not privileged (Linux: does not have the CAP_FOWNER capability); or newpath is an existing file and the directory containing it has the sticky bit set and the process’s effective user ID is neither the user ID of the file to be replaced nor that of the directory containing it, and the process is not privileged (Linux: does not have the CAP_FOWNER capability); or the filesystem containing oldpath does not support renaming of the type requested.
EROFS
The file is on a read-only filesystem.
EXDEV
oldpath and newpath are not on the same mounted filesystem. (Linux permits a filesystem to be mounted at multiple points, but rename() does not work across different mount points, even if the same filesystem is mounted on both.)
The following additional errors can occur for renameat() and renameat2():
EBADF
oldpath (newpath) is relative but olddirfd (newdirfd) is not a valid file descriptor.
ENOTDIR
oldpath is relative and olddirfd is a file descriptor referring to a file other than a directory; or similar for newpath and newdirfd
The following additional errors can occur for renameat2():
EEXIST
flags contains RENAME_NOREPLACE and newpath already exists.
EINVAL
An invalid flag was specified in flags.
EINVAL
Both RENAME_NOREPLACE and RENAME_EXCHANGE were specified in flags.
EINVAL
Both RENAME_WHITEOUT and RENAME_EXCHANGE were specified in flags.
EINVAL
The filesystem does not support one of the flags in flags.
ENOENT
flags contains RENAME_EXCHANGE and newpath does not exist.
EPERM
RENAME_WHITEOUT was specified in flags, but the caller does not have the CAP_MKNOD capability.
STANDARDS
rename()
C11, POSIX.1-2008.
renameat()
POSIX.1-2008.
renameat2()
Linux.
HISTORY
rename()
4.3BSD, C89, POSIX.1-2001.
renameat()
Linux 2.6.16, glibc 2.4.
renameat2()
Linux 3.15, glibc 2.28.
glibc notes
On older kernels where renameat() is unavailable, the glibc wrapper function falls back to the use of rename(). When oldpath and newpath are relative pathnames, glibc constructs pathnames based on the symbolic links in /proc/self/fd that correspond to the olddirfd and newdirfd arguments.
BUGS
On NFS filesystems, you can not assume that if the operation failed, the file was not renamed. If the server does the rename operation and then crashes, the retransmitted RPC which will be processed when the server is up again causes a failure. The application is expected to deal with this. See link(2) for a similar problem.
SEE ALSO
mv(1), rename(1), chmod(2), link(2), symlink(2), unlink(2), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
454 - Linux cli command inotify_rm_watch
NAME π₯οΈ inotify_rm_watch π₯οΈ
remove an existing watch from an inotify instance
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/inotify.h>
int inotify_rm_watch(int fd, int wd);
DESCRIPTION
inotify_rm_watch() removes the watch associated with the watch descriptor wd from the inotify instance associated with the file descriptor fd.
Removing a watch causes an IN_IGNORED event to be generated for this watch descriptor. (See inotify(7).)
RETURN VALUE
On success, inotify_rm_watch() returns zero. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EBADF
fd is not a valid file descriptor.
EINVAL
The watch descriptor wd is not valid; or fd is not an inotify file descriptor.
STANDARDS
Linux.
HISTORY
Linux 2.6.13.
SEE ALSO
inotify_add_watch(2), inotify_init(2), inotify(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
455 - Linux cli command outsl
NAME π₯οΈ outsl π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
456 - Linux cli command acct
NAME π₯οΈ acct π₯οΈ
switch process accounting on or off
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int acct(const char *_Nullable filename);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
acct():
Since glibc 2.21:
_DEFAULT_SOURCE
In glibc 2.19 and 2.20:
_DEFAULT_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
Up to and including glibc 2.19:
_BSD_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
DESCRIPTION
The acct() system call enables or disables process accounting. If called with the name of an existing file as its argument, accounting is turned on, and records for each terminating process are appended to filename as it terminates. An argument of NULL causes accounting to be turned off.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Write permission is denied for the specified file, or search permission is denied for one of the directories in the path prefix of filename (see also path_resolution(7)), or filename is not a regular file.
EFAULT
filename points outside your accessible address space.
EIO
Error writing to the file filename.
EISDIR
filename is a directory.
ELOOP
Too many symbolic links were encountered in resolving filename.
ENAMETOOLONG
filename was too long.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENOENT
The specified file does not exist.
ENOMEM
Out of memory.
ENOSYS
BSD process accounting has not been enabled when the operating system kernel was compiled. The kernel configuration parameter controlling this feature is CONFIG_BSD_PROCESS_ACCT.
ENOTDIR
A component used as a directory in filename is not in fact a directory.
EPERM
The calling process has insufficient privilege to enable process accounting. On Linux, the CAP_SYS_PACCT capability is required.
EROFS
filename refers to a file on a read-only filesystem.
EUSERS
There are no more free file structures or we ran out of memory.
STANDARDS
None.
HISTORY
SVr4, 4.3BSD.
NOTES
No accounting is produced for programs running when a system crash occurs. In particular, nonterminating processes are never accounted for.
The structure of the records written to the accounting file is described in acct(5).
SEE ALSO
acct(5)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
457 - Linux cli command isastream
NAME π₯οΈ isastream π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
458 - Linux cli command wait3
NAME π₯οΈ wait3 π₯οΈ
wait for process to change state, BSD style
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/wait.h>
pid_t wait3(int *_Nullable wstatus, int options,
struct rusage *_Nullable rusage);
pid_t wait4(pid_t pid, int *_Nullable wstatus, int options,
struct rusage *_Nullable rusage);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
wait3():
Since glibc 2.26:
_DEFAULT_SOURCE
|| (_XOPEN_SOURCE >= 500 &&
! (_POSIX_C_SOURCE >= 200112L
|| _XOPEN_SOURCE >= 600))
From glibc 2.19 to glibc 2.25:
_DEFAULT_SOURCE || _XOPEN_SOURCE >= 500
glibc 2.19 and earlier:
_BSD_SOURCE || _XOPEN_SOURCE >= 500
wait4():
Since glibc 2.19:
_DEFAULT_SOURCE
glibc 2.19 and earlier:
_BSD_SOURCE
DESCRIPTION
These functions are nonstandard; in new programs, the use of waitpid(2) or waitid(2) is preferable.
The wait3() and wait4() system calls are similar to waitpid(2), but additionally return resource usage information about the child in the structure pointed to by rusage.
Other than the use of the rusage argument, the following wait3() call:
wait3(wstatus, options, rusage);
is equivalent to:
waitpid(-1, wstatus, options);
Similarly, the following wait4() call:
wait4(pid, wstatus, options, rusage);
is equivalent to:
waitpid(pid, wstatus, options);
In other words, wait3() waits of any child, while wait4() can be used to select a specific child, or children, on which to wait. See wait(2) for further details.
If rusage is not NULL, the struct rusage to which it points will be filled with accounting information about the child. See getrusage(2) for details.
RETURN VALUE
As for waitpid(2).
ERRORS
As for waitpid(2).
STANDARDS
None.
HISTORY
4.3BSD.
SUSv1 included a specification of wait3(); SUSv2 included wait3(), but marked it LEGACY; SUSv3 removed it.
Including <sys/time.h> is not required these days, but increases portability. (Indeed, <sys/resource.h> defines the rusage structure with fields of type struct timeval defined in <sys/time.h>.)
C library/kernel differences
On Linux, wait3() is a library function implemented on top of the wait4() system call.
SEE ALSO
fork(2), getrusage(2), sigaction(2), signal(2), wait(2), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
459 - Linux cli command sgetmask
NAME π₯οΈ sgetmask π₯οΈ
manipulation of signal mask (obsolete)
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
[[deprecated]] long syscall(SYS_sgetmask, void);
[[deprecated]] long syscall(SYS_ssetmask, long newmask);
DESCRIPTION
These system calls are obsolete. Do not use them; use sigprocmask(2) instead.
sgetmask() returns the signal mask of the calling process.
ssetmask() sets the signal mask of the calling process to the value given in newmask. The previous signal mask is returned.
The signal masks dealt with by these two system calls are plain bit masks (unlike the sigset_t used by sigprocmask(2)); use sigmask(3) to create and inspect these masks.
RETURN VALUE
sgetmask() always successfully returns the signal mask. ssetmask() always succeeds, and returns the previous signal mask.
ERRORS
These system calls always succeed.
STANDARDS
Linux.
HISTORY
Since Linux 3.16, support for these system calls is optional, depending on whether the kernel was built with the CONFIG_SGETMASK_SYSCALL option.
NOTES
These system calls are unaware of signal numbers greater than 31 (i.e., real-time signals).
These system calls do not exist on x86-64.
It is not possible to block SIGSTOP or SIGKILL.
SEE ALSO
sigprocmask(2), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
460 - Linux cli command signal
NAME π₯οΈ signal π₯οΈ
ANSI C signal handling
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <signal.h>
typedef void (*sighandler_t)(int);
sighandler_t signal(int signum, sighandler_t handler);
DESCRIPTION
WARNING: the behavior of signal() varies across UNIX versions, and has also varied historically across different versions of Linux. Avoid its use: use sigaction(2) instead. See Portability below.
signal() sets the disposition of the signal signum to handler, which is either SIG_IGN, SIG_DFL, or the address of a programmer-defined function (a “signal handler”).
If the signal signum is delivered to the process, then one of the following happens:
*
If the disposition is set to SIG_IGN, then the signal is ignored.
*
If the disposition is set to SIG_DFL, then the default action associated with the signal (see signal(7)) occurs.
*
If the disposition is set to a function, then first either the disposition is reset to SIG_DFL, or the signal is blocked (see Portability below), and then handler is called with argument signum. If invocation of the handler caused the signal to be blocked, then the signal is unblocked upon return from the handler.
The signals SIGKILL and SIGSTOP cannot be caught or ignored.
RETURN VALUE
signal() returns the previous value of the signal handler. On failure, it returns SIG_ERR, and errno is set to indicate the error.
ERRORS
EINVAL
signum is invalid.
VERSIONS
The use of sighandler_t is a GNU extension, exposed if _GNU_SOURCE is defined; glibc also defines (the BSD-derived) sig_t if _BSD_SOURCE (glibc 2.19 and earlier) or _DEFAULT_SOURCE (glibc 2.19 and later) is defined. Without use of such a type, the declaration of signal() is the somewhat harder to read:
void ( *signal(int signum, void (*handler)(int)) ) (int);
Portability
The only portable use of signal() is to set a signal’s disposition to SIG_DFL or SIG_IGN. The semantics when using signal() to establish a signal handler vary across systems (and POSIX.1 explicitly permits this variation); do not use it for this purpose.
POSIX.1 solved the portability mess by specifying sigaction(2), which provides explicit control of the semantics when a signal handler is invoked; use that interface instead of signal().
STANDARDS
C11, POSIX.1-2008.
HISTORY
C89, POSIX.1-2001.
In the original UNIX systems, when a handler that was established using signal() was invoked by the delivery of a signal, the disposition of the signal would be reset to SIG_DFL, and the system did not block delivery of further instances of the signal. This is equivalent to calling sigaction(2) with the following flags:
sa.sa_flags = SA_RESETHAND | SA_NODEFER;
System V also provides these semantics for signal(). This was bad because the signal might be delivered again before the handler had a chance to reestablish itself. Furthermore, rapid deliveries of the same signal could result in recursive invocations of the handler.
BSD improved on this situation, but unfortunately also changed the semantics of the existing signal() interface while doing so. On BSD, when a signal handler is invoked, the signal disposition is not reset, and further instances of the signal are blocked from being delivered while the handler is executing. Furthermore, certain blocking system calls are automatically restarted if interrupted by a signal handler (see signal(7)). The BSD semantics are equivalent to calling sigaction(2) with the following flags:
sa.sa_flags = SA_RESTART;
The situation on Linux is as follows:
The kernel’s signal() system call provides System V semantics.
By default, in glibc 2 and later, the signal() wrapper function does not invoke the kernel system call. Instead, it calls sigaction(2) using flags that supply BSD semantics. This default behavior is provided as long as a suitable feature test macro is defined: _BSD_SOURCE on glibc 2.19 and earlier or _DEFAULT_SOURCE in glibc 2.19 and later. (By default, these macros are defined; see feature_test_macros(7) for details.) If such a feature test macro is not defined, then signal() provides System V semantics.
NOTES
The effects of signal() in a multithreaded process are unspecified.
According to POSIX, the behavior of a process is undefined after it ignores a SIGFPE, SIGILL, or SIGSEGV signal that was not generated by kill(2) or raise(3). Integer division by zero has undefined result. On some architectures it will generate a SIGFPE signal. (Also dividing the most negative integer by -1 may generate SIGFPE.) Ignoring this signal might lead to an endless loop.
See sigaction(2) for details on what happens when the disposition SIGCHLD is set to SIG_IGN.
See signal-safety(7) for a list of the async-signal-safe functions that can be safely called from inside a signal handler.
SEE ALSO
kill(1), alarm(2), kill(2), pause(2), sigaction(2), signalfd(2), sigpending(2), sigprocmask(2), sigsuspend(2), bsd_signal(3), killpg(3), raise(3), siginterrupt(3), sigqueue(3), sigsetops(3), sigvec(3), sysv_signal(3), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
461 - Linux cli command rt_sigsuspend
NAME π₯οΈ rt_sigsuspend π₯οΈ
wait for a signal
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <signal.h>
int sigsuspend(const sigset_t *mask);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sigsuspend():
_POSIX_C_SOURCE
DESCRIPTION
sigsuspend() temporarily replaces the signal mask of the calling thread with the mask given by mask and then suspends the thread until delivery of a signal whose action is to invoke a signal handler or to terminate a process.
If the signal terminates the process, then sigsuspend() does not return. If the signal is caught, then sigsuspend() returns after the signal handler returns, and the signal mask is restored to the state before the call to sigsuspend().
It is not possible to block SIGKILL or SIGSTOP; specifying these signals in mask, has no effect on the thread’s signal mask.
RETURN VALUE
sigsuspend() always returns -1, with errno set to indicate the error (normally, EINTR).
ERRORS
EFAULT
mask points to memory which is not a valid part of the process address space.
EINTR
The call was interrupted by a signal; signal(7).
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
C library/kernel differences
The original Linux system call was named sigsuspend(). However, with the addition of real-time signals in Linux 2.2, the fixed-size, 32-bit sigset_t type supported by that system call was no longer fit for purpose. Consequently, a new system call, rt_sigsuspend(), was added to support an enlarged sigset_t type. The new system call takes a second argument, size_t sigsetsize, which specifies the size in bytes of the signal set in mask. This argument is currently required to have the value sizeof(sigset_t) (or the error EINVAL results). The glibc sigsuspend() wrapper function hides these details from us, transparently calling rt_sigsuspend() when the kernel provides it.
NOTES
Normally, sigsuspend() is used in conjunction with sigprocmask(2) in order to prevent delivery of a signal during the execution of a critical code section. The caller first blocks the signals with sigprocmask(2). When the critical code has completed, the caller then waits for the signals by calling sigsuspend() with the signal mask that was returned by sigprocmask(2) (in the oldset argument).
See sigsetops(3) for details on manipulating signal sets.
SEE ALSO
kill(2), pause(2), sigaction(2), signal(2), sigprocmask(2), sigwaitinfo(2), sigsetops(3), sigwait(3), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
462 - Linux cli command s390_runtime_instr
NAME π₯οΈ s390_runtime_instr π₯οΈ
enable/disable s390 CPU run-time instrumentation
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <asm/runtime_instr.h> /* Definition of S390_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_s390_runtime_instr, int command, int signum);
Note: glibc provides no wrapper for s390_runtime_instr(), necessitating the use of syscall(2).
DESCRIPTION
The s390_runtime_instr() system call starts or stops CPU run-time instrumentation for the calling thread.
The command argument controls whether run-time instrumentation is started (S390_RUNTIME_INSTR_START, 1) or stopped (S390_RUNTIME_INSTR_STOP, 2) for the calling thread.
The signum argument specifies the number of a real-time signal. This argument was used to specify a signal number that should be delivered to the thread if the run-time instrumentation buffer was full or if the run-time-instrumentation-halted interrupt had occurred. This feature was never used, and in Linux 4.4 support for this feature was removed; thus, in current kernels, this argument is ignored.
RETURN VALUE
On success, s390_runtime_instr() returns 0 and enables the thread for run-time instrumentation by assigning the thread a default run-time instrumentation control block. The caller can then read and modify the control block and start the run-time instrumentation. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EINVAL
The value specified in command is not a valid command.
EINVAL
The value specified in signum is not a real-time signal number. From Linux 4.4 onwards, the signum argument has no effect, so that an invalid signal number will not result in an error.
ENOMEM
Allocating memory for the run-time instrumentation control block failed.
EOPNOTSUPP
The run-time instrumentation facility is not available.
STANDARDS
Linux on s390.
HISTORY
Linux 3.7. System z EC12.
NOTES
The asm/runtime_instr.h header file is available since Linux 4.16.
Starting with Linux 4.4, support for signalling was removed, as was the check whether signum is a valid real-time signal. For backwards compatibility with older kernels, it is recommended to pass a valid real-time signal number in signum and install a handler for that signal.
SEE ALSO
syscall(2), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
463 - Linux cli command setresgid
NAME π₯οΈ setresgid π₯οΈ
set real, effective, and saved user or group ID
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <unistd.h>
int setresuid(uid_t ruid, uid_t euid, uid_t suid);
int setresgid(gid_t rgid, gid_t egid, gid_t sgid);
DESCRIPTION
setresuid() sets the real user ID, the effective user ID, and the saved set-user-ID of the calling process.
An unprivileged process may change its real UID, effective UID, and saved set-user-ID, each to one of: the current real UID, the current effective UID, or the current saved set-user-ID.
A privileged process (on Linux, one having the CAP_SETUID capability) may set its real UID, effective UID, and saved set-user-ID to arbitrary values.
If one of the arguments equals -1, the corresponding value is not changed.
Regardless of what changes are made to the real UID, effective UID, and saved set-user-ID, the filesystem UID is always set to the same value as the (possibly new) effective UID.
Completely analogously, setresgid() sets the real GID, effective GID, and saved set-group-ID of the calling process (and always modifies the filesystem GID to be the same as the effective GID), with the same restrictions for unprivileged processes.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
Note: there are cases where setresuid() can fail even when the caller is UID 0; it is a grave security error to omit checking for a failure return from setresuid().
ERRORS
EAGAIN
The call would change the caller’s real UID (i.e., ruid does not match the caller’s real UID), but there was a temporary failure allocating the necessary kernel data structures.
EAGAIN
ruid does not match the caller’s real UID and this call would bring the number of processes belonging to the real user ID ruid over the caller’s RLIMIT_NPROC resource limit. Since Linux 3.1, this error case no longer occurs (but robust applications should check for this error); see the description of EAGAIN in execve(2).
EINVAL
One or more of the target user or group IDs is not valid in this user namespace.
EPERM
The calling process is not privileged (did not have the necessary capability in its user namespace) and tried to change the IDs to values that are not permitted. For setresuid(), the necessary capability is CAP_SETUID; for setresgid(), it is CAP_SETGID.
VERSIONS
C library/kernel differences
At the kernel level, user IDs and group IDs are a per-thread attribute. However, POSIX requires that all threads in a process share the same credentials. The NPTL threading implementation handles the POSIX requirements by providing wrapper functions for the various system calls that change process UIDs and GIDs. These wrapper functions (including those for setresuid() and setresgid()) employ a signal-based technique to ensure that when one thread changes credentials, all of the other threads in the process also change their credentials. For details, see nptl(7).
STANDARDS
None.
HISTORY
Linux 2.1.44, glibc 2.3.2. HP-UX, FreeBSD.
The original Linux setresuid() and setresgid() system calls supported only 16-bit user and group IDs. Subsequently, Linux 2.4 added setresuid32() and setresgid32(), supporting 32-bit IDs. The glibc setresuid() and setresgid() wrapper functions transparently deal with the variations across kernel versions.
SEE ALSO
getresuid(2), getuid(2), setfsgid(2), setfsuid(2), setreuid(2), setuid(2), capabilities(7), credentials(7), user_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
464 - Linux cli command getitimer
NAME π₯οΈ getitimer π₯οΈ
get or set value of an interval timer
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/time.h>
int getitimer(int which, struct itimerval *curr_value);
int setitimer(int which, const struct itimerval *restrict new_value,
struct itimerval *_Nullable restrict old_value);
DESCRIPTION
These system calls provide access to interval timers, that is, timers that initially expire at some point in the future, and (optionally) at regular intervals after that. When a timer expires, a signal is generated for the calling process, and the timer is reset to the specified interval (if the interval is nonzero).
Three types of timersβspecified via the which argumentβare provided, each of which counts against a different clock and generates a different signal on timer expiration:
ITIMER_REAL
This timer counts down in real (i.e., wall clock) time. At each expiration, a SIGALRM signal is generated.
ITIMER_VIRTUAL
This timer counts down against the user-mode CPU time consumed by the process. (The measurement includes CPU time consumed by all threads in the process.) At each expiration, a SIGVTALRM signal is generated.
ITIMER_PROF
This timer counts down against the total (i.e., both user and system) CPU time consumed by the process. (The measurement includes CPU time consumed by all threads in the process.) At each expiration, a SIGPROF signal is generated.
In conjunction with ITIMER_VIRTUAL, this timer can be used to profile user and system CPU time consumed by the process.
A process has only one of each of the three types of timers.
Timer values are defined by the following structures:
struct itimerval {
struct timeval it_interval; /* Interval for periodic timer */
struct timeval it_value; /* Time until next expiration */
};
struct timeval {
time_t tv_sec; /* seconds */
suseconds_t tv_usec; /* microseconds */
};
getitimer()
The function getitimer() places the current value of the timer specified by which in the buffer pointed to by curr_value.
The it_value substructure is populated with the amount of time remaining until the next expiration of the specified timer. This value changes as the timer counts down, and will be reset to it_interval when the timer expires. If both fields of it_value are zero, then this timer is currently disarmed (inactive).
The it_interval substructure is populated with the timer interval. If both fields of it_interval are zero, then this is a single-shot timer (i.e., it expires just once).
setitimer()
The function setitimer() arms or disarms the timer specified by which, by setting the timer to the value specified by new_value. If old_value is non-NULL, the buffer it points to is used to return the previous value of the timer (i.e., the same information that is returned by getitimer()).
If either field in new_value.it_value is nonzero, then the timer is armed to initially expire at the specified time. If both fields in new_value.it_value are zero, then the timer is disarmed.
The new_value.it_interval field specifies the new interval for the timer; if both of its subfields are zero, the timer is single-shot.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
new_value, old_value, or curr_value is not valid a pointer.
EINVAL
which is not one of ITIMER_REAL, ITIMER_VIRTUAL, or ITIMER_PROF; or (since Linux 2.6.22) one of the tv_usec fields in the structure pointed to by new_value contains a value outside the range [0, 999999].
VERSIONS
The standards are silent on the meaning of the call:
setitimer(which, NULL, &old_value);
Many systems (Solaris, the BSDs, and perhaps others) treat this as equivalent to:
getitimer(which, &old_value);
In Linux, this is treated as being equivalent to a call in which the new_value fields are zero; that is, the timer is disabled. Don’t use this Linux misfeature: it is nonportable and unnecessary.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.4BSD (this call first appeared in 4.2BSD). POSIX.1-2008 marks getitimer() and setitimer() obsolete, recommending the use of the POSIX timers API (timer_gettime(2), timer_settime(2), etc.) instead.
NOTES
Timers will never expire before the requested time, but may expire some (short) time afterward, which depends on the system timer resolution and on the system load; see time(7). (But see BUGS below.) If the timer expires while the process is active (always true for ITIMER_VIRTUAL), the signal will be delivered immediately when generated.
A child created via fork(2) does not inherit its parent’s interval timers. Interval timers are preserved across an execve(2).
POSIX.1 leaves the interaction between setitimer() and the three interfaces alarm(2), sleep(3), and usleep(3) unspecified.
BUGS
The generation and delivery of a signal are distinct, and only one instance of each of the signals listed above may be pending for a process. Under very heavy loading, an ITIMER_REAL timer may expire before the signal from a previous expiration has been delivered. The second signal in such an event will be lost.
Before Linux 2.6.16, timer values are represented in jiffies. If a request is made set a timer with a value whose jiffies representation exceeds MAX_SEC_IN_JIFFIES (defined in include/linux/jiffies.h), then the timer is silently truncated to this ceiling value. On Linux/i386 (where, since Linux 2.6.13, the default jiffy is 0.004 seconds), this means that the ceiling value for a timer is approximately 99.42 days. Since Linux 2.6.16, the kernel uses a different internal representation for times, and this ceiling is removed.
On certain systems (including i386), Linux kernels before Linux 2.6.12 have a bug which will produce premature timer expirations of up to one jiffy under some circumstances. This bug is fixed in Linux 2.6.12.
POSIX.1-2001 says that setitimer() should fail if a tv_usec value is specified that is outside of the range [0, 999999]. However, up to and including Linux 2.6.21, Linux does not give an error, but instead silently adjusts the corresponding seconds value for the timer. From Linux 2.6.22 onward, this nonconformance has been repaired: an improper tv_usec value results in an EINVAL error.
SEE ALSO
gettimeofday(2), sigaction(2), signal(2), timer_create(2), timerfd_create(2), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
465 - Linux cli command alarm
NAME π₯οΈ alarm π₯οΈ
set an alarm clock for delivery of a signal
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
unsigned int alarm(unsigned int seconds);
DESCRIPTION
alarm() arranges for a SIGALRM signal to be delivered to the calling process in seconds seconds.
If seconds is zero, any pending alarm is canceled.
In any event any previously set alarm() is canceled.
RETURN VALUE
alarm() returns the number of seconds remaining until any previously scheduled alarm was due to be delivered, or zero if there was no previously scheduled alarm.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4, 4.3BSD.
NOTES
alarm() and setitimer(2) share the same timer; calls to one will interfere with use of the other.
Alarms created by alarm() are preserved across execve(2) and are not inherited by children created via fork(2).
sleep(3) may be implemented using SIGALRM; mixing calls to alarm() and sleep(3) is a bad idea.
Scheduling delays can, as ever, cause the execution of the process to be delayed by an arbitrary amount of time.
SEE ALSO
gettimeofday(2), pause(2), select(2), setitimer(2), sigaction(2), signal(2), timer_create(2), timerfd_create(2), sleep(3), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
466 - Linux cli command perfmonctl
NAME π₯οΈ perfmonctl π₯οΈ
interface to IA-64 performance monitoring unit
SYNOPSIS
#include <syscall.h>
#include <perfmon.h>
long perfmonctl(int fd, int cmd, void arg[.narg], int narg);
Note: There is no glibc wrapper for this system call; see HISTORY.
DESCRIPTION
The IA-64-specific perfmonctl() system call provides an interface to the PMU (performance monitoring unit). The PMU consists of PMD (performance monitoring data) registers and PMC (performance monitoring control) registers, which gather hardware statistics.
perfmonctl() applies the operation cmd to the input arguments specified by arg. The number of arguments is defined by narg. The fd argument specifies the perfmon context to operate on.
Supported values for cmd are:
PFM_CREATE_CONTEXT
perfmonctl(int fd, PFM_CREATE_CONTEXT, pfarg_context_t *ctxt, 1);
Set up a context.
The fd parameter is ignored. A new perfmon context is created as specified in ctxt and its file descriptor is returned in ctxt->ctx_fd.
The file descriptor can be used in subsequent calls to perfmonctl() and can be used to read event notifications (type pfm_msg_t) using read(2). The file descriptor is pollable using select(2), poll(2), and epoll(7).
The context can be destroyed by calling close(2) on the file descriptor.
PFM_WRITE_PMCS
perfmonctl(int fd, PFM_WRITE_PMCS, pfarg_reg_t *pmcs, n);
Set PMC registers.
PFM_WRITE_PMDS
perfmonctl(int fd, PFM_WRITE_PMDS, pfarg_reg_t *pmds, n);
Set PMD registers.
PFM_READ_PMDS
perfmonctl(int fd, PFM_READ_PMDS, pfarg_reg_t *pmds, n);
Read PMD registers.
PFM_START
perfmonctl(int fd, PFM_START, NULL, 0);
Start monitoring.
PFM_STOP
perfmonctl(int fd, PFM_STOP, NULL, 0);
Stop monitoring.
PFM_LOAD_CONTEXT
perfmonctl(int fd, PFM_LOAD_CONTEXT, pfarg_load_t *largs, 1);
Attach the context to a thread.
PFM_UNLOAD_CONTEXT
perfmonctl(int fd, PFM_UNLOAD_CONTEXT, NULL, 0);
Detach the context from a thread.
PFM_RESTART
perfmonctl(int fd, PFM_RESTART, NULL, 0);
Restart monitoring after receiving an overflow notification.
PFM_GET_FEATURES
perfmonctl(int fd, PFM_GET_FEATURES, pfarg_features_t *arg, 1);
PFM_DEBUG
perfmonctl(int fd, PFM_DEBUG, val, 0);
If val is nonzero, enable debugging mode, otherwise disable.
PFM_GET_PMC_RESET_VAL
perfmonctl(int fd, PFM_GET_PMC_RESET_VAL, pfarg_reg_t *req, n);
Reset PMC registers to default values.
RETURN VALUE
perfmonctl() returns zero when the operation is successful. On error, -1 is returned and errno is set to indicate the error.
STANDARDS
Linux on IA-64.
HISTORY
Added in Linux 2.4; removed in Linux 5.10.
This system call was broken for many years, and ultimately removed in Linux 5.10.
glibc does not provide a wrapper for this system call; on kernels where it exists, call it using syscall(2).
SEE ALSO
gprof(1)
The perfmon2 interface specification
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
467 - Linux cli command rt_sigpending
NAME π₯οΈ rt_sigpending π₯οΈ
examine pending signals
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <signal.h>
int sigpending(sigset_t *set);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sigpending():
_POSIX_C_SOURCE
DESCRIPTION
sigpending() returns the set of signals that are pending for delivery to the calling thread (i.e., the signals which have been raised while blocked). The mask of pending signals is returned in set.
RETURN VALUE
sigpending() returns 0 on success. On failure, -1 is returned and errno is set to indicate the error.
ERRORS
EFAULT
set points to memory which is not a valid part of the process address space.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
C library/kernel differences
The original Linux system call was named sigpending(). However, with the addition of real-time signals in Linux 2.2, the fixed-size, 32-bit sigset_t argument supported by that system call was no longer fit for purpose. Consequently, a new system call, rt_sigpending(), was added to support an enlarged sigset_t type. The new system call takes a second argument, size_t sigsetsize, which specifies the size in bytes of the signal set in set. The glibc sigpending() wrapper function hides these details from us, transparently calling rt_sigpending() when the kernel provides it.
NOTES
See sigsetops(3) for details on manipulating signal sets.
If a signal is both blocked and has a disposition of “ignored”, it is not added to the mask of pending signals when generated.
The set of signals that is pending for a thread is the union of the set of signals that is pending for that thread and the set of signals that is pending for the process as a whole; see signal(7).
A child created via fork(2) initially has an empty pending signal set; the pending signal set is preserved across an execve(2).
BUGS
Up to and including glibc 2.2.1, there is a bug in the wrapper function for sigpending() which means that information about pending real-time signals is not correctly returned.
SEE ALSO
kill(2), sigaction(2), signal(2), sigprocmask(2), sigsuspend(2), sigsetops(3), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
468 - Linux cli command wait4
NAME π₯οΈ wait4 π₯οΈ
wait for process to change state, BSD style
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/wait.h>
pid_t wait3(int *_Nullable wstatus, int options,
struct rusage *_Nullable rusage);
pid_t wait4(pid_t pid, int *_Nullable wstatus, int options,
struct rusage *_Nullable rusage);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
wait3():
Since glibc 2.26:
_DEFAULT_SOURCE
|| (_XOPEN_SOURCE >= 500 &&
! (_POSIX_C_SOURCE >= 200112L
|| _XOPEN_SOURCE >= 600))
From glibc 2.19 to glibc 2.25:
_DEFAULT_SOURCE || _XOPEN_SOURCE >= 500
glibc 2.19 and earlier:
_BSD_SOURCE || _XOPEN_SOURCE >= 500
wait4():
Since glibc 2.19:
_DEFAULT_SOURCE
glibc 2.19 and earlier:
_BSD_SOURCE
DESCRIPTION
These functions are nonstandard; in new programs, the use of waitpid(2) or waitid(2) is preferable.
The wait3() and wait4() system calls are similar to waitpid(2), but additionally return resource usage information about the child in the structure pointed to by rusage.
Other than the use of the rusage argument, the following wait3() call:
wait3(wstatus, options, rusage);
is equivalent to:
waitpid(-1, wstatus, options);
Similarly, the following wait4() call:
wait4(pid, wstatus, options, rusage);
is equivalent to:
waitpid(pid, wstatus, options);
In other words, wait3() waits of any child, while wait4() can be used to select a specific child, or children, on which to wait. See wait(2) for further details.
If rusage is not NULL, the struct rusage to which it points will be filled with accounting information about the child. See getrusage(2) for details.
RETURN VALUE
As for waitpid(2).
ERRORS
As for waitpid(2).
STANDARDS
None.
HISTORY
4.3BSD.
SUSv1 included a specification of wait3(); SUSv2 included wait3(), but marked it LEGACY; SUSv3 removed it.
Including <sys/time.h> is not required these days, but increases portability. (Indeed, <sys/resource.h> defines the rusage structure with fields of type struct timeval defined in <sys/time.h>.)
C library/kernel differences
On Linux, wait3() is a library function implemented on top of the wait4() system call.
SEE ALSO
fork(2), getrusage(2), sigaction(2), signal(2), wait(2), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
469 - Linux cli command sigtimedwait
NAME π₯οΈ sigtimedwait π₯οΈ
synchronously wait for queued signals
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <signal.h>
int sigwaitinfo(const sigset_t *restrict set,
siginfo_t *_Nullable restrict info);
int sigtimedwait(const sigset_t *restrict set,
siginfo_t *_Nullable restrict info,
const struct timespec *restrict timeout);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
sigwaitinfo(), sigtimedwait():
_POSIX_C_SOURCE >= 199309L
DESCRIPTION
sigwaitinfo() suspends execution of the calling thread until one of the signals in set is pending (If one of the signals in set is already pending for the calling thread, sigwaitinfo() will return immediately.)
sigwaitinfo() removes the signal from the set of pending signals and returns the signal number as its function result. If the info argument is not NULL, then the buffer that it points to is used to return a structure of type siginfo_t (see sigaction(2)) containing information about the signal.
If multiple signals in set are pending for the caller, the signal that is retrieved by sigwaitinfo() is determined according to the usual ordering rules; see signal(7) for further details.
sigtimedwait() operates in exactly the same way as sigwaitinfo() except that it has an additional argument, timeout, which specifies the interval for which the thread is suspended waiting for a signal. (This interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the interval may overrun by a small amount.) This argument is a timespec(3) structure.
If both fields of this structure are specified as 0, a poll is performed: sigtimedwait() returns immediately, either with information about a signal that was pending for the caller, or with an error if none of the signals in set was pending.
RETURN VALUE
On success, both sigwaitinfo() and sigtimedwait() return a signal number (i.e., a value greater than zero). On failure both calls return -1, with errno set to indicate the error.
ERRORS
EAGAIN
No signal in set became pending within the timeout period specified to sigtimedwait().
EINTR
The wait was interrupted by a signal handler; see signal(7). (This handler was for a signal other than one of those in set.)
EINVAL
timeout was invalid.
VERSIONS
C library/kernel differences
On Linux, sigwaitinfo() is a library function implemented on top of sigtimedwait().
The glibc wrapper functions for sigwaitinfo() and sigtimedwait() silently ignore attempts to wait for the two real-time signals that are used internally by the NPTL threading implementation. See nptl(7) for details.
The original Linux system call was named sigtimedwait(). However, with the addition of real-time signals in Linux 2.2, the fixed-size, 32-bit sigset_t type supported by that system call was no longer fit for purpose. Consequently, a new system call, rt_sigtimedwait(), was added to support an enlarged sigset_t type. The new system call takes a fourth argument, size_t sigsetsize, which specifies the size in bytes of the signal set in set. This argument is currently required to have the value sizeof(sigset_t) (or the error EINVAL results). The glibc sigtimedwait() wrapper function hides these details from us, transparently calling rt_sigtimedwait() when the kernel provides it.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001.
NOTES
In normal usage, the calling program blocks the signals in set via a prior call to sigprocmask(2) (so that the default disposition for these signals does not occur if they become pending between successive calls to sigwaitinfo() or sigtimedwait()) and does not establish handlers for these signals. In a multithreaded program, the signal should be blocked in all threads, in order to prevent the signal being treated according to its default disposition in a thread other than the one calling sigwaitinfo() or sigtimedwait()).
The set of signals that is pending for a given thread is the union of the set of signals that is pending specifically for that thread and the set of signals that is pending for the process as a whole (see signal(7)).
Attempts to wait for SIGKILL and SIGSTOP are silently ignored.
If multiple threads of a process are blocked waiting for the same signal(s) in sigwaitinfo() or sigtimedwait(), then exactly one of the threads will actually receive the signal if it becomes pending for the process as a whole; which of the threads receives the signal is indeterminate.
sigwaitinfo() or sigtimedwait(), can’t be used to receive signals that are synchronously generated, such as the SIGSEGV signal that results from accessing an invalid memory address or the SIGFPE signal that results from an arithmetic error. Such signals can be caught only via signal handler.
POSIX leaves the meaning of a NULL value for the timeout argument of sigtimedwait() unspecified, permitting the possibility that this has the same meaning as a call to sigwaitinfo(), and indeed this is what is done on Linux.
SEE ALSO
kill(2), sigaction(2), signal(2), signalfd(2), sigpending(2), sigprocmask(2), sigqueue(3), sigsetops(3), sigwait(3), timespec(3), signal(7), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
470 - Linux cli command madvise
NAME π₯οΈ madvise π₯οΈ
give advice about use of memory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/mman.h>
int madvise(void addr[.length], size_t length, int advice);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
madvise():
Since glibc 2.19:
_DEFAULT_SOURCE
Up to and including glibc 2.19:
_BSD_SOURCE
DESCRIPTION
The madvise() system call is used to give advice or directions to the kernel about the address range beginning at address addr and with size length. madvise() only operates on whole pages, therefore addr must be page-aligned. The value of length is rounded up to a multiple of page size. In most cases, the goal of such advice is to improve system or application performance.
Initially, the system call supported a set of “conventional” advice values, which are also available on several other implementations. (Note, though, that madvise() is not specified in POSIX.) Subsequently, a number of Linux-specific advice values have been added.
Conventional advice values
The advice values listed below allow an application to tell the kernel how it expects to use some mapped or shared memory areas, so that the kernel can choose appropriate read-ahead and caching techniques. These advice values do not influence the semantics of the application (except in the case of MADV_DONTNEED), but may influence its performance. All of the advice values listed here have analogs in the POSIX-specified posix_madvise(3) function, and the values have the same meanings, with the exception of MADV_DONTNEED.
The advice is indicated in the advice argument, which is one of the following:
MADV_NORMAL
No special treatment. This is the default.
MADV_RANDOM
Expect page references in random order. (Hence, read ahead may be less useful than normally.)
MADV_SEQUENTIAL
Expect page references in sequential order. (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.)
MADV_WILLNEED
Expect access in the near future. (Hence, it might be a good idea to read some pages ahead.)
MADV_DONTNEED
Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.)
After a successful MADV_DONTNEED operation, the semantics of memory access in the specified region are changed: subsequent accesses of pages in the range will succeed, but will result in either repopulating the memory contents from the up-to-date contents of the underlying mapped file (for shared file mappings, shared anonymous mappings, and shmem-based techniques such as System V shared memory segments) or zero-fill-on-demand pages for anonymous private mappings.
Note that, when applied to shared mappings, MADV_DONTNEED might not lead to immediate freeing of the pages in the range. The kernel is free to delay freeing the pages until an appropriate moment. The resident set size (RSS) of the calling process will be immediately reduced however.
MADV_DONTNEED cannot be applied to locked pages, or VM_PFNMAP pages. (Pages marked with the kernel-internal VM_PFNMAP flag are special memory areas that are not managed by the virtual memory subsystem. Such pages are typically created by device drivers that map the pages into user space.)
Support for Huge TLB pages was added in Linux v5.18. Addresses within a mapping backed by Huge TLB pages must be aligned to the underlying Huge TLB page size, and the range length is rounded up to a multiple of the underlying Huge TLB page size.
Linux-specific advice values
The following Linux-specific advice values have no counterparts in the POSIX-specified posix_madvise(3), and may or may not have counterparts in the madvise() interface available on other implementations. Note that some of these operations change the semantics of memory accesses.
MADV_REMOVE (since Linux 2.6.16)
Free up a given range of pages and its associated backing store. This is equivalent to punching a hole in the corresponding range of the backing store (see fallocate(2)). Subsequent accesses in the specified address range will see data with a value of zero.
The specified address range must be mapped shared and writable. This flag cannot be applied to locked pages, or VM_PFNMAP pages.
In the initial implementation, only tmpfs(5) supported MADV_REMOVE; but since Linux 3.5, any filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE. Filesystems which do not support MADV_REMOVE fail with the error EOPNOTSUPP.
Support for the Huge TLB filesystem was added in Linux v4.3.
MADV_DONTFORK (since Linux 2.6.16)
Do not make the pages in this range available to the child after a fork(2). This is useful to prevent copy-on-write semantics from changing the physical location of a page if the parent writes to it after a fork(2). (Such page relocations cause problems for hardware that DMAs into the page.)
MADV_DOFORK (since Linux 2.6.16)
Undo the effect of MADV_DONTFORK, restoring the default behavior, whereby a mapping is inherited across fork(2).
MADV_HWPOISON (since Linux 2.6.32)
Poison the pages in the range specified by addr and length and handle subsequent references to those pages like a hardware memory corruption. This operation is available only for privileged (CAP_SYS_ADMIN) processes. This operation may result in the calling process receiving a SIGBUS and the page being unmapped.
This feature is intended for testing of memory error-handling code; it is available only if the kernel was configured with CONFIG_MEMORY_FAILURE.
MADV_MERGEABLE (since Linux 2.6.32)
Enable Kernel Samepage Merging (KSM) for the pages in the range specified by addr and length. The kernel regularly scans those areas of user memory that have been marked as mergeable, looking for pages with identical content. These are replaced by a single write-protected page (which is automatically copied if a process later wants to update the content of the page). KSM merges only private anonymous pages (see mmap(2)).
The KSM feature is intended for applications that generate many instances of the same data (e.g., virtualization systems such as KVM). It can consume a lot of processing power; use with care. See the Linux kernel source file Documentation/admin-guide/mm/ksm.rst for more details.
The MADV_MERGEABLE and MADV_UNMERGEABLE operations are available only if the kernel was configured with CONFIG_KSM.
MADV_UNMERGEABLE (since Linux 2.6.32)
Undo the effect of an earlier MADV_MERGEABLE operation on the specified address range; KSM unmerges whatever pages it had merged in the address range specified by addr and length.
MADV_SOFT_OFFLINE (since Linux 2.6.33)
Soft offline the pages in the range specified by addr and length. The memory of each page in the specified range is preserved (i.e., when next accessed, the same content will be visible, but in a new physical page frame), and the original page is offlined (i.e., no longer used, and taken out of normal memory management). The effect of the MADV_SOFT_OFFLINE operation is invisible to (i.e., does not change the semantics of) the calling process.
This feature is intended for testing of memory error-handling code; it is available only if the kernel was configured with CONFIG_MEMORY_FAILURE.
MADV_HUGEPAGE (since Linux 2.6.38)
Enable Transparent Huge Pages (THP) for pages in the range specified by addr and length. The kernel will regularly scan the areas marked as huge page candidates to replace them with huge pages. The kernel will also allocate huge pages directly when the region is naturally aligned to the huge page size (see posix_memalign(2)).
This feature is primarily aimed at applications that use large mappings of data and access large regions of that memory at a time (e.g., virtualization systems such as QEMU). It can very easily waste memory (e.g., a 2 MB mapping that only ever accesses 1 byte will result in 2 MB of wired memory instead of one 4 KB page). See the Linux kernel source file Documentation/admin-guide/mm/transhuge.rst for more details.
Most common kernels configurations provide MADV_HUGEPAGE-style behavior by default, and thus MADV_HUGEPAGE is normally not necessary. It is mostly intended for embedded systems, where MADV_HUGEPAGE-style behavior may not be enabled by default in the kernel. On such systems, this flag can be used in order to selectively enable THP. Whenever MADV_HUGEPAGE is used, it should always be in regions of memory with an access pattern that the developer knows in advance won’t risk to increase the memory footprint of the application when transparent hugepages are enabled.
Since Linux 5.4, automatic scan of eligible areas and replacement by huge pages works with private anonymous pages (see mmap(2)), shmem pages, and file-backed pages. For all memory types, memory may only be replaced by huge pages on hugepage-aligned boundaries. For file-mapped memory βincluding tmpfs (see tmpfs(2))β the mapping must also be naturally hugepage-aligned within the file. Additionally, for file-backed, non-tmpfs memory, the file must not be open for write and the mapping must be executable.
The VMA must not be marked VM_NOHUGEPAGE, VM_HUGETLB, VM_IO, VM_DONTEXPAND, VM_MIXEDMAP, or VM_PFNMAP, nor can it be stack memory or backed by a DAX-enabled device (unless the DAX device is hot-plugged as System RAM). The process must also not have PR_SET_THP_DISABLE set (see prctl(2)).
The MADV_HUGEPAGE, MADV_NOHUGEPAGE, and MADV_COLLAPSE operations are available only if the kernel was configured with CONFIG_TRANSPARENT_HUGEPAGE and file/shmem memory is only supported if the kernel was configured with CONFIG_READ_ONLY_THP_FOR_FS.
MADV_NOHUGEPAGE (since Linux 2.6.38)
Ensures that memory in the address range specified by addr and length will not be backed by transparent hugepages.
MADV_COLLAPSE (since Linux 6.1)
Perform a best-effort synchronous collapse of the native pages mapped by the memory range into Transparent Huge Pages (THPs). MADV_COLLAPSE operates on the current state of memory of the calling process and makes no persistent changes or guarantees on how pages will be mapped, constructed, or faulted in the future.
MADV_COLLAPSE supports private anonymous pages (see mmap(2)), shmem pages, and file-backed pages. See MADV_HUGEPAGE for general information on memory requirements for THP. If the range provided spans multiple VMAs, the semantics of the collapse over each VMA is independent from the others. If collapse of a given huge page-aligned/sized region fails, the operation may continue to attempt collapsing the remainder of the specified memory. MADV_COLLAPSE will automatically clamp the provided range to be hugepage-aligned.
All non-resident pages covered by the range will first be swapped/faulted-in, before being copied onto a freshly allocated hugepage. If the native pages compose the same PTE-mapped hugepage, and are suitably aligned, allocation of a new hugepage may be elided and collapse may happen in-place. Unmapped pages will have their data directly initialized to 0 in the new hugepage. However, for every eligible hugepage-aligned/sized region to be collapsed, at least one page must currently be backed by physical memory.
MADV_COLLAPSE is independent of any sysfs (see sysfs(5)) setting under /sys/kernel/mm/transparent_hugepage, both in terms of determining THP eligibility, and allocation semantics. See Linux kernel source file Documentation/admin-guide/mm/transhuge.rst for more information. MADV_COLLAPSE also ignores huge= tmpfs mount when operating on tmpfs files. Allocation for the new hugepage may enter direct reclaim and/or compaction, regardless of VMA flags (though VM_NOHUGEPAGE is still respected).
When the system has multiple NUMA nodes, the hugepage will be allocated from the node providing the most native pages.
If all hugepage-sized/aligned regions covered by the provided range were either successfully collapsed, or were already PMD-mapped THPs, this operation will be deemed successful. Note that this doesn’t guarantee anything about other possible mappings of the memory. In the event multiple hugepage-aligned/sized areas fail to collapse, only the most-recentlyβfailed code will be set in errno.
MADV_DONTDUMP (since Linux 3.4)
Exclude from a core dump those pages in the range specified by addr and length. This is useful in applications that have large areas of memory that are known not to be useful in a core dump. The effect of MADV_DONTDUMP takes precedence over the bit mask that is set via the /proc/pid/coredump_filter file (see core(5)).
MADV_DODUMP (since Linux 3.4)
Undo the effect of an earlier MADV_DONTDUMP.
MADV_FREE (since Linux 4.5)
The application no longer requires the pages in the range specified by addr and len. The kernel can thus free these pages, but the freeing could be delayed until memory pressure occurs. For each of the pages that has been marked to be freed but has not yet been freed, the free operation will be canceled if the caller writes into the page. After a successful MADV_FREE operation, any stale data (i.e., dirty, unwritten pages) will be lost when the kernel frees the pages. However, subsequent writes to pages in the range will succeed and then kernel cannot free those dirtied pages, so that the caller can always see just written data. If there is no subsequent write, the kernel can free the pages at any time. Once pages in the range have been freed, the caller will see zero-fill-on-demand pages upon subsequent page references.
The MADV_FREE operation can be applied only to private anonymous pages (see mmap(2)). Before Linux 4.12, when freeing pages on a swapless system, the pages in the given range are freed instantly, regardless of memory pressure.
MADV_WIPEONFORK (since Linux 4.14)
Present the child process with zero-filled memory in this range after a fork(2). This is useful in forking servers in order to ensure that sensitive per-process data (for example, PRNG seeds, cryptographic secrets, and so on) is not handed to child processes.
The MADV_WIPEONFORK operation can be applied only to private anonymous pages (see mmap(2)).
Within the child created by fork(2), the MADV_WIPEONFORK setting remains in place on the specified address range. This setting is cleared during execve(2).
MADV_KEEPONFORK (since Linux 4.14)
Undo the effect of an earlier MADV_WIPEONFORK.
MADV_COLD (since Linux 5.4)
Deactivate a given range of pages. This will make the pages a more probable reclaim target should there be a memory pressure. This is a nondestructive operation. The advice might be ignored for some pages in the range when it is not applicable.
MADV_PAGEOUT (since Linux 5.4)
Reclaim a given range of pages. This is done to free up memory occupied by these pages. If a page is anonymous, it will be swapped out. If a page is file-backed and dirty, it will be written back to the backing storage. The advice might be ignored for some pages in the range when it is not applicable.
MADV_POPULATE_READ (since Linux 5.14)
“Populate (prefault) page tables readable, faulting in all pages in the range just as if manually reading from each page; however, avoid the actual memory access that would have been performed after handling the fault.
In contrast to MAP_POPULATE, MADV_POPULATE_READ does not hide errors, can be applied to (parts of) existing mappings and will always populate (prefault) page tables readable. One example use case is prefaulting a file mapping, reading all file content from disk; however, pages won’t be dirtied and consequently won’t have to be written back to disk when evicting the pages from memory.
Depending on the underlying mapping, map the shared zeropage, preallocate memory or read the underlying file; files with holes might or might not preallocate blocks. If populating fails, a SIGBUS signal is not generated; instead, an error is returned.
If MADV_POPULATE_READ succeeds, all page tables have been populated (prefaulted) readable once. If MADV_POPULATE_READ fails, some page tables might have been populated.
MADV_POPULATE_READ cannot be applied to mappings without read permissions and special mappings, for example, mappings marked with kernel-internal flags such as VM_PFNMAP or VM_IO, or secret memory regions created using memfd_secret(2).
Note that with MADV_POPULATE_READ, the process can be killed at any moment when the system runs out of memory.
MADV_POPULATE_WRITE (since Linux 5.14)
Populate (prefault) page tables writable, faulting in all pages in the range just as if manually writing to each each page; however, avoid the actual memory access that would have been performed after handling the fault.
In contrast to MAP_POPULATE, MADV_POPULATE_WRITE does not hide errors, can be applied to (parts of) existing mappings and will always populate (prefault) page tables writable. One example use case is preallocating memory, breaking any CoW (Copy on Write).
Depending on the underlying mapping, preallocate memory or read the underlying file; files with holes will preallocate blocks. If populating fails, a SIGBUS signal is not generated; instead, an error is returned.
If MADV_POPULATE_WRITE succeeds, all page tables have been populated (prefaulted) writable once. If MADV_POPULATE_WRITE fails, some page tables might have been populated.
MADV_POPULATE_WRITE cannot be applied to mappings without write permissions and special mappings, for example, mappings marked with kernel-internal flags such as VM_PFNMAP or VM_IO, or secret memory regions created using memfd_secret(2).
Note that with MADV_POPULATE_WRITE, the process can be killed at any moment when the system runs out of memory.
RETURN VALUE
On success, madvise() returns zero. On error, it returns -1 and errno is set to indicate the error.
ERRORS
EACCES
advice is MADV_REMOVE, but the specified address range is not a shared writable mapping.
EAGAIN
A kernel resource was temporarily unavailable.
EBADF
The map exists, but the area maps something that isn’t a file.
EBUSY
(for MADV_COLLAPSE) Could not charge hugepage to cgroup: cgroup limit exceeded.
EFAULT
advice is MADV_POPULATE_READ or MADV_POPULATE_WRITE, and populating (prefaulting) page tables failed because a SIGBUS would have been generated on actual memory access and the reason is not a HW poisoned page (HW poisoned pages can, for example, be created using the MADV_HWPOISON flag described elsewhere in this page).
EINVAL
addr is not page-aligned or length is negative.
EINVAL
advice is not a valid.
EINVAL
advice is MADV_COLD or MADV_PAGEOUT and the specified address range includes locked, Huge TLB pages, or VM_PFNMAP pages.
EINVAL
advice is MADV_DONTNEED or MADV_REMOVE and the specified address range includes locked, Huge TLB pages, or VM_PFNMAP pages.
EINVAL
advice is MADV_MERGEABLE or MADV_UNMERGEABLE, but the kernel was not configured with CONFIG_KSM.
EINVAL
advice is MADV_FREE or MADV_WIPEONFORK but the specified address range includes file, Huge TLB, MAP_SHARED, or VM_PFNMAP ranges.
EINVAL
advice is MADV_POPULATE_READ or MADV_POPULATE_WRITE, but the specified address range includes ranges with insufficient permissions or special mappings, for example, mappings marked with kernel-internal flags such a VM_IO or VM_PFNMAP, or secret memory regions created using memfd_secret(2).
EIO
(for MADV_WILLNEED) Paging in this area would exceed the process’s maximum resident set size.
ENOMEM
(for MADV_WILLNEED) Not enough memory: paging in failed.
ENOMEM
(for MADV_COLLAPSE) Not enough memory: could not allocate hugepage.
ENOMEM
Addresses in the specified range are not currently mapped, or are outside the address space of the process.
ENOMEM
advice is MADV_POPULATE_READ or MADV_POPULATE_WRITE, and populating (prefaulting) page tables failed because there was not enough memory.
EPERM
advice is MADV_HWPOISON, but the caller does not have the CAP_SYS_ADMIN capability.
EHWPOISON
advice is MADV_POPULATE_READ or MADV_POPULATE_WRITE, and populating (prefaulting) page tables failed because a HW poisoned page (HW poisoned pages can, for example, be created using the MADV_HWPOISON flag described elsewhere in this page) was encountered.
VERSIONS
Versions of this system call, implementing a wide variety of advice values, exist on many other implementations. Other implementations typically implement at least the flags listed above under Conventional advice flags, albeit with some variation in semantics.
POSIX.1-2001 describes posix_madvise(3) with constants POSIX_MADV_NORMAL, POSIX_MADV_RANDOM, POSIX_MADV_SEQUENTIAL, POSIX_MADV_WILLNEED, and POSIX_MADV_DONTNEED, and so on, with behavior close to the similarly named flags listed above.
Linux
The Linux implementation requires that the address addr be page-aligned, and allows length to be zero. If there are some parts of the specified address range that are not mapped, the Linux version of madvise() ignores them and applies the call to the rest (but returns ENOMEM from the system call, as it should).
madvise(0, 0, advice) will return zero iff advice is supported by the kernel and can be relied on to probe for support.
STANDARDS
None.
HISTORY
First appeared in 4.4BSD.
Since Linux 3.18, support for this system call is optional, depending on the setting of the CONFIG_ADVISE_SYSCALLS configuration option.
SEE ALSO
getrlimit(2), memfd_secret(2), mincore(2), mmap(2), mprotect(2), msync(2), munmap(2), prctl(2), process_madvise(2), posix_madvise(3), core(5)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
471 - Linux cli command putmsg
NAME π₯οΈ putmsg π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
472 - Linux cli command open_howtype
NAME π₯οΈ open_howtype π₯οΈ
how to open a pathname
LIBRARY
Linux kernel headers
SYNOPSIS
#include <linux/openat2.h>
struct open_how {
u64 flags; /* O_* flags */
u64 mode; /* Mode for O_{CREAT,TMPFILE} */
u64 resolve; /* RESOLVE_* flags */
/* ... */
};
DESCRIPTION
Specifies how a pathname should be opened.
The fields are as follows:
flags
This field specifies the file creation and file status flags to use when opening the file.
mode
This field specifies the mode for the new file.
resolve
This is a bit mask of flags that modify the way in which all components of a pathname will be resolved (see path_resolution(7) for background information).
VERSIONS
Extra fields may be appended to the structure, with a zero value in a new field resulting in the kernel behaving as though that extension field was not present. Therefore, a user must zero-fill this structure on initialization.
STANDARDS
Linux.
SEE ALSO
openat2(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
473 - Linux cli command oldfstat
NAME π₯οΈ oldfstat π₯οΈ
get file status
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int stat(const char *restrict pathname,
struct stat *restrict statbuf);
int fstat(int fd, struct stat *statbuf);
int lstat(const char *restrict pathname,
struct stat *restrict statbuf);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int fstatat(int dirfd, const char *restrict pathname,
struct stat *restrict statbuf, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
lstat():
/* Since glibc 2.20 */ _DEFAULT_SOURCE
|| _XOPEN_SOURCE >= 500
|| /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L
|| /* glibc 2.19 and earlier */ _BSD_SOURCE
fstatat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These functions return information about a file, in the buffer pointed to by statbuf. No permissions are required on the file itself, butβin the case of stat(), fstatat(), and lstat()βexecute (search) permission is required on all of the directories in pathname that lead to the file.
stat() and fstatat() retrieve information about the file pointed to by pathname; the differences for fstatat() are described below.
lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information about the link itself, not the file that the link refers to.
fstat() is identical to stat(), except that the file about which information is to be retrieved is specified by the file descriptor fd.
The stat structure
All of these system calls return a stat structure (see stat(3type)).
Note: for performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call. For example, if st_mode or st_uid is changed by another process by calling chmod(2) or chown(2), stat() might return the old st_mode together with the new st_uid, or the old st_uid together with the new st_mode.
fstatat()
The fstatat() system call is a more general interface for accessing file information which can still provide exactly the behavior of each of stat(), lstat(), and fstat().
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by stat() and lstat() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like stat() and lstat()).
If pathname is absolute, then dirfd is ignored.
flags can either be 0, or include one or more of the following flags ORed:
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory, and the behavior of fstatat() is similar to that of fstat(). If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_NO_AUTOMOUNT (since Linux 2.6.38)
Don’t automount the terminal (“basename”) component of pathname. Since Linux 3.1 this flag is ignored. Since Linux 4.11 this flag is implied.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead return information about the link itself, like lstat(). (By default, fstatat() dereferences symbolic links, like stat().)
See openat(2) for an explanation of the need for fstatat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Search permission is denied for one of the directories in the path prefix of pathname. (See also path_resolution(7).)
EBADF
fd is not a valid open file descriptor.
EBADF
(fstatat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
Bad address.
EINVAL
(fstatat()) Invalid flag specified in flags.
ELOOP
Too many symbolic links encountered while traversing the path.
ENAMETOOLONG
pathname is too long.
ENOENT
A component of pathname does not exist or is a dangling symbolic link.
ENOENT
pathname is an empty string and AT_EMPTY_PATH was not specified in flags.
ENOMEM
Out of memory (i.e., kernel memory).
ENOTDIR
A component of the path prefix of pathname is not a directory.
ENOTDIR
(fstatat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EOVERFLOW
pathname or fd refers to a file whose size, inode number, or number of blocks cannot be represented in, respectively, the types off_t, ino_t, or blkcnt_t. This error can occur when, for example, an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 calls stat() on a file whose size exceeds (1<<31)-1 bytes.
STANDARDS
POSIX.1-2008.
HISTORY
stat()
fstat()
lstat()
SVr4, 4.3BSD, POSIX.1-2001.
fstatat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
According to POSIX.1-2001, lstat() on a symbolic link need return valid information only in the st_size field and the file type of the st_mode field of the stat structure. POSIX.1-2008 tightens the specification, requiring lstat() to return valid information in all fields except the mode bits in st_mode.
Use of the st_blocks and st_blksize fields may be less portable. (They were introduced in BSD. The interpretation differs between systems, and possibly on a single system when NFS mounts are involved.)
C library/kernel differences
Over time, increases in the size of the stat structure have led to three successive versions of stat(): sys_stat() (slot __NR_oldstat), sys_newstat() (slot __NR_stat), and sys_stat64() (slot __NR_stat64) on 32-bit platforms such as i386. The first two versions were already present in Linux 1.0 (albeit with different names); the last was added in Linux 2.4. Similar remarks apply for fstat() and lstat().
The kernel-internal versions of the stat structure dealt with by the different versions are, respectively:
__old_kernel_stat
The original structure, with rather narrow fields, and no padding.
stat
Larger st_ino field and padding added to various parts of the structure to allow for future expansion.
stat64
Even larger st_ino field, larger st_uid and st_gid fields to accommodate the Linux-2.4 expansion of UIDs and GIDs to 32 bits, and various other enlarged fields and further padding in the structure. (Various padding bytes were eventually consumed in Linux 2.6, with the advent of 32-bit device IDs and nanosecond components for the timestamp fields.)
The glibc stat() wrapper function hides these details from applications, invoking the most recent version of the system call provided by the kernel, and repacking the returned information if required for old binaries.
On modern 64-bit systems, life is simpler: there is a single stat() system call and the kernel deals with a stat structure that contains fields of a sufficient size.
The underlying system call employed by the glibc fstatat() wrapper function is actually called fstatat64() or, on some architectures, newfstatat().
EXAMPLES
The following program calls lstat() and displays selected fields in the returned stat structure.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
#include <time.h>
int
main(int argc, char *argv[])
{
struct stat sb;
if (argc != 2) {
fprintf(stderr, "Usage: %s <pathname>
“, argv[0]); exit(EXIT_FAILURE); } if (lstat(argv[1], &sb) == -1) { perror(“lstat”); exit(EXIT_FAILURE); } printf(“ID of containing device: [%x,%x] “, major(sb.st_dev), minor(sb.st_dev)); printf(“File type: “); switch (sb.st_mode & S_IFMT) { case S_IFBLK: printf(“block device “); break; case S_IFCHR: printf(“character device “); break; case S_IFDIR: printf(“directory “); break; case S_IFIFO: printf(“FIFO/pipe “); break; case S_IFLNK: printf(“symlink “); break; case S_IFREG: printf(“regular file “); break; case S_IFSOCK: printf(“socket “); break; default: printf(“unknown? “); break; } printf(“I-node number: %ju “, (uintmax_t) sb.st_ino); printf(“Mode: %jo (octal) “, (uintmax_t) sb.st_mode); printf(“Link count: %ju “, (uintmax_t) sb.st_nlink); printf(“Ownership: UID=%ju GID=%ju “, (uintmax_t) sb.st_uid, (uintmax_t) sb.st_gid); printf(“Preferred I/O block size: %jd bytes “, (intmax_t) sb.st_blksize); printf(“File size: %jd bytes “, (intmax_t) sb.st_size); printf(“Blocks allocated: %jd “, (intmax_t) sb.st_blocks); printf(“Last status change: %s”, ctime(&sb.st_ctime)); printf(“Last file access: %s”, ctime(&sb.st_atime)); printf(“Last file modification: %s”, ctime(&sb.st_mtime)); exit(EXIT_SUCCESS); }
SEE ALSO
ls(1), stat(1), access(2), chmod(2), chown(2), readlink(2), statx(2), utime(2), stat(3type), capabilities(7), inode(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
474 - Linux cli command setsid
NAME π₯οΈ setsid π₯οΈ
creates a session and sets the process group ID
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
pid_t setsid(void);
DESCRIPTION
setsid() creates a new session if the calling process is not a process group leader. The calling process is the leader of the new session (i.e., its session ID is made the same as its process ID). The calling process also becomes the process group leader of a new process group in the session (i.e., its process group ID is made the same as its process ID).
The calling process will be the only process in the new process group and in the new session.
Initially, the new session has no controlling terminal. For details of how a session acquires a controlling terminal, see credentials(7).
RETURN VALUE
On success, the (new) session ID of the calling process is returned. On error, (pid_t) -1 is returned, and errno is set to indicate the error.
ERRORS
EPERM
The process group ID of any process equals the PID of the calling process. Thus, in particular, setsid() fails if the calling process is already a process group leader.
STANDARDS
POSIX.1-2008.
HISTORY
POSIX.1-2001, SVr4.
NOTES
A child created via fork(2) inherits its parent’s session ID. The session ID is preserved across an execve(2).
A process group leader is a process whose process group ID equals its PID. Disallowing a process group leader from calling setsid() prevents the possibility that a process group leader places itself in a new session while other processes in the process group remain in the original session; such a scenario would break the strict two-level hierarchy of sessions and process groups. In order to be sure that setsid() will succeed, call fork(2) and have the parent _exit(2), while the child (which by definition can’t be a process group leader) calls setsid().
If a session has a controlling terminal, and the CLOCAL flag for that terminal is not set, and a terminal hangup occurs, then the session leader is sent a SIGHUP signal.
If a process that is a session leader terminates, then a SIGHUP signal is sent to each process in the foreground process group of the controlling terminal.
SEE ALSO
setsid(1), getsid(2), setpgid(2), setpgrp(2), tcgetsid(3), credentials(7), sched(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
475 - Linux cli command timerfd_create
NAME π₯οΈ timerfd_create π₯οΈ
timers that notify via file descriptors
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/timerfd.h>
int timerfd_create(int clockid, int flags);
int timerfd_settime(int fd, int flags,
const struct itimerspec *new_value,
struct itimerspec *_Nullable old_value);
int timerfd_gettime(int fd, struct itimerspec *curr_value);
DESCRIPTION
These system calls create and operate on a timer that delivers timer expiration notifications via a file descriptor. They provide an alternative to the use of setitimer(2) or timer_create(2), with the advantage that the file descriptor may be monitored by select(2), poll(2), and epoll(7).
The use of these three system calls is analogous to the use of timer_create(2), timer_settime(2), and timer_gettime(2). (There is no analog of timer_getoverrun(2), since that functionality is provided by read(2), as described below.)
timerfd_create()
timerfd_create() creates a new timer object, and returns a file descriptor that refers to that timer. The clockid argument specifies the clock that is used to mark the progress of the timer, and must be one of the following:
CLOCK_REALTIME
A settable system-wide real-time clock.
CLOCK_MONOTONIC
A nonsettable monotonically increasing clock that measures time from some unspecified point in the past that does not change after system startup.
CLOCK_BOOTTIME (Since Linux 3.15)
Like CLOCK_MONOTONIC, this is a monotonically increasing clock. However, whereas the CLOCK_MONOTONIC clock does not measure the time while a system is suspended, the CLOCK_BOOTTIME clock does include the time during which the system is suspended. This is useful for applications that need to be suspend-aware. CLOCK_REALTIME is not suitable for such applications, since that clock is affected by discontinuous changes to the system clock.
CLOCK_REALTIME_ALARM (since Linux 3.11)
This clock is like CLOCK_REALTIME, but will wake the system if it is suspended. The caller must have the CAP_WAKE_ALARM capability in order to set a timer against this clock.
CLOCK_BOOTTIME_ALARM (since Linux 3.11)
This clock is like CLOCK_BOOTTIME, but will wake the system if it is suspended. The caller must have the CAP_WAKE_ALARM capability in order to set a timer against this clock.
See clock_getres(2) for some further details on the above clocks.
The current value of each of these clocks can be retrieved using clock_gettime(2).
Starting with Linux 2.6.27, the following values may be bitwise ORed in flags to change the behavior of timerfd_create():
TFD_NONBLOCK
Set the O_NONBLOCK file status flag on the open file description (see open(2)) referred to by the new file descriptor. Using this flag saves extra calls to fcntl(2) to achieve the same result.
TFD_CLOEXEC
Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.
In Linux versions up to and including 2.6.26, flags must be specified as zero.
timerfd_settime()
timerfd_settime() arms (starts) or disarms (stops) the timer referred to by the file descriptor fd.
The new_value argument specifies the initial expiration and interval for the timer. The itimerspec structure used for this argument is described in itimerspec(3type).
new_value.it_value specifies the initial expiration of the timer, in seconds and nanoseconds. Setting either field of new_value.it_value to a nonzero value arms the timer. Setting both fields of new_value.it_value to zero disarms the timer.
Setting one or both fields of new_value.it_interval to nonzero values specifies the period, in seconds and nanoseconds, for repeated timer expirations after the initial expiration. If both fields of new_value.it_interval are zero, the timer expires just once, at the time specified by new_value.it_value.
By default, the initial expiration time specified in new_value is interpreted relative to the current time on the timer’s clock at the time of the call (i.e., new_value.it_value specifies a time relative to the current value of the clock specified by clockid). An absolute timeout can be selected via the flags argument.
The flags argument is a bit mask that can include the following values:
TFD_TIMER_ABSTIME
Interpret new_value.it_value as an absolute value on the timer’s clock. The timer will expire when the value of the timer’s clock reaches the value specified in new_value.it_value.
TFD_TIMER_CANCEL_ON_SET
If this flag is specified along with TFD_TIMER_ABSTIME and the clock for this timer is CLOCK_REALTIME or CLOCK_REALTIME_ALARM, then mark this timer as cancelable if the real-time clock undergoes a discontinuous change (settimeofday(2), clock_settime(2), or similar). When such changes occur, a current or future read(2) from the file descriptor will fail with the error ECANCELED.
If the old_value argument is not NULL, then the itimerspec structure that it points to is used to return the setting of the timer that was current at the time of the call; see the description of timerfd_gettime() following.
timerfd_gettime()
timerfd_gettime() returns, in curr_value, an itimerspec structure that contains the current setting of the timer referred to by the file descriptor fd.
The it_value field returns the amount of time until the timer will next expire. If both fields of this structure are zero, then the timer is currently disarmed. This field always contains a relative value, regardless of whether the TFD_TIMER_ABSTIME flag was specified when setting the timer.
The it_interval field returns the interval of the timer. If both fields of this structure are zero, then the timer is set to expire just once, at the time specified by curr_value.it_value.
Operating on a timer file descriptor
The file descriptor returned by timerfd_create() supports the following additional operations:
read(2)
If the timer has already expired one or more times since its settings were last modified using timerfd_settime(), or since the last successful read(2), then the buffer given to read(2) returns an unsigned 8-byte integer (uint64_t) containing the number of expirations that have occurred. (The returned value is in host byte orderβthat is, the native byte order for integers on the host machine.)
If no timer expirations have occurred at the time of the read(2), then the call either blocks until the next timer expiration, or fails with the error EAGAIN if the file descriptor has been made nonblocking (via the use of the fcntl(2) F_SETFL operation to set the O_NONBLOCK flag).
A read(2) fails with the error EINVAL if the size of the supplied buffer is less than 8 bytes.
If the associated clock is either CLOCK_REALTIME or CLOCK_REALTIME_ALARM, the timer is absolute (TFD_TIMER_ABSTIME), and the flag TFD_TIMER_CANCEL_ON_SET was specified when calling timerfd_settime(), then read(2) fails with the error ECANCELED if the real-time clock undergoes a discontinuous change. (This allows the reading application to discover such discontinuous changes to the clock.)
If the associated clock is either CLOCK_REALTIME or CLOCK_REALTIME_ALARM, the timer is absolute (TFD_TIMER_ABSTIME), and the flag TFD_TIMER_CANCEL_ON_SET was not specified when calling timerfd_settime(), then a discontinuous negative change to the clock (e.g., clock_settime(2)) may cause read(2) to unblock, but return a value of 0 (i.e., no bytes read), if the clock change occurs after the time expired, but before the read(2) on the file descriptor.
poll(2)
select(2)
(and similar)
The file descriptor is readable (the select(2) readfds argument; the poll(2) POLLIN flag) if one or more timer expirations have occurred.
The file descriptor also supports the other file-descriptor multiplexing APIs: pselect(2), ppoll(2), and epoll(7).
ioctl(2)
The following timerfd-specific command is supported:
TFD_IOC_SET_TICKS (since Linux 3.17)
Adjust the number of timer expirations that have occurred. The argument is a pointer to a nonzero 8-byte integer (uint64_t*) containing the new number of expirations. Once the number is set, any waiter on the timer is woken up. The only purpose of this command is to restore the expirations for the purpose of checkpoint/restore. This operation is available only if the kernel was configured with the CONFIG_CHECKPOINT_RESTORE option.
close(2)
When the file descriptor is no longer required it should be closed. When all file descriptors associated with the same timer object have been closed, the timer is disarmed and its resources are freed by the kernel.
fork(2) semantics
After a fork(2), the child inherits a copy of the file descriptor created by timerfd_create(). The file descriptor refers to the same underlying timer object as the corresponding file descriptor in the parent, and read(2)s in the child will return information about expirations of the timer.
execve(2) semantics
A file descriptor created by timerfd_create() is preserved across execve(2), and continues to generate timer expirations if the timer was armed.
RETURN VALUE
On success, timerfd_create() returns a new file descriptor. On error, -1 is returned and errno is set to indicate the error.
timerfd_settime() and timerfd_gettime() return 0 on success; on error they return -1, and set errno to indicate the error.
ERRORS
timerfd_create() can fail with the following errors:
EINVAL
The clockid is not valid.
EINVAL
flags is invalid; or, in Linux 2.6.26 or earlier, flags is nonzero.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENODEV
Could not mount (internal) anonymous inode device.
ENOMEM
There was insufficient kernel memory to create the timer.
EPERM
clockid was CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM but the caller did not have the CAP_WAKE_ALARM capability.
timerfd_settime() and timerfd_gettime() can fail with the following errors:
EBADF
fd is not a valid file descriptor.
EFAULT
new_value, old_value, or curr_value is not a valid pointer.
EINVAL
fd is not a valid timerfd file descriptor.
timerfd_settime() can also fail with the following errors:
ECANCELED
See NOTES.
EINVAL
new_value is not properly initialized (one of the tv_nsec falls outside the range zero to 999,999,999).
EINVAL
flags is invalid.
STANDARDS
Linux.
HISTORY
Linux 2.6.25, glibc 2.8.
NOTES
Suppose the following scenario for CLOCK_REALTIME or CLOCK_REALTIME_ALARM timer that was created with timerfd_create():
The timer has been started (timerfd_settime()) with the TFD_TIMER_ABSTIME and TFD_TIMER_CANCEL_ON_SET flags;
A discontinuous change (e.g., settimeofday(2)) is subsequently made to the CLOCK_REALTIME clock; and
the caller once more calls timerfd_settime() to rearm the timer (without first doing a read(2) on the file descriptor).
In this case the following occurs:
The timerfd_settime() returns -1 with errno set to ECANCELED. (This enables the caller to know that the previous timer was affected by a discontinuous change to the clock.)
The timer is successfully rearmed with the settings provided in the second timerfd_settime() call. (This was probably an implementation accident, but won’t be fixed now, in case there are applications that depend on this behaviour.)
BUGS
Currently, timerfd_create() supports fewer types of clock IDs than timer_create(2).
EXAMPLES
The following program creates a timer and then monitors its progress. The program accepts up to three command-line arguments. The first argument specifies the number of seconds for the initial expiration of the timer. The second argument specifies the interval for the timer, in seconds. The third argument specifies the number of times the program should allow the timer to expire before terminating. The second and third command-line arguments are optional.
The following shell session demonstrates the use of the program:
$ a.out 3 1 100
0.000: timer started
3.000: read: 1; total=1
4.000: read: 1; total=2
^Z # type control-Z to suspend the program
[1]+ Stopped ./timerfd3_demo 3 1 100
$ fg # Resume execution after a few seconds
a.out 3 1 100
9.660: read: 5; total=7
10.000: read: 1; total=8
11.000: read: 1; total=9
^C # type control-C to suspend the program
Program source
#include <err.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/timerfd.h>
#include <sys/types.h>
#include <time.h>
#include <unistd.h>
static void
print_elapsed_time(void)
{
int secs, nsecs;
static int first_call = 1;
struct timespec curr;
static struct timespec start;
if (first_call) {
first_call = 0;
if (clock_gettime(CLOCK_MONOTONIC, &start) == -1)
err(EXIT_FAILURE, "clock_gettime");
}
if (clock_gettime(CLOCK_MONOTONIC, &curr) == -1)
err(EXIT_FAILURE, "clock_gettime");
secs = curr.tv_sec - start.tv_sec;
nsecs = curr.tv_nsec - start.tv_nsec;
if (nsecs < 0) {
secs--;
nsecs += 1000000000;
}
printf("%d.%03d: ", secs, (nsecs + 500000) / 1000000);
}
int
main(int argc, char *argv[])
{
int fd;
ssize_t s;
uint64_t exp, tot_exp, max_exp;
struct timespec now;
struct itimerspec new_value;
if (argc != 2 && argc != 4) {
fprintf(stderr, "%s init-secs [interval-secs max-exp]
“, argv[0]); exit(EXIT_FAILURE); } if (clock_gettime(CLOCK_REALTIME, &now) == -1) err(EXIT_FAILURE, “clock_gettime”); /* Create a CLOCK_REALTIME absolute timer with initial expiration and interval as specified in command line. */ new_value.it_value.tv_sec = now.tv_sec + atoi(argv[1]); new_value.it_value.tv_nsec = now.tv_nsec; if (argc == 2) { new_value.it_interval.tv_sec = 0; max_exp = 1; } else { new_value.it_interval.tv_sec = atoi(argv[2]); max_exp = atoi(argv[3]); } new_value.it_interval.tv_nsec = 0; fd = timerfd_create(CLOCK_REALTIME, 0); if (fd == -1) err(EXIT_FAILURE, “timerfd_create”); if (timerfd_settime(fd, TFD_TIMER_ABSTIME, &new_value, NULL) == -1) err(EXIT_FAILURE, “timerfd_settime”); print_elapsed_time(); printf(“timer started “); for (tot_exp = 0; tot_exp < max_exp;) { s = read(fd, &exp, sizeof(uint64_t)); if (s != sizeof(uint64_t)) err(EXIT_FAILURE, “read”); tot_exp += exp; print_elapsed_time(); printf(“read: %” PRIu64 “; total=%” PRIu64 " “, exp, tot_exp); } exit(EXIT_SUCCESS); }
SEE ALSO
eventfd(2), poll(2), read(2), select(2), setitimer(2), signalfd(2), timer_create(2), timer_gettime(2), timer_settime(2), timespec(3), epoll(7), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
476 - Linux cli command getcwd
NAME π₯οΈ getcwd π₯οΈ
get current working directory
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
char *getcwd(char buf[.size], size_t size);
char *get_current_dir_name(void);
[[deprecated]] char *getwd(char buf[PATH_MAX]);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
get_current_dir_name():
_GNU_SOURCE
getwd():
Since glibc 2.12:
(_XOPEN_SOURCE >= 500) && ! (_POSIX_C_SOURCE >= 200809L)
|| /* glibc >= 2.19: */ _DEFAULT_SOURCE
|| /* glibc <= 2.19: */ _BSD_SOURCE
Before glibc 2.12:
_BSD_SOURCE || _XOPEN_SOURCE >= 500
DESCRIPTION
These functions return a null-terminated string containing an absolute pathname that is the current working directory of the calling process. The pathname is returned as the function result and via the argument buf, if present.
The getcwd() function copies an absolute pathname of the current working directory to the array pointed to by buf, which is of length size.
If the length of the absolute pathname of the current working directory, including the terminating null byte, exceeds size bytes, NULL is returned, and errno is set to ERANGE; an application should check for this error, and allocate a larger buffer if necessary.
As an extension to the POSIX.1-2001 standard, glibc’s getcwd() allocates the buffer dynamically using malloc(3) if buf is NULL. In this case, the allocated buffer has the length size unless size is zero, when buf is allocated as big as necessary. The caller should free(3) the returned buffer.
get_current_dir_name() will malloc(3) an array big enough to hold the absolute pathname of the current working directory. If the environment variable PWD is set, and its value is correct, then that value will be returned. The caller should free(3) the returned buffer.
getwd() does not malloc(3) any memory. The buf argument should be a pointer to an array at least PATH_MAX bytes long. If the length of the absolute pathname of the current working directory, including the terminating null byte, exceeds PATH_MAX bytes, NULL is returned, and errno is set to ENAMETOOLONG. (Note that on some systems, PATH_MAX may not be a compile-time constant; furthermore, its value may depend on the filesystem, see pathconf(3).) For portability and security reasons, use of getwd() is deprecated.
RETURN VALUE
On success, these functions return a pointer to a string containing the pathname of the current working directory. In the case of getcwd() and getwd() this is the same value as buf.
On failure, these functions return NULL, and errno is set to indicate the error. The contents of the array pointed to by buf are undefined on error.
ERRORS
EACCES
Permission to read or search a component of the filename was denied.
EFAULT
buf points to a bad address.
EINVAL
The size argument is zero and buf is not a null pointer.
EINVAL
getwd(): buf is NULL.
ENAMETOOLONG
getwd(): The size of the null-terminated absolute pathname string exceeds PATH_MAX bytes.
ENOENT
The current working directory has been unlinked.
ENOMEM
Out of memory.
ERANGE
The size argument is less than the length of the absolute pathname of the working directory, including the terminating null byte. You need to allocate a bigger array and try again.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
getcwd(), getwd() | Thread safety | MT-Safe |
get_current_dir_name() | Thread safety | MT-Safe env |
VERSIONS
POSIX.1-2001 leaves the behavior of getcwd() unspecified if buf is NULL.
POSIX.1-2001 does not define any errors for getwd().
VERSIONS
C library/kernel differences
On Linux, the kernel provides a getcwd() system call, which the functions described in this page will use if possible. The system call takes the same arguments as the library function of the same name, but is limited to returning at most PATH_MAX bytes. (Before Linux 3.12, the limit on the size of the returned pathname was the system page size. On many architectures, PATH_MAX and the system page size are both 4096 bytes, but a few architectures have a larger page size.) If the length of the pathname of the current working directory exceeds this limit, then the system call fails with the error ENAMETOOLONG. In this case, the library functions fall back to a (slower) alternative implementation that returns the full pathname.
Following a change in Linux 2.6.36, the pathname returned by the getcwd() system call will be prefixed with the string “(unreachable)” if the current directory is not below the root directory of the current process (e.g., because the process set a new filesystem root using chroot(2) without changing its current directory into the new root). Such behavior can also be caused by an unprivileged user by changing the current directory into another mount namespace. When dealing with pathname from untrusted sources, callers of the functions described in this page should consider checking whether the returned pathname starts with ‘/’ or ‘(’ to avoid misinterpreting an unreachable path as a relative pathname.
STANDARDS
getcwd()
POSIX.1-2008.
get_current_dir_name()
GNU.
getwd()
None.
HISTORY
getcwd()
POSIX.1-2001.
getwd()
POSIX.1-2001, but marked LEGACY. Removed in POSIX.1-2008. Use getcwd() instead.
Under Linux, these functions make use of the getcwd() system call (available since Linux 2.1.92). On older systems they would query /proc/self/cwd. If both system call and proc filesystem are missing, a generic implementation is called. Only in that case can these calls fail under Linux with EACCES.
NOTES
These functions are often used to save the location of the current working directory for the purpose of returning to it later. Opening the current directory (".") and calling fchdir(2) to return is usually a faster and more reliable alternative when sufficiently many file descriptors are available, especially on platforms other than Linux.
BUGS
Since the Linux 2.6.36 change that added “(unreachable)” in the circumstances described above, the glibc implementation of getcwd() has failed to conform to POSIX and returned a relative pathname when the API contract requires an absolute pathname. With glibc 2.27 onwards this is corrected; calling getcwd() from such a pathname will now result in failure with ENOENT.
SEE ALSO
pwd(1), chdir(2), fchdir(2), open(2), unlink(2), free(3), malloc(3)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
477 - Linux cli command inotify_init
NAME π₯οΈ inotify_init π₯οΈ
initialize an inotify instance
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/inotify.h>
int inotify_init(void);
int inotify_init1(int flags);
DESCRIPTION
For an overview of the inotify API, see inotify(7).
inotify_init() initializes a new inotify instance and returns a file descriptor associated with a new inotify event queue.
If flags is 0, then inotify_init1() is the same as inotify_init(). The following values can be bitwise ORed in flags to obtain different behavior:
IN_NONBLOCK
Set the O_NONBLOCK file status flag on the open file description (see open(2)) referred to by the new file descriptor. Using this flag saves extra calls to fcntl(2) to achieve the same result.
IN_CLOEXEC
Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.
RETURN VALUE
On success, these system calls return a new file descriptor. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EINVAL
(inotify_init1()) An invalid value was specified in flags.
EMFILE
The user limit on the total number of inotify instances has been reached.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENOMEM
Insufficient kernel memory is available.
STANDARDS
Linux.
HISTORY
inotify_init()
Linux 2.6.13, glibc 2.4.
inotify_init1()
Linux 2.6.27, glibc 2.9.
SEE ALSO
inotify_add_watch(2), inotify_rm_watch(2), inotify(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
478 - Linux cli command readlinkat
NAME π₯οΈ readlinkat π₯οΈ
read value of a symbolic link
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
ssize_t readlink(const char *restrict pathname, char *restrict buf,
size_t bufsiz);
#include <fcntl.h> /* Definition of AT_* constants */
#include <unistd.h>
ssize_t readlinkat(int dirfd, const char *restrict pathname,
char *restrict buf, size_t bufsiz);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
readlink():
_XOPEN_SOURCE >= 500 || _POSIX_C_SOURCE >= 200112L
|| /* glibc <= 2.19: */ _BSD_SOURCE
readlinkat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
readlink() places the contents of the symbolic link pathname in the buffer buf, which has size bufsiz. readlink() does not append a terminating null byte to buf. It will (silently) truncate the contents (to a length of bufsiz characters), in case the buffer is too small to hold all of the contents.
readlinkat()
The readlinkat() system call operates in exactly the same way as readlink(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by readlink() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like readlink()).
If pathname is absolute, then dirfd is ignored.
Since Linux 2.6.39, pathname can be an empty string, in which case the call operates on the symbolic link referred to by dirfd (which should have been obtained using open(2) with the O_PATH and O_NOFOLLOW flags).
See openat(2) for an explanation of the need for readlinkat().
RETURN VALUE
On success, these calls return the number of bytes placed in buf. (If the returned value equals bufsiz, then truncation may have occurred.) On error, -1 is returned and errno is set to indicate the error.
ERRORS
EACCES
Search permission is denied for a component of the path prefix. (See also path_resolution(7).)
EBADF
(readlinkat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
buf extends outside the process’s allocated address space.
EINVAL
bufsiz is not positive.
EINVAL
The named file (i.e., the final filename component of pathname) is not a symbolic link.
EIO
An I/O error occurred while reading from the filesystem.
ELOOP
Too many symbolic links were encountered in translating the pathname.
ENAMETOOLONG
A pathname, or a component of a pathname, was too long.
ENOENT
The named file does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component of the path prefix is not a directory.
ENOTDIR
(readlinkat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
STANDARDS
POSIX.1-2008.
HISTORY
readlink()
4.4BSD (first appeared in 4.2BSD), POSIX.1-2001, POSIX.1-2008.
readlinkat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
Up to and including glibc 2.4, the return type of readlink() was declared as int. Nowadays, the return type is declared as ssize_t, as (newly) required in POSIX.1-2001.
glibc
On older kernels where readlinkat() is unavailable, the glibc wrapper function falls back to the use of readlink(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
NOTES
Using a statically sized buffer might not provide enough room for the symbolic link contents. The required size for the buffer can be obtained from the stat.st_size value returned by a call to lstat(2) on the link. However, the number of bytes written by readlink() and readlinkat() should be checked to make sure that the size of the symbolic link did not increase between the calls. Dynamically allocating the buffer for readlink() and readlinkat() also addresses a common portability problem when using PATH_MAX for the buffer size, as this constant is not guaranteed to be defined per POSIX if the system does not have such limit.
EXAMPLES
The following program allocates the buffer needed by readlink() dynamically from the information provided by lstat(2), falling back to a buffer of size PATH_MAX in cases where lstat(2) reports a size of zero.
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
char *buf;
ssize_t nbytes, bufsiz;
struct stat sb;
if (argc != 2) {
fprintf(stderr, "Usage: %s <pathname>
“, argv[0]); exit(EXIT_FAILURE); } if (lstat(argv[1], &sb) == -1) { perror(“lstat”); exit(EXIT_FAILURE); } /* Add one to the link size, so that we can determine whether the buffer returned by readlink() was truncated. / bufsiz = sb.st_size + 1; / Some magic symlinks under (for example) /proc and /sys report ‘st_size’ as zero. In that case, take PATH_MAX as a “good enough” estimate. / if (sb.st_size == 0) bufsiz = PATH_MAX; buf = malloc(bufsiz); if (buf == NULL) { perror(“malloc”); exit(EXIT_FAILURE); } nbytes = readlink(argv[1], buf, bufsiz); if (nbytes == -1) { perror(“readlink”); exit(EXIT_FAILURE); } / Print only ’nbytes’ of ‘buf’, as it doesn’t contain a terminating null byte (‘οΏ½’). */ printf(”’%s’ points to ‘%.s’ “, argv[1], (int) nbytes, buf); / If the return value was equal to the buffer size, then the link target was larger than expected (perhaps because the target was changed between the call to lstat() and the call to readlink()). Warn the user that the returned target may have been truncated. */ if (nbytes == bufsiz) printf("(Returned buffer may have been truncated) “); free(buf); exit(EXIT_SUCCESS); }
SEE ALSO
readlink(1), lstat(2), stat(2), symlink(2), realpath(3), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
479 - Linux cli command getdomainname
NAME π₯οΈ getdomainname π₯οΈ
get/set NIS domain name
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int getdomainname(char *name, size_t len);
int setdomainname(const char *name, size_t len);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
getdomainname(), setdomainname():
Since glibc 2.21:
_DEFAULT_SOURCE
In glibc 2.19 and 2.20:
_DEFAULT_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
Up to and including glibc 2.19:
_BSD_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
DESCRIPTION
These functions are used to access or to change the NIS domain name of the host system. More precisely, they operate on the NIS domain name associated with the calling process’s UTS namespace.
setdomainname() sets the domain name to the value given in the character array name. The len argument specifies the number of bytes in name. (Thus, name does not require a terminating null byte.)
getdomainname() returns the null-terminated domain name in the character array name, which has a length of len bytes. If the null-terminated domain name requires more than len bytes, getdomainname() returns the first len bytes (glibc) or gives an error (libc).
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
setdomainname() can fail with the following errors:
EFAULT
name pointed outside of user address space.
EINVAL
len was negative or too large.
EPERM
The caller did not have the CAP_SYS_ADMIN capability in the user namespace associated with its UTS namespace (see namespaces(7)).
getdomainname() can fail with the following errors:
EINVAL
For getdomainname() under libc: name is NULL or name is longer than len bytes.
VERSIONS
On most Linux architectures (including x86), there is no getdomainname() system call; instead, glibc implements getdomainname() as a library function that returns a copy of the domainname field returned from a call to uname(2).
STANDARDS
None.
HISTORY
Since Linux 1.0, the limit on the length of a domain name, including the terminating null byte, is 64 bytes. In older kernels, it was 8 bytes.
SEE ALSO
gethostname(2), sethostname(2), uname(2), uts_namespaces(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
480 - Linux cli command gtty
NAME π₯οΈ gtty π₯οΈ
unimplemented system calls
SYNOPSIS
Unimplemented system calls.
DESCRIPTION
These system calls are not implemented in the Linux kernel.
RETURN VALUE
These system calls always return -1 and set errno to ENOSYS.
NOTES
Note that ftime(3), profil(3), and ulimit(3) are implemented as library functions.
Some system calls, like alloc_hugepages(2), free_hugepages(2), ioperm(2), iopl(2), and vm86(2) exist only on certain architectures.
Some system calls, like ipc(2), create_module(2), init_module(2), and delete_module(2) exist only when the Linux kernel was built with support for them.
SEE ALSO
syscalls(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
481 - Linux cli command ipc
NAME π₯οΈ ipc π₯οΈ
System V IPC system calls
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/ipc.h> /* Definition of needed constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
int syscall(SYS_ipc, unsigned int call, int first,
unsigned long second, unsigned long third",void*"ptr,
long fifth);
Note: glibc provides no wrapper for ipc(), necessitating the use of syscall(2).
DESCRIPTION
ipc() is a common kernel entry point for the System V IPC calls for messages, semaphores, and shared memory. call determines which IPC function to invoke; the other arguments are passed through to the appropriate call.
User-space programs should call the appropriate functions by their usual names. Only standard library implementors and kernel hackers need to know about ipc().
VERSIONS
On some architecturesβfor example x86-64 and ARMβthere is no ipc() system call; instead, msgctl(2), semctl(2), shmctl(2), and so on really are implemented as separate system calls.
STANDARDS
Linux.
SEE ALSO
msgctl(2), msgget(2), msgrcv(2), msgsnd(2), semctl(2), semget(2), semop(2), semtimedop(2), shmat(2), shmctl(2), shmdt(2), shmget(2), sysvipc(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
482 - Linux cli command outb_p
NAME π₯οΈ outb_p π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
483 - Linux cli command msgop
NAME π₯οΈ msgop π₯οΈ
System V message queue operations
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/msg.h>
int msgsnd(int msqid, const void msgp[.msgsz], size_t msgsz,
int msgflg);
ssize_t msgrcv(int msqid, void msgp[.msgsz], size_t msgsz",long"msgtyp,
int msgflg);
DESCRIPTION
The msgsnd() and msgrcv() system calls are used to send messages to, and receive messages from, a System V message queue. The calling process must have write permission on the message queue in order to send a message, and read permission to receive a message.
The msgp argument is a pointer to a caller-defined structure of the following general form:
struct msgbuf {
long mtype; /* message type, must be > 0 */
char mtext[1]; /* message data */
};
The mtext field is an array (or other structure) whose size is specified by msgsz, a nonnegative integer value. Messages of zero length (i.e., no mtext field) are permitted. The mtype field must have a strictly positive integer value. This value can be used by the receiving process for message selection (see the description of msgrcv() below).
msgsnd()
The msgsnd() system call appends a copy of the message pointed to by msgp to the message queue whose identifier is specified by msqid.
If sufficient space is available in the queue, msgsnd() succeeds immediately. The queue capacity is governed by the msg_qbytes field in the associated data structure for the message queue. During queue creation this field is initialized to MSGMNB bytes, but this limit can be modified using msgctl(2). A message queue is considered to be full if either of the following conditions is true:
Adding a new message to the queue would cause the total number of bytes in the queue to exceed the queue’s maximum size (the msg_qbytes field).
Adding another message to the queue would cause the total number of messages in the queue to exceed the queue’s maximum size (the msg_qbytes field). This check is necessary to prevent an unlimited number of zero-length messages being placed on the queue. Although such messages contain no data, they nevertheless consume (locked) kernel memory.
If insufficient space is available in the queue, then the default behavior of msgsnd() is to block until space becomes available. If IPC_NOWAIT is specified in msgflg, then the call instead fails with the error EAGAIN.
A blocked msgsnd() call may also fail if:
the queue is removed, in which case the system call fails with errno set to EIDRM; or
a signal is caught, in which case the system call fails with errno set to EINTR;see signal(7). (msgsnd() is never automatically restarted after being interrupted by a signal handler, regardless of the setting of the SA_RESTART flag when establishing a signal handler.)
Upon successful completion the message queue data structure is updated as follows:
msg_lspid is set to the process ID of the calling process.
msg_qnum is incremented by 1.
msg_stime is set to the current time.
msgrcv()
The msgrcv() system call removes a message from the queue specified by msqid and places it in the buffer pointed to by msgp.
The argument msgsz specifies the maximum size in bytes for the member mtext of the structure pointed to by the msgp argument. If the message text has length greater than msgsz, then the behavior depends on whether MSG_NOERROR is specified in msgflg. If MSG_NOERROR is specified, then the message text will be truncated (and the truncated part will be lost); if MSG_NOERROR is not specified, then the message isn’t removed from the queue and the system call fails returning -1 with errno set to E2BIG.
Unless MSG_COPY is specified in msgflg (see below), the msgtyp argument specifies the type of message requested, as follows:
If msgtyp is 0, then the first message in the queue is read.
If msgtyp is greater than 0, then the first message in the queue of type msgtyp is read, unless MSG_EXCEPT was specified in msgflg, in which case the first message in the queue of type not equal to msgtyp will be read.
If msgtyp is less than 0, then the first message in the queue with the lowest type less than or equal to the absolute value of msgtyp will be read.
The msgflg argument is a bit mask constructed by ORing together zero or more of the following flags:
IPC_NOWAIT
Return immediately if no message of the requested type is in the queue. The system call fails with errno set to ENOMSG.
MSG_COPY (since Linux 3.8)
Nondestructively fetch a copy of the message at the ordinal position in the queue specified by msgtyp (messages are considered to be numbered starting at 0).
This flag must be specified in conjunction with IPC_NOWAIT, with the result that, if there is no message available at the given position, the call fails immediately with the error ENOMSG. Because they alter the meaning of msgtyp in orthogonal ways, MSG_COPY and MSG_EXCEPT may not both be specified in msgflg.
The MSG_COPY flag was added for the implementation of the kernel checkpoint-restore facility and is available only if the kernel was built with the CONFIG_CHECKPOINT_RESTORE option.
MSG_EXCEPT
Used with msgtyp greater than 0 to read the first message in the queue with message type that differs from msgtyp.
MSG_NOERROR
To truncate the message text if longer than msgsz bytes.
If no message of the requested type is available and IPC_NOWAIT isn’t specified in msgflg, the calling process is blocked until one of the following conditions occurs:
A message of the desired type is placed in the queue.
The message queue is removed from the system. In this case, the system call fails with errno set to EIDRM.
The calling process catches a signal. In this case, the system call fails with errno set to EINTR. (msgrcv() is never automatically restarted after being interrupted by a signal handler, regardless of the setting of the SA_RESTART flag when establishing a signal handler.)
Upon successful completion the message queue data structure is updated as follows:
msg_lrpid is set to the process ID of the calling process.
msg_qnum is decremented by 1.
msg_rtime is set to the current time.
RETURN VALUE
On success, msgsnd() returns 0 and msgrcv() returns the number of bytes actually copied into the mtext array. On failure, both functions return -1, and set errno to indicate the error.
ERRORS
msgsnd() can fail with the following errors:
EACCES
The calling process does not have write permission on the message queue, and does not have the CAP_IPC_OWNER capability in the user namespace that governs its IPC namespace.
EAGAIN
The message can’t be sent due to the msg_qbytes limit for the queue and IPC_NOWAIT was specified in msgflg.
EFAULT
The address pointed to by msgp isn’t accessible.
EIDRM
The message queue was removed.
EINTR
Sleeping on a full message queue condition, the process caught a signal.
EINVAL
Invalid msqid value, or nonpositive mtype value, or invalid msgsz value (less than 0 or greater than the system value MSGMAX).
ENOMEM
The system does not have enough memory to make a copy of the message pointed to by msgp.
msgrcv() can fail with the following errors:
E2BIG
The message text length is greater than msgsz and MSG_NOERROR isn’t specified in msgflg.
EACCES
The calling process does not have read permission on the message queue, and does not have the CAP_IPC_OWNER capability in the user namespace that governs its IPC namespace.
EFAULT
The address pointed to by msgp isn’t accessible.
EIDRM
While the process was sleeping to receive a message, the message queue was removed.
EINTR
While the process was sleeping to receive a message, the process caught a signal; see signal(7).
EINVAL
msqid was invalid, or msgsz was less than 0.
EINVAL (since Linux 3.14)
msgflg specified MSG_COPY, but not IPC_NOWAIT.
EINVAL (since Linux 3.14)
msgflg specified both MSG_COPY and MSG_EXCEPT.
ENOMSG
IPC_NOWAIT was specified in msgflg and no message of the requested type existed on the message queue.
ENOMSG
IPC_NOWAIT and MSG_COPY were specified in msgflg and the queue contains less than msgtyp messages.
ENOSYS (since Linux 3.8)
Both MSG_COPY and IPC_NOWAIT were specified in msgflg, and this kernel was configured without CONFIG_CHECKPOINT_RESTORE.
STANDARDS
POSIX.1-2008.
The MSG_EXCEPT and MSG_COPY flags are Linux-specific; their definitions can be obtained by defining the _GNU_SOURCE feature test macro.
HISTORY
POSIX.1-2001, SVr4.
The msgp argument is declared as struct msgbuf * in glibc 2.0 and 2.1. It is declared as void * in glibc 2.2 and later, as required by SUSv2 and SUSv3.
NOTES
The following limits on message queue resources affect the msgsnd() call:
MSGMAX
Maximum size of a message text, in bytes (default value: 8192 bytes). On Linux, this limit can be read and modified via /proc/sys/kernel/msgmax.
MSGMNB
Maximum number of bytes that can be held in a message queue (default value: 16384 bytes). On Linux, this limit can be read and modified via /proc/sys/kernel/msgmnb. A privileged process (Linux: a process with the CAP_SYS_RESOURCE capability) can increase the size of a message queue beyond MSGMNB using the msgctl(2) IPC_SET operation.
The implementation has no intrinsic system-wide limits on the number of message headers (MSGTQL) and the number of bytes in the message pool (MSGPOOL).
BUGS
In Linux 3.13 and earlier, if msgrcv() was called with the MSG_COPY flag, but without IPC_NOWAIT, and the message queue contained less than msgtyp messages, then the call would block until the next message is written to the queue. At that point, the call would return a copy of the message, regardless of whether that message was at the ordinal position msgtyp. This bug is fixed in Linux 3.14.
Specifying both MSG_COPY and MSC_EXCEPT in msgflg is a logical error (since these flags impose different interpretations on msgtyp). In Linux 3.13 and earlier, this error was not diagnosed by msgrcv(). This bug is fixed in Linux 3.14.
EXAMPLES
The program below demonstrates the use of msgsnd() and msgrcv().
The example program is first run with the -s option to send a message and then run again with the -r option to receive a message.
The following shell session shows a sample run of the program:
$ ./a.out -s
sent: a message at Wed Mar 4 16:25:45 2015
$ ./a.out -r
message received: a message at Wed Mar 4 16:25:45 2015
Program source
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ipc.h>
#include <sys/msg.h>
#include <time.h>
#include <unistd.h>
struct msgbuf {
long mtype;
char mtext[80];
};
static void
usage(char *prog_name, char *msg)
{
if (msg != NULL)
fputs(msg, stderr);
fprintf(stderr, "Usage: %s [options]
“, prog_name); fprintf(stderr, “Options are: “); fprintf(stderr, “-s send message using msgsnd() “); fprintf(stderr, “-r read message using msgrcv() “); fprintf(stderr, “-t message type (default is 1) “); fprintf(stderr, “-k message queue key (default is 1234) “); exit(EXIT_FAILURE); } static void send_msg(int qid, int msgtype) { time_t t; struct msgbuf msg; msg.mtype = msgtype; time(&t); snprintf(msg.mtext, sizeof(msg.mtext), “a message at %s”, ctime(&t)); if (msgsnd(qid, &msg, sizeof(msg.mtext), IPC_NOWAIT) == -1) { perror(“msgsnd error”); exit(EXIT_FAILURE); } printf(“sent: %s “, msg.mtext); } static void get_msg(int qid, int msgtype) { struct msgbuf msg; if (msgrcv(qid, &msg, sizeof(msg.mtext), msgtype, MSG_NOERROR | IPC_NOWAIT) == -1) { if (errno != ENOMSG) { perror(“msgrcv”); exit(EXIT_FAILURE); } printf(“No message available for msgrcv() “); } else { printf(“message received: %s “, msg.mtext); } } int main(int argc, char argv[]) { int qid, opt; int mode = 0; / 1 = send, 2 = receive */ int msgtype = 1; int msgkey = 1234; while ((opt = getopt(argc, argv, “srt:k:”)) != -1) { switch (opt) { case ’s’: mode = 1; break; case ‘r’: mode = 2; break; case ’t’: msgtype = atoi(optarg); if (msgtype <= 0) usage(argv[0], “-t option must be greater than 0 “); break; case ‘k’: msgkey = atoi(optarg); break; default: usage(argv[0], “Unrecognized option “); } } if (mode == 0) usage(argv[0], “must use either -s or -r option “); qid = msgget(msgkey, IPC_CREAT | 0666); if (qid == -1) { perror(“msgget”); exit(EXIT_FAILURE); } if (mode == 2) get_msg(qid, msgtype); else send_msg(qid, msgtype); exit(EXIT_SUCCESS); }
SEE ALSO
msgctl(2), msgget(2), capabilities(7), mq_overview(7), sysvipc(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
484 - Linux cli command ioctl_fideduperange
NAME π₯οΈ ioctl_fideduperange π₯οΈ
share some the data of one file with another file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <linux/fs.h> /* Definition of FIDEDUPERANGE and
FILE_DEDUPE_* constants*/
#include <sys/ioctl.h>
int ioctl(int src_fd, FIDEDUPERANGE, struct file_dedupe_range *arg);
DESCRIPTION
If a filesystem supports files sharing physical storage between multiple files, this ioctl(2) operation can be used to make some of the data in the src_fd file appear in the dest_fd file by sharing the underlying storage if the file data is identical (“deduplication”). Both files must reside within the same filesystem. This reduces storage consumption by allowing the filesystem to store one shared copy of the data. If a file write should occur to a shared region, the filesystem must ensure that the changes remain private to the file being written. This behavior is commonly referred to as “copy on write”.
This ioctl performs the “compare and share if identical” operation on up to src_length bytes from file descriptor src_fd at offset src_offset. This information is conveyed in a structure of the following form:
struct file_dedupe_range {
__u64 src_offset;
__u64 src_length;
__u16 dest_count;
__u16 reserved1;
__u32 reserved2;
struct file_dedupe_range_info info[0];
};
Deduplication is atomic with regards to concurrent writes, so no locks need to be taken to obtain a consistent deduplicated copy.
The fields reserved1 and reserved2 must be zero.
Destinations for the deduplication operation are conveyed in the array at the end of the structure. The number of destinations is given in dest_count, and the destination information is conveyed in the following form:
struct file_dedupe_range_info {
__s64 dest_fd;
__u64 dest_offset;
__u64 bytes_deduped;
__s32 status;
__u32 reserved;
};
Each deduplication operation targets src_length bytes in file descriptor dest_fd at offset dest_offset. The field reserved must be zero. During the call, src_fd must be open for reading and dest_fd must be open for writing. The combined size of the struct file_dedupe_range and the struct file_dedupe_range_info array must not exceed the system page size. The maximum size of src_length is filesystem dependent and is typically 16Β MiB. This limit will be enforced silently by the filesystem. By convention, the storage used by src_fd is mapped into dest_fd and the previous contents in dest_fd are freed.
Upon successful completion of this ioctl, the number of bytes successfully deduplicated is returned in bytes_deduped and a status code for the deduplication operation is returned in status. If even a single byte in the range does not match, the deduplication operation request will be ignored and status set to FILE_DEDUPE_RANGE_DIFFERS. The status code is set to FILE_DEDUPE_RANGE_SAME for success, a negative error code in case of error, or FILE_DEDUPE_RANGE_DIFFERS if the data did not match.
RETURN VALUE
On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Possible errors include (but are not limited to) the following:
EBADF
src_fd is not open for reading; dest_fd is not open for writing or is open for append-only writes; or the filesystem which src_fd resides on does not support deduplication.
EINVAL
The filesystem does not support deduplicating the ranges of the given files. This error can also appear if either file descriptor represents a device, FIFO, or socket. Disk filesystems generally require the offset and length arguments to be aligned to the fundamental block size. Neither Btrfs nor XFS support overlapping deduplication ranges in the same file.
EISDIR
One of the files is a directory and the filesystem does not support shared regions in directories.
ENOMEM
The kernel was unable to allocate sufficient memory to perform the operation or dest_count is so large that the input argument description spans more than a single page of memory.
EOPNOTSUPP
This can appear if the filesystem does not support deduplicating either file descriptor, or if either file descriptor refers to special inodes.
EPERM
dest_fd is immutable.
ETXTBSY
One of the files is a swap file. Swap files cannot share storage.
EXDEV
dest_fd and src_fd are not on the same mounted filesystem.
VERSIONS
Some filesystems may limit the amount of data that can be deduplicated in a single call.
STANDARDS
Linux.
HISTORY
Linux 4.5.
It was previously known as BTRFS_IOC_FILE_EXTENT_SAME and was private to Btrfs.
NOTES
Because a copy-on-write operation requires the allocation of new storage, the fallocate(2) operation may unshare shared blocks to guarantee that subsequent writes will not fail because of lack of disk space.
SEE ALSO
ioctl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
485 - Linux cli command utimensat
NAME π₯οΈ utimensat π₯οΈ
change file timestamps with nanosecond precision
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int utimensat(int dirfd, const char *pathname,
const struct timespec times[_Nullable 2], int flags);
int futimens(int fd, const struct timespec times[_Nullable 2]);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
utimensat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
futimens():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_GNU_SOURCE
DESCRIPTION
utimensat() and futimens() update the timestamps of a file with nanosecond precision. This contrasts with the historical utime(2) and utimes(2), which permit only second and microsecond precision, respectively, when setting file timestamps.
With utimensat() the file is specified via the pathname given in pathname. With futimens() the file whose timestamps are to be updated is specified via an open file descriptor, fd.
For both calls, the new file timestamps are specified in the array times: times[0] specifies the new “last access time” (atime); times[1] specifies the new “last modification time” (mtime). Each of the elements of times specifies a time as the number of seconds and nanoseconds since the Epoch, 1970-01-01 00:00:00 +0000 (UTC). This information is conveyed in a timespec(3) structure.
Updated file timestamps are set to the greatest value supported by the filesystem that is not greater than the specified time.
If the tv_nsec field of one of the timespec structures has the special value UTIME_NOW, then the corresponding file timestamp is set to the current time. If the tv_nsec field of one of the timespec structures has the special value UTIME_OMIT, then the corresponding file timestamp is left unchanged. In both of these cases, the value of the corresponding tv_sec field is ignored.
If times is NULL, then both timestamps are set to the current time.
The status change time (ctime) will be set to the current time, even if the other time stamps don’t actually change.
Permissions requirements
To set both file timestamps to the current time (i.e., times is NULL, or both tv_nsec fields specify UTIME_NOW), either:
the caller must have write access to the file;
the caller’s effective user ID must match the owner of the file; or
the caller must have appropriate privileges.
To make any change other than setting both timestamps to the current time (i.e., times is not NULL, and neither tv_nsec field is UTIME_NOW and neither tv_nsec field is UTIME_OMIT), either condition 2 or 3 above must apply.
If both tv_nsec fields are specified as UTIME_OMIT, then no file ownership or permission checks are performed, and the file timestamps are not modified, but other error conditions may still be detected.
utimensat() specifics
If pathname is relative, then by default it is interpreted relative to the directory referred to by the open file descriptor, dirfd (rather than relative to the current working directory of the calling process, as is done by utimes(2) for a relative pathname). See openat(2) for an explanation of why this can be useful.
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like utimes(2)).
If pathname is absolute, then dirfd is ignored.
The flags argument is a bit mask created by ORing together zero or more of the following values defined in <fcntl.h>:
AT_EMPTY_PATH (since Linux 5.8)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory. If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_SYMLINK_NOFOLLOW
If pathname specifies a symbolic link, then update the timestamps of the link, rather than the file to which it refers.
RETURN VALUE
On success, utimensat() and futimens() return 0. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EACCES
times is NULL, or both tv_nsec values are UTIME_NOW, and the effective user ID of the caller does not match the owner of the file, the caller does not have write access to the file, and the caller is not privileged (Linux: does not have either the CAP_FOWNER or the CAP_DAC_OVERRIDE capability).
EBADF
(futimens()) fd is not a valid file descriptor.
EBADF
(utimensat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
times pointed to an invalid address; or, dirfd was AT_FDCWD, and pathname is NULL or an invalid address.
EINVAL
Invalid value in flags.
EINVAL
Invalid value in one of the tv_nsec fields (value outside range [0, 999,999,999], and not UTIME_NOW or UTIME_OMIT); or an invalid value in one of the tv_sec fields.
EINVAL
pathname is NULL, dirfd is not AT_FDCWD, and flags contains AT_SYMLINK_NOFOLLOW.
ELOOP
(utimensat()) Too many symbolic links were encountered in resolving pathname.
ENAMETOOLONG
(utimensat()) pathname is too long.
ENOENT
(utimensat()) A component of pathname does not refer to an existing directory or file, or pathname is an empty string.
ENOTDIR
(utimensat()) pathname is a relative pathname, but dirfd is neither AT_FDCWD nor a file descriptor referring to a directory; or, one of the prefix components of pathname is not a directory.
EPERM
The caller attempted to change one or both timestamps to a value other than the current time, or to change one of the timestamps to the current time while leaving the other timestamp unchanged, (i.e., times is not NULL, neither tv_nsec field is UTIME_NOW, and neither tv_nsec field is UTIME_OMIT) and either:
the caller’s effective user ID does not match the owner of file, and the caller is not privileged (Linux: does not have the CAP_FOWNER capability); or,
the file is marked append-only or immutable (see chattr(1)).
EROFS
The file is on a read-only filesystem.
ESRCH
(utimensat()) Search permission is denied for one of the prefix components of pathname.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
utimensat(), futimens() | Thread safety | MT-Safe |
VERSIONS
C library/kernel ABI differences
On Linux, futimens() is a library function implemented on top of the utimensat() system call. To support this, the Linux utimensat() system call implements a nonstandard feature: if pathname is NULL, then the call modifies the timestamps of the file referred to by the file descriptor dirfd (which may refer to any type of file). Using this feature, the call futimens(fd, times) is implemented as:
utimensat(fd, NULL, times, 0);
Note, however, that the glibc wrapper for utimensat() disallows passing NULL as the value for pathname: the wrapper function returns the error EINVAL in this case.
STANDARDS
POSIX.1-2008.
VERSIONS
utimensat()
Linux 2.6.22, glibc 2.6. POSIX.1-2008.
futimens()
glibc 2.6. POSIX.1-2008.
NOTES
utimensat() obsoletes futimesat(2).
On Linux, timestamps cannot be changed for a file marked immutable, and the only change permitted for files marked append-only is to set the timestamps to the current time. (This is consistent with the historical behavior of utime(2) and utimes(2) on Linux.)
If both tv_nsec fields are specified as UTIME_OMIT, then the Linux implementation of utimensat() succeeds even if the file referred to by dirfd and pathname does not exist.
BUGS
Several bugs afflict utimensat() and futimens() before Linux 2.6.26. These bugs are either nonconformances with the POSIX.1 draft specification or inconsistencies with historical Linux behavior.
POSIX.1 specifies that if one of the tv_nsec fields has the value UTIME_NOW or UTIME_OMIT, then the value of the corresponding tv_sec field should be ignored. Instead, the value of the tv_sec field is required to be 0 (or the error EINVAL results).
Various bugs mean that for the purposes of permission checking, the case where both tv_nsec fields are set to UTIME_NOW isn’t always treated the same as specifying times as NULL, and the case where one tv_nsec value is UTIME_NOW and the other is UTIME_OMIT isn’t treated the same as specifying times as a pointer to an array of structures containing arbitrary time values. As a result, in some cases: a) file timestamps can be updated by a process that shouldn’t have permission to perform updates; b) file timestamps can’t be updated by a process that should have permission to perform updates; and c) the wrong errno value is returned in case of an error.
POSIX.1 says that a process that has write access to the file can make a call with times as NULL, or with times pointing to an array of structures in which both tv_nsec fields are UTIME_NOW, in order to update both timestamps to the current time. However, futimens() instead checks whether the access mode of the file descriptor allows writing.
SEE ALSO
chattr(1), touch(1), futimesat(2), openat(2), stat(2), utimes(2), futimes(3), timespec(3), inode(7), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
486 - Linux cli command query_module
NAME π₯οΈ query_module π₯οΈ
query the kernel for various bits pertaining to modules
SYNOPSIS
#include <linux/module.h>
[[deprecated]] int query_module(const char *name, int which,
void buf[.bufsize], size_t bufsize,
size_t *ret);
DESCRIPTION
Note: This system call is present only before Linux 2.6.
query_module() requests information from the kernel about loadable modules. The returned information is placed in the buffer pointed to by buf. The caller must specify the size of buf in bufsize. The precise nature and format of the returned information depend on the operation specified by which. Some operations require name to identify a currently loaded module, some allow name to be NULL, indicating the kernel proper.
The following values can be specified for which:
0
Returns success, if the kernel supports query_module(). Used to probe for availability of the system call.
QM_MODULES
Returns the names of all loaded modules. The returned buffer consists of a sequence of null-terminated strings; ret is set to the number of modules.
QM_DEPS
Returns the names of all modules used by the indicated module. The returned buffer consists of a sequence of null-terminated strings; ret is set to the number of modules.
QM_REFS
Returns the names of all modules using the indicated module. This is the inverse of QM_DEPS. The returned buffer consists of a sequence of null-terminated strings; ret is set to the number of modules.
QM_SYMBOLS
Returns the symbols and values exported by the kernel or the indicated module. The returned buffer is an array of structures of the following form
struct module_symbol {
unsigned long value;
unsigned long name;
};
followed by null-terminated strings. The value of name is the character offset of the string relative to the start of buf; ret is set to the number of symbols.
QM_INFO
Returns miscellaneous information about the indicated module. The output buffer format is:
struct module_info {
unsigned long address;
unsigned long size;
unsigned long flags;
};
where address is the kernel address at which the module resides, size is the size of the module in bytes, and flags is a mask of MOD_RUNNING, MOD_AUTOCLEAN, and so on, that indicates the current status of the module (see the Linux kernel source file include/linux/module.h). ret is set to the size of the module_info structure.
RETURN VALUE
On success, zero is returned. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EFAULT
At least one of name, buf, or ret was outside the program’s accessible address space.
EINVAL
Invalid which; or name is NULL (indicating “the kernel”), but this is not permitted with the specified value of which.
ENOENT
No module by that name exists.
ENOSPC
The buffer size provided was too small. ret is set to the minimum size needed.
ENOSYS
query_module() is not supported in this version of the kernel (e.g., Linux 2.6 or later).
STANDARDS
Linux.
VERSIONS
Removed in Linux 2.6.
Some of the information that was formerly available via query_module() can be obtained from /proc/modules, /proc/kallsyms, and the files under the directory /sys/module.
The query_module() system call is not supported by glibc. No declaration is provided in glibc headers, but, through a quirk of history, glibc does export an ABI for this system call. Therefore, in order to employ this system call, it is sufficient to manually declare the interface in your code; alternatively, you can invoke the system call using syscall(2).
SEE ALSO
create_module(2), delete_module(2), get_kernel_syms(2), init_module(2), lsmod(8), modinfo(8)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
487 - Linux cli command ugetrlimit
NAME π₯οΈ ugetrlimit π₯οΈ
get/set resource limits
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/resource.h>
int getrlimit(int resource, struct rlimit *rlim);
int setrlimit(int resource, const struct rlimit *rlim);
int prlimit(pid_t pid, int resource,
const struct rlimit *_Nullable new_limit,
struct rlimit *_Nullable old_limit);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
prlimit():
_GNU_SOURCE
DESCRIPTION
The getrlimit() and setrlimit() system calls get and set resource limits. Each resource has an associated soft and hard limit, as defined by the rlimit structure:
struct rlimit {
rlim_t rlim_cur; /* Soft limit */
rlim_t rlim_max; /* Hard limit (ceiling for rlim_cur) */
};
The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may set only its soft limit to a value in the range from 0 up to the hard limit, and (irreversibly) lower its hard limit. A privileged process (under Linux: one with the CAP_SYS_RESOURCE capability in the initial user namespace) may make arbitrary changes to either limit value.
The value RLIM_INFINITY denotes no limit on a resource (both in the structure returned by getrlimit() and in the structure passed to setrlimit()).
The resource argument must be one of:
RLIMIT_AS
This is the maximum size of the process’s virtual memory (address space). The limit is specified in bytes, and is rounded down to the system page size. This limit affects calls to brk(2), mmap(2), and mremap(2), which fail with the error ENOMEM upon exceeding this limit. In addition, automatic stack expansion fails (and generates a SIGSEGV that kills the process if no alternate stack has been made available via sigaltstack(2)). Since the value is a long, on machines with a 32-bit long either this limit is at most 2 GiB, or this resource is unlimited.
RLIMIT_CORE
This is the maximum size of a core file (see core(5)) in bytes that the process may dump. When 0 no core dump files are created. When nonzero, larger dumps are truncated to this size.
RLIMIT_CPU
This is a limit, in seconds, on the amount of CPU time that the process can consume. When the process reaches the soft limit, it is sent a SIGXCPU signal. The default action for this signal is to terminate the process. However, the signal can be caught, and the handler can return control to the main program. If the process continues to consume CPU time, it will be sent SIGXCPU once per second until the hard limit is reached, at which time it is sent SIGKILL. (This latter point describes Linux behavior. Implementations vary in how they treat processes which continue to consume CPU time after reaching the soft limit. Portable applications that need to catch this signal should perform an orderly termination upon first receipt of SIGXCPU.)
RLIMIT_DATA
This is the maximum size of the process’s data segment (initialized data, uninitialized data, and heap). The limit is specified in bytes, and is rounded down to the system page size. This limit affects calls to brk(2), sbrk(2), and (since Linux 4.7) mmap(2), which fail with the error ENOMEM upon encountering the soft limit of this resource.
RLIMIT_FSIZE
This is the maximum size in bytes of files that the process may create. Attempts to extend a file beyond this limit result in delivery of a SIGXFSZ signal. By default, this signal terminates a process, but a process can catch this signal instead, in which case the relevant system call (e.g., write(2), truncate(2)) fails with the error EFBIG.
RLIMIT_LOCKS (Linux 2.4.0 to Linux 2.4.24)
This is a limit on the combined number of flock(2) locks and fcntl(2) leases that this process may establish.
RLIMIT_MEMLOCK
This is the maximum number of bytes of memory that may be locked into RAM. This limit is in effect rounded down to the nearest multiple of the system page size. This limit affects mlock(2), mlockall(2), and the mmap(2) MAP_LOCKED operation. Since Linux 2.6.9, it also affects the shmctl(2) SHM_LOCK operation, where it sets a maximum on the total bytes in shared memory segments (see shmget(2)) that may be locked by the real user ID of the calling process. The shmctl(2) SHM_LOCK locks are accounted for separately from the per-process memory locks established by mlock(2), mlockall(2), and mmap(2) MAP_LOCKED; a process can lock bytes up to this limit in each of these two categories.
Before Linux 2.6.9, this limit controlled the amount of memory that could be locked by a privileged process. Since Linux 2.6.9, no limits are placed on the amount of memory that a privileged process may lock, and this limit instead governs the amount of memory that an unprivileged process may lock.
RLIMIT_MSGQUEUE (since Linux 2.6.8)
This is a limit on the number of bytes that can be allocated for POSIX message queues for the real user ID of the calling process. This limit is enforced for mq_open(3). Each message queue that the user creates counts (until it is removed) against this limit according to the formula:
Since Linux 3.5:
bytes = attr.mq_maxmsg * sizeof(struct msg_msg) +
MIN(attr.mq_maxmsg, MQ_PRIO_MAX) *
sizeof(struct posix_msg_tree_node)+
/* For overhead */
attr.mq_maxmsg * attr.mq_msgsize;
/* For message data */
Linux 3.4 and earlier:
bytes = attr.mq_maxmsg * sizeof(struct msg_msg *) +
/* For overhead */
attr.mq_maxmsg * attr.mq_msgsize;
/* For message data */
where attr is the mq_attr structure specified as the fourth argument to mq_open(3), and the msg_msg and posix_msg_tree_node structures are kernel-internal structures.
The “overhead” addend in the formula accounts for overhead bytes required by the implementation and ensures that the user cannot create an unlimited number of zero-length messages (such messages nevertheless each consume some system memory for bookkeeping overhead).
RLIMIT_NICE (since Linux 2.6.12, but see BUGS below)
This specifies a ceiling to which the process’s nice value can be raised using setpriority(2) or nice(2). The actual ceiling for the nice value is calculated as 20 - rlim_cur. The useful range for this limit is thus from 1 (corresponding to a nice value of 19) to 40 (corresponding to a nice value of -20). This unusual choice of range was necessary because negative numbers cannot be specified as resource limit values, since they typically have special meanings. For example, RLIM_INFINITY typically is the same as -1. For more detail on the nice value, see sched(7).
RLIMIT_NOFILE
This specifies a value one greater than the maximum file descriptor number that can be opened by this process. Attempts (open(2), pipe(2), dup(2), etc.) to exceed this limit yield the error EMFILE. (Historically, this limit was named RLIMIT_OFILE on BSD.)
Since Linux 4.5, this limit also defines the maximum number of file descriptors that an unprivileged process (one without the CAP_SYS_RESOURCE capability) may have “in flight” to other processes, by being passed across UNIX domain sockets. This limit applies to the sendmsg(2) system call. For further details, see unix(7).
RLIMIT_NPROC
This is a limit on the number of extant process (or, more precisely on Linux, threads) for the real user ID of the calling process. So long as the current number of processes belonging to this process’s real user ID is greater than or equal to this limit, fork(2) fails with the error EAGAIN.
The RLIMIT_NPROC limit is not enforced for processes that have either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE capability, or run with real user ID 0.
RLIMIT_RSS
This is a limit (in bytes) on the process’s resident set (the number of virtual pages resident in RAM). This limit has effect only in Linux 2.4.x, x < 30, and there affects only calls to madvise(2) specifying MADV_WILLNEED.
RLIMIT_RTPRIO (since Linux 2.6.12, but see BUGS)
This specifies a ceiling on the real-time priority that may be set for this process using sched_setscheduler(2) and sched_setparam(2).
For further details on real-time scheduling policies, see sched(7)
RLIMIT_RTTIME (since Linux 2.6.25)
This is a limit (in microseconds) on the amount of CPU time that a process scheduled under a real-time scheduling policy may consume without making a blocking system call. For the purpose of this limit, each time a process makes a blocking system call, the count of its consumed CPU time is reset to zero. The CPU time count is not reset if the process continues trying to use the CPU but is preempted, its time slice expires, or it calls sched_yield(2).
Upon reaching the soft limit, the process is sent a SIGXCPU signal. If the process catches or ignores this signal and continues consuming CPU time, then SIGXCPU will be generated once each second until the hard limit is reached, at which point the process is sent a SIGKILL signal.
The intended use of this limit is to stop a runaway real-time process from locking up the system.
For further details on real-time scheduling policies, see sched(7)
RLIMIT_SIGPENDING (since Linux 2.6.8)
This is a limit on the number of signals that may be queued for the real user ID of the calling process. Both standard and real-time signals are counted for the purpose of checking this limit. However, the limit is enforced only for sigqueue(3); it is always possible to use kill(2) to queue one instance of any of the signals that are not already queued to the process.
RLIMIT_STACK
This is the maximum size of the process stack, in bytes. Upon reaching this limit, a SIGSEGV signal is generated. To handle this signal, a process must employ an alternate signal stack (sigaltstack(2)).
Since Linux 2.6.23, this limit also determines the amount of space used for the process’s command-line arguments and environment variables; for details, see execve(2).
prlimit()
The Linux-specific prlimit() system call combines and extends the functionality of setrlimit() and getrlimit(). It can be used to both set and get the resource limits of an arbitrary process.
The resource argument has the same meaning as for setrlimit() and getrlimit().
If the new_limit argument is not NULL, then the rlimit structure to which it points is used to set new values for the soft and hard limits for resource. If the old_limit argument is not NULL, then a successful call to prlimit() places the previous soft and hard limits for resource in the rlimit structure pointed to by old_limit.
The pid argument specifies the ID of the process on which the call is to operate. If pid is 0, then the call applies to the calling process. To set or get the resources of a process other than itself, the caller must have the CAP_SYS_RESOURCE capability in the user namespace of the process whose resource limits are being changed, or the real, effective, and saved set user IDs of the target process must match the real user ID of the caller and the real, effective, and saved set group IDs of the target process must match the real group ID of the caller.
RETURN VALUE
On success, these system calls return 0. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EFAULT
A pointer argument points to a location outside the accessible address space.
EINVAL
The value specified in resource is not valid; or, for setrlimit() or prlimit(): rlim->rlim_cur was greater than rlim->rlim_max.
EPERM
An unprivileged process tried to raise the hard limit; the CAP_SYS_RESOURCE capability is required to do this.
EPERM
The caller tried to increase the hard RLIMIT_NOFILE limit above the maximum defined by /proc/sys/fs/nr_open (see proc(5))
EPERM
(prlimit()) The calling process did not have permission to set limits for the process specified by pid.
ESRCH
Could not find a process with the ID specified in pid.
ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7).
Interface | Attribute | Value |
getrlimit(), setrlimit(), prlimit() | Thread safety | MT-Safe |
STANDARDS
getrlimit()
setrlimit()
POSIX.1-2008.
prlimit()
Linux.
RLIMIT_MEMLOCK and RLIMIT_NPROC derive from BSD and are not specified in POSIX.1; they are present on the BSDs and Linux, but on few other implementations. RLIMIT_RSS derives from BSD and is not specified in POSIX.1; it is nevertheless present on most implementations. RLIMIT_MSGQUEUE, RLIMIT_NICE, RLIMIT_RTPRIO, RLIMIT_RTTIME, and RLIMIT_SIGPENDING are Linux-specific.
HISTORY
getrlimit()
setrlimit()
POSIX.1-2001, SVr4, 4.3BSD.
prlimit()
Linux 2.6.36, glibc 2.13.
NOTES
A child process created via fork(2) inherits its parent’s resource limits. Resource limits are preserved across execve(2).
Resource limits are per-process attributes that are shared by all of the threads in a process.
Lowering the soft limit for a resource below the process’s current consumption of that resource will succeed (but will prevent the process from further increasing its consumption of the resource).
One can set the resource limits of the shell using the built-in ulimit command (limit in csh(1)). The shell’s resource limits are inherited by the processes that it creates to execute commands.
Since Linux 2.6.24, the resource limits of any process can be inspected via /proc/pid/limits; see proc(5).
Ancient systems provided a vlimit() function with a similar purpose to setrlimit(). For backward compatibility, glibc also provides vlimit(). All new applications should be written using setrlimit().
C library/kernel ABI differences
Since glibc 2.13, the glibc getrlimit() and setrlimit() wrapper functions no longer invoke the corresponding system calls, but instead employ prlimit(), for the reasons described in BUGS.
The name of the glibc wrapper function is prlimit(); the underlying system call is prlimit64().
BUGS
In older Linux kernels, the SIGXCPU and SIGKILL signals delivered when a process encountered the soft and hard RLIMIT_CPU limits were delivered one (CPU) second later than they should have been. This was fixed in Linux 2.6.8.
In Linux 2.6.x kernels before Linux 2.6.17, a RLIMIT_CPU limit of 0 is wrongly treated as “no limit” (like RLIM_INFINITY). Since Linux 2.6.17, setting a limit of 0 does have an effect, but is actually treated as a limit of 1 second.
A kernel bug means that RLIMIT_RTPRIO does not work in Linux 2.6.12; the problem is fixed in Linux 2.6.13.
In Linux 2.6.12, there was an off-by-one mismatch between the priority ranges returned by getpriority(2) and RLIMIT_NICE. This had the effect that the actual ceiling for the nice value was calculated as 19 - rlim_cur. This was fixed in Linux 2.6.13.
Since Linux 2.6.12, if a process reaches its soft RLIMIT_CPU limit and has a handler installed for SIGXCPU, then, in addition to invoking the signal handler, the kernel increases the soft limit by one second. This behavior repeats if the process continues to consume CPU time, until the hard limit is reached, at which point the process is killed. Other implementations do not change the RLIMIT_CPU soft limit in this manner, and the Linux behavior is probably not standards conformant; portable applications should avoid relying on this Linux-specific behavior. The Linux-specific RLIMIT_RTTIME limit exhibits the same behavior when the soft limit is encountered.
Kernels before Linux 2.4.22 did not diagnose the error EINVAL for setrlimit() when rlim->rlim_cur was greater than rlim->rlim_max.
Linux doesn’t return an error when an attempt to set RLIMIT_CPU has failed, for compatibility reasons.
Representation of “large” resource limit values on 32-bit platforms
The glibc getrlimit() and setrlimit() wrapper functions use a 64-bit rlim_t data type, even on 32-bit platforms. However, the rlim_t data type used in the getrlimit() and setrlimit() system calls is a (32-bit) unsigned long. Furthermore, in Linux, the kernel represents resource limits on 32-bit platforms as unsigned long. However, a 32-bit data type is not wide enough. The most pertinent limit here is RLIMIT_FSIZE, which specifies the maximum size to which a file can grow: to be useful, this limit must be represented using a type that is as wide as the type used to represent file offsetsβthat is, as wide as a 64-bit off_t (assuming a program compiled with _FILE_OFFSET_BITS=64).
To work around this kernel limitation, if a program tried to set a resource limit to a value larger than can be represented in a 32-bit unsigned long, then the glibc setrlimit() wrapper function silently converted the limit value to RLIM_INFINITY. In other words, the requested resource limit setting was silently ignored.
Since glibc 2.13, glibc works around the limitations of the getrlimit() and setrlimit() system calls by implementing setrlimit() and getrlimit() as wrapper functions that call prlimit().
EXAMPLES
The program below demonstrates the use of prlimit().
#define _GNU_SOURCE
#define _FILE_OFFSET_BITS 64
#include <err.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <time.h>
int
main(int argc, char *argv[])
{
pid_t pid;
struct rlimit old, new;
struct rlimit *newp;
if (!(argc == 2 || argc == 4)) {
fprintf(stderr, "Usage: %s <pid> [<new-soft-limit> "
"<new-hard-limit>]
“, argv[0]); exit(EXIT_FAILURE); } pid = atoi(argv[1]); /* PID of target process / newp = NULL; if (argc == 4) { new.rlim_cur = atoi(argv[2]); new.rlim_max = atoi(argv[3]); newp = &new; } / Set CPU time limit of target process; retrieve and display previous limit / if (prlimit(pid, RLIMIT_CPU, newp, &old) == -1) err(EXIT_FAILURE, “prlimit-1”); printf(“Previous limits: soft=%jd; hard=%jd “, (intmax_t) old.rlim_cur, (intmax_t) old.rlim_max); / Retrieve and display new CPU time limit */ if (prlimit(pid, RLIMIT_CPU, NULL, &old) == -1) err(EXIT_FAILURE, “prlimit-2”); printf(“New limits: soft=%jd; hard=%jd “, (intmax_t) old.rlim_cur, (intmax_t) old.rlim_max); exit(EXIT_SUCCESS); }
SEE ALSO
prlimit(1), dup(2), fcntl(2), fork(2), getrusage(2), mlock(2), mmap(2), open(2), quotactl(2), sbrk(2), shmctl(2), malloc(3), sigqueue(3), ulimit(3), core(5), capabilities(7), cgroups(7), credentials(7), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
488 - Linux cli command vhangup
NAME π₯οΈ vhangup π₯οΈ
virtually hangup the current terminal
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <unistd.h>
int vhangup(void);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
vhangup():
Since glibc 2.21:
_DEFAULT_SOURCE
In glibc 2.19 and 2.20:
_DEFAULT_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
Up to and including glibc 2.19:
_BSD_SOURCE || (_XOPEN_SOURCE && _XOPEN_SOURCE < 500)
DESCRIPTION
vhangup() simulates a hangup on the current terminal. This call arranges for other users to have a βcleanβ terminal at login time.
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EPERM
The calling process has insufficient privilege to call vhangup(); the CAP_SYS_TTY_CONFIG capability is required.
STANDARDS
Linux.
SEE ALSO
init(1), capabilities(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
489 - Linux cli command futimesat
NAME π₯οΈ futimesat π₯οΈ
change timestamps of a file relative to a directory file descriptor
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/time.h>
[[deprecated]] int futimesat(int dirfd, const char *pathname,
const struct timeval times[2]);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
futimesat():
_GNU_SOURCE
DESCRIPTION
This system call is obsolete. Use utimensat(2) instead.
The futimesat() system call operates in exactly the same way as utimes(2), except for the differences described in this manual page.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by utimes(2) for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like utimes(2)).
If pathname is absolute, then dirfd is ignored. (See openat(2) for an explanation of why the dirfd argument is useful.)
RETURN VALUE
On success, futimesat() returns a 0. On error, -1 is returned and errno is set to indicate the error.
ERRORS
The same errors that occur for utimes(2) can also occur for futimesat(). The following additional errors can occur for futimesat():
EBADF
pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
ENOTDIR
pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
VERSIONS
glibc
If pathname is NULL, then the glibc futimesat() wrapper function updates the times for the file referred to by dirfd.
STANDARDS
None.
HISTORY
Linux 2.6.16, glibc 2.4.
It was implemented from a specification that was proposed for POSIX.1, but that specification was replaced by the one for utimensat(2).
A similar system call exists on Solaris.
NOTES
SEE ALSO
stat(2), utimensat(2), utimes(2), futimes(3), path_resolution(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
490 - Linux cli command chmod
NAME π₯οΈ chmod π₯οΈ
change permissions of a file
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int chmod(const char *pathname, mode_t mode);
int fchmod(int fd, mode_t mode);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int fchmodat(int dirfd, const char *pathname, mode_t mode, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
fchmod():
Since glibc 2.24:
_POSIX_C_SOURCE >= 199309L
glibc 2.19 to glibc 2.23
_POSIX_C_SOURCE
glibc 2.16 to glibc 2.19:
_BSD_SOURCE || _POSIX_C_SOURCE
glibc 2.12 to glibc 2.16:
_BSD_SOURCE || _XOPEN_SOURCE >= 500
|| _POSIX_C_SOURCE >= 200809L
glibc 2.11 and earlier:
_BSD_SOURCE || _XOPEN_SOURCE >= 500
fchmodat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
The chmod() and fchmod() system calls change a file’s mode bits. (The file mode consists of the file permission bits plus the set-user-ID, set-group-ID, and sticky bits.) These system calls differ only in how the file is specified:
chmod() changes the mode of the file specified whose pathname is given in pathname, which is dereferenced if it is a symbolic link.
fchmod() changes the mode of the file referred to by the open file descriptor fd.
The new file mode is specified in mode, which is a bit mask created by ORing together zero or more of the following:
S_ISUID (04000)
set-user-ID (set process effective user ID on execve(2))
S_ISGID (02000)
set-group-ID (set process effective group ID on execve(2); mandatory locking, as described in fcntl(2); take a new file’s group from parent directory, as described in chown(2) and mkdir(2))
S_ISVTX (01000)
sticky bit (restricted deletion flag, as described in unlink(2))
S_IRUSR (00400)
read by owner
S_IWUSR (00200)
write by owner
S_IXUSR (00100)
execute/search by owner (“search” applies for directories, and means that entries within the directory can be accessed)
S_IRGRP (00040)
read by group
S_IWGRP (00020)
write by group
S_IXGRP (00010)
execute/search by group
S_IROTH (00004)
read by others
S_IWOTH (00002)
write by others
S_IXOTH (00001)
execute/search by others
The effective UID of the calling process must match the owner of the file, or the process must be privileged (Linux: it must have the CAP_FOWNER capability).
If the calling process is not privileged (Linux: does not have the CAP_FSETID capability), and the group of the file does not match the effective group ID of the process or one of its supplementary group IDs, the S_ISGID bit will be turned off, but this will not cause an error to be returned.
As a security measure, depending on the filesystem, the set-user-ID and set-group-ID execution bits may be turned off if a file is written. (On Linux, this occurs if the writing process does not have the CAP_FSETID capability.) On some filesystems, only the superuser can set the sticky bit, which may have a special meaning. For the sticky bit, and for set-user-ID and set-group-ID bits on directories, see inode(7).
On NFS filesystems, restricting the permissions will immediately influence already open files, because the access control is done on the server, but open files are maintained by the client. Widening the permissions may be delayed for other clients if attribute caching is enabled on them.
fchmodat()
The fchmodat() system call operates in exactly the same way as chmod(), except for the differences described here.
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by chmod() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like chmod()).
If pathname is absolute, then dirfd is ignored.
flags can either be 0, or include the following flag:
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead operate on the link itself. This flag is not currently implemented.
See openat(2) for an explanation of the need for fchmodat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
Depending on the filesystem, errors other than those listed below can be returned.
The more general errors for chmod() are listed below:
EACCES
Search permission is denied on a component of the path prefix. (See also path_resolution(7).)
EBADF
(fchmod()) The file descriptor fd is not valid.
EBADF
(fchmodat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
pathname points outside your accessible address space.
EINVAL
(fchmodat()) Invalid flag specified in flags.
EIO
An I/O error occurred.
ELOOP
Too many symbolic links were encountered in resolving pathname.
ENAMETOOLONG
pathname is too long.
ENOENT
The file does not exist.
ENOMEM
Insufficient kernel memory was available.
ENOTDIR
A component of the path prefix is not a directory.
ENOTDIR
(fchmodat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
ENOTSUP
(fchmodat()) flags specified AT_SYMLINK_NOFOLLOW, which is not supported.
EPERM
The effective UID does not match the owner of the file, and the process is not privileged (Linux: it does not have the CAP_FOWNER capability).
EPERM
The file is marked immutable or append-only. (See ioctl_iflags(2).)
EROFS
The named file resides on a read-only filesystem.
VERSIONS
C library/kernel differences
The GNU C library fchmodat() wrapper function implements the POSIX-specified interface described in this page. This interface differs from the underlying Linux system call, which does not have a flags argument.
glibc notes
On older kernels where fchmodat() is unavailable, the glibc wrapper function falls back to the use of chmod(). When pathname is a relative pathname, glibc constructs a pathname based on the symbolic link in /proc/self/fd that corresponds to the dirfd argument.
STANDARDS
POSIX.1-2008.
HISTORY
chmod()
fchmod()
4.4BSD, SVr4, POSIX.1-2001.
fchmodat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
SEE ALSO
chmod(1), chown(2), execve(2), open(2), stat(2), inode(7), path_resolution(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
491 - Linux cli command signalfd
NAME π₯οΈ signalfd π₯οΈ
create a file descriptor for accepting signals
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/signalfd.h>
int signalfd(int fd, const sigset_t *mask, int flags);
DESCRIPTION
signalfd() creates a file descriptor that can be used to accept signals targeted at the caller. This provides an alternative to the use of a signal handler or sigwaitinfo(2), and has the advantage that the file descriptor may be monitored by select(2), poll(2), and epoll(7).
The mask argument specifies the set of signals that the caller wishes to accept via the file descriptor. This argument is a signal set whose contents can be initialized using the macros described in sigsetops(3). Normally, the set of signals to be received via the file descriptor should be blocked using sigprocmask(2), to prevent the signals being handled according to their default dispositions. It is not possible to receive SIGKILL or SIGSTOP signals via a signalfd file descriptor; these signals are silently ignored if specified in mask.
If the fd argument is -1, then the call creates a new file descriptor and associates the signal set specified in mask with that file descriptor. If fd is not -1, then it must specify a valid existing signalfd file descriptor, and mask is used to replace the signal set associated with that file descriptor.
Starting with Linux 2.6.27, the following values may be bitwise ORed in flags to change the behavior of signalfd():
SFD_NONBLOCK
Set the O_NONBLOCK file status flag on the open file description (see open(2)) referred to by the new file descriptor. Using this flag saves extra calls to fcntl(2) to achieve the same result.
SFD_CLOEXEC
Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.
Up to Linux 2.6.26, the flags argument is unused, and must be specified as zero.
signalfd() returns a file descriptor that supports the following operations:
read(2)
If one or more of the signals specified in mask is pending for the process, then the buffer supplied to read(2) is used to return one or more signalfd_siginfo structures (see below) that describe the signals. The read(2) returns information for as many signals as are pending and will fit in the supplied buffer. The buffer must be at least sizeof(struct signalfd_siginfo) bytes. The return value of the read(2) is the total number of bytes read.
As a consequence of the read(2), the signals are consumed, so that they are no longer pending for the process (i.e., will not be caught by signal handlers, and cannot be accepted using sigwaitinfo(2)).
If none of the signals in mask is pending for the process, then the read(2) either blocks until one of the signals in mask is generated for the process, or fails with the error EAGAIN if the file descriptor has been made nonblocking.
poll(2)
select(2)
(and similar)
The file descriptor is readable (the select(2) readfds argument; the poll(2) POLLIN flag) if one or more of the signals in mask is pending for the process.
The signalfd file descriptor also supports the other file-descriptor multiplexing APIs: pselect(2), ppoll(2), and epoll(7).
close(2)
When the file descriptor is no longer required it should be closed. When all file descriptors associated with the same signalfd object have been closed, the resources for object are freed by the kernel.
The signalfd_siginfo structure
The format of the signalfd_siginfo structure(s) returned by read(2)s from a signalfd file descriptor is as follows:
struct signalfd_siginfo {
uint32_t ssi_signo; /* Signal number */
int32_t ssi_errno; /* Error number (unused) */
int32_t ssi_code; /* Signal code */
uint32_t ssi_pid; /* PID of sender */
uint32_t ssi_uid; /* Real UID of sender */
int32_t ssi_fd; /* File descriptor (SIGIO) */
uint32_t ssi_tid; /* Kernel timer ID (POSIX timers)
uint32_t ssi_band; /* Band event (SIGIO) */
uint32_t ssi_overrun; /* POSIX timer overrun count */
uint32_t ssi_trapno; /* Trap number that caused signal */
int32_t ssi_status; /* Exit status or signal (SIGCHLD) */
int32_t ssi_int; /* Integer sent by sigqueue(3) */
uint64_t ssi_ptr; /* Pointer sent by sigqueue(3) */
uint64_t ssi_utime; /* User CPU time consumed (SIGCHLD) */
uint64_t ssi_stime; /* System CPU time consumed
(SIGCHLD) */
uint64_t ssi_addr; /* Address that generated signal
(for hardware-generated signals) */
uint16_t ssi_addr_lsb; /* Least significant bit of address
(SIGBUS; since Linux 2.6.37) */
uint8_t pad[X]; /* Pad size to 128 bytes (allow for
additional fields in the future) */
};
Each of the fields in this structure is analogous to the similarly named field in the siginfo_t structure. The siginfo_t structure is described in sigaction(2). Not all fields in the returned signalfd_siginfo structure will be valid for a specific signal; the set of valid fields can be determined from the value returned in the ssi_code field. This field is the analog of the siginfo_t si_code field; see sigaction(2) for details.
fork(2) semantics
After a fork(2), the child inherits a copy of the signalfd file descriptor. A read(2) from the file descriptor in the child will return information about signals queued to the child.
Semantics of file descriptor passing
As with other file descriptors, signalfd file descriptors can be passed to another process via a UNIX domain socket (see unix(7)). In the receiving process, a read(2) from the received file descriptor will return information about signals queued to that process.
execve(2) semantics
Just like any other file descriptor, a signalfd file descriptor remains open across an execve(2), unless it has been marked for close-on-exec (see fcntl(2)). Any signals that were available for reading before the execve(2) remain available to the newly loaded program. (This is analogous to traditional signal semantics, where a blocked signal that is pending remains pending across an execve(2).)
Thread semantics
The semantics of signalfd file descriptors in a multithreaded program mirror the standard semantics for signals. In other words, when a thread reads from a signalfd file descriptor, it will read the signals that are directed to the thread itself and the signals that are directed to the process (i.e., the entire thread group). (A thread will not be able to read signals that are directed to other threads in the process.)
epoll(7) semantics
If a process adds (via epoll_ctl(2)) a signalfd file descriptor to an epoll(7) instance, then epoll_wait(2) returns events only for signals sent to that process. In particular, if the process then uses fork(2) to create a child process, then the child will be able to read(2) signals that are sent to it using the signalfd file descriptor, but epoll_wait(2) will not indicate that the signalfd file descriptor is ready. In this scenario, a possible workaround is that after the fork(2), the child process can close the signalfd file descriptor that it inherited from the parent process and then create another signalfd file descriptor and add it to the epoll instance. Alternatively, the parent and the child could delay creating their (separate) signalfd file descriptors and adding them to the epoll instance until after the call to fork(2).
RETURN VALUE
On success, signalfd() returns a signalfd file descriptor; this is either a new file descriptor (if fd was -1), or fd if fd was a valid signalfd file descriptor. On error, -1 is returned and errno is set to indicate the error.
ERRORS
EBADF
The fd file descriptor is not a valid file descriptor.
EINVAL
fd is not a valid signalfd file descriptor.
EINVAL
flags is invalid; or, in Linux 2.6.26 or earlier, flags is nonzero.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENODEV
Could not mount (internal) anonymous inode device.
ENOMEM
There was insufficient memory to create a new signalfd file descriptor.
VERSIONS
C library/kernel differences
The underlying Linux system call requires an additional argument, size_t sizemask, which specifies the size of the mask argument. The glibc signalfd() wrapper function does not include this argument, since it provides the required value for the underlying system call.
There are two underlying Linux system calls: signalfd() and the more recent signalfd4(). The former system call does not implement a flags argument. The latter system call implements the flags values described above. Starting with glibc 2.9, the signalfd() wrapper function will use signalfd4() where it is available.
STANDARDS
Linux.
HISTORY
signalfd()
Linux 2.6.22, glibc 2.8.
signalfd4()
Linux 2.6.27.
NOTES
A process can create multiple signalfd file descriptors. This makes it possible to accept different signals on different file descriptors. (This may be useful if monitoring the file descriptors using select(2), poll(2), or epoll(7): the arrival of different signals will make different file descriptors ready.) If a signal appears in the mask of more than one of the file descriptors, then occurrences of that signal can be read (once) from any one of the file descriptors.
Attempts to include SIGKILL and SIGSTOP in mask are silently ignored.
The signal mask employed by a signalfd file descriptor can be viewed via the entry for the corresponding file descriptor in the process’s /proc/pid/fdinfo directory. See proc(5) for further details.
Limitations
The signalfd mechanism can’t be used to receive signals that are synchronously generated, such as the SIGSEGV signal that results from accessing an invalid memory address or the SIGFPE signal that results from an arithmetic error. Such signals can be caught only via signal handler.
As described above, in normal usage one blocks the signals that will be accepted via signalfd(). If spawning a child process to execute a helper program (that does not need the signalfd file descriptor), then, after the call to fork(2), you will normally want to unblock those signals before calling execve(2), so that the helper program can see any signals that it expects to see. Be aware, however, that this won’t be possible in the case of a helper program spawned behind the scenes by any library function that the program may call. In such cases, one must fall back to using a traditional signal handler that writes to a file descriptor monitored by select(2), poll(2), or epoll(7).
BUGS
Before Linux 2.6.25, the ssi_ptr and ssi_int fields are not filled in with the data accompanying a signal sent by sigqueue(3).
EXAMPLES
The program below accepts the signals SIGINT and SIGQUIT via a signalfd file descriptor. The program terminates after accepting a SIGQUIT signal. The following shell session demonstrates the use of the program:
$ ./signalfd_demo
^C # Control-C generates SIGINT
Got SIGINT
^C
Got SIGINT
^\ # Control-\ generates SIGQUIT
Got SIGQUIT
$
Program source
#include <err.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/signalfd.h>
#include <sys/types.h>
#include <unistd.h>
int
main(void)
{
int sfd;
ssize_t s;
sigset_t mask;
struct signalfd_siginfo fdsi;
sigemptyset(&mask);
sigaddset(&mask, SIGINT);
sigaddset(&mask, SIGQUIT);
/* Block signals so that they aren't handled
according to their default dispositions. */
if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1)
err(EXIT_FAILURE, "sigprocmask");
sfd = signalfd(-1, &mask, 0);
if (sfd == -1)
err(EXIT_FAILURE, "signalfd");
for (;;) {
s = read(sfd, &fdsi, sizeof(fdsi));
if (s != sizeof(fdsi))
err(EXIT_FAILURE, "read");
if (fdsi.ssi_signo == SIGINT) {
printf("Got SIGINT
“); } else if (fdsi.ssi_signo == SIGQUIT) { printf(“Got SIGQUIT “); exit(EXIT_SUCCESS); } else { printf(“Read unexpected signal “); } } }
SEE ALSO
eventfd(2), poll(2), read(2), select(2), sigaction(2), sigprocmask(2), sigwaitinfo(2), timerfd_create(2), sigsetops(3), sigwait(3), epoll(7), signal(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
492 - Linux cli command io_submit
NAME π₯οΈ io_submit π₯οΈ
submit asynchronous I/O blocks for processing
LIBRARY
Standard C library (libc, -lc)
Alternatively, Asynchronous I/O library (libaio, -laio); see VERSIONS.
SYNOPSIS
#include <linux/aio_abi.h> /* Defines needed types */
int io_submit(aio_context_t ctx_id, long nr",structiocb**"iocbpp);
Note: There is no glibc wrapper for this system call; see VERSIONS.
DESCRIPTION
Note: this page describes the raw Linux system call interface. The wrapper function provided by libaio uses a different type for the ctx_id argument. See VERSIONS.
The io_submit() system call queues nr I/O request blocks for processing in the AIO context ctx_id. The iocbpp argument should be an array of nr AIO control blocks, which will be submitted to context ctx_id.
The iocb (I/O control block) structure defined in linux/aio_abi.h defines the parameters that control the I/O operation.
#include <linux/aio_abi.h>
struct iocb {
__u64 aio_data;
__u32 PADDED(aio_key, aio_rw_flags);
__u16 aio_lio_opcode;
__s16 aio_reqprio;
__u32 aio_fildes;
__u64 aio_buf;
__u64 aio_nbytes;
__s64 aio_offset;
__u64 aio_reserved2;
__u32 aio_flags;
__u32 aio_resfd;
};
The fields of this structure are as follows:
aio_data
This data is copied into the data field of the io_event structure upon I/O completion (see io_getevents(2)).
aio_key
This is an internal field used by the kernel. Do not modify this field after an io_submit() call.
aio_rw_flags
This defines the R/W flags passed with structure. The valid values are:
RWF_APPEND (since Linux 4.16)
Append data to the end of the file. See the description of the flag of the same name in pwritev2(2) as well as the description of O_APPEND in open(2). The aio_offset field is ignored. The file offset is not changed.
RWF_DSYNC (since Linux 4.13)
Write operation complete according to requirement of synchronized I/O data integrity. See the description of the flag of the same name in pwritev2(2) as well the description of O_DSYNC in open(2).
RWF_HIPRI (since Linux 4.13)
High priority request, poll if possible
RWF_NOWAIT (since Linux 4.14)
Don’t wait if the I/O will block for operations such as file block allocations, dirty page flush, mutex locks, or a congested block device inside the kernel. If any of these conditions are met, the control block is returned immediately with a return value of -EAGAIN in the res field of the io_event structure (see io_getevents(2)).
RWF_SYNC (since Linux 4.13)
Write operation complete according to requirement of synchronized I/O file integrity. See the description of the flag of the same name in pwritev2(2) as well the description of O_SYNC in open(2).
aio_lio_opcode
This defines the type of I/O to be performed by the iocb structure. The valid values are defined by the enum defined in linux/aio_abi.h:
enum {
IOCB_CMD_PREAD = 0,
IOCB_CMD_PWRITE = 1,
IOCB_CMD_FSYNC = 2,
IOCB_CMD_FDSYNC = 3,
IOCB_CMD_POLL = 5,
IOCB_CMD_NOOP = 6,
IOCB_CMD_PREADV = 7,
IOCB_CMD_PWRITEV = 8,
};
aio_reqprio
This defines the requests priority.
aio_fildes
The file descriptor on which the I/O operation is to be performed.
aio_buf
This is the buffer used to transfer data for a read or write operation.
aio_nbytes
This is the size of the buffer pointed to by aio_buf.
aio_offset
This is the file offset at which the I/O operation is to be performed.
aio_flags
This is the set of flags associated with the iocb structure. The valid values are:
IOCB_FLAG_RESFD
Asynchronous I/O control must signal the file descriptor mentioned in aio_resfd upon completion.
IOCB_FLAG_IOPRIO (since Linux 4.18)
Interpret the aio_reqprio field as an IOPRIO_VALUE as defined by linux/ioprio.h.
aio_resfd
The file descriptor to signal in the event of asynchronous I/O completion.
RETURN VALUE
On success, io_submit() returns the number of iocbs submitted (which may be less than nr, or 0 if nr is zero). For the failure return, see VERSIONS.
ERRORS
EAGAIN
Insufficient resources are available to queue any iocbs.
EBADF
The file descriptor specified in the first iocb is invalid.
EFAULT
One of the data structures points to invalid data.
EINVAL
The AIO context specified by ctx_id is invalid. nr is less than 0. The iocb at *iocbpp[0] is not properly initialized, the operation specified is invalid for the file descriptor in the iocb, or the value in the aio_reqprio field is invalid.
ENOSYS
io_submit() is not implemented on this architecture.
EPERM
The aio_reqprio field is set with the class IOPRIO_CLASS_RT, but the submitting context does not have the CAP_SYS_ADMIN capability.
VERSIONS
glibc does not provide a wrapper for this system call. You could invoke it using syscall(2). But instead, you probably want to use the io_submit() wrapper function provided by libaio.
Note that the libaio wrapper function uses a different type (io_context_t) for the ctx_id argument. Note also that the libaio wrapper does not follow the usual C library conventions for indicating errors: on error it returns a negated error number (the negative of one of the values listed in ERRORS). If the system call is invoked via syscall(2), then the return value follows the usual conventions for indicating an error: -1, with errno set to a (positive) value that indicates the error.
STANDARDS
Linux.
HISTORY
Linux 2.5.
SEE ALSO
io_cancel(2), io_destroy(2), io_getevents(2), io_setup(2), aio(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
493 - Linux cli command mremap
NAME π₯οΈ mremap π₯οΈ
remap a virtual memory address
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <sys/mman.h>
void *mremap(void old_address[.old_size], size_t old_size,
size_t new_size, int flags, ... /* void *new_address */);
DESCRIPTION
mremap() expands (or shrinks) an existing memory mapping, potentially moving it at the same time (controlled by the flags argument and the available virtual address space).
old_address is the old address of the virtual memory block that you want to expand (or shrink). Note that old_address has to be page aligned. old_size is the old size of the virtual memory block. new_size is the requested size of the virtual memory block after the resize. An optional fifth argument, new_address, may be provided; see the description of MREMAP_FIXED below.
If the value of old_size is zero, and old_address refers to a shareable mapping (see the description of MAP_SHARED in mmap(2)), then mremap() will create a new mapping of the same pages. new_size will be the size of the new mapping and the location of the new mapping may be specified with new_address; see the description of MREMAP_FIXED below. If a new mapping is requested via this method, then the MREMAP_MAYMOVE flag must also be specified.
The flags bit-mask argument may be 0, or include the following flags:
MREMAP_MAYMOVE
By default, if there is not sufficient space to expand a mapping at its current location, then mremap() fails. If this flag is specified, then the kernel is permitted to relocate the mapping to a new virtual address, if necessary. If the mapping is relocated, then absolute pointers into the old mapping location become invalid (offsets relative to the starting address of the mapping should be employed).
MREMAP_FIXED (since Linux 2.3.31)
This flag serves a similar purpose to the MAP_FIXED flag of mmap(2). If this flag is specified, then mremap() accepts a fifth argument, void *new_address, which specifies a page-aligned address to which the mapping must be moved. Any previous mapping at the address range specified by new_address and new_size is unmapped.
If MREMAP_FIXED is specified, then MREMAP_MAYMOVE must also be specified.
MREMAP_DONTUNMAP (since Linux 5.7)
This flag, which must be used in conjunction with MREMAP_MAYMOVE, remaps a mapping to a new address but does not unmap the mapping at old_address.
The MREMAP_DONTUNMAP flag can be used only with private anonymous mappings (see the description of MAP_PRIVATE and MAP_ANONYMOUS in mmap(2)).
After completion, any access to the range specified by old_address and old_size will result in a page fault. The page fault will be handled by a userfaultfd(2) handler if the address is in a range previously registered with userfaultfd(2). Otherwise, the kernel allocates a zero-filled page to handle the fault.
The MREMAP_DONTUNMAP flag may be used to atomically move a mapping while leaving the source mapped. See NOTES for some possible applications of MREMAP_DONTUNMAP.
If the memory segment specified by old_address and old_size is locked (using mlock(2) or similar), then this lock is maintained when the segment is resized and/or relocated. As a consequence, the amount of memory locked by the process may change.
RETURN VALUE
On success mremap() returns a pointer to the new virtual memory area. On error, the value MAP_FAILED (that is, (void *) -1) is returned, and errno is set to indicate the error.
ERRORS
EAGAIN
The caller tried to expand a memory segment that is locked, but this was not possible without exceeding the RLIMIT_MEMLOCK resource limit.
EFAULT
Some address in the range old_address to old_address+old_size is an invalid virtual memory address for this process. You can also get EFAULT even if there exist mappings that cover the whole address space requested, but those mappings are of different types.
EINVAL
An invalid argument was given. Possible causes are:
old_address was not page aligned;
a value other than MREMAP_MAYMOVE or MREMAP_FIXED or MREMAP_DONTUNMAP was specified in flags;
new_size was zero;
new_size or new_address was invalid;
the new address range specified by new_address and new_size overlapped the old address range specified by old_address and old_size;
MREMAP_FIXED or MREMAP_DONTUNMAP was specified without also specifying MREMAP_MAYMOVE;
MREMAP_DONTUNMAP was specified, but one or more pages in the range specified by old_address and old_size were not private anonymous;
MREMAP_DONTUNMAP was specified and old_size was not equal to new_size;
old_size was zero and old_address does not refer to a shareable mapping (but see BUGS);
old_size was zero and the MREMAP_MAYMOVE flag was not specified.
ENOMEM
Not enough memory was available to complete the operation. Possible causes are:
The memory area cannot be expanded at the current virtual address, and the MREMAP_MAYMOVE flag is not set in flags. Or, there is not enough (virtual) memory available.
MREMAP_DONTUNMAP was used causing a new mapping to be created that would exceed the (virtual) memory available. Or, it would exceed the maximum number of allowed mappings.
STANDARDS
Linux.
HISTORY
Prior to glibc 2.4, glibc did not expose the definition of MREMAP_FIXED, and the prototype for mremap() did not allow for the new_address argument.
NOTES
mremap() changes the mapping between virtual addresses and memory pages. This can be used to implement a very efficient realloc(3).
In Linux, memory is divided into pages. A process has (one or) several linear virtual memory segments. Each virtual memory segment has one or more mappings to real memory pages (in the page table). Each virtual memory segment has its own protection (access rights), which may cause a segmentation violation (SIGSEGV) if the memory is accessed incorrectly (e.g., writing to a read-only segment). Accessing virtual memory outside of the segments will also cause a segmentation violation.
If mremap() is used to move or expand an area locked with mlock(2) or equivalent, the mremap() call will make a best effort to populate the new area but will not fail with ENOMEM if the area cannot be populated.
MREMAP_DONTUNMAP use cases
Possible applications for MREMAP_DONTUNMAP include:
Non-cooperative userfaultfd(2): an application can yank out a virtual address range using MREMAP_DONTUNMAP and then employ a userfaultfd(2) handler to handle the page faults that subsequently occur as other threads in the process touch pages in the yanked range.
Garbage collection: MREMAP_DONTUNMAP can be used in conjunction with userfaultfd(2) to implement garbage collection algorithms (e.g., in a Java virtual machine). Such an implementation can be cheaper (and simpler) than conventional garbage collection techniques that involve marking pages with protection PROT_NONE in conjunction with the use of a SIGSEGV handler to catch accesses to those pages.
BUGS
Before Linux 4.14, if old_size was zero and the mapping referred to by old_address was a private mapping (see the description of MAP_PRIVATE in mmap(2)), mremap() created a new private mapping unrelated to the original mapping. This behavior was unintended and probably unexpected in user-space applications (since the intention of mremap() is to create a new mapping based on the original mapping). Since Linux 4.14, mremap() fails with the error EINVAL in this scenario.
SEE ALSO
brk(2), getpagesize(2), getrlimit(2), mlock(2), mmap(2), sbrk(2), malloc(3), realloc(3)
Your favorite text book on operating systems for more information on paged memory (e.g., Modern Operating Systems by Andrew S. Tanenbaum, Inside Linux by Randolph Bentson, The Design of the UNIX Operating System by Maurice J. Bach)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
494 - Linux cli command oldstat
NAME π₯οΈ oldstat π₯οΈ
get file status
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/stat.h>
int stat(const char *restrict pathname,
struct stat *restrict statbuf);
int fstat(int fd, struct stat *statbuf);
int lstat(const char *restrict pathname,
struct stat *restrict statbuf);
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/stat.h>
int fstatat(int dirfd, const char *restrict pathname,
struct stat *restrict statbuf, int flags);
Feature Test Macro Requirements for glibc (see feature_test_macros(7)):
lstat():
/* Since glibc 2.20 */ _DEFAULT_SOURCE
|| _XOPEN_SOURCE >= 500
|| /* Since glibc 2.10: */ _POSIX_C_SOURCE >= 200112L
|| /* glibc 2.19 and earlier */ _BSD_SOURCE
fstatat():
Since glibc 2.10:
_POSIX_C_SOURCE >= 200809L
Before glibc 2.10:
_ATFILE_SOURCE
DESCRIPTION
These functions return information about a file, in the buffer pointed to by statbuf. No permissions are required on the file itself, butβin the case of stat(), fstatat(), and lstat()βexecute (search) permission is required on all of the directories in pathname that lead to the file.
stat() and fstatat() retrieve information about the file pointed to by pathname; the differences for fstatat() are described below.
lstat() is identical to stat(), except that if pathname is a symbolic link, then it returns information about the link itself, not the file that the link refers to.
fstat() is identical to stat(), except that the file about which information is to be retrieved is specified by the file descriptor fd.
The stat structure
All of these system calls return a stat structure (see stat(3type)).
Note: for performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call. For example, if st_mode or st_uid is changed by another process by calling chmod(2) or chown(2), stat() might return the old st_mode together with the new st_uid, or the old st_uid together with the new st_mode.
fstatat()
The fstatat() system call is a more general interface for accessing file information which can still provide exactly the behavior of each of stat(), lstat(), and fstat().
If the pathname given in pathname is relative, then it is interpreted relative to the directory referred to by the file descriptor dirfd (rather than relative to the current working directory of the calling process, as is done by stat() and lstat() for a relative pathname).
If pathname is relative and dirfd is the special value AT_FDCWD, then pathname is interpreted relative to the current working directory of the calling process (like stat() and lstat()).
If pathname is absolute, then dirfd is ignored.
flags can either be 0, or include one or more of the following flags ORed:
AT_EMPTY_PATH (since Linux 2.6.39)
If pathname is an empty string, operate on the file referred to by dirfd (which may have been obtained using the open(2) O_PATH flag). In this case, dirfd can refer to any type of file, not just a directory, and the behavior of fstatat() is similar to that of fstat(). If dirfd is AT_FDCWD, the call operates on the current working directory. This flag is Linux-specific; define _GNU_SOURCE to obtain its definition.
AT_NO_AUTOMOUNT (since Linux 2.6.38)
Don’t automount the terminal (“basename”) component of pathname. Since Linux 3.1 this flag is ignored. Since Linux 4.11 this flag is implied.
AT_SYMLINK_NOFOLLOW
If pathname is a symbolic link, do not dereference it: instead return information about the link itself, like lstat(). (By default, fstatat() dereferences symbolic links, like stat().)
See openat(2) for an explanation of the need for fstatat().
RETURN VALUE
On success, zero is returned. On error, -1 is returned, and errno is set to indicate the error.
ERRORS
EACCES
Search permission is denied for one of the directories in the path prefix of pathname. (See also path_resolution(7).)
EBADF
fd is not a valid open file descriptor.
EBADF
(fstatat()) pathname is relative but dirfd is neither AT_FDCWD nor a valid file descriptor.
EFAULT
Bad address.
EINVAL
(fstatat()) Invalid flag specified in flags.
ELOOP
Too many symbolic links encountered while traversing the path.
ENAMETOOLONG
pathname is too long.
ENOENT
A component of pathname does not exist or is a dangling symbolic link.
ENOENT
pathname is an empty string and AT_EMPTY_PATH was not specified in flags.
ENOMEM
Out of memory (i.e., kernel memory).
ENOTDIR
A component of the path prefix of pathname is not a directory.
ENOTDIR
(fstatat()) pathname is relative and dirfd is a file descriptor referring to a file other than a directory.
EOVERFLOW
pathname or fd refers to a file whose size, inode number, or number of blocks cannot be represented in, respectively, the types off_t, ino_t, or blkcnt_t. This error can occur when, for example, an application compiled on a 32-bit platform without -D_FILE_OFFSET_BITS=64 calls stat() on a file whose size exceeds (1<<31)-1 bytes.
STANDARDS
POSIX.1-2008.
HISTORY
stat()
fstat()
lstat()
SVr4, 4.3BSD, POSIX.1-2001.
fstatat()
POSIX.1-2008. Linux 2.6.16, glibc 2.4.
According to POSIX.1-2001, lstat() on a symbolic link need return valid information only in the st_size field and the file type of the st_mode field of the stat structure. POSIX.1-2008 tightens the specification, requiring lstat() to return valid information in all fields except the mode bits in st_mode.
Use of the st_blocks and st_blksize fields may be less portable. (They were introduced in BSD. The interpretation differs between systems, and possibly on a single system when NFS mounts are involved.)
C library/kernel differences
Over time, increases in the size of the stat structure have led to three successive versions of stat(): sys_stat() (slot __NR_oldstat), sys_newstat() (slot __NR_stat), and sys_stat64() (slot __NR_stat64) on 32-bit platforms such as i386. The first two versions were already present in Linux 1.0 (albeit with different names); the last was added in Linux 2.4. Similar remarks apply for fstat() and lstat().
The kernel-internal versions of the stat structure dealt with by the different versions are, respectively:
__old_kernel_stat
The original structure, with rather narrow fields, and no padding.
stat
Larger st_ino field and padding added to various parts of the structure to allow for future expansion.
stat64
Even larger st_ino field, larger st_uid and st_gid fields to accommodate the Linux-2.4 expansion of UIDs and GIDs to 32 bits, and various other enlarged fields and further padding in the structure. (Various padding bytes were eventually consumed in Linux 2.6, with the advent of 32-bit device IDs and nanosecond components for the timestamp fields.)
The glibc stat() wrapper function hides these details from applications, invoking the most recent version of the system call provided by the kernel, and repacking the returned information if required for old binaries.
On modern 64-bit systems, life is simpler: there is a single stat() system call and the kernel deals with a stat structure that contains fields of a sufficient size.
The underlying system call employed by the glibc fstatat() wrapper function is actually called fstatat64() or, on some architectures, newfstatat().
EXAMPLES
The following program calls lstat() and displays selected fields in the returned stat structure.
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
#include <time.h>
int
main(int argc, char *argv[])
{
struct stat sb;
if (argc != 2) {
fprintf(stderr, "Usage: %s <pathname>
“, argv[0]); exit(EXIT_FAILURE); } if (lstat(argv[1], &sb) == -1) { perror(“lstat”); exit(EXIT_FAILURE); } printf(“ID of containing device: [%x,%x] “, major(sb.st_dev), minor(sb.st_dev)); printf(“File type: “); switch (sb.st_mode & S_IFMT) { case S_IFBLK: printf(“block device “); break; case S_IFCHR: printf(“character device “); break; case S_IFDIR: printf(“directory “); break; case S_IFIFO: printf(“FIFO/pipe “); break; case S_IFLNK: printf(“symlink “); break; case S_IFREG: printf(“regular file “); break; case S_IFSOCK: printf(“socket “); break; default: printf(“unknown? “); break; } printf(“I-node number: %ju “, (uintmax_t) sb.st_ino); printf(“Mode: %jo (octal) “, (uintmax_t) sb.st_mode); printf(“Link count: %ju “, (uintmax_t) sb.st_nlink); printf(“Ownership: UID=%ju GID=%ju “, (uintmax_t) sb.st_uid, (uintmax_t) sb.st_gid); printf(“Preferred I/O block size: %jd bytes “, (intmax_t) sb.st_blksize); printf(“File size: %jd bytes “, (intmax_t) sb.st_size); printf(“Blocks allocated: %jd “, (intmax_t) sb.st_blocks); printf(“Last status change: %s”, ctime(&sb.st_ctime)); printf(“Last file access: %s”, ctime(&sb.st_atime)); printf(“Last file modification: %s”, ctime(&sb.st_mtime)); exit(EXIT_SUCCESS); }
SEE ALSO
ls(1), stat(1), access(2), chmod(2), chown(2), readlink(2), statx(2), utime(2), stat(3type), capabilities(7), inode(7), symlink(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
495 - Linux cli command timerfd_settime
NAME π₯οΈ timerfd_settime π₯οΈ
timers that notify via file descriptors
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/timerfd.h>
int timerfd_create(int clockid, int flags);
int timerfd_settime(int fd, int flags,
const struct itimerspec *new_value,
struct itimerspec *_Nullable old_value);
int timerfd_gettime(int fd, struct itimerspec *curr_value);
DESCRIPTION
These system calls create and operate on a timer that delivers timer expiration notifications via a file descriptor. They provide an alternative to the use of setitimer(2) or timer_create(2), with the advantage that the file descriptor may be monitored by select(2), poll(2), and epoll(7).
The use of these three system calls is analogous to the use of timer_create(2), timer_settime(2), and timer_gettime(2). (There is no analog of timer_getoverrun(2), since that functionality is provided by read(2), as described below.)
timerfd_create()
timerfd_create() creates a new timer object, and returns a file descriptor that refers to that timer. The clockid argument specifies the clock that is used to mark the progress of the timer, and must be one of the following:
CLOCK_REALTIME
A settable system-wide real-time clock.
CLOCK_MONOTONIC
A nonsettable monotonically increasing clock that measures time from some unspecified point in the past that does not change after system startup.
CLOCK_BOOTTIME (Since Linux 3.15)
Like CLOCK_MONOTONIC, this is a monotonically increasing clock. However, whereas the CLOCK_MONOTONIC clock does not measure the time while a system is suspended, the CLOCK_BOOTTIME clock does include the time during which the system is suspended. This is useful for applications that need to be suspend-aware. CLOCK_REALTIME is not suitable for such applications, since that clock is affected by discontinuous changes to the system clock.
CLOCK_REALTIME_ALARM (since Linux 3.11)
This clock is like CLOCK_REALTIME, but will wake the system if it is suspended. The caller must have the CAP_WAKE_ALARM capability in order to set a timer against this clock.
CLOCK_BOOTTIME_ALARM (since Linux 3.11)
This clock is like CLOCK_BOOTTIME, but will wake the system if it is suspended. The caller must have the CAP_WAKE_ALARM capability in order to set a timer against this clock.
See clock_getres(2) for some further details on the above clocks.
The current value of each of these clocks can be retrieved using clock_gettime(2).
Starting with Linux 2.6.27, the following values may be bitwise ORed in flags to change the behavior of timerfd_create():
TFD_NONBLOCK
Set the O_NONBLOCK file status flag on the open file description (see open(2)) referred to by the new file descriptor. Using this flag saves extra calls to fcntl(2) to achieve the same result.
TFD_CLOEXEC
Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful.
In Linux versions up to and including 2.6.26, flags must be specified as zero.
timerfd_settime()
timerfd_settime() arms (starts) or disarms (stops) the timer referred to by the file descriptor fd.
The new_value argument specifies the initial expiration and interval for the timer. The itimerspec structure used for this argument is described in itimerspec(3type).
new_value.it_value specifies the initial expiration of the timer, in seconds and nanoseconds. Setting either field of new_value.it_value to a nonzero value arms the timer. Setting both fields of new_value.it_value to zero disarms the timer.
Setting one or both fields of new_value.it_interval to nonzero values specifies the period, in seconds and nanoseconds, for repeated timer expirations after the initial expiration. If both fields of new_value.it_interval are zero, the timer expires just once, at the time specified by new_value.it_value.
By default, the initial expiration time specified in new_value is interpreted relative to the current time on the timer’s clock at the time of the call (i.e., new_value.it_value specifies a time relative to the current value of the clock specified by clockid). An absolute timeout can be selected via the flags argument.
The flags argument is a bit mask that can include the following values:
TFD_TIMER_ABSTIME
Interpret new_value.it_value as an absolute value on the timer’s clock. The timer will expire when the value of the timer’s clock reaches the value specified in new_value.it_value.
TFD_TIMER_CANCEL_ON_SET
If this flag is specified along with TFD_TIMER_ABSTIME and the clock for this timer is CLOCK_REALTIME or CLOCK_REALTIME_ALARM, then mark this timer as cancelable if the real-time clock undergoes a discontinuous change (settimeofday(2), clock_settime(2), or similar). When such changes occur, a current or future read(2) from the file descriptor will fail with the error ECANCELED.
If the old_value argument is not NULL, then the itimerspec structure that it points to is used to return the setting of the timer that was current at the time of the call; see the description of timerfd_gettime() following.
timerfd_gettime()
timerfd_gettime() returns, in curr_value, an itimerspec structure that contains the current setting of the timer referred to by the file descriptor fd.
The it_value field returns the amount of time until the timer will next expire. If both fields of this structure are zero, then the timer is currently disarmed. This field always contains a relative value, regardless of whether the TFD_TIMER_ABSTIME flag was specified when setting the timer.
The it_interval field returns the interval of the timer. If both fields of this structure are zero, then the timer is set to expire just once, at the time specified by curr_value.it_value.
Operating on a timer file descriptor
The file descriptor returned by timerfd_create() supports the following additional operations:
read(2)
If the timer has already expired one or more times since its settings were last modified using timerfd_settime(), or since the last successful read(2), then the buffer given to read(2) returns an unsigned 8-byte integer (uint64_t) containing the number of expirations that have occurred. (The returned value is in host byte orderβthat is, the native byte order for integers on the host machine.)
If no timer expirations have occurred at the time of the read(2), then the call either blocks until the next timer expiration, or fails with the error EAGAIN if the file descriptor has been made nonblocking (via the use of the fcntl(2) F_SETFL operation to set the O_NONBLOCK flag).
A read(2) fails with the error EINVAL if the size of the supplied buffer is less than 8 bytes.
If the associated clock is either CLOCK_REALTIME or CLOCK_REALTIME_ALARM, the timer is absolute (TFD_TIMER_ABSTIME), and the flag TFD_TIMER_CANCEL_ON_SET was specified when calling timerfd_settime(), then read(2) fails with the error ECANCELED if the real-time clock undergoes a discontinuous change. (This allows the reading application to discover such discontinuous changes to the clock.)
If the associated clock is either CLOCK_REALTIME or CLOCK_REALTIME_ALARM, the timer is absolute (TFD_TIMER_ABSTIME), and the flag TFD_TIMER_CANCEL_ON_SET was not specified when calling timerfd_settime(), then a discontinuous negative change to the clock (e.g., clock_settime(2)) may cause read(2) to unblock, but return a value of 0 (i.e., no bytes read), if the clock change occurs after the time expired, but before the read(2) on the file descriptor.
poll(2)
select(2)
(and similar)
The file descriptor is readable (the select(2) readfds argument; the poll(2) POLLIN flag) if one or more timer expirations have occurred.
The file descriptor also supports the other file-descriptor multiplexing APIs: pselect(2), ppoll(2), and epoll(7).
ioctl(2)
The following timerfd-specific command is supported:
TFD_IOC_SET_TICKS (since Linux 3.17)
Adjust the number of timer expirations that have occurred. The argument is a pointer to a nonzero 8-byte integer (uint64_t*) containing the new number of expirations. Once the number is set, any waiter on the timer is woken up. The only purpose of this command is to restore the expirations for the purpose of checkpoint/restore. This operation is available only if the kernel was configured with the CONFIG_CHECKPOINT_RESTORE option.
close(2)
When the file descriptor is no longer required it should be closed. When all file descriptors associated with the same timer object have been closed, the timer is disarmed and its resources are freed by the kernel.
fork(2) semantics
After a fork(2), the child inherits a copy of the file descriptor created by timerfd_create(). The file descriptor refers to the same underlying timer object as the corresponding file descriptor in the parent, and read(2)s in the child will return information about expirations of the timer.
execve(2) semantics
A file descriptor created by timerfd_create() is preserved across execve(2), and continues to generate timer expirations if the timer was armed.
RETURN VALUE
On success, timerfd_create() returns a new file descriptor. On error, -1 is returned and errno is set to indicate the error.
timerfd_settime() and timerfd_gettime() return 0 on success; on error they return -1, and set errno to indicate the error.
ERRORS
timerfd_create() can fail with the following errors:
EINVAL
The clockid is not valid.
EINVAL
flags is invalid; or, in Linux 2.6.26 or earlier, flags is nonzero.
EMFILE
The per-process limit on the number of open file descriptors has been reached.
ENFILE
The system-wide limit on the total number of open files has been reached.
ENODEV
Could not mount (internal) anonymous inode device.
ENOMEM
There was insufficient kernel memory to create the timer.
EPERM
clockid was CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM but the caller did not have the CAP_WAKE_ALARM capability.
timerfd_settime() and timerfd_gettime() can fail with the following errors:
EBADF
fd is not a valid file descriptor.
EFAULT
new_value, old_value, or curr_value is not a valid pointer.
EINVAL
fd is not a valid timerfd file descriptor.
timerfd_settime() can also fail with the following errors:
ECANCELED
See NOTES.
EINVAL
new_value is not properly initialized (one of the tv_nsec falls outside the range zero to 999,999,999).
EINVAL
flags is invalid.
STANDARDS
Linux.
HISTORY
Linux 2.6.25, glibc 2.8.
NOTES
Suppose the following scenario for CLOCK_REALTIME or CLOCK_REALTIME_ALARM timer that was created with timerfd_create():
The timer has been started (timerfd_settime()) with the TFD_TIMER_ABSTIME and TFD_TIMER_CANCEL_ON_SET flags;
A discontinuous change (e.g., settimeofday(2)) is subsequently made to the CLOCK_REALTIME clock; and
the caller once more calls timerfd_settime() to rearm the timer (without first doing a read(2) on the file descriptor).
In this case the following occurs:
The timerfd_settime() returns -1 with errno set to ECANCELED. (This enables the caller to know that the previous timer was affected by a discontinuous change to the clock.)
The timer is successfully rearmed with the settings provided in the second timerfd_settime() call. (This was probably an implementation accident, but won’t be fixed now, in case there are applications that depend on this behaviour.)
BUGS
Currently, timerfd_create() supports fewer types of clock IDs than timer_create(2).
EXAMPLES
The following program creates a timer and then monitors its progress. The program accepts up to three command-line arguments. The first argument specifies the number of seconds for the initial expiration of the timer. The second argument specifies the interval for the timer, in seconds. The third argument specifies the number of times the program should allow the timer to expire before terminating. The second and third command-line arguments are optional.
The following shell session demonstrates the use of the program:
$ a.out 3 1 100
0.000: timer started
3.000: read: 1; total=1
4.000: read: 1; total=2
^Z # type control-Z to suspend the program
[1]+ Stopped ./timerfd3_demo 3 1 100
$ fg # Resume execution after a few seconds
a.out 3 1 100
9.660: read: 5; total=7
10.000: read: 1; total=8
11.000: read: 1; total=9
^C # type control-C to suspend the program
Program source
#include <err.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/timerfd.h>
#include <sys/types.h>
#include <time.h>
#include <unistd.h>
static void
print_elapsed_time(void)
{
int secs, nsecs;
static int first_call = 1;
struct timespec curr;
static struct timespec start;
if (first_call) {
first_call = 0;
if (clock_gettime(CLOCK_MONOTONIC, &start) == -1)
err(EXIT_FAILURE, "clock_gettime");
}
if (clock_gettime(CLOCK_MONOTONIC, &curr) == -1)
err(EXIT_FAILURE, "clock_gettime");
secs = curr.tv_sec - start.tv_sec;
nsecs = curr.tv_nsec - start.tv_nsec;
if (nsecs < 0) {
secs--;
nsecs += 1000000000;
}
printf("%d.%03d: ", secs, (nsecs + 500000) / 1000000);
}
int
main(int argc, char *argv[])
{
int fd;
ssize_t s;
uint64_t exp, tot_exp, max_exp;
struct timespec now;
struct itimerspec new_value;
if (argc != 2 && argc != 4) {
fprintf(stderr, "%s init-secs [interval-secs max-exp]
“, argv[0]); exit(EXIT_FAILURE); } if (clock_gettime(CLOCK_REALTIME, &now) == -1) err(EXIT_FAILURE, “clock_gettime”); /* Create a CLOCK_REALTIME absolute timer with initial expiration and interval as specified in command line. */ new_value.it_value.tv_sec = now.tv_sec + atoi(argv[1]); new_value.it_value.tv_nsec = now.tv_nsec; if (argc == 2) { new_value.it_interval.tv_sec = 0; max_exp = 1; } else { new_value.it_interval.tv_sec = atoi(argv[2]); max_exp = atoi(argv[3]); } new_value.it_interval.tv_nsec = 0; fd = timerfd_create(CLOCK_REALTIME, 0); if (fd == -1) err(EXIT_FAILURE, “timerfd_create”); if (timerfd_settime(fd, TFD_TIMER_ABSTIME, &new_value, NULL) == -1) err(EXIT_FAILURE, “timerfd_settime”); print_elapsed_time(); printf(“timer started “); for (tot_exp = 0; tot_exp < max_exp;) { s = read(fd, &exp, sizeof(uint64_t)); if (s != sizeof(uint64_t)) err(EXIT_FAILURE, “read”); tot_exp += exp; print_elapsed_time(); printf(“read: %” PRIu64 “; total=%” PRIu64 " “, exp, tot_exp); } exit(EXIT_SUCCESS); }
SEE ALSO
eventfd(2), poll(2), read(2), select(2), setitimer(2), signalfd(2), timer_create(2), timer_gettime(2), timer_settime(2), timespec(3), epoll(7), time(7)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
496 - Linux cli command outb
NAME π₯οΈ outb π₯οΈ
port I/O
LIBRARY
Standard C library (libc, -lc)
SYNOPSIS
#include <sys/io.h>
unsigned char inb(unsigned short port);
unsigned char inb_p(unsigned short port);
unsigned short inw(unsigned short port);
unsigned short inw_p(unsigned short port);
unsigned int inl(unsigned short port);
unsigned int inl_p(unsigned short port);
void outb(unsigned char value, unsigned short port);
void outb_p(unsigned char value, unsigned short port);
void outw(unsigned short value, unsigned short port);
void outw_p(unsigned short value, unsigned short port);
void outl(unsigned int value, unsigned short port);
void outl_p(unsigned int value, unsigned short port);
void insb(unsigned short port, void addr[.count],
unsigned long count);
void insw(unsigned short port, void addr[.count],
unsigned long count);
void insl(unsigned short port, void addr[.count],
unsigned long count);
void outsb(unsigned short port, const void addr[.count],
unsigned long count);
void outsw(unsigned short port, const void addr[.count],
unsigned long count);
void outsl(unsigned short port, const void addr[.count],
unsigned long count);
DESCRIPTION
This family of functions is used to do low-level port input and output. The out* functions do port output, the in* functions do port input; the b-suffix functions are byte-width and the w-suffix functions word-width; the _p-suffix functions pause until the I/O completes.
They are primarily designed for internal kernel use, but can be used from user space.
You must compile with -O or -O2 or similar. The functions are defined as inline macros, and will not be substituted in without optimization enabled, causing unresolved references at link time.
You use ioperm(2) or alternatively iopl(2) to tell the kernel to allow the user space application to access the I/O ports in question. Failure to do this will cause the application to receive a segmentation fault.
VERSIONS
outb() and friends are hardware-specific. The value argument is passed first and the port argument is passed second, which is the opposite order from most DOS implementations.
STANDARDS
None.
SEE ALSO
ioperm(2), iopl(2)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
KALI β
PARROT β
DEBIAN π΄ PENTESTING β
HACKING β
ββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββ ββββββ βββ ββββββ ββββββ βββ βββ
ββββββββββββββββββββββββββββββ βββββββββββββββββββ
βββ βββ βββββββ βββββββ βββ ββββββββββββββββββ
ββββββββ WITH COMMANDLINE-KUNGFU POWER ββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ