├── README.md └── mount-idmapped.c /README.md: -------------------------------------------------------------------------------- 1 | # Idmapped mounts 2 | 3 | This is a tiny tool to allow the creation of idmapped mounts. In order for this 4 | to work you need to be on a kernel with support for the `mount_setattr()` 5 | syscall, i.e. at least Linux 5.12. 6 | 7 | Note that this tool is not really meant to be production software. 8 | It was mainly written to allow users to test the patchset during the review 9 | process and in general to experiment with idmapped mounts. 10 | 11 | With util-linux v2.39 the functionality of `mount-idmapped` has been integrated 12 | into the `mount` utility via the `X-mount.idmap=` option for bind mounts. This 13 | uses the same syntax as `--map-mount=` option below. 14 | 15 | ``` 16 | mount-idmapped --map-mount= [--map-mount=] 17 | 18 | Create an idmapped mount of at 19 | Options: 20 | --map-mount= 21 | Specify an idmap for the mount in the format 22 | ::: 23 | The can be: 24 | "b" or "both" -> map both uids and gids 25 | "u" or "uid" -> map uids 26 | "g" or "gid" -> map gids 27 | For example, specifying: 28 | both:1000:1001:1 -> map uid and gid 1000 to uid and gid 1001 in and no other ids 29 | uid:20000:100000:1000 -> map uid 20000 to uid 100000, uid 20001 to uid 100001 [...] in 30 | Currently up to 340 separate idmappings may be specified. 31 | 32 | --map-mount=/proc//ns/user 33 | Specify a path to a user namespace whose idmap is to be used. 34 | 35 | --map-caller= 36 | Specify an idmap to be used for the caller, i.e. move the caller into a new user namespace 37 | with the requested mapping. 38 | 39 | --recursive 40 | Copy the whole mount tree from and apply the idmap to everyone at . 41 | 42 | Examples: 43 | - Create an idmapped mount of /source on /target with both ('b') uids and gids mapped: 44 | mount-idmapped --map-mount=b:0:10000:10000 /source /target 45 | 46 | - Create an idmapped mount of /source on /target with uids ('u') and gids ('g') mapped separately: 47 | mount-idmapped --map-mount=u:0:10000:10000 --map-mount=g:0:20000:20000 /source /target 48 | 49 | - Create an idmapped mount of /source on /target with both ('b') uids and gids mapped and a user namespace 50 | with both ('b') uids and gids mapped: 51 | mount-idmapped --map-caller=b:0:10000:10000 --map-mount=b:0:10000:1000 /source /target 52 | 53 | - Create an idmapped mount of /source on /target with uids ('u') gids ('g') mapped separately 54 | and a user namespace with both ('b') uids and gids mapped: 55 | mount-idmapped --map-caller=u:0:10000:10000 --map-mount=g:0:20000:20000 --map-mount=b:0:10000:1000 /source /target 56 | ``` 57 | 58 | The tool is based on the `mount_setattr()` syscall. A man page is currently up 59 | for review but it will likely take a while for it to show up in distros. So for 60 | the curious here it is: 61 | 62 | NAME 63 | ==== 64 | 65 | mount\_setattr - change properties of a mount or mount tree 66 | 67 | SYNOPSIS 68 | ======== 69 | 70 | 71 | 72 | #include /* Definition of AT_* constants */ 73 | #include /* Definition of MOUNT_ATTR_* constants */ 74 | #include /* Definition of SYS_* constants */ 75 | #include 76 | 77 | int syscall(SYS_mount_setattr, int dirfd, const char *pathname, 78 | unsigned int flags, struct mount_attr *attr",size_t"size); 79 | 80 | *Note*: glibc provides no wrapper for **mount\_setattr**(), 81 | necessitating the use of **syscall**(2). 82 | 83 | DESCRIPTION 84 | =========== 85 | 86 | The **mount\_setattr**() system call changes the mount properties of a 87 | mount or an entire mount tree. If *pathname* is a relative pathname, 88 | then it is interpreted relative to the directory referred to by the file 89 | descriptor *dirfd*. If *dirfd* is the special value **AT\_FDCWD**, then 90 | *pathname* is interpreted relative to the current working directory of 91 | the calling process. If *pathname* is the empty string and 92 | **AT\_EMPTY\_PATH** is specified in *flags*, then the mount properties 93 | of the mount identified by *dirfd* are changed. (See **openat**(2) for 94 | an explanation of why the *dirfd* argument is useful.) 95 | 96 | The **mount\_setattr**() system call uses an extensible structure 97 | (*struct mount\_attr*) to allow for future extensions. Any non-flag 98 | extensions to **mount\_setattr**() will be implemented as new fields 99 | appended to the this structure, with a zero value in a new field 100 | resulting in the kernel behaving as though that extension field was not 101 | present. Therefore, the caller *must* zero-fill this structure on 102 | initialization. See the \"Extensibility\" subsection under **NOTES** for 103 | more details. 104 | 105 | The *size* argument should usually be specified as *sizeof(struct 106 | mount\_attr)*. However, if the caller is using a kernel that supports an 107 | extended *struct mount\_attr*, but the caller does not intend to make 108 | use of these features, it is possible to pass the size of an earlier 109 | version of the structure together with the extended structure. This 110 | allows the kernel to not copy later parts of the structure that aren\'t 111 | used anyway. With each extension that changes the size of *struct 112 | mount\_attr*, the kernel will expose a definition of the form 113 | **MOUNT\_ATTR\_SIZE\_VER***number* . For example, the macro for the size 114 | of the initial version of *struct mount\_attr* is 115 | **MOUNT\_ATTR\_SIZE\_VER0**. 116 | 117 | The *flags* argument can be used to alter the pathname resolution 118 | behavior. The supported values are: 119 | 120 | **AT\_EMPTY\_PATH** 121 | 122 | - If *pathname* is the empty string, change the mount properties on 123 | *dirfd* itself. 124 | 125 | **AT\_RECURSIVE** 126 | 127 | - Change the mount properties of the entire mount tree. 128 | 129 | **AT\_SYMLINK\_NOFOLLOW** 130 | 131 | - Don\'t follow trailing symbolic links. 132 | 133 | **AT\_NO\_AUTOMOUNT** 134 | 135 | - Don\'t trigger automounts. 136 | 137 | The *attr* argument of **mount\_setattr**() is a structure of the 138 | following form: 139 | 140 | ```c 141 | struct mount_attr { 142 | __u64 attr_set; /* Mount properties to set */ 143 | __u64 attr_clr; /* Mount properties to clear */ 144 | __u64 propagation; /* Mount propagation type */ 145 | __u64 userns_fd; /* User namespace file descriptor */ 146 | }; 147 | ``` 148 | 149 | The *attr\_set* and *attr\_clr* members are used to specify the mount 150 | properties that are supposed to be set or cleared for a mount or mount 151 | tree. Flags set in *attr\_set* enable a property on a mount or mount 152 | tree, and flags set in *attr\_clr* remove a property from a mount or 153 | mount tree. 154 | 155 | When changing mount properties, the kernel will first clear the flags 156 | specified in the *attr\_clr* field, and then set the flags specified in 157 | the *attr\_set* field. For example, these settings: 158 | 159 | ```c 160 | struct mount_attr attr = { 161 | .attr_clr = MOUNT_ATTR_NOEXEC | MOUNT_ATTR_NODEV, 162 | .attr_set = MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID, 163 | }; 164 | ``` 165 | 166 | are equivalent to the following steps: 167 | 168 | ```c 169 | unsigned int current_mnt_flags = mnt->mnt_flags; 170 | 171 | /* 172 | * Clear all flags set in .attr_clr, 173 | * clearing MOUNT_ATTR_NOEXEC and MOUNT_ATTR_NODEV. 174 | */ 175 | current_mnt_flags &= ~attr->attr_clr; 176 | 177 | /* 178 | * Now set all flags set in .attr_set, 179 | * applying MOUNT_ATTR_RDONLY and MOUNT_ATTR_NOSUID. 180 | */ 181 | current_mnt_flags |= attr->attr_set; 182 | 183 | mnt->mnt_flags = current_mnt_flags; 184 | ``` 185 | 186 | As a result of this change, the mount or mount tree (a) is read-only; 187 | (b) blocks the execution of set-user-ID and set-group-ID programs; (c) 188 | allows execution of programs; and (d) allows access to devices. 189 | 190 | Multiple changes with the same set of flags requested in *attr\_clr* and 191 | *attr\_set* are guaranteed to be idempotent after the changes have been 192 | applied. 193 | 194 | The following mount attributes can be specified in the *attr\_set* or 195 | *attr\_clr* fields: 196 | 197 | **MOUNT\_ATTR\_RDONLY** 198 | 199 | - If set in *attr\_set*, makes the mount read-only. If set in 200 | *attr\_clr*, removes the read-only setting if set on the mount. 201 | 202 | **MOUNT\_ATTR\_NOSUID** 203 | 204 | - If set in *attr\_set*, causes the mount not to honor the set-user-ID 205 | and set-group-ID mode bits and file capabilities when executing 206 | programs. If set in *attr\_clr*, clears the set-user-ID, 207 | set-group-ID, and file capability restriction if set on this mount. 208 | 209 | **MOUNT\_ATTR\_NODEV** 210 | 211 | - If set in *attr\_set*, prevents access to devices on this mount. If 212 | set in *attr\_clr*, removes the restriction that prevented accessing 213 | devices on this mount. 214 | 215 | **MOUNT\_ATTR\_NOEXEC** 216 | 217 | - If set in *attr\_set*, prevents executing programs on this mount. If 218 | set in *attr\_clr*, removes the restriction that prevented executing 219 | programs on this mount. 220 | 221 | **MOUNT\_ATTR\_NOSYMFOLLOW** 222 | 223 | - If set in *attr\_set*, prevents following symbolic links on this 224 | mount. If set in *attr\_clr*, removes the restriction that prevented 225 | following symbolic links on this mount. 226 | 227 | **MOUNT\_ATTR\_NODIRATIME** 228 | 229 | - If set in *attr\_set*, prevents updating access time for directories 230 | on this mount. If set in *attr\_clr*, removes the restriction that 231 | prevented updating access time for directories. Note that 232 | **MOUNT\_ATTR\_NODIRATIME** can be combined with other access-time 233 | settings and is implied by the noatime setting. All other 234 | access-time settings are mutually exclusive. 235 | 236 | **MOUNT\_ATTR\_\_ATIME** - changing access-time settings 237 | 238 | - The access-time values listed below are an enumeration that includes 239 | the value zero, expressed in the bits defined by the mask 240 | **MOUNT\_ATTR\_\_ATIME**. Even though these bits are an enumeration 241 | (in contrast to the other mount flags such as 242 | **MOUNT\_ATTR\_NOEXEC**), they are nonetheless passed in *attr\_set* 243 | and *attr\_clr* for consistency with **fsmount**(2), which 244 | introduced this behavior. 245 | 246 | Note that, since the access-time values are an enumeration rather 247 | than bit values, a caller wanting to transition to a different 248 | access-time setting cannot simply specify the access-time setting in 249 | *attr\_set*, but must also include **MOUNT\_ATTR\_\_ATIME** in the 250 | *attr\_clr* field. The kernel will verify that 251 | **MOUNT\_ATTR\_\_ATIME** isn\'t partially set in *attr\_clr* (i.e., 252 | either all bits in the **MOUNT\_ATTR\_\_ATIME** bit field are either 253 | set or clear), and that *attr\_set* doesn\'t have any access-time 254 | bits set if **MOUNT\_ATTR\_\_ATIME** isn\'t set in *attr\_clr*. 255 | 256 | **MOUNT\_ATTR\_RELATIME** 257 | 258 | - When a file is accessed via this mount, update the file\'s last 259 | access time (atime) only if the current value of atime is less 260 | than or equal to the file\'s last modification time (mtime) or 261 | last status change time (ctime). 262 | 263 | To enable this access-time setting on a mount or mount tree, 264 | **MOUNT\_ATTR\_RELATIME** must be set in *attr\_set* and 265 | **MOUNT\_ATTR\_\_ATIME** must be set in the *attr\_clr* field. 266 | 267 | **MOUNT\_ATTR\_NOATIME** 268 | 269 | - Do not update access times for (all types of) files on this 270 | mount. 271 | 272 | To enable this access-time setting on a mount or mount tree, 273 | **MOUNT\_ATTR\_NOATIME** must be set in *attr\_set* and 274 | **MOUNT\_ATTR\_\_ATIME** must be set in the *attr\_clr* field. 275 | 276 | **MOUNT\_ATTR\_STRICTATIME** 277 | 278 | - Always update the last access time (atime) when files are 279 | accessed on this mount. 280 | 281 | To enable this access-time setting on a mount or mount tree, 282 | **MOUNT\_ATTR\_STRICTATIME** must be set in *attr\_set* and 283 | **MOUNT\_ATTR\_\_ATIME** must be set in the *attr\_clr* field. 284 | 285 | **MOUNT\_ATTR\_IDMAP** 286 | 287 | - If set in *attr\_set*, creates an ID-mapped mount. The ID mapping is 288 | taken from the user namespace specified in *userns\_fd* and attached 289 | to the mount. 290 | 291 | Since it is not supported to change the ID mapping of a mount after 292 | it has been ID mapped, it is invalid to specify 293 | **MOUNT\_ATTR\_IDMAP** in *attr\_clr*. 294 | 295 | For further details, see the subsection \"ID-mapped mounts\" under 296 | NOTES. 297 | 298 | The *propagation* field is used to specify the propagation type of the 299 | mount or mount tree. This field either has the value zero, meaning leave 300 | the propagation type unchanged, or it has one of the following values: 301 | 302 | **MS\_PRIVATE** 303 | 304 | - Turn all mounts into private mounts. 305 | 306 | **MS\_SHARED** 307 | 308 | - Turn all mounts into shared mounts. 309 | 310 | **MS\_SLAVE** 311 | 312 | - Turn all mounts into dependent mounts. 313 | 314 | **MS\_UNBINDABLE** 315 | 316 | - Turn all mounts into unbindable mounts. 317 | 318 | For further details on the above propagation types, see 319 | **mount\_namespaces**(7). 320 | 321 | RETURN VALUE 322 | ============ 323 | 324 | On success, **mount\_setattr**() returns zero. On error, -1 is returned 325 | and *errno* is set to indicate the cause of the error. 326 | 327 | ERRORS 328 | ====== 329 | 330 | **EBADF** 331 | 332 | - *pathname* is relative but *dirfd* is neither **AT\_FDCWD** nor a 333 | valid file descriptor. 334 | 335 | **EBADF** 336 | 337 | - *userns\_fd* is not a valid file descriptor. 338 | 339 | **EBUSY** 340 | 341 | - The caller tried to change the mount to **MOUNT\_ATTR\_RDONLY**, but 342 | the mount still holds files open for writing. 343 | 344 | **EINVAL** 345 | 346 | - The pathname specified via the *dirfd* and *pathname* arguments to 347 | **mount\_setattr**() isn\'t a mount point. 348 | 349 | **EINVAL** 350 | 351 | - An unsupported value was set in *flags*. 352 | 353 | **EINVAL** 354 | 355 | - An unsupported value was specified in the *attr\_set* field of 356 | *mount\_attr*. 357 | 358 | **EINVAL** 359 | 360 | - An unsupported value was specified in the *attr\_clr* field of 361 | *mount\_attr*. 362 | 363 | **EINVAL** 364 | 365 | - An unsupported value was specified in the *propagation* field of 366 | *mount\_attr*. 367 | 368 | **EINVAL** 369 | 370 | - More than one of **MS\_SHARED**, **MS\_SLAVE**, **MS\_PRIVATE**, or 371 | **MS\_UNBINDABLE** was set in the *propagation* field of 372 | *mount\_attr*. 373 | 374 | **EINVAL** 375 | 376 | - An access-time setting was specified in the *attr\_set* field 377 | without **MOUNT\_ATTR\_\_ATIME** being set in the *attr\_clr* field. 378 | 379 | **EINVAL** 380 | 381 | - **MOUNT\_ATTR\_IDMAP** was specified in *attr\_clr*. 382 | 383 | **EINVAL** 384 | 385 | - A file descriptor value was specified in *userns\_fd* which exceeds 386 | **INT\_MAX**. 387 | 388 | **EINVAL** 389 | 390 | - A valid file descriptor value was specified in *userns\_fd*, but the 391 | file descriptor did not refer to a user namespace. 392 | 393 | **EINVAL** 394 | 395 | - The underlying filesystem does not support ID-mapped mounts. 396 | 397 | **EINVAL** 398 | 399 | - The mount that is to be ID mapped is not a detached mount; that is, 400 | the mount has not previously been visible in a mount namespace. 401 | 402 | **EINVAL** 403 | 404 | - A partial access-time setting was specified in *attr\_clr* instead 405 | of **MOUNT\_ATTR\_\_ATIME** being set. 406 | 407 | **EINVAL** 408 | 409 | - The mount is located outside the caller\'s mount namespace. 410 | 411 | **EINVAL** 412 | 413 | - The underlying filesystem has been mounted in a mount namespace that 414 | is owned by a noninitial user namespace 415 | 416 | **ENOENT** 417 | 418 | - A pathname was empty or had a nonexistent component. 419 | 420 | **ENOMEM** 421 | 422 | - When changing mount propagation to **MS\_SHARED**, a new peer group 423 | ID needs to be allocated for all mounts without a peer group ID set. 424 | This allocation failed because there was not enough memory to 425 | allocate the relevant internal structures. 426 | 427 | **ENOSPC** 428 | 429 | - When changing mount propagation to **MS\_SHARED**, a new peer group 430 | ID needs to be allocated for all mounts without a peer group ID set. 431 | This allocation failed because the kernel has run out of IDs. 432 | 433 | **EPERM** 434 | 435 | - One of the mounts had at least one of **MOUNT\_ATTR\_NOATIME**, 436 | **MOUNT\_ATTR\_NODEV**, **MOUNT\_ATTR\_NODIRATIME**, 437 | **MOUNT\_ATTR\_NOEXEC**, **MOUNT\_ATTR\_NOSUID**, or 438 | **MOUNT\_ATTR\_RDONLY** set and the flag is locked. Mount attributes 439 | become locked on a mount if: 440 | 441 | - A new mount or mount tree is created causing mount propagation 442 | across user namespaces (i.e., propagation to a mount namespace 443 | owned by a different user namespace). The kernel will lock the 444 | aforementioned flags to prevent these sensitive properties from 445 | being altered. 446 | 447 | - A new mount and user namespace pair is created. This happens for 448 | example when specifying **CLONE\_NEWUSER \| CLONE\_NEWNS** in 449 | **unshare**(2), **clone**(2), or **clone3**(2). The 450 | aforementioned flags become locked in the new mount namespace to 451 | prevent sensitive mount properties from being altered. Since the 452 | newly created mount namespace will be owned by the newly created 453 | user namespace, a calling process that is privileged in the new 454 | user namespace would---in the absence of such locking---be able 455 | to alter sensitive mount properties (e.g., to remount a mount 456 | that was marked read-only as read-write in the new mount 457 | namespace). 458 | 459 | **EPERM** 460 | 461 | - A valid file descriptor value was specified in *userns\_fd*, but the 462 | file descriptor refers to the initial user namespace. 463 | 464 | **EPERM** 465 | 466 | - An attempt was made to add an ID mapping to a mount that is already 467 | ID mapped. 468 | 469 | **EPERM** 470 | 471 | - The caller does not have **CAP\_SYS\_ADMIN** in the initial user 472 | namespace. 473 | 474 | VERSIONS 475 | ======== 476 | 477 | **mount\_setattr**() first appeared in Linux 5.12. 478 | 479 | CONFORMING TO 480 | ============= 481 | 482 | **mount\_setattr**() is Linux-specific. 483 | 484 | NOTES 485 | ===== 486 | 487 | ID-mapped mounts 488 | ---------------- 489 | 490 | Creating an ID-mapped mount makes it possible to change the ownership of 491 | all files located under a mount. Thus, ID-mapped mounts make it possible 492 | to change ownership in a temporary and localized way. It is a localized 493 | change because the ownership changes are visible only via a specific 494 | mount. All other users and locations where the filesystem is exposed are 495 | unaffected. It is a temporary change because the ownership changes are 496 | tied to the lifetime of the mount. 497 | 498 | Whenever callers interact with the filesystem through an ID-mapped 499 | mount, the ID mapping of the mount will be applied to user and group IDs 500 | associated with filesystem objects. This encompasses the user and group 501 | IDs associated with inodes and also the following **xattr**(7) keys: 502 | 503 | - *security.capability*, whenever filesystem capabilities are stored 504 | or returned in the **VFS\_CAP\_REVISION\_3** format, which stores a 505 | root user ID alongside the capabilities (see **capabilities**(7)). 506 | 507 | - *system.posix\_acl\_access* and *system.posix\_acl\_default*, 508 | whenever user IDs or group IDs are stored in **ACL\_USER** or 509 | **ACL\_GROUP** entries. 510 | 511 | The following conditions must be met in order to create an ID-mapped 512 | mount: 513 | 514 | - The caller must have the **CAP\_SYS\_ADMIN** capability in the 515 | initial user namespace. 516 | 517 | - The filesystem must be mounted in a mount namespace that is owned by 518 | the initial user namespace. 519 | 520 | - The underlying filesystem must support ID-mapped mounts. Currently, 521 | the **xfs**(5), **ext4**(5), and **FAT** filesystems support 522 | ID-mapped mounts with more filesystems being actively worked on. 523 | 524 | - The mount must not already be ID-mapped. This also implies that the 525 | ID mapping of a mount cannot be altered. 526 | 527 | - The mount must be a detached mount; that is, it must have been 528 | created by calling **open\_tree**(2) with the **OPEN\_TREE\_CLONE** 529 | flag and it must not already have been visible in a mount namespace. 530 | (To put things another way: the mount must not have been attached to 531 | the filesystem hierarchy with a system call such as 532 | **move\_mount**(2).) 533 | 534 | ID mappings can be created for user IDs, group IDs, and project IDs. An 535 | ID mapping is essentially a mapping of a range of user or group IDs into 536 | another or the same range of user or group IDs. ID mappings are written 537 | to map files as three numbers separated by white space. The first two 538 | numbers specify the starting user or group ID in each of the two user 539 | namespaces. The third number specifies the range of the ID mapping. For 540 | example, a mapping for user IDs such as \"1000 1001 1\" would indicate 541 | that user ID 1000 in the caller\'s user namespace is mapped to user ID 542 | 1001 in its ancestor user namespace. Since the map range is 1, only user 543 | ID 1000 is mapped. 544 | 545 | It is possible to specify up to 340 ID mappings for each ID mapping 546 | type. If any user IDs or group IDs are not mapped, all files owned by 547 | that unmapped user or group ID will appear as being owned by the 548 | overflow user ID or overflow group ID respectively. 549 | 550 | Further details on setting up ID mappings can be found in 551 | **user\_namespaces**(7). 552 | 553 | In the common case, the user namespace passed in *userns\_fd* (together 554 | with **MOUNT\_ATTR\_IDMAP** in *attr\_set*) to create an ID-mapped mount 555 | will be the user namespace of a container. In other scenarios it will be 556 | a dedicated user namespace associated with a user\'s login session as is 557 | the case for portable home directories in **systemd-homed.service**(8)). 558 | It is also perfectly fine to create a dedicated user namespace for the 559 | sake of ID mapping a mount. 560 | 561 | ID-mapped mounts can be useful in the following and a variety of other 562 | scenarios: 563 | 564 | - Sharing files or filesystems between multiple users or multiple 565 | machines, especially in complex scenarios. For example, ID-mapped 566 | mounts are used to implement portable home directories in 567 | **systemd-homed.service**(8), where they allow users to move their 568 | home directory to an external storage device and use it on multiple 569 | computers where they are assigned different user IDs and group IDs. 570 | This effectively makes it possible to assign random user IDs and 571 | group IDs at login time. 572 | 573 | - Sharing files or filesystems from the host with unprivileged 574 | containers. This allows a user to avoid having to change ownership 575 | permanently through **chown**(2). 576 | 577 | - ID mapping a container\'s root filesystem. Users don\'t need to 578 | change ownership permanently through **chown**(2). Especially for 579 | large root filesystems, using **chown**(2) can be prohibitively 580 | expensive. 581 | 582 | - Sharing files or filesystems between containers with non-overlapping 583 | ID mappings. 584 | 585 | - Implementing discretionary access (DAC) permission checking for 586 | filesystems lacking a concept of ownership. 587 | 588 | - Efficiently changing ownership on a per-mount basis. In contrast to 589 | **chown**(2), changing ownership of large sets of files is 590 | instantaneous with ID-mapped mounts. This is especially useful when 591 | ownership of an entire root filesystem of a virtual machine or 592 | container is to be changed as mentioned above. With ID-mapped 593 | mounts, a single **mount\_setattr**() system call will be sufficient 594 | to change the ownership of all files. 595 | 596 | - Taking the current ownership into account. ID mappings specify 597 | precisely what a user or group ID is supposed to be mapped to. This 598 | contrasts with the **chown**(2) system call which cannot by itself 599 | take the current ownership of the files it changes into account. It 600 | simply changes the ownership to the specified user ID and group ID. 601 | 602 | - Locally and temporarily restricted ownership changes. ID-mapped 603 | mounts make it possible to change ownership locally, restricting the 604 | ownership changes to specific mounts, and temporarily as the 605 | ownership changes only apply as long as the mount exists. By 606 | contrast, changing ownership via the **chown**(2) system call 607 | changes the ownership globally and permanently. 608 | 609 | Extensibility 610 | ------------- 611 | 612 | In order to allow for future extensibility, **mount\_setattr**() 613 | requires the user-space application to specify the size of the 614 | *mount\_attr* structure that it is passing. By providing this 615 | information, it is possible for **mount\_setattr**() to provide both 616 | forwards- and backwards-compatibility, with *size* acting as an implicit 617 | version number. (Because new extension fields will always be appended, 618 | the structure size will always increase.) This extensibility design is 619 | very similar to other system calls such as **perf\_setattr**(2), 620 | **perf\_event\_open**(2), **clone3**(2) and **openat2**(2). 621 | 622 | Let *usize* be the size of the structure as specified by the user-space 623 | application, and let *ksize* be the size of the structure which the 624 | kernel supports, then there are three cases to consider: 625 | 626 | - If *ksize* equals *usize*, then there is no version mismatch and 627 | *attr* can be used verbatim. 628 | 629 | - If *ksize* is larger than *usize*, then there are some extension 630 | fields that the kernel supports which the user-space application is 631 | unaware of. Because a zero value in any added extension field 632 | signifies a no-op, the kernel treats all of the extension fields not 633 | provided by the user-space application as having zero values. This 634 | provides backwards-compatibility. 635 | 636 | - If *ksize* is smaller than *usize*, then there are some extension 637 | fields which the user-space application is aware of but which the 638 | kernel does not support. Because any extension field must have its 639 | zero values signify a no-op, the kernel can safely ignore the 640 | unsupported extension fields if they are all zero. If any 641 | unsupported extension fields are non-zero, then -1 is returned and 642 | *errno* is set to **E2BIG**. This provides forwards-compatibility. 643 | 644 | Because the definition of *struct mount\_attr* may change in the future 645 | (with new fields being added when system headers are updated), 646 | user-space applications should zero-fill *struct mount\_attr* to ensure 647 | that recompiling the program with new headers will not result in 648 | spurious errors at runtime. The simplest way is to use a designated 649 | initializer: 650 | 651 | ```c 652 | struct mount_attr attr = { 653 | .attr_set = MOUNT_ATTR_RDONLY, 654 | .attr_clr = MOUNT_ATTR_NODEV 655 | }; 656 | ``` 657 | 658 | Alternatively, the structure can be zero-filled using **memset**(3) or 659 | similar functions: 660 | 661 | ```c 662 | struct mount_attr attr; 663 | memset(&attr, 0, sizeof(attr)); 664 | attr.attr_set = MOUNT_ATTR_RDONLY; 665 | attr.attr_clr = MOUNT_ATTR_NODEV; 666 | ``` 667 | 668 | A user-space application that wishes to determine which extensions the 669 | running kernel supports can do so by conducting a binary search on 670 | *size* with a structure which has every byte nonzero (to find the 671 | largest value which doesn\'t produce an error of **E2BIG**). 672 | 673 | EXAMPLES 674 | ======== 675 | 676 | ```c 677 | /* 678 | * This program allows the caller to create a new detached mount 679 | * and set various properties on it. 680 | */ 681 | #define _GNU_SOURCE 682 | #include 683 | #include 684 | #include 685 | #include 686 | #include 687 | #include 688 | #include 689 | #include 690 | #include 691 | #include 692 | #include 693 | 694 | static inline int 695 | mount_setattr(int dirfd, const char *pathname, unsigned int flags, 696 | struct mount_attr *attr, size_t size) 697 | { 698 | return syscall(SYS_mount_setattr, dirfd, pathname, flags, 699 | attr, size); 700 | } 701 | 702 | static inline int 703 | open_tree(int dirfd, const char *filename, unsigned int flags) 704 | { 705 | return syscall(SYS_open_tree, dirfd, filename, flags); 706 | } 707 | 708 | static inline int 709 | move_mount(int from_dirfd, const char *from_pathname, 710 | int to_dirfd, const char *to_pathname, unsigned int flags) 711 | { 712 | return syscall(SYS_move_mount, from_dirfd, from_pathname, 713 | to_dirfd, to_pathname, flags); 714 | } 715 | 716 | static const struct option longopts[] = { 717 | {"map-mount", required_argument, NULL, 'a'}, 718 | {"recursive", no_argument, NULL, 'b'}, 719 | {"read-only", no_argument, NULL, 'c'}, 720 | {"block-setid", no_argument, NULL, 'd'}, 721 | {"block-devices", no_argument, NULL, 'e'}, 722 | {"block-exec", no_argument, NULL, 'f'}, 723 | {"no-access-time", no_argument, NULL, 'g'}, 724 | { NULL, 0, NULL, 0 }, 725 | }; 726 | 727 | #define exit_log(format, ...) do \ 728 | { \ 729 | fprintf(stderr, format, ##__VA_ARGS__); \ 730 | exit(EXIT_FAILURE); \ 731 | } while (0) 732 | 733 | int 734 | main(int argc, char *argv[]) 735 | { 736 | struct mount_attr *attr = &(struct mount_attr){}; 737 | int fd_userns = -1; 738 | bool recursive = false; 739 | int index = 0; 740 | int ret; 741 | 742 | while ((ret = getopt_long_only(argc, argv, "", 743 | longopts, &index)) != -1) { 744 | switch (ret) { 745 | case 'a': 746 | fd_userns = open(optarg, O_RDONLY | O_CLOEXEC); 747 | if (fd_userns == -1) 748 | exit_log("%m - Failed top open %s\n", optarg); 749 | break; 750 | case 'b': 751 | recursive = true; 752 | break; 753 | case 'c': 754 | attr->attr_set |= MOUNT_ATTR_RDONLY; 755 | break; 756 | case 'd': 757 | attr->attr_set |= MOUNT_ATTR_NOSUID; 758 | break; 759 | case 'e': 760 | attr->attr_set |= MOUNT_ATTR_NODEV; 761 | break; 762 | case 'f': 763 | attr->attr_set |= MOUNT_ATTR_NOEXEC; 764 | break; 765 | case 'g': 766 | attr->attr_set |= MOUNT_ATTR_NOATIME; 767 | attr->attr_clr |= MOUNT_ATTR__ATIME; 768 | break; 769 | default: 770 | exit_log("Invalid argument specified"); 771 | } 772 | } 773 | 774 | if ((argc - optind) < 2) 775 | exit_log("Missing source or target mount point\n"); 776 | 777 | const char *source = argv[optind]; 778 | const char *target = argv[optind + 1]; 779 | 780 | /* In the following, -1 as the 'dirfd' argument ensures that 781 | open_tree() fails if 'source' is not an absolute pathname. */ 782 | 783 | int fd_tree = open_tree(-1, source, 784 | OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC | 785 | AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0)); 786 | if (fd_tree == -1) 787 | exit_log("%m - Failed to open %s\n", source); 788 | 789 | if (fd_userns >= 0) { 790 | attr->attr_set |= MOUNT_ATTR_IDMAP; 791 | attr->userns_fd = fd_userns; 792 | } 793 | 794 | ret = mount_setattr(fd_tree, "", 795 | AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0), 796 | attr, sizeof(struct mount_attr)); 797 | if (ret == -1) 798 | exit_log("%m - Failed to change mount attributes\n"); 799 | 800 | close(fd_userns); 801 | 802 | /* In the following, -1 as the 'to_dirfd' argument ensures that 803 | open_tree() fails if 'target' is not an absolute pathname. */ 804 | 805 | ret = move_mount(fd_tree, "", -1, target, 806 | MOVE_MOUNT_F_EMPTY_PATH); 807 | if (ret == -1) 808 | exit_log("%m - Failed to attach mount to %s\n", target); 809 | 810 | close(fd_tree); 811 | 812 | exit(EXIT_SUCCESS); 813 | } 814 | ``` 815 | 816 | SEE ALSO 817 | ======== 818 | 819 | **newuidmap**(1), **newgidmap**(1), **clone**(2), **mount**(2), 820 | **unshare**(2), **proc**(5), **mount\_namespaces**(7), 821 | **capabilities**(7), **user\_namespaces**(7), **xattr**(7) 822 | -------------------------------------------------------------------------------- /mount-idmapped.c: -------------------------------------------------------------------------------- 1 | /* SPDX-License-Identifier: LGPL-2.1+ */ 2 | 3 | #define _GNU_SOURCE 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | #include 21 | #include 22 | #include 23 | #include 24 | #include 25 | #include 26 | #include 27 | #include 28 | 29 | /* mount_setattr() */ 30 | #ifndef MOUNT_ATTR_RDONLY 31 | #define MOUNT_ATTR_RDONLY 0x00000001 32 | #endif 33 | 34 | #ifndef MOUNT_ATTR_NOSUID 35 | #define MOUNT_ATTR_NOSUID 0x00000002 36 | #endif 37 | 38 | #ifndef MOUNT_ATTR_NOEXEC 39 | #define MOUNT_ATTR_NOEXEC 0x00000008 40 | #endif 41 | 42 | #ifndef MOUNT_ATTR_NODIRATIME 43 | #define MOUNT_ATTR_NODIRATIME 0x00000080 44 | #endif 45 | 46 | #ifndef MOUNT_ATTR__ATIME 47 | #define MOUNT_ATTR__ATIME 0x00000070 48 | #endif 49 | 50 | #ifndef MOUNT_ATTR_RELATIME 51 | #define MOUNT_ATTR_RELATIME 0x00000000 52 | #endif 53 | 54 | #ifndef MOUNT_ATTR_NOATIME 55 | #define MOUNT_ATTR_NOATIME 0x00000010 56 | #endif 57 | 58 | #ifndef MOUNT_ATTR_STRICTATIME 59 | #define MOUNT_ATTR_STRICTATIME 0x00000020 60 | #endif 61 | 62 | #ifndef MOUNT_ATTR_IDMAP 63 | #define MOUNT_ATTR_IDMAP 0x00100000 64 | #endif 65 | 66 | #ifndef AT_RECURSIVE 67 | #define AT_RECURSIVE 0x8000 68 | #endif 69 | 70 | #ifndef __NR_mount_setattr 71 | #if defined __alpha__ 72 | #define __NR_mount_setattr 552 73 | #elif defined _MIPS_SIM 74 | #if _MIPS_SIM == _MIPS_SIM_ABI32 /* o32 */ 75 | #define __NR_mount_setattr (442 + 4000) 76 | #endif 77 | #if _MIPS_SIM == _MIPS_SIM_NABI32 /* n32 */ 78 | #define __NR_mount_setattr (442 + 6000) 79 | #endif 80 | #if _MIPS_SIM == _MIPS_SIM_ABI64 /* n64 */ 81 | #define __NR_mount_setattr (442 + 5000) 82 | #endif 83 | #elif defined __ia64__ 84 | #define __NR_mount_setattr (442 + 1024) 85 | #else 86 | #define __NR_mount_setattr 442 87 | #endif 88 | struct mount_attr { 89 | __u64 attr_set; 90 | __u64 attr_clr; 91 | __u64 propagation; 92 | __u64 userns_fd; 93 | }; 94 | #endif 95 | 96 | /* open_tree() */ 97 | #ifndef OPEN_TREE_CLONE 98 | #define OPEN_TREE_CLONE 1 99 | #endif 100 | 101 | #ifndef OPEN_TREE_CLOEXEC 102 | #define OPEN_TREE_CLOEXEC O_CLOEXEC 103 | #endif 104 | 105 | #ifndef __NR_open_tree 106 | #if defined __alpha__ 107 | #define __NR_open_tree 538 108 | #elif defined _MIPS_SIM 109 | #if _MIPS_SIM == _MIPS_SIM_ABI32 /* o32 */ 110 | #define __NR_open_tree 4428 111 | #endif 112 | #if _MIPS_SIM == _MIPS_SIM_NABI32 /* n32 */ 113 | #define __NR_open_tree 6428 114 | #endif 115 | #if _MIPS_SIM == _MIPS_SIM_ABI64 /* n64 */ 116 | #define __NR_open_tree 5428 117 | #endif 118 | #elif defined __ia64__ 119 | #define __NR_open_tree (428 + 1024) 120 | #else 121 | #define __NR_open_tree 428 122 | #endif 123 | #endif 124 | 125 | /* move_mount() */ 126 | #ifndef MOVE_MOUNT_F_SYMLINKS 127 | #define MOVE_MOUNT_F_SYMLINKS 0x00000001 /* Follow symlinks on from path */ 128 | #endif 129 | 130 | #ifndef MOVE_MOUNT_F_AUTOMOUNTS 131 | #define MOVE_MOUNT_F_AUTOMOUNTS 0x00000002 /* Follow automounts on from path */ 132 | #endif 133 | 134 | #ifndef MOVE_MOUNT_F_EMPTY_PATH 135 | #define MOVE_MOUNT_F_EMPTY_PATH 0x00000004 /* Empty from path permitted */ 136 | #endif 137 | 138 | #ifndef MOVE_MOUNT_T_SYMLINKS 139 | #define MOVE_MOUNT_T_SYMLINKS 0x00000010 /* Follow symlinks on to path */ 140 | #endif 141 | 142 | #ifndef MOVE_MOUNT_T_AUTOMOUNTS 143 | #define MOVE_MOUNT_T_AUTOMOUNTS 0x00000020 /* Follow automounts on to path */ 144 | #endif 145 | 146 | #ifndef MOVE_MOUNT_T_EMPTY_PATH 147 | #define MOVE_MOUNT_T_EMPTY_PATH 0x00000040 /* Empty to path permitted */ 148 | #endif 149 | 150 | #ifndef MOVE_MOUNT__MASK 151 | #define MOVE_MOUNT__MASK 0x00000077 152 | #endif 153 | 154 | #ifndef __NR_move_mount 155 | #if defined __alpha__ 156 | #define __NR_move_mount 539 157 | #elif defined _MIPS_SIM 158 | #if _MIPS_SIM == _MIPS_SIM_ABI32 /* o32 */ 159 | #define __NR_move_mount 4429 160 | #endif 161 | #if _MIPS_SIM == _MIPS_SIM_NABI32 /* n32 */ 162 | #define __NR_move_mount 6429 163 | #endif 164 | #if _MIPS_SIM == _MIPS_SIM_ABI64 /* n64 */ 165 | #define __NR_move_mount 5429 166 | #endif 167 | #elif defined __ia64__ 168 | #define __NR_move_mount (428 + 1024) 169 | #else 170 | #define __NR_move_mount 429 171 | #endif 172 | #endif 173 | 174 | /* A few helpful macros. */ 175 | #define IDMAPLEN 4096 176 | 177 | #define STRLITERALLEN(x) (sizeof(""x"") - 1) 178 | 179 | #define INTTYPE_TO_STRLEN(type) \ 180 | (2 + (sizeof(type) <= 1 \ 181 | ? 3 \ 182 | : sizeof(type) <= 2 \ 183 | ? 5 \ 184 | : sizeof(type) <= 4 \ 185 | ? 10 \ 186 | : sizeof(type) <= 8 ? 20 : sizeof(int[-2 * (sizeof(type) > 8)]))) 187 | 188 | #define syserror(format, ...) \ 189 | ({ \ 190 | fprintf(stderr, format, ##__VA_ARGS__); \ 191 | (-errno); \ 192 | }) 193 | 194 | #define syserror_set(__ret__, format, ...) \ 195 | ({ \ 196 | typeof(__ret__) __internal_ret__ = (__ret__); \ 197 | errno = labs(__ret__); \ 198 | fprintf(stderr, format, ##__VA_ARGS__); \ 199 | __internal_ret__; \ 200 | }) 201 | 202 | #define call_cleaner(cleaner) __attribute__((__cleanup__(cleaner##_function))) 203 | 204 | #define free_disarm(ptr) \ 205 | ({ \ 206 | free(ptr); \ 207 | ptr = NULL; \ 208 | }) 209 | 210 | static inline void free_disarm_function(void *ptr) 211 | { 212 | free_disarm(*(void **)ptr); 213 | } 214 | #define __do_free call_cleaner(free_disarm) 215 | 216 | #define move_ptr(ptr) \ 217 | ({ \ 218 | typeof(ptr) __internal_ptr__ = (ptr); \ 219 | (ptr) = NULL; \ 220 | __internal_ptr__; \ 221 | }) 222 | 223 | #define define_cleanup_function(type, cleaner) \ 224 | static inline void cleaner##_function(type *ptr) \ 225 | { \ 226 | if (*ptr) \ 227 | cleaner(*ptr); \ 228 | } 229 | 230 | #define call_cleaner(cleaner) __attribute__((__cleanup__(cleaner##_function))) 231 | 232 | #define close_prot_errno_disarm(fd) \ 233 | if (fd >= 0) { \ 234 | int _e_ = errno; \ 235 | close(fd); \ 236 | errno = _e_; \ 237 | fd = -EBADF; \ 238 | } 239 | 240 | static inline void close_prot_errno_disarm_function(int *fd) 241 | { 242 | close_prot_errno_disarm(*fd); 243 | } 244 | #define __do_close call_cleaner(close_prot_errno_disarm) 245 | 246 | define_cleanup_function(FILE *, fclose); 247 | #define __do_fclose call_cleaner(fclose) 248 | 249 | define_cleanup_function(DIR *, closedir); 250 | #define __do_closedir call_cleaner(closedir) 251 | 252 | static inline int sys_mount_setattr(int dfd, const char *path, unsigned int flags, 253 | struct mount_attr *attr, size_t size) 254 | { 255 | return syscall(__NR_mount_setattr, dfd, path, flags, attr, size); 256 | } 257 | 258 | static inline int sys_open_tree(int dfd, const char *filename, unsigned int flags) 259 | { 260 | return syscall(__NR_open_tree, dfd, filename, flags); 261 | } 262 | 263 | static inline int sys_move_mount(int from_dfd, const char *from_pathname, int to_dfd, 264 | const char *to_pathname, unsigned int flags) 265 | { 266 | return syscall(__NR_move_mount, from_dfd, from_pathname, to_dfd, to_pathname, flags); 267 | } 268 | 269 | static ssize_t write_nointr(int fd, const void *buf, size_t count) 270 | { 271 | ssize_t ret; 272 | 273 | do { 274 | ret = write(fd, buf, count); 275 | } while (ret < 0 && errno == EINTR); 276 | 277 | return ret; 278 | } 279 | 280 | static int write_file(const char *path, const void *buf, size_t count) 281 | { 282 | int fd; 283 | ssize_t ret; 284 | 285 | fd = open(path, O_WRONLY | O_CLOEXEC | O_NOCTTY | O_NOFOLLOW); 286 | if (fd < 0) 287 | return -errno; 288 | 289 | ret = write_nointr(fd, buf, count); 290 | close(fd); 291 | if (ret < 0 || (size_t)ret != count) 292 | return -1; 293 | 294 | return 0; 295 | } 296 | 297 | /* 298 | * Let's use the "standard stack limit" (i.e. glibc thread size default) for 299 | * stack sizes: 8MB. 300 | */ 301 | #define __STACK_SIZE (8 * 1024 * 1024) 302 | static pid_t do_clone(int (*fn)(void *), void *arg, int flags) 303 | { 304 | void *stack; 305 | 306 | stack = malloc(__STACK_SIZE); 307 | if (!stack) 308 | return -ENOMEM; 309 | 310 | #ifdef __ia64__ 311 | return __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, NULL); 312 | #else 313 | return clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, NULL); 314 | #endif 315 | } 316 | 317 | static int clone_cb(void *data) 318 | { 319 | return kill(getpid(), SIGSTOP); 320 | } 321 | 322 | struct list { 323 | void *elem; 324 | struct list *next; 325 | struct list *prev; 326 | }; 327 | 328 | #define list_for_each(__iterator, __list) \ 329 | for (__iterator = (__list)->next; __iterator != __list; __iterator = __iterator->next) 330 | 331 | static inline void list_init(struct list *list) 332 | { 333 | list->elem = NULL; 334 | list->next = list->prev = list; 335 | } 336 | 337 | static inline int list_empty(const struct list *list) 338 | { 339 | return list == list->next; 340 | } 341 | 342 | static inline void __list_add(struct list *new, struct list *prev, struct list *next) 343 | { 344 | next->prev = new; 345 | new->next = next; 346 | new->prev = prev; 347 | prev->next = new; 348 | } 349 | 350 | static inline void list_add_tail(struct list *head, struct list *list) 351 | { 352 | __list_add(list, head->prev, head); 353 | } 354 | 355 | typedef enum idmap_type_t { 356 | ID_TYPE_UID, 357 | ID_TYPE_GID 358 | } idmap_type_t; 359 | 360 | struct id_map { 361 | idmap_type_t map_type; 362 | __u32 nsid; 363 | __u32 hostid; 364 | __u32 range; 365 | }; 366 | 367 | static struct list active_map; 368 | 369 | static int add_map_entry(__u32 id_host, 370 | __u32 id_ns, 371 | __u32 range, 372 | idmap_type_t map_type) 373 | { 374 | __do_free struct list *new_list = NULL; 375 | __do_free struct id_map *newmap = NULL; 376 | 377 | newmap = malloc(sizeof(*newmap)); 378 | if (!newmap) 379 | return -ENOMEM; 380 | 381 | new_list = malloc(sizeof(struct list)); 382 | if (!new_list) 383 | return -ENOMEM; 384 | 385 | *newmap = (struct id_map){ 386 | .hostid = id_host, 387 | .nsid = id_ns, 388 | .range = range, 389 | .map_type = map_type, 390 | }; 391 | 392 | new_list->elem = move_ptr(newmap); 393 | list_add_tail(&active_map, move_ptr(new_list)); 394 | return 0; 395 | } 396 | 397 | static int parse_map(char *map) 398 | { 399 | char types[2] = {'u', 'g'}; 400 | int ret; 401 | __u32 id_host, id_ns, range; 402 | char which; 403 | 404 | if (!map) 405 | return -1; 406 | 407 | ret = sscanf(map, "%c:%u:%u:%u", &which, &id_ns, &id_host, &range); 408 | if (ret != 4) 409 | return -1; 410 | 411 | if (which != 'b' && which != 'u' && which != 'g') 412 | return -1; 413 | 414 | for (int i = 0; i < 2; i++) { 415 | idmap_type_t map_type; 416 | 417 | if (which != types[i] && which != 'b') 418 | continue; 419 | 420 | if (types[i] == 'u') 421 | map_type = ID_TYPE_UID; 422 | else 423 | map_type = ID_TYPE_GID; 424 | 425 | ret = add_map_entry(id_host, id_ns, range, map_type); 426 | if (ret < 0) 427 | return ret; 428 | } 429 | 430 | return 0; 431 | } 432 | 433 | static int write_id_mapping(idmap_type_t map_type, pid_t pid, const char *buf, size_t buf_size) 434 | { 435 | __do_close int fd = -EBADF; 436 | int ret; 437 | char path[STRLITERALLEN("/proc/") + INTTYPE_TO_STRLEN(pid_t) + 438 | STRLITERALLEN("/setgroups") + 1]; 439 | 440 | if (geteuid() != 0 && map_type == ID_TYPE_GID) { 441 | __do_close int setgroups_fd = -EBADF; 442 | 443 | ret = snprintf(path, PATH_MAX, "/proc/%d/setgroups", pid); 444 | if (ret < 0 || ret >= PATH_MAX) 445 | return -E2BIG; 446 | 447 | setgroups_fd = open(path, O_WRONLY | O_CLOEXEC); 448 | if (setgroups_fd < 0 && errno != ENOENT) 449 | return syserror("Failed to open \"%s\"", path); 450 | 451 | if (setgroups_fd >= 0) { 452 | ret = write_nointr(setgroups_fd, "deny\n", STRLITERALLEN("deny\n")); 453 | if (ret != STRLITERALLEN("deny\n")) 454 | return syserror("Failed to write \"deny\" to \"/proc/%d/setgroups\"", pid); 455 | } 456 | } 457 | 458 | ret = snprintf(path, PATH_MAX, "/proc/%d/%cid_map", pid, map_type == ID_TYPE_UID ? 'u' : 'g'); 459 | if (ret < 0 || ret >= PATH_MAX) 460 | return -E2BIG; 461 | 462 | fd = open(path, O_WRONLY | O_CLOEXEC); 463 | if (fd < 0) 464 | return syserror("Failed to open \"%s\"", path); 465 | 466 | ret = write_nointr(fd, buf, buf_size); 467 | if (ret != buf_size) 468 | return syserror("Failed to write %cid mapping to \"%s\"", 469 | map_type == ID_TYPE_UID ? 'u' : 'g', path); 470 | 471 | return 0; 472 | } 473 | 474 | static int map_ids(struct list *idmap, pid_t pid) 475 | { 476 | int fill, left; 477 | char u_or_g; 478 | char mapbuf[STRLITERALLEN("new@idmap") + STRLITERALLEN(" ") + 479 | INTTYPE_TO_STRLEN(pid_t) + STRLITERALLEN(" ") + IDMAPLEN] = {}; 480 | bool had_entry = false; 481 | 482 | for (idmap_type_t map_type = ID_TYPE_UID, u_or_g = 'u'; 483 | map_type <= ID_TYPE_GID; map_type++, u_or_g = 'g') { 484 | char *pos = mapbuf; 485 | int ret; 486 | struct list *iterator; 487 | 488 | 489 | list_for_each(iterator, idmap) { 490 | struct id_map *map = iterator->elem; 491 | if (map->map_type != map_type) 492 | continue; 493 | 494 | had_entry = true; 495 | 496 | left = IDMAPLEN - (pos - mapbuf); 497 | fill = snprintf(pos, left, "%u %u %u\n", map->nsid, map->hostid, map->range); 498 | /* 499 | * The kernel only takes <= 4k for writes to 500 | * /proc//{g,u}id_map 501 | */ 502 | if (fill <= 0 || fill >= left) 503 | return syserror_set(-E2BIG, "Too many %cid mappings defined", u_or_g); 504 | 505 | pos += fill; 506 | } 507 | if (!had_entry) 508 | continue; 509 | 510 | ret = write_id_mapping(map_type, pid, mapbuf, pos - mapbuf); 511 | if (ret < 0) 512 | return syserror("Failed to write mapping: %s", mapbuf); 513 | 514 | memset(mapbuf, 0, sizeof(mapbuf)); 515 | } 516 | 517 | return 0; 518 | } 519 | 520 | static int wait_for_pid(pid_t pid) 521 | { 522 | int status, ret; 523 | 524 | again: 525 | ret = waitpid(pid, &status, 0); 526 | if (ret < 0) { 527 | if (errno == EINTR) 528 | goto again; 529 | 530 | return -1; 531 | } 532 | 533 | if (ret != pid) 534 | goto again; 535 | 536 | if (!WIFEXITED(status) || WEXITSTATUS(status) != 0) 537 | return -1; 538 | 539 | return 0; 540 | } 541 | 542 | static int get_userns_fd(struct list *idmap) 543 | { 544 | int ret; 545 | pid_t pid; 546 | char path_ns[STRLITERALLEN("/proc/") + INTTYPE_TO_STRLEN(pid_t) + 547 | STRLITERALLEN("/ns/user") + 1]; 548 | 549 | pid = do_clone(clone_cb, NULL, CLONE_NEWUSER); 550 | if (pid < 0) 551 | return -errno; 552 | 553 | ret = map_ids(idmap, pid); 554 | if (ret < 0) 555 | return ret; 556 | 557 | ret = snprintf(path_ns, sizeof(path_ns), "/proc/%d/ns/user", pid); 558 | if (ret < 0 || (size_t)ret >= sizeof(path_ns)) 559 | ret = -EIO; 560 | else 561 | ret = open(path_ns, O_RDONLY | O_CLOEXEC | O_NOCTTY); 562 | 563 | (void)kill(pid, SIGKILL); 564 | (void)wait_for_pid(pid); 565 | return ret; 566 | } 567 | 568 | static inline bool strnequal(const char *str, const char *eq, size_t len) 569 | { 570 | return strncmp(str, eq, len) == 0; 571 | } 572 | 573 | static void usage(void) 574 | { 575 | const char *text = "\ 576 | mount-idmapped --map-mount= \n\ 577 | \n\ 578 | Create an idmapped mount of at \n\ 579 | Options:\n\ 580 | --map-mount=\n\ 581 | Specify an idmap for the mount in the format\n\ 582 | :::\n\ 583 | The can be:\n\ 584 | \"b\" or \"both\" -> map both uids and gids\n\ 585 | \"u\" or \"uid\" -> map uids\n\ 586 | \"g\" or \"gid\" -> map gids\n\ 587 | For example, specifying:\n\ 588 | both:1000:1001:1 -> map uid and gid 1000 to uid and gid 1001 in and no other ids\n\ 589 | uid:20000:100000:1000 -> map uid 20000 to uid 100000, uid 20001 to uid 100001 [...] in \n\ 590 | Currently up to 340 separate idmappings may be specified.\n\n\ 591 | --map-mount=/proc//ns/user\n\ 592 | Specify a path to a user namespace whose idmap is to be used.\n\n\ 593 | --map-caller=\n\ 594 | Specify an idmap to be used for the caller, i.e. move the caller into a new user namespace\n\ 595 | with the requested mapping.\n\n\ 596 | --recursive\n\ 597 | Copy the whole mount tree from and apply the idmap to everyone at .\n\n\ 598 | Examples:\n\ 599 | - Create an idmapped mount of /source on /target with both ('b') uids and gids mapped:\n\ 600 | mount-idmapped --map-mount b:0:10000:10000 /source /target\n\n\ 601 | - Create an idmapped mount of /source on /target with uids ('u') and gids ('g') mapped separately:\n\ 602 | mount-idmapped --map-mount u:0:10000:10000 g:0:20000:20000 /source /target\n\n\ 603 | - Create an idmapped mount of /source on /target with both ('b') uids and gids mapped and a user namespace\n\ 604 | with both ('b') uids and gids mapped:\n\ 605 | mount-idmapped --map-caller b:0:10000:10000 --map-mount b:0:10000:1000 /source /target\n\n\ 606 | - Create an idmapped mount of /source on /target with uids ('u') gids ('g') mapped separately\n\ 607 | and a user namespace with both ('b') uids and gids mapped:\n\ 608 | mount-idmapped --map-caller u:0:10000:10000 g:0:20000:20000 --map-mount b:0:10000:1000 /source /target\n\ 609 | "; 610 | fprintf(stderr, "%s", text); 611 | _exit(EXIT_SUCCESS); 612 | } 613 | 614 | #define exit_usage(format, ...) \ 615 | ({ \ 616 | fprintf(stderr, format, ##__VA_ARGS__); \ 617 | usage(); \ 618 | }) 619 | 620 | #define exit_log(format, ...) \ 621 | ({ \ 622 | fprintf(stderr, format, ##__VA_ARGS__); \ 623 | exit(EXIT_FAILURE); \ 624 | }) 625 | 626 | static const struct option longopts[] = { 627 | {"map-mount", required_argument, 0, 'a'}, 628 | {"map-caller", required_argument, 0, 'b'}, 629 | {"help", no_argument, 0, 'c'}, 630 | {"recursive", no_argument, 0, 'd'}, 631 | NULL, 632 | }; 633 | 634 | int main(int argc, char *argv[]) 635 | { 636 | int fd_userns = -EBADF; 637 | int index = 0; 638 | const char *caller_idmap = NULL, *source = NULL, *target = NULL; 639 | bool recursive = false; 640 | int fd_tree, new_argc, ret; 641 | char *const *new_argv; 642 | 643 | list_init(&active_map); 644 | while ((ret = getopt_long_only(argc, argv, "", longopts, &index)) != -1) { 645 | switch (ret) { 646 | case 'a': 647 | if (strnequal(optarg, "/proc/", STRLITERALLEN("/proc/"))) { 648 | fd_userns = open(optarg, O_RDONLY | O_CLOEXEC); 649 | if (fd_userns < 0) 650 | exit_log("%m - Failed top open user namespace path %s\n", optarg); 651 | break; 652 | } 653 | 654 | ret = parse_map(optarg); 655 | if (ret < 0) 656 | exit_log("Failed to parse idmaps for mount\n"); 657 | break; 658 | case 'b': 659 | caller_idmap = optarg; 660 | break; 661 | case 'd': 662 | recursive = true; 663 | break; 664 | case 'c': 665 | /* fallthrough */ 666 | default: 667 | usage(); 668 | } 669 | } 670 | 671 | new_argv = &argv[optind]; 672 | new_argc = argc - optind; 673 | if (new_argc < 2) 674 | exit_usage("Missing source or target mountpoint\n\n"); 675 | source = new_argv[0]; 676 | target = new_argv[1]; 677 | 678 | /* 679 | * The issue explained below is now fixed in mainline: 680 | * 681 | * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d3110f256d126b44d34c1f662310cd295877c447 682 | * 683 | * Make sure that your distro picks it up for your supported stable 684 | * kernels. 685 | * 686 | * Note, that all currently released kernels supporting open_tree() and 687 | * move_mount() are buggy when source and target are identical and 688 | * reside on a shared mount. Until my fix 689 | * https://gitlab.com/brauner/linux/-/commit/6ada58d955aed4515689b2c609eb9d755792d82a 690 | * is merged this bug can cause you to be unable to create new mounts. 691 | * 692 | * For example, whenever your "/" is mounted MS_SHARED (which it is on 693 | * systemd systems) and you were to do mount-idmapped /mnt /mnt the 694 | * following issue would apply to you: 695 | * 696 | * Creating a series of detached mounts, attaching them to the 697 | * filesystem, and unmounting them can be used to trigger an integer 698 | * overflow in ns->mounts causing the kernel to block any new mounts in 699 | * count_mounts() and returning ENOSPC because it falsely assumes that 700 | * the maximum number of mounts in the mount namespace has been 701 | * reached, i.e. it thinks it can't fit the new mounts into the mount 702 | * namespace anymore. 703 | * 704 | * The root cause of this is that detached mounts aren't handled 705 | * correctly when source and target mount are identical and reside on a 706 | * shared mount causing a broken mount tree where the detached source 707 | * itself is propagated which propagation prevents for regular 708 | * bind-mounts and new mounts. This ultimately leads to a 709 | * miscalculation of the number of mounts in the mount namespace. 710 | * 711 | * Detached mounts created via open_tree(fd, path, OPEN_TREE_CLONE) are 712 | * essentially like an unattached new mount, or an unattached 713 | * bind-mount. They can then later on be attached to the filesystem via 714 | * move_mount() which calls into attach_recursive_mount(). Part of 715 | * attaching it to the filesystem is making sure that mounts get 716 | * correctly propagated in case the destination mountpoint is 717 | * MS_SHARED, i.e. is a shared mountpoint. This is done by calling into 718 | * propagate_mnt() which walks the list of peers calling 719 | * propagate_one() on each mount in this list making sure it receives 720 | * the propagation event. The propagate_one() functions thereby skips 721 | * both new mounts and bind mounts to not propagate them "into 722 | * themselves". Both are identified by checking whether the mount is 723 | * already attached to any mount namespace in mnt->mnt_ns. The is what 724 | * the IS_MNT_NEW() helper is responsible for. 725 | * 726 | * However, detached mounts have an anonymous mount namespace attached 727 | * to them stashed in mnt->mnt_ns which means that IS_MNT_NEW() doesn't 728 | * realize they need to be skipped causing the mount to propagate "into 729 | * itself" breaking the mount table and causing a disconnect between 730 | * the number of mounts recorded as being beneath or reachable from the 731 | * target mountpoint and the number of mounts actually recorded/counted 732 | * in ns->mounts ultimately causing an overflow which in turn prevents 733 | * any new mounts via the ENOSPC issue. 734 | */ 735 | fd_tree = sys_open_tree(-EBADF, source, 736 | OPEN_TREE_CLONE | 737 | OPEN_TREE_CLOEXEC | 738 | AT_EMPTY_PATH | 739 | (recursive ? AT_RECURSIVE : 0)); 740 | if (fd_tree < 0) { 741 | exit_log("%m - Failed to open %s\n", source); 742 | exit(EXIT_FAILURE); 743 | } 744 | 745 | if (!list_empty(&active_map) || fd_userns != -EBADF) { 746 | struct mount_attr attr = { 747 | .attr_set = MOUNT_ATTR_IDMAP, 748 | }; 749 | 750 | if (!list_empty(&active_map)) 751 | attr.userns_fd = get_userns_fd(&active_map); 752 | else 753 | attr.userns_fd = fd_userns; 754 | 755 | if (attr.userns_fd < 0) 756 | exit_log("%m - Failed to create user namespace\n"); 757 | 758 | ret = sys_mount_setattr(fd_tree, "", AT_EMPTY_PATH | AT_RECURSIVE, &attr, 759 | sizeof(attr)); 760 | if (ret < 0) 761 | exit_log("%m - Failed to change mount attributes\n"); 762 | close(attr.userns_fd); 763 | } 764 | 765 | ret = sys_move_mount(fd_tree, "", -EBADF, target, MOVE_MOUNT_F_EMPTY_PATH); 766 | if (ret < 0) 767 | exit_log("%m - Failed to attach mount to %s\n", target); 768 | close(fd_tree); 769 | 770 | if (caller_idmap) { 771 | execlp("lxc-usernsexec", "lxc-usernsexec", "-m", caller_idmap, "bash", (char *)NULL); 772 | exit_log("Note that moving the caller into a new user namespace requires \"lxc-usernsexec\" to be installed\n"); 773 | } 774 | exit(EXIT_SUCCESS); 775 | } 776 | --------------------------------------------------------------------------------