├── .gitignore ├── Makefile ├── README.md └── main.c /.gitignore: -------------------------------------------------------------------------------- 1 | infilter 2 | *.swp 3 | *.o 4 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | # Infilter Makefile 2 | 3 | PREFIX?=/usr/local 4 | 5 | all: infilter 6 | 7 | infilter: main.c 8 | $(CC) -o $@ $< 9 | 10 | install: all 11 | install -D --mode=755 infilter ${DESTDIR}${PREFIX}/bin/infilter 12 | 13 | clean: 14 | rm -f infilter 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Infilter 2 | 3 | Run *any* binary in *any* container. 4 | 5 | ``` 6 | make 7 | ./infilter $(docker inspect --format='{{.State.Pid}}' my-minimal-container) htop 8 | ``` 9 | 10 | ## Presentation 11 | 12 | ``Infilter`` will run any binary executable in any running container without 13 | patching. It is especially useful for minimalistic containers with no 14 | monitoring/debugging tools like alpine or busybox based Dockers. 15 | 16 | You may use it to: 17 | - run ``/bin/zsh`` in a container with *no* shell 18 | - run ``htop`` in any container 19 | - run pre-existing collectors like scollector *inside* any container 20 | - wrap ``sftp-server`` to enter a customer container on demand 21 | - ... 22 | 23 | ``Infilter`` will happily run with any namespace-based Linux container, be it 24 | Docker, LXC, runc, rkt or any other custom implementation. Please note that 25 | it currently assumes the amd64 kernel ABI. 26 | 27 | NOTE: ``Infilter`` will *not* work as expected if it relies on support files 28 | to be available at runtime. For example, it will not be able to run a python 29 | program like ``ctop``. 30 | 31 | ## Infilter vs... 32 | 33 | ### ``nsenter``, ``lxc-attach``, ``docker-exec``, ... 34 | 35 | ``nsenter``, ``lxc-attach``, ``docker-exec`` work by first entering the target 36 | container using ``setns`` system call. Once done, they will ``execve`` the 37 | command of your choice. This implies that it actually exists in target container. 38 | 39 | Furtermore, in an infrastructure managment system, it may be needed to run a 40 | command with only a subset of the namespaces (isolation domains) activated. This 41 | creates a bridge between the Host and the container. Depending on your situation 42 | this may be a security threat. Hence not an option. 43 | 44 | ### Mounting a special volume with required tools 45 | 46 | An alternative approach to the same functionality is to mount a volume with your 47 | preferred tools. This works great when all required tools are well known in 48 | advance and all required shared libraries are either fully trusted or 49 | statically linked. If these tools rely on any library in the container, this is a 50 | potential vulnerability as a malicious user could manipulate a library to run 51 | arbitrary code as privileged user. 52 | 53 | ### Patching 54 | 55 | Another alternative approach is to patch the desired tool with the appropriate 56 | system calls. Patching implies maintenance, keeping up to date with security 57 | fixes and careful tuning to make sure ``setns`` is called at the right time. 58 | 59 | ## How does it work? 60 | 61 | Basically, ``Infilter`` starts target program under ``ptrace``. It then waits for 62 | the first syscall outside of ``ld``, the dynamic linker. At this stage, the program 63 | is still running in the Host context and, based on loaded libraries, it is able to 64 | decide wether the target program will need terminfo resources. 65 | 66 | terminfo requires special treatment as this support file is necessary to run most 67 | terminal based applications like htop, in the example above. 68 | 69 | Either way, it is now *inside* the first syscall outside the dynamic linker. This 70 | is when all 6 namespaces are attached. Once attached, the initial syscall is 71 | released. The target program is now running *inside* the container. 72 | 73 | If terminfo was not required, ``infilter`` will simply detach itself and let the 74 | execution go. 75 | 76 | If terminfo was potentially required, ``infilter`` will attempt to intercept 77 | terminfo files open/stat/access operations and proxy them to the host context 78 | at runtime. As soon as one terminfo has been sucessfuly opened, ``infilter`` will 79 | detach itself and let the execution go. 80 | 81 | For more details, please see the code. It is extensively documented. 82 | 83 | ## License 84 | 85 | MIT 86 | 87 | -------------------------------------------------------------------------------- /main.c: -------------------------------------------------------------------------------- 1 | #define _GNU_SOURCE 2 | 3 | #include 4 | 5 | // open, stat 6 | #include 7 | #include 8 | #include 9 | 10 | // fork, exec, free 11 | #include 12 | #include 13 | #include 14 | 15 | // clone 16 | #include 17 | 18 | // parser 19 | #include 20 | 21 | // syscall introspection 22 | #include 23 | #include 24 | #include 25 | #include 26 | #include 27 | 28 | #define STAGE_HOST 1 29 | #define STAGE_CONTAINER 2 30 | 31 | const char* ld_path_prefix = "/lib/x86_64-linux-gnu/ld+"; 32 | struct namespace { 33 | const char* proc_name; 34 | int flag; 35 | int stage; 36 | int fd; 37 | }; 38 | 39 | struct namespace namespaces[] = { 40 | {"pid", CLONE_NEWPID, 1, -1}, // PID actually changes on fork, hence stage 1 41 | {"ipc", CLONE_NEWIPC, 1, -1}, 42 | {"net", CLONE_NEWNET, 1, -1}, 43 | {"uts", CLONE_NEWUTS, 1, -1}, 44 | {"mnt", CLONE_NEWNS, 2, -1}, 45 | {"user", CLONE_NEWUSER, 2, -1}, 46 | {NULL, 0, 0, 0}, 47 | }; 48 | 49 | // terminfo special case 50 | char terminfo_suffix[255] = {0}; 51 | const char* terminfo_lib_fullpath = "/lib/x86_64-linux-gnu/libtinfo.so.5"; 52 | const char* terminfo_candidates_locations[] = { 53 | "/etc", 54 | "/lib", 55 | "/usr/lib", 56 | "/usr/share", 57 | NULL, 58 | }; 59 | 60 | // path whitelisted for proxying. If ending with a '+', interpret as a prefix 61 | const char* proxy_whitelist[] = { 62 | "/etc/terminfo", 63 | "/etc/terminfo/+", 64 | "/lib/terminfo", 65 | "/lib/terminfo/+", 66 | "/usr/share/terminfo", 67 | "/usr/share/terminfo/+", 68 | NULL, 69 | }; 70 | 71 | // return 1 if prefix match. 0 otherwise 72 | // if prefix does not end with '+', consider exact match only 73 | // if the prefix is empty, consider not a match 74 | int str_prefix_match(const char* prefix, const char* candidate) { 75 | int prefix_len; 76 | 77 | // Sanity 78 | if(prefix == NULL || candidate == NULL) { 79 | return 0; 80 | } 81 | 82 | prefix_len = strlen(prefix); 83 | if(!prefix_len) { 84 | return 0; 85 | } 86 | 87 | // prefix ? 88 | if(prefix[prefix_len-1] == '+') { 89 | if(strncmp(prefix, candidate, prefix_len-1) == 0) { 90 | return 1; 91 | } 92 | } 93 | 94 | // exact match ? 95 | if(strcmp(prefix, candidate) == 0) { 96 | return 1; 97 | } 98 | 99 | } 100 | 101 | int proc_open_mem(pid_t pid) { 102 | char path[255]; 103 | int fd; 104 | 105 | // open memory 106 | if(snprintf(path, 255, "/proc/%d/mem", pid) < 0) { 107 | perror("snprintf"); 108 | return -1; 109 | } 110 | 111 | fd = open(path, O_RDWR); 112 | if(fd == -1) { 113 | perror("open mem"); 114 | return -1; 115 | } 116 | 117 | return fd; 118 | } 119 | 120 | int proc_read_data(int fd, void* src, void* dst, size_t len) { 121 | // read data 122 | if(lseek(fd, (long)src, SEEK_SET) == -1) { 123 | perror("seek mem"); 124 | return -1; 125 | } 126 | 127 | if(read(fd, dst, len) == -1) { 128 | perror("read mem"); 129 | return -1; 130 | } 131 | } 132 | 133 | int proc_write_data(int fd, void* src, void* dst, size_t len) { 134 | // write data 135 | if(lseek(fd, (long)dst, SEEK_SET) == -1) { 136 | perror("seek mem"); 137 | return -1; 138 | } 139 | 140 | if(write(fd, src, len) == -1) { 141 | perror("write mem"); 142 | return -1; 143 | } 144 | } 145 | 146 | // Read arbitrary data from PID's memory. Ptrace PEEKTEXT is too slow/risky to use 147 | int proc_read_string(int fd, char* src, char* dst, size_t len) { 148 | if(proc_read_data(fd, src, dst, len) == -1) { 149 | return -1; 150 | } 151 | dst[len-1] = '\0'; 152 | return 0; 153 | } 154 | 155 | int in_ld(unsigned long long int addr, unsigned int pid) { 156 | static unsigned long long int begin_addr = 0; 157 | static unsigned long long int end_addr = 0; 158 | 159 | // load + parse /proc//maps 160 | if(begin_addr == 0) { 161 | int index = 0; 162 | size_t len; 163 | size_t read; 164 | FILE* fp; 165 | char* parsing = NULL; 166 | char* line = NULL; 167 | char* path = NULL; 168 | char* field = NULL; 169 | 170 | char* addrs = NULL; 171 | 172 | if(asprintf(&path, "/proc/%d/maps", pid) == -1) { 173 | perror("asprintf"); 174 | exit(1); 175 | } 176 | 177 | fp = fopen(path, "r"); 178 | if(!fp) { 179 | perror("fopen"); 180 | exit(1); 181 | } 182 | 183 | while((read = getline(&line, &len, fp)) != -1) { 184 | parsing = line; 185 | for (index = 0; index<6; index++) { 186 | field = strtok(parsing, " \t\n"); 187 | if (field == NULL) 188 | break; 189 | 190 | // load addresses 191 | if(index == 0) { 192 | addrs = field; 193 | parsing = NULL; 194 | continue; 195 | } 196 | 197 | // make sure this section is executable 198 | if(index == 1) { 199 | if(strlen(field) < 3 || field[2] != 'x') { 200 | break; 201 | } 202 | } 203 | 204 | // is it out target ? 205 | if(index == 5) { 206 | if(str_prefix_match(ld_path_prefix, field) != 1) { 207 | addrs = NULL; 208 | } 209 | } 210 | } 211 | 212 | // If last loop was full AND we have match --> exit loop 213 | if(addrs != NULL && index == 6) { 214 | break; 215 | } 216 | 217 | } 218 | 219 | // parse addr field 220 | parsing = addrs; 221 | for(index = 0; index<2; index++) { 222 | field = strtok(parsing, "-"); 223 | if(field == NULL) { 224 | break; 225 | } 226 | if(index == 0) { 227 | begin_addr = strtol(field, NULL, 16);; 228 | } else if (index == 1) { 229 | end_addr = strtol(field, NULL, 16); 230 | } 231 | parsing = NULL; 232 | } 233 | 234 | fclose(fp); 235 | free(line); 236 | free(path); 237 | } 238 | 239 | return addr >= begin_addr && addr < end_addr; 240 | } 241 | 242 | char* terminfo_build_suffix() { 243 | if(terminfo_suffix[0] != '\0') return terminfo_suffix; 244 | 245 | const char* termname; 246 | termname = getenv("TERM"); 247 | if(!termname) { 248 | return NULL; 249 | } 250 | 251 | if(snprintf(terminfo_suffix, sizeof(terminfo_suffix), "terminfo/%c/%s", termname[0], termname) == -1) { 252 | perror("snprintf terminfo suffix"); 253 | return NULL; 254 | } 255 | 256 | return terminfo_suffix; 257 | } 258 | 259 | int terminfo_need(int mem_fd, char* base_address) { 260 | char path[255]; 261 | char** candidate; 262 | 263 | if(proc_read_string(mem_fd, base_address, path, sizeof(path)) == -1) { 264 | return 0; 265 | } 266 | 267 | if(strcmp(terminfo_lib_fullpath, path) == 0) { 268 | return 1; 269 | } 270 | 271 | return 0; 272 | } 273 | 274 | int terminfo_open() { 275 | char path[255]; 276 | const char** candidate; 277 | char* suffix = terminfo_build_suffix(); 278 | int fd; 279 | 280 | if(!suffix) { 281 | return -1; 282 | } 283 | 284 | for(candidate=terminfo_candidates_locations; *candidate; candidate++) { 285 | if(snprintf(path, sizeof(path), "%s/%s", *candidate, suffix) == -1) { 286 | continue; 287 | } 288 | if((fd = open(path, O_RDONLY)) != -1) { 289 | return fd; 290 | } 291 | } 292 | 293 | return -1; 294 | } 295 | 296 | int terminfo_is_descfile(char* path) { 297 | char* suffix = terminfo_build_suffix(); 298 | return (suffix && strlen(path) > strlen(suffix) && !strcmp(path + strlen(path) - strlen(suffix), suffix)); 299 | } 300 | 301 | int proxy_is_ok(const char *pathname) { 302 | const char** candidate; 303 | int candidate_len; 304 | 305 | for(candidate=proxy_whitelist; *candidate; candidate++) { 306 | if(str_prefix_match(*candidate, pathname) == 1) { 307 | return 1; 308 | } 309 | } 310 | 311 | return 0; 312 | } 313 | 314 | // run stat in parent, copy result to child 315 | int proxy_stat(int mem_fd, const char *pathname, struct stat *buf) { 316 | struct stat src; 317 | int ret; 318 | 319 | ret = stat(pathname, &src); 320 | proc_write_data(mem_fd, &src, buf, sizeof(src)); 321 | 322 | return ret; 323 | } 324 | 325 | // run access in parent, copy result to child 326 | int proxy_access(int mem_fd, const char *pathname, int mode) { 327 | return access(pathname, mode); 328 | } 329 | 330 | // Wait for syscall. If regs is not NULL, load registers states into regs 331 | int wait_for_syscall(pid_t child, struct user_regs_struct* regs) { 332 | int status; 333 | while (1) { 334 | ptrace(PTRACE_SYSCALL, child, 0, 0); 335 | waitpid(child, &status, 0); 336 | if (WIFSTOPPED(status) && WSTOPSIG(status) & 0x80) { 337 | if(regs) 338 | ptrace(PTRACE_GETREGS, child, NULL, regs); 339 | return 0; 340 | } 341 | if (WIFEXITED(status)) 342 | return -1; 343 | } 344 | } 345 | 346 | // inject syscall in the container 347 | // convention: process is *entering* a syscall 348 | // FIXME: only trial syscalls, enough for our needs 349 | int inject_syscall(pid_t pid, unsigned long long int nr, unsigned long long int rdi, unsigned long long int rsi) { 350 | struct user_regs_struct regs; 351 | 352 | // Grab current registers state 353 | if(ptrace(PTRACE_GETREGS, pid, NULL, ®s) == -1) { 354 | perror("ptrace"); 355 | return -1; 356 | } 357 | 358 | // inject call 359 | regs.orig_rax = nr; 360 | regs.rdi = rdi; 361 | regs.rsi = rsi; 362 | regs.rip -= 2; 363 | if(ptrace(PTRACE_SETREGS, pid, NULL, ®s) == -1) { 364 | perror("ptrace"); 365 | return -1; 366 | } 367 | 368 | // get return value 369 | if(wait_for_syscall(pid, ®s) == -1) exit(1); 370 | if(regs.rax < 0) { 371 | errno = -regs.rax; 372 | return -1; 373 | } 374 | 375 | // wait for next syscall *entry* 376 | if(wait_for_syscall(pid, NULL) == -1) exit(1); 377 | } 378 | 379 | // wrapper fo injecting 'setns' syscall in the container 380 | int inject_setns(pid_t pid, int fd, int nstype) { 381 | return inject_syscall(pid, SYS_setns, fd, nstype); 382 | } 383 | 384 | // wrapper fo injecting 'close' syscall in the container 385 | int inject_close(pid_t pid, int fd) { 386 | return inject_syscall(pid, SYS_close, fd, 0); 387 | } 388 | 389 | pid_t create_child(char **argv) { 390 | pid_t child = fork(); 391 | 392 | if(child == -1) { 393 | perror("fork"); 394 | return 1; 395 | } 396 | 397 | if(child == 0) { 398 | ptrace(PTRACE_TRACEME, 0, NULL, NULL); 399 | kill(getpid(), SIGSTOP); 400 | execvp(*argv, argv); 401 | perror("exec"); 402 | exit(1); 403 | } 404 | return child; 405 | } 406 | 407 | int main(int argc, char** argv) { 408 | // parent vars 409 | int mem_fd; 410 | int status; 411 | int insyscall = 0; 412 | pid_t pid, child; 413 | char path[255]; 414 | struct user_regs_struct regs; 415 | struct user_regs_struct regs_in; 416 | struct user_regs_struct regs_out; 417 | struct namespace* ns; 418 | 419 | // special/favor treatment: terminfo 420 | int terminfo_enabled = 0; 421 | int terminfo_fd = 0; 422 | 423 | /* Parse arguments */ 424 | if(argc <= 2) { 425 | fprintf(stderr, "Usage: %s PID [command...]\n", argv[0]); 426 | exit(1); 427 | } 428 | 429 | pid = atoi(argv[1]); 430 | if(!pid) { 431 | fprintf(stderr, "Invalid PID: %s\n", argv[1]); 432 | exit(1); 433 | } 434 | 435 | /* Open + leak fds */ 436 | for(ns = namespaces; ns->proc_name; ns++) { 437 | // grab a fd to the ns 438 | if(snprintf(path, 255, "/proc/%d/ns/%s", pid, ns->proc_name) < 0) { 439 | perror("snprintf"); 440 | exit(1); 441 | } 442 | 443 | ns->fd = open(path, O_RDONLY); 444 | if(ns->fd < 0) { 445 | perror("open"); 446 | exit(1); 447 | } 448 | } 449 | terminfo_fd = terminfo_open(); 450 | 451 | // Mount all stage 1 namespaces now + close fds 452 | for(ns = namespaces; ns->proc_name; ns++) { 453 | if(ns->stage != STAGE_HOST) 454 | continue; 455 | 456 | if(setns(ns->fd, ns->flag) == -1) { 457 | perror("setns"); 458 | return 1; 459 | } 460 | 461 | close(ns->fd); 462 | ns->fd = -1; 463 | } 464 | 465 | // create child 466 | child = create_child(argv+2); 467 | 468 | // Close any leaked fd at this stage 469 | for(ns = namespaces; ns->proc_name; ns++) { 470 | close(ns->fd); 471 | } 472 | close(terminfo_fd); 473 | 474 | waitpid(child, &status, 0); 475 | ptrace(PTRACE_SETOPTIONS, child, 0, PTRACE_O_TRACESYSGOOD); 476 | 477 | // Wait until exec 478 | do { 479 | if(wait_for_syscall(child, NULL) == -1) break; 480 | if(wait_for_syscall(child, ®s) == -1) break; 481 | } while(regs.orig_rax != SYS_execve); 482 | 483 | // open child's memory 484 | mem_fd = proc_open_mem(child); 485 | if(mem_fd == -1) { 486 | return 1; 487 | } 488 | 489 | // wait until 1st syscall outside of ld, break on entry 490 | while(1) { 491 | if(wait_for_syscall(child, ®s_in) == -1) break; 492 | if(!in_ld(regs_in.rip, child)) break; 493 | 494 | // detect special treatments 495 | if(regs_in.orig_rax == SYS_open) { 496 | if(terminfo_need(mem_fd, (char*)regs_in.rdi)) terminfo_enabled=1; 497 | } 498 | if(wait_for_syscall(child, ®s_in) == -1) break; 499 | } 500 | 501 | // At this point, we are *inside* a syscall 502 | 503 | // Switch child namespaces + cleanup 504 | for(ns = namespaces; ns->proc_name; ns++) { 505 | if(ns->stage != STAGE_CONTAINER) 506 | continue; 507 | 508 | if(inject_setns(child, ns->fd, ns->flag) == -1) { 509 | perror("inject_setns"); 510 | exit(1); 511 | } 512 | 513 | if(inject_close(child, ns->fd) == -1) { 514 | perror("inject_close"); 515 | exit(1); 516 | } 517 | 518 | } 519 | 520 | // Set user to root in the container context, drop any supplementary groups 521 | // best effort calls (no checks) 522 | inject_syscall(child, SYS_setreuid, 0, 0); 523 | inject_syscall(child, SYS_setregid, 0, 0); 524 | inject_syscall(child, SYS_setgroups, 0, 0); 525 | 526 | // Restore original syscall 527 | if(ptrace(PTRACE_SETREGS, child, NULL, ®s_in) == -1) { 528 | perror("ptrace"); 529 | exit(1); 530 | } 531 | if(wait_for_syscall(child, NULL) == -1) exit(1); 532 | 533 | // At this point, we are *outside* any syscall 534 | 535 | // In theory, we are good to go. BUT, we still need to handle special treatment cases 536 | // like terminfo (loves this one...) 537 | if(terminfo_enabled && terminfo_fd != -1) { 538 | int done = 0; 539 | int dirty = 0; 540 | // let's cheat on terminfo's access/open 541 | while(!done) { 542 | dirty = 0; 543 | 544 | // wait for syscall entry+exit 545 | if(wait_for_syscall(child, ®s_in) == -1) break; 546 | if(wait_for_syscall(child, ®s_in) == -1) break; 547 | 548 | // proxy stat 549 | if(regs_in.orig_rax == SYS_stat) { 550 | if(proc_read_string(mem_fd, (char*)regs_in.rdi, path, sizeof(path)) != -1 && proxy_is_ok(path)) { 551 | regs_in.rax = proxy_stat(mem_fd, path, (struct stat*)regs_in.rsi); 552 | dirty = 1; 553 | } 554 | } 555 | 556 | // proxy access 557 | if(regs_in.orig_rax == SYS_access) { 558 | if(proc_read_string(mem_fd, (char*)regs_in.rdi, path, sizeof(path)) != -1 && proxy_is_ok(path)) { 559 | regs_in.rax = proxy_access(mem_fd, path, (int)regs_in.rsi); 560 | dirty = 1; 561 | } 562 | } 563 | 564 | // proxy open 565 | if(regs_in.orig_rax == SYS_open) { 566 | if(proc_read_string(mem_fd, (char*)regs_in.rdi, path, sizeof(path)) != -1 && proxy_is_ok(path) && terminfo_is_descfile(path)) { 567 | // for this one, we cheat: it's already open. Much easier that forwarding the fd or proxying *all* syscalls... 568 | regs_in.rax = terminfo_fd; 569 | dirty = 1; 570 | done = 1; 571 | } 572 | } 573 | 574 | // commit 575 | if(dirty && ptrace(PTRACE_SETREGS, child, NULL, ®s_in) == -1) { 576 | perror("ptrace"); 577 | exit(1); 578 | } 579 | 580 | // KEEP THIS: you'll need it when debugging 581 | /*fprintf(stderr, "%lld(%lld, %lld)=%lld\n", 582 | regs_in.orig_rax, 583 | regs_in.rdi, 584 | regs_in.rsi, 585 | regs_in.rax 586 | );*/ 587 | } 588 | if(wait_for_syscall(child, NULL) == -1) exit(1); 589 | } 590 | 591 | // exit 592 | ptrace(PTRACE_DETACH, child, 0, 0); 593 | waitpid(child, &status, 0); 594 | 595 | if(WIFEXITED(status)) { 596 | return WEXITSTATUS(status); 597 | } else { 598 | return 1; 599 | } 600 | } 601 | 602 | --------------------------------------------------------------------------------