├── .gitignore ├── COPYING ├── Makefile ├── README.rst ├── scnpm.nim └── scnpm.service /.gitignore: -------------------------------------------------------------------------------- 1 | /scnpm 2 | /scnpm.cm* 3 | /*.o 4 | -------------------------------------------------------------------------------- /COPYING: -------------------------------------------------------------------------------- 1 | DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE 2 | Version 2, December 2004 3 | 4 | Copyright (C) 2012 Mike Kazantsev 5 | 6 | Everyone is permitted to copy and distribute verbatim or modified 7 | copies of this license document, and changing it is allowed as long 8 | as the name is changed. 9 | 10 | DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE 11 | TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 12 | 13 | 0. You just DO WHAT THE FUCK YOU WANT TO. 14 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | all: scnpm 2 | 3 | scnpm: scnpm.nim 4 | nim c -d:release -d:strip -d:lto_incremental --opt:size $< 5 | 6 | clean: 7 | rm -f scnpm 8 | 9 | .SUFFIXES: # to disable built-in rules for %.c and such 10 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | systemd cgroup (v2) nftables policy manager 2 | =========================================== 3 | 4 | .. contents:: 5 | :backlinks: none 6 | 7 | This repository URLs: 8 | 9 | - https://github.com/mk-fg/systemd-cgroup-nftables-policy-manager 10 | - https://codeberg.org/mk-fg/systemd-cgroup-nftables-policy-manager 11 | - https://fraggod.net/code/git/systemd-cgroup-nftables-policy-manager 12 | 13 | 14 | 15 | Description 16 | ----------- 17 | 18 | Small tool that adds and updates nftables_ cgroupv2 filtering rules for 19 | systemd_-managed per-unit cgroups (slices, services, scopes). 20 | 21 | "cgroupv2" is also often referred to as "unified cgroup hierarchy" (considered 22 | stable in linux since 2015), works differently from old cgroup implementation, 23 | and is the only one supported here. 24 | 25 | Similar capability have also been added to systemd versions 255+ (2023-12-06 and 26 | later) via NFTSet= option in unit files (see `"man systemd.resource-control"`_), 27 | but its use is limited to system units (can't be used in ``~/.config/systemd/user`` 28 | session units). 29 | 30 | This tool is somewhat redundant with that functionality, but can still be useful 31 | for user session units, or if NFTSet= doesn't work for some purpose/reason. 32 | 33 | .. _nftables: https://nftables.org/ 34 | .. _systemd: https://systemd.io/ 35 | .. _"man systemd.resource-control": 36 | https://man.archlinux.org/man/systemd.resource-control.5 37 | 38 | 39 | Problem that it addressess 40 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ 41 | 42 | nftables supports "socket cgroupv2" matching in rules (since linux-5.13+), 43 | similar to iptables' "-m cgroup --path ...", which can be used to add rules 44 | like this:: 45 | 46 | add rule inet filter output socket cgroupv2 level 5 \ 47 | "user.slice/user-1000.slice/user@1000.service/app.slice/myapp.service" accept 48 | 49 | (or in iptables: ``iptables -A OUTPUT -m cgroup --path ... -j ACCEPT``) 50 | 51 | But when trying to put this into /etc/nftables.conf, it will fail to load on boot 52 | (same as similar iptables rules), as that "myapp.service" cgroup with a long 53 | path does not exist yet. 54 | 55 | Both nftables/iptables rules use xt_cgroup kernel module that - when looking at 56 | the packet - actually matches numeric cgroup ID, and not the path string, and 57 | does not update those IDs dynamically when cgroups are created/removed in any way. 58 | 59 | This means that: 60 | 61 | - Firewall rules can't be added for not-yet-existing cgroups. 62 | 63 | Causes "Error: cgroupv2 path fails: No such file or directory" from "nft" 64 | command and "xt_cgroup: invalid path, errno=-2" error in dmesg for iptables. 65 | 66 | - If cgroup gets removed and re-created, none of the existing rules will apply to it. 67 | 68 | This is because new cgroup gets a new unique ID, which can't be present in any 69 | pre-existing netfilter tables, so none of the rules will match it. 70 | 71 | So basically such rules in a system-wide policy-config only work for cgroups 72 | that are created early on boot and never removed after that. 73 | 74 | This is not what happens with most systemd services and slices, restarting which 75 | will also re-create cgroups, and which are usually started way after system 76 | firewalls are initialized (and often can't be started on boot - e.g. user units). 77 | 78 | 79 | Solution: 80 | ~~~~~~~~~ 81 | 82 | Since this tool was written, ``NFTSet=`` directive was added to systemd, 83 | which mostly addresses this for system units already - use that if possible, 84 | and see caveats section below for some of potential shortcomings there. 85 | 86 | Monitor cgroup (or systemd unit) creation/removal events and (re-)apply any 87 | relevant rules to these dynamically. 88 | 89 | This is `how "socket cgroupv2" matcher in nftables is intended to work`_:: 90 | 91 | Following the decoupled approach: If the cgroup is gone, the filtering 92 | policy would not match anymore. You only have to subscribe to events 93 | and perform an incremental updates to tear down the side of the 94 | filtering policy that you don't need anymore. If a new cgroup is 95 | created, you load the filtering policy for the new cgroup and then add 96 | processes to that cgroup. You only have to follow the right sequence 97 | to avoid problems. 98 | 99 | So that's pretty much what this simple tool does, subscribing to systemd unit 100 | start/stop events via journal (using libsystemd) and updating any relevant rules 101 | on events from there (using libnftables). 102 | 103 | .. _how "socket cgroupv2" matcher in nftables is intended to work: 104 | https://patchwork.ozlabs.org/project/netfilter-devel/patch/1479114761-19534-1-git-send-email-pablo@netfilter.org/#1511797 105 | 106 | 107 | Intended use-case: 108 | ~~~~~~~~~~~~~~~~~~ 109 | 110 | Defining system-wide policy to whitelist connections to/from specific systemd 111 | units (can be services/apps, slices of those, or ad-hoc scopes) in an easy and 112 | relatively foolproof way. 113 | 114 | I.e. if a desktop system is connected to some kind of "intranet" VPN, there's 115 | no reason for random complex and leaky apps like web browsers or games to be able 116 | to connect to anything there (think fetch() JS call from any site you visit), 117 | and that is trivial to block with a single firewall rule. 118 | 119 | This tool is intended to manage a whitelist of rules for systemd units on top, 120 | that should have access there, and hence are allowed to bypass such rule. 121 | 122 | Again, systemd has aforementioned NFTSet= option, as well as network filtering 123 | via eBPFs attached to cgroups (IPAddressAllow/Deny=, BPFProgram=, IPEgressFilterPath= 124 | and such), which can be used as an alternative to this tool. 125 | 126 | 127 | 128 | Build / Install 129 | --------------- 130 | 131 | This is a small Nim_ command-line app, can be built with any modern 132 | `Nim compiler`_, e.g. using included Makefile:: 133 | 134 | % make 135 | % ./scnpm --help 136 | Usage: ./scnpm [opts] [nft-configs ...] 137 | ... 138 | 139 | (or run ``nim c -d:release -d:strip -d:lto_incremental --opt:size scnpm.nim`` without make) 140 | 141 | That should produce ~150K binary, linked against libsystemd (for journal access) 142 | and libnftables (to re-apply cgroupv2 nftables rules), which can then be installed 143 | and copied between systems normally. 144 | Nim compiler is only needed to build the tool, not to run it. 145 | 146 | scnpm.service_ systemd unit file can be used to auto-start it on boot. 147 | 148 | Journal is used as an event source instead of more conventional dbus signals to 149 | be able to monitor state changes of units under all "systemd --user" instances 150 | as well as system ones, which are sent through multiple transient dbus brokers, 151 | so much more difficult to reliably track there. 152 | 153 | .. _Nim: https://nim-lang.org/ 154 | .. _Nim compiler: https://nim-lang.org/install_unix.html 155 | .. _scnpm.service: scnpm.service 156 | 157 | 158 | 159 | Usage 160 | ----- 161 | 162 | Tool is designed to parse special commented-out rules for it from the same 163 | nftables.conf as used with the rest of ruleset, for consistency 164 | (though of course they can be stored in any other file(s) as well):: 165 | 166 | ## Allow connections to smtp over vpn for system postfix.service 167 | # postfix.service :: add rule inet filter vpn.whitelist \ 168 | # socket cgroupv2 level 2 "system.slice/postfix.service" tcp dport 25 accept 169 | 170 | ## Allow connections to intranet mail for a scope unit running under "systemd --user" 171 | ## "systemd-run" can be used to easily start apps in custom scopes or slices 172 | # app-mail.scope :: add rule inet filter vpn.whitelist socket cgroupv2 level 5 \ 173 | # "user.slice/user-1000.slice/user@1000.service/app.slice/app-mail.scope" \ 174 | # ip daddr mail.intranet.local tcp dport {25, 143} accept 175 | 176 | ## Only allow whitelisted apps to connect over "my-vpn" iface 177 | add rule inet filter output oifname my-vpn jump vpn.whitelist 178 | add rule inet filter output oifname my-vpn drop 179 | 180 | Commented-out "add rule" lines would normally make this config fail to apply on 181 | boot, as those service/scope/slice cgroups won't exist yet at that point in time. 182 | 183 | Script will parse those " :: " comments, and try to apply 184 | rules from them on start and whenever any kind of state-change happens to a unit 185 | with the name specified there. 186 | 187 | For example, when postfix.service is stopped/restarted with the config above, 188 | corresponding vpn.whitelist rule will be removed and re-added, allowing access 189 | to a new cgroup which systemd will create for it after restart. 190 | 191 | To start it in verbose mode: ``./scnpm --flush --debug /etc/nftables.conf`` 192 | 193 | ``-f/--flush`` option will purge (flush) all chains mentioned in the rules 194 | that will be monitored/applied on tool start, so that leftover rules from any 195 | previous runs are removed, and can be replaced with more fine-grained manual 196 | removal if these are not dedicated chains used for such dynamic rules only. 197 | 198 | Running without ``-d/--debug`` should not normally produce any output, unless 199 | there are some (non-critical) warnings like unexpected mismatch or nftables error, 200 | code bugs or fatal errors. 201 | 202 | Starting the tool on boot should be scheduled after nftables.service, 203 | so that ``--flush`` option will be able to find all required chains, 204 | and will exit with an error otherwise. 205 | 206 | Multiple nftables rules linked to same systemd unit(s) are allowed. 207 | 208 | Changes in parsed config files are not auto-detected, and only applied by 209 | either sending SIGHUP or tool restart, which can be done manually after changes, 210 | configured in nftables.service (e.g. via PropagatesReloadTo= and/or BindsTo=) 211 | or systemd.path unit monitoring state of source configuration file(s); 212 | or - without signal - using ``-u/--reload-with-unit`` or ``-a/--reapply-with-unit`` 213 | opts, since this tool monitors systemd unit states anyway, and can spot when 214 | things restart there on its own. 215 | 216 | Syntax errors in nftables rules should produce warnings when these are applied on 217 | tool start or changes, so should be hard to miss, but maybe do check "nft list chain" 218 | or debug output when rules are supposed to be enabled after conf updates anyway. 219 | 220 | To modify nftables rulesets, CAP_NET_ADMIN capability is required, which can be 221 | passed via AmbientCapabilities= in systemd service (or similar option in capsh) 222 | in addition to SupplementaryGroups=systemd-journal and netlink access to avoid 223 | running this as full root. 224 | 225 | 226 | 227 | Caveats and limitations 228 | ----------------------- 229 | 230 | - Due to "best-effort" nature of trying to apply rules when unit startup is 231 | detected, and an inherent race condition between systemd creating 232 | service/cgroup and rule being applied, I'd heavily recommend to always use 233 | allow-listing rules with this tool, which fail on the safe side. 234 | 235 | - I think "cgroupv2" in nftables rule must be the one where network socket was 236 | created, and not the one where systemd might move the process using it. 237 | 238 | So for incoming ssh connections for example, "sshd-session" process might 239 | end up in session-N.scope under user.slice, but nftables will only match it 240 | as belonging to sshd.service cgroup, so some rules might need to have different 241 | cgroup string in the rule than a name that triggers the rule to the left of it. 242 | 243 | Not 100% sure that's how it works or supposed to work, but have observed it earlier. 244 | 245 | - Use HUP signal, ``-u/--reload-with-unit`` (same as SIGHUP) or ``-a/--reapply-with-unit`` 246 | option to restore transient cgroup-specific rules after nftables restart 247 | or other firewall resets that'd remove those. 248 | 249 | 250 | 251 | Links 252 | ----- 253 | 254 | - `systemd.resource-control(5)`_ manpage that describes implementation of 255 | similar functionality there - lookup ``NFTSet=`` option. 256 | 257 | - `Systemd firewall integration suggestions (issue #7327)`_ - more comprehensive 258 | netfilter integration than NFTSet= option above, still at a proposal/suggestion 259 | stage at the moment (2025-04-10), neither accepted nor rejected. 260 | 261 | - `helsinki-systems/nft_cgroupv2`_ - alternative third-party implementation of 262 | such matching in nftables. 263 | 264 | AFAICT it doesn't rely on cgroup id's and instead resolves these from cgroup 265 | path for every packet, which is probably not great wrt performance, but might 266 | be ok for most use-cases where conntrack filters-out traffic before these rules. 267 | 268 | Might conflict with current upstream nftables implementation due to "cgroupv2" 269 | keyword used there as well. 270 | 271 | - `Upstreamed "netfilter: nft_socket: add support for cgroupsv2" patch 272 | `_ 273 | for "cgroupv2" matching support in nftables (0.99+) on the linux kernel side (linux-5.13+). 274 | 275 | - `"netfilter: implement xt_cgroup cgroup2 path match" patch 276 | `_ 277 | from linux-4.5. 278 | 279 | - Earlier version of this tool was written in OCaml_, and can be last found in `commit 280 | 048a8128 `_. 281 | 282 | .. _systemd.resource-control(5): https://man.archlinux.org/man/systemd.resource-control.5 283 | .. _Systemd firewall integration suggestions (issue #7327): 284 | https://github.com/systemd/systemd/issues/7327 285 | .. _helsinki-systems/nft_cgroupv2: https://github.com/helsinki-systems/nft_cgroupv2/ 286 | .. _OCaml: https://ocaml.org/ 287 | -------------------------------------------------------------------------------- /scnpm.nim: -------------------------------------------------------------------------------- 1 | #? replace(sub = "\t", by = " ") 2 | # 3 | # Debug build/run: nim c -w=on --hints=on -r scnpm.nim -h 4 | # Final build: nim c -d:release -d:strip -d:lto_incremental --opt:size scnpm.nim 5 | # Usage info: ./scnpm -h 6 | 7 | import std/[ parseopt, os, posix, logging, sugar, 8 | strformat, strutils, sequtils, re, tables, times, monotimes ] 9 | 10 | template nfmt(v: untyped): string = ($v).insertSep # format integer with digit groups 11 | 12 | 13 | ### sd-journal api wrapper 14 | 15 | {.passl: "-lsystemd"} 16 | 17 | type 18 | sd_journal {.importc: "sd_journal*", header: "".} = object 19 | sd_journal_msg {.importc: "const void*".} = cstring 20 | let 21 | SD_JOURNAL_LOCAL_ONLY {.importc, nodecl.}: cint 22 | SD_JOURNAL_NOP {.importc, nodecl.}: cint 23 | 24 | proc c_strerror(errnum: cint): cstring {.importc: "strerror", header: "".} 25 | 26 | proc sd_journal_open(sdj: ptr sd_journal, flags: cint): cint {.importc, header: "".} 27 | proc sd_journal_close(sdj: sd_journal) {.importc, header: "".} 28 | proc sd_journal_get_fd(sdj: sd_journal): cint {.importc, header: "".} 29 | proc sd_journal_seek_tail(sdj: sd_journal): cint {.importc, header: "".} 30 | proc sd_journal_wait(sdj: sd_journal, timeout_us: uint64): cint {.importc, header: "".} 31 | proc sd_journal_next(sdj: sd_journal): cint {.importc, header: "".} 32 | proc sd_journal_previous(sdj: sd_journal): cint {.importc, header: "".} 33 | proc sd_journal_get_data( sdj: sd_journal, field: cstring, 34 | msg: ptr sd_journal_msg, msg_len: ptr csize_t ): cint {.importc, header: "".} 35 | proc sd_journal_add_match( sdj: sd_journal, 36 | s: cstring, s_len: csize_t ): cint {.importc, header: "".} 37 | proc sd_journal_add_disjunction(sdj: sd_journal): cint {.importc, header: "".} # OR 38 | proc sd_journal_add_conjunction(sdj: sd_journal): cint {.importc, header: "".} # AND 39 | 40 | type 41 | Journal = object 42 | ctx: sd_journal 43 | closed: bool 44 | ret: cint 45 | c_msg: sd_journal_msg 46 | c_msg_len: csize_t 47 | field: string 48 | JournalError = object of CatchableError 49 | 50 | template sdj_err(call: untyped, args: varargs[untyped]): cint = 51 | if o.closed: o.ret = -EPIPE 52 | else: o.ret = when varargsLen(args) > 0: 53 | `sd journal call`(o.ctx, args) else: `sd journal call`(o.ctx) 54 | -o.ret 55 | template sdj(call: untyped, args: varargs[untyped]) = 56 | discard sdj_err(call, args) 57 | if o.ret < 0: raise newException( JournalError, 58 | "sd_journal_" & astToStr(call) & &" failed = {c_strerror(-o.ret)}" ) 59 | template sdj_ret(call: untyped, args: varargs[untyped]): cint = 60 | sdj(call, args) 61 | o.ret 62 | 63 | method init(o: var Journal) {.base.} = 64 | if sd_journal_open(o.ctx.addr, SD_JOURNAL_LOCAL_ONLY) < 0: 65 | raise newException(JournalError, "systemd journal open failed") 66 | o.closed = false 67 | sdj get_fd # to mimic journalctl.c - "means the first sd_journal_wait() will actually wait" 68 | sdj seek_tail 69 | sdj previous 70 | 71 | method close(o: var Journal) {.base.} = 72 | o.closed = true 73 | sd_journal_close(o.ctx) 74 | 75 | method setup_filters(o: var Journal) {.base.} = 76 | # systemd journal match-list uses CNF logic, e.g. "level=X && (unit=A || ... || tag=B || ...) && ..." 77 | # online CNF calculator: https://www.dcode.fr/boolean-expressions-calculator 78 | # systemd does not support negation atm - https://github.com/systemd/systemd/pull/12592 79 | # Using journal_make_match_string in test-journal-match.c is a good way to make sense of this: 80 | # meson setup build && ninja -C build -v libsystemd && meson test -C build test-journal-match 81 | sdj(add_match, "SYSLOG_IDENTIFIER=systemd", 25) 82 | sdj add_conjunction 83 | sdj(add_match, "JOB_RESULT=done", 15) 84 | sdj add_disjunction 85 | sdj(add_match, "JOB_TYPE=start", 15) 86 | sdj add_conjunction 87 | sdj(add_match, "_SYSTEMD_UNIT=init.scope", 24) 88 | sdj add_disjunction 89 | sdj(add_match, "_SYSTEMD_USER_UNIT=init.scope", 29) 90 | 91 | method read_field(o: var Journal, key: string): bool {.base.} = 92 | result = sdj_err(get_data, key, o.c_msg.addr, o.c_msg_len.addr) != ENOENT 93 | if result: 94 | o.field = newString(o.c_msg_len) 95 | copyMem(o.field.cstring, o.c_msg, o.c_msg_len) 96 | o.field = o.field.substr(key.len + 1) 97 | 98 | method poll( 99 | o: var Journal, units: Table[string, seq[string]], 100 | us: uint64 = 3600_000_000'u64 ): string {.base.} = 101 | while true: 102 | while sdj_ret(next) > 0: 103 | if not o.read_field("UNIT"): 104 | if not o.read_field("USER_UNIT"): continue 105 | let relevant = units.hasKey(o.field) 106 | debug("Detected job-event for unit: ", o.field, " [has-rule=", relevant, "]") 107 | if relevant: return o.field 108 | if sdj_ret(wait, us) == SD_JOURNAL_NOP: 109 | debug("Journal-poll wait-timeout: us=", us.nfmt) 110 | return "" 111 | 112 | 113 | ### libnftables bindings 114 | 115 | {.passl: "-lnftables"} 116 | 117 | type nft_ctx {.importc: "nft_ctx*", header: "".} = distinct pointer 118 | let 119 | NFT_CTX_DEFAULT {.importc, nodecl.}: uint32 120 | NFT_CTX_OUTPUT_ECHO {.importc, nodecl.}: cuint 121 | NFT_CTX_OUTPUT_HANDLE {.importc, nodecl.}: cuint 122 | NFT_CTX_OUTPUT_NUMERIC_ALL {.importc, nodecl.}: cuint 123 | NFT_CTX_OUTPUT_TERSE {.importc, nodecl.}: cuint 124 | 125 | proc nft_ctx_new(flags: uint32): nft_ctx {.importc, header: "".} 126 | proc nft_ctx_free(ctx: nft_ctx) {.importc, header: "".} 127 | proc nft_ctx_buffer_output(ctx: nft_ctx): cint {.importc, header: "".} 128 | proc nft_ctx_buffer_error(ctx: nft_ctx): cint {.importc, header: "".} 129 | proc nft_ctx_output_set_flags(ctx: nft_ctx, flags: cuint) {.importc, header: "".} 130 | proc nft_run_cmd_from_buffer(ctx: nft_ctx, buff: cstring): cint {.importc, header: "".} 131 | proc nft_ctx_get_output_buffer(ctx: nft_ctx): cstring {.importc, header: "".} 132 | proc nft_ctx_get_error_buffer(ctx: nft_ctx): cstring {.importc, header: "".} 133 | 134 | type 135 | NFTables = object 136 | ctx: nft_ctx 137 | ret: cint 138 | NFTablesError = object of CatchableError 139 | NFTablesCmdFail = object of CatchableError 140 | NFTablesCmdNoCG = object of CatchableError 141 | 142 | let 143 | re_rule_prefix = re"^(\S+ +\S+ +\S+ +)" 144 | re_handle = re"^(add|insert) rule .* # handle (\d+)\n" 145 | re_add_skip = re"^Error: cgroupv2 path fails: No such file or directory\b" 146 | 147 | method init(o: var NFTables) {.base.} = 148 | o.ctx = nft_ctx_new(NFT_CTX_DEFAULT) 149 | if nft_ctx_buffer_output(o.ctx) != 0: 150 | raise newException(NFTablesError, "nftables set output buffering failed") 151 | if nft_ctx_buffer_error(o.ctx) != 0: 152 | raise newException(NFTablesError, "nftables set error buffering failed") 153 | nft_ctx_output_set_flags( o.ctx, 154 | NFT_CTX_OUTPUT_ECHO or NFT_CTX_OUTPUT_HANDLE or 155 | NFT_CTX_OUTPUT_NUMERIC_ALL or NFT_CTX_OUTPUT_TERSE ) 156 | 157 | method close(o: var NFTables) {.base.} = nft_ctx_free(o.ctx) 158 | 159 | method run(o: NFTables, commands: string): string {.base.} = 160 | let 161 | success = nft_run_cmd_from_buffer(o.ctx, commands.cstring) == 0 162 | buff = if success: nft_ctx_get_output_buffer(o.ctx) else: nft_ctx_get_error_buffer(o.ctx) 163 | result = ($buff).strip() 164 | if not success: 165 | if result =~ re_add_skip: raise newException(NFTablesCmdNoCG, result) 166 | raise newException(NFTablesCmdFail, result) 167 | 168 | method apply(o: NFTables, rule: string): int {.base.} = 169 | debug("nft apply :: ", rule) 170 | let s = o.run(rule) 171 | if s =~ re_handle: return matches[1].parseInt 172 | raise newException(NFTablesError, "BUG - failed to parse rule handle from nft output:\n" & s.strip()) 173 | 174 | method delete(o: NFTables, rule: string, handle: int) {.base.} = 175 | debug("nft delete :: ", handle, " :: ", rule) 176 | if rule =~ re_rule_prefix: 177 | discard o.run(&"delete rule {matches[0]} handle {handle}") 178 | else: raise newException(NFTablesError, "Failed to match table-prefix in a rule") 179 | 180 | method flush_rule_chains(o: NFTables, rules: openArray[string]) {.base.} = 181 | var chains = collect(newSeq): 182 | for rule in rules: 183 | if rule =~ re_rule_prefix: matches[0] 184 | chains = chains.deduplicate 185 | for chain in chains: debug("nft flush :: ", chain) 186 | discard o.run(chains.mapIt(&"flush chain {it}").join("\n")) 187 | 188 | 189 | ### main 190 | 191 | var # globals for noconv 192 | sq = Journal() 193 | nft = NFTables() 194 | reexec = false 195 | 196 | type ParseError = object of CatchableError 197 | 198 | proc parse_unit_rules(conf_list: seq[string]): Table[string, seq[string]] = 199 | result = initTable[string, seq[string]]() 200 | var 201 | re_start = re"^#+ *([^ ]+\.[a-zA-Z]+) +:: +(add +(rule +)?)?(.*)$" 202 | re_cont = re"^#+ (.*)$" 203 | cg = "" 204 | line_pre = "" 205 | for p in conf_list: 206 | debug("Parsing config-file: ", p) 207 | for line in readFile(p).splitLines: 208 | var line = line 209 | 210 | if line_pre != "": 211 | if line =~ re_cont: 212 | line_pre &= " " & matches[0].strip(chars={'\\'}).strip() 213 | if line.endsWith("\\"): continue 214 | line = line_pre 215 | cg = ""; line_pre = "" 216 | else: raise newException( ParseError, 217 | &"Broken rule continuation [ {cg} ]: '{line_pre}' + '{line}'" ) 218 | 219 | if line =~ re_start: 220 | cg = matches[0] 221 | if line.endsWith("\\"): line_pre = line.strip(chars={'\\'}).strip() 222 | else: 223 | if not result.hasKey(cg): result[cg] = newSeq[string]() 224 | line = matches[3].strip() 225 | if not (line =~ re_rule_prefix): 226 | raise newException( ParseError, 227 | &"Failed to validate 'family table chain ...' rule [ {cg} ]: {line}" ) 228 | result[cg].add(line) 229 | 230 | proc main_help(err="") = 231 | proc print(s: string) = 232 | let dst = if err == "": stdout else: stderr 233 | write(dst, s); write(dst, "\n") 234 | let app = getAppFilename().lastPathPart 235 | if err != "": print &"ERROR: {err}" 236 | print &"\nUsage: {app} [opts] [nft-configs ...]" 237 | if err != "": print &"Run '{app} --help' for more information"; quit 1 238 | print dedent(&""" 239 | 240 | systemd cgroup (v2) nftables policy manager. 241 | Small tool that adds and updates nftables cgroupv2 filtering 242 | rules for systemd-managed per-unit cgroups (slices, services, scopes). 243 | 244 | -f / --flush 245 | Flush nft chain(s) used in all parsed cgroup-rules on start, to cleanup leftovers 246 | from previous run(s), as otherwise only rules added at runtime get replaced/removed. 247 | 248 | -u / --reload-with-unit unit-name 249 | Reload and re-apply all rules (and do -f/--flush if enabled) on systemd unit state changes. 250 | Can be useful to pass something like nftables.service with this option, as restarting 251 | that usually flushes nft rulesets and can indicates changes in the dynamic rules there. 252 | Option can be used multiple times to act on events from any of the specified units. 253 | Same as restarting the tool, done via simple re-exec internally, runs on SIGHUP. 254 | 255 | -a / --reapply-with-unit unit-name 256 | Same as -u/--reload-with-unit, but does not reload the rules. 257 | 258 | -c / --cooldown milliseconds 259 | Min interval between applying rules for the same cgroup/unit (default=300ms). 260 | If multiple events for same unit are detected, 261 | subsequent ones are queued to apply after this interval. 262 | 263 | -d / --debug -- Verbose operation mode. 264 | """) 265 | quit 0 266 | 267 | proc main(argv: seq[string]) = 268 | var 269 | opt_flush = false 270 | opt_debug = false 271 | opt_cooldown = initDuration(milliseconds=300) 272 | opt_reapply_units = newSeq[string]() 273 | opt_reexec_units = newSeq[string]() 274 | opt_nft_confs = newSeq[string]() 275 | 276 | block cli_parser: 277 | var opt_last = "" 278 | proc opt_fmt(opt: string): string = 279 | if opt.len == 1: &"-{opt}" else: &"--{opt}" 280 | proc opt_empty_check = 281 | if opt_last == "": return 282 | main_help &"{opt_fmt(opt_last)} option unrecognized or requires a value" 283 | proc opt_set(k: string, v: string) = 284 | if k in ["u", "reload-with-unit"]: opt_reexec_units.add(v) 285 | elif k in ["a", "reapply-with-unit"]: opt_reapply_units.add(v) 286 | elif k in ["c", "cooldown"]: opt_cooldown = initDuration(milliseconds=v.parseInt) 287 | else: main_help &"Unrecognized option [ {opt_fmt(k)} = {v} ]" 288 | 289 | for t, opt, val in getopt(argv): 290 | case t 291 | of cmdEnd: break 292 | of cmdShortOption, cmdLongOption: 293 | if opt in ["h", "help"]: main_help() 294 | elif opt in ["f", "flush"]: opt_flush = true 295 | elif opt in ["d", "debug"]: opt_debug = true 296 | elif val == "": opt_empty_check(); opt_last = opt 297 | else: opt_set(opt, val) 298 | of cmdArgument: 299 | if opt_last != "": opt_set(opt_last, opt); opt_last = "" 300 | else: opt_nft_confs.add(opt) 301 | opt_empty_check() 302 | 303 | var logger = newConsoleLogger( 304 | fmtStr="$levelid $datetime :: ", useStderr=true, 305 | levelThreshold=if opt_debug: lvlAll else: lvlInfo ) 306 | addHandler(logger) 307 | 308 | debug("Processing configuration...") 309 | var rules = parse_unit_rules(opt_nft_confs) 310 | for unit in opt_reexec_units: rules[unit] = @[] 311 | for unit in opt_reapply_units: rules[unit] = @[] # special "rules" to reapply/reexec 312 | 313 | debug("Parsed following cgroup rules (empty=reapply-rules):") 314 | for unit, rules in rules.pairs: 315 | for rule in rules: debug(" ", unit, " :: ", rule) 316 | 317 | debug("Initializing nftables/journal components...") 318 | nft.init() 319 | defer: nft.close() 320 | sq.init() # closed from signal handler to interrupt wait 321 | sq.setup_filters() 322 | 323 | onSignal(SIGINT, SIGTERM, SIGHUP): 324 | if sig == SIGHUP: 325 | debug("Got re-exec signal, restarting...") 326 | reexec = true 327 | else: debug("Got exit signal, shutting down...") 328 | sq.close() 329 | 330 | var 331 | ts_now: MonoTime 332 | ts_wake: MonoTime 333 | ts_wake_unit: string 334 | unit: string 335 | rule_queue = initTable[string, tuple[ts: MonoTime, apply: bool]]() 336 | rule_handles = initTable[string, int]() # rule -> handle 337 | reapply = false 338 | 339 | proc rules_queue_all() = 340 | debug("Rules schedule: all rules") 341 | ts_now = getMonoTime() 342 | for unit, rules in rules.pairs: 343 | if rules.len > 0: rule_queue[unit] = (ts: ts_now, apply: true) 344 | 345 | proc rules_apply(unit: string, rules: seq[string]) = 346 | for rule in rules: 347 | rule_handles.withValue(rule, n): 348 | try: nft.delete(rule, n[]) 349 | except NFTablesCmdFail: discard 350 | try: 351 | let n = nft.apply(rule) 352 | rule_handles[rule] = n 353 | debug("Rule added: unit=", unit, " handle=", n, " :: ", rule) 354 | except NFTablesCmdNoCG: 355 | debug("Rule skipped - no cgroup: unit=", unit, " :: ", rule) 356 | except NFTablesCmdFail as err: 357 | warn("Rule failed to apply: unit=", unit, " :: ", rule) 358 | for line in err.msg.strip.splitLines: warn(line.indent(2)) 359 | 360 | proc rules_flush() = 361 | nft.flush_rule_chains(rules.values.toSeq.concat) 362 | 363 | if opt_flush: 364 | debug("Flushing all affected nftables chains...") 365 | rules_flush() 366 | 367 | debug("Starting main loop...") 368 | rules_queue_all() # initial try-them-all after flush 369 | while not sq.closed: 370 | ts_now = getMonoTime(); ts_wake = ts_now; ts_wake_unit = "" 371 | 372 | for n, (unit, check) in rule_queue.pairs.toSeq: 373 | if ts_now >= check.ts: 374 | if check.apply: 375 | debug("Rules apply: unit=", unit) 376 | rule_queue[unit] = (ts: ts_now + opt_cooldown, apply: false) 377 | let rules = rules[unit] 378 | if rules.len == 0: # one of the -u/--re*-with-unit 379 | if unit in opt_reexec_units: 380 | debug("Rule for reload-with-unit event, restarting...") 381 | reexec = true; sq.close() 382 | else: 383 | debug("Rule for reapply-with-unit event") 384 | reapply = true 385 | break 386 | else: rules_apply(unit, rules) 387 | else: 388 | rule_queue.del(unit) 389 | debug("Rules cooldown expired: unit=", unit) 390 | elif check.ts < ts_wake or ts_wake == ts_now: 391 | ts_wake = check.ts; ts_wake_unit = unit 392 | 393 | if reapply: 394 | if opt_flush: rules_flush() 395 | rules_queue_all() 396 | reapply = false 397 | continue 398 | 399 | let delay = 400 | if ts_wake == ts_now: 3600_000_000'u64 401 | else: uint64((ts_wake - ts_now).inMicroseconds) 402 | if ts_wake_unit != "": 403 | debug("Rules cooldown wait: unit=", ts_wake_unit, " us=", delay.nfmt) 404 | try: unit = sq.poll(rules, delay) 405 | except JournalError: 406 | if sq.closed: break 407 | raise 408 | if unit == "": continue # timeout-wakeup 409 | 410 | ts_now = getMonoTime() 411 | rule_queue.withValue(unit, check): 412 | if check.ts > ts_now: 413 | debug( "Rules schedule delayed: unit=", unit, 414 | " us=", (check.ts - ts_now).inMicroseconds.nfmt ) 415 | check.apply = true; continue # apply after cooldown 416 | debug("Rules schedule now: unit=", unit) 417 | rule_queue[unit] = (ts: ts_now, apply: true) 418 | 419 | if reexec: 420 | debug("Restarting tool via exec...") 421 | let app = getAppFilename() 422 | discard execv(app.cstring, concat(@[app], argv).allocCStringArray) 423 | debug("Finished") 424 | 425 | when is_main_module: main(commandLineParams()) 426 | -------------------------------------------------------------------------------- /scnpm.service: -------------------------------------------------------------------------------- 1 | [Unit] 2 | After=nftables.service 3 | 4 | [Service] 5 | Type=exec 6 | ExecStart=scnpm --flush --reload-with-unit nftables.service /etc/nftables.conf 7 | ExecReload=kill -HUP $MAINPID 8 | 9 | DynamicUser=yes 10 | SupplementaryGroups=systemd-journal 11 | ProtectProc=noaccess 12 | ProcSubset=pid 13 | ProtectHome=yes 14 | PrivateDevices=yes 15 | 16 | CapabilityBoundingSet=CAP_NET_ADMIN 17 | AmbientCapabilities=CAP_NET_ADMIN 18 | RestrictAddressFamilies=AF_NETLINK 19 | SecureBits=noroot-locked 20 | SystemCallFilter=@system-service 21 | 22 | [Install] 23 | WantedBy=multi-user.target 24 | --------------------------------------------------------------------------------