├── .gitignore ├── core-dns-override ├── Dockerfile ├── README.md ├── build.json ├── config.json └── rootfs │ └── usr │ └── local │ └── bin │ ├── docker │ └── override.sh ├── repository.json └── telegraf ├── Dockerfile ├── README.md ├── build.json ├── config.json ├── example_setup.md ├── imgs ├── Screenshot_20211106_124408.png ├── Screenshot_20211106_124841.png ├── Screenshot_20211106_125449.png ├── Screenshot_20211106_145548.png ├── Screenshot_20211106_150804.png ├── Screenshot_20211108_195511.png ├── Screenshot_20211109_092307.png ├── Screenshot_20211109_092645.png └── Screenshot_20211109_093856.png └── rootfs └── usr └── local └── src └── entrypoint.sh /.gitignore: -------------------------------------------------------------------------------- 1 | *.swp 2 | *~ 3 | *bk 4 | -------------------------------------------------------------------------------- /core-dns-override/Dockerfile: -------------------------------------------------------------------------------- 1 | ARG BUILD_FROM=ghcr.io/hassio-addons/base/amd64:10.0.1 2 | # hadolint ignore=DL3006 3 | FROM ${BUILD_FROM} 4 | 5 | # Set shell 6 | SHELL ["/bin/bash", "-o", "pipefail", "-c"] 7 | 8 | # Add env 9 | ENV TERM="xterm-256color" 10 | 11 | 12 | # Setup base 13 | ARG BUILD_ARCH=amd64 14 | RUN apk add --no-cache docker 15 | 16 | RUN cp /usr/bin/docker /usr/local/bin/.undocked \ 17 | \ 18 | && find /usr/local \ 19 | \( -type d -a -name test -o -name tests -o -name '__pycache__' \) \ 20 | -o \( -type f -a -name '*.pyc' -o -name '*.pyo' \) \ 21 | -exec rm -rf '{}' + \ 22 | \ 23 | && rm -f -r \ 24 | /root/.cache \ 25 | /root/.cmake \ 26 | /tmp/* 27 | 28 | # Copy root filesystem 29 | COPY rootfs / 30 | 31 | # Build arguments 32 | ARG BUILD_ARCH 33 | ARG BUILD_DATE 34 | ARG BUILD_DESCRIPTION 35 | ARG BUILD_NAME 36 | ARG BUILD_REF 37 | ARG BUILD_REPOSITORY 38 | ARG BUILD_VERSION 39 | 40 | # Labels 41 | LABEL \ 42 | io.hass.name="${BUILD_NAME}" \ 43 | io.hass.description="${BUILD_DESCRIPTION}" \ 44 | io.hass.arch="${BUILD_ARCH}" \ 45 | io.hass.type="addon" \ 46 | io.hass.version=${BUILD_VERSION} \ 47 | maintainer="B Tasker" \ 48 | org.opencontainers.image.title="${BUILD_NAME}" \ 49 | org.opencontainers.image.description="${BUILD_DESCRIPTION}" \ 50 | org.opencontainers.image.vendor="BTasker Home Assistant Add-ons" \ 51 | org.opencontainers.image.authors="B Tasker" \ 52 | org.opencontainers.image.licenses="MIT" \ 53 | org.opencontainers.image.url="https://tobeconfirmed" \ 54 | org.opencontainers.image.source="https://github.com/${BUILD_REPOSITORY}" \ 55 | org.opencontainers.image.documentation="https://github.com/${BUILD_REPOSITORY}/blob/main/README.md" \ 56 | org.opencontainers.image.created=${BUILD_DATE} \ 57 | org.opencontainers.image.revision=${BUILD_REF} \ 58 | org.opencontainers.image.version=${BUILD_VERSION} 59 | 60 | CMD ["/usr/local/bin/override.sh"] 61 | -------------------------------------------------------------------------------- /core-dns-override/README.md: -------------------------------------------------------------------------------- 1 | HomeAssistant Core DNS Fix 2 | ============================ 3 | 4 | The following routes of installing HomeAssistant: 5 | 6 | - HomeAssistant OS 7 | - Supervised 8 | 9 | Both contain a significant, and [poorly documented flaw](https://github.com/home-assistant/home-assistant.io/issues/19511) in their DNS setup. 10 | 11 | When installing using these methods, a container `hassio_dns` is run, running a `coredns` install. 12 | 13 | Unfortunately, this configuration hardcodes a fallback of Cloudflare's DoT service: 14 | 15 | .:53 { 16 | log { 17 | class error 18 | } 19 | errors 20 | loop 21 | 22 | hosts /config/hosts { 23 | fallthrough 24 | } 25 | template ANY AAAA local.hass.io hassio { 26 | rcode NOERROR 27 | } 28 | mdns 29 | forward . dns://192.168.1.253 dns://127.0.0.1:5553 { 30 | except local.hass.io 31 | policy sequential 32 | health_check 1m 33 | } 34 | fallback REFUSED,SERVFAIL,NXDOMAIN . dns://127.0.0.1:5553 35 | cache 600 36 | } 37 | 38 | .:5553 { 39 | log { 40 | class error 41 | } 42 | errors 43 | 44 | forward . tls://1.1.1.1 tls://1.0.0.1 { 45 | tls_servername cloudflare-dns.com 46 | except local.hass.io 47 | health_check 5m 48 | } 49 | cache 600 50 | } 51 | 52 | There are a number of issues with this 53 | 54 | - If the DHCP/Owner configured DNS server responds with `REFUSED`, `SERVFAIL` or `NXDOMAIN` rather than being respected, the query will be retried via Cloudflare 55 | - The "fallback" (`127.0.0.1:5553`) is specified in the main pool, so will sometimes be used instead of the configured DNS 56 | - Behaviour has been observed where it then won't switch back to local DNS for later queries 57 | - Queries sent to cloudflare will be unable to resolve local names 58 | - Where queries are sent to CF, local DNS names may be leaked 59 | - HomeAssistant users have, in effect, been signed up to [this](https://developers.cloudflare.com/1.1.1.1/privacy/public-dns-resolver) without their knowledge 60 | - Health check probes will be sent to cloudflare every 5 minutes 61 | 62 | The latter is particularly problematic, because if healthchecks fail, they are retried at smaller and smaller intervals. So HomeAssistant users who have blocked `1.1.1.1:853` outbound will find their [HomeAssistant installation flinging packets at it](https://github.com/home-assistant/plugin-dns/pull/56#issuecomment-928967969), [another example](https://github.com/home-assistant/plugin-dns/issues/20#issuecomment-917354758). 63 | 64 | TL:DR 65 | 66 | - HomeAssistant sends queries via Cloudflare's DNS without the user's consent 67 | - HomeAssistant will therefore sometimes fail to resolve local names 68 | - If Cloudflare's DNS is blocked/unreachable, HomeAssistant's DNS plugin will send a flood of retries onto the network 69 | - There is no ability to disable this 70 | 71 | ---- 72 | 73 | ### Upstream Reports 74 | 75 | This has been reported upstream, and [Pull Requests have been made](https://github.com/home-assistant/plugin-dns/pull/56), but have been [roundly rejected](https://github.com/home-assistant/plugin-dns/pull/56#issuecomment-929700917). 76 | 77 | The pro-offered solution of running a container installation isn't *really* a solution, given it entirely ignores the reasons that people want HA-OS. 78 | 79 | Issues have been raised and [closed](https://github.com/home-assistant/supervisor/issues/1877) on some extremely spurious grounds. 80 | 81 | Unfortunately, despite this being a clear issue with HomeAssistant, it is not going to get addressed in the short-term: the devs are neither willing to do the work, or accept PRs from people who have. 82 | 83 | ---- 84 | 85 | ### This Addon 86 | 87 | This add-on is a fix that shouldn't need to exist, and in it's current state is *unbelievably* dirty, but should be less prone to silently reverting after upgrade/restart than manually editing files. 88 | 89 | This addon runs a privileged container (yeuch). But, it needs to be privileged so that it can communicate with the docker daemon. 90 | 91 | This allows it to, once a minute (configurable): 92 | 93 | - Copy `/etc/corefile` from the `hassio_dns` container 94 | - Check whether the Cloudflare config is active 95 | - If it is, remove it, copy the new config up and force a restart of `coredns` 96 | 97 | The result is that the `coredns` config will then look more like 98 | 99 | ``` 100 | bash-5.1# cat /etc/corefile 101 | .:53 { 102 | log { 103 | class error 104 | } 105 | errors 106 | loop 107 | 108 | hosts /config/hosts { 109 | fallthrough 110 | } 111 | template ANY AAAA local.hass.io hassio { 112 | rcode NOERROR 113 | } 114 | mdns 115 | forward . dns://192.168.1.253 { 116 | except local.hass.io 117 | policy sequential 118 | 119 | } 120 | 121 | cache 600 122 | } 123 | 124 | .:5553 { 125 | log { 126 | class error 127 | } 128 | errors 129 | 130 | forward . tls://1.1.1.1 tls://1.0.0.1 { 131 | tls_servername cloudflare-dns.com 132 | except local.hass.io 133 | 134 | } 135 | cache 600 136 | } 137 | 138 | ``` 139 | 140 | (Although still present in the config, the `:5553` section will no longer be used by the resolver listening on `53`) 141 | 142 | ---- 143 | 144 | ### Installation 145 | 146 | To install my repo: 147 | 148 | - Log into HomeAssistant 149 | - Head to `Supervisor` -> `Add-On Store` 150 | - Click the overflow menu in the top right 151 | - Click `Repositories` 152 | - Paste `https://github.com/bentasker/HomeAssistantAddons/` and click `Add` 153 | - Click the overflow button and click `Reload` 154 | 155 | A new section should appear, if it doesn't hit `System` and then `Restart Supervisor` 156 | 157 | Click into the addon and click `Install` 158 | 159 | Once installed, click in and choose 160 | 161 | - Start on boot 162 | - Protection mode (turn it off) 163 | - Click Start 164 | 165 | ---- 166 | 167 | ### Notes 168 | 169 | The config section relating to port `5553` must be left in place. 170 | 171 | It's declared as an ingress port for the DNS container, so if it isn't actively listening the check [here](https://github.com/home-assistant/supervisor/blob/main/supervisor/addons/addon.py#L479) will fail, and `supervisor` will restart the container. 172 | 173 | ---- 174 | 175 | ### Using a template 176 | 177 | The default behaviour of this add-on is to patch the existing config to remove references that lead to the fallback being used. 178 | 179 | However, for additional control, you can also provide a config file to be used. 180 | 181 | This should be added as `dns-override-template` in your `config` directory: 182 | 183 | ``` 184 | .:53 { 185 | log { 186 | class error 187 | } 188 | errors 189 | loop 190 | 191 | hosts /config/hosts { 192 | fallthrough 193 | } 194 | template ANY AAAA local.hass.io hassio { 195 | rcode NOERROR 196 | } 197 | mdns 198 | forward . dns://192.168.1.253 { 199 | except local.hass.io 200 | policy sequential 201 | } 202 | cache 600 203 | } 204 | 205 | .:5553 { 206 | log { 207 | class error 208 | } 209 | errors 210 | 211 | forward . dns://192.168.1.253 { 212 | except local.hass.io 213 | } 214 | cache 600 215 | } 216 | 217 | ``` 218 | Note that you must not remove the `:5553` server (see [#Notes](#notes)) - 219 | 220 | In the addon's configuration page, tick `use_dns_template` and restart the addon. 221 | 222 | Unfortunately, the template cannot be provided via the configuration page because [HomeAssistant's YAML handling appears to be broken](https://github.com/bentasker/HomeAssistantAddons/commit/a34fb242599c25458094bec3cddccb37f351c2a8). 223 | 224 | If you've ticked the box and the config file doesn't exist, the default behaviour of patching existing config will be used, and an error will be logged 225 | 226 | ``` 227 | [12:19:19] ERROR: /config/dns-override-template does not exist - will patch existing file instead 228 | ``` 229 | 230 | Note: if you wish to switch back to using the patching method, after unticking the `use_dns_templatew` option you need to trigger a restart of the DNS container (or just restart the whole system). 231 | 232 | ---- 233 | 234 | # Blocking Supervisor updates 235 | 236 | Supervisor autoupdates and sometimes breaks otherwise stable installs. 237 | 238 | There is currently no official way to prevent this, so the addon includes functionality to [prevent Superviser updates](https://github.com/bentasker/HomeAssistantAddons/issues/1) by interfering with Superviser's ability to check for new versions. 239 | 240 | To enable, tick `block_supervisor_updates` in the configuration and restart the add-on. 241 | 242 | This will cause it to update `/etc/hosts` in supervisor in order to blackhole the domain `version.home-assistant.io`. 243 | 244 | When you're ready to install updates, untick the box and restart the addon - it'll remove the blackhole and you should be able to proceed. 245 | 246 | 247 | ---- 248 | 249 | ## Loglines 250 | 251 | The following loglines may appear in your logs: 252 | 253 | ### Informational 254 | 255 | ``` 256 | INFO: Launched 257 | ``` 258 | Just the addon confirming it's started successfully 259 | 260 | 261 | ``` 262 | INFO: Changes detected - overwriting DNS Config 263 | ``` 264 | This is normally a routine entry - `coredns`'s config had changed away from the desired one, so was overridden. 265 | 266 | If you see lots of these in close succession, it suggests the DNS container might be crashing out and restarting, which will need further investigation. 267 | 268 | 269 | ### Errors 270 | 271 | ``` 272 | ERROR: Unable to access docker 273 | ``` 274 | Logged at startup if it's not possible to communicate with Docker. 275 | 276 | Most likely cause is that you forgot to disable protection mode - disable it in the Addon's config page and then restart the addon. 277 | 278 | 279 | ``` 280 | ERROR: Did you forget to disable protection mode? 281 | ``` 282 | Logged as a result of not being able to communicate with docker. This will be logged periodically to improve the chances of you noticing it in your system logs. 283 | 284 | Disable protection mode and restart the add-on 285 | 286 | 287 | ``` 288 | ERROR: /config/dns-override-template does not exist - will patch existing file instead 289 | ``` 290 | You've enabled `use_dns_template` but the configuration file could not be found. SSH onto your HomeAssistant box and create the [config file](#using-a-template), or disable `use_dns_template`. 291 | 292 | In the meantime, the default patching mode will be used. 293 | -------------------------------------------------------------------------------- /core-dns-override/build.json: -------------------------------------------------------------------------------- 1 | { 2 | "build_from": { 3 | "aarch64": "ghcr.io/hassio-addons/base/aarch64:10.0.1", 4 | "amd64": "ghcr.io/hassio-addons/base/amd64:10.0.1", 5 | "armhf": "ghcr.io/hassio-addons/base/armhf:10.0.1", 6 | "armv7": "ghcr.io/hassio-addons/base/armv7:10.0.1", 7 | "i386": "ghcr.io/hassio-addons/base/i386:10.0.1" 8 | } 9 | } 10 | -------------------------------------------------------------------------------- /core-dns-override/config.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "Core DNS Override", 3 | "version": "0.1.1", 4 | "slug": "coredns-fix", 5 | "description": "Periodically ensure the DNS container isn't using Cloudflare", 6 | "url": "https://github.com/bentasker/HomeAssistantAddons/tree/master/core-dns-override", 7 | "advanced": true, 8 | "startup": "services", 9 | "ingress": false, 10 | "ingress_port": 0, 11 | "ingress_stream": true, 12 | "panel_icon": "mdi:console", 13 | "panel_title": "DNSFix", 14 | "homeassistant": "0.92.0b2", 15 | "arch": ["aarch64", "amd64", "armhf", "armv7", "i386"], 16 | "hassio_api": false, 17 | "hassio_role": "manager", 18 | "services": ["mysql:want", "mqtt:want"], 19 | "homeassistant_api": false, 20 | "host_network": false, 21 | "uart": false, 22 | "usb": false, 23 | "gpio": false, 24 | "audio": false, 25 | "apparmor": false, 26 | "host_dbus": false, 27 | "stdin": false, 28 | "docker_api": true, 29 | "privileged": ["NET_ADMIN", "SYS_ADMIN", "SYS_RAWIO", "SYS_TIME", "SYS_NICE"], 30 | "devices": [], 31 | "map": [ 32 | "config:rw" 33 | ], 34 | "journald": false, 35 | "options" : { 36 | "interval": 60, 37 | "log_changes": true, 38 | "dns_container" : "hassio_dns", 39 | "use_dns_template" : false, 40 | "block_supervisor_updates" : false, 41 | "supervisor_container" : "hassio_supervisor", 42 | "block_domain" : "version.home-assistant.io", 43 | "block_dest": "127.0.0.12" 44 | }, 45 | "schema" : { 46 | "interval" : "int", 47 | "log_changes": "bool?", 48 | "dns_container" : "str", 49 | "use_dns_template" : "bool?", 50 | "block_supervisor_updates" : "bool?", 51 | "supervisor_container" : "str", 52 | "block_domain" : "str", 53 | "block_dest": "str" 54 | } 55 | } 56 | -------------------------------------------------------------------------------- /core-dns-override/rootfs/usr/local/bin/docker: -------------------------------------------------------------------------------- 1 | #!/usr/bin/with-contenv bashio 2 | # ============================================================================== 3 | # Home Assistant Community Add-on: SSH & Web Terminal 4 | # This script gives the user instructions on how to enable Docker access. 5 | # ============================================================================== 6 | bashio::log.yellow "PROTECTION MODE ENABLED!" 7 | bashio::log.yellow "" 8 | bashio::log.yellow "To be able to use this command, you'll need to disable" 9 | bashio::log.yellow "protection mode on this add-on. Without it, the add-on" 10 | bashio::log.yellow "is unable to access Docker." 11 | bashio::log.yellow "" 12 | bashio::log.yellow "Steps:" 13 | bashio::log.yellow " - Go to the Supervisor Panel." 14 | bashio::log.yellow " - Click on the SSH & Web Terminal add-on." 15 | bashio::log.yellow " - Set the 'Protection mode' switch to off." 16 | bashio::log.yellow " - Restart the add-on." 17 | bashio::log.yellow "" 18 | bashio::log.yellow "Access to Docker allows you to do really powerful things" 19 | bashio::log.yellow "including complete destruction of your system." 20 | bashio::log.yellow "Please, be sure you know what you are doing before enabling" 21 | bashio::log.yellow "this feature!" 22 | bashio::log.yellow "" 23 | -------------------------------------------------------------------------------- /core-dns-override/rootfs/usr/local/bin/override.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/with-contenv bashio 2 | 3 | #CONTAINER_NAME=hassio_dns 4 | #INTERVAL=60 5 | 6 | CONFIG_PATH=/data/options.json 7 | 8 | function dump_curr_state(){ 9 | # Implemented for https://github.com/bentasker/HomeAssistantAddons/issues/2 10 | INSTANCE=$1 11 | CFILE=$2 12 | 13 | if [ `bashio::config log_changes` == "true" ] 14 | then 15 | bashio::log.info "Dumping current $CFILE" 16 | docker exec $INSTANCE cat $CFILE 17 | fi 18 | } 19 | 20 | 21 | 22 | function make_and_push(){ 23 | 24 | MODE=$1 25 | 26 | if [ "$MODE" == "last" ] 27 | then 28 | # Strip direction to the fallback 29 | # 30 | # Also, turn off health checks - it's unneeded traffic, this isn't a kubernetes cluster 31 | cat current | sed 's~dns://127.0.0.1:5553~~g' | sed 's~fallback REFUSED.*~~g' | sed 's~health_check .*~~' > new 32 | docker cp new $CONTAINER_NAME:/etc/corefile 33 | 34 | # take a copy as our "last" 35 | mv new last 36 | else 37 | docker cp $MODE $CONTAINER_NAME:/etc/corefile 38 | fi 39 | 40 | bashio::log.info "Changes pushed" 41 | dump_curr_state $CONTAINER_NAME /etc/corefile 42 | 43 | # Now restart coredns 44 | docker exec $CONTAINER_NAME pkill coredns 45 | } 46 | 47 | function fetch_and_check(){ 48 | docker cp $CONTAINER_NAME:/etc/corefile ./current 49 | 50 | if [ "$USE_TEMPLATE" == "true" ] 51 | then 52 | COMP_FILE="/config/dns-override-template" 53 | if [ ! -f "$COMP_FILE" ] 54 | then 55 | bashio::log.error "/config/dns-override-template does not exist - will patch existing file instead" 56 | COMP_FILE="last" 57 | fi 58 | else 59 | COMP_FILE="last" 60 | fi 61 | 62 | if [ -f $COMP_FILE ] 63 | then 64 | diff -s current $COMP_FILE > /dev/null 65 | if [ ! "$?" == "0" ] 66 | then 67 | # Files differ 68 | bashio::log.info "Changes detected - overwriting DNS Config" 69 | dump_curr_state $CONTAINER_NAME /etc/corefile 70 | make_and_push $COMP_FILE 71 | fi 72 | else 73 | # We don't have a copy of the last change 74 | make_and_push $COMP_FILE 75 | fi 76 | 77 | # Tidy up 78 | rm current 79 | } 80 | 81 | function check_supervisor_dns(){ 82 | # Prevent Supervisor from auto-updating 83 | # 84 | # https://github.com/bentasker/HomeAssistantAddons/issues/1 85 | # 86 | 87 | UPDATE_DOMAIN=`bashio::config block_domain` 88 | BLOCK_IP=`bashio::config block_dest` 89 | 90 | # Copy down /etc/hosts 91 | docker cp $SUPERVISOR:/etc/hosts ./hosts 92 | 93 | if [ `bashio::config block_supervisor_updates` == "false" ] 94 | then 95 | # Ensure that it's not configured 96 | grep -E "^([0-9,\.]+)[[:space:]]+$UPDATE_DOMAIN" hosts 2>&1 >/dev/null 97 | if [ "$?" == "0" ] 98 | then 99 | bashio::log.info "Removing update block from Supervisor /etc/hosts" 100 | grep -v -E "^([0-9,\.]+)[[:space:]]+$UPDATE_DOMAIN" hosts > hosts_new 101 | 102 | # We have to take the long way round 103 | # 104 | # Can't copy direct - docker borks because the file's in use 105 | # and can't grep from /etc/hosts direct into itself, as we'll end up 106 | # with an empty file 107 | docker cp hosts_new $SUPERVISOR:/etc/hosts.tmp 108 | docker exec hassio_supervisor bash -c "cat /etc/hosts.tmp > /etc/hosts" 109 | docker exec hassio_supervisor bash -c "rm -f /etc/hosts.tmp" 110 | rm hosts_new 111 | bashio::log.info "Changes pushed" 112 | dump_curr_state $SUPERVISOR /etc/hosts 113 | fi 114 | rm hosts 115 | return 116 | fi 117 | 118 | grep -E "^([0-9,\.]+)[[:space:]]+$UPDATE_DOMAIN" hosts 2>&1 >/dev/null 119 | if [ ! "$?" == "0" ] 120 | then 121 | # Update it 122 | bashio::log.info "Changes detected - overwriting Supervisor /etc/hosts" 123 | docker exec hassio_supervisor bash -c "echo '$BLOCK_IP $UPDATE_DOMAIN' | tee -a /etc/hosts" 124 | bashio::log.info "Changes pushed" 125 | dump_curr_state $SUPERVISOR /etc/hosts 126 | fi 127 | rm hosts 128 | } 129 | 130 | 131 | # bashio uses set -e by default, we do not want that 132 | # we're specifically testing things that may fail 133 | set +e 134 | 135 | # Catch an easy oversight 136 | FAIL=0 137 | docker ps 2>&1 >/dev/null 138 | if [ ! "$?" == "0" ] 139 | then 140 | bashio::log.error "Unable to access docker" 141 | bashio::log.error "Did you forget to disable protection mode?" 142 | FAIL=1 143 | # We don't exit here, because supervisor would only restart us 144 | fi 145 | 146 | # 147 | # Get config 148 | INTERVAL="`bashio::config 'interval'`" 149 | CONTAINER_NAME=`bashio::config dns_container` 150 | USE_TEMPLATE=`bashio::config use_dns_template` 151 | SUPERVISOR=`bashio::config supervisor_container` 152 | 153 | bashio::log.info "Starting" 154 | dump_curr_state $SUPERVISOR /etc/hosts 155 | dump_curr_state $CONTAINER_NAME /etc/corefile 156 | 157 | bashio::log.info "Launched" 158 | while true 159 | do 160 | if [ "$FAIL" == "0" ] 161 | then 162 | fetch_and_check 163 | check_supervisor_dns 164 | else 165 | bashio::log.error "Did you forget to disable protection mode?" 166 | fi 167 | sleep $INTERVAL 168 | done 169 | -------------------------------------------------------------------------------- /repository.json: -------------------------------------------------------------------------------- 1 | { 2 | "name" : "B Tasker HomeAssistant Addons", 3 | "url" : "https://github.com/bentasker/HomeAssistantAddons", 4 | "maintainer": "B Tasker" 5 | } 6 | -------------------------------------------------------------------------------- /telegraf/Dockerfile: -------------------------------------------------------------------------------- 1 | ARG BUILD_FROM=ghcr.io/hassio-addons/base/amd64:10.0.1 2 | # hadolint ignore=DL3006 3 | FROM ${BUILD_FROM} 4 | 5 | # Set shell 6 | SHELL ["/bin/bash", "-o", "pipefail", "-c"] 7 | 8 | # Setup base 9 | ARG BUILD_ARCH=amd64 10 | ARG TELEGRAF_VERSION=1.20.3 11 | 12 | # Build arguments 13 | ARG BUILD_ARCH 14 | ARG BUILD_DATE 15 | ARG BUILD_DESCRIPTION 16 | ARG BUILD_NAME 17 | ARG BUILD_REF 18 | ARG BUILD_REPOSITORY 19 | ARG BUILD_VERSION 20 | 21 | 22 | WORKDIR /usr/local/src 23 | # Todo - make this use the build arch (there's no armv7l in the repo) 24 | ADD https://dl.influxdata.com/telegraf/releases/telegraf-${TELEGRAF_VERSION}_linux_armhf.tar.gz ./ 25 | 26 | COPY rootfs / 27 | 28 | RUN tar -xzf telegraf-${TELEGRAF_VERSION}_linux_*.tar.gz \ 29 | && chmod +x telegraf*/usr/bin/telegraf \ 30 | && cp telegraf*/usr/bin/telegraf /usr/local/bin \ 31 | && rm -f telegraf-${TELEGRAF_VERSION}_linux_*.tar.gz 32 | 33 | # Labels 34 | LABEL \ 35 | io.hass.name="${BUILD_NAME}" \ 36 | io.hass.description="${BUILD_DESCRIPTION}" \ 37 | io.hass.arch="${BUILD_ARCH}" \ 38 | io.hass.type="addon" \ 39 | io.hass.version=${BUILD_VERSION} \ 40 | maintainer="B Tasker" \ 41 | org.opencontainers.image.title="${BUILD_NAME}" \ 42 | org.opencontainers.image.description="${BUILD_DESCRIPTION}" \ 43 | org.opencontainers.image.vendor="BTasker Home Assistant Add-ons" \ 44 | org.opencontainers.image.authors="B Tasker" \ 45 | org.opencontainers.image.licenses="MIT" \ 46 | org.opencontainers.image.url="https://tobeconfirmed" \ 47 | org.opencontainers.image.source="https://github.com/${BUILD_REPOSITORY}" \ 48 | org.opencontainers.image.documentation="https://github.com/${BUILD_REPOSITORY}/blob/main/README.md" \ 49 | org.opencontainers.image.created=${BUILD_DATE} \ 50 | org.opencontainers.image.revision=${BUILD_REF} \ 51 | org.opencontainers.image.version=${BUILD_VERSION} 52 | 53 | ENTRYPOINT ["/usr/local/src/entrypoint.sh"] 54 | -------------------------------------------------------------------------------- /telegraf/README.md: -------------------------------------------------------------------------------- 1 | HomeAssistant Telegraf Addon 2 | =============================== 3 | 4 | [Telegraf](https://github.com/influxdata/telegraf) allows you to capture metrics and write them out to InfluxDB so that you can monitor performance etc. 5 | 6 | This Add-on provides a Telegraf instance which you can configured in `/config/` (the addon doesn't ship with any config by default). 7 | 8 | See [example_setup.md](example_setup.md) for an example of collecting telemetry from the DNS container. 9 | 10 | ### Example configuration 11 | 12 | [agent] 13 | interval = "10s" 14 | round_interval = true 15 | metric_batch_size = 300 16 | metric_buffer_limit = 5000 17 | collection_jitter = "0s" 18 | flush_interval = "20s" 19 | flush_jitter = "0s" 20 | precision = "" 21 | debug = false 22 | quiet = false 23 | logfile = "" 24 | # Override the name, or you'll get the containers name 25 | hostname = "home-assistant" 26 | omit_hostname = false 27 | 28 | [[inputs.diskio]] 29 | [[inputs.mem]] 30 | [[inputs.net]] 31 | [[inputs.swap]] 32 | [[inputs.system]] 33 | [[inputs.cpu]] 34 | ## Whether to report per-cpu stats or not 35 | percpu = true 36 | ## Whether to report total system cpu stats or not 37 | totalcpu = true 38 | ## If true, collect raw CPU time metrics. 39 | collect_cpu_time = false 40 | ## If true, compute and report the sum of all non-idle CPU states. 41 | report_active = false 42 | 43 | [[outputs.influxdb]] 44 | urls = [""] 45 | database = "" 46 | 47 | -------------------------------------------------------------------------------- /telegraf/build.json: -------------------------------------------------------------------------------- 1 | { 2 | "build_from": { 3 | "aarch64": "ghcr.io/hassio-addons/base/aarch64:10.0.1", 4 | "amd64": "ghcr.io/hassio-addons/base/amd64:10.0.1", 5 | "armhf": "ghcr.io/hassio-addons/base/armhf:10.0.1", 6 | "armv7": "ghcr.io/hassio-addons/base/armv7:10.0.1", 7 | "i386": "ghcr.io/hassio-addons/base/i386:10.0.1" 8 | } 9 | } 10 | -------------------------------------------------------------------------------- /telegraf/config.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "Telegraf DNS Metrics", 3 | "version": "0.1.1", 4 | "slug": "telegraf", 5 | "description": "Spin up a container running Telegraf to collect telemetry", 6 | "url": "https://github.com/bentasker/HomeAssistantAddons/tree/master/telegraf", 7 | "advanced": true, 8 | "startup": "services", 9 | "ingress": false, 10 | "ingress_port": 0, 11 | "ingress_stream": true, 12 | "panel_icon": "mdi:console", 13 | "panel_title": "Telegraf DNS", 14 | "homeassistant": "0.92.0b2", 15 | "arch": ["aarch64", "amd64", "armhf", "armv7", "i386"], 16 | "hassio_api": false, 17 | "hassio_role": "manager", 18 | "services": ["mysql:want", "mqtt:want"], 19 | "homeassistant_api": false, 20 | "host_network": false, 21 | "uart": false, 22 | "usb": false, 23 | "gpio": false, 24 | "audio": false, 25 | "apparmor": false, 26 | "host_dbus": false, 27 | "stdin": false, 28 | "docker_api": true, 29 | "privileged": ["NET_ADMIN", "SYS_ADMIN", "SYS_RAWIO", "SYS_TIME", "SYS_NICE"], 30 | "devices": [], 31 | "map": [ 32 | "config:rw" 33 | ], 34 | "journald": false, 35 | "options" : { 36 | }, 37 | "schema" : { 38 | } 39 | } 40 | -------------------------------------------------------------------------------- /telegraf/example_setup.md: -------------------------------------------------------------------------------- 1 | ## Example setup - Monitoring CoreDNS 2 | 3 | Note: this has been reported upstream [plugin-dns#64](https://github.com/home-assistant/plugin-dns/issues/64) 4 | 5 | The reason I created this add-on is I wanted to be able to chart out the performance difference between hassio-DNS's default habit of using Cloudflare and not (using my override plugin made things feel faster, wanted to prove that was the case). 6 | 7 | So, using my [core-dns-override](https://github.com/bentasker/HomeAssistantAddons/tree/master/core-dns-override) plugin, I enabled the Prometheus endpoint in CoreDNS: 8 | 9 | .:53 { 10 | log { 11 | class error 12 | } 13 | errors 14 | loop 15 | 16 | hosts /config/hosts { 17 | fallthrough 18 | } 19 | template ANY AAAA local.hass.io hassio { 20 | rcode NOERROR 21 | } 22 | mdns 23 | forward . dns://192.168.1.253 dns://127.0.0.1:5553 { 24 | except local.hass.io 25 | policy sequential 26 | health_check 1m 27 | } 28 | fallback REFUSED,SERVFAIL,NXDOMAIN . dns://127.0.0.1:5553 29 | prometheus 0.0.0.0:9153 30 | cache 600 31 | } 32 | 33 | .:5553 { 34 | log { 35 | class error 36 | } 37 | errors 38 | 39 | forward . tls://1.1.1.1 tls://1.0.0.1 { 40 | tls_servername cloudflare-dns.com 41 | except local.hass.io 42 | health_check 5m 43 | } 44 | prometheus 0.0.0.0:9153 45 | cache 600 46 | } 47 | 48 | 49 | ### Telegraf config 50 | 51 | Then, I created my telegraf configuration in `/config/telegraf.conf` 52 | 53 | [agent] 54 | interval = "10s" 55 | round_interval = true 56 | metric_batch_size = 300 57 | metric_buffer_limit = 5000 58 | collection_jitter = "0s" 59 | flush_interval = "20s" 60 | flush_jitter = "0s" 61 | precision = "" 62 | debug = false 63 | quiet = false 64 | logfile = "" 65 | hostname = "home-assistant" 66 | omit_hostname = false 67 | 68 | [[inputs.diskio]] 69 | [[inputs.mem]] 70 | [[inputs.net]] 71 | [[inputs.swap]] 72 | [[inputs.system]] 73 | [[inputs.cpu]] 74 | ## Whether to report per-cpu stats or not 75 | percpu = true 76 | ## Whether to report total system cpu stats or not 77 | totalcpu = true 78 | ## If true, collect raw CPU time metrics. 79 | collect_cpu_time = false 80 | ## If true, compute and report the sum of all non-idle CPU states. 81 | report_active = false 82 | 83 | 84 | [[inputs.prometheus]] 85 | ## An array of urls to scrape metrics from. 86 | urls = ["http://hassio_dns:9153/metrics"] 87 | 88 | [[inputs.docker]] 89 | endpoint = "unix:///var/run/docker.sock" 90 | timeout = "5s" 91 | 92 | [[outputs.influxdb]] 93 | urls = ["http://192.168.3.84:8086"] 94 | database = "home_assistant_performance" 95 | 96 | Within a few seconds, data started appearing in InfluxDB, the measurement we're most interested in is `coredns_dns_request_duration_seconds` - there's a tag for each of the server blocks: 97 | 98 | ![Measurement and tag](imgs/Screenshot_20211106_124408.png) 99 | 100 | We can then trivially graph out how many responses the fallback is sending a second 101 | 102 | ![Response rates](imgs/Screenshot_20211106_124841.png) 103 | (this is with a forced failure to ensure it's used) 104 | 105 | We can also graph out reponse times for that fallback, and see that it is *very* slow to fail queries (Cloudflare is blocked at the firewall in this graph) 106 | ![Query Response Times](imgs/Screenshot_20211106_125449.png) 107 | 108 | Although artificially blocked here, that clearly translates to very slow response times if there are issues between HomeAssistant and Cloudflare - essentially blocking execution of automations etc. 109 | 110 | With Cloudflare unblocked, we can see that it's still significantly slower than local 111 | 112 | ![Average response time](imgs/Screenshot_20211106_145548.png) 113 | 114 | CF's responses take around 500ms longer than the local DNS server - that's an additional half second lag when running scripts/automations which require an external name to be (re)resolved. 115 | 116 | A ping to `1.1.1.1` shows a RTT of 10 - 13ms, so the additional latency is probably attributable to the overheads of DoT. Although the user's observed latency may also be increased by the inter-coreDNS communication (there's a UDP communication between the server block on `:53` and the one on `:5553`), it's not included in the graphed statistic. 117 | 118 | Whatever the cause, where queries are passed to Cloudflare erroneously, latency is *25x* higher, even before the impact of passing local names to Cloudflare is taken into account. 119 | 120 | ### Impact of Healthchecks 121 | 122 | Looking at healthcheck failure rates, we can see how the coredns configuration inadvertantly contributes to the packet storms some users have complained of 123 | 124 | ![Healthcheck failure rates](imgs/Screenshot_20211106_150804.png) 125 | 126 | We can see lots and lots of failures against `127.0.0.1:5553`, despite it not being supposed to be used as an actual upstream. 127 | 128 | The reason is that the default HomeAssistant config contains this: 129 | 130 | ``` 131 | forward . dns://192.168.1.253 dns://127.0.0.1:5553 { 132 | except local.hass.io 133 | policy sequential 134 | health_check 1m 135 | } 136 | ``` 137 | 138 | It's therefore considered an upstream and will receive a healthcheck every 1 minute. This _clearly_ wasn't desired by the devs because they set the fallback healthcheck interval at 5m: 139 | 140 | ``` 141 | forward . tls://1.1.1.1 tls://1.0.0.1 { 142 | tls_servername cloudflare-dns.com 143 | except local.hass.io 144 | health_check 5m 145 | } 146 | ``` 147 | 148 | The result is that when conneectivity to Cloudflare fails, an unexpectedly large number of healthchecks fail. 149 | 150 | ---- 151 | 152 | ### Impact of `fallback` section 153 | 154 | This is actually worse than first thought. 155 | 156 | With the firewall rejecting connections to Cloudflare (i.e. it will actively send a `RST` back), we see a spike to 6000pps from the DNS container, with the rate stabilising down to 3000 junk packets/sec 157 | 158 | This is because, at startup, `coredns` sends a query for `.` to the fallback. 159 | 160 | If Cloudflare were reachable, this'd result in 2 packets on the network - query and response. However, when it isn't, we instead get this: 161 | 162 | ![Packet rate and queries](imgs/Screenshot_20211108_195511.png) 163 | 164 | This only occurs when the `fallback` statement is present - the presence (or lack of) `127.0.0.1:5553` in the forward statement has no impact on the rate of the storm. 165 | 166 | The `fallback` plugin uses the `proxy` plugin under the hood, and it appears that `proxy` will just cycle over it's upstreams trying to elicit a response until it reaches it's own timeout - see [Line 78 of proxy.go](https://github.com/coredns/proxy/blob/master/proxy.go#L78). 167 | 168 | Technically the same behaviour occurs when the firewall is configured to drop rather than reject packets - it's just that the storm can't develop nearly as fast because of the time inherently required to hit timeouts (although some caution is needed - `coredns`'s `forward` plugin can dynamically adjust it's timeouts, so I assume `proxy` can too). 169 | 170 | However, this behaviour can first arise *after* startup too - any time that `127.0.0.1:5553` is tried. 171 | 172 | For example, if we simulate a local DNS server issue by blocking packets from HomeAssistant: 173 | 174 | root@PIHRP1:~# iptables -I INPUT -s 192.168.3.16 -j DROP 175 | 176 | And then send a single query: 177 | 178 | ➜ ~ dig -p 53 @172.30.32.3 testquery.bentasker.co.uk 179 | 180 | This causes `coredns` to move onto the next host in it's `forward` statement, triggering intermittent storms 181 | 182 | ![In-use storm](imgs/Screenshot_20211109_092645.png) 183 | 184 | The initial "recoveries" there are exactly 30s long - that's `coredns`'s retry cycling back to the local host and needing to wait for it to timeout before moving on. But then `coredns` dynamic timeouts appear to come into play, and the "recovery" gets shorter and shorter 185 | 186 | At this point, the querying client has long-since received a `SERVFAIL` but `coredns` continues to spam the network. 187 | 188 | It's assume that we can make this worse too - if the local DNS server's process crashed we'd see resets instead. 189 | 190 | bash-5.1# pkill coredns # stop it's retry attempts 191 | root@PIHRP1:~# iptables -I INPUT -s 192.168.3.16 -j REJECT 192 | ➜ ~ dig -p 53 @172.30.32.3 testquery2.bentasker.co.uk 193 | 194 | However, this doesn't change the behaviour in the graphs. This is _presumably_ because the local server is being contacted using `UDP`, so will receive an ICMP unreachable from the firewall - it appears this doesn't trigger a retry. 195 | 196 | This suggests that the fault (such as it is) lies in `coredns`'s handling of TCP connections, so switching the fallback to use UDP rather than DoT may also resolve the behaviour. 197 | 198 | Quick config rewrite then: 199 | 200 | forward . dns://1.1.1.1 dns://1.0.0.1 { 201 | tls_servername cloudflare-dns.com 202 | except local.hass.io 203 | health_check 5m 204 | } 205 | 206 | Quick test to verify it's working, and then blocked in firewall. 207 | 208 | bash-5.1# pkill coredns 209 | 210 | No storm, also ran a query via `5553` to verify that the query fails 211 | 212 | ![No storm](imgs/Screenshot_20211109_093856.png) 213 | 214 | So, a "fix" of switching the fallback to UDP would also be valid 215 | 216 | ---- 217 | 218 | ## Repro 219 | 220 | OK, so pulling this altogether, this is how to Repro and verify the above. 221 | 222 | ### Metrics collection 223 | 224 | Optional, you could also just run a packet capture if you don't want graphs 225 | 226 | - Stand up an InfluxDB instance (or sign up for a free account at https://cloud2.influxdata.com) 227 | - Create a telegraf config in `/config/telegraf.conf` (Config I used is [here](https://github.com/bentasker/HomeAssistantAddons/tree/master/telegraf)) 228 | - Install and start my [Telegraf addon](https://github.com/bentasker/HomeAssistantAddons/tree/master/telegraf) 229 | - Exec into `hassio_dns` and edit `/etc/corefile` to add `prometheus 0.0.0.0:9153` to each of the server blocks 230 | - Run `pkill coredns` on `hassio_dns` to force a config reload 231 | - Stats should start appearing in InfluxDB (If you're using Chronograf, you can import [this](https://github.com/home-assistant/plugin-dns/files/7503425/HomeAssistant.DNS.json.gz) dashboard as a starting point - you'll prob need to edit the DB name if you're writing into a different one than me) 232 | 233 | 234 | ### Repro 235 | 236 | On your network firewall, add two rules 237 | 238 | - Dest: `1.0.0.1` Proto: any, dport: `853` `REJECT` 239 | - Dest: `1.1.1.1` Proto: any, dport: `853` `REJECT` 240 | 241 | Exec into `hassio_dns` and 242 | 243 | - `cp /etc/corefile /root/` 244 | - `pkill coredns` to trigger a restart 245 | 246 | You should see thousands of packets hit the network. If you're using some other metrics+graphing solution, be aware that you may not see them in graphs for `eth0` (or whatever your main interface is) because the container uses an aliased interface. 247 | 248 | Now exec into `hassio_dns` and edit `/etc/corefile` to remove the `fallback` line. Run `pkill coredns` to force a restart - you should not see significant packet rates on the network after this. 249 | 250 | Restore the original config `cp /root/corefile /etc/` and then edit it to make the upstream use plain DNS 251 | 252 | forward . dns://1.1.1.1 dns://1.0.0.1 { 253 | tls_servername cloudflare-dns.com 254 | except local.hass.io 255 | health_check 5m 256 | } 257 | 258 | On your firewall, add two more rules 259 | 260 | - Dest: `1.0.0.1` Proto: any, dport: `53` `REJECT` 261 | - Dest: `1.1.1.1` Proto: any, dport: `53` `REJECT` 262 | 263 | `pkill coredns` should not elicit a packet storm. 264 | 265 | -------------------------------------------------------------------------------- /telegraf/imgs/Screenshot_20211106_124408.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bentasker/HomeAssistantAddons/8bbf29d766421f743527c3a28f09e40168e67b5b/telegraf/imgs/Screenshot_20211106_124408.png -------------------------------------------------------------------------------- /telegraf/imgs/Screenshot_20211106_124841.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bentasker/HomeAssistantAddons/8bbf29d766421f743527c3a28f09e40168e67b5b/telegraf/imgs/Screenshot_20211106_124841.png -------------------------------------------------------------------------------- /telegraf/imgs/Screenshot_20211106_125449.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bentasker/HomeAssistantAddons/8bbf29d766421f743527c3a28f09e40168e67b5b/telegraf/imgs/Screenshot_20211106_125449.png -------------------------------------------------------------------------------- /telegraf/imgs/Screenshot_20211106_145548.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bentasker/HomeAssistantAddons/8bbf29d766421f743527c3a28f09e40168e67b5b/telegraf/imgs/Screenshot_20211106_145548.png -------------------------------------------------------------------------------- /telegraf/imgs/Screenshot_20211106_150804.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bentasker/HomeAssistantAddons/8bbf29d766421f743527c3a28f09e40168e67b5b/telegraf/imgs/Screenshot_20211106_150804.png -------------------------------------------------------------------------------- /telegraf/imgs/Screenshot_20211108_195511.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bentasker/HomeAssistantAddons/8bbf29d766421f743527c3a28f09e40168e67b5b/telegraf/imgs/Screenshot_20211108_195511.png -------------------------------------------------------------------------------- /telegraf/imgs/Screenshot_20211109_092307.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bentasker/HomeAssistantAddons/8bbf29d766421f743527c3a28f09e40168e67b5b/telegraf/imgs/Screenshot_20211109_092307.png -------------------------------------------------------------------------------- /telegraf/imgs/Screenshot_20211109_092645.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bentasker/HomeAssistantAddons/8bbf29d766421f743527c3a28f09e40168e67b5b/telegraf/imgs/Screenshot_20211109_092645.png -------------------------------------------------------------------------------- /telegraf/imgs/Screenshot_20211109_093856.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bentasker/HomeAssistantAddons/8bbf29d766421f743527c3a28f09e40168e67b5b/telegraf/imgs/Screenshot_20211109_093856.png -------------------------------------------------------------------------------- /telegraf/rootfs/usr/local/src/entrypoint.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # Start telegraf 4 | 5 | /usr/local/bin/telegraf --config /config/telegraf.conf 6 | --------------------------------------------------------------------------------