├── LICENSE
├── README.md
├── visualise_domains.ipynb
└── mitm_test.ipynb


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 timsh
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | **Analyse requests sent from your mobile device while using any app**
 2 | 
 3 | Read more about the project at [timsh.org](https://timsh.org/everyone-knows-your-location-part-2-try-it-yourself)
 4 | 
 5 | Fill in the form if you found something interesting in the requests - let's analyse all apps from the [list](https://docs.google.com/spreadsheets/d/1fJbNT-kmfuWUlIpYr9sduvjZS1ggrmhydCzoDlqaMaA/) together!
 6 | 
 7 | [https://forms.gle/CE6y7XkpRNJkGqEeA](https://forms.gle/CE6y7XkpRNJkGqEeA)
 8 | 
 9 | 
10 | ---
11 | **Follow this process**
12 | 
13 | 1. **Install mitmproxy** 
14 |    https://docs.mitmproxy.org/stable/overview-installation/
15 |    on mac, simply use `brew install mitmproxy`
16 |    
17 |    be aware that mitmproxy is recognised as malware by some antiviruses - looks scary but it's not. 
18 |    
19 | 2. **Turn on developer mode on iPhone / Android device**
20 |    I did the whole experiment on iOS so I will not include any specific instructions for Android (though they should be even easier). 
21 |    https://developer.apple.com/documentation/xcode/enabling-developer-mode-on-a-device
22 | 
23 |    You will need to have developer mode turned on for the next step. 
24 |    
25 | 3. **Configure mitmproxy and iphone to work together**
26 | 	1. Launch mitmproxy in terminal. 
27 | 	   I prefer to use mitmweb because the interface is actually very helpful for initial discovery (and understanding the scale of the RTB requests).
28 | 	   
29 | 	   Use this command to launch the proxy that by default won't listen to the traffic on your computer (or any other device) without manually connecting to a proxy. 
30 | 	   `mitmweb --listen-host 0.0.0.0 --listen-port 8080`
31 | 	   
32 | 	2. Now `ipconfig getifaddr en0` to find your computer local IP address. 
33 | 	   By the way, your iphone and computer must be in the same wifi network for all of this to work. 
34 | 	   
35 | 	3. Next, open the settings on iphone and setup manual proxy with:
36 | 	   `server` = the ip address you just found
37 | 	   `port` = 8080
38 | 	   
39 | 	4. On iphone, open browser and go to mitm.it 
40 | 	   further instructions are described here, TLDR: you need to install the certificate and enable full trust to be able to decrypt TLS-encrypted traffic. 
41 | 	   
42 | 	   https://jasdev.me/intercepting-ios-traffic
43 | 	   https://support.apple.com/en-us/102390
44 | 
45 | 4. We're all set! 
46 |    Now you're able to intercept and decrypt all traffic going through iPhone. 
47 |    If you only want to record traffic coming from a specific app, close all apps, "Clear flows" in MitmWeb and then open the desired app. 
48 |    
49 | 5. Take any app from the [list](https://docs.google.com/spreadsheets/d/1fJbNT-kmfuWUlIpYr9sduvjZS1ggrmhydCzoDlqaMaA) (or just any app)
50 |    
51 |    In order to download it from App Store, you might have to turn off proxy on iphone, download the app and then turn it on again and clear the flows. 
52 |    
53 | 6. Open the app and wait / click / play - you'll immediately see hundreds of requests flowing in mitmweb. 
54 |    
55 | 7. When you feel like there's enough (you could even leave it open or play for an hour or so to collect more data), close the app and switch off the phone, then in mitmweb press File → Save all. 
56 |    
57 |    This will give you a `flows` file - rename it as "appname.flow"
58 |    
59 | 8. Open the [mitm_test.ipynb](https://github.com/tim-sha256/analyse-ad-traffic/blob/main/mitm_test.ipynb) - either in local Jupyter Notebook or in Google Colab, both work fine. 
60 |    Further instructions are included in the file itself. 
61 |    
62 | 9. Repeat steps 5-7 for as much apps as you need, just don't forget to clear the flows before each recording. 
63 |    When you're done, press `Ctrl+C` in terminal to stop mitmproxy and turn off proxy on iphone. 
64 |    If that's your main device, you also MUST turn off the certificate trust setting that you enabled before.
65 | 
66 | ---
67 | 
68 | 
69 | **Check the instructions in [visualise_domains.ipynb](https://github.com/tim-sha256/analyse-ad-traffic/blob/main/visualise_domains.ipynb) to create a visualisation of domain and subdomain frequency in the data - just like this one:**
70 | 
71 | <img width="1091" alt="Screenshot 2025-04-16 at 19 20 25" src="https://github.com/user-attachments/assets/d6a259f2-1962-4532-899f-1a4f6b0062d1" />
72 | 
73 | 


--------------------------------------------------------------------------------
/visualise_domains.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "provenance": []
  7 |     },
  8 |     "kernelspec": {
  9 |       "name": "python3",
 10 |       "display_name": "Python 3"
 11 |     },
 12 |     "language_info": {
 13 |       "name": "python"
 14 |     }
 15 |   },
 16 |   "cells": [
 17 |     {
 18 |       "cell_type": "markdown",
 19 |       "source": [
 20 |         "# Hey!\n",
 21 |         "This simple file allows you to turn a .csv from the previous one - \"mitm_test.ipynb\" - into an interactive domain visulisation.\n",
 22 |         "\n",
 23 |         "\n",
 24 |         "\n",
 25 |         "\n",
 26 |         "\n",
 27 |         "---\n",
 28 |         "\n",
 29 |         "if you want to use multiple csvs - add them to the df_list and df_color_map.\n",
 30 |         "If you want to only use 1 app - please edit the df_list as well with the correct name and punctuation.\n",
 31 |         "\n",
 32 |         "---\n",
 33 |         "1. Every circle is a domain or a subdomain.\n",
 34 |         "\n",
 35 |         "  The hierarchy is represented by the inclusion of circles into others.\n",
 36 |         "Example: o-sdk.ads.unity3d.com is represented by 3 circles: o-sdk inside of ads inside of unity3d.\n",
 37 |         "2. Colors represent the app (I analysed 6~) that the request corresponds to.\n",
 38 |         "\n",
 39 |         "  I used low opacities for better visibility, and it turns out that in my mix of colors and their opacities purple is the combination of all of them.\n",
 40 |         "3. Circle sizes, or masses, represent the frequency: how often did this or that domain appear in the requests data.\n",
 41 |         "\n",
 42 |         "  See any insights?\n",
 43 |         "Unity rules the mobile game app traffic scene.\n",
 44 |         "For comparison, the g / doubleclick thing is Google Ad Network.\n",
 45 |         "\n",
 46 |         "---\n",
 47 |         "\n",
 48 |         "please save and open the resulting .html file in your browser - turns out it's very complicated to insert it inside the notebook."
 49 |       ],
 50 |       "metadata": {
 51 |         "id": "hsZTi8Jl7fBu"
 52 |       }
 53 |     },
 54 |     {
 55 |       "cell_type": "code",
 56 |       "execution_count": null,
 57 |       "metadata": {
 58 |         "id": "3_42EXen5hyD"
 59 |       },
 60 |       "outputs": [],
 61 |       "source": [
 62 |         "!pip install tldextract pyvis\n",
 63 |         "\n",
 64 |         "import pandas as pd\n",
 65 |         "import tldextract\n",
 66 |         "from collections import defaultdict\n",
 67 |         "import networkx as nx\n",
 68 |         "from pyvis.network import Network"
 69 |       ]
 70 |     },
 71 |     {
 72 |       "cell_type": "code",
 73 |       "source": [
 74 |         "appname_df = pd.read_csv(\"appname.csv\") # upload the csv created in the \"mitm_test\" notebook"
 75 |       ],
 76 |       "metadata": {
 77 |         "id": "a0uflIkI6mhD"
 78 |       },
 79 |       "execution_count": null,
 80 |       "outputs": []
 81 |     },
 82 |     {
 83 |       "cell_type": "code",
 84 |       "source": [
 85 |         "df_list = [(\"appname\", appname_df),\n",
 86 |         "           (\"otherapp\", otherapp_df)\n",
 87 |         "          #  ...\n",
 88 |         "           ]\n",
 89 |         "\n",
 90 |         "df_color_map = {\n",
 91 |         "    \"appname\": \"#e6194B\" #,  # red\n",
 92 |         "    # \"otherapp\": \"#3cb44b\",  # green\n",
 93 |         "    # \"someother\": \"#4363d8\",  # blue\n",
 94 |         "    # \"...\": \"#f58231\",  # orange\n",
 95 |         "    # \"iii\": \"#911eb4\",  # purple\n",
 96 |         "    # \"ooo\": \"#42d4f4\"\n",
 97 |         "}\n"
 98 |       ],
 99 |       "metadata": {
100 |         "id": "CSSVTwp-53gZ"
101 |       },
102 |       "execution_count": null,
103 |       "outputs": []
104 |     },
105 |     {
106 |       "cell_type": "code",
107 |       "source": [
108 |         "def extract_domain_and_subdomains(url):\n",
109 |         "    ext = tldextract.extract(url)\n",
110 |         "    subdomains = ext.subdomain.split('.') if ext.subdomain else []\n",
111 |         "    subdomains.reverse()\n",
112 |         "    return [ext.domain] + subdomains\n",
113 |         "\n",
114 |         "def build_domain_tree(df_list):\n",
115 |         "    domain_tree = {}\n",
116 |         "    frequencies = defaultdict(int)\n",
117 |         "    domain_to_dfs = defaultdict(set)\n",
118 |         "\n",
119 |         "    for df_name, df_obj in df_list:\n",
120 |         "        urls = df_obj['url'].dropna().tolist()\n",
121 |         "        for url in urls:\n",
122 |         "            parts = extract_domain_and_subdomains(url)\n",
123 |         "            for p in parts:\n",
124 |         "                frequencies[p] += 1\n",
125 |         "                domain_to_dfs[p].add(df_name)\n",
126 |         "            # Build nested tree structure\n",
127 |         "            current = domain_tree\n",
128 |         "            for p in parts:\n",
129 |         "                current = current.setdefault(p, {})\n",
130 |         "    return domain_tree, frequencies, domain_to_dfs\n",
131 |         "\n",
132 |         "def tree_to_graph(domain_tree):\n",
133 |         "    G = nx.DiGraph()\n",
134 |         "    def add_edges(d, parent=None):\n",
135 |         "        for node, subtree in d.items():\n",
136 |         "            if parent is not None:\n",
137 |         "                G.add_edge(parent, node)\n",
138 |         "            add_edges(subtree, node)\n",
139 |         "    add_edges(domain_tree)\n",
140 |         "    return G\n",
141 |         "\n",
142 |         "def hex_to_rgb(hex_color):\n",
143 |         "    hex_color = hex_color.strip('#')\n",
144 |         "    return tuple(int(hex_color[i:i+2], 16) for i in (0, 2, 4))\n",
145 |         "\n",
146 |         "def rgb_to_rgba(r, g, b, alpha=0.2):\n",
147 |         "    return f\"rgba({r},{g},{b},{alpha})\"\n",
148 |         "\n",
149 |         "def mix_colors(hex_colors, alpha=0.2):\n",
150 |         "    if not hex_colors:\n",
151 |         "        return \"rgba(153,153,153,0.2)\"  # default gray\n",
152 |         "    rgbs = [hex_to_rgb(color) for color in hex_colors]\n",
153 |         "    avg_r = int(sum(c[0] for c in rgbs) / len(rgbs))\n",
154 |         "    avg_g = int(sum(c[1] for c in rgbs) / len(rgbs))\n",
155 |         "    avg_b = int(sum(c[2] for c in rgbs) / len(rgbs))\n",
156 |         "    return rgb_to_rgba(avg_r, avg_g, avg_b, alpha)\n",
157 |         "\n",
158 |         "def visualize_pyvis(G, frequencies, domain_to_dfs, df_color_map, html_file=\"combined_domains.html\"):\n",
159 |         "    net = Network(height=\"700px\", width=\"100%\", notebook=True, directed=True, cdn_resources='in_line')\n",
160 |         "\n",
161 |         "    # set custom physics options using JSON configuration - this is some complex stuff\n",
162 |         "    net.set_options(\"\"\"\n",
163 |         "    var options = {\n",
164 |         "      \"physics\": {\n",
165 |         "        \"barnesHut\": {\n",
166 |         "          \"gravitationalConstant\": -1000,\n",
167 |         "          \"centralGravity\": 0.05,\n",
168 |         "          \"springLength\": 200,\n",
169 |         "          \"springConstant\": 0.04,\n",
170 |         "          \"damping\": 0.09,\n",
171 |         "          \"avoidOverlap\": 0.5\n",
172 |         "        },\n",
173 |         "        \"minVelocity\": 0.75\n",
174 |         "      }\n",
175 |         "    }\n",
176 |         "    \"\"\")\n",
177 |         "\n",
178 |         "    for node in G.nodes():\n",
179 |         "        freq = frequencies[node]\n",
180 |         "        df_names = domain_to_dfs[node]\n",
181 |         "        hex_colors = [df_color_map[df_name] for df_name in df_names] if df_names else []\n",
182 |         "        node_color = mix_colors(hex_colors, alpha=0.2)\n",
183 |         "\n",
184 |         "        font_size = max(11, int(freq * 1.5))\n",
185 |         "        node_size = freq * 8\n",
186 |         "\n",
187 |         "        tooltip = (\n",
188 |         "            f\"{node}\"\n",
189 |         "            f\"Frequency: {freq}\"\n",
190 |         "            f\"Dataframes: {', '.join(df_names) if df_names else 'None'}\"\n",
191 |         "        )\n",
192 |         "\n",
193 |         "        net.add_node(\n",
194 |         "            node,\n",
195 |         "            label=node,\n",
196 |         "            size=node_size,\n",
197 |         "            color=node_color,\n",
198 |         "            font={'size': font_size, 'color': '#222'},\n",
199 |         "            title=tooltip\n",
200 |         "        )\n",
201 |         "\n",
202 |         "    for source, target in G.edges():\n",
203 |         "        net.add_edge(source, target)\n",
204 |         "\n",
205 |         "    net.show(html_file)\n",
206 |         "    print(f\"visualization saved to {html_file}.\")"
207 |       ],
208 |       "metadata": {
209 |         "id": "Nl_amuKc5lCn"
210 |       },
211 |       "execution_count": null,
212 |       "outputs": []
213 |     },
214 |     {
215 |       "cell_type": "code",
216 |       "source": [
217 |         "domain_tree, frequencies, domain_to_dfs = build_domain_tree(df_list)\n",
218 |         "G = tree_to_graph(domain_tree)\n",
219 |         "\n",
220 |         "visualize_pyvis(G, frequencies, domain_to_dfs, df_color_map, \"ad_domains.html\")"
221 |       ],
222 |       "metadata": {
223 |         "id": "OQtP2Zv26UVv"
224 |       },
225 |       "execution_count": null,
226 |       "outputs": []
227 |     }
228 |   ]
229 | }


--------------------------------------------------------------------------------
/mitm_test.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "provenance": []
  7 |     },
  8 |     "kernelspec": {
  9 |       "name": "python3",
 10 |       "display_name": "Python 3"
 11 |     },
 12 |     "language_info": {
 13 |       "name": "python"
 14 |     }
 15 |   },
 16 |   "cells": [
 17 |     {
 18 |       "cell_type": "markdown",
 19 |       "source": [
 20 |         "# Hey there!\n",
 21 |         "Let's continue - now that we have the .flow file, we can parse and analyse it."
 22 |       ],
 23 |       "metadata": {
 24 |         "id": "mzsz3hLhCZv9"
 25 |       }
 26 |     },
 27 |     {
 28 |       "cell_type": "code",
 29 |       "source": [
 30 |         "# Run this if you don't have the modules installed\n",
 31 |         "!pip install mitmproxy pandas requests datetime json"
 32 |       ],
 33 |       "metadata": {
 34 |         "id": "3LBqe9nwWuGj"
 35 |       },
 36 |       "execution_count": null,
 37 |       "outputs": []
 38 |     },
 39 |     {
 40 |       "cell_type": "code",
 41 |       "source": [
 42 |         "import csv\n",
 43 |         "from mitmproxy import io\n",
 44 |         "from mitmproxy.exceptions import FlowReadException\n",
 45 |         "from mitmproxy.io import FlowReader\n",
 46 |         "import sys\n",
 47 |         "from datetime import datetime\n",
 48 |         "import pandas as pd\n",
 49 |         "import json\n",
 50 |         "import re\n",
 51 |         "import requests"
 52 |       ],
 53 |       "metadata": {
 54 |         "id": "g8W6lB-NXKi6"
 55 |       },
 56 |       "execution_count": null,
 57 |       "outputs": []
 58 |     },
 59 |     {
 60 |       "cell_type": "markdown",
 61 |       "source": [
 62 |         "Put the filename that you chose before (e.g. appname.flow) here\n"
 63 |       ],
 64 |       "metadata": {
 65 |         "id": "NsM8ZddBCtDX"
 66 |       }
 67 |     },
 68 |     {
 69 |       "cell_type": "code",
 70 |       "source": [
 71 |         "file = \"makemore.flow\"\n",
 72 |         "output = file.split(sep=\".\")[0] + \".csv\""
 73 |       ],
 74 |       "metadata": {
 75 |         "id": "SvW_pUZCw2Gc"
 76 |       },
 77 |       "execution_count": null,
 78 |       "outputs": []
 79 |     },
 80 |     {
 81 |       "cell_type": "markdown",
 82 |       "source": [
 83 |         "Here we read the .flow file and turn it into a csv fo further analysis"
 84 |       ],
 85 |       "metadata": {
 86 |         "id": "8t87DvWcDAnY"
 87 |       }
 88 |     },
 89 |     {
 90 |       "cell_type": "code",
 91 |       "execution_count": null,
 92 |       "metadata": {
 93 |         "id": "mB0YToKJWLX-"
 94 |       },
 95 |       "outputs": [],
 96 |       "source": [
 97 |         "def clean_bytes(data):\n",
 98 |         "    if not data:\n",
 99 |         "        return \"\"\n",
100 |         "    try:\n",
101 |         "        return data.decode(\"utf-8\", errors=\"replace\")\n",
102 |         "    except:\n",
103 |         "        return str(data)\n",
104 |         "\n",
105 |         "\n",
106 |         "with open(file, \"rb\") as logfile:\n",
107 |         "    freader = io.FlowReader(logfile)\n",
108 |         "\n",
109 |         "    with open(output, \"w\", newline=\"\", encoding=\"utf-8\") as csvfile:\n",
110 |         "        fieldnames = [\n",
111 |         "            \"timestamp\", \"method\", \"url\",\n",
112 |         "            \"full_request\", \"full_response\"\n",
113 |         "        ]\n",
114 |         "        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\n",
115 |         "        writer.writeheader()\n",
116 |         "\n",
117 |         "        for flow in freader.stream():\n",
118 |         "            req = flow.request\n",
119 |         "            res = flow.response\n",
120 |         "\n",
121 |         "            # full_request = f\"{req.method} {req.path} HTTP/{req.http_version}\\r\\n\"\n",
122 |         "            full_request = \"\".join(f\"{k}: {v}\" for k, v in req.headers.items())\n",
123 |         "            # full_request += \"\\r\\n\"\n",
124 |         "            full_request += clean_bytes(req.content)\n",
125 |         "\n",
126 |         "            full_response = \"\"\n",
127 |         "            if res:\n",
128 |         "                # full_response = f\"HTTP/{res.http_version} {res.status_code} {res.reason}\\r\\n\"\n",
129 |         "                full_response += \"\".join(f\"{k}: {v}\" for k, v in res.headers.items())\n",
130 |         "                # full_response += \"\\r\\n\"\n",
131 |         "                full_response += clean_bytes(res.content)\n",
132 |         "\n",
133 |         "            writer.writerow({\n",
134 |         "                \"timestamp\": datetime.fromtimestamp(req.timestamp_start).isoformat(),\n",
135 |         "                \"method\": req.method,\n",
136 |         "                \"url\": req.pretty_url,\n",
137 |         "                \"full_request\": full_request,\n",
138 |         "                \"full_response\": full_response\n",
139 |         "            })"
140 |       ]
141 |     },
142 |     {
143 |       "cell_type": "code",
144 |       "source": [
145 |         "df = pd.read_csv(output)\n",
146 |         "# uncomment the \"df\" below to see the preview of the full table\n",
147 |         "\n",
148 |         "# df"
149 |       ],
150 |       "metadata": {
151 |         "id": "GMU-yWLPW1FK"
152 |       },
153 |       "execution_count": null,
154 |       "outputs": []
155 |     },
156 |     {
157 |       "cell_type": "markdown",
158 |       "source": [
159 |         "You'll need to put keywords that will be used for filtering here.\n",
160 |         "\n",
161 |         "I put my city / country / postal code / ip address / lat + long. You can also look trhough the parameters and put something specific you want (like screen brightness).\n",
162 |         "\n",
163 |         "Please note that these will be used to find exact matches (too much code and text out there for substring search).\n",
164 |         "\n",
165 |         "\n",
166 |         "\n",
167 |         "---\n",
168 |         "\n",
169 |         "\n",
170 |         "\n",
171 |         "You can also uncomment the last row in the cell below to find out your ip, coordinates and so on (based on your ip).\n",
172 |         "\n",
173 |         "Please note that if you run it in Google Colab, it will display the data of the Google Server somewhere in the world - in that case, simply open the \"url\" in browser\n"
174 |       ],
175 |       "metadata": {
176 |         "id": "OX5wfeT4Desq"
177 |       }
178 |     },
179 |     {
180 |       "cell_type": "code",
181 |       "source": [
182 |         "keywords = [\"lat\", \"lon\", \"loc\", \"postal\",\n",
183 |         "            \"Barcelona\"]\n",
184 |         "\n",
185 |         "url = 'https://ipinfo.io/json'\n",
186 |         "# print(requests.get(url).text)"
187 |       ],
188 |       "metadata": {
189 |         "id": "b21Wo2Ry59A3"
190 |       },
191 |       "execution_count": null,
192 |       "outputs": []
193 |     },
194 |     {
195 |       "cell_type": "markdown",
196 |       "source": [
197 |         "This cell applies the filter to the table from before and adds new columns for matches."
198 |       ],
199 |       "metadata": {
200 |         "id": "xgOEo4lKEhz3"
201 |       }
202 |     },
203 |     {
204 |       "cell_type": "code",
205 |       "source": [
206 |         "pattern = r'\\b(?:' + '|'.join(re.escape(k) for k in keywords) + r')\\b'\n",
207 |         "regex = re.compile(pattern)\n",
208 |         "\n",
209 |         "def extract_context(text, pattern, window=40):\n",
210 |         "    if pd.isna(text):\n",
211 |         "        return \"\"\n",
212 |         "\n",
213 |         "    matches = []\n",
214 |         "    for match in pattern.finditer(text):\n",
215 |         "        start = max(match.start() - window, 0)\n",
216 |         "        end = match.end() + window\n",
217 |         "        context = text[start:end].replace(\"\\n\", \" \").replace(\"\\r\", \"\")\n",
218 |         "        matches.append(f\"...{context}...\")\n",
219 |         "\n",
220 |         "    return \" | \".join(matches)\n",
221 |         "\n",
222 |         "def extract_keywords(text, pattern):\n",
223 |         "    if pd.isna(text):\n",
224 |         "        return \"\"\n",
225 |         "    return \" | \".join(set(match.group() for match in pattern.finditer(text)))\n",
226 |         "\n",
227 |         "df[\"matched_in_request\"] = df[\"full_request\"].apply(lambda x: extract_context(x, regex))\n",
228 |         "df[\"matched_in_response\"] = df[\"full_response\"].apply(lambda x: extract_context(x, regex))\n",
229 |         "df[\"reason_in_request\"] = df[\"full_request\"].apply(lambda x: extract_keywords(x, regex))\n",
230 |         "df[\"reason_in_response\"] = df[\"full_response\"].apply(lambda x: extract_keywords(x, regex))\n",
231 |         "\n",
232 |         "df_filtered = df[(df[\"matched_in_request\"] != \"\") | (df[\"matched_in_response\"] != \"\")]"
233 |       ],
234 |       "metadata": {
235 |         "id": "6X8hrHCsZsik"
236 |       },
237 |       "execution_count": null,
238 |       "outputs": []
239 |     },
240 |     {
241 |       "cell_type": "markdown",
242 |       "source": [
243 |         "Run the cell below if you want to see the preview of the filtered table (all rows have one or more matches with the pattern).\n",
244 |         "\n",
245 |         "Scroll to the right to see the new columns:\n",
246 |         "\n",
247 |         "matched_in_request - +-40 symbols surrounding the match in request\n",
248 |         "\n",
249 |         "matched_in_response\t- +-40 symbols surrounding the match in response\n",
250 |         "\n",
251 |         "reason_in_request\n",
252 |         "\n",
253 |         "reason_in_response\n",
254 |         "\n",
255 |         "In the left part of the table you can also see the index of each row - you might need it later."
256 |       ],
257 |       "metadata": {
258 |         "id": "r7yYDkU9EqC2"
259 |       }
260 |     },
261 |     {
262 |       "cell_type": "code",
263 |       "source": [
264 |         "df_filtered"
265 |       ],
266 |       "metadata": {
267 |         "id": "7YV8UzcNYtXH"
268 |       },
269 |       "execution_count": null,
270 |       "outputs": []
271 |     },
272 |     {
273 |       "cell_type": "markdown",
274 |       "source": [
275 |         "Now the automation stops and it's time for manual analysis.\n",
276 |         "\n",
277 |         "The data formats are too different to automate this (or too sophisticated for me), + you want to filter out the false positives.\n",
278 |         "\n",
279 |         "Good news: usually this table only has around 10 rows, so it's not hard to look through all of them.\n",
280 |         "\n",
281 |         "The cell simply prints out the entire value of a given row (index, eg 43) and column - use column names from the table above.\n",
282 |         "\n",
283 |         "When the value is too large to read trhough in plain text, I copy it to SublimeText and use the search in there."
284 |       ],
285 |       "metadata": {
286 |         "id": "fHcZjqPa9Zvs"
287 |       }
288 |     },
289 |     {
290 |       "cell_type": "code",
291 |       "source": [
292 |         "df.loc[43, \"full_request\"]"
293 |       ],
294 |       "metadata": {
295 |         "id": "UXzWoRQT3RTC"
296 |       },
297 |       "execution_count": null,
298 |       "outputs": []
299 |     }
300 |   ]
301 | }


--------------------------------------------------------------------------------