├── .gitignore
├── Creation_de_contours_a_partir_du_REU.ipynb
├── README.md
├── atelier.ipynb
├── bureaux_de_vote.html
├── cleaner.py
├── decoupage_parquet.py
├── display.py
├── generate_areas.py
├── generate_areas_geojson.py
├── geo.py
├── license.md
├── main.py
├── main_atelier.py
├── renovate.json
├── requirements.txt
└── starting_kit_atelier.R


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Jupyter Notebook
 2 | .ipynb_checkpoints
 3 | 
 4 | # IPython
 5 | profile_default/
 6 | ipython_config.py
 7 | 
 8 | # pyenv
 9 | .python-version
10 | 
11 | communes-*
12 | *.html
13 | *.csv
14 | 
15 | __pycache__/


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # bureau-vote
 2 | 
 3 | Ce dépôt contient les travaux conjoints des équipes Etalab et data.gouv.fr, en étroite colaboration avec l'INSEE, au sujet du répertoire électoral unique (REU). Le but de ces travaux était de partir des [données du REU](https://www.data.gouv.fr/fr/datasets/bureaux-de-vote-et-adresses-de-leurs-electeurs/) (adresses de France et leur bureau de vote attribué) pour déterminer des contours des bureaux de vote de France. Une telle donnée permettra à l'avenir - ainsi que pour toutes les élections dont les données sont déjà publiques - d'afficher les résultats des élections à la maille la plus fine qui soit : celle des bureaux de vote.
 4 | 
 5 | La méthode choisie est celle des [aires de Voronoï](https://fr.wikipedia.org/wiki/Diagramme_de_Vorono%C3%AF), qui permet de séparer un plan contenant des points d'intérêt (dit germes) en autant de zones autour de ces germes, de sorte que chaque zone enferme un seul germe, et forme l'ensemble des points de plus proches de ce germe que d'aucun autre. D'autres méthodes sont possibles, ainsi que d'autres choix au sein même de cette méthode : il n'y a pas unicité des contours.
 6 | 
 7 | ## Création des contours
 8 | 
 9 | Le notebook python ``Creation_de_contours_a_partir_du_REU.ipynb`` contient toutes les informations permettant de regénérer les contours tels que nous les avons publiés. Les prérequis sont :
10 | - ``python`` et ``jupyter notebook`` installés
11 | - tous les packages listés dans le fichier `requirements.txt`
12 | 
13 | Il suffit ensuite de dérouler le notebook pour obtenir les contours de la même façon que nous les avons générés. Toutes les fonctions utilisées sont dans ce repo et sont perfectibles : n'hésitez pas à contribuer !
14 | 
15 | ## Travaux préalables
16 | 
17 | Ce dépôt comprend aussi du code en langage Python permettant de nettoyer et géocoder un extrait (le département de l'Ariège) du format brut des adresses du Répertoire Electoral Unique, ainsi que du code permettant d'afficher sur un fond de carte le standard de publication retenu par l'INSEE [le lien de la documentation sera indiqué ici ultérieurement].
18 | 
19 | Il s'agit d'un des dépôts de travail en vue de la publication en open data des adresses du Répertoire Electoral Unique, qui n'a pas vocation à être maintenu à l'issue de la diffusion du fichier.
20 | 
21 | ### Visualisation sur un fond de carte du fichier des adresses déjà géocodés, pour n'importe quel département
22 | 
23 | Déposer les fichiers sources de données à la racine du dépôt, modifier si utile le code en indiquant à la fois le chemin du fichier des adresses et le chemin du fichier de contour des communes (dans notre cas,communes-20220101.shp), indiquer le  créer un environnement virtuel Python3.10 (pratique non nécessaire mais recommandée) puis lancer les commandes :
24 | 
25 | ```
26 | python3.10 -m pip install -r requirements.txt
27 | python3.10 main_atelier.py
28 | ```
29 | 
30 | ### Nettoyage, géocodage, visualisation du fichier des adresses, et essais de contours non officiels, pour le département de l'Ariège.
31 | 
32 | #### Données nécessaires
33 | 
34 | - Récupérer les données sources
35 | - Récupérer les données des contours des communes ([fichier utilisé dans ce cadre](https://www.data.gouv.fr/fr/datasets/decoupage-administratif-communal-francais-issu-d-openstreetmap/))
36 | 
37 | #### Déploiement
38 | 
39 | Déposer ces fichiers de données à la racine du dépôt, modifier si utile le code en indiquant le chemin du fichier de contour des communes (dans notre cas,communes-20220101.shp), créer un environnement virtuel Python3.10 (pratique non nécessaire mais recommandée) puis lancer les commandes :
40 | 
41 | ```
42 | python3.10 -m pip install -r requirements.txt
43 | python3.10 main.py <NOM_FICHIER_SOURCE_ADRESSES_REU>
44 | ```


--------------------------------------------------------------------------------
/atelier.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "id": "8cf893a0",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "%load_ext autoreload\n",
 11 |     "%autoreload 2"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "code",
 16 |    "execution_count": null,
 17 |    "id": "bb62bea5",
 18 |    "metadata": {},
 19 |    "outputs": [],
 20 |    "source": [
 21 |     "#!python3.10 -m pip install pyarrow"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": null,
 27 |    "id": "b5622edd",
 28 |    "metadata": {},
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "import os\n",
 32 |     "import pandas as pd\n",
 33 |     "import geopandas as gpd\n",
 34 |     "from display import *\n",
 35 |     "import re"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": null,
 41 |    "id": "2a1d507f",
 42 |    "metadata": {
 43 |     "scrolled": true
 44 |    },
 45 |    "outputs": [],
 46 |    "source": [
 47 |     "# path of the address file\n",
 48 |     "path = \"extrait_fichier_adresses_REU.parquet\"\n",
 49 |     "#os.listdir()"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "markdown",
 54 |    "id": "f400b591",
 55 |    "metadata": {},
 56 |    "source": [
 57 |     "## Loading the address file, and a file with the shape of communes.\n",
 58 |     "##### Warning: these files are memory-consuming"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "code",
 63 |    "execution_count": null,
 64 |    "id": "6a6b0625",
 65 |    "metadata": {},
 66 |    "outputs": [],
 67 |    "source": [
 68 |     "df = pd.read_parquet(path)\n",
 69 |     "df.head()"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": null,
 75 |    "id": "31507876",
 76 |    "metadata": {},
 77 |    "outputs": [],
 78 |    "source": [
 79 |     "df.describe()"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": null,
 85 |    "id": "22724e39",
 86 |    "metadata": {},
 87 |    "outputs": [],
 88 |    "source": [
 89 |     "communes_france = gpd.read_file(\"communes-20220101.shp\")[[\"geometry\", \"insee\"]].dropna()"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "markdown",
 94 |    "id": "0b7056f3",
 95 |    "metadata": {},
 96 |    "source": [
 97 |     "### The code below creates an (unofficial) identifier of bureau de vote. We use it in this code mostly for displaying purpose"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "code",
102 |    "execution_count": null,
103 |    "id": "d743febb",
104 |    "metadata": {},
105 |    "outputs": [],
106 |    "source": [
107 |     "\n",
108 |     "def prepare_ids(df: pd.DataFrame) -> pd.DataFrame:\n",
109 |     "    \"\"\"\n",
110 |     "    Prepare not-official `id_bv` (integers) column, under the assumption there is less than 10000 bv per city\n",
111 |     "\n",
112 |     "    Args:\n",
113 |     "        df (pd.DataFrame): a dataframe including columns \"Code_BV\" and \"result_citycode\"\n",
114 |     "\n",
115 |     "    Returns:\n",
116 |     "        pd.DataFrame: a dataframe similar to the input, with a supplementary column \"id_bv\" (integers) unique for every bureau de vote\n",
117 |     "    \"\"\"\n",
118 |     "    assert (\"code_bv\" in df.columns) and (\n",
119 |     "        \"code_commune_ref\" in df.columns\n",
120 |     "    ), \"There is no identifiers for bureau de vote\"\n",
121 |     "    df_copy = df.copy()\n",
122 |     "\n",
123 |     "    def prepare_id_bv(row):\n",
124 |     "        \"\"\"\n",
125 |     "        Combine the unique id of a city (citycode) and the number of the bureau de vote inside the city to compute a nationalwide id of bureau de vote\n",
126 |     "\n",
127 |     "        Args:\n",
128 |     "            row (_type_): _description_\n",
129 |     "\n",
130 |     "        Returns:\n",
131 |     "            id_bv: integer serving as unique id of a bureau de vote\n",
132 |     "        \"\"\"\n",
133 |     "        max_bv_per_city = 10000  # assuming there is always less than this number of bv in a city. This is important to grant the uniqueness of id_bv\n",
134 |     "        max_code_commune = 10**5\n",
135 |     "        try:\n",
136 |     "            code_bv = int(row[\"code_bv\"])\n",
137 |     "        except:\n",
138 |     "            # keep as Code_BV the first number found in the string (if there is one)\n",
139 |     "            found = re.search(r\"\\d+\", row[\"code_bv\"])\n",
140 |     "            if found:\n",
141 |     "                code_bv = int(found.group())\n",
142 |     "            else:\n",
143 |     "                code_bv = max_bv_per_city  # this code will indicate parsing errors but won't raise exception\n",
144 |     "        try:\n",
145 |     "            code_commune = int(row[\"code_commune_ref\"])\n",
146 |     "        except:\n",
147 |     "            found = re.search(r\"\\d+\", row[\"code_commune_ref\"])\n",
148 |     "            if found:\n",
149 |     "                code_commune = int(found.group())\n",
150 |     "            else:\n",
151 |     "                code_commune = max_code_commune\n",
152 |     "        return max_bv_per_city * code_commune + code_bv\n",
153 |     "\n",
154 |     "    df_copy[\"id_bv\"] = df_copy.apply(prepare_id_bv, axis=1)\n",
155 |     "    return df_copy"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": null,
161 |    "id": "90ff3390",
162 |    "metadata": {},
163 |    "outputs": [],
164 |    "source": [
165 |     "# add an unofficiel \"id_bv\" field id to recognize and to determine the color of id fields\n",
166 |     "df_prepared = prepare_ids(df)"
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "markdown",
171 |    "id": "d06bb8e1",
172 |    "metadata": {},
173 |    "source": [
174 |     "## Display an example, restricted to a fraction of a department"
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "code",
179 |    "execution_count": null,
180 |    "id": "c65943a6",
181 |    "metadata": {},
182 |    "outputs": [],
183 |    "source": [
184 |     "# Take the example of the departement 83: Le Var\n",
185 |     "DEP = \"83\""
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "code",
190 |    "execution_count": null,
191 |    "id": "67b24ed0",
192 |    "metadata": {},
193 |    "outputs": [],
194 |    "source": [
195 |     "communes_dep = communes_france[communes_france.insee.str.startswith(str(DEP))]\n",
196 |     "communes_dep"
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "code",
201 |    "execution_count": null,
202 |    "id": "f1da4dfc",
203 |    "metadata": {},
204 |    "outputs": [],
205 |    "source": [
206 |     "# For displaying purpose, display only a fraction of the addresses\n",
207 |     "ratio = 0.1 # 0 <= ratio <= 1"
208 |    ]
209 |   },
210 |   {
211 |    "cell_type": "code",
212 |    "execution_count": null,
213 |    "id": "48e49bd0",
214 |    "metadata": {},
215 |    "outputs": [],
216 |    "source": [
217 |     "df_dep = df_prepared[df_prepared.dep_bv==str(DEP)].sample(frac=ratio, random_state=0)\n"
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "code",
222 |    "execution_count": null,
223 |    "id": "c489e703",
224 |    "metadata": {
225 |     "scrolled": true
226 |    },
227 |    "outputs": [],
228 |    "source": [
229 |     "r = display_addresses(addresses=df_dep, communes=communes_dep)\n",
230 |     "r.to_html(f\"scatterplot_{DEP}_layer_ratio_{ratio}.html\")\n"
231 |    ]
232 |   },
233 |   {
234 |    "cell_type": "code",
235 |    "execution_count": null,
236 |    "id": "f84bfa17",
237 |    "metadata": {},
238 |    "outputs": [],
239 |    "source": [
240 |     "r_voronoi = display_bureau_vote_shapes(addresses=df_dep, communes=communes_dep, mode=\"voronoi\")\n",
241 |     "r_voronoi.to_html(f\"voronoi_{DEP}_layer_ratio_{ratio}.html\")"
242 |    ]
243 |   },
244 |   {
245 |    "cell_type": "code",
246 |    "execution_count": null,
247 |    "id": "704c570b",
248 |    "metadata": {},
249 |    "outputs": [],
250 |    "source": []
251 |   },
252 |   {
253 |    "cell_type": "code",
254 |    "execution_count": null,
255 |    "id": "2b56c07c",
256 |    "metadata": {},
257 |    "outputs": [],
258 |    "source": []
259 |   },
260 |   {
261 |    "cell_type": "code",
262 |    "execution_count": null,
263 |    "id": "3b89f39b",
264 |    "metadata": {},
265 |    "outputs": [],
266 |    "source": []
267 |   }
268 |  ],
269 |  "metadata": {
270 |   "kernelspec": {
271 |    "display_name": "Python 3 (ipykernel)",
272 |    "language": "python",
273 |    "name": "python3"
274 |   },
275 |   "language_info": {
276 |    "codemirror_mode": {
277 |     "name": "ipython",
278 |     "version": 3
279 |    },
280 |    "file_extension": ".py",
281 |    "mimetype": "text/x-python",
282 |    "name": "python",
283 |    "nbconvert_exporter": "python",
284 |    "pygments_lexer": "ipython3",
285 |    "version": "3.10.6"
286 |   }
287 |  },
288 |  "nbformat": 4,
289 |  "nbformat_minor": 5
290 | }
291 | 


--------------------------------------------------------------------------------
/bureaux_de_vote.html:
--------------------------------------------------------------------------------
  1 | <!DOCTYPE html>
  2 | <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head>
  3 | 
  4 | <meta charset="utf-8">
  5 | <meta name="generator" content="quarto-0.9.483">
  6 | 
  7 | <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
  8 | 
  9 | 
 10 | <title>bureaux_de_vote</title>
 11 | <style>
 12 | code{white-space: pre-wrap;}
 13 | span.smallcaps{font-variant: small-caps;}
 14 | span.underline{text-decoration: underline;}
 15 | div.column{display: inline-block; vertical-align: top; width: 50%;}
 16 | div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
 17 | ul.task-list{list-style: none;}
 18 | pre > code.sourceCode { white-space: pre; position: relative; }
 19 | pre > code.sourceCode > span { display: inline-block; line-height: 1.25; }
 20 | pre > code.sourceCode > span:empty { height: 1.2em; }
 21 | .sourceCode { overflow: visible; }
 22 | code.sourceCode > span { color: inherit; text-decoration: inherit; }
 23 | div.sourceCode { margin: 1em 0; }
 24 | pre.sourceCode { margin: 0; }
 25 | @media screen {
 26 | div.sourceCode { overflow: auto; }
 27 | }
 28 | @media print {
 29 | pre > code.sourceCode { white-space: pre-wrap; }
 30 | pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
 31 | }
 32 | pre.numberSource code
 33 |   { counter-reset: source-line 0; }
 34 | pre.numberSource code > span
 35 |   { position: relative; left: -4em; counter-increment: source-line; }
 36 | pre.numberSource code > span > a:first-child::before
 37 |   { content: counter(source-line);
 38 |     position: relative; left: -1em; text-align: right; vertical-align: baseline;
 39 |     border: none; display: inline-block;
 40 |     -webkit-touch-callout: none; -webkit-user-select: none;
 41 |     -khtml-user-select: none; -moz-user-select: none;
 42 |     -ms-user-select: none; user-select: none;
 43 |     padding: 0 4px; width: 4em;
 44 |     color: #aaaaaa;
 45 |   }
 46 | pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa;  padding-left: 4px; }
 47 | div.sourceCode
 48 |   {   }
 49 | @media screen {
 50 | pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
 51 | }
 52 | code span.al { color: #ff0000; font-weight: bold; } /* Alert */
 53 | code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
 54 | code span.at { color: #7d9029; } /* Attribute */
 55 | code span.bn { color: #40a070; } /* BaseN */
 56 | code span.bu { } /* BuiltIn */
 57 | code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
 58 | code span.ch { color: #4070a0; } /* Char */
 59 | code span.cn { color: #880000; } /* Constant */
 60 | code span.co { color: #60a0b0; font-style: italic; } /* Comment */
 61 | code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
 62 | code span.do { color: #ba2121; font-style: italic; } /* Documentation */
 63 | code span.dt { color: #902000; } /* DataType */
 64 | code span.dv { color: #40a070; } /* DecVal */
 65 | code span.er { color: #ff0000; font-weight: bold; } /* Error */
 66 | code span.ex { } /* Extension */
 67 | code span.fl { color: #40a070; } /* Float */
 68 | code span.fu { color: #06287e; } /* Function */
 69 | code span.im { } /* Import */
 70 | code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
 71 | code span.kw { color: #007020; font-weight: bold; } /* Keyword */
 72 | code span.op { color: #666666; } /* Operator */
 73 | code span.ot { color: #007020; } /* Other */
 74 | code span.pp { color: #bc7a00; } /* Preprocessor */
 75 | code span.sc { color: #4070a0; } /* SpecialChar */
 76 | code span.ss { color: #bb6688; } /* SpecialString */
 77 | code span.st { color: #4070a0; } /* String */
 78 | code span.va { color: #19177c; } /* Variable */
 79 | code span.vs { color: #4070a0; } /* VerbatimString */
 80 | code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
 81 | </style>
 82 | 
 83 | 
 84 | 
 85 | <script src="bureaux_de_vote_files/libs/clipboard/clipboard.min.js"></script>
 86 | <script src="bureaux_de_vote_files/libs/quarto-html/quarto.js"></script>
 87 | <script src="bureaux_de_vote_files/libs/quarto-html/popper.min.js"></script>
 88 | <script src="bureaux_de_vote_files/libs/quarto-html/tippy.umd.min.js"></script>
 89 | <script src="bureaux_de_vote_files/libs/quarto-html/anchor.min.js"></script>
 90 | <link href="bureaux_de_vote_files/libs/quarto-html/tippy.css" rel="stylesheet">
 91 | <link href="bureaux_de_vote_files/libs/quarto-html/quarto-syntax-highlighting.css" rel="stylesheet" id="quarto-text-highlighting-styles">
 92 | <script src="bureaux_de_vote_files/libs/bootstrap/bootstrap.min.js"></script>
 93 | <link href="bureaux_de_vote_files/libs/bootstrap/bootstrap-icons.css" rel="stylesheet">
 94 | <link href="bureaux_de_vote_files/libs/bootstrap/bootstrap.min.css" rel="stylesheet">
 95 | </head>
 96 | 
 97 | <body class="fullcontent">
 98 | 
 99 | <div id="quarto-content" class="page-columns page-rows-contents page-layout-article">
100 | 
101 | <main class="content" id="quarto-document-content">
102 | 
103 | 
104 | 
105 | <div class="cell">
106 | <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="op">%</span>load_ext autoreload</span>
107 | <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="op">%</span>autoreload <span class="dv">2</span></span>
108 | <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="co">#!pip install pydeck</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
109 | </div>
110 | <div class="cell">
111 | <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> cleaner</span>
112 | <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> display</span>
113 | <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
114 | <span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> geo</span>
115 | <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> geopandas <span class="im">as</span> gpd</span>
116 | <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pydeck <span class="im">as</span> pdk</span>
117 | <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a></span>
118 | <span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a>pd.options.display.max_rows <span class="op">=</span> <span class="dv">1000</span></span>
119 | <span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a>pd.options.display.max_columns <span class="op">=</span> <span class="dv">1000</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
120 | </div>
121 | <div class="cell">
122 | <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>df <span class="op">=</span> pd.read_csv(<span class="st">"Correspondance adresse_bureau de vote_Département de l'Ariège.csv"</span>, sep<span class="op">=</span><span class="st">";"</span>, dtype<span class="op">=</span><span class="bu">str</span>)</span>
123 | <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>df <span class="op">=</span> cleaner.clean_dataset(df)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
124 | </div>
125 | <div class="cell">
126 | <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># check that names preceded with a "chez" have been removed</span></span>
127 | <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>df[[<span class="st">'adr_complete'</span>, <span class="st">'libelle_voie_clean'</span>, <span class="st">'comp_adr_1_clean'</span>, <span class="st">'comp_adr_2_clean'</span>, <span class="st">'lieu-dit-clean'</span>]][df[<span class="st">'adr_complete'</span>].<span class="bu">str</span>.contains(<span class="st">'chez'</span>,na<span class="op">=</span><span class="va">False</span>)].head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
128 | </div>
129 | <div class="cell">
130 | <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>df.drop(columns<span class="op">=</span>[<span class="st">'libelle_voie_clean'</span>, <span class="st">'comp_adr_1_clean'</span>, <span class="st">'comp_adr_2_clean'</span>, <span class="st">'lieu-dit-clean'</span>], inplace<span class="op">=</span><span class="va">True</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
131 | </div>
132 | <div class="cell" data-scrolled="true">
133 | <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co">#geocoded_df = geo.add_geoloc(df=df)</span></span>
134 | <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>geocoded_df <span class="op">=</span> pd.read_csv(<span class="st">"concat_adr_bv_geocoded.csv"</span>,dtype<span class="op">=</span><span class="bu">str</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
135 | </div>
136 | <section id="clean-geocoded-dataframe" class="level2">
137 | <h2 class="anchored" data-anchor-id="clean-geocoded-dataframe">Clean geocoded dataframe</h2>
138 | <div class="cell">
139 | <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>geocoded_df <span class="op">=</span> cleaner.clean_geocoded_types(geocoded_df)</span>
140 | <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>geocoded_df <span class="op">=</span> cleaner.clean_failed_geocoding(geocoded_df)</span>
141 | <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>geocoded_df <span class="op">=</span> cleaner.prepare_ids(geocoded_df)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
142 | </div>
143 | <div class="cell">
144 | <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># IMPORTANT: when there is two points at the position lat-lon, keep only one</span></span>
145 | <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>geocoded_df <span class="op">=</span> geocoded_df.drop_duplicates(subset<span class="op">=</span>[<span class="st">"latitude"</span>, <span class="st">"longitude"</span>])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
146 | </div>
147 | </section>
148 | <section id="load-shapes-of-communes" class="level2">
149 | <h2 class="anchored" data-anchor-id="load-shapes-of-communes">Load shapes of communes</h2>
150 | <div class="cell">
151 | <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>communes_france <span class="op">=</span> gpd.read_file(<span class="st">"communes-20220101.shp"</span>)[[<span class="st">"geometry"</span>, <span class="st">"insee"</span>]].dropna().<span class="op">\</span></span>
152 | <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a>    rename(columns<span class="op">=</span>{<span class="st">"insee"</span>: <span class="st">"result_citycode"</span>})</span>
153 | <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a>communes_france[<span class="st">"result_citycode"</span>] <span class="op">=</span> communes_france[<span class="st">"result_citycode"</span>].<span class="bu">apply</span>(<span class="kw">lambda</span> row: row.split(<span class="st">"."</span>)[<span class="dv">0</span>] <span class="cf">if</span> <span class="st">"."</span> <span class="kw">in</span> row <span class="cf">else</span> row)</span>
154 | <span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a></span>
155 | <span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a>communes_ariege <span class="op">=</span> communes_france[communes_france.result_citycode.<span class="bu">str</span>.startswith(<span class="st">"09"</span>)]</span>
156 | <span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a><span class="kw">del</span> communes_france</span>
157 | <span id="cb9-7"><a href="#cb9-7" aria-hidden="true" tabindex="-1"></a>communes_ariege.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
158 | </div>
159 | </section>
160 | <section id="cartography-with-color-by-bureau-de-vote" class="level2">
161 | <h2 class="anchored" data-anchor-id="cartography-with-color-by-bureau-de-vote">Cartography with color by bureau de vote</h2>
162 | <div class="cell">
163 | <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>r <span class="op">=</span> display.display_addresses(addresses<span class="op">=</span>geocoded_df, communes<span class="op">=</span>communes_ariege)</span>
164 | <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a>r.to_html(<span class="st">"scatterplot_layer.html"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
165 | </div>
166 | </section>
167 | <section id="save-geojson-with-1-point-per-voter-address" class="level2">
168 | <h2 class="anchored" data-anchor-id="save-geojson-with-1-point-per-voter-address">Save GeoJSON (with 1 Point per voter address)</h2>
169 | <div class="cell">
170 | <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co"># geojson = geo.build_geojson_point(geocoded_df)</span></span>
171 | <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="co">#geojson.to_file("bv_point.geojson", driver="GeoJSON")</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
172 | </div>
173 | </section>
174 | <section id="display-convex-hull" class="level2">
175 | <h2 class="anchored" data-anchor-id="display-convex-hull">Display convex Hull</h2>
176 | <div class="cell" data-scrolled="true">
177 | <div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="co"># r_hulls = display.display_bureau_vote_shapes(addresses=geocoded_df, communes=communes_ariege, mode="convex")</span></span>
178 | <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="co"># r_hulls.to_html("hull_layer.html")</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
179 | </div>
180 | </section>
181 | <section id="display-voronoi-tessellation" class="level2">
182 | <h2 class="anchored" data-anchor-id="display-voronoi-tessellation">Display Voronoi tessellation</h2>
183 | <div class="cell" data-scrolled="true">
184 | <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>r_voronoi <span class="op">=</span> display.display_bureau_vote_shapes(addresses<span class="op">=</span>geocoded_df, communes<span class="op">=</span>communes_ariege, mode<span class="op">=</span><span class="st">"voronoi"</span>)</span>
185 | <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>r_voronoi.to_html(<span class="st">"voronoi_layer.html"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
186 | </div>
187 | </section>
188 | 
189 | </main>
190 | <!-- /main column -->
191 | <script id="quarto-html-after-body" type="application/javascript">
192 | window.document.addEventListener("DOMContentLoaded", function (event) {
193 |   const icon = "";
194 |   const anchorJS = new window.AnchorJS();
195 |   anchorJS.options = {
196 |     placement: 'right',
197 |     icon: icon
198 |   };
199 |   anchorJS.add('.anchored');
200 |   const clipboard = new window.ClipboardJS('.code-copy-button', {
201 |     target: function(trigger) {
202 |       return trigger.previousElementSibling;
203 |     }
204 |   });
205 |   clipboard.on('success', function(e) {
206 |     // button target
207 |     const button = e.trigger;
208 |     // don't keep focus
209 |     button.blur();
210 |     // flash "checked"
211 |     button.classList.add('code-copy-button-checked');
212 |     var currentTitle = button.getAttribute("title");
213 |     button.setAttribute("title", "Copied!");
214 |     setTimeout(function() {
215 |       button.setAttribute("title", currentTitle);
216 |       button.classList.remove('code-copy-button-checked');
217 |     }, 1000);
218 |     // clear code selection
219 |     e.clearSelection();
220 |   });
221 |   function tippyHover(el, contentFn) {
222 |     const config = {
223 |       allowHTML: true,
224 |       content: contentFn,
225 |       maxWidth: 500,
226 |       delay: 100,
227 |       arrow: false,
228 |       appendTo: function(el) {
229 |           return el.parentElement;
230 |       },
231 |       interactive: true,
232 |       interactiveBorder: 10,
233 |       theme: 'quarto',
234 |       placement: 'bottom-start'
235 |     };
236 |     window.tippy(el, config); 
237 |   }
238 |   const noterefs = window.document.querySelectorAll('a[role="doc-noteref"]');
239 |   for (var i=0; i<noterefs.length; i++) {
240 |     const ref = noterefs[i];
241 |     tippyHover(ref, function() {
242 |       let href = ref.getAttribute('href');
243 |       try { href = new URL(href).hash; } catch {}
244 |       const id = href.replace(/^#\/?/, "");
245 |       const note = window.document.getElementById(id);
246 |       return note.innerHTML;
247 |     });
248 |   }
249 |   var bibliorefs = window.document.querySelectorAll('a[role="doc-biblioref"]');
250 |   for (var i=0; i<bibliorefs.length; i++) {
251 |     const ref = bibliorefs[i];
252 |     const cites = ref.parentNode.getAttribute('data-cites').split(' ');
253 |     tippyHover(ref, function() {
254 |       var popup = window.document.createElement('div');
255 |       cites.forEach(function(cite) {
256 |         var citeDiv = window.document.createElement('div');
257 |         citeDiv.classList.add('hanging-indent');
258 |         citeDiv.classList.add('csl-entry');
259 |         var biblioDiv = window.document.getElementById('ref-' + cite);
260 |         if (biblioDiv) {
261 |           citeDiv.innerHTML = biblioDiv.innerHTML;
262 |         }
263 |         popup.appendChild(citeDiv);
264 |       });
265 |       return popup.innerHTML;
266 |     });
267 |   }
268 | });
269 | </script>
270 | </div> <!-- /content -->
271 | 
272 | 
273 | 
274 | </body></html>


--------------------------------------------------------------------------------
/cleaner.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Various cleaning methods. Some can be applied directly on the input addresses table, while others must be applied on addresses 
  3 | that have been previously geocoded with the "geo" module. 
  4 | """
  5 | 
  6 | import pandas as pd
  7 | from difflib import SequenceMatcher
  8 | import re
  9 |     
 10 |     
 11 | def clean_dataset(df: pd.DataFrame) -> pd.DataFrame:
 12 |     """
 13 |     Put fields of strings in lowercase and remove the names of persons from the dataset
 14 | 
 15 |     Args:
 16 |         df (pd.DataFrame): the raw dataframe read from INSEE file
 17 | 
 18 |     Returns:
 19 |         pd.DataFrame: a dataframe without any names, where some column names and column content have been normalized
 20 |     """
 21 |     df = df.rename(
 22 |         columns={
 23 |             "Numéro de voie": "num_voie",
 24 |             "Type et libellé de voie": "libelle_voie",
 25 |             "Complément d’adresse 1": "comp_adr_1",
 26 |             "Complément d’adresse 2": "comp_adr_2",
 27 |             "Lieu-dit  ": "lieu-dit",
 28 |             "Code commune\nRéférentiel": "Code communeRéférentiel",
 29 |             "Libellé commune\nRéférentiel": "Libellé communeRéférentiel",
 30 |         },
 31 |         errors="ignore"
 32 |     )
 33 |     for col in ["num_voie", "libelle_voie", "comp_adr_1", "comp_adr_2", "lieu-dit"]:
 34 |         try:
 35 |             df[col] = df[col].str.lower()
 36 |             df[f"{col}_clean"] = df[col].str.lower()
 37 |         except:
 38 |             continue
 39 |     if not "geo_adresse" in df.columns:
 40 |         df["geo_adresse"] = df.apply(lambda row: get_address(row), axis=1)
 41 |     return df
 42 | 
 43 | 
 44 | def remove_names(x: str) -> str:
 45 |     """
 46 |     This function is specific to the Ariege dataset. It normalizes text, detect the presence of the word "chez" and remove the names following this word.
 47 |     In particular the function assumes that:
 48 |         (i) the names of a person is made of 2 tokens (composed-words count for one), and can possibly follow tokens like "m."/"madame"...
 49 |         (ii) there is at most 2 persons mentioned in one field, and the two names are then only separated with the word "et"
 50 | 
 51 |     Args:
 52 |         x (str): a string  possibly containing names, following the conditions above
 53 | 
 54 |     Returns:
 55 |         str: a string where names have been removed
 56 |     """
 57 |     x = (
 58 |         x.replace("(", "")
 59 |         .replace(")", "")
 60 |         .replace(".", "")
 61 |         .replace(",", "")
 62 |         .replace(";", "")
 63 |         .replace("/", "")
 64 |         .lower()
 65 |     )
 66 |     if "chez" in x:
 67 |         adr = x.split("chez")[0]
 68 |         chez = x.split("chez")[1]
 69 |         to_parse = chez.split(" ")
 70 |         if len(to_parse) > 1:
 71 |             if to_parse[1] in [
 72 |                 "m.",
 73 |                 "m",
 74 |                 "mr",
 75 |                 "mme",
 76 |                 "mlle",
 77 |                 "monsieur",
 78 |                 "madame",
 79 |                 "mademoiselle",
 80 |             ]:
 81 |                 if len(to_parse) > 4:
 82 |                     if to_parse[4] == "et":
 83 |                         if len(to_parse) > 5 and to_parse[5] in [
 84 |                             "m.",
 85 |                             "m",
 86 |                             "mr",
 87 |                             "mme",
 88 |                             "mlle",
 89 |                             "monsieur",
 90 |                             "madame",
 91 |                             "mademoiselle",
 92 |                         ]:
 93 |                             adr = adr + " ".join(to_parse[8:])
 94 |                         else:
 95 |                             adr = adr + " ".join(to_parse[7:])
 96 |                     else:
 97 |                         adr = adr + " ".join(to_parse[4:])
 98 |             else:
 99 |                 if len(to_parse) > 3:
100 |                     if to_parse[3] == "et":
101 |                         adr = adr + " ".join(to_parse[4:])
102 |                     else:
103 |                         adr = adr + " ".join(to_parse[3:])
104 |         if adr == "nan":
105 |             return ""
106 |         else:
107 |             return adr
108 |     else:
109 |         if x == "nan":
110 |             return ""
111 |         else:
112 |             return x
113 | 
114 | 
115 | def clean_geocoded_types(df: pd.DataFrame) -> pd.DataFrame:
116 |     """
117 |     Clean some dtypes of the dataframe after the geocoding step
118 | 
119 |     Args:
120 |         df (pd.DataFrame): _description_
121 | 
122 |     Returns:
123 |         pd.DataFrame: _description_
124 |     """
125 |     geocoded_df = df.copy()
126 |     geocoded_df["latitude"] = geocoded_df["latitude"].astype(float)
127 |     geocoded_df["longitude"] = geocoded_df["longitude"].astype(float)
128 |     geocoded_df["result_score"] = geocoded_df["result_score"].astype(float)
129 |     geocoded_df = geocoded_df[geocoded_df["result_label"].notna()]
130 |     return geocoded_df
131 | 
132 | 
133 | def clean_failed_geocoding(df: pd.DataFrame) -> pd.DataFrame:
134 |     """
135 |     Remove both failed geocoding (geocoding score below a threshold) + also remove lines where the voter does not inhabit in the same code commune + also remove lines where the geocoding is not consistent with the postcode indicated in the INSEE file
136 | 
137 |     Args:
138 |         df (pd.DataFrame): a dataframe where geocoding has already been performed with API-adresse
139 | 
140 |     Returns:
141 |         pd.DataFrame: a cleaned subset of this dataframe
142 |     """
143 |     assert (
144 |         "result_score" in df.columns
145 |         and "result_postcode" in df.columns
146 |         and "CP" in df.columns
147 |         and "CP_BV" in df.columns
148 |         and "result_citycode" in df.columns
149 |         and "Code communeRéférentiel" in df.columns
150 |     ), "the dataframe does not include required columns for cleaning"
151 |     # the comparison is performed on column "result_postcode" (because there is no citycode in INSEE input file) but other functions will only refer to "result_citycode" (because it is a good practice to prefer this column)
152 |     return df[
153 |         (df.result_score > 0.5)
154 |         & (df.result_citycode == df["Code communeRéférentiel"])
155 |         & (df.result_postcode == df.CP)
156 |     ].dropna(subset=["CP", "result_citycode", "result_postcode"])
157 | 
158 | 
159 | def get_address(row) -> str:
160 |     """
161 |     Build a unique address string by combining several fields
162 | 
163 |     Args:
164 |         row : a row of pd.DataFrame
165 | 
166 |     Returns:
167 |         str: the address
168 |     """
169 | 
170 |     def similar(a: str, b: str) -> float:  # return a measure of similarity
171 |         return SequenceMatcher(None, a, b).ratio()
172 | 
173 |     address = ""
174 |     
175 |     for col in ["num_voie_clean", "libelle_voie_clean", "comp_adr_1_clean", "comp_adr_2_clean"]:
176 |         try:
177 |             address += str(row[col]) + " "
178 |         except:
179 |             continue
180 | 
181 |     if not "lieu-dit-clean" in row:
182 |         return address.strip()
183 |     elif (similar(str(address), str(row["lieu-dit-clean"]).lower()) > 0.7) | (
184 |         str(row["lieu-dit-clean"]).lower() == "nan"
185 |     ):
186 |         return address.strip()
187 |     else:
188 |         return (address + " " + str(row["lieu-dit-clean"]).lower()).strip()
189 | 
190 | 
191 | def prepare_ids(df: pd.DataFrame) -> pd.DataFrame:
192 |     """
193 |     Prepare `id_bv` (integers) column
194 | 
195 |     Args:
196 |         df (pd.DataFrame): a dataframe including columns "Code_BV" and "result_citycode"
197 | 
198 |     Returns:
199 |         pd.DataFrame: a dataframe similar to the input, with a supplementary column "id_bv" (integers) unique for every bureau de vote
200 |     """
201 |     assert ("Code_BV" in df.columns) and (
202 |         "result_citycode" in df.columns
203 |     ), "There is no identifiers for bureau de vote"
204 |     df_copy = df.copy()
205 | 
206 |     def prepare_id_bv(row):
207 |         """
208 |         Combine the unique id of a city (citycode) and the number of the bureau de vote inside the city to compute a nationalwide id of bureau de vote
209 | 
210 |         Args:
211 |             row (_type_): _description_
212 | 
213 |         Returns:
214 |             id_bv: integer serving as unique id of a bureau de vote
215 |         """
216 |         max_bv_per_city = 1000  # assuming there is always less than this number of bv in a city. This is important to grant the uniqueness of id_bv
217 |         max_code_commune = 10**5
218 |         try:
219 |             code_bv = int(row["Code_BV"])
220 |         except:
221 |             # keep as Code_BV the first number found in the string (if there is one)
222 |             found = re.search(r"\d+", row["Code_BV"])
223 |             if found:
224 |                 code_bv = int(found.group())
225 |             else:
226 |                 code_bv = max_bv_per_city  # this code will indicate parsing errors but won't raise exception
227 |         try:
228 |             code_commune = int(row["result_citycode"])
229 |         except:
230 |             found = re.search(r"\d+", row["result_citycode"])
231 |             if found:
232 |                 code_commune = int(found.group())
233 |             else:
234 |                 code_commune = max_code_commune
235 |         return max_bv_per_city * code_commune + code_bv
236 | 
237 |     df_copy["id_bv"] = df_copy.apply(prepare_id_bv, axis=1)
238 |     return df_copy
239 | 


--------------------------------------------------------------------------------
/decoupage_parquet.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import os
 3 | 
 4 | path_in = "./../work/table_adresses.parquet"
 5 | df = pd.read_parquet(path_in)
 6 | df['dep_bv'] = df['code_commune_ref'].apply(lambda s: s[:3] if s[:2]=='97' else s[:2])
 7 | 
 8 | for k in df.dep_bv.unique():
 9 |     print(k)
10 |     path_out = f"parquet/table_{k}.parquet"
11 |     if f"table_{k}.parquet" in os.listdir("parquet/"):
12 |         print('Already processed')
13 |     else:
14 |         df[df.dep_bv == k].to_parquet(path_out)
15 | 


--------------------------------------------------------------------------------
/display.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Methods to display the addresses of voters, the shapes of communes and the interpolated shapes of bureaux de votes 
  3 | """
  4 | 
  5 | import pydeck as pdk
  6 | import pandas as pd
  7 | import numpy as np
  8 | import geopandas as gpd
  9 | import geo
 10 | from typing import Dict, List
 11 | 
 12 | def prepare_layer_communes(communes: gpd.GeoDataFrame, filled=True) -> pdk.Layer:
 13 |     """
 14 |     Get a layer with the shapes of the communes
 15 | 
 16 |     Args:
 17 |         communes (gpd.GeoDataFrame): the shapes of the communes, and a column with the citycode
 18 |         filled (bool, optional): if True, fills the communes shapes with colours. Defaults to True.
 19 | 
 20 |     Returns:
 21 |         pdk.Layer: a pydeck Layer with the polygonal shapes of the communes
 22 |     """
 23 |     assert (
 24 |         "result_citycode" in communes.columns or "insee" in communes.columns
 25 |     ), "the code commune must be given, in order to associate deterministic colours to communes"
 26 |     if "result_citycode" in communes.columns:
 27 |         col = "result_citycode"
 28 |     else:
 29 |         col = "insee"
 30 |     displayed = communes.copy()
 31 |     # Corsica: remove "a" and "b" in code commune
 32 |     corsica_mask = displayed[col].str.contains("a|b|A|B", regex=True)
 33 |     displayed[col] = displayed[col].str.split("a|b", regex=True, expand=True)
 34 |     displayed = displayed.astype({col: int})
 35 |     displayed["color_r"] = 7 * displayed[col] % 255
 36 |     displayed["color_g"] = 23 * displayed[col] % 255
 37 |     displayed["color_b"] = 67 * displayed[col] % 255
 38 | 
 39 |     coordinates = []
 40 |     for _, row in displayed.iterrows():
 41 |         try:
 42 |             coord = [
 43 |                 [
 44 |                     list(x)
 45 |                     for x in np.transpose(
 46 |                         [
 47 |                             list(row["geometry"].exterior.coords.xy[0]),
 48 |                             list(row["geometry"].exterior.coords.xy[1]),
 49 |                         ]
 50 |                     )
 51 |                 ]
 52 |             ]
 53 |             coordinates.append(coord)
 54 |         except Exception as e:
 55 |             print(e)
 56 |             coordinates.append([])
 57 |             pass
 58 |     displayed["coordinates"] = coordinates
 59 | 
 60 |     return pdk.Layer(
 61 |         "PolygonLayer",
 62 |         pd.DataFrame(displayed),
 63 |         pickable=False,
 64 |         opacity=0.05,
 65 |         stroked=True,
 66 |         filled=filled,
 67 |         radius_scale=6,
 68 |         line_width_min_pixels=1,
 69 |         get_polygon="coordinates",
 70 |         get_fill_color=["color_r", "color_g", "color_b"],
 71 |         get_line_color=[128, 128, 128],
 72 |     )
 73 | 
 74 | 
 75 | def prepare_layer_addresses(df: pd.DataFrame) -> pdk.Layer:
 76 |     """
 77 |     Put a table of addresses on a map
 78 | 
 79 |     Args:
 80 |         df (pd.DataFrame): must include columns 'Commune' (strings), 'adr_complete' (strings), 'result_score' (floats), 'result_label' (strings), 'latitude' (floats), 'longitude' (floats)
 81 | 
 82 |     Returns:
 83 |         pdk.Layer: every input address is figured with a point on the map
 84 |     """
 85 |     data = df.copy()
 86 |     data["radius"] = 6
 87 |     data["coordinates"] = np.array(df[["longitude", "latitude"]]).tolist()
 88 |     #    NB: 7, 23 and 67 are coprime with 255. That implies two voting places in the same city will have the same colors if and only if their id_bv modulo 255 are the same. Moreover, two successive voting places will have rather different colors.
 89 |     data["id_bv_r"] = 7 * data["id_bv"] % 255
 90 |     data["id_bv_g"] = 23 * data["id_bv"] % 255
 91 |     data["id_bv_b"] = 67 * data["id_bv"] % 255
 92 |     data.drop(columns=["latitude", "longitude"], inplace=True, errors="ignore")
 93 |     # Define a layer to display on a map
 94 |     return pdk.Layer(
 95 |         "ScatterplotLayer",
 96 |         data,
 97 |         pickable=True,
 98 |         opacity=0.9,
 99 |         filled=True,
100 |         radius_min_pixels=1,
101 |         radius_max_pixels=6,
102 |         line_width_min_pixels=2,
103 |         get_position="coordinates",
104 |         get_fill_color=["id_bv_r", "id_bv_g", "id_bv_b"],
105 |         get_radius="radius",
106 |         get_line_color=[0, 0, 0],
107 |     )
108 | 
109 | 
110 | def prepare_layer_polygons(
111 |     geo_addresses: gpd.GeoDataFrame,
112 |     communes: gpd.GeoDataFrame = gpd.GeoDataFrame(),
113 |     mode="voronoi",
114 | ) -> pdk.Layer:
115 |     """
116 |     Draw polygons around the addresses, so that addresses sharing the same bureau de vote are within the same polygon
117 | 
118 |     :warning: The geometries of the `geo_addresses` must be either MultiPoint (if we want convex hull) or Point (if we want Voronoi cells)
119 | 
120 |     Args:
121 |         geo_addresses (gpd.GeoDataFrame): must include columns "id_bv" and "result_citycode". The geometries must be shapely Point (in the case of voronoi cells) or MultiPoint (in the case of convex hulls)
122 |         communes (gpd.GeoDataFrame, optional): the shapes of communes, if available
123 |         mode (str, optional): The way we want to compute polygons around the addresses : can be "convex" or "voronoi". Defaults to "voronoi".
124 | 
125 |     Returns:
126 |         pdk.Layer: calculated bureau de vote shapes are figured with polygons on the map
127 |     """
128 |     assert mode.lower() in [
129 |         "convex",
130 |         "voronoi",
131 |     ], "the implemented methods are voronoi cells or convex hulls"
132 |     mode = mode.lower()
133 | 
134 |     coordinates = []
135 | 
136 |     if mode == "convex":
137 |         displayed = geo_addresses.copy()
138 |         displayed["hulls"] = geo.convex_hull(displayed)
139 |         for _, row in displayed.iterrows():
140 |             try:
141 |                 coord = [
142 |                     [
143 |                         list(x)
144 |                         for x in np.transpose(
145 |                             [
146 |                                 list(row["hulls"].exterior.coords.xy[0]),
147 |                                 list(row["hulls"].exterior.coords.xy[1]),
148 |                             ]
149 |                         )
150 |                     ]
151 |                 ]
152 |                 coordinates.append(coord)
153 |             except Exception as e:
154 |                 # print(e)
155 |                 coordinates.append([])
156 |                 pass
157 |         displayed.drop(columns=["geometry", "hulls"], inplace=True)
158 | 
159 |     elif mode == "voronoi":
160 |         hulls = geo.get_clipped_voronoi_shapes(geo_addresses, communes)
161 |         id_bvs = []
162 |         for _, row in hulls.iterrows():
163 |             id_bvs.append(row["id_bv"])
164 |             try:
165 |                 coord = [
166 |                     [
167 |                         list(x)
168 |                         for x in np.transpose(
169 |                             [
170 |                                 list(row["geometry"].exterior.coords.xy[0]),
171 |                                 list(row["geometry"].exterior.coords.xy[1]),
172 |                             ]
173 |                         )
174 |                     ]
175 |                 ]
176 |                 coordinates.append(coord)
177 |             except Exception as e:
178 |                 coordinates.append([])
179 |                 pass
180 | 
181 |         displayed = pd.DataFrame(data={"coordinates": coordinates, "id_bv": id_bvs})
182 |     displayed["id_bv_r"] = 7 * displayed["id_bv"] % 255
183 |     displayed["id_bv_g"] = 23 * displayed["id_bv"] % 255
184 |     displayed["id_bv_b"] = 67 * displayed["id_bv"] % 255
185 |     displayed["coordinates"] = coordinates
186 |     # Define a layer to display on a map
187 |     return pdk.Layer(
188 |         "PolygonLayer",
189 |         pd.DataFrame(displayed),
190 |         pickable=False,
191 |         opacity=0.2,
192 |         stroked=False,
193 |         filled=True,
194 |         radius_scale=6,
195 |         line_width_min_pixels=1,
196 |         get_polygon="coordinates",
197 |         get_fill_color=["id_bv_r", "id_bv_g", "id_bv_b"],
198 |         get_line_color=[0, 0, 0],
199 |     )
200 | 
201 | 
202 | def prepare_tooltip(columns: List[str]) -> Dict:
203 |     """
204 |     Prepare a tooltip indicating a specific subset of columns
205 | 
206 |     Args:
207 |         columns (List[str]): a list of columns of the data
208 | 
209 |     Returns:
210 |         Dict: _description_
211 |     """
212 |     legend = ""
213 |     for col in ["id_bv", "result_score", "geo_score", "commune_bv", "geo_adresse", "result_label", "adr_complete", "Commune"]:
214 |         if col in columns:
215 |             legend += f"{col}: "+"{"+f"{col}"+"} \n" 
216 |     tooltip = {
217 |         "text": legend
218 |     }
219 |     return tooltip
220 | 
221 | 
222 | def display_addresses(
223 |     addresses: pd.DataFrame, communes: gpd.GeoDataFrame = gpd.GeoDataFrame()
224 | ) -> pdk.Deck:
225 |     """
226 |     Display a map with one point per address
227 | 
228 |     Args:
229 |         addresses (pd.DataFrame): _description_
230 |         communes (gpd.GeoDataFrame, optional): the shapes of communes, if available
231 | 
232 |     Returns:
233 |         pdk.Deck: _description_
234 |     """
235 |     addresses_layer = prepare_layer_addresses(addresses)
236 |     if len(communes):
237 |         layers = [prepare_layer_communes(communes), addresses_layer]
238 |     else:
239 |         layers = [addresses_layer]
240 | 
241 |     # Set the viewport location
242 |     view_state = pdk.ViewState(
243 |         latitude=43.055403, longitude=1.470104, zoom=6, bearing=0, pitch=0
244 |     )
245 | 
246 |     # Render
247 |     return pdk.Deck(
248 |         map_style="light",
249 |         layers=layers,
250 |         initial_view_state=view_state,
251 |         tooltip=prepare_tooltip(addresses.columns),
252 |     )
253 | 
254 | 
255 | def display_bureau_vote_shapes(
256 |     addresses: pd.DataFrame,
257 |     communes: gpd.GeoDataFrame = gpd.GeoDataFrame(),
258 |     mode="voronoi",
259 | ) -> pdk.Deck:
260 |     """
261 |     Display on the same map the addresses and the corresponding interpolated bureau de vote shapes
262 | 
263 |     Args:
264 |         addresses (pd.DataFrame): must include columns 'Commune' (strings), 'adr_complete' (strings), 'result_score' (floats), 'result_label' (strings), 'latitude' (floats), 'longitude' (floats)
265 |         communes (gpd.GeoDataFrame, optional): the shapes of communes, if available
266 |         mode (str, optional): The way we want to compute polygons around the addresses : can be "convex" or "voronoi". Defaults to "voronoi".
267 | 
268 |     Returns:
269 |         pdk.Deck: pydeck with layers 'addresses' (one point per adress), 'communes' (one shape per commune), 'polygons' (one shape per bureau de vote, with the commune)
270 |     """
271 |     assert mode.lower() in ["convex", "voronoi"]
272 |     mode = mode.lower()
273 | 
274 |     if mode == "convex":
275 |         geojson = geo.build_geojson_multipoint(addresses)
276 |     elif mode == "voronoi":
277 |         geojson = geo.build_geojson_point(addresses)
278 | 
279 |     geojson.drop_duplicates(subset=["geometry"], inplace=True)
280 |     polygons_layer = prepare_layer_polygons(geojson, mode=mode, communes=communes)
281 | 
282 |     if len(communes):
283 |         communes_layers = prepare_layer_communes(communes, filled=False)
284 |         layers = [communes_layers, polygons_layer, prepare_layer_addresses(addresses)]
285 |     else:
286 |         layers = [
287 |             polygons_layer,
288 |             prepare_layer_addresses(addresses),
289 |         ]
290 | 
291 |     # Set the viewport location
292 |     view_state = pdk.ViewState(
293 |         latitude=43.055403, longitude=1.470104, zoom=6, bearing=0, pitch=0
294 |     )
295 |     # Render
296 |     return pdk.Deck(
297 |         map_style="light",
298 |         layers=layers,
299 |         initial_view_state=view_state,
300 |         tooltip=prepare_tooltip(addresses.columns),
301 |     )
302 | 


--------------------------------------------------------------------------------
/generate_areas.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # coding: utf-8
 3 | 
 4 | import os
 5 | import pandas as pd
 6 | import geopandas as gpd
 7 | from display import *
 8 | import re
 9 | 
10 | # display just a departement/drom/com
11 | DEP_LIST = ["0"+str(i) for i in range(1,10)]+[str(i) for i in range(10,19)]+["2A","2B"]+[str(i) for i in range(21,96)] + [str(i) for i in range(971,977)]
12 | #DEP_LIST = ["01", "83"]
13 | COMPUTE_BV_BORDERS = False
14 | # path of the address file
15 | 
16 | commune_shapes_path = "communes-20220101.shp"
17 | communes_france = gpd.read_file(commune_shapes_path)[["geometry", "insee"]].dropna()
18 | 
19 | for DEP in DEP_LIST:
20 |     communes_dep = communes_france[communes_france.insee.str.startswith(str(DEP))]
21 | 
22 |     addresses_path = f"parquet/table_{DEP}.parquet"
23 | 
24 |     # for this departement, determine the radio of addresses you want to plot
25 |     RATIO = 0.4 # 0 <= RATIO <= 1
26 | 
27 |     # ## Loading the address file, and a file with the shape of communes.
28 |     # ##### Warning: these files are heavy
29 | 
30 |     df = pd.read_parquet(addresses_path)
31 |     # if id_brut_bv is not None, condition below should always be True
32 |     if "id_bv" not in df.columns:
33 |         pat = re.compile(r"\d+")
34 |         df["id_bv"] = df["id_brut_bv"].apply(lambda row : int("".join(re.findall(pat, row))))
35 | 
36 | 
37 |     print(f"LOAD data in memory: {len(df)} rows")
38 | 
39 | 
40 |     # ### The code below creates an (unofficial) identifier of bureau de vote. We use it in this code mostly for displaying purpose
41 | 
42 |     # add this unofficiel "id_bv" field id to recognize and to determine the color of id fields
43 | 
44 | 
45 |     #df_dep = df[df.dep_bv==DEP].sample(frac=RATIO, random_state=0)
46 |     os.makedirs("html/dep", exist_ok=True)
47 |     os.makedirs("html/bv", exist_ok=True)
48 | 
49 |     if COMPUTE_BV_BORDERS:
50 |         for raw_id_bv in df.id_brut_bv.unique():
51 |             df_bv = df[df.id_brut_bv==raw_id_bv]
52 |             r_bv = display_addresses(addresses=df_bv, communes=communes_dep)
53 |             r_bv.to_html(f"html/bv/scatterplot_bv_{raw_id_bv}.html")
54 | 
55 |             r_voronoi_bv = display_bureau_vote_shapes(addresses=df_bv, communes=communes_dep, mode="voronoi")
56 |             r_voronoi_bv.to_html(f"html/bv/voronoi_bv_{raw_id_bv}.html")
57 | 
58 |         
59 |     df_dep = df.sample(frac=RATIO, random_state=0)
60 |     
61 |     print("Going to display addresses")
62 |     r = display_addresses(addresses=df_dep, communes=communes_dep)
63 |     r.to_html(f"html/dep/scatterplot_{DEP}_layer_ratio_{RATIO}.html")
64 | 
65 |     r_voronoi = display_bureau_vote_shapes(addresses=df_dep, communes=communes_dep, mode="voronoi")
66 |     r_voronoi.to_html(f"html/dep/voronoi_{DEP}_layer_ratio_{RATIO}.html")
67 |     
68 | 
69 | 
70 | 


--------------------------------------------------------------------------------
/generate_areas_geojson.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # coding: utf-8
 3 | 
 4 | import os
 5 | import pandas as pd
 6 | import numpy as np
 7 | import geopandas as gpd
 8 | from shapely import Polygon
 9 | from geo import build_geojson_point, get_clipped_voronoi_shapes
10 | pd.set_option('display.max_columns', None)
11 | 
12 | DEP_LIST = [
13 |     "0"+str(i) for i in range(1, 10)
14 | ]+[
15 |     str(i) for i in range(10, 20)
16 | ]+["2A", "2B"]+[
17 |     str(i) for i in range(21, 96)
18 | ] + [
19 |     str(i) for i in range(971, 977)
20 | ]
21 | commune_shapes_path = "./../communes-5m.geojson"
22 | communes_france = gpd.read_file(commune_shapes_path)
23 | communes_france = communes_france.rename(
24 |     {'code': 'insee'}, axis=1
25 | )[['insee', 'geometry']]
26 | 
27 | for DEP in DEP_LIST:
28 |     print(DEP)
29 |     if f"voronoi_contours_{DEP}.geojson" not in os.listdir("geojson/"):
30 |         communes_dep = communes_france[communes_france.insee.str.startswith(str(DEP))]
31 |         codes2drop = ('13055', '75056', '69123')
32 |         communes_dep = communes_dep.loc[~(communes_dep['insee'].str.startswith(codes2drop))]
33 | 
34 |         addresses_path = f"parquet/table_{DEP}.parquet"
35 | 
36 |         addresses_df = pd.read_parquet(addresses_path)
37 |         # The lines below creates an (unofficial) identifier of bureau de vote
38 |         # We use it in this code mostly for displaying purposes
39 |         addresses_df['id_bv'] = addresses_df['id_brut_bv']
40 |         addresses_df['commune_bv'] = addresses_df['code_commune_ref']
41 | 
42 |         print(f"LOAD dep {DEP} in memory: {len(addresses_df)} rows")
43 |         geo_addresses = build_geojson_point(addresses_df)
44 |         hulls = get_clipped_voronoi_shapes(geo_addresses, communes_dep)
45 |         id_bvs = []
46 |         coordinates = []
47 |         # the block below just aims at formatting
48 |         # the cordinates into a list of [x, y]
49 |         exceptions = []
50 |         for _, row in hulls.iterrows():
51 |             id_bvs.append(row["id_bv"])
52 |             try:
53 |                 coord = Polygon(
54 |                     [
55 |                         list(x)
56 |                         for x in np.transpose(
57 |                             [
58 |                                 list(row["geometry"].exterior.coords.xy[0]),
59 |                                 list(row["geometry"].exterior.coords.xy[1]),
60 |                             ]
61 |                         )
62 |                     ]
63 |                 )
64 |                 coordinates.append(coord)
65 |             except Exception as e:
66 |                 exceptions.append({
67 |                     'error': e,
68 |                     'row': row
69 |                 })
70 |                 coordinates.append([])
71 |                 pass
72 | 
73 |         voronoi_polygons = gpd.GeoDataFrame(
74 |             pd.DataFrame(data={"coordinates": coordinates, "id_bv": id_bvs}),
75 |             geometry='coordinates'
76 |         )
77 |         # handling overlaps
78 |         for main_idx in voronoi_polygons.index:
79 |             for side_idx in voronoi_polygons.index:
80 |                 if main_idx != side_idx:
81 |                     if voronoi_polygons.loc[main_idx, 'coordinates'].contains(voronoi_polygons.loc[side_idx, 'coordinates']):
82 |                         voronoi_polygons.loc[main_idx, 'coordinates'] = voronoi_polygons.loc[main_idx, 'coordinates'].difference(voronoi_polygons.loc[side_idx, 'coordinates'])
83 |         # grouping polygons into multipolygons for each BdV
84 |         voronoi_polygons = voronoi_polygons.dissolve('id_bv').reset_index(names='id_bv').reset_index(names='id')
85 |         # int id as requested for downstream processes
86 |         voronoi_polygons['id'] = voronoi_polygons['id'].astype(int)
87 |         with open(f"geojson/voronoi_contours_{DEP}.geojson", 'w') as f:
88 |             f.write(voronoi_polygons.to_json())
89 |     else:
90 |         print("Already processed")
91 | 


--------------------------------------------------------------------------------
/geo.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Utils methods to geocode addresses, and to compute polygonal shapes around the addresses
  3 | """
  4 | import pandas as pd
  5 | import os
  6 | import numpy as np
  7 | import geopandas as gpd
  8 | import pytess
  9 | from typing import List
 10 | from shapely.geometry import Polygon, Point
 11 | from shapely import make_valid
 12 | import requests
 13 | 
 14 | 
 15 | def add_geoloc(df: pd.DataFrame) -> pd.DataFrame:
 16 |     """
 17 |     Locally save the raw base of addresses and call the API-adresse to geocode them (in particular: add coordinates and found city)
 18 | 
 19 |     Args:
 20 |         df (pd.DataFrame): a file with columns "geo_adresse" ((street number +) street type + street name/locality name), "Commune" (commune name), "CP" (postcode)
 21 | 
 22 |     Returns:
 23 |         pd.DataFrame: a dataframe with the input columns, and also latitudes, longitudes, result_postcode, result_citycode, etc.
 24 |     """
 25 |     df.to_csv("concat_adr_bv.csv", index=False)
 26 |     # os.system(
 27 |     #     "curl -X POST -F data=@concat_adr_bv.csv -F columns=adr_complete -F columns=Commune -F postcode=CP https://api-adresse.data.gouv.fr/search/csv/ > concat_adr_bv_geocoded.csv"
 28 |     # )
 29 |     f = open('concat_adr_bv.csv', 'rb')
 30 |     files = {'data': ('concat_adr_bv', f)}
 31 |     payload = {'columns': ['geo_adresse', 'Commune'], 'postcode': 'CP'}
 32 |     r = requests.post('https://api-adresse.data.gouv.fr/search/csv/', files=files, data=payload, stream=True)
 33 |     with open('concat_adr_bv_geocoded.csv', 'wb') as fd:
 34 |         for chunk in r.iter_content(chunk_size=1024):
 35 |             fd.write(chunk)
 36 | 
 37 |     geocoded = pd.read_csv("concat_adr_bv_geocoded.csv", dtype=str)
 38 |     geocoded["latitude"] = geocoded["latitude"].astype(float)
 39 |     geocoded["longitude"] = geocoded["longitude"].astype(float)
 40 |     geocoded["result_score"] = geocoded["result_score"].astype(float)
 41 |     geocoded = geocoded[geocoded["result_label"].notna()]
 42 |     return geocoded
 43 | 
 44 | 
 45 | def build_geojson_point(addresses: pd.DataFrame) -> gpd.GeoDataFrame:
 46 |     """
 47 |     Turn the dataframes with coordinates into a GeoDataFrame containing a Point object for each address
 48 |     NB: when there is several addresses at the same point, the function keeps only one sample
 49 | 
 50 |     Args:
 51 |         addresses (pd.DataFrame): a dataframe that have already been processed with API-adresse, and that also contain ids for bureau de vote (function `cleaner.prepare_ids`)
 52 |     Returns:
 53 |         gpd.GeoDataFrame: includes columns: "geometry" (shapely Point), "result_citycode" (as string), "label" (commune name, as string) and "id_bv" (unique id we impose per bureau de vote, int)
 54 |     """
 55 | 
 56 |     geojson = {"type": "FeatureCollection", "features": []}
 57 |     if "result_label" in addresses.columns:
 58 |         label_col = "result_label"
 59 |     else:
 60 |         label_col = "commune_bv"
 61 |     if "result_citycode" in addresses.columns:
 62 |         code_col = "result_citycode"
 63 |     else:
 64 |         code_col = "code_commune_ref"
 65 |     for _, row in addresses.iterrows():
 66 |         if row[label_col]:
 67 |             props = {
 68 |                 "label": row[label_col],
 69 |                 "id_bv": row["id_bv"],
 70 |                 "result_citycode": row[code_col],
 71 |             }
 72 |             geojson["features"].append(
 73 |                 {
 74 |                     "type": "Feature",
 75 |                     "geometry": {
 76 |                         "type": "Point",
 77 |                         "coordinates": [
 78 |                             float(row["longitude"]),
 79 |                             float(row["latitude"]),
 80 |                         ],
 81 |                     },
 82 |                     "properties": props,
 83 |                 }
 84 |             )
 85 |     gdf = gpd.GeoDataFrame.from_features(geojson)
 86 |     # IMPORTANT: when there is several addresses at the same point keep only one sample
 87 |     return gdf.drop_duplicates(subset=["geometry"])
 88 | 
 89 | 
 90 | def build_geojson_multipoint(addresses: pd.DataFrame) -> gpd.GeoDataFrame:
 91 |     """
 92 |     Turn the dataframes with coordinates into a GeoDataFrame containing a MultiPoint (list of point) object for each bureau de vote
 93 | 
 94 |     Args:
 95 |         addresses (pd.DataFrame): a dataframe that have already been processed with API-adresse, and that also contain ids for bureau de vote (function `cleaner.prepare_ids`)
 96 |     Returns:
 97 |         gpd.GeoDataFrame: includes columns: "geometry" (shapely MultiPoint), "result_citycode" (as string) and "id_bv" (unique id we impose per bureau de vote, int)
 98 |     """
 99 | 
100 |     geojson = {"type": "FeatureCollection", "features": []}
101 |     assert (
102 |         "id_bv" in addresses.columns
103 |     ), "There is no identifier for the 'bureaux de vote' in this dataframe"
104 | 
105 |     def get_coordinates_list(data: pd.DataFrame) -> np.array:
106 |         return np.array(data[["longitude", "latitude"]]).tolist()
107 | 
108 |     for id_bv, data in addresses.groupby("id_bv"):
109 |         cp = data.result_citycode.min()
110 | 
111 |         geojson["features"].append(
112 |             {
113 |                 "type": "Feature",
114 |                 "geometry": {
115 |                     "type": "MultiPoint",
116 |                     "coordinates": get_coordinates_list(data),
117 |                 },
118 |                 "properties": {"id_bv": id_bv, "result_citycode": cp},
119 |             }
120 |         )
121 |     gdf = gpd.GeoDataFrame.from_features(geojson)
122 |     return gdf
123 | 
124 | 
125 | def convex_hull(gdf: gpd.GeoDataFrame) -> gpd.GeoSeries:
126 |     """
127 |     Compute the convex hulls of input geometries
128 | 
129 |     Args:
130 |         gdf (gpd.GeoDataFrame):
131 | 
132 |     Returns:
133 |         gpd.GeoSeries: each row is a Polygon, a Point or a LineString
134 |     """
135 |     return gpd.GeoSeries(gdf.geometry).convex_hull
136 | 
137 | 
138 | def clip_to_communes(
139 |     gdf: gpd.GeoDataFrame, communes: gpd.GeoDataFrame
140 | ) -> gpd.GeoDataFrame:
141 |     """
142 |     Clip the polygons of input geodataframe to the boundaries of specified communes.
143 | 
144 |     Args:
145 |         gdf (gpd.GeoDataFrame): must include columns "geometry" and "result_citycode"
146 |         communes (gpd.GeoDataFrame): must include columns `geometry` and "result_citycode"
147 | 
148 |     Returns:
149 |         gpd.GeoDataFrame: the input GeoDataFrames have been clipped according to the input `communes` shapes
150 |     """
151 |     gdf_copy = gdf.copy()
152 |     multipolygons_communes_dict = {}
153 |     multipolygons_communes_list = list()
154 |     if "result_citycode" in communes.columns:
155 |         code_col = "result_citycode"
156 |     else:
157 |         code_col = "insee"
158 |     # precompute the MultiPolygon of each of the commune that is relevant for our input geodataframe
159 |     for cp in np.intersect1d(
160 |         communes[code_col].unique(), gdf["result_citycode"].unique()
161 |     ):
162 |         multipolygons_communes_dict[cp] = communes[
163 |             communes[code_col] == cp
164 |         ].geometry.unary_union
165 |     # align the precomputed MultiPolygons with the input GeoDataFrame `gdf`
166 |     for _, row in gdf_copy.iterrows():
167 |         cp = row["result_citycode"]
168 |         multipolygons_communes_list.append(multipolygons_communes_dict[cp])
169 |     to_intersect = gpd.GeoSeries(multipolygons_communes_list)
170 |     try:
171 |         gdf_copy.geometry = gdf_copy.geometry.intersection(
172 |             to_intersect, align=False
173 |         )
174 |     except:
175 |         # handling self-intersection cases
176 |         for k in range(len(gdf_copy)):
177 |             try:
178 |                 # for rows where intersection works fine
179 |                 gdf_copy.loc[k:k,'geometry'] = gdf_copy.loc[k:k,'geometry'].intersection(
180 |                     to_intersect.loc[k:k],
181 |                     align=False
182 |                 )
183 |             except:
184 |                 # removing points that are too close together to resolve the Polygon
185 |                 try:
186 |                     gdf_copy.loc[k:k,'geometry'] = gpd.GeoSeries(
187 |                             gdf_copy.loc[k:k,'geometry'].values[0].simplify(tolerance=1)
188 |                         ).intersection(
189 |                             gpd.GeoSeries(to_intersect.loc[k:k].values[0].simplify(tolerance=1)),
190 |                             align=False
191 |                         )
192 |                 except:
193 |                     # use make_valid to restore geometry
194 |                     gdf_copy.loc[k:k,'geometry'] = gpd.GeoSeries(
195 |                             make_valid(gdf_copy.loc[k:k,'geometry'].values[0].simplify(tolerance=1))
196 |                         ).intersection(
197 |                             gpd.GeoSeries(make_valid(to_intersect.loc[k:k].values[0].simplify(tolerance=1))),
198 |                             align=False
199 |                         )
200 |     return gdf_copy
201 | 
202 | 
203 | def polygon_union(
204 |     gdf: gpd.GeoDataFrame,
205 |     pivot_column: str = "id_bv",
206 |     columns: List[str] = ["result_citycode"],
207 | ) -> gpd.GeoDataFrame:
208 |     """
209 |     Assuming the geometry of the input GeoDataFrame geometry consists of polygons, make the union of these polygons given a pivot column.
210 |     Some columns of the input GeoDataFrame can be kept in the output, under the assumption that :
211 |     (i) for a given pivot value, and a given column of "columns", the value of the column on this pivot value stays constant
212 | 
213 |     Args:
214 |         gdf (gpd.GeoDataFrame): must contain the column `pivot_column` and the ancillary columns `columns`
215 |         pivot_column (str): the column that must be used as pivot. Defaults to "id_bv".
216 |         columns (List[str], optional): The list of other columns (not `pivot_column` nor "geometry") to keep in the output. Defaults to ["result_citycode"].
217 | 
218 |     Returns:
219 |         gpd.GeoDataFrame: consists of the geometry of merged polygons (that are Polygon or MultiPolygon), `pivot_column` and the ancillary columns `columns`
220 |     """
221 |     geometries = list()
222 |     # "data" consists of the properties of the output GeoDataFrame
223 |     data = {pivot_column: []}
224 |     for column in columns:
225 |         data[column] = list()
226 | 
227 |     for pivot in gdf[pivot_column].unique():
228 |         # WARNING: the 2 lines below assumes that, for a given pivot value, and a given column of "columns", the value of the column on this pivot value stays constant
229 |         # in particular, it is right for the column "result_citycode" when the union is done on "id_bv")
230 |         for column in columns:
231 |             val = gdf[gdf[pivot_column] == pivot][column].min()
232 |             data[column].append(val)
233 |         s = gdf[gdf[pivot_column] == pivot].geometry
234 |         geometries.append(s.unary_union)
235 |         data[pivot_column].append(pivot)
236 |     return gpd.GeoDataFrame(geometry=geometries, data=data)
237 | 
238 | 
239 | def get_clipped_voronoi_shapes(
240 |     gdf: gpd.GeoDataFrame, communes: gpd.GeoDataFrame = gpd.GeoDataFrame()
241 | ) -> gpd.GeoDataFrame:
242 |     """
243 |     Compute voronoi cells, clip them to the shapes of communes, and merge the clipped cells that share the same "id_bv"
244 | 
245 |     Args:
246 |         gdf (gpd.GeoDataFrame): must include "geometry", "result_citycode" (string) and "id_bv" (unique id we determine for each bureau de vote, int)
247 |         communes (gpd.GeoDataFrame, optional): _description_. Defaults to gpd.GeoDataFrame().
248 | 
249 |     Returns:
250 |         gpd.GeoDataFrame:
251 |     """
252 |     hulls = voronoi_hull(gdf, communes)
253 |     if len(communes):
254 |         hulls = clip_to_communes(hulls, communes)
255 |     return connected_components_polygon_union(hulls)
256 | 
257 | 
258 | def connected_components_polygon_union(
259 |     gdf: gpd.GeoDataFrame,
260 |     pivot_column: str = "id_bv",
261 |     columns: List[str] = ["result_citycode"],
262 | ) -> gpd.GeoDataFrame:
263 |     """
264 |     Assuming the geometry of the input GeoDataFrame geometry consists of polygons, return the connected components of the union of these polygons given a pivot column
265 |     Some columns of the input GeoDataFrame can be kept in the output, under the assumption that :
266 |     (i) for a given pivot value, and a given column of "columns", the value of the column on this pivot value stays constant
267 | 
268 |     Args:
269 |         gdf (gpd.GeoDataFrame): must contain the column `pivot_column` and the ancillary columns `columns`
270 |         pivot_column (str): the column that must be used as pivot. Defaults to "id_bv".
271 |         columns (List[str], optional): The list of other columns (not `pivot_column` nor "geometry") to keep in the output. Defaults to ["result_citycode"].
272 | 
273 |     Returns:
274 |         gpd.GeoDataFrame: consists of the geometry of merged connected components (that are necessary Polygon), `pivot_column` and the ancillary columns `columns`
275 |     """
276 |     geometries = list()
277 |     # "data" consists of the properties of the output GeoDataFrame
278 |     data = {pivot_column: []}
279 |     for column in columns:
280 |         data[column] = list()
281 | 
282 |     def save_columns_values(pivot):
283 |         data[pivot_column].append(pivot)
284 |         for column in columns:
285 |             val = gdf[gdf[pivot_column] == pivot][column].min()
286 |             data[column].append(val)
287 | 
288 |     for pivot in gdf[pivot_column].unique():
289 |         # WARNING: the 2 lines below assumes that, for a given pivot value, and a given column of "columns", the value of the column on this pivot value stays constant
290 |         # in particular, it is right for the column "result_citycode" when the union is done on "id_bv")
291 |         s = gdf[
292 |             gdf[pivot_column] == pivot
293 |         ].geometry  # normally these shapes are Polygon, but could be Point if there is only one found voter in a bureau de vote
294 |         if len(s) == 1 and s.iloc[0].geom_type == "Point":
295 |             geometries.append(s)
296 |             save_columns_values(pivot)
297 |         else:
298 |             merged_shape = s.unary_union
299 |             if merged_shape is not None:
300 |                 if merged_shape.geom_type == "Polygon":
301 |                     geometries.append(merged_shape)
302 |                     save_columns_values(pivot)
303 | 
304 |                 elif merged_shape.geom_type == "MultiPolygon":
305 |                     for _, row in (
306 |                         gpd.GeoDataFrame(geometry=[merged_shape])
307 |                         .explode(index_parts=False)
308 |                         .iterrows()
309 |                     ):
310 |                         geometries.append(row["geometry"])
311 |                         save_columns_values(pivot)
312 |     return gpd.GeoDataFrame(geometry=geometries, data=data)
313 | 
314 | 
315 | def voronoi_hull(gdf: gpd.GeoDataFrame, communes: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
316 |     """
317 |     Compute voronoi cells around each of the input addresses, within an arbitrary large bounding box (hence, it is useful to clip afterwards the cells on limits relevant to our use cases)
318 |     It is a based on the voronoi method implemented in the library pytess
319 | 
320 |     Args:
321 |         gdf (gpd.GeoDataFrame): must include "geometry", "result_citycode" (string) and "id_bv" (unique id we determine for each bureau de vote, int)
322 | 
323 |     Returns:
324 |         gpd.GeoDataFrame: include "geometry", "result_citycode" and "id_bv"
325 |     """
326 |     assert (
327 |         "id_bv" in gdf.columns and "result_citycode" in gdf.columns
328 |     ), "Some necessary columns are missing"
329 |     gdf_copy = gdf.copy()
330 | 
331 |     id_bvs, citycodes = [], []
332 |     polygons = []
333 |     gdf_copy.drop_duplicates(
334 |         subset=["geometry"], inplace=True
335 |     )  # delete duplicates of geolocated points
336 |     # on s'assure de parcourir toutes les communes, certaines sont absentes des adresses
337 |     for citycode in set(gdf_copy.result_citycode.unique()) | set(communes.insee.unique()):
338 |         gdf_city = gdf_copy[gdf_copy.result_citycode == citycode]
339 |         # rares cas sans aucune adresse de votant sur la commune
340 |         if len(gdf_city) == 0:
341 |             id_bvs.append(citycode+'_X')
342 |             citycodes.append(citycode)
343 |             polygons.append(communes.loc[communes['insee']==citycode, 'geometry'].values[0])
344 |         # un seul BdV dans la commune : le contour sera celui de la commune
345 |         elif gdf_city['id_bv'].nunique() == 1:
346 |             id_bvs.append(gdf_city['id_bv'].values[0])
347 |             citycodes.append(citycode)
348 |             polygons.append(communes.loc[communes['insee']==citycode, 'geometry'].values[0])
349 |         # cas général
350 |         elif len(gdf_city) >= 3:
351 |             points_city, id_bvs_city = [], []
352 |             for k in gdf_city.index:
353 |                 try:
354 |                     points_city.append(
355 |                         (
356 |                             gdf_city.geometry[k].coords.xy[0][0],
357 |                             gdf_city.geometry[k].coords.xy[1][0],
358 |                         )
359 |                     )
360 |                     id_bvs_city.append(gdf_city.id_bv[k])
361 |                 except:
362 |                     pass
363 | 
364 |             # the condition "if k" exclude the corner of bounding box from the pytess.voronoi output
365 |             # the size of 'buffer_percent' defines the size of the virtual bounding box we compute Voronoi in
366 |             # pytess.voronoi returns a list of 2-tuples, with the first item in each tuple being the original input point (or None for each corner of the bounding box buffer), and the second item being the point's corressponding Voronoi polygon.
367 | 
368 |             voronoi_city_dict = {
369 |                 k: v for (k, v) in pytess.voronoi(points_city, buffer_percent=1000) if k
370 |             }
371 |             polygons_city = []
372 |             if (
373 |                 type(points_city) == list
374 |             ):  # this list is supposed to be like [(lat, lon), (lat, lon), (lat, lon), ...]
375 |                 for point in points_city:
376 |                     try:
377 |                         polygons_city.append(Polygon(voronoi_city_dict[point]))
378 |                     except:
379 |                         polygons_city.append(None)
380 |             id_bvs.extend(id_bvs_city)
381 |             citycodes.extend([citycode] * len(id_bvs_city))
382 |             polygons.extend(polygons_city)
383 | 
384 |         # handling one known case : two points in one commune (due to bad geocoding), from two different BdV
385 |         # creating big triangles along the bisection between the two points, that will be cropped later to the commune's contours
386 |         elif len(gdf_city) == 2:
387 |             size = 10e6
388 |             middle_point = Point(
389 |                 (gdf_city['geometry'].values[0].coords.xy[0][0] + gdf_city['geometry'].values[1].coords.xy[0][0])/2,
390 |                 (gdf_city['geometry'].values[0].coords.xy[1][0] + gdf_city['geometry'].values[1].coords.xy[1][0])/2
391 |             )
392 |             for k in range(2):
393 |                 point2middle_vector = [
394 |                     middle_point.coords.xy[0][0] - gdf_city['geometry'].values[k].coords.xy[0][0],
395 |                     middle_point.coords.xy[1][0] - gdf_city['geometry'].values[k].coords.xy[1][0]
396 |                 ]
397 |                 orthogonal_vector = [
398 |                     -point2middle_vector[1],
399 |                     point2middle_vector[0]
400 |                 ]
401 |                 accross_point = Point(
402 |                     gdf_city['geometry'].values[k].coords.xy[0][0] +
403 |                     size*point2middle_vector[0],
404 |                     gdf_city['geometry'].values[k].coords.xy[1][0] +
405 |                     size*point2middle_vector[1]
406 |                 )
407 |                 other_point1 = Point(
408 |                     middle_point.coords.xy[0][0] +
409 |                     size*orthogonal_vector[0],
410 |                     middle_point.coords.xy[1][0] +
411 |                     size*orthogonal_vector[1],
412 |                 )
413 |                 other_point2 = Point(
414 |                     middle_point.coords.xy[0][0] -
415 |                     size*orthogonal_vector[0],
416 |                     middle_point.coords.xy[1][0] -
417 |                     size*orthogonal_vector[1],
418 |                 )
419 |                 id_bvs.append(gdf_city['id_bv'].values[k])
420 |                 citycodes.append(citycode)
421 |                 polygons.append(Polygon([accross_point, other_point1, other_point2]))
422 |                 
423 |     return gpd.GeoDataFrame(
424 |         geometry=polygons, data={"id_bv": id_bvs, "result_citycode": citycodes}
425 |     )
426 | 


--------------------------------------------------------------------------------
/license.md:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 Etalab
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
 1 | from cleaner import (
 2 |     clean_dataset,
 3 |     clean_failed_geocoding,
 4 |     clean_geocoded_types,
 5 |     prepare_ids
 6 | )
 7 | from display import (
 8 |     display_addresses,
 9 |     display_bureau_vote_shapes
10 | )
11 | import pandas as pd
12 | from geo import (
13 |     add_geoloc
14 | )
15 | import geopandas as gpd
16 | import pydeck as pdk
17 | import sys
18 | 
19 | if __name__ == '__main__':
20 |     df = pd.read_csv(sys.argv[1], sep=";", dtype=str)
21 |     print('### Dataset Loaded!')
22 |     df = clean_dataset(df)
23 |     # check that names preceded with a "chez" have been removed
24 |     df.drop(columns=['libelle_voie_clean', 'comp_adr_1_clean', 'comp_adr_2_clean', 'lieu-dit-clean'], inplace=True)
25 |     print('### Dataset Cleaned!')
26 |     # Comment if you want to skip geocode stp (this step takes few minutes to run)
27 |     geocoded_df = add_geoloc(df=df)
28 |     print('### Dataset geocoded!')
29 |     geocoded_df = pd.read_csv("concat_adr_bv_geocoded.csv",dtype=str)
30 |     #Clean geocoded dataframe
31 |     geocoded_df = clean_geocoded_types(geocoded_df)
32 |     geocoded_df = clean_failed_geocoding(geocoded_df)
33 |     geocoded_df = prepare_ids(geocoded_df)
34 |     # IMPORTANT: when there is two points at the position lat-lon, keep only one
35 |     geocoded_df = geocoded_df.drop_duplicates(subset=["latitude", "longitude"])
36 |     print('### Geocoded dataset Cleaned!')
37 |     #Load shapes of communes
38 |     communes_france = gpd.read_file("communes-20220101.shp")[["geometry", "insee"]].dropna().\
39 |         rename(columns={"insee": "result_citycode"})
40 |     communes_france["result_citycode"] = communes_france["result_citycode"].apply(lambda row: row.split(".")[0] if "." in row else row)
41 |     communes_ariege = communes_france[communes_france.result_citycode.str.startswith("09")]
42 |     del communes_france
43 |     print('### Shapes communes loaded!')
44 |     #Cartography with color by bureau de vote
45 |     r = display_addresses(addresses=geocoded_df, communes=communes_ariege)
46 |     r.to_html("scatterplot_layer.html")
47 |     print('### Page 1 HTML generated!')
48 |     #Save GeoJSON (with 1 Point per voter address)
49 |     # geojson = geo.build_geojson_point(geocoded_df)
50 |     #geojson.to_file("bv_point.geojson", driver="GeoJSON")
51 |     #Display convex Hull
52 |     # r_hulls = display.display_bureau_vote_shapes(addresses=geocoded_df, communes=communes_ariege, mode="convex")
53 |     # r_hulls.to_html("hull_layer.html")
54 |     #Display Voronoi tessellation
55 |     r_voronoi = display_bureau_vote_shapes(addresses=geocoded_df, communes=communes_ariege, mode="voronoi")
56 |     r_voronoi.to_html("voronoi_layer.html")
57 |     print('### Page 2 HTML generated!')
58 | 


--------------------------------------------------------------------------------
/main_atelier.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # coding: utf-8
 3 | 
 4 | import os
 5 | import pandas as pd
 6 | import geopandas as gpd
 7 | from display import *
 8 | import re
 9 | 
10 | # path of the address file
11 | addresses_path = "extrait_fichier_adresses_REU.parquet"
12 | commune_shapes_path = "communes-20220101.shp"
13 | 
14 | # choose an example of departement
15 | DEP = "83"
16 | # for this departement, determine the radio of addresses you want to plot
17 | RATIO = 0.1 # 0 <= RATIO <= 1
18 | 
19 | # ## Loading the address file, and a file with the shape of communes.
20 | # ##### Warning: these files are heavy
21 | 
22 | df = pd.read_parquet(addresses_path)
23 | communes_france = gpd.read_file(commune_shapes_path)[["geometry", "insee"]].dropna()
24 | 
25 | 
26 | # ### The code below creates an (unofficial) identifier of bureau de vote. We use it in this code mostly for displaying purpose
27 | 
28 | 
29 | def prepare_ids(df: pd.DataFrame) -> pd.DataFrame:
30 |     """
31 |     Prepare not-official `id_bv` (integers) column, under the assumption there is less than 10000 bv per city
32 | 
33 |     Args:
34 |         df (pd.DataFrame): a dataframe including columns "Code_BV" and "result_citycode"
35 | 
36 |     Returns:
37 |         pd.DataFrame: a dataframe similar to the input, with a supplementary column "id_bv" (integers) unique for every bureau de vote
38 |     """
39 |     assert ("code_bv" in df.columns) and (
40 |         "code_commune_ref" in df.columns
41 |     ), "There is no identifiers for bureau de vote"
42 |     df_copy = df.copy()
43 | 
44 |     def prepare_id_bv(row):
45 |         """
46 |         Combine the unique id of a city (citycode) and the number of the bureau de vote inside the city to compute a nationalwide id of bureau de vote
47 | 
48 |         Args:
49 |             row (_type_): _description_
50 | 
51 |         Returns:
52 |             id_bv: integer serving as unique id of a bureau de vote
53 |         """
54 |         max_bv_per_city = 10000  # assuming there is always less than this number of bv in a city. This is important to grant the uniqueness of id_bv
55 |         max_code_commune = 10**5
56 |         try:
57 |             code_bv = int(row["code_bv"])
58 |         except:
59 |             # keep as Code_BV the first number found in the string (if there is one)
60 |             found = re.search(r"\d+", row["code_bv"])
61 |             if found:
62 |                 code_bv = int(found.group())
63 |             else:
64 |                 code_bv = max_bv_per_city  # this code will indicate parsing errors but won't raise exception
65 |         try:
66 |             code_commune = int(row["code_commune_ref"])
67 |         except:
68 |             found = re.search(r"\d+", row["code_commune_ref"])
69 |             if found:
70 |                 code_commune = int(found.group())
71 |             else:
72 |                 code_commune = max_code_commune
73 |         return max_bv_per_city * code_commune + code_bv
74 | 
75 |     df_copy["id_bv"] = df_copy.apply(prepare_id_bv, axis=1)
76 |     return df_copy
77 | 
78 | 
79 | # add this unofficiel "id_bv" field id to recognize and to determine the color of id fields
80 | df_prepared = prepare_ids(df)
81 | 
82 | communes_dep = communes_france[communes_france.insee.str.startswith(str(DEP))]
83 | 
84 | df_dep = df_prepared[df_prepared.dep_bv==DEP].sample(frac=RATIO, random_state=0)
85 | 
86 | 
87 | r = display_addresses(addresses=df_dep, communes=communes_dep)
88 | r.to_html(f"scatterplot_{DEP}_layer_ratio_{RATIO}.html")
89 | 
90 | r_voronoi = display_bureau_vote_shapes(addresses=df_dep, communes=communes_dep, mode="voronoi")
91 | r_voronoi.to_html(f"voronoi_{DEP}_layer_ratio_{RATIO}.html")
92 | 
93 | 
94 | 
95 | 


--------------------------------------------------------------------------------
/renovate.json:
--------------------------------------------------------------------------------
1 | {
2 |   "$schema": "https://docs.renovatebot.com/renovate-schema.json",
3 |   "extends": [
4 |     "config:base"
5 |   ]
6 | }
7 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | geopandas==0.12.0
2 | pygeos==0.13
3 | numpy==1.22.3
4 | pandas==1.5.0
5 | pydeck==0.7.1
6 | Pytess==1.0.0
7 | requests==2.28.1
8 | pyarrow==10.0.1
9 | 


--------------------------------------------------------------------------------
/starting_kit_atelier.R:
--------------------------------------------------------------------------------
  1 | #############################################################################
  2 | #            Atelier sur les adresses du REU : starting kit                 #
  3 | #############################################################################
  4 | 
  5 | ################################
  6 | # Imports
  7 | ################################
  8 | 
  9 | ##### Packages
 10 | 
 11 | library(arrow)
 12 | library(dplyr)
 13 | library(data.table)
 14 | library(magrittr)
 15 | library(sf)
 16 | library(ggplot2)
 17 | library(viridis)
 18 | 
 19 | ##### Données
 20 | 
 21 | extrait_adressesREU <- arrow::read_parquet(
 22 |   "extrait_fichier_adresses_REU.parquet"
 23 | ) %>% setDT()
 24 | 
 25 | ################################
 26 | # Quelques manipulations
 27 | ################################
 28 | 
 29 | ##### Sélectionner un échantillon du fichier
 30 | 
 31 | sample_REU <- extrait_adressesREU[sample(.N, 5e5)]
 32 | 
 33 | ##### Convertir les coordonnées Lambert de Geoloc en GPS
 34 | 
 35 | adressesREU_geoloc <- copy(extrait_adressesREU) %>%
 36 |   select(X, Y) %>%
 37 |   st_as_sf(
 38 |     coords = c("X", "Y"), 
 39 |     crs = 2154,
 40 |     na.fail = FALSE
 41 |   ) %>% 
 42 |   st_transform(crs = 4326)
 43 | 
 44 | ##### Convertir les coordonnées de la BAN en GPS
 45 | 
 46 | adressesREU_BAN <- copy(extrait_adressesREU) %>%
 47 |   select(latitude, longitude) %>%
 48 |   st_as_sf(
 49 |     coords = c("longitude", "latitude"), 
 50 |     crs = 4326,
 51 |     na.fail = FALSE
 52 |   )
 53 | 
 54 | ################################
 55 | # Statistiques descriptives et nouveaux champs
 56 | ################################
 57 | 
 58 | ##### Observer les quantiles de score de pertinence pour la BAN
 59 | 
 60 | quantiles_geo_score <- quantile(extrait_adressesREU$geo_score, seq(0, 1, 0.2),
 61 |                                 na.rm = TRUE)
 62 | 
 63 | 
 64 | ##### Générer des intervalles pour le score de qualité de la BAN
 65 | 
 66 | extrait_adressesREU[,`:=`(categorie_geo_score = cut(
 67 |   geo_score, 5, ordered_result = TRUE))]
 68 | 
 69 | ##### Générer des labels qualité pour Geoloc plus explicites
 70 | 
 71 | extrait_adressesREU[, `:=`(
 72 |   label_QUALITE_XY = fcase(
 73 |     QUALITE_XY == 11, "Voie Sûre, Numéro trouvé",
 74 |     QUALITE_XY == 12, "Voie Sûre, Position aléatoire dans la voie",
 75 |     QUALITE_XY == 21, "Voie probable, Numéro trouvé",
 76 |     QUALITE_XY == 22, "Voie probable, Position aléatoire dans la voie",
 77 |     QUALITE_XY == 33, "Voie inconnue, Position aléatoire dans la commune"
 78 |   ) %>%
 79 |     factor(
 80 |       levels = c(
 81 |         "Voie Sûre, Numéro trouvé",
 82 |         "Voie probable, Numéro trouvé",
 83 |         "Voie Sûre, Position aléatoire dans la voie",
 84 |         "Voie probable, Position aléatoire dans la voie",
 85 |         "Voie inconnue, Position aléatoire dans la commune"
 86 |       ),
 87 |       ordered = TRUE
 88 |     )
 89 | )
 90 | ]
 91 | 
 92 | ##### Générer les distances entre les positions renvoyées par la BAN et par Geoloc
 93 | 
 94 | extrait_adressesREU[, `:=`(
 95 |   distance = st_distance(
 96 |     x = adressesREU_geoloc,
 97 |     y = adressesREU_BAN,
 98 |     by_element = TRUE
 99 |   )
100 |   )]
101 | 
102 | ################################
103 | # Visualisation des différences BAN / Geoloc
104 | ################################
105 | 
106 | ##### Générer la proportion d'adresses pour lesquels les 2 référentiels renvoient
107 | ##### des localisations à moins de 100m, 200m, ... l'une de l'autre
108 | ##### en fonction des indicateurs de qualité
109 | 
110 | prop_normalisations_proches <- extrait_adressesREU[, .(
111 |   nb_adresses  = .N,
112 |   # part_10moins   = mean(distance <= units::set_units(10, m), na.rm = TRUE),
113 |   # part_20moins   = mean(distance <= units::set_units(20, m), na.rm = TRUE),
114 |   # part_50moins   = mean(distance <= units::set_units(50, m), na.rm = TRUE),
115 |   part_100moins  = mean(distance <= units::set_units(100, m), na.rm = TRUE),
116 |   part_200moins  = mean(distance <= units::set_units(200, m), na.rm = TRUE)
117 |   # part_500moins  = mean(distance <= units::set_units(500, m), na.rm = TRUE),
118 |   # part_1000moins = mean(distance <= units::set_units(1000, m), na.rm = TRUE)
119 | ), by = .(label_QUALITE_XY, QUALITE_XY, categorie_geo_score)][
120 |   order(QUALITE_XY, categorie_geo_score)]
121 | 
122 | ##### Visualiser les proportions calculées ci-dessus
123 | 
124 | ggplot(prop_normalisations_proches[!is.na(QUALITE_XY) & !is.na(categorie_geo_score)]) + 
125 |   geom_bar(
126 |     aes(
127 |       x = categorie_geo_score, y = part_100moins, fill = label_QUALITE_XY
128 |     ), position = "dodge", stat = "identity"
129 |   ) + 
130 |   labs(
131 |     x = "Score de qualité BAN",
132 |     y = "Proportion de distance <100m",
133 |     fill = "Qualité de Geoloc"
134 |   ) +
135 |   scale_fill_viridis_d() +
136 |   scale_y_continuous(labels = scales::percent_format()) +
137 |   theme(legend.position = "bottom") +
138 |   guides(
139 |     fill = guide_legend(
140 |       title.hjust = 0.5,
141 |       title.position = "top", 
142 |       nrow = 3
143 |     )
144 |   )
145 | 
146 | ################################
147 | # Les contours
148 | ################################
149 | 
150 | 
151 | 


--------------------------------------------------------------------------------