├── .gitignore ├── Creation_de_contours_a_partir_du_REU.ipynb ├── README.md ├── atelier.ipynb ├── bureaux_de_vote.html ├── cleaner.py ├── decoupage_parquet.py ├── display.py ├── generate_areas.py ├── generate_areas_geojson.py ├── geo.py ├── license.md ├── main.py ├── main_atelier.py ├── renovate.json ├── requirements.txt └── starting_kit_atelier.R /.gitignore: -------------------------------------------------------------------------------- 1 | # Jupyter Notebook 2 | .ipynb_checkpoints 3 | 4 | # IPython 5 | profile_default/ 6 | ipython_config.py 7 | 8 | # pyenv 9 | .python-version 10 | 11 | communes-* 12 | *.html 13 | *.csv 14 | 15 | __pycache__/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # bureau-vote 2 | 3 | Ce dépôt contient les travaux conjoints des équipes Etalab et data.gouv.fr, en étroite colaboration avec l'INSEE, au sujet du répertoire électoral unique (REU). Le but de ces travaux était de partir des [données du REU](https://www.data.gouv.fr/fr/datasets/bureaux-de-vote-et-adresses-de-leurs-electeurs/) (adresses de France et leur bureau de vote attribué) pour déterminer des contours des bureaux de vote de France. Une telle donnée permettra à l'avenir - ainsi que pour toutes les élections dont les données sont déjà publiques - d'afficher les résultats des élections à la maille la plus fine qui soit : celle des bureaux de vote. 4 | 5 | La méthode choisie est celle des [aires de Voronoï](https://fr.wikipedia.org/wiki/Diagramme_de_Vorono%C3%AF), qui permet de séparer un plan contenant des points d'intérêt (dit germes) en autant de zones autour de ces germes, de sorte que chaque zone enferme un seul germe, et forme l'ensemble des points de plus proches de ce germe que d'aucun autre. D'autres méthodes sont possibles, ainsi que d'autres choix au sein même de cette méthode : il n'y a pas unicité des contours. 6 | 7 | ## Création des contours 8 | 9 | Le notebook python ``Creation_de_contours_a_partir_du_REU.ipynb`` contient toutes les informations permettant de regénérer les contours tels que nous les avons publiés. Les prérequis sont : 10 | - ``python`` et ``jupyter notebook`` installés 11 | - tous les packages listés dans le fichier `requirements.txt` 12 | 13 | Il suffit ensuite de dérouler le notebook pour obtenir les contours de la même façon que nous les avons générés. Toutes les fonctions utilisées sont dans ce repo et sont perfectibles : n'hésitez pas à contribuer ! 14 | 15 | ## Travaux préalables 16 | 17 | Ce dépôt comprend aussi du code en langage Python permettant de nettoyer et géocoder un extrait (le département de l'Ariège) du format brut des adresses du Répertoire Electoral Unique, ainsi que du code permettant d'afficher sur un fond de carte le standard de publication retenu par l'INSEE [le lien de la documentation sera indiqué ici ultérieurement]. 18 | 19 | Il s'agit d'un des dépôts de travail en vue de la publication en open data des adresses du Répertoire Electoral Unique, qui n'a pas vocation à être maintenu à l'issue de la diffusion du fichier. 20 | 21 | ### Visualisation sur un fond de carte du fichier des adresses déjà géocodés, pour n'importe quel département 22 | 23 | Déposer les fichiers sources de données à la racine du dépôt, modifier si utile le code en indiquant à la fois le chemin du fichier des adresses et le chemin du fichier de contour des communes (dans notre cas,communes-20220101.shp), indiquer le créer un environnement virtuel Python3.10 (pratique non nécessaire mais recommandée) puis lancer les commandes : 24 | 25 | ``` 26 | python3.10 -m pip install -r requirements.txt 27 | python3.10 main_atelier.py 28 | ``` 29 | 30 | ### Nettoyage, géocodage, visualisation du fichier des adresses, et essais de contours non officiels, pour le département de l'Ariège. 31 | 32 | #### Données nécessaires 33 | 34 | - Récupérer les données sources 35 | - Récupérer les données des contours des communes ([fichier utilisé dans ce cadre](https://www.data.gouv.fr/fr/datasets/decoupage-administratif-communal-francais-issu-d-openstreetmap/)) 36 | 37 | #### Déploiement 38 | 39 | Déposer ces fichiers de données à la racine du dépôt, modifier si utile le code en indiquant le chemin du fichier de contour des communes (dans notre cas,communes-20220101.shp), créer un environnement virtuel Python3.10 (pratique non nécessaire mais recommandée) puis lancer les commandes : 40 | 41 | ``` 42 | python3.10 -m pip install -r requirements.txt 43 | python3.10 main.py 44 | ``` -------------------------------------------------------------------------------- /atelier.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "id": "8cf893a0", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "%load_ext autoreload\n", 11 | "%autoreload 2" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": null, 17 | "id": "bb62bea5", 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "#!python3.10 -m pip install pyarrow" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "id": "b5622edd", 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "import os\n", 32 | "import pandas as pd\n", 33 | "import geopandas as gpd\n", 34 | "from display import *\n", 35 | "import re" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "id": "2a1d507f", 42 | "metadata": { 43 | "scrolled": true 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "# path of the address file\n", 48 | "path = \"extrait_fichier_adresses_REU.parquet\"\n", 49 | "#os.listdir()" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "id": "f400b591", 55 | "metadata": {}, 56 | "source": [ 57 | "## Loading the address file, and a file with the shape of communes.\n", 58 | "##### Warning: these files are memory-consuming" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "id": "6a6b0625", 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "df = pd.read_parquet(path)\n", 69 | "df.head()" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": null, 75 | "id": "31507876", 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "df.describe()" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "id": "22724e39", 86 | "metadata": {}, 87 | "outputs": [], 88 | "source": [ 89 | "communes_france = gpd.read_file(\"communes-20220101.shp\")[[\"geometry\", \"insee\"]].dropna()" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "id": "0b7056f3", 95 | "metadata": {}, 96 | "source": [ 97 | "### The code below creates an (unofficial) identifier of bureau de vote. We use it in this code mostly for displaying purpose" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "id": "d743febb", 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "\n", 108 | "def prepare_ids(df: pd.DataFrame) -> pd.DataFrame:\n", 109 | " \"\"\"\n", 110 | " Prepare not-official `id_bv` (integers) column, under the assumption there is less than 10000 bv per city\n", 111 | "\n", 112 | " Args:\n", 113 | " df (pd.DataFrame): a dataframe including columns \"Code_BV\" and \"result_citycode\"\n", 114 | "\n", 115 | " Returns:\n", 116 | " pd.DataFrame: a dataframe similar to the input, with a supplementary column \"id_bv\" (integers) unique for every bureau de vote\n", 117 | " \"\"\"\n", 118 | " assert (\"code_bv\" in df.columns) and (\n", 119 | " \"code_commune_ref\" in df.columns\n", 120 | " ), \"There is no identifiers for bureau de vote\"\n", 121 | " df_copy = df.copy()\n", 122 | "\n", 123 | " def prepare_id_bv(row):\n", 124 | " \"\"\"\n", 125 | " Combine the unique id of a city (citycode) and the number of the bureau de vote inside the city to compute a nationalwide id of bureau de vote\n", 126 | "\n", 127 | " Args:\n", 128 | " row (_type_): _description_\n", 129 | "\n", 130 | " Returns:\n", 131 | " id_bv: integer serving as unique id of a bureau de vote\n", 132 | " \"\"\"\n", 133 | " max_bv_per_city = 10000 # assuming there is always less than this number of bv in a city. This is important to grant the uniqueness of id_bv\n", 134 | " max_code_commune = 10**5\n", 135 | " try:\n", 136 | " code_bv = int(row[\"code_bv\"])\n", 137 | " except:\n", 138 | " # keep as Code_BV the first number found in the string (if there is one)\n", 139 | " found = re.search(r\"\\d+\", row[\"code_bv\"])\n", 140 | " if found:\n", 141 | " code_bv = int(found.group())\n", 142 | " else:\n", 143 | " code_bv = max_bv_per_city # this code will indicate parsing errors but won't raise exception\n", 144 | " try:\n", 145 | " code_commune = int(row[\"code_commune_ref\"])\n", 146 | " except:\n", 147 | " found = re.search(r\"\\d+\", row[\"code_commune_ref\"])\n", 148 | " if found:\n", 149 | " code_commune = int(found.group())\n", 150 | " else:\n", 151 | " code_commune = max_code_commune\n", 152 | " return max_bv_per_city * code_commune + code_bv\n", 153 | "\n", 154 | " df_copy[\"id_bv\"] = df_copy.apply(prepare_id_bv, axis=1)\n", 155 | " return df_copy" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "id": "90ff3390", 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "# add an unofficiel \"id_bv\" field id to recognize and to determine the color of id fields\n", 166 | "df_prepared = prepare_ids(df)" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "id": "d06bb8e1", 172 | "metadata": {}, 173 | "source": [ 174 | "## Display an example, restricted to a fraction of a department" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "id": "c65943a6", 181 | "metadata": {}, 182 | "outputs": [], 183 | "source": [ 184 | "# Take the example of the departement 83: Le Var\n", 185 | "DEP = \"83\"" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": null, 191 | "id": "67b24ed0", 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [ 195 | "communes_dep = communes_france[communes_france.insee.str.startswith(str(DEP))]\n", 196 | "communes_dep" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "id": "f1da4dfc", 203 | "metadata": {}, 204 | "outputs": [], 205 | "source": [ 206 | "# For displaying purpose, display only a fraction of the addresses\n", 207 | "ratio = 0.1 # 0 <= ratio <= 1" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "id": "48e49bd0", 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [ 217 | "df_dep = df_prepared[df_prepared.dep_bv==str(DEP)].sample(frac=ratio, random_state=0)\n" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": null, 223 | "id": "c489e703", 224 | "metadata": { 225 | "scrolled": true 226 | }, 227 | "outputs": [], 228 | "source": [ 229 | "r = display_addresses(addresses=df_dep, communes=communes_dep)\n", 230 | "r.to_html(f\"scatterplot_{DEP}_layer_ratio_{ratio}.html\")\n" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "id": "f84bfa17", 237 | "metadata": {}, 238 | "outputs": [], 239 | "source": [ 240 | "r_voronoi = display_bureau_vote_shapes(addresses=df_dep, communes=communes_dep, mode=\"voronoi\")\n", 241 | "r_voronoi.to_html(f\"voronoi_{DEP}_layer_ratio_{ratio}.html\")" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "id": "704c570b", 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": null, 255 | "id": "2b56c07c", 256 | "metadata": {}, 257 | "outputs": [], 258 | "source": [] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "id": "3b89f39b", 264 | "metadata": {}, 265 | "outputs": [], 266 | "source": [] 267 | } 268 | ], 269 | "metadata": { 270 | "kernelspec": { 271 | "display_name": "Python 3 (ipykernel)", 272 | "language": "python", 273 | "name": "python3" 274 | }, 275 | "language_info": { 276 | "codemirror_mode": { 277 | "name": "ipython", 278 | "version": 3 279 | }, 280 | "file_extension": ".py", 281 | "mimetype": "text/x-python", 282 | "name": "python", 283 | "nbconvert_exporter": "python", 284 | "pygments_lexer": "ipython3", 285 | "version": "3.10.6" 286 | } 287 | }, 288 | "nbformat": 4, 289 | "nbformat_minor": 5 290 | } 291 | -------------------------------------------------------------------------------- /bureaux_de_vote.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | bureaux_de_vote 11 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 |
100 | 101 |
102 | 103 | 104 | 105 |
106 |
%load_ext autoreload
107 | %autoreload 2
108 | #!pip install pydeck
109 |
110 |
111 |
import cleaner
112 | import display
113 | import pandas as pd
114 | import geo
115 | import geopandas as gpd
116 | import pydeck as pdk
117 | 
118 | pd.options.display.max_rows = 1000
119 | pd.options.display.max_columns = 1000
120 |
121 |
122 |
df = pd.read_csv("Correspondance adresse_bureau de vote_Département de l'Ariège.csv", sep=";", dtype=str)
123 | df = cleaner.clean_dataset(df)
124 |
125 |
126 |
# check that names preceded with a "chez" have been removed
127 | df[['adr_complete', 'libelle_voie_clean', 'comp_adr_1_clean', 'comp_adr_2_clean', 'lieu-dit-clean']][df['adr_complete'].str.contains('chez',na=False)].head()
128 |
129 |
130 |
df.drop(columns=['libelle_voie_clean', 'comp_adr_1_clean', 'comp_adr_2_clean', 'lieu-dit-clean'], inplace=True)
131 |
132 |
133 |
#geocoded_df = geo.add_geoloc(df=df)
134 | geocoded_df = pd.read_csv("concat_adr_bv_geocoded.csv",dtype=str)
135 |
136 |
137 |

Clean geocoded dataframe

138 |
139 |
geocoded_df = cleaner.clean_geocoded_types(geocoded_df)
140 | geocoded_df = cleaner.clean_failed_geocoding(geocoded_df)
141 | geocoded_df = cleaner.prepare_ids(geocoded_df)
142 |
143 |
144 |
# IMPORTANT: when there is two points at the position lat-lon, keep only one
145 | geocoded_df = geocoded_df.drop_duplicates(subset=["latitude", "longitude"])
146 |
147 |
148 |
149 |

Load shapes of communes

150 |
151 |
communes_france = gpd.read_file("communes-20220101.shp")[["geometry", "insee"]].dropna().\
152 |     rename(columns={"insee": "result_citycode"})
153 | communes_france["result_citycode"] = communes_france["result_citycode"].apply(lambda row: row.split(".")[0] if "." in row else row)
154 | 
155 | communes_ariege = communes_france[communes_france.result_citycode.str.startswith("09")]
156 | del communes_france
157 | communes_ariege.head()
158 |
159 |
160 |
161 |

Cartography with color by bureau de vote

162 |
163 |
r = display.display_addresses(addresses=geocoded_df, communes=communes_ariege)
164 | r.to_html("scatterplot_layer.html")
165 |
166 |
167 |
168 |

Save GeoJSON (with 1 Point per voter address)

169 |
170 |
# geojson = geo.build_geojson_point(geocoded_df)
171 | #geojson.to_file("bv_point.geojson", driver="GeoJSON")
172 |
173 |
174 |
175 |

Display convex Hull

176 |
177 |
# r_hulls = display.display_bureau_vote_shapes(addresses=geocoded_df, communes=communes_ariege, mode="convex")
178 | # r_hulls.to_html("hull_layer.html")
179 |
180 |
181 |
182 |

Display Voronoi tessellation

183 |
184 |
r_voronoi = display.display_bureau_vote_shapes(addresses=geocoded_df, communes=communes_ariege, mode="voronoi")
185 | r_voronoi.to_html("voronoi_layer.html")
186 |
187 |
188 | 189 |
190 | 191 | 270 |
271 | 272 | 273 | 274 | -------------------------------------------------------------------------------- /cleaner.py: -------------------------------------------------------------------------------- 1 | """ 2 | Various cleaning methods. Some can be applied directly on the input addresses table, while others must be applied on addresses 3 | that have been previously geocoded with the "geo" module. 4 | """ 5 | 6 | import pandas as pd 7 | from difflib import SequenceMatcher 8 | import re 9 | 10 | 11 | def clean_dataset(df: pd.DataFrame) -> pd.DataFrame: 12 | """ 13 | Put fields of strings in lowercase and remove the names of persons from the dataset 14 | 15 | Args: 16 | df (pd.DataFrame): the raw dataframe read from INSEE file 17 | 18 | Returns: 19 | pd.DataFrame: a dataframe without any names, where some column names and column content have been normalized 20 | """ 21 | df = df.rename( 22 | columns={ 23 | "Numéro de voie": "num_voie", 24 | "Type et libellé de voie": "libelle_voie", 25 | "Complément d’adresse 1": "comp_adr_1", 26 | "Complément d’adresse 2": "comp_adr_2", 27 | "Lieu-dit ": "lieu-dit", 28 | "Code commune\nRéférentiel": "Code communeRéférentiel", 29 | "Libellé commune\nRéférentiel": "Libellé communeRéférentiel", 30 | }, 31 | errors="ignore" 32 | ) 33 | for col in ["num_voie", "libelle_voie", "comp_adr_1", "comp_adr_2", "lieu-dit"]: 34 | try: 35 | df[col] = df[col].str.lower() 36 | df[f"{col}_clean"] = df[col].str.lower() 37 | except: 38 | continue 39 | if not "geo_adresse" in df.columns: 40 | df["geo_adresse"] = df.apply(lambda row: get_address(row), axis=1) 41 | return df 42 | 43 | 44 | def remove_names(x: str) -> str: 45 | """ 46 | This function is specific to the Ariege dataset. It normalizes text, detect the presence of the word "chez" and remove the names following this word. 47 | In particular the function assumes that: 48 | (i) the names of a person is made of 2 tokens (composed-words count for one), and can possibly follow tokens like "m."/"madame"... 49 | (ii) there is at most 2 persons mentioned in one field, and the two names are then only separated with the word "et" 50 | 51 | Args: 52 | x (str): a string possibly containing names, following the conditions above 53 | 54 | Returns: 55 | str: a string where names have been removed 56 | """ 57 | x = ( 58 | x.replace("(", "") 59 | .replace(")", "") 60 | .replace(".", "") 61 | .replace(",", "") 62 | .replace(";", "") 63 | .replace("/", "") 64 | .lower() 65 | ) 66 | if "chez" in x: 67 | adr = x.split("chez")[0] 68 | chez = x.split("chez")[1] 69 | to_parse = chez.split(" ") 70 | if len(to_parse) > 1: 71 | if to_parse[1] in [ 72 | "m.", 73 | "m", 74 | "mr", 75 | "mme", 76 | "mlle", 77 | "monsieur", 78 | "madame", 79 | "mademoiselle", 80 | ]: 81 | if len(to_parse) > 4: 82 | if to_parse[4] == "et": 83 | if len(to_parse) > 5 and to_parse[5] in [ 84 | "m.", 85 | "m", 86 | "mr", 87 | "mme", 88 | "mlle", 89 | "monsieur", 90 | "madame", 91 | "mademoiselle", 92 | ]: 93 | adr = adr + " ".join(to_parse[8:]) 94 | else: 95 | adr = adr + " ".join(to_parse[7:]) 96 | else: 97 | adr = adr + " ".join(to_parse[4:]) 98 | else: 99 | if len(to_parse) > 3: 100 | if to_parse[3] == "et": 101 | adr = adr + " ".join(to_parse[4:]) 102 | else: 103 | adr = adr + " ".join(to_parse[3:]) 104 | if adr == "nan": 105 | return "" 106 | else: 107 | return adr 108 | else: 109 | if x == "nan": 110 | return "" 111 | else: 112 | return x 113 | 114 | 115 | def clean_geocoded_types(df: pd.DataFrame) -> pd.DataFrame: 116 | """ 117 | Clean some dtypes of the dataframe after the geocoding step 118 | 119 | Args: 120 | df (pd.DataFrame): _description_ 121 | 122 | Returns: 123 | pd.DataFrame: _description_ 124 | """ 125 | geocoded_df = df.copy() 126 | geocoded_df["latitude"] = geocoded_df["latitude"].astype(float) 127 | geocoded_df["longitude"] = geocoded_df["longitude"].astype(float) 128 | geocoded_df["result_score"] = geocoded_df["result_score"].astype(float) 129 | geocoded_df = geocoded_df[geocoded_df["result_label"].notna()] 130 | return geocoded_df 131 | 132 | 133 | def clean_failed_geocoding(df: pd.DataFrame) -> pd.DataFrame: 134 | """ 135 | Remove both failed geocoding (geocoding score below a threshold) + also remove lines where the voter does not inhabit in the same code commune + also remove lines where the geocoding is not consistent with the postcode indicated in the INSEE file 136 | 137 | Args: 138 | df (pd.DataFrame): a dataframe where geocoding has already been performed with API-adresse 139 | 140 | Returns: 141 | pd.DataFrame: a cleaned subset of this dataframe 142 | """ 143 | assert ( 144 | "result_score" in df.columns 145 | and "result_postcode" in df.columns 146 | and "CP" in df.columns 147 | and "CP_BV" in df.columns 148 | and "result_citycode" in df.columns 149 | and "Code communeRéférentiel" in df.columns 150 | ), "the dataframe does not include required columns for cleaning" 151 | # the comparison is performed on column "result_postcode" (because there is no citycode in INSEE input file) but other functions will only refer to "result_citycode" (because it is a good practice to prefer this column) 152 | return df[ 153 | (df.result_score > 0.5) 154 | & (df.result_citycode == df["Code communeRéférentiel"]) 155 | & (df.result_postcode == df.CP) 156 | ].dropna(subset=["CP", "result_citycode", "result_postcode"]) 157 | 158 | 159 | def get_address(row) -> str: 160 | """ 161 | Build a unique address string by combining several fields 162 | 163 | Args: 164 | row : a row of pd.DataFrame 165 | 166 | Returns: 167 | str: the address 168 | """ 169 | 170 | def similar(a: str, b: str) -> float: # return a measure of similarity 171 | return SequenceMatcher(None, a, b).ratio() 172 | 173 | address = "" 174 | 175 | for col in ["num_voie_clean", "libelle_voie_clean", "comp_adr_1_clean", "comp_adr_2_clean"]: 176 | try: 177 | address += str(row[col]) + " " 178 | except: 179 | continue 180 | 181 | if not "lieu-dit-clean" in row: 182 | return address.strip() 183 | elif (similar(str(address), str(row["lieu-dit-clean"]).lower()) > 0.7) | ( 184 | str(row["lieu-dit-clean"]).lower() == "nan" 185 | ): 186 | return address.strip() 187 | else: 188 | return (address + " " + str(row["lieu-dit-clean"]).lower()).strip() 189 | 190 | 191 | def prepare_ids(df: pd.DataFrame) -> pd.DataFrame: 192 | """ 193 | Prepare `id_bv` (integers) column 194 | 195 | Args: 196 | df (pd.DataFrame): a dataframe including columns "Code_BV" and "result_citycode" 197 | 198 | Returns: 199 | pd.DataFrame: a dataframe similar to the input, with a supplementary column "id_bv" (integers) unique for every bureau de vote 200 | """ 201 | assert ("Code_BV" in df.columns) and ( 202 | "result_citycode" in df.columns 203 | ), "There is no identifiers for bureau de vote" 204 | df_copy = df.copy() 205 | 206 | def prepare_id_bv(row): 207 | """ 208 | Combine the unique id of a city (citycode) and the number of the bureau de vote inside the city to compute a nationalwide id of bureau de vote 209 | 210 | Args: 211 | row (_type_): _description_ 212 | 213 | Returns: 214 | id_bv: integer serving as unique id of a bureau de vote 215 | """ 216 | max_bv_per_city = 1000 # assuming there is always less than this number of bv in a city. This is important to grant the uniqueness of id_bv 217 | max_code_commune = 10**5 218 | try: 219 | code_bv = int(row["Code_BV"]) 220 | except: 221 | # keep as Code_BV the first number found in the string (if there is one) 222 | found = re.search(r"\d+", row["Code_BV"]) 223 | if found: 224 | code_bv = int(found.group()) 225 | else: 226 | code_bv = max_bv_per_city # this code will indicate parsing errors but won't raise exception 227 | try: 228 | code_commune = int(row["result_citycode"]) 229 | except: 230 | found = re.search(r"\d+", row["result_citycode"]) 231 | if found: 232 | code_commune = int(found.group()) 233 | else: 234 | code_commune = max_code_commune 235 | return max_bv_per_city * code_commune + code_bv 236 | 237 | df_copy["id_bv"] = df_copy.apply(prepare_id_bv, axis=1) 238 | return df_copy 239 | -------------------------------------------------------------------------------- /decoupage_parquet.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import os 3 | 4 | path_in = "./../work/table_adresses.parquet" 5 | df = pd.read_parquet(path_in) 6 | df['dep_bv'] = df['code_commune_ref'].apply(lambda s: s[:3] if s[:2]=='97' else s[:2]) 7 | 8 | for k in df.dep_bv.unique(): 9 | print(k) 10 | path_out = f"parquet/table_{k}.parquet" 11 | if f"table_{k}.parquet" in os.listdir("parquet/"): 12 | print('Already processed') 13 | else: 14 | df[df.dep_bv == k].to_parquet(path_out) 15 | -------------------------------------------------------------------------------- /display.py: -------------------------------------------------------------------------------- 1 | """ 2 | Methods to display the addresses of voters, the shapes of communes and the interpolated shapes of bureaux de votes 3 | """ 4 | 5 | import pydeck as pdk 6 | import pandas as pd 7 | import numpy as np 8 | import geopandas as gpd 9 | import geo 10 | from typing import Dict, List 11 | 12 | def prepare_layer_communes(communes: gpd.GeoDataFrame, filled=True) -> pdk.Layer: 13 | """ 14 | Get a layer with the shapes of the communes 15 | 16 | Args: 17 | communes (gpd.GeoDataFrame): the shapes of the communes, and a column with the citycode 18 | filled (bool, optional): if True, fills the communes shapes with colours. Defaults to True. 19 | 20 | Returns: 21 | pdk.Layer: a pydeck Layer with the polygonal shapes of the communes 22 | """ 23 | assert ( 24 | "result_citycode" in communes.columns or "insee" in communes.columns 25 | ), "the code commune must be given, in order to associate deterministic colours to communes" 26 | if "result_citycode" in communes.columns: 27 | col = "result_citycode" 28 | else: 29 | col = "insee" 30 | displayed = communes.copy() 31 | # Corsica: remove "a" and "b" in code commune 32 | corsica_mask = displayed[col].str.contains("a|b|A|B", regex=True) 33 | displayed[col] = displayed[col].str.split("a|b", regex=True, expand=True) 34 | displayed = displayed.astype({col: int}) 35 | displayed["color_r"] = 7 * displayed[col] % 255 36 | displayed["color_g"] = 23 * displayed[col] % 255 37 | displayed["color_b"] = 67 * displayed[col] % 255 38 | 39 | coordinates = [] 40 | for _, row in displayed.iterrows(): 41 | try: 42 | coord = [ 43 | [ 44 | list(x) 45 | for x in np.transpose( 46 | [ 47 | list(row["geometry"].exterior.coords.xy[0]), 48 | list(row["geometry"].exterior.coords.xy[1]), 49 | ] 50 | ) 51 | ] 52 | ] 53 | coordinates.append(coord) 54 | except Exception as e: 55 | print(e) 56 | coordinates.append([]) 57 | pass 58 | displayed["coordinates"] = coordinates 59 | 60 | return pdk.Layer( 61 | "PolygonLayer", 62 | pd.DataFrame(displayed), 63 | pickable=False, 64 | opacity=0.05, 65 | stroked=True, 66 | filled=filled, 67 | radius_scale=6, 68 | line_width_min_pixels=1, 69 | get_polygon="coordinates", 70 | get_fill_color=["color_r", "color_g", "color_b"], 71 | get_line_color=[128, 128, 128], 72 | ) 73 | 74 | 75 | def prepare_layer_addresses(df: pd.DataFrame) -> pdk.Layer: 76 | """ 77 | Put a table of addresses on a map 78 | 79 | Args: 80 | df (pd.DataFrame): must include columns 'Commune' (strings), 'adr_complete' (strings), 'result_score' (floats), 'result_label' (strings), 'latitude' (floats), 'longitude' (floats) 81 | 82 | Returns: 83 | pdk.Layer: every input address is figured with a point on the map 84 | """ 85 | data = df.copy() 86 | data["radius"] = 6 87 | data["coordinates"] = np.array(df[["longitude", "latitude"]]).tolist() 88 | # NB: 7, 23 and 67 are coprime with 255. That implies two voting places in the same city will have the same colors if and only if their id_bv modulo 255 are the same. Moreover, two successive voting places will have rather different colors. 89 | data["id_bv_r"] = 7 * data["id_bv"] % 255 90 | data["id_bv_g"] = 23 * data["id_bv"] % 255 91 | data["id_bv_b"] = 67 * data["id_bv"] % 255 92 | data.drop(columns=["latitude", "longitude"], inplace=True, errors="ignore") 93 | # Define a layer to display on a map 94 | return pdk.Layer( 95 | "ScatterplotLayer", 96 | data, 97 | pickable=True, 98 | opacity=0.9, 99 | filled=True, 100 | radius_min_pixels=1, 101 | radius_max_pixels=6, 102 | line_width_min_pixels=2, 103 | get_position="coordinates", 104 | get_fill_color=["id_bv_r", "id_bv_g", "id_bv_b"], 105 | get_radius="radius", 106 | get_line_color=[0, 0, 0], 107 | ) 108 | 109 | 110 | def prepare_layer_polygons( 111 | geo_addresses: gpd.GeoDataFrame, 112 | communes: gpd.GeoDataFrame = gpd.GeoDataFrame(), 113 | mode="voronoi", 114 | ) -> pdk.Layer: 115 | """ 116 | Draw polygons around the addresses, so that addresses sharing the same bureau de vote are within the same polygon 117 | 118 | :warning: The geometries of the `geo_addresses` must be either MultiPoint (if we want convex hull) or Point (if we want Voronoi cells) 119 | 120 | Args: 121 | geo_addresses (gpd.GeoDataFrame): must include columns "id_bv" and "result_citycode". The geometries must be shapely Point (in the case of voronoi cells) or MultiPoint (in the case of convex hulls) 122 | communes (gpd.GeoDataFrame, optional): the shapes of communes, if available 123 | mode (str, optional): The way we want to compute polygons around the addresses : can be "convex" or "voronoi". Defaults to "voronoi". 124 | 125 | Returns: 126 | pdk.Layer: calculated bureau de vote shapes are figured with polygons on the map 127 | """ 128 | assert mode.lower() in [ 129 | "convex", 130 | "voronoi", 131 | ], "the implemented methods are voronoi cells or convex hulls" 132 | mode = mode.lower() 133 | 134 | coordinates = [] 135 | 136 | if mode == "convex": 137 | displayed = geo_addresses.copy() 138 | displayed["hulls"] = geo.convex_hull(displayed) 139 | for _, row in displayed.iterrows(): 140 | try: 141 | coord = [ 142 | [ 143 | list(x) 144 | for x in np.transpose( 145 | [ 146 | list(row["hulls"].exterior.coords.xy[0]), 147 | list(row["hulls"].exterior.coords.xy[1]), 148 | ] 149 | ) 150 | ] 151 | ] 152 | coordinates.append(coord) 153 | except Exception as e: 154 | # print(e) 155 | coordinates.append([]) 156 | pass 157 | displayed.drop(columns=["geometry", "hulls"], inplace=True) 158 | 159 | elif mode == "voronoi": 160 | hulls = geo.get_clipped_voronoi_shapes(geo_addresses, communes) 161 | id_bvs = [] 162 | for _, row in hulls.iterrows(): 163 | id_bvs.append(row["id_bv"]) 164 | try: 165 | coord = [ 166 | [ 167 | list(x) 168 | for x in np.transpose( 169 | [ 170 | list(row["geometry"].exterior.coords.xy[0]), 171 | list(row["geometry"].exterior.coords.xy[1]), 172 | ] 173 | ) 174 | ] 175 | ] 176 | coordinates.append(coord) 177 | except Exception as e: 178 | coordinates.append([]) 179 | pass 180 | 181 | displayed = pd.DataFrame(data={"coordinates": coordinates, "id_bv": id_bvs}) 182 | displayed["id_bv_r"] = 7 * displayed["id_bv"] % 255 183 | displayed["id_bv_g"] = 23 * displayed["id_bv"] % 255 184 | displayed["id_bv_b"] = 67 * displayed["id_bv"] % 255 185 | displayed["coordinates"] = coordinates 186 | # Define a layer to display on a map 187 | return pdk.Layer( 188 | "PolygonLayer", 189 | pd.DataFrame(displayed), 190 | pickable=False, 191 | opacity=0.2, 192 | stroked=False, 193 | filled=True, 194 | radius_scale=6, 195 | line_width_min_pixels=1, 196 | get_polygon="coordinates", 197 | get_fill_color=["id_bv_r", "id_bv_g", "id_bv_b"], 198 | get_line_color=[0, 0, 0], 199 | ) 200 | 201 | 202 | def prepare_tooltip(columns: List[str]) -> Dict: 203 | """ 204 | Prepare a tooltip indicating a specific subset of columns 205 | 206 | Args: 207 | columns (List[str]): a list of columns of the data 208 | 209 | Returns: 210 | Dict: _description_ 211 | """ 212 | legend = "" 213 | for col in ["id_bv", "result_score", "geo_score", "commune_bv", "geo_adresse", "result_label", "adr_complete", "Commune"]: 214 | if col in columns: 215 | legend += f"{col}: "+"{"+f"{col}"+"} \n" 216 | tooltip = { 217 | "text": legend 218 | } 219 | return tooltip 220 | 221 | 222 | def display_addresses( 223 | addresses: pd.DataFrame, communes: gpd.GeoDataFrame = gpd.GeoDataFrame() 224 | ) -> pdk.Deck: 225 | """ 226 | Display a map with one point per address 227 | 228 | Args: 229 | addresses (pd.DataFrame): _description_ 230 | communes (gpd.GeoDataFrame, optional): the shapes of communes, if available 231 | 232 | Returns: 233 | pdk.Deck: _description_ 234 | """ 235 | addresses_layer = prepare_layer_addresses(addresses) 236 | if len(communes): 237 | layers = [prepare_layer_communes(communes), addresses_layer] 238 | else: 239 | layers = [addresses_layer] 240 | 241 | # Set the viewport location 242 | view_state = pdk.ViewState( 243 | latitude=43.055403, longitude=1.470104, zoom=6, bearing=0, pitch=0 244 | ) 245 | 246 | # Render 247 | return pdk.Deck( 248 | map_style="light", 249 | layers=layers, 250 | initial_view_state=view_state, 251 | tooltip=prepare_tooltip(addresses.columns), 252 | ) 253 | 254 | 255 | def display_bureau_vote_shapes( 256 | addresses: pd.DataFrame, 257 | communes: gpd.GeoDataFrame = gpd.GeoDataFrame(), 258 | mode="voronoi", 259 | ) -> pdk.Deck: 260 | """ 261 | Display on the same map the addresses and the corresponding interpolated bureau de vote shapes 262 | 263 | Args: 264 | addresses (pd.DataFrame): must include columns 'Commune' (strings), 'adr_complete' (strings), 'result_score' (floats), 'result_label' (strings), 'latitude' (floats), 'longitude' (floats) 265 | communes (gpd.GeoDataFrame, optional): the shapes of communes, if available 266 | mode (str, optional): The way we want to compute polygons around the addresses : can be "convex" or "voronoi". Defaults to "voronoi". 267 | 268 | Returns: 269 | pdk.Deck: pydeck with layers 'addresses' (one point per adress), 'communes' (one shape per commune), 'polygons' (one shape per bureau de vote, with the commune) 270 | """ 271 | assert mode.lower() in ["convex", "voronoi"] 272 | mode = mode.lower() 273 | 274 | if mode == "convex": 275 | geojson = geo.build_geojson_multipoint(addresses) 276 | elif mode == "voronoi": 277 | geojson = geo.build_geojson_point(addresses) 278 | 279 | geojson.drop_duplicates(subset=["geometry"], inplace=True) 280 | polygons_layer = prepare_layer_polygons(geojson, mode=mode, communes=communes) 281 | 282 | if len(communes): 283 | communes_layers = prepare_layer_communes(communes, filled=False) 284 | layers = [communes_layers, polygons_layer, prepare_layer_addresses(addresses)] 285 | else: 286 | layers = [ 287 | polygons_layer, 288 | prepare_layer_addresses(addresses), 289 | ] 290 | 291 | # Set the viewport location 292 | view_state = pdk.ViewState( 293 | latitude=43.055403, longitude=1.470104, zoom=6, bearing=0, pitch=0 294 | ) 295 | # Render 296 | return pdk.Deck( 297 | map_style="light", 298 | layers=layers, 299 | initial_view_state=view_state, 300 | tooltip=prepare_tooltip(addresses.columns), 301 | ) 302 | -------------------------------------------------------------------------------- /generate_areas.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | import os 5 | import pandas as pd 6 | import geopandas as gpd 7 | from display import * 8 | import re 9 | 10 | # display just a departement/drom/com 11 | DEP_LIST = ["0"+str(i) for i in range(1,10)]+[str(i) for i in range(10,19)]+["2A","2B"]+[str(i) for i in range(21,96)] + [str(i) for i in range(971,977)] 12 | #DEP_LIST = ["01", "83"] 13 | COMPUTE_BV_BORDERS = False 14 | # path of the address file 15 | 16 | commune_shapes_path = "communes-20220101.shp" 17 | communes_france = gpd.read_file(commune_shapes_path)[["geometry", "insee"]].dropna() 18 | 19 | for DEP in DEP_LIST: 20 | communes_dep = communes_france[communes_france.insee.str.startswith(str(DEP))] 21 | 22 | addresses_path = f"parquet/table_{DEP}.parquet" 23 | 24 | # for this departement, determine the radio of addresses you want to plot 25 | RATIO = 0.4 # 0 <= RATIO <= 1 26 | 27 | # ## Loading the address file, and a file with the shape of communes. 28 | # ##### Warning: these files are heavy 29 | 30 | df = pd.read_parquet(addresses_path) 31 | # if id_brut_bv is not None, condition below should always be True 32 | if "id_bv" not in df.columns: 33 | pat = re.compile(r"\d+") 34 | df["id_bv"] = df["id_brut_bv"].apply(lambda row : int("".join(re.findall(pat, row)))) 35 | 36 | 37 | print(f"LOAD data in memory: {len(df)} rows") 38 | 39 | 40 | # ### The code below creates an (unofficial) identifier of bureau de vote. We use it in this code mostly for displaying purpose 41 | 42 | # add this unofficiel "id_bv" field id to recognize and to determine the color of id fields 43 | 44 | 45 | #df_dep = df[df.dep_bv==DEP].sample(frac=RATIO, random_state=0) 46 | os.makedirs("html/dep", exist_ok=True) 47 | os.makedirs("html/bv", exist_ok=True) 48 | 49 | if COMPUTE_BV_BORDERS: 50 | for raw_id_bv in df.id_brut_bv.unique(): 51 | df_bv = df[df.id_brut_bv==raw_id_bv] 52 | r_bv = display_addresses(addresses=df_bv, communes=communes_dep) 53 | r_bv.to_html(f"html/bv/scatterplot_bv_{raw_id_bv}.html") 54 | 55 | r_voronoi_bv = display_bureau_vote_shapes(addresses=df_bv, communes=communes_dep, mode="voronoi") 56 | r_voronoi_bv.to_html(f"html/bv/voronoi_bv_{raw_id_bv}.html") 57 | 58 | 59 | df_dep = df.sample(frac=RATIO, random_state=0) 60 | 61 | print("Going to display addresses") 62 | r = display_addresses(addresses=df_dep, communes=communes_dep) 63 | r.to_html(f"html/dep/scatterplot_{DEP}_layer_ratio_{RATIO}.html") 64 | 65 | r_voronoi = display_bureau_vote_shapes(addresses=df_dep, communes=communes_dep, mode="voronoi") 66 | r_voronoi.to_html(f"html/dep/voronoi_{DEP}_layer_ratio_{RATIO}.html") 67 | 68 | 69 | 70 | -------------------------------------------------------------------------------- /generate_areas_geojson.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | import os 5 | import pandas as pd 6 | import numpy as np 7 | import geopandas as gpd 8 | from shapely import Polygon 9 | from geo import build_geojson_point, get_clipped_voronoi_shapes 10 | pd.set_option('display.max_columns', None) 11 | 12 | DEP_LIST = [ 13 | "0"+str(i) for i in range(1, 10) 14 | ]+[ 15 | str(i) for i in range(10, 20) 16 | ]+["2A", "2B"]+[ 17 | str(i) for i in range(21, 96) 18 | ] + [ 19 | str(i) for i in range(971, 977) 20 | ] 21 | commune_shapes_path = "./../communes-5m.geojson" 22 | communes_france = gpd.read_file(commune_shapes_path) 23 | communes_france = communes_france.rename( 24 | {'code': 'insee'}, axis=1 25 | )[['insee', 'geometry']] 26 | 27 | for DEP in DEP_LIST: 28 | print(DEP) 29 | if f"voronoi_contours_{DEP}.geojson" not in os.listdir("geojson/"): 30 | communes_dep = communes_france[communes_france.insee.str.startswith(str(DEP))] 31 | codes2drop = ('13055', '75056', '69123') 32 | communes_dep = communes_dep.loc[~(communes_dep['insee'].str.startswith(codes2drop))] 33 | 34 | addresses_path = f"parquet/table_{DEP}.parquet" 35 | 36 | addresses_df = pd.read_parquet(addresses_path) 37 | # The lines below creates an (unofficial) identifier of bureau de vote 38 | # We use it in this code mostly for displaying purposes 39 | addresses_df['id_bv'] = addresses_df['id_brut_bv'] 40 | addresses_df['commune_bv'] = addresses_df['code_commune_ref'] 41 | 42 | print(f"LOAD dep {DEP} in memory: {len(addresses_df)} rows") 43 | geo_addresses = build_geojson_point(addresses_df) 44 | hulls = get_clipped_voronoi_shapes(geo_addresses, communes_dep) 45 | id_bvs = [] 46 | coordinates = [] 47 | # the block below just aims at formatting 48 | # the cordinates into a list of [x, y] 49 | exceptions = [] 50 | for _, row in hulls.iterrows(): 51 | id_bvs.append(row["id_bv"]) 52 | try: 53 | coord = Polygon( 54 | [ 55 | list(x) 56 | for x in np.transpose( 57 | [ 58 | list(row["geometry"].exterior.coords.xy[0]), 59 | list(row["geometry"].exterior.coords.xy[1]), 60 | ] 61 | ) 62 | ] 63 | ) 64 | coordinates.append(coord) 65 | except Exception as e: 66 | exceptions.append({ 67 | 'error': e, 68 | 'row': row 69 | }) 70 | coordinates.append([]) 71 | pass 72 | 73 | voronoi_polygons = gpd.GeoDataFrame( 74 | pd.DataFrame(data={"coordinates": coordinates, "id_bv": id_bvs}), 75 | geometry='coordinates' 76 | ) 77 | # handling overlaps 78 | for main_idx in voronoi_polygons.index: 79 | for side_idx in voronoi_polygons.index: 80 | if main_idx != side_idx: 81 | if voronoi_polygons.loc[main_idx, 'coordinates'].contains(voronoi_polygons.loc[side_idx, 'coordinates']): 82 | voronoi_polygons.loc[main_idx, 'coordinates'] = voronoi_polygons.loc[main_idx, 'coordinates'].difference(voronoi_polygons.loc[side_idx, 'coordinates']) 83 | # grouping polygons into multipolygons for each BdV 84 | voronoi_polygons = voronoi_polygons.dissolve('id_bv').reset_index(names='id_bv').reset_index(names='id') 85 | # int id as requested for downstream processes 86 | voronoi_polygons['id'] = voronoi_polygons['id'].astype(int) 87 | with open(f"geojson/voronoi_contours_{DEP}.geojson", 'w') as f: 88 | f.write(voronoi_polygons.to_json()) 89 | else: 90 | print("Already processed") 91 | -------------------------------------------------------------------------------- /geo.py: -------------------------------------------------------------------------------- 1 | """ 2 | Utils methods to geocode addresses, and to compute polygonal shapes around the addresses 3 | """ 4 | import pandas as pd 5 | import os 6 | import numpy as np 7 | import geopandas as gpd 8 | import pytess 9 | from typing import List 10 | from shapely.geometry import Polygon, Point 11 | from shapely import make_valid 12 | import requests 13 | 14 | 15 | def add_geoloc(df: pd.DataFrame) -> pd.DataFrame: 16 | """ 17 | Locally save the raw base of addresses and call the API-adresse to geocode them (in particular: add coordinates and found city) 18 | 19 | Args: 20 | df (pd.DataFrame): a file with columns "geo_adresse" ((street number +) street type + street name/locality name), "Commune" (commune name), "CP" (postcode) 21 | 22 | Returns: 23 | pd.DataFrame: a dataframe with the input columns, and also latitudes, longitudes, result_postcode, result_citycode, etc. 24 | """ 25 | df.to_csv("concat_adr_bv.csv", index=False) 26 | # os.system( 27 | # "curl -X POST -F data=@concat_adr_bv.csv -F columns=adr_complete -F columns=Commune -F postcode=CP https://api-adresse.data.gouv.fr/search/csv/ > concat_adr_bv_geocoded.csv" 28 | # ) 29 | f = open('concat_adr_bv.csv', 'rb') 30 | files = {'data': ('concat_adr_bv', f)} 31 | payload = {'columns': ['geo_adresse', 'Commune'], 'postcode': 'CP'} 32 | r = requests.post('https://api-adresse.data.gouv.fr/search/csv/', files=files, data=payload, stream=True) 33 | with open('concat_adr_bv_geocoded.csv', 'wb') as fd: 34 | for chunk in r.iter_content(chunk_size=1024): 35 | fd.write(chunk) 36 | 37 | geocoded = pd.read_csv("concat_adr_bv_geocoded.csv", dtype=str) 38 | geocoded["latitude"] = geocoded["latitude"].astype(float) 39 | geocoded["longitude"] = geocoded["longitude"].astype(float) 40 | geocoded["result_score"] = geocoded["result_score"].astype(float) 41 | geocoded = geocoded[geocoded["result_label"].notna()] 42 | return geocoded 43 | 44 | 45 | def build_geojson_point(addresses: pd.DataFrame) -> gpd.GeoDataFrame: 46 | """ 47 | Turn the dataframes with coordinates into a GeoDataFrame containing a Point object for each address 48 | NB: when there is several addresses at the same point, the function keeps only one sample 49 | 50 | Args: 51 | addresses (pd.DataFrame): a dataframe that have already been processed with API-adresse, and that also contain ids for bureau de vote (function `cleaner.prepare_ids`) 52 | Returns: 53 | gpd.GeoDataFrame: includes columns: "geometry" (shapely Point), "result_citycode" (as string), "label" (commune name, as string) and "id_bv" (unique id we impose per bureau de vote, int) 54 | """ 55 | 56 | geojson = {"type": "FeatureCollection", "features": []} 57 | if "result_label" in addresses.columns: 58 | label_col = "result_label" 59 | else: 60 | label_col = "commune_bv" 61 | if "result_citycode" in addresses.columns: 62 | code_col = "result_citycode" 63 | else: 64 | code_col = "code_commune_ref" 65 | for _, row in addresses.iterrows(): 66 | if row[label_col]: 67 | props = { 68 | "label": row[label_col], 69 | "id_bv": row["id_bv"], 70 | "result_citycode": row[code_col], 71 | } 72 | geojson["features"].append( 73 | { 74 | "type": "Feature", 75 | "geometry": { 76 | "type": "Point", 77 | "coordinates": [ 78 | float(row["longitude"]), 79 | float(row["latitude"]), 80 | ], 81 | }, 82 | "properties": props, 83 | } 84 | ) 85 | gdf = gpd.GeoDataFrame.from_features(geojson) 86 | # IMPORTANT: when there is several addresses at the same point keep only one sample 87 | return gdf.drop_duplicates(subset=["geometry"]) 88 | 89 | 90 | def build_geojson_multipoint(addresses: pd.DataFrame) -> gpd.GeoDataFrame: 91 | """ 92 | Turn the dataframes with coordinates into a GeoDataFrame containing a MultiPoint (list of point) object for each bureau de vote 93 | 94 | Args: 95 | addresses (pd.DataFrame): a dataframe that have already been processed with API-adresse, and that also contain ids for bureau de vote (function `cleaner.prepare_ids`) 96 | Returns: 97 | gpd.GeoDataFrame: includes columns: "geometry" (shapely MultiPoint), "result_citycode" (as string) and "id_bv" (unique id we impose per bureau de vote, int) 98 | """ 99 | 100 | geojson = {"type": "FeatureCollection", "features": []} 101 | assert ( 102 | "id_bv" in addresses.columns 103 | ), "There is no identifier for the 'bureaux de vote' in this dataframe" 104 | 105 | def get_coordinates_list(data: pd.DataFrame) -> np.array: 106 | return np.array(data[["longitude", "latitude"]]).tolist() 107 | 108 | for id_bv, data in addresses.groupby("id_bv"): 109 | cp = data.result_citycode.min() 110 | 111 | geojson["features"].append( 112 | { 113 | "type": "Feature", 114 | "geometry": { 115 | "type": "MultiPoint", 116 | "coordinates": get_coordinates_list(data), 117 | }, 118 | "properties": {"id_bv": id_bv, "result_citycode": cp}, 119 | } 120 | ) 121 | gdf = gpd.GeoDataFrame.from_features(geojson) 122 | return gdf 123 | 124 | 125 | def convex_hull(gdf: gpd.GeoDataFrame) -> gpd.GeoSeries: 126 | """ 127 | Compute the convex hulls of input geometries 128 | 129 | Args: 130 | gdf (gpd.GeoDataFrame): 131 | 132 | Returns: 133 | gpd.GeoSeries: each row is a Polygon, a Point or a LineString 134 | """ 135 | return gpd.GeoSeries(gdf.geometry).convex_hull 136 | 137 | 138 | def clip_to_communes( 139 | gdf: gpd.GeoDataFrame, communes: gpd.GeoDataFrame 140 | ) -> gpd.GeoDataFrame: 141 | """ 142 | Clip the polygons of input geodataframe to the boundaries of specified communes. 143 | 144 | Args: 145 | gdf (gpd.GeoDataFrame): must include columns "geometry" and "result_citycode" 146 | communes (gpd.GeoDataFrame): must include columns `geometry` and "result_citycode" 147 | 148 | Returns: 149 | gpd.GeoDataFrame: the input GeoDataFrames have been clipped according to the input `communes` shapes 150 | """ 151 | gdf_copy = gdf.copy() 152 | multipolygons_communes_dict = {} 153 | multipolygons_communes_list = list() 154 | if "result_citycode" in communes.columns: 155 | code_col = "result_citycode" 156 | else: 157 | code_col = "insee" 158 | # precompute the MultiPolygon of each of the commune that is relevant for our input geodataframe 159 | for cp in np.intersect1d( 160 | communes[code_col].unique(), gdf["result_citycode"].unique() 161 | ): 162 | multipolygons_communes_dict[cp] = communes[ 163 | communes[code_col] == cp 164 | ].geometry.unary_union 165 | # align the precomputed MultiPolygons with the input GeoDataFrame `gdf` 166 | for _, row in gdf_copy.iterrows(): 167 | cp = row["result_citycode"] 168 | multipolygons_communes_list.append(multipolygons_communes_dict[cp]) 169 | to_intersect = gpd.GeoSeries(multipolygons_communes_list) 170 | try: 171 | gdf_copy.geometry = gdf_copy.geometry.intersection( 172 | to_intersect, align=False 173 | ) 174 | except: 175 | # handling self-intersection cases 176 | for k in range(len(gdf_copy)): 177 | try: 178 | # for rows where intersection works fine 179 | gdf_copy.loc[k:k,'geometry'] = gdf_copy.loc[k:k,'geometry'].intersection( 180 | to_intersect.loc[k:k], 181 | align=False 182 | ) 183 | except: 184 | # removing points that are too close together to resolve the Polygon 185 | try: 186 | gdf_copy.loc[k:k,'geometry'] = gpd.GeoSeries( 187 | gdf_copy.loc[k:k,'geometry'].values[0].simplify(tolerance=1) 188 | ).intersection( 189 | gpd.GeoSeries(to_intersect.loc[k:k].values[0].simplify(tolerance=1)), 190 | align=False 191 | ) 192 | except: 193 | # use make_valid to restore geometry 194 | gdf_copy.loc[k:k,'geometry'] = gpd.GeoSeries( 195 | make_valid(gdf_copy.loc[k:k,'geometry'].values[0].simplify(tolerance=1)) 196 | ).intersection( 197 | gpd.GeoSeries(make_valid(to_intersect.loc[k:k].values[0].simplify(tolerance=1))), 198 | align=False 199 | ) 200 | return gdf_copy 201 | 202 | 203 | def polygon_union( 204 | gdf: gpd.GeoDataFrame, 205 | pivot_column: str = "id_bv", 206 | columns: List[str] = ["result_citycode"], 207 | ) -> gpd.GeoDataFrame: 208 | """ 209 | Assuming the geometry of the input GeoDataFrame geometry consists of polygons, make the union of these polygons given a pivot column. 210 | Some columns of the input GeoDataFrame can be kept in the output, under the assumption that : 211 | (i) for a given pivot value, and a given column of "columns", the value of the column on this pivot value stays constant 212 | 213 | Args: 214 | gdf (gpd.GeoDataFrame): must contain the column `pivot_column` and the ancillary columns `columns` 215 | pivot_column (str): the column that must be used as pivot. Defaults to "id_bv". 216 | columns (List[str], optional): The list of other columns (not `pivot_column` nor "geometry") to keep in the output. Defaults to ["result_citycode"]. 217 | 218 | Returns: 219 | gpd.GeoDataFrame: consists of the geometry of merged polygons (that are Polygon or MultiPolygon), `pivot_column` and the ancillary columns `columns` 220 | """ 221 | geometries = list() 222 | # "data" consists of the properties of the output GeoDataFrame 223 | data = {pivot_column: []} 224 | for column in columns: 225 | data[column] = list() 226 | 227 | for pivot in gdf[pivot_column].unique(): 228 | # WARNING: the 2 lines below assumes that, for a given pivot value, and a given column of "columns", the value of the column on this pivot value stays constant 229 | # in particular, it is right for the column "result_citycode" when the union is done on "id_bv") 230 | for column in columns: 231 | val = gdf[gdf[pivot_column] == pivot][column].min() 232 | data[column].append(val) 233 | s = gdf[gdf[pivot_column] == pivot].geometry 234 | geometries.append(s.unary_union) 235 | data[pivot_column].append(pivot) 236 | return gpd.GeoDataFrame(geometry=geometries, data=data) 237 | 238 | 239 | def get_clipped_voronoi_shapes( 240 | gdf: gpd.GeoDataFrame, communes: gpd.GeoDataFrame = gpd.GeoDataFrame() 241 | ) -> gpd.GeoDataFrame: 242 | """ 243 | Compute voronoi cells, clip them to the shapes of communes, and merge the clipped cells that share the same "id_bv" 244 | 245 | Args: 246 | gdf (gpd.GeoDataFrame): must include "geometry", "result_citycode" (string) and "id_bv" (unique id we determine for each bureau de vote, int) 247 | communes (gpd.GeoDataFrame, optional): _description_. Defaults to gpd.GeoDataFrame(). 248 | 249 | Returns: 250 | gpd.GeoDataFrame: 251 | """ 252 | hulls = voronoi_hull(gdf, communes) 253 | if len(communes): 254 | hulls = clip_to_communes(hulls, communes) 255 | return connected_components_polygon_union(hulls) 256 | 257 | 258 | def connected_components_polygon_union( 259 | gdf: gpd.GeoDataFrame, 260 | pivot_column: str = "id_bv", 261 | columns: List[str] = ["result_citycode"], 262 | ) -> gpd.GeoDataFrame: 263 | """ 264 | Assuming the geometry of the input GeoDataFrame geometry consists of polygons, return the connected components of the union of these polygons given a pivot column 265 | Some columns of the input GeoDataFrame can be kept in the output, under the assumption that : 266 | (i) for a given pivot value, and a given column of "columns", the value of the column on this pivot value stays constant 267 | 268 | Args: 269 | gdf (gpd.GeoDataFrame): must contain the column `pivot_column` and the ancillary columns `columns` 270 | pivot_column (str): the column that must be used as pivot. Defaults to "id_bv". 271 | columns (List[str], optional): The list of other columns (not `pivot_column` nor "geometry") to keep in the output. Defaults to ["result_citycode"]. 272 | 273 | Returns: 274 | gpd.GeoDataFrame: consists of the geometry of merged connected components (that are necessary Polygon), `pivot_column` and the ancillary columns `columns` 275 | """ 276 | geometries = list() 277 | # "data" consists of the properties of the output GeoDataFrame 278 | data = {pivot_column: []} 279 | for column in columns: 280 | data[column] = list() 281 | 282 | def save_columns_values(pivot): 283 | data[pivot_column].append(pivot) 284 | for column in columns: 285 | val = gdf[gdf[pivot_column] == pivot][column].min() 286 | data[column].append(val) 287 | 288 | for pivot in gdf[pivot_column].unique(): 289 | # WARNING: the 2 lines below assumes that, for a given pivot value, and a given column of "columns", the value of the column on this pivot value stays constant 290 | # in particular, it is right for the column "result_citycode" when the union is done on "id_bv") 291 | s = gdf[ 292 | gdf[pivot_column] == pivot 293 | ].geometry # normally these shapes are Polygon, but could be Point if there is only one found voter in a bureau de vote 294 | if len(s) == 1 and s.iloc[0].geom_type == "Point": 295 | geometries.append(s) 296 | save_columns_values(pivot) 297 | else: 298 | merged_shape = s.unary_union 299 | if merged_shape is not None: 300 | if merged_shape.geom_type == "Polygon": 301 | geometries.append(merged_shape) 302 | save_columns_values(pivot) 303 | 304 | elif merged_shape.geom_type == "MultiPolygon": 305 | for _, row in ( 306 | gpd.GeoDataFrame(geometry=[merged_shape]) 307 | .explode(index_parts=False) 308 | .iterrows() 309 | ): 310 | geometries.append(row["geometry"]) 311 | save_columns_values(pivot) 312 | return gpd.GeoDataFrame(geometry=geometries, data=data) 313 | 314 | 315 | def voronoi_hull(gdf: gpd.GeoDataFrame, communes: gpd.GeoDataFrame) -> gpd.GeoDataFrame: 316 | """ 317 | Compute voronoi cells around each of the input addresses, within an arbitrary large bounding box (hence, it is useful to clip afterwards the cells on limits relevant to our use cases) 318 | It is a based on the voronoi method implemented in the library pytess 319 | 320 | Args: 321 | gdf (gpd.GeoDataFrame): must include "geometry", "result_citycode" (string) and "id_bv" (unique id we determine for each bureau de vote, int) 322 | 323 | Returns: 324 | gpd.GeoDataFrame: include "geometry", "result_citycode" and "id_bv" 325 | """ 326 | assert ( 327 | "id_bv" in gdf.columns and "result_citycode" in gdf.columns 328 | ), "Some necessary columns are missing" 329 | gdf_copy = gdf.copy() 330 | 331 | id_bvs, citycodes = [], [] 332 | polygons = [] 333 | gdf_copy.drop_duplicates( 334 | subset=["geometry"], inplace=True 335 | ) # delete duplicates of geolocated points 336 | # on s'assure de parcourir toutes les communes, certaines sont absentes des adresses 337 | for citycode in set(gdf_copy.result_citycode.unique()) | set(communes.insee.unique()): 338 | gdf_city = gdf_copy[gdf_copy.result_citycode == citycode] 339 | # rares cas sans aucune adresse de votant sur la commune 340 | if len(gdf_city) == 0: 341 | id_bvs.append(citycode+'_X') 342 | citycodes.append(citycode) 343 | polygons.append(communes.loc[communes['insee']==citycode, 'geometry'].values[0]) 344 | # un seul BdV dans la commune : le contour sera celui de la commune 345 | elif gdf_city['id_bv'].nunique() == 1: 346 | id_bvs.append(gdf_city['id_bv'].values[0]) 347 | citycodes.append(citycode) 348 | polygons.append(communes.loc[communes['insee']==citycode, 'geometry'].values[0]) 349 | # cas général 350 | elif len(gdf_city) >= 3: 351 | points_city, id_bvs_city = [], [] 352 | for k in gdf_city.index: 353 | try: 354 | points_city.append( 355 | ( 356 | gdf_city.geometry[k].coords.xy[0][0], 357 | gdf_city.geometry[k].coords.xy[1][0], 358 | ) 359 | ) 360 | id_bvs_city.append(gdf_city.id_bv[k]) 361 | except: 362 | pass 363 | 364 | # the condition "if k" exclude the corner of bounding box from the pytess.voronoi output 365 | # the size of 'buffer_percent' defines the size of the virtual bounding box we compute Voronoi in 366 | # pytess.voronoi returns a list of 2-tuples, with the first item in each tuple being the original input point (or None for each corner of the bounding box buffer), and the second item being the point's corressponding Voronoi polygon. 367 | 368 | voronoi_city_dict = { 369 | k: v for (k, v) in pytess.voronoi(points_city, buffer_percent=1000) if k 370 | } 371 | polygons_city = [] 372 | if ( 373 | type(points_city) == list 374 | ): # this list is supposed to be like [(lat, lon), (lat, lon), (lat, lon), ...] 375 | for point in points_city: 376 | try: 377 | polygons_city.append(Polygon(voronoi_city_dict[point])) 378 | except: 379 | polygons_city.append(None) 380 | id_bvs.extend(id_bvs_city) 381 | citycodes.extend([citycode] * len(id_bvs_city)) 382 | polygons.extend(polygons_city) 383 | 384 | # handling one known case : two points in one commune (due to bad geocoding), from two different BdV 385 | # creating big triangles along the bisection between the two points, that will be cropped later to the commune's contours 386 | elif len(gdf_city) == 2: 387 | size = 10e6 388 | middle_point = Point( 389 | (gdf_city['geometry'].values[0].coords.xy[0][0] + gdf_city['geometry'].values[1].coords.xy[0][0])/2, 390 | (gdf_city['geometry'].values[0].coords.xy[1][0] + gdf_city['geometry'].values[1].coords.xy[1][0])/2 391 | ) 392 | for k in range(2): 393 | point2middle_vector = [ 394 | middle_point.coords.xy[0][0] - gdf_city['geometry'].values[k].coords.xy[0][0], 395 | middle_point.coords.xy[1][0] - gdf_city['geometry'].values[k].coords.xy[1][0] 396 | ] 397 | orthogonal_vector = [ 398 | -point2middle_vector[1], 399 | point2middle_vector[0] 400 | ] 401 | accross_point = Point( 402 | gdf_city['geometry'].values[k].coords.xy[0][0] + 403 | size*point2middle_vector[0], 404 | gdf_city['geometry'].values[k].coords.xy[1][0] + 405 | size*point2middle_vector[1] 406 | ) 407 | other_point1 = Point( 408 | middle_point.coords.xy[0][0] + 409 | size*orthogonal_vector[0], 410 | middle_point.coords.xy[1][0] + 411 | size*orthogonal_vector[1], 412 | ) 413 | other_point2 = Point( 414 | middle_point.coords.xy[0][0] - 415 | size*orthogonal_vector[0], 416 | middle_point.coords.xy[1][0] - 417 | size*orthogonal_vector[1], 418 | ) 419 | id_bvs.append(gdf_city['id_bv'].values[k]) 420 | citycodes.append(citycode) 421 | polygons.append(Polygon([accross_point, other_point1, other_point2])) 422 | 423 | return gpd.GeoDataFrame( 424 | geometry=polygons, data={"id_bv": id_bvs, "result_citycode": citycodes} 425 | ) 426 | -------------------------------------------------------------------------------- /license.md: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Etalab 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | from cleaner import ( 2 | clean_dataset, 3 | clean_failed_geocoding, 4 | clean_geocoded_types, 5 | prepare_ids 6 | ) 7 | from display import ( 8 | display_addresses, 9 | display_bureau_vote_shapes 10 | ) 11 | import pandas as pd 12 | from geo import ( 13 | add_geoloc 14 | ) 15 | import geopandas as gpd 16 | import pydeck as pdk 17 | import sys 18 | 19 | if __name__ == '__main__': 20 | df = pd.read_csv(sys.argv[1], sep=";", dtype=str) 21 | print('### Dataset Loaded!') 22 | df = clean_dataset(df) 23 | # check that names preceded with a "chez" have been removed 24 | df.drop(columns=['libelle_voie_clean', 'comp_adr_1_clean', 'comp_adr_2_clean', 'lieu-dit-clean'], inplace=True) 25 | print('### Dataset Cleaned!') 26 | # Comment if you want to skip geocode stp (this step takes few minutes to run) 27 | geocoded_df = add_geoloc(df=df) 28 | print('### Dataset geocoded!') 29 | geocoded_df = pd.read_csv("concat_adr_bv_geocoded.csv",dtype=str) 30 | #Clean geocoded dataframe 31 | geocoded_df = clean_geocoded_types(geocoded_df) 32 | geocoded_df = clean_failed_geocoding(geocoded_df) 33 | geocoded_df = prepare_ids(geocoded_df) 34 | # IMPORTANT: when there is two points at the position lat-lon, keep only one 35 | geocoded_df = geocoded_df.drop_duplicates(subset=["latitude", "longitude"]) 36 | print('### Geocoded dataset Cleaned!') 37 | #Load shapes of communes 38 | communes_france = gpd.read_file("communes-20220101.shp")[["geometry", "insee"]].dropna().\ 39 | rename(columns={"insee": "result_citycode"}) 40 | communes_france["result_citycode"] = communes_france["result_citycode"].apply(lambda row: row.split(".")[0] if "." in row else row) 41 | communes_ariege = communes_france[communes_france.result_citycode.str.startswith("09")] 42 | del communes_france 43 | print('### Shapes communes loaded!') 44 | #Cartography with color by bureau de vote 45 | r = display_addresses(addresses=geocoded_df, communes=communes_ariege) 46 | r.to_html("scatterplot_layer.html") 47 | print('### Page 1 HTML generated!') 48 | #Save GeoJSON (with 1 Point per voter address) 49 | # geojson = geo.build_geojson_point(geocoded_df) 50 | #geojson.to_file("bv_point.geojson", driver="GeoJSON") 51 | #Display convex Hull 52 | # r_hulls = display.display_bureau_vote_shapes(addresses=geocoded_df, communes=communes_ariege, mode="convex") 53 | # r_hulls.to_html("hull_layer.html") 54 | #Display Voronoi tessellation 55 | r_voronoi = display_bureau_vote_shapes(addresses=geocoded_df, communes=communes_ariege, mode="voronoi") 56 | r_voronoi.to_html("voronoi_layer.html") 57 | print('### Page 2 HTML generated!') 58 | -------------------------------------------------------------------------------- /main_atelier.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | import os 5 | import pandas as pd 6 | import geopandas as gpd 7 | from display import * 8 | import re 9 | 10 | # path of the address file 11 | addresses_path = "extrait_fichier_adresses_REU.parquet" 12 | commune_shapes_path = "communes-20220101.shp" 13 | 14 | # choose an example of departement 15 | DEP = "83" 16 | # for this departement, determine the radio of addresses you want to plot 17 | RATIO = 0.1 # 0 <= RATIO <= 1 18 | 19 | # ## Loading the address file, and a file with the shape of communes. 20 | # ##### Warning: these files are heavy 21 | 22 | df = pd.read_parquet(addresses_path) 23 | communes_france = gpd.read_file(commune_shapes_path)[["geometry", "insee"]].dropna() 24 | 25 | 26 | # ### The code below creates an (unofficial) identifier of bureau de vote. We use it in this code mostly for displaying purpose 27 | 28 | 29 | def prepare_ids(df: pd.DataFrame) -> pd.DataFrame: 30 | """ 31 | Prepare not-official `id_bv` (integers) column, under the assumption there is less than 10000 bv per city 32 | 33 | Args: 34 | df (pd.DataFrame): a dataframe including columns "Code_BV" and "result_citycode" 35 | 36 | Returns: 37 | pd.DataFrame: a dataframe similar to the input, with a supplementary column "id_bv" (integers) unique for every bureau de vote 38 | """ 39 | assert ("code_bv" in df.columns) and ( 40 | "code_commune_ref" in df.columns 41 | ), "There is no identifiers for bureau de vote" 42 | df_copy = df.copy() 43 | 44 | def prepare_id_bv(row): 45 | """ 46 | Combine the unique id of a city (citycode) and the number of the bureau de vote inside the city to compute a nationalwide id of bureau de vote 47 | 48 | Args: 49 | row (_type_): _description_ 50 | 51 | Returns: 52 | id_bv: integer serving as unique id of a bureau de vote 53 | """ 54 | max_bv_per_city = 10000 # assuming there is always less than this number of bv in a city. This is important to grant the uniqueness of id_bv 55 | max_code_commune = 10**5 56 | try: 57 | code_bv = int(row["code_bv"]) 58 | except: 59 | # keep as Code_BV the first number found in the string (if there is one) 60 | found = re.search(r"\d+", row["code_bv"]) 61 | if found: 62 | code_bv = int(found.group()) 63 | else: 64 | code_bv = max_bv_per_city # this code will indicate parsing errors but won't raise exception 65 | try: 66 | code_commune = int(row["code_commune_ref"]) 67 | except: 68 | found = re.search(r"\d+", row["code_commune_ref"]) 69 | if found: 70 | code_commune = int(found.group()) 71 | else: 72 | code_commune = max_code_commune 73 | return max_bv_per_city * code_commune + code_bv 74 | 75 | df_copy["id_bv"] = df_copy.apply(prepare_id_bv, axis=1) 76 | return df_copy 77 | 78 | 79 | # add this unofficiel "id_bv" field id to recognize and to determine the color of id fields 80 | df_prepared = prepare_ids(df) 81 | 82 | communes_dep = communes_france[communes_france.insee.str.startswith(str(DEP))] 83 | 84 | df_dep = df_prepared[df_prepared.dep_bv==DEP].sample(frac=RATIO, random_state=0) 85 | 86 | 87 | r = display_addresses(addresses=df_dep, communes=communes_dep) 88 | r.to_html(f"scatterplot_{DEP}_layer_ratio_{RATIO}.html") 89 | 90 | r_voronoi = display_bureau_vote_shapes(addresses=df_dep, communes=communes_dep, mode="voronoi") 91 | r_voronoi.to_html(f"voronoi_{DEP}_layer_ratio_{RATIO}.html") 92 | 93 | 94 | 95 | -------------------------------------------------------------------------------- /renovate.json: -------------------------------------------------------------------------------- 1 | { 2 | "$schema": "https://docs.renovatebot.com/renovate-schema.json", 3 | "extends": [ 4 | "config:base" 5 | ] 6 | } 7 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | geopandas==0.12.0 2 | pygeos==0.13 3 | numpy==1.22.3 4 | pandas==1.5.0 5 | pydeck==0.7.1 6 | Pytess==1.0.0 7 | requests==2.28.1 8 | pyarrow==10.0.1 9 | -------------------------------------------------------------------------------- /starting_kit_atelier.R: -------------------------------------------------------------------------------- 1 | ############################################################################# 2 | # Atelier sur les adresses du REU : starting kit # 3 | ############################################################################# 4 | 5 | ################################ 6 | # Imports 7 | ################################ 8 | 9 | ##### Packages 10 | 11 | library(arrow) 12 | library(dplyr) 13 | library(data.table) 14 | library(magrittr) 15 | library(sf) 16 | library(ggplot2) 17 | library(viridis) 18 | 19 | ##### Données 20 | 21 | extrait_adressesREU <- arrow::read_parquet( 22 | "extrait_fichier_adresses_REU.parquet" 23 | ) %>% setDT() 24 | 25 | ################################ 26 | # Quelques manipulations 27 | ################################ 28 | 29 | ##### Sélectionner un échantillon du fichier 30 | 31 | sample_REU <- extrait_adressesREU[sample(.N, 5e5)] 32 | 33 | ##### Convertir les coordonnées Lambert de Geoloc en GPS 34 | 35 | adressesREU_geoloc <- copy(extrait_adressesREU) %>% 36 | select(X, Y) %>% 37 | st_as_sf( 38 | coords = c("X", "Y"), 39 | crs = 2154, 40 | na.fail = FALSE 41 | ) %>% 42 | st_transform(crs = 4326) 43 | 44 | ##### Convertir les coordonnées de la BAN en GPS 45 | 46 | adressesREU_BAN <- copy(extrait_adressesREU) %>% 47 | select(latitude, longitude) %>% 48 | st_as_sf( 49 | coords = c("longitude", "latitude"), 50 | crs = 4326, 51 | na.fail = FALSE 52 | ) 53 | 54 | ################################ 55 | # Statistiques descriptives et nouveaux champs 56 | ################################ 57 | 58 | ##### Observer les quantiles de score de pertinence pour la BAN 59 | 60 | quantiles_geo_score <- quantile(extrait_adressesREU$geo_score, seq(0, 1, 0.2), 61 | na.rm = TRUE) 62 | 63 | 64 | ##### Générer des intervalles pour le score de qualité de la BAN 65 | 66 | extrait_adressesREU[,`:=`(categorie_geo_score = cut( 67 | geo_score, 5, ordered_result = TRUE))] 68 | 69 | ##### Générer des labels qualité pour Geoloc plus explicites 70 | 71 | extrait_adressesREU[, `:=`( 72 | label_QUALITE_XY = fcase( 73 | QUALITE_XY == 11, "Voie Sûre, Numéro trouvé", 74 | QUALITE_XY == 12, "Voie Sûre, Position aléatoire dans la voie", 75 | QUALITE_XY == 21, "Voie probable, Numéro trouvé", 76 | QUALITE_XY == 22, "Voie probable, Position aléatoire dans la voie", 77 | QUALITE_XY == 33, "Voie inconnue, Position aléatoire dans la commune" 78 | ) %>% 79 | factor( 80 | levels = c( 81 | "Voie Sûre, Numéro trouvé", 82 | "Voie probable, Numéro trouvé", 83 | "Voie Sûre, Position aléatoire dans la voie", 84 | "Voie probable, Position aléatoire dans la voie", 85 | "Voie inconnue, Position aléatoire dans la commune" 86 | ), 87 | ordered = TRUE 88 | ) 89 | ) 90 | ] 91 | 92 | ##### Générer les distances entre les positions renvoyées par la BAN et par Geoloc 93 | 94 | extrait_adressesREU[, `:=`( 95 | distance = st_distance( 96 | x = adressesREU_geoloc, 97 | y = adressesREU_BAN, 98 | by_element = TRUE 99 | ) 100 | )] 101 | 102 | ################################ 103 | # Visualisation des différences BAN / Geoloc 104 | ################################ 105 | 106 | ##### Générer la proportion d'adresses pour lesquels les 2 référentiels renvoient 107 | ##### des localisations à moins de 100m, 200m, ... l'une de l'autre 108 | ##### en fonction des indicateurs de qualité 109 | 110 | prop_normalisations_proches <- extrait_adressesREU[, .( 111 | nb_adresses = .N, 112 | # part_10moins = mean(distance <= units::set_units(10, m), na.rm = TRUE), 113 | # part_20moins = mean(distance <= units::set_units(20, m), na.rm = TRUE), 114 | # part_50moins = mean(distance <= units::set_units(50, m), na.rm = TRUE), 115 | part_100moins = mean(distance <= units::set_units(100, m), na.rm = TRUE), 116 | part_200moins = mean(distance <= units::set_units(200, m), na.rm = TRUE) 117 | # part_500moins = mean(distance <= units::set_units(500, m), na.rm = TRUE), 118 | # part_1000moins = mean(distance <= units::set_units(1000, m), na.rm = TRUE) 119 | ), by = .(label_QUALITE_XY, QUALITE_XY, categorie_geo_score)][ 120 | order(QUALITE_XY, categorie_geo_score)] 121 | 122 | ##### Visualiser les proportions calculées ci-dessus 123 | 124 | ggplot(prop_normalisations_proches[!is.na(QUALITE_XY) & !is.na(categorie_geo_score)]) + 125 | geom_bar( 126 | aes( 127 | x = categorie_geo_score, y = part_100moins, fill = label_QUALITE_XY 128 | ), position = "dodge", stat = "identity" 129 | ) + 130 | labs( 131 | x = "Score de qualité BAN", 132 | y = "Proportion de distance <100m", 133 | fill = "Qualité de Geoloc" 134 | ) + 135 | scale_fill_viridis_d() + 136 | scale_y_continuous(labels = scales::percent_format()) + 137 | theme(legend.position = "bottom") + 138 | guides( 139 | fill = guide_legend( 140 | title.hjust = 0.5, 141 | title.position = "top", 142 | nrow = 3 143 | ) 144 | ) 145 | 146 | ################################ 147 | # Les contours 148 | ################################ 149 | 150 | 151 | --------------------------------------------------------------------------------