├── .github └── ISSUE_TEMPLATE │ ├── bug_report.md │ └── feature_request.md ├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── MANIFEST.in ├── README.md ├── _config.yml ├── benchmark └── Benchmark.ipynb ├── csv_schema_inference ├── __init__.py └── csv_schema_inference.py ├── googled57bdb220576a44a.html ├── pyproject.toml └── setup.cfg /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | A clear and concise description of what the bug is. 12 | 13 | **To Reproduce** 14 | Steps to reproduce the behavior: 15 | 1. Go to '...' 16 | 2. Click on '....' 17 | 3. Scroll down to '....' 18 | 4. See error 19 | 20 | **Expected behavior** 21 | A clear and concise description of what you expected to happen. 22 | 23 | **Screenshots** 24 | If applicable, add screenshots to help explain your problem. 25 | 26 | **Desktop (please complete the following information):** 27 | - OS: [e.g. iOS] 28 | - Browser [e.g. chrome, safari] 29 | - Version [e.g. 22] 30 | 31 | **Smartphone (please complete the following information):** 32 | - Device: [e.g. iPhone6] 33 | - OS: [e.g. iOS8.1] 34 | - Browser [e.g. stock browser, safari] 35 | - Version [e.g. 22] 36 | 37 | **Additional context** 38 | Add any other context about the problem here. 39 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Is your feature request related to a problem? Please describe.** 11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 12 | 13 | **Describe the solution you'd like** 14 | A clear and concise description of what you want to happen. 15 | 16 | **Describe alternatives you've considered** 17 | A clear and concise description of any alternative solutions or features you've considered. 18 | 19 | **Additional context** 20 | Add any other context or screenshots about the feature request here. 21 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | We as members, contributors, and leaders pledge to make participation in our 6 | community a harassment-free experience for everyone, regardless of age, body 7 | size, visible or invisible disability, ethnicity, sex characteristics, gender 8 | identity and expression, level of experience, education, socio-economic status, 9 | nationality, personal appearance, race, religion, or sexual identity 10 | and orientation. 11 | 12 | We pledge to act and interact in ways that contribute to an open, welcoming, 13 | diverse, inclusive, and healthy community. 14 | 15 | ## Our Standards 16 | 17 | Examples of behavior that contributes to a positive environment for our 18 | community include: 19 | 20 | * Demonstrating empathy and kindness toward other people 21 | * Being respectful of differing opinions, viewpoints, and experiences 22 | * Giving and gracefully accepting constructive feedback 23 | * Accepting responsibility and apologizing to those affected by our mistakes, 24 | and learning from the experience 25 | * Focusing on what is best not just for us as individuals, but for the 26 | overall community 27 | 28 | Examples of unacceptable behavior include: 29 | 30 | * The use of sexualized language or imagery, and sexual attention or 31 | advances of any kind 32 | * Trolling, insulting or derogatory comments, and personal or political attacks 33 | * Public or private harassment 34 | * Publishing others' private information, such as a physical or email 35 | address, without their explicit permission 36 | * Other conduct which could reasonably be considered inappropriate in a 37 | professional setting 38 | 39 | ## Enforcement Responsibilities 40 | 41 | Community leaders are responsible for clarifying and enforcing our standards of 42 | acceptable behavior and will take appropriate and fair corrective action in 43 | response to any behavior that they deem inappropriate, threatening, offensive, 44 | or harmful. 45 | 46 | Community leaders have the right and responsibility to remove, edit, or reject 47 | comments, commits, code, wiki edits, issues, and other contributions that are 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation 49 | decisions when appropriate. 50 | 51 | ## Scope 52 | 53 | This Code of Conduct applies within all community spaces, and also applies when 54 | an individual is officially representing the community in public spaces. 55 | Examples of representing our community include using an official e-mail address, 56 | posting via an official social media account, or acting as an appointed 57 | representative at an online or offline event. 58 | 59 | ## Enforcement 60 | 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 62 | reported to the community leaders responsible for enforcement at 63 | . 64 | All complaints will be reviewed and investigated promptly and fairly. 65 | 66 | All community leaders are obligated to respect the privacy and security of the 67 | reporter of any incident. 68 | 69 | ## Enforcement Guidelines 70 | 71 | Community leaders will follow these Community Impact Guidelines in determining 72 | the consequences for any action they deem in violation of this Code of Conduct: 73 | 74 | ### 1. Correction 75 | 76 | **Community Impact**: Use of inappropriate language or other behavior deemed 77 | unprofessional or unwelcome in the community. 78 | 79 | **Consequence**: A private, written warning from community leaders, providing 80 | clarity around the nature of the violation and an explanation of why the 81 | behavior was inappropriate. A public apology may be requested. 82 | 83 | ### 2. Warning 84 | 85 | **Community Impact**: A violation through a single incident or series 86 | of actions. 87 | 88 | **Consequence**: A warning with consequences for continued behavior. No 89 | interaction with the people involved, including unsolicited interaction with 90 | those enforcing the Code of Conduct, for a specified period of time. This 91 | includes avoiding interactions in community spaces as well as external channels 92 | like social media. Violating these terms may lead to a temporary or 93 | permanent ban. 94 | 95 | ### 3. Temporary Ban 96 | 97 | **Community Impact**: A serious violation of community standards, including 98 | sustained inappropriate behavior. 99 | 100 | **Consequence**: A temporary ban from any sort of interaction or public 101 | communication with the community for a specified period of time. No public or 102 | private interaction with the people involved, including unsolicited interaction 103 | with those enforcing the Code of Conduct, is allowed during this period. 104 | Violating these terms may lead to a permanent ban. 105 | 106 | ### 4. Permanent Ban 107 | 108 | **Community Impact**: Demonstrating a pattern of violation of community 109 | standards, including sustained inappropriate behavior, harassment of an 110 | individual, or aggression toward or disparagement of classes of individuals. 111 | 112 | **Consequence**: A permanent ban from any sort of public interaction within 113 | the community. 114 | 115 | ## Attribution 116 | 117 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 118 | version 2.0, available at 119 | https://www.contributor-covenant.org/version/2/0/code_of_conduct.html. 120 | 121 | Community Impact Guidelines were inspired by [Mozilla's code of conduct 122 | enforcement ladder](https://github.com/mozilla/diversity). 123 | 124 | [homepage]: https://www.contributor-covenant.org 125 | 126 | For answers to common questions about this code of conduct, see the FAQ at 127 | https://www.contributor-covenant.org/faq. Translations are available at 128 | https://www.contributor-covenant.org/translations. 129 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | CONTRIBUTING.md 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Ramses Alexander Coraspe Valdez 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | global-include *.* -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # **Csv Schema Inference** 2 | A tool to automatically infer columns data types in .csv files 3 | 4 | ### Check the article here: Building a Schema Inference Data Pipeline for Large CSV files 5 | 6 |

7 | 10 |

11 | 12 | 13 |
14 | 15 | ## **Installing csv-schema-inference** 🔧 16 | 17 |
18 | 19 |
20 | 21 | ``` python 22 | pip install csv-schema-inference 23 | ``` 24 | 25 |
26 | 27 | Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ 28 | Collecting csv-schema-inference 29 | Downloading csv_schema_inference-0.0.9-py3-none-any.whl (7.3 kB) 30 | Installing collected packages: csv-schema-inference 31 | Successfully installed csv-schema-inference-0.0.9 32 | 33 |
34 | 35 |
36 | 37 |
38 | 39 | ## **Importing csv-schema-inference library** ⚡ 40 | 41 |
42 | 43 |
44 | 45 | ``` python 46 | from csv_schema_inference import csv_schema_inference 47 | ``` 48 | 49 |
50 | 51 |
52 | 53 | ## **Setting csv-schema-inference configuration** ✍ 54 | 55 |
56 | 57 |
58 | 59 | ``` python 60 | 61 | #if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT 62 | conditions = {"INTEGER":"FLOAT"} 63 | 64 | csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.9, max_length=100, batch_size = 200000, acc = 0.8, seed=2, header=True, sep=",", conditions = conditions) 65 | pathfile = "/content/file__500k.csv" 66 | ``` 67 | 68 |
69 | 70 |
71 | 72 | ## **Run inference** 🏃 73 | 74 |
75 | 76 |
77 | 78 | ``` python 79 | aprox_schema = csv_infer.run_inference(pathfile) 80 | ``` 81 | 82 |
83 | 84 |
85 | 86 | ## **Showing the approximate data type inference for each column** 🔍 87 | 88 |
89 | 90 |
91 | 92 | ``` python 93 | csv_infer.pretty(aprox_schema) 94 | ``` 95 | 96 |
97 | 98 | 0 99 | name 100 | id 101 | type 102 | INTEGER 103 | nullable 104 | False 105 | 1 106 | name 107 | full_name 108 | type 109 | STRING 110 | nullable 111 | True 112 | 2 113 | name 114 | age 115 | type 116 | INTEGER 117 | nullable 118 | False 119 | 3 120 | name 121 | city 122 | type 123 | STRING 124 | nullable 125 | True 126 | 4 127 | name 128 | weight 129 | type 130 | FLOAT 131 | nullable 132 | False 133 | 5 134 | name 135 | height 136 | type 137 | FLOAT 138 | nullable 139 | False 140 | 6 141 | name 142 | isActive 143 | type 144 | BOOLEAN 145 | nullable 146 | False 147 | 7 148 | name 149 | col_int1 150 | type 151 | INTEGER 152 | nullable 153 | False 154 | 8 155 | name 156 | col_int2 157 | type 158 | INTEGER 159 | nullable 160 | False 161 | 9 162 | name 163 | col_int3 164 | type 165 | INTEGER 166 | nullable 167 | False 168 | 10 169 | name 170 | col_float1 171 | type 172 | FLOAT 173 | nullable 174 | False 175 | 11 176 | name 177 | col_float2 178 | type 179 | FLOAT 180 | nullable 181 | False 182 | 12 183 | name 184 | col_float3 185 | type 186 | FLOAT 187 | nullable 188 | False 189 | 13 190 | name 191 | col_float4 192 | type 193 | FLOAT 194 | nullable 195 | False 196 | 14 197 | name 198 | col_float5 199 | type 200 | FLOAT 201 | nullable 202 | False 203 | 15 204 | name 205 | col_float6 206 | type 207 | FLOAT 208 | nullable 209 | False 210 | 16 211 | name 212 | col_float7 213 | type 214 | FLOAT 215 | nullable 216 | False 217 | 17 218 | name 219 | col_float8 220 | type 221 | FLOAT 222 | nullable 223 | False 224 | 18 225 | name 226 | col_float9 227 | type 228 | FLOAT 229 | nullable 230 | False 231 | 19 232 | name 233 | col_float10 234 | type 235 | FLOAT 236 | nullable 237 | False 238 | 20 239 | name 240 | test_column 241 | type 242 | FLOAT 243 | nullable 244 | False 245 | 246 |
247 | 248 |
249 | 250 |
251 | 252 | ## **Checking schema values for specific columns** ✔ 253 | 254 |
255 | 256 |
257 | 258 | ``` python 259 | result = csv_infer.get_schema_columns(columns = {"test_column"}) 260 | csv_infer.pretty(result) 261 | ``` 262 | 263 |
264 | 265 | 20 266 | _name 267 | test_column 268 | types_found 269 | INTEGER 270 | cnt 271 | 406130 272 | FLOAT 273 | cnt 274 | 50964 275 | nullable 276 | False 277 | type 278 | FLOAT 279 | 280 |
281 | 282 |
283 | 284 |
285 | 286 | ## **Explore all possible data types for a specific columns** ✅ 287 | 288 |
289 | 290 |
291 | 292 | ``` python 293 | result = csv_infer.explore_schema_column(column = "test_column") 294 | csv_infer.pretty(result) 295 | ``` 296 | 297 |
298 | 299 | 20 300 | name 301 | test_column 302 | types_found 303 | INTEGER 304 | 88.85043339006856 305 | FLOAT 306 | 11.149566609931437 307 | nullable 308 | False 309 | 310 |
311 | 312 |
313 | 314 | ## Benchmark 315 | The tests were done with 9 .csv files, 21 columns, different sizes and number of records, an average of 5 executions was calculated for each process, shuffle time and inferring time. 316 | 317 | - file__20m.csv: 20 million records 318 | - file__15m.csv: 15 million records 319 | - file__12m.csv: 12 million records 320 | - file__10m.csv: 10 million records 321 | - And so on... 322 | 323 | If you want to know more about the shuffling process, you can check this other repository: A tool to automatically Shuffle lines in .csv files, the shuffling process helps us to: 324 | 325 | 1. Increase the probability of finding all the data types present in a single column. 326 | 2. Avoid iterate the entire dataset. 327 | 2. Avoid see biases in the data that may be part of its organic behavior and due to not knowing the nature of its construction. 328 | 329 |

330 | 333 |

334 | 335 | ## Contributing and Feedback 336 | Any ideas or feedback about this repository?. Help me to improve it. 337 | 338 | ## Authors 339 | - Created by Ramses Alexander Coraspe Valdez 340 | - Created on 2022 341 | 342 | ## License 343 | This project is licensed under the terms of the MIT License. 344 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-cayman 2 | title: "A tool to automatically infer columns data types in .csv files" 3 | description: "A parallel implementation of Schema inference using python" 4 | author: "Ramses Alexander Coraspe Valdez" 5 | -------------------------------------------------------------------------------- /benchmark/Benchmark.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Benchmark.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [] 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | }, 14 | "language_info": { 15 | "name": "python" 16 | } 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "code", 21 | "source": [ 22 | "import pandas as pd" 23 | ], 24 | "metadata": { 25 | "id": "NrcJWz22npeq" 26 | }, 27 | "execution_count": 99, 28 | "outputs": [] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 100, 33 | "metadata": { 34 | "id": "XyKNLam6m4Vv" 35 | }, 36 | "outputs": [], 37 | "source": [ 38 | "benchmark_data = [ \n", 39 | " {'filename': 'file__20m.csv', 'file_size': 3100848423, 'shuffle_time': 73.00043550, 'inferring_time': 111.56356970},\n", 40 | " {'filename': 'file__20m.csv', 'file_size': 3100848423, 'shuffle_time': 72.97213800, 'inferring_time': 115.25191430},\n", 41 | " {'filename': 'file__20m.csv', 'file_size': 3100848423, 'shuffle_time': 82.32063370, 'inferring_time': 116.76299740},\n", 42 | " {'filename': 'file__20m.csv', 'file_size': 3100848423, 'shuffle_time': 77.67622630, 'inferring_time': 114.59385790},\n", 43 | " {'filename': 'file__20m.csv', 'file_size': 3100848423, 'shuffle_time': 73.26938180, 'inferring_time': 112.55643420},\n", 44 | " {'filename': 'file__15m.csv', 'file_size': 2322887546, 'shuffle_time': 55.82634800, 'inferring_time': 74.62251340},\n", 45 | " {'filename': 'file__15m.csv', 'file_size': 2322887546, 'shuffle_time': 42.93429800, 'inferring_time': 71.26189710},\n", 46 | " {'filename': 'file__15m.csv', 'file_size': 2322887546, 'shuffle_time': 42.87042450, 'inferring_time': 69.13962730},\n", 47 | " {'filename': 'file__15m.csv', 'file_size': 2322887546, 'shuffle_time': 42.14651010, 'inferring_time': 71.23978310},\n", 48 | " {'filename': 'file__15m.csv', 'file_size': 2322887546, 'shuffle_time': 42.21968120, 'inferring_time': 69.67053280},\n", 49 | " {'filename': 'file__12m.csv', 'file_size': 1856118441, 'shuffle_time': 32.48983010, 'inferring_time': 58.08111770},\n", 50 | " {'filename': 'file__12m.csv', 'file_size': 1856118441, 'shuffle_time': 34.64318280, 'inferring_time': 57.98930810},\n", 51 | " {'filename': 'file__12m.csv', 'file_size': 1856118441, 'shuffle_time': 34.85442540, 'inferring_time': 57.71942010},\n", 52 | " {'filename': 'file__12m.csv', 'file_size': 1856118441, 'shuffle_time': 33.38362710, 'inferring_time': 59.86055910},\n", 53 | " {'filename': 'file__12m.csv', 'file_size': 1856118441, 'shuffle_time': 32.79728820, 'inferring_time': 57.41156370},\n", 54 | " {'filename': 'file__10m.csv', 'file_size': 1544899668, 'shuffle_time': 28.28831460, 'inferring_time': 53.78283170},\n", 55 | " {'filename': 'file__10m.csv', 'file_size': 1544899668, 'shuffle_time': 30.25130520, 'inferring_time': 51.21287500},\n", 56 | " {'filename': 'file__10m.csv', 'file_size': 1544899668, 'shuffle_time': 29.83213370, 'inferring_time': 53.01958860},\n", 57 | " {'filename': 'file__10m.csv', 'file_size': 1544899668, 'shuffle_time': 30.21982290, 'inferring_time': 51.81474830},\n", 58 | " {'filename': 'file__10m.csv', 'file_size': 1544899668, 'shuffle_time': 29.52344140, 'inferring_time': 58.40408200},\n", 59 | " {'filename': 'file__8m.csv', 'file_size': 1235682644, 'shuffle_time': 22.60465530, 'inferring_time': 44.68717590},\n", 60 | " {'filename': 'file__8m.csv', 'file_size': 1235682644, 'shuffle_time': 23.84743100, 'inferring_time': 42.68867510},\n", 61 | " {'filename': 'file__8m.csv', 'file_size': 1235682644, 'shuffle_time': 22.94851320, 'inferring_time': 46.96807710},\n", 62 | " {'filename': 'file__8m.csv', 'file_size': 1235682644, 'shuffle_time': 22.77527450, 'inferring_time': 42.62858490},\n", 63 | " {'filename': 'file__8m.csv', 'file_size': 1235682644, 'shuffle_time': 22.20869720, 'inferring_time': 42.98606580},\n", 64 | " {'filename': 'file__6m.csv', 'file_size': 926480055, 'shuffle_time': 15.88705860, 'inferring_time': 28.34111610},\n", 65 | " {'filename': 'file__6m.csv', 'file_size': 926480055, 'shuffle_time': 17.08761300, 'inferring_time': 29.42147060},\n", 66 | " {'filename': 'file__6m.csv', 'file_size': 926480055, 'shuffle_time': 16.48110200, 'inferring_time': 29.21088670},\n", 67 | " {'filename': 'file__6m.csv', 'file_size': 926480055, 'shuffle_time': 17.10600270, 'inferring_time': 28.82191680},\n", 68 | " {'filename': 'file__6m.csv', 'file_size': 926480055, 'shuffle_time': 17.17415740, 'inferring_time': 29.26859480},\n", 69 | " {'filename': 'file__4m.csv', 'file_size': 617284424, 'shuffle_time': 11.47866530, 'inferring_time': 19.30165580},\n", 70 | " {'filename': 'file__4m.csv', 'file_size': 617284424, 'shuffle_time': 12.42761710, 'inferring_time': 19.83578670},\n", 71 | " {'filename': 'file__4m.csv', 'file_size': 617284424, 'shuffle_time': 11.44712670, 'inferring_time': 21.38865030},\n", 72 | " {'filename': 'file__4m.csv', 'file_size': 617284424, 'shuffle_time': 11.67422640, 'inferring_time': 23.90071370},\n", 73 | " {'filename': 'file__4m.csv', 'file_size': 617284424, 'shuffle_time': 11.75241490, 'inferring_time': 23.17653020},\n", 74 | " {'filename': 'file__2m.csv', 'file_size': 308078962, 'shuffle_time': 5.56659010, 'inferring_time': 9.77755900},\n", 75 | " {'filename': 'file__2m.csv', 'file_size': 308078962, 'shuffle_time': 5.37670290, 'inferring_time': 9.85879350},\n", 76 | " {'filename': 'file__2m.csv', 'file_size': 308078962, 'shuffle_time': 5.50792340, 'inferring_time': 9.83664550},\n", 77 | " {'filename': 'file__2m.csv', 'file_size': 308078962, 'shuffle_time': 5.77451570, 'inferring_time': 9.72117910},\n", 78 | " {'filename': 'file__2m.csv', 'file_size': 308078962, 'shuffle_time': 5.56910340, 'inferring_time': 9.84671710},\n", 79 | " {'filename': 'file__1m.csv', 'file_size': 153491820, 'shuffle_time': 2.42946810, 'inferring_time': 4.65625420},\n", 80 | " {'filename': 'file__1m.csv', 'file_size': 153491820, 'shuffle_time': 2.38822270, 'inferring_time': 5.17744930},\n", 81 | " {'filename': 'file__1m.csv', 'file_size': 153491820, 'shuffle_time': 2.74428740, 'inferring_time': 4.82960490},\n", 82 | " {'filename': 'file__1m.csv', 'file_size': 153491820, 'shuffle_time': 2.58021890, 'inferring_time': 5.17412620},\n", 83 | " {'filename': 'file__1m.csv', 'file_size': 153491820, 'shuffle_time': 2.67854850, 'inferring_time': 5.08991410}\n", 84 | "]" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "source": [ 90 | "df = pd.DataFrame(benchmark_data)\n", 91 | "df" 92 | ], 93 | "metadata": { 94 | "colab": { 95 | "base_uri": "https://localhost:8080/", 96 | "height": 1000 97 | }, 98 | "id": "R9ABJG7qnohL", 99 | "outputId": "fb7f4f3c-6542-44cb-9d1c-4ce44d9537bb" 100 | }, 101 | "execution_count": 102, 102 | "outputs": [ 103 | { 104 | "output_type": "execute_result", 105 | "data": { 106 | "text/plain": [ 107 | " filename file_size shuffle_time inferring_time\n", 108 | "0 file__20m.csv 3100848423 73.000435 111.563570\n", 109 | "1 file__20m.csv 3100848423 72.972138 115.251914\n", 110 | "2 file__20m.csv 3100848423 82.320634 116.762997\n", 111 | "3 file__20m.csv 3100848423 77.676226 114.593858\n", 112 | "4 file__20m.csv 3100848423 73.269382 112.556434\n", 113 | "5 file__15m.csv 2322887546 55.826348 74.622513\n", 114 | "6 file__15m.csv 2322887546 42.934298 71.261897\n", 115 | "7 file__15m.csv 2322887546 42.870424 69.139627\n", 116 | "8 file__15m.csv 2322887546 42.146510 71.239783\n", 117 | "9 file__15m.csv 2322887546 42.219681 69.670533\n", 118 | "10 file__12m.csv 1856118441 32.489830 58.081118\n", 119 | "11 file__12m.csv 1856118441 34.643183 57.989308\n", 120 | "12 file__12m.csv 1856118441 34.854425 57.719420\n", 121 | "13 file__12m.csv 1856118441 33.383627 59.860559\n", 122 | "14 file__12m.csv 1856118441 32.797288 57.411564\n", 123 | "15 file__10m.csv 1544899668 28.288315 53.782832\n", 124 | "16 file__10m.csv 1544899668 30.251305 51.212875\n", 125 | "17 file__10m.csv 1544899668 29.832134 53.019589\n", 126 | "18 file__10m.csv 1544899668 30.219823 51.814748\n", 127 | "19 file__10m.csv 1544899668 29.523441 58.404082\n", 128 | "20 file__8m.csv 1235682644 22.604655 44.687176\n", 129 | "21 file__8m.csv 1235682644 23.847431 42.688675\n", 130 | "22 file__8m.csv 1235682644 22.948513 46.968077\n", 131 | "23 file__8m.csv 1235682644 22.775274 42.628585\n", 132 | "24 file__8m.csv 1235682644 22.208697 42.986066\n", 133 | "25 file__6m.csv 926480055 15.887059 28.341116\n", 134 | "26 file__6m.csv 926480055 17.087613 29.421471\n", 135 | "27 file__6m.csv 926480055 16.481102 29.210887\n", 136 | "28 file__6m.csv 926480055 17.106003 28.821917\n", 137 | "29 file__6m.csv 926480055 17.174157 29.268595\n", 138 | "30 file__4m.csv 617284424 11.478665 19.301656\n", 139 | "31 file__4m.csv 617284424 12.427617 19.835787\n", 140 | "32 file__4m.csv 617284424 11.447127 21.388650\n", 141 | "33 file__4m.csv 617284424 11.674226 23.900714\n", 142 | "34 file__4m.csv 617284424 11.752415 23.176530\n", 143 | "35 file__2m.csv 308078962 5.566590 9.777559\n", 144 | "36 file__2m.csv 308078962 5.376703 9.858794\n", 145 | "37 file__2m.csv 308078962 5.507923 9.836645\n", 146 | "38 file__2m.csv 308078962 5.774516 9.721179\n", 147 | "39 file__2m.csv 308078962 5.569103 9.846717\n", 148 | "40 file__1m.csv 153491820 2.429468 4.656254\n", 149 | "41 file__1m.csv 153491820 2.388223 5.177449\n", 150 | "42 file__1m.csv 153491820 2.744287 4.829605\n", 151 | "43 file__1m.csv 153491820 2.580219 5.174126\n", 152 | "44 file__1m.csv 153491820 2.678549 5.089914" 153 | ], 154 | "text/html": [ 155 | "\n", 156 | "
\n", 157 | "
\n", 158 | "
\n", 159 | "\n", 172 | "\n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | "
filenamefile_sizeshuffle_timeinferring_time
0file__20m.csv310084842373.000435111.563570
1file__20m.csv310084842372.972138115.251914
2file__20m.csv310084842382.320634116.762997
3file__20m.csv310084842377.676226114.593858
4file__20m.csv310084842373.269382112.556434
5file__15m.csv232288754655.82634874.622513
6file__15m.csv232288754642.93429871.261897
7file__15m.csv232288754642.87042469.139627
8file__15m.csv232288754642.14651071.239783
9file__15m.csv232288754642.21968169.670533
10file__12m.csv185611844132.48983058.081118
11file__12m.csv185611844134.64318357.989308
12file__12m.csv185611844134.85442557.719420
13file__12m.csv185611844133.38362759.860559
14file__12m.csv185611844132.79728857.411564
15file__10m.csv154489966828.28831553.782832
16file__10m.csv154489966830.25130551.212875
17file__10m.csv154489966829.83213453.019589
18file__10m.csv154489966830.21982351.814748
19file__10m.csv154489966829.52344158.404082
20file__8m.csv123568264422.60465544.687176
21file__8m.csv123568264423.84743142.688675
22file__8m.csv123568264422.94851346.968077
23file__8m.csv123568264422.77527442.628585
24file__8m.csv123568264422.20869742.986066
25file__6m.csv92648005515.88705928.341116
26file__6m.csv92648005517.08761329.421471
27file__6m.csv92648005516.48110229.210887
28file__6m.csv92648005517.10600328.821917
29file__6m.csv92648005517.17415729.268595
30file__4m.csv61728442411.47866519.301656
31file__4m.csv61728442412.42761719.835787
32file__4m.csv61728442411.44712721.388650
33file__4m.csv61728442411.67422623.900714
34file__4m.csv61728442411.75241523.176530
35file__2m.csv3080789625.5665909.777559
36file__2m.csv3080789625.3767039.858794
37file__2m.csv3080789625.5079239.836645
38file__2m.csv3080789625.7745169.721179
39file__2m.csv3080789625.5691039.846717
40file__1m.csv1534918202.4294684.656254
41file__1m.csv1534918202.3882235.177449
42file__1m.csv1534918202.7442874.829605
43file__1m.csv1534918202.5802195.174126
44file__1m.csv1534918202.6785495.089914
\n", 500 | "
\n", 501 | " \n", 511 | " \n", 512 | " \n", 549 | "\n", 550 | " \n", 574 | "
\n", 575 | "
\n", 576 | " " 577 | ] 578 | }, 579 | "metadata": {}, 580 | "execution_count": 102 581 | } 582 | ] 583 | }, 584 | { 585 | "cell_type": "code", 586 | "source": [ 587 | "df['file_size'] = round((df['file_size'] / 1e+9), 3)" 588 | ], 589 | "metadata": { 590 | "id": "VBxKzFLjuyfL" 591 | }, 592 | "execution_count": 103, 593 | "outputs": [] 594 | }, 595 | { 596 | "cell_type": "code", 597 | "source": [ 598 | "df = df.groupby(['file_size', 'filename'], sort = False)['shuffle_time', 'inferring_time'].mean()\n", 599 | "df" 600 | ], 601 | "metadata": { 602 | "colab": { 603 | "base_uri": "https://localhost:8080/", 604 | "height": 398 605 | }, 606 | "id": "fDBxYlEbn3TB", 607 | "outputId": "98af830a-86f4-435f-ac3e-91ec5f41b0da" 608 | }, 609 | "execution_count": 105, 610 | "outputs": [ 611 | { 612 | "output_type": "stream", 613 | "name": "stderr", 614 | "text": [ 615 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.\n", 616 | " \"\"\"Entry point for launching an IPython kernel.\n" 617 | ] 618 | }, 619 | { 620 | "output_type": "execute_result", 621 | "data": { 622 | "text/plain": [ 623 | " shuffle_time inferring_time\n", 624 | "file_size filename \n", 625 | "3.101 file__20m.csv 75.847763 114.145755\n", 626 | "2.323 file__15m.csv 45.199452 71.186871\n", 627 | "1.856 file__12m.csv 33.633671 58.212394\n", 628 | "1.545 file__10m.csv 29.623004 53.646825\n", 629 | "1.236 file__8m.csv 22.876914 43.991716\n", 630 | "0.926 file__6m.csv 16.747187 29.012797\n", 631 | "0.617 file__4m.csv 11.756010 21.520667\n", 632 | "0.308 file__2m.csv 5.558967 9.808179\n", 633 | "0.153 file__1m.csv 2.564149 4.985470" 634 | ], 635 | "text/html": [ 636 | "\n", 637 | "
\n", 638 | "
\n", 639 | "
\n", 640 | "\n", 653 | "\n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | "
shuffle_timeinferring_time
file_sizefilename
3.101file__20m.csv75.847763114.145755
2.323file__15m.csv45.19945271.186871
1.856file__12m.csv33.63367158.212394
1.545file__10m.csv29.62300453.646825
1.236file__8m.csv22.87691443.991716
0.926file__6m.csv16.74718729.012797
0.617file__4m.csv11.75601021.520667
0.308file__2m.csv5.5589679.808179
0.153file__1m.csv2.5641494.985470
\n", 725 | "
\n", 726 | " \n", 736 | " \n", 737 | " \n", 774 | "\n", 775 | " \n", 799 | "
\n", 800 | "
\n", 801 | " " 802 | ] 803 | }, 804 | "metadata": {}, 805 | "execution_count": 105 806 | } 807 | ] 808 | }, 809 | { 810 | "cell_type": "code", 811 | "source": [ 812 | "df.reset_index(inplace=True)" 813 | ], 814 | "metadata": { 815 | "id": "zh6NHO3csESF" 816 | }, 817 | "execution_count": 106, 818 | "outputs": [] 819 | }, 820 | { 821 | "cell_type": "code", 822 | "source": [ 823 | "import matplotlib.pyplot as plt\n", 824 | "import seaborn as sns\n", 825 | "\n", 826 | "plt.figure(figsize=(15,4))\n", 827 | "\n", 828 | "sns.set(style='white')\n", 829 | "\n", 830 | "df.set_index(['file_size', 'filename']).plot(kind='bar', stacked=True, color=sns.set_palette(\"colorblind\"))\n", 831 | "\n", 832 | "plt.title('Time taken by Shuffle Time & Inferring Time', fontsize=12)\n", 833 | "\n", 834 | "plt.xlabel('Files Sizes (Gigabytes)')\n", 835 | "plt.ylabel('Time (Seconds)')\n", 836 | "\n", 837 | "plt.xticks(rotation=90)" 838 | ], 839 | "metadata": { 840 | "colab": { 841 | "base_uri": "https://localhost:8080/", 842 | "height": 464 843 | }, 844 | "id": "ikFD4xeysAVV", 845 | "outputId": "60dbaf2d-252f-4f2d-b2a2-ec9caa86909b" 846 | }, 847 | "execution_count": 107, 848 | "outputs": [ 849 | { 850 | "output_type": "execute_result", 851 | "data": { 852 | "text/plain": [ 853 | "(array([0, 1, 2, 3, 4, 5, 6, 7, 8]),\n", 854 | " )" 855 | ] 856 | }, 857 | "metadata": {}, 858 | "execution_count": 107 859 | }, 860 | { 861 | "output_type": "display_data", 862 | "data": { 863 | "text/plain": [ 864 | "
" 865 | ] 866 | }, 867 | "metadata": {} 868 | }, 869 | { 870 | "output_type": "display_data", 871 | "data": { 872 | "text/plain": [ 873 | "
" 874 | ], 875 | "image/png": "\n" 876 | }, 877 | "metadata": {} 878 | } 879 | ] 880 | } 881 | ] 882 | } -------------------------------------------------------------------------------- /csv_schema_inference/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Wittline/csv-schema-inference/8121193f82b02984c811b2cc794c539e28b5e5ef/csv_schema_inference/__init__.py -------------------------------------------------------------------------------- /csv_schema_inference/csv_schema_inference.py: -------------------------------------------------------------------------------- 1 | import mmap 2 | import os 3 | import multiprocessing as mp 4 | import datetime as dt 5 | import operator 6 | 7 | 8 | 9 | class DetectType: 10 | 11 | def __init__(self, max_length, sep): 12 | self.max_length = max_length 13 | self.sep = sep 14 | 15 | def __get_local_type(self, value): 16 | try: 17 | float(value) 18 | except ValueError: 19 | return "STRING" 20 | 21 | if float(value).is_integer(): 22 | return "INTEGER" 23 | else: 24 | return "FLOAT" 25 | 26 | 27 | def __get_date_type(self, value): 28 | 29 | 30 | if "T" in value: 31 | segments = value.split("T") 32 | try: 33 | 34 | if len(segments) == 2: 35 | valid_date = False 36 | d_elements = segments[0].split("-") 37 | if len(d_elements) == 3 and len(d_elements[0]) in {2, 4} and \ 38 | len(d_elements[1]) == 2 and len(d_elements[2]) == 2: 39 | dt.date(*(int(e) for e in d_elements)) 40 | valid_date = True 41 | t_elements = segments[1].split(":") 42 | valid_time = False 43 | if len(t_elements) in (2, 3): 44 | valid_time = (len(t_elements[0]) == 2 and 0 <= int(t_elements[0]) < 24 and 45 | len(t_elements[1]) and 0 <= int(t_elements[1]) < 60) 46 | if len(t_elements) == 3: 47 | valid_time = (valid_time and len(t_elements[2]) == 2 and 48 | 0 <= int(t_elements[2]) < 60) 49 | if valid_time and valid_date: 50 | return "TIMESTAMP" 51 | 52 | except ValueError: 53 | return "STRING" 54 | 55 | elif "-" in value: 56 | 57 | segments = value.split("-") 58 | try: 59 | 60 | if len(segments) == 3 and len(segments[0]) in {2, 4} and \ 61 | len(segments[1]) == 2 and len(segments[2]) == 2: 62 | 63 | dt.date(*(int(e) for e in segments)) 64 | return "DATE" 65 | except ValueError: 66 | return "STRING" 67 | else: 68 | 69 | try: 70 | segments = value.split(":") 71 | if len(segments) in {2, 3}: 72 | valid = (len(segments[0]) == 2 and 0 <= int(segments[0]) < 24 and 73 | len(segments[1]) and 0 <= int(segments[1]) < 60) 74 | if len(segments) == 3: 75 | valid = (valid and len(segments[2]) == 2 and 76 | 0 <= int(segments[2]) < 60) 77 | if valid: 78 | return "TIME" 79 | except ValueError: 80 | return "STRING" 81 | 82 | 83 | return "STRING" 84 | 85 | 86 | def __infer_value_type(self, value, index, schema, values_type): 87 | 88 | if value not in values_type.keys(): 89 | 90 | local_type = self.__get_local_type(value) 91 | 92 | if local_type == 'STRING': 93 | 94 | if value in {"", "na", "NA", "null", "NULL"}: 95 | schema[index]["nullable"] = True 96 | _type = "STRING" 97 | elif value in {"true", "false", "TRUE", "FALSE", "True", "False"}: 98 | _type = "BOOLEAN" 99 | elif len(value) < 21: 100 | _type = self.__get_date_type(value) 101 | else: 102 | _type = local_type 103 | else: 104 | _type = local_type 105 | 106 | values_type[value] = _type 107 | 108 | if values_type[value] not in schema[index]["types_found"].keys(): 109 | schema[index]["types_found"][values_type[value]] = { "cnt": 1} 110 | else: 111 | schema[index]["types_found"][values_type[value]]["cnt"] += 1 112 | else: 113 | if values_type[value] not in schema[index]["types_found"].keys(): 114 | schema[index]["types_found"][values_type[value]] = { "cnt": 1} 115 | else: 116 | schema[index]["types_found"][values_type[value]]["cnt"] += 1 117 | 118 | 119 | def execute(self, records, schema): 120 | values_type = {} 121 | for record in records: 122 | values = record.rstrip().split(self.sep) 123 | for index, value in enumerate(values): 124 | self.__infer_value_type(value[0:self.max_length], index, schema, values_type) 125 | 126 | 127 | class Parallel: 128 | 129 | def __init__(self): 130 | pass 131 | 132 | 133 | def execute(self, records, x, obj, d_schema): 134 | obj.execute(records, d_schema) 135 | return d_schema 136 | 137 | 138 | def parallel(self, records, obj, d_schema): 139 | 140 | 141 | 142 | cpus = (mp.cpu_count() - 2) 143 | 144 | if cpus <= 0: 145 | cpus = mp.cpu_count() 146 | 147 | chunk_size = len(records) / cpus 148 | 149 | if chunk_size < 1: 150 | cpus = int(chunk_size * 10) 151 | chunk_size = 1 152 | else: 153 | chunk_size = round(chunk_size) 154 | 155 | 156 | 157 | pool = mp.Pool(processes=cpus) 158 | 159 | results = [pool.apply_async(self.execute, args=(records[x:x+chunk_size], x, obj, d_schema)) for x in range(0, len(records), chunk_size)] 160 | pool.close() 161 | pool.join() 162 | 163 | return [p.get() for p in results] 164 | 165 | 166 | class CsvSchemaInference: 167 | 168 | def __init__(self, portion = 0.5, max_length = 1000, batch_size = 250000, acc = 0.7, seed= 1, header= True, sep=";", conditions = {}): 169 | self.portion = portion 170 | self.seed = seed 171 | self.header = header 172 | self.sep = sep 173 | self.accuracy = acc 174 | self.__schema = {} 175 | self.max_length = max_length 176 | self.data_types = {"STRING", "INTEGER", "FLOAT", "DATETIME", "DATE", "TIME", "TIMESTAMP", "BOOLEAN"} 177 | self.batch_size = batch_size 178 | 179 | if isinstance(conditions,dict): 180 | 181 | if conditions: 182 | for k, v in conditions.items(): 183 | if k not in self.data_types or v not in self.data_types: 184 | raise ValueError('Keys and values in conditions must be valid data types') 185 | 186 | 187 | self.conditions = conditions 188 | 189 | 190 | 191 | 192 | def __set_header(self, header): 193 | 194 | header = header.rstrip().split(self.sep) 195 | for i in range(0, len(header)): 196 | self.__schema[i] = { 197 | "_name": header[i].replace('"', ''), 198 | "types_found":{ 199 | }, 200 | "nullable":False, 201 | "type":"" 202 | } 203 | 204 | 205 | def __estimate_count(self, filename, reader): 206 | buffer = reader.read(1<<13) 207 | file_size = os.path.getsize(filename) 208 | return file_size // (len(buffer) // buffer.count(b'\n')) 209 | 210 | 211 | def __merge_schemas(self, schemas): 212 | 213 | for c_inx in self.__schema: 214 | 215 | for s_inx in range(0, len(schemas)): 216 | 217 | _v = schemas[s_inx][c_inx] 218 | 219 | if _v['nullable']: 220 | self.__schema[c_inx]['nullable'] = True 221 | 222 | 223 | for k in _v['types_found']: 224 | 225 | if k not in self.__schema[c_inx]['types_found'].keys(): 226 | 227 | self.__schema[c_inx]['types_found'][k] = { 228 | "cnt": _v['types_found'][k]['cnt'] 229 | } 230 | else: 231 | self.__schema[c_inx]['types_found'][k]['cnt'] += _v['types_found'][k]['cnt'] 232 | 233 | 234 | 235 | def check_condition(self, _types, acc): 236 | 237 | try: 238 | _type = max({k: v for k, v in _types.items() if v >= (acc * 100)}.items(), 239 | key=operator.itemgetter(1))[0] 240 | 241 | if _type in self.conditions: 242 | if self.conditions[_type] in _types: 243 | _type = self.conditions[_type] 244 | 245 | except ValueError: 246 | 247 | if "STRING" in _types or len(_types) > 2: 248 | _type = "STRING" 249 | else: 250 | if {"INTEGER", "FLOAT"}.issubset(_types): 251 | _type = "FLOAT" 252 | else: 253 | _type = "STRING" 254 | 255 | return _type 256 | 257 | 258 | 259 | 260 | 261 | def __approximate_types(self, acc = 0.5): 262 | 263 | result = {} 264 | for c in self.__schema: 265 | _types = {} 266 | t = 0 267 | for v in self.__schema[c]['types_found']: 268 | t += self.__schema[c]['types_found'][v]['cnt'] 269 | if v not in _types.keys(): 270 | _types[v] = self.__schema[c]['types_found'][v]['cnt'] 271 | else: 272 | _types[v] += self.__schema[c]['types_found'][v]['cnt'] 273 | 274 | for ft in _types: 275 | _types[ft] = (_types[ft] * 100) / t 276 | 277 | 278 | _type = self.check_condition(_types, acc) 279 | 280 | 281 | self.__schema[c]['type'] = _type 282 | 283 | result[c] = { 284 | "name": self.__schema[c]['_name'], 285 | "type": _type, 286 | "nullable": self.__schema[c]['nullable'] 287 | } 288 | 289 | return result 290 | 291 | 292 | def pretty(self, d, ind=0): 293 | 294 | for k, v in d.items(): 295 | print('\t' * ind + str(k)) 296 | if isinstance(v, dict): 297 | self.pretty(v, ind+1) 298 | else: 299 | print('\t' * (ind+1) + str(v)) 300 | 301 | 302 | def get_schema_columns(self, columns = {}): 303 | 304 | 305 | result = {} 306 | 307 | for c in self.__schema: 308 | if self.__schema[c]["_name"] in columns: 309 | result[c] = { 310 | "_name": self.__schema[c]["_name"], 311 | "types_found":self.__schema[c]["types_found"], 312 | "nullable":self.__schema[c]["nullable"], 313 | "type":self.__schema[c]["type"] 314 | } 315 | 316 | return result 317 | 318 | 319 | def explore_schema_column(self, column): 320 | 321 | result = {} 322 | 323 | for c in self.__schema: 324 | 325 | if column == self.__schema[c]['_name']: 326 | 327 | _types = {} 328 | t = 0 329 | for v in self.__schema[c]['types_found']: 330 | t += self.__schema[c]['types_found'][v]['cnt'] 331 | 332 | if v not in _types.keys(): 333 | _types[v] = self.__schema[c]['types_found'][v]['cnt'] 334 | else: 335 | _types[v] += self.__schema[c]['types_found'][v]['cnt'] 336 | 337 | for ft in _types: 338 | _types[ft] = (_types[ft] * 100) / t 339 | 340 | result[c] = { 341 | "name" : self.__schema[c]['_name'], 342 | "types_found": _types, 343 | "nullable": self.__schema[c]['nullable'] 344 | } 345 | 346 | break 347 | 348 | return result 349 | 350 | 351 | 352 | def run_inference(self, filename): 353 | 354 | with open(filename, mode="r", encoding = "ISO-8859-1") as file_obj: 355 | 356 | with mmap.mmap(file_obj.fileno(), length=0, access=mmap.ACCESS_READ) as map_file: 357 | 358 | less_header = 0 359 | 360 | if self.header: 361 | less_header = 1 362 | 363 | no_lines = self.__estimate_count(filename, map_file) - less_header 364 | portion = int(no_lines * self.portion) 365 | map_file.seek(0) 366 | 367 | if self.header: 368 | self.__set_header(map_file.readline().decode("ISO-8859-1")) 369 | 370 | lines = [] 371 | schemas = [] 372 | batch_count = 0 373 | 374 | dtype = DetectType(self.max_length, self.sep) 375 | 376 | 377 | while batch_count < portion: 378 | 379 | batch_count += 1 380 | lines.append(map_file.readline().decode("ISO-8859-1")) 381 | 382 | if batch_count % self.batch_size == 0: 383 | 384 | prl = Parallel() 385 | schemas_result = prl.parallel(records = lines, obj=dtype, d_schema = self.__schema) 386 | 387 | for schema in schemas_result: 388 | schemas.append(schema) 389 | 390 | lines = [] 391 | 392 | if len(lines) > 0: 393 | 394 | prl = Parallel() 395 | schemas_result = prl.parallel(records = lines,obj=dtype, d_schema = self.__schema) 396 | 397 | for schema in schemas_result: 398 | schemas.append(schema) 399 | 400 | del lines 401 | del batch_count 402 | 403 | 404 | #Joining schemas results 405 | self.__merge_schemas(schemas) 406 | 407 | #Approximate data types 408 | return self.__approximate_types(acc = self.accuracy) -------------------------------------------------------------------------------- /googled57bdb220576a44a.html: -------------------------------------------------------------------------------- 1 | google-site-verification: googled57bdb220576a44a.html -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | requires = [ 3 | "setuptools>=42", 4 | "wheel" 5 | ] 6 | build-backend = "setuptools.build_meta" -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | name = csv-schema-inference 3 | version = 0.0.9 4 | author = Ramses Alexander Coraspe Valdez 5 | author_email = contacto@wittline.com 6 | description = A tool to automatically infer columns data types in .csv files 7 | long_description = file: README.md 8 | long_description_content_type = text/markdown 9 | url = https://github.com/Wittline/csv-schema-inference 10 | classifiers = 11 | Programming Language :: Python :: 3 12 | License :: OSI Approved :: MIT License 13 | Operating System :: OS Independent 14 | 15 | [options] 16 | packages = find: 17 | python_requires = >=3.7 18 | include_package_data = False --------------------------------------------------------------------------------