├── .github
└── ISSUE_TEMPLATE
│ ├── bug_report.md
│ └── feature_request.md
├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── MANIFEST.in
├── README.md
├── _config.yml
├── benchmark
└── Benchmark.ipynb
├── csv_schema_inference
├── __init__.py
└── csv_schema_inference.py
├── googled57bdb220576a44a.html
├── pyproject.toml
└── setup.cfg
/.github/ISSUE_TEMPLATE/bug_report.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Bug report
3 | about: Create a report to help us improve
4 | title: ''
5 | labels: ''
6 | assignees: ''
7 |
8 | ---
9 |
10 | **Describe the bug**
11 | A clear and concise description of what the bug is.
12 |
13 | **To Reproduce**
14 | Steps to reproduce the behavior:
15 | 1. Go to '...'
16 | 2. Click on '....'
17 | 3. Scroll down to '....'
18 | 4. See error
19 |
20 | **Expected behavior**
21 | A clear and concise description of what you expected to happen.
22 |
23 | **Screenshots**
24 | If applicable, add screenshots to help explain your problem.
25 |
26 | **Desktop (please complete the following information):**
27 | - OS: [e.g. iOS]
28 | - Browser [e.g. chrome, safari]
29 | - Version [e.g. 22]
30 |
31 | **Smartphone (please complete the following information):**
32 | - Device: [e.g. iPhone6]
33 | - OS: [e.g. iOS8.1]
34 | - Browser [e.g. stock browser, safari]
35 | - Version [e.g. 22]
36 |
37 | **Additional context**
38 | Add any other context about the problem here.
39 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature_request.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Feature request
3 | about: Suggest an idea for this project
4 | title: ''
5 | labels: ''
6 | assignees: ''
7 |
8 | ---
9 |
10 | **Is your feature request related to a problem? Please describe.**
11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
12 |
13 | **Describe the solution you'd like**
14 | A clear and concise description of what you want to happen.
15 |
16 | **Describe alternatives you've considered**
17 | A clear and concise description of any alternative solutions or features you've considered.
18 |
19 | **Additional context**
20 | Add any other context or screenshots about the feature request here.
21 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | pip-wheel-metadata/
24 | share/python-wheels/
25 | *.egg-info/
26 | .installed.cfg
27 | *.egg
28 | MANIFEST
29 |
30 | # PyInstaller
31 | # Usually these files are written by a python script from a template
32 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
33 | *.manifest
34 | *.spec
35 |
36 | # Installer logs
37 | pip-log.txt
38 | pip-delete-this-directory.txt
39 |
40 | # Unit test / coverage reports
41 | htmlcov/
42 | .tox/
43 | .nox/
44 | .coverage
45 | .coverage.*
46 | .cache
47 | nosetests.xml
48 | coverage.xml
49 | *.cover
50 | *.py,cover
51 | .hypothesis/
52 | .pytest_cache/
53 |
54 | # Translations
55 | *.mo
56 | *.pot
57 |
58 | # Django stuff:
59 | *.log
60 | local_settings.py
61 | db.sqlite3
62 | db.sqlite3-journal
63 |
64 | # Flask stuff:
65 | instance/
66 | .webassets-cache
67 |
68 | # Scrapy stuff:
69 | .scrapy
70 |
71 | # Sphinx documentation
72 | docs/_build/
73 |
74 | # PyBuilder
75 | target/
76 |
77 | # Jupyter Notebook
78 | .ipynb_checkpoints
79 |
80 | # IPython
81 | profile_default/
82 | ipython_config.py
83 |
84 | # pyenv
85 | .python-version
86 |
87 | # pipenv
88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
91 | # install all needed dependencies.
92 | #Pipfile.lock
93 |
94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
95 | __pypackages__/
96 |
97 | # Celery stuff
98 | celerybeat-schedule
99 | celerybeat.pid
100 |
101 | # SageMath parsed files
102 | *.sage.py
103 |
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 |
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 |
117 | # Rope project settings
118 | .ropeproject
119 |
120 | # mkdocs documentation
121 | /site
122 |
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 |
128 | # Pyre type checker
129 | .pyre/
130 |
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | # Contributor Covenant Code of Conduct
2 |
3 | ## Our Pledge
4 |
5 | We as members, contributors, and leaders pledge to make participation in our
6 | community a harassment-free experience for everyone, regardless of age, body
7 | size, visible or invisible disability, ethnicity, sex characteristics, gender
8 | identity and expression, level of experience, education, socio-economic status,
9 | nationality, personal appearance, race, religion, or sexual identity
10 | and orientation.
11 |
12 | We pledge to act and interact in ways that contribute to an open, welcoming,
13 | diverse, inclusive, and healthy community.
14 |
15 | ## Our Standards
16 |
17 | Examples of behavior that contributes to a positive environment for our
18 | community include:
19 |
20 | * Demonstrating empathy and kindness toward other people
21 | * Being respectful of differing opinions, viewpoints, and experiences
22 | * Giving and gracefully accepting constructive feedback
23 | * Accepting responsibility and apologizing to those affected by our mistakes,
24 | and learning from the experience
25 | * Focusing on what is best not just for us as individuals, but for the
26 | overall community
27 |
28 | Examples of unacceptable behavior include:
29 |
30 | * The use of sexualized language or imagery, and sexual attention or
31 | advances of any kind
32 | * Trolling, insulting or derogatory comments, and personal or political attacks
33 | * Public or private harassment
34 | * Publishing others' private information, such as a physical or email
35 | address, without their explicit permission
36 | * Other conduct which could reasonably be considered inappropriate in a
37 | professional setting
38 |
39 | ## Enforcement Responsibilities
40 |
41 | Community leaders are responsible for clarifying and enforcing our standards of
42 | acceptable behavior and will take appropriate and fair corrective action in
43 | response to any behavior that they deem inappropriate, threatening, offensive,
44 | or harmful.
45 |
46 | Community leaders have the right and responsibility to remove, edit, or reject
47 | comments, commits, code, wiki edits, issues, and other contributions that are
48 | not aligned to this Code of Conduct, and will communicate reasons for moderation
49 | decisions when appropriate.
50 |
51 | ## Scope
52 |
53 | This Code of Conduct applies within all community spaces, and also applies when
54 | an individual is officially representing the community in public spaces.
55 | Examples of representing our community include using an official e-mail address,
56 | posting via an official social media account, or acting as an appointed
57 | representative at an online or offline event.
58 |
59 | ## Enforcement
60 |
61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
62 | reported to the community leaders responsible for enforcement at
63 | .
64 | All complaints will be reviewed and investigated promptly and fairly.
65 |
66 | All community leaders are obligated to respect the privacy and security of the
67 | reporter of any incident.
68 |
69 | ## Enforcement Guidelines
70 |
71 | Community leaders will follow these Community Impact Guidelines in determining
72 | the consequences for any action they deem in violation of this Code of Conduct:
73 |
74 | ### 1. Correction
75 |
76 | **Community Impact**: Use of inappropriate language or other behavior deemed
77 | unprofessional or unwelcome in the community.
78 |
79 | **Consequence**: A private, written warning from community leaders, providing
80 | clarity around the nature of the violation and an explanation of why the
81 | behavior was inappropriate. A public apology may be requested.
82 |
83 | ### 2. Warning
84 |
85 | **Community Impact**: A violation through a single incident or series
86 | of actions.
87 |
88 | **Consequence**: A warning with consequences for continued behavior. No
89 | interaction with the people involved, including unsolicited interaction with
90 | those enforcing the Code of Conduct, for a specified period of time. This
91 | includes avoiding interactions in community spaces as well as external channels
92 | like social media. Violating these terms may lead to a temporary or
93 | permanent ban.
94 |
95 | ### 3. Temporary Ban
96 |
97 | **Community Impact**: A serious violation of community standards, including
98 | sustained inappropriate behavior.
99 |
100 | **Consequence**: A temporary ban from any sort of interaction or public
101 | communication with the community for a specified period of time. No public or
102 | private interaction with the people involved, including unsolicited interaction
103 | with those enforcing the Code of Conduct, is allowed during this period.
104 | Violating these terms may lead to a permanent ban.
105 |
106 | ### 4. Permanent Ban
107 |
108 | **Community Impact**: Demonstrating a pattern of violation of community
109 | standards, including sustained inappropriate behavior, harassment of an
110 | individual, or aggression toward or disparagement of classes of individuals.
111 |
112 | **Consequence**: A permanent ban from any sort of public interaction within
113 | the community.
114 |
115 | ## Attribution
116 |
117 | This Code of Conduct is adapted from the [Contributor Covenant][homepage],
118 | version 2.0, available at
119 | https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
120 |
121 | Community Impact Guidelines were inspired by [Mozilla's code of conduct
122 | enforcement ladder](https://github.com/mozilla/diversity).
123 |
124 | [homepage]: https://www.contributor-covenant.org
125 |
126 | For answers to common questions about this code of conduct, see the FAQ at
127 | https://www.contributor-covenant.org/faq. Translations are available at
128 | https://www.contributor-covenant.org/translations.
129 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | CONTRIBUTING.md
2 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2022 Ramses Alexander Coraspe Valdez
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | global-include *.*
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # **Csv Schema Inference**
2 | A tool to automatically infer columns data types in .csv files
3 |
4 | ### Check the article here: Building a Schema Inference Data Pipeline for Large CSV files
5 |
6 |
7 |
10 |
11 |
12 |
13 |
14 |
15 | ## **Installing csv-schema-inference** 🔧
16 |
17 |
18 |
19 |
20 |
21 | ``` python
22 | pip install csv-schema-inference
23 | ```
24 |
25 |
26 |
27 | Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
28 | Collecting csv-schema-inference
29 | Downloading csv_schema_inference-0.0.9-py3-none-any.whl (7.3 kB)
30 | Installing collected packages: csv-schema-inference
31 | Successfully installed csv-schema-inference-0.0.9
32 |
33 |
34 |
35 |
36 |
37 |
38 |
39 | ## **Importing csv-schema-inference library** ⚡
40 |
41 |
42 |
43 |
44 |
45 | ``` python
46 | from csv_schema_inference import csv_schema_inference
47 | ```
48 |
49 |
50 |
51 |
52 |
53 | ## **Setting csv-schema-inference configuration** ✍
54 |
55 |
56 |
57 |
58 |
59 | ``` python
60 |
61 | #if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT
62 | conditions = {"INTEGER":"FLOAT"}
63 |
64 | csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.9, max_length=100, batch_size = 200000, acc = 0.8, seed=2, header=True, sep=",", conditions = conditions)
65 | pathfile = "/content/file__500k.csv"
66 | ```
67 |
68 |
69 |
70 |
71 |
72 | ## **Run inference** 🏃
73 |
74 |
75 |
76 |
77 |
78 | ``` python
79 | aprox_schema = csv_infer.run_inference(pathfile)
80 | ```
81 |
82 |
83 |
84 |
85 |
86 | ## **Showing the approximate data type inference for each column** 🔍
87 |
88 |
89 |
90 |
91 |
92 | ``` python
93 | csv_infer.pretty(aprox_schema)
94 | ```
95 |
96 |
97 |
98 | 0
99 | name
100 | id
101 | type
102 | INTEGER
103 | nullable
104 | False
105 | 1
106 | name
107 | full_name
108 | type
109 | STRING
110 | nullable
111 | True
112 | 2
113 | name
114 | age
115 | type
116 | INTEGER
117 | nullable
118 | False
119 | 3
120 | name
121 | city
122 | type
123 | STRING
124 | nullable
125 | True
126 | 4
127 | name
128 | weight
129 | type
130 | FLOAT
131 | nullable
132 | False
133 | 5
134 | name
135 | height
136 | type
137 | FLOAT
138 | nullable
139 | False
140 | 6
141 | name
142 | isActive
143 | type
144 | BOOLEAN
145 | nullable
146 | False
147 | 7
148 | name
149 | col_int1
150 | type
151 | INTEGER
152 | nullable
153 | False
154 | 8
155 | name
156 | col_int2
157 | type
158 | INTEGER
159 | nullable
160 | False
161 | 9
162 | name
163 | col_int3
164 | type
165 | INTEGER
166 | nullable
167 | False
168 | 10
169 | name
170 | col_float1
171 | type
172 | FLOAT
173 | nullable
174 | False
175 | 11
176 | name
177 | col_float2
178 | type
179 | FLOAT
180 | nullable
181 | False
182 | 12
183 | name
184 | col_float3
185 | type
186 | FLOAT
187 | nullable
188 | False
189 | 13
190 | name
191 | col_float4
192 | type
193 | FLOAT
194 | nullable
195 | False
196 | 14
197 | name
198 | col_float5
199 | type
200 | FLOAT
201 | nullable
202 | False
203 | 15
204 | name
205 | col_float6
206 | type
207 | FLOAT
208 | nullable
209 | False
210 | 16
211 | name
212 | col_float7
213 | type
214 | FLOAT
215 | nullable
216 | False
217 | 17
218 | name
219 | col_float8
220 | type
221 | FLOAT
222 | nullable
223 | False
224 | 18
225 | name
226 | col_float9
227 | type
228 | FLOAT
229 | nullable
230 | False
231 | 19
232 | name
233 | col_float10
234 | type
235 | FLOAT
236 | nullable
237 | False
238 | 20
239 | name
240 | test_column
241 | type
242 | FLOAT
243 | nullable
244 | False
245 |
246 |
247 |
248 |
249 |
250 |
251 |
252 | ## **Checking schema values for specific columns** ✔
253 |
254 |
255 |
256 |
257 |
258 | ``` python
259 | result = csv_infer.get_schema_columns(columns = {"test_column"})
260 | csv_infer.pretty(result)
261 | ```
262 |
263 |
264 |
265 | 20
266 | _name
267 | test_column
268 | types_found
269 | INTEGER
270 | cnt
271 | 406130
272 | FLOAT
273 | cnt
274 | 50964
275 | nullable
276 | False
277 | type
278 | FLOAT
279 |
280 |
281 |
282 |
283 |
284 |
285 |
286 | ## **Explore all possible data types for a specific columns** ✅
287 |
288 |
289 |
290 |
291 |
292 | ``` python
293 | result = csv_infer.explore_schema_column(column = "test_column")
294 | csv_infer.pretty(result)
295 | ```
296 |
297 |
298 |
299 | 20
300 | name
301 | test_column
302 | types_found
303 | INTEGER
304 | 88.85043339006856
305 | FLOAT
306 | 11.149566609931437
307 | nullable
308 | False
309 |
310 |
311 |
312 |
313 |
314 | ## Benchmark
315 | The tests were done with 9 .csv files, 21 columns, different sizes and number of records, an average of 5 executions was calculated for each process, shuffle time and inferring time.
316 |
317 | - file__20m.csv: 20 million records
318 | - file__15m.csv: 15 million records
319 | - file__12m.csv: 12 million records
320 | - file__10m.csv: 10 million records
321 | - And so on...
322 |
323 | If you want to know more about the shuffling process, you can check this other repository: A tool to automatically Shuffle lines in .csv files , the shuffling process helps us to:
324 |
325 | 1. Increase the probability of finding all the data types present in a single column.
326 | 2. Avoid iterate the entire dataset.
327 | 2. Avoid see biases in the data that may be part of its organic behavior and due to not knowing the nature of its construction.
328 |
329 |
330 |
333 |
334 |
335 | ## Contributing and Feedback
336 | Any ideas or feedback about this repository?. Help me to improve it.
337 |
338 | ## Authors
339 | - Created by Ramses Alexander Coraspe Valdez
340 | - Created on 2022
341 |
342 | ## License
343 | This project is licensed under the terms of the MIT License.
344 |
--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-cayman
2 | title: "A tool to automatically infer columns data types in .csv files"
3 | description: "A parallel implementation of Schema inference using python"
4 | author: "Ramses Alexander Coraspe Valdez"
5 |
--------------------------------------------------------------------------------
/benchmark/Benchmark.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "Benchmark.ipynb",
7 | "provenance": [],
8 | "collapsed_sections": []
9 | },
10 | "kernelspec": {
11 | "name": "python3",
12 | "display_name": "Python 3"
13 | },
14 | "language_info": {
15 | "name": "python"
16 | }
17 | },
18 | "cells": [
19 | {
20 | "cell_type": "code",
21 | "source": [
22 | "import pandas as pd"
23 | ],
24 | "metadata": {
25 | "id": "NrcJWz22npeq"
26 | },
27 | "execution_count": 99,
28 | "outputs": []
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": 100,
33 | "metadata": {
34 | "id": "XyKNLam6m4Vv"
35 | },
36 | "outputs": [],
37 | "source": [
38 | "benchmark_data = [ \n",
39 | " {'filename': 'file__20m.csv', 'file_size': 3100848423, 'shuffle_time': 73.00043550, 'inferring_time': 111.56356970},\n",
40 | " {'filename': 'file__20m.csv', 'file_size': 3100848423, 'shuffle_time': 72.97213800, 'inferring_time': 115.25191430},\n",
41 | " {'filename': 'file__20m.csv', 'file_size': 3100848423, 'shuffle_time': 82.32063370, 'inferring_time': 116.76299740},\n",
42 | " {'filename': 'file__20m.csv', 'file_size': 3100848423, 'shuffle_time': 77.67622630, 'inferring_time': 114.59385790},\n",
43 | " {'filename': 'file__20m.csv', 'file_size': 3100848423, 'shuffle_time': 73.26938180, 'inferring_time': 112.55643420},\n",
44 | " {'filename': 'file__15m.csv', 'file_size': 2322887546, 'shuffle_time': 55.82634800, 'inferring_time': 74.62251340},\n",
45 | " {'filename': 'file__15m.csv', 'file_size': 2322887546, 'shuffle_time': 42.93429800, 'inferring_time': 71.26189710},\n",
46 | " {'filename': 'file__15m.csv', 'file_size': 2322887546, 'shuffle_time': 42.87042450, 'inferring_time': 69.13962730},\n",
47 | " {'filename': 'file__15m.csv', 'file_size': 2322887546, 'shuffle_time': 42.14651010, 'inferring_time': 71.23978310},\n",
48 | " {'filename': 'file__15m.csv', 'file_size': 2322887546, 'shuffle_time': 42.21968120, 'inferring_time': 69.67053280},\n",
49 | " {'filename': 'file__12m.csv', 'file_size': 1856118441, 'shuffle_time': 32.48983010, 'inferring_time': 58.08111770},\n",
50 | " {'filename': 'file__12m.csv', 'file_size': 1856118441, 'shuffle_time': 34.64318280, 'inferring_time': 57.98930810},\n",
51 | " {'filename': 'file__12m.csv', 'file_size': 1856118441, 'shuffle_time': 34.85442540, 'inferring_time': 57.71942010},\n",
52 | " {'filename': 'file__12m.csv', 'file_size': 1856118441, 'shuffle_time': 33.38362710, 'inferring_time': 59.86055910},\n",
53 | " {'filename': 'file__12m.csv', 'file_size': 1856118441, 'shuffle_time': 32.79728820, 'inferring_time': 57.41156370},\n",
54 | " {'filename': 'file__10m.csv', 'file_size': 1544899668, 'shuffle_time': 28.28831460, 'inferring_time': 53.78283170},\n",
55 | " {'filename': 'file__10m.csv', 'file_size': 1544899668, 'shuffle_time': 30.25130520, 'inferring_time': 51.21287500},\n",
56 | " {'filename': 'file__10m.csv', 'file_size': 1544899668, 'shuffle_time': 29.83213370, 'inferring_time': 53.01958860},\n",
57 | " {'filename': 'file__10m.csv', 'file_size': 1544899668, 'shuffle_time': 30.21982290, 'inferring_time': 51.81474830},\n",
58 | " {'filename': 'file__10m.csv', 'file_size': 1544899668, 'shuffle_time': 29.52344140, 'inferring_time': 58.40408200},\n",
59 | " {'filename': 'file__8m.csv', 'file_size': 1235682644, 'shuffle_time': 22.60465530, 'inferring_time': 44.68717590},\n",
60 | " {'filename': 'file__8m.csv', 'file_size': 1235682644, 'shuffle_time': 23.84743100, 'inferring_time': 42.68867510},\n",
61 | " {'filename': 'file__8m.csv', 'file_size': 1235682644, 'shuffle_time': 22.94851320, 'inferring_time': 46.96807710},\n",
62 | " {'filename': 'file__8m.csv', 'file_size': 1235682644, 'shuffle_time': 22.77527450, 'inferring_time': 42.62858490},\n",
63 | " {'filename': 'file__8m.csv', 'file_size': 1235682644, 'shuffle_time': 22.20869720, 'inferring_time': 42.98606580},\n",
64 | " {'filename': 'file__6m.csv', 'file_size': 926480055, 'shuffle_time': 15.88705860, 'inferring_time': 28.34111610},\n",
65 | " {'filename': 'file__6m.csv', 'file_size': 926480055, 'shuffle_time': 17.08761300, 'inferring_time': 29.42147060},\n",
66 | " {'filename': 'file__6m.csv', 'file_size': 926480055, 'shuffle_time': 16.48110200, 'inferring_time': 29.21088670},\n",
67 | " {'filename': 'file__6m.csv', 'file_size': 926480055, 'shuffle_time': 17.10600270, 'inferring_time': 28.82191680},\n",
68 | " {'filename': 'file__6m.csv', 'file_size': 926480055, 'shuffle_time': 17.17415740, 'inferring_time': 29.26859480},\n",
69 | " {'filename': 'file__4m.csv', 'file_size': 617284424, 'shuffle_time': 11.47866530, 'inferring_time': 19.30165580},\n",
70 | " {'filename': 'file__4m.csv', 'file_size': 617284424, 'shuffle_time': 12.42761710, 'inferring_time': 19.83578670},\n",
71 | " {'filename': 'file__4m.csv', 'file_size': 617284424, 'shuffle_time': 11.44712670, 'inferring_time': 21.38865030},\n",
72 | " {'filename': 'file__4m.csv', 'file_size': 617284424, 'shuffle_time': 11.67422640, 'inferring_time': 23.90071370},\n",
73 | " {'filename': 'file__4m.csv', 'file_size': 617284424, 'shuffle_time': 11.75241490, 'inferring_time': 23.17653020},\n",
74 | " {'filename': 'file__2m.csv', 'file_size': 308078962, 'shuffle_time': 5.56659010, 'inferring_time': 9.77755900},\n",
75 | " {'filename': 'file__2m.csv', 'file_size': 308078962, 'shuffle_time': 5.37670290, 'inferring_time': 9.85879350},\n",
76 | " {'filename': 'file__2m.csv', 'file_size': 308078962, 'shuffle_time': 5.50792340, 'inferring_time': 9.83664550},\n",
77 | " {'filename': 'file__2m.csv', 'file_size': 308078962, 'shuffle_time': 5.77451570, 'inferring_time': 9.72117910},\n",
78 | " {'filename': 'file__2m.csv', 'file_size': 308078962, 'shuffle_time': 5.56910340, 'inferring_time': 9.84671710},\n",
79 | " {'filename': 'file__1m.csv', 'file_size': 153491820, 'shuffle_time': 2.42946810, 'inferring_time': 4.65625420},\n",
80 | " {'filename': 'file__1m.csv', 'file_size': 153491820, 'shuffle_time': 2.38822270, 'inferring_time': 5.17744930},\n",
81 | " {'filename': 'file__1m.csv', 'file_size': 153491820, 'shuffle_time': 2.74428740, 'inferring_time': 4.82960490},\n",
82 | " {'filename': 'file__1m.csv', 'file_size': 153491820, 'shuffle_time': 2.58021890, 'inferring_time': 5.17412620},\n",
83 | " {'filename': 'file__1m.csv', 'file_size': 153491820, 'shuffle_time': 2.67854850, 'inferring_time': 5.08991410}\n",
84 | "]"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "source": [
90 | "df = pd.DataFrame(benchmark_data)\n",
91 | "df"
92 | ],
93 | "metadata": {
94 | "colab": {
95 | "base_uri": "https://localhost:8080/",
96 | "height": 1000
97 | },
98 | "id": "R9ABJG7qnohL",
99 | "outputId": "fb7f4f3c-6542-44cb-9d1c-4ce44d9537bb"
100 | },
101 | "execution_count": 102,
102 | "outputs": [
103 | {
104 | "output_type": "execute_result",
105 | "data": {
106 | "text/plain": [
107 | " filename file_size shuffle_time inferring_time\n",
108 | "0 file__20m.csv 3100848423 73.000435 111.563570\n",
109 | "1 file__20m.csv 3100848423 72.972138 115.251914\n",
110 | "2 file__20m.csv 3100848423 82.320634 116.762997\n",
111 | "3 file__20m.csv 3100848423 77.676226 114.593858\n",
112 | "4 file__20m.csv 3100848423 73.269382 112.556434\n",
113 | "5 file__15m.csv 2322887546 55.826348 74.622513\n",
114 | "6 file__15m.csv 2322887546 42.934298 71.261897\n",
115 | "7 file__15m.csv 2322887546 42.870424 69.139627\n",
116 | "8 file__15m.csv 2322887546 42.146510 71.239783\n",
117 | "9 file__15m.csv 2322887546 42.219681 69.670533\n",
118 | "10 file__12m.csv 1856118441 32.489830 58.081118\n",
119 | "11 file__12m.csv 1856118441 34.643183 57.989308\n",
120 | "12 file__12m.csv 1856118441 34.854425 57.719420\n",
121 | "13 file__12m.csv 1856118441 33.383627 59.860559\n",
122 | "14 file__12m.csv 1856118441 32.797288 57.411564\n",
123 | "15 file__10m.csv 1544899668 28.288315 53.782832\n",
124 | "16 file__10m.csv 1544899668 30.251305 51.212875\n",
125 | "17 file__10m.csv 1544899668 29.832134 53.019589\n",
126 | "18 file__10m.csv 1544899668 30.219823 51.814748\n",
127 | "19 file__10m.csv 1544899668 29.523441 58.404082\n",
128 | "20 file__8m.csv 1235682644 22.604655 44.687176\n",
129 | "21 file__8m.csv 1235682644 23.847431 42.688675\n",
130 | "22 file__8m.csv 1235682644 22.948513 46.968077\n",
131 | "23 file__8m.csv 1235682644 22.775274 42.628585\n",
132 | "24 file__8m.csv 1235682644 22.208697 42.986066\n",
133 | "25 file__6m.csv 926480055 15.887059 28.341116\n",
134 | "26 file__6m.csv 926480055 17.087613 29.421471\n",
135 | "27 file__6m.csv 926480055 16.481102 29.210887\n",
136 | "28 file__6m.csv 926480055 17.106003 28.821917\n",
137 | "29 file__6m.csv 926480055 17.174157 29.268595\n",
138 | "30 file__4m.csv 617284424 11.478665 19.301656\n",
139 | "31 file__4m.csv 617284424 12.427617 19.835787\n",
140 | "32 file__4m.csv 617284424 11.447127 21.388650\n",
141 | "33 file__4m.csv 617284424 11.674226 23.900714\n",
142 | "34 file__4m.csv 617284424 11.752415 23.176530\n",
143 | "35 file__2m.csv 308078962 5.566590 9.777559\n",
144 | "36 file__2m.csv 308078962 5.376703 9.858794\n",
145 | "37 file__2m.csv 308078962 5.507923 9.836645\n",
146 | "38 file__2m.csv 308078962 5.774516 9.721179\n",
147 | "39 file__2m.csv 308078962 5.569103 9.846717\n",
148 | "40 file__1m.csv 153491820 2.429468 4.656254\n",
149 | "41 file__1m.csv 153491820 2.388223 5.177449\n",
150 | "42 file__1m.csv 153491820 2.744287 4.829605\n",
151 | "43 file__1m.csv 153491820 2.580219 5.174126\n",
152 | "44 file__1m.csv 153491820 2.678549 5.089914"
153 | ],
154 | "text/html": [
155 | "\n",
156 | " \n",
157 | "
\n",
158 | "
\n",
159 | "\n",
172 | "
\n",
173 | " \n",
174 | " \n",
175 | " \n",
176 | " filename \n",
177 | " file_size \n",
178 | " shuffle_time \n",
179 | " inferring_time \n",
180 | " \n",
181 | " \n",
182 | " \n",
183 | " \n",
184 | " 0 \n",
185 | " file__20m.csv \n",
186 | " 3100848423 \n",
187 | " 73.000435 \n",
188 | " 111.563570 \n",
189 | " \n",
190 | " \n",
191 | " 1 \n",
192 | " file__20m.csv \n",
193 | " 3100848423 \n",
194 | " 72.972138 \n",
195 | " 115.251914 \n",
196 | " \n",
197 | " \n",
198 | " 2 \n",
199 | " file__20m.csv \n",
200 | " 3100848423 \n",
201 | " 82.320634 \n",
202 | " 116.762997 \n",
203 | " \n",
204 | " \n",
205 | " 3 \n",
206 | " file__20m.csv \n",
207 | " 3100848423 \n",
208 | " 77.676226 \n",
209 | " 114.593858 \n",
210 | " \n",
211 | " \n",
212 | " 4 \n",
213 | " file__20m.csv \n",
214 | " 3100848423 \n",
215 | " 73.269382 \n",
216 | " 112.556434 \n",
217 | " \n",
218 | " \n",
219 | " 5 \n",
220 | " file__15m.csv \n",
221 | " 2322887546 \n",
222 | " 55.826348 \n",
223 | " 74.622513 \n",
224 | " \n",
225 | " \n",
226 | " 6 \n",
227 | " file__15m.csv \n",
228 | " 2322887546 \n",
229 | " 42.934298 \n",
230 | " 71.261897 \n",
231 | " \n",
232 | " \n",
233 | " 7 \n",
234 | " file__15m.csv \n",
235 | " 2322887546 \n",
236 | " 42.870424 \n",
237 | " 69.139627 \n",
238 | " \n",
239 | " \n",
240 | " 8 \n",
241 | " file__15m.csv \n",
242 | " 2322887546 \n",
243 | " 42.146510 \n",
244 | " 71.239783 \n",
245 | " \n",
246 | " \n",
247 | " 9 \n",
248 | " file__15m.csv \n",
249 | " 2322887546 \n",
250 | " 42.219681 \n",
251 | " 69.670533 \n",
252 | " \n",
253 | " \n",
254 | " 10 \n",
255 | " file__12m.csv \n",
256 | " 1856118441 \n",
257 | " 32.489830 \n",
258 | " 58.081118 \n",
259 | " \n",
260 | " \n",
261 | " 11 \n",
262 | " file__12m.csv \n",
263 | " 1856118441 \n",
264 | " 34.643183 \n",
265 | " 57.989308 \n",
266 | " \n",
267 | " \n",
268 | " 12 \n",
269 | " file__12m.csv \n",
270 | " 1856118441 \n",
271 | " 34.854425 \n",
272 | " 57.719420 \n",
273 | " \n",
274 | " \n",
275 | " 13 \n",
276 | " file__12m.csv \n",
277 | " 1856118441 \n",
278 | " 33.383627 \n",
279 | " 59.860559 \n",
280 | " \n",
281 | " \n",
282 | " 14 \n",
283 | " file__12m.csv \n",
284 | " 1856118441 \n",
285 | " 32.797288 \n",
286 | " 57.411564 \n",
287 | " \n",
288 | " \n",
289 | " 15 \n",
290 | " file__10m.csv \n",
291 | " 1544899668 \n",
292 | " 28.288315 \n",
293 | " 53.782832 \n",
294 | " \n",
295 | " \n",
296 | " 16 \n",
297 | " file__10m.csv \n",
298 | " 1544899668 \n",
299 | " 30.251305 \n",
300 | " 51.212875 \n",
301 | " \n",
302 | " \n",
303 | " 17 \n",
304 | " file__10m.csv \n",
305 | " 1544899668 \n",
306 | " 29.832134 \n",
307 | " 53.019589 \n",
308 | " \n",
309 | " \n",
310 | " 18 \n",
311 | " file__10m.csv \n",
312 | " 1544899668 \n",
313 | " 30.219823 \n",
314 | " 51.814748 \n",
315 | " \n",
316 | " \n",
317 | " 19 \n",
318 | " file__10m.csv \n",
319 | " 1544899668 \n",
320 | " 29.523441 \n",
321 | " 58.404082 \n",
322 | " \n",
323 | " \n",
324 | " 20 \n",
325 | " file__8m.csv \n",
326 | " 1235682644 \n",
327 | " 22.604655 \n",
328 | " 44.687176 \n",
329 | " \n",
330 | " \n",
331 | " 21 \n",
332 | " file__8m.csv \n",
333 | " 1235682644 \n",
334 | " 23.847431 \n",
335 | " 42.688675 \n",
336 | " \n",
337 | " \n",
338 | " 22 \n",
339 | " file__8m.csv \n",
340 | " 1235682644 \n",
341 | " 22.948513 \n",
342 | " 46.968077 \n",
343 | " \n",
344 | " \n",
345 | " 23 \n",
346 | " file__8m.csv \n",
347 | " 1235682644 \n",
348 | " 22.775274 \n",
349 | " 42.628585 \n",
350 | " \n",
351 | " \n",
352 | " 24 \n",
353 | " file__8m.csv \n",
354 | " 1235682644 \n",
355 | " 22.208697 \n",
356 | " 42.986066 \n",
357 | " \n",
358 | " \n",
359 | " 25 \n",
360 | " file__6m.csv \n",
361 | " 926480055 \n",
362 | " 15.887059 \n",
363 | " 28.341116 \n",
364 | " \n",
365 | " \n",
366 | " 26 \n",
367 | " file__6m.csv \n",
368 | " 926480055 \n",
369 | " 17.087613 \n",
370 | " 29.421471 \n",
371 | " \n",
372 | " \n",
373 | " 27 \n",
374 | " file__6m.csv \n",
375 | " 926480055 \n",
376 | " 16.481102 \n",
377 | " 29.210887 \n",
378 | " \n",
379 | " \n",
380 | " 28 \n",
381 | " file__6m.csv \n",
382 | " 926480055 \n",
383 | " 17.106003 \n",
384 | " 28.821917 \n",
385 | " \n",
386 | " \n",
387 | " 29 \n",
388 | " file__6m.csv \n",
389 | " 926480055 \n",
390 | " 17.174157 \n",
391 | " 29.268595 \n",
392 | " \n",
393 | " \n",
394 | " 30 \n",
395 | " file__4m.csv \n",
396 | " 617284424 \n",
397 | " 11.478665 \n",
398 | " 19.301656 \n",
399 | " \n",
400 | " \n",
401 | " 31 \n",
402 | " file__4m.csv \n",
403 | " 617284424 \n",
404 | " 12.427617 \n",
405 | " 19.835787 \n",
406 | " \n",
407 | " \n",
408 | " 32 \n",
409 | " file__4m.csv \n",
410 | " 617284424 \n",
411 | " 11.447127 \n",
412 | " 21.388650 \n",
413 | " \n",
414 | " \n",
415 | " 33 \n",
416 | " file__4m.csv \n",
417 | " 617284424 \n",
418 | " 11.674226 \n",
419 | " 23.900714 \n",
420 | " \n",
421 | " \n",
422 | " 34 \n",
423 | " file__4m.csv \n",
424 | " 617284424 \n",
425 | " 11.752415 \n",
426 | " 23.176530 \n",
427 | " \n",
428 | " \n",
429 | " 35 \n",
430 | " file__2m.csv \n",
431 | " 308078962 \n",
432 | " 5.566590 \n",
433 | " 9.777559 \n",
434 | " \n",
435 | " \n",
436 | " 36 \n",
437 | " file__2m.csv \n",
438 | " 308078962 \n",
439 | " 5.376703 \n",
440 | " 9.858794 \n",
441 | " \n",
442 | " \n",
443 | " 37 \n",
444 | " file__2m.csv \n",
445 | " 308078962 \n",
446 | " 5.507923 \n",
447 | " 9.836645 \n",
448 | " \n",
449 | " \n",
450 | " 38 \n",
451 | " file__2m.csv \n",
452 | " 308078962 \n",
453 | " 5.774516 \n",
454 | " 9.721179 \n",
455 | " \n",
456 | " \n",
457 | " 39 \n",
458 | " file__2m.csv \n",
459 | " 308078962 \n",
460 | " 5.569103 \n",
461 | " 9.846717 \n",
462 | " \n",
463 | " \n",
464 | " 40 \n",
465 | " file__1m.csv \n",
466 | " 153491820 \n",
467 | " 2.429468 \n",
468 | " 4.656254 \n",
469 | " \n",
470 | " \n",
471 | " 41 \n",
472 | " file__1m.csv \n",
473 | " 153491820 \n",
474 | " 2.388223 \n",
475 | " 5.177449 \n",
476 | " \n",
477 | " \n",
478 | " 42 \n",
479 | " file__1m.csv \n",
480 | " 153491820 \n",
481 | " 2.744287 \n",
482 | " 4.829605 \n",
483 | " \n",
484 | " \n",
485 | " 43 \n",
486 | " file__1m.csv \n",
487 | " 153491820 \n",
488 | " 2.580219 \n",
489 | " 5.174126 \n",
490 | " \n",
491 | " \n",
492 | " 44 \n",
493 | " file__1m.csv \n",
494 | " 153491820 \n",
495 | " 2.678549 \n",
496 | " 5.089914 \n",
497 | " \n",
498 | " \n",
499 | "
\n",
500 | "
\n",
501 | "
\n",
504 | " \n",
505 | " \n",
507 | " \n",
508 | " \n",
509 | " \n",
510 | " \n",
511 | " \n",
512 | " \n",
549 | "\n",
550 | " \n",
574 | "
\n",
575 | "
\n",
576 | " "
577 | ]
578 | },
579 | "metadata": {},
580 | "execution_count": 102
581 | }
582 | ]
583 | },
584 | {
585 | "cell_type": "code",
586 | "source": [
587 | "df['file_size'] = round((df['file_size'] / 1e+9), 3)"
588 | ],
589 | "metadata": {
590 | "id": "VBxKzFLjuyfL"
591 | },
592 | "execution_count": 103,
593 | "outputs": []
594 | },
595 | {
596 | "cell_type": "code",
597 | "source": [
598 | "df = df.groupby(['file_size', 'filename'], sort = False)['shuffle_time', 'inferring_time'].mean()\n",
599 | "df"
600 | ],
601 | "metadata": {
602 | "colab": {
603 | "base_uri": "https://localhost:8080/",
604 | "height": 398
605 | },
606 | "id": "fDBxYlEbn3TB",
607 | "outputId": "98af830a-86f4-435f-ac3e-91ec5f41b0da"
608 | },
609 | "execution_count": 105,
610 | "outputs": [
611 | {
612 | "output_type": "stream",
613 | "name": "stderr",
614 | "text": [
615 | "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.\n",
616 | " \"\"\"Entry point for launching an IPython kernel.\n"
617 | ]
618 | },
619 | {
620 | "output_type": "execute_result",
621 | "data": {
622 | "text/plain": [
623 | " shuffle_time inferring_time\n",
624 | "file_size filename \n",
625 | "3.101 file__20m.csv 75.847763 114.145755\n",
626 | "2.323 file__15m.csv 45.199452 71.186871\n",
627 | "1.856 file__12m.csv 33.633671 58.212394\n",
628 | "1.545 file__10m.csv 29.623004 53.646825\n",
629 | "1.236 file__8m.csv 22.876914 43.991716\n",
630 | "0.926 file__6m.csv 16.747187 29.012797\n",
631 | "0.617 file__4m.csv 11.756010 21.520667\n",
632 | "0.308 file__2m.csv 5.558967 9.808179\n",
633 | "0.153 file__1m.csv 2.564149 4.985470"
634 | ],
635 | "text/html": [
636 | "\n",
637 | " \n",
638 | "
\n",
639 | "
\n",
640 | "\n",
653 | "
\n",
654 | " \n",
655 | " \n",
656 | " \n",
657 | " \n",
658 | " shuffle_time \n",
659 | " inferring_time \n",
660 | " \n",
661 | " \n",
662 | " file_size \n",
663 | " filename \n",
664 | " \n",
665 | " \n",
666 | " \n",
667 | " \n",
668 | " \n",
669 | " \n",
670 | " 3.101 \n",
671 | " file__20m.csv \n",
672 | " 75.847763 \n",
673 | " 114.145755 \n",
674 | " \n",
675 | " \n",
676 | " 2.323 \n",
677 | " file__15m.csv \n",
678 | " 45.199452 \n",
679 | " 71.186871 \n",
680 | " \n",
681 | " \n",
682 | " 1.856 \n",
683 | " file__12m.csv \n",
684 | " 33.633671 \n",
685 | " 58.212394 \n",
686 | " \n",
687 | " \n",
688 | " 1.545 \n",
689 | " file__10m.csv \n",
690 | " 29.623004 \n",
691 | " 53.646825 \n",
692 | " \n",
693 | " \n",
694 | " 1.236 \n",
695 | " file__8m.csv \n",
696 | " 22.876914 \n",
697 | " 43.991716 \n",
698 | " \n",
699 | " \n",
700 | " 0.926 \n",
701 | " file__6m.csv \n",
702 | " 16.747187 \n",
703 | " 29.012797 \n",
704 | " \n",
705 | " \n",
706 | " 0.617 \n",
707 | " file__4m.csv \n",
708 | " 11.756010 \n",
709 | " 21.520667 \n",
710 | " \n",
711 | " \n",
712 | " 0.308 \n",
713 | " file__2m.csv \n",
714 | " 5.558967 \n",
715 | " 9.808179 \n",
716 | " \n",
717 | " \n",
718 | " 0.153 \n",
719 | " file__1m.csv \n",
720 | " 2.564149 \n",
721 | " 4.985470 \n",
722 | " \n",
723 | " \n",
724 | "
\n",
725 | "
\n",
726 | "
\n",
729 | " \n",
730 | " \n",
732 | " \n",
733 | " \n",
734 | " \n",
735 | " \n",
736 | " \n",
737 | " \n",
774 | "\n",
775 | " \n",
799 | "
\n",
800 | "
\n",
801 | " "
802 | ]
803 | },
804 | "metadata": {},
805 | "execution_count": 105
806 | }
807 | ]
808 | },
809 | {
810 | "cell_type": "code",
811 | "source": [
812 | "df.reset_index(inplace=True)"
813 | ],
814 | "metadata": {
815 | "id": "zh6NHO3csESF"
816 | },
817 | "execution_count": 106,
818 | "outputs": []
819 | },
820 | {
821 | "cell_type": "code",
822 | "source": [
823 | "import matplotlib.pyplot as plt\n",
824 | "import seaborn as sns\n",
825 | "\n",
826 | "plt.figure(figsize=(15,4))\n",
827 | "\n",
828 | "sns.set(style='white')\n",
829 | "\n",
830 | "df.set_index(['file_size', 'filename']).plot(kind='bar', stacked=True, color=sns.set_palette(\"colorblind\"))\n",
831 | "\n",
832 | "plt.title('Time taken by Shuffle Time & Inferring Time', fontsize=12)\n",
833 | "\n",
834 | "plt.xlabel('Files Sizes (Gigabytes)')\n",
835 | "plt.ylabel('Time (Seconds)')\n",
836 | "\n",
837 | "plt.xticks(rotation=90)"
838 | ],
839 | "metadata": {
840 | "colab": {
841 | "base_uri": "https://localhost:8080/",
842 | "height": 464
843 | },
844 | "id": "ikFD4xeysAVV",
845 | "outputId": "60dbaf2d-252f-4f2d-b2a2-ec9caa86909b"
846 | },
847 | "execution_count": 107,
848 | "outputs": [
849 | {
850 | "output_type": "execute_result",
851 | "data": {
852 | "text/plain": [
853 | "(array([0, 1, 2, 3, 4, 5, 6, 7, 8]),\n",
854 | " )"
855 | ]
856 | },
857 | "metadata": {},
858 | "execution_count": 107
859 | },
860 | {
861 | "output_type": "display_data",
862 | "data": {
863 | "text/plain": [
864 | ""
865 | ]
866 | },
867 | "metadata": {}
868 | },
869 | {
870 | "output_type": "display_data",
871 | "data": {
872 | "text/plain": [
873 | ""
874 | ],
875 | "image/png": "\n"
876 | },
877 | "metadata": {}
878 | }
879 | ]
880 | }
881 | ]
882 | }
--------------------------------------------------------------------------------
/csv_schema_inference/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Wittline/csv-schema-inference/8121193f82b02984c811b2cc794c539e28b5e5ef/csv_schema_inference/__init__.py
--------------------------------------------------------------------------------
/csv_schema_inference/csv_schema_inference.py:
--------------------------------------------------------------------------------
1 | import mmap
2 | import os
3 | import multiprocessing as mp
4 | import datetime as dt
5 | import operator
6 |
7 |
8 |
9 | class DetectType:
10 |
11 | def __init__(self, max_length, sep):
12 | self.max_length = max_length
13 | self.sep = sep
14 |
15 | def __get_local_type(self, value):
16 | try:
17 | float(value)
18 | except ValueError:
19 | return "STRING"
20 |
21 | if float(value).is_integer():
22 | return "INTEGER"
23 | else:
24 | return "FLOAT"
25 |
26 |
27 | def __get_date_type(self, value):
28 |
29 |
30 | if "T" in value:
31 | segments = value.split("T")
32 | try:
33 |
34 | if len(segments) == 2:
35 | valid_date = False
36 | d_elements = segments[0].split("-")
37 | if len(d_elements) == 3 and len(d_elements[0]) in {2, 4} and \
38 | len(d_elements[1]) == 2 and len(d_elements[2]) == 2:
39 | dt.date(*(int(e) for e in d_elements))
40 | valid_date = True
41 | t_elements = segments[1].split(":")
42 | valid_time = False
43 | if len(t_elements) in (2, 3):
44 | valid_time = (len(t_elements[0]) == 2 and 0 <= int(t_elements[0]) < 24 and
45 | len(t_elements[1]) and 0 <= int(t_elements[1]) < 60)
46 | if len(t_elements) == 3:
47 | valid_time = (valid_time and len(t_elements[2]) == 2 and
48 | 0 <= int(t_elements[2]) < 60)
49 | if valid_time and valid_date:
50 | return "TIMESTAMP"
51 |
52 | except ValueError:
53 | return "STRING"
54 |
55 | elif "-" in value:
56 |
57 | segments = value.split("-")
58 | try:
59 |
60 | if len(segments) == 3 and len(segments[0]) in {2, 4} and \
61 | len(segments[1]) == 2 and len(segments[2]) == 2:
62 |
63 | dt.date(*(int(e) for e in segments))
64 | return "DATE"
65 | except ValueError:
66 | return "STRING"
67 | else:
68 |
69 | try:
70 | segments = value.split(":")
71 | if len(segments) in {2, 3}:
72 | valid = (len(segments[0]) == 2 and 0 <= int(segments[0]) < 24 and
73 | len(segments[1]) and 0 <= int(segments[1]) < 60)
74 | if len(segments) == 3:
75 | valid = (valid and len(segments[2]) == 2 and
76 | 0 <= int(segments[2]) < 60)
77 | if valid:
78 | return "TIME"
79 | except ValueError:
80 | return "STRING"
81 |
82 |
83 | return "STRING"
84 |
85 |
86 | def __infer_value_type(self, value, index, schema, values_type):
87 |
88 | if value not in values_type.keys():
89 |
90 | local_type = self.__get_local_type(value)
91 |
92 | if local_type == 'STRING':
93 |
94 | if value in {"", "na", "NA", "null", "NULL"}:
95 | schema[index]["nullable"] = True
96 | _type = "STRING"
97 | elif value in {"true", "false", "TRUE", "FALSE", "True", "False"}:
98 | _type = "BOOLEAN"
99 | elif len(value) < 21:
100 | _type = self.__get_date_type(value)
101 | else:
102 | _type = local_type
103 | else:
104 | _type = local_type
105 |
106 | values_type[value] = _type
107 |
108 | if values_type[value] not in schema[index]["types_found"].keys():
109 | schema[index]["types_found"][values_type[value]] = { "cnt": 1}
110 | else:
111 | schema[index]["types_found"][values_type[value]]["cnt"] += 1
112 | else:
113 | if values_type[value] not in schema[index]["types_found"].keys():
114 | schema[index]["types_found"][values_type[value]] = { "cnt": 1}
115 | else:
116 | schema[index]["types_found"][values_type[value]]["cnt"] += 1
117 |
118 |
119 | def execute(self, records, schema):
120 | values_type = {}
121 | for record in records:
122 | values = record.rstrip().split(self.sep)
123 | for index, value in enumerate(values):
124 | self.__infer_value_type(value[0:self.max_length], index, schema, values_type)
125 |
126 |
127 | class Parallel:
128 |
129 | def __init__(self):
130 | pass
131 |
132 |
133 | def execute(self, records, x, obj, d_schema):
134 | obj.execute(records, d_schema)
135 | return d_schema
136 |
137 |
138 | def parallel(self, records, obj, d_schema):
139 |
140 |
141 |
142 | cpus = (mp.cpu_count() - 2)
143 |
144 | if cpus <= 0:
145 | cpus = mp.cpu_count()
146 |
147 | chunk_size = len(records) / cpus
148 |
149 | if chunk_size < 1:
150 | cpus = int(chunk_size * 10)
151 | chunk_size = 1
152 | else:
153 | chunk_size = round(chunk_size)
154 |
155 |
156 |
157 | pool = mp.Pool(processes=cpus)
158 |
159 | results = [pool.apply_async(self.execute, args=(records[x:x+chunk_size], x, obj, d_schema)) for x in range(0, len(records), chunk_size)]
160 | pool.close()
161 | pool.join()
162 |
163 | return [p.get() for p in results]
164 |
165 |
166 | class CsvSchemaInference:
167 |
168 | def __init__(self, portion = 0.5, max_length = 1000, batch_size = 250000, acc = 0.7, seed= 1, header= True, sep=";", conditions = {}):
169 | self.portion = portion
170 | self.seed = seed
171 | self.header = header
172 | self.sep = sep
173 | self.accuracy = acc
174 | self.__schema = {}
175 | self.max_length = max_length
176 | self.data_types = {"STRING", "INTEGER", "FLOAT", "DATETIME", "DATE", "TIME", "TIMESTAMP", "BOOLEAN"}
177 | self.batch_size = batch_size
178 |
179 | if isinstance(conditions,dict):
180 |
181 | if conditions:
182 | for k, v in conditions.items():
183 | if k not in self.data_types or v not in self.data_types:
184 | raise ValueError('Keys and values in conditions must be valid data types')
185 |
186 |
187 | self.conditions = conditions
188 |
189 |
190 |
191 |
192 | def __set_header(self, header):
193 |
194 | header = header.rstrip().split(self.sep)
195 | for i in range(0, len(header)):
196 | self.__schema[i] = {
197 | "_name": header[i].replace('"', ''),
198 | "types_found":{
199 | },
200 | "nullable":False,
201 | "type":""
202 | }
203 |
204 |
205 | def __estimate_count(self, filename, reader):
206 | buffer = reader.read(1<<13)
207 | file_size = os.path.getsize(filename)
208 | return file_size // (len(buffer) // buffer.count(b'\n'))
209 |
210 |
211 | def __merge_schemas(self, schemas):
212 |
213 | for c_inx in self.__schema:
214 |
215 | for s_inx in range(0, len(schemas)):
216 |
217 | _v = schemas[s_inx][c_inx]
218 |
219 | if _v['nullable']:
220 | self.__schema[c_inx]['nullable'] = True
221 |
222 |
223 | for k in _v['types_found']:
224 |
225 | if k not in self.__schema[c_inx]['types_found'].keys():
226 |
227 | self.__schema[c_inx]['types_found'][k] = {
228 | "cnt": _v['types_found'][k]['cnt']
229 | }
230 | else:
231 | self.__schema[c_inx]['types_found'][k]['cnt'] += _v['types_found'][k]['cnt']
232 |
233 |
234 |
235 | def check_condition(self, _types, acc):
236 |
237 | try:
238 | _type = max({k: v for k, v in _types.items() if v >= (acc * 100)}.items(),
239 | key=operator.itemgetter(1))[0]
240 |
241 | if _type in self.conditions:
242 | if self.conditions[_type] in _types:
243 | _type = self.conditions[_type]
244 |
245 | except ValueError:
246 |
247 | if "STRING" in _types or len(_types) > 2:
248 | _type = "STRING"
249 | else:
250 | if {"INTEGER", "FLOAT"}.issubset(_types):
251 | _type = "FLOAT"
252 | else:
253 | _type = "STRING"
254 |
255 | return _type
256 |
257 |
258 |
259 |
260 |
261 | def __approximate_types(self, acc = 0.5):
262 |
263 | result = {}
264 | for c in self.__schema:
265 | _types = {}
266 | t = 0
267 | for v in self.__schema[c]['types_found']:
268 | t += self.__schema[c]['types_found'][v]['cnt']
269 | if v not in _types.keys():
270 | _types[v] = self.__schema[c]['types_found'][v]['cnt']
271 | else:
272 | _types[v] += self.__schema[c]['types_found'][v]['cnt']
273 |
274 | for ft in _types:
275 | _types[ft] = (_types[ft] * 100) / t
276 |
277 |
278 | _type = self.check_condition(_types, acc)
279 |
280 |
281 | self.__schema[c]['type'] = _type
282 |
283 | result[c] = {
284 | "name": self.__schema[c]['_name'],
285 | "type": _type,
286 | "nullable": self.__schema[c]['nullable']
287 | }
288 |
289 | return result
290 |
291 |
292 | def pretty(self, d, ind=0):
293 |
294 | for k, v in d.items():
295 | print('\t' * ind + str(k))
296 | if isinstance(v, dict):
297 | self.pretty(v, ind+1)
298 | else:
299 | print('\t' * (ind+1) + str(v))
300 |
301 |
302 | def get_schema_columns(self, columns = {}):
303 |
304 |
305 | result = {}
306 |
307 | for c in self.__schema:
308 | if self.__schema[c]["_name"] in columns:
309 | result[c] = {
310 | "_name": self.__schema[c]["_name"],
311 | "types_found":self.__schema[c]["types_found"],
312 | "nullable":self.__schema[c]["nullable"],
313 | "type":self.__schema[c]["type"]
314 | }
315 |
316 | return result
317 |
318 |
319 | def explore_schema_column(self, column):
320 |
321 | result = {}
322 |
323 | for c in self.__schema:
324 |
325 | if column == self.__schema[c]['_name']:
326 |
327 | _types = {}
328 | t = 0
329 | for v in self.__schema[c]['types_found']:
330 | t += self.__schema[c]['types_found'][v]['cnt']
331 |
332 | if v not in _types.keys():
333 | _types[v] = self.__schema[c]['types_found'][v]['cnt']
334 | else:
335 | _types[v] += self.__schema[c]['types_found'][v]['cnt']
336 |
337 | for ft in _types:
338 | _types[ft] = (_types[ft] * 100) / t
339 |
340 | result[c] = {
341 | "name" : self.__schema[c]['_name'],
342 | "types_found": _types,
343 | "nullable": self.__schema[c]['nullable']
344 | }
345 |
346 | break
347 |
348 | return result
349 |
350 |
351 |
352 | def run_inference(self, filename):
353 |
354 | with open(filename, mode="r", encoding = "ISO-8859-1") as file_obj:
355 |
356 | with mmap.mmap(file_obj.fileno(), length=0, access=mmap.ACCESS_READ) as map_file:
357 |
358 | less_header = 0
359 |
360 | if self.header:
361 | less_header = 1
362 |
363 | no_lines = self.__estimate_count(filename, map_file) - less_header
364 | portion = int(no_lines * self.portion)
365 | map_file.seek(0)
366 |
367 | if self.header:
368 | self.__set_header(map_file.readline().decode("ISO-8859-1"))
369 |
370 | lines = []
371 | schemas = []
372 | batch_count = 0
373 |
374 | dtype = DetectType(self.max_length, self.sep)
375 |
376 |
377 | while batch_count < portion:
378 |
379 | batch_count += 1
380 | lines.append(map_file.readline().decode("ISO-8859-1"))
381 |
382 | if batch_count % self.batch_size == 0:
383 |
384 | prl = Parallel()
385 | schemas_result = prl.parallel(records = lines, obj=dtype, d_schema = self.__schema)
386 |
387 | for schema in schemas_result:
388 | schemas.append(schema)
389 |
390 | lines = []
391 |
392 | if len(lines) > 0:
393 |
394 | prl = Parallel()
395 | schemas_result = prl.parallel(records = lines,obj=dtype, d_schema = self.__schema)
396 |
397 | for schema in schemas_result:
398 | schemas.append(schema)
399 |
400 | del lines
401 | del batch_count
402 |
403 |
404 | #Joining schemas results
405 | self.__merge_schemas(schemas)
406 |
407 | #Approximate data types
408 | return self.__approximate_types(acc = self.accuracy)
--------------------------------------------------------------------------------
/googled57bdb220576a44a.html:
--------------------------------------------------------------------------------
1 | google-site-verification: googled57bdb220576a44a.html
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [build-system]
2 | requires = [
3 | "setuptools>=42",
4 | "wheel"
5 | ]
6 | build-backend = "setuptools.build_meta"
--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [metadata]
2 | name = csv-schema-inference
3 | version = 0.0.9
4 | author = Ramses Alexander Coraspe Valdez
5 | author_email = contacto@wittline.com
6 | description = A tool to automatically infer columns data types in .csv files
7 | long_description = file: README.md
8 | long_description_content_type = text/markdown
9 | url = https://github.com/Wittline/csv-schema-inference
10 | classifiers =
11 | Programming Language :: Python :: 3
12 | License :: OSI Approved :: MIT License
13 | Operating System :: OS Independent
14 |
15 | [options]
16 | packages = find:
17 | python_requires = >=3.7
18 | include_package_data = False
--------------------------------------------------------------------------------