├── .gitignore ├── CHANGELOG.md ├── LICENSE.txt ├── README.md ├── examples ├── basic-usage.ipynb └── sample-datasets.ipynb ├── pydataset ├── __init__.py ├── datasets_handler.py ├── dump_data.py ├── locate_datasets.py ├── resources.tar.gz ├── support.py └── utils │ ├── __init__.py │ └── html2text.py ├── setup.cfg └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | 55 | # Sphinx documentation 56 | docs/_build/ 57 | 58 | # PyBuilder 59 | target/ 60 | 61 | #Ipython Notebook 62 | .ipynb_checkpoints 63 | 64 | # local 65 | clean.sh 66 | todo/ 67 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | ### Changelog 2 | 3 | **0.2.0** 4 | 5 | - Add search dataset by name similarity. 6 | 7 | Example: 8 | 9 | ```python 10 | >>> data('heat') 11 | Did you mean: 12 | Wheat, heart, Heating, Yeast, eidat, badhealth, deaths, agefat, hla, heptathlon, azt 13 | ``` 14 | 15 | **0.1.1** 16 | 17 | - Fix: add support to Windows and fix filepaths, issue #1 18 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2016 Aziz Alto 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## PyDataset 2 | [![PyPI version](https://badge.fury.io/py/pydataset.svg)](http://badge.fury.io/py/pydataset) 3 | 4 | Provides instant access to many datasets right from Python (in pandas DataFrame structure). 5 | 6 | ### What? 7 | 8 | The idea is simple. There are various datasets available out there, but they are scattered in different places over the web. 9 | Is there a quick way (in Python) to access them instantly without going through the hassle of searching, downloading, and reading ... etc? 10 | PyDataset tries to address that question :) 11 | 12 | 13 | ### Usage: 14 | 15 | Start with importing `data()`: 16 | ```python 17 | from pydataset import data 18 | ``` 19 | - To load a dataset: 20 | ```python 21 | titanic = data('titanic') 22 | ``` 23 | - To display the documentation of a dataset: 24 | ```python 25 | data('titanic', show_doc=True) 26 | ``` 27 | - To see the available datasets: 28 | ```python 29 | data() 30 | ``` 31 | 32 | That's it. 33 | See more [examples](examples). 34 | 35 | 36 | ### Why? 37 | 38 | In `R`, there is a very easy and immediate way to access multiple statistical datasets, 39 | in almost no effort. All it takes is one line ` > data(dataset_name)`. 40 | This makes the life easier for quick prototyping and testing. 41 | Well, I am jealous that Python does not have a similar functionality. 42 | Thus, the aim of `pydataset` is to fill that gap. 43 | 44 | Currently, `pydataset` has about 757 (mostly numerical-based) datasets, that are based on `RDatasets`. 45 | In the future, I plan to scale it to include a larger set of datasets. 46 | For example, 47 | 1) include textual data for NLP-related tasks, and 48 | 2) allow adding a new dataset to the in-module repository. 49 | 50 | 51 | ### Installation: 52 | 53 | `$ pip install pydataset` 54 | 55 | #### Uninstall: 56 | 57 | - `$ pip uninstall pydataset` 58 | - `$ rm -rf $HOME/.pydataset` 59 | 60 | ### Changelog 61 | 62 | **0.2.0** 63 | 64 | - Add search dataset by name similarity. 65 | - Example: 66 | 67 | ```python 68 | >>> data('heat') 69 | Did you mean: 70 | Wheat, heart, Heating, Yeast, eidat, badhealth, deaths, agefat, hla, heptathlon, azt 71 | ``` 72 | 73 | **0.1.1** 74 | 75 | - Fix: add support to Windows and fix filepaths, issue #1 76 | 77 | ### Dependency: 78 | - pandas 79 | 80 | ### Miscellaneous: 81 | 82 | - Tested on OSX and Linux (debian). 83 | - Supports both Python 2 (2.7.11) and Python 3 (3.5.1). 84 | 85 | 86 | #### TODO: 87 | - add textual datasets (e.g. NLTK stuff). 88 | - add samples generators. 89 | 90 | 91 | #### Thanks to: 92 | 93 | - [RDatasets](https://github.com/vincentarelbundock/Rdatasets): R's datasets collection. 94 | -------------------------------------------------------------------------------- /examples/sample-datasets.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Samples of the available datasets" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "from pydataset import data" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": { 25 | "collapsed": false 26 | }, 27 | "outputs": [ 28 | { 29 | "data": { 30 | "text/html": [ 31 | "
\n", 32 | "\n", 33 | " \n", 34 | " \n", 35 | " \n", 36 | " \n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | "
crimzninduschasnoxrmagedisradtaxptratioblacklstatmedv
10.00632182.3100.5386.57565.24.0900129615.3396.904.9824.0
20.0273107.0700.4696.42178.94.9671224217.8396.909.1421.6
30.0272907.0700.4697.18561.14.9671224217.8392.834.0334.7
40.0323702.1800.4586.99845.86.0622322218.7394.632.9433.4
50.0690502.1800.4587.14754.26.0622322218.7396.905.3336.2
\n", 140 | "
" 141 | ], 142 | "text/plain": [ 143 | " crim zn indus chas nox rm age dis rad tax ptratio \\\n", 144 | "1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 \n", 145 | "2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 \n", 146 | "3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 \n", 147 | "4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 \n", 148 | "5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 \n", 149 | "\n", 150 | " black lstat medv \n", 151 | "1 396.90 4.98 24.0 \n", 152 | "2 396.90 9.14 21.6 \n", 153 | "3 392.83 4.03 34.7 \n", 154 | "4 394.63 2.94 33.4 \n", 155 | "5 396.90 5.33 36.2 " 156 | ] 157 | }, 158 | "execution_count": 2, 159 | "metadata": {}, 160 | "output_type": "execute_result" 161 | } 162 | ], 163 | "source": [ 164 | "boston = data('Boston')\n", 165 | "boston.head()" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 3, 171 | "metadata": { 172 | "collapsed": false 173 | }, 174 | "outputs": [ 175 | { 176 | "data": { 177 | "text/html": [ 178 | "
\n", 179 | "\n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | "
dursizewaterdgasresoperatorpvardpp97varp97p98varp98
186235126114025762.18341.87002.04803.2982.20913.905
2227105910160001.38942.40002.00474.6222.05424.818
317707605840.93210.00700.90760.1780.90560.179
41296850161750.98930.00700.89930.1500.89390.155
59970140024452.24321.95762.06623.2582.20893.833
\n", 269 | "
" 270 | ], 271 | "text/plain": [ 272 | " dur size waterd gasres operator p vardp p97 varp97 \\\n", 273 | "1 86 235 126 1140 2576 2.1834 1.8700 2.0480 3.298 \n", 274 | "2 227 105 91 0 16000 1.3894 2.4000 2.0047 4.622 \n", 275 | "3 17 70 76 0 584 0.9321 0.0070 0.9076 0.178 \n", 276 | "4 12 96 85 0 16175 0.9893 0.0070 0.8993 0.150 \n", 277 | "5 99 70 140 0 2445 2.2432 1.9576 2.0662 3.258 \n", 278 | "\n", 279 | " p98 varp98 \n", 280 | "1 2.2091 3.905 \n", 281 | "2 2.0542 4.818 \n", 282 | "3 0.9056 0.179 \n", 283 | "4 0.8939 0.155 \n", 284 | "5 2.2089 3.833 " 285 | ] 286 | }, 287 | "execution_count": 3, 288 | "metadata": {}, 289 | "output_type": "execute_result" 290 | } 291 | ], 292 | "source": [ 293 | "oil = data('Oil')\n", 294 | "oil.head()" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": 4, 300 | "metadata": { 301 | "collapsed": false 302 | }, 303 | "outputs": [ 304 | { 305 | "data": { 306 | "text/html": [ 307 | "
\n", 308 | "\n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | "
airlineyearcostoutputpflf
11111406400.9527571066500.534487
21212156900.9867571103070.532328
31313095701.0919801105740.547736
41415115301.1757801219740.540846
51516767301.1601701966060.591167
\n", 368 | "
" 369 | ], 370 | "text/plain": [ 371 | " airline year cost output pf lf\n", 372 | "1 1 1 1140640 0.952757 106650 0.534487\n", 373 | "2 1 2 1215690 0.986757 110307 0.532328\n", 374 | "3 1 3 1309570 1.091980 110574 0.547736\n", 375 | "4 1 4 1511530 1.175780 121974 0.540846\n", 376 | "5 1 5 1676730 1.160170 196606 0.591167" 377 | ] 378 | }, 379 | "execution_count": 4, 380 | "metadata": {}, 381 | "output_type": "execute_result" 382 | } 383 | ], 384 | "source": [ 385 | "air = data('Airline')\n", 386 | "air.head()" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": 6, 392 | "metadata": { 393 | "collapsed": false 394 | }, 395 | "outputs": [ 396 | { 397 | "data": { 398 | "text/html": [ 399 | "
\n", 400 | "\n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | "
pricelotsizebedroomsbathrmsstoriesdrivewayrecroomfullbasegashwaircogarageplprefarea
1420005850312yesnoyesnono1no
2385004000211yesnononono0no
3495003060311yesnononono0no
4605006650312yesyesnonono0no
5610006360211yesnononono0no
\n", 496 | "
" 497 | ], 498 | "text/plain": [ 499 | " price lotsize bedrooms bathrms stories driveway recroom fullbase gashw \\\n", 500 | "1 42000 5850 3 1 2 yes no yes no \n", 501 | "2 38500 4000 2 1 1 yes no no no \n", 502 | "3 49500 3060 3 1 1 yes no no no \n", 503 | "4 60500 6650 3 1 2 yes yes no no \n", 504 | "5 61000 6360 2 1 1 yes no no no \n", 505 | "\n", 506 | " airco garagepl prefarea \n", 507 | "1 no 1 no \n", 508 | "2 no 0 no \n", 509 | "3 no 0 no \n", 510 | "4 no 0 no \n", 511 | "5 no 0 no " 512 | ] 513 | }, 514 | "execution_count": 6, 515 | "metadata": {}, 516 | "output_type": "execute_result" 517 | } 518 | ], 519 | "source": [ 520 | "housing = data('Housing')\n", 521 | "housing.head()" 522 | ] 523 | }, 524 | { 525 | "cell_type": "code", 526 | "execution_count": 9, 527 | "metadata": { 528 | "collapsed": false 529 | }, 530 | "outputs": [ 531 | { 532 | "data": { 533 | "text/html": [ 534 | "
\n", 535 | "\n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | "
titlepubsocietylibpricepagescharppcitestotdate1oclcfield
1Asian-Pacific Economic LiteratureBlackwellno123440382221198614General
2South African Journal of Economic HistorySo Afr ec history assnno20309178222198659Ec History
3Computational EconomicsKluwerno443567292422198717Specialized
4MOCT-MOST Economic Policy in Transitional EconomicsKluwerno27652032342219912Area Studies
5Journal of Socio-EconomicsElsevierno295791302424197296Interdisciplinary
\n", 619 | "
" 620 | ], 621 | "text/plain": [ 622 | " title \\\n", 623 | "1 Asian-Pacific Economic Literature \n", 624 | "2 South African Journal of Economic History \n", 625 | "3 Computational Economics \n", 626 | "4 MOCT-MOST Economic Policy in Transitional Economics \n", 627 | "5 Journal of Socio-Economics \n", 628 | "\n", 629 | " pub society libprice pages charpp citestot date1 \\\n", 630 | "1 Blackwell no 123 440 3822 21 1986 \n", 631 | "2 So Afr ec history assn no 20 309 1782 22 1986 \n", 632 | "3 Kluwer no 443 567 2924 22 1987 \n", 633 | "4 Kluwer no 276 520 3234 22 1991 \n", 634 | "5 Elsevier no 295 791 3024 24 1972 \n", 635 | "\n", 636 | " oclc field \n", 637 | "1 14 General \n", 638 | "2 59 Ec History \n", 639 | "3 17 Specialized \n", 640 | "4 2 Area Studies \n", 641 | "5 96 Interdisciplinary " 642 | ] 643 | }, 644 | "execution_count": 9, 645 | "metadata": {}, 646 | "output_type": "execute_result" 647 | } 648 | ], 649 | "source": [ 650 | "housing = data('Journals')\n", 651 | "housing.head()" 652 | ] 653 | }, 654 | { 655 | "cell_type": "code", 656 | "execution_count": 10, 657 | "metadata": { 658 | "collapsed": false 659 | }, 660 | "outputs": [ 661 | { 662 | "data": { 663 | "text/html": [ 664 | "
\n", 665 | "\n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | "
occupationregionnkidsnkids2nadultslnxstobaccosalcoholage
1bluecolflanders10214.1905400.0000002
2inactselfflanders00313.9085700.0022853
3whitecolflanders00113.9746100.0128752
4bluecolflanders10213.7628100.0059072
5inactselfflanders20113.8080000.0219812
\n", 743 | "
" 744 | ], 745 | "text/plain": [ 746 | " occupation region nkids nkids2 nadults lnx stobacco salcohol \\\n", 747 | "1 bluecol flanders 1 0 2 14.19054 0 0.000000 \n", 748 | "2 inactself flanders 0 0 3 13.90857 0 0.002285 \n", 749 | "3 whitecol flanders 0 0 1 13.97461 0 0.012875 \n", 750 | "4 bluecol flanders 1 0 2 13.76281 0 0.005907 \n", 751 | "5 inactself flanders 2 0 1 13.80800 0 0.021981 \n", 752 | "\n", 753 | " age \n", 754 | "1 2 \n", 755 | "2 3 \n", 756 | "3 2 \n", 757 | "4 2 \n", 758 | "5 2 " 759 | ] 760 | }, 761 | "execution_count": 10, 762 | "metadata": {}, 763 | "output_type": "execute_result" 764 | } 765 | ], 766 | "source": [ 767 | "housing = data('Tobacco')\n", 768 | "housing.head()" 769 | ] 770 | }, 771 | { 772 | "cell_type": "markdown", 773 | "metadata": {}, 774 | "source": [ 775 | "If you are not sure what's the dataset name or whether it exists or not, you can try something close:" 776 | ] 777 | }, 778 | { 779 | "cell_type": "code", 780 | "execution_count": 11, 781 | "metadata": { 782 | "collapsed": false 783 | }, 784 | "outputs": [ 785 | { 786 | "name": "stdout", 787 | "output_type": "stream", 788 | "text": [ 789 | "Did you mean:\n", 790 | "anscombe, Anscombe, income, acme, newcomb, cancer, OME, voteincome, cane, sanction, brambles\n" 791 | ] 792 | } 793 | ], 794 | "source": [ 795 | "data('ancombe')" 796 | ] 797 | } 798 | ], 799 | "metadata": { 800 | "kernelspec": { 801 | "display_name": "Python 3", 802 | "language": "python", 803 | "name": "python3" 804 | }, 805 | "language_info": { 806 | "codemirror_mode": { 807 | "name": "ipython", 808 | "version": 3 809 | }, 810 | "file_extension": ".py", 811 | "mimetype": "text/x-python", 812 | "name": "python", 813 | "nbconvert_exporter": "python", 814 | "pygments_lexer": "ipython3", 815 | "version": "3.5.1" 816 | } 817 | }, 818 | "nbformat": 4, 819 | "nbformat_minor": 0 820 | } 821 | -------------------------------------------------------------------------------- /pydataset/__init__.py: -------------------------------------------------------------------------------- 1 | # __init__.py 2 | # main interface to pydataset module 3 | 4 | from .datasets_handler import __print_item_docs, __read_csv, __datasets_desc 5 | from .support import find_similar 6 | 7 | 8 | def data(item=None, show_doc=False): 9 | """loads a datasaet (from in-modules datasets) in a dataframe data structure. 10 | 11 | Args: 12 | item (str) : name of the dataset to load. 13 | show_doc (bool) : to show the dataset's documentation. 14 | 15 | Examples: 16 | 17 | >>> iris = data('iris') 18 | 19 | 20 | >>> data('titanic', show_doc=True) 21 | : returns the dataset's documentation. 22 | 23 | >>> data() 24 | : like help(), returns a dataframe [Item, Title] 25 | for a list of the available datasets. 26 | """ 27 | 28 | if item: 29 | try: 30 | if show_doc: 31 | __print_item_docs(item) 32 | return 33 | 34 | df = __read_csv(item) 35 | return df 36 | except KeyError: 37 | find_similar(item) 38 | else: 39 | return __datasets_desc() 40 | 41 | 42 | if __name__ == '__main__': 43 | # Numerical data 44 | rain = data('rain') 45 | print(rain) 46 | -------------------------------------------------------------------------------- /pydataset/datasets_handler.py: -------------------------------------------------------------------------------- 1 | # datasets_handler.py 2 | # dataset handling file 3 | 4 | import pandas as pd 5 | from .utils import html2text 6 | from .locate_datasets import __items_dict, __docs_dict, __get_data_folder_path 7 | 8 | items = __items_dict() 9 | docs = __docs_dict() 10 | 11 | # make dataframe layout (of __datasets_desc()) terminal-friendly 12 | pd.set_option('display.max_rows', 170) 13 | pd.set_option('display.max_colwidth', 90) 14 | # for terminal, auto-detect 15 | pd.set_option('display.width', None) 16 | 17 | 18 | # HELPER 19 | 20 | def __filter_doc(raw): 21 | note = "PyDataset Documentation (adopted from R Documentation. " \ 22 | "The displayed examples are in R)" 23 | txt = raw.replace('R Documentation', note) 24 | return txt 25 | 26 | 27 | def __read_docs(path): 28 | # raw html 29 | html = open(path, 'r').read() 30 | # html handler 31 | h = html2text.HTML2Text() 32 | h.ignore_links = True 33 | h.ignore_images = True 34 | txt = h.handle(html) 35 | 36 | return txt 37 | 38 | 39 | # MAIN 40 | 41 | def __get_csv_path(item): 42 | """return the full path of the item's csv file""" 43 | return items[item] 44 | 45 | 46 | def __read_csv(item): 47 | path = __get_csv_path(item) 48 | df = pd.read_csv(path, index_col=0) 49 | # display 'optional' log msg "loaded: Titanic " 50 | # print('loaded: {} {}'.format(item, type(df))) 51 | return df 52 | 53 | 54 | def __get_doc_path(item): 55 | return docs[item] 56 | 57 | 58 | def __print_item_docs(item): 59 | path = __get_doc_path(item) 60 | doc = __read_docs(path) # html format 61 | txt = __filter_doc(doc) # edit R related txt 62 | print(txt) 63 | 64 | 65 | def __datasets_desc(): 66 | """return a df of the available datasets with description""" 67 | datasets = __get_data_folder_path() + 'datasets.csv' 68 | df = pd.read_csv(datasets) 69 | df = df[['Item', 'Title']] 70 | df.columns = ['dataset_id', 'title'] 71 | # print('a list of the available datasets:') 72 | return df 73 | -------------------------------------------------------------------------------- /pydataset/dump_data.py: -------------------------------------------------------------------------------- 1 | # dump_data.py 2 | # initialize PYDATASET_HOME, and 3 | # dump pydataset/resources.tar.gz into $HOME/.pydataset/ 4 | 5 | import tarfile 6 | from os import path as os_path 7 | from os import mkdir as os_mkdir 8 | from os.path import join as path_join 9 | 10 | 11 | def __setup_db(): 12 | 13 | homedir = os_path.expanduser('~') 14 | PYDATASET_HOME = path_join(homedir, '.pydataset/') 15 | 16 | if not os_path.exists(PYDATASET_HOME): 17 | # create $HOME/.pydataset/ 18 | os_mkdir(PYDATASET_HOME) 19 | print('initiated datasets repo at: {}'.format(PYDATASET_HOME)) 20 | 21 | # copy the resources.tar.gz from the module files. 22 | 23 | # # There should be a better way ? read from a URL ? 24 | import pydataset 25 | filename = path_join(pydataset.__path__[0], 'resources.tar.gz') 26 | tar = tarfile.open(filename, mode='r|gz') 27 | 28 | # # reading 'resources.tar.gz' from a URL 29 | # try: 30 | # from urllib.request import urlopen # py3 31 | # except ImportError: 32 | # from urllib import urlopen # py2 33 | # import tarfile 34 | # 35 | # targz_url = 'https://example.com/resources.tar.gz' 36 | # httpstrem = urlopen(targz_url) 37 | # tar = tarfile.open(fileobj=httpstrem, mode="r|gz") 38 | 39 | # extract 'resources.tar.gz' into PYDATASET_HOME 40 | # print('extracting resources.tar.gz ... from {}'.format(targz_url)) 41 | tar.extractall(path=PYDATASET_HOME) 42 | # print('done.') 43 | tar.close() 44 | -------------------------------------------------------------------------------- /pydataset/locate_datasets.py: -------------------------------------------------------------------------------- 1 | 2 | # locate_datasets.py 3 | # locate datasets file paths 4 | 5 | from os import path as os_path 6 | from os import walk as os_walk 7 | from os.path import join as path_join 8 | from .dump_data import __setup_db 9 | 10 | 11 | def __get_data_folder_path(): 12 | # read rdata folder's path from $HOME 13 | homedir = os_path.expanduser('~') 14 | # initiate database datafile 15 | dpath = path_join(homedir, '.pydataset/resources/rdata/') 16 | if os_path.exists(dpath): 17 | return dpath 18 | else: 19 | # create PYDATASET_HOME and folders 20 | __setup_db() 21 | return __get_data_folder_path() 22 | 23 | data_path = __get_data_folder_path() 24 | 25 | 26 | # scan data and documentation folders to build a dictionary (e.g. 27 | # {item:path} ) for each 28 | 29 | items = {} 30 | docs = {} 31 | for dirname, dirnames, filenames in os_walk(data_path): 32 | 33 | # store item name and path to all csv files. 34 | for fname in filenames: 35 | if fname.endswith('.csv') and not fname.startswith('.'): 36 | # e.g. pydataset-package/rdata/csv/boot/acme.csv 37 | item_path = path_join(dirname, fname) 38 | # e.g acme.csv 39 | item_file = os_path.split(item_path)[1] 40 | # e.g. acme 41 | item = item_file.replace('.csv', '') 42 | # store item and its path 43 | items[item] = item_path 44 | 45 | # store item name and path to all html files. 46 | for fname in filenames: 47 | if fname.endswith('.html') and not fname.startswith('.'): 48 | item_path = path_join(dirname, fname) 49 | item_file = os_path.split(item_path)[1] 50 | item = item_file.replace('.html', '') 51 | docs[item] = item_path 52 | 53 | 54 | def __items_dict(): 55 | return items 56 | 57 | 58 | def __docs_dict(): 59 | return docs 60 | -------------------------------------------------------------------------------- /pydataset/resources.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iamaziz/PyDataset/789c0ca7587b86343f636b132dcf1f475ee6b90b/pydataset/resources.tar.gz -------------------------------------------------------------------------------- /pydataset/support.py: -------------------------------------------------------------------------------- 1 | 2 | from difflib import SequenceMatcher as SM 3 | from collections import Counter 4 | from .locate_datasets import __items_dict 5 | 6 | 7 | DATASET_IDS = list(__items_dict().keys()) 8 | ERROR = ('Not valid dataset name and no similar found! ' 9 | 'Try: data() to see available.') 10 | 11 | 12 | def similarity(w1, w2, threshold=0.5): 13 | """compare two strings 'words', and 14 | return ratio of smiliarity, be it larger than the threshold, 15 | or 0 otherwise. 16 | 17 | NOTE: if the result more like junk, increase the threshold value. 18 | """ 19 | ratio = SM(None, str(w1).lower(), str(w2).lower()).ratio() 20 | return ratio if ratio > threshold else 0 21 | 22 | 23 | def search_similar(s1, dlist=DATASET_IDS, MAX_SIMILARS=10): 24 | """Returns the top MAX_SIMILARS [(dataset_id : smilarity_ratio)] to s1""" 25 | 26 | similars = {s2: similarity(s1, s2) 27 | for s2 in dlist 28 | if similarity(s1, s2)} 29 | 30 | # a list of tuples [(similar_word, ratio) .. ] 31 | top_match = Counter(similars).most_common(MAX_SIMILARS+1) 32 | 33 | return top_match 34 | 35 | 36 | def find_similar(query): 37 | 38 | result = search_similar(query) 39 | 40 | if result: 41 | top_words, ratios = zip(*result) 42 | 43 | print('Did you mean:') 44 | print(', '.join(t for t in top_words)) 45 | # print(', '.join('{:.1f}'.format(r*100) for r in ratios)) 46 | 47 | else: 48 | raise Exception(ERROR) 49 | 50 | 51 | if __name__ == '__main__': 52 | 53 | s = 'ansc' 54 | find_similar(s) 55 | -------------------------------------------------------------------------------- /pydataset/utils/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | -------------------------------------------------------------------------------- /pydataset/utils/html2text.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """html2text: Turn HTML into equivalent Markdown-structured text.""" 3 | __version__ = "3.200.3" 4 | __author__ = "Aaron Swartz (me@aaronsw.com)" 5 | __copyright__ = "(C) 2004-2008 Aaron Swartz. GNU GPL 3." 6 | __contributors__ = ["Martin 'Joey' Schulze", "Ricardo Reyes", "Kevin Jay North"] 7 | 8 | # TODO: 9 | # Support decoded entities with unifiable. 10 | 11 | try: 12 | True 13 | except NameError: 14 | setattr(__builtins__, 'True', 1) 15 | setattr(__builtins__, 'False', 0) 16 | 17 | def has_key(x, y): 18 | if hasattr(x, 'has_key'): return x.has_key(y) 19 | else: return y in x 20 | 21 | try: 22 | import htmlentitydefs 23 | import urlparse 24 | import HTMLParser 25 | except ImportError: #Python3 26 | import html.entities as htmlentitydefs 27 | import urllib.parse as urlparse 28 | import html.parser as HTMLParser 29 | try: #Python3 30 | import urllib.request as urllib 31 | except: 32 | import urllib 33 | import optparse, re, sys, codecs, types 34 | 35 | try: from textwrap import wrap 36 | except: pass 37 | 38 | # Use Unicode characters instead of their ascii psuedo-replacements 39 | UNICODE_SNOB = 0 40 | 41 | # Escape all special characters. Output is less readable, but avoids corner case formatting issues. 42 | ESCAPE_SNOB = 0 43 | 44 | # Put the links after each paragraph instead of at the end. 45 | LINKS_EACH_PARAGRAPH = 0 46 | 47 | # Wrap long lines at position. 0 for no wrapping. (Requires Python 2.3.) 48 | BODY_WIDTH = 78 49 | 50 | # Don't show internal links (href="#local-anchor") -- corresponding link targets 51 | # won't be visible in the plain text file anyway. 52 | SKIP_INTERNAL_LINKS = True 53 | 54 | # Use inline, rather than reference, formatting for images and links 55 | INLINE_LINKS = True 56 | 57 | # Number of pixels Google indents nested lists 58 | GOOGLE_LIST_INDENT = 36 59 | 60 | IGNORE_ANCHORS = False 61 | IGNORE_IMAGES = False 62 | IGNORE_EMPHASIS = False 63 | 64 | ### Entity Nonsense ### 65 | 66 | def name2cp(k): 67 | if k == 'apos': return ord("'") 68 | if hasattr(htmlentitydefs, "name2codepoint"): # requires Python 2.3 69 | return htmlentitydefs.name2codepoint[k] 70 | else: 71 | k = htmlentitydefs.entitydefs[k] 72 | if k.startswith("&#") and k.endswith(";"): return int(k[2:-1]) # not in latin-1 73 | return ord(codecs.latin_1_decode(k)[0]) 74 | 75 | unifiable = {'rsquo':"'", 'lsquo':"'", 'rdquo':'"', 'ldquo':'"', 76 | 'copy':'(C)', 'mdash':'--', 'nbsp':' ', 'rarr':'->', 'larr':'<-', 'middot':'*', 77 | 'ndash':'-', 'oelig':'oe', 'aelig':'ae', 78 | 'agrave':'a', 'aacute':'a', 'acirc':'a', 'atilde':'a', 'auml':'a', 'aring':'a', 79 | 'egrave':'e', 'eacute':'e', 'ecirc':'e', 'euml':'e', 80 | 'igrave':'i', 'iacute':'i', 'icirc':'i', 'iuml':'i', 81 | 'ograve':'o', 'oacute':'o', 'ocirc':'o', 'otilde':'o', 'ouml':'o', 82 | 'ugrave':'u', 'uacute':'u', 'ucirc':'u', 'uuml':'u', 83 | 'lrm':'', 'rlm':''} 84 | 85 | unifiable_n = {} 86 | 87 | for k in unifiable.keys(): 88 | unifiable_n[name2cp(k)] = unifiable[k] 89 | 90 | ### End Entity Nonsense ### 91 | 92 | def onlywhite(line): 93 | """Return true if the line does only consist of whitespace characters.""" 94 | for c in line: 95 | if c is not ' ' and c is not ' ': 96 | return c is ' ' 97 | return line 98 | 99 | def hn(tag): 100 | if tag[0] == 'h' and len(tag) == 2: 101 | try: 102 | n = int(tag[1]) 103 | if n in range(1, 10): return n 104 | except ValueError: return 0 105 | 106 | def dumb_property_dict(style): 107 | """returns a hash of css attributes""" 108 | return dict([(x.strip(), y.strip()) for x, y in [z.split(':', 1) for z in style.split(';') if ':' in z]]); 109 | 110 | def dumb_css_parser(data): 111 | """returns a hash of css selectors, each of which contains a hash of css attributes""" 112 | # remove @import sentences 113 | data += ';' 114 | importIndex = data.find('@import') 115 | while importIndex != -1: 116 | data = data[0:importIndex] + data[data.find(';', importIndex) + 1:] 117 | importIndex = data.find('@import') 118 | 119 | # parse the css. reverted from dictionary compehension in order to support older pythons 120 | elements = [x.split('{') for x in data.split('}') if '{' in x.strip()] 121 | try: 122 | elements = dict([(a.strip(), dumb_property_dict(b)) for a, b in elements]) 123 | except ValueError: 124 | elements = {} # not that important 125 | 126 | return elements 127 | 128 | def element_style(attrs, style_def, parent_style): 129 | """returns a hash of the 'final' style attributes of the element""" 130 | style = parent_style.copy() 131 | if 'class' in attrs: 132 | for css_class in attrs['class'].split(): 133 | css_style = style_def['.' + css_class] 134 | style.update(css_style) 135 | if 'style' in attrs: 136 | immediate_style = dumb_property_dict(attrs['style']) 137 | style.update(immediate_style) 138 | return style 139 | 140 | def google_list_style(style): 141 | """finds out whether this is an ordered or unordered list""" 142 | if 'list-style-type' in style: 143 | list_style = style['list-style-type'] 144 | if list_style in ['disc', 'circle', 'square', 'none']: 145 | return 'ul' 146 | return 'ol' 147 | 148 | def google_has_height(style): 149 | """check if the style of the element has the 'height' attribute explicitly defined""" 150 | if 'height' in style: 151 | return True 152 | return False 153 | 154 | def google_text_emphasis(style): 155 | """return a list of all emphasis modifiers of the element""" 156 | emphasis = [] 157 | if 'text-decoration' in style: 158 | emphasis.append(style['text-decoration']) 159 | if 'font-style' in style: 160 | emphasis.append(style['font-style']) 161 | if 'font-weight' in style: 162 | emphasis.append(style['font-weight']) 163 | return emphasis 164 | 165 | def google_fixed_width_font(style): 166 | """check if the css of the current element defines a fixed width font""" 167 | font_family = '' 168 | if 'font-family' in style: 169 | font_family = style['font-family'] 170 | if 'Courier New' == font_family or 'Consolas' == font_family: 171 | return True 172 | return False 173 | 174 | def list_numbering_start(attrs): 175 | """extract numbering from list element attributes""" 176 | if 'start' in attrs: 177 | return int(attrs['start']) - 1 178 | else: 179 | return 0 180 | 181 | class HTML2Text(HTMLParser.HTMLParser): 182 | def __init__(self, out=None, baseurl=''): 183 | HTMLParser.HTMLParser.__init__(self) 184 | 185 | # Config options 186 | self.unicode_snob = UNICODE_SNOB 187 | self.escape_snob = ESCAPE_SNOB 188 | self.links_each_paragraph = LINKS_EACH_PARAGRAPH 189 | self.body_width = BODY_WIDTH 190 | self.skip_internal_links = SKIP_INTERNAL_LINKS 191 | self.inline_links = INLINE_LINKS 192 | self.google_list_indent = GOOGLE_LIST_INDENT 193 | self.ignore_links = IGNORE_ANCHORS 194 | self.ignore_images = IGNORE_IMAGES 195 | self.ignore_emphasis = IGNORE_EMPHASIS 196 | self.google_doc = False 197 | self.ul_item_mark = '*' 198 | self.emphasis_mark = '_' 199 | self.strong_mark = '**' 200 | 201 | if out is None: 202 | self.out = self.outtextf 203 | else: 204 | self.out = out 205 | 206 | self.outtextlist = [] # empty list to store output characters before they are "joined" 207 | 208 | try: 209 | self.outtext = unicode() 210 | except NameError: # Python3 211 | self.outtext = str() 212 | 213 | self.quiet = 0 214 | self.p_p = 0 # number of newline character to print before next output 215 | self.outcount = 0 216 | self.start = 1 217 | self.space = 0 218 | self.a = [] 219 | self.astack = [] 220 | self.maybe_automatic_link = None 221 | self.absolute_url_matcher = re.compile(r'^[a-zA-Z+]+://') 222 | self.acount = 0 223 | self.list = [] 224 | self.blockquote = 0 225 | self.pre = 0 226 | self.startpre = 0 227 | self.code = False 228 | self.br_toggle = '' 229 | self.lastWasNL = 0 230 | self.lastWasList = False 231 | self.style = 0 232 | self.style_def = {} 233 | self.tag_stack = [] 234 | self.emphasis = 0 235 | self.drop_white_space = 0 236 | self.inheader = False 237 | self.abbr_title = None # current abbreviation definition 238 | self.abbr_data = None # last inner HTML (for abbr being defined) 239 | self.abbr_list = {} # stack of abbreviations to write later 240 | self.baseurl = baseurl 241 | 242 | try: del unifiable_n[name2cp('nbsp')] 243 | except KeyError: pass 244 | unifiable['nbsp'] = ' _place_holder;' 245 | 246 | 247 | def feed(self, data): 248 | data = data.replace("", "") 249 | HTMLParser.HTMLParser.feed(self, data) 250 | 251 | def handle(self, data): 252 | self.feed(data) 253 | self.feed("") 254 | return self.optwrap(self.close()) 255 | 256 | def outtextf(self, s): 257 | self.outtextlist.append(s) 258 | if s: self.lastWasNL = s[-1] == '\n' 259 | 260 | def close(self): 261 | HTMLParser.HTMLParser.close(self) 262 | 263 | self.pbr() 264 | self.o('', 0, 'end') 265 | 266 | self.outtext = self.outtext.join(self.outtextlist) 267 | if self.unicode_snob: 268 | nbsp = unichr(name2cp('nbsp')) 269 | else: 270 | nbsp = u' ' 271 | self.outtext = self.outtext.replace(u' _place_holder;', nbsp) 272 | 273 | return self.outtext 274 | 275 | def handle_charref(self, c): 276 | self.o(self.charref(c), 1) 277 | 278 | def handle_entityref(self, c): 279 | self.o(self.entityref(c), 1) 280 | 281 | def handle_starttag(self, tag, attrs): 282 | self.handle_tag(tag, attrs, 1) 283 | 284 | def handle_endtag(self, tag): 285 | self.handle_tag(tag, None, 0) 286 | 287 | def previousIndex(self, attrs): 288 | """ returns the index of certain set of attributes (of a link) in the 289 | self.a list 290 | 291 | If the set of attributes is not found, returns None 292 | """ 293 | if not has_key(attrs, 'href'): return None 294 | 295 | i = -1 296 | for a in self.a: 297 | i += 1 298 | match = 0 299 | 300 | if has_key(a, 'href') and a['href'] == attrs['href']: 301 | if has_key(a, 'title') or has_key(attrs, 'title'): 302 | if (has_key(a, 'title') and has_key(attrs, 'title') and 303 | a['title'] == attrs['title']): 304 | match = True 305 | else: 306 | match = True 307 | 308 | if match: return i 309 | 310 | def drop_last(self, nLetters): 311 | if not self.quiet: 312 | self.outtext = self.outtext[:-nLetters] 313 | 314 | def handle_emphasis(self, start, tag_style, parent_style): 315 | """handles various text emphases""" 316 | tag_emphasis = google_text_emphasis(tag_style) 317 | parent_emphasis = google_text_emphasis(parent_style) 318 | 319 | # handle Google's text emphasis 320 | strikethrough = 'line-through' in tag_emphasis and self.hide_strikethrough 321 | bold = 'bold' in tag_emphasis and not 'bold' in parent_emphasis 322 | italic = 'italic' in tag_emphasis and not 'italic' in parent_emphasis 323 | fixed = google_fixed_width_font(tag_style) and not \ 324 | google_fixed_width_font(parent_style) and not self.pre 325 | 326 | if start: 327 | # crossed-out text must be handled before other attributes 328 | # in order not to output qualifiers unnecessarily 329 | if bold or italic or fixed: 330 | self.emphasis += 1 331 | if strikethrough: 332 | self.quiet += 1 333 | if italic: 334 | self.o(self.emphasis_mark) 335 | self.drop_white_space += 1 336 | if bold: 337 | self.o(self.strong_mark) 338 | self.drop_white_space += 1 339 | if fixed: 340 | self.o('`') 341 | self.drop_white_space += 1 342 | self.code = True 343 | else: 344 | if bold or italic or fixed: 345 | # there must not be whitespace before closing emphasis mark 346 | self.emphasis -= 1 347 | self.space = 0 348 | self.outtext = self.outtext.rstrip() 349 | if fixed: 350 | if self.drop_white_space: 351 | # empty emphasis, drop it 352 | self.drop_last(1) 353 | self.drop_white_space -= 1 354 | else: 355 | self.o('`') 356 | self.code = False 357 | if bold: 358 | if self.drop_white_space: 359 | # empty emphasis, drop it 360 | self.drop_last(2) 361 | self.drop_white_space -= 1 362 | else: 363 | self.o(self.strong_mark) 364 | if italic: 365 | if self.drop_white_space: 366 | # empty emphasis, drop it 367 | self.drop_last(1) 368 | self.drop_white_space -= 1 369 | else: 370 | self.o(self.emphasis_mark) 371 | # space is only allowed after *all* emphasis marks 372 | if (bold or italic) and not self.emphasis: 373 | self.o(" ") 374 | if strikethrough: 375 | self.quiet -= 1 376 | 377 | def handle_tag(self, tag, attrs, start): 378 | #attrs = fixattrs(attrs) 379 | if attrs is None: 380 | attrs = {} 381 | else: 382 | attrs = dict(attrs) 383 | 384 | if self.google_doc: 385 | # the attrs parameter is empty for a closing tag. in addition, we 386 | # need the attributes of the parent nodes in order to get a 387 | # complete style description for the current element. we assume 388 | # that google docs export well formed html. 389 | parent_style = {} 390 | if start: 391 | if self.tag_stack: 392 | parent_style = self.tag_stack[-1][2] 393 | tag_style = element_style(attrs, self.style_def, parent_style) 394 | self.tag_stack.append((tag, attrs, tag_style)) 395 | else: 396 | dummy, attrs, tag_style = self.tag_stack.pop() 397 | if self.tag_stack: 398 | parent_style = self.tag_stack[-1][2] 399 | 400 | if hn(tag): 401 | self.p() 402 | if start: 403 | self.inheader = True 404 | self.o(hn(tag)*"#" + ' ') 405 | else: 406 | self.inheader = False 407 | return # prevent redundant emphasis marks on headers 408 | 409 | if tag in ['p', 'div']: 410 | if self.google_doc: 411 | if start and google_has_height(tag_style): 412 | self.p() 413 | else: 414 | self.soft_br() 415 | else: 416 | self.p() 417 | 418 | if tag == "br" and start: self.o(" \n") 419 | 420 | if tag == "hr" and start: 421 | self.p() 422 | self.o("* * *") 423 | self.p() 424 | 425 | if tag in ["head", "style", 'script']: 426 | if start: self.quiet += 1 427 | else: self.quiet -= 1 428 | 429 | if tag == "style": 430 | if start: self.style += 1 431 | else: self.style -= 1 432 | 433 | if tag in ["body"]: 434 | self.quiet = 0 # sites like 9rules.com never close 435 | 436 | if tag == "blockquote": 437 | if start: 438 | self.p(); self.o('> ', 0, 1); self.start = 1 439 | self.blockquote += 1 440 | else: 441 | self.blockquote -= 1 442 | self.p() 443 | 444 | if tag in ['em', 'i', 'u'] and not self.ignore_emphasis: self.o(self.emphasis_mark) 445 | if tag in ['strong', 'b'] and not self.ignore_emphasis: self.o(self.strong_mark) 446 | if tag in ['del', 'strike', 's']: 447 | if start: 448 | self.o("<"+tag+">") 449 | else: 450 | self.o("") 451 | 452 | if self.google_doc: 453 | if not self.inheader: 454 | # handle some font attributes, but leave headers clean 455 | self.handle_emphasis(start, tag_style, parent_style) 456 | 457 | if tag in ["code", "tt"] and not self.pre: self.o('`') #TODO: `` `this` `` 458 | if tag == "abbr": 459 | if start: 460 | self.abbr_title = None 461 | self.abbr_data = '' 462 | if has_key(attrs, 'title'): 463 | self.abbr_title = attrs['title'] 464 | else: 465 | if self.abbr_title != None: 466 | self.abbr_list[self.abbr_data] = self.abbr_title 467 | self.abbr_title = None 468 | self.abbr_data = '' 469 | 470 | if tag == "a" and not self.ignore_links: 471 | if start: 472 | if has_key(attrs, 'href') and not (self.skip_internal_links and attrs['href'].startswith('#')): 473 | self.astack.append(attrs) 474 | self.maybe_automatic_link = attrs['href'] 475 | else: 476 | self.astack.append(None) 477 | else: 478 | if self.astack: 479 | a = self.astack.pop() 480 | if self.maybe_automatic_link: 481 | self.maybe_automatic_link = None 482 | elif a: 483 | if self.inline_links: 484 | self.o("](" + escape_md(a['href']) + ")") 485 | else: 486 | i = self.previousIndex(a) 487 | if i is not None: 488 | a = self.a[i] 489 | else: 490 | self.acount += 1 491 | a['count'] = self.acount 492 | a['outcount'] = self.outcount 493 | self.a.append(a) 494 | self.o("][" + str(a['count']) + "]") 495 | 496 | if tag == "img" and start and not self.ignore_images: 497 | if has_key(attrs, 'src'): 498 | attrs['href'] = attrs['src'] 499 | alt = attrs.get('alt', '') 500 | self.o("![" + escape_md(alt) + "]") 501 | 502 | if self.inline_links: 503 | self.o("(" + escape_md(attrs['href']) + ")") 504 | else: 505 | i = self.previousIndex(attrs) 506 | if i is not None: 507 | attrs = self.a[i] 508 | else: 509 | self.acount += 1 510 | attrs['count'] = self.acount 511 | attrs['outcount'] = self.outcount 512 | self.a.append(attrs) 513 | self.o("[" + str(attrs['count']) + "]") 514 | 515 | if tag == 'dl' and start: self.p() 516 | if tag == 'dt' and not start: self.pbr() 517 | if tag == 'dd' and start: self.o(' ') 518 | if tag == 'dd' and not start: self.pbr() 519 | 520 | if tag in ["ol", "ul"]: 521 | # Google Docs create sub lists as top level lists 522 | if (not self.list) and (not self.lastWasList): 523 | self.p() 524 | if start: 525 | if self.google_doc: 526 | list_style = google_list_style(tag_style) 527 | else: 528 | list_style = tag 529 | numbering_start = list_numbering_start(attrs) 530 | self.list.append({'name':list_style, 'num':numbering_start}) 531 | else: 532 | if self.list: self.list.pop() 533 | self.lastWasList = True 534 | else: 535 | self.lastWasList = False 536 | 537 | if tag == 'li': 538 | self.pbr() 539 | if start: 540 | if self.list: li = self.list[-1] 541 | else: li = {'name':'ul', 'num':0} 542 | if self.google_doc: 543 | nest_count = self.google_nest_count(tag_style) 544 | else: 545 | nest_count = len(self.list) 546 | self.o(" " * nest_count) #TODO: line up
  1. s > 9 correctly. 547 | if li['name'] == "ul": self.o(self.ul_item_mark + " ") 548 | elif li['name'] == "ol": 549 | li['num'] += 1 550 | self.o(str(li['num'])+". ") 551 | self.start = 1 552 | 553 | if tag in ["table", "tr"] and start: self.p() 554 | if tag == 'td': self.pbr() 555 | 556 | if tag == "pre": 557 | if start: 558 | self.startpre = 1 559 | self.pre = 1 560 | else: 561 | self.pre = 0 562 | self.p() 563 | 564 | def pbr(self): 565 | if self.p_p == 0: 566 | self.p_p = 1 567 | 568 | def p(self): 569 | self.p_p = 2 570 | 571 | def soft_br(self): 572 | self.pbr() 573 | self.br_toggle = ' ' 574 | 575 | def o(self, data, puredata=0, force=0): 576 | if self.abbr_data is not None: 577 | self.abbr_data += data 578 | 579 | if not self.quiet: 580 | if self.google_doc: 581 | # prevent white space immediately after 'begin emphasis' marks ('**' and '_') 582 | lstripped_data = data.lstrip() 583 | if self.drop_white_space and not (self.pre or self.code): 584 | data = lstripped_data 585 | if lstripped_data != '': 586 | self.drop_white_space = 0 587 | 588 | if puredata and not self.pre: 589 | data = re.sub('\s+', ' ', data) 590 | if data and data[0] == ' ': 591 | self.space = 1 592 | data = data[1:] 593 | if not data and not force: return 594 | 595 | if self.startpre: 596 | #self.out(" :") #TODO: not output when already one there 597 | if not data.startswith("\n"): #
    stuff...
    598 |                     data = "\n" + data
    599 | 
    600 |             bq = (">" * self.blockquote)
    601 |             if not (force and data and data[0] == ">") and self.blockquote: bq += " "
    602 | 
    603 |             if self.pre:
    604 |                 if not self.list:
    605 |                     bq += "    "
    606 |                 #else: list content is already partially indented
    607 |                 # for i in xrange(len(self.list)): # no python 3
    608 |                 for i in range(len(self.list)):
    609 |                     bq += "    "
    610 |                 data = data.replace("\n", "\n"+bq)
    611 | 
    612 |             if self.startpre:
    613 |                 self.startpre = 0
    614 |                 if self.list:
    615 |                     data = data.lstrip("\n") # use existing initial indentation
    616 | 
    617 |             if self.start:
    618 |                 self.space = 0
    619 |                 self.p_p = 0
    620 |                 self.start = 0
    621 | 
    622 |             if force == 'end':
    623 |                 # It's the end.
    624 |                 self.p_p = 0
    625 |                 self.out("\n")
    626 |                 self.space = 0
    627 | 
    628 |             if self.p_p:
    629 |                 self.out((self.br_toggle+'\n'+bq)*self.p_p)
    630 |                 self.space = 0
    631 |                 self.br_toggle = ''
    632 | 
    633 |             if self.space:
    634 |                 if not self.lastWasNL: self.out(' ')
    635 |                 self.space = 0
    636 | 
    637 |             if self.a and ((self.p_p == 2 and self.links_each_paragraph) or force == "end"):
    638 |                 if force == "end": self.out("\n")
    639 | 
    640 |                 newa = []
    641 |                 for link in self.a:
    642 |                     if self.outcount > link['outcount']:
    643 |                         self.out("   ["+ str(link['count']) +"]: " + urlparse.urljoin(self.baseurl, link['href']))
    644 |                         if has_key(link, 'title'): self.out(" ("+link['title']+")")
    645 |                         self.out("\n")
    646 |                     else:
    647 |                         newa.append(link)
    648 | 
    649 |                 if self.a != newa: self.out("\n") # Don't need an extra line when nothing was done.
    650 | 
    651 |                 self.a = newa
    652 | 
    653 |             if self.abbr_list and force == "end":
    654 |                 for abbr, definition in self.abbr_list.items():
    655 |                     self.out("  *[" + abbr + "]: " + definition + "\n")
    656 | 
    657 |             self.p_p = 0
    658 |             self.out(data)
    659 |             self.outcount += 1
    660 | 
    661 |     def handle_data(self, data):
    662 |         if r'\/script>' in data: self.quiet -= 1
    663 | 
    664 |         if self.style:
    665 |             self.style_def.update(dumb_css_parser(data))
    666 | 
    667 |         if not self.maybe_automatic_link is None:
    668 |             href = self.maybe_automatic_link
    669 |             if href == data and self.absolute_url_matcher.match(href):
    670 |                 self.o("<" + data + ">")
    671 |                 return
    672 |             else:
    673 |                 self.o("[")
    674 |                 self.maybe_automatic_link = None
    675 | 
    676 |         if not self.code and not self.pre:
    677 |             data = escape_md_section(data, snob=self.escape_snob)
    678 |         self.o(data, 1)
    679 | 
    680 |     def unknown_decl(self, data): pass
    681 | 
    682 |     def charref(self, name):
    683 |         if name[0] in ['x','X']:
    684 |             c = int(name[1:], 16)
    685 |         else:
    686 |             c = int(name)
    687 | 
    688 |         if not self.unicode_snob and c in unifiable_n.keys():
    689 |             return unifiable_n[c]
    690 |         else:
    691 |             try:
    692 |                 return unichr(c)
    693 |             except NameError: #Python3
    694 |                 return chr(c)
    695 | 
    696 |     def entityref(self, c):
    697 |         if not self.unicode_snob and c in unifiable.keys():
    698 |             return unifiable[c]
    699 |         else:
    700 |             try: name2cp(c)
    701 |             except KeyError: return "&" + c + ';'
    702 |             else:
    703 |                 try:
    704 |                     return unichr(name2cp(c))
    705 |                 except NameError: #Python3
    706 |                     return chr(name2cp(c))
    707 | 
    708 |     def replaceEntities(self, s):
    709 |         s = s.group(1)
    710 |         if s[0] == "#":
    711 |             return self.charref(s[1:])
    712 |         else: return self.entityref(s)
    713 | 
    714 |     r_unescape = re.compile(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));")
    715 |     def unescape(self, s):
    716 |         return self.r_unescape.sub(self.replaceEntities, s)
    717 | 
    718 |     def google_nest_count(self, style):
    719 |         """calculate the nesting count of google doc lists"""
    720 |         nest_count = 0
    721 |         if 'margin-left' in style:
    722 |             nest_count = int(style['margin-left'][:-2]) / self.google_list_indent
    723 |         return nest_count
    724 | 
    725 | 
    726 |     def optwrap(self, text):
    727 |         """Wrap all paragraphs in the provided text."""
    728 |         if not self.body_width:
    729 |             return text
    730 | 
    731 |         assert wrap, "Requires Python 2.3."
    732 |         result = ''
    733 |         newlines = 0
    734 |         for para in text.split("\n"):
    735 |             if len(para) > 0:
    736 |                 if not skipwrap(para):
    737 |                     result += "\n".join(wrap(para, self.body_width))
    738 |                     if para.endswith('  '):
    739 |                         result += "  \n"
    740 |                         newlines = 1
    741 |                     else:
    742 |                         result += "\n\n"
    743 |                         newlines = 2
    744 |                 else:
    745 |                     if not onlywhite(para):
    746 |                         result += para + "\n"
    747 |                         newlines = 1
    748 |             else:
    749 |                 if newlines < 2:
    750 |                     result += "\n"
    751 |                     newlines += 1
    752 |         return result
    753 | 
    754 | ordered_list_matcher = re.compile(r'\d+\.\s')
    755 | unordered_list_matcher = re.compile(r'[-\*\+]\s')
    756 | md_chars_matcher = re.compile(r"([\\\[\]\(\)])")
    757 | md_chars_matcher_all = re.compile(r"([`\*_{}\[\]\(\)#!])")
    758 | md_dot_matcher = re.compile(r"""
    759 |     ^             # start of line
    760 |     (\s*\d+)      # optional whitespace and a number
    761 |     (\.)          # dot
    762 |     (?=\s)        # lookahead assert whitespace
    763 |     """, re.MULTILINE | re.VERBOSE)
    764 | md_plus_matcher = re.compile(r"""
    765 |     ^
    766 |     (\s*)
    767 |     (\+)
    768 |     (?=\s)
    769 |     """, flags=re.MULTILINE | re.VERBOSE)
    770 | md_dash_matcher = re.compile(r"""
    771 |     ^
    772 |     (\s*)
    773 |     (-)
    774 |     (?=\s|\-)     # followed by whitespace (bullet list, or spaced out hr)
    775 |                   # or another dash (header or hr)
    776 |     """, flags=re.MULTILINE | re.VERBOSE)
    777 | slash_chars = r'\`*_{}[]()#+-.!'
    778 | md_backslash_matcher = re.compile(r'''
    779 |     (\\)          # match one slash
    780 |     (?=[%s])      # followed by a char that requires escaping
    781 |     ''' % re.escape(slash_chars),
    782 |     flags=re.VERBOSE)
    783 | 
    784 | def skipwrap(para):
    785 |     # If the text begins with four spaces or one tab, it's a code block; don't wrap
    786 |     if para[0:4] == '    ' or para[0] == '\t':
    787 |         return True
    788 |     # If the text begins with only two "--", possibly preceded by whitespace, that's
    789 |     # an emdash; so wrap.
    790 |     stripped = para.lstrip()
    791 |     if stripped[0:2] == "--" and len(stripped) > 2 and stripped[2] != "-":
    792 |         return False
    793 |     # I'm not sure what this is for; I thought it was to detect lists, but there's
    794 |     # a 
    -inside- case in one of the tests that also depends upon it. 795 | if stripped[0:1] == '-' or stripped[0:1] == '*': 796 | return True 797 | # If the text begins with a single -, *, or +, followed by a space, or an integer, 798 | # followed by a ., followed by a space (in either case optionally preceeded by 799 | # whitespace), it's a list; don't wrap. 800 | if ordered_list_matcher.match(stripped) or unordered_list_matcher.match(stripped): 801 | return True 802 | return False 803 | 804 | def wrapwrite(text): 805 | text = text.encode('utf-8') 806 | try: #Python3 807 | sys.stdout.buffer.write(text) 808 | except AttributeError: 809 | sys.stdout.write(text) 810 | 811 | def html2text(html, baseurl=''): 812 | h = HTML2Text(baseurl=baseurl) 813 | return h.handle(html) 814 | 815 | def unescape(s, unicode_snob=False): 816 | h = HTML2Text() 817 | h.unicode_snob = unicode_snob 818 | return h.unescape(s) 819 | 820 | def escape_md(text): 821 | """Escapes markdown-sensitive characters within other markdown constructs.""" 822 | return md_chars_matcher.sub(r"\\\1", text) 823 | 824 | def escape_md_section(text, snob=False): 825 | """Escapes markdown-sensitive characters across whole document sections.""" 826 | text = md_backslash_matcher.sub(r"\\\1", text) 827 | if snob: 828 | text = md_chars_matcher_all.sub(r"\\\1", text) 829 | text = md_dot_matcher.sub(r"\1\\\2", text) 830 | text = md_plus_matcher.sub(r"\1\\\2", text) 831 | text = md_dash_matcher.sub(r"\1\\\2", text) 832 | return text 833 | 834 | 835 | def main(): 836 | baseurl = '' 837 | 838 | p = optparse.OptionParser('%prog [(filename|url) [encoding]]', 839 | version='%prog ' + __version__) 840 | p.add_option("--ignore-emphasis", dest="ignore_emphasis", action="store_true", 841 | default=IGNORE_EMPHASIS, help="don't include any formatting for emphasis") 842 | p.add_option("--ignore-links", dest="ignore_links", action="store_true", 843 | default=IGNORE_ANCHORS, help="don't include any formatting for links") 844 | p.add_option("--ignore-images", dest="ignore_images", action="store_true", 845 | default=IGNORE_IMAGES, help="don't include any formatting for images") 846 | p.add_option("-g", "--google-doc", action="store_true", dest="google_doc", 847 | default=False, help="convert an html-exported Google Document") 848 | p.add_option("-d", "--dash-unordered-list", action="store_true", dest="ul_style_dash", 849 | default=False, help="use a dash rather than a star for unordered list items") 850 | p.add_option("-e", "--asterisk-emphasis", action="store_true", dest="em_style_asterisk", 851 | default=False, help="use an asterisk rather than an underscore for emphasized text") 852 | p.add_option("-b", "--body-width", dest="body_width", action="store", type="int", 853 | default=BODY_WIDTH, help="number of characters per output line, 0 for no wrap") 854 | p.add_option("-i", "--google-list-indent", dest="list_indent", action="store", type="int", 855 | default=GOOGLE_LIST_INDENT, help="number of pixels Google indents nested lists") 856 | p.add_option("-s", "--hide-strikethrough", action="store_true", dest="hide_strikethrough", 857 | default=False, help="hide strike-through text. only relevant when -g is specified as well") 858 | p.add_option("--escape-all", action="store_true", dest="escape_snob", 859 | default=False, help="Escape all special characters. Output is less readable, but avoids corner case formatting issues.") 860 | (options, args) = p.parse_args() 861 | 862 | # process input 863 | encoding = "utf-8" 864 | if len(args) > 0: 865 | file_ = args[0] 866 | if len(args) == 2: 867 | encoding = args[1] 868 | if len(args) > 2: 869 | p.error('Too many arguments') 870 | 871 | if file_.startswith('http://') or file_.startswith('https://'): 872 | baseurl = file_ 873 | j = urllib.urlopen(baseurl) 874 | data = j.read() 875 | if encoding is None: 876 | try: 877 | from feedparser import _getCharacterEncoding as enc 878 | except ImportError: 879 | enc = lambda x, y: ('utf-8', 1) 880 | encoding = enc(j.headers, data)[0] 881 | if encoding == 'us-ascii': 882 | encoding = 'utf-8' 883 | else: 884 | data = open(file_, 'rb').read() 885 | if encoding is None: 886 | try: 887 | from chardet import detect 888 | except ImportError: 889 | detect = lambda x: {'encoding': 'utf-8'} 890 | encoding = detect(data)['encoding'] 891 | else: 892 | data = sys.stdin.read() 893 | 894 | data = data.decode(encoding) 895 | h = HTML2Text(baseurl=baseurl) 896 | # handle options 897 | if options.ul_style_dash: h.ul_item_mark = '-' 898 | if options.em_style_asterisk: 899 | h.emphasis_mark = '*' 900 | h.strong_mark = '__' 901 | 902 | h.body_width = options.body_width 903 | h.list_indent = options.list_indent 904 | h.ignore_emphasis = options.ignore_emphasis 905 | h.ignore_links = options.ignore_links 906 | h.ignore_images = options.ignore_images 907 | h.google_doc = options.google_doc 908 | h.hide_strikethrough = options.hide_strikethrough 909 | h.escape_snob = options.escape_snob 910 | 911 | wrapwrite(h.handle(data)) 912 | 913 | 914 | if __name__ == "__main__": 915 | main() 916 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | description-file = README.md 3 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Author: Aziz Alto 4 | # email: iamaziz.alto@gmail.com 5 | 6 | try: 7 | from setuptools import setup 8 | except ImportError: 9 | from distutils.core import setup 10 | 11 | 12 | setup( 13 | name='pydataset', 14 | description=("Provides instant access to many popular datasets right from " 15 | "Python (in dataframe structure)."), 16 | author='Aziz Alto', 17 | url='https://github.com/iamaziz/PyDataset', 18 | download_url='https://github.com/iamaziz/PyDataset/tarball/0.2.0', 19 | license = 'MIT', 20 | author_email='iamaziz.alto@gmail.com', 21 | version='0.2.0', 22 | install_requires=['pandas'], 23 | packages=['pydataset', 'pydataset.utils'], 24 | package_data={'pydataset': ['*.gz', 'resources.tar.gz']} 25 | ) 26 | --------------------------------------------------------------------------------