├── .gitignore
├── CHANGELOG.md
├── LICENSE.txt
├── README.md
├── examples
    ├── basic-usage.ipynb
    └── sample-datasets.ipynb
├── pydataset
    ├── __init__.py
    ├── datasets_handler.py
    ├── dump_data.py
    ├── locate_datasets.py
    ├── resources.tar.gz
    ├── support.py
    └── utils
    │   ├── __init__.py
    │   └── html2text.py
├── setup.cfg
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # PyInstaller
28 | #  Usually these files are written by a python script from a template
29 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 | 
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 | 
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 | 
48 | # Translations
49 | *.mo
50 | *.pot
51 | 
52 | # Django stuff:
53 | *.log
54 | 
55 | # Sphinx documentation
56 | docs/_build/
57 | 
58 | # PyBuilder
59 | target/
60 | 
61 | #Ipython Notebook
62 | .ipynb_checkpoints
63 | 
64 | # local
65 | clean.sh
66 | todo/
67 | 


--------------------------------------------------------------------------------
/CHANGELOG.md:
--------------------------------------------------------------------------------
 1 | ### Changelog
 2 | 
 3 | **0.2.0**
 4 | 
 5 | - Add search dataset by name similarity.
 6 | 
 7 | Example:
 8 | 
 9 | ```python
10 | >>> data('heat')
11 | Did you mean:
12 | Wheat, heart, Heating, Yeast, eidat, badhealth, deaths, agefat, hla, heptathlon, azt
13 | ```
14 | 
15 | **0.1.1**
16 | 
17 | - Fix: add support to Windows and fix filepaths, issue #1
18 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2016 Aziz Alto
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## PyDataset
 2 |  [![PyPI version](https://badge.fury.io/py/pydataset.svg)](http://badge.fury.io/py/pydataset)
 3 | 
 4 | Provides instant access to many datasets right from Python (in pandas DataFrame structure).
 5 | 
 6 | ### What?
 7 | 
 8 | The idea is simple. There are various datasets available out there, but they are scattered in different places over the web.
 9 | Is there a quick way (in Python) to access them instantly without going through the hassle of searching, downloading, and reading ... etc?
10 | PyDataset tries to address that question :)
11 | 
12 | 
13 | ### Usage:
14 | 
15 | Start with importing `data()`:
16 | ```python
17 | from pydataset import data
18 | ```
19 | - To load a dataset:
20 | ```python
21 | titanic = data('titanic')
22 | ```
23 | - To display the documentation of a dataset:
24 | ```python
25 | data('titanic', show_doc=True)
26 | ```
27 | - To see the available datasets:
28 | ```python
29 | data()
30 | ```
31 | 
32 | That's it.
33 | See more [examples](examples).
34 | 
35 | 
36 | ### Why?
37 | 
38 | In `R`, there is a very easy and immediate way to access multiple statistical datasets,
39 | in almost no effort. All it takes is one line ` > data(dataset_name)`.
40 | This makes the life easier for quick prototyping and testing.
41 | Well, I am jealous that Python does not have a similar functionality.
42 | Thus, the aim of `pydataset` is to fill that gap.
43 | 
44 | Currently, `pydataset` has about 757 (mostly numerical-based) datasets, that are based on `RDatasets`.
45 | In the future, I plan to scale it to include a larger set of datasets.
46 | For example,
47 | 1) include textual data for NLP-related tasks, and
48 | 2) allow adding a new dataset to the in-module repository.
49 | 
50 | 
51 | ### Installation:
52 | 
53 | `$ pip install pydataset`
54 | 
55 | #### Uninstall:
56 | 
57 | - `$ pip uninstall pydataset`
58 | - `$ rm -rf $HOME/.pydataset`
59 | 
60 | ### Changelog
61 | 
62 | **0.2.0**
63 | 
64 | - Add search dataset by name similarity.
65 | - Example:
66 | 
67 | ```python
68 | >>> data('heat')
69 | Did you mean:
70 | Wheat, heart, Heating, Yeast, eidat, badhealth, deaths, agefat, hla, heptathlon, azt
71 | ```
72 | 
73 | **0.1.1**
74 | 
75 | - Fix: add support to Windows and fix filepaths, issue #1
76 | 
77 | ### Dependency:
78 | - pandas
79 | 
80 | ### Miscellaneous:
81 | 
82 | - Tested on OSX and Linux (debian).
83 | - Supports both Python 2 (2.7.11) and Python 3 (3.5.1).
84 | 
85 | 
86 | #### TODO:
87 | - add textual datasets (e.g. NLTK stuff).
88 | - add samples generators.
89 | 
90 | 
91 | #### Thanks to:
92 | 
93 | - [RDatasets](https://github.com/vincentarelbundock/Rdatasets): R's datasets collection.  
94 | 


--------------------------------------------------------------------------------
/examples/sample-datasets.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "Samples of the available datasets"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {
 14 |     "collapsed": false
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "from pydataset import data"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": 2,
 24 |    "metadata": {
 25 |     "collapsed": false
 26 |    },
 27 |    "outputs": [
 28 |     {
 29 |      "data": {
 30 |       "text/html": [
 31 |        "<div>\n",
 32 |        "<table border=\"1\" class=\"dataframe\">\n",
 33 |        "  <thead>\n",
 34 |        "    <tr style=\"text-align: right;\">\n",
 35 |        "      <th></th>\n",
 36 |        "      <th>crim</th>\n",
 37 |        "      <th>zn</th>\n",
 38 |        "      <th>indus</th>\n",
 39 |        "      <th>chas</th>\n",
 40 |        "      <th>nox</th>\n",
 41 |        "      <th>rm</th>\n",
 42 |        "      <th>age</th>\n",
 43 |        "      <th>dis</th>\n",
 44 |        "      <th>rad</th>\n",
 45 |        "      <th>tax</th>\n",
 46 |        "      <th>ptratio</th>\n",
 47 |        "      <th>black</th>\n",
 48 |        "      <th>lstat</th>\n",
 49 |        "      <th>medv</th>\n",
 50 |        "    </tr>\n",
 51 |        "  </thead>\n",
 52 |        "  <tbody>\n",
 53 |        "    <tr>\n",
 54 |        "      <th>1</th>\n",
 55 |        "      <td>0.00632</td>\n",
 56 |        "      <td>18</td>\n",
 57 |        "      <td>2.31</td>\n",
 58 |        "      <td>0</td>\n",
 59 |        "      <td>0.538</td>\n",
 60 |        "      <td>6.575</td>\n",
 61 |        "      <td>65.2</td>\n",
 62 |        "      <td>4.0900</td>\n",
 63 |        "      <td>1</td>\n",
 64 |        "      <td>296</td>\n",
 65 |        "      <td>15.3</td>\n",
 66 |        "      <td>396.90</td>\n",
 67 |        "      <td>4.98</td>\n",
 68 |        "      <td>24.0</td>\n",
 69 |        "    </tr>\n",
 70 |        "    <tr>\n",
 71 |        "      <th>2</th>\n",
 72 |        "      <td>0.02731</td>\n",
 73 |        "      <td>0</td>\n",
 74 |        "      <td>7.07</td>\n",
 75 |        "      <td>0</td>\n",
 76 |        "      <td>0.469</td>\n",
 77 |        "      <td>6.421</td>\n",
 78 |        "      <td>78.9</td>\n",
 79 |        "      <td>4.9671</td>\n",
 80 |        "      <td>2</td>\n",
 81 |        "      <td>242</td>\n",
 82 |        "      <td>17.8</td>\n",
 83 |        "      <td>396.90</td>\n",
 84 |        "      <td>9.14</td>\n",
 85 |        "      <td>21.6</td>\n",
 86 |        "    </tr>\n",
 87 |        "    <tr>\n",
 88 |        "      <th>3</th>\n",
 89 |        "      <td>0.02729</td>\n",
 90 |        "      <td>0</td>\n",
 91 |        "      <td>7.07</td>\n",
 92 |        "      <td>0</td>\n",
 93 |        "      <td>0.469</td>\n",
 94 |        "      <td>7.185</td>\n",
 95 |        "      <td>61.1</td>\n",
 96 |        "      <td>4.9671</td>\n",
 97 |        "      <td>2</td>\n",
 98 |        "      <td>242</td>\n",
 99 |        "      <td>17.8</td>\n",
100 |        "      <td>392.83</td>\n",
101 |        "      <td>4.03</td>\n",
102 |        "      <td>34.7</td>\n",
103 |        "    </tr>\n",
104 |        "    <tr>\n",
105 |        "      <th>4</th>\n",
106 |        "      <td>0.03237</td>\n",
107 |        "      <td>0</td>\n",
108 |        "      <td>2.18</td>\n",
109 |        "      <td>0</td>\n",
110 |        "      <td>0.458</td>\n",
111 |        "      <td>6.998</td>\n",
112 |        "      <td>45.8</td>\n",
113 |        "      <td>6.0622</td>\n",
114 |        "      <td>3</td>\n",
115 |        "      <td>222</td>\n",
116 |        "      <td>18.7</td>\n",
117 |        "      <td>394.63</td>\n",
118 |        "      <td>2.94</td>\n",
119 |        "      <td>33.4</td>\n",
120 |        "    </tr>\n",
121 |        "    <tr>\n",
122 |        "      <th>5</th>\n",
123 |        "      <td>0.06905</td>\n",
124 |        "      <td>0</td>\n",
125 |        "      <td>2.18</td>\n",
126 |        "      <td>0</td>\n",
127 |        "      <td>0.458</td>\n",
128 |        "      <td>7.147</td>\n",
129 |        "      <td>54.2</td>\n",
130 |        "      <td>6.0622</td>\n",
131 |        "      <td>3</td>\n",
132 |        "      <td>222</td>\n",
133 |        "      <td>18.7</td>\n",
134 |        "      <td>396.90</td>\n",
135 |        "      <td>5.33</td>\n",
136 |        "      <td>36.2</td>\n",
137 |        "    </tr>\n",
138 |        "  </tbody>\n",
139 |        "</table>\n",
140 |        "</div>"
141 |       ],
142 |       "text/plain": [
143 |        "      crim  zn  indus  chas    nox     rm   age     dis  rad  tax  ptratio  \\\n",
144 |        "1  0.00632  18   2.31     0  0.538  6.575  65.2  4.0900    1  296     15.3   \n",
145 |        "2  0.02731   0   7.07     0  0.469  6.421  78.9  4.9671    2  242     17.8   \n",
146 |        "3  0.02729   0   7.07     0  0.469  7.185  61.1  4.9671    2  242     17.8   \n",
147 |        "4  0.03237   0   2.18     0  0.458  6.998  45.8  6.0622    3  222     18.7   \n",
148 |        "5  0.06905   0   2.18     0  0.458  7.147  54.2  6.0622    3  222     18.7   \n",
149 |        "\n",
150 |        "    black  lstat  medv  \n",
151 |        "1  396.90   4.98  24.0  \n",
152 |        "2  396.90   9.14  21.6  \n",
153 |        "3  392.83   4.03  34.7  \n",
154 |        "4  394.63   2.94  33.4  \n",
155 |        "5  396.90   5.33  36.2  "
156 |       ]
157 |      },
158 |      "execution_count": 2,
159 |      "metadata": {},
160 |      "output_type": "execute_result"
161 |     }
162 |    ],
163 |    "source": [
164 |     "boston = data('Boston')\n",
165 |     "boston.head()"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "code",
170 |    "execution_count": 3,
171 |    "metadata": {
172 |     "collapsed": false
173 |    },
174 |    "outputs": [
175 |     {
176 |      "data": {
177 |       "text/html": [
178 |        "<div>\n",
179 |        "<table border=\"1\" class=\"dataframe\">\n",
180 |        "  <thead>\n",
181 |        "    <tr style=\"text-align: right;\">\n",
182 |        "      <th></th>\n",
183 |        "      <th>dur</th>\n",
184 |        "      <th>size</th>\n",
185 |        "      <th>waterd</th>\n",
186 |        "      <th>gasres</th>\n",
187 |        "      <th>operator</th>\n",
188 |        "      <th>p</th>\n",
189 |        "      <th>vardp</th>\n",
190 |        "      <th>p97</th>\n",
191 |        "      <th>varp97</th>\n",
192 |        "      <th>p98</th>\n",
193 |        "      <th>varp98</th>\n",
194 |        "    </tr>\n",
195 |        "  </thead>\n",
196 |        "  <tbody>\n",
197 |        "    <tr>\n",
198 |        "      <th>1</th>\n",
199 |        "      <td>86</td>\n",
200 |        "      <td>235</td>\n",
201 |        "      <td>126</td>\n",
202 |        "      <td>1140</td>\n",
203 |        "      <td>2576</td>\n",
204 |        "      <td>2.1834</td>\n",
205 |        "      <td>1.8700</td>\n",
206 |        "      <td>2.0480</td>\n",
207 |        "      <td>3.298</td>\n",
208 |        "      <td>2.2091</td>\n",
209 |        "      <td>3.905</td>\n",
210 |        "    </tr>\n",
211 |        "    <tr>\n",
212 |        "      <th>2</th>\n",
213 |        "      <td>227</td>\n",
214 |        "      <td>105</td>\n",
215 |        "      <td>91</td>\n",
216 |        "      <td>0</td>\n",
217 |        "      <td>16000</td>\n",
218 |        "      <td>1.3894</td>\n",
219 |        "      <td>2.4000</td>\n",
220 |        "      <td>2.0047</td>\n",
221 |        "      <td>4.622</td>\n",
222 |        "      <td>2.0542</td>\n",
223 |        "      <td>4.818</td>\n",
224 |        "    </tr>\n",
225 |        "    <tr>\n",
226 |        "      <th>3</th>\n",
227 |        "      <td>17</td>\n",
228 |        "      <td>70</td>\n",
229 |        "      <td>76</td>\n",
230 |        "      <td>0</td>\n",
231 |        "      <td>584</td>\n",
232 |        "      <td>0.9321</td>\n",
233 |        "      <td>0.0070</td>\n",
234 |        "      <td>0.9076</td>\n",
235 |        "      <td>0.178</td>\n",
236 |        "      <td>0.9056</td>\n",
237 |        "      <td>0.179</td>\n",
238 |        "    </tr>\n",
239 |        "    <tr>\n",
240 |        "      <th>4</th>\n",
241 |        "      <td>12</td>\n",
242 |        "      <td>96</td>\n",
243 |        "      <td>85</td>\n",
244 |        "      <td>0</td>\n",
245 |        "      <td>16175</td>\n",
246 |        "      <td>0.9893</td>\n",
247 |        "      <td>0.0070</td>\n",
248 |        "      <td>0.8993</td>\n",
249 |        "      <td>0.150</td>\n",
250 |        "      <td>0.8939</td>\n",
251 |        "      <td>0.155</td>\n",
252 |        "    </tr>\n",
253 |        "    <tr>\n",
254 |        "      <th>5</th>\n",
255 |        "      <td>99</td>\n",
256 |        "      <td>70</td>\n",
257 |        "      <td>140</td>\n",
258 |        "      <td>0</td>\n",
259 |        "      <td>2445</td>\n",
260 |        "      <td>2.2432</td>\n",
261 |        "      <td>1.9576</td>\n",
262 |        "      <td>2.0662</td>\n",
263 |        "      <td>3.258</td>\n",
264 |        "      <td>2.2089</td>\n",
265 |        "      <td>3.833</td>\n",
266 |        "    </tr>\n",
267 |        "  </tbody>\n",
268 |        "</table>\n",
269 |        "</div>"
270 |       ],
271 |       "text/plain": [
272 |        "   dur  size  waterd  gasres  operator       p   vardp     p97  varp97  \\\n",
273 |        "1   86   235     126    1140      2576  2.1834  1.8700  2.0480   3.298   \n",
274 |        "2  227   105      91       0     16000  1.3894  2.4000  2.0047   4.622   \n",
275 |        "3   17    70      76       0       584  0.9321  0.0070  0.9076   0.178   \n",
276 |        "4   12    96      85       0     16175  0.9893  0.0070  0.8993   0.150   \n",
277 |        "5   99    70     140       0      2445  2.2432  1.9576  2.0662   3.258   \n",
278 |        "\n",
279 |        "      p98  varp98  \n",
280 |        "1  2.2091   3.905  \n",
281 |        "2  2.0542   4.818  \n",
282 |        "3  0.9056   0.179  \n",
283 |        "4  0.8939   0.155  \n",
284 |        "5  2.2089   3.833  "
285 |       ]
286 |      },
287 |      "execution_count": 3,
288 |      "metadata": {},
289 |      "output_type": "execute_result"
290 |     }
291 |    ],
292 |    "source": [
293 |     "oil = data('Oil')\n",
294 |     "oil.head()"
295 |    ]
296 |   },
297 |   {
298 |    "cell_type": "code",
299 |    "execution_count": 4,
300 |    "metadata": {
301 |     "collapsed": false
302 |    },
303 |    "outputs": [
304 |     {
305 |      "data": {
306 |       "text/html": [
307 |        "<div>\n",
308 |        "<table border=\"1\" class=\"dataframe\">\n",
309 |        "  <thead>\n",
310 |        "    <tr style=\"text-align: right;\">\n",
311 |        "      <th></th>\n",
312 |        "      <th>airline</th>\n",
313 |        "      <th>year</th>\n",
314 |        "      <th>cost</th>\n",
315 |        "      <th>output</th>\n",
316 |        "      <th>pf</th>\n",
317 |        "      <th>lf</th>\n",
318 |        "    </tr>\n",
319 |        "  </thead>\n",
320 |        "  <tbody>\n",
321 |        "    <tr>\n",
322 |        "      <th>1</th>\n",
323 |        "      <td>1</td>\n",
324 |        "      <td>1</td>\n",
325 |        "      <td>1140640</td>\n",
326 |        "      <td>0.952757</td>\n",
327 |        "      <td>106650</td>\n",
328 |        "      <td>0.534487</td>\n",
329 |        "    </tr>\n",
330 |        "    <tr>\n",
331 |        "      <th>2</th>\n",
332 |        "      <td>1</td>\n",
333 |        "      <td>2</td>\n",
334 |        "      <td>1215690</td>\n",
335 |        "      <td>0.986757</td>\n",
336 |        "      <td>110307</td>\n",
337 |        "      <td>0.532328</td>\n",
338 |        "    </tr>\n",
339 |        "    <tr>\n",
340 |        "      <th>3</th>\n",
341 |        "      <td>1</td>\n",
342 |        "      <td>3</td>\n",
343 |        "      <td>1309570</td>\n",
344 |        "      <td>1.091980</td>\n",
345 |        "      <td>110574</td>\n",
346 |        "      <td>0.547736</td>\n",
347 |        "    </tr>\n",
348 |        "    <tr>\n",
349 |        "      <th>4</th>\n",
350 |        "      <td>1</td>\n",
351 |        "      <td>4</td>\n",
352 |        "      <td>1511530</td>\n",
353 |        "      <td>1.175780</td>\n",
354 |        "      <td>121974</td>\n",
355 |        "      <td>0.540846</td>\n",
356 |        "    </tr>\n",
357 |        "    <tr>\n",
358 |        "      <th>5</th>\n",
359 |        "      <td>1</td>\n",
360 |        "      <td>5</td>\n",
361 |        "      <td>1676730</td>\n",
362 |        "      <td>1.160170</td>\n",
363 |        "      <td>196606</td>\n",
364 |        "      <td>0.591167</td>\n",
365 |        "    </tr>\n",
366 |        "  </tbody>\n",
367 |        "</table>\n",
368 |        "</div>"
369 |       ],
370 |       "text/plain": [
371 |        "   airline  year     cost    output      pf        lf\n",
372 |        "1        1     1  1140640  0.952757  106650  0.534487\n",
373 |        "2        1     2  1215690  0.986757  110307  0.532328\n",
374 |        "3        1     3  1309570  1.091980  110574  0.547736\n",
375 |        "4        1     4  1511530  1.175780  121974  0.540846\n",
376 |        "5        1     5  1676730  1.160170  196606  0.591167"
377 |       ]
378 |      },
379 |      "execution_count": 4,
380 |      "metadata": {},
381 |      "output_type": "execute_result"
382 |     }
383 |    ],
384 |    "source": [
385 |     "air = data('Airline')\n",
386 |     "air.head()"
387 |    ]
388 |   },
389 |   {
390 |    "cell_type": "code",
391 |    "execution_count": 6,
392 |    "metadata": {
393 |     "collapsed": false
394 |    },
395 |    "outputs": [
396 |     {
397 |      "data": {
398 |       "text/html": [
399 |        "<div>\n",
400 |        "<table border=\"1\" class=\"dataframe\">\n",
401 |        "  <thead>\n",
402 |        "    <tr style=\"text-align: right;\">\n",
403 |        "      <th></th>\n",
404 |        "      <th>price</th>\n",
405 |        "      <th>lotsize</th>\n",
406 |        "      <th>bedrooms</th>\n",
407 |        "      <th>bathrms</th>\n",
408 |        "      <th>stories</th>\n",
409 |        "      <th>driveway</th>\n",
410 |        "      <th>recroom</th>\n",
411 |        "      <th>fullbase</th>\n",
412 |        "      <th>gashw</th>\n",
413 |        "      <th>airco</th>\n",
414 |        "      <th>garagepl</th>\n",
415 |        "      <th>prefarea</th>\n",
416 |        "    </tr>\n",
417 |        "  </thead>\n",
418 |        "  <tbody>\n",
419 |        "    <tr>\n",
420 |        "      <th>1</th>\n",
421 |        "      <td>42000</td>\n",
422 |        "      <td>5850</td>\n",
423 |        "      <td>3</td>\n",
424 |        "      <td>1</td>\n",
425 |        "      <td>2</td>\n",
426 |        "      <td>yes</td>\n",
427 |        "      <td>no</td>\n",
428 |        "      <td>yes</td>\n",
429 |        "      <td>no</td>\n",
430 |        "      <td>no</td>\n",
431 |        "      <td>1</td>\n",
432 |        "      <td>no</td>\n",
433 |        "    </tr>\n",
434 |        "    <tr>\n",
435 |        "      <th>2</th>\n",
436 |        "      <td>38500</td>\n",
437 |        "      <td>4000</td>\n",
438 |        "      <td>2</td>\n",
439 |        "      <td>1</td>\n",
440 |        "      <td>1</td>\n",
441 |        "      <td>yes</td>\n",
442 |        "      <td>no</td>\n",
443 |        "      <td>no</td>\n",
444 |        "      <td>no</td>\n",
445 |        "      <td>no</td>\n",
446 |        "      <td>0</td>\n",
447 |        "      <td>no</td>\n",
448 |        "    </tr>\n",
449 |        "    <tr>\n",
450 |        "      <th>3</th>\n",
451 |        "      <td>49500</td>\n",
452 |        "      <td>3060</td>\n",
453 |        "      <td>3</td>\n",
454 |        "      <td>1</td>\n",
455 |        "      <td>1</td>\n",
456 |        "      <td>yes</td>\n",
457 |        "      <td>no</td>\n",
458 |        "      <td>no</td>\n",
459 |        "      <td>no</td>\n",
460 |        "      <td>no</td>\n",
461 |        "      <td>0</td>\n",
462 |        "      <td>no</td>\n",
463 |        "    </tr>\n",
464 |        "    <tr>\n",
465 |        "      <th>4</th>\n",
466 |        "      <td>60500</td>\n",
467 |        "      <td>6650</td>\n",
468 |        "      <td>3</td>\n",
469 |        "      <td>1</td>\n",
470 |        "      <td>2</td>\n",
471 |        "      <td>yes</td>\n",
472 |        "      <td>yes</td>\n",
473 |        "      <td>no</td>\n",
474 |        "      <td>no</td>\n",
475 |        "      <td>no</td>\n",
476 |        "      <td>0</td>\n",
477 |        "      <td>no</td>\n",
478 |        "    </tr>\n",
479 |        "    <tr>\n",
480 |        "      <th>5</th>\n",
481 |        "      <td>61000</td>\n",
482 |        "      <td>6360</td>\n",
483 |        "      <td>2</td>\n",
484 |        "      <td>1</td>\n",
485 |        "      <td>1</td>\n",
486 |        "      <td>yes</td>\n",
487 |        "      <td>no</td>\n",
488 |        "      <td>no</td>\n",
489 |        "      <td>no</td>\n",
490 |        "      <td>no</td>\n",
491 |        "      <td>0</td>\n",
492 |        "      <td>no</td>\n",
493 |        "    </tr>\n",
494 |        "  </tbody>\n",
495 |        "</table>\n",
496 |        "</div>"
497 |       ],
498 |       "text/plain": [
499 |        "   price  lotsize  bedrooms  bathrms  stories driveway recroom fullbase gashw  \\\n",
500 |        "1  42000     5850         3        1        2      yes      no      yes    no   \n",
501 |        "2  38500     4000         2        1        1      yes      no       no    no   \n",
502 |        "3  49500     3060         3        1        1      yes      no       no    no   \n",
503 |        "4  60500     6650         3        1        2      yes     yes       no    no   \n",
504 |        "5  61000     6360         2        1        1      yes      no       no    no   \n",
505 |        "\n",
506 |        "  airco  garagepl prefarea  \n",
507 |        "1    no         1       no  \n",
508 |        "2    no         0       no  \n",
509 |        "3    no         0       no  \n",
510 |        "4    no         0       no  \n",
511 |        "5    no         0       no  "
512 |       ]
513 |      },
514 |      "execution_count": 6,
515 |      "metadata": {},
516 |      "output_type": "execute_result"
517 |     }
518 |    ],
519 |    "source": [
520 |     "housing = data('Housing')\n",
521 |     "housing.head()"
522 |    ]
523 |   },
524 |   {
525 |    "cell_type": "code",
526 |    "execution_count": 9,
527 |    "metadata": {
528 |     "collapsed": false
529 |    },
530 |    "outputs": [
531 |     {
532 |      "data": {
533 |       "text/html": [
534 |        "<div>\n",
535 |        "<table border=\"1\" class=\"dataframe\">\n",
536 |        "  <thead>\n",
537 |        "    <tr style=\"text-align: right;\">\n",
538 |        "      <th></th>\n",
539 |        "      <th>title</th>\n",
540 |        "      <th>pub</th>\n",
541 |        "      <th>society</th>\n",
542 |        "      <th>libprice</th>\n",
543 |        "      <th>pages</th>\n",
544 |        "      <th>charpp</th>\n",
545 |        "      <th>citestot</th>\n",
546 |        "      <th>date1</th>\n",
547 |        "      <th>oclc</th>\n",
548 |        "      <th>field</th>\n",
549 |        "    </tr>\n",
550 |        "  </thead>\n",
551 |        "  <tbody>\n",
552 |        "    <tr>\n",
553 |        "      <th>1</th>\n",
554 |        "      <td>Asian-Pacific Economic Literature</td>\n",
555 |        "      <td>Blackwell</td>\n",
556 |        "      <td>no</td>\n",
557 |        "      <td>123</td>\n",
558 |        "      <td>440</td>\n",
559 |        "      <td>3822</td>\n",
560 |        "      <td>21</td>\n",
561 |        "      <td>1986</td>\n",
562 |        "      <td>14</td>\n",
563 |        "      <td>General</td>\n",
564 |        "    </tr>\n",
565 |        "    <tr>\n",
566 |        "      <th>2</th>\n",
567 |        "      <td>South African Journal of Economic History</td>\n",
568 |        "      <td>So Afr ec history assn</td>\n",
569 |        "      <td>no</td>\n",
570 |        "      <td>20</td>\n",
571 |        "      <td>309</td>\n",
572 |        "      <td>1782</td>\n",
573 |        "      <td>22</td>\n",
574 |        "      <td>1986</td>\n",
575 |        "      <td>59</td>\n",
576 |        "      <td>Ec History</td>\n",
577 |        "    </tr>\n",
578 |        "    <tr>\n",
579 |        "      <th>3</th>\n",
580 |        "      <td>Computational Economics</td>\n",
581 |        "      <td>Kluwer</td>\n",
582 |        "      <td>no</td>\n",
583 |        "      <td>443</td>\n",
584 |        "      <td>567</td>\n",
585 |        "      <td>2924</td>\n",
586 |        "      <td>22</td>\n",
587 |        "      <td>1987</td>\n",
588 |        "      <td>17</td>\n",
589 |        "      <td>Specialized</td>\n",
590 |        "    </tr>\n",
591 |        "    <tr>\n",
592 |        "      <th>4</th>\n",
593 |        "      <td>MOCT-MOST Economic Policy in Transitional Economics</td>\n",
594 |        "      <td>Kluwer</td>\n",
595 |        "      <td>no</td>\n",
596 |        "      <td>276</td>\n",
597 |        "      <td>520</td>\n",
598 |        "      <td>3234</td>\n",
599 |        "      <td>22</td>\n",
600 |        "      <td>1991</td>\n",
601 |        "      <td>2</td>\n",
602 |        "      <td>Area Studies</td>\n",
603 |        "    </tr>\n",
604 |        "    <tr>\n",
605 |        "      <th>5</th>\n",
606 |        "      <td>Journal of Socio-Economics</td>\n",
607 |        "      <td>Elsevier</td>\n",
608 |        "      <td>no</td>\n",
609 |        "      <td>295</td>\n",
610 |        "      <td>791</td>\n",
611 |        "      <td>3024</td>\n",
612 |        "      <td>24</td>\n",
613 |        "      <td>1972</td>\n",
614 |        "      <td>96</td>\n",
615 |        "      <td>Interdisciplinary</td>\n",
616 |        "    </tr>\n",
617 |        "  </tbody>\n",
618 |        "</table>\n",
619 |        "</div>"
620 |       ],
621 |       "text/plain": [
622 |        "                                                 title  \\\n",
623 |        "1                    Asian-Pacific Economic Literature   \n",
624 |        "2            South African Journal of Economic History   \n",
625 |        "3                              Computational Economics   \n",
626 |        "4  MOCT-MOST Economic Policy in Transitional Economics   \n",
627 |        "5                           Journal of Socio-Economics   \n",
628 |        "\n",
629 |        "                      pub society  libprice  pages  charpp  citestot  date1  \\\n",
630 |        "1               Blackwell      no       123    440    3822        21   1986   \n",
631 |        "2  So Afr ec history assn      no        20    309    1782        22   1986   \n",
632 |        "3                  Kluwer      no       443    567    2924        22   1987   \n",
633 |        "4                  Kluwer      no       276    520    3234        22   1991   \n",
634 |        "5                Elsevier      no       295    791    3024        24   1972   \n",
635 |        "\n",
636 |        "   oclc              field  \n",
637 |        "1    14            General  \n",
638 |        "2    59         Ec History  \n",
639 |        "3    17        Specialized  \n",
640 |        "4     2       Area Studies  \n",
641 |        "5    96  Interdisciplinary  "
642 |       ]
643 |      },
644 |      "execution_count": 9,
645 |      "metadata": {},
646 |      "output_type": "execute_result"
647 |     }
648 |    ],
649 |    "source": [
650 |     "housing = data('Journals')\n",
651 |     "housing.head()"
652 |    ]
653 |   },
654 |   {
655 |    "cell_type": "code",
656 |    "execution_count": 10,
657 |    "metadata": {
658 |     "collapsed": false
659 |    },
660 |    "outputs": [
661 |     {
662 |      "data": {
663 |       "text/html": [
664 |        "<div>\n",
665 |        "<table border=\"1\" class=\"dataframe\">\n",
666 |        "  <thead>\n",
667 |        "    <tr style=\"text-align: right;\">\n",
668 |        "      <th></th>\n",
669 |        "      <th>occupation</th>\n",
670 |        "      <th>region</th>\n",
671 |        "      <th>nkids</th>\n",
672 |        "      <th>nkids2</th>\n",
673 |        "      <th>nadults</th>\n",
674 |        "      <th>lnx</th>\n",
675 |        "      <th>stobacco</th>\n",
676 |        "      <th>salcohol</th>\n",
677 |        "      <th>age</th>\n",
678 |        "    </tr>\n",
679 |        "  </thead>\n",
680 |        "  <tbody>\n",
681 |        "    <tr>\n",
682 |        "      <th>1</th>\n",
683 |        "      <td>bluecol</td>\n",
684 |        "      <td>flanders</td>\n",
685 |        "      <td>1</td>\n",
686 |        "      <td>0</td>\n",
687 |        "      <td>2</td>\n",
688 |        "      <td>14.19054</td>\n",
689 |        "      <td>0</td>\n",
690 |        "      <td>0.000000</td>\n",
691 |        "      <td>2</td>\n",
692 |        "    </tr>\n",
693 |        "    <tr>\n",
694 |        "      <th>2</th>\n",
695 |        "      <td>inactself</td>\n",
696 |        "      <td>flanders</td>\n",
697 |        "      <td>0</td>\n",
698 |        "      <td>0</td>\n",
699 |        "      <td>3</td>\n",
700 |        "      <td>13.90857</td>\n",
701 |        "      <td>0</td>\n",
702 |        "      <td>0.002285</td>\n",
703 |        "      <td>3</td>\n",
704 |        "    </tr>\n",
705 |        "    <tr>\n",
706 |        "      <th>3</th>\n",
707 |        "      <td>whitecol</td>\n",
708 |        "      <td>flanders</td>\n",
709 |        "      <td>0</td>\n",
710 |        "      <td>0</td>\n",
711 |        "      <td>1</td>\n",
712 |        "      <td>13.97461</td>\n",
713 |        "      <td>0</td>\n",
714 |        "      <td>0.012875</td>\n",
715 |        "      <td>2</td>\n",
716 |        "    </tr>\n",
717 |        "    <tr>\n",
718 |        "      <th>4</th>\n",
719 |        "      <td>bluecol</td>\n",
720 |        "      <td>flanders</td>\n",
721 |        "      <td>1</td>\n",
722 |        "      <td>0</td>\n",
723 |        "      <td>2</td>\n",
724 |        "      <td>13.76281</td>\n",
725 |        "      <td>0</td>\n",
726 |        "      <td>0.005907</td>\n",
727 |        "      <td>2</td>\n",
728 |        "    </tr>\n",
729 |        "    <tr>\n",
730 |        "      <th>5</th>\n",
731 |        "      <td>inactself</td>\n",
732 |        "      <td>flanders</td>\n",
733 |        "      <td>2</td>\n",
734 |        "      <td>0</td>\n",
735 |        "      <td>1</td>\n",
736 |        "      <td>13.80800</td>\n",
737 |        "      <td>0</td>\n",
738 |        "      <td>0.021981</td>\n",
739 |        "      <td>2</td>\n",
740 |        "    </tr>\n",
741 |        "  </tbody>\n",
742 |        "</table>\n",
743 |        "</div>"
744 |       ],
745 |       "text/plain": [
746 |        "  occupation    region  nkids  nkids2  nadults       lnx  stobacco  salcohol  \\\n",
747 |        "1    bluecol  flanders      1       0        2  14.19054         0  0.000000   \n",
748 |        "2  inactself  flanders      0       0        3  13.90857         0  0.002285   \n",
749 |        "3   whitecol  flanders      0       0        1  13.97461         0  0.012875   \n",
750 |        "4    bluecol  flanders      1       0        2  13.76281         0  0.005907   \n",
751 |        "5  inactself  flanders      2       0        1  13.80800         0  0.021981   \n",
752 |        "\n",
753 |        "   age  \n",
754 |        "1    2  \n",
755 |        "2    3  \n",
756 |        "3    2  \n",
757 |        "4    2  \n",
758 |        "5    2  "
759 |       ]
760 |      },
761 |      "execution_count": 10,
762 |      "metadata": {},
763 |      "output_type": "execute_result"
764 |     }
765 |    ],
766 |    "source": [
767 |     "housing = data('Tobacco')\n",
768 |     "housing.head()"
769 |    ]
770 |   },
771 |   {
772 |    "cell_type": "markdown",
773 |    "metadata": {},
774 |    "source": [
775 |     "If you are not sure what's the dataset name or whether it exists or not, you can try something close:"
776 |    ]
777 |   },
778 |   {
779 |    "cell_type": "code",
780 |    "execution_count": 11,
781 |    "metadata": {
782 |     "collapsed": false
783 |    },
784 |    "outputs": [
785 |     {
786 |      "name": "stdout",
787 |      "output_type": "stream",
788 |      "text": [
789 |       "Did you mean:\n",
790 |       "anscombe, Anscombe, income, acme, newcomb, cancer, OME, voteincome, cane, sanction, brambles\n"
791 |      ]
792 |     }
793 |    ],
794 |    "source": [
795 |     "data('ancombe')"
796 |    ]
797 |   }
798 |  ],
799 |  "metadata": {
800 |   "kernelspec": {
801 |    "display_name": "Python 3",
802 |    "language": "python",
803 |    "name": "python3"
804 |   },
805 |   "language_info": {
806 |    "codemirror_mode": {
807 |     "name": "ipython",
808 |     "version": 3
809 |    },
810 |    "file_extension": ".py",
811 |    "mimetype": "text/x-python",
812 |    "name": "python",
813 |    "nbconvert_exporter": "python",
814 |    "pygments_lexer": "ipython3",
815 |    "version": "3.5.1"
816 |   }
817 |  },
818 |  "nbformat": 4,
819 |  "nbformat_minor": 0
820 | }
821 | 


--------------------------------------------------------------------------------
/pydataset/__init__.py:
--------------------------------------------------------------------------------
 1 | # __init__.py
 2 | # main interface to pydataset module
 3 | 
 4 | from .datasets_handler import __print_item_docs, __read_csv, __datasets_desc
 5 | from .support import find_similar
 6 | 
 7 | 
 8 | def data(item=None, show_doc=False):
 9 |     """loads a datasaet (from in-modules datasets) in a dataframe data structure.
10 | 
11 |     Args:
12 |         item (str)      : name of the dataset to load.
13 |         show_doc (bool) : to show the dataset's documentation.
14 | 
15 |     Examples:
16 | 
17 |     >>> iris = data('iris')
18 | 
19 | 
20 |     >>> data('titanic', show_doc=True)
21 |         : returns the dataset's documentation.
22 | 
23 |     >>> data()
24 |         : like help(), returns a dataframe [Item, Title]
25 |         for a list of the available datasets.
26 |     """
27 | 
28 |     if item:
29 |         try:
30 |             if show_doc:
31 |                 __print_item_docs(item)
32 |                 return
33 | 
34 |             df = __read_csv(item)
35 |             return df
36 |         except KeyError:
37 |             find_similar(item)
38 |     else:
39 |         return __datasets_desc()
40 | 
41 | 
42 | if __name__ == '__main__':
43 |     # Numerical data
44 |     rain = data('rain')
45 |     print(rain)
46 | 


--------------------------------------------------------------------------------
/pydataset/datasets_handler.py:
--------------------------------------------------------------------------------
 1 | # datasets_handler.py
 2 | # dataset handling file
 3 | 
 4 | import pandas as pd
 5 | from .utils import html2text
 6 | from .locate_datasets import __items_dict, __docs_dict, __get_data_folder_path
 7 | 
 8 | items = __items_dict()
 9 | docs = __docs_dict()
10 | 
11 | # make dataframe layout (of __datasets_desc()) terminal-friendly
12 | pd.set_option('display.max_rows', 170)
13 | pd.set_option('display.max_colwidth', 90)
14 | # for terminal, auto-detect
15 | pd.set_option('display.width', None)
16 | 
17 | 
18 | # HELPER
19 | 
20 | def __filter_doc(raw):
21 |     note = "PyDataset Documentation (adopted from R Documentation. " \
22 |            "The displayed examples are in R)"
23 |     txt = raw.replace('R Documentation', note)
24 |     return txt
25 | 
26 | 
27 | def __read_docs(path):
28 |     # raw html
29 |     html = open(path, 'r').read()
30 |     # html handler
31 |     h = html2text.HTML2Text()
32 |     h.ignore_links = True
33 |     h.ignore_images = True
34 |     txt = h.handle(html)
35 | 
36 |     return txt
37 | 
38 | 
39 | # MAIN
40 | 
41 | def __get_csv_path(item):
42 |     """return the full path of the item's csv file"""
43 |     return items[item]
44 | 
45 | 
46 | def __read_csv(item):
47 |     path = __get_csv_path(item)
48 |     df = pd.read_csv(path, index_col=0)
49 |     # display 'optional' log msg "loaded: Titanic <class 'numpy.ndarray'>"
50 |     # print('loaded: {} {}'.format(item, type(df)))
51 |     return df
52 | 
53 | 
54 | def __get_doc_path(item):
55 |     return docs[item]
56 | 
57 | 
58 | def __print_item_docs(item):
59 |     path = __get_doc_path(item)
60 |     doc = __read_docs(path)  # html format
61 |     txt = __filter_doc(doc)  # edit R related txt
62 |     print(txt)
63 | 
64 | 
65 | def __datasets_desc():
66 |     """return a df of the available datasets with description"""
67 |     datasets = __get_data_folder_path() + 'datasets.csv'
68 |     df = pd.read_csv(datasets)
69 |     df = df[['Item', 'Title']]
70 |     df.columns = ['dataset_id', 'title']
71 |     # print('a list of the available datasets:')
72 |     return df
73 | 


--------------------------------------------------------------------------------
/pydataset/dump_data.py:
--------------------------------------------------------------------------------
 1 | # dump_data.py
 2 | # initialize PYDATASET_HOME, and
 3 | # dump pydataset/resources.tar.gz into $HOME/.pydataset/
 4 | 
 5 | import tarfile
 6 | from os import path as os_path
 7 | from os import mkdir as os_mkdir
 8 | from os.path import join as path_join
 9 | 
10 | 
11 | def __setup_db():
12 | 
13 |     homedir = os_path.expanduser('~')
14 |     PYDATASET_HOME = path_join(homedir, '.pydataset/')
15 | 
16 |     if not os_path.exists(PYDATASET_HOME):
17 |         # create $HOME/.pydataset/
18 |         os_mkdir(PYDATASET_HOME)
19 |         print('initiated datasets repo at: {}'.format(PYDATASET_HOME))
20 | 
21 |         # copy the resources.tar.gz from the module files.
22 | 
23 |         # # There should be a better way ? read from a URL ?
24 |         import pydataset
25 |         filename = path_join(pydataset.__path__[0], 'resources.tar.gz')
26 |         tar = tarfile.open(filename, mode='r|gz')
27 | 
28 |         # # reading 'resources.tar.gz' from a URL
29 |         # try:
30 |         #     from urllib.request import urlopen # py3
31 |         # except ImportError:
32 |         #     from urllib import urlopen # py2
33 |         # import tarfile
34 |         #
35 |         # targz_url = 'https://example.com/resources.tar.gz'
36 |         # httpstrem = urlopen(targz_url)
37 |         # tar = tarfile.open(fileobj=httpstrem, mode="r|gz")
38 | 
39 |         # extract 'resources.tar.gz' into PYDATASET_HOME
40 |         # print('extracting resources.tar.gz ... from {}'.format(targz_url))
41 |         tar.extractall(path=PYDATASET_HOME)
42 |         # print('done.')
43 |         tar.close()
44 | 


--------------------------------------------------------------------------------
/pydataset/locate_datasets.py:
--------------------------------------------------------------------------------
 1 | 
 2 | # locate_datasets.py
 3 | # locate datasets file paths
 4 | 
 5 | from os import path as os_path
 6 | from os import walk as os_walk
 7 | from os.path import join as path_join
 8 | from .dump_data import __setup_db
 9 | 
10 | 
11 | def __get_data_folder_path():
12 |     # read rdata folder's path from $HOME
13 |     homedir = os_path.expanduser('~')
14 |     # initiate database datafile
15 |     dpath = path_join(homedir, '.pydataset/resources/rdata/')
16 |     if os_path.exists(dpath):
17 |         return dpath
18 |     else:
19 |         # create PYDATASET_HOME and folders
20 |         __setup_db()
21 |         return __get_data_folder_path()
22 | 
23 | data_path = __get_data_folder_path()
24 | 
25 | 
26 | # scan data and documentation folders to build a dictionary (e.g.
27 | # {item:path} ) for each
28 | 
29 | items = {}
30 | docs = {}
31 | for dirname, dirnames, filenames in os_walk(data_path):
32 | 
33 |     # store item name and path to all csv files.
34 |     for fname in filenames:
35 |         if fname.endswith('.csv') and not fname.startswith('.'):
36 |             # e.g. pydataset-package/rdata/csv/boot/acme.csv
37 |             item_path = path_join(dirname, fname)
38 |             # e.g acme.csv
39 |             item_file = os_path.split(item_path)[1]
40 |             # e.g. acme
41 |             item = item_file.replace('.csv', '')
42 |             # store item and its path
43 |             items[item] = item_path
44 | 
45 |     # store item name and path to all html files.
46 |     for fname in filenames:
47 |         if fname.endswith('.html') and not fname.startswith('.'):
48 |             item_path = path_join(dirname, fname)
49 |             item_file = os_path.split(item_path)[1]
50 |             item = item_file.replace('.html', '')
51 |             docs[item] = item_path
52 | 
53 | 
54 | def __items_dict():
55 |     return items
56 | 
57 | 
58 | def __docs_dict():
59 |     return docs
60 | 


--------------------------------------------------------------------------------
/pydataset/resources.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iamaziz/PyDataset/789c0ca7587b86343f636b132dcf1f475ee6b90b/pydataset/resources.tar.gz


--------------------------------------------------------------------------------
/pydataset/support.py:
--------------------------------------------------------------------------------
 1 | 
 2 | from difflib import SequenceMatcher as SM
 3 | from collections import Counter
 4 | from .locate_datasets import __items_dict
 5 | 
 6 | 
 7 | DATASET_IDS = list(__items_dict().keys())
 8 | ERROR = ('Not valid dataset name and no similar found! '
 9 |          'Try: data() to see available.')
10 | 
11 | 
12 | def similarity(w1, w2, threshold=0.5):
13 |     """compare two strings 'words', and
14 |     return ratio of smiliarity, be it larger than the threshold,
15 |     or 0 otherwise.
16 | 
17 |     NOTE: if the result more like junk, increase the threshold value.
18 |     """
19 |     ratio = SM(None, str(w1).lower(), str(w2).lower()).ratio()
20 |     return ratio if ratio > threshold else 0
21 | 
22 | 
23 | def search_similar(s1, dlist=DATASET_IDS, MAX_SIMILARS=10):
24 |     """Returns the top MAX_SIMILARS [(dataset_id : smilarity_ratio)] to s1"""
25 | 
26 |     similars = {s2: similarity(s1, s2)
27 |                 for s2 in dlist
28 |                 if similarity(s1, s2)}
29 | 
30 |     # a list of tuples [(similar_word, ratio) .. ]
31 |     top_match = Counter(similars).most_common(MAX_SIMILARS+1)
32 | 
33 |     return top_match
34 | 
35 | 
36 | def find_similar(query):
37 | 
38 |     result = search_similar(query)
39 | 
40 |     if result:
41 |         top_words, ratios = zip(*result)
42 | 
43 |         print('Did you mean:')
44 |         print(', '.join(t for t in top_words))
45 |         # print(', '.join('{:.1f}'.format(r*100) for r in ratios))
46 | 
47 |     else:
48 |         raise Exception(ERROR)
49 | 
50 | 
51 | if __name__ == '__main__':
52 | 
53 |     s = 'ansc'
54 |     find_similar(s)
55 | 


--------------------------------------------------------------------------------
/pydataset/utils/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | 


--------------------------------------------------------------------------------
/pydataset/utils/html2text.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | """html2text: Turn HTML into equivalent Markdown-structured text."""
  3 | __version__ = "3.200.3"
  4 | __author__ = "Aaron Swartz (me@aaronsw.com)"
  5 | __copyright__ = "(C) 2004-2008 Aaron Swartz. GNU GPL 3."
  6 | __contributors__ = ["Martin 'Joey' Schulze", "Ricardo Reyes", "Kevin Jay North"]
  7 | 
  8 | # TODO:
  9 | #   Support decoded entities with unifiable.
 10 | 
 11 | try:
 12 |     True
 13 | except NameError:
 14 |     setattr(__builtins__, 'True', 1)
 15 |     setattr(__builtins__, 'False', 0)
 16 | 
 17 | def has_key(x, y):
 18 |     if hasattr(x, 'has_key'): return x.has_key(y)
 19 |     else: return y in x
 20 | 
 21 | try:
 22 |     import htmlentitydefs
 23 |     import urlparse
 24 |     import HTMLParser
 25 | except ImportError: #Python3
 26 |     import html.entities as htmlentitydefs
 27 |     import urllib.parse as urlparse
 28 |     import html.parser as HTMLParser
 29 | try: #Python3
 30 |     import urllib.request as urllib
 31 | except:
 32 |     import urllib
 33 | import optparse, re, sys, codecs, types
 34 | 
 35 | try: from textwrap import wrap
 36 | except: pass
 37 | 
 38 | # Use Unicode characters instead of their ascii psuedo-replacements
 39 | UNICODE_SNOB = 0
 40 | 
 41 | # Escape all special characters.  Output is less readable, but avoids corner case formatting issues.
 42 | ESCAPE_SNOB = 0
 43 | 
 44 | # Put the links after each paragraph instead of at the end.
 45 | LINKS_EACH_PARAGRAPH = 0
 46 | 
 47 | # Wrap long lines at position. 0 for no wrapping. (Requires Python 2.3.)
 48 | BODY_WIDTH = 78
 49 | 
 50 | # Don't show internal links (href="#local-anchor") -- corresponding link targets
 51 | # won't be visible in the plain text file anyway.
 52 | SKIP_INTERNAL_LINKS = True
 53 | 
 54 | # Use inline, rather than reference, formatting for images and links
 55 | INLINE_LINKS = True
 56 | 
 57 | # Number of pixels Google indents nested lists
 58 | GOOGLE_LIST_INDENT = 36
 59 | 
 60 | IGNORE_ANCHORS = False
 61 | IGNORE_IMAGES = False
 62 | IGNORE_EMPHASIS = False
 63 | 
 64 | ### Entity Nonsense ###
 65 | 
 66 | def name2cp(k):
 67 |     if k == 'apos': return ord("'")
 68 |     if hasattr(htmlentitydefs, "name2codepoint"): # requires Python 2.3
 69 |         return htmlentitydefs.name2codepoint[k]
 70 |     else:
 71 |         k = htmlentitydefs.entitydefs[k]
 72 |         if k.startswith("&#") and k.endswith(";"): return int(k[2:-1]) # not in latin-1
 73 |         return ord(codecs.latin_1_decode(k)[0])
 74 | 
 75 | unifiable = {'rsquo':"'", 'lsquo':"'", 'rdquo':'"', 'ldquo':'"',
 76 | 'copy':'(C)', 'mdash':'--', 'nbsp':' ', 'rarr':'->', 'larr':'<-', 'middot':'*',
 77 | 'ndash':'-', 'oelig':'oe', 'aelig':'ae',
 78 | 'agrave':'a', 'aacute':'a', 'acirc':'a', 'atilde':'a', 'auml':'a', 'aring':'a',
 79 | 'egrave':'e', 'eacute':'e', 'ecirc':'e', 'euml':'e',
 80 | 'igrave':'i', 'iacute':'i', 'icirc':'i', 'iuml':'i',
 81 | 'ograve':'o', 'oacute':'o', 'ocirc':'o', 'otilde':'o', 'ouml':'o',
 82 | 'ugrave':'u', 'uacute':'u', 'ucirc':'u', 'uuml':'u',
 83 | 'lrm':'', 'rlm':''}
 84 | 
 85 | unifiable_n = {}
 86 | 
 87 | for k in unifiable.keys():
 88 |     unifiable_n[name2cp(k)] = unifiable[k]
 89 | 
 90 | ### End Entity Nonsense ###
 91 | 
 92 | def onlywhite(line):
 93 |     """Return true if the line does only consist of whitespace characters."""
 94 |     for c in line:
 95 |         if c is not ' ' and c is not '  ':
 96 |             return c is ' '
 97 |     return line
 98 | 
 99 | def hn(tag):
100 |     if tag[0] == 'h' and len(tag) == 2:
101 |         try:
102 |             n = int(tag[1])
103 |             if n in range(1, 10): return n
104 |         except ValueError: return 0
105 | 
106 | def dumb_property_dict(style):
107 |     """returns a hash of css attributes"""
108 |     return dict([(x.strip(), y.strip()) for x, y in [z.split(':', 1) for z in style.split(';') if ':' in z]]);
109 | 
110 | def dumb_css_parser(data):
111 |     """returns a hash of css selectors, each of which contains a hash of css attributes"""
112 |     # remove @import sentences
113 |     data += ';'
114 |     importIndex = data.find('@import')
115 |     while importIndex != -1:
116 |         data = data[0:importIndex] + data[data.find(';', importIndex) + 1:]
117 |         importIndex = data.find('@import')
118 | 
119 |     # parse the css. reverted from dictionary compehension in order to support older pythons
120 |     elements =  [x.split('{') for x in data.split('}') if '{' in x.strip()]
121 |     try:
122 |         elements = dict([(a.strip(), dumb_property_dict(b)) for a, b in elements])
123 |     except ValueError:
124 |         elements = {} # not that important
125 | 
126 |     return elements
127 | 
128 | def element_style(attrs, style_def, parent_style):
129 |     """returns a hash of the 'final' style attributes of the element"""
130 |     style = parent_style.copy()
131 |     if 'class' in attrs:
132 |         for css_class in attrs['class'].split():
133 |             css_style = style_def['.' + css_class]
134 |             style.update(css_style)
135 |     if 'style' in attrs:
136 |         immediate_style = dumb_property_dict(attrs['style'])
137 |         style.update(immediate_style)
138 |     return style
139 | 
140 | def google_list_style(style):
141 |     """finds out whether this is an ordered or unordered list"""
142 |     if 'list-style-type' in style:
143 |         list_style = style['list-style-type']
144 |         if list_style in ['disc', 'circle', 'square', 'none']:
145 |             return 'ul'
146 |     return 'ol'
147 | 
148 | def google_has_height(style):
149 |     """check if the style of the element has the 'height' attribute explicitly defined"""
150 |     if 'height' in style:
151 |         return True
152 |     return False
153 | 
154 | def google_text_emphasis(style):
155 |     """return a list of all emphasis modifiers of the element"""
156 |     emphasis = []
157 |     if 'text-decoration' in style:
158 |         emphasis.append(style['text-decoration'])
159 |     if 'font-style' in style:
160 |         emphasis.append(style['font-style'])
161 |     if 'font-weight' in style:
162 |         emphasis.append(style['font-weight'])
163 |     return emphasis
164 | 
165 | def google_fixed_width_font(style):
166 |     """check if the css of the current element defines a fixed width font"""
167 |     font_family = ''
168 |     if 'font-family' in style:
169 |         font_family = style['font-family']
170 |     if 'Courier New' == font_family or 'Consolas' == font_family:
171 |         return True
172 |     return False
173 | 
174 | def list_numbering_start(attrs):
175 |     """extract numbering from list element attributes"""
176 |     if 'start' in attrs:
177 |         return int(attrs['start']) - 1
178 |     else:
179 |         return 0
180 | 
181 | class HTML2Text(HTMLParser.HTMLParser):
182 |     def __init__(self, out=None, baseurl=''):
183 |         HTMLParser.HTMLParser.__init__(self)
184 | 
185 |         # Config options
186 |         self.unicode_snob = UNICODE_SNOB
187 |         self.escape_snob = ESCAPE_SNOB
188 |         self.links_each_paragraph = LINKS_EACH_PARAGRAPH
189 |         self.body_width = BODY_WIDTH
190 |         self.skip_internal_links = SKIP_INTERNAL_LINKS
191 |         self.inline_links = INLINE_LINKS
192 |         self.google_list_indent = GOOGLE_LIST_INDENT
193 |         self.ignore_links = IGNORE_ANCHORS
194 |         self.ignore_images = IGNORE_IMAGES
195 |         self.ignore_emphasis = IGNORE_EMPHASIS
196 |         self.google_doc = False
197 |         self.ul_item_mark = '*'
198 |         self.emphasis_mark = '_'
199 |         self.strong_mark = '**'
200 | 
201 |         if out is None:
202 |             self.out = self.outtextf
203 |         else:
204 |             self.out = out
205 | 
206 |         self.outtextlist = []  # empty list to store output characters before they are "joined"
207 | 
208 |         try:
209 |             self.outtext = unicode()
210 |         except NameError:  # Python3
211 |             self.outtext = str()
212 | 
213 |         self.quiet = 0
214 |         self.p_p = 0  # number of newline character to print before next output
215 |         self.outcount = 0
216 |         self.start = 1
217 |         self.space = 0
218 |         self.a = []
219 |         self.astack = []
220 |         self.maybe_automatic_link = None
221 |         self.absolute_url_matcher = re.compile(r'^[a-zA-Z+]+://')
222 |         self.acount = 0
223 |         self.list = []
224 |         self.blockquote = 0
225 |         self.pre = 0
226 |         self.startpre = 0
227 |         self.code = False
228 |         self.br_toggle = ''
229 |         self.lastWasNL = 0
230 |         self.lastWasList = False
231 |         self.style = 0
232 |         self.style_def = {}
233 |         self.tag_stack = []
234 |         self.emphasis = 0
235 |         self.drop_white_space = 0
236 |         self.inheader = False
237 |         self.abbr_title = None  # current abbreviation definition
238 |         self.abbr_data = None  # last inner HTML (for abbr being defined)
239 |         self.abbr_list = {}  # stack of abbreviations to write later
240 |         self.baseurl = baseurl
241 | 
242 |         try: del unifiable_n[name2cp('nbsp')]
243 |         except KeyError: pass
244 |         unifiable['nbsp'] = '&nbsp_place_holder;'
245 | 
246 | 
247 |     def feed(self, data):
248 |         data = data.replace("</' + 'script>", "</ignore>")
249 |         HTMLParser.HTMLParser.feed(self, data)
250 | 
251 |     def handle(self, data):
252 |         self.feed(data)
253 |         self.feed("")
254 |         return self.optwrap(self.close())
255 | 
256 |     def outtextf(self, s):
257 |         self.outtextlist.append(s)
258 |         if s: self.lastWasNL = s[-1] == '\n'
259 | 
260 |     def close(self):
261 |         HTMLParser.HTMLParser.close(self)
262 | 
263 |         self.pbr()
264 |         self.o('', 0, 'end')
265 | 
266 |         self.outtext = self.outtext.join(self.outtextlist)
267 |         if self.unicode_snob:
268 |             nbsp = unichr(name2cp('nbsp'))
269 |         else:
270 |             nbsp = u' '
271 |         self.outtext = self.outtext.replace(u'&nbsp_place_holder;', nbsp)
272 | 
273 |         return self.outtext
274 | 
275 |     def handle_charref(self, c):
276 |         self.o(self.charref(c), 1)
277 | 
278 |     def handle_entityref(self, c):
279 |         self.o(self.entityref(c), 1)
280 | 
281 |     def handle_starttag(self, tag, attrs):
282 |         self.handle_tag(tag, attrs, 1)
283 | 
284 |     def handle_endtag(self, tag):
285 |         self.handle_tag(tag, None, 0)
286 | 
287 |     def previousIndex(self, attrs):
288 |         """ returns the index of certain set of attributes (of a link) in the
289 |             self.a list
290 | 
291 |             If the set of attributes is not found, returns None
292 |         """
293 |         if not has_key(attrs, 'href'): return None
294 | 
295 |         i = -1
296 |         for a in self.a:
297 |             i += 1
298 |             match = 0
299 | 
300 |             if has_key(a, 'href') and a['href'] == attrs['href']:
301 |                 if has_key(a, 'title') or has_key(attrs, 'title'):
302 |                         if (has_key(a, 'title') and has_key(attrs, 'title') and
303 |                             a['title'] == attrs['title']):
304 |                             match = True
305 |                 else:
306 |                     match = True
307 | 
308 |             if match: return i
309 | 
310 |     def drop_last(self, nLetters):
311 |         if not self.quiet:
312 |             self.outtext = self.outtext[:-nLetters]
313 | 
314 |     def handle_emphasis(self, start, tag_style, parent_style):
315 |         """handles various text emphases"""
316 |         tag_emphasis = google_text_emphasis(tag_style)
317 |         parent_emphasis = google_text_emphasis(parent_style)
318 | 
319 |         # handle Google's text emphasis
320 |         strikethrough =  'line-through' in tag_emphasis and self.hide_strikethrough
321 |         bold = 'bold' in tag_emphasis and not 'bold' in parent_emphasis
322 |         italic = 'italic' in tag_emphasis and not 'italic' in parent_emphasis
323 |         fixed = google_fixed_width_font(tag_style) and not \
324 |                 google_fixed_width_font(parent_style) and not self.pre
325 | 
326 |         if start:
327 |             # crossed-out text must be handled before other attributes
328 |             # in order not to output qualifiers unnecessarily
329 |             if bold or italic or fixed:
330 |                 self.emphasis += 1
331 |             if strikethrough:
332 |                 self.quiet += 1
333 |             if italic:
334 |                 self.o(self.emphasis_mark)
335 |                 self.drop_white_space += 1
336 |             if bold:
337 |                 self.o(self.strong_mark)
338 |                 self.drop_white_space += 1
339 |             if fixed:
340 |                 self.o('`')
341 |                 self.drop_white_space += 1
342 |                 self.code = True
343 |         else:
344 |             if bold or italic or fixed:
345 |                 # there must not be whitespace before closing emphasis mark
346 |                 self.emphasis -= 1
347 |                 self.space = 0
348 |                 self.outtext = self.outtext.rstrip()
349 |             if fixed:
350 |                 if self.drop_white_space:
351 |                     # empty emphasis, drop it
352 |                     self.drop_last(1)
353 |                     self.drop_white_space -= 1
354 |                 else:
355 |                     self.o('`')
356 |                 self.code = False
357 |             if bold:
358 |                 if self.drop_white_space:
359 |                     # empty emphasis, drop it
360 |                     self.drop_last(2)
361 |                     self.drop_white_space -= 1
362 |                 else:
363 |                     self.o(self.strong_mark)
364 |             if italic:
365 |                 if self.drop_white_space:
366 |                     # empty emphasis, drop it
367 |                     self.drop_last(1)
368 |                     self.drop_white_space -= 1
369 |                 else:
370 |                     self.o(self.emphasis_mark)
371 |             # space is only allowed after *all* emphasis marks
372 |             if (bold or italic) and not self.emphasis:
373 |                     self.o(" ")
374 |             if strikethrough:
375 |                 self.quiet -= 1
376 | 
377 |     def handle_tag(self, tag, attrs, start):
378 |         #attrs = fixattrs(attrs)
379 |         if attrs is None:
380 |             attrs = {}
381 |         else:
382 |             attrs = dict(attrs)
383 | 
384 |         if self.google_doc:
385 |             # the attrs parameter is empty for a closing tag. in addition, we
386 |             # need the attributes of the parent nodes in order to get a
387 |             # complete style description for the current element. we assume
388 |             # that google docs export well formed html.
389 |             parent_style = {}
390 |             if start:
391 |                 if self.tag_stack:
392 |                   parent_style = self.tag_stack[-1][2]
393 |                 tag_style = element_style(attrs, self.style_def, parent_style)
394 |                 self.tag_stack.append((tag, attrs, tag_style))
395 |             else:
396 |                 dummy, attrs, tag_style = self.tag_stack.pop()
397 |                 if self.tag_stack:
398 |                     parent_style = self.tag_stack[-1][2]
399 | 
400 |         if hn(tag):
401 |             self.p()
402 |             if start:
403 |                 self.inheader = True
404 |                 self.o(hn(tag)*"#" + ' ')
405 |             else:
406 |                 self.inheader = False
407 |                 return # prevent redundant emphasis marks on headers
408 | 
409 |         if tag in ['p', 'div']:
410 |             if self.google_doc:
411 |                 if start and google_has_height(tag_style):
412 |                     self.p()
413 |                 else:
414 |                     self.soft_br()
415 |             else:
416 |                 self.p()
417 | 
418 |         if tag == "br" and start: self.o("  \n")
419 | 
420 |         if tag == "hr" and start:
421 |             self.p()
422 |             self.o("* * *")
423 |             self.p()
424 | 
425 |         if tag in ["head", "style", 'script']:
426 |             if start: self.quiet += 1
427 |             else: self.quiet -= 1
428 | 
429 |         if tag == "style":
430 |             if start: self.style += 1
431 |             else: self.style -= 1
432 | 
433 |         if tag in ["body"]:
434 |             self.quiet = 0 # sites like 9rules.com never close <head>
435 | 
436 |         if tag == "blockquote":
437 |             if start:
438 |                 self.p(); self.o('> ', 0, 1); self.start = 1
439 |                 self.blockquote += 1
440 |             else:
441 |                 self.blockquote -= 1
442 |                 self.p()
443 | 
444 |         if tag in ['em', 'i', 'u'] and not self.ignore_emphasis: self.o(self.emphasis_mark)
445 |         if tag in ['strong', 'b'] and not self.ignore_emphasis: self.o(self.strong_mark)
446 |         if tag in ['del', 'strike', 's']:
447 |             if start:
448 |                 self.o("<"+tag+">")
449 |             else:
450 |                 self.o("</"+tag+">")
451 | 
452 |         if self.google_doc:
453 |             if not self.inheader:
454 |                 # handle some font attributes, but leave headers clean
455 |                 self.handle_emphasis(start, tag_style, parent_style)
456 | 
457 |         if tag in ["code", "tt"] and not self.pre: self.o('`') #TODO: `` `this` ``
458 |         if tag == "abbr":
459 |             if start:
460 |                 self.abbr_title = None
461 |                 self.abbr_data = ''
462 |                 if has_key(attrs, 'title'):
463 |                     self.abbr_title = attrs['title']
464 |             else:
465 |                 if self.abbr_title != None:
466 |                     self.abbr_list[self.abbr_data] = self.abbr_title
467 |                     self.abbr_title = None
468 |                 self.abbr_data = ''
469 | 
470 |         if tag == "a" and not self.ignore_links:
471 |             if start:
472 |                 if has_key(attrs, 'href') and not (self.skip_internal_links and attrs['href'].startswith('#')):
473 |                     self.astack.append(attrs)
474 |                     self.maybe_automatic_link = attrs['href']
475 |                 else:
476 |                     self.astack.append(None)
477 |             else:
478 |                 if self.astack:
479 |                     a = self.astack.pop()
480 |                     if self.maybe_automatic_link:
481 |                         self.maybe_automatic_link = None
482 |                     elif a:
483 |                         if self.inline_links:
484 |                             self.o("](" + escape_md(a['href']) + ")")
485 |                         else:
486 |                             i = self.previousIndex(a)
487 |                             if i is not None:
488 |                                 a = self.a[i]
489 |                             else:
490 |                                 self.acount += 1
491 |                                 a['count'] = self.acount
492 |                                 a['outcount'] = self.outcount
493 |                                 self.a.append(a)
494 |                             self.o("][" + str(a['count']) + "]")
495 | 
496 |         if tag == "img" and start and not self.ignore_images:
497 |             if has_key(attrs, 'src'):
498 |                 attrs['href'] = attrs['src']
499 |                 alt = attrs.get('alt', '')
500 |                 self.o("![" + escape_md(alt) + "]")
501 | 
502 |                 if self.inline_links:
503 |                     self.o("(" + escape_md(attrs['href']) + ")")
504 |                 else:
505 |                     i = self.previousIndex(attrs)
506 |                     if i is not None:
507 |                         attrs = self.a[i]
508 |                     else:
509 |                         self.acount += 1
510 |                         attrs['count'] = self.acount
511 |                         attrs['outcount'] = self.outcount
512 |                         self.a.append(attrs)
513 |                     self.o("[" + str(attrs['count']) + "]")
514 | 
515 |         if tag == 'dl' and start: self.p()
516 |         if tag == 'dt' and not start: self.pbr()
517 |         if tag == 'dd' and start: self.o('    ')
518 |         if tag == 'dd' and not start: self.pbr()
519 | 
520 |         if tag in ["ol", "ul"]:
521 |             # Google Docs create sub lists as top level lists
522 |             if (not self.list) and (not self.lastWasList):
523 |                 self.p()
524 |             if start:
525 |                 if self.google_doc:
526 |                     list_style = google_list_style(tag_style)
527 |                 else:
528 |                     list_style = tag
529 |                 numbering_start = list_numbering_start(attrs)
530 |                 self.list.append({'name':list_style, 'num':numbering_start})
531 |             else:
532 |                 if self.list: self.list.pop()
533 |             self.lastWasList = True
534 |         else:
535 |             self.lastWasList = False
536 | 
537 |         if tag == 'li':
538 |             self.pbr()
539 |             if start:
540 |                 if self.list: li = self.list[-1]
541 |                 else: li = {'name':'ul', 'num':0}
542 |                 if self.google_doc:
543 |                     nest_count = self.google_nest_count(tag_style)
544 |                 else:
545 |                     nest_count = len(self.list)
546 |                 self.o("  " * nest_count) #TODO: line up <ol><li>s > 9 correctly.
547 |                 if li['name'] == "ul": self.o(self.ul_item_mark + " ")
548 |                 elif li['name'] == "ol":
549 |                     li['num'] += 1
550 |                     self.o(str(li['num'])+". ")
551 |                 self.start = 1
552 | 
553 |         if tag in ["table", "tr"] and start: self.p()
554 |         if tag == 'td': self.pbr()
555 | 
556 |         if tag == "pre":
557 |             if start:
558 |                 self.startpre = 1
559 |                 self.pre = 1
560 |             else:
561 |                 self.pre = 0
562 |             self.p()
563 | 
564 |     def pbr(self):
565 |         if self.p_p == 0:
566 |             self.p_p = 1
567 | 
568 |     def p(self):
569 |         self.p_p = 2
570 | 
571 |     def soft_br(self):
572 |         self.pbr()
573 |         self.br_toggle = '  '
574 | 
575 |     def o(self, data, puredata=0, force=0):
576 |         if self.abbr_data is not None:
577 |             self.abbr_data += data
578 | 
579 |         if not self.quiet:
580 |             if self.google_doc:
581 |                 # prevent white space immediately after 'begin emphasis' marks ('**' and '_')
582 |                 lstripped_data = data.lstrip()
583 |                 if self.drop_white_space and not (self.pre or self.code):
584 |                     data = lstripped_data
585 |                 if lstripped_data != '':
586 |                     self.drop_white_space = 0
587 | 
588 |             if puredata and not self.pre:
589 |                 data = re.sub('\s+', ' ', data)
590 |                 if data and data[0] == ' ':
591 |                     self.space = 1
592 |                     data = data[1:]
593 |             if not data and not force: return
594 | 
595 |             if self.startpre:
596 |                 #self.out(" :") #TODO: not output when already one there
597 |                 if not data.startswith("\n"):  # <pre>stuff...
598 |                     data = "\n" + data
599 | 
600 |             bq = (">" * self.blockquote)
601 |             if not (force and data and data[0] == ">") and self.blockquote: bq += " "
602 | 
603 |             if self.pre:
604 |                 if not self.list:
605 |                     bq += "    "
606 |                 #else: list content is already partially indented
607 |                 # for i in xrange(len(self.list)): # no python 3
608 |                 for i in range(len(self.list)):
609 |                     bq += "    "
610 |                 data = data.replace("\n", "\n"+bq)
611 | 
612 |             if self.startpre:
613 |                 self.startpre = 0
614 |                 if self.list:
615 |                     data = data.lstrip("\n") # use existing initial indentation
616 | 
617 |             if self.start:
618 |                 self.space = 0
619 |                 self.p_p = 0
620 |                 self.start = 0
621 | 
622 |             if force == 'end':
623 |                 # It's the end.
624 |                 self.p_p = 0
625 |                 self.out("\n")
626 |                 self.space = 0
627 | 
628 |             if self.p_p:
629 |                 self.out((self.br_toggle+'\n'+bq)*self.p_p)
630 |                 self.space = 0
631 |                 self.br_toggle = ''
632 | 
633 |             if self.space:
634 |                 if not self.lastWasNL: self.out(' ')
635 |                 self.space = 0
636 | 
637 |             if self.a and ((self.p_p == 2 and self.links_each_paragraph) or force == "end"):
638 |                 if force == "end": self.out("\n")
639 | 
640 |                 newa = []
641 |                 for link in self.a:
642 |                     if self.outcount > link['outcount']:
643 |                         self.out("   ["+ str(link['count']) +"]: " + urlparse.urljoin(self.baseurl, link['href']))
644 |                         if has_key(link, 'title'): self.out(" ("+link['title']+")")
645 |                         self.out("\n")
646 |                     else:
647 |                         newa.append(link)
648 | 
649 |                 if self.a != newa: self.out("\n") # Don't need an extra line when nothing was done.
650 | 
651 |                 self.a = newa
652 | 
653 |             if self.abbr_list and force == "end":
654 |                 for abbr, definition in self.abbr_list.items():
655 |                     self.out("  *[" + abbr + "]: " + definition + "\n")
656 | 
657 |             self.p_p = 0
658 |             self.out(data)
659 |             self.outcount += 1
660 | 
661 |     def handle_data(self, data):
662 |         if r'\/script>' in data: self.quiet -= 1
663 | 
664 |         if self.style:
665 |             self.style_def.update(dumb_css_parser(data))
666 | 
667 |         if not self.maybe_automatic_link is None:
668 |             href = self.maybe_automatic_link
669 |             if href == data and self.absolute_url_matcher.match(href):
670 |                 self.o("<" + data + ">")
671 |                 return
672 |             else:
673 |                 self.o("[")
674 |                 self.maybe_automatic_link = None
675 | 
676 |         if not self.code and not self.pre:
677 |             data = escape_md_section(data, snob=self.escape_snob)
678 |         self.o(data, 1)
679 | 
680 |     def unknown_decl(self, data): pass
681 | 
682 |     def charref(self, name):
683 |         if name[0] in ['x','X']:
684 |             c = int(name[1:], 16)
685 |         else:
686 |             c = int(name)
687 | 
688 |         if not self.unicode_snob and c in unifiable_n.keys():
689 |             return unifiable_n[c]
690 |         else:
691 |             try:
692 |                 return unichr(c)
693 |             except NameError: #Python3
694 |                 return chr(c)
695 | 
696 |     def entityref(self, c):
697 |         if not self.unicode_snob and c in unifiable.keys():
698 |             return unifiable[c]
699 |         else:
700 |             try: name2cp(c)
701 |             except KeyError: return "&" + c + ';'
702 |             else:
703 |                 try:
704 |                     return unichr(name2cp(c))
705 |                 except NameError: #Python3
706 |                     return chr(name2cp(c))
707 | 
708 |     def replaceEntities(self, s):
709 |         s = s.group(1)
710 |         if s[0] == "#":
711 |             return self.charref(s[1:])
712 |         else: return self.entityref(s)
713 | 
714 |     r_unescape = re.compile(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));")
715 |     def unescape(self, s):
716 |         return self.r_unescape.sub(self.replaceEntities, s)
717 | 
718 |     def google_nest_count(self, style):
719 |         """calculate the nesting count of google doc lists"""
720 |         nest_count = 0
721 |         if 'margin-left' in style:
722 |             nest_count = int(style['margin-left'][:-2]) / self.google_list_indent
723 |         return nest_count
724 | 
725 | 
726 |     def optwrap(self, text):
727 |         """Wrap all paragraphs in the provided text."""
728 |         if not self.body_width:
729 |             return text
730 | 
731 |         assert wrap, "Requires Python 2.3."
732 |         result = ''
733 |         newlines = 0
734 |         for para in text.split("\n"):
735 |             if len(para) > 0:
736 |                 if not skipwrap(para):
737 |                     result += "\n".join(wrap(para, self.body_width))
738 |                     if para.endswith('  '):
739 |                         result += "  \n"
740 |                         newlines = 1
741 |                     else:
742 |                         result += "\n\n"
743 |                         newlines = 2
744 |                 else:
745 |                     if not onlywhite(para):
746 |                         result += para + "\n"
747 |                         newlines = 1
748 |             else:
749 |                 if newlines < 2:
750 |                     result += "\n"
751 |                     newlines += 1
752 |         return result
753 | 
754 | ordered_list_matcher = re.compile(r'\d+\.\s')
755 | unordered_list_matcher = re.compile(r'[-\*\+]\s')
756 | md_chars_matcher = re.compile(r"([\\\[\]\(\)])")
757 | md_chars_matcher_all = re.compile(r"([`\*_{}\[\]\(\)#!])")
758 | md_dot_matcher = re.compile(r"""
759 |     ^             # start of line
760 |     (\s*\d+)      # optional whitespace and a number
761 |     (\.)          # dot
762 |     (?=\s)        # lookahead assert whitespace
763 |     """, re.MULTILINE | re.VERBOSE)
764 | md_plus_matcher = re.compile(r"""
765 |     ^
766 |     (\s*)
767 |     (\+)
768 |     (?=\s)
769 |     """, flags=re.MULTILINE | re.VERBOSE)
770 | md_dash_matcher = re.compile(r"""
771 |     ^
772 |     (\s*)
773 |     (-)
774 |     (?=\s|\-)     # followed by whitespace (bullet list, or spaced out hr)
775 |                   # or another dash (header or hr)
776 |     """, flags=re.MULTILINE | re.VERBOSE)
777 | slash_chars = r'\`*_{}[]()#+-.!'
778 | md_backslash_matcher = re.compile(r'''
779 |     (\\)          # match one slash
780 |     (?=[%s])      # followed by a char that requires escaping
781 |     ''' % re.escape(slash_chars),
782 |     flags=re.VERBOSE)
783 | 
784 | def skipwrap(para):
785 |     # If the text begins with four spaces or one tab, it's a code block; don't wrap
786 |     if para[0:4] == '    ' or para[0] == '\t':
787 |         return True
788 |     # If the text begins with only two "--", possibly preceded by whitespace, that's
789 |     # an emdash; so wrap.
790 |     stripped = para.lstrip()
791 |     if stripped[0:2] == "--" and len(stripped) > 2 and stripped[2] != "-":
792 |         return False
793 |     # I'm not sure what this is for; I thought it was to detect lists, but there's
794 |     # a <br>-inside-<span> case in one of the tests that also depends upon it.
795 |     if stripped[0:1] == '-' or stripped[0:1] == '*':
796 |         return True
797 |     # If the text begins with a single -, *, or +, followed by a space, or an integer,
798 |     # followed by a ., followed by a space (in either case optionally preceeded by
799 |     # whitespace), it's a list; don't wrap.
800 |     if ordered_list_matcher.match(stripped) or unordered_list_matcher.match(stripped):
801 |         return True
802 |     return False
803 | 
804 | def wrapwrite(text):
805 |     text = text.encode('utf-8')
806 |     try: #Python3
807 |         sys.stdout.buffer.write(text)
808 |     except AttributeError:
809 |         sys.stdout.write(text)
810 | 
811 | def html2text(html, baseurl=''):
812 |     h = HTML2Text(baseurl=baseurl)
813 |     return h.handle(html)
814 | 
815 | def unescape(s, unicode_snob=False):
816 |     h = HTML2Text()
817 |     h.unicode_snob = unicode_snob
818 |     return h.unescape(s)
819 | 
820 | def escape_md(text):
821 |     """Escapes markdown-sensitive characters within other markdown constructs."""
822 |     return md_chars_matcher.sub(r"\\\1", text)
823 | 
824 | def escape_md_section(text, snob=False):
825 |     """Escapes markdown-sensitive characters across whole document sections."""
826 |     text = md_backslash_matcher.sub(r"\\\1", text)
827 |     if snob:
828 |         text = md_chars_matcher_all.sub(r"\\\1", text)
829 |     text = md_dot_matcher.sub(r"\1\\\2", text)
830 |     text = md_plus_matcher.sub(r"\1\\\2", text)
831 |     text = md_dash_matcher.sub(r"\1\\\2", text)
832 |     return text
833 | 
834 | 
835 | def main():
836 |     baseurl = ''
837 | 
838 |     p = optparse.OptionParser('%prog [(filename|url) [encoding]]',
839 |                               version='%prog ' + __version__)
840 |     p.add_option("--ignore-emphasis", dest="ignore_emphasis", action="store_true",
841 |         default=IGNORE_EMPHASIS, help="don't include any formatting for emphasis")
842 |     p.add_option("--ignore-links", dest="ignore_links", action="store_true",
843 |         default=IGNORE_ANCHORS, help="don't include any formatting for links")
844 |     p.add_option("--ignore-images", dest="ignore_images", action="store_true",
845 |         default=IGNORE_IMAGES, help="don't include any formatting for images")
846 |     p.add_option("-g", "--google-doc", action="store_true", dest="google_doc",
847 |         default=False, help="convert an html-exported Google Document")
848 |     p.add_option("-d", "--dash-unordered-list", action="store_true", dest="ul_style_dash",
849 |         default=False, help="use a dash rather than a star for unordered list items")
850 |     p.add_option("-e", "--asterisk-emphasis", action="store_true", dest="em_style_asterisk",
851 |         default=False, help="use an asterisk rather than an underscore for emphasized text")
852 |     p.add_option("-b", "--body-width", dest="body_width", action="store", type="int",
853 |         default=BODY_WIDTH, help="number of characters per output line, 0 for no wrap")
854 |     p.add_option("-i", "--google-list-indent", dest="list_indent", action="store", type="int",
855 |         default=GOOGLE_LIST_INDENT, help="number of pixels Google indents nested lists")
856 |     p.add_option("-s", "--hide-strikethrough", action="store_true", dest="hide_strikethrough",
857 |         default=False, help="hide strike-through text. only relevant when -g is specified as well")
858 |     p.add_option("--escape-all", action="store_true", dest="escape_snob",
859 |         default=False, help="Escape all special characters.  Output is less readable, but avoids corner case formatting issues.")
860 |     (options, args) = p.parse_args()
861 | 
862 |     # process input
863 |     encoding = "utf-8"
864 |     if len(args) > 0:
865 |         file_ = args[0]
866 |         if len(args) == 2:
867 |             encoding = args[1]
868 |         if len(args) > 2:
869 |             p.error('Too many arguments')
870 | 
871 |         if file_.startswith('http://') or file_.startswith('https://'):
872 |             baseurl = file_
873 |             j = urllib.urlopen(baseurl)
874 |             data = j.read()
875 |             if encoding is None:
876 |                 try:
877 |                     from feedparser import _getCharacterEncoding as enc
878 |                 except ImportError:
879 |                     enc = lambda x, y: ('utf-8', 1)
880 |                 encoding = enc(j.headers, data)[0]
881 |                 if encoding == 'us-ascii':
882 |                     encoding = 'utf-8'
883 |         else:
884 |             data = open(file_, 'rb').read()
885 |             if encoding is None:
886 |                 try:
887 |                     from chardet import detect
888 |                 except ImportError:
889 |                     detect = lambda x: {'encoding': 'utf-8'}
890 |                 encoding = detect(data)['encoding']
891 |     else:
892 |         data = sys.stdin.read()
893 | 
894 |     data = data.decode(encoding)
895 |     h = HTML2Text(baseurl=baseurl)
896 |     # handle options
897 |     if options.ul_style_dash: h.ul_item_mark = '-'
898 |     if options.em_style_asterisk:
899 |         h.emphasis_mark = '*'
900 |         h.strong_mark = '__'
901 | 
902 |     h.body_width = options.body_width
903 |     h.list_indent = options.list_indent
904 |     h.ignore_emphasis = options.ignore_emphasis
905 |     h.ignore_links = options.ignore_links
906 |     h.ignore_images = options.ignore_images
907 |     h.google_doc = options.google_doc
908 |     h.hide_strikethrough = options.hide_strikethrough
909 |     h.escape_snob = options.escape_snob
910 | 
911 |     wrapwrite(h.handle(data))
912 | 
913 | 
914 | if __name__ == "__main__":
915 |     main()
916 | 


--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [metadata]
2 | description-file = README.md
3 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | # Author: Aziz Alto
 4 | # email: iamaziz.alto@gmail.com
 5 | 
 6 | try:
 7 |     from setuptools import setup
 8 | except ImportError:
 9 |     from distutils.core import setup
10 | 
11 | 
12 | setup(
13 |     name='pydataset',
14 |     description=("Provides instant access to many popular datasets right from "
15 |                  "Python (in dataframe structure)."),
16 |     author='Aziz Alto',
17 |     url='https://github.com/iamaziz/PyDataset',
18 |     download_url='https://github.com/iamaziz/PyDataset/tarball/0.2.0',
19 |     license = 'MIT',
20 |     author_email='iamaziz.alto@gmail.com',
21 |     version='0.2.0',
22 |     install_requires=['pandas'],
23 |     packages=['pydataset', 'pydataset.utils'],
24 |     package_data={'pydataset': ['*.gz', 'resources.tar.gz']}
25 | )
26 | 


--------------------------------------------------------------------------------