├── LICENSE.txt
├── MANIFEST.in
├── README.rst
├── build-dist.bat
├── build-doc.bat
├── create_doctree.py
├── example
    ├── Example.ipynb
    ├── test.csv
    └── train.csv
├── hccEncoding
    ├── EncoderForClassification.py
    ├── EncoderForRegression.py
    ├── __init__.py
    └── __init__.pyc
├── install-linux.sh
├── install-macos.sh
├── install-win.bat
├── pypi-register.bat
├── pypi-upload.bat
├── release-history.rst
├── requirements.txt
├── setup.py
├── source
    ├── conf.py
    └── index.rst
└── view-doc.bat


/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright 2017 Ruobing Wang <https://github.com/Robin888/hccEncoding-project>
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include README.rst LICENSE.txt requirements.txt release-history.rst
2 | recursive-include hccEncoding *.*


--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
 1 | .. image:: https://travis-ci.org/Robin888/hccEncoding-project.svg?branch=master
 2 | 
 3 | .. image:: https://img.shields.io/pypi/v/hccEncoding.svg
 4 | 
 5 | .. image:: https://img.shields.io/pypi/l/hccEncoding.svg
 6 | 
 7 | .. image:: https://img.shields.io/pypi/pyversions/hccEncoding.svg
 8 | 
 9 | 
10 | Welcome to hccEncoding Documentation
11 | ===============================================================================
12 | This is just a example project for demonstration purpose.
13 | 
14 | 
15 | **Quick Links**
16 | -------------------------------------------------------------------------------
17 | - `GitHub Homepage <https://github.com/Robin888/hccEncoding-project>`_
18 | - `Online Documentation <http://hccencoding-project.readthedocs.io/en/latest/>`_
19 | - `PyPI download <https://pypi.python.org/pypi/hccEncoding>`_
20 | - `Install <install_>`_
21 | - `Issue submit and feature request <https://github.com/Robin888/hccEncoding-project/issues>`_
22 | - `API reference and source code <http://pythonhosted.org/hccEncoding/py-modindex.html>`_
23 | - `Tutorial <https://github.com/Robin888/hccEncoding-project/blob/master/example/Example.ipynb>`_
24 | 
25 | .. _install:
26 | 
27 | Install
28 | -------------------------------------------------------------------------------
29 | 
30 | ``hccEncoding`` is released on PyPI, so all you need is:
31 | 
32 | .. code-block:: console
33 | 
34 | 	$ pip install hccEncoding
35 | 
36 | To upgrade to latest version:
37 | 
38 | .. code-block:: console
39 | 
40 | 	$ pip install --upgrade hccEncoding
41 | 


--------------------------------------------------------------------------------
/build-dist.bat:
--------------------------------------------------------------------------------
1 | pushd "%~dp0"
2 | python setup.py sdist
3 | python setup.py bdist_wheel --universal
4 | pause


--------------------------------------------------------------------------------
/build-doc.bat:
--------------------------------------------------------------------------------
1 | pushd "%~dp0"
2 | cd hccEncoding
3 | python zzz_manual_install.py
4 | cd ..
5 | python create_doctree.py
6 | make html


--------------------------------------------------------------------------------
/create_doctree.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | import docfly
 5 | 
 6 | # Uncomment this if you follow Sanhe's Sphinx Doc Style Guide
 7 | #--- Manually Made Doc ---
 8 | # doc = docfly.DocTree("source")
 9 | # doc.fly(table_of_content_header="Table of Content (目录)")
10 | 
11 | #--- Api Reference Doc ---
12 | package_name = "hccEncoding"
13 | 
14 | doc = docfly.ApiReferenceDoc(
15 |     package_name,
16 |     dst="source",
17 |     ignore=[
18 |         "%s.packages" % package_name,
19 |         "%s.zzz_manual_install.py" % package_name,
20 |     ]
21 | )
22 | doc.fly()


--------------------------------------------------------------------------------
/example/Example.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Tutorial for hccEncoding\n",
   8 |     "\n",
   9 |     "\n",
  10 |     "This notebook will exhibit how to use hcc-encoding package for classification and regression problem using dataset from Kaggle competition 'Prudential Life Insurance Assessment'\n",
  11 |     "\n",
  12 |     "In hcc-encoding, the basic principle motivating the processing is to map individual values of a high-cardinality categorical independent attribute to an estimate of the probability or the expected value of dependent attribute. However, just simply transfer the high-cardinality categorical to target statistics often result in information leaking.
  13 |    ]
  14 |   },
  15 |   {
  16 |    "cell_type": "markdown",
  17 |    "metadata": {},
  18 |    "source": [
  19 |     "dataset download: https://www.kaggle.com/c/prudential-life-insurance-assessment"
  20 |    ]
  21 |   },
  22 |   {
  23 |    "cell_type": "markdown",
  24 |    "metadata": {},
  25 |    "source": [
  26 |     "In this dataset, you are provided over a hundred variables describing attributes of life insurance applicants. \n",
  27 |     "The task is to predict the \"Response\" variable for each Id in the test set. \n",
  28 |     "\"Response\" is an ordinal measure of risk that has 8 levels, which means the problem can be treated as both regression problem and classification problems (classify to 8 classes)"
  29 |    ]
  30 |   },
  31 |   {
  32 |    "cell_type": "code",
  33 |    "execution_count": 1,
  34 |    "metadata": {
  35 |     "collapsed": true
  36 |    },
  37 |    "outputs": [],
  38 |    "source": [
  39 |     "# load raw data\n",
  40 |     "import pandas as pd\n",
  41 |     "train=pd.read_csv('train.csv')\n",
  42 |     "test=pd.read_csv('test.csv')"
  43 |    ]
  44 |   },
  45 |   {
  46 |    "cell_type": "code",
  47 |    "execution_count": 2,
  48 |    "metadata": {
  49 |     "collapsed": false
  50 |    },
  51 |    "outputs": [
  52 |     {
  53 |      "data": {
  54 |       "text/html": [
  55 |        "<div>\n",
  56 |        "<table border=\"1\" class=\"dataframe\">\n",
  57 |        "  <thead>\n",
  58 |        "    <tr style=\"text-align: right;\">\n",
  59 |        "      <th></th>\n",
  60 |        "      <th>Id</th>\n",
  61 |        "      <th>Product_Info_1</th>\n",
  62 |        "      <th>Product_Info_2</th>\n",
  63 |        "      <th>Product_Info_3</th>\n",
  64 |        "      <th>Product_Info_4</th>\n",
  65 |        "      <th>Product_Info_5</th>\n",
  66 |        "      <th>Product_Info_6</th>\n",
  67 |        "      <th>Product_Info_7</th>\n",
  68 |        "      <th>Ins_Age</th>\n",
  69 |        "      <th>Ht</th>\n",
  70 |        "      <th>...</th>\n",
  71 |        "      <th>Medical_Keyword_40</th>\n",
  72 |        "      <th>Medical_Keyword_41</th>\n",
  73 |        "      <th>Medical_Keyword_42</th>\n",
  74 |        "      <th>Medical_Keyword_43</th>\n",
  75 |        "      <th>Medical_Keyword_44</th>\n",
  76 |        "      <th>Medical_Keyword_45</th>\n",
  77 |        "      <th>Medical_Keyword_46</th>\n",
  78 |        "      <th>Medical_Keyword_47</th>\n",
  79 |        "      <th>Medical_Keyword_48</th>\n",
  80 |        "      <th>Response</th>\n",
  81 |        "    </tr>\n",
  82 |        "  </thead>\n",
  83 |        "  <tbody>\n",
  84 |        "    <tr>\n",
  85 |        "      <th>0</th>\n",
  86 |        "      <td>2</td>\n",
  87 |        "      <td>1</td>\n",
  88 |        "      <td>D3</td>\n",
  89 |        "      <td>10</td>\n",
  90 |        "      <td>0.076923</td>\n",
  91 |        "      <td>2</td>\n",
  92 |        "      <td>1</td>\n",
  93 |        "      <td>1</td>\n",
  94 |        "      <td>0.641791</td>\n",
  95 |        "      <td>0.581818</td>\n",
  96 |        "      <td>...</td>\n",
  97 |        "      <td>0</td>\n",
  98 |        "      <td>0</td>\n",
  99 |        "      <td>0</td>\n",
 100 |        "      <td>0</td>\n",
 101 |        "      <td>0</td>\n",
 102 |        "      <td>0</td>\n",
 103 |        "      <td>0</td>\n",
 104 |        "      <td>0</td>\n",
 105 |        "      <td>0</td>\n",
 106 |        "      <td>8</td>\n",
 107 |        "    </tr>\n",
 108 |        "    <tr>\n",
 109 |        "      <th>1</th>\n",
 110 |        "      <td>5</td>\n",
 111 |        "      <td>1</td>\n",
 112 |        "      <td>A1</td>\n",
 113 |        "      <td>26</td>\n",
 114 |        "      <td>0.076923</td>\n",
 115 |        "      <td>2</td>\n",
 116 |        "      <td>3</td>\n",
 117 |        "      <td>1</td>\n",
 118 |        "      <td>0.059701</td>\n",
 119 |        "      <td>0.600000</td>\n",
 120 |        "      <td>...</td>\n",
 121 |        "      <td>0</td>\n",
 122 |        "      <td>0</td>\n",
 123 |        "      <td>0</td>\n",
 124 |        "      <td>0</td>\n",
 125 |        "      <td>0</td>\n",
 126 |        "      <td>0</td>\n",
 127 |        "      <td>0</td>\n",
 128 |        "      <td>0</td>\n",
 129 |        "      <td>0</td>\n",
 130 |        "      <td>4</td>\n",
 131 |        "    </tr>\n",
 132 |        "    <tr>\n",
 133 |        "      <th>2</th>\n",
 134 |        "      <td>6</td>\n",
 135 |        "      <td>1</td>\n",
 136 |        "      <td>E1</td>\n",
 137 |        "      <td>26</td>\n",
 138 |        "      <td>0.076923</td>\n",
 139 |        "      <td>2</td>\n",
 140 |        "      <td>3</td>\n",
 141 |        "      <td>1</td>\n",
 142 |        "      <td>0.029851</td>\n",
 143 |        "      <td>0.745455</td>\n",
 144 |        "      <td>...</td>\n",
 145 |        "      <td>0</td>\n",
 146 |        "      <td>0</td>\n",
 147 |        "      <td>0</td>\n",
 148 |        "      <td>0</td>\n",
 149 |        "      <td>0</td>\n",
 150 |        "      <td>0</td>\n",
 151 |        "      <td>0</td>\n",
 152 |        "      <td>0</td>\n",
 153 |        "      <td>0</td>\n",
 154 |        "      <td>8</td>\n",
 155 |        "    </tr>\n",
 156 |        "    <tr>\n",
 157 |        "      <th>3</th>\n",
 158 |        "      <td>7</td>\n",
 159 |        "      <td>1</td>\n",
 160 |        "      <td>D4</td>\n",
 161 |        "      <td>10</td>\n",
 162 |        "      <td>0.487179</td>\n",
 163 |        "      <td>2</td>\n",
 164 |        "      <td>3</td>\n",
 165 |        "      <td>1</td>\n",
 166 |        "      <td>0.164179</td>\n",
 167 |        "      <td>0.672727</td>\n",
 168 |        "      <td>...</td>\n",
 169 |        "      <td>0</td>\n",
 170 |        "      <td>0</td>\n",
 171 |        "      <td>0</td>\n",
 172 |        "      <td>0</td>\n",
 173 |        "      <td>0</td>\n",
 174 |        "      <td>0</td>\n",
 175 |        "      <td>0</td>\n",
 176 |        "      <td>0</td>\n",
 177 |        "      <td>0</td>\n",
 178 |        "      <td>8</td>\n",
 179 |        "    </tr>\n",
 180 |        "    <tr>\n",
 181 |        "      <th>4</th>\n",
 182 |        "      <td>8</td>\n",
 183 |        "      <td>1</td>\n",
 184 |        "      <td>D2</td>\n",
 185 |        "      <td>26</td>\n",
 186 |        "      <td>0.230769</td>\n",
 187 |        "      <td>2</td>\n",
 188 |        "      <td>3</td>\n",
 189 |        "      <td>1</td>\n",
 190 |        "      <td>0.417910</td>\n",
 191 |        "      <td>0.654545</td>\n",
 192 |        "      <td>...</td>\n",
 193 |        "      <td>0</td>\n",
 194 |        "      <td>0</td>\n",
 195 |        "      <td>0</td>\n",
 196 |        "      <td>0</td>\n",
 197 |        "      <td>0</td>\n",
 198 |        "      <td>0</td>\n",
 199 |        "      <td>0</td>\n",
 200 |        "      <td>0</td>\n",
 201 |        "      <td>0</td>\n",
 202 |        "      <td>8</td>\n",
 203 |        "    </tr>\n",
 204 |        "  </tbody>\n",
 205 |        "</table>\n",
 206 |        "<p>5 rows × 128 columns</p>\n",
 207 |        "</div>"
 208 |       ],
 209 |       "text/plain": [
 210 |        "   Id  Product_Info_1 Product_Info_2  Product_Info_3  Product_Info_4  \\\n",
 211 |        "0   2               1             D3              10        0.076923   \n",
 212 |        "1   5               1             A1              26        0.076923   \n",
 213 |        "2   6               1             E1              26        0.076923   \n",
 214 |        "3   7               1             D4              10        0.487179   \n",
 215 |        "4   8               1             D2              26        0.230769   \n",
 216 |        "\n",
 217 |        "   Product_Info_5  Product_Info_6  Product_Info_7   Ins_Age        Ht  \\\n",
 218 |        "0               2               1               1  0.641791  0.581818   \n",
 219 |        "1               2               3               1  0.059701  0.600000   \n",
 220 |        "2               2               3               1  0.029851  0.745455   \n",
 221 |        "3               2               3               1  0.164179  0.672727   \n",
 222 |        "4               2               3               1  0.417910  0.654545   \n",
 223 |        "\n",
 224 |        "     ...     Medical_Keyword_40  Medical_Keyword_41  Medical_Keyword_42  \\\n",
 225 |        "0    ...                      0                   0                   0   \n",
 226 |        "1    ...                      0                   0                   0   \n",
 227 |        "2    ...                      0                   0                   0   \n",
 228 |        "3    ...                      0                   0                   0   \n",
 229 |        "4    ...                      0                   0                   0   \n",
 230 |        "\n",
 231 |        "   Medical_Keyword_43  Medical_Keyword_44  Medical_Keyword_45  \\\n",
 232 |        "0                   0                   0                   0   \n",
 233 |        "1                   0                   0                   0   \n",
 234 |        "2                   0                   0                   0   \n",
 235 |        "3                   0                   0                   0   \n",
 236 |        "4                   0                   0                   0   \n",
 237 |        "\n",
 238 |        "   Medical_Keyword_46  Medical_Keyword_47  Medical_Keyword_48  Response  \n",
 239 |        "0                   0                   0                   0         8  \n",
 240 |        "1                   0                   0                   0         4  \n",
 241 |        "2                   0                   0                   0         8  \n",
 242 |        "3                   0                   0                   0         8  \n",
 243 |        "4                   0                   0                   0         8  \n",
 244 |        "\n",
 245 |        "[5 rows x 128 columns]"
 246 |       ]
 247 |      },
 248 |      "execution_count": 2,
 249 |      "metadata": {},
 250 |      "output_type": "execute_result"
 251 |     }
 252 |    ],
 253 |    "source": [
 254 |     "train.head()"
 255 |    ]
 256 |   },
 257 |   {
 258 |    "cell_type": "code",
 259 |    "execution_count": 3,
 260 |    "metadata": {
 261 |     "collapsed": false
 262 |    },
 263 |    "outputs": [
 264 |     {
 265 |      "data": {
 266 |       "text/plain": [
 267 |        "19"
 268 |       ]
 269 |      },
 270 |      "execution_count": 3,
 271 |      "metadata": {},
 272 |      "output_type": "execute_result"
 273 |     }
 274 |    ],
 275 |    "source": [
 276 |     "len(train['Product_Info_2'].unique())"
 277 |    ]
 278 |   },
 279 |   {
 280 |    "cell_type": "markdown",
 281 |    "metadata": {},
 282 |    "source": [
 283 |     "It can be seen that the feature 'Product_Info_2' can be treated as a high-cardinal feature. To exhibit how to use hcc-encoding more easily, we ignore most irrelavant features:"
 284 |    ]
 285 |   },
 286 |   {
 287 |    "cell_type": "code",
 288 |    "execution_count": 4,
 289 |    "metadata": {
 290 |     "collapsed": false
 291 |    },
 292 |    "outputs": [],
 293 |    "source": [
 294 |     "train=train[['Id','Response','Product_Info_2']]\n",
 295 |     "test=test[['Id','Product_Info_2']]"
 296 |    ]
 297 |   },
 298 |   {
 299 |    "cell_type": "code",
 300 |    "execution_count": 5,
 301 |    "metadata": {
 302 |     "collapsed": false
 303 |    },
 304 |    "outputs": [
 305 |     {
 306 |      "data": {
 307 |       "text/html": [
 308 |        "<div>\n",
 309 |        "<table border=\"1\" class=\"dataframe\">\n",
 310 |        "  <thead>\n",
 311 |        "    <tr style=\"text-align: right;\">\n",
 312 |        "      <th></th>\n",
 313 |        "      <th>Id</th>\n",
 314 |        "      <th>Response</th>\n",
 315 |        "      <th>Product_Info_2</th>\n",
 316 |        "    </tr>\n",
 317 |        "  </thead>\n",
 318 |        "  <tbody>\n",
 319 |        "    <tr>\n",
 320 |        "      <th>0</th>\n",
 321 |        "      <td>2</td>\n",
 322 |        "      <td>8</td>\n",
 323 |        "      <td>D3</td>\n",
 324 |        "    </tr>\n",
 325 |        "    <tr>\n",
 326 |        "      <th>1</th>\n",
 327 |        "      <td>5</td>\n",
 328 |        "      <td>4</td>\n",
 329 |        "      <td>A1</td>\n",
 330 |        "    </tr>\n",
 331 |        "    <tr>\n",
 332 |        "      <th>2</th>\n",
 333 |        "      <td>6</td>\n",
 334 |        "      <td>8</td>\n",
 335 |        "      <td>E1</td>\n",
 336 |        "    </tr>\n",
 337 |        "    <tr>\n",
 338 |        "      <th>3</th>\n",
 339 |        "      <td>7</td>\n",
 340 |        "      <td>8</td>\n",
 341 |        "      <td>D4</td>\n",
 342 |        "    </tr>\n",
 343 |        "    <tr>\n",
 344 |        "      <th>4</th>\n",
 345 |        "      <td>8</td>\n",
 346 |        "      <td>8</td>\n",
 347 |        "      <td>D2</td>\n",
 348 |        "    </tr>\n",
 349 |        "  </tbody>\n",
 350 |        "</table>\n",
 351 |        "</div>"
 352 |       ],
 353 |       "text/plain": [
 354 |        "   Id  Response Product_Info_2\n",
 355 |        "0   2         8             D3\n",
 356 |        "1   5         4             A1\n",
 357 |        "2   6         8             E1\n",
 358 |        "3   7         8             D4\n",
 359 |        "4   8         8             D2"
 360 |       ]
 361 |      },
 362 |      "execution_count": 5,
 363 |      "metadata": {},
 364 |      "output_type": "execute_result"
 365 |     }
 366 |    ],
 367 |    "source": [
 368 |     "train.head()"
 369 |    ]
 370 |   },
 371 |   {
 372 |    "cell_type": "markdown",
 373 |    "metadata": {},
 374 |    "source": [
 375 |     "# Part 1. Encoding for classification problems"
 376 |    ]
 377 |   },
 378 |   {
 379 |    "cell_type": "code",
 380 |    "execution_count": 6,
 381 |    "metadata": {
 382 |     "collapsed": false
 383 |    },
 384 |    "outputs": [],
 385 |    "source": [
 386 |     "from hccEncoding.EncoderForClassification import BayesEncoding,BayesEncodingKfold,LOOEncoding,LOOEncodingKfold\n",
 387 |     "\n",
 388 |     "train_BayesEncoding,test_BayesEncoding=BayesEncoding(train,test,'Response','Product_Info_2')\n",
 389 |     "train_LOOEncoding,test_LOOEncoding=LOOEncoding(train,test,'Response','Product_Info_2')\n"
 390 |    ]
 391 |   },
 392 |   {
 393 |    "cell_type": "code",
 394 |    "execution_count": 7,
 395 |    "metadata": {
 396 |     "collapsed": false
 397 |    },
 398 |    "outputs": [
 399 |     {
 400 |      "data": {
 401 |       "text/html": [
 402 |        "<div>\n",
 403 |        "<table border=\"1\" class=\"dataframe\">\n",
 404 |        "  <thead>\n",
 405 |        "    <tr style=\"text-align: right;\">\n",
 406 |        "      <th></th>\n",
 407 |        "      <th>Id</th>\n",
 408 |        "      <th>Response</th>\n",
 409 |        "      <th>Product_Info_2</th>\n",
 410 |        "      <th>bayes_Product_Info_2_1</th>\n",
 411 |        "      <th>bayes_Product_Info_2_2</th>\n",
 412 |        "      <th>bayes_Product_Info_2_3</th>\n",
 413 |        "      <th>bayes_Product_Info_2_4</th>\n",
 414 |        "      <th>bayes_Product_Info_2_5</th>\n",
 415 |        "      <th>bayes_Product_Info_2_6</th>\n",
 416 |        "      <th>bayes_Product_Info_2_7</th>\n",
 417 |        "      <th>bayes_Product_Info_2_8</th>\n",
 418 |        "    </tr>\n",
 419 |        "  </thead>\n",
 420 |        "  <tbody>\n",
 421 |        "    <tr>\n",
 422 |        "      <th>0</th>\n",
 423 |        "      <td>2</td>\n",
 424 |        "      <td>8</td>\n",
 425 |        "      <td>D3</td>\n",
 426 |        "      <td>0.100520</td>\n",
 427 |        "      <td>0.116242</td>\n",
 428 |        "      <td>0.016612</td>\n",
 429 |        "      <td>0.029456</td>\n",
 430 |        "      <td>0.087625</td>\n",
 431 |        "      <td>0.227483</td>\n",
 432 |        "      <td>0.145368</td>\n",
 433 |        "      <td>0.274486</td>\n",
 434 |        "    </tr>\n",
 435 |        "    <tr>\n",
 436 |        "      <th>1</th>\n",
 437 |        "      <td>5</td>\n",
 438 |        "      <td>4</td>\n",
 439 |        "      <td>A1</td>\n",
 440 |        "      <td>0.055488</td>\n",
 441 |        "      <td>0.100343</td>\n",
 442 |        "      <td>0.021790</td>\n",
 443 |        "      <td>0.028626</td>\n",
 444 |        "      <td>0.083012</td>\n",
 445 |        "      <td>0.148888</td>\n",
 446 |        "      <td>0.115147</td>\n",
 447 |        "      <td>0.445322</td>\n",
 448 |        "    </tr>\n",
 449 |        "    <tr>\n",
 450 |        "      <th>2</th>\n",
 451 |        "      <td>6</td>\n",
 452 |        "      <td>8</td>\n",
 453 |        "      <td>E1</td>\n",
 454 |        "      <td>0.076636</td>\n",
 455 |        "      <td>0.078272</td>\n",
 456 |        "      <td>0.012823</td>\n",
 457 |        "      <td>0.029673</td>\n",
 458 |        "      <td>0.067797</td>\n",
 459 |        "      <td>0.162853</td>\n",
 460 |        "      <td>0.150906</td>\n",
 461 |        "      <td>0.425601</td>\n",
 462 |        "    </tr>\n",
 463 |        "    <tr>\n",
 464 |        "      <th>3</th>\n",
 465 |        "      <td>7</td>\n",
 466 |        "      <td>8</td>\n",
 467 |        "      <td>D4</td>\n",
 468 |        "      <td>0.064025</td>\n",
 469 |        "      <td>0.065807</td>\n",
 470 |        "      <td>0.007659</td>\n",
 471 |        "      <td>0.011436</td>\n",
 472 |        "      <td>0.075465</td>\n",
 473 |        "      <td>0.166885</td>\n",
 474 |        "      <td>0.135976</td>\n",
 475 |        "      <td>0.477254</td>\n",
 476 |        "    </tr>\n",
 477 |        "    <tr>\n",
 478 |        "      <th>4</th>\n",
 479 |        "      <td>8</td>\n",
 480 |        "      <td>8</td>\n",
 481 |        "      <td>D2</td>\n",
 482 |        "      <td>0.119590</td>\n",
 483 |        "      <td>0.147281</td>\n",
 484 |        "      <td>0.017612</td>\n",
 485 |        "      <td>0.024653</td>\n",
 486 |        "      <td>0.067270</td>\n",
 487 |        "      <td>0.247489</td>\n",
 488 |        "      <td>0.145203</td>\n",
 489 |        "      <td>0.233645</td>\n",
 490 |        "    </tr>\n",
 491 |        "  </tbody>\n",
 492 |        "</table>\n",
 493 |        "</div>"
 494 |       ],
 495 |       "text/plain": [
 496 |        "   Id  Response Product_Info_2  bayes_Product_Info_2_1  \\\n",
 497 |        "0   2         8             D3                0.100520   \n",
 498 |        "1   5         4             A1                0.055488   \n",
 499 |        "2   6         8             E1                0.076636   \n",
 500 |        "3   7         8             D4                0.064025   \n",
 501 |        "4   8         8             D2                0.119590   \n",
 502 |        "\n",
 503 |        "   bayes_Product_Info_2_2  bayes_Product_Info_2_3  bayes_Product_Info_2_4  \\\n",
 504 |        "0                0.116242                0.016612                0.029456   \n",
 505 |        "1                0.100343                0.021790                0.028626   \n",
 506 |        "2                0.078272                0.012823                0.029673   \n",
 507 |        "3                0.065807                0.007659                0.011436   \n",
 508 |        "4                0.147281                0.017612                0.024653   \n",
 509 |        "\n",
 510 |        "   bayes_Product_Info_2_5  bayes_Product_Info_2_6  bayes_Product_Info_2_7  \\\n",
 511 |        "0                0.087625                0.227483                0.145368   \n",
 512 |        "1                0.083012                0.148888                0.115147   \n",
 513 |        "2                0.067797                0.162853                0.150906   \n",
 514 |        "3                0.075465                0.166885                0.135976   \n",
 515 |        "4                0.067270                0.247489                0.145203   \n",
 516 |        "\n",
 517 |        "   bayes_Product_Info_2_8  \n",
 518 |        "0                0.274486  \n",
 519 |        "1                0.445322  \n",
 520 |        "2                0.425601  \n",
 521 |        "3                0.477254  \n",
 522 |        "4                0.233645  "
 523 |       ]
 524 |      },
 525 |      "execution_count": 7,
 526 |      "metadata": {},
 527 |      "output_type": "execute_result"
 528 |     }
 529 |    ],
 530 |    "source": [
 531 |     "train_BayesEncoding.head()"
 532 |    ]
 533 |   },
 534 |   {
 535 |    "cell_type": "markdown",
 536 |    "metadata": {},
 537 |    "source": [
 538 |     "Note: In BayesEncoding for classification problem, after encoding, new feature will be the probilities of each class, the header of generated new feature will be 'bayes_ __ _(Origin Feature name)_ __ _(name of the class)'"
 539 |    ]
 540 |   },
 541 |   {
 542 |    "cell_type": "code",
 543 |    "execution_count": 8,
 544 |    "metadata": {
 545 |     "collapsed": false
 546 |    },
 547 |    "outputs": [
 548 |     {
 549 |      "data": {
 550 |       "text/html": [
 551 |        "<div>\n",
 552 |        "<table border=\"1\" class=\"dataframe\">\n",
 553 |        "  <thead>\n",
 554 |        "    <tr style=\"text-align: right;\">\n",
 555 |        "      <th></th>\n",
 556 |        "      <th>Id</th>\n",
 557 |        "      <th>Response</th>\n",
 558 |        "      <th>Product_Info_2</th>\n",
 559 |        "      <th>loo_Product_Info_2</th>\n",
 560 |        "    </tr>\n",
 561 |        "  </thead>\n",
 562 |        "  <tbody>\n",
 563 |        "    <tr>\n",
 564 |        "      <th>0</th>\n",
 565 |        "      <td>2</td>\n",
 566 |        "      <td>8</td>\n",
 567 |        "      <td>D3</td>\n",
 568 |        "      <td>5.568211</td>\n",
 569 |        "    </tr>\n",
 570 |        "    <tr>\n",
 571 |        "      <th>1</th>\n",
 572 |        "      <td>5</td>\n",
 573 |        "      <td>4</td>\n",
 574 |        "      <td>A1</td>\n",
 575 |        "      <td>6.176825</td>\n",
 576 |        "    </tr>\n",
 577 |        "    <tr>\n",
 578 |        "      <th>2</th>\n",
 579 |        "      <td>6</td>\n",
 580 |        "      <td>8</td>\n",
 581 |        "      <td>E1</td>\n",
 582 |        "      <td>6.171721</td>\n",
 583 |        "    </tr>\n",
 584 |        "    <tr>\n",
 585 |        "      <th>3</th>\n",
 586 |        "      <td>7</td>\n",
 587 |        "      <td>8</td>\n",
 588 |        "      <td>D4</td>\n",
 589 |        "      <td>6.373370</td>\n",
 590 |        "    </tr>\n",
 591 |        "    <tr>\n",
 592 |        "      <th>4</th>\n",
 593 |        "      <td>8</td>\n",
 594 |        "      <td>8</td>\n",
 595 |        "      <td>D2</td>\n",
 596 |        "      <td>5.260191</td>\n",
 597 |        "    </tr>\n",
 598 |        "  </tbody>\n",
 599 |        "</table>\n",
 600 |        "</div>"
 601 |       ],
 602 |       "text/plain": [
 603 |        "   Id  Response Product_Info_2  loo_Product_Info_2\n",
 604 |        "0   2         8             D3            5.568211\n",
 605 |        "1   5         4             A1            6.176825\n",
 606 |        "2   6         8             E1            6.171721\n",
 607 |        "3   7         8             D4            6.373370\n",
 608 |        "4   8         8             D2            5.260191"
 609 |       ]
 610 |      },
 611 |      "execution_count": 8,
 612 |      "metadata": {},
 613 |      "output_type": "execute_result"
 614 |     }
 615 |    ],
 616 |    "source": [
 617 |     "train_LOOEncoding.head()"
 618 |    ]
 619 |   },
 620 |   {
 621 |    "cell_type": "code",
 622 |    "execution_count": 9,
 623 |    "metadata": {
 624 |     "collapsed": false
 625 |    },
 626 |    "outputs": [],
 627 |    "source": [
 628 |     "train_BayesEncodingKfold,test_BayesEncodingKfold=BayesEncodingKfold(train,test,'Response','Product_Info_2',fold=5)\n",
 629 |     "train_LOOEncodingKfold,test_LOOEncodingKfold=LOOEncodingKfold(train,test,'Response','Product_Info_2',fold=5)\n"
 630 |    ]
 631 |   },
 632 |   {
 633 |    "cell_type": "code",
 634 |    "execution_count": 10,
 635 |    "metadata": {
 636 |     "collapsed": false
 637 |    },
 638 |    "outputs": [
 639 |     {
 640 |      "data": {
 641 |       "text/html": [
 642 |        "<div>\n",
 643 |        "<table border=\"1\" class=\"dataframe\">\n",
 644 |        "  <thead>\n",
 645 |        "    <tr style=\"text-align: right;\">\n",
 646 |        "      <th></th>\n",
 647 |        "      <th>Id</th>\n",
 648 |        "      <th>Response</th>\n",
 649 |        "      <th>Product_Info_2</th>\n",
 650 |        "      <th>bayes_Product_Info_2_1</th>\n",
 651 |        "      <th>bayes_Product_Info_2_2</th>\n",
 652 |        "      <th>bayes_Product_Info_2_3</th>\n",
 653 |        "      <th>bayes_Product_Info_2_4</th>\n",
 654 |        "      <th>bayes_Product_Info_2_5</th>\n",
 655 |        "      <th>bayes_Product_Info_2_6</th>\n",
 656 |        "      <th>bayes_Product_Info_2_7</th>\n",
 657 |        "      <th>bayes_Product_Info_2_8</th>\n",
 658 |        "    </tr>\n",
 659 |        "  </thead>\n",
 660 |        "  <tbody>\n",
 661 |        "    <tr>\n",
 662 |        "      <th>0</th>\n",
 663 |        "      <td>2</td>\n",
 664 |        "      <td>8</td>\n",
 665 |        "      <td>D3</td>\n",
 666 |        "      <td>0.103135</td>\n",
 667 |        "      <td>0.116970</td>\n",
 668 |        "      <td>0.016271</td>\n",
 669 |        "      <td>0.029247</td>\n",
 670 |        "      <td>0.086499</td>\n",
 671 |        "      <td>0.229355</td>\n",
 672 |        "      <td>0.147193</td>\n",
 673 |        "      <td>0.268195</td>\n",
 674 |        "    </tr>\n",
 675 |        "    <tr>\n",
 676 |        "      <th>1</th>\n",
 677 |        "      <td>5</td>\n",
 678 |        "      <td>4</td>\n",
 679 |        "      <td>A1</td>\n",
 680 |        "      <td>0.054170</td>\n",
 681 |        "      <td>0.094039</td>\n",
 682 |        "      <td>0.023956</td>\n",
 683 |        "      <td>0.030422</td>\n",
 684 |        "      <td>0.087120</td>\n",
 685 |        "      <td>0.148125</td>\n",
 686 |        "      <td>0.115648</td>\n",
 687 |        "      <td>0.443238</td>\n",
 688 |        "    </tr>\n",
 689 |        "    <tr>\n",
 690 |        "      <th>2</th>\n",
 691 |        "      <td>6</td>\n",
 692 |        "      <td>8</td>\n",
 693 |        "      <td>E1</td>\n",
 694 |        "      <td>0.072549</td>\n",
 695 |        "      <td>0.079161</td>\n",
 696 |        "      <td>0.013412</td>\n",
 697 |        "      <td>0.028990</td>\n",
 698 |        "      <td>0.063683</td>\n",
 699 |        "      <td>0.161461</td>\n",
 700 |        "      <td>0.150065</td>\n",
 701 |        "      <td>0.433177</td>\n",
 702 |        "    </tr>\n",
 703 |        "    <tr>\n",
 704 |        "      <th>3</th>\n",
 705 |        "      <td>7</td>\n",
 706 |        "      <td>8</td>\n",
 707 |        "      <td>D4</td>\n",
 708 |        "      <td>0.063238</td>\n",
 709 |        "      <td>0.064417</td>\n",
 710 |        "      <td>0.007972</td>\n",
 711 |        "      <td>0.011082</td>\n",
 712 |        "      <td>0.076236</td>\n",
 713 |        "      <td>0.165874</td>\n",
 714 |        "      <td>0.131380</td>\n",
 715 |        "      <td>0.478303</td>\n",
 716 |        "    </tr>\n",
 717 |        "    <tr>\n",
 718 |        "      <th>4</th>\n",
 719 |        "      <td>8</td>\n",
 720 |        "      <td>8</td>\n",
 721 |        "      <td>D2</td>\n",
 722 |        "      <td>0.114566</td>\n",
 723 |        "      <td>0.146584</td>\n",
 724 |        "      <td>0.018327</td>\n",
 725 |        "      <td>0.024603</td>\n",
 726 |        "      <td>0.068299</td>\n",
 727 |        "      <td>0.245421</td>\n",
 728 |        "      <td>0.148178</td>\n",
 729 |        "      <td>0.233748</td>\n",
 730 |        "    </tr>\n",
 731 |        "  </tbody>\n",
 732 |        "</table>\n",
 733 |        "</div>"
 734 |       ],
 735 |       "text/plain": [
 736 |        "   Id  Response Product_Info_2  bayes_Product_Info_2_1  \\\n",
 737 |        "0   2         8             D3                0.103135   \n",
 738 |        "1   5         4             A1                0.054170   \n",
 739 |        "2   6         8             E1                0.072549   \n",
 740 |        "3   7         8             D4                0.063238   \n",
 741 |        "4   8         8             D2                0.114566   \n",
 742 |        "\n",
 743 |        "   bayes_Product_Info_2_2  bayes_Product_Info_2_3  bayes_Product_Info_2_4  \\\n",
 744 |        "0                0.116970                0.016271                0.029247   \n",
 745 |        "1                0.094039                0.023956                0.030422   \n",
 746 |        "2                0.079161                0.013412                0.028990   \n",
 747 |        "3                0.064417                0.007972                0.011082   \n",
 748 |        "4                0.146584                0.018327                0.024603   \n",
 749 |        "\n",
 750 |        "   bayes_Product_Info_2_5  bayes_Product_Info_2_6  bayes_Product_Info_2_7  \\\n",
 751 |        "0                0.086499                0.229355                0.147193   \n",
 752 |        "1                0.087120                0.148125                0.115648   \n",
 753 |        "2                0.063683                0.161461                0.150065   \n",
 754 |        "3                0.076236                0.165874                0.131380   \n",
 755 |        "4                0.068299                0.245421                0.148178   \n",
 756 |        "\n",
 757 |        "   bayes_Product_Info_2_8  \n",
 758 |        "0                0.268195  \n",
 759 |        "1                0.443238  \n",
 760 |        "2                0.433177  \n",
 761 |        "3                0.478303  \n",
 762 |        "4                0.233748  "
 763 |       ]
 764 |      },
 765 |      "execution_count": 10,
 766 |      "metadata": {},
 767 |      "output_type": "execute_result"
 768 |     }
 769 |    ],
 770 |    "source": [
 771 |     "train_BayesEncodingKfold.head()"
 772 |    ]
 773 |   },
 774 |   {
 775 |    "cell_type": "code",
 776 |    "execution_count": 11,
 777 |    "metadata": {
 778 |     "collapsed": false
 779 |    },
 780 |    "outputs": [
 781 |     {
 782 |      "data": {
 783 |       "text/html": [
 784 |        "<div>\n",
 785 |        "<table border=\"1\" class=\"dataframe\">\n",
 786 |        "  <thead>\n",
 787 |        "    <tr style=\"text-align: right;\">\n",
 788 |        "      <th></th>\n",
 789 |        "      <th>Id</th>\n",
 790 |        "      <th>Response</th>\n",
 791 |        "      <th>Product_Info_2</th>\n",
 792 |        "      <th>loo_Product_Info_2</th>\n",
 793 |        "    </tr>\n",
 794 |        "  </thead>\n",
 795 |        "  <tbody>\n",
 796 |        "    <tr>\n",
 797 |        "      <th>0</th>\n",
 798 |        "      <td>2</td>\n",
 799 |        "      <td>8</td>\n",
 800 |        "      <td>D3</td>\n",
 801 |        "      <td>5.510350</td>\n",
 802 |        "    </tr>\n",
 803 |        "    <tr>\n",
 804 |        "      <th>1</th>\n",
 805 |        "      <td>5</td>\n",
 806 |        "      <td>4</td>\n",
 807 |        "      <td>A1</td>\n",
 808 |        "      <td>6.147088</td>\n",
 809 |        "    </tr>\n",
 810 |        "    <tr>\n",
 811 |        "      <th>2</th>\n",
 812 |        "      <td>6</td>\n",
 813 |        "      <td>8</td>\n",
 814 |        "      <td>E1</td>\n",
 815 |        "      <td>6.165965</td>\n",
 816 |        "    </tr>\n",
 817 |        "    <tr>\n",
 818 |        "      <th>3</th>\n",
 819 |        "      <td>7</td>\n",
 820 |        "      <td>8</td>\n",
 821 |        "      <td>D4</td>\n",
 822 |        "      <td>6.389432</td>\n",
 823 |        "    </tr>\n",
 824 |        "    <tr>\n",
 825 |        "      <th>4</th>\n",
 826 |        "      <td>8</td>\n",
 827 |        "      <td>8</td>\n",
 828 |        "      <td>D2</td>\n",
 829 |        "      <td>5.280445</td>\n",
 830 |        "    </tr>\n",
 831 |        "  </tbody>\n",
 832 |        "</table>\n",
 833 |        "</div>"
 834 |       ],
 835 |       "text/plain": [
 836 |        "   Id  Response Product_Info_2  loo_Product_Info_2\n",
 837 |        "0   2         8             D3            5.510350\n",
 838 |        "1   5         4             A1            6.147088\n",
 839 |        "2   6         8             E1            6.165965\n",
 840 |        "3   7         8             D4            6.389432\n",
 841 |        "4   8         8             D2            5.280445"
 842 |       ]
 843 |      },
 844 |      "execution_count": 11,
 845 |      "metadata": {},
 846 |      "output_type": "execute_result"
 847 |     }
 848 |    ],
 849 |    "source": [
 850 |     "train_LOOEncodingKfold.head()"
 851 |    ]
 852 |   },
 853 |   {
 854 |    "cell_type": "markdown",
 855 |    "metadata": {},
 856 |    "source": [
 857 |     "Note: the difference between BayesEncoding and BayesEncodingKfold (also LOOEncoding and LOOEncodingKfold) is how to encode train dataset. In BayesEncoding (or LOOEncoding), the train dataset is encoded using statistics of full train dataset. In BayesEncodingKfold (or LOOEncodingKfold), the train dataset is encoded using statistics of part of train dataset. For example, when fold=5, Baye0%sEncodingKfold (or LOOEncodingKfold) use 80% of train dataset to encode the rest 20% train dataset. This can further reduce the risk of information leaking, the cons is to use less information from train dataset. "
 858 |    ]
 859 |   },
 860 |   {
 861 |    "cell_type": "markdown",
 862 |    "metadata": {},
 863 |    "source": [
 864 |     "# Part 2. Encoding for Regression problems"
 865 |    ]
 866 |   },
 867 |   {
 868 |    "cell_type": "code",
 869 |    "execution_count": 12,
 870 |    "metadata": {
 871 |     "collapsed": true
 872 |    },
 873 |    "outputs": [],
 874 |    "source": [
 875 |     "from hccEncoding.EncoderForRegression import BayesEncoding,BayesEncodingKfold,LOOEncoding,LOOEncodingKfold\n",
 876 |     "\n",
 877 |     "train_BayesEncoding,test_BayesEncoding=BayesEncoding(train,test,'Response','Product_Info_2')\n",
 878 |     "train_LOOEncoding,test_LOOEncoding=LOOEncoding(train,test,'Response','Product_Info_2')"
 879 |    ]
 880 |   },
 881 |   {
 882 |    "cell_type": "code",
 883 |    "execution_count": 13,
 884 |    "metadata": {
 885 |     "collapsed": false
 886 |    },
 887 |    "outputs": [
 888 |     {
 889 |      "data": {
 890 |       "text/html": [
 891 |        "<div>\n",
 892 |        "<table border=\"1\" class=\"dataframe\">\n",
 893 |        "  <thead>\n",
 894 |        "    <tr style=\"text-align: right;\">\n",
 895 |        "      <th></th>\n",
 896 |        "      <th>Id</th>\n",
 897 |        "      <th>Response</th>\n",
 898 |        "      <th>Product_Info_2</th>\n",
 899 |        "      <th>bayes_Product_Info_2</th>\n",
 900 |        "    </tr>\n",
 901 |        "  </thead>\n",
 902 |        "  <tbody>\n",
 903 |        "    <tr>\n",
 904 |        "      <th>0</th>\n",
 905 |        "      <td>2</td>\n",
 906 |        "      <td>8</td>\n",
 907 |        "      <td>D3</td>\n",
 908 |        "      <td>5.554615</td>\n",
 909 |        "    </tr>\n",
 910 |        "    <tr>\n",
 911 |        "      <th>1</th>\n",
 912 |        "      <td>5</td>\n",
 913 |        "      <td>4</td>\n",
 914 |        "      <td>A1</td>\n",
 915 |        "      <td>6.212693</td>\n",
 916 |        "    </tr>\n",
 917 |        "    <tr>\n",
 918 |        "      <th>2</th>\n",
 919 |        "      <td>6</td>\n",
 920 |        "      <td>8</td>\n",
 921 |        "      <td>E1</td>\n",
 922 |        "      <td>6.226905</td>\n",
 923 |        "    </tr>\n",
 924 |        "    <tr>\n",
 925 |        "      <th>3</th>\n",
 926 |        "      <td>7</td>\n",
 927 |        "      <td>8</td>\n",
 928 |        "      <td>D4</td>\n",
 929 |        "      <td>6.337270</td>\n",
 930 |        "    </tr>\n",
 931 |        "    <tr>\n",
 932 |        "      <th>4</th>\n",
 933 |        "      <td>8</td>\n",
 934 |        "      <td>8</td>\n",
 935 |        "      <td>D2</td>\n",
 936 |        "      <td>5.261874</td>\n",
 937 |        "    </tr>\n",
 938 |        "  </tbody>\n",
 939 |        "</table>\n",
 940 |        "</div>"
 941 |       ],
 942 |       "text/plain": [
 943 |        "   Id  Response Product_Info_2  bayes_Product_Info_2\n",
 944 |        "0   2         8             D3              5.554615\n",
 945 |        "1   5         4             A1              6.212693\n",
 946 |        "2   6         8             E1              6.226905\n",
 947 |        "3   7         8             D4              6.337270\n",
 948 |        "4   8         8             D2              5.261874"
 949 |       ]
 950 |      },
 951 |      "execution_count": 13,
 952 |      "metadata": {},
 953 |      "output_type": "execute_result"
 954 |     }
 955 |    ],
 956 |    "source": [
 957 |     "train_BayesEncoding.head()"
 958 |    ]
 959 |   },
 960 |   {
 961 |    "cell_type": "code",
 962 |    "execution_count": 14,
 963 |    "metadata": {
 964 |     "collapsed": false
 965 |    },
 966 |    "outputs": [
 967 |     {
 968 |      "data": {
 969 |       "text/html": [
 970 |        "<div>\n",
 971 |        "<table border=\"1\" class=\"dataframe\">\n",
 972 |        "  <thead>\n",
 973 |        "    <tr style=\"text-align: right;\">\n",
 974 |        "      <th></th>\n",
 975 |        "      <th>Id</th>\n",
 976 |        "      <th>Response</th>\n",
 977 |        "      <th>Product_Info_2</th>\n",
 978 |        "      <th>loo_Product_Info_2</th>\n",
 979 |        "    </tr>\n",
 980 |        "  </thead>\n",
 981 |        "  <tbody>\n",
 982 |        "    <tr>\n",
 983 |        "      <th>0</th>\n",
 984 |        "      <td>2</td>\n",
 985 |        "      <td>8</td>\n",
 986 |        "      <td>D3</td>\n",
 987 |        "      <td>5.553309</td>\n",
 988 |        "    </tr>\n",
 989 |        "    <tr>\n",
 990 |        "      <th>1</th>\n",
 991 |        "      <td>5</td>\n",
 992 |        "      <td>4</td>\n",
 993 |        "      <td>A1</td>\n",
 994 |        "      <td>6.042045</td>\n",
 995 |        "    </tr>\n",
 996 |        "    <tr>\n",
 997 |        "      <th>2</th>\n",
 998 |        "      <td>6</td>\n",
 999 |        "      <td>8</td>\n",
1000 |        "      <td>E1</td>\n",
1001 |        "      <td>6.065838</td>\n",
1002 |        "    </tr>\n",
1003 |        "    <tr>\n",
1004 |        "      <th>3</th>\n",
1005 |        "      <td>7</td>\n",
1006 |        "      <td>8</td>\n",
1007 |        "      <td>D4</td>\n",
1008 |        "      <td>6.423081</td>\n",
1009 |        "    </tr>\n",
1010 |        "    <tr>\n",
1011 |        "      <th>4</th>\n",
1012 |        "      <td>8</td>\n",
1013 |        "      <td>8</td>\n",
1014 |        "      <td>D2</td>\n",
1015 |        "      <td>5.302704</td>\n",
1016 |        "    </tr>\n",
1017 |        "  </tbody>\n",
1018 |        "</table>\n",
1019 |        "</div>"
1020 |       ],
1021 |       "text/plain": [
1022 |        "   Id  Response Product_Info_2  loo_Product_Info_2\n",
1023 |        "0   2         8             D3            5.553309\n",
1024 |        "1   5         4             A1            6.042045\n",
1025 |        "2   6         8             E1            6.065838\n",
1026 |        "3   7         8             D4            6.423081\n",
1027 |        "4   8         8             D2            5.302704"
1028 |       ]
1029 |      },
1030 |      "execution_count": 14,
1031 |      "metadata": {},
1032 |      "output_type": "execute_result"
1033 |     }
1034 |    ],
1035 |    "source": [
1036 |     "train_LOOEncoding.head()"
1037 |    ]
1038 |   },
1039 |   {
1040 |    "cell_type": "code",
1041 |    "execution_count": 15,
1042 |    "metadata": {
1043 |     "collapsed": true
1044 |    },
1045 |    "outputs": [],
1046 |    "source": [
1047 |     "train_BayesEncodingKfold,test_BayesEncodingKfold=BayesEncodingKfold(train,test,'Response','Product_Info_2',fold=5)\n",
1048 |     "train_LOOEncodingKfold,test_LOOEncodingKfold=LOOEncodingKfold(train,test,'Response','Product_Info_2',fold=5)"
1049 |    ]
1050 |   },
1051 |   {
1052 |    "cell_type": "code",
1053 |    "execution_count": 16,
1054 |    "metadata": {
1055 |     "collapsed": false
1056 |    },
1057 |    "outputs": [
1058 |     {
1059 |      "data": {
1060 |       "text/html": [
1061 |        "<div>\n",
1062 |        "<table border=\"1\" class=\"dataframe\">\n",
1063 |        "  <thead>\n",
1064 |        "    <tr style=\"text-align: right;\">\n",
1065 |        "      <th></th>\n",
1066 |        "      <th>Id</th>\n",
1067 |        "      <th>Response</th>\n",
1068 |        "      <th>Product_Info_2</th>\n",
1069 |        "      <th>bayes_Product_Info_2</th>\n",
1070 |        "    </tr>\n",
1071 |        "  </thead>\n",
1072 |        "  <tbody>\n",
1073 |        "    <tr>\n",
1074 |        "      <th>0</th>\n",
1075 |        "      <td>2</td>\n",
1076 |        "      <td>8</td>\n",
1077 |        "      <td>D3</td>\n",
1078 |        "      <td>5.528438</td>\n",
1079 |        "    </tr>\n",
1080 |        "    <tr>\n",
1081 |        "      <th>1</th>\n",
1082 |        "      <td>5</td>\n",
1083 |        "      <td>4</td>\n",
1084 |        "      <td>A1</td>\n",
1085 |        "      <td>6.217039</td>\n",
1086 |        "    </tr>\n",
1087 |        "    <tr>\n",
1088 |        "      <th>2</th>\n",
1089 |        "      <td>6</td>\n",
1090 |        "      <td>8</td>\n",
1091 |        "      <td>E1</td>\n",
1092 |        "      <td>6.175437</td>\n",
1093 |        "    </tr>\n",
1094 |        "    <tr>\n",
1095 |        "      <th>3</th>\n",
1096 |        "      <td>7</td>\n",
1097 |        "      <td>8</td>\n",
1098 |        "      <td>D4</td>\n",
1099 |        "      <td>6.322527</td>\n",
1100 |        "    </tr>\n",
1101 |        "    <tr>\n",
1102 |        "      <th>4</th>\n",
1103 |        "      <td>8</td>\n",
1104 |        "      <td>8</td>\n",
1105 |        "      <td>D2</td>\n",
1106 |        "      <td>5.288024</td>\n",
1107 |        "    </tr>\n",
1108 |        "  </tbody>\n",
1109 |        "</table>\n",
1110 |        "</div>"
1111 |       ],
1112 |       "text/plain": [
1113 |        "   Id  Response Product_Info_2  bayes_Product_Info_2\n",
1114 |        "0   2         8             D3              5.528438\n",
1115 |        "1   5         4             A1              6.217039\n",
1116 |        "2   6         8             E1              6.175437\n",
1117 |        "3   7         8             D4              6.322527\n",
1118 |        "4   8         8             D2              5.288024"
1119 |       ]
1120 |      },
1121 |      "execution_count": 16,
1122 |      "metadata": {},
1123 |      "output_type": "execute_result"
1124 |     }
1125 |    ],
1126 |    "source": [
1127 |     "train_BayesEncodingKfold.head()"
1128 |    ]
1129 |   },
1130 |   {
1131 |    "cell_type": "code",
1132 |    "execution_count": 17,
1133 |    "metadata": {
1134 |     "collapsed": false
1135 |    },
1136 |    "outputs": [
1137 |     {
1138 |      "data": {
1139 |       "text/html": [
1140 |        "<div>\n",
1141 |        "<table border=\"1\" class=\"dataframe\">\n",
1142 |        "  <thead>\n",
1143 |        "    <tr style=\"text-align: right;\">\n",
1144 |        "      <th></th>\n",
1145 |        "      <th>Id</th>\n",
1146 |        "      <th>Response</th>\n",
1147 |        "      <th>Product_Info_2</th>\n",
1148 |        "      <th>loo_Product_Info_2</th>\n",
1149 |        "    </tr>\n",
1150 |        "  </thead>\n",
1151 |        "  <tbody>\n",
1152 |        "    <tr>\n",
1153 |        "      <th>0</th>\n",
1154 |        "      <td>2</td>\n",
1155 |        "      <td>8</td>\n",
1156 |        "      <td>D3</td>\n",
1157 |        "      <td>5.511082</td>\n",
1158 |        "    </tr>\n",
1159 |        "    <tr>\n",
1160 |        "      <th>1</th>\n",
1161 |        "      <td>5</td>\n",
1162 |        "      <td>4</td>\n",
1163 |        "      <td>A1</td>\n",
1164 |        "      <td>6.147173</td>\n",
1165 |        "    </tr>\n",
1166 |        "    <tr>\n",
1167 |        "      <th>2</th>\n",
1168 |        "      <td>6</td>\n",
1169 |        "      <td>8</td>\n",
1170 |        "      <td>E1</td>\n",
1171 |        "      <td>6.164431</td>\n",
1172 |        "    </tr>\n",
1173 |        "    <tr>\n",
1174 |        "      <th>3</th>\n",
1175 |        "      <td>7</td>\n",
1176 |        "      <td>8</td>\n",
1177 |        "      <td>D4</td>\n",
1178 |        "      <td>6.390441</td>\n",
1179 |        "    </tr>\n",
1180 |        "    <tr>\n",
1181 |        "      <th>4</th>\n",
1182 |        "      <td>8</td>\n",
1183 |        "      <td>8</td>\n",
1184 |        "      <td>D2</td>\n",
1185 |        "      <td>5.280106</td>\n",
1186 |        "    </tr>\n",
1187 |        "  </tbody>\n",
1188 |        "</table>\n",
1189 |        "</div>"
1190 |       ],
1191 |       "text/plain": [
1192 |        "   Id  Response Product_Info_2  loo_Product_Info_2\n",
1193 |        "0   2         8             D3            5.511082\n",
1194 |        "1   5         4             A1            6.147173\n",
1195 |        "2   6         8             E1            6.164431\n",
1196 |        "3   7         8             D4            6.390441\n",
1197 |        "4   8         8             D2            5.280106"
1198 |       ]
1199 |      },
1200 |      "execution_count": 17,
1201 |      "metadata": {},
1202 |      "output_type": "execute_result"
1203 |     }
1204 |    ],
1205 |    "source": [
1206 |     "train_LOOEncodingKfold.head()"
1207 |    ]
1208 |   },
1209 |   {
1210 |    "cell_type": "markdown",
1211 |    "metadata": {},
1212 |    "source": [
1213 |     "for more detailed explanation about parameters, please check online documentation: http://hccencoding-project.readthedocs.io/en/latest/"
1214 |    ]
1215 |   }
1216 |  ],
1217 |  "metadata": {
1218 |   "anaconda-cloud": {},
1219 |   "kernelspec": {
1220 |    "display_name": "Python [default]",
1221 |    "language": "python",
1222 |    "name": "python2"
1223 |   },
1224 |   "language_info": {
1225 |    "codemirror_mode": {
1226 |     "name": "ipython",
1227 |     "version": 2
1228 |    },
1229 |    "file_extension": ".py",
1230 |    "mimetype": "text/x-python",
1231 |    "name": "python",
1232 |    "nbconvert_exporter": "python",
1233 |    "pygments_lexer": "ipython2",
1234 |    "version": "2.7.12"
1235 |   }
1236 |  },
1237 |  "nbformat": 4,
1238 |  "nbformat_minor": 1
1239 | }
1240 | 


--------------------------------------------------------------------------------
/hccEncoding/EncoderForClassification.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """
 3 | Created on Sun Mar 12 01:36:23 2017
 4 | 
 5 | @author: Ruobing
 6 | """
 7 | from sklearn.model_selection import StratifiedKFold
 8 | import numpy as np
 9 | import pandas as pd
10 | ###paper: A preprocessing scheme for high cardinality categorical attributes in classification and prediction problems
11 | def BayesEncoding(train,test,target,feature,k=5,f=1,noise=0.01,drop_origin_feature=False):  
12 |     entry=pd.get_dummies(train[target])
13 |     classes=list(entry)
14 |     entry[feature]=train[feature]
15 |     entry=entry.groupby(feature).agg('sum').reset_index()
16 |     entry['total_count']=entry.sum(axis=1)
17 |     for e in classes:
18 |         newname='bayes_'+feature+'_'+str(e)
19 |         prior=sum(entry[e])/sum(entry['total_count'])
20 |         posterior=entry[e]/entry['total_count']
21 |         B=1/(1+np.exp(-1*(entry['total_count']-k)/f))
22 |         entry[newname]=B*posterior+(1-B)*prior
23 | 
24 |     newfeature=['bayes_'+feature+'_'+str(e) for e in classes]
25 |     newlist=[feature]+ newfeature   
26 |     useful=entry[newlist]
27 |     train=train.join(useful.set_index(feature),on=feature)
28 |     test=test.join(useful.set_index(feature),on=feature)
29 | 
30 |     for e in classes:
31 |         newname='bayes_'+feature+'_'+str(e)
32 |         test[newname]=test[newname].fillna(sum(entry[e])/sum(entry['total_count']))
33 |         if noise:  # Add uniform noise. Not mentioned in original paper
34 |             train[newname]=train[newname]*np.random.uniform(1 - noise, 1 + noise, len(train[newname]))
35 |             test[newname]=test[newname]*np.random.uniform(1 - noise, 1 + noise, len(test[newname]))
36 |         
37 |     
38 |     if drop_origin_feature==True:
39 |         train=train.drop(feature,1)
40 |         test=test.drop(feature,1)
41 |         
42 |     return train,test
43 | 
44 | def BayesEncodingKfold(train,test,target,feature,k=5,f=1,noise=0.01,drop_origin_feature=False,fold=5):
45 |     train_no_use,test_useful=BayesEncoding(train,test,target,feature,k,f,noise,drop_origin_feature)
46 |     skf = StratifiedKFold(fold)
47 |     alltrain_min=[]
48 |     for train_id, test_id in skf.split(train,np.zeros(len(train))):
49 |         train_maj,train_min=BayesEncoding(train.iloc[train_id], train.iloc[test_id],target,feature,k,f,noise,drop_origin_feature)
50 |         alltrain_min.append(train_min)
51 |     train_useful=pd.concat(alltrain_min,0)
52 |     return train_useful,test_useful
53 |     
54 |     
55 | def LOOEncoding(train,test,target,feature,noise=0.01,drop_origin_feature=False):
56 |     cs = train.groupby(by=[feature])[target].sum()
57 |     cc = train[feature].value_counts()
58 |     boolean = (cc == 1)
59 |     index = boolean[boolean == True].index.values
60 |     cc.loc[boolean] += 1
61 |     cs.loc[index] *= 2
62 |     train = train.join(cs.rename('sum'), on=[feature])
63 |     train = train.join(cc.rename('count'), on=[feature])
64 |     newname='loo_'+feature
65 |     train[newname] = (train['sum']-train[target])/(train['count'] - 1)
66 |     if noise: train[newname]= train[newname]*np.random.uniform(1 - noise, 1 + noise, len(train[newname]))  # Add uniform noise. Not mentioned in original paper
67 |     del train['sum'], train['count']
68 |     cstest=train.groupby(by=[feature])[newname].mean()
69 |     test=test.join(cstest.rename(newname),on=[feature])
70 |     if drop_origin_feature==True:
71 |         train=train.drop(feature,1)
72 |         test=test.drop(feature,1)
73 |     return train,test
74 | 
75 | def LOOEncodingKfold(train,test,target,feature,noise=0.01,drop_origin_feature=False,fold=5):
76 |     train_no_use,test_useful=LOOEncoding(train,test,target,feature,noise,drop_origin_feature)
77 |     skf = StratifiedKFold(fold)
78 |     alltrain_min=[]
79 |     for train_id, test_id in skf.split(train,np.zeros(len(train))):
80 |         train_maj,train_min=LOOEncoding(train.iloc[train_id], train.iloc[test_id],target,feature,noise,drop_origin_feature)
81 |         alltrain_min.append(train_min)
82 |     train_useful=pd.concat(alltrain_min,0)
83 |     newname='loo_'+feature
84 |     train_useful[newname]=train_useful[newname].fillna(train_useful[newname].median())
85 |     test_useful[newname]=test_useful[newname].fillna(test_useful[newname].median())
86 |     return train_useful,test_useful


--------------------------------------------------------------------------------
/hccEncoding/EncoderForRegression.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """
 3 | Created on Sun Mar 12 01:36:23 2017
 4 | 
 5 | @author: Ruobing
 6 | """
 7 | from sklearn.model_selection import StratifiedKFold
 8 | import numpy as np
 9 | import pandas as pd
10 | ###paper: A preprocessing scheme for high cardinality categorical attributes in classification and prediction problems
11 | def BayesEncoding(train,test,target,feature,k=5,f=1,noise=0.01,drop_origin_feature=False):  
12 |     entry=pd.DataFrame(train[[target,feature]],columns=[target,feature])
13 |     library=entry.groupby(feature).agg('mean').rename(columns={target:'sampleMean'}).reset_index()
14 |     count=entry.groupby(feature).agg('count').rename(columns={target:'total_count'}).reset_index()
15 |     library['total_count']=count['total_count']
16 |     library['prior']=train[target].mean()
17 |    
18 |     B=1/(1+np.exp(-1*(library['total_count']-k)/f))
19 |     newname='bayes_'+feature
20 |     library[newname]=B*library['sampleMean']+(1-B)*library['prior']
21 |     
22 |     newlist=[feature]+ [newname]  
23 |     useful=library[newlist]
24 |     train=train.join(useful.set_index(feature),on=feature)
25 |     test=test.join(useful.set_index(feature),on=feature)
26 |     test[newname]=test[newname].fillna(train[target].mean())
27 |     
28 |     if noise:  # Add normal noise. Not mentioned in original paper
29 |         train[newname]=train[newname]*np.random.normal(1, noise, len(train[newname]))
30 |         test[newname]=test[newname]*np.random.normal(1 , noise, len(test[newname]))
31 |         
32 |     
33 |     if drop_origin_feature==True:
34 |         train=train.drop(feature,1)
35 |         test=test.drop(feature,1)
36 |     return train,test
37 | 
38 | def BayesEncodingKfold(train,test,target,feature,k=5,f=1,noise=0.01,drop_origin_feature=False,fold=5):  
39 |     train_no_use,test_useful=BayesEncoding(train,test,target,feature,k,f,noise,drop_origin_feature)
40 |     skf = StratifiedKFold(fold)
41 |     alltrain_min=[]
42 |     for train_id, test_id in skf.split(train,np.zeros(len(train))):
43 |         train_maj,train_min=BayesEncoding(train.iloc[train_id], train.iloc[test_id],target,feature,k,f,noise,drop_origin_feature)
44 |         alltrain_min.append(train_min)
45 |     train_useful=pd.concat(alltrain_min,0)
46 |     return train_useful,test_useful
47 |     
48 |     
49 | def LOOEncoding(train,test,target,feature,noise=0.01,drop_origin_feature=False):
50 |     cs = train.groupby(by=[feature])[target].sum()
51 |     cc = train[feature].value_counts()
52 |     boolean = (cc == 1)
53 |     index = boolean[boolean == True].index.values
54 |     cc.loc[boolean] += 1
55 |     cs.loc[index] *= 2
56 |     train = train.join(cs.rename('sum'), on=[feature])
57 |     train = train.join(cc.rename('count'), on=[feature])
58 |     newname='loo_'+feature
59 |     train[newname] = (train['sum']-train[target])/(train['count'] - 1)
60 |     if noise: train[newname]= train[newname]*np.random.normal(1,noise,len(train[newname]))  # Add normal noise. Not mentioned in original paper
61 |     del train['sum'], train['count']
62 |     cstest=train.groupby(by=[feature])[newname].mean()
63 |     test=test.join(cstest.rename(newname),on=[feature])
64 |     if drop_origin_feature==True:
65 |         train=train.drop(feature,1)
66 |         test=test.drop(feature,1)
67 |     return train,test
68 | 
69 | def LOOEncodingKfold(train,test,target,feature,noise=0.01,drop_origin_feature=False,fold=5):
70 |     train_no_use,test_useful=LOOEncoding(train,test,target,feature,noise,drop_origin_feature)
71 |     skf = StratifiedKFold(fold)
72 |     alltrain_min=[]
73 |     for train_id, test_id in skf.split(train,np.zeros(len(train))):
74 |         train_maj,train_min=LOOEncoding(train.iloc[train_id], train.iloc[test_id],target,feature,noise,drop_origin_feature)
75 |         alltrain_min.append(train_min)
76 |     train_useful=pd.concat(alltrain_min,0)
77 |     newname='loo_'+feature
78 |     train_useful[newname]=train_useful[newname].fillna(train_useful[newname].mean())
79 |     test_useful[newname]=test_useful[newname].fillna(test_useful[newname].mean())
80 |     return train_useful,test_useful


--------------------------------------------------------------------------------
/hccEncoding/__init__.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """
 3 | Created on Sat Apr 01 10:03:49 2017
 4 | 
 5 | @author: Ruobing Wang
 6 | """
 7 | 
 8 | 
 9 | 
10 | __version__ = "0.0.1"
11 | __short_description__ = "hcc-encoding"
12 | __license__ = "MIT"


--------------------------------------------------------------------------------
/hccEncoding/__init__.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Robin888/hccEncoding-project/2412e5216b4904421721b43c803f417a4b3b7b61/hccEncoding/__init__.pyc


--------------------------------------------------------------------------------
/install-linux.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | cd "$(dirname "$0")"
3 | cd hccEncoding
4 | python zzz_manual_install.py


--------------------------------------------------------------------------------
/install-macos.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | cd "$(dirname "$0")"
3 | cd hccEncoding
4 | python zzz_manual_install.py


--------------------------------------------------------------------------------
/install-win.bat:
--------------------------------------------------------------------------------
1 | pushd "%~dp0"
2 | cd hccEncoding
3 | python zzz_manual_install.py


--------------------------------------------------------------------------------
/pypi-register.bat:
--------------------------------------------------------------------------------
1 | pushd "%~dp0"
2 | python setup.py register -r pypi


--------------------------------------------------------------------------------
/pypi-upload.bat:
--------------------------------------------------------------------------------
 1 | pushd "%~dp0"
 2 | CHOICE /C YN /M "upload to pypi, Y to continue, N to cancle"
 3 | IF ERRORLEVEL==2 goto end
 4 | IF ERRORLEVEL==1 goto upload
 5 | 
 6 | :upload
 7 | python setup.py sdist upload -r pypi
 8 | goto end
 9 | 
10 | :end
11 | pause


--------------------------------------------------------------------------------
/release-history.rst:
--------------------------------------------------------------------------------
 1 | Release and Version History
 2 | ===========================
 3 | 
 4 | 0.0.2 (TODO)
 5 | ~~~~~~~~~~~~~~~~~~
 6 | **Features and Improvements**
 7 | 
 8 | **Minor Improvements**
 9 | 
10 | **Bugfixes**
11 | 
12 | **Miscellaneous**
13 | 
14 | 
15 | 0.0.1 (2016-01-01)
16 | ~~~~~~~~~~~~~~~~~~
17 | - First release


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy >= 1.9.0
2 | pandas>= 0.19.0
3 | scikit-learn >= 0.18.0


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | from __future__ import print_function
  5 | from setuptools import setup, find_packages
  6 | from datetime import date
  7 | import os
  8 | 
  9 | #--- Define project dependent variable ---
 10 | # Your package name
 11 | NAME = "hccEncoding"
 12 | # Your GitHub user name
 13 | GITHUB_USERNAME = "Robin888" # your GitHub account name
 14 | 
 15 | 
 16 | #--- Automatically generate setup parameters ---
 17 | try:
 18 |     SHORT_DESCRIPTION = __import__(NAME).__short_description__ # GitHub Short Description
 19 | except:
 20 |     print("'__short_description__' not found in '%s.__init__.py'!" % NAME)
 21 |     SHORT_DESCRIPTION = "No short description!"
 22 |     
 23 | try:
 24 |     LONG_DESCRIPTION = open("README.rst", "rb").read().decode("utf-8")
 25 | except:
 26 |     LONG_DESCRIPTION = "No long description!"
 27 | 
 28 | VERSION = __import__(NAME).__version__
 29 | AUTHOR = "Ruobing Wang"
 30 | AUTHOR_EMAIL = "wangruobing@gmail.com"
 31 | MAINTAINER = "Ruobing Wang"
 32 | MAINTAINER_EMAIL = "wangruobing@gmail.com"
 33 | 
 34 | # Include all sub packages in package directory
 35 | PACKAGES = [NAME] + ["%s.%s" % (NAME, i) for i in find_packages(NAME)]
 36 | # Include everything in package directory
 37 | INCLUDE_PACKAGE_DATA = True
 38 | PACKAGE_DATA = {
 39 |     "": ["*.*"],
 40 | }
 41 | 
 42 | # The project directory name is the GitHub repository name
 43 | repository_name = os.path.basename(os.getcwd())
 44 | # Project Url
 45 | URL = "https://github.com/{0}/{1}".format(GITHUB_USERNAME, repository_name)
 46 | # Use todays date as GitHub release tag
 47 | github_release_tag = str(date.today())
 48 | # Source code download url
 49 | DOWNLOAD_URL = "https://github.com/{0}/{1}/tarball/{2}".format(
 50 |     GITHUB_USERNAME, repository_name, github_release_tag)
 51 | 
 52 | try:
 53 |     LICENSE = __import__(NAME).__license__
 54 | except:
 55 |     print("'__license__' not found in '%s.__init__.py'!" % NAME)
 56 |     LICENSE = ""
 57 | 
 58 | PLATFORMS = ["Windows", "MacOS", "Unix"]
 59 | CLASSIFIERS = [
 60 |     "Development Status :: 4 - Beta",
 61 |     "Intended Audience :: Developers",
 62 |     "License :: OSI Approved :: MIT License",
 63 |     "Natural Language :: English",
 64 |     "Operating System :: Microsoft :: Windows",
 65 |     "Operating System :: MacOS",
 66 |     "Operating System :: Unix",
 67 |     "Programming Language :: Python",
 68 |     "Programming Language :: Python :: 2.7",
 69 |     "Programming Language :: Python :: 3.3",
 70 |     "Programming Language :: Python :: 3.4",
 71 |     "Programming Language :: Python :: 3.5",
 72 | ]
 73 | 
 74 | try:
 75 |     f = open("requirements.txt", "rb")
 76 |     REQUIRES = [i.strip() for i in f.read().decode("utf-8").split("\n")]
 77 | except:
 78 |     print("'requirements.txt' not found!")
 79 |     REQUIRES = list()
 80 | 
 81 | setup(
 82 |     name=NAME,
 83 |     description=SHORT_DESCRIPTION,
 84 |     long_description=LONG_DESCRIPTION,
 85 |     version=VERSION,
 86 |     author=AUTHOR,
 87 |     author_email=AUTHOR_EMAIL,
 88 |     maintainer=MAINTAINER,
 89 |     maintainer_email=MAINTAINER_EMAIL,
 90 |     packages=PACKAGES,
 91 |     include_package_data=INCLUDE_PACKAGE_DATA,
 92 |     package_data=PACKAGE_DATA,
 93 |     url=URL,
 94 |     download_url=DOWNLOAD_URL,
 95 |     classifiers=CLASSIFIERS,
 96 |     platforms=PLATFORMS,
 97 |     license=LICENSE,
 98 |     install_requires=REQUIRES,
 99 | )
100 | 
101 | """
102 | Appendix
103 | --------
104 | ::
105 | 
106 | Frequent used classifiers List = [
107 |     "Development Status :: 1 - Planning",
108 |     "Development Status :: 2 - Pre-Alpha",
109 |     "Development Status :: 3 - Alpha",
110 |     "Development Status :: 4 - Beta",
111 |     "Development Status :: 5 - Production/Stable",
112 |     "Development Status :: 6 - Mature",
113 |     "Development Status :: 7 - Inactive",
114 | 
115 |     "Intended Audience :: Customer Service",
116 |     "Intended Audience :: Developers",
117 |     "Intended Audience :: Education",
118 |     "Intended Audience :: End Users/Desktop",
119 |     "Intended Audience :: Financial and Insurance Industry",
120 |     "Intended Audience :: Healthcare Industry",
121 |     "Intended Audience :: Information Technology",
122 |     "Intended Audience :: Legal Industry",
123 |     "Intended Audience :: Manufacturing",
124 |     "Intended Audience :: Other Audience",
125 |     "Intended Audience :: Religion",
126 |     "Intended Audience :: Science/Research",
127 |     "Intended Audience :: System Administrators",
128 |     "Intended Audience :: Telecommunications Industry",
129 | 
130 |     "License :: OSI Approved :: BSD License",
131 |     "License :: OSI Approved :: MIT License",
132 |     "License :: OSI Approved :: Apache Software License",
133 |     "License :: OSI Approved :: GNU General Public License (GPL)",
134 |     "License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)",
135 | 
136 |     "Natural Language :: English",
137 |     "Natural Language :: Chinese (Simplified)",
138 | 
139 |     "Operating System :: Microsoft :: Windows",
140 |     "Operating System :: MacOS",
141 |     "Operating System :: Unix",
142 |     
143 |     "Programming Language :: Python",
144 |     "Programming Language :: Python :: 2",
145 |     "Programming Language :: Python :: 2.7",
146 |     "Programming Language :: Python :: 2 :: Only",
147 |     "Programming Language :: Python :: 3",
148 |     "Programming Language :: Python :: 3.3",
149 |     "Programming Language :: Python :: 3.4",
150 |     "Programming Language :: Python :: 3 :: Only",
151 | ]
152 | """


--------------------------------------------------------------------------------
/source/conf.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #
  3 | # hccEncoding documentation build configuration file, created by
  4 | # sphinx-quickstart on Mon Apr 03 12:53:54 2017.
  5 | #
  6 | # This file is execfile()d with the current directory set to its
  7 | # containing dir.
  8 | #
  9 | # Note that not all possible configuration values are present in this
 10 | # autogenerated file.
 11 | #
 12 | # All configuration values have a default; values that are commented out
 13 | # serve to show the default.
 14 | 
 15 | # If extensions (or modules to document with autodoc) are in another directory,
 16 | # add these directories to sys.path here. If the directory is relative to the
 17 | # documentation root, use os.path.abspath to make it absolute, like shown here.
 18 | #
 19 | # import os
 20 | # import sys
 21 | # sys.path.insert(0, os.path.abspath('.'))
 22 | 
 23 | # -- General configuration ------------------------------------------------
 24 | 
 25 | # If your documentation needs a minimal Sphinx version, state it here.
 26 | #
 27 | # needs_sphinx = '1.0'
 28 | 
 29 | # Add any Sphinx extension module names here, as strings. They can be
 30 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 31 | # ones.
 32 | extensions = []
 33 | 
 34 | # Add any paths that contain templates here, relative to this directory.
 35 | templates_path = ['_templates']
 36 | 
 37 | # The suffix(es) of source filenames.
 38 | # You can specify multiple suffix as a list of string:
 39 | #
 40 | # source_suffix = ['.rst', '.md']
 41 | source_suffix = '.rst'
 42 | 
 43 | # The encoding of source files.
 44 | #
 45 | # source_encoding = 'utf-8-sig'
 46 | 
 47 | # The master toctree document.
 48 | master_doc = 'index'
 49 | 
 50 | # General information about the project.
 51 | project = u'hccEncoding'
 52 | copyright = u'2017, Ruobing Wang'
 53 | author = u'Ruobing Wang'
 54 | 
 55 | # The version info for the project you're documenting, acts as replacement for
 56 | # |version| and |release|, also used in various other places throughout the
 57 | # built documents.
 58 | #
 59 | # The short X.Y version.
 60 | version = u'0.0.1'
 61 | # The full version, including alpha/beta/rc tags.
 62 | release = u'0.0.1'
 63 | 
 64 | # The language for content autogenerated by Sphinx. Refer to documentation
 65 | # for a list of supported languages.
 66 | #
 67 | # This is also used if you do content translation via gettext catalogs.
 68 | # Usually you set "language" from the command line for these cases.
 69 | language = None
 70 | 
 71 | # There are two options for replacing |today|: either, you set today to some
 72 | # non-false value, then it is used:
 73 | #
 74 | # today = ''
 75 | #
 76 | # Else, today_fmt is used as the format for a strftime call.
 77 | #
 78 | # today_fmt = '%B %d, %Y'
 79 | 
 80 | # List of patterns, relative to source directory, that match files and
 81 | # directories to ignore when looking for source files.
 82 | # This patterns also effect to html_static_path and html_extra_path
 83 | exclude_patterns = []
 84 | 
 85 | # The reST default role (used for this markup: `text`) to use for all
 86 | # documents.
 87 | #
 88 | # default_role = None
 89 | 
 90 | # If true, '()' will be appended to :func: etc. cross-reference text.
 91 | #
 92 | # add_function_parentheses = True
 93 | 
 94 | # If true, the current module name will be prepended to all description
 95 | # unit titles (such as .. function::).
 96 | #
 97 | # add_module_names = True
 98 | 
 99 | # If true, sectionauthor and moduleauthor directives will be shown in the
100 | # output. They are ignored by default.
101 | #
102 | # show_authors = False
103 | 
104 | # The name of the Pygments (syntax highlighting) style to use.
105 | pygments_style = 'sphinx'
106 | 
107 | # A list of ignored prefixes for module index sorting.
108 | # modindex_common_prefix = []
109 | 
110 | # If true, keep warnings as "system message" paragraphs in the built documents.
111 | # keep_warnings = False
112 | 
113 | # If true, `todo` and `todoList` produce output, else they produce nothing.
114 | todo_include_todos = False
115 | 
116 | 
117 | # -- Options for HTML output ----------------------------------------------
118 | 
119 | # The theme to use for HTML and HTML Help pages.  See the documentation for
120 | # a list of builtin themes.
121 | #
122 | html_theme = 'alabaster'
123 | 
124 | # Theme options are theme-specific and customize the look and feel of a theme
125 | # further.  For a list of options available for each theme, see the
126 | # documentation.
127 | #
128 | # html_theme_options = {}
129 | 
130 | # Add any paths that contain custom themes here, relative to this directory.
131 | # html_theme_path = []
132 | 
133 | # The name for this set of Sphinx documents.
134 | # "<project> v<release> documentation" by default.
135 | #
136 | # html_title = u'hccEncoding v0.0.1'
137 | 
138 | # A shorter title for the navigation bar.  Default is the same as html_title.
139 | #
140 | # html_short_title = None
141 | 
142 | # The name of an image file (relative to this directory) to place at the top
143 | # of the sidebar.
144 | #
145 | # html_logo = None
146 | 
147 | # The name of an image file (relative to this directory) to use as a favicon of
148 | # the docs.  This file should be a Windows icon file (.ico) being 16x16 or 32x32
149 | # pixels large.
150 | #
151 | # html_favicon = None
152 | 
153 | # Add any paths that contain custom static files (such as style sheets) here,
154 | # relative to this directory. They are copied after the builtin static files,
155 | # so a file named "default.css" will overwrite the builtin "default.css".
156 | html_static_path = ['_static']
157 | 
158 | # Add any extra paths that contain custom files (such as robots.txt or
159 | # .htaccess) here, relative to this directory. These files are copied
160 | # directly to the root of the documentation.
161 | #
162 | # html_extra_path = []
163 | 
164 | # If not None, a 'Last updated on:' timestamp is inserted at every page
165 | # bottom, using the given strftime format.
166 | # The empty string is equivalent to '%b %d, %Y'.
167 | #
168 | # html_last_updated_fmt = None
169 | 
170 | # If true, SmartyPants will be used to convert quotes and dashes to
171 | # typographically correct entities.
172 | #
173 | # html_use_smartypants = True
174 | 
175 | # Custom sidebar templates, maps document names to template names.
176 | #
177 | # html_sidebars = {}
178 | 
179 | # Additional templates that should be rendered to pages, maps page names to
180 | # template names.
181 | #
182 | # html_additional_pages = {}
183 | 
184 | # If false, no module index is generated.
185 | #
186 | # html_domain_indices = True
187 | 
188 | # If false, no index is generated.
189 | #
190 | # html_use_index = True
191 | 
192 | # If true, the index is split into individual pages for each letter.
193 | #
194 | # html_split_index = False
195 | 
196 | # If true, links to the reST sources are added to the pages.
197 | #
198 | # html_show_sourcelink = True
199 | 
200 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
201 | #
202 | # html_show_sphinx = True
203 | 
204 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
205 | #
206 | # html_show_copyright = True
207 | 
208 | # If true, an OpenSearch description file will be output, and all pages will
209 | # contain a <link> tag referring to it.  The value of this option must be the
210 | # base URL from which the finished HTML is served.
211 | #
212 | # html_use_opensearch = ''
213 | 
214 | # This is the file name suffix for HTML files (e.g. ".xhtml").
215 | # html_file_suffix = None
216 | 
217 | # Language to be used for generating the HTML full-text search index.
218 | # Sphinx supports the following languages:
219 | #   'da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'ja'
220 | #   'nl', 'no', 'pt', 'ro', 'ru', 'sv', 'tr', 'zh'
221 | #
222 | # html_search_language = 'en'
223 | 
224 | # A dictionary with options for the search language support, empty by default.
225 | # 'ja' uses this config value.
226 | # 'zh' user can custom change `jieba` dictionary path.
227 | #
228 | # html_search_options = {'type': 'default'}
229 | 
230 | # The name of a javascript file (relative to the configuration directory) that
231 | # implements a search results scorer. If empty, the default will be used.
232 | #
233 | # html_search_scorer = 'scorer.js'
234 | 
235 | # Output file base name for HTML help builder.
236 | htmlhelp_basename = 'hccEncodingdoc'
237 | 
238 | # -- Options for LaTeX output ---------------------------------------------
239 | 
240 | latex_elements = {
241 |      # The paper size ('letterpaper' or 'a4paper').
242 |      #
243 |      # 'papersize': 'letterpaper',
244 | 
245 |      # The font size ('10pt', '11pt' or '12pt').
246 |      #
247 |      # 'pointsize': '10pt',
248 | 
249 |      # Additional stuff for the LaTeX preamble.
250 |      #
251 |      # 'preamble': '',
252 | 
253 |      # Latex figure (float) alignment
254 |      #
255 |      # 'figure_align': 'htbp',
256 | }
257 | 
258 | # Grouping the document tree into LaTeX files. List of tuples
259 | # (source start file, target name, title,
260 | #  author, documentclass [howto, manual, or own class]).
261 | latex_documents = [
262 |     (master_doc, 'hccEncoding.tex', u'hccEncoding Documentation',
263 |      u'Ruobing Wang', 'manual'),
264 | ]
265 | 
266 | # The name of an image file (relative to this directory) to place at the top of
267 | # the title page.
268 | #
269 | # latex_logo = None
270 | 
271 | # For "manual" documents, if this is true, then toplevel headings are parts,
272 | # not chapters.
273 | #
274 | # latex_use_parts = False
275 | 
276 | # If true, show page references after internal links.
277 | #
278 | # latex_show_pagerefs = False
279 | 
280 | # If true, show URL addresses after external links.
281 | #
282 | # latex_show_urls = False
283 | 
284 | # Documents to append as an appendix to all manuals.
285 | #
286 | # latex_appendices = []
287 | 
288 | # It false, will not define \strong, \code, 	itleref, \crossref ... but only
289 | # \sphinxstrong, ..., \sphinxtitleref, ... To help avoid clash with user added
290 | # packages.
291 | #
292 | # latex_keep_old_macro_names = True
293 | 
294 | # If false, no module index is generated.
295 | #
296 | # latex_domain_indices = True
297 | 
298 | 
299 | # -- Options for manual page output ---------------------------------------
300 | 
301 | # One entry per manual page. List of tuples
302 | # (source start file, name, description, authors, manual section).
303 | man_pages = [
304 |     (master_doc, 'hccencoding', u'hccEncoding Documentation',
305 |      [author], 1)
306 | ]
307 | 
308 | # If true, show URL addresses after external links.
309 | #
310 | # man_show_urls = False
311 | 
312 | 
313 | # -- Options for Texinfo output -------------------------------------------
314 | 
315 | # Grouping the document tree into Texinfo files. List of tuples
316 | # (source start file, target name, title, author,
317 | #  dir menu entry, description, category)
318 | texinfo_documents = [
319 |     (master_doc, 'hccEncoding', u'hccEncoding Documentation',
320 |      author, 'hccEncoding', 'One line description of project.',
321 |      'Miscellaneous'),
322 | ]
323 | 
324 | # Documents to append as an appendix to all manuals.
325 | #
326 | # texinfo_appendices = []
327 | 
328 | # If false, no module index is generated.
329 | #
330 | # texinfo_domain_indices = True
331 | 
332 | # How to display URL addresses: 'footnote', 'no', or 'inline'.
333 | #
334 | # texinfo_show_urls = 'footnote'
335 | 
336 | # If true, do not generate a @detailmenu in the "Top" node's menu.
337 | #
338 | # texinfo_no_detailmenu = False
339 | 


--------------------------------------------------------------------------------
/source/index.rst:
--------------------------------------------------------------------------------
 1 | .. hcc-encoding documentation master file, created by
 2 |    sphinx-quickstart on Sun Apr 02 21:31:44 2017.
 3 |    You can adapt this file completely to your liking, but it should at least
 4 |    contain the root `toctree` directive.
 5 | 
 6 | .. image:: https://travis-ci.org/MacHu-GWU/elementary_math-project.svg?branch=master
 7 | 
 8 | .. image:: https://img.shields.io/pypi/v/hccEncoding.svg
 9 | 
10 | .. image:: https://img.shields.io/pypi/l/hccEncoding.svg
11 | 
12 | .. image:: https://img.shields.io/pypi/pyversions/hccEncoding.svg
13 | 
14 | 
15 | 
16 | 
17 | Welcome to hccEncoding's documentation!
18 | ========================================
19 | 
20 | Categorical data fields characterized by a large number of distinct values represent a serious challenge for many classification and regression algorithms that require numerical inputs. On the other hand, these types of data fields are quite common in real-world data mining applications and often contain potentially relevant information that is difficult to represent for modeling purposes.
21 | 
22 | This python package implements some preprocessing strategies for high-cardinality categorical data that allows this class of attributes to be used in predictive models. Currently there are two major methods, whihc are based on Daniele Micci-Barreca 's empirical Bayes method [ref1] and Owen Zhang's leave-one-out encoding[ref2].
23 | 
24 | .. toctree::
25 |    :maxdepth: 2
26 | 
27 | Functions
28 | ------------------
29 | * BayesEncoding(train,test,target,feature,k=5,f=1,noise=0.01,drop_origin_feature=False)
30 | * BayesEncodingKfold(train,test,target,feature,k=5,f=1,noise=0.01,drop_origin_feature=False,fold=5)
31 | * LOOEncoding(train,test,target,feature,noise=0.01,drop_origin_feature=False)
32 | * LOOEncodingKfold(train,test,target,feature,noise=0.01,drop_origin_feature=False,fold=5)
33 | 
34 | **Please see example for detailed explanation**
35 | - `Example <https://github.com/Robin888/hccEncoding-project/blob/master/example/Example.ipynb>`_
36 | 
37 | 
38 | General Parameters
39 | ------------------
40 | * train 
41 |   - train dataset, datatype: pandas dataframe
42 | * test 
43 |   - test dataset,  datatype: pandas dataframe
44 | * target 
45 |   - name of target for prediction, datatype: string
46 | * feature 
47 |   - name of features that need to be encoded, datatype: string
48 | * k [default=5]
49 |   - parameter for BayesEncoding and BayesEncodingKfold, determines half of the minimal sample size of which we completely 'trust' the estimate of transition between the cell's posterior probability and the prior probability, datatype: int
50 | * f [default=1]
51 |   - parameter for BayesEncoding and BayesEncodingKfold,controls how quickly the weight changes from the prior to the posterior as the size of the group increases, to further understand k and f's meaning  datatype: int  
52 | * noise [default=0.01]
53 |   - a manually added noise after encoding. For classification problems, a random uniform-distributed noise in the range of [-noise,noise]*data is added. For regression problem, a random normal-distributed noise in the range of norm(0,noise) is added, datatype: double
54 | * drop_origin_feature [default=False]
55 |   - whether dropping the original feature or not, datatype: boolean
56 | * fold [default=5]
57 |   - parameter for LOOEncodingKfold and BayesEncodingKfold, represent the number of folds that the train dataset will be splitted into. datatype: int
58 | 
59 | 
60 | References
61 | ==================
62 | * ref1: Daniele Micci-Barreca. 2001. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explor. Newsl. 3, 1 (July 2001), 27-32.
63 | * ref2: - `https://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions <https://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions>`_
64 | 
65 | Indices and tables
66 | ==================
67 | 
68 | * :ref:`genindex`
69 | * :ref:`modindex`
70 | * :ref:`search`
71 | 
72 | 


--------------------------------------------------------------------------------
/view-doc.bat:
--------------------------------------------------------------------------------
1 | pushd "%~dp0"
2 | cd build
3 | cd html
4 | index.html


--------------------------------------------------------------------------------