├── LICENSE.txt ├── MANIFEST.in ├── README.rst ├── build-dist.bat ├── build-doc.bat ├── create_doctree.py ├── example ├── Example.ipynb ├── test.csv └── train.csv ├── hccEncoding ├── EncoderForClassification.py ├── EncoderForRegression.py ├── __init__.py └── __init__.pyc ├── install-linux.sh ├── install-macos.sh ├── install-win.bat ├── pypi-register.bat ├── pypi-upload.bat ├── release-history.rst ├── requirements.txt ├── setup.py ├── source ├── conf.py └── index.rst └── view-doc.bat /LICENSE.txt: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright 2017 Ruobing Wang 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include README.rst LICENSE.txt requirements.txt release-history.rst 2 | recursive-include hccEncoding *.* -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | .. image:: https://travis-ci.org/Robin888/hccEncoding-project.svg?branch=master 2 | 3 | .. image:: https://img.shields.io/pypi/v/hccEncoding.svg 4 | 5 | .. image:: https://img.shields.io/pypi/l/hccEncoding.svg 6 | 7 | .. image:: https://img.shields.io/pypi/pyversions/hccEncoding.svg 8 | 9 | 10 | Welcome to hccEncoding Documentation 11 | =============================================================================== 12 | This is just a example project for demonstration purpose. 13 | 14 | 15 | **Quick Links** 16 | ------------------------------------------------------------------------------- 17 | - `GitHub Homepage `_ 18 | - `Online Documentation `_ 19 | - `PyPI download `_ 20 | - `Install `_ 21 | - `Issue submit and feature request `_ 22 | - `API reference and source code `_ 23 | - `Tutorial `_ 24 | 25 | .. _install: 26 | 27 | Install 28 | ------------------------------------------------------------------------------- 29 | 30 | ``hccEncoding`` is released on PyPI, so all you need is: 31 | 32 | .. code-block:: console 33 | 34 | $ pip install hccEncoding 35 | 36 | To upgrade to latest version: 37 | 38 | .. code-block:: console 39 | 40 | $ pip install --upgrade hccEncoding 41 | -------------------------------------------------------------------------------- /build-dist.bat: -------------------------------------------------------------------------------- 1 | pushd "%~dp0" 2 | python setup.py sdist 3 | python setup.py bdist_wheel --universal 4 | pause -------------------------------------------------------------------------------- /build-doc.bat: -------------------------------------------------------------------------------- 1 | pushd "%~dp0" 2 | cd hccEncoding 3 | python zzz_manual_install.py 4 | cd .. 5 | python create_doctree.py 6 | make html -------------------------------------------------------------------------------- /create_doctree.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | import docfly 5 | 6 | # Uncomment this if you follow Sanhe's Sphinx Doc Style Guide 7 | #--- Manually Made Doc --- 8 | # doc = docfly.DocTree("source") 9 | # doc.fly(table_of_content_header="Table of Content (目录)") 10 | 11 | #--- Api Reference Doc --- 12 | package_name = "hccEncoding" 13 | 14 | doc = docfly.ApiReferenceDoc( 15 | package_name, 16 | dst="source", 17 | ignore=[ 18 | "%s.packages" % package_name, 19 | "%s.zzz_manual_install.py" % package_name, 20 | ] 21 | ) 22 | doc.fly() -------------------------------------------------------------------------------- /example/Example.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Tutorial for hccEncoding\n", 8 | "\n", 9 | "\n", 10 | "This notebook will exhibit how to use hcc-encoding package for classification and regression problem using dataset from Kaggle competition 'Prudential Life Insurance Assessment'\n", 11 | "\n", 12 | "In hcc-encoding, the basic principle motivating the processing is to map individual values of a high-cardinality categorical independent attribute to an estimate of the probability or the expected value of dependent attribute. However, just simply transfer the high-cardinality categorical to target statistics often result in information leaking. 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "dataset download: https://www.kaggle.com/c/prudential-life-insurance-assessment" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "In this dataset, you are provided over a hundred variables describing attributes of life insurance applicants. \n", 27 | "The task is to predict the \"Response\" variable for each Id in the test set. \n", 28 | "\"Response\" is an ordinal measure of risk that has 8 levels, which means the problem can be treated as both regression problem and classification problems (classify to 8 classes)" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 1, 34 | "metadata": { 35 | "collapsed": true 36 | }, 37 | "outputs": [], 38 | "source": [ 39 | "# load raw data\n", 40 | "import pandas as pd\n", 41 | "train=pd.read_csv('train.csv')\n", 42 | "test=pd.read_csv('test.csv')" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 2, 48 | "metadata": { 49 | "collapsed": false 50 | }, 51 | "outputs": [ 52 | { 53 | "data": { 54 | "text/html": [ 55 | "
\n", 56 | "\n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | "
IdProduct_Info_1Product_Info_2Product_Info_3Product_Info_4Product_Info_5Product_Info_6Product_Info_7Ins_AgeHt...Medical_Keyword_40Medical_Keyword_41Medical_Keyword_42Medical_Keyword_43Medical_Keyword_44Medical_Keyword_45Medical_Keyword_46Medical_Keyword_47Medical_Keyword_48Response
021D3100.0769232110.6417910.581818...0000000008
151A1260.0769232310.0597010.600000...0000000004
261E1260.0769232310.0298510.745455...0000000008
371D4100.4871792310.1641790.672727...0000000008
481D2260.2307692310.4179100.654545...0000000008
\n", 206 | "

5 rows × 128 columns

\n", 207 | "
" 208 | ], 209 | "text/plain": [ 210 | " Id Product_Info_1 Product_Info_2 Product_Info_3 Product_Info_4 \\\n", 211 | "0 2 1 D3 10 0.076923 \n", 212 | "1 5 1 A1 26 0.076923 \n", 213 | "2 6 1 E1 26 0.076923 \n", 214 | "3 7 1 D4 10 0.487179 \n", 215 | "4 8 1 D2 26 0.230769 \n", 216 | "\n", 217 | " Product_Info_5 Product_Info_6 Product_Info_7 Ins_Age Ht \\\n", 218 | "0 2 1 1 0.641791 0.581818 \n", 219 | "1 2 3 1 0.059701 0.600000 \n", 220 | "2 2 3 1 0.029851 0.745455 \n", 221 | "3 2 3 1 0.164179 0.672727 \n", 222 | "4 2 3 1 0.417910 0.654545 \n", 223 | "\n", 224 | " ... Medical_Keyword_40 Medical_Keyword_41 Medical_Keyword_42 \\\n", 225 | "0 ... 0 0 0 \n", 226 | "1 ... 0 0 0 \n", 227 | "2 ... 0 0 0 \n", 228 | "3 ... 0 0 0 \n", 229 | "4 ... 0 0 0 \n", 230 | "\n", 231 | " Medical_Keyword_43 Medical_Keyword_44 Medical_Keyword_45 \\\n", 232 | "0 0 0 0 \n", 233 | "1 0 0 0 \n", 234 | "2 0 0 0 \n", 235 | "3 0 0 0 \n", 236 | "4 0 0 0 \n", 237 | "\n", 238 | " Medical_Keyword_46 Medical_Keyword_47 Medical_Keyword_48 Response \n", 239 | "0 0 0 0 8 \n", 240 | "1 0 0 0 4 \n", 241 | "2 0 0 0 8 \n", 242 | "3 0 0 0 8 \n", 243 | "4 0 0 0 8 \n", 244 | "\n", 245 | "[5 rows x 128 columns]" 246 | ] 247 | }, 248 | "execution_count": 2, 249 | "metadata": {}, 250 | "output_type": "execute_result" 251 | } 252 | ], 253 | "source": [ 254 | "train.head()" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": 3, 260 | "metadata": { 261 | "collapsed": false 262 | }, 263 | "outputs": [ 264 | { 265 | "data": { 266 | "text/plain": [ 267 | "19" 268 | ] 269 | }, 270 | "execution_count": 3, 271 | "metadata": {}, 272 | "output_type": "execute_result" 273 | } 274 | ], 275 | "source": [ 276 | "len(train['Product_Info_2'].unique())" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "It can be seen that the feature 'Product_Info_2' can be treated as a high-cardinal feature. To exhibit how to use hcc-encoding more easily, we ignore most irrelavant features:" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": 4, 289 | "metadata": { 290 | "collapsed": false 291 | }, 292 | "outputs": [], 293 | "source": [ 294 | "train=train[['Id','Response','Product_Info_2']]\n", 295 | "test=test[['Id','Product_Info_2']]" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 5, 301 | "metadata": { 302 | "collapsed": false 303 | }, 304 | "outputs": [ 305 | { 306 | "data": { 307 | "text/html": [ 308 | "
\n", 309 | "\n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | "
IdResponseProduct_Info_2
028D3
154A1
268E1
378D4
488D2
\n", 351 | "
" 352 | ], 353 | "text/plain": [ 354 | " Id Response Product_Info_2\n", 355 | "0 2 8 D3\n", 356 | "1 5 4 A1\n", 357 | "2 6 8 E1\n", 358 | "3 7 8 D4\n", 359 | "4 8 8 D2" 360 | ] 361 | }, 362 | "execution_count": 5, 363 | "metadata": {}, 364 | "output_type": "execute_result" 365 | } 366 | ], 367 | "source": [ 368 | "train.head()" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": {}, 374 | "source": [ 375 | "# Part 1. Encoding for classification problems" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": 6, 381 | "metadata": { 382 | "collapsed": false 383 | }, 384 | "outputs": [], 385 | "source": [ 386 | "from hccEncoding.EncoderForClassification import BayesEncoding,BayesEncodingKfold,LOOEncoding,LOOEncodingKfold\n", 387 | "\n", 388 | "train_BayesEncoding,test_BayesEncoding=BayesEncoding(train,test,'Response','Product_Info_2')\n", 389 | "train_LOOEncoding,test_LOOEncoding=LOOEncoding(train,test,'Response','Product_Info_2')\n" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": 7, 395 | "metadata": { 396 | "collapsed": false 397 | }, 398 | "outputs": [ 399 | { 400 | "data": { 401 | "text/html": [ 402 | "
\n", 403 | "\n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | "
IdResponseProduct_Info_2bayes_Product_Info_2_1bayes_Product_Info_2_2bayes_Product_Info_2_3bayes_Product_Info_2_4bayes_Product_Info_2_5bayes_Product_Info_2_6bayes_Product_Info_2_7bayes_Product_Info_2_8
028D30.1005200.1162420.0166120.0294560.0876250.2274830.1453680.274486
154A10.0554880.1003430.0217900.0286260.0830120.1488880.1151470.445322
268E10.0766360.0782720.0128230.0296730.0677970.1628530.1509060.425601
378D40.0640250.0658070.0076590.0114360.0754650.1668850.1359760.477254
488D20.1195900.1472810.0176120.0246530.0672700.2474890.1452030.233645
\n", 493 | "
" 494 | ], 495 | "text/plain": [ 496 | " Id Response Product_Info_2 bayes_Product_Info_2_1 \\\n", 497 | "0 2 8 D3 0.100520 \n", 498 | "1 5 4 A1 0.055488 \n", 499 | "2 6 8 E1 0.076636 \n", 500 | "3 7 8 D4 0.064025 \n", 501 | "4 8 8 D2 0.119590 \n", 502 | "\n", 503 | " bayes_Product_Info_2_2 bayes_Product_Info_2_3 bayes_Product_Info_2_4 \\\n", 504 | "0 0.116242 0.016612 0.029456 \n", 505 | "1 0.100343 0.021790 0.028626 \n", 506 | "2 0.078272 0.012823 0.029673 \n", 507 | "3 0.065807 0.007659 0.011436 \n", 508 | "4 0.147281 0.017612 0.024653 \n", 509 | "\n", 510 | " bayes_Product_Info_2_5 bayes_Product_Info_2_6 bayes_Product_Info_2_7 \\\n", 511 | "0 0.087625 0.227483 0.145368 \n", 512 | "1 0.083012 0.148888 0.115147 \n", 513 | "2 0.067797 0.162853 0.150906 \n", 514 | "3 0.075465 0.166885 0.135976 \n", 515 | "4 0.067270 0.247489 0.145203 \n", 516 | "\n", 517 | " bayes_Product_Info_2_8 \n", 518 | "0 0.274486 \n", 519 | "1 0.445322 \n", 520 | "2 0.425601 \n", 521 | "3 0.477254 \n", 522 | "4 0.233645 " 523 | ] 524 | }, 525 | "execution_count": 7, 526 | "metadata": {}, 527 | "output_type": "execute_result" 528 | } 529 | ], 530 | "source": [ 531 | "train_BayesEncoding.head()" 532 | ] 533 | }, 534 | { 535 | "cell_type": "markdown", 536 | "metadata": {}, 537 | "source": [ 538 | "Note: In BayesEncoding for classification problem, after encoding, new feature will be the probilities of each class, the header of generated new feature will be 'bayes_ __ _(Origin Feature name)_ __ _(name of the class)'" 539 | ] 540 | }, 541 | { 542 | "cell_type": "code", 543 | "execution_count": 8, 544 | "metadata": { 545 | "collapsed": false 546 | }, 547 | "outputs": [ 548 | { 549 | "data": { 550 | "text/html": [ 551 | "
\n", 552 | "\n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | "
IdResponseProduct_Info_2loo_Product_Info_2
028D35.568211
154A16.176825
268E16.171721
378D46.373370
488D25.260191
\n", 600 | "
" 601 | ], 602 | "text/plain": [ 603 | " Id Response Product_Info_2 loo_Product_Info_2\n", 604 | "0 2 8 D3 5.568211\n", 605 | "1 5 4 A1 6.176825\n", 606 | "2 6 8 E1 6.171721\n", 607 | "3 7 8 D4 6.373370\n", 608 | "4 8 8 D2 5.260191" 609 | ] 610 | }, 611 | "execution_count": 8, 612 | "metadata": {}, 613 | "output_type": "execute_result" 614 | } 615 | ], 616 | "source": [ 617 | "train_LOOEncoding.head()" 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": 9, 623 | "metadata": { 624 | "collapsed": false 625 | }, 626 | "outputs": [], 627 | "source": [ 628 | "train_BayesEncodingKfold,test_BayesEncodingKfold=BayesEncodingKfold(train,test,'Response','Product_Info_2',fold=5)\n", 629 | "train_LOOEncodingKfold,test_LOOEncodingKfold=LOOEncodingKfold(train,test,'Response','Product_Info_2',fold=5)\n" 630 | ] 631 | }, 632 | { 633 | "cell_type": "code", 634 | "execution_count": 10, 635 | "metadata": { 636 | "collapsed": false 637 | }, 638 | "outputs": [ 639 | { 640 | "data": { 641 | "text/html": [ 642 | "
\n", 643 | "\n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | "
IdResponseProduct_Info_2bayes_Product_Info_2_1bayes_Product_Info_2_2bayes_Product_Info_2_3bayes_Product_Info_2_4bayes_Product_Info_2_5bayes_Product_Info_2_6bayes_Product_Info_2_7bayes_Product_Info_2_8
028D30.1031350.1169700.0162710.0292470.0864990.2293550.1471930.268195
154A10.0541700.0940390.0239560.0304220.0871200.1481250.1156480.443238
268E10.0725490.0791610.0134120.0289900.0636830.1614610.1500650.433177
378D40.0632380.0644170.0079720.0110820.0762360.1658740.1313800.478303
488D20.1145660.1465840.0183270.0246030.0682990.2454210.1481780.233748
\n", 733 | "
" 734 | ], 735 | "text/plain": [ 736 | " Id Response Product_Info_2 bayes_Product_Info_2_1 \\\n", 737 | "0 2 8 D3 0.103135 \n", 738 | "1 5 4 A1 0.054170 \n", 739 | "2 6 8 E1 0.072549 \n", 740 | "3 7 8 D4 0.063238 \n", 741 | "4 8 8 D2 0.114566 \n", 742 | "\n", 743 | " bayes_Product_Info_2_2 bayes_Product_Info_2_3 bayes_Product_Info_2_4 \\\n", 744 | "0 0.116970 0.016271 0.029247 \n", 745 | "1 0.094039 0.023956 0.030422 \n", 746 | "2 0.079161 0.013412 0.028990 \n", 747 | "3 0.064417 0.007972 0.011082 \n", 748 | "4 0.146584 0.018327 0.024603 \n", 749 | "\n", 750 | " bayes_Product_Info_2_5 bayes_Product_Info_2_6 bayes_Product_Info_2_7 \\\n", 751 | "0 0.086499 0.229355 0.147193 \n", 752 | "1 0.087120 0.148125 0.115648 \n", 753 | "2 0.063683 0.161461 0.150065 \n", 754 | "3 0.076236 0.165874 0.131380 \n", 755 | "4 0.068299 0.245421 0.148178 \n", 756 | "\n", 757 | " bayes_Product_Info_2_8 \n", 758 | "0 0.268195 \n", 759 | "1 0.443238 \n", 760 | "2 0.433177 \n", 761 | "3 0.478303 \n", 762 | "4 0.233748 " 763 | ] 764 | }, 765 | "execution_count": 10, 766 | "metadata": {}, 767 | "output_type": "execute_result" 768 | } 769 | ], 770 | "source": [ 771 | "train_BayesEncodingKfold.head()" 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": 11, 777 | "metadata": { 778 | "collapsed": false 779 | }, 780 | "outputs": [ 781 | { 782 | "data": { 783 | "text/html": [ 784 | "
\n", 785 | "\n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | "
IdResponseProduct_Info_2loo_Product_Info_2
028D35.510350
154A16.147088
268E16.165965
378D46.389432
488D25.280445
\n", 833 | "
" 834 | ], 835 | "text/plain": [ 836 | " Id Response Product_Info_2 loo_Product_Info_2\n", 837 | "0 2 8 D3 5.510350\n", 838 | "1 5 4 A1 6.147088\n", 839 | "2 6 8 E1 6.165965\n", 840 | "3 7 8 D4 6.389432\n", 841 | "4 8 8 D2 5.280445" 842 | ] 843 | }, 844 | "execution_count": 11, 845 | "metadata": {}, 846 | "output_type": "execute_result" 847 | } 848 | ], 849 | "source": [ 850 | "train_LOOEncodingKfold.head()" 851 | ] 852 | }, 853 | { 854 | "cell_type": "markdown", 855 | "metadata": {}, 856 | "source": [ 857 | "Note: the difference between BayesEncoding and BayesEncodingKfold (also LOOEncoding and LOOEncodingKfold) is how to encode train dataset. In BayesEncoding (or LOOEncoding), the train dataset is encoded using statistics of full train dataset. In BayesEncodingKfold (or LOOEncodingKfold), the train dataset is encoded using statistics of part of train dataset. For example, when fold=5, Baye0%sEncodingKfold (or LOOEncodingKfold) use 80% of train dataset to encode the rest 20% train dataset. This can further reduce the risk of information leaking, the cons is to use less information from train dataset. " 858 | ] 859 | }, 860 | { 861 | "cell_type": "markdown", 862 | "metadata": {}, 863 | "source": [ 864 | "# Part 2. Encoding for Regression problems" 865 | ] 866 | }, 867 | { 868 | "cell_type": "code", 869 | "execution_count": 12, 870 | "metadata": { 871 | "collapsed": true 872 | }, 873 | "outputs": [], 874 | "source": [ 875 | "from hccEncoding.EncoderForRegression import BayesEncoding,BayesEncodingKfold,LOOEncoding,LOOEncodingKfold\n", 876 | "\n", 877 | "train_BayesEncoding,test_BayesEncoding=BayesEncoding(train,test,'Response','Product_Info_2')\n", 878 | "train_LOOEncoding,test_LOOEncoding=LOOEncoding(train,test,'Response','Product_Info_2')" 879 | ] 880 | }, 881 | { 882 | "cell_type": "code", 883 | "execution_count": 13, 884 | "metadata": { 885 | "collapsed": false 886 | }, 887 | "outputs": [ 888 | { 889 | "data": { 890 | "text/html": [ 891 | "
\n", 892 | "\n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | "
IdResponseProduct_Info_2bayes_Product_Info_2
028D35.554615
154A16.212693
268E16.226905
378D46.337270
488D25.261874
\n", 940 | "
" 941 | ], 942 | "text/plain": [ 943 | " Id Response Product_Info_2 bayes_Product_Info_2\n", 944 | "0 2 8 D3 5.554615\n", 945 | "1 5 4 A1 6.212693\n", 946 | "2 6 8 E1 6.226905\n", 947 | "3 7 8 D4 6.337270\n", 948 | "4 8 8 D2 5.261874" 949 | ] 950 | }, 951 | "execution_count": 13, 952 | "metadata": {}, 953 | "output_type": "execute_result" 954 | } 955 | ], 956 | "source": [ 957 | "train_BayesEncoding.head()" 958 | ] 959 | }, 960 | { 961 | "cell_type": "code", 962 | "execution_count": 14, 963 | "metadata": { 964 | "collapsed": false 965 | }, 966 | "outputs": [ 967 | { 968 | "data": { 969 | "text/html": [ 970 | "
\n", 971 | "\n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | "
IdResponseProduct_Info_2loo_Product_Info_2
028D35.553309
154A16.042045
268E16.065838
378D46.423081
488D25.302704
\n", 1019 | "
" 1020 | ], 1021 | "text/plain": [ 1022 | " Id Response Product_Info_2 loo_Product_Info_2\n", 1023 | "0 2 8 D3 5.553309\n", 1024 | "1 5 4 A1 6.042045\n", 1025 | "2 6 8 E1 6.065838\n", 1026 | "3 7 8 D4 6.423081\n", 1027 | "4 8 8 D2 5.302704" 1028 | ] 1029 | }, 1030 | "execution_count": 14, 1031 | "metadata": {}, 1032 | "output_type": "execute_result" 1033 | } 1034 | ], 1035 | "source": [ 1036 | "train_LOOEncoding.head()" 1037 | ] 1038 | }, 1039 | { 1040 | "cell_type": "code", 1041 | "execution_count": 15, 1042 | "metadata": { 1043 | "collapsed": true 1044 | }, 1045 | "outputs": [], 1046 | "source": [ 1047 | "train_BayesEncodingKfold,test_BayesEncodingKfold=BayesEncodingKfold(train,test,'Response','Product_Info_2',fold=5)\n", 1048 | "train_LOOEncodingKfold,test_LOOEncodingKfold=LOOEncodingKfold(train,test,'Response','Product_Info_2',fold=5)" 1049 | ] 1050 | }, 1051 | { 1052 | "cell_type": "code", 1053 | "execution_count": 16, 1054 | "metadata": { 1055 | "collapsed": false 1056 | }, 1057 | "outputs": [ 1058 | { 1059 | "data": { 1060 | "text/html": [ 1061 | "
\n", 1062 | "\n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | " \n", 1090 | " \n", 1091 | " \n", 1092 | " \n", 1093 | " \n", 1094 | " \n", 1095 | " \n", 1096 | " \n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | "
IdResponseProduct_Info_2bayes_Product_Info_2
028D35.528438
154A16.217039
268E16.175437
378D46.322527
488D25.288024
\n", 1110 | "
" 1111 | ], 1112 | "text/plain": [ 1113 | " Id Response Product_Info_2 bayes_Product_Info_2\n", 1114 | "0 2 8 D3 5.528438\n", 1115 | "1 5 4 A1 6.217039\n", 1116 | "2 6 8 E1 6.175437\n", 1117 | "3 7 8 D4 6.322527\n", 1118 | "4 8 8 D2 5.288024" 1119 | ] 1120 | }, 1121 | "execution_count": 16, 1122 | "metadata": {}, 1123 | "output_type": "execute_result" 1124 | } 1125 | ], 1126 | "source": [ 1127 | "train_BayesEncodingKfold.head()" 1128 | ] 1129 | }, 1130 | { 1131 | "cell_type": "code", 1132 | "execution_count": 17, 1133 | "metadata": { 1134 | "collapsed": false 1135 | }, 1136 | "outputs": [ 1137 | { 1138 | "data": { 1139 | "text/html": [ 1140 | "
\n", 1141 | "\n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | "
IdResponseProduct_Info_2loo_Product_Info_2
028D35.511082
154A16.147173
268E16.164431
378D46.390441
488D25.280106
\n", 1189 | "
" 1190 | ], 1191 | "text/plain": [ 1192 | " Id Response Product_Info_2 loo_Product_Info_2\n", 1193 | "0 2 8 D3 5.511082\n", 1194 | "1 5 4 A1 6.147173\n", 1195 | "2 6 8 E1 6.164431\n", 1196 | "3 7 8 D4 6.390441\n", 1197 | "4 8 8 D2 5.280106" 1198 | ] 1199 | }, 1200 | "execution_count": 17, 1201 | "metadata": {}, 1202 | "output_type": "execute_result" 1203 | } 1204 | ], 1205 | "source": [ 1206 | "train_LOOEncodingKfold.head()" 1207 | ] 1208 | }, 1209 | { 1210 | "cell_type": "markdown", 1211 | "metadata": {}, 1212 | "source": [ 1213 | "for more detailed explanation about parameters, please check online documentation: http://hccencoding-project.readthedocs.io/en/latest/" 1214 | ] 1215 | } 1216 | ], 1217 | "metadata": { 1218 | "anaconda-cloud": {}, 1219 | "kernelspec": { 1220 | "display_name": "Python [default]", 1221 | "language": "python", 1222 | "name": "python2" 1223 | }, 1224 | "language_info": { 1225 | "codemirror_mode": { 1226 | "name": "ipython", 1227 | "version": 2 1228 | }, 1229 | "file_extension": ".py", 1230 | "mimetype": "text/x-python", 1231 | "name": "python", 1232 | "nbconvert_exporter": "python", 1233 | "pygments_lexer": "ipython2", 1234 | "version": "2.7.12" 1235 | } 1236 | }, 1237 | "nbformat": 4, 1238 | "nbformat_minor": 1 1239 | } 1240 | -------------------------------------------------------------------------------- /hccEncoding/EncoderForClassification.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Sun Mar 12 01:36:23 2017 4 | 5 | @author: Ruobing 6 | """ 7 | from sklearn.model_selection import StratifiedKFold 8 | import numpy as np 9 | import pandas as pd 10 | ###paper: A preprocessing scheme for high cardinality categorical attributes in classification and prediction problems 11 | def BayesEncoding(train,test,target,feature,k=5,f=1,noise=0.01,drop_origin_feature=False): 12 | entry=pd.get_dummies(train[target]) 13 | classes=list(entry) 14 | entry[feature]=train[feature] 15 | entry=entry.groupby(feature).agg('sum').reset_index() 16 | entry['total_count']=entry.sum(axis=1) 17 | for e in classes: 18 | newname='bayes_'+feature+'_'+str(e) 19 | prior=sum(entry[e])/sum(entry['total_count']) 20 | posterior=entry[e]/entry['total_count'] 21 | B=1/(1+np.exp(-1*(entry['total_count']-k)/f)) 22 | entry[newname]=B*posterior+(1-B)*prior 23 | 24 | newfeature=['bayes_'+feature+'_'+str(e) for e in classes] 25 | newlist=[feature]+ newfeature 26 | useful=entry[newlist] 27 | train=train.join(useful.set_index(feature),on=feature) 28 | test=test.join(useful.set_index(feature),on=feature) 29 | 30 | for e in classes: 31 | newname='bayes_'+feature+'_'+str(e) 32 | test[newname]=test[newname].fillna(sum(entry[e])/sum(entry['total_count'])) 33 | if noise: # Add uniform noise. Not mentioned in original paper 34 | train[newname]=train[newname]*np.random.uniform(1 - noise, 1 + noise, len(train[newname])) 35 | test[newname]=test[newname]*np.random.uniform(1 - noise, 1 + noise, len(test[newname])) 36 | 37 | 38 | if drop_origin_feature==True: 39 | train=train.drop(feature,1) 40 | test=test.drop(feature,1) 41 | 42 | return train,test 43 | 44 | def BayesEncodingKfold(train,test,target,feature,k=5,f=1,noise=0.01,drop_origin_feature=False,fold=5): 45 | train_no_use,test_useful=BayesEncoding(train,test,target,feature,k,f,noise,drop_origin_feature) 46 | skf = StratifiedKFold(fold) 47 | alltrain_min=[] 48 | for train_id, test_id in skf.split(train,np.zeros(len(train))): 49 | train_maj,train_min=BayesEncoding(train.iloc[train_id], train.iloc[test_id],target,feature,k,f,noise,drop_origin_feature) 50 | alltrain_min.append(train_min) 51 | train_useful=pd.concat(alltrain_min,0) 52 | return train_useful,test_useful 53 | 54 | 55 | def LOOEncoding(train,test,target,feature,noise=0.01,drop_origin_feature=False): 56 | cs = train.groupby(by=[feature])[target].sum() 57 | cc = train[feature].value_counts() 58 | boolean = (cc == 1) 59 | index = boolean[boolean == True].index.values 60 | cc.loc[boolean] += 1 61 | cs.loc[index] *= 2 62 | train = train.join(cs.rename('sum'), on=[feature]) 63 | train = train.join(cc.rename('count'), on=[feature]) 64 | newname='loo_'+feature 65 | train[newname] = (train['sum']-train[target])/(train['count'] - 1) 66 | if noise: train[newname]= train[newname]*np.random.uniform(1 - noise, 1 + noise, len(train[newname])) # Add uniform noise. Not mentioned in original paper 67 | del train['sum'], train['count'] 68 | cstest=train.groupby(by=[feature])[newname].mean() 69 | test=test.join(cstest.rename(newname),on=[feature]) 70 | if drop_origin_feature==True: 71 | train=train.drop(feature,1) 72 | test=test.drop(feature,1) 73 | return train,test 74 | 75 | def LOOEncodingKfold(train,test,target,feature,noise=0.01,drop_origin_feature=False,fold=5): 76 | train_no_use,test_useful=LOOEncoding(train,test,target,feature,noise,drop_origin_feature) 77 | skf = StratifiedKFold(fold) 78 | alltrain_min=[] 79 | for train_id, test_id in skf.split(train,np.zeros(len(train))): 80 | train_maj,train_min=LOOEncoding(train.iloc[train_id], train.iloc[test_id],target,feature,noise,drop_origin_feature) 81 | alltrain_min.append(train_min) 82 | train_useful=pd.concat(alltrain_min,0) 83 | newname='loo_'+feature 84 | train_useful[newname]=train_useful[newname].fillna(train_useful[newname].median()) 85 | test_useful[newname]=test_useful[newname].fillna(test_useful[newname].median()) 86 | return train_useful,test_useful -------------------------------------------------------------------------------- /hccEncoding/EncoderForRegression.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Sun Mar 12 01:36:23 2017 4 | 5 | @author: Ruobing 6 | """ 7 | from sklearn.model_selection import StratifiedKFold 8 | import numpy as np 9 | import pandas as pd 10 | ###paper: A preprocessing scheme for high cardinality categorical attributes in classification and prediction problems 11 | def BayesEncoding(train,test,target,feature,k=5,f=1,noise=0.01,drop_origin_feature=False): 12 | entry=pd.DataFrame(train[[target,feature]],columns=[target,feature]) 13 | library=entry.groupby(feature).agg('mean').rename(columns={target:'sampleMean'}).reset_index() 14 | count=entry.groupby(feature).agg('count').rename(columns={target:'total_count'}).reset_index() 15 | library['total_count']=count['total_count'] 16 | library['prior']=train[target].mean() 17 | 18 | B=1/(1+np.exp(-1*(library['total_count']-k)/f)) 19 | newname='bayes_'+feature 20 | library[newname]=B*library['sampleMean']+(1-B)*library['prior'] 21 | 22 | newlist=[feature]+ [newname] 23 | useful=library[newlist] 24 | train=train.join(useful.set_index(feature),on=feature) 25 | test=test.join(useful.set_index(feature),on=feature) 26 | test[newname]=test[newname].fillna(train[target].mean()) 27 | 28 | if noise: # Add normal noise. Not mentioned in original paper 29 | train[newname]=train[newname]*np.random.normal(1, noise, len(train[newname])) 30 | test[newname]=test[newname]*np.random.normal(1 , noise, len(test[newname])) 31 | 32 | 33 | if drop_origin_feature==True: 34 | train=train.drop(feature,1) 35 | test=test.drop(feature,1) 36 | return train,test 37 | 38 | def BayesEncodingKfold(train,test,target,feature,k=5,f=1,noise=0.01,drop_origin_feature=False,fold=5): 39 | train_no_use,test_useful=BayesEncoding(train,test,target,feature,k,f,noise,drop_origin_feature) 40 | skf = StratifiedKFold(fold) 41 | alltrain_min=[] 42 | for train_id, test_id in skf.split(train,np.zeros(len(train))): 43 | train_maj,train_min=BayesEncoding(train.iloc[train_id], train.iloc[test_id],target,feature,k,f,noise,drop_origin_feature) 44 | alltrain_min.append(train_min) 45 | train_useful=pd.concat(alltrain_min,0) 46 | return train_useful,test_useful 47 | 48 | 49 | def LOOEncoding(train,test,target,feature,noise=0.01,drop_origin_feature=False): 50 | cs = train.groupby(by=[feature])[target].sum() 51 | cc = train[feature].value_counts() 52 | boolean = (cc == 1) 53 | index = boolean[boolean == True].index.values 54 | cc.loc[boolean] += 1 55 | cs.loc[index] *= 2 56 | train = train.join(cs.rename('sum'), on=[feature]) 57 | train = train.join(cc.rename('count'), on=[feature]) 58 | newname='loo_'+feature 59 | train[newname] = (train['sum']-train[target])/(train['count'] - 1) 60 | if noise: train[newname]= train[newname]*np.random.normal(1,noise,len(train[newname])) # Add normal noise. Not mentioned in original paper 61 | del train['sum'], train['count'] 62 | cstest=train.groupby(by=[feature])[newname].mean() 63 | test=test.join(cstest.rename(newname),on=[feature]) 64 | if drop_origin_feature==True: 65 | train=train.drop(feature,1) 66 | test=test.drop(feature,1) 67 | return train,test 68 | 69 | def LOOEncodingKfold(train,test,target,feature,noise=0.01,drop_origin_feature=False,fold=5): 70 | train_no_use,test_useful=LOOEncoding(train,test,target,feature,noise,drop_origin_feature) 71 | skf = StratifiedKFold(fold) 72 | alltrain_min=[] 73 | for train_id, test_id in skf.split(train,np.zeros(len(train))): 74 | train_maj,train_min=LOOEncoding(train.iloc[train_id], train.iloc[test_id],target,feature,noise,drop_origin_feature) 75 | alltrain_min.append(train_min) 76 | train_useful=pd.concat(alltrain_min,0) 77 | newname='loo_'+feature 78 | train_useful[newname]=train_useful[newname].fillna(train_useful[newname].mean()) 79 | test_useful[newname]=test_useful[newname].fillna(test_useful[newname].mean()) 80 | return train_useful,test_useful -------------------------------------------------------------------------------- /hccEncoding/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Sat Apr 01 10:03:49 2017 4 | 5 | @author: Ruobing Wang 6 | """ 7 | 8 | 9 | 10 | __version__ = "0.0.1" 11 | __short_description__ = "hcc-encoding" 12 | __license__ = "MIT" -------------------------------------------------------------------------------- /hccEncoding/__init__.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Robin888/hccEncoding-project/2412e5216b4904421721b43c803f417a4b3b7b61/hccEncoding/__init__.pyc -------------------------------------------------------------------------------- /install-linux.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | cd "$(dirname "$0")" 3 | cd hccEncoding 4 | python zzz_manual_install.py -------------------------------------------------------------------------------- /install-macos.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | cd "$(dirname "$0")" 3 | cd hccEncoding 4 | python zzz_manual_install.py -------------------------------------------------------------------------------- /install-win.bat: -------------------------------------------------------------------------------- 1 | pushd "%~dp0" 2 | cd hccEncoding 3 | python zzz_manual_install.py -------------------------------------------------------------------------------- /pypi-register.bat: -------------------------------------------------------------------------------- 1 | pushd "%~dp0" 2 | python setup.py register -r pypi -------------------------------------------------------------------------------- /pypi-upload.bat: -------------------------------------------------------------------------------- 1 | pushd "%~dp0" 2 | CHOICE /C YN /M "upload to pypi, Y to continue, N to cancle" 3 | IF ERRORLEVEL==2 goto end 4 | IF ERRORLEVEL==1 goto upload 5 | 6 | :upload 7 | python setup.py sdist upload -r pypi 8 | goto end 9 | 10 | :end 11 | pause -------------------------------------------------------------------------------- /release-history.rst: -------------------------------------------------------------------------------- 1 | Release and Version History 2 | =========================== 3 | 4 | 0.0.2 (TODO) 5 | ~~~~~~~~~~~~~~~~~~ 6 | **Features and Improvements** 7 | 8 | **Minor Improvements** 9 | 10 | **Bugfixes** 11 | 12 | **Miscellaneous** 13 | 14 | 15 | 0.0.1 (2016-01-01) 16 | ~~~~~~~~~~~~~~~~~~ 17 | - First release -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy >= 1.9.0 2 | pandas>= 0.19.0 3 | scikit-learn >= 0.18.0 -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | from __future__ import print_function 5 | from setuptools import setup, find_packages 6 | from datetime import date 7 | import os 8 | 9 | #--- Define project dependent variable --- 10 | # Your package name 11 | NAME = "hccEncoding" 12 | # Your GitHub user name 13 | GITHUB_USERNAME = "Robin888" # your GitHub account name 14 | 15 | 16 | #--- Automatically generate setup parameters --- 17 | try: 18 | SHORT_DESCRIPTION = __import__(NAME).__short_description__ # GitHub Short Description 19 | except: 20 | print("'__short_description__' not found in '%s.__init__.py'!" % NAME) 21 | SHORT_DESCRIPTION = "No short description!" 22 | 23 | try: 24 | LONG_DESCRIPTION = open("README.rst", "rb").read().decode("utf-8") 25 | except: 26 | LONG_DESCRIPTION = "No long description!" 27 | 28 | VERSION = __import__(NAME).__version__ 29 | AUTHOR = "Ruobing Wang" 30 | AUTHOR_EMAIL = "wangruobing@gmail.com" 31 | MAINTAINER = "Ruobing Wang" 32 | MAINTAINER_EMAIL = "wangruobing@gmail.com" 33 | 34 | # Include all sub packages in package directory 35 | PACKAGES = [NAME] + ["%s.%s" % (NAME, i) for i in find_packages(NAME)] 36 | # Include everything in package directory 37 | INCLUDE_PACKAGE_DATA = True 38 | PACKAGE_DATA = { 39 | "": ["*.*"], 40 | } 41 | 42 | # The project directory name is the GitHub repository name 43 | repository_name = os.path.basename(os.getcwd()) 44 | # Project Url 45 | URL = "https://github.com/{0}/{1}".format(GITHUB_USERNAME, repository_name) 46 | # Use todays date as GitHub release tag 47 | github_release_tag = str(date.today()) 48 | # Source code download url 49 | DOWNLOAD_URL = "https://github.com/{0}/{1}/tarball/{2}".format( 50 | GITHUB_USERNAME, repository_name, github_release_tag) 51 | 52 | try: 53 | LICENSE = __import__(NAME).__license__ 54 | except: 55 | print("'__license__' not found in '%s.__init__.py'!" % NAME) 56 | LICENSE = "" 57 | 58 | PLATFORMS = ["Windows", "MacOS", "Unix"] 59 | CLASSIFIERS = [ 60 | "Development Status :: 4 - Beta", 61 | "Intended Audience :: Developers", 62 | "License :: OSI Approved :: MIT License", 63 | "Natural Language :: English", 64 | "Operating System :: Microsoft :: Windows", 65 | "Operating System :: MacOS", 66 | "Operating System :: Unix", 67 | "Programming Language :: Python", 68 | "Programming Language :: Python :: 2.7", 69 | "Programming Language :: Python :: 3.3", 70 | "Programming Language :: Python :: 3.4", 71 | "Programming Language :: Python :: 3.5", 72 | ] 73 | 74 | try: 75 | f = open("requirements.txt", "rb") 76 | REQUIRES = [i.strip() for i in f.read().decode("utf-8").split("\n")] 77 | except: 78 | print("'requirements.txt' not found!") 79 | REQUIRES = list() 80 | 81 | setup( 82 | name=NAME, 83 | description=SHORT_DESCRIPTION, 84 | long_description=LONG_DESCRIPTION, 85 | version=VERSION, 86 | author=AUTHOR, 87 | author_email=AUTHOR_EMAIL, 88 | maintainer=MAINTAINER, 89 | maintainer_email=MAINTAINER_EMAIL, 90 | packages=PACKAGES, 91 | include_package_data=INCLUDE_PACKAGE_DATA, 92 | package_data=PACKAGE_DATA, 93 | url=URL, 94 | download_url=DOWNLOAD_URL, 95 | classifiers=CLASSIFIERS, 96 | platforms=PLATFORMS, 97 | license=LICENSE, 98 | install_requires=REQUIRES, 99 | ) 100 | 101 | """ 102 | Appendix 103 | -------- 104 | :: 105 | 106 | Frequent used classifiers List = [ 107 | "Development Status :: 1 - Planning", 108 | "Development Status :: 2 - Pre-Alpha", 109 | "Development Status :: 3 - Alpha", 110 | "Development Status :: 4 - Beta", 111 | "Development Status :: 5 - Production/Stable", 112 | "Development Status :: 6 - Mature", 113 | "Development Status :: 7 - Inactive", 114 | 115 | "Intended Audience :: Customer Service", 116 | "Intended Audience :: Developers", 117 | "Intended Audience :: Education", 118 | "Intended Audience :: End Users/Desktop", 119 | "Intended Audience :: Financial and Insurance Industry", 120 | "Intended Audience :: Healthcare Industry", 121 | "Intended Audience :: Information Technology", 122 | "Intended Audience :: Legal Industry", 123 | "Intended Audience :: Manufacturing", 124 | "Intended Audience :: Other Audience", 125 | "Intended Audience :: Religion", 126 | "Intended Audience :: Science/Research", 127 | "Intended Audience :: System Administrators", 128 | "Intended Audience :: Telecommunications Industry", 129 | 130 | "License :: OSI Approved :: BSD License", 131 | "License :: OSI Approved :: MIT License", 132 | "License :: OSI Approved :: Apache Software License", 133 | "License :: OSI Approved :: GNU General Public License (GPL)", 134 | "License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)", 135 | 136 | "Natural Language :: English", 137 | "Natural Language :: Chinese (Simplified)", 138 | 139 | "Operating System :: Microsoft :: Windows", 140 | "Operating System :: MacOS", 141 | "Operating System :: Unix", 142 | 143 | "Programming Language :: Python", 144 | "Programming Language :: Python :: 2", 145 | "Programming Language :: Python :: 2.7", 146 | "Programming Language :: Python :: 2 :: Only", 147 | "Programming Language :: Python :: 3", 148 | "Programming Language :: Python :: 3.3", 149 | "Programming Language :: Python :: 3.4", 150 | "Programming Language :: Python :: 3 :: Only", 151 | ] 152 | """ -------------------------------------------------------------------------------- /source/conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # hccEncoding documentation build configuration file, created by 4 | # sphinx-quickstart on Mon Apr 03 12:53:54 2017. 5 | # 6 | # This file is execfile()d with the current directory set to its 7 | # containing dir. 8 | # 9 | # Note that not all possible configuration values are present in this 10 | # autogenerated file. 11 | # 12 | # All configuration values have a default; values that are commented out 13 | # serve to show the default. 14 | 15 | # If extensions (or modules to document with autodoc) are in another directory, 16 | # add these directories to sys.path here. If the directory is relative to the 17 | # documentation root, use os.path.abspath to make it absolute, like shown here. 18 | # 19 | # import os 20 | # import sys 21 | # sys.path.insert(0, os.path.abspath('.')) 22 | 23 | # -- General configuration ------------------------------------------------ 24 | 25 | # If your documentation needs a minimal Sphinx version, state it here. 26 | # 27 | # needs_sphinx = '1.0' 28 | 29 | # Add any Sphinx extension module names here, as strings. They can be 30 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 31 | # ones. 32 | extensions = [] 33 | 34 | # Add any paths that contain templates here, relative to this directory. 35 | templates_path = ['_templates'] 36 | 37 | # The suffix(es) of source filenames. 38 | # You can specify multiple suffix as a list of string: 39 | # 40 | # source_suffix = ['.rst', '.md'] 41 | source_suffix = '.rst' 42 | 43 | # The encoding of source files. 44 | # 45 | # source_encoding = 'utf-8-sig' 46 | 47 | # The master toctree document. 48 | master_doc = 'index' 49 | 50 | # General information about the project. 51 | project = u'hccEncoding' 52 | copyright = u'2017, Ruobing Wang' 53 | author = u'Ruobing Wang' 54 | 55 | # The version info for the project you're documenting, acts as replacement for 56 | # |version| and |release|, also used in various other places throughout the 57 | # built documents. 58 | # 59 | # The short X.Y version. 60 | version = u'0.0.1' 61 | # The full version, including alpha/beta/rc tags. 62 | release = u'0.0.1' 63 | 64 | # The language for content autogenerated by Sphinx. Refer to documentation 65 | # for a list of supported languages. 66 | # 67 | # This is also used if you do content translation via gettext catalogs. 68 | # Usually you set "language" from the command line for these cases. 69 | language = None 70 | 71 | # There are two options for replacing |today|: either, you set today to some 72 | # non-false value, then it is used: 73 | # 74 | # today = '' 75 | # 76 | # Else, today_fmt is used as the format for a strftime call. 77 | # 78 | # today_fmt = '%B %d, %Y' 79 | 80 | # List of patterns, relative to source directory, that match files and 81 | # directories to ignore when looking for source files. 82 | # This patterns also effect to html_static_path and html_extra_path 83 | exclude_patterns = [] 84 | 85 | # The reST default role (used for this markup: `text`) to use for all 86 | # documents. 87 | # 88 | # default_role = None 89 | 90 | # If true, '()' will be appended to :func: etc. cross-reference text. 91 | # 92 | # add_function_parentheses = True 93 | 94 | # If true, the current module name will be prepended to all description 95 | # unit titles (such as .. function::). 96 | # 97 | # add_module_names = True 98 | 99 | # If true, sectionauthor and moduleauthor directives will be shown in the 100 | # output. They are ignored by default. 101 | # 102 | # show_authors = False 103 | 104 | # The name of the Pygments (syntax highlighting) style to use. 105 | pygments_style = 'sphinx' 106 | 107 | # A list of ignored prefixes for module index sorting. 108 | # modindex_common_prefix = [] 109 | 110 | # If true, keep warnings as "system message" paragraphs in the built documents. 111 | # keep_warnings = False 112 | 113 | # If true, `todo` and `todoList` produce output, else they produce nothing. 114 | todo_include_todos = False 115 | 116 | 117 | # -- Options for HTML output ---------------------------------------------- 118 | 119 | # The theme to use for HTML and HTML Help pages. See the documentation for 120 | # a list of builtin themes. 121 | # 122 | html_theme = 'alabaster' 123 | 124 | # Theme options are theme-specific and customize the look and feel of a theme 125 | # further. For a list of options available for each theme, see the 126 | # documentation. 127 | # 128 | # html_theme_options = {} 129 | 130 | # Add any paths that contain custom themes here, relative to this directory. 131 | # html_theme_path = [] 132 | 133 | # The name for this set of Sphinx documents. 134 | # " v documentation" by default. 135 | # 136 | # html_title = u'hccEncoding v0.0.1' 137 | 138 | # A shorter title for the navigation bar. Default is the same as html_title. 139 | # 140 | # html_short_title = None 141 | 142 | # The name of an image file (relative to this directory) to place at the top 143 | # of the sidebar. 144 | # 145 | # html_logo = None 146 | 147 | # The name of an image file (relative to this directory) to use as a favicon of 148 | # the docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 149 | # pixels large. 150 | # 151 | # html_favicon = None 152 | 153 | # Add any paths that contain custom static files (such as style sheets) here, 154 | # relative to this directory. They are copied after the builtin static files, 155 | # so a file named "default.css" will overwrite the builtin "default.css". 156 | html_static_path = ['_static'] 157 | 158 | # Add any extra paths that contain custom files (such as robots.txt or 159 | # .htaccess) here, relative to this directory. These files are copied 160 | # directly to the root of the documentation. 161 | # 162 | # html_extra_path = [] 163 | 164 | # If not None, a 'Last updated on:' timestamp is inserted at every page 165 | # bottom, using the given strftime format. 166 | # The empty string is equivalent to '%b %d, %Y'. 167 | # 168 | # html_last_updated_fmt = None 169 | 170 | # If true, SmartyPants will be used to convert quotes and dashes to 171 | # typographically correct entities. 172 | # 173 | # html_use_smartypants = True 174 | 175 | # Custom sidebar templates, maps document names to template names. 176 | # 177 | # html_sidebars = {} 178 | 179 | # Additional templates that should be rendered to pages, maps page names to 180 | # template names. 181 | # 182 | # html_additional_pages = {} 183 | 184 | # If false, no module index is generated. 185 | # 186 | # html_domain_indices = True 187 | 188 | # If false, no index is generated. 189 | # 190 | # html_use_index = True 191 | 192 | # If true, the index is split into individual pages for each letter. 193 | # 194 | # html_split_index = False 195 | 196 | # If true, links to the reST sources are added to the pages. 197 | # 198 | # html_show_sourcelink = True 199 | 200 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. 201 | # 202 | # html_show_sphinx = True 203 | 204 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. 205 | # 206 | # html_show_copyright = True 207 | 208 | # If true, an OpenSearch description file will be output, and all pages will 209 | # contain a tag referring to it. The value of this option must be the 210 | # base URL from which the finished HTML is served. 211 | # 212 | # html_use_opensearch = '' 213 | 214 | # This is the file name suffix for HTML files (e.g. ".xhtml"). 215 | # html_file_suffix = None 216 | 217 | # Language to be used for generating the HTML full-text search index. 218 | # Sphinx supports the following languages: 219 | # 'da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'ja' 220 | # 'nl', 'no', 'pt', 'ro', 'ru', 'sv', 'tr', 'zh' 221 | # 222 | # html_search_language = 'en' 223 | 224 | # A dictionary with options for the search language support, empty by default. 225 | # 'ja' uses this config value. 226 | # 'zh' user can custom change `jieba` dictionary path. 227 | # 228 | # html_search_options = {'type': 'default'} 229 | 230 | # The name of a javascript file (relative to the configuration directory) that 231 | # implements a search results scorer. If empty, the default will be used. 232 | # 233 | # html_search_scorer = 'scorer.js' 234 | 235 | # Output file base name for HTML help builder. 236 | htmlhelp_basename = 'hccEncodingdoc' 237 | 238 | # -- Options for LaTeX output --------------------------------------------- 239 | 240 | latex_elements = { 241 | # The paper size ('letterpaper' or 'a4paper'). 242 | # 243 | # 'papersize': 'letterpaper', 244 | 245 | # The font size ('10pt', '11pt' or '12pt'). 246 | # 247 | # 'pointsize': '10pt', 248 | 249 | # Additional stuff for the LaTeX preamble. 250 | # 251 | # 'preamble': '', 252 | 253 | # Latex figure (float) alignment 254 | # 255 | # 'figure_align': 'htbp', 256 | } 257 | 258 | # Grouping the document tree into LaTeX files. List of tuples 259 | # (source start file, target name, title, 260 | # author, documentclass [howto, manual, or own class]). 261 | latex_documents = [ 262 | (master_doc, 'hccEncoding.tex', u'hccEncoding Documentation', 263 | u'Ruobing Wang', 'manual'), 264 | ] 265 | 266 | # The name of an image file (relative to this directory) to place at the top of 267 | # the title page. 268 | # 269 | # latex_logo = None 270 | 271 | # For "manual" documents, if this is true, then toplevel headings are parts, 272 | # not chapters. 273 | # 274 | # latex_use_parts = False 275 | 276 | # If true, show page references after internal links. 277 | # 278 | # latex_show_pagerefs = False 279 | 280 | # If true, show URL addresses after external links. 281 | # 282 | # latex_show_urls = False 283 | 284 | # Documents to append as an appendix to all manuals. 285 | # 286 | # latex_appendices = [] 287 | 288 | # It false, will not define \strong, \code, itleref, \crossref ... but only 289 | # \sphinxstrong, ..., \sphinxtitleref, ... To help avoid clash with user added 290 | # packages. 291 | # 292 | # latex_keep_old_macro_names = True 293 | 294 | # If false, no module index is generated. 295 | # 296 | # latex_domain_indices = True 297 | 298 | 299 | # -- Options for manual page output --------------------------------------- 300 | 301 | # One entry per manual page. List of tuples 302 | # (source start file, name, description, authors, manual section). 303 | man_pages = [ 304 | (master_doc, 'hccencoding', u'hccEncoding Documentation', 305 | [author], 1) 306 | ] 307 | 308 | # If true, show URL addresses after external links. 309 | # 310 | # man_show_urls = False 311 | 312 | 313 | # -- Options for Texinfo output ------------------------------------------- 314 | 315 | # Grouping the document tree into Texinfo files. List of tuples 316 | # (source start file, target name, title, author, 317 | # dir menu entry, description, category) 318 | texinfo_documents = [ 319 | (master_doc, 'hccEncoding', u'hccEncoding Documentation', 320 | author, 'hccEncoding', 'One line description of project.', 321 | 'Miscellaneous'), 322 | ] 323 | 324 | # Documents to append as an appendix to all manuals. 325 | # 326 | # texinfo_appendices = [] 327 | 328 | # If false, no module index is generated. 329 | # 330 | # texinfo_domain_indices = True 331 | 332 | # How to display URL addresses: 'footnote', 'no', or 'inline'. 333 | # 334 | # texinfo_show_urls = 'footnote' 335 | 336 | # If true, do not generate a @detailmenu in the "Top" node's menu. 337 | # 338 | # texinfo_no_detailmenu = False 339 | -------------------------------------------------------------------------------- /source/index.rst: -------------------------------------------------------------------------------- 1 | .. hcc-encoding documentation master file, created by 2 | sphinx-quickstart on Sun Apr 02 21:31:44 2017. 3 | You can adapt this file completely to your liking, but it should at least 4 | contain the root `toctree` directive. 5 | 6 | .. image:: https://travis-ci.org/MacHu-GWU/elementary_math-project.svg?branch=master 7 | 8 | .. image:: https://img.shields.io/pypi/v/hccEncoding.svg 9 | 10 | .. image:: https://img.shields.io/pypi/l/hccEncoding.svg 11 | 12 | .. image:: https://img.shields.io/pypi/pyversions/hccEncoding.svg 13 | 14 | 15 | 16 | 17 | Welcome to hccEncoding's documentation! 18 | ======================================== 19 | 20 | Categorical data fields characterized by a large number of distinct values represent a serious challenge for many classification and regression algorithms that require numerical inputs. On the other hand, these types of data fields are quite common in real-world data mining applications and often contain potentially relevant information that is difficult to represent for modeling purposes. 21 | 22 | This python package implements some preprocessing strategies for high-cardinality categorical data that allows this class of attributes to be used in predictive models. Currently there are two major methods, whihc are based on Daniele Micci-Barreca 's empirical Bayes method [ref1] and Owen Zhang's leave-one-out encoding[ref2]. 23 | 24 | .. toctree:: 25 | :maxdepth: 2 26 | 27 | Functions 28 | ------------------ 29 | * BayesEncoding(train,test,target,feature,k=5,f=1,noise=0.01,drop_origin_feature=False) 30 | * BayesEncodingKfold(train,test,target,feature,k=5,f=1,noise=0.01,drop_origin_feature=False,fold=5) 31 | * LOOEncoding(train,test,target,feature,noise=0.01,drop_origin_feature=False) 32 | * LOOEncodingKfold(train,test,target,feature,noise=0.01,drop_origin_feature=False,fold=5) 33 | 34 | **Please see example for detailed explanation** 35 | - `Example `_ 36 | 37 | 38 | General Parameters 39 | ------------------ 40 | * train 41 | - train dataset, datatype: pandas dataframe 42 | * test 43 | - test dataset, datatype: pandas dataframe 44 | * target 45 | - name of target for prediction, datatype: string 46 | * feature 47 | - name of features that need to be encoded, datatype: string 48 | * k [default=5] 49 | - parameter for BayesEncoding and BayesEncodingKfold, determines half of the minimal sample size of which we completely 'trust' the estimate of transition between the cell's posterior probability and the prior probability, datatype: int 50 | * f [default=1] 51 | - parameter for BayesEncoding and BayesEncodingKfold,controls how quickly the weight changes from the prior to the posterior as the size of the group increases, to further understand k and f's meaning datatype: int 52 | * noise [default=0.01] 53 | - a manually added noise after encoding. For classification problems, a random uniform-distributed noise in the range of [-noise,noise]*data is added. For regression problem, a random normal-distributed noise in the range of norm(0,noise) is added, datatype: double 54 | * drop_origin_feature [default=False] 55 | - whether dropping the original feature or not, datatype: boolean 56 | * fold [default=5] 57 | - parameter for LOOEncodingKfold and BayesEncodingKfold, represent the number of folds that the train dataset will be splitted into. datatype: int 58 | 59 | 60 | References 61 | ================== 62 | * ref1: Daniele Micci-Barreca. 2001. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explor. Newsl. 3, 1 (July 2001), 27-32. 63 | * ref2: - `https://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions `_ 64 | 65 | Indices and tables 66 | ================== 67 | 68 | * :ref:`genindex` 69 | * :ref:`modindex` 70 | * :ref:`search` 71 | 72 | -------------------------------------------------------------------------------- /view-doc.bat: -------------------------------------------------------------------------------- 1 | pushd "%~dp0" 2 | cd build 3 | cd html 4 | index.html --------------------------------------------------------------------------------