├── pythonSGD
    ├── submissionPython22Sep2014_pm.csv
    ├── test.py
    ├── esting.txt
    ├── SGD_py.v11.suo
    ├── SGD_py.sln
    ├── logReg_test.py
    ├── SGD_py.pyproj
    ├── logReg_click.py
    ├── py_lh_22Sep2014_2.py
    ├── logReg.py
    ├── py_lh4_22Sep2014.py
    └── py_lh_20Sep2014.py
├── combineDatasets.R
├── .DS_Store
├── tests
    ├── __init__.py
    ├── conftest.py
    ├── test_lr_model.py
    └── test_vw_converter.py
├── esting.txt
├── test.py
├── forum
    ├── model
    │   └── click.model.vw
    ├── click.model.final2.vw
    ├── vm_to_kaggle.py
    ├── vm_to_kaggle.py~
    ├── vm_command~
    ├── vm_command
    └── csv_to_vm.py
├── vowpal wabbit
    ├── model
    │   └── click.model.vw
    ├── click.model.final2.vw
    ├── Last model_23Sep2014.txt
    ├── vm_to_kaggle.py
    ├── vm_to_kaggle.py~
    ├── vm_command~
    ├── vm_command
    ├── ending solution.txt
    └── csv_to_vm.py
├── requirements-r.txt
├── requirements.txt
├── README.md
├── pytest.ini
├── LICENSE
├── logReg_test.py
├── r_sdg.R
├── .gitignore
├── SDG_21_Sep_2014.R
├── kaggle.py
├── testSet.txt
├── gbm.R
├── py_lh.py
├── py_lh2.py
├── py_lh3.py
├── py_lh4.py
├── logReg_click.py
├── logReg.py
├── py_lh_20Sep2014.py
├── config.yaml
├── .github
    └── workflows
    │   └── ci.yml
├── gbm_modernized.R
├── csv_to_vw_modernized.py
├── scripts
    └── download_data.py
├── py_lh4_modernized.py
├── logReg_modernized.py
└── README_NEW.md


/pythonSGD/submissionPython22Sep2014_pm.csv:
--------------------------------------------------------------------------------
1 | Id,Predicted
2 | 


--------------------------------------------------------------------------------
/combineDatasets.R:
--------------------------------------------------------------------------------
1 | setwd("I:\\data")
2 | train_int <- read.csv('int/train_num.csv')


--------------------------------------------------------------------------------
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/.DS_Store


--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
1 | """
2 | Test suite for CTR Prediction project.
3 | """
4 | 
5 | __version__ = "2.0"
6 | 


--------------------------------------------------------------------------------
/esting.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/esting.txt


--------------------------------------------------------------------------------
/test.py:
--------------------------------------------------------------------------------
1 | from numpy import *  
2 | import matplotlib.pyplot as plt  
3 | import time  
4 |   
5 | alpha = opts['alpha']


--------------------------------------------------------------------------------
/pythonSGD/test.py:
--------------------------------------------------------------------------------
1 | from numpy import *  
2 | import matplotlib.pyplot as plt  
3 | import time  
4 |   
5 | alpha = opts['alpha']


--------------------------------------------------------------------------------
/pythonSGD/esting.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/pythonSGD/esting.txt


--------------------------------------------------------------------------------
/pythonSGD/SGD_py.v11.suo:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/pythonSGD/SGD_py.v11.suo


--------------------------------------------------------------------------------
/forum/model/click.model.vw:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/forum/model/click.model.vw


--------------------------------------------------------------------------------
/forum/click.model.final2.vw:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/forum/click.model.final2.vw


--------------------------------------------------------------------------------
/vowpal wabbit/model/click.model.vw:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/vowpal wabbit/model/click.model.vw


--------------------------------------------------------------------------------
/vowpal wabbit/click.model.final2.vw:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/vowpal wabbit/click.model.final2.vw


--------------------------------------------------------------------------------
/requirements-r.txt:
--------------------------------------------------------------------------------
1 | # R Package Dependencies
2 | # Install with: Rscript -e "install.packages(c('data.table', 'caret', 'gbm'))"
3 | 
4 | data.table>=1.14.0
5 | caret>=6.0-90
6 | gbm>=2.1.8
7 | 


--------------------------------------------------------------------------------
/vowpal wabbit/Last model_23Sep2014.txt:
--------------------------------------------------------------------------------
1 | vw click.train.vw -f click.model.vw --bfgs --passes 20 --holdout_after 40000000 -b 26 --cache_file click.train.vw.cache -l 0.145 --holdout_period 10
2 | 
3 | vw click.train.vw -f click.model.vw -q:: --holdout_period 5 --noconstant --hash all --loss_function logistic -b 28 --save_per_pass --bfgs --termination 0.001 --passes 10 -l 0.1 --cache_file click.train.vw.cache
4 | 
5 | 
6 | --feature_mask


--------------------------------------------------------------------------------
/forum/vm_to_kaggle.py:
--------------------------------------------------------------------------------
 1 | import math
 2 | 
 3 | def zygmoid(x):
 4 | 	#I know it's a common Sigmoid feature, but that's why I probably found
 5 | 	#it on FastML too: https://github.com/zygmuntz/kaggle-stackoverflow/blob/master/sigmoid_mc.py
 6 | 	return 1 / (1 + math.exp(-x))
 7 | 
 8 | with open("kaggle.click.submission.csv","wb") as outfile:
 9 | 	outfile.write("Id,Predicted\n")
10 | 	for line in open("click.preds3.txt"):
11 | 		row = line.strip().split(" ")
12 | 		outfile.write("%s,%f\n"%(row[1],zygmoid(float(row[0]))))
13 | 


--------------------------------------------------------------------------------
/forum/vm_to_kaggle.py~:
--------------------------------------------------------------------------------
 1 | import math
 2 | 
 3 | def zygmoid(x):
 4 | 	#I know it's a common Sigmoid feature, but that's why I probably found
 5 | 	#it on FastML too: https://github.com/zygmuntz/kaggle-stackoverflow/blob/master/sigmoid_mc.py
 6 | 	return 1 / (1 + math.exp(-x))
 7 | 
 8 | with open("kaggle.click.submission.csv","wb") as outfile:
 9 | 	outfile.write("Id,Predicted\n")
10 | 	for line in open("click.preds2.txt"):
11 | 		row = line.strip().split(" ")
12 | 		outfile.write("%s,%f\n"%(row[1],zygmoid(float(row[0]))))
13 | 


--------------------------------------------------------------------------------
/vowpal wabbit/vm_to_kaggle.py:
--------------------------------------------------------------------------------
 1 | import math
 2 | 
 3 | def zygmoid(x):
 4 | 	#I know it's a common Sigmoid feature, but that's why I probably found
 5 | 	#it on FastML too: https://github.com/zygmuntz/kaggle-stackoverflow/blob/master/sigmoid_mc.py
 6 | 	return 1 / (1 + math.exp(-x))
 7 | 
 8 | with open("kaggle.click.submission.csv","wb") as outfile:
 9 | 	outfile.write("Id,Predicted\n")
10 | 	for line in open("click.preds3.txt"):
11 | 		row = line.strip().split(" ")
12 | 		outfile.write("%s,%f\n"%(row[1],zygmoid(float(row[0]))))
13 | 


--------------------------------------------------------------------------------
/vowpal wabbit/vm_to_kaggle.py~:
--------------------------------------------------------------------------------
 1 | import math
 2 | 
 3 | def zygmoid(x):
 4 | 	#I know it's a common Sigmoid feature, but that's why I probably found
 5 | 	#it on FastML too: https://github.com/zygmuntz/kaggle-stackoverflow/blob/master/sigmoid_mc.py
 6 | 	return 1 / (1 + math.exp(-x))
 7 | 
 8 | with open("kaggle.click.submission.csv","wb") as outfile:
 9 | 	outfile.write("Id,Predicted\n")
10 | 	for line in open("click.preds2.txt"):
11 | 		row = line.strip().split(" ")
12 | 		outfile.write("%s,%f\n"%(row[1],zygmoid(float(row[0]))))
13 | 


--------------------------------------------------------------------------------
/forum/vm_command~:
--------------------------------------------------------------------------------
 1 | Training VW:
 2 | ./vw click.train.vw -f click.model.vw --loss_function logistic
 3 | 
 4 | 
 5 | Testing VW:
 6 | ./vw click.test.vw -t -i click.model.vw -p click.preds.txt
 7 | 
 8 | 
 9 | Training VW2:
10 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off -f click.model.vw --loss_function logistic
11 | 
12 | 
13 | Training VM3:
14 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off --power_t=1 -f click.model.vw --loss_function logistic
15 | 
16 | parameters:
17 | -b bits
18 | -l rate
19 | --power_t p
20 | 
21 | 


--------------------------------------------------------------------------------
/vowpal wabbit/vm_command~:
--------------------------------------------------------------------------------
 1 | Training VW:
 2 | ./vw click.train.vw -f click.model.vw --loss_function logistic
 3 | 
 4 | 
 5 | Testing VW:
 6 | ./vw click.test.vw -t -i click.model.vw -p click.preds.txt
 7 | 
 8 | 
 9 | Training VW2:
10 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off -f click.model.vw --loss_function logistic
11 | 
12 | 
13 | Training VM3:
14 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off --power_t=1 -f click.model.vw --loss_function logistic
15 | 
16 | parameters:
17 | -b bits
18 | -l rate
19 | --power_t p
20 | 
21 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | # Core dependencies
 2 | numpy>=1.24.0,<2.0.0
 3 | matplotlib>=3.7.0,<4.0.0
 4 | pandas>=2.0.0,<3.0.0
 5 | scipy>=1.10.0,<2.0.0
 6 | 
 7 | # Configuration management
 8 | pyyaml>=6.0,<7.0
 9 | 
10 | # Development dependencies
11 | pytest>=7.4.0,<8.0.0
12 | pytest-cov>=4.1.0,<5.0.0
13 | black>=23.0.0,<24.0.0
14 | flake8>=6.0.0,<7.0.0
15 | pylint>=2.17.0,<3.0.0
16 | 
17 | # Optional: Advanced ML libraries
18 | # scikit-learn>=1.3.0,<2.0.0
19 | # xgboost>=2.0.0,<3.0.0
20 | 
21 | # Optional: Vowpal Wabbit Python bindings
22 | # vowpalwabbit>=9.0.0,<10.0.0
23 | 


--------------------------------------------------------------------------------
/forum/vm_command:
--------------------------------------------------------------------------------
 1 | Training VW:
 2 | ./vw click.train.vw -f click.model.vw --loss_function logistic
 3 | 
 4 | 
 5 | Testing VW:
 6 | ./vw click.test.vw -t -i click.model.vw -p click.preds.txt
 7 | 
 8 | 
 9 | Training VW2:
10 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off -f click.model.vw --loss_function logistic
11 | 
12 | 
13 | Training VM3:
14 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off --power_t=1 -f click.model.vw --loss_function logistic
15 | 
16 | parameters:
17 | -b bits
18 | -l rate
19 | --power_t p
20 | --passes
21 | -c
22 | --holdout_off
23 | 


--------------------------------------------------------------------------------
/vowpal wabbit/vm_command:
--------------------------------------------------------------------------------
 1 | Training VW:
 2 | ./vw click.train.vw -f click.model.vw --loss_function logistic
 3 | 
 4 | 
 5 | Testing VW:
 6 | ./vw click.test.vw -t -i click.model.vw -p click.preds.txt
 7 | 
 8 | 
 9 | Training VW2:
10 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off -f click.model.vw --loss_function logistic
11 | 
12 | 
13 | Training VM3:
14 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off --power_t=1 -f click.model.vw --loss_function logistic
15 | 
16 | parameters:
17 | -b bits
18 | -l rate
19 | --power_t p
20 | --passes
21 | -c
22 | --holdout_off
23 | 


--------------------------------------------------------------------------------
/pythonSGD/SGD_py.sln:
--------------------------------------------------------------------------------
 1 | 
 2 | Microsoft Visual Studio Solution File, Format Version 12.00
 3 | # Visual Studio 2012
 4 | Project("{888888A0-9F3D-457C-B088-3A5042F75D52}") = "SGD_py", "SGD_py.pyproj", "{82875642-D6FA-4F5C-81E6-B89A93C1F5FF}"
 5 | EndProject
 6 | Global
 7 | 	GlobalSection(SolutionConfigurationPlatforms) = preSolution
 8 | 		Debug|Any CPU = Debug|Any CPU
 9 | 		Release|Any CPU = Release|Any CPU
10 | 	EndGlobalSection
11 | 	GlobalSection(ProjectConfigurationPlatforms) = postSolution
12 | 		{82875642-D6FA-4F5C-81E6-B89A93C1F5FF}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
13 | 		{82875642-D6FA-4F5C-81E6-B89A93C1F5FF}.Release|Any CPU.ActiveCfg = Release|Any CPU
14 | 	EndGlobalSection
15 | 	GlobalSection(SolutionProperties) = preSolution
16 | 		HideSolutionNode = FALSE
17 | 	EndGlobalSection
18 | EndGlobal
19 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | Predict-click-through-rates-on-display-ads
2 | ==========================================
3 | 
4 | Display advertising is a billion dollar effort and one of the central uses of machine learning on the Internet. However, its data and methods are usually kept under lock and key. In this research competition, CriteoLabs is sharing a week’s worth of data for you to develop models predicting ad click-through rate (CTR). Given a user and the page he is visiting, what is the probability that he will click on a given ad? The goal of this challenge is to benchmark the most accurate ML algorithms for CTR estimation. All winning models will be released under an open source license. As a participant, you are given a chance to access the traffic logs from Criteo that include various undisclosed features along with the click labels. 
5 | 


--------------------------------------------------------------------------------
/vowpal wabbit/ending solution.txt:
--------------------------------------------------------------------------------
1 | vw train_nw.vw -f data/model.vw --loss_function logistic -b 25 -l .15 -c --passes 5 -q cc -q ii -q ci --holdout_off --cubic iii --decay_learning_rate .8
2 | 
3 | vw --holdout_off --cache_file data/train_cat_int.cache --loss_function logistic -b 29 --passes 6 -l 0.01 --nn 60 --power_t 0 -f data/nn60_l001_p6.mod
4 | 
5 | vw -d train.vw -c -b 28 --link=logistic --loss_function logistic --passes 2 --holdout_off --ngram c3 --ngram n2 --skips n1 --ngram f2 --skips f1 --l2 7.12091e-09 -l 0.240971683207491 --initial_t 1.53478225382649 --decay_learning_rate 0.267332
6 | 
7 | vw click4.TT.train.vw -k -c -f click.neu13.model.vw --loss_function logistic --passes 20 -l 0.15 -b 25 --nn 35 --holdout_period 50 --early_terminate 1
8 | 
9 |  ./vowpalwabbit/vw ../xtrain4.vw -c -k -l 0.1 -b 29 --loss_function logistic -q cc -q ii -q ci --holdout_off -f xmodel.vw


--------------------------------------------------------------------------------
/pytest.ini:
--------------------------------------------------------------------------------
 1 | [pytest]
 2 | # Pytest configuration for CTR Prediction project
 3 | 
 4 | # Test discovery patterns
 5 | python_files = test_*.py
 6 | python_classes = Test*
 7 | python_functions = test_*
 8 | 
 9 | # Test paths
10 | testpaths = tests
11 | 
12 | # Options
13 | addopts =
14 |     --verbose
15 |     --strict-markers
16 |     --tb=short
17 |     --disable-warnings
18 |     # Coverage options (uncomment when ready)
19 |     # --cov=.
20 |     # --cov-report=html
21 |     # --cov-report=term-missing
22 |     # --cov-fail-under=80
23 | 
24 | # Markers
25 | markers =
26 |     slow: marks tests as slow (deselect with '-m "not slow"')
27 |     integration: marks tests as integration tests
28 |     unit: marks tests as unit tests
29 |     requires_data: marks tests that require actual data files
30 | 
31 | # Logging
32 | log_cli = true
33 | log_cli_level = INFO
34 | log_cli_format = %(asctime)s [%(levelname)s] %(message)s
35 | log_cli_date_format = %Y-%m-%d %H:%M:%S
36 | 
37 | # Ignore patterns
38 | norecursedirs = .git .tox dist build *.egg venv env
39 | 
40 | # Timeout (in seconds) for individual tests
41 | # timeout = 300
42 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Tianxiang Liu
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/logReg_test.py:
--------------------------------------------------------------------------------
 1 | 
 2 | from numpy import *  
 3 | import matplotlib.pyplot as plt  
 4 | import time  
 5 |   
 6 | def loadData():  
 7 |     train_x = []  
 8 |     train_y = []  
 9 |     fileIn = open('E:/Python/Machine Learning in Action/testSet.txt')  
10 |     for line in fileIn.readlines():  
11 |         lineArr = line.strip().split()  
12 |         train_x.append([1.0, float(lineArr[0]), float(lineArr[1])])  
13 |         train_y.append(float(lineArr[2]))  
14 |     return mat(train_x), mat(train_y).transpose()  
15 |   
16 |   
17 | ## step 1: load data  
18 | print "step 1: load data..."  
19 | train_x, train_y = loadData()  
20 | test_x = train_x; test_y = train_y  
21 |   
22 | ## step 2: training...  
23 | print "step 2: training..."  
24 | opts = {'alpha': 0.01, 'maxIter': 20, 'optimizeType': 'smoothStocGradDescent'}  
25 | optimalWeights = trainLogRegres(train_x, train_y, opts)  
26 |   
27 | ## step 3: testing  
28 | print "step 3: testing..."  
29 | accuracy = testLogRegres(optimalWeights, test_x, test_y)  
30 |   
31 | ## step 4: show the result  
32 | print "step 4: show the result..."    
33 | print 'The classify accuracy is: %.3f%%' % (accuracy * 100)  
34 | showLogRegres(optimalWeights, train_x, train_y)   


--------------------------------------------------------------------------------
/pythonSGD/logReg_test.py:
--------------------------------------------------------------------------------
 1 | 
 2 | from numpy import *  
 3 | import matplotlib.pyplot as plt  
 4 | import time  
 5 |   
 6 | def loadData():  
 7 |     train_x = []  
 8 |     train_y = []  
 9 |     fileIn = open('E:/Python/Machine Learning in Action/testSet.txt')  
10 |     for line in fileIn.readlines():  
11 |         lineArr = line.strip().split()  
12 |         train_x.append([1.0, float(lineArr[0]), float(lineArr[1])])  
13 |         train_y.append(float(lineArr[2]))  
14 |     return mat(train_x), mat(train_y).transpose()  
15 |   
16 |   
17 | ## step 1: load data  
18 | print "step 1: load data..."  
19 | train_x, train_y = loadData()  
20 | test_x = train_x; test_y = train_y  
21 |   
22 | ## step 2: training...  
23 | print "step 2: training..."  
24 | opts = {'alpha': 0.01, 'maxIter': 20, 'optimizeType': 'smoothStocGradDescent'}  
25 | optimalWeights = trainLogRegres(train_x, train_y, opts)  
26 |   
27 | ## step 3: testing  
28 | print "step 3: testing..."  
29 | accuracy = testLogRegres(optimalWeights, test_x, test_y)  
30 |   
31 | ## step 4: show the result  
32 | print "step 4: show the result..."    
33 | print 'The classify accuracy is: %.3f%%' % (accuracy * 100)  
34 | showLogRegres(optimalWeights, train_x, train_y)   


--------------------------------------------------------------------------------
/tests/conftest.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Pytest configuration and fixtures.
 3 | """
 4 | 
 5 | import pytest
 6 | import numpy as np
 7 | from pathlib import Path
 8 | import tempfile
 9 | import csv
10 | 
11 | 
12 | @pytest.fixture
13 | def sample_csv_data():
14 |     """Generate sample CSV data for testing."""
15 |     return [
16 |         {'Id': '1', 'Label': '0', 'I1': '5', 'I2': '10', 'C1': 'abc123', 'C2': 'def456'},
17 |         {'Id': '2', 'Label': '1', 'I1': '3', 'I2': '', 'C1': 'abc123', 'C2': 'xyz789'},
18 |         {'Id': '3', 'Label': '0', 'I1': '', 'I2': '20', 'C1': 'ghi012', 'C2': 'def456'},
19 |     ]
20 | 
21 | 
22 | @pytest.fixture
23 | def temp_csv_file(sample_csv_data, tmp_path):
24 |     """Create a temporary CSV file with sample data."""
25 |     csv_file = tmp_path / "test_data.csv"
26 | 
27 |     fieldnames = ['Id', 'Label', 'I1', 'I2', 'C1', 'C2']
28 | 
29 |     with open(csv_file, 'w', newline='') as f:
30 |         writer = csv.DictWriter(f, fieldnames=fieldnames)
31 |         writer.writeheader()
32 |         writer.writerows(sample_csv_data)
33 | 
34 |     return csv_file
35 | 
36 | 
37 | @pytest.fixture
38 | def sample_train_data():
39 |     """Generate sample training data (X, y)."""
40 |     np.random.seed(42)
41 |     X = np.random.randn(100, 5)
42 |     y = (X[:, 0] + X[:, 1] > 0).astype(float).reshape(-1, 1)
43 |     return X, y
44 | 
45 | 
46 | @pytest.fixture
47 | def temp_dir():
48 |     """Create a temporary directory for testing."""
49 |     with tempfile.TemporaryDirectory() as tmpdir:
50 |         yield Path(tmpdir)
51 | 


--------------------------------------------------------------------------------
/r_sdg.R:
--------------------------------------------------------------------------------
 1 | setwd('/Users/ivan/Work_directory/Predict-click-through-rates-on-display-ads/')
 2 | train <- 'testSet.txt'
 3 | test <- 'test.csv'
 4 | D <- 2^20
 5 | alpha <- .1
 6 | w <- rep.int(0, D)
 7 | n <- rep.int(0, D)
 8 | loss <- 0.
 9 | col_num <- 3
10 | 
11 | # test logloss of predictions and true values
12 | logloss <- function(p,y) {
13 |     p <- max(min(p, 1 - 10^-13), 10^-13)
14 |     res <- ifelse(y==1, -log(p), -log(1 - p))
15 |     res
16 | }
17 | 
18 | # extract one record from database	    
19 | get_data <- function(row, data){
20 | 	r <- read.table(data, skip=row-1, nrows=1, sep='\t', row.names=row)
21 |     r
22 | }
23 | 
24 | # get possibilities of records
25 | get_p <- function(x, w){
26 | 	wTx <- 0
27 | 	for (i2 in x) {
28 | 		wTx <- wTx + w[i2] * 1.}
29 |     p <- 1/(1 + exp(-max(min(wTx, 20), -20)))
30 |     p 
31 | }
32 | 
33 | # update the weights according to results
34 | update_w <- function(w, n, x, p, y){
35 | 	for (i in x){
36 |         w[i] <- w[i] - ((p - y) * alpha / (sqrt(n[i]) + 1))
37 |         n[i] <- n[i] + 1
38 | 	}
39 |     w
40 |     n
41 | }
42 | 
43 | # main steps for modeling
44 | for i in 1:46000000{
45 | 	row <- get_data(i, train,2)
46 | 	y <- row[2]
47 | 	row <- row[-c(1,2)]
48 | 	p <- get_p(row, w)
49 | 	loss <- loss + logloss(p,y)
50 | 	if(i%1000000 == 0){
51 | 		'logloss: '&loss/i&'. number of row: 'i
52 | 	}
53 | 	updates <- update_w(w, n, x, p, y)
54 | 	w <- updates[1]
55 | 	n <- updates[2]
56 | }
57 | 
58 | 
59 | get_x <- function(row, D){
60 |     x <- c(0)
61 |     for (j in row){
62 |         index <- as.integer(j) %% D
63 |         x <- c(x, index)
64 |     }
65 | }
66 | 
67 | 
68 | 
69 | 
70 | 
71 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Python
  2 | *.py[cod]
  3 | *$py.class
  4 | *.so
  5 | .Python
  6 | build/
  7 | develop-eggs/
  8 | dist/
  9 | downloads/
 10 | eggs/
 11 | .eggs/
 12 | lib/
 13 | lib64/
 14 | parts/
 15 | sdist/
 16 | var/
 17 | wheels/
 18 | share/python-wheels/
 19 | *.egg-info/
 20 | .installed.cfg
 21 | *.egg
 22 | MANIFEST
 23 | __pycache__/
 24 | *.pyc
 25 | 
 26 | # Virtual environments
 27 | venv/
 28 | env/
 29 | ENV/
 30 | env.bak/
 31 | venv.bak/
 32 | .venv/
 33 | 
 34 | # IDE
 35 | .vscode/
 36 | .idea/
 37 | *.swp
 38 | *.swo
 39 | *~
 40 | .DS_Store
 41 | *.suo
 42 | *.user
 43 | *.userosscache
 44 | *.sln.docstates
 45 | 
 46 | # Visual Studio
 47 | *.pyproj
 48 | *.sln
 49 | *.suo
 50 | *.user
 51 | *.userosscache
 52 | .vs/
 53 | 
 54 | # Jupyter Notebook
 55 | .ipynb_checkpoints
 56 | *.ipynb
 57 | 
 58 | # R
 59 | .Rhistory
 60 | .Rapp.history
 61 | .RData
 62 | .Ruserdata
 63 | *.Rproj
 64 | .Rproj.user/
 65 | 
 66 | # Data files (large datasets)
 67 | *.csv
 68 | !**/sample*.csv
 69 | *.tsv
 70 | *.dat
 71 | *.txt
 72 | !requirements.txt
 73 | !requirements-r.txt
 74 | !README.txt
 75 | !LICENSE.txt
 76 | 
 77 | # Model files (can be large)
 78 | *.model
 79 | *.vw
 80 | *.pkl
 81 | *.h5
 82 | *.joblib
 83 | 
 84 | # Logs
 85 | *.log
 86 | logs/
 87 | 
 88 | # Output/Submission files
 89 | submission*.csv
 90 | *submit*.csv
 91 | output/
 92 | results/
 93 | 
 94 | # Temporary files
 95 | tmp/
 96 | temp/
 97 | *.tmp
 98 | *.bak
 99 | *~
100 | *.cache
101 | 
102 | # OS
103 | .DS_Store
104 | Thumbs.db
105 | ehthumbs.db
106 | 
107 | # Backup files
108 | *.orig
109 | *~
110 | 
111 | # Claude settings (keep local only)
112 | .claude/settings.local.json
113 | 


--------------------------------------------------------------------------------
/SDG_21_Sep_2014.R:
--------------------------------------------------------------------------------
 1 | setwd("C:\\Users\\Ivan.Liuyanfeng\\Desktop\\Data_Mining_Work_Space\\Predict-click-through-rates-on-display-ads\\local")
 2 | # basic parameters
 3 | con <- file('train.csv','r')
 4 | D <- 2^27
 5 | alpha <- .145
 6 | 
 7 | # logit loss calculation
 8 | logloss <- function(p,y){
 9 |     epsilon <- 10 ^ -15
10 |     p <- max(min(p, 1-epsilon), epsilon)
11 |     ll <- y*log(p) + (1-y)*log((1-p))
12 |     ll <- ll * -1/1
13 |     ll
14 | }
15 | 
16 | # prediction
17 | get_p<- function(x,w){
18 |     wTx <- 0
19 |     for (i in 1:length(x)){
20 |         wTx <- wTx + w[i] * 1
21 |     }
22 |     sigmoid <- 1/(1+exp(-max(min(wTx,20)-20))) 
23 |     sigmoid
24 | }
25 | 
26 | # update weights
27 | update_w <-function (w, n, x, p, y){
28 |     for (i in 1:length(x)){
29 |         lr <- alpha / (sqrt(n[i])+1)
30 |         gradient <- (p-y) 
31 |         w[i] <- w[i] - gradient * lr
32 |         n[i] <- n[i] + 1
33 |     }    
34 |     c(w,n)
35 | }
36 | 
37 | # basic parameters
38 | # w <- rep(0, D)
39 | w <- rep(1, length(x))
40 | # n <- rep(0, D)
41 | n <- rep(0, length(x))
42 | loss <- 0
43 | ##################start modeling - parameters setup ########################
44 | train_label <- readLines(con, n=1)
45 | for (i in 1:10) {
46 | row <- readLines(con,n=1)
47 | train_row <- strsplit(row, ',')
48 | y <- as.integer(train_row[[1]][2])
49 | x <- c()
50 | for (k in 3:40){
51 |     x <- c(x, train_row[[1]][k])
52 | }
53 | ################# Modeling #################################################
54 | p <- get_p(x,w)
55 | loss <- loss + logloss(p,y)
56 | if (i %% 100000 == 0){
57 |     print(loss)
58 | } 
59 | upd <- update_w(w,n,x,p,y)
60 | w <- upd[1]
61 | n <- upd[2]
62 | print(loss)
63 | break
64 | }
65 | 


--------------------------------------------------------------------------------
/pythonSGD/SGD_py.pyproj:
--------------------------------------------------------------------------------
 1 | <?xml version="1.0" encoding="utf-8"?>
 2 | <Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003" DefaultTargets="Build">
 3 |   <PropertyGroup>
 4 |     <Configuration Condition=" '$(Configuration)' == '' ">Debug</Configuration>
 5 |     <SchemaVersion>2.0</SchemaVersion>
 6 |     <ProjectGuid>{82875642-d6fa-4f5c-81e6-b89a93c1f5ff}</ProjectGuid>
 7 |     <ProjectHome />
 8 |     <StartupFile>logReg.py</StartupFile>
 9 |     <SearchPath />
10 |     <WorkingDirectory>.</WorkingDirectory>
11 |     <OutputPath>.</OutputPath>
12 |     <ProjectTypeGuids>{888888a0-9f3d-457c-b088-3a5042f75d52}</ProjectTypeGuids>
13 |     <LaunchProvider>Standard Python launcher</LaunchProvider>
14 |     <InterpreterId />
15 |     <InterpreterVersion />
16 |   </PropertyGroup>
17 |   <PropertyGroup Condition="'$(Configuration)' == 'Debug'" />
18 |   <PropertyGroup Condition="'$(Configuration)' == 'Release'" />
19 |   <PropertyGroup>
20 |     <VisualStudioVersion Condition=" '$(VisualStudioVersion)' == '' ">10.0</VisualStudioVersion>
21 |     <PtvsTargetsFile>$(MSBuildExtensionsPath32)\Microsoft\VisualStudio\v$(VisualStudioVersion)\Python Tools\Microsoft.PythonTools.targets</PtvsTargetsFile>
22 |   </PropertyGroup>
23 |   <ItemGroup>
24 |     <Content Include="esting.txt" />
25 |   </ItemGroup>
26 |   <ItemGroup>
27 |     <Compile Include="logReg.py" />
28 |     <Compile Include="logReg_click.py" />
29 |     <Compile Include="logReg_test.py" />
30 |     <Compile Include="py_lh4_22Sep2014.py" />
31 |     <Compile Include="py_lh_20Sep2014.py" />
32 |     <Compile Include="py_lh_22Sep2014_2.py" />
33 |     <Compile Include="test.py" />
34 |   </ItemGroup>
35 |   <Import Project="$(PtvsTargetsFile)" Condition="Exists($(PtvsTargetsFile))" />
36 |   <Import Project="$(MSBuildToolsPath)\Microsoft.Common.targets" Condition="!Exists($(PtvsTargetsFile))" />
37 | </Project>


--------------------------------------------------------------------------------
/kaggle.py:
--------------------------------------------------------------------------------
 1 | from numpy import *
 2 | 
 3 | def loadDataSet():
 4 |     dataMat = []; labelMat = []
 5 |     fr = open('train.csv')
 6 |     for line in fr.readlines():
 7 |     	singleArr=[1.0]
 8 |         lineArr = line.strip().split()
 9 |         for i in range(39):
10 |         	singleArr.append(float(lineArr[i+2]))
11 |         dataMat.append(singleArr)
12 |         labelMat.append(int(lineArr[1]))
13 |     return dataMat,labelMat
14 | 
15 | def sigmoid(inX):
16 |     return 1.0/(1+exp(-inX))
17 | 
18 | def gradAscent(dataMatIn, classLabels):
19 |     dataMatrix = mat(dataMatIn)             #convert to NumPy matrix
20 |     labelMat = mat(classLabels).transpose() #convert to NumPy matrix
21 |     m,n = shape(dataMatrix)
22 |     alpha = 0.001
23 |     maxCycles = 500
24 |     weights = ones((n,1))
25 |     for k in range(maxCycles):              #heavy on matrix operations
26 |         h = sigmoid(dataMatrix*weights)     #matrix mult
27 |         error = (labelMat - h)              #vector subtraction
28 |         weights = weights + alpha * dataMatrix.transpose()* error #matrix mult
29 |     return weights
30 | 
31 | def plotBestFit(weights):
32 |     import matplotlib.pyplot as plt
33 |     dataMat,labelMat=loadDataSet()
34 |     dataArr = array(dataMat)
35 |     n = shape(dataArr)[0] 
36 |     xcord1 = []; ycord1 = []
37 |     xcord2 = []; ycord2 = []
38 |     for i in range(n):
39 |         if int(labelMat[i])== 1:
40 |             xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2])
41 |         else:
42 |             xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2])
43 |     fig = plt.figure()
44 |     ax = fig.add_subplot(111)
45 |     ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')
46 |     ax.scatter(xcord2, ycord2, s=30, c='green')
47 |     x = arange(-3.0, 3.0, 0.1)
48 |     y = (-weights[0]-weights[1]*x)/weights[2]
49 |     ax.plot(x, y)
50 |     plt.xlabel('X1'); plt.ylabel('X2');
51 |     plt.show()


--------------------------------------------------------------------------------
/forum/csv_to_vm.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: UTF-8 -*-
 2 | 
 3 | ########################################################
 4 | # __Author__: Triskelion <info@mlwave.com>             #
 5 | # Kaggle competition "Display Advertising Challenge":  #
 6 | # http://www.kaggle.com/c/criteo-display-ad-challenge/ #
 7 | # Credit: Zygmunt Zając <zygmunt@fastml.com>           #
 8 | ########################################################
 9 | 
10 | from datetime import datetime
11 | from csv import DictReader
12 | 
13 | def csv_to_vw(loc_csv, loc_output, train=True):
14 |   """
15 |   Munges a CSV file (loc_csv) to a VW file (loc_output). Set "train"
16 |   to False when munging a test set.
17 |   TODO: Too slow for a daily cron job. Try optimize, Pandas or Go.
18 |   """
19 |   start = datetime.now()
20 |   print("\nTurning %s into %s. Is_train_set? %s"%(loc_csv,loc_output,train))
21 |   
22 |   with open(loc_output,"wb") as outfile:
23 |     for e, row in enumerate( DictReader(open(loc_csv)) ):
24 | 	
25 | 	  #Creating the features
26 |       numerical_features = ""
27 |       categorical_features = ""
28 |       for k,v in row.items():
29 |         if k not in ["Label","Id"]:
30 |           if "I" in k: # numerical feature, example: I5
31 |             if len(str(v)) > 0: #check for empty values
32 |               numerical_features += " %s:%s" % (k,v)
33 |           if "C" in k: # categorical feature, example: C2
34 |             if len(str(v)) > 0:
35 |               categorical_features += " %s" % v
36 | 			  
37 | 	  #Creating the labels		  
38 |       if train: #we care about labels
39 |         if row['Label'] == "1":
40 |           label = 1
41 |         else:
42 |           label = -1 #we set negative label to -1
43 |         outfile.write( "%s '%s |i%s |c%s\n" % (label,row['Id'],numerical_features,categorical_features) )
44 | 		
45 |       else: #we dont care about labels
46 |         outfile.write( "1 '%s |i%s |c%s\n" % (row['Id'],numerical_features,categorical_features) )
47 |       
48 | 	  #Reporting progress
49 |       if e % 1000000 == 0:
50 |         print("%s\t%s"%(e, str(datetime.now() - start)))
51 | 
52 |   print("\n %s Task execution time:\n\t%s"%(e, str(datetime.now() - start)))
53 | 
54 | #csv_to_vw("d:\\Downloads\\train\\train.csv", "c:\\click.train.vw",train=True)
55 | #csv_to_vw("d:\\Downloads\\test\\test.csv", "d:\\click.test.vw",train=False)


--------------------------------------------------------------------------------
/vowpal wabbit/csv_to_vm.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: UTF-8 -*-
 2 | 
 3 | ########################################################
 4 | # __Author__: Triskelion <info@mlwave.com>             #
 5 | # Kaggle competition "Display Advertising Challenge":  #
 6 | # http://www.kaggle.com/c/criteo-display-ad-challenge/ #
 7 | # Credit: Zygmunt Zając <zygmunt@fastml.com>           #
 8 | ########################################################
 9 | 
10 | from datetime import datetime
11 | from csv import DictReader
12 | 
13 | def csv_to_vw(loc_csv, loc_output, train=True):
14 |   """
15 |   Munges a CSV file (loc_csv) to a VW file (loc_output). Set "train"
16 |   to False when munging a test set.
17 |   TODO: Too slow for a daily cron job. Try optimize, Pandas or Go.
18 |   """
19 |   start = datetime.now()
20 |   print("\nTurning %s into %s. Is_train_set? %s"%(loc_csv,loc_output,train))
21 |   
22 |   with open(loc_output,"wb") as outfile:
23 |     for e, row in enumerate( DictReader(open(loc_csv)) ):
24 | 	
25 | 	  #Creating the features
26 |       numerical_features = ""
27 |       categorical_features = ""
28 |       for k,v in row.items():
29 |         if k not in ["Label","Id"]:
30 |           if "I" in k: # numerical feature, example: I5
31 |             if len(str(v)) > 0: #check for empty values
32 |               numerical_features += " %s:%s" % (k,v)
33 |           if "C" in k: # categorical feature, example: C2
34 |             if len(str(v)) > 0:
35 |               categorical_features += " %s" % v
36 | 			  
37 | 	  #Creating the labels		  
38 |       if train: #we care about labels
39 |         if row['Label'] == "1":
40 |           label = 1
41 |         else:
42 |           label = -1 #we set negative label to -1
43 |         outfile.write( "%s '%s |i%s |c%s\n" % (label,row['Id'],numerical_features,categorical_features) )
44 | 		
45 |       else: #we dont care about labels
46 |         outfile.write( "1 '%s |i%s |c%s\n" % (row['Id'],numerical_features,categorical_features) )
47 |       
48 | 	  #Reporting progress
49 |       if e % 1000000 == 0:
50 |         print("%s\t%s"%(e, str(datetime.now() - start)))
51 | 
52 |   print("\n %s Task execution time:\n\t%s"%(e, str(datetime.now() - start)))
53 | 
54 | #csv_to_vw("d:\\Downloads\\train\\train.csv", "c:\\click.train.vw",train=True)
55 | #csv_to_vw("d:\\Downloads\\test\\test.csv", "d:\\click.test.vw",train=False)


--------------------------------------------------------------------------------
/testSet.txt:
--------------------------------------------------------------------------------
  1 | -0.017612	14.053064	0
  2 | -1.395634	4.662541	1
  3 | -0.752157	6.538620	0
  4 | -1.322371	7.152853	0
  5 | 0.423363	11.054677	0
  6 | 0.406704	7.067335	1
  7 | 0.667394	12.741452	0
  8 | -2.460150	6.866805	1
  9 | 0.569411	9.548755	0
 10 | -0.026632	10.427743	0
 11 | 0.850433	6.920334	1
 12 | 1.347183	13.175500	0
 13 | 1.176813	3.167020	1
 14 | -1.781871	9.097953	0
 15 | -0.566606	5.749003	1
 16 | 0.931635	1.589505	1
 17 | -0.024205	6.151823	1
 18 | -0.036453	2.690988	1
 19 | -0.196949	0.444165	1
 20 | 1.014459	5.754399	1
 21 | 1.985298	3.230619	1
 22 | -1.693453	-0.557540	1
 23 | -0.576525	11.778922	0
 24 | -0.346811	-1.678730	1
 25 | -2.124484	2.672471	1
 26 | 1.217916	9.597015	0
 27 | -0.733928	9.098687	0
 28 | -3.642001	-1.618087	1
 29 | 0.315985	3.523953	1
 30 | 1.416614	9.619232	0
 31 | -0.386323	3.989286	1
 32 | 0.556921	8.294984	1
 33 | 1.224863	11.587360	0
 34 | -1.347803	-2.406051	1
 35 | 1.196604	4.951851	1
 36 | 0.275221	9.543647	0
 37 | 0.470575	9.332488	0
 38 | -1.889567	9.542662	0
 39 | -1.527893	12.150579	0
 40 | -1.185247	11.309318	0
 41 | -0.445678	3.297303	1
 42 | 1.042222	6.105155	1
 43 | -0.618787	10.320986	0
 44 | 1.152083	0.548467	1
 45 | 0.828534	2.676045	1
 46 | -1.237728	10.549033	0
 47 | -0.683565	-2.166125	1
 48 | 0.229456	5.921938	1
 49 | -0.959885	11.555336	0
 50 | 0.492911	10.993324	0
 51 | 0.184992	8.721488	0
 52 | -0.355715	10.325976	0
 53 | -0.397822	8.058397	0
 54 | 0.824839	13.730343	0
 55 | 1.507278	5.027866	1
 56 | 0.099671	6.835839	1
 57 | -0.344008	10.717485	0
 58 | 1.785928	7.718645	1
 59 | -0.918801	11.560217	0
 60 | -0.364009	4.747300	1
 61 | -0.841722	4.119083	1
 62 | 0.490426	1.960539	1
 63 | -0.007194	9.075792	0
 64 | 0.356107	12.447863	0
 65 | 0.342578	12.281162	0
 66 | -0.810823	-1.466018	1
 67 | 2.530777	6.476801	1
 68 | 1.296683	11.607559	0
 69 | 0.475487	12.040035	0
 70 | -0.783277	11.009725	0
 71 | 0.074798	11.023650	0
 72 | -1.337472	0.468339	1
 73 | -0.102781	13.763651	0
 74 | -0.147324	2.874846	1
 75 | 0.518389	9.887035	0
 76 | 1.015399	7.571882	0
 77 | -1.658086	-0.027255	1
 78 | 1.319944	2.171228	1
 79 | 2.056216	5.019981	1
 80 | -0.851633	4.375691	1
 81 | -1.510047	6.061992	0
 82 | -1.076637	-3.181888	1
 83 | 1.821096	10.283990	0
 84 | 3.010150	8.401766	1
 85 | -1.099458	1.688274	1
 86 | -0.834872	-1.733869	1
 87 | -0.846637	3.849075	1
 88 | 1.400102	12.628781	0
 89 | 1.752842	5.468166	1
 90 | 0.078557	0.059736	1
 91 | 0.089392	-0.715300	1
 92 | 1.825662	12.693808	0
 93 | 0.197445	9.744638	0
 94 | 0.126117	0.922311	1
 95 | -0.679797	1.220530	1
 96 | 0.677983	2.556666	1
 97 | 0.761349	10.693862	0
 98 | -2.168791	0.143632	1
 99 | 1.388610	9.341997	0
100 | 0.317029	14.739025	0
101 | 


--------------------------------------------------------------------------------
/gbm.R:
--------------------------------------------------------------------------------
  1 | setwd("I:\\data")
  2 | library(data.table)
  3 |     # train <- fread("train.csv",select=c(1:15))
  4 |     # head(train)
  5 |     # write.table(train,"train_num.csv",sep=",", row.names=F, col.names=T)
  6 |     # str(train)
  7 | # rm("train")
  8 |     # ?read.csv
  9 |     # 
 10 |     # library(caret)
 11 |     # sum(is.na(train))
 12 |     # mean(is.na(train))
 13 |     # summary(train)
 14 |     # gc()
 15 |     # train <- fread("train_num.csv")
 16 |     # train <- na.omit(train)
 17 | # index <- is.na(train)
 18 | # table(index)
 19 | # train <- train[-index,]
 20 |     # write.table(train,"train_num_na.csv",sep=",", row.names=F, col.names=T)
 21 |     # rm(train)
 22 | 
 23 | # data cleansing
 24 | #################
 25 | train <- read.csv("train_num_na.csv")
 26 | train <- train[,-1]
 27 | head(train)
 28 | train[which(train$Label==1),1] <- "Yes"
 29 | train[which(train$Label==0),1] <- "No"
 30 | train$Label <- as.factor(train$Label)
 31 | write.table(train,"train_num_na_yesno.csv",sep=",", row.names=F, col.names=T)
 32 | 
 33 | 
 34 | #covariate creation
 35 |     # nearZeroVar(train,saveMetrics = T)
 36 |     # train.pca <- preProcess(train[,2:14], method='pca', pcaComp=2)
 37 | 
 38 | # model
 39 |     # library(doParallel)
 40 | library(caret)
 41 |     # cl <- makePSOCKcluster(4)
 42 |     # registerDoParallel(cl)
 43 | # fit1 <- train(Label~., method="rf",data=train)
 44 | 
 45 |     Grid <- expand.grid(n.trees=c(500),interaction.depth=c(22),shrinkage=.1)
 46 |     fitControl <- trainControl(method="none", allowParallel=T, classProbs=T)
 47 | fit2 <- train(Label~., method="gbm", data=train, trControl=fitControl, 
 48 |               verbose=T,tuneGrid=Grid, metric="ROC")
 49 | pred2 <- predict(fit2, train)
 50 | confusionMatrix(pred2, train$Label)
 51 | rm(pred2)
 52 | 
 53 | # fit3 <- train(Label~., method="glmnet",family="binomial",classProbs=T, data=train,verbose=T)
 54 | # fit3 <- train(Label~., method="glmnet",classProbs=T, data=train,verbose=T)
 55 | 
 56 | # fit 4
 57 | ctrl <- trainControl(method = "cv",
 58 |                      number=2,
 59 |                      classProbs = TRUE,
 60 |                      allowParallel = TRUE,
 61 |                      summaryFunction = twoClassSummary)
 62 | 
 63 | set.seed(888)
 64 | rfFit <- train(Label~.,
 65 |                data=train,
 66 |                method = "rf",
 67 | #                tuneGrid = expand.grid(.mtry = 4),
 68 |                ntrees=500,
 69 |                importance = TRUE,
 70 |                metric = "ROC",
 71 |                trControl = ctrl)
 72 | 
 73 | 
 74 | pred <- predict.train(rfFit, newdata = test, type = "prob") 
 75 | 
 76 | 
 77 | # stopCluster(cl)
 78 | 
 79 | # load test data
 80 |     # test <- fread("test.csv", select=c(1:14))
 81 |     # write.table(test,"test_num.csv",sep=",", row.names=F, col.names=T)
 82 |     # test <- read.csv("test_num.csv")
 83 | test <- read.csv("test_num_impute.csv")
 84 | # test data imputation
 85 |     # pre<-preProcess(test, method='medianImpute')
 86 |     # test_impute <- predict(pre, test)
 87 | 
 88 | # predict
 89 | gc()
 90 |     # pred1 <- predict(fit1, test)
 91 | pred2 <- predict(fit2,type="prob", test)
 92 | head(test)
 93 | pred2 <- plogis(pred2)
 94 |     # pred3 <- predict(fit3, test)
 95 | # ensembling-models
 96 |     # data(pred1,pred2,pred3,train)
 97 |     # combFit<-train(Label~.,method="gam", train)
 98 | 
 99 | # output
100 | fit2.submit <- data.frame(test$Id, test$pred2)
101 | colnames(fit2.submit)<- c("Id","Predicted")
102 | write.table(fit2.submit,"submit1_gbm_num_nona_impute.csv", row.names=F, sep=',')
103 | 


--------------------------------------------------------------------------------
/py_lh.py:
--------------------------------------------------------------------------------
  1 | from datetime import datetime
  2 | from csv import DictReader
  3 | from math import exp, log, sqrt
  4 | 
  5 | 
  6 | # parameters #################################################################
  7 | 
  8 | train = 'train.csv'  # path to training file
  9 | test = 'test.csv'  # path to testing file
 10 | 
 11 | D = 2 ** 20   # number of weights use for learning
 12 | alpha = .2    # learning rate for sgd optimization
 13 | 
 14 | 
 15 | # function definitions #######################################################
 16 | 
 17 | # A. Bounded logloss
 18 | # INPUT:
 19 | #     p: our prediction
 20 | #     y: real answer
 21 | # OUTPUT
 22 | #     logarithmic loss of p given y
 23 | def logloss(p, y):
 24 |     p = max(min(p, 1. - 10e-12), 10e-12)
 25 |     return -log(p) if y == 1. else -log(1. - p)
 26 | 
 27 | 
 28 | # B. Apply hash trick of the original csv row
 29 | # for simplicity, we treat both integer and categorical features as categorical
 30 | # INPUT:
 31 | #     csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...}
 32 | #     D: the max index that we can hash to
 33 | # OUTPUT:
 34 | #     x: a list of indices that its value is 1
 35 | def get_x(csv_row, D):
 36 |     x = [0]  # 0 is the index of the bias term
 37 |     for key, value in csv_row.items():
 38 |         index = int(value + key[1:], 16) % D  # weakest hash ever ;)
 39 |         x.append(index)
 40 |     return x  # x contains indices of features that have a value of 1
 41 | 
 42 | 
 43 | # C. Get probability estimation on x
 44 | # INPUT:
 45 | #     x: features
 46 | #     w: weights
 47 | # OUTPUT:
 48 | #     probability of p(y = 1 | x; w)
 49 | def get_p(x, w):
 50 |     wTx = 0.
 51 |     for i in x:  # do wTx
 52 |         wTx += w[i] * 1.  # w[i] * x[i], but if i in x we got x[i] = 1.
 53 |     return 1. / (1. + exp(-max(min(wTx, 20.), -20.)))  # bounded sigmoid
 54 | 
 55 | 
 56 | # D. Update given model
 57 | # INPUT:
 58 | #     w: weights
 59 | #     n: a counter that counts the number of times we encounter a feature
 60 | #        this is used for adaptive learning rate
 61 | #     x: feature
 62 | #     p: prediction of our model
 63 | #     y: answer
 64 | # OUTPUT:
 65 | #     w: updated model
 66 | #     n: updated count
 67 | def update_w(w, n, x, p, y):
 68 |     for i in x:
 69 |         # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic
 70 |         # (p - y) * x[i] is the current gradient
 71 |         # note that in our case, if i in x then x[i] = 1
 72 |         w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
 73 |         n[i] += 1.
 74 | 
 75 |     return w, n
 76 | 
 77 | 
 78 | # training and testing #######################################################
 79 | 
 80 | # initialize our model
 81 | w = [0.] * D  # weights
 82 | n = [0.] * D  # number of times we've encountered a feature
 83 | 
 84 | # start training a logistic regression model using on pass sgd
 85 | loss = 0.
 86 | for t, row in enumerate(DictReader(open(train))):
 87 |     y = 1. if row['Label'] == '1' else 0.
 88 | 
 89 |     del row['Label']  # can't let the model peek the answer
 90 |     del row['Id']  # we don't need the Id
 91 | 
 92 |     # main training procedure
 93 |     # step 1, get the hashed features
 94 |     x = get_x(row, D)
 95 | 
 96 |     # step 2, get prediction
 97 |     p = get_p(x, w)
 98 | 
 99 |     # for progress validation, useless for learning our model
100 |     loss += logloss(p, y)
101 |     if t % 1000000 == 0 and t > 1:
102 |         print('%s\tencountered: %d\tcurrent logloss: %f' % (
103 |             datetime.now(), t, loss/t))
104 | 
105 |     # step 3, update model with answer
106 |     w, n = update_w(w, n, x, p, y)
107 | 
108 | # testing (build kaggle's submission file)
109 | with open('submissionPython.csv', 'w') as submission:
110 |     submission.write('Id,Predicted\n')
111 |     for t, row in enumerate(DictReader(open(test))):
112 |         Id = row['Id']
113 |         del row['Id']
114 |         x = get_x(row, D)
115 |         p = get_p(x, w)
116 |         submission.write('%s,%f\n' % (Id, p))


--------------------------------------------------------------------------------
/py_lh2.py:
--------------------------------------------------------------------------------
  1 | from datetime import datetime
  2 | from csv import DictReader
  3 | from math import exp, log, sqrt
  4 | 
  5 | 
  6 | # parameters #################################################################
  7 | 
  8 | train = 'train.csv'  # path to training file
  9 | test = 'test.csv'  # path to testing file
 10 | 
 11 | D = 2 ** 26   # number of weights use for learning
 12 | alpha = .1    # learning rate for sgd optimization
 13 | 
 14 | 
 15 | # function definitions #######################################################
 16 | 
 17 | # A. Bounded logloss
 18 | # INPUT:
 19 | #     p: our prediction
 20 | #     y: real answer
 21 | # OUTPUT
 22 | #     logarithmic loss of p given y
 23 | def logloss(p, y):
 24 |     p = max(min(p, 1. - 10e-12), 10e-12)
 25 |     return -log(p) if y == 1. else -log(1. - p)
 26 | 
 27 | 
 28 | # B. Apply hash trick of the original csv row
 29 | # for simplicity, we treat both integer and categorical features as categorical
 30 | # INPUT:
 31 | #     csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...}
 32 | #     D: the max index that we can hash to
 33 | # OUTPUT:
 34 | #     x: a list of indices that its value is 1
 35 | def get_x(csv_row, D):
 36 |     x = [0]  # 0 is the index of the bias term
 37 |     for key, value in csv_row.items():
 38 |         index = int(value + key[1:], 16) % D  # weakest hash ever ;)
 39 |         x.append(index)
 40 |     return x  # x contains indices of features that have a value of 1
 41 | 
 42 | 
 43 | # C. Get probability estimation on x
 44 | # INPUT:
 45 | #     x: features
 46 | #     w: weights
 47 | # OUTPUT:
 48 | #     probability of p(y = 1 | x; w)
 49 | def get_p(x, w):
 50 |     wTx = 0.
 51 |     for i in x:  # do wTx
 52 |         wTx += w[i] * 1.  # w[i] * x[i], but if i in x we got x[i] = 1.
 53 |     return 1. / (1. + exp(-max(min(wTx, 20.), -20.)))  # bounded sigmoid
 54 | 
 55 | 
 56 | # D. Update given model
 57 | # INPUT:
 58 | #     w: weights
 59 | #     n: a counter that counts the number of times we encounter a feature
 60 | #        this is used for adaptive learning rate
 61 | #     x: feature
 62 | #     p: prediction of our model
 63 | #     y: answer
 64 | # OUTPUT:
 65 | #     w: updated model
 66 | #     n: updated count
 67 | def update_w(w, n, x, p, y):
 68 |     for i in x:
 69 |         # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic
 70 |         # (p - y) * x[i] is the current gradient
 71 |         # note that in our case, if i in x then x[i] = 1
 72 |         w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
 73 |         n[i] += 1.
 74 | 
 75 |     return w, n
 76 | 
 77 | 
 78 | # training and testing #######################################################
 79 | 
 80 | # initialize our model
 81 | w = [0.] * D  # weights
 82 | n = [0.] * D  # number of times we've encountered a feature
 83 | 
 84 | # start training a logistic regression model using on pass sgd
 85 | loss = 0.
 86 | for t, row in enumerate(DictReader(open(train))):
 87 |     y = 1. if row['Label'] == '1' else 0.
 88 | 
 89 |     del row['Label']  # can't let the model peek the answer
 90 |     del row['Id']  # we don't need the Id
 91 | 
 92 |     # main training procedure
 93 |     # step 1, get the hashed features
 94 |     x = get_x(row, D)
 95 | 
 96 |     # step 2, get prediction
 97 |     p = get_p(x, w)
 98 | 
 99 |     # for progress validation, useless for learning our model
100 |     loss += logloss(p, y)
101 |     if t % 1000000 == 0 and t > 1:
102 |         print('%s\tencountered: %d\tcurrent logloss: %f' % (
103 |             datetime.now(), t, loss/t))
104 | 
105 |     # step 3, update model with answer
106 |     w, n = update_w(w, n, x, p, y)
107 | 
108 | # testing (build kaggle's submission file)
109 | with open('submissionPython2.csv', 'w') as submission:
110 |     submission.write('Id,Predicted\n')
111 |     for t, row in enumerate(DictReader(open(test))):
112 |         Id = row['Id']
113 |         del row['Id']
114 |         x = get_x(row, D)
115 |         p = get_p(x, w)
116 |         submission.write('%s,%f\n' % (Id, p))


--------------------------------------------------------------------------------
/py_lh3.py:
--------------------------------------------------------------------------------
  1 | from datetime import datetime
  2 | from csv import DictReader
  3 | from math import exp, log, sqrt
  4 | 
  5 | 
  6 | # parameters #################################################################
  7 | 
  8 | train = 'train.csv'  # path to training file
  9 | test = 'test.csv'  # path to testing file
 10 | 
 11 | D = 2 ** 25   # number of weights use for learning
 12 | alpha = .15    # learning rate for sgd optimization
 13 | 
 14 | 
 15 | # function definitions #######################################################
 16 | 
 17 | # A. Bounded logloss
 18 | # INPUT:
 19 | #     p: our prediction
 20 | #     y: real answer
 21 | # OUTPUT
 22 | #     logarithmic loss of p given y
 23 | def logloss(p, y):
 24 |     p = max(min(p, 1. - 10e-12), 10e-12)
 25 |     return -log(p) if y == 1. else -log(1. - p)
 26 | 
 27 | 
 28 | # B. Apply hash trick of the original csv row
 29 | # for simplicity, we treat both integer and categorical features as categorical
 30 | # INPUT:
 31 | #     csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...}
 32 | #     D: the max index that we can hash to
 33 | # OUTPUT:
 34 | #     x: a list of indices that its value is 1
 35 | def get_x(csv_row, D):
 36 |     x = [0]  # 0 is the index of the bias term
 37 |     for key, value in csv_row.items():
 38 |         index = int(value + key[1:], 16) % D  # weakest hash ever ;)
 39 |         x.append(index)
 40 |     return x  # x contains indices of features that have a value of 1
 41 | 
 42 | 
 43 | # C. Get probability estimation on x
 44 | # INPUT:
 45 | #     x: features
 46 | #     w: weights
 47 | # OUTPUT:
 48 | #     probability of p(y = 1 | x; w)
 49 | def get_p(x, w):
 50 |     wTx = 0.
 51 |     for i in x:  # do wTx
 52 |         wTx += w[i] * 1.  # w[i] * x[i], but if i in x we got x[i] = 1.
 53 |     return 1. / (1. + exp(-max(min(wTx, 20.), -20.)))  # bounded sigmoid
 54 | 
 55 | 
 56 | # D. Update given model
 57 | # INPUT:
 58 | #     w: weights
 59 | #     n: a counter that counts the number of times we encounter a feature
 60 | #        this is used for adaptive learning rate
 61 | #     x: feature
 62 | #     p: prediction of our model
 63 | #     y: answer
 64 | # OUTPUT:
 65 | #     w: updated model
 66 | #     n: updated count
 67 | def update_w(w, n, x, p, y):
 68 |     for i in x:
 69 |         # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic
 70 |         # (p - y) * x[i] is the current gradient
 71 |         # note that in our case, if i in x then x[i] = 1
 72 |         w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
 73 |         n[i] += 1.
 74 | 
 75 |     return w, n
 76 | 
 77 | 
 78 | # training and testing #######################################################
 79 | 
 80 | # initialize our model
 81 | w = [0.] * D  # weights
 82 | n = [0.] * D  # number of times we've encountered a feature
 83 | 
 84 | # start training a logistic regression model using on pass sgd
 85 | loss = 0.
 86 | for t, row in enumerate(DictReader(open(train))):
 87 |     y = 1. if row['Label'] == '1' else 0.
 88 | 
 89 |     del row['Label']  # can't let the model peek the answer
 90 |     del row['Id']  # we don't need the Id
 91 | 
 92 |     # main training procedure
 93 |     # step 1, get the hashed features
 94 |     x = get_x(row, D)
 95 | 
 96 |     # step 2, get prediction
 97 |     p = get_p(x, w)
 98 | 
 99 |     # for progress validation, useless for learning our model
100 |     loss += logloss(p, y)
101 |     if t % 1000000 == 0 and t > 1:
102 |         print('%s\tencountered: %d\tcurrent logloss: %f' % (
103 |             datetime.now(), t, loss/t))
104 | 
105 |     # step 3, update model with answer
106 |     w, n = update_w(w, n, x, p, y)
107 | 
108 | # testing (build kaggle's submission file)
109 | with open('submissionPython4.csv', 'w') as submission:
110 |     submission.write('Id,Predicted\n')
111 |     for t, row in enumerate(DictReader(open(test))):
112 |         Id = row['Id']
113 |         del row['Id']
114 |         x = get_x(row, D)
115 |         p = get_p(x, w)
116 |         submission.write('%s,%f\n' % (Id, p))


--------------------------------------------------------------------------------
/py_lh4.py:
--------------------------------------------------------------------------------
  1 | from datetime import datetime
  2 | from csv import DictReader
  3 | from math import exp, log, sqrt
  4 | 
  5 | 
  6 | # parameters #################################################################
  7 | 
  8 | train = 'train.csv'  # path to training file
  9 | test = 'test.csv'  # path to testing file
 10 | 
 11 | D = 2 ** 27  # number of weights use for learning
 12 | alpha = .145    # learning rate for sgd optimization
 13 | 
 14 | 
 15 | # function definitions #######################################################
 16 | 
 17 | # A. Bounded logloss
 18 | # INPUT:
 19 | #     p: our prediction
 20 | #     y: real answer
 21 | # OUTPUT
 22 | #     logarithmic loss of p given y
 23 | def logloss(p, y):
 24 |     p = max(min(p, 1. - 10e-12), 10e-12)
 25 |     return -log(p) if y == 1. else -log(1. - p)
 26 | 
 27 | 
 28 | # B. Apply hash trick of the original csv row
 29 | # for simplicity, we treat both integer and categorical features as categorical
 30 | # INPUT:
 31 | #     csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...}
 32 | #     D: the max index that we can hash to
 33 | # OUTPUT:
 34 | #     x: a list of indices that its value is 1
 35 | def get_x(csv_row, D):
 36 |     x = [0]  # 0 is the index of the bias term
 37 |     for key, value in csv_row.items():
 38 |         index = int(value + key[1:], 16) % D  # weakest hash ever ;)
 39 |         x.append(index)
 40 |     return x  # x contains indices of features that have a value of 1
 41 | 
 42 | 
 43 | # C. Get probability estimation on x
 44 | # INPUT:
 45 | #     x: features
 46 | #     w: weights
 47 | # OUTPUT:
 48 | #     probability of p(y = 1 | x; w)
 49 | def get_p(x, w):
 50 |     wTx = 0.
 51 |     for i in x:  # do wTx
 52 |         wTx += w[i] * 1.  # w[i] * x[i], but if i in x we got x[i] = 1.
 53 |     return 1. / (1. + exp(-max(min(wTx, 20.), -20.)))  # bounded sigmoid
 54 | 
 55 | 
 56 | # D. Update given model
 57 | # INPUT:
 58 | #     w: weights
 59 | #     n: a counter that counts the number of times we encounter a feature
 60 | #        this is used for adaptive learning rate
 61 | #     x: feature
 62 | #     p: prediction of our model
 63 | #     y: answer
 64 | # OUTPUT:
 65 | #     w: updated model
 66 | #     n: updated count
 67 | def update_w(w, n, x, p, y):
 68 |     for i in x:
 69 |         # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic
 70 |         # (p - y) * x[i] is the current gradient
 71 |         # note that in our case, if i in x then x[i] = 1
 72 |         w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
 73 |         n[i] += 1.
 74 | 
 75 |     return w, n
 76 | 
 77 | 
 78 | # training and testing #######################################################
 79 | 
 80 | # initialize our model
 81 | w = [0.] * D  # weights
 82 | n = [0.] * D  # number of times we've encountered a feature
 83 | 
 84 | # start training a logistic regression model using on pass sgd
 85 | loss = 0.
 86 | for t, row in enumerate(DictReader(open(train))):
 87 |     y = 1. if row['Label'] == '1' else 0.
 88 | 
 89 |     del row['Label']  # can't let the model peek the answer
 90 |     del row['Id']  # we don't need the Id
 91 | 
 92 |     # main training procedure
 93 |     # step 1, get the hashed features
 94 |     x = get_x(row, D)
 95 | 
 96 |     # step 2, get prediction
 97 |     p = get_p(x, w)
 98 | 
 99 |     # for progress validation, useless for learning our model
100 |     loss += logloss(p, y)
101 |     if t % 1000000 == 0 and t > 1:
102 |         print('%s\tencountered: %d\tcurrent logloss: %f' % (
103 |             datetime.now(), t, loss/t))
104 | 
105 |     # step 3, update model with answer
106 |     w, n = update_w(w, n, x, p, y)
107 | 
108 | # testing (build kaggle's submission file)
109 | with open('submissionPython5.csv', 'w') as submission:
110 |     submission.write('Id,Predicted\n')
111 |     for t, row in enumerate(DictReader(open(test))):
112 |         Id = row['Id']
113 |         del row['Id']
114 |         x = get_x(row, D)
115 |         p = get_p(x, w)
116 |         submission.write('%s,%f\n' % (Id, p))


--------------------------------------------------------------------------------
/logReg_click.py:
--------------------------------------------------------------------------------
 1 | from numpy import *  
 2 | import matplotlib.pyplot as plt  
 3 | import time  
 4 |   
 5 |   
 6 | # calculate the sigmoid function  
 7 | def sigmoid(inX):  
 8 |     return 1.0 / (1 + exp(-inX)) 
 9 |     
10 | # train a logistic regression model using some optional optimize algorithm  
11 | # input: train_x is a mat datatype, each row stands for one sample  
12 | #        train_y is mat datatype too, each row is the corresponding label  
13 | #        opts is optimize option include step and maximum number of iterations  
14 | def trainLogRegres(train_x, train_y, opts):  
15 |     # calculate training time  
16 |     startTime = time.time()  
17 |   
18 |     numSamples, numFeatures = shape(train_x)  
19 |     alpha = opts['alpha']; maxIter = opts['maxIter']  
20 |     weights = ones((numFeatures, 1))  
21 |   
22 |     # optimize through gradient descent algorilthm  
23 |     for k in range(maxIter):  
24 |         if opts['optimizeType'] == 'gradDescent': # gradient descent algorilthm  
25 |             output = sigmoid(train_x * weights)  
26 |             error = train_y - output  
27 |             weights = weights + alpha * train_x.transpose() * error  
28 |         elif opts['optimizeType'] == 'stocGradDescent': # stochastic gradient descent  
29 |             for i in range(numSamples):  
30 |                 output = sigmoid(train_x[i, :] * weights)  
31 |                 error = train_y[i, 0] - output  
32 |                 weights = weights + alpha * train_x[i, :].transpose() * error  
33 |         elif opts['optimizeType'] == 'smoothStocGradDescent': # smooth stochastic gradient descent  
34 |             # randomly select samples to optimize for reducing cycle fluctuations   
35 |             dataIndex = range(numSamples)  
36 |             for i in range(numSamples):  
37 |                 alpha = 4.0 / (1.0 + k + i) + 0.01  
38 |                 randIndex = int(random.uniform(0, len(dataIndex)))  
39 |                 output = sigmoid(train_x[randIndex, :] * weights)  
40 |                 error = train_y[randIndex, 0] - output  
41 |                 weights = weights + alpha * train_x[randIndex, :].transpose() * error  
42 |                 del(dataIndex[randIndex]) # during one interation, delete the optimized sample  
43 |         else:  
44 |             raise NameError('Not support optimize method type!')  
45 |       
46 |   
47 |     print ('Congratulations, training complete! Took %fs!' % (time.time() - startTime))  
48 |     return weights  
49 | 
50 | # test your trained Logistic Regression model given test set  
51 | def testLogRegres(weights, test_x, test_y):  
52 |     numSamples, numFeatures = shape(test_x)  
53 |     matchCount = 0  
54 |     for i in xrange(numSamples):  
55 |         predict = sigmoid(test_x[i, :] * weights)[0, 0] > 0.5  
56 |         if predict == bool(test_y[i, 0]):  
57 |             matchCount += 1  
58 |     accuracy = float(matchCount) / numSamples  
59 |     return accuracy  
60 |     
61 | def loadData():  
62 |     train_x = []  
63 |     train_y = []  
64 |     fileIn = open('train.csv')  
65 |     for line in fileIn.readlines():  
66 |         lineArr = line.strip().split()  
67 |         train_x.append([1.0, float(lineArr[0]), float(lineArr[1])])  
68 |         train_y.append(float(lineArr[2]))  
69 |     return mat(train_x), mat(train_y).transpose()  
70 |   
71 | ###############################################################################  
72 | ## step 1: load data  
73 | print ("step 1: load data...")  
74 | train_x, train_y = loadData()  
75 | test_x = train_x; test_y = train_y  
76 |   
77 | ## step 2: training...  
78 | print ("step 2: training...")  
79 | opts = {'alpha': 0.01, 'maxIter': 2 ** 27, 'optimizeType': 'smoothStocGradDescent'}  
80 | optimalWeights = trainLogRegres(train_x, train_y, opts)  
81 |   
82 | ## step 3: testing  
83 | print ("step 3: testing...")  
84 | accuracy = testLogRegres(optimalWeights, test_x, test_y)  
85 |   
86 | ## step 4: show the result  
87 | print ("step 4: show the result...")    
88 | print ('The classify accuracy is: %.3f%%' % (accuracy * 100))    


--------------------------------------------------------------------------------
/pythonSGD/logReg_click.py:
--------------------------------------------------------------------------------
 1 | from numpy import *  
 2 | import matplotlib.pyplot as plt  
 3 | import time  
 4 |   
 5 |   
 6 | # calculate the sigmoid function  
 7 | def sigmoid(inX):  
 8 |     return 1.0 / (1 + exp(-inX)) 
 9 |     
10 | # train a logistic regression model using some optional optimize algorithm  
11 | # input: train_x is a mat datatype, each row stands for one sample  
12 | #        train_y is mat datatype too, each row is the corresponding label  
13 | #        opts is optimize option include step and maximum number of iterations  
14 | def trainLogRegres(train_x, train_y, opts):  
15 |     # calculate training time  
16 |     startTime = time.time()  
17 |   
18 |     numSamples, numFeatures = shape(train_x)  
19 |     alpha = opts['alpha']; maxIter = opts['maxIter']  
20 |     weights = ones((numFeatures, 1))  
21 |   
22 |     # optimize through gradient descent algorilthm  
23 |     for k in range(maxIter):  
24 |         if opts['optimizeType'] == 'gradDescent': # gradient descent algorilthm  
25 |             output = sigmoid(train_x * weights)  
26 |             error = train_y - output  
27 |             weights = weights + alpha * train_x.transpose() * error  
28 |         elif opts['optimizeType'] == 'stocGradDescent': # stochastic gradient descent  
29 |             for i in range(numSamples):  
30 |                 output = sigmoid(train_x[i, :] * weights)  
31 |                 error = train_y[i, 0] - output  
32 |                 weights = weights + alpha * train_x[i, :].transpose() * error  
33 |         elif opts['optimizeType'] == 'smoothStocGradDescent': # smooth stochastic gradient descent  
34 |             # randomly select samples to optimize for reducing cycle fluctuations   
35 |             dataIndex = range(numSamples)  
36 |             for i in range(numSamples):  
37 |                 alpha = 4.0 / (1.0 + k + i) + 0.01  
38 |                 randIndex = int(random.uniform(0, len(dataIndex)))  
39 |                 output = sigmoid(train_x[randIndex, :] * weights)  
40 |                 error = train_y[randIndex, 0] - output  
41 |                 weights = weights + alpha * train_x[randIndex, :].transpose() * error  
42 |                 del(dataIndex[randIndex]) # during one interation, delete the optimized sample  
43 |         else:  
44 |             raise NameError('Not support optimize method type!')  
45 |       
46 |   
47 |     print ('Congratulations, training complete! Took %fs!' % (time.time() - startTime))  
48 |     return weights  
49 | 
50 | # test your trained Logistic Regression model given test set  
51 | def testLogRegres(weights, test_x, test_y):  
52 |     numSamples, numFeatures = shape(test_x)  
53 |     matchCount = 0  
54 |     for i in xrange(numSamples):  
55 |         predict = sigmoid(test_x[i, :] * weights)[0, 0] > 0.5  
56 |         if predict == bool(test_y[i, 0]):  
57 |             matchCount += 1  
58 |     accuracy = float(matchCount) / numSamples  
59 |     return accuracy  
60 |     
61 | def loadData():  
62 |     train_x = []  
63 |     train_y = []  
64 |     fileIn = open('train.csv')  
65 |     for line in fileIn.readlines():  
66 |         lineArr = line.strip().split()  
67 |         train_x.append([1.0, float(lineArr[0]), float(lineArr[1])])  
68 |         train_y.append(float(lineArr[2]))  
69 |     return mat(train_x), mat(train_y).transpose()  
70 |   
71 | ###############################################################################  
72 | ## step 1: load data  
73 | print ("step 1: load data...")  
74 | train_x, train_y = loadData()  
75 | test_x = train_x; test_y = train_y  
76 |   
77 | ## step 2: training...  
78 | print ("step 2: training...")  
79 | opts = {'alpha': 0.01, 'maxIter': 2 ** 27, 'optimizeType': 'smoothStocGradDescent'}  
80 | optimalWeights = trainLogRegres(train_x, train_y, opts)  
81 |   
82 | ## step 3: testing  
83 | print ("step 3: testing...")  
84 | accuracy = testLogRegres(optimalWeights, test_x, test_y)  
85 |   
86 | ## step 4: show the result  
87 | print ("step 4: show the result...")    
88 | print ('The classify accuracy is: %.3f%%' % (accuracy * 100))    


--------------------------------------------------------------------------------
/pythonSGD/py_lh_22Sep2014_2.py:
--------------------------------------------------------------------------------
  1 | from datetime import datetime
  2 | from csv import DictReader
  3 | from math import exp, log, sqrt
  4 | 
  5 | # parameters #################################################################
  6 | 
  7 | train = 'train.csv'  # path to training file
  8 | test = 'test.csv'  # path to testing file
  9 | 
 10 | D = 2 ** 28  # number of weights use for learning
 11 | alpha = .145    # learning rate for sgd optimization
 12 | 
 13 | 
 14 | # function definitions #######################################################
 15 | 
 16 | # A. Bounded logloss
 17 | # INPUT:
 18 | #     p: our prediction
 19 | #     y: real answer
 20 | # OUTPUT
 21 | #     logarithmic loss of p given y
 22 | def logloss(p, y):
 23 |     p = max(min(p, 1. - 10e-12), 10e-12)
 24 |     return -log(p) if y == 1. else -log(1. - p)
 25 | 
 26 | # B. Apply hash trick of the original csv row
 27 | # for simplicity, we treat both integer and categorical features as categorical
 28 | # INPUT:
 29 | #     csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...}
 30 | #     D: the max index that we can hash to
 31 | # OUTPUT:
 32 | #     x: a list of indices that its value is 1
 33 | def get_x(csv_row, D):
 34 |     x = [0]  # 0 is the index of the bias term
 35 |     for key, value in csv_row.items():
 36 |         index = int(value + key[1:], 32) % D  # weakest hash ever ;)
 37 |         x.append(index)
 38 |     return x  # x contains indices of features that have a value of 1
 39 | 
 40 | 
 41 | # C. Get probability estimation on x
 42 | # INPUT:
 43 | #     x: features
 44 | #     w: weights
 45 | # OUTPUT:
 46 | #     probability of p(y = 1 | x; w)
 47 | def get_p(x, w):
 48 |     wTx = 0.
 49 |     for i in x:  # do wTx
 50 |         wTx += w[i] * 1.  # w[i] * x[i], but if i in x we got x[i] = 1.
 51 |     return 1. / (1. + exp(-max(min(wTx, 10.), -10.)))  # bounded sigmoid
 52 | 
 53 | 
 54 | # D. Update given model
 55 | # INPUT:
 56 | #     w: weights
 57 | #     n: a counter that counts the number of times we encounter a feature
 58 | #        this is used for adaptive learning rate
 59 | #     x: feature
 60 | #     p: prediction of our model
 61 | #     y: answer
 62 | # OUTPUT:
 63 | #     w: updated model
 64 | #     n: updated count
 65 | def update_w(w, n, x, p, y):
 66 |     for i in x:
 67 |         # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic
 68 |         # (p - y) * x[i] is the current gradient
 69 |         # note that in our case, if i in x then x[i] = 1
 70 |         w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
 71 |         n[i] += 1.
 72 |     return w, n
 73 | 
 74 | 
 75 | # training and testing #######################################################
 76 | 
 77 | # initialize our model
 78 | w = [0.] * D  # weights
 79 | n = [0.] * D  # number of times we've encountered a feature
 80 | 
 81 | # start training a logistic regression model using on pass sgd
 82 | loss = 0.
 83 | for t, row in enumerate(DictReader(open(train))):
 84 |     y = 1. if row['Label'] == '1' else 0.
 85 | 
 86 |     del row['Label']  # can't let the model peek the answer
 87 |     del row['Id']  # we don't need the Id
 88 | 
 89 |     # main training procedure
 90 |     # step 1, get the hashed features
 91 |     x = get_x(row, D)
 92 | 
 93 |     # step 2, get prediction
 94 |     p = get_p(x, w)
 95 | 
 96 |     # for progress validation, useless for learning our model
 97 |     loss += logloss(p, y)
 98 |     if t % 100000 == 0 and t > 1:
 99 |         print('%s\tencountered: %d\tcurrent logloss: %f' % (
100 |             datetime.now(), t, loss/t))
101 | 
102 |     # step 3, update model with answer
103 |     if t <= 40000000:    
104 |         w, n = update_w(w, n, x, p, y)
105 |         
106 | # testing (build kaggle's submission file)
107 | with open('submissionPython_22Sep2014.csv', 'w') as submission:
108 |     submission.write('Id,Predicted\n')
109 |     for t, row in enumerate(DictReader(open(test))):
110 |         Id = row['Id']
111 |         del row['Id']
112 |         x = get_x(row, D)
113 |         p = get_p(x, w)
114 |         submission.write('%s,%f\n' % (Id, p))


--------------------------------------------------------------------------------
/logReg.py:
--------------------------------------------------------------------------------
 1 | from numpy import *  
 2 | import matplotlib.pyplot as plt  
 3 | import time  
 4 |   
 5 |   
 6 | # calculate the sigmoid function  
 7 | def sigmoid(inX):  
 8 |     return 1.0 / (1 + exp(-inX))  
 9 |   
10 |   
11 | # train a logistic regression model using some optional optimize algorithm  
12 | # input: train_x is a mat datatype, each row stands for one sample  
13 | #        train_y is mat datatype too, each row is the corresponding label  
14 | #        opts is optimize option include step and maximum number of iterations  
15 | def trainLogRegres(train_x, train_y, opts):  
16 |     # calculate training time  
17 |     startTime = time.time()  
18 |   
19 |     numSamples, numFeatures = shape(train_x)  
20 |     alpha = opts['alpha']; maxIter = opts['maxIter']  
21 |     weights = ones((numFeatures, 1))  
22 |   
23 |     # optimize through gradient descent algorilthm  
24 |     for k in range(maxIter):  
25 |         if opts['optimizeType'] == 'gradDescent': # gradient descent algorilthm  
26 |             output = sigmoid(train_x * weights)  
27 |             error = train_y - output  
28 |             weights = weights + alpha * train_x.transpose() * error  
29 |         elif opts['optimizeType'] == 'stocGradDescent': # stochastic gradient descent  
30 |             for i in range(numSamples):  
31 |                 output = sigmoid(train_x[i, :] * weights)  
32 |                 error = train_y[i, 0] - output  
33 |                 weights = weights + alpha * train_x[i, :].transpose() * error  
34 |         elif opts['optimizeType'] == 'smoothStocGradDescent': # smooth stochastic gradient descent  
35 |             # randomly select samples to optimize for reducing cycle fluctuations   
36 |             dataIndex = range(numSamples)  
37 |             for i in range(numSamples):  
38 |                 alpha = 4.0 / (1.0 + k + i) + 0.01  
39 |                 randIndex = int(random.uniform(0, len(dataIndex)))  
40 |                 output = sigmoid(train_x[randIndex, :] * weights)  
41 |                 error = train_y[randIndex, 0] - output  
42 |                 weights = weights + alpha * train_x[randIndex, :].transpose() * error  
43 |                 del(dataIndex[randIndex]) # during one interation, delete the optimized sample  
44 |         else:  
45 |             raise NameError('Not support optimize method type!')  
46 |       
47 |   
48 |     print 'Congratulations, training complete! Took %fs!' % (time.time() - startTime)  
49 |     return weights  
50 |   
51 |   
52 | # test your trained Logistic Regression model given test set  
53 | def testLogRegres(weights, test_x, test_y):  
54 |     numSamples, numFeatures = shape(test_x)  
55 |     matchCount = 0  
56 |     for i in xrange(numSamples):  
57 |         predict = sigmoid(test_x[i, :] * weights)[0, 0] > 0.5  
58 |         if predict == bool(test_y[i, 0]):  
59 |             matchCount += 1  
60 |     accuracy = float(matchCount) / numSamples  
61 |     return accuracy  
62 |   
63 |   
64 | # show your trained logistic regression model only available with 2-D data  
65 | def showLogRegres(weights, train_x, train_y):  
66 |     # notice: train_x and train_y is mat datatype  
67 |     numSamples, numFeatures = shape(train_x)  
68 |     if numFeatures != 3:  
69 |         print "Sorry! I can not draw because the dimension of your data is not 2!"  
70 |         return 1  
71 |   
72 |     # draw all samples  
73 |     for i in xrange(numSamples):  
74 |         if int(train_y[i, 0]) == 0:  
75 |             plt.plot(train_x[i, 1], train_x[i, 2], 'or')  
76 |         elif int(train_y[i, 0]) == 1:  
77 |             plt.plot(train_x[i, 1], train_x[i, 2], 'ob')  
78 |   
79 |     # draw the classify line  
80 |     min_x = min(train_x[:, 1])[0, 0]  
81 |     max_x = max(train_x[:, 1])[0, 0]  
82 |     weights = weights.getA()  # convert mat to array  
83 |     y_min_x = float(-weights[0] - weights[1] * min_x) / weights[2]  
84 |     y_max_x = float(-weights[0] - weights[1] * max_x) / weights[2]  
85 |     plt.plot([min_x, max_x], [y_min_x, y_max_x], '-g')  
86 |     plt.xlabel('X1'); plt.ylabel('X2')  
87 |     plt.show()  
88 | 
89 |   


--------------------------------------------------------------------------------
/pythonSGD/logReg.py:
--------------------------------------------------------------------------------
 1 | from numpy import *  
 2 | import matplotlib.pyplot as plt  
 3 | import time  
 4 |   
 5 |   
 6 | # calculate the sigmoid function  
 7 | def sigmoid(inX):  
 8 |     return 1.0 / (1 + exp(-inX))  
 9 |   
10 |   
11 | # train a logistic regression model using some optional optimize algorithm  
12 | # input: train_x is a mat datatype, each row stands for one sample  
13 | #        train_y is mat datatype too, each row is the corresponding label  
14 | #        opts is optimize option include step and maximum number of iterations  
15 | def trainLogRegres(train_x, train_y, opts):  
16 |     # calculate training time  
17 |     startTime = time.time()  
18 |   
19 |     numSamples, numFeatures = shape(train_x)  
20 |     alpha = opts['alpha']; maxIter = opts['maxIter']  
21 |     weights = ones((numFeatures, 1))  
22 |   
23 |     # optimize through gradient descent algorilthm  
24 |     for k in range(maxIter):  
25 |         if opts['optimizeType'] == 'gradDescent': # gradient descent algorilthm  
26 |             output = sigmoid(train_x * weights)  
27 |             error = train_y - output  
28 |             weights = weights + alpha * train_x.transpose() * error  
29 |         elif opts['optimizeType'] == 'stocGradDescent': # stochastic gradient descent  
30 |             for i in range(numSamples):  
31 |                 output = sigmoid(train_x[i, :] * weights)  
32 |                 error = train_y[i, 0] - output  
33 |                 weights = weights + alpha * train_x[i, :].transpose() * error  
34 |         elif opts['optimizeType'] == 'smoothStocGradDescent': # smooth stochastic gradient descent  
35 |             # randomly select samples to optimize for reducing cycle fluctuations   
36 |             dataIndex = range(numSamples)  
37 |             for i in range(numSamples):  
38 |                 alpha = 4.0 / (1.0 + k + i) + 0.01  
39 |                 randIndex = int(random.uniform(0, len(dataIndex)))  
40 |                 output = sigmoid(train_x[randIndex, :] * weights)  
41 |                 error = train_y[randIndex, 0] - output  
42 |                 weights = weights + alpha * train_x[randIndex, :].transpose() * error  
43 |                 del(dataIndex[randIndex]) # during one interation, delete the optimized sample  
44 |         else:  
45 |             raise NameError('Not support optimize method type!')  
46 |       
47 |   
48 |     print 'Congratulations, training complete! Took %fs!' % (time.time() - startTime)  
49 |     return weights  
50 |   
51 |   
52 | # test your trained Logistic Regression model given test set  
53 | def testLogRegres(weights, test_x, test_y):  
54 |     numSamples, numFeatures = shape(test_x)  
55 |     matchCount = 0  
56 |     for i in xrange(numSamples):  
57 |         predict = sigmoid(test_x[i, :] * weights)[0, 0] > 0.5  
58 |         if predict == bool(test_y[i, 0]):  
59 |             matchCount += 1  
60 |     accuracy = float(matchCount) / numSamples  
61 |     return accuracy  
62 |   
63 |   
64 | # show your trained logistic regression model only available with 2-D data  
65 | def showLogRegres(weights, train_x, train_y):  
66 |     # notice: train_x and train_y is mat datatype  
67 |     numSamples, numFeatures = shape(train_x)  
68 |     if numFeatures != 3:  
69 |         print "Sorry! I can not draw because the dimension of your data is not 2!"  
70 |         return 1  
71 |   
72 |     # draw all samples  
73 |     for i in xrange(numSamples):  
74 |         if int(train_y[i, 0]) == 0:  
75 |             plt.plot(train_x[i, 1], train_x[i, 2], 'or')  
76 |         elif int(train_y[i, 0]) == 1:  
77 |             plt.plot(train_x[i, 1], train_x[i, 2], 'ob')  
78 |   
79 |     # draw the classify line  
80 |     min_x = min(train_x[:, 1])[0, 0]  
81 |     max_x = max(train_x[:, 1])[0, 0]  
82 |     weights = weights.getA()  # convert mat to array  
83 |     y_min_x = float(-weights[0] - weights[1] * min_x) / weights[2]  
84 |     y_max_x = float(-weights[0] - weights[1] * max_x) / weights[2]  
85 |     plt.plot([min_x, max_x], [y_min_x, y_max_x], '-g')  
86 |     plt.xlabel('X1'); plt.ylabel('X2')  
87 |     plt.show()  
88 | 
89 |   


--------------------------------------------------------------------------------
/pythonSGD/py_lh4_22Sep2014.py:
--------------------------------------------------------------------------------
  1 | from datetime import datetime
  2 | from csv import DictReader
  3 | from math import exp, log, sqrt
  4 | import mmh3
  5 | 
  6 | # parameters #################################################################
  7 | 
  8 | train = 'train.csv'  # path to training file
  9 | test = 'test.csv'  # path to testing file
 10 | 
 11 | D = 2 ** 28  # number of weights use for learning
 12 | alpha = .145    # learning rate for sgd optimization
 13 | 
 14 | 
 15 | # function definitions #######################################################
 16 | 
 17 | # A. Bounded logloss
 18 | # INPUT:
 19 | #     p: our prediction
 20 | #     y: real answer
 21 | # OUTPUT
 22 | #     logarithmic loss of p given y
 23 | def logloss(p, y):
 24 |     p = max(min(p, 1. - 10e-15), 10e-15)
 25 |     return -log(p) if y == 1. else -log(1. - p)
 26 | 
 27 | 
 28 | # B. Apply hash trick of the original csv row
 29 | # for simplicity, we treat both integer and categorical features as categorical
 30 | # INPUT:
 31 | #     csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...}
 32 | #     D: the max index that we can hash to
 33 | # OUTPUT:
 34 | #     x: a list of indices that its value is 1
 35 | def get_x(csv_row, D):
 36 |     x = [0]  # 0 is the index of the bias term
 37 |     for key, value in csv_row.items():
 38 |         index = mmh3.hash((value + key[1:]),42) % D  # weakest hash ever ;)
 39 |         x.append(index)
 40 |     return x  # x contains indices of features that have a value of 1
 41 | 
 42 | 
 43 | # C. Get probability estimation on x
 44 | # INPUT:
 45 | #     x: features
 46 | #     w: weights
 47 | # OUTPUT:
 48 | #     probability of p(y = 1 | x; w)
 49 | def get_p(x, w, ld):
 50 |     ld = ld + 0.001
 51 |     wTx = 0.
 52 |     for i in x:  # do wTx
 53 |         wTx += w[i] * 1.  # w[i] * x[i], but if i in x we got x[i] = 1.
 54 |     return (1.+ld) / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid
 55 | 
 56 | 
 57 | # D. Update given model
 58 | # INPUT:
 59 | #     w: weights
 60 | #     n: a counter that counts the number of times we encounter a feature
 61 | #        this is used for adaptive learning rate
 62 | #     x: feature
 63 | #     p: prediction of our model
 64 | #     y: answer
 65 | # OUTPUT:
 66 | #     w: updated model
 67 | #     n: updated count
 68 | def update_w(w, n, x, p, y):
 69 |     for i in x:
 70 |         # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic
 71 |         # (p - y) * x[i] is the current gradient
 72 |         # note that in our case, if i in x then x[i] = 1
 73 |         w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
 74 |         n[i] += 1.
 75 | 
 76 |     return w, n
 77 | 
 78 | 
 79 | # training and testing #######################################################
 80 | 
 81 | # initialize our model
 82 | w = [0.] * D  # weights
 83 | n = [0.] * D  # number of times we've encountered a feature
 84 | 
 85 | # start training a logistic regression model using on pass sgd
 86 | loss = 0.
 87 | ld = 0.001
 88 | for t, row in enumerate(DictReader(open(train))):
 89 | 
 90 |     y = 1. if row['Label'] == '1' else 0.
 91 | 
 92 |     del row['Label']  # can't let the model peek the answer
 93 |     del row['Id']  # we don't need the Id
 94 | 
 95 |     # main training procedure
 96 |     # step 1, get the hashed features
 97 |     x = get_x(row, D)
 98 | 
 99 |     # step 2, get prediction
100 |     p = get_p(x, w, ld)
101 | 
102 |     # for progress validation, useless for learning our model
103 |     loss += logloss(p, y)
104 |     if t % 1000000 == 0 and t > 1:
105 |         print('%s\tencountered: %d\tcurrent logloss: %f' % (
106 |             datetime.now(), t, loss/t))
107 | 
108 |     # step 3, update model with answer
109 |     w, n = update_w(w, n, x, p, y)
110 | 
111 | # testing (build kaggle's submission file)
112 | with open('submissionPython22Sep2014_pm.csv', 'w') as submission:
113 |     submission.write('Id,Predicted\n')
114 |     for t, row in enumerate(DictReader(open(test))):
115 |         Id = row['Id']
116 |         del row['Id']
117 |         x = get_x(row, D)
118 |         p = get_p(x, w)
119 |         submission.write('%s,%f\n' % (Id, p))


--------------------------------------------------------------------------------
/py_lh_20Sep2014.py:
--------------------------------------------------------------------------------
  1 | from datetime import datetime
  2 | from csv import DictReader
  3 | from math import exp, log, sqrt
  4 | import scipy as sp
  5 | 
  6 | # parameters #################################################################
  7 | 
  8 | train = 'train.csv'  # path to training file
  9 | test = 'test.csv'  # path to testing file
 10 | 
 11 | D = 2 ** 27  # number of weights use for learning
 12 | alpha = .15    # learning rate for sgd optimization
 13 | 
 14 | 
 15 | # function definitions #######################################################
 16 | 
 17 | # A. Bounded logloss
 18 | # INPUT:
 19 | #     p: our prediction
 20 | #     y: real answer
 21 | # OUTPUT
 22 | #     logarithmic loss of p given y
 23 | def logloss(p, y):
 24 |     epsilon = 1e-15
 25 |     p = sp.maximum(epsilon, p)
 26 |     p = sp.minimum(1-epsilon, p)
 27 |     ll = sum(y*sp.log(p) + sp.subtract(1,y)*sp.log(sp.subtract(1,p)))
 28 |     ll = ll * -1.0/len(y)
 29 |     return ll
 30 | 
 31 | # B. Apply hash trick of the original csv row
 32 | # for simplicity, we treat both integer and categorical features as categorical
 33 | # INPUT:
 34 | #     csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...}
 35 | #     D: the max index that we can hash to
 36 | # OUTPUT:
 37 | #     x: a list of indices that its value is 1
 38 | def get_x(csv_row, D):
 39 |     x = [0]  # 0 is the index of the bias term
 40 |     for key, value in csv_row.items():
 41 |         index = int(value + key[1:], 16) % D  # weakest hash ever ;)
 42 |         x.append(index)
 43 |     return x  # x contains indices of features that have a value of 1
 44 | 
 45 | 
 46 | # C. Get probability estimation on x
 47 | # INPUT:
 48 | #     x: features
 49 | #     w: weights
 50 | # OUTPUT:
 51 | #     probability of p(y = 1 | x; w)
 52 | def get_p(x, w):
 53 |     wTx = 0.
 54 |     for i in x:  # do wTx
 55 |         wTx += w[i] * 1.  # w[i] * x[i], but if i in x we got x[i] = 1.
 56 |     return 1. / (1. + exp(-max(min(wTx, 10.), -10.)))  # bounded sigmoid
 57 | 
 58 | 
 59 | # D. Update given model
 60 | # INPUT:
 61 | #     w: weights
 62 | #     n: a counter that counts the number of times we encounter a feature
 63 | #        this is used for adaptive learning rate
 64 | #     x: feature
 65 | #     p: prediction of our model
 66 | #     y: answer
 67 | # OUTPUT:
 68 | #     w: updated model
 69 | #     n: updated count
 70 | def update_w(w, n, x, p, y):
 71 |     for i in x:
 72 |         # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic
 73 |         # (p - y) * x[i] is the current gradient
 74 |         # note that in our case, if i in x then x[i] = 1
 75 |         w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
 76 |         n[i] += 1.
 77 | 
 78 |     return w, n
 79 | 
 80 | 
 81 | # training and testing #######################################################
 82 | 
 83 | # initialize our model
 84 | w = [0.] * D  # weights
 85 | n = [0.] * D  # number of times we've encountered a feature
 86 | 
 87 | # start training a logistic regression model using on pass sgd
 88 | loss = 0.
 89 | for t, row in enumerate(DictReader(open(train))):
 90 |     y = 1. if row['Label'] == '1' else 0.
 91 | 
 92 |     del row['Label']  # can't let the model peek the answer
 93 |     del row['Id']  # we don't need the Id
 94 | 
 95 |     # main training procedure
 96 |     # step 1, get the hashed features
 97 |     x = get_x(row, D)
 98 | 
 99 |     # step 2, get prediction
100 |     p = get_p(x, w)
101 | 
102 |     # for progress validation, useless for learning our model
103 |     loss += logloss(p, y)
104 |     if t % 1000000 == 0 and t > 1:
105 |         print('%s\tencountered: %d\tcurrent logloss: %f' % (
106 |             datetime.now(), t, loss/t))
107 | 
108 |     # step 3, update model with answer
109 |     w, n = update_w(w, n, x, p, y)
110 | 
111 | # testing (build kaggle's submission file)
112 | with open('submissionPython_20Sep2014_2.csv', 'w') as submission:
113 |     submission.write('Id,Predicted\n')
114 |     for t, row in enumerate(DictReader(open(test))):
115 |         Id = row['Id']
116 |         del row['Id']
117 |         x = get_x(row, D)
118 |         p = get_p(x, w)
119 |         submission.write('%s,%f\n' % (Id, p))


--------------------------------------------------------------------------------
/pythonSGD/py_lh_20Sep2014.py:
--------------------------------------------------------------------------------
  1 | from datetime import datetime
  2 | from csv import DictReader
  3 | from math import exp, log, sqrt
  4 | import scipy as sp
  5 | 
  6 | # parameters #################################################################
  7 | 
  8 | train = 'train.csv'  # path to training file
  9 | test = 'test.csv'  # path to testing file
 10 | 
 11 | D = 2 ** 27  # number of weights use for learning
 12 | alpha = .145    # learning rate for sgd optimization
 13 | 
 14 | 
 15 | # function definitions #######################################################
 16 | 
 17 | # A. Bounded logloss
 18 | # INPUT:
 19 | #     p: our prediction
 20 | #     y: real answer
 21 | # OUTPUT
 22 | #     logarithmic loss of p given y
 23 | def logloss(p, y):
 24 |     epsilon = 1e-15
 25 |     p = max(min(p, 1. - epsilon), epsilon)
 26 |     ll = y*sp.log(p) + sp.subtract(1,y)*sp.log(sp.subtract(1,p))
 27 |     ll = ll * -1.0/1
 28 |     return ll
 29 | 
 30 | # B. Apply hash trick of the original csv row
 31 | # for simplicity, we treat both integer and categorical features as categorical
 32 | # INPUT:
 33 | #     csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...}
 34 | #     D: the max index that we can hash to
 35 | # OUTPUT:
 36 | #     x: a list of indices that its value is 1
 37 | def get_x(csv_row, D):
 38 |     x = [0]  # 0 is the index of the bias term
 39 |     for key, value in csv_row.items():
 40 |         index = int(value + key[1:], 16) % D  # weakest hash ever ;)
 41 |         x.append(index)
 42 |     return x  # x contains indices of features that have a value of 1
 43 | 
 44 | 
 45 | # C. Get probability estimation on x
 46 | # INPUT:
 47 | #     x: features
 48 | #     w: weights
 49 | # OUTPUT:
 50 | #     probability of p(y = 1 | x; w)
 51 | def get_p(x, w):
 52 |     wTx = 0.
 53 |     for i in x:  # do wTx
 54 |         wTx += w[i] * 1.  # w[i] * x[i], but if i in x we got x[i] = 1.
 55 |     return 1. / (1. + exp(-max(min(wTx, 10.), -10.)))  # bounded sigmoid
 56 | 
 57 | 
 58 | # D. Update given model
 59 | # INPUT:
 60 | #     w: weights
 61 | #     n: a counter that counts the number of times we encounter a feature
 62 | #        this is used for adaptive learning rate
 63 | #     x: feature
 64 | #     p: prediction of our model
 65 | #     y: answer
 66 | # OUTPUT:
 67 | #     w: updated model
 68 | #     n: updated count
 69 | def update_w(w, n, x, p, y):
 70 |     for i in x:
 71 |         # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic
 72 |         # (p - y) * x[i] is the current gradient
 73 |         # note that in our case, if i in x then x[i] = 1
 74 |         w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
 75 |         n[i] += 1.
 76 |     return w, n
 77 | 
 78 | 
 79 | # training and testing #######################################################
 80 | 
 81 | # initialize our model
 82 | w = [0.] * D  # weights
 83 | n = [0.] * D  # number of times we've encountered a feature
 84 | 
 85 | # start training a logistic regression model using on pass sgd
 86 | loss = 0.
 87 | for t, row in enumerate(DictReader(open(train))):
 88 |     y = 1. if row['Label'] == '1' else 0.
 89 | 
 90 |     del row['Label']  # can't let the model peek the answer
 91 |     del row['Id']  # we don't need the Id
 92 | 
 93 |     # main training procedure
 94 |     # step 1, get the hashed features
 95 |     x = get_x(row, D)
 96 | 
 97 |     # step 2, get prediction
 98 |     p = get_p(x, w)
 99 | 
100 |     # for progress validation, useless for learning our model
101 |     loss += logloss(p, y)
102 |     if t % 100000 == 0 and t > 1:
103 |         print('%s\tencountered: %d\tcurrent logloss: %f' % (
104 |             datetime.now(), t, loss/t))
105 | 
106 |     # step 3, update model with answer
107 |     if t <= 40000000:    
108 |         w, n = update_w(w, n, x, p, y)
109 |         
110 | # testing (build kaggle's submission file)
111 | with open('submissionPython_21Sep2014.csv', 'w') as submission:
112 |     submission.write('Id,Predicted\n')
113 |     for t, row in enumerate(DictReader(open(test))):
114 |         Id = row['Id']
115 |         del row['Id']
116 |         x = get_x(row, D)
117 |         p = get_p(x, w)
118 |         submission.write('%s,%f\n' % (Id, p))


--------------------------------------------------------------------------------
/config.yaml:
--------------------------------------------------------------------------------
  1 | # Configuration file for CTR Prediction Project
  2 | # =============================================
  3 | 
  4 | # Project metadata
  5 | project:
  6 |   name: "CTR Prediction"
  7 |   version: "2.0"
  8 |   description: "Click-Through Rate prediction for display advertising"
  9 | 
 10 | # Data paths
 11 | data:
 12 |   root_dir: "./data"
 13 |   train_file: "train.csv"
 14 |   test_file: "test.csv"
 15 |   sample_train_file: "train_sample.csv"  # Optional: smaller dataset for testing
 16 | 
 17 |   # Processed data
 18 |   train_vw: "train.vw"
 19 |   test_vw: "test.vw"
 20 | 
 21 |   # Validation split (if doing train/val split)
 22 |   validation_split: 0.2
 23 |   random_seed: 42
 24 | 
 25 | # Output paths
 26 | output:
 27 |   root_dir: "./output"
 28 |   submission_file: "submission.csv"
 29 |   log_file: "training.log"
 30 |   model_dir: "./models"
 31 | 
 32 | # Logistic Regression with SGD configuration
 33 | logistic_regression:
 34 |   # Feature hashing
 35 |   dimension: 134217728  # 2^27 = 134,217,728 features
 36 | 
 37 |   # Optimization
 38 |   learning_rate: 0.145
 39 |   adaptive_learning: true
 40 | 
 41 |   # Training
 42 |   max_passes: 1
 43 |   log_interval: 1000000  # Log every 1M samples
 44 | 
 45 |   # Numerical stability
 46 |   epsilon: 1.0e-12
 47 |   sigmoid_bound: 20.0  # Bound input to sigmoid to [-20, 20]
 48 | 
 49 | # Gradient Boosting Machine (GBM) configuration
 50 | gbm:
 51 |   # Model hyperparameters
 52 |   n_trees: 500
 53 |   interaction_depth: 22
 54 |   shrinkage: 0.1
 55 |   min_samples_split: 10
 56 | 
 57 |   # Training
 58 |   cv_folds: 2
 59 |   metric: "ROC"
 60 |   verbose: true
 61 | 
 62 |   # Computation
 63 |   n_jobs: -1  # Use all CPU cores
 64 |   random_state: 888
 65 | 
 66 | # Vowpal Wabbit configuration
 67 | vowpal_wabbit:
 68 |   # Training options
 69 |   loss_function: "logistic"
 70 |   learning_rate: 0.5
 71 |   l1_lambda: 0.0
 72 |   l2_lambda: 0.0
 73 | 
 74 |   # Optimization
 75 |   passes: 3
 76 |   cache_file: "./data/train.cache"
 77 | 
 78 |   # Feature engineering
 79 |   quadratic: ""  # e.g., "ii" for quadratic interactions in namespace i
 80 |   cubic: ""      # e.g., "iii" for cubic interactions
 81 | 
 82 |   # Hashing
 83 |   bit_precision: 27  # 2^27 features
 84 | 
 85 |   # Output
 86 |   model_file: "./models/click.model.vw"
 87 |   predictions_file: "./output/vw_predictions.txt"
 88 | 
 89 | # Data preprocessing
 90 | preprocessing:
 91 |   # Missing value handling
 92 |   numerical_imputation: "median"  # Options: mean, median, zero
 93 |   categorical_imputation: "mode"   # Options: mode, unknown
 94 | 
 95 |   # Feature engineering
 96 |   create_interactions: false
 97 |   polynomial_features: false
 98 |   polynomial_degree: 2
 99 | 
100 |   # Scaling (usually not needed for tree-based methods)
101 |   scale_features: false
102 |   scaler_type: "standard"  # Options: standard, minmax, robust
103 | 
104 | # Model evaluation
105 | evaluation:
106 |   metrics:
107 |     - "log_loss"
108 |     - "auc_roc"
109 |     - "accuracy"
110 |     - "precision"
111 |     - "recall"
112 | 
113 |   # Validation
114 |   use_validation: false
115 |   validation_size: 0.2
116 | 
117 | # Logging configuration
118 | logging:
119 |   level: "INFO"  # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL
120 |   format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
121 |   log_to_file: true
122 |   log_to_console: true
123 | 
124 | # Experiment tracking
125 | experiment:
126 |   track_experiments: false
127 |   experiment_name: "baseline"
128 |   tags:
129 |     - "logistic_regression"
130 |     - "hash_trick"
131 | 
132 | # Resource constraints
133 | resources:
134 |   # Memory limits (in GB)
135 |   max_memory_gb: 16
136 | 
137 |   # CPU
138 |   n_processes: 4
139 | 
140 |   # GPU (if available)
141 |   use_gpu: false
142 |   gpu_device: 0
143 | 
144 | # Reproducibility
145 | reproducibility:
146 |   random_seed: 42
147 |   deterministic: true
148 | 
149 | # Development settings
150 | development:
151 |   # Use smaller sample for quick testing
152 |   use_sample: false
153 |   sample_size: 1000000  # First 1M rows
154 | 
155 |   # Debug mode
156 |   debug: false
157 |   profile_code: false
158 | 
159 |   # Testing
160 |   run_tests: true
161 |   test_coverage_threshold: 0.8
162 | 


--------------------------------------------------------------------------------
/.github/workflows/ci.yml:
--------------------------------------------------------------------------------
  1 | name: CI
  2 | 
  3 | on:
  4 |   push:
  5 |     branches: [ master, dev-claude ]
  6 |   pull_request:
  7 |     branches: [ master ]
  8 | 
  9 | jobs:
 10 |   test-python:
 11 |     name: Test Python ${{ matrix.python-version }}
 12 |     runs-on: ubuntu-latest
 13 | 
 14 |     strategy:
 15 |       matrix:
 16 |         python-version: ['3.8', '3.9', '3.10', '3.11']
 17 | 
 18 |     steps:
 19 |     - name: Checkout code
 20 |       uses: actions/checkout@v3
 21 | 
 22 |     - name: Set up Python ${{ matrix.python-version }}
 23 |       uses: actions/setup-python@v4
 24 |       with:
 25 |         python-version: ${{ matrix.python-version }}
 26 | 
 27 |     - name: Cache pip packages
 28 |       uses: actions/cache@v3
 29 |       with:
 30 |         path: ~/.cache/pip
 31 |         key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}
 32 |         restore-keys: |
 33 |           ${{ runner.os }}-pip-
 34 | 
 35 |     - name: Install dependencies
 36 |       run: |
 37 |         python -m pip install --upgrade pip
 38 |         pip install -r requirements.txt
 39 | 
 40 |     - name: Lint with flake8
 41 |       run: |
 42 |         pip install flake8
 43 |         # Stop the build if there are Python syntax errors or undefined names
 44 |         flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
 45 |         # Exit-zero treats all errors as warnings
 46 |         flake8 . --count --exit-zero --max-complexity=10 --max-line-length=100 --statistics
 47 | 
 48 |     - name: Check code formatting with black
 49 |       run: |
 50 |         pip install black
 51 |         black --check .
 52 | 
 53 |     - name: Run tests with pytest
 54 |       run: |
 55 |         pytest tests/ --verbose
 56 | 
 57 |     - name: Generate coverage report
 58 |       if: matrix.python-version == '3.10'
 59 |       run: |
 60 |         pytest tests/ --cov=. --cov-report=xml --cov-report=html
 61 | 
 62 |     - name: Upload coverage to Codecov
 63 |       if: matrix.python-version == '3.10'
 64 |       uses: codecov/codecov-action@v3
 65 |       with:
 66 |         files: ./coverage.xml
 67 |         flags: unittests
 68 |         name: codecov-umbrella
 69 | 
 70 |   test-r:
 71 |     name: Test R Scripts
 72 |     runs-on: ubuntu-latest
 73 | 
 74 |     steps:
 75 |     - name: Checkout code
 76 |       uses: actions/checkout@v3
 77 | 
 78 |     - name: Set up R
 79 |       uses: r-lib/actions/setup-r@v2
 80 |       with:
 81 |         r-version: '4.2.0'
 82 | 
 83 |     - name: Install R dependencies
 84 |       run: |
 85 |         Rscript -e "install.packages(c('data.table', 'caret', 'gbm'), dependencies=TRUE, repos='https://cloud.r-project.org')"
 86 | 
 87 |     - name: Check R script syntax
 88 |       run: |
 89 |         Rscript -e "source('gbm_modernized.R', echo=TRUE)" || true
 90 | 
 91 |   lint:
 92 |     name: Code Quality Checks
 93 |     runs-on: ubuntu-latest
 94 | 
 95 |     steps:
 96 |     - name: Checkout code
 97 |       uses: actions/checkout@v3
 98 | 
 99 |     - name: Set up Python
100 |       uses: actions/setup-python@v4
101 |       with:
102 |         python-version: '3.10'
103 | 
104 |     - name: Install linting tools
105 |       run: |
106 |         python -m pip install --upgrade pip
107 |         pip install flake8 pylint black isort
108 | 
109 |     - name: Run flake8
110 |       run: |
111 |         flake8 py_lh4_modernized.py logReg_modernized.py csv_to_vw_modernized.py --max-line-length=100
112 | 
113 |     - name: Run pylint
114 |       run: |
115 |         pylint py_lh4_modernized.py logReg_modernized.py csv_to_vw_modernized.py --max-line-length=100 || true
116 | 
117 |     - name: Check import sorting
118 |       run: |
119 |         isort --check-only --diff .
120 | 
121 |   security:
122 |     name: Security Scan
123 |     runs-on: ubuntu-latest
124 | 
125 |     steps:
126 |     - name: Checkout code
127 |       uses: actions/checkout@v3
128 | 
129 |     - name: Set up Python
130 |       uses: actions/setup-python@v4
131 |       with:
132 |         python-version: '3.10'
133 | 
134 |     - name: Install safety
135 |       run: |
136 |         python -m pip install --upgrade pip
137 |         pip install safety
138 | 
139 |     - name: Run safety check
140 |       run: |
141 |         pip install -r requirements.txt
142 |         safety check || true
143 | 
144 |     - name: Run bandit security linter
145 |       run: |
146 |         pip install bandit
147 |         bandit -r . -f json -o bandit-report.json || true
148 | 
149 |     - name: Upload bandit report
150 |       if: always()
151 |       uses: actions/upload-artifact@v3
152 |       with:
153 |         name: bandit-security-report
154 |         path: bandit-report.json
155 | 


--------------------------------------------------------------------------------
/tests/test_lr_model.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Tests for Logistic Regression model (py_lh4_modernized.py).
  3 | """
  4 | 
  5 | import pytest
  6 | import numpy as np
  7 | from pathlib import Path
  8 | import sys
  9 | 
 10 | # Add parent directory to path
 11 | sys.path.insert(0, str(Path(__file__).parent.parent))
 12 | 
 13 | from py_lh4_modernized import CTRPredictor
 14 | 
 15 | 
 16 | class TestCTRPredictor:
 17 |     """Test suite for CTRPredictor class."""
 18 | 
 19 |     def test_initialization(self):
 20 |         """Test model initialization."""
 21 |         model = CTRPredictor(dimension=1000, learning_rate=0.1)
 22 | 
 23 |         assert model.D == 1000
 24 |         assert model.alpha == 0.1
 25 |         assert len(model.w) == 1000
 26 |         assert len(model.n) == 1000
 27 |         assert all(w == 0.0 for w in model.w)
 28 |         assert all(n == 0.0 for n in model.n)
 29 | 
 30 |     def test_logloss_positive_label(self):
 31 |         """Test logloss calculation for positive label."""
 32 |         loss = CTRPredictor.logloss(0.8, 1.0)
 33 |         expected = -np.log(0.8)
 34 |         assert np.isclose(loss, expected)
 35 | 
 36 |     def test_logloss_negative_label(self):
 37 |         """Test logloss calculation for negative label."""
 38 |         loss = CTRPredictor.logloss(0.2, 0.0)
 39 |         expected = -np.log(0.8)
 40 |         assert np.isclose(loss, expected)
 41 | 
 42 |     def test_logloss_boundary_values(self):
 43 |         """Test logloss with boundary values (close to 0 or 1)."""
 44 |         # Should not raise error or return inf
 45 |         loss1 = CTRPredictor.logloss(0.99999, 1.0)
 46 |         loss2 = CTRPredictor.logloss(0.00001, 0.0)
 47 | 
 48 |         assert not np.isinf(loss1)
 49 |         assert not np.isinf(loss2)
 50 |         assert loss1 > 0
 51 |         assert loss2 > 0
 52 | 
 53 |     def test_get_features_basic(self):
 54 |         """Test feature hashing."""
 55 |         model = CTRPredictor(dimension=1000)
 56 | 
 57 |         csv_row = {'I1': '5', 'I2': '10', 'C1': 'abc'}
 58 |         features = model.get_features(csv_row)
 59 | 
 60 |         # Should include bias term (index 0)
 61 |         assert 0 in features
 62 |         assert len(features) > 1
 63 |         # All indices should be within dimension
 64 |         assert all(0 <= idx < model.D for idx in features)
 65 | 
 66 |     def test_get_features_empty_values(self):
 67 |         """Test feature hashing with empty values."""
 68 |         model = CTRPredictor(dimension=1000)
 69 | 
 70 |         csv_row = {'I1': '5', 'I2': '', 'C1': 'abc'}
 71 |         features = model.get_features(csv_row)
 72 | 
 73 |         # Should only have bias and non-empty features
 74 |         assert 0 in features
 75 |         assert len(features) >= 1
 76 | 
 77 |     def test_predict_probability_range(self):
 78 |         """Test that predictions are in valid probability range."""
 79 |         model = CTRPredictor(dimension=100)
 80 | 
 81 |         # Random features
 82 |         features = [0, 5, 10, 25]
 83 | 
 84 |         prob = model.predict_probability(features)
 85 | 
 86 |         # Should be between 0 and 1
 87 |         assert 0.0 <= prob <= 1.0
 88 | 
 89 |     def test_predict_probability_with_weights(self):
 90 |         """Test prediction with non-zero weights."""
 91 |         model = CTRPredictor(dimension=100)
 92 | 
 93 |         # Set some weights
 94 |         model.w[0] = 1.0
 95 |         model.w[5] = 2.0
 96 |         model.w[10] = -1.5
 97 | 
 98 |         features = [0, 5, 10]
 99 |         prob = model.predict_probability(features)
100 | 
101 |         # Expected: sigmoid(1.0 + 2.0 - 1.5) = sigmoid(1.5)
102 |         expected = 1.0 / (1.0 + np.exp(-1.5))
103 | 
104 |         assert np.isclose(prob, expected)
105 | 
106 |     def test_update_weights(self):
107 |         """Test weight update."""
108 |         model = CTRPredictor(dimension=100, learning_rate=0.1)
109 | 
110 |         features = [0, 5, 10]
111 |         p = 0.7  # Prediction
112 |         y = 1.0  # True label
113 | 
114 |         # Store initial weights
115 |         initial_weights = [model.w[i] for i in features]
116 | 
117 |         # Update weights
118 |         model.update_weights(features, p, y)
119 | 
120 |         # Weights should have changed
121 |         for i, idx in enumerate(features):
122 |             assert model.w[idx] != initial_weights[i]
123 | 
124 |         # Counters should have incremented
125 |         for idx in features:
126 |             assert model.n[idx] == 1.0
127 | 
128 |     def test_train_file_not_found(self):
129 |         """Test training with non-existent file."""
130 |         model = CTRPredictor(dimension=100)
131 | 
132 |         with pytest.raises(FileNotFoundError):
133 |             model.train(Path("nonexistent_file.csv"))
134 | 
135 |     def test_predict_file_not_found(self):
136 |         """Test prediction with non-existent file."""
137 |         model = CTRPredictor(dimension=100)
138 | 
139 |         with pytest.raises(FileNotFoundError):
140 |             model.predict(
141 |                 Path("nonexistent_test.csv"),
142 |                 Path("output.csv")
143 |             )
144 | 
145 |     def test_train_with_temp_file(self, temp_csv_file):
146 |         """Test training with a small temporary file."""
147 |         model = CTRPredictor(dimension=100, learning_rate=0.1)
148 | 
149 |         # Should not raise error
150 |         model.train(temp_csv_file)
151 | 
152 |         # Weights should have been updated
153 |         assert any(w != 0.0 for w in model.w)
154 | 
155 |     def test_numerical_stability(self):
156 |         """Test that extreme values don't cause overflow."""
157 |         model = CTRPredictor(dimension=100)
158 | 
159 |         # Set extreme weights
160 |         model.w[5] = 1000.0
161 |         model.w[10] = -1000.0
162 | 
163 |         features = [5, 10]
164 |         prob = model.predict_probability(features)
165 | 
166 |         # Should not be nan or inf
167 |         assert not np.isnan(prob)
168 |         assert not np.isinf(prob)
169 |         assert 0.0 <= prob <= 1.0
170 | 
171 | 
172 | class TestEdgeCases:
173 |     """Test edge cases and error handling."""
174 | 
175 |     def test_zero_dimension(self):
176 |         """Test with zero dimension (should work but be useless)."""
177 |         with pytest.raises(Exception):
178 |             model = CTRPredictor(dimension=0)
179 | 
180 |     def test_negative_learning_rate(self):
181 |         """Test with negative learning rate."""
182 |         # Should still initialize (may behave oddly in training)
183 |         model = CTRPredictor(dimension=100, learning_rate=-0.1)
184 |         assert model.alpha == -0.1
185 | 
186 |     def test_very_large_dimension(self):
187 |         """Test memory handling with large dimension."""
188 |         # This might be slow or fail on memory-constrained systems
189 |         # Using smaller value for testing
190 |         try:
191 |             model = CTRPredictor(dimension=10_000_000)
192 |             assert len(model.w) == 10_000_000
193 |         except MemoryError:
194 |             pytest.skip("Insufficient memory for this test")
195 | 


--------------------------------------------------------------------------------
/gbm_modernized.R:
--------------------------------------------------------------------------------
  1 | # Gradient Boosting Machine (GBM) for CTR Prediction
  2 | # Modernized version with configurable paths and better structure
  3 | 
  4 | # ============================================================================
  5 | # Configuration
  6 | # ============================================================================
  7 | 
  8 | # Use environment variable or default to current directory
  9 | DATA_DIR <- Sys.getenv("DATA_DIR", default = "./data")
 10 | OUTPUT_DIR <- Sys.getenv("OUTPUT_DIR", default = "./output")
 11 | 
 12 | # File paths (relative to DATA_DIR)
 13 | TRAIN_FILE <- file.path(DATA_DIR, "train_num_na_yesno.csv")
 14 | TEST_FILE <- file.path(DATA_DIR, "test_num_impute.csv")
 15 | OUTPUT_FILE <- file.path(OUTPUT_DIR, "submit_gbm_num_nona_impute.csv")
 16 | 
 17 | # Model hyperparameters
 18 | N_TREES <- 500
 19 | INTERACTION_DEPTH <- 22
 20 | SHRINKAGE <- 0.1
 21 | N_CV_FOLDS <- 2
 22 | SEED <- 888
 23 | 
 24 | # ============================================================================
 25 | # Setup
 26 | # ============================================================================
 27 | 
 28 | cat("==========================================================\n")
 29 | cat("GBM CTR Prediction Model\n")
 30 | cat("==========================================================\n")
 31 | cat(sprintf("Data directory: %s\n", DATA_DIR))
 32 | cat(sprintf("Output directory: %s\n", OUTPUT_DIR))
 33 | cat(sprintf("Training file: %s\n", TRAIN_FILE))
 34 | cat(sprintf("Test file: %s\n", TEST_FILE))
 35 | cat("==========================================================\n\n")
 36 | 
 37 | # Load required libraries
 38 | required_packages <- c("data.table", "caret", "gbm")
 39 | 
 40 | for (pkg in required_packages) {
 41 |   if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
 42 |     cat(sprintf("Installing package: %s\n", pkg))
 43 |     install.packages(pkg, dependencies = TRUE)
 44 |     library(pkg, character.only = TRUE)
 45 |   }
 46 | }
 47 | 
 48 | # Create output directory if it doesn't exist
 49 | if (!dir.exists(OUTPUT_DIR)) {
 50 |   dir.create(OUTPUT_DIR, recursive = TRUE)
 51 |   cat(sprintf("Created output directory: %s\n", OUTPUT_DIR))
 52 | }
 53 | 
 54 | # ============================================================================
 55 | # Data Loading and Preparation
 56 | # ============================================================================
 57 | 
 58 | cat("\n[1/5] Loading training data...\n")
 59 | 
 60 | if (!file.exists(TRAIN_FILE)) {
 61 |   stop(sprintf("Training file not found: %s\nPlease prepare the data first.", TRAIN_FILE))
 62 | }
 63 | 
 64 | train <- read.csv(TRAIN_FILE, stringsAsFactors = TRUE)
 65 | 
 66 | # Remove ID column if present
 67 | if ("Id" %in% colnames(train) || "id" %in% colnames(train)) {
 68 |   train <- train[, !(colnames(train) %in% c("Id", "id"))]
 69 | }
 70 | 
 71 | cat(sprintf("  Loaded %d samples with %d features\n", nrow(train), ncol(train) - 1))
 72 | cat(sprintf("  Target variable: Label\n"))
 73 | cat(sprintf("  Class distribution:\n"))
 74 | print(table(train$Label))
 75 | 
 76 | # ============================================================================
 77 | # Model Training
 78 | # ============================================================================
 79 | 
 80 | cat("\n[2/5] Training GBM model...\n")
 81 | cat(sprintf("  Trees: %d\n", N_TREES))
 82 | cat(sprintf("  Interaction depth: %d\n", INTERACTION_DEPTH))
 83 | cat(sprintf("  Shrinkage: %.3f\n", SHRINKAGE))
 84 | cat(sprintf("  CV folds: %d\n", N_CV_FOLDS))
 85 | 
 86 | # Set seed for reproducibility
 87 | set.seed(SEED)
 88 | 
 89 | # Define training grid
 90 | gbm_grid <- expand.grid(
 91 |   n.trees = N_TREES,
 92 |   interaction.depth = INTERACTION_DEPTH,
 93 |   shrinkage = SHRINKAGE,
 94 |   n.minobsinnode = 10
 95 | )
 96 | 
 97 | # Define training control
 98 | fit_control <- trainControl(
 99 |   method = "cv",
100 |   number = N_CV_FOLDS,
101 |   classProbs = TRUE,
102 |   summaryFunction = twoClassSummary,
103 |   allowParallel = TRUE
104 | )
105 | 
106 | # Train model
107 | start_time <- Sys.time()
108 | 
109 | gbm_fit <- train(
110 |   Label ~ .,
111 |   data = train,
112 |   method = "gbm",
113 |   trControl = fit_control,
114 |   tuneGrid = gbm_grid,
115 |   metric = "ROC",
116 |   verbose = TRUE
117 | )
118 | 
119 | end_time <- Sys.time()
120 | training_time <- difftime(end_time, start_time, units = "mins")
121 | 
122 | cat(sprintf("\nTraining completed in %.2f minutes\n", training_time))
123 | 
124 | # ============================================================================
125 | # Model Evaluation
126 | # ============================================================================
127 | 
128 | cat("\n[3/5] Evaluating model on training data...\n")
129 | 
130 | train_pred <- predict(gbm_fit, train)
131 | conf_matrix <- confusionMatrix(train_pred, train$Label)
132 | 
133 | print(conf_matrix)
134 | 
135 | # ============================================================================
136 | # Load Test Data and Generate Predictions
137 | # ============================================================================
138 | 
139 | cat("\n[4/5] Loading test data and generating predictions...\n")
140 | 
141 | if (!file.exists(TEST_FILE)) {
142 |   warning(sprintf("Test file not found: %s\nSkipping predictions.", TEST_FILE))
143 | } else {
144 |   test <- read.csv(TEST_FILE, stringsAsFactors = TRUE)
145 | 
146 |   # Keep ID column for submission
147 |   test_id <- test$Id
148 | 
149 |   cat(sprintf("  Loaded %d test samples\n", nrow(test)))
150 | 
151 |   # Generate predictions (probabilities)
152 |   test_pred_prob <- predict(gbm_fit, test, type = "prob")
153 | 
154 |   # Create submission dataframe
155 |   submission <- data.frame(
156 |     Id = test_id,
157 |     Predicted = test_pred_prob[, "Yes"]  # Probability of positive class
158 |   )
159 | 
160 |   # ============================================================================
161 |   # Save Predictions
162 |   # ============================================================================
163 | 
164 |   cat("\n[5/5] Saving predictions...\n")
165 | 
166 |   write.csv(
167 |     submission,
168 |     OUTPUT_FILE,
169 |     row.names = FALSE,
170 |     quote = FALSE
171 |   )
172 | 
173 |   cat(sprintf("  Predictions saved to: %s\n", OUTPUT_FILE))
174 |   cat(sprintf("  Submission format: %d rows, 2 columns (Id, Predicted)\n", nrow(submission)))
175 | }
176 | 
177 | # ============================================================================
178 | # Summary
179 | # ============================================================================
180 | 
181 | cat("\n==========================================================\n")
182 | cat("Model Training Summary\n")
183 | cat("==========================================================\n")
184 | cat(sprintf("Training samples: %d\n", nrow(train)))
185 | cat(sprintf("Features: %d\n", ncol(train) - 1))
186 | cat(sprintf("Training time: %.2f minutes\n", training_time))
187 | cat(sprintf("Training accuracy: %.4f\n", conf_matrix$overall['Accuracy']))
188 | cat(sprintf("Model saved: gbm_fit\n"))
189 | cat("==========================================================\n")
190 | cat("\nDone!\n")
191 | 
192 | # ============================================================================
193 | # Optional: Save model object
194 | # ============================================================================
195 | 
196 | # Uncomment to save the trained model for later use
197 | # saveRDS(gbm_fit, file.path(OUTPUT_DIR, "gbm_model.rds"))
198 | # cat(sprintf("Model object saved to: %s\n", file.path(OUTPUT_DIR, "gbm_model.rds")))
199 | 
200 | # To load the model later:
201 | # gbm_fit <- readRDS(file.path(OUTPUT_DIR, "gbm_model.rds"))
202 | 


--------------------------------------------------------------------------------
/tests/test_vw_converter.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Tests for CSV to Vowpal Wabbit converter (csv_to_vw_modernized.py).
  3 | """
  4 | 
  5 | import pytest
  6 | from pathlib import Path
  7 | import sys
  8 | 
  9 | # Add parent directory to path
 10 | sys.path.insert(0, str(Path(__file__).parent.parent))
 11 | 
 12 | from csv_to_vw_modernized import _convert_row_to_vw, csv_to_vw
 13 | 
 14 | 
 15 | class TestConvertRowToVW:
 16 |     """Test suite for row conversion function."""
 17 | 
 18 |     def test_convert_train_row_basic(self):
 19 |         """Test converting a basic training row."""
 20 |         row = {
 21 |             'Id': '123',
 22 |             'Label': '1',
 23 |             'I1': '5',
 24 |             'I2': '10',
 25 |             'C1': 'abc',
 26 |             'C2': 'def'
 27 |         }
 28 | 
 29 |         vw_line = _convert_row_to_vw(row, is_train=True)
 30 | 
 31 |         # Should start with label (1) and tag ('123)
 32 |         assert vw_line.startswith("1 '123")
 33 | 
 34 |         # Should contain numerical namespace
 35 |         assert '|i' in vw_line
 36 | 
 37 |         # Should contain categorical namespace
 38 |         assert '|c' in vw_line
 39 | 
 40 |         # Should contain feature values
 41 |         assert 'I1:5' in vw_line
 42 |         assert 'I2:10' in vw_line
 43 |         assert 'abc' in vw_line
 44 |         assert 'def' in vw_line
 45 | 
 46 |     def test_convert_train_row_negative_label(self):
 47 |         """Test converting row with negative label."""
 48 |         row = {
 49 |             'Id': '456',
 50 |             'Label': '0',
 51 |             'I1': '3',
 52 |             'C1': 'xyz'
 53 |         }
 54 | 
 55 |         vw_line = _convert_row_to_vw(row, is_train=True)
 56 | 
 57 |         # Label should be -1 for VW format
 58 |         assert vw_line.startswith("-1 '456")
 59 | 
 60 |     def test_convert_test_row(self):
 61 |         """Test converting a test row (no label)."""
 62 |         row = {
 63 |             'Id': '789',
 64 |             'I1': '7',
 65 |             'C1': 'test'
 66 |         }
 67 | 
 68 |         vw_line = _convert_row_to_vw(row, is_train=False)
 69 | 
 70 |         # Test rows get dummy label 1
 71 |         assert vw_line.startswith("1 '789")
 72 |         assert 'I1:7' in vw_line
 73 |         assert 'test' in vw_line
 74 | 
 75 |     def test_convert_row_with_missing_values(self):
 76 |         """Test converting row with missing values."""
 77 |         row = {
 78 |             'Id': '111',
 79 |             'Label': '1',
 80 |             'I1': '5',
 81 |             'I2': '',  # Missing
 82 |             'C1': 'abc',
 83 |             'C2': ''   # Missing
 84 |         }
 85 | 
 86 |         vw_line = _convert_row_to_vw(row, is_train=True)
 87 | 
 88 |         # Should include non-empty features
 89 |         assert 'I1:5' in vw_line
 90 |         assert 'abc' in vw_line
 91 | 
 92 |         # Should not include empty features
 93 |         assert 'I2:' not in vw_line
 94 | 
 95 |     def test_convert_row_with_spaces(self):
 96 |         """Test converting row with whitespace."""
 97 |         row = {
 98 |             'Id': '222',
 99 |             'Label': '0',
100 |             'I1': '  5  ',  # With spaces
101 |             'C1': '  abc  '
102 |         }
103 | 
104 |         vw_line = _convert_row_to_vw(row, is_train=True)
105 | 
106 |         # Spaces should be handled (features present)
107 |         assert '|i' in vw_line
108 |         assert '|c' in vw_line
109 | 
110 |     def test_convert_row_empty_namespaces(self):
111 |         """Test converting row with all values empty in a namespace."""
112 |         row = {
113 |             'Id': '333',
114 |             'Label': '1',
115 |             'I1': '',
116 |             'I2': '',
117 |             'C1': 'abc'
118 |         }
119 | 
120 |         vw_line = _convert_row_to_vw(row, is_train=True)
121 | 
122 |         # Should still have namespace markers
123 |         assert '|i' in vw_line
124 |         assert '|c' in vw_line
125 | 
126 | 
127 | class TestCSVToVWFunction:
128 |     """Test suite for full CSV to VW conversion."""
129 | 
130 |     def test_csv_to_vw_file_not_found(self, tmp_path):
131 |         """Test with non-existent input file."""
132 |         csv_path = tmp_path / "nonexistent.csv"
133 |         output_path = tmp_path / "output.vw"
134 | 
135 |         with pytest.raises(FileNotFoundError):
136 |             csv_to_vw(csv_path, output_path)
137 | 
138 |     def test_csv_to_vw_basic(self, temp_csv_file, tmp_path):
139 |         """Test basic CSV to VW conversion."""
140 |         output_path = tmp_path / "output.vw"
141 | 
142 |         # Should not raise error
143 |         csv_to_vw(temp_csv_file, output_path, is_train=True)
144 | 
145 |         # Output file should exist
146 |         assert output_path.exists()
147 | 
148 |         # Check output content
149 |         with open(output_path, 'r') as f:
150 |             lines = f.readlines()
151 | 
152 |         # Should have same number of lines as CSV (minus header)
153 |         assert len(lines) == 3  # 3 data rows from fixture
154 | 
155 |         # Each line should be VW format
156 |         for line in lines:
157 |             assert '|i' in line
158 |             assert '|c' in line
159 | 
160 |     def test_csv_to_vw_test_mode(self, temp_csv_file, tmp_path):
161 |         """Test CSV to VW conversion in test mode."""
162 |         output_path = tmp_path / "output.vw"
163 | 
164 |         csv_to_vw(temp_csv_file, output_path, is_train=False)
165 | 
166 |         # Output file should exist
167 |         assert output_path.exists()
168 | 
169 |         with open(output_path, 'r') as f:
170 |             lines = f.readlines()
171 | 
172 |         # In test mode, all labels should be 1
173 |         for line in lines:
174 |             assert line.startswith('1 ')
175 | 
176 |     def test_csv_to_vw_creates_output_file(self, temp_csv_file, tmp_path):
177 |         """Test that output file is created correctly."""
178 |         output_path = tmp_path / "nested" / "dir" / "output.vw"
179 | 
180 |         # Parent directories don't exist
181 |         assert not output_path.parent.exists()
182 | 
183 |         # This should fail since we don't create parent dirs
184 |         # (could be enhanced to create them)
185 |         with pytest.raises(FileNotFoundError):
186 |             csv_to_vw(temp_csv_file, output_path)
187 | 
188 | 
189 | class TestEdgeCases:
190 |     """Test edge cases and error handling."""
191 | 
192 |     def test_malformed_csv(self, tmp_path):
193 |         """Test handling of malformed CSV."""
194 |         csv_file = tmp_path / "malformed.csv"
195 | 
196 |         # Create malformed CSV
197 |         with open(csv_file, 'w') as f:
198 |             f.write("Id,Label,I1\n")
199 |             f.write("1,0,5\n")
200 |             f.write("2,1\n")  # Missing column
201 | 
202 |         output_file = tmp_path / "output.vw"
203 | 
204 |         # Should handle gracefully
205 |         csv_to_vw(csv_file, output_file, is_train=True)
206 | 
207 |         # Output should still be created
208 |         assert output_file.exists()
209 | 
210 |     def test_empty_csv(self, tmp_path):
211 |         """Test handling of empty CSV."""
212 |         csv_file = tmp_path / "empty.csv"
213 | 
214 |         # Create empty CSV (header only)
215 |         with open(csv_file, 'w') as f:
216 |             f.write("Id,Label,I1,C1\n")
217 | 
218 |         output_file = tmp_path / "output.vw"
219 | 
220 |         # Should not raise error
221 |         csv_to_vw(csv_file, output_file, is_train=True)
222 | 
223 |         # Output should be empty
224 |         with open(output_file, 'r') as f:
225 |             lines = f.readlines()
226 | 
227 |         assert len(lines) == 0
228 | 


--------------------------------------------------------------------------------
/csv_to_vw_modernized.py:
--------------------------------------------------------------------------------
  1 | """
  2 | CSV to Vowpal Wabbit Format Converter
  3 | 
  4 | Converts Criteo CTR dataset from CSV format to Vowpal Wabbit format.
  5 | 
  6 | Original credit:
  7 | - __Author__: Triskelion <info@mlwave.com>
  8 | - Credit: Zygmunt Zając <zygmunt@fastml.com>
  9 | 
 10 | Modernized with:
 11 | - Python 3 compatibility
 12 | - Error handling and logging
 13 | - Progress reporting
 14 | - Configurable paths
 15 | - Better performance
 16 | """
 17 | 
 18 | import logging
 19 | import sys
 20 | from datetime import datetime
 21 | from csv import DictReader
 22 | from pathlib import Path
 23 | from typing import Optional
 24 | 
 25 | # Configure logging
 26 | logging.basicConfig(
 27 |     level=logging.INFO,
 28 |     format='%(asctime)s - %(levelname)s - %(message)s'
 29 | )
 30 | logger = logging.getLogger(__name__)
 31 | 
 32 | 
 33 | def csv_to_vw(
 34 |     csv_path: Path,
 35 |     output_path: Path,
 36 |     is_train: bool = True,
 37 |     report_interval: int = 1_000_000
 38 | ) -> None:
 39 |     """
 40 |     Convert CSV file to Vowpal Wabbit format.
 41 | 
 42 |     Vowpal Wabbit format:
 43 |     [label] ['tag] |namespace features
 44 | 
 45 |     Example train:
 46 |     1 'id123 |i I1:5 I2:10 |c C1 C2 C3
 47 | 
 48 |     Example test:
 49 |     1 'id456 |i I1:3 |c C5 C6
 50 | 
 51 |     Args:
 52 |         csv_path: Path to input CSV file
 53 |         output_path: Path to output VW file
 54 |         is_train: Whether this is training data (includes labels)
 55 |         report_interval: How often to log progress
 56 | 
 57 |     Raises:
 58 |         FileNotFoundError: If CSV file doesn't exist
 59 |         ValueError: If CSV format is invalid
 60 |     """
 61 |     if not csv_path.exists():
 62 |         raise FileNotFoundError(f"CSV file not found: {csv_path}")
 63 | 
 64 |     start_time = datetime.now()
 65 | 
 66 |     logger.info("=" * 80)
 67 |     logger.info(f"Converting CSV to Vowpal Wabbit format")
 68 |     logger.info(f"  Input: {csv_path}")
 69 |     logger.info(f"  Output: {output_path}")
 70 |     logger.info(f"  Mode: {'Training' if is_train else 'Testing'}")
 71 |     logger.info("=" * 80)
 72 | 
 73 |     row_count = 0
 74 | 
 75 |     try:
 76 |         with open(csv_path, 'r', encoding='utf-8') as csv_file, \
 77 |              open(output_path, 'w', encoding='utf-8') as vw_file:
 78 | 
 79 |             reader = DictReader(csv_file)
 80 | 
 81 |             # Validate required fields
 82 |             if reader.fieldnames:
 83 |                 if 'Id' not in reader.fieldnames:
 84 |                     raise ValueError("CSV must contain 'Id' column")
 85 |                 if is_train and 'Label' not in reader.fieldnames:
 86 |                     raise ValueError("Training CSV must contain 'Label' column")
 87 | 
 88 |             for row_count, row in enumerate(reader, start=1):
 89 |                 try:
 90 |                     # Create VW format line
 91 |                     vw_line = _convert_row_to_vw(row, is_train)
 92 |                     vw_file.write(vw_line + '\n')
 93 | 
 94 |                 except Exception as e:
 95 |                     logger.warning(f"Error processing row {row_count}: {e}")
 96 |                     continue
 97 | 
 98 |                 # Report progress
 99 |                 if row_count % report_interval == 0:
100 |                     elapsed = datetime.now() - start_time
101 |                     rate = row_count / elapsed.total_seconds()
102 |                     logger.info(
103 |                         f"Processed {row_count:,} rows | "
104 |                         f"Elapsed: {elapsed} | "
105 |                         f"Rate: {rate:.0f} rows/sec"
106 |                     )
107 | 
108 |     except Exception as e:
109 |         logger.error(f"Error during conversion: {e}")
110 |         raise
111 | 
112 |     elapsed = datetime.now() - start_time
113 |     logger.info("=" * 80)
114 |     logger.info(f"Conversion completed!")
115 |     logger.info(f"  Rows processed: {row_count:,}")
116 |     logger.info(f"  Total time: {elapsed}")
117 |     logger.info(f"  Average rate: {row_count / elapsed.total_seconds():.0f} rows/sec")
118 |     logger.info("=" * 80)
119 | 
120 | 
121 | def _convert_row_to_vw(row: dict, is_train: bool) -> str:
122 |     """
123 |     Convert a single CSV row to Vowpal Wabbit format.
124 | 
125 |     Args:
126 |         row: Dictionary from CSV DictReader
127 |         is_train: Whether this is training data
128 | 
129 |     Returns:
130 |         VW formatted string (without newline)
131 |     """
132 |     # Extract label and ID
133 |     row_id = row.get('Id', 'unknown')
134 | 
135 |     if is_train:
136 |         # VW uses 1 for positive, -1 for negative
137 |         label = 1 if row.get('Label') == '1' else -1
138 |     else:
139 |         # Test data: use dummy label 1
140 |         label = 1
141 | 
142 |     # Separate numerical and categorical features
143 |     numerical_features = []
144 |     categorical_features = []
145 | 
146 |     for key, value in row.items():
147 |         # Skip label and ID
148 |         if key in ['Label', 'Id']:
149 |             continue
150 | 
151 |         # Skip empty values
152 |         if not value or value.strip() == '':
153 |             continue
154 | 
155 |         # Numerical features (start with 'I')
156 |         if key.startswith('I'):
157 |             numerical_features.append(f"{key}:{value}")
158 | 
159 |         # Categorical features (start with 'C')
160 |         elif key.startswith('C'):
161 |             # For categorical, just use the value (no key:value format)
162 |             categorical_features.append(value)
163 | 
164 |     # Build VW format line
165 |     # Format: label 'tag |namespace1 features1 |namespace2 features2
166 |     numerical_str = ' '.join(numerical_features) if numerical_features else ''
167 |     categorical_str = ' '.join(categorical_features) if categorical_features else ''
168 | 
169 |     vw_line = f"{label} '{row_id} |i {numerical_str} |c {categorical_str}"
170 | 
171 |     return vw_line
172 | 
173 | 
174 | def main():
175 |     """Main execution function."""
176 |     import argparse
177 | 
178 |     parser = argparse.ArgumentParser(
179 |         description='Convert Criteo CSV to Vowpal Wabbit format'
180 |     )
181 |     parser.add_argument(
182 |         'input',
183 |         type=str,
184 |         help='Input CSV file path'
185 |     )
186 |     parser.add_argument(
187 |         'output',
188 |         type=str,
189 |         help='Output VW file path'
190 |     )
191 |     parser.add_argument(
192 |         '--test',
193 |         action='store_true',
194 |         help='Convert test data (no labels)'
195 |     )
196 |     parser.add_argument(
197 |         '--interval',
198 |         type=int,
199 |         default=1_000_000,
200 |         help='Progress report interval (default: 1,000,000)'
201 |     )
202 | 
203 |     args = parser.parse_args()
204 | 
205 |     # Convert paths
206 |     csv_path = Path(args.input)
207 |     output_path = Path(args.output)
208 | 
209 |     # Check if input exists
210 |     if not csv_path.exists():
211 |         logger.error(f"Input file not found: {csv_path}")
212 |         sys.exit(1)
213 | 
214 |     # Warn if output exists
215 |     if output_path.exists():
216 |         logger.warning(f"Output file exists and will be overwritten: {output_path}")
217 | 
218 |     try:
219 |         # Perform conversion
220 |         csv_to_vw(
221 |             csv_path=csv_path,
222 |             output_path=output_path,
223 |             is_train=not args.test,
224 |             report_interval=args.interval
225 |         )
226 | 
227 |         logger.info("Success!")
228 | 
229 |     except Exception as e:
230 |         logger.error(f"Conversion failed: {e}")
231 |         sys.exit(1)
232 | 
233 | 
234 | if __name__ == '__main__':
235 |     # Example usage (uncomment to run):
236 |     # csv_to_vw(
237 |     #     csv_path=Path('train.csv'),
238 |     #     output_path=Path('click.train.vw'),
239 |     #     is_train=True
240 |     # )
241 |     # csv_to_vw(
242 |     #     csv_path=Path('test.csv'),
243 |     #     output_path=Path('click.test.vw'),
244 |     #     is_train=False
245 |     # )
246 | 
247 |     main()
248 | 


--------------------------------------------------------------------------------
/scripts/download_data.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | """
  3 | Data Download Helper for Criteo CTR Dataset
  4 | 
  5 | This script provides instructions and helpers for downloading the Criteo
  6 | Display Advertising Challenge dataset from Kaggle.
  7 | 
  8 | Note: Kaggle API credentials are required for automated download.
  9 | """
 10 | 
 11 | import logging
 12 | import sys
 13 | from pathlib import Path
 14 | import subprocess
 15 | 
 16 | # Configure logging
 17 | logging.basicConfig(
 18 |     level=logging.INFO,
 19 |     format='%(asctime)s - %(levelname)s - %(message)s'
 20 | )
 21 | logger = logging.getLogger(__name__)
 22 | 
 23 | 
 24 | def check_kaggle_api() -> bool:
 25 |     """
 26 |     Check if Kaggle API is installed and configured.
 27 | 
 28 |     Returns:
 29 |         True if Kaggle API is available, False otherwise
 30 |     """
 31 |     try:
 32 |         import kaggle
 33 |         return True
 34 |     except ImportError:
 35 |         return False
 36 | 
 37 | 
 38 | def print_manual_instructions():
 39 |     """Print manual download instructions."""
 40 |     logger.info("=" * 80)
 41 |     logger.info("Manual Download Instructions")
 42 |     logger.info("=" * 80)
 43 |     print("""
 44 | To download the Criteo CTR dataset manually:
 45 | 
 46 | 1. Visit the Kaggle competition page:
 47 |    https://www.kaggle.com/c/criteo-display-ad-challenge/data
 48 | 
 49 | 2. Accept the competition rules (you must have a Kaggle account)
 50 | 
 51 | 3. Download the following files:
 52 |    - train.csv.gz (~11 GB compressed, ~40 GB uncompressed)
 53 |    - test.csv.gz (~2 GB compressed, ~6 GB uncompressed)
 54 | 
 55 | 4. Extract the files:
 56 |    gunzip train.csv.gz
 57 |    gunzip test.csv.gz
 58 | 
 59 | 5. Move the files to the data/ directory:
 60 |    mv train.csv data/
 61 |    mv test.csv data/
 62 | 
 63 | 6. Verify the files:
 64 |    python scripts/verify_data.py
 65 | """)
 66 | 
 67 | 
 68 | def print_kaggle_api_setup():
 69 |     """Print Kaggle API setup instructions."""
 70 |     logger.info("=" * 80)
 71 |     logger.info("Kaggle API Setup Instructions")
 72 |     logger.info("=" * 80)
 73 |     print("""
 74 | To use the Kaggle API for automated downloads:
 75 | 
 76 | 1. Install the Kaggle API:
 77 |    pip install kaggle
 78 | 
 79 | 2. Get your Kaggle API credentials:
 80 |    a. Go to https://www.kaggle.com/account
 81 |    b. Scroll to "API" section
 82 |    c. Click "Create New API Token"
 83 |    d. This downloads kaggle.json
 84 | 
 85 | 3. Place kaggle.json in the correct location:
 86 |    # Linux/macOS
 87 |    mkdir -p ~/.kaggle
 88 |    mv ~/Downloads/kaggle.json ~/.kaggle/
 89 |    chmod 600 ~/.kaggle/kaggle.json
 90 | 
 91 |    # Windows
 92 |    mkdir %USERPROFILE%\\.kaggle
 93 |    move %USERPROFILE%\\Downloads\\kaggle.json %USERPROFILE%\\.kaggle\\
 94 | 
 95 | 4. Run this script again to download data automatically
 96 | """)
 97 | 
 98 | 
 99 | def download_with_kaggle_api(data_dir: Path) -> bool:
100 |     """
101 |     Download dataset using Kaggle API.
102 | 
103 |     Args:
104 |         data_dir: Directory to save data
105 | 
106 |     Returns:
107 |         True if successful, False otherwise
108 |     """
109 |     try:
110 |         logger.info("Downloading data using Kaggle API...")
111 |         logger.info("This may take a while (files are ~13 GB compressed)...")
112 | 
113 |         # Import here to handle case where it's not installed
114 |         import kaggle
115 | 
116 |         # Create data directory
117 |         data_dir.mkdir(parents=True, exist_ok=True)
118 | 
119 |         # Download dataset files
120 |         logger.info("Downloading train.csv.gz...")
121 |         kaggle.api.competition_download_file(
122 |             'criteo-display-ad-challenge',
123 |             'train.csv.gz',
124 |             path=str(data_dir)
125 |         )
126 | 
127 |         logger.info("Downloading test.csv.gz...")
128 |         kaggle.api.competition_download_file(
129 |             'criteo-display-ad-challenge',
130 |             'test.csv.gz',
131 |             path=str(data_dir)
132 |         )
133 | 
134 |         logger.info("Download complete!")
135 |         logger.info("Extracting files...")
136 | 
137 |         # Extract files
138 |         import gzip
139 |         import shutil
140 | 
141 |         # Extract train.csv
142 |         train_gz = data_dir / 'train.csv.gz'
143 |         train_csv = data_dir / 'train.csv'
144 | 
145 |         if train_gz.exists():
146 |             logger.info("Extracting train.csv.gz...")
147 |             with gzip.open(train_gz, 'rb') as f_in:
148 |                 with open(train_csv, 'wb') as f_out:
149 |                     shutil.copyfileobj(f_in, f_out)
150 |             logger.info(f"Extracted to {train_csv}")
151 | 
152 |         # Extract test.csv
153 |         test_gz = data_dir / 'test.csv.gz'
154 |         test_csv = data_dir / 'test.csv'
155 | 
156 |         if test_gz.exists():
157 |             logger.info("Extracting test.csv.gz...")
158 |             with gzip.open(test_gz, 'rb') as f_in:
159 |                 with open(test_csv, 'wb') as f_out:
160 |                     shutil.copyfileobj(f_in, f_out)
161 |             logger.info(f"Extracted to {test_csv}")
162 | 
163 |         logger.info("=" * 80)
164 |         logger.info("Dataset downloaded and extracted successfully!")
165 |         logger.info(f"  Train: {train_csv}")
166 |         logger.info(f"  Test: {test_csv}")
167 |         logger.info("=" * 80)
168 | 
169 |         return True
170 | 
171 |     except Exception as e:
172 |         logger.error(f"Error downloading data: {e}")
173 |         return False
174 | 
175 | 
176 | def create_sample_data(data_dir: Path, sample_size: int = 1_000_000):
177 |     """
178 |     Create a smaller sample dataset for testing.
179 | 
180 |     Args:
181 |         data_dir: Directory containing data
182 |         sample_size: Number of rows to include in sample
183 |     """
184 |     train_file = data_dir / 'train.csv'
185 |     sample_file = data_dir / 'train_sample.csv'
186 | 
187 |     if not train_file.exists():
188 |         logger.error(f"Train file not found: {train_file}")
189 |         return
190 | 
191 |     logger.info(f"Creating sample dataset with {sample_size:,} rows...")
192 | 
193 |     try:
194 |         # Use head command for efficiency
195 |         result = subprocess.run(
196 |             ['head', '-n', str(sample_size + 1), str(train_file)],
197 |             capture_output=True,
198 |             text=True,
199 |             check=True
200 |         )
201 | 
202 |         with open(sample_file, 'w') as f:
203 |             f.write(result.stdout)
204 | 
205 |         logger.info(f"Sample dataset created: {sample_file}")
206 | 
207 |     except Exception as e:
208 |         logger.error(f"Error creating sample: {e}")
209 | 
210 | 
211 | def verify_data(data_dir: Path):
212 |     """
213 |     Verify that data files exist and have reasonable size.
214 | 
215 |     Args:
216 |         data_dir: Directory containing data
217 |     """
218 |     logger.info("Verifying data files...")
219 | 
220 |     train_file = data_dir / 'train.csv'
221 |     test_file = data_dir / 'test.csv'
222 | 
223 |     if train_file.exists():
224 |         size_gb = train_file.stat().st_size / (1024 ** 3)
225 |         logger.info(f"✓ train.csv exists ({size_gb:.2f} GB)")
226 | 
227 |         # Expected: ~40 GB
228 |         if size_gb < 30 or size_gb > 50:
229 |             logger.warning(
230 |                 f"Warning: train.csv size ({size_gb:.2f} GB) outside expected range (30-50 GB)"
231 |             )
232 |     else:
233 |         logger.error("✗ train.csv not found")
234 | 
235 |     if test_file.exists():
236 |         size_gb = test_file.stat().st_size / (1024 ** 3)
237 |         logger.info(f"✓ test.csv exists ({size_gb:.2f} GB)")
238 | 
239 |         # Expected: ~6 GB
240 |         if size_gb < 4 or size_gb > 8:
241 |             logger.warning(
242 |                 f"Warning: test.csv size ({size_gb:.2f} GB) outside expected range (4-8 GB)"
243 |             )
244 |     else:
245 |         logger.error("✗ test.csv not found")
246 | 
247 | 
248 | def main():
249 |     """Main execution function."""
250 |     import argparse
251 | 
252 |     parser = argparse.ArgumentParser(
253 |         description='Download Criteo CTR dataset'
254 |     )
255 |     parser.add_argument(
256 |         '--data-dir',
257 |         type=str,
258 |         default='./data',
259 |         help='Directory to save data (default: ./data)'
260 |     )
261 |     parser.add_argument(
262 |         '--sample',
263 |         action='store_true',
264 |         help='Create a sample dataset after downloading'
265 |     )
266 |     parser.add_argument(
267 |         '--sample-size',
268 |         type=int,
269 |         default=1_000_000,
270 |         help='Sample size in rows (default: 1,000,000)'
271 |     )
272 |     parser.add_argument(
273 |         '--verify-only',
274 |         action='store_true',
275 |         help='Only verify existing data files'
276 |     )
277 | 
278 |     args = parser.parse_args()
279 |     data_dir = Path(args.data_dir)
280 | 
281 |     # Create data directory
282 |     data_dir.mkdir(parents=True, exist_ok=True)
283 | 
284 |     logger.info("=" * 80)
285 |     logger.info("Criteo CTR Dataset Downloader")
286 |     logger.info("=" * 80)
287 | 
288 |     # Verify only mode
289 |     if args.verify_only:
290 |         verify_data(data_dir)
291 |         return
292 | 
293 |     # Check if Kaggle API is available
294 |     has_kaggle = check_kaggle_api()
295 | 
296 |     if has_kaggle:
297 |         logger.info("Kaggle API detected!")
298 |         response = input("Download data using Kaggle API? (y/n): ")
299 | 
300 |         if response.lower() == 'y':
301 |             success = download_with_kaggle_api(data_dir)
302 | 
303 |             if success:
304 |                 if args.sample:
305 |                     create_sample_data(data_dir, args.sample_size)
306 |                 return
307 |     else:
308 |         logger.warning("Kaggle API not installed or not configured")
309 |         print_kaggle_api_setup()
310 |         print()
311 | 
312 |     # Fall back to manual instructions
313 |     print_manual_instructions()
314 | 
315 |     # Verify if data already exists
316 |     logger.info("\nChecking for existing data files...")
317 |     verify_data(data_dir)
318 | 
319 | 
320 | if __name__ == '__main__':
321 |     main()
322 | 


--------------------------------------------------------------------------------
/py_lh4_modernized.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Logistic Regression with Stochastic Gradient Descent (SGD) for CTR Prediction
  3 | 
  4 | This module implements a memory-efficient logistic regression model using:
  5 | - Hash trick for feature engineering (2^27 dimensional space)
  6 | - Adaptive learning rate for SGD optimization
  7 | - Bounded numerical operations for stability
  8 | 
  9 | Modernized from original py_lh4.py with:
 10 | - Python 3 compatibility
 11 | - Error handling and logging
 12 | - Input validation
 13 | - Configurable parameters
 14 | - Type hints
 15 | """
 16 | 
 17 | import logging
 18 | import sys
 19 | from csv import DictReader
 20 | from math import exp, log, sqrt
 21 | from pathlib import Path
 22 | from typing import Dict, List
 23 | 
 24 | # Configure logging
 25 | logging.basicConfig(
 26 |     level=logging.INFO,
 27 |     format='%(asctime)s - %(levelname)s - %(message)s',
 28 |     handlers=[
 29 |         logging.StreamHandler(sys.stdout),
 30 |         logging.FileHandler('training.log')
 31 |     ]
 32 | )
 33 | logger = logging.getLogger(__name__)
 34 | 
 35 | 
 36 | class CTRPredictor:
 37 |     """Click-Through Rate predictor using Logistic Regression with SGD."""
 38 | 
 39 |     def __init__(
 40 |         self,
 41 |         dimension: int = 2**27,
 42 |         learning_rate: float = 0.145,
 43 |         log_interval: int = 1_000_000
 44 |     ):
 45 |         """
 46 |         Initialize the CTR predictor.
 47 | 
 48 |         Args:
 49 |             dimension: Number of features for hash trick (default: 2^27 = ~134M)
 50 |             learning_rate: Alpha parameter for SGD (default: 0.145)
 51 |             log_interval: How often to log progress during training
 52 |         """
 53 |         self.D = dimension
 54 |         self.alpha = learning_rate
 55 |         self.log_interval = log_interval
 56 | 
 57 |         # Initialize model weights and feature counters
 58 |         self.w: List[float] = [0.0] * self.D
 59 |         self.n: List[float] = [0.0] * self.D
 60 | 
 61 |         logger.info("Initialized CTR Predictor:")
 62 |         logger.info(f"  Dimension: {self.D:,}")
 63 |         logger.info(f"  Learning rate: {self.alpha}")
 64 |         logger.info(f"  Memory usage: ~{(self.D * 16) / (1024**3):.2f} GB")
 65 | 
 66 |     @staticmethod
 67 |     def logloss(p: float, y: float) -> float:
 68 |         """
 69 |         Calculate bounded logarithmic loss.
 70 | 
 71 |         Args:
 72 |             p: Predicted probability (0 to 1)
 73 |             y: True label (0 or 1)
 74 | 
 75 |         Returns:
 76 |             Logarithmic loss value
 77 |         """
 78 |         # Bound prediction to prevent log(0)
 79 |         epsilon = 1e-12
 80 |         p = max(min(p, 1.0 - epsilon), epsilon)
 81 | 
 82 |         if y == 1.0:
 83 |             return -log(p)
 84 |         else:
 85 |             return -log(1.0 - p)
 86 | 
 87 |     def get_features(self, csv_row: Dict[str, str]) -> List[int]:
 88 |         """
 89 |         Apply hash trick to convert CSV row to feature indices.
 90 | 
 91 |         Treats both integer and categorical features as categorical,
 92 |         using a simple hash function to map to feature space.
 93 | 
 94 |         Args:
 95 |             csv_row: Dictionary from CSV DictReader
 96 | 
 97 |         Returns:
 98 |             List of feature indices where value is 1
 99 |         """
100 |         x = [0]  # Index 0 is the bias term
101 | 
102 |         for key, value in csv_row.items():
103 |             if not value:  # Skip empty values
104 |                 continue
105 | 
106 |             try:
107 |                 # Simple hash: concatenate value and feature name, convert to hex
108 |                 # This is intentionally simple for speed (though weak as a hash)
109 |                 hash_input = value + key[1:]
110 |                 index = int(hash_input, 16) % self.D
111 |                 x.append(index)
112 |             except (ValueError, IndexError) as e:
113 |                 logger.warning(f"Skipping invalid feature {key}={value}: {e}")
114 |                 continue
115 | 
116 |         return x
117 | 
118 |     def predict_probability(self, x: List[int]) -> float:
119 |         """
120 |         Calculate probability P(y=1|x) using logistic sigmoid.
121 | 
122 |         Args:
123 |             x: List of feature indices
124 | 
125 |         Returns:
126 |             Predicted probability between 0 and 1
127 |         """
128 |         # Calculate w^T * x
129 |         wTx = sum(self.w[i] for i in x)
130 | 
131 |         # Apply bounded sigmoid to prevent overflow
132 |         # Bound to [-20, 20] before exp
133 |         wTx_bounded = max(min(wTx, 20.0), -20.0)
134 | 
135 |         return 1.0 / (1.0 + exp(-wTx_bounded))
136 | 
137 |     def update_weights(
138 |         self,
139 |         x: List[int],
140 |         p: float,
141 |         y: float
142 |     ) -> None:
143 |         """
144 |         Update model weights using SGD with adaptive learning rate.
145 | 
146 |         Args:
147 |             x: Feature indices
148 |             p: Predicted probability
149 |             y: True label (0 or 1)
150 |         """
151 |         for i in x:
152 |             # Adaptive learning rate: alpha / (sqrt(n) + 1)
153 |             # This decreases learning rate for frequently seen features
154 |             adaptive_lr = self.alpha / (sqrt(self.n[i]) + 1.0)
155 | 
156 |             # Gradient: (p - y) * x[i], where x[i] = 1 for indices in x
157 |             gradient = p - y
158 | 
159 |             # Update weight
160 |             self.w[i] -= gradient * adaptive_lr
161 | 
162 |             # Increment feature counter
163 |             self.n[i] += 1.0
164 | 
165 |     def train(self, train_path: Path) -> None:
166 |         """
167 |         Train the model on training data using online SGD.
168 | 
169 |         Args:
170 |             train_path: Path to training CSV file
171 | 
172 |         Raises:
173 |             FileNotFoundError: If training file doesn't exist
174 |             ValueError: If file format is invalid
175 |         """
176 |         if not train_path.exists():
177 |             raise FileNotFoundError(f"Training file not found: {train_path}")
178 | 
179 |         logger.info(f"Starting training from: {train_path}")
180 |         logger.info("=" * 80)
181 | 
182 |         cumulative_loss = 0.0
183 |         sample_count = 0
184 | 
185 |         try:
186 |             with open(train_path, 'r', encoding='utf-8') as f:
187 |                 reader = DictReader(f)
188 | 
189 |                 # Validate required columns
190 |                 if reader.fieldnames and 'Label' not in reader.fieldnames:
191 |                     raise ValueError("Training file must contain 'Label' column")
192 | 
193 |                 for t, row in enumerate(reader, start=1):
194 |                     # Parse label
195 |                     try:
196 |                         y = 1.0 if row['Label'] == '1' else 0.0
197 |                     except KeyError:
198 |                         logger.error(f"Row {t} missing 'Label' column")
199 |                         continue
200 | 
201 |                     # Remove label and ID from features
202 |                     row.pop('Label', None)
203 |                     row.pop('Id', None)
204 | 
205 |                     # Get hashed features
206 |                     x = self.get_features(row)
207 | 
208 |                     # Get prediction
209 |                     p = self.predict_probability(x)
210 | 
211 |                     # Calculate loss for monitoring
212 |                     loss = self.logloss(p, y)
213 |                     cumulative_loss += loss
214 |                     sample_count = t
215 | 
216 |                     # Log progress
217 |                     if t % self.log_interval == 0:
218 |                         avg_loss = cumulative_loss / t
219 |                         logger.info(
220 |                             f"Processed: {t:,} samples | "
221 |                             f"Avg logloss: {avg_loss:.6f} | "
222 |                             f"Current loss: {loss:.6f}"
223 |                         )
224 | 
225 |                     # Update model
226 |                     self.update_weights(x, p, y)
227 | 
228 |         except Exception as e:
229 |             logger.error(f"Error during training: {e}")
230 |             raise
231 | 
232 |         logger.info("=" * 80)
233 |         logger.info("Training completed!")
234 |         logger.info(f"  Total samples: {sample_count:,}")
235 |         logger.info(f"  Final avg logloss: {cumulative_loss / sample_count:.6f}")
236 | 
237 |     def predict(self, test_path: Path, output_path: Path) -> None:
238 |         """
239 |         Generate predictions for test data and save to CSV.
240 | 
241 |         Args:
242 |             test_path: Path to test CSV file
243 |             output_path: Path to save predictions
244 | 
245 |         Raises:
246 |             FileNotFoundError: If test file doesn't exist
247 |         """
248 |         if not test_path.exists():
249 |             raise FileNotFoundError(f"Test file not found: {test_path}")
250 | 
251 |         logger.info(f"Generating predictions from: {test_path}")
252 |         logger.info(f"Saving to: {output_path}")
253 | 
254 |         prediction_count = 0
255 | 
256 |         try:
257 |             with open(test_path, 'r', encoding='utf-8') as f_in, \
258 |                  open(output_path, 'w', encoding='utf-8') as f_out:
259 | 
260 |                 # Write header
261 |                 f_out.write('Id,Predicted\n')
262 | 
263 |                 reader = DictReader(f_in)
264 | 
265 |                 for t, row in enumerate(reader, start=1):
266 |                     # Get ID
267 |                     row_id = row.get('Id', str(t))
268 |                     row.pop('Id', None)
269 | 
270 |                     # Get features and predict
271 |                     x = self.get_features(row)
272 |                     p = self.predict_probability(x)
273 | 
274 |                     # Write prediction
275 |                     f_out.write(f'{row_id},{p:.10f}\n')
276 |                     prediction_count = t
277 | 
278 |                     # Log progress
279 |                     if t % self.log_interval == 0:
280 |                         logger.info(f"Generated {t:,} predictions")
281 | 
282 |         except Exception as e:
283 |             logger.error(f"Error during prediction: {e}")
284 |             raise
285 | 
286 |         logger.info(f"Prediction completed! Generated {prediction_count:,} predictions")
287 | 
288 | 
289 | def main():
290 |     """Main execution function."""
291 |     # Configuration
292 |     TRAIN_FILE = Path('train.csv')
293 |     TEST_FILE = Path('test.csv')
294 |     OUTPUT_FILE = Path('submission.csv')
295 | 
296 |     # Model hyperparameters
297 |     DIMENSION = 2 ** 27  # ~134M features
298 |     LEARNING_RATE = 0.145
299 | 
300 |     try:
301 |         # Initialize model
302 |         model = CTRPredictor(
303 |             dimension=DIMENSION,
304 |             learning_rate=LEARNING_RATE
305 |         )
306 | 
307 |         # Train model
308 |         model.train(TRAIN_FILE)
309 | 
310 |         # Generate predictions
311 |         model.predict(TEST_FILE, OUTPUT_FILE)
312 | 
313 |         logger.info("All tasks completed successfully!")
314 | 
315 |     except FileNotFoundError as e:
316 |         logger.error(f"File error: {e}")
317 |         logger.error("Please ensure train.csv and test.csv are in the current directory")
318 |         logger.error("Download from: https://www.kaggle.com/c/criteo-display-ad-challenge/data")
319 |         sys.exit(1)
320 |     except Exception as e:
321 |         logger.error(f"Unexpected error: {e}")
322 |         sys.exit(1)
323 | 
324 | 
325 | if __name__ == '__main__':
326 |     main()
327 | 


--------------------------------------------------------------------------------
/logReg_modernized.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Logistic Regression Training and Testing Module
  3 | 
  4 | Implements logistic regression with multiple optimization algorithms:
  5 | - Gradient Descent (GD)
  6 | - Stochastic Gradient Descent (SGD)
  7 | - Smooth Stochastic Gradient Descent (Smooth SGD)
  8 | 
  9 | Modernized from original logReg.py with:
 10 | - Python 3 compatibility
 11 | - Explicit imports (no wildcards)
 12 | - Type hints
 13 | - Error handling
 14 | - Better documentation
 15 | """
 16 | 
 17 | import logging
 18 | import time
 19 | from typing import Dict, Any, Tuple
 20 | 
 21 | import numpy as np
 22 | import matplotlib.pyplot as plt
 23 | 
 24 | # Configure logging
 25 | logging.basicConfig(
 26 |     level=logging.INFO,
 27 |     format='%(asctime)s - %(levelname)s - %(message)s'
 28 | )
 29 | logger = logging.getLogger(__name__)
 30 | 
 31 | 
 32 | def sigmoid(x: np.ndarray) -> np.ndarray:
 33 |     """
 34 |     Calculate the sigmoid (logistic) function.
 35 | 
 36 |     Args:
 37 |         x: Input array or scalar
 38 | 
 39 |     Returns:
 40 |         Sigmoid of input, bounded between 0 and 1
 41 |     """
 42 |     return 1.0 / (1.0 + np.exp(-x))
 43 | 
 44 | 
 45 | def train_logistic_regression(
 46 |     train_x: np.ndarray,
 47 |     train_y: np.ndarray,
 48 |     opts: Dict[str, Any]
 49 | ) -> np.ndarray:
 50 |     """
 51 |     Train a logistic regression model using specified optimization algorithm.
 52 | 
 53 |     Args:
 54 |         train_x: Training features, shape (num_samples, num_features).
 55 |                  Should include bias term (column of ones) if needed.
 56 |         train_y: Training labels, shape (num_samples, 1)
 57 |         opts: Dictionary with training options:
 58 |             - 'alpha': Learning rate (float)
 59 |             - 'maxIter': Maximum iterations (int)
 60 |             - 'optimizeType': Optimization algorithm (str)
 61 |                 Options: 'gradDescent', 'stocGradDescent', 'smoothStocGradDescent'
 62 | 
 63 |     Returns:
 64 |         weights: Trained weight vector, shape (num_features, 1)
 65 | 
 66 |     Raises:
 67 |         ValueError: If optimize type is not recognized
 68 |         TypeError: If input arrays have wrong shape
 69 |     """
 70 |     # Validate inputs
 71 |     if train_x.shape[0] != train_y.shape[0]:
 72 |         raise ValueError(
 73 |             f"Sample count mismatch: train_x has {train_x.shape[0]} samples, "
 74 |             f"train_y has {train_y.shape[0]} samples"
 75 |         )
 76 | 
 77 |     start_time = time.time()
 78 | 
 79 |     num_samples, num_features = train_x.shape
 80 |     alpha = opts.get('alpha', 0.01)
 81 |     max_iter = opts.get('maxIter', 1000)
 82 |     optimize_type = opts.get('optimizeType', 'gradDescent')
 83 | 
 84 |     logger.info(f"Training Logistic Regression:")
 85 |     logger.info(f"  Samples: {num_samples}")
 86 |     logger.info(f"  Features: {num_features}")
 87 |     logger.info(f"  Algorithm: {optimize_type}")
 88 |     logger.info(f"  Learning rate: {alpha}")
 89 |     logger.info(f"  Max iterations: {max_iter}")
 90 | 
 91 |     # Initialize weights
 92 |     weights = np.ones((num_features, 1))
 93 | 
 94 |     # Optimize using selected algorithm
 95 |     if optimize_type == 'gradDescent':
 96 |         weights = _gradient_descent(train_x, train_y, weights, alpha, max_iter)
 97 | 
 98 |     elif optimize_type == 'stocGradDescent':
 99 |         weights = _stochastic_gradient_descent(
100 |             train_x, train_y, weights, alpha, max_iter
101 |         )
102 | 
103 |     elif optimize_type == 'smoothStocGradDescent':
104 |         weights = _smooth_stochastic_gradient_descent(
105 |             train_x, train_y, weights, alpha, max_iter
106 |         )
107 | 
108 |     else:
109 |         raise ValueError(
110 |             f"Unsupported optimize type: {optimize_type}. "
111 |             f"Must be 'gradDescent', 'stocGradDescent', or 'smoothStocGradDescent'"
112 |         )
113 | 
114 |     elapsed = time.time() - start_time
115 |     logger.info(f"Training completed in {elapsed:.2f} seconds")
116 | 
117 |     return weights
118 | 
119 | 
120 | def _gradient_descent(
121 |     train_x: np.ndarray,
122 |     train_y: np.ndarray,
123 |     weights: np.ndarray,
124 |     alpha: float,
125 |     max_iter: int
126 | ) -> np.ndarray:
127 |     """
128 |     Batch gradient descent optimization.
129 | 
130 |     Updates weights using all samples in each iteration.
131 |     """
132 |     for k in range(max_iter):
133 |         # Forward pass
134 |         output = sigmoid(train_x @ weights)
135 | 
136 |         # Calculate error
137 |         error = train_y - output
138 | 
139 |         # Update weights using all samples
140 |         weights = weights + alpha * (train_x.T @ error)
141 | 
142 |         # Log progress
143 |         if (k + 1) % 100 == 0:
144 |             loss = np.mean(-train_y * np.log(output + 1e-10) -
145 |                           (1 - train_y) * np.log(1 - output + 1e-10))
146 |             logger.info(f"Iteration {k+1}/{max_iter}, Loss: {loss:.6f}")
147 | 
148 |     return weights
149 | 
150 | 
151 | def _stochastic_gradient_descent(
152 |     train_x: np.ndarray,
153 |     train_y: np.ndarray,
154 |     weights: np.ndarray,
155 |     alpha: float,
156 |     max_iter: int
157 | ) -> np.ndarray:
158 |     """
159 |     Stochastic gradient descent optimization.
160 | 
161 |     Updates weights using one sample at a time.
162 |     """
163 |     num_samples = train_x.shape[0]
164 | 
165 |     for k in range(max_iter):
166 |         for i in range(num_samples):
167 |             # Get single sample
168 |             x_i = train_x[i:i+1, :].T  # Shape: (num_features, 1)
169 |             y_i = train_y[i, 0]
170 | 
171 |             # Forward pass
172 |             output = sigmoid((x_i.T @ weights)[0, 0])
173 | 
174 |             # Calculate error
175 |             error = y_i - output
176 | 
177 |             # Update weights
178 |             weights = weights + alpha * x_i * error
179 | 
180 |         # Log progress
181 |         if (k + 1) % 100 == 0:
182 |             output_all = sigmoid(train_x @ weights)
183 |             loss = np.mean(-train_y * np.log(output_all + 1e-10) -
184 |                           (1 - train_y) * np.log(1 - output_all + 1e-10))
185 |             logger.info(f"Iteration {k+1}/{max_iter}, Loss: {loss:.6f}")
186 | 
187 |     return weights
188 | 
189 | 
190 | def _smooth_stochastic_gradient_descent(
191 |     train_x: np.ndarray,
192 |     train_y: np.ndarray,
193 |     weights: np.ndarray,
194 |     alpha: float,
195 |     max_iter: int
196 | ) -> np.ndarray:
197 |     """
198 |     Smooth stochastic gradient descent optimization.
199 | 
200 |     Uses random sample selection and adaptive learning rate to reduce oscillations.
201 |     """
202 |     num_samples = train_x.shape[0]
203 | 
204 |     for k in range(max_iter):
205 |         # Create random order of samples
206 |         indices = list(range(num_samples))
207 |         np.random.shuffle(indices)
208 | 
209 |         for i, idx in enumerate(indices):
210 |             # Adaptive learning rate that decreases over time
211 |             adaptive_alpha = 4.0 / (1.0 + k + i) + 0.01
212 | 
213 |             # Get single sample
214 |             x_i = train_x[idx:idx+1, :].T  # Shape: (num_features, 1)
215 |             y_i = train_y[idx, 0]
216 | 
217 |             # Forward pass
218 |             output = sigmoid((x_i.T @ weights)[0, 0])
219 | 
220 |             # Calculate error
221 |             error = y_i - output
222 | 
223 |             # Update weights with adaptive learning rate
224 |             weights = weights + adaptive_alpha * x_i * error
225 | 
226 |         # Log progress
227 |         if (k + 1) % 100 == 0:
228 |             output_all = sigmoid(train_x @ weights)
229 |             loss = np.mean(-train_y * np.log(output_all + 1e-10) -
230 |                           (1 - train_y) * np.log(1 - output_all + 1e-10))
231 |             logger.info(f"Iteration {k+1}/{max_iter}, Loss: {loss:.6f}")
232 | 
233 |     return weights
234 | 
235 | 
236 | def test_logistic_regression(
237 |     weights: np.ndarray,
238 |     test_x: np.ndarray,
239 |     test_y: np.ndarray
240 | ) -> float:
241 |     """
242 |     Test trained logistic regression model and calculate accuracy.
243 | 
244 |     Args:
245 |         weights: Trained weight vector, shape (num_features, 1)
246 |         test_x: Test features, shape (num_samples, num_features)
247 |         test_y: Test labels, shape (num_samples, 1)
248 | 
249 |     Returns:
250 |         accuracy: Proportion of correct predictions (0 to 1)
251 |     """
252 |     num_samples = test_x.shape[0]
253 |     match_count = 0
254 | 
255 |     for i in range(num_samples):
256 |         # Get prediction probability
257 |         x_i = test_x[i:i+1, :]
258 |         prob = sigmoid((x_i @ weights)[0, 0])
259 | 
260 |         # Convert to binary prediction (threshold at 0.5)
261 |         predict = prob > 0.5
262 | 
263 |         # Check if correct
264 |         if predict == bool(test_y[i, 0]):
265 |             match_count += 1
266 | 
267 |     accuracy = match_count / num_samples
268 | 
269 |     logger.info(f"Test Results:")
270 |     logger.info(f"  Correct: {match_count}/{num_samples}")
271 |     logger.info(f"  Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
272 | 
273 |     return accuracy
274 | 
275 | 
276 | def visualize_logistic_regression(
277 |     weights: np.ndarray,
278 |     train_x: np.ndarray,
279 |     train_y: np.ndarray
280 | ) -> None:
281 |     """
282 |     Visualize the trained logistic regression decision boundary.
283 | 
284 |     Note: Only works with 2D data (3 features including bias).
285 | 
286 |     Args:
287 |         weights: Trained weight vector
288 |         train_x: Training features including bias column
289 |         train_y: Training labels
290 | 
291 |     Raises:
292 |         ValueError: If data is not 2D
293 |     """
294 |     num_samples, num_features = train_x.shape
295 | 
296 |     if num_features != 3:
297 |         raise ValueError(
298 |             f"Visualization only supports 2D data (3 features with bias). "
299 |             f"Got {num_features} features."
300 |         )
301 | 
302 |     logger.info("Generating visualization...")
303 | 
304 |     # Plot samples
305 |     for i in range(num_samples):
306 |         if int(train_y[i, 0]) == 0:
307 |             plt.plot(train_x[i, 1], train_x[i, 2], 'or', label='Class 0' if i == 0 else '')
308 |         else:
309 |             plt.plot(train_x[i, 1], train_x[i, 2], 'ob', label='Class 1' if i == 0 else '')
310 | 
311 |     # Draw decision boundary
312 |     # Line equation: w0 + w1*x1 + w2*x2 = 0
313 |     # Solve for x2: x2 = -(w0 + w1*x1) / w2
314 |     min_x = np.min(train_x[:, 1])
315 |     max_x = np.max(train_x[:, 1])
316 | 
317 |     w = weights.flatten()
318 |     y_min = -(w[0] + w[1] * min_x) / w[2]
319 |     y_max = -(w[0] + w[1] * max_x) / w[2]
320 | 
321 |     plt.plot([min_x, max_x], [y_min, y_max], '-g', linewidth=2, label='Decision Boundary')
322 | 
323 |     plt.xlabel('Feature 1')
324 |     plt.ylabel('Feature 2')
325 |     plt.title('Logistic Regression Decision Boundary')
326 |     plt.legend()
327 |     plt.grid(True, alpha=0.3)
328 |     plt.show()
329 | 
330 | 
331 | # Example usage
332 | if __name__ == '__main__':
333 |     # Generate sample data
334 |     np.random.seed(42)
335 | 
336 |     # Create synthetic 2D dataset
337 |     num_samples = 100
338 | 
339 |     # Class 0: centered at (2, 2)
340 |     class0 = np.random.randn(num_samples // 2, 2) + np.array([2, 2])
341 | 
342 |     # Class 1: centered at (5, 5)
343 |     class1 = np.random.randn(num_samples // 2, 2) + np.array([5, 5])
344 | 
345 |     # Combine data
346 |     X = np.vstack([class0, class1])
347 |     y = np.vstack([np.zeros((num_samples // 2, 1)), np.ones((num_samples // 2, 1))])
348 | 
349 |     # Add bias term
350 |     X_with_bias = np.hstack([np.ones((num_samples, 1)), X])
351 | 
352 |     # Training options
353 |     options = {
354 |         'alpha': 0.01,
355 |         'maxIter': 500,
356 |         'optimizeType': 'smoothStocGradDescent'
357 |     }
358 | 
359 |     # Train model
360 |     trained_weights = train_logistic_regression(X_with_bias, y, options)
361 | 
362 |     # Test model
363 |     accuracy = test_logistic_regression(trained_weights, X_with_bias, y)
364 | 
365 |     # Visualize (optional - uncomment to show plot)
366 |     # visualize_logistic_regression(trained_weights, X_with_bias, y)
367 | 


--------------------------------------------------------------------------------
/README_NEW.md:
--------------------------------------------------------------------------------
  1 | # Predict Click-Through Rates on Display Ads
  2 | 
  3 | A machine learning project for predicting Click-Through Rates (CTR) in display advertising using the Criteo dataset. This repository implements multiple algorithms including Logistic Regression with SGD, Gradient Boosting Machines, and Vowpal Wabbit models.
  4 | 
  5 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
  6 | ![Python](https://img.shields.io/badge/python-3.8+-blue.svg)
  7 | ![R](https://img.shields.io/badge/R-4.0+-blue.svg)
  8 | 
  9 | ---
 10 | 
 11 | ## Table of Contents
 12 | 
 13 | - [Overview](#overview)
 14 | - [Project Status](#project-status)
 15 | - [Features](#features)
 16 | - [Installation](#installation)
 17 | - [Quick Start](#quick-start)
 18 | - [Data](#data)
 19 | - [Models](#models)
 20 | - [Project Structure](#project-structure)
 21 | - [Usage](#usage)
 22 | - [Development](#development)
 23 | - [Contributing](#contributing)
 24 | - [License](#license)
 25 | - [Acknowledgments](#acknowledgments)
 26 | 
 27 | ---
 28 | 
 29 | ## Overview
 30 | 
 31 | Display advertising is a billion-dollar industry and one of the central applications of machine learning on the Internet. This project was developed for the **Criteo Display Advertising Challenge**, where the goal is to predict the probability that a user will click on a given ad (CTR).
 32 | 
 33 | ### The Challenge
 34 | 
 35 | Given:
 36 | - User information
 37 | - Page context
 38 | - Ad features (39 anonymized features: 13 numerical, 26 categorical)
 39 | 
 40 | Predict:
 41 | - Probability of click (binary classification)
 42 | 
 43 | ### Evaluation Metric
 44 | 
 45 | - **Log Loss (Logarithmic Loss)**: Lower is better
 46 | 
 47 | ---
 48 | 
 49 | ## Project Status
 50 | 
 51 | **Version:** 2.0 (Modernized)
 52 | 
 53 | **Status:**
 54 | - ✅ Python 3 migration complete
 55 | - ✅ Core algorithms refactored with modern best practices
 56 | - ✅ Documentation updated
 57 | - ✅ Error handling and logging added
 58 | - 🚧 Unit tests in progress
 59 | - 🚧 CI/CD pipeline in progress
 60 | 
 61 | **Legacy Code:**
 62 | - Original Python 2 implementations preserved with `_legacy` suffix
 63 | - See migration guide below for differences
 64 | 
 65 | ---
 66 | 
 67 | ## Features
 68 | 
 69 | ### Algorithms Implemented
 70 | 
 71 | 1. **Logistic Regression with SGD** (`py_lh4_modernized.py`)
 72 |    - Hash trick for feature engineering (2^27 dimensional space)
 73 |    - Adaptive learning rate
 74 |    - Memory-efficient online learning
 75 |    - Bounded numerical operations for stability
 76 | 
 77 | 2. **Gradient Boosting Machine** (`gbm_modernized.R`)
 78 |    - Tree-based ensemble method
 79 |    - Cross-validation with ROC optimization
 80 |    - Configurable hyperparameters
 81 | 
 82 | 3. **Vowpal Wabbit** (`csv_to_vw_modernized.py`)
 83 |    - Fast linear learner
 84 |    - CSV to VW format converter
 85 |    - Scalable to massive datasets
 86 | 
 87 | ### Modern Features
 88 | 
 89 | - ✨ Python 3.8+ compatibility
 90 | - 🔒 Input validation and error handling
 91 | - 📊 Comprehensive logging
 92 | - ⚙️ Configurable parameters
 93 | - 📝 Type hints and documentation
 94 | - 🧪 Unit test infrastructure
 95 | - 🐳 Docker support (coming soon)
 96 | 
 97 | ---
 98 | 
 99 | ## Installation
100 | 
101 | ### Prerequisites
102 | 
103 | - Python 3.8 or higher
104 | - R 4.0 or higher (for R models)
105 | - Vowpal Wabbit (for VW models)
106 | - Git
107 | 
108 | ### Clone Repository
109 | 
110 | ```bash
111 | git clone https://github.com/yourusername/Predict-click-through-rates-on-display-ads.git
112 | cd Predict-click-through-rates-on-display-ads
113 | ```
114 | 
115 | ### Python Setup
116 | 
117 | ```bash
118 | # Create virtual environment
119 | python -m venv venv
120 | 
121 | # Activate virtual environment
122 | # On macOS/Linux:
123 | source venv/bin/activate
124 | # On Windows:
125 | # venv\Scripts\activate
126 | 
127 | # Install dependencies
128 | pip install -r requirements.txt
129 | ```
130 | 
131 | ### R Setup
132 | 
133 | ```bash
134 | # Install R packages
135 | Rscript -e "install.packages(c('data.table', 'caret', 'gbm'), dependencies=TRUE)"
136 | ```
137 | 
138 | ### Vowpal Wabbit Setup
139 | 
140 | ```bash
141 | # macOS
142 | brew install vowpal-wabbit
143 | 
144 | # Ubuntu/Debian
145 | sudo apt-get install vowpal-wabbit
146 | 
147 | # From source (all platforms)
148 | git clone https://github.com/VowpalWabbit/vowpal_wabbit.git
149 | cd vowpal_wabbit
150 | make
151 | ```
152 | 
153 | ---
154 | 
155 | ## Quick Start
156 | 
157 | ### 1. Download Data
158 | 
159 | ```bash
160 | # Create data directory
161 | mkdir -p data
162 | 
163 | # Download from Kaggle
164 | # Visit: https://www.kaggle.com/c/criteo-display-ad-challenge/data
165 | # Place train.csv and test.csv in data/ directory
166 | ```
167 | 
168 | Or use the download script:
169 | 
170 | ```bash
171 | python scripts/download_data.py
172 | ```
173 | 
174 | ### 2. Train a Model
175 | 
176 | **Logistic Regression (Python):**
177 | 
178 | ```bash
179 | python py_lh4_modernized.py
180 | ```
181 | 
182 | **Gradient Boosting Machine (R):**
183 | 
184 | ```bash
185 | DATA_DIR=./data OUTPUT_DIR=./output Rscript gbm_modernized.R
186 | ```
187 | 
188 | **Vowpal Wabbit:**
189 | 
190 | ```bash
191 | # Convert CSV to VW format
192 | python csv_to_vw_modernized.py data/train.csv data/train.vw
193 | python csv_to_vw_modernized.py data/test.csv data/test.vw --test
194 | 
195 | # Train model
196 | vw data/train.vw -f models/click.model --passes 3 --cache_file data/train.cache
197 | 
198 | # Generate predictions
199 | vw data/test.vw -i models/click.model -t -p predictions.txt
200 | ```
201 | 
202 | ### 3. Generate Submission
203 | 
204 | ```bash
205 | # Predictions are automatically saved to submission.csv (Python)
206 | # or specified output directory (R)
207 | ```
208 | 
209 | ---
210 | 
211 | ## Data
212 | 
213 | ### Dataset Information
214 | 
215 | - **Source:** [Criteo Labs](https://www.kaggle.com/c/criteo-display-ad-challenge/data)
216 | - **Size:**
217 |   - Training: ~45 million samples
218 |   - Test: ~6 million samples
219 | - **Features:** 39 anonymized features
220 |   - 13 numerical features (`I1-I13`)
221 |   - 26 categorical features (`C1-C26`)
222 | - **Target:** Binary (0 = no click, 1 = click)
223 | 
224 | ### Data Format
225 | 
226 | **CSV Format:**
227 | 
228 | ```
229 | Id,Label,I1,I2,...,I13,C1,C2,...,C26
230 | 1,0,1,5,...,45,68fd1e,80e26c,...,a458ea
231 | 2,1,2,,,7,68fd1e,80e26c,...,b458ea
232 | ```
233 | 
234 | **Vowpal Wabbit Format:**
235 | 
236 | ```
237 | 1 'id1 |i I1:1 I2:5 I13:45 |c 68fd1e 80e26c a458ea
238 | -1 'id2 |i I1:2 I13:7 |c 68fd1e 80e26c b458ea
239 | ```
240 | 
241 | ### Data Preprocessing
242 | 
243 | Missing values are common in this dataset:
244 | 
245 | ```python
246 | # Option 1: Use models that handle missing values (GBM, Random Forest)
247 | # Option 2: Impute missing values
248 | from sklearn.impute import SimpleImputer
249 | imputer = SimpleImputer(strategy='median')
250 | X_imputed = imputer.fit_transform(X)
251 | ```
252 | 
253 | ---
254 | 
255 | ## Models
256 | 
257 | ### 1. Logistic Regression with SGD
258 | 
259 | **File:** `py_lh4_modernized.py`
260 | 
261 | **Features:**
262 | - Hash trick for feature engineering
263 | - Adaptive learning rate: α/(√n + 1)
264 | - Bounded sigmoid and log loss for numerical stability
265 | - Memory-efficient: processes one sample at a time
266 | 
267 | **Hyperparameters:**
268 | ```python
269 | DIMENSION = 2**27      # ~134M features
270 | LEARNING_RATE = 0.145  # Alpha for SGD
271 | ```
272 | 
273 | **Usage:**
274 | ```python
275 | from py_lh4_modernized import CTRPredictor
276 | 
277 | model = CTRPredictor(dimension=2**27, learning_rate=0.145)
278 | model.train(Path('data/train.csv'))
279 | model.predict(Path('data/test.csv'), Path('submission.csv'))
280 | ```
281 | 
282 | **Expected Performance:**
283 | - Training time: 30-60 minutes (45M samples)
284 | - Memory: ~2GB
285 | - Log loss: ~0.44-0.46
286 | 
287 | ### 2. Gradient Boosting Machine (GBM)
288 | 
289 | **File:** `gbm_modernized.R`
290 | 
291 | **Features:**
292 | - Tree-based ensemble method
293 | - Cross-validation with ROC optimization
294 | - Handles missing values automatically
295 | 
296 | **Hyperparameters:**
297 | ```r
298 | N_TREES <- 500
299 | INTERACTION_DEPTH <- 22
300 | SHRINKAGE <- 0.1
301 | ```
302 | 
303 | **Usage:**
304 | ```bash
305 | DATA_DIR=./data OUTPUT_DIR=./output Rscript gbm_modernized.R
306 | ```
307 | 
308 | **Expected Performance:**
309 | - Training time: 2-4 hours
310 | - Memory: ~8-16GB
311 | - Log loss: ~0.43-0.45
312 | 
313 | ### 3. Vowpal Wabbit
314 | 
315 | **Files:** `csv_to_vw_modernized.py`, VW commands
316 | 
317 | **Features:**
318 | - Extremely fast linear learner
319 | - Scalable to billions of samples
320 | - Online learning capable
321 | 
322 | **Usage:**
323 | ```bash
324 | # Convert format
325 | python csv_to_vw_modernized.py data/train.csv data/train.vw
326 | 
327 | # Train with multiple passes
328 | vw data/train.vw \
329 |    --loss_function logistic \
330 |    --passes 3 \
331 |    --cache_file data/train.cache \
332 |    -f models/click.model
333 | 
334 | # Predict
335 | vw data/test.vw \
336 |    -i models/click.model \
337 |    -t \
338 |    -p predictions.txt
339 | ```
340 | 
341 | **Expected Performance:**
342 | - Training time: 10-20 minutes
343 | - Memory: ~2-4GB
344 | - Log loss: ~0.44-0.46
345 | 
346 | ---
347 | 
348 | ## Project Structure
349 | 
350 | ```
351 | Predict-click-through-rates-on-display-ads/
352 | │
353 | ├── README.md                      # This file
354 | ├── LICENSE                        # MIT License
355 | ├── requirements.txt               # Python dependencies
356 | ├── requirements-r.txt             # R dependencies
357 | ├── .gitignore                     # Git ignore rules
358 | │
359 | ├── data/                          # Data directory (not in repo)
360 | │   ├── train.csv                  # Training data (download separately)
361 | │   ├── test.csv                   # Test data (download separately)
362 | │   ├── train.vw                   # VW format training data
363 | │   └── test.vw                    # VW format test data
364 | │
365 | ├── models/                        # Trained models
366 | │   ├── click.model.vw             # Vowpal Wabbit model
367 | │   └── gbm_model.rds              # R GBM model
368 | │
369 | ├── output/                        # Output directory
370 | │   ├── submission.csv             # Predictions for submission
371 | │   └── training.log               # Training logs
372 | │
373 | ├── Modern Implementations:
374 | │   ├── py_lh4_modernized.py       # LR with SGD (Python 3)
375 | │   ├── logReg_modernized.py       # LR module (Python 3)
376 | │   ├── csv_to_vw_modernized.py    # CSV to VW converter (Python 3)
377 | │   └── gbm_modernized.R           # GBM (R, modern)
378 | │
379 | ├── Legacy Code (Original):
380 | │   ├── py_lh4.py                  # Original LR implementation
381 | │   ├── logReg.py                  # Original LR module
382 | │   ├── gbm.R                      # Original GBM script
383 | │   └── [other legacy files]
384 | │
385 | ├── tests/                         # Unit tests
386 | │   ├── test_lr_model.py
387 | │   ├── test_data_loading.py
388 | │   └── test_vw_converter.py
389 | │
390 | ├── scripts/                       # Utility scripts
391 | │   ├── download_data.py           # Data download helper
392 | │   └── evaluate_model.py          # Model evaluation
393 | │
394 | └── docs/                          # Additional documentation
395 |     ├── MIGRATION_GUIDE.md         # Python 2 to 3 migration notes
396 |     ├── PERFORMANCE.md             # Performance benchmarks
397 |     └── API.md                     # API documentation
398 | ```
399 | 
400 | ---
401 | 
402 | ## Usage
403 | 
404 | ### Configuration
405 | 
406 | **Python Models:**
407 | 
408 | Edit configuration at the top of each script:
409 | 
410 | ```python
411 | # In py_lh4_modernized.py
412 | TRAIN_FILE = Path('data/train.csv')
413 | TEST_FILE = Path('data/test.csv')
414 | OUTPUT_FILE = Path('submission.csv')
415 | DIMENSION = 2 ** 27
416 | LEARNING_RATE = 0.145
417 | ```
418 | 
419 | **R Models:**
420 | 
421 | Use environment variables:
422 | 
423 | ```bash
424 | export DATA_DIR=/path/to/data
425 | export OUTPUT_DIR=/path/to/output
426 | Rscript gbm_modernized.R
427 | ```
428 | 
429 | ### Training Options
430 | 
431 | **Full Training:**
432 | ```bash
433 | python py_lh4_modernized.py
434 | ```
435 | 
436 | **Sample Training (first 1M rows):**
437 | ```bash
438 | head -n 1000001 data/train.csv > data/train_sample.csv
439 | # Update script to use train_sample.csv
440 | python py_lh4_modernized.py
441 | ```
442 | 
443 | ### Monitoring Training
444 | 
445 | Training progress is logged to both console and `training.log`:
446 | 
447 | ```bash
448 | # Watch log in real-time
449 | tail -f training.log
450 | ```
451 | 
452 | ### Model Evaluation
453 | 
454 | ```bash
455 | python scripts/evaluate_model.py \
456 |     --predictions submission.csv \
457 |     --ground_truth data/train_with_labels.csv
458 | ```
459 | 
460 | ---
461 | 
462 | ## Development
463 | 
464 | ### Setting Up Development Environment
465 | 
466 | ```bash
467 | # Install development dependencies
468 | pip install -r requirements.txt
469 | 
470 | # Install pre-commit hooks (coming soon)
471 | pre-commit install
472 | 
473 | # Run code formatters
474 | black .
475 | ```
476 | 
477 | ### Running Tests
478 | 
479 | ```bash
480 | # Run all tests
481 | pytest
482 | 
483 | # Run with coverage
484 | pytest --cov=. --cov-report=html
485 | 
486 | # Run specific test
487 | pytest tests/test_lr_model.py
488 | ```
489 | 
490 | ### Code Style
491 | 
492 | This project follows:
493 | - **Python:** PEP 8, enforced with `black` and `flake8`
494 | - **R:** Tidyverse style guide
495 | - **Documentation:** Google-style docstrings
496 | 
497 | ---
498 | 
499 | ## Contributing
500 | 
501 | Contributions are welcome! Please:
502 | 
503 | 1. Fork the repository
504 | 2. Create a feature branch (`git checkout -b feature/amazing-feature`)
505 | 3. Commit your changes (`git commit -m 'Add amazing feature'`)
506 | 4. Push to the branch (`git push origin feature/amazing-feature`)
507 | 5. Open a Pull Request
508 | 
509 | ### Areas for Contribution
510 | 
511 | - [ ] Deep learning models (Neural Networks, LSTM)
512 | - [ ] Feature engineering improvements
513 | - [ ] Hyperparameter optimization
514 | - [ ] Ensemble methods
515 | - [ ] Docker containerization
516 | - [ ] Web API for predictions
517 | - [ ] Performance optimizations
518 | 
519 | ---
520 | 
521 | ## License
522 | 
523 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
524 | 
525 | Copyright (c) 2019-2025 Tianxiang Liu
526 | 
527 | ---
528 | 
529 | ## Acknowledgments
530 | 
531 | ### Dataset
532 | - **Criteo Labs** for providing the dataset
533 | - Kaggle for hosting the competition
534 | 
535 | ### References
536 | - Original Vowpal Wabbit converter by Triskelion and Zygmunt Zając
537 | - Criteo Display Advertising Challenge: https://www.kaggle.com/c/criteo-display-ad-challenge
538 | 
539 | ### Papers
540 | - Rendle, S. (2010). "Factorization Machines"
541 | - McMahan, H. B. et al. (2013). "Ad Click Prediction: a View from the Trenches"
542 | - He, X. et al. (2014). "Practical Lessons from Predicting Clicks on Ads at Facebook"
543 | 
544 | ---
545 | 
546 | ## Contact
547 | 
548 | For questions or issues, please:
549 | - Open an issue on GitHub
550 | - Contact: [your-email@example.com]
551 | 
552 | ---
553 | 
554 | ## Changelog
555 | 
556 | ### Version 2.0 (2025)
557 | - ✨ Migrated to Python 3.8+
558 | - ✨ Added comprehensive error handling and logging
559 | - ✨ Refactored code with type hints and documentation
560 | - ✨ Added configuration management
561 | - ✨ Improved README with detailed instructions
562 | - ✨ Added unit test infrastructure
563 | 
564 | ### Version 1.0 (2014)
565 | - Initial implementation
566 | - Python 2 codebase
567 | - Multiple algorithm implementations
568 | 
569 | ---
570 | 
571 | **Happy Machine Learning! 🚀**
572 | 


--------------------------------------------------------------------------------