├── pythonSGD
├── submissionPython22Sep2014_pm.csv
├── test.py
├── esting.txt
├── SGD_py.v11.suo
├── SGD_py.sln
├── logReg_test.py
├── SGD_py.pyproj
├── logReg_click.py
├── py_lh_22Sep2014_2.py
├── logReg.py
├── py_lh4_22Sep2014.py
└── py_lh_20Sep2014.py
├── combineDatasets.R
├── .DS_Store
├── tests
├── __init__.py
├── conftest.py
├── test_lr_model.py
└── test_vw_converter.py
├── esting.txt
├── test.py
├── forum
├── model
│ └── click.model.vw
├── click.model.final2.vw
├── vm_to_kaggle.py
├── vm_to_kaggle.py~
├── vm_command~
├── vm_command
└── csv_to_vm.py
├── vowpal wabbit
├── model
│ └── click.model.vw
├── click.model.final2.vw
├── Last model_23Sep2014.txt
├── vm_to_kaggle.py
├── vm_to_kaggle.py~
├── vm_command~
├── vm_command
├── ending solution.txt
└── csv_to_vm.py
├── requirements-r.txt
├── requirements.txt
├── README.md
├── pytest.ini
├── LICENSE
├── logReg_test.py
├── r_sdg.R
├── .gitignore
├── SDG_21_Sep_2014.R
├── kaggle.py
├── testSet.txt
├── gbm.R
├── py_lh.py
├── py_lh2.py
├── py_lh3.py
├── py_lh4.py
├── logReg_click.py
├── logReg.py
├── py_lh_20Sep2014.py
├── config.yaml
├── .github
└── workflows
│ └── ci.yml
├── gbm_modernized.R
├── csv_to_vw_modernized.py
├── scripts
└── download_data.py
├── py_lh4_modernized.py
├── logReg_modernized.py
└── README_NEW.md
/pythonSGD/submissionPython22Sep2014_pm.csv:
--------------------------------------------------------------------------------
1 | Id,Predicted
2 |
--------------------------------------------------------------------------------
/combineDatasets.R:
--------------------------------------------------------------------------------
1 | setwd("I:\\data")
2 | train_int <- read.csv('int/train_num.csv')
--------------------------------------------------------------------------------
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/.DS_Store
--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
1 | """
2 | Test suite for CTR Prediction project.
3 | """
4 |
5 | __version__ = "2.0"
6 |
--------------------------------------------------------------------------------
/esting.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/esting.txt
--------------------------------------------------------------------------------
/test.py:
--------------------------------------------------------------------------------
1 | from numpy import *
2 | import matplotlib.pyplot as plt
3 | import time
4 |
5 | alpha = opts['alpha']
--------------------------------------------------------------------------------
/pythonSGD/test.py:
--------------------------------------------------------------------------------
1 | from numpy import *
2 | import matplotlib.pyplot as plt
3 | import time
4 |
5 | alpha = opts['alpha']
--------------------------------------------------------------------------------
/pythonSGD/esting.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/pythonSGD/esting.txt
--------------------------------------------------------------------------------
/pythonSGD/SGD_py.v11.suo:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/pythonSGD/SGD_py.v11.suo
--------------------------------------------------------------------------------
/forum/model/click.model.vw:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/forum/model/click.model.vw
--------------------------------------------------------------------------------
/forum/click.model.final2.vw:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/forum/click.model.final2.vw
--------------------------------------------------------------------------------
/vowpal wabbit/model/click.model.vw:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/vowpal wabbit/model/click.model.vw
--------------------------------------------------------------------------------
/vowpal wabbit/click.model.final2.vw:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/vowpal wabbit/click.model.final2.vw
--------------------------------------------------------------------------------
/requirements-r.txt:
--------------------------------------------------------------------------------
1 | # R Package Dependencies
2 | # Install with: Rscript -e "install.packages(c('data.table', 'caret', 'gbm'))"
3 |
4 | data.table>=1.14.0
5 | caret>=6.0-90
6 | gbm>=2.1.8
7 |
--------------------------------------------------------------------------------
/vowpal wabbit/Last model_23Sep2014.txt:
--------------------------------------------------------------------------------
1 | vw click.train.vw -f click.model.vw --bfgs --passes 20 --holdout_after 40000000 -b 26 --cache_file click.train.vw.cache -l 0.145 --holdout_period 10
2 |
3 | vw click.train.vw -f click.model.vw -q:: --holdout_period 5 --noconstant --hash all --loss_function logistic -b 28 --save_per_pass --bfgs --termination 0.001 --passes 10 -l 0.1 --cache_file click.train.vw.cache
4 |
5 |
6 | --feature_mask
--------------------------------------------------------------------------------
/forum/vm_to_kaggle.py:
--------------------------------------------------------------------------------
1 | import math
2 |
3 | def zygmoid(x):
4 | #I know it's a common Sigmoid feature, but that's why I probably found
5 | #it on FastML too: https://github.com/zygmuntz/kaggle-stackoverflow/blob/master/sigmoid_mc.py
6 | return 1 / (1 + math.exp(-x))
7 |
8 | with open("kaggle.click.submission.csv","wb") as outfile:
9 | outfile.write("Id,Predicted\n")
10 | for line in open("click.preds3.txt"):
11 | row = line.strip().split(" ")
12 | outfile.write("%s,%f\n"%(row[1],zygmoid(float(row[0]))))
13 |
--------------------------------------------------------------------------------
/forum/vm_to_kaggle.py~:
--------------------------------------------------------------------------------
1 | import math
2 |
3 | def zygmoid(x):
4 | #I know it's a common Sigmoid feature, but that's why I probably found
5 | #it on FastML too: https://github.com/zygmuntz/kaggle-stackoverflow/blob/master/sigmoid_mc.py
6 | return 1 / (1 + math.exp(-x))
7 |
8 | with open("kaggle.click.submission.csv","wb") as outfile:
9 | outfile.write("Id,Predicted\n")
10 | for line in open("click.preds2.txt"):
11 | row = line.strip().split(" ")
12 | outfile.write("%s,%f\n"%(row[1],zygmoid(float(row[0]))))
13 |
--------------------------------------------------------------------------------
/vowpal wabbit/vm_to_kaggle.py:
--------------------------------------------------------------------------------
1 | import math
2 |
3 | def zygmoid(x):
4 | #I know it's a common Sigmoid feature, but that's why I probably found
5 | #it on FastML too: https://github.com/zygmuntz/kaggle-stackoverflow/blob/master/sigmoid_mc.py
6 | return 1 / (1 + math.exp(-x))
7 |
8 | with open("kaggle.click.submission.csv","wb") as outfile:
9 | outfile.write("Id,Predicted\n")
10 | for line in open("click.preds3.txt"):
11 | row = line.strip().split(" ")
12 | outfile.write("%s,%f\n"%(row[1],zygmoid(float(row[0]))))
13 |
--------------------------------------------------------------------------------
/vowpal wabbit/vm_to_kaggle.py~:
--------------------------------------------------------------------------------
1 | import math
2 |
3 | def zygmoid(x):
4 | #I know it's a common Sigmoid feature, but that's why I probably found
5 | #it on FastML too: https://github.com/zygmuntz/kaggle-stackoverflow/blob/master/sigmoid_mc.py
6 | return 1 / (1 + math.exp(-x))
7 |
8 | with open("kaggle.click.submission.csv","wb") as outfile:
9 | outfile.write("Id,Predicted\n")
10 | for line in open("click.preds2.txt"):
11 | row = line.strip().split(" ")
12 | outfile.write("%s,%f\n"%(row[1],zygmoid(float(row[0]))))
13 |
--------------------------------------------------------------------------------
/forum/vm_command~:
--------------------------------------------------------------------------------
1 | Training VW:
2 | ./vw click.train.vw -f click.model.vw --loss_function logistic
3 |
4 |
5 | Testing VW:
6 | ./vw click.test.vw -t -i click.model.vw -p click.preds.txt
7 |
8 |
9 | Training VW2:
10 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off -f click.model.vw --loss_function logistic
11 |
12 |
13 | Training VM3:
14 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off --power_t=1 -f click.model.vw --loss_function logistic
15 |
16 | parameters:
17 | -b bits
18 | -l rate
19 | --power_t p
20 |
21 |
--------------------------------------------------------------------------------
/vowpal wabbit/vm_command~:
--------------------------------------------------------------------------------
1 | Training VW:
2 | ./vw click.train.vw -f click.model.vw --loss_function logistic
3 |
4 |
5 | Testing VW:
6 | ./vw click.test.vw -t -i click.model.vw -p click.preds.txt
7 |
8 |
9 | Training VW2:
10 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off -f click.model.vw --loss_function logistic
11 |
12 |
13 | Training VM3:
14 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off --power_t=1 -f click.model.vw --loss_function logistic
15 |
16 | parameters:
17 | -b bits
18 | -l rate
19 | --power_t p
20 |
21 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | # Core dependencies
2 | numpy>=1.24.0,<2.0.0
3 | matplotlib>=3.7.0,<4.0.0
4 | pandas>=2.0.0,<3.0.0
5 | scipy>=1.10.0,<2.0.0
6 |
7 | # Configuration management
8 | pyyaml>=6.0,<7.0
9 |
10 | # Development dependencies
11 | pytest>=7.4.0,<8.0.0
12 | pytest-cov>=4.1.0,<5.0.0
13 | black>=23.0.0,<24.0.0
14 | flake8>=6.0.0,<7.0.0
15 | pylint>=2.17.0,<3.0.0
16 |
17 | # Optional: Advanced ML libraries
18 | # scikit-learn>=1.3.0,<2.0.0
19 | # xgboost>=2.0.0,<3.0.0
20 |
21 | # Optional: Vowpal Wabbit Python bindings
22 | # vowpalwabbit>=9.0.0,<10.0.0
23 |
--------------------------------------------------------------------------------
/forum/vm_command:
--------------------------------------------------------------------------------
1 | Training VW:
2 | ./vw click.train.vw -f click.model.vw --loss_function logistic
3 |
4 |
5 | Testing VW:
6 | ./vw click.test.vw -t -i click.model.vw -p click.preds.txt
7 |
8 |
9 | Training VW2:
10 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off -f click.model.vw --loss_function logistic
11 |
12 |
13 | Training VM3:
14 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off --power_t=1 -f click.model.vw --loss_function logistic
15 |
16 | parameters:
17 | -b bits
18 | -l rate
19 | --power_t p
20 | --passes
21 | -c
22 | --holdout_off
23 |
--------------------------------------------------------------------------------
/vowpal wabbit/vm_command:
--------------------------------------------------------------------------------
1 | Training VW:
2 | ./vw click.train.vw -f click.model.vw --loss_function logistic
3 |
4 |
5 | Testing VW:
6 | ./vw click.test.vw -t -i click.model.vw -p click.preds.txt
7 |
8 |
9 | Training VW2:
10 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off -f click.model.vw --loss_function logistic
11 |
12 |
13 | Training VM3:
14 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off --power_t=1 -f click.model.vw --loss_function logistic
15 |
16 | parameters:
17 | -b bits
18 | -l rate
19 | --power_t p
20 | --passes
21 | -c
22 | --holdout_off
23 |
--------------------------------------------------------------------------------
/pythonSGD/SGD_py.sln:
--------------------------------------------------------------------------------
1 |
2 | Microsoft Visual Studio Solution File, Format Version 12.00
3 | # Visual Studio 2012
4 | Project("{888888A0-9F3D-457C-B088-3A5042F75D52}") = "SGD_py", "SGD_py.pyproj", "{82875642-D6FA-4F5C-81E6-B89A93C1F5FF}"
5 | EndProject
6 | Global
7 | GlobalSection(SolutionConfigurationPlatforms) = preSolution
8 | Debug|Any CPU = Debug|Any CPU
9 | Release|Any CPU = Release|Any CPU
10 | EndGlobalSection
11 | GlobalSection(ProjectConfigurationPlatforms) = postSolution
12 | {82875642-D6FA-4F5C-81E6-B89A93C1F5FF}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
13 | {82875642-D6FA-4F5C-81E6-B89A93C1F5FF}.Release|Any CPU.ActiveCfg = Release|Any CPU
14 | EndGlobalSection
15 | GlobalSection(SolutionProperties) = preSolution
16 | HideSolutionNode = FALSE
17 | EndGlobalSection
18 | EndGlobal
19 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | Predict-click-through-rates-on-display-ads
2 | ==========================================
3 |
4 | Display advertising is a billion dollar effort and one of the central uses of machine learning on the Internet. However, its data and methods are usually kept under lock and key. In this research competition, CriteoLabs is sharing a week’s worth of data for you to develop models predicting ad click-through rate (CTR). Given a user and the page he is visiting, what is the probability that he will click on a given ad? The goal of this challenge is to benchmark the most accurate ML algorithms for CTR estimation. All winning models will be released under an open source license. As a participant, you are given a chance to access the traffic logs from Criteo that include various undisclosed features along with the click labels.
5 |
--------------------------------------------------------------------------------
/vowpal wabbit/ending solution.txt:
--------------------------------------------------------------------------------
1 | vw train_nw.vw -f data/model.vw --loss_function logistic -b 25 -l .15 -c --passes 5 -q cc -q ii -q ci --holdout_off --cubic iii --decay_learning_rate .8
2 |
3 | vw --holdout_off --cache_file data/train_cat_int.cache --loss_function logistic -b 29 --passes 6 -l 0.01 --nn 60 --power_t 0 -f data/nn60_l001_p6.mod
4 |
5 | vw -d train.vw -c -b 28 --link=logistic --loss_function logistic --passes 2 --holdout_off --ngram c3 --ngram n2 --skips n1 --ngram f2 --skips f1 --l2 7.12091e-09 -l 0.240971683207491 --initial_t 1.53478225382649 --decay_learning_rate 0.267332
6 |
7 | vw click4.TT.train.vw -k -c -f click.neu13.model.vw --loss_function logistic --passes 20 -l 0.15 -b 25 --nn 35 --holdout_period 50 --early_terminate 1
8 |
9 | ./vowpalwabbit/vw ../xtrain4.vw -c -k -l 0.1 -b 29 --loss_function logistic -q cc -q ii -q ci --holdout_off -f xmodel.vw
--------------------------------------------------------------------------------
/pytest.ini:
--------------------------------------------------------------------------------
1 | [pytest]
2 | # Pytest configuration for CTR Prediction project
3 |
4 | # Test discovery patterns
5 | python_files = test_*.py
6 | python_classes = Test*
7 | python_functions = test_*
8 |
9 | # Test paths
10 | testpaths = tests
11 |
12 | # Options
13 | addopts =
14 | --verbose
15 | --strict-markers
16 | --tb=short
17 | --disable-warnings
18 | # Coverage options (uncomment when ready)
19 | # --cov=.
20 | # --cov-report=html
21 | # --cov-report=term-missing
22 | # --cov-fail-under=80
23 |
24 | # Markers
25 | markers =
26 | slow: marks tests as slow (deselect with '-m "not slow"')
27 | integration: marks tests as integration tests
28 | unit: marks tests as unit tests
29 | requires_data: marks tests that require actual data files
30 |
31 | # Logging
32 | log_cli = true
33 | log_cli_level = INFO
34 | log_cli_format = %(asctime)s [%(levelname)s] %(message)s
35 | log_cli_date_format = %Y-%m-%d %H:%M:%S
36 |
37 | # Ignore patterns
38 | norecursedirs = .git .tox dist build *.egg venv env
39 |
40 | # Timeout (in seconds) for individual tests
41 | # timeout = 300
42 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 Tianxiang Liu
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/logReg_test.py:
--------------------------------------------------------------------------------
1 |
2 | from numpy import *
3 | import matplotlib.pyplot as plt
4 | import time
5 |
6 | def loadData():
7 | train_x = []
8 | train_y = []
9 | fileIn = open('E:/Python/Machine Learning in Action/testSet.txt')
10 | for line in fileIn.readlines():
11 | lineArr = line.strip().split()
12 | train_x.append([1.0, float(lineArr[0]), float(lineArr[1])])
13 | train_y.append(float(lineArr[2]))
14 | return mat(train_x), mat(train_y).transpose()
15 |
16 |
17 | ## step 1: load data
18 | print "step 1: load data..."
19 | train_x, train_y = loadData()
20 | test_x = train_x; test_y = train_y
21 |
22 | ## step 2: training...
23 | print "step 2: training..."
24 | opts = {'alpha': 0.01, 'maxIter': 20, 'optimizeType': 'smoothStocGradDescent'}
25 | optimalWeights = trainLogRegres(train_x, train_y, opts)
26 |
27 | ## step 3: testing
28 | print "step 3: testing..."
29 | accuracy = testLogRegres(optimalWeights, test_x, test_y)
30 |
31 | ## step 4: show the result
32 | print "step 4: show the result..."
33 | print 'The classify accuracy is: %.3f%%' % (accuracy * 100)
34 | showLogRegres(optimalWeights, train_x, train_y)
--------------------------------------------------------------------------------
/pythonSGD/logReg_test.py:
--------------------------------------------------------------------------------
1 |
2 | from numpy import *
3 | import matplotlib.pyplot as plt
4 | import time
5 |
6 | def loadData():
7 | train_x = []
8 | train_y = []
9 | fileIn = open('E:/Python/Machine Learning in Action/testSet.txt')
10 | for line in fileIn.readlines():
11 | lineArr = line.strip().split()
12 | train_x.append([1.0, float(lineArr[0]), float(lineArr[1])])
13 | train_y.append(float(lineArr[2]))
14 | return mat(train_x), mat(train_y).transpose()
15 |
16 |
17 | ## step 1: load data
18 | print "step 1: load data..."
19 | train_x, train_y = loadData()
20 | test_x = train_x; test_y = train_y
21 |
22 | ## step 2: training...
23 | print "step 2: training..."
24 | opts = {'alpha': 0.01, 'maxIter': 20, 'optimizeType': 'smoothStocGradDescent'}
25 | optimalWeights = trainLogRegres(train_x, train_y, opts)
26 |
27 | ## step 3: testing
28 | print "step 3: testing..."
29 | accuracy = testLogRegres(optimalWeights, test_x, test_y)
30 |
31 | ## step 4: show the result
32 | print "step 4: show the result..."
33 | print 'The classify accuracy is: %.3f%%' % (accuracy * 100)
34 | showLogRegres(optimalWeights, train_x, train_y)
--------------------------------------------------------------------------------
/tests/conftest.py:
--------------------------------------------------------------------------------
1 | """
2 | Pytest configuration and fixtures.
3 | """
4 |
5 | import pytest
6 | import numpy as np
7 | from pathlib import Path
8 | import tempfile
9 | import csv
10 |
11 |
12 | @pytest.fixture
13 | def sample_csv_data():
14 | """Generate sample CSV data for testing."""
15 | return [
16 | {'Id': '1', 'Label': '0', 'I1': '5', 'I2': '10', 'C1': 'abc123', 'C2': 'def456'},
17 | {'Id': '2', 'Label': '1', 'I1': '3', 'I2': '', 'C1': 'abc123', 'C2': 'xyz789'},
18 | {'Id': '3', 'Label': '0', 'I1': '', 'I2': '20', 'C1': 'ghi012', 'C2': 'def456'},
19 | ]
20 |
21 |
22 | @pytest.fixture
23 | def temp_csv_file(sample_csv_data, tmp_path):
24 | """Create a temporary CSV file with sample data."""
25 | csv_file = tmp_path / "test_data.csv"
26 |
27 | fieldnames = ['Id', 'Label', 'I1', 'I2', 'C1', 'C2']
28 |
29 | with open(csv_file, 'w', newline='') as f:
30 | writer = csv.DictWriter(f, fieldnames=fieldnames)
31 | writer.writeheader()
32 | writer.writerows(sample_csv_data)
33 |
34 | return csv_file
35 |
36 |
37 | @pytest.fixture
38 | def sample_train_data():
39 | """Generate sample training data (X, y)."""
40 | np.random.seed(42)
41 | X = np.random.randn(100, 5)
42 | y = (X[:, 0] + X[:, 1] > 0).astype(float).reshape(-1, 1)
43 | return X, y
44 |
45 |
46 | @pytest.fixture
47 | def temp_dir():
48 | """Create a temporary directory for testing."""
49 | with tempfile.TemporaryDirectory() as tmpdir:
50 | yield Path(tmpdir)
51 |
--------------------------------------------------------------------------------
/r_sdg.R:
--------------------------------------------------------------------------------
1 | setwd('/Users/ivan/Work_directory/Predict-click-through-rates-on-display-ads/')
2 | train <- 'testSet.txt'
3 | test <- 'test.csv'
4 | D <- 2^20
5 | alpha <- .1
6 | w <- rep.int(0, D)
7 | n <- rep.int(0, D)
8 | loss <- 0.
9 | col_num <- 3
10 |
11 | # test logloss of predictions and true values
12 | logloss <- function(p,y) {
13 | p <- max(min(p, 1 - 10^-13), 10^-13)
14 | res <- ifelse(y==1, -log(p), -log(1 - p))
15 | res
16 | }
17 |
18 | # extract one record from database
19 | get_data <- function(row, data){
20 | r <- read.table(data, skip=row-1, nrows=1, sep='\t', row.names=row)
21 | r
22 | }
23 |
24 | # get possibilities of records
25 | get_p <- function(x, w){
26 | wTx <- 0
27 | for (i2 in x) {
28 | wTx <- wTx + w[i2] * 1.}
29 | p <- 1/(1 + exp(-max(min(wTx, 20), -20)))
30 | p
31 | }
32 |
33 | # update the weights according to results
34 | update_w <- function(w, n, x, p, y){
35 | for (i in x){
36 | w[i] <- w[i] - ((p - y) * alpha / (sqrt(n[i]) + 1))
37 | n[i] <- n[i] + 1
38 | }
39 | w
40 | n
41 | }
42 |
43 | # main steps for modeling
44 | for i in 1:46000000{
45 | row <- get_data(i, train,2)
46 | y <- row[2]
47 | row <- row[-c(1,2)]
48 | p <- get_p(row, w)
49 | loss <- loss + logloss(p,y)
50 | if(i%1000000 == 0){
51 | 'logloss: '&loss/i&'. number of row: 'i
52 | }
53 | updates <- update_w(w, n, x, p, y)
54 | w <- updates[1]
55 | n <- updates[2]
56 | }
57 |
58 |
59 | get_x <- function(row, D){
60 | x <- c(0)
61 | for (j in row){
62 | index <- as.integer(j) %% D
63 | x <- c(x, index)
64 | }
65 | }
66 |
67 |
68 |
69 |
70 |
71 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Python
2 | *.py[cod]
3 | *$py.class
4 | *.so
5 | .Python
6 | build/
7 | develop-eggs/
8 | dist/
9 | downloads/
10 | eggs/
11 | .eggs/
12 | lib/
13 | lib64/
14 | parts/
15 | sdist/
16 | var/
17 | wheels/
18 | share/python-wheels/
19 | *.egg-info/
20 | .installed.cfg
21 | *.egg
22 | MANIFEST
23 | __pycache__/
24 | *.pyc
25 |
26 | # Virtual environments
27 | venv/
28 | env/
29 | ENV/
30 | env.bak/
31 | venv.bak/
32 | .venv/
33 |
34 | # IDE
35 | .vscode/
36 | .idea/
37 | *.swp
38 | *.swo
39 | *~
40 | .DS_Store
41 | *.suo
42 | *.user
43 | *.userosscache
44 | *.sln.docstates
45 |
46 | # Visual Studio
47 | *.pyproj
48 | *.sln
49 | *.suo
50 | *.user
51 | *.userosscache
52 | .vs/
53 |
54 | # Jupyter Notebook
55 | .ipynb_checkpoints
56 | *.ipynb
57 |
58 | # R
59 | .Rhistory
60 | .Rapp.history
61 | .RData
62 | .Ruserdata
63 | *.Rproj
64 | .Rproj.user/
65 |
66 | # Data files (large datasets)
67 | *.csv
68 | !**/sample*.csv
69 | *.tsv
70 | *.dat
71 | *.txt
72 | !requirements.txt
73 | !requirements-r.txt
74 | !README.txt
75 | !LICENSE.txt
76 |
77 | # Model files (can be large)
78 | *.model
79 | *.vw
80 | *.pkl
81 | *.h5
82 | *.joblib
83 |
84 | # Logs
85 | *.log
86 | logs/
87 |
88 | # Output/Submission files
89 | submission*.csv
90 | *submit*.csv
91 | output/
92 | results/
93 |
94 | # Temporary files
95 | tmp/
96 | temp/
97 | *.tmp
98 | *.bak
99 | *~
100 | *.cache
101 |
102 | # OS
103 | .DS_Store
104 | Thumbs.db
105 | ehthumbs.db
106 |
107 | # Backup files
108 | *.orig
109 | *~
110 |
111 | # Claude settings (keep local only)
112 | .claude/settings.local.json
113 |
--------------------------------------------------------------------------------
/SDG_21_Sep_2014.R:
--------------------------------------------------------------------------------
1 | setwd("C:\\Users\\Ivan.Liuyanfeng\\Desktop\\Data_Mining_Work_Space\\Predict-click-through-rates-on-display-ads\\local")
2 | # basic parameters
3 | con <- file('train.csv','r')
4 | D <- 2^27
5 | alpha <- .145
6 |
7 | # logit loss calculation
8 | logloss <- function(p,y){
9 | epsilon <- 10 ^ -15
10 | p <- max(min(p, 1-epsilon), epsilon)
11 | ll <- y*log(p) + (1-y)*log((1-p))
12 | ll <- ll * -1/1
13 | ll
14 | }
15 |
16 | # prediction
17 | get_p<- function(x,w){
18 | wTx <- 0
19 | for (i in 1:length(x)){
20 | wTx <- wTx + w[i] * 1
21 | }
22 | sigmoid <- 1/(1+exp(-max(min(wTx,20)-20)))
23 | sigmoid
24 | }
25 |
26 | # update weights
27 | update_w <-function (w, n, x, p, y){
28 | for (i in 1:length(x)){
29 | lr <- alpha / (sqrt(n[i])+1)
30 | gradient <- (p-y)
31 | w[i] <- w[i] - gradient * lr
32 | n[i] <- n[i] + 1
33 | }
34 | c(w,n)
35 | }
36 |
37 | # basic parameters
38 | # w <- rep(0, D)
39 | w <- rep(1, length(x))
40 | # n <- rep(0, D)
41 | n <- rep(0, length(x))
42 | loss <- 0
43 | ##################start modeling - parameters setup ########################
44 | train_label <- readLines(con, n=1)
45 | for (i in 1:10) {
46 | row <- readLines(con,n=1)
47 | train_row <- strsplit(row, ',')
48 | y <- as.integer(train_row[[1]][2])
49 | x <- c()
50 | for (k in 3:40){
51 | x <- c(x, train_row[[1]][k])
52 | }
53 | ################# Modeling #################################################
54 | p <- get_p(x,w)
55 | loss <- loss + logloss(p,y)
56 | if (i %% 100000 == 0){
57 | print(loss)
58 | }
59 | upd <- update_w(w,n,x,p,y)
60 | w <- upd[1]
61 | n <- upd[2]
62 | print(loss)
63 | break
64 | }
65 |
--------------------------------------------------------------------------------
/pythonSGD/SGD_py.pyproj:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Debug
5 | 2.0
6 | {82875642-d6fa-4f5c-81e6-b89a93c1f5ff}
7 |
8 | logReg.py
9 |
10 | .
11 | .
12 | {888888a0-9f3d-457c-b088-3a5042f75d52}
13 | Standard Python launcher
14 |
15 |
16 |
17 |
18 |
19 |
20 | 10.0
21 | $(MSBuildExtensionsPath32)\Microsoft\VisualStudio\v$(VisualStudioVersion)\Python Tools\Microsoft.PythonTools.targets
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 |
36 |
37 |
--------------------------------------------------------------------------------
/kaggle.py:
--------------------------------------------------------------------------------
1 | from numpy import *
2 |
3 | def loadDataSet():
4 | dataMat = []; labelMat = []
5 | fr = open('train.csv')
6 | for line in fr.readlines():
7 | singleArr=[1.0]
8 | lineArr = line.strip().split()
9 | for i in range(39):
10 | singleArr.append(float(lineArr[i+2]))
11 | dataMat.append(singleArr)
12 | labelMat.append(int(lineArr[1]))
13 | return dataMat,labelMat
14 |
15 | def sigmoid(inX):
16 | return 1.0/(1+exp(-inX))
17 |
18 | def gradAscent(dataMatIn, classLabels):
19 | dataMatrix = mat(dataMatIn) #convert to NumPy matrix
20 | labelMat = mat(classLabels).transpose() #convert to NumPy matrix
21 | m,n = shape(dataMatrix)
22 | alpha = 0.001
23 | maxCycles = 500
24 | weights = ones((n,1))
25 | for k in range(maxCycles): #heavy on matrix operations
26 | h = sigmoid(dataMatrix*weights) #matrix mult
27 | error = (labelMat - h) #vector subtraction
28 | weights = weights + alpha * dataMatrix.transpose()* error #matrix mult
29 | return weights
30 |
31 | def plotBestFit(weights):
32 | import matplotlib.pyplot as plt
33 | dataMat,labelMat=loadDataSet()
34 | dataArr = array(dataMat)
35 | n = shape(dataArr)[0]
36 | xcord1 = []; ycord1 = []
37 | xcord2 = []; ycord2 = []
38 | for i in range(n):
39 | if int(labelMat[i])== 1:
40 | xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2])
41 | else:
42 | xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2])
43 | fig = plt.figure()
44 | ax = fig.add_subplot(111)
45 | ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')
46 | ax.scatter(xcord2, ycord2, s=30, c='green')
47 | x = arange(-3.0, 3.0, 0.1)
48 | y = (-weights[0]-weights[1]*x)/weights[2]
49 | ax.plot(x, y)
50 | plt.xlabel('X1'); plt.ylabel('X2');
51 | plt.show()
--------------------------------------------------------------------------------
/forum/csv_to_vm.py:
--------------------------------------------------------------------------------
1 | # -*- coding: UTF-8 -*-
2 |
3 | ########################################################
4 | # __Author__: Triskelion #
5 | # Kaggle competition "Display Advertising Challenge": #
6 | # http://www.kaggle.com/c/criteo-display-ad-challenge/ #
7 | # Credit: Zygmunt Zając #
8 | ########################################################
9 |
10 | from datetime import datetime
11 | from csv import DictReader
12 |
13 | def csv_to_vw(loc_csv, loc_output, train=True):
14 | """
15 | Munges a CSV file (loc_csv) to a VW file (loc_output). Set "train"
16 | to False when munging a test set.
17 | TODO: Too slow for a daily cron job. Try optimize, Pandas or Go.
18 | """
19 | start = datetime.now()
20 | print("\nTurning %s into %s. Is_train_set? %s"%(loc_csv,loc_output,train))
21 |
22 | with open(loc_output,"wb") as outfile:
23 | for e, row in enumerate( DictReader(open(loc_csv)) ):
24 |
25 | #Creating the features
26 | numerical_features = ""
27 | categorical_features = ""
28 | for k,v in row.items():
29 | if k not in ["Label","Id"]:
30 | if "I" in k: # numerical feature, example: I5
31 | if len(str(v)) > 0: #check for empty values
32 | numerical_features += " %s:%s" % (k,v)
33 | if "C" in k: # categorical feature, example: C2
34 | if len(str(v)) > 0:
35 | categorical_features += " %s" % v
36 |
37 | #Creating the labels
38 | if train: #we care about labels
39 | if row['Label'] == "1":
40 | label = 1
41 | else:
42 | label = -1 #we set negative label to -1
43 | outfile.write( "%s '%s |i%s |c%s\n" % (label,row['Id'],numerical_features,categorical_features) )
44 |
45 | else: #we dont care about labels
46 | outfile.write( "1 '%s |i%s |c%s\n" % (row['Id'],numerical_features,categorical_features) )
47 |
48 | #Reporting progress
49 | if e % 1000000 == 0:
50 | print("%s\t%s"%(e, str(datetime.now() - start)))
51 |
52 | print("\n %s Task execution time:\n\t%s"%(e, str(datetime.now() - start)))
53 |
54 | #csv_to_vw("d:\\Downloads\\train\\train.csv", "c:\\click.train.vw",train=True)
55 | #csv_to_vw("d:\\Downloads\\test\\test.csv", "d:\\click.test.vw",train=False)
--------------------------------------------------------------------------------
/vowpal wabbit/csv_to_vm.py:
--------------------------------------------------------------------------------
1 | # -*- coding: UTF-8 -*-
2 |
3 | ########################################################
4 | # __Author__: Triskelion #
5 | # Kaggle competition "Display Advertising Challenge": #
6 | # http://www.kaggle.com/c/criteo-display-ad-challenge/ #
7 | # Credit: Zygmunt Zając #
8 | ########################################################
9 |
10 | from datetime import datetime
11 | from csv import DictReader
12 |
13 | def csv_to_vw(loc_csv, loc_output, train=True):
14 | """
15 | Munges a CSV file (loc_csv) to a VW file (loc_output). Set "train"
16 | to False when munging a test set.
17 | TODO: Too slow for a daily cron job. Try optimize, Pandas or Go.
18 | """
19 | start = datetime.now()
20 | print("\nTurning %s into %s. Is_train_set? %s"%(loc_csv,loc_output,train))
21 |
22 | with open(loc_output,"wb") as outfile:
23 | for e, row in enumerate( DictReader(open(loc_csv)) ):
24 |
25 | #Creating the features
26 | numerical_features = ""
27 | categorical_features = ""
28 | for k,v in row.items():
29 | if k not in ["Label","Id"]:
30 | if "I" in k: # numerical feature, example: I5
31 | if len(str(v)) > 0: #check for empty values
32 | numerical_features += " %s:%s" % (k,v)
33 | if "C" in k: # categorical feature, example: C2
34 | if len(str(v)) > 0:
35 | categorical_features += " %s" % v
36 |
37 | #Creating the labels
38 | if train: #we care about labels
39 | if row['Label'] == "1":
40 | label = 1
41 | else:
42 | label = -1 #we set negative label to -1
43 | outfile.write( "%s '%s |i%s |c%s\n" % (label,row['Id'],numerical_features,categorical_features) )
44 |
45 | else: #we dont care about labels
46 | outfile.write( "1 '%s |i%s |c%s\n" % (row['Id'],numerical_features,categorical_features) )
47 |
48 | #Reporting progress
49 | if e % 1000000 == 0:
50 | print("%s\t%s"%(e, str(datetime.now() - start)))
51 |
52 | print("\n %s Task execution time:\n\t%s"%(e, str(datetime.now() - start)))
53 |
54 | #csv_to_vw("d:\\Downloads\\train\\train.csv", "c:\\click.train.vw",train=True)
55 | #csv_to_vw("d:\\Downloads\\test\\test.csv", "d:\\click.test.vw",train=False)
--------------------------------------------------------------------------------
/testSet.txt:
--------------------------------------------------------------------------------
1 | -0.017612 14.053064 0
2 | -1.395634 4.662541 1
3 | -0.752157 6.538620 0
4 | -1.322371 7.152853 0
5 | 0.423363 11.054677 0
6 | 0.406704 7.067335 1
7 | 0.667394 12.741452 0
8 | -2.460150 6.866805 1
9 | 0.569411 9.548755 0
10 | -0.026632 10.427743 0
11 | 0.850433 6.920334 1
12 | 1.347183 13.175500 0
13 | 1.176813 3.167020 1
14 | -1.781871 9.097953 0
15 | -0.566606 5.749003 1
16 | 0.931635 1.589505 1
17 | -0.024205 6.151823 1
18 | -0.036453 2.690988 1
19 | -0.196949 0.444165 1
20 | 1.014459 5.754399 1
21 | 1.985298 3.230619 1
22 | -1.693453 -0.557540 1
23 | -0.576525 11.778922 0
24 | -0.346811 -1.678730 1
25 | -2.124484 2.672471 1
26 | 1.217916 9.597015 0
27 | -0.733928 9.098687 0
28 | -3.642001 -1.618087 1
29 | 0.315985 3.523953 1
30 | 1.416614 9.619232 0
31 | -0.386323 3.989286 1
32 | 0.556921 8.294984 1
33 | 1.224863 11.587360 0
34 | -1.347803 -2.406051 1
35 | 1.196604 4.951851 1
36 | 0.275221 9.543647 0
37 | 0.470575 9.332488 0
38 | -1.889567 9.542662 0
39 | -1.527893 12.150579 0
40 | -1.185247 11.309318 0
41 | -0.445678 3.297303 1
42 | 1.042222 6.105155 1
43 | -0.618787 10.320986 0
44 | 1.152083 0.548467 1
45 | 0.828534 2.676045 1
46 | -1.237728 10.549033 0
47 | -0.683565 -2.166125 1
48 | 0.229456 5.921938 1
49 | -0.959885 11.555336 0
50 | 0.492911 10.993324 0
51 | 0.184992 8.721488 0
52 | -0.355715 10.325976 0
53 | -0.397822 8.058397 0
54 | 0.824839 13.730343 0
55 | 1.507278 5.027866 1
56 | 0.099671 6.835839 1
57 | -0.344008 10.717485 0
58 | 1.785928 7.718645 1
59 | -0.918801 11.560217 0
60 | -0.364009 4.747300 1
61 | -0.841722 4.119083 1
62 | 0.490426 1.960539 1
63 | -0.007194 9.075792 0
64 | 0.356107 12.447863 0
65 | 0.342578 12.281162 0
66 | -0.810823 -1.466018 1
67 | 2.530777 6.476801 1
68 | 1.296683 11.607559 0
69 | 0.475487 12.040035 0
70 | -0.783277 11.009725 0
71 | 0.074798 11.023650 0
72 | -1.337472 0.468339 1
73 | -0.102781 13.763651 0
74 | -0.147324 2.874846 1
75 | 0.518389 9.887035 0
76 | 1.015399 7.571882 0
77 | -1.658086 -0.027255 1
78 | 1.319944 2.171228 1
79 | 2.056216 5.019981 1
80 | -0.851633 4.375691 1
81 | -1.510047 6.061992 0
82 | -1.076637 -3.181888 1
83 | 1.821096 10.283990 0
84 | 3.010150 8.401766 1
85 | -1.099458 1.688274 1
86 | -0.834872 -1.733869 1
87 | -0.846637 3.849075 1
88 | 1.400102 12.628781 0
89 | 1.752842 5.468166 1
90 | 0.078557 0.059736 1
91 | 0.089392 -0.715300 1
92 | 1.825662 12.693808 0
93 | 0.197445 9.744638 0
94 | 0.126117 0.922311 1
95 | -0.679797 1.220530 1
96 | 0.677983 2.556666 1
97 | 0.761349 10.693862 0
98 | -2.168791 0.143632 1
99 | 1.388610 9.341997 0
100 | 0.317029 14.739025 0
101 |
--------------------------------------------------------------------------------
/gbm.R:
--------------------------------------------------------------------------------
1 | setwd("I:\\data")
2 | library(data.table)
3 | # train <- fread("train.csv",select=c(1:15))
4 | # head(train)
5 | # write.table(train,"train_num.csv",sep=",", row.names=F, col.names=T)
6 | # str(train)
7 | # rm("train")
8 | # ?read.csv
9 | #
10 | # library(caret)
11 | # sum(is.na(train))
12 | # mean(is.na(train))
13 | # summary(train)
14 | # gc()
15 | # train <- fread("train_num.csv")
16 | # train <- na.omit(train)
17 | # index <- is.na(train)
18 | # table(index)
19 | # train <- train[-index,]
20 | # write.table(train,"train_num_na.csv",sep=",", row.names=F, col.names=T)
21 | # rm(train)
22 |
23 | # data cleansing
24 | #################
25 | train <- read.csv("train_num_na.csv")
26 | train <- train[,-1]
27 | head(train)
28 | train[which(train$Label==1),1] <- "Yes"
29 | train[which(train$Label==0),1] <- "No"
30 | train$Label <- as.factor(train$Label)
31 | write.table(train,"train_num_na_yesno.csv",sep=",", row.names=F, col.names=T)
32 |
33 |
34 | #covariate creation
35 | # nearZeroVar(train,saveMetrics = T)
36 | # train.pca <- preProcess(train[,2:14], method='pca', pcaComp=2)
37 |
38 | # model
39 | # library(doParallel)
40 | library(caret)
41 | # cl <- makePSOCKcluster(4)
42 | # registerDoParallel(cl)
43 | # fit1 <- train(Label~., method="rf",data=train)
44 |
45 | Grid <- expand.grid(n.trees=c(500),interaction.depth=c(22),shrinkage=.1)
46 | fitControl <- trainControl(method="none", allowParallel=T, classProbs=T)
47 | fit2 <- train(Label~., method="gbm", data=train, trControl=fitControl,
48 | verbose=T,tuneGrid=Grid, metric="ROC")
49 | pred2 <- predict(fit2, train)
50 | confusionMatrix(pred2, train$Label)
51 | rm(pred2)
52 |
53 | # fit3 <- train(Label~., method="glmnet",family="binomial",classProbs=T, data=train,verbose=T)
54 | # fit3 <- train(Label~., method="glmnet",classProbs=T, data=train,verbose=T)
55 |
56 | # fit 4
57 | ctrl <- trainControl(method = "cv",
58 | number=2,
59 | classProbs = TRUE,
60 | allowParallel = TRUE,
61 | summaryFunction = twoClassSummary)
62 |
63 | set.seed(888)
64 | rfFit <- train(Label~.,
65 | data=train,
66 | method = "rf",
67 | # tuneGrid = expand.grid(.mtry = 4),
68 | ntrees=500,
69 | importance = TRUE,
70 | metric = "ROC",
71 | trControl = ctrl)
72 |
73 |
74 | pred <- predict.train(rfFit, newdata = test, type = "prob")
75 |
76 |
77 | # stopCluster(cl)
78 |
79 | # load test data
80 | # test <- fread("test.csv", select=c(1:14))
81 | # write.table(test,"test_num.csv",sep=",", row.names=F, col.names=T)
82 | # test <- read.csv("test_num.csv")
83 | test <- read.csv("test_num_impute.csv")
84 | # test data imputation
85 | # pre<-preProcess(test, method='medianImpute')
86 | # test_impute <- predict(pre, test)
87 |
88 | # predict
89 | gc()
90 | # pred1 <- predict(fit1, test)
91 | pred2 <- predict(fit2,type="prob", test)
92 | head(test)
93 | pred2 <- plogis(pred2)
94 | # pred3 <- predict(fit3, test)
95 | # ensembling-models
96 | # data(pred1,pred2,pred3,train)
97 | # combFit<-train(Label~.,method="gam", train)
98 |
99 | # output
100 | fit2.submit <- data.frame(test$Id, test$pred2)
101 | colnames(fit2.submit)<- c("Id","Predicted")
102 | write.table(fit2.submit,"submit1_gbm_num_nona_impute.csv", row.names=F, sep=',')
103 |
--------------------------------------------------------------------------------
/py_lh.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime
2 | from csv import DictReader
3 | from math import exp, log, sqrt
4 |
5 |
6 | # parameters #################################################################
7 |
8 | train = 'train.csv' # path to training file
9 | test = 'test.csv' # path to testing file
10 |
11 | D = 2 ** 20 # number of weights use for learning
12 | alpha = .2 # learning rate for sgd optimization
13 |
14 |
15 | # function definitions #######################################################
16 |
17 | # A. Bounded logloss
18 | # INPUT:
19 | # p: our prediction
20 | # y: real answer
21 | # OUTPUT
22 | # logarithmic loss of p given y
23 | def logloss(p, y):
24 | p = max(min(p, 1. - 10e-12), 10e-12)
25 | return -log(p) if y == 1. else -log(1. - p)
26 |
27 |
28 | # B. Apply hash trick of the original csv row
29 | # for simplicity, we treat both integer and categorical features as categorical
30 | # INPUT:
31 | # csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...}
32 | # D: the max index that we can hash to
33 | # OUTPUT:
34 | # x: a list of indices that its value is 1
35 | def get_x(csv_row, D):
36 | x = [0] # 0 is the index of the bias term
37 | for key, value in csv_row.items():
38 | index = int(value + key[1:], 16) % D # weakest hash ever ;)
39 | x.append(index)
40 | return x # x contains indices of features that have a value of 1
41 |
42 |
43 | # C. Get probability estimation on x
44 | # INPUT:
45 | # x: features
46 | # w: weights
47 | # OUTPUT:
48 | # probability of p(y = 1 | x; w)
49 | def get_p(x, w):
50 | wTx = 0.
51 | for i in x: # do wTx
52 | wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1.
53 | return 1. / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid
54 |
55 |
56 | # D. Update given model
57 | # INPUT:
58 | # w: weights
59 | # n: a counter that counts the number of times we encounter a feature
60 | # this is used for adaptive learning rate
61 | # x: feature
62 | # p: prediction of our model
63 | # y: answer
64 | # OUTPUT:
65 | # w: updated model
66 | # n: updated count
67 | def update_w(w, n, x, p, y):
68 | for i in x:
69 | # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic
70 | # (p - y) * x[i] is the current gradient
71 | # note that in our case, if i in x then x[i] = 1
72 | w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
73 | n[i] += 1.
74 |
75 | return w, n
76 |
77 |
78 | # training and testing #######################################################
79 |
80 | # initialize our model
81 | w = [0.] * D # weights
82 | n = [0.] * D # number of times we've encountered a feature
83 |
84 | # start training a logistic regression model using on pass sgd
85 | loss = 0.
86 | for t, row in enumerate(DictReader(open(train))):
87 | y = 1. if row['Label'] == '1' else 0.
88 |
89 | del row['Label'] # can't let the model peek the answer
90 | del row['Id'] # we don't need the Id
91 |
92 | # main training procedure
93 | # step 1, get the hashed features
94 | x = get_x(row, D)
95 |
96 | # step 2, get prediction
97 | p = get_p(x, w)
98 |
99 | # for progress validation, useless for learning our model
100 | loss += logloss(p, y)
101 | if t % 1000000 == 0 and t > 1:
102 | print('%s\tencountered: %d\tcurrent logloss: %f' % (
103 | datetime.now(), t, loss/t))
104 |
105 | # step 3, update model with answer
106 | w, n = update_w(w, n, x, p, y)
107 |
108 | # testing (build kaggle's submission file)
109 | with open('submissionPython.csv', 'w') as submission:
110 | submission.write('Id,Predicted\n')
111 | for t, row in enumerate(DictReader(open(test))):
112 | Id = row['Id']
113 | del row['Id']
114 | x = get_x(row, D)
115 | p = get_p(x, w)
116 | submission.write('%s,%f\n' % (Id, p))
--------------------------------------------------------------------------------
/py_lh2.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime
2 | from csv import DictReader
3 | from math import exp, log, sqrt
4 |
5 |
6 | # parameters #################################################################
7 |
8 | train = 'train.csv' # path to training file
9 | test = 'test.csv' # path to testing file
10 |
11 | D = 2 ** 26 # number of weights use for learning
12 | alpha = .1 # learning rate for sgd optimization
13 |
14 |
15 | # function definitions #######################################################
16 |
17 | # A. Bounded logloss
18 | # INPUT:
19 | # p: our prediction
20 | # y: real answer
21 | # OUTPUT
22 | # logarithmic loss of p given y
23 | def logloss(p, y):
24 | p = max(min(p, 1. - 10e-12), 10e-12)
25 | return -log(p) if y == 1. else -log(1. - p)
26 |
27 |
28 | # B. Apply hash trick of the original csv row
29 | # for simplicity, we treat both integer and categorical features as categorical
30 | # INPUT:
31 | # csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...}
32 | # D: the max index that we can hash to
33 | # OUTPUT:
34 | # x: a list of indices that its value is 1
35 | def get_x(csv_row, D):
36 | x = [0] # 0 is the index of the bias term
37 | for key, value in csv_row.items():
38 | index = int(value + key[1:], 16) % D # weakest hash ever ;)
39 | x.append(index)
40 | return x # x contains indices of features that have a value of 1
41 |
42 |
43 | # C. Get probability estimation on x
44 | # INPUT:
45 | # x: features
46 | # w: weights
47 | # OUTPUT:
48 | # probability of p(y = 1 | x; w)
49 | def get_p(x, w):
50 | wTx = 0.
51 | for i in x: # do wTx
52 | wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1.
53 | return 1. / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid
54 |
55 |
56 | # D. Update given model
57 | # INPUT:
58 | # w: weights
59 | # n: a counter that counts the number of times we encounter a feature
60 | # this is used for adaptive learning rate
61 | # x: feature
62 | # p: prediction of our model
63 | # y: answer
64 | # OUTPUT:
65 | # w: updated model
66 | # n: updated count
67 | def update_w(w, n, x, p, y):
68 | for i in x:
69 | # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic
70 | # (p - y) * x[i] is the current gradient
71 | # note that in our case, if i in x then x[i] = 1
72 | w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
73 | n[i] += 1.
74 |
75 | return w, n
76 |
77 |
78 | # training and testing #######################################################
79 |
80 | # initialize our model
81 | w = [0.] * D # weights
82 | n = [0.] * D # number of times we've encountered a feature
83 |
84 | # start training a logistic regression model using on pass sgd
85 | loss = 0.
86 | for t, row in enumerate(DictReader(open(train))):
87 | y = 1. if row['Label'] == '1' else 0.
88 |
89 | del row['Label'] # can't let the model peek the answer
90 | del row['Id'] # we don't need the Id
91 |
92 | # main training procedure
93 | # step 1, get the hashed features
94 | x = get_x(row, D)
95 |
96 | # step 2, get prediction
97 | p = get_p(x, w)
98 |
99 | # for progress validation, useless for learning our model
100 | loss += logloss(p, y)
101 | if t % 1000000 == 0 and t > 1:
102 | print('%s\tencountered: %d\tcurrent logloss: %f' % (
103 | datetime.now(), t, loss/t))
104 |
105 | # step 3, update model with answer
106 | w, n = update_w(w, n, x, p, y)
107 |
108 | # testing (build kaggle's submission file)
109 | with open('submissionPython2.csv', 'w') as submission:
110 | submission.write('Id,Predicted\n')
111 | for t, row in enumerate(DictReader(open(test))):
112 | Id = row['Id']
113 | del row['Id']
114 | x = get_x(row, D)
115 | p = get_p(x, w)
116 | submission.write('%s,%f\n' % (Id, p))
--------------------------------------------------------------------------------
/py_lh3.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime
2 | from csv import DictReader
3 | from math import exp, log, sqrt
4 |
5 |
6 | # parameters #################################################################
7 |
8 | train = 'train.csv' # path to training file
9 | test = 'test.csv' # path to testing file
10 |
11 | D = 2 ** 25 # number of weights use for learning
12 | alpha = .15 # learning rate for sgd optimization
13 |
14 |
15 | # function definitions #######################################################
16 |
17 | # A. Bounded logloss
18 | # INPUT:
19 | # p: our prediction
20 | # y: real answer
21 | # OUTPUT
22 | # logarithmic loss of p given y
23 | def logloss(p, y):
24 | p = max(min(p, 1. - 10e-12), 10e-12)
25 | return -log(p) if y == 1. else -log(1. - p)
26 |
27 |
28 | # B. Apply hash trick of the original csv row
29 | # for simplicity, we treat both integer and categorical features as categorical
30 | # INPUT:
31 | # csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...}
32 | # D: the max index that we can hash to
33 | # OUTPUT:
34 | # x: a list of indices that its value is 1
35 | def get_x(csv_row, D):
36 | x = [0] # 0 is the index of the bias term
37 | for key, value in csv_row.items():
38 | index = int(value + key[1:], 16) % D # weakest hash ever ;)
39 | x.append(index)
40 | return x # x contains indices of features that have a value of 1
41 |
42 |
43 | # C. Get probability estimation on x
44 | # INPUT:
45 | # x: features
46 | # w: weights
47 | # OUTPUT:
48 | # probability of p(y = 1 | x; w)
49 | def get_p(x, w):
50 | wTx = 0.
51 | for i in x: # do wTx
52 | wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1.
53 | return 1. / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid
54 |
55 |
56 | # D. Update given model
57 | # INPUT:
58 | # w: weights
59 | # n: a counter that counts the number of times we encounter a feature
60 | # this is used for adaptive learning rate
61 | # x: feature
62 | # p: prediction of our model
63 | # y: answer
64 | # OUTPUT:
65 | # w: updated model
66 | # n: updated count
67 | def update_w(w, n, x, p, y):
68 | for i in x:
69 | # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic
70 | # (p - y) * x[i] is the current gradient
71 | # note that in our case, if i in x then x[i] = 1
72 | w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
73 | n[i] += 1.
74 |
75 | return w, n
76 |
77 |
78 | # training and testing #######################################################
79 |
80 | # initialize our model
81 | w = [0.] * D # weights
82 | n = [0.] * D # number of times we've encountered a feature
83 |
84 | # start training a logistic regression model using on pass sgd
85 | loss = 0.
86 | for t, row in enumerate(DictReader(open(train))):
87 | y = 1. if row['Label'] == '1' else 0.
88 |
89 | del row['Label'] # can't let the model peek the answer
90 | del row['Id'] # we don't need the Id
91 |
92 | # main training procedure
93 | # step 1, get the hashed features
94 | x = get_x(row, D)
95 |
96 | # step 2, get prediction
97 | p = get_p(x, w)
98 |
99 | # for progress validation, useless for learning our model
100 | loss += logloss(p, y)
101 | if t % 1000000 == 0 and t > 1:
102 | print('%s\tencountered: %d\tcurrent logloss: %f' % (
103 | datetime.now(), t, loss/t))
104 |
105 | # step 3, update model with answer
106 | w, n = update_w(w, n, x, p, y)
107 |
108 | # testing (build kaggle's submission file)
109 | with open('submissionPython4.csv', 'w') as submission:
110 | submission.write('Id,Predicted\n')
111 | for t, row in enumerate(DictReader(open(test))):
112 | Id = row['Id']
113 | del row['Id']
114 | x = get_x(row, D)
115 | p = get_p(x, w)
116 | submission.write('%s,%f\n' % (Id, p))
--------------------------------------------------------------------------------
/py_lh4.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime
2 | from csv import DictReader
3 | from math import exp, log, sqrt
4 |
5 |
6 | # parameters #################################################################
7 |
8 | train = 'train.csv' # path to training file
9 | test = 'test.csv' # path to testing file
10 |
11 | D = 2 ** 27 # number of weights use for learning
12 | alpha = .145 # learning rate for sgd optimization
13 |
14 |
15 | # function definitions #######################################################
16 |
17 | # A. Bounded logloss
18 | # INPUT:
19 | # p: our prediction
20 | # y: real answer
21 | # OUTPUT
22 | # logarithmic loss of p given y
23 | def logloss(p, y):
24 | p = max(min(p, 1. - 10e-12), 10e-12)
25 | return -log(p) if y == 1. else -log(1. - p)
26 |
27 |
28 | # B. Apply hash trick of the original csv row
29 | # for simplicity, we treat both integer and categorical features as categorical
30 | # INPUT:
31 | # csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...}
32 | # D: the max index that we can hash to
33 | # OUTPUT:
34 | # x: a list of indices that its value is 1
35 | def get_x(csv_row, D):
36 | x = [0] # 0 is the index of the bias term
37 | for key, value in csv_row.items():
38 | index = int(value + key[1:], 16) % D # weakest hash ever ;)
39 | x.append(index)
40 | return x # x contains indices of features that have a value of 1
41 |
42 |
43 | # C. Get probability estimation on x
44 | # INPUT:
45 | # x: features
46 | # w: weights
47 | # OUTPUT:
48 | # probability of p(y = 1 | x; w)
49 | def get_p(x, w):
50 | wTx = 0.
51 | for i in x: # do wTx
52 | wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1.
53 | return 1. / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid
54 |
55 |
56 | # D. Update given model
57 | # INPUT:
58 | # w: weights
59 | # n: a counter that counts the number of times we encounter a feature
60 | # this is used for adaptive learning rate
61 | # x: feature
62 | # p: prediction of our model
63 | # y: answer
64 | # OUTPUT:
65 | # w: updated model
66 | # n: updated count
67 | def update_w(w, n, x, p, y):
68 | for i in x:
69 | # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic
70 | # (p - y) * x[i] is the current gradient
71 | # note that in our case, if i in x then x[i] = 1
72 | w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
73 | n[i] += 1.
74 |
75 | return w, n
76 |
77 |
78 | # training and testing #######################################################
79 |
80 | # initialize our model
81 | w = [0.] * D # weights
82 | n = [0.] * D # number of times we've encountered a feature
83 |
84 | # start training a logistic regression model using on pass sgd
85 | loss = 0.
86 | for t, row in enumerate(DictReader(open(train))):
87 | y = 1. if row['Label'] == '1' else 0.
88 |
89 | del row['Label'] # can't let the model peek the answer
90 | del row['Id'] # we don't need the Id
91 |
92 | # main training procedure
93 | # step 1, get the hashed features
94 | x = get_x(row, D)
95 |
96 | # step 2, get prediction
97 | p = get_p(x, w)
98 |
99 | # for progress validation, useless for learning our model
100 | loss += logloss(p, y)
101 | if t % 1000000 == 0 and t > 1:
102 | print('%s\tencountered: %d\tcurrent logloss: %f' % (
103 | datetime.now(), t, loss/t))
104 |
105 | # step 3, update model with answer
106 | w, n = update_w(w, n, x, p, y)
107 |
108 | # testing (build kaggle's submission file)
109 | with open('submissionPython5.csv', 'w') as submission:
110 | submission.write('Id,Predicted\n')
111 | for t, row in enumerate(DictReader(open(test))):
112 | Id = row['Id']
113 | del row['Id']
114 | x = get_x(row, D)
115 | p = get_p(x, w)
116 | submission.write('%s,%f\n' % (Id, p))
--------------------------------------------------------------------------------
/logReg_click.py:
--------------------------------------------------------------------------------
1 | from numpy import *
2 | import matplotlib.pyplot as plt
3 | import time
4 |
5 |
6 | # calculate the sigmoid function
7 | def sigmoid(inX):
8 | return 1.0 / (1 + exp(-inX))
9 |
10 | # train a logistic regression model using some optional optimize algorithm
11 | # input: train_x is a mat datatype, each row stands for one sample
12 | # train_y is mat datatype too, each row is the corresponding label
13 | # opts is optimize option include step and maximum number of iterations
14 | def trainLogRegres(train_x, train_y, opts):
15 | # calculate training time
16 | startTime = time.time()
17 |
18 | numSamples, numFeatures = shape(train_x)
19 | alpha = opts['alpha']; maxIter = opts['maxIter']
20 | weights = ones((numFeatures, 1))
21 |
22 | # optimize through gradient descent algorilthm
23 | for k in range(maxIter):
24 | if opts['optimizeType'] == 'gradDescent': # gradient descent algorilthm
25 | output = sigmoid(train_x * weights)
26 | error = train_y - output
27 | weights = weights + alpha * train_x.transpose() * error
28 | elif opts['optimizeType'] == 'stocGradDescent': # stochastic gradient descent
29 | for i in range(numSamples):
30 | output = sigmoid(train_x[i, :] * weights)
31 | error = train_y[i, 0] - output
32 | weights = weights + alpha * train_x[i, :].transpose() * error
33 | elif opts['optimizeType'] == 'smoothStocGradDescent': # smooth stochastic gradient descent
34 | # randomly select samples to optimize for reducing cycle fluctuations
35 | dataIndex = range(numSamples)
36 | for i in range(numSamples):
37 | alpha = 4.0 / (1.0 + k + i) + 0.01
38 | randIndex = int(random.uniform(0, len(dataIndex)))
39 | output = sigmoid(train_x[randIndex, :] * weights)
40 | error = train_y[randIndex, 0] - output
41 | weights = weights + alpha * train_x[randIndex, :].transpose() * error
42 | del(dataIndex[randIndex]) # during one interation, delete the optimized sample
43 | else:
44 | raise NameError('Not support optimize method type!')
45 |
46 |
47 | print ('Congratulations, training complete! Took %fs!' % (time.time() - startTime))
48 | return weights
49 |
50 | # test your trained Logistic Regression model given test set
51 | def testLogRegres(weights, test_x, test_y):
52 | numSamples, numFeatures = shape(test_x)
53 | matchCount = 0
54 | for i in xrange(numSamples):
55 | predict = sigmoid(test_x[i, :] * weights)[0, 0] > 0.5
56 | if predict == bool(test_y[i, 0]):
57 | matchCount += 1
58 | accuracy = float(matchCount) / numSamples
59 | return accuracy
60 |
61 | def loadData():
62 | train_x = []
63 | train_y = []
64 | fileIn = open('train.csv')
65 | for line in fileIn.readlines():
66 | lineArr = line.strip().split()
67 | train_x.append([1.0, float(lineArr[0]), float(lineArr[1])])
68 | train_y.append(float(lineArr[2]))
69 | return mat(train_x), mat(train_y).transpose()
70 |
71 | ###############################################################################
72 | ## step 1: load data
73 | print ("step 1: load data...")
74 | train_x, train_y = loadData()
75 | test_x = train_x; test_y = train_y
76 |
77 | ## step 2: training...
78 | print ("step 2: training...")
79 | opts = {'alpha': 0.01, 'maxIter': 2 ** 27, 'optimizeType': 'smoothStocGradDescent'}
80 | optimalWeights = trainLogRegres(train_x, train_y, opts)
81 |
82 | ## step 3: testing
83 | print ("step 3: testing...")
84 | accuracy = testLogRegres(optimalWeights, test_x, test_y)
85 |
86 | ## step 4: show the result
87 | print ("step 4: show the result...")
88 | print ('The classify accuracy is: %.3f%%' % (accuracy * 100))
--------------------------------------------------------------------------------
/pythonSGD/logReg_click.py:
--------------------------------------------------------------------------------
1 | from numpy import *
2 | import matplotlib.pyplot as plt
3 | import time
4 |
5 |
6 | # calculate the sigmoid function
7 | def sigmoid(inX):
8 | return 1.0 / (1 + exp(-inX))
9 |
10 | # train a logistic regression model using some optional optimize algorithm
11 | # input: train_x is a mat datatype, each row stands for one sample
12 | # train_y is mat datatype too, each row is the corresponding label
13 | # opts is optimize option include step and maximum number of iterations
14 | def trainLogRegres(train_x, train_y, opts):
15 | # calculate training time
16 | startTime = time.time()
17 |
18 | numSamples, numFeatures = shape(train_x)
19 | alpha = opts['alpha']; maxIter = opts['maxIter']
20 | weights = ones((numFeatures, 1))
21 |
22 | # optimize through gradient descent algorilthm
23 | for k in range(maxIter):
24 | if opts['optimizeType'] == 'gradDescent': # gradient descent algorilthm
25 | output = sigmoid(train_x * weights)
26 | error = train_y - output
27 | weights = weights + alpha * train_x.transpose() * error
28 | elif opts['optimizeType'] == 'stocGradDescent': # stochastic gradient descent
29 | for i in range(numSamples):
30 | output = sigmoid(train_x[i, :] * weights)
31 | error = train_y[i, 0] - output
32 | weights = weights + alpha * train_x[i, :].transpose() * error
33 | elif opts['optimizeType'] == 'smoothStocGradDescent': # smooth stochastic gradient descent
34 | # randomly select samples to optimize for reducing cycle fluctuations
35 | dataIndex = range(numSamples)
36 | for i in range(numSamples):
37 | alpha = 4.0 / (1.0 + k + i) + 0.01
38 | randIndex = int(random.uniform(0, len(dataIndex)))
39 | output = sigmoid(train_x[randIndex, :] * weights)
40 | error = train_y[randIndex, 0] - output
41 | weights = weights + alpha * train_x[randIndex, :].transpose() * error
42 | del(dataIndex[randIndex]) # during one interation, delete the optimized sample
43 | else:
44 | raise NameError('Not support optimize method type!')
45 |
46 |
47 | print ('Congratulations, training complete! Took %fs!' % (time.time() - startTime))
48 | return weights
49 |
50 | # test your trained Logistic Regression model given test set
51 | def testLogRegres(weights, test_x, test_y):
52 | numSamples, numFeatures = shape(test_x)
53 | matchCount = 0
54 | for i in xrange(numSamples):
55 | predict = sigmoid(test_x[i, :] * weights)[0, 0] > 0.5
56 | if predict == bool(test_y[i, 0]):
57 | matchCount += 1
58 | accuracy = float(matchCount) / numSamples
59 | return accuracy
60 |
61 | def loadData():
62 | train_x = []
63 | train_y = []
64 | fileIn = open('train.csv')
65 | for line in fileIn.readlines():
66 | lineArr = line.strip().split()
67 | train_x.append([1.0, float(lineArr[0]), float(lineArr[1])])
68 | train_y.append(float(lineArr[2]))
69 | return mat(train_x), mat(train_y).transpose()
70 |
71 | ###############################################################################
72 | ## step 1: load data
73 | print ("step 1: load data...")
74 | train_x, train_y = loadData()
75 | test_x = train_x; test_y = train_y
76 |
77 | ## step 2: training...
78 | print ("step 2: training...")
79 | opts = {'alpha': 0.01, 'maxIter': 2 ** 27, 'optimizeType': 'smoothStocGradDescent'}
80 | optimalWeights = trainLogRegres(train_x, train_y, opts)
81 |
82 | ## step 3: testing
83 | print ("step 3: testing...")
84 | accuracy = testLogRegres(optimalWeights, test_x, test_y)
85 |
86 | ## step 4: show the result
87 | print ("step 4: show the result...")
88 | print ('The classify accuracy is: %.3f%%' % (accuracy * 100))
--------------------------------------------------------------------------------
/pythonSGD/py_lh_22Sep2014_2.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime
2 | from csv import DictReader
3 | from math import exp, log, sqrt
4 |
5 | # parameters #################################################################
6 |
7 | train = 'train.csv' # path to training file
8 | test = 'test.csv' # path to testing file
9 |
10 | D = 2 ** 28 # number of weights use for learning
11 | alpha = .145 # learning rate for sgd optimization
12 |
13 |
14 | # function definitions #######################################################
15 |
16 | # A. Bounded logloss
17 | # INPUT:
18 | # p: our prediction
19 | # y: real answer
20 | # OUTPUT
21 | # logarithmic loss of p given y
22 | def logloss(p, y):
23 | p = max(min(p, 1. - 10e-12), 10e-12)
24 | return -log(p) if y == 1. else -log(1. - p)
25 |
26 | # B. Apply hash trick of the original csv row
27 | # for simplicity, we treat both integer and categorical features as categorical
28 | # INPUT:
29 | # csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...}
30 | # D: the max index that we can hash to
31 | # OUTPUT:
32 | # x: a list of indices that its value is 1
33 | def get_x(csv_row, D):
34 | x = [0] # 0 is the index of the bias term
35 | for key, value in csv_row.items():
36 | index = int(value + key[1:], 32) % D # weakest hash ever ;)
37 | x.append(index)
38 | return x # x contains indices of features that have a value of 1
39 |
40 |
41 | # C. Get probability estimation on x
42 | # INPUT:
43 | # x: features
44 | # w: weights
45 | # OUTPUT:
46 | # probability of p(y = 1 | x; w)
47 | def get_p(x, w):
48 | wTx = 0.
49 | for i in x: # do wTx
50 | wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1.
51 | return 1. / (1. + exp(-max(min(wTx, 10.), -10.))) # bounded sigmoid
52 |
53 |
54 | # D. Update given model
55 | # INPUT:
56 | # w: weights
57 | # n: a counter that counts the number of times we encounter a feature
58 | # this is used for adaptive learning rate
59 | # x: feature
60 | # p: prediction of our model
61 | # y: answer
62 | # OUTPUT:
63 | # w: updated model
64 | # n: updated count
65 | def update_w(w, n, x, p, y):
66 | for i in x:
67 | # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic
68 | # (p - y) * x[i] is the current gradient
69 | # note that in our case, if i in x then x[i] = 1
70 | w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
71 | n[i] += 1.
72 | return w, n
73 |
74 |
75 | # training and testing #######################################################
76 |
77 | # initialize our model
78 | w = [0.] * D # weights
79 | n = [0.] * D # number of times we've encountered a feature
80 |
81 | # start training a logistic regression model using on pass sgd
82 | loss = 0.
83 | for t, row in enumerate(DictReader(open(train))):
84 | y = 1. if row['Label'] == '1' else 0.
85 |
86 | del row['Label'] # can't let the model peek the answer
87 | del row['Id'] # we don't need the Id
88 |
89 | # main training procedure
90 | # step 1, get the hashed features
91 | x = get_x(row, D)
92 |
93 | # step 2, get prediction
94 | p = get_p(x, w)
95 |
96 | # for progress validation, useless for learning our model
97 | loss += logloss(p, y)
98 | if t % 100000 == 0 and t > 1:
99 | print('%s\tencountered: %d\tcurrent logloss: %f' % (
100 | datetime.now(), t, loss/t))
101 |
102 | # step 3, update model with answer
103 | if t <= 40000000:
104 | w, n = update_w(w, n, x, p, y)
105 |
106 | # testing (build kaggle's submission file)
107 | with open('submissionPython_22Sep2014.csv', 'w') as submission:
108 | submission.write('Id,Predicted\n')
109 | for t, row in enumerate(DictReader(open(test))):
110 | Id = row['Id']
111 | del row['Id']
112 | x = get_x(row, D)
113 | p = get_p(x, w)
114 | submission.write('%s,%f\n' % (Id, p))
--------------------------------------------------------------------------------
/logReg.py:
--------------------------------------------------------------------------------
1 | from numpy import *
2 | import matplotlib.pyplot as plt
3 | import time
4 |
5 |
6 | # calculate the sigmoid function
7 | def sigmoid(inX):
8 | return 1.0 / (1 + exp(-inX))
9 |
10 |
11 | # train a logistic regression model using some optional optimize algorithm
12 | # input: train_x is a mat datatype, each row stands for one sample
13 | # train_y is mat datatype too, each row is the corresponding label
14 | # opts is optimize option include step and maximum number of iterations
15 | def trainLogRegres(train_x, train_y, opts):
16 | # calculate training time
17 | startTime = time.time()
18 |
19 | numSamples, numFeatures = shape(train_x)
20 | alpha = opts['alpha']; maxIter = opts['maxIter']
21 | weights = ones((numFeatures, 1))
22 |
23 | # optimize through gradient descent algorilthm
24 | for k in range(maxIter):
25 | if opts['optimizeType'] == 'gradDescent': # gradient descent algorilthm
26 | output = sigmoid(train_x * weights)
27 | error = train_y - output
28 | weights = weights + alpha * train_x.transpose() * error
29 | elif opts['optimizeType'] == 'stocGradDescent': # stochastic gradient descent
30 | for i in range(numSamples):
31 | output = sigmoid(train_x[i, :] * weights)
32 | error = train_y[i, 0] - output
33 | weights = weights + alpha * train_x[i, :].transpose() * error
34 | elif opts['optimizeType'] == 'smoothStocGradDescent': # smooth stochastic gradient descent
35 | # randomly select samples to optimize for reducing cycle fluctuations
36 | dataIndex = range(numSamples)
37 | for i in range(numSamples):
38 | alpha = 4.0 / (1.0 + k + i) + 0.01
39 | randIndex = int(random.uniform(0, len(dataIndex)))
40 | output = sigmoid(train_x[randIndex, :] * weights)
41 | error = train_y[randIndex, 0] - output
42 | weights = weights + alpha * train_x[randIndex, :].transpose() * error
43 | del(dataIndex[randIndex]) # during one interation, delete the optimized sample
44 | else:
45 | raise NameError('Not support optimize method type!')
46 |
47 |
48 | print 'Congratulations, training complete! Took %fs!' % (time.time() - startTime)
49 | return weights
50 |
51 |
52 | # test your trained Logistic Regression model given test set
53 | def testLogRegres(weights, test_x, test_y):
54 | numSamples, numFeatures = shape(test_x)
55 | matchCount = 0
56 | for i in xrange(numSamples):
57 | predict = sigmoid(test_x[i, :] * weights)[0, 0] > 0.5
58 | if predict == bool(test_y[i, 0]):
59 | matchCount += 1
60 | accuracy = float(matchCount) / numSamples
61 | return accuracy
62 |
63 |
64 | # show your trained logistic regression model only available with 2-D data
65 | def showLogRegres(weights, train_x, train_y):
66 | # notice: train_x and train_y is mat datatype
67 | numSamples, numFeatures = shape(train_x)
68 | if numFeatures != 3:
69 | print "Sorry! I can not draw because the dimension of your data is not 2!"
70 | return 1
71 |
72 | # draw all samples
73 | for i in xrange(numSamples):
74 | if int(train_y[i, 0]) == 0:
75 | plt.plot(train_x[i, 1], train_x[i, 2], 'or')
76 | elif int(train_y[i, 0]) == 1:
77 | plt.plot(train_x[i, 1], train_x[i, 2], 'ob')
78 |
79 | # draw the classify line
80 | min_x = min(train_x[:, 1])[0, 0]
81 | max_x = max(train_x[:, 1])[0, 0]
82 | weights = weights.getA() # convert mat to array
83 | y_min_x = float(-weights[0] - weights[1] * min_x) / weights[2]
84 | y_max_x = float(-weights[0] - weights[1] * max_x) / weights[2]
85 | plt.plot([min_x, max_x], [y_min_x, y_max_x], '-g')
86 | plt.xlabel('X1'); plt.ylabel('X2')
87 | plt.show()
88 |
89 |
--------------------------------------------------------------------------------
/pythonSGD/logReg.py:
--------------------------------------------------------------------------------
1 | from numpy import *
2 | import matplotlib.pyplot as plt
3 | import time
4 |
5 |
6 | # calculate the sigmoid function
7 | def sigmoid(inX):
8 | return 1.0 / (1 + exp(-inX))
9 |
10 |
11 | # train a logistic regression model using some optional optimize algorithm
12 | # input: train_x is a mat datatype, each row stands for one sample
13 | # train_y is mat datatype too, each row is the corresponding label
14 | # opts is optimize option include step and maximum number of iterations
15 | def trainLogRegres(train_x, train_y, opts):
16 | # calculate training time
17 | startTime = time.time()
18 |
19 | numSamples, numFeatures = shape(train_x)
20 | alpha = opts['alpha']; maxIter = opts['maxIter']
21 | weights = ones((numFeatures, 1))
22 |
23 | # optimize through gradient descent algorilthm
24 | for k in range(maxIter):
25 | if opts['optimizeType'] == 'gradDescent': # gradient descent algorilthm
26 | output = sigmoid(train_x * weights)
27 | error = train_y - output
28 | weights = weights + alpha * train_x.transpose() * error
29 | elif opts['optimizeType'] == 'stocGradDescent': # stochastic gradient descent
30 | for i in range(numSamples):
31 | output = sigmoid(train_x[i, :] * weights)
32 | error = train_y[i, 0] - output
33 | weights = weights + alpha * train_x[i, :].transpose() * error
34 | elif opts['optimizeType'] == 'smoothStocGradDescent': # smooth stochastic gradient descent
35 | # randomly select samples to optimize for reducing cycle fluctuations
36 | dataIndex = range(numSamples)
37 | for i in range(numSamples):
38 | alpha = 4.0 / (1.0 + k + i) + 0.01
39 | randIndex = int(random.uniform(0, len(dataIndex)))
40 | output = sigmoid(train_x[randIndex, :] * weights)
41 | error = train_y[randIndex, 0] - output
42 | weights = weights + alpha * train_x[randIndex, :].transpose() * error
43 | del(dataIndex[randIndex]) # during one interation, delete the optimized sample
44 | else:
45 | raise NameError('Not support optimize method type!')
46 |
47 |
48 | print 'Congratulations, training complete! Took %fs!' % (time.time() - startTime)
49 | return weights
50 |
51 |
52 | # test your trained Logistic Regression model given test set
53 | def testLogRegres(weights, test_x, test_y):
54 | numSamples, numFeatures = shape(test_x)
55 | matchCount = 0
56 | for i in xrange(numSamples):
57 | predict = sigmoid(test_x[i, :] * weights)[0, 0] > 0.5
58 | if predict == bool(test_y[i, 0]):
59 | matchCount += 1
60 | accuracy = float(matchCount) / numSamples
61 | return accuracy
62 |
63 |
64 | # show your trained logistic regression model only available with 2-D data
65 | def showLogRegres(weights, train_x, train_y):
66 | # notice: train_x and train_y is mat datatype
67 | numSamples, numFeatures = shape(train_x)
68 | if numFeatures != 3:
69 | print "Sorry! I can not draw because the dimension of your data is not 2!"
70 | return 1
71 |
72 | # draw all samples
73 | for i in xrange(numSamples):
74 | if int(train_y[i, 0]) == 0:
75 | plt.plot(train_x[i, 1], train_x[i, 2], 'or')
76 | elif int(train_y[i, 0]) == 1:
77 | plt.plot(train_x[i, 1], train_x[i, 2], 'ob')
78 |
79 | # draw the classify line
80 | min_x = min(train_x[:, 1])[0, 0]
81 | max_x = max(train_x[:, 1])[0, 0]
82 | weights = weights.getA() # convert mat to array
83 | y_min_x = float(-weights[0] - weights[1] * min_x) / weights[2]
84 | y_max_x = float(-weights[0] - weights[1] * max_x) / weights[2]
85 | plt.plot([min_x, max_x], [y_min_x, y_max_x], '-g')
86 | plt.xlabel('X1'); plt.ylabel('X2')
87 | plt.show()
88 |
89 |
--------------------------------------------------------------------------------
/pythonSGD/py_lh4_22Sep2014.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime
2 | from csv import DictReader
3 | from math import exp, log, sqrt
4 | import mmh3
5 |
6 | # parameters #################################################################
7 |
8 | train = 'train.csv' # path to training file
9 | test = 'test.csv' # path to testing file
10 |
11 | D = 2 ** 28 # number of weights use for learning
12 | alpha = .145 # learning rate for sgd optimization
13 |
14 |
15 | # function definitions #######################################################
16 |
17 | # A. Bounded logloss
18 | # INPUT:
19 | # p: our prediction
20 | # y: real answer
21 | # OUTPUT
22 | # logarithmic loss of p given y
23 | def logloss(p, y):
24 | p = max(min(p, 1. - 10e-15), 10e-15)
25 | return -log(p) if y == 1. else -log(1. - p)
26 |
27 |
28 | # B. Apply hash trick of the original csv row
29 | # for simplicity, we treat both integer and categorical features as categorical
30 | # INPUT:
31 | # csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...}
32 | # D: the max index that we can hash to
33 | # OUTPUT:
34 | # x: a list of indices that its value is 1
35 | def get_x(csv_row, D):
36 | x = [0] # 0 is the index of the bias term
37 | for key, value in csv_row.items():
38 | index = mmh3.hash((value + key[1:]),42) % D # weakest hash ever ;)
39 | x.append(index)
40 | return x # x contains indices of features that have a value of 1
41 |
42 |
43 | # C. Get probability estimation on x
44 | # INPUT:
45 | # x: features
46 | # w: weights
47 | # OUTPUT:
48 | # probability of p(y = 1 | x; w)
49 | def get_p(x, w, ld):
50 | ld = ld + 0.001
51 | wTx = 0.
52 | for i in x: # do wTx
53 | wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1.
54 | return (1.+ld) / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid
55 |
56 |
57 | # D. Update given model
58 | # INPUT:
59 | # w: weights
60 | # n: a counter that counts the number of times we encounter a feature
61 | # this is used for adaptive learning rate
62 | # x: feature
63 | # p: prediction of our model
64 | # y: answer
65 | # OUTPUT:
66 | # w: updated model
67 | # n: updated count
68 | def update_w(w, n, x, p, y):
69 | for i in x:
70 | # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic
71 | # (p - y) * x[i] is the current gradient
72 | # note that in our case, if i in x then x[i] = 1
73 | w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
74 | n[i] += 1.
75 |
76 | return w, n
77 |
78 |
79 | # training and testing #######################################################
80 |
81 | # initialize our model
82 | w = [0.] * D # weights
83 | n = [0.] * D # number of times we've encountered a feature
84 |
85 | # start training a logistic regression model using on pass sgd
86 | loss = 0.
87 | ld = 0.001
88 | for t, row in enumerate(DictReader(open(train))):
89 |
90 | y = 1. if row['Label'] == '1' else 0.
91 |
92 | del row['Label'] # can't let the model peek the answer
93 | del row['Id'] # we don't need the Id
94 |
95 | # main training procedure
96 | # step 1, get the hashed features
97 | x = get_x(row, D)
98 |
99 | # step 2, get prediction
100 | p = get_p(x, w, ld)
101 |
102 | # for progress validation, useless for learning our model
103 | loss += logloss(p, y)
104 | if t % 1000000 == 0 and t > 1:
105 | print('%s\tencountered: %d\tcurrent logloss: %f' % (
106 | datetime.now(), t, loss/t))
107 |
108 | # step 3, update model with answer
109 | w, n = update_w(w, n, x, p, y)
110 |
111 | # testing (build kaggle's submission file)
112 | with open('submissionPython22Sep2014_pm.csv', 'w') as submission:
113 | submission.write('Id,Predicted\n')
114 | for t, row in enumerate(DictReader(open(test))):
115 | Id = row['Id']
116 | del row['Id']
117 | x = get_x(row, D)
118 | p = get_p(x, w)
119 | submission.write('%s,%f\n' % (Id, p))
--------------------------------------------------------------------------------
/py_lh_20Sep2014.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime
2 | from csv import DictReader
3 | from math import exp, log, sqrt
4 | import scipy as sp
5 |
6 | # parameters #################################################################
7 |
8 | train = 'train.csv' # path to training file
9 | test = 'test.csv' # path to testing file
10 |
11 | D = 2 ** 27 # number of weights use for learning
12 | alpha = .15 # learning rate for sgd optimization
13 |
14 |
15 | # function definitions #######################################################
16 |
17 | # A. Bounded logloss
18 | # INPUT:
19 | # p: our prediction
20 | # y: real answer
21 | # OUTPUT
22 | # logarithmic loss of p given y
23 | def logloss(p, y):
24 | epsilon = 1e-15
25 | p = sp.maximum(epsilon, p)
26 | p = sp.minimum(1-epsilon, p)
27 | ll = sum(y*sp.log(p) + sp.subtract(1,y)*sp.log(sp.subtract(1,p)))
28 | ll = ll * -1.0/len(y)
29 | return ll
30 |
31 | # B. Apply hash trick of the original csv row
32 | # for simplicity, we treat both integer and categorical features as categorical
33 | # INPUT:
34 | # csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...}
35 | # D: the max index that we can hash to
36 | # OUTPUT:
37 | # x: a list of indices that its value is 1
38 | def get_x(csv_row, D):
39 | x = [0] # 0 is the index of the bias term
40 | for key, value in csv_row.items():
41 | index = int(value + key[1:], 16) % D # weakest hash ever ;)
42 | x.append(index)
43 | return x # x contains indices of features that have a value of 1
44 |
45 |
46 | # C. Get probability estimation on x
47 | # INPUT:
48 | # x: features
49 | # w: weights
50 | # OUTPUT:
51 | # probability of p(y = 1 | x; w)
52 | def get_p(x, w):
53 | wTx = 0.
54 | for i in x: # do wTx
55 | wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1.
56 | return 1. / (1. + exp(-max(min(wTx, 10.), -10.))) # bounded sigmoid
57 |
58 |
59 | # D. Update given model
60 | # INPUT:
61 | # w: weights
62 | # n: a counter that counts the number of times we encounter a feature
63 | # this is used for adaptive learning rate
64 | # x: feature
65 | # p: prediction of our model
66 | # y: answer
67 | # OUTPUT:
68 | # w: updated model
69 | # n: updated count
70 | def update_w(w, n, x, p, y):
71 | for i in x:
72 | # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic
73 | # (p - y) * x[i] is the current gradient
74 | # note that in our case, if i in x then x[i] = 1
75 | w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
76 | n[i] += 1.
77 |
78 | return w, n
79 |
80 |
81 | # training and testing #######################################################
82 |
83 | # initialize our model
84 | w = [0.] * D # weights
85 | n = [0.] * D # number of times we've encountered a feature
86 |
87 | # start training a logistic regression model using on pass sgd
88 | loss = 0.
89 | for t, row in enumerate(DictReader(open(train))):
90 | y = 1. if row['Label'] == '1' else 0.
91 |
92 | del row['Label'] # can't let the model peek the answer
93 | del row['Id'] # we don't need the Id
94 |
95 | # main training procedure
96 | # step 1, get the hashed features
97 | x = get_x(row, D)
98 |
99 | # step 2, get prediction
100 | p = get_p(x, w)
101 |
102 | # for progress validation, useless for learning our model
103 | loss += logloss(p, y)
104 | if t % 1000000 == 0 and t > 1:
105 | print('%s\tencountered: %d\tcurrent logloss: %f' % (
106 | datetime.now(), t, loss/t))
107 |
108 | # step 3, update model with answer
109 | w, n = update_w(w, n, x, p, y)
110 |
111 | # testing (build kaggle's submission file)
112 | with open('submissionPython_20Sep2014_2.csv', 'w') as submission:
113 | submission.write('Id,Predicted\n')
114 | for t, row in enumerate(DictReader(open(test))):
115 | Id = row['Id']
116 | del row['Id']
117 | x = get_x(row, D)
118 | p = get_p(x, w)
119 | submission.write('%s,%f\n' % (Id, p))
--------------------------------------------------------------------------------
/pythonSGD/py_lh_20Sep2014.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime
2 | from csv import DictReader
3 | from math import exp, log, sqrt
4 | import scipy as sp
5 |
6 | # parameters #################################################################
7 |
8 | train = 'train.csv' # path to training file
9 | test = 'test.csv' # path to testing file
10 |
11 | D = 2 ** 27 # number of weights use for learning
12 | alpha = .145 # learning rate for sgd optimization
13 |
14 |
15 | # function definitions #######################################################
16 |
17 | # A. Bounded logloss
18 | # INPUT:
19 | # p: our prediction
20 | # y: real answer
21 | # OUTPUT
22 | # logarithmic loss of p given y
23 | def logloss(p, y):
24 | epsilon = 1e-15
25 | p = max(min(p, 1. - epsilon), epsilon)
26 | ll = y*sp.log(p) + sp.subtract(1,y)*sp.log(sp.subtract(1,p))
27 | ll = ll * -1.0/1
28 | return ll
29 |
30 | # B. Apply hash trick of the original csv row
31 | # for simplicity, we treat both integer and categorical features as categorical
32 | # INPUT:
33 | # csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...}
34 | # D: the max index that we can hash to
35 | # OUTPUT:
36 | # x: a list of indices that its value is 1
37 | def get_x(csv_row, D):
38 | x = [0] # 0 is the index of the bias term
39 | for key, value in csv_row.items():
40 | index = int(value + key[1:], 16) % D # weakest hash ever ;)
41 | x.append(index)
42 | return x # x contains indices of features that have a value of 1
43 |
44 |
45 | # C. Get probability estimation on x
46 | # INPUT:
47 | # x: features
48 | # w: weights
49 | # OUTPUT:
50 | # probability of p(y = 1 | x; w)
51 | def get_p(x, w):
52 | wTx = 0.
53 | for i in x: # do wTx
54 | wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1.
55 | return 1. / (1. + exp(-max(min(wTx, 10.), -10.))) # bounded sigmoid
56 |
57 |
58 | # D. Update given model
59 | # INPUT:
60 | # w: weights
61 | # n: a counter that counts the number of times we encounter a feature
62 | # this is used for adaptive learning rate
63 | # x: feature
64 | # p: prediction of our model
65 | # y: answer
66 | # OUTPUT:
67 | # w: updated model
68 | # n: updated count
69 | def update_w(w, n, x, p, y):
70 | for i in x:
71 | # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic
72 | # (p - y) * x[i] is the current gradient
73 | # note that in our case, if i in x then x[i] = 1
74 | w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
75 | n[i] += 1.
76 | return w, n
77 |
78 |
79 | # training and testing #######################################################
80 |
81 | # initialize our model
82 | w = [0.] * D # weights
83 | n = [0.] * D # number of times we've encountered a feature
84 |
85 | # start training a logistic regression model using on pass sgd
86 | loss = 0.
87 | for t, row in enumerate(DictReader(open(train))):
88 | y = 1. if row['Label'] == '1' else 0.
89 |
90 | del row['Label'] # can't let the model peek the answer
91 | del row['Id'] # we don't need the Id
92 |
93 | # main training procedure
94 | # step 1, get the hashed features
95 | x = get_x(row, D)
96 |
97 | # step 2, get prediction
98 | p = get_p(x, w)
99 |
100 | # for progress validation, useless for learning our model
101 | loss += logloss(p, y)
102 | if t % 100000 == 0 and t > 1:
103 | print('%s\tencountered: %d\tcurrent logloss: %f' % (
104 | datetime.now(), t, loss/t))
105 |
106 | # step 3, update model with answer
107 | if t <= 40000000:
108 | w, n = update_w(w, n, x, p, y)
109 |
110 | # testing (build kaggle's submission file)
111 | with open('submissionPython_21Sep2014.csv', 'w') as submission:
112 | submission.write('Id,Predicted\n')
113 | for t, row in enumerate(DictReader(open(test))):
114 | Id = row['Id']
115 | del row['Id']
116 | x = get_x(row, D)
117 | p = get_p(x, w)
118 | submission.write('%s,%f\n' % (Id, p))
--------------------------------------------------------------------------------
/config.yaml:
--------------------------------------------------------------------------------
1 | # Configuration file for CTR Prediction Project
2 | # =============================================
3 |
4 | # Project metadata
5 | project:
6 | name: "CTR Prediction"
7 | version: "2.0"
8 | description: "Click-Through Rate prediction for display advertising"
9 |
10 | # Data paths
11 | data:
12 | root_dir: "./data"
13 | train_file: "train.csv"
14 | test_file: "test.csv"
15 | sample_train_file: "train_sample.csv" # Optional: smaller dataset for testing
16 |
17 | # Processed data
18 | train_vw: "train.vw"
19 | test_vw: "test.vw"
20 |
21 | # Validation split (if doing train/val split)
22 | validation_split: 0.2
23 | random_seed: 42
24 |
25 | # Output paths
26 | output:
27 | root_dir: "./output"
28 | submission_file: "submission.csv"
29 | log_file: "training.log"
30 | model_dir: "./models"
31 |
32 | # Logistic Regression with SGD configuration
33 | logistic_regression:
34 | # Feature hashing
35 | dimension: 134217728 # 2^27 = 134,217,728 features
36 |
37 | # Optimization
38 | learning_rate: 0.145
39 | adaptive_learning: true
40 |
41 | # Training
42 | max_passes: 1
43 | log_interval: 1000000 # Log every 1M samples
44 |
45 | # Numerical stability
46 | epsilon: 1.0e-12
47 | sigmoid_bound: 20.0 # Bound input to sigmoid to [-20, 20]
48 |
49 | # Gradient Boosting Machine (GBM) configuration
50 | gbm:
51 | # Model hyperparameters
52 | n_trees: 500
53 | interaction_depth: 22
54 | shrinkage: 0.1
55 | min_samples_split: 10
56 |
57 | # Training
58 | cv_folds: 2
59 | metric: "ROC"
60 | verbose: true
61 |
62 | # Computation
63 | n_jobs: -1 # Use all CPU cores
64 | random_state: 888
65 |
66 | # Vowpal Wabbit configuration
67 | vowpal_wabbit:
68 | # Training options
69 | loss_function: "logistic"
70 | learning_rate: 0.5
71 | l1_lambda: 0.0
72 | l2_lambda: 0.0
73 |
74 | # Optimization
75 | passes: 3
76 | cache_file: "./data/train.cache"
77 |
78 | # Feature engineering
79 | quadratic: "" # e.g., "ii" for quadratic interactions in namespace i
80 | cubic: "" # e.g., "iii" for cubic interactions
81 |
82 | # Hashing
83 | bit_precision: 27 # 2^27 features
84 |
85 | # Output
86 | model_file: "./models/click.model.vw"
87 | predictions_file: "./output/vw_predictions.txt"
88 |
89 | # Data preprocessing
90 | preprocessing:
91 | # Missing value handling
92 | numerical_imputation: "median" # Options: mean, median, zero
93 | categorical_imputation: "mode" # Options: mode, unknown
94 |
95 | # Feature engineering
96 | create_interactions: false
97 | polynomial_features: false
98 | polynomial_degree: 2
99 |
100 | # Scaling (usually not needed for tree-based methods)
101 | scale_features: false
102 | scaler_type: "standard" # Options: standard, minmax, robust
103 |
104 | # Model evaluation
105 | evaluation:
106 | metrics:
107 | - "log_loss"
108 | - "auc_roc"
109 | - "accuracy"
110 | - "precision"
111 | - "recall"
112 |
113 | # Validation
114 | use_validation: false
115 | validation_size: 0.2
116 |
117 | # Logging configuration
118 | logging:
119 | level: "INFO" # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL
120 | format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
121 | log_to_file: true
122 | log_to_console: true
123 |
124 | # Experiment tracking
125 | experiment:
126 | track_experiments: false
127 | experiment_name: "baseline"
128 | tags:
129 | - "logistic_regression"
130 | - "hash_trick"
131 |
132 | # Resource constraints
133 | resources:
134 | # Memory limits (in GB)
135 | max_memory_gb: 16
136 |
137 | # CPU
138 | n_processes: 4
139 |
140 | # GPU (if available)
141 | use_gpu: false
142 | gpu_device: 0
143 |
144 | # Reproducibility
145 | reproducibility:
146 | random_seed: 42
147 | deterministic: true
148 |
149 | # Development settings
150 | development:
151 | # Use smaller sample for quick testing
152 | use_sample: false
153 | sample_size: 1000000 # First 1M rows
154 |
155 | # Debug mode
156 | debug: false
157 | profile_code: false
158 |
159 | # Testing
160 | run_tests: true
161 | test_coverage_threshold: 0.8
162 |
--------------------------------------------------------------------------------
/.github/workflows/ci.yml:
--------------------------------------------------------------------------------
1 | name: CI
2 |
3 | on:
4 | push:
5 | branches: [ master, dev-claude ]
6 | pull_request:
7 | branches: [ master ]
8 |
9 | jobs:
10 | test-python:
11 | name: Test Python ${{ matrix.python-version }}
12 | runs-on: ubuntu-latest
13 |
14 | strategy:
15 | matrix:
16 | python-version: ['3.8', '3.9', '3.10', '3.11']
17 |
18 | steps:
19 | - name: Checkout code
20 | uses: actions/checkout@v3
21 |
22 | - name: Set up Python ${{ matrix.python-version }}
23 | uses: actions/setup-python@v4
24 | with:
25 | python-version: ${{ matrix.python-version }}
26 |
27 | - name: Cache pip packages
28 | uses: actions/cache@v3
29 | with:
30 | path: ~/.cache/pip
31 | key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}
32 | restore-keys: |
33 | ${{ runner.os }}-pip-
34 |
35 | - name: Install dependencies
36 | run: |
37 | python -m pip install --upgrade pip
38 | pip install -r requirements.txt
39 |
40 | - name: Lint with flake8
41 | run: |
42 | pip install flake8
43 | # Stop the build if there are Python syntax errors or undefined names
44 | flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
45 | # Exit-zero treats all errors as warnings
46 | flake8 . --count --exit-zero --max-complexity=10 --max-line-length=100 --statistics
47 |
48 | - name: Check code formatting with black
49 | run: |
50 | pip install black
51 | black --check .
52 |
53 | - name: Run tests with pytest
54 | run: |
55 | pytest tests/ --verbose
56 |
57 | - name: Generate coverage report
58 | if: matrix.python-version == '3.10'
59 | run: |
60 | pytest tests/ --cov=. --cov-report=xml --cov-report=html
61 |
62 | - name: Upload coverage to Codecov
63 | if: matrix.python-version == '3.10'
64 | uses: codecov/codecov-action@v3
65 | with:
66 | files: ./coverage.xml
67 | flags: unittests
68 | name: codecov-umbrella
69 |
70 | test-r:
71 | name: Test R Scripts
72 | runs-on: ubuntu-latest
73 |
74 | steps:
75 | - name: Checkout code
76 | uses: actions/checkout@v3
77 |
78 | - name: Set up R
79 | uses: r-lib/actions/setup-r@v2
80 | with:
81 | r-version: '4.2.0'
82 |
83 | - name: Install R dependencies
84 | run: |
85 | Rscript -e "install.packages(c('data.table', 'caret', 'gbm'), dependencies=TRUE, repos='https://cloud.r-project.org')"
86 |
87 | - name: Check R script syntax
88 | run: |
89 | Rscript -e "source('gbm_modernized.R', echo=TRUE)" || true
90 |
91 | lint:
92 | name: Code Quality Checks
93 | runs-on: ubuntu-latest
94 |
95 | steps:
96 | - name: Checkout code
97 | uses: actions/checkout@v3
98 |
99 | - name: Set up Python
100 | uses: actions/setup-python@v4
101 | with:
102 | python-version: '3.10'
103 |
104 | - name: Install linting tools
105 | run: |
106 | python -m pip install --upgrade pip
107 | pip install flake8 pylint black isort
108 |
109 | - name: Run flake8
110 | run: |
111 | flake8 py_lh4_modernized.py logReg_modernized.py csv_to_vw_modernized.py --max-line-length=100
112 |
113 | - name: Run pylint
114 | run: |
115 | pylint py_lh4_modernized.py logReg_modernized.py csv_to_vw_modernized.py --max-line-length=100 || true
116 |
117 | - name: Check import sorting
118 | run: |
119 | isort --check-only --diff .
120 |
121 | security:
122 | name: Security Scan
123 | runs-on: ubuntu-latest
124 |
125 | steps:
126 | - name: Checkout code
127 | uses: actions/checkout@v3
128 |
129 | - name: Set up Python
130 | uses: actions/setup-python@v4
131 | with:
132 | python-version: '3.10'
133 |
134 | - name: Install safety
135 | run: |
136 | python -m pip install --upgrade pip
137 | pip install safety
138 |
139 | - name: Run safety check
140 | run: |
141 | pip install -r requirements.txt
142 | safety check || true
143 |
144 | - name: Run bandit security linter
145 | run: |
146 | pip install bandit
147 | bandit -r . -f json -o bandit-report.json || true
148 |
149 | - name: Upload bandit report
150 | if: always()
151 | uses: actions/upload-artifact@v3
152 | with:
153 | name: bandit-security-report
154 | path: bandit-report.json
155 |
--------------------------------------------------------------------------------
/tests/test_lr_model.py:
--------------------------------------------------------------------------------
1 | """
2 | Tests for Logistic Regression model (py_lh4_modernized.py).
3 | """
4 |
5 | import pytest
6 | import numpy as np
7 | from pathlib import Path
8 | import sys
9 |
10 | # Add parent directory to path
11 | sys.path.insert(0, str(Path(__file__).parent.parent))
12 |
13 | from py_lh4_modernized import CTRPredictor
14 |
15 |
16 | class TestCTRPredictor:
17 | """Test suite for CTRPredictor class."""
18 |
19 | def test_initialization(self):
20 | """Test model initialization."""
21 | model = CTRPredictor(dimension=1000, learning_rate=0.1)
22 |
23 | assert model.D == 1000
24 | assert model.alpha == 0.1
25 | assert len(model.w) == 1000
26 | assert len(model.n) == 1000
27 | assert all(w == 0.0 for w in model.w)
28 | assert all(n == 0.0 for n in model.n)
29 |
30 | def test_logloss_positive_label(self):
31 | """Test logloss calculation for positive label."""
32 | loss = CTRPredictor.logloss(0.8, 1.0)
33 | expected = -np.log(0.8)
34 | assert np.isclose(loss, expected)
35 |
36 | def test_logloss_negative_label(self):
37 | """Test logloss calculation for negative label."""
38 | loss = CTRPredictor.logloss(0.2, 0.0)
39 | expected = -np.log(0.8)
40 | assert np.isclose(loss, expected)
41 |
42 | def test_logloss_boundary_values(self):
43 | """Test logloss with boundary values (close to 0 or 1)."""
44 | # Should not raise error or return inf
45 | loss1 = CTRPredictor.logloss(0.99999, 1.0)
46 | loss2 = CTRPredictor.logloss(0.00001, 0.0)
47 |
48 | assert not np.isinf(loss1)
49 | assert not np.isinf(loss2)
50 | assert loss1 > 0
51 | assert loss2 > 0
52 |
53 | def test_get_features_basic(self):
54 | """Test feature hashing."""
55 | model = CTRPredictor(dimension=1000)
56 |
57 | csv_row = {'I1': '5', 'I2': '10', 'C1': 'abc'}
58 | features = model.get_features(csv_row)
59 |
60 | # Should include bias term (index 0)
61 | assert 0 in features
62 | assert len(features) > 1
63 | # All indices should be within dimension
64 | assert all(0 <= idx < model.D for idx in features)
65 |
66 | def test_get_features_empty_values(self):
67 | """Test feature hashing with empty values."""
68 | model = CTRPredictor(dimension=1000)
69 |
70 | csv_row = {'I1': '5', 'I2': '', 'C1': 'abc'}
71 | features = model.get_features(csv_row)
72 |
73 | # Should only have bias and non-empty features
74 | assert 0 in features
75 | assert len(features) >= 1
76 |
77 | def test_predict_probability_range(self):
78 | """Test that predictions are in valid probability range."""
79 | model = CTRPredictor(dimension=100)
80 |
81 | # Random features
82 | features = [0, 5, 10, 25]
83 |
84 | prob = model.predict_probability(features)
85 |
86 | # Should be between 0 and 1
87 | assert 0.0 <= prob <= 1.0
88 |
89 | def test_predict_probability_with_weights(self):
90 | """Test prediction with non-zero weights."""
91 | model = CTRPredictor(dimension=100)
92 |
93 | # Set some weights
94 | model.w[0] = 1.0
95 | model.w[5] = 2.0
96 | model.w[10] = -1.5
97 |
98 | features = [0, 5, 10]
99 | prob = model.predict_probability(features)
100 |
101 | # Expected: sigmoid(1.0 + 2.0 - 1.5) = sigmoid(1.5)
102 | expected = 1.0 / (1.0 + np.exp(-1.5))
103 |
104 | assert np.isclose(prob, expected)
105 |
106 | def test_update_weights(self):
107 | """Test weight update."""
108 | model = CTRPredictor(dimension=100, learning_rate=0.1)
109 |
110 | features = [0, 5, 10]
111 | p = 0.7 # Prediction
112 | y = 1.0 # True label
113 |
114 | # Store initial weights
115 | initial_weights = [model.w[i] for i in features]
116 |
117 | # Update weights
118 | model.update_weights(features, p, y)
119 |
120 | # Weights should have changed
121 | for i, idx in enumerate(features):
122 | assert model.w[idx] != initial_weights[i]
123 |
124 | # Counters should have incremented
125 | for idx in features:
126 | assert model.n[idx] == 1.0
127 |
128 | def test_train_file_not_found(self):
129 | """Test training with non-existent file."""
130 | model = CTRPredictor(dimension=100)
131 |
132 | with pytest.raises(FileNotFoundError):
133 | model.train(Path("nonexistent_file.csv"))
134 |
135 | def test_predict_file_not_found(self):
136 | """Test prediction with non-existent file."""
137 | model = CTRPredictor(dimension=100)
138 |
139 | with pytest.raises(FileNotFoundError):
140 | model.predict(
141 | Path("nonexistent_test.csv"),
142 | Path("output.csv")
143 | )
144 |
145 | def test_train_with_temp_file(self, temp_csv_file):
146 | """Test training with a small temporary file."""
147 | model = CTRPredictor(dimension=100, learning_rate=0.1)
148 |
149 | # Should not raise error
150 | model.train(temp_csv_file)
151 |
152 | # Weights should have been updated
153 | assert any(w != 0.0 for w in model.w)
154 |
155 | def test_numerical_stability(self):
156 | """Test that extreme values don't cause overflow."""
157 | model = CTRPredictor(dimension=100)
158 |
159 | # Set extreme weights
160 | model.w[5] = 1000.0
161 | model.w[10] = -1000.0
162 |
163 | features = [5, 10]
164 | prob = model.predict_probability(features)
165 |
166 | # Should not be nan or inf
167 | assert not np.isnan(prob)
168 | assert not np.isinf(prob)
169 | assert 0.0 <= prob <= 1.0
170 |
171 |
172 | class TestEdgeCases:
173 | """Test edge cases and error handling."""
174 |
175 | def test_zero_dimension(self):
176 | """Test with zero dimension (should work but be useless)."""
177 | with pytest.raises(Exception):
178 | model = CTRPredictor(dimension=0)
179 |
180 | def test_negative_learning_rate(self):
181 | """Test with negative learning rate."""
182 | # Should still initialize (may behave oddly in training)
183 | model = CTRPredictor(dimension=100, learning_rate=-0.1)
184 | assert model.alpha == -0.1
185 |
186 | def test_very_large_dimension(self):
187 | """Test memory handling with large dimension."""
188 | # This might be slow or fail on memory-constrained systems
189 | # Using smaller value for testing
190 | try:
191 | model = CTRPredictor(dimension=10_000_000)
192 | assert len(model.w) == 10_000_000
193 | except MemoryError:
194 | pytest.skip("Insufficient memory for this test")
195 |
--------------------------------------------------------------------------------
/gbm_modernized.R:
--------------------------------------------------------------------------------
1 | # Gradient Boosting Machine (GBM) for CTR Prediction
2 | # Modernized version with configurable paths and better structure
3 |
4 | # ============================================================================
5 | # Configuration
6 | # ============================================================================
7 |
8 | # Use environment variable or default to current directory
9 | DATA_DIR <- Sys.getenv("DATA_DIR", default = "./data")
10 | OUTPUT_DIR <- Sys.getenv("OUTPUT_DIR", default = "./output")
11 |
12 | # File paths (relative to DATA_DIR)
13 | TRAIN_FILE <- file.path(DATA_DIR, "train_num_na_yesno.csv")
14 | TEST_FILE <- file.path(DATA_DIR, "test_num_impute.csv")
15 | OUTPUT_FILE <- file.path(OUTPUT_DIR, "submit_gbm_num_nona_impute.csv")
16 |
17 | # Model hyperparameters
18 | N_TREES <- 500
19 | INTERACTION_DEPTH <- 22
20 | SHRINKAGE <- 0.1
21 | N_CV_FOLDS <- 2
22 | SEED <- 888
23 |
24 | # ============================================================================
25 | # Setup
26 | # ============================================================================
27 |
28 | cat("==========================================================\n")
29 | cat("GBM CTR Prediction Model\n")
30 | cat("==========================================================\n")
31 | cat(sprintf("Data directory: %s\n", DATA_DIR))
32 | cat(sprintf("Output directory: %s\n", OUTPUT_DIR))
33 | cat(sprintf("Training file: %s\n", TRAIN_FILE))
34 | cat(sprintf("Test file: %s\n", TEST_FILE))
35 | cat("==========================================================\n\n")
36 |
37 | # Load required libraries
38 | required_packages <- c("data.table", "caret", "gbm")
39 |
40 | for (pkg in required_packages) {
41 | if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
42 | cat(sprintf("Installing package: %s\n", pkg))
43 | install.packages(pkg, dependencies = TRUE)
44 | library(pkg, character.only = TRUE)
45 | }
46 | }
47 |
48 | # Create output directory if it doesn't exist
49 | if (!dir.exists(OUTPUT_DIR)) {
50 | dir.create(OUTPUT_DIR, recursive = TRUE)
51 | cat(sprintf("Created output directory: %s\n", OUTPUT_DIR))
52 | }
53 |
54 | # ============================================================================
55 | # Data Loading and Preparation
56 | # ============================================================================
57 |
58 | cat("\n[1/5] Loading training data...\n")
59 |
60 | if (!file.exists(TRAIN_FILE)) {
61 | stop(sprintf("Training file not found: %s\nPlease prepare the data first.", TRAIN_FILE))
62 | }
63 |
64 | train <- read.csv(TRAIN_FILE, stringsAsFactors = TRUE)
65 |
66 | # Remove ID column if present
67 | if ("Id" %in% colnames(train) || "id" %in% colnames(train)) {
68 | train <- train[, !(colnames(train) %in% c("Id", "id"))]
69 | }
70 |
71 | cat(sprintf(" Loaded %d samples with %d features\n", nrow(train), ncol(train) - 1))
72 | cat(sprintf(" Target variable: Label\n"))
73 | cat(sprintf(" Class distribution:\n"))
74 | print(table(train$Label))
75 |
76 | # ============================================================================
77 | # Model Training
78 | # ============================================================================
79 |
80 | cat("\n[2/5] Training GBM model...\n")
81 | cat(sprintf(" Trees: %d\n", N_TREES))
82 | cat(sprintf(" Interaction depth: %d\n", INTERACTION_DEPTH))
83 | cat(sprintf(" Shrinkage: %.3f\n", SHRINKAGE))
84 | cat(sprintf(" CV folds: %d\n", N_CV_FOLDS))
85 |
86 | # Set seed for reproducibility
87 | set.seed(SEED)
88 |
89 | # Define training grid
90 | gbm_grid <- expand.grid(
91 | n.trees = N_TREES,
92 | interaction.depth = INTERACTION_DEPTH,
93 | shrinkage = SHRINKAGE,
94 | n.minobsinnode = 10
95 | )
96 |
97 | # Define training control
98 | fit_control <- trainControl(
99 | method = "cv",
100 | number = N_CV_FOLDS,
101 | classProbs = TRUE,
102 | summaryFunction = twoClassSummary,
103 | allowParallel = TRUE
104 | )
105 |
106 | # Train model
107 | start_time <- Sys.time()
108 |
109 | gbm_fit <- train(
110 | Label ~ .,
111 | data = train,
112 | method = "gbm",
113 | trControl = fit_control,
114 | tuneGrid = gbm_grid,
115 | metric = "ROC",
116 | verbose = TRUE
117 | )
118 |
119 | end_time <- Sys.time()
120 | training_time <- difftime(end_time, start_time, units = "mins")
121 |
122 | cat(sprintf("\nTraining completed in %.2f minutes\n", training_time))
123 |
124 | # ============================================================================
125 | # Model Evaluation
126 | # ============================================================================
127 |
128 | cat("\n[3/5] Evaluating model on training data...\n")
129 |
130 | train_pred <- predict(gbm_fit, train)
131 | conf_matrix <- confusionMatrix(train_pred, train$Label)
132 |
133 | print(conf_matrix)
134 |
135 | # ============================================================================
136 | # Load Test Data and Generate Predictions
137 | # ============================================================================
138 |
139 | cat("\n[4/5] Loading test data and generating predictions...\n")
140 |
141 | if (!file.exists(TEST_FILE)) {
142 | warning(sprintf("Test file not found: %s\nSkipping predictions.", TEST_FILE))
143 | } else {
144 | test <- read.csv(TEST_FILE, stringsAsFactors = TRUE)
145 |
146 | # Keep ID column for submission
147 | test_id <- test$Id
148 |
149 | cat(sprintf(" Loaded %d test samples\n", nrow(test)))
150 |
151 | # Generate predictions (probabilities)
152 | test_pred_prob <- predict(gbm_fit, test, type = "prob")
153 |
154 | # Create submission dataframe
155 | submission <- data.frame(
156 | Id = test_id,
157 | Predicted = test_pred_prob[, "Yes"] # Probability of positive class
158 | )
159 |
160 | # ============================================================================
161 | # Save Predictions
162 | # ============================================================================
163 |
164 | cat("\n[5/5] Saving predictions...\n")
165 |
166 | write.csv(
167 | submission,
168 | OUTPUT_FILE,
169 | row.names = FALSE,
170 | quote = FALSE
171 | )
172 |
173 | cat(sprintf(" Predictions saved to: %s\n", OUTPUT_FILE))
174 | cat(sprintf(" Submission format: %d rows, 2 columns (Id, Predicted)\n", nrow(submission)))
175 | }
176 |
177 | # ============================================================================
178 | # Summary
179 | # ============================================================================
180 |
181 | cat("\n==========================================================\n")
182 | cat("Model Training Summary\n")
183 | cat("==========================================================\n")
184 | cat(sprintf("Training samples: %d\n", nrow(train)))
185 | cat(sprintf("Features: %d\n", ncol(train) - 1))
186 | cat(sprintf("Training time: %.2f minutes\n", training_time))
187 | cat(sprintf("Training accuracy: %.4f\n", conf_matrix$overall['Accuracy']))
188 | cat(sprintf("Model saved: gbm_fit\n"))
189 | cat("==========================================================\n")
190 | cat("\nDone!\n")
191 |
192 | # ============================================================================
193 | # Optional: Save model object
194 | # ============================================================================
195 |
196 | # Uncomment to save the trained model for later use
197 | # saveRDS(gbm_fit, file.path(OUTPUT_DIR, "gbm_model.rds"))
198 | # cat(sprintf("Model object saved to: %s\n", file.path(OUTPUT_DIR, "gbm_model.rds")))
199 |
200 | # To load the model later:
201 | # gbm_fit <- readRDS(file.path(OUTPUT_DIR, "gbm_model.rds"))
202 |
--------------------------------------------------------------------------------
/tests/test_vw_converter.py:
--------------------------------------------------------------------------------
1 | """
2 | Tests for CSV to Vowpal Wabbit converter (csv_to_vw_modernized.py).
3 | """
4 |
5 | import pytest
6 | from pathlib import Path
7 | import sys
8 |
9 | # Add parent directory to path
10 | sys.path.insert(0, str(Path(__file__).parent.parent))
11 |
12 | from csv_to_vw_modernized import _convert_row_to_vw, csv_to_vw
13 |
14 |
15 | class TestConvertRowToVW:
16 | """Test suite for row conversion function."""
17 |
18 | def test_convert_train_row_basic(self):
19 | """Test converting a basic training row."""
20 | row = {
21 | 'Id': '123',
22 | 'Label': '1',
23 | 'I1': '5',
24 | 'I2': '10',
25 | 'C1': 'abc',
26 | 'C2': 'def'
27 | }
28 |
29 | vw_line = _convert_row_to_vw(row, is_train=True)
30 |
31 | # Should start with label (1) and tag ('123)
32 | assert vw_line.startswith("1 '123")
33 |
34 | # Should contain numerical namespace
35 | assert '|i' in vw_line
36 |
37 | # Should contain categorical namespace
38 | assert '|c' in vw_line
39 |
40 | # Should contain feature values
41 | assert 'I1:5' in vw_line
42 | assert 'I2:10' in vw_line
43 | assert 'abc' in vw_line
44 | assert 'def' in vw_line
45 |
46 | def test_convert_train_row_negative_label(self):
47 | """Test converting row with negative label."""
48 | row = {
49 | 'Id': '456',
50 | 'Label': '0',
51 | 'I1': '3',
52 | 'C1': 'xyz'
53 | }
54 |
55 | vw_line = _convert_row_to_vw(row, is_train=True)
56 |
57 | # Label should be -1 for VW format
58 | assert vw_line.startswith("-1 '456")
59 |
60 | def test_convert_test_row(self):
61 | """Test converting a test row (no label)."""
62 | row = {
63 | 'Id': '789',
64 | 'I1': '7',
65 | 'C1': 'test'
66 | }
67 |
68 | vw_line = _convert_row_to_vw(row, is_train=False)
69 |
70 | # Test rows get dummy label 1
71 | assert vw_line.startswith("1 '789")
72 | assert 'I1:7' in vw_line
73 | assert 'test' in vw_line
74 |
75 | def test_convert_row_with_missing_values(self):
76 | """Test converting row with missing values."""
77 | row = {
78 | 'Id': '111',
79 | 'Label': '1',
80 | 'I1': '5',
81 | 'I2': '', # Missing
82 | 'C1': 'abc',
83 | 'C2': '' # Missing
84 | }
85 |
86 | vw_line = _convert_row_to_vw(row, is_train=True)
87 |
88 | # Should include non-empty features
89 | assert 'I1:5' in vw_line
90 | assert 'abc' in vw_line
91 |
92 | # Should not include empty features
93 | assert 'I2:' not in vw_line
94 |
95 | def test_convert_row_with_spaces(self):
96 | """Test converting row with whitespace."""
97 | row = {
98 | 'Id': '222',
99 | 'Label': '0',
100 | 'I1': ' 5 ', # With spaces
101 | 'C1': ' abc '
102 | }
103 |
104 | vw_line = _convert_row_to_vw(row, is_train=True)
105 |
106 | # Spaces should be handled (features present)
107 | assert '|i' in vw_line
108 | assert '|c' in vw_line
109 |
110 | def test_convert_row_empty_namespaces(self):
111 | """Test converting row with all values empty in a namespace."""
112 | row = {
113 | 'Id': '333',
114 | 'Label': '1',
115 | 'I1': '',
116 | 'I2': '',
117 | 'C1': 'abc'
118 | }
119 |
120 | vw_line = _convert_row_to_vw(row, is_train=True)
121 |
122 | # Should still have namespace markers
123 | assert '|i' in vw_line
124 | assert '|c' in vw_line
125 |
126 |
127 | class TestCSVToVWFunction:
128 | """Test suite for full CSV to VW conversion."""
129 |
130 | def test_csv_to_vw_file_not_found(self, tmp_path):
131 | """Test with non-existent input file."""
132 | csv_path = tmp_path / "nonexistent.csv"
133 | output_path = tmp_path / "output.vw"
134 |
135 | with pytest.raises(FileNotFoundError):
136 | csv_to_vw(csv_path, output_path)
137 |
138 | def test_csv_to_vw_basic(self, temp_csv_file, tmp_path):
139 | """Test basic CSV to VW conversion."""
140 | output_path = tmp_path / "output.vw"
141 |
142 | # Should not raise error
143 | csv_to_vw(temp_csv_file, output_path, is_train=True)
144 |
145 | # Output file should exist
146 | assert output_path.exists()
147 |
148 | # Check output content
149 | with open(output_path, 'r') as f:
150 | lines = f.readlines()
151 |
152 | # Should have same number of lines as CSV (minus header)
153 | assert len(lines) == 3 # 3 data rows from fixture
154 |
155 | # Each line should be VW format
156 | for line in lines:
157 | assert '|i' in line
158 | assert '|c' in line
159 |
160 | def test_csv_to_vw_test_mode(self, temp_csv_file, tmp_path):
161 | """Test CSV to VW conversion in test mode."""
162 | output_path = tmp_path / "output.vw"
163 |
164 | csv_to_vw(temp_csv_file, output_path, is_train=False)
165 |
166 | # Output file should exist
167 | assert output_path.exists()
168 |
169 | with open(output_path, 'r') as f:
170 | lines = f.readlines()
171 |
172 | # In test mode, all labels should be 1
173 | for line in lines:
174 | assert line.startswith('1 ')
175 |
176 | def test_csv_to_vw_creates_output_file(self, temp_csv_file, tmp_path):
177 | """Test that output file is created correctly."""
178 | output_path = tmp_path / "nested" / "dir" / "output.vw"
179 |
180 | # Parent directories don't exist
181 | assert not output_path.parent.exists()
182 |
183 | # This should fail since we don't create parent dirs
184 | # (could be enhanced to create them)
185 | with pytest.raises(FileNotFoundError):
186 | csv_to_vw(temp_csv_file, output_path)
187 |
188 |
189 | class TestEdgeCases:
190 | """Test edge cases and error handling."""
191 |
192 | def test_malformed_csv(self, tmp_path):
193 | """Test handling of malformed CSV."""
194 | csv_file = tmp_path / "malformed.csv"
195 |
196 | # Create malformed CSV
197 | with open(csv_file, 'w') as f:
198 | f.write("Id,Label,I1\n")
199 | f.write("1,0,5\n")
200 | f.write("2,1\n") # Missing column
201 |
202 | output_file = tmp_path / "output.vw"
203 |
204 | # Should handle gracefully
205 | csv_to_vw(csv_file, output_file, is_train=True)
206 |
207 | # Output should still be created
208 | assert output_file.exists()
209 |
210 | def test_empty_csv(self, tmp_path):
211 | """Test handling of empty CSV."""
212 | csv_file = tmp_path / "empty.csv"
213 |
214 | # Create empty CSV (header only)
215 | with open(csv_file, 'w') as f:
216 | f.write("Id,Label,I1,C1\n")
217 |
218 | output_file = tmp_path / "output.vw"
219 |
220 | # Should not raise error
221 | csv_to_vw(csv_file, output_file, is_train=True)
222 |
223 | # Output should be empty
224 | with open(output_file, 'r') as f:
225 | lines = f.readlines()
226 |
227 | assert len(lines) == 0
228 |
--------------------------------------------------------------------------------
/csv_to_vw_modernized.py:
--------------------------------------------------------------------------------
1 | """
2 | CSV to Vowpal Wabbit Format Converter
3 |
4 | Converts Criteo CTR dataset from CSV format to Vowpal Wabbit format.
5 |
6 | Original credit:
7 | - __Author__: Triskelion
8 | - Credit: Zygmunt Zając
9 |
10 | Modernized with:
11 | - Python 3 compatibility
12 | - Error handling and logging
13 | - Progress reporting
14 | - Configurable paths
15 | - Better performance
16 | """
17 |
18 | import logging
19 | import sys
20 | from datetime import datetime
21 | from csv import DictReader
22 | from pathlib import Path
23 | from typing import Optional
24 |
25 | # Configure logging
26 | logging.basicConfig(
27 | level=logging.INFO,
28 | format='%(asctime)s - %(levelname)s - %(message)s'
29 | )
30 | logger = logging.getLogger(__name__)
31 |
32 |
33 | def csv_to_vw(
34 | csv_path: Path,
35 | output_path: Path,
36 | is_train: bool = True,
37 | report_interval: int = 1_000_000
38 | ) -> None:
39 | """
40 | Convert CSV file to Vowpal Wabbit format.
41 |
42 | Vowpal Wabbit format:
43 | [label] ['tag] |namespace features
44 |
45 | Example train:
46 | 1 'id123 |i I1:5 I2:10 |c C1 C2 C3
47 |
48 | Example test:
49 | 1 'id456 |i I1:3 |c C5 C6
50 |
51 | Args:
52 | csv_path: Path to input CSV file
53 | output_path: Path to output VW file
54 | is_train: Whether this is training data (includes labels)
55 | report_interval: How often to log progress
56 |
57 | Raises:
58 | FileNotFoundError: If CSV file doesn't exist
59 | ValueError: If CSV format is invalid
60 | """
61 | if not csv_path.exists():
62 | raise FileNotFoundError(f"CSV file not found: {csv_path}")
63 |
64 | start_time = datetime.now()
65 |
66 | logger.info("=" * 80)
67 | logger.info(f"Converting CSV to Vowpal Wabbit format")
68 | logger.info(f" Input: {csv_path}")
69 | logger.info(f" Output: {output_path}")
70 | logger.info(f" Mode: {'Training' if is_train else 'Testing'}")
71 | logger.info("=" * 80)
72 |
73 | row_count = 0
74 |
75 | try:
76 | with open(csv_path, 'r', encoding='utf-8') as csv_file, \
77 | open(output_path, 'w', encoding='utf-8') as vw_file:
78 |
79 | reader = DictReader(csv_file)
80 |
81 | # Validate required fields
82 | if reader.fieldnames:
83 | if 'Id' not in reader.fieldnames:
84 | raise ValueError("CSV must contain 'Id' column")
85 | if is_train and 'Label' not in reader.fieldnames:
86 | raise ValueError("Training CSV must contain 'Label' column")
87 |
88 | for row_count, row in enumerate(reader, start=1):
89 | try:
90 | # Create VW format line
91 | vw_line = _convert_row_to_vw(row, is_train)
92 | vw_file.write(vw_line + '\n')
93 |
94 | except Exception as e:
95 | logger.warning(f"Error processing row {row_count}: {e}")
96 | continue
97 |
98 | # Report progress
99 | if row_count % report_interval == 0:
100 | elapsed = datetime.now() - start_time
101 | rate = row_count / elapsed.total_seconds()
102 | logger.info(
103 | f"Processed {row_count:,} rows | "
104 | f"Elapsed: {elapsed} | "
105 | f"Rate: {rate:.0f} rows/sec"
106 | )
107 |
108 | except Exception as e:
109 | logger.error(f"Error during conversion: {e}")
110 | raise
111 |
112 | elapsed = datetime.now() - start_time
113 | logger.info("=" * 80)
114 | logger.info(f"Conversion completed!")
115 | logger.info(f" Rows processed: {row_count:,}")
116 | logger.info(f" Total time: {elapsed}")
117 | logger.info(f" Average rate: {row_count / elapsed.total_seconds():.0f} rows/sec")
118 | logger.info("=" * 80)
119 |
120 |
121 | def _convert_row_to_vw(row: dict, is_train: bool) -> str:
122 | """
123 | Convert a single CSV row to Vowpal Wabbit format.
124 |
125 | Args:
126 | row: Dictionary from CSV DictReader
127 | is_train: Whether this is training data
128 |
129 | Returns:
130 | VW formatted string (without newline)
131 | """
132 | # Extract label and ID
133 | row_id = row.get('Id', 'unknown')
134 |
135 | if is_train:
136 | # VW uses 1 for positive, -1 for negative
137 | label = 1 if row.get('Label') == '1' else -1
138 | else:
139 | # Test data: use dummy label 1
140 | label = 1
141 |
142 | # Separate numerical and categorical features
143 | numerical_features = []
144 | categorical_features = []
145 |
146 | for key, value in row.items():
147 | # Skip label and ID
148 | if key in ['Label', 'Id']:
149 | continue
150 |
151 | # Skip empty values
152 | if not value or value.strip() == '':
153 | continue
154 |
155 | # Numerical features (start with 'I')
156 | if key.startswith('I'):
157 | numerical_features.append(f"{key}:{value}")
158 |
159 | # Categorical features (start with 'C')
160 | elif key.startswith('C'):
161 | # For categorical, just use the value (no key:value format)
162 | categorical_features.append(value)
163 |
164 | # Build VW format line
165 | # Format: label 'tag |namespace1 features1 |namespace2 features2
166 | numerical_str = ' '.join(numerical_features) if numerical_features else ''
167 | categorical_str = ' '.join(categorical_features) if categorical_features else ''
168 |
169 | vw_line = f"{label} '{row_id} |i {numerical_str} |c {categorical_str}"
170 |
171 | return vw_line
172 |
173 |
174 | def main():
175 | """Main execution function."""
176 | import argparse
177 |
178 | parser = argparse.ArgumentParser(
179 | description='Convert Criteo CSV to Vowpal Wabbit format'
180 | )
181 | parser.add_argument(
182 | 'input',
183 | type=str,
184 | help='Input CSV file path'
185 | )
186 | parser.add_argument(
187 | 'output',
188 | type=str,
189 | help='Output VW file path'
190 | )
191 | parser.add_argument(
192 | '--test',
193 | action='store_true',
194 | help='Convert test data (no labels)'
195 | )
196 | parser.add_argument(
197 | '--interval',
198 | type=int,
199 | default=1_000_000,
200 | help='Progress report interval (default: 1,000,000)'
201 | )
202 |
203 | args = parser.parse_args()
204 |
205 | # Convert paths
206 | csv_path = Path(args.input)
207 | output_path = Path(args.output)
208 |
209 | # Check if input exists
210 | if not csv_path.exists():
211 | logger.error(f"Input file not found: {csv_path}")
212 | sys.exit(1)
213 |
214 | # Warn if output exists
215 | if output_path.exists():
216 | logger.warning(f"Output file exists and will be overwritten: {output_path}")
217 |
218 | try:
219 | # Perform conversion
220 | csv_to_vw(
221 | csv_path=csv_path,
222 | output_path=output_path,
223 | is_train=not args.test,
224 | report_interval=args.interval
225 | )
226 |
227 | logger.info("Success!")
228 |
229 | except Exception as e:
230 | logger.error(f"Conversion failed: {e}")
231 | sys.exit(1)
232 |
233 |
234 | if __name__ == '__main__':
235 | # Example usage (uncomment to run):
236 | # csv_to_vw(
237 | # csv_path=Path('train.csv'),
238 | # output_path=Path('click.train.vw'),
239 | # is_train=True
240 | # )
241 | # csv_to_vw(
242 | # csv_path=Path('test.csv'),
243 | # output_path=Path('click.test.vw'),
244 | # is_train=False
245 | # )
246 |
247 | main()
248 |
--------------------------------------------------------------------------------
/scripts/download_data.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | """
3 | Data Download Helper for Criteo CTR Dataset
4 |
5 | This script provides instructions and helpers for downloading the Criteo
6 | Display Advertising Challenge dataset from Kaggle.
7 |
8 | Note: Kaggle API credentials are required for automated download.
9 | """
10 |
11 | import logging
12 | import sys
13 | from pathlib import Path
14 | import subprocess
15 |
16 | # Configure logging
17 | logging.basicConfig(
18 | level=logging.INFO,
19 | format='%(asctime)s - %(levelname)s - %(message)s'
20 | )
21 | logger = logging.getLogger(__name__)
22 |
23 |
24 | def check_kaggle_api() -> bool:
25 | """
26 | Check if Kaggle API is installed and configured.
27 |
28 | Returns:
29 | True if Kaggle API is available, False otherwise
30 | """
31 | try:
32 | import kaggle
33 | return True
34 | except ImportError:
35 | return False
36 |
37 |
38 | def print_manual_instructions():
39 | """Print manual download instructions."""
40 | logger.info("=" * 80)
41 | logger.info("Manual Download Instructions")
42 | logger.info("=" * 80)
43 | print("""
44 | To download the Criteo CTR dataset manually:
45 |
46 | 1. Visit the Kaggle competition page:
47 | https://www.kaggle.com/c/criteo-display-ad-challenge/data
48 |
49 | 2. Accept the competition rules (you must have a Kaggle account)
50 |
51 | 3. Download the following files:
52 | - train.csv.gz (~11 GB compressed, ~40 GB uncompressed)
53 | - test.csv.gz (~2 GB compressed, ~6 GB uncompressed)
54 |
55 | 4. Extract the files:
56 | gunzip train.csv.gz
57 | gunzip test.csv.gz
58 |
59 | 5. Move the files to the data/ directory:
60 | mv train.csv data/
61 | mv test.csv data/
62 |
63 | 6. Verify the files:
64 | python scripts/verify_data.py
65 | """)
66 |
67 |
68 | def print_kaggle_api_setup():
69 | """Print Kaggle API setup instructions."""
70 | logger.info("=" * 80)
71 | logger.info("Kaggle API Setup Instructions")
72 | logger.info("=" * 80)
73 | print("""
74 | To use the Kaggle API for automated downloads:
75 |
76 | 1. Install the Kaggle API:
77 | pip install kaggle
78 |
79 | 2. Get your Kaggle API credentials:
80 | a. Go to https://www.kaggle.com/account
81 | b. Scroll to "API" section
82 | c. Click "Create New API Token"
83 | d. This downloads kaggle.json
84 |
85 | 3. Place kaggle.json in the correct location:
86 | # Linux/macOS
87 | mkdir -p ~/.kaggle
88 | mv ~/Downloads/kaggle.json ~/.kaggle/
89 | chmod 600 ~/.kaggle/kaggle.json
90 |
91 | # Windows
92 | mkdir %USERPROFILE%\\.kaggle
93 | move %USERPROFILE%\\Downloads\\kaggle.json %USERPROFILE%\\.kaggle\\
94 |
95 | 4. Run this script again to download data automatically
96 | """)
97 |
98 |
99 | def download_with_kaggle_api(data_dir: Path) -> bool:
100 | """
101 | Download dataset using Kaggle API.
102 |
103 | Args:
104 | data_dir: Directory to save data
105 |
106 | Returns:
107 | True if successful, False otherwise
108 | """
109 | try:
110 | logger.info("Downloading data using Kaggle API...")
111 | logger.info("This may take a while (files are ~13 GB compressed)...")
112 |
113 | # Import here to handle case where it's not installed
114 | import kaggle
115 |
116 | # Create data directory
117 | data_dir.mkdir(parents=True, exist_ok=True)
118 |
119 | # Download dataset files
120 | logger.info("Downloading train.csv.gz...")
121 | kaggle.api.competition_download_file(
122 | 'criteo-display-ad-challenge',
123 | 'train.csv.gz',
124 | path=str(data_dir)
125 | )
126 |
127 | logger.info("Downloading test.csv.gz...")
128 | kaggle.api.competition_download_file(
129 | 'criteo-display-ad-challenge',
130 | 'test.csv.gz',
131 | path=str(data_dir)
132 | )
133 |
134 | logger.info("Download complete!")
135 | logger.info("Extracting files...")
136 |
137 | # Extract files
138 | import gzip
139 | import shutil
140 |
141 | # Extract train.csv
142 | train_gz = data_dir / 'train.csv.gz'
143 | train_csv = data_dir / 'train.csv'
144 |
145 | if train_gz.exists():
146 | logger.info("Extracting train.csv.gz...")
147 | with gzip.open(train_gz, 'rb') as f_in:
148 | with open(train_csv, 'wb') as f_out:
149 | shutil.copyfileobj(f_in, f_out)
150 | logger.info(f"Extracted to {train_csv}")
151 |
152 | # Extract test.csv
153 | test_gz = data_dir / 'test.csv.gz'
154 | test_csv = data_dir / 'test.csv'
155 |
156 | if test_gz.exists():
157 | logger.info("Extracting test.csv.gz...")
158 | with gzip.open(test_gz, 'rb') as f_in:
159 | with open(test_csv, 'wb') as f_out:
160 | shutil.copyfileobj(f_in, f_out)
161 | logger.info(f"Extracted to {test_csv}")
162 |
163 | logger.info("=" * 80)
164 | logger.info("Dataset downloaded and extracted successfully!")
165 | logger.info(f" Train: {train_csv}")
166 | logger.info(f" Test: {test_csv}")
167 | logger.info("=" * 80)
168 |
169 | return True
170 |
171 | except Exception as e:
172 | logger.error(f"Error downloading data: {e}")
173 | return False
174 |
175 |
176 | def create_sample_data(data_dir: Path, sample_size: int = 1_000_000):
177 | """
178 | Create a smaller sample dataset for testing.
179 |
180 | Args:
181 | data_dir: Directory containing data
182 | sample_size: Number of rows to include in sample
183 | """
184 | train_file = data_dir / 'train.csv'
185 | sample_file = data_dir / 'train_sample.csv'
186 |
187 | if not train_file.exists():
188 | logger.error(f"Train file not found: {train_file}")
189 | return
190 |
191 | logger.info(f"Creating sample dataset with {sample_size:,} rows...")
192 |
193 | try:
194 | # Use head command for efficiency
195 | result = subprocess.run(
196 | ['head', '-n', str(sample_size + 1), str(train_file)],
197 | capture_output=True,
198 | text=True,
199 | check=True
200 | )
201 |
202 | with open(sample_file, 'w') as f:
203 | f.write(result.stdout)
204 |
205 | logger.info(f"Sample dataset created: {sample_file}")
206 |
207 | except Exception as e:
208 | logger.error(f"Error creating sample: {e}")
209 |
210 |
211 | def verify_data(data_dir: Path):
212 | """
213 | Verify that data files exist and have reasonable size.
214 |
215 | Args:
216 | data_dir: Directory containing data
217 | """
218 | logger.info("Verifying data files...")
219 |
220 | train_file = data_dir / 'train.csv'
221 | test_file = data_dir / 'test.csv'
222 |
223 | if train_file.exists():
224 | size_gb = train_file.stat().st_size / (1024 ** 3)
225 | logger.info(f"✓ train.csv exists ({size_gb:.2f} GB)")
226 |
227 | # Expected: ~40 GB
228 | if size_gb < 30 or size_gb > 50:
229 | logger.warning(
230 | f"Warning: train.csv size ({size_gb:.2f} GB) outside expected range (30-50 GB)"
231 | )
232 | else:
233 | logger.error("✗ train.csv not found")
234 |
235 | if test_file.exists():
236 | size_gb = test_file.stat().st_size / (1024 ** 3)
237 | logger.info(f"✓ test.csv exists ({size_gb:.2f} GB)")
238 |
239 | # Expected: ~6 GB
240 | if size_gb < 4 or size_gb > 8:
241 | logger.warning(
242 | f"Warning: test.csv size ({size_gb:.2f} GB) outside expected range (4-8 GB)"
243 | )
244 | else:
245 | logger.error("✗ test.csv not found")
246 |
247 |
248 | def main():
249 | """Main execution function."""
250 | import argparse
251 |
252 | parser = argparse.ArgumentParser(
253 | description='Download Criteo CTR dataset'
254 | )
255 | parser.add_argument(
256 | '--data-dir',
257 | type=str,
258 | default='./data',
259 | help='Directory to save data (default: ./data)'
260 | )
261 | parser.add_argument(
262 | '--sample',
263 | action='store_true',
264 | help='Create a sample dataset after downloading'
265 | )
266 | parser.add_argument(
267 | '--sample-size',
268 | type=int,
269 | default=1_000_000,
270 | help='Sample size in rows (default: 1,000,000)'
271 | )
272 | parser.add_argument(
273 | '--verify-only',
274 | action='store_true',
275 | help='Only verify existing data files'
276 | )
277 |
278 | args = parser.parse_args()
279 | data_dir = Path(args.data_dir)
280 |
281 | # Create data directory
282 | data_dir.mkdir(parents=True, exist_ok=True)
283 |
284 | logger.info("=" * 80)
285 | logger.info("Criteo CTR Dataset Downloader")
286 | logger.info("=" * 80)
287 |
288 | # Verify only mode
289 | if args.verify_only:
290 | verify_data(data_dir)
291 | return
292 |
293 | # Check if Kaggle API is available
294 | has_kaggle = check_kaggle_api()
295 |
296 | if has_kaggle:
297 | logger.info("Kaggle API detected!")
298 | response = input("Download data using Kaggle API? (y/n): ")
299 |
300 | if response.lower() == 'y':
301 | success = download_with_kaggle_api(data_dir)
302 |
303 | if success:
304 | if args.sample:
305 | create_sample_data(data_dir, args.sample_size)
306 | return
307 | else:
308 | logger.warning("Kaggle API not installed or not configured")
309 | print_kaggle_api_setup()
310 | print()
311 |
312 | # Fall back to manual instructions
313 | print_manual_instructions()
314 |
315 | # Verify if data already exists
316 | logger.info("\nChecking for existing data files...")
317 | verify_data(data_dir)
318 |
319 |
320 | if __name__ == '__main__':
321 | main()
322 |
--------------------------------------------------------------------------------
/py_lh4_modernized.py:
--------------------------------------------------------------------------------
1 | """
2 | Logistic Regression with Stochastic Gradient Descent (SGD) for CTR Prediction
3 |
4 | This module implements a memory-efficient logistic regression model using:
5 | - Hash trick for feature engineering (2^27 dimensional space)
6 | - Adaptive learning rate for SGD optimization
7 | - Bounded numerical operations for stability
8 |
9 | Modernized from original py_lh4.py with:
10 | - Python 3 compatibility
11 | - Error handling and logging
12 | - Input validation
13 | - Configurable parameters
14 | - Type hints
15 | """
16 |
17 | import logging
18 | import sys
19 | from csv import DictReader
20 | from math import exp, log, sqrt
21 | from pathlib import Path
22 | from typing import Dict, List
23 |
24 | # Configure logging
25 | logging.basicConfig(
26 | level=logging.INFO,
27 | format='%(asctime)s - %(levelname)s - %(message)s',
28 | handlers=[
29 | logging.StreamHandler(sys.stdout),
30 | logging.FileHandler('training.log')
31 | ]
32 | )
33 | logger = logging.getLogger(__name__)
34 |
35 |
36 | class CTRPredictor:
37 | """Click-Through Rate predictor using Logistic Regression with SGD."""
38 |
39 | def __init__(
40 | self,
41 | dimension: int = 2**27,
42 | learning_rate: float = 0.145,
43 | log_interval: int = 1_000_000
44 | ):
45 | """
46 | Initialize the CTR predictor.
47 |
48 | Args:
49 | dimension: Number of features for hash trick (default: 2^27 = ~134M)
50 | learning_rate: Alpha parameter for SGD (default: 0.145)
51 | log_interval: How often to log progress during training
52 | """
53 | self.D = dimension
54 | self.alpha = learning_rate
55 | self.log_interval = log_interval
56 |
57 | # Initialize model weights and feature counters
58 | self.w: List[float] = [0.0] * self.D
59 | self.n: List[float] = [0.0] * self.D
60 |
61 | logger.info("Initialized CTR Predictor:")
62 | logger.info(f" Dimension: {self.D:,}")
63 | logger.info(f" Learning rate: {self.alpha}")
64 | logger.info(f" Memory usage: ~{(self.D * 16) / (1024**3):.2f} GB")
65 |
66 | @staticmethod
67 | def logloss(p: float, y: float) -> float:
68 | """
69 | Calculate bounded logarithmic loss.
70 |
71 | Args:
72 | p: Predicted probability (0 to 1)
73 | y: True label (0 or 1)
74 |
75 | Returns:
76 | Logarithmic loss value
77 | """
78 | # Bound prediction to prevent log(0)
79 | epsilon = 1e-12
80 | p = max(min(p, 1.0 - epsilon), epsilon)
81 |
82 | if y == 1.0:
83 | return -log(p)
84 | else:
85 | return -log(1.0 - p)
86 |
87 | def get_features(self, csv_row: Dict[str, str]) -> List[int]:
88 | """
89 | Apply hash trick to convert CSV row to feature indices.
90 |
91 | Treats both integer and categorical features as categorical,
92 | using a simple hash function to map to feature space.
93 |
94 | Args:
95 | csv_row: Dictionary from CSV DictReader
96 |
97 | Returns:
98 | List of feature indices where value is 1
99 | """
100 | x = [0] # Index 0 is the bias term
101 |
102 | for key, value in csv_row.items():
103 | if not value: # Skip empty values
104 | continue
105 |
106 | try:
107 | # Simple hash: concatenate value and feature name, convert to hex
108 | # This is intentionally simple for speed (though weak as a hash)
109 | hash_input = value + key[1:]
110 | index = int(hash_input, 16) % self.D
111 | x.append(index)
112 | except (ValueError, IndexError) as e:
113 | logger.warning(f"Skipping invalid feature {key}={value}: {e}")
114 | continue
115 |
116 | return x
117 |
118 | def predict_probability(self, x: List[int]) -> float:
119 | """
120 | Calculate probability P(y=1|x) using logistic sigmoid.
121 |
122 | Args:
123 | x: List of feature indices
124 |
125 | Returns:
126 | Predicted probability between 0 and 1
127 | """
128 | # Calculate w^T * x
129 | wTx = sum(self.w[i] for i in x)
130 |
131 | # Apply bounded sigmoid to prevent overflow
132 | # Bound to [-20, 20] before exp
133 | wTx_bounded = max(min(wTx, 20.0), -20.0)
134 |
135 | return 1.0 / (1.0 + exp(-wTx_bounded))
136 |
137 | def update_weights(
138 | self,
139 | x: List[int],
140 | p: float,
141 | y: float
142 | ) -> None:
143 | """
144 | Update model weights using SGD with adaptive learning rate.
145 |
146 | Args:
147 | x: Feature indices
148 | p: Predicted probability
149 | y: True label (0 or 1)
150 | """
151 | for i in x:
152 | # Adaptive learning rate: alpha / (sqrt(n) + 1)
153 | # This decreases learning rate for frequently seen features
154 | adaptive_lr = self.alpha / (sqrt(self.n[i]) + 1.0)
155 |
156 | # Gradient: (p - y) * x[i], where x[i] = 1 for indices in x
157 | gradient = p - y
158 |
159 | # Update weight
160 | self.w[i] -= gradient * adaptive_lr
161 |
162 | # Increment feature counter
163 | self.n[i] += 1.0
164 |
165 | def train(self, train_path: Path) -> None:
166 | """
167 | Train the model on training data using online SGD.
168 |
169 | Args:
170 | train_path: Path to training CSV file
171 |
172 | Raises:
173 | FileNotFoundError: If training file doesn't exist
174 | ValueError: If file format is invalid
175 | """
176 | if not train_path.exists():
177 | raise FileNotFoundError(f"Training file not found: {train_path}")
178 |
179 | logger.info(f"Starting training from: {train_path}")
180 | logger.info("=" * 80)
181 |
182 | cumulative_loss = 0.0
183 | sample_count = 0
184 |
185 | try:
186 | with open(train_path, 'r', encoding='utf-8') as f:
187 | reader = DictReader(f)
188 |
189 | # Validate required columns
190 | if reader.fieldnames and 'Label' not in reader.fieldnames:
191 | raise ValueError("Training file must contain 'Label' column")
192 |
193 | for t, row in enumerate(reader, start=1):
194 | # Parse label
195 | try:
196 | y = 1.0 if row['Label'] == '1' else 0.0
197 | except KeyError:
198 | logger.error(f"Row {t} missing 'Label' column")
199 | continue
200 |
201 | # Remove label and ID from features
202 | row.pop('Label', None)
203 | row.pop('Id', None)
204 |
205 | # Get hashed features
206 | x = self.get_features(row)
207 |
208 | # Get prediction
209 | p = self.predict_probability(x)
210 |
211 | # Calculate loss for monitoring
212 | loss = self.logloss(p, y)
213 | cumulative_loss += loss
214 | sample_count = t
215 |
216 | # Log progress
217 | if t % self.log_interval == 0:
218 | avg_loss = cumulative_loss / t
219 | logger.info(
220 | f"Processed: {t:,} samples | "
221 | f"Avg logloss: {avg_loss:.6f} | "
222 | f"Current loss: {loss:.6f}"
223 | )
224 |
225 | # Update model
226 | self.update_weights(x, p, y)
227 |
228 | except Exception as e:
229 | logger.error(f"Error during training: {e}")
230 | raise
231 |
232 | logger.info("=" * 80)
233 | logger.info("Training completed!")
234 | logger.info(f" Total samples: {sample_count:,}")
235 | logger.info(f" Final avg logloss: {cumulative_loss / sample_count:.6f}")
236 |
237 | def predict(self, test_path: Path, output_path: Path) -> None:
238 | """
239 | Generate predictions for test data and save to CSV.
240 |
241 | Args:
242 | test_path: Path to test CSV file
243 | output_path: Path to save predictions
244 |
245 | Raises:
246 | FileNotFoundError: If test file doesn't exist
247 | """
248 | if not test_path.exists():
249 | raise FileNotFoundError(f"Test file not found: {test_path}")
250 |
251 | logger.info(f"Generating predictions from: {test_path}")
252 | logger.info(f"Saving to: {output_path}")
253 |
254 | prediction_count = 0
255 |
256 | try:
257 | with open(test_path, 'r', encoding='utf-8') as f_in, \
258 | open(output_path, 'w', encoding='utf-8') as f_out:
259 |
260 | # Write header
261 | f_out.write('Id,Predicted\n')
262 |
263 | reader = DictReader(f_in)
264 |
265 | for t, row in enumerate(reader, start=1):
266 | # Get ID
267 | row_id = row.get('Id', str(t))
268 | row.pop('Id', None)
269 |
270 | # Get features and predict
271 | x = self.get_features(row)
272 | p = self.predict_probability(x)
273 |
274 | # Write prediction
275 | f_out.write(f'{row_id},{p:.10f}\n')
276 | prediction_count = t
277 |
278 | # Log progress
279 | if t % self.log_interval == 0:
280 | logger.info(f"Generated {t:,} predictions")
281 |
282 | except Exception as e:
283 | logger.error(f"Error during prediction: {e}")
284 | raise
285 |
286 | logger.info(f"Prediction completed! Generated {prediction_count:,} predictions")
287 |
288 |
289 | def main():
290 | """Main execution function."""
291 | # Configuration
292 | TRAIN_FILE = Path('train.csv')
293 | TEST_FILE = Path('test.csv')
294 | OUTPUT_FILE = Path('submission.csv')
295 |
296 | # Model hyperparameters
297 | DIMENSION = 2 ** 27 # ~134M features
298 | LEARNING_RATE = 0.145
299 |
300 | try:
301 | # Initialize model
302 | model = CTRPredictor(
303 | dimension=DIMENSION,
304 | learning_rate=LEARNING_RATE
305 | )
306 |
307 | # Train model
308 | model.train(TRAIN_FILE)
309 |
310 | # Generate predictions
311 | model.predict(TEST_FILE, OUTPUT_FILE)
312 |
313 | logger.info("All tasks completed successfully!")
314 |
315 | except FileNotFoundError as e:
316 | logger.error(f"File error: {e}")
317 | logger.error("Please ensure train.csv and test.csv are in the current directory")
318 | logger.error("Download from: https://www.kaggle.com/c/criteo-display-ad-challenge/data")
319 | sys.exit(1)
320 | except Exception as e:
321 | logger.error(f"Unexpected error: {e}")
322 | sys.exit(1)
323 |
324 |
325 | if __name__ == '__main__':
326 | main()
327 |
--------------------------------------------------------------------------------
/logReg_modernized.py:
--------------------------------------------------------------------------------
1 | """
2 | Logistic Regression Training and Testing Module
3 |
4 | Implements logistic regression with multiple optimization algorithms:
5 | - Gradient Descent (GD)
6 | - Stochastic Gradient Descent (SGD)
7 | - Smooth Stochastic Gradient Descent (Smooth SGD)
8 |
9 | Modernized from original logReg.py with:
10 | - Python 3 compatibility
11 | - Explicit imports (no wildcards)
12 | - Type hints
13 | - Error handling
14 | - Better documentation
15 | """
16 |
17 | import logging
18 | import time
19 | from typing import Dict, Any, Tuple
20 |
21 | import numpy as np
22 | import matplotlib.pyplot as plt
23 |
24 | # Configure logging
25 | logging.basicConfig(
26 | level=logging.INFO,
27 | format='%(asctime)s - %(levelname)s - %(message)s'
28 | )
29 | logger = logging.getLogger(__name__)
30 |
31 |
32 | def sigmoid(x: np.ndarray) -> np.ndarray:
33 | """
34 | Calculate the sigmoid (logistic) function.
35 |
36 | Args:
37 | x: Input array or scalar
38 |
39 | Returns:
40 | Sigmoid of input, bounded between 0 and 1
41 | """
42 | return 1.0 / (1.0 + np.exp(-x))
43 |
44 |
45 | def train_logistic_regression(
46 | train_x: np.ndarray,
47 | train_y: np.ndarray,
48 | opts: Dict[str, Any]
49 | ) -> np.ndarray:
50 | """
51 | Train a logistic regression model using specified optimization algorithm.
52 |
53 | Args:
54 | train_x: Training features, shape (num_samples, num_features).
55 | Should include bias term (column of ones) if needed.
56 | train_y: Training labels, shape (num_samples, 1)
57 | opts: Dictionary with training options:
58 | - 'alpha': Learning rate (float)
59 | - 'maxIter': Maximum iterations (int)
60 | - 'optimizeType': Optimization algorithm (str)
61 | Options: 'gradDescent', 'stocGradDescent', 'smoothStocGradDescent'
62 |
63 | Returns:
64 | weights: Trained weight vector, shape (num_features, 1)
65 |
66 | Raises:
67 | ValueError: If optimize type is not recognized
68 | TypeError: If input arrays have wrong shape
69 | """
70 | # Validate inputs
71 | if train_x.shape[0] != train_y.shape[0]:
72 | raise ValueError(
73 | f"Sample count mismatch: train_x has {train_x.shape[0]} samples, "
74 | f"train_y has {train_y.shape[0]} samples"
75 | )
76 |
77 | start_time = time.time()
78 |
79 | num_samples, num_features = train_x.shape
80 | alpha = opts.get('alpha', 0.01)
81 | max_iter = opts.get('maxIter', 1000)
82 | optimize_type = opts.get('optimizeType', 'gradDescent')
83 |
84 | logger.info(f"Training Logistic Regression:")
85 | logger.info(f" Samples: {num_samples}")
86 | logger.info(f" Features: {num_features}")
87 | logger.info(f" Algorithm: {optimize_type}")
88 | logger.info(f" Learning rate: {alpha}")
89 | logger.info(f" Max iterations: {max_iter}")
90 |
91 | # Initialize weights
92 | weights = np.ones((num_features, 1))
93 |
94 | # Optimize using selected algorithm
95 | if optimize_type == 'gradDescent':
96 | weights = _gradient_descent(train_x, train_y, weights, alpha, max_iter)
97 |
98 | elif optimize_type == 'stocGradDescent':
99 | weights = _stochastic_gradient_descent(
100 | train_x, train_y, weights, alpha, max_iter
101 | )
102 |
103 | elif optimize_type == 'smoothStocGradDescent':
104 | weights = _smooth_stochastic_gradient_descent(
105 | train_x, train_y, weights, alpha, max_iter
106 | )
107 |
108 | else:
109 | raise ValueError(
110 | f"Unsupported optimize type: {optimize_type}. "
111 | f"Must be 'gradDescent', 'stocGradDescent', or 'smoothStocGradDescent'"
112 | )
113 |
114 | elapsed = time.time() - start_time
115 | logger.info(f"Training completed in {elapsed:.2f} seconds")
116 |
117 | return weights
118 |
119 |
120 | def _gradient_descent(
121 | train_x: np.ndarray,
122 | train_y: np.ndarray,
123 | weights: np.ndarray,
124 | alpha: float,
125 | max_iter: int
126 | ) -> np.ndarray:
127 | """
128 | Batch gradient descent optimization.
129 |
130 | Updates weights using all samples in each iteration.
131 | """
132 | for k in range(max_iter):
133 | # Forward pass
134 | output = sigmoid(train_x @ weights)
135 |
136 | # Calculate error
137 | error = train_y - output
138 |
139 | # Update weights using all samples
140 | weights = weights + alpha * (train_x.T @ error)
141 |
142 | # Log progress
143 | if (k + 1) % 100 == 0:
144 | loss = np.mean(-train_y * np.log(output + 1e-10) -
145 | (1 - train_y) * np.log(1 - output + 1e-10))
146 | logger.info(f"Iteration {k+1}/{max_iter}, Loss: {loss:.6f}")
147 |
148 | return weights
149 |
150 |
151 | def _stochastic_gradient_descent(
152 | train_x: np.ndarray,
153 | train_y: np.ndarray,
154 | weights: np.ndarray,
155 | alpha: float,
156 | max_iter: int
157 | ) -> np.ndarray:
158 | """
159 | Stochastic gradient descent optimization.
160 |
161 | Updates weights using one sample at a time.
162 | """
163 | num_samples = train_x.shape[0]
164 |
165 | for k in range(max_iter):
166 | for i in range(num_samples):
167 | # Get single sample
168 | x_i = train_x[i:i+1, :].T # Shape: (num_features, 1)
169 | y_i = train_y[i, 0]
170 |
171 | # Forward pass
172 | output = sigmoid((x_i.T @ weights)[0, 0])
173 |
174 | # Calculate error
175 | error = y_i - output
176 |
177 | # Update weights
178 | weights = weights + alpha * x_i * error
179 |
180 | # Log progress
181 | if (k + 1) % 100 == 0:
182 | output_all = sigmoid(train_x @ weights)
183 | loss = np.mean(-train_y * np.log(output_all + 1e-10) -
184 | (1 - train_y) * np.log(1 - output_all + 1e-10))
185 | logger.info(f"Iteration {k+1}/{max_iter}, Loss: {loss:.6f}")
186 |
187 | return weights
188 |
189 |
190 | def _smooth_stochastic_gradient_descent(
191 | train_x: np.ndarray,
192 | train_y: np.ndarray,
193 | weights: np.ndarray,
194 | alpha: float,
195 | max_iter: int
196 | ) -> np.ndarray:
197 | """
198 | Smooth stochastic gradient descent optimization.
199 |
200 | Uses random sample selection and adaptive learning rate to reduce oscillations.
201 | """
202 | num_samples = train_x.shape[0]
203 |
204 | for k in range(max_iter):
205 | # Create random order of samples
206 | indices = list(range(num_samples))
207 | np.random.shuffle(indices)
208 |
209 | for i, idx in enumerate(indices):
210 | # Adaptive learning rate that decreases over time
211 | adaptive_alpha = 4.0 / (1.0 + k + i) + 0.01
212 |
213 | # Get single sample
214 | x_i = train_x[idx:idx+1, :].T # Shape: (num_features, 1)
215 | y_i = train_y[idx, 0]
216 |
217 | # Forward pass
218 | output = sigmoid((x_i.T @ weights)[0, 0])
219 |
220 | # Calculate error
221 | error = y_i - output
222 |
223 | # Update weights with adaptive learning rate
224 | weights = weights + adaptive_alpha * x_i * error
225 |
226 | # Log progress
227 | if (k + 1) % 100 == 0:
228 | output_all = sigmoid(train_x @ weights)
229 | loss = np.mean(-train_y * np.log(output_all + 1e-10) -
230 | (1 - train_y) * np.log(1 - output_all + 1e-10))
231 | logger.info(f"Iteration {k+1}/{max_iter}, Loss: {loss:.6f}")
232 |
233 | return weights
234 |
235 |
236 | def test_logistic_regression(
237 | weights: np.ndarray,
238 | test_x: np.ndarray,
239 | test_y: np.ndarray
240 | ) -> float:
241 | """
242 | Test trained logistic regression model and calculate accuracy.
243 |
244 | Args:
245 | weights: Trained weight vector, shape (num_features, 1)
246 | test_x: Test features, shape (num_samples, num_features)
247 | test_y: Test labels, shape (num_samples, 1)
248 |
249 | Returns:
250 | accuracy: Proportion of correct predictions (0 to 1)
251 | """
252 | num_samples = test_x.shape[0]
253 | match_count = 0
254 |
255 | for i in range(num_samples):
256 | # Get prediction probability
257 | x_i = test_x[i:i+1, :]
258 | prob = sigmoid((x_i @ weights)[0, 0])
259 |
260 | # Convert to binary prediction (threshold at 0.5)
261 | predict = prob > 0.5
262 |
263 | # Check if correct
264 | if predict == bool(test_y[i, 0]):
265 | match_count += 1
266 |
267 | accuracy = match_count / num_samples
268 |
269 | logger.info(f"Test Results:")
270 | logger.info(f" Correct: {match_count}/{num_samples}")
271 | logger.info(f" Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
272 |
273 | return accuracy
274 |
275 |
276 | def visualize_logistic_regression(
277 | weights: np.ndarray,
278 | train_x: np.ndarray,
279 | train_y: np.ndarray
280 | ) -> None:
281 | """
282 | Visualize the trained logistic regression decision boundary.
283 |
284 | Note: Only works with 2D data (3 features including bias).
285 |
286 | Args:
287 | weights: Trained weight vector
288 | train_x: Training features including bias column
289 | train_y: Training labels
290 |
291 | Raises:
292 | ValueError: If data is not 2D
293 | """
294 | num_samples, num_features = train_x.shape
295 |
296 | if num_features != 3:
297 | raise ValueError(
298 | f"Visualization only supports 2D data (3 features with bias). "
299 | f"Got {num_features} features."
300 | )
301 |
302 | logger.info("Generating visualization...")
303 |
304 | # Plot samples
305 | for i in range(num_samples):
306 | if int(train_y[i, 0]) == 0:
307 | plt.plot(train_x[i, 1], train_x[i, 2], 'or', label='Class 0' if i == 0 else '')
308 | else:
309 | plt.plot(train_x[i, 1], train_x[i, 2], 'ob', label='Class 1' if i == 0 else '')
310 |
311 | # Draw decision boundary
312 | # Line equation: w0 + w1*x1 + w2*x2 = 0
313 | # Solve for x2: x2 = -(w0 + w1*x1) / w2
314 | min_x = np.min(train_x[:, 1])
315 | max_x = np.max(train_x[:, 1])
316 |
317 | w = weights.flatten()
318 | y_min = -(w[0] + w[1] * min_x) / w[2]
319 | y_max = -(w[0] + w[1] * max_x) / w[2]
320 |
321 | plt.plot([min_x, max_x], [y_min, y_max], '-g', linewidth=2, label='Decision Boundary')
322 |
323 | plt.xlabel('Feature 1')
324 | plt.ylabel('Feature 2')
325 | plt.title('Logistic Regression Decision Boundary')
326 | plt.legend()
327 | plt.grid(True, alpha=0.3)
328 | plt.show()
329 |
330 |
331 | # Example usage
332 | if __name__ == '__main__':
333 | # Generate sample data
334 | np.random.seed(42)
335 |
336 | # Create synthetic 2D dataset
337 | num_samples = 100
338 |
339 | # Class 0: centered at (2, 2)
340 | class0 = np.random.randn(num_samples // 2, 2) + np.array([2, 2])
341 |
342 | # Class 1: centered at (5, 5)
343 | class1 = np.random.randn(num_samples // 2, 2) + np.array([5, 5])
344 |
345 | # Combine data
346 | X = np.vstack([class0, class1])
347 | y = np.vstack([np.zeros((num_samples // 2, 1)), np.ones((num_samples // 2, 1))])
348 |
349 | # Add bias term
350 | X_with_bias = np.hstack([np.ones((num_samples, 1)), X])
351 |
352 | # Training options
353 | options = {
354 | 'alpha': 0.01,
355 | 'maxIter': 500,
356 | 'optimizeType': 'smoothStocGradDescent'
357 | }
358 |
359 | # Train model
360 | trained_weights = train_logistic_regression(X_with_bias, y, options)
361 |
362 | # Test model
363 | accuracy = test_logistic_regression(trained_weights, X_with_bias, y)
364 |
365 | # Visualize (optional - uncomment to show plot)
366 | # visualize_logistic_regression(trained_weights, X_with_bias, y)
367 |
--------------------------------------------------------------------------------
/README_NEW.md:
--------------------------------------------------------------------------------
1 | # Predict Click-Through Rates on Display Ads
2 |
3 | A machine learning project for predicting Click-Through Rates (CTR) in display advertising using the Criteo dataset. This repository implements multiple algorithms including Logistic Regression with SGD, Gradient Boosting Machines, and Vowpal Wabbit models.
4 |
5 | [](https://opensource.org/licenses/MIT)
6 | 
7 | 
8 |
9 | ---
10 |
11 | ## Table of Contents
12 |
13 | - [Overview](#overview)
14 | - [Project Status](#project-status)
15 | - [Features](#features)
16 | - [Installation](#installation)
17 | - [Quick Start](#quick-start)
18 | - [Data](#data)
19 | - [Models](#models)
20 | - [Project Structure](#project-structure)
21 | - [Usage](#usage)
22 | - [Development](#development)
23 | - [Contributing](#contributing)
24 | - [License](#license)
25 | - [Acknowledgments](#acknowledgments)
26 |
27 | ---
28 |
29 | ## Overview
30 |
31 | Display advertising is a billion-dollar industry and one of the central applications of machine learning on the Internet. This project was developed for the **Criteo Display Advertising Challenge**, where the goal is to predict the probability that a user will click on a given ad (CTR).
32 |
33 | ### The Challenge
34 |
35 | Given:
36 | - User information
37 | - Page context
38 | - Ad features (39 anonymized features: 13 numerical, 26 categorical)
39 |
40 | Predict:
41 | - Probability of click (binary classification)
42 |
43 | ### Evaluation Metric
44 |
45 | - **Log Loss (Logarithmic Loss)**: Lower is better
46 |
47 | ---
48 |
49 | ## Project Status
50 |
51 | **Version:** 2.0 (Modernized)
52 |
53 | **Status:**
54 | - ✅ Python 3 migration complete
55 | - ✅ Core algorithms refactored with modern best practices
56 | - ✅ Documentation updated
57 | - ✅ Error handling and logging added
58 | - 🚧 Unit tests in progress
59 | - 🚧 CI/CD pipeline in progress
60 |
61 | **Legacy Code:**
62 | - Original Python 2 implementations preserved with `_legacy` suffix
63 | - See migration guide below for differences
64 |
65 | ---
66 |
67 | ## Features
68 |
69 | ### Algorithms Implemented
70 |
71 | 1. **Logistic Regression with SGD** (`py_lh4_modernized.py`)
72 | - Hash trick for feature engineering (2^27 dimensional space)
73 | - Adaptive learning rate
74 | - Memory-efficient online learning
75 | - Bounded numerical operations for stability
76 |
77 | 2. **Gradient Boosting Machine** (`gbm_modernized.R`)
78 | - Tree-based ensemble method
79 | - Cross-validation with ROC optimization
80 | - Configurable hyperparameters
81 |
82 | 3. **Vowpal Wabbit** (`csv_to_vw_modernized.py`)
83 | - Fast linear learner
84 | - CSV to VW format converter
85 | - Scalable to massive datasets
86 |
87 | ### Modern Features
88 |
89 | - ✨ Python 3.8+ compatibility
90 | - 🔒 Input validation and error handling
91 | - 📊 Comprehensive logging
92 | - ⚙️ Configurable parameters
93 | - 📝 Type hints and documentation
94 | - 🧪 Unit test infrastructure
95 | - 🐳 Docker support (coming soon)
96 |
97 | ---
98 |
99 | ## Installation
100 |
101 | ### Prerequisites
102 |
103 | - Python 3.8 or higher
104 | - R 4.0 or higher (for R models)
105 | - Vowpal Wabbit (for VW models)
106 | - Git
107 |
108 | ### Clone Repository
109 |
110 | ```bash
111 | git clone https://github.com/yourusername/Predict-click-through-rates-on-display-ads.git
112 | cd Predict-click-through-rates-on-display-ads
113 | ```
114 |
115 | ### Python Setup
116 |
117 | ```bash
118 | # Create virtual environment
119 | python -m venv venv
120 |
121 | # Activate virtual environment
122 | # On macOS/Linux:
123 | source venv/bin/activate
124 | # On Windows:
125 | # venv\Scripts\activate
126 |
127 | # Install dependencies
128 | pip install -r requirements.txt
129 | ```
130 |
131 | ### R Setup
132 |
133 | ```bash
134 | # Install R packages
135 | Rscript -e "install.packages(c('data.table', 'caret', 'gbm'), dependencies=TRUE)"
136 | ```
137 |
138 | ### Vowpal Wabbit Setup
139 |
140 | ```bash
141 | # macOS
142 | brew install vowpal-wabbit
143 |
144 | # Ubuntu/Debian
145 | sudo apt-get install vowpal-wabbit
146 |
147 | # From source (all platforms)
148 | git clone https://github.com/VowpalWabbit/vowpal_wabbit.git
149 | cd vowpal_wabbit
150 | make
151 | ```
152 |
153 | ---
154 |
155 | ## Quick Start
156 |
157 | ### 1. Download Data
158 |
159 | ```bash
160 | # Create data directory
161 | mkdir -p data
162 |
163 | # Download from Kaggle
164 | # Visit: https://www.kaggle.com/c/criteo-display-ad-challenge/data
165 | # Place train.csv and test.csv in data/ directory
166 | ```
167 |
168 | Or use the download script:
169 |
170 | ```bash
171 | python scripts/download_data.py
172 | ```
173 |
174 | ### 2. Train a Model
175 |
176 | **Logistic Regression (Python):**
177 |
178 | ```bash
179 | python py_lh4_modernized.py
180 | ```
181 |
182 | **Gradient Boosting Machine (R):**
183 |
184 | ```bash
185 | DATA_DIR=./data OUTPUT_DIR=./output Rscript gbm_modernized.R
186 | ```
187 |
188 | **Vowpal Wabbit:**
189 |
190 | ```bash
191 | # Convert CSV to VW format
192 | python csv_to_vw_modernized.py data/train.csv data/train.vw
193 | python csv_to_vw_modernized.py data/test.csv data/test.vw --test
194 |
195 | # Train model
196 | vw data/train.vw -f models/click.model --passes 3 --cache_file data/train.cache
197 |
198 | # Generate predictions
199 | vw data/test.vw -i models/click.model -t -p predictions.txt
200 | ```
201 |
202 | ### 3. Generate Submission
203 |
204 | ```bash
205 | # Predictions are automatically saved to submission.csv (Python)
206 | # or specified output directory (R)
207 | ```
208 |
209 | ---
210 |
211 | ## Data
212 |
213 | ### Dataset Information
214 |
215 | - **Source:** [Criteo Labs](https://www.kaggle.com/c/criteo-display-ad-challenge/data)
216 | - **Size:**
217 | - Training: ~45 million samples
218 | - Test: ~6 million samples
219 | - **Features:** 39 anonymized features
220 | - 13 numerical features (`I1-I13`)
221 | - 26 categorical features (`C1-C26`)
222 | - **Target:** Binary (0 = no click, 1 = click)
223 |
224 | ### Data Format
225 |
226 | **CSV Format:**
227 |
228 | ```
229 | Id,Label,I1,I2,...,I13,C1,C2,...,C26
230 | 1,0,1,5,...,45,68fd1e,80e26c,...,a458ea
231 | 2,1,2,,,7,68fd1e,80e26c,...,b458ea
232 | ```
233 |
234 | **Vowpal Wabbit Format:**
235 |
236 | ```
237 | 1 'id1 |i I1:1 I2:5 I13:45 |c 68fd1e 80e26c a458ea
238 | -1 'id2 |i I1:2 I13:7 |c 68fd1e 80e26c b458ea
239 | ```
240 |
241 | ### Data Preprocessing
242 |
243 | Missing values are common in this dataset:
244 |
245 | ```python
246 | # Option 1: Use models that handle missing values (GBM, Random Forest)
247 | # Option 2: Impute missing values
248 | from sklearn.impute import SimpleImputer
249 | imputer = SimpleImputer(strategy='median')
250 | X_imputed = imputer.fit_transform(X)
251 | ```
252 |
253 | ---
254 |
255 | ## Models
256 |
257 | ### 1. Logistic Regression with SGD
258 |
259 | **File:** `py_lh4_modernized.py`
260 |
261 | **Features:**
262 | - Hash trick for feature engineering
263 | - Adaptive learning rate: α/(√n + 1)
264 | - Bounded sigmoid and log loss for numerical stability
265 | - Memory-efficient: processes one sample at a time
266 |
267 | **Hyperparameters:**
268 | ```python
269 | DIMENSION = 2**27 # ~134M features
270 | LEARNING_RATE = 0.145 # Alpha for SGD
271 | ```
272 |
273 | **Usage:**
274 | ```python
275 | from py_lh4_modernized import CTRPredictor
276 |
277 | model = CTRPredictor(dimension=2**27, learning_rate=0.145)
278 | model.train(Path('data/train.csv'))
279 | model.predict(Path('data/test.csv'), Path('submission.csv'))
280 | ```
281 |
282 | **Expected Performance:**
283 | - Training time: 30-60 minutes (45M samples)
284 | - Memory: ~2GB
285 | - Log loss: ~0.44-0.46
286 |
287 | ### 2. Gradient Boosting Machine (GBM)
288 |
289 | **File:** `gbm_modernized.R`
290 |
291 | **Features:**
292 | - Tree-based ensemble method
293 | - Cross-validation with ROC optimization
294 | - Handles missing values automatically
295 |
296 | **Hyperparameters:**
297 | ```r
298 | N_TREES <- 500
299 | INTERACTION_DEPTH <- 22
300 | SHRINKAGE <- 0.1
301 | ```
302 |
303 | **Usage:**
304 | ```bash
305 | DATA_DIR=./data OUTPUT_DIR=./output Rscript gbm_modernized.R
306 | ```
307 |
308 | **Expected Performance:**
309 | - Training time: 2-4 hours
310 | - Memory: ~8-16GB
311 | - Log loss: ~0.43-0.45
312 |
313 | ### 3. Vowpal Wabbit
314 |
315 | **Files:** `csv_to_vw_modernized.py`, VW commands
316 |
317 | **Features:**
318 | - Extremely fast linear learner
319 | - Scalable to billions of samples
320 | - Online learning capable
321 |
322 | **Usage:**
323 | ```bash
324 | # Convert format
325 | python csv_to_vw_modernized.py data/train.csv data/train.vw
326 |
327 | # Train with multiple passes
328 | vw data/train.vw \
329 | --loss_function logistic \
330 | --passes 3 \
331 | --cache_file data/train.cache \
332 | -f models/click.model
333 |
334 | # Predict
335 | vw data/test.vw \
336 | -i models/click.model \
337 | -t \
338 | -p predictions.txt
339 | ```
340 |
341 | **Expected Performance:**
342 | - Training time: 10-20 minutes
343 | - Memory: ~2-4GB
344 | - Log loss: ~0.44-0.46
345 |
346 | ---
347 |
348 | ## Project Structure
349 |
350 | ```
351 | Predict-click-through-rates-on-display-ads/
352 | │
353 | ├── README.md # This file
354 | ├── LICENSE # MIT License
355 | ├── requirements.txt # Python dependencies
356 | ├── requirements-r.txt # R dependencies
357 | ├── .gitignore # Git ignore rules
358 | │
359 | ├── data/ # Data directory (not in repo)
360 | │ ├── train.csv # Training data (download separately)
361 | │ ├── test.csv # Test data (download separately)
362 | │ ├── train.vw # VW format training data
363 | │ └── test.vw # VW format test data
364 | │
365 | ├── models/ # Trained models
366 | │ ├── click.model.vw # Vowpal Wabbit model
367 | │ └── gbm_model.rds # R GBM model
368 | │
369 | ├── output/ # Output directory
370 | │ ├── submission.csv # Predictions for submission
371 | │ └── training.log # Training logs
372 | │
373 | ├── Modern Implementations:
374 | │ ├── py_lh4_modernized.py # LR with SGD (Python 3)
375 | │ ├── logReg_modernized.py # LR module (Python 3)
376 | │ ├── csv_to_vw_modernized.py # CSV to VW converter (Python 3)
377 | │ └── gbm_modernized.R # GBM (R, modern)
378 | │
379 | ├── Legacy Code (Original):
380 | │ ├── py_lh4.py # Original LR implementation
381 | │ ├── logReg.py # Original LR module
382 | │ ├── gbm.R # Original GBM script
383 | │ └── [other legacy files]
384 | │
385 | ├── tests/ # Unit tests
386 | │ ├── test_lr_model.py
387 | │ ├── test_data_loading.py
388 | │ └── test_vw_converter.py
389 | │
390 | ├── scripts/ # Utility scripts
391 | │ ├── download_data.py # Data download helper
392 | │ └── evaluate_model.py # Model evaluation
393 | │
394 | └── docs/ # Additional documentation
395 | ├── MIGRATION_GUIDE.md # Python 2 to 3 migration notes
396 | ├── PERFORMANCE.md # Performance benchmarks
397 | └── API.md # API documentation
398 | ```
399 |
400 | ---
401 |
402 | ## Usage
403 |
404 | ### Configuration
405 |
406 | **Python Models:**
407 |
408 | Edit configuration at the top of each script:
409 |
410 | ```python
411 | # In py_lh4_modernized.py
412 | TRAIN_FILE = Path('data/train.csv')
413 | TEST_FILE = Path('data/test.csv')
414 | OUTPUT_FILE = Path('submission.csv')
415 | DIMENSION = 2 ** 27
416 | LEARNING_RATE = 0.145
417 | ```
418 |
419 | **R Models:**
420 |
421 | Use environment variables:
422 |
423 | ```bash
424 | export DATA_DIR=/path/to/data
425 | export OUTPUT_DIR=/path/to/output
426 | Rscript gbm_modernized.R
427 | ```
428 |
429 | ### Training Options
430 |
431 | **Full Training:**
432 | ```bash
433 | python py_lh4_modernized.py
434 | ```
435 |
436 | **Sample Training (first 1M rows):**
437 | ```bash
438 | head -n 1000001 data/train.csv > data/train_sample.csv
439 | # Update script to use train_sample.csv
440 | python py_lh4_modernized.py
441 | ```
442 |
443 | ### Monitoring Training
444 |
445 | Training progress is logged to both console and `training.log`:
446 |
447 | ```bash
448 | # Watch log in real-time
449 | tail -f training.log
450 | ```
451 |
452 | ### Model Evaluation
453 |
454 | ```bash
455 | python scripts/evaluate_model.py \
456 | --predictions submission.csv \
457 | --ground_truth data/train_with_labels.csv
458 | ```
459 |
460 | ---
461 |
462 | ## Development
463 |
464 | ### Setting Up Development Environment
465 |
466 | ```bash
467 | # Install development dependencies
468 | pip install -r requirements.txt
469 |
470 | # Install pre-commit hooks (coming soon)
471 | pre-commit install
472 |
473 | # Run code formatters
474 | black .
475 | ```
476 |
477 | ### Running Tests
478 |
479 | ```bash
480 | # Run all tests
481 | pytest
482 |
483 | # Run with coverage
484 | pytest --cov=. --cov-report=html
485 |
486 | # Run specific test
487 | pytest tests/test_lr_model.py
488 | ```
489 |
490 | ### Code Style
491 |
492 | This project follows:
493 | - **Python:** PEP 8, enforced with `black` and `flake8`
494 | - **R:** Tidyverse style guide
495 | - **Documentation:** Google-style docstrings
496 |
497 | ---
498 |
499 | ## Contributing
500 |
501 | Contributions are welcome! Please:
502 |
503 | 1. Fork the repository
504 | 2. Create a feature branch (`git checkout -b feature/amazing-feature`)
505 | 3. Commit your changes (`git commit -m 'Add amazing feature'`)
506 | 4. Push to the branch (`git push origin feature/amazing-feature`)
507 | 5. Open a Pull Request
508 |
509 | ### Areas for Contribution
510 |
511 | - [ ] Deep learning models (Neural Networks, LSTM)
512 | - [ ] Feature engineering improvements
513 | - [ ] Hyperparameter optimization
514 | - [ ] Ensemble methods
515 | - [ ] Docker containerization
516 | - [ ] Web API for predictions
517 | - [ ] Performance optimizations
518 |
519 | ---
520 |
521 | ## License
522 |
523 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
524 |
525 | Copyright (c) 2019-2025 Tianxiang Liu
526 |
527 | ---
528 |
529 | ## Acknowledgments
530 |
531 | ### Dataset
532 | - **Criteo Labs** for providing the dataset
533 | - Kaggle for hosting the competition
534 |
535 | ### References
536 | - Original Vowpal Wabbit converter by Triskelion and Zygmunt Zając
537 | - Criteo Display Advertising Challenge: https://www.kaggle.com/c/criteo-display-ad-challenge
538 |
539 | ### Papers
540 | - Rendle, S. (2010). "Factorization Machines"
541 | - McMahan, H. B. et al. (2013). "Ad Click Prediction: a View from the Trenches"
542 | - He, X. et al. (2014). "Practical Lessons from Predicting Clicks on Ads at Facebook"
543 |
544 | ---
545 |
546 | ## Contact
547 |
548 | For questions or issues, please:
549 | - Open an issue on GitHub
550 | - Contact: [your-email@example.com]
551 |
552 | ---
553 |
554 | ## Changelog
555 |
556 | ### Version 2.0 (2025)
557 | - ✨ Migrated to Python 3.8+
558 | - ✨ Added comprehensive error handling and logging
559 | - ✨ Refactored code with type hints and documentation
560 | - ✨ Added configuration management
561 | - ✨ Improved README with detailed instructions
562 | - ✨ Added unit test infrastructure
563 |
564 | ### Version 1.0 (2014)
565 | - Initial implementation
566 | - Python 2 codebase
567 | - Multiple algorithm implementations
568 |
569 | ---
570 |
571 | **Happy Machine Learning! 🚀**
572 |
--------------------------------------------------------------------------------