├── pythonSGD ├── submissionPython22Sep2014_pm.csv ├── test.py ├── esting.txt ├── SGD_py.v11.suo ├── SGD_py.sln ├── logReg_test.py ├── SGD_py.pyproj ├── logReg_click.py ├── py_lh_22Sep2014_2.py ├── logReg.py ├── py_lh4_22Sep2014.py └── py_lh_20Sep2014.py ├── combineDatasets.R ├── .DS_Store ├── tests ├── __init__.py ├── conftest.py ├── test_lr_model.py └── test_vw_converter.py ├── esting.txt ├── test.py ├── forum ├── model │ └── click.model.vw ├── click.model.final2.vw ├── vm_to_kaggle.py ├── vm_to_kaggle.py~ ├── vm_command~ ├── vm_command └── csv_to_vm.py ├── vowpal wabbit ├── model │ └── click.model.vw ├── click.model.final2.vw ├── Last model_23Sep2014.txt ├── vm_to_kaggle.py ├── vm_to_kaggle.py~ ├── vm_command~ ├── vm_command ├── ending solution.txt └── csv_to_vm.py ├── requirements-r.txt ├── requirements.txt ├── README.md ├── pytest.ini ├── LICENSE ├── logReg_test.py ├── r_sdg.R ├── .gitignore ├── SDG_21_Sep_2014.R ├── kaggle.py ├── testSet.txt ├── gbm.R ├── py_lh.py ├── py_lh2.py ├── py_lh3.py ├── py_lh4.py ├── logReg_click.py ├── logReg.py ├── py_lh_20Sep2014.py ├── config.yaml ├── .github └── workflows │ └── ci.yml ├── gbm_modernized.R ├── csv_to_vw_modernized.py ├── scripts └── download_data.py ├── py_lh4_modernized.py ├── logReg_modernized.py └── README_NEW.md /pythonSGD/submissionPython22Sep2014_pm.csv: -------------------------------------------------------------------------------- 1 | Id,Predicted 2 | -------------------------------------------------------------------------------- /combineDatasets.R: -------------------------------------------------------------------------------- 1 | setwd("I:\\data") 2 | train_int <- read.csv('int/train_num.csv') -------------------------------------------------------------------------------- /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/.DS_Store -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | Test suite for CTR Prediction project. 3 | """ 4 | 5 | __version__ = "2.0" 6 | -------------------------------------------------------------------------------- /esting.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/esting.txt -------------------------------------------------------------------------------- /test.py: -------------------------------------------------------------------------------- 1 | from numpy import * 2 | import matplotlib.pyplot as plt 3 | import time 4 | 5 | alpha = opts['alpha'] -------------------------------------------------------------------------------- /pythonSGD/test.py: -------------------------------------------------------------------------------- 1 | from numpy import * 2 | import matplotlib.pyplot as plt 3 | import time 4 | 5 | alpha = opts['alpha'] -------------------------------------------------------------------------------- /pythonSGD/esting.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/pythonSGD/esting.txt -------------------------------------------------------------------------------- /pythonSGD/SGD_py.v11.suo: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/pythonSGD/SGD_py.v11.suo -------------------------------------------------------------------------------- /forum/model/click.model.vw: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/forum/model/click.model.vw -------------------------------------------------------------------------------- /forum/click.model.final2.vw: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/forum/click.model.final2.vw -------------------------------------------------------------------------------- /vowpal wabbit/model/click.model.vw: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/vowpal wabbit/model/click.model.vw -------------------------------------------------------------------------------- /vowpal wabbit/click.model.final2.vw: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ivanliu1989/Predict-click-through-rates-on-display-ads/HEAD/vowpal wabbit/click.model.final2.vw -------------------------------------------------------------------------------- /requirements-r.txt: -------------------------------------------------------------------------------- 1 | # R Package Dependencies 2 | # Install with: Rscript -e "install.packages(c('data.table', 'caret', 'gbm'))" 3 | 4 | data.table>=1.14.0 5 | caret>=6.0-90 6 | gbm>=2.1.8 7 | -------------------------------------------------------------------------------- /vowpal wabbit/Last model_23Sep2014.txt: -------------------------------------------------------------------------------- 1 | vw click.train.vw -f click.model.vw --bfgs --passes 20 --holdout_after 40000000 -b 26 --cache_file click.train.vw.cache -l 0.145 --holdout_period 10 2 | 3 | vw click.train.vw -f click.model.vw -q:: --holdout_period 5 --noconstant --hash all --loss_function logistic -b 28 --save_per_pass --bfgs --termination 0.001 --passes 10 -l 0.1 --cache_file click.train.vw.cache 4 | 5 | 6 | --feature_mask -------------------------------------------------------------------------------- /forum/vm_to_kaggle.py: -------------------------------------------------------------------------------- 1 | import math 2 | 3 | def zygmoid(x): 4 | #I know it's a common Sigmoid feature, but that's why I probably found 5 | #it on FastML too: https://github.com/zygmuntz/kaggle-stackoverflow/blob/master/sigmoid_mc.py 6 | return 1 / (1 + math.exp(-x)) 7 | 8 | with open("kaggle.click.submission.csv","wb") as outfile: 9 | outfile.write("Id,Predicted\n") 10 | for line in open("click.preds3.txt"): 11 | row = line.strip().split(" ") 12 | outfile.write("%s,%f\n"%(row[1],zygmoid(float(row[0])))) 13 | -------------------------------------------------------------------------------- /forum/vm_to_kaggle.py~: -------------------------------------------------------------------------------- 1 | import math 2 | 3 | def zygmoid(x): 4 | #I know it's a common Sigmoid feature, but that's why I probably found 5 | #it on FastML too: https://github.com/zygmuntz/kaggle-stackoverflow/blob/master/sigmoid_mc.py 6 | return 1 / (1 + math.exp(-x)) 7 | 8 | with open("kaggle.click.submission.csv","wb") as outfile: 9 | outfile.write("Id,Predicted\n") 10 | for line in open("click.preds2.txt"): 11 | row = line.strip().split(" ") 12 | outfile.write("%s,%f\n"%(row[1],zygmoid(float(row[0])))) 13 | -------------------------------------------------------------------------------- /vowpal wabbit/vm_to_kaggle.py: -------------------------------------------------------------------------------- 1 | import math 2 | 3 | def zygmoid(x): 4 | #I know it's a common Sigmoid feature, but that's why I probably found 5 | #it on FastML too: https://github.com/zygmuntz/kaggle-stackoverflow/blob/master/sigmoid_mc.py 6 | return 1 / (1 + math.exp(-x)) 7 | 8 | with open("kaggle.click.submission.csv","wb") as outfile: 9 | outfile.write("Id,Predicted\n") 10 | for line in open("click.preds3.txt"): 11 | row = line.strip().split(" ") 12 | outfile.write("%s,%f\n"%(row[1],zygmoid(float(row[0])))) 13 | -------------------------------------------------------------------------------- /vowpal wabbit/vm_to_kaggle.py~: -------------------------------------------------------------------------------- 1 | import math 2 | 3 | def zygmoid(x): 4 | #I know it's a common Sigmoid feature, but that's why I probably found 5 | #it on FastML too: https://github.com/zygmuntz/kaggle-stackoverflow/blob/master/sigmoid_mc.py 6 | return 1 / (1 + math.exp(-x)) 7 | 8 | with open("kaggle.click.submission.csv","wb") as outfile: 9 | outfile.write("Id,Predicted\n") 10 | for line in open("click.preds2.txt"): 11 | row = line.strip().split(" ") 12 | outfile.write("%s,%f\n"%(row[1],zygmoid(float(row[0])))) 13 | -------------------------------------------------------------------------------- /forum/vm_command~: -------------------------------------------------------------------------------- 1 | Training VW: 2 | ./vw click.train.vw -f click.model.vw --loss_function logistic 3 | 4 | 5 | Testing VW: 6 | ./vw click.test.vw -t -i click.model.vw -p click.preds.txt 7 | 8 | 9 | Training VW2: 10 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off -f click.model.vw --loss_function logistic 11 | 12 | 13 | Training VM3: 14 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off --power_t=1 -f click.model.vw --loss_function logistic 15 | 16 | parameters: 17 | -b bits 18 | -l rate 19 | --power_t p 20 | 21 | -------------------------------------------------------------------------------- /vowpal wabbit/vm_command~: -------------------------------------------------------------------------------- 1 | Training VW: 2 | ./vw click.train.vw -f click.model.vw --loss_function logistic 3 | 4 | 5 | Testing VW: 6 | ./vw click.test.vw -t -i click.model.vw -p click.preds.txt 7 | 8 | 9 | Training VW2: 10 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off -f click.model.vw --loss_function logistic 11 | 12 | 13 | Training VM3: 14 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off --power_t=1 -f click.model.vw --loss_function logistic 15 | 16 | parameters: 17 | -b bits 18 | -l rate 19 | --power_t p 20 | 21 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # Core dependencies 2 | numpy>=1.24.0,<2.0.0 3 | matplotlib>=3.7.0,<4.0.0 4 | pandas>=2.0.0,<3.0.0 5 | scipy>=1.10.0,<2.0.0 6 | 7 | # Configuration management 8 | pyyaml>=6.0,<7.0 9 | 10 | # Development dependencies 11 | pytest>=7.4.0,<8.0.0 12 | pytest-cov>=4.1.0,<5.0.0 13 | black>=23.0.0,<24.0.0 14 | flake8>=6.0.0,<7.0.0 15 | pylint>=2.17.0,<3.0.0 16 | 17 | # Optional: Advanced ML libraries 18 | # scikit-learn>=1.3.0,<2.0.0 19 | # xgboost>=2.0.0,<3.0.0 20 | 21 | # Optional: Vowpal Wabbit Python bindings 22 | # vowpalwabbit>=9.0.0,<10.0.0 23 | -------------------------------------------------------------------------------- /forum/vm_command: -------------------------------------------------------------------------------- 1 | Training VW: 2 | ./vw click.train.vw -f click.model.vw --loss_function logistic 3 | 4 | 5 | Testing VW: 6 | ./vw click.test.vw -t -i click.model.vw -p click.preds.txt 7 | 8 | 9 | Training VW2: 10 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off -f click.model.vw --loss_function logistic 11 | 12 | 13 | Training VM3: 14 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off --power_t=1 -f click.model.vw --loss_function logistic 15 | 16 | parameters: 17 | -b bits 18 | -l rate 19 | --power_t p 20 | --passes 21 | -c 22 | --holdout_off 23 | -------------------------------------------------------------------------------- /vowpal wabbit/vm_command: -------------------------------------------------------------------------------- 1 | Training VW: 2 | ./vw click.train.vw -f click.model.vw --loss_function logistic 3 | 4 | 5 | Testing VW: 6 | ./vw click.test.vw -t -i click.model.vw -p click.preds.txt 7 | 8 | 9 | Training VW2: 10 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off -f click.model.vw --loss_function logistic 11 | 12 | 13 | Training VM3: 14 | ./vw click.train.vw -b 28 -l 10 -c --passes 25 --holdout_off --power_t=1 -f click.model.vw --loss_function logistic 15 | 16 | parameters: 17 | -b bits 18 | -l rate 19 | --power_t p 20 | --passes 21 | -c 22 | --holdout_off 23 | -------------------------------------------------------------------------------- /pythonSGD/SGD_py.sln: -------------------------------------------------------------------------------- 1 | 2 | Microsoft Visual Studio Solution File, Format Version 12.00 3 | # Visual Studio 2012 4 | Project("{888888A0-9F3D-457C-B088-3A5042F75D52}") = "SGD_py", "SGD_py.pyproj", "{82875642-D6FA-4F5C-81E6-B89A93C1F5FF}" 5 | EndProject 6 | Global 7 | GlobalSection(SolutionConfigurationPlatforms) = preSolution 8 | Debug|Any CPU = Debug|Any CPU 9 | Release|Any CPU = Release|Any CPU 10 | EndGlobalSection 11 | GlobalSection(ProjectConfigurationPlatforms) = postSolution 12 | {82875642-D6FA-4F5C-81E6-B89A93C1F5FF}.Debug|Any CPU.ActiveCfg = Debug|Any CPU 13 | {82875642-D6FA-4F5C-81E6-B89A93C1F5FF}.Release|Any CPU.ActiveCfg = Release|Any CPU 14 | EndGlobalSection 15 | GlobalSection(SolutionProperties) = preSolution 16 | HideSolutionNode = FALSE 17 | EndGlobalSection 18 | EndGlobal 19 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Predict-click-through-rates-on-display-ads 2 | ========================================== 3 | 4 | Display advertising is a billion dollar effort and one of the central uses of machine learning on the Internet. However, its data and methods are usually kept under lock and key. In this research competition, CriteoLabs is sharing a week’s worth of data for you to develop models predicting ad click-through rate (CTR). Given a user and the page he is visiting, what is the probability that he will click on a given ad? The goal of this challenge is to benchmark the most accurate ML algorithms for CTR estimation. All winning models will be released under an open source license. As a participant, you are given a chance to access the traffic logs from Criteo that include various undisclosed features along with the click labels. 5 | -------------------------------------------------------------------------------- /vowpal wabbit/ending solution.txt: -------------------------------------------------------------------------------- 1 | vw train_nw.vw -f data/model.vw --loss_function logistic -b 25 -l .15 -c --passes 5 -q cc -q ii -q ci --holdout_off --cubic iii --decay_learning_rate .8 2 | 3 | vw --holdout_off --cache_file data/train_cat_int.cache --loss_function logistic -b 29 --passes 6 -l 0.01 --nn 60 --power_t 0 -f data/nn60_l001_p6.mod 4 | 5 | vw -d train.vw -c -b 28 --link=logistic --loss_function logistic --passes 2 --holdout_off --ngram c3 --ngram n2 --skips n1 --ngram f2 --skips f1 --l2 7.12091e-09 -l 0.240971683207491 --initial_t 1.53478225382649 --decay_learning_rate 0.267332 6 | 7 | vw click4.TT.train.vw -k -c -f click.neu13.model.vw --loss_function logistic --passes 20 -l 0.15 -b 25 --nn 35 --holdout_period 50 --early_terminate 1 8 | 9 | ./vowpalwabbit/vw ../xtrain4.vw -c -k -l 0.1 -b 29 --loss_function logistic -q cc -q ii -q ci --holdout_off -f xmodel.vw -------------------------------------------------------------------------------- /pytest.ini: -------------------------------------------------------------------------------- 1 | [pytest] 2 | # Pytest configuration for CTR Prediction project 3 | 4 | # Test discovery patterns 5 | python_files = test_*.py 6 | python_classes = Test* 7 | python_functions = test_* 8 | 9 | # Test paths 10 | testpaths = tests 11 | 12 | # Options 13 | addopts = 14 | --verbose 15 | --strict-markers 16 | --tb=short 17 | --disable-warnings 18 | # Coverage options (uncomment when ready) 19 | # --cov=. 20 | # --cov-report=html 21 | # --cov-report=term-missing 22 | # --cov-fail-under=80 23 | 24 | # Markers 25 | markers = 26 | slow: marks tests as slow (deselect with '-m "not slow"') 27 | integration: marks tests as integration tests 28 | unit: marks tests as unit tests 29 | requires_data: marks tests that require actual data files 30 | 31 | # Logging 32 | log_cli = true 33 | log_cli_level = INFO 34 | log_cli_format = %(asctime)s [%(levelname)s] %(message)s 35 | log_cli_date_format = %Y-%m-%d %H:%M:%S 36 | 37 | # Ignore patterns 38 | norecursedirs = .git .tox dist build *.egg venv env 39 | 40 | # Timeout (in seconds) for individual tests 41 | # timeout = 300 42 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Tianxiang Liu 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /logReg_test.py: -------------------------------------------------------------------------------- 1 | 2 | from numpy import * 3 | import matplotlib.pyplot as plt 4 | import time 5 | 6 | def loadData(): 7 | train_x = [] 8 | train_y = [] 9 | fileIn = open('E:/Python/Machine Learning in Action/testSet.txt') 10 | for line in fileIn.readlines(): 11 | lineArr = line.strip().split() 12 | train_x.append([1.0, float(lineArr[0]), float(lineArr[1])]) 13 | train_y.append(float(lineArr[2])) 14 | return mat(train_x), mat(train_y).transpose() 15 | 16 | 17 | ## step 1: load data 18 | print "step 1: load data..." 19 | train_x, train_y = loadData() 20 | test_x = train_x; test_y = train_y 21 | 22 | ## step 2: training... 23 | print "step 2: training..." 24 | opts = {'alpha': 0.01, 'maxIter': 20, 'optimizeType': 'smoothStocGradDescent'} 25 | optimalWeights = trainLogRegres(train_x, train_y, opts) 26 | 27 | ## step 3: testing 28 | print "step 3: testing..." 29 | accuracy = testLogRegres(optimalWeights, test_x, test_y) 30 | 31 | ## step 4: show the result 32 | print "step 4: show the result..." 33 | print 'The classify accuracy is: %.3f%%' % (accuracy * 100) 34 | showLogRegres(optimalWeights, train_x, train_y) -------------------------------------------------------------------------------- /pythonSGD/logReg_test.py: -------------------------------------------------------------------------------- 1 | 2 | from numpy import * 3 | import matplotlib.pyplot as plt 4 | import time 5 | 6 | def loadData(): 7 | train_x = [] 8 | train_y = [] 9 | fileIn = open('E:/Python/Machine Learning in Action/testSet.txt') 10 | for line in fileIn.readlines(): 11 | lineArr = line.strip().split() 12 | train_x.append([1.0, float(lineArr[0]), float(lineArr[1])]) 13 | train_y.append(float(lineArr[2])) 14 | return mat(train_x), mat(train_y).transpose() 15 | 16 | 17 | ## step 1: load data 18 | print "step 1: load data..." 19 | train_x, train_y = loadData() 20 | test_x = train_x; test_y = train_y 21 | 22 | ## step 2: training... 23 | print "step 2: training..." 24 | opts = {'alpha': 0.01, 'maxIter': 20, 'optimizeType': 'smoothStocGradDescent'} 25 | optimalWeights = trainLogRegres(train_x, train_y, opts) 26 | 27 | ## step 3: testing 28 | print "step 3: testing..." 29 | accuracy = testLogRegres(optimalWeights, test_x, test_y) 30 | 31 | ## step 4: show the result 32 | print "step 4: show the result..." 33 | print 'The classify accuracy is: %.3f%%' % (accuracy * 100) 34 | showLogRegres(optimalWeights, train_x, train_y) -------------------------------------------------------------------------------- /tests/conftest.py: -------------------------------------------------------------------------------- 1 | """ 2 | Pytest configuration and fixtures. 3 | """ 4 | 5 | import pytest 6 | import numpy as np 7 | from pathlib import Path 8 | import tempfile 9 | import csv 10 | 11 | 12 | @pytest.fixture 13 | def sample_csv_data(): 14 | """Generate sample CSV data for testing.""" 15 | return [ 16 | {'Id': '1', 'Label': '0', 'I1': '5', 'I2': '10', 'C1': 'abc123', 'C2': 'def456'}, 17 | {'Id': '2', 'Label': '1', 'I1': '3', 'I2': '', 'C1': 'abc123', 'C2': 'xyz789'}, 18 | {'Id': '3', 'Label': '0', 'I1': '', 'I2': '20', 'C1': 'ghi012', 'C2': 'def456'}, 19 | ] 20 | 21 | 22 | @pytest.fixture 23 | def temp_csv_file(sample_csv_data, tmp_path): 24 | """Create a temporary CSV file with sample data.""" 25 | csv_file = tmp_path / "test_data.csv" 26 | 27 | fieldnames = ['Id', 'Label', 'I1', 'I2', 'C1', 'C2'] 28 | 29 | with open(csv_file, 'w', newline='') as f: 30 | writer = csv.DictWriter(f, fieldnames=fieldnames) 31 | writer.writeheader() 32 | writer.writerows(sample_csv_data) 33 | 34 | return csv_file 35 | 36 | 37 | @pytest.fixture 38 | def sample_train_data(): 39 | """Generate sample training data (X, y).""" 40 | np.random.seed(42) 41 | X = np.random.randn(100, 5) 42 | y = (X[:, 0] + X[:, 1] > 0).astype(float).reshape(-1, 1) 43 | return X, y 44 | 45 | 46 | @pytest.fixture 47 | def temp_dir(): 48 | """Create a temporary directory for testing.""" 49 | with tempfile.TemporaryDirectory() as tmpdir: 50 | yield Path(tmpdir) 51 | -------------------------------------------------------------------------------- /r_sdg.R: -------------------------------------------------------------------------------- 1 | setwd('/Users/ivan/Work_directory/Predict-click-through-rates-on-display-ads/') 2 | train <- 'testSet.txt' 3 | test <- 'test.csv' 4 | D <- 2^20 5 | alpha <- .1 6 | w <- rep.int(0, D) 7 | n <- rep.int(0, D) 8 | loss <- 0. 9 | col_num <- 3 10 | 11 | # test logloss of predictions and true values 12 | logloss <- function(p,y) { 13 | p <- max(min(p, 1 - 10^-13), 10^-13) 14 | res <- ifelse(y==1, -log(p), -log(1 - p)) 15 | res 16 | } 17 | 18 | # extract one record from database 19 | get_data <- function(row, data){ 20 | r <- read.table(data, skip=row-1, nrows=1, sep='\t', row.names=row) 21 | r 22 | } 23 | 24 | # get possibilities of records 25 | get_p <- function(x, w){ 26 | wTx <- 0 27 | for (i2 in x) { 28 | wTx <- wTx + w[i2] * 1.} 29 | p <- 1/(1 + exp(-max(min(wTx, 20), -20))) 30 | p 31 | } 32 | 33 | # update the weights according to results 34 | update_w <- function(w, n, x, p, y){ 35 | for (i in x){ 36 | w[i] <- w[i] - ((p - y) * alpha / (sqrt(n[i]) + 1)) 37 | n[i] <- n[i] + 1 38 | } 39 | w 40 | n 41 | } 42 | 43 | # main steps for modeling 44 | for i in 1:46000000{ 45 | row <- get_data(i, train,2) 46 | y <- row[2] 47 | row <- row[-c(1,2)] 48 | p <- get_p(row, w) 49 | loss <- loss + logloss(p,y) 50 | if(i%1000000 == 0){ 51 | 'logloss: '&loss/i&'. number of row: 'i 52 | } 53 | updates <- update_w(w, n, x, p, y) 54 | w <- updates[1] 55 | n <- updates[2] 56 | } 57 | 58 | 59 | get_x <- function(row, D){ 60 | x <- c(0) 61 | for (j in row){ 62 | index <- as.integer(j) %% D 63 | x <- c(x, index) 64 | } 65 | } 66 | 67 | 68 | 69 | 70 | 71 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Python 2 | *.py[cod] 3 | *$py.class 4 | *.so 5 | .Python 6 | build/ 7 | develop-eggs/ 8 | dist/ 9 | downloads/ 10 | eggs/ 11 | .eggs/ 12 | lib/ 13 | lib64/ 14 | parts/ 15 | sdist/ 16 | var/ 17 | wheels/ 18 | share/python-wheels/ 19 | *.egg-info/ 20 | .installed.cfg 21 | *.egg 22 | MANIFEST 23 | __pycache__/ 24 | *.pyc 25 | 26 | # Virtual environments 27 | venv/ 28 | env/ 29 | ENV/ 30 | env.bak/ 31 | venv.bak/ 32 | .venv/ 33 | 34 | # IDE 35 | .vscode/ 36 | .idea/ 37 | *.swp 38 | *.swo 39 | *~ 40 | .DS_Store 41 | *.suo 42 | *.user 43 | *.userosscache 44 | *.sln.docstates 45 | 46 | # Visual Studio 47 | *.pyproj 48 | *.sln 49 | *.suo 50 | *.user 51 | *.userosscache 52 | .vs/ 53 | 54 | # Jupyter Notebook 55 | .ipynb_checkpoints 56 | *.ipynb 57 | 58 | # R 59 | .Rhistory 60 | .Rapp.history 61 | .RData 62 | .Ruserdata 63 | *.Rproj 64 | .Rproj.user/ 65 | 66 | # Data files (large datasets) 67 | *.csv 68 | !**/sample*.csv 69 | *.tsv 70 | *.dat 71 | *.txt 72 | !requirements.txt 73 | !requirements-r.txt 74 | !README.txt 75 | !LICENSE.txt 76 | 77 | # Model files (can be large) 78 | *.model 79 | *.vw 80 | *.pkl 81 | *.h5 82 | *.joblib 83 | 84 | # Logs 85 | *.log 86 | logs/ 87 | 88 | # Output/Submission files 89 | submission*.csv 90 | *submit*.csv 91 | output/ 92 | results/ 93 | 94 | # Temporary files 95 | tmp/ 96 | temp/ 97 | *.tmp 98 | *.bak 99 | *~ 100 | *.cache 101 | 102 | # OS 103 | .DS_Store 104 | Thumbs.db 105 | ehthumbs.db 106 | 107 | # Backup files 108 | *.orig 109 | *~ 110 | 111 | # Claude settings (keep local only) 112 | .claude/settings.local.json 113 | -------------------------------------------------------------------------------- /SDG_21_Sep_2014.R: -------------------------------------------------------------------------------- 1 | setwd("C:\\Users\\Ivan.Liuyanfeng\\Desktop\\Data_Mining_Work_Space\\Predict-click-through-rates-on-display-ads\\local") 2 | # basic parameters 3 | con <- file('train.csv','r') 4 | D <- 2^27 5 | alpha <- .145 6 | 7 | # logit loss calculation 8 | logloss <- function(p,y){ 9 | epsilon <- 10 ^ -15 10 | p <- max(min(p, 1-epsilon), epsilon) 11 | ll <- y*log(p) + (1-y)*log((1-p)) 12 | ll <- ll * -1/1 13 | ll 14 | } 15 | 16 | # prediction 17 | get_p<- function(x,w){ 18 | wTx <- 0 19 | for (i in 1:length(x)){ 20 | wTx <- wTx + w[i] * 1 21 | } 22 | sigmoid <- 1/(1+exp(-max(min(wTx,20)-20))) 23 | sigmoid 24 | } 25 | 26 | # update weights 27 | update_w <-function (w, n, x, p, y){ 28 | for (i in 1:length(x)){ 29 | lr <- alpha / (sqrt(n[i])+1) 30 | gradient <- (p-y) 31 | w[i] <- w[i] - gradient * lr 32 | n[i] <- n[i] + 1 33 | } 34 | c(w,n) 35 | } 36 | 37 | # basic parameters 38 | # w <- rep(0, D) 39 | w <- rep(1, length(x)) 40 | # n <- rep(0, D) 41 | n <- rep(0, length(x)) 42 | loss <- 0 43 | ##################start modeling - parameters setup ######################## 44 | train_label <- readLines(con, n=1) 45 | for (i in 1:10) { 46 | row <- readLines(con,n=1) 47 | train_row <- strsplit(row, ',') 48 | y <- as.integer(train_row[[1]][2]) 49 | x <- c() 50 | for (k in 3:40){ 51 | x <- c(x, train_row[[1]][k]) 52 | } 53 | ################# Modeling ################################################# 54 | p <- get_p(x,w) 55 | loss <- loss + logloss(p,y) 56 | if (i %% 100000 == 0){ 57 | print(loss) 58 | } 59 | upd <- update_w(w,n,x,p,y) 60 | w <- upd[1] 61 | n <- upd[2] 62 | print(loss) 63 | break 64 | } 65 | -------------------------------------------------------------------------------- /pythonSGD/SGD_py.pyproj: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Debug 5 | 2.0 6 | {82875642-d6fa-4f5c-81e6-b89a93c1f5ff} 7 | 8 | logReg.py 9 | 10 | . 11 | . 12 | {888888a0-9f3d-457c-b088-3a5042f75d52} 13 | Standard Python launcher 14 | 15 | 16 | 17 | 18 | 19 | 20 | 10.0 21 | $(MSBuildExtensionsPath32)\Microsoft\VisualStudio\v$(VisualStudioVersion)\Python Tools\Microsoft.PythonTools.targets 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | -------------------------------------------------------------------------------- /kaggle.py: -------------------------------------------------------------------------------- 1 | from numpy import * 2 | 3 | def loadDataSet(): 4 | dataMat = []; labelMat = [] 5 | fr = open('train.csv') 6 | for line in fr.readlines(): 7 | singleArr=[1.0] 8 | lineArr = line.strip().split() 9 | for i in range(39): 10 | singleArr.append(float(lineArr[i+2])) 11 | dataMat.append(singleArr) 12 | labelMat.append(int(lineArr[1])) 13 | return dataMat,labelMat 14 | 15 | def sigmoid(inX): 16 | return 1.0/(1+exp(-inX)) 17 | 18 | def gradAscent(dataMatIn, classLabels): 19 | dataMatrix = mat(dataMatIn) #convert to NumPy matrix 20 | labelMat = mat(classLabels).transpose() #convert to NumPy matrix 21 | m,n = shape(dataMatrix) 22 | alpha = 0.001 23 | maxCycles = 500 24 | weights = ones((n,1)) 25 | for k in range(maxCycles): #heavy on matrix operations 26 | h = sigmoid(dataMatrix*weights) #matrix mult 27 | error = (labelMat - h) #vector subtraction 28 | weights = weights + alpha * dataMatrix.transpose()* error #matrix mult 29 | return weights 30 | 31 | def plotBestFit(weights): 32 | import matplotlib.pyplot as plt 33 | dataMat,labelMat=loadDataSet() 34 | dataArr = array(dataMat) 35 | n = shape(dataArr)[0] 36 | xcord1 = []; ycord1 = [] 37 | xcord2 = []; ycord2 = [] 38 | for i in range(n): 39 | if int(labelMat[i])== 1: 40 | xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2]) 41 | else: 42 | xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2]) 43 | fig = plt.figure() 44 | ax = fig.add_subplot(111) 45 | ax.scatter(xcord1, ycord1, s=30, c='red', marker='s') 46 | ax.scatter(xcord2, ycord2, s=30, c='green') 47 | x = arange(-3.0, 3.0, 0.1) 48 | y = (-weights[0]-weights[1]*x)/weights[2] 49 | ax.plot(x, y) 50 | plt.xlabel('X1'); plt.ylabel('X2'); 51 | plt.show() -------------------------------------------------------------------------------- /forum/csv_to_vm.py: -------------------------------------------------------------------------------- 1 | # -*- coding: UTF-8 -*- 2 | 3 | ######################################################## 4 | # __Author__: Triskelion # 5 | # Kaggle competition "Display Advertising Challenge": # 6 | # http://www.kaggle.com/c/criteo-display-ad-challenge/ # 7 | # Credit: Zygmunt Zając # 8 | ######################################################## 9 | 10 | from datetime import datetime 11 | from csv import DictReader 12 | 13 | def csv_to_vw(loc_csv, loc_output, train=True): 14 | """ 15 | Munges a CSV file (loc_csv) to a VW file (loc_output). Set "train" 16 | to False when munging a test set. 17 | TODO: Too slow for a daily cron job. Try optimize, Pandas or Go. 18 | """ 19 | start = datetime.now() 20 | print("\nTurning %s into %s. Is_train_set? %s"%(loc_csv,loc_output,train)) 21 | 22 | with open(loc_output,"wb") as outfile: 23 | for e, row in enumerate( DictReader(open(loc_csv)) ): 24 | 25 | #Creating the features 26 | numerical_features = "" 27 | categorical_features = "" 28 | for k,v in row.items(): 29 | if k not in ["Label","Id"]: 30 | if "I" in k: # numerical feature, example: I5 31 | if len(str(v)) > 0: #check for empty values 32 | numerical_features += " %s:%s" % (k,v) 33 | if "C" in k: # categorical feature, example: C2 34 | if len(str(v)) > 0: 35 | categorical_features += " %s" % v 36 | 37 | #Creating the labels 38 | if train: #we care about labels 39 | if row['Label'] == "1": 40 | label = 1 41 | else: 42 | label = -1 #we set negative label to -1 43 | outfile.write( "%s '%s |i%s |c%s\n" % (label,row['Id'],numerical_features,categorical_features) ) 44 | 45 | else: #we dont care about labels 46 | outfile.write( "1 '%s |i%s |c%s\n" % (row['Id'],numerical_features,categorical_features) ) 47 | 48 | #Reporting progress 49 | if e % 1000000 == 0: 50 | print("%s\t%s"%(e, str(datetime.now() - start))) 51 | 52 | print("\n %s Task execution time:\n\t%s"%(e, str(datetime.now() - start))) 53 | 54 | #csv_to_vw("d:\\Downloads\\train\\train.csv", "c:\\click.train.vw",train=True) 55 | #csv_to_vw("d:\\Downloads\\test\\test.csv", "d:\\click.test.vw",train=False) -------------------------------------------------------------------------------- /vowpal wabbit/csv_to_vm.py: -------------------------------------------------------------------------------- 1 | # -*- coding: UTF-8 -*- 2 | 3 | ######################################################## 4 | # __Author__: Triskelion # 5 | # Kaggle competition "Display Advertising Challenge": # 6 | # http://www.kaggle.com/c/criteo-display-ad-challenge/ # 7 | # Credit: Zygmunt Zając # 8 | ######################################################## 9 | 10 | from datetime import datetime 11 | from csv import DictReader 12 | 13 | def csv_to_vw(loc_csv, loc_output, train=True): 14 | """ 15 | Munges a CSV file (loc_csv) to a VW file (loc_output). Set "train" 16 | to False when munging a test set. 17 | TODO: Too slow for a daily cron job. Try optimize, Pandas or Go. 18 | """ 19 | start = datetime.now() 20 | print("\nTurning %s into %s. Is_train_set? %s"%(loc_csv,loc_output,train)) 21 | 22 | with open(loc_output,"wb") as outfile: 23 | for e, row in enumerate( DictReader(open(loc_csv)) ): 24 | 25 | #Creating the features 26 | numerical_features = "" 27 | categorical_features = "" 28 | for k,v in row.items(): 29 | if k not in ["Label","Id"]: 30 | if "I" in k: # numerical feature, example: I5 31 | if len(str(v)) > 0: #check for empty values 32 | numerical_features += " %s:%s" % (k,v) 33 | if "C" in k: # categorical feature, example: C2 34 | if len(str(v)) > 0: 35 | categorical_features += " %s" % v 36 | 37 | #Creating the labels 38 | if train: #we care about labels 39 | if row['Label'] == "1": 40 | label = 1 41 | else: 42 | label = -1 #we set negative label to -1 43 | outfile.write( "%s '%s |i%s |c%s\n" % (label,row['Id'],numerical_features,categorical_features) ) 44 | 45 | else: #we dont care about labels 46 | outfile.write( "1 '%s |i%s |c%s\n" % (row['Id'],numerical_features,categorical_features) ) 47 | 48 | #Reporting progress 49 | if e % 1000000 == 0: 50 | print("%s\t%s"%(e, str(datetime.now() - start))) 51 | 52 | print("\n %s Task execution time:\n\t%s"%(e, str(datetime.now() - start))) 53 | 54 | #csv_to_vw("d:\\Downloads\\train\\train.csv", "c:\\click.train.vw",train=True) 55 | #csv_to_vw("d:\\Downloads\\test\\test.csv", "d:\\click.test.vw",train=False) -------------------------------------------------------------------------------- /testSet.txt: -------------------------------------------------------------------------------- 1 | -0.017612 14.053064 0 2 | -1.395634 4.662541 1 3 | -0.752157 6.538620 0 4 | -1.322371 7.152853 0 5 | 0.423363 11.054677 0 6 | 0.406704 7.067335 1 7 | 0.667394 12.741452 0 8 | -2.460150 6.866805 1 9 | 0.569411 9.548755 0 10 | -0.026632 10.427743 0 11 | 0.850433 6.920334 1 12 | 1.347183 13.175500 0 13 | 1.176813 3.167020 1 14 | -1.781871 9.097953 0 15 | -0.566606 5.749003 1 16 | 0.931635 1.589505 1 17 | -0.024205 6.151823 1 18 | -0.036453 2.690988 1 19 | -0.196949 0.444165 1 20 | 1.014459 5.754399 1 21 | 1.985298 3.230619 1 22 | -1.693453 -0.557540 1 23 | -0.576525 11.778922 0 24 | -0.346811 -1.678730 1 25 | -2.124484 2.672471 1 26 | 1.217916 9.597015 0 27 | -0.733928 9.098687 0 28 | -3.642001 -1.618087 1 29 | 0.315985 3.523953 1 30 | 1.416614 9.619232 0 31 | -0.386323 3.989286 1 32 | 0.556921 8.294984 1 33 | 1.224863 11.587360 0 34 | -1.347803 -2.406051 1 35 | 1.196604 4.951851 1 36 | 0.275221 9.543647 0 37 | 0.470575 9.332488 0 38 | -1.889567 9.542662 0 39 | -1.527893 12.150579 0 40 | -1.185247 11.309318 0 41 | -0.445678 3.297303 1 42 | 1.042222 6.105155 1 43 | -0.618787 10.320986 0 44 | 1.152083 0.548467 1 45 | 0.828534 2.676045 1 46 | -1.237728 10.549033 0 47 | -0.683565 -2.166125 1 48 | 0.229456 5.921938 1 49 | -0.959885 11.555336 0 50 | 0.492911 10.993324 0 51 | 0.184992 8.721488 0 52 | -0.355715 10.325976 0 53 | -0.397822 8.058397 0 54 | 0.824839 13.730343 0 55 | 1.507278 5.027866 1 56 | 0.099671 6.835839 1 57 | -0.344008 10.717485 0 58 | 1.785928 7.718645 1 59 | -0.918801 11.560217 0 60 | -0.364009 4.747300 1 61 | -0.841722 4.119083 1 62 | 0.490426 1.960539 1 63 | -0.007194 9.075792 0 64 | 0.356107 12.447863 0 65 | 0.342578 12.281162 0 66 | -0.810823 -1.466018 1 67 | 2.530777 6.476801 1 68 | 1.296683 11.607559 0 69 | 0.475487 12.040035 0 70 | -0.783277 11.009725 0 71 | 0.074798 11.023650 0 72 | -1.337472 0.468339 1 73 | -0.102781 13.763651 0 74 | -0.147324 2.874846 1 75 | 0.518389 9.887035 0 76 | 1.015399 7.571882 0 77 | -1.658086 -0.027255 1 78 | 1.319944 2.171228 1 79 | 2.056216 5.019981 1 80 | -0.851633 4.375691 1 81 | -1.510047 6.061992 0 82 | -1.076637 -3.181888 1 83 | 1.821096 10.283990 0 84 | 3.010150 8.401766 1 85 | -1.099458 1.688274 1 86 | -0.834872 -1.733869 1 87 | -0.846637 3.849075 1 88 | 1.400102 12.628781 0 89 | 1.752842 5.468166 1 90 | 0.078557 0.059736 1 91 | 0.089392 -0.715300 1 92 | 1.825662 12.693808 0 93 | 0.197445 9.744638 0 94 | 0.126117 0.922311 1 95 | -0.679797 1.220530 1 96 | 0.677983 2.556666 1 97 | 0.761349 10.693862 0 98 | -2.168791 0.143632 1 99 | 1.388610 9.341997 0 100 | 0.317029 14.739025 0 101 | -------------------------------------------------------------------------------- /gbm.R: -------------------------------------------------------------------------------- 1 | setwd("I:\\data") 2 | library(data.table) 3 | # train <- fread("train.csv",select=c(1:15)) 4 | # head(train) 5 | # write.table(train,"train_num.csv",sep=",", row.names=F, col.names=T) 6 | # str(train) 7 | # rm("train") 8 | # ?read.csv 9 | # 10 | # library(caret) 11 | # sum(is.na(train)) 12 | # mean(is.na(train)) 13 | # summary(train) 14 | # gc() 15 | # train <- fread("train_num.csv") 16 | # train <- na.omit(train) 17 | # index <- is.na(train) 18 | # table(index) 19 | # train <- train[-index,] 20 | # write.table(train,"train_num_na.csv",sep=",", row.names=F, col.names=T) 21 | # rm(train) 22 | 23 | # data cleansing 24 | ################# 25 | train <- read.csv("train_num_na.csv") 26 | train <- train[,-1] 27 | head(train) 28 | train[which(train$Label==1),1] <- "Yes" 29 | train[which(train$Label==0),1] <- "No" 30 | train$Label <- as.factor(train$Label) 31 | write.table(train,"train_num_na_yesno.csv",sep=",", row.names=F, col.names=T) 32 | 33 | 34 | #covariate creation 35 | # nearZeroVar(train,saveMetrics = T) 36 | # train.pca <- preProcess(train[,2:14], method='pca', pcaComp=2) 37 | 38 | # model 39 | # library(doParallel) 40 | library(caret) 41 | # cl <- makePSOCKcluster(4) 42 | # registerDoParallel(cl) 43 | # fit1 <- train(Label~., method="rf",data=train) 44 | 45 | Grid <- expand.grid(n.trees=c(500),interaction.depth=c(22),shrinkage=.1) 46 | fitControl <- trainControl(method="none", allowParallel=T, classProbs=T) 47 | fit2 <- train(Label~., method="gbm", data=train, trControl=fitControl, 48 | verbose=T,tuneGrid=Grid, metric="ROC") 49 | pred2 <- predict(fit2, train) 50 | confusionMatrix(pred2, train$Label) 51 | rm(pred2) 52 | 53 | # fit3 <- train(Label~., method="glmnet",family="binomial",classProbs=T, data=train,verbose=T) 54 | # fit3 <- train(Label~., method="glmnet",classProbs=T, data=train,verbose=T) 55 | 56 | # fit 4 57 | ctrl <- trainControl(method = "cv", 58 | number=2, 59 | classProbs = TRUE, 60 | allowParallel = TRUE, 61 | summaryFunction = twoClassSummary) 62 | 63 | set.seed(888) 64 | rfFit <- train(Label~., 65 | data=train, 66 | method = "rf", 67 | # tuneGrid = expand.grid(.mtry = 4), 68 | ntrees=500, 69 | importance = TRUE, 70 | metric = "ROC", 71 | trControl = ctrl) 72 | 73 | 74 | pred <- predict.train(rfFit, newdata = test, type = "prob") 75 | 76 | 77 | # stopCluster(cl) 78 | 79 | # load test data 80 | # test <- fread("test.csv", select=c(1:14)) 81 | # write.table(test,"test_num.csv",sep=",", row.names=F, col.names=T) 82 | # test <- read.csv("test_num.csv") 83 | test <- read.csv("test_num_impute.csv") 84 | # test data imputation 85 | # pre<-preProcess(test, method='medianImpute') 86 | # test_impute <- predict(pre, test) 87 | 88 | # predict 89 | gc() 90 | # pred1 <- predict(fit1, test) 91 | pred2 <- predict(fit2,type="prob", test) 92 | head(test) 93 | pred2 <- plogis(pred2) 94 | # pred3 <- predict(fit3, test) 95 | # ensembling-models 96 | # data(pred1,pred2,pred3,train) 97 | # combFit<-train(Label~.,method="gam", train) 98 | 99 | # output 100 | fit2.submit <- data.frame(test$Id, test$pred2) 101 | colnames(fit2.submit)<- c("Id","Predicted") 102 | write.table(fit2.submit,"submit1_gbm_num_nona_impute.csv", row.names=F, sep=',') 103 | -------------------------------------------------------------------------------- /py_lh.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime 2 | from csv import DictReader 3 | from math import exp, log, sqrt 4 | 5 | 6 | # parameters ################################################################# 7 | 8 | train = 'train.csv' # path to training file 9 | test = 'test.csv' # path to testing file 10 | 11 | D = 2 ** 20 # number of weights use for learning 12 | alpha = .2 # learning rate for sgd optimization 13 | 14 | 15 | # function definitions ####################################################### 16 | 17 | # A. Bounded logloss 18 | # INPUT: 19 | # p: our prediction 20 | # y: real answer 21 | # OUTPUT 22 | # logarithmic loss of p given y 23 | def logloss(p, y): 24 | p = max(min(p, 1. - 10e-12), 10e-12) 25 | return -log(p) if y == 1. else -log(1. - p) 26 | 27 | 28 | # B. Apply hash trick of the original csv row 29 | # for simplicity, we treat both integer and categorical features as categorical 30 | # INPUT: 31 | # csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...} 32 | # D: the max index that we can hash to 33 | # OUTPUT: 34 | # x: a list of indices that its value is 1 35 | def get_x(csv_row, D): 36 | x = [0] # 0 is the index of the bias term 37 | for key, value in csv_row.items(): 38 | index = int(value + key[1:], 16) % D # weakest hash ever ;) 39 | x.append(index) 40 | return x # x contains indices of features that have a value of 1 41 | 42 | 43 | # C. Get probability estimation on x 44 | # INPUT: 45 | # x: features 46 | # w: weights 47 | # OUTPUT: 48 | # probability of p(y = 1 | x; w) 49 | def get_p(x, w): 50 | wTx = 0. 51 | for i in x: # do wTx 52 | wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1. 53 | return 1. / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid 54 | 55 | 56 | # D. Update given model 57 | # INPUT: 58 | # w: weights 59 | # n: a counter that counts the number of times we encounter a feature 60 | # this is used for adaptive learning rate 61 | # x: feature 62 | # p: prediction of our model 63 | # y: answer 64 | # OUTPUT: 65 | # w: updated model 66 | # n: updated count 67 | def update_w(w, n, x, p, y): 68 | for i in x: 69 | # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic 70 | # (p - y) * x[i] is the current gradient 71 | # note that in our case, if i in x then x[i] = 1 72 | w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) 73 | n[i] += 1. 74 | 75 | return w, n 76 | 77 | 78 | # training and testing ####################################################### 79 | 80 | # initialize our model 81 | w = [0.] * D # weights 82 | n = [0.] * D # number of times we've encountered a feature 83 | 84 | # start training a logistic regression model using on pass sgd 85 | loss = 0. 86 | for t, row in enumerate(DictReader(open(train))): 87 | y = 1. if row['Label'] == '1' else 0. 88 | 89 | del row['Label'] # can't let the model peek the answer 90 | del row['Id'] # we don't need the Id 91 | 92 | # main training procedure 93 | # step 1, get the hashed features 94 | x = get_x(row, D) 95 | 96 | # step 2, get prediction 97 | p = get_p(x, w) 98 | 99 | # for progress validation, useless for learning our model 100 | loss += logloss(p, y) 101 | if t % 1000000 == 0 and t > 1: 102 | print('%s\tencountered: %d\tcurrent logloss: %f' % ( 103 | datetime.now(), t, loss/t)) 104 | 105 | # step 3, update model with answer 106 | w, n = update_w(w, n, x, p, y) 107 | 108 | # testing (build kaggle's submission file) 109 | with open('submissionPython.csv', 'w') as submission: 110 | submission.write('Id,Predicted\n') 111 | for t, row in enumerate(DictReader(open(test))): 112 | Id = row['Id'] 113 | del row['Id'] 114 | x = get_x(row, D) 115 | p = get_p(x, w) 116 | submission.write('%s,%f\n' % (Id, p)) -------------------------------------------------------------------------------- /py_lh2.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime 2 | from csv import DictReader 3 | from math import exp, log, sqrt 4 | 5 | 6 | # parameters ################################################################# 7 | 8 | train = 'train.csv' # path to training file 9 | test = 'test.csv' # path to testing file 10 | 11 | D = 2 ** 26 # number of weights use for learning 12 | alpha = .1 # learning rate for sgd optimization 13 | 14 | 15 | # function definitions ####################################################### 16 | 17 | # A. Bounded logloss 18 | # INPUT: 19 | # p: our prediction 20 | # y: real answer 21 | # OUTPUT 22 | # logarithmic loss of p given y 23 | def logloss(p, y): 24 | p = max(min(p, 1. - 10e-12), 10e-12) 25 | return -log(p) if y == 1. else -log(1. - p) 26 | 27 | 28 | # B. Apply hash trick of the original csv row 29 | # for simplicity, we treat both integer and categorical features as categorical 30 | # INPUT: 31 | # csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...} 32 | # D: the max index that we can hash to 33 | # OUTPUT: 34 | # x: a list of indices that its value is 1 35 | def get_x(csv_row, D): 36 | x = [0] # 0 is the index of the bias term 37 | for key, value in csv_row.items(): 38 | index = int(value + key[1:], 16) % D # weakest hash ever ;) 39 | x.append(index) 40 | return x # x contains indices of features that have a value of 1 41 | 42 | 43 | # C. Get probability estimation on x 44 | # INPUT: 45 | # x: features 46 | # w: weights 47 | # OUTPUT: 48 | # probability of p(y = 1 | x; w) 49 | def get_p(x, w): 50 | wTx = 0. 51 | for i in x: # do wTx 52 | wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1. 53 | return 1. / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid 54 | 55 | 56 | # D. Update given model 57 | # INPUT: 58 | # w: weights 59 | # n: a counter that counts the number of times we encounter a feature 60 | # this is used for adaptive learning rate 61 | # x: feature 62 | # p: prediction of our model 63 | # y: answer 64 | # OUTPUT: 65 | # w: updated model 66 | # n: updated count 67 | def update_w(w, n, x, p, y): 68 | for i in x: 69 | # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic 70 | # (p - y) * x[i] is the current gradient 71 | # note that in our case, if i in x then x[i] = 1 72 | w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) 73 | n[i] += 1. 74 | 75 | return w, n 76 | 77 | 78 | # training and testing ####################################################### 79 | 80 | # initialize our model 81 | w = [0.] * D # weights 82 | n = [0.] * D # number of times we've encountered a feature 83 | 84 | # start training a logistic regression model using on pass sgd 85 | loss = 0. 86 | for t, row in enumerate(DictReader(open(train))): 87 | y = 1. if row['Label'] == '1' else 0. 88 | 89 | del row['Label'] # can't let the model peek the answer 90 | del row['Id'] # we don't need the Id 91 | 92 | # main training procedure 93 | # step 1, get the hashed features 94 | x = get_x(row, D) 95 | 96 | # step 2, get prediction 97 | p = get_p(x, w) 98 | 99 | # for progress validation, useless for learning our model 100 | loss += logloss(p, y) 101 | if t % 1000000 == 0 and t > 1: 102 | print('%s\tencountered: %d\tcurrent logloss: %f' % ( 103 | datetime.now(), t, loss/t)) 104 | 105 | # step 3, update model with answer 106 | w, n = update_w(w, n, x, p, y) 107 | 108 | # testing (build kaggle's submission file) 109 | with open('submissionPython2.csv', 'w') as submission: 110 | submission.write('Id,Predicted\n') 111 | for t, row in enumerate(DictReader(open(test))): 112 | Id = row['Id'] 113 | del row['Id'] 114 | x = get_x(row, D) 115 | p = get_p(x, w) 116 | submission.write('%s,%f\n' % (Id, p)) -------------------------------------------------------------------------------- /py_lh3.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime 2 | from csv import DictReader 3 | from math import exp, log, sqrt 4 | 5 | 6 | # parameters ################################################################# 7 | 8 | train = 'train.csv' # path to training file 9 | test = 'test.csv' # path to testing file 10 | 11 | D = 2 ** 25 # number of weights use for learning 12 | alpha = .15 # learning rate for sgd optimization 13 | 14 | 15 | # function definitions ####################################################### 16 | 17 | # A. Bounded logloss 18 | # INPUT: 19 | # p: our prediction 20 | # y: real answer 21 | # OUTPUT 22 | # logarithmic loss of p given y 23 | def logloss(p, y): 24 | p = max(min(p, 1. - 10e-12), 10e-12) 25 | return -log(p) if y == 1. else -log(1. - p) 26 | 27 | 28 | # B. Apply hash trick of the original csv row 29 | # for simplicity, we treat both integer and categorical features as categorical 30 | # INPUT: 31 | # csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...} 32 | # D: the max index that we can hash to 33 | # OUTPUT: 34 | # x: a list of indices that its value is 1 35 | def get_x(csv_row, D): 36 | x = [0] # 0 is the index of the bias term 37 | for key, value in csv_row.items(): 38 | index = int(value + key[1:], 16) % D # weakest hash ever ;) 39 | x.append(index) 40 | return x # x contains indices of features that have a value of 1 41 | 42 | 43 | # C. Get probability estimation on x 44 | # INPUT: 45 | # x: features 46 | # w: weights 47 | # OUTPUT: 48 | # probability of p(y = 1 | x; w) 49 | def get_p(x, w): 50 | wTx = 0. 51 | for i in x: # do wTx 52 | wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1. 53 | return 1. / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid 54 | 55 | 56 | # D. Update given model 57 | # INPUT: 58 | # w: weights 59 | # n: a counter that counts the number of times we encounter a feature 60 | # this is used for adaptive learning rate 61 | # x: feature 62 | # p: prediction of our model 63 | # y: answer 64 | # OUTPUT: 65 | # w: updated model 66 | # n: updated count 67 | def update_w(w, n, x, p, y): 68 | for i in x: 69 | # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic 70 | # (p - y) * x[i] is the current gradient 71 | # note that in our case, if i in x then x[i] = 1 72 | w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) 73 | n[i] += 1. 74 | 75 | return w, n 76 | 77 | 78 | # training and testing ####################################################### 79 | 80 | # initialize our model 81 | w = [0.] * D # weights 82 | n = [0.] * D # number of times we've encountered a feature 83 | 84 | # start training a logistic regression model using on pass sgd 85 | loss = 0. 86 | for t, row in enumerate(DictReader(open(train))): 87 | y = 1. if row['Label'] == '1' else 0. 88 | 89 | del row['Label'] # can't let the model peek the answer 90 | del row['Id'] # we don't need the Id 91 | 92 | # main training procedure 93 | # step 1, get the hashed features 94 | x = get_x(row, D) 95 | 96 | # step 2, get prediction 97 | p = get_p(x, w) 98 | 99 | # for progress validation, useless for learning our model 100 | loss += logloss(p, y) 101 | if t % 1000000 == 0 and t > 1: 102 | print('%s\tencountered: %d\tcurrent logloss: %f' % ( 103 | datetime.now(), t, loss/t)) 104 | 105 | # step 3, update model with answer 106 | w, n = update_w(w, n, x, p, y) 107 | 108 | # testing (build kaggle's submission file) 109 | with open('submissionPython4.csv', 'w') as submission: 110 | submission.write('Id,Predicted\n') 111 | for t, row in enumerate(DictReader(open(test))): 112 | Id = row['Id'] 113 | del row['Id'] 114 | x = get_x(row, D) 115 | p = get_p(x, w) 116 | submission.write('%s,%f\n' % (Id, p)) -------------------------------------------------------------------------------- /py_lh4.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime 2 | from csv import DictReader 3 | from math import exp, log, sqrt 4 | 5 | 6 | # parameters ################################################################# 7 | 8 | train = 'train.csv' # path to training file 9 | test = 'test.csv' # path to testing file 10 | 11 | D = 2 ** 27 # number of weights use for learning 12 | alpha = .145 # learning rate for sgd optimization 13 | 14 | 15 | # function definitions ####################################################### 16 | 17 | # A. Bounded logloss 18 | # INPUT: 19 | # p: our prediction 20 | # y: real answer 21 | # OUTPUT 22 | # logarithmic loss of p given y 23 | def logloss(p, y): 24 | p = max(min(p, 1. - 10e-12), 10e-12) 25 | return -log(p) if y == 1. else -log(1. - p) 26 | 27 | 28 | # B. Apply hash trick of the original csv row 29 | # for simplicity, we treat both integer and categorical features as categorical 30 | # INPUT: 31 | # csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...} 32 | # D: the max index that we can hash to 33 | # OUTPUT: 34 | # x: a list of indices that its value is 1 35 | def get_x(csv_row, D): 36 | x = [0] # 0 is the index of the bias term 37 | for key, value in csv_row.items(): 38 | index = int(value + key[1:], 16) % D # weakest hash ever ;) 39 | x.append(index) 40 | return x # x contains indices of features that have a value of 1 41 | 42 | 43 | # C. Get probability estimation on x 44 | # INPUT: 45 | # x: features 46 | # w: weights 47 | # OUTPUT: 48 | # probability of p(y = 1 | x; w) 49 | def get_p(x, w): 50 | wTx = 0. 51 | for i in x: # do wTx 52 | wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1. 53 | return 1. / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid 54 | 55 | 56 | # D. Update given model 57 | # INPUT: 58 | # w: weights 59 | # n: a counter that counts the number of times we encounter a feature 60 | # this is used for adaptive learning rate 61 | # x: feature 62 | # p: prediction of our model 63 | # y: answer 64 | # OUTPUT: 65 | # w: updated model 66 | # n: updated count 67 | def update_w(w, n, x, p, y): 68 | for i in x: 69 | # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic 70 | # (p - y) * x[i] is the current gradient 71 | # note that in our case, if i in x then x[i] = 1 72 | w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) 73 | n[i] += 1. 74 | 75 | return w, n 76 | 77 | 78 | # training and testing ####################################################### 79 | 80 | # initialize our model 81 | w = [0.] * D # weights 82 | n = [0.] * D # number of times we've encountered a feature 83 | 84 | # start training a logistic regression model using on pass sgd 85 | loss = 0. 86 | for t, row in enumerate(DictReader(open(train))): 87 | y = 1. if row['Label'] == '1' else 0. 88 | 89 | del row['Label'] # can't let the model peek the answer 90 | del row['Id'] # we don't need the Id 91 | 92 | # main training procedure 93 | # step 1, get the hashed features 94 | x = get_x(row, D) 95 | 96 | # step 2, get prediction 97 | p = get_p(x, w) 98 | 99 | # for progress validation, useless for learning our model 100 | loss += logloss(p, y) 101 | if t % 1000000 == 0 and t > 1: 102 | print('%s\tencountered: %d\tcurrent logloss: %f' % ( 103 | datetime.now(), t, loss/t)) 104 | 105 | # step 3, update model with answer 106 | w, n = update_w(w, n, x, p, y) 107 | 108 | # testing (build kaggle's submission file) 109 | with open('submissionPython5.csv', 'w') as submission: 110 | submission.write('Id,Predicted\n') 111 | for t, row in enumerate(DictReader(open(test))): 112 | Id = row['Id'] 113 | del row['Id'] 114 | x = get_x(row, D) 115 | p = get_p(x, w) 116 | submission.write('%s,%f\n' % (Id, p)) -------------------------------------------------------------------------------- /logReg_click.py: -------------------------------------------------------------------------------- 1 | from numpy import * 2 | import matplotlib.pyplot as plt 3 | import time 4 | 5 | 6 | # calculate the sigmoid function 7 | def sigmoid(inX): 8 | return 1.0 / (1 + exp(-inX)) 9 | 10 | # train a logistic regression model using some optional optimize algorithm 11 | # input: train_x is a mat datatype, each row stands for one sample 12 | # train_y is mat datatype too, each row is the corresponding label 13 | # opts is optimize option include step and maximum number of iterations 14 | def trainLogRegres(train_x, train_y, opts): 15 | # calculate training time 16 | startTime = time.time() 17 | 18 | numSamples, numFeatures = shape(train_x) 19 | alpha = opts['alpha']; maxIter = opts['maxIter'] 20 | weights = ones((numFeatures, 1)) 21 | 22 | # optimize through gradient descent algorilthm 23 | for k in range(maxIter): 24 | if opts['optimizeType'] == 'gradDescent': # gradient descent algorilthm 25 | output = sigmoid(train_x * weights) 26 | error = train_y - output 27 | weights = weights + alpha * train_x.transpose() * error 28 | elif opts['optimizeType'] == 'stocGradDescent': # stochastic gradient descent 29 | for i in range(numSamples): 30 | output = sigmoid(train_x[i, :] * weights) 31 | error = train_y[i, 0] - output 32 | weights = weights + alpha * train_x[i, :].transpose() * error 33 | elif opts['optimizeType'] == 'smoothStocGradDescent': # smooth stochastic gradient descent 34 | # randomly select samples to optimize for reducing cycle fluctuations 35 | dataIndex = range(numSamples) 36 | for i in range(numSamples): 37 | alpha = 4.0 / (1.0 + k + i) + 0.01 38 | randIndex = int(random.uniform(0, len(dataIndex))) 39 | output = sigmoid(train_x[randIndex, :] * weights) 40 | error = train_y[randIndex, 0] - output 41 | weights = weights + alpha * train_x[randIndex, :].transpose() * error 42 | del(dataIndex[randIndex]) # during one interation, delete the optimized sample 43 | else: 44 | raise NameError('Not support optimize method type!') 45 | 46 | 47 | print ('Congratulations, training complete! Took %fs!' % (time.time() - startTime)) 48 | return weights 49 | 50 | # test your trained Logistic Regression model given test set 51 | def testLogRegres(weights, test_x, test_y): 52 | numSamples, numFeatures = shape(test_x) 53 | matchCount = 0 54 | for i in xrange(numSamples): 55 | predict = sigmoid(test_x[i, :] * weights)[0, 0] > 0.5 56 | if predict == bool(test_y[i, 0]): 57 | matchCount += 1 58 | accuracy = float(matchCount) / numSamples 59 | return accuracy 60 | 61 | def loadData(): 62 | train_x = [] 63 | train_y = [] 64 | fileIn = open('train.csv') 65 | for line in fileIn.readlines(): 66 | lineArr = line.strip().split() 67 | train_x.append([1.0, float(lineArr[0]), float(lineArr[1])]) 68 | train_y.append(float(lineArr[2])) 69 | return mat(train_x), mat(train_y).transpose() 70 | 71 | ############################################################################### 72 | ## step 1: load data 73 | print ("step 1: load data...") 74 | train_x, train_y = loadData() 75 | test_x = train_x; test_y = train_y 76 | 77 | ## step 2: training... 78 | print ("step 2: training...") 79 | opts = {'alpha': 0.01, 'maxIter': 2 ** 27, 'optimizeType': 'smoothStocGradDescent'} 80 | optimalWeights = trainLogRegres(train_x, train_y, opts) 81 | 82 | ## step 3: testing 83 | print ("step 3: testing...") 84 | accuracy = testLogRegres(optimalWeights, test_x, test_y) 85 | 86 | ## step 4: show the result 87 | print ("step 4: show the result...") 88 | print ('The classify accuracy is: %.3f%%' % (accuracy * 100)) -------------------------------------------------------------------------------- /pythonSGD/logReg_click.py: -------------------------------------------------------------------------------- 1 | from numpy import * 2 | import matplotlib.pyplot as plt 3 | import time 4 | 5 | 6 | # calculate the sigmoid function 7 | def sigmoid(inX): 8 | return 1.0 / (1 + exp(-inX)) 9 | 10 | # train a logistic regression model using some optional optimize algorithm 11 | # input: train_x is a mat datatype, each row stands for one sample 12 | # train_y is mat datatype too, each row is the corresponding label 13 | # opts is optimize option include step and maximum number of iterations 14 | def trainLogRegres(train_x, train_y, opts): 15 | # calculate training time 16 | startTime = time.time() 17 | 18 | numSamples, numFeatures = shape(train_x) 19 | alpha = opts['alpha']; maxIter = opts['maxIter'] 20 | weights = ones((numFeatures, 1)) 21 | 22 | # optimize through gradient descent algorilthm 23 | for k in range(maxIter): 24 | if opts['optimizeType'] == 'gradDescent': # gradient descent algorilthm 25 | output = sigmoid(train_x * weights) 26 | error = train_y - output 27 | weights = weights + alpha * train_x.transpose() * error 28 | elif opts['optimizeType'] == 'stocGradDescent': # stochastic gradient descent 29 | for i in range(numSamples): 30 | output = sigmoid(train_x[i, :] * weights) 31 | error = train_y[i, 0] - output 32 | weights = weights + alpha * train_x[i, :].transpose() * error 33 | elif opts['optimizeType'] == 'smoothStocGradDescent': # smooth stochastic gradient descent 34 | # randomly select samples to optimize for reducing cycle fluctuations 35 | dataIndex = range(numSamples) 36 | for i in range(numSamples): 37 | alpha = 4.0 / (1.0 + k + i) + 0.01 38 | randIndex = int(random.uniform(0, len(dataIndex))) 39 | output = sigmoid(train_x[randIndex, :] * weights) 40 | error = train_y[randIndex, 0] - output 41 | weights = weights + alpha * train_x[randIndex, :].transpose() * error 42 | del(dataIndex[randIndex]) # during one interation, delete the optimized sample 43 | else: 44 | raise NameError('Not support optimize method type!') 45 | 46 | 47 | print ('Congratulations, training complete! Took %fs!' % (time.time() - startTime)) 48 | return weights 49 | 50 | # test your trained Logistic Regression model given test set 51 | def testLogRegres(weights, test_x, test_y): 52 | numSamples, numFeatures = shape(test_x) 53 | matchCount = 0 54 | for i in xrange(numSamples): 55 | predict = sigmoid(test_x[i, :] * weights)[0, 0] > 0.5 56 | if predict == bool(test_y[i, 0]): 57 | matchCount += 1 58 | accuracy = float(matchCount) / numSamples 59 | return accuracy 60 | 61 | def loadData(): 62 | train_x = [] 63 | train_y = [] 64 | fileIn = open('train.csv') 65 | for line in fileIn.readlines(): 66 | lineArr = line.strip().split() 67 | train_x.append([1.0, float(lineArr[0]), float(lineArr[1])]) 68 | train_y.append(float(lineArr[2])) 69 | return mat(train_x), mat(train_y).transpose() 70 | 71 | ############################################################################### 72 | ## step 1: load data 73 | print ("step 1: load data...") 74 | train_x, train_y = loadData() 75 | test_x = train_x; test_y = train_y 76 | 77 | ## step 2: training... 78 | print ("step 2: training...") 79 | opts = {'alpha': 0.01, 'maxIter': 2 ** 27, 'optimizeType': 'smoothStocGradDescent'} 80 | optimalWeights = trainLogRegres(train_x, train_y, opts) 81 | 82 | ## step 3: testing 83 | print ("step 3: testing...") 84 | accuracy = testLogRegres(optimalWeights, test_x, test_y) 85 | 86 | ## step 4: show the result 87 | print ("step 4: show the result...") 88 | print ('The classify accuracy is: %.3f%%' % (accuracy * 100)) -------------------------------------------------------------------------------- /pythonSGD/py_lh_22Sep2014_2.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime 2 | from csv import DictReader 3 | from math import exp, log, sqrt 4 | 5 | # parameters ################################################################# 6 | 7 | train = 'train.csv' # path to training file 8 | test = 'test.csv' # path to testing file 9 | 10 | D = 2 ** 28 # number of weights use for learning 11 | alpha = .145 # learning rate for sgd optimization 12 | 13 | 14 | # function definitions ####################################################### 15 | 16 | # A. Bounded logloss 17 | # INPUT: 18 | # p: our prediction 19 | # y: real answer 20 | # OUTPUT 21 | # logarithmic loss of p given y 22 | def logloss(p, y): 23 | p = max(min(p, 1. - 10e-12), 10e-12) 24 | return -log(p) if y == 1. else -log(1. - p) 25 | 26 | # B. Apply hash trick of the original csv row 27 | # for simplicity, we treat both integer and categorical features as categorical 28 | # INPUT: 29 | # csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...} 30 | # D: the max index that we can hash to 31 | # OUTPUT: 32 | # x: a list of indices that its value is 1 33 | def get_x(csv_row, D): 34 | x = [0] # 0 is the index of the bias term 35 | for key, value in csv_row.items(): 36 | index = int(value + key[1:], 32) % D # weakest hash ever ;) 37 | x.append(index) 38 | return x # x contains indices of features that have a value of 1 39 | 40 | 41 | # C. Get probability estimation on x 42 | # INPUT: 43 | # x: features 44 | # w: weights 45 | # OUTPUT: 46 | # probability of p(y = 1 | x; w) 47 | def get_p(x, w): 48 | wTx = 0. 49 | for i in x: # do wTx 50 | wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1. 51 | return 1. / (1. + exp(-max(min(wTx, 10.), -10.))) # bounded sigmoid 52 | 53 | 54 | # D. Update given model 55 | # INPUT: 56 | # w: weights 57 | # n: a counter that counts the number of times we encounter a feature 58 | # this is used for adaptive learning rate 59 | # x: feature 60 | # p: prediction of our model 61 | # y: answer 62 | # OUTPUT: 63 | # w: updated model 64 | # n: updated count 65 | def update_w(w, n, x, p, y): 66 | for i in x: 67 | # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic 68 | # (p - y) * x[i] is the current gradient 69 | # note that in our case, if i in x then x[i] = 1 70 | w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) 71 | n[i] += 1. 72 | return w, n 73 | 74 | 75 | # training and testing ####################################################### 76 | 77 | # initialize our model 78 | w = [0.] * D # weights 79 | n = [0.] * D # number of times we've encountered a feature 80 | 81 | # start training a logistic regression model using on pass sgd 82 | loss = 0. 83 | for t, row in enumerate(DictReader(open(train))): 84 | y = 1. if row['Label'] == '1' else 0. 85 | 86 | del row['Label'] # can't let the model peek the answer 87 | del row['Id'] # we don't need the Id 88 | 89 | # main training procedure 90 | # step 1, get the hashed features 91 | x = get_x(row, D) 92 | 93 | # step 2, get prediction 94 | p = get_p(x, w) 95 | 96 | # for progress validation, useless for learning our model 97 | loss += logloss(p, y) 98 | if t % 100000 == 0 and t > 1: 99 | print('%s\tencountered: %d\tcurrent logloss: %f' % ( 100 | datetime.now(), t, loss/t)) 101 | 102 | # step 3, update model with answer 103 | if t <= 40000000: 104 | w, n = update_w(w, n, x, p, y) 105 | 106 | # testing (build kaggle's submission file) 107 | with open('submissionPython_22Sep2014.csv', 'w') as submission: 108 | submission.write('Id,Predicted\n') 109 | for t, row in enumerate(DictReader(open(test))): 110 | Id = row['Id'] 111 | del row['Id'] 112 | x = get_x(row, D) 113 | p = get_p(x, w) 114 | submission.write('%s,%f\n' % (Id, p)) -------------------------------------------------------------------------------- /logReg.py: -------------------------------------------------------------------------------- 1 | from numpy import * 2 | import matplotlib.pyplot as plt 3 | import time 4 | 5 | 6 | # calculate the sigmoid function 7 | def sigmoid(inX): 8 | return 1.0 / (1 + exp(-inX)) 9 | 10 | 11 | # train a logistic regression model using some optional optimize algorithm 12 | # input: train_x is a mat datatype, each row stands for one sample 13 | # train_y is mat datatype too, each row is the corresponding label 14 | # opts is optimize option include step and maximum number of iterations 15 | def trainLogRegres(train_x, train_y, opts): 16 | # calculate training time 17 | startTime = time.time() 18 | 19 | numSamples, numFeatures = shape(train_x) 20 | alpha = opts['alpha']; maxIter = opts['maxIter'] 21 | weights = ones((numFeatures, 1)) 22 | 23 | # optimize through gradient descent algorilthm 24 | for k in range(maxIter): 25 | if opts['optimizeType'] == 'gradDescent': # gradient descent algorilthm 26 | output = sigmoid(train_x * weights) 27 | error = train_y - output 28 | weights = weights + alpha * train_x.transpose() * error 29 | elif opts['optimizeType'] == 'stocGradDescent': # stochastic gradient descent 30 | for i in range(numSamples): 31 | output = sigmoid(train_x[i, :] * weights) 32 | error = train_y[i, 0] - output 33 | weights = weights + alpha * train_x[i, :].transpose() * error 34 | elif opts['optimizeType'] == 'smoothStocGradDescent': # smooth stochastic gradient descent 35 | # randomly select samples to optimize for reducing cycle fluctuations 36 | dataIndex = range(numSamples) 37 | for i in range(numSamples): 38 | alpha = 4.0 / (1.0 + k + i) + 0.01 39 | randIndex = int(random.uniform(0, len(dataIndex))) 40 | output = sigmoid(train_x[randIndex, :] * weights) 41 | error = train_y[randIndex, 0] - output 42 | weights = weights + alpha * train_x[randIndex, :].transpose() * error 43 | del(dataIndex[randIndex]) # during one interation, delete the optimized sample 44 | else: 45 | raise NameError('Not support optimize method type!') 46 | 47 | 48 | print 'Congratulations, training complete! Took %fs!' % (time.time() - startTime) 49 | return weights 50 | 51 | 52 | # test your trained Logistic Regression model given test set 53 | def testLogRegres(weights, test_x, test_y): 54 | numSamples, numFeatures = shape(test_x) 55 | matchCount = 0 56 | for i in xrange(numSamples): 57 | predict = sigmoid(test_x[i, :] * weights)[0, 0] > 0.5 58 | if predict == bool(test_y[i, 0]): 59 | matchCount += 1 60 | accuracy = float(matchCount) / numSamples 61 | return accuracy 62 | 63 | 64 | # show your trained logistic regression model only available with 2-D data 65 | def showLogRegres(weights, train_x, train_y): 66 | # notice: train_x and train_y is mat datatype 67 | numSamples, numFeatures = shape(train_x) 68 | if numFeatures != 3: 69 | print "Sorry! I can not draw because the dimension of your data is not 2!" 70 | return 1 71 | 72 | # draw all samples 73 | for i in xrange(numSamples): 74 | if int(train_y[i, 0]) == 0: 75 | plt.plot(train_x[i, 1], train_x[i, 2], 'or') 76 | elif int(train_y[i, 0]) == 1: 77 | plt.plot(train_x[i, 1], train_x[i, 2], 'ob') 78 | 79 | # draw the classify line 80 | min_x = min(train_x[:, 1])[0, 0] 81 | max_x = max(train_x[:, 1])[0, 0] 82 | weights = weights.getA() # convert mat to array 83 | y_min_x = float(-weights[0] - weights[1] * min_x) / weights[2] 84 | y_max_x = float(-weights[0] - weights[1] * max_x) / weights[2] 85 | plt.plot([min_x, max_x], [y_min_x, y_max_x], '-g') 86 | plt.xlabel('X1'); plt.ylabel('X2') 87 | plt.show() 88 | 89 | -------------------------------------------------------------------------------- /pythonSGD/logReg.py: -------------------------------------------------------------------------------- 1 | from numpy import * 2 | import matplotlib.pyplot as plt 3 | import time 4 | 5 | 6 | # calculate the sigmoid function 7 | def sigmoid(inX): 8 | return 1.0 / (1 + exp(-inX)) 9 | 10 | 11 | # train a logistic regression model using some optional optimize algorithm 12 | # input: train_x is a mat datatype, each row stands for one sample 13 | # train_y is mat datatype too, each row is the corresponding label 14 | # opts is optimize option include step and maximum number of iterations 15 | def trainLogRegres(train_x, train_y, opts): 16 | # calculate training time 17 | startTime = time.time() 18 | 19 | numSamples, numFeatures = shape(train_x) 20 | alpha = opts['alpha']; maxIter = opts['maxIter'] 21 | weights = ones((numFeatures, 1)) 22 | 23 | # optimize through gradient descent algorilthm 24 | for k in range(maxIter): 25 | if opts['optimizeType'] == 'gradDescent': # gradient descent algorilthm 26 | output = sigmoid(train_x * weights) 27 | error = train_y - output 28 | weights = weights + alpha * train_x.transpose() * error 29 | elif opts['optimizeType'] == 'stocGradDescent': # stochastic gradient descent 30 | for i in range(numSamples): 31 | output = sigmoid(train_x[i, :] * weights) 32 | error = train_y[i, 0] - output 33 | weights = weights + alpha * train_x[i, :].transpose() * error 34 | elif opts['optimizeType'] == 'smoothStocGradDescent': # smooth stochastic gradient descent 35 | # randomly select samples to optimize for reducing cycle fluctuations 36 | dataIndex = range(numSamples) 37 | for i in range(numSamples): 38 | alpha = 4.0 / (1.0 + k + i) + 0.01 39 | randIndex = int(random.uniform(0, len(dataIndex))) 40 | output = sigmoid(train_x[randIndex, :] * weights) 41 | error = train_y[randIndex, 0] - output 42 | weights = weights + alpha * train_x[randIndex, :].transpose() * error 43 | del(dataIndex[randIndex]) # during one interation, delete the optimized sample 44 | else: 45 | raise NameError('Not support optimize method type!') 46 | 47 | 48 | print 'Congratulations, training complete! Took %fs!' % (time.time() - startTime) 49 | return weights 50 | 51 | 52 | # test your trained Logistic Regression model given test set 53 | def testLogRegres(weights, test_x, test_y): 54 | numSamples, numFeatures = shape(test_x) 55 | matchCount = 0 56 | for i in xrange(numSamples): 57 | predict = sigmoid(test_x[i, :] * weights)[0, 0] > 0.5 58 | if predict == bool(test_y[i, 0]): 59 | matchCount += 1 60 | accuracy = float(matchCount) / numSamples 61 | return accuracy 62 | 63 | 64 | # show your trained logistic regression model only available with 2-D data 65 | def showLogRegres(weights, train_x, train_y): 66 | # notice: train_x and train_y is mat datatype 67 | numSamples, numFeatures = shape(train_x) 68 | if numFeatures != 3: 69 | print "Sorry! I can not draw because the dimension of your data is not 2!" 70 | return 1 71 | 72 | # draw all samples 73 | for i in xrange(numSamples): 74 | if int(train_y[i, 0]) == 0: 75 | plt.plot(train_x[i, 1], train_x[i, 2], 'or') 76 | elif int(train_y[i, 0]) == 1: 77 | plt.plot(train_x[i, 1], train_x[i, 2], 'ob') 78 | 79 | # draw the classify line 80 | min_x = min(train_x[:, 1])[0, 0] 81 | max_x = max(train_x[:, 1])[0, 0] 82 | weights = weights.getA() # convert mat to array 83 | y_min_x = float(-weights[0] - weights[1] * min_x) / weights[2] 84 | y_max_x = float(-weights[0] - weights[1] * max_x) / weights[2] 85 | plt.plot([min_x, max_x], [y_min_x, y_max_x], '-g') 86 | plt.xlabel('X1'); plt.ylabel('X2') 87 | plt.show() 88 | 89 | -------------------------------------------------------------------------------- /pythonSGD/py_lh4_22Sep2014.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime 2 | from csv import DictReader 3 | from math import exp, log, sqrt 4 | import mmh3 5 | 6 | # parameters ################################################################# 7 | 8 | train = 'train.csv' # path to training file 9 | test = 'test.csv' # path to testing file 10 | 11 | D = 2 ** 28 # number of weights use for learning 12 | alpha = .145 # learning rate for sgd optimization 13 | 14 | 15 | # function definitions ####################################################### 16 | 17 | # A. Bounded logloss 18 | # INPUT: 19 | # p: our prediction 20 | # y: real answer 21 | # OUTPUT 22 | # logarithmic loss of p given y 23 | def logloss(p, y): 24 | p = max(min(p, 1. - 10e-15), 10e-15) 25 | return -log(p) if y == 1. else -log(1. - p) 26 | 27 | 28 | # B. Apply hash trick of the original csv row 29 | # for simplicity, we treat both integer and categorical features as categorical 30 | # INPUT: 31 | # csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...} 32 | # D: the max index that we can hash to 33 | # OUTPUT: 34 | # x: a list of indices that its value is 1 35 | def get_x(csv_row, D): 36 | x = [0] # 0 is the index of the bias term 37 | for key, value in csv_row.items(): 38 | index = mmh3.hash((value + key[1:]),42) % D # weakest hash ever ;) 39 | x.append(index) 40 | return x # x contains indices of features that have a value of 1 41 | 42 | 43 | # C. Get probability estimation on x 44 | # INPUT: 45 | # x: features 46 | # w: weights 47 | # OUTPUT: 48 | # probability of p(y = 1 | x; w) 49 | def get_p(x, w, ld): 50 | ld = ld + 0.001 51 | wTx = 0. 52 | for i in x: # do wTx 53 | wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1. 54 | return (1.+ld) / (1. + exp(-max(min(wTx, 20.), -20.))) # bounded sigmoid 55 | 56 | 57 | # D. Update given model 58 | # INPUT: 59 | # w: weights 60 | # n: a counter that counts the number of times we encounter a feature 61 | # this is used for adaptive learning rate 62 | # x: feature 63 | # p: prediction of our model 64 | # y: answer 65 | # OUTPUT: 66 | # w: updated model 67 | # n: updated count 68 | def update_w(w, n, x, p, y): 69 | for i in x: 70 | # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic 71 | # (p - y) * x[i] is the current gradient 72 | # note that in our case, if i in x then x[i] = 1 73 | w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) 74 | n[i] += 1. 75 | 76 | return w, n 77 | 78 | 79 | # training and testing ####################################################### 80 | 81 | # initialize our model 82 | w = [0.] * D # weights 83 | n = [0.] * D # number of times we've encountered a feature 84 | 85 | # start training a logistic regression model using on pass sgd 86 | loss = 0. 87 | ld = 0.001 88 | for t, row in enumerate(DictReader(open(train))): 89 | 90 | y = 1. if row['Label'] == '1' else 0. 91 | 92 | del row['Label'] # can't let the model peek the answer 93 | del row['Id'] # we don't need the Id 94 | 95 | # main training procedure 96 | # step 1, get the hashed features 97 | x = get_x(row, D) 98 | 99 | # step 2, get prediction 100 | p = get_p(x, w, ld) 101 | 102 | # for progress validation, useless for learning our model 103 | loss += logloss(p, y) 104 | if t % 1000000 == 0 and t > 1: 105 | print('%s\tencountered: %d\tcurrent logloss: %f' % ( 106 | datetime.now(), t, loss/t)) 107 | 108 | # step 3, update model with answer 109 | w, n = update_w(w, n, x, p, y) 110 | 111 | # testing (build kaggle's submission file) 112 | with open('submissionPython22Sep2014_pm.csv', 'w') as submission: 113 | submission.write('Id,Predicted\n') 114 | for t, row in enumerate(DictReader(open(test))): 115 | Id = row['Id'] 116 | del row['Id'] 117 | x = get_x(row, D) 118 | p = get_p(x, w) 119 | submission.write('%s,%f\n' % (Id, p)) -------------------------------------------------------------------------------- /py_lh_20Sep2014.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime 2 | from csv import DictReader 3 | from math import exp, log, sqrt 4 | import scipy as sp 5 | 6 | # parameters ################################################################# 7 | 8 | train = 'train.csv' # path to training file 9 | test = 'test.csv' # path to testing file 10 | 11 | D = 2 ** 27 # number of weights use for learning 12 | alpha = .15 # learning rate for sgd optimization 13 | 14 | 15 | # function definitions ####################################################### 16 | 17 | # A. Bounded logloss 18 | # INPUT: 19 | # p: our prediction 20 | # y: real answer 21 | # OUTPUT 22 | # logarithmic loss of p given y 23 | def logloss(p, y): 24 | epsilon = 1e-15 25 | p = sp.maximum(epsilon, p) 26 | p = sp.minimum(1-epsilon, p) 27 | ll = sum(y*sp.log(p) + sp.subtract(1,y)*sp.log(sp.subtract(1,p))) 28 | ll = ll * -1.0/len(y) 29 | return ll 30 | 31 | # B. Apply hash trick of the original csv row 32 | # for simplicity, we treat both integer and categorical features as categorical 33 | # INPUT: 34 | # csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...} 35 | # D: the max index that we can hash to 36 | # OUTPUT: 37 | # x: a list of indices that its value is 1 38 | def get_x(csv_row, D): 39 | x = [0] # 0 is the index of the bias term 40 | for key, value in csv_row.items(): 41 | index = int(value + key[1:], 16) % D # weakest hash ever ;) 42 | x.append(index) 43 | return x # x contains indices of features that have a value of 1 44 | 45 | 46 | # C. Get probability estimation on x 47 | # INPUT: 48 | # x: features 49 | # w: weights 50 | # OUTPUT: 51 | # probability of p(y = 1 | x; w) 52 | def get_p(x, w): 53 | wTx = 0. 54 | for i in x: # do wTx 55 | wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1. 56 | return 1. / (1. + exp(-max(min(wTx, 10.), -10.))) # bounded sigmoid 57 | 58 | 59 | # D. Update given model 60 | # INPUT: 61 | # w: weights 62 | # n: a counter that counts the number of times we encounter a feature 63 | # this is used for adaptive learning rate 64 | # x: feature 65 | # p: prediction of our model 66 | # y: answer 67 | # OUTPUT: 68 | # w: updated model 69 | # n: updated count 70 | def update_w(w, n, x, p, y): 71 | for i in x: 72 | # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic 73 | # (p - y) * x[i] is the current gradient 74 | # note that in our case, if i in x then x[i] = 1 75 | w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) 76 | n[i] += 1. 77 | 78 | return w, n 79 | 80 | 81 | # training and testing ####################################################### 82 | 83 | # initialize our model 84 | w = [0.] * D # weights 85 | n = [0.] * D # number of times we've encountered a feature 86 | 87 | # start training a logistic regression model using on pass sgd 88 | loss = 0. 89 | for t, row in enumerate(DictReader(open(train))): 90 | y = 1. if row['Label'] == '1' else 0. 91 | 92 | del row['Label'] # can't let the model peek the answer 93 | del row['Id'] # we don't need the Id 94 | 95 | # main training procedure 96 | # step 1, get the hashed features 97 | x = get_x(row, D) 98 | 99 | # step 2, get prediction 100 | p = get_p(x, w) 101 | 102 | # for progress validation, useless for learning our model 103 | loss += logloss(p, y) 104 | if t % 1000000 == 0 and t > 1: 105 | print('%s\tencountered: %d\tcurrent logloss: %f' % ( 106 | datetime.now(), t, loss/t)) 107 | 108 | # step 3, update model with answer 109 | w, n = update_w(w, n, x, p, y) 110 | 111 | # testing (build kaggle's submission file) 112 | with open('submissionPython_20Sep2014_2.csv', 'w') as submission: 113 | submission.write('Id,Predicted\n') 114 | for t, row in enumerate(DictReader(open(test))): 115 | Id = row['Id'] 116 | del row['Id'] 117 | x = get_x(row, D) 118 | p = get_p(x, w) 119 | submission.write('%s,%f\n' % (Id, p)) -------------------------------------------------------------------------------- /pythonSGD/py_lh_20Sep2014.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime 2 | from csv import DictReader 3 | from math import exp, log, sqrt 4 | import scipy as sp 5 | 6 | # parameters ################################################################# 7 | 8 | train = 'train.csv' # path to training file 9 | test = 'test.csv' # path to testing file 10 | 11 | D = 2 ** 27 # number of weights use for learning 12 | alpha = .145 # learning rate for sgd optimization 13 | 14 | 15 | # function definitions ####################################################### 16 | 17 | # A. Bounded logloss 18 | # INPUT: 19 | # p: our prediction 20 | # y: real answer 21 | # OUTPUT 22 | # logarithmic loss of p given y 23 | def logloss(p, y): 24 | epsilon = 1e-15 25 | p = max(min(p, 1. - epsilon), epsilon) 26 | ll = y*sp.log(p) + sp.subtract(1,y)*sp.log(sp.subtract(1,p)) 27 | ll = ll * -1.0/1 28 | return ll 29 | 30 | # B. Apply hash trick of the original csv row 31 | # for simplicity, we treat both integer and categorical features as categorical 32 | # INPUT: 33 | # csv_row: a csv dictionary, ex: {'Lable': '1', 'I1': '357', 'I2': '', ...} 34 | # D: the max index that we can hash to 35 | # OUTPUT: 36 | # x: a list of indices that its value is 1 37 | def get_x(csv_row, D): 38 | x = [0] # 0 is the index of the bias term 39 | for key, value in csv_row.items(): 40 | index = int(value + key[1:], 16) % D # weakest hash ever ;) 41 | x.append(index) 42 | return x # x contains indices of features that have a value of 1 43 | 44 | 45 | # C. Get probability estimation on x 46 | # INPUT: 47 | # x: features 48 | # w: weights 49 | # OUTPUT: 50 | # probability of p(y = 1 | x; w) 51 | def get_p(x, w): 52 | wTx = 0. 53 | for i in x: # do wTx 54 | wTx += w[i] * 1. # w[i] * x[i], but if i in x we got x[i] = 1. 55 | return 1. / (1. + exp(-max(min(wTx, 10.), -10.))) # bounded sigmoid 56 | 57 | 58 | # D. Update given model 59 | # INPUT: 60 | # w: weights 61 | # n: a counter that counts the number of times we encounter a feature 62 | # this is used for adaptive learning rate 63 | # x: feature 64 | # p: prediction of our model 65 | # y: answer 66 | # OUTPUT: 67 | # w: updated model 68 | # n: updated count 69 | def update_w(w, n, x, p, y): 70 | for i in x: 71 | # alpha / (sqrt(n) + 1) is the adaptive learning rate heuristic 72 | # (p - y) * x[i] is the current gradient 73 | # note that in our case, if i in x then x[i] = 1 74 | w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) 75 | n[i] += 1. 76 | return w, n 77 | 78 | 79 | # training and testing ####################################################### 80 | 81 | # initialize our model 82 | w = [0.] * D # weights 83 | n = [0.] * D # number of times we've encountered a feature 84 | 85 | # start training a logistic regression model using on pass sgd 86 | loss = 0. 87 | for t, row in enumerate(DictReader(open(train))): 88 | y = 1. if row['Label'] == '1' else 0. 89 | 90 | del row['Label'] # can't let the model peek the answer 91 | del row['Id'] # we don't need the Id 92 | 93 | # main training procedure 94 | # step 1, get the hashed features 95 | x = get_x(row, D) 96 | 97 | # step 2, get prediction 98 | p = get_p(x, w) 99 | 100 | # for progress validation, useless for learning our model 101 | loss += logloss(p, y) 102 | if t % 100000 == 0 and t > 1: 103 | print('%s\tencountered: %d\tcurrent logloss: %f' % ( 104 | datetime.now(), t, loss/t)) 105 | 106 | # step 3, update model with answer 107 | if t <= 40000000: 108 | w, n = update_w(w, n, x, p, y) 109 | 110 | # testing (build kaggle's submission file) 111 | with open('submissionPython_21Sep2014.csv', 'w') as submission: 112 | submission.write('Id,Predicted\n') 113 | for t, row in enumerate(DictReader(open(test))): 114 | Id = row['Id'] 115 | del row['Id'] 116 | x = get_x(row, D) 117 | p = get_p(x, w) 118 | submission.write('%s,%f\n' % (Id, p)) -------------------------------------------------------------------------------- /config.yaml: -------------------------------------------------------------------------------- 1 | # Configuration file for CTR Prediction Project 2 | # ============================================= 3 | 4 | # Project metadata 5 | project: 6 | name: "CTR Prediction" 7 | version: "2.0" 8 | description: "Click-Through Rate prediction for display advertising" 9 | 10 | # Data paths 11 | data: 12 | root_dir: "./data" 13 | train_file: "train.csv" 14 | test_file: "test.csv" 15 | sample_train_file: "train_sample.csv" # Optional: smaller dataset for testing 16 | 17 | # Processed data 18 | train_vw: "train.vw" 19 | test_vw: "test.vw" 20 | 21 | # Validation split (if doing train/val split) 22 | validation_split: 0.2 23 | random_seed: 42 24 | 25 | # Output paths 26 | output: 27 | root_dir: "./output" 28 | submission_file: "submission.csv" 29 | log_file: "training.log" 30 | model_dir: "./models" 31 | 32 | # Logistic Regression with SGD configuration 33 | logistic_regression: 34 | # Feature hashing 35 | dimension: 134217728 # 2^27 = 134,217,728 features 36 | 37 | # Optimization 38 | learning_rate: 0.145 39 | adaptive_learning: true 40 | 41 | # Training 42 | max_passes: 1 43 | log_interval: 1000000 # Log every 1M samples 44 | 45 | # Numerical stability 46 | epsilon: 1.0e-12 47 | sigmoid_bound: 20.0 # Bound input to sigmoid to [-20, 20] 48 | 49 | # Gradient Boosting Machine (GBM) configuration 50 | gbm: 51 | # Model hyperparameters 52 | n_trees: 500 53 | interaction_depth: 22 54 | shrinkage: 0.1 55 | min_samples_split: 10 56 | 57 | # Training 58 | cv_folds: 2 59 | metric: "ROC" 60 | verbose: true 61 | 62 | # Computation 63 | n_jobs: -1 # Use all CPU cores 64 | random_state: 888 65 | 66 | # Vowpal Wabbit configuration 67 | vowpal_wabbit: 68 | # Training options 69 | loss_function: "logistic" 70 | learning_rate: 0.5 71 | l1_lambda: 0.0 72 | l2_lambda: 0.0 73 | 74 | # Optimization 75 | passes: 3 76 | cache_file: "./data/train.cache" 77 | 78 | # Feature engineering 79 | quadratic: "" # e.g., "ii" for quadratic interactions in namespace i 80 | cubic: "" # e.g., "iii" for cubic interactions 81 | 82 | # Hashing 83 | bit_precision: 27 # 2^27 features 84 | 85 | # Output 86 | model_file: "./models/click.model.vw" 87 | predictions_file: "./output/vw_predictions.txt" 88 | 89 | # Data preprocessing 90 | preprocessing: 91 | # Missing value handling 92 | numerical_imputation: "median" # Options: mean, median, zero 93 | categorical_imputation: "mode" # Options: mode, unknown 94 | 95 | # Feature engineering 96 | create_interactions: false 97 | polynomial_features: false 98 | polynomial_degree: 2 99 | 100 | # Scaling (usually not needed for tree-based methods) 101 | scale_features: false 102 | scaler_type: "standard" # Options: standard, minmax, robust 103 | 104 | # Model evaluation 105 | evaluation: 106 | metrics: 107 | - "log_loss" 108 | - "auc_roc" 109 | - "accuracy" 110 | - "precision" 111 | - "recall" 112 | 113 | # Validation 114 | use_validation: false 115 | validation_size: 0.2 116 | 117 | # Logging configuration 118 | logging: 119 | level: "INFO" # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL 120 | format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s" 121 | log_to_file: true 122 | log_to_console: true 123 | 124 | # Experiment tracking 125 | experiment: 126 | track_experiments: false 127 | experiment_name: "baseline" 128 | tags: 129 | - "logistic_regression" 130 | - "hash_trick" 131 | 132 | # Resource constraints 133 | resources: 134 | # Memory limits (in GB) 135 | max_memory_gb: 16 136 | 137 | # CPU 138 | n_processes: 4 139 | 140 | # GPU (if available) 141 | use_gpu: false 142 | gpu_device: 0 143 | 144 | # Reproducibility 145 | reproducibility: 146 | random_seed: 42 147 | deterministic: true 148 | 149 | # Development settings 150 | development: 151 | # Use smaller sample for quick testing 152 | use_sample: false 153 | sample_size: 1000000 # First 1M rows 154 | 155 | # Debug mode 156 | debug: false 157 | profile_code: false 158 | 159 | # Testing 160 | run_tests: true 161 | test_coverage_threshold: 0.8 162 | -------------------------------------------------------------------------------- /.github/workflows/ci.yml: -------------------------------------------------------------------------------- 1 | name: CI 2 | 3 | on: 4 | push: 5 | branches: [ master, dev-claude ] 6 | pull_request: 7 | branches: [ master ] 8 | 9 | jobs: 10 | test-python: 11 | name: Test Python ${{ matrix.python-version }} 12 | runs-on: ubuntu-latest 13 | 14 | strategy: 15 | matrix: 16 | python-version: ['3.8', '3.9', '3.10', '3.11'] 17 | 18 | steps: 19 | - name: Checkout code 20 | uses: actions/checkout@v3 21 | 22 | - name: Set up Python ${{ matrix.python-version }} 23 | uses: actions/setup-python@v4 24 | with: 25 | python-version: ${{ matrix.python-version }} 26 | 27 | - name: Cache pip packages 28 | uses: actions/cache@v3 29 | with: 30 | path: ~/.cache/pip 31 | key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }} 32 | restore-keys: | 33 | ${{ runner.os }}-pip- 34 | 35 | - name: Install dependencies 36 | run: | 37 | python -m pip install --upgrade pip 38 | pip install -r requirements.txt 39 | 40 | - name: Lint with flake8 41 | run: | 42 | pip install flake8 43 | # Stop the build if there are Python syntax errors or undefined names 44 | flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics 45 | # Exit-zero treats all errors as warnings 46 | flake8 . --count --exit-zero --max-complexity=10 --max-line-length=100 --statistics 47 | 48 | - name: Check code formatting with black 49 | run: | 50 | pip install black 51 | black --check . 52 | 53 | - name: Run tests with pytest 54 | run: | 55 | pytest tests/ --verbose 56 | 57 | - name: Generate coverage report 58 | if: matrix.python-version == '3.10' 59 | run: | 60 | pytest tests/ --cov=. --cov-report=xml --cov-report=html 61 | 62 | - name: Upload coverage to Codecov 63 | if: matrix.python-version == '3.10' 64 | uses: codecov/codecov-action@v3 65 | with: 66 | files: ./coverage.xml 67 | flags: unittests 68 | name: codecov-umbrella 69 | 70 | test-r: 71 | name: Test R Scripts 72 | runs-on: ubuntu-latest 73 | 74 | steps: 75 | - name: Checkout code 76 | uses: actions/checkout@v3 77 | 78 | - name: Set up R 79 | uses: r-lib/actions/setup-r@v2 80 | with: 81 | r-version: '4.2.0' 82 | 83 | - name: Install R dependencies 84 | run: | 85 | Rscript -e "install.packages(c('data.table', 'caret', 'gbm'), dependencies=TRUE, repos='https://cloud.r-project.org')" 86 | 87 | - name: Check R script syntax 88 | run: | 89 | Rscript -e "source('gbm_modernized.R', echo=TRUE)" || true 90 | 91 | lint: 92 | name: Code Quality Checks 93 | runs-on: ubuntu-latest 94 | 95 | steps: 96 | - name: Checkout code 97 | uses: actions/checkout@v3 98 | 99 | - name: Set up Python 100 | uses: actions/setup-python@v4 101 | with: 102 | python-version: '3.10' 103 | 104 | - name: Install linting tools 105 | run: | 106 | python -m pip install --upgrade pip 107 | pip install flake8 pylint black isort 108 | 109 | - name: Run flake8 110 | run: | 111 | flake8 py_lh4_modernized.py logReg_modernized.py csv_to_vw_modernized.py --max-line-length=100 112 | 113 | - name: Run pylint 114 | run: | 115 | pylint py_lh4_modernized.py logReg_modernized.py csv_to_vw_modernized.py --max-line-length=100 || true 116 | 117 | - name: Check import sorting 118 | run: | 119 | isort --check-only --diff . 120 | 121 | security: 122 | name: Security Scan 123 | runs-on: ubuntu-latest 124 | 125 | steps: 126 | - name: Checkout code 127 | uses: actions/checkout@v3 128 | 129 | - name: Set up Python 130 | uses: actions/setup-python@v4 131 | with: 132 | python-version: '3.10' 133 | 134 | - name: Install safety 135 | run: | 136 | python -m pip install --upgrade pip 137 | pip install safety 138 | 139 | - name: Run safety check 140 | run: | 141 | pip install -r requirements.txt 142 | safety check || true 143 | 144 | - name: Run bandit security linter 145 | run: | 146 | pip install bandit 147 | bandit -r . -f json -o bandit-report.json || true 148 | 149 | - name: Upload bandit report 150 | if: always() 151 | uses: actions/upload-artifact@v3 152 | with: 153 | name: bandit-security-report 154 | path: bandit-report.json 155 | -------------------------------------------------------------------------------- /tests/test_lr_model.py: -------------------------------------------------------------------------------- 1 | """ 2 | Tests for Logistic Regression model (py_lh4_modernized.py). 3 | """ 4 | 5 | import pytest 6 | import numpy as np 7 | from pathlib import Path 8 | import sys 9 | 10 | # Add parent directory to path 11 | sys.path.insert(0, str(Path(__file__).parent.parent)) 12 | 13 | from py_lh4_modernized import CTRPredictor 14 | 15 | 16 | class TestCTRPredictor: 17 | """Test suite for CTRPredictor class.""" 18 | 19 | def test_initialization(self): 20 | """Test model initialization.""" 21 | model = CTRPredictor(dimension=1000, learning_rate=0.1) 22 | 23 | assert model.D == 1000 24 | assert model.alpha == 0.1 25 | assert len(model.w) == 1000 26 | assert len(model.n) == 1000 27 | assert all(w == 0.0 for w in model.w) 28 | assert all(n == 0.0 for n in model.n) 29 | 30 | def test_logloss_positive_label(self): 31 | """Test logloss calculation for positive label.""" 32 | loss = CTRPredictor.logloss(0.8, 1.0) 33 | expected = -np.log(0.8) 34 | assert np.isclose(loss, expected) 35 | 36 | def test_logloss_negative_label(self): 37 | """Test logloss calculation for negative label.""" 38 | loss = CTRPredictor.logloss(0.2, 0.0) 39 | expected = -np.log(0.8) 40 | assert np.isclose(loss, expected) 41 | 42 | def test_logloss_boundary_values(self): 43 | """Test logloss with boundary values (close to 0 or 1).""" 44 | # Should not raise error or return inf 45 | loss1 = CTRPredictor.logloss(0.99999, 1.0) 46 | loss2 = CTRPredictor.logloss(0.00001, 0.0) 47 | 48 | assert not np.isinf(loss1) 49 | assert not np.isinf(loss2) 50 | assert loss1 > 0 51 | assert loss2 > 0 52 | 53 | def test_get_features_basic(self): 54 | """Test feature hashing.""" 55 | model = CTRPredictor(dimension=1000) 56 | 57 | csv_row = {'I1': '5', 'I2': '10', 'C1': 'abc'} 58 | features = model.get_features(csv_row) 59 | 60 | # Should include bias term (index 0) 61 | assert 0 in features 62 | assert len(features) > 1 63 | # All indices should be within dimension 64 | assert all(0 <= idx < model.D for idx in features) 65 | 66 | def test_get_features_empty_values(self): 67 | """Test feature hashing with empty values.""" 68 | model = CTRPredictor(dimension=1000) 69 | 70 | csv_row = {'I1': '5', 'I2': '', 'C1': 'abc'} 71 | features = model.get_features(csv_row) 72 | 73 | # Should only have bias and non-empty features 74 | assert 0 in features 75 | assert len(features) >= 1 76 | 77 | def test_predict_probability_range(self): 78 | """Test that predictions are in valid probability range.""" 79 | model = CTRPredictor(dimension=100) 80 | 81 | # Random features 82 | features = [0, 5, 10, 25] 83 | 84 | prob = model.predict_probability(features) 85 | 86 | # Should be between 0 and 1 87 | assert 0.0 <= prob <= 1.0 88 | 89 | def test_predict_probability_with_weights(self): 90 | """Test prediction with non-zero weights.""" 91 | model = CTRPredictor(dimension=100) 92 | 93 | # Set some weights 94 | model.w[0] = 1.0 95 | model.w[5] = 2.0 96 | model.w[10] = -1.5 97 | 98 | features = [0, 5, 10] 99 | prob = model.predict_probability(features) 100 | 101 | # Expected: sigmoid(1.0 + 2.0 - 1.5) = sigmoid(1.5) 102 | expected = 1.0 / (1.0 + np.exp(-1.5)) 103 | 104 | assert np.isclose(prob, expected) 105 | 106 | def test_update_weights(self): 107 | """Test weight update.""" 108 | model = CTRPredictor(dimension=100, learning_rate=0.1) 109 | 110 | features = [0, 5, 10] 111 | p = 0.7 # Prediction 112 | y = 1.0 # True label 113 | 114 | # Store initial weights 115 | initial_weights = [model.w[i] for i in features] 116 | 117 | # Update weights 118 | model.update_weights(features, p, y) 119 | 120 | # Weights should have changed 121 | for i, idx in enumerate(features): 122 | assert model.w[idx] != initial_weights[i] 123 | 124 | # Counters should have incremented 125 | for idx in features: 126 | assert model.n[idx] == 1.0 127 | 128 | def test_train_file_not_found(self): 129 | """Test training with non-existent file.""" 130 | model = CTRPredictor(dimension=100) 131 | 132 | with pytest.raises(FileNotFoundError): 133 | model.train(Path("nonexistent_file.csv")) 134 | 135 | def test_predict_file_not_found(self): 136 | """Test prediction with non-existent file.""" 137 | model = CTRPredictor(dimension=100) 138 | 139 | with pytest.raises(FileNotFoundError): 140 | model.predict( 141 | Path("nonexistent_test.csv"), 142 | Path("output.csv") 143 | ) 144 | 145 | def test_train_with_temp_file(self, temp_csv_file): 146 | """Test training with a small temporary file.""" 147 | model = CTRPredictor(dimension=100, learning_rate=0.1) 148 | 149 | # Should not raise error 150 | model.train(temp_csv_file) 151 | 152 | # Weights should have been updated 153 | assert any(w != 0.0 for w in model.w) 154 | 155 | def test_numerical_stability(self): 156 | """Test that extreme values don't cause overflow.""" 157 | model = CTRPredictor(dimension=100) 158 | 159 | # Set extreme weights 160 | model.w[5] = 1000.0 161 | model.w[10] = -1000.0 162 | 163 | features = [5, 10] 164 | prob = model.predict_probability(features) 165 | 166 | # Should not be nan or inf 167 | assert not np.isnan(prob) 168 | assert not np.isinf(prob) 169 | assert 0.0 <= prob <= 1.0 170 | 171 | 172 | class TestEdgeCases: 173 | """Test edge cases and error handling.""" 174 | 175 | def test_zero_dimension(self): 176 | """Test with zero dimension (should work but be useless).""" 177 | with pytest.raises(Exception): 178 | model = CTRPredictor(dimension=0) 179 | 180 | def test_negative_learning_rate(self): 181 | """Test with negative learning rate.""" 182 | # Should still initialize (may behave oddly in training) 183 | model = CTRPredictor(dimension=100, learning_rate=-0.1) 184 | assert model.alpha == -0.1 185 | 186 | def test_very_large_dimension(self): 187 | """Test memory handling with large dimension.""" 188 | # This might be slow or fail on memory-constrained systems 189 | # Using smaller value for testing 190 | try: 191 | model = CTRPredictor(dimension=10_000_000) 192 | assert len(model.w) == 10_000_000 193 | except MemoryError: 194 | pytest.skip("Insufficient memory for this test") 195 | -------------------------------------------------------------------------------- /gbm_modernized.R: -------------------------------------------------------------------------------- 1 | # Gradient Boosting Machine (GBM) for CTR Prediction 2 | # Modernized version with configurable paths and better structure 3 | 4 | # ============================================================================ 5 | # Configuration 6 | # ============================================================================ 7 | 8 | # Use environment variable or default to current directory 9 | DATA_DIR <- Sys.getenv("DATA_DIR", default = "./data") 10 | OUTPUT_DIR <- Sys.getenv("OUTPUT_DIR", default = "./output") 11 | 12 | # File paths (relative to DATA_DIR) 13 | TRAIN_FILE <- file.path(DATA_DIR, "train_num_na_yesno.csv") 14 | TEST_FILE <- file.path(DATA_DIR, "test_num_impute.csv") 15 | OUTPUT_FILE <- file.path(OUTPUT_DIR, "submit_gbm_num_nona_impute.csv") 16 | 17 | # Model hyperparameters 18 | N_TREES <- 500 19 | INTERACTION_DEPTH <- 22 20 | SHRINKAGE <- 0.1 21 | N_CV_FOLDS <- 2 22 | SEED <- 888 23 | 24 | # ============================================================================ 25 | # Setup 26 | # ============================================================================ 27 | 28 | cat("==========================================================\n") 29 | cat("GBM CTR Prediction Model\n") 30 | cat("==========================================================\n") 31 | cat(sprintf("Data directory: %s\n", DATA_DIR)) 32 | cat(sprintf("Output directory: %s\n", OUTPUT_DIR)) 33 | cat(sprintf("Training file: %s\n", TRAIN_FILE)) 34 | cat(sprintf("Test file: %s\n", TEST_FILE)) 35 | cat("==========================================================\n\n") 36 | 37 | # Load required libraries 38 | required_packages <- c("data.table", "caret", "gbm") 39 | 40 | for (pkg in required_packages) { 41 | if (!require(pkg, character.only = TRUE, quietly = TRUE)) { 42 | cat(sprintf("Installing package: %s\n", pkg)) 43 | install.packages(pkg, dependencies = TRUE) 44 | library(pkg, character.only = TRUE) 45 | } 46 | } 47 | 48 | # Create output directory if it doesn't exist 49 | if (!dir.exists(OUTPUT_DIR)) { 50 | dir.create(OUTPUT_DIR, recursive = TRUE) 51 | cat(sprintf("Created output directory: %s\n", OUTPUT_DIR)) 52 | } 53 | 54 | # ============================================================================ 55 | # Data Loading and Preparation 56 | # ============================================================================ 57 | 58 | cat("\n[1/5] Loading training data...\n") 59 | 60 | if (!file.exists(TRAIN_FILE)) { 61 | stop(sprintf("Training file not found: %s\nPlease prepare the data first.", TRAIN_FILE)) 62 | } 63 | 64 | train <- read.csv(TRAIN_FILE, stringsAsFactors = TRUE) 65 | 66 | # Remove ID column if present 67 | if ("Id" %in% colnames(train) || "id" %in% colnames(train)) { 68 | train <- train[, !(colnames(train) %in% c("Id", "id"))] 69 | } 70 | 71 | cat(sprintf(" Loaded %d samples with %d features\n", nrow(train), ncol(train) - 1)) 72 | cat(sprintf(" Target variable: Label\n")) 73 | cat(sprintf(" Class distribution:\n")) 74 | print(table(train$Label)) 75 | 76 | # ============================================================================ 77 | # Model Training 78 | # ============================================================================ 79 | 80 | cat("\n[2/5] Training GBM model...\n") 81 | cat(sprintf(" Trees: %d\n", N_TREES)) 82 | cat(sprintf(" Interaction depth: %d\n", INTERACTION_DEPTH)) 83 | cat(sprintf(" Shrinkage: %.3f\n", SHRINKAGE)) 84 | cat(sprintf(" CV folds: %d\n", N_CV_FOLDS)) 85 | 86 | # Set seed for reproducibility 87 | set.seed(SEED) 88 | 89 | # Define training grid 90 | gbm_grid <- expand.grid( 91 | n.trees = N_TREES, 92 | interaction.depth = INTERACTION_DEPTH, 93 | shrinkage = SHRINKAGE, 94 | n.minobsinnode = 10 95 | ) 96 | 97 | # Define training control 98 | fit_control <- trainControl( 99 | method = "cv", 100 | number = N_CV_FOLDS, 101 | classProbs = TRUE, 102 | summaryFunction = twoClassSummary, 103 | allowParallel = TRUE 104 | ) 105 | 106 | # Train model 107 | start_time <- Sys.time() 108 | 109 | gbm_fit <- train( 110 | Label ~ ., 111 | data = train, 112 | method = "gbm", 113 | trControl = fit_control, 114 | tuneGrid = gbm_grid, 115 | metric = "ROC", 116 | verbose = TRUE 117 | ) 118 | 119 | end_time <- Sys.time() 120 | training_time <- difftime(end_time, start_time, units = "mins") 121 | 122 | cat(sprintf("\nTraining completed in %.2f minutes\n", training_time)) 123 | 124 | # ============================================================================ 125 | # Model Evaluation 126 | # ============================================================================ 127 | 128 | cat("\n[3/5] Evaluating model on training data...\n") 129 | 130 | train_pred <- predict(gbm_fit, train) 131 | conf_matrix <- confusionMatrix(train_pred, train$Label) 132 | 133 | print(conf_matrix) 134 | 135 | # ============================================================================ 136 | # Load Test Data and Generate Predictions 137 | # ============================================================================ 138 | 139 | cat("\n[4/5] Loading test data and generating predictions...\n") 140 | 141 | if (!file.exists(TEST_FILE)) { 142 | warning(sprintf("Test file not found: %s\nSkipping predictions.", TEST_FILE)) 143 | } else { 144 | test <- read.csv(TEST_FILE, stringsAsFactors = TRUE) 145 | 146 | # Keep ID column for submission 147 | test_id <- test$Id 148 | 149 | cat(sprintf(" Loaded %d test samples\n", nrow(test))) 150 | 151 | # Generate predictions (probabilities) 152 | test_pred_prob <- predict(gbm_fit, test, type = "prob") 153 | 154 | # Create submission dataframe 155 | submission <- data.frame( 156 | Id = test_id, 157 | Predicted = test_pred_prob[, "Yes"] # Probability of positive class 158 | ) 159 | 160 | # ============================================================================ 161 | # Save Predictions 162 | # ============================================================================ 163 | 164 | cat("\n[5/5] Saving predictions...\n") 165 | 166 | write.csv( 167 | submission, 168 | OUTPUT_FILE, 169 | row.names = FALSE, 170 | quote = FALSE 171 | ) 172 | 173 | cat(sprintf(" Predictions saved to: %s\n", OUTPUT_FILE)) 174 | cat(sprintf(" Submission format: %d rows, 2 columns (Id, Predicted)\n", nrow(submission))) 175 | } 176 | 177 | # ============================================================================ 178 | # Summary 179 | # ============================================================================ 180 | 181 | cat("\n==========================================================\n") 182 | cat("Model Training Summary\n") 183 | cat("==========================================================\n") 184 | cat(sprintf("Training samples: %d\n", nrow(train))) 185 | cat(sprintf("Features: %d\n", ncol(train) - 1)) 186 | cat(sprintf("Training time: %.2f minutes\n", training_time)) 187 | cat(sprintf("Training accuracy: %.4f\n", conf_matrix$overall['Accuracy'])) 188 | cat(sprintf("Model saved: gbm_fit\n")) 189 | cat("==========================================================\n") 190 | cat("\nDone!\n") 191 | 192 | # ============================================================================ 193 | # Optional: Save model object 194 | # ============================================================================ 195 | 196 | # Uncomment to save the trained model for later use 197 | # saveRDS(gbm_fit, file.path(OUTPUT_DIR, "gbm_model.rds")) 198 | # cat(sprintf("Model object saved to: %s\n", file.path(OUTPUT_DIR, "gbm_model.rds"))) 199 | 200 | # To load the model later: 201 | # gbm_fit <- readRDS(file.path(OUTPUT_DIR, "gbm_model.rds")) 202 | -------------------------------------------------------------------------------- /tests/test_vw_converter.py: -------------------------------------------------------------------------------- 1 | """ 2 | Tests for CSV to Vowpal Wabbit converter (csv_to_vw_modernized.py). 3 | """ 4 | 5 | import pytest 6 | from pathlib import Path 7 | import sys 8 | 9 | # Add parent directory to path 10 | sys.path.insert(0, str(Path(__file__).parent.parent)) 11 | 12 | from csv_to_vw_modernized import _convert_row_to_vw, csv_to_vw 13 | 14 | 15 | class TestConvertRowToVW: 16 | """Test suite for row conversion function.""" 17 | 18 | def test_convert_train_row_basic(self): 19 | """Test converting a basic training row.""" 20 | row = { 21 | 'Id': '123', 22 | 'Label': '1', 23 | 'I1': '5', 24 | 'I2': '10', 25 | 'C1': 'abc', 26 | 'C2': 'def' 27 | } 28 | 29 | vw_line = _convert_row_to_vw(row, is_train=True) 30 | 31 | # Should start with label (1) and tag ('123) 32 | assert vw_line.startswith("1 '123") 33 | 34 | # Should contain numerical namespace 35 | assert '|i' in vw_line 36 | 37 | # Should contain categorical namespace 38 | assert '|c' in vw_line 39 | 40 | # Should contain feature values 41 | assert 'I1:5' in vw_line 42 | assert 'I2:10' in vw_line 43 | assert 'abc' in vw_line 44 | assert 'def' in vw_line 45 | 46 | def test_convert_train_row_negative_label(self): 47 | """Test converting row with negative label.""" 48 | row = { 49 | 'Id': '456', 50 | 'Label': '0', 51 | 'I1': '3', 52 | 'C1': 'xyz' 53 | } 54 | 55 | vw_line = _convert_row_to_vw(row, is_train=True) 56 | 57 | # Label should be -1 for VW format 58 | assert vw_line.startswith("-1 '456") 59 | 60 | def test_convert_test_row(self): 61 | """Test converting a test row (no label).""" 62 | row = { 63 | 'Id': '789', 64 | 'I1': '7', 65 | 'C1': 'test' 66 | } 67 | 68 | vw_line = _convert_row_to_vw(row, is_train=False) 69 | 70 | # Test rows get dummy label 1 71 | assert vw_line.startswith("1 '789") 72 | assert 'I1:7' in vw_line 73 | assert 'test' in vw_line 74 | 75 | def test_convert_row_with_missing_values(self): 76 | """Test converting row with missing values.""" 77 | row = { 78 | 'Id': '111', 79 | 'Label': '1', 80 | 'I1': '5', 81 | 'I2': '', # Missing 82 | 'C1': 'abc', 83 | 'C2': '' # Missing 84 | } 85 | 86 | vw_line = _convert_row_to_vw(row, is_train=True) 87 | 88 | # Should include non-empty features 89 | assert 'I1:5' in vw_line 90 | assert 'abc' in vw_line 91 | 92 | # Should not include empty features 93 | assert 'I2:' not in vw_line 94 | 95 | def test_convert_row_with_spaces(self): 96 | """Test converting row with whitespace.""" 97 | row = { 98 | 'Id': '222', 99 | 'Label': '0', 100 | 'I1': ' 5 ', # With spaces 101 | 'C1': ' abc ' 102 | } 103 | 104 | vw_line = _convert_row_to_vw(row, is_train=True) 105 | 106 | # Spaces should be handled (features present) 107 | assert '|i' in vw_line 108 | assert '|c' in vw_line 109 | 110 | def test_convert_row_empty_namespaces(self): 111 | """Test converting row with all values empty in a namespace.""" 112 | row = { 113 | 'Id': '333', 114 | 'Label': '1', 115 | 'I1': '', 116 | 'I2': '', 117 | 'C1': 'abc' 118 | } 119 | 120 | vw_line = _convert_row_to_vw(row, is_train=True) 121 | 122 | # Should still have namespace markers 123 | assert '|i' in vw_line 124 | assert '|c' in vw_line 125 | 126 | 127 | class TestCSVToVWFunction: 128 | """Test suite for full CSV to VW conversion.""" 129 | 130 | def test_csv_to_vw_file_not_found(self, tmp_path): 131 | """Test with non-existent input file.""" 132 | csv_path = tmp_path / "nonexistent.csv" 133 | output_path = tmp_path / "output.vw" 134 | 135 | with pytest.raises(FileNotFoundError): 136 | csv_to_vw(csv_path, output_path) 137 | 138 | def test_csv_to_vw_basic(self, temp_csv_file, tmp_path): 139 | """Test basic CSV to VW conversion.""" 140 | output_path = tmp_path / "output.vw" 141 | 142 | # Should not raise error 143 | csv_to_vw(temp_csv_file, output_path, is_train=True) 144 | 145 | # Output file should exist 146 | assert output_path.exists() 147 | 148 | # Check output content 149 | with open(output_path, 'r') as f: 150 | lines = f.readlines() 151 | 152 | # Should have same number of lines as CSV (minus header) 153 | assert len(lines) == 3 # 3 data rows from fixture 154 | 155 | # Each line should be VW format 156 | for line in lines: 157 | assert '|i' in line 158 | assert '|c' in line 159 | 160 | def test_csv_to_vw_test_mode(self, temp_csv_file, tmp_path): 161 | """Test CSV to VW conversion in test mode.""" 162 | output_path = tmp_path / "output.vw" 163 | 164 | csv_to_vw(temp_csv_file, output_path, is_train=False) 165 | 166 | # Output file should exist 167 | assert output_path.exists() 168 | 169 | with open(output_path, 'r') as f: 170 | lines = f.readlines() 171 | 172 | # In test mode, all labels should be 1 173 | for line in lines: 174 | assert line.startswith('1 ') 175 | 176 | def test_csv_to_vw_creates_output_file(self, temp_csv_file, tmp_path): 177 | """Test that output file is created correctly.""" 178 | output_path = tmp_path / "nested" / "dir" / "output.vw" 179 | 180 | # Parent directories don't exist 181 | assert not output_path.parent.exists() 182 | 183 | # This should fail since we don't create parent dirs 184 | # (could be enhanced to create them) 185 | with pytest.raises(FileNotFoundError): 186 | csv_to_vw(temp_csv_file, output_path) 187 | 188 | 189 | class TestEdgeCases: 190 | """Test edge cases and error handling.""" 191 | 192 | def test_malformed_csv(self, tmp_path): 193 | """Test handling of malformed CSV.""" 194 | csv_file = tmp_path / "malformed.csv" 195 | 196 | # Create malformed CSV 197 | with open(csv_file, 'w') as f: 198 | f.write("Id,Label,I1\n") 199 | f.write("1,0,5\n") 200 | f.write("2,1\n") # Missing column 201 | 202 | output_file = tmp_path / "output.vw" 203 | 204 | # Should handle gracefully 205 | csv_to_vw(csv_file, output_file, is_train=True) 206 | 207 | # Output should still be created 208 | assert output_file.exists() 209 | 210 | def test_empty_csv(self, tmp_path): 211 | """Test handling of empty CSV.""" 212 | csv_file = tmp_path / "empty.csv" 213 | 214 | # Create empty CSV (header only) 215 | with open(csv_file, 'w') as f: 216 | f.write("Id,Label,I1,C1\n") 217 | 218 | output_file = tmp_path / "output.vw" 219 | 220 | # Should not raise error 221 | csv_to_vw(csv_file, output_file, is_train=True) 222 | 223 | # Output should be empty 224 | with open(output_file, 'r') as f: 225 | lines = f.readlines() 226 | 227 | assert len(lines) == 0 228 | -------------------------------------------------------------------------------- /csv_to_vw_modernized.py: -------------------------------------------------------------------------------- 1 | """ 2 | CSV to Vowpal Wabbit Format Converter 3 | 4 | Converts Criteo CTR dataset from CSV format to Vowpal Wabbit format. 5 | 6 | Original credit: 7 | - __Author__: Triskelion 8 | - Credit: Zygmunt Zając 9 | 10 | Modernized with: 11 | - Python 3 compatibility 12 | - Error handling and logging 13 | - Progress reporting 14 | - Configurable paths 15 | - Better performance 16 | """ 17 | 18 | import logging 19 | import sys 20 | from datetime import datetime 21 | from csv import DictReader 22 | from pathlib import Path 23 | from typing import Optional 24 | 25 | # Configure logging 26 | logging.basicConfig( 27 | level=logging.INFO, 28 | format='%(asctime)s - %(levelname)s - %(message)s' 29 | ) 30 | logger = logging.getLogger(__name__) 31 | 32 | 33 | def csv_to_vw( 34 | csv_path: Path, 35 | output_path: Path, 36 | is_train: bool = True, 37 | report_interval: int = 1_000_000 38 | ) -> None: 39 | """ 40 | Convert CSV file to Vowpal Wabbit format. 41 | 42 | Vowpal Wabbit format: 43 | [label] ['tag] |namespace features 44 | 45 | Example train: 46 | 1 'id123 |i I1:5 I2:10 |c C1 C2 C3 47 | 48 | Example test: 49 | 1 'id456 |i I1:3 |c C5 C6 50 | 51 | Args: 52 | csv_path: Path to input CSV file 53 | output_path: Path to output VW file 54 | is_train: Whether this is training data (includes labels) 55 | report_interval: How often to log progress 56 | 57 | Raises: 58 | FileNotFoundError: If CSV file doesn't exist 59 | ValueError: If CSV format is invalid 60 | """ 61 | if not csv_path.exists(): 62 | raise FileNotFoundError(f"CSV file not found: {csv_path}") 63 | 64 | start_time = datetime.now() 65 | 66 | logger.info("=" * 80) 67 | logger.info(f"Converting CSV to Vowpal Wabbit format") 68 | logger.info(f" Input: {csv_path}") 69 | logger.info(f" Output: {output_path}") 70 | logger.info(f" Mode: {'Training' if is_train else 'Testing'}") 71 | logger.info("=" * 80) 72 | 73 | row_count = 0 74 | 75 | try: 76 | with open(csv_path, 'r', encoding='utf-8') as csv_file, \ 77 | open(output_path, 'w', encoding='utf-8') as vw_file: 78 | 79 | reader = DictReader(csv_file) 80 | 81 | # Validate required fields 82 | if reader.fieldnames: 83 | if 'Id' not in reader.fieldnames: 84 | raise ValueError("CSV must contain 'Id' column") 85 | if is_train and 'Label' not in reader.fieldnames: 86 | raise ValueError("Training CSV must contain 'Label' column") 87 | 88 | for row_count, row in enumerate(reader, start=1): 89 | try: 90 | # Create VW format line 91 | vw_line = _convert_row_to_vw(row, is_train) 92 | vw_file.write(vw_line + '\n') 93 | 94 | except Exception as e: 95 | logger.warning(f"Error processing row {row_count}: {e}") 96 | continue 97 | 98 | # Report progress 99 | if row_count % report_interval == 0: 100 | elapsed = datetime.now() - start_time 101 | rate = row_count / elapsed.total_seconds() 102 | logger.info( 103 | f"Processed {row_count:,} rows | " 104 | f"Elapsed: {elapsed} | " 105 | f"Rate: {rate:.0f} rows/sec" 106 | ) 107 | 108 | except Exception as e: 109 | logger.error(f"Error during conversion: {e}") 110 | raise 111 | 112 | elapsed = datetime.now() - start_time 113 | logger.info("=" * 80) 114 | logger.info(f"Conversion completed!") 115 | logger.info(f" Rows processed: {row_count:,}") 116 | logger.info(f" Total time: {elapsed}") 117 | logger.info(f" Average rate: {row_count / elapsed.total_seconds():.0f} rows/sec") 118 | logger.info("=" * 80) 119 | 120 | 121 | def _convert_row_to_vw(row: dict, is_train: bool) -> str: 122 | """ 123 | Convert a single CSV row to Vowpal Wabbit format. 124 | 125 | Args: 126 | row: Dictionary from CSV DictReader 127 | is_train: Whether this is training data 128 | 129 | Returns: 130 | VW formatted string (without newline) 131 | """ 132 | # Extract label and ID 133 | row_id = row.get('Id', 'unknown') 134 | 135 | if is_train: 136 | # VW uses 1 for positive, -1 for negative 137 | label = 1 if row.get('Label') == '1' else -1 138 | else: 139 | # Test data: use dummy label 1 140 | label = 1 141 | 142 | # Separate numerical and categorical features 143 | numerical_features = [] 144 | categorical_features = [] 145 | 146 | for key, value in row.items(): 147 | # Skip label and ID 148 | if key in ['Label', 'Id']: 149 | continue 150 | 151 | # Skip empty values 152 | if not value or value.strip() == '': 153 | continue 154 | 155 | # Numerical features (start with 'I') 156 | if key.startswith('I'): 157 | numerical_features.append(f"{key}:{value}") 158 | 159 | # Categorical features (start with 'C') 160 | elif key.startswith('C'): 161 | # For categorical, just use the value (no key:value format) 162 | categorical_features.append(value) 163 | 164 | # Build VW format line 165 | # Format: label 'tag |namespace1 features1 |namespace2 features2 166 | numerical_str = ' '.join(numerical_features) if numerical_features else '' 167 | categorical_str = ' '.join(categorical_features) if categorical_features else '' 168 | 169 | vw_line = f"{label} '{row_id} |i {numerical_str} |c {categorical_str}" 170 | 171 | return vw_line 172 | 173 | 174 | def main(): 175 | """Main execution function.""" 176 | import argparse 177 | 178 | parser = argparse.ArgumentParser( 179 | description='Convert Criteo CSV to Vowpal Wabbit format' 180 | ) 181 | parser.add_argument( 182 | 'input', 183 | type=str, 184 | help='Input CSV file path' 185 | ) 186 | parser.add_argument( 187 | 'output', 188 | type=str, 189 | help='Output VW file path' 190 | ) 191 | parser.add_argument( 192 | '--test', 193 | action='store_true', 194 | help='Convert test data (no labels)' 195 | ) 196 | parser.add_argument( 197 | '--interval', 198 | type=int, 199 | default=1_000_000, 200 | help='Progress report interval (default: 1,000,000)' 201 | ) 202 | 203 | args = parser.parse_args() 204 | 205 | # Convert paths 206 | csv_path = Path(args.input) 207 | output_path = Path(args.output) 208 | 209 | # Check if input exists 210 | if not csv_path.exists(): 211 | logger.error(f"Input file not found: {csv_path}") 212 | sys.exit(1) 213 | 214 | # Warn if output exists 215 | if output_path.exists(): 216 | logger.warning(f"Output file exists and will be overwritten: {output_path}") 217 | 218 | try: 219 | # Perform conversion 220 | csv_to_vw( 221 | csv_path=csv_path, 222 | output_path=output_path, 223 | is_train=not args.test, 224 | report_interval=args.interval 225 | ) 226 | 227 | logger.info("Success!") 228 | 229 | except Exception as e: 230 | logger.error(f"Conversion failed: {e}") 231 | sys.exit(1) 232 | 233 | 234 | if __name__ == '__main__': 235 | # Example usage (uncomment to run): 236 | # csv_to_vw( 237 | # csv_path=Path('train.csv'), 238 | # output_path=Path('click.train.vw'), 239 | # is_train=True 240 | # ) 241 | # csv_to_vw( 242 | # csv_path=Path('test.csv'), 243 | # output_path=Path('click.test.vw'), 244 | # is_train=False 245 | # ) 246 | 247 | main() 248 | -------------------------------------------------------------------------------- /scripts/download_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | """ 3 | Data Download Helper for Criteo CTR Dataset 4 | 5 | This script provides instructions and helpers for downloading the Criteo 6 | Display Advertising Challenge dataset from Kaggle. 7 | 8 | Note: Kaggle API credentials are required for automated download. 9 | """ 10 | 11 | import logging 12 | import sys 13 | from pathlib import Path 14 | import subprocess 15 | 16 | # Configure logging 17 | logging.basicConfig( 18 | level=logging.INFO, 19 | format='%(asctime)s - %(levelname)s - %(message)s' 20 | ) 21 | logger = logging.getLogger(__name__) 22 | 23 | 24 | def check_kaggle_api() -> bool: 25 | """ 26 | Check if Kaggle API is installed and configured. 27 | 28 | Returns: 29 | True if Kaggle API is available, False otherwise 30 | """ 31 | try: 32 | import kaggle 33 | return True 34 | except ImportError: 35 | return False 36 | 37 | 38 | def print_manual_instructions(): 39 | """Print manual download instructions.""" 40 | logger.info("=" * 80) 41 | logger.info("Manual Download Instructions") 42 | logger.info("=" * 80) 43 | print(""" 44 | To download the Criteo CTR dataset manually: 45 | 46 | 1. Visit the Kaggle competition page: 47 | https://www.kaggle.com/c/criteo-display-ad-challenge/data 48 | 49 | 2. Accept the competition rules (you must have a Kaggle account) 50 | 51 | 3. Download the following files: 52 | - train.csv.gz (~11 GB compressed, ~40 GB uncompressed) 53 | - test.csv.gz (~2 GB compressed, ~6 GB uncompressed) 54 | 55 | 4. Extract the files: 56 | gunzip train.csv.gz 57 | gunzip test.csv.gz 58 | 59 | 5. Move the files to the data/ directory: 60 | mv train.csv data/ 61 | mv test.csv data/ 62 | 63 | 6. Verify the files: 64 | python scripts/verify_data.py 65 | """) 66 | 67 | 68 | def print_kaggle_api_setup(): 69 | """Print Kaggle API setup instructions.""" 70 | logger.info("=" * 80) 71 | logger.info("Kaggle API Setup Instructions") 72 | logger.info("=" * 80) 73 | print(""" 74 | To use the Kaggle API for automated downloads: 75 | 76 | 1. Install the Kaggle API: 77 | pip install kaggle 78 | 79 | 2. Get your Kaggle API credentials: 80 | a. Go to https://www.kaggle.com/account 81 | b. Scroll to "API" section 82 | c. Click "Create New API Token" 83 | d. This downloads kaggle.json 84 | 85 | 3. Place kaggle.json in the correct location: 86 | # Linux/macOS 87 | mkdir -p ~/.kaggle 88 | mv ~/Downloads/kaggle.json ~/.kaggle/ 89 | chmod 600 ~/.kaggle/kaggle.json 90 | 91 | # Windows 92 | mkdir %USERPROFILE%\\.kaggle 93 | move %USERPROFILE%\\Downloads\\kaggle.json %USERPROFILE%\\.kaggle\\ 94 | 95 | 4. Run this script again to download data automatically 96 | """) 97 | 98 | 99 | def download_with_kaggle_api(data_dir: Path) -> bool: 100 | """ 101 | Download dataset using Kaggle API. 102 | 103 | Args: 104 | data_dir: Directory to save data 105 | 106 | Returns: 107 | True if successful, False otherwise 108 | """ 109 | try: 110 | logger.info("Downloading data using Kaggle API...") 111 | logger.info("This may take a while (files are ~13 GB compressed)...") 112 | 113 | # Import here to handle case where it's not installed 114 | import kaggle 115 | 116 | # Create data directory 117 | data_dir.mkdir(parents=True, exist_ok=True) 118 | 119 | # Download dataset files 120 | logger.info("Downloading train.csv.gz...") 121 | kaggle.api.competition_download_file( 122 | 'criteo-display-ad-challenge', 123 | 'train.csv.gz', 124 | path=str(data_dir) 125 | ) 126 | 127 | logger.info("Downloading test.csv.gz...") 128 | kaggle.api.competition_download_file( 129 | 'criteo-display-ad-challenge', 130 | 'test.csv.gz', 131 | path=str(data_dir) 132 | ) 133 | 134 | logger.info("Download complete!") 135 | logger.info("Extracting files...") 136 | 137 | # Extract files 138 | import gzip 139 | import shutil 140 | 141 | # Extract train.csv 142 | train_gz = data_dir / 'train.csv.gz' 143 | train_csv = data_dir / 'train.csv' 144 | 145 | if train_gz.exists(): 146 | logger.info("Extracting train.csv.gz...") 147 | with gzip.open(train_gz, 'rb') as f_in: 148 | with open(train_csv, 'wb') as f_out: 149 | shutil.copyfileobj(f_in, f_out) 150 | logger.info(f"Extracted to {train_csv}") 151 | 152 | # Extract test.csv 153 | test_gz = data_dir / 'test.csv.gz' 154 | test_csv = data_dir / 'test.csv' 155 | 156 | if test_gz.exists(): 157 | logger.info("Extracting test.csv.gz...") 158 | with gzip.open(test_gz, 'rb') as f_in: 159 | with open(test_csv, 'wb') as f_out: 160 | shutil.copyfileobj(f_in, f_out) 161 | logger.info(f"Extracted to {test_csv}") 162 | 163 | logger.info("=" * 80) 164 | logger.info("Dataset downloaded and extracted successfully!") 165 | logger.info(f" Train: {train_csv}") 166 | logger.info(f" Test: {test_csv}") 167 | logger.info("=" * 80) 168 | 169 | return True 170 | 171 | except Exception as e: 172 | logger.error(f"Error downloading data: {e}") 173 | return False 174 | 175 | 176 | def create_sample_data(data_dir: Path, sample_size: int = 1_000_000): 177 | """ 178 | Create a smaller sample dataset for testing. 179 | 180 | Args: 181 | data_dir: Directory containing data 182 | sample_size: Number of rows to include in sample 183 | """ 184 | train_file = data_dir / 'train.csv' 185 | sample_file = data_dir / 'train_sample.csv' 186 | 187 | if not train_file.exists(): 188 | logger.error(f"Train file not found: {train_file}") 189 | return 190 | 191 | logger.info(f"Creating sample dataset with {sample_size:,} rows...") 192 | 193 | try: 194 | # Use head command for efficiency 195 | result = subprocess.run( 196 | ['head', '-n', str(sample_size + 1), str(train_file)], 197 | capture_output=True, 198 | text=True, 199 | check=True 200 | ) 201 | 202 | with open(sample_file, 'w') as f: 203 | f.write(result.stdout) 204 | 205 | logger.info(f"Sample dataset created: {sample_file}") 206 | 207 | except Exception as e: 208 | logger.error(f"Error creating sample: {e}") 209 | 210 | 211 | def verify_data(data_dir: Path): 212 | """ 213 | Verify that data files exist and have reasonable size. 214 | 215 | Args: 216 | data_dir: Directory containing data 217 | """ 218 | logger.info("Verifying data files...") 219 | 220 | train_file = data_dir / 'train.csv' 221 | test_file = data_dir / 'test.csv' 222 | 223 | if train_file.exists(): 224 | size_gb = train_file.stat().st_size / (1024 ** 3) 225 | logger.info(f"✓ train.csv exists ({size_gb:.2f} GB)") 226 | 227 | # Expected: ~40 GB 228 | if size_gb < 30 or size_gb > 50: 229 | logger.warning( 230 | f"Warning: train.csv size ({size_gb:.2f} GB) outside expected range (30-50 GB)" 231 | ) 232 | else: 233 | logger.error("✗ train.csv not found") 234 | 235 | if test_file.exists(): 236 | size_gb = test_file.stat().st_size / (1024 ** 3) 237 | logger.info(f"✓ test.csv exists ({size_gb:.2f} GB)") 238 | 239 | # Expected: ~6 GB 240 | if size_gb < 4 or size_gb > 8: 241 | logger.warning( 242 | f"Warning: test.csv size ({size_gb:.2f} GB) outside expected range (4-8 GB)" 243 | ) 244 | else: 245 | logger.error("✗ test.csv not found") 246 | 247 | 248 | def main(): 249 | """Main execution function.""" 250 | import argparse 251 | 252 | parser = argparse.ArgumentParser( 253 | description='Download Criteo CTR dataset' 254 | ) 255 | parser.add_argument( 256 | '--data-dir', 257 | type=str, 258 | default='./data', 259 | help='Directory to save data (default: ./data)' 260 | ) 261 | parser.add_argument( 262 | '--sample', 263 | action='store_true', 264 | help='Create a sample dataset after downloading' 265 | ) 266 | parser.add_argument( 267 | '--sample-size', 268 | type=int, 269 | default=1_000_000, 270 | help='Sample size in rows (default: 1,000,000)' 271 | ) 272 | parser.add_argument( 273 | '--verify-only', 274 | action='store_true', 275 | help='Only verify existing data files' 276 | ) 277 | 278 | args = parser.parse_args() 279 | data_dir = Path(args.data_dir) 280 | 281 | # Create data directory 282 | data_dir.mkdir(parents=True, exist_ok=True) 283 | 284 | logger.info("=" * 80) 285 | logger.info("Criteo CTR Dataset Downloader") 286 | logger.info("=" * 80) 287 | 288 | # Verify only mode 289 | if args.verify_only: 290 | verify_data(data_dir) 291 | return 292 | 293 | # Check if Kaggle API is available 294 | has_kaggle = check_kaggle_api() 295 | 296 | if has_kaggle: 297 | logger.info("Kaggle API detected!") 298 | response = input("Download data using Kaggle API? (y/n): ") 299 | 300 | if response.lower() == 'y': 301 | success = download_with_kaggle_api(data_dir) 302 | 303 | if success: 304 | if args.sample: 305 | create_sample_data(data_dir, args.sample_size) 306 | return 307 | else: 308 | logger.warning("Kaggle API not installed or not configured") 309 | print_kaggle_api_setup() 310 | print() 311 | 312 | # Fall back to manual instructions 313 | print_manual_instructions() 314 | 315 | # Verify if data already exists 316 | logger.info("\nChecking for existing data files...") 317 | verify_data(data_dir) 318 | 319 | 320 | if __name__ == '__main__': 321 | main() 322 | -------------------------------------------------------------------------------- /py_lh4_modernized.py: -------------------------------------------------------------------------------- 1 | """ 2 | Logistic Regression with Stochastic Gradient Descent (SGD) for CTR Prediction 3 | 4 | This module implements a memory-efficient logistic regression model using: 5 | - Hash trick for feature engineering (2^27 dimensional space) 6 | - Adaptive learning rate for SGD optimization 7 | - Bounded numerical operations for stability 8 | 9 | Modernized from original py_lh4.py with: 10 | - Python 3 compatibility 11 | - Error handling and logging 12 | - Input validation 13 | - Configurable parameters 14 | - Type hints 15 | """ 16 | 17 | import logging 18 | import sys 19 | from csv import DictReader 20 | from math import exp, log, sqrt 21 | from pathlib import Path 22 | from typing import Dict, List 23 | 24 | # Configure logging 25 | logging.basicConfig( 26 | level=logging.INFO, 27 | format='%(asctime)s - %(levelname)s - %(message)s', 28 | handlers=[ 29 | logging.StreamHandler(sys.stdout), 30 | logging.FileHandler('training.log') 31 | ] 32 | ) 33 | logger = logging.getLogger(__name__) 34 | 35 | 36 | class CTRPredictor: 37 | """Click-Through Rate predictor using Logistic Regression with SGD.""" 38 | 39 | def __init__( 40 | self, 41 | dimension: int = 2**27, 42 | learning_rate: float = 0.145, 43 | log_interval: int = 1_000_000 44 | ): 45 | """ 46 | Initialize the CTR predictor. 47 | 48 | Args: 49 | dimension: Number of features for hash trick (default: 2^27 = ~134M) 50 | learning_rate: Alpha parameter for SGD (default: 0.145) 51 | log_interval: How often to log progress during training 52 | """ 53 | self.D = dimension 54 | self.alpha = learning_rate 55 | self.log_interval = log_interval 56 | 57 | # Initialize model weights and feature counters 58 | self.w: List[float] = [0.0] * self.D 59 | self.n: List[float] = [0.0] * self.D 60 | 61 | logger.info("Initialized CTR Predictor:") 62 | logger.info(f" Dimension: {self.D:,}") 63 | logger.info(f" Learning rate: {self.alpha}") 64 | logger.info(f" Memory usage: ~{(self.D * 16) / (1024**3):.2f} GB") 65 | 66 | @staticmethod 67 | def logloss(p: float, y: float) -> float: 68 | """ 69 | Calculate bounded logarithmic loss. 70 | 71 | Args: 72 | p: Predicted probability (0 to 1) 73 | y: True label (0 or 1) 74 | 75 | Returns: 76 | Logarithmic loss value 77 | """ 78 | # Bound prediction to prevent log(0) 79 | epsilon = 1e-12 80 | p = max(min(p, 1.0 - epsilon), epsilon) 81 | 82 | if y == 1.0: 83 | return -log(p) 84 | else: 85 | return -log(1.0 - p) 86 | 87 | def get_features(self, csv_row: Dict[str, str]) -> List[int]: 88 | """ 89 | Apply hash trick to convert CSV row to feature indices. 90 | 91 | Treats both integer and categorical features as categorical, 92 | using a simple hash function to map to feature space. 93 | 94 | Args: 95 | csv_row: Dictionary from CSV DictReader 96 | 97 | Returns: 98 | List of feature indices where value is 1 99 | """ 100 | x = [0] # Index 0 is the bias term 101 | 102 | for key, value in csv_row.items(): 103 | if not value: # Skip empty values 104 | continue 105 | 106 | try: 107 | # Simple hash: concatenate value and feature name, convert to hex 108 | # This is intentionally simple for speed (though weak as a hash) 109 | hash_input = value + key[1:] 110 | index = int(hash_input, 16) % self.D 111 | x.append(index) 112 | except (ValueError, IndexError) as e: 113 | logger.warning(f"Skipping invalid feature {key}={value}: {e}") 114 | continue 115 | 116 | return x 117 | 118 | def predict_probability(self, x: List[int]) -> float: 119 | """ 120 | Calculate probability P(y=1|x) using logistic sigmoid. 121 | 122 | Args: 123 | x: List of feature indices 124 | 125 | Returns: 126 | Predicted probability between 0 and 1 127 | """ 128 | # Calculate w^T * x 129 | wTx = sum(self.w[i] for i in x) 130 | 131 | # Apply bounded sigmoid to prevent overflow 132 | # Bound to [-20, 20] before exp 133 | wTx_bounded = max(min(wTx, 20.0), -20.0) 134 | 135 | return 1.0 / (1.0 + exp(-wTx_bounded)) 136 | 137 | def update_weights( 138 | self, 139 | x: List[int], 140 | p: float, 141 | y: float 142 | ) -> None: 143 | """ 144 | Update model weights using SGD with adaptive learning rate. 145 | 146 | Args: 147 | x: Feature indices 148 | p: Predicted probability 149 | y: True label (0 or 1) 150 | """ 151 | for i in x: 152 | # Adaptive learning rate: alpha / (sqrt(n) + 1) 153 | # This decreases learning rate for frequently seen features 154 | adaptive_lr = self.alpha / (sqrt(self.n[i]) + 1.0) 155 | 156 | # Gradient: (p - y) * x[i], where x[i] = 1 for indices in x 157 | gradient = p - y 158 | 159 | # Update weight 160 | self.w[i] -= gradient * adaptive_lr 161 | 162 | # Increment feature counter 163 | self.n[i] += 1.0 164 | 165 | def train(self, train_path: Path) -> None: 166 | """ 167 | Train the model on training data using online SGD. 168 | 169 | Args: 170 | train_path: Path to training CSV file 171 | 172 | Raises: 173 | FileNotFoundError: If training file doesn't exist 174 | ValueError: If file format is invalid 175 | """ 176 | if not train_path.exists(): 177 | raise FileNotFoundError(f"Training file not found: {train_path}") 178 | 179 | logger.info(f"Starting training from: {train_path}") 180 | logger.info("=" * 80) 181 | 182 | cumulative_loss = 0.0 183 | sample_count = 0 184 | 185 | try: 186 | with open(train_path, 'r', encoding='utf-8') as f: 187 | reader = DictReader(f) 188 | 189 | # Validate required columns 190 | if reader.fieldnames and 'Label' not in reader.fieldnames: 191 | raise ValueError("Training file must contain 'Label' column") 192 | 193 | for t, row in enumerate(reader, start=1): 194 | # Parse label 195 | try: 196 | y = 1.0 if row['Label'] == '1' else 0.0 197 | except KeyError: 198 | logger.error(f"Row {t} missing 'Label' column") 199 | continue 200 | 201 | # Remove label and ID from features 202 | row.pop('Label', None) 203 | row.pop('Id', None) 204 | 205 | # Get hashed features 206 | x = self.get_features(row) 207 | 208 | # Get prediction 209 | p = self.predict_probability(x) 210 | 211 | # Calculate loss for monitoring 212 | loss = self.logloss(p, y) 213 | cumulative_loss += loss 214 | sample_count = t 215 | 216 | # Log progress 217 | if t % self.log_interval == 0: 218 | avg_loss = cumulative_loss / t 219 | logger.info( 220 | f"Processed: {t:,} samples | " 221 | f"Avg logloss: {avg_loss:.6f} | " 222 | f"Current loss: {loss:.6f}" 223 | ) 224 | 225 | # Update model 226 | self.update_weights(x, p, y) 227 | 228 | except Exception as e: 229 | logger.error(f"Error during training: {e}") 230 | raise 231 | 232 | logger.info("=" * 80) 233 | logger.info("Training completed!") 234 | logger.info(f" Total samples: {sample_count:,}") 235 | logger.info(f" Final avg logloss: {cumulative_loss / sample_count:.6f}") 236 | 237 | def predict(self, test_path: Path, output_path: Path) -> None: 238 | """ 239 | Generate predictions for test data and save to CSV. 240 | 241 | Args: 242 | test_path: Path to test CSV file 243 | output_path: Path to save predictions 244 | 245 | Raises: 246 | FileNotFoundError: If test file doesn't exist 247 | """ 248 | if not test_path.exists(): 249 | raise FileNotFoundError(f"Test file not found: {test_path}") 250 | 251 | logger.info(f"Generating predictions from: {test_path}") 252 | logger.info(f"Saving to: {output_path}") 253 | 254 | prediction_count = 0 255 | 256 | try: 257 | with open(test_path, 'r', encoding='utf-8') as f_in, \ 258 | open(output_path, 'w', encoding='utf-8') as f_out: 259 | 260 | # Write header 261 | f_out.write('Id,Predicted\n') 262 | 263 | reader = DictReader(f_in) 264 | 265 | for t, row in enumerate(reader, start=1): 266 | # Get ID 267 | row_id = row.get('Id', str(t)) 268 | row.pop('Id', None) 269 | 270 | # Get features and predict 271 | x = self.get_features(row) 272 | p = self.predict_probability(x) 273 | 274 | # Write prediction 275 | f_out.write(f'{row_id},{p:.10f}\n') 276 | prediction_count = t 277 | 278 | # Log progress 279 | if t % self.log_interval == 0: 280 | logger.info(f"Generated {t:,} predictions") 281 | 282 | except Exception as e: 283 | logger.error(f"Error during prediction: {e}") 284 | raise 285 | 286 | logger.info(f"Prediction completed! Generated {prediction_count:,} predictions") 287 | 288 | 289 | def main(): 290 | """Main execution function.""" 291 | # Configuration 292 | TRAIN_FILE = Path('train.csv') 293 | TEST_FILE = Path('test.csv') 294 | OUTPUT_FILE = Path('submission.csv') 295 | 296 | # Model hyperparameters 297 | DIMENSION = 2 ** 27 # ~134M features 298 | LEARNING_RATE = 0.145 299 | 300 | try: 301 | # Initialize model 302 | model = CTRPredictor( 303 | dimension=DIMENSION, 304 | learning_rate=LEARNING_RATE 305 | ) 306 | 307 | # Train model 308 | model.train(TRAIN_FILE) 309 | 310 | # Generate predictions 311 | model.predict(TEST_FILE, OUTPUT_FILE) 312 | 313 | logger.info("All tasks completed successfully!") 314 | 315 | except FileNotFoundError as e: 316 | logger.error(f"File error: {e}") 317 | logger.error("Please ensure train.csv and test.csv are in the current directory") 318 | logger.error("Download from: https://www.kaggle.com/c/criteo-display-ad-challenge/data") 319 | sys.exit(1) 320 | except Exception as e: 321 | logger.error(f"Unexpected error: {e}") 322 | sys.exit(1) 323 | 324 | 325 | if __name__ == '__main__': 326 | main() 327 | -------------------------------------------------------------------------------- /logReg_modernized.py: -------------------------------------------------------------------------------- 1 | """ 2 | Logistic Regression Training and Testing Module 3 | 4 | Implements logistic regression with multiple optimization algorithms: 5 | - Gradient Descent (GD) 6 | - Stochastic Gradient Descent (SGD) 7 | - Smooth Stochastic Gradient Descent (Smooth SGD) 8 | 9 | Modernized from original logReg.py with: 10 | - Python 3 compatibility 11 | - Explicit imports (no wildcards) 12 | - Type hints 13 | - Error handling 14 | - Better documentation 15 | """ 16 | 17 | import logging 18 | import time 19 | from typing import Dict, Any, Tuple 20 | 21 | import numpy as np 22 | import matplotlib.pyplot as plt 23 | 24 | # Configure logging 25 | logging.basicConfig( 26 | level=logging.INFO, 27 | format='%(asctime)s - %(levelname)s - %(message)s' 28 | ) 29 | logger = logging.getLogger(__name__) 30 | 31 | 32 | def sigmoid(x: np.ndarray) -> np.ndarray: 33 | """ 34 | Calculate the sigmoid (logistic) function. 35 | 36 | Args: 37 | x: Input array or scalar 38 | 39 | Returns: 40 | Sigmoid of input, bounded between 0 and 1 41 | """ 42 | return 1.0 / (1.0 + np.exp(-x)) 43 | 44 | 45 | def train_logistic_regression( 46 | train_x: np.ndarray, 47 | train_y: np.ndarray, 48 | opts: Dict[str, Any] 49 | ) -> np.ndarray: 50 | """ 51 | Train a logistic regression model using specified optimization algorithm. 52 | 53 | Args: 54 | train_x: Training features, shape (num_samples, num_features). 55 | Should include bias term (column of ones) if needed. 56 | train_y: Training labels, shape (num_samples, 1) 57 | opts: Dictionary with training options: 58 | - 'alpha': Learning rate (float) 59 | - 'maxIter': Maximum iterations (int) 60 | - 'optimizeType': Optimization algorithm (str) 61 | Options: 'gradDescent', 'stocGradDescent', 'smoothStocGradDescent' 62 | 63 | Returns: 64 | weights: Trained weight vector, shape (num_features, 1) 65 | 66 | Raises: 67 | ValueError: If optimize type is not recognized 68 | TypeError: If input arrays have wrong shape 69 | """ 70 | # Validate inputs 71 | if train_x.shape[0] != train_y.shape[0]: 72 | raise ValueError( 73 | f"Sample count mismatch: train_x has {train_x.shape[0]} samples, " 74 | f"train_y has {train_y.shape[0]} samples" 75 | ) 76 | 77 | start_time = time.time() 78 | 79 | num_samples, num_features = train_x.shape 80 | alpha = opts.get('alpha', 0.01) 81 | max_iter = opts.get('maxIter', 1000) 82 | optimize_type = opts.get('optimizeType', 'gradDescent') 83 | 84 | logger.info(f"Training Logistic Regression:") 85 | logger.info(f" Samples: {num_samples}") 86 | logger.info(f" Features: {num_features}") 87 | logger.info(f" Algorithm: {optimize_type}") 88 | logger.info(f" Learning rate: {alpha}") 89 | logger.info(f" Max iterations: {max_iter}") 90 | 91 | # Initialize weights 92 | weights = np.ones((num_features, 1)) 93 | 94 | # Optimize using selected algorithm 95 | if optimize_type == 'gradDescent': 96 | weights = _gradient_descent(train_x, train_y, weights, alpha, max_iter) 97 | 98 | elif optimize_type == 'stocGradDescent': 99 | weights = _stochastic_gradient_descent( 100 | train_x, train_y, weights, alpha, max_iter 101 | ) 102 | 103 | elif optimize_type == 'smoothStocGradDescent': 104 | weights = _smooth_stochastic_gradient_descent( 105 | train_x, train_y, weights, alpha, max_iter 106 | ) 107 | 108 | else: 109 | raise ValueError( 110 | f"Unsupported optimize type: {optimize_type}. " 111 | f"Must be 'gradDescent', 'stocGradDescent', or 'smoothStocGradDescent'" 112 | ) 113 | 114 | elapsed = time.time() - start_time 115 | logger.info(f"Training completed in {elapsed:.2f} seconds") 116 | 117 | return weights 118 | 119 | 120 | def _gradient_descent( 121 | train_x: np.ndarray, 122 | train_y: np.ndarray, 123 | weights: np.ndarray, 124 | alpha: float, 125 | max_iter: int 126 | ) -> np.ndarray: 127 | """ 128 | Batch gradient descent optimization. 129 | 130 | Updates weights using all samples in each iteration. 131 | """ 132 | for k in range(max_iter): 133 | # Forward pass 134 | output = sigmoid(train_x @ weights) 135 | 136 | # Calculate error 137 | error = train_y - output 138 | 139 | # Update weights using all samples 140 | weights = weights + alpha * (train_x.T @ error) 141 | 142 | # Log progress 143 | if (k + 1) % 100 == 0: 144 | loss = np.mean(-train_y * np.log(output + 1e-10) - 145 | (1 - train_y) * np.log(1 - output + 1e-10)) 146 | logger.info(f"Iteration {k+1}/{max_iter}, Loss: {loss:.6f}") 147 | 148 | return weights 149 | 150 | 151 | def _stochastic_gradient_descent( 152 | train_x: np.ndarray, 153 | train_y: np.ndarray, 154 | weights: np.ndarray, 155 | alpha: float, 156 | max_iter: int 157 | ) -> np.ndarray: 158 | """ 159 | Stochastic gradient descent optimization. 160 | 161 | Updates weights using one sample at a time. 162 | """ 163 | num_samples = train_x.shape[0] 164 | 165 | for k in range(max_iter): 166 | for i in range(num_samples): 167 | # Get single sample 168 | x_i = train_x[i:i+1, :].T # Shape: (num_features, 1) 169 | y_i = train_y[i, 0] 170 | 171 | # Forward pass 172 | output = sigmoid((x_i.T @ weights)[0, 0]) 173 | 174 | # Calculate error 175 | error = y_i - output 176 | 177 | # Update weights 178 | weights = weights + alpha * x_i * error 179 | 180 | # Log progress 181 | if (k + 1) % 100 == 0: 182 | output_all = sigmoid(train_x @ weights) 183 | loss = np.mean(-train_y * np.log(output_all + 1e-10) - 184 | (1 - train_y) * np.log(1 - output_all + 1e-10)) 185 | logger.info(f"Iteration {k+1}/{max_iter}, Loss: {loss:.6f}") 186 | 187 | return weights 188 | 189 | 190 | def _smooth_stochastic_gradient_descent( 191 | train_x: np.ndarray, 192 | train_y: np.ndarray, 193 | weights: np.ndarray, 194 | alpha: float, 195 | max_iter: int 196 | ) -> np.ndarray: 197 | """ 198 | Smooth stochastic gradient descent optimization. 199 | 200 | Uses random sample selection and adaptive learning rate to reduce oscillations. 201 | """ 202 | num_samples = train_x.shape[0] 203 | 204 | for k in range(max_iter): 205 | # Create random order of samples 206 | indices = list(range(num_samples)) 207 | np.random.shuffle(indices) 208 | 209 | for i, idx in enumerate(indices): 210 | # Adaptive learning rate that decreases over time 211 | adaptive_alpha = 4.0 / (1.0 + k + i) + 0.01 212 | 213 | # Get single sample 214 | x_i = train_x[idx:idx+1, :].T # Shape: (num_features, 1) 215 | y_i = train_y[idx, 0] 216 | 217 | # Forward pass 218 | output = sigmoid((x_i.T @ weights)[0, 0]) 219 | 220 | # Calculate error 221 | error = y_i - output 222 | 223 | # Update weights with adaptive learning rate 224 | weights = weights + adaptive_alpha * x_i * error 225 | 226 | # Log progress 227 | if (k + 1) % 100 == 0: 228 | output_all = sigmoid(train_x @ weights) 229 | loss = np.mean(-train_y * np.log(output_all + 1e-10) - 230 | (1 - train_y) * np.log(1 - output_all + 1e-10)) 231 | logger.info(f"Iteration {k+1}/{max_iter}, Loss: {loss:.6f}") 232 | 233 | return weights 234 | 235 | 236 | def test_logistic_regression( 237 | weights: np.ndarray, 238 | test_x: np.ndarray, 239 | test_y: np.ndarray 240 | ) -> float: 241 | """ 242 | Test trained logistic regression model and calculate accuracy. 243 | 244 | Args: 245 | weights: Trained weight vector, shape (num_features, 1) 246 | test_x: Test features, shape (num_samples, num_features) 247 | test_y: Test labels, shape (num_samples, 1) 248 | 249 | Returns: 250 | accuracy: Proportion of correct predictions (0 to 1) 251 | """ 252 | num_samples = test_x.shape[0] 253 | match_count = 0 254 | 255 | for i in range(num_samples): 256 | # Get prediction probability 257 | x_i = test_x[i:i+1, :] 258 | prob = sigmoid((x_i @ weights)[0, 0]) 259 | 260 | # Convert to binary prediction (threshold at 0.5) 261 | predict = prob > 0.5 262 | 263 | # Check if correct 264 | if predict == bool(test_y[i, 0]): 265 | match_count += 1 266 | 267 | accuracy = match_count / num_samples 268 | 269 | logger.info(f"Test Results:") 270 | logger.info(f" Correct: {match_count}/{num_samples}") 271 | logger.info(f" Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)") 272 | 273 | return accuracy 274 | 275 | 276 | def visualize_logistic_regression( 277 | weights: np.ndarray, 278 | train_x: np.ndarray, 279 | train_y: np.ndarray 280 | ) -> None: 281 | """ 282 | Visualize the trained logistic regression decision boundary. 283 | 284 | Note: Only works with 2D data (3 features including bias). 285 | 286 | Args: 287 | weights: Trained weight vector 288 | train_x: Training features including bias column 289 | train_y: Training labels 290 | 291 | Raises: 292 | ValueError: If data is not 2D 293 | """ 294 | num_samples, num_features = train_x.shape 295 | 296 | if num_features != 3: 297 | raise ValueError( 298 | f"Visualization only supports 2D data (3 features with bias). " 299 | f"Got {num_features} features." 300 | ) 301 | 302 | logger.info("Generating visualization...") 303 | 304 | # Plot samples 305 | for i in range(num_samples): 306 | if int(train_y[i, 0]) == 0: 307 | plt.plot(train_x[i, 1], train_x[i, 2], 'or', label='Class 0' if i == 0 else '') 308 | else: 309 | plt.plot(train_x[i, 1], train_x[i, 2], 'ob', label='Class 1' if i == 0 else '') 310 | 311 | # Draw decision boundary 312 | # Line equation: w0 + w1*x1 + w2*x2 = 0 313 | # Solve for x2: x2 = -(w0 + w1*x1) / w2 314 | min_x = np.min(train_x[:, 1]) 315 | max_x = np.max(train_x[:, 1]) 316 | 317 | w = weights.flatten() 318 | y_min = -(w[0] + w[1] * min_x) / w[2] 319 | y_max = -(w[0] + w[1] * max_x) / w[2] 320 | 321 | plt.plot([min_x, max_x], [y_min, y_max], '-g', linewidth=2, label='Decision Boundary') 322 | 323 | plt.xlabel('Feature 1') 324 | plt.ylabel('Feature 2') 325 | plt.title('Logistic Regression Decision Boundary') 326 | plt.legend() 327 | plt.grid(True, alpha=0.3) 328 | plt.show() 329 | 330 | 331 | # Example usage 332 | if __name__ == '__main__': 333 | # Generate sample data 334 | np.random.seed(42) 335 | 336 | # Create synthetic 2D dataset 337 | num_samples = 100 338 | 339 | # Class 0: centered at (2, 2) 340 | class0 = np.random.randn(num_samples // 2, 2) + np.array([2, 2]) 341 | 342 | # Class 1: centered at (5, 5) 343 | class1 = np.random.randn(num_samples // 2, 2) + np.array([5, 5]) 344 | 345 | # Combine data 346 | X = np.vstack([class0, class1]) 347 | y = np.vstack([np.zeros((num_samples // 2, 1)), np.ones((num_samples // 2, 1))]) 348 | 349 | # Add bias term 350 | X_with_bias = np.hstack([np.ones((num_samples, 1)), X]) 351 | 352 | # Training options 353 | options = { 354 | 'alpha': 0.01, 355 | 'maxIter': 500, 356 | 'optimizeType': 'smoothStocGradDescent' 357 | } 358 | 359 | # Train model 360 | trained_weights = train_logistic_regression(X_with_bias, y, options) 361 | 362 | # Test model 363 | accuracy = test_logistic_regression(trained_weights, X_with_bias, y) 364 | 365 | # Visualize (optional - uncomment to show plot) 366 | # visualize_logistic_regression(trained_weights, X_with_bias, y) 367 | -------------------------------------------------------------------------------- /README_NEW.md: -------------------------------------------------------------------------------- 1 | # Predict Click-Through Rates on Display Ads 2 | 3 | A machine learning project for predicting Click-Through Rates (CTR) in display advertising using the Criteo dataset. This repository implements multiple algorithms including Logistic Regression with SGD, Gradient Boosting Machines, and Vowpal Wabbit models. 4 | 5 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) 6 | ![Python](https://img.shields.io/badge/python-3.8+-blue.svg) 7 | ![R](https://img.shields.io/badge/R-4.0+-blue.svg) 8 | 9 | --- 10 | 11 | ## Table of Contents 12 | 13 | - [Overview](#overview) 14 | - [Project Status](#project-status) 15 | - [Features](#features) 16 | - [Installation](#installation) 17 | - [Quick Start](#quick-start) 18 | - [Data](#data) 19 | - [Models](#models) 20 | - [Project Structure](#project-structure) 21 | - [Usage](#usage) 22 | - [Development](#development) 23 | - [Contributing](#contributing) 24 | - [License](#license) 25 | - [Acknowledgments](#acknowledgments) 26 | 27 | --- 28 | 29 | ## Overview 30 | 31 | Display advertising is a billion-dollar industry and one of the central applications of machine learning on the Internet. This project was developed for the **Criteo Display Advertising Challenge**, where the goal is to predict the probability that a user will click on a given ad (CTR). 32 | 33 | ### The Challenge 34 | 35 | Given: 36 | - User information 37 | - Page context 38 | - Ad features (39 anonymized features: 13 numerical, 26 categorical) 39 | 40 | Predict: 41 | - Probability of click (binary classification) 42 | 43 | ### Evaluation Metric 44 | 45 | - **Log Loss (Logarithmic Loss)**: Lower is better 46 | 47 | --- 48 | 49 | ## Project Status 50 | 51 | **Version:** 2.0 (Modernized) 52 | 53 | **Status:** 54 | - ✅ Python 3 migration complete 55 | - ✅ Core algorithms refactored with modern best practices 56 | - ✅ Documentation updated 57 | - ✅ Error handling and logging added 58 | - 🚧 Unit tests in progress 59 | - 🚧 CI/CD pipeline in progress 60 | 61 | **Legacy Code:** 62 | - Original Python 2 implementations preserved with `_legacy` suffix 63 | - See migration guide below for differences 64 | 65 | --- 66 | 67 | ## Features 68 | 69 | ### Algorithms Implemented 70 | 71 | 1. **Logistic Regression with SGD** (`py_lh4_modernized.py`) 72 | - Hash trick for feature engineering (2^27 dimensional space) 73 | - Adaptive learning rate 74 | - Memory-efficient online learning 75 | - Bounded numerical operations for stability 76 | 77 | 2. **Gradient Boosting Machine** (`gbm_modernized.R`) 78 | - Tree-based ensemble method 79 | - Cross-validation with ROC optimization 80 | - Configurable hyperparameters 81 | 82 | 3. **Vowpal Wabbit** (`csv_to_vw_modernized.py`) 83 | - Fast linear learner 84 | - CSV to VW format converter 85 | - Scalable to massive datasets 86 | 87 | ### Modern Features 88 | 89 | - ✨ Python 3.8+ compatibility 90 | - 🔒 Input validation and error handling 91 | - 📊 Comprehensive logging 92 | - ⚙️ Configurable parameters 93 | - 📝 Type hints and documentation 94 | - 🧪 Unit test infrastructure 95 | - 🐳 Docker support (coming soon) 96 | 97 | --- 98 | 99 | ## Installation 100 | 101 | ### Prerequisites 102 | 103 | - Python 3.8 or higher 104 | - R 4.0 or higher (for R models) 105 | - Vowpal Wabbit (for VW models) 106 | - Git 107 | 108 | ### Clone Repository 109 | 110 | ```bash 111 | git clone https://github.com/yourusername/Predict-click-through-rates-on-display-ads.git 112 | cd Predict-click-through-rates-on-display-ads 113 | ``` 114 | 115 | ### Python Setup 116 | 117 | ```bash 118 | # Create virtual environment 119 | python -m venv venv 120 | 121 | # Activate virtual environment 122 | # On macOS/Linux: 123 | source venv/bin/activate 124 | # On Windows: 125 | # venv\Scripts\activate 126 | 127 | # Install dependencies 128 | pip install -r requirements.txt 129 | ``` 130 | 131 | ### R Setup 132 | 133 | ```bash 134 | # Install R packages 135 | Rscript -e "install.packages(c('data.table', 'caret', 'gbm'), dependencies=TRUE)" 136 | ``` 137 | 138 | ### Vowpal Wabbit Setup 139 | 140 | ```bash 141 | # macOS 142 | brew install vowpal-wabbit 143 | 144 | # Ubuntu/Debian 145 | sudo apt-get install vowpal-wabbit 146 | 147 | # From source (all platforms) 148 | git clone https://github.com/VowpalWabbit/vowpal_wabbit.git 149 | cd vowpal_wabbit 150 | make 151 | ``` 152 | 153 | --- 154 | 155 | ## Quick Start 156 | 157 | ### 1. Download Data 158 | 159 | ```bash 160 | # Create data directory 161 | mkdir -p data 162 | 163 | # Download from Kaggle 164 | # Visit: https://www.kaggle.com/c/criteo-display-ad-challenge/data 165 | # Place train.csv and test.csv in data/ directory 166 | ``` 167 | 168 | Or use the download script: 169 | 170 | ```bash 171 | python scripts/download_data.py 172 | ``` 173 | 174 | ### 2. Train a Model 175 | 176 | **Logistic Regression (Python):** 177 | 178 | ```bash 179 | python py_lh4_modernized.py 180 | ``` 181 | 182 | **Gradient Boosting Machine (R):** 183 | 184 | ```bash 185 | DATA_DIR=./data OUTPUT_DIR=./output Rscript gbm_modernized.R 186 | ``` 187 | 188 | **Vowpal Wabbit:** 189 | 190 | ```bash 191 | # Convert CSV to VW format 192 | python csv_to_vw_modernized.py data/train.csv data/train.vw 193 | python csv_to_vw_modernized.py data/test.csv data/test.vw --test 194 | 195 | # Train model 196 | vw data/train.vw -f models/click.model --passes 3 --cache_file data/train.cache 197 | 198 | # Generate predictions 199 | vw data/test.vw -i models/click.model -t -p predictions.txt 200 | ``` 201 | 202 | ### 3. Generate Submission 203 | 204 | ```bash 205 | # Predictions are automatically saved to submission.csv (Python) 206 | # or specified output directory (R) 207 | ``` 208 | 209 | --- 210 | 211 | ## Data 212 | 213 | ### Dataset Information 214 | 215 | - **Source:** [Criteo Labs](https://www.kaggle.com/c/criteo-display-ad-challenge/data) 216 | - **Size:** 217 | - Training: ~45 million samples 218 | - Test: ~6 million samples 219 | - **Features:** 39 anonymized features 220 | - 13 numerical features (`I1-I13`) 221 | - 26 categorical features (`C1-C26`) 222 | - **Target:** Binary (0 = no click, 1 = click) 223 | 224 | ### Data Format 225 | 226 | **CSV Format:** 227 | 228 | ``` 229 | Id,Label,I1,I2,...,I13,C1,C2,...,C26 230 | 1,0,1,5,...,45,68fd1e,80e26c,...,a458ea 231 | 2,1,2,,,7,68fd1e,80e26c,...,b458ea 232 | ``` 233 | 234 | **Vowpal Wabbit Format:** 235 | 236 | ``` 237 | 1 'id1 |i I1:1 I2:5 I13:45 |c 68fd1e 80e26c a458ea 238 | -1 'id2 |i I1:2 I13:7 |c 68fd1e 80e26c b458ea 239 | ``` 240 | 241 | ### Data Preprocessing 242 | 243 | Missing values are common in this dataset: 244 | 245 | ```python 246 | # Option 1: Use models that handle missing values (GBM, Random Forest) 247 | # Option 2: Impute missing values 248 | from sklearn.impute import SimpleImputer 249 | imputer = SimpleImputer(strategy='median') 250 | X_imputed = imputer.fit_transform(X) 251 | ``` 252 | 253 | --- 254 | 255 | ## Models 256 | 257 | ### 1. Logistic Regression with SGD 258 | 259 | **File:** `py_lh4_modernized.py` 260 | 261 | **Features:** 262 | - Hash trick for feature engineering 263 | - Adaptive learning rate: α/(√n + 1) 264 | - Bounded sigmoid and log loss for numerical stability 265 | - Memory-efficient: processes one sample at a time 266 | 267 | **Hyperparameters:** 268 | ```python 269 | DIMENSION = 2**27 # ~134M features 270 | LEARNING_RATE = 0.145 # Alpha for SGD 271 | ``` 272 | 273 | **Usage:** 274 | ```python 275 | from py_lh4_modernized import CTRPredictor 276 | 277 | model = CTRPredictor(dimension=2**27, learning_rate=0.145) 278 | model.train(Path('data/train.csv')) 279 | model.predict(Path('data/test.csv'), Path('submission.csv')) 280 | ``` 281 | 282 | **Expected Performance:** 283 | - Training time: 30-60 minutes (45M samples) 284 | - Memory: ~2GB 285 | - Log loss: ~0.44-0.46 286 | 287 | ### 2. Gradient Boosting Machine (GBM) 288 | 289 | **File:** `gbm_modernized.R` 290 | 291 | **Features:** 292 | - Tree-based ensemble method 293 | - Cross-validation with ROC optimization 294 | - Handles missing values automatically 295 | 296 | **Hyperparameters:** 297 | ```r 298 | N_TREES <- 500 299 | INTERACTION_DEPTH <- 22 300 | SHRINKAGE <- 0.1 301 | ``` 302 | 303 | **Usage:** 304 | ```bash 305 | DATA_DIR=./data OUTPUT_DIR=./output Rscript gbm_modernized.R 306 | ``` 307 | 308 | **Expected Performance:** 309 | - Training time: 2-4 hours 310 | - Memory: ~8-16GB 311 | - Log loss: ~0.43-0.45 312 | 313 | ### 3. Vowpal Wabbit 314 | 315 | **Files:** `csv_to_vw_modernized.py`, VW commands 316 | 317 | **Features:** 318 | - Extremely fast linear learner 319 | - Scalable to billions of samples 320 | - Online learning capable 321 | 322 | **Usage:** 323 | ```bash 324 | # Convert format 325 | python csv_to_vw_modernized.py data/train.csv data/train.vw 326 | 327 | # Train with multiple passes 328 | vw data/train.vw \ 329 | --loss_function logistic \ 330 | --passes 3 \ 331 | --cache_file data/train.cache \ 332 | -f models/click.model 333 | 334 | # Predict 335 | vw data/test.vw \ 336 | -i models/click.model \ 337 | -t \ 338 | -p predictions.txt 339 | ``` 340 | 341 | **Expected Performance:** 342 | - Training time: 10-20 minutes 343 | - Memory: ~2-4GB 344 | - Log loss: ~0.44-0.46 345 | 346 | --- 347 | 348 | ## Project Structure 349 | 350 | ``` 351 | Predict-click-through-rates-on-display-ads/ 352 | │ 353 | ├── README.md # This file 354 | ├── LICENSE # MIT License 355 | ├── requirements.txt # Python dependencies 356 | ├── requirements-r.txt # R dependencies 357 | ├── .gitignore # Git ignore rules 358 | │ 359 | ├── data/ # Data directory (not in repo) 360 | │ ├── train.csv # Training data (download separately) 361 | │ ├── test.csv # Test data (download separately) 362 | │ ├── train.vw # VW format training data 363 | │ └── test.vw # VW format test data 364 | │ 365 | ├── models/ # Trained models 366 | │ ├── click.model.vw # Vowpal Wabbit model 367 | │ └── gbm_model.rds # R GBM model 368 | │ 369 | ├── output/ # Output directory 370 | │ ├── submission.csv # Predictions for submission 371 | │ └── training.log # Training logs 372 | │ 373 | ├── Modern Implementations: 374 | │ ├── py_lh4_modernized.py # LR with SGD (Python 3) 375 | │ ├── logReg_modernized.py # LR module (Python 3) 376 | │ ├── csv_to_vw_modernized.py # CSV to VW converter (Python 3) 377 | │ └── gbm_modernized.R # GBM (R, modern) 378 | │ 379 | ├── Legacy Code (Original): 380 | │ ├── py_lh4.py # Original LR implementation 381 | │ ├── logReg.py # Original LR module 382 | │ ├── gbm.R # Original GBM script 383 | │ └── [other legacy files] 384 | │ 385 | ├── tests/ # Unit tests 386 | │ ├── test_lr_model.py 387 | │ ├── test_data_loading.py 388 | │ └── test_vw_converter.py 389 | │ 390 | ├── scripts/ # Utility scripts 391 | │ ├── download_data.py # Data download helper 392 | │ └── evaluate_model.py # Model evaluation 393 | │ 394 | └── docs/ # Additional documentation 395 | ├── MIGRATION_GUIDE.md # Python 2 to 3 migration notes 396 | ├── PERFORMANCE.md # Performance benchmarks 397 | └── API.md # API documentation 398 | ``` 399 | 400 | --- 401 | 402 | ## Usage 403 | 404 | ### Configuration 405 | 406 | **Python Models:** 407 | 408 | Edit configuration at the top of each script: 409 | 410 | ```python 411 | # In py_lh4_modernized.py 412 | TRAIN_FILE = Path('data/train.csv') 413 | TEST_FILE = Path('data/test.csv') 414 | OUTPUT_FILE = Path('submission.csv') 415 | DIMENSION = 2 ** 27 416 | LEARNING_RATE = 0.145 417 | ``` 418 | 419 | **R Models:** 420 | 421 | Use environment variables: 422 | 423 | ```bash 424 | export DATA_DIR=/path/to/data 425 | export OUTPUT_DIR=/path/to/output 426 | Rscript gbm_modernized.R 427 | ``` 428 | 429 | ### Training Options 430 | 431 | **Full Training:** 432 | ```bash 433 | python py_lh4_modernized.py 434 | ``` 435 | 436 | **Sample Training (first 1M rows):** 437 | ```bash 438 | head -n 1000001 data/train.csv > data/train_sample.csv 439 | # Update script to use train_sample.csv 440 | python py_lh4_modernized.py 441 | ``` 442 | 443 | ### Monitoring Training 444 | 445 | Training progress is logged to both console and `training.log`: 446 | 447 | ```bash 448 | # Watch log in real-time 449 | tail -f training.log 450 | ``` 451 | 452 | ### Model Evaluation 453 | 454 | ```bash 455 | python scripts/evaluate_model.py \ 456 | --predictions submission.csv \ 457 | --ground_truth data/train_with_labels.csv 458 | ``` 459 | 460 | --- 461 | 462 | ## Development 463 | 464 | ### Setting Up Development Environment 465 | 466 | ```bash 467 | # Install development dependencies 468 | pip install -r requirements.txt 469 | 470 | # Install pre-commit hooks (coming soon) 471 | pre-commit install 472 | 473 | # Run code formatters 474 | black . 475 | ``` 476 | 477 | ### Running Tests 478 | 479 | ```bash 480 | # Run all tests 481 | pytest 482 | 483 | # Run with coverage 484 | pytest --cov=. --cov-report=html 485 | 486 | # Run specific test 487 | pytest tests/test_lr_model.py 488 | ``` 489 | 490 | ### Code Style 491 | 492 | This project follows: 493 | - **Python:** PEP 8, enforced with `black` and `flake8` 494 | - **R:** Tidyverse style guide 495 | - **Documentation:** Google-style docstrings 496 | 497 | --- 498 | 499 | ## Contributing 500 | 501 | Contributions are welcome! Please: 502 | 503 | 1. Fork the repository 504 | 2. Create a feature branch (`git checkout -b feature/amazing-feature`) 505 | 3. Commit your changes (`git commit -m 'Add amazing feature'`) 506 | 4. Push to the branch (`git push origin feature/amazing-feature`) 507 | 5. Open a Pull Request 508 | 509 | ### Areas for Contribution 510 | 511 | - [ ] Deep learning models (Neural Networks, LSTM) 512 | - [ ] Feature engineering improvements 513 | - [ ] Hyperparameter optimization 514 | - [ ] Ensemble methods 515 | - [ ] Docker containerization 516 | - [ ] Web API for predictions 517 | - [ ] Performance optimizations 518 | 519 | --- 520 | 521 | ## License 522 | 523 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. 524 | 525 | Copyright (c) 2019-2025 Tianxiang Liu 526 | 527 | --- 528 | 529 | ## Acknowledgments 530 | 531 | ### Dataset 532 | - **Criteo Labs** for providing the dataset 533 | - Kaggle for hosting the competition 534 | 535 | ### References 536 | - Original Vowpal Wabbit converter by Triskelion and Zygmunt Zając 537 | - Criteo Display Advertising Challenge: https://www.kaggle.com/c/criteo-display-ad-challenge 538 | 539 | ### Papers 540 | - Rendle, S. (2010). "Factorization Machines" 541 | - McMahan, H. B. et al. (2013). "Ad Click Prediction: a View from the Trenches" 542 | - He, X. et al. (2014). "Practical Lessons from Predicting Clicks on Ads at Facebook" 543 | 544 | --- 545 | 546 | ## Contact 547 | 548 | For questions or issues, please: 549 | - Open an issue on GitHub 550 | - Contact: [your-email@example.com] 551 | 552 | --- 553 | 554 | ## Changelog 555 | 556 | ### Version 2.0 (2025) 557 | - ✨ Migrated to Python 3.8+ 558 | - ✨ Added comprehensive error handling and logging 559 | - ✨ Refactored code with type hints and documentation 560 | - ✨ Added configuration management 561 | - ✨ Improved README with detailed instructions 562 | - ✨ Added unit test infrastructure 563 | 564 | ### Version 1.0 (2014) 565 | - Initial implementation 566 | - Python 2 codebase 567 | - Multiple algorithm implementations 568 | 569 | --- 570 | 571 | **Happy Machine Learning! 🚀** 572 | --------------------------------------------------------------------------------