├── README.md
├── comparisonCode
    ├── ADPCP.m
    ├── MVE.m
    ├── compareCovEstimators.m
    ├── compareGenomicData.m
    ├── compareMeanEstimators.m
    ├── geoMedianGaussianMean.m
    ├── norm_nuc.m
    ├── pruneGaussianCov.m
    ├── pruneGaussianMean.m
    ├── ransacGaussianMean.m
    ├── ransacMVE.m
    ├── shrinkage.m
    ├── specThresh.m
    └── testGeoMedian.m
├── filterCode
    ├── filterGaussianCov.m
    ├── filterGaussianCovTuned.m
    ├── filterGaussianMean.m
    ├── findMaxPoly.m
    ├── flatten.m
    ├── mahalanobis.m
    └── sharpen.m
├── genomicData
    └── colors.txt
└── testCode
    ├── testGaussianCov.m
    ├── testGaussianMean.m
    └── testGenomicData.m


/README.md:
--------------------------------------------------------------------------------
 1 | # Being Robust (in High Dimensions) Can Be Practical
 2 | A MATLAB implementation of [Being Robust (in High Dimensions) Can Be Practical](https://arxiv.org/abs/1703.00893) from ICML 2017.
 3 | 
 4 | Prerequisites 
 5 | ===
 6 | This project requires installation of the following packages:
 7 | * [Fast SVD and PCA](https://www.mathworks.com/matlabcentral/fileexchange/47132-fast-svd-and-pca): Used for faster computation of SVD for some large matrices.
 8 | * [Novembre_etal_2008_misc](https://github.com/NovembreLab/Novembre_etal_2008_misc): A repository containing data from [Genes Mirror Geography in Europe](https://www.nature.com/articles/nature07331). Used for our semi-synthetic experiments. Extract the following three files to `robust-filter/genomicData`: 
 9 |     * `POPRES_08_24_01.EuroThinFinal.LD_0.8.exLD.out0-PCA.eigs`
10 |     * `POPRES_08_24_01.EuroThinFinal.LD_0.8.exLD.out0-PCA.eval`
11 |     * `POPRESID_Color.txt`
12 | 
13 | Optional
14 | ---
15 | The following packages are optional, and are only used for comparison of our algorithm with alternative estimators:
16 | * [AgnosticMeanAndCovarianceCode](https://github.com/kal2000/AgnosticMeanAndCovarianceCode): An implementation of the algorithms from the paper [Agnostic Estimation of Mean and Covariance](https://arxiv.org/abs/1604.06968) from FOCS 2016, authored by [Kevin A. Lai](https://www.cc.gatech.edu/~klai9/), [Anup B. Rao](https://sites.google.com/site/anupraob/), and [Santosh Vempala](https://www.cc.gatech.edu/~vempala/). Used to compare the statistical accuracy of our algorithms against theirs.
17 | * [CVX](http://cvxr.com/cvx/): A MATLAB library for convex optimization. Used for implementing methods from [Robust Principal Component Analysis?](https://dl.acm.org/citation.cfm?id=1970395) and [Robust PCA via Outlier Pursuit](https://arxiv.org/abs/1010.4237).
18 | 
19 | Explanation of Files
20 | ===
21 | This repository contains several algorithms -- we identify files relevant to our algorithm, and those which are only used for comparison with alternatives.
22 | 
23 | Our Filter algorithms' files
24 | ---
25 | The following are files which are used in our algorithms' implementations, and can all be found in the `filterCode` subdirectory.
26 | * `filterGaussianMean.m`: Our algorithm for estimating the mean of a Gaussian
27 | * `filterGaussianCov.m`: Our algorithm for estimating the covariance of a Gaussian
28 |     * `findMaxPoly.m`: Finds the structured degree-two polynomial which is maximized by the data
29 |     * `flatten.m` and `sharpen.m`: Convert between matrix and vector representations
30 | * `filterGaussianCovTuned.m`: A version of our covariance estimation algorithm which is tuned to select hyperparameters 
31 | * `mahalanobis.m`: Compute Mahalanobis rescaling of a matrix
32 | 
33 | Test files, for demonstrating performance of our estimators for mean and covariance. These can be found in the `testCode` subdirectory:
34 | * `testGaussianMean.m`: Tests our mean estimation algorithm
35 | * `testGaussianCov.m`: Tests our covariance estimation algorithm
36 | * `testGenomicData.m`: Tests our covariance estimation algorithm on semi-synthetic genome dataset
37 | 
38 | Other algorithms' files
39 | ---
40 | The following are files which are used for comparison with other competing algorithms, and can be found in the `comparisonCode` subdirectory:
41 | 
42 | Algorithms for estimating the mean of a Gaussian:
43 | * `ransacGaussianMean.m`: A RANSAC-based method
44 | * `geoMedianGaussianMean.m`: Geometric median
45 | * `pruneGaussianMean.m`: Coordinate-wise median, followed by naive pruning of distant points
46 | 
47 | Algorithms for estimating the covariance of a Gaussian:
48 | * `pruneGaussianCov.m`: Naive pruning of distant points
49 | * `ransacMVE.m`: A RANSAC-based method
50 |     * `MVE.m`: Approximating the MVE for a small dataset
51 | * `ADPCP.m`: Principal Component Pursuit by Alternating Directions, from [Robust Principal Component Analysis?](https://dl.acm.org/citation.cfm?id=1970395)
52 |     * `specThresh.m` and `shrinkage.m`: Singular value thresholding and shrinkage operators
53 |     * `norm_nuc.m`: Compute the nuclear norm of a matrix
54 | 
55 | Comparison files, for evaluating and comparing the performance of estimators for mean and covariance:
56 | * `testGeoMedian.m`: Demonstrates that the geometric median incurs an O(sqrt(d)) loss in accuracy
57 | * `compareMeanEstimators.m`: Compares mean estimation algorithms
58 | * `compareCovEstimators.m`: Compares covariance estimation algorithms
59 | * `compareGenomicData.m`: Compares covariance estimation algorithms to semi-synthetic genome dataset
60 | 
61 | Reproducibility
62 | ===
63 | Figures in the paper can be approximately reproduced by running the following scripts, which are in the `comparisonCode` subdirectory:
64 | * Figure 1: `compareMeanEstimators.m`
65 | * Figure 2: `compareCovEstimators.m`
66 | * Figures 3 and 4: `compareGenomicData.m`
67 | 
68 | Reference
69 | ===
70 | This repository is an implementation of our paper [Being Robust (in High Dimensions) Can Be Practical](https://arxiv.org/abs/1703.00893) from ICML 2017, authored by [Ilias Diakonikolas](http://www.iliasdiakonikolas.org/), [Gautam Kamath](http://www.gautamkamath.com/), [Daniel M. Kane](https://cseweb.ucsd.edu/~dakane/), [Jerry Li](http://www.mit.edu/~jerryzli/), [Ankur Moitra](http://people.csail.mit.edu/moitra/), and [Alistair Stewart](http://www.alistair-stewart.com/).
71 | 
72 | See also our original theory paper, [Robust Estimators in High Dimensions without the Computational Intractability](https://arxiv.org/abs/1604.06443), which appeared in FOCS 2016.
73 | 
74 | If you use our code or paper, we ask that you please cite:
75 | ```
76 | @inproceedings{DiakonikolasKKLMS17,
77 |   author        = {Diakonikolas, Ilias and Kamath, Gautam and Kane, Daniel M. and Li, Jerry and Moitra, Ankur and Stewart, Alistair},
78 |   title         = {Being Robust (in High Dimensions) Can Be Practical},
79 |   booktitle     = {Proceedings of the 34th International Conference on Machine Learning},
80 |   series        = {ICML '17},
81 |   year          = {2017},
82 |   pages         = {999--1008},
83 |   publisher     = {JMLR, Inc.}
84 | }
85 | ```
86 | 


--------------------------------------------------------------------------------
/comparisonCode/ADPCP.m:
--------------------------------------------------------------------------------
 1 | function [ L, S ] = ADPCP( M, lambda )
 2 | %Principal Component Pursuit by Alternating Directions
 3 | %Based on Algorithm 1 of "Robust Principal Component Analysis?" by
 4 | %Candes, Li, Ma, and Wright
 5 |     [m, n] = size(M);
 6 |     
 7 |     mu = m * n / (4 * norm_nuc(M));
 8 |     
 9 |     L = zeros(m, n);
10 |     S = zeros(m, n);
11 |     Y = zeros(m, n);
12 |     
13 |     
14 |     delta = 1e-7;
15 | 
16 |     while norm(M - L - S, 'fro') > delta * norm(M, 'fro')
17 |         L = specThresh(M - S + Y / mu, 1 / mu);
18 |         S = shrinkage(M - L + Y / mu, lambda / mu);
19 |         Y = Y + mu * (M - L - S);
20 |     end
21 | end


--------------------------------------------------------------------------------
/comparisonCode/MVE.m:
--------------------------------------------------------------------------------
 1 | function [ estCov, vol ] = MVE( samp, eps )
 2 |     [N, d] = size(samp);
 3 | 
 4 |     empCov = samp' * samp / N;
 5 |     empCovInv = empCov^(-1);
 6 |     mDists = zeros(N, 1);
 7 |     for j = 1:N
 8 |         mDists(j) = samp(j, :) * empCovInv * samp(j, :)';
 9 |     end
10 |     mDistsSorted = sort(mDists);
11 |     lambda = mDistsSorted(N - floor(eps * N));
12 |     estCov = lambda * empCov;
13 |     vol = det(empCov) * lambda^d;
14 | end


--------------------------------------------------------------------------------
/comparisonCode/compareCovEstimators.m:
--------------------------------------------------------------------------------
  1 | clear;
  2 | eps = 0.05;
  3 | tau = 0.1;
  4 | 
  5 | sampErr = [];
  6 | noisyEmpErr = [];
  7 | ransacErr = [];
  8 | filterErr = [];
  9 | pruneErr = [];
 10 | LRVErr = [];
 11 | ds = 10:10:100;
 12 | 
 13 | %Set to 0 for isotropic covariance, generates left half of Figure 2
 14 | %Set to 1 for spiked covariance, generates right half of Figure 2
 15 | spikedCovariance = 1;
 16 | 
 17 | if spikedCovariance
 18 |     spike = 100;
 19 | else
 20 |     spike = 1;
 21 | end
 22 | 
 23 | 
 24 | for d = ds
 25 |     N =  0.5*d / eps^2;
 26 |     fprintf('Training with dimension = %d, number of samples = %d \n', d, round(N, 0))
 27 |     sumEmpErr = 0;
 28 |     sumNoisyEmpErr = 0;
 29 |     sumFilterErr = 0;
 30 |     sumMVEErr = 0;
 31 |     sumLRVErr = 0;
 32 |     sumPruneErr = 0;
 33 | 
 34 |     covar = eye(d);
 35 |     if spikedCovariance
 36 |         covar(1,1) = spike;
 37 |     end
 38 | 
 39 |     X =  mvnrnd(zeros(1,d), covar, round((1-eps)*N)); 
 40 |     if spikedCovariance
 41 |         U1 = orth(randn(d, d));
 42 |         Y = [ 0.5 * randi([-1 1], round(eps *N), d / 2) 0.8 * randi([-2 2], round(eps *N), d / 2 - 1) randi([-spike spike], round(eps *N), 1)] * U1; 
 43 |     else
 44 |         Y = zeros(round(eps * N), d);
 45 |     end
 46 |     Z = [X; Y];
 47 | 
 48 |     fprintf('Sampling Error w/o noise...');
 49 |     empCov = cov(X);
 50 |     sumEmpErr = sumEmpErr + norm(mahalanobis(empCov, covar) - eye(d), 'fro');
 51 |     fprintf('done\n')
 52 | 
 53 |     fprintf('Sampling Error with noise...');
 54 |     empCov = cov(Z);
 55 |     sumNoisyEmpErr = sumNoisyEmpErr + norm(mahalanobis(empCov, covar) - eye(d), 'fro');
 56 |     fprintf('done\n')
 57 | 
 58 |     fprintf('Filter...')
 59 |     [ourCov, filterPoints, ~] = filterGaussianCovTuned(Z, zeros(size(Z)),  eps, tau, false);
 60 |     sumFilterErr = sumFilterErr + norm(mahalanobis(ourCov, covar) - eye(d), 'fro');
 61 |     fprintf('done\n')
 62 | 
 63 |     fprintf('Prune...')
 64 |     [pruneCov, prunePoints, ~] = pruneGaussianCov(Z, zeros(size(Z)),  tau, false);
 65 |     sumPruneErr = sumPruneErr + norm(mahalanobis(pruneCov, covar) - eye(d), 'fro');
 66 |     fprintf('done\n')
 67 | 
 68 | 
 69 |     fprintf('RANSAC...')
 70 |     sumMVEErr = sumMVEErr + norm(mahalanobis(ransacMVE(Z, eps, ceil(d / eps), 1000),  covar)  -  eye(d), 'fro');
 71 |     fprintf('done\n')
 72 | 
 73 |     fprintf('LRV...')
 74 |     [~, lrvCov, ~] = agnosticCovarianceGeneral(Z, eps);
 75 |     sumLRVErr = sumLRVErr + norm(mahalanobis(lrvCov, covar) - eye(d), 'fro') ;
 76 |     fprintf('done\n')
 77 | 
 78 | 
 79 |     sampErr = [sampErr sumEmpErr];
 80 |     noisyEmpErr = [noisyEmpErr sumNoisyEmpErr];
 81 |     ransacErr = [ransacErr sumMVEErr];
 82 |     filterErr = [filterErr sumFilterErr];
 83 |     pruneErr = [pruneErr sumPruneErr]; 
 84 |     LRVErr = [LRVErr sumLRVErr];
 85 | end
 86 | 
 87 | noisyEmpErr = noisyEmpErr - sampErr;
 88 | pruneErr = pruneErr - sampErr;
 89 | ransacErr = ransacErr - sampErr;
 90 | LRVErr = LRVErr - sampErr;
 91 | filterErr = filterErr - sampErr;
 92 | 
 93 | figure(1);
 94 | plot(ds, noisyEmpErr, '-gx',  ds,  ransacErr, ds, LRVErr, ds, pruneErr, ds, filterErr, '-.b', 'LineWidth', 2)
 95 | xlabel('Dimension')
 96 | ylabel('Excess Frobenius error')
 97 | legend('Samping Error (with corruption)', 'RANSAC', 'LRV', 'Prune', 'Filter')
 98 | 
 99 | figure(2);
100 | plot(ds, LRVErr, ds, filterErr, '-.b', 'LineWidth', 2)
101 | xlabel('Dimension')
102 | ylabel('Excess Frobenius error')
103 | legend('LRV', 'Filter')


--------------------------------------------------------------------------------
/comparisonCode/compareGenomicData.m:
--------------------------------------------------------------------------------
  1 | %Generates Figures 3 and 4
  2 | clear;
  3 | eps = 0.10;
  4 | tau = 0.1;
  5 | d = 20;
  6 | lambda = 0.2;
  7 | subsampSize = 100;
  8 | 
  9 | %Process mappings from color names to numbers
 10 | rawColor = importdata('../genomicData/colors.txt');
 11 | keys = rawColor.rowheaders;
 12 | vals = mat2cell(rawColor.data, ones(37,1));
 13 | colorMap = containers.Map(keys, vals);
 14 | 
 15 | %Process data points
 16 | fid = fopen('../genomicData/POPRES_08_24_01.EuroThinFinal.LD_0.8.exLD.out0-PCA.eigs');
 17 | A = textscan(fid,'%f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %s', 'HeaderLines',1);
 18 | fclose(fid);
 19 | V = [A{3} A{4} A{5} A{6} A{7} A{8} A{9} A{10} A{11} A{12} A{13} A{14} A{15} A{16} A{17} A{18} A{19} A{20} A{21} A{22}]; 
 20 | 
 21 | eigs = importdata('../genomicData/POPRES_08_24_01.EuroThinFinal.LD_0.8.exLD.out0-PCA.eval');
 22 | eigs = eigs(1:d);
 23 | data = zeros(length(A), d);
 24 | for i = 1:size(V,1)
 25 |     data(i,:) = V(i,:).*eigs';
 26 | end
 27 | 
 28 | %Convert data point color names to numbers
 29 | colorsFid = fopen('../genomicData/POPRESID_Color.txt');
 30 | colorsStrings = textscan(colorsFid, '%d %s');
 31 | colorsStrings = colorsStrings{2};
 32 | dataColors = zeros(length(colorsStrings),3);
 33 | for i = 1:length(colorsStrings)
 34 |     dataColors(i, :) = colorMap(colorsStrings{i})/255;
 35 | end
 36 | 
 37 | %Generate noisy points
 38 | N = size(V,1)/(1-eps);
 39 | noise = (1/24)*[randi([0, 2], round(eps*N), d/2) randi([2, 3], round(eps*N), d/2)];
 40 | 
 41 | %Generate noisy point colors
 42 | noiseColors = zeros(round(eps*N), 3);
 43 | 
 44 | %Combine data and noise,
 45 | randRot = orth(randn(d, d));
 46 | data = data * randRot;
 47 | noise = noise * randRot;
 48 | D = [data; noise];
 49 | C = [dataColors; noiseColors];
 50 | subsampD = datasample(D, subsampSize);
 51 | 
 52 | %Generate original data's projection
 53 | [dataU, ~, ~] = svd(cov(data));
 54 | 
 55 | %Generate noised data's projection
 56 | [noisedU, ~, ~] = svd(cov(D));
 57 | 
 58 | %Generate filter output and projection
 59 | [filterM, filtered, filteredMetadata] = filterGaussianCovTuned(D, C, eps, tau, 0);
 60 | [filterU, ~, ~] = svd(filterM);
 61 | 
 62 | %Generate filter output and projection (no noise)
 63 | [filterNoNoiseM, filteredNoNoise, filterNoNoiseMetadata] = filterGaussianCovTuned(data, dataColors, eps, tau, 0);
 64 | [filterNoNoiseU, ~, ~] = svd(filterNoNoiseM);
 65 | 
 66 | %Generate LRV projection
 67 | [~, lrvM, ~] = agnosticCovarianceGeneral(D, eps);
 68 | [lrvU, ~, ~] = svd(lrvM);
 69 | 
 70 | %Generate Pruning projection
 71 | [pruneM, ~, ~] = pruneGaussianCov(D, C, tau, 0);
 72 | [pruneU, ~, ~] = svd(pruneM);
 73 | 
 74 | %Generate RANSAC projection
 75 | [ransacM] = ransacMVE(D, eps, ceil(d / eps), 1000);
 76 | [ransacU, ~, ~] = svd(ransacM);
 77 | 
 78 | %Generate CLMW SDP projection
 79 | cvx_begin quiet
 80 |     variable clmwsdpL(subsampSize, d) 
 81 |     variable S(subsampSize, d) 
 82 |     dual variable Q
 83 |     minimize norm_nuc(clmwsdpL) + lambda * sum(sum(abs(S)))
 84 |     subject to 
 85 |         clmwsdpL + S == subsampD;
 86 | cvx_end
 87 | clmwsdpM = clmwsdpL' * clmwsdpL;
 88 | [clmwsdpU, ~, ~] = svd(clmwsdpM);
 89 | 
 90 | %Generate CLMW AM projection
 91 | [clmwamL, ~] = ADPCP(D, lambda);
 92 | clmwamM = clmwamL' * clmwamL;
 93 | [clmwamU, ~, ~] = svd(clmwamM);
 94 | 
 95 | %Generate XCS SDP projection
 96 | cvx_begin quiet
 97 |     variable xcsL(subsampSize, d) 
 98 |     variable S2(subsampSize, d) 
 99 |     dual variable Q
100 |     minimize norm_nuc(xcsL) + lambda * sum(norms(S2, 2, 2))
101 |     subject to 
102 |         xcsL + S2 == subsampD;
103 | cvx_end
104 | xcsM = xcsL' * xcsL;
105 | [xcsU, ~, ~] = svd(xcsM);
106 | 
107 | %Generate plot for original data
108 | fig = figure(1);
109 | clf;
110 | scatter(data*dataU(:,1), data*dataU(:,2), [], dataColors);
111 | hold on;
112 | scatter(noise*dataU(:,1), noise*dataU(:,2), [], noiseColors);
113 | axis([-0.25 0.35 -0.15 0.2])
114 | title('Original Data')
115 | set(fig, 'Position', [100, 100, 640, 480]);
116 | 
117 | %Generate plot for noised data
118 | fig = figure(2);
119 | clf
120 | scatter(data*noisedU(:,1), data*noisedU(:,2), [], dataColors);
121 | hold on;
122 | scatter(noise*noisedU(:,1), noise*noisedU(:,2), [], noiseColors);
123 | axis([-0.25 0.35 -0.15 0.2])
124 | title('Empirical Noised Singular Vectors')
125 | set(fig, 'Position', [100, 100, 640, 480]);
126 | 
127 | %Generate plot for filter's output
128 | fig = figure(3);
129 | clf
130 | scatter(filtered*filterU(:,1),filtered*filterU(:,2), [], filteredMetadata);
131 | axis([-0.25 0.35 -0.15 0.2])
132 | title('Filter Output');
133 | set(fig, 'Position', [100, 100, 640, 480]);
134 | 
135 | %Generate plot for filter's output without noise
136 | fig = figure(4);
137 | clf
138 | scatter(filteredNoNoise*filterNoNoiseU(:,1), filteredNoNoise*filterNoNoiseU(:,2), [], filterNoNoiseMetadata);
139 | axis([-0.25 0.35 -0.15 0.2])
140 | title('Filter Output (no noise)');
141 | set(fig, 'Position', [100, 100, 640, 480]);
142 | 
143 | %Generate plot for the filter's projection
144 | fig = figure(5);
145 | clf
146 | scatter(data*filterU(:,1), data*filterU(:,2), [], dataColors);
147 | hold on;
148 | scatter(noise*filterU(:,1), noise*filterU(:,2), [], noiseColors);
149 | axis([-0.25 0.35 -0.15 0.2])
150 | title('Filter Projection')
151 | set(fig, 'Position', [100, 100, 640, 480]);
152 | 
153 | %Generate plot for LRV's projection
154 | fig = figure(6);
155 | clf
156 | scatter(data*lrvU(:,1), data*lrvU(:,2), [], dataColors);
157 | hold on;
158 | scatter(noise*lrvU(:,1), noise*lrvU(:,2), [], noiseColors);
159 | axis([-0.25 0.35 -0.15 0.2])
160 | title('LRV Projection')
161 | set(fig, 'Position', [100, 100, 640, 480]);
162 | 
163 | %Generate plot for pruned data's projection
164 | fig = figure(7);
165 | clf
166 | scatter(data*pruneU(:,1), data*pruneU(:,2), [], dataColors);
167 | hold on;
168 | scatter(noise*pruneU(:,1), noise*pruneU(:,2), [], noiseColors);
169 | axis([-0.25 0.35 -0.15 0.2])
170 | title('Pruning Projection')
171 | set(fig, 'Position', [100, 100, 640, 480]);
172 | 
173 | %Generate plot for RANSAC's projection
174 | fig = figure(8);
175 | clf
176 | scatter(data*ransacU(:,1), data*ransacU(:,2), [], dataColors);
177 | hold on;
178 | scatter(noise*ransacU(:,1), noise*ransacU(:,2), [], noiseColors);
179 | axis([-0.25 0.35 -0.15 0.2])
180 | title('RANSAC Projection')
181 | set(fig, 'Position', [100, 100, 640, 480]);
182 | 
183 | %Generate plot for CLMW SDP projection
184 | fig = figure(9);
185 | clf
186 | scatter(data*clmwsdpU(:,1), data*clmwsdpU(:,2), [], dataColors);
187 | hold on;
188 | scatter(noise*clmwsdpU(:,1), noise*clmwsdpU(:,2), [], noiseColors);
189 | axis([-0.25 0.35 -0.15 0.2])
190 | title('CLMW SDP Projection')
191 | set(fig, 'Position', [100, 100, 640, 480]);
192 | 
193 | %Generate plot for CLMW AM projection
194 | fig = figure(10);
195 | clf
196 | scatter(data*clmwamU(:,1), data*clmwamU(:,2), [], dataColors);
197 | hold on;
198 | scatter(noise*clmwamU(:,1), noise*clmwamU(:,2), [], noiseColors);
199 | axis([-0.25 0.35 -0.15 0.2])
200 | title('CLMW ADMM Projection')
201 | set(fig, 'Position', [100, 100, 640, 480]);
202 | 
203 | %Generate plot for XCS projection
204 | fig = figure(11);
205 | clf
206 | scatter(data*xcsU(:,1), data*xcsU(:,2), [], dataColors);
207 | hold on;
208 | scatter(noise*xcsU(:,1), noise*xcsU(:,2), [], noiseColors);
209 | axis([-0.25 0.35 -0.15 0.2])
210 | title('XCS Projection')
211 | set(fig, 'Position', [100, 100, 640, 480]);
212 | 


--------------------------------------------------------------------------------
/comparisonCode/compareMeanEstimators.m:
--------------------------------------------------------------------------------
 1 | %Generates Figure 1 from paper
 2 | clear
 3 | eps = 0.1;
 4 | tau = 0.1;
 5 | cher = 2.5;
 6 | 
 7 | filterErr = [];
 8 | medianErr = [];
 9 | ransacErr = [];
10 | LRVErr = [];
11 | sampErr = [];
12 | noisySampErr = []; 
13 | prunedErr = [];
14 | ds = 100:50:400;
15 | 
16 | for d = ds
17 |     N = 10*floor(d/eps^2);
18 |     fprintf('Training with dimension = %d, number of samples = %d \n', d, round(N, 0))
19 |     sumFilterErr = 0;
20 |     sumMedErr = 0;
21 |     sumRansacErr = 0;
22 |     sumLRVErr = 0;
23 |     sumSampErr = 0;
24 |     sumNoisySampErr = 0;
25 |     sumPrunedErr = 0;
26 | 
27 |     X =  mvnrnd(zeros(1,d), eye(d), round((1-eps)*N)) + ones(round((1-eps)*N), d);
28 | 
29 |     fprintf('Sampling Error w/o noise...');
30 |     sumSampErr = sumSampErr + norm(mean(X) - ones(1,d));
31 |     fprintf('done\n')
32 | 
33 |     Y1 = randi([0 1], round(0.5*eps*N), d); 
34 |     Y2 = [12*ones(round(0.5*eps*N),1), -2 * ones(round(0.5*eps*N), 1), zeros(round(0.5 * eps * N), d-2)];
35 |     X = [X; Y1; Y2];
36 | 
37 |     fprintf('Sampling Error with noise...');
38 |     sumNoisySampErr = sumNoisySampErr + norm(mean(X) - ones(1,d));
39 |     fprintf('done\n')
40 |     
41 |     
42 |     fprintf('Pruning...');
43 |     [prunedMean, ~] = pruneGaussianMean(X, eps);
44 |     sumPrunedErr = sumPrunedErr + norm(prunedMean - ones(1, d));
45 |     fprintf('done\n')
46 |     
47 |     fprintf('Median...')
48 |     gm = geoMedianGaussianMean(X);
49 |     sumMedErr = sumMedErr + norm(gm - ones(1, d));
50 |     fprintf('done\n')
51 |     
52 |     fprintf('Ransac...')
53 |     sumRansacErr = sumRansacErr + norm(ransacGaussianMean(X, eps, tau) - ones(1, d));
54 |     fprintf('done\n')
55 | 
56 |     fprintf('LRV...')
57 |     sumLRVErr = sumLRVErr + norm(agnosticMeanGeneral(X, eps) - ones(1,d));
58 |     fprintf('done\n')
59 | 
60 |     fprintf('Filter...')
61 |     sumFilterErr = sumFilterErr + norm(filterGaussianMean(X, eps, tau, cher) - ones(1, d));
62 |     fprintf('done\n')
63 | 
64 |     medianErr = [medianErr sumMedErr];
65 |     ransacErr = [ransacErr sumRansacErr];
66 |     filterErr = [filterErr sumFilterErr];
67 |     LRVErr = [LRVErr sumLRVErr];
68 |     sampErr = [sampErr sumSampErr];
69 |     noisySampErr = [noisySampErr sumNoisySampErr];
70 |     prunedErr = [prunedErr sumPrunedErr];
71 | end
72 | 
73 | noisySampErr = noisySampErr - sampErr;
74 | prunedErr = prunedErr - sampErr;
75 | medianErr = medianErr - sampErr;
76 | ransacErr = ransacErr - sampErr;
77 | LRVErr = LRVErr - sampErr;
78 | filterErr = filterErr - sampErr;
79 | 
80 | plot(ds, noisySampErr, ds, prunedErr, ds, medianErr, '-ro', ds, ransacErr, ds, LRVErr, ds, filterErr, '-.b', 'LineWidth', 2)
81 | xlabel('Dimension')
82 | ylabel('Excess L2 error')
83 | legend('Sampling Error (with noise)', 'Naive Pruning', 'Geometric Median', 'RANSAC', 'LRV', 'Filter')


--------------------------------------------------------------------------------
/comparisonCode/geoMedianGaussianMean.m:
--------------------------------------------------------------------------------
 1 | function [ estMean ] = geoMedianGaussianMean(data)
 2 | %Approximately compute the geometric median using Weiszfeld's algorithm
 3 |     numIters = 100;
 4 |     [N, ~] = size(data);
 5 |     curEstimate = mean(data);
 6 |     for i = 1:numIters
 7 |         num = 0;
 8 |         den = 0;
 9 |         for j = 1:N
10 |             distToEstimate = norm(data(j,:) - curEstimate);
11 |             num = num + data(j,:)/distToEstimate;
12 |             den = den + 1/distToEstimate;
13 |         end
14 |         curEstimate = num/den;
15 |     end
16 |     estMean = curEstimate;
17 | end


--------------------------------------------------------------------------------
/comparisonCode/norm_nuc.m:
--------------------------------------------------------------------------------
1 | function [ norm ] = norm_nuc( A )
2 | %norm_nuc Compute the nuclear norm (or Schatten-1 norm) of a matrix A
3 |     [~, S, ~] = svdecon(A);
4 |     norm = trace(abs(S));
5 | end


--------------------------------------------------------------------------------
/comparisonCode/pruneGaussianCov.m:
--------------------------------------------------------------------------------
 1 | function [ estCov, filteredPoints, filteredMetadata] = pruneGaussianCov( data, metadata, tau, debug )
 2 | % Remove any points which are sufficiently far away from the empirical  covariance
 3 |     [N, d] = size(data);
 4 |     C = 0.2;
 5 |     
 6 |     empCov = data' * data / N;
 7 |     restart = false;
 8 |     remove = [];
 9 |     for i = 1:N
10 |         if mod(i, 10000) == 0 && debug 
11 |             fprintf('Initial pruning iteration %d\n',i);
12 |         end
13 |         x = data(i, :);
14 |         if x * empCov^(-1) * x' > C * d * log(N / tau)
15 |             remove = [i remove];
16 |             restart = true;
17 |         end
18 |     end
19 |     if restart
20 |         if debug
21 |             fprintf('Pruning %d points\n', length(remove)); 
22 |         end
23 |         data(remove, :) = [];
24 |         metadata(remove, :) = [];
25 |         [estCov, filteredPoints, filteredMetadata] = pruneGaussianCov(data, metadata, tau, debug);
26 |         return;
27 |     end
28 |     estCov = empCov;
29 |     filteredPoints = data;
30 |     filteredMetadata = metadata;
31 |     if debug
32 |         fprintf('Remaining points: %d\n', size(filteredPoints, 1));
33 |     end
34 | end


--------------------------------------------------------------------------------
/comparisonCode/pruneGaussianMean.m:
--------------------------------------------------------------------------------
 1 | function [ estMean, filteredPoints] = pruneGaussianMean(data, eps)
 2 | %Use a naive method to prune points which are "obviously far" from mean
 3 |     [N, d] = size(data);
 4 |     coordMed = zeros(1,d);
 5 |     filteredPoints = zeros(size(data));
 6 |     numFilteredPoints = 0;
 7 |     
 8 |     for i = 1:d
 9 |         coordMed(i) = median(data(:,i));
10 |     end
11 |     
12 |     for i = 1:N
13 |         if norm(data(i,:) - coordMed) < 0.33*sqrt(d*log(N/eps))
14 |             numFilteredPoints = numFilteredPoints + 1;
15 |             filteredPoints(numFilteredPoints,:) = data(i,:);
16 |         end
17 |     end
18 |     filteredPoints = filteredPoints(1:numFilteredPoints,:);
19 |     estMean = mean(filteredPoints);
20 | end


--------------------------------------------------------------------------------
/comparisonCode/ransacGaussianMean.m:
--------------------------------------------------------------------------------
 1 | function [ estMean ] = ransacGaussianMean( data, eps, tau)
 2 | %Use RANSAC to estimate the mean of Gaussian data
 3 |     [N, d] = size(data);
 4 |     empiricalMean = mean(data);
 5 |     ransacN = ceil(2*(d*log(4) + log(2/tau))/eps^2);
 6 |     if ransacN > N
 7 |         estMean = empiricalMean;
 8 |         return;
 9 |     end
10 | 
11 |     numIters = 100;
12 |     thresh = d + 2*(sqrt(d * log(N/tau)) + log(N/tau)) + eps^2*(log(1/eps))^2;
13 | 
14 |     %As a baseline, see how many inliers the mean has
15 |     bestInliers = 0;
16 |     bestMean = empiricalMean;
17 |     for j = 1:N
18 |         if norm(empiricalMean - data(j))^2 <= thresh
19 |             bestInliers = bestInliers + 1;
20 |         end
21 |     end
22 | 
23 |     for i = 1:numIters
24 |         ransacData = data(randsample(1:N, ransacN),:);
25 |         ransacMean = mean(ransacData);
26 |         curInliers = 0;
27 |         for j = 1:N
28 |             if norm(ransacMean - data(j))^2 <= thresh
29 |                 curInliers = curInliers + 1;
30 |             end
31 |         end
32 |         if curInliers > bestInliers
33 |             bestMean = ransacMean;
34 |             bestInliers = curInliers;
35 |         end
36 |     end
37 | 
38 |     estMean = bestMean;
39 | end


--------------------------------------------------------------------------------
/comparisonCode/ransacMVE.m:
--------------------------------------------------------------------------------
 1 | function [ estCov ] = ransacMVE( data, eps, sampSize, iter )
 2 | % Use RANSAC to estimate the Minimum Volume Ellipsoid (MVE) as per 
 3 | % http://webmining.spd.louisville.edu/wp-content/uploads/2014/05/A-Brief-Overview-of-Robust-Statistics.pdf
 4 | 
 5 |     [~, d] = size(data);
 6 |     
 7 |     estCov = zeros(d, d);
 8 |     bestVol = Inf;
 9 |     
10 |     for i = 1:iter
11 |         if mod(i, 10000) == 0
12 |             fprintf('ransac iteration i = %d\n', i)
13 |         end
14 |         samp = datasample(data, sampSize);
15 |         [currentCov, currentVol] = MVE(samp, eps);
16 |         if currentVol < bestVol
17 |             estCov = currentCov;
18 |         end
19 |     end
20 | 
21 |     estCov = estCov / chi2inv(1 - eps, d);
22 |     
23 | end


--------------------------------------------------------------------------------
/comparisonCode/shrinkage.m:
--------------------------------------------------------------------------------
1 | function [ M ] = shrinkage( M, tau )
2 | %Shrinkage operator, as defined in CandesLMW'11
3 |     [ilist,jlist,vlist] = find(M);
4 |     for i = 1:length(ilist)
5 |         v = vlist(i);
6 |         M(ilist(i), jlist(i)) = sign(v) * max(abs(v) - tau, 0);
7 |     end  
8 | end


--------------------------------------------------------------------------------
/comparisonCode/specThresh.m:
--------------------------------------------------------------------------------
1 | function [ Mthres ] = specThresh( M, tau )
2 | %Singular Value Thresholding operator, as defined in CandesLMW'11
3 |     [U, S, V] = svd(M);
4 |     Mthres = U * shrinkage(S, tau) * V';
5 | end


--------------------------------------------------------------------------------
/comparisonCode/testGeoMedian.m:
--------------------------------------------------------------------------------
 1 | %Demonstrates that the geometric median incurs a sqrt(d) loss in accuracy
 2 | eps = 0.5;
 3 | dims = floor(linspace(1,700,10));
 4 | err = [];
 5 | for d = dims
 6 |     N = ceil(10*(d/eps^2));
 7 |     fprintf('Training with dimension = %d, number of samples = %d \n', d, round(N, 0))
 8 |     sumErr = 0;
 9 |     for i = 1:3
10 |         X =  mvnrnd(zeros(1,d), eye(d), round((1-eps)*N)) + ones(round((1-eps)*N), d); 
11 |         Y = zeros(round(eps * N), d);
12 |         X = [X; Y];
13 |         sumErr = sumErr + norm(geoMedianGaussianMean(X) - ones(1, d));
14 |     end
15 |     err = [err sumErr / 3];
16 | end
17 | plot(dims, err)


--------------------------------------------------------------------------------
/filterCode/filterGaussianCov.m:
--------------------------------------------------------------------------------
  1 | function [ estCov, filteredPoints, filteredMetadata] = filterGaussianCov( data, metadata, eps, tau, expConst, debug )
  2 |     MAX_COND = 10000;
  3 | 
  4 |     [N, d] = size(data);
  5 |     threshold = eps*(log(1/eps))^2;
  6 |     C1 = 0.4; 
  7 | 	C2 = 0;
  8 |     
  9 |     empCov = data' * data / N;
 10 |     condition = cond(empCov);
 11 |     
 12 |     % Our empirical covariance is too ill-conditioned--we probably threw
 13 |     % away too many points, so let's just throw everything away and restart
 14 |     if condition > MAX_COND || N < d
 15 |         if debug
 16 |             fprintf('Ill conditioned %d %d %d\n', condition, N, d);
 17 |         end
 18 |         estCov = zeros(d);
 19 |         filteredPoints = [];
 20 |         filteredMetadata = [];
 21 |         return
 22 |     end
 23 |         
 24 |     
 25 |     empCovInv = empCov^(-1);
 26 |     % Remove any points which are very far away from the empirical
 27 |     % covariance
 28 |     restart = false;
 29 |     remove = [];
 30 |     for i = 1:N
 31 |         if mod(i, 10000) == 0 && debug 
 32 |             fprintf('Initial pruning iteration %d\n',i);
 33 |         end
 34 |         x = data(i, :);
 35 |         if x * empCovInv * x' > C1 * d * log(N / tau)
 36 |             remove = [i remove];
 37 |             restart = true;
 38 |         end
 39 |     end
 40 |     if restart
 41 |         data(remove, :) = [];
 42 |         metadata(remove, :) = [];
 43 |         [estCov, filteredPoints, filteredMetadata] = filterGaussianCov(data, metadata, eps, tau, expConst, debug);
 44 |         return
 45 |     end
 46 |         
 47 |     if debug
 48 |         fprintf('After first for loop\n')
 49 |     end
 50 |     
 51 |     
 52 |     rotData = data * empCovInv^(1/2);
 53 |     if debug
 54 |         fprintf('before findMaxPoly\n');
 55 |     end
 56 |     [~, M, lambda] = findMaxPoly(rotData);
 57 |     if debug
 58 |         fprintf('after findMaxPoly\n');
 59 |     end
 60 |     
 61 |     if debug
 62 |         fprintf('lambda = %d, threshold = %d\n', lambda, 1 + C2 * threshold);
 63 |         if d < 12
 64 |             svd(M)
 65 |         end
 66 |     end
 67 |     
 68 |     %If top eigenvalue isn't too large, the dataset is OK
 69 |     %Otherwise, find points which violate the tail bound
 70 |     if lambda < 1 + C2 * threshold
 71 |         estCov = empCov;
 72 |         filteredPoints = data;
 73 |         filteredMetadata = metadata;
 74 |     else
 75 |         projectedData = zeros(N, 1);
 76 |         for i = 1:N
 77 |             projectedData(i) = rotData(i, :) * M * rotData(i, :)';
 78 |         end
 79 |                 
 80 |         med = mean(projectedData);
 81 | 		
 82 |         if debug
 83 |             fprintf('mean %d, trace %d, variance %d , eigenvalue %d \n', med, trace(M), var(projectedData), lambda);
 84 |             fprintf('N %d \n',N);
 85 |         end
 86 | 		
 87 |         indices = abs(projectedData - med);
 88 |         sortedProjectedData = sortrows([indices data]);
 89 |         sortedMetadata = sortrows([indices metadata]);
 90 |         
 91 |         for i = 1:N
 92 |             T = sortedProjectedData(i,1);
 93 |             rhs = N * (12 * exp(-expConst*T) + 3 * eps/(d*log(N/tau))^2);
 94 |             if  debug && ((N - i) > rhs || mod(N - i, 1000) == 10)
 95 |                 fprintf('N - i =  %d \n', N - i);
 96 |                 fprintf('other thing =  %d \n', rhs);
 97 |                 fprintf('T = %d\n', T);
 98 |                 fprintf('first term RHS %d\n', N * 3 * exp(-T));
 99 |                 fprintf('second term RHS %d\n', 3 * N * eps/(d*log(N/tau))^2);
100 |             end
101 |             if (N - i) > rhs
102 |                 break
103 |             end
104 |         end
105 |         if i == 1 || i == N
106 |             if debug
107 |                 fprintf('No data points removed, i = %d\n', i);
108 |             end
109 |             estCov = empCov;
110 |             filteredPoints = data;
111 |             filteredMetadata = metadata;
112 |             return;
113 |         else 
114 |             [estCov, filteredPoints, filteredMetadata] = filterGaussianCov(sortedProjectedData(1:i, 2:end), sortedMetadata(1:i,2:end), eps, tau, expConst, debug);
115 |         end
116 |     end
117 | end


--------------------------------------------------------------------------------
/filterCode/filterGaussianCovTuned.m:
--------------------------------------------------------------------------------
 1 | function [ estCov, filteredPoints, filteredMetadata] = filterGaussianCovTuned( data, metadata, eps, tau, debug )
 2 |     expConstUB = inf;
 3 |     expConstLB = 0;
 4 |     expConstCand = 2;
 5 |     N = size(data, 1);    
 6 |     targetNoise = eps*N;
 7 |     
 8 |     if debug
 9 |         fprintf('Number of points: %d\n', N);
10 |         fprintf('Target number of noise points is %d\n', targetNoise)
11 |         fprintf('Trying to filter with constant %d\n\n\n', expConstCand);
12 |     end
13 |     
14 |     
15 |     [estCov, filteredPoints, filteredMetadata] = filterGaussianCov(data, metadata, eps, tau, expConstCand, debug);
16 |     outN = size(filteredPoints, 1);
17 |     
18 |     if debug
19 |         fprintf('Number of points output by filter is %d\n', outN);
20 |     end
21 |     
22 |     %Find upper and lower bounds for removing the right number of points
23 |     if outN < (N - 1.5*targetNoise)
24 |         while outN < (N - 1.5*targetNoise)
25 |             expConstUB = expConstCand;
26 |             expConstCand = expConstCand / 2;
27 |             
28 |             if debug
29 |                 fprintf('Trying to filter with constant %d\n\n\n', expConstCand);
30 |             end
31 |             [estCov, filteredPoints, filteredMetadata] = filterGaussianCov(data, metadata, eps, tau, expConstCand, debug);
32 |             outN = size(filteredPoints, 1);
33 |             
34 |             if debug
35 |                 fprintf('Number of points output by filter is %d\n', outN);
36 |             end
37 |         end
38 |         expConstLB = expConstCand;
39 |     elseif outN > (N - 0.5*targetNoise)
40 |         while outN > (N - 0.5*targetNoise)
41 |             expConstLB = expConstCand;
42 |             expConstCand = expConstCand * 2;
43 |             if debug
44 |                 fprintf('Trying to filter with constant %d\n\n\n', expConstCand);
45 |             end
46 |             [estCov, filteredPoints, filteredMetadata] = filterGaussianCov(data, metadata, eps, tau, expConstCand, debug);
47 |             outN = size(filteredPoints, 1);
48 |             if debug
49 |                 fprintf('Number of points output by filter is %d\n', outN);
50 |             end
51 |         end
52 |         expConstUB = expConstCand;
53 |     end
54 |     if outN > (N - 1.5*targetNoise) && outN < (N - 0.5*targetNoise)
55 |         if debug
56 |             fprintf('Succeeded with constant %d\n', expConstCand);
57 |         end
58 |         return;
59 |     end
60 |     
61 |     %Binary search within the interval to remove the right number of points
62 |     numTrials = 0;
63 |     while 1
64 |         numTrials = numTrials + 1;
65 |         if numTrials >= 10
66 |             if debug
67 |                 fprintf('Quitting due to too many trials\n');
68 |             end
69 |             [estCov, filteredPoints, filteredMetadata] = filterGaussianCov(data, metadata, eps, tau, expConstUB, debug);
70 |             return;
71 |         end
72 |         
73 |         expConstCand = (expConstUB + expConstLB)/2;
74 |         if debug
75 |             fprintf('Trying to filter with constant %d\n\n\n', expConstCand);
76 |         end
77 |         [estCov, filteredPoints, filteredMetadata] = filterGaussianCov(data, metadata, eps, tau, expConstCand, debug);
78 |         outN = size(filteredPoints, 1);
79 |         
80 |         if debug
81 |             fprintf('Number of points output by filter is %d\n', outN);
82 |         end
83 |         
84 |         if outN < (N - 1.5*targetNoise)
85 |            expConstUB = expConstCand; 
86 |         elseif outN > (N - 0.5*targetNoise)
87 |             expConstLB = expConstCand;
88 |         else
89 |             if debug
90 |                 fprintf('Succeeded with constant %d\n', expConstCand);
91 |             end
92 |             return;
93 |         end
94 |     end
95 | end


--------------------------------------------------------------------------------
/filterCode/filterGaussianMean.m:
--------------------------------------------------------------------------------
 1 | function [ estMean ] = filterGaussianMean(data, eps, tau, cher)
 2 | %filterGaussianMean Run the filter algorithm on a Gaussian
 3 | %Suggested value of cher = 2.5
 4 |     [N, d] = size(data);
 5 |     empiricalMean = mean(data);
 6 |     threshold = eps*log(1/eps);
 7 |     centeredData = bsxfun(@minus, data, empiricalMean)/sqrt(N);
 8 | 
 9 |     [U, S, ~] = svdsecon(centeredData', 1);
10 | 
11 |     lambda = S(1,1)^2;
12 |     v = U(:,1);
13 | 
14 |     %If the largest eigenvalue is about right, just return
15 |     if lambda < 1 + 3 * threshold
16 |        estMean = empiricalMean;
17 |     %Otherwise, project in direction of v and filter
18 |     else
19 |         delta = 2*eps;
20 |         projectedData1 = data * v;
21 |         med = median(projectedData1);
22 | 
23 |         projectedData = [abs(data*v - med) data];
24 | 
25 |         sortedProjectedData = sortrows(projectedData);
26 |         for i = 1:N
27 |             T = sortedProjectedData(i,1) - delta;
28 |             if (N - i) > cher * N * (erfc(T / sqrt(2)) / 2 + eps/(d*log(d*eps/tau)))
29 |                 break
30 |             end
31 |         end
32 |         if i == 1 || i == N
33 |             estMean = empiricalMean;
34 |         else 
35 |             estMean = filterGaussianMean(sortedProjectedData(1:i, 2:end), eps, tau, cher);
36 |         end
37 |     end
38 | end


--------------------------------------------------------------------------------
/filterCode/findMaxPoly.m:
--------------------------------------------------------------------------------
 1 | function [ c, M, lambda ] = findMaxPoly(data)
 2 | % Given a dataset and an empirical covariance matrix, outputs the
 3 | % polynomial 
 4 | %           p*(x) = x^T M x - c
 5 | % with unit variance
 6 | 
 7 |     [N, d] = size(data);
 8 |     v = flatten(eye(d));
 9 |     
10 |     dataKron = zeros(N, d * d);
11 |     for i = 1:N
12 |          dataKron(i, :) = kron(data(i, :), data(i, :));
13 |     end
14 |     
15 |     empFourth = dataKron' * dataKron / N - v' * v;
16 |     [U, S, ~] = svdsecon(empFourth, 1);
17 |     
18 |     lambda = S(1, 1) / 2;
19 |     Mflat = U(:, 1);
20 | 
21 |     M = sharpen(Mflat, d) / sqrt(2);
22 |     c = trace(M) / sqrt(2);
23 | 	
24 | end
25 | 


--------------------------------------------------------------------------------
/filterCode/flatten.m:
--------------------------------------------------------------------------------
 1 | function [ v ] = flatten( mat )
 2 | % Given a d by d matrix M, convert to a d^2 dimensional vector v
 3 |     [d, ~] = size(mat);
 4 |     v = zeros(1, d * d);
 5 |     for i = 1:d
 6 |         for j = 1:d
 7 |             v((i - 1) * d + j) = mat(i, j);
 8 |         end
 9 |     end
10 | end


--------------------------------------------------------------------------------
/filterCode/mahalanobis.m:
--------------------------------------------------------------------------------
1 | function [Sscaled] = mahalanobis(Shat, S)
2 | %Computes Mahalanobis rescaling of Shat with respect to S
3 |     Sscaled = inv(sqrtm(S))*Shat*inv(sqrtm(S));
4 | end


--------------------------------------------------------------------------------
/filterCode/sharpen.m:
--------------------------------------------------------------------------------
1 | function M = sharpen(v,d)
2 | % Given a d^2 dimensional vector v, convert to a d by d matrix M
3 |     M = zeros(d, d);
4 |     for i = 1:d
5 |         for j = 1:d
6 |             M(i, j) = v((i-1)*d + j);
7 |         end
8 |     end
9 | end


--------------------------------------------------------------------------------
/genomicData/colors.txt:
--------------------------------------------------------------------------------
 1 | gold1	255	215	0
 2 | orange	255	165	0
 3 | tan3	205	133	63
 4 | yellow	255	255	0
 5 | yellowgreen	154	205	50
 6 | gold2	238	201	0
 7 | gold3	205	173	0
 8 | aquamarine4	69	139	116
 9 | aquamarine2	118	238	198
10 | aquamarine3	102	205	170
11 | darkolivegreen2	188	238	104
12 | green2	0	238	0
13 | green3	0	205	0
14 | green4	0	139	0
15 | green	0	255	0
16 | pink	255	192	203
17 | red	255	0	0
18 | lightpink3	205	140	149
19 | sienna	160	82	45
20 | dodgerblue1	30	144	255
21 | darkseagreen4	105	139	105
22 | darkseagreen	143	188	143
23 | dodgerblue3	24	116	205
24 | cadetblue1	152	245	255
25 | deepskyblue2	0	178	238
26 | deepskyblue4	0	104	139
27 | cadetblue	95	158	168
28 | cadetblue3	122	197	205
29 | cadetblue2	142	229	238
30 | deepskyblue1	0	191	255
31 | blue	0	0	255
32 | darkorchid3	154	50	205
33 | violetred1	255	62	150
34 | violetred3	205	50	120
35 | palevioletred2	238	121	159
36 | orangered1	255	69	0
37 | springgreen4	0	139	69
38 | 


--------------------------------------------------------------------------------
/testCode/testGaussianCov.m:
--------------------------------------------------------------------------------
 1 | clear;
 2 | eps = 0.05;
 3 | tau = 0.1;
 4 | 
 5 | sampErr = [];
 6 | noisyEmpErr = [];
 7 | filterErr = [];
 8 | ds = 10:10:50;
 9 | 
10 | %Set to 0 for isotropic covariance
11 | %Set to 1 for spiked covariance
12 | spikedCovariance = 1;
13 | 
14 | if spikedCovariance
15 |     spike = 100;
16 | else
17 |     spike = 1;
18 | end
19 | 
20 | 
21 | for d = ds
22 |     N =  0.5*d / eps^2;
23 |     fprintf('Training with dimension = %d, number of samples = %d \n', d, round(N, 0))
24 |     sumEmpErr = 0;
25 |     sumNoisyEmpErr = 0;
26 |     sumFilterErr = 0;
27 | 
28 |     covar = eye(d);
29 |     if spikedCovariance
30 |         covar(1,1) = spike;
31 |     end
32 | 
33 |     X =  mvnrnd(zeros(1,d), covar, round((1-eps)*N)); 
34 |     if spikedCovariance
35 |         U1 = orth(randn(d, d));
36 |         Y = [ 0.5 * randi([-1 1], round(eps *N), d / 2) 0.8 * randi([-2 2], round(eps *N), d / 2 - 1) randi([-spike spike], round(eps *N), 1)] * U1; 
37 |     else
38 |         Y = zeros(round(eps * N), d);
39 |     end
40 |     Z = [X; Y];
41 | 
42 |     fprintf('Sampling error w/o noise...')
43 |     empCov = cov(X);
44 |     sumEmpErr = sumEmpErr + norm(mahalanobis(empCov, covar) - eye(d), 'fro');
45 |     fprintf('done\n')
46 | 
47 |     
48 |     fprintf('Sampling error with noise...')
49 |     empCov = cov(Z);
50 |     sumNoisyEmpErr = sumNoisyEmpErr + norm(mahalanobis(empCov, covar) - eye(d), 'fro');
51 |     fprintf('done\n')
52 |     
53 |     
54 |     fprintf('Filter...')
55 |     [ourCov, filterPoints, ~] = filterGaussianCovTuned(Z, zeros(size(Z)),  eps, tau, false);
56 |     sumFilterErr = sumFilterErr + norm(mahalanobis(ourCov, covar) - eye(d), 'fro');
57 |     fprintf('done\n')
58 |     
59 |     sampErr = [sampErr sumEmpErr];
60 |     noisyEmpErr = [noisyEmpErr sumNoisyEmpErr];
61 |     filterErr = [filterErr sumFilterErr];
62 | end
63 | 
64 | noisyEmpErr = noisyEmpErr - sampErr;
65 | filterErr = filterErr - sampErr;
66 | 
67 | figure(1);
68 | plot(ds, noisyEmpErr, '-gx',  ds, filterErr, '-.b', 'LineWidth', 2)
69 | xlabel('Dimension')
70 | ylabel('Excess Frobenius error')
71 | legend('Samping Error (with corruption)', 'Filter')


--------------------------------------------------------------------------------
/testCode/testGaussianMean.m:
--------------------------------------------------------------------------------
 1 | clear;
 2 | eps = 0.1;
 3 | tau = 0.1;
 4 | cher = 2.5;
 5 | 
 6 | filterErr = [];
 7 | sampErr = [];
 8 | noisySampErr = []; 
 9 | ds = 100:50:400;
10 | 
11 | for d = ds
12 |     N = 10*floor(d/eps^2);
13 |     fprintf('Training with dimension = %d, number of samples = %d \n', d, round(N, 0))
14 |     sumFilterErr = 0;
15 |     sumSampErr = 0;
16 |     sumNoisySampErr = 0;
17 | 
18 |     X =  mvnrnd(zeros(1,d), eye(d), round((1-eps)*N)) + ones(round((1-eps)*N), d);
19 | 
20 |     fprintf('Sampling Error w/o noise...');
21 |     sumSampErr = sumSampErr + norm(mean(X) - ones(1,d));
22 |     fprintf('...done\n')
23 |     
24 |     Y1 = randi([0 1], round(0.5*eps*N), d); 
25 |     Y2 = [12*ones(round(0.5*eps*N),1), -2 * ones(round(0.5*eps*N), 1), zeros(round(0.5 * eps * N), d-2)];
26 |     X = [X; Y1; Y2];
27 | 
28 |     fprintf('Sampling Error with noise');
29 |     sumNoisySampErr = sumNoisySampErr + norm(mean(X) - ones(1,d));
30 |     fprintf('...done\n')
31 |     
32 |     fprintf('Filter')
33 |     sumFilterErr = sumFilterErr + norm(filterGaussianMean(X, eps, tau, cher) - ones(1, d));
34 |     fprintf('...done\n')
35 |     
36 |     filterErr = [filterErr sumFilterErr];
37 |     sampErr = [sampErr sumSampErr];
38 |     noisySampErr = [noisySampErr sumNoisySampErr];
39 | end
40 | 
41 | noisySampErr = noisySampErr - sampErr;
42 | filterErr = filterErr - sampErr;
43 | 
44 | plot(ds, noisySampErr, ds, filterErr, '-.b', 'LineWidth', 2)
45 | xlabel('Dimension')
46 | ylabel('Excess L2 error')
47 | legend('Sampling Error (with noise)', 'Filter')
48 | 


--------------------------------------------------------------------------------
/testCode/testGenomicData.m:
--------------------------------------------------------------------------------
 1 | clear;
 2 | eps = 0.10;
 3 | tau = 0.1;
 4 | d = 20;
 5 | lambda = 0.2;
 6 | subsampSize = 100;
 7 | 
 8 | %Process mappings from color names to numbers
 9 | rawColor = importdata('../genomicData/colors.txt');
10 | keys = rawColor.rowheaders;
11 | vals = mat2cell(rawColor.data, ones(37,1));
12 | colorMap = containers.Map(keys, vals);
13 | 
14 | %Process data points
15 | fid = fopen('../genomicData/POPRES_08_24_01.EuroThinFinal.LD_0.8.exLD.out0-PCA.eigs');
16 | A = textscan(fid,'%f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %s', 'HeaderLines',1);
17 | fclose(fid);
18 | V = [A{3} A{4} A{5} A{6} A{7} A{8} A{9} A{10} A{11} A{12} A{13} A{14} A{15} A{16} A{17} A{18} A{19} A{20} A{21} A{22}]; 
19 | 
20 | eigs = importdata('../genomicData/POPRES_08_24_01.EuroThinFinal.LD_0.8.exLD.out0-PCA.eval');
21 | eigs = eigs(1:d);
22 | data = zeros(length(A), d);
23 | for i = 1:size(V,1)
24 |     data(i,:) = V(i,:).*eigs';
25 | end
26 | 
27 | %Convert data point color names to numbers
28 | colorsFid = fopen('../genomicData/POPRESID_Color.txt');
29 | colorsStrings = textscan(colorsFid, '%d %s');
30 | colorsStrings = colorsStrings{2};
31 | dataColors = zeros(length(colorsStrings),3);
32 | for i = 1:length(colorsStrings)
33 |     dataColors(i, :) = colorMap(colorsStrings{i})/255;
34 | end
35 | 
36 | %Generate noisy points
37 | N = size(V,1)/(1-eps);
38 | noise = (1/24)*[randi([0, 2], round(eps*N), d/2) randi([2, 3], round(eps*N), d/2)];
39 | 
40 | %Generate noisy point colors
41 | noiseColors = zeros(round(eps*N), 3);
42 | 
43 | %Combine data and noise,
44 | randRot = orth(randn(d, d));
45 | data = data * randRot;
46 | noise = noise * randRot;
47 | D = [data; noise];
48 | C = [dataColors; noiseColors];
49 | subsampD = datasample(D, subsampSize);
50 | 
51 | %Generate original data's projection
52 | [dataU, ~, ~] = svd(cov(data));
53 | 
54 | %Generate noised data's projection
55 | [noisedU, ~, ~] = svd(cov(D));
56 | 
57 | %Generate filter output and projection
58 | [filterM, filtered, filteredMetadata] = filterGaussianCovTuned(D, C, eps, tau, 0);
59 | [filterU, ~, ~] = svd(filterM);
60 | 
61 | %Generate plot for original data
62 | fig = figure(1);
63 | clf;
64 | scatter(data*dataU(:,1), data*dataU(:,2), [], dataColors);
65 | hold on;
66 | scatter(noise*dataU(:,1), noise*dataU(:,2), [], noiseColors);
67 | axis([-0.25 0.35 -0.15 0.2])
68 | title('Original Data')
69 | set(fig, 'Position', [100, 100, 640, 480]);
70 | 
71 | %Generate plot for noised data
72 | fig = figure(2);
73 | clf
74 | scatter(data*noisedU(:,1), data*noisedU(:,2), [], dataColors);
75 | hold on;
76 | scatter(noise*noisedU(:,1), noise*noisedU(:,2), [], noiseColors);
77 | axis([-0.25 0.35 -0.15 0.2])
78 | title('Empirical Noised Singular Vectors')
79 | set(fig, 'Position', [100, 100, 640, 480]);
80 | 
81 | %Generate plot for filter's output
82 | fig = figure(3);
83 | clf
84 | scatter(filtered*filterU(:,1),filtered*filterU(:,2), [], filteredMetadata);
85 | axis([-0.25 0.35 -0.15 0.2])
86 | title('Filter Output');
87 | set(fig, 'Position', [100, 100, 640, 480]);
88 | 
89 | %Generate plot for the filter's projection
90 | fig = figure(4);
91 | clf
92 | scatter(data*filterU(:,1), data*filterU(:,2), [], dataColors);
93 | hold on;
94 | scatter(noise*filterU(:,1), noise*filterU(:,2), [], noiseColors);
95 | axis([-0.25 0.35 -0.15 0.2])
96 | title('Filter Projection')
97 | set(fig, 'Position', [100, 100, 640, 480]);


--------------------------------------------------------------------------------