├── README.md ├── comparisonCode ├── ADPCP.m ├── MVE.m ├── compareCovEstimators.m ├── compareGenomicData.m ├── compareMeanEstimators.m ├── geoMedianGaussianMean.m ├── norm_nuc.m ├── pruneGaussianCov.m ├── pruneGaussianMean.m ├── ransacGaussianMean.m ├── ransacMVE.m ├── shrinkage.m ├── specThresh.m └── testGeoMedian.m ├── filterCode ├── filterGaussianCov.m ├── filterGaussianCovTuned.m ├── filterGaussianMean.m ├── findMaxPoly.m ├── flatten.m ├── mahalanobis.m └── sharpen.m ├── genomicData └── colors.txt └── testCode ├── testGaussianCov.m ├── testGaussianMean.m └── testGenomicData.m /README.md: -------------------------------------------------------------------------------- 1 | # Being Robust (in High Dimensions) Can Be Practical 2 | A MATLAB implementation of [Being Robust (in High Dimensions) Can Be Practical](https://arxiv.org/abs/1703.00893) from ICML 2017. 3 | 4 | Prerequisites 5 | === 6 | This project requires installation of the following packages: 7 | * [Fast SVD and PCA](https://www.mathworks.com/matlabcentral/fileexchange/47132-fast-svd-and-pca): Used for faster computation of SVD for some large matrices. 8 | * [Novembre_etal_2008_misc](https://github.com/NovembreLab/Novembre_etal_2008_misc): A repository containing data from [Genes Mirror Geography in Europe](https://www.nature.com/articles/nature07331). Used for our semi-synthetic experiments. Extract the following three files to `robust-filter/genomicData`: 9 | * `POPRES_08_24_01.EuroThinFinal.LD_0.8.exLD.out0-PCA.eigs` 10 | * `POPRES_08_24_01.EuroThinFinal.LD_0.8.exLD.out0-PCA.eval` 11 | * `POPRESID_Color.txt` 12 | 13 | Optional 14 | --- 15 | The following packages are optional, and are only used for comparison of our algorithm with alternative estimators: 16 | * [AgnosticMeanAndCovarianceCode](https://github.com/kal2000/AgnosticMeanAndCovarianceCode): An implementation of the algorithms from the paper [Agnostic Estimation of Mean and Covariance](https://arxiv.org/abs/1604.06968) from FOCS 2016, authored by [Kevin A. Lai](https://www.cc.gatech.edu/~klai9/), [Anup B. Rao](https://sites.google.com/site/anupraob/), and [Santosh Vempala](https://www.cc.gatech.edu/~vempala/). Used to compare the statistical accuracy of our algorithms against theirs. 17 | * [CVX](http://cvxr.com/cvx/): A MATLAB library for convex optimization. Used for implementing methods from [Robust Principal Component Analysis?](https://dl.acm.org/citation.cfm?id=1970395) and [Robust PCA via Outlier Pursuit](https://arxiv.org/abs/1010.4237). 18 | 19 | Explanation of Files 20 | === 21 | This repository contains several algorithms -- we identify files relevant to our algorithm, and those which are only used for comparison with alternatives. 22 | 23 | Our Filter algorithms' files 24 | --- 25 | The following are files which are used in our algorithms' implementations, and can all be found in the `filterCode` subdirectory. 26 | * `filterGaussianMean.m`: Our algorithm for estimating the mean of a Gaussian 27 | * `filterGaussianCov.m`: Our algorithm for estimating the covariance of a Gaussian 28 | * `findMaxPoly.m`: Finds the structured degree-two polynomial which is maximized by the data 29 | * `flatten.m` and `sharpen.m`: Convert between matrix and vector representations 30 | * `filterGaussianCovTuned.m`: A version of our covariance estimation algorithm which is tuned to select hyperparameters 31 | * `mahalanobis.m`: Compute Mahalanobis rescaling of a matrix 32 | 33 | Test files, for demonstrating performance of our estimators for mean and covariance. These can be found in the `testCode` subdirectory: 34 | * `testGaussianMean.m`: Tests our mean estimation algorithm 35 | * `testGaussianCov.m`: Tests our covariance estimation algorithm 36 | * `testGenomicData.m`: Tests our covariance estimation algorithm on semi-synthetic genome dataset 37 | 38 | Other algorithms' files 39 | --- 40 | The following are files which are used for comparison with other competing algorithms, and can be found in the `comparisonCode` subdirectory: 41 | 42 | Algorithms for estimating the mean of a Gaussian: 43 | * `ransacGaussianMean.m`: A RANSAC-based method 44 | * `geoMedianGaussianMean.m`: Geometric median 45 | * `pruneGaussianMean.m`: Coordinate-wise median, followed by naive pruning of distant points 46 | 47 | Algorithms for estimating the covariance of a Gaussian: 48 | * `pruneGaussianCov.m`: Naive pruning of distant points 49 | * `ransacMVE.m`: A RANSAC-based method 50 | * `MVE.m`: Approximating the MVE for a small dataset 51 | * `ADPCP.m`: Principal Component Pursuit by Alternating Directions, from [Robust Principal Component Analysis?](https://dl.acm.org/citation.cfm?id=1970395) 52 | * `specThresh.m` and `shrinkage.m`: Singular value thresholding and shrinkage operators 53 | * `norm_nuc.m`: Compute the nuclear norm of a matrix 54 | 55 | Comparison files, for evaluating and comparing the performance of estimators for mean and covariance: 56 | * `testGeoMedian.m`: Demonstrates that the geometric median incurs an O(sqrt(d)) loss in accuracy 57 | * `compareMeanEstimators.m`: Compares mean estimation algorithms 58 | * `compareCovEstimators.m`: Compares covariance estimation algorithms 59 | * `compareGenomicData.m`: Compares covariance estimation algorithms to semi-synthetic genome dataset 60 | 61 | Reproducibility 62 | === 63 | Figures in the paper can be approximately reproduced by running the following scripts, which are in the `comparisonCode` subdirectory: 64 | * Figure 1: `compareMeanEstimators.m` 65 | * Figure 2: `compareCovEstimators.m` 66 | * Figures 3 and 4: `compareGenomicData.m` 67 | 68 | Reference 69 | === 70 | This repository is an implementation of our paper [Being Robust (in High Dimensions) Can Be Practical](https://arxiv.org/abs/1703.00893) from ICML 2017, authored by [Ilias Diakonikolas](http://www.iliasdiakonikolas.org/), [Gautam Kamath](http://www.gautamkamath.com/), [Daniel M. Kane](https://cseweb.ucsd.edu/~dakane/), [Jerry Li](http://www.mit.edu/~jerryzli/), [Ankur Moitra](http://people.csail.mit.edu/moitra/), and [Alistair Stewart](http://www.alistair-stewart.com/). 71 | 72 | See also our original theory paper, [Robust Estimators in High Dimensions without the Computational Intractability](https://arxiv.org/abs/1604.06443), which appeared in FOCS 2016. 73 | 74 | If you use our code or paper, we ask that you please cite: 75 | ``` 76 | @inproceedings{DiakonikolasKKLMS17, 77 | author = {Diakonikolas, Ilias and Kamath, Gautam and Kane, Daniel M. and Li, Jerry and Moitra, Ankur and Stewart, Alistair}, 78 | title = {Being Robust (in High Dimensions) Can Be Practical}, 79 | booktitle = {Proceedings of the 34th International Conference on Machine Learning}, 80 | series = {ICML '17}, 81 | year = {2017}, 82 | pages = {999--1008}, 83 | publisher = {JMLR, Inc.} 84 | } 85 | ``` 86 | -------------------------------------------------------------------------------- /comparisonCode/ADPCP.m: -------------------------------------------------------------------------------- 1 | function [ L, S ] = ADPCP( M, lambda ) 2 | %Principal Component Pursuit by Alternating Directions 3 | %Based on Algorithm 1 of "Robust Principal Component Analysis?" by 4 | %Candes, Li, Ma, and Wright 5 | [m, n] = size(M); 6 | 7 | mu = m * n / (4 * norm_nuc(M)); 8 | 9 | L = zeros(m, n); 10 | S = zeros(m, n); 11 | Y = zeros(m, n); 12 | 13 | 14 | delta = 1e-7; 15 | 16 | while norm(M - L - S, 'fro') > delta * norm(M, 'fro') 17 | L = specThresh(M - S + Y / mu, 1 / mu); 18 | S = shrinkage(M - L + Y / mu, lambda / mu); 19 | Y = Y + mu * (M - L - S); 20 | end 21 | end -------------------------------------------------------------------------------- /comparisonCode/MVE.m: -------------------------------------------------------------------------------- 1 | function [ estCov, vol ] = MVE( samp, eps ) 2 | [N, d] = size(samp); 3 | 4 | empCov = samp' * samp / N; 5 | empCovInv = empCov^(-1); 6 | mDists = zeros(N, 1); 7 | for j = 1:N 8 | mDists(j) = samp(j, :) * empCovInv * samp(j, :)'; 9 | end 10 | mDistsSorted = sort(mDists); 11 | lambda = mDistsSorted(N - floor(eps * N)); 12 | estCov = lambda * empCov; 13 | vol = det(empCov) * lambda^d; 14 | end -------------------------------------------------------------------------------- /comparisonCode/compareCovEstimators.m: -------------------------------------------------------------------------------- 1 | clear; 2 | eps = 0.05; 3 | tau = 0.1; 4 | 5 | sampErr = []; 6 | noisyEmpErr = []; 7 | ransacErr = []; 8 | filterErr = []; 9 | pruneErr = []; 10 | LRVErr = []; 11 | ds = 10:10:100; 12 | 13 | %Set to 0 for isotropic covariance, generates left half of Figure 2 14 | %Set to 1 for spiked covariance, generates right half of Figure 2 15 | spikedCovariance = 1; 16 | 17 | if spikedCovariance 18 | spike = 100; 19 | else 20 | spike = 1; 21 | end 22 | 23 | 24 | for d = ds 25 | N = 0.5*d / eps^2; 26 | fprintf('Training with dimension = %d, number of samples = %d \n', d, round(N, 0)) 27 | sumEmpErr = 0; 28 | sumNoisyEmpErr = 0; 29 | sumFilterErr = 0; 30 | sumMVEErr = 0; 31 | sumLRVErr = 0; 32 | sumPruneErr = 0; 33 | 34 | covar = eye(d); 35 | if spikedCovariance 36 | covar(1,1) = spike; 37 | end 38 | 39 | X = mvnrnd(zeros(1,d), covar, round((1-eps)*N)); 40 | if spikedCovariance 41 | U1 = orth(randn(d, d)); 42 | Y = [ 0.5 * randi([-1 1], round(eps *N), d / 2) 0.8 * randi([-2 2], round(eps *N), d / 2 - 1) randi([-spike spike], round(eps *N), 1)] * U1; 43 | else 44 | Y = zeros(round(eps * N), d); 45 | end 46 | Z = [X; Y]; 47 | 48 | fprintf('Sampling Error w/o noise...'); 49 | empCov = cov(X); 50 | sumEmpErr = sumEmpErr + norm(mahalanobis(empCov, covar) - eye(d), 'fro'); 51 | fprintf('done\n') 52 | 53 | fprintf('Sampling Error with noise...'); 54 | empCov = cov(Z); 55 | sumNoisyEmpErr = sumNoisyEmpErr + norm(mahalanobis(empCov, covar) - eye(d), 'fro'); 56 | fprintf('done\n') 57 | 58 | fprintf('Filter...') 59 | [ourCov, filterPoints, ~] = filterGaussianCovTuned(Z, zeros(size(Z)), eps, tau, false); 60 | sumFilterErr = sumFilterErr + norm(mahalanobis(ourCov, covar) - eye(d), 'fro'); 61 | fprintf('done\n') 62 | 63 | fprintf('Prune...') 64 | [pruneCov, prunePoints, ~] = pruneGaussianCov(Z, zeros(size(Z)), tau, false); 65 | sumPruneErr = sumPruneErr + norm(mahalanobis(pruneCov, covar) - eye(d), 'fro'); 66 | fprintf('done\n') 67 | 68 | 69 | fprintf('RANSAC...') 70 | sumMVEErr = sumMVEErr + norm(mahalanobis(ransacMVE(Z, eps, ceil(d / eps), 1000), covar) - eye(d), 'fro'); 71 | fprintf('done\n') 72 | 73 | fprintf('LRV...') 74 | [~, lrvCov, ~] = agnosticCovarianceGeneral(Z, eps); 75 | sumLRVErr = sumLRVErr + norm(mahalanobis(lrvCov, covar) - eye(d), 'fro') ; 76 | fprintf('done\n') 77 | 78 | 79 | sampErr = [sampErr sumEmpErr]; 80 | noisyEmpErr = [noisyEmpErr sumNoisyEmpErr]; 81 | ransacErr = [ransacErr sumMVEErr]; 82 | filterErr = [filterErr sumFilterErr]; 83 | pruneErr = [pruneErr sumPruneErr]; 84 | LRVErr = [LRVErr sumLRVErr]; 85 | end 86 | 87 | noisyEmpErr = noisyEmpErr - sampErr; 88 | pruneErr = pruneErr - sampErr; 89 | ransacErr = ransacErr - sampErr; 90 | LRVErr = LRVErr - sampErr; 91 | filterErr = filterErr - sampErr; 92 | 93 | figure(1); 94 | plot(ds, noisyEmpErr, '-gx', ds, ransacErr, ds, LRVErr, ds, pruneErr, ds, filterErr, '-.b', 'LineWidth', 2) 95 | xlabel('Dimension') 96 | ylabel('Excess Frobenius error') 97 | legend('Samping Error (with corruption)', 'RANSAC', 'LRV', 'Prune', 'Filter') 98 | 99 | figure(2); 100 | plot(ds, LRVErr, ds, filterErr, '-.b', 'LineWidth', 2) 101 | xlabel('Dimension') 102 | ylabel('Excess Frobenius error') 103 | legend('LRV', 'Filter') -------------------------------------------------------------------------------- /comparisonCode/compareGenomicData.m: -------------------------------------------------------------------------------- 1 | %Generates Figures 3 and 4 2 | clear; 3 | eps = 0.10; 4 | tau = 0.1; 5 | d = 20; 6 | lambda = 0.2; 7 | subsampSize = 100; 8 | 9 | %Process mappings from color names to numbers 10 | rawColor = importdata('../genomicData/colors.txt'); 11 | keys = rawColor.rowheaders; 12 | vals = mat2cell(rawColor.data, ones(37,1)); 13 | colorMap = containers.Map(keys, vals); 14 | 15 | %Process data points 16 | fid = fopen('../genomicData/POPRES_08_24_01.EuroThinFinal.LD_0.8.exLD.out0-PCA.eigs'); 17 | A = textscan(fid,'%f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %s', 'HeaderLines',1); 18 | fclose(fid); 19 | V = [A{3} A{4} A{5} A{6} A{7} A{8} A{9} A{10} A{11} A{12} A{13} A{14} A{15} A{16} A{17} A{18} A{19} A{20} A{21} A{22}]; 20 | 21 | eigs = importdata('../genomicData/POPRES_08_24_01.EuroThinFinal.LD_0.8.exLD.out0-PCA.eval'); 22 | eigs = eigs(1:d); 23 | data = zeros(length(A), d); 24 | for i = 1:size(V,1) 25 | data(i,:) = V(i,:).*eigs'; 26 | end 27 | 28 | %Convert data point color names to numbers 29 | colorsFid = fopen('../genomicData/POPRESID_Color.txt'); 30 | colorsStrings = textscan(colorsFid, '%d %s'); 31 | colorsStrings = colorsStrings{2}; 32 | dataColors = zeros(length(colorsStrings),3); 33 | for i = 1:length(colorsStrings) 34 | dataColors(i, :) = colorMap(colorsStrings{i})/255; 35 | end 36 | 37 | %Generate noisy points 38 | N = size(V,1)/(1-eps); 39 | noise = (1/24)*[randi([0, 2], round(eps*N), d/2) randi([2, 3], round(eps*N), d/2)]; 40 | 41 | %Generate noisy point colors 42 | noiseColors = zeros(round(eps*N), 3); 43 | 44 | %Combine data and noise, 45 | randRot = orth(randn(d, d)); 46 | data = data * randRot; 47 | noise = noise * randRot; 48 | D = [data; noise]; 49 | C = [dataColors; noiseColors]; 50 | subsampD = datasample(D, subsampSize); 51 | 52 | %Generate original data's projection 53 | [dataU, ~, ~] = svd(cov(data)); 54 | 55 | %Generate noised data's projection 56 | [noisedU, ~, ~] = svd(cov(D)); 57 | 58 | %Generate filter output and projection 59 | [filterM, filtered, filteredMetadata] = filterGaussianCovTuned(D, C, eps, tau, 0); 60 | [filterU, ~, ~] = svd(filterM); 61 | 62 | %Generate filter output and projection (no noise) 63 | [filterNoNoiseM, filteredNoNoise, filterNoNoiseMetadata] = filterGaussianCovTuned(data, dataColors, eps, tau, 0); 64 | [filterNoNoiseU, ~, ~] = svd(filterNoNoiseM); 65 | 66 | %Generate LRV projection 67 | [~, lrvM, ~] = agnosticCovarianceGeneral(D, eps); 68 | [lrvU, ~, ~] = svd(lrvM); 69 | 70 | %Generate Pruning projection 71 | [pruneM, ~, ~] = pruneGaussianCov(D, C, tau, 0); 72 | [pruneU, ~, ~] = svd(pruneM); 73 | 74 | %Generate RANSAC projection 75 | [ransacM] = ransacMVE(D, eps, ceil(d / eps), 1000); 76 | [ransacU, ~, ~] = svd(ransacM); 77 | 78 | %Generate CLMW SDP projection 79 | cvx_begin quiet 80 | variable clmwsdpL(subsampSize, d) 81 | variable S(subsampSize, d) 82 | dual variable Q 83 | minimize norm_nuc(clmwsdpL) + lambda * sum(sum(abs(S))) 84 | subject to 85 | clmwsdpL + S == subsampD; 86 | cvx_end 87 | clmwsdpM = clmwsdpL' * clmwsdpL; 88 | [clmwsdpU, ~, ~] = svd(clmwsdpM); 89 | 90 | %Generate CLMW AM projection 91 | [clmwamL, ~] = ADPCP(D, lambda); 92 | clmwamM = clmwamL' * clmwamL; 93 | [clmwamU, ~, ~] = svd(clmwamM); 94 | 95 | %Generate XCS SDP projection 96 | cvx_begin quiet 97 | variable xcsL(subsampSize, d) 98 | variable S2(subsampSize, d) 99 | dual variable Q 100 | minimize norm_nuc(xcsL) + lambda * sum(norms(S2, 2, 2)) 101 | subject to 102 | xcsL + S2 == subsampD; 103 | cvx_end 104 | xcsM = xcsL' * xcsL; 105 | [xcsU, ~, ~] = svd(xcsM); 106 | 107 | %Generate plot for original data 108 | fig = figure(1); 109 | clf; 110 | scatter(data*dataU(:,1), data*dataU(:,2), [], dataColors); 111 | hold on; 112 | scatter(noise*dataU(:,1), noise*dataU(:,2), [], noiseColors); 113 | axis([-0.25 0.35 -0.15 0.2]) 114 | title('Original Data') 115 | set(fig, 'Position', [100, 100, 640, 480]); 116 | 117 | %Generate plot for noised data 118 | fig = figure(2); 119 | clf 120 | scatter(data*noisedU(:,1), data*noisedU(:,2), [], dataColors); 121 | hold on; 122 | scatter(noise*noisedU(:,1), noise*noisedU(:,2), [], noiseColors); 123 | axis([-0.25 0.35 -0.15 0.2]) 124 | title('Empirical Noised Singular Vectors') 125 | set(fig, 'Position', [100, 100, 640, 480]); 126 | 127 | %Generate plot for filter's output 128 | fig = figure(3); 129 | clf 130 | scatter(filtered*filterU(:,1),filtered*filterU(:,2), [], filteredMetadata); 131 | axis([-0.25 0.35 -0.15 0.2]) 132 | title('Filter Output'); 133 | set(fig, 'Position', [100, 100, 640, 480]); 134 | 135 | %Generate plot for filter's output without noise 136 | fig = figure(4); 137 | clf 138 | scatter(filteredNoNoise*filterNoNoiseU(:,1), filteredNoNoise*filterNoNoiseU(:,2), [], filterNoNoiseMetadata); 139 | axis([-0.25 0.35 -0.15 0.2]) 140 | title('Filter Output (no noise)'); 141 | set(fig, 'Position', [100, 100, 640, 480]); 142 | 143 | %Generate plot for the filter's projection 144 | fig = figure(5); 145 | clf 146 | scatter(data*filterU(:,1), data*filterU(:,2), [], dataColors); 147 | hold on; 148 | scatter(noise*filterU(:,1), noise*filterU(:,2), [], noiseColors); 149 | axis([-0.25 0.35 -0.15 0.2]) 150 | title('Filter Projection') 151 | set(fig, 'Position', [100, 100, 640, 480]); 152 | 153 | %Generate plot for LRV's projection 154 | fig = figure(6); 155 | clf 156 | scatter(data*lrvU(:,1), data*lrvU(:,2), [], dataColors); 157 | hold on; 158 | scatter(noise*lrvU(:,1), noise*lrvU(:,2), [], noiseColors); 159 | axis([-0.25 0.35 -0.15 0.2]) 160 | title('LRV Projection') 161 | set(fig, 'Position', [100, 100, 640, 480]); 162 | 163 | %Generate plot for pruned data's projection 164 | fig = figure(7); 165 | clf 166 | scatter(data*pruneU(:,1), data*pruneU(:,2), [], dataColors); 167 | hold on; 168 | scatter(noise*pruneU(:,1), noise*pruneU(:,2), [], noiseColors); 169 | axis([-0.25 0.35 -0.15 0.2]) 170 | title('Pruning Projection') 171 | set(fig, 'Position', [100, 100, 640, 480]); 172 | 173 | %Generate plot for RANSAC's projection 174 | fig = figure(8); 175 | clf 176 | scatter(data*ransacU(:,1), data*ransacU(:,2), [], dataColors); 177 | hold on; 178 | scatter(noise*ransacU(:,1), noise*ransacU(:,2), [], noiseColors); 179 | axis([-0.25 0.35 -0.15 0.2]) 180 | title('RANSAC Projection') 181 | set(fig, 'Position', [100, 100, 640, 480]); 182 | 183 | %Generate plot for CLMW SDP projection 184 | fig = figure(9); 185 | clf 186 | scatter(data*clmwsdpU(:,1), data*clmwsdpU(:,2), [], dataColors); 187 | hold on; 188 | scatter(noise*clmwsdpU(:,1), noise*clmwsdpU(:,2), [], noiseColors); 189 | axis([-0.25 0.35 -0.15 0.2]) 190 | title('CLMW SDP Projection') 191 | set(fig, 'Position', [100, 100, 640, 480]); 192 | 193 | %Generate plot for CLMW AM projection 194 | fig = figure(10); 195 | clf 196 | scatter(data*clmwamU(:,1), data*clmwamU(:,2), [], dataColors); 197 | hold on; 198 | scatter(noise*clmwamU(:,1), noise*clmwamU(:,2), [], noiseColors); 199 | axis([-0.25 0.35 -0.15 0.2]) 200 | title('CLMW ADMM Projection') 201 | set(fig, 'Position', [100, 100, 640, 480]); 202 | 203 | %Generate plot for XCS projection 204 | fig = figure(11); 205 | clf 206 | scatter(data*xcsU(:,1), data*xcsU(:,2), [], dataColors); 207 | hold on; 208 | scatter(noise*xcsU(:,1), noise*xcsU(:,2), [], noiseColors); 209 | axis([-0.25 0.35 -0.15 0.2]) 210 | title('XCS Projection') 211 | set(fig, 'Position', [100, 100, 640, 480]); 212 | -------------------------------------------------------------------------------- /comparisonCode/compareMeanEstimators.m: -------------------------------------------------------------------------------- 1 | %Generates Figure 1 from paper 2 | clear 3 | eps = 0.1; 4 | tau = 0.1; 5 | cher = 2.5; 6 | 7 | filterErr = []; 8 | medianErr = []; 9 | ransacErr = []; 10 | LRVErr = []; 11 | sampErr = []; 12 | noisySampErr = []; 13 | prunedErr = []; 14 | ds = 100:50:400; 15 | 16 | for d = ds 17 | N = 10*floor(d/eps^2); 18 | fprintf('Training with dimension = %d, number of samples = %d \n', d, round(N, 0)) 19 | sumFilterErr = 0; 20 | sumMedErr = 0; 21 | sumRansacErr = 0; 22 | sumLRVErr = 0; 23 | sumSampErr = 0; 24 | sumNoisySampErr = 0; 25 | sumPrunedErr = 0; 26 | 27 | X = mvnrnd(zeros(1,d), eye(d), round((1-eps)*N)) + ones(round((1-eps)*N), d); 28 | 29 | fprintf('Sampling Error w/o noise...'); 30 | sumSampErr = sumSampErr + norm(mean(X) - ones(1,d)); 31 | fprintf('done\n') 32 | 33 | Y1 = randi([0 1], round(0.5*eps*N), d); 34 | Y2 = [12*ones(round(0.5*eps*N),1), -2 * ones(round(0.5*eps*N), 1), zeros(round(0.5 * eps * N), d-2)]; 35 | X = [X; Y1; Y2]; 36 | 37 | fprintf('Sampling Error with noise...'); 38 | sumNoisySampErr = sumNoisySampErr + norm(mean(X) - ones(1,d)); 39 | fprintf('done\n') 40 | 41 | 42 | fprintf('Pruning...'); 43 | [prunedMean, ~] = pruneGaussianMean(X, eps); 44 | sumPrunedErr = sumPrunedErr + norm(prunedMean - ones(1, d)); 45 | fprintf('done\n') 46 | 47 | fprintf('Median...') 48 | gm = geoMedianGaussianMean(X); 49 | sumMedErr = sumMedErr + norm(gm - ones(1, d)); 50 | fprintf('done\n') 51 | 52 | fprintf('Ransac...') 53 | sumRansacErr = sumRansacErr + norm(ransacGaussianMean(X, eps, tau) - ones(1, d)); 54 | fprintf('done\n') 55 | 56 | fprintf('LRV...') 57 | sumLRVErr = sumLRVErr + norm(agnosticMeanGeneral(X, eps) - ones(1,d)); 58 | fprintf('done\n') 59 | 60 | fprintf('Filter...') 61 | sumFilterErr = sumFilterErr + norm(filterGaussianMean(X, eps, tau, cher) - ones(1, d)); 62 | fprintf('done\n') 63 | 64 | medianErr = [medianErr sumMedErr]; 65 | ransacErr = [ransacErr sumRansacErr]; 66 | filterErr = [filterErr sumFilterErr]; 67 | LRVErr = [LRVErr sumLRVErr]; 68 | sampErr = [sampErr sumSampErr]; 69 | noisySampErr = [noisySampErr sumNoisySampErr]; 70 | prunedErr = [prunedErr sumPrunedErr]; 71 | end 72 | 73 | noisySampErr = noisySampErr - sampErr; 74 | prunedErr = prunedErr - sampErr; 75 | medianErr = medianErr - sampErr; 76 | ransacErr = ransacErr - sampErr; 77 | LRVErr = LRVErr - sampErr; 78 | filterErr = filterErr - sampErr; 79 | 80 | plot(ds, noisySampErr, ds, prunedErr, ds, medianErr, '-ro', ds, ransacErr, ds, LRVErr, ds, filterErr, '-.b', 'LineWidth', 2) 81 | xlabel('Dimension') 82 | ylabel('Excess L2 error') 83 | legend('Sampling Error (with noise)', 'Naive Pruning', 'Geometric Median', 'RANSAC', 'LRV', 'Filter') -------------------------------------------------------------------------------- /comparisonCode/geoMedianGaussianMean.m: -------------------------------------------------------------------------------- 1 | function [ estMean ] = geoMedianGaussianMean(data) 2 | %Approximately compute the geometric median using Weiszfeld's algorithm 3 | numIters = 100; 4 | [N, ~] = size(data); 5 | curEstimate = mean(data); 6 | for i = 1:numIters 7 | num = 0; 8 | den = 0; 9 | for j = 1:N 10 | distToEstimate = norm(data(j,:) - curEstimate); 11 | num = num + data(j,:)/distToEstimate; 12 | den = den + 1/distToEstimate; 13 | end 14 | curEstimate = num/den; 15 | end 16 | estMean = curEstimate; 17 | end -------------------------------------------------------------------------------- /comparisonCode/norm_nuc.m: -------------------------------------------------------------------------------- 1 | function [ norm ] = norm_nuc( A ) 2 | %norm_nuc Compute the nuclear norm (or Schatten-1 norm) of a matrix A 3 | [~, S, ~] = svdecon(A); 4 | norm = trace(abs(S)); 5 | end -------------------------------------------------------------------------------- /comparisonCode/pruneGaussianCov.m: -------------------------------------------------------------------------------- 1 | function [ estCov, filteredPoints, filteredMetadata] = pruneGaussianCov( data, metadata, tau, debug ) 2 | % Remove any points which are sufficiently far away from the empirical covariance 3 | [N, d] = size(data); 4 | C = 0.2; 5 | 6 | empCov = data' * data / N; 7 | restart = false; 8 | remove = []; 9 | for i = 1:N 10 | if mod(i, 10000) == 0 && debug 11 | fprintf('Initial pruning iteration %d\n',i); 12 | end 13 | x = data(i, :); 14 | if x * empCov^(-1) * x' > C * d * log(N / tau) 15 | remove = [i remove]; 16 | restart = true; 17 | end 18 | end 19 | if restart 20 | if debug 21 | fprintf('Pruning %d points\n', length(remove)); 22 | end 23 | data(remove, :) = []; 24 | metadata(remove, :) = []; 25 | [estCov, filteredPoints, filteredMetadata] = pruneGaussianCov(data, metadata, tau, debug); 26 | return; 27 | end 28 | estCov = empCov; 29 | filteredPoints = data; 30 | filteredMetadata = metadata; 31 | if debug 32 | fprintf('Remaining points: %d\n', size(filteredPoints, 1)); 33 | end 34 | end -------------------------------------------------------------------------------- /comparisonCode/pruneGaussianMean.m: -------------------------------------------------------------------------------- 1 | function [ estMean, filteredPoints] = pruneGaussianMean(data, eps) 2 | %Use a naive method to prune points which are "obviously far" from mean 3 | [N, d] = size(data); 4 | coordMed = zeros(1,d); 5 | filteredPoints = zeros(size(data)); 6 | numFilteredPoints = 0; 7 | 8 | for i = 1:d 9 | coordMed(i) = median(data(:,i)); 10 | end 11 | 12 | for i = 1:N 13 | if norm(data(i,:) - coordMed) < 0.33*sqrt(d*log(N/eps)) 14 | numFilteredPoints = numFilteredPoints + 1; 15 | filteredPoints(numFilteredPoints,:) = data(i,:); 16 | end 17 | end 18 | filteredPoints = filteredPoints(1:numFilteredPoints,:); 19 | estMean = mean(filteredPoints); 20 | end -------------------------------------------------------------------------------- /comparisonCode/ransacGaussianMean.m: -------------------------------------------------------------------------------- 1 | function [ estMean ] = ransacGaussianMean( data, eps, tau) 2 | %Use RANSAC to estimate the mean of Gaussian data 3 | [N, d] = size(data); 4 | empiricalMean = mean(data); 5 | ransacN = ceil(2*(d*log(4) + log(2/tau))/eps^2); 6 | if ransacN > N 7 | estMean = empiricalMean; 8 | return; 9 | end 10 | 11 | numIters = 100; 12 | thresh = d + 2*(sqrt(d * log(N/tau)) + log(N/tau)) + eps^2*(log(1/eps))^2; 13 | 14 | %As a baseline, see how many inliers the mean has 15 | bestInliers = 0; 16 | bestMean = empiricalMean; 17 | for j = 1:N 18 | if norm(empiricalMean - data(j))^2 <= thresh 19 | bestInliers = bestInliers + 1; 20 | end 21 | end 22 | 23 | for i = 1:numIters 24 | ransacData = data(randsample(1:N, ransacN),:); 25 | ransacMean = mean(ransacData); 26 | curInliers = 0; 27 | for j = 1:N 28 | if norm(ransacMean - data(j))^2 <= thresh 29 | curInliers = curInliers + 1; 30 | end 31 | end 32 | if curInliers > bestInliers 33 | bestMean = ransacMean; 34 | bestInliers = curInliers; 35 | end 36 | end 37 | 38 | estMean = bestMean; 39 | end -------------------------------------------------------------------------------- /comparisonCode/ransacMVE.m: -------------------------------------------------------------------------------- 1 | function [ estCov ] = ransacMVE( data, eps, sampSize, iter ) 2 | % Use RANSAC to estimate the Minimum Volume Ellipsoid (MVE) as per 3 | % http://webmining.spd.louisville.edu/wp-content/uploads/2014/05/A-Brief-Overview-of-Robust-Statistics.pdf 4 | 5 | [~, d] = size(data); 6 | 7 | estCov = zeros(d, d); 8 | bestVol = Inf; 9 | 10 | for i = 1:iter 11 | if mod(i, 10000) == 0 12 | fprintf('ransac iteration i = %d\n', i) 13 | end 14 | samp = datasample(data, sampSize); 15 | [currentCov, currentVol] = MVE(samp, eps); 16 | if currentVol < bestVol 17 | estCov = currentCov; 18 | end 19 | end 20 | 21 | estCov = estCov / chi2inv(1 - eps, d); 22 | 23 | end -------------------------------------------------------------------------------- /comparisonCode/shrinkage.m: -------------------------------------------------------------------------------- 1 | function [ M ] = shrinkage( M, tau ) 2 | %Shrinkage operator, as defined in CandesLMW'11 3 | [ilist,jlist,vlist] = find(M); 4 | for i = 1:length(ilist) 5 | v = vlist(i); 6 | M(ilist(i), jlist(i)) = sign(v) * max(abs(v) - tau, 0); 7 | end 8 | end -------------------------------------------------------------------------------- /comparisonCode/specThresh.m: -------------------------------------------------------------------------------- 1 | function [ Mthres ] = specThresh( M, tau ) 2 | %Singular Value Thresholding operator, as defined in CandesLMW'11 3 | [U, S, V] = svd(M); 4 | Mthres = U * shrinkage(S, tau) * V'; 5 | end -------------------------------------------------------------------------------- /comparisonCode/testGeoMedian.m: -------------------------------------------------------------------------------- 1 | %Demonstrates that the geometric median incurs a sqrt(d) loss in accuracy 2 | eps = 0.5; 3 | dims = floor(linspace(1,700,10)); 4 | err = []; 5 | for d = dims 6 | N = ceil(10*(d/eps^2)); 7 | fprintf('Training with dimension = %d, number of samples = %d \n', d, round(N, 0)) 8 | sumErr = 0; 9 | for i = 1:3 10 | X = mvnrnd(zeros(1,d), eye(d), round((1-eps)*N)) + ones(round((1-eps)*N), d); 11 | Y = zeros(round(eps * N), d); 12 | X = [X; Y]; 13 | sumErr = sumErr + norm(geoMedianGaussianMean(X) - ones(1, d)); 14 | end 15 | err = [err sumErr / 3]; 16 | end 17 | plot(dims, err) -------------------------------------------------------------------------------- /filterCode/filterGaussianCov.m: -------------------------------------------------------------------------------- 1 | function [ estCov, filteredPoints, filteredMetadata] = filterGaussianCov( data, metadata, eps, tau, expConst, debug ) 2 | MAX_COND = 10000; 3 | 4 | [N, d] = size(data); 5 | threshold = eps*(log(1/eps))^2; 6 | C1 = 0.4; 7 | C2 = 0; 8 | 9 | empCov = data' * data / N; 10 | condition = cond(empCov); 11 | 12 | % Our empirical covariance is too ill-conditioned--we probably threw 13 | % away too many points, so let's just throw everything away and restart 14 | if condition > MAX_COND || N < d 15 | if debug 16 | fprintf('Ill conditioned %d %d %d\n', condition, N, d); 17 | end 18 | estCov = zeros(d); 19 | filteredPoints = []; 20 | filteredMetadata = []; 21 | return 22 | end 23 | 24 | 25 | empCovInv = empCov^(-1); 26 | % Remove any points which are very far away from the empirical 27 | % covariance 28 | restart = false; 29 | remove = []; 30 | for i = 1:N 31 | if mod(i, 10000) == 0 && debug 32 | fprintf('Initial pruning iteration %d\n',i); 33 | end 34 | x = data(i, :); 35 | if x * empCovInv * x' > C1 * d * log(N / tau) 36 | remove = [i remove]; 37 | restart = true; 38 | end 39 | end 40 | if restart 41 | data(remove, :) = []; 42 | metadata(remove, :) = []; 43 | [estCov, filteredPoints, filteredMetadata] = filterGaussianCov(data, metadata, eps, tau, expConst, debug); 44 | return 45 | end 46 | 47 | if debug 48 | fprintf('After first for loop\n') 49 | end 50 | 51 | 52 | rotData = data * empCovInv^(1/2); 53 | if debug 54 | fprintf('before findMaxPoly\n'); 55 | end 56 | [~, M, lambda] = findMaxPoly(rotData); 57 | if debug 58 | fprintf('after findMaxPoly\n'); 59 | end 60 | 61 | if debug 62 | fprintf('lambda = %d, threshold = %d\n', lambda, 1 + C2 * threshold); 63 | if d < 12 64 | svd(M) 65 | end 66 | end 67 | 68 | %If top eigenvalue isn't too large, the dataset is OK 69 | %Otherwise, find points which violate the tail bound 70 | if lambda < 1 + C2 * threshold 71 | estCov = empCov; 72 | filteredPoints = data; 73 | filteredMetadata = metadata; 74 | else 75 | projectedData = zeros(N, 1); 76 | for i = 1:N 77 | projectedData(i) = rotData(i, :) * M * rotData(i, :)'; 78 | end 79 | 80 | med = mean(projectedData); 81 | 82 | if debug 83 | fprintf('mean %d, trace %d, variance %d , eigenvalue %d \n', med, trace(M), var(projectedData), lambda); 84 | fprintf('N %d \n',N); 85 | end 86 | 87 | indices = abs(projectedData - med); 88 | sortedProjectedData = sortrows([indices data]); 89 | sortedMetadata = sortrows([indices metadata]); 90 | 91 | for i = 1:N 92 | T = sortedProjectedData(i,1); 93 | rhs = N * (12 * exp(-expConst*T) + 3 * eps/(d*log(N/tau))^2); 94 | if debug && ((N - i) > rhs || mod(N - i, 1000) == 10) 95 | fprintf('N - i = %d \n', N - i); 96 | fprintf('other thing = %d \n', rhs); 97 | fprintf('T = %d\n', T); 98 | fprintf('first term RHS %d\n', N * 3 * exp(-T)); 99 | fprintf('second term RHS %d\n', 3 * N * eps/(d*log(N/tau))^2); 100 | end 101 | if (N - i) > rhs 102 | break 103 | end 104 | end 105 | if i == 1 || i == N 106 | if debug 107 | fprintf('No data points removed, i = %d\n', i); 108 | end 109 | estCov = empCov; 110 | filteredPoints = data; 111 | filteredMetadata = metadata; 112 | return; 113 | else 114 | [estCov, filteredPoints, filteredMetadata] = filterGaussianCov(sortedProjectedData(1:i, 2:end), sortedMetadata(1:i,2:end), eps, tau, expConst, debug); 115 | end 116 | end 117 | end -------------------------------------------------------------------------------- /filterCode/filterGaussianCovTuned.m: -------------------------------------------------------------------------------- 1 | function [ estCov, filteredPoints, filteredMetadata] = filterGaussianCovTuned( data, metadata, eps, tau, debug ) 2 | expConstUB = inf; 3 | expConstLB = 0; 4 | expConstCand = 2; 5 | N = size(data, 1); 6 | targetNoise = eps*N; 7 | 8 | if debug 9 | fprintf('Number of points: %d\n', N); 10 | fprintf('Target number of noise points is %d\n', targetNoise) 11 | fprintf('Trying to filter with constant %d\n\n\n', expConstCand); 12 | end 13 | 14 | 15 | [estCov, filteredPoints, filteredMetadata] = filterGaussianCov(data, metadata, eps, tau, expConstCand, debug); 16 | outN = size(filteredPoints, 1); 17 | 18 | if debug 19 | fprintf('Number of points output by filter is %d\n', outN); 20 | end 21 | 22 | %Find upper and lower bounds for removing the right number of points 23 | if outN < (N - 1.5*targetNoise) 24 | while outN < (N - 1.5*targetNoise) 25 | expConstUB = expConstCand; 26 | expConstCand = expConstCand / 2; 27 | 28 | if debug 29 | fprintf('Trying to filter with constant %d\n\n\n', expConstCand); 30 | end 31 | [estCov, filteredPoints, filteredMetadata] = filterGaussianCov(data, metadata, eps, tau, expConstCand, debug); 32 | outN = size(filteredPoints, 1); 33 | 34 | if debug 35 | fprintf('Number of points output by filter is %d\n', outN); 36 | end 37 | end 38 | expConstLB = expConstCand; 39 | elseif outN > (N - 0.5*targetNoise) 40 | while outN > (N - 0.5*targetNoise) 41 | expConstLB = expConstCand; 42 | expConstCand = expConstCand * 2; 43 | if debug 44 | fprintf('Trying to filter with constant %d\n\n\n', expConstCand); 45 | end 46 | [estCov, filteredPoints, filteredMetadata] = filterGaussianCov(data, metadata, eps, tau, expConstCand, debug); 47 | outN = size(filteredPoints, 1); 48 | if debug 49 | fprintf('Number of points output by filter is %d\n', outN); 50 | end 51 | end 52 | expConstUB = expConstCand; 53 | end 54 | if outN > (N - 1.5*targetNoise) && outN < (N - 0.5*targetNoise) 55 | if debug 56 | fprintf('Succeeded with constant %d\n', expConstCand); 57 | end 58 | return; 59 | end 60 | 61 | %Binary search within the interval to remove the right number of points 62 | numTrials = 0; 63 | while 1 64 | numTrials = numTrials + 1; 65 | if numTrials >= 10 66 | if debug 67 | fprintf('Quitting due to too many trials\n'); 68 | end 69 | [estCov, filteredPoints, filteredMetadata] = filterGaussianCov(data, metadata, eps, tau, expConstUB, debug); 70 | return; 71 | end 72 | 73 | expConstCand = (expConstUB + expConstLB)/2; 74 | if debug 75 | fprintf('Trying to filter with constant %d\n\n\n', expConstCand); 76 | end 77 | [estCov, filteredPoints, filteredMetadata] = filterGaussianCov(data, metadata, eps, tau, expConstCand, debug); 78 | outN = size(filteredPoints, 1); 79 | 80 | if debug 81 | fprintf('Number of points output by filter is %d\n', outN); 82 | end 83 | 84 | if outN < (N - 1.5*targetNoise) 85 | expConstUB = expConstCand; 86 | elseif outN > (N - 0.5*targetNoise) 87 | expConstLB = expConstCand; 88 | else 89 | if debug 90 | fprintf('Succeeded with constant %d\n', expConstCand); 91 | end 92 | return; 93 | end 94 | end 95 | end -------------------------------------------------------------------------------- /filterCode/filterGaussianMean.m: -------------------------------------------------------------------------------- 1 | function [ estMean ] = filterGaussianMean(data, eps, tau, cher) 2 | %filterGaussianMean Run the filter algorithm on a Gaussian 3 | %Suggested value of cher = 2.5 4 | [N, d] = size(data); 5 | empiricalMean = mean(data); 6 | threshold = eps*log(1/eps); 7 | centeredData = bsxfun(@minus, data, empiricalMean)/sqrt(N); 8 | 9 | [U, S, ~] = svdsecon(centeredData', 1); 10 | 11 | lambda = S(1,1)^2; 12 | v = U(:,1); 13 | 14 | %If the largest eigenvalue is about right, just return 15 | if lambda < 1 + 3 * threshold 16 | estMean = empiricalMean; 17 | %Otherwise, project in direction of v and filter 18 | else 19 | delta = 2*eps; 20 | projectedData1 = data * v; 21 | med = median(projectedData1); 22 | 23 | projectedData = [abs(data*v - med) data]; 24 | 25 | sortedProjectedData = sortrows(projectedData); 26 | for i = 1:N 27 | T = sortedProjectedData(i,1) - delta; 28 | if (N - i) > cher * N * (erfc(T / sqrt(2)) / 2 + eps/(d*log(d*eps/tau))) 29 | break 30 | end 31 | end 32 | if i == 1 || i == N 33 | estMean = empiricalMean; 34 | else 35 | estMean = filterGaussianMean(sortedProjectedData(1:i, 2:end), eps, tau, cher); 36 | end 37 | end 38 | end -------------------------------------------------------------------------------- /filterCode/findMaxPoly.m: -------------------------------------------------------------------------------- 1 | function [ c, M, lambda ] = findMaxPoly(data) 2 | % Given a dataset and an empirical covariance matrix, outputs the 3 | % polynomial 4 | % p*(x) = x^T M x - c 5 | % with unit variance 6 | 7 | [N, d] = size(data); 8 | v = flatten(eye(d)); 9 | 10 | dataKron = zeros(N, d * d); 11 | for i = 1:N 12 | dataKron(i, :) = kron(data(i, :), data(i, :)); 13 | end 14 | 15 | empFourth = dataKron' * dataKron / N - v' * v; 16 | [U, S, ~] = svdsecon(empFourth, 1); 17 | 18 | lambda = S(1, 1) / 2; 19 | Mflat = U(:, 1); 20 | 21 | M = sharpen(Mflat, d) / sqrt(2); 22 | c = trace(M) / sqrt(2); 23 | 24 | end 25 | -------------------------------------------------------------------------------- /filterCode/flatten.m: -------------------------------------------------------------------------------- 1 | function [ v ] = flatten( mat ) 2 | % Given a d by d matrix M, convert to a d^2 dimensional vector v 3 | [d, ~] = size(mat); 4 | v = zeros(1, d * d); 5 | for i = 1:d 6 | for j = 1:d 7 | v((i - 1) * d + j) = mat(i, j); 8 | end 9 | end 10 | end -------------------------------------------------------------------------------- /filterCode/mahalanobis.m: -------------------------------------------------------------------------------- 1 | function [Sscaled] = mahalanobis(Shat, S) 2 | %Computes Mahalanobis rescaling of Shat with respect to S 3 | Sscaled = inv(sqrtm(S))*Shat*inv(sqrtm(S)); 4 | end -------------------------------------------------------------------------------- /filterCode/sharpen.m: -------------------------------------------------------------------------------- 1 | function M = sharpen(v,d) 2 | % Given a d^2 dimensional vector v, convert to a d by d matrix M 3 | M = zeros(d, d); 4 | for i = 1:d 5 | for j = 1:d 6 | M(i, j) = v((i-1)*d + j); 7 | end 8 | end 9 | end -------------------------------------------------------------------------------- /genomicData/colors.txt: -------------------------------------------------------------------------------- 1 | gold1 255 215 0 2 | orange 255 165 0 3 | tan3 205 133 63 4 | yellow 255 255 0 5 | yellowgreen 154 205 50 6 | gold2 238 201 0 7 | gold3 205 173 0 8 | aquamarine4 69 139 116 9 | aquamarine2 118 238 198 10 | aquamarine3 102 205 170 11 | darkolivegreen2 188 238 104 12 | green2 0 238 0 13 | green3 0 205 0 14 | green4 0 139 0 15 | green 0 255 0 16 | pink 255 192 203 17 | red 255 0 0 18 | lightpink3 205 140 149 19 | sienna 160 82 45 20 | dodgerblue1 30 144 255 21 | darkseagreen4 105 139 105 22 | darkseagreen 143 188 143 23 | dodgerblue3 24 116 205 24 | cadetblue1 152 245 255 25 | deepskyblue2 0 178 238 26 | deepskyblue4 0 104 139 27 | cadetblue 95 158 168 28 | cadetblue3 122 197 205 29 | cadetblue2 142 229 238 30 | deepskyblue1 0 191 255 31 | blue 0 0 255 32 | darkorchid3 154 50 205 33 | violetred1 255 62 150 34 | violetred3 205 50 120 35 | palevioletred2 238 121 159 36 | orangered1 255 69 0 37 | springgreen4 0 139 69 38 | -------------------------------------------------------------------------------- /testCode/testGaussianCov.m: -------------------------------------------------------------------------------- 1 | clear; 2 | eps = 0.05; 3 | tau = 0.1; 4 | 5 | sampErr = []; 6 | noisyEmpErr = []; 7 | filterErr = []; 8 | ds = 10:10:50; 9 | 10 | %Set to 0 for isotropic covariance 11 | %Set to 1 for spiked covariance 12 | spikedCovariance = 1; 13 | 14 | if spikedCovariance 15 | spike = 100; 16 | else 17 | spike = 1; 18 | end 19 | 20 | 21 | for d = ds 22 | N = 0.5*d / eps^2; 23 | fprintf('Training with dimension = %d, number of samples = %d \n', d, round(N, 0)) 24 | sumEmpErr = 0; 25 | sumNoisyEmpErr = 0; 26 | sumFilterErr = 0; 27 | 28 | covar = eye(d); 29 | if spikedCovariance 30 | covar(1,1) = spike; 31 | end 32 | 33 | X = mvnrnd(zeros(1,d), covar, round((1-eps)*N)); 34 | if spikedCovariance 35 | U1 = orth(randn(d, d)); 36 | Y = [ 0.5 * randi([-1 1], round(eps *N), d / 2) 0.8 * randi([-2 2], round(eps *N), d / 2 - 1) randi([-spike spike], round(eps *N), 1)] * U1; 37 | else 38 | Y = zeros(round(eps * N), d); 39 | end 40 | Z = [X; Y]; 41 | 42 | fprintf('Sampling error w/o noise...') 43 | empCov = cov(X); 44 | sumEmpErr = sumEmpErr + norm(mahalanobis(empCov, covar) - eye(d), 'fro'); 45 | fprintf('done\n') 46 | 47 | 48 | fprintf('Sampling error with noise...') 49 | empCov = cov(Z); 50 | sumNoisyEmpErr = sumNoisyEmpErr + norm(mahalanobis(empCov, covar) - eye(d), 'fro'); 51 | fprintf('done\n') 52 | 53 | 54 | fprintf('Filter...') 55 | [ourCov, filterPoints, ~] = filterGaussianCovTuned(Z, zeros(size(Z)), eps, tau, false); 56 | sumFilterErr = sumFilterErr + norm(mahalanobis(ourCov, covar) - eye(d), 'fro'); 57 | fprintf('done\n') 58 | 59 | sampErr = [sampErr sumEmpErr]; 60 | noisyEmpErr = [noisyEmpErr sumNoisyEmpErr]; 61 | filterErr = [filterErr sumFilterErr]; 62 | end 63 | 64 | noisyEmpErr = noisyEmpErr - sampErr; 65 | filterErr = filterErr - sampErr; 66 | 67 | figure(1); 68 | plot(ds, noisyEmpErr, '-gx', ds, filterErr, '-.b', 'LineWidth', 2) 69 | xlabel('Dimension') 70 | ylabel('Excess Frobenius error') 71 | legend('Samping Error (with corruption)', 'Filter') -------------------------------------------------------------------------------- /testCode/testGaussianMean.m: -------------------------------------------------------------------------------- 1 | clear; 2 | eps = 0.1; 3 | tau = 0.1; 4 | cher = 2.5; 5 | 6 | filterErr = []; 7 | sampErr = []; 8 | noisySampErr = []; 9 | ds = 100:50:400; 10 | 11 | for d = ds 12 | N = 10*floor(d/eps^2); 13 | fprintf('Training with dimension = %d, number of samples = %d \n', d, round(N, 0)) 14 | sumFilterErr = 0; 15 | sumSampErr = 0; 16 | sumNoisySampErr = 0; 17 | 18 | X = mvnrnd(zeros(1,d), eye(d), round((1-eps)*N)) + ones(round((1-eps)*N), d); 19 | 20 | fprintf('Sampling Error w/o noise...'); 21 | sumSampErr = sumSampErr + norm(mean(X) - ones(1,d)); 22 | fprintf('...done\n') 23 | 24 | Y1 = randi([0 1], round(0.5*eps*N), d); 25 | Y2 = [12*ones(round(0.5*eps*N),1), -2 * ones(round(0.5*eps*N), 1), zeros(round(0.5 * eps * N), d-2)]; 26 | X = [X; Y1; Y2]; 27 | 28 | fprintf('Sampling Error with noise'); 29 | sumNoisySampErr = sumNoisySampErr + norm(mean(X) - ones(1,d)); 30 | fprintf('...done\n') 31 | 32 | fprintf('Filter') 33 | sumFilterErr = sumFilterErr + norm(filterGaussianMean(X, eps, tau, cher) - ones(1, d)); 34 | fprintf('...done\n') 35 | 36 | filterErr = [filterErr sumFilterErr]; 37 | sampErr = [sampErr sumSampErr]; 38 | noisySampErr = [noisySampErr sumNoisySampErr]; 39 | end 40 | 41 | noisySampErr = noisySampErr - sampErr; 42 | filterErr = filterErr - sampErr; 43 | 44 | plot(ds, noisySampErr, ds, filterErr, '-.b', 'LineWidth', 2) 45 | xlabel('Dimension') 46 | ylabel('Excess L2 error') 47 | legend('Sampling Error (with noise)', 'Filter') 48 | -------------------------------------------------------------------------------- /testCode/testGenomicData.m: -------------------------------------------------------------------------------- 1 | clear; 2 | eps = 0.10; 3 | tau = 0.1; 4 | d = 20; 5 | lambda = 0.2; 6 | subsampSize = 100; 7 | 8 | %Process mappings from color names to numbers 9 | rawColor = importdata('../genomicData/colors.txt'); 10 | keys = rawColor.rowheaders; 11 | vals = mat2cell(rawColor.data, ones(37,1)); 12 | colorMap = containers.Map(keys, vals); 13 | 14 | %Process data points 15 | fid = fopen('../genomicData/POPRES_08_24_01.EuroThinFinal.LD_0.8.exLD.out0-PCA.eigs'); 16 | A = textscan(fid,'%f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f %s', 'HeaderLines',1); 17 | fclose(fid); 18 | V = [A{3} A{4} A{5} A{6} A{7} A{8} A{9} A{10} A{11} A{12} A{13} A{14} A{15} A{16} A{17} A{18} A{19} A{20} A{21} A{22}]; 19 | 20 | eigs = importdata('../genomicData/POPRES_08_24_01.EuroThinFinal.LD_0.8.exLD.out0-PCA.eval'); 21 | eigs = eigs(1:d); 22 | data = zeros(length(A), d); 23 | for i = 1:size(V,1) 24 | data(i,:) = V(i,:).*eigs'; 25 | end 26 | 27 | %Convert data point color names to numbers 28 | colorsFid = fopen('../genomicData/POPRESID_Color.txt'); 29 | colorsStrings = textscan(colorsFid, '%d %s'); 30 | colorsStrings = colorsStrings{2}; 31 | dataColors = zeros(length(colorsStrings),3); 32 | for i = 1:length(colorsStrings) 33 | dataColors(i, :) = colorMap(colorsStrings{i})/255; 34 | end 35 | 36 | %Generate noisy points 37 | N = size(V,1)/(1-eps); 38 | noise = (1/24)*[randi([0, 2], round(eps*N), d/2) randi([2, 3], round(eps*N), d/2)]; 39 | 40 | %Generate noisy point colors 41 | noiseColors = zeros(round(eps*N), 3); 42 | 43 | %Combine data and noise, 44 | randRot = orth(randn(d, d)); 45 | data = data * randRot; 46 | noise = noise * randRot; 47 | D = [data; noise]; 48 | C = [dataColors; noiseColors]; 49 | subsampD = datasample(D, subsampSize); 50 | 51 | %Generate original data's projection 52 | [dataU, ~, ~] = svd(cov(data)); 53 | 54 | %Generate noised data's projection 55 | [noisedU, ~, ~] = svd(cov(D)); 56 | 57 | %Generate filter output and projection 58 | [filterM, filtered, filteredMetadata] = filterGaussianCovTuned(D, C, eps, tau, 0); 59 | [filterU, ~, ~] = svd(filterM); 60 | 61 | %Generate plot for original data 62 | fig = figure(1); 63 | clf; 64 | scatter(data*dataU(:,1), data*dataU(:,2), [], dataColors); 65 | hold on; 66 | scatter(noise*dataU(:,1), noise*dataU(:,2), [], noiseColors); 67 | axis([-0.25 0.35 -0.15 0.2]) 68 | title('Original Data') 69 | set(fig, 'Position', [100, 100, 640, 480]); 70 | 71 | %Generate plot for noised data 72 | fig = figure(2); 73 | clf 74 | scatter(data*noisedU(:,1), data*noisedU(:,2), [], dataColors); 75 | hold on; 76 | scatter(noise*noisedU(:,1), noise*noisedU(:,2), [], noiseColors); 77 | axis([-0.25 0.35 -0.15 0.2]) 78 | title('Empirical Noised Singular Vectors') 79 | set(fig, 'Position', [100, 100, 640, 480]); 80 | 81 | %Generate plot for filter's output 82 | fig = figure(3); 83 | clf 84 | scatter(filtered*filterU(:,1),filtered*filterU(:,2), [], filteredMetadata); 85 | axis([-0.25 0.35 -0.15 0.2]) 86 | title('Filter Output'); 87 | set(fig, 'Position', [100, 100, 640, 480]); 88 | 89 | %Generate plot for the filter's projection 90 | fig = figure(4); 91 | clf 92 | scatter(data*filterU(:,1), data*filterU(:,2), [], dataColors); 93 | hold on; 94 | scatter(noise*filterU(:,1), noise*filterU(:,2), [], noiseColors); 95 | axis([-0.25 0.35 -0.15 0.2]) 96 | title('Filter Projection') 97 | set(fig, 'Position', [100, 100, 640, 480]); --------------------------------------------------------------------------------