├── .gitignore ├── LICENSE ├── README.md ├── independence_testing ├── CorrTestObject.py ├── ExampleData.txt ├── ExampleHSIC.py ├── ExperimentsHSICBlock.py ├── ExperimentsHSICPermutation.py ├── ExperimentsHSICSpectral.py ├── HSICBlockTestObject.py ├── HSICPermutationTestObject.py ├── HSICSpectralTestObject.py ├── HSICTestObject.py ├── SimDataGen.py ├── SubHSICTestObject.py ├── TestExperiment.py ├── TestObject.py └── __init__.py ├── kerpy ├── BagKernel.py ├── BrownianKernel.py ├── GaussianBagKernel.py ├── GaussianKernel.py ├── HypercubeKernel.py ├── Kernel.py ├── LinearBagKernel.py ├── LinearKernel.py ├── MaternKernel.py ├── PolynomialKernel.py ├── ProductKernel.py ├── SumKernel.py └── __init__.py ├── setup.py ├── tools ├── GenericTests.py ├── ProcessingObject.py ├── UnitTests.py ├── __init__.py ├── read_and_plot_test_results.py └── read_test_results.py └── weak_conditional_independence_testing ├── BH_prewhiten.csv ├── Ozone_prewhiten.csv ├── PCalg_twostep_flags.py ├── SimDataGen.py ├── SyntheticDim_KRESIT.py ├── Synthetic_DAGexample.csv ├── TwoStepCondTestObject.py ├── WHO_KRESITvsRESIT.py ├── WHO_dataset.csv └── __init__.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *.project 3 | *.pydevproject 4 | *~ 5 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2016 oxmlcs 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | * 7 | 1. Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 2. Redistributions in binary form must reproduce the above copyright notice, 10 | this list of conditions and the following disclaimer in the documentation 11 | and/or other materials provided with the distribution. 12 | * 13 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 14 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 15 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 16 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR 17 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 18 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 19 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 20 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 21 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 22 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 23 | * 24 | The views and conclusions contained in the software and documentation are those 25 | of the authors and should not be interpreted as representing official policies, 26 | either expressed or implied, of the author. 27 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # kerpy 2 | python code framework for kernel methods in hypothesis testing. 3 | some code on kernel computation was adapted from https://github.com/karlnapf/kameleon-mcmc 4 | 5 | To set up as a package run in terminal: 6 | 7 | ``python setup.py develop`` 8 | 9 | 10 | ### independence_testing 11 | 12 | Code for HSIC-based large-scale independence tests. The methods are described in: 13 | 14 | Q. Zhang, S. Filippi, A. Gretton, and D. Sejdinovic, __Large-Scale Kernel Methods for Independence Testing__, _Statistics and Computing_, to appear, 2017. [url](http://link.springer.com/article/10.1007%2Fs11222-016-9721-7) 15 | 16 | For an example use of the code, demonstrating how to run an HSIC-based large-scale independence test on either simulated data or data loaded from a file, see ExampleHSIC.py. 17 | 18 | To reproduce results from the paper, see ExperimentsHSICPermutation.py, ExperimentsHSICSpectral.py, ExperimentsHSICBlock.py. 19 | 20 | 21 | 22 | ### weak_conditional_independence_testing 23 | 24 | Code for feature-to-feature regression for a two-step conditional independence tests (i.e. testing for weak conditional independence). The methods are described in: 25 | 26 | Q. Zhang, S. Filippi, S. Flaxman, and D. Sejdinovic, __Feature-to-Feature Regression for a Two-Step Conditional Independence Test__, UAI, 2017. 27 | 28 | 29 | To reproduce results from the paper, see WHO_KRESITvsRESIT.py, SyntheticDim_KRESIT.py, PCalg_twostep_flags.py (the nodes names correspond to the varibales in Synthetic_DAGexample.csv file, to run it for Boston Housing Data (BH_prewhiten.csv) or Ozone Data (Ozone_prewhiten.csv) simply change the label names in the script.) 30 | -------------------------------------------------------------------------------- /independence_testing/CorrTestObject.py: -------------------------------------------------------------------------------- 1 | from TestObject import TestObject 2 | from numpy import shape, zeros 3 | from scipy.stats import pearsonr 4 | import time 5 | from numpy.random import permutation 6 | 7 | 8 | class CorrTestObject(TestObject): 9 | def __init__(self, num_samples, data_generator, streaming=False, freeze_data=False,num_shuffles=1000): 10 | TestObject.__init__(self,self.__class__.__name__,streaming=streaming, freeze_data=freeze_data) 11 | self.num_samples = num_samples #We have same number of samples from X and Y in independence testing 12 | self.data_generator = data_generator 13 | self.num_shuffles = num_shuffles 14 | 15 | 16 | def generate_data(self): 17 | self.data_x, self.data_y = self.data_generator(self.num_samples) 18 | return self.data_x, self.data_y 19 | 20 | 21 | def SubCorr_statistic(self,data_x=None,data_y=None): 22 | if data_x is None: 23 | data_x=self.data_x 24 | if data_y is None: 25 | data_y=self.data_y 26 | dx = shape(data_x)[1] 27 | stats_value = zeros(dx) 28 | for dd in range(dx): 29 | stats_value[dd] = pearsonr(data_x[:,[dd]],data_y)[0]**2 30 | SubCorr = sum(stats_value)/float(dx) 31 | return SubCorr 32 | 33 | 34 | def compute_pvalue_with_time_tracking(self,data_x = None, data_y = None): 35 | if data_x is None and data_y is None: 36 | if not self.streaming and not self.freeze_data: 37 | start = time.clock() 38 | self.generate_data() 39 | data_generating_time = time.clock()-start 40 | data_x = self.data_x 41 | data_y = self.data_y 42 | else: 43 | data_generating_time = 0. 44 | else: 45 | data_generating_time = 0. 46 | print 'data generating time passed: ', data_generating_time 47 | SubCorr_statistic = self.SubCorr_statistic(data_x=data_x,data_y=data_y) 48 | null_samples=zeros(self.num_shuffles) 49 | for jj in range(self.num_shuffles): 50 | pp = permutation(self.num_samples) 51 | yy = self.data_y[pp,:] 52 | null_samples[jj]=self.SubCorr_statistic(data_x = data_x, data_y = yy) 53 | pvalue = ( sum( null_samples > SubCorr_statistic ) ) / float( self.num_shuffles ) 54 | return pvalue, data_generating_time 55 | 56 | 57 | 58 | -------------------------------------------------------------------------------- /independence_testing/ExampleHSIC.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Example script for running large-scale independence tests with HSIC 3 | https://github.com/oxmlcs/kerpy 4 | ''' 5 | 6 | 7 | #adding relevant folder to your pythonpath 8 | import os, sys 9 | BASE_DIR = os.path.join( os.path.dirname( __file__ ), '..' ) 10 | sys.path.append(BASE_DIR) 11 | 12 | 13 | from kerpy.GaussianKernel import GaussianKernel 14 | from SimDataGen import SimDataGen 15 | from HSICTestObject import HSICTestObject 16 | from numpy import shape,savetxt,loadtxt,transpose,shape,reshape,concatenate 17 | from independence_testing.HSICSpectralTestObject import HSICSpectralTestObject 18 | from independence_testing.HSICBlockTestObject import HSICBlockTestObject 19 | 20 | ''' 21 | Given a data set data_x and data_y of paired observations, 22 | we wish to test the hypothesis of independence between the two. 23 | If dealing with vectorial data, data_x and data_y should be 2d-numpy arrays of shape (n,dim), 24 | where n is the number of observations and dim is the dimension of these observations 25 | --- note: one-dimensional observations should also be in a 2d-numpy array format (n,1) 26 | ''' 27 | 28 | 29 | 30 | #here we simulate a dataset of size 'num_samples' in the correct format 31 | num_samples = 10000 32 | data_x, data_y = SimDataGen.LargeScale(num_samples, dimension=20) 33 | #SimDataGen.py contains more examples of data generating functions 34 | 35 | 36 | ''' 37 | # Alternatively, we can load a dataset from a file as follows: 38 | #-- here file is assumed to be a num_samples by (dimx+dimy) table 39 | data = loadtxt('ExampleData.txt') 40 | num_samples,D = shape(data) 41 | #assume that x corresponds to all but the last column in the file 42 | data_x = data[:,:(D-1)] 43 | #and that y is just the last column 44 | data_y = data[:,D-1] 45 | #need to ensure data_y is a 2d array 46 | data_y=reshape(data_y,(num_samples,1)) 47 | ''' 48 | 49 | 50 | print "shape of data_x:", shape(data_x) 51 | print "shape of data_y:", shape(data_y) 52 | 53 | ''' 54 | First, we need to specify the kernels for X and Y. We will use Gaussian kernels -- default value of the width parameter is 1.0 55 | the widths can be either kept fixed or set to a median heuristic based on the data when running a test 56 | ''' 57 | kernelX=GaussianKernel() 58 | kernelY=GaussianKernel() 59 | 60 | 61 | 62 | 63 | ''' 64 | HSICSpectralTestObject/HSICPermutationTestObject: 65 | ================================================= 66 | num_samples: Integer values -- the number of data samples 67 | data_generator: If we use simulated data, which function to use to generate data for repeated tests to investigate power; 68 | Examples are given in SimDataGen.py, e.g. data_generator = SimDataGen.LargeScale; 69 | Default value is None (if only a single test will be run). 70 | 71 | kernelX, kernelY: The kernel functions to use for X and Y respectively. (Examples are included in kerpy folder) 72 | E.g. kernelX = GaussianKernel(); alternatively, for a kernel with fixed width: kernelY = GaussianKernel(float(1.5)) 73 | kernelX_use_median, 74 | kernelY_use_median: "True" or "False" -- if median heuristic should be used to select the kernel bandwidth. 75 | 76 | rff: "True" or "False" -- if random Fourier Features should be used. 77 | num_rfx, num_rfy: Even integer values -- the number of random features for X and Y respectively. 78 | 79 | induce_set: "True" or "False" -- if Nystrom method should be used. 80 | num_inducex, num_inducey: Integer values -- the number of inducing variables for X and Y respectively. 81 | 82 | num_nullsims: An integer value -- the number of simulations from the null distribution for spectral approach. 83 | num_shuffles: An integer value -- the number of shuffles for permutation approach. 84 | unbiased: "True" or "False" -- if unbiased HSIC test statistics is preferred. 85 | 86 | 87 | HSICBlockTestObject: 88 | ==================== 89 | blocksize: Integer value -- the size of each block. 90 | nullvarmethod: "permutation", "direct" or "across" -- the method of estimating the null variance. 91 | Refer to the paper for more details of each. 92 | ''' 93 | 94 | 95 | #example usage of HSIC spectral test with random Fourier feature approximation 96 | myspectralobject = HSICSpectralTestObject(num_samples, kernelX=kernelX, kernelY=kernelY, 97 | kernelX_use_median=True, kernelY_use_median=True, 98 | rff=True, num_rfx=20, num_rfy=20, num_nullsims=1000) 99 | pvalue = myspectralobject.compute_pvalue(data_x, data_y) 100 | 101 | print "Spectral test p-value:", pvalue 102 | 103 | #example usage of HSIC block test: 104 | myblockobject = HSICBlockTestObject(num_samples, kernelX=kernelX, kernelY=kernelY, 105 | kernelX_use_median=True, kernelY_use_median=True, 106 | blocksize=50, nullvarmethod='permutation') 107 | pvalue = myblockobject.compute_pvalue(data_x, data_y) 108 | 109 | print "Block test p-value:", pvalue -------------------------------------------------------------------------------- /independence_testing/ExperimentsHSICBlock.py: -------------------------------------------------------------------------------- 1 | ''' 2 | adding relevant folder to your pythonpath 3 | ''' 4 | import os, sys 5 | BASE_DIR = os.path.join( os.path.dirname( __file__ ), '..' ) 6 | sys.path.append(BASE_DIR) 7 | 8 | from kerpy.GaussianKernel import GaussianKernel 9 | from HSICTestObject import HSICTestObject 10 | from HSICBlockTestObject import HSICBlockTestObject 11 | from TestExperiment import TestExperiment 12 | from SimDataGen import SimDataGen 13 | from tools.ProcessingObject import ProcessingObject 14 | 15 | #example use: python ExperimentsHSICBlock.py 500 --dimX 3 --kernelX_use_median --kernelY_use_median --blocksize 10 16 | 17 | data_generating_function = SimDataGen.LargeScale 18 | data_generating_function_null = SimDataGen.turn_into_null(SimDataGen.LargeScale) 19 | args = ProcessingObject.parse_arguments() 20 | 21 | '''unpack the arguments needed:''' 22 | num_samples=args.num_samples 23 | hypothesis=args.hypothesis 24 | dimX = args.dimX 25 | kernelX_use_median = args.kernelX_use_median 26 | kernelY_use_median = args.kernelY_use_median 27 | blocksize = args.blocksize 28 | #currently, we are using the same blocksize for both X and Y 29 | 30 | # A temporary set up for the kernels: 31 | kernelX = GaussianKernel(1.) 32 | kernelY = GaussianKernel(1.) 33 | 34 | if hypothesis=="alter": 35 | data_generator=lambda num_samples: data_generating_function(num_samples,dimension=dimX) 36 | elif hypothesis=="null": 37 | data_generator=lambda num_samples: data_generating_function_null(num_samples,dimension=dimX) 38 | else: 39 | raise NotImplementedError() 40 | 41 | 42 | test_object=HSICBlockTestObject(num_samples, data_generator, kernelX, kernelY, 43 | kernelX_use_median=kernelX_use_median,kernelY_use_median=kernelY_use_median, 44 | nullvarmethod='permutation', 45 | blocksize=blocksize) 46 | 47 | name = os.path.basename(__file__).rstrip('.py')+'_LS_'+hypothesis+'_d_'+str(dimX)+'_B_'+str(blocksize)+'_n_'+str(num_samples) 48 | 49 | param={'name': name,\ 50 | 'kernelX': kernelX,\ 51 | 'kernelY': kernelY,\ 52 | 'blocksize':blocksize,\ 53 | 'data_generator': data_generator.__name__,\ 54 | 'hypothesis':hypothesis,\ 55 | 'num_samples': num_samples} 56 | 57 | 58 | experiment=TestExperiment(name, param, test_object) 59 | 60 | numTrials = 100 61 | alpha=0.05 62 | experiment.run_test_trials(numTrials, alpha=alpha) 63 | -------------------------------------------------------------------------------- /independence_testing/ExperimentsHSICPermutation.py: -------------------------------------------------------------------------------- 1 | ''' 2 | adding relevant folder to your pythonpath 3 | ''' 4 | import os, sys 5 | BASE_DIR = os.path.join( os.path.dirname( __file__ ), '..' ) 6 | sys.path.append(BASE_DIR) 7 | 8 | from kerpy.GaussianKernel import GaussianKernel 9 | from HSICTestObject import HSICTestObject 10 | from TestExperiment import TestExperiment 11 | from SimDataGen import SimDataGen 12 | from tools.ProcessingObject import ProcessingObject 13 | from HSICPermutationTestObject import HSICPermutationTestObject 14 | 15 | #example use: python ExperimentsHSICPermutation.py 500 --dimX 3 --hypothesis null --rff --num_rfx 50 --num_rfy 50 16 | 17 | data_generating_function = SimDataGen.VaryDimension 18 | data_generating_function_null = SimDataGen.turn_into_null(SimDataGen.VaryDimension) 19 | args = ProcessingObject.parse_arguments() 20 | 21 | '''unpack the arguments needed:''' 22 | num_samples=args.num_samples 23 | hypothesis=args.hypothesis 24 | dimX = args.dimX 25 | kernelX_use_median = args.kernelX_use_median 26 | kernelY_use_median = args.kernelY_use_median 27 | rff=args.rff 28 | num_rfx = args.num_rfx 29 | num_rfy = args.num_rfy 30 | induce_set = args.induce_set 31 | num_inducex = args.num_inducex 32 | num_inducey = args.num_inducey 33 | num_shuffles = args.num_shuffles 34 | 35 | 36 | # A temporary set up for the kernels: 37 | kernelX = GaussianKernel(1.) 38 | kernelY = GaussianKernel(1.) 39 | 40 | 41 | if hypothesis=="alter": 42 | data_generator=lambda num_samples: data_generating_function(num_samples,dimension=dimX) 43 | elif hypothesis=="null": 44 | data_generator=lambda num_samples: data_generating_function_null(num_samples,dimension=dimX) 45 | else: 46 | raise NotImplementedError() 47 | 48 | 49 | 50 | 51 | test_object=HSICPermutationTestObject(num_samples, data_generator, kernelX, kernelY, 52 | kernelX_use_median=kernelX_use_median,kernelY_use_median=kernelY_use_median, 53 | num_rfx=num_rfx,num_rfy=num_rfy, unbiased=False, rff=rff, 54 | induce_set=induce_set,num_inducex=num_inducex,num_inducey=num_inducey, 55 | num_shuffles=num_shuffles) 56 | 57 | 58 | if rff: 59 | name = os.path.basename(__file__).rstrip('.py')+'_VD_'+hypothesis+'_d_'+str(dimX)+\ 60 | '_shuffles_'+str(num_shuffles)+'_rff_'+str(num_rfx)+str(num_rfy)+'_n_'+str(num_samples) 61 | elif induce_set: 62 | name = os.path.basename(__file__).rstrip('.py')+'_VD_'+hypothesis+'_d_'+str(dimX)+\ 63 | '_shuffles_'+str(num_shuffles)+'_induce_'+str(num_inducex)+str(num_inducey)+'_n_'+str(num_samples) 64 | else: 65 | name = os.path.basename(__file__).rstrip('.py')+'_VD_'+hypothesis+'_d_'+str(dimX)+\ 66 | '_shuffles_'+str(num_shuffles)+'_n_'+str(num_samples) 67 | 68 | 69 | 70 | param={'name': name,\ 71 | 'dim': dimX,\ 72 | 'kernelX': kernelX,\ 73 | 'kernelY': kernelY,\ 74 | 'num_rfx': num_rfx,\ 75 | 'num_rfy': num_rfy,\ 76 | 'num_inducex': num_inducex,\ 77 | 'num_inducey': num_inducey,\ 78 | 'data_generator': data_generator.__name__,\ 79 | 'hypothesis':hypothesis,\ 80 | 'num_samples': num_samples} 81 | 82 | 83 | experiment=TestExperiment(name, param, test_object) 84 | 85 | numTrials = 100 86 | alpha=0.05 87 | experiment.run_test_trials(numTrials, alpha=alpha) -------------------------------------------------------------------------------- /independence_testing/ExperimentsHSICSpectral.py: -------------------------------------------------------------------------------- 1 | ''' 2 | adding relevant folder to your pythonpath 3 | ''' 4 | import os, sys 5 | BASE_DIR = os.path.join( os.path.dirname( __file__ ), '..' ) 6 | #print BASE_DIR 7 | sys.path.append(BASE_DIR) 8 | #print sys.path.append(BASE_DIR) 9 | 10 | from kerpy.GaussianKernel import GaussianKernel 11 | from kerpy.BrownianKernel import BrownianKernel 12 | from HSICTestObject import HSICTestObject 13 | from HSICSpectralTestObject import HSICSpectralTestObject 14 | from TestExperiment import TestExperiment 15 | from SimDataGen import SimDataGen 16 | from tools.ProcessingObject import ProcessingObject 17 | 18 | #example usage: python ExperimentsHSICSpectral.py 500 --dimX 4 --hypothesis null --rff --num_rfx 50 --num_rfy 50 19 | # the above says that 500 samples; null hypothesis; rff True; 50 random Fourier features for X and Y. 20 | 21 | 22 | data_generating_function = SimDataGen.VaryDimension 23 | data_generating_function_null = SimDataGen.turn_into_null(SimDataGen.VaryDimension) 24 | #data_generating_function = SimDataGen.LargeScale 25 | #data_generating_function_null = SimDataGen.turn_into_null(SimDataGen.LargeScale) 26 | args = ProcessingObject.parse_arguments() 27 | 28 | '''unpack the arguments needed:''' 29 | num_samples=args.num_samples 30 | hypothesis=args.hypothesis 31 | dimX = args.dimX 32 | kernelX_use_median = args.kernelX_use_median 33 | kernelY_use_median = args.kernelY_use_median 34 | rff=args.rff 35 | num_rfx = args.num_rfx 36 | num_rfy = args.num_rfy 37 | induce_set = args.induce_set 38 | num_inducex = args.num_inducex 39 | num_inducey = args.num_inducey 40 | 41 | 42 | # This will only be a temporary set up for the kernels when we use median heuristics. 43 | #kernelX = GaussianKernel(1.) 44 | #kernelY = GaussianKernel(1.) 45 | 46 | # Brownian kernel with H = 0.5 equivalently alpha = 1.0 47 | kernelX = BrownianKernel(1.) 48 | kernelY = BrownianKernel(1.) 49 | 50 | if hypothesis=="alter": 51 | data_generator=lambda num_samples: data_generating_function(num_samples,dimension=dimX) 52 | elif hypothesis=="null": 53 | data_generator=lambda num_samples: data_generating_function_null(num_samples,dimension=dimX) 54 | else: 55 | raise NotImplementedError() 56 | 57 | 58 | test_object=HSICSpectralTestObject(num_samples, data_generator, kernelX, kernelY, 59 | kernelX_use_median=kernelX_use_median,kernelY_use_median=kernelY_use_median, 60 | rff = rff, num_rfx=num_rfx,num_rfy=num_rfy, unbiased=False, 61 | induce_set = induce_set, num_inducex = num_inducex, num_inducey = num_inducey) 62 | 63 | 64 | # file name of the results 65 | if rff: 66 | name = os.path.basename(__file__).rstrip('.py')+'Bsine'+'_'+hypothesis+'_rff_'+str(num_rfx)+\ 67 | str(num_rfy)+'_d_'+str(dimX)+'_n_'+str(num_samples) 68 | elif induce_set: 69 | name = os.path.basename(__file__).rstrip('.py')+'Bsine'+'_'+hypothesis+'_induce_'+str(num_inducex)+\ 70 | str(num_inducey)+'_d_'+str(dimX)+'_n_'+str(num_samples) 71 | else: 72 | name = os.path.basename(__file__).rstrip('.py')+'Bsine'+'_'+hypothesis+'_d_'+str(dimX)+'_n_'+str(num_samples) 73 | 74 | param={'name': name,\ 75 | 'dim': dimX,\ 76 | 'kernelX': kernelX,\ 77 | 'kernelY': kernelY,\ 78 | 'num_rfx': num_rfx,\ 79 | 'num_rfy': num_rfy,\ 80 | 'num_inducex': num_inducex,\ 81 | 'num_inducey': num_inducey,\ 82 | 'data_generator': data_generator.__name__,\ 83 | 'hypothesis':hypothesis,\ 84 | 'num_samples': num_samples} 85 | 86 | experiment=TestExperiment(name, param, test_object) 87 | 88 | numTrials = 100 89 | alpha=0.05 90 | experiment.run_test_trials(numTrials, alpha=alpha) 91 | 92 | -------------------------------------------------------------------------------- /independence_testing/HSICBlockTestObject.py: -------------------------------------------------------------------------------- 1 | from TestObject import TestObject 2 | from HSICTestObject import HSICTestObject 3 | from numpy import mean, sum, zeros, var, sqrt 4 | from scipy.stats import norm 5 | import time 6 | 7 | class HSICBlockTestObject(HSICTestObject): 8 | def __init__(self,num_samples, data_generator=None, kernelX=None, kernelY=None, 9 | kernelX_use_median=False,kernelY_use_median=False, 10 | rff=False, num_rfx=None, num_rfy=None, 11 | blocksize=50, streaming=False, nullvarmethod='permutation', freeze_data=False): 12 | HSICTestObject.__init__(self, num_samples, data_generator=data_generator, kernelX=kernelX, kernelY=kernelY, 13 | kernelX_use_median=kernelX_use_median,kernelY_use_median=kernelY_use_median, 14 | rff=rff, streaming=streaming, num_rfx=num_rfx, num_rfy=num_rfy, 15 | freeze_data=freeze_data) 16 | self.blocksize = blocksize 17 | #self.blocksizeY = blocksizeY 18 | self.nullvarmethod = nullvarmethod 19 | 20 | def compute_pvalue_with_time_tracking(self,data_x=None,data_y=None): 21 | if data_x is None and data_y is None: 22 | if not self.streaming and not self.freeze_data: 23 | start = time.clock() 24 | self.generate_data() 25 | data_generating_time = time.clock()-start 26 | data_x = self.data_x 27 | data_y = self.data_y 28 | else: 29 | data_generating_time = 0. 30 | else: 31 | data_generating_time = 0. 32 | #print 'Total block data generating time passed: ', data_generating_time 33 | if self.kernelX_use_median: 34 | sigmax = self.kernelX.get_sigma_median_heuristic(data_x) 35 | self.kernelX.set_width(float(sigmax)) 36 | if self.kernelY_use_median: 37 | sigmay = self.kernelY.get_sigma_median_heuristic(data_y) 38 | self.kernelY.set_width(float(sigmay)) 39 | num_blocks = int(( self.num_samples ) // self.blocksize) 40 | block_statistics = zeros(num_blocks) 41 | null_samples = zeros(num_blocks) 42 | null_varx = zeros(num_blocks) 43 | null_vary = zeros(num_blocks) 44 | for bb in range(num_blocks): 45 | if self.streaming: 46 | data_xb, data_yb = self.data_generator(self.blocksize, self.blocksize) 47 | else: 48 | data_xb = data_x[(bb*self.blocksize):((bb+1)*self.blocksize)] 49 | data_yb = data_y[(bb*self.blocksize):((bb+1)*self.blocksize)] 50 | if self.nullvarmethod == 'permutation': 51 | block_statistics[bb], null_samples[bb], _, _, _, _, _ = \ 52 | self.HSICmethod(data_x=data_xb, data_y=data_yb, unbiased=True, num_shuffles=1, estimate_nullvar=False,isBlockHSIC=True) 53 | elif self.nullvarmethod == 'direct': 54 | block_statistics[bb], _, null_varx[bb], null_vary[bb], _, _, _ = \ 55 | self.HSICmethod(data_x=data_xb, data_y=data_yb, unbiased=True, num_shuffles=0, estimate_nullvar=True,isBlockHSIC=True) 56 | elif self.nullvarmethod == 'across': 57 | block_statistics[bb], _, _, _, _, _, _ = \ 58 | self.HSICmethod(data_x=data_xb, data_y=data_yb, unbiased=True, num_shuffles=0, estimate_nullvar=False,isBlockHSIC=True) 59 | else: 60 | raise NotImplementedError() 61 | BTest_Statistic = sum(block_statistics) / float(num_blocks) 62 | #print BTest_Statistic 63 | if self.nullvarmethod == 'permutation': 64 | BTest_NullVar = self.blocksize**2*var(null_samples) 65 | elif self.nullvarmethod == 'direct': 66 | overall_varx = mean(null_varx) 67 | overall_vary = mean(null_vary) 68 | BTest_NullVar = 2.*overall_varx*overall_vary 69 | elif self.nullvarmethod == 'across': 70 | BTest_NullVar = var(block_statistics) 71 | #print BTest_NullVar 72 | Z_score = sqrt(self.num_samples*self.blocksize)*BTest_Statistic / sqrt(BTest_NullVar) 73 | #print Z_score 74 | pvalue = norm.sf(Z_score) 75 | return pvalue, data_generating_time 76 | 77 | 78 | -------------------------------------------------------------------------------- /independence_testing/HSICPermutationTestObject.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Created on 17 Nov 2015 3 | 4 | @author: qinyi 5 | ''' 6 | from HSICTestObject import HSICTestObject 7 | import time 8 | 9 | 10 | class HSICPermutationTestObject(HSICTestObject): 11 | 12 | def __init__(self, num_samples, data_generator=None, kernelX=None, kernelY=None, kernelX_use_median=False, 13 | kernelY_use_median=False, num_rfx=None, num_rfy=None, rff=False, 14 | induce_set=False, num_inducex = None, num_inducey = None, num_shuffles=1000, unbiased=True): 15 | HSICTestObject.__init__(self, num_samples, data_generator=data_generator, kernelX=kernelX, kernelY=kernelY, 16 | kernelX_use_median=kernelX_use_median,kernelY_use_median=kernelY_use_median, 17 | num_rfx=num_rfx, num_rfy=num_rfy, rff=rff,induce_set=induce_set, 18 | num_inducex = num_inducex, num_inducey = num_inducey) 19 | self.num_shuffles = num_shuffles 20 | self.unbiased = unbiased 21 | 22 | 23 | def compute_pvalue_with_time_tracking(self,data_x=None,data_y=None): 24 | if data_x is None and data_y is None: 25 | if not self.streaming and not self.freeze_data: 26 | start = time.clock() 27 | self.generate_data() 28 | data_generating_time = time.clock()-start 29 | data_x = self.data_x 30 | data_y = self.data_y 31 | else: 32 | data_generating_time = 0. 33 | else: 34 | data_generating_time = 0. 35 | print 'Permutation data generating time passed: ', data_generating_time 36 | hsic_statistic, null_samples, _, _, _, _, _ = self.HSICmethod(unbiased=self.unbiased,num_shuffles=self.num_shuffles, 37 | data_x = data_x, data_y = data_y) 38 | pvalue = ( 1 + sum( null_samples > hsic_statistic ) ) / float( 1 + self.num_shuffles ) 39 | 40 | return pvalue, data_generating_time -------------------------------------------------------------------------------- /independence_testing/HSICSpectralTestObject.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Created on 15 Nov 2015 3 | 4 | @author: qinyi 5 | ''' 6 | from HSICTestObject import HSICTestObject 7 | import numpy as np 8 | import time 9 | 10 | class HSICSpectralTestObject(HSICTestObject): 11 | 12 | def __init__(self, num_samples, data_generator=None, 13 | kernelX=None, kernelY=None, kernelX_use_median=False,kernelY_use_median=False, 14 | rff=False,num_rfx=None,num_rfy=None,induce_set=False, num_inducex = None, num_inducey = None, 15 | num_nullsims=1000, unbiased=False): 16 | HSICTestObject.__init__(self, num_samples, data_generator=data_generator, kernelX=kernelX, kernelY=kernelY, 17 | kernelX_use_median=kernelX_use_median,kernelY_use_median=kernelY_use_median, 18 | num_rfx=num_rfx, num_rfy=num_rfy, rff=rff, 19 | induce_set=induce_set, num_inducex = num_inducex, num_inducey = num_inducey) 20 | self.num_nullsims = num_nullsims 21 | self.unbiased = unbiased 22 | 23 | 24 | def get_null_samples_with_spectral_approach(self,Mx,My): 25 | lambdax, lambday = self.get_spectrum_on_data(Mx,My) 26 | Dx=len(lambdax) 27 | Dy=len(lambday) 28 | null_samples=np.zeros(self.num_nullsims) 29 | for jj in range(self.num_nullsims): 30 | zz=np.random.randn(Dx,Dy)**2 31 | if self.unbiased: 32 | zz = zz - 1 33 | null_samples[jj]=np.dot(lambdax.T,np.dot(zz,lambday)) 34 | return null_samples 35 | 36 | def compute_pvalue_with_time_tracking(self,data_x=None,data_y=None): 37 | if data_x is None and data_y is None: 38 | if not self.streaming and not self.freeze_data: 39 | start = time.clock() 40 | self.generate_data() 41 | data_generating_time = time.clock()-start 42 | data_x = self.data_x 43 | data_y = self.data_y 44 | else: 45 | data_generating_time = 0. 46 | else: 47 | data_generating_time = 0. 48 | #print 'data generating time passed: ', data_generating_time 49 | hsic_statistic, _, _, _, Mx, My, _ = self.HSICmethod(unbiased=self.unbiased,data_x = data_x, data_y = data_y) 50 | null_samples = self.get_null_samples_with_spectral_approach(Mx, My) 51 | pvalue = ( 1+ sum( null_samples > self.num_samples*hsic_statistic ) ) / float( 1 + self.num_nullsims ) 52 | return pvalue, data_generating_time 53 | -------------------------------------------------------------------------------- /independence_testing/HSICTestObject.py: -------------------------------------------------------------------------------- 1 | from numpy import shape, fill_diagonal, zeros, mean, sqrt,identity,dot,diag 2 | from numpy.random import permutation, randn 3 | from independence_testing.TestObject import TestObject 4 | import numpy as np 5 | from abc import abstractmethod 6 | from kerpy.Kernel import Kernel 7 | import time 8 | from scipy.linalg import sqrtm,inv 9 | from numpy.linalg import eigh,svd 10 | 11 | 12 | 13 | class HSICTestObject(TestObject): 14 | def __init__(self, num_samples, data_generator=None, kernelX=None, kernelY=None, kernelZ = None, 15 | kernelX_use_median=False,kernelY_use_median=False,kernelZ_use_median=False, 16 | rff=False, num_rfx=None, num_rfy=None, induce_set=False, 17 | num_inducex = None, num_inducey = None, 18 | streaming=False, freeze_data=False): 19 | TestObject.__init__(self,self.__class__.__name__,streaming=streaming, freeze_data=freeze_data) 20 | self.num_samples = num_samples #We have same number of samples from X and Y in independence testing 21 | self.data_generator = data_generator 22 | self.kernelX = kernelX 23 | self.kernelY = kernelY 24 | self.kernelZ = kernelZ 25 | self.kernelX_use_median = kernelX_use_median #indicate if median heuristic for Gaussian Kernel should be used 26 | self.kernelY_use_median = kernelY_use_median 27 | self.kernelZ_use_median = kernelZ_use_median 28 | self.rff = rff 29 | self.num_rfx = num_rfx 30 | self.num_rfy = num_rfy 31 | self.induce_set = induce_set 32 | self.num_inducex = num_inducex 33 | self.num_inducey = num_inducey 34 | if self.rff|self.induce_set: 35 | self.HSICmethod = self.HSIC_with_shuffles_rff 36 | else: 37 | self.HSICmethod = self.HSIC_with_shuffles 38 | 39 | def generate_data(self,isConditionalTesting = False): 40 | if not isConditionalTesting: 41 | self.data_x, self.data_y = self.data_generator(self.num_samples) 42 | return self.data_x, self.data_y 43 | else: 44 | self.data_x, self.data_y, self.data_z = self.data_generator(self.num_samples) 45 | return self.data_x, self.data_y, self.data_z 46 | ''' for our SimDataGen examples, one argument suffice''' 47 | 48 | 49 | @staticmethod 50 | def HSIC_U_statistic(Kx,Ky): 51 | m = shape(Kx)[0] 52 | fill_diagonal(Kx,0.) 53 | fill_diagonal(Ky,0.) 54 | K = np.dot(Kx,Ky) 55 | first_term = np.trace(K)/float(m*(m-3.)) 56 | second_term = np.sum(Kx)*np.sum(Ky)/float(m*(m-3.)*(m-1.)*(m-2.)) 57 | third_term = 2.*np.sum(K)/float(m*(m-3.)*(m-2.)) 58 | return first_term+second_term-third_term 59 | 60 | 61 | @staticmethod 62 | def HSIC_V_statistic(Kx,Ky): 63 | Kxc=Kernel.center_kernel_matrix(Kx) 64 | Kyc=Kernel.center_kernel_matrix(Ky) 65 | return np.sum(Kxc*Kyc) 66 | 67 | 68 | @staticmethod 69 | def HSIC_V_statistic_rff(phix,phiy): 70 | m=shape(phix)[0] 71 | phix_c=phix-mean(phix,axis=0) 72 | phiy_c=phiy-mean(phiy,axis=0) 73 | featCov=(phix_c.T).dot(phiy_c)/float(m) 74 | return np.linalg.norm(featCov)**2 75 | 76 | 77 | # generalise distance correlation ---- a kernel interpretation 78 | @staticmethod 79 | def dCor_HSIC_statistic(Kx,Ky,unbiased=False): 80 | if unbiased: 81 | first_term = HSICTestObject.HSIC_U_statistic(Kx,Ky) 82 | second_term = HSICTestObject.HSIC_U_statistic(Kx,Kx)*HSICTestObject.HSIC_U_statistic(Ky,Ky) 83 | dCor = first_term/float(sqrt(second_term)) 84 | else: 85 | first_term = HSICTestObject.HSIC_V_statistic(Kx,Ky) 86 | second_term = HSICTestObject.HSIC_V_statistic(Kx,Kx)*HSICTestObject.HSIC_V_statistic(Ky,Ky) 87 | dCor = first_term/float(sqrt(second_term)) 88 | return dCor 89 | 90 | 91 | # approximated dCor using rff/Nystrom 92 | @staticmethod 93 | def dCor_HSIC_statistic_rff(phix,phiy): 94 | first_term = HSICTestObject.HSIC_V_statistic_rff(phix,phiy) 95 | second_term = HSICTestObject.HSIC_V_statistic_rff(phix,phix)*HSICTestObject.HSIC_V_statistic_rff(phiy,phiy) 96 | approx_dCor = first_term/float(sqrt(second_term)) 97 | return approx_dCor 98 | 99 | 100 | def SubdCor_HSIC_statistic(self,data_x=None,data_y=None,unbiased=True): 101 | if data_x is None: 102 | data_x=self.data_x 103 | if data_y is None: 104 | data_y=self.data_y 105 | dx = shape(data_x)[1] 106 | stats_value = zeros(dx) 107 | for dd in range(dx): 108 | Kx, Ky = self.compute_kernel_matrix_on_data(data_x[:,[dd]], data_y) 109 | stats_value[dd] = HSICTestObject.dCor_HSIC_statistic(Kx, Ky, unbiased) 110 | SubdCor = sum(stats_value)/float(dx) 111 | return SubdCor 112 | 113 | 114 | def SubHSIC_statistic(self,data_x=None,data_y=None,unbiased=True): 115 | if data_x is None: 116 | data_x=self.data_x 117 | if data_y is None: 118 | data_y=self.data_y 119 | dx = shape(data_x)[1] 120 | stats_value = zeros(dx) 121 | for dd in range(dx): 122 | Kx, Ky = self.compute_kernel_matrix_on_data(data_x[:,[dd]], data_y) 123 | if unbiased: 124 | stats_value[dd] = HSICTestObject.HSIC_U_statistic(Kx, Ky) 125 | else: 126 | stats_value[dd] = HSICTestObject.HSIC_V_statistic(Kx, Ky) 127 | SubHSIC = sum(stats_value)/float(dx) 128 | return SubHSIC 129 | 130 | 131 | def HSIC_with_shuffles(self,data_x=None,data_y=None,unbiased=True,num_shuffles=0, 132 | estimate_nullvar=False,isBlockHSIC=False): 133 | start = time.clock() 134 | if data_x is None: 135 | data_x=self.data_x 136 | if data_y is None: 137 | data_y=self.data_y 138 | time_passed = time.clock()-start 139 | if isBlockHSIC: 140 | Kx, Ky = self.compute_kernel_matrix_on_dataB(data_x,data_y) 141 | else: 142 | Kx, Ky = self.compute_kernel_matrix_on_data(data_x,data_y) 143 | ny=shape(data_y)[0] 144 | if unbiased: 145 | test_statistic = HSICTestObject.HSIC_U_statistic(Kx,Ky) 146 | else: 147 | test_statistic = HSICTestObject.HSIC_V_statistic(Kx,Ky) 148 | null_samples=zeros(num_shuffles) 149 | for jj in range(num_shuffles): 150 | pp = permutation(ny) 151 | Kpp = Ky[pp,:][:,pp] 152 | if unbiased: 153 | null_samples[jj]=HSICTestObject.HSIC_U_statistic(Kx,Kpp) 154 | else: 155 | null_samples[jj]=HSICTestObject.HSIC_V_statistic(Kx,Kpp) 156 | if estimate_nullvar: 157 | nullvarx, nullvary = self.unbiased_HSnorm_estimate_of_centred_operator(Kx,Ky) 158 | nullvarx = 2.* nullvarx 159 | nullvary = 2.* nullvary 160 | else: 161 | nullvarx, nullvary = None, None 162 | return test_statistic,null_samples,nullvarx,nullvary,Kx, Ky, time_passed 163 | 164 | 165 | 166 | def HSIC_with_shuffles_rff(self,data_x=None,data_y=None, 167 | unbiased=True,num_shuffles=0,estimate_nullvar=False): 168 | start = time.clock() 169 | if data_x is None: 170 | data_x=self.data_x 171 | if data_y is None: 172 | data_y=self.data_y 173 | time_passed = time.clock()-start 174 | if self.rff: 175 | phix, phiy = self.compute_rff_on_data(data_x,data_y) 176 | else: 177 | phix, phiy = self.compute_induced_kernel_matrix_on_data(data_x,data_y) 178 | ny=shape(data_y)[0] 179 | if unbiased: 180 | test_statistic = HSICTestObject.HSIC_U_statistic_rff(phix,phiy) 181 | else: 182 | test_statistic = HSICTestObject.HSIC_V_statistic_rff(phix,phiy) 183 | null_samples=zeros(num_shuffles) 184 | for jj in range(num_shuffles): 185 | pp = permutation(ny) 186 | if unbiased: 187 | null_samples[jj]=HSICTestObject.HSIC_U_statistic_rff(phix,phiy[pp]) 188 | else: 189 | null_samples[jj]=HSICTestObject.HSIC_V_statistic_rff(phix,phiy[pp]) 190 | if estimate_nullvar: 191 | raise NotImplementedError() 192 | else: 193 | nullvarx, nullvary = None, None 194 | return test_statistic, null_samples, nullvarx, nullvary,phix, phiy, time_passed 195 | 196 | 197 | def get_spectrum_on_data(self, Mx, My): 198 | '''Mx and My are Kx Ky when rff =False 199 | Mx and My are phix, phiy when rff =True''' 200 | if self.rff|self.induce_set: 201 | Cx = np.cov(Mx.T) 202 | Cy = np.cov(My.T) 203 | lambdax=np.linalg.eigvalsh(Cx) 204 | lambday=np.linalg.eigvalsh(Cy) 205 | else: 206 | Kxc = Kernel.center_kernel_matrix(Mx) 207 | Kyc = Kernel.center_kernel_matrix(My) 208 | lambdax=np.linalg.eigvalsh(Kxc) 209 | lambday=np.linalg.eigvalsh(Kyc) 210 | return lambdax,lambday 211 | 212 | 213 | @abstractmethod 214 | def compute_kernel_matrix_on_data(self,data_x,data_y): 215 | if self.kernelX_use_median: 216 | sigmax = self.kernelX.get_sigma_median_heuristic(data_x) 217 | self.kernelX.set_width(float(sigmax)) 218 | if self.kernelY_use_median: 219 | sigmay = self.kernelY.get_sigma_median_heuristic(data_y) 220 | self.kernelY.set_width(float(sigmay)) 221 | Kx=self.kernelX.kernel(data_x) 222 | Ky=self.kernelY.kernel(data_y) 223 | return Kx, Ky 224 | 225 | 226 | @abstractmethod 227 | def compute_kernel_matrix_on_dataB(self,data_x,data_y): 228 | Kx=self.kernelX.kernel(data_x) 229 | Ky=self.kernelY.kernel(data_y) 230 | return Kx, Ky 231 | 232 | 233 | 234 | @abstractmethod 235 | def compute_kernel_matrix_on_data_CI(self,data_x,data_y,data_z): 236 | if self.kernelX_use_median: 237 | sigmax = self.kernelX.get_sigma_median_heuristic(data_x) 238 | self.kernelX.set_width(float(sigmax)) 239 | if self.kernelY_use_median: 240 | sigmay = self.kernelY.get_sigma_median_heuristic(data_y) 241 | self.kernelY.set_width(float(sigmay)) 242 | if self.kernelZ_use_median: 243 | sigmaz = self.kernelZ.get_sigma_median_heuristic(data_z) 244 | self.kernelZ.set_width(float(sigmaz)) 245 | Kx=self.kernelX.kernel(data_x) 246 | Ky=self.kernelY.kernel(data_y) 247 | Kz=self.kernelZ.kernel(data_z) 248 | return Kx, Ky,Kz 249 | 250 | 251 | 252 | 253 | def unbiased_HSnorm_estimate_of_centred_operator(self,Kx,Ky): 254 | '''returns an unbiased estimate of 2*Sum_p Sum_q lambda^2_p theta^2_q 255 | where lambda and theta are the eigenvalues of the centered matrices for X and Y respectively''' 256 | varx = HSICTestObject.HSIC_U_statistic(Kx,Kx) 257 | vary = HSICTestObject.HSIC_U_statistic(Ky,Ky) 258 | return varx,vary 259 | 260 | 261 | @abstractmethod 262 | def compute_rff_on_data(self,data_x,data_y): 263 | self.kernelX.rff_generate(self.num_rfx,dim=shape(data_x)[1]) 264 | self.kernelY.rff_generate(self.num_rfy,dim=shape(data_y)[1]) 265 | if self.kernelX_use_median: 266 | sigmax = self.kernelX.get_sigma_median_heuristic(data_x) 267 | self.kernelX.set_width(float(sigmax)) 268 | if self.kernelY_use_median: 269 | sigmay = self.kernelY.get_sigma_median_heuristic(data_y) 270 | self.kernelY.set_width(float(sigmay)) 271 | phix = self.kernelX.rff_expand(data_x) 272 | phiy = self.kernelY.rff_expand(data_y) 273 | return phix, phiy 274 | 275 | 276 | @abstractmethod 277 | def compute_induced_kernel_matrix_on_data(self,data_x,data_y): 278 | '''Z follows the same distribution as X; W follows that of Y. 279 | The current data generating methods we use 280 | generate X and Y at the same time. ''' 281 | size_induced_set = max(self.num_inducex,self.num_inducey) 282 | #print "size_induce_set", size_induced_set 283 | if self.data_generator is None: 284 | subsample_idx = np.random.randint(self.num_samples, size=size_induced_set) 285 | self.data_z = data_x[subsample_idx,:] 286 | self.data_w = data_y[subsample_idx,:] 287 | else: 288 | self.data_z, self.data_w = self.data_generator(size_induced_set) 289 | self.data_z[[range(self.num_inducex)],:] 290 | self.data_w[[range(self.num_inducey)],:] 291 | #print 'Induce Set' 292 | if self.kernelX_use_median: 293 | sigmax = self.kernelX.get_sigma_median_heuristic(data_x) 294 | self.kernelX.set_width(float(sigmax)) 295 | if self.kernelY_use_median: 296 | sigmay = self.kernelY.get_sigma_median_heuristic(data_y) 297 | self.kernelY.set_width(float(sigmay)) 298 | Kxz = self.kernelX.kernel(data_x,self.data_z) 299 | Kzz = self.kernelX.kernel(self.data_z) 300 | #R = inv(sqrtm(Kzz)) 301 | R = inv(sqrtm(Kzz + np.eye(np.shape(Kzz)[0])*10**(-6))) 302 | phix = Kxz.dot(R) 303 | Kyw = self.kernelY.kernel(data_y,self.data_w) 304 | Kww = self.kernelY.kernel(self.data_w) 305 | #S = inv(sqrtm(Kww)) 306 | S = inv(sqrtm(Kww + np.eye(np.shape(Kww)[0])*10**(-6))) 307 | phiy = Kyw.dot(S) 308 | return phix, phiy 309 | 310 | 311 | def compute_pvalue(self,data_x=None,data_y=None): 312 | pvalue,_=self.compute_pvalue_with_time_tracking(data_x,data_y) 313 | return pvalue 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | -------------------------------------------------------------------------------- /independence_testing/SimDataGen.py: -------------------------------------------------------------------------------- 1 | from numpy.random import uniform, permutation, multivariate_normal,normal 2 | from numpy import pi, prod, empty, sin, cos, asscalar, shape,zeros, identity,arange,sign,sum,sqrt,transpose, tanh,sinh 3 | import numpy as np 4 | 5 | 6 | 7 | class SimDataGen(object): 8 | def __init__(self): 9 | pass 10 | 11 | 12 | @staticmethod 13 | def LargeScale(num_samples, dimension=4): 14 | ''' dimension takes large even numbers, e.g. 50, 100 ''' 15 | Xmean = zeros(dimension) 16 | Xcov = identity(dimension) 17 | data_x = multivariate_normal(Xmean, Xcov, num_samples) 18 | dd = dimension/2 19 | Zmean = zeros(dd+1) 20 | Zcov = identity(dd+1) 21 | Z = multivariate_normal(Zmean, Zcov, num_samples) 22 | first_term = sqrt(2./dimension)*sum(sign(data_x[:,arange(0,dimension,2)]* data_x[:,arange(1,dimension,2)])*abs(Z[:,range(dd)]),axis=1,keepdims=True) 23 | second_term = Z[:,[dd]] #take the last dimension of Z 24 | data_y = first_term + second_term 25 | return data_x, data_y 26 | 27 | 28 | @staticmethod 29 | def VaryDimension(num_samples, dimension = 5): 30 | Xmean = zeros(dimension) 31 | Xcov = identity(dimension) 32 | data_x = multivariate_normal(Xmean, Xcov, num_samples) 33 | data_z = transpose([normal(0,1,num_samples)]) 34 | data_y = 20*sin(4*pi*(data_x[:,[0]]**2 + data_x[:,[1]]**2)) + data_z 35 | return data_x,data_y 36 | 37 | 38 | @staticmethod 39 | def SimpleLn(num_samples, dimension = 5): 40 | Xmean = zeros(dimension) 41 | Xcov = identity(dimension) 42 | data_x = multivariate_normal(Xmean, Xcov, num_samples) 43 | data_z = transpose([normal(0,1,num_samples)]) 44 | data_y = data_x[:,[0]] + data_z 45 | return data_x, data_y 46 | 47 | 48 | @staticmethod 49 | def turn_into_null(fn): 50 | def null_fn(*args, **kwargs): 51 | dataX,dataY=fn(*args, **kwargs) 52 | num_samples=shape(dataX)[0] 53 | pp = permutation(num_samples) 54 | return dataX,dataY[pp] 55 | return null_fn 56 | -------------------------------------------------------------------------------- /independence_testing/SubHSICTestObject.py: -------------------------------------------------------------------------------- 1 | from HSICTestObject import HSICTestObject 2 | from numpy import zeros 3 | import time 4 | from numpy.random import permutation 5 | 6 | class SubHSICTestObject(HSICTestObject): 7 | 8 | def __init__(self, num_samples, data_generator=None, kernelX=None, kernelY=None, kernelX_use_median=False, 9 | kernelY_use_median=False, num_rfx=None, num_rfy=None, rff=False, num_shuffles=1000, unbiased=True): 10 | HSICTestObject.__init__(self, num_samples, data_generator=data_generator, kernelX=kernelX, kernelY=kernelY, 11 | kernelX_use_median=kernelX_use_median,kernelY_use_median=kernelY_use_median, 12 | num_rfx=num_rfx, num_rfy=num_rfy, rff=rff) 13 | self.num_samples = num_samples 14 | self.num_shuffles = num_shuffles 15 | self.unbiased = unbiased 16 | 17 | 18 | def compute_pvalue_with_time_tracking(self,data_x=None,data_y=None): 19 | if data_x is None and data_y is None: 20 | if not self.streaming and not self.freeze_data: 21 | start = time.clock() 22 | self.generate_data() 23 | data_generating_time = time.clock()-start 24 | data_x = self.data_x 25 | data_y = self.data_y 26 | else: 27 | data_generating_time = 0. 28 | else: 29 | data_generating_time = 0. 30 | #print 'data generating time passed: ', data_generating_time 31 | SubHSIC_statistic = self.SubHSIC_statistic(unbiased=self.unbiased,data_x=data_x, data_y = data_y) 32 | null_samples=zeros(self.num_shuffles) 33 | for jj in range(self.num_shuffles): 34 | pp = permutation(self.num_samples) 35 | yy = self.data_y[pp,:] 36 | null_samples[jj]=self.SubHSIC_statistic(data_x = data_x, data_y = yy, unbiased = self.unbiased) 37 | pvalue = ( sum( null_samples > SubHSIC_statistic ) ) / float( self.num_shuffles ) 38 | return pvalue, data_generating_time -------------------------------------------------------------------------------- /independence_testing/TestExperiment.py: -------------------------------------------------------------------------------- 1 | """ 2 | This program is free software; you can redistribute it and/or modify 3 | it under the terms of the GNU General Public License as published by 4 | the Free Software Foundation; either version 3 of the License, or 5 | (at your option) any later version. 6 | 7 | Written (W) 2013 Dino Sejdinovic 8 | """ 9 | from numpy import arange,zeros,mean 10 | import os 11 | from pickle import load, dump 12 | import time 13 | 14 | class TestExperiment(object): 15 | def __init__(self, name, param, test_object): 16 | self.name=name 17 | self.param=param 18 | self.test_object=test_object 19 | self.folder="results/res_"+self.name 20 | 21 | def compute_pvalue(self): 22 | return self.test_object.compute_pvalue() 23 | 24 | def compute_pvalue_with_time_tracking(self): 25 | return self.test_object.compute_pvalue_with_time_tracking() 26 | 27 | def perform_test(self, alpha): 28 | return self.test_object.perform_test(alpha) 29 | 30 | def run_test_trials(self, numTrials, alpha=0.05): 31 | completedTrials = 0 32 | counter_init = 0 33 | save_filename = self.folder+"/results.bin" 34 | average_time_init=0.0 35 | pvalues_init=list() 36 | if not os.path.exists(self.folder): 37 | os.mkdir(self.folder) 38 | elif os.path.exists(save_filename): 39 | load_f = open(save_filename,"r") 40 | [counter_init, completedTrials, _, average_time_init,pvalues_init] = load(load_f) 41 | load_f.close() 42 | print "Found %d completed trials" % completedTrials 43 | if completedTrials >= numTrials: 44 | print "Exiting" 45 | return 0 46 | else: 47 | print "Continuing" 48 | counter = counter_init 49 | pvalues = pvalues_init 50 | times_passed=zeros(numTrials-completedTrials) 51 | for trial in arange(completedTrials,numTrials): 52 | start=time.clock() 53 | print "Trial %d" % trial 54 | pvalue, data_generating_time = self.compute_pvalue_with_time_tracking() 55 | counter += pvalue=2): 25 | raise ValueError("incorrect parameter value") 26 | self.alpha=kerpar 27 | 28 | 29 | def kernel(self, X, Y=None): 30 | 31 | GenericTests.check_type(X,'X',np.ndarray,2) 32 | # if X=Y, use more efficient pdist call which exploits symmetry 33 | normX=reshape(np.linalg.norm(X,axis=1),(len(X),1)) 34 | if Y is None: 35 | dists = squareform(pdist(X, 'euclidean')) 36 | normY=normX.T 37 | else: 38 | GenericTests.check_type(Y,'Y',np.ndarray,2) 39 | assert(shape(X)[1]==shape(Y)[1]) 40 | normY=reshape(np.linalg.norm(Y,axis=1),(1,len(Y))) 41 | dists = cdist(X, Y, 'euclidean') 42 | K=0.5*(normX**self.alpha+normY**self.alpha-dists**self.alpha) 43 | return K 44 | 45 | def gradient(self, x, Y): 46 | raise NotImplementedError() 47 | 48 | if __name__ == '__main__': 49 | from tools.UnitTests import UnitTests 50 | UnitTests.UnitTestDefaultKernel(BrownianKernel) 51 | -------------------------------------------------------------------------------- /kerpy/GaussianBagKernel.py: -------------------------------------------------------------------------------- 1 | from kerpy.BagKernel import BagKernel 2 | from abc import abstractmethod 3 | from numpy import exp, zeros, dot, cos, sin, concatenate, sqrt, mean, median 4 | from numpy.random.mtrand import randn 5 | import numpy as np 6 | 7 | class GaussianBagKernel(BagKernel): 8 | def __init__(self,data_kernel,sigma=1.0): 9 | BagKernel.__init__(self,data_kernel) 10 | self.width=sigma 11 | 12 | def __str__(self): 13 | s=self.__class__.__name__+ "[" 14 | s += "width="+ str(self.width) 15 | s += ", " + BagKernel.__str__(self) 16 | s += "]" 17 | return s 18 | 19 | def rff_generate(self,mbags=20,mdata=20,dim=1): 20 | ''' 21 | mbags:: number of random features for bag kernel 22 | mdata:: number of random features for data kernel 23 | dim:: data dimensionality 24 | ''' 25 | self.data_kernel.rff_generate(mdata,dim=dim) 26 | self.rff_num=mbags 27 | self.unit_rff_freq=randn(mbags/2,mdata) 28 | self.rff_freq=self.unit_rff_freq/self.width 29 | 30 | def rff_expand(self,bagX): 31 | if self.rff_freq is None: 32 | raise ValueError("rff_freq has not been set. use rff_generate first") 33 | nx=len(bagX) 34 | featuremeans=zeros((nx,self.data_kernel.rff_num)) 35 | for ii in range(nx): 36 | featuremeans[ii]=mean(self.data_kernel.rff_expand(bagX[ii]),axis=0) 37 | xdotw=dot(featuremeans,(self.rff_freq).T) 38 | return sqrt(2./self.rff_num)*concatenate( ( cos(xdotw),sin(xdotw) ) , axis=1 ) 39 | 40 | 41 | def compute_BagKernel_value(self,bag1,bag2): 42 | return exp(-0.5 * self.data_kernel.estimateMMD(bag1,bag2) / self.width ** 2) 43 | 44 | def get_sigma_median_heuristic(self,X): 45 | nx=np.shape(X)[0] 46 | if nx>200: 47 | X=X[np.random.permutation(nx)[:200]] 48 | n=min(nx,200) 49 | D=zeros((n,n)) 50 | for ii in range(n): 51 | zi = X[ii] 52 | for jj in range(ii+1,n): 53 | zj = X[jj] 54 | D[ii,jj]=sqrt(self.data_kernel.estimateMMD(zi,zj)) 55 | D=self.symmetrize(D) 56 | median_dist=median(D[D>0]) 57 | sigma=median_dist/sqrt(2.) 58 | return sigma 59 | 60 | if __name__ == '__main__': 61 | from tools.UnitTests import UnitTests 62 | UnitTests.UnitTestBagKernel(GaussianBagKernel) -------------------------------------------------------------------------------- /kerpy/GaussianKernel.py: -------------------------------------------------------------------------------- 1 | from kerpy.Kernel import Kernel 2 | from numpy import exp, shape, reshape, sqrt, median 3 | from numpy.random import permutation,randn 4 | from scipy.spatial.distance import squareform, pdist, cdist 5 | import warnings 6 | from tools.GenericTests import GenericTests 7 | import numpy as np 8 | 9 | class GaussianKernel(Kernel): 10 | def __init__(self, sigma=1.0, is_sparse = False): 11 | Kernel.__init__(self) 12 | self.width = sigma 13 | self.is_sparse = is_sparse 14 | 15 | def __str__(self): 16 | s=self.__class__.__name__+ "[" 17 | s += "width="+ str(self.width) 18 | s += "]" 19 | return s 20 | 21 | def kernel(self, X, Y=None): 22 | """ 23 | Computes the standard Gaussian kernel k(x,y)=exp(-0.5* ||x-y||**2 / sigma**2) 24 | 25 | X - 2d numpy.ndarray, first set of samples: 26 | number of rows: number of samples 27 | number of columns: dimensionality 28 | Y - 2d numpy.ndarray, second set of samples, can be None in which case its replaced by X 29 | """ 30 | if self.is_sparse: 31 | X = X.todense() 32 | Y = Y.todense() 33 | GenericTests.check_type(X, 'X',np.ndarray) 34 | assert(len(shape(X))==2) 35 | 36 | # if X=Y, use more efficient pdist call which exploits symmetry 37 | if Y is None: 38 | sq_dists = squareform(pdist(X, 'sqeuclidean')) 39 | else: 40 | GenericTests.check_type(Y, 'Y',np.ndarray) 41 | assert(len(shape(Y))==2) 42 | assert(shape(X)[1]==shape(Y)[1]) 43 | sq_dists = cdist(X, Y, 'sqeuclidean') 44 | 45 | K = exp(-0.5 * (sq_dists) / self.width ** 2) 46 | return K 47 | 48 | 49 | def gradient(self, x, Y): 50 | """ 51 | Computes the gradient of the Gaussian kernel wrt. to the left argument, i.e. 52 | k(x,y)=exp(-0.5* ||x-y||**2 / sigma**2), which is 53 | \nabla_x k(x,y)=1.0/sigma**2 k(x,y)(y-x) 54 | Given a set of row vectors Y, this computes the 55 | gradient for every pair (x,y) for y in Y. 56 | """ 57 | if self.is_sparse: 58 | x = x.todense() 59 | Y = Y.todense() 60 | assert(len(shape(x))==1) 61 | assert(len(shape(Y))==2) 62 | assert(len(x)==shape(Y)[1]) 63 | 64 | x_2d=reshape(x, (1, len(x))) 65 | k = self.kernel(x_2d, Y) 66 | differences = Y - x 67 | G = (1.0 / self.width ** 2) * (k.T * differences) 68 | return G 69 | 70 | 71 | def rff_generate(self,m,dim=1): 72 | self.rff_num=m 73 | self.unit_rff_freq=randn(int(m/2),dim) 74 | self.rff_freq=self.unit_rff_freq/self.width 75 | 76 | @staticmethod 77 | def get_sigma_median_heuristic(X, is_sparse = False): 78 | if is_sparse: 79 | X = X.todense() 80 | n=shape(X)[0] 81 | if n>1000: 82 | X=X[permutation(n)[:1000],:] 83 | dists=squareform(pdist(X, 'euclidean')) 84 | median_dist=median(dists[dists>0]) 85 | sigma=median_dist/sqrt(2.) 86 | return sigma 87 | -------------------------------------------------------------------------------- /kerpy/HypercubeKernel.py: -------------------------------------------------------------------------------- 1 | from numpy import tanh 2 | import numpy 3 | from scipy.spatial.distance import squareform, pdist, cdist 4 | 5 | from kerpy.Kernel import Kernel 6 | 7 | 8 | class HypercubeKernel(Kernel): 9 | def __init__(self, gamma): 10 | Kernel.__init__(self) 11 | 12 | if type(gamma) is not float: 13 | raise TypeError("Gamma must be float") 14 | 15 | self.gamma = gamma 16 | 17 | def __str__(self): 18 | s = self.__class__.__name__ + "=[" 19 | s += "gamma=" + str(self.gamma) 20 | s += ", " + Kernel.__str__(self) 21 | s += "]" 22 | return s 23 | 24 | def kernel(self, X, Y=None): 25 | """ 26 | Computes the hypercube kerpy k(x,y)=tanh(gamma)^d(x,y), where d is the 27 | Hamming distance between x and y 28 | 29 | X - 2d numpy.bool8 array, samples on right left side 30 | Y - 2d numpy.bool8 array, samples on left hand side. 31 | Can be None in which case its replaced by X 32 | """ 33 | 34 | if not type(X) is numpy.ndarray: 35 | raise TypeError("X must be numpy array") 36 | 37 | if not len(X.shape) == 2: 38 | raise ValueError("X must be 2D numpy array") 39 | 40 | if not X.dtype == numpy.bool8: 41 | raise ValueError("X must be boolean numpy array") 42 | 43 | if not Y is None: 44 | if not type(Y) is numpy.ndarray: 45 | raise TypeError("Y must be None or numpy array") 46 | 47 | if not len(Y.shape) == 2: 48 | raise ValueError("Y must be None or 2D numpy array") 49 | 50 | if not Y.dtype == numpy.bool8: 51 | raise ValueError("Y must be boolean numpy array") 52 | 53 | if not X.shape[1] == Y.shape[1]: 54 | raise ValueError("X and Y must have same dimension if Y is not None") 55 | 56 | # un-normalise normalised hamming distance in both cases 57 | if Y is None: 58 | K = tanh(self.gamma) ** squareform(pdist(X, 'hamming') * X.shape[1]) 59 | else: 60 | K = tanh(self.gamma) ** (cdist(X, Y, 'hamming') * X.shape[1]) 61 | 62 | return K 63 | 64 | def gradient(self, x, Y): 65 | """ 66 | Computes the gradient of the hypercube kerpy wrt. to the left argument 67 | 68 | x - single sample on right hand side (1D vector) 69 | Y - samples on left hand side (2D matrix) 70 | """ 71 | pass 72 | 73 | -------------------------------------------------------------------------------- /kerpy/Kernel.py: -------------------------------------------------------------------------------- 1 | from abc import abstractmethod 2 | from numpy import eye, concatenate, zeros, shape, mean, reshape, arange, exp, outer,\ 3 | linalg, dot, cos, sin, sqrt, inf 4 | from numpy.random import permutation 5 | from numpy.lib.index_tricks import fill_diagonal 6 | from matplotlib.pyplot import imshow,show 7 | import numpy as np 8 | import matplotlib.pyplot as plt 9 | import matplotlib.cm as cm 10 | import warnings 11 | from tools.GenericTests import GenericTests 12 | 13 | 14 | 15 | 16 | class Kernel(object): 17 | def __init__(self): 18 | self.rff_num=None 19 | self.rff_freq=None 20 | pass 21 | 22 | def __str__(self): 23 | s="" 24 | return s 25 | 26 | @abstractmethod 27 | def kernel(self, X, Y=None): 28 | raise NotImplementedError() 29 | 30 | @abstractmethod 31 | def set_kerpar(self,kerpar): 32 | self.set_width(kerpar) 33 | 34 | @abstractmethod 35 | def set_width(self, width): 36 | if hasattr(self, 'width'): 37 | warnmsg="\nChanging kernel width from "+str(self.width)+" to "+str(width) 38 | #warnings.warn(warnmsg) ---need to add verbose argument to show these warning messages 39 | if self.rff_freq is not None: 40 | warnmsg="\nrff frequencies found. rescaling to width " +str(width) 41 | #warnings.warn(warnmsg) 42 | self.rff_freq=self.unit_rff_freq/width 43 | self.width=width 44 | else: 45 | raise ValueError("Senseless: kernel has no 'width' attribute!") 46 | 47 | @abstractmethod 48 | def rff_generate(self,m,dim=1): 49 | raise NotImplementedError() 50 | 51 | @abstractmethod 52 | def rff_expand(self,X): 53 | if self.rff_freq is None: 54 | raise ValueError("rff_freq has not been set. use rff_generate first") 55 | """ 56 | Computes the random Fourier features for the input dataset X 57 | for a set of frequencies in rff_freq. 58 | This set of frequencies has to be precomputed 59 | X - 2d numpy.ndarray, first set of samples: 60 | number of rows: number of samples 61 | number of columns: dimensionality 62 | """ 63 | GenericTests.check_type(X, 'X',np.ndarray) 64 | xdotw=dot(X,(self.rff_freq).T) 65 | return sqrt(2./self.rff_num)*np.concatenate( ( cos(xdotw),sin(xdotw) ) , axis=1 ) 66 | 67 | @abstractmethod 68 | def gradient(self, x, Y): 69 | 70 | # ensure this in every implementation 71 | assert(len(shape(x))==1) 72 | assert(len(shape(Y))==2) 73 | assert(len(x)==shape(Y)[1]) 74 | 75 | raise NotImplementedError() 76 | 77 | @staticmethod 78 | def centering_matrix(n): 79 | """ 80 | Returns the centering matrix eye(n) - 1.0 / n 81 | """ 82 | return eye(n) - 1.0 / n 83 | 84 | @staticmethod 85 | def center_kernel_matrix(K): 86 | """ 87 | Centers the kernel matrix via a centering matrix H=I-1/n and returns HKH 88 | """ 89 | n = shape(K)[0] 90 | H = eye(n) - 1.0 / n 91 | return 1.0 / n * H.dot(K.dot(H)) 92 | 93 | 94 | @abstractmethod 95 | def show_kernel_matrix(self,X,Y=None): 96 | K=self.kernel(X,Y) 97 | imshow(K, interpolation="nearest") 98 | show() 99 | 100 | @abstractmethod 101 | def svc(self,X,y,lmbda=1.0,Xtst=None,ytst=None): 102 | from sklearn import svm 103 | svc=svm.SVC(kernel=self.kernel,C=lmbda) 104 | svc.fit(X,y) 105 | if Xtst is None: 106 | return svc 107 | else: 108 | ypre=svc.predict(Xtst) 109 | if ytst is None: 110 | return svc,ypre 111 | else: 112 | return svc,ypre,1-svc.score(Xtst,ytst) 113 | 114 | @abstractmethod 115 | def svc_rff(self,X,y,lmbda=1.0,Xtst=None,ytst=None): 116 | from sklearn import svm 117 | phi=self.rff_expand(X) 118 | svc=svm.LinearSVC(C=lmbda,dual=True) 119 | svc.fit(phi,y) 120 | if Xtst is None: 121 | return svc 122 | else: 123 | phitst=self.rff_expand(Xtst) 124 | ypre=svc.predict(phitst) 125 | if ytst is None: 126 | return svc,ypre 127 | else: 128 | return svc,ypre,1-svc.score(phitst,ytst) 129 | 130 | @abstractmethod 131 | def ridge_regress(self,X,y,lmbda=0.01,Xtst=None,ytst=None): 132 | K=self.kernel(X) 133 | n=shape(K)[0] 134 | aa=linalg.solve(K+lmbda*eye(n),y) 135 | if Xtst is None: 136 | return aa 137 | else: 138 | ypre=dot(aa.T,self.kernel(X,Xtst)).T 139 | if ytst is None: 140 | return aa,ypre 141 | else: 142 | return aa,ypre,(linalg.norm(ytst-ypre)**2)/np.shape(ytst)[0] 143 | 144 | @abstractmethod 145 | def ridge_regress_rff(self,X,y,lmbda=0.01,Xtst=None,ytst=None): 146 | # if self.rff_freq is None: 147 | # warnings.warn("\nrff_freq has not been set!\nGenerating new random frequencies (m=100 by default)") 148 | # self.rff_generate(100,dim=shape(X)[1]) 149 | # print shape(X)[1] 150 | phi=self.rff_expand(X) 151 | bb=linalg.solve(dot(phi.T,phi)+lmbda*eye(self.rff_num),dot(phi.T,y)) 152 | if Xtst is None: 153 | return bb 154 | else: 155 | phitst=self.rff_expand(Xtst) 156 | ypre=dot(phitst,bb) 157 | if ytst is None: 158 | return bb,ypre 159 | else: 160 | return bb,ypre,(linalg.norm(ytst-ypre)**2)/np.shape(ytst)[0] 161 | 162 | @abstractmethod 163 | def xvalidate( self,X,y, method = 'ridge_regress', \ 164 | regpar_grid=(1+arange(25))/200.0, \ 165 | kerpar_grid=exp(-13+arange(25)), \ 166 | numFolds = 10, verbose = False, visualise = False): 167 | from sklearn import cross_validation 168 | which_method = getattr(self,method) 169 | n=len(X) 170 | kf=cross_validation.KFold(n,n_folds=numFolds) 171 | xvalerr=zeros((len(regpar_grid),len(kerpar_grid))) 172 | width_idx=0 173 | for width in kerpar_grid: 174 | try: 175 | self.set_kerpar(width) 176 | except ValueError: 177 | xvalerr[:,width_idx]=inf 178 | warnings.warn("...invalid kernel parameter value in cross-validation. ignoring\n") 179 | width_idx+=1 180 | continue 181 | else: 182 | lmbda_idx=0 183 | for lmbda in regpar_grid: 184 | fold = 0 185 | prederr = zeros(numFolds) 186 | for train_index, test_index in kf: 187 | if type(X)==list: 188 | #could use slicing to speed this up when X is a list 189 | #currently uses sklearn cross_validation framework which returns indices as arrays 190 | #so simple list comprehension below 191 | X_train = [X[i] for i in train_index] 192 | X_test = [X[i] for i in test_index] 193 | else: 194 | X_train, X_test = X[train_index], X[test_index] 195 | if type(y)==list: 196 | y_train = [y[i] for i in train_index] 197 | y_test = [y[i] for i in test_index] 198 | else: 199 | y_train, y_test = y[train_index], y[test_index] 200 | _,_,prederr[fold]=which_method(X_train,y_train,lmbda=lmbda,Xtst=X_test,ytst=y_test) 201 | fold+=1 202 | xvalerr[lmbda_idx,width_idx]=mean(prederr) 203 | if verbose: 204 | print("kerpar:"+str(width)+", regpar:"+str(lmbda)) 205 | print(" cross-validated loss:"+str(xvalerr[lmbda_idx,width_idx])) 206 | lmbda_idx+=1 207 | width_idx+=1 208 | min_idx = np.unravel_index(np.argmin(xvalerr),shape(xvalerr)) 209 | if visualise: 210 | plt.imshow(xvalerr, interpolation='none', 211 | origin='lower', 212 | cmap=cm.pink) 213 | #extent=(regpar_grid[0],regpar_grid[-1],kerpar_grid[0],kerpar_grid[-1])) 214 | plt.colorbar() 215 | plt.title("cross-validated loss") 216 | plt.ylabel("regularisation parameter") 217 | plt.xlabel("kernel parameter") 218 | show() 219 | return regpar_grid[min_idx[0]],kerpar_grid[min_idx[1]] 220 | 221 | @abstractmethod 222 | def estimateMMD(self,sample1,sample2,unbiased=False): 223 | """ 224 | Compute the MMD between two samples 225 | """ 226 | K11 = self.kernel(sample1) 227 | K22 = self.kernel(sample2) 228 | K12 = self.kernel(sample1,sample2) 229 | if unbiased: 230 | fill_diagonal(K11,0.0) 231 | fill_diagonal(K22,0.0) 232 | n=float(shape(K11)[0]) 233 | m=float(shape(K22)[0]) 234 | return sum(sum(K11))/(pow(n,2)-n) + sum(sum(K22))/(pow(m,2)-m) - 2*mean(K12[:]) 235 | else: 236 | return mean(K11[:])+mean(K22[:])-2*mean(K12[:]) 237 | 238 | 239 | 240 | @abstractmethod 241 | def estimateMMD_rff(self,sample1,sample2,unbiased=False): 242 | # if self.rff_freq is None: 243 | # warnings.warn("\nrff_freq has not been set!\nGenerating new random frequencies (m=100 by default)") 244 | # self.rff_generate(100,dim=shape(sample1)[1]) 245 | phi1=self.rff_expand(sample1) 246 | phi2=self.rff_expand(sample2) 247 | featuremean1=mean(phi1,axis=0) 248 | featuremean2=mean(phi2,axis=0) 249 | if unbiased: 250 | nx=shape(phi1)[0] 251 | ny=shape(phi2)[0] 252 | first_term=nx/(nx-1.0)*( dot(featuremean1,featuremean1) \ 253 | -mean(linalg.norm(phi1,axis=1)**2)/nx ) 254 | second_term=ny/(ny-1.0)*( dot(featuremean2,featuremean2) \ 255 | -mean(linalg.norm(phi2,axis=1)**2)/ny ) 256 | third_term=-2*dot(featuremean1,featuremean2) 257 | return first_term+second_term+third_term 258 | else: 259 | return linalg.norm(featuremean1-featuremean2)**2 260 | -------------------------------------------------------------------------------- /kerpy/LinearBagKernel.py: -------------------------------------------------------------------------------- 1 | from kerpy.BagKernel import BagKernel 2 | import numpy as np 3 | from tools.GenericTests import GenericTests 4 | from kerpy.GaussianKernel import GaussianKernel 5 | from abc import abstractmethod 6 | 7 | class LinearBagKernel(BagKernel): 8 | def __init__(self,data_kernel): 9 | BagKernel.__init__(self,data_kernel) 10 | 11 | def __str__(self): 12 | s=self.__class__.__name__+ "[" 13 | s += "" + BagKernel.__str__(self) 14 | s += "]" 15 | return s 16 | 17 | def rff_generate(self,mdata=20,dim=1): 18 | ''' 19 | mdata:: number of random features for data kernel 20 | dim:: data dimensionality 21 | ''' 22 | self.data_kernel.rff_generate(mdata,dim=dim) 23 | self.rff_num=mdata 24 | 25 | def rff_expand(self,bagX): 26 | nx=len(bagX) 27 | featuremeans=np.zeros((nx,self.data_kernel.rff_num)) 28 | for ii in range(nx): 29 | featuremeans[ii]=np.mean(self.data_kernel.rff_expand(bagX[ii]),axis=0) 30 | return featuremeans 31 | 32 | def compute_BagKernel_value(self,bag1,bag2): 33 | innerK=self.data_kernel.kernel(bag1,bag2) 34 | return np.mean(innerK[:]) 35 | 36 | 37 | if __name__ == '__main__': 38 | from tools.UnitTests import UnitTests 39 | UnitTests.UnitTestBagKernel(LinearBagKernel) -------------------------------------------------------------------------------- /kerpy/LinearKernel.py: -------------------------------------------------------------------------------- 1 | from kerpy.Kernel import Kernel 2 | 3 | class LinearKernel(Kernel): 4 | def __init__(self, is_sparse = False): 5 | Kernel.__init__(self) 6 | self.is_sparse = is_sparse 7 | 8 | def __str__(self): 9 | s=self.__class__.__name__+ "=[" 10 | s += "" + Kernel.__str__(self) 11 | s += "]" 12 | return s 13 | 14 | def kernel(self, X, Y=None): 15 | """ 16 | Computes the linear kerpy k(x,y)=x^T y for the given data 17 | X - samples on right hand side 18 | Y - samples on left hand side, can be None in which case its replaced by X 19 | """ 20 | 21 | if Y is None: 22 | Y = X 23 | if self.is_sparse: 24 | return X.dot(Y.T).todense() 25 | else: 26 | return X.dot(Y.T) 27 | 28 | def gradient(self, x, Y, args_euqal=False): 29 | """ 30 | Computes the linear kerpy k(x,y)=x^T y for the given data 31 | x - single sample on right hand side 32 | Y - samples on left hand side 33 | """ 34 | return Y 35 | -------------------------------------------------------------------------------- /kerpy/MaternKernel.py: -------------------------------------------------------------------------------- 1 | from matplotlib.pyplot import show, imshow 2 | from numpy import exp, shape, sqrt, reshape 3 | import numpy as np 4 | from scipy.spatial.distance import squareform, pdist, cdist 5 | 6 | 7 | from kerpy.Kernel import Kernel 8 | from tools.GenericTests import GenericTests 9 | 10 | 11 | class MaternKernel(Kernel): 12 | def __init__(self, width=1.0, nu=1.5, sigma=1.0): 13 | Kernel.__init__(self) 14 | GenericTests.check_type(width,'width',float) 15 | GenericTests.check_type(nu,'nu',float) 16 | GenericTests.check_type(sigma,'sigma',float) 17 | 18 | self.width = width 19 | self.nu = nu 20 | self.sigma = sigma 21 | 22 | def __str__(self): 23 | s=self.__class__.__name__+ "[" 24 | s += "width="+ str(self.width) 25 | s += ", nu="+ str(self.nu) 26 | s += ", sigma="+ str(self.sigma) 27 | s += "]" 28 | return s 29 | 30 | def kernel(self, X, Y=None): 31 | 32 | GenericTests.check_type(X,'X',np.ndarray,2) 33 | # if X=Y, use more efficient pdist call which exploits symmetry 34 | if Y is None: 35 | dists = squareform(pdist(X, 'euclidean')) 36 | else: 37 | GenericTests.check_type(Y,'Y',np.ndarray,2) 38 | assert(shape(X)[1]==shape(Y)[1]) 39 | dists = cdist(X, Y, 'euclidean') 40 | if self.nu==0.5: 41 | #for nu=1/2, Matern class corresponds to Ornstein-Uhlenbeck Process 42 | K = (self.sigma**2.) * exp( -dists / self.width ) 43 | elif self.nu==1.5: 44 | K = (self.sigma**2.) * (1+ sqrt(3.)*dists / self.width) * exp( -sqrt(3.)*dists / self.width ) 45 | elif self.nu==2.5: 46 | K = (self.sigma**2.) * (1+ sqrt(5.)*dists / self.width + 5.0*(dists**2.) / (3.0*self.width**2.) ) * exp( -sqrt(5.)*dists / self.width ) 47 | else: 48 | raise NotImplementedError() 49 | return K 50 | 51 | def rff_generate(self,m,dim=1): 52 | self.rff_num=m 53 | assert(dim==1) 54 | ##currently works only for dim=1 55 | ##need to check how student spectral density generalizes to multivariate case 56 | assert(self.sigma==1.0) 57 | ##the scale parameter should be one 58 | if self.nu==0.5 or self.nu==1.5 or self.nu==2.5: 59 | df = self.nu*2 60 | self.unit_rff_freq=np.random.standard_t(df,size=(int(m/2),dim)) 61 | self.rff_freq=self.unit_rff_freq/self.width 62 | else: 63 | raise NotImplementedError() 64 | 65 | def gradient(self, x, Y): 66 | assert(len(shape(x))==1) 67 | assert(len(shape(Y))==2) 68 | assert(len(x)==shape(Y)[1]) 69 | 70 | if self.nu==1.5 or self.nu==2.5: 71 | x_2d=reshape(x, (1, len(x))) 72 | lower_order_width = self.width * sqrt(2*(self.nu-1)) / sqrt(2*self.nu) 73 | lower_order_kernel = MaternKernel(lower_order_width,self.nu-1,self.sigma) 74 | k = lower_order_kernel.kernel(x_2d, Y) 75 | differences = Y - x 76 | G = ( 1.0 / lower_order_width ** 2 ) * (k.T * differences) 77 | return G 78 | else: 79 | raise NotImplementedError() 80 | 81 | if __name__ == '__main__': 82 | from tools.UnitTests import UnitTests 83 | UnitTests.UnitTestDefaultKernel(MaternKernel) 84 | kernel=MaternKernel(width=2.0) 85 | x=np.random.rand(10,1) 86 | y=np.random.rand(15,1) 87 | K=kernel.kernel(x,y) 88 | kernel.rff_generate(50000) 89 | phix=kernel.rff_expand(x) 90 | phiy=kernel.rff_expand(y) 91 | Khat=phix.dot(phiy.T) 92 | print(np.linalg.norm(K-Khat)) 93 | -------------------------------------------------------------------------------- /kerpy/PolynomialKernel.py: -------------------------------------------------------------------------------- 1 | from numpy import array 2 | 3 | from kerpy.Kernel import Kernel 4 | 5 | 6 | class PolynomialKernel(Kernel): 7 | def __init__(self, degree,theta=1.0): 8 | Kernel.__init__(self) 9 | self.degree = degree 10 | self.theta = theta 11 | 12 | def __str__(self): 13 | s=self.__class__.__name__+ "=[" 14 | s += "degree="+ str(self.degree) 15 | s += ", " + Kernel.__str__(self) 16 | s += "]" 17 | return s 18 | 19 | def kernel(self, X, Y=None): 20 | """ 21 | Computes the polynomial kerpy k(x,y)=(1+theta*)^degree for the given data 22 | X - samples on right hand side 23 | Y - samples on left hand side, can be None in which case its replaced by X 24 | """ 25 | if Y is None: 26 | Y = X 27 | 28 | return pow(self.theta+X.dot(Y.T), self.degree) 29 | 30 | def gradient(self, x, Y): 31 | """ 32 | Computes the gradient of the Polynomial kerpy wrt. to the left argument, i.e. 33 | \nabla_x k(x,y)=\nabla_x (1+x^Ty)^d=d(1+x^Ty)^(d-1) y 34 | 35 | x - single sample on right hand side (1D vector) 36 | Y - samples on left hand side (2D matrix) 37 | """ 38 | assert(len(x.shape)==1) 39 | assert(len(Y.shape)==2) 40 | assert(len(x)==Y.shape[1]) 41 | 42 | return self.degree*pow(self.theta+x.dot(Y.T), self.degree-1)*Y 43 | -------------------------------------------------------------------------------- /kerpy/ProductKernel.py: -------------------------------------------------------------------------------- 1 | from kerpy.Kernel import Kernel 2 | import numpy as np 3 | 4 | 5 | class ProductKernel(Kernel): 6 | def __init__(self, list_of_kernels): 7 | Kernel.__init__(self) 8 | self.list_of_kernels = list_of_kernels 9 | 10 | def __str__(self): 11 | s=self.__class__.__name__+ "=[" 12 | s += ", " + Kernel.__str__(self) 13 | s += "]" 14 | return s 15 | 16 | def kernel(self, X, Y=None): 17 | return np.prod([individual_kernel.kernel(X,Y) for individual_kernel in self.list_of_kernels],0) -------------------------------------------------------------------------------- /kerpy/SumKernel.py: -------------------------------------------------------------------------------- 1 | from kerpy.Kernel import Kernel 2 | import numpy as np 3 | 4 | 5 | class SumKernel(Kernel): 6 | def __init__(self, list_of_kernels): 7 | Kernel.__init__(self) 8 | self.list_of_kernels = list_of_kernels 9 | 10 | def __str__(self): 11 | s=self.__class__.__name__+ "=[" 12 | s += ", " + Kernel.__str__(self) 13 | s += "]" 14 | return s 15 | 16 | def kernel(self, X, Y=None): 17 | return np.sum([individual_kernel.kernel(X,Y) for individual_kernel in self.list_of_kernels],0) -------------------------------------------------------------------------------- /kerpy/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxcsml/kerpy/50b175961d13e0e1f625aa987ae41cb98bfe4d84/kerpy/__init__.py -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | setup( 4 | name='kerpy', 5 | version='0.1.0', 6 | url='https://github.com/oxmlcs/kerpy', 7 | packages=find_packages(), 8 | description='Code for kernel methods', 9 | license='MIT' 10 | ) 11 | -------------------------------------------------------------------------------- /tools/GenericTests.py: -------------------------------------------------------------------------------- 1 | class GenericTests(): 2 | @staticmethod 3 | def check_type(varvalue, varname, vartype, required_shapelen=None): 4 | if not type(varvalue) is vartype: 5 | raise TypeError("Variable " + varname + " must be of type " + vartype.__name__ + \ 6 | ". Given is " + str(type(varvalue))) 7 | if not required_shapelen is None: 8 | if not len(varvalue.shape) is required_shapelen: 9 | raise ValueError("Variable " + varname + " must be " + str(required_shapelen) + "-dimensional") 10 | return 0 11 | -------------------------------------------------------------------------------- /tools/ProcessingObject.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | ''' 3 | Class containing some helper functions, i.e., argument parsing 4 | ''' 5 | 6 | from kerpy.LinearKernel import LinearKernel 7 | from kerpy.GaussianKernel import GaussianKernel 8 | 9 | class ProcessingObject(object): 10 | def __init__(self): 11 | ''' 12 | Constructor 13 | ''' 14 | @staticmethod 15 | def parse_arguments(): 16 | parser = argparse.ArgumentParser() 17 | parser.add_argument("num_samples", type=int,\ 18 | help="total # of samples") 19 | parser.add_argument("--num_rfx", type=int,\ 20 | help="number of random features of the data X", 21 | default=30) 22 | parser.add_argument("--num_rfy", type=int,\ 23 | help="number of random features of the data Y", 24 | default=30) 25 | parser.add_argument("--num_inducex", type=int,\ 26 | help="number of inducing variables of the data X", 27 | default=30) 28 | parser.add_argument("--num_inducey", type=int,\ 29 | help="number of inducing variables of the data Y", 30 | default=30) 31 | parser.add_argument("--num_shuffles", type=int,\ 32 | help="number of shuffles", 33 | default=800) 34 | parser.add_argument("--blocksize", type=int,\ 35 | help="# of samples per block (includes X and Y) when using a block-based test", 36 | default=20) 37 | parser.add_argument("--dimX", type=int,\ 38 | help="dimensionality of the data X", 39 | default=3) 40 | parser.add_argument("--dimZ", type=int,\ 41 | help="dimensionality of the data Z (i.e. the conditioning variable)", 42 | default=7) 43 | parser.add_argument("--kernelX", const = LinearKernel(), default = GaussianKernel(1.), \ 44 | action='store_const', \ 45 | help="Linear kernel (Default GaussianKernel(1.))?") 46 | parser.add_argument("--kernelY", const = LinearKernel(), default = GaussianKernel(1.), \ 47 | action='store_const', \ 48 | help="Linear kernel (Default GaussianKernel(1.))?") 49 | parser.add_argument("--kernelX_use_median", action="store_true",\ 50 | help="should median heuristic be used for X?", 51 | default=False) 52 | parser.add_argument("--kernelY_use_median", action="store_true",\ 53 | help="should median heuristic be used for Y?", 54 | default=False) 55 | 56 | parser.add_argument("--kernelRxz", const = GaussianKernel(1.), default = LinearKernel(), \ 57 | action='store_const', \ 58 | help="Gaussian kernel(1.) (Default LinearKernel)?") 59 | parser.add_argument("--kernelRyz", const = GaussianKernel(1.), default = LinearKernel(), \ 60 | action='store_const', \ 61 | help="Linear kernel (Default GaussianKernel(1.))?") 62 | parser.add_argument("--kernelRxz_use_median", action="store_true",\ 63 | help="should median heuristic be used for residuals Rxz?", 64 | default=False) 65 | parser.add_argument("--kernelRyz_use_median", action="store_true",\ 66 | help="should median heuristic be used for residuals Ryz?", 67 | default=False) 68 | 69 | parser.add_argument("--RESIT_type", action="store_true",\ 70 | help="Conditional Testing using RESIT?",\ 71 | default=False) 72 | parser.add_argument("--optimise_lambda_only", action="store_false",\ 73 | help="Optimise lambdas only?",\ 74 | default=True) 75 | parser.add_argument("--grid_search", action="store_false",\ 76 | help="Optimise hyperparameters through grid search?",\ 77 | default=True) 78 | parser.add_argument("--GD_optimise", action="store_true",\ 79 | help="Optimise hyperparameters through gradient descent?",\ 80 | default=False) 81 | 82 | parser.add_argument("--results_filename",type=str,\ 83 | help = "name of the file to save results?",\ 84 | default = "testing") 85 | parser.add_argument("--figure_filename",type=str,\ 86 | help = "name of the file to save the causal graph?",\ 87 | default = "testing") 88 | parser.add_argument("--data_filename",type=str,\ 89 | help = "name of the file to load data from?",\ 90 | default = "testing") 91 | 92 | #parser.add_argument("--dimY", type=int,\ 93 | # help="dimensionality of the data Y", 94 | # default=3) 95 | parser.add_argument("--hypothesis", type=str,\ 96 | help="is null or alternative true in this experiment? [null, alter]",\ 97 | default="alter") 98 | parser.add_argument("--nullvarmethod", type=str,\ 99 | help="how to estimate asymptotic variance under null? [direct, permutation, across]?",\ 100 | default="direct") 101 | parser.add_argument("--streaming", action="store_true",\ 102 | help="should data be streamed (rather than all loaded into memory)?",\ 103 | default=False) 104 | parser.add_argument("--rff", action="store_true",\ 105 | help="should random features be used?",\ 106 | default=False) 107 | parser.add_argument("--induce_set", action="store_true",\ 108 | help="should inducing variables be used?",\ 109 | default=False) 110 | args = parser.parse_args() 111 | return args -------------------------------------------------------------------------------- /tools/UnitTests.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from kerpy.GaussianKernel import GaussianKernel 3 | 4 | class UnitTests(): 5 | 6 | @staticmethod 7 | def UnitTestDefaultKernel(which_kernel): 8 | dim = 5 9 | nx = 20 10 | X=np.random.randn(nx,dim) 11 | kernel = which_kernel() 12 | kernel.show_kernel_matrix(X) 13 | print '...successfully visualised kernel matrix.' 14 | response_y=X[:,1]**2+np.random.randn(nx) 15 | kernel.ridge_regress(X,response_y) 16 | print '...successfully ran ridge regression.' 17 | 18 | @staticmethod 19 | def UnitTestBagKernel(which_bag_kernel): 20 | num_bagsX = 20 21 | num_bagsY = 30 22 | shift = 2.0 23 | dim = 3 24 | bagsize = 50 25 | qvar = 0.6 26 | baglistx = list() 27 | baglisty = list() 28 | for _ in range(num_bagsX): 29 | muX = np.sqrt(qvar) * np.random.randn(1, dim) 30 | baglistx.append(muX + np.sqrt(1 - qvar) * np.random.randn(bagsize, dim)) 31 | for _ in range(num_bagsY): 32 | muY = np.sqrt(qvar) * np.random.randn(1, dim) 33 | muY[:, 0] = muY[:, 0] + shift 34 | baglisty.append(muY + np.sqrt(1 - qvar) * np.random.randn(bagsize, dim)) 35 | data_kernel = GaussianKernel(1.0) 36 | bag_kernel = which_bag_kernel(data_kernel) 37 | bag_kernel.show_kernel_matrix(baglistx + baglisty) 38 | print '...successfully visualised kernel matrix on bags.' 39 | bag_kernel.rff_generate(dim=dim) 40 | bagmmd = bag_kernel.estimateMMD_rff(baglistx, baglisty) 41 | print '...successfully computed rff mmd on bags; value: ', bagmmd 42 | response_y=np.random.randn(num_bagsX) 43 | bag_kernel.ridge_regress_rff(baglistx,response_y) 44 | print '...successfully ran rff ridge regression on bags.' 45 | print 'unit test ran for ', bag_kernel.__str__() 46 | -------------------------------------------------------------------------------- /tools/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxcsml/kerpy/50b175961d13e0e1f625aa987ae41cb98bfe4d84/tools/__init__.py -------------------------------------------------------------------------------- /tools/read_and_plot_test_results.py: -------------------------------------------------------------------------------- 1 | from matplotlib.pyplot import figure, gca, errorbar, grid, show, legend, xlabel, \ 2 | ylabel, ylim 3 | from numpy import sqrt, argsort, asarray 4 | import os 5 | from pickle import load 6 | import sys 7 | 8 | import matplotlib as mpl 9 | 10 | 11 | mpl.rcParams['text.usetex']=True 12 | mpl.rcParams['text.latex.unicode']=True 13 | 14 | 15 | load_filename = "results.bin" 16 | #print sys.argv[1:] 17 | folders_greps = [x.replace('results/','') for x in sys.argv[1:]] 18 | #print folders_greps 19 | #if len(sys.argv)==1: 20 | # sys.argv[1:] = raw_input('Read null or alter: ').split() 21 | which_case = 'alter' 22 | 23 | lsdirs = os.listdir('results/') 24 | print lsdirs 25 | figure() 26 | ax=gca() 27 | legend_str=list() 28 | for folders_grep in folders_greps: 29 | counter=list() 30 | numTrials=list() 31 | rate=list() 32 | stder=list() 33 | num_samples=list() 34 | ii=0 35 | for lsdir in lsdirs: 36 | if os.path.isdir('results/'+lsdir) and lsdir.startswith(folders_grep): 37 | os.chdir('results/'+lsdir) 38 | load_f = open(load_filename,"r") 39 | [counter_current, numTrials_current, param,_,_] = load(load_f) 40 | print 'reading ' + lsdir + ' -found ' + str(numTrials_current) + ' trials' 41 | num_samples.append( param['num_samplesX'] ) 42 | numTrials.append(numTrials_current) 43 | counter.append(counter_current) 44 | rate.append( counter[ii]/float(numTrials[ii]) ) 45 | stder.append( 1.96*sqrt( rate[ii]*(1-rate[ii]) / float(numTrials[ii]) ) ) 46 | print "Rejection rate: %.3f +- %.3f (%d / %d)" % (rate[ii], stder[ii], counter[ii], numTrials[ii]) 47 | os.chdir('../..') 48 | ii+=1 49 | #stat_test sizes may not be ordered 50 | legend_str.append(param['name'].split('_')[0]) 51 | #legend_str.append(param['name']) 52 | order = argsort(num_samples) 53 | 54 | 55 | errorbar(asarray(num_samples)[order],\ 56 | asarray(rate)[order],\ 57 | yerr=asarray(stder)[order]) 58 | 59 | legend(legend_str,loc=4) 60 | xlabel("number of samples",fontsize=12) 61 | if which_case=='null': 62 | ylim([0,0.12]) 63 | ax.set_yticks([0.01, 0.03, 0.05, 0.07, 0.09, 0.11]) 64 | ylabel("rejection rate (Type I error)",fontsize=12) 65 | elif which_case=='alter': 66 | #ylabel("rejection rate (1-Type II error)",fontsize=12) 67 | ylabel("rejection rate",fontsize=12) 68 | ax.set_xscale("log",basex=2) 69 | grid() 70 | show() 71 | -------------------------------------------------------------------------------- /tools/read_test_results.py: -------------------------------------------------------------------------------- 1 | import os,sys 2 | BASE_DIR = os.path.join( os.path.dirname( __file__ ), '..' ) 3 | sys.path.append(BASE_DIR) 4 | 5 | from numpy import sqrt 6 | import os 7 | from pickle import load 8 | import sys 9 | 10 | 11 | os.chdir(sys.argv[1]) 12 | load_filename = "results.bin" 13 | load_f = open(load_filename,"r") 14 | [counter, numTrials, param, average_time, pvalues] = load(load_f) 15 | load_f.close() 16 | 17 | rate = counter/float(numTrials) 18 | stder = 1.96*sqrt( rate*(1-rate) / float(numTrials) ) 19 | '''this stder is symmetrical in terms of rate''' 20 | 21 | print "Parameters:" 22 | for keys,values in param.items(): 23 | print(keys) 24 | print(values) 25 | print "Rejection rate: %.3f +- %.3f (%d / %d)" % (rate, stder, counter, numTrials) 26 | print "Average test time: %.5f sec" % average_time 27 | os.chdir('..') 28 | #Minor: need to change the above for Gaussian Kernel Median Heuristic 29 | 30 | -------------------------------------------------------------------------------- /weak_conditional_independence_testing/Ozone_prewhiten.csv: -------------------------------------------------------------------------------- 1 | "","Ozone","Temp","InvHt","Pres","Vis","Hgt","Hum","InvTmp","Wind" 2 | "1",-0.093145,-0.097772,0.11934,-0.054445,0.79649,-0.003748,-0.10658,-0.05296,-0.1364 3 | "2",0.017124,-0.019049,-0.20739,-0.13161,-0.26912,-0.070705,-0.043278,0.025266,-0.60218 4 | "3",-0.016349,0.26796,-0.0054871,0.16532,-0.33646,0.094017,-0.00088725,0.026799,-0.61121 5 | "4",0.045921,-0.29085,-0.06006,0.10718,-0.32905,-0.10617,0.28116,-0.021376,-0.14299 6 | "5",-0.049687,-0.0092208,0.16081,-0.16119,-0.23511,0.081014,-0.26058,-0.047845,0.79053 7 | "6",-0.053726,0.24389,-0.24056,-0.13393,0.91119,0.10203,-0.20197,0.11335,-0.63392 8 | "7",0.017516,-0.17027,0.01662,0.20012,-0.22723,-0.10208,0.26964,-0.037665,-0.63734 9 | "8",0.065979,-0.088185,-0.056292,-0.01744,-0.081539,-0.079742,0.083908,-0.032508,-0.65003 10 | "9",-0.092817,0.12984,0.44579,-0.11491,-0.058096,0.11828,-0.21664,-0.074408,1.7013 11 | "10",0.040506,-0.020616,-0.35255,0.20127,0.2584,-0.088947,0.095655,0.066674,-0.67176 12 | "11",-0.0058906,-0.041487,-0.2795,-0.14027,-0.81997,0.0086807,-0.00039355,0.046488,0.73755 13 | "12",-0.039191,-0.017412,0.4783,-0.00093938,0.33832,-0.030532,-0.086534,-0.13117,0.73096 14 | "13",-0.063422,0.009324,-0.40386,-0.012208,0.52485,0.020466,-0.033456,0.14501,-0.6847 15 | "14",0.073353,0.0066984,0.26671,-0.13545,0.080308,0.091766,-0.041759,-0.032971,-1.1585 16 | "15",-0.092215,0.031637,0.31154,-0.03473,0.26975,0.051543,-0.19031,-0.027275,0.24467 17 | "16",0.15615,0.18368,-0.29293,0.28441,0.13441,0.033471,0.12352,6.2296e-05,-0.23528 18 | "17",-0.074148,-0.1671,-0.33887,-0.19295,-0.96298,-0.16004,0.29601,0.095371,0.22343 19 | "18",-0.061865,-0.014602,0.3029,-0.06044,-0.0019414,-0.010194,-0.24038,-0.060201,-0.25578 20 | "19",-0.00168,-0.055533,0.098344,-0.091167,0.67863,0.081056,-0.085277,-0.020209,-0.74305 21 | "20",-0.030806,-0.040905,0.16597,-0.038903,0.21668,0.010897,-0.1314,-0.0058436,-0.28215 22 | "21",-0.01705,0.29397,-0.10401,0.11788,-0.024424,0.011319,0.093732,0.023869,0.16825 23 | "22",-0.035197,-0.18858,0.2303,0.116,0.15788,-0.099767,0.096,-0.087188,0.15311 24 | "23",0.085781,-0.046949,-0.29105,-0.040372,-0.44213,-0.057153,0.076567,0.029869,-0.80677 25 | "24",-0.050139,0.0060219,-0.11092,-0.14669,-0.22533,0.09815,-0.14795,0.049668,0.099892 26 | "25",-0.10213,0.044801,0.060901,0.041121,0.29486,0.021649,-0.080151,-0.021589,0.059365 27 | "26",0.12932,0.0027747,-0.07598,0.042492,0.11341,-0.037602,0.015296,-0.043092,-0.90687 28 | "27",0.13751,0.040503,-0.031541,-0.20589,-0.51025,-0.051945,0.20176,0.062748,-1.3958 29 | "28",-0.23697,-0.028659,-0.09363,0.046094,0.49778,0.13725,-0.45266,0.049707,0.021279 30 | "29",0.13907,0.12582,-0.18156,-0.11365,-0.50027,0.074295,0.24876,0.041351,1.4025 31 | "30",0.076123,0.098252,0.1105,0.18489,-0.36071,-0.0042622,-0.02582,-0.032836,-0.98719 32 | "31",-0.16406,-0.14923,0.1686,0.16194,-0.06017,0.092023,0.087884,-0.042206,-1.0162 33 | "32",-0.014058,-0.089128,0.24406,-0.070679,0.56292,-0.1635,-0.037193,-0.10009,0.36646 34 | "33",0.0021554,-0.097518,-0.41413,0.013574,-1.0666,-0.24585,-0.050606,0.093285,0.80758 35 | "34",-0.050037,-0.00040655,0.29045,-0.019683,1.1317,0.0058602,0.034025,-0.06687,1.7184 36 | "35",-0.0011941,0.1087,-0.07885,-0.17815,-0.15424,0.23978,0.048918,0.067657,0.75704 37 | "36",-0.034264,-0.091675,0.11895,0.21793,0.62888,-0.19389,0.0086469,-0.03978,2.628 38 | "37",-0.032957,-0.028242,0.067776,-0.0084405,0.15234,-0.087069,0.16099,-0.049816,2.1571 39 | "38",0.0025582,0.00082486,0.074259,-0.15714,0.26035,0.171,-0.26202,0.036384,-1.1395 40 | "39",0.079801,0.041585,0.1839,-0.039508,0.49467,-0.017501,-0.082929,-0.036024,-0.18771 41 | "40",8.718e-05,0.18752,-0.27632,0.0059564,-0.032088,0.074521,0.26786,0.12188,-0.65691 42 | "41",-0.047864,-0.14743,0.23937,0.12019,-0.73364,-0.02637,-0.090461,-0.1152,-0.18185 43 | "42",-0.12172,-0.10099,0.17588,0.035541,0.58835,-0.181,-0.017993,-0.042117,-0.18697 44 | "43",-0.014571,0.1002,-0.27851,0.020932,0.1202,0.077506,0.0092277,0.041196,-0.1794 45 | "44",0.059791,-0.11546,0.18181,-0.12482,0.34948,0.062502,-0.028147,0.0078783,-0.1899 46 | "45",0.12983,0.1778,-0.3087,0.13924,-1.0052,0.053434,0.20653,0.05401,-1.1327 47 | "46",-0.055625,-0.13049,0.19026,-0.03122,0.57716,-0.062229,-0.14956,-0.055632,1.2089 48 | "47",-0.022346,-0.038294,0.07451,-0.12161,0.31238,-0.022082,-0.048036,-0.023586,-1.1542 49 | "48",-0.04233,0.045813,0.06368,-0.12719,0.25664,0.078002,-0.045707,0.025562,2.6133 50 | "49",-0.065902,0.044751,0.081771,0.086045,0.32132,0.0022497,-0.22112,0.027664,0.7319 51 | "50",-0.063293,-0.11601,0.25574,0.075749,0.60925,-0.14511,0.22617,-0.12263,-0.2026 52 | "51",-0.034857,0.15814,-0.222,0.038138,-0.3117,0.054804,0.038702,0.053264,0.2793 53 | "52",-0.012506,-0.032088,-0.029836,-0.036051,0.11993,0.031241,-0.01442,0.020946,-1.1312 54 | "53",0.31635,-0.002521,-0.058239,-0.060752,-0.18149,0.013613,-0.076768,0.022002,-1.1307 55 | "54",0.11234,0.090624,-0.25568,-0.16512,-0.3205,0.081899,0.010071,0.089579,-1.1298 56 | "55",-0.16637,-0.066545,0.1638,0.21111,-0.32659,-0.0030047,0.0051735,-0.065605,0.74801 57 | "56",-0.10841,0.07696,0.12813,0.1074,-0.55055,0.024292,0.15949,0.021917,2.1544 58 | "57",-0.010117,-0.075732,0.069652,-0.046715,1.0779,0.039787,-0.11294,-0.05303,0.74337 59 | "58",-0.02972,-0.1729,0.082002,0.053623,0.1007,-0.31841,-0.019465,-0.038831,2.6468 60 | "59",-0.045912,0.013326,0.081975,-0.11551,-0.085929,0.01908,0.059133,-0.020284,-1.079 61 | "60",0.011591,0.0065198,0.069409,-0.067275,0.0021883,0.030953,-0.25321,0.0093028,-1.0599 62 | "61",0.022422,0.10419,0.12596,0.098027,0.068421,0.16254,0.19234,0.02869,0.84518 63 | "62",0.024822,0.0076408,-0.12315,-0.046011,-0.28591,-0.0096443,-0.095046,0.014651,-0.080528 64 | "63",0.028466,-0.050638,0.12864,-0.0065388,-0.046696,-0.12146,-0.021722,-0.0509,-0.065879 65 | "64",-0.071334,0.040425,0.093503,0.056515,0.3449,-0.098401,0.049572,0.0032188,-0.051475 66 | "65",-0.16142,-0.11173,0.29657,0.27503,-0.6028,0.010187,0.019179,-0.1353,1.8629 67 | "66",0.024503,-0.015429,-0.26908,-0.2965,0.77435,0.012226,0.10182,0.10086,-2.3648 68 | "67",0.080117,0.094225,-0.10823,-0.031723,0.82721,0.099077,-0.27023,0.043306,-0.46013 69 | "68",-0.041704,-0.10259,-0.037483,0.14036,0.01783,-0.053029,0.19051,-0.026712,-0.9181 70 | "69",0.16638,0.050003,-0.057319,-0.10713,-0.77975,0.035471,-0.0070737,-0.013039,0.034951 71 | "70",-0.35066,-0.0092071,-0.052745,-0.073539,-0.14683,0.055523,-0.13755,0.058131,-1.3758 72 | "71",0.47948,0.2394,-0.10328,-0.089471,0.021991,0.035277,-0.0090086,0.031265,-0.90199 73 | "72",-0.068766,0.071209,0.079,0.29574,-0.60461,-0.051623,0.22307,-0.052136,-0.43742 74 | "73",-0.15203,-0.42815,-0.072298,0.10745,0.20125,-0.072412,0.051643,-0.069018,-0.44597 75 | "74",0.045964,0.04773,-0.060592,-0.28393,-0.24805,0.036297,-0.17559,0.028971,0.01664 76 | "75",-0.060333,0.030715,-0.069216,-0.036279,0.89885,-0.0039374,-0.11585,0.028432,0.005654 77 | "76",-0.014923,0.20888,-0.078439,-0.1095,-0.088595,0.030597,-0.14499,0.092448,-0.47649 78 | "77",-0.014596,-0.013589,-0.052194,0.25603,0.54936,-0.026025,0.23072,-0.064101,-0.016075 79 | "78",0.13876,-0.21229,-0.22521,-0.024601,-0.53925,0.028651,0.21482,0.01793,-0.5153 80 | "79",-0.11017,0.25857,0.27456,-0.20265,-0.15744,0.098819,-0.3388,-0.020753,-1.0053 81 | "80",0.085521,-0.15495,0.27692,0.14405,0.1402,-0.15868,0.072422,-0.048075,0.39087 82 | "81",-0.11865,-0.21702,-0.24543,0.14007,-0.010956,-0.052428,-0.1016,0.021031,-0.57365 83 | "82",-0.026367,0.167,-0.22674,-0.10969,0.40872,0.030826,0.21113,0.007197,0.81857 84 | "83",0.099509,0.054983,0.48251,-0.18133,-0.53541,0.12493,-0.11581,-0.034897,0.33472 85 | "84",0.25097,0.37542,-0.40587,-0.03867,1.4893,0.42857,-0.1479,0.33351,-1.1017 86 | "85",-0.24833,-0.38537,0.11079,0.18531,-0.77878,-0.51743,0.053463,-0.27943,0.31543 87 | "86",0.048214,0.31249,-0.26587,-0.090829,-0.58256,0.10126,0.10305,0.086241,-0.17599 88 | "87",-0.069842,-0.281,0.2757,0.19774,0.23944,0.12812,0.0031289,-0.085679,0.27173 89 | "88",-0.05822,-0.062762,0.0856,-0.12041,0.058442,-0.35607,0.076165,-0.049382,0.72384 90 | "89",-0.11426,-0.15402,0.078237,0.10504,-0.15337,0.030235,-0.054349,-0.040479,-0.23653 91 | "90",0.035341,0.043391,0.20195,-0.0080218,0.52716,-0.026813,-0.069955,-0.010474,0.21558 92 | "91",0.21773,0.28573,-0.33044,-0.091896,0.49828,0.24704,0.051797,0.13183,0.19971 93 | "92",-0.22643,-0.31745,0.22521,0.13559,0.3441,-0.20982,0.029553,-0.1318,0.65866 94 | "93",0.075865,0.2237,0.28803,-0.14259,-0.47663,0.049918,-0.045934,0.014562,0.64718 95 | "94",0.069631,0.014134,-0.52716,0.083878,-0.15791,0.070876,-0.04692,0.11623,0.16651 96 | "95",-0.020627,0.0059018,0.2738,0.0089682,0.38834,-0.018585,0.14204,-0.06389,0.63082 97 | "96",-0.043266,-0.03921,0.096835,0.1064,-0.083609,0.086126,0.039782,-0.017706,-0.78996 98 | "97",-0.078703,-0.16264,0.18146,-0.11855,-0.7139,-0.19301,-0.13234,-0.086961,-0.3232 99 | "98",0.11007,0.17518,-0.26995,-0.17498,1.1216,-0.10793,-0.059387,0.14697,0.60787 100 | "99",-0.10509,-0.013195,0.18142,0.40487,-0.83912,0.29492,0.2513,-0.092083,-0.34273 101 | "100",-0.072208,-0.24302,0.096515,-0.13686,0.94605,-0.38796,-0.16891,-0.034112,2.482 102 | "101",-0.03857,0.079007,0.27095,-0.032518,0.16794,0.16518,-0.12899,-0.048616,0.13208 103 | "102",-0.009673,-0.1007,-0.20877,0.07521,-0.21482,-0.042125,0.14613,0.0043484,0.13648 104 | "103",-0.015564,0.14738,-0.10084,-0.11126,-0.17577,0.067726,-0.072171,0.060275,-0.80363 105 | "104",0.27272,0.023141,-0.095037,-0.086661,0.12798,0.12277,-0.01659,0.052855,-0.7968 106 | "105",0.10727,0.12882,-0.20073,0.020061,-0.36533,-0.013395,0.090172,0.031087,-0.33077 107 | "106",-0.27811,-0.14606,0.19235,0.20702,0.17821,-0.13339,-0.06837,-0.095174,0.13745 108 | "107",-0.1363,0.025627,0.18962,-0.065819,0.004597,-0.025737,-0.019695,-0.031304,-0.33541 109 | "108",0.21446,0.23089,-0.13172,-0.041417,0.13053,0.11772,0.098928,0.072549,0.60519 110 | "109",0.14252,-0.22157,-0.081383,0.17454,-0.41553,0.040126,0.12091,0.035412,0.60739 111 | "110",-0.012659,-0.044224,-0.27672,-0.25266,-0.37885,0.022023,-0.16527,0.058801,-0.33858 112 | "111",-0.22011,0.028115,0.30199,0.092061,0.29017,-0.13587,0.0038974,-0.10976,1.5556 113 | "112",-0.16034,-0.1283,0.2919,0.085114,0.26126,-0.075465,-0.069364,-0.12937,0.14209 114 | "113",0.14171,0.028544,-0.22311,0.024762,-0.14993,0.0054149,0.083447,0.04066,-0.31612 115 | "114",-0.022637,0.091403,-0.084924,-0.11547,-0.1293,0.12291,-0.047067,0.096333,-0.78215 116 | "115",0.42051,0.10346,-0.11233,-0.055349,-0.022063,0.047044,-0.050401,0.043787,-0.77824 117 | "116",-0.1556,-0.025769,0.086764,0.072457,0.18352,-0.027061,0.1097,-0.034489,1.1081 118 | "117",6.3438e-05,0.11728,-0.25179,0.029124,-0.046895,-0.029639,-0.0054905,0.030866,-0.77092 119 | "118",-0.32598,-0.38485,0.40178,0.18662,-0.039672,0.011423,0.072927,-0.12761,0.18018 120 | "119",0.070156,0.098291,-0.15512,-0.14732,0.16113,-0.11842,-0.070327,0.02839,1.1208 121 | "120",-0.00096457,0.08132,-0.19229,-0.029754,0.088827,0.073647,0.023315,0.014157,0.18457 122 | "121",0.1507,-0.024158,0.053973,0.0099844,-0.21279,-0.03171,0.033068,0.022262,-1.2199 123 | "122",-0.14412,0.045205,-0.030515,0.07492,-0.0073608,-0.046674,0.025365,-0.018576,0.67379 124 | "123",0.058549,-0.054412,-0.03565,0.021603,0.0073847,0.026153,-0.022628,-0.037163,0.20337 125 | "124",0.2897,0.033797,-0.10011,-0.096226,-0.1287,0.059913,0.0066319,0.073549,-1.2011 126 | "125",-0.17385,0.11107,-0.055595,-0.10602,-0.20483,0.067093,0.1066,0.040909,0.21655 127 | "126",0.33715,-0.064654,-0.053199,0.12789,-0.22108,-0.003913,0.01663,0.024045,-0.72868 128 | "127",-0.2685,0.042126,0.038965,0.0017252,-0.084989,-0.019464,-0.08282,-0.047408,0.21338 129 | "128",0.23153,0.016357,-0.028403,0.034063,-0.15146,0.05728,0.068069,-0.0077306,0.68258 130 | "129",-0.049287,0.058082,-0.15434,-0.015044,-0.06034,-0.034135,0.042461,0.058663,-1.204 131 | "130",-0.052534,-0.08612,0.11421,0.035174,-0.10175,-0.051681,-0.036196,-0.041309,0.68527 132 | "131",-0.14403,-0.02851,0.031018,-0.015835,0.14606,-0.0049731,0.012834,-0.017133,1.1589 133 | "132",0.1608,0.055533,-0.082125,0.022227,-0.11462,0.039258,-0.046572,0.022777,-0.24776 134 | "133",-0.16049,-0.11446,0.057265,0.062626,-0.12326,-0.080834,-0.015444,-0.070646,1.6503 135 | "134",-0.32425,-0.076972,0.20349,0.071753,0.56977,-0.015292,0.054639,-0.02537,-0.22408 136 | "135",0.59117,0.19889,-0.13427,-0.19257,0.029151,0.053167,-0.044696,0.086025,-1.1554 137 | "136",0.17697,0.09588,-0.11125,0.064482,-0.45767,0.068228,0.071516,0.050011,0.74215 138 | "137",-0.31726,-0.11,0.22428,0.095345,-0.14771,0.027109,0.025512,-0.049299,-1.1459 139 | "138",-0.18165,-0.0028762,-0.24712,0.0042936,0.016767,-0.12977,0.004065,0.010222,0.2793 140 | "139",0.0072111,-0.047464,0.22005,0.097062,0.12214,-0.039559,-0.0631,-0.089615,-1.131 141 | "140",-0.39475,-0.21603,0.44358,0.24456,-0.074809,-0.095287,0.22561,-0.18593,-0.65813 142 | "141",0.47902,0.19895,-0.55482,-0.39939,0.1772,0.18851,-0.22745,0.24444,-0.65813 143 | "142",0.020357,0.028639,-0.10355,0.0019812,-0.21775,0.042304,0.17532,0.038137,1.2265 144 | "143",0.11467,0.040319,0.091645,0.098576,-0.11356,-0.035415,-0.042529,-0.015461,0.28174 145 | "144",0.0079812,-0.064984,0.13813,0.0028263,0.0090637,-0.047188,-0.043902,-0.0096151,0.27905 146 | "145",0.035329,0.048074,-0.14307,-0.046537,-0.065672,0.030135,0.052995,-0.0010438,0.28052 147 | "146",-0.056598,0.075065,-0.073427,0.045831,0.047391,0.10355,-0.06582,0.0085755,0.75826 148 | "147",0.0073665,-0.01288,0.071091,0.089519,-0.0087504,-0.035791,0.040256,0.016631,1.2392 149 | "148",-0.0086907,0.053241,-0.13084,-0.10824,-0.071374,-0.0094341,0.01752,0.04414,0.30811 150 | "149",-0.033821,-0.079305,0.18589,0.115,0.13794,-0.026484,-0.091725,-0.067564,-1.091 151 | "150",-0.28331,-0.18248,0.097457,-0.0021163,0.020842,-0.18659,0.16747,-0.069927,1.7511 152 | "151",0.00058725,0.019091,0.27498,0.017701,0.037535,0.037797,-0.045377,-0.014449,-0.11471 153 | "152",0.28564,0.092049,-0.26046,-0.068938,0.088558,0.032435,-0.066728,0.046582,-1.0385 154 | "153",-0.20543,-0.086951,-0.03666,0.20309,0.25506,0.037413,0.034052,-0.03411,-0.084922 155 | "154",0.19424,0.001652,-0.052603,-0.25642,-0.20619,-0.018514,0.041227,0.039084,-0.54192 156 | "155",0.30124,0.15963,-0.077759,-0.15517,-0.2595,0.08274,0.0032884,0.084968,0.41504 157 | "156",-0.20731,-0.014599,0.0060314,0.23262,-0.35628,0.015054,-0.011195,-0.037504,-0.9911 158 | "157",-0.3157,-0.11831,-0.052501,0.066179,0.11428,-0.070359,-0.019451,-0.043833,0.4336 159 | "158",0.1794,0.12113,-0.020284,-0.031237,-0.2062,0.058582,0.00077908,0.02289,-0.50261 160 | "159",0.21346,0.010033,-0.055179,-0.04213,0.0072498,0.059599,0.1082,0.033519,-0.96937 161 | "160",0.032367,-0.0014093,-0.017387,0.040014,0.020451,-0.059513,-0.088387,-0.01872,-0.96839 162 | "161",-0.19319,0.11062,-0.035166,0.11706,0.19962,0.012826,0.06483,-0.02988,-0.50237 163 | "162",0.15771,-0.30618,-0.047085,0.010454,-0.10153,-0.12658,0.10819,-0.027589,-0.036094 164 | "163",-0.14837,0.10855,-0.042375,-0.1126,0.073293,0.081992,-0.16807,0.043933,-0.51555 165 | "164",-0.071706,0.054188,-0.084002,-0.12586,0.08731,0.025385,-0.063798,0.049141,0.89205 166 | "165",0.30607,-0.068869,-0.027968,0.12281,-0.037187,0.013563,0.12013,-0.0095146,-0.52776 167 | "166",-0.031635,0.07083,-0.07233,0.013688,0.038615,-0.026717,-0.088374,-0.025996,0.40845 168 | "167",-0.41137,-0.0013381,-0.044983,-0.073828,-0.32075,0.034871,-0.054491,0.015691,-0.075401 169 | "168",0.4143,0.11359,-0.053327,-0.0037349,0.14283,0.0093257,-0.036331,0.043399,0.86544 170 | "169",0.19027,0.11927,-0.040888,0.023225,0.052194,0.023666,0.16065,0.031205,-0.55412 171 | "170",-0.2176,-0.16605,-0.11601,0.052377,0.1452,-0.020955,-0.052353,-0.025254,-0.084434 172 | "171",-0.022593,-0.023557,0.014519,-0.012698,-0.076637,-0.018117,0.010903,0.0047052,-0.090293 173 | "172",-0.066493,-0.12035,0.083354,0.049014,-0.034916,-0.043339,-0.018307,-0.060135,1.3188 174 | "173",-0.027031,-0.017964,0.071901,-0.010153,0.21072,0.030909,-0.0081673,-0.0152,0.3772 175 | "174",-0.011037,0.072128,-0.10612,0.0042904,0.010093,-0.0035332,0.061084,0.037326,-1.0358 176 | "175",0.024231,0.095528,-0.00025697,0.017931,-0.26025,0.016839,-0.060971,-0.011784,0.3772 177 | "176",0.1938,-0.062474,-0.044421,-0.050006,-0.31825,0.026055,0.16031,0.030888,0.3811 178 | "177",0.15025,0.066093,-0.11998,-0.018544,0.13826,-0.014884,-0.090513,0.023929,-0.089073 179 | "178",0.086337,0.053309,-0.096829,0.030305,0.11259,0.033362,0.045715,0.010694,-0.55901 180 | "179",-0.017983,2.8592e-05,0.062208,0.023027,-0.15842,0.040078,0.013099,0.01489,-0.093711 181 | "180",-0.22323,-0.064395,-0.0017049,0.033358,-0.19753,-0.011075,-0.016659,-0.032875,-1.0328 182 | "181",0.054028,0.016408,-0.06186,0.024102,-0.050534,-0.052642,-0.041754,0.015243,-1.0402 183 | "182",-0.087054,-0.052874,0.10839,-0.016792,-0.020483,0.040722,0.0080406,-0.025067,1.7807 184 | "183",0.038272,0.070319,-0.0071286,0.0207,-0.075977,-0.0069046,0.093667,0.022892,0.35816 185 | "184",0.12874,-0.041203,-0.035232,-0.010326,-0.10686,-0.032969,-0.032857,-0.0029486,1.2948 186 | "185",-0.067764,0.035266,0.041625,0.046302,0.10423,0.019299,0.047898,-0.015122,-0.58342 187 | "186",-0.077525,-0.099673,0.045383,0.028594,0.11053,-0.0099649,-0.023772,0.0085014,-0.10714 188 | "187",0.054655,0.04579,-0.01333,-0.02519,0.048041,-0.018276,0.010251,-0.0090937,0.83737 189 | "188",0.085671,0.020403,-0.23623,-0.014336,-0.14982,0.035972,-0.04992,0.032006,0.36987 190 | "189",0.12428,0.065966,0.083817,0.038744,-0.2281,0.011406,0.12375,8.2936e-05,-0.56682 191 | "190",-0.0564,-0.036311,-0.035842,-0.018556,-0.22656,-0.044047,0.033695,-0.018667,-1.0199 192 | "191",0.27659,-0.042435,-0.049934,0.027903,-0.018264,0.055986,-0.013108,0.0083337,-1.0285 193 | "192",-0.22475,0.046335,-0.053801,-0.015718,-0.0015916,0.0075134,-0.030156,0.030427,-0.5573 194 | "193",-0.27146,0.03457,-0.059619,-0.034781,-0.040421,-0.015296,0.035086,0.018341,-0.56438 195 | "194",0.32806,-0.056279,-0.022788,0.017369,-0.070502,0.03693,0.051696,-0.036952,0.36426 196 | "195",0.051115,0.098583,-0.093222,0.028848,-0.13967,0.013283,0.013317,0.037988,1.2997 197 | "196",0.074378,0.039033,-0.022785,-0.14262,-0.23613,0.19145,0.15183,0.062211,0.34473 198 | "197",-0.050446,-0.03013,-0.0050457,0.13735,0.14189,-0.18695,-0.15021,-0.055161,0.34473 199 | "198",-0.14989,-0.027823,0.17111,0.039204,0.42063,-0.055798,-0.06285,-0.0050313,-1.0687 200 | "199",0.035069,-0.072495,-0.0029292,0.036206,-0.015441,0.010512,0.052112,-0.030674,-0.60613 201 | "200",-0.086632,0.027597,-0.08112,-0.002847,-0.044476,-0.024216,-0.0014006,0.0055538,0.76242 202 | "201",0.23726,0.041765,-0.057578,-0.081019,-0.089543,0.073064,0.0051224,0.020204,-0.19454 203 | "202",0.023645,0.06501,-0.095912,0.04753,-0.15896,0.021164,0.11056,0.025038,-0.19332 204 | "203",0.11574,0.13473,-0.11571,0.062571,-0.39175,0.079501,-0.078711,0.027457,0.26953 205 | "204",-0.097359,-0.15583,0.10227,0.0024972,0.11829,-0.049679,-0.034179,-0.011632,0.73873 206 | "205",-0.1778,-0.17774,0.13514,0.096419,0.27067,-0.11269,0.093745,-0.042014,-0.2109 207 | "206",-0.048281,0.12057,0.10525,-0.067999,0.099759,0.075175,0.032007,-0.031335,0.73409 208 | "207",-0.071175,0.0064823,-0.040806,-0.0084468,-0.036263,-0.032402,-0.055188,0.05731,0.73556 209 | "208",0.14962,-0.082511,0.11819,0.073376,0.2763,0.037734,-0.061654,-0.074048,-0.20431 210 | "209",-0.15086,-0.022102,0.25183,-0.080558,0.82253,-0.038157,0.051877,-0.0078553,0.74728 211 | "210",0.078846,0.084221,-0.18055,0.026746,-1.139,-0.063594,0.11422,0.024767,-0.18526 212 | "211",0.10967,0.041458,-0.093058,-0.0073796,0.97372,0.087611,-0.12063,0.029649,-0.6508 213 | "212",-0.19463,-0.11632,-0.008644,0.13276,-0.44089,-0.17027,0.01626,-0.034964,-1.1217 214 | "213",0.082612,0.0106,-0.079266,-0.14145,1.1817,0.12512,0.048123,0.015658,-1.119 215 | "214",0.024147,0.10327,-0.039438,0.0054777,-0.8964,0.041171,-0.013409,0.015985,0.75826 216 | "215",0.0074531,0.050156,-0.055253,0.05789,0.039068,0.0022203,0.021957,0.0023085,-0.65642 217 | "216",-0.045667,-0.22473,-0.040122,0.045965,0.19488,-0.042365,0.017425,0.0055079,0.74752 218 | "217",0.139,0.10463,-0.031845,-0.059959,-0.34498,0.011066,-0.019713,-0.029912,0.27466 219 | "218",0.15007,0.014956,-0.026255,0.055928,-0.17352,0.016071,0.074291,0.0077853,-0.66985 220 | "219",-0.22287,0.040657,-0.086995,-0.040602,-0.346,0.00067163,-0.0047946,0.045073,-1.1491 221 | "220",0.54465,0.0077165,-0.066304,-0.10281,0.16216,0.057155,-0.043073,0.03082,-0.21603 222 | "221",0.1205,0.059203,-0.21975,-0.0252,-0.31799,0.24869,-0.26414,0.032091,1.1891 223 | "222",-0.3788,0.059203,0.21486,0.14279,0.18603,-0.22431,0.23926,-0.033115,-0.22799 224 | "223",-0.016749,0.020547,-0.018772,0.034367,0.0048496,-0.0025987,0.084331,-0.032014,1.1757 225 | "224",-0.26083,-0.15512,-0.10404,-0.11607,-0.18879,-0.0070812,0.0026974,0.039645,0.70309 226 | "225",0.54776,0.070047,-0.16875,0.0238,-0.16603,0.022764,-0.036964,0.013599,-0.23605 227 | "226",-0.39254,0.015774,0.12024,0.072981,-0.15909,-0.0089479,0.067887,-0.006697,0.23511 228 | "227",-0.15999,-0.12147,0.1791,-0.014624,0.31428,-0.10351,-0.09186,-0.080389,0.71285 229 | "228",0.4139,0.072698,-0.16298,-0.037084,-0.12392,0.010528,0.13968,0.046802,0.2478 230 | "229",-0.041799,0.13703,-0.15241,-0.042697,0.037053,0.086294,-0.069327,0.068827,-0.69182 231 | "230",-0.38618,-0.25762,0.32607,0.11497,0.36616,-0.14072,0.12402,-0.10165,0.7402 232 | "231",-0.051293,-0.0016189,-0.16424,-0.056127,-0.36193,0.13217,-0.075597,-0.006942,-0.67082 233 | "232",0.56395,0.10883,-0.081997,-0.02039,-0.031338,-0.022246,0.029877,0.057172,-0.1943 234 | "233",0.11425,0.24021,-0.11803,-0.1332,-0.29118,0.052002,0.043799,0.1035,-0.19112 235 | "234",-0.40785,-0.25206,0.2598,0.17248,-0.11945,-0.037953,-0.0088329,-0.11192,0.75509 236 | "235",-0.13313,-0.091614,0.098877,0.006484,0.1829,-0.048092,0.0077638,-0.040673,-1.1159 237 | "236",0.36583,0.10196,-0.11675,-0.035816,0.17556,0.065007,-0.010511,0.032106,-0.17135 238 | "237",-0.32058,0.012495,0.08044,0.014666,-0.081935,0.0013745,0.038746,0.012557,-0.17159 239 | "238",0.36167,-0.01087,0.084809,-0.0032732,-0.097133,-0.038988,0.030003,0.0057152,0.29712 240 | "239",-0.093667,0.098273,-0.012228,0.038095,-0.33382,0.022492,-0.01036,0.012675,0.29419 241 | "240",-0.018504,-0.011049,-0.13692,-0.081914,-0.017178,0.023952,0.061792,0.047031,-0.16939 242 | "241",-0.15086,-0.045086,0.20862,0.069319,0.059315,-0.027023,-0.042001,-0.042879,1.7199 243 | "242",-0.0949,-0.054072,0.3069,0.031686,-0.01443,0.033953,0.06258,-0.11746,1.2563 244 | "243",0.11452,0.064404,-0.46015,-0.011818,0.14832,0.015584,0.014199,0.14159,-0.14693 245 | "244",0.16582,-0.041268,0.017138,-0.050532,0.017573,0.090848,-0.11053,-0.00065559,-0.601 246 | "245",-0.114,0.031921,0.18058,0.04107,-0.4558,-0.12233,0.26109,-0.030915,-0.11788 247 | "246",-0.067724,0.02641,0.074603,-0.040155,0.073397,-0.036299,-0.18657,0.0055985,0.84469 248 | "247",-0.0026152,-0.010453,0.074192,0.050778,0.29955,-0.064035,0.067748,-0.026156,-1.0275 249 | "248",-0.16085,-0.088249,0.17647,0.059634,0.24833,-0.00047829,-0.018329,-0.066726,0.39819 250 | "249",-0.051674,-0.10589,0.02929,-0.022063,0.46829,-0.028424,-0.021482,-0.00062112,0.41406 251 | "250",-0.016394,0.14011,-0.16981,-0.018909,-0.32881,0.049897,0.021791,0.047075,-0.51799 252 | "251",0.17401,-0.019399,-0.084646,-0.022836,-0.29937,0.04557,0.023592,0.0059248,-0.97205 253 | "252",0.36467,0.083335,-0.019526,0.062976,-0.10341,0.020401,0.075497,0.0034605,-0.02169 254 | "253",-0.070413,-0.036305,-0.31464,-0.087458,-0.49151,0.0061481,0.19781,0.10709,-0.4821 255 | "254",-0.074349,-0.031422,0.32412,-0.047205,-0.27531,0.0090986,-0.2644,-0.087694,-0.0094827 256 | "255",-0.2545,-0.020537,0.33088,-0.083388,0.92431,0.047534,-0.26326,-0.039133,0.46582 257 | "256",0.18776,0.10859,-0.32115,0.010519,0.61612,-0.018806,0.21201,0.071759,-0.94495 258 | "257",-0.0033965,-0.022145,-0.017056,0.19592,-0.70082,-0.027367,0.03629,-0.056718,0.0080954 259 | "258",-0.45912,-0.041158,-0.097938,-0.15465,-0.20969,-0.0282,-0.012497,0.036435,1.4267 260 | "259",0.50511,0.044673,-0.10411,-0.0182,-0.27125,0.068133,0.099088,0.050503,0.026162 261 | "260",0.047726,0.029053,-0.068453,0.008342,-0.11783,0.0026184,0.012789,0.020535,0.032265 262 | "261",-0.023153,-0.081133,0.16105,0.080755,-0.034568,-0.058579,-0.069121,-0.06912,0.98629 263 | "262",-0.22125,0.030715,-0.054802,-0.063769,0.11595,0.077463,0.046978,0.0010916,-0.41521 264 | "263",0.42389,0.055792,-0.16059,0.025503,-0.37953,0.019352,-0.048425,0.084764,0.070595 265 | "264",-0.28386,-0.042529,0.35851,0.086497,0.15938,-0.11397,0.11731,-0.14339,-0.38835 266 | "265",0.12019,0.10047,-0.14058,-0.072838,-0.14703,0.090205,0.070023,0.070922,0.096474 267 | "266",-0.12658,-0.18079,-0.10444,-0.094339,-0.06102,-0.0025377,-0.22158,0.057634,1.0522 268 | "267",0.12991,0.12297,0.0051709,0.058201,-0.092228,-0.014833,0.11649,-0.0022981,1.5424 269 | "268",-0.089515,-0.04297,0.16391,0.0096216,0.087725,-0.018372,0.041012,-0.058355,0.62329 270 | "269",-0.019826,-0.034307,0.19455,-0.03894,-0.14979,-0.12668,-0.050941,-0.040651,-0.28972 271 | "270",-0.12894,0.11546,-0.092026,0.06311,-0.12528,0.032733,0.070217,0.013045,-0.25627 272 | "271",0.272,-0.16368,-0.23964,0.14789,-0.3038,0.10974,0.14371,0.07504,0.23808 273 | "272",-0.10921,0.0072871,0.22079,-0.2254,-0.70086,0.004859,-0.11297,-0.039428,-0.67396 274 | "273",-0.14503,0.0044299,0.093568,0.014065,1.2022,-0.07583,-0.082957,-0.031797,0.29179 275 | "274",0.03095,0.022057,0.20501,-0.070907,0.19325,0.034417,-0.208,-0.016119,0.79444 276 | "275",-0.016758,-0.0082656,-0.11252,0.10253,-0.0020724,0.066699,0.16264,0.017587,1.2978 277 | "276",0.013314,-0.0025452,0.014567,0.0098332,-0.24246,-0.15887,0.10918,-0.024553,-0.084152 278 | "277",0.10674,-0.007806,-0.14628,-0.002126,-0.19819,0.050277,-0.077257,0.015087,-1.9307 279 | "278",-0.080942,0.050111,-0.22336,-0.08408,-0.045152,0.10074,-0.081081,0.085518,-0.46204 280 | "279",-0.19157,0.013011,0.34057,-0.018478,1.0307,-0.013762,-0.11434,-0.079808,1.4748 281 | "280",0.24859,-0.0046347,-0.19416,-0.035305,-0.61395,0.011963,0.027938,0.040835,-1.7637 282 | "281",-0.043136,0.13414,-0.16129,-0.042349,-0.20394,0.039056,-0.11003,0.050516,-0.77426 283 | "282",0.23736,0.01591,-0.15654,0.045237,-0.66499,0.043845,0.14208,0.061601,-0.75644 284 | "283",-0.17065,-0.03634,0.25062,0.0423,0.63389,0.12817,0.03545,-0.10998,0.20467 285 | "284",-0.10657,-0.2292,0.081017,-0.057086,-0.45896,-0.40631,-0.04428,-0.025772,2.5797 286 | "285",0.074756,0.11295,0.061995,-0.016791,1.0018,0.07949,-0.07758,0.015085,-0.22743 287 | "286",-0.15455,-0.11805,0.060069,0.2933,0.23998,-0.063857,0.13575,-0.067081,-0.19618 288 | "287",0.050112,0.010499,0.061719,-0.18855,-0.64197,0.066902,0.094277,0.0033993,0.7725 289 | "288",-0.13041,0.047412,0.07853,-0.068586,0.064618,0.035666,-0.16487,0.06552,-1.5542 290 | "289",0.20403,0.00087583,0.22847,-0.058083,0.092581,0.11442,-0.11177,-0.061231,-1.5303 291 | "290",-0.13241,0.026923,-0.091586,-0.021352,0.73004,0.028285,0.0094657,0.048629,0.84647 292 | "291",-0.097641,0.022108,-0.17756,0.052396,-0.69491,-0.086691,-0.069864,0.027722,-0.080697 293 | "292",0.063215,0.035532,-0.048418,0.03851,-0.10408,0.013582,0.096259,-0.022045,-1.4749 294 | "293",0.21725,-0.016689,-0.021008,-0.0086342,-0.11976,0.0028875,0.052811,-0.006761,-0.52719 295 | "294",-0.034757,0.0010778,0.0085438,-0.023232,-0.044877,0.0056652,0.0021977,-0.0056995,-0.51621 296 | "295",-0.053666,-0.046077,-0.11463,-0.049031,-0.23914,-0.053978,0.074306,0.02976,-0.50766 297 | "296",-0.14907,0.066652,-0.085055,-0.057671,-0.28518,0.025008,0.021725,0.033393,0.90385 298 | "297",0.3038,0.034041,-0.16484,-0.046054,-0.17734,0.065326,0.0080677,0.065943,0.91215 299 | "298",-0.15234,0.20957,0.24776,0.45885,-0.1965,0.03148,0.12015,-0.082162,1.3994 300 | "299",-0.057099,-0.28464,0.096772,-0.34477,0.33367,-0.18564,-0.18615,-0.059934,2.3605 301 | "300",-0.019564,-0.12024,0.20693,-0.014824,-0.4113,0.067257,-0.040944,-0.060996,-1.3814 302 | "301",-0.013198,-0.014805,-0.15658,-0.043344,0.74991,-0.021347,-0.028246,0.036733,0.52693 303 | "302",-0.012325,0.10916,-0.1898,0.075107,-0.25209,0.038662,-0.027443,0.075619,-0.38877 304 | "303",-0.013425,-0.011177,0.30181,-0.14,0.1933,-0.049092,-0.030905,-0.05822,-0.3624 305 | "304",-0.020935,-0.03335,0.14684,-0.051743,0.19232,0.11767,-0.071514,-0.015952,0.1383 306 | "305",-0.065139,0.090524,-0.35549,-0.058154,0.37343,0.049607,-0.13227,0.10915,-0.3143 307 | "306",0.05018,0.030188,0.16027,0.27579,-0.48735,-0.0068029,0.21441,-0.087373,-1.2354 308 | "307",0.024792,-0.042476,-0.19246,-0.078913,-0.79014,-0.20139,0.092538,0.026952,-1.2151 309 | "308",0.0059682,-0.073672,0.10088,-0.093357,1.0324,0.069603,-0.16733,-0.020836,0.68416 310 | "309",-0.076763,-0.05053,0.12652,-0.046089,-0.075128,0.10523,-0.098951,0.01127,0.22423 311 | "310",-0.046847,0.3267,-0.35787,0.0051628,1.0942,0.027106,-0.059636,0.10873,-1.1775 312 | "311",0.086092,-0.25373,0.50174,0.18451,-0.69412,-0.1683,0.23886,-0.22634,-1.1707 313 | "312",-0.031945,-0.053342,-0.55981,-0.16232,-0.44007,0.023112,-0.11267,0.18176,-1.1682 314 | "313",0.0076973,0.0092993,0.29462,-0.048858,-0.24324,0.073991,-0.065383,-0.061372,1.1892 315 | "314",-0.009595,0.047249,0.088352,-0.010746,0.28619,0.0097922,-0.030359,0.020547,0.23815 316 | "315",-0.023292,0.0047937,0.084135,0.038604,0.22097,0.0033434,-0.027298,-0.031705,-1.1751 317 | "316",-0.08319,-0.013582,0.25238,0.022875,-0.5426,-0.076918,-0.027123,-0.088257,-1.1795 318 | "317",0.073905,0.055926,-0.47253,-0.06175,0.5664,0.033011,-0.028175,0.19265,0.23131 319 | "318",-0.058242,-0.059163,0.25896,-0.099935,-0.19364,0.15685,-0.040447,-0.079392,-1.2041 320 | "319",-0.003054,0.07566,0.14351,0.029178,0.13182,-0.076301,-0.1805,-0.011347,0.67268 321 | "320",-0.042718,-0.10104,-0.11066,0.011411,0.24715,-0.10646,0.18459,0.00086902,-0.28452 322 | "321",0.037291,0.080563,0.1225,0.076622,-0.42025,-0.018403,-0.011781,-0.023858,-1.2405 323 | "322",0.049444,-0.028949,0.069449,-0.031985,-0.39013,0.066776,0.12938,0.003615,0.15124 324 | "323",-0.08208,-0.04161,0.085806,-0.049641,-0.0090254,-0.085952,-0.17188,-0.02249,1.0796 325 | "324",0.002423,0.060926,0.28635,-0.044334,1.198,0.095215,-0.18815,-0.034615,0.58821 326 | "325",-0.11066,-0.067903,-0.4118,0.020679,-0.3876,0.020985,0.17771,0.089724,0.58064 327 | "326",0.12667,0.019861,0.14487,0.018152,-0.40734,-0.10187,-0.026358,-0.079644,0.57478 328 | "327",-0.083937,0.0083425,-0.40077,0.024828,0.63676,0.088036,0.10813,0.15111,0.0819 329 | "328",-0.027496,0.026971,0.37383,-0.20551,-0.31644,0.015756,-0.30385,-0.078564,0.076529 330 | "329",0.090614,0.091512,-0.07486,0.15776,-0.42342,0.050158,0.051346,0.0018103,0.076529 331 | "330",-0.13797,-0.15824,0.19323,-0.021542,0.06887,-0.15722,0.17069,-0.060901,0.54915 332 | -------------------------------------------------------------------------------- /weak_conditional_independence_testing/PCalg_twostep_flags.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Modified PC algorithm code (original code: https://github.com/keiichishima/pcalg) 3 | by incorporating KRESIT as the conditional independence test. 4 | 5 | Original code: A graph generator based on the PC algorithm [Kalisch2007]. 6 | [Kalisch2007] Markus Kalisch and Peter Bhlmann. Estimating 7 | high-dimensional directed acyclic graphs with the pc-algorithm. In The 8 | Journal of Machine Learning Research, Vol. 8, pp. 613-636, 2007. 9 | License: BSD 10 | 11 | Example run in terminal: 12 | 1) KRESIT: 13 | $ python PCalg_twostep_flags.py 400 --dimZ 4 --data_filename "Synthetic_DAGexample" 14 | --kernelX_use_median --kernelY_use_median 15 | --results_filename "test_run_KRESIT" --figure_filename "test_graph_KRESIT" 16 | 17 | (It takes the first 400 samples and the first 4 dimensions from the Synthetic_DAGexample.csv 18 | and run KRESIT with Gaussian kernel median Heuristic on the variables X and Y. The kernel on Z 19 | is set by default to be Gaussian median Heuristic. The regularisation parameters is 20 | set by default to use grid search. The resulting CPDAG is saved "test_graph_KRESIT.pdf".) 21 | 22 | 2) RESIT: 23 | $ python PCalg_twostep_flags.py 400 --dimZ 4 --data_filename "Synthetic_DAGexample" 24 | --kernelX --kernelY 25 | --kernelRxz --kernelRyz 26 | --kernelRxz_use_median --kernelRyz_use_median 27 | --RESIT_type 28 | --result_filename "test_run_RESIT" --figure_filename "test_graph_RESIT" 29 | 30 | (It takes the first 400 samples and the first 4 dimensions from the Synthetic_DAGexample.csv 31 | and run RESIT. The kernels on X and Y are set to be linear. The kernels on the residuals Rxz and 32 | Ryz are Gaussian with median Heuristic bandwidth.The regularisation parameters is 33 | set by default to use grid search. The resulting CPDAG is saved "test_graph_RESIT.pdf".) 34 | 35 | ''' 36 | from __future__ import print_function 37 | 38 | # Remote_running 39 | #import matplotlib 40 | #matplotlib.use('Agg') 41 | 42 | import os, sys 43 | BASE_DIR = os.path.join( os.path.dirname( __file__ ), '..' ) 44 | sys.path.append(BASE_DIR) 45 | 46 | 47 | from itertools import combinations, permutations 48 | import logging 49 | 50 | import networkx as nx 51 | import cPickle as pickle 52 | from pickle import load, dump 53 | 54 | from kerpy.GaussianKernel import GaussianKernel 55 | from kerpy.LinearKernel import LinearKernel 56 | from TwoStepCondTestObject import TwoStepCondTestObject 57 | from independence_testing.HSICSpectralTestObject import HSICSpectralTestObject 58 | import numpy as np 59 | from numpy import arange 60 | 61 | _logger = logging.getLogger(__name__) 62 | 63 | 64 | 65 | 66 | 67 | 68 | def _create_complete_graph(node_ids): 69 | """Create a complete graph from the list of node ids. 70 | Args: 71 | node_ids: a list of node ids 72 | Returns: 73 | An undirected graph (as a networkx.Graph) 74 | """ 75 | g = nx.Graph() 76 | g.add_nodes_from(node_ids) 77 | for (i, j) in combinations(node_ids, 2): 78 | g.add_edge(i, j) 79 | pass 80 | return g 81 | 82 | 83 | 84 | 85 | def estimate_skeleton( data_matrix, alpha, **kwargs): 86 | # originally first argument is indep_test_func 87 | # now this version uses HSIC Spectral Test for independence 88 | # and KRESIT for conditional independence. 89 | """Estimate a skeleton graph from the statistis information. 90 | Args: 91 | indep_test_func: the function name for a conditional 92 | independency test. 93 | data_matrix: data (as a numpy array). 94 | alpha: the significance level. 95 | kwargs: 96 | 'max_reach': maximum value of l (see the code). The 97 | value depends on the underlying distribution. 98 | 'method': if 'stable' given, use stable-PC algorithm 99 | (see [Colombo2014]). 100 | other parameters may be passed depending on the 101 | indep_test_func()s. 102 | Returns: 103 | g: a skeleton graph (as a networkx.Graph). 104 | sep_set: a separation set (as an 2D-array of set()). 105 | [Colombo2014] Diego Colombo and Marloes H Maathuis. Order-independent 106 | constraint-based causal structure learning. In The Journal of Machine 107 | Learning Research, Vol. 15, pp. 3741-3782, 2014. 108 | """ 109 | 110 | def method_stable(kwargs): 111 | return ('method' in kwargs) and kwargs['method'] == "stable" 112 | 113 | node_ids = range(data_matrix.shape[1]) 114 | g = _create_complete_graph(node_ids) 115 | 116 | node_size = data_matrix.shape[1] 117 | sep_set = [[set() for i in range(node_size)] for j in range(node_size)] 118 | 119 | 120 | X_idx_list_init = [] 121 | Y_idx_list_init = [] 122 | Z_idx_list_init = [] 123 | pval_list_init = [] 124 | 125 | l = 0 126 | completed_xy_idx_init = 0 127 | completed_z_idx_init = 0 128 | remove_edges_current = [] 129 | 130 | 131 | results_filename = kwargs['results_filename'] 132 | myfolder = "pcalg_results/" 133 | save_filename = myfolder + results_filename + ".bin" 134 | #print("save_filename:", save_filename) 135 | #sys.exit(1) 136 | if not os.path.exists(myfolder): 137 | os.mkdir(myfolder) 138 | elif os.path.exists(save_filename): 139 | load_f = open(save_filename,"r") 140 | [X_idx_list_init, Y_idx_list_init, Z_idx_list_init, pval_list_init, \ 141 | l, completed_xy_idx_init, completed_z_idx_init,\ 142 | remove_edges_current, g] = load(load_f) 143 | load_f.close() 144 | print("Found exitising results") 145 | 146 | X_idx_list = X_idx_list_init 147 | Y_idx_list = Y_idx_list_init 148 | Z_idx_list = Z_idx_list_init 149 | pval_list = pval_list_init 150 | completed_xy_idx = completed_xy_idx_init 151 | completed_z_idx = completed_z_idx_init 152 | 153 | 154 | 155 | while True: 156 | cont = False 157 | remove_edges = remove_edges_current 158 | perm_iteration_list = list(permutations(node_ids,2)) 159 | length_iteration_list = len(perm_iteration_list) 160 | 161 | for ij in arange(completed_xy_idx, length_iteration_list): 162 | (i,j) = perm_iteration_list[ij] 163 | adj_i = g.neighbors(i) 164 | if j not in adj_i: 165 | continue 166 | else: 167 | adj_i.remove(j) 168 | pass 169 | if len(adj_i) >= l: 170 | _logger.debug('testing %s and %s' % (i,j)) 171 | _logger.debug('neighbors of %s are %s' % (i, str(adj_i))) 172 | if len(adj_i) < l: 173 | continue 174 | 175 | cc = list(combinations(adj_i, l)) 176 | length_cc = len(cc) 177 | 178 | 179 | for kk in arange(completed_z_idx, length_cc): 180 | k = cc[kk] 181 | _logger.debug('indep prob of %s and %s with subset %s' 182 | % (i, j, str(k))) 183 | if l == 0: # independence testing 184 | print("independence testing", (i,j)) 185 | data_x = data_matrix[:,[i]] 186 | data_y = data_matrix[:,[j]] 187 | 188 | num_samples = np.shape(data_matrix)[0] 189 | kernelX_hsic = GaussianKernel(1.) 190 | kernelY_hsic = GaussianKernel(1.) 191 | kernelX_use_median_hsic = True 192 | kernelY_use_median_hsic = True 193 | 194 | myspectraltestobj = HSICSpectralTestObject(num_samples, None, kernelX_hsic, kernelY_hsic, 195 | kernelX_use_median = kernelX_use_median_hsic, 196 | kernelY_use_median = kernelY_use_median_hsic, 197 | num_nullsims=1000, unbiased=False) 198 | 199 | p_val, _ = myspectraltestobj.compute_pvalue_with_time_tracking(data_x,data_y) 200 | 201 | 202 | X_idx_list.append((i)) 203 | Y_idx_list.append((j)) 204 | Z_idx_list.append((0)) 205 | pval_list.append((p_val)) 206 | 207 | 208 | else: # conditional independence testing 209 | print("conditional independence testing",(i,j,k)) 210 | data_x = data_matrix[:,[i]] 211 | data_y = data_matrix[:,[j]] 212 | data_z = data_matrix[:,k] 213 | 214 | num_samples = np.shape(data_matrix)[0] 215 | #kernelX = GaussianKernel(1.) 216 | #kernelY = GaussianKernel(1.) 217 | #kernelX_use_median = True 218 | #kernelY_use_median = True 219 | #kernelX = LinearKernel() 220 | #kernelY = LinearKernel() 221 | kernelX = kwargs['kernelX'] 222 | kernelY = kwargs['kernelY'] 223 | kernelZ = GaussianKernel(1.) 224 | kernelX_use_median = kwargs['kernelX_use_median'] 225 | kernelY_use_median = kwargs['kernelY_use_median'] 226 | kernelRxz = kwargs['kernelRxz'] 227 | kernelRyz = kwargs['kernelRyz'] 228 | kernelRxz_use_median = kwargs['kernelRxz_use_median'] 229 | kernelRyz_use_median = kwargs['kernelRyz_use_median'] 230 | RESIT_type = kwargs['RESIT_type'] 231 | optimise_lambda_only = kwargs['optimise_lambda_only'] 232 | grid_search = kwargs['grid_search'] 233 | GD_optimise = kwargs['GD_optimise'] 234 | 235 | 236 | 237 | num_lambdaval = 30 238 | lambda_val = 10**np.linspace(-6,1, num=num_lambdaval) 239 | z_bandwidth = None 240 | #num_bandwidth = 20 241 | #z_bandwidth = 10**np.linspace(-5,1,num = num_bandwidth) 242 | 243 | mytestobj = TwoStepCondTestObject(num_samples, None, 244 | kernelX, kernelY, kernelZ, 245 | kernelX_use_median=kernelX_use_median, 246 | kernelY_use_median=kernelY_use_median, 247 | kernelZ_use_median=True, 248 | kernelRxz = kernelRxz, kernelRyz = kernelRyz, 249 | kernelRxz_use_median = kernelRxz_use_median, 250 | kernelRyz_use_median = kernelRyz_use_median, 251 | RESIT_type = RESIT_type, 252 | num_shuffles=800, 253 | lambda_val=lambda_val,lambda_X = None, lambda_Y = None, 254 | optimise_lambda_only = optimise_lambda_only, 255 | sigmasq_vals = z_bandwidth ,sigmasq_xz = 1., sigmasq_yz = 1., 256 | K_folds=5, grid_search = grid_search, 257 | GD_optimise=GD_optimise, learning_rate=0.1,max_iter=300, 258 | initial_lambda_x=0.5,initial_lambda_y=0.5, initial_sigmasq = 1) 259 | 260 | 261 | 262 | p_val, _ = mytestobj.compute_pvalue(data_x, data_y, data_z) 263 | 264 | X_idx_list.append((i)) 265 | Y_idx_list.append((j)) 266 | Z_idx_list.append(k) 267 | pval_list.append((p_val)) 268 | 269 | 270 | 271 | 272 | completed_z_idx = kk + 1 273 | 274 | save_f = open(save_filename,"w") 275 | dump([X_idx_list, Y_idx_list, Z_idx_list, pval_list, l, completed_xy_idx, completed_z_idx,\ 276 | remove_edges, g], save_f) 277 | save_f.close() 278 | 279 | 280 | _logger.debug('p_val is %s' % str(p_val)) 281 | if p_val > alpha: 282 | if g.has_edge(i, j): 283 | _logger.debug('p: remove edge (%s, %s)' % (i, j)) 284 | if method_stable(kwargs): 285 | remove_edges.append((i, j)) 286 | else: 287 | g.remove_edge(i, j) 288 | pass 289 | sep_set[i][j] |= set(k) 290 | sep_set[j][i] |= set(k) 291 | break 292 | pass 293 | completed_z_idx = 0 294 | completed_xy_idx = ij + 1 295 | cont = True 296 | pass 297 | pass 298 | 299 | 300 | 301 | l += 1 302 | completed_xy_idx = 0 303 | if method_stable(kwargs): 304 | g.remove_edges_from(remove_edges) 305 | if cont is False: 306 | break 307 | if ('max_reach' in kwargs) and (l > kwargs['max_reach']): 308 | break 309 | 310 | 311 | 312 | save_f = open(save_filename,"w") 313 | dump([X_idx_list, Y_idx_list, Z_idx_list, pval_list, l, completed_xy_idx, completed_z_idx,\ 314 | remove_edges, g], save_f) 315 | save_f.close() 316 | 317 | pass 318 | 319 | return (g, sep_set) 320 | 321 | 322 | 323 | 324 | def estimate_cpdag(skel_graph, sep_set): 325 | """Estimate a CPDAG from the skeleton graph and separation sets 326 | returned by the estimate_skeleton() function. 327 | Args: 328 | skel_graph: A skeleton graph (an undirected networkx.Graph). 329 | sep_set: An 2D-array of separation set. 330 | The contents look like something like below. 331 | sep_set[i][j] = set([k, l, m]) 332 | Returns: 333 | An estimated DAG. 334 | """ 335 | dag = skel_graph.to_directed() 336 | node_ids = skel_graph.nodes() 337 | for (i, j) in combinations(node_ids, 2): 338 | adj_i = set(dag.successors(i)) 339 | if j in adj_i: 340 | continue 341 | adj_j = set(dag.successors(j)) 342 | if i in adj_j: 343 | continue 344 | common_k = adj_i & adj_j 345 | for k in common_k: 346 | if k not in sep_set[i][j]: 347 | if dag.has_edge(k, i): 348 | _logger.debug('S: remove edge (%s, %s)' % (k, i)) 349 | dag.remove_edge(k, i) 350 | pass 351 | if dag.has_edge(k, j): 352 | _logger.debug('S: remove edge (%s, %s)' % (k, j)) 353 | dag.remove_edge(k, j) 354 | pass 355 | pass 356 | pass 357 | pass 358 | 359 | def _has_both_edges(dag, i, j): 360 | return dag.has_edge(i, j) and dag.has_edge(j, i) 361 | 362 | def _has_any_edge(dag, i, j): 363 | return dag.has_edge(i, j) or dag.has_edge(j, i) 364 | 365 | def _has_one_edge(dag, i, j): 366 | return ((dag.has_edge(i, j) and (not dag.has_edge(j, i))) or 367 | (not dag.has_edge(i, j)) and dag.has_edge(j, i)) 368 | 369 | def _has_no_edge(dag, i, j): 370 | return (not dag.has_edge(i, j)) and (not dag.has_edge(j, i)) 371 | 372 | # For all the combination of nodes i and j, apply the following 373 | # rules. 374 | for (i, j) in combinations(node_ids, 2): 375 | # Rule 1: Orient i-j into i->j whenever there is an arrow k->i 376 | # such that k and j are nonadjacent. 377 | # 378 | # Check if i-j. 379 | if _has_both_edges(dag, i, j): 380 | # Look all the predecessors of i. 381 | for k in dag.predecessors(i): 382 | # Skip if there is an arrow i->k. 383 | if dag.has_edge(i, k): 384 | continue 385 | # Skip if k and j are adjacent. 386 | if _has_any_edge(dag, k, j): 387 | continue 388 | # Make i-j into i->j 389 | _logger.debug('R1: remove edge (%s, %s)' % (j, i)) 390 | dag.remove_edge(j, i) 391 | break 392 | pass 393 | 394 | # Rule 2: Orient i-j into i->j whenever there is a chain 395 | # i->k->j. 396 | # 397 | # Check if i-j. 398 | if _has_both_edges(dag, i, j): 399 | # Find nodes k where k is i->k. 400 | succs_i = set() 401 | for k in dag.successors(i): 402 | if not dag.has_edge(k, i): 403 | succs_i.add(k) 404 | pass 405 | pass 406 | # Find nodes j where j is k->j. 407 | preds_j = set() 408 | for k in dag.predecessors(j): 409 | if not dag.has_edge(j, k): 410 | preds_j.add(k) 411 | pass 412 | pass 413 | # Check if there is any node k where i->k->j. 414 | if len(succs_i & preds_j) > 0: 415 | # Make i-j into i->j 416 | _logger.debug('R2: remove edge (%s, %s)' % (j, i)) 417 | dag.remove_edge(j, i) 418 | break 419 | pass 420 | 421 | # Rule 3: Orient i-j into i->j whenever there are two chains 422 | # i-k->j and i-l->j such that k and l are nonadjacent. 423 | # 424 | # Check if i-j. 425 | if _has_both_edges(dag, i, j): 426 | # Find nodes k where i-k. 427 | adj_i = set() 428 | for k in dag.successors(i): 429 | if dag.has_edge(k, i): 430 | adj_i.add(k) 431 | pass 432 | pass 433 | # For all the pairs of nodes in adj_i, 434 | for (k, l) in combinations(adj_i, 2): 435 | # Skip if k and l are adjacent. 436 | if _has_any_edge(dag, k, l): 437 | continue 438 | # Skip if not k->j. 439 | if dag.has_edge(j, k) or (not dag.has_edge(k, j)): 440 | continue 441 | # Skip if not l->j. 442 | if dag.has_edge(j, l) or (not dag.has_edge(l, j)): 443 | continue 444 | # Make i-j into i->j. 445 | _logger.debug('R3: remove edge (%s, %s)' % (j, i)) 446 | dag.remove_edge(j, i) 447 | break 448 | pass 449 | 450 | # Rule 4: Orient i-j into i->j whenever there are two chains 451 | # i-k->l and k->l->j such that k and j are nonadjacent. 452 | # 453 | # However, this rule is not necessary when the PC-algorithm 454 | # is used to estimate a DAG. 455 | pass 456 | 457 | return dag 458 | 459 | 460 | 461 | 462 | def run():#if __name__ == '__main__': 463 | import networkx as nx 464 | import pandas as pd 465 | import matplotlib.pyplot as plt 466 | from SimDataGen import SimDataGen 467 | 468 | from tools.ProcessingObject import ProcessingObject 469 | args = ProcessingObject.parse_arguments() 470 | num_samples = args.num_samples 471 | kernelX = args.kernelX #Default: GaussianKernel(1.) 472 | kernelY = args.kernelY #Default: GaussianKernel(1.) 473 | kernelX_use_median = args.kernelX_use_median #Default: False 474 | kernelY_use_median = args.kernelY_use_median #Default: False 475 | kernelRxz = args.kernelRxz #Default: LinearKernel 476 | kernelRyz = args.kernelRyz #Default: LinearKernel 477 | kernelRxz_use_median = args.kernelRxz_use_median #Default: False 478 | kernelRyz_use_median = args.kernelRyz_use_median #Default: False 479 | RESIT_type = args.RESIT_type #Default: False 480 | optimise_lambda_only = args.optimise_lambda_only #Default: True 481 | grid_search = args.grid_search #Default: True 482 | GD_optimise = args.GD_optimise #Default: False 483 | results_filename = args.results_filename 484 | figure_filename = args.figure_filename 485 | data_filename = args.data_filename 486 | num_var = args.dimZ 487 | 488 | 489 | 490 | datafile = data_filename + ".csv" 491 | data = np.loadtxt(datafile,delimiter = ',') 492 | dm = data[range(num_samples),:][:, range(num_var)] 493 | 494 | 495 | (g, sep_set) = estimate_skeleton(data_matrix=dm, alpha=0.05, 496 | kernelX = kernelX, kernelY = kernelY, 497 | kernelX_use_median = kernelX_use_median, 498 | kernelY_use_median = kernelY_use_median, 499 | kernelRxz = kernelRxz, kernelRyz = kernelRyz, 500 | kernelRxz_use_median = kernelRxz_use_median, 501 | kernelRyz_use_median = kernelRyz_use_median, 502 | RESIT_type = RESIT_type, 503 | results_filename = results_filename, 504 | optimise_lambda_only = optimise_lambda_only, 505 | grid_search = grid_search, 506 | GD_optimise = GD_optimise) 507 | 508 | 509 | g = estimate_cpdag(skel_graph=g, sep_set=sep_set) 510 | 511 | 512 | if num_var == 7: 513 | labels={} 514 | labels[0]=r'$X$' 515 | labels[1]=r'$Y$' 516 | labels[2]=r'$Z$' 517 | labels[3]=r'$A$' 518 | labels[4]=r'$B$' 519 | labels[5]=r'$Cx$' 520 | labels[6]=r'$Cy$' 521 | elif num_var == 6: 522 | labels={} 523 | labels[0] = r'$X$' 524 | labels[1] = r'$Y$' 525 | labels[2] = r'$Z$' 526 | labels[3] = r'$A$' 527 | labels[4] = r'$Cx$' 528 | labels[5] = r'$Cy$' 529 | elif num_var == 5: 530 | labels={} 531 | labels[0] = r'$X$' 532 | labels[1] = r'$Y$' 533 | labels[2] = r'$Z$' 534 | labels[3] = r'$A$' 535 | labels[4] = r'$B$' 536 | elif num_var == 4: 537 | labels={} 538 | labels[0] = r'$X$' 539 | labels[1] = r'$Y$' 540 | labels[2] = r'$Z$' 541 | labels[3] = r'$A$' 542 | else: 543 | raise NotImplementedError 544 | 545 | nx.draw_networkx(g, pos=nx.spring_layout(g), labels=labels, with_labels=True) 546 | figure_name = figure_filename + ".pdf" 547 | plt.savefig(figure_name) 548 | #plt.show() 549 | 550 | 551 | if __name__ == '__main__': 552 | run() -------------------------------------------------------------------------------- /weak_conditional_independence_testing/SimDataGen.py: -------------------------------------------------------------------------------- 1 | from numpy.random import uniform, permutation, multivariate_normal,normal 2 | from numpy import pi, prod, empty, sin, cos, asscalar, shape,zeros, identity,arange,sign,sum,sqrt,transpose, tanh,sinh 3 | import numpy as np 4 | 5 | 6 | 7 | class SimDataGen(object): 8 | def __init__(self): 9 | pass 10 | 11 | 12 | @staticmethod 13 | def null_model(num_samples, dimension = 1, rho=0): 14 | data_z = np.reshape(uniform(0,5,num_samples*dimension),(num_samples,dimension)) 15 | coin_flip_x = np.random.choice([0,1],replace=True,size=num_samples) 16 | coin_flip_y = np.random.choice([0,1],replace=True,size=num_samples) 17 | mean_noise = [0,0] 18 | cov_noise = [[1,0],[0,1]] 19 | noise_x, noise_y = multivariate_normal(mean_noise, cov_noise, num_samples).T 20 | data_x = zeros(num_samples) 21 | data_x[coin_flip_x == 0,] = 1.7*data_z[coin_flip_x == 0,0] 22 | data_x[coin_flip_x == 1,] = -1.7*data_z[coin_flip_x == 1,0] 23 | data_x = data_x + noise_x 24 | data_y = zeros(num_samples) 25 | data_y[coin_flip_y == 0,] = (data_z[coin_flip_y == 0,0]-2.7)**2 26 | data_y[coin_flip_y == 1,] = -(data_z[coin_flip_y == 1,0]-2.7)**2+13 27 | data_y = data_y + noise_y 28 | data_x = np.reshape(data_x, (num_samples,1)) 29 | data_y = np.reshape(data_y, (num_samples,1)) 30 | return data_x, data_y, data_z 31 | 32 | 33 | @staticmethod 34 | def alternative_model(num_samples,dimension = 1, rho=0.15): 35 | data_z = np.reshape(uniform(0,5,num_samples*dimension),(num_samples,dimension)) 36 | rr = uniform(0,1, num_samples) 37 | idx_rr = np.where(rr < rho) 38 | coin_flip_x = np.random.choice([0,1],replace=True,size=num_samples) 39 | coin_flip_y = np.random.choice([0,1],replace=True,size=num_samples) 40 | coin_flip_y[idx_rr] = coin_flip_x[idx_rr] 41 | mean_noise = [0,0] 42 | cov_noise = [[1,0],[0,1]] 43 | noise_x, noise_y = multivariate_normal(mean_noise, cov_noise, num_samples).T 44 | data_x = zeros(num_samples) 45 | data_x[coin_flip_x == 0] = 1.7*data_z[coin_flip_x == 0,0] 46 | data_x[coin_flip_x == 1] = -1.7*data_z[coin_flip_x == 1,0] 47 | data_x = data_x + noise_x 48 | data_y = zeros(num_samples) 49 | data_y[coin_flip_y == 0] = (data_z[coin_flip_y == 0,0]-2.7)**2 50 | data_y[coin_flip_y == 1] = -(data_z[coin_flip_y == 1,0]-2.7)**2+13 51 | data_y = data_y + noise_y 52 | data_x = np.reshape(data_x, (num_samples,1)) 53 | data_y = np.reshape(data_y, (num_samples,1)) 54 | return data_x, data_y, data_z 55 | 56 | 57 | 58 | @staticmethod 59 | def DAG_simulation_version1(num_samples): 60 | dimension = 1 61 | rho = 0 62 | data_z = np.reshape(uniform(0,5,num_samples*dimension),(num_samples,dimension)) 63 | rr = uniform(0,1, num_samples) 64 | idx_rr = np.where(rr < rho) 65 | coin_flip_x = np.random.choice([0,1],replace=True,size=num_samples) 66 | coin_flip_y = np.random.choice([0,1],replace=True,size=num_samples) 67 | coin_flip_y[idx_rr] = coin_flip_x[idx_rr] 68 | mean_noise = [0,0] 69 | cov_noise = [[1,0],[0,1]] 70 | noise_x, noise_y = multivariate_normal(mean_noise, cov_noise, num_samples).T 71 | data_x = zeros(num_samples) 72 | data_x[coin_flip_x == 0] = 1.7*data_z[coin_flip_x == 0,0] 73 | data_x[coin_flip_x == 1] = -1.7*data_z[coin_flip_x == 1,0] 74 | data_x = data_x + noise_x 75 | data_y = zeros(num_samples) 76 | data_y[coin_flip_y == 0] = (data_z[coin_flip_y == 0,0]-2.7)**2 77 | data_y[coin_flip_y == 1] = -(data_z[coin_flip_y == 1,0]-2.7)**2+13 78 | data_y = data_y + noise_y 79 | data_x = np.reshape(data_x, (num_samples,1)) 80 | data_y = np.reshape(data_y, (num_samples,1)) 81 | coin_x = np.reshape(coin_flip_x, (num_samples,1)) 82 | coin_y = np.reshape(coin_flip_y, (num_samples,1)) 83 | 84 | noise_A, noise_B = multivariate_normal(mean_noise, cov_noise, num_samples).T 85 | noise_A = np.reshape(noise_A, (num_samples,1)) 86 | noise_B = np.reshape(noise_B, (num_samples,1)) 87 | 88 | data_A = (data_y-5)**2/float(3) + 5 + noise_A 89 | #data_A = (data_y-5)**2/float(11)+ 5 +noise_A 90 | #data_B = 5*np.tanh(data_y) + noise_B # tanh version 91 | #data_B = 5*np.sin(data_y) + noise_B # sine version 92 | data_B = 5.5*np.tanh(data_y) + noise_B 93 | 94 | data_matrix = np.concatenate((data_x,data_y,data_z,data_A,data_B,coin_x, coin_y),axis=1) 95 | return data_matrix 96 | # data_x, data_y, data_z, data_A, data_B, coin_x, coin_y, 97 | -------------------------------------------------------------------------------- /weak_conditional_independence_testing/SyntheticDim_KRESIT.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Example run in terminal: 3 | 1) KRESIT: 4 | $ python SyntheticDim_KRESIT.py 40 --dimZ 1 5 | --kernelX_use_median --kernelY_use_median 6 | 7 | (Simulate 100 sets of 40 samples with 1 dimensional conditioning set from the null_model 8 | and run KRESIT with Gaussian kernel median Heuristic on the variables X and Y. The kernel on Z 9 | is set by default to be Gaussian median Heuristic. The regularisation parameters is 10 | set by default to use grid search. ) 11 | 12 | 2) RESIT: 13 | $ python SyntheticDim_KRESIT.py 40 --dimZ 1 14 | --kernelX --kernelY 15 | --kernelRxz --kernelRyz 16 | --kernelRxz_use_median --kernelRyz_use_median 17 | --RESIT_type 18 | 19 | (Simulate 100 sets of 40 samples with 1 dimensional conditioning set from the null_model 20 | and run RESIT. The kernels on X and Y are set to be linear. The kernels on the residuals Rxz and 21 | Ryz are Gaussian with median Heuristic bandwidth.The regularisation parameters is 22 | set by default to use grid search. ) 23 | 24 | ''' 25 | 26 | import os, sys 27 | BASE_DIR = os.path.join( os.path.dirname( __file__ ), '..' ) 28 | sys.path.append(BASE_DIR) 29 | 30 | from kerpy.GaussianKernel import GaussianKernel 31 | from kerpy.LinearKernel import LinearKernel 32 | from TwoStepCondTestObject import TwoStepCondTestObject 33 | from independence_testing.TestExperiment import TestExperiment 34 | from SimDataGen import SimDataGen 35 | import numpy as np 36 | from tools.ProcessingObject import ProcessingObject 37 | 38 | 39 | 40 | data_generating_function = SimDataGen.null_model 41 | #data_generating_function = SimDataGen.alternative_model 42 | args = ProcessingObject.parse_arguments() 43 | 44 | '''unpack the arguments needed:''' 45 | num_samples = args.num_samples 46 | dimZ = args.dimZ # Integer dimension of the conditioning set. 47 | kernelX = args.kernelX #Default: GaussianKernel(1.) 48 | kernelY = args.kernelY #Default: GaussianKernel(1.) 49 | kernelX_use_median = args.kernelX_use_median #Default: False 50 | kernelY_use_median = args.kernelY_use_median #Default: False 51 | kernelRxz = args.kernelRxz #Default: LinearKernel 52 | kernelRyz = args.kernelRyz #Default: LinearKernel 53 | kernelRxz_use_median = args.kernelRxz_use_median #Default: False 54 | kernelRyz_use_median = args.kernelRyz_use_median #Default: False 55 | RESIT_type = args.RESIT_type #Default: False 56 | optimise_lambda_only = args.optimise_lambda_only #Default: True 57 | grid_search = args.grid_search #Default: True 58 | GD_optimise = args.GD_optimise #Default: False 59 | 60 | 61 | data_generator=lambda num_samples: data_generating_function(num_samples,dimension=dimZ) 62 | 63 | 64 | num_lambdaval = 30 65 | lambda_val = 10**np.linspace(-6,-1, num=num_lambdaval) 66 | #num_bandwidth = 20 67 | #z_bandwidth = 10**np.linspace(-5,1,num = num_bandwidth) 68 | z_bandwidth = None 69 | kernelZ = GaussianKernel(1.) 70 | 71 | 72 | test_object = TwoStepCondTestObject(num_samples, data_generator, 73 | kernelX, kernelY, kernelZ, 74 | kernelX_use_median=kernelX_use_median, 75 | kernelY_use_median=kernelY_use_median, 76 | kernelZ_use_median=True, 77 | kernelRxz = kernelRxz, kernelRyz = kernelRyz, 78 | kernelRxz_use_median = kernelRxz_use_median, 79 | kernelRyz_use_median = kernelRyz_use_median, 80 | RESIT_type = RESIT_type, 81 | num_shuffles=800, 82 | lambda_val=lambda_val,lambda_X = None, lambda_Y = None, 83 | optimise_lambda_only = optimise_lambda_only, 84 | sigmasq_vals = z_bandwidth ,sigmasq_xz = 1., sigmasq_yz = 1., 85 | K_folds=5, grid_search = grid_search, 86 | GD_optimise=GD_optimise, learning_rate=0.1,max_iter=300, 87 | initial_lambda_x=0.5,initial_lambda_y=0.5, initial_sigmasq = 1) 88 | 89 | 90 | # file name of the results 91 | name = os.path.basename(__file__).rstrip('.py')+'_d_'+str(dimZ)+'_n_'+str(num_samples) 92 | 93 | param={'name': name,\ 94 | 'dim_conditioning_set': dimZ,\ 95 | 'kernelX': kernelX,\ 96 | 'kernelY': kernelY,\ 97 | 'kernelZ': kernelZ,\ 98 | 'RESIT_type': RESIT_type,\ 99 | 'optimise_lambda_only': optimise_lambda_only,\ 100 | 'grid_search': grid_search, \ 101 | 'data_generator': str(data_generating_function),\ 102 | 'num_samples': num_samples} 103 | 104 | experiment=TestExperiment(name, param, test_object) 105 | 106 | numTrials = 100 107 | alpha=0.05 108 | experiment.run_test_trials(numTrials, alpha=alpha) 109 | 110 | 111 | -------------------------------------------------------------------------------- /weak_conditional_independence_testing/TwoStepCondTestObject.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Created on 21 Jun 2017 3 | 4 | 5 | Combine HSICConditionalTestObject.py (KRESIT) 6 | and RESITCondTestObject.py (RESIT & LRESIT) 7 | in one framework with the following parameter optimisation options: 8 | - Fix step gradient descent on lambda 9 | - Fix step gradient descent on sigmasq and lambda 10 | - grid search on lambda 11 | - grid search on sigmasq and lambda 12 | 13 | Such parameter optimisation is only for the regression part. 14 | 15 | ''' 16 | 17 | import os, sys 18 | BASE_DIR = os.path.join( os.path.dirname( __file__ ), '..' ) 19 | sys.path.append(BASE_DIR) 20 | 21 | 22 | from independence_testing.HSICTestObject import HSICTestObject 23 | from kerpy.Kernel import Kernel 24 | from kerpy.LinearKernel import LinearKernel 25 | from kerpy.GaussianKernel import GaussianKernel 26 | from scipy.spatial.distance import squareform, pdist, cdist 27 | from sklearn.model_selection import KFold 28 | from scipy.linalg import solve 29 | 30 | 31 | import numpy as np 32 | from numpy.random import permutation 33 | from numpy import trace,eye, sqrt, median,exp,log 34 | from scipy.linalg import cholesky, cho_solve 35 | import time 36 | 37 | 38 | class TwoStepCondTestObject(HSICTestObject): 39 | 40 | def __init__(self, num_samples, data_generator, 41 | kernelX, kernelY, kernelZ, kernelX_use_median=False, 42 | kernelY_use_median=False, kernelZ_use_median=False, 43 | kernelRxz = LinearKernel(), kernelRyz = LinearKernel(), 44 | kernelRxz_use_median=False, kernelRyz_use_median=False, 45 | RESIT_type = False, 46 | num_shuffles=800, 47 | lambda_val=[0.5,1,5,10],lambda_X = None, lambda_Y = None, 48 | optimise_lambda_only = False, 49 | sigmasq_vals = [1,2,3] ,sigmasq_xz = 1., sigmasq_yz = 1., 50 | K_folds=5, grid_search = False, 51 | GD_optimise=True, learning_rate=0.001,max_iter=3000, 52 | initial_lambda_x=0.5,initial_lambda_y=0.5, initial_sigmasq = 1): 53 | HSICTestObject.__init__(self, num_samples, data_generator, kernelX, kernelY, kernelZ, 54 | kernelX_use_median=kernelX_use_median, kernelY_use_median=kernelY_use_median, 55 | kernelZ_use_median=kernelZ_use_median) 56 | 57 | self.kernelRxz = kernelRxz 58 | self.kernelRyz = kernelRyz 59 | self.kernelRxz_use_median = kernelRxz_use_median 60 | self.kernelRyz_use_median = kernelRyz_use_median 61 | self.RESIT_type = RESIT_type 62 | self.num_shuffles = num_shuffles 63 | self.lambda_val = lambda_val 64 | self.lambda_X = lambda_X 65 | self.lambda_Y = lambda_Y 66 | self.optimise_lambda_only = optimise_lambda_only 67 | self.sigmasq_vals = sigmasq_vals 68 | self.sigmasq_xz = sigmasq_xz 69 | self.sigmasq_yz = sigmasq_yz 70 | self.K_folds = K_folds 71 | self.GD_optimise = GD_optimise 72 | self.learning_rate = learning_rate 73 | self.grid_search = grid_search 74 | self.initial_lambda_x = initial_lambda_x 75 | self.initial_lambda_y = initial_lambda_y 76 | self.initial_sigmasq = initial_sigmasq 77 | self.max_iter = max_iter 78 | 79 | 80 | 81 | 82 | # Pre-compute the kernel matrices needed for the total cv error and its gradient 83 | def compute_matrices_for_gradient_totalcverr(self, train_x, train_y, train_z): 84 | if self.kernelX_use_median: 85 | sigmax = self.kernelX.get_sigma_median_heuristic(train_x) 86 | self.kernelX.set_width(float(sigmax)) 87 | if self.kernelY_use_median: 88 | sigmay = self.kernelY.get_sigma_median_heuristic(train_y) 89 | self.kernelY.set_width(float(sigmay)) 90 | kf = KFold( n_splits=self.K_folds) 91 | matrix_results = [[[None] for _ in range(self.K_folds)]for _ in range(8)] 92 | # xx=[[None]*10]*6 will give the same id to xx[0][0] and xx[1][0] etc. as 93 | # this command simply copied [None] many times. But the above gives different ids. 94 | count = 0 95 | for train_index, test_index in kf.split(np.ones((self.num_samples,1))): 96 | X_tr, X_tst = train_x[train_index], train_x[test_index] 97 | Y_tr, Y_tst = train_y[train_index], train_y[test_index] 98 | Z_tr, Z_tst = train_z[train_index], train_z[test_index] 99 | matrix_results[0][count] = self.kernelX.kernel(X_tst, X_tr) #Kx_tst_tr 100 | matrix_results[1][count] = self.kernelX.kernel(X_tr, X_tr) #Kx_tr_tr 101 | matrix_results[2][count] = self.kernelX.kernel(X_tst, X_tst) #Kx_tst_tst 102 | matrix_results[3][count] = self.kernelY.kernel(Y_tst, Y_tr) #Ky_tst_tr 103 | matrix_results[4][count] = self.kernelY.kernel(Y_tr, Y_tr) #Ky_tr_tr 104 | matrix_results[5][count] = self.kernelY.kernel(Y_tst,Y_tst) #Ky_tst_tst 105 | matrix_results[6][count] = cdist(Z_tst, Z_tr, 'sqeuclidean') #D_tst_tr: square distance matrix 106 | matrix_results[7][count] = cdist(Z_tr, Z_tr, 'sqeuclidean') #D_tr_tr: square distance matrix 107 | count = count + 1 108 | return matrix_results 109 | 110 | 111 | 112 | 113 | 114 | # compute the gradient of the total cverror with respect to lambda 115 | def compute_gradient_totalcverr_wrt_lambda(self,matrix_results,lambda_val,sigmasq_z): 116 | # 0: K_tst_tr; 1: K_tr_tr; 2: D_tst_tr; 3: D_tr_tr 117 | num_sample_cv = self.num_samples 118 | ttl_num_folds = np.shape(matrix_results)[1] 119 | gradient_cverr_per_fold = np.zeros(ttl_num_folds) 120 | for jj in range(ttl_num_folds): 121 | uu = np.shape(matrix_results[3][jj])[0] # number of training samples 122 | M_tst_tr = exp(matrix_results[2][jj]*float(-1/2)*sigmasq_z**(-1)) 123 | M_tr_tr = exp(matrix_results[3][jj]*float(-1/2)*sigmasq_z**(-1)) 124 | lower_ZZ = cholesky(M_tr_tr+ lambda_val*eye(uu), lower=True) 125 | ZZ = cho_solve((lower_ZZ,True),eye(uu)) 126 | first_term = matrix_results[0][jj].dot(ZZ.dot(ZZ.dot(M_tst_tr.T))) 127 | second_term = M_tst_tr.dot(ZZ.dot(ZZ.dot( 128 | matrix_results[1][jj].dot(ZZ.dot(M_tst_tr.T))))) 129 | gradient_cverr_per_fold[jj] = trace(first_term-second_term) 130 | return 2*sum(gradient_cverr_per_fold)/float(num_sample_cv) 131 | 132 | 133 | # lambda = exp(eta) 134 | def compute_gradient_totalcverr_wrt_eta(self,matrix_results,lambda_val,sigmasq_z): 135 | # 0: K_tst_tr; 1: K_tr_tr; 2: D_tst_tr; 3: D_tr_tr 136 | #eta = log(lambda_val) 137 | #gamma = log(sigmasq_z) 138 | num_sample_cv = self.num_samples 139 | ttl_num_folds = np.shape(matrix_results)[1] 140 | gradient_cverr_per_fold = np.zeros(ttl_num_folds) 141 | for jj in range(ttl_num_folds): 142 | uu = np.shape(matrix_results[3][jj])[0] # number of training samples 143 | M_tst_tr = exp(matrix_results[2][jj]*float(-1/2)*sigmasq_z**(-1)) 144 | M_tr_tr = exp(matrix_results[3][jj]*float(-1/2)*sigmasq_z**(-1)) 145 | lower_ZZ = cholesky(M_tr_tr+ lambda_val*eye(uu), lower=True) 146 | ZZ = cho_solve((lower_ZZ,True),eye(uu)) 147 | EE = lambda_val*eye(uu) 148 | first_term = matrix_results[0][jj].dot(ZZ.dot(EE.dot(ZZ.dot(M_tst_tr.T)))) 149 | second_term = first_term.T 150 | third_term = -M_tst_tr.dot(ZZ.dot(EE.dot(ZZ.dot( 151 | matrix_results[1][jj].dot(ZZ.dot(M_tst_tr.T)))))) 152 | forth_term = -M_tst_tr.dot(ZZ.dot( 153 | matrix_results[1][jj].dot(ZZ.dot(EE.dot(ZZ.dot(M_tst_tr.T)))))) 154 | gradient_cverr_per_fold[jj] = trace(first_term + second_term + third_term + forth_term) 155 | return sum(gradient_cverr_per_fold)/float(num_sample_cv) 156 | 157 | 158 | 159 | 160 | 161 | # compute the gradient of the total cverror with respect to sigma_z squared 162 | def compute_gradient_totalcverr_wrt_sqsigma(self,matrix_results,lambda_val,sigmasq_z): 163 | # 0: K_tst_tr; 1: K_tr_tr; 2: D_tst_tr; 3: D_tr_tr 164 | num_sample_cv = self.num_samples 165 | ttl_num_folds = np.shape(matrix_results)[1] 166 | gradient_cverr_per_fold = np.zeros(ttl_num_folds) 167 | for jj in range(ttl_num_folds): 168 | uu = np.shape(matrix_results[3][jj])[0] 169 | log_M_tr_tst = matrix_results[2][jj].T*float(-1/2)*sigmasq_z**(-1) 170 | M_tr_tst = exp(log_M_tr_tst) 171 | log_M_tr_tr = matrix_results[3][jj]*float(-1/2)*sigmasq_z**(-1) 172 | M_tr_tr = exp(log_M_tr_tr) 173 | lower_ZZ = cholesky(M_tr_tr+ lambda_val*eye(uu), lower=True) 174 | ZZ = cho_solve((lower_ZZ,True),eye(uu)) 175 | term_1 = matrix_results[0][jj].dot(ZZ.dot((M_tr_tr*sigmasq_z**(-1)*(-log_M_tr_tr)).dot(ZZ.dot(M_tr_tst)))) 176 | term_2 = -matrix_results[0][jj].dot(ZZ.dot(M_tr_tst*(-log_M_tr_tst*sigmasq_z**(-1)))) 177 | term_3 = (sigmasq_z**(-1)*(M_tr_tst.T)*(-log_M_tr_tst.T)).dot(ZZ.dot(matrix_results[1][jj].dot(ZZ.dot(M_tr_tst)))) 178 | term_4 = -(M_tr_tst.T).dot(ZZ.dot((M_tr_tr*sigmasq_z**(-1)*(-log_M_tr_tr)).dot(ZZ.dot(matrix_results[1][jj].dot( 179 | ZZ.dot(M_tr_tst)))))) 180 | term_5 = -(M_tr_tst.T).dot(ZZ.dot(matrix_results[1][jj].dot(ZZ.dot((M_tr_tr*sigmasq_z**(-1)*(-log_M_tr_tr)).dot( 181 | ZZ.dot(M_tr_tst)))))) 182 | term_6 = (M_tr_tst.T).dot(ZZ.dot(matrix_results[1][jj].dot(ZZ.dot(M_tr_tst*sigmasq_z**(-1)*(-log_M_tr_tst))))) 183 | gradient_cverr_per_fold[jj] = trace(2*term_1 + 2*term_2 + term_3 + term_4 + term_5 + term_6) 184 | return sum(gradient_cverr_per_fold)/float(num_sample_cv) 185 | 186 | 187 | 188 | 189 | def compute_gradient_totalcverr_wrt_gamma(self,matrix_results,lambda_val,sigmasq_z): 190 | # 0: K_tst_tr; 1: K_tr_tr; 2: D_tst_tr; 3: D_tr_tr 191 | #eta = log(lambda_val) 192 | #gamma = log(sigmasq_z) 193 | num_sample_cv = self.num_samples 194 | ttl_num_folds = np.shape(matrix_results)[1] 195 | gradient_cverr_per_fold = np.zeros(ttl_num_folds) 196 | for jj in range(ttl_num_folds): 197 | uu = np.shape(matrix_results[3][jj])[0] 198 | log_M_tr_tst = matrix_results[2][jj].T*float(-1/2)*sigmasq_z**(-1) 199 | M_tr_tst = exp(log_M_tr_tst) 200 | log_M_tr_tr = matrix_results[3][jj]*float(-1/2)*sigmasq_z**(-1) 201 | M_tr_tr = exp(log_M_tr_tr) 202 | lower_ZZ = cholesky(M_tr_tr+ lambda_val*eye(uu), lower=True) 203 | ZZ = cho_solve((lower_ZZ,True),eye(uu)) 204 | term_1 = matrix_results[0][jj].dot(ZZ.dot((M_tr_tr*(-log_M_tr_tr)).dot(ZZ.dot(M_tr_tst)))) 205 | term_2 = -matrix_results[0][jj].dot(ZZ.dot(M_tr_tst*(-log_M_tr_tst))) 206 | term_3 = (M_tr_tst.T*(-log_M_tr_tst).T).dot(ZZ.dot(matrix_results[1][jj].dot(ZZ.dot(M_tr_tst)))) 207 | term_4 = -(M_tr_tst.T).dot(ZZ.dot((M_tr_tr*(-log_M_tr_tr)).dot(ZZ.dot(matrix_results[1][jj].dot( 208 | ZZ.dot(M_tr_tst)))))) 209 | term_5 = -(M_tr_tst.T).dot(ZZ.dot(matrix_results[1][jj].dot(ZZ.dot((M_tr_tr*(-log_M_tr_tr)).dot( 210 | ZZ.dot(M_tr_tst)))))) 211 | term_6 = (M_tr_tst.T).dot(ZZ.dot(matrix_results[1][jj].dot(ZZ.dot(M_tr_tst*(-log_M_tr_tst))))) 212 | gradient_cverr_per_fold[jj] = trace(2*term_1 + 2*term_2 + term_3 + term_4 + term_5 + term_6) 213 | return sum(gradient_cverr_per_fold)/float(num_sample_cv) 214 | 215 | 216 | 217 | # compute the total cverror 218 | def compute_totalcverr(self,matrix_results,lambda_val,sigmasq_z): 219 | # 0: K_tst_tr; 1: K_tr_tr; 2: K_tst_tst; 3: D_tst_tr; 4: D_tr_tr 220 | num_sample_cv = self.num_samples 221 | ttl_num_folds = np.shape(matrix_results)[1] 222 | cverr_per_fold = np.zeros(ttl_num_folds) 223 | for jj in range(ttl_num_folds): 224 | uu = np.shape(matrix_results[4][jj])[0] # number of training samples 225 | M_tst_tr = exp(matrix_results[3][jj]*float(-1/2)*sigmasq_z**(-1)) 226 | M_tr_tr = exp(matrix_results[4][jj]*float(-1/2)*sigmasq_z**(-1)) 227 | lower_ZZ = cholesky(M_tr_tr+ lambda_val*eye(uu), lower=True) 228 | ZZ = cho_solve((lower_ZZ,True),eye(uu)) 229 | first_term = matrix_results[2][jj] 230 | second_term = - matrix_results[0][jj].dot(ZZ.dot(M_tst_tr.T)) 231 | third_term = np.transpose(second_term) 232 | fourth_term = M_tst_tr.dot(ZZ.dot( 233 | matrix_results[1][jj].dot(ZZ.dot(M_tst_tr.T)))) 234 | cverr_per_fold[jj] = trace(first_term + second_term + third_term + fourth_term) 235 | return sum(cverr_per_fold)/float(num_sample_cv) 236 | 237 | 238 | 239 | 240 | def compute_GD_lambda_sigmasq_for_TotalCVerr_with_fix_step_logspace(self, matrix_results,initial_lambda, initial_sigmasq): 241 | EE = log(initial_lambda) # initialisation of the lambda value 242 | GG = log(initial_sigmasq) # initialisation of the sigma square value for z 243 | count = 0 244 | log_lambda_path = [EE] 245 | log_sigma_square_path = [GG] 246 | Error_path = [self.compute_totalcverr(matrix_results,lambda_val = exp(EE),sigmasq_z=exp(GG))] 247 | d_part_matrix_results = [matrix_results[ii] for ii in [0,1,3,4]] 248 | Grad_EE = self.compute_gradient_totalcverr_wrt_eta(d_part_matrix_results, exp(EE), exp(GG)) 249 | Grad_GG = self.compute_gradient_totalcverr_wrt_gamma(d_part_matrix_results, exp(EE), exp(GG)) 250 | while (sum(np.array([abs(Grad_EE),abs(Grad_GG)]) >= 0.00001) == 2 and count < self.max_iter): 251 | Grad_EE_old = self.compute_gradient_totalcverr_wrt_eta(d_part_matrix_results, exp(EE), exp(GG)) 252 | EE = EE - self.learning_rate*Grad_EE_old 253 | Grad_EE = self.compute_gradient_totalcverr_wrt_eta(d_part_matrix_results, exp(EE), exp(GG)) 254 | log_lambda_path = np.concatenate((log_lambda_path,[EE])) 255 | Error_path = np.concatenate((Error_path,[self.compute_totalcverr(matrix_results,lambda_val = exp(EE), sigmasq_z=exp(GG))])) 256 | 257 | if sum(np.array([abs(Grad_EE),abs(Grad_GG)]) >= 0.00001) == 2 and count < self.max_iter: 258 | Grad_GG_old = self.compute_gradient_totalcverr_wrt_gamma(d_part_matrix_results, exp(EE), exp(GG)) 259 | GG = GG - self.learning_rate*Grad_GG_old 260 | Grad_GG = self.compute_gradient_totalcverr_wrt_gamma(d_part_matrix_results, exp(EE), exp(GG)) 261 | log_sigma_square_path = np.concatenate((log_sigma_square_path,[GG])) 262 | Error_path = np.concatenate((Error_path,[self.compute_totalcverr(matrix_results,lambda_val = exp(EE), sigmasq_z=exp(GG))])) 263 | 264 | else: 265 | break 266 | count = count+1 267 | return log_lambda_path[count], log_lambda_path, log_sigma_square_path[count], log_sigma_square_path,Error_path 268 | 269 | 270 | 271 | def compute_GD_lambda_for_TotalCVerr_with_fix_step_logspace(self, matrix_results,initial_lambda, sigmasq_z): 272 | EE = log(initial_lambda) # initialisation of the lambda value 273 | count = 0 274 | log_lambda_path = [EE] 275 | Error_path = [self.compute_totalcverr(matrix_results,lambda_val = exp(EE),sigmasq_z=sigmasq_z)] 276 | d_part_matrix_results = [matrix_results[ii] for ii in [0,1,3,4]] 277 | Grad_EE = self.compute_gradient_totalcverr_wrt_eta(d_part_matrix_results, exp(EE), sigmasq_z) 278 | while (abs(Grad_EE) >= 0.00001 and count < self.max_iter): 279 | Grad_EE_old = Grad_EE 280 | EE = EE - self.learning_rate*Grad_EE_old 281 | Grad_EE = self.compute_gradient_totalcverr_wrt_eta(d_part_matrix_results, exp(EE), sigmasq_z) 282 | log_lambda_path = np.concatenate((log_lambda_path,[EE])) 283 | Error_path = np.concatenate((Error_path,[self.compute_totalcverr(matrix_results,lambda_val = exp(EE), sigmasq_z=sigmasq_z)])) 284 | count = count+1 285 | 286 | return log_lambda_path[count], log_lambda_path, Error_path 287 | 288 | 289 | 290 | def compute_lambda_sigmasq_through_grid_search(self, matrix_results_x, matrix_results_y, lambda_vals, sigmasq_vals): 291 | # 0: K_tst_tr; 1: K_tr_tr; 2: K_tst_tst; 3: D_tst_tr; 4: D_tr_tr 292 | #print "parameter opt via grid search" 293 | num_of_lambdas = np.shape(lambda_vals)[0] 294 | num_of_sigmasq = np.shape(sigmasq_vals)[0] 295 | total_cverr_matrix_x = np.reshape(np.zeros(num_of_sigmasq*num_of_lambdas), (num_of_sigmasq,num_of_lambdas)) 296 | total_cverr_matrix_y = np.reshape(np.zeros(num_of_sigmasq*num_of_lambdas), (num_of_sigmasq,num_of_lambdas)) 297 | for ss in range(num_of_sigmasq): 298 | for ll in range(num_of_lambdas): 299 | #print "Bandwidth numb; Lambdaval numb:", (ss,ll) 300 | total_cverr_matrix_x[ss,ll] = self.compute_totalcverr(matrix_results_x, lambda_vals[ll], sigmasq_vals[ss]) 301 | total_cverr_matrix_y[ss,ll] = self.compute_totalcverr(matrix_results_y, lambda_vals[ll], sigmasq_vals[ss]) 302 | x_sigmasq_idx, x_lambda_idx = np.where(total_cverr_matrix_x == np.min(total_cverr_matrix_x)) 303 | y_sigmasq_idx, y_lambda_idx = np.where(total_cverr_matrix_y == np.min(total_cverr_matrix_y)) 304 | if np.shape(x_sigmasq_idx)[0] > 1: 305 | x_sigmasq = sigmasq_vals[x_sigmasq_idx[0]] 306 | x_lambda = lambda_vals[x_lambda_idx[0]] 307 | else: 308 | x_sigmasq = sigmasq_vals[x_sigmasq_idx] 309 | x_lambda = lambda_vals[x_lambda_idx] 310 | if np.shape(y_sigmasq_idx[0]) > 1: 311 | y_sigmasq = sigmasq_vals[y_sigmasq_idx[0]] 312 | y_lambda = lambda_vals[y_lambda_idx[0]] 313 | else: 314 | y_sigmasq = sigmasq_vals[y_sigmasq_idx] 315 | y_lambda = lambda_vals[y_lambda_idx] 316 | return x_sigmasq, x_lambda, y_sigmasq, y_lambda, total_cverr_matrix_x,total_cverr_matrix_y 317 | 318 | 319 | 320 | 321 | 322 | def compute_lambda_through_grid_search(self, matrix_results_x, lambda_vals, sigmasq_xz): 323 | # 0: K_tst_tr; 1: K_tr_tr; 2: K_tst_tst; 3: D_tst_tr; 4: D_tr_tr 324 | #print "lambda parameter opt via grid search" 325 | num_of_lambdas = np.shape(lambda_vals)[0] 326 | total_cverr_matrix_x = np.reshape(np.zeros(num_of_lambdas), (num_of_lambdas,1)) 327 | for ll in range(num_of_lambdas): 328 | total_cverr_matrix_x[ll,0] = self.compute_totalcverr(matrix_results_x, lambda_vals[ll], sigmasq_xz) 329 | 330 | x_lambda_idx = np.where(total_cverr_matrix_x == np.min(total_cverr_matrix_x)) 331 | if np.shape(x_lambda_idx)[0] > 1: 332 | x_lambda = lambda_vals[x_lambda_idx[0]] 333 | else: 334 | x_lambda = lambda_vals[x_lambda_idx] 335 | return x_lambda, total_cverr_matrix_x 336 | 337 | 338 | 339 | 340 | 341 | 342 | def compute_test_statistics_and_others(self, data_x, data_y, data_z): 343 | if self.grid_search or self.GD_optimise: 344 | matrix_results = self.compute_matrices_for_gradient_totalcverr(data_x,data_y,data_z) 345 | matrix_results_x = [matrix_results[ii] for ii in [0,1,2,6,7]] 346 | matrix_results_y = [matrix_results[ii] for ii in [3,4,5,6,7]] 347 | if self.GD_optimise: # Gradient descent with fixed learning rate 348 | if self.optimise_lambda_only: 349 | if self.kernelZ_use_median: 350 | sigmaz = self.kernelZ.get_sigma_median_heuristic(data_z) 351 | self.kernelZ.set_width(float(sigmaz)) 352 | self.sigmasq_xz = self.sigmasq_yz = sigmaz**2 353 | #print "Gradient Descent Optimisation in log space, fixed step for lambda X" 354 | log_lambda_X, log_lambda_pathx, X_CVerror = self.compute_GD_lambda_for_TotalCVerr_with_fix_step_logspace(matrix_results_x, 355 | self.initial_lambda_x, self.sigmasq_xz) 356 | self.lambda_X = exp(log_lambda_X) 357 | #print X_CVerror 358 | #print "Gradient Descent Optimisation in log space, fixed step for lambda Y" 359 | log_lambda_Y, log_lambda_pathy, Y_CVerror = self.compute_GD_lambda_for_TotalCVerr_with_fix_step_logspace(matrix_results_y, 360 | self.initial_lambda_y, self.sigmasq_yz) 361 | self.lambda_Y = exp(log_lambda_Y) 362 | #print Y_CVerror 363 | else: 364 | #print "Gradient Descent Optimisation in log space, fixed step for lambda X and sigma XZ" 365 | log_lambda_X, _, log_sigmasq_xz, _, X_CVerror = self.compute_GD_lambda_sigmasq_for_TotalCVerr_with_fix_step_logspace(matrix_results_x, 366 | initial_lambda=self.initial_lambda_x, initial_sigmasq=self.initial_sigmasq) 367 | self.lambda_X = exp(log_lambda_X) 368 | self.sigmasq_xz = exp(log_sigmasq_xz) 369 | #print X_CVerror 370 | #print "Gradient Descent Optimisation in log space, fixed step for lambda Y and sigma YZ" 371 | log_lambda_Y, _, log_sigmasq_yz, _, Y_CVerror = self.compute_GD_lambda_sigmasq_for_TotalCVerr_with_fix_step_logspace(matrix_results_y, 372 | initial_lambda=self.initial_lambda_y, initial_sigmasq=self.initial_sigmasq) 373 | self.lambda_Y = exp(log_lambda_Y) 374 | self.sigmasq_yz = exp(log_sigmasq_yz) 375 | #print Y_CVerror 376 | 377 | elif self.grid_search: 378 | if self.optimise_lambda_only: 379 | if self.kernelZ_use_median: 380 | sigmaz = self.kernelZ.get_sigma_median_heuristic(data_z) 381 | self.kernelZ.set_width(float(sigmaz)) 382 | self.sigmasq_xz = self.sigmasq_yz = sigmaz**2 383 | #print "Grid Search Optimisation in log space for lambda X" 384 | self.lambda_X, X_CVerror = self.compute_lambda_through_grid_search(matrix_results_x, self.lambda_val,self.sigmasq_xz) 385 | #print X_CVerror 386 | #print "Grid Search Optimisation in log space for lambda Y" 387 | self.lambda_Y, Y_CVerror = self.compute_lambda_through_grid_search(matrix_results_y, self.lambda_val,self.sigmasq_yz) 388 | #print Y_CVerror 389 | else: 390 | self.sigmasq_xz, self.lambda_X, self.sigmasq_yz, self.lambda_Y, X_CVerror, Y_CVerror = \ 391 | self.compute_lambda_sigmasq_through_grid_search(matrix_results_x, matrix_results_y, self.lambda_val, self.sigmasq_vals) 392 | #print X_CVerror 393 | #print Y_CVerror 394 | else: 395 | raise NotImplementedError 396 | 397 | else: 398 | if self.lambda_X == None: 399 | self.lambda_X = self.lambda_val[0] 400 | if self.lambda_Y == None: 401 | self.lambda_Y = self.lambda_val[0] 402 | if self.sigmasq_xz == None: 403 | sigmaz = self.kernelZ.get_sigma_median_heuristic(data_z) 404 | self.kernelZ.set_width(float(sigmaz)) 405 | self.sigmasq_xz = sigmaz**2 406 | if self.sigmasq_yz == None: 407 | sigmaz = self.kernelZ.get_sigma_median_heuristic(data_z) 408 | self.kernelZ.set_width(float(sigmaz)) 409 | self.sigmasq_yz = sigmaz**2 410 | X_CVerror = 0 411 | Y_CVerror = 0 412 | 413 | #print "lambda value for (X,Y)", (self.lambda_X,self.lambda_Y) 414 | #print "sigma squared for (XZ, YZ)", (self.sigmasq_xz, self.sigmasq_yz) 415 | test_size = self.num_samples 416 | if not self.RESIT_type: 417 | test_Kx, test_Ky, _ = self.compute_kernel_matrix_on_data_CI(data_x, data_y, data_z) 418 | D_z = cdist(data_z, data_z, 'sqeuclidean') 419 | test_Kzx = exp(D_z*(-0.5)*self.sigmasq_xz**(-1)) 420 | test_Kzy = exp(D_z*(-0.5)*self.sigmasq_yz**(-1)) 421 | weight_xz = solve(test_Kzx/float(self.lambda_X)+np.identity(test_size),np.identity(test_size)) 422 | weight_yz = solve(test_Kzy/float(self.lambda_Y)+np.identity(test_size),np.identity(test_size)) 423 | K_epsilon_x = weight_xz.dot(test_Kx.dot(weight_xz)) 424 | K_epsilon_y = weight_yz.dot(test_Ky.dot(weight_yz)) 425 | else: 426 | #print "RESIT Computation" 427 | if self.kernelZ_use_median: 428 | sigmaz = self.kernelZ.get_sigma_median_heuristic(data_z) 429 | self.kernelZ.set_width(float(sigmaz)) 430 | test_Kz = self.kernelZ.kernel(data_z) 431 | weight_xz = solve(test_Kz/float(self.lambda_X)+np.identity(test_size),np.identity(test_size)) 432 | weight_yz = solve(test_Kz/float(self.lambda_Y)+np.identity(test_size),np.identity(test_size)) 433 | residual_xz = weight_xz.dot(data_x) 434 | residual_yz = weight_yz.dot(data_y) 435 | if self.kernelRxz_use_median: 436 | sigmaRxz = self.kernelRxz.get_sigma_median_heuristic(residual_xz) 437 | self.kernelRxz.set_width(float(sigmaRxz)) 438 | if self.kernelRyz_use_median: 439 | sigmaRyz = self.kernelRyz.get_sigma_median_heuristic(residual_yz) 440 | self.kernelRyz.set_width(float(sigmaRyz)) 441 | K_epsilon_x = self.kernelRxz.kernel(residual_xz) 442 | K_epsilon_y = self.kernelRyz.kernel(residual_yz) 443 | 444 | hsic_statistic = self.HSIC_V_statistic(K_epsilon_x, K_epsilon_y) 445 | #print "HSIC Statistics", hsic_statistic 446 | return hsic_statistic, K_epsilon_x, K_epsilon_y, X_CVerror, Y_CVerror 447 | 448 | 449 | 450 | 451 | 452 | def compute_null_samples_and_pvalue(self,data_x=None,data_y=None,data_z=None): 453 | ''' data_x,data_y, data_z are the given data that we wish to test 454 | the conditional independence given data_z. 455 | > each data set has the number of samples = number of rows 456 | > the bandwidth for training set and test set will be different (as we will calculate as soon as data comes in) 457 | ''' 458 | if data_x is None and data_y is None and data_z is None: 459 | if not self.streaming and not self.freeze_data: 460 | start = time.clock() 461 | self.generate_data(isConditionalTesting=True) 462 | data_generating_time = time.clock()-start 463 | data_x = self.data_x 464 | data_y = self.data_y 465 | data_z = self.data_z 466 | #print "dimension of data:", np.shape(data_x) 467 | else: 468 | data_generating_time = 0. 469 | 470 | else: 471 | data_generating_time = 0. 472 | #print 'Data generating time passed: ', data_generating_time 473 | hsic_statistic, K_epsilon_x, K_epsilon_y, X_CVerror, Y_CVerror = self.compute_test_statistics_and_others(data_x, data_y, data_z) 474 | if self.num_shuffles != 0: 475 | ny = np.shape(K_epsilon_y)[0] 476 | null_samples = np.zeros(self.num_shuffles) 477 | for jj in range(self.num_shuffles): 478 | pp = permutation(ny) 479 | Kp = K_epsilon_y[pp,:][:,pp] 480 | null_samples[jj] = self.HSIC_V_statistic(K_epsilon_x, Kp) 481 | pvalue = ( sum( null_samples > hsic_statistic ) + 1) / float( self.num_shuffles + 1) 482 | #print "P-value:", pvalue 483 | else: 484 | pvalue = None 485 | null_samples = 0 486 | #print "Not interested in P-value" 487 | return null_samples, hsic_statistic, pvalue, X_CVerror, Y_CVerror,data_generating_time 488 | 489 | 490 | 491 | def compute_pvalue_with_time_tracking(self, data_x = None, data_y = None, data_z = None): 492 | if self.lambda_X is not None and self.lambda_Y is not None: 493 | self.GD_optimise = False 494 | self.grid_search = False 495 | self.lambda_val = [1] 496 | _, _, pvalue, _, _, data_generating_time = self.compute_null_samples_and_pvalue(data_x = data_x, 497 | data_y = data_y, data_z = data_z) 498 | return pvalue, data_generating_time 499 | 500 | 501 | 502 | 503 | -------------------------------------------------------------------------------- /weak_conditional_independence_testing/WHO_KRESITvsRESIT.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Comparing RESIT and KRESIT on WHO data. 3 | To run the script in terminal: 4 | $ python WHO_KRESITvsRESIT.py 5 | The p-values will be printed and a data plot will be shown. 6 | 7 | ''' 8 | 9 | import os, sys 10 | BASE_DIR = os.path.join( os.path.dirname( __file__ ), '..' ) 11 | sys.path.append(BASE_DIR) 12 | 13 | import os 14 | import pandas as pd 15 | import numpy as np 16 | from numpy.random import normal 17 | from scipy.stats import t 18 | from scipy import stats, linalg 19 | import matplotlib.pyplot as plt 20 | from pylab import rcParams 21 | 22 | from kerpy.GaussianKernel import GaussianKernel 23 | from kerpy.LinearKernel import LinearKernel 24 | from TwoStepCondTestObject import TwoStepCondTestObject 25 | # Import data 26 | data = pd.DataFrame.from_csv("WHO_dataset.csv") #WHO_dataset.csv is the same as WHO1 original.csv 27 | Gross_national_income = np.reshape(data.iloc[:,1],(202,1)) 28 | Expenditure_on_health = np.reshape(data.iloc[:,5],(202,1)) 29 | 30 | # Remove missing values 31 | data_yz = np.concatenate((Gross_national_income,Expenditure_on_health),axis=1) 32 | data_yz = data_yz[~np.isnan(data_yz).any(axis=1)] 33 | data_z = data_yz[:,[0]] #(178,1) 34 | data_y = data_yz[:,[1]] 35 | 36 | # log transform data z to make the concentrated data more spread out 37 | data_z = np.log(data_z) 38 | 39 | 40 | # range of values for grid search 41 | num_lambdaval = 30 42 | lambda_val = 10**np.linspace(-6,-1, num=num_lambdaval) 43 | z_bandwidth = None 44 | #num_bandwidth = 20 45 | #z_bandwidth = 10**np.linspace(-5,1,num = num_bandwidth) 46 | 47 | # some parameter settings 48 | num_samples = np.shape(data_z)[0] 49 | data_generator=None 50 | num_trials = 1 51 | pvals_KRESIT = np.reshape(np.zeros(num_trials),(num_trials,1)) 52 | pvals_RESIT = np.reshape(np.zeros(num_trials),(num_trials,1)) 53 | 54 | 55 | 56 | # computing Type I error (Null model is true) 57 | for jj in xrange(num_trials): 58 | #print "number of trial:", jj 59 | 60 | data_x = np.reshape(np.zeros(num_samples),(num_samples,1)) 61 | noise_x = np.reshape(normal(0,1,np.shape(data_z)[0]),(np.shape(data_z)[0],1)) 62 | coin_flip_x = np.random.choice([0,1],replace=True,size=num_samples) 63 | data_x[coin_flip_x == 0] = (data_z[coin_flip_x == 0]-10)**2 64 | data_x[coin_flip_x == 1] = -(data_z[coin_flip_x == 1]-10)**2+35 65 | data_x = data_x + noise_x 66 | 67 | 68 | # KRESIT: 69 | kernelX = GaussianKernel(1.) 70 | kernelY = GaussianKernel(1.) 71 | kernelZ = GaussianKernel(1.) 72 | mytestobject = TwoStepCondTestObject(num_samples, None, 73 | kernelX, kernelY, kernelZ, 74 | kernelX_use_median=True, 75 | kernelY_use_median=True, 76 | kernelZ_use_median=True, 77 | kernelRxz = LinearKernel(), kernelRyz = LinearKernel(), 78 | kernelRxz_use_median = False, 79 | kernelRyz_use_median = False, 80 | RESIT_type = False, 81 | num_shuffles=800, 82 | lambda_val=lambda_val,lambda_X = None, lambda_Y = None, 83 | optimise_lambda_only = True, 84 | sigmasq_vals = z_bandwidth ,sigmasq_xz = 1., sigmasq_yz = 1., 85 | K_folds=5, grid_search = True, 86 | GD_optimise=False) 87 | 88 | pvals_KRESIT[jj,], _ = mytestobject.compute_pvalue(data_x,data_y,data_z) 89 | 90 | 91 | # RESIT: 92 | kernelX = LinearKernel() 93 | kernelY = LinearKernel() 94 | mytestobject_RESIT = TwoStepCondTestObject(num_samples, None, 95 | kernelX, kernelY, kernelZ, 96 | kernelX_use_median=False, 97 | kernelY_use_median=False, 98 | kernelZ_use_median=True, 99 | kernelRxz = GaussianKernel(1.), kernelRyz = GaussianKernel(1.), 100 | kernelRxz_use_median = True, 101 | kernelRyz_use_median = True, 102 | RESIT_type = True, 103 | num_shuffles=800, 104 | lambda_val=lambda_val,lambda_X = None, lambda_Y = None, 105 | optimise_lambda_only = True, 106 | sigmasq_vals = z_bandwidth ,sigmasq_xz = 1., sigmasq_yz = 1., 107 | K_folds=5, grid_search = True, 108 | GD_optimise=False) 109 | 110 | pvals_RESIT[jj,], _ = mytestobject_RESIT.compute_pvalue(data_x,data_y,data_z) 111 | 112 | 113 | 114 | #np.savetxt("WHO_KRESIT_rejection_rate.csv", pvals_KRESIT, delimiter=",") 115 | #np.savetxt("WHO_RESIT_rejection_rate.csv", pvals_RESIT, delimiter=",") 116 | 117 | 118 | if num_trials > 1: 119 | print "Type I error (KRESIT):", np.shape(filter(lambda x: x<0.05 ,pvals_KRESIT))[0]/float(num_trials) 120 | print "Type I error (RESIT):", np.shape(filter(lambda x: x<0.05 ,pvals_RESIT))[0]/float(num_trials) 121 | elif num_trials == 1: 122 | print "pval our approach:", pvals_KRESIT 123 | print "pval RESIT:", pvals_RESIT 124 | 125 | 126 | # Plot of the data 127 | rcParams['figure.figsize'] = 9, 4.7 128 | plt.figure(1) 129 | plt.subplot(121) 130 | plt.plot(data_z, data_y,'.') 131 | plt.ylabel('Y = Expenditure on health per cap') 132 | plt.xlabel('Z = log(Gross national income per cap)') 133 | 134 | plt.subplot(122) 135 | plt.plot(data_z,data_x,'.') 136 | plt.ylabel('X ') 137 | plt.xlabel('Z') 138 | plt.show() 139 | 140 | #plot_name = "WHO_" + "logZ_"+ "quadraticX" + ".pdf" 141 | #plt.savefig(plot_name, format='pdf') 142 | #plt.show() 143 | 144 | 145 | 146 | -------------------------------------------------------------------------------- /weak_conditional_independence_testing/WHO_dataset.csv: -------------------------------------------------------------------------------- 1 | Country,CountryID,Gross national income per cap,Government expenditure on health per cap,Adult female obesity (%),Income per person,Per capita total expenditure on health (US$) Afghanistan,1,,8,,874,23 Albania,2,6000,127,,5369,174 Algeria,3,5940,146,,6011,123 Andorra,4,,2054,,,2815 Angola,5,3890,61,,3533,71 Antigua and Barbuda,6,15130,439,,14579,517 Argentina,7,11670,758,,11063,551 Armenia,8,4950,112,15.5,3903,99 Australia,9,33940,2097,,32798,3316 Austria,10,36040,2729,,34108,3864 Azerbaijan,11,5430,67,,4648,86 Bahamas,12,,775,,23021,1311 Bahrain,13,34310,669,,27236,810 Bangladesh,14,1230,26,,1268,13 Barbados,15,15150,722,,15837,785 Belarus,16,9700,428,,8541,244 Belgium,17,33860,2264,13.4,32077,3565 Belize,18,7080,254,,7290,229 Benin,19,1250,25,6.1,1390,29 Bermuda,20,,,,69916.79, Bhutan,21,4000,73,,3694,65 Bolivia,22,3810,128,15.1,3618,79 Bosnia and Herzegovina,23,6780,454,25.2,6506,258 Botswana,24,11730,487,,12057,378 Brazil,25,8700,367,13.1,8596,426 Brunei Darussalam,26,49900,314,,47465,543 Bulgaria,27,10270,443,,9353,283 Burkina Faso,28,1130,50,2.4,1140,27 Burundi,29,320,4,,420.08,4 Cambodia,30,1550,43,1.2,1453,30 Cameroon,31,2060,23,8.2,1995,51 Canada,32,36280,2585,13.9,35078,3912 Cape Verde,33,2590,227,,2831,129 Central African Republic,34,690,20,,675,14 Chad,35,1170,14,1.5,1749,22 Chile,36,11300,367,25,12262,473 China,37,4660,144,3.4,4091,90 Colombia,38,6130,534,16.6,6306,217 Comoros,39,1140,19,,1063,16 "Congo, Dem. Rep.",40,270,7,,264,6 "Congo, Rep.",41,2420,13,7.5,3621,42 Cook Islands,42,,518,65.7,9000,427 Costa Rica,43,9220,565,,8661,353 Cote d'Ivoire,44,1580,15,,1575,35 Croatia,45,13850,869,22.7,13232,722 Cuba,46,,329,,7407.24,355 Cyprus,47,25060,759,11.8,24473,1483 Czech Republic,48,20920,1309,16.3,20281,943 Denmark,49,36190,2812,9.1,33626,4828 Djibouti,50,2180,75,,1964,62 Dominica,51,7870,311,,8576,297 Dominican Republic,52,5550,140,,5173,223 Ecuador,53,6810,130,,6533,166 Egypt,54,4940,129,46.6,5049,93 El Salvador,55,5610,227,,5403,191 Equatorial Guinea,56,16620,219,,11999,274 Eritrea,57,680,10,1.6,685,10 Estonia,58,18090,734,14.9,16654,620 Ethiopia,59,630,13,0.7,591,7 Fiji,60,4450,199,26.4,4209,149 Finland,61,33170,1940,13.5,30469,2994 France,62,32240,2833,,29644,4056 French Polynesia,63,,,,26016.11, Gabon,64,11180,198,8.2,12742,267 Gambia,65,1110,33,,726,13 Georgia,66,3880,76,,3505,147 Germany,67,32680,2548,12.3,30496,3669 Ghana,68,1240,36,8.1,1225,35 Greece,69,30870,1317,18.2,25520,2733 Grenada,70,8770,387,,9128,346 Guatemala,71,5120,98,,4897,144 Guinea,72,1130,14,3,946,20 Guinea-Bissau,73,460,10,,569,13 Guyana,74,3410,223,,3232,67 Haiti,75,1070,65,6.3,1175,42 Honduras,76,3420,116,18.8,3266,99 "Hong Kong, China",77,,,,35680, Hungary,78,16970,978,18.2,17014,853 Iceland,79,33740,2758,12.3,35630,4962 India,80,2460,21,2.8,2126,39 Indonesia,81,3310,44,3.6,3234,34 Iran (Islamic Republic of),82,9800,406,19.2,10692,247 Iraq,83,,90,38.2,3200,67 Ireland,84,34730,2413,12,38058,3888 Israel,85,23840,1477,,23846,1618 Italy,86,28970,2022,8.9,27750,2845 Jamaica,87,7050,127,,7132,180 Japan,88,32840,2067,3.3,30290,2690 Jordan,89,4820,257,16.2,4294,246 Kazakhstan,90,8700,214,,8699,189 Kenya,91,1470,51,6.3,1359,29 Kiribati,92,6230,268,,3377,117 "Korea, Dem. Rep.",93,,42,,1596.68,0 "Korea, Rep.",94,22990,819,,21342,1187 Kuwait,95,48310,422,,44947,796 Kyrgyzstan,96,1790,55,,1728,34 Lao People's Democratic Republic,97,1740,18,1.6,1811,22 Latvia,98,14840,615,19.5,13218,533 Lebanon,99,9600,285,,10212,468 Lesotho,100,1810,88,16.1,1415,49 Liberia,101,260,25,,383,10 Libyan Arab Jamahiriya,102,11630,189,,10804,255 Lithuania,103,14550,728,19.2,14085,545 Luxembourg,104,60870,5233,,70014,6610 "Macao, China",105,,,,37256, Macedonia,106,7850,446,,7393,245 Madagascar,107,870,21,1,988,9 Malawi,108,690,51,2.4,691,20 Malaysia,109,12160,226,18.8,11466,255 Maldives,110,4740,742,,4017,306 Mali,111,1000,34,3.7,1027,30 Malta,112,20990,1419,21.3,20410,1295 Marshall Islands,113,8040,589,,,298 Mauritania,114,1970,31,16.7,1691,19 Mauritius,115,10640,292,,10155,223 Mexico,116,11990,327,28.1,11317,500 Micronesia (Federated States of),117,6070,444,,,266 Moldova,118,2660,107,18.2,2362,68 Monaco,119,,5309,,,6343 Mongolia,120,2810,124,12.5,2643,53 Montenegro,121,8930,93,,,306 Morocco,122,3860,98,11,3547,95 Mozambique,123,660,39,3.9,743,17 Myanmar,124,510,7,,831,4 Namibia,125,4770,218,,4547,167 Nauru,126,,444,60.5,2500,605 Nepal,127,1010,24,0.9,1081,17 Netherlands,128,37940,2768,,34724,3784 Netherlands Antilles,129,,,,22700.41, New Caledonia,130,,,,31942.83, New Zealand,131,25750,1905,23.2,24554,2420 Nicaragua,132,2720,137,18,2611,76 Niger,133,630,14,3.2,613,10 Nigeria,134,1410,15,5.8,1892,32 Niue,135,,294,,,1045 Norway,136,50070,3780,5.9,47551,6267 Oman,137,19740,321,,20334,325 Pakistan,138,2410,8,,2396,16 Palau,139,14340,1003,,,835 Panama,140,8690,495,,8399,380 Papua New Guinea,141,1630,111,,1747,29 Paraguay,142,4040,131,,3900,117 Peru,143,6490,171,13.1,6466,145 Philippines,144,3430,88,,2932,45 Poland,145,14250,636,19.9,13573,556 Portugal,146,19960,1494,,20006,1830 Puerto Rico,147,,,,19725.45, Qatar,148,,1115,,68696,2753 Romania,149,10150,433,9.5,9374,315 Russia,150,12740,404,,11861,369 Rwanda,151,730,134,1.3,813,32 Saint Kitts and Nevis,152,12440,403,,13677,569 Saint Lucia,153,8500,237,,9279,339 Saint Vincent and the Grenadines,154,6220,289,,6752,233 Samoa,155,5090,188,66.3,4872,120 San Marino,156,,2765,,,3591 Sao Tome and Principe,157,1490,120,,1460,58 Saudi Arabia,158,22300,468,,21220,491 Senegal,159,1560,23,7.2,1676,40 Serbia,160,9320,373,,,247 Seychelles,161,14360,602,35.2,14202,573 Sierra Leone,162,610,20,,790,9 Singapore,163,43300,413,7.3,41479,1035 Slovakia,164,17060,913,15,15881,718 Slovenia,165,23970,1507,13.8,23004,1599 Solomon Islands,166,1850,99,,1712,34 Somalia,167,,8,,932.96,8 South Africa,168,8900,364,,8477,456 Spain,169,28200,1732,13.5,27270,2263 Sri Lanka,170,3730,105,,3481,60 Sudan,171,1780,23,,2249,38 Suriname,172,7720,151,,7234,254 Swaziland,173,4700,219,,4384,138 Sweden,174,34310,2533,9.5,31995,3870 Switzerland,175,40840,2598,7.5,35520,5878 Syria,176,4110,52,,4059,66 Taiwan,177,,,,26069, Tajikistan,178,1560,16,,1413,21 Tanzania,179,980,27,4.4,1018,18 Thailand,180,7440,223,,6869,113 Timor-Leste,181,5100,150,,,52 Togo,182,770,20,,888,19 Tonga,183,5470,218,74.9,5135,121 Trinidad and Tobago,184,16800,438,,15352,568 Tunisia,185,6490,214,,6461,159 Turkey,186,8410,461,22.7,7786,406 Turkmenistan,187,3990,172,10.3,4247,161 Tuvalu,188,,189,,,281 Uganda,189,880,39,4.1,991,25 Ukraine,190,6110,298,11.3,5583,159 United Arab Emirates,191,31190,491,,33487,982 United Kingdom,192,33650,2434,23,31580,3361 United States of America,193,44070,3074,33.2,41674,6714 Uruguay,194,9940,430,,9266,476 Uzbekistan,195,2190,89,7.1,1975,30 Vanuatu,196,3480,90,25.2,3477,68 Venezuela,197,10970,196,,9876,332 Vietnam,198,2310,86,,2142,46 West Bank and Gaza,199,,,,3542, Yemen,200,2090,38,,2276,40 Zambia,201,1140,29,3,1175,49 Zimbabwe,202,,77,19.4,538,36 -------------------------------------------------------------------------------- /weak_conditional_independence_testing/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxcsml/kerpy/50b175961d13e0e1f625aa987ae41cb98bfe4d84/weak_conditional_independence_testing/__init__.py --------------------------------------------------------------------------------