├── MANIFEST.in ├── requirements.txt ├── doc ├── data.npz ├── fit.png ├── misc.md └── bces-examples.ipynb ├── setup.cfg ├── .gitignore ├── bces ├── __init__.py └── bces.py ├── setup.py ├── LICENSE ├── tests └── test_bces.py └── README.md /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include README.md 2 | include LICENSE -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | scipy 3 | tqdm 4 | numdifftools 5 | nmmn -------------------------------------------------------------------------------- /doc/data.npz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rsnemmen/BCES/HEAD/doc/data.npz -------------------------------------------------------------------------------- /doc/fit.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rsnemmen/BCES/HEAD/doc/fit.png -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | description-file = README.md 3 | license_files=LICENSE -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.py[co] 2 | 3 | # Packages 4 | *.egg 5 | *.egg-info 6 | dist 7 | build 8 | eggs 9 | parts 10 | bin 11 | var 12 | sdist 13 | develop-eggs 14 | .installed.cfg 15 | .idea -------------------------------------------------------------------------------- /bces/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | Python module for performing robust linear regression on (X,Y) data points where 4 | both X and Y have measurement errors. 5 | 6 | The fitting method is the bivariate correlated errors and intrinsic scatter 7 | (BCES) and follows the description given in Akritas, M. G., & Bershady, M. A. 8 | Astrophysical Journal, 1996, 470, 706. 9 | 10 | Author: Rodrigo Nemmen, https://rodrigonemmen.com 11 | """ 12 | 13 | 14 | 15 | 16 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | with open('README.md') as f: 4 | readme = f.read() 5 | 6 | with open('LICENSE') as f: 7 | license = f.read() 8 | 9 | setup( 10 | name='bces', 11 | version='1.5.1', 12 | description='Python module for performing linear regression for data with measurement errors and intrinsic scatter', 13 | long_description=readme, 14 | long_description_content_type='text/markdown', 15 | author='Rodrigo Nemmen', 16 | author_email='rodrigo.nemmen@iag.usp.br', 17 | url='https://github.com/rsnemmen/BCES', 18 | download_url = 'https://github.com/rsnemmen/BCES/archive/1.5.1.tar.gz', 19 | license=license, 20 | keywords = ['statistics', 'fitting', 'linear-regression','machine-learning'], 21 | packages=find_packages(exclude=('tests', 'docs')) 22 | ) -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Rodrigo Nemmen 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /doc/misc.md: -------------------------------------------------------------------------------- 1 | Other information 2 | ================== 3 | 4 | The BCES python package is inspired on the much faster Fortran routine which was [originally written by Akritas et al](http://www.astro.wisc.edu/%7Emab/archive/stats/stats.html). I wrote it because I wanted something more portable and easier to use, trading off speed. 5 | 6 | For a general tutorial on how to—and how not to—perform linear regression, [please read this paper: Hogg, D. et al. 2010, arXiv:1008.4686](http://labs.adsabs.harvard.edu/adsabs/abs/2010arXiv1008.4686H/). In particular, *please refrain from using the bisector method*. 7 | 8 | If you want to plot confidence bands for your fits, have a look at [`nmmn` package](https://github.com/rsnemmen/nmmn) (in particular, modules `nmmn.plots.fitconf` and `stats`). 9 | 10 | 11 | # Bayesian linear regression 12 | 13 | There are a couple of Bayesian approaches to perform linear regression which can be more powerful than BCES, some of which are described below. The advantage of these methods is that they give you posterior distributions on the parameters. Also, some of them handle censored data (i.e. upper limits) while BCES does not. 14 | 15 | **A Gibbs Sampler for Multivariate Linear Regression:** 16 | [R code](https://github.com/abmantz/lrgs), [arXiv:1509.00908](http://arxiv.org/abs/1509.00908). 17 | Linear regression in the fairly general case with errors in X and Y, errors may be correlated, intrinsic scatter. The prior distribution of covariates is modeled by a flexible mixture of Gaussians. This is an extension of the very nice work by Brandon Kelly [(Kelly, B. 2007, ApJ)](http://labs.adsabs.harvard.edu/adsabs/abs/2007ApJ...665.1489K/). 18 | 19 | **LIRA: A Bayesian approach to linear regression in astronomy:** [R code](https://github.com/msereno/lira), [arXiv:1509.05778](http://arxiv.org/abs/1509.05778) 20 | Bayesian hierarchical modelling of data with heteroscedastic and possibly correlated measurement errors and intrinsic scatter. The method fully accounts for time evolution. The slope, the normalization, and the intrinsic scatter of the relation can evolve with the redshift. The intrinsic distribution of the independent variable is approximated using a mixture of Gaussian distributions whose means and standard deviations depend on time. The method can address scatter in the measured independent variable (a kind of Eddington bias), selection effects in the response variable (Malmquist bias), and departure from linearity in form of a knee. 21 | 22 | **AstroML: Machine Learning and Data Mining for Astronomy.** 23 | [Python example](http://www.astroml.org/book_figures/chapter8/fig_total_least_squares.html) of a linear fit to data with correlated errors in x and y using AstroML. In the literature, this is often referred to as total least squares or errors-in-variables fitting. 24 | 25 | -------------------------------------------------------------------------------- /tests/test_bces.py: -------------------------------------------------------------------------------- 1 | """ 2 | Basic unit testing functionality. 3 | """ 4 | import bces.bces as BCES 5 | import numpy as np 6 | 7 | # reads test dataset 8 | data=np.load('doc/data.npz') 9 | xdata=data['x'] 10 | ydata=data['y'] 11 | errx=data['errx'] 12 | erry=data['erry'] 13 | covdata=data['cov'] 14 | 15 | # Correct fit parameters expected for dataset. The parameters in 16 | # ans_pars are such that y=Ax+B: 17 | # 18 | # ans_pars = [ A(y|x), B(y|x), 19 | # A(x|y), B(x|y), 20 | # A(ort), B(ort) ] 21 | # 22 | # i.e., each pair contains the expected result for one of the BCES 23 | # regression methods. 24 | # 25 | ans_pars=np.array([ [0.57955173, 17.88855826], 26 | [0.26053751, 32.70952271], 27 | [0.50709256, 21.25491222] ]) 28 | # covariance matrix 29 | ans_cov=np.array([[ 5.85029731e-04, -2.72808055e-02], 30 | [-2.72808055e-02, 1.27299029e+00]]) 31 | 32 | def test_yx(): 33 | """ 34 | Test BCES Y|X fit without bootstrapping. 35 | """ 36 | # fit 37 | a,b,erra,errb,covab=BCES.bces(xdata,errx,ydata,erry,covdata) 38 | 39 | np.testing.assert_array_almost_equal([ans_pars[0,0],ans_pars[0,1]], np.array([a[0],b[0]])) 40 | 41 | def test_xy(): 42 | """ 43 | Test BCES X|Y fit without bootstrapping. 44 | """ 45 | # fit 46 | a,b,erra,errb,covab=BCES.bces(xdata,errx,ydata,erry,covdata) 47 | 48 | np.testing.assert_array_almost_equal([ans_pars[1,0],ans_pars[1,1]], np.array([a[1],b[1]])) 49 | 50 | def test_ort(): 51 | """ 52 | Test BCES orthogonal fit without bootstrapping. 53 | """ 54 | # fit 55 | a,b,erra,errb,covab=BCES.bces(xdata,errx,ydata,erry,covdata) 56 | 57 | np.testing.assert_array_almost_equal([ans_pars[2,0],ans_pars[2,1]], np.array([a[3],b[3]])) 58 | 59 | def test_yxboot(): 60 | """ 61 | Test BCES Y|X fit with bootstrapping. 62 | """ 63 | # fit 64 | a,b,erra,errb,covab=BCES.bcesp(xdata,errx,ydata,erry,covdata) 65 | 66 | # check if the regression parameters match within 1 decimal 67 | np.testing.assert_array_almost_equal([ans_pars[0,0],ans_pars[0,1]], np.array([a[0],b[0]]),1) 68 | 69 | def test_ortboot(): 70 | """ 71 | Test BCES orthogonal fit with bootstrapping. 72 | """ 73 | # fit 74 | a,b,erra,errb,covab=BCES.bcesp(xdata,errx,ydata,erry,covdata) 75 | 76 | # check if the regression parameters match within 1 decimal 77 | np.testing.assert_array_almost_equal([ans_pars[2,0],ans_pars[2,1]], np.array([a[3],b[3]]),1) 78 | 79 | def test_bootstrap(): 80 | """ 81 | Test if bootstrap is working correctly. 82 | """ 83 | import scipy.stats 84 | 85 | # number of bootstrap samples 86 | nboot=5000 87 | 88 | # bootstrapping procedure 89 | ts=[] # test statistic 90 | for i in range(nboot): 91 | xsim=BCES.bootstrap(xdata) 92 | tsim,asim,psim=scipy.stats.anderson_ksamp([xdata,xsim]) 93 | ts.append(tsim) 94 | 95 | ts=np.array(ts) 96 | 97 | # is the simulated (bootstrapped) dataset consistent with the 98 | # original one? 99 | assert np.median(ts)>> a,b,aerr,berr,covab=bces(x,xerr,y,yerr,cov) 22 | 23 | Output: 24 | 25 | - a,b : best-fit parameters a,b of the linear regression 26 | - aerr,berr : the standard deviations in a,b 27 | - covab : the covariance between a and b (e.g. for plotting confidence bands) 28 | 29 | Arguments: 30 | 31 | - x,y : data 32 | - xerr,yerr: measurement errors affecting x and y 33 | - cov : covariance between the measurement errors 34 | (all are arrays) 35 | """ 36 | # Arrays holding the code main results for each method: 37 | # Elements: 0-Y|X, 1-X|Y, 2-bisector, 3-orthogonal 38 | a,b,avar,bvar,covarxiz,covar_ba=np.zeros(4),np.zeros(4),np.zeros(4),np.zeros(4),np.zeros(4),np.zeros(4) 39 | # Lists holding the xi and zeta arrays for each method above 40 | xi,zeta=[],[] 41 | 42 | # Calculate sigma's for datapoints using length of conf. intervals 43 | sig11var = np.mean( y1err**2 ) 44 | sig22var = np.mean( y2err**2 ) 45 | sig12var = np.mean( cerr ) 46 | 47 | # Covariance of Y1 (X) and Y2 (Y) 48 | covar_y1y2 = np.mean( (y1-y1.mean())*(y2-y2.mean()) ) 49 | 50 | # Compute the regression slopes 51 | a[0] = (covar_y1y2 - sig12var)/(y1.var() - sig11var) # Y|X 52 | a[1] = (y2.var() - sig22var)/(covar_y1y2 - sig12var) # X|Y 53 | a[2] = ( a[0]*a[1] - 1.0 + np.sqrt((1.0 + a[0]**2)*(1.0 + a[1]**2)) ) / (a[0]+a[1]) # bisector 54 | if covar_y1y2<0: 55 | sign = -1. 56 | else: 57 | sign = 1. 58 | a[3] = 0.5*((a[1]-(1./a[0])) + sign*np.sqrt(4.+(a[1]-(1./a[0]))**2)) # orthogonal 59 | 60 | # Compute intercepts 61 | for i in range(4): 62 | b[i]=y2.mean()-a[i]*y1.mean() 63 | 64 | # Set up variables to calculate standard deviations of slope/intercept 65 | xi.append( ( (y1-y1.mean()) * (y2-a[0]*y1-b[0]) + a[0]*y1err**2 ) / (y1.var()-sig11var) ) # Y|X 66 | xi.append( ( (y2-y2.mean()) * (y2-a[1]*y1-b[1]) - y2err**2 ) / covar_y1y2 ) # X|Y 67 | xi.append( xi[0] * (1.+a[1]**2)*a[2] / ((a[0]+a[1])*np.sqrt((1.+a[0]**2)*(1.+a[1]**2))) + xi[1] * (1.+a[0]**2)*a[2] / ((a[0]+a[1])*np.sqrt((1.+a[0]**2)*(1.+a[1]**2))) ) # bisector 68 | xi.append( xi[0] * a[3]/(a[0]**2*np.sqrt(4.+(a[1]-1./a[0])**2)) + xi[1]*a[3]/np.sqrt(4.+(a[1]-1./a[0])**2) ) # orthogonal 69 | for i in range(4): 70 | zeta.append( y2 - a[i]*y1 - y1.mean()*xi[i] ) 71 | 72 | for i in range(4): 73 | # Calculate variance for all a and b 74 | avar[i]=xi[i].var()/xi[i].size 75 | bvar[i]=zeta[i].var()/zeta[i].size 76 | 77 | # Sample covariance obtained from xi and zeta (paragraph after equation 15 in AB96) 78 | covarxiz[i]=np.mean( (xi[i]-xi[i].mean()) * (zeta[i]-zeta[i].mean()) ) 79 | 80 | # Covariance between a and b (equation after eq. 15 in AB96) 81 | covar_ab=covarxiz/y1.size 82 | 83 | return a,b,np.sqrt(avar),np.sqrt(bvar),covar_ab 84 | 85 | 86 | 87 | 88 | def bootstrap(v): 89 | """ 90 | Constructs Monte Carlo simulated data set using the 91 | Bootstrap algorithm. 92 | 93 | Usage: 94 | 95 | >>> bootstrap(x) 96 | 97 | where x is either an array or a list of arrays. If it is a 98 | list, the code returns the corresponding list of bootstrapped 99 | arrays assuming that the same position in these arrays map the 100 | same "physical" object. 101 | """ 102 | if type(v)==list: 103 | vboot=[] # list of boostrapped arrays 104 | n=v[0].size 105 | iran=scipy.stats.randint.rvs(0,n,size=n) # Array of random indexes 106 | for x in v: vboot.append(x[iran]) 107 | else: # if v is an array, not a list of arrays 108 | n=v.size 109 | iran=scipy.stats.randint.rvs(0,n,size=n) # Array of random indexes 110 | vboot=v[iran] 111 | 112 | return vboot 113 | 114 | 115 | 116 | 117 | 118 | 119 | def bcesboot(y1,y1err,y2,y2err,cerr,nsim=10000): 120 | """ 121 | Does the BCES with bootstrapping. 122 | 123 | Usage: 124 | 125 | >>> a,b,aerr,berr,covab=bcesboot(x,xerr,y,yerr,cov,nsim) 126 | 127 | :param x,y: data 128 | :param xerr,yerr: measurement errors affecting x and y 129 | :param cov: covariance between the measurement errors (all are arrays) 130 | :param nsim: number of Monte Carlo simulations (bootstraps) 131 | 132 | :returns: a,b -- best-fit parameters a,b of the linear regression 133 | :returns: aerr,berr -- the standard deviations in a,b 134 | :returns: covab -- the covariance between a and b (e.g. for plotting confidence bands) 135 | 136 | .. note:: this method is definitely not nearly as fast as bces_regress.f. Needs to be optimized. Maybe adapt the fortran routine using f2python? 137 | """ 138 | import tqdm 139 | 140 | print("Bootstrapping progress:") 141 | 142 | """ 143 | My convention for storing the results of the bces code below as 144 | matrixes for processing later are as follow: 145 | 146 | simulation-method y|x x|y bisector orthogonal 147 | sim0 ... 148 | Am = sim1 ... 149 | sim2 ... 150 | sim3 ... 151 | """ 152 | for i in tqdm.tqdm(range(nsim)): 153 | # This is needed for small datasets. With a dataset of e.g. 4 points, 154 | # bootstrapping can generate a mock array with 4 repeated points. This 155 | # will cause an error in the linear regression. 156 | allEquals=True 157 | while allEquals: 158 | [y1sim,y1errsim,y2sim,y2errsim,cerrsim]=bootstrap([y1,y1err,y2,y2err,cerr]) 159 | 160 | allEquals=allEqual(y1sim) 161 | 162 | asim,bsim,errasim,errbsim,covabsim=bces(y1sim,y1errsim,y2sim,y2errsim,cerrsim) 163 | 164 | if i==0: 165 | # Initialize the matrixes 166 | am,bm=asim.copy(),bsim.copy() 167 | else: 168 | am=np.vstack((am,asim)) 169 | bm=np.vstack((bm,bsim)) 170 | 171 | 172 | if True in np.isnan(am): 173 | am,bm=checkNan(am,bm) 174 | 175 | # Bootstrapping results 176 | a=np.array([ am[:,0].mean(),am[:,1].mean(),am[:,2].mean(),am[:,3].mean() ]) 177 | b=np.array([ bm[:,0].mean(),bm[:,1].mean(),bm[:,2].mean(),bm[:,3].mean() ]) 178 | 179 | # Error from unbiased sample variances 180 | erra,errb,covab=np.zeros(4),np.zeros(4),np.zeros(4) 181 | for i in range(4): 182 | erra[i]=np.sqrt( 1./(nsim-1) * ( np.sum(am[:,i]**2)-nsim*(am[:,i].mean())**2 )) 183 | errb[i]=np.sqrt( 1./(nsim-1) * ( np.sum(bm[:,i]**2)-nsim*(bm[:,i].mean())**2 )) 184 | covab[i]=1./(nsim-1) * ( np.sum(am[:,i]*bm[:,i])-nsim*am[:,i].mean()*bm[:,i].mean() ) 185 | 186 | return a,b,erra,errb,covab 187 | 188 | 189 | 190 | def checkNan(am,bm): 191 | """ 192 | Sometimes, if the dataset is very small, the regression parameters in 193 | some instances of the bootstrapped sample may have NaNs i.e. failed 194 | regression (I need to investigate this in more details). 195 | 196 | This method checks to see if there are NaNs in the bootstrapped 197 | fits and remove them from the final sample. 198 | """ 199 | import nmmn.lsd 200 | 201 | idel=nmmn.lsd.findnan(am[:,2]) 202 | print("Bootstrapping error: regression failed in",np.size(idel),"instances. They were removed.") 203 | 204 | return np.delete(am,idel,0),np.delete(bm,idel,0) 205 | 206 | 207 | 208 | def allEqual(x): 209 | """ 210 | Check if all elements in an array are equal. 211 | Returns True if they are all the same. 212 | """ 213 | from itertools import groupby 214 | 215 | g = groupby(x) 216 | 217 | return next(g, True) and not next(g, False) 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | # Methods which make use of parallelization 226 | # =========================================== 227 | 228 | 229 | def ab(x): 230 | """ 231 | This method is the big bottleneck of the parallel BCES code. That's the 232 | reason why I put these calculations in a separate method, in order to 233 | distribute this among the cores. In the original BCES method, this is 234 | inside the main routine. 235 | 236 | Argument: 237 | [y1,y1err,y2,y2err,cerr,nsim] 238 | where nsim is the number of bootstrapping trials sent to each core. 239 | 240 | :returns: am,bm : the matrixes with slope and intercept where each line corresponds to a bootrap trial and each column maps a different BCES method (ort, y|x etc). 241 | 242 | Be very careful and do not use lambda functions when calling this 243 | method and passing it to multiprocessing or ipython.parallel! 244 | I spent >2 hours figuring out why the code was not working until I 245 | realized the reason was the use of lambda functions. 246 | """ 247 | y1,y1err,y2,y2err,cerr,nsim=x[0],x[1],x[2],x[3],x[4],x[5] 248 | 249 | for i in range(int(nsim)): 250 | # This is needed for small datasets. With datasets of 4 points or less, 251 | # bootstrapping can generate a mock array with 4 repeated points. This 252 | # will cause an error in the linear regression. 253 | allEquals=True 254 | while allEquals: 255 | [y1sim,y1errsim,y2sim,y2errsim,cerrsim]=bootstrap([y1,y1err,y2,y2err,cerr]) 256 | 257 | allEquals=allEqual(y1sim) 258 | 259 | asim,bsim,errasim,errbsim,covabsim=bces(y1sim,y1errsim,y2sim,y2errsim,cerrsim) 260 | 261 | if i==0: 262 | # Initialize the matrixes 263 | am,bm=asim.copy(),bsim.copy() 264 | else: 265 | am=np.vstack((am,asim)) 266 | bm=np.vstack((bm,bsim)) 267 | 268 | return am,bm 269 | 270 | 271 | 272 | 273 | 274 | def bcesp(y1,y1err,y2,y2err,cerr,nsim=10000): 275 | """ 276 | Parallel implementation of the BCES with bootstrapping. 277 | Divide the bootstraps equally among the threads (cores) of 278 | the machine. It will automatically detect the number of 279 | cores available. 280 | 281 | Usage: 282 | 283 | >>> a,b,aerr,berr,covab=bcesp(x,xerr,y,yerr,cov,nsim) 284 | 285 | :param x,y: data 286 | :param xerr,yerr: measurement errors affecting x and y 287 | :param cov: covariance between the measurement errors (all are arrays) 288 | :param nsim: number of Monte Carlo simulations (bootstraps) 289 | 290 | :returns: a,b - best-fit parameters a,b of the linear regression 291 | :returns: aerr,berr - the standard deviations in a,b 292 | :returns: covab - the covariance between a and b (e.g. for plotting confidence bands) 293 | 294 | .. seealso:: Check out ~/work/projects/playground/parallel python/bcesp.py for the original, testing, code. I deleted some line from there to make the "production" version. 295 | 296 | * v1 Mar 2012: serial version ported from bces_regress.f. Added covariance output. 297 | * v2 May 3rd 2012: parallel version ported from nemmen.bcesboot. 298 | 299 | .. codeauthor: Rodrigo Nemmen 300 | """ 301 | import time # for benchmarking 302 | import multiprocessing 303 | 304 | print("BCES,", nsim,"trials... ") 305 | tic=time.time() 306 | 307 | # Find out number of cores available 308 | ncores=multiprocessing.cpu_count() 309 | # We will divide the processing into how many parts? 310 | n=2*ncores 311 | 312 | """ 313 | Must create lists that will be distributed among the many 314 | cores with structure 315 | core1 <- [y1,y1err,y2,y2err,cerr,nsim/n] 316 | core2 <- [y1,y1err,y2,y2err,cerr,nsim/n] 317 | etc... 318 | """ 319 | pargs=[] # this is a list of lists! 320 | for i in range(n): 321 | pargs.append([y1,y1err,y2,y2err,cerr,nsim/n]) 322 | 323 | # Initializes the parallel engine 324 | pool = multiprocessing.Pool(processes=ncores) # multiprocessing package 325 | 326 | """ 327 | Each core processes ab(input) 328 | return matrixes Am,Bm with the results of nsim/n 329 | presult[i][0] = Am with nsim/n lines 330 | presult[i][1] = Bm with nsim/n lines 331 | """ 332 | presult=pool.map(ab, pargs) # multiprocessing 333 | pool.close() # close the parallel engine 334 | 335 | # vstack the matrixes processed from all cores 336 | i=0 337 | for m in presult: 338 | if i==0: 339 | # Initialize the matrixes 340 | am,bm=m[0].copy(),m[1].copy() 341 | else: 342 | am=np.vstack((am,m[0])) 343 | bm=np.vstack((bm,m[1])) 344 | i=i+1 345 | 346 | if True in np.isnan(am): 347 | am,bm=checkNan(am,bm) 348 | 349 | # Computes the bootstrapping results on the stacked matrixes 350 | a=np.array([ am[:,0].mean(),am[:,1].mean(),am[:,2].mean(),am[:,3].mean() ]) 351 | b=np.array([ bm[:,0].mean(),bm[:,1].mean(),bm[:,2].mean(),bm[:,3].mean() ]) 352 | 353 | # Error from unbiased sample variances 354 | erra,errb,covab=np.zeros(4),np.zeros(4),np.zeros(4) 355 | for i in range(4): 356 | erra[i]=np.sqrt( 1./(nsim-1) * ( np.sum(am[:,i]**2)-nsim*(am[:,i].mean())**2 )) 357 | errb[i]=np.sqrt( 1./(nsim-1) * ( np.sum(bm[:,i]**2)-nsim*(bm[:,i].mean())**2 )) 358 | covab[i]=1./(nsim-1) * ( np.sum(am[:,i]*bm[:,i])-nsim*am[:,i].mean()*bm[:,i].mean() ) 359 | 360 | print("%f s" % (time.time() - tic)) 361 | 362 | return a,b,erra,errb,covab 363 | 364 | 365 | 366 | -------------------------------------------------------------------------------- /doc/bces-examples.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Examples of how to use the BCES fitting code\n", 8 | "===================\n", 9 | "\n", 10 | "BCES python module [available on Github](https://github.com/rsnemmen/BCES)." 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 1, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "%matplotlib inline\n", 20 | "import numpy as np\n", 21 | "import matplotlib.pyplot as plt" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 2, 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "name": "stdout", 31 | "output_type": "stream", 32 | "text": [ 33 | "/Users/nemmen/Dropbox/codes/python/bces/doc\n" 34 | ] 35 | } 36 | ], 37 | "source": [ 38 | "cd '/Users/nemmen/Dropbox/codes/python/bces/doc'" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 3, 44 | "metadata": { 45 | "scrolled": true 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "import bces.bces as BCES" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "# Example 1\n", 57 | "\n", 58 | "In this example, the data contains uncertainties on both $x$ and $y$; no correlation between uncertainties. These are real astronomical data for blazars from [this paper](http://science.sciencemag.org/content/338/6113/1445.full). " 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 5, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "data=np.load('data.npz')\n", 68 | "xdata=data['x']\n", 69 | "ydata=data['y']\n", 70 | "errx=data['errx']\n", 71 | "erry=data['erry']\n", 72 | "cov=data['cov']" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "The regression line is $y = Ax + B$. `covab` is the resulting covariance matrix which can be used to draw confidence regions." 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 6, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "# number of bootstrapping trials\n", 89 | "nboot=10000" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 7, 95 | "metadata": {}, 96 | "outputs": [ 97 | { 98 | "name": "stdout", 99 | "output_type": "stream", 100 | "text": [ 101 | "BCES, 10000 trials... \n", 102 | "1.401265 s\n", 103 | "CPU times: user 34.4 ms, sys: 60.3 ms, total: 94.7 ms\n", 104 | "Wall time: 1.42 s\n" 105 | ] 106 | } 107 | ], 108 | "source": [ 109 | "%%time\n", 110 | "# Performs the BCES fit in parallel\n", 111 | "a,b,erra,errb,covab=BCES.bcesp(xdata,errx,ydata,erry,cov,nboot)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "## Selecting the fitting method\n", 119 | "\n", 120 | "Select the desired BCES method by setting the variable `bcesMethod`. The available methods are:\n", 121 | "\n", 122 | "| Value | Method | Description |\n", 123 | "|---|---| --- |\n", 124 | "| 0 | $y|x$ | Assumes $x$ as the independent variable |\n", 125 | "| 1 | $x|y$ | Assumes $y$ as the independent variable |\n", 126 | "| 2 | bissector | Line that bisects the $y|x$ and $x|y$. *Do not use this method*, cf. [Hogg, D. et al. 2010, arXiv:1008.4686](http://labs.adsabs.harvard.edu/adsabs/abs/2010arXiv1008.4686H/). |\n", 127 | "| 3 | orthogonal | Orthogonal least squares: line that minimizes orthogonal distances. Should be used when it is not clear which variable should be treated as the independent one |\n", 128 | "\n", 129 | "As usual, please read the [original BCES paper](http://labs.adsabs.harvard.edu/adsabs/abs/1996ApJ...470..706A/) to understand what these different lines mean." 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 8, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "bcesMethod=0" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 10, 144 | "metadata": { 145 | "scrolled": true 146 | }, 147 | "outputs": [ 148 | { 149 | "data": { 150 | "text/plain": [ 151 | "Text(0, 0.5, '$y$')" 152 | ] 153 | }, 154 | "execution_count": 10, 155 | "metadata": {}, 156 | "output_type": "execute_result" 157 | }, 158 | { 159 | "data": { 160 | "image/png": "\n", 161 | "text/plain": [ 162 | "
" 163 | ] 164 | }, 165 | "metadata": {}, 166 | "output_type": "display_data" 167 | } 168 | ], 169 | "source": [ 170 | "plt.errorbar(xdata,ydata,xerr=errx,yerr=erry,fmt='o')\n", 171 | "x=np.linspace(xdata.min(),xdata.max())\n", 172 | "plt.plot(x,a[bcesMethod]*x+b[bcesMethod],'-k',label=\"BCES $y|x$\")\n", 173 | "\n", 174 | "plt.legend()\n", 175 | "plt.xlabel('$x$')\n", 176 | "plt.ylabel('$y$')" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "## Confidence band\n", 184 | "\n", 185 | "Suppose you want to include in the plot a visual estimate of the uncertainty on the fit. This is called the confidence band. For example, the $3\\sigma$ confidence interval is 99.7% sure to contain the best-fit regression line. Note that this is *not* the same as saying it will contain 99.7% of the data points. For more information, [check this out](http://www.jerrydallal.com/LHSP/slr.htm).\n", 186 | "\n", 187 | "In order to plot the confidence band, you will need to [install the `nmmn` package](https://github.com/rsnemmen/nmmn#installation) and another dependency: \n", 188 | "\n", 189 | " pip install nmmn numdifftools\n", 190 | "\n", 191 | "After installing the package, follow the instructions below to plot the confidence band of your fit.\n", 192 | "\n", 193 | "First we define convenient arrays that encapsulate the fit parameters and their uncertainties—including the covariance." 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 11, 199 | "metadata": {}, 200 | "outputs": [], 201 | "source": [ 202 | "# array with best-fit parameters\n", 203 | "fitm=np.array([ a[bcesMethod],b[bcesMethod] ])\t\n", 204 | "# covariance matrix of parameter uncertainties\n", 205 | "covm=np.array([ (erra[bcesMethod]**2,covab[bcesMethod]), (covab[bcesMethod],errb[bcesMethod]**2) ])\t\n", 206 | "\n", 207 | "# convenient function for a line\n", 208 | "def func(x): return x[1]*x[0]+x[2]" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "Now we estimate the $3\\sigma$ confidence band using one of the methods in the `nmmn.stats` module. If you want the $1\\sigma$ band instead, just change the 7th argument of `confbandnl` to `0.68`." 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": 12, 221 | "metadata": { 222 | "scrolled": true 223 | }, 224 | "outputs": [], 225 | "source": [ 226 | "import nmmn.stats\n", 227 | "\n", 228 | "# Gets lower and upper bounds on the confidence band \n", 229 | "lcb,ucb,x=nmmn.stats.confbandnl(xdata,ydata,func,fitm,covm,2,0.997,x)" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "Finally, the plot with the confidence band displayed in orange. Even at $3\\sigma$, it is still very narrow for this dataset." 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 13, 242 | "metadata": {}, 243 | "outputs": [ 244 | { 245 | "data": { 246 | "text/plain": [ 247 | "Text(0.5, 1.0, 'Data, fit and confidence band')" 248 | ] 249 | }, 250 | "execution_count": 13, 251 | "metadata": {}, 252 | "output_type": "execute_result" 253 | }, 254 | { 255 | "data": { 256 | "image/png": "\n", 257 | "text/plain": [ 258 | "
" 259 | ] 260 | }, 261 | "metadata": {}, 262 | "output_type": "display_data" 263 | } 264 | ], 265 | "source": [ 266 | "plt.errorbar(xdata,ydata,xerr=errx,yerr=erry,fmt='o')\n", 267 | "plt.plot(x,a[bcesMethod]*x+b[bcesMethod],'-k',label=\"BCES $y|x$\")\n", 268 | "plt.fill_between(x, lcb, ucb, alpha=0.3, facecolor='orange')\n", 269 | "\n", 270 | "plt.legend(loc='best')\n", 271 | "plt.xlabel('$x$')\n", 272 | "plt.ylabel('$y$')\n", 273 | "plt.title(\"Data, fit and confidence band\")" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "# Example 2\n", 281 | "\n", 282 | "Fake data with random uncertainties in $x$ and $y$" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": {}, 288 | "source": [ 289 | "Prepares fake data" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": 14, 295 | "metadata": {}, 296 | "outputs": [], 297 | "source": [ 298 | "x=np.arange(1,20)\n", 299 | "y=3*x + 4\n", 300 | "\n", 301 | "xer=np.sqrt((x- np.random.normal(x))**2)\n", 302 | "yer=np.sqrt((y- np.random.normal(y))**2)\n", 303 | "\n", 304 | "y=np.random.normal(y)\n", 305 | "x=np.random.normal(x)" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": 15, 311 | "metadata": {}, 312 | "outputs": [], 313 | "source": [ 314 | "# simple linear regression\n", 315 | "(aa,bb)=np.polyfit(x,y,deg=1)\n", 316 | "yfit=x*aa+bb" 317 | ] 318 | }, 319 | { 320 | "cell_type": "code", 321 | "execution_count": 16, 322 | "metadata": { 323 | "scrolled": true 324 | }, 325 | "outputs": [ 326 | { 327 | "name": "stdout", 328 | "output_type": "stream", 329 | "text": [ 330 | "BCES, 10000 trials... \n", 331 | "1.311906 s\n" 332 | ] 333 | } 334 | ], 335 | "source": [ 336 | "# BCES fit\n", 337 | "cov=np.zeros(len(x)) # no correlation between error measurements\n", 338 | "nboot=10000 # number of bootstrapping trials\n", 339 | "a,b,aerr,berr,covab=BCES.bcesp(x,xer,y,yer,cov,nboot)" 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": {}, 345 | "source": [ 346 | "The integer corresponds to the desired BCES method for plotting (3-ort, 0-y|x, 1-x|y, *don't use bissector*)" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": 17, 352 | "metadata": {}, 353 | "outputs": [], 354 | "source": [ 355 | "bcesMethod=3\n", 356 | "ybces=a[bcesMethod]*x+b[bcesMethod]" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": 18, 362 | "metadata": { 363 | "scrolled": false 364 | }, 365 | "outputs": [ 366 | { 367 | "data": { 368 | "text/plain": [ 369 | "Text(0, 0.5, '$y$')" 370 | ] 371 | }, 372 | "execution_count": 18, 373 | "metadata": {}, 374 | "output_type": "execute_result" 375 | }, 376 | { 377 | "data": { 378 | "image/png": "\n", 379 | "text/plain": [ 380 | "
" 381 | ] 382 | }, 383 | "metadata": {}, 384 | "output_type": "display_data" 385 | } 386 | ], 387 | "source": [ 388 | "plt.errorbar(x,y,xer,yer,fmt='o',ls='None')\n", 389 | "plt.plot(x,yfit,label='Simple regression')\n", 390 | "plt.plot(x,ybces,label='BCES orthogonal')\n", 391 | "\n", 392 | "plt.legend()\n", 393 | "plt.xlabel('$x$')\n", 394 | "plt.ylabel('$y$')" 395 | ] 396 | }, 397 | { 398 | "cell_type": "markdown", 399 | "metadata": {}, 400 | "source": [ 401 | "## Confidence band\n", 402 | "\n", 403 | "Again, make sure you install the [`nmmn`](https://github.com/rsnemmen/nmmn#installation) package before proceeding." 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": 21, 409 | "metadata": {}, 410 | "outputs": [], 411 | "source": [ 412 | "# array with best-fit parameters\n", 413 | "fitm=np.array([ a[bcesMethod],b[bcesMethod] ])\t\n", 414 | "# covariance matrix of parameter uncertainties\n", 415 | "covm=np.array([ (aerr[bcesMethod]**2,covab[bcesMethod]), (covab[bcesMethod],berr[bcesMethod]**2) ])\t" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": {}, 421 | "source": [ 422 | "Now we estimate the $2\\sigma$ confidence band using one of the methods in the `nmmn.stats` module." 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": 22, 428 | "metadata": { 429 | "scrolled": true 430 | }, 431 | "outputs": [], 432 | "source": [ 433 | "# Gets lower and upper bounds on the confidence band \n", 434 | "lcb,ucb,xcb=nmmn.stats.confbandnl(x,y,func,fitm,covm,2,0.954,x)" 435 | ] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "metadata": {}, 440 | "source": [ 441 | "Finally, the plot where the confidence band is displayed in orange. As you can see, it is very narrow." 442 | ] 443 | }, 444 | { 445 | "cell_type": "code", 446 | "execution_count": 23, 447 | "metadata": {}, 448 | "outputs": [ 449 | { 450 | "data": { 451 | "text/plain": [ 452 | "Text(0.5, 1.0, 'Data, fit and confidence band')" 453 | ] 454 | }, 455 | "execution_count": 23, 456 | "metadata": {}, 457 | "output_type": "execute_result" 458 | }, 459 | { 460 | "data": { 461 | "image/png": "\n", 462 | "text/plain": [ 463 | "
" 464 | ] 465 | }, 466 | "metadata": {}, 467 | "output_type": "display_data" 468 | } 469 | ], 470 | "source": [ 471 | "plt.errorbar(x,y,xerr=xer,yerr=yer,fmt='o')\n", 472 | "plt.plot(xcb,a[bcesMethod]*xcb+b[bcesMethod],'-k',label=\"BCES orthogonal\")\n", 473 | "plt.fill_between(xcb, lcb, ucb, alpha=0.3, facecolor='orange')\n", 474 | "\n", 475 | "plt.legend(loc='best')\n", 476 | "plt.xlabel('$x$')\n", 477 | "plt.ylabel('$y$')\n", 478 | "plt.title(\"Data, fit and confidence band\")" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": null, 484 | "metadata": {}, 485 | "outputs": [], 486 | "source": [] 487 | } 488 | ], 489 | "metadata": { 490 | "anaconda-cloud": {}, 491 | "kernelspec": { 492 | "display_name": "Python 3 (ipykernel)", 493 | "language": "python", 494 | "name": "python3" 495 | }, 496 | "language_info": { 497 | "codemirror_mode": { 498 | "name": "ipython", 499 | "version": 3 500 | }, 501 | "file_extension": ".py", 502 | "mimetype": "text/x-python", 503 | "name": "python", 504 | "nbconvert_exporter": "python", 505 | "pygments_lexer": "ipython3", 506 | "version": "3.9.13" 507 | } 508 | }, 509 | "nbformat": 4, 510 | "nbformat_minor": 1 511 | } 512 | --------------------------------------------------------------------------------