├── README ├── march_cps_codebook.pdf ├── more_examples ├── Python │ ├── marchCPS_2010.txt │ └── pandas_examples.py └── R │ └── weird_formula.r ├── original ├── Python │ ├── marchCPS_2010.txt │ └── py_stats_analysis.py ├── R │ ├── marchCPS_2010.txt │ └── r_stats_analysis.R └── summary.txt └── revised ├── Python ├── marchCPS_2010.txt └── py_stats_analysis.py ├── R ├── marchCPS_2010.txt └── r_stats_analysis.R └── summary.txt /README: -------------------------------------------------------------------------------- 1 | This contains a comparison of R and Python for a simple OLS analyis 2 | of a dateset. 3 | 4 | The original comparison is in the folder original. 5 | 6 | A revised comparison, using comments from the scipy users list, 7 | is in the folder revised. 8 | 9 | This revised comparison has simpler R and Python code and uses pandas 10 | for part of the analysis. (However, it is pandas 0.4, which is 11 | currently in development.) 12 | -------------------------------------------------------------------------------- /march_cps_codebook.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chrisjordansquire/r_vs_py/e99b73048f007aec0d3b9790a186a9c3338d9936/march_cps_codebook.pdf -------------------------------------------------------------------------------- /more_examples/Python/pandas_examples.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | import numpy as np 4 | import scikits.statsmodels.api as sm 5 | import matplotlib.pyplot as plt 6 | import scipy.stats as sps 7 | import pandas as pa 8 | 9 | 10 | 11 | """ 12 | Load in the data as in the other files, but then we stick it 13 | into a pandas DataFrame. We then use the new 0.4 pandas to 14 | generate some simple summary statistics to understand the data. 15 | 16 | The major new feature in pandas 0.4 is the multi-index, and associated 17 | functionality to manipulate them. In a nutshell: 18 | 19 | *The multi-index is a tuple of objects to index on. This is convenient 20 | for categorical or binned variables. (e.g. race, education level, 21 | age range, weight range, etc.) 22 | *These multi-indexes can be used to index both rows and columns. 23 | *The pivot command lets you take a column of data and convert it 24 | to an index in a multi-index. 25 | *The stack/unstack commands let you re-organize indices between 26 | indexing rows or columns. 27 | *The groupby functionality plays well with the new multi-indices, 28 | exactly as one would hope. In particular, you can also 29 | use groupby on these new multi-indexes. 30 | 31 | This new multi-index functionality is very nice for replicating the 32 | sort of split-apply-combine functionality that can be accomplished 33 | in SQL or using the plyr package in R. 34 | 35 | Some potential gotchas I ran into while creating this were: 36 | 37 | *Thus far, groupby can't respect any non-lexigraphic ordering 38 | on grouping variables. So when grouping on ordinal variables 39 | you must ensure the lexigraphic ordering is the same as 40 | the natural ordering if you want groupby to respect it. 41 | *If you want the number of items in each group you must use 42 | apply instead of aggregate. aggregate will again return 43 | a DataFrame, whereas you just want a Series. This is true 44 | in general if your desired aggregation would yield the same 45 | value across all columns. 46 | 47 | In the following I freely make use of awful, awful names simply 48 | because in the interpreter they're easy to remember and type. 49 | I do this just because that's a not unreasonable style for iteractive 50 | use. b in front of a variable I create denotes binned, g denotes 51 | a groupby. 52 | """ 53 | 54 | # Load and clean the data, removing a few people that don't seem 55 | # to fit most reasonable models. 56 | 57 | dat = np.genfromtxt('marchCPS_2010.txt', names=True, dtype=None, 58 | missing_values='NA') 59 | # Remove everyone without hourly wage data 60 | indhr = ~np.isnan(dat['hrwage']) 61 | hrdat = dat[indhr] 62 | hrdat = np.delete(hrdat, [407, 515, 852, 1197]) 63 | 64 | 65 | ### 66 | # Create a dataframe and start the exploratory analysis 67 | ### 68 | 69 | d = pa.DataFrame(hrdat) 70 | 71 | # The names corresponding to the numerical codes 72 | sex_names = {1:'Male', 2:'Female'} 73 | race_names = {1:'White', 2:'Black', 3:'First Nation', 4:'Asian', 6:'Mixed'} 74 | ptft_names = {1:'1-Part time', 2:'2-Full time'} 75 | occ_names = {1:'Management', 2:'Professional', 3:'Service', 4:'Sales', 76 | 5:'Office Support', 7:'Construction', 77 | 8:'Maintenance', 9:'Production', 10:'Transportation'} 78 | b_age_names = {0:'18-24', 1:'25-29', 2:'30-34', 3:'35-39', 4:'40-44', 79 | 5:'45-49', 6:'50-54', 7:'55-59',8:'60+'} 80 | 81 | # Function to create 5-ish year bins for ages 82 | def time_bucket(age): 83 | return np.digitize(age, [25,30,35,40,45,50,55,60]) 84 | 85 | # Function to bucket similar education levels 86 | # Leadning integers are adding to force groupby to use the correct 87 | # ordering, as these variables have an order. 88 | def educ_bucket(educ): 89 | if educ<39: 90 | return '1: 43: 100 | return '6:Grad/Prof' 101 | 102 | # Replace with numerical codes with the corresponding string names 103 | d['sex'] = d['sex'].apply(sex_names.get) 104 | d['race'] = d['race'].apply(race_names.get) 105 | d['PTFT'] = d['PTFT'].apply(ptft_names.get) 106 | d['occ'] = d['occ'].apply(occ_names.get) 107 | 108 | # b for binned 109 | d['bage'] = d['age'].apply(time_bucket) 110 | d['bage'] = d['bage'].apply(b_age_names.get) 111 | d['beduc'] = d['educ'].apply(educ_bucket) 112 | 113 | # Get rid of the long, obscure name for the weights 114 | d['wt'] = d['A_ERNLWT'] 115 | del d['A_ERNLWT'] 116 | 117 | # age, sex groupby 118 | gas = d.groupby(['bage', 'sex']) 119 | # age, ptft, sex groupby 120 | gaps = d.groupby(['bage','PTFT', 'sex']) 121 | # occ, sex groupby 122 | gos = d.groupby(['occ','sex']) 123 | # educ, sex groupby 124 | ges = d.groupby(['beduc','sex']) 125 | 126 | def mf_ratio(df): 127 | df['ratio'] = df['Female']/df['Male'] 128 | return df 129 | 130 | def wt_avg(group): 131 | return np.average(group['hrwage'], weights = group['wt']) 132 | 133 | # I suspect the following scheme leaves something to be desired numerically. 134 | # It was just translated verbatim from wikipedia's formula for the weighted 135 | # sample variance 136 | def wt_std(group): 137 | mean = np.average(group['hrwage'], weights = group['wt']) 138 | sos = np.average((group['hrwage'] - mean)**2, weights=group['wt']) 139 | wt_sum = group['wt'].sum() 140 | wt_sos = (group['wt']**2).sum() 141 | return np.sqrt((wt_sum)/(wt_sum**2 - wt_sos) * sos) 142 | 143 | # Poke around each of the groupby's. 144 | 145 | # How many observations, unweighted, for each combination 146 | print gas.apply(len).unstack() 147 | print gaps.apply(len).unstack(level=1).unstack() 148 | print gos.apply(len).unstack() 149 | print ges.apply(len).unstack() 150 | 151 | # How many observations according to the weights 152 | print gas['wt'].sum().unstack() 153 | print gaps['wt'].sum().unstack(level=1).unstack() 154 | print gos['wt'].sum().unstack() 155 | print ges['wt'].sum().unstack() 156 | 157 | # First look at the weighted mean for each 158 | print gas.apply(wt_avg).unstack() 159 | print gaps.apply(wt_avg).unstack(level=1).unstack() 160 | print gos.apply(wt_avg).unstack() 161 | print ges.apply(wt_avg).unstack() 162 | 163 | # Then look at the weighted sample stardard errors 164 | # (Turns out that because of the weights they're all tiny, 165 | # so there's no need to think about the confidence intervals. 166 | # But it would be simple to create a function that returned confidence 167 | # intervals as a tuple instead of a scalar point estimate.) 168 | print gas.apply(wt_std).unstack() 169 | print gaps.apply(wt_std).unstack(level=1).unstack() 170 | print gos.apply(wt_std).unstack() 171 | print ges.apply(wt_std).unstack() 172 | 173 | # Take a closer look at how the ratios change 174 | # Dealing with the multi-index is slightly ungainly, and can probably 175 | # be improved. 176 | print mf_ratio(gas.apply(wt_avg).unstack()) 177 | print mf_ratio(gaps.apply(wt_avg).unstack()).stack().unstack(level=1).unstack() 178 | print mf_ratio(gos.apply(wt_avg).unstack()) 179 | print mf_ratio(ges.apply(wt_avg).unstack()) 180 | 181 | 182 | 183 | -------------------------------------------------------------------------------- /more_examples/R/weird_formula.r: -------------------------------------------------------------------------------- 1 | # This is the code for an example that popped up on the statsmodels 2 | # list a year or so ago, showing a bizarre example of how things 3 | # can work when R is creating model matrices for the user. 4 | 5 | # (At least that's what the discussion thread was about. I'm not 6 | # sure if the real issue is the model matrix or how the anova 7 | # command works.) 8 | 9 | # n is the sample size 10 | # d1 is the number of categories in x 11 | # d2 is the number of categories in y 12 | # It's left arbitrary just to show this works for any d1, d2 13 | n<-200 14 | d1<-3 15 | d2<-5 16 | 17 | x<-sample(1:d1, n, replace=T) 18 | y<-sample(1:d2, n, replace=T) 19 | 20 | x<-factor(x) 21 | y<-factor(y) 22 | 23 | X<-model.matrix(~x*y) 24 | k<-dim(X)[2] 25 | 26 | # Generate some output values with an 27 | # arbitrary model 28 | out<- X %*% 1:k + rnorm(n, sd=2) 29 | 30 | # Fit two different models, with m1 nested in m2 31 | # in the sense that y*x is y+x+y:x 32 | m1<-lm(out~y+y:x) 33 | m2<-lm(out~y*x) 34 | 35 | # Now look at the difference in sum of squares between them 36 | anova(m1,m2) 37 | 38 | # The models are the "same" in the sense that they have the 39 | # same fitted values 40 | 41 | # However, the models are not the same. 42 | # The model matrices are not the same. 43 | all(model.matrix(m1) == model.matrix(m2)) 44 | # And the interaction coefficients are not the same 45 | (coef(m1)-coef(m2))/coef(m1) 46 | -------------------------------------------------------------------------------- /original/Python/py_stats_analysis.py: -------------------------------------------------------------------------------- 1 | """ 2 | This is an experiment to see how easily an analysis 3 | done in R could be done in Python. The rcode is in a 4 | companion file. The comments are cut and pasted from 5 | the R code, and similar variable names are used as 6 | much as possible to make comparisons easier. 7 | """ 8 | 9 | #An example analysis of a fairly simple dataset. 10 | #The data is from the 2010 CPS March supplement. 11 | #(Also called the Annual Social and Economic Supplement.) 12 | #The goal is examining the difference in hourly pay 13 | #between males and females. The column A_ERNLWT is 14 | #survey weights, PTFT is part-time/full-time status, 15 | #educ is education level, ind is what industry the 16 | #person works in and occ is what occupation they work in. 17 | #A full description of most of the variables sometimes with 18 | #slightly different names, can be found at 19 | #http://www.census.gov/apsd/techdoc/cps/cpsmar10.pdf 20 | 21 | 22 | 23 | import numpy as np 24 | import scikits.statsmodels.api as sm 25 | import matplotlib.pyplot as plt 26 | import scipy.stats as sps 27 | 28 | dat = np.genfromtxt('marchCPS_2010.txt', names=True, dtype=None, 29 | missing_values='NA') 30 | 31 | print dat.shape 32 | print dat.dtype 33 | print len(dat.dtype) 34 | 35 | 36 | ####Actual analysis code 37 | 38 | #Remove everyone that doesn't have an houlry wage data 39 | 40 | 41 | 42 | indhr = ~np.isnan(dat['hrwage']) 43 | hrdat = dat[indhr] 44 | 45 | 46 | #Remove several people who just didn't fit the models (using the 47 | #standard model checking techniques below) 48 | 49 | 50 | hrdat = np.delete(hrdat, [407, 515, 852, 1197]) 51 | 52 | 53 | indf = np.flatnonzero(dat['sex'] == 2) 54 | indm = np.flatnonzero(dat['sex'] == 1) 55 | 56 | print len(indf) 57 | print len(indm) 58 | 59 | #With each of these models, typically do some 60 | #commands to look more at the models, like summary(), 61 | #, anova for the model on its own or betwen two models to see 62 | #how much additional explantory power you get with the added 63 | #variables, and plots to look at residuals, qqplot, and hist of residuals 64 | #Currently can't do anova or lowess in python, and the qqplots are annoying 65 | #to make. 66 | 67 | 68 | #Initial model, only look at log(hrwage)~sex 69 | X1 = hrdat['sex']==2 70 | X1 = sm.add_constant(X1, prepend=True) 71 | model1 = sm.WLS(np.log(hrdat['hrwage']), X1, weights = hrdat['A_ERNLWT']) 72 | results1 = model1.fit() 73 | 74 | print results1.summary() 75 | 76 | 77 | 78 | 79 | #More complicated model, log(hrwage)~sex+educ+age+PTFT 80 | n = len(hrdat) 81 | logwage = np.log(hrdat['hrwage']) 82 | w = hrdat['A_ERNLWT'] 83 | 84 | X2 = np.hstack((sm.categorical(hrdat['sex'])[:,2:], 85 | sm.categorical(hrdat['educ'])[:,2:], 86 | hrdat['age'].reshape(n,1), 87 | sm.categorical(hrdat['PTFT'])[:,2:])) 88 | 89 | X2 = sm.add_constant(X2, prepend=True) 90 | model2 = sm.WLS(logwage, X2, weights = w) 91 | results2 = model2.fit() 92 | 93 | print results2.summary() 94 | 95 | 96 | 97 | #Now include ind and occ (industry and occupation codings) 98 | X2_5 = np.hstack((sm.categorical(hrdat['sex'])[:,2:], 99 | sm.categorical(hrdat['educ'])[:,2:], 100 | hrdat['age'].reshape(n,1), 101 | (hrdat['age']**2).reshape(n,1), 102 | sm.categorical(hrdat['PTFT'])[:,2:], 103 | sm.categorical(hrdat['ind'])[:,2:], 104 | sm.categorical(hrdat['occ'])[:,2:])) 105 | X2_5 = sm.add_constant(X2_5, prepend=True) 106 | model2_5 = sm.WLS(logwage, X2_5, weights=w) 107 | results2_5 = model2_5.fit() 108 | 109 | print results2_5.summary() 110 | 111 | 112 | #Residual diagnostics for model2_5 113 | 114 | plt.subplot(1,3,1) 115 | plt.hist(results2_5.resid, 30, normed=1, facecolor='green', alpha=0.75) 116 | plt.title('Histogram of Residuals') 117 | plt.xlabel('Probability') 118 | plt.grid(True) 119 | 120 | plt.subplot(1,3,2) 121 | plt.plot(results2_5.fittedvalues, results2_5.resid, 'bo') 122 | plt.title('Residuals vs. Fitted Values') 123 | plt.xlabel('Fitted Values') 124 | plt.ylabel('Residuals') 125 | plt.axhline(lw=3, color = 'r') 126 | plt.grid(True) 127 | 128 | plt.subplot(1,3,3) 129 | normal_rv_q = sps.norm.ppf(np.linspace(0,1,n+1)[1:]) 130 | ordered_resid = np.copy(results2_5.resid) 131 | ordered_resid.sort() 132 | plt.plot(normal_rv_q, ordered_resid, 'ro') 133 | plt.title('Normal qq Plot') 134 | plt.xlabel('Theoretical Quantiles') 135 | plt.ylabel('Sample Quantiles') 136 | plt.grid(True) 137 | 138 | 139 | #Used mean replacement for all the people whose hours varied, 140 | #where the mean of 22 was the mean over all part time workers 141 | #w/o missing data and using their weights 142 | #Done just as a sanity check on the above models 143 | 144 | 145 | 146 | #Throw in everything and the kitchen sink for reality check 147 | indic = np.flatnonzero(hrdat['PMHRUSLT']<=0) 148 | tmp = np.copy(hrdat['PMHRUSLT']) 149 | tmp[indic] = 22 150 | 151 | X3 = np.hstack((sm.categorical(hrdat['sex'])[:,2:], 152 | sm.categorical(hrdat['educ'])[:,2:], 153 | sm.categorical(hrdat['PTFT'])[:,2:], 154 | hrdat['age'].reshape(n,1), 155 | (hrdat['age']**2).reshape(n,1), 156 | sm.categorical(hrdat['marstat'])[:,2:], 157 | sm.categorical(hrdat['GEDIV'])[:,2:], 158 | sm.categorical(hrdat['race'])[:,2:], 159 | sm.categorical(hrdat['hispanic'])[:,2:], 160 | tmp.reshape(n,1), 161 | sm.categorical(hrdat['disabled'])[:,2:])) 162 | 163 | X3 = sm.add_constant(X3, prepend=True) 164 | model3 = sm.WLS(logwage, X3, weights=w) 165 | results3 = model3.fit() 166 | 167 | print results3.summary() 168 | 169 | 170 | #These models bin the works by age group, <=30, 31-40, 41-50, >50. 171 | #This was done to see what the difference betweens males and 172 | #females was in each bin. This was done as a sanity check for the 173 | #later exploratory analysis that fit lowess curves across age to 174 | #males and females seperately. I wanted to make sure that the trends 175 | #observed were real. (Where the gap was smaller for younger workers, 176 | #expanded for middle age workers, and then contracted again.) 177 | 178 | 179 | lt30 = (hrdat['age']<=30) 180 | btw30_40 = (np.logical_and(30 0) 60 | 61 | 62 | #Used mean replacement for all the people whose hours varied, 63 | #where the mean of 22 was the mean over all part time workers 64 | #w/o missing data and using their weights 65 | #Done just as a sanity check on the above models 66 | 67 | model3<-lm(I(log(hrwage))~ as.factor(sex)+as.factor(educ)+as.factor(PTFT)+age +I(age^2)+ as.factor(marstat) + 68 | as.factor(GEDIV)+ as.factor(race) + 69 | as.factor(hispanic)+I(PMHRUSLT*indic+22*(1-indic))+as.factor(disabled), data = hrdat, weights = A_ERNLWT) 70 | 71 | 72 | 73 | #These models bin the works by age group, <=30, 31-40, 41-50, >50. 74 | #This was done to see what the difference betweens males and 75 | #females was in each bin. This was done as a sanity check for the 76 | #later exploratory analysis that fit lowess curves across age to 77 | #males and females seperately. I wanted to make sure that the trends 78 | #observed were real. (Where the gap was smaller for younger workers, 79 | #expanded for middle age workers, and then contracted again.) 80 | 81 | model7.1<-lm(I(log(hrwage))~as.factor(sex)+as.factor(educ)+as.factor(PTFT)+age+I(age^2), data = hrdat[hrdat$age<=30,] , weights=A_ERNLWT ) 82 | 83 | model7.2<-lm(I(log(hrwage))~as.factor(sex)+as.factor(educ)+as.factor(PTFT)+age+I(age^2), data = hrdat[hrdat$age<=40 & hrdat$age>30,] , weights=A_ERNLWT ) 84 | 85 | model7.3<-lm(I(log(hrwage))~as.factor(sex)+as.factor(educ)+as.factor(PTFT)+age+I(age^2), data = hrdat[hrdat$age<=50 & hrdat$age>40,] , weights=A_ERNLWT ) 86 | 87 | model7.4<-lm(I(log(hrwage))~as.factor(sex)+as.factor(educ)+as.factor(PTFT)+age+I(age^2), data = hrdat[hrdat$age>50,] , weights=A_ERNLWT ) 88 | 89 | 90 | 91 | #### 92 | 93 | 94 | #This analysis switches gears and focuses on occupation. 95 | #The males and females are broken down by occupation and the 96 | #weighted mean of their hourly wage are compared. 97 | #(The weights used are again the survey weights.) 98 | #The matrix wocc stores all of that, and 99 | #Use xtable to format tables in latex 100 | 101 | lab<-sort(unique(hrdat$oc)) 102 | occrow<-length(unique(hrdat$oc)) 103 | wocc<-matrix(rep(0,4*occrow),nrow=occrow, ncol=4) 104 | for(i in 1:length(lab)){ 105 | tmp<-which(hrdat$sex==1 & hrdat$occ==lab[i]) 106 | 107 | if(length(tmp)>0){ 108 | wocc[i,1]<-sum(hrdat$A_ERNLWT[tmp]) 109 | wocc[i,2]<-weighted.mean(hrdat$hrwage[tmp], hrdat$A_ERNLWT[tmp]) 110 | } 111 | tmp<-which(hrdat$sex==2 & hrdat$occ==lab[i]) 112 | if(length(tmp)>0){ 113 | wocc[i,3]<-sum(hrdat$A_ERNLWT[tmp]) 114 | wocc[i,4]<-weighted.mean(hrdat$hrwage[tmp], hrdat$A_ERNLWT[tmp]) 115 | } 116 | } 117 | 118 | #These were just some quick computations to make sure wocc had 119 | #the values I thought it had. Nothing more embarassing than 120 | #bad statistics because you didn't double-check your output. 121 | 122 | males = hrdat$sex == 1 123 | females = hrdat$sex ==2 124 | 125 | mean(hrdat$hrwage[males]) 126 | weighted.mean(hrdat$hrwage[males], hrdat$A_ERNLWT[males]) 127 | sum(wocc[,1]*wocc[,2])/sum(wocc[,1]) 128 | weighted.mean(hrdat$hrwage[females],hrdat$A_ERNLWT[females]) 129 | sum(wocc[-6,1]*wocc[-6,4])/sum(wocc[-6,1]) 130 | 131 | 132 | 133 | 134 | #Plotting hourly wage versus age and fitting a lowess curve to it. 135 | #Done sepereately for both males and females. Just to get a feel for 136 | #what their salary trajectories look like. Though it's not totally 137 | #clear from the picture, the gap is smaller proportionally at younger 138 | #ages than at older ones. (This can be seen as above by binning the ages, 139 | #or via a separate analysis by fitting an interaction term between 140 | #sex and age.) 141 | 142 | plot(lowess(hrdat$age[indm], (hrdat$hrwage)[indm]), col='blue', ylim=c(7,25), xlab="Age", ylab = "Hourly Wage", main="Hourly Wage vs. Age", type='l', , cex.main=2, cex.lab=1.5, lwd=2) 143 | lines(lowess(hrdat$age[indf], (hrdat$hrwage)[indf]), col='deeppink') 144 | legend('topleft', c("Males", "Females"), cex=1.25, col=c("blue", "deeppink"), lty=1, lwd=2) 145 | 146 | 147 | 148 | 149 | 150 | -------------------------------------------------------------------------------- /original/summary.txt: -------------------------------------------------------------------------------- 1 | A simple comparison of R and Python (statsmodels) for doing some 2 | exploratory data analysis and fitting of simple OLS models. 3 | 4 | This is a rather condensed version of an analysis I did in a 5 | statistics course. It uses CPS data to look at wage differences 6 | according to gender, examining first a baseline simple model 7 | and then successively more complex models. 8 | 9 | For the OLS models R is much easier to use than Python. In 10 | python the design matrices must be constructed explicitly, 11 | which it both painstaking and annoying for complicated 12 | models. Furthermore, the resulting coefficient fits are 13 | unlabelled. Python also doesn't have an anova command like 14 | in R to compare between nested models. 15 | 16 | There is also some simple exploratory, non-model based analysis. 17 | In this area R clearly dominates. Python currently doesn't have 18 | one of the main graphical exploratory tools I used, lowess, and 19 | it doesn't print the contents of matrices in a user friendly 20 | manner. 21 | 22 | Both sets of code are used best when cut and pasted into their 23 | respective gui's. In R's case the standard R gui, and in Python's 24 | case the IPython qtconsole. Non-gui interpreters are generally 25 | much more frustrating to use for interactive data analysis, which 26 | is how these analyses were generated and explored. 27 | 28 | In fact, using Python without the IPython qtconsole is practically 29 | impossible for this sort of cut and paste, interactive analysis. 30 | The shell IPython doesn't allow it because it automatically adds 31 | whitespace on multiline bits of code, breaking pre-formatted code's 32 | alignment. Cutting and pasting works for the standard python shell, 33 | but then you lose all the advantages of IPython. 34 | 35 | Other issues I ran into were: 36 | 37 | *Inability to easily add new columns and rows to data. (However, 38 | this will hopefully be fixed with pandas, and additionally it 39 | wasn't an issue for me in this analysis.) 40 | 41 | *Python doesn't pretty print its arrays in any way. This makes it 42 | much more difficult to inspect output from interactive data 43 | analysis if the user was just shoving results/summary statistics 44 | into a single array. 45 | 46 | Even if you use the %precision magic in IPython, that only 47 | helps so much because numpy prints numpy arrays by row instead 48 | of by column. The lack of being able to name rows and columns 49 | after array creation is also an impediment to analysis. 50 | 51 | *No matplotlib keyboard shortcuts for closing windows. This makes it 52 | annoying to open a plot and then have to move your fingers away from 53 | the keyboard to close the plot. 54 | -------------------------------------------------------------------------------- /revised/Python/py_stats_analysis.py: -------------------------------------------------------------------------------- 1 | """ 2 | This is an experiment to see how easily an analysis 3 | done in R could be done in Python. The rcode is in a 4 | companion file. The comments are cut and pasted from 5 | the R code, and similar variable names are used as 6 | much as possible to make comparisons easier. 7 | """ 8 | 9 | #An example analysis of a fairly simple dataset. 10 | #The data is from the 2010 CPS March supplement. 11 | #(Also called the Annual Social and Economic Supplement.) 12 | #The goal is examining the difference in hourly pay 13 | #between males and females. The column A_ERNLWT is 14 | #survey weights, PTFT is part-time/full-time status, 15 | #educ is education level, ind is what industry the 16 | #person works in and occ is what occupation they work in. 17 | #A full description of most of the variables sometimes with 18 | #slightly different names, can be found at 19 | #http://www.census.gov/apsd/techdoc/cps/cpsmar10.pdf 20 | 21 | 22 | 23 | import numpy as np 24 | import scikits.statsmodels.api as sm 25 | import matplotlib.pyplot as plt 26 | import scipy.stats as sps 27 | 28 | dat = np.genfromtxt('marchCPS_2010.txt', names=True, dtype=None, 29 | missing_values='NA') 30 | 31 | print dat.shape 32 | print dat.dtype 33 | print len(dat.dtype) 34 | 35 | 36 | ####Actual analysis code 37 | 38 | #Remove everyone that doesn't have an houlry wage data 39 | 40 | 41 | 42 | indhr = ~np.isnan(dat['hrwage']) 43 | hrdat = dat[indhr] 44 | 45 | 46 | #Remove several people who just didn't fit the models (using the 47 | #standard model checking techniques below) 48 | 49 | 50 | hrdat = np.delete(hrdat, [407, 515, 852, 1197]) 51 | 52 | 53 | indf = np.flatnonzero(dat['sex'] == 2) 54 | indm = np.flatnonzero(dat['sex'] == 1) 55 | 56 | print len(indf) 57 | print len(indm) 58 | 59 | #With each of these models, typically do some 60 | #commands to look more at the models, like summary(), 61 | #, anova for the model on its own or betwen two models to see 62 | #how much additional explantory power you get with the added 63 | #variables, and plots to look at residuals, qqplot, and hist of residuals 64 | #Currently can't do anova or lowess in python, and the qqplots are annoying 65 | #to make. 66 | 67 | 68 | #Initial model, only look at log(hrwage)~sex 69 | X1 = hrdat['sex']==2 70 | X1 = sm.add_constant(X1, prepend=True) 71 | model1 = sm.WLS(np.log(hrdat['hrwage']), X1, weights = hrdat['A_ERNLWT']) 72 | results1 = model1.fit() 73 | 74 | print results1.summary() 75 | 76 | 77 | #Pre-defining model matrix components for more complicated models 78 | #dat_mat is DATa model MATtrices 79 | n = len(hrdat) 80 | dat_mat = {} 81 | dat_names = {} 82 | factor_vars = ['sex', 'educ', 'PTFT', 'ind', 'occ', 'marstat', 83 | 'GEDIV', 'race', 'hispanic', 'disabled'] 84 | for name in factor_vars: 85 | dat_mat[name],dat_names[name] = sm.categorical(hrdat[name], 86 | dictnames=True) 87 | dat_mat[name] = dat_mat[name][:,2:] 88 | dat_mat['age'] = hrdat['age'].reshape(n,1) 89 | dat_mat['age^2'] = (hrdat['age']**2).reshape(n,1) 90 | dat_mat['const'] = np.ones((n,1)) 91 | dat_names['age'] = ['age'] 92 | dat_names['age^2'] = ['age^2'] 93 | dat_names['const'] = ['const'] 94 | 95 | for name in factor_vars: 96 | fact_names = sorted(dat_names[name].values())[1:] 97 | dat_names[name] = [''.join([name, str(val)]) for 98 | val in fact_names] 99 | 100 | #helper function to spit out design matrix and names 101 | 102 | def get_mat(var_list): 103 | terms = map(dat_mat.get, var_list) 104 | mat = np.hstack(tuple(terms)) 105 | 106 | return mat 107 | 108 | def get_names(var_list): 109 | names = [] 110 | map(names.extend, map(dat_names.get, var_list)) 111 | 112 | return names 113 | 114 | #More complicated model, log(hrwage)~sex+educ+age+PTFT 115 | logwage = np.log(hrdat['hrwage']) 116 | w = hrdat['A_ERNLWT'] 117 | 118 | m2_vars = ['const', 'sex', 'educ', 'age', 'PTFT'] 119 | 120 | X2 = get_mat(m2_vars) 121 | m2_names = get_names(m2_vars) 122 | 123 | model2 = sm.WLS(logwage, X2, weights = w) 124 | results2 = model2.fit() 125 | 126 | print results2.summary(xname=m2_names) 127 | 128 | 129 | 130 | #Now include ind and occ (industry and occupation codings) 131 | 132 | m2_5_vars = list(m2_vars) 133 | m2_5_vars.extend(['ind','occ']) 134 | m2_5_vars.insert(4, 'age^2') 135 | 136 | X2_5 = get_mat(m2_5_vars) 137 | m2_5_names = get_names(m2_5_vars) 138 | 139 | model2_5 = sm.WLS(logwage, X2_5, weights=w) 140 | results2_5 = model2_5.fit() 141 | 142 | print results2_5.summary(xname=m2_5_names) 143 | 144 | 145 | #Residual diagnostics for model2_5 146 | 147 | plt.subplot(1,3,1) 148 | plt.hist(results2_5.resid, 30, normed=1, facecolor='green', alpha=0.75) 149 | plt.title('Histogram of Residuals') 150 | plt.xlabel('Probability') 151 | plt.grid(True) 152 | 153 | plt.subplot(1,3,2) 154 | plt.plot(results2_5.fittedvalues, results2_5.resid, 'bo') 155 | plt.title('Residuals vs. Fitted Values') 156 | plt.xlabel('Fitted Values') 157 | plt.ylabel('Residuals') 158 | plt.axhline(lw=3, color = 'r') 159 | plt.grid(True) 160 | 161 | plt.subplot(1,3,3) 162 | normal_rv_q = sps.norm.ppf(np.linspace(0,1,n+1)[1:]) 163 | ordered_resid = np.copy(results2_5.resid) 164 | ordered_resid.sort() 165 | plt.plot(normal_rv_q, ordered_resid, 'ro') 166 | plt.title('Normal qq Plot') 167 | plt.xlabel('Theoretical Quantiles') 168 | plt.ylabel('Sample Quantiles') 169 | plt.grid(True) 170 | 171 | 172 | #Used mean replacement for all the people whose hours varied, 173 | #where the mean of 22 was the mean over all part time workers 174 | #w/o missing data and using their weights 175 | #Done just as a sanity check on the above models 176 | 177 | 178 | 179 | #Throw in everything and the kitchen sink for reality check 180 | indic = np.flatnonzero(hrdat['PMHRUSLT']<=0) 181 | tmp = np.copy(hrdat['PMHRUSLT']) 182 | tmp[indic] = 22 183 | tmp = tmp.reshape(n,1) 184 | 185 | m3_vars = ['const','sex', 'educ', 'PTFT', 'age', 'age^2', 186 | 'marstat', 'GEDIV', 'race', 'hispanic', 187 | 'tmp', 'disabled'] 188 | dat_mat['tmp'] = tmp 189 | dat_names['tmp'] = ['Varying Hours'] 190 | 191 | X3 = get_mat(m3_vars) 192 | m3_names = get_names(m3_vars) 193 | 194 | model3 = sm.WLS(logwage, X3, weights=w) 195 | results3 = model3.fit() 196 | 197 | print results3.summary(xname=m3_names) 198 | 199 | 200 | #These models bin the works by age group, <=30, 31-40, 41-50, >50. 201 | #This was done to see what the difference betweens males and 202 | #females was in each bin. This was done as a sanity check for the 203 | #later exploratory analysis that fit lowess curves across age to 204 | #males and females seperately. I wanted to make sure that the trends 205 | #observed were real. (Where the gap was smaller for younger workers, 206 | #expanded for middle age workers, and then contracted again.) 207 | 208 | 209 | lt30 = (hrdat['age']<=30) 210 | btw30_40 = (np.logical_and(30 0) 76 | 77 | 78 | #Used mean replacement for all the people whose hours varied, 79 | #where the mean of 22 was the mean over all part time workers 80 | #w/o missing data and using their weights 81 | #Done just as a sanity check on the above models 82 | 83 | model3<-lm(log(hrwage)~ sex+educ+PTFT+age +I(age^2)+marstat+GEDIV+race+ 84 | hispanic+I(PMHRUSLT*indic+22*(1-indic))+disabled, 85 | data = hrdat, weights = A_ERNLWT) 86 | 87 | 88 | 89 | #These models bin the works by age group, <=30, 31-40, 41-50, >50. 90 | #This was done to see what the difference betweens males and 91 | #females was in each bin. This was done as a sanity check for the 92 | #later exploratory analysis that fit lowess curves across age to 93 | #males and females seperately. I wanted to make sure that the trends 94 | #observed were real. (Where the gap was smaller for younger workers, 95 | #expanded for middle age workers, and then contracted again.) 96 | 97 | model7.1<-lm(log(hrwage)~sex+educ+PTFT+age+I(age^2), 98 | data = subset(hrdat, age<=30), weights=A_ERNLWT ) 99 | 100 | model7.2<-lm(log(hrwage)~sex+educ+PTFT+age+I(age^2), 101 | data = subset(hrdat, age>30 & age<=40), weights=A_ERNLWT ) 102 | 103 | model7.3<-lm(log(hrwage)~sex+educ+PTFT+age+I(age^2), 104 | data = subset(hrdat, age>40 & age<=50), weights=A_ERNLWT ) 105 | 106 | model7.4<-lm(log(hrwage)~sex+educ+PTFT+age+I(age^2), 107 | data = subset(hrdat, age>50), weights=A_ERNLWT ) 108 | 109 | 110 | #### 111 | 112 | 113 | #This analysis switches gears and focuses on occupation. 114 | #The males and females are broken down by occupation and the 115 | #weighted mean of their hourly wage are compared. 116 | #(The weights used are again the survey weights.) 117 | #The matrix wocc stores all of that, and 118 | #Use xtable to format tables in latex 119 | 120 | sum_stat<-function(x){ 121 | tmp1 <- sum(x$A_ERNLWT) 122 | tmp2 <- weighted.mean(x$hrwage, x$A_ERNLWT) 123 | c(survey.wt=tmp1, avr.hr.wage=tmp2) 124 | } 125 | 126 | split.by <-list(hrdat$sex, hrdat$occ) 127 | wocc<-split(hrdat, split.by) 128 | 129 | wocc <-lapply(wocc, sum_stat) 130 | wocc<-do.call(rbind, wocc) 131 | 132 | wocc<-split(data.frame(wocc), rep(levels(hrdat$sex), 9)) 133 | 134 | #Plotting hourly wage versus age and fitting a lowess curve to it. 135 | #Done sepereately for both males and females. Just to get a feel for 136 | #what their salary trajectories look like. Though it's not totally 137 | #clear from the picture, the gap is smaller proportionally at younger 138 | #ages than at older ones. (This can be seen as above by binning the ages, 139 | #or via a separate analysis by fitting an interaction term between 140 | #sex and age.) 141 | 142 | plot(lowess(hrdat$age[indm], (hrdat$hrwage)[indm]), col='blue', ylim=c(7,25), 143 | xlab="Age", ylab = "Hourly Wage", main="Hourly Wage vs. Age", type='l', 144 | cex.main=2, cex.lab=1.5, lwd=2) 145 | lines(lowess(hrdat$age[indf], (hrdat$hrwage)[indf]), col='deeppink') 146 | legend('topleft', c("Males", "Females"), cex=1.25, col=c("blue", "deeppink"), 147 | lty=1, lwd=2) 148 | 149 | 150 | 151 | 152 | 153 | -------------------------------------------------------------------------------- /revised/summary.txt: -------------------------------------------------------------------------------- 1 | A simple comparison of R and Python (statsmodels) for doing some 2 | exploratory data analysis and fitting of simple OLS models. 3 | 4 | See summary.txt in the folder for the original analysis first. 5 | This document only gives corrections/suggestions/changes 6 | relative to the original analysis. 7 | 8 | ------------------------------------------------------------- 9 | Compared to the original .r and .py files, in these revised version: 10 | -The R code was cleaned up because I realized I didn't need to use 11 | as.factor if I made the relevant variables into factors 12 | -The python code was cleaned up by computing the 'sub-design matrices' 13 | associated with each factor variable before hand and stashing 14 | them in a dictionary 15 | -Names were added to the variables in the regression by creating them 16 | from the calls to sm.categorical and stashing them in a dictionary 17 | 18 | Notably, the helper fucntions and stashing of the pieces of design matrices 19 | simplified the calls for model fitting, but they didn't noticeably shorten 20 | the code. They also required a small increase in complexity. (In terms of the 21 | data structures and function calls used to create the list of names and 22 | the design matrices.) 23 | -------------------------------------------------------------- 24 | 25 | Comments: 26 | 27 | ***Pasting without autoindent can be done in the shell IPython and not 28 | just the IPython qtconsole. The relevant commands are paste and cpaste. 29 | I found cpaste to be more what I was looking for, but mileage 30 | may vary. 31 | 32 | (Though the shell IPython still has the limitation that when hitting 33 | the up arrow key you cannot recall multi-line pasted inputs as a 34 | single block. You can do that in the IPython qtconsole.) 35 | 36 | ***There are some options for pretty printing a matrix. One is to adjust 37 | settings in np.set_printoptions as needed, and the other is to use 38 | sm.iolib.SimpleTable from statsmodels. 39 | 40 | (In the future the addition of pandas to statsmodels should alleviate 41 | this as well.) 42 | 43 | ***The user can add names to the variables in a statsmodels regression 44 | by using the xname option on a regression results object. However, the 45 | user must supply the names. In particular, this means keeping track 46 | of the levels for a categorical variable. 47 | 48 | ***One can simplify the construction of design matrices by, at the 49 | beginning of the analysis, creating a dictionary that associates 50 | variable names with parts of the design matrix associate with each 51 | variable. Then a helper function can call np.hstack to combine 52 | these sub-pieces of the design matrix into a whole design matrix. 53 | 54 | A similar strategy can be used to keep track of the names associated with 55 | each variable. (In particular the names for each level for a categorical 56 | variable.) 57 | 58 | However, that method doesn't allow subsetting of the data. The problem 59 | is that for some subsets of the data not all levels for categorical 60 | variable will be present. If you used the same matrices and names as 61 | for the full dataset you'd have columns all all 0's in the design 62 | matrix or (if those columns were eliminated) too many names. 63 | 64 | Creating a function for generating names and design matrices that 65 | can take a subset of the data is more involved. (And I did not implement 66 | it.) In particular, it would have to generate both the names and the 67 | design matrix together, as well as keep track of what variables were 68 | categorical (and hence needed sm.categorical called) and which were not. 69 | 70 | --------------------------------------------------------------------------------