├── README.md └── Tests ├── Mann-Whitney (-Wilcoxon).md ├── Various tests to describe one by one.md └── common_procedures.md /README.md: -------------------------------------------------------------------------------- 1 | Testing hypotheses through statistical models opens a universe of new possibilities. Learn how to improve your daily work with this approach. 2 | 3 | It's not a new idea. In my old Polish books (~2007-2012) in statistics ANOVA and t-test were mentioned as special cases of the general linear model. That was the first time I realized that every parametric test (and numerous non-parametric ones) are inferential procedures **applied on a top** of various models. Later I found this approach in other books too. 4 | 5 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/bd74461d-f695-434d-b755-f481fadb89ff) 6 | 7 | **Then it turned out, that also many non-parametric tests can be accurately replicated with "classic" parametric or semi-parametric models!** 8 | 9 | Sometimes the relationship is easy to find (like ANOVA vs. assessment of the main and interaction effects of the general linear model = reduction of the residual variance when comparing nested models), and sometimes it's not that direct (like Wald's z-test for proportions or Breslow-Day test and the logistic regression). Sometimes the properties connecting non-paramertic tests and parametric models are really surprising and breathtaking (like the Mann-Whitney and Ordinal Logistic Regression)! 10 | 11 | You can find excellent resources on the Internet, like [Common statistical tests are linear models (or: how to teach stats)](https://lindeloev.github.io/tests-as-linear/), but I wanted to summarize my own experience with model-based testing, used in my work (biostatistics in clinical trials) on daily basis. 12 | 13 | So far I examined Wald's and Rao z test for 2+ proportions (+ ANOVA-like), Cochran-Mantel-Haenszel (CMH), Breslow-Day, Cochran-Armitage, McNemar, Cochran Q, Friedman, Mann-Whitney (-Wilcoxon), Kruskal-Wallis replicated with the logistic regression. 14 | 15 | But this extends much more and beyond just classic testing! General and Generalized Linear Models fit via Generalized Least Square, Generalized Estimating Equations and through Generalized Linear Mixed Models, (Mixed-effect) Quantile regression, Cox proportional-hazard regression, Accelerated Failure Time, Tobit regression, Simplex regression and dozens of others models will be at your disposal to test hypotheses in a way you probably were never told. 16 | 17 | This document is incomplete and dynamic. I will update it over time, so stay tuned. 18 | 19 | You will find descriptions of subsequent tests in the [Test subdirectory](https://github.com/adrianolszewski/model-based-testing-hypotheses/tree/main/Tests). So far I started describing the [Mann-Whitney-Wilcoxon vs. Ordinal Logistic Regression](https://github.com/adrianolszewski/model-based-testing-hypotheses/blob/main/Tests/Mann-Whitney%20(-Wilcoxon).md). Still a lot is to be done: https://github.com/adrianolszewski/model-based-testing-hypotheses/blob/main/Tests/Various%20tests%20to%20describe%20one%20by%20one.md 20 | 21 | ## Just one question: WHY? 22 | Well, classic tests are "simple" and fast. But simple method is for simple scenarios. 23 | A more advanced inferential analysis often goes FAR beyond that these tests can do. 24 | 25 | For our considerations it's important to say, that **by applying Likelihood Ratio, Rao's, or Wald's testing procedure to an appropriate model you will be (mostly) able to reproduce what the classic "plain" tests do but also go far beyond that** (find below a section dedicated to these 3 methods). 26 | 27 | So what? 28 | 29 | Think about it. **By using a more general model, that can handle several "problems" in data, you can bypass various assumptions of classic parametric tests RATHER than switching to non-parametric methods** (which have often complex interpretation and entirely change your null hypotheses!). 30 | 31 | Namely, you can address: 32 | - **mean-variance dependency and general heteroscedasticity**? How about the _Generalized Linear Model (GLM)_, _Generalized Least Square (GLS)_ or _Generalized Estimating Equation (GEE)_ estimation? 33 | - **dependency in residuals** (repeated observations, clusetered data)? How about the _GLS_, _GEE_ and _mixed models_ (LMM, GLMM)? 34 | - **non-normal residuals**? Maybe the _GLM_ or _GEE_ (doesn't make distributional assumptions about the residuals)? Oh, OK, GEE is semi-parametric, but still retains your null hypotheses about conditional means! 35 | - **conditional distributions are skewed**? (related to the above one) Again - GLM: gamma or beta regression, maybe Poisson or negative binomial? Or fractional logit? 36 | - **categorical data**? Again, GLM (with extensions) will offer you the ordinal logistic regression (proportional-odds model), partial ordinal regression, multinomial logistic regression 37 | - **relationships (hierarchy) in the response**? How about the nested logistic regression? 38 | - **survival data**? I guess you know that Log-rank test is equivalent to using Cox regression with a 2-level categorical predictor. 39 | And that's not all! 40 | 41 | How about the _quantile regression_ (especially combined with random effects) can handle even more! And yes - it's a non-parametric method, BUT at least well interpretable (what cannot be said about Mann-Whitney (-Wilcoxon), for instance. 42 | 43 | Let's summarize the key facts: 44 | 45 | 1. Ordinary tests won't allow you to control for covariates. Goodbye more advanced analyses. Models will do it naturally (born for that). 46 | 2. Most classic non-parametric tests cannot handle more complex m x n-way designs. Want to test some outcome across multiple groups rendered by `sex * visit * treatment_arm`? Forget! You will likely need to run multiple simple tests and they won't be able to detect inter-variable relationships. 47 | 3. Most of the classic tests won't be able to test interactions between multiple factors (a few modern ones, like ATS (ANOVA-Type Statistic) or WTS (Wald-Type Statistic) can do this, but only in a limited scope (1-level interaction between just 2 factors). 48 | 4. Classic tests won't allow you to test and compare simple effects via contrasts, e.g.: "Visit: 3, Arm: Treatment | Male vs Female effect" vs. "Visit 3, Arm: Control | Male vs Female effect". **For the model-based testing it's a piece of cake**. 49 | 5. You may simply NOT KNOW which test to use! Believe me or not, there are 850+ (EIGHT HUNDRED FIFTY!) statistical tests and counting. A colleague of mine has been counting them for years (with my little support). With a model - you don't care about all these "version", just provide the formula, set some parameters and test the hypotheses you need. Need a trest for trend? Just user ordinal factor for your time variable. 50 | 6. You will obtain standard errors and confidence intervals for these comparisons. 51 | 7. Want to test some hypotheses jointly? Model-based approach followed by Wald's testing (sometimes also LRT) will do the job. 52 | 8. Want to effectively adjust for multiple testing using parametric exact method employing the estimated effects and covariace\s through the multivariate t distribution ("MVT")? This is far better than Bonferroni :-) But forget this flexibility when using plain tests! 53 | 9. Need to run type-2 or type-3 AN(C)OVA-type analysis for binary data? Or counts? Or anything else? Use appropriate model capable of handling such data and run the necessary testing on the top this model! Depending on what kind of model you use and what link transformation was used, the interpretation will be more complex, but - at the end of the day - you will have the way to assess the impact of your predictors on the response. 54 | 10. The assumptions for AN(C)OVA on your data fail? As above - fit appropriate model addressing these issues and follow it with approprite testing procedure! 55 | 56 | Fair enough? 57 | 58 | ### A longer by-the-way about the ANOVA/ANCOVA-like analysis 59 | 60 | The AN(C)OVA you learned from textbooks is a special kind of a much more general procedure: _the assessment of main and interaction effects of a model containing categorical predictor variables_. This can be done in two ways: 61 | 1. by joint testing of the model coefficients (Wald's approach) 62 | 2. by comparing nested models (Wilks's Likelihood Ratio testing), one of which has the predictor of interest and the other does not. This actually assesses the reduction in residual variance between the models. In the Generalized Linear Model the deviance is assessed instead. 63 | 64 | For the general linear model (under the hood of the "textbook AN(C)OVA" and t-test) the two approaches are equivalent. The whole procedure reduces to just comparing means across the groups rendered by the levels of your categorical predictors (adjusted for numerical covariates, if present) through appropriate contrasts (functions of means). If you test contrasts related to a certain predictor jointly, you obtain the "global" test for the main effect of this predictor. It enhances also to interactions. And that's all. Exactly the same way you can obtain main and interaction effects of any model you want - only circumstances may differ (e.g. LRT is not available for non-likelihood models, so you need to stick with Wald's). 65 | 66 | Or differently. Let's consider a simple model with just one 3-level categorical predictor variable. You can test the impact of this variable in 2 ways: 67 | - the estimated model coefficients are shifts (differences) between the corresponding group means and the intercept-related mean, so you can test theset coefficiets jointly for the effect of the whole variable. That's the Wald's approach. 68 | 69 | - if the variable of interest has statistically significant impact on response (means), then removing it from the model will worsen it (now "throwing" everything to theoveral mean - the intercept), so the residual variance will increase. You can test this change in variance by comparing the two nested model. This is the "omnibus" F-test you likely know from the ANOVA textbooks. 70 | 71 | - In case of the GLM we replace the residual variance with deviance, directly related to the likelihood ratio. It's still based on model comparison - this time through the model likelihood ratio. If LR <> 1, then it means that removing the variable of interest worsened the fit. That's the Wilk's LRT aproach, employing the $\chi^2$ distribution. 72 | 73 | **That's exactly how the R's anova(), car::Anova(), emmeans::joint_tests() and a few more methods work.** 74 | Other statistical packages do it too. 75 | See? **The model-based approach to ANOVA is not just "some observation made in some books", that's the common approach to it!** 76 | 77 | PS: let's recall, by the way, that Chi2 is a limiting distribution for F under infinite denominator degrees of freedom: 78 | If $X \sim F(n_1, n_2)$, the limiting distribution of $n_1X$ as $n_2 \rightarrow \infty$ is the $\chi^2$ distribution with $n_1$ degrees of freedom. 79 | https://www.math.wm.edu/~leemis/chart/UDR/PDFs/FChisquare.pdf 80 | 81 | You will find the $\chi^2$ distribution in at least two key places: 82 | - when comparing models via LRT: under H0, as the sample size $n \rightarrow \infty$, the test statistic $\lambda _{\text{LR}}$ will be asymptotically $\chi^2$ distributed with degrees of freedom = to the difference in dimensionality of $\Theta$ - $\Theta_0$. 83 | 84 | - when you perform Wald's testing on multiple parameters, then the distribution of the test statistic converges asymptotically with distribution to $\chi^2$. 85 | 86 | You'll also meet the name _Wald's_ in 3 contexts, when the Wald's approach is applied to 87 | - a single-parameter inference (simple H0), e.g. when doing simple comparisons (contrasts), and the degrees of freedom are known, you will see statistical packages reporting "Wald t". If you look at the formula, it looks much like a single-sample t-test (nominator is the difference, denominator - standard deviation), but that's a different creature and don't confuse the two. It's rather "t ratio". It's t-distributed only under normal sampling distribution of the parameter. 88 | 89 | - a single-parameter inference, but the degrees of freedom aren't known (or are hard to guess) and assumed to be infinite. Then you will see statistical packages reporting "Wald z". You may find also "Wald chi2" - but that's the same, asymptotically $z^2 = \chi^2$. 90 | 91 | - a multi-pararameter (like in ANOVA) asymptotic (under infinite degrees of freedom) inference employing composite H0. Then you will see statistical packages reporting "Wald chi2" or similar just "chi2". You may also find something like "Score chi2", which refers to Rao's approach to testing. 92 | 93 | ## What will you present in this repository? 94 | 95 | We will do 3 things. 96 | 97 | 1. We will asesss how well the classic tests can be replicated with models. OK, I hear you: but why, if tests are faster nad rarerly fail? Because sometimes there may be no test that fits your needs in your statistical package. 98 | 99 | 2. We will extend the classic tests, for example to more than 1 factor with interactions. That's exactly the place where models enter the scene! Want to compare % of succeseses between males and females? Smokers and non-smokers? Check their interactions? In other words - do you want "ANOVA" for the binary response case? OK, that's exactly where we will go. 100 | 101 | 3. We will run some basic analyses of contrasts by employing severeal models and estimation methods and the awesome emmeans package. So, we will try GLM, GEE-GLM, GLS (MMRM) to compare % of success, counts, ratios, maybe concentrations (via log-linked gamma regression) via contrasts. 102 | 103 | That's my plans. But first I need to cover the basic tests. 104 | 105 | ## But the model-based testing is SLOW! 106 | Let's be honest - model-based testing can be (and IS) **SLOWER** than running a plain good-old test, especially if you perform then under multiple imputation approach to missing data (where you repeat the same analysis on each of dozens of imputed datasets and then pool the results.), especially if you also employ the Likelihood Ratio testing (LRT). 107 | 108 | But if you have just a few ten-to-hundred of observations, then I wouldn't be much concerned. 109 | 110 | **But - well - did you expect a free lunch? 111 | As I said in the beginning - simple tests do simple jobs and cannot do anything more. If you want more, you have to pay for it. That's as simple.** 112 | 113 | ---- 114 | ## Not only it's slow. It may also fail to converge! 115 | Yes, they may and sometimes they do. 116 | 117 | And while tests may fail computationally too, it happens incomparably rarer than with models. 118 | Just recall how many times you saw messages like "failed to converge", "did not converge", "negative variance estimate" or similar, and how often the plain test failed? Well, that's it. 119 | 120 | Models are non-trivial procedures and may computationally fail under "edge" conditions. For exmaple, if you fit logistic regression with all responses equal to TRUE or FALSE, then - well... it's rather obvious that it will fail, right? 121 | 122 | Let's take an example of the ordinal logistic regression (aka proportional-odds model) to replicate the Mann-Whitney (-Wilcoxon) test 123 | Let me show you a trivial case: we will compare two samples from a normal distribution: N(mean=0, SD=1) vs. N(mean=5, SD=1). 124 | What can be simpler than that? 125 | 126 | / PS: yes, YOU CAN use the ordinal logistic regression for continuous numerical data. You will have just dozens of intercepts :) 127 | Read these excellent resources by Prof. Frank Harrell: [If you like the Wilcoxon test you must like the proportional odds model](https://www.fharrell.com/post/wpo/), [Equivalence of Wilcoxon Statistic and Proportional Odds Model](https://www.fharrell.com/post/powilcoxon/), and (best repository on this topic I've ever seen!) [Resources for Ordinal Regression Models](https://www.fharrell.com/post/rpo/) / 128 | 129 | ``` r 130 | > set.seed(1000) 131 | > stack( 132 | + data.frame(arm1 = rnorm(50, mean=0), 133 | + arm2 = rnorm(50, mean=5))) %>% 134 | + mutate(values_ord = ordered(values), ind = factor(ind)) %>% 135 | + rms::orm(values ~ ind, data=.) 136 | Did not converge in 12 iterations 137 | Unable to fit model using “orm.fit” 138 | Error in switch(x$family, logistic = "Logistic (Proportional Odds)", probit = "Probit", : 139 | EXPR must be a length 1 vector 140 | ``` 141 | As you can see, the model failed to converge. The Mann-Whitney (-Wilcoxon) test did well, however: 142 | ``` r 143 | > set.seed(1000) 144 | > stack( 145 | + data.frame(arm1 = rnorm(50, mean=0.0), 146 | + arm2 = rnorm(50, mean=5))) %>% 147 | + mutate(values_ord = ordered(values), ind = factor(ind)) %>% 148 | + wilcox.test(values ~ ind, data=., conf.int = TRUE) 149 | 150 | Wilcoxon rank sum test with continuity correction 151 | 152 | data: values by ind 153 | W = 0, p-value < 2.2e-16 154 | alternative hypothesis: true location shift is not equal to 0 155 | 95 percent confidence interval: 156 | -5.772126 -4.977111 157 | sample estimates: 158 | difference in location 159 | -5.34864 160 | ``` 161 | Since we used normal distributions, let's also check the classic t-test: 162 | ``` r 163 | > set.seed(1000) 164 | > stack( 165 | + data.frame(arm1 = rnorm(50, mean=0.01), 166 | + arm2 = rnorm(50, mean=5))) %>% 167 | + mutate(values_ord = ordered(values), ind = factor(ind)) %>% 168 | + t.test(values ~ ind, data=., var.equal = TRUE) 169 | 170 | Two Sample t-test 171 | 172 | data: values by ind 173 | t = -26.803, df = 98, p-value < 2.2e-16 174 | alternative hypothesis: true difference in means between group arm1 and group arm2 is not equal to 0 175 | 95 percent confidence interval: 176 | -5.735385 -4.944653 177 | sample estimates: 178 | mean in group arm1 mean in group arm2 179 | -0.1486302 5.1913886 180 | ``` 181 | As expected. In parametric case Wilcoxon and t-test are naturally consistent. 182 | Sometimes a model that failed can converge if we a little "perturb" the data. In our case let's add just 0.01 to the mean. 183 | ``` r 184 | > set.seed(1000) 185 | > stack( 186 | + data.frame(arm1 = rnorm(50, mean=0.01), 187 | + arm2 = rnorm(50, mean=5))) %>% 188 | + mutate(values_ord = ordered(values), ind = factor(ind)) %>% 189 | + rms::orm(values ~ ind, data=.) 190 | Logistic (Proportional Odds) Ordinal Regression Model 191 | 192 | rms::orm(formula = values ~ ind, data = .) 193 | 194 | Model Likelihood Discrimination Rank Discrim. 195 | Ratio Test Indexes Indexes 196 | Obs 100 LR chi2 133.10 R2 0.736 rho 0.865 197 | Distinct Y 100 d.f. 1 R2(1,100) 0.733 198 | Median Y 2.677509 Pr(> chi2) <0.0001 R2(1,100) 0.733 199 | max |deriv| 0.0009 Score chi2 99.69 |Pr(Y>=median)-0.5| 0.487 200 | Pr(> chi2) <0.0001 201 | 202 | Coef S.E. Wald Z Pr(>|Z|) 203 | ind=arm2 9.1781 1.7356 5.29 <0.0001 204 | ``` 205 | 206 | And sometimes it does not work at all and then you cannot proceed with this model. Try OLS with signed-ranked response instead. 207 | Should work and be consistent with Mann-Whitney (-Wilcoxon). 208 | 209 | ---- 210 | ## Wald's, Wilks' Likelihood Ratio, and Rao's testing. Why should I care? 211 | 212 | Oh my, so many related topics! But nobody's said it will be a piece of cake. 213 | 214 | I'm not going to write a textbook of statistical methods, so if you're curious about the Wilk's Likelihood Ratio (LR), Wald's and Rao testing, google these terms. 215 | You can start from: 216 | - [Likelihood-ratio test or z-test?](https://stats.stackexchange.com/questions/48206/likelihood-ratio-test-or-z-test) 217 | - [FAQ: How are the likelihood ratio, Wald, and Lagrange multiplier (score) tests different and/or similar?](https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faqhow-are-the-likelihood-ratio-wald-and-lagrange-multiplier-score-tests-different-andor-similar/) 218 | - [Wald vs likelihood ratio test](https://thestatsgeek.com/2014/02/08/wald-vs-likelihood-ratio-test/) 219 | - [Score, Wald, and Likelihood Ratio](https://myweb.uiowa.edu/pbreheny/7110/f21/notes/10-25.pdf) 220 | 221 | Cannot live wihout a solid dose of math? Look there: 222 | - [Mathematical Statistics — Rigorous Derivations and Analysis of the Wald Test, Score Test, and Likelihood Ratio Test. Derivations of the Classic Trinity of Inferential Tests with Full Computational Simulation](https://towardsdatascience.com/mathematical-statistics-a-rigorous-derivation-and-analysis-of-the-wald-test-score-test-and-6262bed53c55) 223 | - [Handouts for Lecture Stat 461-561 Wald, Rao and Likelihood Ratio Tests](https://www.cs.ubc.ca/~arnaud/stat461/lecture_stat461_WaldRaoLRtests_handouts_2008.pdf) 224 | - [STAT 713 MATHEMATICAL STATISTICS II - Lecture Notes](https://people.stat.sc.edu/Tebbs/stat713/s18notes.pdf) 225 | 226 | Briefly, if you imagine the IDEALIZED log-likelihood curve for some model parameter: 227 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/d85738a7-f5a8-40d4-9463-5433c3f386d7) 228 | 229 | then 230 | - **Wald's** is about the difference between the true value of some parameter p0 (e.g. the difference between groups) and the parameter p_hat estimated from data, placed on the horizontal axis (called the parameter space). 231 | - **Wilk's LRT** is about the difference between two log-likelihoods (or ratio of two likelihoods, thus the name: "likelihood ratio testing") of a "full" model with the parameter of interest and a reduced model without it. This difference is placed on the vertical axis 232 | - **Rao's score** is about the slope of the tangential line (Rao's score) to the log-likelihood curve. 233 | 234 | OK, now let's make some notes: 235 | 236 | 1. Wald's, Wilks' and Rao's approach are just different ways to test hypotheses about some model parameter of interest. They offer 3 different ways to answer the question _whether constraining parameters of interest to zero (leaving out corresponding predictor variables from the model) reduces the fit of the model?_ 237 | 238 | 2. In our considerations we'll primarily employ two statistical methods: Wald's test and Wilk's Likelihood Ratio Test (LRT). We'll closely examine how these methods behave in various scenarios. 239 | 240 | 3. Under the null hypothesis they provide asymptotically equivalent results. Asymptotically means "at infinite sample size". In real-world scenarios they will always differ, but unless you hit the "edge case", they will be rather consistent (at the end of the day, they assess a single thing). But still, Wald ≥ LRT ≥ Rao or, equivalently, $p-value_{WALD} \le p-value_{LRT} \le p_value_{RAO}$. Even if you don't notice that. 241 | 242 | 4. They may **noticeably** diverge in "edge cases", where the log-likelihood curve of a model parameter in the parameter space deviates from a parabolic shape. If you read either of the 3 first links, you will know what I mean. 243 | 244 | 5. Wald's test is extremely useful for testing hypotheses in the following cases: 245 | a. any kind of comparisons between group means and their combinations - via contrasts - **that's essential in experimental research employing planned comparisons** 246 | b. AN[C]OVA-type joint testing (of model coefficients) to assess the main and interaction effects in non-likelihood models, where it's the only option (__"why is it best? because it's the only available"), like GEE estimated GLM models (accounting for dependency and heteroscedasticity) or in quantile regression. So, despite it's critique (in small samples!) **it's important enough to be the widespread in all statistical packages and that's why you should know it**. 247 | 248 | 6. Wald's is also faster than LRT as it needs only a single model fitting. On the contrary, LRT needs at least 2 fits to compare 2 nested models: with and without the term you want to assess (you can also compare other settings, like covariance structures). With the increasing number of variables to assess, the number of compared models increases as well. 249 | 250 | I hear you: _Yes, but computers nowadays are faster and faster, minimizing the importance of this very issue_. OK, but nature abhors a vacuum - whenever something drops, something else appears: 251 | - analyses employing **multiple imputation of missing data** (like MICE). In such setting the inference is repeated on each imputed dataset several times (20-50 and more) and then the results are pooled. The amount of necessary time may be not acceptable, especially if you have more analyses to complete. 252 | - in small-sample inference, especially in repeated-observation (longitudinal) studies, **computationally-demanding adjustments to degrees of freedom** (like **Satterthwaite** and **Kenward-Roger** (DoF + covariance)) are in common use. They are standard in fields like mine (clinical trials). They can slow down the calculations several times! 253 | - **employing complex covariance structures in dependent-observation model** (repeated, clustered) requires many parameters (co-variances) to be estimated, which increases the time of fitting noticeably! 254 | - MLE, IWLS and similar methods are iterative, so let's hope the implementation you use is optimal, or it will slow down the LRT inference as well. 255 | 256 | And now: ALL the four scenarios often **occur together**. In my work it's a daily reality to work with models accounting for repeated-observation, employinig small-sample adjustments to degrees of freedom, and working under the multiple imputation setting. 257 | 258 | **LRT inference in this case can be really painful. If only the sample size > 15-20 observations, Wald's approach with all ciriticism can win with the complexity of LRT**. 259 | 260 | 9. LRT is found more conservative than Wald's approach in small samples, because of the relationship between the value of statistic obtained with these three approaches: Wald >= LR >= Rao ([Ranking of Wald, LR and score statistic in the normal linear regression model](https://stats.stackexchange.com/questions/449494/ranking-of-wald-lr-and-score-statistic-in-the-normal-linear-regression-model). It means, that it's less likely to reject a null hypothesis when it's true (i.e., LR has a lower Type-I error rate). This conservativeness can be advantageous when you want to avoid making false-positive errors, such as in clinical trials or other critical applications. In contrast, Wald's test can be more liberal, leading to a higher likelihood of false positives. That's why typically it's advised to select LRT over Wald's - as long as you are free to choose. 261 | 262 | Let's summarize it with: "_When the sample size is small to moderate, the Wald test is the least reliable of the three tests. We should not trust it for such a small n as in this example (n = 10). Likelihood-ratio inference and score-test based inference are better in terms of actual error probabilities coming close to matching nominal levels. A marked divergence in the values of the three statistics indicates that the distribution of the ML estimator may be far from normality. In that case, small-sample methods are more appropriate than large-sample methods._ (Agresti, A. (2007). An introduction to categorical data analysis (2nd edition). John Wiley & Sons.)" 263 | 264 | and 265 | 266 | "_In conclusion, although the likelihood ratio approach has clear statistical advantages, computationally the Wald interval/test is far easier. In practice, provided the sample size is not too small, and the Wald intervals are constructed on an appropriate scale, **they will usually be reasonable** (hence their use in statistical software packages). In small samples however the likelihood ratio approach may be preferred._" [Wald vs likelihood ratio test](https://thestatsgeek.com/2014/02/08/wald-vs-likelihood-ratio-test/) 267 | 268 | 9. Wald's may not perform well when sample sizes are small or when the distribution of the parameter estimates deviates from normality. In other words, Wald tests assume that the log-likelihood curve (or surface) is quadratic. LRT does not, thus is more robust to the non-normality of the sampling distribution of the parameter of interest. 269 | 270 | I quickly wrote a simulation using the general linear model with a single 3-level categorical predictor (just for fun), response sampled from the normal distribution, with parameters chosen empirically so that the p-values span several orders of magnitude under the inreasing sample size. The statements made in point 8 and 9 reflected by the results. Under normality of the parameter sampling (natural in this scenario), f**or N>10-20 the differences betweeh the methods are negligible so there's no need to blame Wald's and pray LRT**. 271 | 272 | ```r 273 | set.seed(1000) 274 | do.call(rbind, 275 | lapply(c(5, 10, 50, 100, 200), function(N) { 276 | print(N) 277 | do.call(rbind, 278 | lapply(1:100, function(i) { 279 | tmp_data <- data.frame(val = c(rnorm(N, mean=0.9), rnorm(N, mean=1), rnorm(N, mean=1.3)), 280 | gr = c(rep("A", N), rep("B", N), rep("C", N))) 281 | m1 <- lm(val ~ gr, data=tmp_data) 282 | m2 <- lm(val ~ 1, data=tmp_data) 283 | data.frame(Wald = lmtest::waldtest(m1, m2)$`Pr(>F)`[2], 284 | LRT = lmtest::lrtest (m1, m2)$`Pr(>Chisq)`[2]) 285 | })) %>% cbind(N=N) 286 | })) %>% 287 | mutate(N = factor(N), 288 | "Wald to LRT" = Wald / LRT) -> results 289 | 290 | library(gghalves) 291 | library(patchwork) 292 | (results %>% 293 | ggplot(aes(x=Wald, y=LRT, col=N)) + 294 | geom_point() + 295 | geom_line() + 296 | theme_bw() + 297 | scale_y_log10() + 298 | scale_x_log10() + 299 | geom_abline(slope=1, intercept = 0) + 300 | labs(title="Wilks' LRT vs. Wald's p-values; General Linear Model")) + 301 | (results %>% 302 | ggplot(aes(x=N, y=`Wald to LRT`)) + 303 | geom_half_boxplot(nudge = 0.1, outlier.color = NA) + 304 | geom_half_violin(side = "r", trim = TRUE, nudge = .1, scale = "width") + 305 | geom_point() + 306 | scale_y_continuous(breaks=c(0,0.5,1,1.5,2,3,4,5, max(round(results$`Wald to LRT`))), trans="log1p") + 307 | geom_hline(yintercept=1, col="green") + 308 | theme_bw() + 309 | labs(title="Wald : LRT ratio of p-values; General Linear Model")) 310 | ``` 311 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/833e505d-2aa9-4a88-bc12-d193a5add1b9) 312 | 313 | PS: Something weird here - Wald is MORE conservative in this scenario! P-values are larger than for LRT, which means lower test statistic. Need to investigate it! 314 | 315 | If we move away from the quadrative log-likelihood (typically for binomial-distribution problem), we observe discrepancies, as anticipated by the theory. But when the differences matter, it's far below any common significance level. So again, Wald's not that bad here! 316 | 317 | ```r 318 | set.seed(1000) 319 | do.call(rbind, 320 | lapply(c(5, 10, 50, 100, 200), function(N) { 321 | print(N) 322 | do.call(rbind, 323 | lapply(1:100, function(i) { 324 | 325 | tmp_data <- data.frame(val = c(sample(x = 0:1, size=N, prob=c(5,5), replace=TRUE), 326 | sample(x = 0:1, size=N, prob=c(6,4), replace=TRUE), 327 | sample(x = 0:1, size=N, prob=c(5,5), replace=TRUE)), 328 | gr = c(rep("A", N), rep("B", N), rep("C", N))) 329 | 330 | m1 <- glm(val ~ gr, family = binomial(link="logit"), data=tmp_data) 331 | m2 <- glm(val ~ 1, family = binomial(link="logit"), data=tmp_data) 332 | 333 | data.frame(Wald = lmtest::waldtest(m1, m2)$`Pr(>F)`[2], 334 | LRT = lmtest::lrtest (m1, m2)$`Pr(>Chisq)`[2]) 335 | 336 | })) %>% cbind(N=N) 337 | })) %>% 338 | mutate(N = factor(N), 339 | "Wald to LRT" = Wald / LRT, 340 | Magnitude = abs(Wald - LRT)) -> results 341 | 342 | results %>% 343 | group_by(N) %>% 344 | summarize(median_magn = median(Magnitude)) -> diff_magnit 345 | 346 | (results %>% 347 | ggplot(aes(x=Wald, y=LRT, col=N)) + 348 | geom_point() + 349 | geom_line() + 350 | theme_bw() + 351 | scale_y_log10() + 352 | scale_x_log10() + 353 | geom_abline(slope=1, intercept = 0) + 354 | labs(title="Wilks' LRT vs. Wald's p-values; Generalized Linear Model")) + 355 | (results %>% 356 | ggplot(aes(x=N, y=`Wald to LRT`)) + 357 | geom_half_boxplot(nudge = 0.1, outlier.color = NA) + 358 | geom_half_violin(side = "r", trim = TRUE, nudge = .1, scale = "width") + 359 | geom_point() + 360 | scale_y_continuous(breaks=c(0,0.5,1,1.5,2,3,4,5, max(round(results$`Wald to LRT`))), trans="log1p") + 361 | geom_hline(yintercept=1, col="green") + 362 | theme_bw() + 363 | labs(title="Wald : LRT ratio of p-values; Generalized Linear Model") + 364 | geom_text(data=diff_magnit, aes(y=-0.1, x=N, label=sprintf("Me.Δ=%.4f", median_magn)), size=3)) 365 | ``` 366 | 367 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/c4a39713-9b7e-4952-9a32-e98bdfe08217) 368 | 369 | 10. Sometimes Wald's testing fails to calculate (e.g. estimation of covariance fails), while the likelihood is still obtainable and then the LRT is the only method that works. Who said the world is simple? And sometimes the LRT is not available, as mentioned above. Happy those, who have both at their disposal. 370 | 371 | 11. LRT allows one for performing AN[C]OVA-type analyse (which requires careful specification of the model terms!) but mat not be available in your software for a more complex analysis of contrasts (e.g. in planned covariate-adjusted comparisons), so you'll have to specify the nested models by hand. At the same time, Wald's approach takes full advantage of the estimated parameter and covariance matrix, **which means that "sky is the only limit" when testing contrasts of any complexity**. With LRT it will be harder to obtain appropriate parametrization, but it's still doable if only you carefully set contrasts! (TODO: show example with `glmglrt`) 372 | 373 | 12. **Wald's (and Rao!) inference is not transformation (or re-parametrization) invariant**, using derivatives, sensitive to it. If you use Wald's to calculate p-value or confidence interval on two different scales, e.g. probability and logit transformed back to the probability scale, you will get different results. Often they are consistent, but discrepancies may occur at the boundary of significance and then you're in trouble. By the way, Wald's on the logit scale will return sensible results, while Wald's applied to probability scale may yield negative probabilities (e.g. -0.12) or exceeding 100% (e.g. 1.14). This is very important when employing LS-means (EM-means) on probability-regrid scale(!). 374 | On the contrary, the **LRT IS transformation invariant** and will have exactly the same value regardless of any monotononus transformation of the parameter. 375 | 376 | / Side note: Wald's assumes normally distributed parameter sampling, which briefly means **symmetry**. 0-1 truncated data will never be so, that's why you may easily obtain negative or >1 bounds of the confidence interval. See? I believe that's the manifestation of _Hauck–Donner effect in binomial models_ when the estimated parameter is close to the boundary of the parameter space (here close to 0 or 1 and at a very small sample size). 377 | ```r 378 | > binom::binom.asymp(x=1, n=5) 379 | method x n mean lower upper 380 | 1 asymptotic 1 5 0.2 -0.150609 0.550609 # -0.15 ? 381 | > binom::binom.asymp(x=4, n=5) 382 | method x n mean lower upper 383 | 1 asymptotic 4 5 0.8 0.449391 1.150609 # 1.15? 384 | ``` 385 | / 386 | 387 | 13. There are another problematic areas for Wald's testing, like mixed models (ironically, numerous statistical packages offer only (or mostly) such inference due to big complexity of the solutions!). For example read: [Paper 5088 -2020; A Warning about Wald Tests](https://support.sas.com/resources/papers/proceedings20/5088-2020.pdf) 388 | 389 | ## You said that that different method of testing (Wald's, Rao's, Wilk's LRT) may yield a bit different results? 390 | 391 | **Yes. But serious differences mostly happens under the ALTERNATIVE hypothesis and at low p-values.** 392 | Let's consider a 2-sample test comparing means (or any other quantity of interest). 393 | A corresponding (general linear with treatment contrast coding) model will have just the response (dependent) variable and a single, 2-level categorical predictor variable which levels form the groups we want to compare: `response ~ group`. 394 | 395 | The fitted model will have two estimated parameters: the intercept coefficient (representing the 1st group mean, at the reference level of the group indicator) and the coefficient representing the shift (in means or anything) between the 2nd group and the intercept. It's value added to the intercept term will give us the 2nd group mean. 396 | 397 | Now, imagine the log-likelihood curve for the "shift" parameter. Under the null hypothesis the difference between the means is 0, so is the parameter. This generalized directly to comparing multiple groups through a predictor variable with multiple levels (forming the groups). 398 | 399 | Now, if removing this predictor from the model (effectively - if the parameter is ~zero) doesn't change the fit, than - well - this predictor is "useless", as it didn't affect the response. Which means that the groups distinguished by this predictor levels are mutually similar: respective "shifts" are zero, all means are similar enough to say they come from the same population. Which means that H0 cannot be rejected. 400 | 401 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/8c3bd8ed-df10-4da1-bfde-d96574c24da5) 402 | 403 | If you operate under H0, then regardless of its shape (ideally quadratic, but here let's agree on a concave one, with single maximum): 404 | - **Wald's:** the difference between groups = 0, so the distance in the parameter space approaches 0. 405 | - **Wilk's LRT** the difference in both log-likelihoods between both models approaches 0 (then the likelihood ratio approaches 1) - no gain. 406 | - **Rao's score** the slope of the tangential line (Rao's score) approaches zero as well. Which reflects the fact that under H0 the derivative (score) of the log-likelihood function with respect to the parameter should approach zero. 407 | 408 | The word "approaches 0" means, that we operate in its neighbourhood, where (in case of concave shape) changes in all 3 measures are relatively small and close to each other (they follow each other). This explains why mostly you will obtain similar results under the nul hypothesis, and why they start diverging as you go away from this "safe area" towards the alternative hypothesis. 409 | 410 | This is a simple graphical explanation, but formal proofs can be found in many places, e.g.: [Score Statistics for Current Status Data: 411 | Comparisons with Likelihood Ratio and Wald Statistics](https://sites.stat.washington.edu/jaw/JAW-papers/NR/jaw-banerjee-05IJB.pdf) or [Mathematical Statistics — Rigorous Derivations and Analysis of the Wald Test, Score Test, and Likelihood Ratio Test](https://towardsdatascience.com/mathematical-statistics-a-rigorous-derivation-and-analysis-of-the-wald-test-score-test-and-6262bed53c55) 412 | 413 | In simpler words, under "_no difference_" we are approaching the maximum of this curve which must be reflected by either perspective: "horizontal" (Wald), "vertical" (Wilks LRT) and "slope" (Rao). 414 | 415 | It's ideally observed if the distribution of parameter is normal, which is necessary for the Wald's approach to work properly. 416 | But remember that normality never holds exactly, as it's only an idealized, theoretical construct! And because the normality is always approximate (at finite sample size), the Wald's and LRT p-values may be **very close** to each other, but not exactly (yet you may not be able to distinguish them in the "good case"). 417 | 418 | / BTW: That's why you saw, many times, the "t" (if degrees of freedom are easy/possible to derive) or "z" (otherwise; at infinite degrees of freedom) statistic reported for coefficients of most regression models and when testing contrasts. That's because the sampling distribution for these parameters is theoretically assumed to be normal (via Central Limit Theorem). But it's not always the case. For less "obvious" models, like quantile regression, such approximation may be incorrect (or limited only to "good cases"), and then resampling techniques are typically employed (e.g. bootstrap, jacknife). / 419 | 420 | **But the more the log-likehood curve deviates from the quadratic shape, the more they will diverge from each other even under the null hypothsesis!** 421 | 422 | And the shape may diverge from the desired one also by using common transformations, e.g. from logit scale to probabiltiy scale, when employing the logistic regression family. 423 | 424 | The mentioned, excellent article [Wald vs likelihood ratio test](https://thestatsgeek.com/2014/02/08/wald-vs-likelihood-ratio-test/) by Prof. Jonathan Bartlett shows additional examples that speak louder than words! 425 | 426 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/4b89d9d8-c571-4ddf-a137-8a3f1f8889e0) 427 | (cited with Prof. Bartlett's permission). 428 | 429 | **But what will happen under alternative hypothesis?** 430 | Briefly - the 3 measures will increase. Whether this will happen in a more or less consistent manner (you remember the relationship: Wald ≥ LRT ≥ Rao) and depends on the log-likelihood curve. You may observe a good agreement in the beginning (near to H0) and then quick divergence. 431 | 432 | **The good news is that the discrepancies will rather occur at small p-values, typically below common levels of statistical significance level, so for us, users of the test, the differences may be barely noticeable**, if we are lucky. But read the next section... 433 | 434 | Let me show you an example. We will compare data sampled from two normal distributions in four settings: 435 | 1. Both groups are sampled from the same N(50, 20). We operate under the null hypothesis. Practically full overlap of empirical distributions. 436 | ```r 437 | set.seed(1000) 438 | x1 <- rnorm(100, 50, 20) 439 | x2 <- rnorm(100, 50, 20) 440 | effectsize::p_overlap(x1, x2) 441 | Overlap | 95% CI 442 | ---------------------- 443 | 0.96 | [0.86, 1.00] 444 | ``` 445 | 446 | 2. One group sampled to have a smaller mean, with variances unchanged: N(50, 20) vs. N(35, 20). We operate under alternative hypothesis, but the data overlap mostly (this is important for the ordinal logistic regression, as the log-likelihood data should be more or less symmetric) 447 | ```r 448 | # Exemplary sampling 449 | set.seed(1000) 450 | x1 <- rnorm(100, 50, 20) 451 | x2 <- rnorm(100, 35, 20) 452 | 453 | effectsize::p_overlap(x1, x2) 454 | Overlap | 95% CI 455 | ---------------------- 456 | 0.73 | [0.62, 0.84] 457 | 458 | > tidy(t.test(x1, x2, var.equal = TRUE)) 459 | # A tibble: 1 × 10 460 | estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative 461 | 462 | 1 13.3 50.3 37.0 4.92 0.00000179 198 7.98 18.7 Two Sample t-test two.sided 463 | ``` 464 | 465 | 3. Like in case 2, but now variance is reduced: N(50, 10) vs. N(35, 10). Now the data are less overlapping (bigger separation). This will distort the log-likelihood curve. 466 | ```r 467 | # Exemplary sampling 468 | set.seed(1000) 469 | x1 <- rnorm(100, 50, 10) 470 | x2 <- rnorm(100, 35, 10) 471 | effectsize::p_overlap(x1, x2) 472 | Overlap | 95% CI 473 | ---------------------- 474 | 0.46 | [0.37, 0.56] 475 | ``` 476 | 477 | 4. Like in case 3, with further reduced variance: N(50, 5) vs. N(35, 5). 478 | ```r 479 | set.seed(1000) 480 | x1 <- rnorm(100, 50, 5) 481 | x2 <- rnorm(100, 35, 5) 482 | effectsize::p_overlap(x1, x2) 483 | Overlap | 95% CI 484 | ---------------------- 485 | 0.13 | [0.08, 0.19] 486 | ``` 487 | 488 | And now let's have a look at the p-values obtained from the ordinal logistic regression (as if we applied the Mann-Whitney test to normally distributed data) by 489 | employing the three metehods: 490 | 491 | ```r 492 | compare_3methods_distr(n_group = 50, samples=50, 493 | distr_pair_spec = list("N(50,20) vs. N(50,20)" = 494 | c(arm_1_distr = "rnorm(n_group, mean=50, sd=20)", 495 | arm_2_distr = "rnorm(n_group, mean=50, sd=20)"), 496 | "N(50,20) vs. N(35,20)" = 497 | c(arm_1_distr = "rnorm(n_group, mean=50, sd=20)", 498 | arm_2_distr = "rnorm(n_group, mean=35, sd=20)", 499 | log = "TRUE"), 500 | "N(50,10) vs. N(35,10)" = 501 | c(arm_1_distr = "rnorm(n_group, mean=50, sd=10)", 502 | arm_2_distr = "rnorm(n_group, mean=35, sd=10)", 503 | log = "TRUE"), 504 | "N(50,5) vs. N(35,5)" = 505 | c(arm_1_distr = "rnorm(n_group, mean=50, sd=5)", 506 | arm_2_distr = "rnorm(n_group, mean=35, sd=5)", 507 | log = "TRUE"))) 508 | ``` 509 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/8c684e6e-cb79-4c79-80db-0579044a36e6) 510 | 511 | As expected, the methods gave very similar p-values (look especially at the 1st, top-left plot for H0). Even, if differences occurred, they were too small to be observed. That's because the p-values spanned several orders of magnitude! And we could end here: the situation is perfect! 512 | 1. ideal agreement under H0 (where we are worrying the most for wrong rejections; type-1 error) 513 | 2. potencial discrepancies under H1 at so low magnitudes, so we can totally ignore it (in this case). 514 | 515 | Let's log-transform the axes to have a better view: 516 | OK, now we can observe the differences. No surprise they were barely noticeable at these magnitudes! 517 | Generally we can observe, that the lower p-value, the bigger disrepancy. But why? 518 | 519 | That's simple - because we more farther and farther from H0. With the convex curve, at some point, far from the maximum, small changes in argument (X) will result in big changes of value (Y). Here I have no idea what's the shape of the log-likelihood curve, which can even amplify the problem. So let's ignore it. 520 | 521 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/62977972-f69e-49ba-9431-d331edfda0f4) 522 | 523 | By this occasion, let's have a lookg how the p-values differ with respect to the Mann-Whitney (-Wilcoxon) test. 524 | It's interesting to observe the opposite directions in which the p-values diverge. 525 | 526 | / PS: Remember, that the way the p-value is calculated in Wilcoxon depends on the implementation! There can be a normal approximation, exact method, ties may be unhandled or handled via different methods (Wilcoxon, Pratt). / 527 | 528 | ```r 529 | compare_3methods_distr_model_vs_test(n_group = 50, samples=50, 530 | distr_pair_spec = list("N(50,20) vs. N(50,20)" = 531 | c(arm_1_distr = "rnorm(n_group, mean=50, sd=20)", 532 | arm_2_distr = "rnorm(n_group, mean=50, sd=20)"), 533 | "N(50,20) vs. N(35,20)" = 534 | c(arm_1_distr = "rnorm(n_group, mean=50, sd=20)", 535 | arm_2_distr = "rnorm(n_group, mean=35, sd=20)", 536 | log = "TRUE"), 537 | "N(50,10) vs. N(35,10)" = 538 | c(arm_1_distr = "rnorm(n_group, mean=50, sd=10)", 539 | arm_2_distr = "rnorm(n_group, mean=35, sd=10)", 540 | log = "TRUE"), 541 | "N(50,5) vs. N(35,5)" = 542 | c(arm_1_distr = "rnorm(n_group, mean=50, sd=5)", 543 | arm_2_distr = "rnorm(n_group, mean=35, sd=5)", 544 | log = "TRUE"))) 545 | ``` 546 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/51718316-4a2e-41d6-a951-249617404a23) 547 | 548 | ## Even worse! You just said that different method of testing can yield OPPOSITE results! 549 | 550 | Well, yes, Wald's and LRT **may yield opposite results**, e.g. Wald's p-value = 0.8 vs. LRT p-value < 0.001. 551 | 552 | But that doesn't happen very often, in "**edge cases**". The problem is that you may NOT know when it's an "edge case". 553 | That's why doing statistical analysis is not just a brainless "click-and-go" task! 554 | 555 | If only you suspect something is wrong, I mean - that the results of some analysis does not reflect what you observe in data, validate it with another analysis. 556 | 557 | This is summarized in the article I already cited ([Wald vs likelihood ratio test](https://thestatsgeek.com/2014/02/08/wald-vs-likelihood-ratio-test/)): "Further, a situation in which the Wald approach completely fails while the likelihood ratio approach is still (often) reasonable is when testing whether a parameter lies on the boundary of its parameter space." 558 | 559 | If you anticipate that something bad may happen, VALIDATE your calculations using a different method. It doesn't have to bring SAME results (it's a different method) but at least you will be able to assess if they are approximately consistent. 560 | 561 | If, for instance, you run the ordinal logistic regression over Likert data and obtain Wald's p-value = 0.3 and LRT p-value = 0.0003, then you can also simplify it for a while and check with the Mann-Whitney (-Wilcoxon) test or even run the OLS with ranked response (it's a very good approximation). 562 | 563 | **And always - always start with the EDA (exploratory data analysis), plot your data, know it. Otherwise you won't even suspect what could go wrong.** 564 | 565 | **Take-home messages #1:** 566 | 1. A model can fail to converge if the data are "not nice" :-) 567 | 2. Wald's approach may fail, while the LRT will do well. 568 | 3. A test may still work in this case (like the LRT). 569 | 4. Wald's may differ from LRT under the alternative hypothesis 570 | 571 | By the way! You may ask: _but cannot we just use a better implementations able to complete the estimation?_ 572 | 573 | Good question! Let's find a different implementation, which will converge in this problematic case. There's one MASS::polr() 574 | Again, we start with N(0, 1) vs. N(5, 1): 575 | ```r 576 | > set.seed(1000) 577 | > stack( 578 | + data.frame(arm1 = rnorm(50, mean=0.0), 579 | + arm2 = rnorm(50, mean=5))) %>% 580 | + mutate(values_ord = ordered(values), ind = factor(ind)) %>% 581 | + MASS::polr(values_ord ~ ind, data=., Hess = TRUE) %>% summary() %>% coef() %>% as.data.frame() %>% slice(1) 582 | Value Std. Error t value 583 | indarm2 21.30274 19.75443 1.078378 584 | 585 | # p-value 586 | > 2*pnorm(q=1.078378, lower.tail=FALSE) 587 | [1] 0.2808651 588 | ``` 589 | Good! This one converged! But... wait a minute! What?! Is this a joke?! Why so big p-value?! 590 | 591 | **Because the p-value is simply wrong. We are exactly in the "edge case" and Wald's approach failed evidently.** 592 | 593 | And what about the LRT? Is this any better? 594 | ```r 595 | > set.seed(1000) 596 | > stack( 597 | + data.frame(arm1 = rnorm(50, mean=0.0), 598 | + arm2 = rnorm(50, mean=5))) %>% 599 | + mutate(values_ord = ordered(values), ind = factor(ind)) -> ord_data 600 | 601 | > model_full <- MASS::polr(values_ord ~ ind, data=ord_data, Hess = TRUE) 602 | > model_intercept <- MASS::polr(values_ord ~ 1, data=ord_data, Hess = TRUE) 603 | 604 | > anova(model_full, model_intercept) 605 | Likelihood ratio tests of ordinal regression models 606 | 607 | Response: values_ord 608 | Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) 609 | 1 1 1 921.0340 610 | 2 ind 0 782.4094 1 vs 2 1 138.6246 0 611 | ``` 612 | Yes, the LRT did better. This result is consistent with the Mann-Whitney (-Wilcoxon test). 613 | 614 | Now let's again increase the mean in one group by 0.01. Not only the model converges properly, but also the Wald's and LRT are in a good agreement! 615 | ``` r 616 | > set.seed(1000) 617 | > stack( 618 | + data.frame(arm1 = rnorm(50, mean=0.01), 619 | + arm2 = rnorm(50, mean=5))) %>% 620 | + mutate(values_ord = ordered(values), ind = factor(ind)) %>% 621 | + MASS::polr(values_ord ~ ind, data=., Hess = TRUE) %>% summary() %>% coef() %>% as.data.frame() %>% slice(1) 622 | Value Std. Error t value 623 | indarm2 9.179891 1.735655 5.289008 624 | 625 | # p-value 626 | > 2*pnorm(q=5.289008, lower.tail=FALSE) 627 | [1] 1.229815e-07 628 | 629 | # Let's confirm with the previous method rms::orm() 630 | # ------------------------------------------------- 631 | > set.seed(1000) 632 | > stack( 633 | + data.frame(arm1 = rnorm(50, mean=0.01), 634 | + arm2 = rnorm(50, mean=5))) %>% 635 | + mutate(values_ord = ordered(values), ind = factor(ind)) %>% 636 | + rms::orm(values ~ ind, data=.) 637 | Logistic (Proportional Odds) Ordinal Regression Model 638 | 639 | rms::orm(formula = values ~ ind, data = .) 640 | 641 | Model Likelihood Discrimination Rank Discrim. 642 | Ratio Test Indexes Indexes 643 | Obs 100 LR chi2 133.10 R2 0.736 rho 0.865 644 | Distinct Y 100 d.f. 1 R2(1,100) 0.733 645 | Median Y 2.677509 Pr(> chi2) <0.0001 R2(1,100) 0.733 646 | max |deriv| 0.0009 Score chi2 99.69 |Pr(Y>=median)-0.5| 0.487 647 | Pr(> chi2) <0.0001 648 | 649 | Coef S.E. Wald Z Pr(>|Z|) 650 | ind=arm2 9.1781 1.7356 5.29 <0.0001 651 | ``` 652 | **And now that's fine.** 653 | 654 | So, as I said - even if a model *actually converges*, the estimate may be unreliable. Always remember to check your outcomes. 655 | 656 | **Take-home messages #2:** 657 | 658 | 3. Having multiple options, always choose the method that alerts you that something went wrong rather than method that silently pretends nothing wrong happened and happily continues. **It's always better to have NO result rather than having WRONG result.** 659 | 660 | **You. Have. Been. Warned.** 661 | 662 | ## Any words of comfort...? 663 | 664 | Yes :) 665 | 666 | First, for several tests, to reproduce it closely with a model, you just need one approach and don't have to overthink the problem. 667 | A few examples of perfect agreement you can find in this gist (temporarily; I will move it to this repo soon): https://github.com/adrianolszewski/Logistic-regression-is-regression/blob/main/Testing%20hypotheses%20about%20proportions%20using%20logistic%20regression.md 668 | 669 | If you use R: 670 | - the prop.test() (z test with pooled standard errors) will exactly match Rao ANOVA over logistic regression 671 | - the classic z test for 2 proportions with non-pooled standard errors will exactly match Wald's test over AME over logistic regression 672 | - Cochrane-Mantel-Haenszel will exactly match Wald's test applied to the conditional logistic regression 673 | - McNemar and Cochrane Q will be quite close with GEE-estimated logistic regression (mostly with exchangeable covariance) tested with Wald's 674 | - Breslow-Day will agree with Rao ANOVA over the logistic regression. 675 | 676 | And then you don't have to to split hairs. Remember also, that as long, as your model contains multiple categorical predictors and numerical covariates, you cannot compare it with most classic tests (limited to just 1 categorical predictor). 677 | 678 | ## I give up! Too many information! Briefly - is Wald just bad and LRT just awesome? 679 | 680 | No. At small samples (N~10-30) LRT will be less biased than Wald's. Above they will do approximately equally well (unless edge cases and problematic transformations). Wald's will be eaiser to work with and much faster. That's it. 681 | 682 | More precisely: 683 | 684 | **LRT - advantages:** 685 | 1) At small samples you rather won't test complex hypotheses and contrasts, but rather the effect off (a) single variable(s), e.g. "treatment arm". LRT for such comparison will be avaiable in your statistical package and will be likely better than Wald's 686 | 2) Whatever scale you choose (e.g. in binomial models: log-ods or probability) - LRT will do equally well (unlike Wald's) 687 | 3) In simple scenarios you may not notice the increased testing time due to multiple fitting. 688 | 4) May be available even when Wald's fails to calculate. 689 | 5) It's possible to analyze complex contrasts (including difference-in-difference) with LRT as well as with Wald's. It won't be as simple as with the `emmeans` package, and you will need to construct the contrasts carefully selecting appropriate model coefficients. 690 | 691 | LRT - disadvantages or no clear benefit: 692 | 0) When you work with small N, like N<20, then ANY analysis you do may be unreliable. If you believe that "tools and tricks" will manage, you cheat yourself. Think twice, then pick the LRT. 693 | 1) If you need to test contrasts accounting for appropriate covariance(s) and degrees of freedom adjustments, like in repeated-data setting, the LRT rather won't be available in your stat. software. 694 | 2) For testing more complex contrasts it won'y be as easy as with, say, the `emmeans` package, where you have all LS-means (EM-means) listed nicely, so just pick the right ones. Model coefficients will depend on the parametrization and for the default, `treatment contrast` (dummy) you will need to be very careful about the reference levels represented by the model coefficients! Also, only a very few models are supported in R package (mostly GLM, Cox), but GLS (MMRM) require extra work... 695 | 4) If you need to test something over a non-likelihood model (e.g. GEE-estimated) - LRT will no be available. 696 | 5) If you work under a computational-demanding setting (multiple imputation + Kenward-Roger + unstructured covariance + multiple variables, you may prefer not to use it due to unacceptably long time of calculations. 697 | 6) Support in statistical packages may be seriously limited. Lots' of necessary work may be up to you. 698 | 7) No test for LS-means (EM-means). Recall, testing model coefficients is NOT the same as testing LS-means in general. Moreover, you may want to test at CERTAIN value of the numerical covariate. 699 | 700 | Wald's - advantages: 701 | 1) As fast as fitting the model sigle time. Matters when run under multiple imputation. 702 | 2) OK if you pick apporpriate scale (e.g. log-odds rather than probability; Note, the probability scale CAN BE USED TOO, just with care and confirm it with the analaysis on log-odds) 703 | 3) In models assuming conditional normality and at N>20-50 (depending on model) - not much worse than LRT. I guess you have enough data, if you start playing with complex contrasts? 704 | 4) The farther from the null hypothesis, the less you care about the discrepancies (occurring typically at low p-values). Who cares if Wald p=0.0000000034 and LRT p=0.000000084? Don't make it absurd! 705 | 5) You can compare any contrasts you want, accounting for covariances, degrees of freedom small-sample adjustments, robust estimators of covariance, any kind of covariance structure, etc. 706 | 6) rellted with 5 - so you can use the multivariate-t distribution ("mvt") based exacty adjustment for multiple comparisons 707 | 7) Available for non-likelihood models (GEE-estimated GLM; and such models usually NEED more data!) 708 | 8) Available in all statistical packages. 709 | 9) Allows for testing LS-means (EM-means) and at CERTAIN values of the numerical covariate(s). 710 | 711 | Wald's - disadvantages: 712 | 1) At N<20-50 may be anti-conservative, worse than LRT 713 | 2) May fail to calculate 714 | 3) May give different estimates when used on different scale (e.g. log-odds vs. probability). You can calculate it in both and - if consistent - report the one you prefer. Otherwise report the one you find more relevant. 715 | 4) In edge case can yield very different results - typically that's a sign of small sample or very "edge case". In such case don't trust either, but prefer LRT. 716 | 717 | There is no one-size-fits-all solution in statistics. 718 | 719 | ---- 720 | ## Interpretation 721 | For some models the equivalence is strict, I mean - it could be shown by formulas (I was shown it in the past yet I don't take the challenge to replicate :-) ). 722 | For others it's a close approximation, coming from different approaches (thus, different set of calculations), testing different (yet related!) hypotheses and sometimes even rounding. 723 | When you take a simple model like `response ~ categorical_predictor`, then, using the _treatment coding_ (default in R) the interpretation comes out of the box: 724 | - value of the intercept is the (conditional) mean, or median, or any other function of data in the group rendered by the reference level of your predictor (e.g. non-smokers). Note, that the mean can be either "raw" 725 | or transformed via link function in case of the GLM (Generalized Linear Model). 726 | - value of the 2nd model coefficient -assigned to the 2nd level of the predictor- is the difference between the 2nd group and the intercept (1st) group. 727 | - ... this extends to multiple predictors with >2 levels and their interactions. The calculations complicate quickly (especially in presence of interactions), but the idea remains the same. 728 | 729 | Whether it is directly and easily interpretable depends on the formulation of your model. 730 | 731 | For the general linear model (say, just the linear regression) the interpretation will be easy: intercept stands for the mean in the reference group, 732 | and the value of the "beta" (assigned to the 2nd level of predictor) represents the difference in means between both groups. 733 | Similarly for the quantile regression yet now you get differences in medians, first, third or any other quantile you specified. 734 | 735 | The situation complicates if you applied transformation of the dependent variable (response) or in case of the Generalized Linear Models (GLM) transforming -in contrast- the E(Y|X=x), 736 | like the logistic, ordinal logistic, beta and gamma regressions (+ a few more). First, the interpretation of the "conditional mean" depends 737 | on the conditional distribution of the data (e.g. for the Bernoulli's distribution in the logistic regression the E(Y|X=x) means is probability of success). Second, it depends on the used link function, where - depending on the used transformation - differences may turn into ratios, and (say) probabilities turn into log-odds. No surprise that the interpretation of model coefficients will complicate (a bit). 738 | 739 | And this poses another issue. Let's take the logistic regression. While all these terms: probabilities, odds, log-odds, odds-ratios, testing hypotheses are mutually related through the well-known formulas, performing inference 740 | on different scales may yield a little bit different results. Well, at the end of the day, you test *different* hypotheses (though related, so the results cannot be wildly discrepant!) and you will operate on different scales. 741 | And this will complicate EVEN MORE if you will ignore some terms (average over them), because it DOES matter whether you average on the original (logit) or the transformed (probability) scale. 742 | Change on the scale of the log-odds is not the same as change on the probability scale. 743 | [Why is the p value by average marginal effects different than the p value of the coefficients?](https://stats.stackexchange.com/questions/464115/why-is-the-p-value-by-average-marginal-effects-different-than-the-p-value-of-the) 744 | 745 | With the logistic regression you can compare outcomes on both logit and probability scales. The former comes out of the box, the latter requires either calculating the AME (average marginal effect) or employing the LS-means on appropriate scale. 746 | [Transformations and link functions in emmeans](https://cran.r-project.org/web/packages/emmeans/vignettes/transformations.html) 747 | 748 | / If you're curious about the AME, especially in terms of the logistic regression, visit: [Usage Note 22604: Marginal effect estimation for predictors in logistic and probit models](http://support.sas.com/kb/22/604.html) | [A Beginner’s Guide to Marginal Effects](https://library.virginia.edu/data/articles/a-beginners-guide-to-marginal-effects) | 749 | [How can I use the margins command to understand multiple interactions in logistic regression? Stata FAQ](https://stats.oarc.ucla.edu/stata/faq/how-can-i-use-the-margins-command-to-understand-multiple-interactions-in-logistic-regression-stata/) | 750 | [Week 13: Interpreting Model Results: Marginal Effects and the margins Command](https://clas.ucdenver.edu/marcelo-perraillon/sites/default/files/attached-files/week_13_margins.pdf) | [Stata 18 - Help for margins](https://www.stata.com/help.cgi?margins) | 751 | [An introduction to margins in R](https://cran.r-project.org/web/packages/margins/vignettes/Introduction.html) 752 | 753 | If you are curious about the LS-means (EM-means), that's another story and just use Google. 754 | Ideally add "SAS" to your queries, as commercial packages (SAS, Stata, SPSS, NCSS) have the best documentation I've ever seen (sorry R). 755 | [ Getting started with emmeans ](https://aosmith.rbind.io/2019/03/25/getting-started-with-emmeans/) 756 | [Shared Concepts and Topics - LSMEANS Statement](https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_introcom_sect040.htm) 757 | [Least-Squares Means: The R Package lsmeans](https://www.jstatsoft.org/article/view/v069i01) / 758 | 759 | ---- 760 | 761 | ## Conclusions 762 | So you can see with your own eyes, that model-based testing has LOTS of advantages. But sometimes you will need just a simple, classic test that runs FAST. Especially, if you have lots of tests to do under multiple imputation conditions, with lengthy data, and you are approaching a deadline :-) 763 | 764 | **So the choice depends really on your needs. I only want to show you that this is doable and how well it performs.** 765 | -------------------------------------------------------------------------------- /Tests/Mann-Whitney (-Wilcoxon).md: -------------------------------------------------------------------------------- 1 | # Mann-Whitney (-Wilcoxon) 2 | 3 | Still incomplete! 4 | 5 | ## Table of contents 6 | 1. Description of the test 7 | Start with 8 | [Equivalence of Wilcoxon Statistic and Proportional Odds Model](https://www.fharrell.com/post/powilcoxon/) | [Resources for Ordinal Regression Models](https://www.fharrell.com/post/rpo/) | [If You Like the Wilcoxon Test You Must Like the Proportional Odds Model](https://www.fharrell.com/post/wpo/) 9 | 10 | https://github.com/adrianolszewski/model-based-testing-hypotheses/blob/main/Tests/Various%20tests%20to%20describe%20one%20by%20one.md#mww 11 | 12 | For simulations and liteature) https://gist.github.com/adrianolszewski/2cec75678e1183e4703589bfd22fa8b2 13 | 14 | 3. Model-based implementation via Ordinal Logistic Regression (aka Proportional-Odds model) - categorical ordinal data 15 | 4. Model-based implementation via Ordinal Logistic Regression (aka Proportional-Odds model) - numerical data 16 | 5. References 17 | 6. [Simulations - how well the model-based approach replicates the classic test](#simulations) 18 | 19 | Notes: I will use the classic implementation without continuity correction and exact calculation of p-values. Of you feel lost about all those variations on this topic, read this post: [[R] The three routines in R that calculate the wilcoxon signed-rank test give different p-values... which is correct?](https://stat.ethz.ch/pipermail/r-help/2011-April/274931.html) 20 | 21 | ---- 22 | 23 | ## Description of the test 24 | This is a non-parametric test for testing H0: stochastic equality vs. H1: stochastic superiority (aka dominance). 25 | 26 | Surprised, that I did not wrote "for comparing medians"? Well, I did NOT, because it does NOT - in general, unless additional, strong distributional assumptions are met: 27 | 1. **data in both compared samples are IID**. It means: have same dispersion (scale parameter, more/less variance) and similar shape (e.g. if skewed, then in same direction). Think about it. Stochastic superiority can be imposed by 1) shift in locations, by 2) unequal variances and by 3) differences in shapes of the distributin. It's like saying that Mann-Whitney (-Wilcoxon) test is sensitive to ALL THREE factors. If you want to test for the shift in location parameter, you have to address the two other differences. In other words, if both dipersions are similar, if both shapres are similar as well, then what remains is the possible difference in locations, right? 28 | 29 | And if your data meet this assumption, then Mann-Whitney gives you the _Hodges–Lehmann estimator of pseudo-median of differences_ between the compared groups. 30 | WHAT!? _Pseudo-median difference_? Not "difference in medians"? 31 | 32 | No. But **if the distribution of differences betwene the groups is symmetric, then it reduces to median difference.** 33 | 34 | Quick and dirty check (you can use bigger numbers for better agreement, but it may take whole lot of time!) 35 | ```r 36 | # set.seed(1000); 37 | # x1 <- rnorm(1000000); 38 | # x2 <- rnorm(1000000, mean=10) 39 | # sprintf("mean(diff)=%f, diff(means)=%f, median(diff)=%f, diff(medians)=%f, ps-median(diff)=%f", 40 | # mean(x1 - x2), mean(x1) - mean(x2), median(x1 - x2), median(x1) - median(x2), 41 | wilcox.test(x1, x2, conf.int = TRUE, exact = FALSE, adjust = FALSE)$estimate) 42 | 43 | [1] "mean(diff)=-9.999591, diff(means)=-9.999591, median(diff)=-10.000615, diff(medians)=-9.998330, ps-median(diff)=-9.999202" 44 | ``` 45 | 46 | 2. **data in both compared samples are symmetric around their medians**. Only now it reduces to difference in medians. 47 | 48 | In my other respository [Mann-Whitney fails as a test of medians in general](https://github.com/adrianolszewski/Mann-Whitney-is-not-about-medians-in-general) you will find numerous resources (mostly free and some available via Google Books) to learn about this problem and simulations proving that treating this test as a test of medians can really be dangerous. 49 | 50 | If your teacher, statistical reviewer or boss will "force" you to use Mann-Whitney (-Wilcoxon) as a test of medians withut checking the 2 assumptions, show them the literature I recommend (full of formulas and examples) and the results of my simulations. If this does not convinvce them, quit or resign (if you can) or at least put "On teacher's/client's demand" somewhere in your work, to make yourself less guilty of harming research... Yes, seriously - it's easy to reject the null hypothesis under exactly equal medians with this test (which invalidates it as a test of medians) - and type-1 error rate can go far beyond 5%, 10% and even 20%. 51 | 52 | **Take-home message:** 53 | If you want to compare quantiles (including medians), use the quantile regression (will show you how elsewhere), the Brown-Mood test of medians or make sure the above assumptions hold in your case. And always visually assess your data. 54 | 55 | BTW: you will find articles, that "Mood test of medians should be ditched". And the authors say? That this test lacks power, so they propose Mann-Whitney, because it has greater power. But they don't say A WORD about the assumptions. So indeed, Mann-Whitney (-Wilcoxon) is MORE powerful, but at the cost of being sensitive to not only locations, but also dispersions and shapes. 56 | 57 | If you are fine with it, if it's your actual goal - then OK, Mann-Whitney (-Wilcoxon) will do its job. But only then. If you want to compare medians, as I wrote, choose quantile regression (ideally). It is designed to deal with quantiles, so no better method will ever exist. 58 | 59 | ---- 60 | 61 | ## Simulations 62 | Let's check how good is the approximation by performing some simulations. Nothing will give us better feeling of situaion than that. 63 | 64 | ### Methodoloy 65 | I set up a loop, in which data were sampled (according to some patterns) and the p-values from both ordinary test and its model counterpart were collected. 66 | 67 | Then I display these p-values in a series of graphs: 68 | 1. Test vs. Model to observe how well they follow each other and observe discrepancies 69 | 2. Ratios of p-values - the closer to 1 the better (this works well regardless of the magnitude of p-values) 70 | 3. Raw differences in p-values. These differences do depend on the magnitude of p-values, and that's exactly what I wanted to see. At very small p-values (p<0.001) I would not care about differences like p=0.00032 vs. p=0.00017 because both results are far from the typical signfiicance levels (0.1, 0.01, 0.05, 0.01 and also 0.001). 71 | For practical purposes I would care if they were discrepant near 0.05. 72 | 4. Comparison of % of cases where Test > Model vs. Test < Model, with additional descriptie statistics of the p-values. Because raw differences depend on the magnitude, it's difficult to observe them on a common scale. Log transformation doesn't help there (0, negative values). 73 | So this approach gives me some feelings about the situation 74 | 5. How many times both methods disagreed on the rejection of the null hypothesis and what were the actual p-values. 75 | 76 | I repeated the simulations sample sizes: 77 | - n_group = 20 - typical small data size; smaller samples don't make much sense in a reliable research, 78 | - n_group = 30 - the "magical number" so many statisticians find "magical". Of course it's not, but this constitutes a reasonable "lower limit of data" 79 | - n_group = 50 - still small data but in a safer area 80 | - n_group = 100 - safe area (optionally) 81 | 82 | Actually, the type-1 error and power is totally off topic in this simulation, we only check how well the methods follow each other, but having the opportunity - we will look at it too. We only have to remember that this is a long-run property and at 100 replications it will be "just close" rather than "nominal", especiallly at smaller samples. It's normal to have, say 7/100 (=0.07), 12/200 (=0.06) to finally reach 15/300 (=0.05). 83 | 84 | Whenever possible I will provide both Wald's and LRT results. 85 | 86 | --- 87 | ### Simulations! 88 | #### The data 89 | For the ordinal logistic regression I made a dataset consisting of patients' responses to question about the intensity of physical pain they feel, part of the ODI (Oswestry Disability Index). Patients are split into two groups: those taking a new investigated medicine vs. those taking active control (existing standard of care). 90 | So the model is `ODIPain ~ Arm` (arm stands for the treatment arm). 91 | 92 | The response was recorded on a 6-level Likert ordinal item (don't confuse with Likert scale, where responses to items are summed together giving a numerical outcome). 93 | I programmed it so in one group dominate low responses and in the other group - high responses. 94 | Example: 95 | ```r 96 | set.seed(1000) 97 | lapply(1:100, function(i) { 98 | stack( 99 | data.frame(arm1 = sample(0:5, size=100, replace=TRUE, prob = c(20, 10, 5, 2, 2, 2)), 100 | arm2 = sample(0:5, size=100, replace=TRUE, prob = rev(c(20, 10, 5, 2, 2, 2))))) %>% 101 | mutate(values_ord = ordered(values), ind = factor(ind)) 102 | }) -> data 103 | ``` 104 | Although the data size will vary during the simulations, I will visualize only this one case to save space. The stacked bar chart clearly shows that in one group dominate "dark" elements, while in the other group - "bright" ones. The density plots confirm the opposite nature of responses across both arms. 105 | 106 | Our comparisons will get more power at increasing sample size, which is an ideal situation as we want to observe both high and low p-values. 107 | 108 | ``` r 109 | lapply(1:100, function(i) { 110 | print(i) 111 | data[[i]] %>% 112 | ggplot() + 113 | geom_bar(aes(x=ind, group=values, fill=values), position = "fill", show.legend = FALSE) + 114 | theme_void() 115 | }) -> plots1 116 | 117 | lapply(1:100, function(i) { 118 | print(i) 119 | data[[i]] %>% 120 | ggplot() + 121 | geom_density(aes(x=values, group=ind, fill=ind, col=ind), show.legend = FALSE, alpha=0.6, adjust = 1.5)+ 122 | xlab(NULL) +ylab(NULL) + 123 | theme_void() 124 | }) -> plots2 125 | 126 | (patchwork::wrap_plots(plots1, ncol = 10, nrow = 10) | 127 | patchwork::wrap_plots(plots2, ncol = 10, nrow = 10)) + 128 | patchwork::plot_annotation(title = "Patient-reported ODI pain scores across both treatment arms") 129 | ``` 130 | ![obraz](https://github.com/adrianolszewski/Logistic-regression-is-regression/assets/95669100/a75158cb-c47e-44d8-8e33-acb98c6534a2) 131 | 132 | #### Results 133 | ##### Under H0 - same probabilities of obtating each score in both groups = same ordering of observations 134 | ###### 20 observations per group 135 | ``` r 136 | simulate_wilcox_olr(samples = 100, n_group = 20, set = 0:5, 137 | arm_1_prob = c(20, 10, 5, 2, 2, 2), 138 | arm_2_prob = c(20, 10, 5, 2, 2, 2), 139 | which_p = "Wald") %>% 140 | plot_differences_between_methods() 141 | ``` 142 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/e072f1a3-e3d3-48bb-b1ce-b754ab48a77c) 143 | 144 | OK, let's make some notes: 145 | 1. look at the most left plot. The p-values from both methods are well aligned to the line of slope = 1, though some bias is visible towards Test p-values. 146 | 2. It means, that test was more conservative than model (higher p-values). 147 | 3. The fact, that practically all p-values coming a test were larger from the p-values obtained from the model is confirmed also by the bar plot (central-bottom). 148 | 4. When we look at the area below the 0.05 significance level, we notice 5-6 observations. Remembering we "work under the null" it means, that both methods exceeded the 5% significance level, reaching 9% in this simulation. At this sample size it's fine for exploratory analyses. 149 | 5. There was just single "contradiction" at very small difference of p-values 0.049 vs 0.051. 150 | 151 | With increased group size the situation will get only better. The type-1 error is well maintained in all cases (remembering what I said about sm 152 | 153 | ###### 30 observations per group 154 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/0643dff4-a2f8-4ac0-9988-809f25b698e0) 155 | 156 | ###### 50 observations per group 157 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/dd8f83d8-f804-4b8f-89a8-33260ad20c1d) 158 | 159 | ##### 100 observations per group 160 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/4514867b-627c-43bf-865b-3cbf3d6fe34a) 161 | 162 | ##### Under H1 - in one group the probabilities of obtating each score are reversed (as in the figure explaining the data above) 163 | ###### 20 observations per group 164 | ``` r 165 | simulate_wilcox_olr(samples = 100, n_group = 20, set = 0:5, 166 | arm_1_prob = rev(c(20, 10, 5, 2, 2, 2)), 167 | arm_2_prob = c(20, 10, 5, 2, 2, 2), 168 | which_p = "Wald") %>% 169 | plot_differences_between_methods(log_axes = TRUE) 170 | ``` 171 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/b8664220-96d1-4348-a73d-c4172f75a29c) 172 | 173 | For Wald's we can notice the already described issue, where some noticeable discrepancies occur. This happens at this sample size, nothing unusual. 174 | In general - at this data size - it's not that bad! 175 | 176 | But can read in many textbooks, that at lower sample size Wald's may be unreliable (it does), so the LRT or Rao score is preferred. 177 | Well, sometimes you just NEED Wald's, but if you can ask for the LRT or Rao, do it. Let's have a look: 178 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/e6b4e3e2-d4f8-4dc9-b0ca-99c941b1a824) 179 | Indeed, it did bettter! Now it deviates the other way, but definitely in the safe direction - towards lower p-values, farther from the common significant levels. 180 | 181 | And Rao's score did even better than LRT, resulting in smaller differences (and ratios): 182 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/cbac1d16-54dc-4cb7-8a8f-ee60769fbac1) 183 | 184 | ###### 30 observations per group 185 | Just 10 more observations makes a big difference and this time Wald's does the job! 186 | I'll skip the other approaches sa they give result consistent with the above. We will compare them later. 187 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/bcca869a-330a-4cea-8632-1210575524a4) 188 | 189 | ###### 50 observations per group 190 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/564d2d47-98bf-4b3b-a969-050b710bb526) 191 | 192 | ##### Under H0 vs H1 - comparison of three estimation methods 193 | Let's make some feeling on how the 3 methods behave. 194 | ###### 20 observations per group 195 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/c4a96f36-2b38-4cc0-bc03-2fc9ef1e4da0) 196 | 197 | ###### 30 observations per group 198 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/98ed9c48-dca4-40c6-b10a-b49fc4fca5d2) 199 | 200 | ###### 50 observations per group 201 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/a1de3100-d82c-44d4-8f85-75e909c2b42d) 202 | 203 | ###### 100 observations per group 204 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/aac5871e-f7b2-460e-82fa-96a941a4e4b1) 205 | 206 | ##### Again under H1 - another, less extreme setting 207 | ###### 20 observations per group 208 | ``` r 209 | simulate_wilcox_olr(samples = 100, n_group = 20, set = 0:5, 210 | arm_1_prob = rev(c(20, 10, 20, 5, 5, 5)), 211 | arm_2_prob = c(20, 10, 20, 5, 5, 5), 212 | which_p = "Wald") %>% 213 | plot_differences_between_methods(log_axes = TRUE) 214 | ``` 215 | If the differences between compared groups are not that extreme, the log-likelihood curve becomes not only concave but approaches quadratic shape, so the 216 | differences between the 3 methods and the test (kind of Rao test) become milder. 217 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/690fd701-c717-4e97-a062-8f51c356b9dc) 218 | 219 | ###### 30 observations per group 220 | As the sample size increases, the pattern remains - the discrepancy occurs, but - naturally - at smaller and smaller p-values, where it's completely not harmful. So let's skip checking at bigger samples to save time. 221 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/20ce5c77-6b4a-463f-b3d0-e3025207ecc8) 222 | 223 | ##### Under H0 vs H1 - comparison of three estimation methods 224 | Again, let's make some feeling on how the 3 methods behave. As the differences are less extreme, we expect the methods to give a bit closer results. 225 | But don't expect a spectacular improvement. 226 | ###### 20 observations per group 227 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/cae95c76-ec7d-4dea-9935-bb21410e376f) 228 | 229 | ###### 30 observations per group 230 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/4e9481e0-dfcd-4076-b0a6-6060d8ae2441) 231 | 232 | ###### 50 observations per group 233 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/36429657-32e5-4fd3-b76d-b5f92f9c2b2e) 234 | 235 | ###### 100 observations per group 236 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/640d7a24-a14f-4bd3-a752-3314801a4ddf) 237 | 238 | Well, I must say that despite the inevitable divergence, the improvement actually happened. It's always nice to see the theory agreeing with observation :) 239 | 240 | ---- 241 | 242 | #### Comparing numerical data 243 | So far we were comparing Wilcoxon test and ordinal logistic regression model applied to categorical ordinal data. 244 | The agreement was very close, so the selection depended exclusively on the concrete scenario. 245 | 246 | The situation complicates when we want to extend comparisons to numeric continuous data. 247 | The ordinal logistic regression can easily handle numerical data, treating them like classes, and esitmating as many intercepts as many unique values exist. 248 | 249 | But it won't go so smoothly. For some distributions the procedure may fail to converge, regardless of the implementation, under the alternative hypothesis. Shifting mean in one gruop just by a little (e.g. 0.01) may solve the problem while not affecting the interpretation of the test. 250 | 251 | Why is that we explained in file "readme.md". Briefly: the proportional-odds model assesses so-called stochastic superiority, which is probability that for a random pair of observations sampled from 2 groups {a1 ∈ A1, a2 ∈ A2}, a1 > a2. Now, if you have two numeric variables with non-overlapping empirical distributions, this will always hold, so the model won't be able to converge. 252 | 253 | Notice, this will almost never happen with Likert items - both variables share same set of values (e.g. 0..5), so they will overlap anyway, except just a few marginal cases, where all observations in one group are lower than in t he other group, e.g. A = {0,0,0,1,1,1,2,2}, B = {3,3,3,4,4,4,5,5} 254 | 255 | **Warning: Better choose the implementation that raises errors rather than returns unreliable estimates(!)** 256 | 257 | For other distributions it works very well. So let's explore a few examples. 258 | 259 | We only need to change a bit the function used for simulations: 260 | ```r 261 | simulate_wilcox_olr_distr <- function(samples, n_group, arm_1_distr, arm_2_distr, title) { 262 | set.seed(1000) 263 | 264 | data.frame( 265 | do.call( 266 | rbind, 267 | lapply(1:samples, function(i) { 268 | print(i) 269 | stack( 270 | data.frame(arm1 = eval(parse(text = gsub("n_group", n_group, arm_1_distr))), 271 | arm2 = eval(parse(text = gsub("n_group", n_group, arm_2_distr))))) %>% 272 | mutate(values_ord = ordered(values), ind = factor(ind)) -> data 273 | 274 | c(iter = i, 275 | n_group = n_group, 276 | Test = tidy(wilcox.test(values~ind, data=data, exact = FALSE, adjust = FALSE))$p.value, 277 | Model = tryCatch(joint_tests(rms::orm(values_ord ~ ind, data))$p.value, 278 | error = function(e) {cli::cli_alert_danger("Failed to converge; skip"); NA})) 279 | }))) -> result 280 | 281 | attr(result, "properties") <- list(under_null = identical(arm_1_distr, arm_2_distr), 282 | title = title, 283 | samples = samples, 284 | n_group = n_group) 285 | return(result) 286 | } 287 | ``` 288 | ##### Under H0: 289 | ###### Standard normal distribution. N(0, 1) vs. N(0, 1); n_group = 50 (to save time; for bigger value it's even better) 290 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/5481b23d-b715-4e16-b26b-06710dbe49c9) 291 | 292 | ###### Beta right-skewed β(1, 10) vs. β(1, 10); n_group = 50 293 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/01301273-2567-4b20-af4b-25639a4d7bbf) 294 | 295 | ###### Beta left-skewed β(10, 1) vs. β(10, 1); n_group = 50 296 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/94bd9a3e-cc6c-4f36-81c1-ef24605e6415) 297 | 298 | ###### Beta U-shaped β(0.5, 0.5) vs. β(0.5, 0.5); n_group = 100 299 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/b64ebbf8-4e92-440b-9a49-7f7afd238552) 300 | 301 | ###### Gamma right-skewed Γ(3, 1) vs. Γ(3, 1); n_group = 50 302 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/a401d6c7-921b-4c4e-9dcb-f1fe29548438) 303 | 304 | **Remarks:** 305 | 1. OK, as long as we test similar distributions (under the null), the agreement is perfect regardless of the distribution (Mann-Whitney is a non-parametric test). 306 | 2. We can observe, that bias -test giving a little bigger p-values- is common to all cases, but it does not harm at this magnitude. It's just observable. 307 | 308 | ##### Under H1 (various cases) 309 | ###### Standard normal distribution. N(5, 1) vs. N(0, 1); n_group = 20 310 | ```r 311 | simulate_wilcox_olr_distr(samples = 100, n_group = 20, 312 | arm_1_distr = "rnorm(n_group, 5, 1)", 313 | arm_2_distr = "rnorm(n_group, 0, 1)", 314 | title = "N(5,1) vs N(0,1)", 315 | which_p = "Wald") %>% 316 | filter(complete.cases(.)) %>% 317 | plot_differences_between_methods(log_axes = TRUE) 318 | ``` 319 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/d8a15792-3e38-4317-812e-2712750fe5c2) 320 | 321 | Well, a disaster! It converged(?) for just a few cases and resulted in weird p-values. All the same p-values! 322 | 323 | OK, what about Rao? 324 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/0eba5c40-246d-4da7-94c6-e006f67fed79) 325 | 326 | Well, let's try bigger sample. But I suspect it will just throw an error... 327 | 328 | ###### Standard normal distribution. N(5, 1) vs. N(0, 1); n_group = 100 329 | ```r 330 | > simulate_wilcox_olr_distr(samples = 100, n_group = 100, 331 | + arm_1_distr = "rnorm(n_group, 5, 1)", 332 | + arm_2_distr = "rnorm(n_group, 0, 1)", 333 | + title = "N(5,1) vs N(0,1)", 334 | + which_p = "Wald") %>% 335 | + filter(complete.cases(.)) %>% 336 | + plot_differences_between_methods(log_axes = TRUE) 337 | [1] 1 338 | Did not converge in 12 iterations 339 | Unable to fit model using “orm.fit” 340 | [1] 2 341 | Did not converge in 12 iterations 342 | Unable to fit model using “orm.fit” 343 | [1] 3 344 | Did not converge in 12 iterations 345 | Unable to fit model using “orm.fit” 346 | ``` 347 | As I expected. The model failed itself. 348 | 349 | There may be other implementation that are "silent", but won't make a miracle too: 350 | 351 | ```r 352 | > set.seed(1000) 353 | > 354 | > stack(data.frame(arm1 = rnorm(n = 20, mean=5, sd = 1), 355 | + arm2 = rnorm(n = 20, mean=0, sd = 1))) %>% 356 | + mutate(values_ord = ordered(values), ind = factor(ind)) %>% 357 | + MASS::polr(values_ord ~ ind , data = ., Hess = TRUE) 358 | Call: 359 | MASS::polr(formula = values_ord ~ ind, data = ., Hess = TRUE) 360 | 361 | Coefficients: 362 | indarm2 363 | -19.78993 364 | 365 | Intercepts: 366 | -1.78384413571217|-1.766198610056 -1.766198610056|-1.34835905258526 -1.34835905258526|-1.22701600631211 367 | -22.732515115 -21.986947341 -21.524205629 368 | [... CUT ...] 369 | 370 | Residual Deviance: 239.6635 371 | AIC: 319.6635 372 | Warning: did not converge as iteration limit reached 373 | ``` 374 | 375 | It will just not work for this data, sorry. 376 | 377 | I investigated this issue a lot and it seems it's a matter of overlap in data. In other words - as long, as the empirical distributions overlap, it will work. Otherwise it fails. 378 | 379 | Let me show you: 380 | * N(50, 20) vs. N(35, 20); n_group = 100 381 | 382 | For such data we get a big overlap of distributions, even if their means differ statistically significantly. 383 | Let's examine one of the _Common Language Effect Size_ measures, namely the **% of overlap** ([Interpreting Cohen's d Effect Size](https://rpsychologist.com/cohend/)) 384 | ```r 385 | set.seed(1000) 386 | x1 <- rnorm(100, 50, 20) 387 | x2 <- rnorm(100, 35, 20) 388 | effectsize::p_overlap(x1, x2) 389 | Overlap | 95% CI 390 | ---------------------- 391 | 0.73 | [0.62, 0.84] 392 | ``` 393 | Let's run the tests: 394 | ``` r 395 | simulate_wilcox_olr_distr(samples = 100, n_group = 100, 396 | arm_1_distr = "rnorm(n_group, 50, 20)", 397 | arm_2_distr = "rnorm(n_group, 35, 20)", 398 | title = "N(50,20) vs N(35,20)", 399 | which_p = "Wald") %>% 400 | filter(complete.cases(.)) %>% 401 | plot_differences_between_methods(log_axes = TRUE) 402 | ``` 403 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/62ef5e23-b928-4b6d-aee2-24c9b65ddf1b) 404 | 405 | Perfect agreement even under alternative hypothesis (for the normal distribution we get perfect parabolic log-likelihood curve). 406 | Now let's decrease variance - separate the distributions (less overlap). 407 | 408 | * N(50, 10) vs. N(35, 10); n_group = 100 409 | Now the overlap decreases: 410 | ```r 411 | set.seed(1000) 412 | x1 <- rnorm(100, 50, 10) 413 | x2 <- rnorm(100, 35, 10) 414 | effectsize::p_overlap(x1, x2) 415 | Overlap | 95% CI 416 | ---------------------- 417 | 0.46 | [0.37, 0.56] 418 | ``` 419 | ![obraz](https://github.com/adrianolszewski/model-based-testing-hypotheses/assets/95669100/4036752c-ae81-4a58-bd13-aef78d5f6794) 420 | 421 | OK, something beggins to happen! We notice some deviation from the line appearing at the smallest p-values. 422 | 423 | * N(50, 5) vs. N(30, 5); n_group = 100 424 | ```r 425 | set.seed(1000) 426 | x1 <- rnorm(100, 50, 5) 427 | x2 <- rnorm(100, 35, 5) 428 | effectsize::p_overlap(x1, x2) 429 | Overlap | 95% CI 430 | ---------------------- 431 | 0.13 | [0.08, 0.19] 432 | ``` 433 | 434 | TO CLEAN UP AND ADD aditional notes 435 | -------------------------------------------------------------------------------- /Tests/Various tests to describe one by one.md: -------------------------------------------------------------------------------- 1 | This is a temporary file that I will separate file-per-test. 2 | Now, only to not forget it. 3 | 4 | ----- 5 | Let me show you how the logistic regression (with a few extensions) can be used to test hypotheses about fractions (%) of successes, repacling the classic "test for proportions". 6 | Namely, it can replicate the results of: 7 | 8 | 1. [the Wald's (normal approximation) **z test for 2 proportions with non-pooled standard errors**](#wald_2prop_z) (common in clinical trials) via LS-means on the prediction scale or AME (average marginal effect) 9 | 2. [the Rao's score (normal appr.) **z test for 2 proportions with pooled standard errors**](#rao_2prop_z) (just what the `prop.test()` does in R) 10 | 3. the **z test for multiple (2+) proportions** 11 | 4. **ANOVA-like** (joint) test for multiple caterogical predictors (n-way ANOVA). Also (n-way) ANCOVA if you employ numerical covariates. 12 | 5. [the **Cochran-Mantel-Haenszel (CMH) for stratified/matched data**](#cmh) via _conditional logistic regression_ 13 | 7. [the **Breslow-Day test for odds ratios**](#breslow-day) through Rao's ANOVA --> the interaction term 14 | 8. [the **Cochran-Armitage test for trend in ordered proportions**](#armitage-trend) 15 | 9. [the **McNemar and Cochran Q** test of paired proportions](#mcnemar) via GEE estimation (Generalized Estimating Equations with compound symmetry) 16 | 10. [the **Friedman test**](#mcnemar) - as above 17 | 11. [the **Mann-Whitney-Wilcoxon and Kruskal-Wallis**](#mww) via Ordinal Logistic Regression (and paired Wilcoxon via GEE) 18 | 19 | Actually, the model-based approach to testing hypotheses is not anything new, and lots of other tests can be replicated with the general linear model via Ordinal Least Square (OLS) and Generalized Least Square (GLS) estimation, generalized linear models (GLM) via both Maximum-Likelhiood estimation (MLE) and semi-parametric Generalized Estimating Equations (GEE). Let's add to this also the conditional approach via Mixed-Effect models (both general and generalized). And let's not forget about the Quantile Regression (with mixed effects), robust regression models, survival models (Cox, AFT, Andersen-Gill, frailty models) and dozens of others! 20 | 21 | All those models, followed by the Likelihood Ratio testing (LRT) or Wald's testing of model coefficients, especially combined with LS-means (EM-means) will give you incredibly flexible testing framework. 22 | 23 | This time we will look at the Logistic Regression, part of the Generalized Linear Model - the binomial regression with logit link. We will also employ certain extensions i generalizations to achieve concrete effects. 24 | 25 | --- 26 | 27 | We will use 3 data sets (defined at the bottom of this file): 28 | * unpaired 2-group data 29 | 30 | ``` r 31 | > head(unpaired_data) 32 | sex response trt 33 | 1 female 0 active 34 | 2 female 0 active 35 | 3 female 0 active 36 | 4 female 0 active 37 | 5 female 0 active 38 | 6 female 0 active 39 | > tail(unpaired_data) 40 | sex response trt 41 | 101 male 1 placebo 42 | 102 male 1 placebo 43 | 103 male 1 placebo 44 | 104 male 1 placebo 45 | 105 male 1 placebo 46 | 106 male 1 placebo 47 | ``` 48 | 49 | * paired 2-group data 50 | 51 | ``` r 52 | > head(paired_data) 53 | ID Time Treatment Response 54 | 1 1 Pre placebo 0 55 | 2 1 Post placebo 1 56 | 3 2 Pre placebo 0 57 | 4 2 Post placebo 0 58 | 5 3 Pre placebo 0 59 | 6 3 Post placebo 0 60 | > 61 | > tail(paired_data) 62 | ID Time Treatment Response 63 | 35 18 Pre active 0 64 | 36 18 Post active 1 65 | 37 19 Pre active 0 66 | 38 19 Post active 0 67 | 39 20 Pre active 0 68 | 40 20 Post active 0 69 | ``` 70 | 71 | * ordered data 72 | ``` r 73 | > head(ordered_paired_data) 74 | ID Time Response TimeUnord 75 | 1 1 T1 0 T1 76 | 2 2 T1 0 T1 77 | 3 3 T1 0 T1 78 | 4 4 T1 0 T1 79 | 5 5 T1 0 T1 80 | 6 6 T1 0 T1 81 | > tail(ordered_paired_data) 82 | ID Time Response TimeUnord 83 | 25 5 T3 1 T3 84 | 26 6 T3 1 T3 85 | 27 7 T3 1 T3 86 | 28 8 T3 0 T3 87 | 29 9 T3 1 T3 88 | 30 10 T3 1 T3 89 | ``` 90 | 91 | * unpaired 2-group ordinal data (Pain score of the ODI (Oswestry Disability Index) questionnaire; 6-items Likert data. 92 | https://www.lni.wa.gov/forms-publications/F252-130-000.pdf 93 | ``` r 94 | > head(ordinal_data) 95 | ODIPain Arm Age_centered 96 | 1 [2] Moderate pain B -6.15315 97 | 2 [0] No pain B 12.84685 98 | 3 [1] Very mild pain A -9.15315 99 | 4 [2] Moderate pain B 14.84685 100 | 5 [3] Fairly severe pain A 12.84685 101 | 6 [2] Moderate pain B 2.84685 102 | > tail(ordinal_data) 103 | ODIPain Arm Age_centered 104 | 106 [2] Moderate pain A -15.153153 105 | 107 [2] Moderate pain B -11.153153 106 | 108 [2] Moderate pain A -4.153153 107 | 109 [4] Very severe pain B -0.153153 108 | 110 [1] Very mild pain B -4.153153 109 | 111 [1] Very mild pain B -7.153153 110 | ``` 111 | 112 | --- 113 | Loading necessary packages 114 | ```{r} 115 | library(emmeans) 116 | library(broom) 117 | library(survival) 118 | library(marginaleffects) 119 | library(geepack) 120 | ``` 121 | 122 | Defining auxiliary function (to validate the results) 123 | ``` r 124 | wald_z_test <- function(table) { 125 | p1 <- prop.table(table, 1)[1, 1] 126 | p2 <- prop.table(table, 1)[2, 1] 127 | n1 <- rowSums(table)[1] 128 | n2 <- rowSums(table)[2] 129 | se_p1 <- sqrt(p1 * (1 - p1) / n1) 130 | se_p2 <- sqrt(p2 * (1 - p2) / n2) 131 | se_diff <- sqrt(se_p1^2 + se_p2^2) 132 | z <- (p1 - p2) / se_diff 133 | p <- 2 * (1 - pnorm(abs(z))) 134 | return(data.frame(estimate = p1 - p2, z = z, se = se_diff, p.value = p, row.names = NULL)) 135 | } 136 | ``` 137 | 138 | --- 139 | 140 | # Wald's z test for 2 proportions (non-pooled SE) 141 | 142 | We want to reproduce this result: 143 | ``` r 144 | > wald_z_test(xtabs(~ trt + response,data = unpaired_data)) 145 | estimate z se p.value 146 | 1 0.2737968 3.047457 0.08984435 0.002307865 147 | ``` 148 | 149 | We will use this logistic regression (LR) model: 150 | ``` r 151 | > summary(lr_model <- glm(response ~ trt , data = unpaired_data, family = binomial(link = "logit"))) 152 | 153 | Call: 154 | glm(formula = response ~ trt, family = binomial(link = "logit"), 155 | data = unpaired_data) 156 | 157 | Deviance Residuals: 158 | Min 1Q Median 3Q Max 159 | -1.7011 -1.1620 0.7325 1.0778 1.1929 160 | 161 | Coefficients: 162 | Estimate Std. Error z value Pr(>|z|) 163 | (Intercept) -0.03637 0.26972 -0.135 0.89274 164 | trtplacebo 1.21502 0.42629 2.850 0.00437 ** 165 | --- 166 | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 167 | 168 | (Dispersion parameter for binomial family taken to be 1) 169 | 170 | Null deviance: 140.50 on 105 degrees of freedom 171 | Residual deviance: 131.88 on 104 degrees of freedom 172 | AIC: 135.88 173 | 174 | Number of Fisher Scoring iterations: 4 175 | ``` 176 | 177 | ## Wald's z test via LS-means on re-grided scale (probability scale) 178 | ``` r 179 | > pairs(emmeans(lr_model, regrid="response", specs = ~ trt)) 180 | contrast estimate SE df z.ratio p.value 181 | active - placebo -0.274 0.0898 Inf -3.047 0.0023 182 | ``` 183 | Let's look closer at the results: 184 | | Outcome | LS-means | raw z test | comment | 185 | |-----------|----------|------------|---------| 186 | | estimate | -0.2737968 | 0.2737968| 👍; swap factor levels to change the sign or ignore | 187 | | SE | 0.08984432 | 0.08984435 | agreement by 7 dec. digits 👍 | 188 | | statistic | -3.047458 | 3.047457 | sign - as above; agreement by 5 dec. digits 👍 | 189 | | p-value | 0.002307857 | 0.002307865 | aggrement by 7 dec. digits 👍 | 190 | 191 | Excellent agreement! 192 | 193 | ## Wald's z test via AME (average marginal effect) 194 | ``` r 195 | > marginaleffects::avg_slopes(lr_model) 196 | 197 | Term Contrast Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 % 198 | trt placebo - active 0.274 0.0898 3.05 0.00231 8.8 0.0977 0.45 199 | 200 | Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 201 | ``` 202 | Let's look closer at the results: 203 | | Outcome | AME | raw z test | comment | 204 | |-----------|----------|------------|------| 205 | | estimate | 0.2737968 | 0.2737968 | | 👍 | 206 | | SE | 0.08984433 | 0.08984435 | 👍 | 207 | | statistic | 3.047458 | 3.047457 | agreement by 5 dec. digits 👍 | 208 | | p-value | 0.002307859 | 0.002307865 | agreement by 6 dec. digits 👍 | 209 | 210 | Perfect agreement! 211 | 212 | --- 213 | 214 | # Rao score z test for 2 proportions (pooled SE) 215 | 216 | We want to reproduce this result: 217 | ``` r 218 | > prop.test(xtabs(~ trt + response,data=unpaired_data), correct = FALSE) 219 | 220 | 2-sample test for equality of proportions without continuity correction 221 | 222 | data: xtabs(~trt + response, data = unpaired_data) 223 | X-squared = 8.4429, df = 1, p-value = 0.003665 224 | alternative hypothesis: two.sided 225 | 95 percent confidence interval: 226 | 0.09770511 0.44988848 227 | sample estimates: 228 | prop 1 prop 2 229 | 0.5090909 0.2352941 230 | ``` 231 | 232 | We will use the same logistic regression (LR) model as previously 233 | 234 | ## Rao score z test via ANOVA with Rao test 235 | ``` r 236 | > anova(glm(response ~ trt , data = unpaired_data, family = binomial(link = "logit")), test = "Rao") 237 | Analysis of Deviance Table 238 | 239 | Model: binomial, link: logit 240 | 241 | Response: response 242 | 243 | Terms added sequentially (first to last) 244 | 245 | Df Deviance Resid. Df Resid. Dev Rao Pr(>Chi) 246 | NULL 105 140.50 247 | trt 1 8.6257 104 131.88 8.4429 0.003665 ** 248 | --- 249 | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 250 | ``` 251 | Let's look closer at the results: 252 | | Outcome | ANOVA + Rao test | prop.test() | comment | 253 | |-----------|----------|------------|---------| 254 | | statistic | 8.442898 | 8.442897 | agreement by 5 dec. digits 👍 | 255 | | p-value | 0.003664718 | 0.003664719 | agreement by 8 dec. digits 👍 | 256 | 257 | Perfect agreement! 258 | 259 | --- 260 | 261 | # Breslow-Day test for odds ratios via ANOVA with Rao test 262 | 263 | We want to reproduce this result for treatment and sex: 264 | ``` r 265 | > BreslowDayTest(xtabs(~ trt +response + sex, data=unpaired_data), correct = TRUE) 266 | 267 | Breslow-Day Test on Homogeneity of Odds Ratios (with Tarone correction) 268 | 269 | data: xtabs(~trt + response + sex, data = unpaired_data) 270 | X-squared = 1.4905, df = 1, p-value = 0.2221 271 | ``` 272 | This time add sex to the model and will look at the interaction term 273 | 274 | ``` r 275 | > as.data.frame(anova(glm(response ~ trt * sex , data = unpaired_data, family = binomial(link = "logit")), test="Rao")[4, ]) 276 | Df Deviance Resid. Df Resid. Dev Rao Pr(>Chi) 277 | trt:sex 1 1.498573 102 130.0512 1.496552 0.2212027 278 | ``` 279 | Let's look closer at the results: 280 | | Outcome | ANOVA + Rao test | Breslow-Day | comment | 281 | |-----------|----------|------------|---------| 282 | | statistic | 1.496552 | 1.490537 | agreement by 2 dec. digits 👍 | 283 | | p-value | 0.2212027 | 0.2221331 | agreement bt 2 dec. digits 👍 | 284 | 285 | Good agreement! 286 | 287 | --- 288 | 289 | # (Cochrane-) Mantel-Haenszel via conditional logistic regression 290 | 291 | We want to reproduce this result for sex strata: 292 | ``` r 293 | > mantelhaen.test(unpaired_data$response, unpaired_data$trt, unpaired_data$sex, exact = F, correct = F) 294 | 295 | Mantel-Haenszel chi-squared test without continuity correction 296 | 297 | data: unpaired_data$response and unpaired_data$trt and unpaired_data$sex 298 | Mantel-Haenszel X-squared = 8.3052, df = 1, p-value = 0.003953 299 | alternative hypothesis: true common odds ratio is not equal to 1 300 | 95 percent confidence interval: 301 | 1.445613 7.593375 302 | sample estimates: 303 | common odds ratio 304 | 3.313168 305 | ``` 306 | And through the model: 307 | ``` r 308 | > summary(clogit(response~trt + strata(sex),data=unpaired_data))$sctest 309 | test df pvalue 310 | 8.30516934 1.00000000 0.00395324 311 | ``` 312 | Let's look closer at the results: 313 | | Outcome | Cond. LR | CMH | comment | 314 | |-----------|----------|------------|---------| 315 | | statistic | 8.30516934 | 8.305169 | 👍 | 316 | | p-value | 0.00395324 | 0.00395324 | 👍 | 317 | 318 | Ideal agreement! 319 | 320 | --- 321 | 322 | # McNemar's, Cochran Q, Friedman tests via GEE estimated LR 323 | We want to reproduce this result for sex strata: 324 | ``` r 325 | > mcnemar.test(x=paired_data[paired_data$Time == "Pre", "Response"], y=paired_data[paired_data$Time == "Post", "Response"], correct = F) 326 | 327 | McNemar's Chi-squared test 328 | 329 | data: paired_data[paired_data$Time == "Pre", "Response"] and paired_data[paired_data$Time == "Post", "Response"] 330 | McNemar's chi-squared = 10.286, df = 1, p-value = 0.001341 331 | 332 | # or this one 333 | 334 | > paired_data %>% rstatix::friedman_test(Response ~ Time |ID) 335 | # A tibble: 1 × 6 336 | .y. n statistic df p method 337 | * 338 | 1 Response 20 10.3 1 0.00134 Friedman test 339 | 340 | # or this one 341 | 342 | > RVAideMemoire::cochran.qtest(Response ~ Time | ID,data=paired_data) 343 | 344 | Cochran's Q test 345 | 346 | data: Response by Time, block = ID 347 | Q = 10.2857, df = 1, p-value = 0.001341 348 | alternative hypothesis: true difference in probabilities is not equal to 0 349 | sample estimates: 350 | proba in group Post proba in group Pre 351 | 0.7 0.1 352 | ``` 353 | 354 | Through the GEE-estimated model: 355 | ``` r 356 | > summary(geepack::geeglm(Response ~ Time, id = ID,data=paired_data, family = binomial(), corstr = "exchangeable")) 357 | 358 | Call: 359 | geepack::geeglm(formula = Response ~ Time, family = binomial(), 360 | data = paired_data, id = ID, corstr = "exchangeable") 361 | 362 | Coefficients: 363 | Estimate Std.err Wald Pr(>|W|) 364 | (Intercept) 0.8473 0.4880 3.015 0.08249 . 365 | TimePre -3.0445 0.9484 10.305 0.00133 ** 366 | --- 367 | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 368 | 369 | Correlation structure = exchangeable 370 | Estimated Scale Parameters: 371 | 372 | Estimate Std.err 373 | (Intercept) 1 0.7215 374 | Link = identity 375 | 376 | Estimated Correlation Parameters: 377 | Estimate Std.err 378 | alpha -0.1455 0.2819 379 | Number of clusters: 20 Maximum cluster size: 2 380 | 381 | # or in a more compact form: 382 | > coef(summary(geepack::geeglm(Response ~ Time, id = ID,data=paired_data, family = binomial(), corstr = "exchangeable")))[2,] 383 | Estimate Std.err Wald Pr(>|W|) 384 | TimePre -3.045 0.9484 10.31 0.001327 385 | ``` 386 | 387 | Let's look closer at the results: 388 | | Outcome | GEE LR | Tests | comment | 389 | |-----------|----------|------------|---------| 390 | | statistic | 10.31 | 10.2857 | agreement by 1 deci. digits 👍 | 391 | | p-value | 0.001327 | 0.001341 | agreement by 4 dec. digits 👍 | 392 | 393 | Acceptable agreement! 394 | 395 | --- 396 | 397 | # Cochrane-Armitage test for trend via GLM + ANOVA LRT (Likelihood Ratio Test) 398 | We want to reproduce this result for sex strata: 399 | ``` r 400 | > DescTools::CochranArmitageTest(xtabs(~Response + Time,data=ordered_paired_data)) 401 | 402 | Cochran-Armitage test for trend 403 | 404 | data: xtabs(~Response + Time, data = ordered_paired_data) 405 | Z = -3.6, dim = 3, p-value = 0.0003 406 | alternative hypothesis: two.sided 407 | 408 | # or this one 409 | 410 | > rstatix::prop_trend_test(xtabs(~Response + Time,data=ordered_paired_data)) 411 | # A tibble: 1 × 6 412 | n statistic p p.signif df method 413 | * 414 | 1 30 12.9 0.000336 *** 1 Chi-square trend test 415 | ``` 416 | 417 | Through the GLM model: 418 | ``` r 419 | > as.data.frame(anova(glm(Response ~ Time,data=ordered_paired_data, family = binomial()), test="LRT"))[2,] 420 | Df Deviance Resid. Df Resid. Dev Pr(>Chi) 421 | Time 2 14.99 27 26.46 0.0005553 422 | ``` 423 | 424 | Let's look closer at the results: 425 | | Outcome | GLM + LRT ANOVA | Test | comment | 426 | |-----------|----------|------------|---------| 427 | | statistic | 12.86 | 14.99 | same order of magnitude | 428 | | p-value | 0.0005553 | 0.000336 | agreement by 3 dec. digits 👍 | 429 | 430 | Reasonable agreement. (Maybe I'll find a better one). 431 | 432 | --- 433 | 434 | # Mann-Whitney (-Wilcoxon) test of stochastic equivalence (vs. stochastic superiority / dominance) 435 | **Note:** This test DOES NOT TEST MEDIANS in general, unless strong distributional assumptions hold: 436 | 1) IID samples (same dispersion, variance & same shape - if skewed, then in the same direction) 437 | 2) Symmetric around their medians. 438 | For detailed explanations, read my gist and find a rich list of literature (mostly freely accessible) and examples: https://gist.github.com/adrianolszewski/2cec75678e1183e4703589bfd22fa8b2 439 | 440 | We want to reproduce this result: 441 | ``` r 442 | > (wtest <- wilcox.test(as.numeric(ODIPain) ~ Arm, data = ordinal_data, exact = FALSE, correct = FALSE)) 443 | 444 | Wilcoxon rank sum test 445 | 446 | data: as.numeric(ODIPain) by Arm 447 | W = 1472, p-value = 0.68 448 | alternative hypothesis: true location shift is not equal to 0 449 | 450 | > wtest$p.value 451 | [1] 0.679575 452 | ``` 453 | By using the proportional-odds model (ordinal logistic regression) we obtain: 454 | ``` r 455 | > coef(summary(m <- MASS::polr(ODIPain ~ Arm , data = ordinal_data, Hess=T))) 456 | Value Std. Error t value 457 | ArmB 0.141709 0.341471 0.414995 458 | [0] No pain|[1] Very mild pain -1.444439 0.299213 -4.827458 459 | [1] Very mild pain|[2] Moderate pain -0.273260 0.259784 -1.051875 460 | [2] Moderate pain|[3] Fairly severe pain 1.361363 0.291704 4.666935 461 | [3] Fairly severe pain|[4] Very severe pain 2.093502 0.345203 6.064551 462 | [4] Very severe pain|[5] Worst imaginable pain 4.072209 0.736078 5.532306 463 | 464 | > pairs(emmeans(m, specs = ~Arm)) 465 | contrast estimate SE df z.ratio p.value 466 | A - B -0.142 0.341 Inf -0.415 0.6781 467 | 468 | # or 469 | > (mtest <- joint_tests(m)) 470 | model term df1 df2 F.ratio p.value 471 | Arm 1 Inf 0.172 0.6781 472 | 473 | mtest$p.value 474 | [1] 0.678146 475 | ``` 476 | 477 | This time, the two outputs (model vs. test) look very different, but give a very close p-value! 478 | It's not a coincidence. 479 | You can find detailed explanations and necessary formulas here: [Equivalence of Wilcoxon Statistic and Proportional Odds Model](https://www.fharrell.com/post/powilcoxon/) | [Resources for Ordinal Regression Models](https://www.fharrell.com/post/rpo/) | [If You Like the Wilcoxon Test You Must Like the Proportional Odds Model](https://www.fharrell.com/post/wpo/) 480 | 481 | So, like Prof. Harrell, we will check also the concordance index: 482 | ``` r 483 | # From the Wilcoxon statistic 484 | > (bind_cols(tidy(wilcox.test(as.numeric(ODIPain) ~ Arm, data = ordinal_data, exact = FALSE, correct = FALSE)), 485 | ordinal_data %>% 486 | group_by(Arm) %>% 487 | summarize(n=n()) %>% 488 | summarize("n1*n2" = prod(n))) %>% 489 | mutate(c = statistic / `n1*n2`) -> concor) 490 | 491 | # A tibble: 1 × 6 492 | statistic p.value method alternative `n1*n2` c 493 | 494 | 1 1472. 0.680 Wilcoxon rank sum test two.sided 3080 0.478 495 | 496 | > concor$c 497 | 0.478084 498 | 499 | # From the odds ratio taken from the model: 500 | > (OR <- 1/exp((coef(summary(m)))[1,1])) 501 | [1] 0.867874 502 | 503 | and finally 504 | > (c_mod <- OR^0.66 / (1 + OR ^ 0.66)) 505 | 0.476635 506 | 507 | # So we are off by: 508 | > sprintf("%.2f%%",100*(concor$c - c_mod) / concor$c) 509 | [1] "0.30%" 510 | 511 | # Isn't this IMPRESSIVE? 512 | ``` 513 | 514 | Let's collect the results closer at the results: 515 | | Outcome | OLR | Wilcox | comment | 516 | |-----------|----------|------------|---------| 517 | | concordance | 0.478084 | 0.476635 | agreement by 2 dec. digits 👍 | 518 | | p-value | 0.679575 | 0.678146 | agreement by 2 dec. digits 👍 | 519 | 520 | Very good agreement! 521 | 522 | Later, in a separate gist, I will shoud you, through simulation, that this equivalence holds very well! 523 | 524 | **Think about the the consequences. This way obtain the Mann-Whitney (-Wilcoxon) test adjusted for covariates.** 525 | By the way, this is another interesting example, where the result of a completely non-parametric test can be obtained via parametric method. 526 | 527 | --- 528 | * _EM-means_ (estimated marginal means) is another name of the well-known in experimental research _LS-means_ (least-square means) 529 | It's a model-based predicted (estimated) mean. If you remember the definition of regression (NO, not the Machine Learning one...) 530 | then you know that regresion gives you a some function of the data conditional to the predictor. 531 | For the linear regression it's E(Y|X=x), for the GLM it is link(E(Y|X=x)), for quantile regression it's median(Y|X=x). 532 | And since the predictor exclusively consists of categorical variables, they form sub-groups in which the (conditional) 533 | means are calculated. If we include also numerical covariates into the model, the predictions will account for it, giving us so-called "covariate-adjusted means". 534 | 535 | ---- 536 | The datasets for your own experiments: 537 | 538 | ``` r 539 | unpaired_data <- structure(list(sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 540 | 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 541 | 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 542 | 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 543 | 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 544 | 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 545 | 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 546 | 2L, 2L, 2L), levels = c("female", "male"), class = "factor"), 547 | response = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 548 | 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 549 | 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 550 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 551 | 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 552 | 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), trt = structure(c(1L, 553 | 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 554 | 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 555 | 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 556 | 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 557 | 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 558 | 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 559 | 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L 560 | ), levels = c("active", "placebo"), class = "factor")), row.names = c(NA, 561 | -106L), class = "data.frame") 562 | 563 | paired_data <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 564 | 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L, 11L, 12L, 12L, 565 | 13L, 13L, 14L, 14L, 15L, 15L, 16L, 16L, 17L, 17L, 18L, 18L, 19L, 566 | 19L, 20L, 20L), Time = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 2L, 567 | 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 568 | 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 569 | 1L), levels = c("Post", "Pre"), class = "factor"), Treatment = structure(c(2L, 570 | 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 571 | 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 572 | 1L, 1L, 1L, 1L, 1L, 1L, 1L), levels = c("active", "placebo"), class = "factor"), 573 | Response = c(0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 574 | 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 575 | 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L)), row.names = c(NA, 576 | -40L), class = "data.frame") 577 | 578 | ordered_paired_data <- structure(list(ID = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 579 | 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 580 | 4L, 5L, 6L, 7L, 8L, 9L, 10L), levels = c("1", "2", "3", "4", 581 | "5", "6", "7", "8", "9", "10"), class = "factor"), Time = structure(c(1L, 582 | 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 583 | 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), levels = c("T1", 584 | "T2", "T3"), class = c("ordered", "factor")), Response = c(0L, 585 | 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 586 | 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), TimeUnord = structure(c(1L, 587 | 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 588 | 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), levels = c("T1", 589 | "T2", "T3"), class = "factor")), row.names = c(NA, -30L), class = "data.frame") 590 | 591 | ordinal_data <- structure(list(ODIPain = structure(c(3L, 1L, 2L, 3L, 4L, 3L, 592 | 4L, 2L, 3L, 5L, 2L, 5L, 5L, 6L, 2L, 3L, 1L, 2L, 3L, 3L, 1L, 3L, 593 | 3L, 2L, 2L, 5L, 5L, 2L, 5L, 3L, 5L, 1L, 3L, 3L, 3L, 1L, 5L, 3L, 594 | 5L, 1L, 1L, 2L, 1L, 2L, 3L, 2L, 3L, 1L, 2L, 1L, 2L, 4L, 6L, 4L, 595 | 3L, 3L, 3L, 3L, 1L, 4L, 5L, 4L, 3L, 3L, 1L, 3L, 1L, 4L, 3L, 3L, 596 | 2L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 1L, 2L, 2L, 1L, 3L, 4L, 4L, 3L, 597 | 2L, 2L, 2L, 2L, 2L, 1L, 1L, 3L, 1L, 3L, 1L, 3L, 4L, 4L, 3L, 3L, 598 | 1L, 2L, 3L, 3L, 3L, 3L, 5L, 2L, 2L), levels = c("[0] No pain", 599 | "[1] Very mild pain", "[2] Moderate pain", "[3] Fairly severe pain", 600 | "[4] Very severe pain", "[5] Worst imaginable pain"), class = c("ordered", 601 | "factor")), Arm = structure(c(2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 602 | 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 603 | 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 604 | 2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 605 | 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 606 | 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 607 | 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 608 | 1L, 1L, 2L, 1L, 2L, 2L, 2L), levels = c("A", "B"), class = "factor"), 609 | Age_centered = c(-6.15315315315316, 12.8468468468468, -9.15315315315316, 610 | 14.8468468468468, 12.8468468468468, 2.84684684684684, -10.1531531531532, 611 | -18.1531531531532, -1.15315315315316, 8.84684684684684, -17.1531531531532, 612 | 13.8468468468468, 9.84684684684684, 17.8468468468468, -19.1531531531532, 613 | -7.15315315315316, -10.1531531531532, -19.1531531531532, 614 | -7.15315315315316, 0.846846846846844, -17.1531531531532, 615 | 5.84684684684684, -25.1531531531532, -1.15315315315316, -15.1531531531532, 616 | 4.84684684684684, 1.84684684684684, 12.8468468468468, -11.1531531531532, 617 | 5.84684684684684, -6.15315315315316, -0.153153153153156, 618 | 20.8468468468468, 5.84684684684684, -0.153153153153156, 12.8468468468468, 619 | -19.1531531531532, -11.1531531531532, 1.84684684684684, 0.846846846846844, 620 | -21.1531531531532, 9.84684684684684, 15.8468468468468, 14.8468468468468, 621 | -12.1531531531532, -11.1531531531532, -9.15315315315316, 622 | 5.84684684684684, -4.15315315315316, 12.8468468468468, 1.84684684684684, 623 | -7.15315315315316, -3.15315315315316, 7.84684684684684, 0.846846846846844, 624 | -4.15315315315316, 5.84684684684684, -0.153153153153156, 625 | 1.84684684684684, -7.15315315315316, 1.84684684684684, -9.15315315315316, 626 | 6.84684684684684, 9.84684684684684, 17.8468468468468, 5.84684684684684, 627 | 9.84684684684684, -10.1531531531532, -5.15315315315316, 18.8468468468468, 628 | 21.8468468468468, -0.153153153153156, 2.84684684684684, -8.15315315315316, 629 | -5.15315315315316, 5.84684684684684, 2.84684684684684, -15.1531531531532, 630 | 2.84684684684684, 25.8468468468468, -11.1531531531532, 27.8468468468468, 631 | 2.84684684684684, 20.8468468468468, -0.153153153153156, -2.15315315315316, 632 | 12.8468468468468, -0.153153153153156, 0.846846846846844, 633 | 11.8468468468468, -8.15315315315316, 3.84684684684684, 22.8468468468468, 634 | 5.84684684684684, 12.8468468468468, 4.84684684684684, 11.8468468468468, 635 | -5.15315315315316, -17.1531531531532, -7.15315315315316, 636 | -16.1531531531532, 0.846846846846844, -13.1531531531532, 637 | -13.1531531531532, -19.1531531531532, -15.1531531531532, 638 | -11.1531531531532, -4.15315315315316, -0.153153153153156, 639 | -4.15315315315316, -7.15315315315316)), row.names = c(NA, 640 | -111L), class = "data.frame") 641 | ``` 642 | -------------------------------------------------------------------------------- /Tests/common_procedures.md: -------------------------------------------------------------------------------- 1 | # Function sampling Likert items {0-5} with predefined probability of each item occurrence and applying both Mann-Whitney (-Wilcoxon) test and the ordinal logistic regression (proportional-odds model) 2 | 3 | I use the orm::olr() as it returns both LRT and Rao score. The Wald's is calculated by hand to save time. 4 | ```r 5 | simulate_wilcox_olr <- function(samples, n_group, set, arm_1_prob, arm_2_prob, which_p) { 6 | set.seed(1000) 7 | 8 | mixed_hyp <- is.null(arm_1_prob) | is.null(arm_2_prob) 9 | 10 | data.frame( 11 | do.call( 12 | rbind, 13 | lapply(1:samples, function(i) { 14 | print(i) 15 | 16 | if(mixed_hyp) { 17 | arm_1_prob <- sample(x = 1:10, size = 6, replace = FALSE) 18 | arm_2_prob <- sample(x = 1:10, size = 6, replace = FALSE) 19 | } 20 | 21 | stack( 22 | data.frame(arm1 = sample(set, size=n_group, replace=TRUE, prob = arm_1_prob), 23 | arm2 = sample(set, size=n_group, replace=TRUE, prob = arm_2_prob))) %>% 24 | mutate(values_ord = ordered(values), ind = factor(ind)) -> data 25 | 26 | #m <- joint_tests(MASS::polr(values_ord ~ ind , data = data, Hess=TRUE))$p.value) 27 | m <- rms::orm(values_ord ~ ind , data = data) 28 | 29 | if(!m$fail) 30 | case_when(which_p == "Wald" ~ as.numeric(2*pnorm(abs(m$coefficients["ind=arm2"] / sqrt(m$var["ind=arm2","ind=arm2"])), lower.tail = FALSE)), 31 | which_p == "LRT" ~ as.numeric(m$stats["P"]), 32 | which_p == "Rao" ~ as.numeric(m$stats["Score P"])) -> Model_p 33 | else 34 | Model_p <- NA 35 | 36 | c(iter = i, 37 | n_group = n_group, 38 | Test = tidy(wilcox.test(values~ind, data=data, exact = FALSE, adjust = FALSE))$p.value, 39 | Model = Model_p) 40 | }))) -> result 41 | 42 | attr(result, "properties") <- list(arm_1_prob = arm_1_prob, 43 | arm_2_prob = arm_2_prob, 44 | under_null = ifelse(mixed_hyp, NA, identical(arm_1_prob, arm_2_prob)), 45 | samples = samples, 46 | n_group = n_group, 47 | title = "", 48 | method = which_p) 49 | return(result) 50 | } 51 | ``` 52 | 53 | ---- 54 | 55 | # Function sampling from specified distributions and applying both Mann-Whitney (-Wilcoxon) test and the ordinal logistic regression (proportional-odds model) 56 | ```r 57 | 58 | simulate_wilcox_olr_distr <- function(samples, n_group, arm_1_distr, arm_2_distr, title, which_p) { 59 | set.seed(1000) 60 | 61 | data.frame( 62 | do.call( 63 | rbind, 64 | lapply(1:samples, function(i) { 65 | print(i) 66 | stack( 67 | data.frame(arm1 = eval(parse(text = gsub("n_group", n_group, arm_1_distr))), 68 | arm2 = eval(parse(text = gsub("n_group", n_group, arm_2_distr))))) %>% 69 | mutate(values_ord = ordered(values), ind = factor(ind)) -> data 70 | 71 | m <- rms::orm(values_ord ~ ind , data = data) 72 | 73 | if(!m$fail) 74 | case_when(which_p == "Wald" ~ as.numeric(2*pnorm(abs(m$coefficients["ind=arm2"] / sqrt(m$var["ind=arm2","ind=arm2"])), lower.tail = FALSE)), 75 | which_p == "LRT" ~ as.numeric(m$stats["P"]), 76 | which_p == "Rao" ~ as.numeric(m$stats["Score P"])) -> Model_p 77 | else 78 | Model_p <- NA 79 | 80 | c(iter = i, 81 | n_group = n_group, 82 | Test = tidy(wilcox.test(values~ind, data=data, exact = FALSE, adjust = FALSE))$p.value, 83 | Model = Model_p) 84 | }))) -> result 85 | 86 | attr(result, "properties") <- list(arm_1_prob = arm_1_distr, 87 | arm_2_prob = arm_2_distr, 88 | under_null = identical(arm_1_distr, arm_2_distr), 89 | samples = samples, 90 | n_group = n_group, 91 | title = title, 92 | method = which_p) 93 | return(result) 94 | } 95 | ``` 96 | 97 | ---- 98 | # Comparing MWW test and PO-model for Likert under H0 and H1 (Halt) 99 | ```r 100 | compare_likert_h0_h1_3methods <- function(n_group, log_transform=TRUE, 101 | arm_prob = c(20, 10, 5, 2, 2, 2)) { 102 | 103 | cbind(method="Wald", simulate_wilcox_olr(samples = 100, n_group = n_group, set = 0:5, 104 | arm_1_prob = arm_prob, 105 | arm_2_prob = arm_prob, which_p = "Wald")) %>% 106 | bind_rows(cbind(method="Wilk's LRT", simulate_wilcox_olr(samples = 100, n_group = n_group, set = 0:5, 107 | arm_1_prob = arm_prob, 108 | arm_2_prob = arm_prob, which_p = "LRT")), 109 | cbind(method="Rao's score", simulate_wilcox_olr(samples = 100, n_group = n_group, set = 0:5, 110 | arm_1_prob = arm_prob, 111 | arm_2_prob = arm_prob, which_p = "Rao"))) ->cmp_h0 112 | 113 | 114 | ph0 <- cmp_h0 %>% 115 | ggplot() + 116 | geom_point(aes(x=Test, y = Model, col=method)) + 117 | 118 | geom_abline(slope = 1, intercept = 0, col="green") + 119 | 120 | { if(TRUE) 121 | list(geom_hline(yintercept = 0.05, linetype="dashed", color="red"), 122 | geom_vline(xintercept = 0.05, linetype="dashed", color="red")) 123 | } + 124 | 125 | { if(FALSE) 126 | list(scale_y_continuous(trans='log10'), 127 | scale_x_continuous(trans='log10')) 128 | } + 129 | 130 | xlab("Test p-values") + 131 | ylab("Model p-values") + 132 | theme_bw() + 133 | labs(title = "P-values from Test vs. Model. Comparison of LRT vs. Rao vs. Wald", 134 | subtitle = sprintf("N=%d samples, group size=%d obs., %s", 135 | 100, n_group, "under null hypothesis"), 136 | caption = ifelse(FALSE, "Both axes log10 transformed", "")) + 137 | theme(plot.caption = element_text(color = "red", face="italic")) 138 | 139 | 140 | cbind(method="Wald", simulate_wilcox_olr(samples = 100, n_group = n_group, set = 0:5, 141 | arm_1_prob = rev(arm_prob), 142 | arm_2_prob = arm_prob, which_p = "Wald")) %>% 143 | bind_rows(cbind(method="Wilk's LRT", simulate_wilcox_olr(samples = 100, n_group = n_group, set = 0:5, 144 | arm_1_prob = rev(arm_prob), 145 | arm_2_prob = arm_prob, which_p = "LRT")), 146 | cbind(method="Rao's score", simulate_wilcox_olr(samples = 100, n_group = n_group, set = 0:5, 147 | arm_1_prob = rev(arm_prob), 148 | arm_2_prob = arm_prob, which_p = "Rao"))) ->cmp_h1 149 | 150 | 151 | ph1 <- cmp_h1 %>% 152 | ggplot() + 153 | geom_point(aes(x=Test, y = Model, col=method)) + 154 | 155 | geom_abline(slope = 1, intercept = 0, col="green") + 156 | 157 | { if(FALSE) 158 | list(geom_hline(yintercept = 0.05, linetype="dashed", color="red"), 159 | geom_vline(xintercept = 0.05, linetype="dashed", color="red")) 160 | } + 161 | 162 | { if(log_transform) 163 | list(scale_y_continuous(trans='log10'), 164 | scale_x_continuous(trans='log10')) 165 | } + 166 | 167 | xlab("Test p-values") + 168 | ylab("Model p-values") + 169 | theme_bw() + 170 | labs(title = "P-values from Test vs. Model. Comparison of LRT vs. Rao vs. Wald", 171 | subtitle = sprintf("N=%d samples, group size=%d obs., %s", 172 | 100, n_group, "under alternative hypothesis"), 173 | caption = ifelse(log_transform, "Both axes log10 transformed", "")) + 174 | theme(plot.caption = element_text(color = "red", face="italic")) 175 | 176 | (ph0 + ph1 & theme(legend.position = "top")) + plot_layout(guides = "collect") 177 | } 178 | ``` 179 | 180 | ---- 181 | 182 | # Comparing test vs. model for the 3 methods 183 | ```r 184 | 185 | compare_3methods_distr_model_vs_test <- function(n_group, samples=100, 186 | distr_pair_spec = list("N(50,20) vs. N(50,20)" = 187 | c(arm_1_distr = "rnorm(n_group, mean=50, sd=20)", 188 | arm_2_distr = "rnorm(n_group, mean=50, sd=20)", 189 | log = "TRUE"))) { 190 | 191 | lapply(names(distr_pair_spec), function(dps_name) { 192 | dps <- distr_pair_spec[[dps_name]] 193 | 194 | log_axes <- as.logical(dps["log"]) 195 | log_axes <- ifelse(is.na(log_axes), TRUE, log_axes) 196 | log_axes <- (dps["arm_1_distr"] != dps["arm_2_distr"]) & log_axes 197 | 198 | bind_rows( 199 | bind_cols(method="Wald", simulate_wilcox_olr_distr(samples = samples, 200 | n_group = n_group, 201 | arm_1_distr = dps["arm_1_distr"], 202 | arm_2_distr = dps["arm_2_distr"], 203 | title = dps_name, 204 | which_p = "Wald")), 205 | bind_cols(method="Wilks' LRT", simulate_wilcox_olr_distr(samples = samples, 206 | n_group = n_group, 207 | arm_1_distr = dps["arm_1_distr"], 208 | arm_2_distr = dps["arm_2_distr"], 209 | title = dps_name, 210 | which_p = "LRT")), 211 | bind_cols(method="Rao score", simulate_wilcox_olr_distr(samples = samples, 212 | n_group = n_group, 213 | arm_1_distr = dps["arm_1_distr"], 214 | arm_2_distr = dps["arm_2_distr"], 215 | title = dps_name, 216 | which_p = "Rao"))) %>% 217 | ggplot() + 218 | 219 | { if(dps["arm_1_distr"] == dps["arm_2_distr"]) 220 | list(geom_hline(yintercept = 0.05, linetype="dashed", color="red"), 221 | geom_vline(xintercept = 0.05, linetype="dashed", color="red")) 222 | } + 223 | 224 | { if(log_axes) 225 | list(scale_y_continuous(trans='log10'), 226 | scale_x_continuous(trans='log10')) 227 | } + 228 | 229 | geom_point(aes(x=Test, y = Model, col=method)) + 230 | geom_abline(slope = 1, intercept = 0, col="green") + 231 | 232 | xlab("Test p-values") + 233 | ylab("Model p-values") + 234 | theme_bw() + 235 | labs(title = "P-values from Test vs. Model. Comparison of LRT vs. Rao vs. Wald", 236 | subtitle = sprintf("N=%d samples, group size=%d obs. | comparison: %s", 237 | samples, n_group, dps_name), 238 | caption = ifelse(log_axes, "Both axes log10 transformed", "")) + 239 | theme(plot.caption = element_text(color = "red", face="italic")) 240 | }) -> plots 241 | 242 | (patchwork::wrap_plots(plot = plots, ncol = 2) & theme(legend.position = "top")) + plot_layout(guides = "collect") 243 | } 244 | ``` 245 | 246 | ---- 247 | 248 | # Comparing p-values from model only - 3 approaches 249 | ```r 250 | compare_3methods_distr <- function(n_group, samples=100, 251 | distr_pair_spec = list("N(50,20) vs. N(50,20)" = 252 | c(arm_1_distr = "rnorm(n_group, mean=50, sd=20)", 253 | arm_2_distr = "rnorm(n_group, mean=50, sd=20)", 254 | log = "TRUE"))) { 255 | 256 | lapply(names(distr_pair_spec), function(dps_name) { 257 | dps <- distr_pair_spec[[dps_name]] 258 | 259 | log_axes <- as.logical(dps["log"]) 260 | log_axes <- ifelse(is.na(log_axes), TRUE, log_axes) 261 | log_axes <- (dps["arm_1_distr"] != dps["arm_2_distr"]) & log_axes 262 | 263 | bind_rows( 264 | bind_cols(method="Wald", simulate_wilcox_olr_distr(samples = samples, 265 | n_group = n_group, 266 | arm_1_distr = dps["arm_1_distr"], 267 | arm_2_distr = dps["arm_2_distr"], 268 | title = dps_name, 269 | which_p = "Wald")), 270 | bind_cols(method="Wilks' LRT", simulate_wilcox_olr_distr(samples = samples, 271 | n_group = n_group, 272 | arm_1_distr = dps["arm_1_distr"], 273 | arm_2_distr = dps["arm_2_distr"], 274 | title = dps_name, 275 | which_p = "LRT")), 276 | bind_cols(method="Rao score", simulate_wilcox_olr_distr(samples = samples, 277 | n_group = n_group, 278 | arm_1_distr = dps["arm_1_distr"], 279 | arm_2_distr = dps["arm_2_distr"], 280 | title = dps_name, 281 | which_p = "Rao"))) %>% 282 | ggplot() + 283 | 284 | { if(dps["arm_1_distr"] == dps["arm_2_distr"]) 285 | list(geom_hline(yintercept = 0.05, linetype="dashed", color="red")) 286 | } + 287 | 288 | { if(log_axes) 289 | list(scale_y_continuous(trans='log10'), 290 | scale_x_continuous(trans='log10')) 291 | } + 292 | 293 | geom_point(aes(x=iter, y = Model, col=method)) + 294 | 295 | xlab("Test p-values") + 296 | ylab("Model p-values") + 297 | theme_bw() + 298 | labs(title = "P-values: Wilks LRT vs. Rao vs. Wald | Ordinal Logistic Regression", 299 | subtitle = sprintf("N=%d samples, group size=%d obs. | comparison: %s", 300 | samples, n_group, dps_name), 301 | caption = ifelse(log_axes, "Both axes log10 transformed", "")) + 302 | theme(plot.caption = element_text(color = "red", face="italic")) 303 | }) -> plots 304 | 305 | (patchwork::wrap_plots(plot = plots, ncol = 2) & theme(legend.position = "top")) + plot_layout(guides = "collect") 306 | } 307 | ``` 308 | 309 | ---- 310 | # Printing Test vs. Model comparison for 311 | ``` r 312 | plot_differences_between_methods <- function(results, sign_level = 0.05, log_axes=FALSE) { 313 | 314 | properties <- attr(results, "properties") 315 | samples <- properties$samples 316 | n_group <- properties$n_group 317 | under_null <- properties$under_null 318 | title <- properties$title 319 | hypothesis <- case_when(is.na(under_null) ~ "Mixed hypotheses", 320 | under_null ~ "Under H0", 321 | .default = "Under H1") 322 | method <- properties$method 323 | 324 | results %>% 325 | mutate(diff = Test - Model, 326 | diff_sign = case_when(diff == 0 ~ "Test = Model", diff > 0 ~ "Test > Model", diff < 0 ~ "Test < Model"), 327 | ratio = Test / Model, 328 | discr = (Test <= sign_level & Model > sign_level) | (Test > sign_level & Model <= sign_level), 329 | discr_descr = ifelse(discr, sprintf("T: %.3f ; M :%.3f", Test, Model), NA)) -> results 330 | 331 | (p_rej_discrep <- results %>% 332 | ggplot(aes(x=iter, y = discr, label=ifelse(discr, discr_descr, NA), col=discr)) + 333 | geom_point(show.legend = FALSE) + 334 | ggrepel::geom_label_repel(min.segment.length = 0, seed = 100, 335 | box.padding = .4, segment.linetype = 6, 336 | direction = "both", 337 | nudge_y = .1, 338 | max.overlaps = 50, size=2.5, show.legend = FALSE) + 339 | scale_color_manual(values=c("TRUE"="red2", "FALSE"="green2")) + 340 | scale_y_discrete(labels = c('In agreement','Discrepant')) + 341 | theme_bw() + 342 | labs(title = "Discrepancy in rejection: Test vs Model", 343 | subtitle = sprintf("N=%d samples, group size=%d observations; sign. level=%.3f", samples, n_group, sign_level)) + 344 | ylab(NULL) 345 | ) 346 | 347 | (p_pval_ratios <- results %>% 348 | ggplot() + 349 | geom_point(aes(x=iter, y = ratio)) + 350 | geom_hline(yintercept = 1, col="green") + 351 | theme_bw() + 352 | scale_y_continuous(trans='log10') + 353 | labs(title = "Ratio of p-values: Test vs Model", 354 | subtitle = sprintf("N=%d samples, group size=%d observations", samples, n_group), 355 | caption = "Y axis log10 transformed") + 356 | ylab("p-value Test / p-value Model") + 357 | theme(plot.caption = element_text(color = "red", face="italic")) 358 | ) 359 | 360 | (p_pval_diffs <- results %>% 361 | ggplot() + 362 | geom_point(aes(x=iter, y = diff)) + 363 | geom_hline(yintercept = 0, col="green") + 364 | theme_bw() + 365 | labs(title = "Raw differences in p-values: Test vs Model", 366 | subtitle = sprintf("N=%d samples, group size=%d observations", samples, n_group)) + 367 | ylab("p-value Test - p-value Model") 368 | ) 369 | 370 | (p_pval_rel <- results %>% 371 | group_by(diff_sign) %>% 372 | summarize(n=n(), 373 | q=list(setNames(quantile(diff), nm=c("Min", "Q1", "Med", "Q3", "Max"))), 374 | .groups = "drop_last") %>% 375 | mutate(p=n/sum(n)) %>% 376 | unnest_wider(q) %>% 377 | 378 | ggplot() + 379 | geom_bar(aes(x = diff_sign, y=p), stat="identity") + 380 | scale_y_continuous(labels=scales::percent) + 381 | theme_bw() + 382 | xlab("Relationship") + 383 | labs(title = "Relationship between p-values: Test vs Model", 384 | subtitle = sprintf("N=%d samples, group size=%d observations", samples, n_group)) + 385 | geom_label(aes(x = diff_sign, y=.5, 386 | label=sprintf("Quartiles of differences\nMin=%.3f\nQ1=%.3f\nMed=%.3f\nQ3=%.3f\nMax=%.3f", Min, Q1, Med, Q3, Max)), 387 | size=3.5) + 388 | ylab(NULL) 389 | ) 390 | 391 | (p_pval_vs <- results %>% 392 | ggplot() + 393 | geom_point(aes(x=Test, y = Model)) + 394 | 395 | geom_abline(slope = 1, intercept = 0, col="green") + 396 | 397 | { if(under_null) 398 | list(geom_hline(yintercept = sign_level, linetype="dashed", color="red"), 399 | geom_vline(xintercept = sign_level, linetype="dashed", color="red"), 400 | geom_label(aes(x=0.75, y=0.25, 401 | label = sprintf("Rejections of H0 at α=%.3f\nModel: %d\nTest: %d", 402 | sign_level, 403 | sum(Model <= sign_level), 404 | sum(Test <= sign_level))))) 405 | } + 406 | 407 | { if(log_axes) 408 | list(scale_y_continuous(trans='log10'), 409 | scale_x_continuous(trans='log10')) 410 | } + 411 | 412 | xlab("Test p-values") + 413 | ylab("Model p-values") + 414 | theme_bw() + 415 | labs(title = sprintf("P-values [%s]: Test vs Model", method), 416 | subtitle = sprintf("N=%d samples, group size=%d obs., %s %s", 417 | samples, n_group, 418 | hypothesis, title), 419 | caption = ifelse(log_axes, "Both axes log10 transformed", "")) + 420 | theme(plot.caption = element_text(color = "red", face="italic")) 421 | ) 422 | 423 | (p_pval_vs | (p_rej_discrep + p_pval_ratios + 424 | p_pval_diffs + p_pval_rel + 425 | patchwork::plot_layout(ncol = 2, nrow = 2))) + 426 | plot_layout(widths = c(1, 2)) 427 | } 428 | ``` 429 | --------------------------------------------------------------------------------