├── Chapter01_Audience_aim_motivation.html ├── Chapter01_Audience_aim_motivation.ipynb ├── Chapter03_Why_python.html ├── Chapter03_Why_python.ipynb ├── Chapter04_Introduction_to_the_IPython_Notebook-2.html ├── Chapter04_Introduction_to_the_IPython_Notebook.html ├── Chapter04_Introduction_to_the_IPython_Notebook.ipynb ├── Chapter04_Introduction_to_the_IPython_Notebook.ipynbdnpp7x ├── Chapter04_Introduction_to_the_IPython_Notebook_II.ipynb ├── Chapter05_A_closer_look_at_our_data.html ├── Chapter05_A_closer_look_at_our_data.ipynb ├── Chapter08_Research_types.html ├── Chapter08_Research_types.ipynb ├── Chapter09_Data_types.html ├── Chapter09_Data_types.ipynb ├── Chapter10_Introducing_pandas.html ├── Chapter10_Introducing_pandas.ipynb ├── Chapter11_Measures_of_central_tendency_dispersion.html ├── Chapter11_Measures_of_central_tendency_dispersion.ipynb ├── Chapter12_The_connection_between_probability_and_area.html ├── Chapter12_The_connection_between_probability_and_area.ipynb ├── Chapter13_The_central_limit_theorem.html ├── Chapter13_The_central_limit_theorem.ipynb ├── Chapter13_The_central_limit_theorem.ipynb3wgaui ├── Chapter14_Z_and_t_distributions.html ├── Chapter14_Z_and_t_distributions.ipynb ├── Chapter15_Hypotheses.html ├── Chapter15_Hypotheses.ipynb ├── Chapter16_Confidence_intervals.html ├── Chapter16_Confidence_intervals.ipynb ├── Chapter17_Parametric_non_tests.html ├── Chapter17_Parametric_non_tests.ipynb ├── Chapter18_Comparing_means.html ├── Chapter18_Comparing_means.ipynb ├── Chapter19_Comparing_categorical_data.html ├── Chapter19_Comparing_categorical_data.ipynb ├── Chapter20_Linear_regression.html ├── Chapter20_Linear_regression.ipynb ├── Chapter21_Sensitivity_specificity_risks_rates_odds.html ├── Chapter21_Sensitivity_specificity_risks_rates_odds.ipynb ├── IPython_Notebook_Toolbar.png ├── Launcher.png ├── MOOC_Mock.csv ├── README.md ├── Regression.csv ├── Toolbar.png ├── odds_ratio.csv ├── p_graph.png ├── style.css └── wcc_crp.xlsx /Chapter01_Audience_aim_motivation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Setting up a fancy stylesheet" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [ 17 | { 18 | "data": { 19 | "text/html": [ 20 | "\n", 21 | "\n", 22 | "\n", 23 | "\n", 24 | "\n", 25 | "\n", 26 | "\n", 109 | "\n" 110 | ], 111 | "text/plain": [ 112 | "" 113 | ] 114 | }, 115 | "execution_count": 1, 116 | "metadata": {}, 117 | "output_type": "execute_result" 118 | } 119 | ], 120 | "source": [ 121 | "from IPython.core.display import HTML\n", 122 | "css_file = 'style.css'\n", 123 | "HTML(open(css_file, 'r').read())" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "# Audience and aim" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "## General information" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "

This massive open online course (MOOC) was conceived during a call for proposals made by the executive of the University of Cape Town (UCT) in 2013. At that time, the continent of Africa was devoid of any tertiary education content on the large, international MOOC platforms. As leading ranked university on the continent, it was perhaps seen as the duty of the university to contribute to this field.

\n", 145 | "

More so, Groote Schuur Hospital (GSH) is home of the clinical services of the Faculty of Health Sciences at UCT.Even after nearly 50 years, it is still well-known for the first human heart transplant, and therefore was inevitably going to play its role.

\n", 146 | "

This course, was therefore proposed and designed in the Department of Surgery at GSH. As recipient of the 2014 Open Education Consortium Individual Educator of the Year award for the development of open educational resources, I personally felt both compelled and hugely excited to bring this course to life.

" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "## Audience" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "

In this second decade of the twentieth century, technology has awarded us the gift of new forms of teaching and learning. As an outstanding example, we have the MOOC, which you are currently partaking in. The first term, massive, provides, in itself, a massive challenge. In the traditional classroom, teacher and student are bound by many common ideals, mindsets and goals, which are all well ring-fenced in an individual facility.

\n", 161 | "

In contrast to this, the MOOC has to be designed to fulfill many needs as well as be applicable in many scenarios. Herein, , can also lie a strength. In the previous lecture, reference was made to the fact that both novice researchers (in a first-world setting) as well as potential researchers (in under-resourced areas) could find benefit from this course. This lecture will elaborate on the intended audience.

\n", 162 | "

It is a well-established fact that a large proportion of healthcare research is conducted in the first-world. Resources are, to some extent, widely available. These resources include funding, which forms part of a self-perpetuating and well-maintained ecosystem. More research ensures continued funding, which in turn, allows for the continuation of research.

\n", 163 | "

With the abundance in research there comes experienced researchers. These individuals and groups are well placed to establish a hierarchy for the training of new researchers. Time, effort and money are all available, not only to teach newcomers, but also to market the idea of conducting research.

\n", 164 | "

More financial resources also bring more personnel. Healthcare provision is understood to stand on three pillars: service delivery, teaching, and research. With a larger pool of human resources, more time is available to individuals to concentrate on research and to induct newcomers into the field.

\n", 165 | "

However, with this excellence in research, there comes competition in the workplace. Those who wish to embark on healthcare research have a lot to live up to. The expectations are indeed enormous and the bar is set extremely high. A subset of the intended audience for this course then, is those who intend to get a head-start in this process. No doubt, most of them will have resources available at their home institutions, but as with most online courses, the intent is not to replace, but to augment and provide yet another learning and teaching opportunity.

\n", 166 | "

The healthcare research spectrum is wide indeed, and leaning towards the opposite side is the 'under-resourced' setting. In many cases none or very little research activity is noted. Experienced individuals are scarce and many of them are lost to better equipped facilities and countries.

\n", 167 | "

Monetary resources are much more difficult to come by in these areas. This has the effect of reducing the personnel available to work in healthcare, taking those involved away from research and concentrating on service delivery and, to a lesser, extent, teaching. Perhaps under appreciated is the fact that no culture exists to market research under these circumstances. This point has been raised twice now, with the explicit aim of emphasizing the need to sell the prospect of becoming involved in research. In many instances service delivery becomes the only way for individuals to attain job security and thereofore an income.

\n", 168 | "

All of this is compounded by the burden of disease which is prevalent in many of these settings. Long delays ensue and the natural progression of disease is seen to a much larger extent than in the first-world. Not only do patients present late in the course of their disease, but the disease profile as such, does not reflect that of the first-world. With the latter engaged in the research of diseases of relative importance to them. Hence the emphasis is not always placed on diseases that affect a large proposition of humanity.

\n", 169 | "

Lastly, addressing specifically the individual for whom this course was developed, consideration must be given to the modern view-point of interdisciplinary and integrated healthcare. A great many individuals and groups are involved in modern healthcare delivery, irrespective of the setting. From laboratory scientist to healthcare manager; from medical specialist, to assistant nurse; and from pharmacist to physiotherapist, every role-player will find some benefit to this course.

" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "## Aim" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "

The aim of this course is quite simply to provide a structural foundation for the first steps in healthcare research. A foundation which is structured such that it is independent of where in the spectrum of the audience a participant might be and independent of the caregiver group he or she may represent.

\n", 184 | "

In order to contribute effectively to the effort of healthcare research, this structural foundation requires some essential ingredients. This ranges from the obvious (which involves at least a basic understanding of statistics), to more practical issues.

\n", 185 | "

To commence any research, a research question must exist. This involves the identification of a problem that requires a solution or, at least, an answer. It is important to understand how to identify these questions. An essential step in formulating a proper question is the need to identify other research that has been conducted in this field. The ability to write a literature review is a skill that all healthcare providers should possess.

\n", 186 | "

From research questions which arise, the need to gather data evolves. This is not a trivial task and requires deep thought. Not only are the logistics important, but the actual data to be collected must contain within it, the solution to the research question. Data should be collectable in an accurate and reproducible way, whist simultaneously not being duplicate.

\n", 187 | "

Collecting data in the setting of healthcare very often involves personal patient information and it is absolutely essential to understand the ethical ways of obtaining, storing, and using this data.

\n", 188 | "

The actual vehicle of data capturing must be understood. In this modern era, many devices and software tools not only allow for accurate data capturing, but allow for the analysis of the data. It is important to have at least some understanding of the tools available as well as be comfortable in the use of some of them.

\n", 189 | "

Once the data is analyzed it becomes important to share the results with peers and the wider community. A skill essential in this regard is the writing of a manuscript for presentation or submission either in the form of an abstract or a full-text article.

\n", 190 | "

This course, then, aims to take you through this process.

" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "# Motivation for this lecture series" 198 | ] 199 | }, 200 | { 201 | "cell_type": "markdown", 202 | "metadata": {}, 203 | "source": [ 204 | "

Welcome to this series on conducting healthcare research.

\n", 205 | "

Whilst the main study material for this course is provided in the form of video lectures, it is my aim to provide you, the participants, with as many educational resources as possible. It is clear that some learners prefer the ability to watch video lectures and find maximum pedagogical benefit from this type of resource, rich in visual and auditory input, whilst others benefit most from the written word. For most of us, our personal preference lies somewhere in between. I trust, then, that this text will be of some use.

\n", 206 | "

To further enrich the learning experience, the text is provided within an active coding environment. You could thus choose to simply stick to the written word, or you can partake in the actual analysis of data using code.

\n", 207 | "Although we can in no way compare to research performed in the hard sciences, such as particle physics, we have made huge strides by the introduction of evidence-based research, clinical management, and teaching.

" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "## Healthcare as a science" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "

Research in the health sciences is taking the first steps towards becoming a true scientific endeavor, such as physics. Many of us are old enough to, with some fondness, remember the days of the preeminent statesman-like professor, on whose words we hung. Eminence-based medicine has a rich and prolific history, indeed.

\n", 222 | "

As previously mentioned, when considering physics, it is important to take cognizance of the long lead-time it has over healthcare research. The strides made in the former field began at the start of the twentieth century, when Rutherford’s postgraduate students first noted evidence for the nuclear form of the atom. Bohr gave us the structure of electron orbits based on the work of Planck and Einstein, from which grew our fundamental understanding (or lack thereof) of the quantum world.

\n", 223 | "

Not only does physics have a much longer research history, which has lead to a deep and wonderful understanding of the inner workings of our universe, but it also benefits from the size of available data to work with. We need only consider the work done at the Large Hadron Collider in Switzerland. The shear volume of data is so large, that most of it has to be discarded. Even so, they have to make use of distributed computing, with interconnected computers all over the world, to store and evaluate data.

\n", 224 | "

Contrast this to the sample sizes seen, on average, in a typical healthcare research paper. We are bound by arbitrary values of two standard deviations as a measure of significance, whilst our learned colleagues in physics can set the bar orders of magnitude higher.

\n", 225 | "

Turning full-circle we need to remember that our understanding of disease is still very primitive. We design studies by trying to isolate a characteristic to use in comparing various groups. It is a serious matter to consider the fact that these characteristics are seldom ‘singular’. If we knew more we would probably discover a large and rich causal structure belying our characteristic of choice and if we could design studies to investigate these, our patient care would improve.

\n", 226 | "

Consider the following scenario: A researcher looks into breast carcinoma and groups patients according to stage. An intervention is shown to have a significant difference between a set of groups. These groups were constructed from the presence of tumor size and nodal and distal spread. In the analysis, statistical outliers will invariably be noted. Patients whose reactions to the intervention stand in contrast to that of their group. Alas, they are often lost in our analysis.

\n", 227 | "

Yet, we have enough knowledge now to know that tumor size and disease spread has a deeper causal layer, involving genetics as an example. The rich interaction between genetic mutations in tumor genes and the patient’s defense system, with resulting sensitivity to chemotherapeutic agents are much more important to investigate. For this, though, we need adequate sample sizes, something physicists will always have over us.

\n", 228 | "

Akin to their efforts however,we are metaphorically, digging deeper. In order for us to continue this process, it is important to widen the research capabilities of those involved in healthcare. This can be achieved in a variety of ways. Considering those of us who are to take up positions in healthcare in well-resourced environments, it is perhaps prudent to provide courses such as this to those who wish to get get a head-start in the process before being formally introduced to the topic during their early education. For others, perhaps in more austere circumstances, the luxury of well-funded research units with experienced colleagues is not available. Workers in these cricumstances, though, are usually blessed with riches in available clinical data. They might be in the position to solve many of our outstanding research questions if provided with the tools to commence their own research.

\n", 229 | "

One of the major steps towards improving our science then, is to enable all healtcare workers to understand data analysis. This is the true motivation for this course. It is my sincerest wish then, that someone out there will find value in this course as they embark on a career in healthcare research that will benefit us all.

" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": null, 235 | "metadata": { 236 | "collapsed": false 237 | }, 238 | "outputs": [], 239 | "source": [] 240 | } 241 | ], 242 | "metadata": { 243 | "kernelspec": { 244 | "display_name": "Python 3", 245 | "language": "python", 246 | "name": "python3" 247 | }, 248 | "language_info": { 249 | "codemirror_mode": { 250 | "name": "ipython", 251 | "version": 3 252 | }, 253 | "file_extension": ".py", 254 | "mimetype": "text/x-python", 255 | "name": "python", 256 | "nbconvert_exporter": "python", 257 | "pygments_lexer": "ipython3", 258 | "version": "3.4.3" 259 | } 260 | }, 261 | "nbformat": 4, 262 | "nbformat_minor": 0 263 | } 264 | -------------------------------------------------------------------------------- /Chapter04_Introduction_to_the_IPython_Notebook_II.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "from warnings import filterwarnings\n", 12 | "from IPython.core.display import HTML\n", 13 | "from IPython.display import Image\n", 14 | "\n", 15 | "filterwarnings('ignore')" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "# IPython Notebook" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "

Fernando Pérez, Brian E. Granger, IPython: A System for Interactive Scientific Computing, Computing in Science and Engineering, vol. 9, no. 3, pp. 21-29, May/June 2007, doi:10.1109/MCSE.2007.53. URL: http://ipython.org

" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "## Learning more" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "

You can learn a lot more about *IPython* from there website: http://www.ipython.org

" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 2, 49 | "metadata": { 50 | "collapsed": true 51 | }, 52 | "outputs": [ 53 | { 54 | "data": { 55 | "text/html": [ 56 | "" 57 | ], 58 | "text/plain": [ 59 | "" 60 | ] 61 | }, 62 | "execution_count": 2, 63 | "metadata": {}, 64 | "output_type": "execute_result" 65 | } 66 | ], 67 | "source": [ 68 | "HTML('')" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "### Toolbar" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "### Text" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "### A fancy calculator" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 3, 100 | "metadata": { 101 | "collapsed": false 102 | }, 103 | "outputs": [], 104 | "source": [ 105 | "a = 7\n", 106 | "#a = 7" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "+ Very exiting indeed!" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": {}, 119 | "source": [ 120 | "+ Let's heat things up a bit and repeat the process with another computer variable" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 4, 126 | "metadata": { 127 | "collapsed": false 128 | }, 129 | "outputs": [], 130 | "source": [ 131 | "b = 4\n", 132 | "#b = 4" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "+ I can get to the value in the bucket at any time\n", 140 | "+ Here are two ways to do it" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 39, 146 | "metadata": { 147 | "collapsed": false 148 | }, 149 | "outputs": [ 150 | { 151 | "data": { 152 | "text/plain": [ 153 | "7" 154 | ] 155 | }, 156 | "execution_count": 39, 157 | "metadata": {}, 158 | "output_type": "execute_result" 159 | } 160 | ], 161 | "source": [ 162 | "a # Simply refer to the computer variable name" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 40, 168 | "metadata": { 169 | "collapsed": false 170 | }, 171 | "outputs": [], 172 | "source": [ 173 | "\n", 174 | "#print(a) # Use the print command" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "+ Now, let's do some math\n", 182 | "+ What is *74* ?" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 41, 188 | "metadata": { 189 | "collapsed": false 190 | }, 191 | "outputs": [], 192 | "source": [ 193 | "\n", 194 | "#a ** b # Unlike most other computer languages the code for power is not ^, but **" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "+ Here are some others\n", 202 | "+ Can you work it out?" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 42, 208 | "metadata": { 209 | "collapsed": false 210 | }, 211 | "outputs": [], 212 | "source": [ 213 | "\n", 214 | "#a + b # Simple addition" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 43, 220 | "metadata": { 221 | "collapsed": false 222 | }, 223 | "outputs": [], 224 | "source": [ 225 | "\n", 226 | "#a - b # Simple subtraction" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 44, 232 | "metadata": { 233 | "collapsed": false 234 | }, 235 | "outputs": [], 236 | "source": [ 237 | "\n", 238 | "#a * b # Simple multiplication" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 45, 244 | "metadata": { 245 | "collapsed": false 246 | }, 247 | "outputs": [], 248 | "source": [ 249 | "\n", 250 | "#a / b # Simple subtraction" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": 46, 256 | "metadata": { 257 | "collapsed": false 258 | }, 259 | "outputs": [], 260 | "source": [ 261 | "\n", 262 | "#a % b # The modulus (or remainder), i.e. what is left after doing 7 / 4" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 47, 268 | "metadata": { 269 | "collapsed": false 270 | }, 271 | "outputs": [], 272 | "source": [ 273 | "\n", 274 | "#(a + b) / 2 # Forcing the order of arithemtic operation" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "+ Now for the reason why I insist on using the name *computer variable*. \n", 282 | "+ Can you guess what's going to happen here?" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": 48, 288 | "metadata": { 289 | "collapsed": false 290 | }, 291 | "outputs": [], 292 | "source": [ 293 | "\n", 294 | "#a = a + 5" 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": {}, 300 | "source": [ 301 | "+ For computer variables, the right hand side of the equation is evaluated first and then the new value is entererd into the referenced bucket" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": 49, 307 | "metadata": { 308 | "collapsed": false 309 | }, 310 | "outputs": [], 311 | "source": [ 312 | "\n", 313 | "#print(a)" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "### Importing our first library" 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "+ Herewith the beauty of python™!\n", 328 | "+ We can extend the language by simply importing code someoene else have developed!" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": 50, 334 | "metadata": { 335 | "collapsed": false 336 | }, 337 | "outputs": [], 338 | "source": [ 339 | "\n", 340 | "#import math" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": 51, 346 | "metadata": { 347 | "collapsed": false 348 | }, 349 | "outputs": [], 350 | "source": [ 351 | "\n", 352 | "#math.pi" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "+ Let's ask the *math* module to calculate the *sine* of *π*" 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": 52, 365 | "metadata": { 366 | "collapsed": false 367 | }, 368 | "outputs": [], 369 | "source": [ 370 | "\n", 371 | "#math.sin(math.pi) # Because pi is restricted to a few decimal places, the aritmetic will look\n", 372 | "# a bit funny" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": 19, 378 | "metadata": { 379 | "collapsed": false 380 | }, 381 | "outputs": [], 382 | "source": [ 383 | "\n", 384 | "#from math import cos, pi # Negating the need to use the module name" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": 20, 390 | "metadata": { 391 | "collapsed": false 392 | }, 393 | "outputs": [], 394 | "source": [ 395 | "\n", 396 | "#cos(pi)" 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": 21, 402 | "metadata": { 403 | "collapsed": false 404 | }, 405 | "outputs": [], 406 | "source": [ 407 | "\n", 408 | "#import numpy as np # Importing a module and using an abbreviation" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": 22, 414 | "metadata": { 415 | "collapsed": false 416 | }, 417 | "outputs": [], 418 | "source": [ 419 | "\n", 420 | "#np.cos(np.pi)" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": {}, 426 | "source": [ 427 | "### Using strings" 428 | ] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "metadata": {}, 433 | "source": [ 434 | "+ A computer variable can contained other things too!" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": 23, 440 | "metadata": { 441 | "collapsed": false 442 | }, 443 | "outputs": [], 444 | "source": [ 445 | "\n", 446 | "#a = 'OK, here we go! Hello, world!' # Adding a string value to a computer variable" 447 | ] 448 | }, 449 | { 450 | "cell_type": "code", 451 | "execution_count": 24, 452 | "metadata": { 453 | "collapsed": false 454 | }, 455 | "outputs": [], 456 | "source": [ 457 | "\n", 458 | "#a" 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": 25, 464 | "metadata": { 465 | "collapsed": false 466 | }, 467 | "outputs": [], 468 | "source": [ 469 | "\n", 470 | "#print(a)" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": 26, 476 | "metadata": { 477 | "collapsed": false 478 | }, 479 | "outputs": [], 480 | "source": [ 481 | "\n", 482 | "#b = 'This is python!'" 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": 27, 488 | "metadata": { 489 | "collapsed": false 490 | }, 491 | "outputs": [], 492 | "source": [ 493 | "\n", 494 | "#a + ' ' + b # Adding a space between the strings " 495 | ] 496 | }, 497 | { 498 | "cell_type": "code", 499 | "execution_count": 28, 500 | "metadata": { 501 | "collapsed": false 502 | }, 503 | "outputs": [], 504 | "source": [ 505 | "\n", 506 | "#print(a, b) # No space required with this method" 507 | ] 508 | }, 509 | { 510 | "cell_type": "code", 511 | "execution_count": 29, 512 | "metadata": { 513 | "collapsed": false 514 | }, 515 | "outputs": [], 516 | "source": [ 517 | "\n", 518 | "#a * 2 # Surprise!" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": 30, 524 | "metadata": { 525 | "collapsed": false 526 | }, 527 | "outputs": [], 528 | "source": [ 529 | "\n", 530 | "#a = [12, 13, 18] # Creating an array" 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": 31, 536 | "metadata": { 537 | "collapsed": false 538 | }, 539 | "outputs": [], 540 | "source": [ 541 | "\n", 542 | "#a" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": 32, 548 | "metadata": { 549 | "collapsed": false 550 | }, 551 | "outputs": [], 552 | "source": [ 553 | "\n", 554 | "#a * 2 # An array is not a matrix" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": 33, 560 | "metadata": { 561 | "collapsed": false 562 | }, 563 | "outputs": [], 564 | "source": [ 565 | "\n", 566 | "#a = [2, 3, 'hello', b] # A lot of different things in my array" 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": 34, 572 | "metadata": { 573 | "collapsed": false 574 | }, 575 | "outputs": [], 576 | "source": [ 577 | "\n", 578 | "#a" 579 | ] 580 | }, 581 | { 582 | "cell_type": "markdown", 583 | "metadata": {}, 584 | "source": [ 585 | "### Boolean values" 586 | ] 587 | }, 588 | { 589 | "cell_type": "code", 590 | "execution_count": 35, 591 | "metadata": { 592 | "collapsed": false 593 | }, 594 | "outputs": [], 595 | "source": [ 596 | "\n", 597 | "#a == b # The double equal sign asks a question: Is the value in the computer variable, a, equal\n", 598 | "# to the value in computer variable b?" 599 | ] 600 | }, 601 | { 602 | "cell_type": "code", 603 | "execution_count": 36, 604 | "metadata": { 605 | "collapsed": false 606 | }, 607 | "outputs": [], 608 | "source": [ 609 | "\n", 610 | "#a != b # The question asked here is: Are they different?" 611 | ] 612 | }, 613 | { 614 | "cell_type": "code", 615 | "execution_count": 37, 616 | "metadata": { 617 | "collapsed": false 618 | }, 619 | "outputs": [], 620 | "source": [ 621 | "\n", 622 | "#a = 'I love math!' # Simply a law of nature! Accept it now!!!" 623 | ] 624 | }, 625 | { 626 | "cell_type": "code", 627 | "execution_count": 38, 628 | "metadata": { 629 | "collapsed": false 630 | }, 631 | "outputs": [], 632 | "source": [ 633 | "\n", 634 | "#a == 'I love math!'# I told you to accept it!" 635 | ] 636 | } 637 | ], 638 | "metadata": { 639 | "kernelspec": { 640 | "display_name": "Python 3", 641 | "language": "python", 642 | "name": "python3" 643 | }, 644 | "language_info": { 645 | "codemirror_mode": { 646 | "name": "ipython", 647 | "version": 3 648 | }, 649 | "file_extension": ".py", 650 | "mimetype": "text/x-python", 651 | "name": "python", 652 | "nbconvert_exporter": "python", 653 | "pygments_lexer": "ipython3", 654 | "version": "3.4.3" 655 | } 656 | }, 657 | "nbformat": 4, 658 | "nbformat_minor": 0 659 | } 660 | -------------------------------------------------------------------------------- /Chapter08_Research_types.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Setting up a fancy stylesheet" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [ 17 | { 18 | "data": { 19 | "text/html": [ 20 | "\n", 21 | "\n", 22 | "\n", 23 | "\n", 24 | "\n", 25 | "\n", 26 | "\n", 109 | "\n" 110 | ], 111 | "text/plain": [ 112 | "" 113 | ] 114 | }, 115 | "execution_count": 1, 116 | "metadata": {}, 117 | "output_type": "execute_result" 118 | } 119 | ], 120 | "source": [ 121 | "from IPython.core.display import HTML\n", 122 | "css_file = 'style.css'\n", 123 | "HTML(open(css_file, 'r').read())" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "# Research types" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "## Introduction" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "

This lecture is devoted to the topic of study design. One of the most important factors in both reading and evaluating published literature and in designing your own research is a thorough understanding of the variety in study types. Indeed, the choice of study type has a direct bearing on the findings, its importance and relevance.

\n", 145 | "

As with most scientific endeavors, the classification of study type is varied and different schemes exist to classify designs. One logical approach is to divide designs into two main groups: *observational* and *experimental*. In observational studies, subjects or variables are merely observed and in experimental studies, an intervention is performed. From this basic difference a scheme to different study design can follow.

\n", 146 | "

I'll conclude with two study types that fall outside of the chosen classification system: *meta-analysis* and *reviews*.\n", 147 | "

" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "## Classification" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "

Observational studies consist of four main types: *case-series*, *case-control*, *cross-sectional*, and *cohort* studies.

\n", 162 | "

Experimental studies are much easier to identify, as they directly state the form of intervention, whether it be the administration of a drug or the performance of procedure. In the case where experimental studies involve humans as test subjects, the term *clinical trial* is more commonly used. Experimental trials are further divided depending on the type of control that is used for comparison: *trials with independent concurrent controls*, *trials with self-controls* (dependent groups), *trials with external controls*, and *uncontrolled trials*.

" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "## Case series" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "

This is perhaps the simplest of all study types and reports a simple descriptive account of a characteristic observed in a group of subjects. A good example would be a *clinical audit*, merely measuring various variables of a particular subject group over a defined period.

\n", 177 | "

No research hypothesis exists and these studies are usually used to identify interesting observations for more detailed research or future planning.

\n", 178 | "

Note that the actual data collection (for this and other types) can be performed either in *retrospect*, where we take the files and hunt for values for chosen variables, or in *prospect*, where we decide beforehand what variables we want documented and collect these during the management of every patient.

" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "## Case-control series" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "

A case-control study starts with the presence (or absence) of an outcome and looks rearwards in time comparing subjects with and without the particular outcome in an attempt to find risk factors or characteristics that differ between the groups. It is important to note that the two groups are identified at the start of the study and data points on various variables are then collected retrospectively.

\n", 193 | "

Do not confuse this with the terms prospective and retrospective. Let me explain. Nothing prevents you from setting up a database for any particular group or disease or whatever in your department or unit. You may have no plans yet for any research. What you are looking for is a set of variables, which you want completed for every patient. This type of data collection is done prospectively. On the other hand, you might have not had this forethought and simply go to the free-form notes in the patient files. This data collection is done retrospectively.

\n", 194 | "

Back to case-control serieds. A good example would be to look at a group of patients who has had a particular procedure, with two varying outcomes. The fact that both groups had the same intervention differentiates this from an experimental study. Individuals with one outcome are separated from those with another outcome and data is collected on decided variables. The measurements are compared, to look for any differences.

\n", 195 | "

The main disadvantage of this type of study is a concept termed confounding. It refers to a fact that a specific causative agent or parameter is difficult to identify. A good illustrative example would be that of the intake of meat in patients who suffer a myocardial infarction. It might be found that these patients have a high level of meat intake, but this is no proof that the meat was the causative agent. Meat-eaters might also have a high incidence of smoking, drinking, poor exercise regimes, and many others.

\n", 196 | "

In a *nested case-control study*, cases and controls are taken from a prospective cohort study. The cases develop a certain effect and the controls don’t.

" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "## Cross-sectional series" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "

Many types of studies can be classified as cross-sectional and these include surveys, epidemiological studies, and prevalence studies. The outstanding identifying factor in the heterogenous groups is that an observation was made at a particular point in time, as opposed to over a period of time. To complicate matters ever so slightly, this type of study can be incorporated in other designs. It can, for example, be used as part of case-control or cohort studies.

\n", 211 | "

A common use for cross-sectional studies is investigating correlation, whether between cause and effect or accuracy of diagnostic techniques, to name but two examples. As with case-control studies, cross-sectional studies also suffer from confounding and when cause and effect are under investigation, it is often impossible to prove that the cause actually is that and indeed came before the effect.

" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": {}, 217 | "source": [ 218 | "## Cohort series" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "metadata": {}, 224 | "source": [ 225 | "

Here, it is important to define a *cohort* as a group of individuals (with a common trait such as a disease or risk factor) who remain part of that group over an extended time period. Subjects with the specific characteristic is identified at the start of the study and followed with measurement of various variables in a prospective manner. The time period does not necessarily have to be that long. Some prospective follow-ups after an intervention can last for a few days, for instance after a surgical procedure, termed outcome assessment.

\n", 226 | "

Cohort and case-control studies are differentiated by their direction of inquiry, with the former always collecting data from a starting point, moving forward in time and the latter starting from a point in time and collecting data points that was noted before that point.

" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "## Trials with independent concurrent controls" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "

Moving on to experimental trials, this first type includes trials that group patient who have and have not had an intervention. These are usually termed the experimental and the control groups. An attempt is made to have no other differences between the groups other than the intervention. This is usually impossible with common differences in age, gender, body habitus, and many others present. Subgroups can then usually be formed so as to keep the differences to a non-clinically significant level.

\n", 241 | "

What separates this type of study from other experimental studies is the fact that data points are collected on both groups at the same time, i.e., they run concurrently.

\n", 242 | "

When both the subject and the observer are unaware of which group the subject belongs to, the term *double-blinded* is used. If only the subject is unaware, it is termed a *blinded* study.

\n", 243 | "

The pinnacle of clinical research is usually seen to be the randomized controlled trial, that is grouped under this heading of independent concurrent trials, with the double-blinded form being the pinnacle of medical research. It is often the strongest form of evidence to prove causation.

" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": {}, 249 | "source": [ 250 | "## Trials with self-controls" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "

Here, the test subjects form their own controls. These studies are similar to cohort studies, expect for the fact that an intervention took place.

\n", 258 | "

The most elegant subtype here is the cross-over study. Two groups are formed each with their own intervention. Most commonly one group will receive a placebo. A period where no intervention takes place follows and then restarted, but interchanged between the individuals in the two groups.

" 259 | ] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "metadata": {}, 264 | "source": [ 265 | "## Trials with external controls" 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": {}, 271 | "source": [ 272 | "

An appropriate example would be the use of previously published groups as controls for analysing against a current intervention group. These are termed *historical controls*. It is commonly used in oncological research where an effective treatment does not yet exist.

\n", 273 | "

The risk exists in the choice of the control group, a choice that may influence and exacerbate the findings of the outcome of the statistical analysis.

" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "metadata": {}, 279 | "source": [ 280 | "## Uncontrolled trials" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "

In these studies an intervention takes place, but there are no controls. The hypothesis is that there will be varying outcomes and reasons for these can be elucidated from the data. No attempt is made to evaluate the intervention itself, as it is not being compared to either a placebo or an alternative form of intervention.

" 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "## Meta-analysis" 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": {}, 300 | "source": [ 301 | "

A *meta-analysis* uses pre-existing research and combines their results to permit an overall conclusion. It is especially helpful if the individual studies that make up the analysis are of inadequate size or power.

" 302 | ] 303 | }, 304 | { 305 | "cell_type": "markdown", 306 | "metadata": {}, 307 | "source": [ 308 | "## Reviews" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": {}, 314 | "source": [ 315 | "

As in the case of a meta-analysis various papers are included based on a particular factor. The discussion around this amalgamation, though, remains qualitative and no quantitative assessment and summary of findings are included.

" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": null, 321 | "metadata": { 322 | "collapsed": false 323 | }, 324 | "outputs": [], 325 | "source": [] 326 | } 327 | ], 328 | "metadata": { 329 | "kernelspec": { 330 | "display_name": "Python 3", 331 | "language": "python", 332 | "name": "python3" 333 | }, 334 | "language_info": { 335 | "codemirror_mode": { 336 | "name": "ipython", 337 | "version": 3 338 | }, 339 | "file_extension": ".py", 340 | "mimetype": "text/x-python", 341 | "name": "python", 342 | "nbconvert_exporter": "python", 343 | "pygments_lexer": "ipython3", 344 | "version": "3.4.3" 345 | } 346 | }, 347 | "nbformat": 4, 348 | "nbformat_minor": 0 349 | } 350 | -------------------------------------------------------------------------------- /Chapter09_Data_types.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Setting up a fancy stylesheet" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [ 17 | { 18 | "data": { 19 | "text/html": [ 20 | "\n", 21 | "\n", 22 | "\n", 23 | "\n", 24 | "\n", 25 | "\n", 26 | "\n", 109 | "\n" 110 | ], 111 | "text/plain": [ 112 | "" 113 | ] 114 | }, 115 | "execution_count": 1, 116 | "metadata": {}, 117 | "output_type": "execute_result" 118 | } 119 | ], 120 | "source": [ 121 | "from IPython.core.display import HTML\n", 122 | "css_file = 'style.css'\n", 123 | "HTML(open(css_file, 'r').read())" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "# Data types" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "## Introduction" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "

In order to conduct a study, information is captured in a spreadsheet or in a database. Examples of this include demographic data such as patient gender and age, or specific medical laboratory values such as leukocyte count. Each of these are referred to as variables. Variables can be divided into two main groups, called *categorical* and *numerical* and the start of this chapter deals with this form of classification. It greatly influences the type of statistical analysis that can be performed.

\n", 145 | "

Later in the chapter, I introduce another classification system. Again, there are two main types, named *discrete* and *continuous*.

" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": {}, 151 | "source": [ 152 | "## Categorical and numerical data types" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "

*Categorical data* is data that either represents words, or if numbers, have no mathematical value. Cathegorical data is further divided into *nominal* and *ordinal* data types.

\n", 160 | "

A good example of the nominal categorical data type is disease names. Disease names can be grouped and counted, but no mathematical procedure can be performed on the names themselves.

\n", 161 | "

Ordinal categorical data types also represent words, but in this instance, they can be ranked. The Likert scale, used for surveys, is a good example. The subjects completing the survey may be asked to rate their opinion to a statement by selecting from a list either: strongly disagree, disagree, neither agree nor disagree, agree, or finally, strongly agree. There is definitely some order to this categorical type. It can even be expressed as numbers, for instance subjects can be asked to rate how much they agree with a statement, with the choices being one to 10. It should be clear that although numbers can be used, no mathematical operation can be performed with this data. It would be pointless to suggest that a subject who chooses four, agrees twice as much as one who chooses two. It is incorrect to calculate the mean or standard deviation from this data!

\n", 162 | "

Numerical data is likewise divided into two sub-types. They are *interval* and *ratio*. As the names imply, both of these consist of numeric values only. Degrees Celsius is good example of interval numeric data. There is a clear and consistent difference between values, such as 37.1 and 36.5 deg C. It is distinguished from the ratio numerical data type, though in that no zero exists. It might be true that zero deg C exists, but it is not a true zero. Only the Kelvin temperature scale has a true zero. It is thus wrong to suggest that 20 deg C is twice as warm as 10 deg C! \n", 163 | "Any variable with a true zero such as leukocyte count, blood pressure, and many others, are ratio numeric data types." 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "## Discrete and continuous data types" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "

Another way to look at data variables, is to make the distinction between discrete and continuous types. Data variables are said to be discrete when the actual values are not sub-divisible. A smallest unit of increment exists and cannot be divided any further. It is useful to introduce the concept of discrete data types by way of examples. The simplest non-medical example of a discrete data type is the flip of a coin. The values can only be heads or tails. Another famous example is the rolling of a die. We will take a look at the probabilities associated with the ouctomes of the rolling of dice in a later chapter.

\n", 178 | "

With only two outcomes, the sample space of the flip of a coin is said to be *binomial*, a type of discrete variables. Examples from medical research would include the presence of a complication. If a patient has undergone a procedure or received some form of treatment, a complication of that procedure or treatment may or may not occur. Only one of two outcomes is possible, hence the distribution of the possible outcomes, called the frequency distribution, is binomial. Please be careful, though. It is not because the values are integers that make them discrete. Some variables with decimals can also be discrete, as long as they have a finite smallest increment between them that is no longer divisible. Thruth be told, it is sometimes somewhat arbitrary!

\n", 179 | "

For continuous data types there are no gaps in between values. This might take a while to get use to. Consider leukocyte count, which when expressed in cells per liter, has a range in the order of *109*. Now, although each cell is a whole and cannot be subdivided and still leave a complete cell, the shear numbers that are involved in clinical practice, make “the gaps” almost infinitely divisible and leukocyte count is considered to be continuous.\n", 180 | "Even money can be difficult to come to terms with as a continuous variable. A single cent cannot be further divided. It must be understood, though, that when financial institutions work with large sums of money and have to calculate percentages of interest, fraction of cents are involved. They cannot simply disappear. Eventually a lot of money will disappear and that’s not good. Therefor, money is also a continuous variable.

" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "metadata": { 187 | "collapsed": false 188 | }, 189 | "outputs": [], 190 | "source": [] 191 | } 192 | ], 193 | "metadata": { 194 | "kernelspec": { 195 | "display_name": "Python 3", 196 | "language": "python", 197 | "name": "python3" 198 | }, 199 | "language_info": { 200 | "codemirror_mode": { 201 | "name": "ipython", 202 | "version": 3 203 | }, 204 | "file_extension": ".py", 205 | "mimetype": "text/x-python", 206 | "name": "python", 207 | "nbconvert_exporter": "python", 208 | "pygments_lexer": "ipython3", 209 | "version": "3.4.3" 210 | } 211 | }, 212 | "nbformat": 4, 213 | "nbformat_minor": 0 214 | } 215 | -------------------------------------------------------------------------------- /Chapter11_Measures_of_central_tendency_dispersion.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Setting up a fancy stylesheet" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [ 17 | { 18 | "data": { 19 | "text/html": [ 20 | "\n", 21 | "\n", 22 | "\n", 23 | "\n", 24 | "\n", 25 | "\n", 26 | "\n", 109 | "\n" 110 | ], 111 | "text/plain": [ 112 | "" 113 | ] 114 | }, 115 | "execution_count": 1, 116 | "metadata": {}, 117 | "output_type": "execute_result" 118 | } 119 | ], 120 | "source": [ 121 | "from IPython.core.display import HTML\n", 122 | "css_file = 'style.css'\n", 123 | "HTML(open(css_file, 'r').read())" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "Setting up the required python ™ environment" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 2, 136 | "metadata": { 137 | "collapsed": false 138 | }, 139 | "outputs": [], 140 | "source": [ 141 | "import pandas as pd\n", 142 | "import numpy as np\n", 143 | "import matplotlib.pyplot as plt\n", 144 | "import seaborn as sns\n", 145 | "from warnings import filterwarnings\n", 146 | "#from scipy import mean\n", 147 | "#from scipy.stats import chi2_contingency, ranksums, bayes_mvs, ttest_ind, ranksums\n", 148 | "#import scikits.bootstrap as bs\n", 149 | "\n", 150 | "%matplotlib inline\n", 151 | "sns.set_style('whitegrid')\n", 152 | "sns.set_context('paper', font_scale = 2.0, rc = {'lines.linewidth': 1.5, 'figure.figsize' : (10, 8)})\n", 153 | "filterwarnings('ignore')" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "# Measures of central tendency and dispersion" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "## Introduction" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "In this notebook we will take a look at the most common statistical tools
\n", 175 | "* Measures of central tendency\n", 176 | " - Mean (average)\n", 177 | " - Median\n", 178 | " - Mode\n", 179 | "* Measures of dispersion\n", 180 | " - Range (minimum and maximum\n", 181 | " - Standard deviation\n", 182 | " - Variance\n", 183 | " - Quantiles and percentiles" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "## Importing our data" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 3, 196 | "metadata": { 197 | "collapsed": false 198 | }, 199 | "outputs": [], 200 | "source": [ 201 | "data = pd.read_csv('MOOC_Mock.csv')" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": {}, 207 | "source": [ 208 | "## Measures of central tendency" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "The various forms of central tendency are methods that try to represent a group of values with a single value, representative of the whole group. In order for such a number to be representative, it must reflect some tendency in that group. For the three methods here, namely, mean, median, and mode, the attempt is to represent some central value in the group. " 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "### Mean" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "The *mean* value of a group of values, or the *average* value, is a very simple concept and can easily be calculated by adding all the values in the group and dividing that sum by the number of values in the group. \n", 230 | "\n", 231 | "It is child’s play to consider the mean of four and six, since four plus six is ten and dividing ten by two (since the group consists of two values), is five.\n", 232 | "\n", 233 | "The mean is an excellent way to represent all your data point for a single variable in one single value. In this regard it is most useful when the variable is a ratio type numerical variable. Example would be mean white cell count, systolic blood pressure, age, and many, many more." 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "### Median" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "*Median* is another way to represent central tendency and it searches for a value for which half of the numbers are smaller than and the other half are larger than the calculated value.\n", 248 | "\n", 249 | "The immediate question that arises is: \"That's fine for an odd number of values, but what happens if there are an even number of values?\". That's easy to fix. Simply takes the average of the middle two values. Here is an example:\n", 250 | "\n", 251 | "Values: *9, 11, 12, 13, 14, 16*. There are six values. Clearly the middle two would be *12* and *13*. Their mean being *12.5*. The median would be *12.5*, since half of the values are less than *12.5* in value and half are more than *12.5* in value.\n", 252 | "\n", 253 | "There are two common uses for median as opposed to mean:\n", 254 | "* If some of your data points are out of keeping with the rest. Say you have values of *10, 11, 9, 14, , 9, 10, 30, 43*. Clearly the values of *30*, and *43* are way outside of the rest. Calculating amean would be a misrepresentation of the data if you think about it.\n", 255 | "* The second use is more subtle. Imagine calculating the Modified Alvarado Score (MAS) for patients with suspected appendicitis. These data points are integers and do not represent continuous variables. Would it be correct to represent all the MAS-values as a mean. That woudl seem ever so slightly odd. *Median* would be much better here. We are dealing with ordinal categorical data points after all." 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "### Mode" 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": {}, 268 | "source": [ 269 | "The last form of central tendency is the *mode*. In medical statistics it is most often used for non-numeric values, such as disease names.\n", 270 | "\n", 271 | "It simply represent the value that appears most in a data set." 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "metadata": {}, 277 | "source": [ 278 | "## Measures of dispersion" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": {}, 284 | "source": [ 285 | "Now that we can represent our set of data point for a variable as a single value, the next step is to give some indication of the spread of data.\n", 286 | "\n", 287 | "In this sense, data values can bunch up close to the average, or values can be further distributed. In a simple example, both the sets of three values *9, 10, 11* and *2, 10, 18* have the same mean *10*, but the data values are much further spread in the second set." 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "### Range" 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": {}, 300 | "source": [ 301 | "The range is the simplest way to describe the spread in the data and merely refers to the smallest and largest values in data set. With these values stated, the range is the largest value minus the smallest value.\n", 302 | "\n", 303 | "It is quite useful when describing the age of your sample population." 304 | ] 305 | }, 306 | { 307 | "cell_type": "markdown", 308 | "metadata": {}, 309 | "source": [ 310 | "### Variance and standard deviation" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "The method of expressing how large a dispersion or spread is, is referred to as the *variance*, or by the square root of the variance, called the *standard deviation*.\n", 318 | "\n", 319 | "It should be clear that the standard deviation is merely the square root of the variance. It is simpler to explain the standard deviation. Imagine all the data values in a data set are represented by dots on a straight line, i.e. the familiar x-axis from graphs at school. A dot can also be placed on this line representing the mean value. Now the distance between each point and the mean is taken and then averaged, so as to get an average distance for how far all the points are from the mean.\n", 320 | "\n", 321 | "It is vitally important to have the following concepts well and truly understood. Some values will be below (lower) than the mean and some will be above (higher) than the mean. The standard deviation represents the average distance all the lower points are away from the mean and the average distance all the higher points are away, by a single value!\n", 322 | "\n", 323 | "Quick note: the equations for population and sample standard deviation are slightly different.\n", 324 | "\n", 325 | "Here is a quick look at what we have described up until now:" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": 4, 331 | "metadata": { 332 | "collapsed": false 333 | }, 334 | "outputs": [ 335 | { 336 | "data": { 337 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAnAAAAHoCAYAAADJztIQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X10VNXZ/vFrkkAIYVCQgJaQEKRKhCoJig5SFgGlWLuK\n/SlRoo+QKCkvTwkoIgoURHnRKhBBkhgJaENVwIVQa9sHAgIV1DZKFZoQafMyiIIGG5kQnLzM7w+a\nqWMmgzDjJDt8P2uxFrPn7H3u3DkJFzNnzrG4XC6XAAAAYIyQli4AAAAA54YABwAAYBgCHAAAgGEI\ncAAAAIYhwAEAABiGAAcAAGCYgAW4DRs2aNSoUbrmmmt01113af/+/T63Lykp0fjx45WQkKCkpCTl\n5uY22cZut2vKlClKTEyUzWbTrFmzdOLEiUCVDAAAYKSABLjNmzdrwYIFGjNmjFauXCmr1ar77rtP\nR44c8bp9ZWWlUlNTFRoaqszMTCUnJ2vFihXKy8tzb1NVVaWUlBSdOHFCy5cv16OPPqr33ntP06dP\nD0TJAAAAxgrzdwGXy6WVK1fqzjvv1NSpUyVJQ4YM0ejRo7Vu3TrNnTu3yZz169eroaFBWVlZCg8P\n17Bhw+R0OpWTk6Px48crNDRUa9eulSTl5eWpY8eOkqROnTrp8ccfV2VlpS655BJ/SwcAADCS36/A\nlZeX6+jRoxoxYoR7LCwsTMOHD9eePXu8ztm7d69sNpvCw8PdYyNHjlRVVZU++ugjSdL27dv1s5/9\nzB3eJCkpKUk7duwgvAEAgAua3wGurKxMkhQbG+sxHh0dLbvdLm936iovL1dMTIzHWK9evdzrOZ1O\nlZaWqmfPnnriiSc0ePBgDRw4UA8++KC++uorf0sGAAAwmt8BzuFwSJIiIyM9xiMjI9XQ0KBTp055\nneNt+8bnvvrqK9XX1ys7O1uffPKJVqxYoXnz5mnv3r168MEH/S0ZAADAaAE5B06SLBaL1+dDQppm\nRJfL1ez2FotF9fX1kiSr1arnnnvOvUanTp2UkZGhDz/8UFdffbW/pQMAABjJ7wBntVolSdXV1era\ntat7vLq6WqGhoYqIiPA6p7q62mOs8bHVanWf92az2TwC4JAhQyRJH3/8cbMBrrCw0I+vBgAAILgG\nDRp0znP8DnCN577Z7Xb3eWyNj+Pi4pqdU1FR4TFmt9slSXFxcbJarbr44ovldDo9tqmtrZXU/Kt9\njc6nETh/RUVFio+Pb+kyLij0PPjoefDR8+Cj58F3vi88+X0OXO/evXXZZZdp27Zt7rHa2lq99dZb\nuuGGG7zOsdls2rdvn2pqatxj27dvV5cuXdwHzo033qhdu3bp9OnT7m127dolSUpISPC3bAAAAGP5\nHeAsFosmTpyoV155RcuXL9euXbs0ZcoUVVVVacKECZKkiooKjzszpKSkqLa2Vunp6dq5c6eysrKU\nm5ur9PR0hYWdeVFwypQpcjgcmjhxonbv3q1XXnlFixcv1q233trsK3sAAAAXgoDciSElJUWzZs3S\n1q1blZGRIYfDoTVr1ig6OlqStHr1ao0bN869fVRUlNauXau6ujplZGRo48aNmjFjhlJTU93bXH75\n5crPz1doaKimTZumVatW6Y477tDSpUsDUTIAAICxLC5vF2ozWGFhIefABRnnTAQfPQ8+eh589Dz4\n6HnwnW9uCdjN7AEAABAcBDgAAADDEOAAAAAMQ4ADAAAwDAEOAADAMAQ4AAAAwxDgAAAADEOAAwAA\nMAwBDgAAwDAEOAAAAMMQ4AAAAAxDgAMAADAMAQ4AAMAwBDgAAADDEOAAAAAMQ4ADAAAwDAEOAADA\nMAQ4AAAAwxDgAAAADEOAAwAAMAwBDgAAwDAEOAAAAMMQ4AAAAAxDgAMAADAMAQ4AAMAwBDgAAADD\nEOAAAAAMQ4ADAAAwDAEOAADAMAQ4AAAAwxDgAAAADEOAAwAAMAwBDgAAwDAEOAAAAMMQ4AAAAAxD\ngAMAADAMAQ4AAMAwBDgAAADDEOAAAAAMQ4ADAAAwDAEOAADAMAQ4AAAAwxDgAAAADEOAAwAAMAwB\nDgAAwDAEOAAAAMMQ4AAAAAxDgAMAADAMAQ4AAMAwBDgAAADDEOAAAAAMQ4ADAAAwDAEOAADAMAQ4\nAAAAwxDgAAAADEOAAwAAMAwBDgAAwDAEOAAAAMMQ4AAAAAxDgAMAADAMAQ4AAMAwBDgAAADDBCzA\nbdiwQaNGjdI111yju+66S/v37/e5fUlJicaPH6+EhAQlJSUpNzfX5/aPPPKIRowYEahyAQAAjBWQ\nALd582YtWLBAY8aM0cqVK2W1WnXffffpyJEjXrevrKxUamqqQkNDlZmZqeTkZK1YsUJ5eXlet//L\nX/6izZs3y2KxBKJcAAAAo4X5u4DL5dLKlSt15513aurUqZKkIUOGaPTo0Vq3bp3mzp3bZM769evV\n0NCgrKwshYeHa9iwYXI6ncrJydG9996rsLD/llVdXa1f//rX6tGjh7+lAgAAtAl+vwJXXl6uo0eP\nery9GRYWpuHDh2vPnj1e5+zdu1c2m03h4eHusZEjR6qqqkoHDhzw2PaZZ55RTEyMfvKTn8jlcvlb\nLgAAgPH8DnBlZWWSpNjYWI/x6Oho2e12r6GrvLxcMTExHmO9evXyWE+S/va3v2nz5s16/PHHCW8A\nAAD/4XeAczgckqTIyEiP8cjISDU0NOjUqVNe53jb/pvrff3115ozZ46mTp3qDncAAAAI0Dlwkpr9\ngEFISNOM6HK5mt2+cXzlypWKjIxUWlraOddUVFR0znNw/k6fPk3Pg4yeBx89Dz56Hnz03Bx+Bzir\n1SrpzIcNunbt6h6vrq5WaGioIiIivM6prq72GGt8bLVadeDAAb300kvKz89XQ0ODGhoa3EGxvr5e\noaGhPmuKj4/362vCuSkqKqLnQUbPg4+eBx89Dz56HnyFhYXnNc/vANd47pvdbvd4q9NutysuLq7Z\nORUVFR5jdrtdkhQXF6edO3fK6XQqOTm5ydz+/ftr6dKluu222/wtHQAAwEh+B7jevXvrsssu07Zt\n2zRkyBBJUm1trd566y0lJSV5nWOz2fTqq6+qpqbG/Qrd9u3b1aVLF8XHx6tHjx5NLtqbl5en9957\nT9nZ2erZs6e/ZQMAABjL7wBnsVg0ceJEPf744+rcubMSExOVn5+vqqoqTZgwQZJUUVGhEydOaODA\ngZKklJQU5efnKz09XWlpaSouLlZubq5mzpypsLAwde/eXd27d/fYT9euXdWuXTv179/f35IBAACM\nFpA7MaSkpGjWrFnaunWrMjIy5HA4tGbNGkVHR0uSVq9erXHjxrm3j4qK0tq1a1VXV6eMjAxt3LhR\nM2bMUGpqarP7sFgs3IkBAABAksXVxi6wVlhYqEGDBrV0GRcUTnoNPnoefPQ8+Oh58NHz4Dvf3BKw\nm9kDAAAgOAhwAAAAhiHAAQAAGIYABwAAYBgCHAAAgGEIcAAAAIYhwAEAABiGAAcAAGAYAhwAAIBh\nCHAAAACGIcABAAAYhgAHAABgGAIcAACAYQhwAAAAhiHAAQAAGIYABwAAYBgCHAAAgGEIcAAAAIYh\nwAEAABiGAAcAAGAYAhwAAIBhCHAAAACGIcABAAAYhgAHAABgGAIcAACAYQhwAAAAhiHAAQAAGIYA\nBwAAYBgCHAAAgGEIcAAAAIYhwAEAABiGAAcAAGAYAhwAAIBhCHAAAACGIcABAAAYhgAHAABgGAIc\nAACAYQhwAAAAhiHAAQAAGIYABwAAYBgCHAAAgGEIcAAAAIYhwAEAABiGAAcAAGAYAhwAAIBhCHAA\nAACGIcABAAAYhgAHAABgGAIcAACAYQhwAAAAhiHAAQAAGIYABwAAYBgCHAAAgGEIcAAAAIYhwAEA\nABiGAAcAAGAYAhwAAIBhCHAAAACGIcABAAAYhgAHAABgGAIcAACAYQIW4DZs2KBRo0bpmmuu0V13\n3aX9+/f73L6kpETjx49XQkKCkpKSlJub22SbnTt3auzYsUpMTNSIESP0xBNPqLq6OlAlAwAAGCkg\nAW7z5s1asGCBxowZo5UrV8pqteq+++7TkSNHvG5fWVmp1NRUhYaGKjMzU8nJyVqxYoXy8vLc2+zb\nt0+TJ0/WFVdcoVWrVmny5Ml688039cADDwSiZAAAAGOF+buAy+XSypUrdeedd2rq1KmSpCFDhmj0\n6NFat26d5s6d22TO+vXr1dDQoKysLIWHh2vYsGFyOp3KycnR+PHjFRoaqrVr1+raa6/VokWL3POs\nVqumT5+uf/7zn7r88sv9LR0AAMBIfr8CV15erqNHj2rEiBHusbCwMA0fPlx79uzxOmfv3r2y2WwK\nDw93j40cOVJVVVX66KOPJEkDBw5USkqKx7zevXtLUrOv7AEAAFwI/A5wZWVlkqTY2FiP8ejoaNnt\ndrlcriZzysvLFRMT4zHWq1cvj/WmTJmin/70px7b7Ny5U5LUp08ff8sGAAAwlt8BzuFwSJIiIyM9\nxiMjI9XQ0KBTp055neNt+2+u923FxcV6/vnnNWrUKHfYAwAAuBD5HeAaX2GzWCzedxDSdBcul6vZ\n7b2NFxcXKy0tTZdeeqkef/xxP6oFAAAwn98fYrBarZKk6upqde3a1T1eXV2t0NBQRUREeJ3z7cuB\nND5uXK/Ru+++q6lTpyoqKkrr1q3TRRdddNaaioqKzvnrwPk7ffo0PQ8yeh589Dz46Hnw0XNz+B3g\nGs99s9vtHm9t2u12xcXFNTunoqLCY8xut0uSx5yCggJNnz5dP/zhD/XCCy94BERf4uPjz+lrgH+K\nioroeZDR8+Cj58FHz4OPngdfYWHhec3z+y3U3r1767LLLtO2bdvcY7W1tXrrrbd0ww03eJ1js9m0\nb98+1dTUuMe2b9+uLl26uA+cDz/8UNOnT9c111yj3/72t985vAEAALR1fr8CZ7FYNHHiRD3++OPq\n3LmzEhMTlZ+fr6qqKk2YMEGSVFFRoRMnTmjgwIGSpJSUFOXn5ys9PV1paWkqLi5Wbm6uZs6cqbCw\nMyXNnTtX7dq1U3p6uj7++GOPfcbFxX2nt1IBAADaIr8DnHQmkH399dd66aWX9OKLLyo+Pl5r1qxR\ndHS0JGn16tXasmWL+331qKgorV27VosWLVJGRoa6deumGTNmKDU1VdKZ67yVlJTIYrEoPT3dY18W\ni0WZmZkaNWpUIEoHAAAwjsXl7UJtBissLNSgQYNauowLCudMBB89Dz56Hnz0PPjoefCdb24J2M3s\nAQAAEBwEOAAAAMMQ4AAAAAxDgAMAADAMAQ4AAMAwBDgAAADDEOAAAAAMQ4ADAAAwDAEOAADAMAQ4\nAAAAwxDgAAAADEOAAwAAMAwBDgAAwDAEOAAAAMMQ4AAAAAxDgAMAADAMAQ4AAMAwBDgAAADDEOAA\nAAAMQ4ADAAAwDAEOAADAMAQ4AAAAwxDgAAAADEOAAwAAMAwBDgAAwDAEOAAAAMMQ4AAAAAxDgAMA\nADAMAQ4AAMAwBDgAAADDEOAAAAAMQ4ADAAAwDAEOAADAMAQ4AAAAwxDgAAAADEOAAwAAMAwBDgAA\nwDAEOAAAAMMQ4AAAAAxDgAMAADAMAQ4AAMAwBDgAAADDEOAAAAAMQ4ADAAAwDAEOAADAMAQ4AAAA\nwxDgAAAADEOAAwAAMAwBDgAAwDAEOAAAAMMQ4AAAAAxDgAMAADAMAQ4AAMAwBDgAAADDEOAAAAAM\nQ4AD0DZt2SJt3drSVQDA9yKspQsAgICrrpamTTvz95EjpcjIlq0HAAKMV+AAtD2PPSZVVJz5s3Bh\nS1cDAAFHgAPQthw4IC1f/t/Hy5adGQOANoS3UOH2ry//pdzCXG38x0YdPXlUkvQD6w809qqxmjho\novp06WP8Wq0NfQowl0uaNEmqq/vvWF2dNHmytHu3ZLG0XG0BEOzvcaD2x3GOltKWjxeLy+VytXQR\ngVRYWKhBgwa1dBlGqa2v1bQ/TlNOYY5c8n44WGTRLwf9Us/e8qzahbbzeK6oqEjx8fEBWSuQdbVm\nrbXnxluzRrr//uafS0s776W/2fNgC/b3OFD7ay3Hub+5vW39K+lbSx7ngWTS78XzzS0BewVuw4YN\neuGFF3Ts2DHFx8dr9uzZGjhwYLPbl5SUaNGiRfrwww918cUXKyUlRRMnTvTY5m9/+5uefPJJffzx\nx+rRo4fS09N1++23B6pk/Me418bptaLXfG7jkkvZhdn6/NTn2pS8KShrtQ9rJynrP3+aW0vK/s8f\n95gBv2xba8+NVlkpPfxw88/PmiWNGSNdcknwagqQYH+PA7U/jnO0lAvheAnIOXCbN2/WggULNGbM\nGK1cuVJWq1X33Xefjhw54nX7yspKpaamKjQ0VJmZmUpOTtaKFSuUl5fn3uaf//yn7r//fsXExGjV\nqlUaPny45syZoz//+c+BKBk68z+UOzbccdaD/JteK3pNd2y4Q7X1td/7Wm1Ra+15m/DQQ2dCXHMq\nK8+EOIME+3scqP1xnKOlXEjHi98BzuVyaeXKlbrzzjs1depUDRs2TFlZWerSpYvWrVvndc769evV\n0NCgrKwsDRs2TJMnT1Z6erpycnJUX18vSXr++efVq1cvPfPMMxo6dKgeeeQR/fznP9dzzz3nb8n4\nj2l/nHZOB3mj14peU8afMlrdWiZoDX3ytpbx/vIXqZnfNx7WrpXefvt7LydQgv09DtT+OM7RUi6k\n48XvAFdeXq6jR49qxIgR7rGwsDANHz5ce/bs8Tpn7969stlsCg8Pd4+NHDlSVVVV+uijj9zbDB8+\n3GPeyJEjVVJSos8//9zfsi94//ryX8opzDnv+dl/y1bpl6WSJLvDHrC1/K1Lknut1qa19tx4tbVn\nPqTwXd479/Yhh1YqkMdLMPfHcY6WEuyfmZbmd4ArKyuTJMXGxnqMR0dHy263y9tnJMrLyxUTE+Mx\n1qtXL/d6p06d0ueff+5zG/gntzC32RM7vwuXXMp9P1eStPFfGwO2lr91SXKv1dq01p4bb/nyc7tM\nyLcvM9JKBfJ4Ceb+OM7RUoL9M9PS/A5wDodDkhT5rSudR0ZGqqGhQadOnfI6x9v2jc/5WvOb+8T5\n2/iPjX6vseHgBknSn4/4f15i41qBrKu1aa09N1p5+ZmL9p6rxgv9tmLB/lkI1P44ztFS2vK/H94E\n5Bw4SbI08zntkJCmu3C5XM1ub7FYzmtNnJvG6+EEYo3Pa/x/S7txrUDW1dq01p4bbdo0yct/Es+q\nulr61a8CX08ABftnIVD74zhHS2nL/3544/dlRKxWqySpurpaXbt2dY9XV1crNDRUERERXudUV1d7\njDU+tlqt6tSpk8fYt7dpfL45RUVF5/hVXHgCcfk/l8sVsF43rtXa6gqk1va1tdY+nYvokydlPc+5\nJ0+e1JFz+PpPnz4d1H4F+3gJ1P4CIdDHub9M/zk5F8E+zgOptf2O/b75HeAaz32z2+3uc9QaH8fF\nxTU7p+Jbb1/Y7XZJUlxcnCIjIxUVFeUe87aNL23hIoTft56de+qfX/7T7zXi4+MVFRElu8N+9gnf\nYa1A1tXatNaeG23dOumqq868onYuIiNlXbdO8d86z9aXYF/gNNg/C4Han6RWd5z7V82F9W+KyRfy\nNfXfj8LCwvOa5/d7kb1799Zll12mbdu2ucdqa2v11ltv6YYbbvA6x2azad++faqpqXGPbd++XV26\ndHE3zmazaceOHWpoaPDY5oorrvB4pQ/nZ+xVY/1eI7l/siTpJ9E/CdhagayrtWmtPTdaTIz061+f\n+7z588/MbcWC/bMQqP1xnKOltOV/P7zxO8BZLBZNnDhRr7zyipYvX65du3ZpypQpqqqq0oQJEyRJ\nFRUV2r9/v3tOSkqKamtrlZ6erp07dyorK0u5ublKT09XWNiZFwXT0tJUWlqqjIwM7dq1S0uWLNHv\nf/97/e///q+/JUPSxEETZdH531/GIosmJp65c8bYPmMDtpa/dUlyr9XatNaeG2/GDGnAgO++/YAB\nZ+a0coE8XoK5P45ztJRg/8y0tIB8GiAlJUWzZs3S1q1blZGRIYfDoTVr1ig6OlqStHr1ao0bN869\nfVRUlNauXau6ujplZGRo48aNmjFjhlJTU93b9OvXT9nZ2bLb7frVr36lXbt2aenSpRo1alQgSr7g\n9enSR78c9Mvznj/p2kmK63LmrexenXoFbC1/65LkXqu1aa09N167dlJW1ne74aXFImVnS2EBu4vg\n9yaQx0sw98dxjpYS7J+ZlsbN7C9gtfW13+l+cd92x1V36Hf/73fum/8WFRWp7xV9A7KWR113nt+9\n6VrzEd1ae94mpKWdudPC2bZZs+a8lm+Jc4MCdbwEe3+t7TjnZvbfncnnwEnB/5kJhPPNLVyP4wLW\nLrSdNiVv0u3xt3/nObfH366NYzc2Oci/j7Xaotba8zbhN7/xfaP6Sy6RnnoqePUEQLC/x4HaH8c5\nWsqFdLwQ4KCXb39Zk6+d7PPcAYssmnztZL18+8tBW8tZV6vJb0yRZUGItMDi9Y9lQYgmvzFFzrpa\nuVzm/E+5tfbcaJdcIj35ZPPPP/WU74DXigX7exyo/XGco6VcCMcLb6HCrfTLUuW+n6sNBze4L2b4\nA+sPlNw/WRMTJzZ7boC3l9zPd61A1mWC1tpzY7lc0o9/3PSG9UOHSrt3f7fz5JrRGt5aCvb3OFD7\n4zg3R2s4zgPJhOPlfHMLAQ5+a2s/8Cag5z4cOCAlJPz3hvXt2knvv39un1T1gp4HHz0PPnoefJwD\nBwBS08uEnOtlRgDAAAQ4AG1P44V6Y2PP/B0A2pjWfzEkADhXkZHSs8+eOeetY8eWrgYAAo4AB6Bt\nGjOmpSsAgO8Nb6ECAAAYhgAHAABgGAIcAACAYQhwAAAAhiHAAQAAGIYABwAAYBgCHAAAgGEIcAAA\nAIYhwAEAABiGAAcAAGAYAhwAAIBhCHAAAACGIcABAAAYhgAHAABgGAIcAACAYQhwAAAAhiHAAQAA\nGIYABwAAYBgCHAAAgGEIcAAAAIYhwAEAABiGAAcAAGAYAhwAAIBhCHAAAACGIcABAAAYhgAHAABg\nGAIcAACAYQhwAAAAhiHAAQAAGIYABwAAYBgCHAAAgGEIcAAAAIYhwAEAABiGAAcAAGAYAhwAAIBh\nCHAAAACGIcABAAAYhgAHAABgGAIcAACAYQhwAAAAhiHAAQAAGIYABwAAYBgCHAAAgGEIcAAAAIYh\nwAEAABiGAAcAAGAYAhwAAIBhCHAAAACGIcABAAAYhgAHAABgGAIcAACAYQhwAAAAhiHAAQAAGIYA\nBwAAYJiABLiSkhKNHz9eCQkJSkpKUm5u7lnnOJ1OLV68WEOHDlViYqKmTZum48ePe2zz6aef6sEH\nH9SPf/xjDR48WKmpqfrHP/4RiJIBAACM5XeAq6ysVGpqqkJDQ5WZmank5GStWLFCeXl5PufNnz9f\nW7Zs0cyZM7VkyRIdOnRI6enpamhokCSdPn1aaWlpOnTokObMmaPf/OY3slgsuvvuu3XkyBF/ywYA\nADBWmL8LrF+/Xg0NDcrKylJ4eLiGDRsmp9OpnJwc3XvvvQoLa7qLiooKbdmyRc8884xuueUWSVK/\nfv00evRoFRQU6Oabb9bOnTtVWlqqbdu2qVevXpKk66+/XklJSXr55Zf10EMP+Vs6AACAkfx+BW7v\n3r2y2WwKDw93j40cOVJVVVU6cOCA1znvvPOOJCkpKck9Fhsbq759+2rPnj2SpIsuukjjx493hzdJ\n6tChgy699FJ98skn/pYNAABgLL8DXHl5uWJiYjzGGkNXWVmZ1zmlpaWKiopShw4dmswrLS2VJA0Z\nMkSPPPKIx/N2u10ff/yx+vTp42/ZAAAAxvL5FmpdXZ3Ky8ubfb5bt25yOByKjIz0GG987HA4vM6r\nrq5Wx44dm4x37NhRn332mdc5TqdTc+bMUYcOHXTXXXf5KhsAAKBN8xngPvvsM916661en7NYLJo9\ne7ZcLpcsFkuz23jja05ISNMXBZ1Op6ZPn673339fmZmZ6t69u6+yVVRU5PN5BNbp06fpeZDR8+Cj\n58FHz4OPnpvDZ4CLjo5WcXGxzwWys7NVXV3tMdb42Gq1ep3TqVOnJnMa5317zsmTJzVlyhR98MEH\nWrp0qUaOHOmzHkmKj48/6zYInKKiInoeZPQ8+Oh58NHz4KPnwVdYWHhe8/w+By42NlYVFRUeY3a7\nXZIUFxfndU7v3r31xRdfyOl0eowfOXLEY86JEyd0991366OPPtLKlSv1s5/9zN9yAQAAjOd3gLPZ\nbNq3b59qamrcY9u3b1eXLl2aTfE2m0319fUqKChwj5WVlenw4cOy2WySpNraWv3yl7/UJ598ohde\neMHjE6sAAAAXMr+vA5eSkqL8/Hylp6crLS1NxcXFys3N1cyZM93XgHM4HDp8+LBiYmLUtWtXxcTE\naPTo0Zo3b54cDoesVquWLVumfv366aabbpJ05vpyH330kSZOnKiwsDDt37/fvc+LLrqo2Vf3AAAA\n2jq/A1xUVJTWrl2rRYsWKSMjQ926ddOMGTOUmprq3ubgwYMaP368li5dqttuu02StGTJEi1ZskRP\nP/20GhoaNGTIEM2dO9f94YaCggJZLBbl5uY2uTXX8OHDlZ2d7W/pAAAARvI7wEnSgAED9PLLLzf7\n/PXXX9/kwxARERFauHChFi5c6HXOb3/720CUBgAA0OYE5Gb2AAAACB4CHAAAgGEIcAAAAIYhwAEA\nABiGAAcAAGAYAhwAAIBhCHAAAACGIcABAAAYhgAHAABgGAIcAACAYQhwAAAAhiHAAQAAGIYABwAA\nYBgCHAAAgGEIcAAAAIYhwAEAABiGAAcAAGAYAhwAAIBhCHAAAACGIcABAAAYhgAHAABgGAIcAACA\nYQhwAAAAhiHAAQAAGIYABwAAYBgCHAAAgGEIcAAAAIYhwAEAABiGAAcAAGAYAhwAAIBhCHAAAACG\nIcABAAAYhgAHAABgGAIcAACAYQhwAAAAhiHAAQAAGIYABwAAYBgCHAAAgGEIcAAAAIYhwAEAABiG\nAAcAAGDJX69kAAASNklEQVQYAhwAAIBhCHAAAACGIcABAAAYhgAHAABgGAIcAACAYQhwAAAAhiHA\nAQAAGIYABwAAYBgCHAAAgGEIcAAAAIYhwAEAABiGAAcAAGAYAhwAAIBhCHAAAACGIcABAAAYhgAH\nAABgGAIcAACAYQhwAAAAhiHAAQAAGIYABwAAYJiABLiSkhKNHz9eCQkJSkpKUm5u7lnnOJ1OLV68\nWEOHDlViYqKmTZum48ePN7v9u+++q379+umvf/1rIEoGAAAwlt8BrrKyUqmpqQoNDVVmZqaSk5O1\nYsUK5eXl+Zw3f/58bdmyRTNnztSSJUt06NAhpaenq6Ghocm2p0+f1ty5c2WxWPwtFwAAwHhh/i6w\nfv16NTQ0KCsrS+Hh4Ro2bJicTqdycnJ07733Kiys6S4qKiq0ZcsWPfPMM7rlllskSf369dPo0aNV\nUFCgm2++2WP7FStWyOl0yuVy+VsuAACA8fx+BW7v3r2y2WwKDw93j40cOVJVVVU6cOCA1znvvPOO\nJCkpKck9Fhsbq759+2rPnj0e2/7973/Xq6++qocfftjfUgEAANoEvwNceXm5YmJiPMZ69eolSSor\nK/M6p7S0VFFRUerQoUOTeaWlpe7HTqdTc+bM0aRJk9SnTx9/SwUAAGgTfL6FWldXp/Ly8maf79at\nmxwOhyIjIz3GGx87HA6v86qrq9WxY8cm4x07dtRnn33mfpyVlaWwsDDdf//9Kikp8VUqAADABcNn\ngPvss8906623en3OYrFo9uzZcrlczX64oLlxX3NCQs68KFhcXKy8vDzl5+crNDTUV5kAAAAXFJ8B\nLjo6WsXFxT4XyM7OVnV1tcdY42Or1ep1TqdOnZrMaZxntVrV0NCgOXPmaOzYsbrqqqtUV1en+vp6\nSXL/3VeoKyoq8lkzAuv06dP0PMjoefDR8+Cj58FHz83h96dQY2NjVVFR4TFmt9slSXFxcV7n9O7d\nW1988YWcTqfat2/vHj9y5Iiuu+46ffrppzp48KAOHjyo/Px8j7mpqakaPHiwXnrppWZrio+PP98v\nB+ehqKiIngcZPQ8+eh589Dz46HnwFRYWntc8vwOczWbTq6++qpqaGkVEREiStm/fri5dujR7ENhs\nNtXX16ugoMB9GZGysjIdPnxY06ZNU/fu3bVp0yaPt1lLS0s1c+ZMLVy4UIMHD/a3bAAAAGP5HeBS\nUlKUn5+v9PR0paWlqbi4WLm5uZo5c6b7GnAOh0OHDx9WTEyMunbtqpiYGI0ePVrz5s2Tw+GQ1WrV\nsmXL1K9fP910002yWCwaMGCAx34az42Li4tT7969/S0bAADAWH5fRiQqKkpr165VXV2dMjIytHHj\nRs2YMUOpqanubQ4ePKi77rpLu3fvdo8tWbJEP/3pT/X0009r3rx5io+P1/PPP+/zbgvciQEAACAA\nr8BJ0oABA/Tyyy83+/z111/f5MMQERERWrhwoRYuXPid9hEfH8+JlQAAAArQzewBAAAQPAQ4AAAA\nwxDgAAAADEOAAwAAMAwBDgAAwDAEOAAAAMMQ4AAAAAxDgAMAADAMAQ4AAMAwBDgAAADDEOAAAAAM\nQ4ADAAAwDAEOAADAMAQ4AAAAwxDgAAAADEOAAwAAMAwBDgAAwDAEOAAAAMMQ4AAAAAxDgAMAADAM\nAQ4AAMAwBDgAAADDEOAAAAAMQ4ADAAAwDAEOAADAMAQ4AAAAwxDgAAAADEOAAwAAMAwBDgAAwDAE\nOAAAAMMQ4AAAAAxDgAMAADAMAQ4AAMAwBDgAAADDEOAAAAAMQ4ADAAAwDAEOAADAMAQ4AAAAwxDg\nAAAADEOAAwAAMAwBDgAAwDAEOAAAAMMQ4AAAAAxDgAMAADAMAQ4AAMAwBDgAAADDEOAAAAAMQ4AD\nAAAwDAEOAADAMAQ4AAAAwxDgAAAADEOAAwAAMAwBDgAAwDAEOAAAAMMQ4AAAAAxDgAMAADAMAQ4A\nAMAwBDgAAADDEOAAAAAMQ4ADAAAwDAEOAADAMAQ4AAAAwwQkwJWUlGj8+PFKSEhQUlKScnNzzzrH\n6XRq8eLFGjp0qBITEzVt2jQdP37cY5u6ujo9++yzGj58uAYOHKixY8dq3759gSgZAADAWH4HuMrK\nSqWmpio0NFSZmZlKTk7WihUrlJeX53Pe/PnztWXLFs2cOVNLlizRoUOHlJ6eroaGBvc2TzzxhNat\nW6dJkyZp9erV6tGjhyZNmqR//etf/pYNAABgrDB/F1i/fr0aGhqUlZWl8PBwDRs2TE6nUzk5Obr3\n3nsVFtZ0FxUVFdqyZYueeeYZ3XLLLZKkfv36afTo0SooKNDNN9+ssrIyvfrqq8rMzNSoUaMkSddd\nd53GjBmjd955R3369PG3dAAAACP5/Qrc3r17ZbPZFB4e7h4bOXKkqqqqdODAAa9z3nnnHUlSUlKS\neyw2NlZ9+/bVnj17JEkFBQW6+OKL3eFNktq1a6c333xTKSkp/pYNAABgLL8DXHl5uWJiYjzGevXq\nJUkqKyvzOqe0tFRRUVHq0KGDx3h0dLR7zqFDhxQXF6c//elPuuWWW9S/f3/ddttt+utf/+pvyQAA\nAEbz+RZqXV2dysvLm32+W7ducjgcioyM9BhvfOxwOLzOq66uVseOHZuMR0ZG6tixY5KkEydOqLy8\nXEuWLNEDDzygSy65RHl5eZo4caL+8Ic/qGfPnr6/MgAAgDbKZ4D77LPPdOutt3p9zmKxaPbs2XK5\nXLJYLM1u442vOSEhZ14UrKurU2VlpfLz83XttddKkgYNGqSbb75ZL7zwgubPn99s3YWFhc0+h+8H\nPQ8+eh589Dz46Hnw0XMz+Axw0dHRKi4u9rlAdna2qqurPcYaH1utVq9zOnXq1GRO47zGOR07dlRE\nRIQ7vElSRESEBg4cqJKSkmbrGTRokM96AQAATOf3OXCxsbGqqKjwGLPb7ZKkuLg4r3N69+6tL774\nQk6n02P8yJEj7jmxsbGqr6/3uKyIJNXW1rpfpQMAALgQ+Z2EbDab9u3bp5qaGvfY9u3b1aVLF8XH\nxzc7p76+XgUFBe6xsrIyHT58WDabTZI0dOhQOZ1O7dixw73NV199pQ8++EAJCQn+lg0AAGAsv68D\nl5KSovz8fKWnpystLU3FxcXKzc3VzJkz3deAczgcOnz4sGJiYtS1a1fFxMRo9OjRmjdvnhwOh6xW\nq5YtW6Z+/frppptukiTdeOONstlsmjNnjr788kt1795dOTk5slgsuvfee/0tGwAAwFgWl8vl8neR\nAwcOaNGiRTp48KC6deumlJQU3X///e7n3333XY0fP15Lly7VbbfdJkmqqanRkiVL9Oc//1kNDQ0a\nMmSI5s6dq6ioKPe8U6dOadmyZfrjH/+oU6dOKSEhQY8++qj69u3rb8kAAADGCkiAAwAAQPC0iU8D\nVFdXa+HChRoyZIgSExN13333nfXTszg/BQUFSkxMbDKelZWl4cOHa+DAgUpLS+N+td+D5novSYsX\nL9akSZOCXFHb5a3Xp0+f1vLly3XzzTcrISFBv/jFL/Tmm2+2UIVtj7eenzx5UgsWLNDQoUOVmJio\nKVOmuD8kB//5+p0inbkeq81m06pVq4JYVdvmrecHDhxQv379mvx56qmnfK7l9zlwrcG0adP0/vvv\na9q0abryyiu1detW3X333dq0aVOzn4TFuXv//ff10EMPNRlftWqVcnNz9dBDD+kHP/iBsrKyNGHC\nBL355pvq1KlTC1Ta9jTXe0nKz8/XSy+9pOHDhwe3qDaquV4vWLBABQUFmj59uvr06aOCggI98MAD\nslgs7ns64/w01/MHH3xQRUVFmjVrli666CJlZWXpf/7nf/TGG2/wu8VPvn6nNFq0aJG+/PLLIFXU\n9jXX8+LiYkVEROjFF1/0GO/evbvP9YwPcAcOHNDbb7+thQsXKjk5WZI0ZMgQlZWVKTMzUytWrGjh\nCs3ndDr14osv6tlnn1XHjh1VW1vrfs7hcGjNmjX61a9+pXvuuUeSdO211yopKUmbNm3ShAkTWqjq\ntsFX7ysrK/Wb3/xGW7dubfaai/juztbr119/XYsWLdLtt98u6cyn6e12u/Ly8ghw58lXzw8fPqzd\nu3dr1apV7g+3/fCHP9SIESO0Y8cO/fznP2+pso3mq+fftGPHDr399tse9znH+Tlbzw8dOqQrr7xS\nV1999Tmta/xbqI33Th06dKjHeEJCgv7yl7+0QEVtz+7du5Wbm6uHH35Y99xzj7552uTf//531dTU\naMSIEe6xzp0767rrrtOePXtaotw2xVfvs7Oz9cEHH2jNmjXq169fC1bZNvjq9alTpzRu3Lgmv2d6\n9+6tI0eOBLvUNsNXz2NiYrRhwwYNGzbMPdZ4ZYPmQgfOzlfPG508eVKPPfaYZs+erfbt27dAlW3L\n2Xp+6NAhXXHFFee8rvEB7tJLL5UkHT161GP8k08+kcPh0FdffdUSZbUpP/rRj7Rjxw73K2zf1Big\nY2JiPMajo6NVWloajPLaNF+9T0lJ0R//+Ef3tRPhH1+97tWrl+bPn68ePXq4x+rr67V7925dfvnl\nwSyzTfHV8/bt2+vqq69W+/btVV9fr8OHD+vRRx9Vt27d3K/I4dz56nmjJ598Un379nVfNQL+OVvP\nS0pK9Omnn+q2227TgAEDNGrUKL3++utnXdf4t1Cvvvpq9e7dW4899piWLFmimJgYvfnmm9q9e7cs\nFotqamrUuXPnli7TaN/8R+vbHA6H2rdv7/6fcaPIyEivt0vDufHVe87vDCxfvfbm2WefVWlpqR5+\n+OHvqaK277v2fO7cudq8ebNCQkK0ePFiXXTRRd9zZW3X2Xq+b98+/eEPf9Abb7wRpIraPl89P3bs\nmP7973+roqJCDzzwgDp37qw33nhDs2fPliSfIdr4ANe+fXutWrVKDz74oO644w5JZ94+vf/++7Vq\n1Sp16NChhSts21wulywWi9fnmhsHTPf8888rJydHaWlpfHgkCMaNG6df/OIX2rZtm2bPnq3a2lqN\nHTu2pctqc2pqajRv3jxlZGSoZ8+eLV3OBeHiiy9WXl6errjiCnXr1k3SmfNrjx8/rueee65tBzhJ\n6tu3r7Zs2aJjx46prq5OPXv21KpVqxQSEsLJ3d8zq9Uqp9Op+vp6hYaGuserq6t55RNtjsvl0tKl\nS/Xiiy/q7rvv1qxZs1q6pAtC48ndgwcP1rFjx5STk0OA+x4sX75cnTt3VkpKiurq6tzjDQ0Nqqur\na/JOC/wXHh6uIUOGNBkfOnSo9uzZo5qaGkVERHida/w5cKdPn9brr7+u48ePq0ePHu7/NTSeFMiN\n779fsbGxcrlcTU7kPnLkCG/xoU1paGjQrFmz9OKLL2rSpEmaN29eS5fUptntdm3atKnJeL9+/XT8\n+PEWqKjt2759u/7xj3/o6quv1oABAzRgwACdPHlSq1ev1o9+9KOWLq9NKi0t1e9+9zs5nU6P8a+/\n/lodOnRoNrxJbSDAhYaG6rHHHvO4oKbdbteuXbt4ayMIEhISFB4erm3btrnHqqqq9N5773FyPdqU\npUuX6ve//71mz56t6dOnt3Q5bV5paanmzp2rd9991z3mcrm0d+9eXXnllS1YWduVnZ2t1157zf1n\n06ZN6tixo5KTk72Gafjv2LFjWrhwoXbv3u0ec7lc+r//+z9de+21Puca/3pou3btdMcddygrK0td\nu3ZVZGSknn76aXXr1k2pqaktXV6bFxkZqXvuuUeZmZkKCQlRbGyssrOz1blzZ/c5iYDpDh48qJde\nekk33nijEhIStH//fvdzISEh53z9JpzdjTfeqIEDB+qRRx7R9OnTdfHFF2vTpk3av3+/cnNzW7q8\nNsnbpSxCQkLUvXt39e/fvwUqavsGDx6sQYMGaf78+aqqqlK3bt20YcMGffzxx3r55Zd9zjU+wEnS\nzJkzZbFY9NRTT8npdOqGG25wX7kbgWWxWJp8OOGBBx5QSEiI8vLyVF1drcTERD311FNcKT3AvPUe\n349v93rnzp2SpL179+rtt9/22LZjx456//33g1pfW/TtnoeGhio7O1vLli3T008/raqqKg0YMEB5\neXkaPHhwC1badnyX3yn8zgmsb/c8JCREq1ev1rJly/Tss8/q3//+t/r376+8vDxdddVVvtfiZvYA\nAABmMf4cOAAAgAsNAQ4AAMAwBDgAAADDEOAAAAAMQ4ADAAAwDAEOAADAMAQ4AAAAwxDgAAAADPP/\nAaapAaCtwT5XAAAAAElFTkSuQmCC\n", 338 | "text/plain": [ 339 | "" 340 | ] 341 | }, 342 | "metadata": {}, 343 | "output_type": "display_data" 344 | } 345 | ], 346 | "source": [ 347 | "WCC = [10.1, 12.4, 13.1, 14.6, 9.9, 10.3, 11.1, 12.9, 10.9, 12.7]\n", 348 | "y = 10 * [0]\n", 349 | "\n", 350 | "meanWCC = np.mean(WCC)\n", 351 | "stdWCC = np.std(WCC)\n", 352 | "\n", 353 | "plt.figure()\n", 354 | "plt.plot(WCC, y, 'go', markersize = 18)\n", 355 | "plt.plot(meanWCC, 0, 'rd', markersize = 18)\n", 356 | "plt.plot(meanWCC - stdWCC, 0, 'bs', markersize = 14)\n", 357 | "plt.plot(meanWCC + stdWCC, 0, 'bs', markersize = 16)\n", 358 | "#plt.plot(np.median(WCC), 0, 'bd')\n", 359 | "plt.show();" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "### Quartiles and percentiles" 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "Another form of grouping relates to *quartiles* and *percentiles*. As the name suggest, *quartiles* divide the group of values into four equal quarters and it is possible to calculate what values in the group represents these cut-offs. These watershed values are named zero to four. The zeroth value is the same as the minimum value and the fourth quartile value is the same as the maximum value. The first quartile represent the upper edge of the bottom quarter of values. It simply builds from here. The second quartile value represents to upper edge of the second from bottom quarter of values. Note, though, that two quarters make a half, and as would be suspected, the second quartile value would be the top edge of the bottom half of all the values, in other words, the *median*. The third quartile value, would then represent the upper mark of the next quarter set of values, or actually the median of only the top half of all the values.\n", 374 | "\n", 375 | "Analogous to the quartile functions are the *percentile* functions. They return a value from the group of values based on an argument given to the function. This argument can range from zero to one, representing 0% up to 100%. This argument divides the group of values into what is in effect 100 equal groups. It is important to note that this encompasses the quartile values, as, for instance, the first quartile would be the 0.25th percentile." 376 | ] 377 | }, 378 | { 379 | "cell_type": "markdown", 380 | "metadata": {}, 381 | "source": [ 382 | "## Calculating some measure of central tendency and dispersion" 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": 5, 388 | "metadata": { 389 | "collapsed": false 390 | }, 391 | "outputs": [ 392 | { 393 | "data": { 394 | "text/html": [ 395 | "
\n", 396 | "\n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | "
FileAgeGenderDelayStayICURVDCD4HRTempCRPWCCHBRuptureHistoCompMASS
0138Female36NoNoNaN9735.2NaN10.4910.4NoYesYes5
1232Male610NoYes5710938.845.37.0819.8NoNoYes8
2319Female116NoNoNaN12036.310.713.008.7NoNoNo3
\n", 482 | "
" 483 | ], 484 | "text/plain": [ 485 | " File Age Gender Delay Stay ICU RVD CD4 HR Temp CRP WCC HB \\\n", 486 | "0 1 38 Female 3 6 No No NaN 97 35.2 NaN 10.49 10.4 \n", 487 | "1 2 32 Male 6 10 No Yes 57 109 38.8 45.3 7.08 19.8 \n", 488 | "2 3 19 Female 1 16 No No NaN 120 36.3 10.7 13.00 8.7 \n", 489 | "\n", 490 | " Rupture Histo Comp MASS \n", 491 | "0 No Yes Yes 5 \n", 492 | "1 No No Yes 8 \n", 493 | "2 No No No 3 " 494 | ] 495 | }, 496 | "execution_count": 5, 497 | "metadata": {}, 498 | "output_type": "execute_result" 499 | } 500 | ], 501 | "source": [ 502 | "data.head(3) #Looking at the header columns and first 3 rows (default = 5)" 503 | ] 504 | }, 505 | { 506 | "cell_type": "markdown", 507 | "metadata": {}, 508 | "source": [ 509 | "### Using the *.describe* function" 510 | ] 511 | }, 512 | { 513 | "cell_type": "markdown", 514 | "metadata": {}, 515 | "source": [ 516 | "Identifying a column and adding the *.describe* function is a quick way to get a glimpse at our data." 517 | ] 518 | }, 519 | { 520 | "cell_type": "code", 521 | "execution_count": 6, 522 | "metadata": { 523 | "collapsed": false 524 | }, 525 | "outputs": [ 526 | { 527 | "data": { 528 | "text/plain": [ 529 | "count 150.000000\n", 530 | "mean 30.733333\n", 531 | "std 10.920498\n", 532 | "min 18.000000\n", 533 | "25% 22.000000\n", 534 | "50% 28.000000\n", 535 | "75% 35.000000\n", 536 | "max 67.000000\n", 537 | "Name: Age, dtype: float64" 538 | ] 539 | }, 540 | "execution_count": 6, 541 | "metadata": {}, 542 | "output_type": "execute_result" 543 | } 544 | ], 545 | "source": [ 546 | "data['Age'].describe() #Identifying the Age column and calling the describe function lists the following" 547 | ] 548 | }, 549 | { 550 | "cell_type": "markdown", 551 | "metadata": {}, 552 | "source": [ 553 | "We note that there are a *150* data points in the **Age** columns, with amean of *20.9* years. The *std* stands for standard deviation, the *min* for the minimum value and the *max* for the maximum value.\n", 554 | "\n", 555 | "The percentages are the quartiles, wthe the 50% marked being the *median*.\n", 556 | "\n", 557 | "We could now report that the mean age of our sample group was 20.9 years with a range from 4 to 64 years." 558 | ] 559 | }, 560 | { 561 | "cell_type": "code", 562 | "execution_count": 7, 563 | "metadata": { 564 | "collapsed": false 565 | }, 566 | "outputs": [ 567 | { 568 | "data": { 569 | "text/plain": [ 570 | "count 150.000000\n", 571 | "mean 3.206667\n", 572 | "std 2.241454\n", 573 | "min 0.000000\n", 574 | "25% 1.000000\n", 575 | "50% 3.000000\n", 576 | "75% 5.000000\n", 577 | "max 7.000000\n", 578 | "Name: Delay, dtype: float64" 579 | ] 580 | }, 581 | "execution_count": 7, 582 | "metadata": {}, 583 | "output_type": "execute_result" 584 | } 585 | ], 586 | "source": [ 587 | "data['Delay'].describe()" 588 | ] 589 | }, 590 | { 591 | "cell_type": "code", 592 | "execution_count": 8, 593 | "metadata": { 594 | "collapsed": false 595 | }, 596 | "outputs": [ 597 | { 598 | "data": { 599 | "text/plain": [ 600 | "count 150.000000\n", 601 | "mean 10.886667\n", 602 | "std 5.413799\n", 603 | "min 2.000000\n", 604 | "25% 6.000000\n", 605 | "50% 11.000000\n", 606 | "75% 15.000000\n", 607 | "max 21.000000\n", 608 | "Name: Stay, dtype: float64" 609 | ] 610 | }, 611 | "execution_count": 8, 612 | "metadata": {}, 613 | "output_type": "execute_result" 614 | } 615 | ], 616 | "source": [ 617 | "data['Stay'].describe()" 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": null, 623 | "metadata": { 624 | "collapsed": false 625 | }, 626 | "outputs": [], 627 | "source": [] 628 | } 629 | ], 630 | "metadata": { 631 | "kernelspec": { 632 | "display_name": "Python 3", 633 | "language": "python", 634 | "name": "python3" 635 | }, 636 | "language_info": { 637 | "codemirror_mode": { 638 | "name": "ipython", 639 | "version": 3 640 | }, 641 | "file_extension": ".py", 642 | "mimetype": "text/x-python", 643 | "name": "python", 644 | "nbconvert_exporter": "python", 645 | "pygments_lexer": "ipython3", 646 | "version": "3.4.3" 647 | } 648 | }, 649 | "nbformat": 4, 650 | "nbformat_minor": 0 651 | } 652 | -------------------------------------------------------------------------------- /Chapter13_The_central_limit_theorem.ipynb3wgaui: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/juanklopper/Python_for_Medical_Statistics/b9f62c416f016c4850e02b6da74392e2a307f1f8/Chapter13_The_central_limit_theorem.ipynb3wgaui -------------------------------------------------------------------------------- /Chapter16_Confidence_intervals.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Setting up a fancy stylesheet" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [ 17 | { 18 | "data": { 19 | "text/html": [ 20 | "\n", 21 | "\n", 22 | "\n", 23 | "\n", 24 | "\n", 25 | "\n", 26 | "\n", 109 | "\n" 110 | ], 111 | "text/plain": [ 112 | "" 113 | ] 114 | }, 115 | "execution_count": 1, 116 | "metadata": {}, 117 | "output_type": "execute_result" 118 | } 119 | ], 120 | "source": [ 121 | "from IPython.core.display import HTML\n", 122 | "css_file = 'style.css'\n", 123 | "HTML(open(css_file, 'r').read())" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "Setting up the required python ™ environment" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 2, 136 | "metadata": { 137 | "collapsed": false 138 | }, 139 | "outputs": [], 140 | "source": [ 141 | "#import numpy as np\n", 142 | "import pandas as pd\n", 143 | "from scipy.stats import bayes_mvs\n", 144 | "#from math import factorial\n", 145 | "import scikits.bootstrap as bs\n", 146 | "import matplotlib.pyplot as plt\n", 147 | "import seaborn as sns\n", 148 | "from warnings import filterwarnings\n", 149 | "sns.set_style('white')\n", 150 | "sns.set_context('paper', font_scale = 2.0, rc = {'lines.linewidth': 1.5, 'figure.figsize' : (10, 8)})\n", 151 | "%matplotlib inline\n", 152 | "filterwarnings('ignore')" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "Importing the mock dataset and checking if it imported correctly" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 3, 165 | "metadata": { 166 | "collapsed": false 167 | }, 168 | "outputs": [ 169 | { 170 | "data": { 171 | "text/html": [ 172 | "
\n", 173 | "\n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | "
FileAgeGenderDelayStayICURVDCD4HRTempCRPWCCHBRuptureHistoCompMASS
0138Female36NoNoNaN9735.2NaN10.4910.4NoYesYes5
1232Male610NoYes5710938.845.37.0819.8NoNoYes8
2319Female116NoNoNaN12036.310.713.008.7NoNoNo3
\n", 259 | "
" 260 | ], 261 | "text/plain": [ 262 | " File Age Gender Delay Stay ICU RVD CD4 HR Temp CRP WCC HB \\\n", 263 | "0 1 38 Female 3 6 No No NaN 97 35.2 NaN 10.49 10.4 \n", 264 | "1 2 32 Male 6 10 No Yes 57 109 38.8 45.3 7.08 19.8 \n", 265 | "2 3 19 Female 1 16 No No NaN 120 36.3 10.7 13.00 8.7 \n", 266 | "\n", 267 | " Rupture Histo Comp MASS \n", 268 | "0 No Yes Yes 5 \n", 269 | "1 No No Yes 8 \n", 270 | "2 No No No 3 " 271 | ] 272 | }, 273 | "execution_count": 3, 274 | "metadata": {}, 275 | "output_type": "execute_result" 276 | } 277 | ], 278 | "source": [ 279 | "data = pd.read_csv('MOOC_Mock.csv')\n", 280 | "data.head(3) #Looking at the first 3 rows" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "# Confidence intervals" 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "## Introduction" 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": {}, 300 | "source": [ 301 | "

When conducting research by scrutinizing a sample set of individuals taken from a larger population, we create statistics. We then use these statistics to infer parameter values for the greater population. It can be used to guide patient treatment and management, by inferring that our results are applicable to that larger (patient) population.

\n", 302 | "

However, there is no way that we can say with absolute confidence that our statistic is exactly what would to be found in the general population. Instead we can place an arbitrary confidence around our test statistics, and infer that we are sure that the real (larger population) parameters are within that arbitrary confidence that we placed in our statistic(s).

" 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "metadata": {}, 308 | "source": [ 309 | "## Confidence interval around mean of a sample variable set" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": {}, 315 | "source": [ 316 | "

Most textbooks begin with the chapter on confidence intervals (CI's) by using an example where the population mean and standard deviation is known. This is almost never the case in healthcare research, so let's have a look at what is commonly observed.

\n", 317 | "

Most of the time we are stuck with a set of variable values from our sample set and we can calculate the mean, standard deviation and standard error of that set.

\n", 318 | "

In our example dataset we might look at the white cell count (WCC) on admission. We could divide our sample patients into two groups, say those with and without retroviral disease.

\n", 319 | "

In the code below we make two new DataFrames, with each DataFrame object attached to a new computer variable.

" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 4, 325 | "metadata": { 326 | "collapsed": false 327 | }, 328 | "outputs": [], 329 | "source": [ 330 | "RVD_neg = data[data['RVD'] == 'No'] # Only including the rows in the original DataFrame that have a Yes in the RVD column\n", 331 | "RVD_neg_appx = RVD_neg[RVD_neg['Histo'] == 'Yes'] # Extracting only those who actually had appendicitis\n", 332 | "RVD_pos = data[data['RVD'] == 'Yes'] # Doing the same but with No in the RVD column\n", 333 | "RVD_pos_appx = RVD_pos[RVD_pos['Histo'] == 'Yes'] # Extracting only those who actually had appendicitis" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "

Let's look at the first rows in each new DataFrame using the *.head()* function. Leaving the argument blank will result in a default of *5* rows.

" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": 5, 346 | "metadata": { 347 | "collapsed": false 348 | }, 349 | "outputs": [ 350 | { 351 | "data": { 352 | "text/html": [ 353 | "
\n", 354 | "\n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | "
FileAgeGenderDelayStayICURVDCD4HRTempCRPWCCHBRuptureHistoCompMASS
0138Female36NoNoNaN9735.2NaN10.4910.4NoYesYes5
7819Male213NoNoNaN10238.5224.68.9915.5NoYesYes5
8946Male29NoNoNaN10736.2385.110.2615.5NoYesNo1
91019Male712YesNoNaN10536.612322.1913.7NoYesYes5
101133Male214NoNoNaN10436.8NaN16.2012.7NoYesNo9
\n", 480 | "
" 481 | ], 482 | "text/plain": [ 483 | " File Age Gender Delay Stay ICU RVD CD4 HR Temp CRP WCC \\\n", 484 | "0 1 38 Female 3 6 No No NaN 97 35.2 NaN 10.49 \n", 485 | "7 8 19 Male 2 13 No No NaN 102 38.5 224.6 8.99 \n", 486 | "8 9 46 Male 2 9 No No NaN 107 36.2 385.1 10.26 \n", 487 | "9 10 19 Male 7 12 Yes No NaN 105 36.6 123 22.19 \n", 488 | "10 11 33 Male 2 14 No No NaN 104 36.8 NaN 16.20 \n", 489 | "\n", 490 | " HB Rupture Histo Comp MASS \n", 491 | "0 10.4 No Yes Yes 5 \n", 492 | "7 15.5 No Yes Yes 5 \n", 493 | "8 15.5 No Yes No 1 \n", 494 | "9 13.7 No Yes Yes 5 \n", 495 | "10 12.7 No Yes No 9 " 496 | ] 497 | }, 498 | "execution_count": 5, 499 | "metadata": {}, 500 | "output_type": "execute_result" 501 | } 502 | ], 503 | "source": [ 504 | "RVD_neg_appx.head()" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": 6, 510 | "metadata": { 511 | "collapsed": false 512 | }, 513 | "outputs": [ 514 | { 515 | "data": { 516 | "text/html": [ 517 | "
\n", 518 | "\n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | "
FileAgeGenderDelayStayICURVDCD4HRTempCRPWCCHBRuptureHistoCompMASS
4528Female33NoYes49111537.151.621.9813.4NoYesNo7
6736Male614NoYes2039537.5NaN16.2117.7YesYesYes7
192033Female119NoYes20312337.546.58.0010.4NoYesYes8
313239Male210NoYes10210036.820820.4312.9YesYesNo1
323333Male021NoYes26610039.7194.17.3110.7YesYesYes3
\n", 644 | "
" 645 | ], 646 | "text/plain": [ 647 | " File Age Gender Delay Stay ICU RVD CD4 HR Temp CRP WCC \\\n", 648 | "4 5 28 Female 3 3 No Yes 491 115 37.1 51.6 21.98 \n", 649 | "6 7 36 Male 6 14 No Yes 203 95 37.5 NaN 16.21 \n", 650 | "19 20 33 Female 1 19 No Yes 203 123 37.5 46.5 8.00 \n", 651 | "31 32 39 Male 2 10 No Yes 102 100 36.8 208 20.43 \n", 652 | "32 33 33 Male 0 21 No Yes 266 100 39.7 194.1 7.31 \n", 653 | "\n", 654 | " HB Rupture Histo Comp MASS \n", 655 | "4 13.4 No Yes No 7 \n", 656 | "6 17.7 Yes Yes Yes 7 \n", 657 | "19 10.4 No Yes Yes 8 \n", 658 | "31 12.9 Yes Yes No 1 \n", 659 | "32 10.7 Yes Yes Yes 3 " 660 | ] 661 | }, 662 | "execution_count": 6, 663 | "metadata": {}, 664 | "output_type": "execute_result" 665 | } 666 | ], 667 | "source": [ 668 | "RVD_pos_appx.head()" 669 | ] 670 | }, 671 | { 672 | "cell_type": "markdown", 673 | "metadata": {}, 674 | "source": [ 675 | "Note how the two **RVD** columns contain only *No* and *Yes* respectively." 676 | ] 677 | }, 678 | { 679 | "cell_type": "markdown", 680 | "metadata": {}, 681 | "source": [ 682 | "

The easiest way to calculate the confidence intervals is using the *scipy.stats* function *bayes_mvs*. It actually gives nine results, three rows of three values each. The first row gives the mean and then the lower as well as the upper bound of the stated confidence interval.

\n", 683 | "

The second row does the same but for the variance. The last row calculates the values for the standard deviation.

\n", 684 | "

Let's first describe our two DataFrames then calculate the CI's.

" 685 | ] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": 7, 690 | "metadata": { 691 | "collapsed": false 692 | }, 693 | "outputs": [ 694 | { 695 | "data": { 696 | "text/plain": [ 697 | "count 78.000000\n", 698 | "mean 14.044103\n", 699 | "std 4.347939\n", 700 | "min 6.610000\n", 701 | "25% 11.270000\n", 702 | "50% 13.655000\n", 703 | "75% 16.305000\n", 704 | "max 26.400000\n", 705 | "Name: WCC, dtype: float64" 706 | ] 707 | }, 708 | "execution_count": 7, 709 | "metadata": {}, 710 | "output_type": "execute_result" 711 | } 712 | ], 713 | "source": [ 714 | "RVD_neg_appx['WCC'].describe() # Describing the WCC column in the RVD_neg DataFrame" 715 | ] 716 | }, 717 | { 718 | "cell_type": "code", 719 | "execution_count": 8, 720 | "metadata": { 721 | "collapsed": false 722 | }, 723 | "outputs": [ 724 | { 725 | "data": { 726 | "text/plain": [ 727 | "count 40.000000\n", 728 | "mean 15.773250\n", 729 | "std 4.934148\n", 730 | "min 2.950000\n", 731 | "25% 12.475000\n", 732 | "50% 15.575000\n", 733 | "75% 19.920000\n", 734 | "max 24.890000\n", 735 | "Name: WCC, dtype: float64" 736 | ] 737 | }, 738 | "execution_count": 8, 739 | "metadata": {}, 740 | "output_type": "execute_result" 741 | } 742 | ], 743 | "source": [ 744 | "RVD_pos_appx['WCC'].describe() # Describing the WCC column in the RVD_pos DataFrame" 745 | ] 746 | }, 747 | { 748 | "cell_type": "markdown", 749 | "metadata": {}, 750 | "source": [ 751 | "

Now for the CI's. The *bayes_mvs()* function takes two arguments, the first is the values themselves and the second is the required confidence interval. In this example I want the 95% CI's. I also make sure to discard all non-numerical values.

" 752 | ] 753 | }, 754 | { 755 | "cell_type": "code", 756 | "execution_count": 9, 757 | "metadata": { 758 | "collapsed": false 759 | }, 760 | "outputs": [ 761 | { 762 | "data": { 763 | "text/plain": [ 764 | "(Mean(statistic=14.044102564102564, minmax=(13.063793818374444, 15.024411309830684)),\n", 765 | " Variance(statistic=19.408694495726504, minmax=(14.110883384251707, 26.648893927036038)),\n", 766 | " Std_dev(statistic=4.3908697719337093, minmax=(3.7564455785025959, 5.1622566700074142)))" 767 | ] 768 | }, 769 | "execution_count": 9, 770 | "metadata": {}, 771 | "output_type": "execute_result" 772 | } 773 | ], 774 | "source": [ 775 | "bayes_mvs(RVD_neg_appx['WCC'].dropna(), 0.95)" 776 | ] 777 | }, 778 | { 779 | "cell_type": "code", 780 | "execution_count": 10, 781 | "metadata": { 782 | "collapsed": false 783 | }, 784 | "outputs": [ 785 | { 786 | "data": { 787 | "text/plain": [ 788 | "(Mean(statistic=15.773250000000001, minmax=(14.195232891628422, 17.35126710837158)),\n", 789 | " Variance(statistic=25.661807500000002, minmax=(16.336646621396056, 40.140096800828552)),\n", 790 | " Std_dev(statistic=5.0316399443801929, minmax=(4.041861776631662, 6.33562126399839)))" 791 | ] 792 | }, 793 | "execution_count": 10, 794 | "metadata": {}, 795 | "output_type": "execute_result" 796 | } 797 | ], 798 | "source": [ 799 | "bayes_mvs(RVD_pos_appx['WCC'].dropna(), 0.95)" 800 | ] 801 | }, 802 | { 803 | "cell_type": "markdown", 804 | "metadata": {}, 805 | "source": [ 806 | "

Let's put down in words what we have found from the calculations above. We could state in a paper that there were 78 patients in the RVD negative group with acute appendicitis. They had a mean WCC of 14.04 on admission, with a 95% confidence interval of 13.06 to 15.02. There were 40 RVD positive patients with an average admission WCC of 15.77 (95% CI 14.20 to 17.35). (Note the mathematical rounding).

\n", 807 | "

However, what do we mean by this 95% CI? Well, by our statistical anlaysis, we found a mean WCC of 15.77 in RVD positive patients. If want want to infer this on the larger population we will have to say that according to our analysis we think that the true population admission WCC value out there for all RVD positive patients is between two values. We can state those two values (lower and upper) with a set level of confidence. Here, we are 95% confident that in the larger population out there, we think the true value lies between 14.20 and 17.35.

" 808 | ] 809 | }, 810 | { 811 | "cell_type": "markdown", 812 | "metadata": {}, 813 | "source": [ 814 | "

Let's drop our CI to 80%. What do you expect to happen? Well, the lower and upper values will be closer to the mean, making us less confident that our *prediction* (inference) is correct. Let's see:

" 815 | ] 816 | }, 817 | { 818 | "cell_type": "code", 819 | "execution_count": 11, 820 | "metadata": { 821 | "collapsed": false 822 | }, 823 | "outputs": [ 824 | { 825 | "data": { 826 | "text/plain": [ 827 | "(Mean(statistic=15.773250000000001, minmax=(14.756206821106424, 16.790293178893577)),\n", 828 | " Variance(statistic=25.661807500000002, minmax=(18.742423588894685, 33.674780516226335)),\n", 829 | " Std_dev(statistic=5.0316399443801929, minmax=(4.329252081929936, 5.8029975457711798)))" 830 | ] 831 | }, 832 | "execution_count": 11, 833 | "metadata": {}, 834 | "output_type": "execute_result" 835 | } 836 | ], 837 | "source": [ 838 | "bayes_mvs(RVD_pos_appx['WCC'], 0.8)" 839 | ] 840 | }, 841 | { 842 | "cell_type": "markdown", 843 | "metadata": {}, 844 | "source": [ 845 | "Indeed, the values are now in a much narrower range of 14.76 to 16.79. We are much less confident about this, though!" 846 | ] 847 | }, 848 | { 849 | "cell_type": "markdown", 850 | "metadata": {}, 851 | "source": [ 852 | "## What lies behind these calculations" 853 | ] 854 | }, 855 | { 856 | "cell_type": "markdown", 857 | "metadata": {}, 858 | "source": [ 859 | "

Let's start with the equation:

\n", 860 | "$$ \\overline { x } \\pm t{ s }_{ SE } $$\n", 861 | "

Now *t* is the t-statistic. The t-statistic is calculated (or read from a table) and depends on the value of confidence desired (actually alpha value) and the degrees of freedom.

\n", 862 | "

The *sSE* is the standard error of the mean or...

\n", 863 | "$$ \\frac { s }{ \\sqrt { n } } $$\n", 864 | "

...the standard deviation over the square root of the sample size.

\n", 865 | "

Given the degrees of freedom and alpha value, a t-distribution is calculated. It looks like a normal distribution and is different for any given parameters. Therefore, when we state a confidence interval, we calculate what middle area will be covered by that confidence, i.e. we'll cover 0.95 of the total area under the curve and calculate where on the x-axis these cut-offs fall (the t-statistic). This we multiply by the standard error of the mean, and subtract as well as add to the mean to get the lower and upper bounds of the required CI. Neat!

" 866 | ] 867 | }, 868 | { 869 | "cell_type": "markdown", 870 | "metadata": {}, 871 | "source": [ 872 | "## Bootstrapping a confidence interval" 873 | ] 874 | }, 875 | { 876 | "cell_type": "markdown", 877 | "metadata": {}, 878 | "source": [ 879 | "

There is another method to calculate the CI. It involves bootstrapping or resampling. We are interested in infereing population parameters from statistics. We would like to say that the real population parameter is between such and such a value and we are confident about this *guess* at a given level.

\n", 880 | "

The only knowledge we have about the population, though, is our sample set. Actually we have nothing else. What we good do is take random samples from our sample set. We need to take exactely the same number of random samples from our sample set. In order not to have identical resamples, we allow a given sample value taken to be *thrown back* into the pool and have a chance of being resampled again, and again. We can do thos thousanfs of times. With the dawn of the age of fast computers, this is no problem. Now we have as big a sample set as we want. The hope is that this is a good refelction of the population. From it we can form a distribution curve and calculate the CIs.

" 881 | ] 882 | }, 883 | { 884 | "cell_type": "code", 885 | "execution_count": 12, 886 | "metadata": { 887 | "collapsed": false 888 | }, 889 | "outputs": [ 890 | { 891 | "data": { 892 | "text/plain": [ 893 | "array([ 14.23725, 17.23825])" 894 | ] 895 | }, 896 | "execution_count": 12, 897 | "metadata": {}, 898 | "output_type": "execute_result" 899 | } 900 | ], 901 | "source": [ 902 | "bs.ci(RVD_pos_appx['WCC'].dropna(), alpha = 0.05, n_samples = 10000)\n", 903 | "# Note that we state the alpha value directly and here I use 10000 resamples" 904 | ] 905 | }, 906 | { 907 | "cell_type": "markdown", 908 | "metadata": {}, 909 | "source": [ 910 | "

I can now given the CIs as the values in the array above. Note, it will be different everytime you do it!

" 911 | ] 912 | }, 913 | { 914 | "cell_type": "code", 915 | "execution_count": null, 916 | "metadata": { 917 | "collapsed": false 918 | }, 919 | "outputs": [], 920 | "source": [] 921 | } 922 | ], 923 | "metadata": { 924 | "kernelspec": { 925 | "display_name": "Python 3", 926 | "language": "python", 927 | "name": "python3" 928 | }, 929 | "language_info": { 930 | "codemirror_mode": { 931 | "name": "ipython", 932 | "version": 3 933 | }, 934 | "file_extension": ".py", 935 | "mimetype": "text/x-python", 936 | "name": "python", 937 | "nbconvert_exporter": "python", 938 | "pygments_lexer": "ipython3", 939 | "version": "3.4.3" 940 | } 941 | }, 942 | "nbformat": 4, 943 | "nbformat_minor": 0 944 | } 945 | -------------------------------------------------------------------------------- /Chapter19_Comparing_categorical_data.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Setting up a fancy stylesheet" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [ 17 | { 18 | "data": { 19 | "text/html": [ 20 | "\n", 21 | "\n", 22 | "\n", 23 | "\n", 24 | "\n", 25 | "\n", 26 | "\n", 109 | "\n" 110 | ], 111 | "text/plain": [ 112 | "" 113 | ] 114 | }, 115 | "execution_count": 1, 116 | "metadata": {}, 117 | "output_type": "execute_result" 118 | } 119 | ], 120 | "source": [ 121 | "from IPython.core.display import HTML\n", 122 | "css_file = 'style.css'\n", 123 | "HTML(open(css_file, 'r').read())" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "Setting up the required python ™ environment" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 2, 136 | "metadata": { 137 | "collapsed": false 138 | }, 139 | "outputs": [], 140 | "source": [ 141 | "import numpy as np\n", 142 | "import pandas as pd\n", 143 | "from scipy import stats\n", 144 | "from scipy.stats import bayes_mvs\n", 145 | "from math import factorial\n", 146 | "import scikits.bootstrap as bs\n", 147 | "import matplotlib.pyplot as plt\n", 148 | "import seaborn as sns\n", 149 | "from warnings import filterwarnings\n", 150 | "%matplotlib inline\n", 151 | "sns.set_style('whitegrid')\n", 152 | "sns.set_context('paper', font_scale = 2.0, rc = {'lines.linewidth': 1.5, 'figure.figsize' : (10, 8)})\n", 153 | "filterwarnings('ignore')" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "# Comparing categorical data" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "## Introduction" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "

In the previous chapter we created two groups. These groups were categorical in nature. Group A was those without histological evidence of appendicitis and group B included those with evidence of appendicitis. Furthermore, these two groups are nominal. There is no order to with and without appendicitis.

\n", 175 | "

The variable we examined was white cell count. This represented continuous data of the ratio numerical type. Now, let's construct two groups, once again of nominal ordinal type, but this time the data variable is also categorical.

" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "## Importing and examining the dataset" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 3, 188 | "metadata": { 189 | "collapsed": false 190 | }, 191 | "outputs": [ 192 | { 193 | "data": { 194 | "text/html": [ 195 | "
\n", 196 | "\n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | "
FileAgeGenderDelayStayICURVDCD4HRTempCRPWCCHBRuptureHistoCompMASS
0138Female36NoNoNaN9735.2NaN10.4910.4NoYesYes5
1232Male610NoYes5710938.845.37.0819.8NoNoYes8
2319Female116NoNoNaN12036.310.713.008.7NoNoNo3
3420Female29NoYesNaN12035.777.84.458.8NoNoNo0
4528Female33NoYes49111537.151.621.9813.4NoYesNo7
\n", 322 | "
" 323 | ], 324 | "text/plain": [ 325 | " File Age Gender Delay Stay ICU RVD CD4 HR Temp CRP WCC HB \\\n", 326 | "0 1 38 Female 3 6 No No NaN 97 35.2 NaN 10.49 10.4 \n", 327 | "1 2 32 Male 6 10 No Yes 57 109 38.8 45.3 7.08 19.8 \n", 328 | "2 3 19 Female 1 16 No No NaN 120 36.3 10.7 13.00 8.7 \n", 329 | "3 4 20 Female 2 9 No Yes NaN 120 35.7 77.8 4.45 8.8 \n", 330 | "4 5 28 Female 3 3 No Yes 491 115 37.1 51.6 21.98 13.4 \n", 331 | "\n", 332 | " Rupture Histo Comp MASS \n", 333 | "0 No Yes Yes 5 \n", 334 | "1 No No Yes 8 \n", 335 | "2 No No No 3 \n", 336 | "3 No No No 0 \n", 337 | "4 No Yes No 7 " 338 | ] 339 | }, 340 | "execution_count": 3, 341 | "metadata": {}, 342 | "output_type": "execute_result" 343 | } 344 | ], 345 | "source": [ 346 | "data = pd.read_csv('MOOC_Mock.csv')\n", 347 | "data.head()" 348 | ] 349 | }, 350 | { 351 | "cell_type": "markdown", 352 | "metadata": {}, 353 | "source": [ 354 | "## The chi-squared (*Χ2*) test" 355 | ] 356 | }, 357 | { 358 | "cell_type": "markdown", 359 | "metadata": {}, 360 | "source": [ 361 | "

For this example I want to know if there is a difference in the incidence of histologically proven appendicitis between those with and without retroviral disease (RVD).

\n", 362 | "

Think about it for a moment. I will have two groups: Group A without appendicitis and Group B with. Within each of these I will have two groups: Group I without RVD and Group II with. Form this I can create a little table called a *contingency table*. In this example the table will have two rows and two columns. For a Χ*2* test, the table can have more rows and more columns.

\n", 363 | "

In order to do this test, we will have to get the values to fill in our column, manually.

" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 4, 369 | "metadata": { 370 | "collapsed": false 371 | }, 372 | "outputs": [ 373 | { 374 | "data": { 375 | "text/plain": [ 376 | "Histo RVD\n", 377 | "No No 16\n", 378 | " Yes 14\n", 379 | "Yes No 80\n", 380 | " Yes 40\n", 381 | "dtype: int64" 382 | ] 383 | }, 384 | "execution_count": 4, 385 | "metadata": {}, 386 | "output_type": "execute_result" 387 | } 388 | ], 389 | "source": [ 390 | "histo_group = data.groupby(data['Histo'])\n", 391 | "histo_group['RVD'].value_counts()" 392 | ] 393 | }, 394 | { 395 | "cell_type": "markdown", 396 | "metadata": {}, 397 | "source": [ 398 | "

Here, I have made use of the powerful *.groupby()* function. It splits my DataFrame into parts according to the values found in the specified column (here I chose **Histo**). I attached the new (split) DataFrame to a computer variable name *histo_group*.

\n", 399 | "

I then moved on and asked the software to give me the value counts for the **RVD** column of this new DataFrame. Note how the results tell me that the DataFrame is split by **Histo** and then gives me a breakdown of results found in the requested column. It found values for *yes* and *No* and told me how many of each it found.

" 400 | ] 401 | }, 402 | { 403 | "cell_type": "markdown", 404 | "metadata": {}, 405 | "source": [ 406 | "### Creating the contingency table (matrix)" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "

Now we have to construct our little two-by-two matrix. Here I will use *numpy*. Note the use of square brackets.

" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": 5, 419 | "metadata": { 420 | "collapsed": false 421 | }, 422 | "outputs": [ 423 | { 424 | "data": { 425 | "text/plain": [ 426 | "array([[16, 14],\n", 427 | " [80, 40]])" 428 | ] 429 | }, 430 | "execution_count": 5, 431 | "metadata": {}, 432 | "output_type": "execute_result" 433 | } 434 | ], 435 | "source": [ 436 | "histo_RVD_observed = np.array([[16, 14], [80, 40]])\n", 437 | "histo_RVD_observed" 438 | ] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "metadata": {}, 443 | "source": [ 444 | "

From the *scipy.stats* library I am going to use the *chi2_contingency()* function. It takes a single argument (my observed table above) and returns for values, hence my use of four computer variable names before the equal sign. They are:\n", 445 | "* The Χ*2* value\n", 446 | "* The *p*-value\n", 447 | "* The degrees of freedom\n", 448 | "* The expected table" 449 | ] 450 | }, 451 | { 452 | "cell_type": "code", 453 | "execution_count": 6, 454 | "metadata": { 455 | "collapsed": false 456 | }, 457 | "outputs": [], 458 | "source": [ 459 | "chi_val, p, df, expected = stats.chi2_contingency(histo_RVD_observed)" 460 | ] 461 | }, 462 | { 463 | "cell_type": "markdown", 464 | "metadata": {}, 465 | "source": [ 466 | "

Let's print each of these to the screen.

" 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": 7, 472 | "metadata": { 473 | "collapsed": false 474 | }, 475 | "outputs": [ 476 | { 477 | "data": { 478 | "text/plain": [ 479 | "1.3183593750000002" 480 | ] 481 | }, 482 | "execution_count": 7, 483 | "metadata": {}, 484 | "output_type": "execute_result" 485 | } 486 | ], 487 | "source": [ 488 | "chi_val" 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": 8, 494 | "metadata": { 495 | "collapsed": false 496 | }, 497 | "outputs": [ 498 | { 499 | "data": { 500 | "text/plain": [ 501 | "0.25088670393543944" 502 | ] 503 | }, 504 | "execution_count": 8, 505 | "metadata": {}, 506 | "output_type": "execute_result" 507 | } 508 | ], 509 | "source": [ 510 | "p" 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "execution_count": 9, 516 | "metadata": { 517 | "collapsed": false 518 | }, 519 | "outputs": [ 520 | { 521 | "data": { 522 | "text/plain": [ 523 | "1" 524 | ] 525 | }, 526 | "execution_count": 9, 527 | "metadata": {}, 528 | "output_type": "execute_result" 529 | } 530 | ], 531 | "source": [ 532 | "df" 533 | ] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "execution_count": 10, 538 | "metadata": { 539 | "collapsed": false 540 | }, 541 | "outputs": [ 542 | { 543 | "data": { 544 | "text/plain": [ 545 | "array([[ 19.2, 10.8],\n", 546 | " [ 76.8, 43.2]])" 547 | ] 548 | }, 549 | "execution_count": 10, 550 | "metadata": {}, 551 | "output_type": "execute_result" 552 | } 553 | ], 554 | "source": [ 555 | "expected" 556 | ] 557 | }, 558 | { 559 | "cell_type": "markdown", 560 | "metadata": {}, 561 | "source": [ 562 | "

So, our *p*-value was more than *0.05* and we can say that there was no difference in the rate of RVD between those with and without appendicitis.

\n", 563 | "

The expected table is quite interesting. It calculates what we would have expected given the total number in each group.

" 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "execution_count": null, 569 | "metadata": { 570 | "collapsed": false 571 | }, 572 | "outputs": [], 573 | "source": [] 574 | } 575 | ], 576 | "metadata": { 577 | "kernelspec": { 578 | "display_name": "Python 3", 579 | "language": "python", 580 | "name": "python3" 581 | }, 582 | "language_info": { 583 | "codemirror_mode": { 584 | "name": "ipython", 585 | "version": 3 586 | }, 587 | "file_extension": ".py", 588 | "mimetype": "text/x-python", 589 | "name": "python", 590 | "nbconvert_exporter": "python", 591 | "pygments_lexer": "ipython3", 592 | "version": "3.4.3" 593 | } 594 | }, 595 | "nbformat": 4, 596 | "nbformat_minor": 0 597 | } 598 | -------------------------------------------------------------------------------- /Chapter21_Sensitivity_specificity_risks_rates_odds.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Setting up a fancy stylesheet" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": { 14 | "collapsed": false 15 | }, 16 | "outputs": [ 17 | { 18 | "data": { 19 | "text/html": [ 20 | "\n", 21 | "\n", 22 | "\n", 23 | "\n", 24 | "\n", 25 | "\n", 26 | "\n", 109 | "\n" 110 | ], 111 | "text/plain": [ 112 | "" 113 | ] 114 | }, 115 | "execution_count": 1, 116 | "metadata": {}, 117 | "output_type": "execute_result" 118 | } 119 | ], 120 | "source": [ 121 | "from IPython.core.display import HTML\n", 122 | "css_file = 'style.css'\n", 123 | "HTML(open(css_file, 'r').read())" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "Setting up the required python ™ environment" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 2, 136 | "metadata": { 137 | "collapsed": false 138 | }, 139 | "outputs": [], 140 | "source": [ 141 | "#import numpy as np\n", 142 | "import pandas as pd\n", 143 | "#from scipy.stats import bayes_mvs\n", 144 | "#from math import factorial\n", 145 | "#import scikits.bootstrap as bs\n", 146 | "#import matplotlib.pyplot as plt\n", 147 | "#import seaborn as sns\n", 148 | "from warnings import filterwarnings\n", 149 | "\n", 150 | "#%matplotlib inline\n", 151 | "filterwarnings('ignore')" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "

When considering the use of special investigations or even clinical findings, it is important to understand how accurate these are and how this impacts on their use and the clinical decisions we make that depend on them.

\n", 159 | "

Whereas the sensitivity and specificity are independent of the prevalence of a condition, predictive values are and incidence profoundly affects the use of their reported values. Fortunately, there are equations that can translate reported predictive values to a relevant prevalence.

\n", 160 | "

Predictive values make use of a form of Bayesian statistics. Bayesian inference makes use of past knowledge (as gained from research) to predict, in essence, the future.

\n", 161 | "In this chapter I will look into the calculations and use of sensitivity, specificity, predictive values, incidence, cumulative risk, prevalence and some ratios." 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "# Sensitivity and specificity" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "## Introduction" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "

Let's start by short explanations for sensitivity and specificity in the context of healthcare. *Sensitivity* expresses the probability that a test will return a positive result given the existence of a (disease) condition. In contrast to this, *specificity* expresses the probability that the same test will return a negative result given the absence of that disease condition.

" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "## Calculating sensitivity and specifcity" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "### A simple example" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "

Consider a research study in which, for instance, 1000 volunteers are divided into two groups. In group one, a specific disease is absolutely shown to be present. A good example of this would be tissue taken for histological examination and proved to be diseased. This histological examination will be taken as the gold standard. In group two, this disease is absolutely known to be absent.

\n", 204 | "

Study participants are then investigated with a new test. Results for this test are either *positive* or *negative*. Four results follow from this.\n", 205 | "* Group with disease\n", 206 | " * A positive result\n", 207 | " * A negative result\n", 208 | "* Group B\n", 209 | " * A positive result\n", 210 | " * A negative result\n", 211 | " \n", 212 | "

We use specific words to desrcibe these results. A positive result in a participant with the disease is termed a *true positive*. A negative result in a participant with the disease is a *false negative*.

\n", 213 | "

A negative result in a participant without the disese is a *true negative* and a positive in this group would be a *false positive*.

\n", 214 | "

*Sensitivity* is then a simple ratio between the true positive incidence and the total number of participants with the disease. *Specificity* is the ratio between the true negative rate and the number of participants without the disease.

\n", 215 | "

Sensitivity and specificity should be used and understood from the point of view of the decision to use a test in the work-up of a condition. If a test has a high sensitivity it will diagnose a high percentage of patient who have the disease. Likewise, if a test has a high specificity it will indicate a large percentage of patients who do not have that disease.

\n", 216 | "

Note also, how crucial it is to have a gold standard to divide the groups with.

" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "

In this simple example of 1000 participants let's suppose that 10 individuals have the disease and an overwhelming 990 do not. Of the 10 with the disease, nine tests positive (true positive) and tests negative (false negative). Of the 990 without the disease, 90 test positive (false positive) and 900 test negative (true negative).

\n", 224 | "

Sensitivity (*Sn*) is a simple fraction with true positive (*PT*) in the numerator and the sum of the true positives (*PT*) and the false negatives (*NF*) (in other words, all (10) with the disease) in the denominator.

\n", 225 | "

Specificity (*Sp*) has true negative (*NT*) in the numerator and the sum of the true negative (*NT*) and false positive (*PF*) (sum of all without the disease) in the denominator.

" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "$$ { S }_{ N }=\\frac { { P }_{ T } }{ { P }_{ T }+{ F }_{ N } } $$" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "$$ {{S}_{P} = \\frac {{N}_{T}}{{{N}_{T}} + {{P}_{F}}}} $$" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": {}, 245 | "source": [ 246 | "

Our code could look something like this:

" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": 3, 252 | "metadata": { 253 | "collapsed": false 254 | }, 255 | "outputs": [], 256 | "source": [ 257 | "true_positive = 9 # Create a computer variable called true_positive and give it the value 9\n", 258 | "false_negative = 1 # True and False are python code words, so don't use these as computer\n", 259 | "# variable names\n", 260 | "false_positive = 90\n", 261 | "true_negative = 900\n", 262 | "\n", 263 | "sensitivity = true_positive / (true_positive + false_negative) # A simple equation\n", 264 | "specificity = true_negative / (true_negative + false_positive) # Note to add denominator values\n", 265 | "# in brackets to make the algebra (order of arithmetic execution) work out" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": 4, 271 | "metadata": { 272 | "collapsed": false 273 | }, 274 | "outputs": [ 275 | { 276 | "name": "stdout", 277 | "output_type": "stream", 278 | "text": [ 279 | "The sensitivity is: 90.0 %\n" 280 | ] 281 | } 282 | ], 283 | "source": [ 284 | "print('The sensitivity is: ', (sensitivity * 100), '%')\n", 285 | "# Using the python 3.x print statement\n", 286 | "# Note the use of quotation marks for literal text\n", 287 | "# Multiplying the values with 100 to get percentage" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": 5, 293 | "metadata": { 294 | "collapsed": false 295 | }, 296 | "outputs": [ 297 | { 298 | "name": "stdout", 299 | "output_type": "stream", 300 | "text": [ 301 | "The specificity is: 90.9090909090909 %\n" 302 | ] 303 | } 304 | ], 305 | "source": [ 306 | "print('The specificity is: ', (specificity * 100), '%')" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "

We could no say that this new test will pick up 90% of those with the disease with a positive result and 90% of those without, with a negative result.

" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "### Example using our mock data" 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "

We'll do another example importing our mock data.

\n", 328 | "

I won't use the powerful * pandas .groupby()* function, instead doing a laborious construction of new *DataFrame* objects, to make it easy to follow along.

" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": 6, 334 | "metadata": { 335 | "collapsed": false 336 | }, 337 | "outputs": [ 338 | { 339 | "data": { 340 | "text/html": [ 341 | "
\n", 342 | "\n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | "
FileAgeGenderDelayStayICURVDCD4HRTempCRPWCCHBRuptureHistoCompMASS
0138Female36NoNoNaN9735.2NaN10.4910.4NoYesYes5
1232Male610NoYes5710938.845.37.0819.8NoNoYes8
2319Female116NoNoNaN12036.310.713.008.7NoNoNo3
\n", 428 | "
" 429 | ], 430 | "text/plain": [ 431 | " File Age Gender Delay Stay ICU RVD CD4 HR Temp CRP WCC HB \\\n", 432 | "0 1 38 Female 3 6 No No NaN 97 35.2 NaN 10.49 10.4 \n", 433 | "1 2 32 Male 6 10 No Yes 57 109 38.8 45.3 7.08 19.8 \n", 434 | "2 3 19 Female 1 16 No No NaN 120 36.3 10.7 13.00 8.7 \n", 435 | "\n", 436 | " Rupture Histo Comp MASS \n", 437 | "0 No Yes Yes 5 \n", 438 | "1 No No Yes 8 \n", 439 | "2 No No No 3 " 440 | ] 441 | }, 442 | "execution_count": 6, 443 | "metadata": {}, 444 | "output_type": "execute_result" 445 | } 446 | ], 447 | "source": [ 448 | "data = pd.read_csv('MOOC_Mock.csv')\n", 449 | "# Imporitng the csv file and attaching it to the computer variable (object) arbitrarily called data\n", 450 | "data.head(3) # Inspecting the first 3 rows" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "

We will divide our data set into those with and without histologically proven appendicitis. The histological evaluation is our gold standard. We will now look at using a white cell count (WCC) of 12 or more as a positive result.

" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": 7, 463 | "metadata": { 464 | "collapsed": false 465 | }, 466 | "outputs": [], 467 | "source": [ 468 | "appen_pos = data[data['Histo'] == 'Yes'] # A new DataFrame including only those with\n", 469 | "# histologically proven appendicitis\n", 470 | "appen_neg = data[data['Histo'] == 'No']\n", 471 | "\n", 472 | "tp_all = appen_pos[appen_pos['WCC'] >= 12] # A new DataFrame object from the appen_pos\n", 473 | "# DataFrame with a WCC of 12 or more as a positive result, hence true positive\n", 474 | "fn_all = appen_pos[appen_pos['WCC'] < 12]\n", 475 | "\n", 476 | "tn_all = appen_neg[appen_neg['WCC'] < 12]\n", 477 | "fp_all = appen_neg[appen_neg['WCC'] >= 12]" 478 | ] 479 | }, 480 | { 481 | "cell_type": "code", 482 | "execution_count": 8, 483 | "metadata": { 484 | "collapsed": false 485 | }, 486 | "outputs": [ 487 | { 488 | "data": { 489 | "text/plain": [ 490 | "count 87.000000\n", 491 | "mean 16.507471\n", 492 | "std 3.729035\n", 493 | "min 12.000000\n", 494 | "25% 13.605000\n", 495 | "50% 15.650000\n", 496 | "75% 17.720000\n", 497 | "max 26.400000\n", 498 | "Name: WCC, dtype: float64" 499 | ] 500 | }, 501 | "execution_count": 8, 502 | "metadata": {}, 503 | "output_type": "execute_result" 504 | } 505 | ], 506 | "source": [ 507 | "tp_all['WCC'].dropna().describe() # Just interested in the counts here" 508 | ] 509 | }, 510 | { 511 | "cell_type": "code", 512 | "execution_count": 9, 513 | "metadata": { 514 | "collapsed": false 515 | }, 516 | "outputs": [ 517 | { 518 | "data": { 519 | "text/plain": [ 520 | "count 31.000000\n", 521 | "mean 9.361935\n", 522 | "std 2.010896\n", 523 | "min 2.950000\n", 524 | "25% 8.195000\n", 525 | "50% 9.390000\n", 526 | "75% 10.650000\n", 527 | "max 11.990000\n", 528 | "Name: WCC, dtype: float64" 529 | ] 530 | }, 531 | "execution_count": 9, 532 | "metadata": {}, 533 | "output_type": "execute_result" 534 | } 535 | ], 536 | "source": [ 537 | "fn_all['WCC'].dropna().describe()" 538 | ] 539 | }, 540 | { 541 | "cell_type": "code", 542 | "execution_count": 10, 543 | "metadata": { 544 | "collapsed": false 545 | }, 546 | "outputs": [ 547 | { 548 | "data": { 549 | "text/plain": [ 550 | "count 18.000000\n", 551 | "mean 8.375000\n", 552 | "std 2.481184\n", 553 | "min 4.220000\n", 554 | "25% 6.795000\n", 555 | "50% 8.155000\n", 556 | "75% 10.542500\n", 557 | "max 11.700000\n", 558 | "Name: WCC, dtype: float64" 559 | ] 560 | }, 561 | "execution_count": 10, 562 | "metadata": {}, 563 | "output_type": "execute_result" 564 | } 565 | ], 566 | "source": [ 567 | "tn_all['WCC'].dropna().describe()" 568 | ] 569 | }, 570 | { 571 | "cell_type": "code", 572 | "execution_count": 11, 573 | "metadata": { 574 | "collapsed": false 575 | }, 576 | "outputs": [ 577 | { 578 | "data": { 579 | "text/plain": [ 580 | "count 11.000000\n", 581 | "mean 16.730909\n", 582 | "std 3.835021\n", 583 | "min 13.000000\n", 584 | "25% 14.155000\n", 585 | "50% 15.970000\n", 586 | "75% 17.780000\n", 587 | "max 26.190000\n", 588 | "Name: WCC, dtype: float64" 589 | ] 590 | }, 591 | "execution_count": 11, 592 | "metadata": {}, 593 | "output_type": "execute_result" 594 | } 595 | ], 596 | "source": [ 597 | "fp_all['WCC'].dropna().describe()" 598 | ] 599 | }, 600 | { 601 | "cell_type": "code", 602 | "execution_count": 12, 603 | "metadata": { 604 | "collapsed": false 605 | }, 606 | "outputs": [], 607 | "source": [ 608 | "tp = 87\n", 609 | "fn = 31\n", 610 | "tn = 18\n", 611 | "fp = 11\n", 612 | "\n", 613 | "sens = tp / (tp + fn)\n", 614 | "spec = tn / (tn + fp)" 615 | ] 616 | }, 617 | { 618 | "cell_type": "code", 619 | "execution_count": 13, 620 | "metadata": { 621 | "collapsed": false 622 | }, 623 | "outputs": [ 624 | { 625 | "name": "stdout", 626 | "output_type": "stream", 627 | "text": [ 628 | "The sensitivity of a raised WCC in appendicitis is: 73.72881355932203 %\n" 629 | ] 630 | } 631 | ], 632 | "source": [ 633 | "print('The sensitivity of a raised WCC in appendicitis is: ', (sens * 100), '%')" 634 | ] 635 | }, 636 | { 637 | "cell_type": "code", 638 | "execution_count": 14, 639 | "metadata": { 640 | "collapsed": false 641 | }, 642 | "outputs": [ 643 | { 644 | "name": "stdout", 645 | "output_type": "stream", 646 | "text": [ 647 | "The specificity of a raised WCC in appendicitis is: 62.06896551724138 %\n" 648 | ] 649 | } 650 | ], 651 | "source": [ 652 | "print('The specificity of a raised WCC in appendicitis is: ', (spec * 100), '%')" 653 | ] 654 | }, 655 | { 656 | "cell_type": "markdown", 657 | "metadata": {}, 658 | "source": [ 659 | "

There we have it. A sensitivity and specificity value for a raised WCC in our data set.

" 660 | ] 661 | }, 662 | { 663 | "cell_type": "markdown", 664 | "metadata": {}, 665 | "source": [ 666 | "# Predicitive values" 667 | ] 668 | }, 669 | { 670 | "cell_type": "markdown", 671 | "metadata": {}, 672 | "source": [ 673 | "## Introduction" 674 | ] 675 | }, 676 | { 677 | "cell_type": "markdown", 678 | "metadata": {}, 679 | "source": [ 680 | "

Predictive values make use of a form of Bayesian statistics. Bayesian inference makes use of past knowledge (as gained from research) to predict, in essence, the future.

" 681 | ] 682 | }, 683 | { 684 | "cell_type": "markdown", 685 | "metadata": {}, 686 | "source": [ 687 | "

In contrast to sensitivity and specificity, predictive values should be viewed from the point of view that the test was already performed. The values can either be expressed as a *positive* predictive value, which expresses the probability that a patient has a disease once the test result was positive, or as a *negative* predictive value, which expresses the probability that the patient does not have the disease, once the test result is negative.

\n", 688 | "

Predictive values should be viewed only once a clear indication of the prevalence of the disease in the particular study is given, as this greatly influences the result. In order for predictive values to be used properly, they have to be converted to reflect the prevalence of a disease in the patients seen by a practicing clinician, in his or her setting.

" 689 | ] 690 | }, 691 | { 692 | "cell_type": "markdown", 693 | "metadata": {}, 694 | "source": [ 695 | "## Calculating some values" 696 | ] 697 | }, 698 | { 699 | "cell_type": "markdown", 700 | "metadata": {}, 701 | "source": [ 702 | "

Considering the first research study example above, in which (very importantly and for our purposes here), participants were recruited randomly and thus accurately reflect the true prevalence of a disease (in which case the predictive values will be accurate), calculating the predictive values is quite simple.

\n", 703 | "

The positive predictive (*PP*) value is the ratio between the true positive results and the sum of the true positive and false negative rates. It expresses the probability that a patient will have the disease in the case of a a positive test result.

\n", 704 | "

The negative predictive (*PN*) value is the ratio between the true negative rate and the sum of the true negative and false negative rates. It expresses the probability that a patient does not have the disease, in the case of a negative test result.

" 705 | ] 706 | }, 707 | { 708 | "cell_type": "markdown", 709 | "metadata": {}, 710 | "source": [ 711 | "$$ {P}_{P} = \\frac {{P}_{T}}{{{P}_{T}} + {{F}_{P}}} $$" 712 | ] 713 | }, 714 | { 715 | "cell_type": "markdown", 716 | "metadata": {}, 717 | "source": [ 718 | "$$ {P}_{N} = \\frac {{N}_{T}}{{{N}_{T}} + {{F}_{N}}} $$" 719 | ] 720 | }, 721 | { 722 | "cell_type": "markdown", 723 | "metadata": {}, 724 | "source": [ 725 | "

So, for our first example above we have:

" 726 | ] 727 | }, 728 | { 729 | "cell_type": "code", 730 | "execution_count": 15, 731 | "metadata": { 732 | "collapsed": false 733 | }, 734 | "outputs": [], 735 | "source": [ 736 | "ppv = true_positive / (true_positive + false_positive) # Positive predicitive value\n", 737 | "npv = true_negative / (true_negative + false_negative) # Negative predicitive value" 738 | ] 739 | }, 740 | { 741 | "cell_type": "code", 742 | "execution_count": 16, 743 | "metadata": { 744 | "collapsed": false 745 | }, 746 | "outputs": [ 747 | { 748 | "name": "stdout", 749 | "output_type": "stream", 750 | "text": [ 751 | "The positive predicitive value is 9.090909090909092 %\n", 752 | "The negative predicitive value is 99.88901220865705 %\n" 753 | ] 754 | } 755 | ], 756 | "source": [ 757 | "print('The positive predicitive value is ', (ppv * 100), '%')\n", 758 | "print('The negative predicitive value is ', (npv * 100), '%')" 759 | ] 760 | }, 761 | { 762 | "cell_type": "markdown", 763 | "metadata": {}, 764 | "source": [ 765 | "

Just look how low the positive predicitive value is. The chance of actually having the disease when the test results comes back positive is only 9%. It should be clear that this is highly dependent on the incidince of the disease in our study population. Here we only had 10 out of 1000 patients with the disease (a 1% incidence).

\n", 766 | "

If your research sample does not accurately refelct the population the predictive values can be highly skewed and not representative of the population at large.

" 767 | ] 768 | }, 769 | { 770 | "cell_type": "markdown", 771 | "metadata": {}, 772 | "source": [ 773 | "

For instance, in our appendicitis study, with a WCC of less than 12 as a negative result, the negative predicitive value would only be:

" 774 | ] 775 | }, 776 | { 777 | "cell_type": "code", 778 | "execution_count": 17, 779 | "metadata": { 780 | "collapsed": false 781 | }, 782 | "outputs": [ 783 | { 784 | "name": "stdout", 785 | "output_type": "stream", 786 | "text": [ 787 | "The negative predicitive value of a normal WCC (the chance of not having appendicitis given a negative result is 36.734693877551024 %\n" 788 | ] 789 | } 790 | ], 791 | "source": [ 792 | "print('The negative predicitive value of a normal WCC (the chance of not having appendicitis given a negative result is', (tn * 100 / (tn + fn)), '%')" 793 | ] 794 | }, 795 | { 796 | "cell_type": "markdown", 797 | "metadata": {}, 798 | "source": [ 799 | "

That makes no sense!

" 800 | ] 801 | }, 802 | { 803 | "cell_type": "code", 804 | "execution_count": 18, 805 | "metadata": { 806 | "collapsed": false 807 | }, 808 | "outputs": [ 809 | { 810 | "name": "stdout", 811 | "output_type": "stream", 812 | "text": [ 813 | "The positive predicitive value is : 88.77551020408163 %\n" 814 | ] 815 | } 816 | ], 817 | "source": [ 818 | "print('The positive predicitive value is :', (tp * 100 / (tp + fp)), '%')" 819 | ] 820 | }, 821 | { 822 | "cell_type": "markdown", 823 | "metadata": {}, 824 | "source": [ 825 | "## Correcting a positive predictive value (PP)" 826 | ] 827 | }, 828 | { 829 | "cell_type": "markdown", 830 | "metadata": {}, 831 | "source": [ 832 | "

In order to convert positive predictive values to a population group with a different prevalence, first consider the pre-test odds (*OP*). That is the ratio between having the disease and not having the disease in the group under consideration. This is simply done by dividing the prevalence (*Pv*) by one minus the prevalence, with the prevalence being a simple fraction of those with the disease (*d*) over the total number in the study (*n*).

" 833 | ] 834 | }, 835 | { 836 | "cell_type": "markdown", 837 | "metadata": {}, 838 | "source": [ 839 | "$$ {{P}_{v} = \\frac {d}{n}} $$" 840 | ] 841 | }, 842 | { 843 | "cell_type": "markdown", 844 | "metadata": {}, 845 | "source": [ 846 | "$$ {{O}_{P} = \\frac {{P}_{v}}{1 - {{P}_{v}}}} $$" 847 | ] 848 | }, 849 | { 850 | "cell_type": "markdown", 851 | "metadata": {}, 852 | "source": [ 853 | "

This result has to be multiplied by the likelihood ratio (*RL*). This is the sensitivity divided by 1 minus the specificity.

" 854 | ] 855 | }, 856 | { 857 | "cell_type": "markdown", 858 | "metadata": {}, 859 | "source": [ 860 | "$$ {{R}_{L} = \\frac {{S}_{N}}{1 - {{S}_{N}}}} $$" 861 | ] 862 | }, 863 | { 864 | "cell_type": "markdown", 865 | "metadata": {}, 866 | "source": [ 867 | "

Now we get the post-test odds *OT* (multiplying the pre-test odds and the likelihood ratio):

" 868 | ] 869 | }, 870 | { 871 | "cell_type": "markdown", 872 | "metadata": {}, 873 | "source": [ 874 | "$$ {{O}_{O} = {{{O}_{P}} \\times {{R}_{L}}}} $$" 875 | ] 876 | }, 877 | { 878 | "cell_type": "markdown", 879 | "metadata": {}, 880 | "source": [ 881 | "

Now we calculutate the new positive predictive value (*PN*), which is the post-test odds divided by one minus the post-test odds:

" 882 | ] 883 | }, 884 | { 885 | "cell_type": "markdown", 886 | "metadata": {}, 887 | "source": [ 888 | "$$ {{P}_{N} = {\\frac {{O}_{O}}{1 - {O}_{O}}}} $$" 889 | ] 890 | }, 891 | { 892 | "cell_type": "markdown", 893 | "metadata": {}, 894 | "source": [ 895 | "

Remember that in our test example above the positive predictive value was 9.1%. Let's do that again but consider a population in which the prevalence of appendicitis is only 0.5% (0.005).

" 896 | ] 897 | }, 898 | { 899 | "cell_type": "code", 900 | "execution_count": 19, 901 | "metadata": { 902 | "collapsed": false 903 | }, 904 | "outputs": [ 905 | { 906 | "name": "stdout", 907 | "output_type": "stream", 908 | "text": [ 909 | "The positive predicitive value in a population with a prevalence of only 0.5% is: 3.5726018396846264 %\n" 910 | ] 911 | } 912 | ], 913 | "source": [ 914 | "pre_test_odds = 0.005 / (1 - 0.005)\n", 915 | "likelihood_ratio = sens / (1 - sensitivity) # Remember sensitivity was our computer variable for sensitivity\n", 916 | "# in the test example above\n", 917 | "post_test_odds = pre_test_odds * likelihood_ratio\n", 918 | "new_PPV = post_test_odds / (1 + post_test_odds)\n", 919 | "\n", 920 | "print('The positive predicitive value in a population with a prevalence of only 0.5% is: ', new_PPV * 100, '%')" 921 | ] 922 | }, 923 | { 924 | "cell_type": "markdown", 925 | "metadata": {}, 926 | "source": [ 927 | "# Incidence, cumulative risk, and prevalence
Difference in rates and absolute difference in risk" 928 | ] 929 | }, 930 | { 931 | "cell_type": "markdown", 932 | "metadata": {}, 933 | "source": [ 934 | "## Introduction" 935 | ] 936 | }, 937 | { 938 | "cell_type": "markdown", 939 | "metadata": {}, 940 | "source": [ 941 | "

While major parts of these lectures are devoted to statistical analysis and comparisons between groups, it is important to consider and measure the absolute and relative frequencies with which parameters in a study present. In this section I will commence with a description of the various methods of expressing the frequency of a parameter before moving on to comparing frequencies to each other in the form of differences.

\n", 942 | "

There are three main measures of frequency: incidence (incidence rate), cumulative risk (cumulative incidence) and finally prevalence. There are two main differences, namely a difference in rates and a cumulative difference in risk.

" 943 | ] 944 | }, 945 | { 946 | "cell_type": "markdown", 947 | "metadata": {}, 948 | "source": [ 949 | "## The incidence rate" 950 | ] 951 | }, 952 | { 953 | "cell_type": "markdown", 954 | "metadata": {}, 955 | "source": [ 956 | "

The word rate suggests time involvement and that is indeed the case. It measures the magnitude of occurrence over time and as such can only be used in studies that follow patients over time, such as cohort studies and randomized trials.

\n", 957 | "

The incidence (*RI*) is usually expressed in terms of events per person per year. This would indicate a simple equation, with division of events by person-years, with events as *en*, sample size as *n* and average follow-up as *t* with a bar (line) over it.

" 958 | ] 959 | }, 960 | { 961 | "cell_type": "markdown", 962 | "metadata": {}, 963 | "source": [ 964 | "$$ {{R}_{I} = \\frac {{e}_{n}}{n \\times \\overline{t}}} $$" 965 | ] 966 | }, 967 | { 968 | "cell_type": "markdown", 969 | "metadata": {}, 970 | "source": [ 971 | "

So if five events occur in 100 patients, followed-up for on average one year would have an incidence rate of 5 per 100 patients per year.

" 972 | ] 973 | }, 974 | { 975 | "cell_type": "markdown", 976 | "metadata": {}, 977 | "source": [ 978 | "## Cumulative risk" 979 | ] 980 | }, 981 | { 982 | "cell_type": "markdown", 983 | "metadata": {}, 984 | "source": [ 985 | "

Cumulative risk is also known as cumulative incidence (*IC*). It expresses the number of events as a percentage of a larger total. This equation makes use of the number of events per person involved in the study. It is less powerful than the incidence rate and due to the fact that time is not included in the calculation, it can change dramatically with shorting or extending of the study period. Following a given sample size over a longer period, for instance, might results in more events occurring. Likewise a shorter follow-up may falsely show a smaller cumulative risk, as not enough time might have passed for events to occur.

" 986 | ] 987 | }, 988 | { 989 | "cell_type": "markdown", 990 | "metadata": {}, 991 | "source": [ 992 | "$$ {{I}_{C} = \\frac {{e}_{n}}{n}} $$" 993 | ] 994 | }, 995 | { 996 | "cell_type": "markdown", 997 | "metadata": {}, 998 | "source": [ 999 | "## Prevalence" 1000 | ] 1001 | }, 1002 | { 1003 | "cell_type": "markdown", 1004 | "metadata": {}, 1005 | "source": [ 1006 | "

Prevalence gives the percentage of occurrence of an event at any point in time. The calculation is a simple ratio of events and number involved (which, bar the context, is calculated in the same way as cumulative risk). Multiplying the ratio by 100%, expresses the result in percentage.

" 1007 | ] 1008 | }, 1009 | { 1010 | "cell_type": "markdown", 1011 | "metadata": {}, 1012 | "source": [ 1013 | "## Difference in rates" 1014 | ] 1015 | }, 1016 | { 1017 | "cell_type": "markdown", 1018 | "metadata": {}, 1019 | "source": [ 1020 | "

Calculating the difference in the rates between two events is the most accurate method of expressing outcome. It is simply a subtraction of one rate from another. If the incidence rate of a complication using drug A is 2.1 events per 100 people per year and the incidence rate of the same complication while taking drug B is 4.5 events per 100 people per year, then the difference is simply a subtraction of one value from the other, here resulting in a difference rate of 2.4 events per 100 people per year.

" 1021 | ] 1022 | }, 1023 | { 1024 | "cell_type": "markdown", 1025 | "metadata": {}, 1026 | "source": [ 1027 | "

The difference rate can also be used to calculate the very useful and often-quoted value, **number needed to treat**. The number needed to treat is the reciprocal of the difference rate. Depending on specific circumstances this value would represent the number of patients required to suffer an event.

" 1028 | ] 1029 | }, 1030 | { 1031 | "cell_type": "markdown", 1032 | "metadata": {}, 1033 | "source": [ 1034 | "## Difference in cumulative risk" 1035 | ] 1036 | }, 1037 | { 1038 | "cell_type": "markdown", 1039 | "metadata": {}, 1040 | "source": [ 1041 | "

As with the difference rate, the difference in cumulative risk is merely a subtraction of one cumulative risk from another, resulting in a value expressed in the same units as the original cumulative risks. As an example, if the cumulative risk of a particular adverse event using drug A is 3.8% and the cumulative risk for the same adverse event suing drug B is 1%, then drug B a 2.8% decrease in the cumulative risk for that adverse event.

" 1042 | ] 1043 | }, 1044 | { 1045 | "cell_type": "markdown", 1046 | "metadata": {}, 1047 | "source": [ 1048 | "# Rate, risk and odds ratios" 1049 | ] 1050 | }, 1051 | { 1052 | "cell_type": "markdown", 1053 | "metadata": {}, 1054 | "source": [ 1055 | "## Rate ratios" 1056 | ] 1057 | }, 1058 | { 1059 | "cell_type": "markdown", 1060 | "metadata": {}, 1061 | "source": [ 1062 | "

In the prior sections that dealt with differences, I performed simple subtraction. Ratios imply *division* and the rate ratio is the ratio of incidence rates, i.e. dividing one incidence rate by another. These divisions result in dimensionless values that require interpretation.\n", 1063 | "The *hazards ratio* is a special kind of rate ratio and looks at instantaneous incidence rates, calculated using Cox regression.

\n", 1064 | "

The rate ratio can take one of three sets of values, the first being exactly *1.0*. This would require the incidence rates of two events to be equal. It is called the null value and is interpreted as no difference. A value of less than *1.0* would imply a protective effect as the causative agent in the numerator lowers the rate and a value of more than *1.0* is seen as harmful and indicates an increase in risk.

\n", 1065 | "

The rate ratio is usually expressed in a very particular manner in the literature and requires some explanation. Suppose that drug A causes 2.1 events per 100 patients per year and drug B causes 4.5 events per 100 patients per year. Dividing 2.1 by 4.5 results in 0.46. Drug A can therefore be seen as having a protective effect, in as much as the value is less than *1.0*. By subtracting 0.46 from *1.0* the resultant 0.54 can be expressed as a percentage, 54%. This is the value that is usually expressed in the literature and would be used by stating that drug A reduces the rate of the particular adverse event by 54%.

\n", 1066 | "

If the rate ratio was more than *1.0*, *1.0* is subtracted from the rate ratio value. As an example, the rate ratio might be 2.4. Subtracting 2.4 from *1.0* leaves 1.4. This will be expressed as an increase in the incidence rate of an event by 140%.

" 1067 | ] 1068 | }, 1069 | { 1070 | "cell_type": "markdown", 1071 | "metadata": {}, 1072 | "source": [ 1073 | "## Risk ratio" 1074 | ] 1075 | }, 1076 | { 1077 | "cell_type": "markdown", 1078 | "metadata": {}, 1079 | "source": [ 1080 | "

The risk ratio can be expressed as the division of two cumulative risk values or two prevalence values. As with the rate ratio the resultant value can be subtracted from *1.0* (or *1.0* subtracted from it) and multiplied by 100%, so as to express the value as a decrease or an increase in risk or prevalence.

" 1081 | ] 1082 | }, 1083 | { 1084 | "cell_type": "markdown", 1085 | "metadata": {}, 1086 | "source": [ 1087 | "## Odds ratio" 1088 | ] 1089 | }, 1090 | { 1091 | "cell_type": "markdown", 1092 | "metadata": {}, 1093 | "source": [ 1094 | "

Odds is indeed a very peculiar term and must always be interpreted with caution and in view of how common or rare an event is. Whereas rates and risk result in the probability of likelihood of occurrence (change), the odds are just that, the odds of an occurrence.

\n", 1095 | "

It is useful for two reasons. Firstly, when dealing with case-control studies, it is the only valid measure of relative risk. Remember that in a case-control study the number of cases and controls are chosen beforehand. This choice commands an absolute effect on rates and risks and is set by the study design. Secondly, it is usually calculated by logistic regression (and then expressed as adjusted odds ratio as opposed to unadjusted odds ratio). Using logistic regression can compensate for confounding.

\n", 1096 | "

Odds are the ratio of the probability of an event divided by the probability of the event not occurring. If the probability fo an event occuring is a half (0.5), then the probability of it not occuring is 0.5. This gives you odds of 1. Let's do another example. If the probability of an event occuring is 0.75, then the probability of it not occuring is 0.25 (1.0 - 0.75), giving odds of 3 to 1 or 3.

\n", 1097 | "

For rare events, arbitrarily chosen as equal to or less than 10%, risks and odds are close in value, but when events are more common, the odds are much higher than the risk.

" 1098 | ] 1099 | }, 1100 | { 1101 | "cell_type": "markdown", 1102 | "metadata": {}, 1103 | "source": [ 1104 | "### An example" 1105 | ] 1106 | }, 1107 | { 1108 | "cell_type": "markdown", 1109 | "metadata": {}, 1110 | "source": [ 1111 | "

The concept of odds is best explained by an example. Consider two parameters. The first has three values (it is categorical): low, moderate, and high, and the second is expressed as the prevalence of that parameter with respect to patients in the specific group of parameter 1.

\n", 1112 | "

Let's say that for the low group the prevalence of parameter 2 is 0.9%, for the moderate group it is 2.3% and for the high group it is 3.0%.

\n", 1113 | "

the risk ratio in the low group is reference (0.9% / 0.9%) = *1.0*. For the moderate group it is (2.3% / 0.9%) 2.56 and for the high group it is 3.33.

\n", 1114 | "

Now, let's do the odds ratio. remember it is the ratio of the probability of something occuring and it not occuring. So for the low group it will be 0.9% divided by 99.1% (note I am expressing values in percent here), which is 0.0091. This is our reference. The odds of parameter 2 in the moderate group is 2.3% and therefore 97.7% against. The odds ratio for parameter 2 in this groups is then 2.3% / 97.7%, which is 0.0235. Dividing this by our reference 0.0091 gives an odds ratio of 2.58. For the high group we have odds for parameter 2 of 3%, thus odds (probability) against of 97%, with an odds ratio of (3 % / 97%) 0.031. This divided by our refernce (0.0091) gives an odds ratio of 3.41.

" 1115 | ] 1116 | }, 1117 | { 1118 | "cell_type": "markdown", 1119 | "metadata": {}, 1120 | "source": [ 1121 | "

Again, we intepret this being above or below 1.00 and convert to percentage. So, 2.58 becomes (2.58 - 1.00) 1.58, which is 158%. The odds ratio of 3.41 becomes 241%. That is an increase in odds of parameter 2 in the high group over the low group of 241%.

" 1122 | ] 1123 | }, 1124 | { 1125 | "cell_type": "markdown", 1126 | "metadata": {}, 1127 | "source": [ 1128 | "

The 10% rule applies here too. Be very carful when reading the literature. Note when parameters are expressed as risk and odds ratios. One more example will make it crystal clear.

" 1129 | ] 1130 | }, 1131 | { 1132 | "cell_type": "code", 1133 | "execution_count": 20, 1134 | "metadata": { 1135 | "collapsed": false 1136 | }, 1137 | "outputs": [], 1138 | "source": [ 1139 | "data_odds_ratio = pd.read_csv('Odds_ratio.csv')" 1140 | ] 1141 | }, 1142 | { 1143 | "cell_type": "code", 1144 | "execution_count": 21, 1145 | "metadata": { 1146 | "collapsed": false 1147 | }, 1148 | "outputs": [ 1149 | { 1150 | "data": { 1151 | "text/html": [ 1152 | "
\n", 1153 | "\n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | "
Param_1preval_param_2risk_ratiomeaning_risk_ratioodds_param_2odds_ratiomeaning_odds_ratio
0low0.2refNaN0.25NaNNaN
1moderate0.632001.506500
2high0.94.53509.00363500
\n", 1199 | "
" 1200 | ], 1201 | "text/plain": [ 1202 | " Param_1 preval_param_2 risk_ratio meaning_risk_ratio odds_param_2 \\\n", 1203 | "0 low 0.2 ref NaN 0.25 \n", 1204 | "1 moderate 0.6 3 200 1.50 \n", 1205 | "2 high 0.9 4.5 350 9.00 \n", 1206 | "\n", 1207 | " odds_ratio meaning_odds_ratio \n", 1208 | "0 NaN NaN \n", 1209 | "1 6 500 \n", 1210 | "2 36 3500 " 1211 | ] 1212 | }, 1213 | "execution_count": 21, 1214 | "metadata": {}, 1215 | "output_type": "execute_result" 1216 | } 1217 | ], 1218 | "source": [ 1219 | "data_odds_ratio.head()" 1220 | ] 1221 | }, 1222 | { 1223 | "cell_type": "markdown", 1224 | "metadata": {}, 1225 | "source": [ 1226 | "

I've imported another *csv* spreadsheet file called *Odds_ratio.csv* and show the only three rows. Here we have a much higher prevalence of an event (parameter 2) in the moderate and high groups. Note that for the high group there is a 350% increase ratio risk, but a 3500% odds ratio increase. There are examples in the literature where risk and odds values are used intermittently and without reading carefully, your impression of results can be represented in a verty skew manner.

" 1227 | ] 1228 | }, 1229 | { 1230 | "cell_type": "markdown", 1231 | "metadata": {}, 1232 | "source": [ 1233 | "

For a prevalence of about 10% risk and odds ratios are very similar. With higher and higher prevalence, these become very different!

" 1234 | ] 1235 | }, 1236 | { 1237 | "cell_type": "code", 1238 | "execution_count": null, 1239 | "metadata": { 1240 | "collapsed": false 1241 | }, 1242 | "outputs": [], 1243 | "source": [] 1244 | } 1245 | ], 1246 | "metadata": { 1247 | "kernelspec": { 1248 | "display_name": "Python 3", 1249 | "language": "python", 1250 | "name": "python3" 1251 | }, 1252 | "language_info": { 1253 | "codemirror_mode": { 1254 | "name": "ipython", 1255 | "version": 3 1256 | }, 1257 | "file_extension": ".py", 1258 | "mimetype": "text/x-python", 1259 | "name": "python", 1260 | "nbconvert_exporter": "python", 1261 | "pygments_lexer": "ipython3", 1262 | "version": "3.4.3" 1263 | } 1264 | }, 1265 | "nbformat": 4, 1266 | "nbformat_minor": 0 1267 | } 1268 | -------------------------------------------------------------------------------- /IPython_Notebook_Toolbar.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/juanklopper/Python_for_Medical_Statistics/b9f62c416f016c4850e02b6da74392e2a307f1f8/IPython_Notebook_Toolbar.png -------------------------------------------------------------------------------- /Launcher.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/juanklopper/Python_for_Medical_Statistics/b9f62c416f016c4850e02b6da74392e2a307f1f8/Launcher.png -------------------------------------------------------------------------------- /MOOC_Mock.csv: -------------------------------------------------------------------------------- 1 | File,Age,Gender,Delay,Stay,ICU,RVD,CD4,HR,Temp,CRP,WCC,HB,Rupture,Histo,Comp,MASS 2 | 1,38,Female,3,6,No,No,NA,97,35.2,NA,10.49,10.4,No,Yes,Yes,5 3 | 2,32,Male,6,10,No,Yes,57,109,38.8,45.3,7.08,19.8,No,No,Yes,8 4 | 3,19,Female,1,16,No,No,NA,120,36.3,10.7,13,8.7,No,No,No,3 5 | 4,20,Female,2,9,No,Yes,NA,120,35.7,77.8,4.45,8.8,No,No,No,0 6 | 5,28,Female,3,3,No,Yes,491,115,37.1,51.6,21.98,13.4,No,Yes,No,7 7 | 6,28,Female,2,13,No,Yes,312,133,38.3,92.5,10.34,9.9,No,No,Yes,6 8 | 7,36,Male,6,14,No,Yes,203,95,37.5,NA,16.21,17.7,Yes,Yes,Yes,7 9 | 8,19,Male,2,13,No,No,NA,102,38.5,224.6,8.99,15.5,No,Yes,Yes,5 10 | 9,46,Male,2,9,No,No,NA,107,36.2,385.1,10.26,15.5,No,Yes,No,1 11 | 10,19,Male,7,12,Yes,No,NA,105,36.6,123,22.19,13.7,No,Yes,Yes,5 12 | 11,33,Male,2,14,No,No,NA,104,36.8,NA,16.2,12.7,No,Yes,No,9 13 | 12,27,Female,6,7,No,No,NA,129,38.3,10.7,13.6,17.6,No,No,Yes,6 14 | 13,22,Male,1,11,No,No,NA,87,37.1,122,6.76,13.8,No,Yes,Yes,2 15 | 14,25,Male,1,6,No,No,NA,85,36.7,NA,13.2,17.8,No,Yes,No,9 16 | 15,19,Female,1,2,No,No,NA,100,36.6,6.6,8.39,16.2,No,Yes,Yes,4 17 | 16,35,Male,4,15,No,No,NA,108,39.3,14,10.6,20.2,No,Yes,Yes,8 18 | 17,25,Male,2,14,No,No,NA,130,35.9,5.4,13.96,19.2,No,Yes,No,4 19 | 18,23,Female,3,18,No,No,NA,98,37,92.5,12,10.8,No,Yes,No,2 20 | 19,64,Male,0,6,No,No,NA,129,35.2,33,15.1,17.9,No,Yes,Yes,7 21 | 20,33,Female,1,19,No,Yes,203,123,37.5,46.5,8,10.4,No,Yes,Yes,8 22 | 21,55,Male,5,10,No,No,NA,87,35.8,NA,NA,NA,No,No,Yes,3 23 | 22,22,Male,4,7,No,No,NA,100,38.5,91.8,19.24,12.8,No,No,Yes,6 24 | 23,33,Female,5,19,No,No,NA,98,37,178.3,26.19,10.7,No,No,Yes,0 25 | 24,43,Male,7,11,No,Yes,202,95,36.1,34,20,12,No,No,Yes,1 26 | 25,23,Male,5,5,No,No,NA,116,38.8,NA,8.9,18.1,No,Yes,No,2 27 | 26,23,Male,1,11,No,No,NA,91,38.3,167.5,14.81,16.6,No,Yes,Yes,0 28 | 27,32,Male,1,17,No,No,NA,103,37.9,285,12.66,20,Yes,Yes,Yes,6 29 | 28,43,Female,3,2,No,No,NA,116,37.9,NA,18.9,11.1,No,Yes,Yes,5 30 | 29,20,Female,6,9,No,Yes,290,82,38.1,Yes,10.1,13.6,No,No,No,8 31 | 30,20,Male,5,11,No,No,NA,80,37.2,23,9.92,19.9,No,Yes,No,5 32 | 31,20,Female,6,20,No,No,NA,103,37.2,NA,12.6,15.6,Yes,Yes,Yes,1 33 | 32,39,Male,2,10,No,Yes,102,100,36.8,208,20.43,12.9,Yes,Yes,No,1 34 | 33,33,Male,0,21,No,Yes,266,100,39.7,194.1,7.31,10.7,Yes,Yes,Yes,3 35 | 34,51,Female,1,17,Yes,No,NA,121,37.6,51,15.8,15.1,Yes,Yes,Yes,6 36 | 35,35,Male,1,12,No,No,NA,84,36.6,Yes,18.63,15.1,Yes,Yes,Yes,5 37 | 36,26,Male,3,15,No,No,NA,131,36.6,44.9,16.6,11.3,Yes,Yes,No,0 38 | 37,30,Male,4,18,No,No,NA,129,36.5,8.3,10.32,18.7,Yes,Yes,Yes,7 39 | 38,19,Male,0,10,No,Yes,300,94,36.4,111,15.1,16.6,Yes,Yes,No,4 40 | 39,20,Male,3,17,No,No,NA,140,38,Yes,12.6,10.1,Yes,Yes,Yes,8 41 | 40,21,Male,3,10,No,No,NA,108,36.9,38.5,10.61,14,Yes,No,No,1 42 | 41,22,Male,4,9,No,No,NA,80,38.6,79.5,17.2,13.5,Yes,Yes,No,1 43 | 42,23,Female,1,5,No,Yes,400,103,39.8,11,23.55,15.3,Yes,Yes,No,5 44 | 43,30,Female,4,2,Yes,No,NA,138,39.3,22,7.59,19.9,Yes,Yes,Yes,4 45 | 44,27,Female,6,14,No,Yes,1129,93,37.2,102.6,8.51,11.4,Yes,No,Yes,0 46 | 45,23,Male,1,9,No,No,NA,118,37.7,280.7,21.16,10,Yes,Yes,Yes,5 47 | 46,23,Male,6,4,No,No,NA,107,36.5,13.9,22.5,17.9,Yes,Yes,No,5 48 | 47,20,Male,5,21,No,Yes,400,112,37.1,NA,24.89,18.2,Yes,Yes,No,2 49 | 48,46,Female,4,9,No,Yes,500,87,37.5,12,15.21,8.7,Yes,No,Yes,7 50 | 49,43,Female,4,6,No,Yes,600,111,36.5,12.8,13.14,17.2,Yes,Yes,No,2 51 | 50,19,Male,6,19,No,Yes,700,88,37.2,95.1,20.28,10.2,Yes,Yes,No,3 52 | 51,20,Female,6,12,No,Yes,800,102,37.4,20.5,13.59,19.4,Yes,Yes,Yes,6 53 | 52,25,Female,5,17,No,Yes,900,93,38,222,19.8,19.1,Yes,Yes,No,1 54 | 53,64,Female,0,15,No,Yes,1000,130,38.5,NA,17.5,13.9,Yes,Yes,Yes,6 55 | 54,26,Female,7,10,No,Yes,303,122,37.5,32,11.5,11.2,Yes,Yes,No,5 56 | 55,27,Female,3,15,No,Yes,405,90,35.9,NA,20.7,10.1,No,Yes,Yes,6 57 | 56,23,Male,1,5,No,Yes,NA,108,35.9,NA,15.65,13,No,Yes,No,9 58 | 57,29,Male,3,9,No,Yes,NA,93,36.2,123,11.99,18.8,No,Yes,No,1 59 | 58,28,Female,7,19,No,Yes,232,95,36.8,21.7,16.38,11.1,No,Yes,Yes,6 60 | 59,29,Male,7,3,No,Yes,NA,122,35.1,Yes,16.2,15.1,No,No,Yes,7 61 | 60,25,Female,3,14,Yes,Yes,NA,85,37.1,295.8,10.94,11.2,Yes,No,Yes,8 62 | 61,37,Male,0,20,No,Yes,654,115,37.1,Yes,7.21,9.9,No,No,Yes,4 63 | 62,47,Male,0,12,No,Yes,NA,82,37.7,NA,16.32,11.7,No,No,No,9 64 | 63,21,Female,5,14,No,Yes,NA,101,36.4,123,12.5,17.3,Yes,Yes,No,8 65 | 64,28,Male,3,11,No,Yes,234,99,39,38.7,15.48,13.7,No,Yes,No,1 66 | 65,34,Female,3,9,No,Yes,NA,135,35.3,15.7,13.78,17.8,No,No,No,5 67 | 66,27,Male,2,4,No,Yes,NA,138,36.8,89.7,15.18,13.4,No,Yes,Yes,2 68 | 67,35,Male,2,5,No,Yes,754,82,36.1,25.3,12.14,16,No,Yes,Yes,6 69 | 68,35,Male,1,6,No,Yes,231,139,36.2,123,7.8,17.4,No,No,Yes,6 70 | 69,45,Female,3,13,No,Yes,1011,96,37,398.9,17.35,11,No,Yes,No,9 71 | 70,42,Male,3,13,No,Yes,234,117,35.9,9.9,12.15,17.2,No,Yes,No,5 72 | 71,20,Male,3,5,No,No,NA,100,37.7,132,14.23,12.5,No,Yes,No,6 73 | 72,22,Male,4,3,No,No,NA,136,36.9,NA,15.1,17.5,No,Yes,No,7 74 | 73,24,Male,0,17,No,No,NA,113,36.6,123,11.35,13.4,No,No,No,9 75 | 74,43,Male,3,9,No,No,NA,130,37.7,NA,26.4,11.2,No,Yes,Yes,6 76 | 75,22,Male,1,21,No,No,NA,92,37.4,NA,11.8,19.9,No,Yes,Yes,6 77 | 76,23,Male,6,17,No,No,NA,128,39.1,314.1,22.1,14.5,Yes,Yes,No,1 78 | 77,24,Female,6,7,No,No,NA,138,38,123,15.86,10,No,Yes,Yes,6 79 | 78,21,Male,3,14,No,No,NA,101,36.2,Yes,4.22,17.1,No,No,No,4 80 | 79,21,Female,0,4,No,No,NA,112,39.5,NA,16.1,16.2,Yes,Yes,Yes,7 81 | 80,48,Female,3,21,No,No,NA,110,36.4,123,9.39,18.2,No,Yes,No,3 82 | 81,28,Female,7,15,No,No,NA,82,36.5,44,8.67,8.2,No,Yes,No,8 83 | 82,22,Male,0,2,No,No,NA,102,37,66,13.1,14.6,Yes,Yes,Yes,8 84 | 83,23,Male,2,6,No,No,NA,99,37.2,77,12.2,16.2,No,Yes,Yes,8 85 | 84,27,Male,1,3,No,No,NA,86,36.7,88,6.61,15.2,No,Yes,No,2 86 | 85,34,Male,5,13,No,No,NA,94,37.2,99,10.7,18.8,Yes,Yes,No,2 87 | 86,33,Male,4,5,No,No,NA,121,37.6,69.1,14.32,19.5,Yes,Yes,No,3 88 | 87,32,Male,6,19,No,Yes,177,108,39.9,244.7,15.65,9.1,Yes,Yes,No,6 89 | 88,20,Male,5,6,No,No,NA,99,36.3,NA,8.97,18.5,Yes,Yes,No,3 90 | 89,20,Male,0,14,No,No,NA,109,36.2,Yes,12.2,9.5,Yes,Yes,Yes,9 91 | 90,24,Female,1,13,No,No,NA,81,38.3,NA,10.1,17.4,Yes,No,Yes,8 92 | 91,21,Male,7,2,No,No,NA,135,38.5,NA,24.14,10.8,Yes,Yes,No,1 93 | 92,26,Male,3,17,No,No,NA,134,37.6,34,13.69,10.4,Yes,Yes,Yes,7 94 | 93,20,Male,7,2,No,No,NA,95,35.7,Yes,10.08,8,No,Yes,No,5 95 | 94,33,Female,6,4,No,No,NA,138,36.7,NA,14.62,9,No,Yes,No,9 96 | 95,30,Male,4,16,No,No,NA,127,35.2,Yes,16.93,15,No,Yes,Yes,9 97 | 96,44,Male,0,7,No,No,NA,140,37.3,NA,9.26,18.8,No,Yes,No,5 98 | 97,36,Male,6,13,No,Yes,399,86,39.3,NA,20.86,15.6,No,Yes,Yes,6 99 | 98,38,Male,1,17,No,Yes,NA,140,36.4,44,10.17,11.9,No,Yes,Yes,7 100 | 99,24,Female,4,21,No,No,NA,125,36.9,NA,4.72,17.3,No,No,No,7 101 | 100,26,Male,2,5,No,Yes,NA,82,38.1,304.5,7.7,17.5,No,No,No,1 102 | 101,45,Male,0,3,No,No,NA,110,36.6,78,14.53,10.1,No,No,Yes,8 103 | 102,31,Male,6,2,No,Yes,46,133,36.1,104.6,2.95,8.6,No,Yes,No,2 104 | 103,21,Male,2,12,No,No,NA,92,35.7,234.2,16.34,14,Yes,Yes,No,3 105 | 104,36,Female,4,6,No,Yes,26,107,37.1,253.4,12.4,14.5,Yes,Yes,No,3 106 | 105,44,Female,3,15,No,No,NA,84,38.4,114.8,26.19,12.9,Yes,Yes,No,3 107 | 106,33,Male,1,4,No,No,NA,140,37.8,350,12.16,12.9,Yes,Yes,No,6 108 | 107,24,Female,4,4,No,No,NA,108,37.1,NA,6.31,10,No,No,No,3 109 | 108,43,Female,7,13,No,No,NA,129,36.9,25.8,6.7,13.9,No,No,Yes,8 110 | 109,22,Male,5,13,No,No,NA,125,37,129.1,11.7,10.9,No,No,No,0 111 | 110,21,Male,5,15,No,No,NA,126,36.4,61.5,15.1,17.2,No,Yes,No,5 112 | 111,39,Female,7,12,Yes,Yes,12,132,35.3,135.8,8.9,11.1,Yes,Yes,No,7 113 | 112,28,Male,2,13,No,No,NA,125,38.3,NA,13.62,NA,No,Yes,No,5 114 | 113,32,Male,0,14,No,Yes,NA,126,38.5,NA,12.9,NA,Yes,Yes,No,2 115 | 114,18,Female,4,16,No,Yes,152,90,38.6,245.4,14.57,NA,Yes,Yes,Yes,9 116 | 115,36,Male,7,5,No,No,NA,93,37.7,NA,13.1,11.4,No,Yes,No,7 117 | 116,25,Male,7,18,No,No,NA,136,37,56,11.13,11.2,No,Yes,No,4 118 | 117,54,Male,2,9,No,No,NA,96,38.5,67.7,12.63,19.9,No,Yes,No,5 119 | 118,33,Male,1,19,No,No,NA,89,38.7,289.9,16.51,19.9,Yes,Yes,Yes,4 120 | 119,61,Male,7,8,No,No,NA,116,37.9,NA,11.69,11.3,Yes,Yes,Yes,0 121 | 120,20,Male,3,18,No,Yes,207,87,38,NA,21.7,18,Yes,Yes,No,5 122 | 121,19,Male,0,8,No,No,NA,108,36.5,286.6,16.8,20.3,No,Yes,Yes,5 123 | 122,26,Male,6,8,No,No,NA,129,38.3,38.2,16.48,NA,No,Yes,Yes,5 124 | 123,19,Female,5,4,No,No,NA,100,35.8,NA,NA,NA,No,Yes,Yes,5 125 | 124,22,Female,2,7,No,No,NA,116,35.5,57.7,14.97,9.1,No,Yes,Yes,6 126 | 125,19,Male,1,9,No,No,NA,116,38.5,NA,10.91,7.9,No,No,No,7 127 | 126,18,Male,0,2,No,No,NA,103,37.3,277.9,11.97,16.1,No,Yes,Yes,5 128 | 127,23,Female,2,10,No,No,NA,105,39.5,NA,13.77,13.2,No,Yes,No,6 129 | 128,28,Male,3,21,No,No,NA,122,38.2,66.6,12.33,11,No,Yes,No,1 130 | 129,27,Female,6,16,No,No,NA,119,37.6,NA,15.45,NA,No,Yes,No,4 131 | 130,29,Male,1,5,Yes,No,NA,129,37.5,297.7,7,16.4,Yes,Yes,Yes,7 132 | 131,21,Male,2,6,No,No,NA,94,36.6,NA,16.64,14.7,Yes,Yes,No,5 133 | 132,31,Female,3,11,No,No,NA,134,36.1,162.5,14.5,18.2,Yes,Yes,No,1 134 | 133,24,Female,2,17,No,No,NA,104,38.6,NA,23.22,15.5,Yes,Yes,No,7 135 | 134,30,Male,6,16,No,No,NA,98,36.2,NA,14.1,8,No,Yes,No,8 136 | 135,31,Female,2,18,Yes,No,NA,133,39.2,290.9,17.51,20.9,No,Yes,No,5 137 | 136,31,Male,2,6,Yes,Yes,44,108,38.6,NA,12.69,12.3,No,Yes,No,6 138 | 137,28,Male,7,10,No,Yes,77,90,38.5,NA,22.7,19,No,Yes,Yes,8 139 | 138,45,Female,2,16,Yes,Yes,101,108,37.6,287.6,17.8,21.3,No,Yes,No,8 140 | 139,45,Male,6,9,Yes,Yes,300,127,37.1,39.2,17.48,NA,Yes,Yes,No,3 141 | 140,33,Female,0,3,No,No,NA,81,39.6,NA,NA,NA,Yes,Yes,No,1 142 | 141,28,Female,1,13,No,No,NA,102,37.2,58.7,15.97,10.1,Yes,No,Yes,6 143 | 142,67,Male,1,14,Yes,No,NA,108,40.2,NA,11.91,8.9,Yes,Yes,Yes,7 144 | 143,55,Male,4,13,No,No,NA,136,39.6,278.9,12.97,17.1,No,Yes,No,3 145 | 144,55,Female,0,9,No,No,NA,97,39.5,NA,14.77,14.2,No,Yes,Yes,8 146 | 145,25,Female,0,12,No,No,NA,109,38.6,67.6,13.33,12,No,Yes,Yes,1 147 | 146,31,Male,0,14,No,No,NA,120,38.1,NA,16.45,NA,No,Yes,Yes,7 148 | 147,45,Female,1,7,Yes,No,NA,120,40.6,298.7,8,17.4,No,Yes,Yes,4 149 | 148,54,Male,5,11,Yes,Yes,NA,115,38.2,NA,17.64,15.7,Yes,Yes,No,1 150 | 149,29,Male,4,6,No,Yes,45,133,41.2,163.5,15.5,19.2,Yes,Yes,Yes,9 151 | 150,55,Female,5,2,Yes,Yes,78,95,40.6,NA,24.22,16.5,Yes,Yes,No,4 152 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Python_for_Medical_Statistics 2 | YouTube course using IPython and Jupyter notebooks to do statistical analysis. 3 | -------------------------------------------------------------------------------- /Regression.csv: -------------------------------------------------------------------------------- 1 | PCT,CRP 2 | 2.6,32.6 3 | 9.8,88.2 4 | 4.9,44.6 5 | 7.7,84.2 6 | 2.3,14.3 7 | 9.7,106.8 8 | ,97.4 9 | 9.8,91.9 10 | 3.4,31.7 11 | 3.5,25 12 | 7,64.1 13 | 7.8,69.3 14 | 8.2,84.7 15 | 9.4, 16 | 5.4,45 17 | 8.1,77.9 18 | 5.5,55.8 19 | 4.7,51.4 20 | 5,59.3 21 | 8.6,92.3 22 | 5.9,54.5 23 | ,59 24 | 5.7,52.7 25 | 8.9,87 26 | 6.4,73.5 27 | 5.6,53.2 28 | 2.5,17.3 29 | 6.4,55.7 30 | 9.6,94 31 | -------------------------------------------------------------------------------- /Toolbar.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/juanklopper/Python_for_Medical_Statistics/b9f62c416f016c4850e02b6da74392e2a307f1f8/Toolbar.png -------------------------------------------------------------------------------- /odds_ratio.csv: -------------------------------------------------------------------------------- 1 | Param_1,preval_param_2,risk_ratio,meaning_risk_ratio,odds_param_2,odds_ratio,meaning_odds_ratio 2 | low,0.2,ref,,0.25,, 3 | moderate,0.6,3,200,1.5,6,500 4 | high,0.9,4.5,350,9,36,3500 5 | -------------------------------------------------------------------------------- /p_graph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/juanklopper/Python_for_Medical_Statistics/b9f62c416f016c4850e02b6da74392e2a307f1f8/p_graph.png -------------------------------------------------------------------------------- /style.css: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 90 | 91 | -------------------------------------------------------------------------------- /wcc_crp.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/juanklopper/Python_for_Medical_Statistics/b9f62c416f016c4850e02b6da74392e2a307f1f8/wcc_crp.xlsx --------------------------------------------------------------------------------