├── .gitignore ├── 01. Permutation and Combinations.ipynb ├── 02. Probability and Rules of Probability.ipynb ├── 03. Bayes Theorem.ipynb ├── 04. Variables and Data types.ipynb ├── 05. Probability distributions.ipynb ├── 06. Measures of Central Tendency.ipynb ├── 07. Measures of Variability.ipynb ├── 08. Central Limit Theorem.ipynb ├── 09. Sampling and Sampling errors.ipynb ├── 10. Hypothesis Testing.ipynb ├── 11. Parametric Tests.ipynb ├── 12. Z-test.ipynb ├── 13. t-test.ipynb ├── 14. ANOVA - Analysis of Variance.ipynb ├── 15. Chi-Square Test for Independence and Goodness of Fit.ipynb ├── 16. Effect Size and Statistical Power.ipynb ├── 17.Statistical tests (Summarized).ipynb ├── LICENSE ├── README.md └── data ├── a1.png ├── cdf_ppf.png ├── csd.png ├── dpd.png ├── f_dist.png ├── gaussian-distribution.png ├── house-prices-advanced-regression-techniques └── data_description.txt ├── hypothesis.png ├── image-30.gif ├── kurtosis.png ├── lln.png ├── mean-median-mode.png ├── p1.png ├── p2.png ├── p3.png ├── p4.png ├── poi.png ├── snd.png ├── t_dist.png └── ud.png /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | 131 | # checkpoints 132 | .ipynb_checkpoints/ 133 | 134 | # data files (csv, txt) 135 | *.csv 136 | *.txt 137 | -------------------------------------------------------------------------------- /01. Permutation and Combinations.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "nWAP42l0C3en" 7 | }, 8 | "source": [ 9 | "# Permutation and Combination" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": { 15 | "id": "wO-hsVKwC3er" 16 | }, 17 | "source": [ 18 | "The concepts of **Permutations** and **Combinations** pertain to different methods of arranging elements within a given set of objects. The primary distinction between them is the importance of order in the arrangement.\r\n", 19 | "\r\n", 20 | "A **permutation** considers the specific order in which elements are arranged. For example, the sequences \"AC,,\" \"BC,,\" and \"CAB\" are all distinct permutations of the same set of elements. \r\n", 21 | "\r\n", 22 | "In contrast, a **combination** treats the arrangement of elements as unordered. Therefore, combinations such as A,BC,\" B,AC,\" and \"CAB\" are all considered to be the same grouping of three characters.\r\n", 23 | "\r\n", 24 | "Let's explore some examples to better understand these conepts.\r\n" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 1, 30 | "metadata": { 31 | "id": "VQcNkxNOC3es" 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "import itertools\n", 36 | "\n", 37 | "character_set = {'A', 'B', 'C'} \n", 38 | "permutations_taking_two_elements = itertools.permutations(character_set, 2)" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 2, 44 | "metadata": { 45 | "colab": { 46 | "base_uri": "https://localhost:8080/" 47 | }, 48 | "id": "fKTmATJwE4Rx", 49 | "outputId": "cccad988-a0d0-4900-8790-235d79bc48dd" 50 | }, 51 | "outputs": [ 52 | { 53 | "name": "stdout", 54 | "output_type": "stream", 55 | "text": [ 56 | "('B', 'C')\n", 57 | "('B', 'A')\n", 58 | "('C', 'B')\n", 59 | "('C', 'A')\n", 60 | "('A', 'B')\n", 61 | "('A', 'C')\n" 62 | ] 63 | } 64 | ], 65 | "source": [ 66 | "for i in permutations_taking_two_elements:\n", 67 | " print(i)" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": 3, 73 | "metadata": { 74 | "colab": { 75 | "base_uri": "https://localhost:8080/" 76 | }, 77 | "id": "TO6cDX02FQnJ", 78 | "outputId": "6cf57180-1338-460b-c044-554fc22e5e9f" 79 | }, 80 | "outputs": [ 81 | { 82 | "name": "stdout", 83 | "output_type": "stream", 84 | "text": [ 85 | "('B', 'C', 'A')\n", 86 | "('B', 'A', 'C')\n", 87 | "('C', 'B', 'A')\n", 88 | "('C', 'A', 'B')\n", 89 | "('A', 'B', 'C')\n", 90 | "('A', 'C', 'B')\n" 91 | ] 92 | } 93 | ], 94 | "source": [ 95 | "\n", 96 | "permutations_taking_three_elements = itertools.permutations(character_set, 3)\n", 97 | "for i in permutations_taking_three_elements:\n", 98 | " print(i)" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "From the above example, we observe that there are a total of 6 possible arrangements from a set containing 3 alphabets, taking 3 at a time. To calculate this value mathematically, we use the formula:\n", 106 | "\n", 107 | "$$ ^nP_r = \\frac{n!}{(n-r)!} $$\n", 108 | "\n", 109 | "where: \n", 110 | "- $ n $ is the number of elements in the set. \n", 111 | "- $ r $ is the number of elements taken together.\n", 112 | "\n", 113 | "**Example:**\n", 114 | "\n", 115 | "**Permutation taking 2 elements together from a set of 3 elements:**\n", 116 | "\n", 117 | "$$ ^3P_2 = \\frac{3!}{(3-2)!} = \\frac{3!}{1!} = \\frac{6}{1} = 6 $$\n", 118 | "\n", 119 | "**Permutation taking 3 elements together from a set of 3 elements:**\n", 120 | "\n", 121 | "$$ ^3P_3 = \\frac{3!}{(3-3)!} = \\frac{3!}{0!} = \\frac{6}{1} = 6 $$\n", 122 | "\n", 123 | "Now, let's look at combinations" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "The formula for calculating the number of combinations from $ n $ elements taken $ r $ at a time is given by:\n", 131 | "$$^n C_r = \\frac{n!}{(n-r)! r!}$$" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 8, 137 | "metadata": { 138 | "colab": { 139 | "base_uri": "https://localhost:8080/" 140 | }, 141 | "id": "vjmrhsluC3ey", 142 | "outputId": "1202466c-0d9d-4418-e6b8-45b90dbadba3" 143 | }, 144 | "outputs": [ 145 | { 146 | "name": "stdout", 147 | "output_type": "stream", 148 | "text": [ 149 | "('B', 'C')\n", 150 | "('B', 'A')\n", 151 | "('C', 'A')\n" 152 | ] 153 | } 154 | ], 155 | "source": [ 156 | "combination_taking_two_elements = itertools.combinations(character_set, 2)\n", 157 | "for i in combination_taking_two_elements:\n", 158 | " print(i)" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 9, 164 | "metadata": { 165 | "colab": { 166 | "base_uri": "https://localhost:8080/" 167 | }, 168 | "id": "SYO9HIj8C3ew", 169 | "outputId": "3d3bf569-3411-41d0-a9a9-55d503168d2a" 170 | }, 171 | "outputs": [ 172 | { 173 | "name": "stdout", 174 | "output_type": "stream", 175 | "text": [ 176 | "('B', 'C', 'A')\n" 177 | ] 178 | } 179 | ], 180 | "source": [ 181 | "combination_taking_three_elements = itertools.combinations(character_set, 3)\n", 182 | "for i in combination_taking_three_elements: \n", 183 | " print(i)" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": { 189 | "id": "eqjVrwFyC3ez" 190 | }, 191 | "source": [ 192 | "In addition to these, itertools provides two more functions:\r\n", 193 | "\r\n", 194 | "- **Combinations with Replacement**: This function generates all possible combinations of $ r $ elements from a given iterable, allowing elements to be selected multiple times. It's useful when repetitions are allowed in the selection process.\r\n", 195 | "\r\n", 196 | "- **Product**: This function computes the Cartesian product of input iterables. It generates all possible combinations where each element from one iterable is combined with every element from other iterables. It's beneficial for creating all possible combinations of multiple sets of elements." 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 6, 202 | "metadata": { 203 | "colab": { 204 | "base_uri": "https://localhost:8080/" 205 | }, 206 | "id": "1czZRTC_C3e0", 207 | "outputId": "83164b64-70d0-41c0-a244-cdefd295680f" 208 | }, 209 | "outputs": [ 210 | { 211 | "name": "stdout", 212 | "output_type": "stream", 213 | "text": [ 214 | "('B', 'B')\n", 215 | "('B', 'C')\n", 216 | "('B', 'A')\n", 217 | "('C', 'C')\n", 218 | "('C', 'A')\n", 219 | "('A', 'A')\n" 220 | ] 221 | } 222 | ], 223 | "source": [ 224 | "combination_taking_two_elements_with_replacement = itertools.combinations_with_replacement(character_set, 2)\n", 225 | "for i in combination_taking_two_elements_with_replacement:\n", 226 | " print(i)" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 7, 232 | "metadata": { 233 | "colab": { 234 | "base_uri": "https://localhost:8080/" 235 | }, 236 | "id": "QKeNbHWjC3e0", 237 | "outputId": "8363c139-daf4-4784-ff7f-2f72d3580287" 238 | }, 239 | "outputs": [ 240 | { 241 | "name": "stdout", 242 | "output_type": "stream", 243 | "text": [ 244 | "('B', 'B')\n", 245 | "('B', 'C')\n", 246 | "('B', 'A')\n", 247 | "('C', 'B')\n", 248 | "('C', 'C')\n", 249 | "('C', 'A')\n", 250 | "('A', 'B')\n", 251 | "('A', 'C')\n", 252 | "('A', 'A')\n" 253 | ] 254 | } 255 | ], 256 | "source": [ 257 | "product_taking_two_elements_with_replacement = itertools.product(character_set, repeat=2)\n", 258 | "for i in product_taking_two_elements_with_replacement:\n", 259 | " print(i)" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": null, 265 | "metadata": { 266 | "id": "U5K1Bqt2C3e1" 267 | }, 268 | "outputs": [], 269 | "source": [] 270 | } 271 | ], 272 | "metadata": { 273 | "colab": { 274 | "collapsed_sections": [], 275 | "name": "01. Permutation and Combinations.ipynb", 276 | "provenance": [] 277 | }, 278 | "kernelspec": { 279 | "display_name": "Python 3 (ipykernel)", 280 | "language": "python", 281 | "name": "python3" 282 | }, 283 | "language_info": { 284 | "codemirror_mode": { 285 | "name": "ipython", 286 | "version": 3 287 | }, 288 | "file_extension": ".py", 289 | "mimetype": "text/x-python", 290 | "name": "python", 291 | "nbconvert_exporter": "python", 292 | "pygments_lexer": "ipython3", 293 | "version": "3.11.7" 294 | } 295 | }, 296 | "nbformat": 4, 297 | "nbformat_minor": 4 298 | } 299 | -------------------------------------------------------------------------------- /02. Probability and Rules of Probability.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "75tuxa_BKLEX" 7 | }, 8 | "source": [ 9 | "# **Probability and Rules of Probability**\n", 10 | "___" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "id": "8xI_XGaIKLEc" 17 | }, 18 | "source": [ 19 | "### Understanding Fundamental Concepts in Probability\r\n", 20 | "\r\n", 21 | "Before delving into the intricacies of probability, it's essential to grasp some fundamental terms and definitions associated with it.\r\n", 22 | "\r\n", 23 | "#### Random Experiment:\r\n", 24 | "A random experiment is characterized by its unpredictable outcomes when repeated under identical conditions. Examples include rolling a die or tossing an unbiased coin.\r\n", 25 | "\r\n", 26 | "#### Outcome:\r\n", 27 | "An outcome refers to the result obtained from a single trial of an experiment.\r\n", 28 | "\r\n", 29 | "#### Sample Space:\r\n", 30 | "The sample space represents a comprehensive list encompassing all potential outcomes of an experiment. For instance, in the case of tossing a coin, the sample space would be $\\{Heads, Tails\\}$, while for rolling a die, it would consist of $\\{1, 2, 3, 4, 5, 6\\}$.\r\n", 31 | "\r\n", 32 | "#### Event:\r\n", 33 | "An event denotes a subset of the sample space and can comprise either a single outcome or a combination of outcomes. For instance, obtaining at least two heads in a row when a coin is tossed four times constitutes an event. Another example could involve getting heads on a coin and rolling a six on a die simultaneously.\r\n", 34 | "\r\n", 35 | "### Probability:\r\n", 36 | "\r\n", 37 | "Probability serves as a quantifiable measure of the likelihood of an event occurring.\r\n", 38 | "\r\n", 39 | "**Note:** Events cannot be predicted with absolute certainty. Probability allows us to assess the likelihood of an event happening, ranging between 0 and 1. A probability of \"Zero\" signifies that the event is impossible, while a value of \"One\" indicates certainty.\r\n", 40 | "\r\n", 41 | "The probability of an event $A$, denoted as $P(A)$, is calculated using the formula:\r\n", 42 | "\r\n", 43 | "$$ P(A) = \\frac {n(A)}{n(S)} $$\r\n", 44 | "\r\n", 45 | "where: \r\n", 46 | "- $P(A)$ represents the probability of event $A$ occurring. \r\n", 47 | "- $n(A)$ denotes the number of favorable outcomes for event $A$. \r\n", 48 | "- $n(S)$ signifies the total number of possible outcomes.\r\n", 49 | "\r\n", 50 | "**Example:** \r\n", 51 | "The probability of rolling a number less than or equal to 2 when tossing a dieis $\\frac{2}{6} = \\frac{1}{3}$.\r\n" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "### **Rules of Probability**\n", 59 | "\n", 60 | "Understanding the rules governing probability is crucial for accurate analysis and interpretation.\n", 61 | "\n", 62 | "+ The probability of an event can range anywhere from 0 to 1: \n", 63 | " $0 \\leq P(A) \\leq 1.$ \n", 64 | " This signifies that probabilities lie within the range of certainty from impossible (0) to certain (1).\n", 65 | "\n", 66 | "+ Sum of all probabilities should add up to 1 \n", 67 | " $P(A) + P(\\overline{A}) = 1.$ \n", 68 | " This rule highlights that the combined probability of an event occurring and not occurring is always equal to 1.\n", 69 | "\n", 70 | "+ Complementary Rule - Probability of event A not happening: \n", 71 | " $P(\\overline{A})=1-P(A).$ \n", 72 | " It indicates that the probability of an event not occurring is equal to 1 minus the probability of the event occurring. \n", 73 | "\n", 74 | "+ Addition Rule (A and B are not necessarily disjoint) - Probability of A happening or B happening: \n", 75 | " $P(A\\cup B)=P(A)+P(B)-P(A\\cap B).$ \n", 76 | " This rule calculates the probability of either event A or event B happening, accounting for the overlap if they are not mutually exclusive \n", 77 | "\n", 78 | "+ Addition Rule (A and B are disjoint) - Probability of A happening or B happening: \n", 79 | " $P(A\\cup B)=P(A)+P(B).$ \n", 80 | " This rule simplifies the addition of probabilities when events A and B are mutually exclusive. \n", 81 | " \n", 82 | "+ Multiplication Rule - Chain Rule: \n", 83 | " $P(A\\cap B)=P(A)*P(B|A)=P(B)*P(A|B).$ \n", 84 | " This rule computes the joint probability of events A and B occurring, taking into account the conditional probabilities. \n", 85 | "\n", 86 | "+ If A and B are independent events, then: \n", 87 | " $P(A\\cap B)=P(A)*P(B).$ \n", 88 | " This implies that the occurrence of one event does not affect the probability of the other event. \n", 89 | "\n", 90 | "+ $P(A\\setminus B)=P(A)-P(A\\cap B).$ \n", 91 | " This rule calculates the probability of event A happening excluding the outcomes also included in event B. \n", 92 | "\n", 93 | "+ $If A\\subset B, \\text{then}\\ P(A)\\leq P(B).$ \n", 94 | " This indicates that the probability of a subset event A is always less than or equal to the probability of the superset event B. \n", 95 | "\n", 96 | "+ $P(\\emptyset)=0. $ \n", 97 | " The probability of the empty set is always zero. " 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": { 103 | "id": "3h0VgeSGKLEe" 104 | }, 105 | "source": [ 106 | "## **Conditional Probability**\n", 107 | "\n", 108 | "Conditional probability of event **A** given event **B** is the probability that **A** occurs given that **B** has occurred.\n", 109 | "\n", 110 | "$$P(A|B)=\\frac{P(A\\cap B)}{P(B)}\\,.$$\n", 111 | "\n", 112 | "Let's illustrate this with an example:\n", 113 | "\n", 114 | "Suppose we roll a fair die, and let event A be the outcome being an odd number (i.e., A={1,3,5}), and event B be the outcome being less than or equal to 3 (i.e., B={1,2,3}). What is the probability of A given B, $P(A|B)$?\n", 115 | "\n", 116 | "$$P(B) = \\frac{3}{6} \\quad , \\quad P(A \\cap B) = \\frac{2}{6}$$ \n", 117 | "\n", 118 | "$$P(A|B) = \\frac{2}{3}$$\n", 119 | "\n", 120 | "\n", 121 | "## **Law of Large Numbers**\n", 122 | "\n", 123 | "The law of large numbers asserts that as the sample size increases, the average or mean of the sample values will converge towards the expected value.\n", 124 | "\n", 125 | "This principle can be exemplified through a basic scenario of flipping a coin. With a coin having equal chances of landing heads or tails, the expected probability of it landing heads is 1/2 or 0.5 over an infinite number of flips.\n", 126 | "\n", 127 | "However, if we only flip the coin 10 times, we may observe a deviation from the expected value. For instance, the coin might land heads only 3 times out of the 10 flips, which doesn't align closely with the expected probability of 0.5. This discrepancy is due to the relatively small sample size.\n", 128 | "\n", 129 | "As the number of flips increases, say to 20 or 30 times, we would expect the proportion of heads to gradually approach 0.5. For instance, after 20 flips, we might see 9 heads, and after 30 flips, we might observe 22 heads. With a larger sample size, the observed proportion of heads tends to converge towards the expected value of 0.5.\n", 130 | "\n", 131 | "\n", 132 | "
" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [] 141 | } 142 | ], 143 | "metadata": { 144 | "colab": { 145 | "name": "02. Probability and Rules of Probability.ipynb", 146 | "provenance": [] 147 | }, 148 | "kernelspec": { 149 | "display_name": "Python 3 (ipykernel)", 150 | "language": "python", 151 | "name": "python3" 152 | }, 153 | "language_info": { 154 | "codemirror_mode": { 155 | "name": "ipython", 156 | "version": 3 157 | }, 158 | "file_extension": ".py", 159 | "mimetype": "text/x-python", 160 | "name": "python", 161 | "nbconvert_exporter": "python", 162 | "pygments_lexer": "ipython3", 163 | "version": "3.11.7" 164 | } 165 | }, 166 | "nbformat": 4, 167 | "nbformat_minor": 4 168 | } 169 | -------------------------------------------------------------------------------- /03. Bayes Theorem.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "UuH9nzMnSFda" 7 | }, 8 | "source": [ 9 | "# Bayes theorem" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "## **Bayes Theorem**\n", 17 | "\n", 18 | "Bayes' theorem is a mathematical principle used to calculate the conditional probability of an event given some evidence related to that event. It establishes a relationship between the probability of an event and prior knowledge of conditions associated with it. As evidence accumulates, the probability of the event can be determined more accurately.\n", 19 | "\n", 20 | "$$ P(A|B) = \\frac{P(B|A)P(A)}{P(B)} $$\n", 21 | "\n", 22 | "- $P(A|B)$, also known as the posterior probability, represents the probability of the hypothesis being true given the available data.\n", 23 | " \n", 24 | "- $P(B|A)$ is the probability of obtaining the evidence given the hypothesis.\n", 25 | " \n", 26 | "- $P(A)$ is the prior probability, representing the probability of the hypothesis being true before any data is considered.\n", 27 | " \n", 28 | "- $P(B)$ is the general probability of occurrence of the evidence, without any hypothesis, also known as the normalizing constant.\n", 29 | "\n", 30 | "**Example: Fire and Smoke**\n", 31 | "\n", 32 | "Suppose we want to find the probability of a fire given that there is smoke:\n", 33 | "\n", 34 | "$$P(Fire|Smoke) =\\frac {P(Smoke|Fire) * P(Fire)}{P(Smoke)}$$\n", 35 | "\n", 36 | "Here,\n", 37 | "- $P(Fire)$ is the Prior\n", 38 | "- $P(Smoke|Fire)$ is the Likelihood\n", 39 | "- $P(Smoke)$ is the evidence" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "**Example:**\n", 47 | "\n", 48 | "Let's consider a scenario where an individual tests positive for an illness. This particular illness affects approximately 1.2% of the population at any given time. The diagnostic test for this illness has an accuracy of 85% for individuals who actually have the illness and 97% for those who do not.\n", 49 | "\n", 50 | "Now, let's define the events involved:\n", 51 | "\n", 52 | "- $A$: The individual has the illness, also known as the hypothesis.\n", 53 | "- $\\overline{A}$: The individual does not have the illness.\n", 54 | "- $B$: The individual tests positive for the illness, also referred to as the evidence.\n", 55 | "- $P(A|B)$: The probability that the individual has the illness given a positive test result, known as the posterior probability, which is what we aim to calculate.\n", 56 | "- $P(B|A)$: The probability that the individual tests positive given that they have the illness, which is 0.85 according to the test's accuracy.\n", 57 | "- $P(A)$: The prior probability or the likelihood of the individual having the illness without any evidence, which is 0.012 based on the prevalence of the illness in the population.\n", 58 | "- $P(B)$: The probability that the individual tests positive for the illness. This can be computed in two ways:\n", 59 | "\n", 60 | " - True Positive (individual has the illness and tests positive): $P(B|A)*P(A)=0.85*0.012=0.0102.$\n", 61 | " - False Positive (individual does not have the illness but tests positive due to test inaccuracy): $P(B|\\overline{A})*P(\\overline{A})=(1-0.97)*(1-0.012)=0.02964.$\n", 62 | " \n", 63 | " Here, $P(B|\\overline{A})$ represents the probability of a positive test result for an individual who does not have the illness, indicating the test's inaccuracy for those without the illness.\n", 64 | " \n", 65 | " Additionally, $P(\\overline{A})$ denotes the probability that the individual does not have the illness, which is derived from the complement of the illness prevalence.\n", 66 | " \n", 67 | " Hence, $P(B)$, the denominator in Bayes' theorem, is the sum of these two probabilities:\n", 68 | " \n", 69 | " $P(B)= (P(B|A)*P(A)) + (P(B|\\overline{A})*P(\\overline{A}))=0.0102+0.2964=0.03984$.\n", 70 | " \n", 71 | " We can now compute the final answer using Bayes' theorem formula:\n", 72 | " \n", 73 | " $P(A|B)=P(B|A)*P(A)/P(B) =0.85*0.012 / 0.03984 = 0.256$.\n", 74 | " \n", 75 | " Thus, even with a positive medical test, the individual only has a 25.6% chance of actually suffering from the illness.\n" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [] 84 | } 85 | ], 86 | "metadata": { 87 | "colab": { 88 | "collapsed_sections": [], 89 | "name": "03. Bayes Theorem.ipynb", 90 | "provenance": [] 91 | }, 92 | "kernelspec": { 93 | "display_name": "Python 3 (ipykernel)", 94 | "language": "python", 95 | "name": "python3" 96 | }, 97 | "language_info": { 98 | "codemirror_mode": { 99 | "name": "ipython", 100 | "version": 3 101 | }, 102 | "file_extension": ".py", 103 | "mimetype": "text/x-python", 104 | "name": "python", 105 | "nbconvert_exporter": "python", 106 | "pygments_lexer": "ipython3", 107 | "version": "3.11.7" 108 | } 109 | }, 110 | "nbformat": 4, 111 | "nbformat_minor": 4 112 | } 113 | -------------------------------------------------------------------------------- /04. Variables and Data types.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "4affdc6c", 6 | "metadata": {}, 7 | "source": [ 8 | "# Variables and Data types\n", 9 | "___" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "id": "65063c08-be77-4aab-8e03-dc04d613ad83", 15 | "metadata": {}, 16 | "source": [ 17 | "## Understanding Variables\n", 18 | "\n", 19 | "In statistical studies, variables denote characteristics of the subjects under analysis. Selecting appropriate variables is pivotal in designing successful experiments, as they help anticipate outcomes.\n", 20 | "\n", 21 | "For instance, when predicting house prices, variables like the number of bedrooms, location, age, amenities nearby, and presence of a garage or pool are considered. These factors, aiding in price prediction, are termed variables.\n", 22 | "\n", 23 | "## Independent and Dependent Variables\n", 24 | "\n", 25 | "**Independent Variable**: Also known as explanatory or predictor variables, these are factors manipulated in an experiment to observe their impact on outcomes. They represent causes and are not influenced by other study variables.\n", 26 | "\n", 27 | "**Dependent Variable**: Referred to as response or outcome variables, these are observed results of an experiment. They represent effects and their values depend on changes made to the independent variable.\n", 28 | "\n", 29 | "## Types of Data\n", 30 | "\n", 31 | "Data, crucial for understanding relationships between variables, making predictions, and supporting decision-making, comes in various types. To accurately analyze and interpret data, it's essential to comprehend these types:\n", 32 | "\n", 33 | "**Quantitative Data**: Deals with quantities and measurements and can be either continuous or discrete. Continuous data refers to uninterrupted values along a scale, like distance and time. Discrete data, however, refers to specific values, like the number of students in a class or the outcome of rolling a die.\n", 34 | "\n", 35 | "**Categorical Data**: Represents groupings and is further categorized into nominal, ordinal, and binary types. Nominal data assigns values to categories without any inherent order, such as people's names and colors. Ordinal data, in contrast, assigns values with an order, like rating levels and grades. Binary data, the simplest form, has only two possible values, such as heads or tails in a coin flip, or yes or no.\n", 36 | "\n", 37 | "\n", 38 | "## Measurement Scales\n", 39 | "\n", 40 | "Measurement scales, also referred to as levels of measurement, elucidate how precisely variables are recorded in scientific research. Here, a variable denotes any attribute capable of assuming different values in a dataset (e.g., height, test scores).\n", 41 | "\n", 42 | "There exist four measurement scales:\n", 43 | "\n", 44 | "- **Nominal**: Data can only be categorized.\n", 45 | "- **Ordinal**: Data can be categorized and ordered.\n", 46 | "- **Interval**: Data can be categorized, ordered, and equally spaced.\n", 47 | "- **Ratio**: Data can be categorized, ordered, equally spaced, and possess a true zero point.\n", 48 | "\n", 49 | "The level of measurement for a variable profoundly influences the types of analyses feasible on it. Ranging from nominal (low) to ratio (high), measurement scales vary in complexity and precision.\n" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "id": "e2708acb-cebe-434b-bb9e-9a873250ab3e", 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [] 59 | } 60 | ], 61 | "metadata": { 62 | "kernelspec": { 63 | "display_name": "Python 3 (ipykernel)", 64 | "language": "python", 65 | "name": "python3" 66 | }, 67 | "language_info": { 68 | "codemirror_mode": { 69 | "name": "ipython", 70 | "version": 3 71 | }, 72 | "file_extension": ".py", 73 | "mimetype": "text/x-python", 74 | "name": "python", 75 | "nbconvert_exporter": "python", 76 | "pygments_lexer": "ipython3", 77 | "version": "3.11.7" 78 | } 79 | }, 80 | "nbformat": 4, 81 | "nbformat_minor": 5 82 | } 83 | -------------------------------------------------------------------------------- /06. Measures of Central Tendency.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction to Measures of Central Tendency\n", 8 | "\n", 9 | "## Overview\n", 10 | "\n", 11 | "Measures of central tendency are statistical metrics that describe the central point of a dataset. They provide a summary that represents a typical value within the dataset. Key measures of central tendency include the mean, median, mode, percentile, and quartile.\n", 12 | "\n", 13 | "### Mean\n", 14 | "\n", 15 | "The mean is the arithmetic average of a dataset, calculated by summing all values and dividing by the number of values.\n", 16 | "\n", 17 | "**Properties of the Mean:**\n", 18 | "\n", 19 | "- The sum of deviations of the items from their arithmetic mean is always zero, i.e., $\\sum (x - \\overline{x}) = 0$.\n", 20 | "- The sum of the squared deviations from the arithmetic mean (A.M.) is minimized compared to deviations from any other value.\n", 21 | "- Replacing each item in the series with the mean results in a sum equal to the sum of the original items.\n", 22 | "- The mean is affected by every value in the dataset.\n", 23 | "- It is a calculated value and not dependent on the position within the series.\n", 24 | "- It is sensitive to extreme values (outliers).\n", 25 | "- The mean cannot typically be identified by inspection.\n", 26 | "- In some cases, the mean may not represent an actual value within the dataset (e.g., an average of 10.7 patients admitted per day).\n", 27 | "- The arithmetic mean is not suitable for extremely asymmetrical distributions.\n", 28 | "\n", 29 | "### Median\n", 30 | "\n", 31 | "The median is the middle value in an ordered dataset, representing the 50th percentile.\n", 32 | "\n", 33 | "**Properties of the Median:**\n", 34 | "\n", 35 | "- The median is not influenced by all data values.\n", 36 | "- It is determined by its position in the dataset and not by individual values.\n", 37 | "- The distance from the median to all other values is minimized compared to any other point.\n", 38 | "- Every dataset has a single median.\n", 39 | "- The median cannot be algebraically manipulated or combined.\n", 40 | "- It remains stable in grouped data procedures.\n", 41 | "- It is not applicable to qualitative data.\n", 42 | "- The data must be ordered for median calculation.\n", 43 | "- The median is suitable for ratio, interval, and ordinal scales.\n", 44 | "- Outliers and skewed data have less impact on the median.\n", 45 | "- The median is a better measure than the mean in skewed distributions.\n", 46 | "\n", 47 | "### Mode\n", 48 | "\n", 49 | "The mode is the most frequently occurring value in a dataset with discrete values.\n", 50 | "\n", 51 | "**Properties of the Mode:**\n", 52 | "\n", 53 | "- The mode is useful when the most typical case is desired.\n", 54 | "- It can be used with nominal or categorical data, such as religious preference, gender, or political affiliation.\n", 55 | "- The mode may not be unique; a dataset can have more than one mode or none at all.\n", 56 | "\n", 57 | "### Percentile\n", 58 | "\n", 59 | "A percentile indicates the percentage of values in a dataset that fall below a particular value. The median is the 50th percentile.\n", 60 | "\n", 61 | "### Quartile\n", 62 | "\n", 63 | "A quartile divides an ordered dataset into four equal parts. \n", 64 | "\n", 65 | "- $Q_1$ (first quartile) corresponds to the 25th percentile.\n", 66 | "- $Q_2$ corresponds to the median.\n", 67 | "- $Q_3$ corresponds to the 75th percentile.\n" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "metadata": {}, 74 | "outputs": [], 75 | "source": [] 76 | } 77 | ], 78 | "metadata": { 79 | "kernelspec": { 80 | "display_name": "Python 3 (ipykernel)", 81 | "language": "python", 82 | "name": "python3" 83 | }, 84 | "language_info": { 85 | "codemirror_mode": { 86 | "name": "ipython", 87 | "version": 3 88 | }, 89 | "file_extension": ".py", 90 | "mimetype": "text/x-python", 91 | "name": "python", 92 | "nbconvert_exporter": "python", 93 | "pygments_lexer": "ipython3", 94 | "version": "3.11.7" 95 | } 96 | }, 97 | "nbformat": 4, 98 | "nbformat_minor": 4 99 | } 100 | -------------------------------------------------------------------------------- /07. Measures of Variability.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Measures of Variability" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Measures of dispersion provide a quantitative assessment of the spread within a distribution. They indicate whether the values are clustered around a central point or dispersed across a range. The following are the most commonly used measures of dispersion:\n", 15 | "\n", 16 | "**Range:** The range represents the difference between the highest and lowest values in a dataset.\n", 17 | "\n", 18 | "**Interquartile Range (IQR):** The IQR measures the difference between the third quartile ($Q_3$) and the first quartile ($Q_1$). It is less affected by extreme values, focusing on the middle portion of the dataset. This makes the IQR particularly useful for skewed distributions with outliers. The IQR is calculated as: \n", 19 | "$$IQR = Q_3 - Q_1$$\n", 20 | "\n", 21 | "**Variance:** Variance quantifies the extent to which the values in a dataset deviate from the mean. It provides an indication of whether the mean is a representative measure of central tendency. A small variance suggests that the mean is a good representation of the dataset. The formula for variance is:\n", 22 | "\n", 23 | "$$\\sigma^2 = \\frac{\\sum (x-\\mu)^2}{N}$$\n", 24 | " \n", 25 | "Where $\\mu$ is the mean, and $N$ is the number of values in the dataset.\n", 26 | "\n", 27 | "**Sample Variance** is given by:\n", 28 | "\n", 29 | "$$S^2 = \\frac{\\sum (x-\\overline x)^2}{n-1}$$\n", 30 | "\n", 31 | "Where $\\overline x$ is the sample mean, and $n$ is the number of values in the sample.\n", 32 | "\n", 33 | "**Standard deviation:** This measure is calculated by taking the square root of the variance. Since the variance is not in the same units as the original data (it involves squaring the differences), taking the square root brings the standard deviation back to the same units as the data. For example, in a dataset measuring average rainfall in centimeters, the variance would be in $cm^2$, which isn't interpretable. However, the standard deviation, expressed in $cm$, provides a meaningful indication of the average deviation of rainfall in centimeters.\n", 34 | "\n", 35 | "**Skewness:** This measures the degree of asymmetry of a distribution\n", 36 | "\n", 37 | "
\n", 38 | "\n", 39 | "**Positive Skewness:** A positively skewed distribution is characterized by numerous outliers in the upper region, or right tail. It is termed \"skewed right\" due to its relatively elongated upper (right) tail.\n", 40 | "\n", 41 | "**Negative Skewness:** Conversely, a negatively skewed distribution exhibits a disproportionate number of outliers within its lower (left) tail. Such a distribution is referred to as \"skewed left\" owing to its extended lower tail.\n", 42 | "\n", 43 | "**Kurtosis:** Kurtosis serves as a measure indicating the curvature, peakiness, or flatness of a given distribution of data.\n", 44 | "\n", 45 | "
" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 1, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "import pandas as pd\n", 55 | "data = pd.Series([19,23,19,18,25,16,17,19,15,23,21,23,21,11,6])" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 2, 61 | "metadata": {}, 62 | "outputs": [ 63 | { 64 | "data": { 65 | "text/plain": [ 66 | "count 15.000000\n", 67 | "mean 18.400000\n", 68 | "std 4.997142\n", 69 | "min 6.000000\n", 70 | "25% 16.500000\n", 71 | "50% 19.000000\n", 72 | "75% 22.000000\n", 73 | "max 25.000000\n", 74 | "dtype: float64" 75 | ] 76 | }, 77 | "execution_count": 2, 78 | "metadata": {}, 79 | "output_type": "execute_result" 80 | } 81 | ], 82 | "source": [ 83 | "data.describe()" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": 3, 89 | "metadata": {}, 90 | "outputs": [ 91 | { 92 | "data": { 93 | "text/plain": [ 94 | "0 19\n", 95 | "1 23\n", 96 | "dtype: int64" 97 | ] 98 | }, 99 | "execution_count": 3, 100 | "metadata": {}, 101 | "output_type": "execute_result" 102 | } 103 | ], 104 | "source": [ 105 | "data.mode()" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "The values 19 and 23 are the most frequently occurring values" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 4, 118 | "metadata": {}, 119 | "outputs": [ 120 | { 121 | "data": { 122 | "text/plain": [ 123 | "19.0" 124 | ] 125 | }, 126 | "execution_count": 4, 127 | "metadata": {}, 128 | "output_type": "execute_result" 129 | } 130 | ], 131 | "source": [ 132 | "data.median()" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 5, 138 | "metadata": {}, 139 | "outputs": [ 140 | { 141 | "data": { 142 | "text/plain": [ 143 | "19" 144 | ] 145 | }, 146 | "execution_count": 5, 147 | "metadata": {}, 148 | "output_type": "execute_result" 149 | } 150 | ], 151 | "source": [ 152 | "range_data = max(data)-min(data)\n", 153 | "range_data" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": 6, 159 | "metadata": {}, 160 | "outputs": [ 161 | { 162 | "data": { 163 | "text/plain": [ 164 | "4.99714204034952" 165 | ] 166 | }, 167 | "execution_count": 6, 168 | "metadata": {}, 169 | "output_type": "execute_result" 170 | } 171 | ], 172 | "source": [ 173 | "data.std()" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 7, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "data": { 183 | "text/plain": [ 184 | "24.97142857142857" 185 | ] 186 | }, 187 | "execution_count": 7, 188 | "metadata": {}, 189 | "output_type": "execute_result" 190 | } 191 | ], 192 | "source": [ 193 | "data.var()" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 8, 199 | "metadata": {}, 200 | "outputs": [ 201 | { 202 | "data": { 203 | "text/plain": [ 204 | "(-1.038344732097918, 0.6995494033062934)" 205 | ] 206 | }, 207 | "execution_count": 8, 208 | "metadata": {}, 209 | "output_type": "execute_result" 210 | } 211 | ], 212 | "source": [ 213 | "from scipy.stats import skew, kurtosis\n", 214 | "\n", 215 | "skew(data), kurtosis(data)" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "**Points to note:** \n", 223 | "1. The mean value is affected by outliers (extreme values). Whenever there are outliers in a dataset, it is better to use the median.\n", 224 | "2. The standard deviation and variance are closely tied to the mean. Thus, if there are outliers, standard deviation and variance may not be representative measures too.\n", 225 | "3. The mode is generally used for discrete data since there can be more than one modal value for continuous data." 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [] 234 | } 235 | ], 236 | "metadata": { 237 | "kernelspec": { 238 | "display_name": "Python 3 (ipykernel)", 239 | "language": "python", 240 | "name": "python3" 241 | }, 242 | "language_info": { 243 | "codemirror_mode": { 244 | "name": "ipython", 245 | "version": 3 246 | }, 247 | "file_extension": ".py", 248 | "mimetype": "text/x-python", 249 | "name": "python", 250 | "nbconvert_exporter": "python", 251 | "pygments_lexer": "ipython3", 252 | "version": "3.11.7" 253 | } 254 | }, 255 | "nbformat": 4, 256 | "nbformat_minor": 4 257 | } 258 | -------------------------------------------------------------------------------- /08. Central Limit Theorem.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Central Limit Theorem" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## The Central Limit Theorem\n", 15 | "\n", 16 | "The Central Limit Theorem (CLT) posits that the distribution of *sample means* drawn from a population will approximate a normal distribution, irrespective of the shape of the population distribution, provided the sample size is sufficiently large (typically n > 30). Even for populations that are already normally distributed, the theorem remains valid for smaller sample sizes.\n", 17 | "\n", 18 | "## Estimating the Population Mean\n", 19 | "\n", 20 | "While the sample mean serves as an estimate for the population mean, it's essential to recognize that the standard deviation of the sampling distribution ($\\sigma_{\\overline{x}}$) differs from the population standard deviation ($\\sigma$).\n", 21 | "\n", 22 | "## The Standard Error\n", 23 | "\n", 24 | "The standard deviation of the sampling distribution ($\\sigma_{\\overline{x}}$) is termed the **standard error** and is linked to the population standard deviation by the formula:\n", 25 | "\n", 26 | "$$\\sigma_{\\overline{x}} = \\frac{\\sigma}{\\sqrt{n}}$$\n", 27 | "\n", 28 | "Here, $\\sigma$ represents the population standard deviation, and $n$ denotes the sample size. \n", 29 | "\n", 30 | "As the sample size increases, the standard error diminishes toward 0, and the sample mean ($\\overline{x}$) converges towards the population mean ($\\mu$).\n" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "## Estimates and Confidence Intervals\n", 38 | "\n", 39 | "### Point Estimate\n", 40 | "\n", 41 | "A point estimate is a single statistic calculated from a sample used to estimate an unknown population parameter. For instance, the sample mean can serve as a point estimate for the population mean.\n", 42 | "\n", 43 | "### Interval Estimate\n", 44 | "\n", 45 | "An interval estimate is a range of values believed to encompass the true population parameter. It represents the margin of error in estimating the population parameter.\n", 46 | "\n", 47 | "### Confidence Interval\n", 48 | "\n", 49 | "A confidence interval is a range of values within which the population mean is presumed to lie. It can be calculated as follows:\n", 50 | "\n", 51 | "#### When Population Standard Deviation is Known\n", 52 | "\n", 53 | "For a random sample of size $n$ with mean $\\overline{x}$, taken from a population with standard deviation $\\sigma$ and mean $\\mu$, the confidence interval for the population mean is:\n", 54 | "\n", 55 | "$$\\overline{x} - \\frac{z\\sigma}{\\sqrt{n}} \\leq \\mu \\leq \\overline{x} + \\frac{z\\sigma}{\\sqrt{n}}$$\n", 56 | "\n", 57 | "#### When Population Standard Deviation is Unknown\n", 58 | "\n", 59 | "In cases where the population standard deviation is unknown, the sample standard deviation ($s$) substitutes $\\sigma$ in calculating the confidence interval:\n", 60 | "\n", 61 | "$$\\overline{x} - \\frac{zs}{\\sqrt{n}} \\leq \\mu \\leq \\overline{x} + \\frac{zs}{\\sqrt{n}}$$\n", 62 | "\n", 63 | "Here: \n", 64 | " \n", 65 | "$\\overline{x}$: Sample mean. \n", 66 | "$n$: Sample size. \n", 67 | "$\\mu$: Population mean (the parameter we are estimating). \n", 68 | "$\\sigma$: Population standard deviation (known when calculating the confidence interval with the formula that includes $\\sigma$). \n", 69 | "$s$: Sample standard deviation (used as an estimate of the population standard deviation when it's unknown). \n", 70 | "$z$: The critical value from the standard normal distribution corresponding to the desired confidence level. It is determined based on the chosen confidence level (e.g., 95% confidence level corresponds to a z-score of approximately 1.96). This value is used to calculate the margin of error.\n", 71 | "\n", 72 | "### Example\n", 73 | "\n", 74 | "Suppose we have grades of 10 students drawn from a population, and we aim to ascertain the 95% confidence interval for the population mean.\n" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 1, 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "data": { 84 | "text/plain": [ 85 | "(3.1110006165952773, 3.668999383404722)" 86 | ] 87 | }, 88 | "execution_count": 1, 89 | "metadata": {}, 90 | "output_type": "execute_result" 91 | } 92 | ], 93 | "source": [ 94 | "import numpy as np\n", 95 | "import scipy.stats as stats\n", 96 | "from scipy.stats import t\n", 97 | "\n", 98 | "grades = np.array([3.1,2.9,3.2,3.4,3.7,3.9,3.9,2.8,3.4,3.6])\n", 99 | "\n", 100 | "stats.t.interval(0.95, len(grades)-1, loc=np.mean(grades), scale=stats.sem(grades))" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "The arguments inside t.interval function are 95% confidence interval, degrees of freedom (n-1), sample mean and the standard error calculated by stats.sem function." 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [] 116 | } 117 | ], 118 | "metadata": { 119 | "kernelspec": { 120 | "display_name": "Python 3 (ipykernel)", 121 | "language": "python", 122 | "name": "python3" 123 | }, 124 | "language_info": { 125 | "codemirror_mode": { 126 | "name": "ipython", 127 | "version": 3 128 | }, 129 | "file_extension": ".py", 130 | "mimetype": "text/x-python", 131 | "name": "python", 132 | "nbconvert_exporter": "python", 133 | "pygments_lexer": "ipython3", 134 | "version": "3.11.7" 135 | } 136 | }, 137 | "nbformat": 4, 138 | "nbformat_minor": 4 139 | } 140 | -------------------------------------------------------------------------------- /09. Sampling and Sampling errors.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Sampling\n", 8 | "\n", 9 | "Sampling serves as a fundamental technique for acquiring insights into a population by gathering data from a representative subset, rather than assessing every individual within the population. It presents a pragmatic approach when exhaustive data collection proves impractical. However, it is imperative that the sample mirrors the population's characteristics accurately.\n", 10 | "\n", 11 | "### Probability Sampling\n", 12 | "\n", 13 | "Probability sampling ensures that every member of the population possesses an equal chance of selection, thereby facilitating the creation of a sample that faithfully mirrors the population. Several commonly employed probability sampling techniques include:\n", 14 | "\n", 15 | "1. **Simple random sampling**: Subjects are selected entirely at random, without bias or preference, ensuring each has an equal probability of inclusion.\n", 16 | "\n", 17 | "2. **Stratified random sampling**: The population undergoes division into non-overlapping groups, from which subjects are randomly chosen. This method ensures representation across all relevant categories or strata.\n", 18 | "\n", 19 | "3. **Systematic random sampling**: Subjects are chosen at regular intervals, offering simplicity in execution but potentially risking representativeness if the interval choice is inappropriate.\n", 20 | "\n", 21 | "4. **Cluster sampling**: The population divides into non-overlapping clusters, from which a subset is randomly selected. This method offers convenience and cost-effectiveness.\n", 22 | "\n", 23 | "Advantages of Probability Sampling\n", 24 | "- Mitigation of Sample Bias\n", 25 | "- Representation of Diverse Population Characteristics\n", 26 | "- Generation of Accurate Sample Representations\n", 27 | "\n", 28 | "### Non-Probability Sampling\n", 29 | "\n", 30 | "Non-probability sampling deviates from the principle of equal probability of selection, thus increasing the likelihood of acquiring a non-representative sample. Commonly utilized non-probability sampling techniques include:\n", 31 | "\n", 32 | "1. **Convenience sampling**: Conveniently accessible subjects form the sample, often leading to ease of implementation but potential representativeness issues.\n", 33 | "\n", 34 | "2. **Judgmental or purposive sampling**: Selection is based on predefined criteria, aligning with the study's objectives, albeit potentially biasing the sample.\n", 35 | "\n", 36 | "3. **Quota sampling**: Quotas ensure the sample reflects significant population characteristics, albeit without the assurance of equal probability of selection.\n", 37 | "\n", 38 | "4. **Snowball sampling**: Initial subjects refer additional participants, commonly employed in populations with low visibility.\n", 39 | "\n", 40 | "**Note**: Non-probability sampling typically yields less reliable and less generalizable results compared to probability sampling methodologies.\n" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "## Types of Errors in Sampling\n", 48 | "\n", 49 | "When making inferences about a population based on a sample, it's possible to encounter various types of errors. These errors can be grouped into the following categories:\n", 50 | "\n", 51 | "- **Sampling Error**: The difference between the sample estimate for the population and the actual population estimate\n", 52 | "- **Coverage Error**: Occurs when the population is not adequately represented and some groups are excluded\n", 53 | "- **Nonresponse Error**: Occurs when we fail to include nonresponsive subjects who meet the criteria of the study, but are excluded because they do not answer the survey questions.\n", 54 | "- **Measurement Error**: Occurs when the correct parameters are not measured due to flaws in the measurement method or tool used.\n" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [] 63 | } 64 | ], 65 | "metadata": { 66 | "kernelspec": { 67 | "display_name": "Python 3 (ipykernel)", 68 | "language": "python", 69 | "name": "python3" 70 | }, 71 | "language_info": { 72 | "codemirror_mode": { 73 | "name": "ipython", 74 | "version": 3 75 | }, 76 | "file_extension": ".py", 77 | "mimetype": "text/x-python", 78 | "name": "python", 79 | "nbconvert_exporter": "python", 80 | "pygments_lexer": "ipython3", 81 | "version": "3.11.7" 82 | } 83 | }, 84 | "nbformat": 4, 85 | "nbformat_minor": 4 86 | } 87 | -------------------------------------------------------------------------------- /10. Hypothesis Testing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Hypothesis Testing\n", 8 | "\n", 9 | "Before delving into the intricacies of hypothesis testing, it's imperative to grasp some foundational concepts.\n", 10 | "\n", 11 | "## Population Parameters vs. Sample Statistics\n", 12 | "\n", 13 | "A **Parameter** denotes a characteristic of an entire population, such as the population mean. As it's typically impractical to measure the entire population, the true value of the parameter often eludes us. Commonly used parameters in statistics, like the population mean and standard deviation, are symbolized by Greek letters such as $\\mu$ (mu) and $\\sigma$ (sigma).\n", 14 | "\n", 15 | "Conversely, a **Statistic** represents a characteristic calculated from a sample. For instance, computing the mean and standard deviation of a sample yields sample statistics. In statistical parlance, sample parameters are denoted by Latin letters.\n", 16 | "\n", 17 | "**Inferential statistics** entails leveraging sample statistics to draw inferences about a population. This involves using sample statistics to estimate population parameters. To ensure validity, representative sampling techniques like random sampling are pivotal for obtaining unbiased estimates. Unbiased estimates are considered accurate on average, whereas biased estimates systematically deviate from the truth.\n", 18 | "\n", 19 | "## Parametric vs Nonparametric Analysis\n", 20 | "\n", 21 | "**Parametric statistics** assumes that the sample data stems from populations describable by probability distributions with fixed parameters. Consequently, parametric analysis reigns as the predominant statistical method.\n", 22 | "\n", 23 | "In contrast, **nonparametric tests** refrain from assuming any specific probability distribution for the underlying data.\n", 24 | "\n", 25 | "## Significance Level (Alpha)\n", 26 | "\n", 27 | "The **significance level** serves as a yardstick dictating the requisite strength of evidence from sample data to infer the presence of an effect in the population. Also known as alpha ($\\alpha$), it's a predetermined threshold established prior to the study. The significance level delineates the evidence threshold needed to reject the null hypothesis in favor of the alternative hypothesis.\n", 28 | "\n", 29 | "Reflecting the probability of erroneously rejecting the null hypothesis when true, the significance level quantifies the risk of asserting an effect's existence when none prevails. Lower significance levels signify heightened evidentiary thresholds, demanding more robust evidence before null hypothesis rejection. For instance, a significance level of 0.05 implies a 5% chance of committing a false positive error—declaring an effect's existence in its absence.\n", 30 | "\n", 31 | "## P-Values\n", 32 | "\n", 33 | "**P-values** gauge the strength of evidence against the null hypothesis furnished by sample data. A P-value below the established significance level denotes statistical significance.\n", 34 | "\n", 35 | "The P-value represents the probability of observing an effect in the sample data as extreme, or even more so, than the one observed if the null hypothesis held true. Essentially, it quantifies the extent to which the sample data contravenes the null hypothesis. Lower P-values denote more compelling evidence against the null hypothesis.\n", 36 | "\n", 37 | "When a P-value falls at or below the significance level, the null hypothesis is discarded, and the results are deemed statistically significant. This implies that the sample data furnishes adequate evidence to endorse the alternative hypothesis positing the effect's presence in the population.\n", 38 | "\n", 39 | "Conversely, when a P-value exceeds the significance level, the sample data fails to supply sufficient evidence for effect existence, prompting null hypothesis retention.\n", 40 | "\n", 41 | "Statistically, these verdicts translate as follows:\n", 42 | "\n", 43 | "+ Reject the null hypothesis when the P-value equals or falls below the significance level.\n", 44 | "+ Retain the null hypothesis when the P-value exceeds the significance level.\n", 45 | "\n", 46 | "\n", 47 | "## Hypothesis Testing\n", 48 | "\n", 49 | "**Hypothesis testing** is a statistical technique that evaluates the evidence of two opposing statements (hypotheses) about a population based on sample data. These hypotheses are known as the null hypothesis and the alternative hypothesis.\n", 50 | "\n", 51 | "The objective of hypothesis testing is to assess the sample statistic and its corresponding sampling error to determine which of the two hypotheses is more strongly supported by the data. If the null hypothesis can be rejected, it means that the results are statistically significant and the alternative hypothesis is favored, suggesting that an effect exists in the population.\n", 52 | "\n", 53 | "It is important to note that failing to reject the null hypothesis does not necessarily mean that the null hypothesis is true, nor does rejecting the null hypothesis necessarily imply that the alternative hypothesis is true. The results of a hypothesis test are only a suggestion or indication about the population, not a conclusive proof of either hypothesis.\n", 54 | "\n", 55 | "The null hypothesis is the theory that there is no effect (i.e., the effect size is equal to zero). It is commonly represented by $H_0$.\n", 56 | "\n", 57 | "The alternative hypothesis is the opposite theory, stating that the population parameter does not equal the value specified in the null hypothesis (i.e., there is a non-zero effect). It is usually represented by $H_1$ or $H_A$.\n", 58 | "\n", 59 | "The steps involved in hypothesis testing are as follows:\n", 60 | "\n", 61 | "1. State the null and alternative hypothesis.\n", 62 | "2. Specify the significance level and calculate the critical value of the test statistic.\n", 63 | "3. Choose the appropriate test based on factors such as the number of samples, population distribution, statistic being tested, sample size, and knowledge of the population standard deviation.\n", 64 | "4. Calculate the relevant test statistic (z-statistic, t-statistic, chi-square statistic, or f-statistic) or p-value.\n", 65 | "5. Compare the calculated test statistic with the critical test statistic or the p-value with the significance level.\n", 66 | " - If using the test statistic:\n", 67 | " - Reject the null hypothesis if the calculated test statistic is greater than the critical test statistic (upper-tail test)\n", 68 | " - Reject the null hypothesis if the calculated test statistic is less than the critical test statistic (lower-tail test)\n", 69 | " - If using the p-value:\n", 70 | " - Reject the null hypothesis if the p-value is less than the significance level.\n", 71 | "6. Draw a conclusion based on the comparison made in step 5.\n", 72 | "\n", 73 | "## Confidence Interval\n", 74 | "\n", 75 | "A **confidence interval** can be calculated for various parameters such as population mean, population proportion, difference of population means, and difference of population proportions, among others. To construct a confidence interval, one needs to have a sample statistic that estimates the population parameter of interest and a measure of variability or standard error for that statistic. The confidence interval is calculated by adding and subtracting the standard error to the sample statistic. The result is a range of values that contains the true population parameter with a specified level of confidence.\n", 76 | "\n", 77 | "The key concept behind a confidence interval is that if we repeated our sampling process many times, the true population parameter would fall within the confidence interval for the specified percentage of these samples. For example, if we have a 95% confidence interval for a population mean, we can say that if we repeated our sampling process 100 times, the true population mean would fall within the calculated confidence interval 95 times out of 100.\n", 78 | "\n", 79 | "In conclusion, confidence intervals provide a range of plausible values for a population parameter based on the observed sample data and the level of confidence specified by the researcher. The confidence interval reflects the precision of the estimate, with wider intervals indicating less precision and narrower intervals indicating more precision.\n", 80 | "\n", 81 | "## Sampling Error\n", 82 | "\n", 83 | "A **sampling error** refers to the discrepancy between a population parameter and a sample statistic. In a study, the sampling error represents the difference between the mean calculated from a sample and the actual mean of the population. Despite using a random sample selection process, sampling errors are still a possibility as the sample may not perfectly reflect the population with regards to numerical values such as means and standard deviations. To improve the accuracy of generalizing findings from a sample to a population, it's essential to minimize the sampling error. One way to do this is to increase the sample size.\n" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [] 92 | } 93 | ], 94 | "metadata": { 95 | "kernelspec": { 96 | "display_name": "Python 3 (ipykernel)", 97 | "language": "python", 98 | "name": "python3" 99 | }, 100 | "language_info": { 101 | "codemirror_mode": { 102 | "name": "ipython", 103 | "version": 3 104 | }, 105 | "file_extension": ".py", 106 | "mimetype": "text/x-python", 107 | "name": "python", 108 | "nbconvert_exporter": "python", 109 | "pygments_lexer": "ipython3", 110 | "version": "3.11.7" 111 | } 112 | }, 113 | "nbformat": 4, 114 | "nbformat_minor": 4 115 | } 116 | -------------------------------------------------------------------------------- /11. Parametric Tests.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Parametric Tests\n", 8 | "\n", 9 | "Parametric tests are statistical tools predicated on the assumption that the data adheres to a normal distribution. They facilitate inferences about population parameters based on sampled data.\n", 10 | "\n", 11 | "## One-Sample Test\n", 12 | "\n", 13 | "A one-sample test is employed when there's a single population of interest, and a solitary sample is extracted from it. It evaluates whether there's a notable discrepancy between the sample values and the population parameter.\n", 14 | "\n", 15 | "## Two-Sample Test\n", 16 | "\n", 17 | "The two-sample test enters the picture when samples are drawn from two distinct populations. It gauges whether the population parameters diverge significantly based on the sample statistics.\n", 18 | "\n", 19 | "## Critical Test Statistic\n", 20 | "\n", 21 | "The critical test statistic denotes the threshold value of the sample test statistic pivotal in discerning whether to embrace or repudiate the null hypothesis.\n", 22 | "\n", 23 | "## Region of Rejection\n", 24 | "\n", 25 | "The region of rejection delineates the spectrum of values wherein the null hypothesis is discarded. Conversely, the region of acceptance encompasses values where the null hypothesis holds sway.\n", 26 | "\n", 27 | "## Types of Tests\n", 28 | "\n", 29 | "Several types of tests are at our disposal:\n", 30 | "\n", 31 | "+ **Z-tests**: Apt for ample sample sizes (n ≥ 30) with a known population standard deviation.\n", 32 | "+ **T-tests**: Tailored for modest sample sizes (n < 30) with an unknown population standard deviation.\n", 33 | "+ **F-tests**: Tasked with comparing values across more than two variables.\n", 34 | "+ **Chi-square**: Devised for the comparison of categorical data.\n", 35 | "\n", 36 | "## One-Tail Test (Directional Test)\n", 37 | "\n", 38 | "A one-tail test enters the fray when probing for a change in the mean, armed with the knowledge of the change's direction. \n", 39 | "\n", 40 | "Two iterations of the one-tail test exist:\n", 41 | "\n", 42 | "+ **Upper one-tail**: The region of rejection resides on the right tail. It's invoked when scrutinizing whether the mean score has surged.\n", 43 | "+ **Lower one-tail**: The region of rejection graces the left tail. It's enlisted when assessing if the mean score has plummeted.\n", 44 | "\n", 45 | "## Two-Tail Test (Non-Directional Test)\n", 46 | "\n", 47 | "The two-tail test is deployed when scrutinizing a change in the mean sans knowledge of the direction. The region of rejection spans both tails of the distribution.\n", 48 | "\n", 49 | "## The P-value\n", 50 | "\n", 51 | "The p-value is the linchpin in deciding whether to embrace or eschew the null hypothesis. It's computed based on the sample data and juxtaposed with a significance level, usually 0.05. \n", 52 | "+ If p < 0.05, it intimates that the sample data is improbable to stem from randomness and doesn't mirror the population adequately. In such instances, the null hypothesis is jettisoned. \n", 53 | "+ If p > 0.05, it implies a heightened likelihood that the sample inadequately represents the population, prompting the null hypothesis's retention\n", 54 | "\n", 55 | "\n", 56 | "\n", 57 | "
" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 1, 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "name": "stdout", 67 | "output_type": "stream", 68 | "text": [ 69 | "Calculating p given z: p = 0.95\n", 70 | "Calculating z given p: z = 1.6448536269514722\n" 71 | ] 72 | } 73 | ], 74 | "source": [ 75 | "from scipy.stats import norm\n", 76 | "\n", 77 | "z = 1.6448536269514722\n", 78 | "p = 0.95\n", 79 | "\n", 80 | "print(\"Calculating p given z: p = \", norm.cdf(z))\n", 81 | "print(\"Calculating z given p: z = \", norm.ppf(p))" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "# Choosing a statistical test?\n", 89 | "\n", 90 | "Below is a simple diagram which shows how to choose a test depending on different data types.\n", 91 | "\n", 92 | "\n", 93 | "
\n" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [] 102 | } 103 | ], 104 | "metadata": { 105 | "kernelspec": { 106 | "display_name": "Python 3 (ipykernel)", 107 | "language": "python", 108 | "name": "python3" 109 | }, 110 | "language_info": { 111 | "codemirror_mode": { 112 | "name": "ipython", 113 | "version": 3 114 | }, 115 | "file_extension": ".py", 116 | "mimetype": "text/x-python", 117 | "name": "python", 118 | "nbconvert_exporter": "python", 119 | "pygments_lexer": "ipython3", 120 | "version": "3.11.7" 121 | } 122 | }, 123 | "nbformat": 4, 124 | "nbformat_minor": 4 125 | } 126 | -------------------------------------------------------------------------------- /12. Z-test.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "7865f8f8-b588-40d3-8838-b4aed0b33c8b", 6 | "metadata": {}, 7 | "source": [ 8 | "# Z-Test\n", 9 | "\n", 10 | "## One-Sample Z-Test\n", 11 | "\n", 12 | "A one-sample Z-test is utilized to evaluate if the mean of a single sample differs from a known or hypothesized population mean. Several criteria must be fulfilled for a one-sample Z-test:\n", 13 | "\n", 14 | "- The population from which the sample is drawn follows a normal distribution.\n", 15 | "- The sample size exceeds 30.\n", 16 | "- Only one sample is obtained.\n", 17 | "- The hypothesis concerns the population mean.\n", 18 | "- The population standard deviation is known.\n", 19 | "\n", 20 | "The test statistic is computed using the formula:\n", 21 | "\n", 22 | "$$ z = \\frac {(\\overline x - \\mu)}{\\frac{\\sigma}{\\sqrt n}}$$\n", 23 | "\n", 24 | "where $x$ denotes the sample mean, $\\mu$ represents the population mean, $\\sigma$ stands for the population standard deviation, and $n$ is the sample size.\n", 25 | "\n", 26 | "## One-Sample Z-Test: One-Tail\n", 27 | "\n", 28 | "Suppose we have a pizza delivery shop with a historical average delivery time of 45 minutes and a standard deviation of 5 minutes. However, due to recent customer complaints, the shop decides to analyze the delivery time of the last 40 orders, revealing an average delivery time of 48 minutes. We aim to ascertain if the new mean significantly exceeds the population mean.\n", 29 | "\n", 30 | "The null hypothesis ($H_0$) posits that the mean delivery time equals 45 minutes: $\\mu = 45$. The alternative hypothesis ($H_1$) suggests that the mean delivery time surpasses 45 minutes: $\\mu > 45$. Let's adopt a significance level of $\\alpha = 0.05$. In this scenario, the region of rejection will be situated on the right tail." 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 1, 36 | "id": "f793d2d7", 37 | "metadata": {}, 38 | "outputs": [ 39 | { 40 | "name": "stdout", 41 | "output_type": "stream", 42 | "text": [ 43 | "3.7947331922020555\n" 44 | ] 45 | } 46 | ], 47 | "source": [ 48 | "z = (48-45)/(5/(40)**0.5)\n", 49 | "print(z)" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 2, 55 | "id": "d4c5f52e", 56 | "metadata": {}, 57 | "outputs": [ 58 | { 59 | "name": "stdout", 60 | "output_type": "stream", 61 | "text": [ 62 | "7.390115516725526e-05\n" 63 | ] 64 | } 65 | ], 66 | "source": [ 67 | "import scipy.stats as stats\n", 68 | "p_value = 1 - stats.norm.cdf(z) # cumulative distribution function\n", 69 | "print(p_value)" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "id": "6de1de5f-bf69-49d6-bb50-b81aa5b3806b", 75 | "metadata": {}, 76 | "source": [ 77 | "Since the p-value is less than $\\alpha$, we reject the null hypothesis. There is a significant difference, at a level of 0.05, between the average delivery time of the sample and the historical population average." 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "id": "005271aa-ffa0-4d34-8d67-3df03d2bb55e", 83 | "metadata": {}, 84 | "source": [ 85 | "## One-Sample Z-Test: Two-Tail\n", 86 | "\n", 87 | "Suppose we aim to investigate whether a drug has an impact on IQ. In this scenario, we opt for a two-tail test because we're interested in determining whether the drug affects IQ, regardless of whether it has a positive or negative effect.\n", 88 | "\n", 89 | "Given a significance level of $\\alpha = 0.05$, our rejection regions are 0.025 on both the right and left tails.\n", 90 | "\n", 91 | "Assuming our population mean $\\mu = 100$ and population standard deviation $\\sigma = 15$, we conduct a study involving a sample of 100 subjects. Upon analysis, we discover that the mean IQ of the sample is 96." 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 3, 97 | "id": "d3823b16", 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "name": "stdout", 102 | "output_type": "stream", 103 | "text": [ 104 | "statistic: 2.6667\n" 105 | ] 106 | } 107 | ], 108 | "source": [ 109 | "z = (100-96)/(15/(100**0.5))\n", 110 | "print(\"statistic: \", round(z, 4))" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 4, 116 | "id": "24e9f7e6", 117 | "metadata": {}, 118 | "outputs": [ 119 | { 120 | "name": "stdout", 121 | "output_type": "stream", 122 | "text": [ 123 | "Critical: 1.96\n" 124 | ] 125 | } 126 | ], 127 | "source": [ 128 | "import scipy.stats as stats\n", 129 | "critical = stats.norm.ppf(1-0.025) # cumulative distribution function\n", 130 | "print(\"Critical:\", round(critical, 4))" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "id": "d9863311-a166-4e95-8aed-e9069f79b69b", 136 | "metadata": {}, 137 | "source": [ 138 | "Since our test statistic is greater than the critical statistic, we conclude that our drug has a significant influence on IQ values at a criterion level of $\\alpha = 0.05$." 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "id": "0d572e80-ee20-4009-90d3-535f3c1b84e3", 144 | "metadata": {}, 145 | "source": [ 146 | "## Two-Sample Z-Test\n", 147 | "\n", 148 | "A two-sample z-test is similar to a one-sample z-test, with the main differences being:\n", 149 | "\n", 150 | "- There are two groups/populations under consideration, and we draw one sample from each population.\n", 151 | "- Both population distributions are assumed to be normal.\n", 152 | "- Both population standard deviations are known.\n", 153 | "- The formula for calculating the test statistic is:\n", 154 | "\n", 155 | "$$z = \\frac{\\overline{x}_1 - \\overline{x}_2} {\\sqrt{\\frac{\\sigma_1^2}{n_1} + \\frac{\\sigma_2^2}{n_2}}}$$\n", 156 | "\n", 157 | "An organization manufactures LED bulbs in two production units, A and B. The quality control team believes that the quality of production at unit A is better than that of B. Quality is measured by how long a bulb works. The team takes samples from both units to test this. The mean life of LED bulbs at units A and B are 1001.3 and 810.47, respectively. The sample sizes are 40 and 44. The population variances are known: $\\sigma_A^2 = 48127$ and $\\sigma_B^2 = 59173$.\n", 158 | "\n", 159 | "Conduct the appropriate test, at a 5% significance level, to verify the claim of the quality control team.\n", 160 | "\n", 161 | "**Null hypothesis:** $H_0: \\mu_A ≤ \\mu_B$ \n", 162 | "**Alternate hypothesis:** $H_1: \\mu_A > \\mu_B$\n", 163 | "\n", 164 | "Let's fix the level of significance at $\\alpha = 0.05$." 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 5, 170 | "id": "3373712a", 171 | "metadata": {}, 172 | "outputs": [ 173 | { 174 | "name": "stdout", 175 | "output_type": "stream", 176 | "text": [ 177 | "3.781260568723408\n" 178 | ] 179 | } 180 | ], 181 | "source": [ 182 | "z = (1001.34-810.47)/(48127/40+59173/44)**0.5\n", 183 | "print(z)" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 6, 189 | "id": "451853c0", 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "data": { 194 | "text/plain": [ 195 | "7.801812433294586e-05" 196 | ] 197 | }, 198 | "execution_count": 6, 199 | "metadata": {}, 200 | "output_type": "execute_result" 201 | } 202 | ], 203 | "source": [ 204 | "import scipy.stats as stats\n", 205 | "p_value = 1 - stats.norm.cdf(z)\n", 206 | "p_value" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "id": "91602877-326c-4234-b084-e921956a04f1", 212 | "metadata": {}, 213 | "source": [ 214 | "p-value (0.000078)<$\\alpha$(0.05), we reject the null hypothesis. The LED bulbs produced at unit A have a significantly longer life than those at unit B, at a 5% level." 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "id": "6a69211a-90c3-4156-bc49-a1ae0479609f", 220 | "metadata": {}, 221 | "source": [ 222 | "## Hypothesis Tests with Proportions\n", 223 | "\n", 224 | "Proportion tests are utilized with nominal data and are effective for comparing percentages or proportions. For instance, a survey collecting responses from a department in an organization might claim that 85% of people in the organization are satisfied with its policies. Historically, the satisfaction rate has been 82%. Here, we compare a percentage or proportion taken from the sample with a percentage/proportion from the population. The following are some characteristics of the sampling distribution of proportions:\n", 225 | "\n", 226 | "- The sampling distribution of the proportions taken from the sample is approximately normal.\n", 227 | "- The mean of this sampling distribution ($\\overline{p}$) equals the population proportion ($p$).\n", 228 | "- Calculating the test statistic: The following equation gives the $z$-value:\n", 229 | "\n", 230 | "$$ z = \\frac{\\overline{p} - p}{\\sqrt{\\frac{p(1-p)}{n}}} $$\n", 231 | "\n", 232 | "Where $\\overline{p}$ is the sample proportion, $p$ is the population proportion, and $n$ is the sample size." 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "id": "df341e54-8cbf-40ad-8317-5b8a59f2e79e", 238 | "metadata": {}, 239 | "source": [ 240 | "## One-Sample Proportion Z-Test\n", 241 | "\n", 242 | "It is known that 40% of the total customers are satisfied with the services provided by a mobile service center. The customer service department of this center decides to conduct a survey for assessing the current customer satisfaction rate. It surveys 100 of its customers and finds that only 30 out of the 100 customers are satisfied with its services. Conduct a hypothesis test at a 5% significance level to determine if the percentage of satisfied customers has reduced from the initial satisfaction level (40%).\n", 243 | "\n", 244 | "**Null Hypothesis:** $H_0: p = 0.4$ \n", 245 | "**Alternate Hypothesis:** $H_1: p < 0.4$\n", 246 | "\n", 247 | "The < sign indicates a lower-tail test.\n", 248 | "\n", 249 | "Let's fix the level of significance at $\\alpha = 0.05$." 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 7, 255 | "id": "088856d4", 256 | "metadata": {}, 257 | "outputs": [ 258 | { 259 | "data": { 260 | "text/plain": [ 261 | "-2.041241452319316" 262 | ] 263 | }, 264 | "execution_count": 7, 265 | "metadata": {}, 266 | "output_type": "execute_result" 267 | } 268 | ], 269 | "source": [ 270 | "z=(0.3-0.4)/((0.4)*(1-0.4)/100)**0.5\n", 271 | "z" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": 8, 277 | "id": "f369ef9c", 278 | "metadata": {}, 279 | "outputs": [ 280 | { 281 | "data": { 282 | "text/plain": [ 283 | "0.02061341666858179" 284 | ] 285 | }, 286 | "execution_count": 8, 287 | "metadata": {}, 288 | "output_type": "execute_result" 289 | } 290 | ], 291 | "source": [ 292 | "import scipy.stats as stats\n", 293 | "\n", 294 | "p=stats.norm.cdf(z)\n", 295 | "p" 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "id": "7eb16578-2f61-4d3f-b7cc-88ee9bfd0435", 301 | "metadata": {}, 302 | "source": [ 303 | "p-value (0.02) < 0.05. We reject the null hypothesis. At a 5% significance level, the percentage of customers satisfied with the service center’s services has reduced." 304 | ] 305 | }, 306 | { 307 | "cell_type": "markdown", 308 | "id": "f627ba1c-9a49-4b8d-8cd4-09a4c07be810", 309 | "metadata": {}, 310 | "source": [ 311 | "## Two-Sample Proportion Z-Test\n", 312 | "\n", 313 | "Here, we compare proportions taken from two independent samples belonging to two different populations. The following equation gives the formula for the critical test statistic:\n", 314 | "\n", 315 | "$$ z = \\frac {(\\overline{p}_1 - \\overline{p}_2)}{\\sqrt{\\frac{p_c(1-p_c)}{N_1} + \\frac{p_c(1-p_c)}{N_2}}}$$\n", 316 | "\n", 317 | "In the preceding formula, $\\overline{p}_1$ is the proportion from the first sample, and $\\overline{p}_2$ is the proportion from the second sample. $N_1$ is the sample size of the first sample, and $N_2$ is the sample size of the second sample. $p_c$ is the pooled variance.\n", 318 | "\n", 319 | "$$\\overline{p}_1 = \\frac{x_1}{N_1} ; \\overline{p}_2 = \\frac {x_2}{N_2} ; p_c = \\frac {x_1 + x_2}{N_1 + N_2}$$\n", 320 | "\n", 321 | "In the preceding formula, $x_1$ is the number of successes in the first sample, and $x_2$ is the number of successes in the second sample." 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "id": "59ed178e-be77-4330-9d99-eb33778a0c59", 327 | "metadata": {}, 328 | "source": [ 329 | "## Investigation of Passenger Compliance with Child Safety Guidelines\n", 330 | "\n", 331 | "A ride-sharing company is investigating complaints by its drivers regarding passenger compliance with child safety guidelines, specifically concerning the use of child seats and seat belts. Surveys were independently conducted in two major cities, A and B, to gather data on passenger compliance. The company aims to determine if there is a difference in the proportion of passengers conforming to child safety guidelines between the two cities. The data for the two cities is summarized in the following table:\n", 332 | "\n", 333 | "| | City A | City B |\n", 334 | "|-----------------|---------|--------|\n", 335 | "| Total surveyed | 200 | 230 |\n", 336 | "| No. of complaints | 110 | 106 |\n", 337 | "\n", 338 | "The law enforcement authority seeks to evaluate if the proportion of compliant passengers differs significantly between City A and City B." 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "id": "f3d2af18-91a7-407c-8d46-8d28ae426ce1", 344 | "metadata": {}, 345 | "source": [ 346 | "## Hypotheses for Two-Sample Proportion Test\n", 347 | "\n", 348 | "For the two-sample proportion test comparing compliance rates between City A and City B:\n", 349 | "\n", 350 | "- Null hypothesis: $H_0: p_A = p_B$\n", 351 | "- Alternative hypothesis: $H_1: p_A \\neq p_B$\n", 352 | "\n", 353 | "This constitutes a two-tail test because the region of rejection could be located on either side.\n", 354 | "\n", 355 | "The significance level $\\alpha$ is set at 0.05, resulting in an area of 0.025 on both sides." 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": 9, 361 | "id": "a5054442", 362 | "metadata": {}, 363 | "outputs": [ 364 | { 365 | "data": { 366 | "text/plain": [ 367 | "1.8437643201697864" 368 | ] 369 | }, 370 | "execution_count": 9, 371 | "metadata": {}, 372 | "output_type": "execute_result" 373 | } 374 | ], 375 | "source": [ 376 | "x1,n1,x2,n2=110,200,106,230\n", 377 | "p1=x1/n1\n", 378 | "p2=x2/n2\n", 379 | "pc=(x1+x2)/(n1+n2)\n", 380 | "z_statistic=(p1-p2)/(((pc*(1-pc)/n1)+(pc*(1-pc)/n2))**0.5)\n", 381 | "z_statistic" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": 10, 387 | "id": "03472547", 388 | "metadata": {}, 389 | "outputs": [ 390 | { 391 | "data": { 392 | "text/plain": [ 393 | "1.959963984540054" 394 | ] 395 | }, 396 | "execution_count": 10, 397 | "metadata": {}, 398 | "output_type": "execute_result" 399 | } 400 | ], 401 | "source": [ 402 | "critical = stats.norm.ppf(1-0.025)\n", 403 | "critical" 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": 11, 409 | "id": "18ac0c29", 410 | "metadata": {}, 411 | "outputs": [ 412 | { 413 | "data": { 414 | "text/plain": [ 415 | "1.9587731666628365" 416 | ] 417 | }, 418 | "execution_count": 11, 419 | "metadata": {}, 420 | "output_type": "execute_result" 421 | } 422 | ], 423 | "source": [ 424 | "p_value =2*(1-stats.norm.cdf(z))\n", 425 | "p_value" 426 | ] 427 | }, 428 | { 429 | "cell_type": "markdown", 430 | "id": "615b4141", 431 | "metadata": {}, 432 | "source": [ 433 | "## Conclusion of Two-Sample Proportion Test\r\n", 434 | "\r\n", 435 | "Based on the statistical analysis:\r\n", 436 | "\r\n", 437 | "- Since the test statistic is less than the critical value, or the p-value is greater than 0.05, we fail to reject the null hypothesis.\r\n", 438 | "- Therefore, there is no significant difference between the proportion of passengers in these cities complying with child safety norms, at a 5% significance level.\r\n" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": null, 444 | "id": "f1ca0ba9", 445 | "metadata": {}, 446 | "outputs": [], 447 | "source": [] 448 | } 449 | ], 450 | "metadata": { 451 | "kernelspec": { 452 | "display_name": "Python 3 (ipykernel)", 453 | "language": "python", 454 | "name": "python3" 455 | }, 456 | "language_info": { 457 | "codemirror_mode": { 458 | "name": "ipython", 459 | "version": 3 460 | }, 461 | "file_extension": ".py", 462 | "mimetype": "text/x-python", 463 | "name": "python", 464 | "nbconvert_exporter": "python", 465 | "pygments_lexer": "ipython3", 466 | "version": "3.11.7" 467 | } 468 | }, 469 | "nbformat": 4, 470 | "nbformat_minor": 5 471 | } 472 | -------------------------------------------------------------------------------- /13. t-test.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "04c5cfaf-d5f0-4375-a8ff-4f1527b3a04e", 6 | "metadata": {}, 7 | "source": [ 8 | "# T-Test\n", 9 | "\n", 10 | "In cases where the standard deviation of the population is not known and the sample size is small, the T-distribution is used. This distribution is also known as the \"Student's T distribution\".\n", 11 | "\n", 12 | "The following are the key features of the T-distribution:\n", 13 | "\n", 14 | "+ It has a shape that is similar to a normal distribution but is slightly flatter.\n", 15 | "+ The sample size is typically small, usually less than 30.\n", 16 | "+ The T-distribution takes into account the concept of degrees of freedom. These are the number of observations in a statistical test that can be calculated independently. For example, if we have three numbers $x$, $y$, and $z$ and know that the mean is 5, we can conclude that the sum of the numbers must be $5 \\times 3 = 15$. We have the freedom to choose any value for $x$ and $y$, but not $z$. $z$ must be chosen so that the numbers add up to 15 and the mean remains at 5. Despite having three numbers, we only have the freedom to choose two of them, meaning we have two degrees of freedom.\n", 17 | "+ As the sample size decreases, the degrees of freedom decrease, and the population parameter can be predicted with less certainty from the sample parameter. The degrees of freedom (df) in the T-distribution is equal to the number of samples minus 1, or $df = n - 1$.\n", 18 | "\n", 19 | "
\n", 20 | "\n", 21 | "The formula for the critical test statistic in a one-sample t-test is given by the following equation: \n", 22 | "\n", 23 | "$$t = \\frac{\\overline{x} - \\mu}{\\frac{s}{\\sqrt{n}}}$$\n", 24 | "\n", 25 | "where $\\overline{x}$ is the sample mean, $\\mu$ is the population mean, $s$ is the sample standard deviation, and $n$ is the sample size.\n", 26 | "\n", 27 | "## One-Sample T-Test\n", 28 | "\n", 29 | "A one-sample t-test is similar to a one-sample z-test, with the following differences:\n", 30 | "\n", 31 | "1. The size of the sample is small ($< 30$).\n", 32 | "2. The population standard deviation is not known; we use the sample standard deviation ($s$) to calculate the standard error.\n", 33 | "3. The critical statistic here is the t-statistic, given by the following formula:\n", 34 | "\n", 35 | "$$t = \\frac{\\overline{x} - \\mu}{\\frac{s}{\\sqrt{n}}}$$\n" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "id": "2ce587eb-9dda-4779-a004-bea595c98abf", 41 | "metadata": {}, 42 | "source": [ 43 | "A coaching institute, preparing students for an exam, has 200 students, and the average score of the students in the practice tests is 80. It takes a sample of nine students and records their scores; it seems that the average score has now increased. These are the scores of these nine students: 80, 87, 80, 75, 79, 78, 89, 84, 88. Conduct a hypothesis test at a 5% significance level to verify if there is a significant increase in the average score.\n", 44 | "\n", 45 | "## Hypotheses\n", 46 | "\n", 47 | "- Null hypothesis ($H_0$): $\\mu = 80$\n", 48 | "- Alternative hypothesis ($H_1$): $\\mu > 80$\n" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 1, 54 | "id": "92b6154c", 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "TtestResult(statistic=1.348399724926488, pvalue=0.21445866072113726, df=8)" 61 | ] 62 | }, 63 | "execution_count": 1, 64 | "metadata": {}, 65 | "output_type": "execute_result" 66 | } 67 | ], 68 | "source": [ 69 | "import numpy as np\n", 70 | "import scipy.stats as stats\n", 71 | "\n", 72 | "sample = np.array([80,87,80,75,79,78,89,84,88])\n", 73 | "\n", 74 | "stats.ttest_1samp(sample,80)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "id": "3602c604", 80 | "metadata": {}, 81 | "source": [ 82 | "Since the p-value is greater than 0.05, we fail to reject the null hypothesis. Hence, we cannot conclude that the average score of students has changed." 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "id": "acf202cc", 88 | "metadata": {}, 89 | "source": [ 90 | "## Two-sample t-test \n", 91 | "\n", 92 | "A two-sample t-test is used when we take samples from two populations, where both the sample sizes are less than 30, and both the population standard deviations are unknown. Formula:\n", 93 | "\n", 94 | "$$t = \\frac{\\overline x_1 - \\overline x_2}{\\sqrt{S_p^2(\\frac{1}{n_1}+\\frac{1}{n_2})}}$$\n", 95 | "\n", 96 | "Where $x_1$ and $x_2$ are the sample means \n", 97 | "\n", 98 | "The degrees of freedom: $df=n_1 + n_2 − 2$ \n", 99 | "\n", 100 | "The pooled variance $S_p^2 = \\frac{(n_1 -1)S_1^2 + (n_2-1)S_2^2}{n_1+n_2-2}$ \n", 101 | "\n", 102 | "A coaching institute has centers in two different cities. It takes a sample of ten students from each center and records their\n", 103 | "scores, which are as follows: \n", 104 | "\n", 105 | "|Center A:| 80, 87, 80, 75, 79, 78, 89, 84, 88|\n", 106 | "|---------|-----------------------------------|\n", 107 | "|Center B:| 81, 74, 70, 73, 76, 73, 81, 82, 84| \n", 108 | " \n", 109 | "Conduct a hypothesis test at a 5% significance level, and verify if there a significant difference in the average scores of the\n", 110 | "students in these two centers.\n", 111 | "\n", 112 | "$H_0:\\mu_1 = \\mu_2$ \n", 113 | "$H_1:\\mu_1 != \\mu_2$" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 2, 119 | "id": "7b4b5010", 120 | "metadata": {}, 121 | "outputs": [ 122 | { 123 | "data": { 124 | "text/plain": [ 125 | "TtestResult(statistic=2.1892354788555664, pvalue=0.04374951024120649, df=16.0)" 126 | ] 127 | }, 128 | "execution_count": 2, 129 | "metadata": {}, 130 | "output_type": "execute_result" 131 | } 132 | ], 133 | "source": [ 134 | "a = np.array([80,87,80,75,79,78,89,84,88])\n", 135 | "b = np.array([81,74,70,73,76,73,81,82,84])\n", 136 | "\n", 137 | "stats.ttest_ind(a,b)" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "id": "88a502b7", 143 | "metadata": {}, 144 | "source": [ 145 | "We can conclude that there is a significant difference in the average scores of students in the two centers of the coaching\n", 146 | "institute since the p-value is less than 0.05" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "id": "f498ce2d", 152 | "metadata": {}, 153 | "source": [ 154 | "## Two-sample t-test for paired samples \n", 155 | "\n", 156 | "This test is used to compare population means from samples that are dependent on each other, that is, sample values are measured twice using the same test group.\n", 157 | "\n", 158 | "+ A measurement taken at two different times (e.g., pre-test and post-test score with an intervention administered between the two time points)\n", 159 | "+ A measurement taken under two different conditions (e.g., completing a test under a \"control\" condition and an \"experimental\" condition)\n", 160 | "\n", 161 | "This equation gives the critical value of the test statistic for a paired two-sample t-test:\n", 162 | "\n", 163 | "$$t = \\frac{\\overline d}{s/\\sqrt{n}}$$\n", 164 | "\n", 165 | "Where $\\overline d$ is the average of the difference between the elements of the two samples. Both\n", 166 | "the samples have the same size, $n$. \n", 167 | "\n", 168 | "Standard deviation of the differences between the elements of the two samples, S = $\\sqrt{\\frac{\\sum d^2 -((\\sum d)^2/ n)}{n -1}}$\n", 169 | "\n", 170 | "The coaching institute is conducting a special program to improve the performance of the students. The scores of the same set of students are compared before and after the special program. Conduct a hypothesis test at a 5% significance level to verify if the scores have improved because of this program." 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 3, 176 | "id": "9055c5e8", 177 | "metadata": {}, 178 | "outputs": [ 179 | { 180 | "data": { 181 | "text/plain": [ 182 | "TtestResult(statistic=-2.4473735525455615, pvalue=0.040100656419513776, df=8)" 183 | ] 184 | }, 185 | "execution_count": 3, 186 | "metadata": {}, 187 | "output_type": "execute_result" 188 | } 189 | ], 190 | "source": [ 191 | "a = np.array([80,87,80,75,79,78,89,84,88])\n", 192 | "b = np.array([81,89,83,81,79,82,90,82,90])\n", 193 | "\n", 194 | "stats.ttest_rel(a,b)" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "id": "aa94fab0", 200 | "metadata": {}, 201 | "source": [ 202 | "We can conclude, at a 5% significance level, that the average score has improved after the\n", 203 | "special program was conducted since the p-value is less than 0.05" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": null, 209 | "id": "95a06911", 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [] 213 | } 214 | ], 215 | "metadata": { 216 | "kernelspec": { 217 | "display_name": "Python 3 (ipykernel)", 218 | "language": "python", 219 | "name": "python3" 220 | }, 221 | "language_info": { 222 | "codemirror_mode": { 223 | "name": "ipython", 224 | "version": 3 225 | }, 226 | "file_extension": ".py", 227 | "mimetype": "text/x-python", 228 | "name": "python", 229 | "nbconvert_exporter": "python", 230 | "pygments_lexer": "ipython3", 231 | "version": "3.11.7" 232 | } 233 | }, 234 | "nbformat": 4, 235 | "nbformat_minor": 5 236 | } 237 | -------------------------------------------------------------------------------- /14. ANOVA - Analysis of Variance.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "0cba2aff", 6 | "metadata": {}, 7 | "source": [ 8 | "# ANOVA" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "id": "64083e01-6eaa-4477-a3a2-48b34d0579f1", 14 | "metadata": {}, 15 | "source": [ 16 | "# Analysis of Variance (ANOVA)\n", 17 | "\n", 18 | "ANOVA (Analysis of Variance) is a statistical method used for comparing the means of multiple populations. Previously, we have considered only a single population or at most two populations. A one-way ANOVA uses one independent variable, while a two-way ANOVA uses two independent variables. The statistical distribution used in ANOVA is the F-distribution, whose characteristics are as follows:\n", 19 | "\n", 20 | "1. The F-distribution has a single tail (toward the right) and contains only positive values.\n", 21 | "\n", 22 | "
\n", 23 | "\n", 24 | "2. The F-statistic, which is the critical statistic in ANOVA, is the ratio of variation between the sample means to the variation within the samples. The formula is as follows:\n", 25 | " $$F = \\frac{\\text{variation between sample means}}{\\text{variation within the samples}}$$\n", 26 | "\n", 27 | "3. The different populations are referred to as treatments.\n", 28 | "4. A high value of the F-statistic implies that the variation between samples is considerable compared to the variation within the samples. In other words, the populations or treatments from which the samples are drawn are actually different from one another.\n", 29 | "5. Random variations between treatments are more likely to occur when the variation within the sample is considerable.\n", 30 | "\n", 31 | "Use a one-way ANOVA when you have collected data about one categorical independent variable and one quantitative dependent variable. The independent variable should have at least three levels (i.e., at least three different groups or categories).\n", 32 | "\n", 33 | "ANOVA tells you if the dependent variable changes according to the level of the independent variable. For example:\n", 34 | "\n", 35 | "+ Your independent variable is social media use, and you assign groups to low, medium, and high levels of social media use to find out if there is a difference in hours of sleep per night.\n", 36 | "+ Your independent variable is the brand of soda, and you collect data on Coke, Pepsi, Sprite, and Fanta to find out if there is a difference in the price per 100ml.\n", 37 | "\n", 38 | "ANOVA determines whether the groups created by the levels of the independent variable are statistically different by calculating whether the means of the treatment levels are different from the overall mean of the dependent variable. If any of the group means is significantly different from the overall mean, then the null hypothesis is rejected.\n", 39 | "\n", 40 | "ANOVA uses the F-test for statistical significance. This allows for the comparison of multiple means at once, as the error is calculated for the whole set of comparisons rather than for each individual two-way comparison (which would happen with a t-test).\n", 41 | "\n", 42 | "The F-test compares the variance in each group mean from the overall group variance. If the variance within groups is smaller than the variance between groups, the F-test will find a higher F-value, and therefore a higher likelihood that the difference observed is real and not due to chance.\n", 43 | "\n", 44 | "The assumptions of the ANOVA test are the same as the general assumptions for any parametric test:\n", 45 | "\n", 46 | "+ **Independence of observations:** The data were collected using statistically valid methods, and there are no hidden relationships among observations. If your data fail to meet this assumption because you have a confounding variable that you need to control for statistically, use an ANOVA with blocking variables.\n", 47 | "+ **Normally distributed response variable:** The values of the dependent variable follow a normal distribution.\n", 48 | "+ **Homogeneity of variance:** The variation within each group being compared is similar for every group. If the variances are different among the groups, then ANOVA probably isn’t the right fit for the data.\n" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "id": "6c2285ee", 54 | "metadata": {}, 55 | "source": [ 56 | "## One-Way-ANOVA\n", 57 | "\n", 58 | "A few agricultural research scientists have planted a new variety of cotton called “AB\n", 59 | "cotton.” They have used three different fertilizers – A, B, and C – for three separate\n", 60 | "plots of this variety. The researchers want to find out if the yield varies with the type of\n", 61 | "fertilizer used. Yields in bushels per acre are mentioned in the below table. Conduct an\n", 62 | "ANOVA test at a 5% level of significance to see if the researchers can conclude that there\n", 63 | "is a difference in yields.\n", 64 | "\n", 65 | "| Fertilizer A | Fertilizer b | Fertilizer c |\n", 66 | "|--------------|--------------|--------------|\n", 67 | "| 40 | 45 | 55 |\n", 68 | "| 30 | 35 | 40 |\n", 69 | "| 35 | 55 | 30 |\n", 70 | "| 45 | 25 | 20 |\n", 71 | "\n", 72 | "Null hypothesis: $H_0 : \\mu_1 = \\mu_2 = \\mu_3$ \n", 73 | "Alternative hypothesis: $H_1 : \\mu_1 ! = \\mu_2 ! = \\mu_3$\n", 74 | "\n", 75 | "the level of significance: $\\alpha$=0.05" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 1, 81 | "id": "15dcbdf0", 82 | "metadata": {}, 83 | "outputs": [ 84 | { 85 | "data": { 86 | "text/plain": [ 87 | "F_onewayResult(statistic=0.10144927536231883, pvalue=0.9045455407589628)" 88 | ] 89 | }, 90 | "execution_count": 1, 91 | "metadata": {}, 92 | "output_type": "execute_result" 93 | } 94 | ], 95 | "source": [ 96 | "import scipy.stats as stats\n", 97 | "\n", 98 | "a=[40,30,35,45]\n", 99 | "b=[45,35,55,25]\n", 100 | "c=[55,40,30,20]\n", 101 | "\n", 102 | "stats.f_oneway(a,b,c)" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "id": "908afb47", 108 | "metadata": {}, 109 | "source": [ 110 | "Since the calculated p-value (0.904)>0.05, we fail to reject the null hypothesis.There is no significant difference between the three treatments, at a 5% significance level." 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "id": "cb92bfa2", 116 | "metadata": {}, 117 | "source": [ 118 | "## Two-way-ANOVA \n", 119 | "\n", 120 | "A botanist wants to know whether or not plant growth is influenced by sunlight exposure and watering frequency. She plants 30 seeds and lets them grow for two months under different conditions for sunlight exposure and watering frequency. After two months, she records the height of each plant, in inches." 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 2, 126 | "id": "872f28ac", 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "import numpy as np\n", 131 | "import pandas as pd\n", 132 | "\n", 133 | "#create data\n", 134 | "df = pd.DataFrame({'water': np.repeat(['daily', 'weekly'], 15),\n", 135 | " 'sun': np.tile(np.repeat(['low', 'med', 'high'], 5), 2),\n", 136 | " 'height': [6, 6, 6, 5, 6, 5, 5, 6, 4, 5,\n", 137 | " 6, 6, 7, 8, 7, 3, 4, 4, 4, 5,\n", 138 | " 4, 4, 4, 4, 4, 5, 6, 6, 7, 8]})" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 3, 144 | "id": "5e132098", 145 | "metadata": {}, 146 | "outputs": [ 147 | { 148 | "data": { 149 | "text/html": [ 150 | "
\n", 151 | "\n", 164 | "\n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | "
watersunheight
0dailylow6
1dailylow6
2dailylow6
3dailylow5
4dailylow6
5dailymed5
6dailymed5
7dailymed6
8dailymed4
9dailymed5
\n", 236 | "
" 237 | ], 238 | "text/plain": [ 239 | " water sun height\n", 240 | "0 daily low 6\n", 241 | "1 daily low 6\n", 242 | "2 daily low 6\n", 243 | "3 daily low 5\n", 244 | "4 daily low 6\n", 245 | "5 daily med 5\n", 246 | "6 daily med 5\n", 247 | "7 daily med 6\n", 248 | "8 daily med 4\n", 249 | "9 daily med 5" 250 | ] 251 | }, 252 | "execution_count": 3, 253 | "metadata": {}, 254 | "output_type": "execute_result" 255 | } 256 | ], 257 | "source": [ 258 | "df[:10]" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": 4, 264 | "id": "ef799971", 265 | "metadata": {}, 266 | "outputs": [ 267 | { 268 | "data": { 269 | "text/html": [ 270 | "
\n", 271 | "\n", 284 | "\n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | "
sum_sqdfFPR(>F)
C(water)8.5333331.016.00000.000527
C(sun)24.8666672.023.31250.000002
C(water):C(sun)2.4666672.02.31250.120667
Residual12.80000024.0NaNNaN
\n", 325 | "
" 326 | ], 327 | "text/plain": [ 328 | " sum_sq df F PR(>F)\n", 329 | "C(water) 8.533333 1.0 16.0000 0.000527\n", 330 | "C(sun) 24.866667 2.0 23.3125 0.000002\n", 331 | "C(water):C(sun) 2.466667 2.0 2.3125 0.120667\n", 332 | "Residual 12.800000 24.0 NaN NaN" 333 | ] 334 | }, 335 | "execution_count": 4, 336 | "metadata": {}, 337 | "output_type": "execute_result" 338 | } 339 | ], 340 | "source": [ 341 | "import statsmodels.api as sm\n", 342 | "from statsmodels.formula.api import ols\n", 343 | "\n", 344 | "#perform two-way ANOVA\n", 345 | "model = ols('height ~ C(water) + C(sun) + C(water):C(sun)', data=df).fit()\n", 346 | "sm.stats.anova_lm(model, typ=2)" 347 | ] 348 | }, 349 | { 350 | "cell_type": "markdown", 351 | "id": "15f1611b-ebbb-426f-8b17-1437a5def208", 352 | "metadata": {}, 353 | "source": [ 354 | "# Analysis of Variance (ANOVA) Results\n", 355 | "\n", 356 | "We can see the following p-values for each of the factors in the table:\n", 357 | "\n", 358 | "- **Water:** p-value = 0.000527 \n", 359 | "- **Sun:** p-value = 0.0000002 \n", 360 | "- **Water * Sun:** p-value = 0.120667 \n", 361 | "\n", 362 | "Since the p-values for water and sun are both less than 0.05, this means that both factors have a statistically significant effect on plant height.\n", 363 | "\n", 364 | "And since the p-value for the interaction effect (0.120667) is not less than 0.05, this tells us that there is no significant interaction effect between sunlight exposure and watering frequency." 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": null, 370 | "id": "ab33cbc5", 371 | "metadata": {}, 372 | "outputs": [], 373 | "source": [] 374 | } 375 | ], 376 | "metadata": { 377 | "kernelspec": { 378 | "display_name": "Python 3 (ipykernel)", 379 | "language": "python", 380 | "name": "python3" 381 | }, 382 | "language_info": { 383 | "codemirror_mode": { 384 | "name": "ipython", 385 | "version": 3 386 | }, 387 | "file_extension": ".py", 388 | "mimetype": "text/x-python", 389 | "name": "python", 390 | "nbconvert_exporter": "python", 391 | "pygments_lexer": "ipython3", 392 | "version": "3.11.7" 393 | } 394 | }, 395 | "nbformat": 4, 396 | "nbformat_minor": 5 397 | } 398 | -------------------------------------------------------------------------------- /15. Chi-Square Test for Independence and Goodness of Fit.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "512274d1-097e-4da4-b7a6-44af0d3230a5", 6 | "metadata": {}, 7 | "source": [ 8 | "# Chi-square test for Independence\n", 9 | "\n", 10 | "The chi-square test is a nonparametric test for testing the association between two variables. A non-parametric test is one that does not make any assumption about the distribution of the population from which the sample is drawn.\n", 11 | "\n", 12 | "The following are some of the characteristics of the chi-square test.\n", 13 | "+ The chi-square test of association is used to test if the frequency of occurrence of one categorical variable is significantly associated with that of another categorical variable.\n", 14 | "\n", 15 | " The chi-square test statistic is given by: \n", 16 | "\n", 17 | " $$\\chi^2 = \\sum\\frac {(f_o -f_e)^2}{f_e}$$\n", 18 | "\n", 19 | " where, $f_o$ denotes the observed frequencies, $f_e$ denotes the expected frequencies, and $\\chi$ is the test statistic. \n", 20 | " Using the chi-square test of association, we can assess if the differences between the frequencies are statistically significant.\n", 21 | "\n", 22 | "+ A contingency table is a table with frequencies of the variable listed under separate columns. The formula for the degrees of freedom in the chi-square test is given by: *df=(r-1)(c-1)*, where *df* is the number of degrees of freedom, r is the number of rows in the contingency table, and c is the number of columns in the contingency table.\n", 23 | "\n", 24 | "\n", 25 | "\n", 26 | "+ The chi-square test compares the observed values of a set of variables with their expected values. It determines if the differences between the observed values and expected values are due to random chance (like a sampling error), or if these differences are statistically significant. If there are only small differences between the observed and expected values, it may be due to an error in sampling. If there are substantial differences between the two, it may indicate an association between the variables.\n", 27 | "\n", 28 | "
\n", 29 | "\n", 30 | "+ The shape of the chi-square distribution for different values of k (degrees of freedom) When the degrees of freedom are few, it looks like an F-distribution. It has only one tail (toward the right). As the degrees of freedom increase, it looks like a normal curve. Also, the increase in the degrees of freedom indicates that the difference between the observed values and expected values could be meaningful and not just due to a sampling error. " 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "id": "e5d38703", 36 | "metadata": {}, 37 | "source": [ 38 | "**Example:**\n", 39 | "\n", 40 | "Suppose there is a city of 1,000,000 residents with four neighborhoods: A, B, C, and D. A random sample of 650 residents of the city is taken and their occupation is recorded as \"white collar\", \"blue collar\", or \"no collar\". The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification.\n", 41 | "The data are tabulated as:\n", 42 | "\n", 43 | "| OBSERVED | A | B | C | D | Row Total |\n", 44 | "|:------------:|-----|-----|-----|-----|-----------|\n", 45 | "| White Collar | 90 | 60 | 104 | 95 | 349 |\n", 46 | "| Blue Collar | 30 | 50 | 51 | 20 | 151 |\n", 47 | "| No Collar | 30 | 40 | 45 | 35 | 150 |\n", 48 | "| Column Total | 150 | 150 | 200 | 150 | 650 |\n", 49 | "\n", 50 | "\n", 51 | "+ **Null hypothesis:** $H_0$: Occupation and Neighbourhood of Residence are not related. \n", 52 | "\n", 53 | "+ **Alternative hypothesis**: $H_1$: Occupation and Neighbourhood of Residence are related. \n", 54 | "\n", 55 | "+ **Number of variables:** Two categorical variables (Occupation and Neighbourhood)\n", 56 | "\n", 57 | "+ What we are testing: Testing for an association between Occupation and Neighbourhood.\n", 58 | "\n", 59 | "+ We conduct a chi-square test of association based on the preceding characteristics.\n", 60 | "\n", 61 | "+ Fix the level of significance: α=0.05\n", 62 | "\n", 63 | "Make an **expected** value table from the totals\n", 64 | "\n", 65 | "For each entry calcuate : $$\\frac{(row\\ total * column\\ total)}{overall\\ total}$$\n", 66 | "\n", 67 | "Example: For A neighbourhood 150 * (349/650) must be the expected White collar Job.\n", 68 | "\n", 69 | "| EXPECTED | A | B | C | D |\n", 70 | "|:------------:|-------|-------|--------|-------|\n", 71 | "| White Collar | 80.54 | 80.54 | 107.38 | 80.54 |\n", 72 | "| Blue Collar | 34.85 | 34.85 | 46.46 | 34.85 |\n", 73 | "| No Collar | 34.62 | 34.62 | 46.15 | 34.62 |\n", 74 | "\n", 75 | "Each of the value in the Expected Value table is 5 or higher. May proceed with Chi-Square test.\n", 76 | "\n", 77 | "Calculate: $$\\chi^2 = \\sum\\frac {(f_o -f_e)^2}{f_e}$$\n", 78 | "\n", 79 | "$$\\chi^2\\ statistic\\ \\approx\\ 24.6$$\n", 80 | "\n", 81 | "Under the null hypothesis, this sum has approximately a chi-squared distribution whose number of degrees of freedom is\n", 82 | "\n", 83 | "*dof = (number of rows-1)(number of columns-1) = (3-1)(4-1) = 6*\n", 84 | "\n", 85 | "From chi square distribution table p value less than 0.0005" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 1, 91 | "id": "18ff36a2", 92 | "metadata": {}, 93 | "outputs": [ 94 | { 95 | "name": "stdout", 96 | "output_type": "stream", 97 | "text": [ 98 | "Chi-Square Statistic: 24.5712028585826\n", 99 | "p-value: 0.0004098425861096696\n", 100 | "degrees of freedom: 6\n", 101 | "Expected Value: \n", 102 | " [[ 80.53846154 80.53846154 107.38461538 80.53846154]\n", 103 | " [ 34.84615385 34.84615385 46.46153846 34.84615385]\n", 104 | " [ 34.61538462 34.61538462 46.15384615 34.61538462]]\n" 105 | ] 106 | } 107 | ], 108 | "source": [ 109 | "import scipy.stats as stats\n", 110 | "import numpy as np\n", 111 | "\n", 112 | "observations = np.array([[90,60,104,95],[30,50,51,20],[30,40,45,35]])\n", 113 | "chi2stat, pval, dof, expvalue = stats.chi2_contingency(observations)\n", 114 | "\n", 115 | "print(f'Chi-Square Statistic: ', chi2stat)\n", 116 | "print(f'p-value: ', pval)\n", 117 | "print(f'degrees of freedom: ', dof)\n", 118 | "print(f'Expected Value: \\n', expvalue)\n" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "id": "0ef31bfe", 124 | "metadata": {}, 125 | "source": [ 126 | "p-value turns to be 0.0004 < 0.05. Therefore we reject the null hypothesis.\n", 127 | "There is a significant association between the Occupation and Neighbourhood of Residence, at a 5%\n", 128 | "significance level." 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "id": "c0dd9fe9", 134 | "metadata": {}, 135 | "source": [ 136 | "### **Chi-Square Goodness of Fit Test:** \n", 137 | "\n", 138 | "A Chi-Square goodness of fit test can be used in a wide variety of settings. Here are a few examples:\n", 139 | "\n", 140 | "+ We want to know if a die is fair, so we roll it 50 times and record the number of times it lands on each number.\n", 141 | "+ We want to know if an equal number of people come into a shop each day of the week, so we count the number of people who come in each day during a random week.\n", 142 | "\n", 143 | "It is performed in a similar way.\n", 144 | "\n", 145 | "A shop owner claims that an equal number of customers come into his shop each weekday. To test this hypothesis, an independent researcher records the number of customers that come into the shop on a given week and finds the following:\n", 146 | "\n", 147 | "| Day | Customers |\n", 148 | "|:---------:|-----------|\n", 149 | "| Monday | 50 |\n", 150 | "| Tuesday | 60 |\n", 151 | "| Wednesday | 40 |\n", 152 | "| Thursday | 47 |\n", 153 | "| Friday | 53 |\n", 154 | "\n", 155 | "$H_0$: An equal number of customers come into the shop each day. \n", 156 | "$H_1$: An equal number of customers do not come into the shop each day.\n", 157 | "\n", 158 | "There were a total of 250 customers that came into the shop during the week. Thus, if we expected an equal amount to come in each day then the expected value $E$ for each day would be 50.\n", 159 | "\n", 160 | "$Monday: (50-50)^2 / 50 = 0$ \n", 161 | "$Tuesday: (60-50)^2 / 50 = 2$ \n", 162 | "$Wednesday: (40-50)^2 / 50 = 2$ \n", 163 | "$Thursday: (47-50)^2 / 50 = 0.18$ \n", 164 | "$Friday: (53-50)^2 / 50 = 0.18$ \n", 165 | "\n", 166 | "$\\chi^2 = \\sum \\frac{(Obs-Exp)^2}{Exp} = 0 + 2 + 2 + 0.18 + 0.18 = 4.36$\n", 167 | "\n", 168 | "the p-value associated with $\\chi^2$ = 4.36 and degrees of freedom n-1 = 5-1 = 4 is **0.359472.**\n", 169 | "\n", 170 | "Since this p-value is not less than 0.05, we fail to reject the null hypothesis. This means we do not have sufficient evidence to say that the true distribution of customers is different from the distribution that the shop owner claimed." 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "id": "dee06726", 177 | "metadata": {}, 178 | "outputs": [], 179 | "source": [] 180 | } 181 | ], 182 | "metadata": { 183 | "kernelspec": { 184 | "display_name": "Python 3 (ipykernel)", 185 | "language": "python", 186 | "name": "python3" 187 | }, 188 | "language_info": { 189 | "codemirror_mode": { 190 | "name": "ipython", 191 | "version": 3 192 | }, 193 | "file_extension": ".py", 194 | "mimetype": "text/x-python", 195 | "name": "python", 196 | "nbconvert_exporter": "python", 197 | "pygments_lexer": "ipython3", 198 | "version": "3.11.7" 199 | } 200 | }, 201 | "nbformat": 4, 202 | "nbformat_minor": 5 203 | } 204 | -------------------------------------------------------------------------------- /16. Effect Size and Statistical Power.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "b9e53dce", 6 | "metadata": {}, 7 | "source": [ 8 | "## Effect Size\n", 9 | "\n", 10 | "\n", 11 | "Quantifying the difference between two groups can be achieved by using an effect size. A p-value provides information about statistical significance of the difference between the groups, but it doesn't give an insight into the magnitude of the difference. Larger sample sizes often result in a higher likelihood of finding a statistically significant difference, even if the real-world effect is small. Hence, it's crucial to consider effect sizes in addition to p-values, as they provide a clearer picture of the true difference between the groups and are more valuable in practical applications\n", 12 | "\n", 13 | "There are different measures for effect sizes. The most common effect sizes are Cohen's d and Pearson's r. \n", 14 | "\n", 15 | "Cohen's d measures the size of the difference between two groups while Pearson's r measures the strength of the relationship between two variables.\n", 16 | "\n", 17 | "### Cohen's d - Standardized Mean Difference\n", 18 | "Cohen's d is designed for comparing two groups. It takes the difference between two means and expresses it in standard deviation units. It tells you how many standard deviations lie between the two means.\n", 19 | "\n", 20 | "$$ d =\\frac{ \\overline x_1 - \\overline x_2 }{S}$$\n", 21 | "\n", 22 | "where $\\overline x_1$ and $\\overline x_2$ are mean of group 1 and group 2 respectively. $S$ is standard deviation.\n", 23 | "\n", 24 | "The choice of standard deviation in the equation depends on your research design.\n", 25 | "We can use:\n", 26 | "+ pooled standard deviation that is based on data from both groups,\n", 27 | "+ standard deviation from a control group.\n", 28 | "+ the standard deviation from the pretest data or posttest.\n", 29 | "\n", 30 | "### Pearson's r - Correlation Coefficient\n", 31 | "\n", 32 | "Pearson's $r$, or the correlation coefficient, measures the extent of a linear relationship between two variables.\n", 33 | "\n", 34 | "The formula is rather complex, so it’s best to use a statistical software to calculate Pearson's r accurately from the raw data.\n", 35 | "\n", 36 | "$$ r_{xy} = \\frac{n\\sum x_i y_i -\\sum x_i \\sum y_i}{\\sqrt{n\\sum x_i^2-(\\sum x_i)^2}{\\sqrt{n\\sum y_i^2-(\\sum y_i)^2}}}$$\n", 37 | "\n", 38 | "The main idea of the formula is to compute how much of the variability of one variable is determined by the variability of the other variable. Pearson's r is a standardized scale to measure correlations between variables that makes it unit-free. You can directly compare the strengths of all correlations with each other.\n", 39 | "\n", 40 | "### Interpreting Values\n", 41 | "\n", 42 | "+ Cohen's $d$ can take on any number between 0 and infinity, In general the greater the Cohen's d, the larger the effect size\n", 43 | "+ Pearson's $r$ ranges between -1 and 1. The closer the value is to 0, the smaller the effect size. A value closer to -1 or 1 indicates a higher effect size.\n", 44 | "\n", 45 | "General Rule of thumb to quantify whether an effect size is small, medium or large:\n", 46 | "\n", 47 | "**Cohen’s D:**\n", 48 | "\n", 49 | "+ A d of 0.2 or smaller is considered to be a small effect size.\n", 50 | "+ A d of 0.5 is considered to be a medium effect size.\n", 51 | "+ A d of 0.8 or larger is considered to be a large effect size.\n", 52 | "\n", 53 | "\n", 54 | "**Pearson Correlation Coefficient:**\n", 55 | "\n", 56 | "+ An absolute value of r around 0.1 is considered a low effect size.\n", 57 | "+ An absolute value of r around 0.3 is considered a medium effect size.\n", 58 | "+ An absolute value of r greater than .5 is considered to be a large effect size." 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "id": "f42c2010-37de-4c7f-b02d-7ddfa104ca9e", 64 | "metadata": {}, 65 | "source": [ 66 | "# Statistical Power\n", 67 | "\n", 68 | "Statistical power, or sensitivity, is the likelihood of a significance test detecting an effect when there actually is one. In other words, power is the probability that we will correctly reject the null hypothesis.\n", 69 | "\n", 70 | "Let's look at an example to understand this concept. Suppose we have two distributions with minimal overlap, as shown in the first picture below. If we collect a small set of samples from both the green and red distributions and compare their means using hypothesis testing, we might get a small p-value, say 0.0004. This would cause us to correctly reject the null hypothesis that both sample sets came from the same distribution. In other words, if the blue distribution says that all data points came from it, we would reject that hypothesis.\n", 71 | "\n", 72 | "If we keep repeating this experiment multiple times, there's a high probability that each statistical test will correctly give us a small p-value. In other words, there is a high probability that the null hypothesis that all the data came from the same distribution will be correctly rejected.\n", 73 | "\n", 74 | "However, occasionally, we might get a trial like in the second picture below, where the two sample sets appear to come from the same distribution due to overlapping sample points, resulting in a high p-value, like 0.08. This means that even though we know that the data came from two different distributions, we cannot correctly reject the null hypothesis that all the data came from the same distribution. Since these two distributions are far apart and have very little overlap, the probability of correctly rejecting the null hypothesis is high. Thus, power, being the probability that we will correctly reject the null hypothesis, is high in this example.\n", 75 | "\n", 76 | "In summary, when distributions have minimal overlap, the statistical power is high, meaning there is a high likelihood of correctly rejecting the null hypothesis.\n", 77 | "\n", 78 | "\n", 79 | "
\n", 80 | "
" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "id": "9b078af3-f059-44bb-be03-bfe088958ac1", 86 | "metadata": {}, 87 | "source": [ 88 | "# Statistical Power: Overlapping Distributions\n", 89 | "\n", 90 | "Now, let's consider a different scenario where we have a large overlap in the distributions, as shown in the first picture below. Most of the time, when we compare the means of these two distributions, we get a high p-value and fail to reject the null hypothesis that the data comes from the same distribution.\n", 91 | "\n", 92 | "However, occasionally, when the sample data points are from the far extremes of the distributions, as shown in the second picture below, we get a small p-value and can correctly reject the null hypothesis that the data comes from the same distribution. Due to the overlap, the probability of correctly rejecting the null hypothesis is low, meaning we have relatively low power.\n", 93 | "\n", 94 | "The good news is that we can increase the power by increasing the number of samples we collect. Power analysis will tell us how many measurements we need to collect to achieve a good amount of power.\n", 95 | "\n", 96 | "In summary, when distributions have a large overlap, the statistical power is low, meaning there is a low likelihood of correctly rejecting the null hypothesis. By increasing the sample size, we can improve the power of our test.\n", 97 | "\n", 98 | "
\n", 99 | "\n", 100 | "
\n", 101 | "\n", 102 | "Before we learn how to do power analysis. Lets understand why do we need to perform power analysis in detail. " 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "id": "aa70ccc3-5b6a-4efb-9acf-459ea869ffc8", 108 | "metadata": {}, 109 | "source": [ 110 | "### Need for Power Analysis\n", 111 | "\n", 112 | "In hypothesis testing, we start with a null hypothesis of no effect and an alternative hypothesis of a true effect. The goal is to collect enough data from a sample to statistically test whether we can reasonably reject the null hypothesis in favor of the alternative hypothesis. In doing so, there's always a risk of making one of two decision errors when interpreting study results:\n", 113 | "\n", 114 | "- **Type I error**: Rejecting the null hypothesis of no effect when it is actually true.\n", 115 | "- **Type II error**: Not rejecting the null hypothesis of no effect when it is actually false.\n", 116 | "\n", 117 | "Power is the probability of avoiding a Type II error. The higher the statistical power of a test, the lower the risk of making a Type II error. Power is usually set at 80%. This means that if there are true effects to be found in 100 different studies with 80% power, only 80 out of 100 statistical tests will actually detect them. If we don't ensure sufficient power, our study may not be able to detect a true effect at all. This means that resources like time and money are wasted, and it may even be unethical to collect data from participants.\n", 118 | "\n", 119 | "On the flip side, too much power means our tests are highly sensitive to true effects, including very small ones. This may lead to finding statistically significant results with very little usefulness in the real world. To balance these pros and cons of low versus high statistical power, we should use a **Power Analysis** to set an appropriate level.\n" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "id": "515995e2-6617-4ef0-83ea-b5733548d321", 125 | "metadata": {}, 126 | "source": [ 127 | "# Power Analysis\n", 128 | "\n", 129 | "Power is mainly influenced by sample size, effect size, and significance level. A power analysis can be used to determine the necessary sample size for a study. Having enough statistical power is necessary to draw accurate conclusions about a population using sample data.\n", 130 | "\n", 131 | "Power is affected by several factors, but two main factors are:\n", 132 | "\n", 133 | "- **Overlap:** How much overlap is there between the two distributions we want to identify with our study.\n", 134 | "- **Sample Size:** The number of samples we collect from each group.\n", 135 | "\n", 136 | "If we want Power to be 80% and if there is very little overlap, a small sample size will suffice. However, if the overlap is greater between the two distributions, we need a larger sample size to achieve 80% power.\n", 137 | "\n", 138 | "To understand the relationship between overlap and sample size, we need to realize that when we do a statistical test, we usually compare sample means rather than individual measurements. So let's see what happens when we calculate means with different sample sizes.\n", 139 | "\n", 140 | "- If the sample size is small, there is a lot of variation in estimated means for a distribution, making it hard to be confident that any single estimated mean is a good estimate of the population mean, and there is overlap between the estimated means of the two distributions.\n", 141 | "- But if the sample size is large, the estimated means are so close to the population mean that they no longer overlap. This suggests a high probability that we correctly reject the null hypothesis that both samples came from the same distribution. With a large sample size, we can achieve high power. Additionally, the central limit theorem states that these results apply to any type of distribution.\n", 142 | "\n", 143 | "A power analysis consists of four main components. If you know or have estimates for any three of these, you can calculate the fourth component:\n", 144 | "\n", 145 | "- **Statistical Power:** The likelihood that a test will detect an effect of a certain size if there is one, usually set at 80% or higher.\n", 146 | "- **Sample Size:** The minimum number of observations needed to observe an effect of a certain size with a given power level.\n", 147 | "- **Significance Level (alpha):** The maximum risk of rejecting a true null hypothesis that you are willing to take, usually set at 5%.\n", 148 | "- **Expected Effect Size:** The combined effect of standard deviation and means of two distributions due to overlap, captured by Effect size (d). There are many different ways to capture the effect.\n", 149 | "\n", 150 | "Before starting a study, we can use a power analysis to calculate the minimum sample size for a desired power level and significance level, along with an expected effect size. Traditionally, the significance level is set to 5% and the desired power level to 80%. That means we only need to figure out an expected effect size to calculate a sample size from a power analysis.\n", 151 | "\n", 152 | "The `stats.power` module of the statsmodels package in Python contains the required functions for carrying out power analysis for the most commonly used statistical tests such as t-test, normal-based test, F-tests, and Chi-square goodness-of-fit test. Its `solve_power` function takes three of the four components mentioned above as input parameters and calculates the sample size.\n" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "id": "a469340a-90ad-4850-85a4-276d9586f9c4", 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [] 162 | } 163 | ], 164 | "metadata": { 165 | "kernelspec": { 166 | "display_name": "Python 3 (ipykernel)", 167 | "language": "python", 168 | "name": "python3" 169 | }, 170 | "language_info": { 171 | "codemirror_mode": { 172 | "name": "ipython", 173 | "version": 3 174 | }, 175 | "file_extension": ".py", 176 | "mimetype": "text/x-python", 177 | "name": "python", 178 | "nbconvert_exporter": "python", 179 | "pygments_lexer": "ipython3", 180 | "version": "3.11.7" 181 | } 182 | }, 183 | "nbformat": 4, 184 | "nbformat_minor": 5 185 | } 186 | -------------------------------------------------------------------------------- /17.Statistical tests (Summarized).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "_uuid": "96f599ec12693ae56a55e0c1884b3f6ebc6bc825", 7 | "id": "HrWEQcATI3em", 8 | "toc": true 9 | }, 10 | "source": [ 11 | "

Table of Contents

\n", 12 | "
" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "# Exploring the Ames, Iowa House Price Dataset\n", 20 | "\n", 21 | "Imagine that we're looking to move to Ames, Iowa, and have a budget of $120,000 to purchase a house. Unfortunately, we're not familiar with the local real estate market. However, the City Hall has a valuable resource, the House Price Dataset, which contains information on approximately 1500 homes in the city, including attributes such as Sale Price, Living Area, and Garage Type. Accessing the full dataset would be too expensive, but the City Hall is offering a generous offer: they'll provide free samples of up to 25 observations, or up to 100 observations for a small fee. This is a great opportunity for us to learn more about the real estate market and see what kind of house we can get for our budget. By analyzing the data, we'll be able to use statistical tests to gain insights into the market. While this notebook won't go into the theory behind the tests, it will provide an overview of which tests to use depending on the situation and how to use them.\n" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 1, 27 | "metadata": { 28 | "_uuid": "685178bd8439921232c40ae2bb4218401deb360f", 29 | "id": "iNhM9iVfI3ff" 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "import pandas as pd\n", 34 | "pd.set_option('max_colwidth', 200)\n", 35 | "pd.set_option('display.float_format', lambda x: '%.3f' % x)\n", 36 | "from statsmodels.stats.weightstats import *\n", 37 | "import scipy.stats" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 2, 43 | "metadata": { 44 | "id": "skFhcwVYLFdH" 45 | }, 46 | "outputs": [ 47 | { 48 | "ename": "FileNotFoundError", 49 | "evalue": "[Errno 2] No such file or directory: 'data/house-prices-advanced-regression-techniques/train.csv'", 50 | "output_type": "error", 51 | "traceback": [ 52 | "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", 53 | "\u001b[1;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", 54 | "Cell \u001b[1;32mIn[2], line 3\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[38;5;66;03m#this is the entire dataset, but we'll only be able to use to extract samples from it.\u001b[39;00m\n\u001b[0;32m 2\u001b[0m FILE_PATH \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mdata/house-prices-advanced-regression-techniques/train.csv\u001b[39m\u001b[38;5;124m'\u001b[39m\n\u001b[1;32m----> 3\u001b[0m city_hall_dataset \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mread_csv(FILE_PATH)\n", 55 | "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:948\u001b[0m, in \u001b[0;36mread_csv\u001b[1;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)\u001b[0m\n\u001b[0;32m 935\u001b[0m kwds_defaults \u001b[38;5;241m=\u001b[39m _refine_defaults_read(\n\u001b[0;32m 936\u001b[0m dialect,\n\u001b[0;32m 937\u001b[0m delimiter,\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 944\u001b[0m dtype_backend\u001b[38;5;241m=\u001b[39mdtype_backend,\n\u001b[0;32m 945\u001b[0m )\n\u001b[0;32m 946\u001b[0m kwds\u001b[38;5;241m.\u001b[39mupdate(kwds_defaults)\n\u001b[1;32m--> 948\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m _read(filepath_or_buffer, kwds)\n", 56 | "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:611\u001b[0m, in \u001b[0;36m_read\u001b[1;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[0;32m 608\u001b[0m _validate_names(kwds\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mnames\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m))\n\u001b[0;32m 610\u001b[0m \u001b[38;5;66;03m# Create the parser.\u001b[39;00m\n\u001b[1;32m--> 611\u001b[0m parser \u001b[38;5;241m=\u001b[39m TextFileReader(filepath_or_buffer, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwds)\n\u001b[0;32m 613\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m chunksize \u001b[38;5;129;01mor\u001b[39;00m iterator:\n\u001b[0;32m 614\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m parser\n", 57 | "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:1448\u001b[0m, in \u001b[0;36mTextFileReader.__init__\u001b[1;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[0;32m 1445\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mhas_index_names\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m kwds[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mhas_index_names\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[0;32m 1447\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles: IOHandles \u001b[38;5;241m|\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m-> 1448\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_engine \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_make_engine(f, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mengine)\n", 58 | "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:1705\u001b[0m, in \u001b[0;36mTextFileReader._make_engine\u001b[1;34m(self, f, engine)\u001b[0m\n\u001b[0;32m 1703\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m mode:\n\u001b[0;32m 1704\u001b[0m mode \u001b[38;5;241m+\u001b[39m\u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m-> 1705\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles \u001b[38;5;241m=\u001b[39m get_handle(\n\u001b[0;32m 1706\u001b[0m f,\n\u001b[0;32m 1707\u001b[0m mode,\n\u001b[0;32m 1708\u001b[0m encoding\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mencoding\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m),\n\u001b[0;32m 1709\u001b[0m compression\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcompression\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m),\n\u001b[0;32m 1710\u001b[0m memory_map\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmemory_map\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mFalse\u001b[39;00m),\n\u001b[0;32m 1711\u001b[0m is_text\u001b[38;5;241m=\u001b[39mis_text,\n\u001b[0;32m 1712\u001b[0m errors\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mencoding_errors\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstrict\u001b[39m\u001b[38;5;124m\"\u001b[39m),\n\u001b[0;32m 1713\u001b[0m storage_options\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstorage_options\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m),\n\u001b[0;32m 1714\u001b[0m )\n\u001b[0;32m 1715\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[0;32m 1716\u001b[0m f \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles\u001b[38;5;241m.\u001b[39mhandle\n", 59 | "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\io\\common.py:863\u001b[0m, in \u001b[0;36mget_handle\u001b[1;34m(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)\u001b[0m\n\u001b[0;32m 858\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(handle, \u001b[38;5;28mstr\u001b[39m):\n\u001b[0;32m 859\u001b[0m \u001b[38;5;66;03m# Check whether the filename is to be opened in binary mode.\u001b[39;00m\n\u001b[0;32m 860\u001b[0m \u001b[38;5;66;03m# Binary mode does not support 'encoding' and 'newline'.\u001b[39;00m\n\u001b[0;32m 861\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m ioargs\u001b[38;5;241m.\u001b[39mencoding \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m ioargs\u001b[38;5;241m.\u001b[39mmode:\n\u001b[0;32m 862\u001b[0m \u001b[38;5;66;03m# Encoding\u001b[39;00m\n\u001b[1;32m--> 863\u001b[0m handle \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mopen\u001b[39m(\n\u001b[0;32m 864\u001b[0m handle,\n\u001b[0;32m 865\u001b[0m ioargs\u001b[38;5;241m.\u001b[39mmode,\n\u001b[0;32m 866\u001b[0m encoding\u001b[38;5;241m=\u001b[39mioargs\u001b[38;5;241m.\u001b[39mencoding,\n\u001b[0;32m 867\u001b[0m errors\u001b[38;5;241m=\u001b[39merrors,\n\u001b[0;32m 868\u001b[0m newline\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[0;32m 869\u001b[0m )\n\u001b[0;32m 870\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 871\u001b[0m \u001b[38;5;66;03m# Binary mode\u001b[39;00m\n\u001b[0;32m 872\u001b[0m handle \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mopen\u001b[39m(handle, ioargs\u001b[38;5;241m.\u001b[39mmode)\n", 60 | "\u001b[1;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: 'data/house-prices-advanced-regression-techniques/train.csv'" 61 | ] 62 | } 63 | ], 64 | "source": [ 65 | "#this is the entire dataset, but we'll only be able to use to extract samples from it.\n", 66 | "FILE_PATH = 'data/house-prices-advanced-regression-techniques/train.csv'\n", 67 | "city_hall_dataset = pd.read_csv(FILE_PATH)" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": { 73 | "_uuid": "95f4ee7982692fb0a348652a2bea034fcfaae251", 74 | "id": "LxaQeafsI3fi" 75 | }, 76 | "source": [ 77 | "# Introduction" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": { 83 | "_uuid": "c1606d0ab5c78dcc772f9d1e91daad2fa1ed5933", 84 | "id": "4ZEVAOUxI3fj" 85 | }, 86 | "source": [ 87 | "# Theory" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": { 93 | "_uuid": "36671356d539ae67d27d451ae71f3c595e139948", 94 | "id": "7A4lcRIRI3fk" 95 | }, 96 | "source": [ 97 | "What we will be trying to do in this tutorial is make assumptions on the whole population of houses based only on the samples at our disposal.
\n", 98 | "This is what statistical tests do, but one must know a few principles before using them." 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": { 104 | "_uuid": "698cfbf57cd980f9e0f558eb2e1206a601742dd3", 105 | "id": "fmT5a95OI3fl" 106 | }, 107 | "source": [ 108 | "## The process" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": { 114 | "_uuid": "03287c60f935631f22c13a14d86e233c2fcd47d7", 115 | "id": "PMc5uv87I3fm" 116 | }, 117 | "source": [ 118 | "The basic process of statistical tests is the following : \n", 119 | "- Stating a Null Hypothesis (most often : \"the two values are not different\")\n", 120 | "- Stating an Alternative Hypothesis (most often : \"the two values are different\")\n", 121 | "- Defining an alpha value, which is a confidence level (most often : 95%). The higher it is, the harder it will be to validate the Alternative Hypothesis, but the more confident we will be if we do validate it.\n", 122 | "- Depending on data at disposal, we choose the relevant test (Z-test, T-test, etc... More on that later)\n", 123 | "- The test computes a score, which corresponds to a p-value.\n", 124 | "- If p-value is below 1-alpha (0.05 if alpha is 95%), we can accept the Alternative Hypothesis (or \"reject the Null Hypothesis\"). If it is over, we'll have to stick with the Null Hypothesis (or \"fail to reject the Null Hypothesis\").\n", 125 | "\n", 126 | "\n", 127 | "There's a built-in function for most statistical tests out there.
\n", 128 | "Let's also build our own function to summarize all the information.
\n", 129 | "All tests we will conduct from now on are based on alpha = 95%." 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": { 136 | "_uuid": "781566a3f1a8b9465fad0818f57c86d9be026780", 137 | "id": "rMy2j0kGI3fo" 138 | }, 139 | "outputs": [], 140 | "source": [ 141 | "def results(p):\n", 142 | " if(p['p_value']<0.05):p['hypothesis_accepted'] = 'alternative'\n", 143 | " if(p['p_value']>=0.05):p['hypothesis_accepted'] = 'null'\n", 144 | "\n", 145 | " df = pd.DataFrame(p, index=[''])\n", 146 | " cols = ['value1', 'value2', 'score', 'p_value', 'hypothesis_accepted']\n", 147 | " return df[cols]" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": { 153 | "_uuid": "3c1916b3ab847149daaffab1a7394fadbbfaee01", 154 | "id": "qNYYzNPbI3fp" 155 | }, 156 | "source": [ 157 | "## Two-tailed and One-tailed" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": { 163 | "_uuid": "72ab88dc706a91a822fd829611149a8bccd567ae", 164 | "id": "hs0yrjpeI3fq" 165 | }, 166 | "source": [ 167 | "Two-tails tests are used to show two values are just \"different\".
\n", 168 | "One-tail tests are used to show one value is either \"larger\" or \"lower\" than another one.

\n", 169 | "This has an influence on the p-value : in case of one-tail tests, p-value has to be divided by 2.
\n", 170 | "
\n", 171 | "Most of the functions we'll use (those from the statweights modules) do that by themselves if we input the right information in the parameters.
\n", 172 | "We'll have to do it on our own with functions from the scipy module." 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": { 178 | "_uuid": "927f0ccf354bdc79be8f15126137812533168a3f", 179 | "id": "83YOCHsUI3fr" 180 | }, 181 | "source": [ 182 | "## Types of tests" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": { 188 | "_uuid": "eb131c0758f17dd039df56627c87536436cf7c73", 189 | "id": "fbXJNsiUI3fs" 190 | }, 191 | "source": [ 192 | "There are different types of tests, here are the ones we will cover : \n", 193 | "- T-tests. Used for small sample sizes (n<30), and when population's standard deviation is unknown.\n", 194 | "- Z-tests. Used for large sample sizes (n=>30), and when population's standard deviation is known.\n", 195 | "- F-tests. Used for comparing values of more than two variables.\n", 196 | "- Chi-square. Used for comparing categorical data." 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": { 202 | "_uuid": "c21d8691d27399446d7d7e707f369da61708c5c1", 203 | "id": "5EdR4B5fI3fs" 204 | }, 205 | "source": [ 206 | "## Normal distribution" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": { 212 | "_uuid": "21bce516b7e3e7b14e45dd3b37c4f3a0d9e9fb72", 213 | "id": "xTG4xsVXI3ft" 214 | }, 215 | "source": [ 216 | "Also, most tests - parametric tests - require a population that is normally distributed.
\n", 217 | "It it not the case for SalePrice - which we'll use for most tests - but we can fix this by log-transforming the variable.
\n", 218 | "Note that to go back to our original scale and understand values vs. our \\$120 000, we'll to exponantiate values back." 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": null, 224 | "metadata": { 225 | "_uuid": "5e7f1cf9faba52350ec923638e9e26b9356764d7", 226 | "colab": { 227 | "base_uri": "https://localhost:8080/" 228 | }, 229 | "id": "XOOw7OlgI3fu", 230 | "outputId": "434155bc-55fd-454b-d9a2-c99dbbe5c315" 231 | }, 232 | "outputs": [], 233 | "source": [ 234 | "import numpy as np\n", 235 | "city_hall_dataset['SalePrice'] = np.log1p(city_hall_dataset['SalePrice'])\n", 236 | "logged_budget = np.log1p(120000) #logged $120 000 is 11.695\n", 237 | "logged_budget" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": { 243 | "_uuid": "6ea5ef2e8f638517b5381aa54400a541d79caeca", 244 | "id": "4zUY8cgFI3fz" 245 | }, 246 | "source": [ 247 | "# Practice" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": { 253 | "_uuid": "6b12d43e42e70f1142a22cdb6561ab58992887e2", 254 | "id": "6nYyX_XYI3f0" 255 | }, 256 | "source": [ 257 | "So let's say we are ready to dive into the data, but not ready to pay the small fee for the large sample size.
\n", 258 | "We'll be starting with the free samples of 25 observations." 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": null, 264 | "metadata": { 265 | "_uuid": "d49290ec36454bdc5276e6636f965ac0db9f048e", 266 | "id": "HcaLu-G6I3f0" 267 | }, 268 | "outputs": [], 269 | "source": [ 270 | "sample = city_hall_dataset.sample(n=25)\n", 271 | "p = {} #dictionnary we'll use to stock information and results" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "metadata": { 277 | "_uuid": "86dcbf3838f5a01829c1dd07ff295d7a323f9b38", 278 | "id": "xUCHHT1PI3f1" 279 | }, 280 | "source": [ 281 | "## One sample T-test | Two-tailed | Means" 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "metadata": { 287 | "_uuid": "057bf391b91193515585af60d1b2e6ecf61d9457", 288 | "id": "j-ne4sQdI3f1" 289 | }, 290 | "source": [ 291 | "So first question we want to ask is : How are our $120 000 situated vs. the average Ames house SalePrice?
\n", 292 | "In other words, is 120 000 (11.7 logged) any different from the mean SalePrice of the population?
\n", 293 | "To know that from a 25 observations sample, we need to use a One Sample T-Test." 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": { 299 | "_uuid": "9a5b43a56ec242192988e4638bdaf947c4b9feca", 300 | "id": "g5OClmfSI3f2" 301 | }, 302 | "source": [ 303 | "Null Hypothesis : Mean SalePrice = 11.695
\n", 304 | "Alternative Hypothesis : Mean SalePrice ≠ 11.695
" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": null, 310 | "metadata": { 311 | "_uuid": "b114d786b7657bb44b18cd10f3a75b59e8b76e4d", 312 | "colab": { 313 | "base_uri": "https://localhost:8080/", 314 | "height": 80 315 | }, 316 | "id": "S2niNvkBI3f3", 317 | "outputId": "994270a3-6585-436d-9f14-99dfe11e5c6f" 318 | }, 319 | "outputs": [], 320 | "source": [ 321 | "p['value1'], p['value2'] = sample['SalePrice'].mean(), logged_budget\n", 322 | "p['score'], p['p_value'] = stats.ttest_1samp(sample['SalePrice'], popmean=logged_budget)\n", 323 | "results(p)" 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": { 329 | "_uuid": "cdbffabed49531b9a2a94290ecf3ad3df7ce1500", 330 | "id": "W5_dIYzhI3f4" 331 | }, 332 | "source": [ 333 | "So we know our initial budget is significantely different from the mean SalePrice.
\n", 334 | "From the table above, it unfortunately seems lower.
" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": { 340 | "_uuid": "9826a1893fc914f63b0c8fb013f54f21a8cb6660", 341 | "id": "R9S7Wyn3I3f5" 342 | }, 343 | "source": [ 344 | "## One sample T-test | One-tailed | Means" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": { 350 | "_uuid": "16f3f4267ad9b0656259a33796891ba99efa4e7e", 351 | "id": "dogNu0GwI3f5" 352 | }, 353 | "source": [ 354 | "Let's make sure our budget is lower by running a one-tailed test.
\n", 355 | "Question now is : is 120 000 (11.695 logged) lower than the mean SalePrice of the population?
" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": { 361 | "_uuid": "c4aa88ea6758d7beced1dd4fe235a4afc5aae480", 362 | "id": "otYidl0LI3f6" 363 | }, 364 | "source": [ 365 | "Null Hypothesis : Mean SalePrice <= 11.695
\n", 366 | "Alternative Hypothesis : Mean SalePrice > 11.695
" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": null, 372 | "metadata": { 373 | "_uuid": "f5a2b2f062c437304675fb7f97cc75be030f7cd8", 374 | "colab": { 375 | "base_uri": "https://localhost:8080/", 376 | "height": 80 377 | }, 378 | "id": "KqfQhhk2I3f7", 379 | "outputId": "96d56b97-7dd3-4cc5-edf0-722874dd5e98" 380 | }, 381 | "outputs": [], 382 | "source": [ 383 | "p['value1'], p['value2'] = sample['SalePrice'].mean(), logged_budget\n", 384 | "p['score'], p['p_value'] = stats.ttest_1samp(sample['SalePrice'], popmean=logged_budget)\n", 385 | "p['p_value'] = p['p_value']/2 #one-tailed test (with scipy function), we need to divide p-value by 2 ourselves\n", 386 | "results(p)" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": { 392 | "_uuid": "4ca6a6194d74b0cb91c7fff614b6ff85543c354f", 393 | "id": "9cKbKEuDI3gC" 394 | }, 395 | "source": [ 396 | "Unfortunately it is!
\n", 397 | "We have 95% chance of believing that our starting budget won't let us buy a house at the average Ames price." 398 | ] 399 | }, 400 | { 401 | "cell_type": "markdown", 402 | "metadata": { 403 | "_uuid": "ba55013530a031bf8fce56db44c036affafbe18c", 404 | "id": "2NeUq916I3gD" 405 | }, 406 | "source": [ 407 | "## Two sample T-test | Two-tailed | Means" 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "metadata": { 413 | "_uuid": "e0c4dd0f363ec00e2fa7dc1e49cccc303c5d7e1c", 414 | "id": "cQsNuRapI3gE" 415 | }, 416 | "source": [ 417 | "Now that our expectations are lowered, we realize something important :
\n", 418 | "The entire dataset probably contains some big houses fitted for entire families as well as small houses for fewer inhabitants.
\n", 419 | "Prices are probably really different in-between the two types.
\n", 420 | "And we are moving in alone, so we probably don't need that big of a house.

\n", 421 | "What if we could ask the City Hall to give us a sample for big houses, and a sample for smaller houses?
\n", 422 | "We first could see if there is a significant difference in prices.
\n", 423 | "And then see how our \\$120 000 are doing against the small houses average SalePrice.

\n", 424 | "We do ask the City Hall, and because they understand it is also for the sake of this tutorial, they accept.
\n", 425 | "They say they'll split the dataset in two, based on the surface area of the houses.
\n", 426 | "They will give us a sample from the top 50\\% houses in terms of surface, and another sample from the bottom 50\\%." 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": null, 432 | "metadata": { 433 | "_uuid": "0ba8742a710640d9e15b7bb828d707b74c76fa9d", 434 | "id": "luaaidekI3gG" 435 | }, 436 | "outputs": [], 437 | "source": [ 438 | "smaller_houses = city_hall_dataset.sort_values('GrLivArea')[:730].sample(n=25)\n", 439 | "larger_houses = city_hall_dataset.sort_values('GrLivArea')[730:].sample(n=25)" 440 | ] 441 | }, 442 | { 443 | "cell_type": "markdown", 444 | "metadata": { 445 | "_uuid": "f62d12ae118d7c88daf2fa31cdab454425ee480c", 446 | "id": "dDIdVvbjI3gH" 447 | }, 448 | "source": [ 449 | "Now we first want to know if the two samples, extracted from two different populations, have significant differences in their average SalePrice." 450 | ] 451 | }, 452 | { 453 | "cell_type": "markdown", 454 | "metadata": { 455 | "_uuid": "e91a0db49dfd68637e52575a901d6688f2f52fc0", 456 | "id": "BiaE7H9WI3gI" 457 | }, 458 | "source": [ 459 | "Null Hypothesis : SalePrice of smaller houses = SalePrice of larger houses
\n", 460 | "Alternative Hypothesis : SalePrice of smaller houses ≠ SalePrice of larger houses
" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": null, 466 | "metadata": { 467 | "_uuid": "063769aa739f591274cfb0c9ff3ca02cb3f47929", 468 | "colab": { 469 | "base_uri": "https://localhost:8080/", 470 | "height": 80 471 | }, 472 | "id": "60IhVpAjI3gJ", 473 | "outputId": "fd3651ca-4b3c-44c5-d97d-8ba31b8bf057" 474 | }, 475 | "outputs": [], 476 | "source": [ 477 | "p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), larger_houses['SalePrice'].mean()\n", 478 | "p['score'], p['p_value'], p['df'] = ttest_ind(smaller_houses['SalePrice'], larger_houses['SalePrice'])\n", 479 | "results(p)" 480 | ] 481 | }, 482 | { 483 | "cell_type": "markdown", 484 | "metadata": { 485 | "_uuid": "ca2691aed68ebaf14f3d45e8500955b564891c94", 486 | "id": "h-LdfFQxI3gL" 487 | }, 488 | "source": [ 489 | "As expected, the two samples show some significant differences in SalePrice." 490 | ] 491 | }, 492 | { 493 | "cell_type": "markdown", 494 | "metadata": { 495 | "_uuid": "ac0291a3dd5e3d592b6d41d475f2e9fafa2d2d7a", 496 | "id": "gDGMr5yJI3gM" 497 | }, 498 | "source": [ 499 | "## Two sample T-test | One-tailed | Means" 500 | ] 501 | }, 502 | { 503 | "cell_type": "markdown", 504 | "metadata": { 505 | "_uuid": "c3e80a2b082da6f8ee1181e7c5ad286388136d3b", 506 | "id": "bNyuvn-QI3gM" 507 | }, 508 | "source": [ 509 | "Obviously, larger houses have a higher SalePrice.
\n", 510 | "Let's prove it this with one-tailed test." 511 | ] 512 | }, 513 | { 514 | "cell_type": "markdown", 515 | "metadata": { 516 | "_uuid": "4047d017f6fbb8420868b5a3bf1f5c484511ec9e", 517 | "id": "k8KJb7LKI3gN" 518 | }, 519 | "source": [ 520 | "Null Hypothesis : SalePrice of smaller houses >= SalePrice of larger houses
\n", 521 | "Alternative Hypothesis : SalePrice of smaller houses < SalePrice of larger houses
" 522 | ] 523 | }, 524 | { 525 | "cell_type": "code", 526 | "execution_count": null, 527 | "metadata": { 528 | "_uuid": "27b5247cac5eb55e0635a066fcac0d9c33785567", 529 | "colab": { 530 | "base_uri": "https://localhost:8080/", 531 | "height": 80 532 | }, 533 | "id": "UvYpzz2eI3gO", 534 | "outputId": "6eb1b4bd-90bb-480b-bd32-023d5dfbd24e" 535 | }, 536 | "outputs": [], 537 | "source": [ 538 | "p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), larger_houses['SalePrice'].mean()\n", 539 | "p['score'], p['p_value'], p['df'] = ttest_ind(smaller_houses['SalePrice'], larger_houses['SalePrice'], alternative='smaller')\n", 540 | "results(p)" 541 | ] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": { 546 | "_uuid": "12e9ac13b4eadb488545075d14613af3195be463", 547 | "id": "eeUoSgZ7I3gQ" 548 | }, 549 | "source": [ 550 | "Still as expected, SalePrice is significantly higher for larger houses." 551 | ] 552 | }, 553 | { 554 | "cell_type": "markdown", 555 | "metadata": { 556 | "_uuid": "d269a6e87bfb36f6c2a58242ece5b1fb6a071992", 557 | "id": "563ptKoxI3gR" 558 | }, 559 | "source": [ 560 | "## Two sample Z-test | One-tailed | Means" 561 | ] 562 | }, 563 | { 564 | "cell_type": "markdown", 565 | "metadata": { 566 | "_uuid": "fac51d383c6d11ba5ed9bc8d48009f44d4226641", 567 | "id": "A5e1a_rzI3gS" 568 | }, 569 | "source": [ 570 | "Now that the City Hall has already splitted the population in two, why not ask them for larger samples?
\n", 571 | "We'll pay a fee but that's all right, this is fake money." 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "execution_count": null, 577 | "metadata": { 578 | "_uuid": "bb09dbc679b48df0861cc3fbbdcce8f7f41c2adb", 579 | "id": "bXK-QdtII3gT" 580 | }, 581 | "outputs": [], 582 | "source": [ 583 | "smaller_houses = city_hall_dataset.sort_values('GrLivArea')[:730].sample(n=100, random_state=1)\n", 584 | "larger_houses = city_hall_dataset.sort_values('GrLivArea')[730:].sample(n=100, random_state=1)" 585 | ] 586 | }, 587 | { 588 | "cell_type": "markdown", 589 | "metadata": { 590 | "_uuid": "9646ac3da76ce781a494505a010a7f4dfc3541a8", 591 | "id": "sf3NEC0yI3gU" 592 | }, 593 | "source": [ 594 | "Null Hypothesis : SalePrice of smaller houses >= SalePrice of larger houses
\n", 595 | "Alternative Hypothesis : SalePrice of smaller houses < SalePrice of larger houses
" 596 | ] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": null, 601 | "metadata": { 602 | "_uuid": "0378fa5d6d7df2e9f17a953f889e104bec2e0790", 603 | "colab": { 604 | "base_uri": "https://localhost:8080/", 605 | "height": 80 606 | }, 607 | "id": "K0ImQ7MsI3gV", 608 | "outputId": "bc5e303a-d6de-466a-d18f-54a0cbc91ed3" 609 | }, 610 | "outputs": [], 611 | "source": [ 612 | "p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), larger_houses['SalePrice'].mean()\n", 613 | "p['score'], p['p_value'] = ztest(smaller_houses['SalePrice'], larger_houses['SalePrice'], alternative='smaller')\n", 614 | "results(p)" 615 | ] 616 | }, 617 | { 618 | "cell_type": "markdown", 619 | "metadata": { 620 | "_uuid": "877f43f345c8b9b283f35f5fd5df218bf0e1a9ad", 621 | "id": "nR7ao1OxI3gY" 622 | }, 623 | "source": [ 624 | "Higher sample sizes show the same results : SalePrice is significantely higher for larger houses." 625 | ] 626 | }, 627 | { 628 | "cell_type": "markdown", 629 | "metadata": { 630 | "_uuid": "cc97a4fb0630306449deeb931cce876a5c404b09", 631 | "id": "-0BEcuhAI3gZ" 632 | }, 633 | "source": [ 634 | "## Two sample Z-test | One-tailed | Proportions" 635 | ] 636 | }, 637 | { 638 | "cell_type": "markdown", 639 | "metadata": { 640 | "_uuid": "1c5fd4e5578687f0cd6defbdcb0f47505c1dbca5", 641 | "id": "uzi9hsr9I3gZ" 642 | }, 643 | "source": [ 644 | "Instead of means, we can also run tests on proportions.
\n", 645 | "Is the proportion of houses over \\$120 000 higher in the larger houses populations than in smaller houses population?" 646 | ] 647 | }, 648 | { 649 | "cell_type": "markdown", 650 | "metadata": { 651 | "_uuid": "08e590d9599111e0ed46441107fe9a568cbc85f8", 652 | "id": "lRi8Vz_II3ga" 653 | }, 654 | "source": [ 655 | "Null Hypothesis : Proportion of smaller houses with SalePrice over 11.695 >= Proportion of larger houses with SalePrice over 11.695
\n", 656 | "Alternative Hypothesis : Proportion of smaller houses with SalePrice over 11.695 < Proportion of larger houses with SalePrice over 11.695
" 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "execution_count": null, 662 | "metadata": { 663 | "_uuid": "6908a85072828db67c52eea41d821caf0b94cc56", 664 | "colab": { 665 | "base_uri": "https://localhost:8080/", 666 | "height": 80 667 | }, 668 | "id": "HAr4460WI3gb", 669 | "outputId": "e8750c30-e44e-41e4-e9f2-2cae12d284a0" 670 | }, 671 | "outputs": [], 672 | "source": [ 673 | "from statsmodels.stats.proportion import *\n", 674 | "A1 = len(smaller_houses[smaller_houses.SalePrice>logged_budget])\n", 675 | "B1 = len(smaller_houses)\n", 676 | "A2 = len(larger_houses[larger_houses.SalePrice>logged_budget])\n", 677 | "B2 = len(larger_houses)\n", 678 | "p['value1'], p['value2'] = A1/B1, A2/B2\n", 679 | "p['score'], p['p_value'] = proportions_ztest([A1, A2], [B1, B2], alternative='smaller')\n", 680 | "results(p)" 681 | ] 682 | }, 683 | { 684 | "cell_type": "markdown", 685 | "metadata": { 686 | "_uuid": "09942c4f972edef68fd0ade7bf6f1fccdf608357", 687 | "id": "fu7um-8kI3gc" 688 | }, 689 | "source": [ 690 | "Logically, the test shows that the larger houses population has a higher ratio of houses sold over \\\\$120 000 vs. the smaller houses population." 691 | ] 692 | }, 693 | { 694 | "cell_type": "markdown", 695 | "metadata": { 696 | "_uuid": "1fb6a257bfc1e2aead70cc81c728b6991b96ee2a", 697 | "id": "7erhQfijI3gd" 698 | }, 699 | "source": [ 700 | "## One sample Z-test | One-tailed | Means" 701 | ] 702 | }, 703 | { 704 | "cell_type": "markdown", 705 | "metadata": { 706 | "_uuid": "751e2b017d2e9d0982d16827f7c2074bde519c2f", 707 | "id": "h1Ry7yFxI3ge" 708 | }, 709 | "source": [ 710 | "So now let's see how our \\$120 000 (11.7 logged) are doing against smaller houses only, based on the 100 observations sample." 711 | ] 712 | }, 713 | { 714 | "cell_type": "markdown", 715 | "metadata": { 716 | "_uuid": "3478bf655881918eff42936df3d5f1d0fcfc8717", 717 | "id": "xzH2utgfI3ge" 718 | }, 719 | "source": [ 720 | "Null Hypothesis : Mean SalePrice of smaller houses => 11.695
\n", 721 | "Alternative Hypothesis : Mean SalePrice of smaller houses < 11.695
" 722 | ] 723 | }, 724 | { 725 | "cell_type": "code", 726 | "execution_count": null, 727 | "metadata": { 728 | "_uuid": "29042a1ab9f5558a694ba2b537f204d648c9229f", 729 | "colab": { 730 | "base_uri": "https://localhost:8080/", 731 | "height": 80 732 | }, 733 | "id": "FaLeLaj4I3gf", 734 | "outputId": "622aee32-78c1-47fe-879f-2ecdedc566ce" 735 | }, 736 | "outputs": [], 737 | "source": [ 738 | "p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), logged_budget\n", 739 | "p['score'], p['p_value'] = ztest(smaller_houses['SalePrice'], value=logged_budget, alternative='larger')\n", 740 | "results(p)" 741 | ] 742 | }, 743 | { 744 | "cell_type": "markdown", 745 | "metadata": { 746 | "_uuid": "40b18c3dbdf3f268901b4aa3b3cd2542b5614ed6", 747 | "id": "WUoHvGY9I3gh" 748 | }, 749 | "source": [ 750 | "That's quite depressing : our \\\\$120 000 do not even beat the average price of smaller houses." 751 | ] 752 | }, 753 | { 754 | "cell_type": "markdown", 755 | "metadata": { 756 | "_uuid": "95f48ae385792cde702067b43d6865a228bada65", 757 | "id": "48EoE0JTI3gh" 758 | }, 759 | "source": [ 760 | "## One sample Z-test | One-tailed | Proportions" 761 | ] 762 | }, 763 | { 764 | "cell_type": "markdown", 765 | "metadata": { 766 | "_uuid": "11e0cda689b265e2fb1f69a6440bb720ae030d01", 767 | "id": "6PAXNuOaI3gi" 768 | }, 769 | "source": [ 770 | "Our \\$120 000 do not seem too far from the average SalePrice of small houses though.
\n", 771 | "Let's see if at least 25\\% of houses have a SalePrice in our budget." 772 | ] 773 | }, 774 | { 775 | "cell_type": "markdown", 776 | "metadata": { 777 | "_uuid": "1f856f2a6117ada10a5dc32962d7cf438fe93beb", 778 | "id": "GQgwSKUUI3gj" 779 | }, 780 | "source": [ 781 | "Null Hypothesis : Proportion of smaller houses with SalePrice under 11.695 <= 25%
\n", 782 | "Alternative Hypothesis : Proportion of smaller houses with SalePrice under 11.695 > 25%
" 783 | ] 784 | }, 785 | { 786 | "cell_type": "code", 787 | "execution_count": null, 788 | "metadata": { 789 | "_uuid": "d9bcda8bd0df8421f071c5fe0d38f07abde33549", 790 | "colab": { 791 | "base_uri": "https://localhost:8080/", 792 | "height": 80 793 | }, 794 | "id": "s1gH0FbmI3gj", 795 | "outputId": "f5c5da9c-98c6-4678-8627-711b9055b243" 796 | }, 797 | "outputs": [], 798 | "source": [ 799 | "from statsmodels.stats.proportion import *\n", 800 | "A = len(smaller_houses[smaller_houses.SalePrice" 815 | ] 816 | }, 817 | { 818 | "cell_type": "markdown", 819 | "metadata": { 820 | "_uuid": "31d1706ffc709d4c08d5c44becd79bb8abdd1f7a", 821 | "id": "_vrgtTttI3gl" 822 | }, 823 | "source": [ 824 | "## F-test (ANOVA)" 825 | ] 826 | }, 827 | { 828 | "cell_type": "markdown", 829 | "metadata": { 830 | "_uuid": "e4755b430cf6a781cf06af48eca1ec2bc34826e6", 831 | "id": "JFGjPpT5I3gl" 832 | }, 833 | "source": [ 834 | "The House Price Dataset has a MSZoning variable, which identifies the general zoning classification of the house.
\n", 835 | "For instance, it lets you know if the house is situated in a residential or a commerical zone.

\n", 836 | "We'll therefore try to know if there is a significant difference in SalePrice based on the zoning.
\n", 837 | "And then know where we will be more likely to live with our budget.
\n", 838 | "Based on the 100 observations samples of smaller houses, let's first have an overview of mean SalePrice by zone." 839 | ] 840 | }, 841 | { 842 | "cell_type": "code", 843 | "execution_count": null, 844 | "metadata": { 845 | "_uuid": "5ad3bbf7148ee3ab852e1e86fe31a722a097fcf7", 846 | "colab": { 847 | "base_uri": "https://localhost:8080/", 848 | "height": 235 849 | }, 850 | "id": "Tw-dZRxXI3gm", 851 | "outputId": "50c47d05-1ec7-4a14-9888-210d9b99a413" 852 | }, 853 | "outputs": [], 854 | "source": [ 855 | "replacement = {'FV': \"Floating Village Residential\", 'C (all)': \"Commercial\", 'RH': \"Residential High Density\",\n", 856 | " 'RL': \"Residential Low Density\", 'RM': \"Residential Medium Density\"}\n", 857 | "smaller_houses['MSZoning_FullName'] = smaller_houses['MSZoning'].replace(replacement)\n", 858 | "mean_price_by_zone = smaller_houses.groupby('MSZoning_FullName')['SalePrice'].mean().to_frame()\n", 859 | "mean_price_by_zone" 860 | ] 861 | }, 862 | { 863 | "cell_type": "markdown", 864 | "metadata": { 865 | "_uuid": "9718fdf6ca87e8eb195849a32370fd1e2fdc50ca", 866 | "id": "EI0TRdT5I3gn" 867 | }, 868 | "source": [ 869 | "To know if there is a significant difference between these values, we run an ANOVA test. (because there a more than 2 values to compare)
\n", 870 | "The test won't not able to tell us what attributes are different from the others, but at least we'll know if there is a difference or not." 871 | ] 872 | }, 873 | { 874 | "cell_type": "markdown", 875 | "metadata": { 876 | "_uuid": "de6cfc37bd2dd0834227c069156c45c904a464ab", 877 | "id": "xGCd7qABI3gn" 878 | }, 879 | "source": [ 880 | "Null Hypothesis : No difference between SalePrice means
\n", 881 | "Alternative Hypothesis : Difference between SalePrice means
" 882 | ] 883 | }, 884 | { 885 | "cell_type": "code", 886 | "execution_count": null, 887 | "metadata": { 888 | "_uuid": "336c3d9d49d2b2ceaf514e2fbd3a4356e1787ba8", 889 | "colab": { 890 | "base_uri": "https://localhost:8080/", 891 | "height": 80 892 | }, 893 | "id": "DZiD5a4xI3go", 894 | "outputId": "a4a4807c-b568-483b-873b-899e62f8436a" 895 | }, 896 | "outputs": [], 897 | "source": [ 898 | "sh = smaller_houses.copy()\n", 899 | "p['score'], p['p_value'] = stats.f_oneway(sh.loc[sh.MSZoning=='FV', 'SalePrice'], \n", 900 | " sh.loc[sh.MSZoning=='C (all)', 'SalePrice'],\n", 901 | " sh.loc[sh.MSZoning=='RH', 'SalePrice'],\n", 902 | " sh.loc[sh.MSZoning=='RL', 'SalePrice'],\n", 903 | " sh.loc[sh.MSZoning=='RM', 'SalePrice'],)\n", 904 | "results(p)[['score', 'p_value', 'hypothesis_accepted']]" 905 | ] 906 | }, 907 | { 908 | "cell_type": "markdown", 909 | "metadata": { 910 | "_uuid": "a95d783ef0a7c8c2561d1e7e35fe076504bdaf75", 911 | "id": "JhKowYxmI3gp" 912 | }, 913 | "source": [ 914 | "There is a difference between SalePrices based on where the house is located.
\n", 915 | "Looking at the Average SalePrice by zone, Commerical Zones and Residential High Density zones seem to be the most affordable for our budget." 916 | ] 917 | }, 918 | { 919 | "cell_type": "markdown", 920 | "metadata": { 921 | "_uuid": "b1764c17eba1681a1b3e11e1409595539036b087", 922 | "id": "0eOVOjqfI3gp" 923 | }, 924 | "source": [ 925 | "## Chi-square test" 926 | ] 927 | }, 928 | { 929 | "cell_type": "markdown", 930 | "metadata": { 931 | "_uuid": "18bbaeabf46aba49d823ebe5689d5b73f78a857d", 932 | "id": "0Wt1bb7kI3gq" 933 | }, 934 | "source": [ 935 | "One last question we'll address : can we get a garage? If yes, what type of garage?
\n", 936 | "If not, then we won't bother saving up for a car, and we'll try to get a house next to Public Transportion.
\n", 937 | "The dataset contains a categorical variable, GarageType, that will help us answer the question.
\n", 938 | "
" 939 | ] 940 | }, 941 | { 942 | "cell_type": "code", 943 | "execution_count": null, 944 | "metadata": { 945 | "_uuid": "c143db864b4a24d0587b00099194f21457e22c29", 946 | "colab": { 947 | "base_uri": "https://localhost:8080/", 948 | "height": 204 949 | }, 950 | "id": "i2PuMmXTI3gq", 951 | "outputId": "2811f9f3-d88a-41ec-d59b-b2538a4c40ba" 952 | }, 953 | "outputs": [], 954 | "source": [ 955 | "smaller_houses['GarageType'].fillna('No Garage', inplace=True)\n", 956 | "smaller_houses['GarageType'].value_counts().to_frame()" 957 | ] 958 | }, 959 | { 960 | "cell_type": "markdown", 961 | "metadata": { 962 | "_uuid": "48d6143222896b8c463ab8a88465a647796f188c", 963 | "id": "0zv1CZjgI3gr" 964 | }, 965 | "source": [ 966 | "We know we can get a house in at least the bottom 25% of smaller houses.
\n", 967 | "We would ideally like to know if distribution of Garage Types among these 25% is different than in the three other quarters
\n", 968 | "We are now friends with the City Hall, so we can ask them one last favor :
\n", 969 | "Split the smaller houses population in 4 based on surface, and give us a sample of each quarter.
\n", 970 | "Because we working here with categorical data, we'll run a Chi-Square test." 971 | ] 972 | }, 973 | { 974 | "cell_type": "code", 975 | "execution_count": null, 976 | "metadata": { 977 | "_uuid": "a0462a0bd18442f27a88c5b0e267122e76ab34b7", 978 | "colab": { 979 | "base_uri": "https://localhost:8080/", 980 | "height": 142 981 | }, 982 | "id": "vBmXrldbI3gr", 983 | "outputId": "56ad970c-fe14-42da-bd74-c4ca45d8a955" 984 | }, 985 | "outputs": [], 986 | "source": [ 987 | "city_hall_dataset['GarageType'].fillna('No Garage', inplace=True)\n", 988 | "sample1 = city_hall_dataset.sort_values('GrLivArea')[:183].sample(n=100)\n", 989 | "sample2 = city_hall_dataset.sort_values('GrLivArea')[183:366].sample(n=100)\n", 990 | "sample3 = city_hall_dataset.sort_values('GrLivArea')[366:549].sample(n=100)\n", 991 | "sample4 = city_hall_dataset.sort_values('GrLivArea')[549:730].sample(n=100)\n", 992 | "dff = pd.concat([\n", 993 | " sample1['GarageType'].value_counts().to_frame(),\n", 994 | " sample2['GarageType'].value_counts().to_frame(), \n", 995 | " sample3['GarageType'].value_counts().to_frame(), \n", 996 | " sample4['GarageType'].value_counts().to_frame()], \n", 997 | " axis=1, sort=False)\n", 998 | "dff.columns = ['Sample1 (smallest houses)', 'Sample2', 'Sample3', 'Sample4 (largest houses)']\n", 999 | "dff = dff[:3] #chi-square tests do not work when table contains some 0, we take only the most frequent attributes\n", 1000 | "dff " 1001 | ] 1002 | }, 1003 | { 1004 | "cell_type": "markdown", 1005 | "metadata": { 1006 | "_uuid": "6858dda979f309254b4b04ec93d3d2a4fc4a22a9", 1007 | "id": "C1nqnJXUI3gs" 1008 | }, 1009 | "source": [ 1010 | "Null Hypothesis : No difference between GarageType distribution
\n", 1011 | "Alternative Hypothesis : Difference between GarageType distribution
" 1012 | ] 1013 | }, 1014 | { 1015 | "cell_type": "code", 1016 | "execution_count": null, 1017 | "metadata": { 1018 | "_uuid": "59e533a57eebfb818a880a00c5eb7b668ddfedf7", 1019 | "colab": { 1020 | "base_uri": "https://localhost:8080/", 1021 | "height": 80 1022 | }, 1023 | "id": "aRLUI6B6I3gt", 1024 | "outputId": "188aca50-8077-4402-c7ef-629e6864b641" 1025 | }, 1026 | "outputs": [], 1027 | "source": [ 1028 | "p['score'], p['p_value'], p['ddf'], p['contigency'] = stats.chi2_contingency(dff)\n", 1029 | "p.pop('contigency')\n", 1030 | "results(p)[['score', 'p_value', 'hypothesis_accepted']]" 1031 | ] 1032 | }, 1033 | { 1034 | "cell_type": "markdown", 1035 | "metadata": { 1036 | "_uuid": "d8acefcc9a561092b50a4dc9e056ced0dd9706a2", 1037 | "id": "F3TAg9RsI3gt" 1038 | }, 1039 | "source": [ 1040 | "Clearly there's a difference in GarageType distribution according to size of houses.
\n", 1041 | "The sample that concerns us, Sample1, has the highest proportion of \"No Garage\" and \"Detached Garage\".
\n", 1042 | "We'll probably have to stick with Public Transportation." 1043 | ] 1044 | }, 1045 | { 1046 | "cell_type": "markdown", 1047 | "metadata": { 1048 | "_uuid": "d9aa71cb9e37ec58ac0ec437edd8b054eaafbca3", 1049 | "id": "F7OWGJmxI3gu" 1050 | }, 1051 | "source": [ 1052 | "# Conclusion" 1053 | ] 1054 | }, 1055 | { 1056 | "cell_type": "markdown", 1057 | "metadata": { 1058 | "_uuid": "4932d573a2a4d8fdd4d7f842c2d4218c60e4afcf", 1059 | "id": "nvxdHBqTI3gu" 1060 | }, 1061 | "source": [ 1062 | "We probably won't have a great house, but at least, we learned about statistical tests." 1063 | ] 1064 | } 1065 | ], 1066 | "metadata": { 1067 | "colab": { 1068 | "collapsed_sections": [ 1069 | "5EdR4B5fI3fs", 1070 | "xUCHHT1PI3f1", 1071 | "R9S7Wyn3I3f5", 1072 | "2NeUq916I3gD", 1073 | "gDGMr5yJI3gM", 1074 | "563ptKoxI3gR", 1075 | "-0BEcuhAI3gZ", 1076 | "7erhQfijI3gd", 1077 | "48EoE0JTI3gh", 1078 | "_vrgtTttI3gl", 1079 | "0eOVOjqfI3gp" 1080 | ], 1081 | "name": "Statistical-Tests-Explained.ipynb", 1082 | "provenance": [] 1083 | }, 1084 | "kernelspec": { 1085 | "display_name": "Python 3 (ipykernel)", 1086 | "language": "python", 1087 | "name": "python3" 1088 | }, 1089 | "language_info": { 1090 | "codemirror_mode": { 1091 | "name": "ipython", 1092 | "version": 3 1093 | }, 1094 | "file_extension": ".py", 1095 | "mimetype": "text/x-python", 1096 | "name": "python", 1097 | "nbconvert_exporter": "python", 1098 | "pygments_lexer": "ipython3", 1099 | "version": "3.11.7" 1100 | }, 1101 | "toc": { 1102 | "base_numbering": 1, 1103 | "nav_menu": {}, 1104 | "number_sections": true, 1105 | "sideBar": true, 1106 | "skip_h1_title": false, 1107 | "title_cell": "Table of Contents", 1108 | "title_sidebar": "Contents", 1109 | "toc_cell": true, 1110 | "toc_position": {}, 1111 | "toc_section_display": true, 1112 | "toc_window_display": false 1113 | } 1114 | }, 1115 | "nbformat": 4, 1116 | "nbformat_minor": 4 1117 | } 1118 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Aman Roland 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Statistics-and-Probability-For-Data-Science 2 | 3 | In this repository, I will delve into the fundamental concepts of statistics and probability through the use of Python programming language. The following topics will be covered: 4 | 5 | 1. Permutations and Combinations 6 | 2. The basics of probability, including conditional probability and the law of large numbers 7 | 3. Bayes' theorem and its applications 8 | 4. Probability distributions, including binomial, uniform, geometric, Poisson, and normal distributions 9 | 5. Measures of central tendency and variability, as well as skewness and kurtosis 10 | 6. The central limit theorem, estimation, and confidence intervals 11 | 7. Sampling methods and errors 12 | 8. Hypothesis testing, significance levels, P values, and confidence intervals 13 | 9. Parametric tests, including z-tests (one-tailed and two-tailed) for means and proportions and t-tests (paired t-test) 14 | 10. Analysis of variance (ANOVA) 15 | 11. Nonparametric tests, such as chi-square tests 16 | 12. Effect size, correlation, power, and power analysis. 17 | -------------------------------------------------------------------------------- /data/a1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/a1.png -------------------------------------------------------------------------------- /data/cdf_ppf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/cdf_ppf.png -------------------------------------------------------------------------------- /data/csd.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/csd.png -------------------------------------------------------------------------------- /data/dpd.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/dpd.png -------------------------------------------------------------------------------- /data/f_dist.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/f_dist.png -------------------------------------------------------------------------------- /data/gaussian-distribution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/gaussian-distribution.png -------------------------------------------------------------------------------- /data/house-prices-advanced-regression-techniques/data_description.txt: -------------------------------------------------------------------------------- 1 | MSSubClass: Identifies the type of dwelling involved in the sale. 2 | 3 | 20 1-STORY 1946 & NEWER ALL STYLES 4 | 30 1-STORY 1945 & OLDER 5 | 40 1-STORY W/FINISHED ATTIC ALL AGES 6 | 45 1-1/2 STORY - UNFINISHED ALL AGES 7 | 50 1-1/2 STORY FINISHED ALL AGES 8 | 60 2-STORY 1946 & NEWER 9 | 70 2-STORY 1945 & OLDER 10 | 75 2-1/2 STORY ALL AGES 11 | 80 SPLIT OR MULTI-LEVEL 12 | 85 SPLIT FOYER 13 | 90 DUPLEX - ALL STYLES AND AGES 14 | 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER 15 | 150 1-1/2 STORY PUD - ALL AGES 16 | 160 2-STORY PUD - 1946 & NEWER 17 | 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER 18 | 190 2 FAMILY CONVERSION - ALL STYLES AND AGES 19 | 20 | MSZoning: Identifies the general zoning classification of the sale. 21 | 22 | A Agriculture 23 | C Commercial 24 | FV Floating Village Residential 25 | I Industrial 26 | RH Residential High Density 27 | RL Residential Low Density 28 | RP Residential Low Density Park 29 | RM Residential Medium Density 30 | 31 | LotFrontage: Linear feet of street connected to property 32 | 33 | LotArea: Lot size in square feet 34 | 35 | Street: Type of road access to property 36 | 37 | Grvl Gravel 38 | Pave Paved 39 | 40 | Alley: Type of alley access to property 41 | 42 | Grvl Gravel 43 | Pave Paved 44 | NA No alley access 45 | 46 | LotShape: General shape of property 47 | 48 | Reg Regular 49 | IR1 Slightly irregular 50 | IR2 Moderately Irregular 51 | IR3 Irregular 52 | 53 | LandContour: Flatness of the property 54 | 55 | Lvl Near Flat/Level 56 | Bnk Banked - Quick and significant rise from street grade to building 57 | HLS Hillside - Significant slope from side to side 58 | Low Depression 59 | 60 | Utilities: Type of utilities available 61 | 62 | AllPub All public Utilities (E,G,W,& S) 63 | NoSewr Electricity, Gas, and Water (Septic Tank) 64 | NoSeWa Electricity and Gas Only 65 | ELO Electricity only 66 | 67 | LotConfig: Lot configuration 68 | 69 | Inside Inside lot 70 | Corner Corner lot 71 | CulDSac Cul-de-sac 72 | FR2 Frontage on 2 sides of property 73 | FR3 Frontage on 3 sides of property 74 | 75 | LandSlope: Slope of property 76 | 77 | Gtl Gentle slope 78 | Mod Moderate Slope 79 | Sev Severe Slope 80 | 81 | Neighborhood: Physical locations within Ames city limits 82 | 83 | Blmngtn Bloomington Heights 84 | Blueste Bluestem 85 | BrDale Briardale 86 | BrkSide Brookside 87 | ClearCr Clear Creek 88 | CollgCr College Creek 89 | Crawfor Crawford 90 | Edwards Edwards 91 | Gilbert Gilbert 92 | IDOTRR Iowa DOT and Rail Road 93 | MeadowV Meadow Village 94 | Mitchel Mitchell 95 | Names North Ames 96 | NoRidge Northridge 97 | NPkVill Northpark Villa 98 | NridgHt Northridge Heights 99 | NWAmes Northwest Ames 100 | OldTown Old Town 101 | SWISU South & West of Iowa State University 102 | Sawyer Sawyer 103 | SawyerW Sawyer West 104 | Somerst Somerset 105 | StoneBr Stone Brook 106 | Timber Timberland 107 | Veenker Veenker 108 | 109 | Condition1: Proximity to various conditions 110 | 111 | Artery Adjacent to arterial street 112 | Feedr Adjacent to feeder street 113 | Norm Normal 114 | RRNn Within 200' of North-South Railroad 115 | RRAn Adjacent to North-South Railroad 116 | PosN Near positive off-site feature--park, greenbelt, etc. 117 | PosA Adjacent to postive off-site feature 118 | RRNe Within 200' of East-West Railroad 119 | RRAe Adjacent to East-West Railroad 120 | 121 | Condition2: Proximity to various conditions (if more than one is present) 122 | 123 | Artery Adjacent to arterial street 124 | Feedr Adjacent to feeder street 125 | Norm Normal 126 | RRNn Within 200' of North-South Railroad 127 | RRAn Adjacent to North-South Railroad 128 | PosN Near positive off-site feature--park, greenbelt, etc. 129 | PosA Adjacent to postive off-site feature 130 | RRNe Within 200' of East-West Railroad 131 | RRAe Adjacent to East-West Railroad 132 | 133 | BldgType: Type of dwelling 134 | 135 | 1Fam Single-family Detached 136 | 2FmCon Two-family Conversion; originally built as one-family dwelling 137 | Duplx Duplex 138 | TwnhsE Townhouse End Unit 139 | TwnhsI Townhouse Inside Unit 140 | 141 | HouseStyle: Style of dwelling 142 | 143 | 1Story One story 144 | 1.5Fin One and one-half story: 2nd level finished 145 | 1.5Unf One and one-half story: 2nd level unfinished 146 | 2Story Two story 147 | 2.5Fin Two and one-half story: 2nd level finished 148 | 2.5Unf Two and one-half story: 2nd level unfinished 149 | SFoyer Split Foyer 150 | SLvl Split Level 151 | 152 | OverallQual: Rates the overall material and finish of the house 153 | 154 | 10 Very Excellent 155 | 9 Excellent 156 | 8 Very Good 157 | 7 Good 158 | 6 Above Average 159 | 5 Average 160 | 4 Below Average 161 | 3 Fair 162 | 2 Poor 163 | 1 Very Poor 164 | 165 | OverallCond: Rates the overall condition of the house 166 | 167 | 10 Very Excellent 168 | 9 Excellent 169 | 8 Very Good 170 | 7 Good 171 | 6 Above Average 172 | 5 Average 173 | 4 Below Average 174 | 3 Fair 175 | 2 Poor 176 | 1 Very Poor 177 | 178 | YearBuilt: Original construction date 179 | 180 | YearRemodAdd: Remodel date (same as construction date if no remodeling or additions) 181 | 182 | RoofStyle: Type of roof 183 | 184 | Flat Flat 185 | Gable Gable 186 | Gambrel Gabrel (Barn) 187 | Hip Hip 188 | Mansard Mansard 189 | Shed Shed 190 | 191 | RoofMatl: Roof material 192 | 193 | ClyTile Clay or Tile 194 | CompShg Standard (Composite) Shingle 195 | Membran Membrane 196 | Metal Metal 197 | Roll Roll 198 | Tar&Grv Gravel & Tar 199 | WdShake Wood Shakes 200 | WdShngl Wood Shingles 201 | 202 | Exterior1st: Exterior covering on house 203 | 204 | AsbShng Asbestos Shingles 205 | AsphShn Asphalt Shingles 206 | BrkComm Brick Common 207 | BrkFace Brick Face 208 | CBlock Cinder Block 209 | CemntBd Cement Board 210 | HdBoard Hard Board 211 | ImStucc Imitation Stucco 212 | MetalSd Metal Siding 213 | Other Other 214 | Plywood Plywood 215 | PreCast PreCast 216 | Stone Stone 217 | Stucco Stucco 218 | VinylSd Vinyl Siding 219 | Wd Sdng Wood Siding 220 | WdShing Wood Shingles 221 | 222 | Exterior2nd: Exterior covering on house (if more than one material) 223 | 224 | AsbShng Asbestos Shingles 225 | AsphShn Asphalt Shingles 226 | BrkComm Brick Common 227 | BrkFace Brick Face 228 | CBlock Cinder Block 229 | CemntBd Cement Board 230 | HdBoard Hard Board 231 | ImStucc Imitation Stucco 232 | MetalSd Metal Siding 233 | Other Other 234 | Plywood Plywood 235 | PreCast PreCast 236 | Stone Stone 237 | Stucco Stucco 238 | VinylSd Vinyl Siding 239 | Wd Sdng Wood Siding 240 | WdShing Wood Shingles 241 | 242 | MasVnrType: Masonry veneer type 243 | 244 | BrkCmn Brick Common 245 | BrkFace Brick Face 246 | CBlock Cinder Block 247 | None None 248 | Stone Stone 249 | 250 | MasVnrArea: Masonry veneer area in square feet 251 | 252 | ExterQual: Evaluates the quality of the material on the exterior 253 | 254 | Ex Excellent 255 | Gd Good 256 | TA Average/Typical 257 | Fa Fair 258 | Po Poor 259 | 260 | ExterCond: Evaluates the present condition of the material on the exterior 261 | 262 | Ex Excellent 263 | Gd Good 264 | TA Average/Typical 265 | Fa Fair 266 | Po Poor 267 | 268 | Foundation: Type of foundation 269 | 270 | BrkTil Brick & Tile 271 | CBlock Cinder Block 272 | PConc Poured Contrete 273 | Slab Slab 274 | Stone Stone 275 | Wood Wood 276 | 277 | BsmtQual: Evaluates the height of the basement 278 | 279 | Ex Excellent (100+ inches) 280 | Gd Good (90-99 inches) 281 | TA Typical (80-89 inches) 282 | Fa Fair (70-79 inches) 283 | Po Poor (<70 inches 284 | NA No Basement 285 | 286 | BsmtCond: Evaluates the general condition of the basement 287 | 288 | Ex Excellent 289 | Gd Good 290 | TA Typical - slight dampness allowed 291 | Fa Fair - dampness or some cracking or settling 292 | Po Poor - Severe cracking, settling, or wetness 293 | NA No Basement 294 | 295 | BsmtExposure: Refers to walkout or garden level walls 296 | 297 | Gd Good Exposure 298 | Av Average Exposure (split levels or foyers typically score average or above) 299 | Mn Mimimum Exposure 300 | No No Exposure 301 | NA No Basement 302 | 303 | BsmtFinType1: Rating of basement finished area 304 | 305 | GLQ Good Living Quarters 306 | ALQ Average Living Quarters 307 | BLQ Below Average Living Quarters 308 | Rec Average Rec Room 309 | LwQ Low Quality 310 | Unf Unfinshed 311 | NA No Basement 312 | 313 | BsmtFinSF1: Type 1 finished square feet 314 | 315 | BsmtFinType2: Rating of basement finished area (if multiple types) 316 | 317 | GLQ Good Living Quarters 318 | ALQ Average Living Quarters 319 | BLQ Below Average Living Quarters 320 | Rec Average Rec Room 321 | LwQ Low Quality 322 | Unf Unfinshed 323 | NA No Basement 324 | 325 | BsmtFinSF2: Type 2 finished square feet 326 | 327 | BsmtUnfSF: Unfinished square feet of basement area 328 | 329 | TotalBsmtSF: Total square feet of basement area 330 | 331 | Heating: Type of heating 332 | 333 | Floor Floor Furnace 334 | GasA Gas forced warm air furnace 335 | GasW Gas hot water or steam heat 336 | Grav Gravity furnace 337 | OthW Hot water or steam heat other than gas 338 | Wall Wall furnace 339 | 340 | HeatingQC: Heating quality and condition 341 | 342 | Ex Excellent 343 | Gd Good 344 | TA Average/Typical 345 | Fa Fair 346 | Po Poor 347 | 348 | CentralAir: Central air conditioning 349 | 350 | N No 351 | Y Yes 352 | 353 | Electrical: Electrical system 354 | 355 | SBrkr Standard Circuit Breakers & Romex 356 | FuseA Fuse Box over 60 AMP and all Romex wiring (Average) 357 | FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair) 358 | FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor) 359 | Mix Mixed 360 | 361 | 1stFlrSF: First Floor square feet 362 | 363 | 2ndFlrSF: Second floor square feet 364 | 365 | LowQualFinSF: Low quality finished square feet (all floors) 366 | 367 | GrLivArea: Above grade (ground) living area square feet 368 | 369 | BsmtFullBath: Basement full bathrooms 370 | 371 | BsmtHalfBath: Basement half bathrooms 372 | 373 | FullBath: Full bathrooms above grade 374 | 375 | HalfBath: Half baths above grade 376 | 377 | Bedroom: Bedrooms above grade (does NOT include basement bedrooms) 378 | 379 | Kitchen: Kitchens above grade 380 | 381 | KitchenQual: Kitchen quality 382 | 383 | Ex Excellent 384 | Gd Good 385 | TA Typical/Average 386 | Fa Fair 387 | Po Poor 388 | 389 | TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) 390 | 391 | Functional: Home functionality (Assume typical unless deductions are warranted) 392 | 393 | Typ Typical Functionality 394 | Min1 Minor Deductions 1 395 | Min2 Minor Deductions 2 396 | Mod Moderate Deductions 397 | Maj1 Major Deductions 1 398 | Maj2 Major Deductions 2 399 | Sev Severely Damaged 400 | Sal Salvage only 401 | 402 | Fireplaces: Number of fireplaces 403 | 404 | FireplaceQu: Fireplace quality 405 | 406 | Ex Excellent - Exceptional Masonry Fireplace 407 | Gd Good - Masonry Fireplace in main level 408 | TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement 409 | Fa Fair - Prefabricated Fireplace in basement 410 | Po Poor - Ben Franklin Stove 411 | NA No Fireplace 412 | 413 | GarageType: Garage location 414 | 415 | 2Types More than one type of garage 416 | Attchd Attached to home 417 | Basment Basement Garage 418 | BuiltIn Built-In (Garage part of house - typically has room above garage) 419 | CarPort Car Port 420 | Detchd Detached from home 421 | NA No Garage 422 | 423 | GarageYrBlt: Year garage was built 424 | 425 | GarageFinish: Interior finish of the garage 426 | 427 | Fin Finished 428 | RFn Rough Finished 429 | Unf Unfinished 430 | NA No Garage 431 | 432 | GarageCars: Size of garage in car capacity 433 | 434 | GarageArea: Size of garage in square feet 435 | 436 | GarageQual: Garage quality 437 | 438 | Ex Excellent 439 | Gd Good 440 | TA Typical/Average 441 | Fa Fair 442 | Po Poor 443 | NA No Garage 444 | 445 | GarageCond: Garage condition 446 | 447 | Ex Excellent 448 | Gd Good 449 | TA Typical/Average 450 | Fa Fair 451 | Po Poor 452 | NA No Garage 453 | 454 | PavedDrive: Paved driveway 455 | 456 | Y Paved 457 | P Partial Pavement 458 | N Dirt/Gravel 459 | 460 | WoodDeckSF: Wood deck area in square feet 461 | 462 | OpenPorchSF: Open porch area in square feet 463 | 464 | EnclosedPorch: Enclosed porch area in square feet 465 | 466 | 3SsnPorch: Three season porch area in square feet 467 | 468 | ScreenPorch: Screen porch area in square feet 469 | 470 | PoolArea: Pool area in square feet 471 | 472 | PoolQC: Pool quality 473 | 474 | Ex Excellent 475 | Gd Good 476 | TA Average/Typical 477 | Fa Fair 478 | NA No Pool 479 | 480 | Fence: Fence quality 481 | 482 | GdPrv Good Privacy 483 | MnPrv Minimum Privacy 484 | GdWo Good Wood 485 | MnWw Minimum Wood/Wire 486 | NA No Fence 487 | 488 | MiscFeature: Miscellaneous feature not covered in other categories 489 | 490 | Elev Elevator 491 | Gar2 2nd Garage (if not described in garage section) 492 | Othr Other 493 | Shed Shed (over 100 SF) 494 | TenC Tennis Court 495 | NA None 496 | 497 | MiscVal: $Value of miscellaneous feature 498 | 499 | MoSold: Month Sold (MM) 500 | 501 | YrSold: Year Sold (YYYY) 502 | 503 | SaleType: Type of sale 504 | 505 | WD Warranty Deed - Conventional 506 | CWD Warranty Deed - Cash 507 | VWD Warranty Deed - VA Loan 508 | New Home just constructed and sold 509 | COD Court Officer Deed/Estate 510 | Con Contract 15% Down payment regular terms 511 | ConLw Contract Low Down payment and low interest 512 | ConLI Contract Low Interest 513 | ConLD Contract Low Down 514 | Oth Other 515 | 516 | SaleCondition: Condition of sale 517 | 518 | Normal Normal Sale 519 | Abnorml Abnormal Sale - trade, foreclosure, short sale 520 | AdjLand Adjoining Land Purchase 521 | Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit 522 | Family Sale between family members 523 | Partial Home was not completed when last assessed (associated with New Homes) 524 | -------------------------------------------------------------------------------- /data/hypothesis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/hypothesis.png -------------------------------------------------------------------------------- /data/image-30.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/image-30.gif -------------------------------------------------------------------------------- /data/kurtosis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/kurtosis.png -------------------------------------------------------------------------------- /data/lln.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/lln.png -------------------------------------------------------------------------------- /data/mean-median-mode.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/mean-median-mode.png -------------------------------------------------------------------------------- /data/p1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/p1.png -------------------------------------------------------------------------------- /data/p2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/p2.png -------------------------------------------------------------------------------- /data/p3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/p3.png -------------------------------------------------------------------------------- /data/p4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/p4.png -------------------------------------------------------------------------------- /data/poi.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/poi.png -------------------------------------------------------------------------------- /data/snd.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/snd.png -------------------------------------------------------------------------------- /data/t_dist.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/t_dist.png -------------------------------------------------------------------------------- /data/ud.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/ud.png --------------------------------------------------------------------------------