├── .gitignore
├── 01. Permutation and Combinations.ipynb
├── 02. Probability and Rules of Probability.ipynb
├── 03. Bayes Theorem.ipynb
├── 04. Variables and Data types.ipynb
├── 05. Probability distributions.ipynb
├── 06. Measures of Central Tendency.ipynb
├── 07. Measures of Variability.ipynb
├── 08.  Central Limit Theorem.ipynb
├── 09. Sampling and Sampling errors.ipynb
├── 10. Hypothesis Testing.ipynb
├── 11. Parametric Tests.ipynb
├── 12. Z-test.ipynb
├── 13. t-test.ipynb
├── 14. ANOVA - Analysis of Variance.ipynb
├── 15. Chi-Square Test for Independence and Goodness of Fit.ipynb
├── 16. Effect Size and Statistical Power.ipynb
├── 17.Statistical tests (Summarized).ipynb
├── LICENSE
├── README.md
└── data
    ├── a1.png
    ├── cdf_ppf.png
    ├── csd.png
    ├── dpd.png
    ├── f_dist.png
    ├── gaussian-distribution.png
    ├── house-prices-advanced-regression-techniques
        └── data_description.txt
    ├── hypothesis.png
    ├── image-30.gif
    ├── kurtosis.png
    ├── lln.png
    ├── mean-median-mode.png
    ├── p1.png
    ├── p2.png
    ├── p3.png
    ├── p4.png
    ├── poi.png
    ├── snd.png
    ├── t_dist.png
    └── ud.png


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | pip-wheel-metadata/
 24 | share/python-wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | MANIFEST
 29 | 
 30 | # PyInstaller
 31 | #  Usually these files are written by a python script from a template
 32 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 33 | *.manifest
 34 | *.spec
 35 | 
 36 | # Installer logs
 37 | pip-log.txt
 38 | pip-delete-this-directory.txt
 39 | 
 40 | # Unit test / coverage reports
 41 | htmlcov/
 42 | .tox/
 43 | .nox/
 44 | .coverage
 45 | .coverage.*
 46 | .cache
 47 | nosetests.xml
 48 | coverage.xml
 49 | *.cover
 50 | *.py,cover
 51 | .hypothesis/
 52 | .pytest_cache/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | target/
 76 | 
 77 | # Jupyter Notebook
 78 | .ipynb_checkpoints
 79 | 
 80 | # IPython
 81 | profile_default/
 82 | ipython_config.py
 83 | 
 84 | # pyenv
 85 | .python-version
 86 | 
 87 | # pipenv
 88 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 89 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 90 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 91 | #   install all needed dependencies.
 92 | #Pipfile.lock
 93 | 
 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 95 | __pypackages__/
 96 | 
 97 | # Celery stuff
 98 | celerybeat-schedule
 99 | celerybeat.pid
100 | 
101 | # SageMath parsed files
102 | *.sage.py
103 | 
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 | 
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 | 
117 | # Rope project settings
118 | .ropeproject
119 | 
120 | # mkdocs documentation
121 | /site
122 | 
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 | 
128 | # Pyre type checker
129 | .pyre/
130 | 
131 | # checkpoints
132 | .ipynb_checkpoints/
133 | 
134 | # data files (csv, txt)
135 | *.csv
136 | *.txt
137 | 


--------------------------------------------------------------------------------
/01. Permutation and Combinations.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "id": "nWAP42l0C3en"
  7 |    },
  8 |    "source": [
  9 |     "# Permutation and Combination"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "markdown",
 14 |    "metadata": {
 15 |     "id": "wO-hsVKwC3er"
 16 |    },
 17 |    "source": [
 18 |     "The concepts of **Permutations** and **Combinations** pertain to different methods of arranging elements within a given set of objects. The primary distinction between them is the importance of order in the arrangement.\r\n",
 19 |     "\r\n",
 20 |     "A **permutation** considers the specific order in which elements are arranged. For example, the sequences \"AC,,\" \"BC,,\" and \"CAB\" are all distinct permutations of the same set of elements. \r\n",
 21 |     "\r\n",
 22 |     "In contrast, a **combination** treats the arrangement of elements as unordered. Therefore, combinations such as A,BC,\" B,AC,\" and \"CAB\" are all considered to be the same grouping of three characters.\r\n",
 23 |     "\r\n",
 24 |     "Let's explore some examples to better understand these conepts.\r\n"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "code",
 29 |    "execution_count": 1,
 30 |    "metadata": {
 31 |     "id": "VQcNkxNOC3es"
 32 |    },
 33 |    "outputs": [],
 34 |    "source": [
 35 |     "import itertools\n",
 36 |     "\n",
 37 |     "character_set = {'A', 'B', 'C'} \n",
 38 |     "permutations_taking_two_elements = itertools.permutations(character_set, 2)"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "code",
 43 |    "execution_count": 2,
 44 |    "metadata": {
 45 |     "colab": {
 46 |      "base_uri": "https://localhost:8080/"
 47 |     },
 48 |     "id": "fKTmATJwE4Rx",
 49 |     "outputId": "cccad988-a0d0-4900-8790-235d79bc48dd"
 50 |    },
 51 |    "outputs": [
 52 |     {
 53 |      "name": "stdout",
 54 |      "output_type": "stream",
 55 |      "text": [
 56 |       "('B', 'C')\n",
 57 |       "('B', 'A')\n",
 58 |       "('C', 'B')\n",
 59 |       "('C', 'A')\n",
 60 |       "('A', 'B')\n",
 61 |       "('A', 'C')\n"
 62 |      ]
 63 |     }
 64 |    ],
 65 |    "source": [
 66 |     "for i in permutations_taking_two_elements:\n",
 67 |     "    print(i)"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": 3,
 73 |    "metadata": {
 74 |     "colab": {
 75 |      "base_uri": "https://localhost:8080/"
 76 |     },
 77 |     "id": "TO6cDX02FQnJ",
 78 |     "outputId": "6cf57180-1338-460b-c044-554fc22e5e9f"
 79 |    },
 80 |    "outputs": [
 81 |     {
 82 |      "name": "stdout",
 83 |      "output_type": "stream",
 84 |      "text": [
 85 |       "('B', 'C', 'A')\n",
 86 |       "('B', 'A', 'C')\n",
 87 |       "('C', 'B', 'A')\n",
 88 |       "('C', 'A', 'B')\n",
 89 |       "('A', 'B', 'C')\n",
 90 |       "('A', 'C', 'B')\n"
 91 |      ]
 92 |     }
 93 |    ],
 94 |    "source": [
 95 |     "\n",
 96 |     "permutations_taking_three_elements = itertools.permutations(character_set, 3)\n",
 97 |     "for i in permutations_taking_three_elements:\n",
 98 |     "    print(i)"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "markdown",
103 |    "metadata": {},
104 |    "source": [
105 |     "From the above example, we observe that there are a total of 6 possible arrangements from a set containing 3 alphabets, taking 3 at a time. To calculate this value mathematically, we use the formula:\n",
106 |     "\n",
107 |     "$$ ^nP_r = \\frac{n!}{(n-r)!} $$\n",
108 |     "\n",
109 |     "where:  \n",
110 |     "- $ n $ is the number of elements in the set.  \n",
111 |     "- $ r $ is the number of elements taken together.\n",
112 |     "\n",
113 |     "**Example:**\n",
114 |     "\n",
115 |     "**Permutation taking 2 elements together from a set of 3 elements:**\n",
116 |     "\n",
117 |     "$$ ^3P_2 = \\frac{3!}{(3-2)!} = \\frac{3!}{1!} = \\frac{6}{1} = 6 $$\n",
118 |     "\n",
119 |     "**Permutation taking 3 elements together from a set of 3 elements:**\n",
120 |     "\n",
121 |     "$$ ^3P_3 = \\frac{3!}{(3-3)!} = \\frac{3!}{0!} = \\frac{6}{1} = 6 $$\n",
122 |     "\n",
123 |     "Now, let's look at combinations"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "markdown",
128 |    "metadata": {},
129 |    "source": [
130 |     "The formula for calculating the number of combinations from $ n $ elements taken $ r $ at a time is given by:\n",
131 |     "$$^n C_r = \\frac{n!}{(n-r)! r!}$$"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": 8,
137 |    "metadata": {
138 |     "colab": {
139 |      "base_uri": "https://localhost:8080/"
140 |     },
141 |     "id": "vjmrhsluC3ey",
142 |     "outputId": "1202466c-0d9d-4418-e6b8-45b90dbadba3"
143 |    },
144 |    "outputs": [
145 |     {
146 |      "name": "stdout",
147 |      "output_type": "stream",
148 |      "text": [
149 |       "('B', 'C')\n",
150 |       "('B', 'A')\n",
151 |       "('C', 'A')\n"
152 |      ]
153 |     }
154 |    ],
155 |    "source": [
156 |     "combination_taking_two_elements = itertools.combinations(character_set, 2)\n",
157 |     "for i in combination_taking_two_elements:\n",
158 |     "    print(i)"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "code",
163 |    "execution_count": 9,
164 |    "metadata": {
165 |     "colab": {
166 |      "base_uri": "https://localhost:8080/"
167 |     },
168 |     "id": "SYO9HIj8C3ew",
169 |     "outputId": "3d3bf569-3411-41d0-a9a9-55d503168d2a"
170 |    },
171 |    "outputs": [
172 |     {
173 |      "name": "stdout",
174 |      "output_type": "stream",
175 |      "text": [
176 |       "('B', 'C', 'A')\n"
177 |      ]
178 |     }
179 |    ],
180 |    "source": [
181 |     "combination_taking_three_elements = itertools.combinations(character_set, 3)\n",
182 |     "for i in combination_taking_three_elements: \n",
183 |     "    print(i)"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "markdown",
188 |    "metadata": {
189 |     "id": "eqjVrwFyC3ez"
190 |    },
191 |    "source": [
192 |     "In addition to these, itertools provides two more functions:\r\n",
193 |     "\r\n",
194 |     "- **Combinations with Replacement**: This function generates all possible combinations of $ r $ elements from a given iterable, allowing elements to be selected multiple times. It's useful when repetitions are allowed in the selection process.\r\n",
195 |     "\r\n",
196 |     "- **Product**: This function computes the Cartesian product of input iterables. It generates all possible combinations where each element from one iterable is combined with every element from other iterables. It's beneficial for creating all possible combinations of multiple sets of elements."
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "code",
201 |    "execution_count": 6,
202 |    "metadata": {
203 |     "colab": {
204 |      "base_uri": "https://localhost:8080/"
205 |     },
206 |     "id": "1czZRTC_C3e0",
207 |     "outputId": "83164b64-70d0-41c0-a244-cdefd295680f"
208 |    },
209 |    "outputs": [
210 |     {
211 |      "name": "stdout",
212 |      "output_type": "stream",
213 |      "text": [
214 |       "('B', 'B')\n",
215 |       "('B', 'C')\n",
216 |       "('B', 'A')\n",
217 |       "('C', 'C')\n",
218 |       "('C', 'A')\n",
219 |       "('A', 'A')\n"
220 |      ]
221 |     }
222 |    ],
223 |    "source": [
224 |     "combination_taking_two_elements_with_replacement = itertools.combinations_with_replacement(character_set, 2)\n",
225 |     "for i in combination_taking_two_elements_with_replacement:\n",
226 |     "    print(i)"
227 |    ]
228 |   },
229 |   {
230 |    "cell_type": "code",
231 |    "execution_count": 7,
232 |    "metadata": {
233 |     "colab": {
234 |      "base_uri": "https://localhost:8080/"
235 |     },
236 |     "id": "QKeNbHWjC3e0",
237 |     "outputId": "8363c139-daf4-4784-ff7f-2f72d3580287"
238 |    },
239 |    "outputs": [
240 |     {
241 |      "name": "stdout",
242 |      "output_type": "stream",
243 |      "text": [
244 |       "('B', 'B')\n",
245 |       "('B', 'C')\n",
246 |       "('B', 'A')\n",
247 |       "('C', 'B')\n",
248 |       "('C', 'C')\n",
249 |       "('C', 'A')\n",
250 |       "('A', 'B')\n",
251 |       "('A', 'C')\n",
252 |       "('A', 'A')\n"
253 |      ]
254 |     }
255 |    ],
256 |    "source": [
257 |     "product_taking_two_elements_with_replacement = itertools.product(character_set, repeat=2)\n",
258 |     "for i in product_taking_two_elements_with_replacement:\n",
259 |     "    print(i)"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": null,
265 |    "metadata": {
266 |     "id": "U5K1Bqt2C3e1"
267 |    },
268 |    "outputs": [],
269 |    "source": []
270 |   }
271 |  ],
272 |  "metadata": {
273 |   "colab": {
274 |    "collapsed_sections": [],
275 |    "name": "01. Permutation and Combinations.ipynb",
276 |    "provenance": []
277 |   },
278 |   "kernelspec": {
279 |    "display_name": "Python 3 (ipykernel)",
280 |    "language": "python",
281 |    "name": "python3"
282 |   },
283 |   "language_info": {
284 |    "codemirror_mode": {
285 |     "name": "ipython",
286 |     "version": 3
287 |    },
288 |    "file_extension": ".py",
289 |    "mimetype": "text/x-python",
290 |    "name": "python",
291 |    "nbconvert_exporter": "python",
292 |    "pygments_lexer": "ipython3",
293 |    "version": "3.11.7"
294 |   }
295 |  },
296 |  "nbformat": 4,
297 |  "nbformat_minor": 4
298 | }
299 | 


--------------------------------------------------------------------------------
/02. Probability and Rules of Probability.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "id": "75tuxa_BKLEX"
  7 |    },
  8 |    "source": [
  9 |     "# **Probability and Rules of Probability**\n",
 10 |     "___"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "markdown",
 15 |    "metadata": {
 16 |     "id": "8xI_XGaIKLEc"
 17 |    },
 18 |    "source": [
 19 |     "### Understanding Fundamental Concepts in Probability\r\n",
 20 |     "\r\n",
 21 |     "Before delving into the intricacies of probability, it's essential to grasp some fundamental terms and definitions associated with it.\r\n",
 22 |     "\r\n",
 23 |     "#### Random Experiment:\r\n",
 24 |     "A random experiment is characterized by its unpredictable outcomes when repeated under identical conditions. Examples include rolling a die or tossing an unbiased coin.\r\n",
 25 |     "\r\n",
 26 |     "#### Outcome:\r\n",
 27 |     "An outcome refers to the result obtained from a single trial of an experiment.\r\n",
 28 |     "\r\n",
 29 |     "#### Sample Space:\r\n",
 30 |     "The sample space represents a comprehensive list encompassing all potential outcomes of an experiment. For instance, in the case of tossing a coin, the sample space would be $\\{Heads, Tails\\}$, while for rolling a die, it would consist of $\\{1, 2, 3, 4, 5, 6\\}$.\r\n",
 31 |     "\r\n",
 32 |     "#### Event:\r\n",
 33 |     "An event denotes a subset of the sample space and can comprise either a single outcome or a combination of outcomes. For instance, obtaining at least two heads in a row when a coin is tossed four times constitutes an event. Another example could involve getting heads on a coin and rolling a six on a die simultaneously.\r\n",
 34 |     "\r\n",
 35 |     "### Probability:\r\n",
 36 |     "\r\n",
 37 |     "Probability serves as a quantifiable measure of the likelihood of an event occurring.\r\n",
 38 |     "\r\n",
 39 |     "**Note:** Events cannot be predicted with absolute certainty. Probability allows us to assess the likelihood of an event happening, ranging between 0 and 1. A probability of \"Zero\" signifies that the event is impossible, while a value of \"One\" indicates certainty.\r\n",
 40 |     "\r\n",
 41 |     "The probability of an event $A$, denoted as $P(A)$, is calculated using the formula:\r\n",
 42 |     "\r\n",
 43 |     "$$ P(A) = \\frac {n(A)}{n(S)} $$\r\n",
 44 |     "\r\n",
 45 |     "where:  \r\n",
 46 |     "- $P(A)$ represents the probability of event $A$ occurring.  \r\n",
 47 |     "- $n(A)$ denotes the number of favorable outcomes for event $A$.  \r\n",
 48 |     "- $n(S)$ signifies the total number of possible outcomes.\r\n",
 49 |     "\r\n",
 50 |     "**Example:**  \r\n",
 51 |     "The probability of rolling a number less than or equal to 2 when tossing a dieis $\\frac{2}{6} = \\frac{1}{3}$.\r\n"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "### **Rules of Probability**\n",
 59 |     "\n",
 60 |     "Understanding the rules governing probability is crucial for accurate analysis and interpretation.\n",
 61 |     "\n",
 62 |     "+ The probability of an event can range anywhere from 0 to 1:  \n",
 63 |     "  $0 \\leq P(A) \\leq 1.$  \n",
 64 |     "  This signifies that probabilities lie within the range of certainty from impossible (0) to certain (1).\n",
 65 |     "\n",
 66 |     "+ Sum of all probabilities should add up to 1  \n",
 67 |     "  $P(A) + P(\\overline{A}) = 1.$  \n",
 68 |     "  This rule highlights that the combined probability of an event occurring and not occurring is always equal to 1.\n",
 69 |     "\n",
 70 |     "+ Complementary Rule - Probability of event A not happening:  \n",
 71 |     "  $P(\\overline{A})=1-P(A).$  \n",
 72 |     "  It indicates that the probability of an event not occurring is equal to 1 minus the probability of the event occurring.  \n",
 73 |     "\n",
 74 |     "+ Addition Rule (A and B are not necessarily disjoint) - Probability of A happening or B happening:  \n",
 75 |     "  $P(A\\cup B)=P(A)+P(B)-P(A\\cap B).$  \n",
 76 |     "  This rule calculates the probability of either event A or event B happening, accounting for the overlap if they are not mutually exclusive  \n",
 77 |     "\n",
 78 |     "+ Addition Rule (A and B are disjoint) - Probability of A happening or B happening:  \n",
 79 |     "  $P(A\\cup B)=P(A)+P(B).$  \n",
 80 |     "  This rule simplifies the addition of probabilities when events A and B are mutually exclusive.  \n",
 81 |     " \n",
 82 |     "+ Multiplication Rule - Chain Rule:  \n",
 83 |     "  $P(A\\cap B)=P(A)*P(B|A)=P(B)*P(A|B).$  \n",
 84 |     "  This rule computes the joint probability of events A and B occurring, taking into account the conditional probabilities.  \n",
 85 |     "\n",
 86 |     "+ If A and B are independent events, then:  \n",
 87 |     "  $P(A\\cap B)=P(A)*P(B).$  \n",
 88 |     "  This implies that the occurrence of one event does not affect the probability of the other event.  \n",
 89 |     "\n",
 90 |     "+ $P(A\\setminus B)=P(A)-P(A\\cap B).$  \n",
 91 |     "  This rule calculates the probability of event A happening excluding the outcomes also included in event B.  \n",
 92 |     "\n",
 93 |     "+ $If A\\subset B, \\text{then}\\ P(A)\\leq P(B).$  \n",
 94 |     "  This indicates that the probability of a subset event A is always less than or equal to the probability of the superset event B.  \n",
 95 |     "\n",
 96 |     "+ $P(\\emptyset)=0. $  \n",
 97 |     "  The probability of the empty set is always zero.  "
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "markdown",
102 |    "metadata": {
103 |     "id": "3h0VgeSGKLEe"
104 |    },
105 |    "source": [
106 |     "## **Conditional Probability**\n",
107 |     "\n",
108 |     "Conditional probability of event **A** given event **B** is the probability that **A** occurs given that **B** has occurred.\n",
109 |     "\n",
110 |     "$$P(A|B)=\\frac{P(A\\cap B)}{P(B)}\\,.$$\n",
111 |     "\n",
112 |     "Let's illustrate this with an example:\n",
113 |     "\n",
114 |     "Suppose we roll a fair die, and let event A be the outcome being an odd number (i.e., A={1,3,5}), and event B be the outcome being less than or equal to 3 (i.e., B={1,2,3}). What is the probability of A given B, $P(A|B)$?\n",
115 |     "\n",
116 |     "$$P(B) = \\frac{3}{6} \\quad , \\quad P(A \\cap B) = \\frac{2}{6}$$ \n",
117 |     "\n",
118 |     "$$P(A|B) = \\frac{2}{3}$$\n",
119 |     "\n",
120 |     "\n",
121 |     "## **Law of Large Numbers**\n",
122 |     "\n",
123 |     "The law of large numbers asserts that as the sample size increases, the average or mean of the sample values will converge towards the expected value.\n",
124 |     "\n",
125 |     "This principle can be exemplified through a basic scenario of flipping a coin. With a coin having equal chances of landing heads or tails, the expected probability of it landing heads is 1/2 or 0.5 over an infinite number of flips.\n",
126 |     "\n",
127 |     "However, if we only flip the coin 10 times, we may observe a deviation from the expected value. For instance, the coin might land heads only 3 times out of the 10 flips, which doesn't align closely with the expected probability of 0.5. This discrepancy is due to the relatively small sample size.\n",
128 |     "\n",
129 |     "As the number of flips increases, say to 20 or 30 times, we would expect the proportion of heads to gradually approach 0.5. For instance, after 20 flips, we might see 9 heads, and after 30 flips, we might observe 22 heads. With a larger sample size, the observed proportion of heads tends to converge towards the expected value of 0.5.\n",
130 |     "\n",
131 |     "\n",
132 |     "<center><img src=\"./data/lln.png\"/></center>"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "code",
137 |    "execution_count": null,
138 |    "metadata": {},
139 |    "outputs": [],
140 |    "source": []
141 |   }
142 |  ],
143 |  "metadata": {
144 |   "colab": {
145 |    "name": "02. Probability and Rules of Probability.ipynb",
146 |    "provenance": []
147 |   },
148 |   "kernelspec": {
149 |    "display_name": "Python 3 (ipykernel)",
150 |    "language": "python",
151 |    "name": "python3"
152 |   },
153 |   "language_info": {
154 |    "codemirror_mode": {
155 |     "name": "ipython",
156 |     "version": 3
157 |    },
158 |    "file_extension": ".py",
159 |    "mimetype": "text/x-python",
160 |    "name": "python",
161 |    "nbconvert_exporter": "python",
162 |    "pygments_lexer": "ipython3",
163 |    "version": "3.11.7"
164 |   }
165 |  },
166 |  "nbformat": 4,
167 |  "nbformat_minor": 4
168 | }
169 | 


--------------------------------------------------------------------------------
/03. Bayes Theorem.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "id": "UuH9nzMnSFda"
  7 |    },
  8 |    "source": [
  9 |     "# Bayes theorem"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "markdown",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "## **Bayes Theorem**\n",
 17 |     "\n",
 18 |     "Bayes' theorem is a mathematical principle used to calculate the conditional probability of an event given some evidence related to that event. It establishes a relationship between the probability of an event and prior knowledge of conditions associated with it. As evidence accumulates, the probability of the event can be determined more accurately.\n",
 19 |     "\n",
 20 |     "$$ P(A|B) = \\frac{P(B|A)P(A)}{P(B)} $$\n",
 21 |     "\n",
 22 |     "- $P(A|B)$, also known as the posterior probability, represents the probability of the hypothesis being true given the available data.\n",
 23 |     "  \n",
 24 |     "- $P(B|A)$ is the probability of obtaining the evidence given the hypothesis.\n",
 25 |     "  \n",
 26 |     "- $P(A)$ is the prior probability, representing the probability of the hypothesis being true before any data is considered.\n",
 27 |     "  \n",
 28 |     "- $P(B)$ is the general probability of occurrence of the evidence, without any hypothesis, also known as the normalizing constant.\n",
 29 |     "\n",
 30 |     "**Example: Fire and Smoke**\n",
 31 |     "\n",
 32 |     "Suppose we want to find the probability of a fire given that there is smoke:\n",
 33 |     "\n",
 34 |     "$$P(Fire|Smoke) =\\frac {P(Smoke|Fire) * P(Fire)}{P(Smoke)}$$\n",
 35 |     "\n",
 36 |     "Here,\n",
 37 |     "- $P(Fire)$ is the Prior\n",
 38 |     "- $P(Smoke|Fire)$ is the Likelihood\n",
 39 |     "- $P(Smoke)$ is the evidence"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "markdown",
 44 |    "metadata": {},
 45 |    "source": [
 46 |     "**Example:**\n",
 47 |     "\n",
 48 |     "Let's consider a scenario where an individual tests positive for an illness. This particular illness affects approximately 1.2% of the population at any given time. The diagnostic test for this illness has an accuracy of 85% for individuals who actually have the illness and 97% for those who do not.\n",
 49 |     "\n",
 50 |     "Now, let's define the events involved:\n",
 51 |     "\n",
 52 |     "- $A$: The individual has the illness, also known as the hypothesis.\n",
 53 |     "- $\\overline{A}$: The individual does not have the illness.\n",
 54 |     "- $B$: The individual tests positive for the illness, also referred to as the evidence.\n",
 55 |     "- $P(A|B)$: The probability that the individual has the illness given a positive test result, known as the posterior probability, which is what we aim to calculate.\n",
 56 |     "- $P(B|A)$: The probability that the individual tests positive given that they have the illness, which is 0.85 according to the test's accuracy.\n",
 57 |     "- $P(A)$: The prior probability or the likelihood of the individual having the illness without any evidence, which is 0.012 based on the prevalence of the illness in the population.\n",
 58 |     "- $P(B)$: The probability that the individual tests positive for the illness. This can be computed in two ways:\n",
 59 |     "\n",
 60 |     "    - True Positive (individual has the illness and tests positive): $P(B|A)*P(A)=0.85*0.012=0.0102.$\n",
 61 |     "    - False Positive (individual does not have the illness but tests positive due to test inaccuracy): $P(B|\\overline{A})*P(\\overline{A})=(1-0.97)*(1-0.012)=0.02964.$\n",
 62 |     "    \n",
 63 |     "    Here, $P(B|\\overline{A})$ represents the probability of a positive test result for an individual who does not have the illness, indicating the test's inaccuracy for those without the illness.\n",
 64 |     "    \n",
 65 |     "    Additionally, $P(\\overline{A})$ denotes the probability that the individual does not have the illness, which is derived from the complement of the illness prevalence.\n",
 66 |     "    \n",
 67 |     "    Hence, $P(B)$, the denominator in Bayes' theorem, is the sum of these two probabilities:\n",
 68 |     "    \n",
 69 |     "    $P(B)= (P(B|A)*P(A)) + (P(B|\\overline{A})*P(\\overline{A}))=0.0102+0.2964=0.03984$.\n",
 70 |     "    \n",
 71 |     "    We can now compute the final answer using Bayes' theorem formula:\n",
 72 |     "    \n",
 73 |     "    $P(A|B)=P(B|A)*P(A)/P(B) =0.85*0.012 / 0.03984 = 0.256$.\n",
 74 |     "    \n",
 75 |     "    Thus, even with a positive medical test, the individual only has a 25.6% chance of actually suffering from the illness.\n"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": null,
 81 |    "metadata": {},
 82 |    "outputs": [],
 83 |    "source": []
 84 |   }
 85 |  ],
 86 |  "metadata": {
 87 |   "colab": {
 88 |    "collapsed_sections": [],
 89 |    "name": "03. Bayes Theorem.ipynb",
 90 |    "provenance": []
 91 |   },
 92 |   "kernelspec": {
 93 |    "display_name": "Python 3 (ipykernel)",
 94 |    "language": "python",
 95 |    "name": "python3"
 96 |   },
 97 |   "language_info": {
 98 |    "codemirror_mode": {
 99 |     "name": "ipython",
100 |     "version": 3
101 |    },
102 |    "file_extension": ".py",
103 |    "mimetype": "text/x-python",
104 |    "name": "python",
105 |    "nbconvert_exporter": "python",
106 |    "pygments_lexer": "ipython3",
107 |    "version": "3.11.7"
108 |   }
109 |  },
110 |  "nbformat": 4,
111 |  "nbformat_minor": 4
112 | }
113 | 


--------------------------------------------------------------------------------
/04. Variables and Data types.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "4affdc6c",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "# Variables and Data types\n",
 9 |     "___"
10 |    ]
11 |   },
12 |   {
13 |    "cell_type": "markdown",
14 |    "id": "65063c08-be77-4aab-8e03-dc04d613ad83",
15 |    "metadata": {},
16 |    "source": [
17 |     "## Understanding Variables\n",
18 |     "\n",
19 |     "In statistical studies, variables denote characteristics of the subjects under analysis. Selecting appropriate variables is pivotal in designing successful experiments, as they help anticipate outcomes.\n",
20 |     "\n",
21 |     "For instance, when predicting house prices, variables like the number of bedrooms, location, age, amenities nearby, and presence of a garage or pool are considered. These factors, aiding in price prediction, are termed variables.\n",
22 |     "\n",
23 |     "## Independent and Dependent Variables\n",
24 |     "\n",
25 |     "**Independent Variable**: Also known as explanatory or predictor variables, these are factors manipulated in an experiment to observe their impact on outcomes. They represent causes and are not influenced by other study variables.\n",
26 |     "\n",
27 |     "**Dependent Variable**: Referred to as response or outcome variables, these are observed results of an experiment. They represent effects and their values depend on changes made to the independent variable.\n",
28 |     "\n",
29 |     "## Types of Data\n",
30 |     "\n",
31 |     "Data, crucial for understanding relationships between variables, making predictions, and supporting decision-making, comes in various types. To accurately analyze and interpret data, it's essential to comprehend these types:\n",
32 |     "\n",
33 |     "**Quantitative Data**: Deals with quantities and measurements and can be either continuous or discrete. Continuous data refers to uninterrupted values along a scale, like distance and time. Discrete data, however, refers to specific values, like the number of students in a class or the outcome of rolling a die.\n",
34 |     "\n",
35 |     "**Categorical Data**: Represents groupings and is further categorized into nominal, ordinal, and binary types. Nominal data assigns values to categories without any inherent order, such as people's names and colors. Ordinal data, in contrast, assigns values with an order, like rating levels and grades. Binary data, the simplest form, has only two possible values, such as heads or tails in a coin flip, or yes or no.\n",
36 |     "\n",
37 |     "\n",
38 |     "## Measurement Scales\n",
39 |     "\n",
40 |     "Measurement scales, also referred to as levels of measurement, elucidate how precisely variables are recorded in scientific research. Here, a variable denotes any attribute capable of assuming different values in a dataset (e.g., height, test scores).\n",
41 |     "\n",
42 |     "There exist four measurement scales:\n",
43 |     "\n",
44 |     "- **Nominal**: Data can only be categorized.\n",
45 |     "- **Ordinal**: Data can be categorized and ordered.\n",
46 |     "- **Interval**: Data can be categorized, ordered, and equally spaced.\n",
47 |     "- **Ratio**: Data can be categorized, ordered, equally spaced, and possess a true zero point.\n",
48 |     "\n",
49 |     "The level of measurement for a variable profoundly influences the types of analyses feasible on it. Ranging from nominal (low) to ratio (high), measurement scales vary in complexity and precision.\n"
50 |    ]
51 |   },
52 |   {
53 |    "cell_type": "code",
54 |    "execution_count": null,
55 |    "id": "e2708acb-cebe-434b-bb9e-9a873250ab3e",
56 |    "metadata": {},
57 |    "outputs": [],
58 |    "source": []
59 |   }
60 |  ],
61 |  "metadata": {
62 |   "kernelspec": {
63 |    "display_name": "Python 3 (ipykernel)",
64 |    "language": "python",
65 |    "name": "python3"
66 |   },
67 |   "language_info": {
68 |    "codemirror_mode": {
69 |     "name": "ipython",
70 |     "version": 3
71 |    },
72 |    "file_extension": ".py",
73 |    "mimetype": "text/x-python",
74 |    "name": "python",
75 |    "nbconvert_exporter": "python",
76 |    "pygments_lexer": "ipython3",
77 |    "version": "3.11.7"
78 |   }
79 |  },
80 |  "nbformat": 4,
81 |  "nbformat_minor": 5
82 | }
83 | 


--------------------------------------------------------------------------------
/06. Measures of Central Tendency.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Introduction to Measures of Central Tendency\n",
  8 |     "\n",
  9 |     "## Overview\n",
 10 |     "\n",
 11 |     "Measures of central tendency are statistical metrics that describe the central point of a dataset. They provide a summary that represents a typical value within the dataset. Key measures of central tendency include the mean, median, mode, percentile, and quartile.\n",
 12 |     "\n",
 13 |     "### Mean\n",
 14 |     "\n",
 15 |     "The mean is the arithmetic average of a dataset, calculated by summing all values and dividing by the number of values.\n",
 16 |     "\n",
 17 |     "**Properties of the Mean:**\n",
 18 |     "\n",
 19 |     "- The sum of deviations of the items from their arithmetic mean is always zero, i.e., $\\sum (x - \\overline{x}) = 0$.\n",
 20 |     "- The sum of the squared deviations from the arithmetic mean (A.M.) is minimized compared to deviations from any other value.\n",
 21 |     "- Replacing each item in the series with the mean results in a sum equal to the sum of the original items.\n",
 22 |     "- The mean is affected by every value in the dataset.\n",
 23 |     "- It is a calculated value and not dependent on the position within the series.\n",
 24 |     "- It is sensitive to extreme values (outliers).\n",
 25 |     "- The mean cannot typically be identified by inspection.\n",
 26 |     "- In some cases, the mean may not represent an actual value within the dataset (e.g., an average of 10.7 patients admitted per day).\n",
 27 |     "- The arithmetic mean is not suitable for extremely asymmetrical distributions.\n",
 28 |     "\n",
 29 |     "### Median\n",
 30 |     "\n",
 31 |     "The median is the middle value in an ordered dataset, representing the 50th percentile.\n",
 32 |     "\n",
 33 |     "**Properties of the Median:**\n",
 34 |     "\n",
 35 |     "- The median is not influenced by all data values.\n",
 36 |     "- It is determined by its position in the dataset and not by individual values.\n",
 37 |     "- The distance from the median to all other values is minimized compared to any other point.\n",
 38 |     "- Every dataset has a single median.\n",
 39 |     "- The median cannot be algebraically manipulated or combined.\n",
 40 |     "- It remains stable in grouped data procedures.\n",
 41 |     "- It is not applicable to qualitative data.\n",
 42 |     "- The data must be ordered for median calculation.\n",
 43 |     "- The median is suitable for ratio, interval, and ordinal scales.\n",
 44 |     "- Outliers and skewed data have less impact on the median.\n",
 45 |     "- The median is a better measure than the mean in skewed distributions.\n",
 46 |     "\n",
 47 |     "### Mode\n",
 48 |     "\n",
 49 |     "The mode is the most frequently occurring value in a dataset with discrete values.\n",
 50 |     "\n",
 51 |     "**Properties of the Mode:**\n",
 52 |     "\n",
 53 |     "- The mode is useful when the most typical case is desired.\n",
 54 |     "- It can be used with nominal or categorical data, such as religious preference, gender, or political affiliation.\n",
 55 |     "- The mode may not be unique; a dataset can have more than one mode or none at all.\n",
 56 |     "\n",
 57 |     "### Percentile\n",
 58 |     "\n",
 59 |     "A percentile indicates the percentage of values in a dataset that fall below a particular value. The median is the 50th percentile.\n",
 60 |     "\n",
 61 |     "### Quartile\n",
 62 |     "\n",
 63 |     "A quartile divides an ordered dataset into four equal parts. \n",
 64 |     "\n",
 65 |     "- $Q_1$ (first quartile) corresponds to the 25th percentile.\n",
 66 |     "- $Q_2$ corresponds to the median.\n",
 67 |     "- $Q_3$ corresponds to the 75th percentile.\n"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": null,
 73 |    "metadata": {},
 74 |    "outputs": [],
 75 |    "source": []
 76 |   }
 77 |  ],
 78 |  "metadata": {
 79 |   "kernelspec": {
 80 |    "display_name": "Python 3 (ipykernel)",
 81 |    "language": "python",
 82 |    "name": "python3"
 83 |   },
 84 |   "language_info": {
 85 |    "codemirror_mode": {
 86 |     "name": "ipython",
 87 |     "version": 3
 88 |    },
 89 |    "file_extension": ".py",
 90 |    "mimetype": "text/x-python",
 91 |    "name": "python",
 92 |    "nbconvert_exporter": "python",
 93 |    "pygments_lexer": "ipython3",
 94 |    "version": "3.11.7"
 95 |   }
 96 |  },
 97 |  "nbformat": 4,
 98 |  "nbformat_minor": 4
 99 | }
100 | 


--------------------------------------------------------------------------------
/07. Measures of Variability.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Measures of Variability"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "Measures of dispersion provide a quantitative assessment of the spread within a distribution. They indicate whether the values are clustered around a central point or dispersed across a range. The following are the most commonly used measures of dispersion:\n",
 15 |     "\n",
 16 |     "**Range:** The range represents the difference between the highest and lowest values in a dataset.\n",
 17 |     "\n",
 18 |     "**Interquartile Range (IQR):** The IQR measures the difference between the third quartile ($Q_3$) and the first quartile ($Q_1$). It is less affected by extreme values, focusing on the middle portion of the dataset. This makes the IQR particularly useful for skewed distributions with outliers. The IQR is calculated as:  \n",
 19 |     "$$IQR = Q_3 - Q_1$$\n",
 20 |     "\n",
 21 |     "**Variance:** Variance quantifies the extent to which the values in a dataset deviate from the mean. It provides an indication of whether the mean is a representative measure of central tendency. A small variance suggests that the mean is a good representation of the dataset. The formula for variance is:\n",
 22 |     "\n",
 23 |     "$$\\sigma^2 = \\frac{\\sum (x-\\mu)^2}{N}$$\n",
 24 |     "    \n",
 25 |     "Where $\\mu$ is the mean, and $N$ is the number of values in the dataset.\n",
 26 |     "\n",
 27 |     "**Sample Variance** is given by:\n",
 28 |     "\n",
 29 |     "$$S^2 = \\frac{\\sum (x-\\overline x)^2}{n-1}$$\n",
 30 |     "\n",
 31 |     "Where $\\overline x$ is the sample mean, and $n$ is the number of values in the sample.\n",
 32 |     "\n",
 33 |     "**Standard deviation:** This measure is calculated by taking the square root of the variance. Since the variance is not in the same units as the original data (it involves squaring the differences), taking the square root brings the standard deviation back to the same units as the data. For example, in a dataset measuring average rainfall in centimeters, the variance would be in $cm^2$, which isn't interpretable. However, the standard deviation, expressed in $cm$, provides a meaningful indication of the average deviation of rainfall in centimeters.\n",
 34 |     "\n",
 35 |     "**Skewness:** This measures the degree of asymmetry of a distribution\n",
 36 |     "\n",
 37 |     "<center><img src=\"./data/mean-median-mode.png\"/></center>\n",
 38 |     "\n",
 39 |     "**Positive Skewness:** A positively skewed distribution is characterized by numerous outliers in the upper region, or right tail. It is termed \"skewed right\" due to its relatively elongated upper (right) tail.\n",
 40 |     "\n",
 41 |     "**Negative Skewness:** Conversely, a negatively skewed distribution exhibits a disproportionate number of outliers within its lower (left) tail. Such a distribution is referred to as \"skewed left\" owing to its extended lower tail.\n",
 42 |     "\n",
 43 |     "**Kurtosis:** Kurtosis serves as a measure indicating the curvature, peakiness, or flatness of a given distribution of data.\n",
 44 |     "\n",
 45 |     "<center><img src=\"./data/kurtosis.png\"/></center>"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": 1,
 51 |    "metadata": {},
 52 |    "outputs": [],
 53 |    "source": [
 54 |     "import pandas as pd\n",
 55 |     "data = pd.Series([19,23,19,18,25,16,17,19,15,23,21,23,21,11,6])"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "code",
 60 |    "execution_count": 2,
 61 |    "metadata": {},
 62 |    "outputs": [
 63 |     {
 64 |      "data": {
 65 |       "text/plain": [
 66 |        "count    15.000000\n",
 67 |        "mean     18.400000\n",
 68 |        "std       4.997142\n",
 69 |        "min       6.000000\n",
 70 |        "25%      16.500000\n",
 71 |        "50%      19.000000\n",
 72 |        "75%      22.000000\n",
 73 |        "max      25.000000\n",
 74 |        "dtype: float64"
 75 |       ]
 76 |      },
 77 |      "execution_count": 2,
 78 |      "metadata": {},
 79 |      "output_type": "execute_result"
 80 |     }
 81 |    ],
 82 |    "source": [
 83 |     "data.describe()"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "code",
 88 |    "execution_count": 3,
 89 |    "metadata": {},
 90 |    "outputs": [
 91 |     {
 92 |      "data": {
 93 |       "text/plain": [
 94 |        "0    19\n",
 95 |        "1    23\n",
 96 |        "dtype: int64"
 97 |       ]
 98 |      },
 99 |      "execution_count": 3,
100 |      "metadata": {},
101 |      "output_type": "execute_result"
102 |     }
103 |    ],
104 |    "source": [
105 |     "data.mode()"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "markdown",
110 |    "metadata": {},
111 |    "source": [
112 |     "The values 19 and 23 are the most frequently occurring values"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "code",
117 |    "execution_count": 4,
118 |    "metadata": {},
119 |    "outputs": [
120 |     {
121 |      "data": {
122 |       "text/plain": [
123 |        "19.0"
124 |       ]
125 |      },
126 |      "execution_count": 4,
127 |      "metadata": {},
128 |      "output_type": "execute_result"
129 |     }
130 |    ],
131 |    "source": [
132 |     "data.median()"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "code",
137 |    "execution_count": 5,
138 |    "metadata": {},
139 |    "outputs": [
140 |     {
141 |      "data": {
142 |       "text/plain": [
143 |        "19"
144 |       ]
145 |      },
146 |      "execution_count": 5,
147 |      "metadata": {},
148 |      "output_type": "execute_result"
149 |     }
150 |    ],
151 |    "source": [
152 |     "range_data = max(data)-min(data)\n",
153 |     "range_data"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "code",
158 |    "execution_count": 6,
159 |    "metadata": {},
160 |    "outputs": [
161 |     {
162 |      "data": {
163 |       "text/plain": [
164 |        "4.99714204034952"
165 |       ]
166 |      },
167 |      "execution_count": 6,
168 |      "metadata": {},
169 |      "output_type": "execute_result"
170 |     }
171 |    ],
172 |    "source": [
173 |     "data.std()"
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": 7,
179 |    "metadata": {},
180 |    "outputs": [
181 |     {
182 |      "data": {
183 |       "text/plain": [
184 |        "24.97142857142857"
185 |       ]
186 |      },
187 |      "execution_count": 7,
188 |      "metadata": {},
189 |      "output_type": "execute_result"
190 |     }
191 |    ],
192 |    "source": [
193 |     "data.var()"
194 |    ]
195 |   },
196 |   {
197 |    "cell_type": "code",
198 |    "execution_count": 8,
199 |    "metadata": {},
200 |    "outputs": [
201 |     {
202 |      "data": {
203 |       "text/plain": [
204 |        "(-1.038344732097918, 0.6995494033062934)"
205 |       ]
206 |      },
207 |      "execution_count": 8,
208 |      "metadata": {},
209 |      "output_type": "execute_result"
210 |     }
211 |    ],
212 |    "source": [
213 |     "from scipy.stats import skew, kurtosis\n",
214 |     "\n",
215 |     "skew(data), kurtosis(data)"
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "markdown",
220 |    "metadata": {},
221 |    "source": [
222 |     "**Points to note:**  \n",
223 |     "1. The mean value is affected by outliers (extreme values). Whenever there are outliers in a dataset, it is better to use the median.\n",
224 |     "2. The standard deviation and variance are closely tied to the mean. Thus, if there are outliers, standard deviation and variance may not be representative measures too.\n",
225 |     "3. The mode is generally used for discrete data since there can be more than one modal value for continuous data."
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "code",
230 |    "execution_count": null,
231 |    "metadata": {},
232 |    "outputs": [],
233 |    "source": []
234 |   }
235 |  ],
236 |  "metadata": {
237 |   "kernelspec": {
238 |    "display_name": "Python 3 (ipykernel)",
239 |    "language": "python",
240 |    "name": "python3"
241 |   },
242 |   "language_info": {
243 |    "codemirror_mode": {
244 |     "name": "ipython",
245 |     "version": 3
246 |    },
247 |    "file_extension": ".py",
248 |    "mimetype": "text/x-python",
249 |    "name": "python",
250 |    "nbconvert_exporter": "python",
251 |    "pygments_lexer": "ipython3",
252 |    "version": "3.11.7"
253 |   }
254 |  },
255 |  "nbformat": 4,
256 |  "nbformat_minor": 4
257 | }
258 | 


--------------------------------------------------------------------------------
/08.  Central Limit Theorem.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Central Limit Theorem"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "## The Central Limit Theorem\n",
 15 |     "\n",
 16 |     "The Central Limit Theorem (CLT) posits that the distribution of *sample means* drawn from a population will approximate a normal distribution, irrespective of the shape of the population distribution, provided the sample size is sufficiently large (typically n > 30). Even for populations that are already normally distributed, the theorem remains valid for smaller sample sizes.\n",
 17 |     "\n",
 18 |     "## Estimating the Population Mean\n",
 19 |     "\n",
 20 |     "While the sample mean serves as an estimate for the population mean, it's essential to recognize that the standard deviation of the sampling distribution ($\\sigma_{\\overline{x}}$) differs from the population standard deviation ($\\sigma$).\n",
 21 |     "\n",
 22 |     "## The Standard Error\n",
 23 |     "\n",
 24 |     "The standard deviation of the sampling distribution ($\\sigma_{\\overline{x}}$) is termed the **standard error** and is linked to the population standard deviation by the formula:\n",
 25 |     "\n",
 26 |     "$$\\sigma_{\\overline{x}} = \\frac{\\sigma}{\\sqrt{n}}$$\n",
 27 |     "\n",
 28 |     "Here, $\\sigma$ represents the population standard deviation, and $n$ denotes the sample size. \n",
 29 |     "\n",
 30 |     "As the sample size increases, the standard error diminishes toward 0, and the sample mean ($\\overline{x}$) converges towards the population mean ($\\mu$).\n"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "markdown",
 35 |    "metadata": {},
 36 |    "source": [
 37 |     "## Estimates and Confidence Intervals\n",
 38 |     "\n",
 39 |     "### Point Estimate\n",
 40 |     "\n",
 41 |     "A point estimate is a single statistic calculated from a sample used to estimate an unknown population parameter. For instance, the sample mean can serve as a point estimate for the population mean.\n",
 42 |     "\n",
 43 |     "### Interval Estimate\n",
 44 |     "\n",
 45 |     "An interval estimate is a range of values believed to encompass the true population parameter. It represents the margin of error in estimating the population parameter.\n",
 46 |     "\n",
 47 |     "### Confidence Interval\n",
 48 |     "\n",
 49 |     "A confidence interval is a range of values within which the population mean is presumed to lie. It can be calculated as follows:\n",
 50 |     "\n",
 51 |     "#### When Population Standard Deviation is Known\n",
 52 |     "\n",
 53 |     "For a random sample of size $n$ with mean $\\overline{x}$, taken from a population with standard deviation $\\sigma$ and mean $\\mu$, the confidence interval for the population mean is:\n",
 54 |     "\n",
 55 |     "$$\\overline{x} - \\frac{z\\sigma}{\\sqrt{n}} \\leq \\mu \\leq \\overline{x} + \\frac{z\\sigma}{\\sqrt{n}}$$\n",
 56 |     "\n",
 57 |     "#### When Population Standard Deviation is Unknown\n",
 58 |     "\n",
 59 |     "In cases where the population standard deviation is unknown, the sample standard deviation ($s$) substitutes $\\sigma$ in calculating the confidence interval:\n",
 60 |     "\n",
 61 |     "$$\\overline{x} - \\frac{zs}{\\sqrt{n}} \\leq \\mu \\leq \\overline{x} + \\frac{zs}{\\sqrt{n}}$$\n",
 62 |     "\n",
 63 |     "Here:  \n",
 64 |     "                                                                                                                                                 \n",
 65 |     "$\\overline{x}$: Sample mean.  \n",
 66 |     "$n$: Sample size.  \n",
 67 |     "$\\mu$: Population mean (the parameter we are estimating).  \n",
 68 |     "$\\sigma$: Population standard deviation (known when calculating the confidence interval with the formula that includes $\\sigma$).  \n",
 69 |     "$s$: Sample standard deviation (used as an estimate of the population standard deviation when it's unknown).  \n",
 70 |     "$z$: The critical value from the standard normal distribution corresponding to the desired confidence level. It is determined based on the chosen confidence level (e.g., 95% confidence level corresponds to a z-score of approximately 1.96). This value is used to calculate the margin of error.\n",
 71 |     "\n",
 72 |     "### Example\n",
 73 |     "\n",
 74 |     "Suppose we have grades of 10 students drawn from a population, and we aim to ascertain the 95% confidence interval for the population mean.\n"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": 1,
 80 |    "metadata": {},
 81 |    "outputs": [
 82 |     {
 83 |      "data": {
 84 |       "text/plain": [
 85 |        "(3.1110006165952773, 3.668999383404722)"
 86 |       ]
 87 |      },
 88 |      "execution_count": 1,
 89 |      "metadata": {},
 90 |      "output_type": "execute_result"
 91 |     }
 92 |    ],
 93 |    "source": [
 94 |     "import numpy as np\n",
 95 |     "import scipy.stats as stats\n",
 96 |     "from scipy.stats import t\n",
 97 |     "\n",
 98 |     "grades =  np.array([3.1,2.9,3.2,3.4,3.7,3.9,3.9,2.8,3.4,3.6])\n",
 99 |     "\n",
100 |     "stats.t.interval(0.95, len(grades)-1, loc=np.mean(grades), scale=stats.sem(grades))"
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "markdown",
105 |    "metadata": {},
106 |    "source": [
107 |     "The arguments inside t.interval function are 95% confidence interval, degrees of freedom (n-1), sample mean and the standard  error calculated by stats.sem function."
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": null,
113 |    "metadata": {},
114 |    "outputs": [],
115 |    "source": []
116 |   }
117 |  ],
118 |  "metadata": {
119 |   "kernelspec": {
120 |    "display_name": "Python 3 (ipykernel)",
121 |    "language": "python",
122 |    "name": "python3"
123 |   },
124 |   "language_info": {
125 |    "codemirror_mode": {
126 |     "name": "ipython",
127 |     "version": 3
128 |    },
129 |    "file_extension": ".py",
130 |    "mimetype": "text/x-python",
131 |    "name": "python",
132 |    "nbconvert_exporter": "python",
133 |    "pygments_lexer": "ipython3",
134 |    "version": "3.11.7"
135 |   }
136 |  },
137 |  "nbformat": 4,
138 |  "nbformat_minor": 4
139 | }
140 | 


--------------------------------------------------------------------------------
/09. Sampling and Sampling errors.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## Sampling\n",
 8 |     "\n",
 9 |     "Sampling serves as a fundamental technique for acquiring insights into a population by gathering data from a representative subset, rather than assessing every individual within the population. It presents a pragmatic approach when exhaustive data collection proves impractical. However, it is imperative that the sample mirrors the population's characteristics accurately.\n",
10 |     "\n",
11 |     "### Probability Sampling\n",
12 |     "\n",
13 |     "Probability sampling ensures that every member of the population possesses an equal chance of selection, thereby facilitating the creation of a sample that faithfully mirrors the population. Several commonly employed probability sampling techniques include:\n",
14 |     "\n",
15 |     "1. **Simple random sampling**: Subjects are selected entirely at random, without bias or preference, ensuring each has an equal probability of inclusion.\n",
16 |     "\n",
17 |     "2. **Stratified random sampling**: The population undergoes division into non-overlapping groups, from which subjects are randomly chosen. This method ensures representation across all relevant categories or strata.\n",
18 |     "\n",
19 |     "3. **Systematic random sampling**: Subjects are chosen at regular intervals, offering simplicity in execution but potentially risking representativeness if the interval choice is inappropriate.\n",
20 |     "\n",
21 |     "4. **Cluster sampling**: The population divides into non-overlapping clusters, from which a subset is randomly selected. This method offers convenience and cost-effectiveness.\n",
22 |     "\n",
23 |     "Advantages of Probability Sampling\n",
24 |     "- Mitigation of Sample Bias\n",
25 |     "- Representation of Diverse Population Characteristics\n",
26 |     "- Generation of Accurate Sample Representations\n",
27 |     "\n",
28 |     "### Non-Probability Sampling\n",
29 |     "\n",
30 |     "Non-probability sampling deviates from the principle of equal probability of selection, thus increasing the likelihood of acquiring a non-representative sample. Commonly utilized non-probability sampling techniques include:\n",
31 |     "\n",
32 |     "1. **Convenience sampling**: Conveniently accessible subjects form the sample, often leading to ease of implementation but potential representativeness issues.\n",
33 |     "\n",
34 |     "2. **Judgmental or purposive sampling**: Selection is based on predefined criteria, aligning with the study's objectives, albeit potentially biasing the sample.\n",
35 |     "\n",
36 |     "3. **Quota sampling**: Quotas ensure the sample reflects significant population characteristics, albeit without the assurance of equal probability of selection.\n",
37 |     "\n",
38 |     "4. **Snowball sampling**: Initial subjects refer additional participants, commonly employed in populations with low visibility.\n",
39 |     "\n",
40 |     "**Note**: Non-probability sampling typically yields less reliable and less generalizable results compared to probability sampling methodologies.\n"
41 |    ]
42 |   },
43 |   {
44 |    "cell_type": "markdown",
45 |    "metadata": {},
46 |    "source": [
47 |     "## Types of Errors in Sampling\n",
48 |     "\n",
49 |     "When making inferences about a population based on a sample, it's possible to encounter various types of errors. These errors can be grouped into the following categories:\n",
50 |     "\n",
51 |     "- **Sampling Error**: The difference between the sample estimate for the population and the actual population estimate\n",
52 |     "- **Coverage Error**: Occurs when the population is not adequately represented and some groups are excluded\n",
53 |     "- **Nonresponse Error**: Occurs when we fail to include nonresponsive subjects who meet the criteria of the study, but are excluded because they do not answer the survey questions.\n",
54 |     "- **Measurement Error**: Occurs when the correct parameters are not measured due to flaws in the measurement method or tool used.\n"
55 |    ]
56 |   },
57 |   {
58 |    "cell_type": "code",
59 |    "execution_count": null,
60 |    "metadata": {},
61 |    "outputs": [],
62 |    "source": []
63 |   }
64 |  ],
65 |  "metadata": {
66 |   "kernelspec": {
67 |    "display_name": "Python 3 (ipykernel)",
68 |    "language": "python",
69 |    "name": "python3"
70 |   },
71 |   "language_info": {
72 |    "codemirror_mode": {
73 |     "name": "ipython",
74 |     "version": 3
75 |    },
76 |    "file_extension": ".py",
77 |    "mimetype": "text/x-python",
78 |    "name": "python",
79 |    "nbconvert_exporter": "python",
80 |    "pygments_lexer": "ipython3",
81 |    "version": "3.11.7"
82 |   }
83 |  },
84 |  "nbformat": 4,
85 |  "nbformat_minor": 4
86 | }
87 | 


--------------------------------------------------------------------------------
/10. Hypothesis Testing.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Hypothesis Testing\n",
  8 |     "\n",
  9 |     "Before delving into the intricacies of hypothesis testing, it's imperative to grasp some foundational concepts.\n",
 10 |     "\n",
 11 |     "## Population Parameters vs. Sample Statistics\n",
 12 |     "\n",
 13 |     "A **Parameter** denotes a characteristic of an entire population, such as the population mean. As it's typically impractical to measure the entire population, the true value of the parameter often eludes us. Commonly used parameters in statistics, like the population mean and standard deviation, are symbolized by Greek letters such as $\\mu$ (mu) and $\\sigma$ (sigma).\n",
 14 |     "\n",
 15 |     "Conversely, a **Statistic** represents a characteristic calculated from a sample. For instance, computing the mean and standard deviation of a sample yields sample statistics. In statistical parlance, sample parameters are denoted by Latin letters.\n",
 16 |     "\n",
 17 |     "**Inferential statistics** entails leveraging sample statistics to draw inferences about a population. This involves using sample statistics to estimate population parameters. To ensure validity, representative sampling techniques like random sampling are pivotal for obtaining unbiased estimates. Unbiased estimates are considered accurate on average, whereas biased estimates systematically deviate from the truth.\n",
 18 |     "\n",
 19 |     "## Parametric vs Nonparametric Analysis\n",
 20 |     "\n",
 21 |     "**Parametric statistics** assumes that the sample data stems from populations describable by probability distributions with fixed parameters. Consequently, parametric analysis reigns as the predominant statistical method.\n",
 22 |     "\n",
 23 |     "In contrast, **nonparametric tests** refrain from assuming any specific probability distribution for the underlying data.\n",
 24 |     "\n",
 25 |     "## Significance Level (Alpha)\n",
 26 |     "\n",
 27 |     "The **significance level** serves as a yardstick dictating the requisite strength of evidence from sample data to infer the presence of an effect in the population. Also known as alpha ($\\alpha$), it's a predetermined threshold established prior to the study. The significance level delineates the evidence threshold needed to reject the null hypothesis in favor of the alternative hypothesis.\n",
 28 |     "\n",
 29 |     "Reflecting the probability of erroneously rejecting the null hypothesis when true, the significance level quantifies the risk of asserting an effect's existence when none prevails. Lower significance levels signify heightened evidentiary thresholds, demanding more robust evidence before null hypothesis rejection. For instance, a significance level of 0.05 implies a 5% chance of committing a false positive error—declaring an effect's existence in its absence.\n",
 30 |     "\n",
 31 |     "## P-Values\n",
 32 |     "\n",
 33 |     "**P-values** gauge the strength of evidence against the null hypothesis furnished by sample data. A P-value below the established significance level denotes statistical significance.\n",
 34 |     "\n",
 35 |     "The P-value represents the probability of observing an effect in the sample data as extreme, or even more so, than the one observed if the null hypothesis held true. Essentially, it quantifies the extent to which the sample data contravenes the null hypothesis. Lower P-values denote more compelling evidence against the null hypothesis.\n",
 36 |     "\n",
 37 |     "When a P-value falls at or below the significance level, the null hypothesis is discarded, and the results are deemed statistically significant. This implies that the sample data furnishes adequate evidence to endorse the alternative hypothesis positing the effect's presence in the population.\n",
 38 |     "\n",
 39 |     "Conversely, when a P-value exceeds the significance level, the sample data fails to supply sufficient evidence for effect existence, prompting null hypothesis retention.\n",
 40 |     "\n",
 41 |     "Statistically, these verdicts translate as follows:\n",
 42 |     "\n",
 43 |     "+ Reject the null hypothesis when the P-value equals or falls below the significance level.\n",
 44 |     "+ Retain the null hypothesis when the P-value exceeds the significance level.\n",
 45 |     "\n",
 46 |     "\n",
 47 |     "## Hypothesis Testing\n",
 48 |     "\n",
 49 |     "**Hypothesis testing** is a statistical technique that evaluates the evidence of two opposing statements (hypotheses) about a population based on sample data. These hypotheses are known as the null hypothesis and the alternative hypothesis.\n",
 50 |     "\n",
 51 |     "The objective of hypothesis testing is to assess the sample statistic and its corresponding sampling error to determine which of the two hypotheses is more strongly supported by the data. If the null hypothesis can be rejected, it means that the results are statistically significant and the alternative hypothesis is favored, suggesting that an effect exists in the population.\n",
 52 |     "\n",
 53 |     "It is important to note that failing to reject the null hypothesis does not necessarily mean that the null hypothesis is true, nor does rejecting the null hypothesis necessarily imply that the alternative hypothesis is true. The results of a hypothesis test are only a suggestion or indication about the population, not a conclusive proof of either hypothesis.\n",
 54 |     "\n",
 55 |     "The null hypothesis is the theory that there is no effect (i.e., the effect size is equal to zero). It is commonly represented by $H_0$.\n",
 56 |     "\n",
 57 |     "The alternative hypothesis is the opposite theory, stating that the population parameter does not equal the value specified in the null hypothesis (i.e., there is a non-zero effect). It is usually represented by $H_1$ or $H_A$.\n",
 58 |     "\n",
 59 |     "The steps involved in hypothesis testing are as follows:\n",
 60 |     "\n",
 61 |     "1. State the null and alternative hypothesis.\n",
 62 |     "2. Specify the significance level and calculate the critical value of the test statistic.\n",
 63 |     "3. Choose the appropriate test based on factors such as the number of samples, population distribution, statistic being tested, sample size, and knowledge of the population standard deviation.\n",
 64 |     "4. Calculate the relevant test statistic (z-statistic, t-statistic, chi-square statistic, or f-statistic) or p-value.\n",
 65 |     "5. Compare the calculated test statistic with the critical test statistic or the p-value with the significance level.\n",
 66 |     "    - If using the test statistic:\n",
 67 |     "        - Reject the null hypothesis if the calculated test statistic is greater than the critical test statistic (upper-tail test)\n",
 68 |     "        - Reject the null hypothesis if the calculated test statistic is less than the critical test statistic (lower-tail test)\n",
 69 |     "    - If using the p-value:\n",
 70 |     "        - Reject the null hypothesis if the p-value is less than the significance level.\n",
 71 |     "6. Draw a conclusion based on the comparison made in step 5.\n",
 72 |     "\n",
 73 |     "## Confidence Interval\n",
 74 |     "\n",
 75 |     "A **confidence interval** can be calculated for various parameters such as population mean, population proportion, difference of population means, and difference of population proportions, among others. To construct a confidence interval, one needs to have a sample statistic that estimates the population parameter of interest and a measure of variability or standard error for that statistic. The confidence interval is calculated by adding and subtracting the standard error to the sample statistic. The result is a range of values that contains the true population parameter with a specified level of confidence.\n",
 76 |     "\n",
 77 |     "The key concept behind a confidence interval is that if we repeated our sampling process many times, the true population parameter would fall within the confidence interval for the specified percentage of these samples. For example, if we have a 95% confidence interval for a population mean, we can say that if we repeated our sampling process 100 times, the true population mean would fall within the calculated confidence interval 95 times out of 100.\n",
 78 |     "\n",
 79 |     "In conclusion, confidence intervals provide a range of plausible values for a population parameter based on the observed sample data and the level of confidence specified by the researcher. The confidence interval reflects the precision of the estimate, with wider intervals indicating less precision and narrower intervals indicating more precision.\n",
 80 |     "\n",
 81 |     "## Sampling Error\n",
 82 |     "\n",
 83 |     "A **sampling error** refers to the discrepancy between a population parameter and a sample statistic. In a study, the sampling error represents the difference between the mean calculated from a sample and the actual mean of the population. Despite using a random sample selection process, sampling errors are still a possibility as the sample may not perfectly reflect the population with regards to numerical values such as means and standard deviations. To improve the accuracy of generalizing findings from a sample to a population, it's essential to minimize the sampling error. One way to do this is to increase the sample size.\n"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "code",
 88 |    "execution_count": null,
 89 |    "metadata": {},
 90 |    "outputs": [],
 91 |    "source": []
 92 |   }
 93 |  ],
 94 |  "metadata": {
 95 |   "kernelspec": {
 96 |    "display_name": "Python 3 (ipykernel)",
 97 |    "language": "python",
 98 |    "name": "python3"
 99 |   },
100 |   "language_info": {
101 |    "codemirror_mode": {
102 |     "name": "ipython",
103 |     "version": 3
104 |    },
105 |    "file_extension": ".py",
106 |    "mimetype": "text/x-python",
107 |    "name": "python",
108 |    "nbconvert_exporter": "python",
109 |    "pygments_lexer": "ipython3",
110 |    "version": "3.11.7"
111 |   }
112 |  },
113 |  "nbformat": 4,
114 |  "nbformat_minor": 4
115 | }
116 | 


--------------------------------------------------------------------------------
/11. Parametric Tests.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Parametric Tests\n",
  8 |     "\n",
  9 |     "Parametric tests are statistical tools predicated on the assumption that the data adheres to a normal distribution. They facilitate inferences about population parameters based on sampled data.\n",
 10 |     "\n",
 11 |     "## One-Sample Test\n",
 12 |     "\n",
 13 |     "A one-sample test is employed when there's a single population of interest, and a solitary sample is extracted from it. It evaluates whether there's a notable discrepancy between the sample values and the population parameter.\n",
 14 |     "\n",
 15 |     "## Two-Sample Test\n",
 16 |     "\n",
 17 |     "The two-sample test enters the picture when samples are drawn from two distinct populations. It gauges whether the population parameters diverge significantly based on the sample statistics.\n",
 18 |     "\n",
 19 |     "## Critical Test Statistic\n",
 20 |     "\n",
 21 |     "The critical test statistic denotes the threshold value of the sample test statistic pivotal in discerning whether to embrace or repudiate the null hypothesis.\n",
 22 |     "\n",
 23 |     "## Region of Rejection\n",
 24 |     "\n",
 25 |     "The region of rejection delineates the spectrum of values wherein the null hypothesis is discarded. Conversely, the region of acceptance encompasses values where the null hypothesis holds sway.\n",
 26 |     "\n",
 27 |     "## Types of Tests\n",
 28 |     "\n",
 29 |     "Several types of tests are at our disposal:\n",
 30 |     "\n",
 31 |     "+ **Z-tests**: Apt for ample sample sizes (n ≥ 30) with a known population standard deviation.\n",
 32 |     "+ **T-tests**: Tailored for modest sample sizes (n < 30) with an unknown population standard deviation.\n",
 33 |     "+ **F-tests**: Tasked with comparing values across more than two variables.\n",
 34 |     "+ **Chi-square**: Devised for the comparison of categorical data.\n",
 35 |     "\n",
 36 |     "## One-Tail Test (Directional Test)\n",
 37 |     "\n",
 38 |     "A one-tail test enters the fray when probing for a change in the mean, armed with the knowledge of the change's direction.  \n",
 39 |     "\n",
 40 |     "Two iterations of the one-tail test exist:\n",
 41 |     "\n",
 42 |     "+ **Upper one-tail**: The region of rejection resides on the right tail. It's invoked when scrutinizing whether the mean score has surged.\n",
 43 |     "+ **Lower one-tail**: The region of rejection graces the left tail. It's enlisted when assessing if the mean score has plummeted.\n",
 44 |     "\n",
 45 |     "## Two-Tail Test (Non-Directional Test)\n",
 46 |     "\n",
 47 |     "The two-tail test is deployed when scrutinizing a change in the mean sans knowledge of the direction. The region of rejection spans both tails of the distribution.\n",
 48 |     "\n",
 49 |     "## The P-value\n",
 50 |     "\n",
 51 |     "The p-value is the linchpin in deciding whether to embrace or eschew the null hypothesis. It's computed based on the sample data and juxtaposed with a significance level, usually 0.05. \n",
 52 |     "+ If p < 0.05, it intimates that the sample data is improbable to stem from randomness and doesn't mirror the population adequately. In such instances, the null hypothesis is jettisoned. \n",
 53 |     "+ If p > 0.05, it implies a heightened likelihood that the sample inadequately represents the population, prompting the null hypothesis's retention\n",
 54 |     "\n",
 55 |     "\n",
 56 |     "\n",
 57 |     "  <center><img src=\"./data/hypothesis.png\"/></center>"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "code",
 62 |    "execution_count": 1,
 63 |    "metadata": {},
 64 |    "outputs": [
 65 |     {
 66 |      "name": "stdout",
 67 |      "output_type": "stream",
 68 |      "text": [
 69 |       "Calculating p given z: p =  0.95\n",
 70 |       "Calculating z given p: z =  1.6448536269514722\n"
 71 |      ]
 72 |     }
 73 |    ],
 74 |    "source": [
 75 |     "from scipy.stats import norm\n",
 76 |     "\n",
 77 |     "z = 1.6448536269514722\n",
 78 |     "p = 0.95\n",
 79 |     "\n",
 80 |     "print(\"Calculating p given z: p = \", norm.cdf(z))\n",
 81 |     "print(\"Calculating z given p: z = \", norm.ppf(p))"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "markdown",
 86 |    "metadata": {},
 87 |    "source": [
 88 |     "# Choosing a statistical test?\n",
 89 |     "\n",
 90 |     "Below is a simple diagram which shows how to choose a test depending on different data types.\n",
 91 |     "\n",
 92 |     "\n",
 93 |     "<center><img src=\"./data/a1.png\"/></center>\n"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": null,
 99 |    "metadata": {},
100 |    "outputs": [],
101 |    "source": []
102 |   }
103 |  ],
104 |  "metadata": {
105 |   "kernelspec": {
106 |    "display_name": "Python 3 (ipykernel)",
107 |    "language": "python",
108 |    "name": "python3"
109 |   },
110 |   "language_info": {
111 |    "codemirror_mode": {
112 |     "name": "ipython",
113 |     "version": 3
114 |    },
115 |    "file_extension": ".py",
116 |    "mimetype": "text/x-python",
117 |    "name": "python",
118 |    "nbconvert_exporter": "python",
119 |    "pygments_lexer": "ipython3",
120 |    "version": "3.11.7"
121 |   }
122 |  },
123 |  "nbformat": 4,
124 |  "nbformat_minor": 4
125 | }
126 | 


--------------------------------------------------------------------------------
/12. Z-test.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "7865f8f8-b588-40d3-8838-b4aed0b33c8b",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Z-Test\n",
  9 |     "\n",
 10 |     "## One-Sample Z-Test\n",
 11 |     "\n",
 12 |     "A one-sample Z-test is utilized to evaluate if the mean of a single sample differs from a known or hypothesized population mean. Several criteria must be fulfilled for a one-sample Z-test:\n",
 13 |     "\n",
 14 |     "- The population from which the sample is drawn follows a normal distribution.\n",
 15 |     "- The sample size exceeds 30.\n",
 16 |     "- Only one sample is obtained.\n",
 17 |     "- The hypothesis concerns the population mean.\n",
 18 |     "- The population standard deviation is known.\n",
 19 |     "\n",
 20 |     "The test statistic is computed using the formula:\n",
 21 |     "\n",
 22 |     "$$ z = \\frac {(\\overline x - \\mu)}{\\frac{\\sigma}{\\sqrt n}}$$\n",
 23 |     "\n",
 24 |     "where $x$ denotes the sample mean, $\\mu$ represents the population mean, $\\sigma$ stands for the population standard deviation, and $n$ is the sample size.\n",
 25 |     "\n",
 26 |     "## One-Sample Z-Test: One-Tail\n",
 27 |     "\n",
 28 |     "Suppose we have a pizza delivery shop with a historical average delivery time of 45 minutes and a standard deviation of 5 minutes. However, due to recent customer complaints, the shop decides to analyze the delivery time of the last 40 orders, revealing an average delivery time of 48 minutes. We aim to ascertain if the new mean significantly exceeds the population mean.\n",
 29 |     "\n",
 30 |     "The null hypothesis ($H_0$) posits that the mean delivery time equals 45 minutes: $\\mu = 45$. The alternative hypothesis ($H_1$) suggests that the mean delivery time surpasses 45 minutes: $\\mu > 45$. Let's adopt a significance level of $\\alpha = 0.05$. In this scenario, the region of rejection will be situated on the right tail."
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": 1,
 36 |    "id": "f793d2d7",
 37 |    "metadata": {},
 38 |    "outputs": [
 39 |     {
 40 |      "name": "stdout",
 41 |      "output_type": "stream",
 42 |      "text": [
 43 |       "3.7947331922020555\n"
 44 |      ]
 45 |     }
 46 |    ],
 47 |    "source": [
 48 |     "z = (48-45)/(5/(40)**0.5)\n",
 49 |     "print(z)"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": 2,
 55 |    "id": "d4c5f52e",
 56 |    "metadata": {},
 57 |    "outputs": [
 58 |     {
 59 |      "name": "stdout",
 60 |      "output_type": "stream",
 61 |      "text": [
 62 |       "7.390115516725526e-05\n"
 63 |      ]
 64 |     }
 65 |    ],
 66 |    "source": [
 67 |     "import scipy.stats as stats\n",
 68 |     "p_value = 1 - stats.norm.cdf(z) # cumulative distribution function\n",
 69 |     "print(p_value)"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "markdown",
 74 |    "id": "6de1de5f-bf69-49d6-bb50-b81aa5b3806b",
 75 |    "metadata": {},
 76 |    "source": [
 77 |     "Since the p-value is less than $\\alpha$, we reject the null hypothesis. There is a significant difference, at a level of 0.05, between the average delivery time of the sample and the historical population average."
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "markdown",
 82 |    "id": "005271aa-ffa0-4d34-8d67-3df03d2bb55e",
 83 |    "metadata": {},
 84 |    "source": [
 85 |     "## One-Sample Z-Test: Two-Tail\n",
 86 |     "\n",
 87 |     "Suppose we aim to investigate whether a drug has an impact on IQ. In this scenario, we opt for a two-tail test because we're interested in determining whether the drug affects IQ, regardless of whether it has a positive or negative effect.\n",
 88 |     "\n",
 89 |     "Given a significance level of $\\alpha = 0.05$, our rejection regions are 0.025 on both the right and left tails.\n",
 90 |     "\n",
 91 |     "Assuming our population mean $\\mu = 100$ and population standard deviation $\\sigma = 15$, we conduct a study involving a sample of 100 subjects. Upon analysis, we discover that the mean IQ of the sample is 96."
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": 3,
 97 |    "id": "d3823b16",
 98 |    "metadata": {},
 99 |    "outputs": [
100 |     {
101 |      "name": "stdout",
102 |      "output_type": "stream",
103 |      "text": [
104 |       "statistic:  2.6667\n"
105 |      ]
106 |     }
107 |    ],
108 |    "source": [
109 |     "z = (100-96)/(15/(100**0.5))\n",
110 |     "print(\"statistic: \", round(z, 4))"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "code",
115 |    "execution_count": 4,
116 |    "id": "24e9f7e6",
117 |    "metadata": {},
118 |    "outputs": [
119 |     {
120 |      "name": "stdout",
121 |      "output_type": "stream",
122 |      "text": [
123 |       "Critical: 1.96\n"
124 |      ]
125 |     }
126 |    ],
127 |    "source": [
128 |     "import scipy.stats as stats\n",
129 |     "critical = stats.norm.ppf(1-0.025) # cumulative distribution function\n",
130 |     "print(\"Critical:\", round(critical, 4))"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "markdown",
135 |    "id": "d9863311-a166-4e95-8aed-e9069f79b69b",
136 |    "metadata": {},
137 |    "source": [
138 |     "Since our test statistic is greater than the critical statistic, we conclude that our drug has a significant influence on IQ values at a criterion level of $\\alpha = 0.05$."
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "markdown",
143 |    "id": "0d572e80-ee20-4009-90d3-535f3c1b84e3",
144 |    "metadata": {},
145 |    "source": [
146 |     "## Two-Sample Z-Test\n",
147 |     "\n",
148 |     "A two-sample z-test is similar to a one-sample z-test, with the main differences being:\n",
149 |     "\n",
150 |     "- There are two groups/populations under consideration, and we draw one sample from each population.\n",
151 |     "- Both population distributions are assumed to be normal.\n",
152 |     "- Both population standard deviations are known.\n",
153 |     "- The formula for calculating the test statistic is:\n",
154 |     "\n",
155 |     "$$z = \\frac{\\overline{x}_1 - \\overline{x}_2} {\\sqrt{\\frac{\\sigma_1^2}{n_1} + \\frac{\\sigma_2^2}{n_2}}}$$\n",
156 |     "\n",
157 |     "An organization manufactures LED bulbs in two production units, A and B. The quality control team believes that the quality of production at unit A is better than that of B. Quality is measured by how long a bulb works. The team takes samples from both units to test this. The mean life of LED bulbs at units A and B are 1001.3 and 810.47, respectively. The sample sizes are 40 and 44. The population variances are known: $\\sigma_A^2 = 48127$ and $\\sigma_B^2 = 59173$.\n",
158 |     "\n",
159 |     "Conduct the appropriate test, at a 5% significance level, to verify the claim of the quality control team.\n",
160 |     "\n",
161 |     "**Null hypothesis:** $H_0: \\mu_A ≤ \\mu_B$  \n",
162 |     "**Alternate hypothesis:** $H_1: \\mu_A > \\mu_B$\n",
163 |     "\n",
164 |     "Let's fix the level of significance at $\\alpha = 0.05$."
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "code",
169 |    "execution_count": 5,
170 |    "id": "3373712a",
171 |    "metadata": {},
172 |    "outputs": [
173 |     {
174 |      "name": "stdout",
175 |      "output_type": "stream",
176 |      "text": [
177 |       "3.781260568723408\n"
178 |      ]
179 |     }
180 |    ],
181 |    "source": [
182 |     "z = (1001.34-810.47)/(48127/40+59173/44)**0.5\n",
183 |     "print(z)"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "code",
188 |    "execution_count": 6,
189 |    "id": "451853c0",
190 |    "metadata": {},
191 |    "outputs": [
192 |     {
193 |      "data": {
194 |       "text/plain": [
195 |        "7.801812433294586e-05"
196 |       ]
197 |      },
198 |      "execution_count": 6,
199 |      "metadata": {},
200 |      "output_type": "execute_result"
201 |     }
202 |    ],
203 |    "source": [
204 |     "import scipy.stats as stats\n",
205 |     "p_value = 1 - stats.norm.cdf(z)\n",
206 |     "p_value"
207 |    ]
208 |   },
209 |   {
210 |    "cell_type": "markdown",
211 |    "id": "91602877-326c-4234-b084-e921956a04f1",
212 |    "metadata": {},
213 |    "source": [
214 |     "p-value (0.000078)<$\\alpha$(0.05), we reject the null hypothesis. The LED bulbs produced at unit A have a significantly longer life than those at unit B, at a 5% level."
215 |    ]
216 |   },
217 |   {
218 |    "cell_type": "markdown",
219 |    "id": "6a69211a-90c3-4156-bc49-a1ae0479609f",
220 |    "metadata": {},
221 |    "source": [
222 |     "## Hypothesis Tests with Proportions\n",
223 |     "\n",
224 |     "Proportion tests are utilized with nominal data and are effective for comparing percentages or proportions. For instance, a survey collecting responses from a department in an organization might claim that 85% of people in the organization are satisfied with its policies. Historically, the satisfaction rate has been 82%. Here, we compare a percentage or proportion taken from the sample with a percentage/proportion from the population. The following are some characteristics of the sampling distribution of proportions:\n",
225 |     "\n",
226 |     "- The sampling distribution of the proportions taken from the sample is approximately normal.\n",
227 |     "- The mean of this sampling distribution ($\\overline{p}$) equals the population proportion ($p$).\n",
228 |     "- Calculating the test statistic: The following equation gives the $z$-value:\n",
229 |     "\n",
230 |     "$$ z = \\frac{\\overline{p} - p}{\\sqrt{\\frac{p(1-p)}{n}}} $$\n",
231 |     "\n",
232 |     "Where $\\overline{p}$ is the sample proportion, $p$ is the population proportion, and $n$ is the sample size."
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "markdown",
237 |    "id": "df341e54-8cbf-40ad-8317-5b8a59f2e79e",
238 |    "metadata": {},
239 |    "source": [
240 |     "## One-Sample Proportion Z-Test\n",
241 |     "\n",
242 |     "It is known that 40% of the total customers are satisfied with the services provided by a mobile service center. The customer service department of this center decides to conduct a survey for assessing the current customer satisfaction rate. It surveys 100 of its customers and finds that only 30 out of the 100 customers are satisfied with its services. Conduct a hypothesis test at a 5% significance level to determine if the percentage of satisfied customers has reduced from the initial satisfaction level (40%).\n",
243 |     "\n",
244 |     "**Null Hypothesis:** $H_0: p = 0.4$  \n",
245 |     "**Alternate Hypothesis:** $H_1: p < 0.4$\n",
246 |     "\n",
247 |     "The < sign indicates a lower-tail test.\n",
248 |     "\n",
249 |     "Let's fix the level of significance at $\\alpha = 0.05$."
250 |    ]
251 |   },
252 |   {
253 |    "cell_type": "code",
254 |    "execution_count": 7,
255 |    "id": "088856d4",
256 |    "metadata": {},
257 |    "outputs": [
258 |     {
259 |      "data": {
260 |       "text/plain": [
261 |        "-2.041241452319316"
262 |       ]
263 |      },
264 |      "execution_count": 7,
265 |      "metadata": {},
266 |      "output_type": "execute_result"
267 |     }
268 |    ],
269 |    "source": [
270 |     "z=(0.3-0.4)/((0.4)*(1-0.4)/100)**0.5\n",
271 |     "z"
272 |    ]
273 |   },
274 |   {
275 |    "cell_type": "code",
276 |    "execution_count": 8,
277 |    "id": "f369ef9c",
278 |    "metadata": {},
279 |    "outputs": [
280 |     {
281 |      "data": {
282 |       "text/plain": [
283 |        "0.02061341666858179"
284 |       ]
285 |      },
286 |      "execution_count": 8,
287 |      "metadata": {},
288 |      "output_type": "execute_result"
289 |     }
290 |    ],
291 |    "source": [
292 |     "import scipy.stats as stats\n",
293 |     "\n",
294 |     "p=stats.norm.cdf(z)\n",
295 |     "p"
296 |    ]
297 |   },
298 |   {
299 |    "cell_type": "markdown",
300 |    "id": "7eb16578-2f61-4d3f-b7cc-88ee9bfd0435",
301 |    "metadata": {},
302 |    "source": [
303 |     "p-value (0.02) < 0.05. We reject the null hypothesis. At a 5% significance level, the percentage of customers satisfied with the service center’s services has reduced."
304 |    ]
305 |   },
306 |   {
307 |    "cell_type": "markdown",
308 |    "id": "f627ba1c-9a49-4b8d-8cd4-09a4c07be810",
309 |    "metadata": {},
310 |    "source": [
311 |     "## Two-Sample Proportion Z-Test\n",
312 |     "\n",
313 |     "Here, we compare proportions taken from two independent samples belonging to two different populations. The following equation gives the formula for the critical test statistic:\n",
314 |     "\n",
315 |     "$$ z = \\frac {(\\overline{p}_1 - \\overline{p}_2)}{\\sqrt{\\frac{p_c(1-p_c)}{N_1} + \\frac{p_c(1-p_c)}{N_2}}}$$\n",
316 |     "\n",
317 |     "In the preceding formula, $\\overline{p}_1$ is the proportion from the first sample, and $\\overline{p}_2$ is the proportion from the second sample. $N_1$ is the sample size of the first sample, and $N_2$ is the sample size of the second sample. $p_c$ is the pooled variance.\n",
318 |     "\n",
319 |     "$$\\overline{p}_1 = \\frac{x_1}{N_1} ;  \\overline{p}_2 = \\frac {x_2}{N_2} ;  p_c = \\frac {x_1 + x_2}{N_1 + N_2}$$\n",
320 |     "\n",
321 |     "In the preceding formula, $x_1$ is the number of successes in the first sample, and $x_2$ is the number of successes in the second sample."
322 |    ]
323 |   },
324 |   {
325 |    "cell_type": "markdown",
326 |    "id": "59ed178e-be77-4330-9d99-eb33778a0c59",
327 |    "metadata": {},
328 |    "source": [
329 |     "## Investigation of Passenger Compliance with Child Safety Guidelines\n",
330 |     "\n",
331 |     "A ride-sharing company is investigating complaints by its drivers regarding passenger compliance with child safety guidelines, specifically concerning the use of child seats and seat belts. Surveys were independently conducted in two major cities, A and B, to gather data on passenger compliance. The company aims to determine if there is a difference in the proportion of passengers conforming to child safety guidelines between the two cities. The data for the two cities is summarized in the following table:\n",
332 |     "\n",
333 |     "|                 |  City A | City B |\n",
334 |     "|-----------------|---------|--------|\n",
335 |     "|  Total surveyed |  200    | 230    |\n",
336 |     "| No. of complaints |  110    | 106    |\n",
337 |     "\n",
338 |     "The law enforcement authority seeks to evaluate if the proportion of compliant passengers differs significantly between City A and City B."
339 |    ]
340 |   },
341 |   {
342 |    "cell_type": "markdown",
343 |    "id": "f3d2af18-91a7-407c-8d46-8d28ae426ce1",
344 |    "metadata": {},
345 |    "source": [
346 |     "## Hypotheses for Two-Sample Proportion Test\n",
347 |     "\n",
348 |     "For the two-sample proportion test comparing compliance rates between City A and City B:\n",
349 |     "\n",
350 |     "- Null hypothesis: $H_0: p_A = p_B$\n",
351 |     "- Alternative hypothesis: $H_1: p_A \\neq p_B$\n",
352 |     "\n",
353 |     "This constitutes a two-tail test because the region of rejection could be located on either side.\n",
354 |     "\n",
355 |     "The significance level $\\alpha$ is set at 0.05, resulting in an area of 0.025 on both sides."
356 |    ]
357 |   },
358 |   {
359 |    "cell_type": "code",
360 |    "execution_count": 9,
361 |    "id": "a5054442",
362 |    "metadata": {},
363 |    "outputs": [
364 |     {
365 |      "data": {
366 |       "text/plain": [
367 |        "1.8437643201697864"
368 |       ]
369 |      },
370 |      "execution_count": 9,
371 |      "metadata": {},
372 |      "output_type": "execute_result"
373 |     }
374 |    ],
375 |    "source": [
376 |     "x1,n1,x2,n2=110,200,106,230\n",
377 |     "p1=x1/n1\n",
378 |     "p2=x2/n2\n",
379 |     "pc=(x1+x2)/(n1+n2)\n",
380 |     "z_statistic=(p1-p2)/(((pc*(1-pc)/n1)+(pc*(1-pc)/n2))**0.5)\n",
381 |     "z_statistic"
382 |    ]
383 |   },
384 |   {
385 |    "cell_type": "code",
386 |    "execution_count": 10,
387 |    "id": "03472547",
388 |    "metadata": {},
389 |    "outputs": [
390 |     {
391 |      "data": {
392 |       "text/plain": [
393 |        "1.959963984540054"
394 |       ]
395 |      },
396 |      "execution_count": 10,
397 |      "metadata": {},
398 |      "output_type": "execute_result"
399 |     }
400 |    ],
401 |    "source": [
402 |     "critical = stats.norm.ppf(1-0.025)\n",
403 |     "critical"
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "code",
408 |    "execution_count": 11,
409 |    "id": "18ac0c29",
410 |    "metadata": {},
411 |    "outputs": [
412 |     {
413 |      "data": {
414 |       "text/plain": [
415 |        "1.9587731666628365"
416 |       ]
417 |      },
418 |      "execution_count": 11,
419 |      "metadata": {},
420 |      "output_type": "execute_result"
421 |     }
422 |    ],
423 |    "source": [
424 |     "p_value =2*(1-stats.norm.cdf(z))\n",
425 |     "p_value"
426 |    ]
427 |   },
428 |   {
429 |    "cell_type": "markdown",
430 |    "id": "615b4141",
431 |    "metadata": {},
432 |    "source": [
433 |     "## Conclusion of Two-Sample Proportion Test\r\n",
434 |     "\r\n",
435 |     "Based on the statistical analysis:\r\n",
436 |     "\r\n",
437 |     "- Since the test statistic is less than the critical value, or the p-value is greater than 0.05, we fail to reject the null hypothesis.\r\n",
438 |     "- Therefore, there is no significant difference between the proportion of passengers in these cities complying with child safety norms, at a 5% significance level.\r\n"
439 |    ]
440 |   },
441 |   {
442 |    "cell_type": "code",
443 |    "execution_count": null,
444 |    "id": "f1ca0ba9",
445 |    "metadata": {},
446 |    "outputs": [],
447 |    "source": []
448 |   }
449 |  ],
450 |  "metadata": {
451 |   "kernelspec": {
452 |    "display_name": "Python 3 (ipykernel)",
453 |    "language": "python",
454 |    "name": "python3"
455 |   },
456 |   "language_info": {
457 |    "codemirror_mode": {
458 |     "name": "ipython",
459 |     "version": 3
460 |    },
461 |    "file_extension": ".py",
462 |    "mimetype": "text/x-python",
463 |    "name": "python",
464 |    "nbconvert_exporter": "python",
465 |    "pygments_lexer": "ipython3",
466 |    "version": "3.11.7"
467 |   }
468 |  },
469 |  "nbformat": 4,
470 |  "nbformat_minor": 5
471 | }
472 | 


--------------------------------------------------------------------------------
/13. t-test.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "04c5cfaf-d5f0-4375-a8ff-4f1527b3a04e",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# T-Test\n",
  9 |     "\n",
 10 |     "In cases where the standard deviation of the population is not known and the sample size is small, the T-distribution is used. This distribution is also known as the \"Student's T distribution\".\n",
 11 |     "\n",
 12 |     "The following are the key features of the T-distribution:\n",
 13 |     "\n",
 14 |     "+ It has a shape that is similar to a normal distribution but is slightly flatter.\n",
 15 |     "+ The sample size is typically small, usually less than 30.\n",
 16 |     "+ The T-distribution takes into account the concept of degrees of freedom. These are the number of observations in a statistical test that can be calculated independently. For example, if we have three numbers $x$, $y$, and $z$ and know that the mean is 5, we can conclude that the sum of the numbers must be $5 \\times 3 = 15$. We have the freedom to choose any value for $x$ and $y$, but not $z$. $z$ must be chosen so that the numbers add up to 15 and the mean remains at 5. Despite having three numbers, we only have the freedom to choose two of them, meaning we have two degrees of freedom.\n",
 17 |     "+ As the sample size decreases, the degrees of freedom decrease, and the population parameter can be predicted with less certainty from the sample parameter. The degrees of freedom (df) in the T-distribution is equal to the number of samples minus 1, or $df = n - 1$.\n",
 18 |     "\n",
 19 |     "<center><img src=\"./data/t_dist.png\"/></center>\n",
 20 |     "\n",
 21 |     "The formula for the critical test statistic in a one-sample t-test is given by the following equation: \n",
 22 |     "\n",
 23 |     "$$t = \\frac{\\overline{x} - \\mu}{\\frac{s}{\\sqrt{n}}}$$\n",
 24 |     "\n",
 25 |     "where $\\overline{x}$ is the sample mean, $\\mu$ is the population mean, $s$ is the sample standard deviation, and $n$ is the sample size.\n",
 26 |     "\n",
 27 |     "## One-Sample T-Test\n",
 28 |     "\n",
 29 |     "A one-sample t-test is similar to a one-sample z-test, with the following differences:\n",
 30 |     "\n",
 31 |     "1. The size of the sample is small ($< 30$).\n",
 32 |     "2. The population standard deviation is not known; we use the sample standard deviation ($s$) to calculate the standard error.\n",
 33 |     "3. The critical statistic here is the t-statistic, given by the following formula:\n",
 34 |     "\n",
 35 |     "$$t = \\frac{\\overline{x} - \\mu}{\\frac{s}{\\sqrt{n}}}$$\n"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "markdown",
 40 |    "id": "2ce587eb-9dda-4779-a004-bea595c98abf",
 41 |    "metadata": {},
 42 |    "source": [
 43 |     "A coaching institute, preparing students for an exam, has 200 students, and the average score of the students in the practice tests is 80. It takes a sample of nine students and records their scores; it seems that the average score has now increased. These are the scores of these nine students: 80, 87, 80, 75, 79, 78, 89, 84, 88. Conduct a hypothesis test at a 5% significance level to verify if there is a significant increase in the average score.\n",
 44 |     "\n",
 45 |     "## Hypotheses\n",
 46 |     "\n",
 47 |     "- Null hypothesis ($H_0$): $\\mu = 80$\n",
 48 |     "- Alternative hypothesis ($H_1$): $\\mu > 80$\n"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "code",
 53 |    "execution_count": 1,
 54 |    "id": "92b6154c",
 55 |    "metadata": {},
 56 |    "outputs": [
 57 |     {
 58 |      "data": {
 59 |       "text/plain": [
 60 |        "TtestResult(statistic=1.348399724926488, pvalue=0.21445866072113726, df=8)"
 61 |       ]
 62 |      },
 63 |      "execution_count": 1,
 64 |      "metadata": {},
 65 |      "output_type": "execute_result"
 66 |     }
 67 |    ],
 68 |    "source": [
 69 |     "import numpy as np\n",
 70 |     "import scipy.stats as stats\n",
 71 |     "\n",
 72 |     "sample = np.array([80,87,80,75,79,78,89,84,88])\n",
 73 |     "\n",
 74 |     "stats.ttest_1samp(sample,80)"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "markdown",
 79 |    "id": "3602c604",
 80 |    "metadata": {},
 81 |    "source": [
 82 |     "Since the p-value is greater than 0.05, we fail to reject the null hypothesis. Hence, we cannot conclude that the average score of students has changed."
 83 |    ]
 84 |   },
 85 |   {
 86 |    "cell_type": "markdown",
 87 |    "id": "acf202cc",
 88 |    "metadata": {},
 89 |    "source": [
 90 |     "## Two-sample t-test \n",
 91 |     "\n",
 92 |     "A two-sample t-test is used when we take samples from two populations, where both the sample sizes are less than 30, and both the population standard deviations are unknown. Formula:\n",
 93 |     "\n",
 94 |     "$$t = \\frac{\\overline x_1 - \\overline x_2}{\\sqrt{S_p^2(\\frac{1}{n_1}+\\frac{1}{n_2})}}$$\n",
 95 |     "\n",
 96 |     "Where $x_1$ and $x_2$ are the sample means  \n",
 97 |     "\n",
 98 |     "The degrees of freedom: $df=n_1 + n_2 − 2$  \n",
 99 |     "\n",
100 |     "The pooled variance $S_p^2 = \\frac{(n_1 -1)S_1^2 + (n_2-1)S_2^2}{n_1+n_2-2}$  \n",
101 |     "\n",
102 |     "A coaching institute has centers in two different cities. It takes a sample of ten students from each center and records their\n",
103 |     "scores, which are as follows:  \n",
104 |     "\n",
105 |     "|Center A:| 80, 87, 80, 75, 79, 78, 89, 84, 88|\n",
106 |     "|---------|-----------------------------------|\n",
107 |     "|Center B:| 81, 74, 70, 73, 76, 73, 81, 82, 84|  \n",
108 |     " \n",
109 |     "Conduct a hypothesis test at a 5% significance level, and verify if there a significant difference in the average scores of the\n",
110 |     "students in these two centers.\n",
111 |     "\n",
112 |     "$H_0:\\mu_1 = \\mu_2$  \n",
113 |     "$H_1:\\mu_1 != \\mu_2$"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": 2,
119 |    "id": "7b4b5010",
120 |    "metadata": {},
121 |    "outputs": [
122 |     {
123 |      "data": {
124 |       "text/plain": [
125 |        "TtestResult(statistic=2.1892354788555664, pvalue=0.04374951024120649, df=16.0)"
126 |       ]
127 |      },
128 |      "execution_count": 2,
129 |      "metadata": {},
130 |      "output_type": "execute_result"
131 |     }
132 |    ],
133 |    "source": [
134 |     "a = np.array([80,87,80,75,79,78,89,84,88])\n",
135 |     "b = np.array([81,74,70,73,76,73,81,82,84])\n",
136 |     "\n",
137 |     "stats.ttest_ind(a,b)"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "markdown",
142 |    "id": "88a502b7",
143 |    "metadata": {},
144 |    "source": [
145 |     "We can conclude that there is a significant difference in the average scores of students in the two centers of the coaching\n",
146 |     "institute since the p-value is less than 0.05"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "markdown",
151 |    "id": "f498ce2d",
152 |    "metadata": {},
153 |    "source": [
154 |     "## Two-sample t-test for paired samples \n",
155 |     "\n",
156 |     "This test is used to compare population means from samples that are dependent on each other, that is, sample values are measured twice using the same test group.\n",
157 |     "\n",
158 |     "+ A measurement taken at two different times (e.g., pre-test and post-test score with an intervention administered between the two time points)\n",
159 |     "+ A measurement taken under two different conditions (e.g., completing a test under a \"control\" condition and an \"experimental\" condition)\n",
160 |     "\n",
161 |     "This equation gives the critical value of the test statistic for a paired two-sample t-test:\n",
162 |     "\n",
163 |     "$$t = \\frac{\\overline d}{s/\\sqrt{n}}$$\n",
164 |     "\n",
165 |     "Where $\\overline d$ is the average of the difference between the elements of the two samples. Both\n",
166 |     "the samples have the same size, $n$.  \n",
167 |     "\n",
168 |     "Standard deviation of the differences between the elements of the two samples, S =  $\\sqrt{\\frac{\\sum d^2 -((\\sum d)^2/ n)}{n -1}}$\n",
169 |     "\n",
170 |     "The coaching institute is conducting a special program to improve the performance of the students. The scores of the same set of students are compared before and after the special program. Conduct a hypothesis test at a 5% significance level to verify if the scores have improved because of this program."
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": 3,
176 |    "id": "9055c5e8",
177 |    "metadata": {},
178 |    "outputs": [
179 |     {
180 |      "data": {
181 |       "text/plain": [
182 |        "TtestResult(statistic=-2.4473735525455615, pvalue=0.040100656419513776, df=8)"
183 |       ]
184 |      },
185 |      "execution_count": 3,
186 |      "metadata": {},
187 |      "output_type": "execute_result"
188 |     }
189 |    ],
190 |    "source": [
191 |     "a = np.array([80,87,80,75,79,78,89,84,88])\n",
192 |     "b = np.array([81,89,83,81,79,82,90,82,90])\n",
193 |     "\n",
194 |     "stats.ttest_rel(a,b)"
195 |    ]
196 |   },
197 |   {
198 |    "cell_type": "markdown",
199 |    "id": "aa94fab0",
200 |    "metadata": {},
201 |    "source": [
202 |     "We can conclude, at a 5% significance level, that the average score has improved after the\n",
203 |     "special program was conducted since the p-value is less than 0.05"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "code",
208 |    "execution_count": null,
209 |    "id": "95a06911",
210 |    "metadata": {},
211 |    "outputs": [],
212 |    "source": []
213 |   }
214 |  ],
215 |  "metadata": {
216 |   "kernelspec": {
217 |    "display_name": "Python 3 (ipykernel)",
218 |    "language": "python",
219 |    "name": "python3"
220 |   },
221 |   "language_info": {
222 |    "codemirror_mode": {
223 |     "name": "ipython",
224 |     "version": 3
225 |    },
226 |    "file_extension": ".py",
227 |    "mimetype": "text/x-python",
228 |    "name": "python",
229 |    "nbconvert_exporter": "python",
230 |    "pygments_lexer": "ipython3",
231 |    "version": "3.11.7"
232 |   }
233 |  },
234 |  "nbformat": 4,
235 |  "nbformat_minor": 5
236 | }
237 | 


--------------------------------------------------------------------------------
/14. ANOVA - Analysis of Variance.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "0cba2aff",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# ANOVA"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "markdown",
 13 |    "id": "64083e01-6eaa-4477-a3a2-48b34d0579f1",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "# Analysis of Variance (ANOVA)\n",
 17 |     "\n",
 18 |     "ANOVA (Analysis of Variance) is a statistical method used for comparing the means of multiple populations. Previously, we have considered only a single population or at most two populations. A one-way ANOVA uses one independent variable, while a two-way ANOVA uses two independent variables. The statistical distribution used in ANOVA is the F-distribution, whose characteristics are as follows:\n",
 19 |     "\n",
 20 |     "1. The F-distribution has a single tail (toward the right) and contains only positive values.\n",
 21 |     "\n",
 22 |     "   <center><img src=\"data/f_dist.png\"/></center>\n",
 23 |     "\n",
 24 |     "2. The F-statistic, which is the critical statistic in ANOVA, is the ratio of variation between the sample means to the variation within the samples. The formula is as follows:\n",
 25 |     "   $$F = \\frac{\\text{variation between sample means}}{\\text{variation within the samples}}$$\n",
 26 |     "\n",
 27 |     "3. The different populations are referred to as treatments.\n",
 28 |     "4. A high value of the F-statistic implies that the variation between samples is considerable compared to the variation within the samples. In other words, the populations or treatments from which the samples are drawn are actually different from one another.\n",
 29 |     "5. Random variations between treatments are more likely to occur when the variation within the sample is considerable.\n",
 30 |     "\n",
 31 |     "Use a one-way ANOVA when you have collected data about one categorical independent variable and one quantitative dependent variable. The independent variable should have at least three levels (i.e., at least three different groups or categories).\n",
 32 |     "\n",
 33 |     "ANOVA tells you if the dependent variable changes according to the level of the independent variable. For example:\n",
 34 |     "\n",
 35 |     "+ Your independent variable is social media use, and you assign groups to low, medium, and high levels of social media use to find out if there is a difference in hours of sleep per night.\n",
 36 |     "+ Your independent variable is the brand of soda, and you collect data on Coke, Pepsi, Sprite, and Fanta to find out if there is a difference in the price per 100ml.\n",
 37 |     "\n",
 38 |     "ANOVA determines whether the groups created by the levels of the independent variable are statistically different by calculating whether the means of the treatment levels are different from the overall mean of the dependent variable. If any of the group means is significantly different from the overall mean, then the null hypothesis is rejected.\n",
 39 |     "\n",
 40 |     "ANOVA uses the F-test for statistical significance. This allows for the comparison of multiple means at once, as the error is calculated for the whole set of comparisons rather than for each individual two-way comparison (which would happen with a t-test).\n",
 41 |     "\n",
 42 |     "The F-test compares the variance in each group mean from the overall group variance. If the variance within groups is smaller than the variance between groups, the F-test will find a higher F-value, and therefore a higher likelihood that the difference observed is real and not due to chance.\n",
 43 |     "\n",
 44 |     "The assumptions of the ANOVA test are the same as the general assumptions for any parametric test:\n",
 45 |     "\n",
 46 |     "+ **Independence of observations:** The data were collected using statistically valid methods, and there are no hidden relationships among observations. If your data fail to meet this assumption because you have a confounding variable that you need to control for statistically, use an ANOVA with blocking variables.\n",
 47 |     "+ **Normally distributed response variable:** The values of the dependent variable follow a normal distribution.\n",
 48 |     "+ **Homogeneity of variance:** The variation within each group being compared is similar for every group. If the variances are different among the groups, then ANOVA probably isn’t the right fit for the data.\n"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "markdown",
 53 |    "id": "6c2285ee",
 54 |    "metadata": {},
 55 |    "source": [
 56 |     "## One-Way-ANOVA\n",
 57 |     "\n",
 58 |     "A few agricultural research scientists have planted a new variety of cotton called “AB\n",
 59 |     "cotton.” They have used three different fertilizers – A, B, and C – for three separate\n",
 60 |     "plots of this variety. The researchers want to find out if the yield varies with the type of\n",
 61 |     "fertilizer used. Yields in bushels per acre are mentioned in the below table. Conduct an\n",
 62 |     "ANOVA test at a 5% level of significance to see if the researchers can conclude that there\n",
 63 |     "is a difference in yields.\n",
 64 |     "\n",
 65 |     "| Fertilizer A | Fertilizer b | Fertilizer c |\n",
 66 |     "|--------------|--------------|--------------|\n",
 67 |     "|     40       |     45       |     55       |\n",
 68 |     "|     30       |     35       |     40       |\n",
 69 |     "|     35       |     55       |     30       |\n",
 70 |     "|     45       |     25       |     20       |\n",
 71 |     "\n",
 72 |     "Null hypothesis: $H_0 : \\mu_1 = \\mu_2 = \\mu_3$  \n",
 73 |     "Alternative hypothesis: $H_1 : \\mu_1 ! = \\mu_2 ! = \\mu_3$\n",
 74 |     "\n",
 75 |     "the level of significance: $\\alpha$=0.05"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": 1,
 81 |    "id": "15dcbdf0",
 82 |    "metadata": {},
 83 |    "outputs": [
 84 |     {
 85 |      "data": {
 86 |       "text/plain": [
 87 |        "F_onewayResult(statistic=0.10144927536231883, pvalue=0.9045455407589628)"
 88 |       ]
 89 |      },
 90 |      "execution_count": 1,
 91 |      "metadata": {},
 92 |      "output_type": "execute_result"
 93 |     }
 94 |    ],
 95 |    "source": [
 96 |     "import scipy.stats as stats\n",
 97 |     "\n",
 98 |     "a=[40,30,35,45]\n",
 99 |     "b=[45,35,55,25]\n",
100 |     "c=[55,40,30,20]\n",
101 |     "\n",
102 |     "stats.f_oneway(a,b,c)"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "markdown",
107 |    "id": "908afb47",
108 |    "metadata": {},
109 |    "source": [
110 |     "Since the calculated p-value (0.904)>0.05, we fail to reject the null hypothesis.There is no significant difference between the three treatments, at a 5% significance level."
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "markdown",
115 |    "id": "cb92bfa2",
116 |    "metadata": {},
117 |    "source": [
118 |     "## Two-way-ANOVA \n",
119 |     "\n",
120 |     "A botanist wants to know whether or not plant growth is influenced by sunlight exposure and watering frequency. She plants 30 seeds and lets them grow for two months under different conditions for sunlight exposure and watering frequency. After two months, she records the height of each plant, in inches."
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": 2,
126 |    "id": "872f28ac",
127 |    "metadata": {},
128 |    "outputs": [],
129 |    "source": [
130 |     "import numpy as np\n",
131 |     "import pandas as pd\n",
132 |     "\n",
133 |     "#create data\n",
134 |     "df = pd.DataFrame({'water': np.repeat(['daily', 'weekly'], 15),\n",
135 |     "                   'sun': np.tile(np.repeat(['low', 'med', 'high'], 5), 2),\n",
136 |     "                   'height': [6, 6, 6, 5, 6, 5, 5, 6, 4, 5,\n",
137 |     "                              6, 6, 7, 8, 7, 3, 4, 4, 4, 5,\n",
138 |     "                              4, 4, 4, 4, 4, 5, 6, 6, 7, 8]})"
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "code",
143 |    "execution_count": 3,
144 |    "id": "5e132098",
145 |    "metadata": {},
146 |    "outputs": [
147 |     {
148 |      "data": {
149 |       "text/html": [
150 |        "<div>\n",
151 |        "<style scoped>\n",
152 |        "    .dataframe tbody tr th:only-of-type {\n",
153 |        "        vertical-align: middle;\n",
154 |        "    }\n",
155 |        "\n",
156 |        "    .dataframe tbody tr th {\n",
157 |        "        vertical-align: top;\n",
158 |        "    }\n",
159 |        "\n",
160 |        "    .dataframe thead th {\n",
161 |        "        text-align: right;\n",
162 |        "    }\n",
163 |        "</style>\n",
164 |        "<table border=\"1\" class=\"dataframe\">\n",
165 |        "  <thead>\n",
166 |        "    <tr style=\"text-align: right;\">\n",
167 |        "      <th></th>\n",
168 |        "      <th>water</th>\n",
169 |        "      <th>sun</th>\n",
170 |        "      <th>height</th>\n",
171 |        "    </tr>\n",
172 |        "  </thead>\n",
173 |        "  <tbody>\n",
174 |        "    <tr>\n",
175 |        "      <th>0</th>\n",
176 |        "      <td>daily</td>\n",
177 |        "      <td>low</td>\n",
178 |        "      <td>6</td>\n",
179 |        "    </tr>\n",
180 |        "    <tr>\n",
181 |        "      <th>1</th>\n",
182 |        "      <td>daily</td>\n",
183 |        "      <td>low</td>\n",
184 |        "      <td>6</td>\n",
185 |        "    </tr>\n",
186 |        "    <tr>\n",
187 |        "      <th>2</th>\n",
188 |        "      <td>daily</td>\n",
189 |        "      <td>low</td>\n",
190 |        "      <td>6</td>\n",
191 |        "    </tr>\n",
192 |        "    <tr>\n",
193 |        "      <th>3</th>\n",
194 |        "      <td>daily</td>\n",
195 |        "      <td>low</td>\n",
196 |        "      <td>5</td>\n",
197 |        "    </tr>\n",
198 |        "    <tr>\n",
199 |        "      <th>4</th>\n",
200 |        "      <td>daily</td>\n",
201 |        "      <td>low</td>\n",
202 |        "      <td>6</td>\n",
203 |        "    </tr>\n",
204 |        "    <tr>\n",
205 |        "      <th>5</th>\n",
206 |        "      <td>daily</td>\n",
207 |        "      <td>med</td>\n",
208 |        "      <td>5</td>\n",
209 |        "    </tr>\n",
210 |        "    <tr>\n",
211 |        "      <th>6</th>\n",
212 |        "      <td>daily</td>\n",
213 |        "      <td>med</td>\n",
214 |        "      <td>5</td>\n",
215 |        "    </tr>\n",
216 |        "    <tr>\n",
217 |        "      <th>7</th>\n",
218 |        "      <td>daily</td>\n",
219 |        "      <td>med</td>\n",
220 |        "      <td>6</td>\n",
221 |        "    </tr>\n",
222 |        "    <tr>\n",
223 |        "      <th>8</th>\n",
224 |        "      <td>daily</td>\n",
225 |        "      <td>med</td>\n",
226 |        "      <td>4</td>\n",
227 |        "    </tr>\n",
228 |        "    <tr>\n",
229 |        "      <th>9</th>\n",
230 |        "      <td>daily</td>\n",
231 |        "      <td>med</td>\n",
232 |        "      <td>5</td>\n",
233 |        "    </tr>\n",
234 |        "  </tbody>\n",
235 |        "</table>\n",
236 |        "</div>"
237 |       ],
238 |       "text/plain": [
239 |        "   water  sun  height\n",
240 |        "0  daily  low       6\n",
241 |        "1  daily  low       6\n",
242 |        "2  daily  low       6\n",
243 |        "3  daily  low       5\n",
244 |        "4  daily  low       6\n",
245 |        "5  daily  med       5\n",
246 |        "6  daily  med       5\n",
247 |        "7  daily  med       6\n",
248 |        "8  daily  med       4\n",
249 |        "9  daily  med       5"
250 |       ]
251 |      },
252 |      "execution_count": 3,
253 |      "metadata": {},
254 |      "output_type": "execute_result"
255 |     }
256 |    ],
257 |    "source": [
258 |     "df[:10]"
259 |    ]
260 |   },
261 |   {
262 |    "cell_type": "code",
263 |    "execution_count": 4,
264 |    "id": "ef799971",
265 |    "metadata": {},
266 |    "outputs": [
267 |     {
268 |      "data": {
269 |       "text/html": [
270 |        "<div>\n",
271 |        "<style scoped>\n",
272 |        "    .dataframe tbody tr th:only-of-type {\n",
273 |        "        vertical-align: middle;\n",
274 |        "    }\n",
275 |        "\n",
276 |        "    .dataframe tbody tr th {\n",
277 |        "        vertical-align: top;\n",
278 |        "    }\n",
279 |        "\n",
280 |        "    .dataframe thead th {\n",
281 |        "        text-align: right;\n",
282 |        "    }\n",
283 |        "</style>\n",
284 |        "<table border=\"1\" class=\"dataframe\">\n",
285 |        "  <thead>\n",
286 |        "    <tr style=\"text-align: right;\">\n",
287 |        "      <th></th>\n",
288 |        "      <th>sum_sq</th>\n",
289 |        "      <th>df</th>\n",
290 |        "      <th>F</th>\n",
291 |        "      <th>PR(&gt;F)</th>\n",
292 |        "    </tr>\n",
293 |        "  </thead>\n",
294 |        "  <tbody>\n",
295 |        "    <tr>\n",
296 |        "      <th>C(water)</th>\n",
297 |        "      <td>8.533333</td>\n",
298 |        "      <td>1.0</td>\n",
299 |        "      <td>16.0000</td>\n",
300 |        "      <td>0.000527</td>\n",
301 |        "    </tr>\n",
302 |        "    <tr>\n",
303 |        "      <th>C(sun)</th>\n",
304 |        "      <td>24.866667</td>\n",
305 |        "      <td>2.0</td>\n",
306 |        "      <td>23.3125</td>\n",
307 |        "      <td>0.000002</td>\n",
308 |        "    </tr>\n",
309 |        "    <tr>\n",
310 |        "      <th>C(water):C(sun)</th>\n",
311 |        "      <td>2.466667</td>\n",
312 |        "      <td>2.0</td>\n",
313 |        "      <td>2.3125</td>\n",
314 |        "      <td>0.120667</td>\n",
315 |        "    </tr>\n",
316 |        "    <tr>\n",
317 |        "      <th>Residual</th>\n",
318 |        "      <td>12.800000</td>\n",
319 |        "      <td>24.0</td>\n",
320 |        "      <td>NaN</td>\n",
321 |        "      <td>NaN</td>\n",
322 |        "    </tr>\n",
323 |        "  </tbody>\n",
324 |        "</table>\n",
325 |        "</div>"
326 |       ],
327 |       "text/plain": [
328 |        "                    sum_sq    df        F    PR(>F)\n",
329 |        "C(water)          8.533333   1.0  16.0000  0.000527\n",
330 |        "C(sun)           24.866667   2.0  23.3125  0.000002\n",
331 |        "C(water):C(sun)   2.466667   2.0   2.3125  0.120667\n",
332 |        "Residual         12.800000  24.0      NaN       NaN"
333 |       ]
334 |      },
335 |      "execution_count": 4,
336 |      "metadata": {},
337 |      "output_type": "execute_result"
338 |     }
339 |    ],
340 |    "source": [
341 |     "import statsmodels.api as sm\n",
342 |     "from statsmodels.formula.api import ols\n",
343 |     "\n",
344 |     "#perform two-way ANOVA\n",
345 |     "model = ols('height ~ C(water) + C(sun) + C(water):C(sun)', data=df).fit()\n",
346 |     "sm.stats.anova_lm(model, typ=2)"
347 |    ]
348 |   },
349 |   {
350 |    "cell_type": "markdown",
351 |    "id": "15f1611b-ebbb-426f-8b17-1437a5def208",
352 |    "metadata": {},
353 |    "source": [
354 |     "# Analysis of Variance (ANOVA) Results\n",
355 |     "\n",
356 |     "We can see the following p-values for each of the factors in the table:\n",
357 |     "\n",
358 |     "- **Water:** p-value = 0.000527  \n",
359 |     "- **Sun:** p-value = 0.0000002  \n",
360 |     "- **Water * Sun:** p-value = 0.120667  \n",
361 |     "\n",
362 |     "Since the p-values for water and sun are both less than 0.05, this means that both factors have a statistically significant effect on plant height.\n",
363 |     "\n",
364 |     "And since the p-value for the interaction effect (0.120667) is not less than 0.05, this tells us that there is no significant interaction effect between sunlight exposure and watering frequency."
365 |    ]
366 |   },
367 |   {
368 |    "cell_type": "code",
369 |    "execution_count": null,
370 |    "id": "ab33cbc5",
371 |    "metadata": {},
372 |    "outputs": [],
373 |    "source": []
374 |   }
375 |  ],
376 |  "metadata": {
377 |   "kernelspec": {
378 |    "display_name": "Python 3 (ipykernel)",
379 |    "language": "python",
380 |    "name": "python3"
381 |   },
382 |   "language_info": {
383 |    "codemirror_mode": {
384 |     "name": "ipython",
385 |     "version": 3
386 |    },
387 |    "file_extension": ".py",
388 |    "mimetype": "text/x-python",
389 |    "name": "python",
390 |    "nbconvert_exporter": "python",
391 |    "pygments_lexer": "ipython3",
392 |    "version": "3.11.7"
393 |   }
394 |  },
395 |  "nbformat": 4,
396 |  "nbformat_minor": 5
397 | }
398 | 


--------------------------------------------------------------------------------
/15. Chi-Square Test for Independence and Goodness of Fit.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "512274d1-097e-4da4-b7a6-44af0d3230a5",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "# Chi-square test for Independence\n",
  9 |     "\n",
 10 |     "The chi-square test is a nonparametric test for testing the association between two variables. A non-parametric test is one that does not make any assumption about the distribution of the population from which the sample is drawn.\n",
 11 |     "\n",
 12 |     "The following are some of the characteristics of the chi-square test.\n",
 13 |     "+ The chi-square test of association is used to test if the frequency of occurrence of one categorical variable is significantly associated with that of another categorical variable.\n",
 14 |     "\n",
 15 |     "    The chi-square test statistic is given by: \n",
 16 |     "\n",
 17 |     "    $$\\chi^2 = \\sum\\frac {(f_o -f_e)^2}{f_e}$$\n",
 18 |     "\n",
 19 |     "    where, $f_o$ denotes the observed frequencies, $f_e$ denotes the expected frequencies, and $\\chi$ is the test statistic.  \n",
 20 |     "    Using the chi-square test of association, we can assess if the differences between the frequencies are statistically significant.\n",
 21 |     "\n",
 22 |     "+ A contingency table is a table with frequencies of the variable listed under separate columns. The formula for the degrees of freedom in the chi-square test is given by: *df=(r-1)(c-1)*, where *df* is the number of degrees of freedom, r is the number of rows in the contingency table, and c is the number of columns in the contingency table.\n",
 23 |     "\n",
 24 |     "\n",
 25 |     "\n",
 26 |     "+ The chi-square test compares the observed values of a set of variables with their expected values. It determines if the differences between the observed values and expected values are due to random chance (like a sampling error), or if these differences are statistically significant. If there are only small differences between the observed and expected values, it may be due to an error in sampling. If there are substantial differences between the two, it may indicate an association between the variables.\n",
 27 |     "\n",
 28 |     "<center><img src=\"./data/csd.png\"/></center>\n",
 29 |     "\n",
 30 |     "+ The shape of the chi-square distribution for different values of k (degrees of freedom) When the degrees of freedom are few, it looks like an F-distribution. It has only one tail (toward the right). As the degrees of freedom increase, it looks like a normal curve. Also, the increase in the degrees of freedom indicates that the difference between the observed values and expected values could be meaningful and not just due to a sampling error. "
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "markdown",
 35 |    "id": "e5d38703",
 36 |    "metadata": {},
 37 |    "source": [
 38 |     "**Example:**\n",
 39 |     "\n",
 40 |     "Suppose there is a city of 1,000,000 residents with four neighborhoods: A, B, C, and D. A random sample of 650 residents of the city is taken and their occupation is recorded as \"white collar\", \"blue collar\", or \"no collar\". The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification.\n",
 41 |     "The data are tabulated as:\n",
 42 |     "\n",
 43 |     "|   OBSERVED   | A   | B   | C   | D   | Row Total |\n",
 44 |     "|:------------:|-----|-----|-----|-----|-----------|\n",
 45 |     "| White Collar | 90  | 60  | 104 | 95  | 349       |\n",
 46 |     "| Blue Collar  | 30  | 50  | 51  | 20  | 151       |\n",
 47 |     "| No Collar    | 30  | 40  | 45  | 35  | 150       |\n",
 48 |     "| Column Total | 150 | 150 | 200 | 150 | 650       |\n",
 49 |     "\n",
 50 |     "\n",
 51 |     "+ **Null hypothesis:** $H_0$: Occupation and Neighbourhood of Residence are not related.  \n",
 52 |     "\n",
 53 |     "+ **Alternative hypothesis**: $H_1$: Occupation and Neighbourhood of Residence are related.  \n",
 54 |     "\n",
 55 |     "+ **Number of variables:** Two categorical variables (Occupation and Neighbourhood)\n",
 56 |     "\n",
 57 |     "+ What we are testing: Testing for an association between Occupation and Neighbourhood.\n",
 58 |     "\n",
 59 |     "+ We conduct a chi-square test of association based on the preceding characteristics.\n",
 60 |     "\n",
 61 |     "+ Fix the level of significance: α=0.05\n",
 62 |     "\n",
 63 |     "Make an **expected** value table from the totals\n",
 64 |     "\n",
 65 |     "For each entry calcuate : $$\\frac{(row\\ total * column\\ total)}{overall\\ total}$$\n",
 66 |     "\n",
 67 |     "Example: For A neighbourhood 150 * (349/650) must be the expected White collar Job.\n",
 68 |     "\n",
 69 |     "|   EXPECTED   | A     | B     | C      | D     |\n",
 70 |     "|:------------:|-------|-------|--------|-------|\n",
 71 |     "| White Collar | 80.54 | 80.54 | 107.38 | 80.54 |\n",
 72 |     "| Blue Collar  | 34.85 | 34.85 | 46.46  | 34.85 |\n",
 73 |     "| No Collar    | 34.62 | 34.62 | 46.15  | 34.62 |\n",
 74 |     "\n",
 75 |     "Each of the value in the Expected Value table is 5 or higher. May proceed with Chi-Square test.\n",
 76 |     "\n",
 77 |     "Calculate: $$\\chi^2 = \\sum\\frac {(f_o -f_e)^2}{f_e}$$\n",
 78 |     "\n",
 79 |     "$$\\chi^2\\  statistic\\  \\approx\\  24.6$$\n",
 80 |     "\n",
 81 |     "Under the null hypothesis, this sum has approximately a chi-squared distribution whose number of degrees of freedom is\n",
 82 |     "\n",
 83 |     "*dof = (number of rows-1)(number of columns-1) = (3-1)(4-1) = 6*\n",
 84 |     "\n",
 85 |     "From chi square distribution table p value less than 0.0005"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "code",
 90 |    "execution_count": 1,
 91 |    "id": "18ff36a2",
 92 |    "metadata": {},
 93 |    "outputs": [
 94 |     {
 95 |      "name": "stdout",
 96 |      "output_type": "stream",
 97 |      "text": [
 98 |       "Chi-Square Statistic:  24.5712028585826\n",
 99 |       "p-value:  0.0004098425861096696\n",
100 |       "degrees of freedom:  6\n",
101 |       "Expected Value: \n",
102 |       " [[ 80.53846154  80.53846154 107.38461538  80.53846154]\n",
103 |       " [ 34.84615385  34.84615385  46.46153846  34.84615385]\n",
104 |       " [ 34.61538462  34.61538462  46.15384615  34.61538462]]\n"
105 |      ]
106 |     }
107 |    ],
108 |    "source": [
109 |     "import scipy.stats as stats\n",
110 |     "import numpy as np\n",
111 |     "\n",
112 |     "observations = np.array([[90,60,104,95],[30,50,51,20],[30,40,45,35]])\n",
113 |     "chi2stat, pval, dof, expvalue = stats.chi2_contingency(observations)\n",
114 |     "\n",
115 |     "print(f'Chi-Square Statistic: ', chi2stat)\n",
116 |     "print(f'p-value: ', pval)\n",
117 |     "print(f'degrees of freedom: ', dof)\n",
118 |     "print(f'Expected Value: \\n', expvalue)\n"
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "markdown",
123 |    "id": "0ef31bfe",
124 |    "metadata": {},
125 |    "source": [
126 |     "p-value turns to be 0.0004 < 0.05. Therefore we reject the null hypothesis.\n",
127 |     "There is a significant association between the Occupation and Neighbourhood of Residence, at a 5%\n",
128 |     "significance level."
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "markdown",
133 |    "id": "c0dd9fe9",
134 |    "metadata": {},
135 |    "source": [
136 |     "### **Chi-Square Goodness of Fit Test:** \n",
137 |     "\n",
138 |     "A Chi-Square goodness of fit test can be used in a wide variety of settings. Here are a few examples:\n",
139 |     "\n",
140 |     "+ We want to know if a die is fair, so we roll it 50 times and record the number of times it lands on each number.\n",
141 |     "+ We want to know if an equal number of people come into a shop each day of the week, so we count the number of people who come in each day during a random week.\n",
142 |     "\n",
143 |     "It is performed in a similar way.\n",
144 |     "\n",
145 |     "A shop owner claims that an equal number of customers come into his shop each weekday. To test this hypothesis, an independent researcher records the number of customers that come into the shop on a given week and finds the following:\n",
146 |     "\n",
147 |     "|    Day    | Customers |\n",
148 |     "|:---------:|-----------|\n",
149 |     "| Monday    | 50        |\n",
150 |     "| Tuesday   | 60        |\n",
151 |     "| Wednesday | 40        |\n",
152 |     "| Thursday  | 47        |\n",
153 |     "| Friday    | 53        |\n",
154 |     "\n",
155 |     "$H_0$: An equal number of customers come into the shop each day.  \n",
156 |     "$H_1$: An equal number of customers do not come into the shop each day.\n",
157 |     "\n",
158 |     "There were a total of 250 customers that came into the shop during the week. Thus, if we expected an equal amount to come in each day then the expected value $E$ for each day would be 50.\n",
159 |     "\n",
160 |     "$Monday: (50-50)^2 / 50 = 0$  \n",
161 |     "$Tuesday: (60-50)^2 / 50 = 2$  \n",
162 |     "$Wednesday: (40-50)^2 / 50 = 2$  \n",
163 |     "$Thursday: (47-50)^2 / 50 = 0.18$  \n",
164 |     "$Friday: (53-50)^2 / 50 = 0.18$  \n",
165 |     "\n",
166 |     "$\\chi^2 = \\sum \\frac{(Obs-Exp)^2}{Exp} = 0 + 2 + 2 + 0.18 + 0.18 = 4.36$\n",
167 |     "\n",
168 |     "the p-value associated with $\\chi^2$ = 4.36 and degrees of freedom n-1 = 5-1 = 4 is **0.359472.**\n",
169 |     "\n",
170 |     "Since this p-value is not less than 0.05, we fail to reject the null hypothesis. This means we do not have sufficient evidence to say that the true distribution of customers is different from the distribution that the shop owner claimed."
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": null,
176 |    "id": "dee06726",
177 |    "metadata": {},
178 |    "outputs": [],
179 |    "source": []
180 |   }
181 |  ],
182 |  "metadata": {
183 |   "kernelspec": {
184 |    "display_name": "Python 3 (ipykernel)",
185 |    "language": "python",
186 |    "name": "python3"
187 |   },
188 |   "language_info": {
189 |    "codemirror_mode": {
190 |     "name": "ipython",
191 |     "version": 3
192 |    },
193 |    "file_extension": ".py",
194 |    "mimetype": "text/x-python",
195 |    "name": "python",
196 |    "nbconvert_exporter": "python",
197 |    "pygments_lexer": "ipython3",
198 |    "version": "3.11.7"
199 |   }
200 |  },
201 |  "nbformat": 4,
202 |  "nbformat_minor": 5
203 | }
204 | 


--------------------------------------------------------------------------------
/16. Effect Size and Statistical Power.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "b9e53dce",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "## Effect Size\n",
  9 |     "\n",
 10 |     "\n",
 11 |     "Quantifying the difference between two groups can be achieved by using an effect size. A p-value provides information about statistical significance of the difference between the groups, but it doesn't give an insight into the magnitude of the difference. Larger sample sizes often result in a higher likelihood of finding a statistically significant difference, even if the real-world effect is small. Hence, it's crucial to consider effect sizes in addition to p-values, as they provide a clearer picture of the true difference between the groups and are more valuable in practical applications\n",
 12 |     "\n",
 13 |     "There are different measures for effect sizes. The most common effect sizes are Cohen's d and Pearson's r.   \n",
 14 |     "\n",
 15 |     "Cohen's d measures the size of the difference between two groups while Pearson's r measures the strength of the relationship between two variables.\n",
 16 |     "\n",
 17 |     "### Cohen's d -  Standardized Mean Difference\n",
 18 |     "Cohen's d is designed for comparing two groups. It takes the difference between two means and expresses it in standard deviation units. It tells you how many standard deviations lie between the two means.\n",
 19 |     "\n",
 20 |     "$$ d =\\frac{ \\overline x_1 - \\overline x_2 }{S}$$\n",
 21 |     "\n",
 22 |     "where  $\\overline x_1$ and $\\overline x_2$ are mean of group 1 and group 2 respectively. $S$ is standard deviation.\n",
 23 |     "\n",
 24 |     "The choice of standard deviation in the equation depends on your research design.\n",
 25 |     "We can use:\n",
 26 |     "+  pooled standard deviation that is based on data from both groups,\n",
 27 |     "+ standard deviation from a control group.\n",
 28 |     "+ the standard deviation from the pretest data or posttest.\n",
 29 |     "\n",
 30 |     "### Pearson's r - Correlation Coefficient\n",
 31 |     "\n",
 32 |     "Pearson's $r$, or the correlation coefficient, measures the extent of a linear relationship between two variables.\n",
 33 |     "\n",
 34 |     "The formula is rather complex, so it’s best to use a statistical software to calculate Pearson's r accurately from the raw data.\n",
 35 |     "\n",
 36 |     "$$ r_{xy} = \\frac{n\\sum x_i y_i -\\sum x_i \\sum y_i}{\\sqrt{n\\sum x_i^2-(\\sum x_i)^2}{\\sqrt{n\\sum y_i^2-(\\sum y_i)^2}}}$$\n",
 37 |     "\n",
 38 |     "The main idea of the formula is to compute how much of the variability of one variable is determined by the variability of the other variable. Pearson's r is a standardized scale to measure correlations between variables that makes it unit-free. You can directly compare the strengths of all correlations with each other.\n",
 39 |     "\n",
 40 |     "### Interpreting Values\n",
 41 |     "\n",
 42 |     "+ Cohen's $d$ can take on any number between 0 and infinity, In general the greater the Cohen's d, the larger the effect size\n",
 43 |     "+ Pearson's $r$ ranges between -1 and 1. The closer the value is to 0, the smaller the effect size. A value closer to -1 or 1 indicates a higher effect size.\n",
 44 |     "\n",
 45 |     "General Rule of thumb to quantify whether an effect size is small, medium or large:\n",
 46 |     "\n",
 47 |     "**Cohen’s D:**\n",
 48 |     "\n",
 49 |     "+ A d of 0.2 or smaller is considered to be a small effect size.\n",
 50 |     "+ A d of 0.5 is considered to be a medium effect size.\n",
 51 |     "+ A d of 0.8 or larger is considered to be a large effect size.\n",
 52 |     "\n",
 53 |     "\n",
 54 |     "**Pearson Correlation Coefficient:**\n",
 55 |     "\n",
 56 |     "+ An absolute value of r around 0.1 is considered a low effect size.\n",
 57 |     "+ An absolute value of r around 0.3 is considered a medium effect size.\n",
 58 |     "+ An absolute value of r greater than .5 is considered to be a large effect size."
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "markdown",
 63 |    "id": "f42c2010-37de-4c7f-b02d-7ddfa104ca9e",
 64 |    "metadata": {},
 65 |    "source": [
 66 |     "# Statistical Power\n",
 67 |     "\n",
 68 |     "Statistical power, or sensitivity, is the likelihood of a significance test detecting an effect when there actually is one. In other words, power is the probability that we will correctly reject the null hypothesis.\n",
 69 |     "\n",
 70 |     "Let's look at an example to understand this concept. Suppose we have two distributions with minimal overlap, as shown in the first picture below. If we collect a small set of samples from both the green and red distributions and compare their means using hypothesis testing, we might get a small p-value, say 0.0004. This would cause us to correctly reject the null hypothesis that both sample sets came from the same distribution. In other words, if the blue distribution says that all data points came from it, we would reject that hypothesis.\n",
 71 |     "\n",
 72 |     "If we keep repeating this experiment multiple times, there's a high probability that each statistical test will correctly give us a small p-value. In other words, there is a high probability that the null hypothesis that all the data came from the same distribution will be correctly rejected.\n",
 73 |     "\n",
 74 |     "However, occasionally, we might get a trial like in the second picture below, where the two sample sets appear to come from the same distribution due to overlapping sample points, resulting in a high p-value, like 0.08. This means that even though we know that the data came from two different distributions, we cannot correctly reject the null hypothesis that all the data came from the same distribution. Since these two distributions are far apart and have very little overlap, the probability of correctly rejecting the null hypothesis is high. Thus, power, being the probability that we will correctly reject the null hypothesis, is high in this example.\n",
 75 |     "\n",
 76 |     "In summary, when distributions have minimal overlap, the statistical power is high, meaning there is a high likelihood of correctly rejecting the null hypothesis.\n",
 77 |     "\n",
 78 |     "\n",
 79 |     "<center><img src=\"./data/p1.png\"/></center>\n",
 80 |     "<center><img src=\"./data/p2.png\"/></center>"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "markdown",
 85 |    "id": "9b078af3-f059-44bb-be03-bfe088958ac1",
 86 |    "metadata": {},
 87 |    "source": [
 88 |     "# Statistical Power: Overlapping Distributions\n",
 89 |     "\n",
 90 |     "Now, let's consider a different scenario where we have a large overlap in the distributions, as shown in the first picture below. Most of the time, when we compare the means of these two distributions, we get a high p-value and fail to reject the null hypothesis that the data comes from the same distribution.\n",
 91 |     "\n",
 92 |     "However, occasionally, when the sample data points are from the far extremes of the distributions, as shown in the second picture below, we get a small p-value and can correctly reject the null hypothesis that the data comes from the same distribution. Due to the overlap, the probability of correctly rejecting the null hypothesis is low, meaning we have relatively low power.\n",
 93 |     "\n",
 94 |     "The good news is that we can increase the power by increasing the number of samples we collect. Power analysis will tell us how many measurements we need to collect to achieve a good amount of power.\n",
 95 |     "\n",
 96 |     "In summary, when distributions have a large overlap, the statistical power is low, meaning there is a low likelihood of correctly rejecting the null hypothesis. By increasing the sample size, we can improve the power of our test.\n",
 97 |     "\n",
 98 |     "<center><img src=\"./data/p3.png\"/></center>\n",
 99 |     "\n",
100 |     "<center><img src=\"./data/p4.png\"/></center>\n",
101 |     "\n",
102 |     "Before we learn how to do power analysis. Lets understand why do we need to perform power analysis in detail. "
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "markdown",
107 |    "id": "aa70ccc3-5b6a-4efb-9acf-459ea869ffc8",
108 |    "metadata": {},
109 |    "source": [
110 |     "### Need for Power Analysis\n",
111 |     "\n",
112 |     "In hypothesis testing, we start with a null hypothesis of no effect and an alternative hypothesis of a true effect. The goal is to collect enough data from a sample to statistically test whether we can reasonably reject the null hypothesis in favor of the alternative hypothesis. In doing so, there's always a risk of making one of two decision errors when interpreting study results:\n",
113 |     "\n",
114 |     "- **Type I error**: Rejecting the null hypothesis of no effect when it is actually true.\n",
115 |     "- **Type II error**: Not rejecting the null hypothesis of no effect when it is actually false.\n",
116 |     "\n",
117 |     "Power is the probability of avoiding a Type II error. The higher the statistical power of a test, the lower the risk of making a Type II error. Power is usually set at 80%. This means that if there are true effects to be found in 100 different studies with 80% power, only 80 out of 100 statistical tests will actually detect them. If we don't ensure sufficient power, our study may not be able to detect a true effect at all. This means that resources like time and money are wasted, and it may even be unethical to collect data from participants.\n",
118 |     "\n",
119 |     "On the flip side, too much power means our tests are highly sensitive to true effects, including very small ones. This may lead to finding statistically significant results with very little usefulness in the real world. To balance these pros and cons of low versus high statistical power, we should use a **Power Analysis** to set an appropriate level.\n"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "markdown",
124 |    "id": "515995e2-6617-4ef0-83ea-b5733548d321",
125 |    "metadata": {},
126 |    "source": [
127 |     "# Power Analysis\n",
128 |     "\n",
129 |     "Power is mainly influenced by sample size, effect size, and significance level. A power analysis can be used to determine the necessary sample size for a study. Having enough statistical power is necessary to draw accurate conclusions about a population using sample data.\n",
130 |     "\n",
131 |     "Power is affected by several factors, but two main factors are:\n",
132 |     "\n",
133 |     "- **Overlap:** How much overlap is there between the two distributions we want to identify with our study.\n",
134 |     "- **Sample Size:** The number of samples we collect from each group.\n",
135 |     "\n",
136 |     "If we want Power to be 80% and if there is very little overlap, a small sample size will suffice. However, if the overlap is greater between the two distributions, we need a larger sample size to achieve 80% power.\n",
137 |     "\n",
138 |     "To understand the relationship between overlap and sample size, we need to realize that when we do a statistical test, we usually compare sample means rather than individual measurements. So let's see what happens when we calculate means with different sample sizes.\n",
139 |     "\n",
140 |     "- If the sample size is small, there is a lot of variation in estimated means for a distribution, making it hard to be confident that any single estimated mean is a good estimate of the population mean, and there is overlap between the estimated means of the two distributions.\n",
141 |     "- But if the sample size is large, the estimated means are so close to the population mean that they no longer overlap. This suggests a high probability that we correctly reject the null hypothesis that both samples came from the same distribution. With a large sample size, we can achieve high power. Additionally, the central limit theorem states that these results apply to any type of distribution.\n",
142 |     "\n",
143 |     "A power analysis consists of four main components. If you know or have estimates for any three of these, you can calculate the fourth component:\n",
144 |     "\n",
145 |     "- **Statistical Power:** The likelihood that a test will detect an effect of a certain size if there is one, usually set at 80% or higher.\n",
146 |     "- **Sample Size:** The minimum number of observations needed to observe an effect of a certain size with a given power level.\n",
147 |     "- **Significance Level (alpha):** The maximum risk of rejecting a true null hypothesis that you are willing to take, usually set at 5%.\n",
148 |     "- **Expected Effect Size:** The combined effect of standard deviation and means of two distributions due to overlap, captured by Effect size (d). There are many different ways to capture the effect.\n",
149 |     "\n",
150 |     "Before starting a study, we can use a power analysis to calculate the minimum sample size for a desired power level and significance level, along with an expected effect size. Traditionally, the significance level is set to 5% and the desired power level to 80%. That means we only need to figure out an expected effect size to calculate a sample size from a power analysis.\n",
151 |     "\n",
152 |     "The `stats.power` module of the statsmodels package in Python contains the required functions for carrying out power analysis for the most commonly used statistical tests such as t-test, normal-based test, F-tests, and Chi-square goodness-of-fit test. Its `solve_power` function takes three of the four components mentioned above as input parameters and calculates the sample size.\n"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": null,
158 |    "id": "a469340a-90ad-4850-85a4-276d9586f9c4",
159 |    "metadata": {},
160 |    "outputs": [],
161 |    "source": []
162 |   }
163 |  ],
164 |  "metadata": {
165 |   "kernelspec": {
166 |    "display_name": "Python 3 (ipykernel)",
167 |    "language": "python",
168 |    "name": "python3"
169 |   },
170 |   "language_info": {
171 |    "codemirror_mode": {
172 |     "name": "ipython",
173 |     "version": 3
174 |    },
175 |    "file_extension": ".py",
176 |    "mimetype": "text/x-python",
177 |    "name": "python",
178 |    "nbconvert_exporter": "python",
179 |    "pygments_lexer": "ipython3",
180 |    "version": "3.11.7"
181 |   }
182 |  },
183 |  "nbformat": 4,
184 |  "nbformat_minor": 5
185 | }
186 | 


--------------------------------------------------------------------------------
/17.Statistical tests (Summarized).ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {
   6 |     "_uuid": "96f599ec12693ae56a55e0c1884b3f6ebc6bc825",
   7 |     "id": "HrWEQcATI3em",
   8 |     "toc": true
   9 |    },
  10 |    "source": [
  11 |     "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
  12 |     "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Introduction\" data-toc-modified-id=\"Introduction-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href=\"#Theory\" data-toc-modified-id=\"Theory-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Theory</a></span><ul class=\"toc-item\"><li><span><a href=\"#The-process\" data-toc-modified-id=\"The-process-2.1\"><span class=\"toc-item-num\">2.1&nbsp;&nbsp;</span>The process</a></span></li><li><span><a href=\"#Two-tailed-and-One-tailed\" data-toc-modified-id=\"Two-tailed-and-One-tailed-2.2\"><span class=\"toc-item-num\">2.2&nbsp;&nbsp;</span>Two-tailed and One-tailed</a></span></li><li><span><a href=\"#Types-of-tests\" data-toc-modified-id=\"Types-of-tests-2.3\"><span class=\"toc-item-num\">2.3&nbsp;&nbsp;</span>Types of tests</a></span></li><li><span><a href=\"#Normal-distribution\" data-toc-modified-id=\"Normal-distribution-2.4\"><span class=\"toc-item-num\">2.4&nbsp;&nbsp;</span>Normal distribution</a></span></li></ul></li><li><span><a href=\"#Practice\" data-toc-modified-id=\"Practice-3\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>Practice</a></span><ul class=\"toc-item\"><li><span><a href=\"#One-sample-T-test-|-Two-tailed-|-Means\" data-toc-modified-id=\"One-sample-T-test-|-Two-tailed-|-Means-3.1\"><span class=\"toc-item-num\">3.1&nbsp;&nbsp;</span>One sample T-test | Two-tailed | Means</a></span></li><li><span><a href=\"#One-sample-T-test-|-One-tailed-|-Means\" data-toc-modified-id=\"One-sample-T-test-|-One-tailed-|-Means-3.2\"><span class=\"toc-item-num\">3.2&nbsp;&nbsp;</span>One sample T-test | One-tailed | Means</a></span></li><li><span><a href=\"#Two-sample-T-test-|-Two-tailed-|-Means\" data-toc-modified-id=\"Two-sample-T-test-|-Two-tailed-|-Means-3.3\"><span class=\"toc-item-num\">3.3&nbsp;&nbsp;</span>Two sample T-test | Two-tailed | Means</a></span></li><li><span><a href=\"#Two-sample-T-test-|-One-tailed-|-Means\" data-toc-modified-id=\"Two-sample-T-test-|-One-tailed-|-Means-3.4\"><span class=\"toc-item-num\">3.4&nbsp;&nbsp;</span>Two sample T-test | One-tailed | Means</a></span></li><li><span><a href=\"#Two-sample-Z-test-|-One-tailed-|-Means\" data-toc-modified-id=\"Two-sample-Z-test-|-One-tailed-|-Means-3.5\"><span class=\"toc-item-num\">3.5&nbsp;&nbsp;</span>Two sample Z-test | One-tailed | Means</a></span></li><li><span><a href=\"#Two-sample-Z-test-|-One-tailed-|-Proportions\" data-toc-modified-id=\"Two-sample-Z-test-|-One-tailed-|-Proportions-3.6\"><span class=\"toc-item-num\">3.6&nbsp;&nbsp;</span>Two sample Z-test | One-tailed | Proportions</a></span></li><li><span><a href=\"#One-sample-Z-test-|-One-tailed-|-Means\" data-toc-modified-id=\"One-sample-Z-test-|-One-tailed-|-Means-3.7\"><span class=\"toc-item-num\">3.7&nbsp;&nbsp;</span>One sample Z-test | One-tailed | Means</a></span></li><li><span><a href=\"#One-sample-Z-test-|-One-tailed-|-Proportions\" data-toc-modified-id=\"One-sample-Z-test-|-One-tailed-|-Proportions-3.8\"><span class=\"toc-item-num\">3.8&nbsp;&nbsp;</span>One sample Z-test | One-tailed | Proportions</a></span></li><li><span><a href=\"#F-test-(ANOVA)\" data-toc-modified-id=\"F-test-(ANOVA)-3.9\"><span class=\"toc-item-num\">3.9&nbsp;&nbsp;</span>F-test (ANOVA)</a></span></li><li><span><a href=\"#Chi-square-test\" data-toc-modified-id=\"Chi-square-test-3.10\"><span class=\"toc-item-num\">3.10&nbsp;&nbsp;</span>Chi-square test</a></span></li></ul></li><li><span><a href=\"#Conclusion\" data-toc-modified-id=\"Conclusion-4\"><span class=\"toc-item-num\">4&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>"
  13 |    ]
  14 |   },
  15 |   {
  16 |    "cell_type": "markdown",
  17 |    "metadata": {},
  18 |    "source": [
  19 |     "# Exploring the Ames, Iowa House Price Dataset\n",
  20 |     "\n",
  21 |     "Imagine that we're looking to move to Ames, Iowa, and have a budget of $120,000 to purchase a house. Unfortunately, we're not familiar with the local real estate market. However, the City Hall has a valuable resource, the House Price Dataset, which contains information on approximately 1500 homes in the city, including attributes such as Sale Price, Living Area, and Garage Type. Accessing the full dataset would be too expensive, but the City Hall is offering a generous offer: they'll provide free samples of up to 25 observations, or up to 100 observations for a small fee. This is a great opportunity for us to learn more about the real estate market and see what kind of house we can get for our budget. By analyzing the data, we'll be able to use statistical tests to gain insights into the market. While this notebook won't go into the theory behind the tests, it will provide an overview of which tests to use depending on the situation and how to use them.\n"
  22 |    ]
  23 |   },
  24 |   {
  25 |    "cell_type": "code",
  26 |    "execution_count": 1,
  27 |    "metadata": {
  28 |     "_uuid": "685178bd8439921232c40ae2bb4218401deb360f",
  29 |     "id": "iNhM9iVfI3ff"
  30 |    },
  31 |    "outputs": [],
  32 |    "source": [
  33 |     "import pandas as pd\n",
  34 |     "pd.set_option('max_colwidth', 200)\n",
  35 |     "pd.set_option('display.float_format', lambda x: '%.3f' % x)\n",
  36 |     "from statsmodels.stats.weightstats import *\n",
  37 |     "import scipy.stats"
  38 |    ]
  39 |   },
  40 |   {
  41 |    "cell_type": "code",
  42 |    "execution_count": 2,
  43 |    "metadata": {
  44 |     "id": "skFhcwVYLFdH"
  45 |    },
  46 |    "outputs": [
  47 |     {
  48 |      "ename": "FileNotFoundError",
  49 |      "evalue": "[Errno 2] No such file or directory: 'data/house-prices-advanced-regression-techniques/train.csv'",
  50 |      "output_type": "error",
  51 |      "traceback": [
  52 |       "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
  53 |       "\u001b[1;31mFileNotFoundError\u001b[0m                         Traceback (most recent call last)",
  54 |       "Cell \u001b[1;32mIn[2], line 3\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[38;5;66;03m#this is the entire dataset, but we'll only be able to use to extract samples from it.\u001b[39;00m\n\u001b[0;32m      2\u001b[0m FILE_PATH \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mdata/house-prices-advanced-regression-techniques/train.csv\u001b[39m\u001b[38;5;124m'\u001b[39m\n\u001b[1;32m----> 3\u001b[0m city_hall_dataset \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mread_csv(FILE_PATH)\n",
  55 |       "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:948\u001b[0m, in \u001b[0;36mread_csv\u001b[1;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)\u001b[0m\n\u001b[0;32m    935\u001b[0m kwds_defaults \u001b[38;5;241m=\u001b[39m _refine_defaults_read(\n\u001b[0;32m    936\u001b[0m     dialect,\n\u001b[0;32m    937\u001b[0m     delimiter,\n\u001b[1;32m   (...)\u001b[0m\n\u001b[0;32m    944\u001b[0m     dtype_backend\u001b[38;5;241m=\u001b[39mdtype_backend,\n\u001b[0;32m    945\u001b[0m )\n\u001b[0;32m    946\u001b[0m kwds\u001b[38;5;241m.\u001b[39mupdate(kwds_defaults)\n\u001b[1;32m--> 948\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m _read(filepath_or_buffer, kwds)\n",
  56 |       "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:611\u001b[0m, in \u001b[0;36m_read\u001b[1;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[0;32m    608\u001b[0m _validate_names(kwds\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mnames\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m))\n\u001b[0;32m    610\u001b[0m \u001b[38;5;66;03m# Create the parser.\u001b[39;00m\n\u001b[1;32m--> 611\u001b[0m parser \u001b[38;5;241m=\u001b[39m TextFileReader(filepath_or_buffer, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwds)\n\u001b[0;32m    613\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m chunksize \u001b[38;5;129;01mor\u001b[39;00m iterator:\n\u001b[0;32m    614\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m parser\n",
  57 |       "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:1448\u001b[0m, in \u001b[0;36mTextFileReader.__init__\u001b[1;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[0;32m   1445\u001b[0m     \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mhas_index_names\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m kwds[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mhas_index_names\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n\u001b[0;32m   1447\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles: IOHandles \u001b[38;5;241m|\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m-> 1448\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_engine \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_make_engine(f, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mengine)\n",
  58 |       "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:1705\u001b[0m, in \u001b[0;36mTextFileReader._make_engine\u001b[1;34m(self, f, engine)\u001b[0m\n\u001b[0;32m   1703\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m mode:\n\u001b[0;32m   1704\u001b[0m         mode \u001b[38;5;241m+\u001b[39m\u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m-> 1705\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles \u001b[38;5;241m=\u001b[39m get_handle(\n\u001b[0;32m   1706\u001b[0m     f,\n\u001b[0;32m   1707\u001b[0m     mode,\n\u001b[0;32m   1708\u001b[0m     encoding\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mencoding\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m),\n\u001b[0;32m   1709\u001b[0m     compression\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcompression\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m),\n\u001b[0;32m   1710\u001b[0m     memory_map\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmemory_map\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mFalse\u001b[39;00m),\n\u001b[0;32m   1711\u001b[0m     is_text\u001b[38;5;241m=\u001b[39mis_text,\n\u001b[0;32m   1712\u001b[0m     errors\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mencoding_errors\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstrict\u001b[39m\u001b[38;5;124m\"\u001b[39m),\n\u001b[0;32m   1713\u001b[0m     storage_options\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39moptions\u001b[38;5;241m.\u001b[39mget(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mstorage_options\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m),\n\u001b[0;32m   1714\u001b[0m )\n\u001b[0;32m   1715\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[0;32m   1716\u001b[0m f \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhandles\u001b[38;5;241m.\u001b[39mhandle\n",
  59 |       "File \u001b[1;32m~\\anaconda3\\Lib\\site-packages\\pandas\\io\\common.py:863\u001b[0m, in \u001b[0;36mget_handle\u001b[1;34m(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)\u001b[0m\n\u001b[0;32m    858\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(handle, \u001b[38;5;28mstr\u001b[39m):\n\u001b[0;32m    859\u001b[0m     \u001b[38;5;66;03m# Check whether the filename is to be opened in binary mode.\u001b[39;00m\n\u001b[0;32m    860\u001b[0m     \u001b[38;5;66;03m# Binary mode does not support 'encoding' and 'newline'.\u001b[39;00m\n\u001b[0;32m    861\u001b[0m     \u001b[38;5;28;01mif\u001b[39;00m ioargs\u001b[38;5;241m.\u001b[39mencoding \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mb\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m ioargs\u001b[38;5;241m.\u001b[39mmode:\n\u001b[0;32m    862\u001b[0m         \u001b[38;5;66;03m# Encoding\u001b[39;00m\n\u001b[1;32m--> 863\u001b[0m         handle \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mopen\u001b[39m(\n\u001b[0;32m    864\u001b[0m             handle,\n\u001b[0;32m    865\u001b[0m             ioargs\u001b[38;5;241m.\u001b[39mmode,\n\u001b[0;32m    866\u001b[0m             encoding\u001b[38;5;241m=\u001b[39mioargs\u001b[38;5;241m.\u001b[39mencoding,\n\u001b[0;32m    867\u001b[0m             errors\u001b[38;5;241m=\u001b[39merrors,\n\u001b[0;32m    868\u001b[0m             newline\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[0;32m    869\u001b[0m         )\n\u001b[0;32m    870\u001b[0m     \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m    871\u001b[0m         \u001b[38;5;66;03m# Binary mode\u001b[39;00m\n\u001b[0;32m    872\u001b[0m         handle \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mopen\u001b[39m(handle, ioargs\u001b[38;5;241m.\u001b[39mmode)\n",
  60 |       "\u001b[1;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: 'data/house-prices-advanced-regression-techniques/train.csv'"
  61 |      ]
  62 |     }
  63 |    ],
  64 |    "source": [
  65 |     "#this is the entire dataset, but we'll only be able to use to extract samples from it.\n",
  66 |     "FILE_PATH = 'data/house-prices-advanced-regression-techniques/train.csv'\n",
  67 |     "city_hall_dataset = pd.read_csv(FILE_PATH)"
  68 |    ]
  69 |   },
  70 |   {
  71 |    "cell_type": "markdown",
  72 |    "metadata": {
  73 |     "_uuid": "95f4ee7982692fb0a348652a2bea034fcfaae251",
  74 |     "id": "LxaQeafsI3fi"
  75 |    },
  76 |    "source": [
  77 |     "# Introduction"
  78 |    ]
  79 |   },
  80 |   {
  81 |    "cell_type": "markdown",
  82 |    "metadata": {
  83 |     "_uuid": "c1606d0ab5c78dcc772f9d1e91daad2fa1ed5933",
  84 |     "id": "4ZEVAOUxI3fj"
  85 |    },
  86 |    "source": [
  87 |     "# Theory"
  88 |    ]
  89 |   },
  90 |   {
  91 |    "cell_type": "markdown",
  92 |    "metadata": {
  93 |     "_uuid": "36671356d539ae67d27d451ae71f3c595e139948",
  94 |     "id": "7A4lcRIRI3fk"
  95 |    },
  96 |    "source": [
  97 |     "What we will be trying to do in this tutorial is make assumptions on the whole population of houses based only on the samples at our disposal.<br>\n",
  98 |     "This is what statistical tests do, but one must know a few principles before using them."
  99 |    ]
 100 |   },
 101 |   {
 102 |    "cell_type": "markdown",
 103 |    "metadata": {
 104 |     "_uuid": "698cfbf57cd980f9e0f558eb2e1206a601742dd3",
 105 |     "id": "fmT5a95OI3fl"
 106 |    },
 107 |    "source": [
 108 |     "## The process"
 109 |    ]
 110 |   },
 111 |   {
 112 |    "cell_type": "markdown",
 113 |    "metadata": {
 114 |     "_uuid": "03287c60f935631f22c13a14d86e233c2fcd47d7",
 115 |     "id": "PMc5uv87I3fm"
 116 |    },
 117 |    "source": [
 118 |     "The basic process of statistical tests is the following : \n",
 119 |     "- Stating a Null Hypothesis (most often : \"the two values are not different\")\n",
 120 |     "- Stating an Alternative Hypothesis (most often : \"the two values are different\")\n",
 121 |     "- Defining an alpha value, which is a confidence level (most often : 95%). The higher it is, the harder it will be to validate the Alternative Hypothesis, but the more confident we will be if we do validate it.\n",
 122 |     "- Depending on data at disposal, we choose the relevant test (Z-test, T-test, etc... More on that later)\n",
 123 |     "- The test computes a score, which corresponds to a p-value.\n",
 124 |     "- If p-value is below 1-alpha (0.05 if alpha is 95%), we can accept the Alternative Hypothesis (or \"reject the Null Hypothesis\"). If it is over, we'll have to stick with the Null Hypothesis (or \"fail to reject the Null Hypothesis\").\n",
 125 |     "\n",
 126 |     "\n",
 127 |     "There's a built-in function for most statistical tests out there.<br>\n",
 128 |     "Let's also build our own function to summarize all the information.<br>\n",
 129 |     "All tests we will conduct from now on are based on alpha = 95%."
 130 |    ]
 131 |   },
 132 |   {
 133 |    "cell_type": "code",
 134 |    "execution_count": null,
 135 |    "metadata": {
 136 |     "_uuid": "781566a3f1a8b9465fad0818f57c86d9be026780",
 137 |     "id": "rMy2j0kGI3fo"
 138 |    },
 139 |    "outputs": [],
 140 |    "source": [
 141 |     "def results(p):\n",
 142 |     "    if(p['p_value']<0.05):p['hypothesis_accepted'] = 'alternative'\n",
 143 |     "    if(p['p_value']>=0.05):p['hypothesis_accepted'] = 'null'\n",
 144 |     "\n",
 145 |     "    df = pd.DataFrame(p, index=[''])\n",
 146 |     "    cols = ['value1', 'value2', 'score', 'p_value', 'hypothesis_accepted']\n",
 147 |     "    return df[cols]"
 148 |    ]
 149 |   },
 150 |   {
 151 |    "cell_type": "markdown",
 152 |    "metadata": {
 153 |     "_uuid": "3c1916b3ab847149daaffab1a7394fadbbfaee01",
 154 |     "id": "qNYYzNPbI3fp"
 155 |    },
 156 |    "source": [
 157 |     "## Two-tailed and One-tailed"
 158 |    ]
 159 |   },
 160 |   {
 161 |    "cell_type": "markdown",
 162 |    "metadata": {
 163 |     "_uuid": "72ab88dc706a91a822fd829611149a8bccd567ae",
 164 |     "id": "hs0yrjpeI3fq"
 165 |    },
 166 |    "source": [
 167 |     "Two-tails tests are used to show two values are just \"different\".<br>\n",
 168 |     "One-tail tests are used to show one value is either \"larger\" or \"lower\" than another one.<br><br>\n",
 169 |     "This has an influence on the p-value : in case of one-tail tests, p-value has to be divided by 2.<br>\n",
 170 |     "<br>\n",
 171 |     "Most of the functions we'll use (those from the statweights modules) do that by themselves if we input the right information in the parameters.<br>\n",
 172 |     "We'll have to do it on our own with functions from the scipy module."
 173 |    ]
 174 |   },
 175 |   {
 176 |    "cell_type": "markdown",
 177 |    "metadata": {
 178 |     "_uuid": "927f0ccf354bdc79be8f15126137812533168a3f",
 179 |     "id": "83YOCHsUI3fr"
 180 |    },
 181 |    "source": [
 182 |     "## Types of tests"
 183 |    ]
 184 |   },
 185 |   {
 186 |    "cell_type": "markdown",
 187 |    "metadata": {
 188 |     "_uuid": "eb131c0758f17dd039df56627c87536436cf7c73",
 189 |     "id": "fbXJNsiUI3fs"
 190 |    },
 191 |    "source": [
 192 |     "There are different types of tests, here are the ones we will cover : \n",
 193 |     "- T-tests. Used for small sample sizes (n<30), and when population's standard deviation is unknown.\n",
 194 |     "- Z-tests. Used for large sample sizes (n=>30), and when population's standard deviation is known.\n",
 195 |     "- F-tests. Used for comparing values of more than two variables.\n",
 196 |     "- Chi-square. Used for comparing categorical data."
 197 |    ]
 198 |   },
 199 |   {
 200 |    "cell_type": "markdown",
 201 |    "metadata": {
 202 |     "_uuid": "c21d8691d27399446d7d7e707f369da61708c5c1",
 203 |     "id": "5EdR4B5fI3fs"
 204 |    },
 205 |    "source": [
 206 |     "## Normal distribution"
 207 |    ]
 208 |   },
 209 |   {
 210 |    "cell_type": "markdown",
 211 |    "metadata": {
 212 |     "_uuid": "21bce516b7e3e7b14e45dd3b37c4f3a0d9e9fb72",
 213 |     "id": "xTG4xsVXI3ft"
 214 |    },
 215 |    "source": [
 216 |     "Also, most tests - parametric tests - require a population that is normally distributed.<br>\n",
 217 |     "It it not the case for SalePrice - which we'll use for most tests - but we can fix this by log-transforming the variable.<br>\n",
 218 |     "Note that to go back to our original scale and understand values vs. our \\$120 000, we'll to exponantiate values back."
 219 |    ]
 220 |   },
 221 |   {
 222 |    "cell_type": "code",
 223 |    "execution_count": null,
 224 |    "metadata": {
 225 |     "_uuid": "5e7f1cf9faba52350ec923638e9e26b9356764d7",
 226 |     "colab": {
 227 |      "base_uri": "https://localhost:8080/"
 228 |     },
 229 |     "id": "XOOw7OlgI3fu",
 230 |     "outputId": "434155bc-55fd-454b-d9a2-c99dbbe5c315"
 231 |    },
 232 |    "outputs": [],
 233 |    "source": [
 234 |     "import numpy as np\n",
 235 |     "city_hall_dataset['SalePrice'] = np.log1p(city_hall_dataset['SalePrice'])\n",
 236 |     "logged_budget = np.log1p(120000) #logged $120 000 is 11.695\n",
 237 |     "logged_budget"
 238 |    ]
 239 |   },
 240 |   {
 241 |    "cell_type": "markdown",
 242 |    "metadata": {
 243 |     "_uuid": "6ea5ef2e8f638517b5381aa54400a541d79caeca",
 244 |     "id": "4zUY8cgFI3fz"
 245 |    },
 246 |    "source": [
 247 |     "# Practice"
 248 |    ]
 249 |   },
 250 |   {
 251 |    "cell_type": "markdown",
 252 |    "metadata": {
 253 |     "_uuid": "6b12d43e42e70f1142a22cdb6561ab58992887e2",
 254 |     "id": "6nYyX_XYI3f0"
 255 |    },
 256 |    "source": [
 257 |     "So let's say we are ready to dive into the data, but not ready to pay the small fee for the large sample size.<br>\n",
 258 |     "We'll be starting with the free samples of 25 observations."
 259 |    ]
 260 |   },
 261 |   {
 262 |    "cell_type": "code",
 263 |    "execution_count": null,
 264 |    "metadata": {
 265 |     "_uuid": "d49290ec36454bdc5276e6636f965ac0db9f048e",
 266 |     "id": "HcaLu-G6I3f0"
 267 |    },
 268 |    "outputs": [],
 269 |    "source": [
 270 |     "sample = city_hall_dataset.sample(n=25)\n",
 271 |     "p = {} #dictionnary we'll use to stock information and results"
 272 |    ]
 273 |   },
 274 |   {
 275 |    "cell_type": "markdown",
 276 |    "metadata": {
 277 |     "_uuid": "86dcbf3838f5a01829c1dd07ff295d7a323f9b38",
 278 |     "id": "xUCHHT1PI3f1"
 279 |    },
 280 |    "source": [
 281 |     "## One sample T-test | Two-tailed | Means"
 282 |    ]
 283 |   },
 284 |   {
 285 |    "cell_type": "markdown",
 286 |    "metadata": {
 287 |     "_uuid": "057bf391b91193515585af60d1b2e6ecf61d9457",
 288 |     "id": "j-ne4sQdI3f1"
 289 |    },
 290 |    "source": [
 291 |     "So first question we want to ask is : How are our $120 000 situated vs. the average Ames house SalePrice? <br>\n",
 292 |     "In other words, is  120 000 (11.7 logged) any different from the mean SalePrice of the population?<br>\n",
 293 |     "To know that from a 25 observations sample, we need to use a One Sample T-Test."
 294 |    ]
 295 |   },
 296 |   {
 297 |    "cell_type": "markdown",
 298 |    "metadata": {
 299 |     "_uuid": "9a5b43a56ec242192988e4638bdaf947c4b9feca",
 300 |     "id": "g5OClmfSI3f2"
 301 |    },
 302 |    "source": [
 303 |     "<b>Null Hypothesis</b> :  Mean SalePrice = 11.695 <br>\n",
 304 |     "<b>Alternative Hypothesis</b> :  Mean SalePrice ≠ 11.695 <br>"
 305 |    ]
 306 |   },
 307 |   {
 308 |    "cell_type": "code",
 309 |    "execution_count": null,
 310 |    "metadata": {
 311 |     "_uuid": "b114d786b7657bb44b18cd10f3a75b59e8b76e4d",
 312 |     "colab": {
 313 |      "base_uri": "https://localhost:8080/",
 314 |      "height": 80
 315 |     },
 316 |     "id": "S2niNvkBI3f3",
 317 |     "outputId": "994270a3-6585-436d-9f14-99dfe11e5c6f"
 318 |    },
 319 |    "outputs": [],
 320 |    "source": [
 321 |     "p['value1'], p['value2'] = sample['SalePrice'].mean(), logged_budget\n",
 322 |     "p['score'], p['p_value'] = stats.ttest_1samp(sample['SalePrice'], popmean=logged_budget)\n",
 323 |     "results(p)"
 324 |    ]
 325 |   },
 326 |   {
 327 |    "cell_type": "markdown",
 328 |    "metadata": {
 329 |     "_uuid": "cdbffabed49531b9a2a94290ecf3ad3df7ce1500",
 330 |     "id": "W5_dIYzhI3f4"
 331 |    },
 332 |    "source": [
 333 |     "So we know our initial budget is significantely different from the mean SalePrice.<br>\n",
 334 |     "From the table above, it unfortunately seems lower.<br>"
 335 |    ]
 336 |   },
 337 |   {
 338 |    "cell_type": "markdown",
 339 |    "metadata": {
 340 |     "_uuid": "9826a1893fc914f63b0c8fb013f54f21a8cb6660",
 341 |     "id": "R9S7Wyn3I3f5"
 342 |    },
 343 |    "source": [
 344 |     "## One sample T-test | One-tailed | Means"
 345 |    ]
 346 |   },
 347 |   {
 348 |    "cell_type": "markdown",
 349 |    "metadata": {
 350 |     "_uuid": "16f3f4267ad9b0656259a33796891ba99efa4e7e",
 351 |     "id": "dogNu0GwI3f5"
 352 |    },
 353 |    "source": [
 354 |     "Let's make sure our budget is lower by running a one-tailed test.<br>\n",
 355 |     "Question now is : is 120 000 (11.695 logged) lower than the mean SalePrice of the population?<br>"
 356 |    ]
 357 |   },
 358 |   {
 359 |    "cell_type": "markdown",
 360 |    "metadata": {
 361 |     "_uuid": "c4aa88ea6758d7beced1dd4fe235a4afc5aae480",
 362 |     "id": "otYidl0LI3f6"
 363 |    },
 364 |    "source": [
 365 |     "<b>Null Hypothesis</b> :  Mean SalePrice <= 11.695 <br>\n",
 366 |     "<b>Alternative Hypothesis</b> :  Mean SalePrice > 11.695 <br>"
 367 |    ]
 368 |   },
 369 |   {
 370 |    "cell_type": "code",
 371 |    "execution_count": null,
 372 |    "metadata": {
 373 |     "_uuid": "f5a2b2f062c437304675fb7f97cc75be030f7cd8",
 374 |     "colab": {
 375 |      "base_uri": "https://localhost:8080/",
 376 |      "height": 80
 377 |     },
 378 |     "id": "KqfQhhk2I3f7",
 379 |     "outputId": "96d56b97-7dd3-4cc5-edf0-722874dd5e98"
 380 |    },
 381 |    "outputs": [],
 382 |    "source": [
 383 |     "p['value1'], p['value2'] = sample['SalePrice'].mean(), logged_budget\n",
 384 |     "p['score'], p['p_value'] = stats.ttest_1samp(sample['SalePrice'], popmean=logged_budget)\n",
 385 |     "p['p_value'] = p['p_value']/2 #one-tailed test (with scipy function), we need to divide p-value by 2 ourselves\n",
 386 |     "results(p)"
 387 |    ]
 388 |   },
 389 |   {
 390 |    "cell_type": "markdown",
 391 |    "metadata": {
 392 |     "_uuid": "4ca6a6194d74b0cb91c7fff614b6ff85543c354f",
 393 |     "id": "9cKbKEuDI3gC"
 394 |    },
 395 |    "source": [
 396 |     "Unfortunately it is!<br>\n",
 397 |     "We have 95% chance of believing that our starting budget won't let us buy a house at the average Ames price."
 398 |    ]
 399 |   },
 400 |   {
 401 |    "cell_type": "markdown",
 402 |    "metadata": {
 403 |     "_uuid": "ba55013530a031bf8fce56db44c036affafbe18c",
 404 |     "id": "2NeUq916I3gD"
 405 |    },
 406 |    "source": [
 407 |     "## Two sample T-test | Two-tailed | Means"
 408 |    ]
 409 |   },
 410 |   {
 411 |    "cell_type": "markdown",
 412 |    "metadata": {
 413 |     "_uuid": "e0c4dd0f363ec00e2fa7dc1e49cccc303c5d7e1c",
 414 |     "id": "cQsNuRapI3gE"
 415 |    },
 416 |    "source": [
 417 |     "Now that our expectations are lowered, we realize something important :<br>\n",
 418 |     "The entire dataset probably contains some big houses fitted for entire families as well as small houses for fewer inhabitants.<br>\n",
 419 |     "Prices are probably really different in-between the two types.<br>\n",
 420 |     "And we are moving in alone, so we probably don't need that big of a house.<br><br>\n",
 421 |     "What if we could ask the City Hall to give us a sample for big houses, and a sample for smaller houses?<br>\n",
 422 |     "We first could see if there is a significant difference in prices.<br>\n",
 423 |     "And then see how our \\$120 000 are doing against the small houses average SalePrice.<br><br>\n",
 424 |     "We do ask the City Hall, and because they understand it is also for the sake of this tutorial, they accept.<br>\n",
 425 |     "They say they'll split the dataset in two, based on the surface area of the houses.<br>\n",
 426 |     "They will give us a sample from the top 50\\% houses in terms of surface, and another sample from the bottom 50\\%."
 427 |    ]
 428 |   },
 429 |   {
 430 |    "cell_type": "code",
 431 |    "execution_count": null,
 432 |    "metadata": {
 433 |     "_uuid": "0ba8742a710640d9e15b7bb828d707b74c76fa9d",
 434 |     "id": "luaaidekI3gG"
 435 |    },
 436 |    "outputs": [],
 437 |    "source": [
 438 |     "smaller_houses = city_hall_dataset.sort_values('GrLivArea')[:730].sample(n=25)\n",
 439 |     "larger_houses = city_hall_dataset.sort_values('GrLivArea')[730:].sample(n=25)"
 440 |    ]
 441 |   },
 442 |   {
 443 |    "cell_type": "markdown",
 444 |    "metadata": {
 445 |     "_uuid": "f62d12ae118d7c88daf2fa31cdab454425ee480c",
 446 |     "id": "dDIdVvbjI3gH"
 447 |    },
 448 |    "source": [
 449 |     "Now we first want to know if the two samples, extracted from two different populations, have significant differences in their average SalePrice."
 450 |    ]
 451 |   },
 452 |   {
 453 |    "cell_type": "markdown",
 454 |    "metadata": {
 455 |     "_uuid": "e91a0db49dfd68637e52575a901d6688f2f52fc0",
 456 |     "id": "BiaE7H9WI3gI"
 457 |    },
 458 |    "source": [
 459 |     "<b>Null Hypothesis</b> : SalePrice of smaller houses = SalePrice of larger houses <br>\n",
 460 |     "<b>Alternative Hypothesis</b> :  SalePrice of smaller houses ≠ SalePrice of larger houses <br>"
 461 |    ]
 462 |   },
 463 |   {
 464 |    "cell_type": "code",
 465 |    "execution_count": null,
 466 |    "metadata": {
 467 |     "_uuid": "063769aa739f591274cfb0c9ff3ca02cb3f47929",
 468 |     "colab": {
 469 |      "base_uri": "https://localhost:8080/",
 470 |      "height": 80
 471 |     },
 472 |     "id": "60IhVpAjI3gJ",
 473 |     "outputId": "fd3651ca-4b3c-44c5-d97d-8ba31b8bf057"
 474 |    },
 475 |    "outputs": [],
 476 |    "source": [
 477 |     "p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), larger_houses['SalePrice'].mean()\n",
 478 |     "p['score'], p['p_value'], p['df'] = ttest_ind(smaller_houses['SalePrice'], larger_houses['SalePrice'])\n",
 479 |     "results(p)"
 480 |    ]
 481 |   },
 482 |   {
 483 |    "cell_type": "markdown",
 484 |    "metadata": {
 485 |     "_uuid": "ca2691aed68ebaf14f3d45e8500955b564891c94",
 486 |     "id": "h-LdfFQxI3gL"
 487 |    },
 488 |    "source": [
 489 |     "As expected, the two samples show some significant differences in SalePrice."
 490 |    ]
 491 |   },
 492 |   {
 493 |    "cell_type": "markdown",
 494 |    "metadata": {
 495 |     "_uuid": "ac0291a3dd5e3d592b6d41d475f2e9fafa2d2d7a",
 496 |     "id": "gDGMr5yJI3gM"
 497 |    },
 498 |    "source": [
 499 |     "## Two sample T-test | One-tailed | Means"
 500 |    ]
 501 |   },
 502 |   {
 503 |    "cell_type": "markdown",
 504 |    "metadata": {
 505 |     "_uuid": "c3e80a2b082da6f8ee1181e7c5ad286388136d3b",
 506 |     "id": "bNyuvn-QI3gM"
 507 |    },
 508 |    "source": [
 509 |     "Obviously, larger houses have a higher SalePrice.<br>\n",
 510 |     "Let's prove it this with one-tailed test."
 511 |    ]
 512 |   },
 513 |   {
 514 |    "cell_type": "markdown",
 515 |    "metadata": {
 516 |     "_uuid": "4047d017f6fbb8420868b5a3bf1f5c484511ec9e",
 517 |     "id": "k8KJb7LKI3gN"
 518 |    },
 519 |    "source": [
 520 |     "<b>Null Hypothesis</b> : SalePrice of smaller houses >= SalePrice of larger houses <br>\n",
 521 |     "<b>Alternative Hypothesis</b> :  SalePrice of smaller houses < SalePrice of larger houses <br>"
 522 |    ]
 523 |   },
 524 |   {
 525 |    "cell_type": "code",
 526 |    "execution_count": null,
 527 |    "metadata": {
 528 |     "_uuid": "27b5247cac5eb55e0635a066fcac0d9c33785567",
 529 |     "colab": {
 530 |      "base_uri": "https://localhost:8080/",
 531 |      "height": 80
 532 |     },
 533 |     "id": "UvYpzz2eI3gO",
 534 |     "outputId": "6eb1b4bd-90bb-480b-bd32-023d5dfbd24e"
 535 |    },
 536 |    "outputs": [],
 537 |    "source": [
 538 |     "p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), larger_houses['SalePrice'].mean()\n",
 539 |     "p['score'], p['p_value'], p['df'] = ttest_ind(smaller_houses['SalePrice'], larger_houses['SalePrice'], alternative='smaller')\n",
 540 |     "results(p)"
 541 |    ]
 542 |   },
 543 |   {
 544 |    "cell_type": "markdown",
 545 |    "metadata": {
 546 |     "_uuid": "12e9ac13b4eadb488545075d14613af3195be463",
 547 |     "id": "eeUoSgZ7I3gQ"
 548 |    },
 549 |    "source": [
 550 |     "Still as expected, SalePrice is significantly higher for larger houses."
 551 |    ]
 552 |   },
 553 |   {
 554 |    "cell_type": "markdown",
 555 |    "metadata": {
 556 |     "_uuid": "d269a6e87bfb36f6c2a58242ece5b1fb6a071992",
 557 |     "id": "563ptKoxI3gR"
 558 |    },
 559 |    "source": [
 560 |     "## Two sample Z-test | One-tailed | Means"
 561 |    ]
 562 |   },
 563 |   {
 564 |    "cell_type": "markdown",
 565 |    "metadata": {
 566 |     "_uuid": "fac51d383c6d11ba5ed9bc8d48009f44d4226641",
 567 |     "id": "A5e1a_rzI3gS"
 568 |    },
 569 |    "source": [
 570 |     "Now that the City Hall has already splitted the population in two, why not ask them for larger samples?<br>\n",
 571 |     "We'll pay a fee but that's all right, this is fake money."
 572 |    ]
 573 |   },
 574 |   {
 575 |    "cell_type": "code",
 576 |    "execution_count": null,
 577 |    "metadata": {
 578 |     "_uuid": "bb09dbc679b48df0861cc3fbbdcce8f7f41c2adb",
 579 |     "id": "bXK-QdtII3gT"
 580 |    },
 581 |    "outputs": [],
 582 |    "source": [
 583 |     "smaller_houses = city_hall_dataset.sort_values('GrLivArea')[:730].sample(n=100, random_state=1)\n",
 584 |     "larger_houses = city_hall_dataset.sort_values('GrLivArea')[730:].sample(n=100, random_state=1)"
 585 |    ]
 586 |   },
 587 |   {
 588 |    "cell_type": "markdown",
 589 |    "metadata": {
 590 |     "_uuid": "9646ac3da76ce781a494505a010a7f4dfc3541a8",
 591 |     "id": "sf3NEC0yI3gU"
 592 |    },
 593 |    "source": [
 594 |     "<b>Null Hypothesis</b> : SalePrice of smaller houses >= SalePrice of larger houses <br>\n",
 595 |     "<b>Alternative Hypothesis</b> :  SalePrice of smaller houses < SalePrice of larger houses <br>"
 596 |    ]
 597 |   },
 598 |   {
 599 |    "cell_type": "code",
 600 |    "execution_count": null,
 601 |    "metadata": {
 602 |     "_uuid": "0378fa5d6d7df2e9f17a953f889e104bec2e0790",
 603 |     "colab": {
 604 |      "base_uri": "https://localhost:8080/",
 605 |      "height": 80
 606 |     },
 607 |     "id": "K0ImQ7MsI3gV",
 608 |     "outputId": "bc5e303a-d6de-466a-d18f-54a0cbc91ed3"
 609 |    },
 610 |    "outputs": [],
 611 |    "source": [
 612 |     "p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), larger_houses['SalePrice'].mean()\n",
 613 |     "p['score'], p['p_value'] = ztest(smaller_houses['SalePrice'], larger_houses['SalePrice'], alternative='smaller')\n",
 614 |     "results(p)"
 615 |    ]
 616 |   },
 617 |   {
 618 |    "cell_type": "markdown",
 619 |    "metadata": {
 620 |     "_uuid": "877f43f345c8b9b283f35f5fd5df218bf0e1a9ad",
 621 |     "id": "nR7ao1OxI3gY"
 622 |    },
 623 |    "source": [
 624 |     "Higher sample sizes show the same results : SalePrice is significantely higher for larger houses."
 625 |    ]
 626 |   },
 627 |   {
 628 |    "cell_type": "markdown",
 629 |    "metadata": {
 630 |     "_uuid": "cc97a4fb0630306449deeb931cce876a5c404b09",
 631 |     "id": "-0BEcuhAI3gZ"
 632 |    },
 633 |    "source": [
 634 |     "## Two sample Z-test | One-tailed | Proportions"
 635 |    ]
 636 |   },
 637 |   {
 638 |    "cell_type": "markdown",
 639 |    "metadata": {
 640 |     "_uuid": "1c5fd4e5578687f0cd6defbdcb0f47505c1dbca5",
 641 |     "id": "uzi9hsr9I3gZ"
 642 |    },
 643 |    "source": [
 644 |     "Instead of means, we can also run tests on proportions.<br>\n",
 645 |     "Is the proportion of houses over \\$120 000 higher in the larger houses populations than in smaller houses population?"
 646 |    ]
 647 |   },
 648 |   {
 649 |    "cell_type": "markdown",
 650 |    "metadata": {
 651 |     "_uuid": "08e590d9599111e0ed46441107fe9a568cbc85f8",
 652 |     "id": "lRi8Vz_II3ga"
 653 |    },
 654 |    "source": [
 655 |     "<b>Null Hypothesis</b> : Proportion of smaller houses with SalePrice over 11.695 >= Proportion of larger houses with SalePrice over 11.695 <br>\n",
 656 |     "<b>Alternative Hypothesis</b> :  Proportion of smaller houses with SalePrice over 11.695 < Proportion of larger houses with SalePrice over 11.695 <br>"
 657 |    ]
 658 |   },
 659 |   {
 660 |    "cell_type": "code",
 661 |    "execution_count": null,
 662 |    "metadata": {
 663 |     "_uuid": "6908a85072828db67c52eea41d821caf0b94cc56",
 664 |     "colab": {
 665 |      "base_uri": "https://localhost:8080/",
 666 |      "height": 80
 667 |     },
 668 |     "id": "HAr4460WI3gb",
 669 |     "outputId": "e8750c30-e44e-41e4-e9f2-2cae12d284a0"
 670 |    },
 671 |    "outputs": [],
 672 |    "source": [
 673 |     "from statsmodels.stats.proportion import *\n",
 674 |     "A1 = len(smaller_houses[smaller_houses.SalePrice>logged_budget])\n",
 675 |     "B1 = len(smaller_houses)\n",
 676 |     "A2 = len(larger_houses[larger_houses.SalePrice>logged_budget])\n",
 677 |     "B2 = len(larger_houses)\n",
 678 |     "p['value1'], p['value2'] = A1/B1, A2/B2\n",
 679 |     "p['score'], p['p_value'] = proportions_ztest([A1, A2], [B1, B2], alternative='smaller')\n",
 680 |     "results(p)"
 681 |    ]
 682 |   },
 683 |   {
 684 |    "cell_type": "markdown",
 685 |    "metadata": {
 686 |     "_uuid": "09942c4f972edef68fd0ade7bf6f1fccdf608357",
 687 |     "id": "fu7um-8kI3gc"
 688 |    },
 689 |    "source": [
 690 |     "Logically, the test shows that the larger houses population has a higher ratio of houses sold over \\\\$120 000 vs. the smaller houses population."
 691 |    ]
 692 |   },
 693 |   {
 694 |    "cell_type": "markdown",
 695 |    "metadata": {
 696 |     "_uuid": "1fb6a257bfc1e2aead70cc81c728b6991b96ee2a",
 697 |     "id": "7erhQfijI3gd"
 698 |    },
 699 |    "source": [
 700 |     "## One sample Z-test | One-tailed | Means"
 701 |    ]
 702 |   },
 703 |   {
 704 |    "cell_type": "markdown",
 705 |    "metadata": {
 706 |     "_uuid": "751e2b017d2e9d0982d16827f7c2074bde519c2f",
 707 |     "id": "h1Ry7yFxI3ge"
 708 |    },
 709 |    "source": [
 710 |     "So now let's see how our \\$120 000  (11.7 logged) are doing against smaller houses only, based on the 100 observations sample."
 711 |    ]
 712 |   },
 713 |   {
 714 |    "cell_type": "markdown",
 715 |    "metadata": {
 716 |     "_uuid": "3478bf655881918eff42936df3d5f1d0fcfc8717",
 717 |     "id": "xzH2utgfI3ge"
 718 |    },
 719 |    "source": [
 720 |     "<b>Null Hypothesis</b> :  Mean SalePrice of smaller houses => 11.695 <br>\n",
 721 |     "<b>Alternative Hypothesis</b> :  Mean SalePrice of smaller houses < 11.695 <br>"
 722 |    ]
 723 |   },
 724 |   {
 725 |    "cell_type": "code",
 726 |    "execution_count": null,
 727 |    "metadata": {
 728 |     "_uuid": "29042a1ab9f5558a694ba2b537f204d648c9229f",
 729 |     "colab": {
 730 |      "base_uri": "https://localhost:8080/",
 731 |      "height": 80
 732 |     },
 733 |     "id": "FaLeLaj4I3gf",
 734 |     "outputId": "622aee32-78c1-47fe-879f-2ecdedc566ce"
 735 |    },
 736 |    "outputs": [],
 737 |    "source": [
 738 |     "p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), logged_budget\n",
 739 |     "p['score'], p['p_value'] = ztest(smaller_houses['SalePrice'], value=logged_budget, alternative='larger')\n",
 740 |     "results(p)"
 741 |    ]
 742 |   },
 743 |   {
 744 |    "cell_type": "markdown",
 745 |    "metadata": {
 746 |     "_uuid": "40b18c3dbdf3f268901b4aa3b3cd2542b5614ed6",
 747 |     "id": "WUoHvGY9I3gh"
 748 |    },
 749 |    "source": [
 750 |     "That's quite depressing : our \\\\$120 000 do not even beat the average price of smaller houses."
 751 |    ]
 752 |   },
 753 |   {
 754 |    "cell_type": "markdown",
 755 |    "metadata": {
 756 |     "_uuid": "95f48ae385792cde702067b43d6865a228bada65",
 757 |     "id": "48EoE0JTI3gh"
 758 |    },
 759 |    "source": [
 760 |     "## One sample Z-test | One-tailed | Proportions"
 761 |    ]
 762 |   },
 763 |   {
 764 |    "cell_type": "markdown",
 765 |    "metadata": {
 766 |     "_uuid": "11e0cda689b265e2fb1f69a6440bb720ae030d01",
 767 |     "id": "6PAXNuOaI3gi"
 768 |    },
 769 |    "source": [
 770 |     "Our \\$120 000 do not seem too far from the average SalePrice of small houses though.<br>\n",
 771 |     "Let's see if at least 25\\% of houses have a SalePrice in our budget."
 772 |    ]
 773 |   },
 774 |   {
 775 |    "cell_type": "markdown",
 776 |    "metadata": {
 777 |     "_uuid": "1f856f2a6117ada10a5dc32962d7cf438fe93beb",
 778 |     "id": "GQgwSKUUI3gj"
 779 |    },
 780 |    "source": [
 781 |     "<b>Null Hypothesis</b> : Proportion of smaller houses with SalePrice under 11.695 <= 25% <br>\n",
 782 |     "<b>Alternative Hypothesis</b> :  Proportion of smaller houses with SalePrice under 11.695 > 25% <br>"
 783 |    ]
 784 |   },
 785 |   {
 786 |    "cell_type": "code",
 787 |    "execution_count": null,
 788 |    "metadata": {
 789 |     "_uuid": "d9bcda8bd0df8421f071c5fe0d38f07abde33549",
 790 |     "colab": {
 791 |      "base_uri": "https://localhost:8080/",
 792 |      "height": 80
 793 |     },
 794 |     "id": "s1gH0FbmI3gj",
 795 |     "outputId": "f5c5da9c-98c6-4678-8627-711b9055b243"
 796 |    },
 797 |    "outputs": [],
 798 |    "source": [
 799 |     "from statsmodels.stats.proportion import *\n",
 800 |     "A = len(smaller_houses[smaller_houses.SalePrice<logged_budget])\n",
 801 |     "B = len(smaller_houses)\n",
 802 |     "p['value1'], p['value2'] = A/B, 0.25\n",
 803 |     "p['score'], p['p_value'] = proportions_ztest(A, B, alternative='larger', value=0.25)\n",
 804 |     "results(p)"
 805 |    ]
 806 |   },
 807 |   {
 808 |    "cell_type": "markdown",
 809 |    "metadata": {
 810 |     "_uuid": "8d095cd73eb9af7b4c3161783537fcb1de5fdf45",
 811 |     "id": "YR7dt4s0I3gk"
 812 |    },
 813 |    "source": [
 814 |     "So at least, now we know we can buy a house among at least 25% of the smaller houses.<br>"
 815 |    ]
 816 |   },
 817 |   {
 818 |    "cell_type": "markdown",
 819 |    "metadata": {
 820 |     "_uuid": "31d1706ffc709d4c08d5c44becd79bb8abdd1f7a",
 821 |     "id": "_vrgtTttI3gl"
 822 |    },
 823 |    "source": [
 824 |     "## F-test (ANOVA)"
 825 |    ]
 826 |   },
 827 |   {
 828 |    "cell_type": "markdown",
 829 |    "metadata": {
 830 |     "_uuid": "e4755b430cf6a781cf06af48eca1ec2bc34826e6",
 831 |     "id": "JFGjPpT5I3gl"
 832 |    },
 833 |    "source": [
 834 |     "The House Price Dataset has a MSZoning variable, which identifies the general zoning classification of the house.<br>\n",
 835 |     "For instance, it lets you know if the house is situated in a residential or a commerical zone.<br><br>\n",
 836 |     "We'll therefore try to know if there is a significant difference in SalePrice based on the zoning.<br>\n",
 837 |     "And then know where we will be more likely to live with our budget.<br>\n",
 838 |     "Based on the 100 observations samples of smaller houses, let's first have an overview of mean SalePrice by zone."
 839 |    ]
 840 |   },
 841 |   {
 842 |    "cell_type": "code",
 843 |    "execution_count": null,
 844 |    "metadata": {
 845 |     "_uuid": "5ad3bbf7148ee3ab852e1e86fe31a722a097fcf7",
 846 |     "colab": {
 847 |      "base_uri": "https://localhost:8080/",
 848 |      "height": 235
 849 |     },
 850 |     "id": "Tw-dZRxXI3gm",
 851 |     "outputId": "50c47d05-1ec7-4a14-9888-210d9b99a413"
 852 |    },
 853 |    "outputs": [],
 854 |    "source": [
 855 |     "replacement = {'FV': \"Floating Village Residential\", 'C (all)': \"Commercial\", 'RH': \"Residential High Density\",\n",
 856 |     "              'RL': \"Residential Low Density\", 'RM': \"Residential Medium Density\"}\n",
 857 |     "smaller_houses['MSZoning_FullName'] = smaller_houses['MSZoning'].replace(replacement)\n",
 858 |     "mean_price_by_zone = smaller_houses.groupby('MSZoning_FullName')['SalePrice'].mean().to_frame()\n",
 859 |     "mean_price_by_zone"
 860 |    ]
 861 |   },
 862 |   {
 863 |    "cell_type": "markdown",
 864 |    "metadata": {
 865 |     "_uuid": "9718fdf6ca87e8eb195849a32370fd1e2fdc50ca",
 866 |     "id": "EI0TRdT5I3gn"
 867 |    },
 868 |    "source": [
 869 |     "To know if there is a significant difference between these values, we run an ANOVA test. (because there a more than 2 values to compare)<br>\n",
 870 |     "The test won't not able to tell us what attributes are different from the others, but at least we'll know if there is a difference or not."
 871 |    ]
 872 |   },
 873 |   {
 874 |    "cell_type": "markdown",
 875 |    "metadata": {
 876 |     "_uuid": "de6cfc37bd2dd0834227c069156c45c904a464ab",
 877 |     "id": "xGCd7qABI3gn"
 878 |    },
 879 |    "source": [
 880 |     "<b>Null Hypothesis</b> : No difference between SalePrice means <br>\n",
 881 |     "<b>Alternative Hypothesis</b> : Difference between SalePrice means <br>"
 882 |    ]
 883 |   },
 884 |   {
 885 |    "cell_type": "code",
 886 |    "execution_count": null,
 887 |    "metadata": {
 888 |     "_uuid": "336c3d9d49d2b2ceaf514e2fbd3a4356e1787ba8",
 889 |     "colab": {
 890 |      "base_uri": "https://localhost:8080/",
 891 |      "height": 80
 892 |     },
 893 |     "id": "DZiD5a4xI3go",
 894 |     "outputId": "a4a4807c-b568-483b-873b-899e62f8436a"
 895 |    },
 896 |    "outputs": [],
 897 |    "source": [
 898 |     "sh = smaller_houses.copy()\n",
 899 |     "p['score'], p['p_value'] = stats.f_oneway(sh.loc[sh.MSZoning=='FV', 'SalePrice'], \n",
 900 |     "               sh.loc[sh.MSZoning=='C (all)', 'SalePrice'],\n",
 901 |     "               sh.loc[sh.MSZoning=='RH', 'SalePrice'],\n",
 902 |     "               sh.loc[sh.MSZoning=='RL', 'SalePrice'],\n",
 903 |     "               sh.loc[sh.MSZoning=='RM', 'SalePrice'],)\n",
 904 |     "results(p)[['score', 'p_value', 'hypothesis_accepted']]"
 905 |    ]
 906 |   },
 907 |   {
 908 |    "cell_type": "markdown",
 909 |    "metadata": {
 910 |     "_uuid": "a95d783ef0a7c8c2561d1e7e35fe076504bdaf75",
 911 |     "id": "JhKowYxmI3gp"
 912 |    },
 913 |    "source": [
 914 |     "There is a difference between SalePrices based on where the house is located.<br>\n",
 915 |     "Looking at the Average SalePrice by zone, Commerical Zones and Residential High Density zones seem to be the most affordable for our budget."
 916 |    ]
 917 |   },
 918 |   {
 919 |    "cell_type": "markdown",
 920 |    "metadata": {
 921 |     "_uuid": "b1764c17eba1681a1b3e11e1409595539036b087",
 922 |     "id": "0eOVOjqfI3gp"
 923 |    },
 924 |    "source": [
 925 |     "## Chi-square test"
 926 |    ]
 927 |   },
 928 |   {
 929 |    "cell_type": "markdown",
 930 |    "metadata": {
 931 |     "_uuid": "18bbaeabf46aba49d823ebe5689d5b73f78a857d",
 932 |     "id": "0Wt1bb7kI3gq"
 933 |    },
 934 |    "source": [
 935 |     "One last question we'll address : can we get a garage? If yes, what type of garage?<br>\n",
 936 |     "If not, then we won't bother saving up for a car, and we'll try to get a house next to Public Transportion.<br>\n",
 937 |     "The dataset contains a categorical variable, GarageType, that will help us answer the question.<br>\n",
 938 |     "<br>"
 939 |    ]
 940 |   },
 941 |   {
 942 |    "cell_type": "code",
 943 |    "execution_count": null,
 944 |    "metadata": {
 945 |     "_uuid": "c143db864b4a24d0587b00099194f21457e22c29",
 946 |     "colab": {
 947 |      "base_uri": "https://localhost:8080/",
 948 |      "height": 204
 949 |     },
 950 |     "id": "i2PuMmXTI3gq",
 951 |     "outputId": "2811f9f3-d88a-41ec-d59b-b2538a4c40ba"
 952 |    },
 953 |    "outputs": [],
 954 |    "source": [
 955 |     "smaller_houses['GarageType'].fillna('No Garage', inplace=True)\n",
 956 |     "smaller_houses['GarageType'].value_counts().to_frame()"
 957 |    ]
 958 |   },
 959 |   {
 960 |    "cell_type": "markdown",
 961 |    "metadata": {
 962 |     "_uuid": "48d6143222896b8c463ab8a88465a647796f188c",
 963 |     "id": "0zv1CZjgI3gr"
 964 |    },
 965 |    "source": [
 966 |     "We know we can get a house in at least the bottom 25% of smaller houses.<br>\n",
 967 |     "We would ideally like to know if distribution of Garage Types among these 25% is different than in the three other quarters<br>\n",
 968 |     "We are now friends with the City Hall, so we can ask them one last favor : <br>\n",
 969 |     "Split the smaller houses population in 4 based on surface, and give us a sample of each quarter.<br>\n",
 970 |     "Because we working here with categorical data, we'll run a Chi-Square test."
 971 |    ]
 972 |   },
 973 |   {
 974 |    "cell_type": "code",
 975 |    "execution_count": null,
 976 |    "metadata": {
 977 |     "_uuid": "a0462a0bd18442f27a88c5b0e267122e76ab34b7",
 978 |     "colab": {
 979 |      "base_uri": "https://localhost:8080/",
 980 |      "height": 142
 981 |     },
 982 |     "id": "vBmXrldbI3gr",
 983 |     "outputId": "56ad970c-fe14-42da-bd74-c4ca45d8a955"
 984 |    },
 985 |    "outputs": [],
 986 |    "source": [
 987 |     "city_hall_dataset['GarageType'].fillna('No Garage', inplace=True)\n",
 988 |     "sample1 = city_hall_dataset.sort_values('GrLivArea')[:183].sample(n=100)\n",
 989 |     "sample2 = city_hall_dataset.sort_values('GrLivArea')[183:366].sample(n=100)\n",
 990 |     "sample3 = city_hall_dataset.sort_values('GrLivArea')[366:549].sample(n=100)\n",
 991 |     "sample4 = city_hall_dataset.sort_values('GrLivArea')[549:730].sample(n=100)\n",
 992 |     "dff = pd.concat([\n",
 993 |     "    sample1['GarageType'].value_counts().to_frame(),\n",
 994 |     "    sample2['GarageType'].value_counts().to_frame(), \n",
 995 |     "    sample3['GarageType'].value_counts().to_frame(), \n",
 996 |     "    sample4['GarageType'].value_counts().to_frame()], \n",
 997 |     "    axis=1, sort=False)\n",
 998 |     "dff.columns = ['Sample1 (smallest houses)', 'Sample2', 'Sample3', 'Sample4 (largest houses)']\n",
 999 |     "dff = dff[:3] #chi-square tests do not work when table contains some 0, we take only the most frequent attributes\n",
1000 |     "dff "
1001 |    ]
1002 |   },
1003 |   {
1004 |    "cell_type": "markdown",
1005 |    "metadata": {
1006 |     "_uuid": "6858dda979f309254b4b04ec93d3d2a4fc4a22a9",
1007 |     "id": "C1nqnJXUI3gs"
1008 |    },
1009 |    "source": [
1010 |     "<b>Null Hypothesis</b> : No difference between GarageType distribution <br>\n",
1011 |     "<b>Alternative Hypothesis</b> : Difference between GarageType distribution <br>"
1012 |    ]
1013 |   },
1014 |   {
1015 |    "cell_type": "code",
1016 |    "execution_count": null,
1017 |    "metadata": {
1018 |     "_uuid": "59e533a57eebfb818a880a00c5eb7b668ddfedf7",
1019 |     "colab": {
1020 |      "base_uri": "https://localhost:8080/",
1021 |      "height": 80
1022 |     },
1023 |     "id": "aRLUI6B6I3gt",
1024 |     "outputId": "188aca50-8077-4402-c7ef-629e6864b641"
1025 |    },
1026 |    "outputs": [],
1027 |    "source": [
1028 |     "p['score'], p['p_value'], p['ddf'], p['contigency'] = stats.chi2_contingency(dff)\n",
1029 |     "p.pop('contigency')\n",
1030 |     "results(p)[['score', 'p_value', 'hypothesis_accepted']]"
1031 |    ]
1032 |   },
1033 |   {
1034 |    "cell_type": "markdown",
1035 |    "metadata": {
1036 |     "_uuid": "d8acefcc9a561092b50a4dc9e056ced0dd9706a2",
1037 |     "id": "F3TAg9RsI3gt"
1038 |    },
1039 |    "source": [
1040 |     "Clearly there's a difference in GarageType distribution according to size of houses.<br>\n",
1041 |     "The sample that concerns us, Sample1, has the highest proportion of \"No Garage\" and \"Detached Garage\".<br>\n",
1042 |     "We'll probably have to stick with Public Transportation."
1043 |    ]
1044 |   },
1045 |   {
1046 |    "cell_type": "markdown",
1047 |    "metadata": {
1048 |     "_uuid": "d9aa71cb9e37ec58ac0ec437edd8b054eaafbca3",
1049 |     "id": "F7OWGJmxI3gu"
1050 |    },
1051 |    "source": [
1052 |     "# Conclusion"
1053 |    ]
1054 |   },
1055 |   {
1056 |    "cell_type": "markdown",
1057 |    "metadata": {
1058 |     "_uuid": "4932d573a2a4d8fdd4d7f842c2d4218c60e4afcf",
1059 |     "id": "nvxdHBqTI3gu"
1060 |    },
1061 |    "source": [
1062 |     "We probably won't have a great house, but at least, we learned about statistical tests."
1063 |    ]
1064 |   }
1065 |  ],
1066 |  "metadata": {
1067 |   "colab": {
1068 |    "collapsed_sections": [
1069 |     "5EdR4B5fI3fs",
1070 |     "xUCHHT1PI3f1",
1071 |     "R9S7Wyn3I3f5",
1072 |     "2NeUq916I3gD",
1073 |     "gDGMr5yJI3gM",
1074 |     "563ptKoxI3gR",
1075 |     "-0BEcuhAI3gZ",
1076 |     "7erhQfijI3gd",
1077 |     "48EoE0JTI3gh",
1078 |     "_vrgtTttI3gl",
1079 |     "0eOVOjqfI3gp"
1080 |    ],
1081 |    "name": "Statistical-Tests-Explained.ipynb",
1082 |    "provenance": []
1083 |   },
1084 |   "kernelspec": {
1085 |    "display_name": "Python 3 (ipykernel)",
1086 |    "language": "python",
1087 |    "name": "python3"
1088 |   },
1089 |   "language_info": {
1090 |    "codemirror_mode": {
1091 |     "name": "ipython",
1092 |     "version": 3
1093 |    },
1094 |    "file_extension": ".py",
1095 |    "mimetype": "text/x-python",
1096 |    "name": "python",
1097 |    "nbconvert_exporter": "python",
1098 |    "pygments_lexer": "ipython3",
1099 |    "version": "3.11.7"
1100 |   },
1101 |   "toc": {
1102 |    "base_numbering": 1,
1103 |    "nav_menu": {},
1104 |    "number_sections": true,
1105 |    "sideBar": true,
1106 |    "skip_h1_title": false,
1107 |    "title_cell": "Table of Contents",
1108 |    "title_sidebar": "Contents",
1109 |    "toc_cell": true,
1110 |    "toc_position": {},
1111 |    "toc_section_display": true,
1112 |    "toc_window_display": false
1113 |   }
1114 |  },
1115 |  "nbformat": 4,
1116 |  "nbformat_minor": 4
1117 | }
1118 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 Aman Roland
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Statistics-and-Probability-For-Data-Science
 2 | 
 3 | In this repository, I will delve into the fundamental concepts of statistics and probability through the use of Python programming language. The following topics will be covered:
 4 | 
 5 | 1. Permutations and Combinations
 6 | 2. The basics of probability, including conditional probability and the law of large numbers
 7 | 3. Bayes' theorem and its applications
 8 | 4. Probability distributions, including binomial, uniform, geometric, Poisson, and normal distributions
 9 | 5. Measures of central tendency and variability, as well as skewness and kurtosis
10 | 6. The central limit theorem, estimation, and confidence intervals
11 | 7. Sampling methods and errors
12 | 8. Hypothesis testing, significance levels, P values, and confidence intervals
13 | 9. Parametric tests, including z-tests (one-tailed and two-tailed) for means and proportions and t-tests (paired t-test)
14 | 10. Analysis of variance (ANOVA)
15 | 11. Nonparametric tests, such as chi-square tests
16 | 12. Effect size, correlation, power, and power analysis.
17 | 


--------------------------------------------------------------------------------
/data/a1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/a1.png


--------------------------------------------------------------------------------
/data/cdf_ppf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/cdf_ppf.png


--------------------------------------------------------------------------------
/data/csd.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/csd.png


--------------------------------------------------------------------------------
/data/dpd.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/dpd.png


--------------------------------------------------------------------------------
/data/f_dist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/f_dist.png


--------------------------------------------------------------------------------
/data/gaussian-distribution.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/gaussian-distribution.png


--------------------------------------------------------------------------------
/data/house-prices-advanced-regression-techniques/data_description.txt:
--------------------------------------------------------------------------------
  1 | MSSubClass: Identifies the type of dwelling involved in the sale.	
  2 | 
  3 |         20	1-STORY 1946 & NEWER ALL STYLES
  4 |         30	1-STORY 1945 & OLDER
  5 |         40	1-STORY W/FINISHED ATTIC ALL AGES
  6 |         45	1-1/2 STORY - UNFINISHED ALL AGES
  7 |         50	1-1/2 STORY FINISHED ALL AGES
  8 |         60	2-STORY 1946 & NEWER
  9 |         70	2-STORY 1945 & OLDER
 10 |         75	2-1/2 STORY ALL AGES
 11 |         80	SPLIT OR MULTI-LEVEL
 12 |         85	SPLIT FOYER
 13 |         90	DUPLEX - ALL STYLES AND AGES
 14 |        120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
 15 |        150	1-1/2 STORY PUD - ALL AGES
 16 |        160	2-STORY PUD - 1946 & NEWER
 17 |        180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
 18 |        190	2 FAMILY CONVERSION - ALL STYLES AND AGES
 19 | 
 20 | MSZoning: Identifies the general zoning classification of the sale.
 21 | 		
 22 |        A	Agriculture
 23 |        C	Commercial
 24 |        FV	Floating Village Residential
 25 |        I	Industrial
 26 |        RH	Residential High Density
 27 |        RL	Residential Low Density
 28 |        RP	Residential Low Density Park 
 29 |        RM	Residential Medium Density
 30 | 	
 31 | LotFrontage: Linear feet of street connected to property
 32 | 
 33 | LotArea: Lot size in square feet
 34 | 
 35 | Street: Type of road access to property
 36 | 
 37 |        Grvl	Gravel	
 38 |        Pave	Paved
 39 |        	
 40 | Alley: Type of alley access to property
 41 | 
 42 |        Grvl	Gravel
 43 |        Pave	Paved
 44 |        NA 	No alley access
 45 | 		
 46 | LotShape: General shape of property
 47 | 
 48 |        Reg	Regular	
 49 |        IR1	Slightly irregular
 50 |        IR2	Moderately Irregular
 51 |        IR3	Irregular
 52 |        
 53 | LandContour: Flatness of the property
 54 | 
 55 |        Lvl	Near Flat/Level	
 56 |        Bnk	Banked - Quick and significant rise from street grade to building
 57 |        HLS	Hillside - Significant slope from side to side
 58 |        Low	Depression
 59 | 		
 60 | Utilities: Type of utilities available
 61 | 		
 62 |        AllPub	All public Utilities (E,G,W,& S)	
 63 |        NoSewr	Electricity, Gas, and Water (Septic Tank)
 64 |        NoSeWa	Electricity and Gas Only
 65 |        ELO	Electricity only	
 66 | 	
 67 | LotConfig: Lot configuration
 68 | 
 69 |        Inside	Inside lot
 70 |        Corner	Corner lot
 71 |        CulDSac	Cul-de-sac
 72 |        FR2	Frontage on 2 sides of property
 73 |        FR3	Frontage on 3 sides of property
 74 | 	
 75 | LandSlope: Slope of property
 76 | 		
 77 |        Gtl	Gentle slope
 78 |        Mod	Moderate Slope	
 79 |        Sev	Severe Slope
 80 | 	
 81 | Neighborhood: Physical locations within Ames city limits
 82 | 
 83 |        Blmngtn	Bloomington Heights
 84 |        Blueste	Bluestem
 85 |        BrDale	Briardale
 86 |        BrkSide	Brookside
 87 |        ClearCr	Clear Creek
 88 |        CollgCr	College Creek
 89 |        Crawfor	Crawford
 90 |        Edwards	Edwards
 91 |        Gilbert	Gilbert
 92 |        IDOTRR	Iowa DOT and Rail Road
 93 |        MeadowV	Meadow Village
 94 |        Mitchel	Mitchell
 95 |        Names	North Ames
 96 |        NoRidge	Northridge
 97 |        NPkVill	Northpark Villa
 98 |        NridgHt	Northridge Heights
 99 |        NWAmes	Northwest Ames
100 |        OldTown	Old Town
101 |        SWISU	South & West of Iowa State University
102 |        Sawyer	Sawyer
103 |        SawyerW	Sawyer West
104 |        Somerst	Somerset
105 |        StoneBr	Stone Brook
106 |        Timber	Timberland
107 |        Veenker	Veenker
108 | 			
109 | Condition1: Proximity to various conditions
110 | 	
111 |        Artery	Adjacent to arterial street
112 |        Feedr	Adjacent to feeder street	
113 |        Norm	Normal	
114 |        RRNn	Within 200' of North-South Railroad
115 |        RRAn	Adjacent to North-South Railroad
116 |        PosN	Near positive off-site feature--park, greenbelt, etc.
117 |        PosA	Adjacent to postive off-site feature
118 |        RRNe	Within 200' of East-West Railroad
119 |        RRAe	Adjacent to East-West Railroad
120 | 	
121 | Condition2: Proximity to various conditions (if more than one is present)
122 | 		
123 |        Artery	Adjacent to arterial street
124 |        Feedr	Adjacent to feeder street	
125 |        Norm	Normal	
126 |        RRNn	Within 200' of North-South Railroad
127 |        RRAn	Adjacent to North-South Railroad
128 |        PosN	Near positive off-site feature--park, greenbelt, etc.
129 |        PosA	Adjacent to postive off-site feature
130 |        RRNe	Within 200' of East-West Railroad
131 |        RRAe	Adjacent to East-West Railroad
132 | 	
133 | BldgType: Type of dwelling
134 | 		
135 |        1Fam	Single-family Detached	
136 |        2FmCon	Two-family Conversion; originally built as one-family dwelling
137 |        Duplx	Duplex
138 |        TwnhsE	Townhouse End Unit
139 |        TwnhsI	Townhouse Inside Unit
140 | 	
141 | HouseStyle: Style of dwelling
142 | 	
143 |        1Story	One story
144 |        1.5Fin	One and one-half story: 2nd level finished
145 |        1.5Unf	One and one-half story: 2nd level unfinished
146 |        2Story	Two story
147 |        2.5Fin	Two and one-half story: 2nd level finished
148 |        2.5Unf	Two and one-half story: 2nd level unfinished
149 |        SFoyer	Split Foyer
150 |        SLvl	Split Level
151 | 	
152 | OverallQual: Rates the overall material and finish of the house
153 | 
154 |        10	Very Excellent
155 |        9	Excellent
156 |        8	Very Good
157 |        7	Good
158 |        6	Above Average
159 |        5	Average
160 |        4	Below Average
161 |        3	Fair
162 |        2	Poor
163 |        1	Very Poor
164 | 	
165 | OverallCond: Rates the overall condition of the house
166 | 
167 |        10	Very Excellent
168 |        9	Excellent
169 |        8	Very Good
170 |        7	Good
171 |        6	Above Average	
172 |        5	Average
173 |        4	Below Average	
174 |        3	Fair
175 |        2	Poor
176 |        1	Very Poor
177 | 		
178 | YearBuilt: Original construction date
179 | 
180 | YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
181 | 
182 | RoofStyle: Type of roof
183 | 
184 |        Flat	Flat
185 |        Gable	Gable
186 |        Gambrel	Gabrel (Barn)
187 |        Hip	Hip
188 |        Mansard	Mansard
189 |        Shed	Shed
190 | 		
191 | RoofMatl: Roof material
192 | 
193 |        ClyTile	Clay or Tile
194 |        CompShg	Standard (Composite) Shingle
195 |        Membran	Membrane
196 |        Metal	Metal
197 |        Roll	Roll
198 |        Tar&Grv	Gravel & Tar
199 |        WdShake	Wood Shakes
200 |        WdShngl	Wood Shingles
201 | 		
202 | Exterior1st: Exterior covering on house
203 | 
204 |        AsbShng	Asbestos Shingles
205 |        AsphShn	Asphalt Shingles
206 |        BrkComm	Brick Common
207 |        BrkFace	Brick Face
208 |        CBlock	Cinder Block
209 |        CemntBd	Cement Board
210 |        HdBoard	Hard Board
211 |        ImStucc	Imitation Stucco
212 |        MetalSd	Metal Siding
213 |        Other	Other
214 |        Plywood	Plywood
215 |        PreCast	PreCast	
216 |        Stone	Stone
217 |        Stucco	Stucco
218 |        VinylSd	Vinyl Siding
219 |        Wd Sdng	Wood Siding
220 |        WdShing	Wood Shingles
221 | 	
222 | Exterior2nd: Exterior covering on house (if more than one material)
223 | 
224 |        AsbShng	Asbestos Shingles
225 |        AsphShn	Asphalt Shingles
226 |        BrkComm	Brick Common
227 |        BrkFace	Brick Face
228 |        CBlock	Cinder Block
229 |        CemntBd	Cement Board
230 |        HdBoard	Hard Board
231 |        ImStucc	Imitation Stucco
232 |        MetalSd	Metal Siding
233 |        Other	Other
234 |        Plywood	Plywood
235 |        PreCast	PreCast
236 |        Stone	Stone
237 |        Stucco	Stucco
238 |        VinylSd	Vinyl Siding
239 |        Wd Sdng	Wood Siding
240 |        WdShing	Wood Shingles
241 | 	
242 | MasVnrType: Masonry veneer type
243 | 
244 |        BrkCmn	Brick Common
245 |        BrkFace	Brick Face
246 |        CBlock	Cinder Block
247 |        None	None
248 |        Stone	Stone
249 | 	
250 | MasVnrArea: Masonry veneer area in square feet
251 | 
252 | ExterQual: Evaluates the quality of the material on the exterior 
253 | 		
254 |        Ex	Excellent
255 |        Gd	Good
256 |        TA	Average/Typical
257 |        Fa	Fair
258 |        Po	Poor
259 | 		
260 | ExterCond: Evaluates the present condition of the material on the exterior
261 | 		
262 |        Ex	Excellent
263 |        Gd	Good
264 |        TA	Average/Typical
265 |        Fa	Fair
266 |        Po	Poor
267 | 		
268 | Foundation: Type of foundation
269 | 		
270 |        BrkTil	Brick & Tile
271 |        CBlock	Cinder Block
272 |        PConc	Poured Contrete	
273 |        Slab	Slab
274 |        Stone	Stone
275 |        Wood	Wood
276 | 		
277 | BsmtQual: Evaluates the height of the basement
278 | 
279 |        Ex	Excellent (100+ inches)	
280 |        Gd	Good (90-99 inches)
281 |        TA	Typical (80-89 inches)
282 |        Fa	Fair (70-79 inches)
283 |        Po	Poor (<70 inches
284 |        NA	No Basement
285 | 		
286 | BsmtCond: Evaluates the general condition of the basement
287 | 
288 |        Ex	Excellent
289 |        Gd	Good
290 |        TA	Typical - slight dampness allowed
291 |        Fa	Fair - dampness or some cracking or settling
292 |        Po	Poor - Severe cracking, settling, or wetness
293 |        NA	No Basement
294 | 	
295 | BsmtExposure: Refers to walkout or garden level walls
296 | 
297 |        Gd	Good Exposure
298 |        Av	Average Exposure (split levels or foyers typically score average or above)	
299 |        Mn	Mimimum Exposure
300 |        No	No Exposure
301 |        NA	No Basement
302 | 	
303 | BsmtFinType1: Rating of basement finished area
304 | 
305 |        GLQ	Good Living Quarters
306 |        ALQ	Average Living Quarters
307 |        BLQ	Below Average Living Quarters	
308 |        Rec	Average Rec Room
309 |        LwQ	Low Quality
310 |        Unf	Unfinshed
311 |        NA	No Basement
312 | 		
313 | BsmtFinSF1: Type 1 finished square feet
314 | 
315 | BsmtFinType2: Rating of basement finished area (if multiple types)
316 | 
317 |        GLQ	Good Living Quarters
318 |        ALQ	Average Living Quarters
319 |        BLQ	Below Average Living Quarters	
320 |        Rec	Average Rec Room
321 |        LwQ	Low Quality
322 |        Unf	Unfinshed
323 |        NA	No Basement
324 | 
325 | BsmtFinSF2: Type 2 finished square feet
326 | 
327 | BsmtUnfSF: Unfinished square feet of basement area
328 | 
329 | TotalBsmtSF: Total square feet of basement area
330 | 
331 | Heating: Type of heating
332 | 		
333 |        Floor	Floor Furnace
334 |        GasA	Gas forced warm air furnace
335 |        GasW	Gas hot water or steam heat
336 |        Grav	Gravity furnace	
337 |        OthW	Hot water or steam heat other than gas
338 |        Wall	Wall furnace
339 | 		
340 | HeatingQC: Heating quality and condition
341 | 
342 |        Ex	Excellent
343 |        Gd	Good
344 |        TA	Average/Typical
345 |        Fa	Fair
346 |        Po	Poor
347 | 		
348 | CentralAir: Central air conditioning
349 | 
350 |        N	No
351 |        Y	Yes
352 | 		
353 | Electrical: Electrical system
354 | 
355 |        SBrkr	Standard Circuit Breakers & Romex
356 |        FuseA	Fuse Box over 60 AMP and all Romex wiring (Average)	
357 |        FuseF	60 AMP Fuse Box and mostly Romex wiring (Fair)
358 |        FuseP	60 AMP Fuse Box and mostly knob & tube wiring (poor)
359 |        Mix	Mixed
360 | 		
361 | 1stFlrSF: First Floor square feet
362 |  
363 | 2ndFlrSF: Second floor square feet
364 | 
365 | LowQualFinSF: Low quality finished square feet (all floors)
366 | 
367 | GrLivArea: Above grade (ground) living area square feet
368 | 
369 | BsmtFullBath: Basement full bathrooms
370 | 
371 | BsmtHalfBath: Basement half bathrooms
372 | 
373 | FullBath: Full bathrooms above grade
374 | 
375 | HalfBath: Half baths above grade
376 | 
377 | Bedroom: Bedrooms above grade (does NOT include basement bedrooms)
378 | 
379 | Kitchen: Kitchens above grade
380 | 
381 | KitchenQual: Kitchen quality
382 | 
383 |        Ex	Excellent
384 |        Gd	Good
385 |        TA	Typical/Average
386 |        Fa	Fair
387 |        Po	Poor
388 |        	
389 | TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
390 | 
391 | Functional: Home functionality (Assume typical unless deductions are warranted)
392 | 
393 |        Typ	Typical Functionality
394 |        Min1	Minor Deductions 1
395 |        Min2	Minor Deductions 2
396 |        Mod	Moderate Deductions
397 |        Maj1	Major Deductions 1
398 |        Maj2	Major Deductions 2
399 |        Sev	Severely Damaged
400 |        Sal	Salvage only
401 | 		
402 | Fireplaces: Number of fireplaces
403 | 
404 | FireplaceQu: Fireplace quality
405 | 
406 |        Ex	Excellent - Exceptional Masonry Fireplace
407 |        Gd	Good - Masonry Fireplace in main level
408 |        TA	Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
409 |        Fa	Fair - Prefabricated Fireplace in basement
410 |        Po	Poor - Ben Franklin Stove
411 |        NA	No Fireplace
412 | 		
413 | GarageType: Garage location
414 | 		
415 |        2Types	More than one type of garage
416 |        Attchd	Attached to home
417 |        Basment	Basement Garage
418 |        BuiltIn	Built-In (Garage part of house - typically has room above garage)
419 |        CarPort	Car Port
420 |        Detchd	Detached from home
421 |        NA	No Garage
422 | 		
423 | GarageYrBlt: Year garage was built
424 | 		
425 | GarageFinish: Interior finish of the garage
426 | 
427 |        Fin	Finished
428 |        RFn	Rough Finished	
429 |        Unf	Unfinished
430 |        NA	No Garage
431 | 		
432 | GarageCars: Size of garage in car capacity
433 | 
434 | GarageArea: Size of garage in square feet
435 | 
436 | GarageQual: Garage quality
437 | 
438 |        Ex	Excellent
439 |        Gd	Good
440 |        TA	Typical/Average
441 |        Fa	Fair
442 |        Po	Poor
443 |        NA	No Garage
444 | 		
445 | GarageCond: Garage condition
446 | 
447 |        Ex	Excellent
448 |        Gd	Good
449 |        TA	Typical/Average
450 |        Fa	Fair
451 |        Po	Poor
452 |        NA	No Garage
453 | 		
454 | PavedDrive: Paved driveway
455 | 
456 |        Y	Paved 
457 |        P	Partial Pavement
458 |        N	Dirt/Gravel
459 | 		
460 | WoodDeckSF: Wood deck area in square feet
461 | 
462 | OpenPorchSF: Open porch area in square feet
463 | 
464 | EnclosedPorch: Enclosed porch area in square feet
465 | 
466 | 3SsnPorch: Three season porch area in square feet
467 | 
468 | ScreenPorch: Screen porch area in square feet
469 | 
470 | PoolArea: Pool area in square feet
471 | 
472 | PoolQC: Pool quality
473 | 		
474 |        Ex	Excellent
475 |        Gd	Good
476 |        TA	Average/Typical
477 |        Fa	Fair
478 |        NA	No Pool
479 | 		
480 | Fence: Fence quality
481 | 		
482 |        GdPrv	Good Privacy
483 |        MnPrv	Minimum Privacy
484 |        GdWo	Good Wood
485 |        MnWw	Minimum Wood/Wire
486 |        NA	No Fence
487 | 	
488 | MiscFeature: Miscellaneous feature not covered in other categories
489 | 		
490 |        Elev	Elevator
491 |        Gar2	2nd Garage (if not described in garage section)
492 |        Othr	Other
493 |        Shed	Shed (over 100 SF)
494 |        TenC	Tennis Court
495 |        NA	None
496 | 		
497 | MiscVal: $Value of miscellaneous feature
498 | 
499 | MoSold: Month Sold (MM)
500 | 
501 | YrSold: Year Sold (YYYY)
502 | 
503 | SaleType: Type of sale
504 | 		
505 |        WD 	Warranty Deed - Conventional
506 |        CWD	Warranty Deed - Cash
507 |        VWD	Warranty Deed - VA Loan
508 |        New	Home just constructed and sold
509 |        COD	Court Officer Deed/Estate
510 |        Con	Contract 15% Down payment regular terms
511 |        ConLw	Contract Low Down payment and low interest
512 |        ConLI	Contract Low Interest
513 |        ConLD	Contract Low Down
514 |        Oth	Other
515 | 		
516 | SaleCondition: Condition of sale
517 | 
518 |        Normal	Normal Sale
519 |        Abnorml	Abnormal Sale -  trade, foreclosure, short sale
520 |        AdjLand	Adjoining Land Purchase
521 |        Alloca	Allocation - two linked properties with separate deeds, typically condo with a garage unit	
522 |        Family	Sale between family members
523 |        Partial	Home was not completed when last assessed (associated with New Homes)
524 | 


--------------------------------------------------------------------------------
/data/hypothesis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/hypothesis.png


--------------------------------------------------------------------------------
/data/image-30.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/image-30.gif


--------------------------------------------------------------------------------
/data/kurtosis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/kurtosis.png


--------------------------------------------------------------------------------
/data/lln.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/lln.png


--------------------------------------------------------------------------------
/data/mean-median-mode.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/mean-median-mode.png


--------------------------------------------------------------------------------
/data/p1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/p1.png


--------------------------------------------------------------------------------
/data/p2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/p2.png


--------------------------------------------------------------------------------
/data/p3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/p3.png


--------------------------------------------------------------------------------
/data/p4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/p4.png


--------------------------------------------------------------------------------
/data/poi.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/poi.png


--------------------------------------------------------------------------------
/data/snd.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/snd.png


--------------------------------------------------------------------------------
/data/t_dist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/t_dist.png


--------------------------------------------------------------------------------
/data/ud.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CodeRabbitHub/Statistics-and-Probability-For-Data-Science/d5a9e7ce443a20e81538e222333ab41ca8b38f0d/data/ud.png


--------------------------------------------------------------------------------