├── Lectures
    ├── 04 Hypothesis Testing
    │   ├── 02 Z Test.ipynb
    │   ├── 03 Decision Rule.ipynb
    │   ├── 01 Hypothesis Testing.ipynb
    │   ├── 04 Why Hypothesis Testing.ipynb
    │   ├── 06 Level of Significance.ipynb
    │   ├── 07 Influence of Sample Size.ipynb
    │   └── 05 One Tailed and Two Tailed Tests.ipynb
    ├── 07 Chi-Square Test
    │   ├── 06 p-values.ipynb
    │   ├── 01 EPO Experiment.ipynb
    │   ├── 02 Statistical Hypothesis.ipynb
    │   ├── 08 Estimating Effect Size.ipynb
    │   ├── 09 Importance of Replication.ipynb
    │   ├── 10 Reports in the Literature.ipynb
    │   ├── 03 Sampling Distribution of X_2 - X_2.ipynb
    │   └── 07 Statistically Significant Results.ipynb
    ├── 06 t Tests
    │   ├── 01 Sampling Distribution of t.ipynb
    │   └── 03 Estimating the Standard Error.ipynb
    ├── 03 Introduction to Inferential Statistics
    │   ├── 02 Probability.ipynb
    │   └── 01 Population and Sample.ipynb
    ├── 05 Confidence Intervals and Estimation
    │   ├── 01 Confidence Interval.ipynb
    │   ├── 02 Level of Confidence.ipynb
    │   ├── 03 Effect of Sample Size.ipynb
    │   └── 04 Hypothesis Tests or Confidence Intervals?.ipynb
    ├── 08 Choosing the Right Statistical Test
    │   ├── 01 Sampling Distribution of D.ipynb
    │   └── 04 Estimating Effect Size.ipynb
    ├── .DS_Store
    ├── 01 Introduction
    │   ├── .DS_Store
    │   ├── images
    │   │   ├── 1.png
    │   │   ├── 2.png
    │   │   ├── 3.png
    │   │   ├── banner.png
    │   │   ├── types-of-data.png
    │   │   ├── qualitative-data.png
    │   │   ├── quantitative-data.png
    │   │   ├── sample-population.png
    │   │   ├── survey-experiment.png
    │   │   ├── confounding-variable.png
    │   │   ├── data-science-ai-ml-dl.png
    │   │   ├── levels-of-measurement.png
    │   │   ├── alternative-hypothesis.png
    │   │   ├── descriptive-statistics.png
    │   │   ├── inferential-statistics.png
    │   │   └── observation-experiment.png
    │   ├── 03 Types of Variables.ipynb
    │   ├── 02 Types of Data.ipynb
    │   └── 01 What is statistics?.ipynb
    ├── images
    │   └── exercise-banner.gif
    └── 02 Descriptive Statistics
    │   ├── .DS_Store
    │   ├── images
    │       ├── iqr.png
    │       ├── .DS_Store
    │       ├── banner.png
    │       ├── error.png
    │       ├── range.jpeg
    │       ├── range.png
    │       ├── variance.png
    │       ├── z-table.jpg
    │       ├── z-table.webp
    │       ├── cdf-pdf-pmf.png
    │       ├── probability.png
    │       ├── skewed-dist.png
    │       ├── z-table-full.jpeg
    │       ├── 68-95-99.7-rule.png
    │       ├── cdf-pdf-pmf-2.jpeg
    │       ├── normal-dist-cdf.png
    │       ├── different-variance.png
    │       ├── standard-deviation.png
    │       ├── y-intercept-slope.png
    │       ├── interquartile-range.png
    │       ├── multiple-regression.png
    │       ├── normal-distribution.png
    │       ├── standard-normal-dist.png
    │       ├── types-of-correlation.png
    │       ├── measures-of-variability.png
    │       ├── regression-correlation.png
    │       ├── sample-population-mean.png
    │       ├── sample-population-range.png
    │       ├── mean-mode-median-summary.png
    │       ├── regression-toward-the-mean.png
    │       ├── sample-population-mean-2.png
    │       ├── sample-population-variance.png
    │       ├── standard-error-of-estimate.png
    │       ├── control-groups-in-research.webp
    │       ├── mean-estimation-from-sample.png
    │       └── regression-to-the-mean-height.png
    │   ├── 07 Regression Toward the Mean.ipynb
    │   ├── 03 Normal Distribution.ipynb
    │   └── 01 Describing Data with Averages.ipynb
├── .DS_Store
├── images
    ├── banner.png
    ├── pytopia-course.png
    └── alternative-hypothesis.png
├── .gitignore
└── README.md


/Lectures/04 Hypothesis Testing/02 Z Test.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/07 Chi-Square Test/06 p-values.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/04 Hypothesis Testing/03 Decision Rule.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/07 Chi-Square Test/01 EPO Experiment.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/04 Hypothesis Testing/01 Hypothesis Testing.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/06 t Tests/01 Sampling Distribution of t.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/06 t Tests/03 Estimating the Standard Error.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/04 Hypothesis Testing/04 Why Hypothesis Testing.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/04 Hypothesis Testing/06 Level of Significance.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/07 Chi-Square Test/02 Statistical Hypothesis.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/07 Chi-Square Test/08 Estimating Effect Size.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/07 Chi-Square Test/09 Importance of Replication.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/07 Chi-Square Test/10 Reports in the Literature.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/04 Hypothesis Testing/07 Influence of Sample Size.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/.DS_Store


--------------------------------------------------------------------------------
/Lectures/03 Introduction to Inferential Statistics/02 Probability.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/04 Hypothesis Testing/05 One Tailed and Two Tailed Tests.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/07 Chi-Square Test/03 Sampling Distribution of X_2 - X_2.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/07 Chi-Square Test/07 Statistically Significant Results.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/05 Confidence Intervals and Estimation/01 Confidence Interval.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/05 Confidence Intervals and Estimation/02 Level of Confidence.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/05 Confidence Intervals and Estimation/03 Effect of Sample Size.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/08 Choosing the Right Statistical Test/01 Sampling Distribution of D.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/Lectures/08 Choosing the Right Statistical Test/04 Estimating Effect Size.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/images/banner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/images/banner.png


--------------------------------------------------------------------------------
/Lectures/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/.DS_Store


--------------------------------------------------------------------------------
/Lectures/05 Confidence Intervals and Estimation/04 Hypothesis Tests or Confidence Intervals?.ipynb:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/images/pytopia-course.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/images/pytopia-course.png


--------------------------------------------------------------------------------
/Lectures/01 Introduction/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/01 Introduction/.DS_Store


--------------------------------------------------------------------------------
/images/alternative-hypothesis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/images/alternative-hypothesis.png


--------------------------------------------------------------------------------
/Lectures/01 Introduction/images/1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/01 Introduction/images/1.png


--------------------------------------------------------------------------------
/Lectures/01 Introduction/images/2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/01 Introduction/images/2.png


--------------------------------------------------------------------------------
/Lectures/01 Introduction/images/3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/01 Introduction/images/3.png


--------------------------------------------------------------------------------
/Lectures/images/exercise-banner.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/images/exercise-banner.gif


--------------------------------------------------------------------------------
/Lectures/01 Introduction/images/banner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/01 Introduction/images/banner.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/.DS_Store


--------------------------------------------------------------------------------
/Lectures/01 Introduction/images/types-of-data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/01 Introduction/images/types-of-data.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/iqr.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/iqr.png


--------------------------------------------------------------------------------
/Lectures/01 Introduction/images/qualitative-data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/01 Introduction/images/qualitative-data.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/.DS_Store


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/banner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/banner.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/error.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/error.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/range.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/range.jpeg


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/range.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/range.png


--------------------------------------------------------------------------------
/Lectures/01 Introduction/images/quantitative-data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/01 Introduction/images/quantitative-data.png


--------------------------------------------------------------------------------
/Lectures/01 Introduction/images/sample-population.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/01 Introduction/images/sample-population.png


--------------------------------------------------------------------------------
/Lectures/01 Introduction/images/survey-experiment.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/01 Introduction/images/survey-experiment.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/variance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/variance.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/z-table.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/z-table.jpg


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/z-table.webp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/z-table.webp


--------------------------------------------------------------------------------
/Lectures/01 Introduction/images/confounding-variable.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/01 Introduction/images/confounding-variable.png


--------------------------------------------------------------------------------
/Lectures/01 Introduction/images/data-science-ai-ml-dl.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/01 Introduction/images/data-science-ai-ml-dl.png


--------------------------------------------------------------------------------
/Lectures/01 Introduction/images/levels-of-measurement.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/01 Introduction/images/levels-of-measurement.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/cdf-pdf-pmf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/cdf-pdf-pmf.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/probability.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/probability.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/skewed-dist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/skewed-dist.png


--------------------------------------------------------------------------------
/Lectures/01 Introduction/images/alternative-hypothesis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/01 Introduction/images/alternative-hypothesis.png


--------------------------------------------------------------------------------
/Lectures/01 Introduction/images/descriptive-statistics.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/01 Introduction/images/descriptive-statistics.png


--------------------------------------------------------------------------------
/Lectures/01 Introduction/images/inferential-statistics.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/01 Introduction/images/inferential-statistics.png


--------------------------------------------------------------------------------
/Lectures/01 Introduction/images/observation-experiment.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/01 Introduction/images/observation-experiment.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/z-table-full.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/z-table-full.jpeg


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/68-95-99.7-rule.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/68-95-99.7-rule.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/cdf-pdf-pmf-2.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/cdf-pdf-pmf-2.jpeg


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/normal-dist-cdf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/normal-dist-cdf.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/different-variance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/different-variance.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/standard-deviation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/standard-deviation.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/y-intercept-slope.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/y-intercept-slope.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/interquartile-range.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/interquartile-range.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/multiple-regression.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/multiple-regression.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/normal-distribution.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/normal-distribution.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/standard-normal-dist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/standard-normal-dist.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/types-of-correlation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/types-of-correlation.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/measures-of-variability.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/measures-of-variability.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/regression-correlation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/regression-correlation.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/sample-population-mean.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/sample-population-mean.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/sample-population-range.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/sample-population-range.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/mean-mode-median-summary.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/mean-mode-median-summary.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/regression-toward-the-mean.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/regression-toward-the-mean.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/sample-population-mean-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/sample-population-mean-2.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/sample-population-variance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/sample-population-variance.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/standard-error-of-estimate.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/standard-error-of-estimate.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/control-groups-in-research.webp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/control-groups-in-research.webp


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/mean-estimation-from-sample.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/mean-estimation-from-sample.png


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/images/regression-to-the-mean-height.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pytopia/Statistics/HEAD/Lectures/02 Descriptive Statistics/images/regression-to-the-mean-height.png


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | share/python-wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | MANIFEST
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .nox/
 43 | .coverage
 44 | .coverage.*
 45 | .cache
 46 | nosetests.xml
 47 | coverage.xml
 48 | *.cover
 49 | *.py,cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | cover/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | .pybuilder/
 76 | target/
 77 | 
 78 | # Jupyter Notebook
 79 | .ipynb_checkpoints
 80 | 
 81 | # IPython
 82 | profile_default/
 83 | ipython_config.py
 84 | 
 85 | # pyenv
 86 | #   For a library or package, you might want to ignore these files since the code is
 87 | #   intended to run in multiple environments; otherwise, check them in:
 88 | # .python-version
 89 | 
 90 | # pipenv
 91 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 92 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 93 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 94 | #   install all needed dependencies.
 95 | #Pipfile.lock
 96 | 
 97 | # poetry
 98 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
 99 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
100 | #   commonly ignored for libraries.
101 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102 | #poetry.lock
103 | 
104 | # pdm
105 | #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106 | #pdm.lock
107 | #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108 | #   in version control.
109 | #   https://pdm.fming.dev/#use-with-ide
110 | .pdm.toml
111 | 
112 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
113 | __pypackages__/
114 | 
115 | # Celery stuff
116 | celerybeat-schedule
117 | celerybeat.pid
118 | 
119 | # SageMath parsed files
120 | *.sage.py
121 | 
122 | # Environments
123 | .env
124 | .venv
125 | env/
126 | venv/
127 | ENV/
128 | env.bak/
129 | venv.bak/
130 | 
131 | # Spyder project settings
132 | .spyderproject
133 | .spyproject
134 | 
135 | # Rope project settings
136 | .ropeproject
137 | 
138 | # mkdocs documentation
139 | /site
140 | 
141 | # mypy
142 | .mypy_cache/
143 | .dmypy.json
144 | dmypy.json
145 | 
146 | # Pyre type checker
147 | .pyre/
148 | 
149 | # pytype static type analyzer
150 | .pytype/
151 | 
152 | # Cython debug symbols
153 | cython_debug/
154 | 
155 | # PyCharm
156 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
157 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
158 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
159 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
160 | #.idea/
161 | 
162 | # custom
163 | *test.ipynb
164 | *.DS_Store


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | <img src="./images/banner.png" width="800">
 2 | 
 3 | ![GitHub last commit](https://img.shields.io/github/last-commit/pytopia/statistics)
 4 | ![GitHub repo size](https://img.shields.io/github/repo-size/pytopia/statistics)
 5 | ![GitHub code size in bytes](https://img.shields.io/github/languages/code-size/pytopia/statistics)
 6 | ![GitHub Repo stars](https://img.shields.io/github/stars/pytopia/statistics)
 7 | ![GitHub top language](https://img.shields.io/github/languages/top/pytopia/statistics)
 8 | [![Website](https://img.shields.io/badge/Visit-Website-blue)](https://www.pytopia.ai)
 9 | [![Telegram](https://img.shields.io/badge/Join-Telegram-blue)](https://t.me/pytopia_ai)
10 | [![Instagram](https://img.shields.io/badge/Follow-Instagram-red)](https://instagram.com/pytopia.ai)
11 | [![YouTube](https://img.shields.io/badge/Subscribe-YouTube-red)](https://www.youtube.com/@pytopia)
12 | [![LinkedIn](https://img.shields.io/badge/Follow-LinkedIn-blue)](https://linkedin.com/company/pytopia)
13 | [![Twitter](https://img.shields.io/badge/Follow-Twitter-blue)](https://twitter.com/pytopia_ai)
14 | 
15 | Welcome to the Statistics for Machine Learning course repository! This course is specifically designed to provide you with a solid foundation in statistical concepts and techniques that are essential for success in the field of Machine Learning. Whether you're a beginner in Machine Learning or an experienced practitioner looking to strengthen your statistical knowledge, this course has something to offer.
16 | 
17 | ## 🎯 Course Objectives
18 | 
19 | By the end of this course, you will:
20 | 
21 | - Understand the fundamental concepts and principles of statistics
22 | - Learn how to apply statistical techniques to real-world Machine Learning problems
23 | - Gain proficiency in descriptive and inferential statistics
24 | - Master hypothesis testing and estimation methods
25 | - Acquire hands-on experience in conducting various types of t-tests
26 | - Discover the relevance and importance of statistics in Machine Learning
27 | 
28 | ## 📚 Course Contents
29 | 
30 | The course is divided into the following chapters:
31 | 
32 | 1. Introduction
33 | 2. Descriptive Statistics
34 | 3. Inferential Statistics
35 | 4. Hypothesis Testing
36 | 5. Estimation
37 | 6. t Test For One Sample
38 | 7. t Test for Two Independent Samples
39 | 8. t Test for Two Related Samples
40 | 
41 | Each chapter includes a combination of theoretical explanations, practical examples, and hands-on exercises to reinforce your understanding of the concepts and their applications in Machine Learning.
42 | 
43 | ## ✅ Prerequisites
44 | 
45 | To get the most out of this course, you should have:
46 | 
47 | - Basic knowledge of mathematics (algebra and calculus)
48 | - Familiarity with programming (preferably in Python)
49 | - Enthusiasm to learn and explore the fascinating intersection of statistics and Machine Learning!
50 | 
51 | # 📚 Learn with Us!
52 | We also offer a [course on these contents](https://www.pytopia.ai/courses/statistics) where learners can interact with peers and instructors, ask questions, and participate in online coding sessions. By registering for the course, you also gain access to our dedicated Telegram group where you can connect with other learners and share your progress. Enroll now and start learning! Here are some useful links:
53 | 
54 | - [Statistics Course](https://www.pytopia.ai/courses/statistics)
55 | - [Pytopia Public Telegram Group](https://t.me/pytopia_ai)
56 | - [Pytopia Website](https://www.pytopia.ai/)
57 | 
58 | [<img src="./images/pytopia-course.png" width="800">](https://www.pytopia.ai/courses/statistics)
59 | 
60 | ## 🚀 Getting Started
61 | 
62 | To get started with the course, follow these steps:
63 | 
64 | 1. Clone this repository to your local machine using the following command:
65 |    ```
66 |    git clone https://github.com/your-username/statistics-for-ml-course.git
67 |    ```
68 | 
69 | 2. Navigate to the cloned repository:
70 |    ```
71 |    cd statistics-for-ml-course
72 |    ```
73 | 
74 | 3. Set up the required dependencies and environment by following the instructions in the `setup.md` file.
75 | 
76 | 4. Start exploring the course materials, beginning with the first chapter.
77 | 
78 | Throughout the course, you will discover how statistical concepts and techniques are applied in various stages of the Machine Learning pipeline, from data preprocessing and feature selection to model evaluation and hyperparameter tuning. By the end of this course, you will have a strong grasp of the statistical foundations necessary to excel in Machine Learning and tackle real-world problems with confidence.
79 | 
80 | # 📞 Contact Information
81 | 
82 | Feel free to reach out to us!
83 | 
84 | - 🌐 Website: [pytopia.ia](https://www.pytopia.ai)
85 | - 💬 Telegram: [pytopia_ai](https://t.me/pytopia_ai)
86 | - 🎥 YouTube: [pytopia](https://www.youtube.com/c/pytopia)
87 | - 📸 Instagram: [pytopia.ai](https://www.instagram.com/pytopia.ai)
88 | - 🎓 LinkedIn: [pytopia](https://www.linkedin.com/in/pytopia)
89 | - 🐦 Twitter: [pytopia_ai](https://twitter.com/pytopia_ai)
90 | - 📧 Email: [pytopia.ai@gmail.com](mailto:pytopia.ai@gmail.com)
91 | 


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/07 Regression Toward the Mean.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "<img src=\"./images/banner.png\" width=\"800\">"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "# Regression Toward the Mean"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "In the previous lectures, we explored the concepts of regression and learned how to evaluate the accuracy of regression predictions. We discussed the least squares method for finding the best-fitting regression line and the importance of the standard error of estimate and the squared correlation coefficient in assessing the precision of our predictions.\n"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "markdown",
 26 |    "metadata": {},
 27 |    "source": [
 28 |     "In this lecture, we will dive into a fascinating phenomenon known as **regression toward the mean**. This concept has important implications for understanding and interpreting changes in data over time, particularly when dealing with **extreme values** or **repeated measurements**.\n"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "markdown",
 33 |    "metadata": {},
 34 |    "source": [
 35 |     "Regression toward the mean is a statistical phenomenon that occurs when extreme values or measurements tend to be followed by values that are closer to the average or mean of the entire dataset. This phenomenon can often lead to misinterpretations of data and incorrect conclusions if not properly understood and accounted for.\n"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "markdown",
 40 |    "metadata": {},
 41 |    "source": [
 42 |     "Throughout this lecture, we will:\n",
 43 |     "\n",
 44 |     "1. Define and explain the concept of regression toward the mean\n",
 45 |     "2. Explore examples of regression toward the mean in various fields, such as education, sports, and healthcare\n",
 46 |     "3. Discuss the implications of regression toward the mean in research and decision-making\n",
 47 |     "4. Learn how to identify and avoid common pitfalls associated with regression toward the mean\n"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "markdown",
 52 |    "metadata": {},
 53 |    "source": [
 54 |     "By the end of this lecture, you will have a solid understanding of regression toward the mean and its significance in data analysis and interpretation. This knowledge will help you make more informed decisions and avoid drawing incorrect conclusions based on extreme values or measurements.\n"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "markdown",
 59 |    "metadata": {},
 60 |    "source": [
 61 |     "Let's begin by diving deeper into the definition and explanation of regression toward the mean."
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "markdown",
 66 |    "metadata": {},
 67 |    "source": [
 68 |     "**Table of contents**<a id='toc0_'></a>    \n",
 69 |     "- [Definition and Explanation of Regression Toward the Mean](#toc1_)    \n",
 70 |     "  - [(Optional) Mathematical Explanation of the Phenomenon](#toc1_1_)    \n",
 71 |     "- [Examples of Regression Toward the Mean](#toc2_)    \n",
 72 |     "  - [Education: Student Performance and Test Scores](#toc2_1_)    \n",
 73 |     "  - [Sports: Athlete Performance and Team Rankings](#toc2_2_)    \n",
 74 |     "  - [Healthcare: Blood Pressure Measurements and Treatment Effects](#toc2_3_)    \n",
 75 |     "  - [Other Real-World Examples](#toc2_4_)    \n",
 76 |     "- [Implications of Regression Toward the Mean](#toc3_)    \n",
 77 |     "- [Conclusion and Key Takeaways](#toc4_)    \n",
 78 |     "\n",
 79 |     "<!-- vscode-jupyter-toc-config\n",
 80 |     "\tnumbering=false\n",
 81 |     "\tanchor=true\n",
 82 |     "\tflat=false\n",
 83 |     "\tminLevel=2\n",
 84 |     "\tmaxLevel=6\n",
 85 |     "\t/vscode-jupyter-toc-config -->\n",
 86 |     "<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "markdown",
 91 |    "metadata": {},
 92 |    "source": [
 93 |     "## <a id='toc1_'></a>[Definition and Explanation of Regression Toward the Mean](#toc0_)"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "markdown",
 98 |    "metadata": {},
 99 |    "source": [
100 |     "Regression toward the mean is a statistical phenomenon that occurs when extreme values or measurements in a dataset tend to be followed by values that are closer to the mean of the entire dataset. In other words, if an initial observation or measurement is extremely high or low, the subsequent observation or measurement is more likely to be closer to the average."
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "markdown",
105 |    "metadata": {},
106 |    "source": [
107 |     "This phenomenon is not due to any specific cause or intervention but rather a result of natural variability and the inherent tendency of data to cluster around the mean. It is important to note that regression toward the mean occurs in both directions - extremely high values tend to be followed by lower values, while extremely low values tend to be followed by higher values.\n"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "markdown",
112 |    "metadata": {},
113 |    "source": [
114 |     "This phenomenon was first observed by Sir Francis Galton, a renowned statistician who made significant contributions to the field, including the introduction of concepts such as correlation, standard deviation, and percentiles. Galton devoted much of his life to studying variation in human populations, particularly in the context of heredity."
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "markdown",
119 |    "metadata": {},
120 |    "source": [
121 |     "In his investigation of the relationship between the heights of parents and their children, Galton plotted the heights of 930 adult children against the mean height of their parents. He discovered that the data did not follow the expected trend line, but instead, the children's heights tended to be closer to the average height than their parents' heights. This finding, known as regression to the mean, suggests that extreme observations are likely to be followed by less extreme ones closer to the true mean.\n"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "markdown",
126 |    "metadata": {},
127 |    "source": [
128 |     "<img src=\"./images/regression-to-the-mean-height.png\" width=\"400\">"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "markdown",
133 |    "metadata": {},
134 |    "source": [
135 |     "For example, if parent's height is 6 feet, the child's height is likely to be closer to the average height of the population (around 5'9\") rather than 6 feet. On the other hand, if the parent's height is 5 feet, the child's height is also likely to be closer to the average height rather than 5 feet. This phenomenon is known as regression toward the mean."
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "markdown",
140 |    "metadata": {},
141 |    "source": [
142 |     "<img src=\"./images/regression-toward-the-mean.png\" width=\"800\">"
143 |    ]
144 |   },
145 |   {
146 |    "cell_type": "markdown",
147 |    "metadata": {},
148 |    "source": [
149 |     "Galton's work highlights the importance of considering regression to the mean in statistical analyses, as failing to do so can lead to misleading conclusions. For example, when estimating the impact of speed cameras on reducing fatal road accidents, initial analyses suggested that they saved an average of 100 lives per year. However, further analysis that accounted for regression to the mean found that 50% of the decline in accidents would have occurred regardless of the installation of speed cameras.\n"
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "markdown",
154 |    "metadata": {},
155 |    "source": [
156 |     "Regression to the mean remains a crucial statistical phenomenon that should not be neglected in data analysis. It serves as a reminder that extreme observations are often followed by less extreme ones, and that careful consideration must be given to the interpretation of data to avoid drawing incorrect conclusions."
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "markdown",
161 |    "metadata": {},
162 |    "source": [
163 |     "Understanding the concept of regression toward the mean is crucial for interpreting changes in data over time and avoiding misinterpretations or incorrect conclusions based on extreme values or measurements. In the next section, we will explore real-world examples of regression toward the mean in various fields.\n"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "markdown",
168 |    "metadata": {},
169 |    "source": [
170 |     "### <a id='toc1_1_'></a>[(Optional) Mathematical Explanation of the Phenomenon](#toc0_)\n"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "markdown",
175 |    "metadata": {},
176 |    "source": [
177 |     "To understand the mathematical basis of regression toward the mean, let's consider a dataset with a mean of $\\mu$ and a standard deviation of $\\sigma$. If we select an observation $x_1$ that is far from the mean (either much higher or much lower), the expected value of the next observation $x_2$ can be expressed as:\n",
178 |     "\n",
179 |     "$E(x_2 | x_1) = \\mu + \\rho \\frac{\\sigma_2}{\\sigma_1}(x_1 - \\mu)$\n",
180 |     "\n",
181 |     "where:\n",
182 |     "- $E(x_2 | x_1)$ is the expected value of $x_2$ given the value of $x_1$\n",
183 |     "- $\\rho$ is the correlation coefficient between $x_1$ and $x_2$\n",
184 |     "- $\\sigma_1$ and $\\sigma_2$ are the standard deviations of $x_1$ and $x_2$, respectively\n"
185 |    ]
186 |   },
187 |   {
188 |    "cell_type": "markdown",
189 |    "metadata": {},
190 |    "source": [
191 |     "If the correlation between $x_1$ and $x_2$ is less than 1 (which is usually the case in real-world datasets), the term $\\rho \\frac{\\sigma_2}{\\sigma_1}$ will be less than 1. As a result, the expected value of $x_2$ will be closer to the mean $\\mu$ than $x_1$.\n"
192 |    ]
193 |   },
194 |   {
195 |    "cell_type": "markdown",
196 |    "metadata": {},
197 |    "source": [
198 |     "## <a id='toc2_'></a>[Examples of Regression Toward the Mean](#toc0_)"
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "markdown",
203 |    "metadata": {},
204 |    "source": [
205 |     "Regression toward the mean can be observed in various fields, including education, sports, healthcare, and many other real-world situations. In this section, we will explore examples of how this phenomenon manifests in different contexts.\n"
206 |    ]
207 |   },
208 |   {
209 |    "cell_type": "markdown",
210 |    "metadata": {},
211 |    "source": [
212 |     "### <a id='toc2_1_'></a>[Education: Student Performance and Test Scores](#toc0_)\n"
213 |    ]
214 |   },
215 |   {
216 |    "cell_type": "markdown",
217 |    "metadata": {},
218 |    "source": [
219 |     "In educational settings, regression toward the mean can be observed when examining student performance and test scores over time. Consider a scenario where students take two similar tests:\n",
220 |     "\n",
221 |     "- Students who perform exceptionally well on the first test (above the mean) are more likely to have lower scores on the second test, closer to the mean.\n",
222 |     "- Students who perform poorly on the first test (below the mean) are more likely to have higher scores on the second test, closer to the mean.\n",
223 |     "\n",
224 |     "This phenomenon can lead to misinterpretations, such as attributing improvement in low-performing students to an intervention or teaching method, when in reality, the improvement may be due to regression toward the mean.\n"
225 |    ]
226 |   },
227 |   {
228 |    "cell_type": "markdown",
229 |    "metadata": {},
230 |    "source": [
231 |     "### <a id='toc2_2_'></a>[Sports: Athlete Performance and Team Rankings](#toc0_)\n"
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "markdown",
236 |    "metadata": {},
237 |    "source": [
238 |     "Regression toward the mean is common in sports, particularly when considering athlete performance and team rankings over multiple seasons or competitions.\n",
239 |     "\n",
240 |     "- An athlete who performs exceptionally well in one season (above their average) is more likely to have a performance closer to their average in the following season.\n",
241 |     "- A team that ranks very high or very low in one season is more likely to have a ranking closer to the middle of the pack in the next season.\n",
242 |     "\n",
243 |     "This phenomenon can lead to overestimating the impact of training or coaching changes on athlete or team performance, when the changes observed may be due to regression toward the mean.\n"
244 |    ]
245 |   },
246 |   {
247 |    "cell_type": "markdown",
248 |    "metadata": {},
249 |    "source": [
250 |     "### <a id='toc2_3_'></a>[Healthcare: Blood Pressure Measurements and Treatment Effects](#toc0_)\n"
251 |    ]
252 |   },
253 |   {
254 |    "cell_type": "markdown",
255 |    "metadata": {},
256 |    "source": [
257 |     "In healthcare, regression toward the mean can be observed in various physiological measurements, such as blood pressure readings.\n",
258 |     "\n",
259 |     "- Patients with extremely high blood pressure readings in one visit are more likely to have lower readings in subsequent visits, closer to their average blood pressure.\n",
260 |     "- Patients with extremely low blood pressure readings in one visit are more likely to have higher readings in subsequent visits, closer to their average blood pressure.\n",
261 |     "\n",
262 |     "This phenomenon can lead to overestimating the effectiveness of treatments or interventions, as the observed changes in blood pressure may be partially due to regression toward the mean rather than the treatment itself.\n"
263 |    ]
264 |   },
265 |   {
266 |    "cell_type": "markdown",
267 |    "metadata": {},
268 |    "source": [
269 |     "### <a id='toc2_4_'></a>[Other Real-World Examples](#toc0_)\n"
270 |    ]
271 |   },
272 |   {
273 |    "cell_type": "markdown",
274 |    "metadata": {},
275 |    "source": [
276 |     "Regression toward the mean can be observed in numerous other real-world situations, such as:\n",
277 |     "\n",
278 |     "- **Stock market performance**: Companies with extremely high or low stock returns in one period are more likely to have returns closer to the market average in the following period.\n",
279 |     "- **Customer satisfaction surveys**: Customers who provide extremely positive or negative feedback in one survey are more likely to provide feedback closer to the average in subsequent surveys.\n",
280 |     "- **Employee performance evaluations**: Employees with exceptionally high or low performance ratings in one evaluation period are more likely to have ratings closer to the average in the next evaluation period.\n"
281 |    ]
282 |   },
283 |   {
284 |    "cell_type": "markdown",
285 |    "metadata": {},
286 |    "source": [
287 |     "In all these examples, it is essential to recognize the potential influence of regression toward the mean when interpreting changes or differences in data over time. Failing to account for this phenomenon can lead to incorrect conclusions and misguided decision-making.\n"
288 |    ]
289 |   },
290 |   {
291 |    "cell_type": "markdown",
292 |    "metadata": {},
293 |    "source": [
294 |     "In the next section, we will discuss the implications of regression toward the mean and how it can lead to misinterpretations and incorrect conclusions if not properly understood and accounted for."
295 |    ]
296 |   },
297 |   {
298 |    "cell_type": "markdown",
299 |    "metadata": {},
300 |    "source": [
301 |     "## <a id='toc3_'></a>[Implications of Regression Toward the Mean](#toc0_)"
302 |    ]
303 |   },
304 |   {
305 |    "cell_type": "markdown",
306 |    "metadata": {},
307 |    "source": [
308 |     "Regression toward the mean can have significant implications for data analysis, interpretation, and decision-making. If not properly understood and accounted for, this phenomenon can lead to various issues:\n",
309 |     "\n",
310 |     "- **Misinterpretation of data and incorrect conclusions**\n",
311 |     "  - Failing to consider regression toward the mean when observing changes or differences in data over time can lead to misinterpretations and incorrect conclusions.\n",
312 |     "  - Example: Attributing improvement in low-performing students solely to an intervention, when some improvement may be due to regression toward the mean.\n",
313 |     "\n",
314 |     "- **Overestimating the effectiveness of interventions or treatments**\n",
315 |     "  - Interventions targeted at individuals with extreme initial values may appear more effective than they actually are, due to regression toward the mean.\n",
316 |     "  - To avoid this issue, use appropriate research designs (e.g., randomized controlled trials) and compare outcomes of the intervention group to a control group.\n",
317 |     "\n",
318 |     "- **Underestimating the impact of extreme values or measurements**\n",
319 |     "  - Extreme values may be dismissed as outliers or attributed to random chance, when they may represent genuine and meaningful deviations from the norm.\n",
320 |     "  - Carefully examine extreme values and consider their potential causes and implications, as they may provide valuable insights or opportunities.\n"
321 |    ]
322 |   },
323 |   {
324 |    "cell_type": "markdown",
325 |    "metadata": {},
326 |    "source": [
327 |     "Understanding the implications of regression toward the mean is crucial for accurate data analysis, interpretation, and decision-making. By recognizing the potential pitfalls, researchers and decision-makers can take steps to account for this phenomenon and draw more reliable conclusions from their data.\n"
328 |    ]
329 |   },
330 |   {
331 |    "cell_type": "markdown",
332 |    "metadata": {},
333 |    "source": [
334 |     "## <a id='toc4_'></a>[Conclusion and Key Takeaways](#toc0_)"
335 |    ]
336 |   },
337 |   {
338 |    "cell_type": "markdown",
339 |    "metadata": {},
340 |    "source": [
341 |     "In this lecture, we have explored the concept of regression toward the mean, its implications, and strategies for dealing with this phenomenon in various contexts. Let's recap the main points covered:\n",
342 |     "\n",
343 |     "- Regression toward the mean is a statistical phenomenon where extreme values or measurements tend to be followed by values closer to the mean of the entire dataset.\n",
344 |     "- This phenomenon can be observed in various fields, such as education, sports, healthcare, and other real-world situations.\n",
345 |     "- Failing to account for regression toward the mean can lead to misinterpretations, incorrect conclusions, and flawed decision-making.\n",
346 |     "- The implications of regression toward the mean include overestimating the effectiveness of interventions or treatments, underestimating the impact of extreme values, and misinterpreting data.\n",
347 |     "- To identify and avoid pitfalls associated with regression toward the mean, researchers and decision-makers should use control groups or comparison data, consider the role of chance and random variation, and employ appropriate statistical methods.\n"
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "markdown",
352 |    "metadata": {},
353 |    "source": [
354 |     "Understanding regression toward the mean is crucial for accurate data analysis and interpretation. By recognizing this phenomenon and its potential implications, researchers and decision-makers can draw more reliable conclusions and make better-informed decisions based on the available evidence.\n"
355 |    ]
356 |   },
357 |   {
358 |    "cell_type": "markdown",
359 |    "metadata": {},
360 |    "source": [
361 |     "In practice, this understanding can lead to:\n",
362 |     "- More accurate evaluations of interventions and treatments\n",
363 |     "- Better allocation of resources based on genuine effects rather than statistical artifacts\n",
364 |     "- More effective identification of meaningful deviations from the norm\n",
365 |     "- Improved decision-making in various domains, from education to healthcare to business\n"
366 |    ]
367 |   },
368 |   {
369 |    "cell_type": "markdown",
370 |    "metadata": {},
371 |    "source": [
372 |     "As you continue your journey in data analysis and interpretation, keep the concept of regression toward the mean in mind. Be cautious when interpreting changes or differences in data over time, and always consider alternative explanations for observed patterns.\n"
373 |    ]
374 |   },
375 |   {
376 |    "cell_type": "markdown",
377 |    "metadata": {},
378 |    "source": [
379 |     "Encourage yourself to apply this knowledge in your future research and decision-making. By doing so, you will be better equipped to navigate the complexities of data analysis and make more informed and reliable conclusions.\n"
380 |    ]
381 |   },
382 |   {
383 |    "cell_type": "markdown",
384 |    "metadata": {},
385 |    "source": [
386 |     "Remember, understanding regression toward the mean is not just a theoretical exercise – it has real-world implications that can impact the effectiveness of interventions, the allocation of resources, and the overall quality of decision-making. By mastering this concept and applying it in practice, you can contribute to more accurate and meaningful insights in your field of interest."
387 |    ]
388 |   }
389 |  ],
390 |  "metadata": {
391 |   "kernelspec": {
392 |    "display_name": "py310",
393 |    "language": "python",
394 |    "name": "python3"
395 |   },
396 |   "language_info": {
397 |    "codemirror_mode": {
398 |     "name": "ipython",
399 |     "version": 3
400 |    },
401 |    "file_extension": ".py",
402 |    "mimetype": "text/x-python",
403 |    "name": "python",
404 |    "nbconvert_exporter": "python",
405 |    "pygments_lexer": "ipython3",
406 |    "version": "3.10.12"
407 |   }
408 |  },
409 |  "nbformat": 4,
410 |  "nbformat_minor": 2
411 | }
412 | 


--------------------------------------------------------------------------------
/Lectures/01 Introduction/03 Types of Variables.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "<img src=\"./images/banner.png\" width=\"800\">"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "# Types of Variables"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "In the world of statistics and data analysis, understanding the concept of variables is crucial. A **variable** is a characteristic or property that can take on different values. For example, when studying a group of people, variables could include age, height, weight, gender, or income.\n"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "markdown",
 26 |    "metadata": {},
 27 |    "source": [
 28 |     "Variables are essential because they help us:\n",
 29 |     "- Organize and categorize data\n",
 30 |     "- Identify patterns and relationships\n",
 31 |     "- Make predictions and draw conclusions\n",
 32 |     "- Communicate findings effectively\n"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "markdown",
 37 |    "metadata": {},
 38 |    "source": [
 39 |     "To work with variables effectively, it's important to understand the different types of variables and their properties. This knowledge will guide you in choosing the appropriate statistical methods and techniques for analyzing your data.\n"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "markdown",
 44 |    "metadata": {},
 45 |    "source": [
 46 |     "In this lecture, we'll explore the main types of variables:\n",
 47 |     "1. Qualitative and Quantitative Variables\n",
 48 |     "2. Discrete and Continuous Variables\n",
 49 |     "3. Independent and Dependent Variables\n",
 50 |     "4. Confounding Variables"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "markdown",
 55 |    "metadata": {},
 56 |    "source": [
 57 |     "We'll also discuss observational studies and the concept of confounding variables.\n"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "markdown",
 62 |    "metadata": {},
 63 |    "source": [
 64 |     "By the end of this lecture, you'll have a solid foundation in understanding the different types of variables and their roles in statistical analysis. This knowledge will empower you to work with data more effectively and make informed decisions based on your findings.\n"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "markdown",
 69 |    "metadata": {},
 70 |    "source": [
 71 |     "Let's dive in! 🌟"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "markdown",
 76 |    "metadata": {},
 77 |    "source": [
 78 |     "**Table of contents**<a id='toc0_'></a>    \n",
 79 |     "- [Qualitative and Quantitative Variables](#toc1_)    \n",
 80 |     "  - [Qualitative (Categorical) Variables](#toc1_1_)    \n",
 81 |     "  - [Quantitative (Numerical) Variables](#toc1_2_)    \n",
 82 |     "- [Discrete and Continuous Variables](#toc2_)    \n",
 83 |     "  - [Discrete Variables](#toc2_1_)    \n",
 84 |     "  - [Continuous Variables](#toc2_2_)    \n",
 85 |     "  - [Approximate Numbers and Rounding Off](#toc2_3_)    \n",
 86 |     "- [Independent and Dependent Variables](#toc3_)    \n",
 87 |     "  - [Independent Variables](#toc3_1_)    \n",
 88 |     "  - [Dependent Variables](#toc3_2_)    \n",
 89 |     "  - [Identifying Independent and Dependent Variables](#toc3_3_)    \n",
 90 |     "- [Observational Studies and Confounding Variables](#toc4_)    \n",
 91 |     "  - [Observational Studies](#toc4_1_)    \n",
 92 |     "  - [Confounding Variables](#toc4_2_)    \n",
 93 |     "- [Summary](#toc5_)    \n",
 94 |     "\n",
 95 |     "<!-- vscode-jupyter-toc-config\n",
 96 |     "\tnumbering=false\n",
 97 |     "\tanchor=true\n",
 98 |     "\tflat=false\n",
 99 |     "\tminLevel=2\n",
100 |     "\tmaxLevel=6\n",
101 |     "\t/vscode-jupyter-toc-config -->\n",
102 |     "<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "markdown",
107 |    "metadata": {},
108 |    "source": [
109 |     "## <a id='toc1_'></a>[Qualitative and Quantitative Variables](#toc0_)"
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "markdown",
114 |    "metadata": {},
115 |    "source": [
116 |     "As we discussed in the previous lecture on types of data, similar to data, variables can be classified into two main types: qualitative and quantitative. Let's explore each type in more detail."
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "markdown",
121 |    "metadata": {},
122 |    "source": [
123 |     "<img src=\"./images/types-of-data.png\" width=\"800\">"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "markdown",
128 |    "metadata": {},
129 |    "source": [
130 |     "### <a id='toc1_1_'></a>[Qualitative (Categorical) Variables](#toc0_)\n"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "markdown",
135 |    "metadata": {},
136 |    "source": [
137 |     "Qualitative variables, also known as categorical variables, represent characteristics or attributes that cannot be quantified numerically. These variables are often expressed in words or labels."
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "markdown",
142 |    "metadata": {},
143 |    "source": [
144 |     "Qualitative variables can be further divided into two subcategories:\n",
145 |     "\n",
146 |     "1. **Nominal Variables**: Nominal variables have no inherent order or ranking. Examples include:\n",
147 |     "   - Gender (male, female, non-binary)\n",
148 |     "   - Eye color (blue, brown, green)\n",
149 |     "   - Marital status (single, married, divorced)\n",
150 |     "\n",
151 |     "2. **Ordinal Variables**: Ordinal variables have a natural order or ranking, but the differences between values are not necessarily equal. Examples include:\n",
152 |     "   - Education level (high school, bachelor's, master's, doctorate)\n",
153 |     "   - Income bracket (low, medium, high)\n",
154 |     "   - Likert scale responses (strongly disagree, disagree, neutral, agree, strongly agree)\n"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "markdown",
159 |    "metadata": {},
160 |    "source": [
161 |     "### <a id='toc1_2_'></a>[Quantitative (Numerical) Variables](#toc0_)\n"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "markdown",
166 |    "metadata": {},
167 |    "source": [
168 |     "Quantitative variables, also known as numerical variables, represent characteristics that can be measured and expressed numerically. These variables can be further divided into two subcategories:\n",
169 |     "\n",
170 |     "1. **Discrete Variables**: Discrete variables can only take on specific, separate values, often integers. Examples include:\n",
171 |     "   - Number of siblings (0, 1, 2, 3, ...)\n",
172 |     "   - Number of cars owned (0, 1, 2, ...)\n",
173 |     "   - Number of students in a class (25, 26, 27, ...)\n",
174 |     "\n",
175 |     "2. **Continuous Variables**: Continuous variables can take on any value within a specific range, including fractional or decimal values. Examples include:\n",
176 |     "   - Height (1.65 m, 1.78 m, 1.82 m, ...)\n",
177 |     "   - Weight (65.3 kg, 72.1 kg, 80.5 kg, ...)\n",
178 |     "   - Time (2.5 seconds, 3.8 seconds, 4.2 seconds, ...)\n"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "markdown",
183 |    "metadata": {},
184 |    "source": [
185 |     "Here's a simple example in Python to illustrate the difference between discrete and continuous variables:\n"
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "code",
190 |    "execution_count": 1,
191 |    "metadata": {},
192 |    "outputs": [],
193 |    "source": [
194 |     "# Discrete variable: Number of siblings\n",
195 |     "siblings = [0, 1, 2, 3, 1, 2, 0, 3, 2, 1]\n",
196 |     "\n",
197 |     "# Continuous variable: Height (in meters)\n",
198 |     "height = [1.65, 1.78, 1.82, 1.60, 1.75, 1.68, 1.84, 1.72, 1.80, 1.77]"
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "markdown",
203 |    "metadata": {},
204 |    "source": [
205 |     "In the next section, we'll explore discrete and continuous variables in more detail."
206 |    ]
207 |   },
208 |   {
209 |    "cell_type": "markdown",
210 |    "metadata": {},
211 |    "source": [
212 |     "## <a id='toc2_'></a>[Discrete and Continuous Variables](#toc0_)"
213 |    ]
214 |   },
215 |   {
216 |    "cell_type": "markdown",
217 |    "metadata": {},
218 |    "source": [
219 |     "Let's take a closer look at the two types of quantitative variables: discrete and continuous variables.\n"
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "markdown",
224 |    "metadata": {},
225 |    "source": [
226 |     "<img src=\"./images/quantitative-data.png\" width=\"800\">"
227 |    ]
228 |   },
229 |   {
230 |    "cell_type": "markdown",
231 |    "metadata": {},
232 |    "source": [
233 |     "### <a id='toc2_1_'></a>[Discrete Variables](#toc0_)\n"
234 |    ]
235 |   },
236 |   {
237 |    "cell_type": "markdown",
238 |    "metadata": {},
239 |    "source": [
240 |     "- **Definition**: Discrete variables can only take on specific, separate values, often integers or whole numbers. These variables usually involve counting.\n",
241 |     "\n",
242 |     "- **Examples**:\n",
243 |     "  - Number of pets owned (0, 1, 2, 3, ...)\n",
244 |     "  - Number of students absent in a class (0, 1, 2, ...)\n",
245 |     "  - Number of cars in a parking lot (10, 11, 12, ...)\n",
246 |     "\n",
247 |     "- **Characteristics of Discrete Variables**:\n",
248 |     "  - Values are distinct and separate\n",
249 |     "  - Often represented by integers or whole numbers\n",
250 |     "  - Gaps exist between values (e.g., you can't have 1.5 pets)\n"
251 |    ]
252 |   },
253 |   {
254 |    "cell_type": "markdown",
255 |    "metadata": {},
256 |    "source": [
257 |     "### <a id='toc2_2_'></a>[Continuous Variables](#toc0_)\n"
258 |    ]
259 |   },
260 |   {
261 |    "cell_type": "markdown",
262 |    "metadata": {},
263 |    "source": [
264 |     "- **Definition**: Continuous variables can take on any value within a specific range, including fractional or decimal values. These variables usually involve measuring.\n",
265 |     "\n",
266 |     "- **Examples**:\n",
267 |     "  - Height (1.65 m, 1.78 m, 1.82 m, ...)\n",
268 |     "  - Weight (65.3 kg, 72.1 kg, 80.5 kg, ...)\n",
269 |     "  - Time taken to complete a task (2.5 seconds, 3.8 seconds, 4.2 seconds, ...)\n",
270 |     "\n",
271 |     "- **Characteristics of Continuous Variables**:\n",
272 |     "  - Values can take on any number within a range\n",
273 |     "  - Often represented by real numbers (including fractions and decimals)\n",
274 |     "  - No gaps exist between values (e.g., height can be 1.75 m or 1.76 m)\n"
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "markdown",
279 |    "metadata": {},
280 |    "source": [
281 |     "In the next section, we'll explore the concepts of independent and dependent variables in the context of experiments and observational studies."
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "markdown",
286 |    "metadata": {},
287 |    "source": [
288 |     "## <a id='toc3_'></a>[Independent and Dependent Variables](#toc0_)"
289 |    ]
290 |   },
291 |   {
292 |    "cell_type": "markdown",
293 |    "metadata": {},
294 |    "source": [
295 |     "When conducting experiments or observational studies, it's essential to understand the roles of independent and dependent variables.\n"
296 |    ]
297 |   },
298 |   {
299 |    "cell_type": "markdown",
300 |    "metadata": {},
301 |    "source": [
302 |     "<img src=\"./images/1.png\" width=\"800\">"
303 |    ]
304 |   },
305 |   {
306 |    "cell_type": "markdown",
307 |    "metadata": {},
308 |    "source": [
309 |     "<img src=\"./images/2.png\" width=\"800\">"
310 |    ]
311 |   },
312 |   {
313 |    "cell_type": "markdown",
314 |    "metadata": {},
315 |    "source": [
316 |     "<img src=\"./images/3.png\" width=\"800\">"
317 |    ]
318 |   },
319 |   {
320 |    "cell_type": "markdown",
321 |    "metadata": {},
322 |    "source": [
323 |     "### <a id='toc3_1_'></a>[Independent Variables](#toc0_)\n"
324 |    ]
325 |   },
326 |   {
327 |    "cell_type": "markdown",
328 |    "metadata": {},
329 |    "source": [
330 |     "- **Definition**: An independent variable is a variable that is manipulated or controlled by the investigator in an experiment. It is believed to have an effect on the dependent variable.\n",
331 |     "\n",
332 |     "- **Role in Experiments**:\n",
333 |     "  - The investigator deliberately changes or manipulates the independent variable to observe its effect on the dependent variable.\n",
334 |     "  - Different levels or conditions of the independent variable are assigned to different groups of subjects.\n",
335 |     "\n",
336 |     "- **Manipulated by the Investigator**:\n",
337 |     "  - The investigator has control over the independent variable and can decide which subjects receive which levels or conditions.\n",
338 |     "  - Example: In a study on the effect of sleep duration on memory, the investigator might assign participants to either a 6-hour sleep group or an 8-hour sleep group (independent variable).\n"
339 |    ]
340 |   },
341 |   {
342 |    "cell_type": "markdown",
343 |    "metadata": {},
344 |    "source": [
345 |     "### <a id='toc3_2_'></a>[Dependent Variables](#toc0_)\n"
346 |    ]
347 |   },
348 |   {
349 |    "cell_type": "markdown",
350 |    "metadata": {},
351 |    "source": [
352 |     "- **Definition**: A dependent variable is a variable that is measured, counted, or recorded by the investigator in an experiment. It is believed to be affected by the independent variable.\n"
353 |    ]
354 |   },
355 |   {
356 |    "cell_type": "markdown",
357 |    "metadata": {},
358 |    "source": [
359 |     "- **Role in Experiments**:\n",
360 |     "  - The dependent variable is the outcome or response that the investigator measures to determine the effect of the independent variable.\n",
361 |     "  - Changes in the dependent variable are presumed to be caused by the manipulation of the independent variable.\n",
362 |     "\n",
363 |     "- **Measured, Counted, or Recorded by the Investigator**:\n",
364 |     "  - The investigator observes and records the values of the dependent variable for each subject or group in the experiment.\n",
365 |     "  - Example: In the sleep duration study, the investigator might measure the participants' memory performance (dependent variable) using a memory test.\n"
366 |    ]
367 |   },
368 |   {
369 |    "cell_type": "markdown",
370 |    "metadata": {},
371 |    "source": [
372 |     "### <a id='toc3_3_'></a>[Identifying Independent and Dependent Variables](#toc0_)\n"
373 |    ]
374 |   },
375 |   {
376 |    "cell_type": "markdown",
377 |    "metadata": {},
378 |    "source": [
379 |     "To identify the independent and dependent variables in a study, ask yourself:\n",
380 |     "- What is being manipulated or changed by the investigator? (Independent variable)\n",
381 |     "- What is being measured or observed as a result of the manipulation? (Dependent variable)\n"
382 |    ]
383 |   },
384 |   {
385 |    "cell_type": "markdown",
386 |    "metadata": {},
387 |    "source": [
388 |     "Understanding the roles of independent and dependent variables is crucial for designing and interpreting experiments and observational studies.\n"
389 |    ]
390 |   },
391 |   {
392 |    "cell_type": "markdown",
393 |    "metadata": {},
394 |    "source": [
395 |     "## <a id='toc4_'></a>[Observational Studies and Confounding Variables](#toc0_)"
396 |    ]
397 |   },
398 |   {
399 |    "cell_type": "markdown",
400 |    "metadata": {},
401 |    "source": [
402 |     "In addition to experiments, researchers often conduct observational studies to investigate relationships between variables. However, observational studies have limitations and can be affected by confounding variables.\n"
403 |    ]
404 |   },
405 |   {
406 |    "cell_type": "markdown",
407 |    "metadata": {},
408 |    "source": [
409 |     "<img src=\"./images/observation-experiment.png\" width=\"400\">"
410 |    ]
411 |   },
412 |   {
413 |    "cell_type": "markdown",
414 |    "metadata": {},
415 |    "source": [
416 |     "### <a id='toc4_1_'></a>[Observational Studies](#toc0_)\n"
417 |    ]
418 |   },
419 |   {
420 |    "cell_type": "markdown",
421 |    "metadata": {},
422 |    "source": [
423 |     "- **Definition**: An observational study is a type of study where the investigator observes and measures variables without manipulating them. The investigator does not control or assign the independent variable.\n",
424 |     "\n",
425 |     "- **Purpose**:\n",
426 |     "  - To examine relationships between variables as they naturally occur.\n",
427 |     "  - To generate hypotheses for future experimental research.\n",
428 |     "\n",
429 |     "- **Limitations in Determining Cause-Effect Relationships**:\n",
430 |     "  - Observational studies cannot definitively establish cause-effect relationships because the investigator does not manipulate the independent variable.\n",
431 |     "  - Other factors (confounding variables) may influence the relationship between the variables, making it difficult to determine the true cause of the observed effects.\n"
432 |    ]
433 |   },
434 |   {
435 |    "cell_type": "markdown",
436 |    "metadata": {},
437 |    "source": [
438 |     "### <a id='toc4_2_'></a>[Confounding Variables](#toc0_)\n"
439 |    ]
440 |   },
441 |   {
442 |    "cell_type": "markdown",
443 |    "metadata": {},
444 |    "source": [
445 |     "- **Definition**: A confounding variable is an extraneous variable that is related to both the independent and dependent variables in a study. It can influence the outcome of the study and make it difficult to interpret the results accurately.\n",
446 |     "\n",
447 |     "- **Impact on Study Interpretation**:\n",
448 |     "  - Confounding variables can lead to misleading conclusions about the relationship between the independent and dependent variables.\n",
449 |     "  - They can create the appearance of a relationship between variables when none exists, or they can mask a true relationship.\n",
450 |     "\n",
451 |     "- **Avoiding Confounding Variables**:\n",
452 |     "  - *Random Assignment*: In experiments, randomly assigning subjects to different levels of the independent variable helps to distribute potential confounding variables evenly across groups, minimizing their impact.\n",
453 |     "  - *Standardization*: Keeping all other variables constant across groups (except for the independent variable) helps to control for potential confounding variables.\n"
454 |    ]
455 |   },
456 |   {
457 |    "cell_type": "markdown",
458 |    "metadata": {},
459 |    "source": [
460 |     "Here's an example of how confounding variables can affect the interpretation of a study:\n",
461 |     "\n",
462 |     "> Suppose an observational study finds a positive correlation between ice cream sales and drowning incidents. It might be tempting to conclude that eating ice cream causes drowning. However, a confounding variable, such as hot weather, could be responsible for both increased ice cream sales and more people swimming (leading to more drowning incidents). In this case, the hot weather is the confounding variable that influences both the independent variable (ice cream sales) and the dependent variable (drowning incidents).\n"
463 |    ]
464 |   },
465 |   {
466 |    "cell_type": "markdown",
467 |    "metadata": {},
468 |    "source": [
469 |     "<img src=\"./images/confounding-variable.png\" width=\"800\">"
470 |    ]
471 |   },
472 |   {
473 |    "cell_type": "markdown",
474 |    "metadata": {},
475 |    "source": [
476 |     "Hot weather (confounding variable) influences both ice cream sales (independent variable) and drowning incidents (dependent variable), creating a spurious relationship between the two variables.\n"
477 |    ]
478 |   },
479 |   {
480 |    "cell_type": "markdown",
481 |    "metadata": {},
482 |    "source": [
483 |     "Understanding the limitations of observational studies and the impact of confounding variables is crucial for accurately interpreting research findings and drawing appropriate conclusions."
484 |    ]
485 |   },
486 |   {
487 |    "cell_type": "markdown",
488 |    "metadata": {},
489 |    "source": [
490 |     "## <a id='toc5_'></a>[Summary](#toc0_)\n"
491 |    ]
492 |   },
493 |   {
494 |    "cell_type": "markdown",
495 |    "metadata": {},
496 |    "source": [
497 |     "In this lecture, we explored the different types of variables and their roles in statistical analysis and research. Let's recap the key points:\n",
498 |     "\n",
499 |     "- Variables are characteristics or properties that can take on different values, and they are essential for organizing data, identifying patterns, and making informed decisions.\n",
500 |     "\n",
501 |     "- Variables can be classified into two main categories:\n",
502 |     "  - **Qualitative (Categorical) Variables**: Variables that represent characteristics or attributes that cannot be quantified numerically. They can be further divided into nominal (no inherent order) and ordinal (natural order, but differences not necessarily equal) variables.\n",
503 |     "  - **Quantitative (Numerical) Variables**: Variables that represent characteristics that can be measured and expressed numerically. They can be further divided into discrete (specific, separate values) and continuous (any value within a range) variables.\n",
504 |     "\n",
505 |     "- When working with continuous variables, it's important to consider the level of precision required and be aware that the values are often approximations due to rounding off.\n",
506 |     "\n",
507 |     "- In experiments and observational studies, variables can be classified as:\n",
508 |     "  - **Independent Variables**: Variables that are manipulated or controlled by the investigator, believed to have an effect on the dependent variable.\n",
509 |     "  - **Dependent Variables**: Variables that are measured, counted, or recorded by the investigator, believed to be affected by the independent variable.\n",
510 |     "\n",
511 |     "- Observational studies are used to examine relationships between variables as they naturally occur, but they have limitations in determining cause-effect relationships due to the presence of confounding variables.\n",
512 |     "\n",
513 |     "- **Confounding Variables**: Extraneous variables that are related to both the independent and dependent variables, which can influence the outcome of a study and lead to misleading conclusions.\n",
514 |     "\n",
515 |     "- To minimize the impact of confounding variables, researchers can use random assignment (in experiments) and standardization (keeping other variables constant).\n"
516 |    ]
517 |   },
518 |   {
519 |    "cell_type": "markdown",
520 |    "metadata": {},
521 |    "source": [
522 |     "Understanding the different types of variables and their roles is crucial for effective data analysis and interpretation. By correctly identifying and classifying variables, researchers can:\n",
523 |     "- Choose appropriate statistical methods and techniques\n",
524 |     "- Design experiments and observational studies effectively\n",
525 |     "- Control for confounding variables\n",
526 |     "- Accurately interpret research findings and draw valid conclusions\n"
527 |    ]
528 |   },
529 |   {
530 |    "cell_type": "markdown",
531 |    "metadata": {},
532 |    "source": [
533 |     "Mastering the concepts of variable types empowers researchers and data analysts to make informed decisions, uncover meaningful insights, and communicate their findings effectively.\n"
534 |    ]
535 |   }
536 |  ],
537 |  "metadata": {
538 |   "kernelspec": {
539 |    "display_name": "py310",
540 |    "language": "python",
541 |    "name": "python3"
542 |   },
543 |   "language_info": {
544 |    "codemirror_mode": {
545 |     "name": "ipython",
546 |     "version": 3
547 |    },
548 |    "file_extension": ".py",
549 |    "mimetype": "text/x-python",
550 |    "name": "python",
551 |    "nbconvert_exporter": "python",
552 |    "pygments_lexer": "ipython3",
553 |    "version": "3.10.12"
554 |   }
555 |  },
556 |  "nbformat": 4,
557 |  "nbformat_minor": 2
558 | }
559 | 


--------------------------------------------------------------------------------
/Lectures/03 Introduction to Inferential Statistics/01 Population and Sample.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Populations, Samples, and Study Design"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "Welcome to the first lecture of Chapter 1: Introduction to Inferential Statistics! In this lecture, we will lay the groundwork for understanding inferential statistics by exploring the concepts of populations, samples, and study design.\n"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "Inferential statistics is a branch of statistics that allows us to make generalizations about a population based on a sample of data. In other words, it enables us to draw conclusions about a larger group (the population) by analyzing a smaller subset of that group (the sample). This is particularly useful when it is impractical or impossible to collect data from an entire population.\n"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "markdown",
 26 |    "metadata": {},
 27 |    "source": [
 28 |     "For example, suppose we want to know the average height of all adults in a country. It would be incredibly time-consuming and expensive to measure every single adult. Instead, we can use inferential statistics to estimate the average height by measuring a representative sample of adults and then generalizing our findings to the entire population.\n"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "markdown",
 33 |    "metadata": {},
 34 |    "source": [
 35 |     "The accuracy of our generalizations depends on several factors, including the size and representativeness of our sample, as well as the design of our study. In inferential statistics, we use probability theory to quantify the uncertainty associated with our estimates and to make informed decisions based on our data.\n"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "markdown",
 40 |    "metadata": {},
 41 |    "source": [
 42 |     "Some key concepts in inferential statistics include:\n",
 43 |     "\n",
 44 |     "- **Hypothesis testing**: A process of evaluating claims or hypotheses about a population based on sample data.\n",
 45 |     "- **Confidence intervals**: A range of values that is likely to contain the true population parameter with a certain level of confidence.\n",
 46 |     "- **p-values**: The probability of obtaining a result as extreme as the observed data, assuming the null hypothesis is true.\n"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "markdown",
 51 |    "metadata": {},
 52 |    "source": [
 53 |     "Throughout this course, we will explore these concepts in more detail and learn how to apply them to real-world problems.\n"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "markdown",
 58 |    "metadata": {},
 59 |    "source": [
 60 |     "In this lecture, we will cover the following topics:\n",
 61 |     "\n",
 62 |     "1. **Populations and Samples**\n",
 63 |     "   - Defining populations\n",
 64 |     "   - Defining samples\n",
 65 |     "   - Importance of representative samples\n",
 66 |     "\n",
 67 |     "2. **Random Sampling**\n",
 68 |     "   - Simple random sampling\n",
 69 |     "   - Stratified random sampling\n",
 70 |     "   - Cluster sampling\n",
 71 |     "   - Systematic sampling\n",
 72 |     "\n",
 73 |     "3. **Random Assignment**\n",
 74 |     "   - Importance of random assignment in experiments\n",
 75 |     "   - Distinguishing between random sampling and random assignment\n",
 76 |     "\n",
 77 |     "4. **Surveys and Experiments**\n",
 78 |     "   - Observational studies and surveys\n",
 79 |     "   - Experiments and causality\n"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "markdown",
 84 |    "metadata": {},
 85 |    "source": [
 86 |     "By the end of this lecture, you will have a solid understanding of the key concepts and terminology related to populations, samples, and study design. This knowledge will serve as a foundation for the rest of the course, where we will delve deeper into inferential statistics and learn how to apply these concepts to real-world data.\n"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "markdown",
 91 |    "metadata": {},
 92 |    "source": [
 93 |     "Let's begin by defining populations and samples, and exploring why it's crucial to have a representative sample when making inferences about a population."
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "markdown",
 98 |    "metadata": {},
 99 |    "source": [
100 |     "**Table of contents**<a id='toc0_'></a>    \n",
101 |     "- [Populations and Samples](#toc1_)    \n",
102 |     "- [Random Sampling](#toc2_)    \n",
103 |     "  - [A. Probability Sampling](#toc2_1_)    \n",
104 |     "  - [B. Non-Probability Sampling](#toc2_2_)    \n",
105 |     "- [Random Assignment](#toc3_)    \n",
106 |     "  - [Importance of Random Assignment in Experiments](#toc3_1_)    \n",
107 |     "  - [Distinguishing Between Random Sampling and Random Assignment](#toc3_2_)    \n",
108 |     "- [Observational Studies, Experiments, and Inferential Statistics](#toc4_)    \n",
109 |     "  - [Observational Studies and Surveys](#toc4_1_)    \n",
110 |     "  - [Experiments](#toc4_2_)    \n",
111 |     "  - [Linking to Inferential Statistics](#toc4_3_)    \n",
112 |     "\n",
113 |     "<!-- vscode-jupyter-toc-config\n",
114 |     "\tnumbering=false\n",
115 |     "\tanchor=true\n",
116 |     "\tflat=false\n",
117 |     "\tminLevel=2\n",
118 |     "\tmaxLevel=6\n",
119 |     "\t/vscode-jupyter-toc-config -->\n",
120 |     "<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "markdown",
125 |    "metadata": {},
126 |    "source": [
127 |     "## <a id='toc1_'></a>[Populations and Samples](#toc0_)"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "markdown",
132 |    "metadata": {},
133 |    "source": [
134 |     "In this section, we'll define populations and samples, and discuss the importance of having a representative sample when making inferences about a population.\n"
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "markdown",
139 |    "metadata": {},
140 |    "source": [
141 |     "A **population** is the entire group of individuals, objects, or events that we are interested in studying. It is the complete set of elements that share a common characteristic. For example, if we want to study the average height of all students in a university, the population would be all students enrolled in that university.\n"
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "markdown",
146 |    "metadata": {},
147 |    "source": [
148 |     "In statistical notation, we often use the following symbols when discussing populations:\n",
149 |     "\n",
150 |     "- $N$: The size of the population (total number of elements)\n",
151 |     "- $\\mu$ (mu): The population mean (average value of a characteristic in the population)\n",
152 |     "- $\\sigma$ (sigma): The population standard deviation (measure of variability in the population)\n"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "markdown",
157 |    "metadata": {},
158 |    "source": [
159 |     "A **sample** is a subset of the population that is selected for study. It is a portion of the population that we use to make inferences about the entire population. For example, if we want to estimate the average height of all students in a university, we might select a sample of 100 students to measure their heights.\n"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "markdown",
164 |    "metadata": {},
165 |    "source": [
166 |     "In statistical notation, we often use the following symbols when discussing samples:\n",
167 |     "\n",
168 |     "- $n$: The size of the sample (number of elements in the sample)\n",
169 |     "- $\\bar{x}$ (x-bar): The sample mean (average value of a characteristic in the sample)\n",
170 |     "- $s$: The sample standard deviation (measure of variability in the sample)\n"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "markdown",
175 |    "metadata": {},
176 |    "source": [
177 |     "When making inferences about a population based on a sample, it is crucial that the sample is **representative** of the population. A representative sample accurately reflects the characteristics of the population, such as its diversity and proportions.\n"
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "markdown",
182 |    "metadata": {},
183 |    "source": [
184 |     "If a sample is not representative, it can lead to **biased** results and inaccurate conclusions about the population. Some common types of bias include:\n",
185 |     "\n",
186 |     "- **Selection bias**: When the sample is not randomly selected, leading to an overrepresentation or underrepresentation of certain groups.\n",
187 |     "- **Non-response bias**: When individuals who are selected for the sample do not respond or participate, leading to a sample that may not be representative of the population.\n",
188 |     "- **Voluntary response bias**: When individuals who voluntarily participate in a study are systematically different from those who do not participate.\n"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "markdown",
193 |    "metadata": {},
194 |    "source": [
195 |     "To minimize bias and ensure a representative sample, researchers often use **random sampling** techniques, which we will discuss in the next section.\n"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "markdown",
200 |    "metadata": {},
201 |    "source": [
202 |     "In summary, a population is the entire group of interest, while a sample is a subset of that population used to make inferences. It is essential to have a representative sample to accurately draw conclusions about the population and minimize bias in our results."
203 |    ]
204 |   },
205 |   {
206 |    "cell_type": "markdown",
207 |    "metadata": {},
208 |    "source": [
209 |     "## <a id='toc2_'></a>[Random Sampling](#toc0_)"
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "markdown",
214 |    "metadata": {},
215 |    "source": [
216 |     "Random sampling is a technique used to select a sample from a population in such a way that each element of the population has an equal chance of being included in the sample. This helps ensure that the sample is representative of the population, minimizing bias and allowing for accurate inferences. There are several types of random sampling, which can be broadly categorized into probability sampling and non-probability sampling.\n"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "markdown",
221 |    "metadata": {},
222 |    "source": [
223 |     "### <a id='toc2_1_'></a>[A. Probability Sampling](#toc0_)\n"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "markdown",
228 |    "metadata": {},
229 |    "source": [
230 |     "Probability sampling involves selecting elements from a population based on a known probability. This allows researchers to calculate the likelihood of each element being included in the sample and to make statistical inferences about the population. Some common types of probability sampling include:\n",
231 |     "\n",
232 |     "1. **Simple Random Sampling (SRS)**:\n",
233 |     "   - In SRS, each element in the population has an equal chance of being selected for the sample.\n",
234 |     "   - Elements are selected independently of each other.\n",
235 |     "   - Example: Using a random number generator to select 50 students from a list of all students in a university.\n",
236 |     "\n",
237 |     "2. **Systematic Sampling**:\n",
238 |     "   - Elements are selected from a population at regular intervals after a random starting point.\n",
239 |     "   - The sampling interval is calculated as $k = \\frac{N}{n}$, where $N$ is the population size and $n$ is the desired sample size.\n",
240 |     "   - Example: Selecting every 10th student from an alphabetical list of all students in a university.\n",
241 |     "\n",
242 |     "3. **Stratified Sampling**:\n",
243 |     "   - The population is divided into mutually exclusive and exhaustive subgroups (strata) based on a specific characteristic.\n",
244 |     "   - A simple random sample is then taken from each stratum.\n",
245 |     "   - This ensures that each subgroup is proportionately represented in the sample.\n",
246 |     "   - Example: Dividing a university's student population by major and selecting a random sample from each major.\n",
247 |     "\n",
248 |     "4. **Cluster Sampling**:\n",
249 |     "   - The population is divided into naturally occurring groups (clusters), such as schools or city blocks.\n",
250 |     "   - A random sample of clusters is selected, and all elements within the selected clusters are included in the sample.\n",
251 |     "   - This is useful when a complete list of the population is not available or when elements are geographically dispersed.\n",
252 |     "   - Example: Randomly selecting 10 classrooms from a school district and including all students in those classrooms in the sample.\n",
253 |     "\n",
254 |     "5. **Multi-stage Sampling**:\n",
255 |     "   - This involves a combination of two or more probability sampling techniques.\n",
256 |     "   - It is often used when the population is large and geographically dispersed.\n",
257 |     "   - Example: First using cluster sampling to select schools within a district, then using stratified sampling to select students within each school based on grade level.\n"
258 |    ]
259 |   },
260 |   {
261 |    "cell_type": "markdown",
262 |    "metadata": {},
263 |    "source": [
264 |     "### <a id='toc2_2_'></a>[B. Non-Probability Sampling](#toc0_)\n"
265 |    ]
266 |   },
267 |   {
268 |    "cell_type": "markdown",
269 |    "metadata": {},
270 |    "source": [
271 |     "Non-probability sampling involves selecting elements from a population based on non-random criteria, such as convenience or the researcher's judgment. This type of sampling does not allow for statistical inferences about the population, as the sample may not be representative. Some common types of non-probability sampling include:\n",
272 |     "\n",
273 |     "1. **Convenience Sampling**:\n",
274 |     "   - Elements are selected based on their ease of accessibility or convenience to the researcher.\n",
275 |     "   - This method is quick and inexpensive but may lead to biased results.\n",
276 |     "   - Example: Surveying students in a cafeteria because they are readily available.\n",
277 |     "\n",
278 |     "2. **Purposive Sampling**:\n",
279 |     "   - Elements are selected based on the researcher's judgment or specific criteria.\n",
280 |     "   - This method is useful when the researcher needs a sample with specific characteristics.\n",
281 |     "   - Types of purposive sampling include:\n",
282 |     "     a. **Judgmental Sampling**: The researcher selects elements based on their expertise or judgment.\n",
283 |     "     b. **Snowball Sampling**: Participants are asked to recommend other individuals who meet the study criteria.\n",
284 |     "     c. **Quota Sampling**: The researcher selects elements based on predetermined quotas for specific subgroups.\n",
285 |     "\n",
286 |     "3. **Voluntary Response Sampling**:\n",
287 |     "   - Participants self-select into the sample by responding to an open invitation.\n",
288 |     "   - This method is prone to self-selection bias, as those who volunteer may differ from those who do not.\n",
289 |     "   - Example: An online survey shared on social media, where anyone can choose to participate.\n"
290 |    ]
291 |   },
292 |   {
293 |    "cell_type": "markdown",
294 |    "metadata": {},
295 |    "source": [
296 |     "In summary, probability sampling techniques, such as simple random sampling, systematic sampling, stratified sampling, cluster sampling, and multi-stage sampling, are preferred when making statistical inferences about a population. Non-probability sampling techniques, such as convenience sampling, purposive sampling, and voluntary response sampling, may be used in certain situations but do not allow for generalizations to the entire population due to potential biases."
297 |    ]
298 |   },
299 |   {
300 |    "cell_type": "markdown",
301 |    "metadata": {},
302 |    "source": [
303 |     "## <a id='toc3_'></a>[Random Assignment](#toc0_)"
304 |    ]
305 |   },
306 |   {
307 |    "cell_type": "markdown",
308 |    "metadata": {},
309 |    "source": [
310 |     "Random assignment is a crucial aspect of experimental design that involves randomly allocating participants to different treatment groups or conditions. This helps ensure that any differences observed between the groups can be attributed to the treatment itself, rather than pre-existing differences among participants.\n"
311 |    ]
312 |   },
313 |   {
314 |    "cell_type": "markdown",
315 |    "metadata": {},
316 |    "source": [
317 |     "### <a id='toc3_1_'></a>[Importance of Random Assignment in Experiments](#toc0_)\n"
318 |    ]
319 |   },
320 |   {
321 |    "cell_type": "markdown",
322 |    "metadata": {},
323 |    "source": [
324 |     "Random assignment plays a vital role in establishing cause-and-effect relationships in experiments. By randomly assigning participants to different groups, researchers can minimize the impact of confounding variables and potential biases. Some key benefits of random assignment include:\n",
325 |     "\n",
326 |     "1. **Minimizing confounding variables**: Random assignment helps distribute potential confounding variables (such as age, gender, or prior knowledge) evenly across treatment groups. This ensures that these variables do not systematically influence the results of the experiment.\n",
327 |     "\n",
328 |     "2. **Reducing bias**: Random assignment helps prevent researcher bias in assigning participants to groups. It eliminates the possibility of the researcher consciously or unconsciously placing certain participants in a particular group based on their characteristics or expected outcomes.\n",
329 |     "\n",
330 |     "3. **Enhancing internal validity**: By minimizing the impact of confounding variables and reducing bias, random assignment strengthens the internal validity of an experiment. This means that any observed differences between the groups can be more confidently attributed to the treatment itself.\n",
331 |     "\n",
332 |     "4. **Enabling causal inferences**: Random assignment is a key component in establishing cause-and-effect relationships. When combined with a well-designed experiment, random assignment allows researchers to make causal inferences about the impact of the treatment on the dependent variable.\n"
333 |    ]
334 |   },
335 |   {
336 |    "cell_type": "markdown",
337 |    "metadata": {},
338 |    "source": [
339 |     "### <a id='toc3_2_'></a>[Distinguishing Between Random Sampling and Random Assignment](#toc0_)\n"
340 |    ]
341 |   },
342 |   {
343 |    "cell_type": "markdown",
344 |    "metadata": {},
345 |    "source": [
346 |     "Although random sampling and random assignment both involve the use of randomness, they serve different purposes and are used in different contexts.\n",
347 |     "\n",
348 |     "- **Random sampling** is used to select a sample from a population. It ensures that each element in the population has an equal chance of being included in the sample, which helps make the sample representative of the population. Random sampling is primarily used in observational studies and surveys to make generalizations about the population.\n",
349 |     "\n",
350 |     "- **Random assignment** is used to allocate participants to different treatment groups or conditions within an experiment. It ensures that each participant has an equal chance of being assigned to any of the groups, which helps minimize the impact of confounding variables and potential biases. Random assignment is primarily used in experimental studies to establish cause-and-effect relationships.\n"
351 |    ]
352 |   },
353 |   {
354 |    "cell_type": "markdown",
355 |    "metadata": {},
356 |    "source": [
357 |     "To better understand the difference, consider the following example:\n",
358 |     "\n",
359 |     "- A researcher wants to study the effectiveness of a new teaching method on student performance.\n",
360 |     "- First, the researcher uses random sampling to select a sample of students from the population of all students in a school district. This ensures that the sample is representative of the population.\n",
361 |     "- Then, the researcher uses random assignment to allocate the sampled students to either the treatment group (which receives the new teaching method) or the control group (which receives the standard teaching method). This ensures that any differences in student performance between the groups can be attributed to the teaching method itself.\n"
362 |    ]
363 |   },
364 |   {
365 |    "cell_type": "markdown",
366 |    "metadata": {},
367 |    "source": [
368 |     "In summary, random assignment is essential for establishing cause-and-effect relationships in experiments by minimizing the impact of confounding variables and reducing bias. It is important to distinguish between random sampling, which is used to select a representative sample from a population, and random assignment, which is used to allocate participants to different treatment groups within an experiment."
369 |    ]
370 |   },
371 |   {
372 |    "cell_type": "markdown",
373 |    "metadata": {},
374 |    "source": [
375 |     "## <a id='toc4_'></a>[Observational Studies, Experiments, and Inferential Statistics](#toc0_)"
376 |    ]
377 |   },
378 |   {
379 |    "cell_type": "markdown",
380 |    "metadata": {},
381 |    "source": [
382 |     "In the previous chapter on descriptive statistics, we discussed the differences between observational studies and experiments. Let's briefly review these concepts and explore how they relate to inferential statistics.\n"
383 |    ]
384 |   },
385 |   {
386 |    "cell_type": "markdown",
387 |    "metadata": {},
388 |    "source": [
389 |     "### <a id='toc4_1_'></a>[Observational Studies and Surveys](#toc0_)\n",
390 |     "\n",
391 |     "- Observational studies, including surveys, involve collecting data on variables without manipulation.\n",
392 |     "- These studies are useful for identifying associations between variables and generalizing findings to larger populations when combined with random sampling.\n",
393 |     "- However, observational studies cannot establish cause-and-effect relationships due to the presence of potential confounding variables.\n"
394 |    ]
395 |   },
396 |   {
397 |    "cell_type": "markdown",
398 |    "metadata": {},
399 |    "source": [
400 |     "### <a id='toc4_2_'></a>[Experiments](#toc0_)\n",
401 |     "\n",
402 |     "- Experiments involve the manipulation of one or more independent variables to observe their effect on a dependent variable while controlling for other potential confounding variables.\n",
403 |     "- Random assignment is crucial in experiments to minimize the impact of confounding variables and potential biases.\n",
404 |     "- Well-designed experiments with random assignment and control groups can establish cause-and-effect relationships between variables.\n"
405 |    ]
406 |   },
407 |   {
408 |    "cell_type": "markdown",
409 |    "metadata": {},
410 |    "source": [
411 |     "### <a id='toc4_3_'></a>[Linking to Inferential Statistics](#toc0_)\n",
412 |     "\n",
413 |     "- Inferential statistics allows us to make generalizations about a population based on a sample of data.\n",
414 |     "- When conducting observational studies or surveys, inferential statistics can be used to estimate population parameters (such as means or proportions) and to test hypotheses about the relationships between variables.\n",
415 |     "- In experiments, inferential statistics can be used to determine whether the observed differences between treatment groups are statistically significant and to estimate the size of the treatment effect.\n",
416 |     "- The concepts of random sampling and random assignment, discussed in the previous sections, are essential for making valid inferences from both observational studies and experiments.\n"
417 |    ]
418 |   },
419 |   {
420 |    "cell_type": "markdown",
421 |    "metadata": {},
422 |    "source": [
423 |     "By understanding the differences between observational studies and experiments, and how they relate to inferential statistics, researchers can make informed decisions about the appropriate study design and statistical analyses for their research questions."
424 |    ]
425 |   }
426 |  ],
427 |  "metadata": {
428 |   "language_info": {
429 |    "name": "python"
430 |   }
431 |  },
432 |  "nbformat": 4,
433 |  "nbformat_minor": 2
434 | }
435 | 


--------------------------------------------------------------------------------
/Lectures/01 Introduction/02 Types of Data.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "<img src=\"./images/banner.png\" width=\"800\">"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "# Types of Data"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "In the field of statistics, data is the foundation upon which all analyses and conclusions are built. Understanding the different types of data is crucial for selecting appropriate statistical methods, interpreting results accurately, and making informed decisions based on the available information.\n"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "markdown",
 26 |    "metadata": {},
 27 |    "source": [
 28 |     "It is essential to understand the different types of data for several reasons:\n",
 29 |     "\n",
 30 |     "1. **Choosing Appropriate Statistical Methods**: Different types of data require different statistical approaches. By understanding the nature of your data, you can select the most suitable methods for analysis, such as descriptive statistics, hypothesis testing, or regression analysis.\n",
 31 |     "\n",
 32 |     "2. **Interpreting Results Accurately**: The type of data you are working with influences how you interpret the results of your analysis. For example, the mean is an appropriate measure of central tendency for quantitative data, while the mode is more suitable for qualitative data.\n",
 33 |     "\n",
 34 |     "3. **Avoiding Common Pitfalls**: Misidentifying the type of data can lead to incorrect analyses and misleading conclusions. By understanding the characteristics of each data type, you can avoid common pitfalls and ensure the validity of your results.\n",
 35 |     "\n",
 36 |     "4. **Communicating Findings Effectively**: Knowing the type of data you are dealing with helps you communicate your findings clearly and accurately to others. This is particularly important when presenting results to stakeholders or collaborating with colleagues from different fields.\n"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "markdown",
 41 |    "metadata": {},
 42 |    "source": [
 43 |     "In this lecture, we will explore the two main categories of data: **qualitative (categorical)** data and **quantitative (numerical)** data. We will define and provide examples of each type, and discuss how to analyze them effectively.\n"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "markdown",
 48 |    "metadata": {},
 49 |    "source": [
 50 |     "We will also dive into the two subtypes of quantitative data: discrete and continuous data. Understanding the differences between these subtypes is essential for selecting appropriate statistical methods and graphical representations.\n"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "markdown",
 55 |    "metadata": {},
 56 |    "source": [
 57 |     "<img src=\"./images/types-of-data.png\" width=\"800\">"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "markdown",
 62 |    "metadata": {},
 63 |    "source": [
 64 |     "Furthermore, we will discuss the levels of measurement, which describe the nature of the data and the relationships between values. The four levels of measurement are nominal, ordinal, interval, and ratio.\n"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "markdown",
 69 |    "metadata": {},
 70 |    "source": [
 71 |     "By the end of this lecture, you will have a solid understanding of the different types of data and their importance in statistical analysis. This knowledge will serve as a foundation for further learning and application of statistical concepts in various fields."
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "markdown",
 76 |    "metadata": {},
 77 |    "source": [
 78 |     "**Table of contents**<a id='toc0_'></a>    \n",
 79 |     "- [Qualitative (Categorical) Data](#toc1_)    \n",
 80 |     "  - [Nominal Data](#toc1_1_)    \n",
 81 |     "  - [Ordinal Data](#toc1_2_)    \n",
 82 |     "- [Quantitative (Numerical) Data](#toc2_)    \n",
 83 |     "  - [Discrete Data](#toc2_1_)    \n",
 84 |     "  - [Continuous Data](#toc2_2_)    \n",
 85 |     "- [Differences Between Qualitative and Quantitative Data](#toc3_)    \n",
 86 |     "  - [Data Collection Methods](#toc3_1_)    \n",
 87 |     "  - [Data Analysis Techniques](#toc3_2_)    \n",
 88 |     "  - [Graphical Representations](#toc3_3_)    \n",
 89 |     "- [Levels of Measurement](#toc4_)    \n",
 90 |     "  - [Nominal Level](#toc4_1_)    \n",
 91 |     "  - [Ordinal Level](#toc4_2_)    \n",
 92 |     "  - [Interval Level](#toc4_3_)    \n",
 93 |     "  - [Ratio Level](#toc4_4_)    \n",
 94 |     "\n",
 95 |     "<!-- vscode-jupyter-toc-config\n",
 96 |     "\tnumbering=false\n",
 97 |     "\tanchor=true\n",
 98 |     "\tflat=false\n",
 99 |     "\tminLevel=2\n",
100 |     "\tmaxLevel=6\n",
101 |     "\t/vscode-jupyter-toc-config -->\n",
102 |     "<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "markdown",
107 |    "metadata": {},
108 |    "source": [
109 |     "## <a id='toc1_'></a>[Qualitative (Categorical) Data](#toc0_)"
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "markdown",
114 |    "metadata": {},
115 |    "source": [
116 |     "Qualitative data, also known as categorical data, represents characteristics or attributes that cannot be measured numerically. This type of data is typically used to describe qualities or categories.\n"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "markdown",
121 |    "metadata": {},
122 |    "source": [
123 |     "Qualitative data is non-numerical and describes characteristics or categories. Some key characteristics of qualitative data include:\n",
124 |     "\n",
125 |     "1. **Descriptive**: Qualitative data describes qualities or attributes, such as colors, types, or opinions.\n",
126 |     "2. **Non-numerical**: Qualitative data cannot be measured or expressed using numbers.\n",
127 |     "3. **Categories**: Qualitative data is often organized into distinct categories or groups.\n"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "markdown",
132 |    "metadata": {},
133 |    "source": [
134 |     "Qualitative data can be further divided into two subtypes: nominal data and ordinal data.\n"
135 |    ]
136 |   },
137 |   {
138 |    "cell_type": "markdown",
139 |    "metadata": {},
140 |    "source": [
141 |     "<img src=\"./images/qualitative-data.png\" width=\"800\">"
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "markdown",
146 |    "metadata": {},
147 |    "source": [
148 |     "### <a id='toc1_1_'></a>[Nominal Data](#toc0_)\n"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "markdown",
153 |    "metadata": {},
154 |    "source": [
155 |     "Nominal data is a type of qualitative data where the categories have no inherent order or ranking. The categories are mutually exclusive and exhaustive, meaning that each data point can only belong to one category, and all possible categories are included.\n"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "markdown",
160 |    "metadata": {},
161 |    "source": [
162 |     "Examples of nominal data include:\n",
163 |     "- Eye color (blue, brown, green)\n",
164 |     "- Marital status (single, married, divorced)\n",
165 |     "- Brand preferences (Coke, Pepsi, Dr. Pepper)\n"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "metadata": {},
171 |    "source": [
172 |     "When analyzing nominal data, you can use the following techniques:\n",
173 |     "\n",
174 |     "1. **Frequency Distribution**: Count the number of observations in each category to create a frequency distribution table or graph, such as a bar chart or pie chart.\n",
175 |     "2. **Mode**: Determine the most frequently occurring category or categories in the dataset.\n",
176 |     "3. **Chi-Square Test**: Use a chi-square test to determine if there is a significant association between two nominal variables.\n"
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "markdown",
181 |    "metadata": {},
182 |    "source": [
183 |     "### <a id='toc1_2_'></a>[Ordinal Data](#toc0_)\n"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "markdown",
188 |    "metadata": {},
189 |    "source": [
190 |     "Ordinal data is a type of qualitative data where the categories have a natural order or ranking. However, the differences between categories are not necessarily equal or measurable.\n"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "markdown",
195 |    "metadata": {},
196 |    "source": [
197 |     "Examples of ordinal data include:\n",
198 |     "- Educational attainment (high school, bachelor's degree, master's degree, doctorate)\n",
199 |     "- Survey responses (strongly disagree, disagree, neutral, agree, strongly agree)\n",
200 |     "- Economic status (low, medium, high)\n"
201 |    ]
202 |   },
203 |   {
204 |    "cell_type": "markdown",
205 |    "metadata": {},
206 |    "source": [
207 |     "When analyzing ordinal data, you can use the following techniques in addition to those used for nominal data:\n",
208 |     "\n",
209 |     "1. **Median**: Calculate the middle value in the ordered dataset to determine the median.\n",
210 |     "2. **Percentiles**: Determine the percentage of observations below or above a specific value using percentiles.\n",
211 |     "3. **Spearman's Rank Correlation**: Use Spearman's rank correlation to measure the strength and direction of the relationship between two ordinal variables.\n"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "markdown",
216 |    "metadata": {},
217 |    "source": [
218 |     "Understanding the differences between nominal and ordinal data is crucial for selecting appropriate statistical methods and accurately interpreting the results of your analysis."
219 |    ]
220 |   },
221 |   {
222 |    "cell_type": "markdown",
223 |    "metadata": {},
224 |    "source": [
225 |     "## <a id='toc2_'></a>[Quantitative (Numerical) Data](#toc0_)"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "markdown",
230 |    "metadata": {},
231 |    "source": [
232 |     "Quantitative data, also known as numerical data, represents measurements or quantities that can be expressed using numbers. This type of data is used to describe measurable characteristics or attributes.\n"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "markdown",
237 |    "metadata": {},
238 |    "source": [
239 |     "Quantitative data is numerical and represents measurable quantities or values. Some key characteristics of quantitative data include:\n",
240 |     "\n",
241 |     "1. **Numerical**: Quantitative data is expressed using numbers and can be used in mathematical operations.\n",
242 |     "2. **Measurable**: Quantitative data represents measurable quantities or values, such as height, weight, or temperature.\n",
243 |     "3. **Continuous or Discrete**: Quantitative data can be either continuous (having an infinite number of possible values within a range) or discrete (having a finite or countable number of possible values).\n"
244 |    ]
245 |   },
246 |   {
247 |    "cell_type": "markdown",
248 |    "metadata": {},
249 |    "source": [
250 |     "Quantitative data can be further divided into two subtypes: discrete data and continuous data.\n"
251 |    ]
252 |   },
253 |   {
254 |    "cell_type": "markdown",
255 |    "metadata": {},
256 |    "source": [
257 |     "<img src=\"./images/quantitative-data.png\" width=\"800\">"
258 |    ]
259 |   },
260 |   {
261 |    "cell_type": "markdown",
262 |    "metadata": {},
263 |    "source": [
264 |     "### <a id='toc2_1_'></a>[Discrete Data](#toc0_)\n"
265 |    ]
266 |   },
267 |   {
268 |    "cell_type": "markdown",
269 |    "metadata": {},
270 |    "source": [
271 |     "Discrete data is a type of quantitative data that has a finite or countable number of possible values. Discrete data often represents whole numbers or counts.\n"
272 |    ]
273 |   },
274 |   {
275 |    "cell_type": "markdown",
276 |    "metadata": {},
277 |    "source": [
278 |     "Examples of discrete data include:\n",
279 |     "- Number of children in a family (0, 1, 2, 3, etc.)\n",
280 |     "- Number of cars sold per day at a dealership\n",
281 |     "- Number of students in a classroom\n"
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "markdown",
286 |    "metadata": {},
287 |    "source": [
288 |     "When analyzing discrete data, you can use the following techniques:\n",
289 |     "\n",
290 |     "1. **Frequency Distribution**: Create a frequency distribution table or graph, such as a bar chart or histogram, to visualize the distribution of the data.\n",
291 |     "2. **Measures of Central Tendency**: Calculate the mean (average), median (middle value), and mode (most frequent value) to describe the center of the data distribution.\n",
292 |     "3. **Measures of Dispersion**: Calculate the range (difference between the maximum and minimum values), variance, and standard deviation to describe the spread of the data.\n"
293 |    ]
294 |   },
295 |   {
296 |    "cell_type": "markdown",
297 |    "metadata": {},
298 |    "source": [
299 |     "### <a id='toc2_2_'></a>[Continuous Data](#toc0_)\n"
300 |    ]
301 |   },
302 |   {
303 |    "cell_type": "markdown",
304 |    "metadata": {},
305 |    "source": [
306 |     "Continuous data is a type of quantitative data that has an infinite number of possible values within a specific range. Continuous data often represents measurements or values that can be fractional.\n"
307 |    ]
308 |   },
309 |   {
310 |    "cell_type": "markdown",
311 |    "metadata": {},
312 |    "source": [
313 |     "Examples of continuous data include:\n",
314 |     "- Height of individuals in a population\n",
315 |     "- Time taken to complete a task\n",
316 |     "- Temperature readings throughout the day\n"
317 |    ]
318 |   },
319 |   {
320 |    "cell_type": "markdown",
321 |    "metadata": {},
322 |    "source": [
323 |     "When analyzing continuous data, you can use the following techniques in addition to those used for discrete data:\n",
324 |     "\n",
325 |     "1. **Histograms**: Create a histogram to visualize the distribution of continuous data by dividing the data into intervals or bins.\n",
326 |     "2. **Density Plots**: Use density plots to represent the probability density function of the continuous data.\n",
327 |     "3. **Measures of Central Tendency and Dispersion**: Calculate the mean, median, mode, range, variance, and standard deviation to describe the center and spread of the continuous data distribution.\n",
328 |     "4. **Correlation and Regression**: Use correlation and regression analysis to examine the relationship between two or more continuous variables.\n"
329 |    ]
330 |   },
331 |   {
332 |    "cell_type": "markdown",
333 |    "metadata": {},
334 |    "source": [
335 |     "Understanding the differences between discrete and continuous data is essential for selecting appropriate statistical methods, creating meaningful visualizations, and interpreting the results of your analysis accurately."
336 |    ]
337 |   },
338 |   {
339 |    "cell_type": "markdown",
340 |    "metadata": {},
341 |    "source": [
342 |     "## <a id='toc3_'></a>[Differences Between Qualitative and Quantitative Data](#toc0_)"
343 |    ]
344 |   },
345 |   {
346 |    "cell_type": "markdown",
347 |    "metadata": {},
348 |    "source": [
349 |     "Qualitative and quantitative data differ in their nature, collection methods, analysis techniques, and graphical representations. Understanding these differences is crucial for effectively collecting, analyzing, and interpreting data in various fields.\n"
350 |    ]
351 |   },
352 |   {
353 |    "cell_type": "markdown",
354 |    "metadata": {},
355 |    "source": [
356 |     "### <a id='toc3_1_'></a>[Data Collection Methods](#toc0_)\n"
357 |    ]
358 |   },
359 |   {
360 |    "cell_type": "markdown",
361 |    "metadata": {},
362 |    "source": [
363 |     "1. **Qualitative Data**:\n",
364 |     "   - Qualitative data is typically collected through methods that allow for open-ended responses and detailed descriptions.\n",
365 |     "   - Common data collection methods include interviews, focus groups, observations, and open-ended survey questions.\n",
366 |     "   - These methods allow participants to express their thoughts, opinions, and experiences in their own words.\n",
367 |     "\n",
368 |     "2. **Quantitative Data**:\n",
369 |     "   - Quantitative data is collected through structured methods that yield numerical or measurable responses.\n",
370 |     "   - Common data collection methods include closed-ended surveys, experiments, and systematic observations.\n",
371 |     "   - These methods often involve predetermined response options or scales, ensuring that the data can be easily quantified and analyzed.\n"
372 |    ]
373 |   },
374 |   {
375 |    "cell_type": "markdown",
376 |    "metadata": {},
377 |    "source": [
378 |     "### <a id='toc3_2_'></a>[Data Analysis Techniques](#toc0_)\n"
379 |    ]
380 |   },
381 |   {
382 |    "cell_type": "markdown",
383 |    "metadata": {},
384 |    "source": [
385 |     "1. **Qualitative Data**:\n",
386 |     "   - Qualitative data analysis focuses on identifying themes, patterns, and relationships within the data.\n",
387 |     "   - Common analysis techniques include content analysis, thematic analysis, and narrative analysis.\n",
388 |     "   - These techniques involve coding and categorizing the data, allowing researchers to draw meaningful conclusions and insights.\n",
389 |     "\n",
390 |     "2. **Quantitative Data**:\n",
391 |     "   - Quantitative data analysis involves using statistical methods to describe, summarize, and draw inferences from the data.\n",
392 |     "   - Common analysis techniques include descriptive statistics (e.g., mean, median, standard deviation), inferential statistics (e.g., t-tests, ANOVA, regression), and hypothesis testing.\n",
393 |     "   - These techniques allow researchers to identify significant relationships, differences, and trends within the data.\n"
394 |    ]
395 |   },
396 |   {
397 |    "cell_type": "markdown",
398 |    "metadata": {},
399 |    "source": [
400 |     "### <a id='toc3_3_'></a>[Graphical Representations](#toc0_)\n"
401 |    ]
402 |   },
403 |   {
404 |    "cell_type": "markdown",
405 |    "metadata": {},
406 |    "source": [
407 |     "1. **Qualitative Data**:\n",
408 |     "   - Graphical representations of qualitative data focus on visualizing categories, themes, or relationships.\n",
409 |     "   - Common graphical representations include word clouds, concept maps, and tree diagrams.\n",
410 |     "   - These visualizations help to communicate the key findings and insights from the qualitative analysis.\n",
411 |     "\n",
412 |     "2. **Quantitative Data**:\n",
413 |     "   - Graphical representations of quantitative data focus on displaying the distribution, central tendency, and variability of the data.\n",
414 |     "   - Common graphical representations include bar charts, histograms, scatter plots, and box plots.\n",
415 |     "   - These visualizations help to summarize and communicate the key features and relationships within the quantitative data.\n"
416 |    ]
417 |   },
418 |   {
419 |    "cell_type": "markdown",
420 |    "metadata": {},
421 |    "source": [
422 |     "It is important to note that some research projects may involve collecting and analyzing both qualitative and quantitative data, known as mixed-methods research. This approach allows researchers to gain a more comprehensive understanding of the topic by leveraging the strengths of both data types.\n"
423 |    ]
424 |   },
425 |   {
426 |    "cell_type": "markdown",
427 |    "metadata": {},
428 |    "source": [
429 |     "By understanding the differences between qualitative and quantitative data in terms of collection methods, analysis techniques, and graphical representations, researchers can make informed decisions when designing studies, analyzing data, and communicating their findings effectively."
430 |    ]
431 |   },
432 |   {
433 |    "cell_type": "markdown",
434 |    "metadata": {},
435 |    "source": [
436 |     "## <a id='toc4_'></a>[Levels of Measurement](#toc0_)"
437 |    ]
438 |   },
439 |   {
440 |    "cell_type": "markdown",
441 |    "metadata": {},
442 |    "source": [
443 |     "Levels of measurement, also known as scales of measurement, describe the nature of the data and the relationships between values. Understanding the level of measurement is essential for selecting appropriate statistical methods and interpreting the results accurately. There are four levels of measurement: nominal, ordinal, interval, and ratio.\n"
444 |    ]
445 |   },
446 |   {
447 |    "cell_type": "markdown",
448 |    "metadata": {},
449 |    "source": [
450 |     "<img src=\"./images/levels-of-measurement.png\" width=\"800\">"
451 |    ]
452 |   },
453 |   {
454 |    "cell_type": "markdown",
455 |    "metadata": {},
456 |    "source": [
457 |     "### <a id='toc4_1_'></a>[Nominal Level](#toc0_)\n"
458 |    ]
459 |   },
460 |   {
461 |    "cell_type": "markdown",
462 |    "metadata": {},
463 |    "source": [
464 |     "- Nominal level data is the lowest level of measurement and represents categories or labels with no inherent order or numerical value.\n",
465 |     "- Examples of nominal level data include gender (male, female), marital status (single, married, divorced), and eye color (blue, brown, green).\n",
466 |     "- Nominal level data can be counted and described using frequencies and percentages.\n",
467 |     "- Appropriate measures of central tendency for nominal data include the mode (most frequent category).\n",
468 |     "- Statistical tests suitable for nominal data include chi-square tests and Fisher's exact test.\n"
469 |    ]
470 |   },
471 |   {
472 |    "cell_type": "markdown",
473 |    "metadata": {},
474 |    "source": [
475 |     "### <a id='toc4_2_'></a>[Ordinal Level](#toc0_)\n"
476 |    ]
477 |   },
478 |   {
479 |    "cell_type": "markdown",
480 |    "metadata": {},
481 |    "source": [
482 |     "- Ordinal level data represents categories with a natural order or ranking, but the differences between categories are not necessarily equal or measurable.\n",
483 |     "- Examples of ordinal level data include educational attainment (high school, bachelor's, master's, doctorate), survey responses (strongly disagree, disagree, neutral, agree, strongly agree), and economic status (low, medium, high).\n",
484 |     "- Ordinal level data can be counted, described using frequencies and percentages, and ranked.\n",
485 |     "- Appropriate measures of central tendency for ordinal data include the median (middle value) and mode.\n",
486 |     "- Statistical tests suitable for ordinal data include Spearman's rank correlation, Kendall's tau, and Mann-Whitney U test.\n"
487 |    ]
488 |   },
489 |   {
490 |    "cell_type": "markdown",
491 |    "metadata": {},
492 |    "source": [
493 |     "### <a id='toc4_3_'></a>[Interval Level](#toc0_)\n"
494 |    ]
495 |   },
496 |   {
497 |    "cell_type": "markdown",
498 |    "metadata": {},
499 |    "source": [
500 |     "- Interval level data represents numerical values where the differences between values are meaningful and consistent, but there is no true zero point.\n",
501 |     "- Examples of interval level data include temperature measured in Celsius or Fahrenheit, dates on a calendar, and IQ scores.\n",
502 |     "- Interval level data can be added and subtracted meaningfully, but multiplication and division are not appropriate.\n",
503 |     "- Appropriate measures of central tendency for interval data include the mean (average), median, and mode.\n",
504 |     "- Statistical tests suitable for interval data include t-tests, ANOVA, and Pearson's correlation coefficient.\n"
505 |    ]
506 |   },
507 |   {
508 |    "cell_type": "markdown",
509 |    "metadata": {},
510 |    "source": [
511 |     "### <a id='toc4_4_'></a>[Ratio Level](#toc0_)\n"
512 |    ]
513 |   },
514 |   {
515 |    "cell_type": "markdown",
516 |    "metadata": {},
517 |    "source": [
518 |     "- Ratio level data represents numerical values where the differences between values are meaningful, consistent, and there is a true zero point.\n",
519 |     "- Examples of ratio level data include height, weight, age, and income.\n",
520 |     "- Ratio level data can be added, subtracted, multiplied, and divided meaningfully.\n",
521 |     "- Appropriate measures of central tendency for ratio data include the mean, median, and mode.\n",
522 |     "- Statistical tests suitable for ratio data include all tests applicable to interval data, as well as geometric mean and coefficient of variation.\n"
523 |    ]
524 |   },
525 |   {
526 |    "cell_type": "markdown",
527 |    "metadata": {},
528 |    "source": [
529 |     "It is important to note that the level of measurement determines the appropriate statistical methods and tests that can be used. Using statistical methods designed for a higher level of measurement on data with a lower level of measurement can lead to inaccurate or misleading results.\n"
530 |    ]
531 |   },
532 |   {
533 |    "cell_type": "markdown",
534 |    "metadata": {},
535 |    "source": [
536 |     "By understanding the levels of measurement and their properties, researchers can make informed decisions when collecting data, selecting statistical methods, and interpreting the results of their analyses."
537 |    ]
538 |   }
539 |  ],
540 |  "metadata": {
541 |   "language_info": {
542 |    "name": "python"
543 |   }
544 |  },
545 |  "nbformat": 4,
546 |  "nbformat_minor": 2
547 | }
548 | 


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/03 Normal Distribution.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","metadata":{},"source":["<img src=\"./images/banner.png\" width=\"800\">"]},{"cell_type":"markdown","metadata":{},"source":["# Normal Distribution"]},{"cell_type":"markdown","metadata":{},"source":["Welcome to the lecture on Normal Distributions! In this Jupyter Notebook, we will explore one of the most important probability distributions in statistics: the **Normal Distribution**, also known as the **Gaussian Distribution**. \n"]},{"cell_type":"markdown","metadata":{"vscode":{"languageId":"plaintext"}},"source":["> The normal distribution is also known as the Gaussian distribution because it was first introduced by the German mathematician and physicist **Carl Friedrich Gauss** in the early 19th century.\n","> \n","> Gauss used the normal distribution to analyze astronomical data, particularly in the context of errors in measurements. He showed that errors in astronomical observations followed a bell-shaped curve, which later became known as the Gaussian curve or the normal distribution.\n","> \n","> Gauss's work on the normal distribution was further developed by other mathematicians, such as Pierre-Simon Laplace and Adolphe Quetelet, who applied it to various fields, including social sciences and biology.\n","> \n","> Due to Gauss's significant contributions to the development and application of this probability distribution, it is often referred to as the Gaussian distribution in his honor. However, both terms – normal distribution and Gaussian distribution – are used interchangeably in statistics and probability theory."]},{"cell_type":"markdown","metadata":{},"source":["Before we dive into the Normal Distribution, let's briefly review some fundamental concepts in probability theory. Probability is a measure of the likelihood that an event will occur. It is expressed as a number between 0 and 1, where 0 indicates impossibility and 1 indicates certainty. The sum of probabilities for all possible outcomes in a given scenario is always equal to 1.\n"]},{"cell_type":"markdown","metadata":{},"source":["<img src=\"./images/probability.png\" width=\"600\">"]},{"cell_type":"markdown","metadata":{},"source":["A **probability distribution** is a function that describes the likelihood of different outcomes in a random experiment. It assigns a probability to each possible outcome. There are two main types of probability distributions:\n","- Discrete\n","- Continuous\n","\n","Discrete probability distributions deal with random variables that can only take on specific, countable values, while continuous probability distributions, like the Normal Distribution, deal with random variables that can take on any value within a specified range.\n"]},{"cell_type":"markdown","metadata":{},"source":["The Normal Distribution is a **continuous probability distribution** that is symmetrical about its mean, with data near the mean being more frequent in occurrence than data far from the mean. This distribution is widely used in various fields, including natural and social sciences, because many real-world phenomena can be approximated by the Normal Distribution.\n"]},{"cell_type":"markdown","metadata":{},"source":["<img src=\"./images/normal-distribution.png\" width=\"800\">"]},{"cell_type":"markdown","metadata":{},"source":["Throughout this lecture, we will cover the following topics:\n","\n","1. Properties of a Normal Distribution\n","2. Standard Normal Distribution\n","3. Probability Density Function (PDF) and Cumulative Distribution Function (CDF)\n","4. Empirical Rule (68-95-99.7 Rule)\n","5. Applications of Normal Distributions\n"]},{"cell_type":"markdown","metadata":{},"source":["By the end of this lecture, you will have a solid understanding of Normal Distributions and their applications in real-world scenarios. Let's dive in!\n"]},{"cell_type":"markdown","metadata":{},"source":["**Table of contents**<a id='toc0_'></a>    \n","- [Properties of a Normal Distribution](#toc1_)    \n","- [Standard Normal Distribution](#toc2_)    \n","  - [Standardizing Normal Distributions](#toc2_1_)    \n","  - [Properties of the Standard Normal Distribution](#toc2_2_)    \n","  - [Z-tables and Probability Calculations](#toc2_3_)    \n","- [Probability Density Function (PDF) and Cumulative Distribution Function (CDF)](#toc3_)    \n","  - [Probability Density Function (PDF)](#toc3_1_)    \n","  - [Cumulative Distribution Function (CDF)](#toc3_2_)    \n","- [Empirical Rule (68-95-99.7 Rule)](#toc4_)    \n","- [Applications of Normal Distributions](#toc5_)    \n","- [Exercise: Normal Distribution Properties and Applications](#toc6_)    \n","  - [Solution](#toc6_1_)    \n","\n","<!-- vscode-jupyter-toc-config\n","\tnumbering=false\n","\tanchor=true\n","\tflat=false\n","\tminLevel=2\n","\tmaxLevel=6\n","\t/vscode-jupyter-toc-config -->\n","<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->"]},{"cell_type":"markdown","metadata":{},"source":["## <a id='toc1_'></a>[Properties of a Normal Distribution](#toc0_)"]},{"cell_type":"markdown","metadata":{},"source":["<img src=\"./images/normal-distribution.png\" width=\"800\">"]},{"cell_type":"markdown","metadata":{},"source":["<img src=\"./images/variance.png\" width=\"800\">"]},{"cell_type":"markdown","metadata":{},"source":["A Normal Distribution is characterized by several key properties that distinguish it from other probability distributions. Understanding these properties is essential for working with Normal Distributions and applying them to real-world problems.\n","\n","1. **Bell-shaped curve**: The Normal Distribution is represented by a symmetric, bell-shaped curve known as the \"Gaussian curve\" or \"bell curve.\" The peak of the curve represents the mean (μ) of the distribution, and the curve is symmetric about this mean.\n","\n","2. **Mean, median, and mode**: In a Normal Distribution, the mean, median, and mode are all equal. This is a result of the distribution's symmetry.\n","\n","3. **Symmetry**: The Normal Distribution is symmetric about its mean. This means that the left and right halves of the distribution are mirror images of each other.\n","\n","4. **Asymptotes**: The tails of the Normal Distribution curve approach the x-axis but never touch it. These tails extend infinitely in both directions, meaning that the range of the Normal Distribution is from negative infinity to positive infinity.\n","\n","5. **Area under the curve**: The total area under the Normal Distribution curve is equal to 1. This property allows us to calculate probabilities by finding the area under the curve between specific points.\n","\n","6. **Parametric distribution**: The Normal Distribution is a parametric distribution, which means it is fully described by its parameters: the mean (μ) and the standard deviation (σ). The mean determines the location of the center of the distribution, while the standard deviation determines the width and height of the curve.\n","\n","   - The mathematical formula for the Normal Distribution is:\n","\n","     $f(x) = \\frac{1}{\\sigma\\sqrt{2\\pi}}e^{-\\frac{1}{2}(\\frac{x-\\mu}{\\sigma})^2}$\n","\n","     where:\n","     - $f(x)$ is the probability density function (PDF)\n","     - $\\mu$ is the mean\n","     - $\\sigma$ is the standard deviation\n","     - $\\pi$ is the mathematical constant pi (approximately 3.14159)\n","     - $e$ is the mathematical constant e (approximately 2.71828)\n","\n","7. **Empirical rule**: The Empirical Rule, also known as the 68-95-99.7 Rule, states that for a Normal Distribution:\n","   - Approximately 68% of the data falls within one standard deviation of the mean (μ ± σ).\n","   - Approximately 95% of the data falls within two standard deviations of the mean (μ ± 2σ).\n","   - Approximately 99.7% of the data falls within three standard deviations of the mean (μ ± 3σ)."]},{"cell_type":"markdown","metadata":{},"source":["These properties make the Normal Distribution a valuable tool for modeling and analyzing real-world phenomena, as many natural processes and measurements tend to follow a Normal Distribution."]},{"cell_type":"markdown","metadata":{},"source":["## <a id='toc2_'></a>[Standard Normal Distribution](#toc0_)"]},{"cell_type":"markdown","metadata":{},"source":["The Standard Normal Distribution, also known as the Z-distribution, is a special case of the Normal Distribution with a mean of 0 and a standard deviation of 1. It is denoted as:\n","\n","$Z \\sim N(0, 1)$\n","\n","The Standard Normal Distribution is essential because it allows us to compare and standardize data from different Normal Distributions. By transforming data from any Normal Distribution into the Standard Normal Distribution, we can calculate probabilities, quantiles, and other statistical measures using a single, standardized scale."]},{"cell_type":"markdown","metadata":{},"source":["<img src=\"./images/standard-normal-dist.png\" width=\"800\">"]},{"cell_type":"markdown","metadata":{},"source":["### <a id='toc2_1_'></a>[Standardizing Normal Distributions](#toc0_)\n"]},{"cell_type":"markdown","metadata":{},"source":["To convert a random variable X from a Normal Distribution with mean μ and standard deviation σ to a Standard Normal Distribution, we use the following formula:\n","\n","$Z = \\frac{X - \\mu}{\\sigma}$\n","\n","This process is called standardization or Z-score normalization. The resulting Z-score represents the number of standard deviations an observation is away from the mean.\n"]},{"cell_type":"markdown","metadata":{},"source":["For example, suppose we have a Normal Distribution with a mean of 100 and a standard deviation of 15. If we observe a value of 115, we can calculate its Z-score as follows:\n","\n","$Z = \\frac{115 - 100}{15} = 1$\n","\n","This means that the observation of 115 is 1 standard deviation above the mean.\n"]},{"cell_type":"markdown","metadata":{},"source":["### <a id='toc2_2_'></a>[Properties of the Standard Normal Distribution](#toc0_)\n"]},{"cell_type":"markdown","metadata":{},"source":["1. **Mean**: The mean of the Standard Normal Distribution is always 0.\n","    - Mean (μ) = 0\n","    - Standard deviation (σ) = 1\n","    - Probability density function (PDF):\n","\n","      $f(z) = \\frac{1}{\\sqrt{2\\pi}}e^{-\\frac{1}{2}z^2}$\n","\n","      where:\n","      - $f(z)$ is the probability density function (PDF)\n","      - $z$ is the standard score (Z-score)\n","      - $\\pi$ is the mathematical constant pi (approximately 3.14159)\n","      - $e$ is the mathematical constant e (approximately 2.71828)\n","\n","2. **Standard Deviation**: The standard deviation of the Standard Normal Distribution is always 1.\n","\n","3. **Symmetry**: The Standard Normal Distribution is symmetric about its mean (0).\n","\n","4. **Area under the curve**: The total area under the Standard Normal Distribution curve is equal to 1.\n","\n","5. **Probability calculations**: Because the Standard Normal Distribution is standardized, we can use pre-calculated tables or statistical software to find the probability of observing a value less than, greater than, or between specific Z-scores.\n","\n","6. **Converting between Z-scores and raw scores**: To convert a raw score (x) from a Normal Distribution to a Z-score, use the formula:\n","    - $z = \\frac{x - \\mu}{\\sigma}$\n","    - To convert a Z-score back to a raw score (x) in a Normal Distribution, use the formula:\n","      $x = \\mu + z\\sigma$"]},{"cell_type":"markdown","metadata":{},"source":["### <a id='toc2_3_'></a>[Z-tables and Probability Calculations](#toc0_)\n"]},{"cell_type":"markdown","metadata":{},"source":["Z-tables, also known as Standard Normal tables, are used to find the probability of observing a value less than or greater than a given Z-score. These tables list the cumulative probabilities for various Z-scores.\n"]},{"cell_type":"markdown","metadata":{},"source":["<img src=\"./images/z-table.webp\" width=\"600\">"]},{"cell_type":"markdown","metadata":{},"source":["<img src=\"./images/z-table-full.jpeg\" width=\"800\">"]},{"cell_type":"markdown","metadata":{},"source":["For example, to find the probability of observing a Z-score less than 1.5, we would look up the value corresponding to 1.5 in the Z-table. The table will give us the probability of observing a value less than 1.5 in a Standard Normal Distribution.\n"]},{"cell_type":"markdown","metadata":{},"source":["Most statistical software packages, such as Python's SciPy library or R's base functions, provide built-in functions to calculate probabilities and quantiles for the Standard Normal Distribution, eliminating the need for manual table lookups.\n"]},{"cell_type":"markdown","metadata":{},"source":["In the next section, we will discuss the Probability Density Function (PDF) and Cumulative Distribution Function (CDF) of the Normal Distribution, which are essential for understanding the properties and applications of the Normal Distribution."]},{"cell_type":"markdown","metadata":{},"source":["## <a id='toc3_'></a>[Probability Density Function (PDF) and Cumulative Distribution Function (CDF)](#toc0_)"]},{"cell_type":"markdown","metadata":{},"source":["To fully understand the Normal Distribution, it is essential to familiarize ourselves with two key concepts: the Probability Density Function (PDF) and the Cumulative Distribution Function (CDF). These functions help us calculate probabilities and quantiles for the Normal Distribution.\n"]},{"cell_type":"markdown","metadata":{},"source":["<img src=\"./images/cdf-pdf-pmf.png\" width=\"800\">"]},{"cell_type":"markdown","metadata":{},"source":["<img src=\"./images/normal-dist-cdf.png\" width=\"800\">"]},{"cell_type":"markdown","metadata":{},"source":["<img src=\"./images/cdf-pdf-pmf-2.jpeg\" width=\"800\">"]},{"cell_type":"markdown","metadata":{},"source":["### <a id='toc3_1_'></a>[Probability Density Function (PDF)](#toc0_)\n"]},{"cell_type":"markdown","metadata":{},"source":["The Probability Density Function (PDF) of a continuous random variable X is a function that describes the relative likelihood of X taking on a specific value. For the Normal Distribution, the PDF is given by:\n","\n","$f(x) = \\frac{1}{\\sigma\\sqrt{2\\pi}}e^{-\\frac{1}{2}(\\frac{x-\\mu}{\\sigma})^2}$\n","\n","where:\n","- $\\mu$ is the mean of the distribution\n","- $\\sigma$ is the standard deviation of the distribution\n","- $\\pi$ is the mathematical constant pi (approximately 3.14159)\n","- $e$ is the mathematical constant e (approximately 2.71828)\n"]},{"cell_type":"markdown","metadata":{},"source":["The PDF has the following properties:\n","\n","1. The total area under the PDF curve is equal to 1.\n","2. The PDF is non-negative everywhere, i.e., $f(x) \\geq 0$ for all x.\n","3. The probability of observing a value between a and b is given by the area under the PDF curve between a and b.\n"]},{"cell_type":"markdown","metadata":{},"source":["It is important to note that the PDF does not directly give us the probability of observing a specific value. Instead, it gives us the relative likelihood of observing a value in a given range.\n"]},{"cell_type":"markdown","metadata":{},"source":["### <a id='toc3_2_'></a>[Cumulative Distribution Function (CDF)](#toc0_)\n"]},{"cell_type":"markdown","metadata":{},"source":["The Cumulative Distribution Function (CDF) of a random variable X is a function that gives the probability of observing a value less than or equal to a given value x. For the Normal Distribution, the CDF is given by:\n","\n","$F(x) = \\int_{-\\infty}^{x} \\frac{1}{\\sigma\\sqrt{2\\pi}}e^{-\\frac{1}{2}(\\frac{t-\\mu}{\\sigma})^2} dt$\n","\n","where:\n","- $\\mu$ is the mean of the distribution\n","- $\\sigma$ is the standard deviation of the distribution\n","- $\\pi$ is the mathematical constant pi (approximately 3.14159)\n","- $e$ is the mathematical constant e (approximately 2.71828)\n","- $t$ is a dummy variable of integration\n"]},{"cell_type":"markdown","metadata":{},"source":["The CDF has the following properties:\n","\n","1. The CDF is a non-decreasing function, i.e., if $a < b$, then $F(a) \\leq F(b)$.\n","2. The CDF is bounded between 0 and 1, i.e., $0 \\leq F(x) \\leq 1$ for all x.\n","3. As x approaches negative infinity, the CDF approaches 0, and as x approaches positive infinity, the CDF approaches 1.\n"]},{"cell_type":"markdown","metadata":{},"source":["The CDF is particularly useful for calculating probabilities and quantiles. To find the probability of observing a value less than or equal to x, we simply evaluate the CDF at x. To find the probability of observing a value between a and b, we calculate $F(b) - F(a)$.\n"]},{"cell_type":"markdown","metadata":{},"source":["In practice, we often use statistical software or pre-calculated tables (such as the Z-table for the Standard Normal Distribution) to evaluate the CDF and calculate probabilities.\n"]},{"cell_type":"markdown","metadata":{},"source":["In the next section, we will discuss the Empirical Rule (68-95-99.7 Rule), which provides a quick way to estimate the probability of observing values within certain ranges of the mean in a Normal Distribution."]},{"cell_type":"markdown","metadata":{},"source":["## <a id='toc4_'></a>[Empirical Rule (68-95-99.7 Rule)](#toc0_)"]},{"cell_type":"markdown","metadata":{},"source":["The Empirical Rule, also known as the 68-95-99.7 Rule or the Three Sigma Rule, is a quick and easy way to estimate the probability of observing values within certain ranges of the mean in a Normal Distribution. This rule is based on the properties of the Standard Normal Distribution and the fact that the Normal Distribution is symmetric about its mean.\n"]},{"cell_type":"markdown","metadata":{},"source":["The Empirical Rule states that for a Normal Distribution:\n","\n","1. Approximately 68% of the data falls within 1 standard deviation of the mean, i.e., within the range $(\\mu - \\sigma, \\mu + \\sigma)$.\n","2. Approximately 95% of the data falls within 2 standard deviations of the mean, i.e., within the range $(\\mu - 2\\sigma, \\mu + 2\\sigma)$.\n","3. Approximately 99.7% of the data falls within 3 standard deviations of the mean, i.e., within the range $(\\mu - 3\\sigma, \\mu + 3\\sigma)$.\n"]},{"cell_type":"markdown","metadata":{},"source":["Here's a visual representation of the Empirical Rule:\n"]},{"cell_type":"markdown","metadata":{},"source":["<img src=\"./images/68-95-99.7-rule.png\" width=\"800\">"]},{"cell_type":"markdown","metadata":{},"source":["To use the Empirical Rule, follow these steps:\n","\n","1. Identify the mean (μ) and standard deviation (σ) of the Normal Distribution.\n","2. Determine the range of interest, i.e., within 1, 2, or 3 standard deviations of the mean.\n","3. Use the corresponding percentage from the Empirical Rule to estimate the probability of observing a value within that range.\n"]},{"cell_type":"markdown","metadata":{},"source":["For example, suppose we have a Normal Distribution with a mean of 100 and a standard deviation of 10. To estimate the probability of observing a value between 80 and 120, we first note that 80 is 2 standard deviations below the mean, and 120 is 2 standard deviations above the mean. Using the Empirical Rule, we know that approximately 95% of the data falls within 2 standard deviations of the mean. Therefore, the probability of observing a value between 80 and 120 is approximately 0.95 or 95%.\n"]},{"cell_type":"markdown","metadata":{},"source":["It is important to note that the Empirical Rule provides an approximation and is most accurate for distributions that are nearly normal. For exact probabilities or more complex problems, it is better to use the Probability Density Function (PDF), Cumulative Distribution Function (CDF), or statistical software.\n"]},{"cell_type":"markdown","metadata":{},"source":["In the next section, we will explore some applications of the Normal Distribution in various fields."]},{"cell_type":"markdown","metadata":{},"source":["## <a id='toc5_'></a>[Applications of Normal Distributions](#toc0_)"]},{"cell_type":"markdown","metadata":{},"source":["The Normal Distribution is widely used in various fields due to its many applications. Some of the most common applications include:\n","\n","1. **Natural and Social Sciences**:\n","   - In biology, the Normal Distribution can model the distribution of various physical characteristics, such as height, weight, or blood pressure, in a population.\n","   - In psychology, the Normal Distribution is used to model the distribution of intelligence quotient (IQ) scores or personality traits.\n","   - In physics, the Normal Distribution is used to model the distribution of measurement errors or the velocities of particles in a gas.\n","\n","2. **Quality Control and Manufacturing**:\n","   - The Normal Distribution is used to model the variation in product dimensions, weights, or other quality characteristics.\n","   - By setting acceptable limits based on the properties of the Normal Distribution (e.g., within 2 standard deviations of the mean), manufacturers can ensure that their products meet quality standards.\n","   - The Six Sigma methodology, which aims to minimize defects and improve quality, relies heavily on the properties of the Normal Distribution.\n","\n","3. **Financial Markets and Economics**:\n","   - In finance, the Normal Distribution is often used to model the returns of financial assets, such as stocks or bonds, over short time periods.\n","   - The Black-Scholes model, which is used for pricing options, assumes that the underlying asset's returns follow a Normal Distribution.\n","   - In economics, the Normal Distribution can be used to model the distribution of income or other economic variables within a population.\n","\n","4. **Hypothesis Testing and Confidence Intervals**:\n","   - Many statistical tests, such as the t-test or the Z-test, assume that the data follows a Normal Distribution.\n","   - The properties of the Normal Distribution are used to construct confidence intervals for population parameters, such as the mean or the proportion.\n","\n","5. **Machine Learning and Data Science**:\n","   - Many machine learning algorithms, such as Linear Regression or Gaussian Naive Bayes, assume that the input features or the errors follow a Normal Distribution.\n","   - In data preprocessing, the Normal Distribution is used to standardize or normalize features, which can improve the performance of some machine learning models.\n","\n","6. **Environmental Sciences and Climatology**:\n","   - The Normal Distribution can be used to model the distribution of temperature, precipitation, or other environmental variables over time or space.\n","   - Climate models often assume that certain variables, such as the concentration of greenhouse gases, follow a Normal Distribution.\n","\n","7. **Telecommunications and Signal Processing**:\n","   - In signal processing, the Normal Distribution is used to model the distribution of noise in communication channels.\n","   - The properties of the Normal Distribution are used to design filters and estimate the signal-to-noise ratio in telecommunication systems.\n"]},{"cell_type":"markdown","metadata":{},"source":["These are just a few examples of the many applications of the Normal Distribution. Its versatility and well-understood properties make it a valuable tool in numerous fields, from the natural and social sciences to engineering and finance.\n"]},{"cell_type":"markdown","metadata":{},"source":["It is important to note that while the Normal Distribution is widely applicable, it is not always the most appropriate model for every situation. Researchers and practitioners should always consider the underlying assumptions and the nature of their data before applying the Normal Distribution or any other statistical model."]},{"cell_type":"markdown","metadata":{},"source":["<img src=\"../images/exercise-banner.gif\" width=\"800\">"]},{"cell_type":"markdown","metadata":{},"source":["## <a id='toc6_'></a>[Exercise: Normal Distribution Properties and Applications](#toc0_)"]},{"cell_type":"markdown","metadata":{},"source":["In this exercise, you will apply your knowledge of Normal Distributions to solve various problems. Use the following information to answer the questions below:\n","\n","A company manufactures light bulbs with a mean life of 1000 hours and a standard deviation of 100 hours. The lifespan of these light bulbs follows a Normal Distribution.\n","\n","1. What is the probability that a randomly selected light bulb will last between 900 and 1100 hours? Use the Empirical Rule to solve this problem.\n","\n","2. Calculate the Z-scores for the following light bulb lifespans:\n","   a. 1200 hours\n","   b. 850 hours\n","\n","3. The company wants to identify the top 5% longest-lasting light bulbs for a premium product line. What is the minimum lifespan (in hours) a light bulb must have to be included in this top 5%? Use the Z-score table to solve this problem.\n","\n","4. Suppose the company decides to offer a warranty for light bulbs that last less than 800 hours. What percentage of light bulbs will be covered under this warranty? Use the Empirical Rule to estimate this value.\n","\n","5. The company plans to introduce a new line of energy-efficient light bulbs with a mean life of 1200 hours. The standard deviation is expected to be 20% less than the current light bulbs. Calculate the probability that a randomly selected energy-efficient light bulb will last between 1100 and 1300 hours. Use the Z-score table to solve this problem.\n"]},{"cell_type":"markdown","metadata":{},"source":["> *Hint: For questions 3 and 5, you can use the Z-score table to find the appropriate Z-score and then convert it back to the original scale using the mean and standard deviation.*"]},{"cell_type":"markdown","metadata":{},"source":["### <a id='toc6_1_'></a>[Solution](#toc0_)\n"]},{"cell_type":"markdown","metadata":{},"source":["1. Using the Empirical Rule, we know that approximately 68% of the data falls within one standard deviation of the mean (μ ± σ). In this case, one standard deviation is 100 hours, so the range is 900 to 1100 hours. Therefore, the probability that a randomly selected light bulb will last between 900 and 1100 hours is approximately 0.68 or 68%.\n","\n","2. To calculate the Z-scores, we use the formula: $Z = \\frac{x - \\mu}{\\sigma}$\n","   a. For 1200 hours: $Z = \\frac{1200 - 1000}{100} = 2$\n","   b. For 850 hours: $Z = \\frac{850 - 1000}{100} = -1.5$\n","\n","3. To find the minimum lifespan for the top 5% longest-lasting light bulbs, we need to find the Z-score that corresponds to the 95th percentile (100% - 5% = 95%). Using the Z-score table, we find that the Z-score for the 95th percentile is approximately 1.645. Now, we can convert this Z-score back to the original scale using the formula: $x = \\mu + Z\\sigma$\n","\n","   $x = 1000 + 1.645 \\times 100 = 1164.5$\n","\n","   Therefore, the minimum lifespan for a light bulb to be included in the top 5% is approximately 1164.5 hours.\n","\n","4. Using the Empirical Rule, we know that approximately 95% of the data falls within two standard deviations of the mean (μ ± 2σ). This means that about 2.5% of the data falls below two standard deviations from the mean. Two standard deviations below the mean is: $1000 - 2 \\times 100 = 800$ hours. Therefore, approximately 2.5% of the light bulbs will be covered under the warranty for lasting less than 800 hours.\n","\n","5. For the new line of energy-efficient light bulbs, the mean is 1200 hours, and the standard deviation is $100 \\times 0.8 = 80$ hours. To find the probability that a light bulb will last between 1100 and 1300 hours, we first calculate the Z-scores for these values:\n","\n","   $Z_{1100} = \\frac{1100 - 1200}{80} = -1.25$\n","   $Z_{1300} = \\frac{1300 - 1200}{80} = 1.25$\n","\n","   Using the Z-score table, we find the cumulative probabilities for these Z-scores:\n","   \n","   $P(Z < -1.25) = 0.1056$\n","   $P(Z < 1.25) = 0.8944$\n","\n","   The probability that a light bulb will last between 1100 and 1300 hours is the difference between these cumulative probabilities:\n","\n","   $P(1100 < x < 1300) = 0.8944 - 0.1056 = 0.7888$\n","\n","   Therefore, the probability that a randomly selected energy-efficient light bulb will last between 1100 and 1300 hours is approximately 0.7888 or 78.88%."]}],"metadata":{"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":2}
2 | 


--------------------------------------------------------------------------------
/Lectures/02 Descriptive Statistics/01 Describing Data with Averages.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "<img src=\"./images/banner.png\" width=\"800\">"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "# Describing Data with Averages"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "In this section, we will explore the concept of central tendency and its importance in understanding and summarizing data. We will also introduce the three main measures of central tendency: mode, median, and mean.\n"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "markdown",
 26 |    "metadata": {},
 27 |    "source": [
 28 |     "**Central tendency** refers to the concept of identifying a single value that represents the \"center\" or \"middle\" of a dataset. It provides a way to describe the typical or central value around which the data points tend to cluster.\n"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "markdown",
 33 |    "metadata": {},
 34 |    "source": [
 35 |     "Measuring central tendency is crucial for several reasons:\n",
 36 |     "\n",
 37 |     "1. **Summarizing data**: Central tendency measures allow us to summarize a large dataset with a single representative value, making it easier to understand and communicate the overall characteristics of the data.\n",
 38 |     "\n",
 39 |     "2. **Comparing datasets**: By calculating the central tendency of different datasets, we can compare them and determine which dataset has higher or lower values on average.\n",
 40 |     "\n",
 41 |     "3. **Identifying patterns**: Central tendency measures help us identify patterns or trends in the data, such as whether the data points are consistently high, low, or clustered around a specific value.\n",
 42 |     "\n",
 43 |     "4. **Making decisions**: In many fields, such as business, healthcare, and social sciences, central tendency measures are used to make informed decisions based on the typical or average values of the data.\n"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "markdown",
 48 |    "metadata": {},
 49 |    "source": [
 50 |     "There are three primary measures of central tendency: mode, median, and mean. Each measure has its own characteristics and is suitable for different types of data and situations.\n",
 51 |     "\n",
 52 |     "1. **Mode**: The mode is the value that appears most frequently in a dataset. It is particularly useful for categorical or discrete data.\n",
 53 |     "\n",
 54 |     "2. **Median**: The median is the middle value when the data points are arranged in ascending or descending order. It is less sensitive to extreme values (outliers) compared to the mean.\n",
 55 |     "\n",
 56 |     "3. **Mean**: The mean, also known as the average, is calculated by summing up all the values in a dataset and dividing by the total number of values. It is the most commonly used measure of central tendency for continuous data.\n"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "markdown",
 61 |    "metadata": {},
 62 |    "source": [
 63 |     "<img src=\"./images/mean-mode-median-summary.png\" width=\"800\">"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "metadata": {},
 69 |    "source": [
 70 |     "In everyday language, the terms \"average\" and \"mean\" are often used interchangeably. However, in a mathematical context, \"mean\" has a more specific definition, while \"average\" can refer to several measures of central tendency.\n",
 71 |     "\n",
 72 |     "**Mean:**\n",
 73 |     "- The mean is a specific measure of central tendency, calculated by summing up all the values in a dataset and dividing by the number of values.\n",
 74 |     "- It is the most commonly used measure when people refer to the \"average.\"\n",
 75 |     "\n",
 76 |     "**Average:**\n",
 77 |     "- In a general sense, an average is a single value that represents the typical or central value in a set of data.\n",
 78 |     "- It can refer to different measures of central tendency, such as the mean, median, or mode, depending on the context.\n",
 79 |     "- The term \"average\" is often used informally to describe a typical or representative value.\n",
 80 |     "\n",
 81 |     "In summary, while \"mean\" refers to a specific mathematical calculation, \"average\" is a more general term that can encompass various measures of central tendency, including the mean. In most cases, when people use the term \"average,\" they are referring to the arithmetic mean."
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "markdown",
 86 |    "metadata": {},
 87 |    "source": [
 88 |     "In the following sections, we will explore each of these measures in more detail, including their definitions, calculations, and when to use them."
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "markdown",
 93 |    "metadata": {},
 94 |    "source": [
 95 |     "**Table of contents**<a id='toc0_'></a>    \n",
 96 |     "- [Mode](#toc1_)    \n",
 97 |     "  - [Procedure for determining the mode](#toc1_1_)    \n",
 98 |     "  - [Examples of datasets with one mode, multiple modes, or no mode](#toc1_2_)    \n",
 99 |     "  - [Advantages and disadvantages of using the mode](#toc1_3_)    \n",
100 |     "  - [Mode for qualitative data](#toc1_4_)    \n",
101 |     "- [Median](#toc2_)    \n",
102 |     "  - [Procedure for finding the median](#toc2_1_)    \n",
103 |     "  - [Median for qualitative and ranked data](#toc2_2_)    \n",
104 |     "  - [Advantages and disadvantages of using the median](#toc2_3_)    \n",
105 |     "- [Mean](#toc3_)    \n",
106 |     "  - [Sample mean formula](#toc3_1_)    \n",
107 |     "  - [Population mean formula](#toc3_2_)    \n",
108 |     "  - [Mean as the balance point of a distribution](#toc3_3_)    \n",
109 |     "  - [Sensitivity of the mean to extreme scores (outliers)](#toc3_4_)    \n",
110 |     "  - [Advantages and disadvantages of using the mean](#toc3_5_)    \n",
111 |     "- [Comparing Mode, Median, and Mean](#toc4_)    \n",
112 |     "  - [Differences Between Mean and Median in Skewed Distributions](#toc4_1_)    \n",
113 |     "  - [Situations Where each Measure is Most Appropriate](#toc4_2_)    \n",
114 |     "- [Exercise: Measures of Central Tendency](#toc5_)    \n",
115 |     "  - [Solution](#toc5_1_)    \n",
116 |     "\n",
117 |     "<!-- vscode-jupyter-toc-config\n",
118 |     "\tnumbering=false\n",
119 |     "\tanchor=true\n",
120 |     "\tflat=false\n",
121 |     "\tminLevel=2\n",
122 |     "\tmaxLevel=6\n",
123 |     "\t/vscode-jupyter-toc-config -->\n",
124 |     "<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "markdown",
129 |    "metadata": {},
130 |    "source": [
131 |     "## <a id='toc1_'></a>[Mode](#toc0_)"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "markdown",
136 |    "metadata": {},
137 |    "source": [
138 |     "The mode is the value that appears most frequently in a dataset. In other words, it is the value with the highest frequency count.\n"
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "markdown",
143 |    "metadata": {},
144 |    "source": [
145 |     "### <a id='toc1_1_'></a>[Procedure for determining the mode](#toc0_)"
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "markdown",
150 |    "metadata": {},
151 |    "source": [
152 |     "To find the mode of a dataset, follow these steps:\n",
153 |     "1. *Organize the data*: Arrange the values in the dataset in a systematic order (e.g., from least to greatest or by categories).\n",
154 |     "2. *Count the frequency*: Count how many times each value appears in the dataset.\n",
155 |     "3. *Identify the value with the highest frequency*: The value or values with the highest frequency count are the mode(s) of the dataset.\n"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "markdown",
160 |    "metadata": {},
161 |    "source": [
162 |     "**Example 1**: Find the mode of the following dataset: 5, 2, 8, 5, 1, 5, 3, 2, 5.\n",
163 |     "\n",
164 |     "*Step 1*: Organize the data in ascending order: 1, 2, 2, 3, 5, 5, 5, 5, 8.\n",
165 |     "*Step 2*: Count the frequency of each value:\n",
166 |     "- 1 appears once\n",
167 |     "- 2 appears twice\n",
168 |     "- 3 appears once\n",
169 |     "- 5 appears four times\n",
170 |     "- 8 appears once\n",
171 |     "*Step 3*: Identify the value with the highest frequency: The mode is 5, as it appears most frequently (four times).\n"
172 |    ]
173 |   },
174 |   {
175 |    "cell_type": "markdown",
176 |    "metadata": {},
177 |    "source": [
178 |     "**Example 2**: Find the mode of the following dataset: apple, banana, orange, grape, apple, banana, orange, grape, kiwi.\n",
179 |     "\n",
180 |     "*Step 1*: Organize the data by categories: apple, apple, banana, banana, grape, grape, kiwi, orange, orange.\n",
181 |     "*Step 2*: Count the frequency of each value:\n",
182 |     "- apple appears twice\n",
183 |     "- banana appears twice\n",
184 |     "- grape appears twice\n",
185 |     "- kiwi appears once\n",
186 |     "- orange appears twice\n",
187 |     "*Step 3*: Identify the value with the highest frequency: The dataset has four modes: apple, banana, grape, and orange, as they all appear twice.\n"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "markdown",
192 |    "metadata": {},
193 |    "source": [
194 |     "### <a id='toc1_2_'></a>[Examples of datasets with one mode, multiple modes, or no mode](#toc0_)"
195 |    ]
196 |   },
197 |   {
198 |    "cell_type": "markdown",
199 |    "metadata": {},
200 |    "source": [
201 |     "Datasets can have one mode (unimodal), multiple modes (bimodal or multimodal), or no mode at all.\n",
202 |     "\n",
203 |     "1. *Unimodal dataset*: Consider the following dataset of student grades: 75, 80, 85, 90, 90, 90, 95. In this dataset, the mode is 90 because it appears most frequently (three times).\n",
204 |     "\n",
205 |     "2. *Bimodal dataset*: Consider the following dataset of car colors: red, blue, green, blue, red, yellow, red, blue. In this dataset, there are two modes: red and blue, as they both appear three times.\n",
206 |     "\n",
207 |     "3. *Multimodal dataset*: Consider the following dataset of favorite fruits: apple, banana, orange, grape, apple, banana, orange, grape, kiwi. In this dataset, there are four modes: apple, banana, orange, and grape, as they all appear twice.\n",
208 |     "\n",
209 |     "4. *Dataset with no mode*: Consider the following dataset of house numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9. In this dataset, there is no mode because each value appears only once.\n"
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "markdown",
214 |    "metadata": {},
215 |    "source": [
216 |     "### <a id='toc1_3_'></a>[Advantages and disadvantages of using the mode](#toc0_)"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "markdown",
221 |    "metadata": {},
222 |    "source": [
223 |     "**Advantages:**\n",
224 |     "- The mode is easy to understand and calculate.\n",
225 |     "- It is useful for categorical or discrete data.\n",
226 |     "- The mode is not affected by extreme values (outliers).\n",
227 |     "\n",
228 |     "**Disadvantages:**\n",
229 |     "- The mode may not exist or may not be unique (i.e., there can be no mode or multiple modes).\n",
230 |     "- It does not take into account the magnitude of the values, only their frequencies.\n",
231 |     "- The mode may not be a good representative of the dataset if the frequencies are evenly distributed.\n"
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "markdown",
236 |    "metadata": {},
237 |    "source": [
238 |     "## <a id='toc2_'></a>[Median](#toc0_)"
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "markdown",
243 |    "metadata": {},
244 |    "source": [
245 |     "The median is the middle value in a dataset when the values are arranged in ascending or descending order. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number of values, the median is the average of the two middle values.\n"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "markdown",
250 |    "metadata": {},
251 |    "source": [
252 |     "### <a id='toc2_1_'></a>[Procedure for finding the median](#toc0_)"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "markdown",
257 |    "metadata": {},
258 |    "source": [
259 |     "The procedure for finding the median depends on whether the total number of values in the dataset is odd or even.\n",
260 |     "\n",
261 |     "When the total number of scores is odd:\n",
262 |     "1. Arrange the values in ascending or descending order.\n",
263 |     "2. Identify the middle position using the formula $\\frac{n+1}{2}$, where $n$ is the total number of values.\n",
264 |     "3. The median is the value at the middle position.\n",
265 |     "\n",
266 |     "**Example**: Find the median of the following dataset: 12, 7, 3, 9, 15.\n",
267 |     "\n",
268 |     "*Step 1*: Arrange the values in ascending order: 3, 7, 9, 12, 15.\n",
269 |     "*Step 2*: Identify the middle position: $\\frac{5+1}{2} = 3$. The middle position is 3.\n",
270 |     "*Step 3*: The median is the value at the 3rd position, which is 9.\n"
271 |    ]
272 |   },
273 |   {
274 |    "cell_type": "markdown",
275 |    "metadata": {},
276 |    "source": [
277 |     "When the total number of scores is even:\n",
278 |     "1. Arrange the values in ascending or descending order.\n",
279 |     "2. Identify the two middle positions using the formula $\\frac{n}{2}$ and $\\frac{n}{2}+1$, where $n$ is the total number of values.\n",
280 |     "3. Calculate the average of the values at the two middle positions to find the median.\n",
281 |     "\n",
282 |     "**Example**: Find the median of the following dataset: 4, 7, 2, 9, 3, 8.\n",
283 |     "\n",
284 |     "*Step 1*: Arrange the values in ascending order: 2, 3, 4, 7, 8, 9.\n",
285 |     "*Step 2*: Identify the two middle positions: $\\frac{6}{2} = 3$ and $\\frac{6}{2}+1 = 4$. The two middle positions are 3 and 4.\n",
286 |     "*Step 3*: Calculate the average of the values at the 3rd and 4th positions: $\\frac{4+7}{2} = 5.5$. The median is 5.5.\n"
287 |    ]
288 |   },
289 |   {
290 |    "cell_type": "markdown",
291 |    "metadata": {},
292 |    "source": [
293 |     "### <a id='toc2_2_'></a>[Median for qualitative and ranked data](#toc0_)"
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "markdown",
298 |    "metadata": {},
299 |    "source": [
300 |     "When dealing with qualitative or ranked data, the median can be found by assigning ranks to the categories or values and then applying the same procedure as for quantitative data.\n",
301 |     "\n",
302 |     "**Example**: Find the median of the following dataset of ranked preferences: A, C, B, A, D, B, A.\n",
303 |     "\n",
304 |     "*Step 1*: Assign ranks to the categories: A = 1, B = 2, C = 3, D = 4.\n",
305 |     "*Step 2*: Arrange the ranks in ascending order: 1, 1, 1, 2, 2, 3, 4.\n",
306 |     "*Step 3*: Identify the middle position: $\\frac{7+1}{2} = 4$. The middle position is 4.\n",
307 |     "*Step 4*: The median is the value at the 4th position, which is 2 (corresponding to category B).\n"
308 |    ]
309 |   },
310 |   {
311 |    "cell_type": "markdown",
312 |    "metadata": {},
313 |    "source": [
314 |     "### <a id='toc2_3_'></a>[Advantages and disadvantages of using the median](#toc0_)"
315 |    ]
316 |   },
317 |   {
318 |    "cell_type": "markdown",
319 |    "metadata": {},
320 |    "source": [
321 |     "**Advantages:**\n",
322 |     "- The median is less affected by extreme values (outliers) than the mean.\n",
323 |     "- It can be used with ordinal data (ranked data) and interval/ratio data.\n",
324 |     "- The median is unique and always exists for any dataset.\n",
325 |     "\n",
326 |     "**Disadvantages:**\n",
327 |     "- The median does not take into account the actual values of the data points, only their positions.\n",
328 |     "- It may not be a good representative of the dataset if the data is highly skewed or has a large range.\n",
329 |     "- Calculating the median for large datasets can be time-consuming if the data is not already sorted."
330 |    ]
331 |   },
332 |   {
333 |    "cell_type": "markdown",
334 |    "metadata": {},
335 |    "source": [
336 |     "## <a id='toc3_'></a>[Mean](#toc0_)"
337 |    ]
338 |   },
339 |   {
340 |    "cell_type": "markdown",
341 |    "metadata": {},
342 |    "source": [
343 |     "The mean, also known as the arithmetic average, is the sum of all values in a dataset divided by the total number of values. It represents the central tendency of the data and is the most commonly used measure of central tendency.\n"
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "markdown",
348 |    "metadata": {},
349 |    "source": [
350 |     "The formula for calculating the mean depends on whether the data is from a sample or a population.\n"
351 |    ]
352 |   },
353 |   {
354 |    "cell_type": "markdown",
355 |    "metadata": {},
356 |    "source": [
357 |     "<img src=\"./images/sample-population-mean-2.png\" width=\"800\">"
358 |    ]
359 |   },
360 |   {
361 |    "cell_type": "markdown",
362 |    "metadata": {},
363 |    "source": [
364 |     "<img src=\"./images/sample-population-mean.png\" width=\"800\">"
365 |    ]
366 |   },
367 |   {
368 |    "cell_type": "markdown",
369 |    "metadata": {},
370 |    "source": [
371 |     "### <a id='toc3_1_'></a>[Sample mean formula](#toc0_)"
372 |    ]
373 |   },
374 |   {
375 |    "cell_type": "markdown",
376 |    "metadata": {},
377 |    "source": [
378 |     "For a sample, the mean is denoted by $\\bar{x}$ (read as \"x-bar\") and is calculated using the following formula:\n",
379 |     "\n",
380 |     "$\\bar{x} = \\frac{\\sum_{i=1}^{n} x_i}{n}$\n",
381 |     "\n",
382 |     "where $x_i$ represents each individual value in the dataset, $n$ is the total number of values in the sample, and $\\sum$ (sigma) denotes the sum of all values.\n"
383 |    ]
384 |   },
385 |   {
386 |    "cell_type": "markdown",
387 |    "metadata": {},
388 |    "source": [
389 |     "### <a id='toc3_2_'></a>[Population mean formula](#toc0_)"
390 |    ]
391 |   },
392 |   {
393 |    "cell_type": "markdown",
394 |    "metadata": {},
395 |    "source": [
396 |     "For a population, the mean is denoted by $\\mu$ (read as \"mu\") and is calculated using the following formula:\n",
397 |     "\n",
398 |     "$\\mu = \\frac{\\sum_{i=1}^{N} x_i}{N}$\n",
399 |     "\n",
400 |     "where $x_i$ represents each individual value in the dataset, $N$ is the total number of values in the population, and $\\sum$ (sigma) denotes the sum of all values.\n"
401 |    ]
402 |   },
403 |   {
404 |    "cell_type": "markdown",
405 |    "metadata": {},
406 |    "source": [
407 |     "**Example**: Calculate the mean of the following dataset: 4, 7, 2, 9, 3, 8.\n",
408 |     "\n",
409 |     "*Step 1*: Add up all the values in the dataset: 4 + 7 + 2 + 9 + 3 + 8 = 33.\n",
410 |     "*Step 2*: Divide the sum by the total number of values: $\\frac{33}{6} = 5.5$.\n",
411 |     "*Step 3*: The mean of the dataset is 5.5.\n"
412 |    ]
413 |   },
414 |   {
415 |    "cell_type": "markdown",
416 |    "metadata": {},
417 |    "source": [
418 |     "### <a id='toc3_3_'></a>[Mean as the balance point of a distribution](#toc0_)"
419 |    ]
420 |   },
421 |   {
422 |    "cell_type": "markdown",
423 |    "metadata": {},
424 |    "source": [
425 |     "The mean can be thought of as the balance point of a distribution. If you were to place the values of the dataset on a number line, the mean would be the point at which the line would balance, with the sum of the distances from the mean to each value on one side equal to the sum of the distances from the mean to each value on the other side.\n"
426 |    ]
427 |   },
428 |   {
429 |    "cell_type": "markdown",
430 |    "metadata": {},
431 |    "source": [
432 |     "### <a id='toc3_4_'></a>[Sensitivity of the mean to extreme scores (outliers)](#toc0_)"
433 |    ]
434 |   },
435 |   {
436 |    "cell_type": "markdown",
437 |    "metadata": {},
438 |    "source": [
439 |     "One important characteristic of the mean is its sensitivity to extreme values or outliers. Because the mean takes into account the actual values of each data point, a single extremely high or low value can significantly influence the mean, pulling it towards the extreme value.\n",
440 |     "\n",
441 |     "**Example**: Consider the following two datasets:\n",
442 |     "- Dataset A: 4, 7, 2, 9, 3, 8\n",
443 |     "- Dataset B: 4, 7, 2, 9, 3, 50\n"
444 |    ]
445 |   },
446 |   {
447 |    "cell_type": "markdown",
448 |    "metadata": {},
449 |    "source": [
450 |     "The mean of Dataset A is 5.5, while the mean of Dataset B is 12.5. The single extreme value of 50 in Dataset B pulls the mean upward, making it less representative of the majority of the data.\n"
451 |    ]
452 |   },
453 |   {
454 |    "cell_type": "markdown",
455 |    "metadata": {},
456 |    "source": [
457 |     "### <a id='toc3_5_'></a>[Advantages and disadvantages of using the mean](#toc0_)"
458 |    ]
459 |   },
460 |   {
461 |    "cell_type": "markdown",
462 |    "metadata": {},
463 |    "source": [
464 |     "**Advantages:**\n",
465 |     "- The mean takes into account the actual values of each data point, providing a more precise measure of central tendency.\n",
466 |     "- It is useful for interval/ratio data and can be used in further statistical analyses.\n",
467 |     "- The mean is sensitive to changes in any value within the dataset.\n",
468 |     "\n",
469 |     "**Disadvantages:**\n",
470 |     "- The mean is strongly influenced by extreme values (outliers), which can make it less representative of the dataset.\n",
471 |     "- It may not be an appropriate measure for highly skewed distributions or datasets with a large range.\n",
472 |     "- The mean cannot be calculated for categorical or ordinal data."
473 |    ]
474 |   },
475 |   {
476 |    "cell_type": "markdown",
477 |    "metadata": {},
478 |    "source": [
479 |     "## <a id='toc4_'></a>[Comparing Mode, Median, and Mean](#toc0_)"
480 |    ]
481 |   },
482 |   {
483 |    "cell_type": "markdown",
484 |    "metadata": {},
485 |    "source": [
486 |     "The mode, median, and mean are all measures of central tendency, but they each have unique properties and are appropriate for different types of data and distributions.\n",
487 |     "\n",
488 |     "- *Mode*: The mode is the most frequently occurring value in a dataset. It is the only measure of central tendency that can be used with nominal data. However, it may not exist or may not be unique for some datasets.\n",
489 |     "\n",
490 |     "- *Median*: The median is the middle value when the dataset is ordered. It is less affected by outliers than the mean and can be used with ordinal data. However, it does not take into account the actual values of the data points.\n",
491 |     "\n",
492 |     "- *Mean*: The mean is the sum of all values divided by the total number of values. It is the most commonly used measure of central tendency and takes into account the actual values of each data point. However, it is sensitive to outliers and can only be used with interval or ratio data.\n"
493 |    ]
494 |   },
495 |   {
496 |    "cell_type": "markdown",
497 |    "metadata": {},
498 |    "source": [
499 |     "### <a id='toc4_1_'></a>[Differences Between Mean and Median in Skewed Distributions](#toc0_)"
500 |    ]
501 |   },
502 |   {
503 |    "cell_type": "markdown",
504 |    "metadata": {},
505 |    "source": [
506 |     "<img src=\"./images/skewed-dist.png\" width=\"600\">"
507 |    ]
508 |   },
509 |   {
510 |    "cell_type": "markdown",
511 |    "metadata": {
512 |     "vscode": {
513 |      "languageId": "plaintext"
514 |     }
515 |    },
516 |    "source": [
517 |     "A skewed distribution is a distribution that is asymmetrical, with the tail of the distribution extending to one side. In a skewed distribution, the mean and median can differ significantly, providing insights into the nature of the data.\n",
518 |     "\n",
519 |     "- *Positively skewed distributions*: In a positively skewed distribution, the tail of the distribution extends to the right, with a few high-value outliers. The bulk of the data is concentrated on the left side of the distribution. In this case, the mean will be greater than the median, as the outliers pull the mean towards the right.\n",
520 |     "\n",
521 |     "**Example**: Consider the following dataset of incomes (in thousands): 20, 25, 30, 35, 40, 50, 100. This dataset is positively skewed, with a long tail extending to the right. The mean income is $42.86, while the median income is $35. The high-value outlier (100) pulls the mean higher than the median.\n",
522 |     "\n",
523 |     "- *Negatively skewed distributions*: In a negatively skewed distribution, the tail of the distribution extends to the left, with a few low-value outliers. The bulk of the data is concentrated on the right side of the distribution. In this case, the mean will be less than the median, as the outliers pull the mean towards the left.\n",
524 |     "\n",
525 |     "**Example**: Consider the following dataset of exam scores: 50, 60, 70, 75, 80, 85, 90, 95, 95. This dataset is negatively skewed, with a long tail extending to the left. The mean score is $77.78, while the median score is $80. The low-value outliers (50 and 60) pull the mean lower than the median."
526 |    ]
527 |   },
528 |   {
529 |    "cell_type": "markdown",
530 |    "metadata": {},
531 |    "source": [
532 |     "In skewed distributions, the mean and median can differ significantly, providing insights into the nature of the data.\n",
533 |     "\n",
534 |     "- *Positively skewed distributions*: In a positively skewed distribution, the tail of the distribution extends to the right, with a few high-value outliers. In this case, the mean will be greater than the median, as the outliers pull the mean towards the right.\n",
535 |     "\n",
536 |     "- *Negatively skewed distributions*: In a negatively skewed distribution, the tail of the distribution extends to the left, with a few low-value outliers. In this case, the mean will be less than the median, as the outliers pull the mean towards the left.\n"
537 |    ]
538 |   },
539 |   {
540 |    "cell_type": "markdown",
541 |    "metadata": {},
542 |    "source": [
543 |     "### <a id='toc4_2_'></a>[Situations Where each Measure is Most Appropriate](#toc0_)"
544 |    ]
545 |   },
546 |   {
547 |    "cell_type": "markdown",
548 |    "metadata": {},
549 |    "source": [
550 |     "- *Mode*: The mode is best used when dealing with nominal or categorical data, or when you want to identify the most common value in a dataset.\n",
551 |     "\n",
552 |     "- *Median*: The median is best used when the data contains outliers, when the data is ordinal, or when the distribution is skewed.\n",
553 |     "\n",
554 |     "- *Mean*: The mean is best used when the data is interval or ratio, when the distribution is symmetrical or normal, and when you want to take into account the actual values of each data point.\n"
555 |    ]
556 |   },
557 |   {
558 |    "cell_type": "markdown",
559 |    "metadata": {},
560 |    "source": [
561 |     "In inferential statistics, the mean plays a crucial role due to its mathematical properties and its relationship to the normal distribution. Many statistical tests and confidence intervals are based on the sample mean, making it a fundamental concept in statistical inference. The Central Limit Theorem, which states that the distribution of sample means approaches a normal distribution as the sample size increases, regardless of the shape of the population distribution, further reinforces the importance of the mean in inferential statistics. You will learn more about these concepts in later lectures."
562 |    ]
563 |   },
564 |   {
565 |    "cell_type": "markdown",
566 |    "metadata": {},
567 |    "source": [
568 |     "<img src=\"../images/exercise-banner.gif\" width=\"800\">"
569 |    ]
570 |   },
571 |   {
572 |    "cell_type": "markdown",
573 |    "metadata": {},
574 |    "source": [
575 |     "## <a id='toc5_'></a>[Exercise: Measures of Central Tendency](#toc0_)"
576 |    ]
577 |   },
578 |   {
579 |    "cell_type": "markdown",
580 |    "metadata": {},
581 |    "source": [
582 |     "Consider the following dataset representing the number of hours students spent studying for an exam:\n",
583 |     "\n",
584 |     "2, 5, 6, 3, 4, 5, 7, 2, 6, 4, 3, 5, 4, 6, 5, 8, 4, 3, 5, 4\n",
585 |     "\n",
586 |     "1. Determine the mode of the dataset.\n",
587 |     "2. Calculate the median of the dataset.\n",
588 |     "3. Compute the mean of the dataset.\n",
589 |     "4. Suppose an outlier value of 20 hours is added to the dataset. Recalculate the mode, median, and mean. Which measure of central tendency is least affected by the presence of the outlier?\n"
590 |    ]
591 |   },
592 |   {
593 |    "cell_type": "markdown",
594 |    "metadata": {},
595 |    "source": [
596 |     "### <a id='toc5_1_'></a>[Solution](#toc0_)\n"
597 |    ]
598 |   },
599 |   {
600 |    "cell_type": "markdown",
601 |    "metadata": {},
602 |    "source": [
603 |     "1. To determine the mode, we count the frequency of each value in the dataset:\n",
604 |     "   - 2 appears 2 times\n",
605 |     "   - 3 appears 3 times\n",
606 |     "   - 4 appears 5 times\n",
607 |     "   - 5 appears 5 times\n",
608 |     "   - 6 appears 3 times\n",
609 |     "   - 7 appears 1 time\n",
610 |     "   - 8 appears 1 time\n",
611 |     "\n",
612 |     "   The values 4 and 5 both appear most frequently (5 times each), so the dataset has two modes: 4 and 5.\n",
613 |     "\n",
614 |     "2. To calculate the median, we first arrange the data in ascending order:\n",
615 |     "   2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 7, 8\n",
616 |     "\n",
617 |     "   With an even number of values (20), the median is the average of the two middle values (10th and 11th values):\n",
618 |     "   - Median = (4 + 5) ÷ 2 = 4.5\n",
619 |     "\n",
620 |     "3. To compute the mean, we add up all the values and divide by the number of values:\n",
621 |     "   - Mean = (2 + 5 + 6 + 3 + 4 + 5 + 7 + 2 + 6 + 4 + 3 + 5 + 4 + 6 + 5 + 8 + 4 + 3 + 5 + 4) ÷ 20 = 91 ÷ 20 = 4.55\n",
622 |     "\n",
623 |     "4. Adding the outlier value of 20 to the dataset:\n",
624 |     "   2, 5, 6, 3, 4, 5, 7, 2, 6, 4, 3, 5, 4, 6, 5, 8, 4, 3, 5, 4, 20\n",
625 |     "\n",
626 |     "   - New Mode: The mode remains 4 and 5, as they still appear most frequently (5 times each).\n",
627 |     "   - New Median: 4 (the middle value in the ordered dataset)\n",
628 |     "   - New Mean: (91 + 20) ÷ 21 ≈ 5.29\n",
629 |     "\n",
630 |     "   The mode is least affected by the presence of the outlier, as it only considers the most frequent values. The median is slightly affected, while the mean is most influenced by the outlier."
631 |    ]
632 |   }
633 |  ],
634 |  "metadata": {
635 |   "kernelspec": {
636 |    "display_name": "py310",
637 |    "language": "python",
638 |    "name": "python3"
639 |   },
640 |   "language_info": {
641 |    "codemirror_mode": {
642 |     "name": "ipython",
643 |     "version": 3
644 |    },
645 |    "file_extension": ".py",
646 |    "mimetype": "text/x-python",
647 |    "name": "python",
648 |    "nbconvert_exporter": "python",
649 |    "pygments_lexer": "ipython3",
650 |    "version": "3.10.12"
651 |   }
652 |  },
653 |  "nbformat": 4,
654 |  "nbformat_minor": 2
655 | }
656 | 


--------------------------------------------------------------------------------
/Lectures/01 Introduction/01 What is statistics?.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "vscode": {
  7 |      "languageId": "plaintext"
  8 |     }
  9 |    },
 10 |    "source": [
 11 |     "<img src=\"./images/banner.png\" width=\"800\">"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "metadata": {
 17 |     "vscode": {
 18 |      "languageId": "plaintext"
 19 |     }
 20 |    },
 21 |    "source": [
 22 |     "# What is Statistics?"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "markdown",
 27 |    "metadata": {
 28 |     "vscode": {
 29 |      "languageId": "plaintext"
 30 |     }
 31 |    },
 32 |    "source": [
 33 |     "**Statistics is a branch of mathematics that deals with the collection, analysis, interpretation, and presentation of data**. It involves methods for gathering, organizing, and drawing conclusions from data to help us make informed decisions in the face of uncertainty.\n"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "markdown",
 38 |    "metadata": {},
 39 |    "source": [
 40 |     "Statistics plays a crucial role in numerous fields, including:\n",
 41 |     "\n",
 42 |     "1. **Business and Economics**: Companies use statistics to analyze market trends, make sales forecasts, and optimize their operations for maximum profitability.\n",
 43 |     "\n",
 44 |     "2. **Medicine and Public Health**: Medical researchers rely on statistical methods to test the effectiveness of new treatments, analyze the spread of diseases, and identify risk factors for various health conditions.\n",
 45 |     "\n",
 46 |     "3. **Social Sciences**: Psychologists, sociologists, and political scientists use statistics to study human behavior, analyze survey data, and test hypotheses about social phenomena.\n",
 47 |     "\n",
 48 |     "4. **Natural Sciences**: Biologists, chemists, and physicists use statistics to analyze experimental data, test scientific theories, and make predictions about natural processes.\n",
 49 |     "\n",
 50 |     "5. **Engineering**: Engineers use statistics to assess the reliability of systems, control the quality of manufacturing processes, and optimize the design of products.\n"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "markdown",
 55 |    "metadata": {},
 56 |    "source": [
 57 |     "Some real-world applications of statistics include:\n",
 58 |     "\n",
 59 |     "1. **Quality Control**: Manufacturing companies use statistical methods to monitor the quality of their products and identify sources of defects.\n",
 60 |     "\n",
 61 |     "2. **Political Polling**: Statisticians design and analyze opinion polls to gauge public sentiment on various issues and predict election outcomes.\n",
 62 |     "\n",
 63 |     "3. **Sports Analytics**: Sports teams use statistics to evaluate player performance, develop game strategies, and make data-driven decisions for team management.\n",
 64 |     "\n",
 65 |     "4. **Insurance**: Insurance companies use statistical models to assess risk, determine premiums, and manage claims.\n",
 66 |     "\n",
 67 |     "5. **Weather Forecasting**: Meteorologists use statistical methods to analyze historical weather data and make predictions about future weather patterns.\n"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "markdown",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "As you can see, statistics is a versatile and essential tool in many aspects of our lives. By understanding statistical concepts and methods, we can make better-informed decisions and solve complex problems in a wide range of fields."
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "markdown",
 79 |    "metadata": {
 80 |     "vscode": {
 81 |      "languageId": "plaintext"
 82 |     }
 83 |    },
 84 |    "source": [
 85 |     "Algorithms, artificial intelligence, machine learning, deep learning, data science, math, visualization, and statistics are all interconnected fields that play crucial roles in the realm of data analysis and decision-making. At the core, algorithms provide the foundation for processing and analyzing data efficiently. Artificial intelligence encompasses techniques that enable machines to exhibit intelligent behavior, with machine learning being a subset of AI that focuses on algorithms that improve automatically through experience. Deep learning, in turn, is a subfield of machine learning that utilizes neural networks with multiple layers to learn hierarchical representations of data. Data science is an interdisciplinary field that combines various techniques, including machine learning, to extract insights and knowledge from data. Math and statistics provide the underlying theoretical frameworks and tools for quantifying uncertainty, making inferences, and building predictive models. Visualization complements these fields by enabling the effective communication and interpretation of data and results. The following chart illustrates the relationships and overlaps between these domains:"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "<img src=\"./images/data-science-ai-ml-dl.png\" width=\"800\">"
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "markdown",
 97 |    "metadata": {
 98 |     "vscode": {
 99 |      "languageId": "plaintext"
100 |     }
101 |    },
102 |    "source": [
103 |     "As shown in the figure, these fields are closely intertwined, with each one building upon and complementing the others. Understanding the connections and leveraging the synergies between these disciplines is crucial for solving complex problems and making data-driven decisions in various domains, ranging from business and healthcare to science and engineering."
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "markdown",
108 |    "metadata": {},
109 |    "source": [
110 |     "**Table of contents**<a id='toc0_'></a>    \n",
111 |     "- [Branches of Statistics](#toc1_)    \n",
112 |     "  - [Descriptive Statistics](#toc1_1_)    \n",
113 |     "  - [Inferential Statistics](#toc1_2_)    \n",
114 |     "- [Descriptive Statistics](#toc2_)    \n",
115 |     "  - [Organizing and Summarizing Data](#toc2_1_)    \n",
116 |     "  - [Measures of Central Tendency](#toc2_2_)    \n",
117 |     "  - [Measures of Dispersion](#toc2_3_)    \n",
118 |     "  - [Graphical Representations](#toc2_4_)    \n",
119 |     "- [Inferential Statistics](#toc3_)    \n",
120 |     "  - [Sample vs. Population](#toc3_1_)    \n",
121 |     "  - [Survey vs. Experiment](#toc3_2_)    \n",
122 |     "  - [Making Predictions and Generalizations About Populations Based on Sample Data](#toc3_3_)    \n",
123 |     "  - [Hypothesis Testing](#toc3_4_)    \n",
124 |     "  - [Confidence Intervals](#toc3_5_)    \n",
125 |     "  - [Regression Analysis](#toc3_6_)    \n",
126 |     "- [Importance of Statistics in Decision Making](#toc4_)    \n",
127 |     "\n",
128 |     "<!-- vscode-jupyter-toc-config\n",
129 |     "\tnumbering=false\n",
130 |     "\tanchor=true\n",
131 |     "\tflat=false\n",
132 |     "\tminLevel=2\n",
133 |     "\tmaxLevel=6\n",
134 |     "\t/vscode-jupyter-toc-config -->\n",
135 |     "<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "markdown",
140 |    "metadata": {},
141 |    "source": [
142 |     "## <a id='toc1_'></a>[Branches of Statistics](#toc0_)"
143 |    ]
144 |   },
145 |   {
146 |    "cell_type": "markdown",
147 |    "metadata": {},
148 |    "source": [
149 |     "Statistics can be broadly divided into two main branches: descriptive statistics and inferential statistics. Each branch serves a specific purpose and employs different methods to analyze and interpret data.\n"
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "markdown",
154 |    "metadata": {
155 |     "vscode": {
156 |      "languageId": "plaintext"
157 |     }
158 |    },
159 |    "source": [
160 |     "**Descriptive statistics** focuses on summarizing and describing the main features of a dataset, while inferential statistics involves making inferences and drawing conclusions about a population based on a sample of data."
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "markdown",
165 |    "metadata": {},
166 |    "source": [
167 |     "When we have a **population**, which refers to the entire group of individuals or objects of interest, collecting data from every member of the population is often impractical or impossible. In such cases, we rely on sampling, which involves selecting a subset of the population that is representative of the whole. The data collected from this sample is then used to calculate descriptive statistics, such as measures of central tendency (mean, median, mode) and measures of dispersion (variance, standard deviation). These descriptive statistics provide a concise summary of the sample data and help us understand its main characteristics.\n"
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "markdown",
172 |    "metadata": {},
173 |    "source": [
174 |     "However, the ultimate goal is often to **make inferences about the larger population based on the sample data**. This is where inferential statistics comes into play. By using probability theory and statistical models, inferential statistics allows us to estimate population parameters, test hypotheses, and make predictions with a certain level of confidence. For example, we can use inferential statistics to determine if there is a significant difference between two groups, assess the relationship between variables, or predict future outcomes based on historical data."
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "markdown",
179 |    "metadata": {},
180 |    "source": [
181 |     "<img src=\"./images/sample-population.png\" width=\"800\">"
182 |    ]
183 |   },
184 |   {
185 |    "cell_type": "markdown",
186 |    "metadata": {},
187 |    "source": [
188 |     "### <a id='toc1_1_'></a>[Descriptive Statistics](#toc0_)\n"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "markdown",
193 |    "metadata": {},
194 |    "source": [
195 |     "Descriptive statistics involves methods for organizing, summarizing, and presenting data in a meaningful way. The purpose of descriptive statistics is to provide a clear and concise summary of the main features of a dataset, such as its central tendency, variability, and distribution.\n"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "markdown",
200 |    "metadata": {},
201 |    "source": [
202 |     "Some common examples of descriptive statistics include:\n",
203 |     "\n",
204 |     "1. **Measures of Central Tendency**: These measures describe the typical or central value in a dataset, such as the mean (average), median (middle value), and mode (most frequent value).\n",
205 |     "\n",
206 |     "2. **Measures of Dispersion**: These measures describe the spread or variability of data, such as the range (difference between the maximum and minimum values), variance, and standard deviation.\n",
207 |     "\n",
208 |     "3. **Frequency Distributions**: These tables or graphs show how often each value or group of values occurs in a dataset, such as histograms or bar charts.\n",
209 |     "\n",
210 |     "4. **Percentiles and Quartiles**: These measures divide a dataset into equal parts, such as the median (50th percentile) or the first and third quartiles (25th and 75th percentiles).\n"
211 |    ]
212 |   },
213 |   {
214 |    "cell_type": "markdown",
215 |    "metadata": {},
216 |    "source": [
217 |     "### <a id='toc1_2_'></a>[Inferential Statistics](#toc0_)\n"
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "markdown",
222 |    "metadata": {},
223 |    "source": [
224 |     "Inferential statistics involves methods for making predictions, generalizations, or decisions about a population based on a sample of data. The purpose of inferential statistics is to use sample data to draw conclusions about a larger population with a certain level of confidence.\n"
225 |    ]
226 |   },
227 |   {
228 |    "cell_type": "markdown",
229 |    "metadata": {},
230 |    "source": [
231 |     "Some common examples of inferential statistics include:\n",
232 |     "\n",
233 |     "1. **Hypothesis Testing**: This is a method for determining whether a claim or hypothesis about a population is likely to be true based on sample evidence. Examples include t-tests, ANOVA, and chi-square tests.\n",
234 |     "\n",
235 |     "2. **Confidence Intervals**: These are ranges of values that are likely to contain the true population parameter with a certain level of confidence, such as a 95% confidence interval for the mean.\n",
236 |     "\n",
237 |     "3. **Regression Analysis**: This is a method for modeling the relationship between a dependent variable and one or more independent variables, such as linear regression or logistic regression.\n",
238 |     "\n",
239 |     "4. **Sampling**: This involves techniques for selecting a representative subset of a population to study, such as simple random sampling or stratified sampling.\n"
240 |    ]
241 |   },
242 |   {
243 |    "cell_type": "markdown",
244 |    "metadata": {},
245 |    "source": [
246 |     "In summary, descriptive statistics helps us to organize and summarize data, while inferential statistics allows us to make predictions and draw conclusions about populations based on sample data. Both branches of statistics are essential for making data-driven decisions in various fields."
247 |    ]
248 |   },
249 |   {
250 |    "cell_type": "markdown",
251 |    "metadata": {},
252 |    "source": [
253 |     "## <a id='toc2_'></a>[Descriptive Statistics](#toc0_)"
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "markdown",
258 |    "metadata": {},
259 |    "source": [
260 |     "Descriptive statistics is a branch of statistics that focuses on organizing, summarizing, and presenting data in a meaningful way. It provides tools to describe the main features of a dataset, such as its central tendency, variability, and distribution.\n"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "markdown",
265 |    "metadata": {},
266 |    "source": [
267 |     "<img src=\"./images/descriptive-statistics.png\" width=\"800\">"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "markdown",
272 |    "metadata": {},
273 |    "source": [
274 |     "### <a id='toc2_1_'></a>[Organizing and Summarizing Data](#toc0_)\n"
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "markdown",
279 |    "metadata": {},
280 |    "source": [
281 |     "The first step in descriptive statistics is to organize and summarize the data. This can be done using various methods, such as:\n",
282 |     "\n",
283 |     "1. **Frequency Distributions**: These tables or graphs show how often each value or group of values occurs in a dataset. They can be used to identify the most common values or categories in a dataset.\n",
284 |     "\n",
285 |     "2. **Contingency Tables**: These tables display the relationship between two or more categorical variables, such as gender and political affiliation. They can be used to examine the association between variables.\n",
286 |     "\n",
287 |     "3. **Cross-Tabulation**: This is a method for summarizing data from two or more variables in a single table, allowing for the examination of relationships between the variables.\n"
288 |    ]
289 |   },
290 |   {
291 |    "cell_type": "markdown",
292 |    "metadata": {},
293 |    "source": [
294 |     "### <a id='toc2_2_'></a>[Measures of Central Tendency](#toc0_)\n"
295 |    ]
296 |   },
297 |   {
298 |    "cell_type": "markdown",
299 |    "metadata": {},
300 |    "source": [
301 |     "Measures of central tendency describe the typical or central value in a dataset. The three main measures of central tendency are:\n",
302 |     "\n",
303 |     "1. **Mean**: The arithmetic average of a set of values, calculated by summing all the values and dividing by the number of values.\n",
304 |     "\n",
305 |     "2. **Median**: The middle value in a dataset when the values are arranged in order from least to greatest. If there is an even number of values, the median is the average of the two middle values.\n",
306 |     "\n",
307 |     "3. **Mode**: The most frequently occurring value or values in a dataset.\n"
308 |    ]
309 |   },
310 |   {
311 |    "cell_type": "markdown",
312 |    "metadata": {},
313 |    "source": [
314 |     "### <a id='toc2_3_'></a>[Measures of Variability](#toc0_)\n"
315 |    ]
316 |   },
317 |   {
318 |    "cell_type": "markdown",
319 |    "metadata": {},
320 |    "source": [
321 |     "Measures of dispersion describe the spread or variability of data. Some common measures of dispersion include:\n",
322 |     "\n",
323 |     "1. **Range**: The difference between the maximum and minimum values in a dataset.\n",
324 |     "\n",
325 |     "2. **Variance**: The average of the squared deviations from the mean, measuring how far the data points are spread out from the mean.\n",
326 |     "\n",
327 |     "3. **Standard Deviation**: The square root of the variance, providing a measure of dispersion in the same units as the original data.\n",
328 |     "\n",
329 |     "4. **Interquartile Range (IQR)**: The range of the middle 50% of the data, calculated as the difference between the third quartile (Q3) and the first quartile (Q1).\n"
330 |    ]
331 |   },
332 |   {
333 |    "cell_type": "markdown",
334 |    "metadata": {},
335 |    "source": [
336 |     "### <a id='toc2_4_'></a>[Graphical Representations](#toc0_)\n"
337 |    ]
338 |   },
339 |   {
340 |    "cell_type": "markdown",
341 |    "metadata": {},
342 |    "source": [
343 |     "Graphical representations are visual tools used to display and communicate data effectively. Some common graphical representations in descriptive statistics include:\n",
344 |     "\n",
345 |     "1. **Bar Charts**: These graphs use rectangular bars to represent the frequency or proportion of categorical data.\n",
346 |     "\n",
347 |     "2. **Histograms**: Similar to bar charts, histograms display the frequency distribution of continuous data, with the area of each bar representing the frequency of values within a specific range.\n",
348 |     "\n",
349 |     "3. **Pie Charts**: These circular charts display the proportion of each category in a dataset, with each slice representing a category.\n",
350 |     "\n",
351 |     "4. **Box Plots**: Also known as box-and-whisker plots, these graphs display the distribution of a dataset based on its quartiles, median, and outliers.\n"
352 |    ]
353 |   },
354 |   {
355 |    "cell_type": "markdown",
356 |    "metadata": {},
357 |    "source": [
358 |     "By using these tools and methods, descriptive statistics helps us to better understand and communicate the main features of a dataset, laying the foundation for further statistical analysis and decision-making."
359 |    ]
360 |   },
361 |   {
362 |    "cell_type": "markdown",
363 |    "metadata": {},
364 |    "source": [
365 |     "## <a id='toc3_'></a>[Inferential Statistics](#toc0_)"
366 |    ]
367 |   },
368 |   {
369 |    "cell_type": "markdown",
370 |    "metadata": {
371 |     "vscode": {
372 |      "languageId": "plaintext"
373 |     }
374 |    },
375 |    "source": [
376 |     "Inferential statistics is a branch of statistics that involves making predictions, generalizations, or decisions about a population based on a sample of data. It allows us to use sample data to draw conclusions about a larger population with a certain level of confidence."
377 |    ]
378 |   },
379 |   {
380 |    "cell_type": "markdown",
381 |    "metadata": {},
382 |    "source": [
383 |     "<img src=\"./images/inferential-statistics.png\" width=\"800\">"
384 |    ]
385 |   },
386 |   {
387 |    "cell_type": "markdown",
388 |    "metadata": {},
389 |    "source": [
390 |     "### <a id='toc3_1_'></a>[Sample vs. Population](#toc0_)\n"
391 |    ]
392 |   },
393 |   {
394 |    "cell_type": "markdown",
395 |    "metadata": {},
396 |    "source": [
397 |     "1. **Population**: A population is the entire group of individuals, objects, or events that we are interested in studying. It is the complete set of elements that share a common characteristic.\n",
398 |     "\n",
399 |     "2. **Sample**: A sample is a subset of the population that is selected for study. It is a representative group drawn from the population, and the information gathered from the sample is used to make inferences about the entire population.\n"
400 |    ]
401 |   },
402 |   {
403 |    "cell_type": "markdown",
404 |    "metadata": {},
405 |    "source": [
406 |     "### <a id='toc3_2_'></a>[Survey vs. Experiment](#toc0_)\n"
407 |    ]
408 |   },
409 |   {
410 |    "cell_type": "markdown",
411 |    "metadata": {},
412 |    "source": [
413 |     "1. **Survey**: A survey is a method of collecting data by asking questions to a sample of individuals. Surveys are often used to gather information about opinions, attitudes, behaviors, or characteristics of a population. They can be conducted through various means, such as questionnaires, interviews, or online polls.\n",
414 |     "\n",
415 |     "2. **Experiment**: An experiment is a controlled study in which the researcher manipulates one or more variables (independent variables) to observe their effect on another variable (dependent variable). Experiments are designed to establish cause-and-effect relationships between variables by controlling for potential confounding factors.\n"
416 |    ]
417 |   },
418 |   {
419 |    "cell_type": "markdown",
420 |    "metadata": {},
421 |    "source": [
422 |     "<img src=\"./images/survey-experiment.png\" width=\"800\">"
423 |    ]
424 |   },
425 |   {
426 |    "cell_type": "markdown",
427 |    "metadata": {
428 |     "vscode": {
429 |      "languageId": "plaintext"
430 |     }
431 |    },
432 |    "source": [
433 |     "There are several other types of studies that researchers can conduct, depending on their research questions, objectives, and available resources. Some additional types of studies include:\n",
434 |     "\n",
435 |     "1. **Observational studies**: These studies involve collecting data on a sample or population without directly manipulating any variables or assigning treatments to participants. The researcher simply observes and records what happens naturally.\n",
436 |     "\n",
437 |     "2. **Meta-analysis**: A systematic review that combines the results of multiple independent studies on the same topic to provide a more comprehensive and statistically powerful analysis.\n",
438 |     "\n",
439 |     "3. **Systematic review**: A structured, comprehensive review of existing literature on a specific topic, which follows a strict protocol to identify, select, and critically appraise relevant studies.\n",
440 |     "\n",
441 |     "4. **Qualitative studies**: These studies aim to explore and understand people's experiences, perceptions, and behaviors through non-numerical data, such as interviews, focus groups, or observations.\n",
442 |     "\n",
443 |     "5. **Mixed-methods studies**: These studies combine both quantitative and qualitative research methods to gain a more comprehensive understanding of a research problem.\n",
444 |     "\n",
445 |     "6. **Longitudinal studies**: These studies involve repeated observations of the same variables over an extended period to examine changes or development over time.\n",
446 |     "\n",
447 |     "7. **Cross-sectional studies**: These studies collect data from a sample at a single point in time to examine the relationship between variables.\n",
448 |     "\n",
449 |     "8. **Ecological studies**: These studies compare populations or groups, rather than individuals, to examine the relationship between exposure and outcome variables at the population level.\n",
450 |     "\n",
451 |     "9. **Quasi-experimental studies**: These studies share some characteristics with true experiments but lack random assignment of participants to treatment groups. They are often used when random assignment is not feasible or ethical.\n",
452 |     "\n",
453 |     "10. **Case studies**: These studies involve an in-depth, detailed examination of a single case, individual, or small group to explore a specific phenomenon or issue.\n",
454 |     "\n",
455 |     "11. **Action research**: This type of research aims to solve practical problems and improve practices through a collaborative, iterative process involving both researchers and participants."
456 |    ]
457 |   },
458 |   {
459 |    "cell_type": "markdown",
460 |    "metadata": {},
461 |    "source": [
462 |     "### <a id='toc3_3_'></a>[Making Predictions and Generalizations About Populations Based on Sample Data](#toc0_)\n"
463 |    ]
464 |   },
465 |   {
466 |    "cell_type": "markdown",
467 |    "metadata": {},
468 |    "source": [
469 |     "The main goal of inferential statistics is to use sample data to make inferences about a larger population. This is done by:\n",
470 |     "\n",
471 |     "1. **Sampling**: Selecting a representative subset of the population to study. The sample should be chosen randomly and be large enough to accurately represent the population.\n",
472 |     "\n",
473 |     "2. **Estimation**: Using sample statistics, such as the mean or proportion, to estimate the corresponding population parameters.\n",
474 |     "\n",
475 |     "3. **Generalization**: Drawing conclusions about the population based on the sample data, while accounting for the uncertainty introduced by sampling variability.\n"
476 |    ]
477 |   },
478 |   {
479 |    "cell_type": "markdown",
480 |    "metadata": {},
481 |    "source": [
482 |     "### <a id='toc3_4_'></a>[Hypothesis Testing](#toc0_)\n"
483 |    ]
484 |   },
485 |   {
486 |    "cell_type": "markdown",
487 |    "metadata": {},
488 |    "source": [
489 |     "Hypothesis testing is a method for determining whether a claim or hypothesis about a population is likely to be true based on sample evidence. The process involves:\n",
490 |     "\n",
491 |     "1. **Null Hypothesis (H0)**: A statement that assumes no effect or difference between populations or variables.\n",
492 |     "\n",
493 |     "2. **Alternative Hypothesis (Ha)**: A statement that contradicts the null hypothesis and represents the claim or effect being tested.\n",
494 |     "\n",
495 |     "3. **Test Statistic**: A value calculated from the sample data that is used to determine whether to reject or fail to reject the null hypothesis.\n",
496 |     "\n",
497 |     "4. **P-value**: The probability of obtaining a test statistic as extreme as the one observed, assuming the null hypothesis is true. A small p-value (typically < 0.05) suggests that the null hypothesis is unlikely to be true.\n"
498 |    ]
499 |   },
500 |   {
501 |    "cell_type": "markdown",
502 |    "metadata": {},
503 |    "source": [
504 |     "<img src=\"./images/alternative-hypothesis.png\" width=\"800\">"
505 |    ]
506 |   },
507 |   {
508 |    "cell_type": "markdown",
509 |    "metadata": {},
510 |    "source": [
511 |     "Common hypothesis tests include t-tests, ANOVA, and chi-square tests.\n"
512 |    ]
513 |   },
514 |   {
515 |    "cell_type": "markdown",
516 |    "metadata": {},
517 |    "source": [
518 |     "### <a id='toc3_5_'></a>[Confidence Intervals](#toc0_)\n"
519 |    ]
520 |   },
521 |   {
522 |    "cell_type": "markdown",
523 |    "metadata": {
524 |     "vscode": {
525 |      "languageId": "plaintext"
526 |     }
527 |    },
528 |    "source": [
529 |     "A hypothesis test merely indicates whether an effect is present. A confidence interval is more informative since it indicates, with a known degree of confidence, the range of possible effects."
530 |    ]
531 |   },
532 |   {
533 |    "cell_type": "markdown",
534 |    "metadata": {},
535 |    "source": [
536 |     "Confidence intervals are ranges of values that are likely to contain the true population parameter with a certain level of confidence, such as 95%. They provide a way to estimate the precision of sample estimates and quantify the uncertainty associated with inferential conclusions.\n"
537 |    ]
538 |   },
539 |   {
540 |    "cell_type": "markdown",
541 |    "metadata": {},
542 |    "source": [
543 |     "Confidence intervals are constructed using the sample statistic, the standard error (a measure of sampling variability), and a critical value from a probability distribution (e.g., the t-distribution or the normal distribution).\n"
544 |    ]
545 |   },
546 |   {
547 |    "cell_type": "markdown",
548 |    "metadata": {},
549 |    "source": [
550 |     "### <a id='toc3_6_'></a>[Regression Analysis](#toc0_)\n"
551 |    ]
552 |   },
553 |   {
554 |    "cell_type": "markdown",
555 |    "metadata": {},
556 |    "source": [
557 |     "Regression analysis is a method for modeling the relationship between a **dependent variable** and one or more **independent variables**. It helps to understand how changes in the independent variables are associated with changes in the dependent variable.\n"
558 |    ]
559 |   },
560 |   {
561 |    "cell_type": "markdown",
562 |    "metadata": {},
563 |    "source": [
564 |     "Common types of regression analysis include:\n",
565 |     "\n",
566 |     "1. **Linear Regression**: Models the relationship between a continuous dependent variable and one or more independent variables using a linear equation.\n",
567 |     "\n",
568 |     "2. **Logistic Regression**: Models the relationship between a binary dependent variable (e.g., success/failure) and one or more independent variables, estimating the probability of an event occurring.\n",
569 |     "\n",
570 |     "3. **Multiple Regression**: Models the relationship between a dependent variable and two or more independent variables, allowing for the examination of the unique effect of each independent variable while controlling for the others.\n"
571 |    ]
572 |   },
573 |   {
574 |    "cell_type": "markdown",
575 |    "metadata": {},
576 |    "source": [
577 |     "<img src=\"./images/1.png\" width=\"800\">"
578 |    ]
579 |   },
580 |   {
581 |    "cell_type": "markdown",
582 |    "metadata": {},
583 |    "source": [
584 |     "<img src=\"./images/2.png\" width=\"800\">"
585 |    ]
586 |   },
587 |   {
588 |    "cell_type": "markdown",
589 |    "metadata": {},
590 |    "source": [
591 |     "<img src=\"./images/3.png\" width=\"800\">"
592 |    ]
593 |   },
594 |   {
595 |    "cell_type": "markdown",
596 |    "metadata": {},
597 |    "source": [
598 |     "Inferential statistics allows us to make data-driven decisions and draw conclusions about populations based on sample data, while accounting for the inherent uncertainty in the process. By using hypothesis testing, confidence intervals, and regression analysis, we can make informed judgments and predictions in various fields, from business and economics to medicine and social sciences."
599 |    ]
600 |   },
601 |   {
602 |    "cell_type": "markdown",
603 |    "metadata": {},
604 |    "source": [
605 |     "## <a id='toc4_'></a>[Importance of Statistics in Decision Making](#toc0_)"
606 |    ]
607 |   },
608 |   {
609 |    "cell_type": "markdown",
610 |    "metadata": {},
611 |    "source": [
612 |     "In today's data-driven world, statistics play a crucial role in decision-making processes across various industries. By using statistical methods to collect, analyze, and interpret data, organizations can make informed decisions based on objective evidence rather than intuition or guesswork.\n"
613 |    ]
614 |   },
615 |   {
616 |    "cell_type": "markdown",
617 |    "metadata": {},
618 |    "source": [
619 |     "**Data-driven decision** making involves using data and statistical analysis to guide strategic and operational decisions. This approach offers several benefits, including:\n",
620 |     "\n",
621 |     "1. **Objectivity**: Statistical methods provide an unbiased and objective way to analyze data, reducing the influence of personal opinions or biases in decision-making.\n",
622 |     "\n",
623 |     "2. **Accuracy**: By using statistical techniques to analyze large datasets, organizations can identify patterns, trends, and relationships that may not be apparent through casual observation, leading to more accurate decisions.\n",
624 |     "\n",
625 |     "3. **Efficiency**: Statistical analysis can help organizations quickly process and interpret large amounts of data, enabling faster and more efficient decision-making.\n",
626 |     "\n",
627 |     "4. **Risk Reduction**: By using statistical methods to quantify uncertainty and assess risk, organizations can make decisions that minimize potential losses and maximize potential gains.\n",
628 |     "\n",
629 |     "5. **Continuous Improvement**: Data-driven decision making allows organizations to monitor the effectiveness of their decisions over time and make adjustments as needed based on new data and insights.\n"
630 |    ]
631 |   },
632 |   {
633 |    "cell_type": "markdown",
634 |    "metadata": {},
635 |    "source": [
636 |     "Statistics are applied in numerous industries to drive decision-making and optimize outcomes. Some examples include:\n",
637 |     "\n",
638 |     "1. **Healthcare**: Healthcare providers use statistical methods to analyze patient data, assess treatment effectiveness, and identify risk factors for diseases. This information is used to make decisions about resource allocation, treatment protocols, and public health interventions.\n",
639 |     "\n",
640 |     "2. **Finance**: Financial institutions use statistical models to assess credit risk, detect fraud, and optimize investment portfolios. Statistical analysis helps these organizations make data-driven decisions about lending, investing, and risk management.\n",
641 |     "\n",
642 |     "3. **Marketing**: Marketers use statistical techniques to analyze customer data, segment markets, and measure the effectiveness of advertising campaigns. This information is used to make decisions about product development, pricing, and promotional strategies.\n",
643 |     "\n",
644 |     "4. **Manufacturing**: Manufacturers use statistical process control methods to monitor the quality of their products and identify sources of variation in their production processes. This information is used to make decisions about process improvements, quality control, and resource allocation.\n",
645 |     "\n",
646 |     "5. **Sports**: Sports teams and organizations use statistical analysis to evaluate player performance, develop game strategies, and make decisions about player acquisition and team management. This data-driven approach has revolutionized the way many sports are played and managed.\n",
647 |     "\n",
648 |     "6. **Government**: Government agencies use statistical methods to analyze demographic data, assess the effectiveness of public policies, and allocate resources. This information is used to make decisions about public services, infrastructure investments, and regulatory policies.\n"
649 |    ]
650 |   },
651 |   {
652 |    "cell_type": "markdown",
653 |    "metadata": {},
654 |    "source": [
655 |     "By leveraging the power of statistics in decision making, organizations across various industries can make more informed, data-driven decisions that lead to better outcomes and a competitive edge in their respective markets."
656 |    ]
657 |   }
658 |  ],
659 |  "metadata": {
660 |   "language_info": {
661 |    "name": "python"
662 |   }
663 |  },
664 |  "nbformat": 4,
665 |  "nbformat_minor": 2
666 | }
667 | 


--------------------------------------------------------------------------------