├── .gitignore
├── Dummy Classifier Notebook.ipynb
├── Gaussian_Mixture_Models.ipynb
├── Hypothesis Testing
├── Hypo_Testing.ipynb
├── Hypothesis Testing.md
├── PlantGrowth.csv
└── blood_pressure.csv
├── LICENSE
├── README.md
├── WIP
├── Exponential_Smoothing.ipynb
└── Monte Carlo Simulation.ipynb
├── data
├── Breast-Cancer.csv
├── HorseKicks.txt
├── Housefly_wing_lengths.txt
└── food_outlet_data.csv
├── ideas.md
├── images
├── Conditional.png
├── JDTable.png
├── Joint.png
├── Marginal.png
├── Marginal2.png
├── OneDirectional.png
├── OnettwoTailed.png
├── Table.png
├── TwoDirectional.png
└── TypeIandTypeIIError.png
└── notebooks
├── 1-Way ANOVA.ipynb
├── ARIMA.ipynb
├── Baye's Theorem Notebook.ipynb
├── Binary Classification-Logistic Regression.ipynb
├── Central Limit Theorem.ipynb
├── Correlation with Example
├── Correlation.ipynb
├── Movie Recommendation using Correlation.ipynb
└── README.md
├── Data_Summary_Notebook#12.ipynb
├── Decision Tree.ipynb
├── Dummy Classifier Notebook.ipynb
├── Frequency Distribution.ipynb
├── Frequency_Distribution.ipynb
├── Heteroscedasticty.ipynb
├── Hypothesis Testing.ipynb
├── JointProbabilityDistribution.ipynb
├── KMeans_Clustering.ipynb
├── K_Nearest_Neighbours.ipynb
├── LNN.ipynb
├── Linear_Discriminant_Analysis.ipynb
├── Markov_chains.ipynb
├── MonteCarlo.ipynb
├── Multilinear-Regression.ipynb
├── PhiK Correlation
├── PhiK.ipynb
├── data_description.txt
└── dataset.csv
├── Precision&Recall.ipynb
├── Principal-Component-Analysis.ipynb
├── Probability_Distributions_All.ipynb
├── RFR using GridsearchCV.ipynb
├── Statistical_&_Probability_Notebook_Part_1.ipynb
├── Statistical_&_Probability_Notebook_Part_2.ipynb
├── Support_Vector_Machine.ipynb
├── Time_Series.ipynb
├── Time_Series_Visualization.ipynb
├── agriculture_yield_rice
├── autocorrelation.ipynb
├── bias_variance_notebook.ipynb
├── biden_speech.txt
├── data_summary_breast_cancer.ipynb
├── intro-numpy-pandas-matplotlib.ipynb
└── maximum-likelihood-estimation.ipynb
/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints/
2 |
--------------------------------------------------------------------------------
/Dummy Classifier Notebook.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Dummy Classifier "
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## What is a `DummyClassifier`?\n",
15 | "\n",
16 | "DummyClassifier is a classifier that makes predictions using simple rules, which can be\n",
17 | "useful as a baseline for comparison against actual classifiers, especially with imbalanced classes(where the class distribution is not equal or close to equal, and is instead biased or skewed).\n",
18 | "\n",
19 | "A dummy classifier is basically a classifier which doesn’t even look at the training data while classification, but follows just a rule of thumb or strategy that we instruct it to use while classifying. It is done by including the strategy we want in the strategy parameter of the `DummyClassifier`.The main notion behind using a dummy classifier is that a classifier which is based on an analytic approach to do better than random guessing approach.\n"
20 | ]
21 | },
22 | {
23 | "cell_type": "markdown",
24 | "metadata": {},
25 | "source": [
26 | "## Strategies used in Dummy Classifier\n",
27 | "\n",
28 | "The scikit-learn `DummyClassifier` class implements several strategies for random guessing classifiers. \n",
29 | "The strategies are as follows:\n",
30 | "\n",
31 | "- stratified : This strategy generates the prediction using the training set's class distribution\n",
32 | "- most_frequent : This always predicts the most frequent label in training set.\n",
33 | "- prior : This predicts the class that maximises the class prior.\n",
34 | "- uniform : This generates predictions uniformly at random\n",
35 | "- constant : Always predicts a constant label which is user defined. This is specificaly usefull for metrics that evaluate a non-majority class."
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | " ## Explaination through Implementation\n",
43 | " \n",
44 | "The dummy classifier gives measure of \"baseline\" performance--i.e. the success rate one should expect to achieve even if simply guessing.\n",
45 | "\n",
46 | "If one wishes to determine whether a given object possesses or does not possess a certain property. After analyzing a large number of the objects it is found that 90% contain the target property, then guessing that every future instance of the object possesses the target property gives a 90% likelihood of guessing correctly. Structuring these guesses is equivalent to using the `most_frequent` method in dummy clasifier\n",
47 | "\n",
48 | "Because many machine learning tasks attempt to increase the success rate of (e.g.) classification tasks, evaluating the baseline success rate can afford a floor value for the minimal value one's classifier should out-perform. \n",
49 | "\n",
50 | "If one trains a dummy classifier with the `stratified` parameter using the data discussed above, that classifier will predict that there is a 90% probability that each object it encounters possesses the target property. This is different from training a dummy classifier with the `most_frequent` parameter, as the latter would guess that all future objects possess the target property. Here's some code to illustrate:"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": 1,
56 | "metadata": {},
57 | "outputs": [],
58 | "source": [
59 | "import numpy as np \n",
60 | "import pandas as pd \n",
61 | "import matplotlib.pyplot as plt \n",
62 | "import seaborn as sns "
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": 2,
68 | "metadata": {},
69 | "outputs": [
70 | {
71 | "data": {
72 | "text/html": [
73 | "
"
264 | ]
265 | },
266 | "metadata": {
267 | "needs_background": "light"
268 | },
269 | "output_type": "display_data"
270 | }
271 | ],
272 | "source": [
273 | "ax = sns.stripplot(strategies, test_scores); \n",
274 | "ax.set(xlabel ='Strategy', ylabel ='Test Score') \n",
275 | "plt.show() "
276 | ]
277 | },
278 | {
279 | "cell_type": "markdown",
280 | "metadata": {},
281 | "source": [
282 | "Checking the performance of `RandomForestClassifier` on the data"
283 | ]
284 | },
285 | {
286 | "cell_type": "code",
287 | "execution_count": 10,
288 | "metadata": {},
289 | "outputs": [
290 | {
291 | "data": {
292 | "text/plain": [
293 | "0.776595744680851"
294 | ]
295 | },
296 | "execution_count": 10,
297 | "metadata": {},
298 | "output_type": "execute_result"
299 | }
300 | ],
301 | "source": [
302 | "from sklearn.ensemble import RandomForestClassifier\n",
303 | "from sklearn.metrics import accuracy_score\n",
304 | "ans=RandomForestClassifier()\n",
305 | "ans.fit(X_train,y_train)\n",
306 | "prediction=ans.predict(X_test)\n",
307 | "accuracy_score(y_test,prediction)"
308 | ]
309 | },
310 | {
311 | "cell_type": "markdown",
312 | "metadata": {},
313 | "source": [
314 | "On comparing the scores of the KNN classifier with the dummy classifier, we come to the conclusion that the KNN classifier is, in fact, a good classifier for the given data."
315 | ]
316 | },
317 | {
318 | "cell_type": "markdown",
319 | "metadata": {},
320 | "source": [
321 | "## Imbalanced Class and `Dummy Classifier`\n",
322 | "\n",
323 | "A major motivation for Dummy Classifier is F-score, when the positive class is in minority (i.e. imbalanced classes). This classifier is used for sanity test of actual classifier. Actually, dummy classifier completely ignores the input data. In case of 'most frequent' method, it checks the occurrence of most frequent label."
324 | ]
325 | },
326 | {
327 | "cell_type": "code",
328 | "execution_count": 11,
329 | "metadata": {},
330 | "outputs": [
331 | {
332 | "name": "stdout",
333 | "output_type": "stream",
334 | "text": [
335 | "0 178\n",
336 | "1 182\n",
337 | "2 177\n",
338 | "3 183\n",
339 | "4 181\n",
340 | "5 182\n",
341 | "6 181\n",
342 | "7 179\n",
343 | "8 174\n",
344 | "9 180\n"
345 | ]
346 | }
347 | ],
348 | "source": [
349 | "from sklearn.datasets import load_digits\n",
350 | "\n",
351 | "dataset = load_digits()\n",
352 | "X, y = dataset.data, dataset.target\n",
353 | "\n",
354 | "for class_name, class_count in zip(dataset.target_names, np.bincount(dataset.target)):\n",
355 | " print(class_name,class_count)"
356 | ]
357 | },
358 | {
359 | "cell_type": "code",
360 | "execution_count": 12,
361 | "metadata": {},
362 | "outputs": [
363 | {
364 | "name": "stdout",
365 | "output_type": "stream",
366 | "text": [
367 | "Original labels:\t [1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9]\n",
368 | "New binary labels:\t [1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]\n"
369 | ]
370 | }
371 | ],
372 | "source": [
373 | "y_imbalanced = y.copy()\n",
374 | "y_imbalanced[y_imbalanced != 1] = 0\n",
375 | "\n",
376 | "print('Original labels:\\t', y[1:20])\n",
377 | "print('New binary labels:\\t', y_imbalanced[1:20])"
378 | ]
379 | },
380 | {
381 | "cell_type": "code",
382 | "execution_count": 14,
383 | "metadata": {},
384 | "outputs": [
385 | {
386 | "data": {
387 | "text/plain": [
388 | "array([1615, 182], dtype=int64)"
389 | ]
390 | },
391 | "execution_count": 14,
392 | "metadata": {},
393 | "output_type": "execute_result"
394 | }
395 | ],
396 | "source": [
397 | "np.bincount(y_imbalanced)"
398 | ]
399 | },
400 | {
401 | "cell_type": "markdown",
402 | "metadata": {},
403 | "source": [
404 | "We can observe that in the above data array one class is more frequent than other which shows it is an imbalanced class"
405 | ]
406 | },
407 | {
408 | "cell_type": "code",
409 | "execution_count": 20,
410 | "metadata": {},
411 | "outputs": [
412 | {
413 | "data": {
414 | "text/plain": [
415 | "0.5466666666666666"
416 | ]
417 | },
418 | "execution_count": 20,
419 | "metadata": {},
420 | "output_type": "execute_result"
421 | }
422 | ],
423 | "source": [
424 | "X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y_imbalanced, random_state=0)\n",
425 | "\n",
426 | "# Accuracy of Support Vector Machine classifier\n",
427 | "from sklearn.naive_bayes import GaussianNB\n",
428 | "gnb = GaussianNB()\n",
429 | "y_pred = gnb.fit(X_train1, y_train1)\n",
430 | "gnb.score(X_test1, y_test1)"
431 | ]
432 | },
433 | {
434 | "cell_type": "markdown",
435 | "metadata": {},
436 | "source": [
437 | "Here on using Naive Bayes Classifier we get a score of 0.55 , We know this is not a good score and we can use other classifiers and fit the model and check their score. "
438 | ]
439 | },
440 | {
441 | "cell_type": "code",
442 | "execution_count": 24,
443 | "metadata": {},
444 | "outputs": [
445 | {
446 | "data": {
447 | "text/plain": [
448 | "0.9088888888888889"
449 | ]
450 | },
451 | "execution_count": 24,
452 | "metadata": {},
453 | "output_type": "execute_result"
454 | }
455 | ],
456 | "source": [
457 | "from sklearn.ensemble import RandomForestClassifier\n",
458 | "\n",
459 | "clf = RandomForestClassifier(max_depth=2, random_state=0).fit(X_train1, y_train1)\n",
460 | "clf.score(X_test1,y_test1)"
461 | ]
462 | },
463 | {
464 | "cell_type": "markdown",
465 | "metadata": {},
466 | "source": [
467 | "On Using RandomForestClassifier we get a score of 0.908 which is a great score and also much better than what Naive Bayes Classifier performed . "
468 | ]
469 | },
470 | {
471 | "cell_type": "markdown",
472 | "metadata": {},
473 | "source": [
474 | "## Using dummy classifier as baseline"
475 | ]
476 | },
477 | {
478 | "cell_type": "code",
479 | "execution_count": 25,
480 | "metadata": {},
481 | "outputs": [
482 | {
483 | "data": {
484 | "text/plain": [
485 | "0.9044444444444445"
486 | ]
487 | },
488 | "execution_count": 25,
489 | "metadata": {},
490 | "output_type": "execute_result"
491 | }
492 | ],
493 | "source": [
494 | "from sklearn.dummy import DummyClassifier\n",
495 | "dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train1, y_train1)\n",
496 | "y_dummy_predictions = dummy_majority.predict(X_test)\n",
497 | "dummy_majority.score(X_test1, y_test1)"
498 | ]
499 | },
500 | {
501 | "cell_type": "markdown",
502 | "metadata": {},
503 | "source": [
504 | "We observe that the RandomForest classsifier score is not much compared to dummy classifier which also has a score of more than .90 . Which shows that RandomForest is not a right fit for the model despite the good score.\n",
505 | "This makes us realise that we need a better model which scores better ."
506 | ]
507 | },
508 | {
509 | "cell_type": "code",
510 | "execution_count": 27,
511 | "metadata": {},
512 | "outputs": [
513 | {
514 | "data": {
515 | "text/plain": [
516 | "0.9955555555555555"
517 | ]
518 | },
519 | "execution_count": 27,
520 | "metadata": {},
521 | "output_type": "execute_result"
522 | }
523 | ],
524 | "source": [
525 | "from sklearn.svm import SVC\n",
526 | "svm = SVC(kernel='rbf', C=1).fit(X_train1, y_train1)\n",
527 | "svm.score(X_test1, y_test1)"
528 | ]
529 | },
530 | {
531 | "cell_type": "markdown",
532 | "metadata": {},
533 | "source": [
534 | "On using SVM classifier using RBF kernel for the model,gives a whoping score of 0.99 which is a good score as well as it performce better than dummy classifier which is our baseline. "
535 | ]
536 | },
537 | {
538 | "cell_type": "markdown",
539 | "metadata": {},
540 | "source": [
541 | "*Thus, Dummy Classifier works as a baseline and gives an idea of the performance of the model on dataset*\n"
542 | ]
543 | },
544 | {
545 | "cell_type": "code",
546 | "execution_count": null,
547 | "metadata": {},
548 | "outputs": [],
549 | "source": []
550 | }
551 | ],
552 | "metadata": {
553 | "kernelspec": {
554 | "display_name": "Python 3",
555 | "language": "python",
556 | "name": "python3"
557 | },
558 | "language_info": {
559 | "codemirror_mode": {
560 | "name": "ipython",
561 | "version": 3
562 | },
563 | "file_extension": ".py",
564 | "mimetype": "text/x-python",
565 | "name": "python",
566 | "nbconvert_exporter": "python",
567 | "pygments_lexer": "ipython3",
568 | "version": "3.7.6"
569 | }
570 | },
571 | "nbformat": 4,
572 | "nbformat_minor": 4
573 | }
574 |
--------------------------------------------------------------------------------
/Hypothesis Testing/Hypo_Testing.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "Hypo_Testing.ipynb",
7 | "provenance": [],
8 | "toc_visible": true
9 | },
10 | "kernelspec": {
11 | "name": "python3",
12 | "display_name": "Python 3"
13 | },
14 | "language_info": {
15 | "name": "python"
16 | }
17 | },
18 | "cells": [
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {
22 | "id": "4uItjVLu5mXc"
23 | },
24 | "source": [
25 | "# T-test\n",
26 | "\n",
27 | "A t-test is a type of inferential statistic which is used to determine if there\n",
28 | "is a significant difference between the means of two groups which may be related in certain features.\n",
29 | "\n",
30 | "T-test has 2 types : 1. one sampled t-test 2. two-sampled t-test.\n",
31 | "One sample t-test : The One Sample t Test determines whether the\n",
32 | "sample mean is statistically different from a known or hypothesised population mean. The One Sample t Test is a parametric test."
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "metadata": {
38 | "colab": {
39 | "base_uri": "https://localhost:8080/"
40 | },
41 | "id": "y90dateE5d9E",
42 | "outputId": "69e603b5-689b-47d9-c61a-cc1d3f3b3279"
43 | },
44 | "source": [
45 | "#-----------------------------------------T-test---------------------------------#\n",
46 | "\n",
47 | "\n",
48 | "from scipy.stats import ttest_1samp\n",
49 | "import numpy as np\n",
50 | "\n",
51 | "#10 ages and you are checking whether avg age is 30 or not.\n",
52 | "#H0: The average age is 30\n",
53 | "#H1: The average age is not 30.\n",
54 | "ages = np.array([32,34,29,29,22,39,38,37,38,36,30,26,22,22])\n",
55 | "print(ages)\n",
56 | "#mean of the age \n",
57 | "ages_mean = np.mean(ages)\n",
58 | "print(ages_mean)\n",
59 | "#One Sample t-test\n",
60 | "tset, pval = ttest_1samp(ages, 30)\n",
61 | "print('p-values',pval)\n",
62 | "if pval < 0.05: # alpha value is 0.05 or 5%\n",
63 | " print(\" we are rejecting null hypothesis\")\n",
64 | "else:\n",
65 | " print(\"we are accepting null hypothesis\")\n"
66 | ],
67 | "execution_count": 1,
68 | "outputs": [
69 | {
70 | "output_type": "stream",
71 | "text": [
72 | "[32 34 29 29 22 39 38 37 38 36 30 26 22 22]\n",
73 | "31.0\n",
74 | "p-values 0.5605155888171379\n",
75 | "we are accepting null hypothesis\n"
76 | ],
77 | "name": "stdout"
78 | }
79 | ]
80 | },
81 | {
82 | "cell_type": "markdown",
83 | "metadata": {
84 | "id": "Lie2acfC5lrP"
85 | },
86 | "source": [
87 | "# Z-Test\n",
88 | "\n",
89 | "Z test is used if:\n",
90 | "Your sample size is greater than 30. Otherwise, use a t test.\n",
91 | "Data points should be independent from each other. In other words, one data point isn’t related or doesn’t affect another data point.\n",
92 | "Your data should be normally distributed. However, for large sample sizes (over 30) this doesn’t always matter.\n",
93 | "Your data should be randomly selected from a population, where each item has an equal chance of being selected.\n",
94 | "Sample sizes should be equal if at all possible."
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "metadata": {
100 | "colab": {
101 | "base_uri": "https://localhost:8080/"
102 | },
103 | "id": "uTfU9vxz6Dhc",
104 | "outputId": "0da843b1-50b7-4a44-9ad0-d37f9e7479cd"
105 | },
106 | "source": [
107 | "#-----------One Sample Z-test-----------#\n",
108 | "import pandas as pd\n",
109 | "from scipy import stats\n",
110 | "from statsmodels.stats import weightstats as stests\n",
111 | "df = pd.read_csv(\"blood_pressure.csv\")\n",
112 | "\n",
113 | "ztest ,pval = stests.ztest(df['bp_before'], x2=None, value=156)\n",
114 | "print('One-sample z-test')\n",
115 | "print(float(pval))\n",
116 | "if pval<0.05:\n",
117 | " print(\"reject null hypothesis\")\n",
118 | "else:\n",
119 | " print(\"accept null hypothesis\")\n",
120 | "\n",
121 | "#-----------Two Sample Z-test-----------#\n",
122 | "#Two-sample Z test- Just check two independent data groups and decide whether sample mean of two group is equal or not.\n",
123 | "#H0 : mean of two group is 0\n",
124 | "#H1 : mean of two group is not 0\n",
125 | "#Example : we are checking in blood data after blood and before blood data.\n",
126 | "\n",
127 | "ztest ,pval1 = stests.ztest(df['bp_before'], x2=df['bp_after'], value=0,alternative='two-sided')\n",
128 | "print('Two-sample z-test')\n",
129 | "print(float(pval1))\n",
130 | "if pval1<0.05:\n",
131 | " print(\"reject null hypothesis\")\n",
132 | "else:\n",
133 | " print(\"accept null hypothesis\")"
134 | ],
135 | "execution_count": 2,
136 | "outputs": [
137 | {
138 | "output_type": "stream",
139 | "text": [
140 | "One-sample z-test\n",
141 | "0.6651614730255063\n",
142 | "accept null hypothesis\n",
143 | "Two-sample z-test\n",
144 | "0.002162306611369422\n",
145 | "reject null hypothesis\n"
146 | ],
147 | "name": "stdout"
148 | },
149 | {
150 | "output_type": "stream",
151 | "text": [
152 | "/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.\n",
153 | " import pandas.util.testing as tm\n"
154 | ],
155 | "name": "stderr"
156 | }
157 | ]
158 | },
159 | {
160 | "cell_type": "markdown",
161 | "metadata": {
162 | "id": "O0AqvlnM6aLN"
163 | },
164 | "source": [
165 | "# ANOVA (F-TEST) :- \n",
166 | "The t-test works well when dealing with two groups, but sometimes we want to compare more than two groups at the same time.\n",
167 | "For example, if we wanted to test whether voter age differs based on some categorical variable like race, we have to compare the means of each level or group the variable. \n",
168 | "The analysis of variance or ANOVA is a statistical inference test that lets you compare multiple groups at the same time.\n"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "metadata": {
174 | "colab": {
175 | "base_uri": "https://localhost:8080/"
176 | },
177 | "id": "uZRF_2qM6Qv_",
178 | "outputId": "64439f9f-a116-497d-90e6-d43e0bd552dd"
179 | },
180 | "source": [
181 | "\n",
182 | "#------------------------One Way F-test(Anova)------------------------# \n",
183 | "#To tell whether two or more groups are similar or not based on their mean similarity and f-score.\n",
184 | "#Example : there are 3 different category of plant and their weight and need to check whether all 3 group are similar or not.\n",
185 | "import pandas as pd\n",
186 | "from scipy import stats\n",
187 | "from statsmodels.stats import weightstats as stests\n",
188 | "print('One-way Anova')\n",
189 | "df_anova = pd.read_csv('PlantGrowth.csv')\n",
190 | "df_anova = df_anova[['weight','group']]\n",
191 | "grps = pd.unique(df_anova.group.values)\n",
192 | "d_data = {grp:df_anova['weight'][df_anova.group == grp] for grp in grps}\n",
193 | " \n",
194 | "F, p = stats.f_oneway(d_data['ctrl'], d_data['trt1'], d_data['trt2'])\n",
195 | "print(\"p-value for significance is: \", p)\n",
196 | "if p<0.05:\n",
197 | " print(\"reject null hypothesis\")\n",
198 | "else:\n",
199 | " print(\"accept null hypothesis\")\n",
200 | "\n",
201 | "#------------------------------------Two Way F-test-----------------------------------# \n",
202 | "#Two way F-test is extension of 1-way f-test, it is used when we have 2 independent variable and 2+ groups.\n",
203 | "#2-way F-test does not tell which variable is dominant. If we need to check individual significance then Post-hoc testing need to be performed.\n",
204 | "\n",
205 | "#e.g: Grand mean crop yield (the mean crop yield not by any sub-group), as well the mean crop yield by each factor, \n",
206 | "# as well as by the factors grouped together.\n",
207 | "import statsmodels.api as sm\n",
208 | "from statsmodels.formula.api import ols\n",
209 | "print('Two-way ANova')\n",
210 | "df_anova2 = pd.read_csv(\"https://raw.githubusercontent.com/Opensourcefordatascience/Data-sets/master/crop_yield.csv\")\n",
211 | "model = ols('Yield ~ C(Fert)*C(Water)', df_anova2).fit()\n",
212 | "print(f\"Overall model F({model.df_model: .0f},{model.df_resid: .0f}) = {model.fvalue: .3f}, p = {model.f_pvalue: .4f}\")\n",
213 | "res = sm.stats.anova_lm(model, typ= 2)\n",
214 | "print(res)"
215 | ],
216 | "execution_count": 3,
217 | "outputs": [
218 | {
219 | "output_type": "stream",
220 | "text": [
221 | "One-way Anova\n",
222 | "p-value for significance is: 0.0159099583256229\n",
223 | "reject null hypothesis\n",
224 | "Two-way ANova\n",
225 | "Overall model F( 3, 16) = 4.112, p = 0.0243\n",
226 | " sum_sq df F PR(>F)\n",
227 | "C(Fert) 69.192 1.0 5.766000 0.028847\n",
228 | "C(Water) 63.368 1.0 5.280667 0.035386\n",
229 | "C(Fert):C(Water) 15.488 1.0 1.290667 0.272656\n",
230 | "Residual 192.000 16.0 NaN NaN\n"
231 | ],
232 | "name": "stdout"
233 | }
234 | ]
235 | }
236 | ]
237 | }
--------------------------------------------------------------------------------
/Hypothesis Testing/Hypothesis Testing.md:
--------------------------------------------------------------------------------
1 | # What is Hypothesis Testing?
2 |
3 | Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data.
4 | It evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data.
5 |
6 | ## Important Parameters:
7 |
8 | **Null Hypothesis** :
9 | 1) A statement about a population paramter.
10 | 2) Contains: '=', '<=', '>='
11 | 3) We consider Null Hypothesis to be true, to decide whether to reject or accept the alternative hypothesis.
12 |
13 | **Alternative Hypothesis** :
14 | 1) A statement that directly contardicts the null hypothesis.
15 | 2) COntains : 'Not equal to', '>', '<'
16 |
17 | **Level Of Significance**:
18 | Degree of significance in which we accept or reject the null-hypothesis. (Ussually taken as 5%, which means your output should be 95% confident to give similar kind of result in each sample.)
19 |
20 | **Type I error:**
21 | When we reject the null hypothesis, although that hypothesis was true. (alpha)
22 |
23 | **Type II error :**
24 | When we accept the null hypothesis but it is false. (Beta)
25 |
26 | **One tailed test :-**
27 | A test of a statistical hypothesis , where the region of rejection is on only one side of the sampling distribution , is called a one-tailed test.
28 |
29 | A box has ≥ 40 chocolates.
30 |
31 | **Two-tailed test :-**
32 | A two-tailed test is a statistical test in which the critical area of a distribution is two-sided and tests whether a sample is greater than or less than a certain range of values. If the sample being tested falls into either of the critical areas, the alternative hypothesis is accepted instead of the null hypothesis.
33 |
34 | A box != 40 chocolates.
35 |
36 | **P-value**
37 | P-value or the calculated probability, is the probability of finding the observed, or more extreme, results when the null hypothesis (H0) of a study question is true.
38 | If your P value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample gives reasonable evidence to support the alternative hypothesis.
39 |
40 | Example : You have a coin and you don’t know whether that is fair or tricky so let’s decide null and alternate hypothesis
41 | H0 : The coin is a fair coin.
42 | H1 : The coin is a tricky coin.
43 | Level of significance : 95%
44 | alpha = 5% or 0.05
45 |
46 | Now let’s toss the coin and calculate p- value ( probability value).
47 | Toss a coin 1st time and result is tail- P-value = 50% (as head and tail have equal probability)
48 | Toss a coin 2nd time and result is tail, now p-value = 50/2 = 25%
49 | and similarly we Toss 6 consecutive time and got result as P-value = 1.5% but we set our significance level as 95% means 5% error rate we allow and here we see we are beyond that level (p-value < alpha) i.e. our null- hypothesis does not hold good so we need to reject the null hypothesis and propose that this coin is a tricky coin which is actually.
50 |
51 | Some of the widely used hypothesis tests and codes are discussed.
52 |
53 |
54 |
55 |
--------------------------------------------------------------------------------
/Hypothesis Testing/PlantGrowth.csv:
--------------------------------------------------------------------------------
1 | "","weight","group"
2 | "1",4.17,"ctrl"
3 | "2",5.58,"ctrl"
4 | "3",5.18,"ctrl"
5 | "4",6.11,"ctrl"
6 | "5",4.5,"ctrl"
7 | "6",4.61,"ctrl"
8 | "7",5.17,"ctrl"
9 | "8",4.53,"ctrl"
10 | "9",5.33,"ctrl"
11 | "10",5.14,"ctrl"
12 | "11",4.81,"trt1"
13 | "12",4.17,"trt1"
14 | "13",4.41,"trt1"
15 | "14",3.59,"trt1"
16 | "15",5.87,"trt1"
17 | "16",3.83,"trt1"
18 | "17",6.03,"trt1"
19 | "18",4.89,"trt1"
20 | "19",4.32,"trt1"
21 | "20",4.69,"trt1"
22 | "21",6.31,"trt2"
23 | "22",5.12,"trt2"
24 | "23",5.54,"trt2"
25 | "24",5.5,"trt2"
26 | "25",5.37,"trt2"
27 | "26",5.29,"trt2"
28 | "27",4.92,"trt2"
29 | "28",6.15,"trt2"
30 | "29",5.8,"trt2"
31 | "30",5.26,"trt2"
32 |
--------------------------------------------------------------------------------
/Hypothesis Testing/blood_pressure.csv:
--------------------------------------------------------------------------------
1 | patient,sex,agegrp,bp_before,bp_after
2 | 1,Male,30-45,143,153
3 | 2,Male,30-45,163,170
4 | 3,Male,30-45,153,168
5 | 4,Male,30-45,153,142
6 | 5,Male,30-45,146,141
7 | 6,Male,30-45,150,147
8 | 7,Male,30-45,148,133
9 | 8,Male,30-45,153,141
10 | 9,Male,30-45,153,131
11 | 10,Male,30-45,158,125
12 | 11,Male,30-45,149,164
13 | 12,Male,30-45,173,159
14 | 13,Male,30-45,165,135
15 | 14,Male,30-45,145,159
16 | 15,Male,30-45,143,153
17 | 16,Male,30-45,152,126
18 | 17,Male,30-45,141,162
19 | 18,Male,30-45,176,134
20 | 19,Male,30-45,143,136
21 | 20,Male,30-45,162,150
22 | 21,Male,46-59,149,168
23 | 22,Male,46-59,156,155
24 | 23,Male,46-59,151,136
25 | 24,Male,46-59,159,132
26 | 25,Male,46-59,164,160
27 | 26,Male,46-59,154,160
28 | 27,Male,46-59,152,136
29 | 28,Male,46-59,142,183
30 | 29,Male,46-59,162,152
31 | 30,Male,46-59,155,162
32 | 31,Male,46-59,175,151
33 | 32,Male,46-59,184,139
34 | 33,Male,46-59,167,175
35 | 34,Male,46-59,148,184
36 | 35,Male,46-59,170,151
37 | 36,Male,46-59,159,171
38 | 37,Male,46-59,149,157
39 | 38,Male,46-59,140,159
40 | 39,Male,46-59,185,140
41 | 40,Male,46-59,160,174
42 | 41,Male,60+,157,167
43 | 42,Male,60+,158,158
44 | 43,Male,60+,162,168
45 | 44,Male,60+,160,159
46 | 45,Male,60+,180,153
47 | 46,Male,60+,155,164
48 | 47,Male,60+,172,169
49 | 48,Male,60+,157,148
50 | 49,Male,60+,171,185
51 | 50,Male,60+,170,163
52 | 51,Male,60+,175,146
53 | 52,Male,60+,175,160
54 | 53,Male,60+,172,175
55 | 54,Male,60+,173,163
56 | 55,Male,60+,170,185
57 | 56,Male,60+,164,146
58 | 57,Male,60+,147,176
59 | 58,Male,60+,154,147
60 | 59,Male,60+,172,161
61 | 60,Male,60+,162,164
62 | 61,Female,30-45,152,149
63 | 62,Female,30-45,147,142
64 | 63,Female,30-45,144,146
65 | 64,Female,30-45,144,138
66 | 65,Female,30-45,158,131
67 | 66,Female,30-45,147,145
68 | 67,Female,30-45,154,134
69 | 68,Female,30-45,151,135
70 | 69,Female,30-45,149,131
71 | 70,Female,30-45,138,135
72 | 71,Female,30-45,162,133
73 | 72,Female,30-45,157,135
74 | 73,Female,30-45,141,168
75 | 74,Female,30-45,167,144
76 | 75,Female,30-45,147,147
77 | 76,Female,30-45,143,151
78 | 77,Female,30-45,142,149
79 | 78,Female,30-45,166,147
80 | 79,Female,30-45,147,149
81 | 80,Female,30-45,142,135
82 | 81,Female,46-59,157,127
83 | 82,Female,46-59,170,150
84 | 83,Female,46-59,150,138
85 | 84,Female,46-59,150,147
86 | 85,Female,46-59,167,157
87 | 86,Female,46-59,154,146
88 | 87,Female,46-59,143,148
89 | 88,Female,46-59,157,136
90 | 89,Female,46-59,149,146
91 | 90,Female,46-59,161,132
92 | 91,Female,46-59,142,145
93 | 92,Female,46-59,162,132
94 | 93,Female,46-59,144,157
95 | 94,Female,46-59,142,140
96 | 95,Female,46-59,159,137
97 | 96,Female,46-59,140,154
98 | 97,Female,46-59,144,169
99 | 98,Female,46-59,142,145
100 | 99,Female,46-59,145,137
101 | 100,Female,46-59,145,143
102 | 101,Female,60+,168,178
103 | 102,Female,60+,142,141
104 | 103,Female,60+,147,149
105 | 104,Female,60+,148,148
106 | 105,Female,60+,162,138
107 | 106,Female,60+,170,143
108 | 107,Female,60+,173,167
109 | 108,Female,60+,151,158
110 | 109,Female,60+,155,152
111 | 110,Female,60+,163,154
112 | 111,Female,60+,183,161
113 | 112,Female,60+,159,143
114 | 113,Female,60+,148,159
115 | 114,Female,60+,151,177
116 | 115,Female,60+,165,142
117 | 116,Female,60+,152,152
118 | 117,Female,60+,161,152
119 | 118,Female,60+,165,174
120 | 119,Female,60+,149,151
121 | 120,Female,60+,185,163
122 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2020 Pankhuri Saxena
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | Possible project for Kharagpur Winter of Code 2020
2 |
3 | # Statistics and Econometrics for Data Science
4 | 
5 | 
6 | 
7 | 
8 | 
9 |
10 | ## Table of Contents
11 | 1. How are the topics even related to ML?
12 | 2. What will the project entail?
13 | 3. How to start with the project?
14 | 4. What are the prerequisites for the project?
15 | 5. What can you contribute to the project?
16 | 6. Expectations from the project
17 | 7. How much is ML and how much is statistics/econometrics?
18 | 8. Who to contact?
19 |
20 |
21 |
22 | ## How are the topics even related to ML?
23 | Often while building models in ML we become too concerned with accuracy and forget whether
24 | the model does what we initially set out to do. Statistics and Econometrics help in
25 | building better models and understanding the data. They can help in better feature engineering,
26 | and a better understanding of the assumptions which can help in ultimately building better models.
27 | Running linear regression sounds easy but what if someone asks you what assumptions you made
28 | while running the model. If your answer is "Umm..." then you are on the track to understanding
29 | what these topics can contribute to ML (if you didn't already know).
30 |
31 | Due to certain limitations, for the time being, we are concerned with only Linear Regression.
32 | This is just a very small subset of ML but let's start with tiny steps to progress.
33 |
34 |
35 |
36 | ## What will the project entail?
37 | The project aims to have a series of notebooks that will help in understanding the basic topics.
38 | The notebooks could be used to get a broad overview of the topic or to quickly revise the topic.
39 | The notebooks can be helpful in the following ways:
40 | - You are participating in a competition and you want to run some quick checks on the data/model
41 | - You are sitting for internship/placement and need to revise some topics fast
42 | - You want some code snippet for a certain test and how to interpret the test results.
43 |
44 |
45 |
46 | ## How to start with the project?
47 | 1. Install Jupyter Notebook, recommended installing with [Anaconda](https://www.anaconda.com/products/individual)
48 | 2. Learn how to use Jupyter Notebook, and python libraries NumPy, pandas, and matplotlib
49 | 3. Clone this repo and make a new branch
50 | 4. Each ipynb file should be able to stand independently so you should be able to open it using Jupyter Notebook
51 |
52 |
53 |
54 | ## What are the prerequisites for the project?
55 | - Basic knowledge of at least one programming language (preferable python)
56 | - Basic knowledge of probability (class 12 level)
57 | - Desire to learn statistics
58 |
59 |
60 |
61 | ## What can you contribute to the project?
62 | Easy: Make some changes to the existing graphs or explanation to make them look better,
63 | add new ideas to 'ideas.md', check if existing notebooks make sense
64 |
65 | Intermediate: Start with a new notebook of your own
66 |
67 | Advanced: Make a series of notebooks or explain a complicated/advanced topic
68 |
69 |
70 |
71 | ## Expectations from the project
72 | There will be a variety of issues, some easy to get you started and one harder to make you
73 | significantly contribute. But I'll set down the minimum expected work that you should do to
74 | pass. By medievals, you should have at least one new notebook and by endevals, you should have
75 | at least three new notebooks ready. Each notebook should have some introduction to the topic,
76 | mathematical proofs if required, the code to implement that topic from scratch, and any ready-made
77 | library code, if available.
78 |
79 | The notebook referred to here are Jupyter Notebooks.
80 |
81 |
82 |
83 | ## How much is ML and how much is statistics/econometrics?
84 | Well, your learning from this will be less towards ML. These topics are to provide support to ML
85 | and do not replace the importance of doing a course/project purely based on machine learning.
86 |
87 |
88 |
89 | ## Who to contact?
90 | The project was started by PetalsOnWind (Pankhuri Saxena, a fourth-year Economics student at IIT KGP).
91 | She can be reached at pankhurisaxena[dot]iitkgp[at]gmail[dot]com.
92 |
93 | ## Contributors:
94 |
95 | ### Credits goes to these people:✨
96 |
97 |
LDA fails to find the lower dimensional space if the dimensions are much higher than \n",
68 | " the number of samples in the data matrix.
\n",
69 | "
LDA produces at most C-1 feature projections: If the classification error estimates establish that more features are needed, some other method must be employed to provide those additional features
\n",
70 | "
LDA is a parametric method since it assumes unimodal Gaussian likelihoods: If the distributions are significantly non-Gaussian, the LDA projections will not be able to preserve any complex structure of the data that may be needed for classification
\n",
71 | "
LDA will fail when the discriminatory information is not in the mean, but rather in the variance of the data
PCA is unsupervised algorithm while LDA is supervised algorithm.
\n",
82 | "
The goal of PCA is to maximize variation in the given dataset while LDA focuses on \n",
83 | " maximizing separatibility among known categories.
\n",
84 | "
LDA performs better multi-class classification tasks than PCA. However, PCA performs better when the sample size is comparatively small. An example would be comparisons between classification accuracies that are used in image classification.
\n",
85 | "
\n",
86 | " "
87 | ]
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "metadata": {},
92 | "source": [
93 | "### Following are the extensions of LDA in case we need to use non-linear discriminant analysis:\n",
94 | "
\n",
95 | "
Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or covariance when there are multiple input variables).
\n",
96 | "
Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs is used such as splines.
\n",
97 | "
Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of the variance (actually covariance), moderating the influence of different variables on LDA.
Face Recognition: In the field of Computer Vision, face recognition is a very popular application in which each face is represented by a very large number of pixel values. Linear discriminant analysis (LDA) is used here to reduce the number of features to a more manageable number before the process of classification. Each of the new dimensions generated is a linear combination of pixel values, which form a template. The linear combinations obtained using Fisher’s linear discriminant are called Fisher faces.
\n",
108 | "
Medical: In this field, Linear discriminant analysis (LDA) is used to classify the patient disease state as mild, moderate or severe based upon the patient various parameters and the medical treatment he is going through. This helps the doctors to intensify or reduce the pace of their treatment.
\n",
109 | "
Customer Identification: Suppose we want to identify the type of customers which are most likely to buy a particular product in a shopping mall. By doing a simple question and answers survey, we can gather all the features of the customers. Here, Linear discriminant analysis will help us to identify and select the features which can describe the characteristics of the group of customers that are most likely to buy that particular product in the shopping mall.
\n",
110 | "
"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": 1,
116 | "metadata": {},
117 | "outputs": [],
118 | "source": [
119 | "# Import necessary modules\n",
120 | "import numpy as np\n",
121 | "import pandas as np\n",
122 | "from sklearn.datasets import make_classification\n",
123 | "from sklearn.model_selection import cross_val_score\n",
124 | "from sklearn.model_selection import RepeatedStratifiedKFold\n",
125 | "from sklearn.pipeline import Pipeline\n",
126 | "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n",
127 | "from sklearn.naive_bayes import GaussianNB"
128 | ]
129 | },
130 | {
131 | "cell_type": "code",
132 | "execution_count": 2,
133 | "metadata": {},
134 | "outputs": [],
135 | "source": [
136 | "# Generating data for our problem\n",
137 | "X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7, n_classes=10)"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": 3,
143 | "metadata": {},
144 | "outputs": [
145 | {
146 | "name": "stdout",
147 | "output_type": "stream",
148 | "text": [
149 | "[[ 2.3548775 -1.69674567 1.6193882 ... -3.33390362 2.45147541\n",
150 | " -1.23455205]\n",
151 | " [ 2.0204277 -1.62734821 -2.27697377 ... -0.28274722 -7.28166465\n",
152 | " -0.91070347]\n",
153 | " [-1.02400669 1.01276423 1.05505825 ... 3.83923974 -1.63530582\n",
154 | " 3.96050914]\n",
155 | " ...\n",
156 | " [-0.36448581 -0.2996303 2.21875138 ... -1.11303373 3.67576043\n",
157 | " -1.44164572]\n",
158 | " [ 0.05614772 1.87270289 -2.63165761 ... -3.07434527 2.31606352\n",
159 | " 1.65068838]\n",
160 | " [ 1.09853247 1.61067335 2.7977282 ... -1.62233539 14.09727916\n",
161 | " 2.27215759]]\n"
162 | ]
163 | }
164 | ],
165 | "source": [
166 | "print(X)"
167 | ]
168 | },
169 | {
170 | "cell_type": "code",
171 | "execution_count": 4,
172 | "metadata": {},
173 | "outputs": [
174 | {
175 | "name": "stdout",
176 | "output_type": "stream",
177 | "text": [
178 | "[9 4 4 1 2 7 9 4 6 9 1 3 9 5 6 4 9 2 0 4 4 8 0 9 3 8 6 0 0 7 8 3 8 5 5 9 7\n",
179 | " 1 3 1 8 7 6 7 4 6 5 6 2 8 3 1 7 0 7 0 4 5 1 6 6 8 3 3 3 2 5 8 0 5 6 2 7 1\n",
180 | " 3 8 7 2 8 0 6 8 2 9 9 8 2 2 5 6 9 4 6 1 4 9 0 9 7 8 7 2 0 8 8 1 7 9 8 1 6\n",
181 | " 3 9 2 5 5 3 9 1 1 2 0 8 0 7 2 0 5 0 1 8 0 2 2 1 3 0 2 5 9 3 8 8 7 7 0 4 3\n",
182 | " 4 0 8 5 3 7 4 4 5 5 0 4 5 1 1 4 3 5 2 6 4 2 1 6 6 9 5 3 7 0 1 5 9 5 4 7 3\n",
183 | " 9 0 0 1 9 5 2 2 7 4 0 1 2 6 4 3 7 6 8 8 3 0 8 3 0 5 5 1 7 8 6 8 4 1 1 3 1\n",
184 | " 9 9 3 2 8 1 8 1 7 6 1 1 7 6 5 3 4 1 6 5 2 8 6 5 9 0 6 9 6 2 3 4 8 3 8 4 8\n",
185 | " 1 0 4 0 6 3 8 4 6 9 2 9 2 7 5 1 6 3 0 6 9 3 7 1 5 5 0 9 4 8 9 2 8 2 9 3 2\n",
186 | " 3 5 1 8 0 0 6 5 1 3 2 8 1 8 6 7 3 2 5 9 6 2 3 4 2 1 5 4 2 9 5 1 7 1 6 0 2\n",
187 | " 8 6 1 8 7 8 0 3 0 7 1 0 4 1 4 2 0 8 2 7 9 7 3 5 1 5 1 4 9 0 4 9 5 0 8 9 1\n",
188 | " 2 9 2 8 4 7 9 7 8 4 9 1 7 8 3 7 3 1 9 6 2 9 4 6 8 1 1 5 6 3 0 3 4 8 7 5 6\n",
189 | " 9 9 6 4 8 2 6 2 7 0 6 8 0 7 0 1 5 7 3 2 2 3 5 2 1 3 6 9 5 4 3 6 7 9 2 4 2\n",
190 | " 5 0 2 7 4 5 9 1 3 1 8 6 3 1 1 3 3 7 6 6 5 5 8 7 8 9 5 0 7 4 6 3 9 4 7 4 3\n",
191 | " 5 7 6 7 6 7 9 7 7 7 7 5 7 5 1 6 3 2 6 5 1 0 6 0 1 5 8 9 6 6 3 6 3 6 0 0 8\n",
192 | " 9 7 6 4 6 8 3 3 5 2 6 3 3 9 2 8 9 2 5 8 6 1 4 4 6 0 9 6 4 3 4 4 2 0 7 3 3\n",
193 | " 4 9 0 5 3 6 4 8 3 5 2 5 8 2 1 5 4 2 3 7 8 0 1 4 0 6 8 2 7 4 8 1 4 3 5 0 3\n",
194 | " 8 3 1 9 9 6 0 8 0 7 1 9 2 7 8 6 0 2 3 8 8 8 2 9 0 3 1 4 3 9 9 2 5 0 3 4 1\n",
195 | " 3 6 6 2 6 2 2 6 5 4 2 6 3 2 7 2 3 3 3 2 2 2 9 7 9 0 9 0 5 3 6 0 3 8 2 6 3\n",
196 | " 7 5 0 2 4 8 9 9 4 2 8 3 6 9 6 7 1 0 4 4 4 1 7 9 6 4 9 7 1 8 0 1 8 9 7 5 4\n",
197 | " 8 3 5 6 6 8 1 2 2 3 0 0 0 9 8 0 3 8 7 9 5 4 6 6 0 1 5 5 1 6 4 7 1 2 0 3 4\n",
198 | " 0 4 0 7 5 7 0 3 8 3 0 9 7 5 6 2 8 5 2 5 3 7 9 1 0 2 2 1 1 9 2 9 2 8 0 4 5\n",
199 | " 4 0 1 6 6 5 2 5 0 1 7 6 5 0 0 3 4 2 1 6 6 5 4 3 3 4 9 4 2 3 5 1 4 5 1 7 8\n",
200 | " 7 0 6 9 5 5 9 2 9 8 1 7 0 1 9 9 9 3 2 5 5 6 2 1 7 4 0 3 5 7 7 7 1 2 2 8 9\n",
201 | " 1 7 3 9 0 2 1 6 1 4 3 6 6 0 1 3 2 8 4 0 7 4 7 9 8 7 1 6 0 1 4 2 3 5 9 5 7\n",
202 | " 8 2 0 9 0 0 1 0 6 3 1 9 6 8 2 2 8 9 7 3 4 9 7 4 0 5 4 4 1 7 2 8 4 6 1 8 8\n",
203 | " 3 4 7 5 7 0 5 8 4 5 8 5 9 6 7 1 5 1 6 9 2 1 9 7 2 4 0 7 3 7 5 4 5 7 8 3 5\n",
204 | " 2 9 4 0 5 4 9 6 9 5 4 2 1 7 2 3 4 1 7 4 4 8 3 3 7 4 5 4 8 0 7 9 2 7 8 6 8\n",
205 | " 6]\n"
206 | ]
207 | }
208 | ],
209 | "source": [
210 | "print(y)"
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": 5,
216 | "metadata": {},
217 | "outputs": [
218 | {
219 | "name": "stdout",
220 | "output_type": "stream",
221 | "text": [
222 | "(1000, 20)\n"
223 | ]
224 | }
225 | ],
226 | "source": [
227 | "# Shape of inout data\n",
228 | "print(X.shape)"
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": 6,
234 | "metadata": {},
235 | "outputs": [],
236 | "source": [
237 | "# Defining model\n",
238 | "def model(n_components=None, solver='svd', shrinkage=None, priors=None,\n",
239 | " store_covariance=False, tol=0.0001, covariance_estimator=None):\n",
240 | " '''\n",
241 | " n_components : int, default=None\n",
242 | " Number of components (<= min(n_classes - 1, n_features)) for dimensionality reduction. \n",
243 | " If None, will be set to min(n_classes - 1, n_features). This parameter only affects the transform method.\n",
244 | " \n",
245 | " solver : {‘svd’, ‘lsqr’, ‘eigen’}, default=’svd’\n",
246 | " Solver to use, possible values:\n",
247 | " ‘svd’: Singular value decomposition (default). Does not compute the covariance matrix, therefore this solver is recommended for data with a large number of features.\n",
248 | " ‘lsqr’: Least squares solution. Can be combined with shrinkage or custom covariance estimator.\n",
249 | " ‘eigen’: Eigenvalue decomposition. Can be combined with shrinkage or custom covariance estimator.\n",
250 | " \n",
251 | " shrinkage : ‘auto’ or float, default=None\n",
252 | " Shrinkage parameter, possible values:\n",
253 | " None: no shrinkage (default).\n",
254 | " ‘auto’: automatic shrinkage using the Ledoit-Wolf lemma.\n",
255 | " float between 0 and 1: fixed shrinkage parameter.\n",
256 | " This should be left to None if covariance_estimator is used. Note that shrinkage works only with ‘lsqr’ and ‘eigen’ solvers.\n",
257 | "\n",
258 | " priors : array-like of shape (n_classes,), default=None\n",
259 | " The class prior probabilities. By default, the class proportions are inferred from the training data.\n",
260 | "\n",
261 | " store_covariance : bool, default=False\n",
262 | " If True, explicitely compute the weighted within-class covariance matrix when solver is ‘svd’. The matrix is always computed\n",
263 | " and stored for the other solvers.\n",
264 | "\n",
265 | " tol : float, default=1.0e-4\n",
266 | " Absolute threshold for a singular value of X to be considered significant, used to estimate the rank of X. Dimensions whose singular \n",
267 | " values are non-significant are discarded. Only used if solver is ‘svd’.\n",
268 | "\n",
269 | " covariance_estimator : covariance estimator, default=None\n",
270 | " If not None, covariance_estimator is used to estimate the covariance matrices instead of relying on the empirical \n",
271 | " covariance estimator (with potential shrinkage). The object should have a fit method and a covariance_ attribute \n",
272 | " like the estimators in sklearn.covariance. if None the shrinkage parameter drives the estimate.\n",
273 | " '''\n",
274 | " lda = LinearDiscriminantAnalysis(solver=solver, shrinkage=shrinkage, \n",
275 | " priors=priors, n_components=n_components, store_covariance=store_covariance, \n",
276 | " tol=tol, covariance_estimator=covariance_estimator)\n",
277 | " return lda"
278 | ]
279 | },
280 | {
281 | "cell_type": "code",
282 | "execution_count": 7,
283 | "metadata": {},
284 | "outputs": [
285 | {
286 | "data": {
287 | "text/plain": [
288 | "LinearDiscriminantAnalysis(n_components=5)"
289 | ]
290 | },
291 | "execution_count": 7,
292 | "metadata": {},
293 | "output_type": "execute_result"
294 | }
295 | ],
296 | "source": [
297 | "# Fitting the data to the model\n",
298 | "lda = model(5)\n",
299 | "lda.fit(X,y)\n",
300 | "\n"
301 | ]
302 | },
303 | {
304 | "cell_type": "code",
305 | "execution_count": 8,
306 | "metadata": {},
307 | "outputs": [],
308 | "source": [
309 | "# Transforming data \n",
310 | "data_transformation = lda.transform(X)"
311 | ]
312 | },
313 | {
314 | "cell_type": "code",
315 | "execution_count": 9,
316 | "metadata": {},
317 | "outputs": [
318 | {
319 | "name": "stdout",
320 | "output_type": "stream",
321 | "text": [
322 | "[[-1.34250698 -0.410752 -0.05284109 -2.52177124 -2.32197387]\n",
323 | " [ 0.92569633 -0.92633682 -0.29396574 -0.62144384 1.61682597]\n",
324 | " [-0.36265323 -0.87103112 1.53812275 0.59888243 -1.39423894]\n",
325 | " ...\n",
326 | " [-0.83323633 0.06686996 0.39414469 -0.5877848 0.11590941]\n",
327 | " [ 0.47329133 1.42040541 0.49439799 -0.05149737 -0.53591346]\n",
328 | " [-1.04969306 0.27613461 -0.13712968 -1.21293132 -0.22775809]]\n"
329 | ]
330 | }
331 | ],
332 | "source": [
333 | "print(data_transformation)"
334 | ]
335 | },
336 | {
337 | "cell_type": "code",
338 | "execution_count": 10,
339 | "metadata": {},
340 | "outputs": [
341 | {
342 | "name": "stdout",
343 | "output_type": "stream",
344 | "text": [
345 | "(1000, 5)\n"
346 | ]
347 | }
348 | ],
349 | "source": [
350 | "# Notice the reduction of dimensions\n",
351 | "print(data_transformation.shape)"
352 | ]
353 | },
354 | {
355 | "cell_type": "code",
356 | "execution_count": null,
357 | "metadata": {},
358 | "outputs": [],
359 | "source": []
360 | },
361 | {
362 | "cell_type": "code",
363 | "execution_count": null,
364 | "metadata": {},
365 | "outputs": [],
366 | "source": []
367 | },
368 | {
369 | "cell_type": "code",
370 | "execution_count": null,
371 | "metadata": {},
372 | "outputs": [],
373 | "source": []
374 | },
375 | {
376 | "cell_type": "code",
377 | "execution_count": null,
378 | "metadata": {},
379 | "outputs": [],
380 | "source": []
381 | },
382 | {
383 | "cell_type": "code",
384 | "execution_count": null,
385 | "metadata": {},
386 | "outputs": [],
387 | "source": []
388 | }
389 | ],
390 | "metadata": {
391 | "kernelspec": {
392 | "display_name": "Python 3",
393 | "language": "python",
394 | "name": "python3"
395 | },
396 | "language_info": {
397 | "codemirror_mode": {
398 | "name": "ipython",
399 | "version": 3
400 | },
401 | "file_extension": ".py",
402 | "mimetype": "text/x-python",
403 | "name": "python",
404 | "nbconvert_exporter": "python",
405 | "pygments_lexer": "ipython3",
406 | "version": "3.7.9"
407 | }
408 | },
409 | "nbformat": 4,
410 | "nbformat_minor": 4
411 | }
412 |
--------------------------------------------------------------------------------
/notebooks/PhiK Correlation/data_description.txt:
--------------------------------------------------------------------------------
1 | MSSubClass: Identifies the type of dwelling involved in the sale.
2 |
3 | 20 1-STORY 1946 & NEWER ALL STYLES
4 | 30 1-STORY 1945 & OLDER
5 | 40 1-STORY W/FINISHED ATTIC ALL AGES
6 | 45 1-1/2 STORY - UNFINISHED ALL AGES
7 | 50 1-1/2 STORY FINISHED ALL AGES
8 | 60 2-STORY 1946 & NEWER
9 | 70 2-STORY 1945 & OLDER
10 | 75 2-1/2 STORY ALL AGES
11 | 80 SPLIT OR MULTI-LEVEL
12 | 85 SPLIT FOYER
13 | 90 DUPLEX - ALL STYLES AND AGES
14 | 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
15 | 150 1-1/2 STORY PUD - ALL AGES
16 | 160 2-STORY PUD - 1946 & NEWER
17 | 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
18 | 190 2 FAMILY CONVERSION - ALL STYLES AND AGES
19 |
20 | MSZoning: Identifies the general zoning classification of the sale.
21 |
22 | A Agriculture
23 | C Commercial
24 | FV Floating Village Residential
25 | I Industrial
26 | RH Residential High Density
27 | RL Residential Low Density
28 | RP Residential Low Density Park
29 | RM Residential Medium Density
30 |
31 | LotFrontage: Linear feet of street connected to property
32 |
33 | LotArea: Lot size in square feet
34 |
35 | Street: Type of road access to property
36 |
37 | Grvl Gravel
38 | Pave Paved
39 |
40 | Alley: Type of alley access to property
41 |
42 | Grvl Gravel
43 | Pave Paved
44 | NA No alley access
45 |
46 | LotShape: General shape of property
47 |
48 | Reg Regular
49 | IR1 Slightly irregular
50 | IR2 Moderately Irregular
51 | IR3 Irregular
52 |
53 | LandContour: Flatness of the property
54 |
55 | Lvl Near Flat/Level
56 | Bnk Banked - Quick and significant rise from street grade to building
57 | HLS Hillside - Significant slope from side to side
58 | Low Depression
59 |
60 | Utilities: Type of utilities available
61 |
62 | AllPub All public Utilities (E,G,W,& S)
63 | NoSewr Electricity, Gas, and Water (Septic Tank)
64 | NoSeWa Electricity and Gas Only
65 | ELO Electricity only
66 |
67 | LotConfig: Lot configuration
68 |
69 | Inside Inside lot
70 | Corner Corner lot
71 | CulDSac Cul-de-sac
72 | FR2 Frontage on 2 sides of property
73 | FR3 Frontage on 3 sides of property
74 |
75 | LandSlope: Slope of property
76 |
77 | Gtl Gentle slope
78 | Mod Moderate Slope
79 | Sev Severe Slope
80 |
81 | Neighborhood: Physical locations within Ames city limits
82 |
83 | Blmngtn Bloomington Heights
84 | Blueste Bluestem
85 | BrDale Briardale
86 | BrkSide Brookside
87 | ClearCr Clear Creek
88 | CollgCr College Creek
89 | Crawfor Crawford
90 | Edwards Edwards
91 | Gilbert Gilbert
92 | IDOTRR Iowa DOT and Rail Road
93 | MeadowV Meadow Village
94 | Mitchel Mitchell
95 | Names North Ames
96 | NoRidge Northridge
97 | NPkVill Northpark Villa
98 | NridgHt Northridge Heights
99 | NWAmes Northwest Ames
100 | OldTown Old Town
101 | SWISU South & West of Iowa State University
102 | Sawyer Sawyer
103 | SawyerW Sawyer West
104 | Somerst Somerset
105 | StoneBr Stone Brook
106 | Timber Timberland
107 | Veenker Veenker
108 |
109 | Condition1: Proximity to various conditions
110 |
111 | Artery Adjacent to arterial street
112 | Feedr Adjacent to feeder street
113 | Norm Normal
114 | RRNn Within 200' of North-South Railroad
115 | RRAn Adjacent to North-South Railroad
116 | PosN Near positive off-site feature--park, greenbelt, etc.
117 | PosA Adjacent to postive off-site feature
118 | RRNe Within 200' of East-West Railroad
119 | RRAe Adjacent to East-West Railroad
120 |
121 | Condition2: Proximity to various conditions (if more than one is present)
122 |
123 | Artery Adjacent to arterial street
124 | Feedr Adjacent to feeder street
125 | Norm Normal
126 | RRNn Within 200' of North-South Railroad
127 | RRAn Adjacent to North-South Railroad
128 | PosN Near positive off-site feature--park, greenbelt, etc.
129 | PosA Adjacent to postive off-site feature
130 | RRNe Within 200' of East-West Railroad
131 | RRAe Adjacent to East-West Railroad
132 |
133 | BldgType: Type of dwelling
134 |
135 | 1Fam Single-family Detached
136 | 2FmCon Two-family Conversion; originally built as one-family dwelling
137 | Duplx Duplex
138 | TwnhsE Townhouse End Unit
139 | TwnhsI Townhouse Inside Unit
140 |
141 | HouseStyle: Style of dwelling
142 |
143 | 1Story One story
144 | 1.5Fin One and one-half story: 2nd level finished
145 | 1.5Unf One and one-half story: 2nd level unfinished
146 | 2Story Two story
147 | 2.5Fin Two and one-half story: 2nd level finished
148 | 2.5Unf Two and one-half story: 2nd level unfinished
149 | SFoyer Split Foyer
150 | SLvl Split Level
151 |
152 | OverallQual: Rates the overall material and finish of the house
153 |
154 | 10 Very Excellent
155 | 9 Excellent
156 | 8 Very Good
157 | 7 Good
158 | 6 Above Average
159 | 5 Average
160 | 4 Below Average
161 | 3 Fair
162 | 2 Poor
163 | 1 Very Poor
164 |
165 | OverallCond: Rates the overall condition of the house
166 |
167 | 10 Very Excellent
168 | 9 Excellent
169 | 8 Very Good
170 | 7 Good
171 | 6 Above Average
172 | 5 Average
173 | 4 Below Average
174 | 3 Fair
175 | 2 Poor
176 | 1 Very Poor
177 |
178 | YearBuilt: Original construction date
179 |
180 | YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
181 |
182 | RoofStyle: Type of roof
183 |
184 | Flat Flat
185 | Gable Gable
186 | Gambrel Gabrel (Barn)
187 | Hip Hip
188 | Mansard Mansard
189 | Shed Shed
190 |
191 | RoofMatl: Roof material
192 |
193 | ClyTile Clay or Tile
194 | CompShg Standard (Composite) Shingle
195 | Membran Membrane
196 | Metal Metal
197 | Roll Roll
198 | Tar&Grv Gravel & Tar
199 | WdShake Wood Shakes
200 | WdShngl Wood Shingles
201 |
202 | Exterior1st: Exterior covering on house
203 |
204 | AsbShng Asbestos Shingles
205 | AsphShn Asphalt Shingles
206 | BrkComm Brick Common
207 | BrkFace Brick Face
208 | CBlock Cinder Block
209 | CemntBd Cement Board
210 | HdBoard Hard Board
211 | ImStucc Imitation Stucco
212 | MetalSd Metal Siding
213 | Other Other
214 | Plywood Plywood
215 | PreCast PreCast
216 | Stone Stone
217 | Stucco Stucco
218 | VinylSd Vinyl Siding
219 | Wd Sdng Wood Siding
220 | WdShing Wood Shingles
221 |
222 | Exterior2nd: Exterior covering on house (if more than one material)
223 |
224 | AsbShng Asbestos Shingles
225 | AsphShn Asphalt Shingles
226 | BrkComm Brick Common
227 | BrkFace Brick Face
228 | CBlock Cinder Block
229 | CemntBd Cement Board
230 | HdBoard Hard Board
231 | ImStucc Imitation Stucco
232 | MetalSd Metal Siding
233 | Other Other
234 | Plywood Plywood
235 | PreCast PreCast
236 | Stone Stone
237 | Stucco Stucco
238 | VinylSd Vinyl Siding
239 | Wd Sdng Wood Siding
240 | WdShing Wood Shingles
241 |
242 | MasVnrType: Masonry veneer type
243 |
244 | BrkCmn Brick Common
245 | BrkFace Brick Face
246 | CBlock Cinder Block
247 | None None
248 | Stone Stone
249 |
250 | MasVnrArea: Masonry veneer area in square feet
251 |
252 | ExterQual: Evaluates the quality of the material on the exterior
253 |
254 | Ex Excellent
255 | Gd Good
256 | TA Average/Typical
257 | Fa Fair
258 | Po Poor
259 |
260 | ExterCond: Evaluates the present condition of the material on the exterior
261 |
262 | Ex Excellent
263 | Gd Good
264 | TA Average/Typical
265 | Fa Fair
266 | Po Poor
267 |
268 | Foundation: Type of foundation
269 |
270 | BrkTil Brick & Tile
271 | CBlock Cinder Block
272 | PConc Poured Contrete
273 | Slab Slab
274 | Stone Stone
275 | Wood Wood
276 |
277 | BsmtQual: Evaluates the height of the basement
278 |
279 | Ex Excellent (100+ inches)
280 | Gd Good (90-99 inches)
281 | TA Typical (80-89 inches)
282 | Fa Fair (70-79 inches)
283 | Po Poor (<70 inches
284 | NA No Basement
285 |
286 | BsmtCond: Evaluates the general condition of the basement
287 |
288 | Ex Excellent
289 | Gd Good
290 | TA Typical - slight dampness allowed
291 | Fa Fair - dampness or some cracking or settling
292 | Po Poor - Severe cracking, settling, or wetness
293 | NA No Basement
294 |
295 | BsmtExposure: Refers to walkout or garden level walls
296 |
297 | Gd Good Exposure
298 | Av Average Exposure (split levels or foyers typically score average or above)
299 | Mn Mimimum Exposure
300 | No No Exposure
301 | NA No Basement
302 |
303 | BsmtFinType1: Rating of basement finished area
304 |
305 | GLQ Good Living Quarters
306 | ALQ Average Living Quarters
307 | BLQ Below Average Living Quarters
308 | Rec Average Rec Room
309 | LwQ Low Quality
310 | Unf Unfinshed
311 | NA No Basement
312 |
313 | BsmtFinSF1: Type 1 finished square feet
314 |
315 | BsmtFinType2: Rating of basement finished area (if multiple types)
316 |
317 | GLQ Good Living Quarters
318 | ALQ Average Living Quarters
319 | BLQ Below Average Living Quarters
320 | Rec Average Rec Room
321 | LwQ Low Quality
322 | Unf Unfinshed
323 | NA No Basement
324 |
325 | BsmtFinSF2: Type 2 finished square feet
326 |
327 | BsmtUnfSF: Unfinished square feet of basement area
328 |
329 | TotalBsmtSF: Total square feet of basement area
330 |
331 | Heating: Type of heating
332 |
333 | Floor Floor Furnace
334 | GasA Gas forced warm air furnace
335 | GasW Gas hot water or steam heat
336 | Grav Gravity furnace
337 | OthW Hot water or steam heat other than gas
338 | Wall Wall furnace
339 |
340 | HeatingQC: Heating quality and condition
341 |
342 | Ex Excellent
343 | Gd Good
344 | TA Average/Typical
345 | Fa Fair
346 | Po Poor
347 |
348 | CentralAir: Central air conditioning
349 |
350 | N No
351 | Y Yes
352 |
353 | Electrical: Electrical system
354 |
355 | SBrkr Standard Circuit Breakers & Romex
356 | FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
357 | FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
358 | FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
359 | Mix Mixed
360 |
361 | 1stFlrSF: First Floor square feet
362 |
363 | 2ndFlrSF: Second floor square feet
364 |
365 | LowQualFinSF: Low quality finished square feet (all floors)
366 |
367 | GrLivArea: Above grade (ground) living area square feet
368 |
369 | BsmtFullBath: Basement full bathrooms
370 |
371 | BsmtHalfBath: Basement half bathrooms
372 |
373 | FullBath: Full bathrooms above grade
374 |
375 | HalfBath: Half baths above grade
376 |
377 | BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms)
378 |
379 | KitchenAbvGr: Kitchens above grade
380 |
381 | KitchenQual: Kitchen quality
382 |
383 | Ex Excellent
384 | Gd Good
385 | TA Typical/Average
386 | Fa Fair
387 | Po Poor
388 |
389 | TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
390 |
391 | Functional: Home functionality (Assume typical unless deductions are warranted)
392 |
393 | Typ Typical Functionality
394 | Min1 Minor Deductions 1
395 | Min2 Minor Deductions 2
396 | Mod Moderate Deductions
397 | Maj1 Major Deductions 1
398 | Maj2 Major Deductions 2
399 | Sev Severely Damaged
400 | Sal Salvage only
401 |
402 | Fireplaces: Number of fireplaces
403 |
404 | FireplaceQu: Fireplace quality
405 |
406 | Ex Excellent - Exceptional Masonry Fireplace
407 | Gd Good - Masonry Fireplace in main level
408 | TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
409 | Fa Fair - Prefabricated Fireplace in basement
410 | Po Poor - Ben Franklin Stove
411 | NA No Fireplace
412 |
413 | GarageType: Garage location
414 |
415 | 2Types More than one type of garage
416 | Attchd Attached to home
417 | Basment Basement Garage
418 | BuiltIn Built-In (Garage part of house - typically has room above garage)
419 | CarPort Car Port
420 | Detchd Detached from home
421 | NA No Garage
422 |
423 | GarageYrBlt: Year garage was built
424 |
425 | GarageFinish: Interior finish of the garage
426 |
427 | Fin Finished
428 | RFn Rough Finished
429 | Unf Unfinished
430 | NA No Garage
431 |
432 | GarageCars: Size of garage in car capacity
433 |
434 | GarageArea: Size of garage in square feet
435 |
436 | GarageQual: Garage quality
437 |
438 | Ex Excellent
439 | Gd Good
440 | TA Typical/Average
441 | Fa Fair
442 | Po Poor
443 | NA No Garage
444 |
445 | GarageCond: Garage condition
446 |
447 | Ex Excellent
448 | Gd Good
449 | TA Typical/Average
450 | Fa Fair
451 | Po Poor
452 | NA No Garage
453 |
454 | PavedDrive: Paved driveway
455 |
456 | Y Paved
457 | P Partial Pavement
458 | N Dirt/Gravel
459 |
460 | WoodDeckSF: Wood deck area in square feet
461 |
462 | OpenPorchSF: Open porch area in square feet
463 |
464 | EnclosedPorch: Enclosed porch area in square feet
465 |
466 | 3SsnPorch: Three season porch area in square feet
467 |
468 | ScreenPorch: Screen porch area in square feet
469 |
470 | PoolArea: Pool area in square feet
471 |
472 | PoolQC: Pool quality
473 |
474 | Ex Excellent
475 | Gd Good
476 | TA Average/Typical
477 | Fa Fair
478 | NA No Pool
479 |
480 | Fence: Fence quality
481 |
482 | GdPrv Good Privacy
483 | MnPrv Minimum Privacy
484 | GdWo Good Wood
485 | MnWw Minimum Wood/Wire
486 | NA No Fence
487 |
488 | MiscFeature: Miscellaneous feature not covered in other categories
489 |
490 | Elev Elevator
491 | Gar2 2nd Garage (if not described in garage section)
492 | Othr Other
493 | Shed Shed (over 100 SF)
494 | TenC Tennis Court
495 | NA None
496 |
497 | MiscVal: $Value of miscellaneous feature
498 |
499 | MoSold: Month Sold (MM)
500 |
501 | YrSold: Year Sold (YYYY)
502 |
503 | SaleType: Type of sale
504 |
505 | WD Warranty Deed - Conventional
506 | CWD Warranty Deed - Cash
507 | VWD Warranty Deed - VA Loan
508 | New Home just constructed and sold
509 | COD Court Officer Deed/Estate
510 | Con Contract 15% Down payment regular terms
511 | ConLw Contract Low Down payment and low interest
512 | ConLI Contract Low Interest
513 | ConLD Contract Low Down
514 | Oth Other
515 |
516 | SaleCondition: Condition of sale
517 |
518 | Normal Normal Sale
519 | Abnorml Abnormal Sale - trade, foreclosure, short sale
520 | AdjLand Adjoining Land Purchase
521 | Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit
522 | Family Sale between family members
523 | Partial Home was not completed when last assessed (associated with New Homes)
524 |
525 |
526 |
527 |
528 |
529 |
530 |
531 |
532 |
533 |
534 |
535 |
536 |
537 |
538 |
539 |
540 |
541 |
542 |
543 |
544 |
545 | Based on data drescription following continous variables were found:
546 | - LotFrontage: Linear feet of street connected to property
547 | - LotArea: Lot size in square feet
548 | - YearBuilt: Original construction date
549 | - YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
550 | - MasVnrArea: Masonry veneer area in square feet
551 | - BsmtFinSF1: Type 1 finished square feet
552 | - BsmtFinSF2: Type 2 finished square feet
553 | - BsmtUnfSF: Unfinished square feet of basement area
554 | - TotalBsmtSF: Total square feet of basement area
555 | - 1stFlrSF: First Floor square feet
556 | - 2ndFlrSF: Second floor square feet
557 | - LowQualFinSF: Low quality finished square feet (all floors)
558 | - GrLivArea: Above grade (ground) living area square feet
559 | - BsmtFullBath: Basement full bathrooms
560 | - BsmtHalfBath: Basement half bathrooms
561 | - FullBath: Full bathrooms above grade
562 | - HalfBath: Half baths above grade
563 | - BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms)
564 | - KitchenAbvGr: Kitchens above grade
565 | - TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
566 | - GarageYrBlt: Year garage was built
567 | - GarageCars: Size of garage in car capacity
568 | - GarageArea: Size of garage in square feet
569 | - WoodDeckSF: Wood deck area in square feet
570 | - OpenPorchSF: Open porch area in square feet
571 | - EnclosedPorch: Enclosed porch area in square feet
572 | - 3SsnPorch: Three season porch area in square feet
573 | - ScreenPorch: Screen porch area in square feet
574 | - PoolArea: Pool area in square feet
--------------------------------------------------------------------------------
/notebooks/agriculture_yield_rice:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | df=pd.read_csv("Dataset.csv")
4 | df
5 | df1=pd.get_dummies(df['Irrigation'])
6 | df2=pd.concat([df1,df],axis=1)
7 | df2
8 |
--------------------------------------------------------------------------------
/notebooks/biden_speech.txt:
--------------------------------------------------------------------------------
1 | Thank you. Thank you, thank you, thank you. It’s good to be back. As Mitch and Chuck will understand, it’s good to be almost home, down the hall. Anyway, thank you all.
2 |
3 | Madam Speaker, Madam Vice President. No president has ever said those words from this podium. No president has ever said those words. And it’s about time. The first lady, I’m her husband. Second gentleman. Chief justice. Members of the United States Congress and the cabinet, distinguished guests. My fellow Americans.
4 |
5 | While the setting tonight is familiar, this gathering is just a little bit different. A reminder of the extraordinary times we’re in. Throughout our history, presidents have come to this chamber to speak to Congress, to the nation and to the world. To declare war, to celebrate peace, to announce new plans and possibilities.
6 |
7 |
8 |
9 |
10 |
11 | Tonight, I come to talk about crisis and opportunity. About rebuilding the nation, revitalizing our democracy, and winning the future for America. I stand here tonight one day shy of the 100th day of my administration. A hundred days since I took the oath of office, lifted my hand off our family Bible and inherited a nation — we all did — that was in crisis. The worst pandemic in a century. The worst economic crisis since the Great Depression. The worst attack on our democracy since the Civil War. Now, after just 100 days, I can report to the nation, America is on the move again. Turning peril into possibility, crisis into opportunity, setbacks to strength.
12 |
13 | We all know life can knock us down. But in America, we never, ever, ever stay down. Americans always get up. Today, that’s what we’re doing. America is rising anew. Choosing hope over fear, truth over lies and light over darkness. After 100 days of rescue and renewal, America is ready for a takeoff, in my view. We’re working again, dreaming again, discovering again and leading the world again. We have shown each other and the world that there’s no quit in America. None.
14 |
15 |
16 |
17 | And more than half of all the adults in America have gotten at least one shot. The mass vaccination center in Glendale, Ariz., I asked the nurse, I said, “What’s it like?” She looked at me, she said, “It’s like every shot is giving a dose of hope” was her phrase, a dose of hope.
18 |
19 | A dose of hope for an educator in Florida, who has a child suffering from an autoimmune disease, wrote to me, said she’s worried — that she was worried about bringing the virus home. She said she then got vaccinated at a large site, in her car. She said she sat in her car when she got vaccinated and just cried, cried out of joy, and cried out of relief.
20 |
21 | Parents seeing the smiles on the kids’ faces, for those who are able to go back to school because the teachers and the school bus drivers and the cafeteria workers have been vaccinated. Grandparents, hugging their children and grandchildren, instead of pressing hands against the window to say goodbye. It means everything. Those things mean everything.
22 |
23 | You know, there’s still — you all know it, you know it better than any group of Americans — there’s still more work to do to beat this virus. We can’t let our guard down. But tonight, I can say, because of you, the American people, our progress these past 100 days against one of the worst pandemics in history has been one of the greatest logistical achievements, logistical achievements this country has ever seen. What else have we done in those first 100 days?
24 |
25 | We kept our commitment, Democrats and Republicans, of sending $1,400 rescue checks to 85 percent of American households. We’ve already sent more than 160 million checks out the door. It’s making a difference. You all know it when you go home. For many people, it’s making all the difference in the world.
26 |
27 | A single mom in Texas who wrote me, she said she couldn’t work. She said the relief check put food on the table and saved her and her son from eviction from their apartment. A grandmother in Virginia who told me she immediately took her granddaughter to the eye doctor, something she said she put off for months because she didn’t have the money. One of the defining images, at least from my perspective, in this crisis has been cars lined up, cars lined up for miles. And not people just barely able to start those cars. Nice cars, lined up for miles, waiting for a box of food to be put in their trunk.
28 |
29 | I don’t know about you, but I didn’t ever think I would see that in America. And all of this is through no fault of their own. No fault of their own, these people are in this position. That’s why the rescue plan is delivering food and nutrition assistance to millions of Americans facing hunger. And hunger is down sharply already.
30 |
31 |
32 |
33 |
34 |
35 |
36 | Folks — as I’ve told every world leader I’ve met with over the years — it’s never, ever, ever been a good bet to bet against America and it still isn’t. We are the United States of America. There is not a single thing — nothing, nothing beyond our capacity. We can do whatever we set our mind to if we do it together. So let’s begin to get together.
37 |
38 | God bless you all, and may God protect our troops. Thank you for your patience.
--------------------------------------------------------------------------------