├── .DS_Store
├── Deep Learning.md
├── ML Project Checklist.md
├── Machine Learning.md
├── More Resources.md
├── README.md
├── The ML Landscape.md
├── img
├── model1.png
├── model2.png
├── model3.png
├── precision-recall.png
└── reinforcement-learning.png
├── numpy-pandas
├── .DS_Store
├── 01-numpy.ipynb
├── 02-example.ipynb
├── 02-pandas.ipynb
├── 03-plt.ipynb
├── README.md
├── data
│ ├── state-abbrevs.csv
│ ├── state-areas.csv
│ └── state-population.csv
├── img
│ ├── .DS_Store
│ ├── axis=1.jpg
│ └── groupby.png
├── plt1.py
├── tools_matplotlib.ipynb
├── tools_numpy.ipynb
├── tools_pandas.ipynb
└── very-basics
│ ├── Readme.md
│ ├── img
│ └── plt.png
│ └── nb
│ └── 01_plt.ipynb
├── scikit-learn
├── Readme.md
├── car_evaluation.csv
├── img
│ ├── process.png
│ ├── process1.png
│ └── scikit-learn.png
├── k-means-clustering.ipynb
├── knn.ipynb
├── linear_regression.ipynb
├── logistic_regression.ipynb
├── svm.ipynb
└── train_test_split.ipynb
└── tensorflow-in-practice
├── .DS_Store
├── Exercises
├── Course_3_Week_1_Exercise_(Tokenizer_BBC_Text).ipynb
├── Course_3_Week_2_Exercise_(BBC_Text_Model_Building).ipynb
├── Course_3_Week_3_Exercise_Twitter_Fake_News.ipynb
├── Exercise_1_House_Prices.ipynb
├── Exercise_2_Handwriting_Recognition_DNN.ipynb
├── Exercise_3_CNN.ipynb
├── Exercise_4_Complex_Images_flow_from_directory.ipynb
├── Exercise_5_Cat_vs_Dog_Kaggle.ipynb
├── Exercise_6_Cats_vs_Dogs_with_Augmentation.ipynb
└── Exercise_7_Transfer_learning.ipynb
├── MNIST
├── my_model.h5
├── test.py
└── train.py
├── README.md
├── convolutional-neural-networks-tensorflow.md
├── img
├── 0.jpg
├── 1.png
├── 2.jpg
├── 3.jpg
├── fibonacci.png
├── fp.png
├── fp2.png
├── lstm.png
├── lstm2.png
├── metrics.png
├── ml_architecture.png
├── rfp.png
├── rnn.png
├── rnn2.png
├── seasonality.png
├── tf_datasets.png
├── trend.png
├── ts.png
├── tsn.png
└── word_embeddings.png
├── introduction-to-tensorflow-for-ai.md
├── natural-language-processing-tensorflow.md
├── notebooks
├── .DS_Store
├── Course_1_Part_2_Lesson_2_Notebook.ipynb
├── Course_1_Part_4_Lesson_2_Notebook.ipynb
├── Course_1_Part_6_Lesson_2_Notebook.ipynb
├── Course_1_Part_6_Lesson_3_Notebook.ipynb
├── Course_1_Part_8_Lesson_2_Notebook.ipynb
├── Course_2_Part_2_Lesson_2_Notebook.ipynb
├── Course_2_Part_4_Lesson_2_Notebook_(Cats_v_Dogs_Augmentation).ipynb
├── Course_2_Part_6_Lesson_3_Notebook_(Transfer_Learning).ipynb
├── Course_2_Part_8_Lesson_2_Notebook_(RockPaperScissors).ipynb
├── Course_3_Week_1(Tokenizer-Sarcasm-Dataset).ipynb
├── Course_3_Week_2(Model_Training_IMDB_Reviews).ipynb
├── Course_3_Week_2(Sarcasm-Classifier).ipynb
├── Course_3_Week_2(Subwords).ipynb
├── Course_3_Week_3(IMDB).ipynb
├── Course_3_Week_4_Lesson_1_(Sheckspire_Text_Generation).ipynb
├── Course_3_Week_4_Lesson_2_Notebook.ipynb
├── README.md
├── irish-lyrics-eof.txt
├── meta.tsv
├── sarcasm.json
└── vecs.tsv
└── sequences-time-series-and-prediction.md
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/.DS_Store
--------------------------------------------------------------------------------
/Deep Learning.md:
--------------------------------------------------------------------------------
1 | # Neural Networks and Deep Learning
2 |
3 | - Building and training with TensorFlow and Keras
4 | - Architectures: feedforward for tabular data, CNN for computer vision, RNN and LSTM for sequence processing
5 | - Encoder / decoder and Transformers for NLP
6 | - Autoencoders and generative Adversarial Network (GANs) for generative learning
7 | - Techniques for training DNN
8 | - Reinforcement learning - building agent to play a game
9 | - Loading and preprocessing large amount of data
10 | - Training and deploying at scale
11 |
12 | ## Contents
13 | - Introduction to ANN with Keras
14 | - [Sequential API](#Sequential-API), classification & regression
15 | - [Functional API](#Functional-API)
16 | - Subclassing API for dynamic models
17 | - [Using Callbacks](#Using-Callbacks), EarlyStopping, ModelCheckpoints
18 | - [TensorBoard](#TensorBoard)
19 | - [Fine-Tuning Neural Network Hyperparameters](#Fine-Tuning-Neural-Network-Hyperparameters)
20 |
21 | ### Sequential API
22 | ```py
23 | """Classification MLP"""
24 | # "sparse_categorical_crossentropy" 0 to 9
25 | #"categorical_crossentropy" [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]
26 | # binary classification "sigmoid" (i.e., logistic) activation function in the output layer instead of the "softmax" activation function, and we would use the "binary_crossentropy" loss.
27 |
28 | model.compile(loss="sparse_categorical_crossentropy",
29 | optimizer="sgd",
30 | metrics=["accuracy"])
31 | history = model.fit(X_train, y_train, epochs=30,
32 | validation_data=(X_valid, y_valid))
33 | model.evaluate(X_test, y_test)
34 | y_proba = model.predict(X_new)
35 | y_pred = model.predict_classes(X_new)
36 | # History
37 | import pandas as pd
38 | import matplotlib.pyplot as plt
39 | pd.DataFrame(history.history).plot(figsize=(8, 5))
40 | plt.grid(True)
41 | plt.gca().set_ylim(0, 1) # set the vertical range to [0-1]
42 | plt.show()
43 | ```
44 | - If you want to convert sparse labels (i.e., class indices) to one-hot vector labels, use the `keras.utils.to_categorical()` function. To go the other way round, use the `np.argmax()` function with `axis=1`.
45 |
46 | - You must **compile** the model, **train** it, **evaluate** it, and use it to **make predictions**.
47 |
48 | - `.fit()` validation_split=0.1, class_weight, sample_weight
49 |
50 | ```py
51 | """Regression MLP"""
52 | from sklearn.datasets import fetch_california_housing
53 | from sklearn.model_selection import train_test_split
54 | from sklearn.preprocessing import StandardScaler
55 |
56 | housing = fetch_california_housing()
57 | X_train_full, X_test, y_train_full, y_test = train_test_split(
58 | housing.data, housing.target)
59 | X_train, X_valid, y_train, y_valid = train_test_split(
60 | X_train_full, y_train_full)
61 |
62 | scaler = StandardScaler()
63 | X_train = scaler.fit_transform(X_train)
64 | X_valid = scaler.transform(X_valid)
65 | X_test = scaler.transform(X_test)
66 |
67 | model = keras.models.Sequential([
68 | keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]),
69 | keras.layers.Dense(1)
70 | ])
71 |
72 | model.compile(loss="mean_squared_error", optimizer="sgd")
73 | history = model.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid))
74 | mse_test = model.evaluate(X_test, y_test)
75 | X_new = X_test[:3] # pretend these are new instances
76 | y_pred = model.predict(X_new)
77 | ```
78 |
79 | ### Functional API
80 | -
81 | ```py
82 | input_ = keras.layers.Input(shape=X_train.shape[1:])
83 | hidden1 = keras.layers.Dense(30, activation="relu")(input_)
84 | hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)
85 | concat = keras.layers.Concatenate()([input_, hidden2])
86 | output = keras.layers.Dense(1)(concat)
87 | model = keras.Model(inputs=[input_], outputs=[output])
88 | ```
89 | -
90 | ```py
91 | input_A = keras.layers.Input(shape=[5], name="wide_input")
92 | input_B = keras.layers.Input(shape=[6], name="deep_input")
93 | hidden1 = keras.layers.Dense(30, activation="relu")(input_B)
94 | hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)
95 | concat = keras.layers.concatenate([input_A, hidden2])
96 | output = keras.layers.Dense(1, name="output")(concat)
97 | model = keras.Model(inputs=[input_A, input_B], outputs=[output])
98 |
99 | # As we have two inputs, we must specify two input features in .fit() and so on
100 | model.compile(loss="mse", optimizer=keras.optimizers.SGD(lr=1e-3))
101 | X_train_A, X_train_B = X_train[:, :5], X_train[:, 2:]
102 | X_valid_A, X_valid_B = X_valid[:, :5], X_valid[:, 2:]
103 | X_test_A, X_test_B = X_test[:, :5], X_test[:, 2:]
104 | X_new_A, X_new_B = X_test_A[:3], X_test_B[:3]
105 |
106 | history = model.fit((X_train_A, X_train_B), y_train, epochs=20, validation_data=((X_valid_A, X_valid_B), y_valid))
107 | mse_test = model.evaluate((X_test_A, X_test_B), y_test)
108 | y_pred = model.predict((X_new_A, X_new_B))
109 | ```
110 | -
111 | ```py
112 | [...] # Same as above, up to the main output layer
113 | output = keras.layers.Dense(1, name="main_output")(concat)
114 | aux_output = keras.layers.Dense(1, name="aux_output")(hidden2)
115 | model = keras.Model(inputs=[input_A, input_B], outputs=[output, aux_output])
116 |
117 | model.compile(loss=["mse", "mse"], loss_weights=[0.9, 0.1], optimizer="sgd")
118 | history = model.fit(
119 | [X_train_A, X_train_B], [y_train, y_train], epochs=20,
120 | validation_data=([X_valid_A, X_valid_B], [y_valid, y_valid]))
121 |
122 | total_loss, main_loss, aux_loss = model.evaluate(
123 | [X_test_A, X_test_B], [y_test, y_test])
124 | y_pred_main, y_pred_aux = model.predict([X_new_A, X_new_B])
125 | ```
126 |
127 | ### Using Callbacks
128 | ```py
129 | """It will only save your model when its performance on the validation set is the best so far"""
130 | checkpoint_cb = keras.callbacks.ModelCheckpoint("my_keras_model.h5", save_best_only=True)
131 |
132 | history = model.fit(X_train, y_train, epochs=10,
133 | validation_data=(X_valid, y_valid),
134 | callbacks=[checkpoint_cb])
135 | model = keras.models.load_model("my_keras_model.h5") # roll back to best model
136 | ```
137 | ```py
138 | """It will interrupt training when it measures no progress on the validation set for a number of epochs (defined by the patience argument)"""
139 | early_stopping_cb = keras.callbacks.EarlyStopping(patience=10,
140 | restore_best_weights=True)
141 | history = model.fit(X_train, y_train, epochs=100,
142 | validation_data=(X_valid, y_valid),
143 | callbacks=[checkpoint_cb, early_stopping_cb])
144 |
145 | ```
146 |
147 | ### TensorBoard
148 | ```py
149 | import os
150 |
151 | root_logdir = os.path.join(os.curdir, "my_logs")
152 |
153 | def get_run_logdir():
154 | import time
155 | run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
156 | return os.path.join(root_logdir, run_id)
157 |
158 | run_logdir = get_run_logdir() # e.g., './my_logs/run_2019_06_07-15_15_22'
159 |
160 | [...] # Build and compile your model
161 | tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
162 | history = model.fit(X_train, y_train, epochs=30,
163 | validation_data=(X_valid, y_valid),
164 | callbacks=[tensorboard_cb])
165 |
166 | ```
167 |
168 | ### Fine-Tuning Neural Network Hyperparameters
169 | ```py
170 | def build_model(n_hidden=1, n_neurons=30, learning_rate=3e-3, input_shape=[8]):
171 | model = keras.models.Sequential()
172 | model.add(keras.layers.InputLayer(input_shape=input_shape))
173 | for layer in range(n_hidden):
174 | model.add(keras.layers.Dense(n_neurons, activation="relu"))
175 | model.add(keras.layers.Dense(1))
176 | optimizer = keras.optimizers.SGD(lr=learning_rate)
177 | model.compile(loss="mse", optimizer=optimizer)
178 | return model
179 |
180 | # We need to create Scikit Regressor object
181 | keras_reg = keras.wrappers.scikit_learn.KerasRegressor(build_model)
182 |
183 | keras_reg.fit(X_train, y_train, epochs=100,
184 | validation_data=(X_valid, y_valid),
185 | callbacks=[keras.callbacks.EarlyStopping(patience=10)])
186 |
187 | mse_test = keras_reg.score(X_test, y_test)
188 | y_pred = keras_reg.predict(X_new)
189 |
190 | from scipy.stats import reciprocal
191 | from sklearn.model_selection import RandomizedSearchCV
192 |
193 | param_distribs = {
194 | "n_hidden": [0, 1, 2, 3],
195 | "n_neurons": np.arange(1, 100),
196 | "learning_rate": reciprocal(3e-4, 3e-2),
197 | }
198 | rnd_search_cv = RandomizedSearchCV(keras_reg, param_distribs, n_iter=10, cv=3)
199 | rnd_search_cv.fit(X_train, y_train, epochs=100,
200 | validation_data=(X_valid, y_valid),
201 | callbacks=[keras.callbacks.EarlyStopping(patience=10)])
202 |
203 | rnd_search_cv.best_params_
204 | rnd_search_cv.best_score_
205 |
206 | model = rnd_search_cv.best_estimator_.model
207 | ```
208 |
--------------------------------------------------------------------------------
/ML Project Checklist.md:
--------------------------------------------------------------------------------
1 | This checklist can guide you through your Machine Learning projects. There are eight main steps:
2 |
3 | 1. Frame the problem and look at the big picture.
4 | 2. Get the data.
5 | 3. Explore the data to gain insights.
6 | 4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.
7 | 5. Explore many different models and short-list the best ones.
8 | 6. Fine-tune your models and combine them into a great solution.
9 | 7. Present your solution.
10 | 8. Launch, monitor, and maintain your system.
11 |
12 | Obviously, you should feel free to adapt this checklist to your needs.
13 |
14 | # Frame the problem and look at the big picture
15 | 1. Define the objective in business terms.
16 | 2. How will your solution be used?
17 | 3. What are the current solutions/workarounds (if any)?
18 | 4. How should you frame this problem (supervised/unsupervised, online/offline, etc.)
19 | 5. How should performance be measured?
20 | 6. Is the performance measure aligned with the business objective?
21 | 7. What would be the minimum performance needed to reach the business objective?
22 | 8. What are comparable problems? Can you reuse experience or tools?
23 | 9. Is human expertise available?
24 | 10. How would you solve the problem manually?
25 | 11. List the assumptions you or others have made so far.
26 | 12. Verify assumptions if possible.
27 |
28 | # Get the data
29 | Note: automate as much as possible so you can easily get fresh data.
30 |
31 | 1. List the data you need and how much you need.
32 | 2. Find and document where you can get that data.
33 | 3. Check how much space it will take.
34 | 4. Check legal obligations, and get the authorization if necessary.
35 | 5. Get access authorizations.
36 | 6. Create a workspace (with enough storage space).
37 | 7. Get the data.
38 | 8. Convert the data to a format you can easily manipulate (without changing the data itself).
39 | 9. Ensure sensitive information is deleted or protected (e.g., anonymized).
40 | 10. Check the size and type of data (time series, sample, geographical, etc.).
41 | 11. Sample a test set, put it aside, and never look at it (no data snooping!).
42 |
43 | # Explore the data
44 | Note: try to get insights from a field expert for these steps.
45 |
46 | 1. Create a copy of the data for exploration (sampling it down to a manageable size if necessary).
47 | 2. Create a Jupyter notebook to keep record of your data exploration.
48 | 3. Study each attribute and its characteristics:
49 | - Name
50 | - Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
51 | - % of missing values
52 | - Noisiness and type of noise (stochastic, outliers, rounding errors, etc.)
53 | - Possibly useful for the task?
54 | - Type of distribution (Gaussian, uniform, logarithmic, etc.)
55 | 4. For supervised learning tasks, identify the target attribute(s).
56 | 5. Visualize the data.
57 | 6. Study the correlations between attributes.
58 | 7. Study how you would solve the problem manually.
59 | 8. Identify the promising transformations you may want to apply.
60 | 9. Identify extra data that would be useful (go back to "Get the Data" on page 502).
61 | 10. Document what you have learned.
62 |
63 | # Prepare the data
64 | Notes:
65 | - Work on copies of the data (keep the original dataset intact).
66 | - Write functions for all data transformations you apply, for five reasons:
67 | - So you can easily prepare the data the next time you get a fresh dataset
68 | - So you can apply these transformations in future projects
69 | - To clean and prepare the test set
70 | - To clean and prepare new data instances
71 | - To make it easy to treat your preparation choices as hyperparameters
72 |
73 | 1. Data cleaning:
74 | - Fix or remove outliers (optional).
75 | - Fill in missing values (e.g., with zero, mean, median...) or drop their rows (or columns).
76 | 2. Feature selection (optional):
77 | - Drop the attributes that provide no useful information for the task.
78 | 3. Feature engineering, where appropriates:
79 | - Discretize continuous features.
80 | - Decompose features (e.g., categorical, date/time, etc.).
81 | - Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.).
82 | - Aggregate features into promising new features.
83 | 4. Feature scaling: standardize or normalize features.
84 |
85 | # Short-list promising models
86 | Notes:
87 | - If the data is huge, you may want to sample smaller training sets so you can train many different models in a reasonable time (be aware that this penalizes complex models such as large neural nets or Random Forests).
88 | - Once again, try to automate these steps as much as possible.
89 |
90 | 1. Train many quick and dirty models from different categories (e.g., linear, naive, Bayes, SVM, Random Forests, neural net, etc.) using standard parameters.
91 | 2. Measure and compare their performance.
92 | - For each model, use N-fold cross-validation and compute the mean and standard deviation of their performance.
93 | 3. Analyze the most significant variables for each algorithm.
94 | 4. Analyze the types of errors the models make.
95 | - What data would a human have used to avoid these errors?
96 | 5. Have a quick round of feature selection and engineering.
97 | 6. Have one or two more quick iterations of the five previous steps.
98 | 7. Short-list the top three to five most promising models, preferring models that make different types of errors.
99 |
100 | # Fine-Tune the System
101 | Notes:
102 | - You will want to use as much data as possible for this step, especially as you move toward the end of fine-tuning.
103 | - As always automate what you can.
104 |
105 | 1. Fine-tune the hyperparameters using cross-validation.
106 | - Treat your data transformation choices as hyperparameters, especially when you are not sure about them (e.g., should I replace missing values with zero or the median value? Or just drop the rows?).
107 | - Unless there are very few hyperparamter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (e.g., using a Gaussian process priors, as described by Jasper Snoek, Hugo Larochelle, and Ryan Adams ([https://goo.gl/PEFfGr](https://goo.gl/PEFfGr)))
108 | 2. Try Ensemble methods. Combining your best models will often perform better than running them invdividually.
109 | 3. Once you are confident about your final model, measure its performance on the test set to estimate the generalization error.
110 |
111 | > Don't tweak your model after measuring the generalization error: you would just start overfitting the test set.
112 |
113 | # Present your solution
114 | 1. Document what you have done.
115 | 2. Create a nice presentation.
116 | - Make sure you highlight the big picture first.
117 | 3. Explain why your solution achieves the business objective.
118 | 4. Don't forget to present interesting points you noticed along the way.
119 | - Describe what worked and what did not.
120 | - List your assumptions and your system's limitations.
121 | 5. Ensure your key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g., "the median income is the number-one predictor of housing prices").
122 |
123 | # Launch!
124 | 1. Get your solution ready for production (plug into production data inputs, write unit tests, etc.).
125 | 2. Write monitoring code to check your system's live performance at regular intervals and trigger alerts when it drops.
126 | - Beware of slow degradation too: models tend to "rot" as data evolves.
127 | - Measuring performance may require a human pipeline (e.g., via a crowdsourcing service).
128 | - Also monitor your inputs' quality (e.g., a malfunctioning sensor sending random values, or another team's output becoming stale). This is particulary important for online learning systems.
129 | 3. Retrain your models on a regular basis on fresh data (automate as much as possible).
130 |
--------------------------------------------------------------------------------
/Machine Learning.md:
--------------------------------------------------------------------------------
1 | # Machine Learning
2 |
3 | - Handling, cleaning, and preparing data.
4 | - Selecting and engineering features.
5 | - Learning by fitting a model to data.
6 | - Optimizing a cost function.
7 | - Selecting a model and tuning hyperparameters using cross-validation.
8 | - Underfitting and overfitting (the bias/variance tradeoff).
9 | - Unsupervised learning techniques: clustering, density estimation and anomaly detection.
10 | - Algorithms: Linear and Polynomial Regression, Logistic Regression, k-Nearest Neighbors, Support Vector Machines, Decision Trees, Random Forests, and Ensemble methods.
11 |
20 |
21 | ## End-to-End Machine Learning Project
22 | - Frame the problem and look at the big picture
23 | - Goal and Performance measure
24 | - Get the data
25 | - [Create test set](#Create-test-set)
26 | - Explore the data to gain insights (EDA)
27 | - [Looking for correlations](#Looking-for-Correlations)
28 | - Experimenting with attribute combinations
29 | - Prepare data for ML algorithms
30 | - [Data cleaning](#Data-cleaning)
31 | - [Handling text and categorical attributes](#Handling-text-and-categorical-attributes)
32 | - [Feature Scaling](#Feature-Scaling)
33 | - [Transformation Pipelines](#Transformation-Pipelines)
34 | - [Explore many different models and short-list the best ones](#Select-and-Train-a-Model)
35 | - [Cross-Validation](#Cross-Validation)
36 | - Fine-tune models and combine them into a great solution
37 | - [Grid Search](#Grid-Search)
38 | - [Randomized Search](#Randomized-Search)
39 | - Ensemble models
40 | - Evaluate on test set
41 | - Launch and monitor
42 |
43 | ## Get the data
44 | ### Create test set
45 | ```py
46 | '''Create test set'''
47 | from sklearn.model_selection import train_test_split
48 | train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)
49 | ```
50 | ```py
51 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
52 | ```
53 | ```py
54 | from sklearn.model_selection import StratifiedShuffleSplit
55 | import numpy as np
56 | X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
57 | y = np.array([0, 0, 0, 1, 1, 1])
58 | sss = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0)
59 | sss.get_n_splits(X, y)
60 |
61 | for train_index, test_index in sss.split(X, y):
62 | print("TRAIN:", train_index, "TEST:", test_index)
63 | X_train, X_test = X[train_index], X[test_index]
64 | y_train, y_test = y[train_index], y[test_index]
65 | ```
66 |
67 | ## Explore the data to gain insights
68 | ```py
69 | '''Visualizing Geographical Data'''
70 | data.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)
71 | ```
72 | ```py
73 | housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
74 | s=housing["population"]/100, label="population", figsize=(10,7),
75 | c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
76 | )
77 | plt.legend()
78 | ```
79 | ### Looking for Correlations
80 | ```py
81 | '''Looking for Correlations'''
82 | corr_matrix = data.corr()
83 | corr_matrix["any_column"].sort_values(ascending=False)
84 |
85 | from pandas.plotting import scatter_matrix
86 | attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
87 | scatter_matrix(housing[attributes], figsize=(12, 8))
88 | ```
89 | ```py
90 | # Correlations between features
91 | all_data_corr = all_data.corr().abs().unstack().sort_values(kind="quicksort", ascending=False).reset_index()
92 | all_data_corr.rename(columns={"level_0": "Feature 1", "level_1": "Feature 2", 0: 'Correlation Coefficient'}, inplace=True)
93 | all_data_corr.drop(all_data_corr.iloc[1::2].index, inplace=True)
94 | all_data_corr_nd = all_data_corr.drop(all_data_corr[all_data_corr['Correlation Coefficient'] == 1.0].index)
95 |
96 | corr = all_data_corr_nd['Correlation Coefficient'] > 0.1
97 | all_data_corr_nd[corr]
98 | ```
99 | ```py
100 | # pivot_table() vs groupby(), the below lines are the same
101 | pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum)
102 | df.groupby(['a','b'])['c'].sum()
103 | ```
104 | ```py
105 | # Aggregate using one or more operations over the specified axis
106 | # agg()-can be applied to multiple groups together
107 | df.agg(['sum', 'min'])
108 | df_all.groupby(['Sex', 'Pclass']).agg(lambda x:x.value_counts().index[0])['Embarked']
109 |
110 | # Apply a function along an axis of the DataFrame
111 | # apply()-cannot be applied to multiple groups together
112 | df.apply(np.sqrt)
113 | df_all['Deck'] = df_all['Cabin'].apply(lambda s: s[0] if pd.notnull(s) else 'M')
114 | ```
115 |
116 | ## Prepare data for ML algorithms
117 | - https://stackoverflow.com/questions/48673402/how-can-i-standardize-only-numeric-variables-in-an-sklearn-pipeline
118 | - https://scikit-learn.org/stable/modules/preprocessing.html
119 |
120 | ### Data Cleaning
121 | ```py
122 | housing.dropna(subset=["total_bedrooms"]) # Get rid of the corresponding districts
123 | housing.drop("total_bedrooms", axis=1) # Get rid of the whole attribute
124 | median = housing["total_bedrooms"].median() # Set the values to some value (zero, mean, median)
125 | housing["total_bedrooms"].fillna(median, inplace=True)
126 | ```
127 | ```py
128 | '''SimpleImputer, filling with the missing numerical attributes with the "median"'''
129 | from sklearn.impute import SimpleImputer
130 | imputer = SimpleImputer(strategy="median")
131 | housing_num = housing.select_dtypes(include=[np.number]) # just numerical attributes
132 | imputer.fit(housing_num) # "trained" inputer, now it is ready to transform the training set by replacing missing values with the learned medians
133 | imputer.statistics_ # same as "housing_num.median().values"
134 | X = imputer.transform(housing_num)
135 | housing_tr = pd.DataFrame(X, columns=housing_num.columns,
136 | index=housing.index) # new dataframe
137 | ```
138 |
139 | ### Handling Text and Categorical Attributes
140 | - [select_dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html)
141 | ```py
142 | '''Transforming continuous numerical attributes to categorical'''
143 | housing["income_cat"] = pd.cut(housing["median_income"],
144 | bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
145 | labels=[1, 2, 3, 4, 5])
146 | ```
147 | ```py
148 | '''Categorical Attributes'''
149 | from sklearn.preprocessing import OrdinalEncoder
150 | from sklearn.preprocessing import OneHotEncoder
151 |
152 | housing_cat = housing[["ocean_proximity"]]
153 |
154 | ordinal_encoder = OrdinalEncoder()
155 | housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
156 |
157 | housing_cat_encoded[:10]
158 | # array([[0.],
159 | # [0.],
160 | # [4.],
161 | # [1.],
162 | # [0.],
163 | # [1.],
164 | # [0.],
165 | # [1.],
166 | # [0.],
167 | # [0.]])
168 |
169 | ordinal_encoder.categories_ # [array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'], dtype=object)]
170 |
171 | cat_encoder = OneHotEncoder(sparse=False)
172 | housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
173 | housing_cat_1hot
174 | # array([[1., 0., 0., 0., 0.],
175 | # [1., 0., 0., 0., 0.],
176 | # [0., 0., 0., 0., 1.],
177 | # ...,
178 | # [0., 1., 0., 0., 0.],
179 | # [1., 0., 0., 0., 0.],
180 | # [0., 0., 0., 1., 0.]])
181 | ```
182 | ### Feature Scaling
183 | ```py
184 | '''StandardScaler'''
185 | from sklearn.preprocessing import StandardScaler
186 | import numpy as np
187 |
188 | X_train = np.array([[ 1., -1., 2.],
189 | [ 2., 0., 0.],
190 | [ 0., 1., -1.]])
191 | scaler = StandardScaler().fit(X_train)
192 |
193 | scaler.mean_
194 | scaler.scale_
195 |
196 | X_scaled = scaler.transform(X_train)
197 | X_scaled
198 | ```
199 | ```py
200 | from sklearn.preprocessing import MinMaxScaler
201 |
202 | X_train = np.array([[ 1., -1., 2.],
203 | [ 2., 0., 0.],
204 | [ 0., 1., -1.]])
205 |
206 | min_max_scaler = MinMaxScaler()
207 | X_train_minmax = min_max_scaler.fit_transform(X_train)
208 | X_train_minmax
209 | # array([[0.5 , 0. , 1. ],
210 | # [1. , 0.5 , 0.33333333],
211 | # [0. , 1. , 0. ]])
212 |
213 | # For the test data, we just need to use .transform()
214 | X_test = np.array([[-3., -1., 4.]])
215 | X_test_minmax = min_max_scaler.transform(X_test)
216 | X_test_minmax
217 | # array([[-1.5 , 0. , 1.66666667]])
218 | ```
219 |
220 | ### Custom Transformer
221 | ```py
222 | from sklearn.base import BaseEstimator, TransformerMixin
223 |
224 | # column index
225 | rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6
226 |
227 | class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
228 | def __init__(self, add_bedrooms_per_room=True): # no *args or **kargs
229 | self.add_bedrooms_per_room = add_bedrooms_per_room
230 | def fit(self, X, y=None):
231 | return self # nothing else to do
232 | def transform(self, X):
233 | rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
234 | population_per_household = X[:, population_ix] / X[:, households_ix]
235 | if self.add_bedrooms_per_room:
236 | bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
237 | return np.c_[X, rooms_per_household, population_per_household,
238 | bedrooms_per_room]
239 | else:
240 | return np.c_[X, rooms_per_household, population_per_household]
241 |
242 | attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
243 | housing_extra_attribs = attr_adder.transform(housing.values)
244 | ```
245 |
246 | ### Transformation Pipelines
247 | ```py
248 | from sklearn.pipeline import Pipeline
249 | from sklearn.preprocessing import StandardScaler
250 | from sklearn.compose import ColumnTransformer
251 |
252 | num_pipeline = Pipeline([
253 | ('imputer', SimpleImputer(strategy="median")),
254 | ('attribs_adder', CombinedAttributesAdder()),
255 | ('std_scaler', StandardScaler()),
256 | ])
257 |
258 | # housing_num_tr = num_pipeline.fit_transform(housing_num)
259 |
260 | num_attribs = list(housing_num)
261 | cat_attribs = ["ocean_proximity"]
262 |
263 | full_pipeline = ColumnTransformer([
264 | ("num", num_pipeline, num_attribs),
265 | ("cat", OneHotEncoder(), cat_attribs),
266 | ])
267 |
268 | housing_prepared = full_pipeline.fit_transform(housing)
269 | housing_prepared # to get access to the new dataset
270 | ```
271 | ## Select and Train a Model
272 | - Before using `.predict()` you have to use `full_pipeline.transform(some_data)`
273 |
274 | ### Cross-Validation
275 | ```py
276 | from sklearn.model_selection import cross_val_score
277 |
278 | scores = cross_val_score(model, data, labels, scoring="neg_mean_squared_eroor", cv=10)
279 | rmse_scores = np.sqrt(-scores)
280 |
281 | def display_scores(scores):
282 | print("Scores:", scores)
283 | print("Mean:", scores.mean())
284 | print("Standart deviation:", scores.std())
285 |
286 | display_scores(rmse_scores)
287 | ```
288 | ```py
289 | '''Save the model'''
290 | import joblib
291 | joblib.dump(my_model, "my_model.pkl") # to save model
292 | my_model_loaded = joblib.load("my_model.pkl") # to load model
293 | ```
294 |
295 |
296 | ## Fine-tune Models
297 | ### Grid Search
298 | ```py
299 | from sklearn.model_selection import GridSearchCV
300 |
301 | param_grid = [
302 | # try 12 (3×4) combinations of hyperparameters
303 | {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
304 | # then try 6 (2×3) combinations with bootstrap set as False
305 | {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
306 | ]
307 |
308 | forest_reg = RandomForestRegressor(random_state=42)
309 | # train across 5 folds, that's a total of (12+6)*5=90 rounds of training
310 | grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
311 | scoring='neg_mean_squared_error',
312 | return_train_score=True)
313 | grid_search.fit(housing_prepared, housing_labels)
314 |
315 | grid_search.best_params_ # the best hyperparameters
316 | grid_search.best_estimator_
317 |
318 | # look at the score of each hyperparameter combination tested during the grid search:
319 | cvres = grid_search.cv_results_
320 | for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
321 | print(np.sqrt(-mean_score), params)
322 | ```
323 |
324 | ### Randomized Search
325 | ```py
326 | from sklearn.model_selection import RandomizedSearchCV
327 | from scipy.stats import randint
328 |
329 | param_distribs = {
330 | 'n_estimators': randint(low=1, high=200),
331 | 'max_features': randint(low=1, high=8),
332 | }
333 |
334 | forest_reg = RandomForestRegressor(random_state=42)
335 | rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
336 | n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
337 | rnd_search.fit(housing_prepared, housing_labels)
338 |
339 | # looking at the scores during training
340 | cvres = rnd_search.cv_results_
341 | for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
342 | print(np.sqrt(-mean_score), params)
343 |
344 | feature_importances = grid_search.best_estimator_.feature_importances_
345 | ```
--------------------------------------------------------------------------------
/More Resources.md:
--------------------------------------------------------------------------------
1 | ## What is Data Science?
2 | - [What really is Data Science? ](https://youtu.be/xC-c7E5PK0Y)
3 | - https://telegra.ph/What-REALLY-is-Data-Science-09-21
4 |
5 | ### Just leaving it here
6 | - [Data Science Interview at Facebook](https://tproger.ru/translations/preparing-for-data-science-interview/)
7 |
8 | ### Advice
9 | - [12 Things I Learned During My First Year as a Machine Learning Engineer](https://proglib.io/w/464d1326)
10 | - [How to Learn Machine Learning, The Self-Starter Way](https://elitedatascience.com/learn-machine-learning)
11 | - [Andrew Ng: Advice on Getting Started in Deep Learning](https://youtu.be/1k37OcjH7BM)
12 | - [Andrew Ng Machine Learning Career](https://youtu.be/hkagmGAu74Y)
13 |
14 | ### Technical Articles / Videos
15 | - [Cheat sheat Stanford](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-machine-learning-tips-and-tricks)
16 | - [Loss function for neural networks CNN](https://towardsdatascience.com/understanding-different-loss-functions-for-neural-networks-dd1ed0274718)
17 | - [Backprop in CNN](https://medium.com/@pavisj/convolutions-and-backpropagations-46026a8f5d2c)
18 | - [Backprop in NN](https://youtu.be/0e0z28wAWfg)
19 | - [Introduction to Backpropagation and Optimization](https://ai.plainenglish.io/approach-complex-functions-with-backpropagation-how-i-was-applying-to-yandex-c5f68d50f2da)
20 |
21 | ### Courses
22 | - http://introtodeeplearning.com
23 | - https://openlearninglibrary.mit.edu/courses/course-v1:MITx+6.036+1T2019/about
24 | - [Fast AI DL](https://www.fast.ai)
25 | - [Coursera Data Science](https://www.coursera.org/specializations/data-science-python?ranMID=40328&ranEAID=EBOQAYvGY4A&ranSiteID=EBOQAYvGY4A-xBZ6HIoQD.6tLROsD7db4g&siteID=EBOQAYvGY4A-xBZ6HIoQD.6tLROsD7db4g&utm_content=2&utm_medium=partners&utm_source=linkshare&utm_campaign=EBOQAYvGY4A)
26 | - [Fast AI ML course](https://course18.fast.ai/ml)
27 |
28 | ### Bookshelf
29 | - [Data Science books](https://proglib.io/w/0232ad78)
30 |
31 | ### YouTube
32 | - [DeepLizard](https://www.youtube.com/c/deeplizard/playlists)
33 |
34 | ### QA
35 | - Validation accuracy is higher than training accuracy.
36 | - https://www.quora.com/Can-validation-accuracy-be-higher-than-training-accuracy
37 |
38 | ### Resources where you can find the latest publications from leading laboratories
39 | - https://openai.com/blog/tags/research/
40 | - https://deepmind.com/research
41 | - https://www.microsoft.com/en-us/research/research-area/artificial-intelligence
42 | - https://www.research.ibm.com/artificial-intelligence/#publications
43 | - https://ai.stanford.edu
44 | - https://www.csail.mit.edu
45 | - https://ai.google/research/
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Machine Learning Area
2 |
3 | [Rustam-Z🚀](https://t.me/rz_zokirov) • [Find more here](https://t.me/rz_zokirov_ml)
4 |
5 | > 1% better every day = 3700% better at the end of the year
6 |
7 | > The goal is to solve problems and help society with the help of AI.
8 |
9 | ## Why should you learn Machine Learning?
10 |
11 | [First of all, understand the difference between AI / Data Science / Machine Learning](https://telegra.ph/AI--Data-Science--Machine-Learning--Deep-Learning--Data-Analysis--Data-Engineering--Big-Data-09-09)
12 |
13 | I found two good answers on why you should care. Firstly, **Machine Learning (ML)** is making computers do things that we’ve never made computers do before. If you want to do something new, not just new to you, but to the world, you can do it with ML.
14 |
15 | Secondly, if you don’t influence the world, the world will influence you.
16 |
17 | If you focus on results, you will never change.
18 | If you focus on change, you will get results.
19 |
20 | ## How to study?
21 | - **First, learn to learn.**
22 | - [Thinking of Self-Studying Machine Learning? Remind yourself of these 6 things](https://towardsdatascience.com/thinking-of-self-studying-machine-learning-remind-yourself-of-these-6-things-b55a5f2b6c7d)
23 | - [How to Learn Machine Learning](https://elitedatascience.com/learn-machine-learning)
24 |
25 | ## Roadmap
26 | - **Math (Calculus, Linear Algebra, Propability & Statistics)**
27 | - [Calculus](https://www.youtube.com/playlist?list=PLmdFyQYShrjd4Qn42rcBeFvF6Qs-b6e-L), *Don't Memorize*
28 | - [Caclulus](https://youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr), *3Blue1Brown*
29 | - [Linear Algebra](https://youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab), *3Blue1Brown*
30 | - [Statistics & Probability](https://www.khanacademy.org/math/statistics-probability)
31 | - **Python**
32 | - [My Python learning roadmap](https://github.com/Rustam-Z/learning-area#1-start-learning-python)
33 | - [NumPy](https://www.w3schools.com/python/numpy/default.asp), [Pandas](https://www.w3schools.com/python/pandas/default.asp), [Matplotlib](https://www.w3schools.com/python/matplotlib_intro.asp)
34 | - [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
35 | - **Machine Learning**
36 | - "Deep learning with Python", book, 1st part
37 | - Machine Learning Course, Andrew Ng, coursera.org
38 | - **Scikit-Learn**
39 | - [freeCodeCamp.org](https://youtu.be/0B5eIE_1vpU)
40 | - https://inria.github.io/scikit-learn-mooc/
41 | - https://scikit-learn.org/stable/tutorial/index.html
42 | - **Deep Learning** - Start solving [Kaggle](https://github.com/Rustam-Z/kaggle-problem-solving)
43 | - TensorFlow Developer Specialization, deeplearning.ai, coursera.org
44 | - OR "AI and Machine Learning for Coders", book
45 | - "Deep learning with Python", book, 2nd part
46 | - ["Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow"](https://github.com/ageron/handson-ml2), book
47 | - **fast.ai**
48 | - "Deep learning", MIT press, book
49 | - Deep Learning Specialization, Andrew Ng, coursera.org
50 | - TensorFlow Advanced Techniques, deeplearning.ai, coursera.org
51 | - **Data Science**
52 | - "Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython", book
53 | - "Python Data Science Handbook", book
54 | - **More**
55 | - Applied Machine Learning: https://machinelearningmastery.com/start-here
56 |
57 | ## ML Cheatsheets
58 | * [Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data](https://becominghuman.ai/cheat-sheets-for-ai-neural-networks-machine-learning-deep-learning-big-data-678c51b4b463) `numpy`, `pandas`, `sklearn`, `ml`, `dl`
59 | * [Machine Learning](https://stanford.edu/~shervine/teaching/cs-229/)
60 |
61 | *Please, consider this repository for contributing too!*
62 |
63 |
83 |
--------------------------------------------------------------------------------
/The ML Landscape.md:
--------------------------------------------------------------------------------
1 | ## The Machine Learning Landscape
2 | ### What is Machine Learning?
3 | - Machine learning (ML) is field of study that gives computers the ability to learn without being explicitly programmed.
4 | - A computer program is said to learn from *experience E* with respect to some *task T* and some *performance measure P*, if its performance on T, as measured by P, improves with experience E.
5 | - **Example:** T = flag spam for new emails, E = the training data, P = accuracy, the ratio of correctly classified emails.
6 |
7 | ### Why use ML?
8 | - Problems for which existing solutions require a lot of hand-tuning or long lists of
9 | rules: one Machine Learning algorithm can often simplify code and perform bet‐
10 | ter. (spam classifier)
11 | - Complex problems for which there is no good solution at all using a traditional
12 | approach: the best Machine Learning techniques can find a solution. (speech recognition)
13 | - Fluctuating environments: a Machine Learning system can adapt to new data.
14 | - Getting insights about complex problems and large amounts of data. (data mining)
15 |
16 | ### Types of ML Systems
17 | - Whether or not they are trained with human supervision `supervised, unsupervised, semisupervised, and Reinforcement Learning`
18 | - Whether or not they can learn incrementally on the fly `online vs batch learning`
19 | - Whether they work by simply comparing new data points to known data points, or instead detect patterns in the training data and build a predictive model, much like scientists do `instance-based vs model-based learning`
20 |
21 | - **Supervised learning** - training data with labels (expected outputs).
22 | - Tasks: classification, regression (univariate / multivariate).
23 | - Class / sample / label / feature (predictors: age, brand, ...) / attribute
24 | - **Algorithms**
25 | - k-Nearest Neighbors
26 | - Linear Regression
27 | - Logistic Regression
28 | - Support Vector Machines (SVMs)
29 | - Decision Trees and Random Forests
30 | - Neural networks
31 |
32 | - **Unsupervised learning** - training data is unlabeled.
33 | - Tasks: clustering, anomaly detection, visualization & dimensionality reduction.
34 | - Clustering (find similar visitors)
35 | - K-Means
36 | - DBSCAN
37 | - Hierarchical Cluster Analysis (HCA)
38 | - Anomaly detection & novelty detection (detect unusual things)
39 | - One-class SVM
40 | - Isolation Forest
41 | - Visualization and dimensionality reduction (king of feature extraction)
42 | - Principal Component Analysis (PCA)
43 | - Kernel PCA
44 | - Locally-Linear Embedding (LLE)
45 | - t-distributed Stochastic Neighbor Embedding (t-SNE)
46 | - Association rule learning
47 | - Apriori
48 | - Eclat
49 |
50 | - `TIP!` Use dimensionality reduction algo before feeding to supervised learning algorithm.
51 | - `TIP!` Automatically removing outliers from a dataset before feeding it to another learning algorithm.
52 |
53 | - **Semisupervised learning** - a lot of unlabeled data and a little bit of labeled data.
54 | - Example: like in Google photos, it recongnizes same person in many pictures. We need supervised part because we need to seperate similar clusters. (like similar people)
55 |
56 | - **Reinforcement Learning** - *agent* can observe environment, and perform some actions, and get *rewards* and *penalties*. Then it must teach itself the best strategy (*policy*) to get max reward. A policy defines what action the agent should choose when it is in a given situation.
57 |
58 |
59 | - **Batch learning** - or *offline learning*, when you have new type of data, you need to retrain over whole dataset every time.
60 | - **Online learning** - you train the system incrementally on a new data or mini-batch of data.
61 | - You must set *learning rate* parameter, if you set hugh rate, then your system rapidly adapt to new data, but it will tend to forget the old data.
62 | - A big challenge if bad data is fed to the system, the system’s performance will gradually decline.
63 | - `TIP!` Monitor your latest input data using an anomaly detection algorithm.
64 |
65 | - **Instance-based learning** - the system learns the examples by heart, then generalizes to new cases by comparing them to the learned examples using a *similarity measure*.
66 | - **Model-based learning** - build the model, then use it to make *predictions*.
67 |
68 | ### Main Challenges of ML
69 | - “Bad algorithm” and “bad data”
70 | - **Bad data**
71 | - If some instances are missing a few features (e.g., 5% of your customers did not specify their age), you must decide whether you want to ignore this attribute altogether, ignore these instances, fill in the missing values (e.g., with the median age), or train one model with the feature and one model without it, and so on.
72 | - **Feature engineering**, involves:
73 | - *Feature selection*: selecting the most useful features to train on among existing features.
74 | - *Feature extraction*: combining existing features to produce a more useful one (dimensionality reduction algorithms can help).
75 | - Creating new features by gathering new data.
76 |
77 | - **Bad algorithm**
78 | - **Overfitting** means that the model performs well on the training data, but it does not generalize well. How to overcome?
79 | - To simplify the model by selecting one with fewer parameters (a linear model rather than a high-degree polynomial model), by redusing number features in training data or or by constraining the model (with regularization).
80 | - To gather more training data.
81 | - To reduce the noise in the training data (fix data errors and remove outliers).
82 | - **Underfitting** occurs when your model is too simple to learn the underlying structure of the data. The options to fix:
83 | - Selecting a more powerful model, with more parameters.
84 | - Feeding better features to the learning algorithm (feature engineering)
85 | - Reducing the constraints on the model (reducing the regularization hyperparameter)
86 |
87 | - The system will not perform well if your training set is too small, or if the data is not representative (production level data), noisy, or polluted with irrelevant features (garbage in, garbage out). Lastly, your model needs to be neither too simple.
88 |
89 | ### Testing and Validating
90 | - 80% training and 20% testing. If 10 million samples 1% for testing is enough.
91 | - **Hyperparameter Tuning and Model Selection** `page 32`
92 | - Example: you are hesiteting between two models linear and polinomial. You must try both and see which one is generalizing better on test set. You want to apply regularization to decrease overfitting, so you don't know how to choose a hyperparameter. Try 100 different hyperparameters, and find the best which produces small error.
93 | - However, after you deployed your model you see 15% error. It is probably because you chose *hp* for this particular set. Then you should use **holdout validation "with validation / dev set"**. You train multiple models with various hyperparameters on the reduced training set (training - validation set). Select model performing best on val-on set. And train again on full dataset.
94 | - [**Cross validation**](https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/)
95 | - **Data Mismatch** `page 33`
96 | - Example: You want to developer flowers species classifier. You downloaded pictures from web. And you have 10K pictures taken with the app. **TIP! Remember, your validation and test set must be as representitive as possible you expect to use in production.** In this case divide 50 / 50 to dev & test sets (pics must not be duplicated in both, even near-duplicate).
97 | - After training you see that model on validation set is very poor. Is it overfitting or mismatch between web and phone pics?
98 | - One solution, is to take the part of training (web pics) into **train-dev set**. After training a model, you see that model on train-dev set is good. Then the problem is data mismatch. Use preprocessing, and make web pics look like phone pics.
99 | - But if model is bad on train-dev set, then you have overfitting. You should try to simplify or regularize the model, get more training data and clean up the training data.
100 |
101 | ### Extra
102 | - **Hyper-parameters** are those which we supply to the model, for example: number of hidden Nodes and Layers, input features, Learning Rate, Activation Function etc in Neural Network, while **Parameters** are those which would be learnt during training by the machine like Weights and Biases.
103 |
104 |
--------------------------------------------------------------------------------
/img/model1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/img/model1.png
--------------------------------------------------------------------------------
/img/model2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/img/model2.png
--------------------------------------------------------------------------------
/img/model3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/img/model3.png
--------------------------------------------------------------------------------
/img/precision-recall.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/img/precision-recall.png
--------------------------------------------------------------------------------
/img/reinforcement-learning.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/img/reinforcement-learning.png
--------------------------------------------------------------------------------
/numpy-pandas/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/numpy-pandas/.DS_Store
--------------------------------------------------------------------------------
/numpy-pandas/02-example.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "language_info": {
4 | "codemirror_mode": {
5 | "name": "ipython",
6 | "version": 3
7 | },
8 | "file_extension": ".py",
9 | "mimetype": "text/x-python",
10 | "name": "python",
11 | "nbconvert_exporter": "python",
12 | "pygments_lexer": "ipython3",
13 | "version": "3.8.6"
14 | },
15 | "orig_nbformat": 2,
16 | "kernelspec": {
17 | "name": "python386jvsc74a57bd04ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a",
18 | "display_name": "Python 3.8.6 64-bit ('tf': conda)"
19 | },
20 | "metadata": {
21 | "interpreter": {
22 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a"
23 | }
24 | }
25 | },
26 | "nbformat": 4,
27 | "nbformat_minor": 2,
28 | "cells": [
29 | {
30 | "cell_type": "code",
31 | "execution_count": 1,
32 | "metadata": {},
33 | "outputs": [],
34 | "source": [
35 | "import pandas as pd"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 3,
41 | "metadata": {},
42 | "outputs": [
43 | {
44 | "output_type": "stream",
45 | "name": "stdout",
46 | "text": [
47 | " state/region ages year population\n0 AL under18 2012 1117489.0\n1 AL total 2012 4817528.0\n2 AL under18 2010 1130966.0\n3 AL total 2010 4785570.0\n4 AL under18 2011 1125763.0\n state area (sq. mi)\n0 Alabama 52423\n1 Alaska 656425\n2 Arizona 114006\n3 Arkansas 53182\n4 California 163707\n state abbreviation\n0 Alabama AL\n1 Alaska AK\n2 Arizona AZ\n3 Arkansas AR\n4 California CA\n"
48 | ]
49 | }
50 | ],
51 | "source": [
52 | "pop = pd.read_csv('data/state-population.csv')\n",
53 | "areas = pd.read_csv('data/state-areas.csv')\n",
54 | "abbrevs = pd.read_csv('data/state-abbrevs.csv')\n",
55 | "\n",
56 | "print(pop.head()); print(areas.head()); print(abbrevs.head())"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 22,
62 | "metadata": {},
63 | "outputs": [
64 | {
65 | "output_type": "execute_result",
66 | "data": {
67 | "text/plain": [
68 | " state/region ages year population state\n",
69 | "0 AL under18 2012 1117489.0 Alabama\n",
70 | "1 AL total 2012 4817528.0 Alabama\n",
71 | "2 AL under18 2010 1130966.0 Alabama\n",
72 | "3 AL total 2010 4785570.0 Alabama\n",
73 | "4 AL under18 2011 1125763.0 Alabama"
74 | ],
75 | "text/html": "
\n\n
\n \n \n | \n state/region | \n ages | \n year | \n population | \n state | \n
\n \n \n \n 0 | \n AL | \n under18 | \n 2012 | \n 1117489.0 | \n Alabama | \n
\n \n 1 | \n AL | \n total | \n 2012 | \n 4817528.0 | \n Alabama | \n
\n \n 2 | \n AL | \n under18 | \n 2010 | \n 1130966.0 | \n Alabama | \n
\n \n 3 | \n AL | \n total | \n 2010 | \n 4785570.0 | \n Alabama | \n
\n \n 4 | \n AL | \n under18 | \n 2011 | \n 1125763.0 | \n Alabama | \n
\n \n
\n
"
76 | },
77 | "metadata": {},
78 | "execution_count": 22
79 | }
80 | ],
81 | "source": [
82 | "merged = pd.merge(pop, abbrevs, how='outer', left_on='state/region', right_on='abbreviation') # if you do not specify left_on/right_on then no common coumns error\n",
83 | "merged = merged.drop('abbreviation', 1) # drop duplicate info \n",
84 | "merged.head()"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": 23,
90 | "metadata": {},
91 | "outputs": [
92 | {
93 | "output_type": "execute_result",
94 | "data": {
95 | "text/plain": [
96 | "state/region False\n",
97 | "ages False\n",
98 | "year False\n",
99 | "population True\n",
100 | "state True\n",
101 | "dtype: bool"
102 | ]
103 | },
104 | "metadata": {},
105 | "execution_count": 23
106 | }
107 | ],
108 | "source": [
109 | "merged.isnull().any()"
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": 30,
115 | "metadata": {},
116 | "outputs": [
117 | {
118 | "output_type": "execute_result",
119 | "data": {
120 | "text/plain": [
121 | " state/region ages year population state\n",
122 | "2448 PR under18 1990 NaN NaN\n",
123 | "2449 PR total 1990 NaN NaN\n",
124 | "2450 PR total 1991 NaN NaN\n",
125 | "2451 PR under18 1991 NaN NaN\n",
126 | "2452 PR total 1993 NaN NaN"
127 | ],
128 | "text/html": "\n\n
\n \n \n | \n state/region | \n ages | \n year | \n population | \n state | \n
\n \n \n \n 2448 | \n PR | \n under18 | \n 1990 | \n NaN | \n NaN | \n
\n \n 2449 | \n PR | \n total | \n 1990 | \n NaN | \n NaN | \n
\n \n 2450 | \n PR | \n total | \n 1991 | \n NaN | \n NaN | \n
\n \n 2451 | \n PR | \n under18 | \n 1991 | \n NaN | \n NaN | \n
\n \n 2452 | \n PR | \n total | \n 1993 | \n NaN | \n NaN | \n
\n \n
\n
"
129 | },
130 | "metadata": {},
131 | "execution_count": 30
132 | }
133 | ],
134 | "source": [
135 | "merged[merged['population'].isnull()].head()"
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": 31,
141 | "metadata": {},
142 | "outputs": [
143 | {
144 | "output_type": "execute_result",
145 | "data": {
146 | "text/plain": [
147 | "state/region False\n",
148 | "ages False\n",
149 | "year False\n",
150 | "population True\n",
151 | "state False\n",
152 | "dtype: bool"
153 | ]
154 | },
155 | "metadata": {},
156 | "execution_count": 31
157 | }
158 | ],
159 | "source": [
160 | "merged.loc[merged['state/region'] == 'PR', 'state'] = 'Puerto Rico'\n",
161 | "merged.loc[merged['state/region'] == 'USA', 'state'] = 'United States'\n",
162 | "merged.isnull().any()"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": 36,
168 | "metadata": {},
169 | "outputs": [
170 | {
171 | "output_type": "execute_result",
172 | "data": {
173 | "text/plain": [
174 | " state/region ages year population state area (sq. mi)\n",
175 | "0 AL under18 2012 1117489.0 Alabama 52423.0\n",
176 | "1 AL total 2012 4817528.0 Alabama 52423.0\n",
177 | "2 AL under18 2010 1130966.0 Alabama 52423.0\n",
178 | "3 AL total 2010 4785570.0 Alabama 52423.0\n",
179 | "4 AL under18 2011 1125763.0 Alabama 52423.0"
180 | ],
181 | "text/html": "\n\n
\n \n \n | \n state/region | \n ages | \n year | \n population | \n state | \n area (sq. mi) | \n
\n \n \n \n 0 | \n AL | \n under18 | \n 2012 | \n 1117489.0 | \n Alabama | \n 52423.0 | \n
\n \n 1 | \n AL | \n total | \n 2012 | \n 4817528.0 | \n Alabama | \n 52423.0 | \n
\n \n 2 | \n AL | \n under18 | \n 2010 | \n 1130966.0 | \n Alabama | \n 52423.0 | \n
\n \n 3 | \n AL | \n total | \n 2010 | \n 4785570.0 | \n Alabama | \n 52423.0 | \n
\n \n 4 | \n AL | \n under18 | \n 2011 | \n 1125763.0 | \n Alabama | \n 52423.0 | \n
\n \n
\n
"
182 | },
183 | "metadata": {},
184 | "execution_count": 36
185 | }
186 | ],
187 | "source": [
188 | "final = pd.merge(merged, areas, on='state', how='left')\n",
189 | "final.head()"
190 | ]
191 | },
192 | {
193 | "cell_type": "code",
194 | "execution_count": 37,
195 | "metadata": {},
196 | "outputs": [
197 | {
198 | "output_type": "execute_result",
199 | "data": {
200 | "text/plain": [
201 | "(2544, 6)"
202 | ]
203 | },
204 | "metadata": {},
205 | "execution_count": 37
206 | }
207 | ],
208 | "source": [
209 | "final.shape"
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "execution_count": 38,
215 | "metadata": {},
216 | "outputs": [
217 | {
218 | "output_type": "execute_result",
219 | "data": {
220 | "text/plain": [
221 | "state/region False\n",
222 | "ages False\n",
223 | "year False\n",
224 | "population True\n",
225 | "state False\n",
226 | "area (sq. mi) True\n",
227 | "dtype: bool"
228 | ]
229 | },
230 | "metadata": {},
231 | "execution_count": 38
232 | }
233 | ],
234 | "source": [
235 | "final.isnull().any()"
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": 39,
241 | "metadata": {},
242 | "outputs": [
243 | {
244 | "output_type": "execute_result",
245 | "data": {
246 | "text/plain": [
247 | "array(['United States'], dtype=object)"
248 | ]
249 | },
250 | "metadata": {},
251 | "execution_count": 39
252 | }
253 | ],
254 | "source": [
255 | "final['state'][final['area (sq. mi)'].isnull()].unique()"
256 | ]
257 | },
258 | {
259 | "cell_type": "code",
260 | "execution_count": 40,
261 | "metadata": {},
262 | "outputs": [
263 | {
264 | "output_type": "execute_result",
265 | "data": {
266 | "text/plain": [
267 | " state/region ages year population state area (sq. mi)\n",
268 | "0 AL under18 2012 1117489.0 Alabama 52423.0\n",
269 | "1 AL total 2012 4817528.0 Alabama 52423.0\n",
270 | "2 AL under18 2010 1130966.0 Alabama 52423.0\n",
271 | "3 AL total 2010 4785570.0 Alabama 52423.0\n",
272 | "4 AL under18 2011 1125763.0 Alabama 52423.0"
273 | ],
274 | "text/html": "\n\n
\n \n \n | \n state/region | \n ages | \n year | \n population | \n state | \n area (sq. mi) | \n
\n \n \n \n 0 | \n AL | \n under18 | \n 2012 | \n 1117489.0 | \n Alabama | \n 52423.0 | \n
\n \n 1 | \n AL | \n total | \n 2012 | \n 4817528.0 | \n Alabama | \n 52423.0 | \n
\n \n 2 | \n AL | \n under18 | \n 2010 | \n 1130966.0 | \n Alabama | \n 52423.0 | \n
\n \n 3 | \n AL | \n total | \n 2010 | \n 4785570.0 | \n Alabama | \n 52423.0 | \n
\n \n 4 | \n AL | \n under18 | \n 2011 | \n 1125763.0 | \n Alabama | \n 52423.0 | \n
\n \n
\n
"
275 | },
276 | "metadata": {},
277 | "execution_count": 40
278 | }
279 | ],
280 | "source": [
281 | "final.dropna(inplace=True)\n",
282 | "final.head()"
283 | ]
284 | },
285 | {
286 | "cell_type": "code",
287 | "execution_count": 42,
288 | "metadata": {},
289 | "outputs": [
290 | {
291 | "output_type": "execute_result",
292 | "data": {
293 | "text/plain": [
294 | "(2476, 6)"
295 | ]
296 | },
297 | "metadata": {},
298 | "execution_count": 42
299 | }
300 | ],
301 | "source": [
302 | "final.shape"
303 | ]
304 | },
305 | {
306 | "cell_type": "code",
307 | "execution_count": 43,
308 | "metadata": {},
309 | "outputs": [
310 | {
311 | "output_type": "execute_result",
312 | "data": {
313 | "text/plain": [
314 | " state/region ages year population state area (sq. mi)\n",
315 | "3 AL total 2010 4785570.0 Alabama 52423.0\n",
316 | "91 AK total 2010 713868.0 Alaska 656425.0\n",
317 | "101 AZ total 2010 6408790.0 Arizona 114006.0\n",
318 | "189 AR total 2010 2922280.0 Arkansas 53182.0\n",
319 | "197 CA total 2010 37333601.0 California 163707.0"
320 | ],
321 | "text/html": "\n\n
\n \n \n | \n state/region | \n ages | \n year | \n population | \n state | \n area (sq. mi) | \n
\n \n \n \n 3 | \n AL | \n total | \n 2010 | \n 4785570.0 | \n Alabama | \n 52423.0 | \n
\n \n 91 | \n AK | \n total | \n 2010 | \n 713868.0 | \n Alaska | \n 656425.0 | \n
\n \n 101 | \n AZ | \n total | \n 2010 | \n 6408790.0 | \n Arizona | \n 114006.0 | \n
\n \n 189 | \n AR | \n total | \n 2010 | \n 2922280.0 | \n Arkansas | \n 53182.0 | \n
\n \n 197 | \n CA | \n total | \n 2010 | \n 37333601.0 | \n California | \n 163707.0 | \n
\n \n
\n
"
322 | },
323 | "metadata": {},
324 | "execution_count": 43
325 | }
326 | ],
327 | "source": [
328 | "data2010 = final.query(\"year == 2010 & ages == 'total'\")\n",
329 | "data2010.head()"
330 | ]
331 | },
332 | {
333 | "cell_type": "code",
334 | "execution_count": 44,
335 | "metadata": {},
336 | "outputs": [],
337 | "source": [
338 | "data2010.set_index('state', inplace=True)\n",
339 | "density = data2010['population'] / data2010['area (sq. mi)']"
340 | ]
341 | },
342 | {
343 | "cell_type": "code",
344 | "execution_count": 45,
345 | "metadata": {},
346 | "outputs": [
347 | {
348 | "output_type": "execute_result",
349 | "data": {
350 | "text/plain": [
351 | "state\n",
352 | "District of Columbia 8898.897059\n",
353 | "Puerto Rico 1058.665149\n",
354 | "New Jersey 1009.253268\n",
355 | "Rhode Island 681.339159\n",
356 | "Connecticut 645.600649\n",
357 | "dtype: float64"
358 | ]
359 | },
360 | "metadata": {},
361 | "execution_count": 45
362 | }
363 | ],
364 | "source": [
365 | "density.sort_values(ascending=False, inplace=True)\n",
366 | "density.head()"
367 | ]
368 | },
369 | {
370 | "cell_type": "code",
371 | "execution_count": 46,
372 | "metadata": {},
373 | "outputs": [
374 | {
375 | "output_type": "execute_result",
376 | "data": {
377 | "text/plain": [
378 | "state\n",
379 | "South Dakota 10.583512\n",
380 | "North Dakota 9.537565\n",
381 | "Montana 6.736171\n",
382 | "Wyoming 5.768079\n",
383 | "Alaska 1.087509\n",
384 | "dtype: float64"
385 | ]
386 | },
387 | "metadata": {},
388 | "execution_count": 46
389 | }
390 | ],
391 | "source": [
392 | "density.tail()"
393 | ]
394 | },
395 | {
396 | "cell_type": "code",
397 | "execution_count": 48,
398 | "metadata": {},
399 | "outputs": [
400 | {
401 | "output_type": "execute_result",
402 | "data": {
403 | "text/plain": [
404 | "state\n",
405 | "District of Columbia 8898.897059\n",
406 | "Puerto Rico 1058.665149\n",
407 | "New Jersey 1009.253268\n",
408 | "Rhode Island 681.339159\n",
409 | "Connecticut 645.600649\n",
410 | "Massachusetts 621.815538\n",
411 | "Maryland 466.445797\n",
412 | "Delaware 460.445752\n",
413 | "New York 356.094135\n",
414 | "Florida 286.597129\n",
415 | "Pennsylvania 275.966651\n",
416 | "Ohio 257.549634\n",
417 | "California 228.051342\n",
418 | "Illinois 221.687472\n",
419 | "Virginia 187.622273\n",
420 | "Indiana 178.197831\n",
421 | "North Carolina 177.617157\n",
422 | "Georgia 163.409902\n",
423 | "Tennessee 150.825298\n",
424 | "South Carolina 144.854594\n",
425 | "New Hampshire 140.799273\n",
426 | "Hawaii 124.746707\n",
427 | "Kentucky 107.586994\n",
428 | "Michigan 102.015794\n",
429 | "Washington 94.557817\n",
430 | "Texas 93.987655\n",
431 | "Alabama 91.287603\n",
432 | "Louisiana 87.676099\n",
433 | "Wisconsin 86.851900\n",
434 | "Missouri 86.015622\n",
435 | "West Virginia 76.519582\n",
436 | "Vermont 65.085075\n",
437 | "Mississippi 61.321530\n",
438 | "Minnesota 61.078373\n",
439 | "Arizona 56.214497\n",
440 | "Arkansas 54.948667\n",
441 | "Iowa 54.202751\n",
442 | "Oklahoma 53.778278\n",
443 | "Colorado 48.493718\n",
444 | "Oregon 39.001565\n",
445 | "Maine 37.509990\n",
446 | "Kansas 34.745266\n",
447 | "Utah 32.677188\n",
448 | "Nevada 24.448796\n",
449 | "Nebraska 23.654153\n",
450 | "Idaho 18.794338\n",
451 | "New Mexico 16.982737\n",
452 | "South Dakota 10.583512\n",
453 | "North Dakota 9.537565\n",
454 | "Montana 6.736171\n",
455 | "Wyoming 5.768079\n",
456 | "Alaska 1.087509\n",
457 | "dtype: float64"
458 | ]
459 | },
460 | "metadata": {},
461 | "execution_count": 48
462 | }
463 | ],
464 | "source": [
465 | "density"
466 | ]
467 | },
468 | {
469 | "cell_type": "code",
470 | "execution_count": null,
471 | "metadata": {},
472 | "outputs": [],
473 | "source": []
474 | }
475 | ]
476 | }
--------------------------------------------------------------------------------
/numpy-pandas/README.md:
--------------------------------------------------------------------------------
1 | # Python Data Science Handbook
2 |
3 | Rustam-Z🚀 • 1 June 2021
4 |
5 | My notes on **NumPy: ndarray**, **Pandas: DataFrame**, **Matplotlib**, and **Scikit-Learn**
6 |
7 | ## Contents
8 | 1. IPython: Beyond Normal Python - *All features of Jupyter Notebook*
9 | 2. [Introduction to NumPy: Math operations with NumPy](#CHAPTER-2:-Introduction-to-NumPy)
10 | - Creating Arrays
11 | - The Basics of NumPy Arrays
12 | - Computation on NumPy Arrays
13 | - Fancy indexing
14 | - Structured Arrays
15 | 3. [Data Manipulation with Pandas](#CHAPTER-3:-Data-Manipulation-with-Pandas)
16 | - The Pandas Series / DataFrame / Index Objects
17 | - Data Selection in Series / DataFrame
18 | - Missing Data in Pandas / Operating on NULL values
19 | - Combining Datasets: Concat and Append
20 | - [GroupBy: Split, Apply, Combine](#GroupBy:-Split,-Apply,-Combine)
21 | 4. [Visualization with Matplotlib](#CHAPTER-4:-Visualization-with-Matplotlib)
22 | 5. [Machine Learning](#Machine-Learning)
23 |
24 | ## CHAPTER 2: Introduction to NumPy
25 | - `axis=0 is column`, `axis=1 is row`
26 |
27 | ### Creating Arrays
28 | ```python
29 | np.zeros(10, dtype=int) # Create a length-10 integer array filled with zeros
30 | np.ones((3, 5), dtype=float) # Create a 3x5 floating-point array filled with 1s
31 | np.full((3, 5), 3.14) # Create a 3x5 array filled with 3.14
32 | np.arange(0, 20, 2) # As python's range()
33 | np.linspace(0, 1, 5) # Create an array of five values evenly spaced between 0 and 1
34 | np.random.random((3, 3)) # 3x3 array, random values between 0 and 1
35 | np.random.normal(0, 1, (3, 3)) # normal distribution, with mean 0 and standard deviation 1
36 | np.random.randint(0, 10, (3, 3)) # random integers between 0 and 10
37 | np.eye(3) # Create a 3x3 identity matrix
38 | np.empty(3) # Create an uninitialized array of three integers
39 |
40 | np.zeros(10, dtype='int16') # same as
41 | np.zeros(10, dtype=np.int16)
42 | ```
43 |
44 | ### The Basics of NumPy Arrays
45 | - *Attributes of arrays*
46 | - Determining the size, shape, memory consumption, and data types of arrays
47 | - *Indexing of arrays*
48 | - Getting and setting the value of individual array elements
49 | - *Slicing of arrays*
50 | - Getting and setting smaller subarrays within a larger array
51 | - *Reshaping of arrays*
52 | - Changing the shape of a given array
53 | - *Array Concatenation and Splitting*
54 | - Combining multiple arrays into one, and splitting one array into many
55 |
56 | - indices `(e.g., arr[0])`, slices `(e.g., arr[:5])`, and boolean masks `(e.g., arr[arr > 0])`
57 | - [np.newaxis()](https://stackoverflow.com/questions/46334014/np-reshapex-1-1-vs-x-np-newaxis)
58 |
59 | ```python
60 | """ Attributes of arrays """
61 | x = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array
62 | x.ndim # 3
63 | x.shape # (3, 4, 5)
64 | x.size # 60 = 3*4*5
65 | x.dtype # dtype: int64
66 | x.nbytes # total size of array in bytes
67 | ```
68 | ```python
69 | """ Indexing of arrays """
70 | # Same as in python lists, but beware if you insert float into int, the result will be int
71 | x[0][0][1] or x[0, 0, 1]
72 | ```
73 | ```python
74 | """ Slicing of arrays """
75 | # Same as python lists
76 | # NOTE! Multidimensional slices work in the same way, with multiple slices separated by commas.
77 | x[start:stop:step]
78 | ```
79 | - NOTE! NumPy arrays return the *view* of original array after slicing. So, when we modify our sliced array it will affect to original array. Use **copy()** method when you don't want it. `x_copy = x[:2, :2].copy()`
80 |
81 | ```python
82 | """ Reshaping of Arrays """
83 | # reshape() method
84 | np.arange(1, 10).reshape((3, 3))
85 | ```
86 | ```python
87 | """ Array Concatenation """
88 | x = np.array([1, 2, 3])
89 | y = np.array([3, 2, 1])
90 |
91 | grid = np.array([[9, 8, 7],[6, 5, 4]])
92 |
93 | np.concatenate([x, y]) # axis=1 same as x axis, then it will concatenated horizontally
94 |
95 | # If working with different dimensions
96 | np.vstack([x, grid])
97 | np.hstack([grid, y])
98 | # np.dstack will stack arrays along the third axis
99 |
100 | """ Splitting of arrays """
101 | # np.split, np.hsplit, np.vsplit
102 | x = [1, 2, 3, 99, 99, 3, 2, 1]
103 | x1, x2, x3 = np.split(x, [3, 5]) # we give splitting points
104 | print(x1, x2, x3) # [1 2 3] [99 99] [3 2 1] # N --> N+1 subarray
105 | ```
106 |
107 | ### Computation on NumPy Arrays
108 | - *unary ufuncs*, operate on a single input, and *binary ufuncs*, operate on two inputs
109 | ```
110 | + np.add Addition (e.g., 1 + 1 = 2)
111 | - np.subtract Subtraction (e.g., 3 - 2 = 1)
112 | - np.negative Unary negation (e.g., -2)
113 | * np.multiply Multiplication (e.g., 2 * 3 = 6)
114 | / np.divide Division (e.g., 3 / 2 = 1.5)
115 | // np.floor_divide Floor division (e.g., 3 // 2 = 1)
116 | ** np.power Exponentiation (e.g., 2 ** 3 = 8)
117 | % np.mod Modulus/remainder (e.g., 9 % 4 = 1)
118 |
119 | np.abs(x)
120 | np.sin(x), np.cos(x), np.tan(x)
121 | np.log(x), np.log2(x), np.log10(x)
122 | np.exp(x) e^x
123 | np.exp2(x) 2^x
124 | np.power(3, x) 3^x
125 | np.expm1(x) exp(x) - 1
126 | np.log1p(x) log(1 + x)
127 | ```
128 | ```python
129 | x = np.arange(1, 6)
130 | np.add.reduce(x) # 15, sum of all elements
131 | np.multiply.reduce(x) # 120, mulitplication of all elements
132 |
133 | np.add.accumulate(x) # array([ 1, 3, 6, 10, 15]), intermediate result
134 | np.multiply.accumulate(x) # array([ 1, 2, 6, 24, 120])
135 |
136 | np.multiply.outer(x, x) # N+1 dimension multiplication
137 |
138 | np.sum Compute sum of elements
139 | np.prod Compute product of elements
140 | np.mean Compute median of elements
141 | np.std Compute standard deviation
142 | np.var Compute variance
143 | np.min Find minimum value
144 | np.max Find maximum value
145 | np.argmin Find index of minimum value
146 | np.argmax Find index of maximum value
147 | np.median Compute median of elements
148 | np.percentile Compute rank-based statistics of elements np.percentile(arr, 25))
149 | np.any Evaluate whether any elements are true
150 | np.all Evaluate whether all elements are true
151 | ```
152 | ```python
153 | """Comparison Operators"""
154 | == np.equal
155 | != np.not_equal
156 | < np.less np.less(x, 3) is x < 3
157 | <= np.less_equal
158 | > np.greater
159 | >= np.greater_equal
160 |
161 | # Example
162 | x = np.array([1, 2, 3, 4, 5])
163 | x < 3 # array([ True, True, False, False, False], dtype=bool)
164 | (2 * x) == (x ** 2) # array([False, True, False, False, False], dtype=bool)
165 | ```
166 | ```python
167 | """Working with Boolean Arrays"""
168 | print(x) # [[5 0 3 3][7 9 3 5][2 4 7 6]]
169 |
170 | # Counting entries
171 | np.count_nonzero(x < 6) # 8, how many values less than 6?
172 | np.sum(x < 6) # 8, counts elements less than 6
173 | np.sum(x < 6, axis=1) # how many values less than 6 in each row?
174 | np.any(x > 8) # are there any values greater than 8?
175 | np.all(x < 10) # are all values less than 10?
176 | np.all(x < 8, axis=1) # are all values in each row less than 8?
177 |
178 | # Boolean operators
179 | & np.bitwise_and
180 | | np.bitwise_or
181 | ^ np.bitwise_xor
182 | ~ np.bitwise_not
183 | np.sum((inches > 0.5) & (inches < 1)) # that's counts the number of elements
184 | np.sum(~( (inches <= 0.5) | (inches >= 1) ))
185 |
186 | x[x < 5] # [0 3 3 3 2 4]
187 |
188 | # Fancy indexing
189 | x = rand.randint(100, size=10)
190 | y = np.array([1, 2])
191 | x[y] # array([92, 14])
192 | ```
193 | - `np.sort(x)`, `np.argsort(x)` , `np.sort(X, axis=0)` = sort each column of X
194 | - Partial Sorts: `np.partition(x, 3)` - returns 2 smallest elements to the left
195 |
196 | ```python
197 | """NumPy’s Structured Arrays: Compound data types"""
198 | name = ['Alice', 'Bob', 'Cathy', 'Doug']
199 | age = [25, 45, 37, 19]
200 | weight = [55.0, 85.5, 68.0, 61.5]
201 |
202 | # We need to combine them
203 | x = np.zeros(4, dtype=int)
204 | data = np.zeros(4, dtype={'names':('name', 'age', 'weight'), 'formats':('U10', 'i4', 'f8')})
205 | data['name'] = name
206 | data['age'] = age
207 | data['weight'] = weight
208 |
209 | print(data) # [('Alice', 25, 55.0) ('Bob', 45, 85.5) ('Cathy', 37, 68.0) ('Doug', 19, 61.5)]
210 |
211 | # Get all names
212 | data['name'] # array(['Alice', 'Bob', 'Cathy', 'Doug'], dtype=' Int64Index([3, 5, 7], dtype='int64')
268 | indA | indB # union => Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
269 | indA ^ indB # symmetric difference => Int64Index([1, 2, 9, 11], dtype='int64')
270 | ```
271 | ```python
272 | """Data Selection in Series"""
273 | data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
274 |
275 | data['b'] # 0.5
276 | 'a' in data # True
277 | data.keys()
278 | data.items() # key: value
279 | data['e'] = 1.25 # We can add new item
280 |
281 | # slicing explicit, 'c' will be included
282 | data['a':'c']
283 |
284 | # slicing implicit
285 | data[0:2]
286 |
287 | # masking
288 | data[(data > 0.3) & (data < 0.8)]
289 |
290 | # fancy indexing
291 | data[['a', 'e']]
292 |
293 | """Indexers: loc, iloc, and ix
294 | loc = allows indexing and slicing that always references the explicit index (own indexing)
295 | iloc = allows indexing and slicing that always references the implicit Python-style index (from 0)
296 |
297 | `TIP!` “Explicit is better than implicit"
298 | """
299 | ```
300 | ```python
301 | """Data Selection in DataFrame"""
302 | # DataFrame as a dictionary
303 | data = pd.DataFrame({'area':area, 'pop':pop})
304 | data['area']
305 | data.area # if name == str method then not working
306 | # Add new column
307 | data['density'] = data['pop'] / data['area']
308 | # Access samples
309 | data.loc['Texas']
310 |
311 | # DataFrame as two-dimensional array
312 | data.values
313 | data.T # Transpose
314 | data.iloc[:3, :2] # Chooses both row and column respectively
315 | data.loc[:'New York', :'pop'] # same as previous
316 | data.loc[data.density > 100, ['pop', 'density']] # fancy indexing
317 | # Change like this
318 | data.iloc[0, 2] = 90
319 | data[data.density > 100]
320 | ```
321 | - Until page 114
322 | - We can perform NumPy operations over Pandas Series and Dataframe (adding, division)
323 | ```py
324 | A = pd.Series([2, 4, 6], index=[0, 1, 2])
325 | B = pd.Series([1, 3, 5], index=[1, 2, 3])
326 | print(A + B)
327 | print(A.add(B, fill_value=0)) # the set which doesn't include that index will be replaces with 0
328 |
329 | ## A.add(B)
330 | + add()
331 | - sub(), subtract()
332 | * xmul(), multiply()
333 | / truediv(), div(), divide()
334 | // floordiv()
335 | % mod()
336 | ** pow()
337 | ```
338 | ```py
339 | """Missing Data in Pandas"""
340 | vals2 = np.array([1, np.nan, 3, 4])
341 | np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)
342 |
343 | # NaN and None in Pandas
344 | x = pd.Series(range(2), dtype=int)
345 | x[0] = None # Then it will be represented as NaN in DataFrame
346 |
347 | """Operating on Null Values"""
348 | isnull() # True / False for each element
349 | notnull() # opposite of isnull()
350 | dropna() # Return a filtered version of the data
351 | fillna()
352 |
353 | # Detecting null values
354 | df.isnull()
355 | data[data.notnull()]
356 |
357 | # Dropping null values
358 | data.dropna()
359 | df.dropna(axis='columns', how='all') # df.dropna(axis=1) | how='all', by default how='any' | thresh=3
360 |
361 | # Filling null values
362 | data.fillna(0)
363 | data.fillna(method='ffill') # propagate the previous value forward
364 | data.fillna(method='bfill')
365 | df.fillna(method='ffill', axis=1) # we can specify an axis along which the fills take place
366 |
367 | # NOTE
368 | df.isnull().any()
369 | df[df['SMTH'].isnull()].head()
370 | ```
371 | ```py
372 | """Combining Datasets: Concat and Append"""
373 | np.concatenate([x, y]) # with numpy
374 | pd.concat([x, y]) # with pandas
375 | pd.concat([x, y], ignore_index=True) # ignoring the index
376 | df1.append(df2) # same as pd.concat([df1, df2]), NOT good practice
377 |
378 | """Combining Datasets: Merge and Join"""
379 | df3 = pd.merge(df1, df2) # can use when df1 and df2 have common columns PK = primary key
380 | # check 02-pandas.ipynb
381 | ```
382 | ### GroupBy: Split, Apply, Combine
383 | - Split, apply, combine
384 | - **Functions: aggregate, filter, transform, and apply.**
385 | - The **split** step involves breaking up and grouping a DataFrame depending on the value of the specified key.
386 | - The **apply** step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
387 | - The **combine** step merges the results of these operations into an output array.
388 |
389 |
390 | - We need to apply any *Aggregation* funcs from Pandas and NumPy, like `df.groupby('key').sum()`
391 | - `n_by_state = df.groupby("state")["last_name"].count()` You call `.groupby()` and pass the name of the column you want to group on, which is ``"state"``. Then, you use `["last_name"]` to specify the columns on which you want to perform the actual aggregation.
392 |
393 | ```py
394 | # Column indexing
395 | # https://realpython.com/pandas-groupby/
396 | n_by_state = df.groupby("state")["last_name"].count()
397 | df.groupby(["state", "gender"])["last_name"].count() # for multiple, as_index=False
398 |
399 | # Dispatch methods
400 | planets.groupby('method')['year'].describe().unstack()
401 |
402 | # Aggregation
403 | df.groupby('key').aggregate(['min', np.median, max])
404 | df.groupby('key').aggregate({'data1': 'min', 'data2': 'max'}) # even we can specify
405 |
406 | # Filtering
407 | def filter_func(x):
408 | return x['data2'].std() > 4
409 |
410 | print(df)
411 | print(df.groupby('key').std())
412 | print(df.groupby('key').filter(filter_func))
413 |
414 | # Transformation
415 | df.groupby('key').transform(lambda x: x - x.mean())
416 |
417 | # The apply() method - we can app;y arbitary function
418 | def norm_by_data2(x):
419 | # x is a DataFrame of group values
420 | x['data1'] /= x['data2'].sum()
421 | return x
422 | print(df); print(df.groupby('key').apply(norm_by_data2))
423 | ```
424 | ```py
425 | """High-Performance Pandas: eval() and query()"""
426 | """eval()"""
427 | # Operators
428 | result1 = -df1 * df2 / (df3 + df4) - df5
429 | result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
430 |
431 | # With dataframe
432 | result1 = (df['A'] + df['B']) / (df['C'] - 1)
433 | result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
434 | df.eval('D = (A + B) / C', inplace=True) # We can even perform on DF object
435 |
436 | column_mean = df.mean(1)
437 | result1 = df['A'] + column_mean
438 | result2 = df.eval('A + @column_mean')
439 |
440 | """query()"""
441 | result1 = df[(df.A < 0.5) & (df.B < 0.5)]
442 | result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
443 | result3 = df.eval('A < 0.5 and B < 0.5') # do not work with DF, so we need query
444 | result4 = df.query('A < 0.5 and B < 0.5')
445 | ```
446 |
447 | ## CHAPTER 4: Visualization with Matplotlib
448 | ```py
449 | """Line"""
450 | plt.plot(x, np.sin(x), linestyle='-g') # -, --, -., :, -g = solid green
451 | plt.axis([-1, 11, -1.5, 1.5]) # [xmin, xmax, ymin, ymax]
452 | plt.title("A Sine Curve")
453 | plt.xlabel("x")
454 | plt.ylabel("sin(x)")
455 |
456 | # When multiple lines
457 | plt.plot(x, np.sin(x), '-g', label='sin(x)')
458 | plt.plot(x, np.cos(x), ':b', label='cos(x)')
459 | plt.axis('equal')
460 | plt.legend()
461 |
462 | """Scatter"""
463 | plt.scatter(x, y) # marker='o'
464 |
465 | """Histogram"""
466 | data = np.random.randn(1000)
467 | plt.hist(data)
468 | ```
469 |
470 | ## Machine Learning
471 | - **Classification: Predicting discrete labels**
472 | - Some important classification algorithms
473 | - Naive Bayes
474 | - Support Vector Machines
475 | - Decision Trees and Random Forests
476 | - **Regression: Predicting continuous labels**
477 | - Some important regression algorithms
478 | - Linear Regression
479 | - Support Vector Machines
480 | - Decision Trees and Random Forests
481 | - **Clustering: Inferring labels on unlabeled data**
482 | - k-Means Clustering
483 | - Gaussian Mixture Models
484 | - **Dimensionality reduction: Inferring structure of unlabeled data**
485 | - Principal Component Analysis (PCA)
486 | - Manifold Learning
--------------------------------------------------------------------------------
/numpy-pandas/data/state-abbrevs.csv:
--------------------------------------------------------------------------------
1 | "state","abbreviation"
2 | "Alabama","AL"
3 | "Alaska","AK"
4 | "Arizona","AZ"
5 | "Arkansas","AR"
6 | "California","CA"
7 | "Colorado","CO"
8 | "Connecticut","CT"
9 | "Delaware","DE"
10 | "District of Columbia","DC"
11 | "Florida","FL"
12 | "Georgia","GA"
13 | "Hawaii","HI"
14 | "Idaho","ID"
15 | "Illinois","IL"
16 | "Indiana","IN"
17 | "Iowa","IA"
18 | "Kansas","KS"
19 | "Kentucky","KY"
20 | "Louisiana","LA"
21 | "Maine","ME"
22 | "Montana","MT"
23 | "Nebraska","NE"
24 | "Nevada","NV"
25 | "New Hampshire","NH"
26 | "New Jersey","NJ"
27 | "New Mexico","NM"
28 | "New York","NY"
29 | "North Carolina","NC"
30 | "North Dakota","ND"
31 | "Ohio","OH"
32 | "Oklahoma","OK"
33 | "Oregon","OR"
34 | "Maryland","MD"
35 | "Massachusetts","MA"
36 | "Michigan","MI"
37 | "Minnesota","MN"
38 | "Mississippi","MS"
39 | "Missouri","MO"
40 | "Pennsylvania","PA"
41 | "Rhode Island","RI"
42 | "South Carolina","SC"
43 | "South Dakota","SD"
44 | "Tennessee","TN"
45 | "Texas","TX"
46 | "Utah","UT"
47 | "Vermont","VT"
48 | "Virginia","VA"
49 | "Washington","WA"
50 | "West Virginia","WV"
51 | "Wisconsin","WI"
52 | "Wyoming","WY"
--------------------------------------------------------------------------------
/numpy-pandas/data/state-areas.csv:
--------------------------------------------------------------------------------
1 | state,area (sq. mi)
2 | Alabama,52423
3 | Alaska,656425
4 | Arizona,114006
5 | Arkansas,53182
6 | California,163707
7 | Colorado,104100
8 | Connecticut,5544
9 | Delaware,1954
10 | Florida,65758
11 | Georgia,59441
12 | Hawaii,10932
13 | Idaho,83574
14 | Illinois,57918
15 | Indiana,36420
16 | Iowa,56276
17 | Kansas,82282
18 | Kentucky,40411
19 | Louisiana,51843
20 | Maine,35387
21 | Maryland,12407
22 | Massachusetts,10555
23 | Michigan,96810
24 | Minnesota,86943
25 | Mississippi,48434
26 | Missouri,69709
27 | Montana,147046
28 | Nebraska,77358
29 | Nevada,110567
30 | New Hampshire,9351
31 | New Jersey,8722
32 | New Mexico,121593
33 | New York,54475
34 | North Carolina,53821
35 | North Dakota,70704
36 | Ohio,44828
37 | Oklahoma,69903
38 | Oregon,98386
39 | Pennsylvania,46058
40 | Rhode Island,1545
41 | South Carolina,32007
42 | South Dakota,77121
43 | Tennessee,42146
44 | Texas,268601
45 | Utah,84904
46 | Vermont,9615
47 | Virginia,42769
48 | Washington,71303
49 | West Virginia,24231
50 | Wisconsin,65503
51 | Wyoming,97818
52 | District of Columbia,68
53 | Puerto Rico,3515
54 |
--------------------------------------------------------------------------------
/numpy-pandas/img/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/numpy-pandas/img/.DS_Store
--------------------------------------------------------------------------------
/numpy-pandas/img/axis=1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/numpy-pandas/img/axis=1.jpg
--------------------------------------------------------------------------------
/numpy-pandas/img/groupby.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/numpy-pandas/img/groupby.png
--------------------------------------------------------------------------------
/numpy-pandas/plt1.py:
--------------------------------------------------------------------------------
1 | import matplotlib.pyplot as plt
2 | import numpy as np
3 |
4 | x = np.linspace(0, 10, 100)
5 |
6 | plt.plot(x, np.sin(x))
7 | plt.plot(x, np.cos(x))
8 |
9 | plt.show()
--------------------------------------------------------------------------------
/numpy-pandas/very-basics/Readme.md:
--------------------------------------------------------------------------------
1 | # [Python for Data Science Very Basics](https://www.sololearn.com/learning/1161)
2 |
3 | > Math Operations with NumPy
4 | > Data Manipulation with Pandas
5 | > Visualization with Matplotlib
6 |
7 | ## Statistics
8 | - **mean:** the average of the values.
9 | - **median:** the middle value.
10 | - **standard deviation:** the measure of spread, the square root of **variance**.
11 | - **variance:** average of the squared differences from the mean.
12 | - One standard deviation from the mean - is the values `from (mean-std) to (mean+std)`
13 |
14 | ## Math Operations with NumPy
15 | ```python
16 | # We can use Python Lists to create NumPy arrays
17 | x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
18 |
19 | # Size, dimentionality, shape of array
20 | print(x[1][2]) # 6
21 | print(x.ndim) # 2
22 | print(x.size) # 9
23 | print(x.shape) # (3, 3)
24 |
25 | x = np.array([2, 1, 3])
26 | x = np.append(x, 4) # [2, 1, 3, 4]
27 | x = np.delete(x, 0) # Takes index
28 | x = np.sort(x)
29 |
30 | # Similar to python range()
31 | x = np.arange(2, 10, 3) # [2, 5, 8]
32 |
33 | # Reshaping the array
34 | x = np.reshape(3, 1) # [[2], [5], [8]]
35 |
36 | # Indexing and slicing
37 | # Same as python lists [-1], [0:4]
38 |
39 | # Conditions
40 | y = x[x<4] # Select element that are less than 4
41 | y = x[(x>5) & (x%2==0)] # & (and), | (or)
42 |
43 | # Operations
44 | y = x.sum()
45 | y = x.min()
46 | y = x.max()
47 | y = x*2 # Broadcasting used
48 |
49 | # Statistics
50 | np.mean(x)
51 | np.median(x)
52 | np.var(x)
53 | np.std(x)
54 | ```
55 | ```python
56 | # https://www.sololearn.com/learning/eom-project/1161/1156
57 | # One standart devisation from the mean
58 | import numpy as np
59 |
60 | data = np.array([150000, 125000, 320000, 540000, 200000, 120000, 160000, 230000, 280000, 290000, 300000, 500000, 420000, 100000, 150000, 280000])
61 |
62 | mean_h = np.mean(data)
63 | std_h = np.std(data)
64 |
65 | low, high = mean_h - std_h, mean_h + std_h
66 |
67 | count = len([v for v in data if low < v < high])
68 | res = count * 100 / len(data)
69 | print(res)
70 | ```
71 |
72 | ## Data Manipulation with Pandas
73 | - Built on top of **NumPy** = "numerical python", **Pandas** = "panel data"
74 | - Used to read and extract data from files, transform and analyze it, calculate statistics and correlations.
75 | - **Series** and **DataFrame**. A **Series** is essentially a column, and a **DataFrame** is a multi-dimensional table made up of a collection of Series.
76 | - `loc` explicit indexing (own indexing), `iloc` implicit indexing (0, 1, 2, 3)
77 | ```python
78 | # Dictionary used to create DataFrame (DF)
79 | data = {
80 | 'ages': [14, 18, 24, 42],
81 | 'heights': [165, 180, 176, 184]
82 | }
83 |
84 | df = pd.DataFrame(data, index=['James', 'Bob', 'Amy', 'Dave']) # You can specify `index` if you want
85 |
86 | # How to access row?
87 | y = df.loc["Bob"] # df.loc[1]
88 |
89 | # Indexing
90 | z = df["ages"] # Series
91 | z = df[["ages", "heights"]] # DataFrame, pay attention to brackets
92 |
93 | # Slicing
94 | # iloc[], same as in python lists
95 | print(df.iloc[2]) # third row
96 | print(df.iloc[:3]) # first 3 rows
97 | print(df.iloc[1:3]) # rows 2 to 3
98 | print(df.iloc[-3:]) # accessing last three rows
99 |
100 | # Conditons
101 | z = df[(df['ages']>18) & (df['heights']>180)]
102 | ```
103 | ```python
104 | # Reading data
105 | df = pd.read_csv("test.csv")
106 |
107 | df.head() # First five rows
108 | df.tail() # Last five rows
109 |
110 | df.info()
111 | df.describe() # Statistics: mean, min, max, percentiles. We can get for a single column too df['cases'].describe()
112 |
113 | df.set_index("date", inplace=True) # Set as the index the "data" column
114 | # inplace=True used to change the currect dataframe without assigning to new
115 | ```
116 | ```python
117 | # Creating a column
118 | df['area'] = df['height'] * df['width']
119 | df['month'] = pd.to_datetime(df['date'], format="%d.%m.%y").dt.month_name()
120 |
121 | # Droping a column
122 | df.drop("state", axis=1, inplace=True)
123 | # axis=1 specifies that we want to drop a column.
124 | # axis=0 will drop a row.
125 | ```
126 | ```python
127 | # Grouping
128 | z = df['month'].value_counts()
129 |
130 | z = df.groupby('month')['cases'].sum()
131 |
132 | z = df['cases'].sum() # max(), min(), mean()
133 | ```
134 | ```python
135 | """COVID Data Analysis"""
136 | import pandas as pd
137 |
138 | df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv")
139 |
140 | df.drop('state', axis=1, inplace=True)
141 | df.set_index('date', inplace=True)
142 |
143 | df['ratio'] = df['deaths'] / df['cases']
144 |
145 | largest = df.loc[df['ratio'] == df['ratio'].max()] # df.loc[df['ratio'].max()] we cannot do that
146 | print(largest)
147 | ```
148 |
149 | ## Visualization with Matplotlib
150 | - https://www.w3schools.com/python/matplotlib_intro.asp
151 | - **Matplotlib** is a library used to create graphs, charts, and figures. It also provides functions to customize your figures by changing the colors, labels, etc.
152 | - **Matplotlib** works really well with **Pandas**! **Pandas** works well with **NumPy**.
153 | ```py
154 | import matplotlib.pyplot as plt
155 | import pandas as pd
156 |
157 | s = pd.Series([18, 42, 9, 32, 81, 64, 3])
158 | s.plot(kind='bar')
159 | plt.savefig('plot.png')
160 | ```
161 | - Data = Y axis, index = X axis.
162 | ```py
163 | """Line Plot"""
164 | import pandas as pd
165 | import matplotlib.pyplot as plt
166 |
167 | df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv")
168 | df.rdop('state', axis=1, inplace=True)
169 | df['date'] = pd.to_datetime(df['date'], format="%d.%m.%y")
170 | df['month'] = df['date'].dt.month
171 | df.set_index('date', inplace=True)
172 |
173 | df[df['month']==12]['cases'].plot()
174 | # Multiple lines
175 | # (df[df['month']==12])[['cases', 'deaths']].plot()
176 | ```
177 | ```py
178 | """Bar Plot"""
179 | (df.groupby('month')['cases'].sum()).plot(kind="bar") # barh = horizontal bar
180 | # OR
181 | # df = df.groupby('month')
182 | # df['cases'].sum().plot(kind="bar")
183 | ```
184 | ```py
185 | """Box Plot"""
186 | df[df["month"]==6]["cases"].plot(kind="box")
187 | ```
188 | ```py
189 | """Histogram"""
190 | df[df["month"]==6]["cases"].plot(kind="hist")
191 | ```
192 | - A **histogram** is a graph showing *frequency* distributions. Similar to box plots, **histograms** show the distribution of data.
193 | Visually histograms are similar to bar charts, however, histograms display frequencies for a group of data rather than an individual data point; therefore, no spaces are present between the bars.
194 | ```py
195 | """Area Plot"""
196 | df[df["month"]==6][["cases", "deaths"]].plot(kind="area", stacked=False)
197 | ```
198 | ```py
199 | """Scatter Plot"""
200 | df[df["month"]==6][["cases", "deaths"]].plot(kind="scatter", x='cases', y='deaths')
201 | ```
202 | ```py
203 | """Pie Chart"""
204 | df.groupby('month')['cases'].sum().plot(kind="pie")
205 | ```
206 | ```py
207 | """Plot formatting"""
208 | df[['cases', 'deaths']].plot(kind="area", legend=True, stacked=False, color=['#1970E7', '#E73E19'])
209 | plt.xlabel('Days in June')
210 | plt.ylabel('Number')
211 | plt.suptitle("COVID-19 in June")
212 | ```
--------------------------------------------------------------------------------
/numpy-pandas/very-basics/img/plt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/numpy-pandas/very-basics/img/plt.png
--------------------------------------------------------------------------------
/scikit-learn/Readme.md:
--------------------------------------------------------------------------------
1 | # Scikit-Learn
2 |
3 | - [freeCodeCamp.org](https://youtu.be/0B5eIE_1vpU)
4 | - https://inria.github.io/scikit-learn-mooc/
5 | - https://scikit-learn.org/stable/tutorial/index.html
6 | - https://machinelearningmastery.com/start-here/
7 |
8 |
9 |
10 |
11 |
12 | ### How to save / upload model
13 | ```py
14 | import joblib
15 |
16 | model = joblib.load('model.sav') # Load the model
17 | joblib.dump(model, 'model.sav') # Save the model
18 | ```
19 |
20 | ### K-Nearest Neighbors (KNN)
21 | > [Notebook](knn.ipynb)
22 | - Measured with Euclidean or Manhattan [distance](https://www.analyticsvidhya.com/blog/2020/02/4-types-of-distance-metrics-in-machine-learning/)
23 | - For **KNN regressor** you take the average of `n_neighbors=23` nearest neighbours
24 | - For **KNN classifier** you take the mood of `n_neighbors=23` nearest neighbours
25 |
26 | ### SVM
27 | > [Notebook](svm.ipynb)
28 | - `support vectors`, `hyperplane`, `margin`, `linear seperable`, `non-linear seperable`
29 | - Our goal is to **maximize** the **margin** (distance between marginal hyperplanes)
30 | - **SVM kernels** - transforms from low-dimension to high-dimension
31 |
32 | ### K-Means Clustering
33 | 1. Select **K** value - centroid
34 | 2. Initialize centroids randomly
35 | 3. Calculate **Euclidean distance** between two points
36 | 4. Select the group and find the **mean**
37 | 5. Move controid to that mean
38 |
39 | - How to select **K**?
40 | - Elbow method
--------------------------------------------------------------------------------
/scikit-learn/img/process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/scikit-learn/img/process.png
--------------------------------------------------------------------------------
/scikit-learn/img/process1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/scikit-learn/img/process1.png
--------------------------------------------------------------------------------
/scikit-learn/img/scikit-learn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/scikit-learn/img/scikit-learn.png
--------------------------------------------------------------------------------
/scikit-learn/k-means-clustering.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "language_info": {
4 | "codemirror_mode": {
5 | "name": "ipython",
6 | "version": 3
7 | },
8 | "file_extension": ".py",
9 | "mimetype": "text/x-python",
10 | "name": "python",
11 | "nbconvert_exporter": "python",
12 | "pygments_lexer": "ipython3",
13 | "version": "3.8.6"
14 | },
15 | "orig_nbformat": 4,
16 | "kernelspec": {
17 | "name": "python3",
18 | "display_name": "Python 3.8.6 64-bit ('tf': conda)"
19 | },
20 | "interpreter": {
21 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a"
22 | }
23 | },
24 | "nbformat": 4,
25 | "nbformat_minor": 2,
26 | "cells": [
27 | {
28 | "cell_type": "code",
29 | "execution_count": 1,
30 | "metadata": {},
31 | "outputs": [],
32 | "source": [
33 | "from sklearn.datasets import load_breast_cancer\n",
34 | "from sklearn.cluster import KMeans\n",
35 | "from sklearn.model_selection import train_test_split\n",
36 | "from sklearn.metrics import accuracy_score\n",
37 | "from sklearn.preprocessing import scale\n",
38 | "\n",
39 | "import numpy as np \n",
40 | "import pandas as pd \n"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": 3,
46 | "metadata": {},
47 | "outputs": [
48 | {
49 | "output_type": "execute_result",
50 | "data": {
51 | "text/plain": [
52 | "{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,\n",
53 | " 1.189e-01],\n",
54 | " [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,\n",
55 | " 8.902e-02],\n",
56 | " [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,\n",
57 | " 8.758e-02],\n",
58 | " ...,\n",
59 | " [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,\n",
60 | " 7.820e-02],\n",
61 | " [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,\n",
62 | " 1.240e-01],\n",
63 | " [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,\n",
64 | " 7.039e-02]]),\n",
65 | " 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,\n",
66 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,\n",
67 | " 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,\n",
68 | " 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,\n",
69 | " 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,\n",
70 | " 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,\n",
71 | " 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,\n",
72 | " 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,\n",
73 | " 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,\n",
74 | " 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,\n",
75 | " 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,\n",
76 | " 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
77 | " 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,\n",
78 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1,\n",
79 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0,\n",
80 | " 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,\n",
81 | " 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,\n",
82 | " 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1,\n",
83 | " 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0,\n",
84 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1,\n",
85 | " 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,\n",
86 | " 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,\n",
87 | " 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1,\n",
88 | " 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,\n",
89 | " 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
90 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1]),\n",
91 | " 'frame': None,\n",
92 | " 'target_names': array(['malignant', 'benign'], dtype='\n\n\n \n \n col_0 | \n 0 | \n 1 | \n
\n \n row_0 | \n | \n | \n
\n \n \n \n 0 | \n 146 | \n 30 | \n
\n \n 1 | \n 13 | \n 266 | \n
\n \n
\n"
322 | },
323 | "metadata": {},
324 | "execution_count": 21
325 | }
326 | ],
327 | "source": [
328 | "# SOMETIMES IT MAY FLIP THE CLUSTERS, THEN WE MUST USE\n",
329 | "\n",
330 | "pd.crosstab(y_train, labels)"
331 | ]
332 | },
333 | {
334 | "cell_type": "code",
335 | "execution_count": null,
336 | "metadata": {},
337 | "outputs": [],
338 | "source": []
339 | }
340 | ]
341 | }
--------------------------------------------------------------------------------
/scikit-learn/knn.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "language_info": {
4 | "codemirror_mode": {
5 | "name": "ipython",
6 | "version": 3
7 | },
8 | "file_extension": ".py",
9 | "mimetype": "text/x-python",
10 | "name": "python",
11 | "nbconvert_exporter": "python",
12 | "pygments_lexer": "ipython3",
13 | "version": "3.8.6"
14 | },
15 | "orig_nbformat": 4,
16 | "kernelspec": {
17 | "name": "python3",
18 | "display_name": "Python 3.8.6 64-bit ('tf': conda)"
19 | },
20 | "interpreter": {
21 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a"
22 | }
23 | },
24 | "nbformat": 4,
25 | "nbformat_minor": 2,
26 | "cells": [
27 | {
28 | "cell_type": "code",
29 | "execution_count": 14,
30 | "metadata": {},
31 | "outputs": [],
32 | "source": [
33 | "import numpy as np \n",
34 | "import pandas as pd \n",
35 | "from sklearn import neighbors, metrics\n",
36 | "from sklearn.model_selection import train_test_split\n",
37 | "from sklearn.preprocessing import LabelEncoder"
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 15,
43 | "metadata": {},
44 | "outputs": [
45 | {
46 | "output_type": "execute_result",
47 | "data": {
48 | "text/plain": [
49 | " buying maint doors persons lug_boot safety class\n",
50 | "0 vhigh vhigh 2 2 small low unacc\n",
51 | "1 vhigh vhigh 2 2 small med unacc\n",
52 | "2 vhigh vhigh 2 2 small high unacc\n",
53 | "3 vhigh vhigh 2 2 med low unacc\n",
54 | "4 vhigh vhigh 2 2 med med unacc"
55 | ],
56 | "text/html": "\n\n
\n \n \n | \n buying | \n maint | \n doors | \n persons | \n lug_boot | \n safety | \n class | \n
\n \n \n \n 0 | \n vhigh | \n vhigh | \n 2 | \n 2 | \n small | \n low | \n unacc | \n
\n \n 1 | \n vhigh | \n vhigh | \n 2 | \n 2 | \n small | \n med | \n unacc | \n
\n \n 2 | \n vhigh | \n vhigh | \n 2 | \n 2 | \n small | \n high | \n unacc | \n
\n \n 3 | \n vhigh | \n vhigh | \n 2 | \n 2 | \n med | \n low | \n unacc | \n
\n \n 4 | \n vhigh | \n vhigh | \n 2 | \n 2 | \n med | \n med | \n unacc | \n
\n \n
\n
"
57 | },
58 | "metadata": {},
59 | "execution_count": 15
60 | }
61 | ],
62 | "source": [
63 | "data = pd.read_csv(\"car_evaluation.csv\")\n",
64 | "data.head()"
65 | ]
66 | },
67 | {
68 | "cell_type": "code",
69 | "execution_count": 16,
70 | "metadata": {},
71 | "outputs": [
72 | {
73 | "output_type": "execute_result",
74 | "data": {
75 | "text/plain": [
76 | " buying maint safety\n",
77 | "0 vhigh vhigh low\n",
78 | "1 vhigh vhigh med\n",
79 | "2 vhigh vhigh high\n",
80 | "3 vhigh vhigh low\n",
81 | "4 vhigh vhigh med"
82 | ],
83 | "text/html": "\n\n
\n \n \n | \n buying | \n maint | \n safety | \n
\n \n \n \n 0 | \n vhigh | \n vhigh | \n low | \n
\n \n 1 | \n vhigh | \n vhigh | \n med | \n
\n \n 2 | \n vhigh | \n vhigh | \n high | \n
\n \n 3 | \n vhigh | \n vhigh | \n low | \n
\n \n 4 | \n vhigh | \n vhigh | \n med | \n
\n \n
\n
"
84 | },
85 | "metadata": {},
86 | "execution_count": 16
87 | }
88 | ],
89 | "source": [
90 | "# Select features\n",
91 | "X = data[['buying', 'maint', 'safety']]\n",
92 | "X.head()"
93 | ]
94 | },
95 | {
96 | "cell_type": "code",
97 | "execution_count": 17,
98 | "metadata": {},
99 | "outputs": [
100 | {
101 | "output_type": "execute_result",
102 | "data": {
103 | "text/plain": [
104 | "0 unacc\n",
105 | "1 unacc\n",
106 | "2 unacc\n",
107 | "3 unacc\n",
108 | "4 unacc\n",
109 | "Name: class, dtype: object"
110 | ]
111 | },
112 | "metadata": {},
113 | "execution_count": 17
114 | }
115 | ],
116 | "source": [
117 | "# Select the label\n",
118 | "y = data['class']\n",
119 | "y.head()"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": 18,
125 | "metadata": {},
126 | "outputs": [
127 | {
128 | "output_type": "execute_result",
129 | "data": {
130 | "text/plain": [
131 | "array([['vhigh', 'vhigh', 'low'],\n",
132 | " ['vhigh', 'vhigh', 'med'],\n",
133 | " ['vhigh', 'vhigh', 'high'],\n",
134 | " ...,\n",
135 | " ['low', 'low', 'low'],\n",
136 | " ['low', 'low', 'med'],\n",
137 | " ['low', 'low', 'high']], dtype=object)"
138 | ]
139 | },
140 | "metadata": {},
141 | "execution_count": 18
142 | }
143 | ],
144 | "source": [
145 | "X = X.values # NumPy array\n",
146 | "X"
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": 19,
152 | "metadata": {},
153 | "outputs": [
154 | {
155 | "output_type": "stream",
156 | "name": "stdout",
157 | "text": [
158 | "(1728, 3)\n['vhigh' 'vhigh' 'vhigh' ... 'low' 'low' 'low']\n['vhigh' 'vhigh' 'vhigh' ... 'low' 'low' 'low']\n['low' 'med' 'high' ... 'low' 'med' 'high']\n"
159 | ]
160 | },
161 | {
162 | "output_type": "execute_result",
163 | "data": {
164 | "text/plain": [
165 | "array([[3, 3, 1],\n",
166 | " [3, 3, 2],\n",
167 | " [3, 3, 0],\n",
168 | " [3, 3, 1],\n",
169 | " [3, 3, 2]], dtype=object)"
170 | ]
171 | },
172 | "metadata": {},
173 | "execution_count": 19
174 | }
175 | ],
176 | "source": [
177 | "\"\"\" \n",
178 | "Now we have the problem: our data consists of strings, we need to convert into nums with LabelEncoder\n",
179 | "\"\"\"\n",
180 | "# X conversion\n",
181 | "print(X.shape)\n",
182 | "\n",
183 | "for i in range(X.shape[1]): # 3\n",
184 | " print(X[:, i]) # Selects the first element for 3 columns\n",
185 | "\n",
186 | "LE = LabelEncoder()\n",
187 | "for i in range(len(X[0])):\n",
188 | " X[:, i] = LE.fit_transform(X[:, i])\n",
189 | "\n",
190 | "X[:5] # vhigh=3, med=2, low=1, high=0"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": 21,
196 | "metadata": {},
197 | "outputs": [
198 | {
199 | "output_type": "execute_result",
200 | "data": {
201 | "text/plain": [
202 | "array([0, 0, 0, ..., 0, 2, 3])"
203 | ]
204 | },
205 | "metadata": {},
206 | "execution_count": 21
207 | }
208 | ],
209 | "source": [
210 | "# y conversion\n",
211 | "label_mapping = {\n",
212 | " 'unacc':0,\n",
213 | " 'acc':1,\n",
214 | " 'good':2,\n",
215 | " 'vgood':3,\n",
216 | "}\n",
217 | "\n",
218 | "y = y.map(label_mapping)\n",
219 | "y = np.array(y)\n",
220 | "y"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": 29,
226 | "metadata": {},
227 | "outputs": [
228 | {
229 | "output_type": "execute_result",
230 | "data": {
231 | "text/plain": [
232 | "KNeighborsClassifier(n_neighbors=23)"
233 | ]
234 | },
235 | "metadata": {},
236 | "execution_count": 29
237 | }
238 | ],
239 | "source": [
240 | "# KNN Model\n",
241 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 20% to test set\n",
242 | "\n",
243 | "knn = neighbors.KNeighborsClassifier(n_neighbors=23, weights='uniform')\n",
244 | "knn.fit(X_train, y_train)"
245 | ]
246 | },
247 | {
248 | "cell_type": "code",
249 | "execution_count": 30,
250 | "metadata": {},
251 | "outputs": [
252 | {
253 | "output_type": "execute_result",
254 | "data": {
255 | "text/plain": [
256 | "0.7485549132947977"
257 | ]
258 | },
259 | "metadata": {},
260 | "execution_count": 30
261 | }
262 | ],
263 | "source": [
264 | "predictions = knn.predict(X_test)\n",
265 | "accuracy = metrics.accuracy_score(y_test, predictions)\n",
266 | "accuracy"
267 | ]
268 | },
269 | {
270 | "cell_type": "code",
271 | "execution_count": 31,
272 | "metadata": {},
273 | "outputs": [
274 | {
275 | "output_type": "execute_result",
276 | "data": {
277 | "text/plain": [
278 | "array([0, 1, 1, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,\n",
279 | " 0, 0, 0, 1, 0, 0, 0, 2, 1, 0, 0, 1, 2, 0, 1, 2, 2, 0, 1, 0, 0, 0,\n",
280 | " 0, 0, 1, 0, 0, 2, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0, 2, 1,\n",
281 | " 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 1,\n",
282 | " 1, 2, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,\n",
283 | " 0, 2, 0, 0, 0, 0, 2, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,\n",
284 | " 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 1, 1, 0, 1, 0, 0, 2,\n",
285 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,\n",
286 | " 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,\n",
287 | " 2, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0,\n",
288 | " 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,\n",
289 | " 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0,\n",
290 | " 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
291 | " 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,\n",
292 | " 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 2, 2, 0, 0, 0, 1, 1, 1, 1,\n",
293 | " 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1])"
294 | ]
295 | },
296 | "metadata": {},
297 | "execution_count": 31
298 | }
299 | ],
300 | "source": [
301 | "predictions"
302 | ]
303 | },
304 | {
305 | "cell_type": "code",
306 | "execution_count": null,
307 | "metadata": {},
308 | "outputs": [],
309 | "source": [
310 | "# For KNN regressor you take the average of n_neighbors = 23 nearest neighbours\n",
311 | "# For KNN classifier you take the mood of n_neighbors = 23 nearest neighbours"
312 | ]
313 | }
314 | ]
315 | }
--------------------------------------------------------------------------------
/scikit-learn/logistic_regression.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "language_info": {
4 | "codemirror_mode": {
5 | "name": "ipython",
6 | "version": 3
7 | },
8 | "file_extension": ".py",
9 | "mimetype": "text/x-python",
10 | "name": "python",
11 | "nbconvert_exporter": "python",
12 | "pygments_lexer": "ipython3",
13 | "version": 3
14 | },
15 | "orig_nbformat": 4
16 | },
17 | "nbformat": 4,
18 | "nbformat_minor": 2,
19 | "cells": [
20 | {
21 | "cell_type": "code",
22 | "execution_count": null,
23 | "metadata": {},
24 | "outputs": [],
25 | "source": [
26 | "\"\"\"Logistic regression\"\"\"\n",
27 | "\n"
28 | ]
29 | }
30 | ]
31 | }
--------------------------------------------------------------------------------
/scikit-learn/svm.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "language_info": {
4 | "codemirror_mode": {
5 | "name": "ipython",
6 | "version": 3
7 | },
8 | "file_extension": ".py",
9 | "mimetype": "text/x-python",
10 | "name": "python",
11 | "nbconvert_exporter": "python",
12 | "pygments_lexer": "ipython3",
13 | "version": "3.8.6"
14 | },
15 | "orig_nbformat": 4,
16 | "kernelspec": {
17 | "name": "python3",
18 | "display_name": "Python 3.8.6 64-bit ('tf': conda)"
19 | },
20 | "interpreter": {
21 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a"
22 | }
23 | },
24 | "nbformat": 4,
25 | "nbformat_minor": 2,
26 | "cells": [
27 | {
28 | "cell_type": "code",
29 | "execution_count": 15,
30 | "metadata": {},
31 | "outputs": [],
32 | "source": [
33 | "from sklearn import datasets\n",
34 | "from sklearn.model_selection import train_test_split\n",
35 | "from sklearn.metrics import accuracy_score\n",
36 | "from sklearn import svm\n",
37 | "\n",
38 | "import numpy as np"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": 9,
44 | "metadata": {},
45 | "outputs": [],
46 | "source": [
47 | "iris = datasets.load_iris()\n",
48 | "classes = ['Iris Setosa', 'Iris Versicolour', 'Iris Virginica']\n",
49 | "\n",
50 | "# Split into features and labels\n",
51 | "X = iris.data\n",
52 | "y = iris.target"
53 | ]
54 | },
55 | {
56 | "cell_type": "code",
57 | "execution_count": 10,
58 | "metadata": {},
59 | "outputs": [
60 | {
61 | "output_type": "stream",
62 | "name": "stdout",
63 | "text": [
64 | "[[5.1 3.5 1.4 0.2]\n [4.9 3. 1.4 0.2]\n [4.7 3.2 1.3 0.2]\n [4.6 3.1 1.5 0.2]\n [5. 3.6 1.4 0.2]]\n[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n 2 2]\n"
65 | ]
66 | }
67 | ],
68 | "source": [
69 | "print(X[:5]) # NumPy array\n",
70 | "print(y) # So we have 3 labels"
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": 11,
76 | "metadata": {},
77 | "outputs": [
78 | {
79 | "output_type": "stream",
80 | "name": "stdout",
81 | "text": [
82 | "(150, 4)\n150\n"
83 | ]
84 | }
85 | ],
86 | "source": [
87 | "print(X.shape) \n",
88 | "print(len(y))"
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": 12,
94 | "metadata": {},
95 | "outputs": [],
96 | "source": [
97 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 20% to test set"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 13,
103 | "metadata": {},
104 | "outputs": [
105 | {
106 | "output_type": "execute_result",
107 | "data": {
108 | "text/plain": [
109 | "SVC()"
110 | ]
111 | },
112 | "metadata": {},
113 | "execution_count": 13
114 | }
115 | ],
116 | "source": [
117 | "model = svm.SVC() # Classifier\n",
118 | "model.fit(X_train, y_train)"
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "execution_count": 18,
124 | "metadata": {},
125 | "outputs": [
126 | {
127 | "output_type": "execute_result",
128 | "data": {
129 | "text/plain": [
130 | "0.9333333333333333"
131 | ]
132 | },
133 | "metadata": {},
134 | "execution_count": 18
135 | }
136 | ],
137 | "source": [
138 | "predictions = model.predict(X_test)\n",
139 | "acc = accuracy_score(predictions, y_test)\n",
140 | "acc"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": 19,
146 | "metadata": {},
147 | "outputs": [
148 | {
149 | "output_type": "execute_result",
150 | "data": {
151 | "text/plain": [
152 | "array([0, 2, 0, 1, 1, 2, 2, 1, 2, 1, 0, 2, 0, 0, 1, 1, 2, 2, 2, 2, 1, 0,\n",
153 | " 1, 0, 0, 2, 1, 2, 1, 1])"
154 | ]
155 | },
156 | "metadata": {},
157 | "execution_count": 19
158 | }
159 | ],
160 | "source": [
161 | "predictions"
162 | ]
163 | }
164 | ]
165 | }
--------------------------------------------------------------------------------
/scikit-learn/train_test_split.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "language_info": {
4 | "codemirror_mode": {
5 | "name": "ipython",
6 | "version": 3
7 | },
8 | "file_extension": ".py",
9 | "mimetype": "text/x-python",
10 | "name": "python",
11 | "nbconvert_exporter": "python",
12 | "pygments_lexer": "ipython3",
13 | "version": "3.8.6"
14 | },
15 | "orig_nbformat": 4,
16 | "kernelspec": {
17 | "name": "python3",
18 | "display_name": "Python 3.8.6 64-bit ('tf': conda)"
19 | },
20 | "interpreter": {
21 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a"
22 | }
23 | },
24 | "nbformat": 4,
25 | "nbformat_minor": 2,
26 | "cells": [
27 | {
28 | "cell_type": "code",
29 | "execution_count": 15,
30 | "metadata": {},
31 | "outputs": [],
32 | "source": [
33 | "from sklearn import datasets\n",
34 | "from sklearn.model_selection import train_test_split\n",
35 | "import numpy as np"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 2,
41 | "metadata": {},
42 | "outputs": [],
43 | "source": [
44 | "iris = datasets.load_iris()\n",
45 | "\n",
46 | "# Split into features and labels\n",
47 | "X = iris.data\n",
48 | "y = iris.target"
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "execution_count": 13,
54 | "metadata": {},
55 | "outputs": [
56 | {
57 | "output_type": "stream",
58 | "name": "stdout",
59 | "text": [
60 | "[[5.1 3.5 1.4 0.2]\n [4.9 3. 1.4 0.2]\n [4.7 3.2 1.3 0.2]\n [4.6 3.1 1.5 0.2]\n [5. 3.6 1.4 0.2]]\n[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n 2 2]\n"
61 | ]
62 | }
63 | ],
64 | "source": [
65 | "print(X[:5]) # NumPy array\n",
66 | "print(y) # So we have 3 labels"
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": 12,
72 | "metadata": {},
73 | "outputs": [
74 | {
75 | "output_type": "stream",
76 | "name": "stdout",
77 | "text": [
78 | "(150, 4)\n150\n"
79 | ]
80 | }
81 | ],
82 | "source": [
83 | "print(X.shape) \n",
84 | "print(len(y))"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": 19,
90 | "metadata": {},
91 | "outputs": [
92 | {
93 | "output_type": "stream",
94 | "name": "stdout",
95 | "text": [
96 | "30.0\n"
97 | ]
98 | }
99 | ],
100 | "source": [
101 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 20% to test set\n",
102 | "print(150 * 0.2) # 120 / 30"
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": 18,
108 | "metadata": {},
109 | "outputs": [
110 | {
111 | "output_type": "execute_result",
112 | "data": {
113 | "text/plain": [
114 | "(120, 4)"
115 | ]
116 | },
117 | "metadata": {},
118 | "execution_count": 18
119 | }
120 | ],
121 | "source": [
122 | "X_train.shape"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": 20,
128 | "metadata": {},
129 | "outputs": [
130 | {
131 | "output_type": "execute_result",
132 | "data": {
133 | "text/plain": [
134 | "120"
135 | ]
136 | },
137 | "metadata": {},
138 | "execution_count": 20
139 | }
140 | ],
141 | "source": [
142 | "len(y_train)"
143 | ]
144 | }
145 | ]
146 | }
--------------------------------------------------------------------------------
/tensorflow-in-practice/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/.DS_Store
--------------------------------------------------------------------------------
/tensorflow-in-practice/Exercises/Exercise_2_Handwriting_Recognition_DNN.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "Exercise2-Question.ipynb",
7 | "provenance": [],
8 | "collapsed_sections": [],
9 | "toc_visible": true
10 | },
11 | "kernelspec": {
12 | "name": "python386jvsc74a57bd04ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a",
13 | "display_name": "Python 3.8.6 64-bit ('tf': conda)"
14 | },
15 | "metadata": {
16 | "interpreter": {
17 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a"
18 | }
19 | }
20 | },
21 | "cells": [
22 | {
23 | "cell_type": "code",
24 | "execution_count": null,
25 | "metadata": {},
26 | "outputs": [],
27 | "source": [
28 | "# Rustam-Z"
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "metadata": {
34 | "colab": {
35 | "base_uri": "https://localhost:8080/"
36 | },
37 | "id": "9rvXQGAA0ssC",
38 | "outputId": "60861935-7551-475e-e8c4-507b43cc6de7"
39 | },
40 | "source": [
41 | "import tensorflow as tf\n",
42 | "\n",
43 | "class myCallback(tf.keras.callbacks.Callback):\n",
44 | " def on_epoch_end(self, epoch, logs={}):\n",
45 | " if(logs.get('accuracy')>0.99):\n",
46 | " print( \"Reached 99% accuracy so cancelling training!\")\n",
47 | " self.model.stop_training = True\n",
48 | "\n",
49 | "\n",
50 | "mnist = tf.keras.datasets.mnist\n",
51 | "(x_train, y_train),(x_test, y_test) = mnist.load_data()\n",
52 | "x_train, x_test = x_train / 255.0, x_test / 255.0\n",
53 | "\n",
54 | "callbacks = myCallback()\n",
55 | "\n",
56 | "model = tf.keras.models.Sequential([\n",
57 | " tf.keras.layers.Flatten(input_shape=(28, 28)),\n",
58 | " tf.keras.layers.Dense(512, activation=\"relu\"),\n",
59 | " tf.keras.layers.Dense(10, activation=\"softmax\")\n",
60 | "])\n",
61 | "\n",
62 | "model.compile(optimizer='adam',\n",
63 | " loss='sparse_categorical_crossentropy',\n",
64 | " metrics=['accuracy'])\n",
65 | "\n",
66 | "model.fit(x_train, y_train, epochs=5, callbacks=[callbacks])"
67 | ],
68 | "execution_count": 1,
69 | "outputs": [
70 | {
71 | "output_type": "stream",
72 | "name": "stdout",
73 | "text": [
74 | "Epoch 1/5\n",
75 | "1875/1875 [==============================] - 1s 656us/step - loss: 0.3419 - accuracy: 0.9011\n",
76 | "Epoch 2/5\n",
77 | "1875/1875 [==============================] - 1s 649us/step - loss: 0.0835 - accuracy: 0.9749\n",
78 | "Epoch 3/5\n",
79 | "1875/1875 [==============================] - 1s 653us/step - loss: 0.0527 - accuracy: 0.9835\n",
80 | "Epoch 4/5\n",
81 | "1875/1875 [==============================] - 1s 655us/step - loss: 0.0366 - accuracy: 0.9877\n",
82 | "Epoch 5/5\n",
83 | "1875/1875 [==============================] - 1s 653us/step - loss: 0.0248 - accuracy: 0.9925\n",
84 | "Reached 99% accuracy so cancelling training!\n"
85 | ]
86 | },
87 | {
88 | "output_type": "execute_result",
89 | "data": {
90 | "text/plain": [
91 | ""
92 | ]
93 | },
94 | "metadata": {},
95 | "execution_count": 1
96 | }
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "metadata": {
102 | "id": "qErwFEW0mz0H",
103 | "outputId": "3d8ba790-8c5e-4a55-c824-ecd92a00352c",
104 | "colab": {
105 | "base_uri": "https://localhost:8080/"
106 | }
107 | },
108 | "source": [
109 | "import tensorflow as tf\n",
110 | "\n",
111 | "print(tf.nn.relu)"
112 | ],
113 | "execution_count": 2,
114 | "outputs": [
115 | {
116 | "output_type": "stream",
117 | "name": "stdout",
118 | "text": [
119 | "\n"
120 | ]
121 | }
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "metadata": {
127 | "id": "18I-y7X-q84V",
128 | "outputId": "e9478496-2bc1-4ce9-ee62-82af2df4a8df",
129 | "colab": {
130 | "base_uri": "https://localhost:8080/"
131 | }
132 | },
133 | "source": [
134 | "model.evaluate(x_test, y_test)"
135 | ],
136 | "execution_count": 3,
137 | "outputs": [
138 | {
139 | "output_type": "stream",
140 | "name": "stdout",
141 | "text": [
142 | "313/313 [==============================] - 0s 293us/step - loss: 0.0637 - accuracy: 0.9809\n"
143 | ]
144 | },
145 | {
146 | "output_type": "execute_result",
147 | "data": {
148 | "text/plain": [
149 | "[0.06370978057384491, 0.98089998960495]"
150 | ]
151 | },
152 | "metadata": {},
153 | "execution_count": 3
154 | }
155 | ]
156 | }
157 | ]
158 | }
--------------------------------------------------------------------------------
/tensorflow-in-practice/Exercises/Exercise_3_CNN.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "Exercise 3 - Question.ipynb",
7 | "provenance": [],
8 | "collapsed_sections": []
9 | },
10 | "kernelspec": {
11 | "name": "python386jvsc74a57bd04ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a",
12 | "display_name": "Python 3.8.6 64-bit ('tf': conda)"
13 | },
14 | "metadata": {
15 | "interpreter": {
16 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a"
17 | }
18 | }
19 | },
20 | "cells": [
21 | {
22 | "cell_type": "code",
23 | "metadata": {
24 | "id": "yl3yB8J_PCZM"
25 | },
26 | "source": [
27 | "# Rustam-Z"
28 | ],
29 | "execution_count": null,
30 | "outputs": []
31 | },
32 | {
33 | "cell_type": "code",
34 | "metadata": {
35 | "colab": {
36 | "base_uri": "https://localhost:8080/"
37 | },
38 | "id": "KtixUwmvSD0A",
39 | "outputId": "34a18be4-67c7-4147-a4e7-62efa5fe3124"
40 | },
41 | "source": [
42 | "import tensorflow as tf\n",
43 | "\n",
44 | "mnist = tf.keras.datasets.mnist\n",
45 | "(training_images, training_labels), (test_images, test_labels) = mnist.load_data()"
46 | ],
47 | "execution_count": 1,
48 | "outputs": []
49 | },
50 | {
51 | "cell_type": "code",
52 | "metadata": {
53 | "id": "EiLuNPb-TnyF"
54 | },
55 | "source": [
56 | "training_images=training_images.reshape(60000, 28, 28, 1)\n",
57 | "training_images=training_images / 255.0\n",
58 | "test_images = test_images.reshape(10000, 28, 28, 1)\n",
59 | "test_images=test_images/255.0"
60 | ],
61 | "execution_count": 2,
62 | "outputs": []
63 | },
64 | {
65 | "cell_type": "code",
66 | "metadata": {
67 | "id": "I-3_hM1mSImZ"
68 | },
69 | "source": [
70 | "class myCallback(tf.keras.callbacks.Callback):\n",
71 | " def on_epoch_end(self, epoch, logs={}):\n",
72 | " if(logs.get('accuracy')>0.998):\n",
73 | " print(\"\\nReached 99.8% accuracy so cancelling training!\")\n",
74 | " self.model.stop_training = True"
75 | ],
76 | "execution_count": 3,
77 | "outputs": []
78 | },
79 | {
80 | "cell_type": "code",
81 | "metadata": {
82 | "id": "sfQRyaJWAIdg"
83 | },
84 | "source": [
85 | "callbacks = myCallback()\n",
86 | "\n",
87 | "model = tf.keras.models.Sequential([\n",
88 | " tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),\n",
89 | " tf.keras.layers.MaxPooling2D(2,2),\n",
90 | " tf.keras.layers.Flatten(),\n",
91 | " tf.keras.layers.Dense(128, activation='relu'),\n",
92 | " tf.keras.layers.Dense(10, activation='softmax')\n",
93 | "])\n",
94 | "\n",
95 | "model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])"
96 | ],
97 | "execution_count": 4,
98 | "outputs": []
99 | },
100 | {
101 | "cell_type": "code",
102 | "metadata": {
103 | "colab": {
104 | "base_uri": "https://localhost:8080/"
105 | },
106 | "id": "i10RG-0ySDGY",
107 | "outputId": "104e7f06-b70e-4fbb-c283-c1b4a55e8b67"
108 | },
109 | "source": [
110 | "model.fit(training_images, training_labels, epochs=20, callbacks=[callbacks])"
111 | ],
112 | "execution_count": 5,
113 | "outputs": [
114 | {
115 | "output_type": "stream",
116 | "name": "stdout",
117 | "text": [
118 | "Epoch 1/20\n",
119 | "1875/1875 [==============================] - 7s 4ms/step - loss: 0.2992 - accuracy: 0.9104\n",
120 | "Epoch 2/20\n",
121 | "1875/1875 [==============================] - 7s 4ms/step - loss: 0.0521 - accuracy: 0.9840\n",
122 | "Epoch 3/20\n",
123 | "1875/1875 [==============================] - 7s 4ms/step - loss: 0.0298 - accuracy: 0.9906\n",
124 | "Epoch 4/20\n",
125 | "1875/1875 [==============================] - 7s 4ms/step - loss: 0.0208 - accuracy: 0.9932\n",
126 | "Epoch 5/20\n",
127 | "1875/1875 [==============================] - 7s 4ms/step - loss: 0.0133 - accuracy: 0.9960\n",
128 | "Epoch 6/20\n",
129 | "1875/1875 [==============================] - 7s 4ms/step - loss: 0.0091 - accuracy: 0.9972\n",
130 | "Epoch 7/20\n",
131 | "1875/1875 [==============================] - 7s 4ms/step - loss: 0.0065 - accuracy: 0.9978\n",
132 | "Epoch 8/20\n",
133 | "1875/1875 [==============================] - 7s 4ms/step - loss: 0.0050 - accuracy: 0.9985\n",
134 | "\n",
135 | "Reached 99.8% accuracy so cancelling training!\n"
136 | ]
137 | },
138 | {
139 | "output_type": "execute_result",
140 | "data": {
141 | "text/plain": [
142 | ""
143 | ]
144 | },
145 | "metadata": {},
146 | "execution_count": 5
147 | }
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": 9,
153 | "metadata": {},
154 | "outputs": [
155 | {
156 | "output_type": "stream",
157 | "name": "stdout",
158 | "text": [
159 | "\n[]\n"
160 | ]
161 | }
162 | ],
163 | "source": [
164 | "import tensorflow as tf\n",
165 | "print(tf.test.gpu_device_name())\n",
166 | "print(tf.config.list_physical_devices('GPU'))"
167 | ]
168 | },
169 | {
170 | "cell_type": "code",
171 | "execution_count": 6,
172 | "metadata": {},
173 | "outputs": [
174 | {
175 | "output_type": "execute_result",
176 | "data": {
177 | "text/plain": [
178 | "[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]"
179 | ]
180 | },
181 | "metadata": {},
182 | "execution_count": 6
183 | }
184 | ],
185 | "source": [
186 | "tf.config.list_physical_devices()"
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": 9,
192 | "metadata": {},
193 | "outputs": [
194 | {
195 | "output_type": "execute_result",
196 | "data": {
197 | "text/plain": [
198 | "True"
199 | ]
200 | },
201 | "metadata": {},
202 | "execution_count": 9
203 | }
204 | ],
205 | "source": [
206 | "from tensorflow.python.compiler.mlcompute import mlcompute\n",
207 | "mlcompute.is_apple_mlc_enabled()"
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": 10,
213 | "metadata": {},
214 | "outputs": [
215 | {
216 | "output_type": "execute_result",
217 | "data": {
218 | "text/plain": [
219 | "True"
220 | ]
221 | },
222 | "metadata": {},
223 | "execution_count": 10
224 | }
225 | ],
226 | "source": [
227 | "mlcompute.is_tf_compiled_with_apple_mlc()"
228 | ]
229 | }
230 | ]
231 | }
--------------------------------------------------------------------------------
/tensorflow-in-practice/Exercises/Exercise_4_Complex_Images_flow_from_directory.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "Exercise 4-Question.ipynb",
7 | "provenance": []
8 | },
9 | "kernelspec": {
10 | "display_name": "Python 3",
11 | "name": "python3"
12 | }
13 | },
14 | "cells": [
15 | {
16 | "cell_type": "code",
17 | "metadata": {
18 | "colab": {
19 | "base_uri": "https://localhost:8080/"
20 | },
21 | "id": "7Vti6p3PxmpS",
22 | "outputId": "99f9f945-5bd1-41e0-c274-77966a56d7aa"
23 | },
24 | "source": [
25 | "import tensorflow as tf\n",
26 | "import os\n",
27 | "import zipfile\n",
28 | "\n",
29 | "DESIRED_ACCURACY = 0.999\n",
30 | "\n",
31 | "!wget --no-check-certificate \\\n",
32 | " \"https://storage.googleapis.com/laurencemoroney-blog.appspot.com/happy-or-sad.zip\" \\\n",
33 | " -O \"/tmp/happy-or-sad.zip\"\n",
34 | "\n",
35 | "zip_ref = zipfile.ZipFile(\"/tmp/happy-or-sad.zip\", 'r')\n",
36 | "zip_ref.extractall(\"/tmp/h-or-s\")\n",
37 | "zip_ref.close()\n",
38 | "\n",
39 | "class myCallback(tf.keras.callbacks.Callback):\n",
40 | " def on_epoch_end(self, epoch, logs={}):\n",
41 | " if(logs.get('accuracy')>DESIRED_ACCURACY):\n",
42 | " print(\"\\nReached 99.9% accuracy so cancelling training!\")\n",
43 | " self.model.stop_training = True\n",
44 | "\n",
45 | "callbacks = myCallback()"
46 | ],
47 | "execution_count": 17,
48 | "outputs": [
49 | {
50 | "output_type": "stream",
51 | "text": [
52 | "--2021-04-08 03:16:56-- https://storage.googleapis.com/laurencemoroney-blog.appspot.com/happy-or-sad.zip\n",
53 | "Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.215.128, 173.194.216.128, 173.194.217.128, ...\n",
54 | "Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.215.128|:443... connected.\n",
55 | "HTTP request sent, awaiting response... 200 OK\n",
56 | "Length: 2670333 (2.5M) [application/zip]\n",
57 | "Saving to: ‘/tmp/happy-or-sad.zip’\n",
58 | "\n",
59 | "\r/tmp/happy-or-sad.z 0%[ ] 0 --.-KB/s \r/tmp/happy-or-sad.z 100%[===================>] 2.55M --.-KB/s in 0.01s \n",
60 | "\n",
61 | "2021-04-08 03:16:56 (217 MB/s) - ‘/tmp/happy-or-sad.zip’ saved [2670333/2670333]\n",
62 | "\n"
63 | ],
64 | "name": "stdout"
65 | }
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "metadata": {
71 | "id": "6DLGbXXI1j_V"
72 | },
73 | "source": [
74 | "# This Code Block should Define and Compile the Model\n",
75 | "model = tf.keras.models.Sequential([\n",
76 | " tf.keras.layers.Conv2D(16, (3, 3), activation='relu', input_shape=(300, 300, 3)),\n",
77 | " tf.keras.layers.MaxPooling2D(2, 2),\n",
78 | " tf.keras.layers.Conv2D(32, (3, 3), activation='relu'),\n",
79 | " tf.keras.layers.MaxPooling2D(2, 2),\n",
80 | " tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),\n",
81 | " tf.keras.layers.MaxPooling2D(2, 2),\n",
82 | " tf.keras.layers.Flatten(), # Flatten the results to feed into a DNN\n",
83 | " tf.keras.layers.Dense(512, activation='relu'), # 512 neuron hidden layer\n",
84 | " tf.keras.layers.Dense(1, activation='sigmoid'),\n",
85 | "])\n",
86 | "\n",
87 | "\n",
88 | "from tensorflow.keras.optimizers import RMSprop\n",
89 | "\n",
90 | "model.compile(loss=\"binary_crossentropy\",\n",
91 | " optimizer=RMSprop(lr=0.001),\n",
92 | " metrics=['accuracy'])"
93 | ],
94 | "execution_count": 18,
95 | "outputs": []
96 | },
97 | {
98 | "cell_type": "code",
99 | "metadata": {
100 | "colab": {
101 | "base_uri": "https://localhost:8080/"
102 | },
103 | "id": "4Ap9fUJE1vVu",
104 | "outputId": "cc7b7369-a630-478d-b57e-62c015cf127a"
105 | },
106 | "source": [
107 | "# This code block should create an instance of an ImageDataGenerator called train_datagen \n",
108 | "# And a train_generator by calling train_datagen.flow_from_directory\n",
109 | "\n",
110 | "from tensorflow.keras.preprocessing.image import ImageDataGenerator\n",
111 | "\n",
112 | "train_datagen = ImageDataGenerator(rescale=1./255)\n",
113 | "\n",
114 | "train_generator = train_datagen.flow_from_directory(\n",
115 | " '/tmp/h-or-s/',\n",
116 | " target_size=(300, 300),\n",
117 | " batch_size=8,\n",
118 | " class_mode='binary')\n",
119 | "\n",
120 | "# Expected output: 'Found 80 images belonging to 2 classes'"
121 | ],
122 | "execution_count": 13,
123 | "outputs": [
124 | {
125 | "output_type": "stream",
126 | "text": [
127 | "Found 80 images belonging to 2 classes.\n"
128 | ],
129 | "name": "stdout"
130 | }
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "metadata": {
136 | "colab": {
137 | "base_uri": "https://localhost:8080/"
138 | },
139 | "id": "48dLm13U1-Le",
140 | "outputId": "8c82e79e-fed0-4b0a-be86-089d17f1cd66"
141 | },
142 | "source": [
143 | "# This code block should call model.fit and train for\n",
144 | "# a number of epochs. \n",
145 | "history = model.fit(\n",
146 | " train_generator,\n",
147 | " steps_per_epoch=10,\n",
148 | " epochs=20,\n",
149 | " callbacks=[callbacks])\n",
150 | " \n",
151 | "# Expected output: \"Reached 99.9% accuracy so cancelling training!\"\""
152 | ],
153 | "execution_count": 19,
154 | "outputs": [
155 | {
156 | "output_type": "stream",
157 | "text": [
158 | "Epoch 1/20\n",
159 | "10/10 [==============================] - 11s 1s/step - loss: 4.0546 - accuracy: 0.5853\n",
160 | "Epoch 2/20\n",
161 | "10/10 [==============================] - 9s 936ms/step - loss: 0.8477 - accuracy: 0.6622\n",
162 | "Epoch 3/20\n",
163 | "10/10 [==============================] - 10s 1s/step - loss: 0.2766 - accuracy: 0.9474\n",
164 | "Epoch 4/20\n",
165 | "10/10 [==============================] - 10s 1s/step - loss: 0.1236 - accuracy: 0.9822\n",
166 | "Epoch 5/20\n",
167 | "10/10 [==============================] - 9s 921ms/step - loss: 0.0683 - accuracy: 0.9704\n",
168 | "Epoch 6/20\n",
169 | "10/10 [==============================] - 10s 977ms/step - loss: 0.0221 - accuracy: 1.0000\n",
170 | "\n",
171 | "Reached 99.9% accuracy so cancelling training!\n"
172 | ],
173 | "name": "stdout"
174 | }
175 | ]
176 | }
177 | ]
178 | }
--------------------------------------------------------------------------------
/tensorflow-in-practice/MNIST/my_model.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/MNIST/my_model.h5
--------------------------------------------------------------------------------
/tensorflow-in-practice/MNIST/test.py:
--------------------------------------------------------------------------------
1 | import tensorflow as tf
2 | import numpy as np
3 | from PIL import Image
4 | import cv2
5 | import matplotlib.pyplot as plt
6 |
7 | model = tf.keras.models.load_model('tensorflow-in-practice/notebooks/MNIST/my_model.h5')
8 |
9 | image = cv2.imread('tensorflow-in-practice/img/0.jpg')
10 | image = cv2.resize(image,(28,28))
11 | gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
12 | data = np.vstack([gray])
13 | data=data/255.0
14 |
15 | plt.imshow(gray, cmap='gray')
16 | plt.show()
17 |
18 | indices_one = data == 1
19 | data[indices_one] = 0 # replacing 1s with 0s
20 | print(data)
21 |
22 | predictions = model.predict(np.expand_dims(data, 0))
23 | print("\nAnswer:")
24 | print(predictions)
25 |
--------------------------------------------------------------------------------
/tensorflow-in-practice/MNIST/train.py:
--------------------------------------------------------------------------------
1 | import tensorflow as tf
2 | import numpy as np
3 | from PIL import Image
4 | import cv2
5 | import matplotlib.pyplot as plt
6 |
7 | class myCallback(tf.keras.callbacks.Callback):
8 | def on_epoch_end(self, epoch, logs={}):
9 | if(logs.get('accuracy')>0.90):
10 | print("\nReached 99% accuracy so cancelling training!")
11 | self.model.stop_training = True
12 |
13 | mnist = tf.keras.datasets.mnist
14 |
15 | (x_train, y_train),(x_test, y_test) = mnist.load_data()
16 | x_train, x_test = x_train / 255.0, x_test / 255.0
17 |
18 | callbacks = myCallback()
19 |
20 | model = tf.keras.models.Sequential([
21 | tf.keras.layers.Flatten(input_shape=(28, 28)),
22 | tf.keras.layers.Dense(512, activation=tf.nn.relu),
23 | tf.keras.layers.Dense(256, activation=tf.nn.relu),
24 | tf.keras.layers.Dense(128, activation=tf.nn.relu),
25 | tf.keras.layers.Dense(10, activation=tf.nn.softmax)
26 | ])
27 | model.compile(optimizer=tf.optimizers.Adam(),
28 | loss='sparse_categorical_crossentropy',
29 | metrics=['accuracy'])
30 |
31 | model.fit(x_train, y_train, epochs=10, callbacks=[callbacks])
32 |
33 | image = cv2.imread('3.png')
34 | image = cv2.resize(image,(28,28))
35 | gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
36 | data = np.vstack([gray])
37 | data=data/255.0
38 |
39 | plt.imshow(gray, cmap='gray')
40 | plt.show()
41 |
42 | indices_one = data == 1
43 | data[indices_one] = 0 # replacing 1s with 0s
44 | print(data)
45 |
46 | predictions = model.predict(np.expand_dims(data, 0))
47 | print("\nAnswer:")
48 | print(predictions)
49 |
50 | model.save('my_model.h5')
--------------------------------------------------------------------------------
/tensorflow-in-practice/README.md:
--------------------------------------------------------------------------------
1 | # [TensorFlow in Practice](https://www.coursera.org/professional-certificates/tensorflow-in-practice) by DeepLearning.AI
2 |
3 | Rustam-Z🚀, 16 April 2021
4 |
5 | Hi there👋, it is the next level.
6 |
7 | Here in this specialization you will learn TensorFlow and Keras.
8 |
9 | We will cover the basics of Keras model building structure, Computer Vision with CNN, etc.
10 |
11 | ## How to study?
12 | Go to the [specialization website](https://www.coursera.org/professional-certificates/tensorflow-in-practice), and enroll the courses (you can audit).
13 | - Course notebooks: https://github.com/lmoroney/dlaicourse
14 |
15 | ## What's next?
16 | - Start Kaggle competitions
17 | - Start reading **Hands-on Machine learning** book
18 | - **TensorFlow Advanced Techniques**: https://www.coursera.org/specializations/tensorflow-advanced-techniques
--------------------------------------------------------------------------------
/tensorflow-in-practice/convolutional-neural-networks-tensorflow.md:
--------------------------------------------------------------------------------
1 | # [Convolutional Neural Networks in TensorFlow](https://www.coursera.org/learn/convolutional-neural-networks-tensorflow)
2 |
3 | - How to work with real-world images in different shapes and sizes.
4 | - Visualize the journey of an image through convolutions to understand how a computer “sees” information
5 | - Plot loss and accuracy, and explore strategies to prevent overfitting, including augmentation and dropout.
6 | - Finally, Course 2 will introduce you to transfer learning and how learned features can be extracted from models.
7 |
8 | ## Contents:
9 | - Week 1 - [Exploring a Larger Dataset](#Exploring-a-Larger-Dataset)
10 | - Week 2 - [Augmentation](#Augmentation)
11 | - Week 3 - [Transfer Learning](#Transfer-Learning)
12 | - Week 4 - [Multiclass Classifications](#Multiclass-Classifications)
13 |
14 | ## Exploring a Larger Dataset
15 | > [Notebook](notebooks/Course_2_Part_2_Lesson_2_Notebook.ipynb)
16 |
17 | > https://www.kaggle.com/c/dogs-vs-cats 25K pictures of cats and dogs
18 |
19 | ```python
20 | # Download ZIP file and extract it with python
21 | !wget --no-check-certificate \
22 | https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip \
23 | -O /tmp/cats_and_dogs_filtered.zip
24 | _____________________________________________
25 | import os
26 | import zipfile
27 |
28 | local_zip = '/tmp/cats_and_dogs_filtered.zip'
29 |
30 | zip_ref = zipfile.ZipFile(local_zip, 'r')
31 |
32 | zip_ref.extractall('/tmp')
33 | zip_ref.close()
34 | ```
35 | ```py
36 | from tensorflow.keras.preprocessing.image import ImageDataGenerator
37 |
38 | # All images will be rescaled by 1./255.
39 | train_datagen = ImageDataGenerator(rescale = 1.0/255.)
40 |
41 | train_generator = train_datagen.flow_from_directory(train_dir,
42 | batch_size=20,
43 | class_mode='binary',
44 | target_size=(150, 150))
45 | ```
46 |
47 | ## Augmentation
48 | > [Notebook](notebooks/Course_2_Part_4_Lesson_2_Notebook_(Cats_v_Dogs_Augmentation).ipynb)
49 |
50 | > `image-augmentation` • `data-augmentation` • `ImageDataGenerator`
51 |
52 | - All processes will happen in the main memory, from_from_directory() will generate the images on the fly. It doesn't require you to edit your raw images, nor does it amend them for you on-disk. It does it in-memory as it's performing the training, allowing you to experiment without impacting your dataset.
53 | - `ImageDataGenerator()` -> `flow_from_directory()` -> `fit_generator()`
54 | - **ImageDataGenerator** will NOT add **new images** to your data set in a sense that it will not make your epochs bigger. Instead, in each epoch it will provide slightly altered images (depending on your configuration). It will always generate new images, no matter how many epochs you have.
55 |
56 | ```python
57 | train_datagen = ImageDataGenerator(
58 | rescale=1./255,
59 | rotation_range=40, # Randomly rotate image between 0 and 40°
60 | width_shift_range=0.2, # Move picture inside its frame
61 | height_shitt_range=0.2,
62 | shear_range=0.2, # Shear up to 20%
63 | zoom_range=0.2,
64 | horizontal_flip=True,
65 | fill_mode='nearest') # It attempts to recreate lost information after a transformation like a shear
66 |
67 | train_generator = train_datagen.flow_from_directory(
68 | train_dir, # This is the source directory for training images
69 | target_size=(150, 150), # All images will be resized to 150x150
70 | batch_size=20, # Size of the batches of data, (? a number of samples per gradient update)
71 | class_mode='binary')
72 |
73 | history = model.fit_generator(
74 | train_generator,
75 | steps_per_epoch=100, # 2000 images = batch_size * steps, total number of steps (batches of samples) before declaring one epoch finished and starting the next epoch
76 | epochs=100,
77 | # validation_data=validation_generator,
78 | # validation_steps=50, # 1000 images = batch_size * steps
79 | verbose=2)
80 | ```
81 | - https://keras.io/api/preprocessing/image/
82 | - https://fairyonice.github.io/Learn-about-ImageDataGenerator.html
83 | - https://keras.io/api/models/model_training_apis/#fit-method
84 | - https://stackoverflow.com/questions/38340311/what-is-the-difference-between-steps-and-epochs-in-tensorflow
85 | - https://stackoverflow.com/questions/51748514/does-imagedatagenerator-add-more-images-to-my-dataset
86 |
87 | ## Transfer Learning
88 | > `inception`
89 |
90 | > https://www.tensorflow.org/tutorials/images/transfer_learning
91 |
92 | ```python
93 | import os
94 | from tensorflow.keras import layers
95 | from tensorflow.keras import Model
96 | from tensorflow.keras.applications.inception_v3 import InceptionV3
97 | from tensorflow.keras.optimizers import RMSprop
98 |
99 | # Donwload InceptionV3 weights
100 | !wget --no-check-certificate \
101 | https://storage.googleapis.com/mledu-datasets/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5 \
102 | -O /tmp/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5
103 |
104 | local_weights_file = '/tmp/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5'
105 | pre_trained_model = Inceptionv3(input_shape=(150, 150, 3),
106 | include_top=False, # Do not include top FC (fully connected) layer
107 | weights=None)
108 | pre_trained_model.load_weights(local_weights_file) # Use own weights
109 |
110 | # Do not retrain layers, i.e freeze them
111 | for layer in pre_trained_model.layers:
112 | layer.trainable = False
113 |
114 | # pre_trained_model.summary()
115 |
116 | # Grab the mixed7 layer from inception, and take its output
117 | last_layer = pre_trained_model.get_layer('mixed7')
118 | print('last layer output shape: ', last_layer.output_shape)
119 | last_output = last_layer.output
120 |
121 | # Now, you'll need to add your own DNN at the bottom of these, which you can retrain to your data
122 | x = layers.Flatten()(last_output)
123 | x = layers.Dense(1024, activation='relu')(x)
124 | x = layers.Dropout(0.2)(x) # Drop out 20% of neurons
125 | x = layers.Dense(1, activation='sigmoid')(x)
126 |
127 | # Create model using 'Model' abstract class
128 | model = Model(pre_trained_model.input, x)
129 | model.compile(optimizer=RMSprop(lr=.0001),
130 | loss='binary_crossentropy',
131 | metrics=['accuracy'])
132 |
133 | train_datagen = ImageDataGenerator(...)
134 | train_generator = train_datagen.flow_from_directory(...)
135 | history = model.fit_generator(...)
136 | ```
137 | > The idea behind **Dropouts** is that they **remove a random number of neurons** in your neural network. This works very well for two reasons: The first is that neighboring neurons often end up with similar weights, which can lead to overfitting, so dropping some out at random can remove this. The second is that often a neuron can over-weigh the input from a neuron in the previous layer, and can over specialize as a result. Thus, dropping out can break the neural network out of this potential bad habit!
138 |
139 | ## Multiclass Classifications
140 | - Computer generated images (CGI) will help you to create a dataset. Imagine you are creating a project for detecting rock, paper, scissors (💎, 📄, ✂️) during the game. So, you need lots of images of different races for both male and female, big and little hands.
141 | - http://www.laurencemoroney.com/rock-paper-scissors-dataset/
142 |
146 | - Change to `class_mode='categorical'` in flow_from_firectory(), and output Dense layer `activation='softmax'`, and loss function in model.compile `loss='categorical_crossentropy'`
147 | - flow_from_directory() uses the alphabetical order. For example, is we test for rock the output should be [1, 0, 0] because of [rock, paper, scissors].
148 |
149 | ## Notes
150 | - Can you use Image augmentation with Transfer Learning?
151 | > Yes. It's pre-trained layers that are frozen. So you can augment your images as you train the bottom layers of the DNN with them
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/0.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/0.jpg
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/1.png
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/2.jpg
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/3.jpg
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/fibonacci.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/fibonacci.png
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/fp.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/fp.png
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/fp2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/fp2.png
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/lstm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/lstm.png
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/lstm2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/lstm2.png
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/metrics.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/metrics.png
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/ml_architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/ml_architecture.png
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/rfp.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/rfp.png
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/rnn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/rnn.png
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/rnn2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/rnn2.png
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/seasonality.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/seasonality.png
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/tf_datasets.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/tf_datasets.png
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/trend.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/trend.png
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/ts.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/ts.png
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/tsn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/tsn.png
--------------------------------------------------------------------------------
/tensorflow-in-practice/img/word_embeddings.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/word_embeddings.png
--------------------------------------------------------------------------------
/tensorflow-in-practice/introduction-to-tensorflow-for-ai.md:
--------------------------------------------------------------------------------
1 | # [Introduction to TensorFlow for Artificial Intelligence, Machine Learning, and Deep Learning](https://www.coursera.org/learn/introduction-tensorflow/home/welcome)
2 |
3 | ## Contents:
4 | - Week 1 - [A new programming paradigm](#A-new-programming-paradigm)
5 | - Week 2 - [Introduction to Computer Vision](#Introduction-to-Computer-Vision)
6 | - Week 3 - [Convolutional Neural Networks](#Convolutional-Neural-Networks)
7 | - Week 4 - [Using real-world images](#Using-Real-world-Images)
8 |
9 | > `!pip install tensorflow==2.0.0-alpha0` run it to use TensorFlow 2.x in Google Colab
10 |
11 | > The notebooks you can work with: https://drive.google.com/drive/folders/1R4bIjns1qRcTNkltbO9NOi7jgnrM-VLg?usp=sharing
12 |
13 | ## A new programming paradigm
14 | > [Notebook](notebooks/Course_1_Part_2_Lesson_2_Notebook.ipynb)
15 |
16 | ### A primer in machine learning
17 |
18 |
19 | ### The ‘Hello World’ of neural networks
20 | ```python
21 | from keras import models
22 | from keras import layers
23 | import numpy as np
24 |
25 | model = keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])
26 | model.compile(optimizer='sgd', loss='mean_squared_error') # Guess the pattern and measure how badly or good the algorithm works
27 |
28 | # Just imagine you have lots of Xs and Ys, the computer doesn't know the correlation between them. Your algorithm tries to connect Xs to Ys (makes guesses). The loss functions looks at the predicted outputs and actial outputs and *measures how good or badly the guess was. Then it gives its value to optimizer which figures out the next guess (update its parameters). So the optimizer thinks about how good or how badly the guess was done using the data from the loss function.
29 |
30 | xs = np.array([-1.0, 0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)
31 | ys = np.array([-3.0, -1.0, 1.0, 3.0, 5.0, 7.0], dtype=float)
32 |
33 | model.fit(xs, ys, epochs=500) # Training
34 |
35 | print(model.predict([10.0])) # You can expect 19 because y = 2x - 1, but it will be very close to ≈19
36 | ```
37 |
38 | ## Introduction to Computer Vision
39 | > [Notebook](notebooks/Course_1_Part_4_Lesson_2_Notebook.ipynb)
40 |
41 | > https://github.com/zalandoresearch/fashion-mnist 70K images
42 |
43 | ```python
44 | import tensorflow as tf
45 | import numpy as np
46 | import matplotlib.pyplot as plt # plt.imshow(training_images[0])
47 | print(tf.__version__)
48 |
49 | # Loading the dataset
50 | mnist = tf.keras.datasets.fashion_mnist
51 | (training_images, training_labels), (test_images, test_labels) = mnist.load_data()
52 | print(training_images.shape)
53 | print(test_images.shape)
54 |
55 | # Normalizing
56 | training_images = training_images / 255.0
57 | test_images = test_images / 255.0
58 |
59 | # Building the model
60 | model = tf.keras.models.Sequential([tf.keras.layers.Flatten(),
61 | tf.keras.layers.Dense(1024, activation=tf.nn.relu),
62 | tf.keras.layers.Dense(10, activation=tf.nn.softmax)])
63 |
64 | # Defining the model, optimizer=tf.optimizers.Adam()
65 | model.compile(optimizer='adam',
66 | loss='sparse_categorical_crossentropy',
67 | metrics=['accuracy'])
68 |
69 | model.fit(training_images, training_labels, epochs=5) # Training the model, i.e. fitting training data to training labels
70 |
71 | model.evaluate(test_images, test_labels)
72 |
73 | classifications = model.predict(test_images) # Predict for new values
74 |
75 | print(">> Predicted label:", classifications[0])
76 | print(">> Actual label:", test_labels[0])
77 |
78 | ```
79 | - Notes:
80 | - **Sequential**: That defines a SEQUENCE of layers in the neural network
81 | - **Flatten**: Flatten just takes the input and turns it into a 1 dimensional set. Via ROWS
82 | - **Dense**: Adds a layer of neuron. Each layer of neurons need an 'activation function' to tell them what to do. There's lots of options, but just use these for now.
83 | - **Relu** effectively means "If X>0 return X, else return 0" -- so what it does it it only passes values 0 or greater to the next layer in the network.
84 | - **Softmax** takes a set of values, and effectively picks the biggest one, so, for example, if the output of the last layer looks like [0.1, 0.1, 0.05, 0.1, 9.5, 0.1, 0.05, 0.05, 0.05], it saves you from fishing through it looking for the biggest value, and turns it into [0,0,0,0,1,0,0,0,0] -- The goal is to save a lot of coding!
85 | - https://stackoverflow.com/questions/44176982/how-does-the-flatten-layer-work-in-keras
86 |
87 | ```python
88 | # What if you want to stop training when you reached the accuracy needed
89 | import tensorflow as tf
90 |
91 | class myCallback(tf.keras.callbacks.Callback):
92 | def on_epoch_end(self, epoch, logs={}):
93 | if(logs.get('accuracy')>0.6):
94 | print("\nReached 60% accuracy so cancelling training!")
95 | self.model.stop_training = True
96 |
97 | mnist = tf.keras.datasets.fashion_mnist
98 |
99 | (x_train, y_train),(x_test, y_test) = mnist.load_data()
100 | x_train, x_test = x_train / 255.0, x_test / 255.0
101 |
102 | callbacks = myCallback() # Creating the callback
103 |
104 | model = tf.keras.models.Sequential([
105 | tf.keras.layers.Flatten(input_shape=(28, 28)),
106 | tf.keras.layers.Dense(512, activation=tf.nn.relu),
107 | tf.keras.layers.Dense(10, activation=tf.nn.softmax)
108 | ])
109 | model.compile(optimizer=tf.optimizers.Adam(),
110 | loss='sparse_categorical_crossentropy',
111 | metrics=['accuracy'])
112 |
113 | model.fit(x_train, y_train, epochs=10, callbacks=[callbacks]) # You need to add callbacks argument
114 | ```
115 |
116 | ## Convolutional Neural Networks
117 | > [Notebook](notebooks/Course_1_Part_6_Lesson_2_Notebook.ipynb)
118 |
119 | > https://github.com/Rustam-Z/deep-learning-notes/tree/main/Course%204%20Convolutional%20Neural%20Networks
120 |
121 | **Types of layers in a convolutional network:**
122 | - Convolution (CONV) - A technique to isolate features in images
123 | - We need to know the filter size, padding (borders - valid, same), striding (jumps)
124 | - Pooling (POOL) - A technique to reduce the information in an image while maintaining features
125 | - Max pooling, average pooling
126 | - Fully connected (FC)
127 |
128 | - Formula to calculate the shape of convolution: [(n + 2p - f) / s] + 1
129 | - Formula to calculate the number of parameters in convolution: (f * f * PREVIOUS_ACTIVATION_SHAPE + 1) * ACTIVATION_SHAPE
130 |
131 | - https://lodev.org/cgtutor/filtering.html • https://colab.research.google.com/drive/1EiNdAW4gtrObrBSAuuxIt_AqO_Eft491#scrollTo=kDHjf-ehaBqm
132 |
133 | ```python
134 | # Model architecture
135 | model = tf.keras.models.Sequential([
136 | tf.keras.layers.Conv2D(64, (3, 3), activation='relu', input_shape=(28, 28, 1)),
137 | tf.keras.layers.MaxPooling2D(2, 2),
138 | tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
139 | tf.keras.layers.MaxPooling2D(2, 2),
140 | tf.keras.layers.Flatten(),
141 | tf.keras.layers.Dense(128, activation='relu'),
142 | tf.keras.layers.Dense(10, activation='softmax')
143 | ])
144 |
145 | model.summary() # To have a look to the architecture of model
146 | ```
147 |
148 | ```python
149 | import tensorflow as tf
150 | print(tf.__version__)
151 |
152 | mnist = tf.keras.datasets.fashion_mnist
153 | (training_images, training_labels), (test_images, test_labels) = mnist.load_data()
154 |
155 | training_images=training_images.reshape(60000, 28, 28, 1)
156 | training_images=training_images / 255.0
157 | test_images = test_images.reshape(10000, 28, 28, 1)
158 | test_images=test_images/255.0
159 |
160 | model = tf.keras.models.Sequential([
161 | tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)),
162 | tf.keras.layers.MaxPooling2D(2, 2),
163 | tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
164 | tf.keras.layers.MaxPooling2D(2,2),
165 | l
166 | ])
167 | model.compile(optimizer='adam', loss='ms', metrics=['accuracy'])
168 | model.summary()
169 | model.fit(training_images, training_labels, epochs=10)
170 | test_loss = model.evaluate(test_images, test_labels)
171 | ```
172 |
173 | ```python
174 | # This code will show us the convolutions graphically
175 |
176 | import matplotlib.pyplot as plt
177 | from tensorflow.keras import models
178 |
179 | f, axarr = plt.subplots(3,4)
180 | FIRST_IMAGE=0
181 | SECOND_IMAGE=23
182 | THIRD_IMAGE=28
183 | CONVOLUTION_NUMBER = 3
184 |
185 | layer_outputs = [layer.output for layer in model.layers]
186 | activation_model = tf.keras.models.Model(inputs = model.input, outputs = layer_outputs)
187 |
188 | for x in range(0,4):
189 | f1 = activation_model.predict(test_images[FIRST_IMAGE].reshape(1, 28, 28, 1))[x]
190 | axarr[0,x].imshow(f1[0, : , :, CONVOLUTION_NUMBER], cmap='inferno')
191 | axarr[0,x].grid(False)
192 | f2 = activation_model.predict(test_images[SECOND_IMAGE].reshape(1, 28, 28, 1))[x]
193 | axarr[1,x].imshow(f2[0, : , :, CONVOLUTION_NUMBER], cmap='inferno')
194 | axarr[1,x].grid(False)
195 | f3 = activation_model.predict(test_images[THIRD_IMAGE].reshape(1, 28, 28, 1))[x]
196 | axarr[2,x].imshow(f3[0, : , :, CONVOLUTION_NUMBER], cmap='inferno')
197 | axarr[2,x].grid(False)
198 | ```
199 |
200 | ## Using Real-world Images
201 | > [Nobebook](notebooks/Course_1_Part_8_Lesson_2_Notebook.ipynb)
202 |
203 | ```python
204 | # An ImageGenerator can flow images from a directory and perform operations such as resizing them on the fly
205 | import tensorflow as tf
206 | from tensorflow.keras.preprocessing.image import ImageDataGenerator
207 | from tensorflow.keras.optimizers import RMSprop
208 |
209 | # All images will be rescaled by 1./255
210 | train_datagen = ImageDataGenerator(rescale=1/255)
211 |
212 | # Flow training images in batches of 128 using train_datagen generator
213 | train_generator = train_datagen.flow_from_directory(
214 | '/tmp/horse-or-human/', # This is the source directory for training images
215 | target_size=(300, 300), # All images will be resized to 150x150
216 | batch_size=128,
217 | # Since we use binary_crossentropy loss, we need binary labels
218 | class_mode='binary')
219 |
220 | validation_generator = train_datagen.flow_from_directory(
221 | validation_dir,
222 | target_size=(300, 300),
223 | batch_size=32,
224 | class_mode='binary',
225 | )
226 |
227 | model.compile(loss='binary_crossentropy',
228 | optimizer=RMSprop(lr=0.001),
229 | metrics=['accuracy'])
230 |
231 | history = model.fit_generator(
232 | train_generator, # streames images from directory
233 | steps_per_epoch=8, # 1024 images overall, so 128*8=1024, 128 is the batch size of train_generator
234 | epochs=15,
235 | validation_data=validation_generator,
236 | validation_steps=8, # 256 images, so 32*8=256, 32 is the batch size of validation_generator
237 | verbose=2 # for info
238 | )
239 | ```
240 | ```python
241 | import numpy as np
242 | from google.colab import files
243 | from keras.preprocessing import image
244 |
245 | uploaded = files.upload()
246 |
247 | for fn in uploaded.keys():
248 | # Predicting images
249 | path = "/content/" + fn
250 | img = image.load_img(path, target_size=(300, 300))
251 | x = image.img_to_array(img)
252 | x = np.expand_dims(x, axis=0)
253 |
254 | images = np.vstack([x])
255 | classes = model.predict(images, batch_size=10)
256 | print(classes[0])
257 |
258 | if classes[0] > 0.5:
259 | print(fn + "is a human")
260 | else:
261 | print(fn + "is a horse")
262 | ```
--------------------------------------------------------------------------------
/tensorflow-in-practice/natural-language-processing-tensorflow.md:
--------------------------------------------------------------------------------
1 | # [Natural Language Processing in TensorFlow](https://www.coursera.org/learn/natural-language-processing-tensorflow/home/welcome)
2 |
3 | - Week 1: How to convert the text into number representation, Tokenizer, fit_on_texts, texts_to_sequences, pad_sequences
4 | - Week 2: Word Embeddings - Classification problems
5 | - Week 3: Sequence models - RNN, LSTM, classification problems
6 | - Week 4: Sequence models and literature - text generation
7 |
8 | - Week 1 - [Sentiment in text](#Sentiment-in-text)
9 | - Week 2 - [Word Embeddings](#Word-Embeddings)
10 | - Week 3 - [Sequence models](#Sequence-models)
11 | - Week 4 - [Sequence models and literature](#Sequence-models-and-literature)
12 |
13 | ## Sentiment in text
14 | > [Week 1 Notebook](notebooks/Course_3_Week_1(Tokenizer-Sarcasm-Dataset).ipynb)
15 |
16 | - How to load in the texts, pre-process it and set up your data so it can be fed to a neural network.
17 | - https://rishabhmisra.github.io/publications/
18 | - `Tokenizer` is used to tokenize the sentences, `oov_token=`can be used to encode unknown words
19 | - `fit_on_texts(sentences)` is used to tokenize the list of sentences
20 | - Output: `{'': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}`
21 | - `texts_to_sequences(sentences)` - the method to encode a list of sentences to use those tokens
22 | - Output: `[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]`
23 |
24 | ```py
25 | tokenizer = Tokenizer(oov_token="")
26 | tokenizer.fit_on_texts(sentences)
27 | word_index = tokenizer.word_index
28 | sequences = tokenizer.texts_to_sequences(sentences)
29 | padded = pad_sequences(sequences, padding='post')
30 | ```
31 |
32 | ## Word Embeddings
33 | > [Week 2 Model Training IMDB Reviews](notebooks/Course_3_Week_2(Model_Training_IMDB_Reviews).ipynb)
34 |
35 | > [Week 2, beautiful code, Sarcasm Classifier](notebooks/Course_3_Week_2(Sarcasm-Classifier).ipynb)
36 |
37 | > [Week 2, subwords](notebooks/Course_3_Week_2(Subwords).ipynb) - shows that Embeddings do not work with sequence of words
38 |
39 | 
40 |
41 | - In the second week, we learn to prepare the data with Tokenizer API, and then teach our model
42 | - TensorFlow Datasets: https://www.tensorflow.org/datasets
43 | 
44 | - https://github.com/tensorflow/datasets/tree/master/docs/catalog
45 | - https://projector.tensorflow.org - to visualize the data
46 |
47 | - **What is the purpose of the embedding dimension?**
48 | > It is the number of dimensions for the **vector representing** the word encoding
49 |
50 | - When tokenizing a corpus, what does the num_words=n parameter do?
51 | > It specifies the maximum number of words to be tokenized, and picks the most common ‘n’ words
52 |
53 | - NOTE: Sequence becomes much more important when dealing with subwords, but we’re ignoring word positions.
54 |
55 | - It must specify 3 arguments, [reference](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/):
56 |
57 | - **input_dim**: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-999, then the size of the vocabulary would be 1000 words. (all words)
58 | - **output_dim**: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.
59 | - **input_length**: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 100 words, this would be 100. (words in a sentence)
60 |
61 | ```py
62 | def plot_graphs(history, string):
63 | plt.plot(history.history[string])
64 | plt.plot(history.history['val_'+string])
65 |
66 | plt.xlabel("Epochs")
67 | plt.ylabel(string)
68 | plt.legend([string, 'val_'+string])
69 | plt.show()
70 |
71 | plot_graphs(history, "accuracy")
72 | plot_graphs(history, "loss")
73 | ```
74 |
75 | ## Sequence models
76 | > [Week 3 IMDB](notebooks/Course_3_Week_3(IMDB).ipynb) - RNN, Embedding, Conv 1D experimenting
77 |
78 | > We looked first at Tokenizing words to get numeric values from them, and then using Embeddings to group words of similar meaning depending on how they were labelled. This gave you a good, but rough, sentiment analysis -- words such as 'fun' and 'entertaining' might show up in a positive movie review, and 'boring' and 'dull' might show up in a negative one. But sentiment can also be determined by the sequence in which words appear. For example, you could have 'not fun', which of course is the opposite of 'fun'. This week you'll start digging into a variety of model formats that are used in training models to understand context in sequence!
79 |
80 | - We used **word embeddings** to sentiment words. But what if we can use RNN and LSTM to predict the group of words. We can analyse in which relative ordering the words are coming.
81 |
82 | - 

That's the classical ML, it doesn't take into account the sequences. For example, like **Fibonacci series**, we must know previous result to fit it into next input.
83 |
84 | - 

So, that's the idea behind RNN (recurrent neural network). The output of previous is the input to the next.
85 |
86 | -

**LSTMs** have an additional pipeline of contexts called cell state. They can be bidirectional too.
87 |
88 | - RNN, LSTM [video](https://www.youtube.com/watch?v=WCUNPb-5EYI)
89 | - GRU - Gated recurrent union `tf.keras.layers.Bidirectional(tf.keras.layers.GRU(64)`
90 |
91 | ```python
92 | """LSTM in code"""
93 | model = tf.keras.Sequential([
94 | tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
95 | tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64), return_sequences=True), # You need to define `return_sequences=True` when stacking two LSTMs
96 | # tf.keras.layers.Bidirectional(tf.keras.layers.GRU(64)
97 | tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
98 | tf.keras.layers.Dense(64, activation='relu'),
99 | tf.keras.layers.Dense(1, activation='sigmoid'),
100 | ])
101 | ```
102 | ```python
103 | """Using a convolutional network 1D"""
104 | model = tf.keras.Sequential([
105 | tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
106 | tf.keras.layers.Conv1D(128, 5, activation='relu'),
107 | tf.keras.layers.GlobalAveragePooling1D(),
108 | tf.keras.layers.Dense(64, activation='relu'),
109 | tf.keras.layers.Dense(1, activation='sigmoid')
110 | ])
111 | ```
112 | ```python
113 | model = tf.keras.Sequential([
114 | tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length), # weights=[embeddings_matrix], trainable=False
115 | tf.keras.layers.Dropout(0.2),
116 | tf.keras.layers.Conv1D(64, 5, activation='relu'),
117 | tf.keras.layers.MaxPooling1D(pool_size=4),
118 | tf.keras.layers.LSTM(64),
119 | tf.keras.layers.Dense(1, activation='sigmoid')
120 | ])
121 | model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
122 | model.summary()
123 |
124 | num_epochs = 50
125 | ```
126 |
127 | ## Sequence models and literature
128 | > **Text generation**
129 |
130 | > [Week 4 Sheckspire Text Generation](notebooks/Course_3_Week_4_Lesson_1_(Sheckspire_Text_Generation).ipynb)
131 |
132 | > Wrap up from course: You’ve been experimenting with NLP for text classification over the last few weeks. Next week you’ll switch gears -- and take a look at using the tools that you’ve learned to predict text, which ultimately means you can create text. By learning sequences of words you can predict the most common word that comes next in the sequence, and thus, when starting from a new sequence of words you can create a model that builds on them. You’ll take different training sets -- like traditional Irish songs, or Shakespeare poetry, and learn how to create new sets of words using their embeddings!
133 |
134 | - **Finding what the next word should be**
135 |
136 |
--------------------------------------------------------------------------------
/tensorflow-in-practice/notebooks/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/notebooks/.DS_Store
--------------------------------------------------------------------------------
/tensorflow-in-practice/notebooks/Course_3_Week_2(Model_Training_IMDB_Reviews).ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "language_info": {
4 | "codemirror_mode": {
5 | "name": "ipython",
6 | "version": 3
7 | },
8 | "file_extension": ".py",
9 | "mimetype": "text/x-python",
10 | "name": "python",
11 | "nbconvert_exporter": "python",
12 | "pygments_lexer": "ipython3",
13 | "version": "3.8.6"
14 | },
15 | "orig_nbformat": 2,
16 | "kernelspec": {
17 | "name": "python3",
18 | "display_name": "Python 3.8.6 64-bit ('tf': conda)"
19 | },
20 | "metadata": {
21 | "interpreter": {
22 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a"
23 | }
24 | },
25 | "interpreter": {
26 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a"
27 | }
28 | },
29 | "nbformat": 4,
30 | "nbformat_minor": 2,
31 | "cells": [
32 | {
33 | "cell_type": "code",
34 | "execution_count": 1,
35 | "metadata": {},
36 | "outputs": [
37 | {
38 | "output_type": "stream",
39 | "name": "stdout",
40 | "text": [
41 | "2.4.0-rc0\n"
42 | ]
43 | }
44 | ],
45 | "source": [
46 | "import tensorflow as tf \n",
47 | "import tensorflow_datasets as tfds\n",
48 | "from tensorflow.keras.preprocessing.text import Tokenizer\n",
49 | "from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
50 | "import numpy as np \n",
51 | "import io\n",
52 | "\n",
53 | "print(tf.__version__)"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 2,
59 | "metadata": {},
60 | "outputs": [
61 | {
62 | "output_type": "execute_result",
63 | "data": {
64 | "text/plain": [
65 | "True"
66 | ]
67 | },
68 | "metadata": {},
69 | "execution_count": 2
70 | }
71 | ],
72 | "source": [
73 | "tf.executing_eagerly() # if 1.x use `tf.enable_eager_execution()`"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 3,
79 | "metadata": {
80 | "tags": []
81 | },
82 | "outputs": [],
83 | "source": [
84 | "imdb, info = tfds.load(\"imdb_reviews\", with_info=True, as_supervised=True) # loading the data"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": 5,
90 | "metadata": {},
91 | "outputs": [
92 | {
93 | "output_type": "execute_result",
94 | "data": {
95 | "text/plain": [
96 | "['abstract_reasoning', 'accentdb', 'aeslc', 'aflw2k3d', 'ag_news_subset']"
97 | ]
98 | },
99 | "metadata": {},
100 | "execution_count": 5
101 | }
102 | ],
103 | "source": [
104 | "tfds.list_builders()[:5] # the list of all datasets"
105 | ]
106 | },
107 | {
108 | "cell_type": "code",
109 | "execution_count": 7,
110 | "metadata": {},
111 | "outputs": [],
112 | "source": [
113 | "train_data, test_data = imdb['train'], imdb['test'] # 25k train and 25k testing\n",
114 | "\n",
115 | "training_sentences = []\n",
116 | "training_labels = []\n",
117 | "testing_sentences = []\n",
118 | "testing_labels = []\n",
119 | "\n",
120 | "for sample, label in train_data:\n",
121 | " training_sentences.append(sample.numpy().decode('utf8'))\n",
122 | " training_labels.append(label.numpy())\n",
123 | "\n",
124 | "for sample, label in test_data:\n",
125 | " testing_sentences.append(sample.numpy().decode('utf8'))\n",
126 | " testing_labels.append(label.numpy())"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": 8,
132 | "metadata": {},
133 | "outputs": [
134 | {
135 | "output_type": "stream",
136 | "name": "stdout",
137 | "text": [
138 | "I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.\n>> label 0\n"
139 | ]
140 | }
141 | ],
142 | "source": [
143 | "print(training_sentences[1]) \n",
144 | "print(\">> label\", training_labels[1]) # 0 negative, 1 pos"
145 | ]
146 | },
147 | {
148 | "cell_type": "code",
149 | "execution_count": 9,
150 | "metadata": {},
151 | "outputs": [
152 | {
153 | "output_type": "stream",
154 | "name": "stdout",
155 | "text": [
156 | "25000\n25000\n25000\n25000\n"
157 | ]
158 | }
159 | ],
160 | "source": [
161 | "print(len(training_sentences))\n",
162 | "print(len(training_labels))\n",
163 | "print(len(testing_sentences))\n",
164 | "print(len(testing_labels))"
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "execution_count": 10,
170 | "metadata": {},
171 | "outputs": [],
172 | "source": [
173 | "# converting to numpy arrays\n",
174 | "training_labels_final = np.array(training_labels) \n",
175 | "testing_labels_final = np.array(testing_labels)"
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": 11,
181 | "metadata": {},
182 | "outputs": [
183 | {
184 | "output_type": "execute_result",
185 | "data": {
186 | "text/plain": [
187 | "(25000,)"
188 | ]
189 | },
190 | "metadata": {},
191 | "execution_count": 11
192 | }
193 | ],
194 | "source": [
195 | "training_labels_final.shape"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": 12,
201 | "metadata": {},
202 | "outputs": [
203 | {
204 | "output_type": "execute_result",
205 | "data": {
206 | "text/plain": [
207 | "(25000, 120)"
208 | ]
209 | },
210 | "metadata": {},
211 | "execution_count": 12
212 | }
213 | ],
214 | "source": [
215 | "# Preparing data for training by tokenizing\n",
216 | "\n",
217 | "vocab_size = 10000\n",
218 | "embedding_dim = 16\n",
219 | "max_length = 120\n",
220 | "trunc_type='post' # [4, 4, 5, 6, ..... 0, 0, 0] - zeros at the end \n",
221 | "oov_tok = \"\" # out of vocabulary\n",
222 | "\n",
223 | "tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)\n",
224 | "tokenizer.fit_on_texts(training_sentences)\n",
225 | "word_index = tokenizer.word_index # all 10000 words with tokens in a dictionary \n",
226 | "sequences = tokenizer.texts_to_sequences(training_sentences) # all sentences represented only with tokens\n",
227 | "padded = pad_sequences(sequences, maxlen=max_length, truncating=trunc_type) # make all sentences the same size\n",
228 | "\n",
229 | "# the same for testing set\n",
230 | "testing_sequences = tokenizer.texts_to_sequences(testing_sentences)\n",
231 | "testing_padded = pad_sequences(testing_sequences, maxlen=max_length)\n",
232 | "\n",
233 | "padded.shape"
234 | ]
235 | },
236 | {
237 | "cell_type": "code",
238 | "execution_count": null,
239 | "metadata": {},
240 | "outputs": [],
241 | "source": []
242 | },
243 | {
244 | "cell_type": "code",
245 | "execution_count": 89,
246 | "metadata": {},
247 | "outputs": [
248 | {
249 | "output_type": "stream",
250 | "name": "stdout",
251 | "text": [
252 | "? ? ? ? ? ? ? ? i have been known to fall asleep during films but this is usually due to a combination of things including really tired being warm and comfortable on the and having just eaten a lot however on this occasion i fell asleep because the film was rubbish the plot development was constant constantly slow and boring things seemed to happen but with no explanation of what was causing them or why i admit i may have missed part of the film but i watched the majority of it and everything just seemed to happen of its own without any real concern for anything else i cant recommend this film at all\n\nI have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.\n"
253 | ]
254 | }
255 | ],
256 | "source": [
257 | "reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])\n",
258 | "\n",
259 | "def decode_review(text):\n",
260 | " return ' '.join([reverse_word_index.get(i, '?') for i in text])\n",
261 | "\n",
262 | "print(decode_review(padded[1]))\n",
263 | "print()\n",
264 | "print(training_sentences[1])"
265 | ]
266 | },
267 | {
268 | "cell_type": "code",
269 | "execution_count": 64,
270 | "metadata": {},
271 | "outputs": [
272 | {
273 | "output_type": "stream",
274 | "name": "stdout",
275 | "text": [
276 | "I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.\n>> original length 617\n>> label 0\n\n[11, 26, 75, 571, 6, 805, 2354, 313, 106, 19, 12, 7, 629, 686, 6, 4, 2219, 5, 181, 584, 64, 1454, 110, 2263, 3, 3951, 21, 2, 1, 3, 258, 41, 4677, 4, 174, 188, 21, 12, 4078, 11, 1578, 2354, 86, 2, 20, 14, 1907, 2, 112, 940, 14, 1811, 1340, 548, 3, 355, 181, 466, 6, 591, 19, 17, 55, 1817, 5, 49, 14, 4044, 96, 40, 136, 11, 972, 11, 201, 26, 1046, 171, 5, 2, 20, 19, 11, 294, 2, 2155, 5, 10, 3, 283, 41, 466, 6, 591, 5, 92, 203, 1, 207, 99, 145, 4382, 16, 230, 332, 11, 2486, 384, 12, 20, 31, 30]\n>> sequence lenght 112\n\n[ 0 0 0 0 0 0 0 0 11 26 75 571 6 805\n 2354 313 106 19 12 7 629 686 6 4 2219 5 181 584\n 64 1454 110 2263 3 3951 21 2 1 3 258 41 4677 4\n 174 188 21 12 4078 11 1578 2354 86 2 20 14 1907 2\n 112 940 14 1811 1340 548 3 355 181 466 6 591 19 17\n 55 1817 5 49 14 4044 96 40 136 11 972 11 201 26\n 1046 171 5 2 20 19 11 294 2 2155 5 10 3 283\n 41 466 6 591 5 92 203 1 207 99 145 4382 16 230\n 332 11 2486 384 12 20 31 30]\n"
277 | ]
278 | },
279 | {
280 | "output_type": "execute_result",
281 | "data": {
282 | "text/plain": [
283 | "(120,)"
284 | ]
285 | },
286 | "metadata": {},
287 | "execution_count": 64
288 | }
289 | ],
290 | "source": [
291 | "print(training_sentences[1]) \n",
292 | "print(\">> original length\", len(training_sentences[1]))\n",
293 | "print(\">> label\", training_labels[1])\n",
294 | "\n",
295 | "print()\n",
296 | "print(sequences[1])\n",
297 | "print(\">> sequence lenght\", len(sequences[1]))\n",
298 | "print()\n",
299 | "print(padded[1])\n",
300 | "padded[1].shape"
301 | ]
302 | },
303 | {
304 | "cell_type": "code",
305 | "execution_count": 56,
306 | "metadata": {},
307 | "outputs": [
308 | {
309 | "output_type": "execute_result",
310 | "data": {
311 | "text/plain": [
312 | "'bintang'"
313 | ]
314 | },
315 | "metadata": {},
316 | "execution_count": 56
317 | }
318 | ],
319 | "source": [
320 | "# len(list(word_index)) # 90000 appr\n",
321 | "list(word_index)[57565] # even we defined vocab_size = 10000, tensorflow tokenizes all words, but in backed end it will work with 10000 words, \n",
322 | "# num_words=n parameter specifies the maximum number of words to be tokenized, and picks the most common ‘n’ words"
323 | ]
324 | },
325 | {
326 | "cell_type": "code",
327 | "execution_count": 71,
328 | "metadata": {},
329 | "outputs": [
330 | {
331 | "output_type": "stream",
332 | "name": "stdout",
333 | "text": [
334 | "Model: \"sequential_2\"\n_________________________________________________________________\nLayer (type) Output Shape Param # \n=================================================================\nembedding_2 (Embedding) (None, 120, 16) 160000 \n_________________________________________________________________\nflatten_2 (Flatten) (None, 1920) 0 \n_________________________________________________________________\ndense_4 (Dense) (None, 6) 11526 \n_________________________________________________________________\ndense_5 (Dense) (None, 1) 7 \n=================================================================\nTotal params: 171,533\nTrainable params: 171,533\nNon-trainable params: 0\n_________________________________________________________________\n"
335 | ]
336 | }
337 | ],
338 | "source": [
339 | "model = tf.keras.Sequential([\n",
340 | " tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),\n",
341 | " tf.keras.layers.Flatten(), # GlobalAveragePooling1D()\n",
342 | " tf.keras.layers.Dense(6, activation='relu'),\n",
343 | " tf.keras.layers.Dense(1, activation='sigmoid')\n",
344 | "])\n",
345 | "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
346 | "model.summary()"
347 | ]
348 | },
349 | {
350 | "cell_type": "code",
351 | "execution_count": 106,
352 | "metadata": {},
353 | "outputs": [
354 | {
355 | "output_type": "stream",
356 | "name": "stdout",
357 | "text": [
358 | "Epoch 1/10\n",
359 | "782/782 [==============================] - 1s 1ms/step - loss: 9.7313e-05 - accuracy: 1.0000 - val_loss: 0.8354 - val_accuracy: 0.8314\n",
360 | "Epoch 2/10\n",
361 | "782/782 [==============================] - 1s 1ms/step - loss: 6.0164e-05 - accuracy: 1.0000 - val_loss: 0.8735 - val_accuracy: 0.8308\n",
362 | "Epoch 3/10\n",
363 | "782/782 [==============================] - 1s 1ms/step - loss: 3.7304e-05 - accuracy: 1.0000 - val_loss: 0.9050 - val_accuracy: 0.8318\n",
364 | "Epoch 4/10\n",
365 | "782/782 [==============================] - 1s 1ms/step - loss: 2.3330e-05 - accuracy: 1.0000 - val_loss: 0.9406 - val_accuracy: 0.8309\n",
366 | "Epoch 5/10\n",
367 | "782/782 [==============================] - 1s 1ms/step - loss: 1.5115e-05 - accuracy: 1.0000 - val_loss: 0.9730 - val_accuracy: 0.8313\n",
368 | "Epoch 6/10\n",
369 | "782/782 [==============================] - 1s 1ms/step - loss: 9.3207e-06 - accuracy: 1.0000 - val_loss: 1.0077 - val_accuracy: 0.8312\n",
370 | "Epoch 7/10\n",
371 | "782/782 [==============================] - 1s 1ms/step - loss: 6.1326e-06 - accuracy: 1.0000 - val_loss: 1.0429 - val_accuracy: 0.8307\n",
372 | "Epoch 8/10\n",
373 | "782/782 [==============================] - 1s 1ms/step - loss: 3.8306e-06 - accuracy: 1.0000 - val_loss: 1.0734 - val_accuracy: 0.8310\n",
374 | "Epoch 9/10\n",
375 | "782/782 [==============================] - 1s 1ms/step - loss: 2.4845e-06 - accuracy: 1.0000 - val_loss: 1.1086 - val_accuracy: 0.8311\n",
376 | "Epoch 10/10\n",
377 | "782/782 [==============================] - 1s 1ms/step - loss: 1.6163e-06 - accuracy: 1.0000 - val_loss: 1.1410 - val_accuracy: 0.8310\n"
378 | ]
379 | },
380 | {
381 | "output_type": "execute_result",
382 | "data": {
383 | "text/plain": [
384 | ""
385 | ]
386 | },
387 | "metadata": {},
388 | "execution_count": 106
389 | }
390 | ],
391 | "source": [
392 | "# Training own modelg\n",
393 | "\n",
394 | "num_epochs = 10\n",
395 | "model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))"
396 | ]
397 | },
398 | {
399 | "cell_type": "code",
400 | "execution_count": 76,
401 | "metadata": {},
402 | "outputs": [
403 | {
404 | "output_type": "execute_result",
405 | "data": {
406 | "text/plain": [
407 | "[,\n",
408 | " ,\n",
409 | " ,\n",
410 | " ]"
411 | ]
412 | },
413 | "metadata": {},
414 | "execution_count": 76
415 | }
416 | ],
417 | "source": [
418 | "e = model.layers\n",
419 | "e"
420 | ]
421 | },
422 | {
423 | "cell_type": "code",
424 | "execution_count": 79,
425 | "metadata": {},
426 | "outputs": [
427 | {
428 | "output_type": "stream",
429 | "name": "stdout",
430 | "text": [
431 | "(10000, 16)\n"
432 | ]
433 | }
434 | ],
435 | "source": [
436 | "e = model.layers[0]\n",
437 | "weights = e.get_weights()[0]\n",
438 | "print(weights.shape) # shape: (vocab_size, embedding_dim)"
439 | ]
440 | },
441 | {
442 | "cell_type": "code",
443 | "execution_count": 87,
444 | "metadata": {},
445 | "outputs": [
446 | {
447 | "output_type": "execute_result",
448 | "data": {
449 | "text/plain": [
450 | "array([-0.08942658, 0.00486923, -0.05935808, -0.06226563, -0.04867279,\n",
451 | " 0.04237117, 0.04769849, 0.03356505, -0.03730453, 0.00785854,\n",
452 | " 0.03105144, 0.0776749 , 0.05284716, 0.025134 , -0.03554538,\n",
453 | " -0.04298926], dtype=float32)"
454 | ]
455 | },
456 | "metadata": {},
457 | "execution_count": 87
458 | }
459 | ],
460 | "source": [
461 | "weights[1] # each word has its own weight"
462 | ]
463 | },
464 | {
465 | "cell_type": "code",
466 | "execution_count": 86,
467 | "metadata": {
468 | "tags": []
469 | },
470 | "outputs": [
471 | {
472 | "output_type": "stream",
473 | "name": "stdout",
474 | "text": [
475 | ">> word 1 \n>> embeddings [-0.08942658 0.00486923 -0.05935808 -0.06226563 -0.04867279 0.04237117\n 0.04769849 0.03356505 -0.03730453 0.00785854 0.03105144 0.0776749\n 0.05284716 0.025134 -0.03554538 -0.04298926]\n>> word 2 the\n>> embeddings [-0.08670148 0.01641071 -0.02393427 -0.07146466 0.01603186 0.06126428\n 0.06148115 0.00766911 0.04187395 0.05556076 0.01930173 0.0744463\n 0.01907398 0.01339489 0.00941497 -0.0138381 ]\n>> word 3 and\n>> embeddings [ 0.01113727 -0.03538265 -0.05725451 -0.01636735 -0.00596739 -0.00635358\n 0.03053617 0.05559737 0.0871934 0.04494542 0.02274616 0.07229666\n 0.01994341 0.01223046 -0.05789011 -0.04256919]\n>> word 4 a\n>> embeddings [-0.05104827 -0.01813413 -0.04630557 -0.02343593 -0.03323779 0.06510878\n -0.00737528 0.02424134 0.0825871 0.00570629 -0.01472468 0.12047923\n 0.01702527 -0.04734353 -0.05681538 -0.06954415]\n"
476 | ]
477 | }
478 | ],
479 | "source": [
480 | "out_v = io.open('vecs.tsv', 'w', encoding='utf-8')\n",
481 | "out_m = io.open('meta.tsv', 'w', encoding='utf-8')\n",
482 | "\n",
483 | "for word_num in range(1, vocab_size):\n",
484 | " word = reverse_word_index[word_num]\n",
485 | " embeddings = weights[word_num]\n",
486 | " \n",
487 | " if word_num < 5:\n",
488 | " print(f\">> word {word_num}\", word)\n",
489 | " print(\">> embeddings\", embeddings)\n",
490 | "\n",
491 | " out_m.write(word + \"\\n\")\n",
492 | " out_v.write('\\t'.join([str(x) for x in embeddings]) + \"\\n\")\n",
493 | "out_v.close()\n",
494 | "out_m.close()"
495 | ]
496 | },
497 | {
498 | "cell_type": "code",
499 | "execution_count": 99,
500 | "metadata": {},
501 | "outputs": [
502 | {
503 | "output_type": "stream",
504 | "name": "stdout",
505 | "text": [
506 | "Please install GPU version of TF\n"
507 | ]
508 | }
509 | ],
510 | "source": [
511 | "if tf.test.gpu_device_name(): \n",
512 | " print('Default GPU Device:'.format(tf.test.gpu_device_name()))\n",
513 | "else:\n",
514 | " print(\"Please install GPU version of TF\")"
515 | ]
516 | },
517 | {
518 | "cell_type": "code",
519 | "execution_count": 102,
520 | "metadata": {},
521 | "outputs": [
522 | {
523 | "output_type": "stream",
524 | "name": "stdout",
525 | "text": [
526 | "[]\n"
527 | ]
528 | },
529 | {
530 | "output_type": "execute_result",
531 | "data": {
532 | "text/plain": [
533 | "[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]"
534 | ]
535 | },
536 | "metadata": {},
537 | "execution_count": 102
538 | }
539 | ],
540 | "source": [
541 | "print(tf.config.list_physical_devices('GPU'))\n",
542 | "tf.config.list_physical_devices()"
543 | ]
544 | }
545 | ]
546 | }
--------------------------------------------------------------------------------
/tensorflow-in-practice/notebooks/README.md:
--------------------------------------------------------------------------------
1 | # Highlighted Notebooks
2 |
3 | ### Course 1
4 | > [Fashion MNIST with CNN](Course_1_Part_6_Lesson_2_Notebook.ipynb)
5 |
6 | > [Human vs Horse, flow_from_directory()](Course_1_Part_8_Lesson_2_Notebook.ipynb)
7 |
8 | ### Course 2
9 | > [** Cat vs Dog, flow_from_directory(), drawings of loss and accuracy, predict on new image](Course_2_Part_2_Lesson_2_Notebook.ipynb)
10 |
11 | > [** With augmentation, good code collected in one cell, plots of accuracy and loss](Course_2_Part_4_Lesson_2_Notebook_(Cats_v_Dogs_Augmentation).ipynb)
12 |
13 | > [** Transfer Learning, Dropout](Course_2_Part_6_Lesson_3_Notebook_(Transfer_Learning).ipynb)
14 |
15 | ### Course 3
16 | > [** Word Embeddings with Tokenizer](Course_3_Week_2(Model_Training_IMDB_Reviews).ipynb) - classifying the reviews in IMDB
17 | > [** Beautiful code, classifying sarcastic news](Course_3_Week_2(Sarcasm-Classifier).ipynb)
18 |
--------------------------------------------------------------------------------
/tensorflow-in-practice/sequences-time-series-and-prediction.md:
--------------------------------------------------------------------------------
1 | # [Sequences, Time Series and Prediction](https://www.coursera.org/learn/tensorflow-sequences-time-series-and-prediction)
2 |
3 | - Sequences and Prediction
4 | - Deep Neural Networks for Time Series
5 | - Recurrent Neural Networks for Time Series
6 | - Real-world time series data
7 |
8 |
9 | ## Sequences and Prediction
10 | > Handling sequential time series data -- where values change over time, like the temperature on a particular day, stock prices, or the number of visitors to your web site.
11 |
12 | > Predicting future values in these time series. We need to find the pattrn to predict new value.
13 |
14 | - Time series can be used in Speech recognition
15 |
16 | - Types:
17 | - **Trend** - upwords facing
18 | - **Seasonality**
19 | - Autocorrelation
20 | - Noise
21 | - Non-stationary time series
22 |
23 | - **Train, validation and test sets**
24 | - **Trend + Seasonality + Noise**
25 | - **Naive forecasting** - take the last value, and assume that the next will be the same
26 | - **Fixed forecasting** - if data is seasonal you should include whole number of season (1, 2, 3 years. Then you have to train in Training period and evaluate of Val Period by tuning hyperparam. Then retrain of TP+VP and test of Test Period. Then again retrain with Test data too.

27 | - We start with a short training period, and we gradually increase it, say by one day at a time, or by one week at a time. At each iteration, we train the model on a training period. And we use it to forecast the following day, or the following week, in the validation period.
28 |
29 | - **Metrics for evaluating performance**
30 | -
31 |
32 | ## Deep Neural Networks for Time Series
33 |
34 | ## Recurrent Neural Networks for Time Series
35 |
36 | ## Real-world time series data
37 |
--------------------------------------------------------------------------------