├── .DS_Store ├── Deep Learning.md ├── ML Project Checklist.md ├── Machine Learning.md ├── More Resources.md ├── README.md ├── The ML Landscape.md ├── img ├── model1.png ├── model2.png ├── model3.png ├── precision-recall.png └── reinforcement-learning.png ├── numpy-pandas ├── .DS_Store ├── 01-numpy.ipynb ├── 02-example.ipynb ├── 02-pandas.ipynb ├── 03-plt.ipynb ├── README.md ├── data │ ├── state-abbrevs.csv │ ├── state-areas.csv │ └── state-population.csv ├── img │ ├── .DS_Store │ ├── axis=1.jpg │ └── groupby.png ├── plt1.py ├── tools_matplotlib.ipynb ├── tools_numpy.ipynb ├── tools_pandas.ipynb └── very-basics │ ├── Readme.md │ ├── img │ └── plt.png │ └── nb │ └── 01_plt.ipynb ├── scikit-learn ├── Readme.md ├── car_evaluation.csv ├── img │ ├── process.png │ ├── process1.png │ └── scikit-learn.png ├── k-means-clustering.ipynb ├── knn.ipynb ├── linear_regression.ipynb ├── logistic_regression.ipynb ├── svm.ipynb └── train_test_split.ipynb └── tensorflow-in-practice ├── .DS_Store ├── Exercises ├── Course_3_Week_1_Exercise_(Tokenizer_BBC_Text).ipynb ├── Course_3_Week_2_Exercise_(BBC_Text_Model_Building).ipynb ├── Course_3_Week_3_Exercise_Twitter_Fake_News.ipynb ├── Exercise_1_House_Prices.ipynb ├── Exercise_2_Handwriting_Recognition_DNN.ipynb ├── Exercise_3_CNN.ipynb ├── Exercise_4_Complex_Images_flow_from_directory.ipynb ├── Exercise_5_Cat_vs_Dog_Kaggle.ipynb ├── Exercise_6_Cats_vs_Dogs_with_Augmentation.ipynb └── Exercise_7_Transfer_learning.ipynb ├── MNIST ├── my_model.h5 ├── test.py └── train.py ├── README.md ├── convolutional-neural-networks-tensorflow.md ├── img ├── 0.jpg ├── 1.png ├── 2.jpg ├── 3.jpg ├── fibonacci.png ├── fp.png ├── fp2.png ├── lstm.png ├── lstm2.png ├── metrics.png ├── ml_architecture.png ├── rfp.png ├── rnn.png ├── rnn2.png ├── seasonality.png ├── tf_datasets.png ├── trend.png ├── ts.png ├── tsn.png └── word_embeddings.png ├── introduction-to-tensorflow-for-ai.md ├── natural-language-processing-tensorflow.md ├── notebooks ├── .DS_Store ├── Course_1_Part_2_Lesson_2_Notebook.ipynb ├── Course_1_Part_4_Lesson_2_Notebook.ipynb ├── Course_1_Part_6_Lesson_2_Notebook.ipynb ├── Course_1_Part_6_Lesson_3_Notebook.ipynb ├── Course_1_Part_8_Lesson_2_Notebook.ipynb ├── Course_2_Part_2_Lesson_2_Notebook.ipynb ├── Course_2_Part_4_Lesson_2_Notebook_(Cats_v_Dogs_Augmentation).ipynb ├── Course_2_Part_6_Lesson_3_Notebook_(Transfer_Learning).ipynb ├── Course_2_Part_8_Lesson_2_Notebook_(RockPaperScissors).ipynb ├── Course_3_Week_1(Tokenizer-Sarcasm-Dataset).ipynb ├── Course_3_Week_2(Model_Training_IMDB_Reviews).ipynb ├── Course_3_Week_2(Sarcasm-Classifier).ipynb ├── Course_3_Week_2(Subwords).ipynb ├── Course_3_Week_3(IMDB).ipynb ├── Course_3_Week_4_Lesson_1_(Sheckspire_Text_Generation).ipynb ├── Course_3_Week_4_Lesson_2_Notebook.ipynb ├── README.md ├── irish-lyrics-eof.txt ├── meta.tsv ├── sarcasm.json └── vecs.tsv └── sequences-time-series-and-prediction.md /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/.DS_Store -------------------------------------------------------------------------------- /Deep Learning.md: -------------------------------------------------------------------------------- 1 | # Neural Networks and Deep Learning 2 | 3 | - Building and training with TensorFlow and Keras 4 | - Architectures: feedforward for tabular data, CNN for computer vision, RNN and LSTM for sequence processing 5 | - Encoder / decoder and Transformers for NLP 6 | - Autoencoders and generative Adversarial Network (GANs) for generative learning 7 | - Techniques for training DNN 8 | - Reinforcement learning - building agent to play a game 9 | - Loading and preprocessing large amount of data 10 | - Training and deploying at scale 11 | 12 | ## Contents 13 | - Introduction to ANN with Keras 14 | - [Sequential API](#Sequential-API), classification & regression 15 | - [Functional API](#Functional-API) 16 | - Subclassing API for dynamic models 17 | - [Using Callbacks](#Using-Callbacks), EarlyStopping, ModelCheckpoints 18 | - [TensorBoard](#TensorBoard) 19 | - [Fine-Tuning Neural Network Hyperparameters](#Fine-Tuning-Neural-Network-Hyperparameters) 20 | 21 | ### Sequential API 22 | ```py 23 | """Classification MLP""" 24 | # "sparse_categorical_crossentropy" 0 to 9 25 | #"categorical_crossentropy" [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.] 26 | # binary classification "sigmoid" (i.e., logistic) activation function in the output layer instead of the "softmax" activation function, and we would use the "binary_crossentropy" loss. 27 | 28 | model.compile(loss="sparse_categorical_crossentropy", 29 | optimizer="sgd", 30 | metrics=["accuracy"]) 31 | history = model.fit(X_train, y_train, epochs=30, 32 | validation_data=(X_valid, y_valid)) 33 | model.evaluate(X_test, y_test) 34 | y_proba = model.predict(X_new) 35 | y_pred = model.predict_classes(X_new) 36 | # History 37 | import pandas as pd 38 | import matplotlib.pyplot as plt 39 | pd.DataFrame(history.history).plot(figsize=(8, 5)) 40 | plt.grid(True) 41 | plt.gca().set_ylim(0, 1) # set the vertical range to [0-1] 42 | plt.show() 43 | ``` 44 | - If you want to convert sparse labels (i.e., class indices) to one-hot vector labels, use the `keras.utils.to_categorical()` function. To go the other way round, use the `np.argmax()` function with `axis=1`. 45 | 46 | - You must **compile** the model, **train** it, **evaluate** it, and use it to **make predictions**. 47 | 48 | - `.fit()` validation_split=0.1, class_weight, sample_weight 49 | 50 | ```py 51 | """Regression MLP""" 52 | from sklearn.datasets import fetch_california_housing 53 | from sklearn.model_selection import train_test_split 54 | from sklearn.preprocessing import StandardScaler 55 | 56 | housing = fetch_california_housing() 57 | X_train_full, X_test, y_train_full, y_test = train_test_split( 58 | housing.data, housing.target) 59 | X_train, X_valid, y_train, y_valid = train_test_split( 60 | X_train_full, y_train_full) 61 | 62 | scaler = StandardScaler() 63 | X_train = scaler.fit_transform(X_train) 64 | X_valid = scaler.transform(X_valid) 65 | X_test = scaler.transform(X_test) 66 | 67 | model = keras.models.Sequential([ 68 | keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]), 69 | keras.layers.Dense(1) 70 | ]) 71 | 72 | model.compile(loss="mean_squared_error", optimizer="sgd") 73 | history = model.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid)) 74 | mse_test = model.evaluate(X_test, y_test) 75 | X_new = X_test[:3] # pretend these are new instances 76 | y_pred = model.predict(X_new) 77 | ``` 78 | 79 | ### Functional API 80 | -
81 | ```py 82 | input_ = keras.layers.Input(shape=X_train.shape[1:]) 83 | hidden1 = keras.layers.Dense(30, activation="relu")(input_) 84 | hidden2 = keras.layers.Dense(30, activation="relu")(hidden1) 85 | concat = keras.layers.Concatenate()([input_, hidden2]) 86 | output = keras.layers.Dense(1)(concat) 87 | model = keras.Model(inputs=[input_], outputs=[output]) 88 | ``` 89 | -
90 | ```py 91 | input_A = keras.layers.Input(shape=[5], name="wide_input") 92 | input_B = keras.layers.Input(shape=[6], name="deep_input") 93 | hidden1 = keras.layers.Dense(30, activation="relu")(input_B) 94 | hidden2 = keras.layers.Dense(30, activation="relu")(hidden1) 95 | concat = keras.layers.concatenate([input_A, hidden2]) 96 | output = keras.layers.Dense(1, name="output")(concat) 97 | model = keras.Model(inputs=[input_A, input_B], outputs=[output]) 98 | 99 | # As we have two inputs, we must specify two input features in .fit() and so on 100 | model.compile(loss="mse", optimizer=keras.optimizers.SGD(lr=1e-3)) 101 | X_train_A, X_train_B = X_train[:, :5], X_train[:, 2:] 102 | X_valid_A, X_valid_B = X_valid[:, :5], X_valid[:, 2:] 103 | X_test_A, X_test_B = X_test[:, :5], X_test[:, 2:] 104 | X_new_A, X_new_B = X_test_A[:3], X_test_B[:3] 105 | 106 | history = model.fit((X_train_A, X_train_B), y_train, epochs=20, validation_data=((X_valid_A, X_valid_B), y_valid)) 107 | mse_test = model.evaluate((X_test_A, X_test_B), y_test) 108 | y_pred = model.predict((X_new_A, X_new_B)) 109 | ``` 110 | -
111 | ```py 112 | [...] # Same as above, up to the main output layer 113 | output = keras.layers.Dense(1, name="main_output")(concat) 114 | aux_output = keras.layers.Dense(1, name="aux_output")(hidden2) 115 | model = keras.Model(inputs=[input_A, input_B], outputs=[output, aux_output]) 116 | 117 | model.compile(loss=["mse", "mse"], loss_weights=[0.9, 0.1], optimizer="sgd") 118 | history = model.fit( 119 | [X_train_A, X_train_B], [y_train, y_train], epochs=20, 120 | validation_data=([X_valid_A, X_valid_B], [y_valid, y_valid])) 121 | 122 | total_loss, main_loss, aux_loss = model.evaluate( 123 | [X_test_A, X_test_B], [y_test, y_test]) 124 | y_pred_main, y_pred_aux = model.predict([X_new_A, X_new_B]) 125 | ``` 126 | 127 | ### Using Callbacks 128 | ```py 129 | """It will only save your model when its performance on the validation set is the best so far""" 130 | checkpoint_cb = keras.callbacks.ModelCheckpoint("my_keras_model.h5", save_best_only=True) 131 | 132 | history = model.fit(X_train, y_train, epochs=10, 133 | validation_data=(X_valid, y_valid), 134 | callbacks=[checkpoint_cb]) 135 | model = keras.models.load_model("my_keras_model.h5") # roll back to best model 136 | ``` 137 | ```py 138 | """It will interrupt training when it measures no progress on the validation set for a number of epochs (defined by the patience argument)""" 139 | early_stopping_cb = keras.callbacks.EarlyStopping(patience=10, 140 | restore_best_weights=True) 141 | history = model.fit(X_train, y_train, epochs=100, 142 | validation_data=(X_valid, y_valid), 143 | callbacks=[checkpoint_cb, early_stopping_cb]) 144 | 145 | ``` 146 | 147 | ### TensorBoard 148 | ```py 149 | import os 150 | 151 | root_logdir = os.path.join(os.curdir, "my_logs") 152 | 153 | def get_run_logdir(): 154 | import time 155 | run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S") 156 | return os.path.join(root_logdir, run_id) 157 | 158 | run_logdir = get_run_logdir() # e.g., './my_logs/run_2019_06_07-15_15_22' 159 | 160 | [...] # Build and compile your model 161 | tensorboard_cb = keras.callbacks.TensorBoard(run_logdir) 162 | history = model.fit(X_train, y_train, epochs=30, 163 | validation_data=(X_valid, y_valid), 164 | callbacks=[tensorboard_cb]) 165 | 166 | ``` 167 | 168 | ### Fine-Tuning Neural Network Hyperparameters 169 | ```py 170 | def build_model(n_hidden=1, n_neurons=30, learning_rate=3e-3, input_shape=[8]): 171 | model = keras.models.Sequential() 172 | model.add(keras.layers.InputLayer(input_shape=input_shape)) 173 | for layer in range(n_hidden): 174 | model.add(keras.layers.Dense(n_neurons, activation="relu")) 175 | model.add(keras.layers.Dense(1)) 176 | optimizer = keras.optimizers.SGD(lr=learning_rate) 177 | model.compile(loss="mse", optimizer=optimizer) 178 | return model 179 | 180 | # We need to create Scikit Regressor object 181 | keras_reg = keras.wrappers.scikit_learn.KerasRegressor(build_model) 182 | 183 | keras_reg.fit(X_train, y_train, epochs=100, 184 | validation_data=(X_valid, y_valid), 185 | callbacks=[keras.callbacks.EarlyStopping(patience=10)]) 186 | 187 | mse_test = keras_reg.score(X_test, y_test) 188 | y_pred = keras_reg.predict(X_new) 189 | 190 | from scipy.stats import reciprocal 191 | from sklearn.model_selection import RandomizedSearchCV 192 | 193 | param_distribs = { 194 | "n_hidden": [0, 1, 2, 3], 195 | "n_neurons": np.arange(1, 100), 196 | "learning_rate": reciprocal(3e-4, 3e-2), 197 | } 198 | rnd_search_cv = RandomizedSearchCV(keras_reg, param_distribs, n_iter=10, cv=3) 199 | rnd_search_cv.fit(X_train, y_train, epochs=100, 200 | validation_data=(X_valid, y_valid), 201 | callbacks=[keras.callbacks.EarlyStopping(patience=10)]) 202 | 203 | rnd_search_cv.best_params_ 204 | rnd_search_cv.best_score_ 205 | 206 | model = rnd_search_cv.best_estimator_.model 207 | ``` 208 | -------------------------------------------------------------------------------- /ML Project Checklist.md: -------------------------------------------------------------------------------- 1 | This checklist can guide you through your Machine Learning projects. There are eight main steps: 2 | 3 | 1. Frame the problem and look at the big picture. 4 | 2. Get the data. 5 | 3. Explore the data to gain insights. 6 | 4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms. 7 | 5. Explore many different models and short-list the best ones. 8 | 6. Fine-tune your models and combine them into a great solution. 9 | 7. Present your solution. 10 | 8. Launch, monitor, and maintain your system. 11 | 12 | Obviously, you should feel free to adapt this checklist to your needs. 13 | 14 | # Frame the problem and look at the big picture 15 | 1. Define the objective in business terms. 16 | 2. How will your solution be used? 17 | 3. What are the current solutions/workarounds (if any)? 18 | 4. How should you frame this problem (supervised/unsupervised, online/offline, etc.) 19 | 5. How should performance be measured? 20 | 6. Is the performance measure aligned with the business objective? 21 | 7. What would be the minimum performance needed to reach the business objective? 22 | 8. What are comparable problems? Can you reuse experience or tools? 23 | 9. Is human expertise available? 24 | 10. How would you solve the problem manually? 25 | 11. List the assumptions you or others have made so far. 26 | 12. Verify assumptions if possible. 27 | 28 | # Get the data 29 | Note: automate as much as possible so you can easily get fresh data. 30 | 31 | 1. List the data you need and how much you need. 32 | 2. Find and document where you can get that data. 33 | 3. Check how much space it will take. 34 | 4. Check legal obligations, and get the authorization if necessary. 35 | 5. Get access authorizations. 36 | 6. Create a workspace (with enough storage space). 37 | 7. Get the data. 38 | 8. Convert the data to a format you can easily manipulate (without changing the data itself). 39 | 9. Ensure sensitive information is deleted or protected (e.g., anonymized). 40 | 10. Check the size and type of data (time series, sample, geographical, etc.). 41 | 11. Sample a test set, put it aside, and never look at it (no data snooping!). 42 | 43 | # Explore the data 44 | Note: try to get insights from a field expert for these steps. 45 | 46 | 1. Create a copy of the data for exploration (sampling it down to a manageable size if necessary). 47 | 2. Create a Jupyter notebook to keep record of your data exploration. 48 | 3. Study each attribute and its characteristics: 49 | - Name 50 | - Type (categorical, int/float, bounded/unbounded, text, structured, etc.) 51 | - % of missing values 52 | - Noisiness and type of noise (stochastic, outliers, rounding errors, etc.) 53 | - Possibly useful for the task? 54 | - Type of distribution (Gaussian, uniform, logarithmic, etc.) 55 | 4. For supervised learning tasks, identify the target attribute(s). 56 | 5. Visualize the data. 57 | 6. Study the correlations between attributes. 58 | 7. Study how you would solve the problem manually. 59 | 8. Identify the promising transformations you may want to apply. 60 | 9. Identify extra data that would be useful (go back to "Get the Data" on page 502). 61 | 10. Document what you have learned. 62 | 63 | # Prepare the data 64 | Notes: 65 | - Work on copies of the data (keep the original dataset intact). 66 | - Write functions for all data transformations you apply, for five reasons: 67 | - So you can easily prepare the data the next time you get a fresh dataset 68 | - So you can apply these transformations in future projects 69 | - To clean and prepare the test set 70 | - To clean and prepare new data instances 71 | - To make it easy to treat your preparation choices as hyperparameters 72 | 73 | 1. Data cleaning: 74 | - Fix or remove outliers (optional). 75 | - Fill in missing values (e.g., with zero, mean, median...) or drop their rows (or columns). 76 | 2. Feature selection (optional): 77 | - Drop the attributes that provide no useful information for the task. 78 | 3. Feature engineering, where appropriates: 79 | - Discretize continuous features. 80 | - Decompose features (e.g., categorical, date/time, etc.). 81 | - Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.). 82 | - Aggregate features into promising new features. 83 | 4. Feature scaling: standardize or normalize features. 84 | 85 | # Short-list promising models 86 | Notes: 87 | - If the data is huge, you may want to sample smaller training sets so you can train many different models in a reasonable time (be aware that this penalizes complex models such as large neural nets or Random Forests). 88 | - Once again, try to automate these steps as much as possible. 89 | 90 | 1. Train many quick and dirty models from different categories (e.g., linear, naive, Bayes, SVM, Random Forests, neural net, etc.) using standard parameters. 91 | 2. Measure and compare their performance. 92 | - For each model, use N-fold cross-validation and compute the mean and standard deviation of their performance. 93 | 3. Analyze the most significant variables for each algorithm. 94 | 4. Analyze the types of errors the models make. 95 | - What data would a human have used to avoid these errors? 96 | 5. Have a quick round of feature selection and engineering. 97 | 6. Have one or two more quick iterations of the five previous steps. 98 | 7. Short-list the top three to five most promising models, preferring models that make different types of errors. 99 | 100 | # Fine-Tune the System 101 | Notes: 102 | - You will want to use as much data as possible for this step, especially as you move toward the end of fine-tuning. 103 | - As always automate what you can. 104 | 105 | 1. Fine-tune the hyperparameters using cross-validation. 106 | - Treat your data transformation choices as hyperparameters, especially when you are not sure about them (e.g., should I replace missing values with zero or the median value? Or just drop the rows?). 107 | - Unless there are very few hyperparamter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (e.g., using a Gaussian process priors, as described by Jasper Snoek, Hugo Larochelle, and Ryan Adams ([https://goo.gl/PEFfGr](https://goo.gl/PEFfGr))) 108 | 2. Try Ensemble methods. Combining your best models will often perform better than running them invdividually. 109 | 3. Once you are confident about your final model, measure its performance on the test set to estimate the generalization error. 110 | 111 | > Don't tweak your model after measuring the generalization error: you would just start overfitting the test set. 112 | 113 | # Present your solution 114 | 1. Document what you have done. 115 | 2. Create a nice presentation. 116 | - Make sure you highlight the big picture first. 117 | 3. Explain why your solution achieves the business objective. 118 | 4. Don't forget to present interesting points you noticed along the way. 119 | - Describe what worked and what did not. 120 | - List your assumptions and your system's limitations. 121 | 5. Ensure your key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g., "the median income is the number-one predictor of housing prices"). 122 | 123 | # Launch! 124 | 1. Get your solution ready for production (plug into production data inputs, write unit tests, etc.). 125 | 2. Write monitoring code to check your system's live performance at regular intervals and trigger alerts when it drops. 126 | - Beware of slow degradation too: models tend to "rot" as data evolves. 127 | - Measuring performance may require a human pipeline (e.g., via a crowdsourcing service). 128 | - Also monitor your inputs' quality (e.g., a malfunctioning sensor sending random values, or another team's output becoming stale). This is particulary important for online learning systems. 129 | 3. Retrain your models on a regular basis on fresh data (automate as much as possible). 130 | -------------------------------------------------------------------------------- /Machine Learning.md: -------------------------------------------------------------------------------- 1 | # Machine Learning 2 | 3 | - Handling, cleaning, and preparing data. 4 | - Selecting and engineering features. 5 | - Learning by fitting a model to data. 6 | - Optimizing a cost function. 7 | - Selecting a model and tuning hyperparameters using cross-validation. 8 | - Underfitting and overfitting (the bias/variance tradeoff). 9 | - Unsupervised learning techniques: clustering, density estimation and anomaly detection. 10 | - Algorithms: Linear and Polynomial Regression, Logistic Regression, k-Nearest Neighbors, Support Vector Machines, Decision Trees, Random Forests, and Ensemble methods. 11 | 20 | 21 | ## End-to-End Machine Learning Project 22 | - Frame the problem and look at the big picture 23 | - Goal and Performance measure 24 | - Get the data 25 | - [Create test set](#Create-test-set) 26 | - Explore the data to gain insights (EDA) 27 | - [Looking for correlations](#Looking-for-Correlations) 28 | - Experimenting with attribute combinations 29 | - Prepare data for ML algorithms 30 | - [Data cleaning](#Data-cleaning) 31 | - [Handling text and categorical attributes](#Handling-text-and-categorical-attributes) 32 | - [Feature Scaling](#Feature-Scaling) 33 | - [Transformation Pipelines](#Transformation-Pipelines) 34 | - [Explore many different models and short-list the best ones](#Select-and-Train-a-Model) 35 | - [Cross-Validation](#Cross-Validation) 36 | - Fine-tune models and combine them into a great solution 37 | - [Grid Search](#Grid-Search) 38 | - [Randomized Search](#Randomized-Search) 39 | - Ensemble models 40 | - Evaluate on test set 41 | - Launch and monitor 42 | 43 | ## Get the data 44 | ### Create test set 45 | ```py 46 | '''Create test set''' 47 | from sklearn.model_selection import train_test_split 48 | train_set, test_set = train_test_split(data, test_size=0.2, random_state=42) 49 | ``` 50 | ```py 51 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) 52 | ``` 53 | ```py 54 | from sklearn.model_selection import StratifiedShuffleSplit 55 | import numpy as np 56 | X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]]) 57 | y = np.array([0, 0, 0, 1, 1, 1]) 58 | sss = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0) 59 | sss.get_n_splits(X, y) 60 | 61 | for train_index, test_index in sss.split(X, y): 62 | print("TRAIN:", train_index, "TEST:", test_index) 63 | X_train, X_test = X[train_index], X[test_index] 64 | y_train, y_test = y[train_index], y[test_index] 65 | ``` 66 | 67 | ## Explore the data to gain insights 68 | ```py 69 | '''Visualizing Geographical Data''' 70 | data.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1) 71 | ``` 72 | ```py 73 | housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, 74 | s=housing["population"]/100, label="population", figsize=(10,7), 75 | c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True, 76 | ) 77 | plt.legend() 78 | ``` 79 | ### Looking for Correlations 80 | ```py 81 | '''Looking for Correlations''' 82 | corr_matrix = data.corr() 83 | corr_matrix["any_column"].sort_values(ascending=False) 84 | 85 | from pandas.plotting import scatter_matrix 86 | attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"] 87 | scatter_matrix(housing[attributes], figsize=(12, 8)) 88 | ``` 89 | ```py 90 | # Correlations between features 91 | all_data_corr = all_data.corr().abs().unstack().sort_values(kind="quicksort", ascending=False).reset_index() 92 | all_data_corr.rename(columns={"level_0": "Feature 1", "level_1": "Feature 2", 0: 'Correlation Coefficient'}, inplace=True) 93 | all_data_corr.drop(all_data_corr.iloc[1::2].index, inplace=True) 94 | all_data_corr_nd = all_data_corr.drop(all_data_corr[all_data_corr['Correlation Coefficient'] == 1.0].index) 95 | 96 | corr = all_data_corr_nd['Correlation Coefficient'] > 0.1 97 | all_data_corr_nd[corr] 98 | ``` 99 | ```py 100 | # pivot_table() vs groupby(), the below lines are the same 101 | pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum) 102 | df.groupby(['a','b'])['c'].sum() 103 | ``` 104 | ```py 105 | # Aggregate using one or more operations over the specified axis 106 | # agg()-can be applied to multiple groups together 107 | df.agg(['sum', 'min']) 108 | df_all.groupby(['Sex', 'Pclass']).agg(lambda x:x.value_counts().index[0])['Embarked'] 109 | 110 | # Apply a function along an axis of the DataFrame 111 | # apply()-cannot be applied to multiple groups together 112 | df.apply(np.sqrt) 113 | df_all['Deck'] = df_all['Cabin'].apply(lambda s: s[0] if pd.notnull(s) else 'M') 114 | ``` 115 | 116 | ## Prepare data for ML algorithms 117 | - https://stackoverflow.com/questions/48673402/how-can-i-standardize-only-numeric-variables-in-an-sklearn-pipeline 118 | - https://scikit-learn.org/stable/modules/preprocessing.html 119 | 120 | ### Data Cleaning 121 | ```py 122 | housing.dropna(subset=["total_bedrooms"]) # Get rid of the corresponding districts 123 | housing.drop("total_bedrooms", axis=1) # Get rid of the whole attribute 124 | median = housing["total_bedrooms"].median() # Set the values to some value (zero, mean, median) 125 | housing["total_bedrooms"].fillna(median, inplace=True) 126 | ``` 127 | ```py 128 | '''SimpleImputer, filling with the missing numerical attributes with the "median"''' 129 | from sklearn.impute import SimpleImputer 130 | imputer = SimpleImputer(strategy="median") 131 | housing_num = housing.select_dtypes(include=[np.number]) # just numerical attributes 132 | imputer.fit(housing_num) # "trained" inputer, now it is ready to transform the training set by replacing missing values with the learned medians 133 | imputer.statistics_ # same as "housing_num.median().values" 134 | X = imputer.transform(housing_num) 135 | housing_tr = pd.DataFrame(X, columns=housing_num.columns, 136 | index=housing.index) # new dataframe 137 | ``` 138 | 139 | ### Handling Text and Categorical Attributes 140 | - [select_dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html) 141 | ```py 142 | '''Transforming continuous numerical attributes to categorical''' 143 | housing["income_cat"] = pd.cut(housing["median_income"], 144 | bins=[0., 1.5, 3.0, 4.5, 6., np.inf], 145 | labels=[1, 2, 3, 4, 5]) 146 | ``` 147 | ```py 148 | '''Categorical Attributes''' 149 | from sklearn.preprocessing import OrdinalEncoder 150 | from sklearn.preprocessing import OneHotEncoder 151 | 152 | housing_cat = housing[["ocean_proximity"]] 153 | 154 | ordinal_encoder = OrdinalEncoder() 155 | housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat) 156 | 157 | housing_cat_encoded[:10] 158 | # array([[0.], 159 | # [0.], 160 | # [4.], 161 | # [1.], 162 | # [0.], 163 | # [1.], 164 | # [0.], 165 | # [1.], 166 | # [0.], 167 | # [0.]]) 168 | 169 | ordinal_encoder.categories_ # [array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'], dtype=object)] 170 | 171 | cat_encoder = OneHotEncoder(sparse=False) 172 | housing_cat_1hot = cat_encoder.fit_transform(housing_cat) 173 | housing_cat_1hot 174 | # array([[1., 0., 0., 0., 0.], 175 | # [1., 0., 0., 0., 0.], 176 | # [0., 0., 0., 0., 1.], 177 | # ..., 178 | # [0., 1., 0., 0., 0.], 179 | # [1., 0., 0., 0., 0.], 180 | # [0., 0., 0., 1., 0.]]) 181 | ``` 182 | ### Feature Scaling 183 | ```py 184 | '''StandardScaler''' 185 | from sklearn.preprocessing import StandardScaler 186 | import numpy as np 187 | 188 | X_train = np.array([[ 1., -1., 2.], 189 | [ 2., 0., 0.], 190 | [ 0., 1., -1.]]) 191 | scaler = StandardScaler().fit(X_train) 192 | 193 | scaler.mean_ 194 | scaler.scale_ 195 | 196 | X_scaled = scaler.transform(X_train) 197 | X_scaled 198 | ``` 199 | ```py 200 | from sklearn.preprocessing import MinMaxScaler 201 | 202 | X_train = np.array([[ 1., -1., 2.], 203 | [ 2., 0., 0.], 204 | [ 0., 1., -1.]]) 205 | 206 | min_max_scaler = MinMaxScaler() 207 | X_train_minmax = min_max_scaler.fit_transform(X_train) 208 | X_train_minmax 209 | # array([[0.5 , 0. , 1. ], 210 | # [1. , 0.5 , 0.33333333], 211 | # [0. , 1. , 0. ]]) 212 | 213 | # For the test data, we just need to use .transform() 214 | X_test = np.array([[-3., -1., 4.]]) 215 | X_test_minmax = min_max_scaler.transform(X_test) 216 | X_test_minmax 217 | # array([[-1.5 , 0. , 1.66666667]]) 218 | ``` 219 | 220 | ### Custom Transformer 221 | ```py 222 | from sklearn.base import BaseEstimator, TransformerMixin 223 | 224 | # column index 225 | rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6 226 | 227 | class CombinedAttributesAdder(BaseEstimator, TransformerMixin): 228 | def __init__(self, add_bedrooms_per_room=True): # no *args or **kargs 229 | self.add_bedrooms_per_room = add_bedrooms_per_room 230 | def fit(self, X, y=None): 231 | return self # nothing else to do 232 | def transform(self, X): 233 | rooms_per_household = X[:, rooms_ix] / X[:, households_ix] 234 | population_per_household = X[:, population_ix] / X[:, households_ix] 235 | if self.add_bedrooms_per_room: 236 | bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix] 237 | return np.c_[X, rooms_per_household, population_per_household, 238 | bedrooms_per_room] 239 | else: 240 | return np.c_[X, rooms_per_household, population_per_household] 241 | 242 | attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False) 243 | housing_extra_attribs = attr_adder.transform(housing.values) 244 | ``` 245 | 246 | ### Transformation Pipelines 247 | ```py 248 | from sklearn.pipeline import Pipeline 249 | from sklearn.preprocessing import StandardScaler 250 | from sklearn.compose import ColumnTransformer 251 | 252 | num_pipeline = Pipeline([ 253 | ('imputer', SimpleImputer(strategy="median")), 254 | ('attribs_adder', CombinedAttributesAdder()), 255 | ('std_scaler', StandardScaler()), 256 | ]) 257 | 258 | # housing_num_tr = num_pipeline.fit_transform(housing_num) 259 | 260 | num_attribs = list(housing_num) 261 | cat_attribs = ["ocean_proximity"] 262 | 263 | full_pipeline = ColumnTransformer([ 264 | ("num", num_pipeline, num_attribs), 265 | ("cat", OneHotEncoder(), cat_attribs), 266 | ]) 267 | 268 | housing_prepared = full_pipeline.fit_transform(housing) 269 | housing_prepared # to get access to the new dataset 270 | ``` 271 | ## Select and Train a Model 272 | - Before using `.predict()` you have to use `full_pipeline.transform(some_data)` 273 | 274 | ### Cross-Validation 275 | ```py 276 | from sklearn.model_selection import cross_val_score 277 | 278 | scores = cross_val_score(model, data, labels, scoring="neg_mean_squared_eroor", cv=10) 279 | rmse_scores = np.sqrt(-scores) 280 | 281 | def display_scores(scores): 282 | print("Scores:", scores) 283 | print("Mean:", scores.mean()) 284 | print("Standart deviation:", scores.std()) 285 | 286 | display_scores(rmse_scores) 287 | ``` 288 | ```py 289 | '''Save the model''' 290 | import joblib 291 | joblib.dump(my_model, "my_model.pkl") # to save model 292 | my_model_loaded = joblib.load("my_model.pkl") # to load model 293 | ``` 294 | 295 | 296 | ## Fine-tune Models 297 | ### Grid Search 298 | ```py 299 | from sklearn.model_selection import GridSearchCV 300 | 301 | param_grid = [ 302 | # try 12 (3×4) combinations of hyperparameters 303 | {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]}, 304 | # then try 6 (2×3) combinations with bootstrap set as False 305 | {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}, 306 | ] 307 | 308 | forest_reg = RandomForestRegressor(random_state=42) 309 | # train across 5 folds, that's a total of (12+6)*5=90 rounds of training 310 | grid_search = GridSearchCV(forest_reg, param_grid, cv=5, 311 | scoring='neg_mean_squared_error', 312 | return_train_score=True) 313 | grid_search.fit(housing_prepared, housing_labels) 314 | 315 | grid_search.best_params_ # the best hyperparameters 316 | grid_search.best_estimator_ 317 | 318 | # look at the score of each hyperparameter combination tested during the grid search: 319 | cvres = grid_search.cv_results_ 320 | for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]): 321 | print(np.sqrt(-mean_score), params) 322 | ``` 323 | 324 | ### Randomized Search 325 | ```py 326 | from sklearn.model_selection import RandomizedSearchCV 327 | from scipy.stats import randint 328 | 329 | param_distribs = { 330 | 'n_estimators': randint(low=1, high=200), 331 | 'max_features': randint(low=1, high=8), 332 | } 333 | 334 | forest_reg = RandomForestRegressor(random_state=42) 335 | rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs, 336 | n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42) 337 | rnd_search.fit(housing_prepared, housing_labels) 338 | 339 | # looking at the scores during training 340 | cvres = rnd_search.cv_results_ 341 | for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]): 342 | print(np.sqrt(-mean_score), params) 343 | 344 | feature_importances = grid_search.best_estimator_.feature_importances_ 345 | ``` -------------------------------------------------------------------------------- /More Resources.md: -------------------------------------------------------------------------------- 1 | ## What is Data Science? 2 | - [What really is Data Science? ](https://youtu.be/xC-c7E5PK0Y) 3 | - https://telegra.ph/What-REALLY-is-Data-Science-09-21 4 | 5 | ### Just leaving it here 6 | - [Data Science Interview at Facebook](https://tproger.ru/translations/preparing-for-data-science-interview/) 7 | 8 | ### Advice 9 | - [12 Things I Learned During My First Year as a Machine Learning Engineer](https://proglib.io/w/464d1326) 10 | - [How to Learn Machine Learning, The Self-Starter Way](https://elitedatascience.com/learn-machine-learning) 11 | - [Andrew Ng: Advice on Getting Started in Deep Learning](https://youtu.be/1k37OcjH7BM) 12 | - [Andrew Ng Machine Learning Career](https://youtu.be/hkagmGAu74Y) 13 | 14 | ### Technical Articles / Videos 15 | - [Cheat sheat Stanford](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-machine-learning-tips-and-tricks) 16 | - [Loss function for neural networks CNN](https://towardsdatascience.com/understanding-different-loss-functions-for-neural-networks-dd1ed0274718) 17 | - [Backprop in CNN](https://medium.com/@pavisj/convolutions-and-backpropagations-46026a8f5d2c) 18 | - [Backprop in NN](https://youtu.be/0e0z28wAWfg) 19 | - [Introduction to Backpropagation and Optimization](https://ai.plainenglish.io/approach-complex-functions-with-backpropagation-how-i-was-applying-to-yandex-c5f68d50f2da) 20 | 21 | ### Courses 22 | - http://introtodeeplearning.com 23 | - https://openlearninglibrary.mit.edu/courses/course-v1:MITx+6.036+1T2019/about 24 | - [Fast AI DL](https://www.fast.ai) 25 | - [Coursera Data Science](https://www.coursera.org/specializations/data-science-python?ranMID=40328&ranEAID=EBOQAYvGY4A&ranSiteID=EBOQAYvGY4A-xBZ6HIoQD.6tLROsD7db4g&siteID=EBOQAYvGY4A-xBZ6HIoQD.6tLROsD7db4g&utm_content=2&utm_medium=partners&utm_source=linkshare&utm_campaign=EBOQAYvGY4A) 26 | - [Fast AI ML course](https://course18.fast.ai/ml) 27 | 28 | ### Bookshelf 29 | - [Data Science books](https://proglib.io/w/0232ad78) 30 | 31 | ### YouTube 32 | - [DeepLizard](https://www.youtube.com/c/deeplizard/playlists) 33 | 34 | ### QA 35 | - Validation accuracy is higher than training accuracy. 36 | - https://www.quora.com/Can-validation-accuracy-be-higher-than-training-accuracy 37 | 38 | ### Resources where you can find the latest publications from leading laboratories 39 | - https://openai.com/blog/tags/research/ 40 | - https://deepmind.com/research 41 | - https://www.microsoft.com/en-us/research/research-area/artificial-intelligence 42 | - https://www.research.ibm.com/artificial-intelligence/#publications 43 | - https://ai.stanford.edu 44 | - https://www.csail.mit.edu 45 | - https://ai.google/research/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Machine Learning Area 2 | 3 | [Rustam-Z🚀](https://t.me/rz_zokirov) • [Find more here](https://t.me/rz_zokirov_ml) 4 | 5 | > 1% better every day = 3700% better at the end of the year 6 | 7 | > The goal is to solve problems and help society with the help of AI. 8 | 9 | ## Why should you learn Machine Learning? 10 | 11 | [First of all, understand the difference between AI / Data Science / Machine Learning](https://telegra.ph/AI--Data-Science--Machine-Learning--Deep-Learning--Data-Analysis--Data-Engineering--Big-Data-09-09) 12 | 13 | I found two good answers on why you should care. Firstly, **Machine Learning (ML)** is making computers do things that we’ve never made computers do before. If you want to do something new, not just new to you, but to the world, you can do it with ML. 14 | 15 | Secondly, if you don’t influence the world, the world will influence you. 16 | 17 | If you focus on results, you will never change. 18 | If you focus on change, you will get results. 19 | 20 | ## How to study? 21 | - **First, learn to learn.** 22 | - [Thinking of Self-Studying Machine Learning? Remind yourself of these 6 things](https://towardsdatascience.com/thinking-of-self-studying-machine-learning-remind-yourself-of-these-6-things-b55a5f2b6c7d) 23 | - [How to Learn Machine Learning](https://elitedatascience.com/learn-machine-learning) 24 | 25 | ## Roadmap 26 | - **Math (Calculus, Linear Algebra, Propability & Statistics)** 27 | - [Calculus](https://www.youtube.com/playlist?list=PLmdFyQYShrjd4Qn42rcBeFvF6Qs-b6e-L), *Don't Memorize* 28 | - [Caclulus](https://youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr), *3Blue1Brown* 29 | - [Linear Algebra](https://youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab), *3Blue1Brown* 30 | - [Statistics & Probability](https://www.khanacademy.org/math/statistics-probability) 31 | - **Python** 32 | - [My Python learning roadmap](https://github.com/Rustam-Z/learning-area#1-start-learning-python) 33 | - [NumPy](https://www.w3schools.com/python/numpy/default.asp), [Pandas](https://www.w3schools.com/python/pandas/default.asp), [Matplotlib](https://www.w3schools.com/python/matplotlib_intro.asp) 34 | - [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) 35 | - **Machine Learning** 36 | - "Deep learning with Python", book, 1st part 37 | - Machine Learning Course, Andrew Ng, coursera.org 38 | - **Scikit-Learn** 39 | - [freeCodeCamp.org](https://youtu.be/0B5eIE_1vpU) 40 | - https://inria.github.io/scikit-learn-mooc/ 41 | - https://scikit-learn.org/stable/tutorial/index.html 42 | - **Deep Learning** - Start solving [Kaggle](https://github.com/Rustam-Z/kaggle-problem-solving) 43 | - TensorFlow Developer Specialization, deeplearning.ai, coursera.org 44 | - OR "AI and Machine Learning for Coders", book 45 | - "Deep learning with Python", book, 2nd part 46 | - ["Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow"](https://github.com/ageron/handson-ml2), book 47 | - **fast.ai** 48 | - "Deep learning", MIT press, book 49 | - Deep Learning Specialization, Andrew Ng, coursera.org 50 | - TensorFlow Advanced Techniques, deeplearning.ai, coursera.org 51 | - **Data Science** 52 | - "Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython", book 53 | - "Python Data Science Handbook", book 54 | - **More** 55 | - Applied Machine Learning: https://machinelearningmastery.com/start-here 56 | 57 | ## ML Cheatsheets 58 | * [Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data](https://becominghuman.ai/cheat-sheets-for-ai-neural-networks-machine-learning-deep-learning-big-data-678c51b4b463) `numpy`, `pandas`, `sklearn`, `ml`, `dl` 59 | * [Machine Learning](https://stanford.edu/~shervine/teaching/cs-229/) 60 | 61 | *Please, consider this repository for contributing too!* 62 | 63 | 83 | -------------------------------------------------------------------------------- /The ML Landscape.md: -------------------------------------------------------------------------------- 1 | ## The Machine Learning Landscape 2 | ### What is Machine Learning? 3 | - Machine learning (ML) is field of study that gives computers the ability to learn without being explicitly programmed. 4 | - A computer program is said to learn from *experience E* with respect to some *task T* and some *performance measure P*, if its performance on T, as measured by P, improves with experience E. 5 | - **Example:** T = flag spam for new emails, E = the training data, P = accuracy, the ratio of correctly classified emails. 6 | 7 | ### Why use ML? 8 | - Problems for which existing solutions require a lot of hand-tuning or long lists of 9 | rules: one Machine Learning algorithm can often simplify code and perform bet‐ 10 | ter. (spam classifier) 11 | - Complex problems for which there is no good solution at all using a traditional 12 | approach: the best Machine Learning techniques can find a solution. (speech recognition) 13 | - Fluctuating environments: a Machine Learning system can adapt to new data. 14 | - Getting insights about complex problems and large amounts of data. (data mining) 15 | 16 | ### Types of ML Systems 17 | - Whether or not they are trained with human supervision `supervised, unsupervised, semisupervised, and Reinforcement Learning` 18 | - Whether or not they can learn incrementally on the fly `online vs batch learning` 19 | - Whether they work by simply comparing new data points to known data points, or instead detect patterns in the training data and build a predictive model, much like scientists do `instance-based vs model-based learning` 20 | 21 | - **Supervised learning** - training data with labels (expected outputs). 22 | - Tasks: classification, regression (univariate / multivariate). 23 | - Class / sample / label / feature (predictors: age, brand, ...) / attribute 24 | - **Algorithms** 25 | - k-Nearest Neighbors 26 | - Linear Regression 27 | - Logistic Regression 28 | - Support Vector Machines (SVMs) 29 | - Decision Trees and Random Forests 30 | - Neural networks 31 | 32 | - **Unsupervised learning** - training data is unlabeled. 33 | - Tasks: clustering, anomaly detection, visualization & dimensionality reduction. 34 | - Clustering (find similar visitors) 35 | - K-Means 36 | - DBSCAN 37 | - Hierarchical Cluster Analysis (HCA) 38 | - Anomaly detection & novelty detection (detect unusual things) 39 | - One-class SVM 40 | - Isolation Forest 41 | - Visualization and dimensionality reduction (king of feature extraction) 42 | - Principal Component Analysis (PCA) 43 | - Kernel PCA 44 | - Locally-Linear Embedding (LLE) 45 | - t-distributed Stochastic Neighbor Embedding (t-SNE) 46 | - Association rule learning 47 | - Apriori 48 | - Eclat 49 | 50 | - `TIP!` Use dimensionality reduction algo before feeding to supervised learning algorithm. 51 | - `TIP!` Automatically removing outliers from a dataset before feeding it to another learning algorithm. 52 | 53 | - **Semisupervised learning** - a lot of unlabeled data and a little bit of labeled data. 54 | - Example: like in Google photos, it recongnizes same person in many pictures. We need supervised part because we need to seperate similar clusters. (like similar people) 55 | 56 | - **Reinforcement Learning** - *agent* can observe environment, and perform some actions, and get *rewards* and *penalties*. Then it must teach itself the best strategy (*policy*) to get max reward. A policy defines what action the agent should choose when it is in a given situation. 57 |
58 | 59 | - **Batch learning** - or *offline learning*, when you have new type of data, you need to retrain over whole dataset every time. 60 | - **Online learning** - you train the system incrementally on a new data or mini-batch of data. 61 | - You must set *learning rate* parameter, if you set hugh rate, then your system rapidly adapt to new data, but it will tend to forget the old data. 62 | - A big challenge if bad data is fed to the system, the system’s performance will gradually decline. 63 | - `TIP!` Monitor your latest input data using an anomaly detection algorithm. 64 | 65 | - **Instance-based learning** - the system learns the examples by heart, then generalizes to new cases by comparing them to the learned examples using a *similarity measure*. 66 | - **Model-based learning** - build the model, then use it to make *predictions*. 67 | 68 | ### Main Challenges of ML 69 | - “Bad algorithm” and “bad data” 70 | - **Bad data** 71 | - If some instances are missing a few features (e.g., 5% of your customers did not specify their age), you must decide whether you want to ignore this attribute altogether, ignore these instances, fill in the missing values (e.g., with the median age), or train one model with the feature and one model without it, and so on. 72 | - **Feature engineering**, involves: 73 | - *Feature selection*: selecting the most useful features to train on among existing features. 74 | - *Feature extraction*: combining existing features to produce a more useful one (dimensionality reduction algorithms can help). 75 | - Creating new features by gathering new data. 76 | 77 | - **Bad algorithm** 78 | - **Overfitting** means that the model performs well on the training data, but it does not generalize well. How to overcome? 79 | - To simplify the model by selecting one with fewer parameters (a linear model rather than a high-degree polynomial model), by redusing number features in training data or or by constraining the model (with regularization). 80 | - To gather more training data. 81 | - To reduce the noise in the training data (fix data errors and remove outliers). 82 | - **Underfitting** occurs when your model is too simple to learn the underlying structure of the data. The options to fix: 83 | - Selecting a more powerful model, with more parameters. 84 | - Feeding better features to the learning algorithm (feature engineering) 85 | - Reducing the constraints on the model (reducing the regularization hyperparameter) 86 | 87 | - The system will not perform well if your training set is too small, or if the data is not representative (production level data), noisy, or polluted with irrelevant features (garbage in, garbage out). Lastly, your model needs to be neither too simple. 88 | 89 | ### Testing and Validating 90 | - 80% training and 20% testing. If 10 million samples 1% for testing is enough. 91 | - **Hyperparameter Tuning and Model Selection** `page 32` 92 | - Example: you are hesiteting between two models linear and polinomial. You must try both and see which one is generalizing better on test set. You want to apply regularization to decrease overfitting, so you don't know how to choose a hyperparameter. Try 100 different hyperparameters, and find the best which produces small error. 93 | - However, after you deployed your model you see 15% error. It is probably because you chose *hp* for this particular set. Then you should use **holdout validation "with validation / dev set"**. You train multiple models with various hyperparameters on the reduced training set (training - validation set). Select model performing best on val-on set. And train again on full dataset. 94 | - [**Cross validation**](https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/) 95 | - **Data Mismatch** `page 33` 96 | - Example: You want to developer flowers species classifier. You downloaded pictures from web. And you have 10K pictures taken with the app. **TIP! Remember, your validation and test set must be as representitive as possible you expect to use in production.** In this case divide 50 / 50 to dev & test sets (pics must not be duplicated in both, even near-duplicate). 97 | - After training you see that model on validation set is very poor. Is it overfitting or mismatch between web and phone pics? 98 | - One solution, is to take the part of training (web pics) into **train-dev set**. After training a model, you see that model on train-dev set is good. Then the problem is data mismatch. Use preprocessing, and make web pics look like phone pics. 99 | - But if model is bad on train-dev set, then you have overfitting. You should try to simplify or regularize the model, get more training data and clean up the training data. 100 | 101 | ### Extra 102 | - **Hyper-parameters** are those which we supply to the model, for example: number of hidden Nodes and Layers, input features, Learning Rate, Activation Function etc in Neural Network, while **Parameters** are those which would be learnt during training by the machine like Weights and Biases. 103 | 104 | -------------------------------------------------------------------------------- /img/model1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/img/model1.png -------------------------------------------------------------------------------- /img/model2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/img/model2.png -------------------------------------------------------------------------------- /img/model3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/img/model3.png -------------------------------------------------------------------------------- /img/precision-recall.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/img/precision-recall.png -------------------------------------------------------------------------------- /img/reinforcement-learning.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/img/reinforcement-learning.png -------------------------------------------------------------------------------- /numpy-pandas/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/numpy-pandas/.DS_Store -------------------------------------------------------------------------------- /numpy-pandas/02-example.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "language_info": { 4 | "codemirror_mode": { 5 | "name": "ipython", 6 | "version": 3 7 | }, 8 | "file_extension": ".py", 9 | "mimetype": "text/x-python", 10 | "name": "python", 11 | "nbconvert_exporter": "python", 12 | "pygments_lexer": "ipython3", 13 | "version": "3.8.6" 14 | }, 15 | "orig_nbformat": 2, 16 | "kernelspec": { 17 | "name": "python386jvsc74a57bd04ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a", 18 | "display_name": "Python 3.8.6 64-bit ('tf': conda)" 19 | }, 20 | "metadata": { 21 | "interpreter": { 22 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a" 23 | } 24 | } 25 | }, 26 | "nbformat": 4, 27 | "nbformat_minor": 2, 28 | "cells": [ 29 | { 30 | "cell_type": "code", 31 | "execution_count": 1, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "import pandas as pd" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 3, 41 | "metadata": {}, 42 | "outputs": [ 43 | { 44 | "output_type": "stream", 45 | "name": "stdout", 46 | "text": [ 47 | " state/region ages year population\n0 AL under18 2012 1117489.0\n1 AL total 2012 4817528.0\n2 AL under18 2010 1130966.0\n3 AL total 2010 4785570.0\n4 AL under18 2011 1125763.0\n state area (sq. mi)\n0 Alabama 52423\n1 Alaska 656425\n2 Arizona 114006\n3 Arkansas 53182\n4 California 163707\n state abbreviation\n0 Alabama AL\n1 Alaska AK\n2 Arizona AZ\n3 Arkansas AR\n4 California CA\n" 48 | ] 49 | } 50 | ], 51 | "source": [ 52 | "pop = pd.read_csv('data/state-population.csv')\n", 53 | "areas = pd.read_csv('data/state-areas.csv')\n", 54 | "abbrevs = pd.read_csv('data/state-abbrevs.csv')\n", 55 | "\n", 56 | "print(pop.head()); print(areas.head()); print(abbrevs.head())" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 22, 62 | "metadata": {}, 63 | "outputs": [ 64 | { 65 | "output_type": "execute_result", 66 | "data": { 67 | "text/plain": [ 68 | " state/region ages year population state\n", 69 | "0 AL under18 2012 1117489.0 Alabama\n", 70 | "1 AL total 2012 4817528.0 Alabama\n", 71 | "2 AL under18 2010 1130966.0 Alabama\n", 72 | "3 AL total 2010 4785570.0 Alabama\n", 73 | "4 AL under18 2011 1125763.0 Alabama" 74 | ], 75 | "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
state/regionagesyearpopulationstate
0ALunder1820121117489.0Alabama
1ALtotal20124817528.0Alabama
2ALunder1820101130966.0Alabama
3ALtotal20104785570.0Alabama
4ALunder1820111125763.0Alabama
\n
" 76 | }, 77 | "metadata": {}, 78 | "execution_count": 22 79 | } 80 | ], 81 | "source": [ 82 | "merged = pd.merge(pop, abbrevs, how='outer', left_on='state/region', right_on='abbreviation') # if you do not specify left_on/right_on then no common coumns error\n", 83 | "merged = merged.drop('abbreviation', 1) # drop duplicate info \n", 84 | "merged.head()" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 23, 90 | "metadata": {}, 91 | "outputs": [ 92 | { 93 | "output_type": "execute_result", 94 | "data": { 95 | "text/plain": [ 96 | "state/region False\n", 97 | "ages False\n", 98 | "year False\n", 99 | "population True\n", 100 | "state True\n", 101 | "dtype: bool" 102 | ] 103 | }, 104 | "metadata": {}, 105 | "execution_count": 23 106 | } 107 | ], 108 | "source": [ 109 | "merged.isnull().any()" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 30, 115 | "metadata": {}, 116 | "outputs": [ 117 | { 118 | "output_type": "execute_result", 119 | "data": { 120 | "text/plain": [ 121 | " state/region ages year population state\n", 122 | "2448 PR under18 1990 NaN NaN\n", 123 | "2449 PR total 1990 NaN NaN\n", 124 | "2450 PR total 1991 NaN NaN\n", 125 | "2451 PR under18 1991 NaN NaN\n", 126 | "2452 PR total 1993 NaN NaN" 127 | ], 128 | "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
state/regionagesyearpopulationstate
2448PRunder181990NaNNaN
2449PRtotal1990NaNNaN
2450PRtotal1991NaNNaN
2451PRunder181991NaNNaN
2452PRtotal1993NaNNaN
\n
" 129 | }, 130 | "metadata": {}, 131 | "execution_count": 30 132 | } 133 | ], 134 | "source": [ 135 | "merged[merged['population'].isnull()].head()" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 31, 141 | "metadata": {}, 142 | "outputs": [ 143 | { 144 | "output_type": "execute_result", 145 | "data": { 146 | "text/plain": [ 147 | "state/region False\n", 148 | "ages False\n", 149 | "year False\n", 150 | "population True\n", 151 | "state False\n", 152 | "dtype: bool" 153 | ] 154 | }, 155 | "metadata": {}, 156 | "execution_count": 31 157 | } 158 | ], 159 | "source": [ 160 | "merged.loc[merged['state/region'] == 'PR', 'state'] = 'Puerto Rico'\n", 161 | "merged.loc[merged['state/region'] == 'USA', 'state'] = 'United States'\n", 162 | "merged.isnull().any()" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 36, 168 | "metadata": {}, 169 | "outputs": [ 170 | { 171 | "output_type": "execute_result", 172 | "data": { 173 | "text/plain": [ 174 | " state/region ages year population state area (sq. mi)\n", 175 | "0 AL under18 2012 1117489.0 Alabama 52423.0\n", 176 | "1 AL total 2012 4817528.0 Alabama 52423.0\n", 177 | "2 AL under18 2010 1130966.0 Alabama 52423.0\n", 178 | "3 AL total 2010 4785570.0 Alabama 52423.0\n", 179 | "4 AL under18 2011 1125763.0 Alabama 52423.0" 180 | ], 181 | "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
state/regionagesyearpopulationstatearea (sq. mi)
0ALunder1820121117489.0Alabama52423.0
1ALtotal20124817528.0Alabama52423.0
2ALunder1820101130966.0Alabama52423.0
3ALtotal20104785570.0Alabama52423.0
4ALunder1820111125763.0Alabama52423.0
\n
" 182 | }, 183 | "metadata": {}, 184 | "execution_count": 36 185 | } 186 | ], 187 | "source": [ 188 | "final = pd.merge(merged, areas, on='state', how='left')\n", 189 | "final.head()" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": 37, 195 | "metadata": {}, 196 | "outputs": [ 197 | { 198 | "output_type": "execute_result", 199 | "data": { 200 | "text/plain": [ 201 | "(2544, 6)" 202 | ] 203 | }, 204 | "metadata": {}, 205 | "execution_count": 37 206 | } 207 | ], 208 | "source": [ 209 | "final.shape" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 38, 215 | "metadata": {}, 216 | "outputs": [ 217 | { 218 | "output_type": "execute_result", 219 | "data": { 220 | "text/plain": [ 221 | "state/region False\n", 222 | "ages False\n", 223 | "year False\n", 224 | "population True\n", 225 | "state False\n", 226 | "area (sq. mi) True\n", 227 | "dtype: bool" 228 | ] 229 | }, 230 | "metadata": {}, 231 | "execution_count": 38 232 | } 233 | ], 234 | "source": [ 235 | "final.isnull().any()" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 39, 241 | "metadata": {}, 242 | "outputs": [ 243 | { 244 | "output_type": "execute_result", 245 | "data": { 246 | "text/plain": [ 247 | "array(['United States'], dtype=object)" 248 | ] 249 | }, 250 | "metadata": {}, 251 | "execution_count": 39 252 | } 253 | ], 254 | "source": [ 255 | "final['state'][final['area (sq. mi)'].isnull()].unique()" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 40, 261 | "metadata": {}, 262 | "outputs": [ 263 | { 264 | "output_type": "execute_result", 265 | "data": { 266 | "text/plain": [ 267 | " state/region ages year population state area (sq. mi)\n", 268 | "0 AL under18 2012 1117489.0 Alabama 52423.0\n", 269 | "1 AL total 2012 4817528.0 Alabama 52423.0\n", 270 | "2 AL under18 2010 1130966.0 Alabama 52423.0\n", 271 | "3 AL total 2010 4785570.0 Alabama 52423.0\n", 272 | "4 AL under18 2011 1125763.0 Alabama 52423.0" 273 | ], 274 | "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
state/regionagesyearpopulationstatearea (sq. mi)
0ALunder1820121117489.0Alabama52423.0
1ALtotal20124817528.0Alabama52423.0
2ALunder1820101130966.0Alabama52423.0
3ALtotal20104785570.0Alabama52423.0
4ALunder1820111125763.0Alabama52423.0
\n
" 275 | }, 276 | "metadata": {}, 277 | "execution_count": 40 278 | } 279 | ], 280 | "source": [ 281 | "final.dropna(inplace=True)\n", 282 | "final.head()" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": 42, 288 | "metadata": {}, 289 | "outputs": [ 290 | { 291 | "output_type": "execute_result", 292 | "data": { 293 | "text/plain": [ 294 | "(2476, 6)" 295 | ] 296 | }, 297 | "metadata": {}, 298 | "execution_count": 42 299 | } 300 | ], 301 | "source": [ 302 | "final.shape" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": 43, 308 | "metadata": {}, 309 | "outputs": [ 310 | { 311 | "output_type": "execute_result", 312 | "data": { 313 | "text/plain": [ 314 | " state/region ages year population state area (sq. mi)\n", 315 | "3 AL total 2010 4785570.0 Alabama 52423.0\n", 316 | "91 AK total 2010 713868.0 Alaska 656425.0\n", 317 | "101 AZ total 2010 6408790.0 Arizona 114006.0\n", 318 | "189 AR total 2010 2922280.0 Arkansas 53182.0\n", 319 | "197 CA total 2010 37333601.0 California 163707.0" 320 | ], 321 | "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
state/regionagesyearpopulationstatearea (sq. mi)
3ALtotal20104785570.0Alabama52423.0
91AKtotal2010713868.0Alaska656425.0
101AZtotal20106408790.0Arizona114006.0
189ARtotal20102922280.0Arkansas53182.0
197CAtotal201037333601.0California163707.0
\n
" 322 | }, 323 | "metadata": {}, 324 | "execution_count": 43 325 | } 326 | ], 327 | "source": [ 328 | "data2010 = final.query(\"year == 2010 & ages == 'total'\")\n", 329 | "data2010.head()" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": 44, 335 | "metadata": {}, 336 | "outputs": [], 337 | "source": [ 338 | "data2010.set_index('state', inplace=True)\n", 339 | "density = data2010['population'] / data2010['area (sq. mi)']" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 45, 345 | "metadata": {}, 346 | "outputs": [ 347 | { 348 | "output_type": "execute_result", 349 | "data": { 350 | "text/plain": [ 351 | "state\n", 352 | "District of Columbia 8898.897059\n", 353 | "Puerto Rico 1058.665149\n", 354 | "New Jersey 1009.253268\n", 355 | "Rhode Island 681.339159\n", 356 | "Connecticut 645.600649\n", 357 | "dtype: float64" 358 | ] 359 | }, 360 | "metadata": {}, 361 | "execution_count": 45 362 | } 363 | ], 364 | "source": [ 365 | "density.sort_values(ascending=False, inplace=True)\n", 366 | "density.head()" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": 46, 372 | "metadata": {}, 373 | "outputs": [ 374 | { 375 | "output_type": "execute_result", 376 | "data": { 377 | "text/plain": [ 378 | "state\n", 379 | "South Dakota 10.583512\n", 380 | "North Dakota 9.537565\n", 381 | "Montana 6.736171\n", 382 | "Wyoming 5.768079\n", 383 | "Alaska 1.087509\n", 384 | "dtype: float64" 385 | ] 386 | }, 387 | "metadata": {}, 388 | "execution_count": 46 389 | } 390 | ], 391 | "source": [ 392 | "density.tail()" 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": 48, 398 | "metadata": {}, 399 | "outputs": [ 400 | { 401 | "output_type": "execute_result", 402 | "data": { 403 | "text/plain": [ 404 | "state\n", 405 | "District of Columbia 8898.897059\n", 406 | "Puerto Rico 1058.665149\n", 407 | "New Jersey 1009.253268\n", 408 | "Rhode Island 681.339159\n", 409 | "Connecticut 645.600649\n", 410 | "Massachusetts 621.815538\n", 411 | "Maryland 466.445797\n", 412 | "Delaware 460.445752\n", 413 | "New York 356.094135\n", 414 | "Florida 286.597129\n", 415 | "Pennsylvania 275.966651\n", 416 | "Ohio 257.549634\n", 417 | "California 228.051342\n", 418 | "Illinois 221.687472\n", 419 | "Virginia 187.622273\n", 420 | "Indiana 178.197831\n", 421 | "North Carolina 177.617157\n", 422 | "Georgia 163.409902\n", 423 | "Tennessee 150.825298\n", 424 | "South Carolina 144.854594\n", 425 | "New Hampshire 140.799273\n", 426 | "Hawaii 124.746707\n", 427 | "Kentucky 107.586994\n", 428 | "Michigan 102.015794\n", 429 | "Washington 94.557817\n", 430 | "Texas 93.987655\n", 431 | "Alabama 91.287603\n", 432 | "Louisiana 87.676099\n", 433 | "Wisconsin 86.851900\n", 434 | "Missouri 86.015622\n", 435 | "West Virginia 76.519582\n", 436 | "Vermont 65.085075\n", 437 | "Mississippi 61.321530\n", 438 | "Minnesota 61.078373\n", 439 | "Arizona 56.214497\n", 440 | "Arkansas 54.948667\n", 441 | "Iowa 54.202751\n", 442 | "Oklahoma 53.778278\n", 443 | "Colorado 48.493718\n", 444 | "Oregon 39.001565\n", 445 | "Maine 37.509990\n", 446 | "Kansas 34.745266\n", 447 | "Utah 32.677188\n", 448 | "Nevada 24.448796\n", 449 | "Nebraska 23.654153\n", 450 | "Idaho 18.794338\n", 451 | "New Mexico 16.982737\n", 452 | "South Dakota 10.583512\n", 453 | "North Dakota 9.537565\n", 454 | "Montana 6.736171\n", 455 | "Wyoming 5.768079\n", 456 | "Alaska 1.087509\n", 457 | "dtype: float64" 458 | ] 459 | }, 460 | "metadata": {}, 461 | "execution_count": 48 462 | } 463 | ], 464 | "source": [ 465 | "density" 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": null, 471 | "metadata": {}, 472 | "outputs": [], 473 | "source": [] 474 | } 475 | ] 476 | } -------------------------------------------------------------------------------- /numpy-pandas/README.md: -------------------------------------------------------------------------------- 1 | # Python Data Science Handbook 2 | 3 | Rustam-Z🚀 • 1 June 2021 4 | 5 | My notes on **NumPy: ndarray**, **Pandas: DataFrame**, **Matplotlib**, and **Scikit-Learn** 6 | 7 | ## Contents 8 | 1. IPython: Beyond Normal Python - *All features of Jupyter Notebook* 9 | 2. [Introduction to NumPy: Math operations with NumPy](#CHAPTER-2:-Introduction-to-NumPy) 10 | - Creating Arrays 11 | - The Basics of NumPy Arrays 12 | - Computation on NumPy Arrays 13 | - Fancy indexing 14 | - Structured Arrays 15 | 3. [Data Manipulation with Pandas](#CHAPTER-3:-Data-Manipulation-with-Pandas) 16 | - The Pandas Series / DataFrame / Index Objects 17 | - Data Selection in Series / DataFrame 18 | - Missing Data in Pandas / Operating on NULL values 19 | - Combining Datasets: Concat and Append 20 | - [GroupBy: Split, Apply, Combine](#GroupBy:-Split,-Apply,-Combine) 21 | 4. [Visualization with Matplotlib](#CHAPTER-4:-Visualization-with-Matplotlib) 22 | 5. [Machine Learning](#Machine-Learning) 23 | 24 | ## CHAPTER 2: Introduction to NumPy 25 | - `axis=0 is column`, `axis=1 is row` 26 | 27 | ### Creating Arrays 28 | ```python 29 | np.zeros(10, dtype=int) # Create a length-10 integer array filled with zeros 30 | np.ones((3, 5), dtype=float) # Create a 3x5 floating-point array filled with 1s 31 | np.full((3, 5), 3.14) # Create a 3x5 array filled with 3.14 32 | np.arange(0, 20, 2) # As python's range() 33 | np.linspace(0, 1, 5) # Create an array of five values evenly spaced between 0 and 1 34 | np.random.random((3, 3)) # 3x3 array, random values between 0 and 1 35 | np.random.normal(0, 1, (3, 3)) # normal distribution, with mean 0 and standard deviation 1 36 | np.random.randint(0, 10, (3, 3)) # random integers between 0 and 10 37 | np.eye(3) # Create a 3x3 identity matrix 38 | np.empty(3) # Create an uninitialized array of three integers 39 | 40 | np.zeros(10, dtype='int16') # same as 41 | np.zeros(10, dtype=np.int16) 42 | ``` 43 | 44 | ### The Basics of NumPy Arrays 45 | - *Attributes of arrays* 46 | - Determining the size, shape, memory consumption, and data types of arrays 47 | - *Indexing of arrays* 48 | - Getting and setting the value of individual array elements 49 | - *Slicing of arrays* 50 | - Getting and setting smaller subarrays within a larger array 51 | - *Reshaping of arrays* 52 | - Changing the shape of a given array 53 | - *Array Concatenation and Splitting* 54 | - Combining multiple arrays into one, and splitting one array into many 55 | 56 | - indices `(e.g., arr[0])`, slices `(e.g., arr[:5])`, and boolean masks `(e.g., arr[arr > 0])` 57 | - [np.newaxis()](https://stackoverflow.com/questions/46334014/np-reshapex-1-1-vs-x-np-newaxis) 58 | 59 | ```python 60 | """ Attributes of arrays """ 61 | x = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array 62 | x.ndim # 3 63 | x.shape # (3, 4, 5) 64 | x.size # 60 = 3*4*5 65 | x.dtype # dtype: int64 66 | x.nbytes # total size of array in bytes 67 | ``` 68 | ```python 69 | """ Indexing of arrays """ 70 | # Same as in python lists, but beware if you insert float into int, the result will be int 71 | x[0][0][1] or x[0, 0, 1] 72 | ``` 73 | ```python 74 | """ Slicing of arrays """ 75 | # Same as python lists 76 | # NOTE! Multidimensional slices work in the same way, with multiple slices separated by commas. 77 | x[start:stop:step] 78 | ``` 79 | - NOTE! NumPy arrays return the *view* of original array after slicing. So, when we modify our sliced array it will affect to original array. Use **copy()** method when you don't want it. `x_copy = x[:2, :2].copy()` 80 | 81 | ```python 82 | """ Reshaping of Arrays """ 83 | # reshape() method 84 | np.arange(1, 10).reshape((3, 3)) 85 | ``` 86 | ```python 87 | """ Array Concatenation """ 88 | x = np.array([1, 2, 3]) 89 | y = np.array([3, 2, 1]) 90 | 91 | grid = np.array([[9, 8, 7],[6, 5, 4]]) 92 | 93 | np.concatenate([x, y]) # axis=1 same as x axis, then it will concatenated horizontally 94 | 95 | # If working with different dimensions 96 | np.vstack([x, grid]) 97 | np.hstack([grid, y]) 98 | # np.dstack will stack arrays along the third axis 99 | 100 | """ Splitting of arrays """ 101 | # np.split, np.hsplit, np.vsplit 102 | x = [1, 2, 3, 99, 99, 3, 2, 1] 103 | x1, x2, x3 = np.split(x, [3, 5]) # we give splitting points 104 | print(x1, x2, x3) # [1 2 3] [99 99] [3 2 1] # N --> N+1 subarray 105 | ``` 106 | 107 | ### Computation on NumPy Arrays 108 | - *unary ufuncs*, operate on a single input, and *binary ufuncs*, operate on two inputs 109 | ``` 110 | + np.add Addition (e.g., 1 + 1 = 2) 111 | - np.subtract Subtraction (e.g., 3 - 2 = 1) 112 | - np.negative Unary negation (e.g., -2) 113 | * np.multiply Multiplication (e.g., 2 * 3 = 6) 114 | / np.divide Division (e.g., 3 / 2 = 1.5) 115 | // np.floor_divide Floor division (e.g., 3 // 2 = 1) 116 | ** np.power Exponentiation (e.g., 2 ** 3 = 8) 117 | % np.mod Modulus/remainder (e.g., 9 % 4 = 1) 118 | 119 | np.abs(x) 120 | np.sin(x), np.cos(x), np.tan(x) 121 | np.log(x), np.log2(x), np.log10(x) 122 | np.exp(x) e^x 123 | np.exp2(x) 2^x 124 | np.power(3, x) 3^x 125 | np.expm1(x) exp(x) - 1 126 | np.log1p(x) log(1 + x) 127 | ``` 128 | ```python 129 | x = np.arange(1, 6) 130 | np.add.reduce(x) # 15, sum of all elements 131 | np.multiply.reduce(x) # 120, mulitplication of all elements 132 | 133 | np.add.accumulate(x) # array([ 1, 3, 6, 10, 15]), intermediate result 134 | np.multiply.accumulate(x) # array([ 1, 2, 6, 24, 120]) 135 | 136 | np.multiply.outer(x, x) # N+1 dimension multiplication 137 | 138 | np.sum Compute sum of elements 139 | np.prod Compute product of elements 140 | np.mean Compute median of elements 141 | np.std Compute standard deviation 142 | np.var Compute variance 143 | np.min Find minimum value 144 | np.max Find maximum value 145 | np.argmin Find index of minimum value 146 | np.argmax Find index of maximum value 147 | np.median Compute median of elements 148 | np.percentile Compute rank-based statistics of elements np.percentile(arr, 25)) 149 | np.any Evaluate whether any elements are true 150 | np.all Evaluate whether all elements are true 151 | ``` 152 | ```python 153 | """Comparison Operators""" 154 | == np.equal 155 | != np.not_equal 156 | < np.less np.less(x, 3) is x < 3 157 | <= np.less_equal 158 | > np.greater 159 | >= np.greater_equal 160 | 161 | # Example 162 | x = np.array([1, 2, 3, 4, 5]) 163 | x < 3 # array([ True, True, False, False, False], dtype=bool) 164 | (2 * x) == (x ** 2) # array([False, True, False, False, False], dtype=bool) 165 | ``` 166 | ```python 167 | """Working with Boolean Arrays""" 168 | print(x) # [[5 0 3 3][7 9 3 5][2 4 7 6]] 169 | 170 | # Counting entries 171 | np.count_nonzero(x < 6) # 8, how many values less than 6? 172 | np.sum(x < 6) # 8, counts elements less than 6 173 | np.sum(x < 6, axis=1) # how many values less than 6 in each row? 174 | np.any(x > 8) # are there any values greater than 8? 175 | np.all(x < 10) # are all values less than 10? 176 | np.all(x < 8, axis=1) # are all values in each row less than 8? 177 | 178 | # Boolean operators 179 | & np.bitwise_and 180 | | np.bitwise_or 181 | ^ np.bitwise_xor 182 | ~ np.bitwise_not 183 | np.sum((inches > 0.5) & (inches < 1)) # that's counts the number of elements 184 | np.sum(~( (inches <= 0.5) | (inches >= 1) )) 185 | 186 | x[x < 5] # [0 3 3 3 2 4] 187 | 188 | # Fancy indexing 189 | x = rand.randint(100, size=10) 190 | y = np.array([1, 2]) 191 | x[y] # array([92, 14]) 192 | ``` 193 | - `np.sort(x)`, `np.argsort(x)` , `np.sort(X, axis=0)` = sort each column of X 194 | - Partial Sorts: `np.partition(x, 3)` - returns 2 smallest elements to the left 195 | 196 | ```python 197 | """NumPy’s Structured Arrays: Compound data types""" 198 | name = ['Alice', 'Bob', 'Cathy', 'Doug'] 199 | age = [25, 45, 37, 19] 200 | weight = [55.0, 85.5, 68.0, 61.5] 201 | 202 | # We need to combine them 203 | x = np.zeros(4, dtype=int) 204 | data = np.zeros(4, dtype={'names':('name', 'age', 'weight'), 'formats':('U10', 'i4', 'f8')}) 205 | data['name'] = name 206 | data['age'] = age 207 | data['weight'] = weight 208 | 209 | print(data) # [('Alice', 25, 55.0) ('Bob', 45, 85.5) ('Cathy', 37, 68.0) ('Doug', 19, 61.5)] 210 | 211 | # Get all names 212 | data['name'] # array(['Alice', 'Bob', 'Cathy', 'Doug'], dtype=' Int64Index([3, 5, 7], dtype='int64') 268 | indA | indB # union => Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64') 269 | indA ^ indB # symmetric difference => Int64Index([1, 2, 9, 11], dtype='int64') 270 | ``` 271 | ```python 272 | """Data Selection in Series""" 273 | data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd']) 274 | 275 | data['b'] # 0.5 276 | 'a' in data # True 277 | data.keys() 278 | data.items() # key: value 279 | data['e'] = 1.25 # We can add new item 280 | 281 | # slicing explicit, 'c' will be included 282 | data['a':'c'] 283 | 284 | # slicing implicit 285 | data[0:2] 286 | 287 | # masking 288 | data[(data > 0.3) & (data < 0.8)] 289 | 290 | # fancy indexing 291 | data[['a', 'e']] 292 | 293 | """Indexers: loc, iloc, and ix 294 | loc = allows indexing and slicing that always references the explicit index (own indexing) 295 | iloc = allows indexing and slicing that always references the implicit Python-style index (from 0) 296 | 297 | `TIP!` “Explicit is better than implicit" 298 | """ 299 | ``` 300 | ```python 301 | """Data Selection in DataFrame""" 302 | # DataFrame as a dictionary 303 | data = pd.DataFrame({'area':area, 'pop':pop}) 304 | data['area'] 305 | data.area # if name == str method then not working 306 | # Add new column 307 | data['density'] = data['pop'] / data['area'] 308 | # Access samples 309 | data.loc['Texas'] 310 | 311 | # DataFrame as two-dimensional array 312 | data.values 313 | data.T # Transpose 314 | data.iloc[:3, :2] # Chooses both row and column respectively 315 | data.loc[:'New York', :'pop'] # same as previous 316 | data.loc[data.density > 100, ['pop', 'density']] # fancy indexing 317 | # Change like this 318 | data.iloc[0, 2] = 90 319 | data[data.density > 100] 320 | ``` 321 | - Until page 114 322 | - We can perform NumPy operations over Pandas Series and Dataframe (adding, division) 323 | ```py 324 | A = pd.Series([2, 4, 6], index=[0, 1, 2]) 325 | B = pd.Series([1, 3, 5], index=[1, 2, 3]) 326 | print(A + B) 327 | print(A.add(B, fill_value=0)) # the set which doesn't include that index will be replaces with 0 328 | 329 | ## A.add(B) 330 | + add() 331 | - sub(), subtract() 332 | * xmul(), multiply() 333 | / truediv(), div(), divide() 334 | // floordiv() 335 | % mod() 336 | ** pow() 337 | ``` 338 | ```py 339 | """Missing Data in Pandas""" 340 | vals2 = np.array([1, np.nan, 3, 4]) 341 | np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2) 342 | 343 | # NaN and None in Pandas 344 | x = pd.Series(range(2), dtype=int) 345 | x[0] = None # Then it will be represented as NaN in DataFrame 346 | 347 | """Operating on Null Values""" 348 | isnull() # True / False for each element 349 | notnull() # opposite of isnull() 350 | dropna() # Return a filtered version of the data 351 | fillna() 352 | 353 | # Detecting null values 354 | df.isnull() 355 | data[data.notnull()] 356 | 357 | # Dropping null values 358 | data.dropna() 359 | df.dropna(axis='columns', how='all') # df.dropna(axis=1) | how='all', by default how='any' | thresh=3 360 | 361 | # Filling null values 362 | data.fillna(0) 363 | data.fillna(method='ffill') # propagate the previous value forward 364 | data.fillna(method='bfill') 365 | df.fillna(method='ffill', axis=1) # we can specify an axis along which the fills take place 366 | 367 | # NOTE 368 | df.isnull().any() 369 | df[df['SMTH'].isnull()].head() 370 | ``` 371 | ```py 372 | """Combining Datasets: Concat and Append""" 373 | np.concatenate([x, y]) # with numpy 374 | pd.concat([x, y]) # with pandas 375 | pd.concat([x, y], ignore_index=True) # ignoring the index 376 | df1.append(df2) # same as pd.concat([df1, df2]), NOT good practice 377 | 378 | """Combining Datasets: Merge and Join""" 379 | df3 = pd.merge(df1, df2) # can use when df1 and df2 have common columns PK = primary key 380 | # check 02-pandas.ipynb 381 | ``` 382 | ### GroupBy: Split, Apply, Combine 383 | - Split, apply, combine 384 | - **Functions: aggregate, filter, transform, and apply.** 385 | - The **split** step involves breaking up and grouping a DataFrame depending on the value of the specified key. 386 | - The **apply** step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups. 387 | - The **combine** step merges the results of these operations into an output array. 388 |
389 | 390 | - We need to apply any *Aggregation* funcs from Pandas and NumPy, like `df.groupby('key').sum()` 391 | - `n_by_state = df.groupby("state")["last_name"].count()` You call `.groupby()` and pass the name of the column you want to group on, which is ``"state"``. Then, you use `["last_name"]` to specify the columns on which you want to perform the actual aggregation. 392 | 393 | ```py 394 | # Column indexing 395 | # https://realpython.com/pandas-groupby/ 396 | n_by_state = df.groupby("state")["last_name"].count() 397 | df.groupby(["state", "gender"])["last_name"].count() # for multiple, as_index=False 398 | 399 | # Dispatch methods 400 | planets.groupby('method')['year'].describe().unstack() 401 | 402 | # Aggregation 403 | df.groupby('key').aggregate(['min', np.median, max]) 404 | df.groupby('key').aggregate({'data1': 'min', 'data2': 'max'}) # even we can specify 405 | 406 | # Filtering 407 | def filter_func(x): 408 | return x['data2'].std() > 4 409 | 410 | print(df) 411 | print(df.groupby('key').std()) 412 | print(df.groupby('key').filter(filter_func)) 413 | 414 | # Transformation 415 | df.groupby('key').transform(lambda x: x - x.mean()) 416 | 417 | # The apply() method - we can app;y arbitary function 418 | def norm_by_data2(x): 419 | # x is a DataFrame of group values 420 | x['data1'] /= x['data2'].sum() 421 | return x 422 | print(df); print(df.groupby('key').apply(norm_by_data2)) 423 | ``` 424 | ```py 425 | """High-Performance Pandas: eval() and query()""" 426 | """eval()""" 427 | # Operators 428 | result1 = -df1 * df2 / (df3 + df4) - df5 429 | result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5') 430 | 431 | # With dataframe 432 | result1 = (df['A'] + df['B']) / (df['C'] - 1) 433 | result2 = pd.eval("(df.A + df.B) / (df.C - 1)") 434 | df.eval('D = (A + B) / C', inplace=True) # We can even perform on DF object 435 | 436 | column_mean = df.mean(1) 437 | result1 = df['A'] + column_mean 438 | result2 = df.eval('A + @column_mean') 439 | 440 | """query()""" 441 | result1 = df[(df.A < 0.5) & (df.B < 0.5)] 442 | result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]') 443 | result3 = df.eval('A < 0.5 and B < 0.5') # do not work with DF, so we need query 444 | result4 = df.query('A < 0.5 and B < 0.5') 445 | ``` 446 | 447 | ## CHAPTER 4: Visualization with Matplotlib 448 | ```py 449 | """Line""" 450 | plt.plot(x, np.sin(x), linestyle='-g') # -, --, -., :, -g = solid green 451 | plt.axis([-1, 11, -1.5, 1.5]) # [xmin, xmax, ymin, ymax] 452 | plt.title("A Sine Curve") 453 | plt.xlabel("x") 454 | plt.ylabel("sin(x)") 455 | 456 | # When multiple lines 457 | plt.plot(x, np.sin(x), '-g', label='sin(x)') 458 | plt.plot(x, np.cos(x), ':b', label='cos(x)') 459 | plt.axis('equal') 460 | plt.legend() 461 | 462 | """Scatter""" 463 | plt.scatter(x, y) # marker='o' 464 | 465 | """Histogram""" 466 | data = np.random.randn(1000) 467 | plt.hist(data) 468 | ``` 469 | 470 | ## Machine Learning 471 | - **Classification: Predicting discrete labels** 472 | - Some important classification algorithms 473 | - Naive Bayes 474 | - Support Vector Machines 475 | - Decision Trees and Random Forests 476 | - **Regression: Predicting continuous labels** 477 | - Some important regression algorithms 478 | - Linear Regression 479 | - Support Vector Machines 480 | - Decision Trees and Random Forests 481 | - **Clustering: Inferring labels on unlabeled data** 482 | - k-Means Clustering 483 | - Gaussian Mixture Models 484 | - **Dimensionality reduction: Inferring structure of unlabeled data** 485 | - Principal Component Analysis (PCA) 486 | - Manifold Learning -------------------------------------------------------------------------------- /numpy-pandas/data/state-abbrevs.csv: -------------------------------------------------------------------------------- 1 | "state","abbreviation" 2 | "Alabama","AL" 3 | "Alaska","AK" 4 | "Arizona","AZ" 5 | "Arkansas","AR" 6 | "California","CA" 7 | "Colorado","CO" 8 | "Connecticut","CT" 9 | "Delaware","DE" 10 | "District of Columbia","DC" 11 | "Florida","FL" 12 | "Georgia","GA" 13 | "Hawaii","HI" 14 | "Idaho","ID" 15 | "Illinois","IL" 16 | "Indiana","IN" 17 | "Iowa","IA" 18 | "Kansas","KS" 19 | "Kentucky","KY" 20 | "Louisiana","LA" 21 | "Maine","ME" 22 | "Montana","MT" 23 | "Nebraska","NE" 24 | "Nevada","NV" 25 | "New Hampshire","NH" 26 | "New Jersey","NJ" 27 | "New Mexico","NM" 28 | "New York","NY" 29 | "North Carolina","NC" 30 | "North Dakota","ND" 31 | "Ohio","OH" 32 | "Oklahoma","OK" 33 | "Oregon","OR" 34 | "Maryland","MD" 35 | "Massachusetts","MA" 36 | "Michigan","MI" 37 | "Minnesota","MN" 38 | "Mississippi","MS" 39 | "Missouri","MO" 40 | "Pennsylvania","PA" 41 | "Rhode Island","RI" 42 | "South Carolina","SC" 43 | "South Dakota","SD" 44 | "Tennessee","TN" 45 | "Texas","TX" 46 | "Utah","UT" 47 | "Vermont","VT" 48 | "Virginia","VA" 49 | "Washington","WA" 50 | "West Virginia","WV" 51 | "Wisconsin","WI" 52 | "Wyoming","WY" -------------------------------------------------------------------------------- /numpy-pandas/data/state-areas.csv: -------------------------------------------------------------------------------- 1 | state,area (sq. mi) 2 | Alabama,52423 3 | Alaska,656425 4 | Arizona,114006 5 | Arkansas,53182 6 | California,163707 7 | Colorado,104100 8 | Connecticut,5544 9 | Delaware,1954 10 | Florida,65758 11 | Georgia,59441 12 | Hawaii,10932 13 | Idaho,83574 14 | Illinois,57918 15 | Indiana,36420 16 | Iowa,56276 17 | Kansas,82282 18 | Kentucky,40411 19 | Louisiana,51843 20 | Maine,35387 21 | Maryland,12407 22 | Massachusetts,10555 23 | Michigan,96810 24 | Minnesota,86943 25 | Mississippi,48434 26 | Missouri,69709 27 | Montana,147046 28 | Nebraska,77358 29 | Nevada,110567 30 | New Hampshire,9351 31 | New Jersey,8722 32 | New Mexico,121593 33 | New York,54475 34 | North Carolina,53821 35 | North Dakota,70704 36 | Ohio,44828 37 | Oklahoma,69903 38 | Oregon,98386 39 | Pennsylvania,46058 40 | Rhode Island,1545 41 | South Carolina,32007 42 | South Dakota,77121 43 | Tennessee,42146 44 | Texas,268601 45 | Utah,84904 46 | Vermont,9615 47 | Virginia,42769 48 | Washington,71303 49 | West Virginia,24231 50 | Wisconsin,65503 51 | Wyoming,97818 52 | District of Columbia,68 53 | Puerto Rico,3515 54 | -------------------------------------------------------------------------------- /numpy-pandas/img/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/numpy-pandas/img/.DS_Store -------------------------------------------------------------------------------- /numpy-pandas/img/axis=1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/numpy-pandas/img/axis=1.jpg -------------------------------------------------------------------------------- /numpy-pandas/img/groupby.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/numpy-pandas/img/groupby.png -------------------------------------------------------------------------------- /numpy-pandas/plt1.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as plt 2 | import numpy as np 3 | 4 | x = np.linspace(0, 10, 100) 5 | 6 | plt.plot(x, np.sin(x)) 7 | plt.plot(x, np.cos(x)) 8 | 9 | plt.show() -------------------------------------------------------------------------------- /numpy-pandas/very-basics/Readme.md: -------------------------------------------------------------------------------- 1 | # [Python for Data Science Very Basics](https://www.sololearn.com/learning/1161) 2 | 3 | > Math Operations with NumPy 4 | > Data Manipulation with Pandas 5 | > Visualization with Matplotlib 6 | 7 | ## Statistics 8 | - **mean:** the average of the values. 9 | - **median:** the middle value. 10 | - **standard deviation:** the measure of spread, the square root of **variance**. 11 | - **variance:** average of the squared differences from the mean. 12 | - One standard deviation from the mean - is the values `from (mean-std) to (mean+std)` 13 | 14 | ## Math Operations with NumPy 15 | ```python 16 | # We can use Python Lists to create NumPy arrays 17 | x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) 18 | 19 | # Size, dimentionality, shape of array 20 | print(x[1][2]) # 6 21 | print(x.ndim) # 2 22 | print(x.size) # 9 23 | print(x.shape) # (3, 3) 24 | 25 | x = np.array([2, 1, 3]) 26 | x = np.append(x, 4) # [2, 1, 3, 4] 27 | x = np.delete(x, 0) # Takes index 28 | x = np.sort(x) 29 | 30 | # Similar to python range() 31 | x = np.arange(2, 10, 3) # [2, 5, 8] 32 | 33 | # Reshaping the array 34 | x = np.reshape(3, 1) # [[2], [5], [8]] 35 | 36 | # Indexing and slicing 37 | # Same as python lists [-1], [0:4] 38 | 39 | # Conditions 40 | y = x[x<4] # Select element that are less than 4 41 | y = x[(x>5) & (x%2==0)] # & (and), | (or) 42 | 43 | # Operations 44 | y = x.sum() 45 | y = x.min() 46 | y = x.max() 47 | y = x*2 # Broadcasting used 48 | 49 | # Statistics 50 | np.mean(x) 51 | np.median(x) 52 | np.var(x) 53 | np.std(x) 54 | ``` 55 | ```python 56 | # https://www.sololearn.com/learning/eom-project/1161/1156 57 | # One standart devisation from the mean 58 | import numpy as np 59 | 60 | data = np.array([150000, 125000, 320000, 540000, 200000, 120000, 160000, 230000, 280000, 290000, 300000, 500000, 420000, 100000, 150000, 280000]) 61 | 62 | mean_h = np.mean(data) 63 | std_h = np.std(data) 64 | 65 | low, high = mean_h - std_h, mean_h + std_h 66 | 67 | count = len([v for v in data if low < v < high]) 68 | res = count * 100 / len(data) 69 | print(res) 70 | ``` 71 | 72 | ## Data Manipulation with Pandas 73 | - Built on top of **NumPy** = "numerical python", **Pandas** = "panel data" 74 | - Used to read and extract data from files, transform and analyze it, calculate statistics and correlations. 75 | - **Series** and **DataFrame**. A **Series** is essentially a column, and a **DataFrame** is a multi-dimensional table made up of a collection of Series. 76 | - `loc` explicit indexing (own indexing), `iloc` implicit indexing (0, 1, 2, 3) 77 | ```python 78 | # Dictionary used to create DataFrame (DF) 79 | data = { 80 | 'ages': [14, 18, 24, 42], 81 | 'heights': [165, 180, 176, 184] 82 | } 83 | 84 | df = pd.DataFrame(data, index=['James', 'Bob', 'Amy', 'Dave']) # You can specify `index` if you want 85 | 86 | # How to access row? 87 | y = df.loc["Bob"] # df.loc[1] 88 | 89 | # Indexing 90 | z = df["ages"] # Series 91 | z = df[["ages", "heights"]] # DataFrame, pay attention to brackets 92 | 93 | # Slicing 94 | # iloc[], same as in python lists 95 | print(df.iloc[2]) # third row 96 | print(df.iloc[:3]) # first 3 rows 97 | print(df.iloc[1:3]) # rows 2 to 3 98 | print(df.iloc[-3:]) # accessing last three rows 99 | 100 | # Conditons 101 | z = df[(df['ages']>18) & (df['heights']>180)] 102 | ``` 103 | ```python 104 | # Reading data 105 | df = pd.read_csv("test.csv") 106 | 107 | df.head() # First five rows 108 | df.tail() # Last five rows 109 | 110 | df.info() 111 | df.describe() # Statistics: mean, min, max, percentiles. We can get for a single column too df['cases'].describe() 112 | 113 | df.set_index("date", inplace=True) # Set as the index the "data" column 114 | # inplace=True used to change the currect dataframe without assigning to new 115 | ``` 116 | ```python 117 | # Creating a column 118 | df['area'] = df['height'] * df['width'] 119 | df['month'] = pd.to_datetime(df['date'], format="%d.%m.%y").dt.month_name() 120 | 121 | # Droping a column 122 | df.drop("state", axis=1, inplace=True) 123 | # axis=1 specifies that we want to drop a column. 124 | # axis=0 will drop a row. 125 | ``` 126 | ```python 127 | # Grouping 128 | z = df['month'].value_counts() 129 | 130 | z = df.groupby('month')['cases'].sum() 131 | 132 | z = df['cases'].sum() # max(), min(), mean() 133 | ``` 134 | ```python 135 | """COVID Data Analysis""" 136 | import pandas as pd 137 | 138 | df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv") 139 | 140 | df.drop('state', axis=1, inplace=True) 141 | df.set_index('date', inplace=True) 142 | 143 | df['ratio'] = df['deaths'] / df['cases'] 144 | 145 | largest = df.loc[df['ratio'] == df['ratio'].max()] # df.loc[df['ratio'].max()] we cannot do that 146 | print(largest) 147 | ``` 148 | 149 | ## Visualization with Matplotlib 150 | - https://www.w3schools.com/python/matplotlib_intro.asp 151 | - **Matplotlib** is a library used to create graphs, charts, and figures. It also provides functions to customize your figures by changing the colors, labels, etc. 152 | - **Matplotlib** works really well with **Pandas**! **Pandas** works well with **NumPy**. 153 | ```py 154 | import matplotlib.pyplot as plt 155 | import pandas as pd 156 | 157 | s = pd.Series([18, 42, 9, 32, 81, 64, 3]) 158 | s.plot(kind='bar') 159 | plt.savefig('plot.png') 160 | ``` 161 | - Data = Y axis, index = X axis. 162 | ```py 163 | """Line Plot""" 164 | import pandas as pd 165 | import matplotlib.pyplot as plt 166 | 167 | df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv") 168 | df.rdop('state', axis=1, inplace=True) 169 | df['date'] = pd.to_datetime(df['date'], format="%d.%m.%y") 170 | df['month'] = df['date'].dt.month 171 | df.set_index('date', inplace=True) 172 | 173 | df[df['month']==12]['cases'].plot() 174 | # Multiple lines 175 | # (df[df['month']==12])[['cases', 'deaths']].plot() 176 | ``` 177 | ```py 178 | """Bar Plot""" 179 | (df.groupby('month')['cases'].sum()).plot(kind="bar") # barh = horizontal bar 180 | # OR 181 | # df = df.groupby('month') 182 | # df['cases'].sum().plot(kind="bar") 183 | ``` 184 | ```py 185 | """Box Plot""" 186 | df[df["month"]==6]["cases"].plot(kind="box") 187 | ``` 188 | ```py 189 | """Histogram""" 190 | df[df["month"]==6]["cases"].plot(kind="hist") 191 | ``` 192 | - A **histogram** is a graph showing *frequency* distributions. Similar to box plots, **histograms** show the distribution of data. 193 | Visually histograms are similar to bar charts, however, histograms display frequencies for a group of data rather than an individual data point; therefore, no spaces are present between the bars. 194 | ```py 195 | """Area Plot""" 196 | df[df["month"]==6][["cases", "deaths"]].plot(kind="area", stacked=False) 197 | ``` 198 | ```py 199 | """Scatter Plot""" 200 | df[df["month"]==6][["cases", "deaths"]].plot(kind="scatter", x='cases', y='deaths') 201 | ``` 202 | ```py 203 | """Pie Chart""" 204 | df.groupby('month')['cases'].sum().plot(kind="pie") 205 | ``` 206 | ```py 207 | """Plot formatting""" 208 | df[['cases', 'deaths']].plot(kind="area", legend=True, stacked=False, color=['#1970E7', '#E73E19']) 209 | plt.xlabel('Days in June') 210 | plt.ylabel('Number') 211 | plt.suptitle("COVID-19 in June") 212 | ``` -------------------------------------------------------------------------------- /numpy-pandas/very-basics/img/plt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/numpy-pandas/very-basics/img/plt.png -------------------------------------------------------------------------------- /scikit-learn/Readme.md: -------------------------------------------------------------------------------- 1 | # Scikit-Learn 2 | 3 | - [freeCodeCamp.org](https://youtu.be/0B5eIE_1vpU) 4 | - https://inria.github.io/scikit-learn-mooc/ 5 | - https://scikit-learn.org/stable/tutorial/index.html 6 | - https://machinelearningmastery.com/start-here/ 7 | 8 | 9 | 10 | 11 | 12 | ### How to save / upload model 13 | ```py 14 | import joblib 15 | 16 | model = joblib.load('model.sav') # Load the model 17 | joblib.dump(model, 'model.sav') # Save the model 18 | ``` 19 | 20 | ### K-Nearest Neighbors (KNN) 21 | > [Notebook](knn.ipynb) 22 | - Measured with Euclidean or Manhattan [distance](https://www.analyticsvidhya.com/blog/2020/02/4-types-of-distance-metrics-in-machine-learning/) 23 | - For **KNN regressor** you take the average of `n_neighbors=23` nearest neighbours 24 | - For **KNN classifier** you take the mood of `n_neighbors=23` nearest neighbours 25 | 26 | ### SVM 27 | > [Notebook](svm.ipynb) 28 | - `support vectors`, `hyperplane`, `margin`, `linear seperable`, `non-linear seperable` 29 | - Our goal is to **maximize** the **margin** (distance between marginal hyperplanes) 30 | - **SVM kernels** - transforms from low-dimension to high-dimension 31 | 32 | ### K-Means Clustering 33 | 1. Select **K** value - centroid 34 | 2. Initialize centroids randomly 35 | 3. Calculate **Euclidean distance** between two points 36 | 4. Select the group and find the **mean** 37 | 5. Move controid to that mean 38 | 39 | - How to select **K**? 40 | - Elbow method -------------------------------------------------------------------------------- /scikit-learn/img/process.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/scikit-learn/img/process.png -------------------------------------------------------------------------------- /scikit-learn/img/process1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/scikit-learn/img/process1.png -------------------------------------------------------------------------------- /scikit-learn/img/scikit-learn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/scikit-learn/img/scikit-learn.png -------------------------------------------------------------------------------- /scikit-learn/k-means-clustering.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "language_info": { 4 | "codemirror_mode": { 5 | "name": "ipython", 6 | "version": 3 7 | }, 8 | "file_extension": ".py", 9 | "mimetype": "text/x-python", 10 | "name": "python", 11 | "nbconvert_exporter": "python", 12 | "pygments_lexer": "ipython3", 13 | "version": "3.8.6" 14 | }, 15 | "orig_nbformat": 4, 16 | "kernelspec": { 17 | "name": "python3", 18 | "display_name": "Python 3.8.6 64-bit ('tf': conda)" 19 | }, 20 | "interpreter": { 21 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a" 22 | } 23 | }, 24 | "nbformat": 4, 25 | "nbformat_minor": 2, 26 | "cells": [ 27 | { 28 | "cell_type": "code", 29 | "execution_count": 1, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "from sklearn.datasets import load_breast_cancer\n", 34 | "from sklearn.cluster import KMeans\n", 35 | "from sklearn.model_selection import train_test_split\n", 36 | "from sklearn.metrics import accuracy_score\n", 37 | "from sklearn.preprocessing import scale\n", 38 | "\n", 39 | "import numpy as np \n", 40 | "import pandas as pd \n" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 3, 46 | "metadata": {}, 47 | "outputs": [ 48 | { 49 | "output_type": "execute_result", 50 | "data": { 51 | "text/plain": [ 52 | "{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,\n", 53 | " 1.189e-01],\n", 54 | " [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,\n", 55 | " 8.902e-02],\n", 56 | " [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,\n", 57 | " 8.758e-02],\n", 58 | " ...,\n", 59 | " [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,\n", 60 | " 7.820e-02],\n", 61 | " [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,\n", 62 | " 1.240e-01],\n", 63 | " [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,\n", 64 | " 7.039e-02]]),\n", 65 | " 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,\n", 66 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,\n", 67 | " 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,\n", 68 | " 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,\n", 69 | " 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,\n", 70 | " 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,\n", 71 | " 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,\n", 72 | " 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,\n", 73 | " 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,\n", 74 | " 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,\n", 75 | " 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,\n", 76 | " 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 77 | " 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,\n", 78 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1,\n", 79 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0,\n", 80 | " 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,\n", 81 | " 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,\n", 82 | " 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1,\n", 83 | " 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0,\n", 84 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1,\n", 85 | " 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,\n", 86 | " 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,\n", 87 | " 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1,\n", 88 | " 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,\n", 89 | " 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", 90 | " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1]),\n", 91 | " 'frame': None,\n", 92 | " 'target_names': array(['malignant', 'benign'], dtype='\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
col_001
row_0
014630
113266
\n" 322 | }, 323 | "metadata": {}, 324 | "execution_count": 21 325 | } 326 | ], 327 | "source": [ 328 | "# SOMETIMES IT MAY FLIP THE CLUSTERS, THEN WE MUST USE\n", 329 | "\n", 330 | "pd.crosstab(y_train, labels)" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": null, 336 | "metadata": {}, 337 | "outputs": [], 338 | "source": [] 339 | } 340 | ] 341 | } -------------------------------------------------------------------------------- /scikit-learn/knn.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "language_info": { 4 | "codemirror_mode": { 5 | "name": "ipython", 6 | "version": 3 7 | }, 8 | "file_extension": ".py", 9 | "mimetype": "text/x-python", 10 | "name": "python", 11 | "nbconvert_exporter": "python", 12 | "pygments_lexer": "ipython3", 13 | "version": "3.8.6" 14 | }, 15 | "orig_nbformat": 4, 16 | "kernelspec": { 17 | "name": "python3", 18 | "display_name": "Python 3.8.6 64-bit ('tf': conda)" 19 | }, 20 | "interpreter": { 21 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a" 22 | } 23 | }, 24 | "nbformat": 4, 25 | "nbformat_minor": 2, 26 | "cells": [ 27 | { 28 | "cell_type": "code", 29 | "execution_count": 14, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "import numpy as np \n", 34 | "import pandas as pd \n", 35 | "from sklearn import neighbors, metrics\n", 36 | "from sklearn.model_selection import train_test_split\n", 37 | "from sklearn.preprocessing import LabelEncoder" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 15, 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "output_type": "execute_result", 47 | "data": { 48 | "text/plain": [ 49 | " buying maint doors persons lug_boot safety class\n", 50 | "0 vhigh vhigh 2 2 small low unacc\n", 51 | "1 vhigh vhigh 2 2 small med unacc\n", 52 | "2 vhigh vhigh 2 2 small high unacc\n", 53 | "3 vhigh vhigh 2 2 med low unacc\n", 54 | "4 vhigh vhigh 2 2 med med unacc" 55 | ], 56 | "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
buyingmaintdoorspersonslug_bootsafetyclass
0vhighvhigh22smalllowunacc
1vhighvhigh22smallmedunacc
2vhighvhigh22smallhighunacc
3vhighvhigh22medlowunacc
4vhighvhigh22medmedunacc
\n
" 57 | }, 58 | "metadata": {}, 59 | "execution_count": 15 60 | } 61 | ], 62 | "source": [ 63 | "data = pd.read_csv(\"car_evaluation.csv\")\n", 64 | "data.head()" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 16, 70 | "metadata": {}, 71 | "outputs": [ 72 | { 73 | "output_type": "execute_result", 74 | "data": { 75 | "text/plain": [ 76 | " buying maint safety\n", 77 | "0 vhigh vhigh low\n", 78 | "1 vhigh vhigh med\n", 79 | "2 vhigh vhigh high\n", 80 | "3 vhigh vhigh low\n", 81 | "4 vhigh vhigh med" 82 | ], 83 | "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
buyingmaintsafety
0vhighvhighlow
1vhighvhighmed
2vhighvhighhigh
3vhighvhighlow
4vhighvhighmed
\n
" 84 | }, 85 | "metadata": {}, 86 | "execution_count": 16 87 | } 88 | ], 89 | "source": [ 90 | "# Select features\n", 91 | "X = data[['buying', 'maint', 'safety']]\n", 92 | "X.head()" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 17, 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "output_type": "execute_result", 102 | "data": { 103 | "text/plain": [ 104 | "0 unacc\n", 105 | "1 unacc\n", 106 | "2 unacc\n", 107 | "3 unacc\n", 108 | "4 unacc\n", 109 | "Name: class, dtype: object" 110 | ] 111 | }, 112 | "metadata": {}, 113 | "execution_count": 17 114 | } 115 | ], 116 | "source": [ 117 | "# Select the label\n", 118 | "y = data['class']\n", 119 | "y.head()" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 18, 125 | "metadata": {}, 126 | "outputs": [ 127 | { 128 | "output_type": "execute_result", 129 | "data": { 130 | "text/plain": [ 131 | "array([['vhigh', 'vhigh', 'low'],\n", 132 | " ['vhigh', 'vhigh', 'med'],\n", 133 | " ['vhigh', 'vhigh', 'high'],\n", 134 | " ...,\n", 135 | " ['low', 'low', 'low'],\n", 136 | " ['low', 'low', 'med'],\n", 137 | " ['low', 'low', 'high']], dtype=object)" 138 | ] 139 | }, 140 | "metadata": {}, 141 | "execution_count": 18 142 | } 143 | ], 144 | "source": [ 145 | "X = X.values # NumPy array\n", 146 | "X" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 19, 152 | "metadata": {}, 153 | "outputs": [ 154 | { 155 | "output_type": "stream", 156 | "name": "stdout", 157 | "text": [ 158 | "(1728, 3)\n['vhigh' 'vhigh' 'vhigh' ... 'low' 'low' 'low']\n['vhigh' 'vhigh' 'vhigh' ... 'low' 'low' 'low']\n['low' 'med' 'high' ... 'low' 'med' 'high']\n" 159 | ] 160 | }, 161 | { 162 | "output_type": "execute_result", 163 | "data": { 164 | "text/plain": [ 165 | "array([[3, 3, 1],\n", 166 | " [3, 3, 2],\n", 167 | " [3, 3, 0],\n", 168 | " [3, 3, 1],\n", 169 | " [3, 3, 2]], dtype=object)" 170 | ] 171 | }, 172 | "metadata": {}, 173 | "execution_count": 19 174 | } 175 | ], 176 | "source": [ 177 | "\"\"\" \n", 178 | "Now we have the problem: our data consists of strings, we need to convert into nums with LabelEncoder\n", 179 | "\"\"\"\n", 180 | "# X conversion\n", 181 | "print(X.shape)\n", 182 | "\n", 183 | "for i in range(X.shape[1]): # 3\n", 184 | " print(X[:, i]) # Selects the first element for 3 columns\n", 185 | "\n", 186 | "LE = LabelEncoder()\n", 187 | "for i in range(len(X[0])):\n", 188 | " X[:, i] = LE.fit_transform(X[:, i])\n", 189 | "\n", 190 | "X[:5] # vhigh=3, med=2, low=1, high=0" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 21, 196 | "metadata": {}, 197 | "outputs": [ 198 | { 199 | "output_type": "execute_result", 200 | "data": { 201 | "text/plain": [ 202 | "array([0, 0, 0, ..., 0, 2, 3])" 203 | ] 204 | }, 205 | "metadata": {}, 206 | "execution_count": 21 207 | } 208 | ], 209 | "source": [ 210 | "# y conversion\n", 211 | "label_mapping = {\n", 212 | " 'unacc':0,\n", 213 | " 'acc':1,\n", 214 | " 'good':2,\n", 215 | " 'vgood':3,\n", 216 | "}\n", 217 | "\n", 218 | "y = y.map(label_mapping)\n", 219 | "y = np.array(y)\n", 220 | "y" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 29, 226 | "metadata": {}, 227 | "outputs": [ 228 | { 229 | "output_type": "execute_result", 230 | "data": { 231 | "text/plain": [ 232 | "KNeighborsClassifier(n_neighbors=23)" 233 | ] 234 | }, 235 | "metadata": {}, 236 | "execution_count": 29 237 | } 238 | ], 239 | "source": [ 240 | "# KNN Model\n", 241 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 20% to test set\n", 242 | "\n", 243 | "knn = neighbors.KNeighborsClassifier(n_neighbors=23, weights='uniform')\n", 244 | "knn.fit(X_train, y_train)" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 30, 250 | "metadata": {}, 251 | "outputs": [ 252 | { 253 | "output_type": "execute_result", 254 | "data": { 255 | "text/plain": [ 256 | "0.7485549132947977" 257 | ] 258 | }, 259 | "metadata": {}, 260 | "execution_count": 30 261 | } 262 | ], 263 | "source": [ 264 | "predictions = knn.predict(X_test)\n", 265 | "accuracy = metrics.accuracy_score(y_test, predictions)\n", 266 | "accuracy" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 31, 272 | "metadata": {}, 273 | "outputs": [ 274 | { 275 | "output_type": "execute_result", 276 | "data": { 277 | "text/plain": [ 278 | "array([0, 1, 1, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,\n", 279 | " 0, 0, 0, 1, 0, 0, 0, 2, 1, 0, 0, 1, 2, 0, 1, 2, 2, 0, 1, 0, 0, 0,\n", 280 | " 0, 0, 1, 0, 0, 2, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0, 2, 1,\n", 281 | " 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 1,\n", 282 | " 1, 2, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,\n", 283 | " 0, 2, 0, 0, 0, 0, 2, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,\n", 284 | " 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 1, 1, 0, 1, 0, 0, 2,\n", 285 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,\n", 286 | " 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,\n", 287 | " 2, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0,\n", 288 | " 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,\n", 289 | " 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0,\n", 290 | " 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", 291 | " 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,\n", 292 | " 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 2, 2, 0, 0, 0, 1, 1, 1, 1,\n", 293 | " 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1])" 294 | ] 295 | }, 296 | "metadata": {}, 297 | "execution_count": 31 298 | } 299 | ], 300 | "source": [ 301 | "predictions" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": null, 307 | "metadata": {}, 308 | "outputs": [], 309 | "source": [ 310 | "# For KNN regressor you take the average of n_neighbors = 23 nearest neighbours\n", 311 | "# For KNN classifier you take the mood of n_neighbors = 23 nearest neighbours" 312 | ] 313 | } 314 | ] 315 | } -------------------------------------------------------------------------------- /scikit-learn/logistic_regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "language_info": { 4 | "codemirror_mode": { 5 | "name": "ipython", 6 | "version": 3 7 | }, 8 | "file_extension": ".py", 9 | "mimetype": "text/x-python", 10 | "name": "python", 11 | "nbconvert_exporter": "python", 12 | "pygments_lexer": "ipython3", 13 | "version": 3 14 | }, 15 | "orig_nbformat": 4 16 | }, 17 | "nbformat": 4, 18 | "nbformat_minor": 2, 19 | "cells": [ 20 | { 21 | "cell_type": "code", 22 | "execution_count": null, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "\"\"\"Logistic regression\"\"\"\n", 27 | "\n" 28 | ] 29 | } 30 | ] 31 | } -------------------------------------------------------------------------------- /scikit-learn/svm.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "language_info": { 4 | "codemirror_mode": { 5 | "name": "ipython", 6 | "version": 3 7 | }, 8 | "file_extension": ".py", 9 | "mimetype": "text/x-python", 10 | "name": "python", 11 | "nbconvert_exporter": "python", 12 | "pygments_lexer": "ipython3", 13 | "version": "3.8.6" 14 | }, 15 | "orig_nbformat": 4, 16 | "kernelspec": { 17 | "name": "python3", 18 | "display_name": "Python 3.8.6 64-bit ('tf': conda)" 19 | }, 20 | "interpreter": { 21 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a" 22 | } 23 | }, 24 | "nbformat": 4, 25 | "nbformat_minor": 2, 26 | "cells": [ 27 | { 28 | "cell_type": "code", 29 | "execution_count": 15, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "from sklearn import datasets\n", 34 | "from sklearn.model_selection import train_test_split\n", 35 | "from sklearn.metrics import accuracy_score\n", 36 | "from sklearn import svm\n", 37 | "\n", 38 | "import numpy as np" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 9, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "iris = datasets.load_iris()\n", 48 | "classes = ['Iris Setosa', 'Iris Versicolour', 'Iris Virginica']\n", 49 | "\n", 50 | "# Split into features and labels\n", 51 | "X = iris.data\n", 52 | "y = iris.target" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 10, 58 | "metadata": {}, 59 | "outputs": [ 60 | { 61 | "output_type": "stream", 62 | "name": "stdout", 63 | "text": [ 64 | "[[5.1 3.5 1.4 0.2]\n [4.9 3. 1.4 0.2]\n [4.7 3.2 1.3 0.2]\n [4.6 3.1 1.5 0.2]\n [5. 3.6 1.4 0.2]]\n[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n 2 2]\n" 65 | ] 66 | } 67 | ], 68 | "source": [ 69 | "print(X[:5]) # NumPy array\n", 70 | "print(y) # So we have 3 labels" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 11, 76 | "metadata": {}, 77 | "outputs": [ 78 | { 79 | "output_type": "stream", 80 | "name": "stdout", 81 | "text": [ 82 | "(150, 4)\n150\n" 83 | ] 84 | } 85 | ], 86 | "source": [ 87 | "print(X.shape) \n", 88 | "print(len(y))" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 12, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 20% to test set" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 13, 103 | "metadata": {}, 104 | "outputs": [ 105 | { 106 | "output_type": "execute_result", 107 | "data": { 108 | "text/plain": [ 109 | "SVC()" 110 | ] 111 | }, 112 | "metadata": {}, 113 | "execution_count": 13 114 | } 115 | ], 116 | "source": [ 117 | "model = svm.SVC() # Classifier\n", 118 | "model.fit(X_train, y_train)" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 18, 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "output_type": "execute_result", 128 | "data": { 129 | "text/plain": [ 130 | "0.9333333333333333" 131 | ] 132 | }, 133 | "metadata": {}, 134 | "execution_count": 18 135 | } 136 | ], 137 | "source": [ 138 | "predictions = model.predict(X_test)\n", 139 | "acc = accuracy_score(predictions, y_test)\n", 140 | "acc" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 19, 146 | "metadata": {}, 147 | "outputs": [ 148 | { 149 | "output_type": "execute_result", 150 | "data": { 151 | "text/plain": [ 152 | "array([0, 2, 0, 1, 1, 2, 2, 1, 2, 1, 0, 2, 0, 0, 1, 1, 2, 2, 2, 2, 1, 0,\n", 153 | " 1, 0, 0, 2, 1, 2, 1, 1])" 154 | ] 155 | }, 156 | "metadata": {}, 157 | "execution_count": 19 158 | } 159 | ], 160 | "source": [ 161 | "predictions" 162 | ] 163 | } 164 | ] 165 | } -------------------------------------------------------------------------------- /scikit-learn/train_test_split.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "language_info": { 4 | "codemirror_mode": { 5 | "name": "ipython", 6 | "version": 3 7 | }, 8 | "file_extension": ".py", 9 | "mimetype": "text/x-python", 10 | "name": "python", 11 | "nbconvert_exporter": "python", 12 | "pygments_lexer": "ipython3", 13 | "version": "3.8.6" 14 | }, 15 | "orig_nbformat": 4, 16 | "kernelspec": { 17 | "name": "python3", 18 | "display_name": "Python 3.8.6 64-bit ('tf': conda)" 19 | }, 20 | "interpreter": { 21 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a" 22 | } 23 | }, 24 | "nbformat": 4, 25 | "nbformat_minor": 2, 26 | "cells": [ 27 | { 28 | "cell_type": "code", 29 | "execution_count": 15, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "from sklearn import datasets\n", 34 | "from sklearn.model_selection import train_test_split\n", 35 | "import numpy as np" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 2, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "iris = datasets.load_iris()\n", 45 | "\n", 46 | "# Split into features and labels\n", 47 | "X = iris.data\n", 48 | "y = iris.target" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 13, 54 | "metadata": {}, 55 | "outputs": [ 56 | { 57 | "output_type": "stream", 58 | "name": "stdout", 59 | "text": [ 60 | "[[5.1 3.5 1.4 0.2]\n [4.9 3. 1.4 0.2]\n [4.7 3.2 1.3 0.2]\n [4.6 3.1 1.5 0.2]\n [5. 3.6 1.4 0.2]]\n[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n 2 2]\n" 61 | ] 62 | } 63 | ], 64 | "source": [ 65 | "print(X[:5]) # NumPy array\n", 66 | "print(y) # So we have 3 labels" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 12, 72 | "metadata": {}, 73 | "outputs": [ 74 | { 75 | "output_type": "stream", 76 | "name": "stdout", 77 | "text": [ 78 | "(150, 4)\n150\n" 79 | ] 80 | } 81 | ], 82 | "source": [ 83 | "print(X.shape) \n", 84 | "print(len(y))" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 19, 90 | "metadata": {}, 91 | "outputs": [ 92 | { 93 | "output_type": "stream", 94 | "name": "stdout", 95 | "text": [ 96 | "30.0\n" 97 | ] 98 | } 99 | ], 100 | "source": [ 101 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 20% to test set\n", 102 | "print(150 * 0.2) # 120 / 30" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 18, 108 | "metadata": {}, 109 | "outputs": [ 110 | { 111 | "output_type": "execute_result", 112 | "data": { 113 | "text/plain": [ 114 | "(120, 4)" 115 | ] 116 | }, 117 | "metadata": {}, 118 | "execution_count": 18 119 | } 120 | ], 121 | "source": [ 122 | "X_train.shape" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 20, 128 | "metadata": {}, 129 | "outputs": [ 130 | { 131 | "output_type": "execute_result", 132 | "data": { 133 | "text/plain": [ 134 | "120" 135 | ] 136 | }, 137 | "metadata": {}, 138 | "execution_count": 20 139 | } 140 | ], 141 | "source": [ 142 | "len(y_train)" 143 | ] 144 | } 145 | ] 146 | } -------------------------------------------------------------------------------- /tensorflow-in-practice/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/.DS_Store -------------------------------------------------------------------------------- /tensorflow-in-practice/Exercises/Exercise_2_Handwriting_Recognition_DNN.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Exercise2-Question.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "toc_visible": true 10 | }, 11 | "kernelspec": { 12 | "name": "python386jvsc74a57bd04ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a", 13 | "display_name": "Python 3.8.6 64-bit ('tf': conda)" 14 | }, 15 | "metadata": { 16 | "interpreter": { 17 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a" 18 | } 19 | } 20 | }, 21 | "cells": [ 22 | { 23 | "cell_type": "code", 24 | "execution_count": null, 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "# Rustam-Z" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "metadata": { 34 | "colab": { 35 | "base_uri": "https://localhost:8080/" 36 | }, 37 | "id": "9rvXQGAA0ssC", 38 | "outputId": "60861935-7551-475e-e8c4-507b43cc6de7" 39 | }, 40 | "source": [ 41 | "import tensorflow as tf\n", 42 | "\n", 43 | "class myCallback(tf.keras.callbacks.Callback):\n", 44 | " def on_epoch_end(self, epoch, logs={}):\n", 45 | " if(logs.get('accuracy')>0.99):\n", 46 | " print( \"Reached 99% accuracy so cancelling training!\")\n", 47 | " self.model.stop_training = True\n", 48 | "\n", 49 | "\n", 50 | "mnist = tf.keras.datasets.mnist\n", 51 | "(x_train, y_train),(x_test, y_test) = mnist.load_data()\n", 52 | "x_train, x_test = x_train / 255.0, x_test / 255.0\n", 53 | "\n", 54 | "callbacks = myCallback()\n", 55 | "\n", 56 | "model = tf.keras.models.Sequential([\n", 57 | " tf.keras.layers.Flatten(input_shape=(28, 28)),\n", 58 | " tf.keras.layers.Dense(512, activation=\"relu\"),\n", 59 | " tf.keras.layers.Dense(10, activation=\"softmax\")\n", 60 | "])\n", 61 | "\n", 62 | "model.compile(optimizer='adam',\n", 63 | " loss='sparse_categorical_crossentropy',\n", 64 | " metrics=['accuracy'])\n", 65 | "\n", 66 | "model.fit(x_train, y_train, epochs=5, callbacks=[callbacks])" 67 | ], 68 | "execution_count": 1, 69 | "outputs": [ 70 | { 71 | "output_type": "stream", 72 | "name": "stdout", 73 | "text": [ 74 | "Epoch 1/5\n", 75 | "1875/1875 [==============================] - 1s 656us/step - loss: 0.3419 - accuracy: 0.9011\n", 76 | "Epoch 2/5\n", 77 | "1875/1875 [==============================] - 1s 649us/step - loss: 0.0835 - accuracy: 0.9749\n", 78 | "Epoch 3/5\n", 79 | "1875/1875 [==============================] - 1s 653us/step - loss: 0.0527 - accuracy: 0.9835\n", 80 | "Epoch 4/5\n", 81 | "1875/1875 [==============================] - 1s 655us/step - loss: 0.0366 - accuracy: 0.9877\n", 82 | "Epoch 5/5\n", 83 | "1875/1875 [==============================] - 1s 653us/step - loss: 0.0248 - accuracy: 0.9925\n", 84 | "Reached 99% accuracy so cancelling training!\n" 85 | ] 86 | }, 87 | { 88 | "output_type": "execute_result", 89 | "data": { 90 | "text/plain": [ 91 | "" 92 | ] 93 | }, 94 | "metadata": {}, 95 | "execution_count": 1 96 | } 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "metadata": { 102 | "id": "qErwFEW0mz0H", 103 | "outputId": "3d8ba790-8c5e-4a55-c824-ecd92a00352c", 104 | "colab": { 105 | "base_uri": "https://localhost:8080/" 106 | } 107 | }, 108 | "source": [ 109 | "import tensorflow as tf\n", 110 | "\n", 111 | "print(tf.nn.relu)" 112 | ], 113 | "execution_count": 2, 114 | "outputs": [ 115 | { 116 | "output_type": "stream", 117 | "name": "stdout", 118 | "text": [ 119 | "\n" 120 | ] 121 | } 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "metadata": { 127 | "id": "18I-y7X-q84V", 128 | "outputId": "e9478496-2bc1-4ce9-ee62-82af2df4a8df", 129 | "colab": { 130 | "base_uri": "https://localhost:8080/" 131 | } 132 | }, 133 | "source": [ 134 | "model.evaluate(x_test, y_test)" 135 | ], 136 | "execution_count": 3, 137 | "outputs": [ 138 | { 139 | "output_type": "stream", 140 | "name": "stdout", 141 | "text": [ 142 | "313/313 [==============================] - 0s 293us/step - loss: 0.0637 - accuracy: 0.9809\n" 143 | ] 144 | }, 145 | { 146 | "output_type": "execute_result", 147 | "data": { 148 | "text/plain": [ 149 | "[0.06370978057384491, 0.98089998960495]" 150 | ] 151 | }, 152 | "metadata": {}, 153 | "execution_count": 3 154 | } 155 | ] 156 | } 157 | ] 158 | } -------------------------------------------------------------------------------- /tensorflow-in-practice/Exercises/Exercise_3_CNN.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Exercise 3 - Question.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [] 9 | }, 10 | "kernelspec": { 11 | "name": "python386jvsc74a57bd04ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a", 12 | "display_name": "Python 3.8.6 64-bit ('tf': conda)" 13 | }, 14 | "metadata": { 15 | "interpreter": { 16 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a" 17 | } 18 | } 19 | }, 20 | "cells": [ 21 | { 22 | "cell_type": "code", 23 | "metadata": { 24 | "id": "yl3yB8J_PCZM" 25 | }, 26 | "source": [ 27 | "# Rustam-Z" 28 | ], 29 | "execution_count": null, 30 | "outputs": [] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "metadata": { 35 | "colab": { 36 | "base_uri": "https://localhost:8080/" 37 | }, 38 | "id": "KtixUwmvSD0A", 39 | "outputId": "34a18be4-67c7-4147-a4e7-62efa5fe3124" 40 | }, 41 | "source": [ 42 | "import tensorflow as tf\n", 43 | "\n", 44 | "mnist = tf.keras.datasets.mnist\n", 45 | "(training_images, training_labels), (test_images, test_labels) = mnist.load_data()" 46 | ], 47 | "execution_count": 1, 48 | "outputs": [] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "metadata": { 53 | "id": "EiLuNPb-TnyF" 54 | }, 55 | "source": [ 56 | "training_images=training_images.reshape(60000, 28, 28, 1)\n", 57 | "training_images=training_images / 255.0\n", 58 | "test_images = test_images.reshape(10000, 28, 28, 1)\n", 59 | "test_images=test_images/255.0" 60 | ], 61 | "execution_count": 2, 62 | "outputs": [] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "metadata": { 67 | "id": "I-3_hM1mSImZ" 68 | }, 69 | "source": [ 70 | "class myCallback(tf.keras.callbacks.Callback):\n", 71 | " def on_epoch_end(self, epoch, logs={}):\n", 72 | " if(logs.get('accuracy')>0.998):\n", 73 | " print(\"\\nReached 99.8% accuracy so cancelling training!\")\n", 74 | " self.model.stop_training = True" 75 | ], 76 | "execution_count": 3, 77 | "outputs": [] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "metadata": { 82 | "id": "sfQRyaJWAIdg" 83 | }, 84 | "source": [ 85 | "callbacks = myCallback()\n", 86 | "\n", 87 | "model = tf.keras.models.Sequential([\n", 88 | " tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),\n", 89 | " tf.keras.layers.MaxPooling2D(2,2),\n", 90 | " tf.keras.layers.Flatten(),\n", 91 | " tf.keras.layers.Dense(128, activation='relu'),\n", 92 | " tf.keras.layers.Dense(10, activation='softmax')\n", 93 | "])\n", 94 | "\n", 95 | "model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])" 96 | ], 97 | "execution_count": 4, 98 | "outputs": [] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "metadata": { 103 | "colab": { 104 | "base_uri": "https://localhost:8080/" 105 | }, 106 | "id": "i10RG-0ySDGY", 107 | "outputId": "104e7f06-b70e-4fbb-c283-c1b4a55e8b67" 108 | }, 109 | "source": [ 110 | "model.fit(training_images, training_labels, epochs=20, callbacks=[callbacks])" 111 | ], 112 | "execution_count": 5, 113 | "outputs": [ 114 | { 115 | "output_type": "stream", 116 | "name": "stdout", 117 | "text": [ 118 | "Epoch 1/20\n", 119 | "1875/1875 [==============================] - 7s 4ms/step - loss: 0.2992 - accuracy: 0.9104\n", 120 | "Epoch 2/20\n", 121 | "1875/1875 [==============================] - 7s 4ms/step - loss: 0.0521 - accuracy: 0.9840\n", 122 | "Epoch 3/20\n", 123 | "1875/1875 [==============================] - 7s 4ms/step - loss: 0.0298 - accuracy: 0.9906\n", 124 | "Epoch 4/20\n", 125 | "1875/1875 [==============================] - 7s 4ms/step - loss: 0.0208 - accuracy: 0.9932\n", 126 | "Epoch 5/20\n", 127 | "1875/1875 [==============================] - 7s 4ms/step - loss: 0.0133 - accuracy: 0.9960\n", 128 | "Epoch 6/20\n", 129 | "1875/1875 [==============================] - 7s 4ms/step - loss: 0.0091 - accuracy: 0.9972\n", 130 | "Epoch 7/20\n", 131 | "1875/1875 [==============================] - 7s 4ms/step - loss: 0.0065 - accuracy: 0.9978\n", 132 | "Epoch 8/20\n", 133 | "1875/1875 [==============================] - 7s 4ms/step - loss: 0.0050 - accuracy: 0.9985\n", 134 | "\n", 135 | "Reached 99.8% accuracy so cancelling training!\n" 136 | ] 137 | }, 138 | { 139 | "output_type": "execute_result", 140 | "data": { 141 | "text/plain": [ 142 | "" 143 | ] 144 | }, 145 | "metadata": {}, 146 | "execution_count": 5 147 | } 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 9, 153 | "metadata": {}, 154 | "outputs": [ 155 | { 156 | "output_type": "stream", 157 | "name": "stdout", 158 | "text": [ 159 | "\n[]\n" 160 | ] 161 | } 162 | ], 163 | "source": [ 164 | "import tensorflow as tf\n", 165 | "print(tf.test.gpu_device_name())\n", 166 | "print(tf.config.list_physical_devices('GPU'))" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 6, 172 | "metadata": {}, 173 | "outputs": [ 174 | { 175 | "output_type": "execute_result", 176 | "data": { 177 | "text/plain": [ 178 | "[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]" 179 | ] 180 | }, 181 | "metadata": {}, 182 | "execution_count": 6 183 | } 184 | ], 185 | "source": [ 186 | "tf.config.list_physical_devices()" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 9, 192 | "metadata": {}, 193 | "outputs": [ 194 | { 195 | "output_type": "execute_result", 196 | "data": { 197 | "text/plain": [ 198 | "True" 199 | ] 200 | }, 201 | "metadata": {}, 202 | "execution_count": 9 203 | } 204 | ], 205 | "source": [ 206 | "from tensorflow.python.compiler.mlcompute import mlcompute\n", 207 | "mlcompute.is_apple_mlc_enabled()" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": 10, 213 | "metadata": {}, 214 | "outputs": [ 215 | { 216 | "output_type": "execute_result", 217 | "data": { 218 | "text/plain": [ 219 | "True" 220 | ] 221 | }, 222 | "metadata": {}, 223 | "execution_count": 10 224 | } 225 | ], 226 | "source": [ 227 | "mlcompute.is_tf_compiled_with_apple_mlc()" 228 | ] 229 | } 230 | ] 231 | } -------------------------------------------------------------------------------- /tensorflow-in-practice/Exercises/Exercise_4_Complex_Images_flow_from_directory.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Exercise 4-Question.ipynb", 7 | "provenance": [] 8 | }, 9 | "kernelspec": { 10 | "display_name": "Python 3", 11 | "name": "python3" 12 | } 13 | }, 14 | "cells": [ 15 | { 16 | "cell_type": "code", 17 | "metadata": { 18 | "colab": { 19 | "base_uri": "https://localhost:8080/" 20 | }, 21 | "id": "7Vti6p3PxmpS", 22 | "outputId": "99f9f945-5bd1-41e0-c274-77966a56d7aa" 23 | }, 24 | "source": [ 25 | "import tensorflow as tf\n", 26 | "import os\n", 27 | "import zipfile\n", 28 | "\n", 29 | "DESIRED_ACCURACY = 0.999\n", 30 | "\n", 31 | "!wget --no-check-certificate \\\n", 32 | " \"https://storage.googleapis.com/laurencemoroney-blog.appspot.com/happy-or-sad.zip\" \\\n", 33 | " -O \"/tmp/happy-or-sad.zip\"\n", 34 | "\n", 35 | "zip_ref = zipfile.ZipFile(\"/tmp/happy-or-sad.zip\", 'r')\n", 36 | "zip_ref.extractall(\"/tmp/h-or-s\")\n", 37 | "zip_ref.close()\n", 38 | "\n", 39 | "class myCallback(tf.keras.callbacks.Callback):\n", 40 | " def on_epoch_end(self, epoch, logs={}):\n", 41 | " if(logs.get('accuracy')>DESIRED_ACCURACY):\n", 42 | " print(\"\\nReached 99.9% accuracy so cancelling training!\")\n", 43 | " self.model.stop_training = True\n", 44 | "\n", 45 | "callbacks = myCallback()" 46 | ], 47 | "execution_count": 17, 48 | "outputs": [ 49 | { 50 | "output_type": "stream", 51 | "text": [ 52 | "--2021-04-08 03:16:56-- https://storage.googleapis.com/laurencemoroney-blog.appspot.com/happy-or-sad.zip\n", 53 | "Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.215.128, 173.194.216.128, 173.194.217.128, ...\n", 54 | "Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.215.128|:443... connected.\n", 55 | "HTTP request sent, awaiting response... 200 OK\n", 56 | "Length: 2670333 (2.5M) [application/zip]\n", 57 | "Saving to: ‘/tmp/happy-or-sad.zip’\n", 58 | "\n", 59 | "\r/tmp/happy-or-sad.z 0%[ ] 0 --.-KB/s \r/tmp/happy-or-sad.z 100%[===================>] 2.55M --.-KB/s in 0.01s \n", 60 | "\n", 61 | "2021-04-08 03:16:56 (217 MB/s) - ‘/tmp/happy-or-sad.zip’ saved [2670333/2670333]\n", 62 | "\n" 63 | ], 64 | "name": "stdout" 65 | } 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "metadata": { 71 | "id": "6DLGbXXI1j_V" 72 | }, 73 | "source": [ 74 | "# This Code Block should Define and Compile the Model\n", 75 | "model = tf.keras.models.Sequential([\n", 76 | " tf.keras.layers.Conv2D(16, (3, 3), activation='relu', input_shape=(300, 300, 3)),\n", 77 | " tf.keras.layers.MaxPooling2D(2, 2),\n", 78 | " tf.keras.layers.Conv2D(32, (3, 3), activation='relu'),\n", 79 | " tf.keras.layers.MaxPooling2D(2, 2),\n", 80 | " tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),\n", 81 | " tf.keras.layers.MaxPooling2D(2, 2),\n", 82 | " tf.keras.layers.Flatten(), # Flatten the results to feed into a DNN\n", 83 | " tf.keras.layers.Dense(512, activation='relu'), # 512 neuron hidden layer\n", 84 | " tf.keras.layers.Dense(1, activation='sigmoid'),\n", 85 | "])\n", 86 | "\n", 87 | "\n", 88 | "from tensorflow.keras.optimizers import RMSprop\n", 89 | "\n", 90 | "model.compile(loss=\"binary_crossentropy\",\n", 91 | " optimizer=RMSprop(lr=0.001),\n", 92 | " metrics=['accuracy'])" 93 | ], 94 | "execution_count": 18, 95 | "outputs": [] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "metadata": { 100 | "colab": { 101 | "base_uri": "https://localhost:8080/" 102 | }, 103 | "id": "4Ap9fUJE1vVu", 104 | "outputId": "cc7b7369-a630-478d-b57e-62c015cf127a" 105 | }, 106 | "source": [ 107 | "# This code block should create an instance of an ImageDataGenerator called train_datagen \n", 108 | "# And a train_generator by calling train_datagen.flow_from_directory\n", 109 | "\n", 110 | "from tensorflow.keras.preprocessing.image import ImageDataGenerator\n", 111 | "\n", 112 | "train_datagen = ImageDataGenerator(rescale=1./255)\n", 113 | "\n", 114 | "train_generator = train_datagen.flow_from_directory(\n", 115 | " '/tmp/h-or-s/',\n", 116 | " target_size=(300, 300),\n", 117 | " batch_size=8,\n", 118 | " class_mode='binary')\n", 119 | "\n", 120 | "# Expected output: 'Found 80 images belonging to 2 classes'" 121 | ], 122 | "execution_count": 13, 123 | "outputs": [ 124 | { 125 | "output_type": "stream", 126 | "text": [ 127 | "Found 80 images belonging to 2 classes.\n" 128 | ], 129 | "name": "stdout" 130 | } 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "metadata": { 136 | "colab": { 137 | "base_uri": "https://localhost:8080/" 138 | }, 139 | "id": "48dLm13U1-Le", 140 | "outputId": "8c82e79e-fed0-4b0a-be86-089d17f1cd66" 141 | }, 142 | "source": [ 143 | "# This code block should call model.fit and train for\n", 144 | "# a number of epochs. \n", 145 | "history = model.fit(\n", 146 | " train_generator,\n", 147 | " steps_per_epoch=10,\n", 148 | " epochs=20,\n", 149 | " callbacks=[callbacks])\n", 150 | " \n", 151 | "# Expected output: \"Reached 99.9% accuracy so cancelling training!\"\"" 152 | ], 153 | "execution_count": 19, 154 | "outputs": [ 155 | { 156 | "output_type": "stream", 157 | "text": [ 158 | "Epoch 1/20\n", 159 | "10/10 [==============================] - 11s 1s/step - loss: 4.0546 - accuracy: 0.5853\n", 160 | "Epoch 2/20\n", 161 | "10/10 [==============================] - 9s 936ms/step - loss: 0.8477 - accuracy: 0.6622\n", 162 | "Epoch 3/20\n", 163 | "10/10 [==============================] - 10s 1s/step - loss: 0.2766 - accuracy: 0.9474\n", 164 | "Epoch 4/20\n", 165 | "10/10 [==============================] - 10s 1s/step - loss: 0.1236 - accuracy: 0.9822\n", 166 | "Epoch 5/20\n", 167 | "10/10 [==============================] - 9s 921ms/step - loss: 0.0683 - accuracy: 0.9704\n", 168 | "Epoch 6/20\n", 169 | "10/10 [==============================] - 10s 977ms/step - loss: 0.0221 - accuracy: 1.0000\n", 170 | "\n", 171 | "Reached 99.9% accuracy so cancelling training!\n" 172 | ], 173 | "name": "stdout" 174 | } 175 | ] 176 | } 177 | ] 178 | } -------------------------------------------------------------------------------- /tensorflow-in-practice/MNIST/my_model.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/MNIST/my_model.h5 -------------------------------------------------------------------------------- /tensorflow-in-practice/MNIST/test.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | from PIL import Image 4 | import cv2 5 | import matplotlib.pyplot as plt 6 | 7 | model = tf.keras.models.load_model('tensorflow-in-practice/notebooks/MNIST/my_model.h5') 8 | 9 | image = cv2.imread('tensorflow-in-practice/img/0.jpg') 10 | image = cv2.resize(image,(28,28)) 11 | gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) 12 | data = np.vstack([gray]) 13 | data=data/255.0 14 | 15 | plt.imshow(gray, cmap='gray') 16 | plt.show() 17 | 18 | indices_one = data == 1 19 | data[indices_one] = 0 # replacing 1s with 0s 20 | print(data) 21 | 22 | predictions = model.predict(np.expand_dims(data, 0)) 23 | print("\nAnswer:") 24 | print(predictions) 25 | -------------------------------------------------------------------------------- /tensorflow-in-practice/MNIST/train.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | from PIL import Image 4 | import cv2 5 | import matplotlib.pyplot as plt 6 | 7 | class myCallback(tf.keras.callbacks.Callback): 8 | def on_epoch_end(self, epoch, logs={}): 9 | if(logs.get('accuracy')>0.90): 10 | print("\nReached 99% accuracy so cancelling training!") 11 | self.model.stop_training = True 12 | 13 | mnist = tf.keras.datasets.mnist 14 | 15 | (x_train, y_train),(x_test, y_test) = mnist.load_data() 16 | x_train, x_test = x_train / 255.0, x_test / 255.0 17 | 18 | callbacks = myCallback() 19 | 20 | model = tf.keras.models.Sequential([ 21 | tf.keras.layers.Flatten(input_shape=(28, 28)), 22 | tf.keras.layers.Dense(512, activation=tf.nn.relu), 23 | tf.keras.layers.Dense(256, activation=tf.nn.relu), 24 | tf.keras.layers.Dense(128, activation=tf.nn.relu), 25 | tf.keras.layers.Dense(10, activation=tf.nn.softmax) 26 | ]) 27 | model.compile(optimizer=tf.optimizers.Adam(), 28 | loss='sparse_categorical_crossentropy', 29 | metrics=['accuracy']) 30 | 31 | model.fit(x_train, y_train, epochs=10, callbacks=[callbacks]) 32 | 33 | image = cv2.imread('3.png') 34 | image = cv2.resize(image,(28,28)) 35 | gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) 36 | data = np.vstack([gray]) 37 | data=data/255.0 38 | 39 | plt.imshow(gray, cmap='gray') 40 | plt.show() 41 | 42 | indices_one = data == 1 43 | data[indices_one] = 0 # replacing 1s with 0s 44 | print(data) 45 | 46 | predictions = model.predict(np.expand_dims(data, 0)) 47 | print("\nAnswer:") 48 | print(predictions) 49 | 50 | model.save('my_model.h5') -------------------------------------------------------------------------------- /tensorflow-in-practice/README.md: -------------------------------------------------------------------------------- 1 | # [TensorFlow in Practice](https://www.coursera.org/professional-certificates/tensorflow-in-practice) by DeepLearning.AI 2 | 3 | Rustam-Z🚀, 16 April 2021 4 | 5 | Hi there👋, it is the next level. 6 | 7 | Here in this specialization you will learn TensorFlow and Keras. 8 | 9 | We will cover the basics of Keras model building structure, Computer Vision with CNN, etc. 10 | 11 | ## How to study? 12 | Go to the [specialization website](https://www.coursera.org/professional-certificates/tensorflow-in-practice), and enroll the courses (you can audit). 13 | - Course notebooks: https://github.com/lmoroney/dlaicourse 14 | 15 | ## What's next? 16 | - Start Kaggle competitions 17 | - Start reading **Hands-on Machine learning** book 18 | - **TensorFlow Advanced Techniques**: https://www.coursera.org/specializations/tensorflow-advanced-techniques -------------------------------------------------------------------------------- /tensorflow-in-practice/convolutional-neural-networks-tensorflow.md: -------------------------------------------------------------------------------- 1 | # [Convolutional Neural Networks in TensorFlow](https://www.coursera.org/learn/convolutional-neural-networks-tensorflow) 2 | 3 | - How to work with real-world images in different shapes and sizes. 4 | - Visualize the journey of an image through convolutions to understand how a computer “sees” information 5 | - Plot loss and accuracy, and explore strategies to prevent overfitting, including augmentation and dropout. 6 | - Finally, Course 2 will introduce you to transfer learning and how learned features can be extracted from models. 7 | 8 | ## Contents: 9 | - Week 1 - [Exploring a Larger Dataset](#Exploring-a-Larger-Dataset) 10 | - Week 2 - [Augmentation](#Augmentation) 11 | - Week 3 - [Transfer Learning](#Transfer-Learning) 12 | - Week 4 - [Multiclass Classifications](#Multiclass-Classifications) 13 | 14 | ## Exploring a Larger Dataset 15 | > [Notebook](notebooks/Course_2_Part_2_Lesson_2_Notebook.ipynb) 16 | 17 | > https://www.kaggle.com/c/dogs-vs-cats 25K pictures of cats and dogs 18 | 19 | ```python 20 | # Download ZIP file and extract it with python 21 | !wget --no-check-certificate \ 22 | https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip \ 23 | -O /tmp/cats_and_dogs_filtered.zip 24 | _____________________________________________ 25 | import os 26 | import zipfile 27 | 28 | local_zip = '/tmp/cats_and_dogs_filtered.zip' 29 | 30 | zip_ref = zipfile.ZipFile(local_zip, 'r') 31 | 32 | zip_ref.extractall('/tmp') 33 | zip_ref.close() 34 | ``` 35 | ```py 36 | from tensorflow.keras.preprocessing.image import ImageDataGenerator 37 | 38 | # All images will be rescaled by 1./255. 39 | train_datagen = ImageDataGenerator(rescale = 1.0/255.) 40 | 41 | train_generator = train_datagen.flow_from_directory(train_dir, 42 | batch_size=20, 43 | class_mode='binary', 44 | target_size=(150, 150)) 45 | ``` 46 | 47 | ## Augmentation 48 | > [Notebook](notebooks/Course_2_Part_4_Lesson_2_Notebook_(Cats_v_Dogs_Augmentation).ipynb) 49 | 50 | > `image-augmentation` • `data-augmentation` • `ImageDataGenerator` 51 | 52 | - All processes will happen in the main memory, from_from_directory() will generate the images on the fly. It doesn't require you to edit your raw images, nor does it amend them for you on-disk. It does it in-memory as it's performing the training, allowing you to experiment without impacting your dataset. 53 | - `ImageDataGenerator()` -> `flow_from_directory()` -> `fit_generator()` 54 | - **ImageDataGenerator** will NOT add **new images** to your data set in a sense that it will not make your epochs bigger. Instead, in each epoch it will provide slightly altered images (depending on your configuration). It will always generate new images, no matter how many epochs you have. 55 | 56 | ```python 57 | train_datagen = ImageDataGenerator( 58 | rescale=1./255, 59 | rotation_range=40, # Randomly rotate image between 0 and 40° 60 | width_shift_range=0.2, # Move picture inside its frame 61 | height_shitt_range=0.2, 62 | shear_range=0.2, # Shear up to 20% 63 | zoom_range=0.2, 64 | horizontal_flip=True, 65 | fill_mode='nearest') # It attempts to recreate lost information after a transformation like a shear 66 | 67 | train_generator = train_datagen.flow_from_directory( 68 | train_dir, # This is the source directory for training images 69 | target_size=(150, 150), # All images will be resized to 150x150 70 | batch_size=20, # Size of the batches of data, (? a number of samples per gradient update) 71 | class_mode='binary') 72 | 73 | history = model.fit_generator( 74 | train_generator, 75 | steps_per_epoch=100, # 2000 images = batch_size * steps, total number of steps (batches of samples) before declaring one epoch finished and starting the next epoch 76 | epochs=100, 77 | # validation_data=validation_generator, 78 | # validation_steps=50, # 1000 images = batch_size * steps 79 | verbose=2) 80 | ``` 81 | - https://keras.io/api/preprocessing/image/ 82 | - https://fairyonice.github.io/Learn-about-ImageDataGenerator.html 83 | - https://keras.io/api/models/model_training_apis/#fit-method 84 | - https://stackoverflow.com/questions/38340311/what-is-the-difference-between-steps-and-epochs-in-tensorflow 85 | - https://stackoverflow.com/questions/51748514/does-imagedatagenerator-add-more-images-to-my-dataset 86 | 87 | ## Transfer Learning 88 | > `inception` 89 | 90 | > https://www.tensorflow.org/tutorials/images/transfer_learning 91 | 92 | ```python 93 | import os 94 | from tensorflow.keras import layers 95 | from tensorflow.keras import Model 96 | from tensorflow.keras.applications.inception_v3 import InceptionV3 97 | from tensorflow.keras.optimizers import RMSprop 98 | 99 | # Donwload InceptionV3 weights 100 | !wget --no-check-certificate \ 101 | https://storage.googleapis.com/mledu-datasets/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5 \ 102 | -O /tmp/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5 103 | 104 | local_weights_file = '/tmp/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5' 105 | pre_trained_model = Inceptionv3(input_shape=(150, 150, 3), 106 | include_top=False, # Do not include top FC (fully connected) layer 107 | weights=None) 108 | pre_trained_model.load_weights(local_weights_file) # Use own weights 109 | 110 | # Do not retrain layers, i.e freeze them 111 | for layer in pre_trained_model.layers: 112 | layer.trainable = False 113 | 114 | # pre_trained_model.summary() 115 | 116 | # Grab the mixed7 layer from inception, and take its output 117 | last_layer = pre_trained_model.get_layer('mixed7') 118 | print('last layer output shape: ', last_layer.output_shape) 119 | last_output = last_layer.output 120 | 121 | # Now, you'll need to add your own DNN at the bottom of these, which you can retrain to your data 122 | x = layers.Flatten()(last_output) 123 | x = layers.Dense(1024, activation='relu')(x) 124 | x = layers.Dropout(0.2)(x) # Drop out 20% of neurons 125 | x = layers.Dense(1, activation='sigmoid')(x) 126 | 127 | # Create model using 'Model' abstract class 128 | model = Model(pre_trained_model.input, x) 129 | model.compile(optimizer=RMSprop(lr=.0001), 130 | loss='binary_crossentropy', 131 | metrics=['accuracy']) 132 | 133 | train_datagen = ImageDataGenerator(...) 134 | train_generator = train_datagen.flow_from_directory(...) 135 | history = model.fit_generator(...) 136 | ``` 137 | > The idea behind **Dropouts** is that they **remove a random number of neurons** in your neural network. This works very well for two reasons: The first is that neighboring neurons often end up with similar weights, which can lead to overfitting, so dropping some out at random can remove this. The second is that often a neuron can over-weigh the input from a neuron in the previous layer, and can over specialize as a result. Thus, dropping out can break the neural network out of this potential bad habit! 138 | 139 | ## Multiclass Classifications 140 | - Computer generated images (CGI) will help you to create a dataset. Imagine you are creating a project for detecting rock, paper, scissors (💎, 📄, ✂️) during the game. So, you need lots of images of different races for both male and female, big and little hands. 141 | - http://www.laurencemoroney.com/rock-paper-scissors-dataset/ 142 | 146 | - Change to `class_mode='categorical'` in flow_from_firectory(), and output Dense layer `activation='softmax'`, and loss function in model.compile `loss='categorical_crossentropy'` 147 | - flow_from_directory() uses the alphabetical order. For example, is we test for rock the output should be [1, 0, 0] because of [rock, paper, scissors]. 148 | 149 | ## Notes 150 | - Can you use Image augmentation with Transfer Learning? 151 | > Yes. It's pre-trained layers that are frozen. So you can augment your images as you train the bottom layers of the DNN with them -------------------------------------------------------------------------------- /tensorflow-in-practice/img/0.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/0.jpg -------------------------------------------------------------------------------- /tensorflow-in-practice/img/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/1.png -------------------------------------------------------------------------------- /tensorflow-in-practice/img/2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/2.jpg -------------------------------------------------------------------------------- /tensorflow-in-practice/img/3.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/3.jpg -------------------------------------------------------------------------------- /tensorflow-in-practice/img/fibonacci.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/fibonacci.png -------------------------------------------------------------------------------- /tensorflow-in-practice/img/fp.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/fp.png -------------------------------------------------------------------------------- /tensorflow-in-practice/img/fp2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/fp2.png -------------------------------------------------------------------------------- /tensorflow-in-practice/img/lstm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/lstm.png -------------------------------------------------------------------------------- /tensorflow-in-practice/img/lstm2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/lstm2.png -------------------------------------------------------------------------------- /tensorflow-in-practice/img/metrics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/metrics.png -------------------------------------------------------------------------------- /tensorflow-in-practice/img/ml_architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/ml_architecture.png -------------------------------------------------------------------------------- /tensorflow-in-practice/img/rfp.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/rfp.png -------------------------------------------------------------------------------- /tensorflow-in-practice/img/rnn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/rnn.png -------------------------------------------------------------------------------- /tensorflow-in-practice/img/rnn2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/rnn2.png -------------------------------------------------------------------------------- /tensorflow-in-practice/img/seasonality.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/seasonality.png -------------------------------------------------------------------------------- /tensorflow-in-practice/img/tf_datasets.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/tf_datasets.png -------------------------------------------------------------------------------- /tensorflow-in-practice/img/trend.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/trend.png -------------------------------------------------------------------------------- /tensorflow-in-practice/img/ts.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/ts.png -------------------------------------------------------------------------------- /tensorflow-in-practice/img/tsn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/tsn.png -------------------------------------------------------------------------------- /tensorflow-in-practice/img/word_embeddings.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/img/word_embeddings.png -------------------------------------------------------------------------------- /tensorflow-in-practice/introduction-to-tensorflow-for-ai.md: -------------------------------------------------------------------------------- 1 | # [Introduction to TensorFlow for Artificial Intelligence, Machine Learning, and Deep Learning](https://www.coursera.org/learn/introduction-tensorflow/home/welcome) 2 | 3 | ## Contents: 4 | - Week 1 - [A new programming paradigm](#A-new-programming-paradigm) 5 | - Week 2 - [Introduction to Computer Vision](#Introduction-to-Computer-Vision) 6 | - Week 3 - [Convolutional Neural Networks](#Convolutional-Neural-Networks) 7 | - Week 4 - [Using real-world images](#Using-Real-world-Images) 8 | 9 | > `!pip install tensorflow==2.0.0-alpha0` run it to use TensorFlow 2.x in Google Colab 10 | 11 | > The notebooks you can work with: https://drive.google.com/drive/folders/1R4bIjns1qRcTNkltbO9NOi7jgnrM-VLg?usp=sharing 12 | 13 | ## A new programming paradigm 14 | > [Notebook](notebooks/Course_1_Part_2_Lesson_2_Notebook.ipynb) 15 | 16 | ### A primer in machine learning 17 | 18 | 19 | ### The ‘Hello World’ of neural networks 20 | ```python 21 | from keras import models 22 | from keras import layers 23 | import numpy as np 24 | 25 | model = keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])]) 26 | model.compile(optimizer='sgd', loss='mean_squared_error') # Guess the pattern and measure how badly or good the algorithm works 27 | 28 | # Just imagine you have lots of Xs and Ys, the computer doesn't know the correlation between them. Your algorithm tries to connect Xs to Ys (makes guesses). The loss functions looks at the predicted outputs and actial outputs and *measures how good or badly the guess was. Then it gives its value to optimizer which figures out the next guess (update its parameters). So the optimizer thinks about how good or how badly the guess was done using the data from the loss function. 29 | 30 | xs = np.array([-1.0, 0.0, 1.0, 2.0, 3.0, 4.0], dtype=float) 31 | ys = np.array([-3.0, -1.0, 1.0, 3.0, 5.0, 7.0], dtype=float) 32 | 33 | model.fit(xs, ys, epochs=500) # Training 34 | 35 | print(model.predict([10.0])) # You can expect 19 because y = 2x - 1, but it will be very close to ≈19 36 | ``` 37 | 38 | ## Introduction to Computer Vision 39 | > [Notebook](notebooks/Course_1_Part_4_Lesson_2_Notebook.ipynb) 40 | 41 | > https://github.com/zalandoresearch/fashion-mnist 70K images 42 | 43 | ```python 44 | import tensorflow as tf 45 | import numpy as np 46 | import matplotlib.pyplot as plt # plt.imshow(training_images[0]) 47 | print(tf.__version__) 48 | 49 | # Loading the dataset 50 | mnist = tf.keras.datasets.fashion_mnist 51 | (training_images, training_labels), (test_images, test_labels) = mnist.load_data() 52 | print(training_images.shape) 53 | print(test_images.shape) 54 | 55 | # Normalizing 56 | training_images = training_images / 255.0 57 | test_images = test_images / 255.0 58 | 59 | # Building the model 60 | model = tf.keras.models.Sequential([tf.keras.layers.Flatten(), 61 | tf.keras.layers.Dense(1024, activation=tf.nn.relu), 62 | tf.keras.layers.Dense(10, activation=tf.nn.softmax)]) 63 | 64 | # Defining the model, optimizer=tf.optimizers.Adam() 65 | model.compile(optimizer='adam', 66 | loss='sparse_categorical_crossentropy', 67 | metrics=['accuracy']) 68 | 69 | model.fit(training_images, training_labels, epochs=5) # Training the model, i.e. fitting training data to training labels 70 | 71 | model.evaluate(test_images, test_labels) 72 | 73 | classifications = model.predict(test_images) # Predict for new values 74 | 75 | print(">> Predicted label:", classifications[0]) 76 | print(">> Actual label:", test_labels[0]) 77 | 78 | ``` 79 | - Notes: 80 | - **Sequential**: That defines a SEQUENCE of layers in the neural network 81 | - **Flatten**: Flatten just takes the input and turns it into a 1 dimensional set. Via ROWS 82 | - **Dense**: Adds a layer of neuron. Each layer of neurons need an 'activation function' to tell them what to do. There's lots of options, but just use these for now. 83 | - **Relu** effectively means "If X>0 return X, else return 0" -- so what it does it it only passes values 0 or greater to the next layer in the network. 84 | - **Softmax** takes a set of values, and effectively picks the biggest one, so, for example, if the output of the last layer looks like [0.1, 0.1, 0.05, 0.1, 9.5, 0.1, 0.05, 0.05, 0.05], it saves you from fishing through it looking for the biggest value, and turns it into [0,0,0,0,1,0,0,0,0] -- The goal is to save a lot of coding! 85 | - https://stackoverflow.com/questions/44176982/how-does-the-flatten-layer-work-in-keras 86 | 87 | ```python 88 | # What if you want to stop training when you reached the accuracy needed 89 | import tensorflow as tf 90 | 91 | class myCallback(tf.keras.callbacks.Callback): 92 | def on_epoch_end(self, epoch, logs={}): 93 | if(logs.get('accuracy')>0.6): 94 | print("\nReached 60% accuracy so cancelling training!") 95 | self.model.stop_training = True 96 | 97 | mnist = tf.keras.datasets.fashion_mnist 98 | 99 | (x_train, y_train),(x_test, y_test) = mnist.load_data() 100 | x_train, x_test = x_train / 255.0, x_test / 255.0 101 | 102 | callbacks = myCallback() # Creating the callback 103 | 104 | model = tf.keras.models.Sequential([ 105 | tf.keras.layers.Flatten(input_shape=(28, 28)), 106 | tf.keras.layers.Dense(512, activation=tf.nn.relu), 107 | tf.keras.layers.Dense(10, activation=tf.nn.softmax) 108 | ]) 109 | model.compile(optimizer=tf.optimizers.Adam(), 110 | loss='sparse_categorical_crossentropy', 111 | metrics=['accuracy']) 112 | 113 | model.fit(x_train, y_train, epochs=10, callbacks=[callbacks]) # You need to add callbacks argument 114 | ``` 115 | 116 | ## Convolutional Neural Networks 117 | > [Notebook](notebooks/Course_1_Part_6_Lesson_2_Notebook.ipynb) 118 | 119 | > https://github.com/Rustam-Z/deep-learning-notes/tree/main/Course%204%20Convolutional%20Neural%20Networks 120 | 121 | **Types of layers in a convolutional network:** 122 | - Convolution (CONV) - A technique to isolate features in images 123 | - We need to know the filter size, padding (borders - valid, same), striding (jumps) 124 | - Pooling (POOL) - A technique to reduce the information in an image while maintaining features 125 | - Max pooling, average pooling 126 | - Fully connected (FC) 127 | 128 | - Formula to calculate the shape of convolution: [(n + 2p - f) / s] + 1 129 | - Formula to calculate the number of parameters in convolution: (f * f * PREVIOUS_ACTIVATION_SHAPE + 1) * ACTIVATION_SHAPE 130 | 131 | - https://lodev.org/cgtutor/filtering.html • https://colab.research.google.com/drive/1EiNdAW4gtrObrBSAuuxIt_AqO_Eft491#scrollTo=kDHjf-ehaBqm 132 | 133 | ```python 134 | # Model architecture 135 | model = tf.keras.models.Sequential([ 136 | tf.keras.layers.Conv2D(64, (3, 3), activation='relu', input_shape=(28, 28, 1)), 137 | tf.keras.layers.MaxPooling2D(2, 2), 138 | tf.keras.layers.Conv2D(64, (3, 3), activation='relu'), 139 | tf.keras.layers.MaxPooling2D(2, 2), 140 | tf.keras.layers.Flatten(), 141 | tf.keras.layers.Dense(128, activation='relu'), 142 | tf.keras.layers.Dense(10, activation='softmax') 143 | ]) 144 | 145 | model.summary() # To have a look to the architecture of model 146 | ``` 147 | 148 | ```python 149 | import tensorflow as tf 150 | print(tf.__version__) 151 | 152 | mnist = tf.keras.datasets.fashion_mnist 153 | (training_images, training_labels), (test_images, test_labels) = mnist.load_data() 154 | 155 | training_images=training_images.reshape(60000, 28, 28, 1) 156 | training_images=training_images / 255.0 157 | test_images = test_images.reshape(10000, 28, 28, 1) 158 | test_images=test_images/255.0 159 | 160 | model = tf.keras.models.Sequential([ 161 | tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)), 162 | tf.keras.layers.MaxPooling2D(2, 2), 163 | tf.keras.layers.Conv2D(64, (3,3), activation='relu'), 164 | tf.keras.layers.MaxPooling2D(2,2), 165 | l 166 | ]) 167 | model.compile(optimizer='adam', loss='ms', metrics=['accuracy']) 168 | model.summary() 169 | model.fit(training_images, training_labels, epochs=10) 170 | test_loss = model.evaluate(test_images, test_labels) 171 | ``` 172 | 173 | ```python 174 | # This code will show us the convolutions graphically 175 | 176 | import matplotlib.pyplot as plt 177 | from tensorflow.keras import models 178 | 179 | f, axarr = plt.subplots(3,4) 180 | FIRST_IMAGE=0 181 | SECOND_IMAGE=23 182 | THIRD_IMAGE=28 183 | CONVOLUTION_NUMBER = 3 184 | 185 | layer_outputs = [layer.output for layer in model.layers] 186 | activation_model = tf.keras.models.Model(inputs = model.input, outputs = layer_outputs) 187 | 188 | for x in range(0,4): 189 | f1 = activation_model.predict(test_images[FIRST_IMAGE].reshape(1, 28, 28, 1))[x] 190 | axarr[0,x].imshow(f1[0, : , :, CONVOLUTION_NUMBER], cmap='inferno') 191 | axarr[0,x].grid(False) 192 | f2 = activation_model.predict(test_images[SECOND_IMAGE].reshape(1, 28, 28, 1))[x] 193 | axarr[1,x].imshow(f2[0, : , :, CONVOLUTION_NUMBER], cmap='inferno') 194 | axarr[1,x].grid(False) 195 | f3 = activation_model.predict(test_images[THIRD_IMAGE].reshape(1, 28, 28, 1))[x] 196 | axarr[2,x].imshow(f3[0, : , :, CONVOLUTION_NUMBER], cmap='inferno') 197 | axarr[2,x].grid(False) 198 | ``` 199 | 200 | ## Using Real-world Images 201 | > [Nobebook](notebooks/Course_1_Part_8_Lesson_2_Notebook.ipynb) 202 | 203 | ```python 204 | # An ImageGenerator can flow images from a directory and perform operations such as resizing them on the fly 205 | import tensorflow as tf 206 | from tensorflow.keras.preprocessing.image import ImageDataGenerator 207 | from tensorflow.keras.optimizers import RMSprop 208 | 209 | # All images will be rescaled by 1./255 210 | train_datagen = ImageDataGenerator(rescale=1/255) 211 | 212 | # Flow training images in batches of 128 using train_datagen generator 213 | train_generator = train_datagen.flow_from_directory( 214 | '/tmp/horse-or-human/', # This is the source directory for training images 215 | target_size=(300, 300), # All images will be resized to 150x150 216 | batch_size=128, 217 | # Since we use binary_crossentropy loss, we need binary labels 218 | class_mode='binary') 219 | 220 | validation_generator = train_datagen.flow_from_directory( 221 | validation_dir, 222 | target_size=(300, 300), 223 | batch_size=32, 224 | class_mode='binary', 225 | ) 226 | 227 | model.compile(loss='binary_crossentropy', 228 | optimizer=RMSprop(lr=0.001), 229 | metrics=['accuracy']) 230 | 231 | history = model.fit_generator( 232 | train_generator, # streames images from directory 233 | steps_per_epoch=8, # 1024 images overall, so 128*8=1024, 128 is the batch size of train_generator 234 | epochs=15, 235 | validation_data=validation_generator, 236 | validation_steps=8, # 256 images, so 32*8=256, 32 is the batch size of validation_generator 237 | verbose=2 # for info 238 | ) 239 | ``` 240 | ```python 241 | import numpy as np 242 | from google.colab import files 243 | from keras.preprocessing import image 244 | 245 | uploaded = files.upload() 246 | 247 | for fn in uploaded.keys(): 248 | # Predicting images 249 | path = "/content/" + fn 250 | img = image.load_img(path, target_size=(300, 300)) 251 | x = image.img_to_array(img) 252 | x = np.expand_dims(x, axis=0) 253 | 254 | images = np.vstack([x]) 255 | classes = model.predict(images, batch_size=10) 256 | print(classes[0]) 257 | 258 | if classes[0] > 0.5: 259 | print(fn + "is a human") 260 | else: 261 | print(fn + "is a horse") 262 | ``` -------------------------------------------------------------------------------- /tensorflow-in-practice/natural-language-processing-tensorflow.md: -------------------------------------------------------------------------------- 1 | # [Natural Language Processing in TensorFlow](https://www.coursera.org/learn/natural-language-processing-tensorflow/home/welcome) 2 | 3 | - Week 1: How to convert the text into number representation, Tokenizer, fit_on_texts, texts_to_sequences, pad_sequences 4 | - Week 2: Word Embeddings - Classification problems 5 | - Week 3: Sequence models - RNN, LSTM, classification problems 6 | - Week 4: Sequence models and literature - text generation 7 | 8 | - Week 1 - [Sentiment in text](#Sentiment-in-text) 9 | - Week 2 - [Word Embeddings](#Word-Embeddings) 10 | - Week 3 - [Sequence models](#Sequence-models) 11 | - Week 4 - [Sequence models and literature](#Sequence-models-and-literature) 12 | 13 | ## Sentiment in text 14 | > [Week 1 Notebook](notebooks/Course_3_Week_1(Tokenizer-Sarcasm-Dataset).ipynb) 15 | 16 | - How to load in the texts, pre-process it and set up your data so it can be fed to a neural network. 17 | - https://rishabhmisra.github.io/publications/ 18 | - `Tokenizer` is used to tokenize the sentences, `oov_token=`can be used to encode unknown words 19 | - `fit_on_texts(sentences)` is used to tokenize the list of sentences 20 | - Output: `{'': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}` 21 | - `texts_to_sequences(sentences)` - the method to encode a list of sentences to use those tokens 22 | - Output: `[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]` 23 | 24 | ```py 25 | tokenizer = Tokenizer(oov_token="") 26 | tokenizer.fit_on_texts(sentences) 27 | word_index = tokenizer.word_index 28 | sequences = tokenizer.texts_to_sequences(sentences) 29 | padded = pad_sequences(sequences, padding='post') 30 | ``` 31 | 32 | ## Word Embeddings 33 | > [Week 2 Model Training IMDB Reviews](notebooks/Course_3_Week_2(Model_Training_IMDB_Reviews).ipynb) 34 | 35 | > [Week 2, beautiful code, Sarcasm Classifier](notebooks/Course_3_Week_2(Sarcasm-Classifier).ipynb) 36 | 37 | > [Week 2, subwords](notebooks/Course_3_Week_2(Subwords).ipynb) - shows that Embeddings do not work with sequence of words 38 | 39 |
40 | 41 | - In the second week, we learn to prepare the data with Tokenizer API, and then teach our model 42 | - TensorFlow Datasets: https://www.tensorflow.org/datasets 43 |
44 | - https://github.com/tensorflow/datasets/tree/master/docs/catalog 45 | - https://projector.tensorflow.org - to visualize the data 46 | 47 | - **What is the purpose of the embedding dimension?** 48 | > It is the number of dimensions for the **vector representing** the word encoding 49 | 50 | - When tokenizing a corpus, what does the num_words=n parameter do? 51 | > It specifies the maximum number of words to be tokenized, and picks the most common ‘n’ words 52 | 53 | - NOTE: Sequence becomes much more important when dealing with subwords, but we’re ignoring word positions. 54 | 55 | - It must specify 3 arguments, [reference](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/): 56 | 57 | - **input_dim**: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-999, then the size of the vocabulary would be 1000 words. (all words) 58 | - **output_dim**: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem. 59 | - **input_length**: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 100 words, this would be 100. (words in a sentence) 60 | 61 | ```py 62 | def plot_graphs(history, string): 63 | plt.plot(history.history[string]) 64 | plt.plot(history.history['val_'+string]) 65 | 66 | plt.xlabel("Epochs") 67 | plt.ylabel(string) 68 | plt.legend([string, 'val_'+string]) 69 | plt.show() 70 | 71 | plot_graphs(history, "accuracy") 72 | plot_graphs(history, "loss") 73 | ``` 74 | 75 | ## Sequence models 76 | > [Week 3 IMDB](notebooks/Course_3_Week_3(IMDB).ipynb) - RNN, Embedding, Conv 1D experimenting 77 | 78 | > We looked first at Tokenizing words to get numeric values from them, and then using Embeddings to group words of similar meaning depending on how they were labelled. This gave you a good, but rough, sentiment analysis -- words such as 'fun' and 'entertaining' might show up in a positive movie review, and 'boring' and 'dull' might show up in a negative one. But sentiment can also be determined by the sequence in which words appear. For example, you could have 'not fun', which of course is the opposite of 'fun'. This week you'll start digging into a variety of model formats that are used in training models to understand context in sequence! 79 | 80 | - We used **word embeddings** to sentiment words. But what if we can use RNN and LSTM to predict the group of words. We can analyse in which relative ordering the words are coming. 81 | 82 | -
That's the classical ML, it doesn't take into account the sequences. For example, like **Fibonacci series**, we must know previous result to fit it into next input. 83 | 84 | -
So, that's the idea behind RNN (recurrent neural network). The output of previous is the input to the next. 85 | 86 | -
**LSTMs** have an additional pipeline of contexts called cell state. They can be bidirectional too. 87 | 88 | - RNN, LSTM [video](https://www.youtube.com/watch?v=WCUNPb-5EYI) 89 | - GRU - Gated recurrent union `tf.keras.layers.Bidirectional(tf.keras.layers.GRU(64)` 90 | 91 | ```python 92 | """LSTM in code""" 93 | model = tf.keras.Sequential([ 94 | tf.keras.layers.Embedding(tokenizer.vocab_size, 64), 95 | tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64), return_sequences=True), # You need to define `return_sequences=True` when stacking two LSTMs 96 | # tf.keras.layers.Bidirectional(tf.keras.layers.GRU(64) 97 | tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)), 98 | tf.keras.layers.Dense(64, activation='relu'), 99 | tf.keras.layers.Dense(1, activation='sigmoid'), 100 | ]) 101 | ``` 102 | ```python 103 | """Using a convolutional network 1D""" 104 | model = tf.keras.Sequential([ 105 | tf.keras.layers.Embedding(tokenizer.vocab_size, 64), 106 | tf.keras.layers.Conv1D(128, 5, activation='relu'), 107 | tf.keras.layers.GlobalAveragePooling1D(), 108 | tf.keras.layers.Dense(64, activation='relu'), 109 | tf.keras.layers.Dense(1, activation='sigmoid') 110 | ]) 111 | ``` 112 | ```python 113 | model = tf.keras.Sequential([ 114 | tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length), # weights=[embeddings_matrix], trainable=False 115 | tf.keras.layers.Dropout(0.2), 116 | tf.keras.layers.Conv1D(64, 5, activation='relu'), 117 | tf.keras.layers.MaxPooling1D(pool_size=4), 118 | tf.keras.layers.LSTM(64), 119 | tf.keras.layers.Dense(1, activation='sigmoid') 120 | ]) 121 | model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy']) 122 | model.summary() 123 | 124 | num_epochs = 50 125 | ``` 126 | 127 | ## Sequence models and literature 128 | > **Text generation** 129 | 130 | > [Week 4 Sheckspire Text Generation](notebooks/Course_3_Week_4_Lesson_1_(Sheckspire_Text_Generation).ipynb) 131 | 132 | > Wrap up from course: You’ve been experimenting with NLP for text classification over the last few weeks. Next week you’ll switch gears -- and take a look at using the tools that you’ve learned to predict text, which ultimately means you can create text. By learning sequences of words you can predict the most common word that comes next in the sequence, and thus, when starting from a new sequence of words you can create a model that builds on them. You’ll take different training sets -- like traditional Irish songs, or Shakespeare poetry, and learn how to create new sets of words using their embeddings! 133 | 134 | - **Finding what the next word should be** 135 | 136 | -------------------------------------------------------------------------------- /tensorflow-in-practice/notebooks/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Rustam-Z/machine-learning/5001d7d103642a61f82492df3a968aa6f4836601/tensorflow-in-practice/notebooks/.DS_Store -------------------------------------------------------------------------------- /tensorflow-in-practice/notebooks/Course_3_Week_2(Model_Training_IMDB_Reviews).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "language_info": { 4 | "codemirror_mode": { 5 | "name": "ipython", 6 | "version": 3 7 | }, 8 | "file_extension": ".py", 9 | "mimetype": "text/x-python", 10 | "name": "python", 11 | "nbconvert_exporter": "python", 12 | "pygments_lexer": "ipython3", 13 | "version": "3.8.6" 14 | }, 15 | "orig_nbformat": 2, 16 | "kernelspec": { 17 | "name": "python3", 18 | "display_name": "Python 3.8.6 64-bit ('tf': conda)" 19 | }, 20 | "metadata": { 21 | "interpreter": { 22 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a" 23 | } 24 | }, 25 | "interpreter": { 26 | "hash": "4ea0e157563bacde0b7fd8dc93db6051c9678d5eadbd4117abf1a4cecbc8cd1a" 27 | } 28 | }, 29 | "nbformat": 4, 30 | "nbformat_minor": 2, 31 | "cells": [ 32 | { 33 | "cell_type": "code", 34 | "execution_count": 1, 35 | "metadata": {}, 36 | "outputs": [ 37 | { 38 | "output_type": "stream", 39 | "name": "stdout", 40 | "text": [ 41 | "2.4.0-rc0\n" 42 | ] 43 | } 44 | ], 45 | "source": [ 46 | "import tensorflow as tf \n", 47 | "import tensorflow_datasets as tfds\n", 48 | "from tensorflow.keras.preprocessing.text import Tokenizer\n", 49 | "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", 50 | "import numpy as np \n", 51 | "import io\n", 52 | "\n", 53 | "print(tf.__version__)" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 2, 59 | "metadata": {}, 60 | "outputs": [ 61 | { 62 | "output_type": "execute_result", 63 | "data": { 64 | "text/plain": [ 65 | "True" 66 | ] 67 | }, 68 | "metadata": {}, 69 | "execution_count": 2 70 | } 71 | ], 72 | "source": [ 73 | "tf.executing_eagerly() # if 1.x use `tf.enable_eager_execution()`" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 3, 79 | "metadata": { 80 | "tags": [] 81 | }, 82 | "outputs": [], 83 | "source": [ 84 | "imdb, info = tfds.load(\"imdb_reviews\", with_info=True, as_supervised=True) # loading the data" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 5, 90 | "metadata": {}, 91 | "outputs": [ 92 | { 93 | "output_type": "execute_result", 94 | "data": { 95 | "text/plain": [ 96 | "['abstract_reasoning', 'accentdb', 'aeslc', 'aflw2k3d', 'ag_news_subset']" 97 | ] 98 | }, 99 | "metadata": {}, 100 | "execution_count": 5 101 | } 102 | ], 103 | "source": [ 104 | "tfds.list_builders()[:5] # the list of all datasets" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 7, 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "train_data, test_data = imdb['train'], imdb['test'] # 25k train and 25k testing\n", 114 | "\n", 115 | "training_sentences = []\n", 116 | "training_labels = []\n", 117 | "testing_sentences = []\n", 118 | "testing_labels = []\n", 119 | "\n", 120 | "for sample, label in train_data:\n", 121 | " training_sentences.append(sample.numpy().decode('utf8'))\n", 122 | " training_labels.append(label.numpy())\n", 123 | "\n", 124 | "for sample, label in test_data:\n", 125 | " testing_sentences.append(sample.numpy().decode('utf8'))\n", 126 | " testing_labels.append(label.numpy())" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 8, 132 | "metadata": {}, 133 | "outputs": [ 134 | { 135 | "output_type": "stream", 136 | "name": "stdout", 137 | "text": [ 138 | "I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.\n>> label 0\n" 139 | ] 140 | } 141 | ], 142 | "source": [ 143 | "print(training_sentences[1]) \n", 144 | "print(\">> label\", training_labels[1]) # 0 negative, 1 pos" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 9, 150 | "metadata": {}, 151 | "outputs": [ 152 | { 153 | "output_type": "stream", 154 | "name": "stdout", 155 | "text": [ 156 | "25000\n25000\n25000\n25000\n" 157 | ] 158 | } 159 | ], 160 | "source": [ 161 | "print(len(training_sentences))\n", 162 | "print(len(training_labels))\n", 163 | "print(len(testing_sentences))\n", 164 | "print(len(testing_labels))" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 10, 170 | "metadata": {}, 171 | "outputs": [], 172 | "source": [ 173 | "# converting to numpy arrays\n", 174 | "training_labels_final = np.array(training_labels) \n", 175 | "testing_labels_final = np.array(testing_labels)" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 11, 181 | "metadata": {}, 182 | "outputs": [ 183 | { 184 | "output_type": "execute_result", 185 | "data": { 186 | "text/plain": [ 187 | "(25000,)" 188 | ] 189 | }, 190 | "metadata": {}, 191 | "execution_count": 11 192 | } 193 | ], 194 | "source": [ 195 | "training_labels_final.shape" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 12, 201 | "metadata": {}, 202 | "outputs": [ 203 | { 204 | "output_type": "execute_result", 205 | "data": { 206 | "text/plain": [ 207 | "(25000, 120)" 208 | ] 209 | }, 210 | "metadata": {}, 211 | "execution_count": 12 212 | } 213 | ], 214 | "source": [ 215 | "# Preparing data for training by tokenizing\n", 216 | "\n", 217 | "vocab_size = 10000\n", 218 | "embedding_dim = 16\n", 219 | "max_length = 120\n", 220 | "trunc_type='post' # [4, 4, 5, 6, ..... 0, 0, 0] - zeros at the end \n", 221 | "oov_tok = \"\" # out of vocabulary\n", 222 | "\n", 223 | "tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)\n", 224 | "tokenizer.fit_on_texts(training_sentences)\n", 225 | "word_index = tokenizer.word_index # all 10000 words with tokens in a dictionary \n", 226 | "sequences = tokenizer.texts_to_sequences(training_sentences) # all sentences represented only with tokens\n", 227 | "padded = pad_sequences(sequences, maxlen=max_length, truncating=trunc_type) # make all sentences the same size\n", 228 | "\n", 229 | "# the same for testing set\n", 230 | "testing_sequences = tokenizer.texts_to_sequences(testing_sentences)\n", 231 | "testing_padded = pad_sequences(testing_sequences, maxlen=max_length)\n", 232 | "\n", 233 | "padded.shape" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": {}, 240 | "outputs": [], 241 | "source": [] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": 89, 246 | "metadata": {}, 247 | "outputs": [ 248 | { 249 | "output_type": "stream", 250 | "name": "stdout", 251 | "text": [ 252 | "? ? ? ? ? ? ? ? i have been known to fall asleep during films but this is usually due to a combination of things including really tired being warm and comfortable on the and having just eaten a lot however on this occasion i fell asleep because the film was rubbish the plot development was constant constantly slow and boring things seemed to happen but with no explanation of what was causing them or why i admit i may have missed part of the film but i watched the majority of it and everything just seemed to happen of its own without any real concern for anything else i cant recommend this film at all\n\nI have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.\n" 253 | ] 254 | } 255 | ], 256 | "source": [ 257 | "reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])\n", 258 | "\n", 259 | "def decode_review(text):\n", 260 | " return ' '.join([reverse_word_index.get(i, '?') for i in text])\n", 261 | "\n", 262 | "print(decode_review(padded[1]))\n", 263 | "print()\n", 264 | "print(training_sentences[1])" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 64, 270 | "metadata": {}, 271 | "outputs": [ 272 | { 273 | "output_type": "stream", 274 | "name": "stdout", 275 | "text": [ 276 | "I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.\n>> original length 617\n>> label 0\n\n[11, 26, 75, 571, 6, 805, 2354, 313, 106, 19, 12, 7, 629, 686, 6, 4, 2219, 5, 181, 584, 64, 1454, 110, 2263, 3, 3951, 21, 2, 1, 3, 258, 41, 4677, 4, 174, 188, 21, 12, 4078, 11, 1578, 2354, 86, 2, 20, 14, 1907, 2, 112, 940, 14, 1811, 1340, 548, 3, 355, 181, 466, 6, 591, 19, 17, 55, 1817, 5, 49, 14, 4044, 96, 40, 136, 11, 972, 11, 201, 26, 1046, 171, 5, 2, 20, 19, 11, 294, 2, 2155, 5, 10, 3, 283, 41, 466, 6, 591, 5, 92, 203, 1, 207, 99, 145, 4382, 16, 230, 332, 11, 2486, 384, 12, 20, 31, 30]\n>> sequence lenght 112\n\n[ 0 0 0 0 0 0 0 0 11 26 75 571 6 805\n 2354 313 106 19 12 7 629 686 6 4 2219 5 181 584\n 64 1454 110 2263 3 3951 21 2 1 3 258 41 4677 4\n 174 188 21 12 4078 11 1578 2354 86 2 20 14 1907 2\n 112 940 14 1811 1340 548 3 355 181 466 6 591 19 17\n 55 1817 5 49 14 4044 96 40 136 11 972 11 201 26\n 1046 171 5 2 20 19 11 294 2 2155 5 10 3 283\n 41 466 6 591 5 92 203 1 207 99 145 4382 16 230\n 332 11 2486 384 12 20 31 30]\n" 277 | ] 278 | }, 279 | { 280 | "output_type": "execute_result", 281 | "data": { 282 | "text/plain": [ 283 | "(120,)" 284 | ] 285 | }, 286 | "metadata": {}, 287 | "execution_count": 64 288 | } 289 | ], 290 | "source": [ 291 | "print(training_sentences[1]) \n", 292 | "print(\">> original length\", len(training_sentences[1]))\n", 293 | "print(\">> label\", training_labels[1])\n", 294 | "\n", 295 | "print()\n", 296 | "print(sequences[1])\n", 297 | "print(\">> sequence lenght\", len(sequences[1]))\n", 298 | "print()\n", 299 | "print(padded[1])\n", 300 | "padded[1].shape" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": 56, 306 | "metadata": {}, 307 | "outputs": [ 308 | { 309 | "output_type": "execute_result", 310 | "data": { 311 | "text/plain": [ 312 | "'bintang'" 313 | ] 314 | }, 315 | "metadata": {}, 316 | "execution_count": 56 317 | } 318 | ], 319 | "source": [ 320 | "# len(list(word_index)) # 90000 appr\n", 321 | "list(word_index)[57565] # even we defined vocab_size = 10000, tensorflow tokenizes all words, but in backed end it will work with 10000 words, \n", 322 | "# num_words=n parameter specifies the maximum number of words to be tokenized, and picks the most common ‘n’ words" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": 71, 328 | "metadata": {}, 329 | "outputs": [ 330 | { 331 | "output_type": "stream", 332 | "name": "stdout", 333 | "text": [ 334 | "Model: \"sequential_2\"\n_________________________________________________________________\nLayer (type) Output Shape Param # \n=================================================================\nembedding_2 (Embedding) (None, 120, 16) 160000 \n_________________________________________________________________\nflatten_2 (Flatten) (None, 1920) 0 \n_________________________________________________________________\ndense_4 (Dense) (None, 6) 11526 \n_________________________________________________________________\ndense_5 (Dense) (None, 1) 7 \n=================================================================\nTotal params: 171,533\nTrainable params: 171,533\nNon-trainable params: 0\n_________________________________________________________________\n" 335 | ] 336 | } 337 | ], 338 | "source": [ 339 | "model = tf.keras.Sequential([\n", 340 | " tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),\n", 341 | " tf.keras.layers.Flatten(), # GlobalAveragePooling1D()\n", 342 | " tf.keras.layers.Dense(6, activation='relu'),\n", 343 | " tf.keras.layers.Dense(1, activation='sigmoid')\n", 344 | "])\n", 345 | "model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n", 346 | "model.summary()" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": 106, 352 | "metadata": {}, 353 | "outputs": [ 354 | { 355 | "output_type": "stream", 356 | "name": "stdout", 357 | "text": [ 358 | "Epoch 1/10\n", 359 | "782/782 [==============================] - 1s 1ms/step - loss: 9.7313e-05 - accuracy: 1.0000 - val_loss: 0.8354 - val_accuracy: 0.8314\n", 360 | "Epoch 2/10\n", 361 | "782/782 [==============================] - 1s 1ms/step - loss: 6.0164e-05 - accuracy: 1.0000 - val_loss: 0.8735 - val_accuracy: 0.8308\n", 362 | "Epoch 3/10\n", 363 | "782/782 [==============================] - 1s 1ms/step - loss: 3.7304e-05 - accuracy: 1.0000 - val_loss: 0.9050 - val_accuracy: 0.8318\n", 364 | "Epoch 4/10\n", 365 | "782/782 [==============================] - 1s 1ms/step - loss: 2.3330e-05 - accuracy: 1.0000 - val_loss: 0.9406 - val_accuracy: 0.8309\n", 366 | "Epoch 5/10\n", 367 | "782/782 [==============================] - 1s 1ms/step - loss: 1.5115e-05 - accuracy: 1.0000 - val_loss: 0.9730 - val_accuracy: 0.8313\n", 368 | "Epoch 6/10\n", 369 | "782/782 [==============================] - 1s 1ms/step - loss: 9.3207e-06 - accuracy: 1.0000 - val_loss: 1.0077 - val_accuracy: 0.8312\n", 370 | "Epoch 7/10\n", 371 | "782/782 [==============================] - 1s 1ms/step - loss: 6.1326e-06 - accuracy: 1.0000 - val_loss: 1.0429 - val_accuracy: 0.8307\n", 372 | "Epoch 8/10\n", 373 | "782/782 [==============================] - 1s 1ms/step - loss: 3.8306e-06 - accuracy: 1.0000 - val_loss: 1.0734 - val_accuracy: 0.8310\n", 374 | "Epoch 9/10\n", 375 | "782/782 [==============================] - 1s 1ms/step - loss: 2.4845e-06 - accuracy: 1.0000 - val_loss: 1.1086 - val_accuracy: 0.8311\n", 376 | "Epoch 10/10\n", 377 | "782/782 [==============================] - 1s 1ms/step - loss: 1.6163e-06 - accuracy: 1.0000 - val_loss: 1.1410 - val_accuracy: 0.8310\n" 378 | ] 379 | }, 380 | { 381 | "output_type": "execute_result", 382 | "data": { 383 | "text/plain": [ 384 | "" 385 | ] 386 | }, 387 | "metadata": {}, 388 | "execution_count": 106 389 | } 390 | ], 391 | "source": [ 392 | "# Training own modelg\n", 393 | "\n", 394 | "num_epochs = 10\n", 395 | "model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": 76, 401 | "metadata": {}, 402 | "outputs": [ 403 | { 404 | "output_type": "execute_result", 405 | "data": { 406 | "text/plain": [ 407 | "[,\n", 408 | " ,\n", 409 | " ,\n", 410 | " ]" 411 | ] 412 | }, 413 | "metadata": {}, 414 | "execution_count": 76 415 | } 416 | ], 417 | "source": [ 418 | "e = model.layers\n", 419 | "e" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": 79, 425 | "metadata": {}, 426 | "outputs": [ 427 | { 428 | "output_type": "stream", 429 | "name": "stdout", 430 | "text": [ 431 | "(10000, 16)\n" 432 | ] 433 | } 434 | ], 435 | "source": [ 436 | "e = model.layers[0]\n", 437 | "weights = e.get_weights()[0]\n", 438 | "print(weights.shape) # shape: (vocab_size, embedding_dim)" 439 | ] 440 | }, 441 | { 442 | "cell_type": "code", 443 | "execution_count": 87, 444 | "metadata": {}, 445 | "outputs": [ 446 | { 447 | "output_type": "execute_result", 448 | "data": { 449 | "text/plain": [ 450 | "array([-0.08942658, 0.00486923, -0.05935808, -0.06226563, -0.04867279,\n", 451 | " 0.04237117, 0.04769849, 0.03356505, -0.03730453, 0.00785854,\n", 452 | " 0.03105144, 0.0776749 , 0.05284716, 0.025134 , -0.03554538,\n", 453 | " -0.04298926], dtype=float32)" 454 | ] 455 | }, 456 | "metadata": {}, 457 | "execution_count": 87 458 | } 459 | ], 460 | "source": [ 461 | "weights[1] # each word has its own weight" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": 86, 467 | "metadata": { 468 | "tags": [] 469 | }, 470 | "outputs": [ 471 | { 472 | "output_type": "stream", 473 | "name": "stdout", 474 | "text": [ 475 | ">> word 1 \n>> embeddings [-0.08942658 0.00486923 -0.05935808 -0.06226563 -0.04867279 0.04237117\n 0.04769849 0.03356505 -0.03730453 0.00785854 0.03105144 0.0776749\n 0.05284716 0.025134 -0.03554538 -0.04298926]\n>> word 2 the\n>> embeddings [-0.08670148 0.01641071 -0.02393427 -0.07146466 0.01603186 0.06126428\n 0.06148115 0.00766911 0.04187395 0.05556076 0.01930173 0.0744463\n 0.01907398 0.01339489 0.00941497 -0.0138381 ]\n>> word 3 and\n>> embeddings [ 0.01113727 -0.03538265 -0.05725451 -0.01636735 -0.00596739 -0.00635358\n 0.03053617 0.05559737 0.0871934 0.04494542 0.02274616 0.07229666\n 0.01994341 0.01223046 -0.05789011 -0.04256919]\n>> word 4 a\n>> embeddings [-0.05104827 -0.01813413 -0.04630557 -0.02343593 -0.03323779 0.06510878\n -0.00737528 0.02424134 0.0825871 0.00570629 -0.01472468 0.12047923\n 0.01702527 -0.04734353 -0.05681538 -0.06954415]\n" 476 | ] 477 | } 478 | ], 479 | "source": [ 480 | "out_v = io.open('vecs.tsv', 'w', encoding='utf-8')\n", 481 | "out_m = io.open('meta.tsv', 'w', encoding='utf-8')\n", 482 | "\n", 483 | "for word_num in range(1, vocab_size):\n", 484 | " word = reverse_word_index[word_num]\n", 485 | " embeddings = weights[word_num]\n", 486 | " \n", 487 | " if word_num < 5:\n", 488 | " print(f\">> word {word_num}\", word)\n", 489 | " print(\">> embeddings\", embeddings)\n", 490 | "\n", 491 | " out_m.write(word + \"\\n\")\n", 492 | " out_v.write('\\t'.join([str(x) for x in embeddings]) + \"\\n\")\n", 493 | "out_v.close()\n", 494 | "out_m.close()" 495 | ] 496 | }, 497 | { 498 | "cell_type": "code", 499 | "execution_count": 99, 500 | "metadata": {}, 501 | "outputs": [ 502 | { 503 | "output_type": "stream", 504 | "name": "stdout", 505 | "text": [ 506 | "Please install GPU version of TF\n" 507 | ] 508 | } 509 | ], 510 | "source": [ 511 | "if tf.test.gpu_device_name(): \n", 512 | " print('Default GPU Device:'.format(tf.test.gpu_device_name()))\n", 513 | "else:\n", 514 | " print(\"Please install GPU version of TF\")" 515 | ] 516 | }, 517 | { 518 | "cell_type": "code", 519 | "execution_count": 102, 520 | "metadata": {}, 521 | "outputs": [ 522 | { 523 | "output_type": "stream", 524 | "name": "stdout", 525 | "text": [ 526 | "[]\n" 527 | ] 528 | }, 529 | { 530 | "output_type": "execute_result", 531 | "data": { 532 | "text/plain": [ 533 | "[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]" 534 | ] 535 | }, 536 | "metadata": {}, 537 | "execution_count": 102 538 | } 539 | ], 540 | "source": [ 541 | "print(tf.config.list_physical_devices('GPU'))\n", 542 | "tf.config.list_physical_devices()" 543 | ] 544 | } 545 | ] 546 | } -------------------------------------------------------------------------------- /tensorflow-in-practice/notebooks/README.md: -------------------------------------------------------------------------------- 1 | # Highlighted Notebooks 2 | 3 | ### Course 1 4 | > [Fashion MNIST with CNN](Course_1_Part_6_Lesson_2_Notebook.ipynb) 5 | 6 | > [Human vs Horse, flow_from_directory()](Course_1_Part_8_Lesson_2_Notebook.ipynb) 7 | 8 | ### Course 2 9 | > [** Cat vs Dog, flow_from_directory(), drawings of loss and accuracy, predict on new image](Course_2_Part_2_Lesson_2_Notebook.ipynb) 10 | 11 | > [** With augmentation, good code collected in one cell, plots of accuracy and loss](Course_2_Part_4_Lesson_2_Notebook_(Cats_v_Dogs_Augmentation).ipynb) 12 | 13 | > [** Transfer Learning, Dropout](Course_2_Part_6_Lesson_3_Notebook_(Transfer_Learning).ipynb) 14 | 15 | ### Course 3 16 | > [** Word Embeddings with Tokenizer](Course_3_Week_2(Model_Training_IMDB_Reviews).ipynb) - classifying the reviews in IMDB 17 | > [** Beautiful code, classifying sarcastic news](Course_3_Week_2(Sarcasm-Classifier).ipynb) 18 | -------------------------------------------------------------------------------- /tensorflow-in-practice/sequences-time-series-and-prediction.md: -------------------------------------------------------------------------------- 1 | # [Sequences, Time Series and Prediction](https://www.coursera.org/learn/tensorflow-sequences-time-series-and-prediction) 2 | 3 | - Sequences and Prediction 4 | - Deep Neural Networks for Time Series 5 | - Recurrent Neural Networks for Time Series 6 | - Real-world time series data 7 | 8 | 9 | ## Sequences and Prediction 10 | > Handling sequential time series data -- where values change over time, like the temperature on a particular day, stock prices, or the number of visitors to your web site. 11 | 12 | > Predicting future values in these time series. We need to find the pattrn to predict new value. 13 | 14 | - Time series can be used in Speech recognition 15 | 16 | - Types: 17 | - **Trend** - upwords facing
18 | - **Seasonality**
19 | - Autocorrelation 20 | - Noise 21 | - Non-stationary time series
22 | 23 | - **Train, validation and test sets** 24 | - **Trend + Seasonality + Noise**
25 | - **Naive forecasting** - take the last value, and assume that the next will be the same 26 | - **Fixed forecasting** - if data is seasonal you should include whole number of season (1, 2, 3 years. Then you have to train in Training period and evaluate of Val Period by tuning hyperparam. Then retrain of TP+VP and test of Test Period. Then again retrain with Test data too.
27 | - We start with a short training period, and we gradually increase it, say by one day at a time, or by one week at a time. At each iteration, we train the model on a training period. And we use it to forecast the following day, or the following week, in the validation period.
28 | 29 | - **Metrics for evaluating performance** 30 | -
31 | 32 | ## Deep Neural Networks for Time Series 33 | 34 | ## Recurrent Neural Networks for Time Series 35 | 36 | ## Real-world time series data 37 | --------------------------------------------------------------------------------