├── FaceRecognition ├── README.md ├── Training.ipynb ├── detect_recognize.py └── result.png ├── HackerRank_MaximumSubArray.py ├── HackerRank_PalindromeIndex.py ├── HackerRank_TwoStrings.py ├── KaggleAmazon ├── README.md ├── Understanding the Amazon from Space.ipynb ├── predict_keras.py ├── predict_tf.py └── train_keras.py ├── KaggleBikeSharing ├── Kaggle_BikeSharing.py ├── Kaggle_BikeSharing_Explanations_French.pdf └── README.md ├── KaggleInstaCart ├── Data Exploration.ipynb └── Predictions.ipynb ├── KaggleMovieRating ├── Exploration.ipynb ├── README.md ├── genre.json ├── imdb_movie_content.py ├── movie_budget.json └── parser.py ├── KaggleSoccer ├── Parsing Soccer Ways.ipynb ├── Predicting Ratings.ipynb ├── README.md ├── df.pkl ├── dumper.py └── parser.py ├── KaggleTaxiTrip ├── Exploring the dataset.ipynb ├── README.md └── pic │ ├── download.png │ ├── feat_importance.png │ ├── nyc_clusters.png │ ├── outliners.png │ └── rush_hour.png ├── KaggleTitanic ├── Notebook.ipynb └── README.md ├── KaggleWiki ├── README.md └── wikipedia_parser.py ├── Kaggle_KobeShots.ipynb ├── ObjectDetection ├── color_extract.py └── recognition.py ├── ProjectMovieRating ├── .ipynb_checkpoints │ └── Exploration-checkpoint.ipynb ├── Exploration.ipynb ├── README.md ├── __pycache__ │ ├── imdb_movie_content.cpython-35.pyc │ └── parser.cpython-35.pyc ├── genre.json ├── imdb_movie_content.py ├── movie_budget.json ├── movie_contents.json ├── movie_contents9.json └── parser.py ├── ProjectSoccer ├── Predicting Ratings.ipynb ├── README.md └── df.pkl ├── README.md ├── TwitterParsing ├── README.md ├── com.alexattia.machinelearning.downloadflashcards.plist ├── config.py ├── config_downl.py ├── download_pics.py ├── pic.png ├── pic2.png ├── run.sh └── send_pictures.py └── pics ├── corr_matrix.png ├── features.png ├── psg_stats.png └── psg_ste.png /FaceRecognition/README.md: -------------------------------------------------------------------------------- 1 | ## Face Recognition 2 | 3 | Modern face recognition with deep learning and HOG algorithm. 4 | 5 | 1. Find faces in image (HOG Algorithm) 6 | 2. Affine Transformations (Face alignment using an ensemble of regression 7 | trees) 8 | 3. Encoding Faces (FaceNet) 9 | 4. Make a prediction (Linear SVM) 10 | 11 | We are using the [Histogram of Oriented Gradients](http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf) (HOG) method. Instead of computing gradients for every pixel of the image (way too much detail). We compute the weighted vote orientation gradients of 16x16 pixels squares. Afterward, we have a simple representation (HOG image) that captures the basic structure of a face. 12 | All we have to do is find the part of our image that looks the most similar to a known trained HOG pattern. 13 | For this technique, we use the dlib Python library to generate and view HOG representations of images. 14 | ``` 15 | face_detector = dlib.get_frontal_face_detector() 16 | detected_faces = face_detector(image, 1) 17 | ``` 18 | 19 | After isolating the faces in our image, we need to warp (posing and projecting)the picture so the face is always in he same place. To do this, we are going to use the [face landmark estimation algorithm](http://www.csc.kth.se/~vahidk/papers/KazemiCVPR14.pdf). Following this method, there are 68 specific points (landmarks) on every face and we train a machine learning algorithm to find these 68 specific points on any face. 20 | ``` 21 | face_pose_predictor = dlib.shape_predictor(predictor_model) 22 | pose_landmarks = face_pose_predictor(img, f) 23 | ``` 24 | After find those landmarks, we need to use affine transformations (such as rotating, scaling and shearing --like translations) on the image so that the eyes and mouth are centered as best as possible. 25 | ``` 26 | face_aligner = openface.AlignDlib(predictor_model) 27 | alignedFace = face_aligner.align(534, image, face_rect, landmarkIndices=openface.AlignDlib.OUTER_EYES_AND_NOSE) 28 | ``` 29 | 30 | The next step is encoding the detected face. For this, we use Deep Learning. We train a neural net to generate [128 measurements (face embedding)](http://www.cv-foundation.org/openaccess/content_cvpr_2015/app/1A_089.pdf) for each face. 31 | The training process works by looking at 3 face images at a time: 32 | - Load two training face images of the same known person and generate for the two pictures the 128 measurements 33 | - Load a picture of a different person and generate for the two pictures the 128 measurements 34 | Then we tweak the neural network slightly so that it makes sure the measurements for the same person are slightly closer while making sure the measurements for the two different persons are slightly further apart. 35 | Once the network has been trained, it can generate measurements for any face, even ones it has never seen before! 36 | ``` 37 | face_encoder = dlib.face_recognition_model_v1(face_recognition_model) 38 | face_encoding = np.array(face_encoder.compute_face_descriptor(image, pose_landmarks, 1)) 39 | ``` 40 | 41 | Finally, we need a classifier (Linear SVM or other classifier) to find the person in our database of known people who has the closest measurements to our test image. We train the classifier with the measurements as input. 42 | 43 | Thanks to Adam Geitgey who wrote a great [post](https://medium.com/@ageitgey/machine-learning-is-fun-part-4-modern-face-recognition-with-deep-learning-c3cffc121d78) about this, I followed his pipeline. 44 | 45 | ![Result](https://github.com/alexattia/Data-Science-Projects/blob/master/FaceRecognition/result.png) -------------------------------------------------------------------------------- /FaceRecognition/detect_recognize.py: -------------------------------------------------------------------------------- 1 | import dlib 2 | from skimage import io, transform 3 | import cv2 4 | import matplotlib.pyplot as plt 5 | import numpy as np 6 | import pandas as pd 7 | import glob 8 | import openface 9 | import pickle 10 | import os 11 | import sys 12 | import argparse 13 | import time 14 | 15 | from sklearn.svm import SVC 16 | from sklearn.preprocessing import LabelEncoder 17 | 18 | face_detector = dlib.get_frontal_face_detector() 19 | face_encoder = dlib.face_recognition_model_v1('./model/dlib_face_recognition_resnet_model_v1.dat') 20 | face_pose_predictor = dlib.shape_predictor('./model/shape_predictor_68_face_landmarks.dat') 21 | 22 | def get_detected_faces(filename): 23 | """ 24 | Detect faces in a picture using HOG 25 | :param filename: picture filename 26 | :return: picture numpy array, face detector object with detected faces 27 | """ 28 | image = cv2.imread(filename) 29 | image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) 30 | return image, face_detector(image, 1) 31 | 32 | def get_face_encoding(image, detected_face): 33 | """ 34 | Encode face into 128 measurements using a neural net 35 | :param image: picture numpy array 36 | :param detected_face: face detector object with one detected face 37 | :return: measurement (128,) numpy array 38 | """ 39 | pose_landmarks = face_pose_predictor(image, detected_face) 40 | face_encoding = face_encoder.compute_face_descriptor(image, pose_landmarks, 1) 41 | return np.array(face_encoding) 42 | 43 | def training(people): 44 | """ 45 | Training our classifier (Linear SVC). Saving model using pickle. 46 | We need to have only one person/face per picture. 47 | :param people: people to classify and recognize 48 | """ 49 | # parsing labels and reprensations 50 | df = pd.DataFrame() 51 | for p in people: 52 | l = [] 53 | for filename in glob.glob('./data/%s/*' % p): 54 | image, face_detect = get_detected_faces(filename) 55 | face_encoding = get_face_encoding(image, face_detect[0]) 56 | l.append(np.append(face_encoding, [p])) 57 | temp = pd.DataFrame(np.array(l)) 58 | df = pd.concat([df, temp]) 59 | df.reset_index(drop=True, inplace=True) 60 | 61 | # converting labels into int 62 | le = LabelEncoder() 63 | y = le.fit_transform(df[128]) 64 | print("Training for {} classes.".format(len(le.classes_))) 65 | X = df.drop(128, axis=1) 66 | print("Training with {} pictures.".format(len(X))) 67 | 68 | # training 69 | clf = SVC(C=1, kernel='linear', probability=True) 70 | clf.fit(X, y) 71 | 72 | # dumping model 73 | fName = "./classifier.pkl" 74 | print("Saving classifier to '{}'".format(fName)) 75 | with open(fName, 'wb') as f: 76 | pickle.dump((le, clf), f) 77 | 78 | def predict(filename, le=None, clf=None, verbose=False): 79 | """ 80 | Detect and recognize a face using a trained classifier. 81 | :param filename: picture filename 82 | :param le: 83 | :paral clf: 84 | :param verbose: 85 | :return: picture with bounding boxes and prediction 86 | """ 87 | if not le and not clf: 88 | with open("./classifier.pkl", 'rb') as f: 89 | (le, clf) = pickle.load(f) 90 | image, detected_faces = get_detected_faces(filename) 91 | prediction = [] 92 | # Verbose for debugging 93 | if verbose: 94 | print('{} faces detected.'.format(len(detected_faces))) 95 | img = np.copy(image) 96 | font = cv2.FONT_HERSHEY_SIMPLEX 97 | for face_detect in detected_faces: 98 | # draw bounding boxes 99 | cv2.rectangle(img, (face_detect.left(), face_detect.top()), 100 | (face_detect.right(), face_detect.bottom()), (255, 0, 0), 2) 101 | start_time = time.time() 102 | # predict each face 103 | p = clf.predict_proba(get_face_encoding(image, face_detect).reshape(1, 128)) 104 | # throwing away prediction with low confidence 105 | a = np.sort(p[0])[::-1] 106 | if a[0]-a[1] > 0.5: 107 | y_pred = le.inverse_transform(np.argmax(p)) 108 | prediction.append([y_pred, (face_detect.left(), face_detect.top()), 109 | (face_detect.right(), face_detect.bottom())]) 110 | else: 111 | y_pred = 'unknown' 112 | # Verbose for debugging 113 | if verbose: 114 | print('\n'.join(['%s : %.3f' % (k[0], k[1]) for k in list(zip(map(le.inverse_transform, 115 | np.argsort(p[0])), 116 | np.sort(p[0])))[::-1]])) 117 | print('Prediction took {:.2f}s'.format(time.time()-start_time)) 118 | 119 | cv2.putText(img, y_pred, (face_detect.left(), face_detect.top()-5), font, np.max(img.shape[:2])/1800, (255, 0, 0)) 120 | return img, prediction 121 | 122 | if __name__ == '__main__': 123 | parser = argparse.ArgumentParser() 124 | parser.add_argument('mode', type=str, help='train or predict') 125 | parser.add_argument('--training_data', 126 | type=str, 127 | help="Path to training data folder.", 128 | default='./data/') 129 | parser.add_argument('--testing_data', 130 | type=str, 131 | help="Path to test data folder.", 132 | default='./test/') 133 | 134 | args = parser.parse_args() 135 | people = os.listdir(args.training_data) 136 | print('{} people will be classified.'.format(len(people))) 137 | if args.mode == 'train': 138 | training(people) 139 | elif args.mode == 'test': 140 | with open("./classifier.pkl", 'rb') as f: 141 | (le, clf) = pickle.load(f) 142 | for i, f in enumerate(glob.glob(args.testing_data)): 143 | img, _ = predict(f, le, clf) 144 | cv2.imwrite(args.testing_data + 'test_{}.jpg'.format(i), img) 145 | 146 | -------------------------------------------------------------------------------- /FaceRecognition/result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/FaceRecognition/result.png -------------------------------------------------------------------------------- /HackerRank_MaximumSubArray.py: -------------------------------------------------------------------------------- 1 | ## Aim: 2 | ## Given an array of k elements, find the maximum possible sum of a 3 | ## 1. Contiguous subarray 4 | ## 2. Non-contiguous (not necessarily contiguous) subarray. 5 | 6 | ## Input : 7 | ## First line of the input has an integer t. t cases follow. 8 | ## Each test case begins with an integer k . In the next line, k integers follow representing the elements of array arr. 9 | 10 | ## Output : 11 | ## Two, space separated, integers denoting the maximum contiguous and non-contiguous subarray. 12 | ## At least one integer should be selected and put into the subarrays 13 | ## (this may be required in cases where all elements are negative). 14 | 15 | t = int(input().strip()) 16 | def max_subarray(A): 17 | max_ending_here = max_so_far = 0 18 | for x in A: 19 | max_ending_here = max(0, max_ending_here + x) 20 | max_so_far = max(max_so_far, max_ending_here) 21 | return max_so_far 22 | 23 | for i in range(2*t): 24 | k=int(input()) 25 | arr = [int(arr_temp) for arr_temp in input().strip().split(' ')] 26 | if(all(item>0 for item in arr)): 27 | print(sum(arr),sum(arr)) 28 | elif(all(item<0 for item in arr)): 29 | print(max(arr),max(arr)) 30 | else: 31 | c=0 32 | for i in range(len(arr)): 33 | if(c+arr[i]>c): 34 | c+=arr[i] 35 | print(max_subarray(arr),c) 36 | -------------------------------------------------------------------------------- /HackerRank_PalindromeIndex.py: -------------------------------------------------------------------------------- 1 | ## Given a string of lowercase letters, determine the index of the character 2 | ## whose removal will make the string a palindrome. 3 | ## If the string is already a palindrome, then print -1 . There will always be a valid solution. 4 | 5 | ## Input: 6 | ## The first line contains n (the number of test cases). 7 | ## The n subsequent lines of test cases each contain a single string to be checked. 8 | 9 | ## Output : 10 | ## int the zero-indexed position (integer) of a character whose deletion will result in a palindrome; 11 | ## if there is no such character (i.e.: the string is already a palindrome), print -1. 12 | 13 | n = int(input().strip()) 14 | a = [] 15 | for a_i in range(n): 16 | a_t = [a_temp for a_temp in input().strip().split(' ')][0] 17 | a.append(a_t) 18 | 19 | def isPal(s): 20 | for i in range(int(len(s)/2)): 21 | if s[i] != s[len(s)-i-1]: 22 | return False 23 | return True 24 | 25 | def solve(s): 26 | for i in range(int((len(s)+1)/2)): 27 | if s[i] != s[len(s)-i-1]: 28 | if isPal(s[:i]+s[i+1:]): 29 | return i 30 | else: 31 | return len(s)-i-1 32 | return -1 33 | for i in range(n): 34 | x=a[i] 35 | print(solve(x)) -------------------------------------------------------------------------------- /HackerRank_TwoStrings.py: -------------------------------------------------------------------------------- 1 | ## Aim : 2 | ## You are given two strings, A and B. 3 | ## Find if there is a substring that appears in both A and B. 4 | 5 | # Input: 6 | # The first line of the input will contain a single integer , tthe number of test cases. 7 | # Then there will be t descriptions of the test cases. Each description contains two lines. 8 | # The first line contains the string A and the second line contains the string B. 9 | 10 | # Output: 11 | # For each test case, display YES (in a newline), if there is a common substring. 12 | # Otherwise, display NO 13 | 14 | n = int(input().strip()) 15 | a = [] 16 | for a_i in range(2*n): 17 | a_t = [a_temp for a_temp in input().strip().split(' ')][0] 18 | a.append(a_t) 19 | 20 | for i in range(2*n): 21 | if (i%2==0): 22 | s1=set(a[i]) 23 | s2=set(a[i+1]) 24 | S=s1.intersection(s2) #intersection of subset -> return the same characters !! 25 | if (len(S)==0): 26 | print("NO") 27 | else: 28 | print("YES") 29 | -------------------------------------------------------------------------------- /KaggleAmazon/README.md: -------------------------------------------------------------------------------- 1 | # Planet: Understanding the Amazon from Space 2 | 3 | [Kaggle Link](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space) 4 | 5 | Use satellite data to track the human footprint in the Amazon rainforest. 6 | 7 | The goal of this challenge is to label satellite image chips with atmospheric conditions and various classes of land cover/land use. Resulting algorithms will help the global community better understand where, how, and why deforestation happens all over the world - and ultimately how to respond. 8 | 9 | This problem was tackled with Deep Learning models (using TensorFlow and Keras). 10 | Submissions are evaluated based on the F-beta score (F2 score), it measures acccuracy using precision and recall. 11 | 12 | 13 | -------------------------------------------------------------------------------- /KaggleAmazon/predict_keras.py: -------------------------------------------------------------------------------- 1 | import train_keras 2 | from keras.models import load_model 3 | import os 4 | import numpy as np 5 | import pandas as pd 6 | from tqdm import tqdm 7 | from keras.callbacks import ModelCheckpoint 8 | import sys 9 | 10 | TF_CPP_MIN_LOG_LEVEL=2 11 | TEST_BATCH = 128 12 | 13 | def load_params(): 14 | X_test = os.listdir('./test-jpg') 15 | X_test = [fn.replace('.jpg', '') for fn in X_test] 16 | model = load_model('model_amazon6.h5', custom_objects={'fbeta': train_keras.fbeta}) 17 | with open('tag_columns.txt', 'r') as f: 18 | tag_columns = f.read().split('\n') 19 | return X_test, model, tag_columns 20 | 21 | def prediction(X_test, model, tag_columns, test_folder): 22 | result = [] 23 | for i in tqdm(range(0, len(X_test), TEST_BATCH)): 24 | X_batch = X_test[i:i+TEST_BATCH] 25 | X_batch = np.array([train_keras.preprocess(train_keras.load_image(fn, folder=test_folder)) for fn in X_batch]) 26 | p = model.predict(X_batch) 27 | result.append(p) 28 | 29 | r = np.concatenate(result) 30 | r = r > 0.5 31 | table = [] 32 | for row in r: 33 | t = [] 34 | for b, v in zip(row, tag_columns): 35 | if b: 36 | t.append(v.replace('tag_', '')) 37 | table.append(' '.join(t)) 38 | print('Prediction done !') 39 | return table 40 | 41 | def launch(test_folder): 42 | X_test, model, tag_columns = load_params() 43 | table = prediction(X_test, model, tag_columns, test_folder) 44 | try: 45 | df_pred = pd.DataFrame.from_dict({'image_name': X_test, 'tags': table}) 46 | df_pred.to_csv('submission9.csv', index=False) 47 | except: 48 | np.save('image_name', X_test) 49 | np.save('table', table) 50 | 51 | if __name__ == '__main__': 52 | if len(sys.argv) > 1: 53 | test_folder = sys.argv[1] 54 | else: 55 | test_folder='test-jpg' 56 | launch(test_folder) -------------------------------------------------------------------------------- /KaggleAmazon/predict_tf.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | import cv2 4 | import pandas as pd 5 | import pickle 6 | import glob 7 | 8 | picture_size = 32 9 | batch_size = 128 10 | num_steps = 30001 11 | 12 | def load_set(): 13 | global labels 14 | 15 | df = pd.DataFrame.from_csv('./train.csv') 16 | df.tags = df.tags.apply(lambda x:x.split(' ')) 17 | df = pd.concat([df, df.tags.apply(pd.Series)], axis=1) 18 | labels = list(set(np.concatenate(df.tags))) 19 | mapping = dict(enumerate(labels)) 20 | reverse_mapping = {v:k for k,v in mapping.items()} 21 | 22 | pictures = [] 23 | tags = [] 24 | for pic in np.random.choice(df.index, len(df)): 25 | img = cv2.imread('./train-jpg/{}.jpg'.format(pic)) / 255. 26 | img = cv2.resize(img, (picture_size, picture_size)) 27 | tag = np.zeros(len(labels)) 28 | for t in df.loc[pic].tags: 29 | tag[reverse_mapping[t]] = 1 30 | pictures.append(img) 31 | tags.append(tag) 32 | 33 | return pictures, tags, mapping 34 | 35 | def separate_set(pictures, labels): 36 | pictures_train = np.array(pictures[:int(len(pictures)*0.8)]) 37 | labels_train = np.array(tags[:int(len(tags)*0.8)]) 38 | print('Train shape', pictures_train.shape, labels_train.shape) 39 | 40 | pictures_valid = np.array(pictures[int(len(pictures)*0.8) : int(len(pictures)*0.9)]) 41 | labels_valid = np.array(tags[int(len(tags)*0.8) : int(len(tags)*0.9)]) 42 | print('Valid shape', pictures_valid.shape, labels_valid.shape) 43 | 44 | pictures_test = np.array(pictures[int(len(pictures)*0.9):]) 45 | labels_test = np.array(tags[int(len(tags)*0.9):]) 46 | print('Test shape', pictures_test.shape, labels_test.shape) 47 | return pictures_train, labels_train, pictures_valid, labels_valid, pictures_test, labels_test 48 | 49 | 50 | def accuracy(predictions, labels): 51 | return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1)) 52 | / predictions.shape[0]) 53 | 54 | def conv2d(x, W, stride=(1, 1), padding='SAME'): 55 | return tf.nn.conv2d(x, W, strides=[1, stride[0], stride[1], 1], 56 | padding=padding) 57 | 58 | def max_pool(x, ksize=(2, 2), stride=(2, 2)): 59 | return tf.nn.max_pool(x, ksize=[1, ksize[0], ksize[1], 1], 60 | strides=[1, stride[0], stride[1], 1], padding='SAME') 61 | 62 | def weight_variable(shape): 63 | initial = tf.truncated_normal(shape, stddev=0.1) 64 | return tf.Variable(initial) 65 | 66 | 67 | def bias_variable(shape): 68 | initial = tf.constant(0.1, shape=shape) 69 | return tf.Variable(initial) 70 | 71 | def model(data): 72 | # First convolution 73 | h_conv1 = tf.nn.relu(conv2d(data, W_1) + b_1) 74 | h_pool1 = max_pool(h_conv1, ksize=(2, 2), stride=(2, 2)) 75 | 76 | # Second convolution 77 | h_conv2 = tf.nn.relu(conv2d(h_pool1, W_2) + b_2) 78 | h_pool2 = max_pool(h_conv2, ksize=(2, 2), stride=(2, 2)) 79 | # reshape tensor into a batch of vectors 80 | pool_flat = tf.reshape(h_pool2, [-1, int(picture_size / 4) * int(picture_size / 4) * 64]) 81 | 82 | # Full connected layer with 1024 neurons. 83 | fc = tf.nn.relu(tf.matmul(pool_flat, W_fc) + b_fc) 84 | 85 | # Compute logits 86 | return tf.matmul(fc, W_logits) + b_logits 87 | 88 | if __name__ == "__main__": 89 | pictures, tags, mapping = load_set() 90 | pictures_train, labels_train, pictures_valid, labels_valid, pictures_test, labels_test = separate_set(pictures, tags) 91 | 92 | graph = tf.Graph() 93 | with graph.as_default(): 94 | x = tf.placeholder(tf.float32, shape=(batch_size, picture_size, picture_size, 3)) 95 | y_ = tf.placeholder(tf.float32, shape=(batch_size, len(labels))) 96 | x_valid = tf.constant(pictures_valid, dtype=tf.float32) 97 | x_test = tf.constant(pictures_test, dtype=tf.float32) 98 | 99 | # Weights and biases 100 | W_1 = weight_variable([3, 3, 3, 64]) 101 | b_1 = bias_variable([32]) 102 | W_2 = weight_variable([5, 5, 32, 128]) 103 | b_2 = bias_variable([64]) 104 | W_fc = weight_variable([int(picture_size / 4) * int(picture_size / 4) * 64, 1024]) 105 | b_fc = bias_variable([1024]) 106 | W_logits = weight_variable([1024, len(labels)]) 107 | b_logits = bias_variable([len(labels)]) 108 | 109 | logits = model(x) 110 | loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y_)) 111 | optimizer = tf.train.AdamOptimizer(1e-4).minimize(loss) 112 | y = tf.nn.softmax(logits) 113 | y_valid = tf.nn.softmax(model(x_valid)) 114 | y_test = tf.nn.softmax(model(x_test)) 115 | 116 | with tf.Session(graph = graph) as session: 117 | tf.global_variables_initializer().run() 118 | for step in range(num_steps): 119 | offset = (step * batch_size) % (labels_train.shape[0] - batch_size) 120 | batch_data = pictures_train[offset:(offset + batch_size), :, :, :] 121 | batch_labels = labels_train[offset:(offset + batch_size), :] 122 | feed_dict = {x : batch_data, y_ : batch_labels} 123 | _, l, predictions = session.run([optimizer, loss, y], feed_dict=feed_dict) 124 | if (step % 1000 == 0): 125 | print('Minibatch loss at step %d: %.3f / Valid Accuracy %.2f' % (step, l, 126 | accuracy(y_valid.eval(), labels_valid))) 127 | print(accuracy(y_test.eval(), test_labels)) 128 | -------------------------------------------------------------------------------- /KaggleAmazon/train_keras.py: -------------------------------------------------------------------------------- 1 | import numpy as np # linear algebra 2 | import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) 3 | import os 4 | import glob 5 | import matplotlib.image as mpimg 6 | import gc 7 | 8 | from keras import backend as K 9 | from keras.models import Sequential 10 | from keras.layers import Dense, Dropout, Flatten, Lambda 11 | from keras.layers import Conv2D, MaxPooling2D, AveragePooling2D 12 | from keras.layers.normalization import BatchNormalization 13 | from keras.preprocessing.image import ImageDataGenerator 14 | from keras.regularizers import l2 15 | from keras.callbacks import ModelCheckpoint 16 | 17 | from sklearn.utils import shuffle 18 | from sklearn.model_selection import train_test_split 19 | 20 | import cv2 21 | from tqdm import tqdm 22 | 23 | picture_size = 128 24 | INPUT_SHAPE = (picture_size, picture_size, 4) 25 | EPOCHS = 20 26 | BATCH = 64 27 | PER_EPOCH = 256 28 | 29 | def fbeta(y_true, y_pred, threshold_shift=0): 30 | """ 31 | Submissions are evaluated based on the F-beta score, it measures acccuracy using precision 32 | and recall. Beta = 2 here. 33 | """ 34 | beta = 2 35 | 36 | # just in case of hipster activation at the final layer 37 | y_pred = K.clip(y_pred, 0, 1) 38 | 39 | # shifting the prediction threshold from .5 if needed 40 | y_pred_bin = K.round(y_pred + threshold_shift) 41 | 42 | tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon() 43 | fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1))) 44 | fn = K.sum(K.round(K.clip(y_true - y_pred, 0, 1))) 45 | 46 | precision = tp / (tp + fp) 47 | recall = tp / (tp + fn) 48 | 49 | beta_squared = beta ** 2 50 | return (beta_squared + 1) * (precision * recall) / (beta_squared * precision + recall + K.epsilon()) 51 | 52 | def load_image(filename, resize=True, folder='train-jpg'): 53 | img = mpimg.imread('./{}/{}.jpg'.format(folder, filename)) 54 | if resize: 55 | img = cv2.resize(img, (picture_size, picture_size)) 56 | return np.array(img) 57 | 58 | # def mean_normalize(img): 59 | # return (img - img.mean()) / (img.max() - img.min()) 60 | 61 | def preprocess(img): 62 | img = img / 127.5 - 1 63 | return img 64 | 65 | def generator(X, y, batch_size=32): 66 | X_copy, y_copy = X, y 67 | while True: 68 | for i in range(0, len(X_copy), batch_size): 69 | X_result, y_result = [], [] 70 | for x, y in zip(X_copy[i:i+batch_size], y_copy[i:i+batch_size]): 71 | rx, ry = [load_image(x)], [y] 72 | rx = np.array([preprocess(x) for x in rx]) 73 | ry = np.array(ry) 74 | X_result.append(rx) 75 | y_result.append(ry) 76 | X_result, y_result = np.concatenate(X_result), np.concatenate(y_result) 77 | yield shuffle(X_result, y_result) 78 | X_copy, y_copy = shuffle(X_copy, y_copy) 79 | 80 | def create_model(): 81 | model = Sequential() 82 | 83 | model.add(Conv2D(48, (8, 8), strides=(2, 2), input_shape=INPUT_SHAPE, activation='elu')) 84 | model.add(BatchNormalization()) 85 | 86 | model.add(Conv2D(64, (8, 8), strides=(2, 2), activation='elu')) 87 | model.add(BatchNormalization()) 88 | 89 | model.add(Conv2D(96, (5, 5), strides=(2, 2), activation='elu')) 90 | model.add(BatchNormalization()) 91 | 92 | model.add(Conv2D(128, (3, 3), activation='elu')) 93 | model.add(BatchNormalization()) 94 | 95 | model.add(Conv2D(128, (2, 2), activation='elu')) 96 | model.add(BatchNormalization()) 97 | 98 | model.add(Flatten()) 99 | model.add(Dropout(0.3)) 100 | 101 | model.add(Dense(1024, activation='elu')) 102 | model.add(BatchNormalization()) 103 | 104 | model.add(Dense(64, activation='elu')) 105 | model.add(BatchNormalization()) 106 | 107 | model.add(Dense(n_classes, activation='sigmoid')) 108 | 109 | 110 | model.compile( 111 | optimizer='adam', 112 | loss='binary_crossentropy', 113 | metrics=[fbeta, 'accuracy'] 114 | ) 115 | 116 | return model 117 | 118 | def load_set(): 119 | global n_features, n_classes, tag_columns 120 | df = pd.DataFrame.from_csv('./train.csv') 121 | df.tags = df.tags.apply(lambda x:x.split(' ')) 122 | df.reset_index(inplace=True) 123 | labels = list(df.tags) 124 | 125 | tags = set(np.concatenate(labels)) 126 | tag_list = list(tags) 127 | tag_list.sort() 128 | tag_columns = ['tag_' + t for t in tag_list] 129 | for t in tag_list: 130 | df['tag_' + t] = df['tags'].map(lambda x: 1 if t in x else 0) 131 | 132 | X = df['image_name'].values 133 | y = df[tag_columns].values 134 | 135 | n_features = 1 136 | n_classes = y.shape[1] 137 | X, y = shuffle(X, y) 138 | 139 | X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.1) 140 | 141 | print('Train', X_train.shape, y_train.shape) 142 | print('Valid', X_valid.shape, y_valid.shape) 143 | return X_train, X_valid, y_train, y_valid 144 | 145 | def launch(): 146 | X_train, X_valid, y_train, y_valid = load_set() 147 | model = create_model() 148 | filepath="./weights-improvement-{epoch:02d}-{val_fbeta:.3f}.hdf5" 149 | checkpoint = ModelCheckpoint(filepath, monitor='val_fbeta', verbose=1, save_best_only=True, mode='max') 150 | callbacks_list = [checkpoint] 151 | 152 | history = model.fit_generator( 153 | generator(X_train, y_train, batch_size=BATCH), 154 | steps_per_epoch=PER_EPOCH, 155 | epochs=EPOCHS, 156 | validation_data=generator(X_valid, y_valid, batch_size=BATCH), 157 | validation_steps=len(y_valid)//(4*BATCH), 158 | callbacks=callbacks_list 159 | ) 160 | model.save('model_amazon2.h5') 161 | 162 | if __name__ == '__main__': 163 | launch() 164 | 165 | 166 | -------------------------------------------------------------------------------- /KaggleBikeSharing/Kaggle_BikeSharing.py: -------------------------------------------------------------------------------- 1 | ## coding: utf8 2 | ## Kaggle Challenge 3 | ## Utilisation de Python 3.5 4 | 5 | ## Chargement des bibliothèques 6 | import numpy as np 7 | import pandas as pd 8 | import matplotlib.pyplot as plt 9 | import collections 10 | import sklearn 11 | from sklearn import linear_model 12 | from sklearn import svm 13 | from sklearn import learning_curve 14 | from sklearn import ensemble 15 | #%matplotlib inline 16 | 17 | ## Chargement de la base de données 18 | df= pd.read_csv('data.csv') 19 | 20 | ## Transformation de la date sous 21 | ## format string au format de date 22 | df['date']=pd.to_datetime(df['datetime']) 23 | 24 | ## Extraction du mois, de l'heure et 25 | ## du jour de la semaine 26 | df.set_index('date', inplace=True) 27 | df['month']=df.index.month 28 | df['hours']=df.index.hour 29 | df['dayOfWeek']=df.index.weekday 30 | 31 | ################### 32 | ################### Influence du temps 33 | ################### 34 | 35 | ## Création d'une classe pour tracer 36 | ## la moyenne du nombre de locations 37 | ## de velos heure par heure pour les 38 | ## jours ouvrés ou non 39 | 40 | class mean_30(): 41 | def __init__(self, df): 42 | self.df=df 43 | def mean_hours_min(self,h): 44 | a = self.df["hours"] == h 45 | return self.df[a]["count"].mean() 46 | def transf(self, t): 47 | return self.mean_hours_min(t) 48 | def transfc(self, t): 49 | return self.err_hours_min(t) 50 | def vector_day(self): 51 | k = [] 52 | for i in range(0,24): 53 | k.append(i) 54 | hour_day = pd.DataFrame() 55 | hour_day["A"] = k 56 | return hour_day["A"] 57 | def view(self): 58 | plt.plot(self.vector_day().apply(self.transf)) 59 | 60 | ## Tracer des graphes 61 | 62 | fig=plt.figure() 63 | fig.suptitle('Location de velo selon l\'heure, jours ouvrables ou non', fontsize=13) 64 | plt.ylabel('nombre de locations de vélos') 65 | plt.xlabel('heure de la journée') 66 | moy0=mean_30(df[df['workingday']==0]) 67 | moy0.view() 68 | moy1=mean_30(df[df['workingday']==1]) 69 | moy1.view() 70 | plt.legend(['0','1']) 71 | plt.show() 72 | 73 | ## Création d'une classe pour tracer 74 | ## la deviation standard du nombre de 75 | ## locations de velos heure par heure 76 | ## pour un jour 77 | 78 | ## Changer cette valeur entre 0 et 6 pour 79 | ## pour obtenir un autre jour de la 80 | ## semaine 81 | 82 | j=0 83 | 84 | class std_30(): 85 | def __init__(self, df): 86 | self.df=df 87 | def mean_hours_std(self,j,h): 88 | y = self.df[self.df["dayOfWeek"]==j]["hours"] == h 89 | return self.df[self.df["dayOfWeek"]==j][y]["count"].mean() 90 | def err_hours(self,j,h): 91 | y = self.df[self.df["dayOfWeek"]==j]["hours"] == h 92 | return self.df[self.df["dayOfWeek"]==j][y]["count"].std() 93 | def transf_err(self,t): 94 | return self.mean_hours_std(j,t) 95 | def transf_err2(self,t): 96 | return self.err_hours(j,t) 97 | def vector_day(self): 98 | k = [] 99 | for i in range(0,24): 100 | k.append(i) 101 | hour_std = pd.DataFrame() 102 | hour_std["A"] = k 103 | return hour_std["A"] 104 | def view(self): 105 | errors=self.vector_day().apply(self.transf_err2) 106 | fig, ax = plt.subplots() 107 | self.vector_day().apply(self.transf_err).plot(yerr=errors, ax=ax,label=str(j)) 108 | plt.legend('0',loc=2,prop={'size':9}) 109 | 110 | fig.suptitle('Deviation standard des locations de velo selon l\'heure', fontsize=13) 111 | std0=std_30(df) 112 | std0.view() 113 | plt.ylabel('nombre de locations de vélos') 114 | plt.xlabel('heure de la journée') 115 | plt.show() 116 | 117 | ################### 118 | ################### Influence du mois 119 | ################### 120 | 121 | ## Création d'une classe pour tracer 122 | ## la moyenne du nombre de locations 123 | ## de velos par mois 124 | 125 | class month_30(): 126 | def __init__(self, df): 127 | self.df=df 128 | def mean_hours_min(self,m): 129 | a = self.df["month"] == m 130 | return self.df[a]["count"].mean() 131 | def transf(self, t): 132 | return self.mean_hours_min(t) 133 | def transfc(self, t): 134 | return self.err_hours_min(t) 135 | def vector_day(self): 136 | k = [] 137 | for i in range(0,13): 138 | k.append(i) 139 | hour_day = pd.DataFrame() 140 | hour_day["A"] = k 141 | return hour_day["A"] 142 | def view(self): 143 | plt.plot(self.vector_day().apply(self.transf)) 144 | 145 | ## Tracer des graphes 146 | 147 | fig=plt.figure() 148 | fig.suptitle('Location de velo selon le mois', fontsize=13) 149 | moy0=month_30(df) 150 | moy0.view() 151 | plt.ylabel('nombre de locations de vélos') 152 | plt.xlabel('mois de l\' année') 153 | plt.show() 154 | 155 | ################### 156 | ################### Influence de la météo 157 | ################### 158 | plt.figure() 159 | 160 | ## Moyenne de la demande en vélos 161 | ## Pour les 4 conditions grâce 162 | ## à un dictionnaire Python 163 | 164 | a={u'Degage/nuageux':df[df['weather']==1]['count'].mean(), 165 | u'Brouillard': df[df['weather']==2]['count'].mean(), 166 | u'Legere pluie':df[df['weather']==3]['count'].mean() 167 | } 168 | 169 | width = 1/1.6 170 | plt.bar(range(len(a)), a.values(),width,color="blue",align='center') 171 | plt.xticks(range(len(a)), a.keys()) 172 | plt.ylabel('nombre de locations de vélos') 173 | plt.title('Moyenne des locations de velos pour differentes conditions meteorologiques') 174 | plt.show() 175 | 176 | ################### 177 | ################### Influence du vent, de la température et de l'humidité 178 | ################### 179 | 180 | ## Moyenne de la demande en vélos 181 | ## Pour certains paramètres grâce 182 | ## à un dictionnaire Python 183 | 184 | D = {u'V>13k/h':df[df['windspeed']>13]['count'].mean(), 185 | u'V<13k/h': df[df['windspeed']<13]['count'].mean(), 186 | u'T<24°C':df[df['atemp']<24]['count'].mean(), 187 | u'T>24°C':df[df['atemp']>24]['count'].mean(), 188 | u'H>62%': df[df['humidity']>62]['count'].mean(), 189 | u'H<62%':df[df['humidity']<62]['count'].mean() 190 | } 191 | od = collections.OrderedDict(sorted(D.items())) 192 | width = 1/1.6 193 | plt.figure() 194 | plt.bar(range(len(od)), od.values(),width,color="blue",align='center') 195 | plt.xticks(range(len(od)), od.keys()) 196 | plt.title('Variation de la demande en fonction de 3 variables') 197 | plt.show() 198 | 199 | ################### 200 | ################### Conclusion et choix des paramètres influents 201 | ################### 202 | 203 | ##On calcule matrice de corrélation pour 204 | ##retirer les variables corrélées 205 | 206 | df.corr() 207 | plt.matshow(df.corr()) 208 | plt.yticks(range(len(df.corr().columns)), df.corr().columns); 209 | plt.colorbar() 210 | plt.show() 211 | 212 | df1=df.drop(['workingday','datetime','season','atemp','holiday','registered','casual'],axis=1) 213 | 214 | target=df1['count'].values #ensemble des outputs à prédire (yi) 215 | 216 | train=df1.drop('count',axis=1) #ensemble des données (xi) 217 | 218 | ## Split aléatoire entre données d'entrainement 219 | ## et donnes test 220 | X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split( 221 | train, target, test_size=0.33, random_state=42) 222 | 223 | ## Creation d'une classe pour tracer 224 | ## les courbes d'apprentissages 225 | 226 | def plot_learning_curve(estimator, title, X, y, cv=None, n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)): 227 | plt.figure() 228 | plt.title(title) 229 | plt.xlabel("Training examples") 230 | plt.ylabel("Score") 231 | train_sizes, train_scores, test_scores = sklearn.learning_curve.learning_curve( 232 | estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes) 233 | train_scores_mean = np.mean(train_scores, axis=1) 234 | train_scores_std = np.std(train_scores, axis=1) 235 | test_scores_mean = np.mean(test_scores, axis=1) 236 | test_scores_std = np.std(test_scores, axis=1) 237 | plt.grid() 238 | plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score") 239 | plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score") 240 | plt.legend(loc="best") 241 | return plt 242 | 243 | ################### 244 | ################### Regression Linéaire 245 | ################### 246 | 247 | linreg=linear_model.LinearRegression() 248 | linreg.fit(X_train,Y_train) 249 | tableau=[['Paramètre', 'Coefficient']] #liste pour voir les valeurs des coefficients 250 | col=list(train.columns.values) 251 | for i in range(7): 252 | tableau.append([col[i],linreg.coef_[i]]) 253 | 254 | print ("Training Score Regression Linéare : ", str(linreg.score(X_train,Y_train))) 255 | print ("Test Score Regression Linéare : " , str(linreg.score(X_test,Y_test))) 256 | print ("Coefficients Regression Linéare :") 257 | print (tableau) 258 | 259 | ################### 260 | ################### Gradient Boosting Regression 261 | ################### 262 | 263 | gbr = ensemble.GradientBoostingRegressor(n_estimators=2000) 264 | gbr.fit(X_train,Y_train) 265 | print ("Training Score GradientBoosting: ", str(gbr.score(X_train,Y_train))) 266 | print ("Test Score GradientBoosting: " , str(gbr.score(X_test,Y_test))) 267 | 268 | title = "Learning Curves (GradientBoosting)" 269 | estimator = ensemble.GradientBoostingRegressor(n_estimators=2000) 270 | plot_learning_curve(estimator, title, X_train, Y_train) 271 | plt.show() 272 | 273 | ################### 274 | ################### Support Vector Regression 275 | ################### 276 | svr=svm.SVR(kernel='linear') 277 | svr.fit(X_train,Y_train) 278 | print ("Training Score SVR: ", str(svr.score(X_train,Y_train))) 279 | print ("Test Score SVR : " , str(svr.score(X_test,Y_test))) 280 | 281 | ################### 282 | ################### Random Forest Regression 283 | ################### 284 | 285 | rf=ensemble.RandomForestRegressor(n_estimators=30,oob_score=True) #30 arbres et OOB Estimation 286 | rf.fit(train,target) 287 | print ("Training Score RandomForest: ", str(rf.score(train,target))) 288 | print ("OOB Score RandomForest: " , str(rf.oob_score_)) 289 | 290 | ################### 291 | ################### Améliorations 292 | ################### 293 | 294 | ## On cherche le paramètre le plus influent de notre modèle 295 | ## Pour cela on utilise deux algorithmes qu'on a utilisé : 296 | ## la régréssio linéaire et Random Forest 297 | def param_import(): 298 | col=list(train.columns.values) 299 | #on trouve d'abord les coefficients de la régréssion linéaire 300 | index1=linreg.coef_.argsort()[-2:][-1] #renvoie la liste triée des coef et on prend le premier élement 301 | index2=linreg.coef_.argsort()[-2:][0] #renvoie la liste triée des coef et on prend le deuxieme élement 302 | print('Pour les améliorations, calculons les paramètres les plus influents : ') 303 | print('...') 304 | print('Pour la regréssion linéaire, les paramètres les plus influents sont :', col[index1],' et ',col[index2]) 305 | #on trouve ensuite les coefficients de la RF 306 | index3=rf.feature_importances_.argsort()[-2:][-1] 307 | index4=rf.feature_importances_.argsort()[-2:][0] 308 | print('Pour l\'algorithme de RF, les paramètres les plus influents sont :', col[index3],' et ',col[index4]) 309 | #on trouve ensuite les coefficients de Gradient Boosting 310 | index5=gbr.feature_importances_.argsort()[-2:][-1] 311 | index6=gbr.feature_importances_.argsort()[-2:][0] 312 | print('Pour l\'algorithme de Gradient Boosting, les paramètres les plus influents sont :', col[index5],' et ',col[index6]) 313 | if index3==index5: 314 | plus_import=index3 315 | elif index5==index4: 316 | plus_import=index4 317 | return plus_import 318 | 319 | print('Le paramètre le plus important est donc : ', col[param_import()]) 320 | 321 | ## On cherche à prouver que certains creneaux sont 322 | ##plus importants pour la demande en vélo 323 | 324 | soir = df[df['hours'].isin([17,18,19])] 325 | peak_soir=soir[soir['workingday']==1] 326 | matin = df[df['hours'].isin([7,8,9])] 327 | peak_matin=matin[matin['workingday']==1] 328 | we = df[df['hours'].isin([12,13,14,15,16])] 329 | peak_we=we[we['workingday']==0] 330 | 331 | print('Calculons ensuite la moyenne du nombre de vélos pour plusieurs créneaux horaires : ') 332 | print('...') 333 | print('La moyenne totale de la demande est : ', df['count'].mean()) 334 | print('En semaine, entre 17 et 19h : ', peak_soir['count'].mean()) 335 | print('En semaine, entre 7 et 9h : ', peak_matin['count'].mean()) 336 | print('En week-end, entre 12 et 16h : ', peak_we['count'].mean()) 337 | 338 | -------------------------------------------------------------------------------- /KaggleBikeSharing/Kaggle_BikeSharing_Explanations_French.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleBikeSharing/Kaggle_BikeSharing_Explanations_French.pdf -------------------------------------------------------------------------------- /KaggleBikeSharing/README.md: -------------------------------------------------------------------------------- 1 | # Kaggle Bike Sharing 2 | 3 | [Kaggle Link](https://www.kaggle.com/c/bike-sharing-demand) 4 | 5 | Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. 6 | In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C. 7 | 8 | The goal of this challenge is to build a model that predicts the count of bike shared, exclusively based on contextual features. The first part of this challenge was aimed to understand, to analyse and to process those dataset. I wanted to produce meaningful information with plots. The second part was to build a model and use a Machine Learning library in order to predict the count. 9 | 10 | The more importants parameters were the time, the month, the temperature and the weather. 11 | Multiple models were tested during this challenge (Linear Regression, Gradient Boosting, SVR and Random Forest). Finally, the chosen model was Random Forest. The accuracy was measured with [Out-of-Bag Error](https://www.stat.berkeley.edu/~breiman/OOBestimation.pdf) and the OOB score was 0.85. 12 | 13 | More detailed explanations (with a couple of plots) has been written in French. 14 | -->[French Explanations PDF](https://github.com/alexattia/Online-Challenge/blob/master/KaggleBikeSharing/Kaggle_BikeSharing_Explanations_French.pdf) 15 | 16 | -------------------------------------------------------------------------------- /KaggleInstaCart/Predictions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 6, 6 | "metadata": { 7 | "ExecuteTime": { 8 | "end_time": "2017-06-21T21:37:22.901368Z", 9 | "start_time": "2017-06-21T21:37:21.787528Z" 10 | }, 11 | "collapsed": true 12 | }, 13 | "outputs": [], 14 | "source": [ 15 | "import numpy as np\n", 16 | "import pandas as pd\n", 17 | "import matplotlib.pyplot as plt\n", 18 | "import seaborn as sns\n", 19 | "import calendar\n", 20 | "%matplotlib inline" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 7, 26 | "metadata": { 27 | "ExecuteTime": { 28 | "end_time": "2017-06-21T21:37:36.669912Z", 29 | "start_time": "2017-06-21T21:37:22.902860Z" 30 | }, 31 | "collapsed": true 32 | }, 33 | "outputs": [], 34 | "source": [ 35 | "# departements and aisle id\n", 36 | "aisles = pd.read_csv('./data/aisles.csv')\n", 37 | "departements = pd.read_csv('./data/departments.csv')\n", 38 | "products = pd.read_csv('./data/products.csv')\n", 39 | "orderp = pd.read_csv('./data/order_products__prior.csv')\n", 40 | "ordert = pd.read_csv('./data/order_products__train.csv')\n", 41 | "orders = pd.read_csv('./data/orders.csv')" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 8, 47 | "metadata": { 48 | "ExecuteTime": { 49 | "end_time": "2017-06-21T21:37:43.774516Z", 50 | "start_time": "2017-06-21T21:37:36.671328Z" 51 | } 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "order_products = pd.merge(orderp, orders, on = 'order_id')" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": { 62 | "ExecuteTime": { 63 | "start_time": "2017-06-21T21:37:22.391Z" 64 | } 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "data = order_products.groupby(['product_id', 'user_id']).agg({'order_id':'count',\n", 69 | " 'order_number':[np.min, np.max],\n", 70 | " 'add_to_cart_order':'mean'})\n", 71 | "data.columns = data.columns.droplevel(0)\n", 72 | "data = data.rename(columns={'mean':\"up_average_cart_position\", \n", 73 | " 'amin':\"up_first_order\", \n", 74 | " 'amax':\"up_last_order\", \n", 75 | " 'count':\"up_orders\"})\n", 76 | "data = data.reset_index()" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": { 83 | "ExecuteTime": { 84 | "start_time": "2017-06-21T21:37:23.516Z" 85 | }, 86 | "scrolled": true 87 | }, 88 | "outputs": [], 89 | "source": [ 90 | "d = pd.merge(data, order_products, on='product_id')" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": { 97 | "collapsed": true 98 | }, 99 | "outputs": [], 100 | "source": [] 101 | } 102 | ], 103 | "metadata": { 104 | "kernelspec": { 105 | "display_name": "workenv", 106 | "language": "python", 107 | "name": "workenv" 108 | }, 109 | "language_info": { 110 | "codemirror_mode": { 111 | "name": "ipython", 112 | "version": 3 113 | }, 114 | "file_extension": ".py", 115 | "mimetype": "text/x-python", 116 | "name": "python", 117 | "nbconvert_exporter": "python", 118 | "pygments_lexer": "ipython3", 119 | "version": "3.5.3" 120 | }, 121 | "nav_menu": {}, 122 | "toc": { 123 | "navigate_menu": true, 124 | "number_sections": true, 125 | "sideBar": true, 126 | "threshold": 6, 127 | "toc_cell": false, 128 | "toc_section_display": "block", 129 | "toc_window_display": false 130 | } 131 | }, 132 | "nbformat": 4, 133 | "nbformat_minor": 2 134 | } 135 | -------------------------------------------------------------------------------- /KaggleMovieRating/README.md: -------------------------------------------------------------------------------- 1 | # Predict IMDB movie rating 2 | 3 | Project inspired by Chuan Sun [work](https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset) without Scrappy 4 | 5 | Main question : How can we tell the greatness of a movie before it is released in cinema? 6 | 7 | ## First Part - Parsing data 8 | 9 | The first part aims to parse data from the imdb and the numbers websites : casting information, directors, production companies, awards, genres, budget, gross, description, imdb_rating, etc. 10 | To create the movie_contents.json file : 11 | ``python3 parser.py nb_elements`` 12 | 13 | ## Second Part - Data Analysis 14 | 15 | The second part is to analyze the dataframe and observe correlation between variables. For example, are the movie awards correlated to the worlwide gross ? Does the more a movie is a liked, the more the casting is liked ? 16 | See the jupyter notebook file. 17 | 18 | ![Correlation Matrix](https://github.com/alexattia/Data-Science-Projects/blob/master/pics/corr_matrix.png) 19 | 20 | As we can see in the pictures above, the imdb score is correlated to the number of awards and the gross but not really to the production budget and the number of facebook likes of the casting. 21 | Obviously, domestic and worlwide gross are highly correlated. However, the more important the production budget, the more important the gross. 22 | As it is shown in the notebook, the budget is not really correlated to the number of awards. 23 | What's funny is that the popularity of the third most famous actor is more important for the IMDB score than the popularity of the most famous score (Correlation 0.2 vs 0.08). 24 | (Many other charts in the Jupyter notebook) 25 | 26 | ## Third Part - Predict the IMDB score 27 | 28 | Machine Learning to predict the IMDB score with the meaningful variables. 29 | Using a Random Forest algorithm (500 estimators). 30 | ![Most important features](https://github.com/alexattia/Data-Science-Projects/blob/master/pics/features.png) 31 | 32 | -------------------------------------------------------------------------------- /KaggleMovieRating/genre.json: -------------------------------------------------------------------------------- 1 | {"Action": "0", 2 | "Adventure": "17", 3 | "Animation": "14", 4 | "Biography": "20", 5 | "Comedy": "21", 6 | "Crime": "10", 7 | "Documentary": "3", 8 | "Drama": "18", 9 | "Family": "19", 10 | "Fantasy": "6", 11 | "History": "8", 12 | "Horror": "4", 13 | "Music": "12", 14 | "Musical": "2", 15 | "Mystery": "16", 16 | "Romance": "9", 17 | "Sci-Fi": "13", 18 | "Short": "5", 19 | "Sport": "11", 20 | "Thriller": "7", 21 | "War": "15", 22 | "Western": "1"} -------------------------------------------------------------------------------- /KaggleMovieRating/imdb_movie_content.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | from bs4 import BeautifulSoup 4 | import time 5 | import random 6 | import re 7 | import datetime 8 | 9 | def parse_facebook_likes_number(num_likes_string): 10 | if num_likes_string[-1] == 'K': 11 | thousand = 1000 12 | else: 13 | thousand = 1 14 | return float(num_likes_string.replace('K','')) * thousand 15 | 16 | def parse_duration(time_string): 17 | time_string = time_string.split('min')[0] 18 | x = time.strptime(time_string,'%Hh%M') 19 | return datetime.timedelta(hours=x.tm_hour,minutes=x.tm_min,seconds=x.tm_sec).total_seconds() 20 | 21 | class ImdbMovieContent(): 22 | def __init__(self, movies): 23 | self.movies = movies 24 | self.base_url = "http://www.imdb.com" 25 | 26 | def get_facebook_likes(self, entity_id): 27 | if entity_id.startswith('nm'): 28 | url = "https://www.facebook.com/widgets/like.php?width=280&show_faces=1&layout=standard&href=http%3A%2F%2Fwww.imdb.com%2Fname%2F{}%2F&colorscheme=light".format(entity_id) 29 | elif entity_id.startswith('tt'): 30 | url = "https://www.facebook.com/widgets/like.php?width=280&show_faces=1&layout=standard&href=http%3A%2F%2Fwww.imdb.com%2Ftitle%2F{}%2F&colorscheme=light".format(entity_id) 31 | else: 32 | url = None 33 | time.sleep(random.uniform(0, 0.25)) # randomly snooze a time within [0, 0.4] second 34 | try: 35 | response = requests.get(url) 36 | soup = BeautifulSoup(response.text, "lxml") 37 | sentence = soup.find_all(id="u_0_2")[0].span.string # get sentence like: "43K people like this" 38 | num_likes = sentence.split(" ")[0] 39 | except Exception as e: 40 | num_likes = None 41 | return parse_facebook_likes_number(num_likes) 42 | 43 | def get_id_from_url(self, url): 44 | if url is None: 45 | return None 46 | return url.split('/')[4] 47 | 48 | def parse(self): 49 | movies_content = [] 50 | for movie in self.movies: 51 | imdb_url = movie['imdb_url'] 52 | response = requests.get(imdb_url) 53 | bs = BeautifulSoup(response.text, 'lxml') 54 | movies_content.append(self.get_content(bs)) 55 | return movies_content 56 | 57 | def get_awards(self, movie_link): 58 | awards_url = movie_link + 'awards' 59 | response = requests.get(awards_url) 60 | bs = BeautifulSoup(response.text, 'lxml') 61 | awards = bs.find_all('tr') 62 | award_dict = [] 63 | for award in awards: 64 | if 'rowspan' in award.find('td').attrs: 65 | rowspan = int(award.find('td')['rowspan']) 66 | if rowspan == 1: 67 | award_dict.append({'category' : award.find('span', class_='award_category').text, 68 | 'type' : award.find('b').text, 69 | 'award': award.find('td', class_='award_description').text.split('\n')[1].replace(' ', '')}) 70 | else: 71 | index = awards.index(award) 72 | dictt = {'category':award.find('span', class_='award_category').text, 73 | 'type' : award.find('b').text} 74 | awards_ = [] 75 | for elem in awards[index:index+rowspan]: 76 | award_dict.append({'category':award.find('span', class_='award_category').text, 77 | 'type' : award.find('b').text, 78 | 'award': elem.find('td', class_='award_description').text.split('\n')[1].replace(' ', '')}) 79 | award_nominated = [{k:v for k,v in award.items() if k != 'type'} for award in award_dict if award['type'] == 'Nominated'] 80 | award_won = [{k:v for k,v in award.items() if k != 'type'} for award in award_dict if award['type'] == 'Won'] 81 | return award_nominated, award_won 82 | 83 | def get_content(self, bs): 84 | movie = {} 85 | try: 86 | title_year = bs.find('div', class_ = 'title_wrapper').find('h1').text.encode('latin').decode('utf-8', 'ignore').replace(') ','').split('(') 87 | movie['movie_title'] = title_year[0] 88 | except: 89 | movie['movie_title'] = None 90 | try: 91 | movie['title_year'] = title_year[1] 92 | except: 93 | movie['title_year'] = None 94 | # try: 95 | # movie['genres'] = '|'.join([genre.text.replace(' ','') for genre in bs.find('div', {'itemprop': 'genre'}).find_all('a')]) 96 | # except: 97 | # movie['genres'] = None 98 | try: 99 | title_details = bs.find('div', {'id':'titleDetails'}) 100 | movie['language'] = '|'.join([language.text for language in title_details.find_all('a', href=re.compile('language'))]) 101 | movie['country'] = '|'.join([country.text for country in title_details.find_all('a', href=re.compile('country'))]) 102 | except: 103 | movie['language'] = None 104 | try: 105 | keywords = bs.find_all('span', {'itemprop':'keywords'}) 106 | movie['keywords'] = '|'.join([key.text for key in keywords]) 107 | except: 108 | movie['keywords'] = None 109 | try: 110 | movie['storyline'] = bs.find('div', {'id':'titleStoryLine'}).find('div', {'itemprop':'description'}).text.replace('\n', '') 111 | except: 112 | movie['storyline'] = None 113 | try: 114 | movie['contentRating'] = bs.find('span', {'itemprop':'contentRating'}).text.split(' ')[1] 115 | except: 116 | movie['contentRating'] = None 117 | try: 118 | movie['color'] = bs.find('a', href=re.compile('colors')).text 119 | except: 120 | movie['color'] = None 121 | try: 122 | movie['idmb_score'] = bs.find('span', {'itemprop':'ratingValue'}).text 123 | except: 124 | movie['idmb_score'] = None 125 | try: 126 | movie['num_voted_users'] = bs.find('span', {'itemprop':'ratingCount'}).text.replace(',',) 127 | except: 128 | movie['num_voted_users'] = None 129 | try: 130 | movie['duration_sec'] = parse_duration(bs.find('time', {'itemprop':'duration'}).text.replace(' ','').replace('\n', '')) 131 | except: 132 | movie['duration_sec'] = None 133 | try: 134 | review_counts = [int(count.text.split(' ')[0].replace(',', '')) for count in bs.find_all('span', {'itemprop':'reviewCount'})] 135 | movie['num_user_for_reviews'] = review_counts[0] 136 | except: 137 | movie['num_user_for_reviews'] = None 138 | try: 139 | prod_co = bs.find_all('span', {'itemtype': re.compile('Organization')}) 140 | movie['production_co'] = [elem.text.replace('\n','') for elem in prod_co] 141 | except: 142 | movie['production_co'] = None 143 | try: 144 | movie['num_critic_for_reviews'] = review_counts[1] 145 | except: 146 | movie['num_critic_for_reviews'] = None 147 | try: 148 | actors = bs.find('table', {'class':'cast_list'}).find_all('a', {'itemprop':'url'}) 149 | pairs_for_rows = [(self.base_url + actor['href'], actor.text.replace('\n', '')[1:]) for actor in actors] 150 | movie['cast_info'] = [{'actor_name' : actor[1], 151 | 'actor_link' : actor[0], 152 | 'actor_fb_likes': self.get_facebook_likes(self.get_id_from_url(actor[0]))} for actor in pairs_for_rows] 153 | except: 154 | movie['cast_info'] = None 155 | try: 156 | director = bs.find('span', {'itemprop':'director'}).find('a') 157 | director = (self.base_url + director['href'].replace('?','/?'), director.text) 158 | movie['director_info'] = {'director_name':director[1], 159 | 'director_link':director[0], 160 | 'director_fb_links': self.get_facebook_likes(self.get_id_from_url(director[0]))} 161 | except: 162 | movie['director_info'] = None 163 | try: 164 | movie['movie_imdb_link'] = bs.find('link')['href'] 165 | movie['num_facebook_like'] = self.get_facebook_likes(self.get_id_from_url(movie['movie_imdb_link'])) 166 | except: 167 | movie['num_facebook_like'] = None 168 | try: 169 | poster_image_url = bs.find('div', {'class':'poster'}).find('img')['src'] 170 | movie['image_urls'] = poster_image_url.split("_V1_")[0] + "_V1_.jpg" 171 | except: 172 | movie['image_urls'] = None 173 | try: 174 | award_nominated, award_won = self.get_awards(movie['movie_imdb_link']) 175 | movie['awards'] = {"won": award_won, 'nominated': award_nominated} 176 | except Exception as e: 177 | print(e) 178 | movie['awards'] = None 179 | return movie 180 | 181 | 182 | -------------------------------------------------------------------------------- /KaggleMovieRating/parser.py: -------------------------------------------------------------------------------- 1 | from bs4 import BeautifulSoup 2 | import json 3 | import requests 4 | import urllib 5 | from tqdm import tqdm 6 | import locale 7 | import pandas as pd 8 | import re 9 | import time 10 | import random 11 | import sys 12 | 13 | from imdb_movie_content import ImdbMovieContent 14 | 15 | def parse_price(price): 16 | """ 17 | Convert string price to numbers 18 | """ 19 | if not price: 20 | return 0 21 | price = price.replace(',', '') 22 | return locale.atoi(re.sub('[^0-9,]', "", price)) 23 | 24 | def get_movie_budget(): 25 | """ 26 | Parsing the numbers website to get the budget data. 27 | :return: list of dictionnaries with budget and gross 28 | """ 29 | movie_budget_url = 'http://www.the-numbers.com/movie/budgets/all' 30 | response = requests.get(movie_budget_url) 31 | bs = BeautifulSoup(response.text, 'lxml') 32 | table = bs.find('table') 33 | rows = [elem for elem in table.find_all('tr') if elem.get_text() != '\n'] 34 | 35 | movie_budget = [] 36 | for row in rows[1:]: 37 | specs = [elem.get_text() for elem in row.find_all('td')] 38 | movie_name = specs[2].encode('latin1').decode('utf8', 'ignore') 39 | movie_budget.append({'release_date': specs[1], 40 | 'movie_name': movie_name, 41 | 'production_budget': parse_price(specs[3]), 42 | 'domestic_gross': parse_price(specs[4]), 43 | 'worldwide_gross': parse_price(specs[5])}) 44 | 45 | return movie_budget 46 | 47 | def get_imdb_urls(movie_budget, nb_elements=None): 48 | """ 49 | Parsing imdb website to get imdb movies links. 50 | Dumping a json file with budget, gross and imdb urls 51 | :param movie_budget: list of dictionnaries with budget and gross 52 | :param nb_elements: number of movies to parse 53 | """ 54 | for movie in tqdm(movie_budget[1000:1000+nb_elements]): 55 | movie_name = movie['movie_name'] 56 | title_url = urllib.parse.quote(movie_name.encode('utf-8')) 57 | imdb_search_link = "http://www.imdb.com/find?ref_=nv_sr_fn&q={}&s=tt".format(title_url) 58 | response = requests.get(imdb_search_link) 59 | bs = BeautifulSoup(response.text, 'lxml') 60 | results = bs.find("table", class_= "findList" ) 61 | try: 62 | movie['imdb_url'] = "http://www.imdb.com" + results.find('td', class_='result_text').find('a')['href'] 63 | except: 64 | movie['imdb_url'] = None 65 | 66 | with open('movie_budget.json', 'w') as fp: 67 | json.dump(movie_budget, fp) 68 | 69 | def get_imdb_content(movie_budget_path, nb_elements=None): 70 | """ 71 | Parsing imdb website to get imdb content : awards, casting, description, etc. 72 | Dumping a json file with imdb content 73 | :param movie_budget_path: path of the movie_budget.json file 74 | :param nb_elements: number of movies to parse 75 | """ 76 | with open(movie_budget_path, 'r') as fp: 77 | movies = json.load(fp) 78 | content_provider = ImdbMovieContent(movies) 79 | contents = [] 80 | threshold = 1300 81 | for i, movie in enumerate(movies[threshold:threshold+nb_elements]): 82 | time.sleep(random.uniform(0, 0.25)) 83 | print("\r%i / %i" % (i, len(movies[threshold:threshold+nb_elements])), end="") 84 | try: 85 | imdb_url = movie['imdb_url'] 86 | response = requests.get(imdb_url) 87 | bs = BeautifulSoup(response.text, 'lxml') 88 | movies_content = content_provider.get_content(bs) 89 | contents.append(movies_content) 90 | except Exception as e: 91 | print(e) 92 | pass 93 | with open('movie_contents7.json', 'w') as fp: 94 | json.dump(contents, fp) 95 | 96 | def parse_awards(movie): 97 | """ 98 | Convert awards information to a dictionnary for dataframe. 99 | Keeping only Oscar, BAFTA, Golden Globe and Palme d'Or awards. 100 | :param movie: movie dictionnary 101 | :return: well-formated dictionnary with awards information 102 | """ 103 | awards_kept = ['Oscar', 'BAFTA Film Award', 'Golden Globe', 'Palme d\'Or'] 104 | awards_category = ['won', 'nominated'] 105 | parsed_awards = {} 106 | for category in awards_category: 107 | for awards_type in awards_kept: 108 | awards_cat = [award for award in movie['awards'][category] if award['category'] == awards_type] 109 | for k, award in enumerate(awards_cat): 110 | parsed_awards['{}_{}_{}'.format(awards_type, category, k+1)] = award["award"] 111 | return parsed_awards 112 | 113 | def parse_actors(movie): 114 | """ 115 | Convert casting information to a dictionnary for dataframe. 116 | Keeping only 3 actors with most facebook likes. 117 | :param movie: movie dictionnary 118 | :return: well-formated dictionnary with casting information 119 | """ 120 | sorted_actors = sorted(movie['cast_info'], key=lambda x:x['actor_fb_likes'], reverse=True) 121 | top_k = 3 122 | parsed_actors = {} 123 | parsed_actors['total_cast_fb_likes'] = sum([actor['actor_fb_likes'] for actor in movie['cast_info']]) + movie['director_info']['director_fb_links'] 124 | for k, actor in enumerate(sorted_actors[:top_k]): 125 | if k < len(sorted_actors): 126 | parsed_actors['actor_{}_name'.format(k+1)] = actor['actor_name'] 127 | parsed_actors['actor_{}_fb_likes'.format(k+1)] = actor['actor_fb_likes'] 128 | else: 129 | parsed_actors['actor_{}_name'.format(k+1)] = None 130 | parsed_actors['actor_{}_fb_likes'.format(k+1)] = None 131 | return parsed_actors 132 | 133 | def parse_production_company(movie): 134 | """ 135 | Convert production companies to a dictionnary for dataframe. 136 | Keeping only 3 production companies. 137 | :param movie: movie dictionnary 138 | :return: well-formated dictionnary with production companies 139 | """ 140 | parsed_production_co = {} 141 | top_k = 3 142 | production_companies = movie['production_co'][:top_k] 143 | for k, company in enumerate(production_companies): 144 | if k < len(movie['production_co']): 145 | parsed_production_co['production_co_{}'.format(k+1)] = company 146 | else: 147 | parsed_production_co['production_co_{}'.format(k+1)] = None 148 | return parsed_production_co 149 | 150 | def parse_genres(movie): 151 | """ 152 | Convert genres to a dictionnary for dataframe. 153 | :param movie: movie dictionnary 154 | :return: well-formated dictionnary with genres 155 | """ 156 | parse_genres = {} 157 | g = movie['genres'] 158 | with open('genre.json', 'r') as f: 159 | genres = json.load(f) 160 | for k, genre in enumerate(g): 161 | if genre in genres: 162 | parse_genres['genre_{}'.format(k+1)] = genres[genre] 163 | return parse_genres 164 | 165 | def create_dataframe(movies_content_path, movie_budget_path): 166 | """ 167 | Create dataframe from movie_budget.json and movie_content.json files. 168 | :param movies_content_path: path of the movies_content.json file 169 | :param movie_budget_path: path of the movie_budget.json file 170 | :return: well formated dataframe 171 | """ 172 | with open(movies_content_path, 'r') as fp: 173 | movies = json.load(fp) 174 | with open(movie_budget_path, 'r') as fp: 175 | movies_budget = json.load(fp) 176 | movies_list = [] 177 | for movie in movies: 178 | content = {k:v for k,v in movie.items() if k not in ['awards', 'cast_info', 'director_info', 'production_co']} 179 | name = movie['movie_title'] 180 | try: 181 | budget = [film for film in movies_budget if film['movie_name']==name][0] 182 | budget = {k:v for k,v in budget.items() if k not in ['imdb_url', 'movie_name']} 183 | content.update(budget) 184 | except: 185 | pass 186 | try: 187 | content.update(parse_awards(movie)) 188 | except: 189 | pass 190 | try: 191 | content.update(parse_genres(movie)) 192 | except: 193 | pass 194 | try: 195 | content.update({k:v for k,v in movie['director_info'].items() if k!= 'director_link'}) 196 | except: 197 | pass 198 | try: 199 | content.update(parse_production_company(movie)) 200 | except: 201 | pass 202 | try: 203 | content.update(parse_actors(movie)) 204 | except: 205 | pass 206 | movies_list.append(content) 207 | df = pd.DataFrame(movies_list) 208 | df = df[pd.notnull(df.idmb_score)] 209 | df.idmb_score = df.idmb_score.apply(float) 210 | return df 211 | 212 | if __name__ == '__main__': 213 | if len(sys.argv) > 1: 214 | nb_elements = int(sys.argv[1]) 215 | else: 216 | nb_elements = None 217 | # movie_budget = get_movie_budget() 218 | # movies = get_imdb_urls(movie_budget, nb_elements=nb_elements) 219 | get_imdb_content("movie_budget.json", nb_elements=nb_elements) 220 | 221 | 222 | -------------------------------------------------------------------------------- /KaggleSoccer/Predicting Ratings.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 25, 6 | "metadata": { 7 | "ExecuteTime": { 8 | "end_time": "2017-05-21T05:03:03.866314Z", 9 | "start_time": "2017-05-21T05:03:03.851669Z" 10 | }, 11 | "collapsed": true 12 | }, 13 | "outputs": [], 14 | "source": [ 15 | "import pandas as pd\n", 16 | "import numpy as np\n", 17 | "import seaborn as sns\n", 18 | "import matplotlib.pyplot as plt\n", 19 | "import imp\n", 20 | "import json\n", 21 | "from sklearn.preprocessing import LabelEncoder\n", 22 | "from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, GradientBoostingRegressor\n", 23 | "from sklearn.linear_model import Ridge, LinearRegression, Lasso\n", 24 | "from sklearn.kernel_ridge import KernelRidge\n", 25 | "from sklearn.svm import SVR, LinearSVC\n", 26 | "from sklearn.model_selection import train_test_split\n", 27 | "import whoscored\n", 28 | "from unidecode import unidecode\n", 29 | "import pickle\n", 30 | "import difflib\n", 31 | "import glob\n", 32 | "%matplotlib inline" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "## Data Processing" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 17, 45 | "metadata": { 46 | "ExecuteTime": { 47 | "end_time": "2017-05-21T05:00:15.202603Z", 48 | "start_time": "2017-05-21T05:00:15.081729Z" 49 | }, 50 | "collapsed": true 51 | }, 52 | "outputs": [], 53 | "source": [ 54 | "teams = {}\n", 55 | "for file in glob.glob('./games/*.json'):\n", 56 | " with open(file,'r') as f:\n", 57 | " teams[file.split('/')[2].split('_')[0]] = json.load(f)" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 21, 63 | "metadata": { 64 | "ExecuteTime": { 65 | "end_time": "2017-05-21T05:01:41.986757Z", 66 | "start_time": "2017-05-21T05:01:41.952494Z" 67 | }, 68 | "collapsed": true 69 | }, 70 | "outputs": [], 71 | "source": [ 72 | "def get_position_sub(row):\n", 73 | " if row['Position'] == 'Sub':\n", 74 | " name = row['Player']\n", 75 | " try:\n", 76 | " row['Position'] = [k for k in df[df.Player == name].Position.value_counts().index if k !='Sub'][0]\n", 77 | " except:\n", 78 | " row['Position'] = 'substitute'\n", 79 | " return row\n", 80 | "\n", 81 | "stop_words = ['<','Bl.', 'Exc.', '>']\n", 82 | "def mean_rating(row):\n", 83 | " if pd.isnull(row['Rating_LEquipe']) or row['Rating_LEquipe'] in stop_words:\n", 84 | " name = row['Player']\n", 85 | " temp = df[(df.Player == name) & (~df.Rating_LEquipe.isin(stop_words)) & (pd.notnull(df.Rating_LEquipe))]\n", 86 | " if len(temp) > 4:\n", 87 | " row['Rating_LEquipe'] = np.mean(temp.Rating_LEquipe.apply(int))\n", 88 | " return row\n", 89 | "\n", 90 | "position_mapping = {'attackingmidfieldcenter' : 'attackingmidfield',\n", 91 | " 'attackingmidfieldleft' : 'attackingmidfield',\n", 92 | " 'attackingmidfieldright' : 'attackingmidfield',\n", 93 | " 'defenderleft' : 'defenderlateral',\n", 94 | " 'defendermidfieldcenter' : \"defendermidfield\",\n", 95 | " 'defendermidfieldleft' : 'defendermidfield',\n", 96 | " 'defendermidfieldright': \"defendermidfield\",\n", 97 | " 'defenderright': 'defenderlateral',\n", 98 | " 'forwardleft' : 'forwardlateral',\n", 99 | " 'forwardright' : 'forwardlateral',\n", 100 | " 'midfieldcenter' :'midfield' ,\n", 101 | " 'midfieldleft' :'midfield' ,\n", 102 | " 'midfieldright' :'midfield' }\n", 103 | "\n", 104 | "mapping_team_name = {'ASM': 'Monaco',\n", 105 | " 'ASNL': 'Nancy',\n", 106 | " 'ASSE': 'Saint-Etienne',\n", 107 | " 'DFCO': 'Dijon',\n", 108 | " 'EAG': 'Guingamp',\n", 109 | " 'FCGB': 'Bordeaux',\n", 110 | " 'FCL': 'Lorient',\n", 111 | " 'FCM': 'Metz',\n", 112 | " 'FCN': 'Nantes',\n", 113 | " 'Losc': 'Lille',\n", 114 | " 'MHSC': 'Montpellier',\n", 115 | " 'Man. City': 'Manchester City',\n", 116 | " 'Man. United': 'Manchester United',\n", 117 | " 'OGCN': 'Nice',\n", 118 | " 'OL': 'Lyon',\n", 119 | " 'OM': 'Marseille',\n", 120 | " 'PSG': 'Paris Saint Germain',\n", 121 | " 'Palace': 'Crystal Palace',\n", 122 | " 'SCB': 'SC Bastia',\n", 123 | " 'SCO': 'Angers',\n", 124 | " 'SMC': 'Caen',\n", 125 | " 'SRFC': 'Rennes',\n", 126 | " 'Stoke City': 'Stoke',\n", 127 | " 'TFC': 'Toulouse',\n", 128 | " 'WBA': 'West Bromwich Albion'}" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 138, 134 | "metadata": { 135 | "ExecuteTime": { 136 | "end_time": "2017-05-21T05:41:57.976844Z", 137 | "start_time": "2017-05-21T05:41:31.980530Z" 138 | }, 139 | "collapsed": true 140 | }, 141 | "outputs": [], 142 | "source": [ 143 | "df = pd.DataFrame()\n", 144 | "for team, file in teams.items():\n", 145 | " for i, game in enumerate(file):\n", 146 | " temp = pd.DataFrame(game['stats'])\n", 147 | " temp['Opponent'] = game['opponent']\n", 148 | " temp['Place'] = game['place'].title()\n", 149 | " result = game['result'].split(' : ')\n", 150 | " temp['Goal Team'] = int(result[int(game['place'] == 'away')])\n", 151 | " temp['Goal Opponent'] = int(result[int(game['place'] == 'home')])\n", 152 | " temp['Team'] = {'psg':'Paris Saint Germain'}.get(team, team).title()\n", 153 | " temp['Team'] = {'Bastia': 'SC Bastia'}.get(temp['Team'][0], temp['Team'][0])\n", 154 | " temp['Day'] = i+1\n", 155 | " df = pd.concat([df, temp])\n", 156 | "df = df.apply(whoscored.get_name, axis=1).reset_index(drop=True)\n", 157 | "df['LineUp'] = 1\n", 158 | "df.loc[df.Position == 'Sub', 'LineUp'] = 0\n", 159 | "df = df.apply(get_position_sub, axis=1)\n", 160 | "df.Goal.fillna(0, inplace=True)\n", 161 | "df.Assist.fillna(0, inplace=True)\n", 162 | "df.Yellowcard.fillna(0, inplace=True)\n", 163 | "df.Redcard.fillna(0, inplace=True)\n", 164 | "df.Penaltymissed.fillna(0, inplace=True)\n", 165 | "df.Shotonpost.fillna(0, inplace=True)\n", 166 | "# df.Position = df.Position.apply(lambda x:position_mapping.get(x,x))" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 139, 172 | "metadata": { 173 | "ExecuteTime": { 174 | "end_time": "2017-05-21T05:41:57.990536Z", 175 | "start_time": "2017-05-21T05:41:57.978189Z" 176 | } 177 | }, 178 | "outputs": [], 179 | "source": [ 180 | "with open('./ratings/notes_ligue1_lequipe.json','r') as f:\n", 181 | " rating_lequipe = json.load(f)\n", 182 | "rating_lequipe = {mapping_team_name.get(k,k):[p for p in v if list(p.keys())[0] != 'Nom' and len(list(p.values())[0]) > 0] \n", 183 | " for k,v in rating_lequipe.items()}" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 140, 189 | "metadata": { 190 | "ExecuteTime": { 191 | "end_time": "2017-05-21T05:42:40.226407Z", 192 | "start_time": "2017-05-21T05:41:57.992420Z" 193 | }, 194 | "scrolled": true 195 | }, 196 | "outputs": [], 197 | "source": [ 198 | "for team_name, rating_team in rating_lequipe.items():\n", 199 | " if team_name in set(df.Team):\n", 200 | " players_lequipe = [list(k.keys())[0] for k in rating_team]\n", 201 | " players_df = list(set(df[df.Team == team_name].Player))\n", 202 | " for player in rating_team:\n", 203 | " [(player_name, player_ratings)] = player.items()\n", 204 | " try:\n", 205 | " player_name_df = difflib.get_close_matches(player_name, players_df)[0]\n", 206 | " except:\n", 207 | " if len(unidecode(player_name).split('-')) > 1 :\n", 208 | " player_name_df = [k for k in players_df if unidecode(player_name).split('-')[0].replace(\"'\",\"\").lower() in unidecode(k).replace(\"'\",\"\").lower()\n", 209 | " or unidecode(player_name).split('-')[1].replace(\"'\",\"\").lower() in unidecode(k).replace(\"'\",\"\").lower()][0]\n", 210 | " else:\n", 211 | " player_name_df = [k for k in players_df if unidecode(player_name).replace(\"'\",\"\").lower() in unidecode(k).replace(\"'\",\"\").lower()][0]\n", 212 | " for day, rating in player_ratings.items():\n", 213 | " df.loc[(df.Player == player_name_df) & (df.Team == team_name) & (df.Day == int(day.split('Day ')[1])), 'Rating_LEquipe'] = rating\n", 214 | "df = df.apply(mean_rating, axis=1)\n", 215 | "df.drop('null', axis=1, inplace=True)" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": 141, 221 | "metadata": { 222 | "ExecuteTime": { 223 | "end_time": "2017-05-21T05:42:40.279122Z", 224 | "start_time": "2017-05-21T05:42:40.227785Z" 225 | }, 226 | "scrolled": true 227 | }, 228 | "outputs": [ 229 | { 230 | "name": "stdout", 231 | "output_type": "stream", 232 | "text": [ 233 | "9552/10187 données avec une note l'Équipe\n" 234 | ] 235 | }, 236 | { 237 | "data": { 238 | "text/html": [ 239 | "
\n", 240 | "\n", 253 | "\n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | "
TeamGoal TeamGoal OpponentPlayerGoalOpponentDayPositionAgeLineUpRating_LEquipe
415Paris Saint Germain21Thiago Motta0.0Lyon30midfieldcenter3405
9580Toulouse00Mauro Goicoechea0.0Rennes30goalkeeper2917
5736Metz01Thomas Didillon0.0Rennes11goalkeeper2115
6440Nantes11Rémy Riou0.0Metz25goalkeeper2915
3446Lille10Eder0.0SC Bastia31forward2914
9100Saint-Etienne11Kevin Malcuit0.0Rennes33defenderright2515.42105
5640Metz30Yann Jouffre0.0Nantes4attackingmidfieldright3204.88889
2401Angers30Abdoulaye Bamba0.0SC Bastia27defenderright2716
\n", 385 | "
" 386 | ], 387 | "text/plain": [ 388 | " Team Goal Team Goal Opponent Player Goal \\\n", 389 | "415 Paris Saint Germain 2 1 Thiago Motta 0.0 \n", 390 | "9580 Toulouse 0 0 Mauro Goicoechea 0.0 \n", 391 | "5736 Metz 0 1 Thomas Didillon 0.0 \n", 392 | "6440 Nantes 1 1 Rémy Riou 0.0 \n", 393 | "3446 Lille 1 0 Eder 0.0 \n", 394 | "9100 Saint-Etienne 1 1 Kevin Malcuit 0.0 \n", 395 | "5640 Metz 3 0 Yann Jouffre 0.0 \n", 396 | "2401 Angers 3 0 Abdoulaye Bamba 0.0 \n", 397 | "\n", 398 | " Opponent Day Position Age LineUp Rating_LEquipe \n", 399 | "415 Lyon 30 midfieldcenter 34 0 5 \n", 400 | "9580 Rennes 30 goalkeeper 29 1 7 \n", 401 | "5736 Rennes 11 goalkeeper 21 1 5 \n", 402 | "6440 Metz 25 goalkeeper 29 1 5 \n", 403 | "3446 SC Bastia 31 forward 29 1 4 \n", 404 | "9100 Rennes 33 defenderright 25 1 5.42105 \n", 405 | "5640 Nantes 4 attackingmidfieldright 32 0 4.88889 \n", 406 | "2401 SC Bastia 27 defenderright 27 1 6 " 407 | ] 408 | }, 409 | "execution_count": 141, 410 | "metadata": {}, 411 | "output_type": "execute_result" 412 | } 413 | ], 414 | "source": [ 415 | "print('%d/%d données avec une note l\\'Équipe' % (len(df[(~df.Rating_LEquipe.isin(stop_words)) & (pd.notnull(df.Rating_LEquipe))]),\n", 416 | " len(df)))\n", 417 | "df[['Team','Goal Team', 'Goal Opponent', 'Player', 'Goal',\n", 418 | " 'Opponent', 'Day', 'Position', 'Age', 'LineUp', 'Rating_LEquipe']].sort_values('Player').sample(8)" 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": 143, 424 | "metadata": { 425 | "ExecuteTime": { 426 | "end_time": "2017-05-21T05:43:00.252474Z", 427 | "start_time": "2017-05-21T05:43:00.221024Z" 428 | }, 429 | "collapsed": true 430 | }, 431 | "outputs": [], 432 | "source": [ 433 | "with open('./df.pkl', 'rb') as f:\n", 434 | " df = pickle.load(f)" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": 176, 440 | "metadata": { 441 | "ExecuteTime": { 442 | "end_time": "2017-05-21T07:45:29.115217Z", 443 | "start_time": "2017-05-21T07:45:28.990990Z" 444 | }, 445 | "collapsed": true 446 | }, 447 | "outputs": [], 448 | "source": [ 449 | "other_cols = ['Day', \"Opponent\", \"Place\", \"Player\", \"Position\", \n", 450 | " \"Team\", 'Key Events', \"Rating_LEquipe\"]\n", 451 | "col_delete = ['Key Events', \"Rating_LEquipe\", 'Rating', 'Day']\n", 452 | "cols_to_transf = [col for col in df.columns if col not in other_cols]\n", 453 | "p = pd.concat([df[cols_to_transf].applymap(float), df[other_cols]], axis=1)\n", 454 | "new_col_to_remov = [e for e in p.columns if e.startswith('Acc')] + ['Touches']" 455 | ] 456 | }, 457 | { 458 | "cell_type": "markdown", 459 | "metadata": { 460 | "ExecuteTime": { 461 | "end_time": "2017-05-20T18:57:44.123776Z", 462 | "start_time": "2017-05-20T18:57:44.118723Z" 463 | } 464 | }, 465 | "source": [ 466 | "## Learning" 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": 177, 472 | "metadata": { 473 | "ExecuteTime": { 474 | "end_time": "2017-05-21T07:45:30.269693Z", 475 | "start_time": "2017-05-21T07:45:30.254692Z" 476 | } 477 | }, 478 | "outputs": [], 479 | "source": [ 480 | "p = p[(~p.Rating_LEquipe.isin(stop_words)) & (pd.notnull(p.Rating_LEquipe))].set_index('Player')\n", 481 | "X = p[[col for col in p.columns if col not in col_delete + new_col_to_remov]]\n", 482 | "y = p['Rating_LEquipe'].apply(float)" 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": 178, 488 | "metadata": { 489 | "ExecuteTime": { 490 | "end_time": "2017-05-21T07:45:31.239689Z", 491 | "start_time": "2017-05-21T07:45:30.949497Z" 492 | } 493 | }, 494 | "outputs": [ 495 | { 496 | "name": "stderr", 497 | "output_type": "stream", 498 | "text": [ 499 | "/Users/alexandreattia/Desktop/Work/workenv/lib/python3.5/site-packages/ipykernel/__main__.py:5: SettingWithCopyWarning: \n", 500 | "A value is trying to be set on a copy of a slice from a DataFrame.\n", 501 | "Try using .loc[row_indexer,col_indexer] = value instead\n", 502 | "\n", 503 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" 504 | ] 505 | } 506 | ], 507 | "source": [ 508 | "label_encoders = {}\n", 509 | "col_encode = [\"Opponent\", \"Place\", \"Position\", \"Team\"]\n", 510 | "for col in col_encode:\n", 511 | " label_encoders[col.lower()] = LabelEncoder()\n", 512 | " X[col] = label_encoders[col.lower()].fit_transform(X[col])" 513 | ] 514 | }, 515 | { 516 | "cell_type": "code", 517 | "execution_count": 123, 518 | "metadata": { 519 | "ExecuteTime": { 520 | "end_time": "2017-05-21T05:28:51.343565Z", 521 | "start_time": "2017-05-21T05:28:51.299523Z" 522 | } 523 | }, 524 | "outputs": [ 525 | { 526 | "name": "stdout", 527 | "output_type": "stream", 528 | "text": [ 529 | "5-fold cross validation mean square error : 0.86\n" 530 | ] 531 | } 532 | ], 533 | "source": [ 534 | "d = []\n", 535 | "k_fold = 5\n", 536 | "for k in range(k_fold): \n", 537 | " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/k_fold)\n", 538 | " ridge = Ridge(alpha=0.001, max_iter=10000, tol=0.0001)\n", 539 | " ridge.fit(X_train, y_train)\n", 540 | " d.append(np.mean((ridge.predict(X_test) - y_test) ** 2))\n", 541 | "print('%d-fold cross validation mean square error : %.2f' % (k_fold, np.mean(d)))\n", 542 | "ridge = Ridge(alpha=0.001, max_iter=10000, tol=0.0001)\n", 543 | "_ = ridge.fit(X, y)" 544 | ] 545 | }, 546 | { 547 | "cell_type": "code", 548 | "execution_count": 137, 549 | "metadata": { 550 | "ExecuteTime": { 551 | "end_time": "2017-05-21T05:37:59.969171Z", 552 | "start_time": "2017-05-21T05:37:59.963504Z" 553 | } 554 | }, 555 | "outputs": [ 556 | { 557 | "name": "stdout", 558 | "output_type": "stream", 559 | "text": [ 560 | "The 10 most important parameters are:\n", 561 | "Penaltymissed : -0.59\n", 562 | "Goal : 0.43\n", 563 | "Assist : 0.36\n", 564 | "Redcard : -0.31\n", 565 | "Goal Opponent : -0.22\n", 566 | "Shotonpost : 0.19\n", 567 | "ShotsOT : 0.16\n", 568 | "Goal Team : 0.15\n", 569 | "Yellowcard : -0.09\n", 570 | "Place : -0.09\n" 571 | ] 572 | } 573 | ], 574 | "source": [ 575 | "top_k = 10\n", 576 | "l = np.argsort(list(map(np.abs, ridge.coef_)))[::-1][:top_k]\n", 577 | "print(\"The %d most important parameters are:\\n%s\" % (top_k,'\\n'.join(['%s : %.2f' % (a,b) \n", 578 | " for a,b in zip(X_train.columns[l], ridge.coef_[l])])))" 579 | ] 580 | }, 581 | { 582 | "cell_type": "code", 583 | "execution_count": 180, 584 | "metadata": { 585 | "ExecuteTime": { 586 | "end_time": "2017-05-21T07:46:19.524179Z", 587 | "start_time": "2017-05-21T07:45:50.166761Z" 588 | } 589 | }, 590 | "outputs": [ 591 | { 592 | "name": "stdout", 593 | "output_type": "stream", 594 | "text": [ 595 | "RF : oob score : 1.00\n", 596 | "RF : 5-fold cross validation mean square error : 0.74\n" 597 | ] 598 | } 599 | ], 600 | "source": [ 601 | "rf = RandomForestRegressor(oob_score=True, n_estimators=100)\n", 602 | "rf.fit(X, y)\n", 603 | "print('RF : oob score : %.2f' % rf.oob_score)\n", 604 | "d = []\n", 605 | "k_fold = 5\n", 606 | "for k in range(k_fold): \n", 607 | " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/k_fold)\n", 608 | " rf = RandomForestRegressor(oob_score=True, n_estimators=100)\n", 609 | " rf.fit(X_train, y_train)\n", 610 | " d.append(np.mean((rf.predict(X_test) - y_test) ** 2))\n", 611 | "print('RF : %d-fold cross validation mean square error : %.2f' % (k_fold, np.mean(d)))" 612 | ] 613 | }, 614 | { 615 | "cell_type": "code", 616 | "execution_count": null, 617 | "metadata": { 618 | "collapsed": true 619 | }, 620 | "outputs": [], 621 | "source": [] 622 | } 623 | ], 624 | "metadata": { 625 | "kernelspec": { 626 | "display_name": "workenv", 627 | "language": "python", 628 | "name": "workenv" 629 | }, 630 | "language_info": { 631 | "codemirror_mode": { 632 | "name": "ipython", 633 | "version": 3 634 | }, 635 | "file_extension": ".py", 636 | "mimetype": "text/x-python", 637 | "name": "python", 638 | "nbconvert_exporter": "python", 639 | "pygments_lexer": "ipython3", 640 | "version": "3.5.2" 641 | }, 642 | "nav_menu": {}, 643 | "toc": { 644 | "navigate_menu": true, 645 | "number_sections": true, 646 | "sideBar": true, 647 | "threshold": 6, 648 | "toc_cell": false, 649 | "toc_section_display": "block", 650 | "toc_window_display": false 651 | } 652 | }, 653 | "nbformat": 4, 654 | "nbformat_minor": 2 655 | } 656 | -------------------------------------------------------------------------------- /KaggleSoccer/README.md: -------------------------------------------------------------------------------- 1 | # Parsing and using Soccer data 2 | 3 | ## First Part - Parsing data 4 | 5 | The first part aims to parse data from multiple websites : games, teams, players, etc. 6 | To collect data about a team and dump a json file (must have created a ``./teams`` folder) : 7 | ``python3 dumper.py team_name`` 8 | 9 | ## Second Part - Data Analysis 10 | 11 | The second part is to analyze the dataset to understand what I can do with it. 12 | 13 | ![Correlation Matrix](https://github.com/alexattia/Data-Science-Projects/blob/master/pics/psg_stats.png) 14 | 15 | ![PSG vs Saint-Etienne](https://github.com/alexattia/Data-Science-Projects/blob/master/pics/psg_ste.png) -------------------------------------------------------------------------------- /KaggleSoccer/df.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleSoccer/df.pkl -------------------------------------------------------------------------------- /KaggleSoccer/dumper.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import re 3 | import time 4 | from datetime import datetime 5 | import json 6 | from tqdm import tqdm 7 | import numpy as np 8 | import seaborn as sns 9 | from matplotlib import patches 10 | import pandas as pd 11 | import parser 12 | import sys 13 | import random 14 | from collections import Counter 15 | from bs4 import BeautifulSoup 16 | from sklearn import preprocessing 17 | import urllib 18 | 19 | def convert_int(string): 20 | try: 21 | string = int(string) 22 | return string 23 | except: 24 | return string 25 | 26 | def get_team(request, link=None): 27 | team = {} 28 | if not link: 29 | team_searched = request 30 | team_searched = urllib.parse.quote(team_searched.encode('utf-8')) 31 | search_link = "http://us.soccerway.com/search/teams/?q={}".format(team_searched) 32 | response = requests.get(search_link) 33 | bs = BeautifulSoup(response.text, 'lxml') 34 | results = bs.find("ul", class_='search-results') 35 | # Take the first results 36 | try: 37 | link = "http://us.soccerway.com" + results.find_all('a')[0]['href'] 38 | print('Please check team link:', link) 39 | team['id_'] = results.find_all('a')[0]["href"].split('/')[4] 40 | team['name'] = results.find_all('a')[0].text 41 | team['country'] = results.find_all('a')[0]["href"].split('/')[2] 42 | except: 43 | print('No team found !') 44 | else: 45 | team['id_'] = link.split('/')[6] 46 | team['name'] = link.split('/')[5] 47 | team['country'] = link.split('/')[4] 48 | return team 49 | 50 | def get_games(team, nb_pages=12): 51 | games = [] 52 | for page_number in range(nb_pages): 53 | link_base = 'http://us.soccerway.com/a/block_team_matches?block_id=page_team_1_block_team_matches_3&callback_params=' 54 | link_ = urllib.parse.quote('{"page":0,"bookmaker_urls":[],"block_service_id":"team_matches_block_teammatches","team_id":%s,\ 55 | "competition_id":0,"filter":"all","new_design":false}' % team['id_']) + '&action=changePage¶ms=' + urllib.parse.quote('{"page":-%s}' % (page_number)) 56 | link = link_base + link_ 57 | response = requests.get(link) 58 | 59 | test = json.loads(response.text)['commands'][0]['parameters']['content'] 60 | bs = BeautifulSoup(test, 'lxml') 61 | 62 | for kind in ['even', 'odd']: 63 | for elem in bs.find_all('tr', class_ = kind): 64 | game = {} 65 | game["date"] = elem.find('td', {'class': ["full-date"]}).text 66 | game["competition"] = elem.find('td', {'class': ["competition"]}).text 67 | game["team_a"] = elem.find('td', class_='team-a').text 68 | game["team_b"] = elem.find('td', class_='team-b').text 69 | game['link'] = "http://us.soccerway.com" + elem.find('td', class_='score-time').find('a')['href'] 70 | game["score"] = elem.find('td', class_='score-time').text.replace(' ','') 71 | if 'E' in game["score"]: 72 | game["score"] = game['score'].replace('E','') 73 | game['extra_time'] = True 74 | if 'P' in game["score"]: 75 | game["score"] = game['score'].replace('P','') 76 | game['penalties'] = True 77 | if datetime.strptime(game["date"], '%d/%m/%y') < datetime.now(): 78 | game = parser.get_score_details(game, team) 79 | time.sleep(random.uniform(0, 0.25)) 80 | game.update(parser.get_goals(game['link'])) 81 | else: 82 | del game['score'] 83 | games.append(game) 84 | games = sorted(games, key=lambda x:datetime.strptime(x['date'], '%d/%m/%y')) 85 | team['games'] = games 86 | return team 87 | 88 | def get_squad(team, season_path='./seasons_codes.json'): 89 | with open(season_path, 'r') as f: 90 | seasons = json.load(f)[team["country"]] 91 | 92 | team['squad'] = {} 93 | for k,v in seasons.items(): 94 | link_base = 'http://us.soccerway.com/a/block_team_squad?block_id=page_team_1_block_team_squad_3&callback_params=' 95 | link_ = urllib.parse.quote('{"team_id":%s}' % team['id_']) + '&action=changeSquadSeason¶ms=' + urllib.parse.quote('{"season_id":%s}' % v) 96 | link = link_base + link_ 97 | response = requests.get(link) 98 | test = json.loads(response.text)['commands'][0]['parameters']['content'] 99 | bs = BeautifulSoup(test, 'lxml') 100 | 101 | players = bs.find('tbody').find_all('tr') 102 | squad = [{ 103 | k: convert_int(player.find('td', class_=k).text) 104 | for k in [k for k,v in Counter(np.concatenate([elem.attrs['class'] for elem in player.find_all('td')])).items() 105 | if v < 2 and k not in ['photo', '', 'flag']] 106 | } for player in players] 107 | team['squad'][k] = squad 108 | try: 109 | coach = {'position': 'Coach', 'name':bs.find_all('tbody')[1].text} 110 | team['coach'][k] = coach 111 | except: pass 112 | return team 113 | 114 | if __name__ == '__main__': 115 | if len(sys.argv) > 1: 116 | request = ' '.join(sys.argv[1:]) 117 | else: 118 | raise ValueError('You must enter a requested team!') 119 | team = get_team(request) 120 | f = input('Satisfied ? (Y/n) ') 121 | count = 0 122 | while f not in ['Y', 'y','yes','']: 123 | if count < 3: 124 | request = input('Enter a new request : ') 125 | link = None 126 | else: 127 | link = input('Paste team link : ') 128 | team = get_team(request, link) 129 | f = input('Satisfied ? (Y/n)') 130 | count += 1 131 | team = get_games(team) 132 | team = get_squad(team) 133 | with open('./teams/%s.json' % team["name"].lower(), 'w') as f: 134 | json.dump(team, f) 135 | -------------------------------------------------------------------------------- /KaggleSoccer/parser.py: -------------------------------------------------------------------------------- 1 | import urllib 2 | import requests 3 | import time 4 | import pandas as pd 5 | import numpy as np 6 | import datetime 7 | import json 8 | from tqdm import tqdm 9 | import random 10 | from bs4 import BeautifulSoup 11 | from collections import Counter 12 | 13 | def get_score_details(game, team): 14 | if game["team_a"] == team['name']: 15 | game['place'] = 'home' 16 | try: 17 | game['nb_goals_{}'.format(team['name'])] = int(game['score'].split('-')[0]) 18 | game['nb_goals_adv'] = int(game['score'].split('-')[1]) 19 | except:pass 20 | else: 21 | game['place'] = 'away' 22 | try: 23 | game['nb_goals_{}'.format(team['name'])] = int(game['score'].split('-')[1]) 24 | game['nb_goals_adv'] = int(game['score'].split('-')[0]) 25 | except:pass 26 | try: 27 | if game['nb_goals_{}'.format(team['name'])] > game['nb_goals_adv']: 28 | game['result'] = 'WIN' 29 | elif game['nb_goals_{}'.format(team['name'])] == game['nb_goals_adv']: 30 | game['result'] = 'TIE' 31 | else: 32 | game['result'] = 'LOST' 33 | except:pass 34 | return game 35 | 36 | def shot_team(row, columns, team_name): 37 | try: 38 | for col in columns: 39 | if row['team_a'] == team_name: 40 | row['{}_{}'.format(col.lower().replace(' ','_'), team_name.lower())] = row[col]['team_a'] 41 | row['{}_adv'.format(col.lower().replace(' ','_'))] = row[col]['team_b'] 42 | else: 43 | row['{}_{}'.format(col.lower().replace(' ','_'), team_name.lower())] = row[col]['team_b'] 44 | row['{}_adv'.format(col.lower().replace(' ','_'))] = row[col]['team_a'] 45 | except: pass 46 | return row 47 | 48 | def convert_team_name(row, team_name): 49 | for col in ['goals', 'players_team', "subs_in"]: 50 | try: 51 | if row['team_a'] == team_name: 52 | row['{}_{}'.format(col, team_name.lower())] = row['{}_a'.format(col)] 53 | row['{}_adv'.format(col)] = row['{}_b'.format(col)] 54 | else: 55 | row['{}_{}'.format(col, team_name.lower())] = row['{}_b'.format(col)] 56 | row['{}_adv'.format(col)] = row['{}_a'.format(col)] 57 | del row['{}_a'.format(col)], row['{}_b'.format(col)] 58 | except: pass 59 | return row 60 | 61 | def player_per_opponent(df, player, team_name): 62 | stats_player = {} 63 | for opponent in set(df.opponent): 64 | try: 65 | game_played = len([e for e in list(df[df.opponent == opponent]["players_team_%s" % team_name.lower()]) + 66 | list(df[df.opponent == opponent]["subs_in_%s" % team_name.lower()]) 67 | if player in e]) 68 | if game_played > 0: 69 | goals = len([e for e in np.concatenate([elem for elem in df[df.opponent == opponent]["goals_%s" % team_name.lower()] if type(elem) == list]) 70 | if e['player'] == player]) 71 | assist = len([e for e in np.concatenate([elem for elem in df[df.opponent == opponent]["goals_%s" % team_name.lower()] if type(elem) == list]) 72 | if 'assist' in e and e['assist'] == player]) 73 | de = goals + assist 74 | 75 | stats_player[opponent] = {'goals':goals, 76 | 'goals_game':goals/game_played, 77 | 'decisive_game':de/game_played, 78 | 'assists':assist, 79 | 'decisive':de, 80 | 'games':game_played} 81 | except: pass 82 | return stats_player 83 | 84 | def ratio_one_opponent(df, opponent, team_name, top_k=10): 85 | df_oppo = df[df.opponent == opponent] 86 | game_player = dict(Counter(np.concatenate(list(df_oppo["players_team_%s" % team_name.lower()]) + 87 | list(df_oppo["subs_in_%s" % team_name.lower()]))).items()) 88 | goal_counter = Counter([elem['player'] for elem in np.concatenate(list(df_oppo["goals_%s" % team_name.lower()]))]) 89 | assist_counter = Counter([elem['assist'] for elem in np.concatenate(list(df_oppo["goals_%s" % team_name.lower()])) if "assist" in elem]) 90 | goal_ratio = sorted([(k, v/game_player[k]) for k,v in dict(goal_counter).items() if k in game_player and game_player[k] >= 3 ], 91 | key=lambda x:x[1], reverse=True)[:top_k] 92 | assist_ratio = sorted([(k, v/game_player[k]) for k,v in dict(assist_counter).items() if k in game_player and game_player[k] >= 3 ], 93 | key=lambda x:x[1], reverse=True)[:top_k] 94 | decisive_ratio = sorted([(k, v/game_player[k]) for k,v in dict(goal_counter+assist_counter).items() if k in game_player and game_player[k] >= 3 ], 95 | key=lambda x:x[1], reverse=True)[:top_k] 96 | return goal_ratio, assist_ratio, decisive_ratio 97 | 98 | def get_goals_team(bs, team): 99 | if bs.find('span', class_='bidi').text != ' 0 - 0': 100 | goals = [{'player':elem.find('a').text, 101 | 'time':elem.find('span', class_='minute').text.replace("'",''), 102 | 'assist': elem.find('span', class_='assist')} for elem in bs.find('div', class_='block_match_goals').find_all('td', class_='player-{}'.format(team)) if elem.text !='\n\n'] 103 | for goal in goals: 104 | if goal['assist'] != None: 105 | goal['assist'] = goal['assist'].text.replace('(assist by ','').replace(')','') 106 | else: 107 | del goal['assist'] 108 | else: 109 | goals=[] 110 | return goals 111 | 112 | def get_lineups(bs): 113 | lineups = {} 114 | players = [elem.text.replace('\n','') for elem in bs.find_all('td', class_='large-link')] 115 | lineups["players_team_a"] = players[:11] 116 | lineups["players_team_b"] = players[11:22] 117 | lineups['subs_in_a'] = [elem[:elem.index('for')] for elem in players[22:29] if 'for' in elem] 118 | lineups['subs_in_b'] = [elem[:elem.index('for')] for elem in players[29:36] if 'for' in elem] 119 | return lineups 120 | 121 | def get_stats(bs): 122 | l = bs.find('div', class_='block_match_stats_plus_chart').find('iframe')['src'] 123 | response = requests.get('http://us.soccerway.com/' + l) 124 | bs2 = BeautifulSoup(response.text,'lxml') 125 | stats_list = [elem.text[1:-1].split('\n') for elem in bs2.find('table').find('tr').find_all('tr')[1:]][0::2] 126 | stats = {elem[1]:{'team_a': int(elem[0]), 'team_b':int(elem[2])} for elem in stats_list} 127 | return stats 128 | 129 | def get_goals(link): 130 | game = {} 131 | response = requests.get(link) 132 | bs = BeautifulSoup(response.text, 'lxml') 133 | try: 134 | game['goals_a'], game['goals_b'] = [get_goals_team(bs, 'a'), get_goals_team(bs, 'b')] 135 | except: pass 136 | try: 137 | game.update(get_lineups(bs)) 138 | except: pass 139 | try: 140 | game.update(get_stats(bs)) 141 | except: pass 142 | return game 143 | 144 | def convert_df_games(dict_file): 145 | df = pd.DataFrame(dict_file["games"]) 146 | df['date'] = pd.to_datetime(df.date.apply(lambda x:'/'.join([x.split('/')[1],x.split('/')[0], x.split('/')[2]]))) 147 | df['month'] = df.date.apply(lambda x:x.month) 148 | df['year'] = df.date.apply(lambda x:x.year) 149 | df = df[df.date < datetime.datetime.now()] 150 | df = df.sort_values('date', ascending=False) 151 | team_name = df.team_a.value_counts().index[0] 152 | df['opponent'] = df.apply(lambda x:(x['team_a']+x['team_b']).replace(team_name, ''), axis=1) 153 | cols = ['Corners', 'Fouls', 'Offsides', 'Shots on target', 'Shots wide'] 154 | df = df.apply(lambda x:shot_team(x, cols, team_name),axis=1) 155 | df = df.apply(lambda x:convert_team_name(x, team_name),axis=1) 156 | df = df.drop(cols, axis=1) 157 | return df 158 | -------------------------------------------------------------------------------- /KaggleTaxiTrip/README.md: -------------------------------------------------------------------------------- 1 | ## New York City Taxi Trip Duration 2 | 3 | Kaggle playground to predict the total ride duration of taxi trips in New York City. 4 | 5 | ### First part - Data exploration 6 | The first part is to analyze the dataframe and observe correlation between variables. 7 | ![Distributions](https://github.com/alexattia/Data-Science-Projects/blob/master/KaggleTaxiTrip/pic/download.png) 8 | ![Rush Hour](https://github.com/alexattia/Data-Science-Projects/blob/master/KaggleTaxiTrip/pic/rush_hour.png) 9 | 10 | ### Second part - Clustering 11 | The goal of this playground is to predict the trip duration of test set. We know that some neighborhoods are more congested. So, I used K-Means to compute geo-clusters for pickup and drop off. 12 | ![Cluster](https://github.com/alexattia/Data-Science-Projects/blob/master/KaggleTaxiTrip/pic/nyc_clusters.png) 13 | 14 | ### Third part - Cleaning and feature selection 15 | I have found some odd long trips : one day trip with a mean spead < 1km/h. 16 | ![Outliners](https://github.com/alexattia/Data-Science-Projects/blob/master/KaggleTaxiTrip/pic/outliners.png) 17 | I have removed these outliners. 18 | 19 | I also added features from the data available : Haversine distance, Manhattan distance, means for clusters, PCA for rotation. 20 | 21 | ### Forth part - Prediction 22 | I compared Random Forest and XGBoost. 23 | Current Root Mean Squared Logarithmic error : 0.391 24 | 25 | Feature importance for RF & XGBoost 26 | ![Feature importance](https://github.com/alexattia/Data-Science-Projects/blob/master/KaggleTaxiTrip/pic/feat_importance.png) -------------------------------------------------------------------------------- /KaggleTaxiTrip/pic/download.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleTaxiTrip/pic/download.png -------------------------------------------------------------------------------- /KaggleTaxiTrip/pic/feat_importance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleTaxiTrip/pic/feat_importance.png -------------------------------------------------------------------------------- /KaggleTaxiTrip/pic/nyc_clusters.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleTaxiTrip/pic/nyc_clusters.png -------------------------------------------------------------------------------- /KaggleTaxiTrip/pic/outliners.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleTaxiTrip/pic/outliners.png -------------------------------------------------------------------------------- /KaggleTaxiTrip/pic/rush_hour.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleTaxiTrip/pic/rush_hour.png -------------------------------------------------------------------------------- /KaggleTitanic/README.md: -------------------------------------------------------------------------------- 1 | ## Titanic: Machine Learning from Disaster 2 | 3 | The Titanic challenge on Kaggle is a competition in which the goal is to predict the survival or the death of a given passenger based on a set of variables describing him such as his age, his sex, or his passenger class on the boat. -------------------------------------------------------------------------------- /KaggleWiki/README.md: -------------------------------------------------------------------------------- 1 | # Parsing and using Wikipedia Data 2 | 3 | ## First Part - French governments 4 | 5 | The first part aims to parse data from the French wikipedia website and pages about French governements. I am playing with data to get insights about governments : number of ministers, ministers' studies, etc. 6 | -------------------------------------------------------------------------------- /KaggleWiki/wikipedia_parser.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from bs4 import BeautifulSoup 4 | from selenium import webdriver 5 | import time 6 | 7 | def convert_formation(s): 8 | mapping = {'IEP' : ['Institut d\'études politiques', 'IEP'], 9 | 'ENA' : ['ENA', 'École nationale d\'administration'], 10 | 'ENS': ['École normale supérieure', 'ENS'], 11 | 'ENPC':['École nationale des ponts et chaussées', 'ENPC'], 12 | 'X': ['École polytechnique'], 13 | 'HEC': ['HEC', 'École des hautes études commerciales'], 14 | 'ESCP': ['ESCP', 'École supérieure de commerce de Paris'], 15 | 'ESSEC' : ['ESSEC', "École Supérieure des Sciences Economiques et Commerciales "] 16 | } 17 | etudes = [] 18 | for key, value in mapping.items(): 19 | for v in value: 20 | if v in s: 21 | etudes.append(key) 22 | break 23 | return etudes 24 | 25 | def get_ministres(link): 26 | browser = webdriver.Chrome() 27 | try: 28 | browser.get(link) 29 | time.sleep(0.2) 30 | except: 31 | time.sleep(1) 32 | text = browser.find_element_by_xpath('//div[@id="mw-content-text"]').get_attribute('innerHTML').strip() 33 | bs = BeautifulSoup(text, 'lxml') 34 | ministres_l = [[(k.text.replace('\u200d ','') , k.find('a')['href']) for k in e.find_parents()[1].find_all('td') if k.text !='' and k.find('a')] 35 | for e in bs.find_all('a', class_="image")] 36 | ministres = [{'Poste':k[0][0], 'Nom':k[::-1][1][0], 'Lien':'https://fr.wikipedia.org' + k[::-1][1][1]} for k in ministres_l if len(k) > 1][:-1] 37 | for m in ministres: 38 | m['Formation'] = get_formation(m['Lien']) 39 | browser.quit() 40 | return ministres 41 | 42 | def get_formation(link_bio): 43 | browser = webdriver.Chrome() 44 | browser.set_page_load_timeout(15) 45 | try: 46 | browser.get(link_bio) 47 | except: 48 | time.sleep(1.5) 49 | bs = BeautifulSoup(browser.find_element_by_xpath('//div[@id="mw-content-text"]').get_attribute('innerHTML').strip(), 'lxml') 50 | try: 51 | h3_elem = [e.find('span').text for e in bs.find_all('h3')] 52 | h3_elem_formation = [k for k in h3_elem if 'formation' in k.lower() or 'études' in k.lower() or 'etudes' in k.lower() or 'parcours' in k.lower() 53 | or "cursus" in k.lower()][0] 54 | f, s = h3_elem[h3_elem.index(h3_elem_formation):h3_elem.index(h3_elem_formation)+2] 55 | except: 56 | h2_elem = [e.find('span').text for e in bs.find_all('h2')[1:]] 57 | h2_elem_biographie = [k for k in h2_elem if 'biographie' in k.lower() or 'carrière' in k.lower() or 'formation' in k.lower() or 'parcours' in k.lower() 58 | or "cursus" in k.lower()][0] 59 | f, s = h2_elem[h2_elem.index(h2_elem_biographie):h2_elem.index(h2_elem_biographie)+2] 60 | try: 61 | first, second = [m.start() for m in re.finditer(f, bs.text)][1], [m.start() for m in re.finditer(s, bs.text)][1] 62 | browser.quit() 63 | s = bs.text[first:second].replace('\n', '') 64 | return convert_formation(re.sub(r'\[*\d*\]', '', s)) 65 | except: 66 | print('Error for %s' % link_bio) 67 | 68 | def get_previous_government_link(browser, link): 69 | try: 70 | browser.get(link) 71 | time.sleep(0.5) 72 | except: 73 | time.sleep(1.5) 74 | box = browser.find_element_by_xpath('//table[@class="infobox_v2"]') 75 | browser.execute_script("return arguments[0].scrollIntoView();", box) 76 | bs = BeautifulSoup(box.get_attribute('innerHTML').strip(), 'lxml') 77 | d = dict(set([([k.replace('|','').replace('\n','') for k in e.text.replace('\n\n','|').split('||') if k != ''][0], 78 | "https://fr.wikipedia.org" + e.find('a')['href']) 79 | for e in bs.find_all('tr') if e.find('img', {'alt':"Précédent"})])) 80 | duration = [e.find('td').text.encode().decode('ascii', 'ignore') for e in bs.find_all('tr') if e.find('th', text='Durée')][0] 81 | return d, duration 82 | 83 | 84 | def convert_duration(dur): 85 | splits = [''.join([s for s in elem if s.isdigit()]) for elem in dur.split('an')] 86 | if len(splits) == 2: 87 | return 365 * int(splits[0]) + int(splits[1]) 88 | else: 89 | return int(splits[0]) -------------------------------------------------------------------------------- /ObjectDetection/color_extract.py: -------------------------------------------------------------------------------- 1 | import cv2 2 | import numpy as np 3 | import sys 4 | import os 5 | import imp 6 | import glob 7 | from sklearn.cluster import KMeans 8 | import skimage.filters as skf 9 | import skimage.color as skc 10 | import skimage.morphology as skm 11 | from skimage.measure import label 12 | 13 | class Back(): 14 | """ 15 | Two algorithms are used together to separate background and foreground. 16 | One consider as background all pixel whose color is close to the pixels 17 | in the corners. This part is impacted by the `max_distance' and 18 | `use_lab' settings. 19 | The second one computes the edges of the image and uses a flood fill 20 | starting from all corners. 21 | """ 22 | def __init__(self, max_distance=5, use_lab=True): 23 | """ 24 | The possible settings are: 25 | - max_distance: The maximum distance for two colors to be 26 | considered closed. A higher value will yield to a more aggressive 27 | background removal. 28 | (default: 5) 29 | - use_lab: Whether to use the LAB color space to perform 30 | background removal. More expensive but closer to eye perception. 31 | (default: True) 32 | """ 33 | self._settings= { 34 | 'max_distance': max_distance, 35 | 'use_lab': use_lab, 36 | } 37 | 38 | def get(self, img): 39 | f = self._floodfill(img) 40 | g = self._global(img) 41 | m = f | g 42 | 43 | if np.count_nonzero(m) < 0.90 * m.size: 44 | return m 45 | 46 | ng, nf = np.count_nonzero(g), np.count_nonzero(f) 47 | 48 | if ng < 0.90 * g.size and nf < 0.90 * f.size: 49 | return g if ng > nf else f 50 | if ng < 0.90 * g.size: 51 | return g 52 | if nf < 0.90 * f.size: 53 | return f 54 | 55 | return np.zeros_like(m) 56 | 57 | def _global(self, img): 58 | h, w = img.shape[:2] 59 | mask = np.zeros((h, w), dtype=np.bool) 60 | max_distance = self._settings['max_distance'] 61 | 62 | if self._settings['use_lab']: 63 | img = skc.rgb2lab(img) 64 | 65 | # Compute euclidean distance of each corner against all other pixels. 66 | corners = [(0, 0), (-1, 0), (0, -1), (-1, -1)] 67 | for color in (img[i, j] for i, j in corners): 68 | norm = np.sqrt(np.sum(np.square(img - color), 2)) 69 | # Add to the mask pixels close to one of the corners. 70 | mask |= norm < max_distance 71 | 72 | return mask 73 | 74 | def _floodfill(self, img): 75 | back = self._scharr(img) 76 | # Binary thresholding. 77 | back = back > 0.05 78 | 79 | # Thin all edges to be 1-pixel wide. 80 | 81 | back = skm.skeletonize(back) 82 | # Edges are not detected on the borders, make artificial ones. 83 | back[0, :] = back[-1, :] = True 84 | back[:, 0] = back[:, -1] = True 85 | 86 | # Label adjacent pixels of the same color. 87 | labels = label(back, background=-1, connectivity=1) 88 | 89 | # Count as background all pixels labeled like one of the corners. 90 | corners = [(1, 1), (-2, 1), (1, -2), (-2, -2)] 91 | for l in (labels[i, j] for i, j in corners): 92 | back[labels == l] = True 93 | 94 | # Remove remaining inner edges. 95 | return skm.opening(back) 96 | 97 | def _scharr(self, img): 98 | # Invert the image to ease edge detection. 99 | img = 1. - img 100 | grey = skc.rgb2grey(img) 101 | return skf.scharr(grey) 102 | 103 | class ColorExtractor(): 104 | def __init__(self): 105 | self._back = Back(max_distance=10, n_clusters=3) 106 | 107 | def get_bar_hist(self, filepath, full_return=False): 108 | image = cv2.imread(filepath) 109 | image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) 110 | back_mask = self._back.get(image) 111 | image_noback = np.zeros(image.shape) 112 | image_noback[~back_mask] = image[~back_mask] /255 113 | image_array = image[~back_mask] 114 | clt = KMeans(n_clusters = n_clusters) 115 | clt.fit(image_array) 116 | hist = self.centroid_histogram(clt) 117 | if full_return: 118 | bar = self.plot_colors(hist, clt.cluster_centers_) 119 | return image, image_noback, bar, hist 120 | else: 121 | return image, hist 122 | 123 | 124 | def centroid_histogram(self, clt): 125 | # grab the number of different clusters and create a histogram 126 | # based on the number of pixels assigned to each cluster 127 | numLabels = np.arange(0, len(np.unique(clt.labels_)) + 1) 128 | (hist, _) = np.histogram(clt.labels_, bins = numLabels) 129 | 130 | # normalize the histogram, such that it sums to one 131 | hist /= hist.astype("float").sum() 132 | 133 | # return the histogram 134 | return hist 135 | 136 | def plot_colors(self, hist, centroids): 137 | # initialize the bar chart representing the relative frequency 138 | # of each of the colors 139 | bar = np.zeros((50, 300, 3), dtype = "uint8") 140 | startX = 0 141 | 142 | zipped = zip(hist, centroids) 143 | zipped = sorted(zipped, reverse=True, key=lambda x : x[0]) 144 | 145 | # loop over the percentage of each cluster and the color of 146 | # each cluster 147 | for (percent, color) in zipped: 148 | # plot the relative percentage of each cluster 149 | endX = startX + (percent * 300) 150 | cv2.rectangle(bar, (int(startX), 0), (int(endX), 50), 151 | color.astype("uint8").tolist(), -1) 152 | startX = endX 153 | 154 | # return the bar chart 155 | return bar -------------------------------------------------------------------------------- /ObjectDetection/recognition.py: -------------------------------------------------------------------------------- 1 | import cv2 2 | from tensorflow.python.client import device_lib 3 | from darkflow.net.build import TFNet 4 | import numpy as np 5 | import glob 6 | import os 7 | 8 | def get_available_gpus(): 9 | """ 10 | Return the number of GPUs availableipytho 11 | """ 12 | local_device_protos = device_lib.list_local_devices() 13 | return [x.name for x in local_device_protos if x.device_type == 'GPU'] 14 | 15 | def define_options(config_path): 16 | """ 17 | Define the network configuration. Load the model and the weights. 18 | Threshold hard coded. 19 | :return: option for tfnet object 20 | """ 21 | options = {} 22 | if len(get_available_gpus()) > 0 : 23 | options["gpu"] = 1 24 | else: 25 | options["gpu"] = 0 26 | 27 | model = config_path + 'cfg/yolo.cfg' 28 | weights = config_path + 'yolo.weights' 29 | if os.path.exists(model): 30 | options['model'] = model 31 | else: 32 | print('No cfg model') 33 | if os.path.exists(weights): 34 | options['load'] = weights 35 | else: 36 | print('No yolo weights, wget https://pjreddie.com/media/files/yolo.weights') 37 | options['config'] = config_path + 'cfg/' 38 | options["threshold"] = 0.1 39 | return options 40 | 41 | def non_max_suppression_fast(boxes, probs, overlap_thresh=0.1, max_boxes=30): 42 | """ 43 | Eliminating redundant object detection windows with a faster non maximum suppression method 44 | Greedily select high-scoring detections and skip detections that are significantly covered by 45 | a previously selected detection. 46 | :param boxes: list of boxes 47 | :param probs: list of probabilities relatives to the boxes 48 | """ 49 | 50 | # if there are no boxes, return an empty list 51 | if len(boxes) == 0: 52 | return [] 53 | 54 | # grab the coordinates of the bounding boxes 55 | x1 = boxes[:, 0] 56 | y1 = boxes[:, 1] 57 | x2 = boxes[:, 2] 58 | y2 = boxes[:, 3] 59 | 60 | np.testing.assert_array_less(x1, x2) 61 | np.testing.assert_array_less(y1, y2) 62 | 63 | # if the bounding boxes integers, convert them to floats -- 64 | # this is important since we'll be doing a bunch of divisions 65 | if boxes.dtype.kind == "i": 66 | boxes = boxes.astype("float") 67 | 68 | # initialize the list of picked indexes 69 | pick = [] 70 | 71 | # calculate the areas 72 | area = (x2 - x1 + 1) * (y2 - y1 + 1) 73 | 74 | # sort the bounding boxes 75 | idxs = np.argsort(probs) 76 | 77 | # keep looping while some indexes still remain in the indexes 78 | # list 79 | while len(idxs) > 0: 80 | # grab the last index in the indexes list and add the 81 | # index value to the list of picked indexes 82 | last = len(idxs) - 1 83 | i = idxs[last] 84 | pick.append(i) 85 | 86 | # find the intersection 87 | xx1_int = np.maximum(x1[i], x1[idxs[:last]]) 88 | yy1_int = np.maximum(y1[i], y1[idxs[:last]]) 89 | xx2_int = np.minimum(x2[i], x2[idxs[:last]]) 90 | yy2_int = np.minimum(y2[i], y2[idxs[:last]]) 91 | ww_int = np.maximum(0, xx2_int - xx1_int + 0.5) 92 | hh_int = np.maximum(0, yy2_int - yy1_int + 0.5) 93 | area_int = ww_int * hh_int 94 | 95 | # find the union 96 | area_union = area[i] + area[idxs[:last]] - area_int 97 | 98 | # compute the ratio of overlap 99 | overlap = area_int / (area_union + 1e-6) 100 | 101 | # delete all indexes from the index list that have 102 | idxs = np.delete(idxs, np.concatenate(([last], 103 | np.where(overlap > overlap_thresh)[0]))) 104 | 105 | if len(pick) >= max_boxes: 106 | break 107 | 108 | # return only the bounding boxes that were picked using the integer data type 109 | boxes = boxes[pick].astype("int") 110 | probs = probs[pick] 111 | return boxes, probs 112 | 113 | def predict_one(image, tfnet): 114 | """ 115 | Object detection in a picture 116 | :param image: image numpy array 117 | :param tfnet: net object 118 | :return: picture with object bounding boxes, predictions, confidence scores 119 | """ 120 | font = cv2.FONT_HERSHEY_SIMPLEX 121 | # prediction 122 | result = tfnet.return_predict(image) 123 | # Separate each label to do non max suppresion 124 | for label in set([k['label'] for k in result]): 125 | boxes = np.array([[k['topleft']['x'], k['topleft']['y'], k['bottomright']['x'], k['bottomright']['y']] for k in result 126 | if k['label'] == label]) 127 | probs = np.array([k['confidence'] for k in result if k['label'] == label]) 128 | # eliminating redundant object 129 | b, p = non_max_suppression_fast(boxes, probs) 130 | for k in b: 131 | # remove big object 132 | if np.abs(np.mean(np.array((k[0],k[1]))-np.array((k[2], k[3])))) < np.min(image.shape[:2])/5: 133 | cv2.rectangle(image, (k[0],k[1]), (k[2], k[3]), (255,0,0), 2) 134 | cv2.putText(image, label, (k[0],k[1]), font, 0.3, (255, 0, 0)) 135 | return image, b, p 136 | 137 | def predict(folder, nb_items=None, config_path='../darkflow/'): 138 | """ 139 | Multiple predictions 140 | :param folder: folder with pictures 141 | :param nb_items: number of pictures to predict 142 | :config_path: weights and model files path 143 | :return: list of icture with object bounding boxes, predictions, confidence scores 144 | """ 145 | images = [cv2.imread(f) for f in glob.glob(folder+'*')][:nb_items] 146 | results = [] 147 | tfnet = TFNet(define_options(config_path)) 148 | for image in images: 149 | results.append(predict_one(image, tfnet)) 150 | return results 151 | 152 | -------------------------------------------------------------------------------- /ProjectMovieRating/README.md: -------------------------------------------------------------------------------- 1 | # Predict IMDB movie rating 2 | 3 | Project inspired by Chuan Sun [work](https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset) without Scrappy 4 | 5 | Main question : How can we tell the greatness of a movie before it is released in cinema? 6 | 7 | ## First Part - Parsing data 8 | 9 | The first part aims to parse data from the imdb and the numbers websites : casting information, directors, production companies, awards, genres, budget, gross, description, imdb_rating, etc. 10 | To create the movie_contents.json file : 11 | ``python3 parser.py nb_elements`` 12 | 13 | ## Second Part - Data Analysis 14 | 15 | The second part is to analyze the dataframe and observe correlation between variables. For example, are the movie awards correlated to the worlwide gross ? Does the more a movie is a liked, the more the casting is liked ? 16 | See the jupyter notebook file. 17 | 18 | ![Correlation Matrix](https://github.com/alexattia/Data-Science-Projects/blob/master/pics/corr_matrix.png) 19 | 20 | As we can see in the pictures above, the imdb score is correlated to the number of awards and the gross but not really to the production budget and the number of facebook likes of the casting. 21 | Obviously, domestic and worlwide gross are highly correlated. However, the more important the production budget, the more important the gross. 22 | As it is shown in the notebook, the budget is not really correlated to the number of awards. 23 | What's funny is that the popularity of the third most famous actor is more important for the IMDB score than the popularity of the most famous score (Correlation 0.2 vs 0.08). 24 | (Many other charts in the Jupyter notebook) 25 | 26 | ## Third Part - Predict the IMDB score 27 | 28 | Machine Learning to predict the IMDB score with the meaningful variables. 29 | Using a Random Forest algorithm (500 estimators). 30 | ![Most important features](https://github.com/alexattia/Data-Science-Projects/blob/master/pics/features.png) 31 | 32 | -------------------------------------------------------------------------------- /ProjectMovieRating/__pycache__/imdb_movie_content.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/ProjectMovieRating/__pycache__/imdb_movie_content.cpython-35.pyc -------------------------------------------------------------------------------- /ProjectMovieRating/__pycache__/parser.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/ProjectMovieRating/__pycache__/parser.cpython-35.pyc -------------------------------------------------------------------------------- /ProjectMovieRating/genre.json: -------------------------------------------------------------------------------- 1 | {"Action": "0", 2 | "Adventure": "17", 3 | "Animation": "14", 4 | "Biography": "20", 5 | "Comedy": "21", 6 | "Crime": "10", 7 | "Documentary": "3", 8 | "Drama": "18", 9 | "Family": "19", 10 | "Fantasy": "6", 11 | "History": "8", 12 | "Horror": "4", 13 | "Music": "12", 14 | "Musical": "2", 15 | "Mystery": "16", 16 | "Romance": "9", 17 | "Sci-Fi": "13", 18 | "Short": "5", 19 | "Sport": "11", 20 | "Thriller": "7", 21 | "War": "15", 22 | "Western": "1"} -------------------------------------------------------------------------------- /ProjectMovieRating/imdb_movie_content.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | from bs4 import BeautifulSoup 4 | import time 5 | import random 6 | import re 7 | import datetime 8 | 9 | def parse_facebook_likes_number(num_likes_string): 10 | if num_likes_string[-1] == 'K': 11 | thousand = 1000 12 | else: 13 | thousand = 1 14 | return float(num_likes_string.replace('K','')) * thousand 15 | 16 | def parse_duration(time_string): 17 | time_string = time_string.split('min')[0] 18 | x = time.strptime(time_string,'%Hh%M') 19 | return datetime.timedelta(hours=x.tm_hour,minutes=x.tm_min,seconds=x.tm_sec).total_seconds() 20 | 21 | class ImdbMovieContent(): 22 | def __init__(self, movies): 23 | self.movies = movies 24 | self.base_url = "http://www.imdb.com" 25 | 26 | def get_facebook_likes(self, entity_id): 27 | if entity_id.startswith('nm'): 28 | url = "https://www.facebook.com/widgets/like.php?width=280&show_faces=1&layout=standard&href=http%3A%2F%2Fwww.imdb.com%2Fname%2F{}%2F&colorscheme=light".format(entity_id) 29 | elif entity_id.startswith('tt'): 30 | url = "https://www.facebook.com/widgets/like.php?width=280&show_faces=1&layout=standard&href=http%3A%2F%2Fwww.imdb.com%2Ftitle%2F{}%2F&colorscheme=light".format(entity_id) 31 | else: 32 | url = None 33 | time.sleep(random.uniform(0, 0.25)) # randomly snooze a time within [0, 0.4] second 34 | try: 35 | response = requests.get(url) 36 | soup = BeautifulSoup(response.text, "lxml") 37 | sentence = soup.find_all(id="u_0_2")[0].span.string # get sentence like: "43K people like this" 38 | num_likes = sentence.split(" ")[0] 39 | except Exception as e: 40 | num_likes = None 41 | return parse_facebook_likes_number(num_likes) 42 | 43 | def get_id_from_url(self, url): 44 | if url is None: 45 | return None 46 | return url.split('/')[4] 47 | 48 | def parse(self): 49 | movies_content = [] 50 | for movie in self.movies: 51 | imdb_url = movie['imdb_url'] 52 | response = requests.get(imdb_url) 53 | bs = BeautifulSoup(response.text, 'lxml') 54 | movies_content.append(self.get_content(bs)) 55 | return movies_content 56 | 57 | def get_awards(self, movie_link): 58 | awards_url = movie_link + 'awards' 59 | response = requests.get(awards_url) 60 | bs = BeautifulSoup(response.text, 'lxml') 61 | awards = bs.find_all('tr') 62 | award_dict = [] 63 | for award in awards: 64 | if 'rowspan' in award.find('td').attrs: 65 | rowspan = int(award.find('td')['rowspan']) 66 | if rowspan == 1: 67 | award_dict.append({'category' : award.find('span', class_='award_category').text, 68 | 'type' : award.find('b').text, 69 | 'award': award.find('td', class_='award_description').text.split('\n')[1].replace(' ', '')}) 70 | else: 71 | index = awards.index(award) 72 | dictt = {'category':award.find('span', class_='award_category').text, 73 | 'type' : award.find('b').text} 74 | awards_ = [] 75 | for elem in awards[index:index+rowspan]: 76 | award_dict.append({'category':award.find('span', class_='award_category').text, 77 | 'type' : award.find('b').text, 78 | 'award': elem.find('td', class_='award_description').text.split('\n')[1].replace(' ', '')}) 79 | award_nominated = [{k:v for k,v in award.items() if k != 'type'} for award in award_dict if award['type'] == 'Nominated'] 80 | award_won = [{k:v for k,v in award.items() if k != 'type'} for award in award_dict if award['type'] == 'Won'] 81 | return award_nominated, award_won 82 | 83 | def get_content(self, bs): 84 | movie = {} 85 | try: 86 | title_year = bs.find('div', class_ = 'title_wrapper').find('h1').text.encode('latin').decode('utf-8', 'ignore').replace(') ','').split('(') 87 | movie['movie_title'] = title_year[0] 88 | except: 89 | movie['movie_title'] = None 90 | try: 91 | movie['title_year'] = title_year[1] 92 | except: 93 | movie['title_year'] = None 94 | # try: 95 | # movie['genres'] = '|'.join([genre.text.replace(' ','') for genre in bs.find('div', {'itemprop': 'genre'}).find_all('a')]) 96 | # except: 97 | # movie['genres'] = None 98 | try: 99 | title_details = bs.find('div', {'id':'titleDetails'}) 100 | movie['language'] = '|'.join([language.text for language in title_details.find_all('a', href=re.compile('language'))]) 101 | movie['country'] = '|'.join([country.text for country in title_details.find_all('a', href=re.compile('country'))]) 102 | except: 103 | movie['language'] = None 104 | try: 105 | keywords = bs.find_all('span', {'itemprop':'keywords'}) 106 | movie['keywords'] = '|'.join([key.text for key in keywords]) 107 | except: 108 | movie['keywords'] = None 109 | try: 110 | movie['storyline'] = bs.find('div', {'id':'titleStoryLine'}).find('div', {'itemprop':'description'}).text.replace('\n', '') 111 | except: 112 | movie['storyline'] = None 113 | try: 114 | movie['contentRating'] = bs.find('span', {'itemprop':'contentRating'}).text.split(' ')[1] 115 | except: 116 | movie['contentRating'] = None 117 | try: 118 | movie['color'] = bs.find('a', href=re.compile('colors')).text 119 | except: 120 | movie['color'] = None 121 | try: 122 | movie['idmb_score'] = bs.find('span', {'itemprop':'ratingValue'}).text 123 | except: 124 | movie['idmb_score'] = None 125 | try: 126 | movie['num_voted_users'] = bs.find('span', {'itemprop':'ratingCount'}).text.replace(',',) 127 | except: 128 | movie['num_voted_users'] = None 129 | try: 130 | movie['duration_sec'] = parse_duration(bs.find('time', {'itemprop':'duration'}).text.replace(' ','').replace('\n', '')) 131 | except: 132 | movie['duration_sec'] = None 133 | try: 134 | review_counts = [int(count.text.split(' ')[0].replace(',', '')) for count in bs.find_all('span', {'itemprop':'reviewCount'})] 135 | movie['num_user_for_reviews'] = review_counts[0] 136 | except: 137 | movie['num_user_for_reviews'] = None 138 | try: 139 | prod_co = bs.find_all('span', {'itemtype': re.compile('Organization')}) 140 | movie['production_co'] = [elem.text.replace('\n','') for elem in prod_co] 141 | except: 142 | movie['production_co'] = None 143 | try: 144 | movie['num_critic_for_reviews'] = review_counts[1] 145 | except: 146 | movie['num_critic_for_reviews'] = None 147 | try: 148 | actors = bs.find('table', {'class':'cast_list'}).find_all('a', {'itemprop':'url'}) 149 | pairs_for_rows = [(self.base_url + actor['href'], actor.text.replace('\n', '')[1:]) for actor in actors] 150 | movie['cast_info'] = [{'actor_name' : actor[1], 151 | 'actor_link' : actor[0], 152 | 'actor_fb_likes': self.get_facebook_likes(self.get_id_from_url(actor[0]))} for actor in pairs_for_rows] 153 | except: 154 | movie['cast_info'] = None 155 | try: 156 | director = bs.find('span', {'itemprop':'director'}).find('a') 157 | director = (self.base_url + director['href'].replace('?','/?'), director.text) 158 | movie['director_info'] = {'director_name':director[1], 159 | 'director_link':director[0], 160 | 'director_fb_links': self.get_facebook_likes(self.get_id_from_url(director[0]))} 161 | except: 162 | movie['director_info'] = None 163 | try: 164 | movie['movie_imdb_link'] = bs.find('link')['href'] 165 | movie['num_facebook_like'] = self.get_facebook_likes(self.get_id_from_url(movie['movie_imdb_link'])) 166 | except: 167 | movie['num_facebook_like'] = None 168 | try: 169 | poster_image_url = bs.find('div', {'class':'poster'}).find('img')['src'] 170 | movie['image_urls'] = poster_image_url.split("_V1_")[0] + "_V1_.jpg" 171 | except: 172 | movie['image_urls'] = None 173 | try: 174 | award_nominated, award_won = self.get_awards(movie['movie_imdb_link']) 175 | movie['awards'] = {"won": award_won, 'nominated': award_nominated} 176 | except Exception as e: 177 | print(e) 178 | movie['awards'] = None 179 | return movie 180 | 181 | 182 | -------------------------------------------------------------------------------- /ProjectMovieRating/parser.py: -------------------------------------------------------------------------------- 1 | from bs4 import BeautifulSoup 2 | import json 3 | import requests 4 | import urllib 5 | from tqdm import tqdm 6 | import locale 7 | import pandas as pd 8 | import re 9 | import time 10 | import random 11 | import sys 12 | 13 | from imdb_movie_content import ImdbMovieContent 14 | 15 | def parse_price(price): 16 | """ 17 | Convert string price to numbers 18 | """ 19 | if not price: 20 | return 0 21 | price = price.replace(',', '') 22 | return locale.atoi(re.sub('[^0-9,]', "", price)) 23 | 24 | def get_movie_budget(): 25 | """ 26 | Parsing the numbers website to get the budget data. 27 | :return: list of dictionnaries with budget and gross 28 | """ 29 | movie_budget_url = 'http://www.the-numbers.com/movie/budgets/all' 30 | response = requests.get(movie_budget_url) 31 | bs = BeautifulSoup(response.text, 'lxml') 32 | table = bs.find('table') 33 | rows = [elem for elem in table.find_all('tr') if elem.get_text() != '\n'] 34 | 35 | movie_budget = [] 36 | for row in rows[1:]: 37 | specs = [elem.get_text() for elem in row.find_all('td')] 38 | movie_name = specs[2].encode('latin1').decode('utf8', 'ignore') 39 | movie_budget.append({'release_date': specs[1], 40 | 'movie_name': movie_name, 41 | 'production_budget': parse_price(specs[3]), 42 | 'domestic_gross': parse_price(specs[4]), 43 | 'worldwide_gross': parse_price(specs[5])}) 44 | 45 | return movie_budget 46 | 47 | def get_imdb_urls(movie_budget, nb_elements=None): 48 | """ 49 | Parsing imdb website to get imdb movies links. 50 | Dumping a json file with budget, gross and imdb urls 51 | :param movie_budget: list of dictionnaries with budget and gross 52 | :param nb_elements: number of movies to parse 53 | """ 54 | for movie in tqdm(movie_budget[1000:1000+nb_elements]): 55 | movie_name = movie['movie_name'] 56 | title_url = urllib.parse.quote(movie_name.encode('utf-8')) 57 | imdb_search_link = "http://www.imdb.com/find?ref_=nv_sr_fn&q={}&s=tt".format(title_url) 58 | response = requests.get(imdb_search_link) 59 | bs = BeautifulSoup(response.text, 'lxml') 60 | results = bs.find("table", class_= "findList" ) 61 | try: 62 | movie['imdb_url'] = "http://www.imdb.com" + results.find('td', class_='result_text').find('a')['href'] 63 | except: 64 | movie['imdb_url'] = None 65 | 66 | with open('movie_budget.json', 'w') as fp: 67 | json.dump(movie_budget, fp) 68 | 69 | def get_imdb_content(movie_budget_path, nb_elements=None): 70 | """ 71 | Parsing imdb website to get imdb content : awards, casting, description, etc. 72 | Dumping a json file with imdb content 73 | :param movie_budget_path: path of the movie_budget.json file 74 | :param nb_elements: number of movies to parse 75 | """ 76 | with open(movie_budget_path, 'r') as fp: 77 | movies = json.load(fp) 78 | content_provider = ImdbMovieContent(movies) 79 | contents = [] 80 | threshold = 1300 81 | for i, movie in enumerate(movies[threshold:threshold+nb_elements]): 82 | time.sleep(random.uniform(0, 0.25)) 83 | print("\r%i / %i" % (i, len(movies[threshold:threshold+nb_elements])), end="") 84 | try: 85 | imdb_url = movie['imdb_url'] 86 | response = requests.get(imdb_url) 87 | bs = BeautifulSoup(response.text, 'lxml') 88 | movies_content = content_provider.get_content(bs) 89 | contents.append(movies_content) 90 | except Exception as e: 91 | print(e) 92 | pass 93 | with open('movie_contents7.json', 'w') as fp: 94 | json.dump(contents, fp) 95 | 96 | def parse_awards(movie): 97 | """ 98 | Convert awards information to a dictionnary for dataframe. 99 | Keeping only Oscar, BAFTA, Golden Globe and Palme d'Or awards. 100 | :param movie: movie dictionnary 101 | :return: well-formated dictionnary with awards information 102 | """ 103 | awards_kept = ['Oscar', 'BAFTA Film Award', 'Golden Globe', 'Palme d\'Or'] 104 | awards_category = ['won', 'nominated'] 105 | parsed_awards = {} 106 | for category in awards_category: 107 | for awards_type in awards_kept: 108 | awards_cat = [award for award in movie['awards'][category] if award['category'] == awards_type] 109 | for k, award in enumerate(awards_cat): 110 | parsed_awards['{}_{}_{}'.format(awards_type, category, k+1)] = award["award"] 111 | return parsed_awards 112 | 113 | def parse_actors(movie): 114 | """ 115 | Convert casting information to a dictionnary for dataframe. 116 | Keeping only 3 actors with most facebook likes. 117 | :param movie: movie dictionnary 118 | :return: well-formated dictionnary with casting information 119 | """ 120 | sorted_actors = sorted(movie['cast_info'], key=lambda x:x['actor_fb_likes'], reverse=True) 121 | top_k = 3 122 | parsed_actors = {} 123 | parsed_actors['total_cast_fb_likes'] = sum([actor['actor_fb_likes'] for actor in movie['cast_info']]) + movie['director_info']['director_fb_links'] 124 | for k, actor in enumerate(sorted_actors[:top_k]): 125 | if k < len(sorted_actors): 126 | parsed_actors['actor_{}_name'.format(k+1)] = actor['actor_name'] 127 | parsed_actors['actor_{}_fb_likes'.format(k+1)] = actor['actor_fb_likes'] 128 | else: 129 | parsed_actors['actor_{}_name'.format(k+1)] = None 130 | parsed_actors['actor_{}_fb_likes'.format(k+1)] = None 131 | return parsed_actors 132 | 133 | def parse_production_company(movie): 134 | """ 135 | Convert production companies to a dictionnary for dataframe. 136 | Keeping only 3 production companies. 137 | :param movie: movie dictionnary 138 | :return: well-formated dictionnary with production companies 139 | """ 140 | parsed_production_co = {} 141 | top_k = 3 142 | production_companies = movie['production_co'][:top_k] 143 | for k, company in enumerate(production_companies): 144 | if k < len(movie['production_co']): 145 | parsed_production_co['production_co_{}'.format(k+1)] = company 146 | else: 147 | parsed_production_co['production_co_{}'.format(k+1)] = None 148 | return parsed_production_co 149 | 150 | def parse_genres(movie): 151 | """ 152 | Convert genres to a dictionnary for dataframe. 153 | :param movie: movie dictionnary 154 | :return: well-formated dictionnary with genres 155 | """ 156 | parse_genres = {} 157 | g = movie['genres'] 158 | with open('genre.json', 'r') as f: 159 | genres = json.load(f) 160 | for k, genre in enumerate(g): 161 | if genre in genres: 162 | parse_genres['genre_{}'.format(k+1)] = genres[genre] 163 | return parse_genres 164 | 165 | def create_dataframe(movies_content_path, movie_budget_path): 166 | """ 167 | Create dataframe from movie_budget.json and movie_content.json files. 168 | :param movies_content_path: path of the movies_content.json file 169 | :param movie_budget_path: path of the movie_budget.json file 170 | :return: well formated dataframe 171 | """ 172 | with open(movies_content_path, 'r') as fp: 173 | movies = json.load(fp) 174 | with open(movie_budget_path, 'r') as fp: 175 | movies_budget = json.load(fp) 176 | movies_list = [] 177 | for movie in movies: 178 | content = {k:v for k,v in movie.items() if k not in ['awards', 'cast_info', 'director_info', 'production_co']} 179 | name = movie['movie_title'] 180 | try: 181 | budget = [film for film in movies_budget if film['movie_name']==name][0] 182 | budget = {k:v for k,v in budget.items() if k not in ['imdb_url', 'movie_name']} 183 | content.update(budget) 184 | except: 185 | pass 186 | try: 187 | content.update(parse_awards(movie)) 188 | except: 189 | pass 190 | try: 191 | content.update(parse_genres(movie)) 192 | except: 193 | pass 194 | try: 195 | content.update({k:v for k,v in movie['director_info'].items() if k!= 'director_link'}) 196 | except: 197 | pass 198 | try: 199 | content.update(parse_production_company(movie)) 200 | except: 201 | pass 202 | try: 203 | content.update(parse_actors(movie)) 204 | except: 205 | pass 206 | movies_list.append(content) 207 | df = pd.DataFrame(movies_list) 208 | df = df[pd.notnull(df.idmb_score)] 209 | df.idmb_score = df.idmb_score.apply(float) 210 | return df 211 | 212 | if __name__ == '__main__': 213 | if len(sys.argv) > 1: 214 | nb_elements = int(sys.argv[1]) 215 | else: 216 | nb_elements = None 217 | # movie_budget = get_movie_budget() 218 | # movies = get_imdb_urls(movie_budget, nb_elements=nb_elements) 219 | get_imdb_content("movie_budget.json", nb_elements=nb_elements) 220 | 221 | 222 | -------------------------------------------------------------------------------- /ProjectSoccer/README.md: -------------------------------------------------------------------------------- 1 | # Parsing and using Soccer data 2 | 3 | ## First Part - Parsing data 4 | 5 | The first part aims to parse data from multiple websites : games, teams, players, etc. 6 | To collect data about a team and dump a json file (must have created a ``./teams`` folder) : 7 | ``python3 dumper.py team_name`` 8 | 9 | ## Second Part - Data Analysis 10 | 11 | The second part is to analyze the dataset to understand what I can do with it. 12 | 13 | ![Correlation Matrix](https://github.com/alexattia/Data-Science-Projects/blob/master/pics/psg_stats.png) 14 | 15 | ![PSG vs Saint-Etienne](https://github.com/alexattia/Data-Science-Projects/blob/master/pics/psg_ste.png) -------------------------------------------------------------------------------- /ProjectSoccer/df.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/ProjectSoccer/df.pkl -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Online-Challenge 2 | Challenge submitted on HackerRank and Kaggle. 3 | 4 | Algorithm challenges are made on HackerRank using Python. 5 | 6 | Data Science and Machine Learning challenges are made on Kaggle using Python too. 7 | 8 | ## [Kaggle Bike Sharing](https://github.com/alexattia/Data-Science-Projects/tree/master/KaggleBikeSharing) 9 | The goal of this challenge is to build a model that predicts the count of bike shared, exclusively based on contextual features. The first part of this challenge was aimed to understand, to analyse and to process those dataset. I wanted to produce meaningful information with plots. The second part was to build a model and use a Machine Learning library in order to predict the count. 10 | 11 | -->[French Explanations PDF](https://github.com/alexattia/Data-Science-Projects/blob/master/KaggleBikeSharing/Kaggle_BikeSharing_Explanations_French.pdf) 12 | 13 | ## [Twitter Parsing](https://github.com/alexattia/Data-Science-Projects/tree/master/TwitterParsing) 14 | 15 | I've recently discovered the Chris Albon Machine Learning flash cards and I want to download those flash cards but the official Twitter API has a limit rate of 2 weeks old tweets so I had to find a way to bypass this limitation : use Selenium and PhantomJS. 16 | Purpose of this project : Check every 2 hours, if he posted new flash cards. In this case, download them and send me a summary email. 17 | 18 | ## [Face Recognition](https://github.com/alexattia/Data-Science-Projects/tree/master/FaceRecognition) 19 | 20 | Modern face recognition with deep learning and HOG algorithm. Using dlib C++ library, I have a quick face recognition tool using few pictures (20 per person). 21 | 22 | ## [Playing with Soccer data](https://github.com/alexattia/Data-Science-Projects/tree/master/KaggleSoccer) 23 | 24 | As a soccer fan and a data passionate, I wanted to play and analyze with soccer data. 25 | I don't know currently what's the aim of this project but I will parse data from diverse websites, for differents teams and differents players. 26 | 27 | ## [NYC Taxi Trips](https://github.com/alexattia/Data-Science-Projects/tree/master/KaggleTaxiTrip) 28 | 29 | Kaggle playground to predict the total ride duration of taxi trips in New York City. 30 | 31 | ## [Kaggle Understanding the Amazon from Space](https://github.com/alexattia/Data-Science-Projects/tree/master/KaggleAmazon) 32 | Use satellite data to track the human footprint in the Amazon rainforest. 33 | Deep Learning model (using Keras) to label satellite images. 34 | 35 | ## [Predicting IMDB movie rating](https://github.com/alexattia/Data-Science-Projects/tree/master/KaggleMovieRating) 36 | Project inspired by Chuan Sun [work](https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset) 37 | How can we tell the greatness of a movie ? 38 | Scrapping and Machine Learning -------------------------------------------------------------------------------- /TwitterParsing/README.md: -------------------------------------------------------------------------------- 1 | # Twitter Parsing 2 | 3 | ## Parsing Machine Learning Flashcards 4 | After spending some time on Twitter, following Machine Learning, Deep Learning and Data Science "influencers". I have found some interesting stuff. 5 | I decided to download some of those tweets. For example, @chrisalbon writes (draws would be more relevant) interesting Machine Learning flash cards. 6 | I am using Selenium to parse and download those flash cards and I create a daemon to parse and send me about new flash cards every day. 7 | 8 | `com.alexattia.machinelearning.downloadflashcards.plist` is the plist file to copy into /Library/LaunchDaemons/ 9 | `run.sh` is the script to launch every day 10 | `download_pics.py`is the Python script 11 | 12 | ## Sending one picture a day 13 | Recently, @chrisalbon published his [machine learning flashcards](machinelearningflashcards.com). There are more than 300 pictures. In order to learn those flashcards, I want to learn one a day. So, I am using a daemon to send me an email every day with an embedded flashcard. -------------------------------------------------------------------------------- /TwitterParsing/com.alexattia.machinelearning.downloadflashcards.plist: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | Label 6 | com.alexattia.machinelearning.downloadflashcards 7 | ProgramArguments 8 | 9 | /Users/alexandreattia/Desktop/Work/Practice/HackerRankChallenge/TwitterParsing/run.sh 10 | 11 | StandardOutPath 12 | /Users/alexandreattia/Desktop/Work/Practice/HackerRankChallenge/TwitterParsing/output.txt 13 | StandardErrorPath 14 | /Users/alexandreattia/Desktop/Work/Practice/HackerRankChallenge/TwitterParsing/errors.txt 15 | 17 | StartCalendarInterval 18 | 19 | Hour 20 | 16 21 | Minute 22 | 0 23 | 24 | 25 | -------------------------------------------------------------------------------- /TwitterParsing/config.py: -------------------------------------------------------------------------------- 1 | class Config: 2 | def __init__(self): 3 | self.my_email_address = 'XXX' 4 | self.password = 'XXX' 5 | self.dest = 'XXX' 6 | self.message_init = """ 7 | 8 | 9 | 10 | 11 | 12 |

13 | Hi Alexandre,
14 | {number_flash_cards} machine learning flashcards has been downloaded. 15 |

16 |
17 | Tweeted by Chris Albon 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 | 27 | 28 |
  26 |
29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | """ 37 | 38 | self.intro = """\ 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 238 | 239 | 240 | 241 | 242 | 243 | 254 | 255 | 256 | 342 | 343 |
244 | 245 | 246 | 251 | 252 |
247 | 248 | 249 | 250 |
253 |
257 | 258 | 259 | 339 | 340 |
260 | 261 | 262 | 289 | 319 | 320 | 321 | 322 | 323 | 331 | 332 | """ 333 | 334 | self.end = """ 335 | 336 | 337 |
263 | 284 | """ 285 | self.flashcard = """ 286 | 287 | 288 |
290 | 291 | 292 | 309 | 310 |
293 |

294 | {tweet_text} 295 |

296 | 297 | 298 | 301 | 306 | 307 |
299 | date 300 | 302 |
303 | {date} 304 |
305 |
308 |
311 | 312 | 313 | 316 | 317 |
314 | map 315 |
318 |
324 | 325 | 326 | 328 | 329 |
  327 |
330 |
338 |
341 |
344 | 345 | 346 | """ -------------------------------------------------------------------------------- /TwitterParsing/config_downl.py: -------------------------------------------------------------------------------- 1 | class Config: 2 | def __init__(self): 3 | self.my_email_address = 'XXX' 4 | self.password = 'XXX' 5 | self.dest = 'XXX' 6 | self.message_init = """ 7 | 8 | 9 | 10 | 11 | 12 |

13 | Hi Alexandre,
14 | Today the machine learning flashcard is about {flashcard_title}. 15 |

16 | 17 | 18 | 19 | 20 | 21 | 22 | 24 | 25 |
  23 |
26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | """ 34 | 35 | self.intro = """\ 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 235 | 236 | 237 | 238 | 239 | 240 | 251 | 252 | 253 | 306 | 307 |
241 | 242 | 243 | 248 | 249 |
244 | 245 | 246 | 247 |
250 |
254 | 255 | 256 | 303 | 304 |
257 | 258 | 259 | 286 | 295 | 296 | """ 297 | 298 | self.end = """ 299 | 300 | 301 |
260 | 281 | """ 282 | self.flashcard = """ 283 | 284 | 285 |
287 | 288 | 289 | 292 | 293 |
290 | 291 |
294 |
302 |
305 |
308 | 309 | 310 | """ -------------------------------------------------------------------------------- /TwitterParsing/download_pics.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from selenium import webdriver 3 | import urllib 4 | import time 5 | import sys 6 | import numpy as np 7 | import glob 8 | import os 9 | import datetime 10 | from bs4 import BeautifulSoup 11 | import smtplib 12 | from email.mime.text import MIMEText 13 | from email.header import Header 14 | from email.utils import formataddr 15 | import config 16 | 17 | def internet_on(): 18 | """ 19 | Check if the network connection is on 20 | :return: boolean 21 | """ 22 | try: 23 | urllib.request.urlopen('http://216.58.192.142', timeout=1) 24 | return True 25 | except urllib.error.URLError: 26 | return False 27 | 28 | def get_all_tweets_chrome(query): 29 | """ 30 | Get all tweets from the webpage using the Chrome driver 31 | :param query: Tweet search query 32 | :return: all tweets on page as selenium objects 33 | """ 34 | driver = webdriver.Chrome("/usr/local/bin/chromedriver") 35 | driver.set_page_load_timeout(20) 36 | driver.get("https://twitter.com/search?f=tweets&vertical=default&q={}".format(urllib.parse.quote(query))) 37 | 38 | length = [] 39 | try: 40 | while True: 41 | time.sleep(np.random.randint(50,100)*0.01) 42 | tweets_found = driver.find_elements_by_class_name('tweet') 43 | driver.execute_script("return arguments[0].scrollIntoView();", tweets_found[::-1][0]) 44 | 45 | # Stop the loop while no more found tweets 46 | length.append(len(tweets_found)) 47 | if len(tweets_found) > 200 and (len(length) - len(set(length))) > 2: 48 | print('%s tweets found at %s' % (len(tweets_found), datetime.datetime.now().strftime('%d/%m/%Y - %H:%M'))) 49 | break 50 | except IndexError: 51 | driver.save_screenshot('/Users/alexandreattia/Desktop/Work/Practice/HackerRankChallenge/TwitterParsing/screenshot_%s.png' % datetime.datetime.now().strftime('%d_%m_%Y_%H/%M')) 52 | time.sleep(np.random.randint(50,100)*0.01) 53 | tweets_found = driver.find_elements_by_class_name('tweet') 54 | print('%s tweets found at %s' % (len(tweets_found), datetime.datetime.now().strftime('%d/%m/%Y - %H:%M'))) 55 | except Exception as e: 56 | print(e) 57 | time.sleep(np.random.randint(50,100)*0.01) 58 | tweets_found = driver.find_elements_by_class_name('tweet') 59 | driver.execute_script("return arguments[0].scrollIntoView();", tweets_found[::-1][0]) 60 | return driver 61 | 62 | def get_all_tweets_phantom(query): 63 | """ 64 | Get all tweets from the webpage using the PhantomJS driver 65 | :param query: Tweet search query 66 | :return: all tweets on page as selenium objects 67 | """ 68 | driver = webdriver.PhantomJS("/usr/local/bin/phantomjs") 69 | driver.get("https://twitter.com/search?f=tweets&vertical=default&q={}".format(urllib.parse.quote(query))) 70 | lastHeight = driver.execute_script("return document.body.scrollHeight") 71 | i = 0 72 | while True: 73 | driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 74 | time.sleep(np.random.randint(50,100)*0.01) 75 | newHeight = driver.execute_script("return document.body.scrollHeight") 76 | if newHeight <= lastHeight: 77 | break 78 | lastHeight = newHeight 79 | i += 1 80 | return driver 81 | 82 | def format_tweets(driver): 83 | """ 84 | Convert selenium objects into dictionnaries (images, date, text, username as keys) 85 | :param driver: Selenium driver (with tweets loaded) 86 | :return: well-formated tweets as a list of dictionnaries 87 | """ 88 | tweets_found = driver.find_elements_by_class_name('tweet') 89 | tweets = [] 90 | for tweet in tweets_found: 91 | tweet_dict = {} 92 | tweet = tweet.get_attribute('innerHTML') 93 | bs = BeautifulSoup(tweet.strip(), "lxml") 94 | tweet_dict['username'] = bs.find('span', class_='username').text 95 | timestamp = float(bs.find('span', class_='_timestamp')['data-time']) 96 | tweet_dict['date'] = datetime.datetime.fromtimestamp(timestamp) 97 | tweet_dict['tweet_link'] = 'https://twitter.com' + bs.find('a', class_='js-permalink')['href'] 98 | tweet_dict['text'] = bs.find('p', class_='tweet-text').text 99 | try: 100 | tweet_dict['images'] = [k['src'] for k in bs.find('div', class_="AdaptiveMedia-container").find_all('img')] 101 | except: 102 | tweet_dict['images'] = [] 103 | if len(tweet_dict['images']) > 0: 104 | tweet_dict['text'] = tweet_dict['text'][:tweet_dict['text'].index('pic.twitter')-1] 105 | tweets.append(tweet_dict) 106 | driver.close() 107 | return tweets 108 | 109 | def filter_tweets(tweets): 110 | """ 111 | Filter tweets not made by Chris Albon, without images and already downloaded 112 | :param tweets: list of dictionnaries 113 | :return: tweets as a list of dictionnaries 114 | """ 115 | # We keep only tweets by chrisalbon with pictures 116 | search_tweets = [tw for tw in tweets if tw['username'] == '@chrisalbon' and len(tw['images']) > 0] 117 | # He made multiple tweets on the same topic, we keep only the most recent tweets 118 | # We use the indexes of the reversed tweet list and dictionnaries to keep only key 119 | unique_search_index = sorted(list({t['text'].lower():i for i,t in list(enumerate(search_tweets))[::-1]}.values())) 120 | unique_search_tweets = [search_tweets[i] for i in unique_search_index] 121 | 122 | # Keep non-downloaded tweets 123 | most_recent_file = sorted([datetime.datetime.fromtimestamp(os.path.getmtime(path)) 124 | for path in glob.glob("./downloaded_pics/*.jpg")], reverse=True)[0] 125 | recent_seach_tweets = [tw for tw in unique_search_tweets if tw['date'] > most_recent_file] 126 | 127 | # Uncomment for testing new tweets 128 | # recent_seach_tweets = [tw for tw in unique_search_tweets if tw['date'] > datetime.datetime(2017, 7, 6, 13, 41, 48)] 129 | return recent_seach_tweets 130 | 131 | def download_pictures(recent_seach_tweets): 132 | """ 133 | Download pictures from tweets 134 | :param recent_seach_tweets: list of dictionnaries 135 | """ 136 | # Downloading pictures 137 | print('%s - Downloading %d tweets' % (datetime.datetime.now().strftime('%d/%m/%Y - %H:%M'), len(recent_seach_tweets))) 138 | for tw in recent_seach_tweets: 139 | img_url = tw['images'][0] 140 | filename = tw['text'][:tw['text'].index("#")-1].lower().replace(' ','_') 141 | filename = "./downloaded_pics/%s.jpg" % filename 142 | urllib.request.urlretrieve(img_url, filename) 143 | 144 | def send_email(recent_seach_tweets): 145 | """ 146 | Send a summary email of new tweets 147 | :param recent_seach_tweets: list of dictionnaries 148 | """ 149 | # Create e-mail message 150 | msg = C.intro + C.message_init.format(number_flash_cards=len(recent_seach_tweets)) 151 | # Add a special text for the first 3 new tweets 152 | for tw in recent_seach_tweets[:3]: 153 | date = tw['date'].strftime('Le %d/%m/%Y à %H:%M') 154 | link_tweet = tw['tweet_link'] 155 | link_picture = tw['images'][0] 156 | tweet_text = tw["text"][:tw["text"].index('#')-1] 157 | msg += C.flashcard.format(date=date, 158 | link_picture=link_picture, 159 | tweet_text=tweet_text, 160 | tweet_link=link_tweet) 161 | 162 | # mapping for the subject 163 | numbers = { 0 : 'zero', 1 : 'one', 2 : 'two', 3 : 'three', 4 : 'four', 5 : 'five', 164 | 6 : 'six', 7 : 'seven', 8 : 'eight', 9 : 'nine', 10 : 'ten'} 165 | 166 | message = MIMEText(msg, 'html') 167 | message['From'] = formataddr((str(Header('Twitter Parser', 'utf-8')), C.my_email_address)) 168 | message['To'] = C.dest 169 | message['Subject'] = '%s new Machine Learning Flash Cards' % numbers[len(recent_seach_tweets)].title() 170 | msg_full = message.as_string() 171 | 172 | server = smtplib.SMTP('smtp.gmail.com:587') 173 | server.starttls() 174 | server.login(C.my_email_address, C.password) 175 | server.sendmail(C.my_email_address, C.dest, msg_full) 176 | server.quit() 177 | print('%s - E-mail sent!' % datetime.datetime.now().strftime('%d/%m/%Y - %H:%M')) 178 | 179 | if __name__ == '__main__': 180 | # config file with mail content, email addresses 181 | C = config.Config() 182 | # Twitter search query 183 | query = '#machinelearningflashcards' 184 | # Loop until there is a network connection 185 | time_slept = 0 186 | while not internet_on(): 187 | time.sleep(1) 188 | time_slept += 1 189 | if time_slept > 15 : 190 | print('%s - No network connection' % datetime.datetime.now().strftime('%d/%m/%Y - %H:%M')) 191 | sys.exit() 192 | # driver = get_all_tweets_chrome(query) 193 | driver = get_all_tweets_phantom(query) 194 | tweets = format_tweets(driver) 195 | recent_seach_tweets = filter_tweets(tweets) 196 | if len(recent_seach_tweets) > 0: 197 | download_pictures(recent_seach_tweets) 198 | send_email(recent_seach_tweets) 199 | else: 200 | print('%s - No picture to download' % datetime.datetime.now().strftime('%d/%m/%Y - %H:%M')) 201 | 202 | 203 | -------------------------------------------------------------------------------- /TwitterParsing/pic.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/TwitterParsing/pic.png -------------------------------------------------------------------------------- /TwitterParsing/pic2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/TwitterParsing/pic2.png -------------------------------------------------------------------------------- /TwitterParsing/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # change directory 4 | APPDIR=/Users/alexandreattia/Desktop/Work/Practice/HackerRankChallenge/TwitterParsing 5 | export APPDIR 6 | cd $APPDIR 7 | 8 | # activate virtual env 9 | source /Users/alexandreattia/Desktop/Work/workenv/bin/activate 10 | 11 | # launch python script 12 | ./send_pictures.py -------------------------------------------------------------------------------- /TwitterParsing/send_pictures.py: -------------------------------------------------------------------------------- 1 | import glob 2 | import numpy as np 3 | from email.mime.text import MIMEText 4 | from email.mime.image import MIMEImage 5 | from email.mime.multipart import MIMEMultipart 6 | from email.header import Header 7 | from email.utils import formataddr 8 | import smtplib 9 | import config_downl 10 | from download_pics import internet_on 11 | import datetime 12 | import sys 13 | 14 | def get_picture(): 15 | pic_folder = '/Users/alexandreattia/Desktop/Work/machine_learning_flashcards_v1.4/png_web/' 16 | pics = glob.glob(pic_folder + '*') 17 | pic = np.random.choice(pics) 18 | flashcard_title = pic.split('/')[7].replace('_web.png', '').replace('_', ' ') 19 | return pic, flashcard_title 20 | 21 | def send_email(flashcard_title, pic): 22 | """ 23 | Send an email of with the picture of the day 24 | """ 25 | # Create e-mail Multi Part 26 | msgRoot = MIMEMultipart('related') 27 | msgRoot['From'] = formataddr((str(Header('ML Flashcards', 'utf-8')), C.my_email_address)) 28 | msgRoot['To'] = C.dest 29 | msgRoot['Subject'] = 'Today ML Flashcard : %s' % flashcard_title.title() 30 | msgAlternative = MIMEMultipart('alternative') 31 | msgRoot.attach(msgAlternative) 32 | 33 | # Create e-mail message 34 | msg = C.intro + C.message_init.format(flashcard_title=flashcard_title.title()) 35 | msg += C.flashcard 36 | message = MIMEText(msg, 'html') 37 | msgAlternative.attach(message) 38 | 39 | # Add picture to e-mail 40 | fp = open(pic, 'rb') 41 | msgImage = MIMEImage(fp.read()) 42 | fp.close() 43 | 44 | # Define the image's ID as referenced above 45 | msgImage.add_header('Content-ID', '') 46 | msgRoot.attach(msgImage) 47 | 48 | msg_full = msgRoot.as_string() 49 | 50 | server = smtplib.SMTP('smtp.gmail.com:587') 51 | server.starttls() 52 | server.login(C.my_email_address, C.password) 53 | server.sendmail(C.my_email_address, C.dest, msg_full) 54 | server.quit() 55 | print('%s - E-mail sent!' % datetime.datetime.now().strftime('%d/%m/%Y - %H:%M')) 56 | 57 | if __name__ == '__main__': 58 | # config file with mail content, email addresses 59 | C = config_downl.Config() 60 | # Loop until there is a network connection 61 | time_slept = 0 62 | while not internet_on(): 63 | time.sleep(1) 64 | time_slept += 1 65 | if time_slept > 15 : 66 | print('%s - No network connection' % datetime.datetime.now().strftime('%d/%m/%Y - %H:%M')) 67 | sys.exit() 68 | p, fc = get_picture() 69 | send_email(fc, p) 70 | -------------------------------------------------------------------------------- /pics/corr_matrix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/pics/corr_matrix.png -------------------------------------------------------------------------------- /pics/features.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/pics/features.png -------------------------------------------------------------------------------- /pics/psg_stats.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/pics/psg_stats.png -------------------------------------------------------------------------------- /pics/psg_ste.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/pics/psg_ste.png --------------------------------------------------------------------------------