├── FaceRecognition
    ├── README.md
    ├── Training.ipynb
    ├── detect_recognize.py
    └── result.png
├── HackerRank_MaximumSubArray.py
├── HackerRank_PalindromeIndex.py
├── HackerRank_TwoStrings.py
├── KaggleAmazon
    ├── README.md
    ├── Understanding the Amazon from Space.ipynb
    ├── predict_keras.py
    ├── predict_tf.py
    └── train_keras.py
├── KaggleBikeSharing
    ├── Kaggle_BikeSharing.py
    ├── Kaggle_BikeSharing_Explanations_French.pdf
    └── README.md
├── KaggleInstaCart
    ├── Data Exploration.ipynb
    └── Predictions.ipynb
├── KaggleMovieRating
    ├── Exploration.ipynb
    ├── README.md
    ├── genre.json
    ├── imdb_movie_content.py
    ├── movie_budget.json
    └── parser.py
├── KaggleSoccer
    ├── Parsing Soccer Ways.ipynb
    ├── Predicting Ratings.ipynb
    ├── README.md
    ├── df.pkl
    ├── dumper.py
    └── parser.py
├── KaggleTaxiTrip
    ├── Exploring the dataset.ipynb
    ├── README.md
    └── pic
    │   ├── download.png
    │   ├── feat_importance.png
    │   ├── nyc_clusters.png
    │   ├── outliners.png
    │   └── rush_hour.png
├── KaggleTitanic
    ├── Notebook.ipynb
    └── README.md
├── KaggleWiki
    ├── README.md
    └── wikipedia_parser.py
├── Kaggle_KobeShots.ipynb
├── ObjectDetection
    ├── color_extract.py
    └── recognition.py
├── ProjectMovieRating
    ├── .ipynb_checkpoints
    │   └── Exploration-checkpoint.ipynb
    ├── Exploration.ipynb
    ├── README.md
    ├── __pycache__
    │   ├── imdb_movie_content.cpython-35.pyc
    │   └── parser.cpython-35.pyc
    ├── genre.json
    ├── imdb_movie_content.py
    ├── movie_budget.json
    ├── movie_contents.json
    ├── movie_contents9.json
    └── parser.py
├── ProjectSoccer
    ├── Predicting Ratings.ipynb
    ├── README.md
    └── df.pkl
├── README.md
├── TwitterParsing
    ├── README.md
    ├── com.alexattia.machinelearning.downloadflashcards.plist
    ├── config.py
    ├── config_downl.py
    ├── download_pics.py
    ├── pic.png
    ├── pic2.png
    ├── run.sh
    └── send_pictures.py
└── pics
    ├── corr_matrix.png
    ├── features.png
    ├── psg_stats.png
    └── psg_ste.png


/FaceRecognition/README.md:
--------------------------------------------------------------------------------
 1 | ## Face Recognition
 2 | 
 3 | Modern face recognition with deep learning and HOG algorithm.  
 4 | 
 5 | 1. Find faces in image (HOG Algorithm)   
 6 | 2. Affine Transformations (Face alignment using an ensemble of regression
 7 | trees)   
 8 | 3. Encoding Faces (FaceNet)  
 9 | 4. Make a prediction (Linear SVM)  
10 | 
11 | We are using the [Histogram of Oriented Gradients](http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf) (HOG) method. Instead of computing gradients for every pixel of the image (way too much detail). We compute the weighted vote orientation  gradients of 16x16 pixels squares. Afterward, we have a simple representation (HOG image) that captures the basic structure of a face.  
12 | All we have to do is find the part of our image that looks the most similar to a known trained HOG pattern.  
13 | For this technique, we use the dlib Python library to generate and view HOG representations of images.  
14 | ```
15 | face_detector = dlib.get_frontal_face_detector()
16 | detected_faces = face_detector(image, 1)
17 | ```
18 | 
19 | After isolating the faces in our image, we need to warp (posing and projecting)the picture so the face is always in he same place. To do this, we are going to use the [face landmark estimation algorithm](http://www.csc.kth.se/~vahidk/papers/KazemiCVPR14.pdf). Following this method, there are 68 specific points (landmarks) on every face and we train a machine learning algorithm to find these 68 specific points on any face. 
20 | ```
21 | face_pose_predictor = dlib.shape_predictor(predictor_model)
22 | pose_landmarks = face_pose_predictor(img, f)
23 | ```
24 | After find those landmarks, we need to use affine transformations (such as rotating, scaling and shearing --like translations) on the image so that the eyes and mouth are centered as best as possible.
25 | ```
26 | face_aligner = openface.AlignDlib(predictor_model)
27 | alignedFace = face_aligner.align(534, image, face_rect, landmarkIndices=openface.AlignDlib.OUTER_EYES_AND_NOSE)
28 | ```
29 | 
30 | The next step is encoding the detected face. For this, we use Deep Learning. We train a neural net to generate [128 measurements (face embedding)](http://www.cv-foundation.org/openaccess/content_cvpr_2015/app/1A_089.pdf) for each face.  
31 | The training process works by looking at 3 face images at a time:  
32 | - Load two training face images of the same known person and generate for the two pictures the 128 measurements
33 | - Load a picture of a  different person and generate for the two pictures the 128 measurements  
34 | Then we tweak the neural network slightly so that it makes sure the measurements for the same person are slightly closer while making sure the measurements for the two different persons are slightly further apart.
35 | Once the network has been trained, it can generate measurements for any face, even ones it has never seen before!
36 | ```
37 | face_encoder = dlib.face_recognition_model_v1(face_recognition_model)
38 | face_encoding = np.array(face_encoder.compute_face_descriptor(image, pose_landmarks, 1))
39 | ```
40 | 
41 | Finally, we need a classifier (Linear SVM or other classifier) to find the person in our database of known people who has the closest measurements to our test image. We train the classifier with the measurements as input.
42 | 
43 | Thanks to Adam Geitgey who wrote a great [post](https://medium.com/@ageitgey/machine-learning-is-fun-part-4-modern-face-recognition-with-deep-learning-c3cffc121d78) about this, I followed his pipeline.
44 | 
45 | ![Result](https://github.com/alexattia/Data-Science-Projects/blob/master/FaceRecognition/result.png)


--------------------------------------------------------------------------------
/FaceRecognition/detect_recognize.py:
--------------------------------------------------------------------------------
  1 | import dlib
  2 | from skimage import io, transform
  3 | import cv2
  4 | import matplotlib.pyplot as plt
  5 | import numpy as np
  6 | import pandas as pd
  7 | import glob
  8 | import openface
  9 | import pickle 
 10 | import os
 11 | import sys
 12 | import argparse
 13 | import time
 14 | 
 15 | from sklearn.svm import SVC
 16 | from sklearn.preprocessing import LabelEncoder
 17 | 
 18 | face_detector = dlib.get_frontal_face_detector()
 19 | face_encoder = dlib.face_recognition_model_v1('./model/dlib_face_recognition_resnet_model_v1.dat')
 20 | face_pose_predictor = dlib.shape_predictor('./model/shape_predictor_68_face_landmarks.dat')
 21 | 
 22 | def get_detected_faces(filename):
 23 |     """
 24 |     Detect faces in a picture using HOG
 25 |     :param filename: picture filename
 26 |     :return: picture numpy array, face detector object with detected faces
 27 |     """
 28 |     image = cv2.imread(filename)
 29 |     image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
 30 |     return image, face_detector(image, 1)
 31 | 
 32 | def get_face_encoding(image, detected_face):
 33 |     """
 34 |     Encode face into 128 measurements using a neural net
 35 |     :param image: picture numpy array
 36 |     :param detected_face: face detector object with one detected face
 37 |     :return: measurement (128,) numpy array 
 38 |     """
 39 |     pose_landmarks = face_pose_predictor(image, detected_face)
 40 |     face_encoding = face_encoder.compute_face_descriptor(image, pose_landmarks, 1)
 41 |     return np.array(face_encoding)
 42 | 
 43 | def training(people):
 44 |     """
 45 |     Training our classifier (Linear SVC). Saving model using pickle.
 46 |     We need to have only one person/face per picture.
 47 |     :param people: people to classify and recognize
 48 |     """
 49 |     # parsing labels and reprensations
 50 |     df = pd.DataFrame()
 51 |     for p in people:
 52 |         l = []
 53 |         for filename in glob.glob('./data/%s/*' % p):
 54 |             image, face_detect = get_detected_faces(filename)
 55 |             face_encoding = get_face_encoding(image, face_detect[0])
 56 |             l.append(np.append(face_encoding, [p]))
 57 |         temp = pd.DataFrame(np.array(l))
 58 |         df = pd.concat([df, temp])
 59 |     df.reset_index(drop=True, inplace=True)
 60 | 
 61 |     # converting labels into int
 62 |     le = LabelEncoder()
 63 |     y = le.fit_transform(df[128])
 64 |     print("Training for {} classes.".format(len(le.classes_)))
 65 |     X = df.drop(128, axis=1)
 66 |     print("Training with {} pictures.".format(len(X)))
 67 | 
 68 |     # training
 69 |     clf = SVC(C=1, kernel='linear', probability=True)
 70 |     clf.fit(X, y)
 71 | 
 72 |     # dumping model
 73 |     fName = "./classifier.pkl"
 74 |     print("Saving classifier to '{}'".format(fName))
 75 |     with open(fName, 'wb') as f:
 76 |         pickle.dump((le, clf), f)
 77 | 
 78 | def predict(filename, le=None, clf=None, verbose=False):
 79 |     """
 80 |     Detect and recognize a face using a trained classifier.
 81 |     :param filename: picture filename
 82 |     :param le:
 83 |     :paral clf:
 84 |     :param verbose:
 85 |     :return: picture with bounding boxes and prediction
 86 |     """
 87 |     if not le and not clf:
 88 |         with open("./classifier.pkl", 'rb') as f:
 89 |             (le, clf) = pickle.load(f)
 90 |     image, detected_faces = get_detected_faces(filename)
 91 |     prediction = []
 92 |     # Verbose for debugging
 93 |     if verbose:
 94 |         print('{} faces detected.'.format(len(detected_faces)))
 95 |     img = np.copy(image)
 96 |     font = cv2.FONT_HERSHEY_SIMPLEX
 97 |     for face_detect in detected_faces:
 98 |         # draw bounding boxes
 99 |         cv2.rectangle(img, (face_detect.left(), face_detect.top()), 
100 |                           (face_detect.right(), face_detect.bottom()), (255, 0, 0), 2)
101 |         start_time = time.time()
102 |         # predict each face
103 |         p = clf.predict_proba(get_face_encoding(image, face_detect).reshape(1, 128))
104 |         # throwing away prediction with low confidence
105 |         a = np.sort(p[0])[::-1]
106 |         if a[0]-a[1] > 0.5:
107 |             y_pred = le.inverse_transform(np.argmax(p))
108 |             prediction.append([y_pred, (face_detect.left(), face_detect.top()), 
109 |                           (face_detect.right(), face_detect.bottom())])
110 |         else:
111 |             y_pred = 'unknown'
112 |         # Verbose for debugging
113 |         if verbose:
114 |             print('\n'.join(['%s : %.3f' % (k[0], k[1]) for k in list(zip(map(le.inverse_transform, 
115 |                                                                               np.argsort(p[0])), 
116 |                                                                           np.sort(p[0])))[::-1]]))
117 |             print('Prediction took {:.2f}s'.format(time.time()-start_time))
118 |         
119 |         cv2.putText(img, y_pred, (face_detect.left(), face_detect.top()-5), font, np.max(img.shape[:2])/1800, (255, 0, 0))
120 |     return img, prediction
121 | 
122 | if __name__ == '__main__':
123 |     parser = argparse.ArgumentParser()
124 |     parser.add_argument('mode', type=str, help='train or predict')
125 |     parser.add_argument('--training_data',
126 |                         type=str,
127 |                         help="Path to training data folder.",
128 |                         default='./data/')
129 |     parser.add_argument('--testing_data',
130 |                         type=str,
131 |                         help="Path to test data folder.",
132 |                         default='./test/')
133 | 
134 |     args = parser.parse_args()
135 |     people = os.listdir(args.training_data)
136 |     print('{} people will be classified.'.format(len(people)))
137 |     if args.mode == 'train':
138 |         training(people)
139 |     elif args.mode == 'test':
140 |         with open("./classifier.pkl", 'rb') as f:
141 |             (le, clf) = pickle.load(f)
142 |         for i, f in enumerate(glob.glob(args.testing_data)):
143 |             img, _ = predict(f, le, clf)
144 |             cv2.imwrite(args.testing_data + 'test_{}.jpg'.format(i), img)
145 | 
146 | 


--------------------------------------------------------------------------------
/FaceRecognition/result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/FaceRecognition/result.png


--------------------------------------------------------------------------------
/HackerRank_MaximumSubArray.py:
--------------------------------------------------------------------------------
 1 | ## Aim:
 2 | ## Given an array  of  k elements, find the maximum possible sum of a
 3 | ## 1. Contiguous subarray
 4 | ## 2. Non-contiguous (not necessarily contiguous) subarray.
 5 | 
 6 | ## Input :
 7 | ## First line of the input has an integer t. t cases follow. 
 8 | ## Each test case begins with an integer k . In the next line, k integers follow representing the elements of array arr.
 9 | 
10 | ## Output :
11 | ## Two, space separated, integers denoting the maximum contiguous and non-contiguous subarray. 
12 | ## At least one integer should be selected and put into the subarrays 
13 | ## (this may be required in cases where all elements are negative).
14 | 
15 | t = int(input().strip())
16 | def max_subarray(A):
17 |     max_ending_here = max_so_far = 0
18 |     for x in A:
19 |         max_ending_here = max(0, max_ending_here + x)
20 |         max_so_far = max(max_so_far, max_ending_here)
21 |     return max_so_far
22 | 
23 | for i in range(2*t):
24 |     k=int(input())
25 |     arr = [int(arr_temp) for arr_temp in input().strip().split(' ')]
26 |     if(all(item>0 for item in arr)):
27 |         print(sum(arr),sum(arr))
28 |     elif(all(item<0 for item in arr)):
29 |         print(max(arr),max(arr))
30 |     else:
31 |         c=0
32 |         for i in range(len(arr)):
33 |             if(c+arr[i]>c):
34 |                 c+=arr[i]
35 |         print(max_subarray(arr),c)
36 |              


--------------------------------------------------------------------------------
/HackerRank_PalindromeIndex.py:
--------------------------------------------------------------------------------
 1 | ## Given a string of lowercase letters, determine the index of the character 
 2 | ## whose removal will make the string a palindrome. 
 3 | ## If the string is already a palindrome, then print -1 . There will always be a valid solution.
 4 | 
 5 | ## Input:
 6 | ## The first line contains n (the number of test cases). 
 7 | ## The n subsequent lines of test cases each contain a single string to be checked.
 8 | 
 9 | ## Output :
10 | ## int the zero-indexed position (integer) of a character whose deletion will result in a palindrome; 
11 | ## if there is no such character (i.e.: the string is already a palindrome), print -1.
12 | 
13 | n = int(input().strip())
14 | a = []
15 | for a_i in range(n):
16 |    a_t = [a_temp for a_temp in input().strip().split(' ')][0]
17 |    a.append(a_t)
18 |         
19 | def isPal(s):
20 |     for i in range(int(len(s)/2)):
21 |         if s[i] != s[len(s)-i-1]:
22 |             return False
23 |     return True
24 |  
25 | def solve(s):
26 |     for i in range(int((len(s)+1)/2)):
27 |         if s[i] != s[len(s)-i-1]:
28 |             if isPal(s[:i]+s[i+1:]):
29 |                 return i
30 |             else:
31 |                 return len(s)-i-1
32 |     return -1
33 | for i in range(n):
34 |     x=a[i]
35 |     print(solve(x))


--------------------------------------------------------------------------------
/HackerRank_TwoStrings.py:
--------------------------------------------------------------------------------
 1 | ## Aim :
 2 | ## You are given two strings, A and B. 
 3 | ## Find if there is a substring that appears in both A and B.
 4 | 
 5 | # Input:
 6 | # The first line of the input will contain a single integer , tthe number of test cases.
 7 | # Then there will be t descriptions of the test cases. Each description contains two lines. 
 8 | # The first line contains the string A and the second line contains the string B.
 9 | 
10 | # Output:
11 | # For each test case, display YES (in a newline), if there is a common substring. 
12 | # Otherwise, display NO
13 | 
14 | n = int(input().strip())
15 | a = []
16 | for a_i in range(2*n):
17 |    a_t = [a_temp for a_temp in input().strip().split(' ')][0]
18 |    a.append(a_t)
19 | 
20 | for i in range(2*n):
21 |     if (i%2==0):
22 |         s1=set(a[i])
23 |         s2=set(a[i+1])
24 |         S=s1.intersection(s2) #intersection of subset -> return the same characters !!
25 |         if (len(S)==0):
26 |             print("NO")
27 |         else:
28 |             print("YES")
29 | 


--------------------------------------------------------------------------------
/KaggleAmazon/README.md:
--------------------------------------------------------------------------------
 1 | # Planet: Understanding the Amazon from Space
 2 | 
 3 | [Kaggle Link](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space)
 4 | 
 5 | Use satellite data to track the human footprint in the Amazon rainforest.  
 6 | 
 7 | The goal of this challenge is to label satellite image chips with atmospheric conditions and various classes of land cover/land use. Resulting algorithms will help the global community better understand where, how, and why deforestation happens all over the world - and ultimately how to respond.
 8 | 
 9 | This problem was tackled with Deep Learning models (using TensorFlow and Keras).  
10 | Submissions are evaluated based on the F-beta score (F2 score), it measures acccuracy using precision and recall.
11 | 
12 | 
13 | 


--------------------------------------------------------------------------------
/KaggleAmazon/predict_keras.py:
--------------------------------------------------------------------------------
 1 | import train_keras
 2 | from keras.models import load_model
 3 | import os
 4 | import numpy as np
 5 | import pandas as pd
 6 | from tqdm import tqdm
 7 | from keras.callbacks import ModelCheckpoint
 8 | import sys
 9 | 
10 | TF_CPP_MIN_LOG_LEVEL=2
11 | TEST_BATCH = 128
12 | 
13 | def load_params():
14 |     X_test = os.listdir('./test-jpg')
15 |     X_test = [fn.replace('.jpg', '') for fn in X_test]
16 |     model = load_model('model_amazon6.h5', custom_objects={'fbeta': train_keras.fbeta})
17 |     with open('tag_columns.txt', 'r') as f:
18 |         tag_columns = f.read().split('\n')
19 |     return X_test, model, tag_columns
20 | 
21 | def prediction(X_test, model, tag_columns, test_folder):
22 |     result = []
23 |     for i in tqdm(range(0, len(X_test), TEST_BATCH)):
24 |         X_batch = X_test[i:i+TEST_BATCH]
25 |         X_batch = np.array([train_keras.preprocess(train_keras.load_image(fn, folder=test_folder)) for fn in X_batch])
26 |         p = model.predict(X_batch)
27 |         result.append(p)
28 | 
29 |     r = np.concatenate(result)
30 |     r = r > 0.5
31 |     table = []
32 |     for row in r:
33 |         t = []
34 |         for b, v in zip(row, tag_columns):
35 |             if b:
36 |                 t.append(v.replace('tag_', ''))
37 |         table.append(' '.join(t))
38 |     print('Prediction done !')
39 |     return table
40 | 
41 | def launch(test_folder):
42 |     X_test, model, tag_columns = load_params()
43 |     table = prediction(X_test, model, tag_columns, test_folder)
44 |     try:
45 |         df_pred = pd.DataFrame.from_dict({'image_name': X_test, 'tags': table})
46 |         df_pred.to_csv('submission9.csv', index=False)
47 |     except:
48 |         np.save('image_name', X_test)
49 |         np.save('table', table)
50 | 
51 | if __name__ == '__main__':
52 |     if len(sys.argv) > 1:
53 |         test_folder = sys.argv[1]
54 |     else:
55 |         test_folder='test-jpg'
56 |     launch(test_folder)


--------------------------------------------------------------------------------
/KaggleAmazon/predict_tf.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | import numpy as np
  3 | import cv2
  4 | import pandas as pd
  5 | import pickle
  6 | import glob
  7 | 
  8 | picture_size = 32   
  9 | batch_size = 128
 10 | num_steps = 30001
 11 | 
 12 | def load_set():
 13 |     global labels
 14 | 
 15 |     df = pd.DataFrame.from_csv('./train.csv')
 16 |     df.tags = df.tags.apply(lambda x:x.split(' '))
 17 |     df = pd.concat([df, df.tags.apply(pd.Series)], axis=1)
 18 |     labels = list(set(np.concatenate(df.tags)))
 19 |     mapping = dict(enumerate(labels))
 20 |     reverse_mapping = {v:k for k,v in mapping.items()}
 21 | 
 22 |     pictures = []
 23 |     tags = []
 24 |     for pic in np.random.choice(df.index, len(df)):
 25 |         img = cv2.imread('./train-jpg/{}.jpg'.format(pic)) / 255.
 26 |         img = cv2.resize(img, (picture_size, picture_size))
 27 |         tag = np.zeros(len(labels))
 28 |         for t in df.loc[pic].tags:
 29 |             tag[reverse_mapping[t]] = 1 
 30 |         pictures.append(img)
 31 |         tags.append(tag)
 32 | 
 33 |     return pictures, tags, mapping
 34 | 
 35 | def separate_set(pictures, labels):
 36 |     pictures_train = np.array(pictures[:int(len(pictures)*0.8)])
 37 |     labels_train = np.array(tags[:int(len(tags)*0.8)])
 38 |     print('Train shape', pictures_train.shape, labels_train.shape)
 39 | 
 40 |     pictures_valid = np.array(pictures[int(len(pictures)*0.8) : int(len(pictures)*0.9)])
 41 |     labels_valid = np.array(tags[int(len(tags)*0.8) : int(len(tags)*0.9)])
 42 |     print('Valid shape', pictures_valid.shape, labels_valid.shape)
 43 | 
 44 |     pictures_test = np.array(pictures[int(len(pictures)*0.9):])
 45 |     labels_test = np.array(tags[int(len(tags)*0.9):])
 46 |     print('Test shape', pictures_test.shape, labels_test.shape)
 47 |     return pictures_train, labels_train, pictures_valid, labels_valid, pictures_test, labels_test
 48 | 
 49 | 
 50 | def accuracy(predictions, labels):
 51 |     return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
 52 |           / predictions.shape[0])
 53 | 
 54 | def conv2d(x, W, stride=(1, 1), padding='SAME'):
 55 |     return tf.nn.conv2d(x, W, strides=[1, stride[0], stride[1], 1],
 56 |                       padding=padding)
 57 | 
 58 | def max_pool(x, ksize=(2, 2), stride=(2, 2)):
 59 |     return tf.nn.max_pool(x, ksize=[1, ksize[0], ksize[1], 1],
 60 |                         strides=[1, stride[0], stride[1], 1], padding='SAME')
 61 | 
 62 | def weight_variable(shape):
 63 |     initial = tf.truncated_normal(shape, stddev=0.1)
 64 |     return tf.Variable(initial)
 65 | 
 66 | 
 67 | def bias_variable(shape):
 68 |     initial = tf.constant(0.1, shape=shape)
 69 |     return tf.Variable(initial)
 70 | 
 71 | def model(data):
 72 |     # First convolution
 73 |     h_conv1 = tf.nn.relu(conv2d(data, W_1) + b_1)
 74 |     h_pool1 = max_pool(h_conv1, ksize=(2, 2), stride=(2, 2))
 75 | 
 76 |     # Second convolution
 77 |     h_conv2 = tf.nn.relu(conv2d(h_pool1, W_2) + b_2)
 78 |     h_pool2 = max_pool(h_conv2, ksize=(2, 2), stride=(2, 2))
 79 |     # reshape tensor into a batch of vectors
 80 |     pool_flat = tf.reshape(h_pool2, [-1, int(picture_size / 4) * int(picture_size / 4) * 64])
 81 | 
 82 |     # Full connected layer with 1024 neurons.
 83 |     fc = tf.nn.relu(tf.matmul(pool_flat, W_fc) + b_fc)
 84 | 
 85 |     # Compute logits
 86 |     return tf.matmul(fc, W_logits) + b_logits
 87 | 
 88 | if __name__ == "__main__":
 89 |     pictures, tags, mapping = load_set()
 90 |     pictures_train, labels_train, pictures_valid, labels_valid, pictures_test, labels_test = separate_set(pictures, tags)
 91 | 
 92 |     graph = tf.Graph()
 93 |     with graph.as_default():
 94 |         x = tf.placeholder(tf.float32, shape=(batch_size, picture_size, picture_size, 3))
 95 |         y_ = tf.placeholder(tf.float32, shape=(batch_size, len(labels)))
 96 |         x_valid = tf.constant(pictures_valid, dtype=tf.float32)
 97 |         x_test = tf.constant(pictures_test, dtype=tf.float32)
 98 | 
 99 |         # Weights and biases
100 |         W_1 = weight_variable([3, 3, 3, 64])
101 |         b_1 = bias_variable([32])
102 |         W_2 = weight_variable([5, 5, 32, 128])
103 |         b_2 = bias_variable([64])
104 |         W_fc = weight_variable([int(picture_size / 4) * int(picture_size / 4) * 64, 1024])
105 |         b_fc = bias_variable([1024])
106 |         W_logits = weight_variable([1024, len(labels)])
107 |         b_logits = bias_variable([len(labels)])
108 | 
109 |         logits = model(x)
110 |         loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y_))
111 |         optimizer = tf.train.AdamOptimizer(1e-4).minimize(loss)
112 |         y = tf.nn.softmax(logits)
113 |         y_valid = tf.nn.softmax(model(x_valid))
114 |         y_test = tf.nn.softmax(model(x_test))
115 | 
116 |     with tf.Session(graph = graph) as session:
117 |         tf.global_variables_initializer().run()
118 |         for step in range(num_steps):
119 |             offset = (step * batch_size) % (labels_train.shape[0] - batch_size)
120 |             batch_data = pictures_train[offset:(offset + batch_size), :, :, :]
121 |             batch_labels = labels_train[offset:(offset + batch_size), :]
122 |             feed_dict = {x : batch_data, y_ : batch_labels}
123 |             _, l, predictions = session.run([optimizer, loss, y], feed_dict=feed_dict)
124 |             if (step % 1000 == 0):
125 |                     print('Minibatch loss at step %d: %.3f / Valid Accuracy %.2f' % (step, l, 
126 |                                                     accuracy(y_valid.eval(), labels_valid)))
127 |         print(accuracy(y_test.eval(), test_labels))
128 | 


--------------------------------------------------------------------------------
/KaggleAmazon/train_keras.py:
--------------------------------------------------------------------------------
  1 | import numpy as np # linear algebra
  2 | import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
  3 | import os
  4 | import glob
  5 | import matplotlib.image as mpimg
  6 | import gc
  7 | 
  8 | from keras import backend as K
  9 | from keras.models import Sequential
 10 | from keras.layers import Dense, Dropout, Flatten, Lambda
 11 | from keras.layers import Conv2D, MaxPooling2D, AveragePooling2D
 12 | from keras.layers.normalization import BatchNormalization
 13 | from keras.preprocessing.image import ImageDataGenerator
 14 | from keras.regularizers import l2
 15 | from keras.callbacks import ModelCheckpoint
 16 | 
 17 | from sklearn.utils import shuffle
 18 | from sklearn.model_selection import train_test_split
 19 | 
 20 | import cv2
 21 | from tqdm import tqdm
 22 | 
 23 | picture_size = 128
 24 | INPUT_SHAPE = (picture_size, picture_size, 4)
 25 | EPOCHS = 20
 26 | BATCH = 64
 27 | PER_EPOCH = 256
 28 | 
 29 | def fbeta(y_true, y_pred, threshold_shift=0):
 30 |   """
 31 |   Submissions are evaluated based on the F-beta score, it measures acccuracy using precision
 32 |   and recall. Beta = 2 here.
 33 |   """
 34 |   beta = 2
 35 | 
 36 |   # just in case of hipster activation at the final layer
 37 |   y_pred = K.clip(y_pred, 0, 1)
 38 | 
 39 |   # shifting the prediction threshold from .5 if needed
 40 |   y_pred_bin = K.round(y_pred + threshold_shift)
 41 | 
 42 |   tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
 43 |   fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1)))
 44 |   fn = K.sum(K.round(K.clip(y_true - y_pred, 0, 1)))
 45 | 
 46 |   precision = tp / (tp + fp)
 47 |   recall = tp / (tp + fn)
 48 | 
 49 |   beta_squared = beta ** 2
 50 |   return (beta_squared + 1) * (precision * recall) / (beta_squared * precision + recall + K.epsilon())
 51 | 
 52 | def load_image(filename, resize=True, folder='train-jpg'):
 53 |   img = mpimg.imread('./{}/{}.jpg'.format(folder, filename))
 54 |   if resize:
 55 |       img = cv2.resize(img, (picture_size, picture_size))
 56 |   return np.array(img)
 57 | 
 58 | # def mean_normalize(img):
 59 | #     return (img - img.mean()) / (img.max() - img.min())
 60 | 
 61 | def preprocess(img):
 62 |     img = img / 127.5 - 1
 63 |     return img
 64 | 
 65 | def generator(X, y, batch_size=32):
 66 |     X_copy, y_copy = X, y
 67 |     while True:
 68 |         for i in range(0, len(X_copy), batch_size):
 69 |             X_result, y_result = [], []
 70 |             for x, y in zip(X_copy[i:i+batch_size], y_copy[i:i+batch_size]):
 71 |                 rx, ry = [load_image(x)], [y]
 72 |                 rx = np.array([preprocess(x) for x in rx])
 73 |                 ry = np.array(ry)
 74 |                 X_result.append(rx)
 75 |                 y_result.append(ry)
 76 |             X_result, y_result = np.concatenate(X_result), np.concatenate(y_result)
 77 |             yield shuffle(X_result, y_result)
 78 |         X_copy, y_copy = shuffle(X_copy, y_copy)
 79 | 
 80 | def create_model():
 81 |     model = Sequential()
 82 | 
 83 |     model.add(Conv2D(48, (8, 8), strides=(2, 2), input_shape=INPUT_SHAPE, activation='elu'))
 84 |     model.add(BatchNormalization())
 85 | 
 86 |     model.add(Conv2D(64, (8, 8), strides=(2, 2), activation='elu'))
 87 |     model.add(BatchNormalization())
 88 | 
 89 |     model.add(Conv2D(96, (5, 5), strides=(2, 2), activation='elu'))
 90 |     model.add(BatchNormalization())
 91 | 
 92 |     model.add(Conv2D(128, (3, 3), activation='elu'))
 93 |     model.add(BatchNormalization())
 94 | 
 95 |     model.add(Conv2D(128, (2, 2), activation='elu'))
 96 |     model.add(BatchNormalization())
 97 | 
 98 |     model.add(Flatten())
 99 |     model.add(Dropout(0.3))
100 | 
101 |     model.add(Dense(1024, activation='elu'))
102 |     model.add(BatchNormalization())
103 | 
104 |     model.add(Dense(64, activation='elu'))
105 |     model.add(BatchNormalization())
106 | 
107 |     model.add(Dense(n_classes, activation='sigmoid'))
108 | 
109 |         
110 |     model.compile(
111 |         optimizer='adam',
112 |         loss='binary_crossentropy',
113 |         metrics=[fbeta, 'accuracy']
114 |     )
115 | 
116 |     return model
117 | 
118 | def load_set():
119 |   global n_features, n_classes, tag_columns
120 |   df = pd.DataFrame.from_csv('./train.csv')
121 |   df.tags = df.tags.apply(lambda x:x.split(' '))
122 |   df.reset_index(inplace=True)
123 |   labels = list(df.tags)
124 | 
125 |   tags = set(np.concatenate(labels))
126 |   tag_list = list(tags)
127 |   tag_list.sort()
128 |   tag_columns = ['tag_' + t for t in tag_list]
129 |   for t in tag_list:
130 |       df['tag_' + t] = df['tags'].map(lambda x: 1 if t in x else 0)
131 | 
132 |   X = df['image_name'].values
133 |   y = df[tag_columns].values
134 | 
135 |   n_features = 1
136 |   n_classes = y.shape[1]
137 |   X, y = shuffle(X, y)
138 | 
139 |   X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.1)
140 | 
141 |   print('Train', X_train.shape, y_train.shape)
142 |   print('Valid', X_valid.shape, y_valid.shape)
143 |   return X_train, X_valid, y_train, y_valid
144 | 
145 | def launch():
146 |   X_train, X_valid, y_train, y_valid = load_set()
147 |   model = create_model()
148 |   filepath="./weights-improvement-{epoch:02d}-{val_fbeta:.3f}.hdf5"
149 |   checkpoint = ModelCheckpoint(filepath, monitor='val_fbeta', verbose=1, save_best_only=True, mode='max')
150 |   callbacks_list = [checkpoint]
151 | 
152 |   history = model.fit_generator(
153 |       generator(X_train, y_train, batch_size=BATCH),
154 |       steps_per_epoch=PER_EPOCH,
155 |       epochs=EPOCHS,
156 |       validation_data=generator(X_valid, y_valid, batch_size=BATCH),
157 |       validation_steps=len(y_valid)//(4*BATCH),
158 |       callbacks=callbacks_list
159 |   )
160 |   model.save('model_amazon2.h5')
161 | 
162 | if __name__ == '__main__':
163 |   launch()
164 | 
165 | 
166 | 


--------------------------------------------------------------------------------
/KaggleBikeSharing/Kaggle_BikeSharing.py:
--------------------------------------------------------------------------------
  1 | ## coding: utf8
  2 | ## Kaggle Challenge
  3 | ## Utilisation de Python 3.5
  4 | 
  5 | ## Chargement des bibliothèques
  6 | import numpy as np
  7 | import pandas as pd
  8 | import matplotlib.pyplot as plt
  9 | import collections
 10 | import sklearn
 11 | from sklearn import linear_model
 12 | from sklearn import svm
 13 | from sklearn import learning_curve
 14 | from sklearn import ensemble
 15 | #%matplotlib inline
 16 | 
 17 | ## Chargement de la base de données
 18 | df= pd.read_csv('data.csv')
 19 | 
 20 | ## Transformation de la date sous 
 21 | ## format string au format de date
 22 | df['date']=pd.to_datetime(df['datetime'])
 23 | 
 24 | ## Extraction du mois, de l'heure et 
 25 | ## du jour de la semaine 
 26 | df.set_index('date', inplace=True)
 27 | df['month']=df.index.month
 28 | df['hours']=df.index.hour
 29 | df['dayOfWeek']=df.index.weekday
 30 | 
 31 | ################### 
 32 | ################### Influence du temps 
 33 | ################### 
 34 | 
 35 | ## Création d'une classe pour tracer
 36 | ## la moyenne du nombre de locations 
 37 | ## de velos heure par heure pour les 
 38 | ## jours ouvrés ou non
 39 | 
 40 | class mean_30():
 41 |     def __init__(self, df):
 42 |         self.df=df
 43 |     def mean_hours_min(self,h):
 44 |         a = self.df["hours"] == h
 45 |         return self.df[a]["count"].mean() 
 46 |     def transf(self, t):
 47 |         return self.mean_hours_min(t)
 48 |     def transfc(self, t):
 49 |         return self.err_hours_min(t)
 50 |     def vector_day(self):
 51 |         k = []
 52 |         for i in range(0,24):
 53 |             k.append(i)
 54 |         hour_day = pd.DataFrame()
 55 |         hour_day["A"] = k
 56 |         return hour_day["A"] 
 57 |     def view(self):
 58 |         plt.plot(self.vector_day().apply(self.transf))
 59 | 
 60 | ## Tracer des graphes
 61 | 
 62 | fig=plt.figure()
 63 | fig.suptitle('Location de velo selon l\'heure, jours ouvrables ou non', fontsize=13)
 64 | plt.ylabel('nombre de locations de vélos')
 65 | plt.xlabel('heure de la journée')
 66 | moy0=mean_30(df[df['workingday']==0])
 67 | moy0.view()
 68 | moy1=mean_30(df[df['workingday']==1])
 69 | moy1.view()
 70 | plt.legend(['0','1'])
 71 | plt.show()
 72 | 
 73 | ## Création d'une classe pour tracer
 74 | ## la deviation standard du nombre de
 75 | ## locations  de velos heure par heure 
 76 | ## pour un jour
 77 | 
 78 | ## Changer cette valeur entre 0 et 6 pour
 79 | ## pour obtenir un autre jour de la
 80 | ## semaine
 81 | 
 82 | j=0
 83 | 
 84 | class std_30():
 85 |     def __init__(self, df):
 86 |         self.df=df
 87 |     def mean_hours_std(self,j,h):
 88 |         y = self.df[self.df["dayOfWeek"]==j]["hours"] == h
 89 |         return self.df[self.df["dayOfWeek"]==j][y]["count"].mean()
 90 |     def err_hours(self,j,h):
 91 |         y = self.df[self.df["dayOfWeek"]==j]["hours"] == h
 92 |         return self.df[self.df["dayOfWeek"]==j][y]["count"].std()
 93 |     def transf_err(self,t):
 94 |         return self.mean_hours_std(j,t)
 95 |     def transf_err2(self,t):
 96 |         return self.err_hours(j,t)
 97 |     def vector_day(self):
 98 |         k = []
 99 |         for i in range(0,24):
100 |             k.append(i)
101 |         hour_std = pd.DataFrame()
102 |         hour_std["A"] = k
103 |         return hour_std["A"] 
104 |     def view(self):
105 |         errors=self.vector_day().apply(self.transf_err2)
106 |         fig, ax = plt.subplots()
107 |         self.vector_day().apply(self.transf_err).plot(yerr=errors, ax=ax,label=str(j))
108 |         plt.legend('0',loc=2,prop={'size':9})
109 | 
110 | fig.suptitle('Deviation standard des locations de velo selon l\'heure', fontsize=13)
111 | std0=std_30(df)
112 | std0.view()
113 | plt.ylabel('nombre de locations de vélos')
114 | plt.xlabel('heure de la journée')
115 | plt.show()
116 | 
117 | ################### 
118 | ################### Influence du mois
119 | ################### 
120 | 
121 | ## Création d'une classe pour tracer
122 | ## la moyenne du nombre de locations 
123 | ## de velos par mois
124 | 
125 | class month_30():
126 |     def __init__(self, df):
127 |         self.df=df
128 |     def mean_hours_min(self,m):
129 |         a = self.df["month"] == m
130 |         return self.df[a]["count"].mean() 
131 |     def transf(self, t):
132 |         return self.mean_hours_min(t)
133 |     def transfc(self, t):
134 |         return self.err_hours_min(t)
135 |     def vector_day(self):
136 |         k = []
137 |         for i in range(0,13):
138 |             k.append(i)
139 |         hour_day = pd.DataFrame()
140 |         hour_day["A"] = k
141 |         return hour_day["A"] 
142 |     def view(self):
143 |         plt.plot(self.vector_day().apply(self.transf))
144 | 
145 | ## Tracer des graphes
146 | 
147 | fig=plt.figure()
148 | fig.suptitle('Location de velo selon le mois', fontsize=13)
149 | moy0=month_30(df)
150 | moy0.view()
151 | plt.ylabel('nombre de locations de vélos')
152 | plt.xlabel('mois de l\' année')
153 | plt.show()
154 | 
155 | ################### 
156 | ################### Influence de la météo 
157 | ################### 
158 | plt.figure()
159 | 
160 | ## Moyenne de la demande en vélos
161 | ## Pour les 4 conditions grâce
162 | ## à un dictionnaire Python
163 | 
164 | a={u'Degage/nuageux':df[df['weather']==1]['count'].mean(), 
165 |      u'Brouillard': df[df['weather']==2]['count'].mean(), 
166 |      u'Legere pluie':df[df['weather']==3]['count'].mean()
167 |     }
168 | 
169 | width = 1/1.6
170 | plt.bar(range(len(a)), a.values(),width,color="blue",align='center')
171 | plt.xticks(range(len(a)), a.keys())
172 | plt.ylabel('nombre de locations de vélos')
173 | plt.title('Moyenne des locations de velos pour differentes conditions meteorologiques')
174 | plt.show()
175 | 
176 | ################### 
177 | ################### Influence du vent, de la température et de l'humidité
178 | ################### 
179 | 
180 | ## Moyenne de la demande en vélos
181 | ## Pour certains paramètres grâce
182 | ## à un dictionnaire Python
183 | 
184 | D = {u'V>13k/h':df[df['windspeed']>13]['count'].mean(), 
185 |      u'V<13k/h': df[df['windspeed']<13]['count'].mean(), 
186 |      u'T<24°C':df[df['atemp']<24]['count'].mean(), 
187 |      u'T>24°C':df[df['atemp']>24]['count'].mean(), 
188 |      u'H>62%': df[df['humidity']>62]['count'].mean(), 
189 |      u'H<62%':df[df['humidity']<62]['count'].mean()
190 |     }
191 | od = collections.OrderedDict(sorted(D.items()))
192 | width = 1/1.6
193 | plt.figure()
194 | plt.bar(range(len(od)), od.values(),width,color="blue",align='center')
195 | plt.xticks(range(len(od)), od.keys())
196 | plt.title('Variation de la demande en fonction de 3 variables')
197 | plt.show()
198 | 
199 | ################### 
200 | ################### Conclusion et choix des paramètres influents
201 | ################### 
202 | 
203 | ##On calcule matrice de corrélation pour 
204 | ##retirer les variables corrélées
205 | 
206 | df.corr()
207 | plt.matshow(df.corr())
208 | plt.yticks(range(len(df.corr().columns)), df.corr().columns); 
209 | plt.colorbar()
210 | plt.show()
211 | 
212 | df1=df.drop(['workingday','datetime','season','atemp','holiday','registered','casual'],axis=1)
213 | 
214 | target=df1['count'].values #ensemble des outputs à prédire (yi)
215 | 
216 | train=df1.drop('count',axis=1) #ensemble des données (xi)
217 | 
218 | ## Split aléatoire entre données d'entrainement
219 | ## et donnes test 
220 | X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(
221 |     train, target, test_size=0.33, random_state=42)
222 | 
223 | ## Creation d'une classe pour tracer
224 | ## les courbes d'apprentissages
225 | 
226 | def plot_learning_curve(estimator, title, X, y, cv=None, n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
227 |     plt.figure()
228 |     plt.title(title)
229 |     plt.xlabel("Training examples")
230 |     plt.ylabel("Score")
231 |     train_sizes, train_scores, test_scores = sklearn.learning_curve.learning_curve(
232 |         estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
233 |     train_scores_mean = np.mean(train_scores, axis=1)
234 |     train_scores_std = np.std(train_scores, axis=1)
235 |     test_scores_mean = np.mean(test_scores, axis=1)
236 |     test_scores_std = np.std(test_scores, axis=1)
237 |     plt.grid()
238 |     plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
239 |     plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
240 |     plt.legend(loc="best")
241 |     return plt
242 | 
243 | ################### 
244 | ################### Regression Linéaire
245 | ################### 
246 | 
247 | linreg=linear_model.LinearRegression()
248 | linreg.fit(X_train,Y_train)
249 | tableau=[['Paramètre', 'Coefficient']] #liste pour voir les valeurs des coefficients
250 | col=list(train.columns.values)
251 | for i in range(7):
252 |     tableau.append([col[i],linreg.coef_[i]])
253 | 
254 | print ("Training Score Regression Linéare : ", str(linreg.score(X_train,Y_train)))
255 | print ("Test Score Regression Linéare : " , str(linreg.score(X_test,Y_test)))
256 | print ("Coefficients Regression Linéare :")
257 | print (tableau)
258 | 
259 | ################### 
260 | ################### Gradient Boosting Regression
261 | ################### 
262 | 
263 | gbr = ensemble.GradientBoostingRegressor(n_estimators=2000)
264 | gbr.fit(X_train,Y_train)
265 | print ("Training Score GradientBoosting: ", str(gbr.score(X_train,Y_train)))
266 | print ("Test Score GradientBoosting: " , str(gbr.score(X_test,Y_test)))
267 | 
268 | title = "Learning Curves (GradientBoosting)"
269 | estimator = ensemble.GradientBoostingRegressor(n_estimators=2000)
270 | plot_learning_curve(estimator, title, X_train, Y_train)
271 | plt.show()
272 | 
273 | ################### 
274 | ################### Support Vector Regression
275 | ################### 
276 | svr=svm.SVR(kernel='linear')
277 | svr.fit(X_train,Y_train)
278 | print ("Training Score SVR: ", str(svr.score(X_train,Y_train)))
279 | print ("Test Score SVR : " , str(svr.score(X_test,Y_test)))
280 | 
281 | ################### 
282 | ################### Random Forest Regression
283 | ################### 
284 | 
285 | rf=ensemble.RandomForestRegressor(n_estimators=30,oob_score=True) #30 arbres et OOB Estimation
286 | rf.fit(train,target)
287 | print ("Training Score RandomForest: ", str(rf.score(train,target)))
288 | print ("OOB Score RandomForest: " , str(rf.oob_score_))
289 | 
290 | ################### 
291 | ################### Améliorations
292 | ################### 
293 | 
294 | ## On cherche le paramètre le plus influent de notre modèle
295 | ## Pour cela on utilise deux algorithmes qu'on a utilisé :
296 | ## la régréssio linéaire et Random Forest
297 | def param_import():
298 |     col=list(train.columns.values)
299 |     #on trouve d'abord les coefficients de la régréssion linéaire
300 |     index1=linreg.coef_.argsort()[-2:][-1] #renvoie la liste triée des coef et on prend le premier élement
301 |     index2=linreg.coef_.argsort()[-2:][0] #renvoie la liste triée des coef et on prend le deuxieme élement
302 |     print('Pour les améliorations, calculons les paramètres les plus influents : ')
303 |     print('...')
304 |     print('Pour la regréssion linéaire, les paramètres les plus influents sont :', col[index1],' et ',col[index2])
305 |     #on trouve ensuite les coefficients de la RF
306 |     index3=rf.feature_importances_.argsort()[-2:][-1] 
307 |     index4=rf.feature_importances_.argsort()[-2:][0] 
308 |     print('Pour l\'algorithme de RF, les paramètres les plus influents sont :', col[index3],' et ',col[index4])
309 |     #on trouve ensuite les coefficients de Gradient Boosting
310 |     index5=gbr.feature_importances_.argsort()[-2:][-1]
311 |     index6=gbr.feature_importances_.argsort()[-2:][0]
312 |     print('Pour l\'algorithme de Gradient Boosting, les paramètres les plus influents sont :', col[index5],' et ',col[index6])
313 |     if index3==index5:
314 |         plus_import=index3
315 |     elif index5==index4:
316 |         plus_import=index4
317 |     return plus_import
318 | 
319 | print('Le paramètre le plus important est donc : ', col[param_import()])
320 | 
321 | ## On cherche à prouver que certains creneaux sont
322 | ##plus importants pour la demande en vélo
323 | 
324 | soir = df[df['hours'].isin([17,18,19])]
325 | peak_soir=soir[soir['workingday']==1]
326 | matin = df[df['hours'].isin([7,8,9])]
327 | peak_matin=matin[matin['workingday']==1]
328 | we = df[df['hours'].isin([12,13,14,15,16])]
329 | peak_we=we[we['workingday']==0]
330 | 
331 | print('Calculons ensuite la moyenne du nombre de vélos pour plusieurs créneaux horaires : ')
332 | print('...')
333 | print('La moyenne totale de la demande est : ', df['count'].mean())
334 | print('En semaine, entre 17 et 19h : ', peak_soir['count'].mean())
335 | print('En semaine, entre 7 et 9h : ', peak_matin['count'].mean())
336 | print('En week-end, entre 12 et 16h : ', peak_we['count'].mean())
337 | 
338 | 


--------------------------------------------------------------------------------
/KaggleBikeSharing/Kaggle_BikeSharing_Explanations_French.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleBikeSharing/Kaggle_BikeSharing_Explanations_French.pdf


--------------------------------------------------------------------------------
/KaggleBikeSharing/README.md:
--------------------------------------------------------------------------------
 1 | # Kaggle Bike Sharing
 2 | 
 3 | [Kaggle Link](https://www.kaggle.com/c/bike-sharing-demand)
 4 | 
 5 | Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city.  
 6 | In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.
 7 | 
 8 | The goal of this challenge is to build a model that predicts the count of bike shared, exclusively based on contextual features. The first part of this challenge was aimed to understand, to analyse and to process those dataset. I wanted to produce meaningful information with plots. The second part was to build a model and use a Machine Learning library in order to predict the count.  
 9 | 
10 | The more importants parameters were the time, the month, the temperature and the weather.  
11 | Multiple models were tested during this challenge (Linear Regression, Gradient Boosting, SVR and Random Forest). Finally, the chosen model was Random Forest. The accuracy was measured with [Out-of-Bag Error](https://www.stat.berkeley.edu/~breiman/OOBestimation.pdf) and the OOB score was 0.85.
12 | 
13 | More detailed explanations (with a couple of plots) has been written in French.
14 | -->[French Explanations PDF](https://github.com/alexattia/Online-Challenge/blob/master/KaggleBikeSharing/Kaggle_BikeSharing_Explanations_French.pdf)
15 | 
16 | 


--------------------------------------------------------------------------------
/KaggleInstaCart/Predictions.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 6,
  6 |    "metadata": {
  7 |     "ExecuteTime": {
  8 |      "end_time": "2017-06-21T21:37:22.901368Z",
  9 |      "start_time": "2017-06-21T21:37:21.787528Z"
 10 |     },
 11 |     "collapsed": true
 12 |    },
 13 |    "outputs": [],
 14 |    "source": [
 15 |     "import numpy as np\n",
 16 |     "import pandas as pd\n",
 17 |     "import matplotlib.pyplot as plt\n",
 18 |     "import seaborn as sns\n",
 19 |     "import calendar\n",
 20 |     "%matplotlib inline"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": 7,
 26 |    "metadata": {
 27 |     "ExecuteTime": {
 28 |      "end_time": "2017-06-21T21:37:36.669912Z",
 29 |      "start_time": "2017-06-21T21:37:22.902860Z"
 30 |     },
 31 |     "collapsed": true
 32 |    },
 33 |    "outputs": [],
 34 |    "source": [
 35 |     "# departements and aisle id\n",
 36 |     "aisles = pd.read_csv('./data/aisles.csv')\n",
 37 |     "departements = pd.read_csv('./data/departments.csv')\n",
 38 |     "products = pd.read_csv('./data/products.csv')\n",
 39 |     "orderp = pd.read_csv('./data/order_products__prior.csv')\n",
 40 |     "ordert = pd.read_csv('./data/order_products__train.csv')\n",
 41 |     "orders = pd.read_csv('./data/orders.csv')"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": 8,
 47 |    "metadata": {
 48 |     "ExecuteTime": {
 49 |      "end_time": "2017-06-21T21:37:43.774516Z",
 50 |      "start_time": "2017-06-21T21:37:36.671328Z"
 51 |     }
 52 |    },
 53 |    "outputs": [],
 54 |    "source": [
 55 |     "order_products = pd.merge(orderp, orders, on = 'order_id')"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "code",
 60 |    "execution_count": null,
 61 |    "metadata": {
 62 |     "ExecuteTime": {
 63 |      "start_time": "2017-06-21T21:37:22.391Z"
 64 |     }
 65 |    },
 66 |    "outputs": [],
 67 |    "source": [
 68 |     "data = order_products.groupby(['product_id', 'user_id']).agg({'order_id':'count',\n",
 69 |     "                                                      'order_number':[np.min, np.max],\n",
 70 |     "                                                      'add_to_cart_order':'mean'})\n",
 71 |     "data.columns = data.columns.droplevel(0)\n",
 72 |     "data = data.rename(columns={'mean':\"up_average_cart_position\", \n",
 73 |     "                    'amin':\"up_first_order\", \n",
 74 |     "                    'amax':\"up_last_order\", \n",
 75 |     "                    'count':\"up_orders\"})\n",
 76 |     "data = data.reset_index()"
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": null,
 82 |    "metadata": {
 83 |     "ExecuteTime": {
 84 |      "start_time": "2017-06-21T21:37:23.516Z"
 85 |     },
 86 |     "scrolled": true
 87 |    },
 88 |    "outputs": [],
 89 |    "source": [
 90 |     "d = pd.merge(data, order_products, on='product_id')"
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "code",
 95 |    "execution_count": null,
 96 |    "metadata": {
 97 |     "collapsed": true
 98 |    },
 99 |    "outputs": [],
100 |    "source": []
101 |   }
102 |  ],
103 |  "metadata": {
104 |   "kernelspec": {
105 |    "display_name": "workenv",
106 |    "language": "python",
107 |    "name": "workenv"
108 |   },
109 |   "language_info": {
110 |    "codemirror_mode": {
111 |     "name": "ipython",
112 |     "version": 3
113 |    },
114 |    "file_extension": ".py",
115 |    "mimetype": "text/x-python",
116 |    "name": "python",
117 |    "nbconvert_exporter": "python",
118 |    "pygments_lexer": "ipython3",
119 |    "version": "3.5.3"
120 |   },
121 |   "nav_menu": {},
122 |   "toc": {
123 |    "navigate_menu": true,
124 |    "number_sections": true,
125 |    "sideBar": true,
126 |    "threshold": 6,
127 |    "toc_cell": false,
128 |    "toc_section_display": "block",
129 |    "toc_window_display": false
130 |   }
131 |  },
132 |  "nbformat": 4,
133 |  "nbformat_minor": 2
134 | }
135 | 


--------------------------------------------------------------------------------
/KaggleMovieRating/README.md:
--------------------------------------------------------------------------------
 1 | # Predict IMDB movie rating
 2 | 
 3 | Project inspired by Chuan Sun [work](https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset) without Scrappy
 4 | 
 5 | Main question : How can we tell the greatness of a movie before it is released in cinema?
 6 | 
 7 | ## First Part - Parsing data
 8 | 
 9 | The first part aims to parse data from the imdb and the numbers websites : casting information, directors, production companies, awards, genres, budget, gross, description, imdb_rating, etc.  
10 | To create the movie_contents.json file :  
11 | ``python3 parser.py nb_elements``  
12 | 
13 | ## Second Part - Data Analysis
14 | 
15 | The second part is to analyze the dataframe and observe correlation between variables. For example, are the movie awards correlated to the worlwide gross ? Does the more a movie is a liked, the more the casting is liked ? 
16 | See the jupyter notebook file.  
17 | 
18 | ![Correlation Matrix](https://github.com/alexattia/Data-Science-Projects/blob/master/pics/corr_matrix.png)
19 | 
20 | As we can see in the pictures above, the imdb score is correlated to the number of awards and the gross but not really to the production budget and the number of facebook likes of the casting.  
21 | Obviously, domestic and worlwide gross are highly correlated. However, the more important the production budget, the more important the gross.  
22 | As it is shown in the notebook, the budget is not really correlated to the number of awards.  
23 | What's funny is that the popularity of the third most famous actor is more important for the IMDB score than the popularity of the most famous score (Correlation 0.2 vs 0.08).  
24 | (Many other charts in the Jupyter notebook)
25 | 
26 | ## Third Part - Predict the IMDB score
27 | 
28 | Machine Learning to predict the IMDB score with the meaningful variables.  
29 | Using a Random Forest algorithm (500 estimators). 
30 | ![Most important features](https://github.com/alexattia/Data-Science-Projects/blob/master/pics/features.png)
31 | 
32 | 


--------------------------------------------------------------------------------
/KaggleMovieRating/genre.json:
--------------------------------------------------------------------------------
 1 | {"Action": "0",
 2 |  "Adventure": "17",
 3 |  "Animation": "14",
 4 |  "Biography": "20",
 5 |  "Comedy": "21",
 6 |  "Crime": "10",
 7 |  "Documentary": "3",
 8 |  "Drama": "18",
 9 |  "Family": "19",
10 |  "Fantasy": "6",
11 |  "History": "8",
12 |  "Horror": "4",
13 |  "Music": "12",
14 |  "Musical": "2",
15 |  "Mystery": "16",
16 |  "Romance": "9",
17 |  "Sci-Fi": "13",
18 |  "Short": "5",
19 |  "Sport": "11",
20 |  "Thriller": "7",
21 |  "War": "15",
22 |  "Western": "1"}


--------------------------------------------------------------------------------
/KaggleMovieRating/imdb_movie_content.py:
--------------------------------------------------------------------------------
  1 | import requests
  2 | import json
  3 | from bs4 import BeautifulSoup
  4 | import time
  5 | import random
  6 | import re
  7 | import datetime
  8 | 
  9 | def parse_facebook_likes_number(num_likes_string):
 10 |     if num_likes_string[-1] == 'K':
 11 |         thousand = 1000
 12 |     else:
 13 |         thousand = 1
 14 |     return float(num_likes_string.replace('K','')) * thousand
 15 | 
 16 | def parse_duration(time_string):
 17 |     time_string = time_string.split('min')[0]
 18 |     x = time.strptime(time_string,'%Hh%M')
 19 |     return datetime.timedelta(hours=x.tm_hour,minutes=x.tm_min,seconds=x.tm_sec).total_seconds()
 20 | 
 21 | class ImdbMovieContent():
 22 |   def __init__(self, movies):
 23 |     self.movies = movies
 24 |     self.base_url = "http://www.imdb.com"
 25 | 
 26 |   def get_facebook_likes(self, entity_id):
 27 |     if entity_id.startswith('nm'):
 28 |         url = "https://www.facebook.com/widgets/like.php?width=280&show_faces=1&layout=standard&href=http%3A%2F%2Fwww.imdb.com%2Fname%2F{}%2F&colorscheme=light".format(entity_id)
 29 |     elif entity_id.startswith('tt'):
 30 |         url = "https://www.facebook.com/widgets/like.php?width=280&show_faces=1&layout=standard&href=http%3A%2F%2Fwww.imdb.com%2Ftitle%2F{}%2F&colorscheme=light".format(entity_id)
 31 |     else:
 32 |         url = None
 33 |     time.sleep(random.uniform(0, 0.25)) # randomly snooze a time within [0, 0.4] second
 34 |     try:
 35 |         response = requests.get(url)
 36 |         soup = BeautifulSoup(response.text, "lxml")
 37 |         sentence = soup.find_all(id="u_0_2")[0].span.string # get sentence like: "43K people like this"
 38 |         num_likes = sentence.split(" ")[0]
 39 |     except Exception as e:
 40 |         num_likes = None
 41 |     return parse_facebook_likes_number(num_likes)
 42 | 
 43 |   def get_id_from_url(self, url):
 44 |     if url is None:
 45 |         return None
 46 |     return url.split('/')[4]
 47 | 
 48 |   def parse(self):
 49 |     movies_content = []
 50 |     for movie in self.movies:
 51 |       imdb_url = movie['imdb_url']
 52 |       response = requests.get(imdb_url)
 53 |       bs = BeautifulSoup(response.text, 'lxml')
 54 |       movies_content.append(self.get_content(bs))
 55 |     return movies_content
 56 | 
 57 |   def get_awards(self, movie_link):
 58 |     awards_url = movie_link + 'awards'
 59 |     response = requests.get(awards_url)
 60 |     bs = BeautifulSoup(response.text, 'lxml')
 61 |     awards = bs.find_all('tr')
 62 |     award_dict = []
 63 |     for award in awards:
 64 |         if 'rowspan' in award.find('td').attrs:
 65 |             rowspan = int(award.find('td')['rowspan'])
 66 |             if rowspan == 1:
 67 |                 award_dict.append({'category' : award.find('span', class_='award_category').text, 
 68 |                                    'type' : award.find('b').text,
 69 |                                    'award': award.find('td', class_='award_description').text.split('\n')[1].replace('            ', '')})
 70 |             else:
 71 |                 index = awards.index(award)
 72 |                 dictt = {'category':award.find('span', class_='award_category').text, 
 73 |                          'type' : award.find('b').text}
 74 |                 awards_ = []
 75 |                 for elem in awards[index:index+rowspan]:
 76 |                     award_dict.append({'category':award.find('span', class_='award_category').text, 
 77 |                                          'type' : award.find('b').text,
 78 |                                         'award': elem.find('td', class_='award_description').text.split('\n')[1].replace('            ', '')})
 79 |     award_nominated = [{k:v for k,v in award.items() if k != 'type'} for award in award_dict if award['type'] == 'Nominated']
 80 |     award_won = [{k:v for k,v in award.items() if k != 'type'} for award in award_dict if award['type'] == 'Won']
 81 |     return award_nominated, award_won
 82 | 
 83 |   def get_content(self, bs):
 84 |     movie = {}
 85 |     try:
 86 |         title_year = bs.find('div', class_ = 'title_wrapper').find('h1').text.encode('latin').decode('utf-8', 'ignore').replace(') ','').split('(')
 87 |         movie['movie_title'] = title_year[0]
 88 |     except:
 89 |         movie['movie_title'] = None
 90 |     try:
 91 |         movie['title_year'] = title_year[1]
 92 |     except:
 93 |         movie['title_year'] = None
 94 |     # try:
 95 |     #     movie['genres'] = '|'.join([genre.text.replace(' ','') for genre in bs.find('div', {'itemprop': 'genre'}).find_all('a')])
 96 |     # except:
 97 |     #     movie['genres'] = None
 98 |     try:
 99 |         title_details = bs.find('div', {'id':'titleDetails'})
100 |         movie['language'] = '|'.join([language.text for language in title_details.find_all('a', href=re.compile('language'))])
101 |         movie['country'] = '|'.join([country.text for country in title_details.find_all('a', href=re.compile('country'))])
102 |     except:
103 |         movie['language'] = None
104 |     try:
105 |         keywords = bs.find_all('span', {'itemprop':'keywords'})
106 |         movie['keywords'] = '|'.join([key.text for key in keywords])
107 |     except:
108 |         movie['keywords'] = None
109 |     try:
110 |         movie['storyline'] = bs.find('div', {'id':'titleStoryLine'}).find('div', {'itemprop':'description'}).text.replace('\n', '')
111 |     except:
112 |         movie['storyline'] = None
113 |     try:
114 |         movie['contentRating'] = bs.find('span', {'itemprop':'contentRating'}).text.split(' ')[1]
115 |     except:
116 |         movie['contentRating'] = None
117 |     try:
118 |         movie['color'] = bs.find('a', href=re.compile('colors')).text
119 |     except:
120 |         movie['color'] = None
121 |     try:
122 |         movie['idmb_score'] = bs.find('span', {'itemprop':'ratingValue'}).text
123 |     except:
124 |         movie['idmb_score'] = None
125 |     try:
126 |         movie['num_voted_users'] = bs.find('span', {'itemprop':'ratingCount'}).text.replace(',',)
127 |     except:
128 |         movie['num_voted_users'] = None
129 |     try:
130 |         movie['duration_sec'] = parse_duration(bs.find('time', {'itemprop':'duration'}).text.replace(' ','').replace('\n', ''))
131 |     except:
132 |         movie['duration_sec'] = None
133 |     try:
134 |         review_counts = [int(count.text.split(' ')[0].replace(',', '')) for count in bs.find_all('span', {'itemprop':'reviewCount'})]
135 |         movie['num_user_for_reviews'] = review_counts[0]
136 |     except:
137 |         movie['num_user_for_reviews'] = None
138 |     try:
139 |         prod_co = bs.find_all('span', {'itemtype': re.compile('Organization')})
140 |         movie['production_co'] = [elem.text.replace('\n','') for elem in prod_co]
141 |     except:
142 |         movie['production_co'] = None
143 |     try:
144 |         movie['num_critic_for_reviews'] = review_counts[1]
145 |     except:
146 |         movie['num_critic_for_reviews'] = None
147 |     try:
148 |         actors = bs.find('table', {'class':'cast_list'}).find_all('a', {'itemprop':'url'})
149 |         pairs_for_rows = [(self.base_url + actor['href'], actor.text.replace('\n', '')[1:]) for actor in actors]
150 |         movie['cast_info'] = [{'actor_name' : actor[1], 
151 |                            'actor_link' : actor[0], 
152 |                            'actor_fb_likes': self.get_facebook_likes(self.get_id_from_url(actor[0]))} for actor in pairs_for_rows]
153 |     except:
154 |         movie['cast_info'] = None
155 |     try:
156 |         director = bs.find('span', {'itemprop':'director'}).find('a')
157 |         director = (self.base_url + director['href'].replace('?','/?'), director.text)
158 |         movie['director_info'] = {'director_name':director[1],
159 |                               'director_link':director[0],
160 |                               'director_fb_links': self.get_facebook_likes(self.get_id_from_url(director[0]))}
161 |     except:
162 |         movie['director_info'] = None
163 |     try:
164 |         movie['movie_imdb_link'] = bs.find('link')['href']
165 |         movie['num_facebook_like'] = self.get_facebook_likes(self.get_id_from_url(movie['movie_imdb_link']))
166 |     except:
167 |         movie['num_facebook_like'] = None
168 |     try:
169 |         poster_image_url = bs.find('div', {'class':'poster'}).find('img')['src']
170 |         movie['image_urls'] = poster_image_url.split("_V1_")[0] + "_V1_.jpg"
171 |     except:
172 |         movie['image_urls'] = None
173 |     try:
174 |         award_nominated, award_won = self.get_awards(movie['movie_imdb_link'])
175 |         movie['awards'] = {"won": award_won, 'nominated': award_nominated}
176 |     except Exception as e:
177 |         print(e)
178 |         movie['awards'] = None
179 |     return movie
180 | 
181 | 
182 | 


--------------------------------------------------------------------------------
/KaggleMovieRating/parser.py:
--------------------------------------------------------------------------------
  1 | from bs4 import BeautifulSoup
  2 | import json
  3 | import requests
  4 | import urllib
  5 | from tqdm import tqdm
  6 | import locale
  7 | import pandas as pd
  8 | import re
  9 | import time
 10 | import random
 11 | import sys
 12 | 
 13 | from imdb_movie_content import ImdbMovieContent
 14 | 
 15 | def parse_price(price):
 16 |     """
 17 |     Convert string price to numbers
 18 |     """
 19 |     if not price:
 20 |         return 0
 21 |     price = price.replace(',', '')
 22 |     return locale.atoi(re.sub('[^0-9,]', "", price))
 23 | 
 24 | def get_movie_budget():
 25 |     """
 26 |     Parsing the numbers website to get the budget data.
 27 |     :return: list of dictionnaries with budget and gross
 28 |     """
 29 |     movie_budget_url = 'http://www.the-numbers.com/movie/budgets/all'
 30 |     response = requests.get(movie_budget_url)
 31 |     bs = BeautifulSoup(response.text, 'lxml')
 32 |     table = bs.find('table')
 33 |     rows = [elem for elem in table.find_all('tr') if elem.get_text() != '\n']
 34 |     
 35 |     movie_budget = []
 36 |     for row in rows[1:]:
 37 |         specs = [elem.get_text() for elem in row.find_all('td')]
 38 |         movie_name = specs[2].encode('latin1').decode('utf8', 'ignore')
 39 |         movie_budget.append({'release_date': specs[1],
 40 |                                   'movie_name': movie_name,
 41 |                                   'production_budget': parse_price(specs[3]),
 42 |                                   'domestic_gross': parse_price(specs[4]),
 43 |                                   'worldwide_gross': parse_price(specs[5])})
 44 |     
 45 |     return movie_budget
 46 | 
 47 | def get_imdb_urls(movie_budget, nb_elements=None):
 48 |     """
 49 |     Parsing imdb website to get imdb movies links.
 50 |     Dumping a json file with budget, gross and imdb urls
 51 |     :param movie_budget: list of dictionnaries with budget and gross
 52 |     :param nb_elements: number of movies to parse 
 53 |     """
 54 |     for movie in tqdm(movie_budget[1000:1000+nb_elements]):
 55 |         movie_name = movie['movie_name']
 56 |         title_url = urllib.parse.quote(movie_name.encode('utf-8'))
 57 |         imdb_search_link = "http://www.imdb.com/find?ref_=nv_sr_fn&q={}&s=tt".format(title_url)
 58 |         response = requests.get(imdb_search_link)
 59 |         bs = BeautifulSoup(response.text, 'lxml')
 60 |         results = bs.find("table", class_= "findList" )
 61 |         try:
 62 |             movie['imdb_url'] = "http://www.imdb.com" + results.find('td', class_='result_text').find('a')['href']
 63 |         except:
 64 |             movie['imdb_url'] = None
 65 |     
 66 |     with open('movie_budget.json', 'w') as fp:
 67 |         json.dump(movie_budget, fp)
 68 | 
 69 | def get_imdb_content(movie_budget_path, nb_elements=None):
 70 |     """
 71 |     Parsing imdb website to get imdb content : awards, casting, description, etc.
 72 |     Dumping a json file with imdb content
 73 |     :param movie_budget_path: path of the movie_budget.json file
 74 |     :param nb_elements: number of movies to parse 
 75 |     """
 76 |     with open(movie_budget_path, 'r') as fp:
 77 |         movies = json.load(fp)
 78 |     content_provider = ImdbMovieContent(movies)
 79 |     contents = []
 80 |     threshold = 1300
 81 |     for i, movie in enumerate(movies[threshold:threshold+nb_elements]):
 82 |         time.sleep(random.uniform(0, 0.25))
 83 |         print("\r%i / %i" % (i, len(movies[threshold:threshold+nb_elements])), end="")
 84 |         try:
 85 |             imdb_url = movie['imdb_url']
 86 |             response = requests.get(imdb_url)
 87 |             bs = BeautifulSoup(response.text, 'lxml')
 88 |             movies_content = content_provider.get_content(bs)
 89 |             contents.append(movies_content)
 90 |         except Exception as e:
 91 |             print(e)
 92 |             pass
 93 |     with open('movie_contents7.json', 'w') as fp:
 94 |         json.dump(contents, fp)    
 95 | 
 96 | def parse_awards(movie):
 97 |     """
 98 |     Convert awards information to a dictionnary for dataframe.
 99 |     Keeping only Oscar, BAFTA, Golden Globe and Palme d'Or awards.
100 |     :param movie: movie dictionnary
101 |     :return: well-formated dictionnary with awards information
102 |     """
103 |     awards_kept = ['Oscar', 'BAFTA Film Award', 'Golden Globe', 'Palme d\'Or']
104 |     awards_category = ['won', 'nominated']
105 |     parsed_awards = {}
106 |     for category in awards_category:
107 |         for awards_type in awards_kept:
108 |             awards_cat = [award for award in movie['awards'][category] if award['category'] == awards_type]
109 |             for k, award in enumerate(awards_cat):
110 |                 parsed_awards['{}_{}_{}'.format(awards_type, category, k+1)] = award["award"]
111 |     return parsed_awards
112 | 
113 | def parse_actors(movie):
114 |     """
115 |     Convert casting information to a dictionnary for dataframe.
116 |     Keeping only 3 actors with most facebook likes.
117 |     :param movie: movie dictionnary
118 |     :return: well-formated dictionnary with casting information
119 |     """
120 |     sorted_actors = sorted(movie['cast_info'], key=lambda x:x['actor_fb_likes'], reverse=True)
121 |     top_k = 3
122 |     parsed_actors = {}
123 |     parsed_actors['total_cast_fb_likes'] = sum([actor['actor_fb_likes'] for actor in movie['cast_info']]) + movie['director_info']['director_fb_links']
124 |     for k, actor in enumerate(sorted_actors[:top_k]):
125 |         if k < len(sorted_actors):
126 |             parsed_actors['actor_{}_name'.format(k+1)] = actor['actor_name']
127 |             parsed_actors['actor_{}_fb_likes'.format(k+1)] = actor['actor_fb_likes']
128 |         else:
129 |             parsed_actors['actor_{}_name'.format(k+1)] = None
130 |             parsed_actors['actor_{}_fb_likes'.format(k+1)] = None
131 |     return parsed_actors
132 | 
133 | def parse_production_company(movie):
134 |     """
135 |     Convert production companies to a dictionnary for dataframe.
136 |     Keeping only 3 production companies.
137 |     :param movie: movie dictionnary
138 |     :return: well-formated dictionnary with production companies
139 |     """   
140 |     parsed_production_co = {}
141 |     top_k = 3
142 |     production_companies = movie['production_co'][:top_k]
143 |     for k, company in enumerate(production_companies):
144 |         if k < len(movie['production_co']):
145 |             parsed_production_co['production_co_{}'.format(k+1)] = company
146 |         else:
147 |             parsed_production_co['production_co_{}'.format(k+1)] = None
148 |     return parsed_production_co
149 | 
150 | def parse_genres(movie):
151 |     """
152 |     Convert genres to a dictionnary for dataframe.
153 |     :param movie: movie dictionnary
154 |     :return: well-formated dictionnary with genres
155 |     """   
156 |     parse_genres = {}
157 |     g = movie['genres']
158 |     with open('genre.json', 'r') as f:
159 |         genres = json.load(f)
160 |     for k, genre in enumerate(g):
161 |         if genre in genres:
162 |             parse_genres['genre_{}'.format(k+1)] = genres[genre]
163 |     return parse_genres
164 | 
165 | def create_dataframe(movies_content_path, movie_budget_path):
166 |     """
167 |     Create dataframe from movie_budget.json and movie_content.json files.
168 |     :param movies_content_path: path of the movies_content.json file
169 |     :param movie_budget_path: path of the movie_budget.json file
170 |     :return: well formated dataframe
171 |     """
172 |     with open(movies_content_path, 'r') as fp:
173 |         movies = json.load(fp)
174 |     with open(movie_budget_path, 'r') as fp:
175 |         movies_budget = json.load(fp)
176 |     movies_list = []
177 |     for movie in movies:
178 |         content = {k:v for k,v in movie.items() if k not in ['awards', 'cast_info', 'director_info', 'production_co']}
179 |         name = movie['movie_title']
180 |         try:
181 |             budget = [film for film in movies_budget if film['movie_name']==name][0]
182 |             budget = {k:v for k,v in budget.items() if k not in ['imdb_url', 'movie_name']}
183 |             content.update(budget)
184 |         except: 
185 |             pass
186 |         try:
187 |             content.update(parse_awards(movie))
188 |         except:
189 |             pass
190 |         try:
191 |             content.update(parse_genres(movie))
192 |         except:
193 |             pass
194 |         try:
195 |             content.update({k:v for k,v in movie['director_info'].items() if k!= 'director_link'})
196 |         except:
197 |             pass
198 |         try:
199 |             content.update(parse_production_company(movie))
200 |         except:
201 |             pass
202 |         try:
203 |             content.update(parse_actors(movie))
204 |         except:
205 |             pass
206 |         movies_list.append(content)
207 |     df = pd.DataFrame(movies_list)
208 |     df = df[pd.notnull(df.idmb_score)]
209 |     df.idmb_score = df.idmb_score.apply(float)
210 |     return df
211 | 
212 | if __name__ == '__main__':
213 |     if len(sys.argv) > 1:
214 |         nb_elements = int(sys.argv[1])
215 |     else:
216 |         nb_elements = None
217 |     # movie_budget = get_movie_budget()
218 |     # movies = get_imdb_urls(movie_budget, nb_elements=nb_elements)
219 |     get_imdb_content("movie_budget.json", nb_elements=nb_elements)
220 | 
221 | 
222 | 


--------------------------------------------------------------------------------
/KaggleSoccer/Predicting Ratings.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 25,
  6 |    "metadata": {
  7 |     "ExecuteTime": {
  8 |      "end_time": "2017-05-21T05:03:03.866314Z",
  9 |      "start_time": "2017-05-21T05:03:03.851669Z"
 10 |     },
 11 |     "collapsed": true
 12 |    },
 13 |    "outputs": [],
 14 |    "source": [
 15 |     "import pandas as pd\n",
 16 |     "import numpy as np\n",
 17 |     "import seaborn as sns\n",
 18 |     "import matplotlib.pyplot as plt\n",
 19 |     "import imp\n",
 20 |     "import json\n",
 21 |     "from sklearn.preprocessing import LabelEncoder\n",
 22 |     "from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, GradientBoostingRegressor\n",
 23 |     "from sklearn.linear_model import Ridge, LinearRegression, Lasso\n",
 24 |     "from sklearn.kernel_ridge import KernelRidge\n",
 25 |     "from sklearn.svm import SVR, LinearSVC\n",
 26 |     "from sklearn.model_selection import train_test_split\n",
 27 |     "import whoscored\n",
 28 |     "from unidecode import unidecode\n",
 29 |     "import pickle\n",
 30 |     "import difflib\n",
 31 |     "import glob\n",
 32 |     "%matplotlib inline"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "markdown",
 37 |    "metadata": {},
 38 |    "source": [
 39 |     "## Data Processing"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "code",
 44 |    "execution_count": 17,
 45 |    "metadata": {
 46 |     "ExecuteTime": {
 47 |      "end_time": "2017-05-21T05:00:15.202603Z",
 48 |      "start_time": "2017-05-21T05:00:15.081729Z"
 49 |     },
 50 |     "collapsed": true
 51 |    },
 52 |    "outputs": [],
 53 |    "source": [
 54 |     "teams = {}\n",
 55 |     "for file in glob.glob('./games/*.json'):\n",
 56 |     "    with open(file,'r') as f:\n",
 57 |     "        teams[file.split('/')[2].split('_')[0]] = json.load(f)"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "code",
 62 |    "execution_count": 21,
 63 |    "metadata": {
 64 |     "ExecuteTime": {
 65 |      "end_time": "2017-05-21T05:01:41.986757Z",
 66 |      "start_time": "2017-05-21T05:01:41.952494Z"
 67 |     },
 68 |     "collapsed": true
 69 |    },
 70 |    "outputs": [],
 71 |    "source": [
 72 |     "def get_position_sub(row):\n",
 73 |     "    if row['Position'] == 'Sub':\n",
 74 |     "        name = row['Player']\n",
 75 |     "        try:\n",
 76 |     "            row['Position'] = [k for k in df[df.Player == name].Position.value_counts().index if k !='Sub'][0]\n",
 77 |     "        except:\n",
 78 |     "            row['Position'] = 'substitute'\n",
 79 |     "    return row\n",
 80 |     "\n",
 81 |     "stop_words = ['<','Bl.', 'Exc.', '>']\n",
 82 |     "def mean_rating(row):\n",
 83 |     "    if pd.isnull(row['Rating_LEquipe']) or row['Rating_LEquipe'] in stop_words:\n",
 84 |     "        name = row['Player']\n",
 85 |     "        temp = df[(df.Player == name) & (~df.Rating_LEquipe.isin(stop_words)) & (pd.notnull(df.Rating_LEquipe))]\n",
 86 |     "        if len(temp) > 4:\n",
 87 |     "            row['Rating_LEquipe'] = np.mean(temp.Rating_LEquipe.apply(int))\n",
 88 |     "    return row\n",
 89 |     "\n",
 90 |     "position_mapping = {'attackingmidfieldcenter' : 'attackingmidfield',\n",
 91 |     " 'attackingmidfieldleft' : 'attackingmidfield',\n",
 92 |     " 'attackingmidfieldright' : 'attackingmidfield',\n",
 93 |     " 'defenderleft' : 'defenderlateral',\n",
 94 |     " 'defendermidfieldcenter' : \"defendermidfield\",\n",
 95 |     " 'defendermidfieldleft' : 'defendermidfield',\n",
 96 |     " 'defendermidfieldright': \"defendermidfield\",\n",
 97 |     " 'defenderright': 'defenderlateral',\n",
 98 |     " 'forwardleft' : 'forwardlateral',\n",
 99 |     " 'forwardright' : 'forwardlateral',\n",
100 |     " 'midfieldcenter' :'midfield' ,\n",
101 |     " 'midfieldleft' :'midfield' ,\n",
102 |     " 'midfieldright' :'midfield' }\n",
103 |     "\n",
104 |     "mapping_team_name = {'ASM': 'Monaco',\n",
105 |     " 'ASNL': 'Nancy',\n",
106 |     " 'ASSE': 'Saint-Etienne',\n",
107 |     " 'DFCO': 'Dijon',\n",
108 |     " 'EAG': 'Guingamp',\n",
109 |     " 'FCGB': 'Bordeaux',\n",
110 |     " 'FCL': 'Lorient',\n",
111 |     " 'FCM': 'Metz',\n",
112 |     " 'FCN': 'Nantes',\n",
113 |     " 'Losc': 'Lille',\n",
114 |     " 'MHSC': 'Montpellier',\n",
115 |     " 'Man. City': 'Manchester City',\n",
116 |     " 'Man. United': 'Manchester United',\n",
117 |     " 'OGCN': 'Nice',\n",
118 |     " 'OL': 'Lyon',\n",
119 |     " 'OM': 'Marseille',\n",
120 |     " 'PSG': 'Paris Saint Germain',\n",
121 |     " 'Palace': 'Crystal Palace',\n",
122 |     " 'SCB': 'SC Bastia',\n",
123 |     " 'SCO': 'Angers',\n",
124 |     " 'SMC': 'Caen',\n",
125 |     " 'SRFC': 'Rennes',\n",
126 |     " 'Stoke City': 'Stoke',\n",
127 |     " 'TFC': 'Toulouse',\n",
128 |     " 'WBA': 'West Bromwich Albion'}"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "code",
133 |    "execution_count": 138,
134 |    "metadata": {
135 |     "ExecuteTime": {
136 |      "end_time": "2017-05-21T05:41:57.976844Z",
137 |      "start_time": "2017-05-21T05:41:31.980530Z"
138 |     },
139 |     "collapsed": true
140 |    },
141 |    "outputs": [],
142 |    "source": [
143 |     "df = pd.DataFrame()\n",
144 |     "for team, file in teams.items():\n",
145 |     "    for i, game in enumerate(file):\n",
146 |     "        temp = pd.DataFrame(game['stats'])\n",
147 |     "        temp['Opponent'] = game['opponent']\n",
148 |     "        temp['Place'] = game['place'].title()\n",
149 |     "        result = game['result'].split(' : ')\n",
150 |     "        temp['Goal Team'] = int(result[int(game['place'] == 'away')])\n",
151 |     "        temp['Goal Opponent'] = int(result[int(game['place'] == 'home')])\n",
152 |     "        temp['Team'] = {'psg':'Paris Saint Germain'}.get(team, team).title()\n",
153 |     "        temp['Team'] = {'Bastia': 'SC Bastia'}.get(temp['Team'][0], temp['Team'][0])\n",
154 |     "        temp['Day'] = i+1\n",
155 |     "        df = pd.concat([df, temp])\n",
156 |     "df = df.apply(whoscored.get_name, axis=1).reset_index(drop=True)\n",
157 |     "df['LineUp'] = 1\n",
158 |     "df.loc[df.Position == 'Sub', 'LineUp'] = 0\n",
159 |     "df = df.apply(get_position_sub, axis=1)\n",
160 |     "df.Goal.fillna(0, inplace=True)\n",
161 |     "df.Assist.fillna(0, inplace=True)\n",
162 |     "df.Yellowcard.fillna(0, inplace=True)\n",
163 |     "df.Redcard.fillna(0, inplace=True)\n",
164 |     "df.Penaltymissed.fillna(0, inplace=True)\n",
165 |     "df.Shotonpost.fillna(0, inplace=True)\n",
166 |     "# df.Position = df.Position.apply(lambda x:position_mapping.get(x,x))"
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "code",
171 |    "execution_count": 139,
172 |    "metadata": {
173 |     "ExecuteTime": {
174 |      "end_time": "2017-05-21T05:41:57.990536Z",
175 |      "start_time": "2017-05-21T05:41:57.978189Z"
176 |     }
177 |    },
178 |    "outputs": [],
179 |    "source": [
180 |     "with open('./ratings/notes_ligue1_lequipe.json','r') as f:\n",
181 |     "    rating_lequipe  = json.load(f)\n",
182 |     "rating_lequipe = {mapping_team_name.get(k,k):[p for p in v if list(p.keys())[0] != 'Nom' and len(list(p.values())[0]) > 0] \n",
183 |     "                    for k,v in rating_lequipe.items()}"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "code",
188 |    "execution_count": 140,
189 |    "metadata": {
190 |     "ExecuteTime": {
191 |      "end_time": "2017-05-21T05:42:40.226407Z",
192 |      "start_time": "2017-05-21T05:41:57.992420Z"
193 |     },
194 |     "scrolled": true
195 |    },
196 |    "outputs": [],
197 |    "source": [
198 |     "for team_name, rating_team in rating_lequipe.items():\n",
199 |     "    if team_name in set(df.Team):\n",
200 |     "        players_lequipe = [list(k.keys())[0] for k in rating_team]\n",
201 |     "        players_df = list(set(df[df.Team == team_name].Player))\n",
202 |     "        for player in rating_team:\n",
203 |     "            [(player_name, player_ratings)] = player.items()\n",
204 |     "            try:\n",
205 |     "                player_name_df = difflib.get_close_matches(player_name, players_df)[0]\n",
206 |     "            except:\n",
207 |     "                if len(unidecode(player_name).split('-')) > 1 :\n",
208 |     "                    player_name_df = [k for k in players_df if unidecode(player_name).split('-')[0].replace(\"'\",\"\").lower() in unidecode(k).replace(\"'\",\"\").lower()\n",
209 |     "                                                             or unidecode(player_name).split('-')[1].replace(\"'\",\"\").lower() in unidecode(k).replace(\"'\",\"\").lower()][0]\n",
210 |     "                else:\n",
211 |     "                    player_name_df = [k for k in players_df if unidecode(player_name).replace(\"'\",\"\").lower() in unidecode(k).replace(\"'\",\"\").lower()][0]\n",
212 |     "            for day, rating in player_ratings.items():\n",
213 |     "                df.loc[(df.Player == player_name_df) & (df.Team == team_name) & (df.Day == int(day.split('Day ')[1])), 'Rating_LEquipe'] = rating\n",
214 |     "df = df.apply(mean_rating, axis=1)\n",
215 |     "df.drop('null', axis=1, inplace=True)"
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "code",
220 |    "execution_count": 141,
221 |    "metadata": {
222 |     "ExecuteTime": {
223 |      "end_time": "2017-05-21T05:42:40.279122Z",
224 |      "start_time": "2017-05-21T05:42:40.227785Z"
225 |     },
226 |     "scrolled": true
227 |    },
228 |    "outputs": [
229 |     {
230 |      "name": "stdout",
231 |      "output_type": "stream",
232 |      "text": [
233 |       "9552/10187 données avec une note l'Équipe\n"
234 |      ]
235 |     },
236 |     {
237 |      "data": {
238 |       "text/html": [
239 |        "<div>\n",
240 |        "<style>\n",
241 |        "    .dataframe thead tr:only-child th {\n",
242 |        "        text-align: right;\n",
243 |        "    }\n",
244 |        "\n",
245 |        "    .dataframe thead th {\n",
246 |        "        text-align: left;\n",
247 |        "    }\n",
248 |        "\n",
249 |        "    .dataframe tbody tr th {\n",
250 |        "        vertical-align: top;\n",
251 |        "    }\n",
252 |        "</style>\n",
253 |        "<table border=\"1\" class=\"dataframe\">\n",
254 |        "  <thead>\n",
255 |        "    <tr style=\"text-align: right;\">\n",
256 |        "      <th></th>\n",
257 |        "      <th>Team</th>\n",
258 |        "      <th>Goal Team</th>\n",
259 |        "      <th>Goal Opponent</th>\n",
260 |        "      <th>Player</th>\n",
261 |        "      <th>Goal</th>\n",
262 |        "      <th>Opponent</th>\n",
263 |        "      <th>Day</th>\n",
264 |        "      <th>Position</th>\n",
265 |        "      <th>Age</th>\n",
266 |        "      <th>LineUp</th>\n",
267 |        "      <th>Rating_LEquipe</th>\n",
268 |        "    </tr>\n",
269 |        "  </thead>\n",
270 |        "  <tbody>\n",
271 |        "    <tr>\n",
272 |        "      <th>415</th>\n",
273 |        "      <td>Paris Saint Germain</td>\n",
274 |        "      <td>2</td>\n",
275 |        "      <td>1</td>\n",
276 |        "      <td>Thiago Motta</td>\n",
277 |        "      <td>0.0</td>\n",
278 |        "      <td>Lyon</td>\n",
279 |        "      <td>30</td>\n",
280 |        "      <td>midfieldcenter</td>\n",
281 |        "      <td>34</td>\n",
282 |        "      <td>0</td>\n",
283 |        "      <td>5</td>\n",
284 |        "    </tr>\n",
285 |        "    <tr>\n",
286 |        "      <th>9580</th>\n",
287 |        "      <td>Toulouse</td>\n",
288 |        "      <td>0</td>\n",
289 |        "      <td>0</td>\n",
290 |        "      <td>Mauro Goicoechea</td>\n",
291 |        "      <td>0.0</td>\n",
292 |        "      <td>Rennes</td>\n",
293 |        "      <td>30</td>\n",
294 |        "      <td>goalkeeper</td>\n",
295 |        "      <td>29</td>\n",
296 |        "      <td>1</td>\n",
297 |        "      <td>7</td>\n",
298 |        "    </tr>\n",
299 |        "    <tr>\n",
300 |        "      <th>5736</th>\n",
301 |        "      <td>Metz</td>\n",
302 |        "      <td>0</td>\n",
303 |        "      <td>1</td>\n",
304 |        "      <td>Thomas Didillon</td>\n",
305 |        "      <td>0.0</td>\n",
306 |        "      <td>Rennes</td>\n",
307 |        "      <td>11</td>\n",
308 |        "      <td>goalkeeper</td>\n",
309 |        "      <td>21</td>\n",
310 |        "      <td>1</td>\n",
311 |        "      <td>5</td>\n",
312 |        "    </tr>\n",
313 |        "    <tr>\n",
314 |        "      <th>6440</th>\n",
315 |        "      <td>Nantes</td>\n",
316 |        "      <td>1</td>\n",
317 |        "      <td>1</td>\n",
318 |        "      <td>Rémy Riou</td>\n",
319 |        "      <td>0.0</td>\n",
320 |        "      <td>Metz</td>\n",
321 |        "      <td>25</td>\n",
322 |        "      <td>goalkeeper</td>\n",
323 |        "      <td>29</td>\n",
324 |        "      <td>1</td>\n",
325 |        "      <td>5</td>\n",
326 |        "    </tr>\n",
327 |        "    <tr>\n",
328 |        "      <th>3446</th>\n",
329 |        "      <td>Lille</td>\n",
330 |        "      <td>1</td>\n",
331 |        "      <td>0</td>\n",
332 |        "      <td>Eder</td>\n",
333 |        "      <td>0.0</td>\n",
334 |        "      <td>SC Bastia</td>\n",
335 |        "      <td>31</td>\n",
336 |        "      <td>forward</td>\n",
337 |        "      <td>29</td>\n",
338 |        "      <td>1</td>\n",
339 |        "      <td>4</td>\n",
340 |        "    </tr>\n",
341 |        "    <tr>\n",
342 |        "      <th>9100</th>\n",
343 |        "      <td>Saint-Etienne</td>\n",
344 |        "      <td>1</td>\n",
345 |        "      <td>1</td>\n",
346 |        "      <td>Kevin Malcuit</td>\n",
347 |        "      <td>0.0</td>\n",
348 |        "      <td>Rennes</td>\n",
349 |        "      <td>33</td>\n",
350 |        "      <td>defenderright</td>\n",
351 |        "      <td>25</td>\n",
352 |        "      <td>1</td>\n",
353 |        "      <td>5.42105</td>\n",
354 |        "    </tr>\n",
355 |        "    <tr>\n",
356 |        "      <th>5640</th>\n",
357 |        "      <td>Metz</td>\n",
358 |        "      <td>3</td>\n",
359 |        "      <td>0</td>\n",
360 |        "      <td>Yann Jouffre</td>\n",
361 |        "      <td>0.0</td>\n",
362 |        "      <td>Nantes</td>\n",
363 |        "      <td>4</td>\n",
364 |        "      <td>attackingmidfieldright</td>\n",
365 |        "      <td>32</td>\n",
366 |        "      <td>0</td>\n",
367 |        "      <td>4.88889</td>\n",
368 |        "    </tr>\n",
369 |        "    <tr>\n",
370 |        "      <th>2401</th>\n",
371 |        "      <td>Angers</td>\n",
372 |        "      <td>3</td>\n",
373 |        "      <td>0</td>\n",
374 |        "      <td>Abdoulaye Bamba</td>\n",
375 |        "      <td>0.0</td>\n",
376 |        "      <td>SC Bastia</td>\n",
377 |        "      <td>27</td>\n",
378 |        "      <td>defenderright</td>\n",
379 |        "      <td>27</td>\n",
380 |        "      <td>1</td>\n",
381 |        "      <td>6</td>\n",
382 |        "    </tr>\n",
383 |        "  </tbody>\n",
384 |        "</table>\n",
385 |        "</div>"
386 |       ],
387 |       "text/plain": [
388 |        "                     Team  Goal Team  Goal Opponent            Player  Goal  \\\n",
389 |        "415   Paris Saint Germain          2              1      Thiago Motta   0.0   \n",
390 |        "9580             Toulouse          0              0  Mauro Goicoechea   0.0   \n",
391 |        "5736                 Metz          0              1   Thomas Didillon   0.0   \n",
392 |        "6440               Nantes          1              1         Rémy Riou   0.0   \n",
393 |        "3446                Lille          1              0              Eder   0.0   \n",
394 |        "9100        Saint-Etienne          1              1     Kevin Malcuit   0.0   \n",
395 |        "5640                 Metz          3              0      Yann Jouffre   0.0   \n",
396 |        "2401               Angers          3              0   Abdoulaye Bamba   0.0   \n",
397 |        "\n",
398 |        "       Opponent  Day                Position  Age  LineUp Rating_LEquipe  \n",
399 |        "415        Lyon   30          midfieldcenter   34       0              5  \n",
400 |        "9580     Rennes   30              goalkeeper   29       1              7  \n",
401 |        "5736     Rennes   11              goalkeeper   21       1              5  \n",
402 |        "6440       Metz   25              goalkeeper   29       1              5  \n",
403 |        "3446  SC Bastia   31                 forward   29       1              4  \n",
404 |        "9100     Rennes   33           defenderright   25       1        5.42105  \n",
405 |        "5640     Nantes    4  attackingmidfieldright   32       0        4.88889  \n",
406 |        "2401  SC Bastia   27           defenderright   27       1              6  "
407 |       ]
408 |      },
409 |      "execution_count": 141,
410 |      "metadata": {},
411 |      "output_type": "execute_result"
412 |     }
413 |    ],
414 |    "source": [
415 |     "print('%d/%d données avec une note l\\'Équipe' % (len(df[(~df.Rating_LEquipe.isin(stop_words)) & (pd.notnull(df.Rating_LEquipe))]),\n",
416 |     "                                                len(df)))\n",
417 |     "df[['Team','Goal Team', 'Goal Opponent', 'Player', 'Goal',\n",
418 |     "    'Opponent', 'Day', 'Position', 'Age', 'LineUp', 'Rating_LEquipe']].sort_values('Player').sample(8)"
419 |    ]
420 |   },
421 |   {
422 |    "cell_type": "code",
423 |    "execution_count": 143,
424 |    "metadata": {
425 |     "ExecuteTime": {
426 |      "end_time": "2017-05-21T05:43:00.252474Z",
427 |      "start_time": "2017-05-21T05:43:00.221024Z"
428 |     },
429 |     "collapsed": true
430 |    },
431 |    "outputs": [],
432 |    "source": [
433 |     "with open('./df.pkl', 'rb') as f:\n",
434 |     "    df = pickle.load(f)"
435 |    ]
436 |   },
437 |   {
438 |    "cell_type": "code",
439 |    "execution_count": 176,
440 |    "metadata": {
441 |     "ExecuteTime": {
442 |      "end_time": "2017-05-21T07:45:29.115217Z",
443 |      "start_time": "2017-05-21T07:45:28.990990Z"
444 |     },
445 |     "collapsed": true
446 |    },
447 |    "outputs": [],
448 |    "source": [
449 |     "other_cols = ['Day', \"Opponent\", \"Place\", \"Player\", \"Position\", \n",
450 |     "                \"Team\", 'Key Events', \"Rating_LEquipe\"]\n",
451 |     "col_delete = ['Key Events', \"Rating_LEquipe\", 'Rating', 'Day']\n",
452 |     "cols_to_transf = [col for col in df.columns if col not in other_cols]\n",
453 |     "p = pd.concat([df[cols_to_transf].applymap(float), df[other_cols]], axis=1)\n",
454 |     "new_col_to_remov = [e for e in p.columns if e.startswith('Acc')] + ['Touches']"
455 |    ]
456 |   },
457 |   {
458 |    "cell_type": "markdown",
459 |    "metadata": {
460 |     "ExecuteTime": {
461 |      "end_time": "2017-05-20T18:57:44.123776Z",
462 |      "start_time": "2017-05-20T18:57:44.118723Z"
463 |     }
464 |    },
465 |    "source": [
466 |     "## Learning"
467 |    ]
468 |   },
469 |   {
470 |    "cell_type": "code",
471 |    "execution_count": 177,
472 |    "metadata": {
473 |     "ExecuteTime": {
474 |      "end_time": "2017-05-21T07:45:30.269693Z",
475 |      "start_time": "2017-05-21T07:45:30.254692Z"
476 |     }
477 |    },
478 |    "outputs": [],
479 |    "source": [
480 |     "p = p[(~p.Rating_LEquipe.isin(stop_words)) & (pd.notnull(p.Rating_LEquipe))].set_index('Player')\n",
481 |     "X = p[[col for col in p.columns if col not in col_delete + new_col_to_remov]]\n",
482 |     "y = p['Rating_LEquipe'].apply(float)"
483 |    ]
484 |   },
485 |   {
486 |    "cell_type": "code",
487 |    "execution_count": 178,
488 |    "metadata": {
489 |     "ExecuteTime": {
490 |      "end_time": "2017-05-21T07:45:31.239689Z",
491 |      "start_time": "2017-05-21T07:45:30.949497Z"
492 |     }
493 |    },
494 |    "outputs": [
495 |     {
496 |      "name": "stderr",
497 |      "output_type": "stream",
498 |      "text": [
499 |       "/Users/alexandreattia/Desktop/Work/workenv/lib/python3.5/site-packages/ipykernel/__main__.py:5: SettingWithCopyWarning: \n",
500 |       "A value is trying to be set on a copy of a slice from a DataFrame.\n",
501 |       "Try using .loc[row_indexer,col_indexer] = value instead\n",
502 |       "\n",
503 |       "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
504 |      ]
505 |     }
506 |    ],
507 |    "source": [
508 |     "label_encoders = {}\n",
509 |     "col_encode = [\"Opponent\", \"Place\", \"Position\", \"Team\"]\n",
510 |     "for col in col_encode:\n",
511 |     "    label_encoders[col.lower()] = LabelEncoder()\n",
512 |     "    X[col] = label_encoders[col.lower()].fit_transform(X[col])"
513 |    ]
514 |   },
515 |   {
516 |    "cell_type": "code",
517 |    "execution_count": 123,
518 |    "metadata": {
519 |     "ExecuteTime": {
520 |      "end_time": "2017-05-21T05:28:51.343565Z",
521 |      "start_time": "2017-05-21T05:28:51.299523Z"
522 |     }
523 |    },
524 |    "outputs": [
525 |     {
526 |      "name": "stdout",
527 |      "output_type": "stream",
528 |      "text": [
529 |       "5-fold cross validation mean square error : 0.86\n"
530 |      ]
531 |     }
532 |    ],
533 |    "source": [
534 |     "d = []\n",
535 |     "k_fold = 5\n",
536 |     "for k in range(k_fold): \n",
537 |     "    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/k_fold)\n",
538 |     "    ridge = Ridge(alpha=0.001, max_iter=10000, tol=0.0001)\n",
539 |     "    ridge.fit(X_train, y_train)\n",
540 |     "    d.append(np.mean((ridge.predict(X_test) - y_test) ** 2))\n",
541 |     "print('%d-fold cross validation mean square error : %.2f' % (k_fold, np.mean(d)))\n",
542 |     "ridge = Ridge(alpha=0.001, max_iter=10000, tol=0.0001)\n",
543 |     "_ = ridge.fit(X, y)"
544 |    ]
545 |   },
546 |   {
547 |    "cell_type": "code",
548 |    "execution_count": 137,
549 |    "metadata": {
550 |     "ExecuteTime": {
551 |      "end_time": "2017-05-21T05:37:59.969171Z",
552 |      "start_time": "2017-05-21T05:37:59.963504Z"
553 |     }
554 |    },
555 |    "outputs": [
556 |     {
557 |      "name": "stdout",
558 |      "output_type": "stream",
559 |      "text": [
560 |       "The 10 most important parameters are:\n",
561 |       "Penaltymissed : -0.59\n",
562 |       "Goal : 0.43\n",
563 |       "Assist : 0.36\n",
564 |       "Redcard : -0.31\n",
565 |       "Goal Opponent : -0.22\n",
566 |       "Shotonpost : 0.19\n",
567 |       "ShotsOT : 0.16\n",
568 |       "Goal Team : 0.15\n",
569 |       "Yellowcard : -0.09\n",
570 |       "Place : -0.09\n"
571 |      ]
572 |     }
573 |    ],
574 |    "source": [
575 |     "top_k = 10\n",
576 |     "l = np.argsort(list(map(np.abs, ridge.coef_)))[::-1][:top_k]\n",
577 |     "print(\"The %d most important parameters are:\\n%s\" % (top_k,'\\n'.join(['%s : %.2f' % (a,b) \n",
578 |     "                                                                       for a,b in zip(X_train.columns[l], ridge.coef_[l])])))"
579 |    ]
580 |   },
581 |   {
582 |    "cell_type": "code",
583 |    "execution_count": 180,
584 |    "metadata": {
585 |     "ExecuteTime": {
586 |      "end_time": "2017-05-21T07:46:19.524179Z",
587 |      "start_time": "2017-05-21T07:45:50.166761Z"
588 |     }
589 |    },
590 |    "outputs": [
591 |     {
592 |      "name": "stdout",
593 |      "output_type": "stream",
594 |      "text": [
595 |       "RF : oob score : 1.00\n",
596 |       "RF : 5-fold cross validation mean square error : 0.74\n"
597 |      ]
598 |     }
599 |    ],
600 |    "source": [
601 |     "rf = RandomForestRegressor(oob_score=True, n_estimators=100)\n",
602 |     "rf.fit(X, y)\n",
603 |     "print('RF : oob score : %.2f' % rf.oob_score)\n",
604 |     "d = []\n",
605 |     "k_fold = 5\n",
606 |     "for k in range(k_fold): \n",
607 |     "    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/k_fold)\n",
608 |     "    rf = RandomForestRegressor(oob_score=True, n_estimators=100)\n",
609 |     "    rf.fit(X_train, y_train)\n",
610 |     "    d.append(np.mean((rf.predict(X_test) - y_test) ** 2))\n",
611 |     "print('RF : %d-fold cross validation mean square error : %.2f' % (k_fold, np.mean(d)))"
612 |    ]
613 |   },
614 |   {
615 |    "cell_type": "code",
616 |    "execution_count": null,
617 |    "metadata": {
618 |     "collapsed": true
619 |    },
620 |    "outputs": [],
621 |    "source": []
622 |   }
623 |  ],
624 |  "metadata": {
625 |   "kernelspec": {
626 |    "display_name": "workenv",
627 |    "language": "python",
628 |    "name": "workenv"
629 |   },
630 |   "language_info": {
631 |    "codemirror_mode": {
632 |     "name": "ipython",
633 |     "version": 3
634 |    },
635 |    "file_extension": ".py",
636 |    "mimetype": "text/x-python",
637 |    "name": "python",
638 |    "nbconvert_exporter": "python",
639 |    "pygments_lexer": "ipython3",
640 |    "version": "3.5.2"
641 |   },
642 |   "nav_menu": {},
643 |   "toc": {
644 |    "navigate_menu": true,
645 |    "number_sections": true,
646 |    "sideBar": true,
647 |    "threshold": 6,
648 |    "toc_cell": false,
649 |    "toc_section_display": "block",
650 |    "toc_window_display": false
651 |   }
652 |  },
653 |  "nbformat": 4,
654 |  "nbformat_minor": 2
655 | }
656 | 


--------------------------------------------------------------------------------
/KaggleSoccer/README.md:
--------------------------------------------------------------------------------
 1 | # Parsing and using Soccer data
 2 | 
 3 | ## First Part - Parsing data
 4 | 
 5 | The first part aims to parse data from multiple websites : games, teams, players, etc.  
 6 | To collect data about a team and dump a json file (must have created a ``./teams`` folder) :
 7 | ``python3 dumper.py team_name`` 
 8 | 
 9 | ## Second Part - Data Analysis
10 | 
11 | The second part is to analyze the dataset to understand what I can do with it.
12 | 
13 | ![Correlation Matrix](https://github.com/alexattia/Data-Science-Projects/blob/master/pics/psg_stats.png)
14 | 
15 | ![PSG vs Saint-Etienne](https://github.com/alexattia/Data-Science-Projects/blob/master/pics/psg_ste.png)


--------------------------------------------------------------------------------
/KaggleSoccer/df.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleSoccer/df.pkl


--------------------------------------------------------------------------------
/KaggleSoccer/dumper.py:
--------------------------------------------------------------------------------
  1 | import requests
  2 | import re
  3 | import time
  4 | from datetime import datetime
  5 | import json
  6 | from tqdm import tqdm
  7 | import numpy as np
  8 | import seaborn as sns
  9 | from matplotlib import patches
 10 | import pandas as pd
 11 | import parser
 12 | import sys
 13 | import random
 14 | from collections import Counter
 15 | from bs4 import BeautifulSoup
 16 | from sklearn import preprocessing
 17 | import urllib
 18 | 
 19 | def convert_int(string):
 20 |     try:
 21 |         string = int(string)
 22 |         return string
 23 |     except:
 24 |         return string
 25 | 
 26 | def get_team(request, link=None):
 27 |     team = {}
 28 |     if not link:
 29 |         team_searched = request
 30 |         team_searched = urllib.parse.quote(team_searched.encode('utf-8'))
 31 |         search_link = "http://us.soccerway.com/search/teams/?q={}".format(team_searched)
 32 |         response = requests.get(search_link)
 33 |         bs = BeautifulSoup(response.text, 'lxml')
 34 |         results = bs.find("ul", class_='search-results')
 35 |         # Take the first results
 36 |         try:
 37 |             link = "http://us.soccerway.com" + results.find_all('a')[0]['href']
 38 |             print('Please check team link:', link)
 39 |             team['id_'] = results.find_all('a')[0]["href"].split('/')[4]
 40 |             team['name'] = results.find_all('a')[0].text
 41 |             team['country'] = results.find_all('a')[0]["href"].split('/')[2]
 42 |         except:
 43 |             print('No team found !')
 44 |     else:
 45 |         team['id_'] = link.split('/')[6]
 46 |         team['name'] = link.split('/')[5]
 47 |         team['country'] = link.split('/')[4]
 48 |     return team
 49 | 
 50 | def get_games(team, nb_pages=12):
 51 |     games = []
 52 |     for page_number in range(nb_pages):
 53 |         link_base = 'http://us.soccerway.com/a/block_team_matches?block_id=page_team_1_block_team_matches_3&callback_params=' 
 54 |         link_ = urllib.parse.quote('{"page":0,"bookmaker_urls":[],"block_service_id":"team_matches_block_teammatches","team_id":%s,\
 55 |         "competition_id":0,"filter":"all","new_design":false}' % team['id_']) + '&action=changePage&params=' + urllib.parse.quote('{"page":-%s}' % (page_number))
 56 |         link = link_base + link_
 57 |         response = requests.get(link)
 58 | 
 59 |         test = json.loads(response.text)['commands'][0]['parameters']['content']
 60 |         bs = BeautifulSoup(test, 'lxml')
 61 | 
 62 |         for kind in ['even', 'odd']:
 63 |             for elem in bs.find_all('tr', class_ = kind):
 64 |                 game = {}
 65 |                 game["date"] = elem.find('td', {'class': ["full-date"]}).text
 66 |                 game["competition"] = elem.find('td', {'class': ["competition"]}).text
 67 |                 game["team_a"] = elem.find('td', class_='team-a').text
 68 |                 game["team_b"] = elem.find('td', class_='team-b').text
 69 |                 game['link'] = "http://us.soccerway.com" + elem.find('td', class_='score-time').find('a')['href']
 70 |                 game["score"] = elem.find('td', class_='score-time').text.replace(' ','')
 71 |                 if 'E' in game["score"]:
 72 |                     game["score"] = game['score'].replace('E','')
 73 |                     game['extra_time'] = True
 74 |                 if 'P' in game["score"]:
 75 |                     game["score"] = game['score'].replace('P','')
 76 |                     game['penalties'] = True
 77 |                 if datetime.strptime(game["date"], '%d/%m/%y') < datetime.now():
 78 |                     game = parser.get_score_details(game, team)
 79 |                     time.sleep(random.uniform(0, 0.25))
 80 |                     game.update(parser.get_goals(game['link']))
 81 |                 else:
 82 |                     del game['score']
 83 |                 games.append(game)
 84 |         games = sorted(games, key=lambda x:datetime.strptime(x['date'], '%d/%m/%y'))
 85 |         team['games'] = games
 86 |     return team
 87 | 
 88 | def get_squad(team, season_path='./seasons_codes.json'):
 89 |     with open(season_path, 'r') as f:
 90 |         seasons = json.load(f)[team["country"]]
 91 |     
 92 |     team['squad'] = {}
 93 |     for k,v in seasons.items():
 94 |         link_base = 'http://us.soccerway.com/a/block_team_squad?block_id=page_team_1_block_team_squad_3&callback_params='
 95 |         link_ = urllib.parse.quote('{"team_id":%s}' % team['id_']) + '&action=changeSquadSeason&params=' + urllib.parse.quote('{"season_id":%s}' % v)
 96 |         link = link_base + link_
 97 |         response = requests.get(link)
 98 |         test = json.loads(response.text)['commands'][0]['parameters']['content']
 99 |         bs = BeautifulSoup(test, 'lxml')
100 |         
101 |         players = bs.find('tbody').find_all('tr')
102 |         squad = [{
103 |             k: convert_int(player.find('td', class_=k).text)
104 |               for k in [k for k,v in Counter(np.concatenate([elem.attrs['class'] for elem in player.find_all('td')])).items() 
105 |                         if v < 2 and k not in ['photo', '', 'flag']]
106 |          } for player in players]
107 |         team['squad'][k] = squad
108 |         try:
109 |             coach = {'position': 'Coach', 'name':bs.find_all('tbody')[1].text}
110 |             team['coach'][k] = coach
111 |         except: pass  
112 |     return team 
113 | 
114 | if __name__ == '__main__':
115 |     if len(sys.argv) > 1:
116 |         request = ' '.join(sys.argv[1:])
117 |     else:
118 |         raise ValueError('You must enter a requested team!')
119 |     team = get_team(request)
120 |     f = input('Satisfied ? (Y/n) ')
121 |     count = 0
122 |     while f not in ['Y', 'y','yes','']:
123 |         if count < 3:
124 |             request = input('Enter a new request : ')
125 |             link = None
126 |         else:
127 |             link = input('Paste team link : ')
128 |         team = get_team(request, link)
129 |         f = input('Satisfied ? (Y/n)')
130 |         count += 1
131 |     team = get_games(team)
132 |     team = get_squad(team)
133 |     with open('./teams/%s.json' % team["name"].lower(), 'w') as f:
134 |         json.dump(team, f)
135 | 


--------------------------------------------------------------------------------
/KaggleSoccer/parser.py:
--------------------------------------------------------------------------------
  1 | import urllib
  2 | import requests
  3 | import time
  4 | import pandas as pd
  5 | import numpy as np
  6 | import datetime
  7 | import json
  8 | from tqdm import tqdm
  9 | import random
 10 | from bs4 import BeautifulSoup
 11 | from collections import Counter
 12 | 
 13 | def get_score_details(game, team):
 14 |     if game["team_a"] == team['name']:
 15 |         game['place'] = 'home'
 16 |         try:
 17 |             game['nb_goals_{}'.format(team['name'])] = int(game['score'].split('-')[0])
 18 |             game['nb_goals_adv'] = int(game['score'].split('-')[1])
 19 |         except:pass
 20 |     else:
 21 |         game['place'] = 'away'
 22 |         try:
 23 |             game['nb_goals_{}'.format(team['name'])] = int(game['score'].split('-')[1])
 24 |             game['nb_goals_adv'] = int(game['score'].split('-')[0])
 25 |         except:pass
 26 |     try:
 27 |         if game['nb_goals_{}'.format(team['name'])] > game['nb_goals_adv']:
 28 |             game['result'] = 'WIN'
 29 |         elif game['nb_goals_{}'.format(team['name'])] == game['nb_goals_adv']:
 30 |             game['result'] = 'TIE'
 31 |         else:
 32 |             game['result'] = 'LOST'
 33 |     except:pass
 34 |     return game
 35 | 
 36 | def shot_team(row, columns, team_name):
 37 |     try:
 38 |         for col in columns:
 39 |             if row['team_a'] == team_name:
 40 |                 row['{}_{}'.format(col.lower().replace(' ','_'), team_name.lower())] = row[col]['team_a'] 
 41 |                 row['{}_adv'.format(col.lower().replace(' ','_'))] = row[col]['team_b'] 
 42 |             else:
 43 |                 row['{}_{}'.format(col.lower().replace(' ','_'), team_name.lower())] = row[col]['team_b'] 
 44 |                 row['{}_adv'.format(col.lower().replace(' ','_'))] = row[col]['team_a']
 45 |     except: pass
 46 |     return row
 47 | 
 48 | def convert_team_name(row, team_name):
 49 |     for col in ['goals', 'players_team', "subs_in"]:
 50 |         try:
 51 |             if row['team_a'] == team_name:
 52 |                 row['{}_{}'.format(col, team_name.lower())] = row['{}_a'.format(col)] 
 53 |                 row['{}_adv'.format(col)] = row['{}_b'.format(col)]  
 54 |             else:
 55 |                 row['{}_{}'.format(col, team_name.lower())] = row['{}_b'.format(col)] 
 56 |                 row['{}_adv'.format(col)] = row['{}_a'.format(col)]  
 57 |             del row['{}_a'.format(col)], row['{}_b'.format(col)] 
 58 |         except: pass
 59 |     return row    
 60 | 
 61 | def player_per_opponent(df, player, team_name):
 62 |     stats_player = {}
 63 |     for opponent in set(df.opponent):
 64 |         try:
 65 |             game_played = len([e for e in list(df[df.opponent == opponent]["players_team_%s" % team_name.lower()]) +
 66 |                                           list(df[df.opponent == opponent]["subs_in_%s" % team_name.lower()]) 
 67 |                              if player in e])
 68 |             if game_played > 0:
 69 |                 goals = len([e for e in np.concatenate([elem for elem in df[df.opponent == opponent]["goals_%s" % team_name.lower()] if type(elem) == list]) 
 70 |                                     if e['player'] == player])
 71 |                 assist = len([e for e in np.concatenate([elem for elem in df[df.opponent == opponent]["goals_%s" % team_name.lower()] if type(elem) == list]) 
 72 |                                     if 'assist' in e and e['assist'] == player])
 73 |                 de = goals + assist
 74 | 
 75 |                 stats_player[opponent] = {'goals':goals,
 76 |                                           'goals_game':goals/game_played,
 77 |                                           'decisive_game':de/game_played,
 78 |                                          'assists':assist,
 79 |                                          'decisive':de,
 80 |                                          'games':game_played}
 81 |         except: pass
 82 |     return stats_player
 83 | 
 84 | def ratio_one_opponent(df, opponent, team_name, top_k=10):
 85 |     df_oppo = df[df.opponent == opponent]
 86 |     game_player = dict(Counter(np.concatenate(list(df_oppo["players_team_%s" % team_name.lower()]) +
 87 |                                           list(df_oppo["subs_in_%s" % team_name.lower()]))).items())
 88 |     goal_counter = Counter([elem['player'] for elem in np.concatenate(list(df_oppo["goals_%s" % team_name.lower()]))])
 89 |     assist_counter = Counter([elem['assist'] for elem in np.concatenate(list(df_oppo["goals_%s" % team_name.lower()])) if "assist" in elem])
 90 |     goal_ratio = sorted([(k, v/game_player[k]) for k,v in dict(goal_counter).items() if k in game_player and game_player[k] >= 3 ], 
 91 |                         key=lambda x:x[1], reverse=True)[:top_k]
 92 |     assist_ratio = sorted([(k, v/game_player[k]) for k,v in dict(assist_counter).items() if k in game_player and game_player[k] >= 3 ], 
 93 |                           key=lambda x:x[1], reverse=True)[:top_k]
 94 |     decisive_ratio = sorted([(k, v/game_player[k]) for k,v in dict(goal_counter+assist_counter).items() if k in game_player and game_player[k] >= 3 ], 
 95 |                             key=lambda x:x[1], reverse=True)[:top_k]
 96 |     return goal_ratio, assist_ratio, decisive_ratio
 97 | 
 98 | def get_goals_team(bs, team):
 99 |     if bs.find('span', class_='bidi').text != ' 0 - 0':
100 |         goals = [{'player':elem.find('a').text, 
101 |                   'time':elem.find('span', class_='minute').text.replace("'",''), 
102 |                   'assist': elem.find('span', class_='assist')} for elem in bs.find('div', class_='block_match_goals').find_all('td', class_='player-{}'.format(team)) if elem.text !='\n\n']
103 |         for goal in goals:
104 |             if goal['assist'] != None:
105 |                 goal['assist'] = goal['assist'].text.replace('(assist by ','').replace(')','')
106 |             else:
107 |                 del goal['assist']
108 |     else:
109 |         goals=[]
110 |     return goals
111 | 
112 | def get_lineups(bs):
113 |     lineups = {}
114 |     players = [elem.text.replace('\n','') for elem in bs.find_all('td', class_='large-link')]
115 |     lineups["players_team_a"] = players[:11]
116 |     lineups["players_team_b"] = players[11:22]
117 |     lineups['subs_in_a'] = [elem[:elem.index('for')] for elem in players[22:29] if 'for' in elem]
118 |     lineups['subs_in_b'] = [elem[:elem.index('for')] for elem in players[29:36] if 'for' in elem]
119 |     return lineups
120 | 
121 | def get_stats(bs):
122 |     l = bs.find('div', class_='block_match_stats_plus_chart').find('iframe')['src']
123 |     response = requests.get('http://us.soccerway.com/' + l)
124 |     bs2 = BeautifulSoup(response.text,'lxml')
125 |     stats_list = [elem.text[1:-1].split('\n') for elem in bs2.find('table').find('tr').find_all('tr')[1:]][0::2]
126 |     stats = {elem[1]:{'team_a': int(elem[0]), 'team_b':int(elem[2])} for elem in stats_list}
127 |     return stats
128 | 
129 | def get_goals(link):
130 |     game = {}
131 |     response = requests.get(link)
132 |     bs = BeautifulSoup(response.text, 'lxml')
133 |     try:
134 |         game['goals_a'], game['goals_b'] = [get_goals_team(bs, 'a'), get_goals_team(bs, 'b')]
135 |     except: pass
136 |     try:
137 |         game.update(get_lineups(bs))
138 |     except: pass
139 |     try:
140 |         game.update(get_stats(bs))
141 |     except: pass 
142 |     return game
143 | 
144 | def convert_df_games(dict_file):
145 |     df = pd.DataFrame(dict_file["games"])
146 |     df['date'] = pd.to_datetime(df.date.apply(lambda x:'/'.join([x.split('/')[1],x.split('/')[0], x.split('/')[2]])))
147 |     df['month'] = df.date.apply(lambda x:x.month)
148 |     df['year'] = df.date.apply(lambda x:x.year)
149 |     df = df[df.date < datetime.datetime.now()]
150 |     df = df.sort_values('date', ascending=False)
151 |     team_name = df.team_a.value_counts().index[0]
152 |     df['opponent'] = df.apply(lambda x:(x['team_a']+x['team_b']).replace(team_name, ''), axis=1)
153 |     cols = ['Corners', 'Fouls', 'Offsides', 'Shots on target', 'Shots wide']
154 |     df = df.apply(lambda x:shot_team(x, cols, team_name),axis=1)
155 |     df = df.apply(lambda x:convert_team_name(x, team_name),axis=1)
156 |     df = df.drop(cols, axis=1)
157 |     return df
158 | 


--------------------------------------------------------------------------------
/KaggleTaxiTrip/README.md:
--------------------------------------------------------------------------------
 1 | ## New York City Taxi Trip Duration
 2 | 
 3 | Kaggle playground to predict the total ride duration of taxi trips in New York City. 
 4 | 
 5 | ### First part - Data exploration
 6 | The first part is to analyze the dataframe and observe correlation between variables.
 7 | ![Distributions](https://github.com/alexattia/Data-Science-Projects/blob/master/KaggleTaxiTrip/pic/download.png)
 8 | ![Rush Hour](https://github.com/alexattia/Data-Science-Projects/blob/master/KaggleTaxiTrip/pic/rush_hour.png)
 9 | 
10 | ### Second part - Clustering
11 | The goal of this playground is to predict the trip duration of test set. We know that some neighborhoods are more congested. So, I used K-Means to compute geo-clusters for pickup and drop off.
12 | ![Cluster](https://github.com/alexattia/Data-Science-Projects/blob/master/KaggleTaxiTrip/pic/nyc_clusters.png)
13 | 
14 | ### Third part - Cleaning and feature selection 
15 | I have found some odd long trips : one day trip with a mean spead < 1km/h.   
16 | ![Outliners](https://github.com/alexattia/Data-Science-Projects/blob/master/KaggleTaxiTrip/pic/outliners.png)
17 | I have removed these outliners.  
18 | 
19 | I also added features from the data available : Haversine distance, Manhattan distance, means for clusters, PCA for rotation.
20 | 
21 | ### Forth part - Prediction
22 | I compared Random Forest and XGBoost.  
23 | Current Root Mean Squared Logarithmic error : 0.391
24 | 
25 | Feature importance for RF & XGBoost
26 | ![Feature importance](https://github.com/alexattia/Data-Science-Projects/blob/master/KaggleTaxiTrip/pic/feat_importance.png)


--------------------------------------------------------------------------------
/KaggleTaxiTrip/pic/download.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleTaxiTrip/pic/download.png


--------------------------------------------------------------------------------
/KaggleTaxiTrip/pic/feat_importance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleTaxiTrip/pic/feat_importance.png


--------------------------------------------------------------------------------
/KaggleTaxiTrip/pic/nyc_clusters.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleTaxiTrip/pic/nyc_clusters.png


--------------------------------------------------------------------------------
/KaggleTaxiTrip/pic/outliners.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleTaxiTrip/pic/outliners.png


--------------------------------------------------------------------------------
/KaggleTaxiTrip/pic/rush_hour.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleTaxiTrip/pic/rush_hour.png


--------------------------------------------------------------------------------
/KaggleTitanic/README.md:
--------------------------------------------------------------------------------
1 | ## Titanic: Machine Learning from Disaster
2 | 
3 | The Titanic challenge on Kaggle is a competition in which the goal is to predict the survival or the death of a given passenger based on a set of variables describing him such as his age, his sex, or his passenger class on the boat.


--------------------------------------------------------------------------------
/KaggleWiki/README.md:
--------------------------------------------------------------------------------
1 | # Parsing and using Wikipedia Data
2 | 
3 | ## First Part - French governments
4 | 
5 | The first part aims to parse data from the French wikipedia website and pages about French governements. I am playing with data to get insights about governments : number of ministers, ministers' studies, etc.
6 | 


--------------------------------------------------------------------------------
/KaggleWiki/wikipedia_parser.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | from bs4 import BeautifulSoup
 4 | from selenium import webdriver
 5 | import time
 6 | 
 7 | def convert_formation(s):
 8 |     mapping = {'IEP' : ['Institut d\'études politiques', 'IEP'],
 9 |      'ENA' : ['ENA', 'École nationale d\'administration'],
10 |      'ENS': ['École normale supérieure', 'ENS'],
11 |      'ENPC':['École nationale des ponts et chaussées', 'ENPC'],
12 |      'X': ['École polytechnique'],
13 |      'HEC': ['HEC', 'École des hautes études commerciales'],
14 |      'ESCP': ['ESCP', 'École supérieure de commerce de Paris'],
15 |      'ESSEC' : ['ESSEC', "École Supérieure des Sciences Economiques et Commerciales "]
16 |     }
17 |     etudes = []
18 |     for key, value in mapping.items():
19 |         for v in value:
20 |             if v in s:
21 |                 etudes.append(key)
22 |                 break
23 |     return etudes
24 | 
25 | def get_ministres(link):
26 |     browser = webdriver.Chrome()
27 |     try:
28 |         browser.get(link)
29 |         time.sleep(0.2)
30 |     except:
31 |         time.sleep(1)
32 |     text = browser.find_element_by_xpath('//div[@id="mw-content-text"]').get_attribute('innerHTML').strip()
33 |     bs = BeautifulSoup(text, 'lxml')
34 |     ministres_l = [[(k.text.replace('\u200d ','') , k.find('a')['href']) for k in e.find_parents()[1].find_all('td') if k.text !='' and k.find('a')] 
35 |                    for e in bs.find_all('a', class_="image")]
36 |     ministres = [{'Poste':k[0][0], 'Nom':k[::-1][1][0], 'Lien':'https://fr.wikipedia.org' + k[::-1][1][1]} for k in ministres_l if len(k) > 1][:-1]
37 |     for m in ministres:
38 |         m['Formation'] = get_formation(m['Lien'])
39 |     browser.quit()
40 |     return ministres
41 | 
42 | def get_formation(link_bio):
43 |     browser = webdriver.Chrome()
44 |     browser.set_page_load_timeout(15)
45 |     try:
46 |         browser.get(link_bio)
47 |     except:
48 |         time.sleep(1.5)
49 |     bs = BeautifulSoup(browser.find_element_by_xpath('//div[@id="mw-content-text"]').get_attribute('innerHTML').strip(), 'lxml')
50 |     try:
51 |         h3_elem = [e.find('span').text for e in bs.find_all('h3')]
52 |         h3_elem_formation = [k for k in h3_elem if 'formation' in k.lower() or 'études' in k.lower() or 'etudes' in k.lower() or 'parcours' in k.lower() 
53 |                                                                             or "cursus" in k.lower()][0]
54 |         f, s = h3_elem[h3_elem.index(h3_elem_formation):h3_elem.index(h3_elem_formation)+2]
55 |     except:
56 |         h2_elem = [e.find('span').text for e in bs.find_all('h2')[1:]]
57 |         h2_elem_biographie = [k for k in h2_elem if 'biographie' in k.lower() or 'carrière' in k.lower() or 'formation' in k.lower() or 'parcours' in k.lower()
58 |                                                      or "cursus" in k.lower()][0]
59 |         f, s = h2_elem[h2_elem.index(h2_elem_biographie):h2_elem.index(h2_elem_biographie)+2]
60 |     try:
61 |         first, second = [m.start() for m in re.finditer(f, bs.text)][1], [m.start() for m in re.finditer(s, bs.text)][1]
62 |         browser.quit()
63 |         s = bs.text[first:second].replace('\n', '')
64 |         return convert_formation(re.sub(r'\[*\d*\]', '', s))
65 |     except:
66 |         print('Error for %s' % link_bio)
67 | 
68 | def get_previous_government_link(browser, link):
69 |     try:
70 |         browser.get(link)
71 |         time.sleep(0.5)
72 |     except:
73 |         time.sleep(1.5)
74 |     box = browser.find_element_by_xpath('//table[@class="infobox_v2"]')
75 |     browser.execute_script("return arguments[0].scrollIntoView();", box)
76 |     bs = BeautifulSoup(box.get_attribute('innerHTML').strip(), 'lxml')
77 |     d = dict(set([([k.replace('|','').replace('\n','') for k in e.text.replace('\n\n','|').split('||') if k != ''][0], 
78 |                  "https://fr.wikipedia.org" + e.find('a')['href']) 
79 |                 for e in bs.find_all('tr') if e.find('img', {'alt':"Précédent"})]))
80 |     duration = [e.find('td').text.encode().decode('ascii', 'ignore') for e in bs.find_all('tr') if e.find('th', text='Durée')][0]
81 |     return d, duration
82 | 
83 | 
84 | def convert_duration(dur):
85 |     splits = [''.join([s for s in elem if s.isdigit()]) for elem in dur.split('an')]
86 |     if len(splits) == 2:
87 |         return 365 * int(splits[0]) + int(splits[1])
88 |     else:
89 |         return int(splits[0])


--------------------------------------------------------------------------------
/ObjectDetection/color_extract.py:
--------------------------------------------------------------------------------
  1 | import cv2
  2 | import numpy as np
  3 | import sys
  4 | import os
  5 | import imp
  6 | import glob
  7 | from sklearn.cluster import KMeans
  8 | import skimage.filters as skf
  9 | import skimage.color as skc
 10 | import skimage.morphology as skm
 11 | from skimage.measure import label
 12 | 
 13 | class Back():
 14 |     """
 15 |     Two algorithms are used together to separate background and foreground.
 16 |     One consider as background all pixel whose color is close to the pixels
 17 |     in the corners. This part is impacted by the `max_distance' and
 18 |     `use_lab' settings.
 19 |     The second one computes the edges of the image and uses a flood fill
 20 |     starting from all corners.
 21 |     """
 22 |     def __init__(self, max_distance=5, use_lab=True):
 23 |         """
 24 |         The possible settings are:
 25 |             - max_distance: The maximum distance for two colors to be
 26 |               considered closed. A higher value will yield to a more aggressive
 27 |               background removal.
 28 |               (default: 5)
 29 |             - use_lab: Whether to use the LAB color space to perform
 30 |               background removal. More expensive but closer to eye perception.
 31 |               (default: True)
 32 |         """
 33 |         self._settings= {
 34 |             'max_distance': max_distance,
 35 |             'use_lab': use_lab,
 36 |         }
 37 | 
 38 |     def get(self, img):
 39 |         f = self._floodfill(img)
 40 |         g = self._global(img)
 41 |         m = f | g
 42 | 
 43 |         if np.count_nonzero(m) < 0.90 * m.size:
 44 |             return m
 45 | 
 46 |         ng, nf = np.count_nonzero(g), np.count_nonzero(f)
 47 | 
 48 |         if ng < 0.90 * g.size and nf < 0.90 * f.size:
 49 |             return g if ng > nf else f
 50 |         if ng < 0.90 * g.size:
 51 |             return g
 52 |         if nf < 0.90 * f.size:
 53 |             return f
 54 | 
 55 |         return np.zeros_like(m)
 56 | 
 57 |     def _global(self, img):
 58 |         h, w = img.shape[:2]
 59 |         mask = np.zeros((h, w), dtype=np.bool)
 60 |         max_distance = self._settings['max_distance']
 61 | 
 62 |         if self._settings['use_lab']:
 63 |             img = skc.rgb2lab(img)
 64 | 
 65 |         # Compute euclidean distance of each corner against all other pixels.
 66 |         corners = [(0, 0), (-1, 0), (0, -1), (-1, -1)]
 67 |         for color in (img[i, j] for i, j in corners):
 68 |             norm = np.sqrt(np.sum(np.square(img - color), 2))
 69 |             # Add to the mask pixels close to one of the corners.
 70 |             mask |= norm < max_distance
 71 | 
 72 |         return mask
 73 | 
 74 |     def _floodfill(self, img):
 75 |         back = self._scharr(img)
 76 |         # Binary thresholding.
 77 |         back = back > 0.05
 78 | 
 79 |         # Thin all edges to be 1-pixel wide.
 80 | 
 81 |         back = skm.skeletonize(back)
 82 |         # Edges are not detected on the borders, make artificial ones.
 83 |         back[0, :] = back[-1, :] = True
 84 |         back[:, 0] = back[:, -1] = True
 85 | 
 86 |         # Label adjacent pixels of the same color.
 87 |         labels = label(back, background=-1, connectivity=1)
 88 | 
 89 |         # Count as background all pixels labeled like one of the corners.
 90 |         corners = [(1, 1), (-2, 1), (1, -2), (-2, -2)]
 91 |         for l in (labels[i, j] for i, j in corners):
 92 |             back[labels == l] = True
 93 | 
 94 |         # Remove remaining inner edges.
 95 |         return skm.opening(back)
 96 | 
 97 |     def _scharr(self, img):
 98 |         # Invert the image to ease edge detection.
 99 |         img = 1. - img
100 |         grey = skc.rgb2grey(img)
101 |         return skf.scharr(grey)
102 | 
103 | class ColorExtractor():
104 |     def __init__(self):
105 |         self._back = Back(max_distance=10, n_clusters=3)
106 | 
107 |     def get_bar_hist(self, filepath, full_return=False):
108 |         image = cv2.imread(filepath)
109 |         image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
110 |         back_mask = self._back.get(image)
111 |         image_noback = np.zeros(image.shape)
112 |         image_noback[~back_mask] = image[~back_mask] /255
113 |         image_array = image[~back_mask]
114 |         clt = KMeans(n_clusters = n_clusters)
115 |         clt.fit(image_array)
116 |         hist = self.centroid_histogram(clt)
117 |         if full_return:
118 |             bar = self.plot_colors(hist, clt.cluster_centers_)
119 |             return image, image_noback, bar, hist
120 |         else:
121 |             return image, hist
122 | 
123 | 
124 |     def centroid_histogram(self, clt):
125 |         # grab the number of different clusters and create a histogram
126 |         # based on the number of pixels assigned to each cluster
127 |         numLabels = np.arange(0, len(np.unique(clt.labels_)) + 1)
128 |         (hist, _) = np.histogram(clt.labels_, bins = numLabels)
129 | 
130 |         # normalize the histogram, such that it sums to one
131 |         hist /= hist.astype("float").sum()
132 | 
133 |         # return the histogram
134 |         return hist
135 | 
136 |     def plot_colors(self, hist, centroids):
137 |         # initialize the bar chart representing the relative frequency
138 |         # of each of the colors
139 |         bar = np.zeros((50, 300, 3), dtype = "uint8")
140 |         startX = 0
141 | 
142 |         zipped = zip(hist, centroids)
143 |         zipped = sorted(zipped, reverse=True, key=lambda x : x[0])
144 | 
145 |         # loop over the percentage of each cluster and the color of
146 |         # each cluster
147 |         for (percent, color) in zipped:
148 |             # plot the relative percentage of each cluster
149 |             endX = startX + (percent * 300)
150 |             cv2.rectangle(bar, (int(startX), 0), (int(endX), 50),
151 |                 color.astype("uint8").tolist(), -1)
152 |             startX = endX
153 | 
154 |         # return the bar chart
155 |         return bar


--------------------------------------------------------------------------------
/ObjectDetection/recognition.py:
--------------------------------------------------------------------------------
  1 | import cv2
  2 | from tensorflow.python.client import device_lib
  3 | from darkflow.net.build import TFNet
  4 | import numpy as np
  5 | import glob
  6 | import os
  7 | 
  8 | def get_available_gpus():
  9 |     """
 10 |     Return the number of GPUs availableipytho
 11 |     """
 12 |     local_device_protos = device_lib.list_local_devices()
 13 |     return [x.name for x in local_device_protos if x.device_type == 'GPU']
 14 | 
 15 | def define_options(config_path):
 16 |     """
 17 |     Define the network configuration. Load the model and the weights.
 18 |     Threshold hard coded.
 19 |     :return: option for tfnet object
 20 |     """
 21 |     options = {}
 22 |     if len(get_available_gpus()) > 0 :
 23 |         options["gpu"] = 1
 24 |     else:
 25 |         options["gpu"] = 0
 26 | 
 27 |     model = config_path + 'cfg/yolo.cfg'
 28 |     weights = config_path + 'yolo.weights'
 29 |     if os.path.exists(model):
 30 |         options['model'] = model
 31 |     else:
 32 |         print('No cfg model')
 33 |     if os.path.exists(weights):
 34 |         options['load'] = weights
 35 |     else:
 36 |         print('No yolo weights, wget https://pjreddie.com/media/files/yolo.weights')
 37 |     options['config'] = config_path + 'cfg/'
 38 |     options["threshold"] = 0.1
 39 |     return options
 40 | 
 41 | def non_max_suppression_fast(boxes, probs, overlap_thresh=0.1, max_boxes=30):
 42 |     """
 43 |     Eliminating redundant object detection windows with a faster non maximum suppression method
 44 |     Greedily select high-scoring detections and skip detections that are significantly covered by 
 45 |     a previously selected detection.
 46 |     :param boxes: list of boxes 
 47 |     :param probs: list of probabilities relatives to the boxes
 48 |     """
 49 | 
 50 |     # if there are no boxes, return an empty list
 51 |     if len(boxes) == 0:
 52 |         return []
 53 | 
 54 |     # grab the coordinates of the bounding boxes
 55 |     x1 = boxes[:, 0]
 56 |     y1 = boxes[:, 1]
 57 |     x2 = boxes[:, 2]
 58 |     y2 = boxes[:, 3]
 59 | 
 60 |     np.testing.assert_array_less(x1, x2)
 61 |     np.testing.assert_array_less(y1, y2)
 62 | 
 63 |     # if the bounding boxes integers, convert them to floats --
 64 |     # this is important since we'll be doing a bunch of divisions
 65 |     if boxes.dtype.kind == "i":
 66 |         boxes = boxes.astype("float")
 67 | 
 68 |     # initialize the list of picked indexes 
 69 |     pick = []
 70 | 
 71 |     # calculate the areas
 72 |     area = (x2 - x1 + 1) * (y2 - y1 + 1)
 73 |     
 74 |     # sort the bounding boxes 
 75 |     idxs = np.argsort(probs)
 76 | 
 77 |     # keep looping while some indexes still remain in the indexes
 78 |     # list
 79 |     while len(idxs) > 0:
 80 |         # grab the last index in the indexes list and add the
 81 |         # index value to the list of picked indexes
 82 |         last = len(idxs) - 1
 83 |         i = idxs[last]
 84 |         pick.append(i)
 85 | 
 86 |         # find the intersection
 87 |         xx1_int = np.maximum(x1[i], x1[idxs[:last]])
 88 |         yy1_int = np.maximum(y1[i], y1[idxs[:last]])
 89 |         xx2_int = np.minimum(x2[i], x2[idxs[:last]])
 90 |         yy2_int = np.minimum(y2[i], y2[idxs[:last]])
 91 |         ww_int = np.maximum(0, xx2_int - xx1_int + 0.5)
 92 |         hh_int = np.maximum(0, yy2_int - yy1_int + 0.5)
 93 |         area_int = ww_int * hh_int
 94 | 
 95 |         # find the union
 96 |         area_union = area[i] + area[idxs[:last]] - area_int
 97 | 
 98 |         # compute the ratio of overlap
 99 |         overlap = area_int / (area_union + 1e-6)
100 | 
101 |         # delete all indexes from the index list that have
102 |         idxs = np.delete(idxs, np.concatenate(([last],
103 |             np.where(overlap > overlap_thresh)[0])))
104 | 
105 |         if len(pick) >= max_boxes:
106 |             break
107 | 
108 |     # return only the bounding boxes that were picked using the integer data type
109 |     boxes = boxes[pick].astype("int")
110 |     probs = probs[pick]
111 |     return boxes, probs
112 | 
113 | def predict_one(image, tfnet):
114 |     """
115 |     Object detection in a picture
116 |     :param image: image numpy array
117 |     :param tfnet: net object
118 |     :return: picture with object bounding boxes, predictions, confidence scores
119 |     """
120 |     font = cv2.FONT_HERSHEY_SIMPLEX
121 |     # prediction
122 |     result = tfnet.return_predict(image)
123 |     # Separate each label to do non max suppresion
124 |     for label in set([k['label'] for k in result]):
125 |         boxes = np.array([[k['topleft']['x'], k['topleft']['y'], k['bottomright']['x'], k['bottomright']['y']] for k in result
126 |                           if k['label'] == label])
127 |         probs = np.array([k['confidence'] for k in result if k['label'] == label])
128 |         # eliminating redundant object
129 |         b, p = non_max_suppression_fast(boxes, probs)
130 |         for k in b:
131 |             # remove big object
132 |             if np.abs(np.mean(np.array((k[0],k[1]))-np.array((k[2], k[3])))) < np.min(image.shape[:2])/5:
133 |                 cv2.rectangle(image, (k[0],k[1]), (k[2], k[3]), (255,0,0), 2)
134 |                 cv2.putText(image, label, (k[0],k[1]), font, 0.3, (255, 0, 0))
135 |     return image, b, p
136 | 
137 | def predict(folder, nb_items=None, config_path='../darkflow/'):
138 |     """
139 |     Multiple predictions
140 |     :param folder: folder with pictures
141 |     :param nb_items: number of pictures to predict
142 |     :config_path: weights and model files path
143 |     :return: list of icture with object bounding boxes, predictions, confidence scores
144 |     """
145 |     images = [cv2.imread(f) for f in glob.glob(folder+'*')][:nb_items]
146 |     results = []
147 |     tfnet = TFNet(define_options(config_path))
148 |     for image in images:
149 |         results.append(predict_one(image, tfnet))
150 |     return results
151 | 
152 | 


--------------------------------------------------------------------------------
/ProjectMovieRating/README.md:
--------------------------------------------------------------------------------
 1 | # Predict IMDB movie rating
 2 | 
 3 | Project inspired by Chuan Sun [work](https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset) without Scrappy
 4 | 
 5 | Main question : How can we tell the greatness of a movie before it is released in cinema?
 6 | 
 7 | ## First Part - Parsing data
 8 | 
 9 | The first part aims to parse data from the imdb and the numbers websites : casting information, directors, production companies, awards, genres, budget, gross, description, imdb_rating, etc.  
10 | To create the movie_contents.json file :  
11 | ``python3 parser.py nb_elements``  
12 | 
13 | ## Second Part - Data Analysis
14 | 
15 | The second part is to analyze the dataframe and observe correlation between variables. For example, are the movie awards correlated to the worlwide gross ? Does the more a movie is a liked, the more the casting is liked ? 
16 | See the jupyter notebook file.  
17 | 
18 | ![Correlation Matrix](https://github.com/alexattia/Data-Science-Projects/blob/master/pics/corr_matrix.png)
19 | 
20 | As we can see in the pictures above, the imdb score is correlated to the number of awards and the gross but not really to the production budget and the number of facebook likes of the casting.  
21 | Obviously, domestic and worlwide gross are highly correlated. However, the more important the production budget, the more important the gross.  
22 | As it is shown in the notebook, the budget is not really correlated to the number of awards.  
23 | What's funny is that the popularity of the third most famous actor is more important for the IMDB score than the popularity of the most famous score (Correlation 0.2 vs 0.08).  
24 | (Many other charts in the Jupyter notebook)
25 | 
26 | ## Third Part - Predict the IMDB score
27 | 
28 | Machine Learning to predict the IMDB score with the meaningful variables.  
29 | Using a Random Forest algorithm (500 estimators). 
30 | ![Most important features](https://github.com/alexattia/Data-Science-Projects/blob/master/pics/features.png)
31 | 
32 | 


--------------------------------------------------------------------------------
/ProjectMovieRating/__pycache__/imdb_movie_content.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/ProjectMovieRating/__pycache__/imdb_movie_content.cpython-35.pyc


--------------------------------------------------------------------------------
/ProjectMovieRating/__pycache__/parser.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/ProjectMovieRating/__pycache__/parser.cpython-35.pyc


--------------------------------------------------------------------------------
/ProjectMovieRating/genre.json:
--------------------------------------------------------------------------------
 1 | {"Action": "0",
 2 |  "Adventure": "17",
 3 |  "Animation": "14",
 4 |  "Biography": "20",
 5 |  "Comedy": "21",
 6 |  "Crime": "10",
 7 |  "Documentary": "3",
 8 |  "Drama": "18",
 9 |  "Family": "19",
10 |  "Fantasy": "6",
11 |  "History": "8",
12 |  "Horror": "4",
13 |  "Music": "12",
14 |  "Musical": "2",
15 |  "Mystery": "16",
16 |  "Romance": "9",
17 |  "Sci-Fi": "13",
18 |  "Short": "5",
19 |  "Sport": "11",
20 |  "Thriller": "7",
21 |  "War": "15",
22 |  "Western": "1"}


--------------------------------------------------------------------------------
/ProjectMovieRating/imdb_movie_content.py:
--------------------------------------------------------------------------------
  1 | import requests
  2 | import json
  3 | from bs4 import BeautifulSoup
  4 | import time
  5 | import random
  6 | import re
  7 | import datetime
  8 | 
  9 | def parse_facebook_likes_number(num_likes_string):
 10 |     if num_likes_string[-1] == 'K':
 11 |         thousand = 1000
 12 |     else:
 13 |         thousand = 1
 14 |     return float(num_likes_string.replace('K','')) * thousand
 15 | 
 16 | def parse_duration(time_string):
 17 |     time_string = time_string.split('min')[0]
 18 |     x = time.strptime(time_string,'%Hh%M')
 19 |     return datetime.timedelta(hours=x.tm_hour,minutes=x.tm_min,seconds=x.tm_sec).total_seconds()
 20 | 
 21 | class ImdbMovieContent():
 22 |   def __init__(self, movies):
 23 |     self.movies = movies
 24 |     self.base_url = "http://www.imdb.com"
 25 | 
 26 |   def get_facebook_likes(self, entity_id):
 27 |     if entity_id.startswith('nm'):
 28 |         url = "https://www.facebook.com/widgets/like.php?width=280&show_faces=1&layout=standard&href=http%3A%2F%2Fwww.imdb.com%2Fname%2F{}%2F&colorscheme=light".format(entity_id)
 29 |     elif entity_id.startswith('tt'):
 30 |         url = "https://www.facebook.com/widgets/like.php?width=280&show_faces=1&layout=standard&href=http%3A%2F%2Fwww.imdb.com%2Ftitle%2F{}%2F&colorscheme=light".format(entity_id)
 31 |     else:
 32 |         url = None
 33 |     time.sleep(random.uniform(0, 0.25)) # randomly snooze a time within [0, 0.4] second
 34 |     try:
 35 |         response = requests.get(url)
 36 |         soup = BeautifulSoup(response.text, "lxml")
 37 |         sentence = soup.find_all(id="u_0_2")[0].span.string # get sentence like: "43K people like this"
 38 |         num_likes = sentence.split(" ")[0]
 39 |     except Exception as e:
 40 |         num_likes = None
 41 |     return parse_facebook_likes_number(num_likes)
 42 | 
 43 |   def get_id_from_url(self, url):
 44 |     if url is None:
 45 |         return None
 46 |     return url.split('/')[4]
 47 | 
 48 |   def parse(self):
 49 |     movies_content = []
 50 |     for movie in self.movies:
 51 |       imdb_url = movie['imdb_url']
 52 |       response = requests.get(imdb_url)
 53 |       bs = BeautifulSoup(response.text, 'lxml')
 54 |       movies_content.append(self.get_content(bs))
 55 |     return movies_content
 56 | 
 57 |   def get_awards(self, movie_link):
 58 |     awards_url = movie_link + 'awards'
 59 |     response = requests.get(awards_url)
 60 |     bs = BeautifulSoup(response.text, 'lxml')
 61 |     awards = bs.find_all('tr')
 62 |     award_dict = []
 63 |     for award in awards:
 64 |         if 'rowspan' in award.find('td').attrs:
 65 |             rowspan = int(award.find('td')['rowspan'])
 66 |             if rowspan == 1:
 67 |                 award_dict.append({'category' : award.find('span', class_='award_category').text, 
 68 |                                    'type' : award.find('b').text,
 69 |                                    'award': award.find('td', class_='award_description').text.split('\n')[1].replace('            ', '')})
 70 |             else:
 71 |                 index = awards.index(award)
 72 |                 dictt = {'category':award.find('span', class_='award_category').text, 
 73 |                          'type' : award.find('b').text}
 74 |                 awards_ = []
 75 |                 for elem in awards[index:index+rowspan]:
 76 |                     award_dict.append({'category':award.find('span', class_='award_category').text, 
 77 |                                          'type' : award.find('b').text,
 78 |                                         'award': elem.find('td', class_='award_description').text.split('\n')[1].replace('            ', '')})
 79 |     award_nominated = [{k:v for k,v in award.items() if k != 'type'} for award in award_dict if award['type'] == 'Nominated']
 80 |     award_won = [{k:v for k,v in award.items() if k != 'type'} for award in award_dict if award['type'] == 'Won']
 81 |     return award_nominated, award_won
 82 | 
 83 |   def get_content(self, bs):
 84 |     movie = {}
 85 |     try:
 86 |         title_year = bs.find('div', class_ = 'title_wrapper').find('h1').text.encode('latin').decode('utf-8', 'ignore').replace(') ','').split('(')
 87 |         movie['movie_title'] = title_year[0]
 88 |     except:
 89 |         movie['movie_title'] = None
 90 |     try:
 91 |         movie['title_year'] = title_year[1]
 92 |     except:
 93 |         movie['title_year'] = None
 94 |     # try:
 95 |     #     movie['genres'] = '|'.join([genre.text.replace(' ','') for genre in bs.find('div', {'itemprop': 'genre'}).find_all('a')])
 96 |     # except:
 97 |     #     movie['genres'] = None
 98 |     try:
 99 |         title_details = bs.find('div', {'id':'titleDetails'})
100 |         movie['language'] = '|'.join([language.text for language in title_details.find_all('a', href=re.compile('language'))])
101 |         movie['country'] = '|'.join([country.text for country in title_details.find_all('a', href=re.compile('country'))])
102 |     except:
103 |         movie['language'] = None
104 |     try:
105 |         keywords = bs.find_all('span', {'itemprop':'keywords'})
106 |         movie['keywords'] = '|'.join([key.text for key in keywords])
107 |     except:
108 |         movie['keywords'] = None
109 |     try:
110 |         movie['storyline'] = bs.find('div', {'id':'titleStoryLine'}).find('div', {'itemprop':'description'}).text.replace('\n', '')
111 |     except:
112 |         movie['storyline'] = None
113 |     try:
114 |         movie['contentRating'] = bs.find('span', {'itemprop':'contentRating'}).text.split(' ')[1]
115 |     except:
116 |         movie['contentRating'] = None
117 |     try:
118 |         movie['color'] = bs.find('a', href=re.compile('colors')).text
119 |     except:
120 |         movie['color'] = None
121 |     try:
122 |         movie['idmb_score'] = bs.find('span', {'itemprop':'ratingValue'}).text
123 |     except:
124 |         movie['idmb_score'] = None
125 |     try:
126 |         movie['num_voted_users'] = bs.find('span', {'itemprop':'ratingCount'}).text.replace(',',)
127 |     except:
128 |         movie['num_voted_users'] = None
129 |     try:
130 |         movie['duration_sec'] = parse_duration(bs.find('time', {'itemprop':'duration'}).text.replace(' ','').replace('\n', ''))
131 |     except:
132 |         movie['duration_sec'] = None
133 |     try:
134 |         review_counts = [int(count.text.split(' ')[0].replace(',', '')) for count in bs.find_all('span', {'itemprop':'reviewCount'})]
135 |         movie['num_user_for_reviews'] = review_counts[0]
136 |     except:
137 |         movie['num_user_for_reviews'] = None
138 |     try:
139 |         prod_co = bs.find_all('span', {'itemtype': re.compile('Organization')})
140 |         movie['production_co'] = [elem.text.replace('\n','') for elem in prod_co]
141 |     except:
142 |         movie['production_co'] = None
143 |     try:
144 |         movie['num_critic_for_reviews'] = review_counts[1]
145 |     except:
146 |         movie['num_critic_for_reviews'] = None
147 |     try:
148 |         actors = bs.find('table', {'class':'cast_list'}).find_all('a', {'itemprop':'url'})
149 |         pairs_for_rows = [(self.base_url + actor['href'], actor.text.replace('\n', '')[1:]) for actor in actors]
150 |         movie['cast_info'] = [{'actor_name' : actor[1], 
151 |                            'actor_link' : actor[0], 
152 |                            'actor_fb_likes': self.get_facebook_likes(self.get_id_from_url(actor[0]))} for actor in pairs_for_rows]
153 |     except:
154 |         movie['cast_info'] = None
155 |     try:
156 |         director = bs.find('span', {'itemprop':'director'}).find('a')
157 |         director = (self.base_url + director['href'].replace('?','/?'), director.text)
158 |         movie['director_info'] = {'director_name':director[1],
159 |                               'director_link':director[0],
160 |                               'director_fb_links': self.get_facebook_likes(self.get_id_from_url(director[0]))}
161 |     except:
162 |         movie['director_info'] = None
163 |     try:
164 |         movie['movie_imdb_link'] = bs.find('link')['href']
165 |         movie['num_facebook_like'] = self.get_facebook_likes(self.get_id_from_url(movie['movie_imdb_link']))
166 |     except:
167 |         movie['num_facebook_like'] = None
168 |     try:
169 |         poster_image_url = bs.find('div', {'class':'poster'}).find('img')['src']
170 |         movie['image_urls'] = poster_image_url.split("_V1_")[0] + "_V1_.jpg"
171 |     except:
172 |         movie['image_urls'] = None
173 |     try:
174 |         award_nominated, award_won = self.get_awards(movie['movie_imdb_link'])
175 |         movie['awards'] = {"won": award_won, 'nominated': award_nominated}
176 |     except Exception as e:
177 |         print(e)
178 |         movie['awards'] = None
179 |     return movie
180 | 
181 | 
182 | 


--------------------------------------------------------------------------------
/ProjectMovieRating/parser.py:
--------------------------------------------------------------------------------
  1 | from bs4 import BeautifulSoup
  2 | import json
  3 | import requests
  4 | import urllib
  5 | from tqdm import tqdm
  6 | import locale
  7 | import pandas as pd
  8 | import re
  9 | import time
 10 | import random
 11 | import sys
 12 | 
 13 | from imdb_movie_content import ImdbMovieContent
 14 | 
 15 | def parse_price(price):
 16 |     """
 17 |     Convert string price to numbers
 18 |     """
 19 |     if not price:
 20 |         return 0
 21 |     price = price.replace(',', '')
 22 |     return locale.atoi(re.sub('[^0-9,]', "", price))
 23 | 
 24 | def get_movie_budget():
 25 |     """
 26 |     Parsing the numbers website to get the budget data.
 27 |     :return: list of dictionnaries with budget and gross
 28 |     """
 29 |     movie_budget_url = 'http://www.the-numbers.com/movie/budgets/all'
 30 |     response = requests.get(movie_budget_url)
 31 |     bs = BeautifulSoup(response.text, 'lxml')
 32 |     table = bs.find('table')
 33 |     rows = [elem for elem in table.find_all('tr') if elem.get_text() != '\n']
 34 |     
 35 |     movie_budget = []
 36 |     for row in rows[1:]:
 37 |         specs = [elem.get_text() for elem in row.find_all('td')]
 38 |         movie_name = specs[2].encode('latin1').decode('utf8', 'ignore')
 39 |         movie_budget.append({'release_date': specs[1],
 40 |                                   'movie_name': movie_name,
 41 |                                   'production_budget': parse_price(specs[3]),
 42 |                                   'domestic_gross': parse_price(specs[4]),
 43 |                                   'worldwide_gross': parse_price(specs[5])})
 44 |     
 45 |     return movie_budget
 46 | 
 47 | def get_imdb_urls(movie_budget, nb_elements=None):
 48 |     """
 49 |     Parsing imdb website to get imdb movies links.
 50 |     Dumping a json file with budget, gross and imdb urls
 51 |     :param movie_budget: list of dictionnaries with budget and gross
 52 |     :param nb_elements: number of movies to parse 
 53 |     """
 54 |     for movie in tqdm(movie_budget[1000:1000+nb_elements]):
 55 |         movie_name = movie['movie_name']
 56 |         title_url = urllib.parse.quote(movie_name.encode('utf-8'))
 57 |         imdb_search_link = "http://www.imdb.com/find?ref_=nv_sr_fn&q={}&s=tt".format(title_url)
 58 |         response = requests.get(imdb_search_link)
 59 |         bs = BeautifulSoup(response.text, 'lxml')
 60 |         results = bs.find("table", class_= "findList" )
 61 |         try:
 62 |             movie['imdb_url'] = "http://www.imdb.com" + results.find('td', class_='result_text').find('a')['href']
 63 |         except:
 64 |             movie['imdb_url'] = None
 65 |     
 66 |     with open('movie_budget.json', 'w') as fp:
 67 |         json.dump(movie_budget, fp)
 68 | 
 69 | def get_imdb_content(movie_budget_path, nb_elements=None):
 70 |     """
 71 |     Parsing imdb website to get imdb content : awards, casting, description, etc.
 72 |     Dumping a json file with imdb content
 73 |     :param movie_budget_path: path of the movie_budget.json file
 74 |     :param nb_elements: number of movies to parse 
 75 |     """
 76 |     with open(movie_budget_path, 'r') as fp:
 77 |         movies = json.load(fp)
 78 |     content_provider = ImdbMovieContent(movies)
 79 |     contents = []
 80 |     threshold = 1300
 81 |     for i, movie in enumerate(movies[threshold:threshold+nb_elements]):
 82 |         time.sleep(random.uniform(0, 0.25))
 83 |         print("\r%i / %i" % (i, len(movies[threshold:threshold+nb_elements])), end="")
 84 |         try:
 85 |             imdb_url = movie['imdb_url']
 86 |             response = requests.get(imdb_url)
 87 |             bs = BeautifulSoup(response.text, 'lxml')
 88 |             movies_content = content_provider.get_content(bs)
 89 |             contents.append(movies_content)
 90 |         except Exception as e:
 91 |             print(e)
 92 |             pass
 93 |     with open('movie_contents7.json', 'w') as fp:
 94 |         json.dump(contents, fp)    
 95 | 
 96 | def parse_awards(movie):
 97 |     """
 98 |     Convert awards information to a dictionnary for dataframe.
 99 |     Keeping only Oscar, BAFTA, Golden Globe and Palme d'Or awards.
100 |     :param movie: movie dictionnary
101 |     :return: well-formated dictionnary with awards information
102 |     """
103 |     awards_kept = ['Oscar', 'BAFTA Film Award', 'Golden Globe', 'Palme d\'Or']
104 |     awards_category = ['won', 'nominated']
105 |     parsed_awards = {}
106 |     for category in awards_category:
107 |         for awards_type in awards_kept:
108 |             awards_cat = [award for award in movie['awards'][category] if award['category'] == awards_type]
109 |             for k, award in enumerate(awards_cat):
110 |                 parsed_awards['{}_{}_{}'.format(awards_type, category, k+1)] = award["award"]
111 |     return parsed_awards
112 | 
113 | def parse_actors(movie):
114 |     """
115 |     Convert casting information to a dictionnary for dataframe.
116 |     Keeping only 3 actors with most facebook likes.
117 |     :param movie: movie dictionnary
118 |     :return: well-formated dictionnary with casting information
119 |     """
120 |     sorted_actors = sorted(movie['cast_info'], key=lambda x:x['actor_fb_likes'], reverse=True)
121 |     top_k = 3
122 |     parsed_actors = {}
123 |     parsed_actors['total_cast_fb_likes'] = sum([actor['actor_fb_likes'] for actor in movie['cast_info']]) + movie['director_info']['director_fb_links']
124 |     for k, actor in enumerate(sorted_actors[:top_k]):
125 |         if k < len(sorted_actors):
126 |             parsed_actors['actor_{}_name'.format(k+1)] = actor['actor_name']
127 |             parsed_actors['actor_{}_fb_likes'.format(k+1)] = actor['actor_fb_likes']
128 |         else:
129 |             parsed_actors['actor_{}_name'.format(k+1)] = None
130 |             parsed_actors['actor_{}_fb_likes'.format(k+1)] = None
131 |     return parsed_actors
132 | 
133 | def parse_production_company(movie):
134 |     """
135 |     Convert production companies to a dictionnary for dataframe.
136 |     Keeping only 3 production companies.
137 |     :param movie: movie dictionnary
138 |     :return: well-formated dictionnary with production companies
139 |     """   
140 |     parsed_production_co = {}
141 |     top_k = 3
142 |     production_companies = movie['production_co'][:top_k]
143 |     for k, company in enumerate(production_companies):
144 |         if k < len(movie['production_co']):
145 |             parsed_production_co['production_co_{}'.format(k+1)] = company
146 |         else:
147 |             parsed_production_co['production_co_{}'.format(k+1)] = None
148 |     return parsed_production_co
149 | 
150 | def parse_genres(movie):
151 |     """
152 |     Convert genres to a dictionnary for dataframe.
153 |     :param movie: movie dictionnary
154 |     :return: well-formated dictionnary with genres
155 |     """   
156 |     parse_genres = {}
157 |     g = movie['genres']
158 |     with open('genre.json', 'r') as f:
159 |         genres = json.load(f)
160 |     for k, genre in enumerate(g):
161 |         if genre in genres:
162 |             parse_genres['genre_{}'.format(k+1)] = genres[genre]
163 |     return parse_genres
164 | 
165 | def create_dataframe(movies_content_path, movie_budget_path):
166 |     """
167 |     Create dataframe from movie_budget.json and movie_content.json files.
168 |     :param movies_content_path: path of the movies_content.json file
169 |     :param movie_budget_path: path of the movie_budget.json file
170 |     :return: well formated dataframe
171 |     """
172 |     with open(movies_content_path, 'r') as fp:
173 |         movies = json.load(fp)
174 |     with open(movie_budget_path, 'r') as fp:
175 |         movies_budget = json.load(fp)
176 |     movies_list = []
177 |     for movie in movies:
178 |         content = {k:v for k,v in movie.items() if k not in ['awards', 'cast_info', 'director_info', 'production_co']}
179 |         name = movie['movie_title']
180 |         try:
181 |             budget = [film for film in movies_budget if film['movie_name']==name][0]
182 |             budget = {k:v for k,v in budget.items() if k not in ['imdb_url', 'movie_name']}
183 |             content.update(budget)
184 |         except: 
185 |             pass
186 |         try:
187 |             content.update(parse_awards(movie))
188 |         except:
189 |             pass
190 |         try:
191 |             content.update(parse_genres(movie))
192 |         except:
193 |             pass
194 |         try:
195 |             content.update({k:v for k,v in movie['director_info'].items() if k!= 'director_link'})
196 |         except:
197 |             pass
198 |         try:
199 |             content.update(parse_production_company(movie))
200 |         except:
201 |             pass
202 |         try:
203 |             content.update(parse_actors(movie))
204 |         except:
205 |             pass
206 |         movies_list.append(content)
207 |     df = pd.DataFrame(movies_list)
208 |     df = df[pd.notnull(df.idmb_score)]
209 |     df.idmb_score = df.idmb_score.apply(float)
210 |     return df
211 | 
212 | if __name__ == '__main__':
213 |     if len(sys.argv) > 1:
214 |         nb_elements = int(sys.argv[1])
215 |     else:
216 |         nb_elements = None
217 |     # movie_budget = get_movie_budget()
218 |     # movies = get_imdb_urls(movie_budget, nb_elements=nb_elements)
219 |     get_imdb_content("movie_budget.json", nb_elements=nb_elements)
220 | 
221 | 
222 | 


--------------------------------------------------------------------------------
/ProjectSoccer/README.md:
--------------------------------------------------------------------------------
 1 | # Parsing and using Soccer data
 2 | 
 3 | ## First Part - Parsing data
 4 | 
 5 | The first part aims to parse data from multiple websites : games, teams, players, etc.  
 6 | To collect data about a team and dump a json file (must have created a ``./teams`` folder) :
 7 | ``python3 dumper.py team_name`` 
 8 | 
 9 | ## Second Part - Data Analysis
10 | 
11 | The second part is to analyze the dataset to understand what I can do with it.
12 | 
13 | ![Correlation Matrix](https://github.com/alexattia/Data-Science-Projects/blob/master/pics/psg_stats.png)
14 | 
15 | ![PSG vs Saint-Etienne](https://github.com/alexattia/Data-Science-Projects/blob/master/pics/psg_ste.png)


--------------------------------------------------------------------------------
/ProjectSoccer/df.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/ProjectSoccer/df.pkl


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Online-Challenge
 2 | Challenge submitted on HackerRank and Kaggle.
 3 | 
 4 | Algorithm challenges are made on HackerRank using Python.
 5 | 
 6 | Data Science and Machine Learning challenges are made on Kaggle using Python too. 
 7 | 
 8 | ## [Kaggle Bike Sharing](https://github.com/alexattia/Data-Science-Projects/tree/master/KaggleBikeSharing)
 9 | The goal of this challenge is to build a model that predicts the count of bike shared, exclusively based on contextual features. The first part of this challenge was aimed to understand, to analyse and to process those dataset. I wanted to produce meaningful information with plots. The second part was to build a model and use a Machine Learning library in order to predict the count.
10 | 
11 | -->[French Explanations PDF](https://github.com/alexattia/Data-Science-Projects/blob/master/KaggleBikeSharing/Kaggle_BikeSharing_Explanations_French.pdf)
12 | 
13 | ## [Twitter Parsing](https://github.com/alexattia/Data-Science-Projects/tree/master/TwitterParsing)
14 | 
15 | I've recently discovered the Chris Albon Machine Learning flash cards and I want to download those flash cards but the official Twitter API has a limit rate of 2 weeks old tweets so I had to find a way to bypass this limitation : use Selenium and PhantomJS.  
16 | Purpose of this project : Check every 2 hours, if he posted new flash cards. In this case, download them and send me a summary email.
17 | 
18 | ## [Face Recognition](https://github.com/alexattia/Data-Science-Projects/tree/master/FaceRecognition)
19 | 
20 | Modern face recognition with deep learning and HOG algorithm. Using dlib C++ library, I have a quick face recognition tool using few pictures (20 per person).
21 | 
22 | ## [Playing with Soccer data](https://github.com/alexattia/Data-Science-Projects/tree/master/KaggleSoccer)
23 | 
24 | As a soccer fan and a data passionate, I wanted to play and analyze with soccer data.  
25 | I don't know currently what's the aim of this project but I will parse data from diverse websites, for differents teams and differents players. 
26 | 
27 | ## [NYC Taxi Trips](https://github.com/alexattia/Data-Science-Projects/tree/master/KaggleTaxiTrip)
28 | 
29 | Kaggle playground to predict the total ride duration of taxi trips in New York City. 
30 | 
31 | ## [Kaggle Understanding the Amazon from Space](https://github.com/alexattia/Data-Science-Projects/tree/master/KaggleAmazon) 
32 | Use satellite data to track the human footprint in the Amazon rainforest.  
33 | Deep Learning model (using Keras) to label satellite images.
34 | 
35 | ## [Predicting IMDB movie rating](https://github.com/alexattia/Data-Science-Projects/tree/master/KaggleMovieRating)
36 | Project inspired by Chuan Sun [work](https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset)  
37 | How can we tell the greatness of a movie ?  
38 | Scrapping and Machine Learning  


--------------------------------------------------------------------------------
/TwitterParsing/README.md:
--------------------------------------------------------------------------------
 1 | # Twitter Parsing
 2 | 
 3 | ## Parsing Machine Learning Flashcards
 4 | After spending some time on Twitter, following Machine Learning, Deep Learning and Data Science "influencers". I have found some interesting stuff.  
 5 | I decided to download some of those tweets. For example, @chrisalbon writes (draws would be more relevant) interesting Machine Learning flash cards.  
 6 | I am using Selenium to parse and download those flash cards and I create a daemon to parse and send me about new flash cards every day.  
 7 | 
 8 | `com.alexattia.machinelearning.downloadflashcards.plist` is the plist file to copy into /Library/LaunchDaemons/  
 9 | `run.sh` is the script to launch every day  
10 | `download_pics.py`is the Python script
11 | 
12 | ## Sending one picture a day
13 | Recently, @chrisalbon published his [machine learning flashcards](machinelearningflashcards.com). There are more than 300 pictures. In order to learn those flashcards, I want to learn one a day. So, I am using a daemon to send me an email every day with an embedded flashcard.


--------------------------------------------------------------------------------
/TwitterParsing/com.alexattia.machinelearning.downloadflashcards.plist:
--------------------------------------------------------------------------------
 1 | <?xml version="1.0" encoding="UTF-8"?>
 2 | <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
 3 | <plist version="1.0">
 4 | <dict>
 5 |     <key>Label</key>
 6 |     <string>com.alexattia.machinelearning.downloadflashcards</string>
 7 |     <key>ProgramArguments</key>
 8 |     <array>
 9 |         <string>/Users/alexandreattia/Desktop/Work/Practice/HackerRankChallenge/TwitterParsing/run.sh</string>
10 |     </array>
11 |     <key>StandardOutPath</key>
12 |     <string>/Users/alexandreattia/Desktop/Work/Practice/HackerRankChallenge/TwitterParsing/output.txt</string>
13 |     <key>StandardErrorPath</key>
14 |     <string>/Users/alexandreattia/Desktop/Work/Practice/HackerRankChallenge/TwitterParsing/errors.txt</string>
15 | <!--     <key>StartInterval</key>
16 |     <integer>5400</integer> -->
17 |     <key>StartCalendarInterval</key>
18 |     <dict>
19 |         <key>Hour</key>
20 |         <integer>16</integer>
21 |         <key>Minute</key>
22 |         <integer>0</integer>
23 |     </dict>
24 | </dict>
25 | </plist>


--------------------------------------------------------------------------------
/TwitterParsing/config.py:
--------------------------------------------------------------------------------
  1 | class Config:
  2 |     def __init__(self):
  3 |         self.my_email_address = 'XXX'
  4 |         self.password = 'XXX'
  5 |         self.dest = 'XXX'
  6 |         self.message_init = """
  7 | 
  8 | 
  9 |                                             <!-- Introduction -->
 10 |                                             <tr class="">
 11 |                                                <td class="grid__col" style="font-family: 'Benton Sans', -apple-system, BlinkMacSystemFont, Roboto, 'Helvetica neue', Helvetica, Tahoma, Arial, sans-serif; padding: 32px 40px; border-radius:6px 6px 0 0;" align="">
 12 |                                                   <h2 style="color: #404040; font-weight: 300; margin: 0 0 12px 0; font-size: 24px; line-height: 30px; font-family: 'Benton Sans', -apple-system, BlinkMacSystemFont, Roboto, 'Helvetica neue', Helvetica, Tahoma, Arial, sans-serif; " class="">
 13 |                                                      Hi Alexandre, </br>
 14 |                                                      {number_flash_cards} machine learning flashcards has been downloaded.
 15 |                                                   </h2>
 16 |                                                   <div style="color: #666666; font-weight: 400; font-size: 15px; line-height: 21px; font-family: 'Benton Sans', -apple-system, BlinkMacSystemFont, Roboto, 'Helvetica neue', Helvetica, Tahoma, Arial, sans-serif; font-weight: 300; " class="organized_by">
 17 |                                                      Tweeted by <a href="https://twitter.com/chrisalbon" style="color:#999;" target="_blank">Chris Albon</a>
 18 |                                                   </div>
 19 |                                                </td>
 20 |                                             </tr>
 21 |                                             <tr>
 22 |                                                <td style="padding: 0 40px;">
 23 |                                                   <table cellspacing="0" cellpadding="0" width="100%" style="width: 100%; min-width: 100%;" class="">
 24 |                                                      <tr>
 25 |                                                         <td style="background-color: #dedede; width: 100%; min-width: 100%; font-size: 1px; height: 1px; line-height: 1px; " class="">&nbsp;
 26 |                                                         </td>
 27 |                                                      </tr>
 28 |                                                   </table>
 29 |                                                </td>
 30 |                                             </tr>
 31 |                                             <tr>
 32 |                                                <td class="table__ridge table__ridge--bottom">
 33 |                                                   <img src="https://cdn.evbstatic.com/s3-s3/marketing/emails/modules/ridges_bottom_fullx2.jpg" height="7" style="height:7px; border:none; display:block;" border="0" />
 34 |                                                </td>
 35 |                                             </tr>
 36 |                 """
 37 | 
 38 |         self.intro = """\
 39 |         <!doctype html>
 40 |         <html xmlns="http://www.w3.org/1999/xhtml">
 41 |            <head>
 42 |               <meta name="viewport" content="width=device-width">
 43 |               <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
 44 |               <!-- Turn off iOS phone number autodetect -->
 45 |               <meta name="format-detection" content="telephone=no">
 46 |               <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">
 47 |               <style>
 48 |                  body, p {
 49 |                  font-family: "Benton Sans", -apple-system, BlinkMacSystemFont, Roboto, "Helvetica Neue", Helvetica, Tahoma, Arial, sans-serif;
 50 |                  -webkit-font-smoothing: antialiased;
 51 |                  -webkit-text-size-adjust: none;
 52 |                  }
 53 |                  .made_tweet_link{
 54 |                  text-decoration: none;
 55 |                  color:#383838;
 56 |                  }
 57 |                  table {
 58 |                  border-collapse: collapse;
 59 |                  border-spacing: 0;
 60 |                  border: 0;
 61 |                  padding: 0;
 62 |                  }
 63 |                  img {
 64 |                  margin: 0;
 65 |                  padding: 0;
 66 |                  }
 67 |                  .content {
 68 |                  width: 600px;
 69 |                  }
 70 |                  .no_text_resize {
 71 |                  -moz-text-size-adjust: none;
 72 |                  -webkit-text-size-adjust: none;
 73 |                  -ms-text-size-adjust: none;
 74 |                  text-size-adjust: none;
 75 |                  }
 76 |                  /* Media Queries */
 77 |                  @media all and (max-width: 600px) {
 78 |                  table[class="content"] {
 79 |                  width: 100% !important;
 80 |                  }
 81 |                  tr[class="grid-no-gutter"] td[class="grid__col"] {
 82 |                  padding-left: 0 !important;
 83 |                  padding-right: 0 !important;
 84 |                  }
 85 |                  td[class="grid__col"] {
 86 |                  padding-left: 15px !important;
 87 |                  padding-right: 15px !important;
 88 |                  }
 89 |                  tr[class="small_full_width"] td[class="grid__col"] {
 90 |                  padding-left: 0px !important;
 91 |                  padding-right: 0px !important;
 92 |                  }
 93 |                  table[class="small_full_width"] {
 94 |                  width: 100% !important;
 95 |                  padding-bottom: 10px;
 96 |                  }
 97 |                  a[class="header-link"] {
 98 |                  margin-right: 0 !important;
 99 |                  margin-left: 10px !important;
100 |                  }
101 |                  a[class="btn"] {
102 |                  width: 100%;
103 |                  border-left-width: 0px !important;
104 |                  border-right-width: 0px !important;
105 |                  }
106 |                  table[class="col-layout"] {
107 |                  width: 100% !important;
108 |                  }
109 |                  td[class="col-container"] {
110 |                  display: block !important;
111 |                  width: 100% !important;
112 |                  padding-left: 0 !important;
113 |                  padding-right: 0 !important;
114 |                  }
115 |                  td[class="col-nav-items"] {
116 |                  display: inline-block !important;
117 |                  padding-left: 0 !important;
118 |                  padding-right: 10px !important;
119 |                  background: none !important;
120 |                  }
121 |                  img[class="col-img"] {
122 |                  height: auto !important;
123 |                  max-width: 520px !important;
124 |                  width: 100% !important;
125 |                  }
126 |                  td[class="col-center-sm"] {
127 |                  text-align: center;
128 |                  }
129 |                  tr[class="footer-attendee-cta"] > td[class="grid__col"] {
130 |                  padding: 24px 0 0 !important;
131 |                  }
132 |                  td[class="col-footer-cta"] {
133 |                  padding-left: 0 !important;
134 |                  padding-right: 0 !important;
135 |                  }
136 |                  td[class="footer-links"] {
137 |                  text-align: left !important;
138 |                  }
139 |                  .hide-for-small {
140 |                  display: none !important;
141 |                  }
142 |                  .ribbon-mobile {
143 |                  line-height: 1.3 !important;
144 |                  }
145 |                  .small_full_width {
146 |                  width: 100% !important;
147 |                  padding-bottom: 10px;
148 |                  }
149 |                  .table__ridge {
150 |                  height: 7px !important;
151 |                  }
152 |                  .table__ridge img {
153 |                  display: none !important;
154 |                  }
155 |                  .table__ridge--top {
156 |                  background-image: url(https://cdn.evbstatic.com/s3-s3/marketing/emails/modules/ridges_top_fullx2.jpg) !important;
157 |                  background-size: 170% 7px;
158 |                  }
159 |                  .table__ridge--bottom {
160 |                  background-image: url(https://cdn.evbstatic.com/s3-s3/marketing/emails/modules/ridges_bottom_fullx2.jpg) !important;
161 |                  background-size: 170% 7px;
162 |                  }
163 |                  .summary-table__total {
164 |                  padding-right: 10px !important;
165 |                  }
166 |                  .app-cta {
167 |                  display: none !important;
168 |                  }
169 |                  .app-cta__mobile {
170 |                  width: 100% !important;
171 |                  height: auto !important;
172 |                  max-height: none !important;
173 |                  overflow: visible !important;
174 |                  float: none !important;
175 |                  display: block !important;
176 |                  margin-top: 12px !important;
177 |                  visibility: visible;
178 |                  font-size: inherit !important;
179 |                  }
180 |                  .list-card__header {
181 |                  width: 130px !important;
182 |                  }
183 |                  .list-card__label {
184 |                  width: 130px !important;
185 |                  }
186 |                  .list-card__image-wrapper {
187 |                  width: 130px !important;
188 |                  height: 65px !important;
189 |                  }
190 |                  .list-card__image {
191 |                  max-width: 130px !important;
192 |                  max-height: 65px !important;
193 |                  }
194 |                  .list-card__body {
195 |                  padding-left: 10px !important;
196 |                  }
197 |                  .list-card__title {
198 |                  margin-bottom: 10px !important;
199 |                  }
200 |                  .list-card__date {
201 |                  padding-top: 0 !important;
202 |                  }
203 |                  }
204 |                  @media all and (device-width: 768px) and (device-height: 1024px) and (orientation:landscape) {
205 |                  .ribbon-mobile {
206 |                  line-height: 1.3 !important;
207 |                  }
208 |                  .ribbon-mobile__text {
209 |                  padding: 0 !important;
210 |                  }
211 |                  }
212 |                  @media all and (device-width: 768px) and (device-height: 1024px) and (orientation:portrait) {
213 |                  .ribbon-mobile {
214 |                  line-height: 1.3 !important;
215 |                  }
216 |                  .ribbon-mobile__text {
217 |                  padding: 0 !important;
218 |                  }
219 |                  }
220 |                  @media screen and (min-device-height:480px) and (max-device-height:568px), (min-device-width : 375px) and (max-device-width : 667px) and (-webkit-min-device-pixel-ratio : 2), (min-device-width : 414px) and (max-device-width : 736px) and (-webkit-min-device-pixel-ratio : 3) {
221 |                  .hide_for_iphone {
222 |                  display: none !important;
223 |                  }
224 |                  .passbook {
225 |                  width: auto !important;
226 |                  height: auto !important;
227 |                  line-height: auto !important;
228 |                  visibility: visible !important;
229 |                  display: block !important;
230 |                  max-height: none !important;
231 |                  overflow: visible !important;
232 |                  float: none !important;
233 |                  text-indent: 0 !important;
234 |                  font-size: inherit !important;
235 |                  }
236 |                  }
237 |               </style>
238 |            </head>
239 |            <!-- Global container with background styles. Gmail converts BODY to DIV so we lose properties like BGCOLOR. -->
240 |            <body border="0" cellpadding="0" cellspacing="0" height="100%" width="100%" bgcolor="#F7F7F7" style="margin: 0;">
241 |               <table border="0" cellpadding="0" cellspacing="0" height="100%" width="100%" bgcolor="#F7F7F7">
242 |                  <tr>
243 |                     <td style="padding-right: 10px; padding-left: 10px;">
244 |                                 <table class="content" align="center" cellpadding="0" cellspacing="0" border="0" bgcolor="white" style="width: 600px; max-width: 600px;">
245 |                                    <tr>
246 |                                       <td width="100%" valign="middle" style="text-align:center; padding:30px 0 20px 0;">
247 |                                          <a href="https://www.twitter.com">
248 |                                          <img src="https://raw.githubusercontent.com/alexattia/Data-Science-Projects/master/TwitterParsing/pic.png" width="715" height="70" border="0"  style="width:715px; height:70px;" />
249 |                                          </a>
250 |                                       </td>
251 |                                    </tr>
252 |                                 </table>
253 |                     </td>
254 |                  </tr>
255 |                  <tr>
256 |                     <td>
257 |                                 <table class="content" align="center" cellpadding="0" cellspacing="0" border="0" bgcolor="#F7F7F7" style="width: 600px; max-width: 600px;">
258 |                                    <tr>
259 |                                       <td colspan="2" style="background: #fff; border-radius: 8px;">
260 |                                          <table border="0" cellpadding="0" cellspacing="0" width="100%">
261 |                                             <tr>
262 |                                                <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;">
263 |                                                   <style>
264 |                                                      @media (max-device-width: 640px) {
265 |                                                      .organized_by {
266 |                                                      font-size: 14px !important;
267 |                                                      }
268 |                                                      .ticket-section__or {
269 |                                                      padding: 12px 0 14px 0 !important;
270 |                                                      }
271 |                                                      .your-account__logo {
272 |                                                      margin-top: 8px;
273 |                                                      }
274 |                                                      }
275 |                                                      @media screen and (min-device-height:480px) and (max-device-height:568px) {
276 |                                                      h2 {
277 |                                                      font-weight: 600 !important;
278 |                                                      }
279 |                                                      .header_defer {
280 |                                                      font-size: 12px;
281 |                                                      }
282 |                                                      }
283 |                                                   </style>
284 | """
285 |         self.flashcard = """
286 |                                     <!-- Beginnning flash cards details  -->
287 | 
288 |                                     <tr class="">
289 |                                        <td class="grid__col" style="font-family: 'Benton Sans', -apple-system, BlinkMacSystemFont, Roboto, 'Helvetica neue', Helvetica, Tahoma, Arial, sans-serif; padding: 32px 40px; border-radius:0 0 6px 6px;" align="">
290 |                                           <table cellpadding="0" cellspacing="0" border="0" align="left" style="width:260px; line-height:1.67; font-size:13px;" class="small_full_width ">
291 |                                              <tr>
292 |                                                 <td>
293 |                                                    <h2 style="color: #404040; font-weight: 300; margin: 0 0 12px 0; font-size: 24px; line-height: 30px; font-family: 'Benton Sans', -apple-system, BlinkMacSystemFont, Roboto, 'Helvetica neue', Helvetica, Tahoma, Arial, sans-serif; " class="">
294 |                                                       <a href={tweet_link} class="made_tweet_link">{tweet_text}</a>
295 |                                                    </h2>
296 |                                                    <table cellspacing="0" cellpadding="0" width='100%' style='' align='' class="">
297 |                                                       <tr style='' class=''>
298 |                                                          <td width='20' height='' style='vertical-align:top; padding-right:10px;' align='' valign='' class='' colspan='1'>
299 |                                                             <img src='https://cdn.evbstatic.com/s3-s3/marketing/emails/order_confirmation/date-iconx2.png' title='date' alt='date' style='width:20px; height:20px; vertical-align:-2px;' border="0" width='20' height='20' class=""/>
300 |                                                          </td>
301 |                                                          <td width='' height='' style='' align='' valign='' class='' colspan='1'>
302 |                                                             <div style="color: #666666; font-weight: 400; font-size: 15px; line-height: 21px; font-family: 'Benton Sans', -apple-system, BlinkMacSystemFont, Roboto, 'Helvetica neue', Helvetica, Tahoma, Arial, sans-serif; " class="">
303 |                                                                {date}
304 |                                                             </div>
305 |                                                          </td>
306 |                                                       </tr>
307 |                                                    </table>
308 |                                                 </td>
309 |                                              </tr>
310 |                                           </table>
311 |                                           <table cellpadding="0" cellspacing="0" border="0" align="right" style="width:240px; text-align: center; margin-top:16px;" class="small_full_width ">
312 |                                              <tr>
313 |                                                 <td>
314 |                                                    <img src={link_picture} title='map' alt='map' style='width:240px; height:160px' border="0" width='240' height='160' class="hide-for-small"/> 
315 |                                                 </td>
316 |                                              </tr>
317 |                                           </table>
318 |                                        </td>
319 |                                     </tr>
320 | 
321 |                                     <!-- Line Separator -->
322 |                                     <tr>
323 |                                        <td style="padding: 0 40px;">
324 |                                           <table cellspacing="0" cellpadding="0" width="100%" style="width: 100%; min-width: 100%;" class="">
325 |                                              <tr>
326 |                                                 <td style="background-color: #dedede; width: 100%; min-width: 100%; font-size: 1px; height: 1px; line-height: 1px; " class="">&nbsp;
327 |                                                 </td>
328 |                                              </tr>
329 |                                           </table>
330 |                                        </td>
331 |                                     </tr>
332 |         """
333 | 
334 |         self.end = """
335 |                                                     </td>
336 |                                             </tr>
337 |                                          </table>
338 |                                       </td>
339 |                                    </tr>
340 |                                 </table>
341 |                     </td>
342 |                  </tr>
343 |               </table>
344 |            </body>
345 |         </html>
346 |         """


--------------------------------------------------------------------------------
/TwitterParsing/config_downl.py:
--------------------------------------------------------------------------------
  1 | class Config:
  2 |     def __init__(self):
  3 |         self.my_email_address = 'XXX'
  4 |         self.password = 'XXX'
  5 |         self.dest = 'XXX'
  6 |         self.message_init = """
  7 | 
  8 | 
  9 |                                             <!-- Introduction -->
 10 |                                             <tr class="">
 11 |                                                <td class="grid__col" style="font-family: 'Benton Sans', -apple-system, BlinkMacSystemFont, Roboto, 'Helvetica neue', Helvetica, Tahoma, Arial, sans-serif; padding: 32px 40px; border-radius:6px 6px 0 0;" align="">
 12 |                                                   <h2 style="color: #404040; font-weight: 300; margin: 0 0 12px 0; font-size: 24px; line-height: 30px; font-family: 'Benton Sans', -apple-system, BlinkMacSystemFont, Roboto, 'Helvetica neue', Helvetica, Tahoma, Arial, sans-serif; " class="">
 13 |                                                      Hi Alexandre, </br>
 14 |                                                      Today the machine learning flashcard is about <b>{flashcard_title}</b>.
 15 |                                                   </h2>
 16 |                                                </td>
 17 |                                             </tr>
 18 |                                             <tr>
 19 |                                                <td style="padding: 0 40px;">
 20 |                                                   <table cellspacing="0" cellpadding="0" width="100%" style="width: 100%; min-width: 100%;" class="">
 21 |                                                      <tr>
 22 |                                                         <td style="background-color: #dedede; width: 100%; min-width: 100%; font-size: 1px; height: 1px; line-height: 1px; " class="">&nbsp;
 23 |                                                         </td>
 24 |                                                      </tr>
 25 |                                                   </table>
 26 |                                                </td>
 27 |                                             </tr>
 28 |                                             <tr>
 29 |                                                <td class="table__ridge table__ridge--bottom">
 30 |                                                   <img src="https://cdn.evbstatic.com/s3-s3/marketing/emails/modules/ridges_bottom_fullx2.jpg" height="7" style="height:7px; border:none; display:block;" border="0" />
 31 |                                                </td>
 32 |                                             </tr>
 33 |                 """
 34 | 
 35 |         self.intro = """\
 36 |         <!doctype html>
 37 |         <html xmlns="http://www.w3.org/1999/xhtml">
 38 |            <head>
 39 |               <meta name="viewport" content="width=device-width">
 40 |               <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
 41 |               <!-- Turn off iOS phone number autodetect -->
 42 |               <meta name="format-detection" content="telephone=no">
 43 |               <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">
 44 |               <style>
 45 |                  body, p {
 46 |                  font-family: "Benton Sans", -apple-system, BlinkMacSystemFont, Roboto, "Helvetica Neue", Helvetica, Tahoma, Arial, sans-serif;
 47 |                  -webkit-font-smoothing: antialiased;
 48 |                  -webkit-text-size-adjust: none;
 49 |                  }
 50 |                  .made_tweet_link{
 51 |                  text-decoration: none;
 52 |                  color:#383838;
 53 |                  }
 54 |                  table {
 55 |                  border-collapse: collapse;
 56 |                  border-spacing: 0;
 57 |                  border: 0;
 58 |                  padding: 0;
 59 |                  }
 60 |                  img {
 61 |                  margin: 0;
 62 |                  padding: 0;
 63 |                  }
 64 |                  .content {
 65 |                  width: 600px;
 66 |                  }
 67 |                  .no_text_resize {
 68 |                  -moz-text-size-adjust: none;
 69 |                  -webkit-text-size-adjust: none;
 70 |                  -ms-text-size-adjust: none;
 71 |                  text-size-adjust: none;
 72 |                  }
 73 |                  /* Media Queries */
 74 |                  @media all and (max-width: 600px) {
 75 |                  table[class="content"] {
 76 |                  width: 100% !important;
 77 |                  }
 78 |                  tr[class="grid-no-gutter"] td[class="grid__col"] {
 79 |                  padding-left: 0 !important;
 80 |                  padding-right: 0 !important;
 81 |                  }
 82 |                  td[class="grid__col"] {
 83 |                  padding-left: 15px !important;
 84 |                  padding-right: 15px !important;
 85 |                  }
 86 |                  tr[class="small_full_width"] td[class="grid__col"] {
 87 |                  padding-left: 0px !important;
 88 |                  padding-right: 0px !important;
 89 |                  }
 90 |                  table[class="small_full_width"] {
 91 |                  width: 100% !important;
 92 |                  padding-bottom: 10px;
 93 |                  }
 94 |                  a[class="header-link"] {
 95 |                  margin-right: 0 !important;
 96 |                  margin-left: 10px !important;
 97 |                  }
 98 |                  a[class="btn"] {
 99 |                  width: 100%;
100 |                  border-left-width: 0px !important;
101 |                  border-right-width: 0px !important;
102 |                  }
103 |                  table[class="col-layout"] {
104 |                  width: 100% !important;
105 |                  }
106 |                  td[class="col-container"] {
107 |                  display: block !important;
108 |                  width: 100% !important;
109 |                  padding-left: 0 !important;
110 |                  padding-right: 0 !important;
111 |                  }
112 |                  td[class="col-nav-items"] {
113 |                  display: inline-block !important;
114 |                  padding-left: 0 !important;
115 |                  padding-right: 10px !important;
116 |                  background: none !important;
117 |                  }
118 |                  img[class="col-img"] {
119 |                  height: auto !important;
120 |                  max-width: 520px !important;
121 |                  width: 100% !important;
122 |                  }
123 |                  td[class="col-center-sm"] {
124 |                  text-align: center;
125 |                  }
126 |                  tr[class="footer-attendee-cta"] > td[class="grid__col"] {
127 |                  padding: 24px 0 0 !important;
128 |                  }
129 |                  td[class="col-footer-cta"] {
130 |                  padding-left: 0 !important;
131 |                  padding-right: 0 !important;
132 |                  }
133 |                  td[class="footer-links"] {
134 |                  text-align: left !important;
135 |                  }
136 |                  .hide-for-small {
137 |                  display: none !important;
138 |                  }
139 |                  .ribbon-mobile {
140 |                  line-height: 1.3 !important;
141 |                  }
142 |                  .small_full_width {
143 |                  width: 100% !important;
144 |                  padding-bottom: 10px;
145 |                  }
146 |                  .table__ridge {
147 |                  height: 7px !important;
148 |                  }
149 |                  .table__ridge img {
150 |                  display: none !important;
151 |                  }
152 |                  .table__ridge--top {
153 |                  background-image: url(https://cdn.evbstatic.com/s3-s3/marketing/emails/modules/ridges_top_fullx2.jpg) !important;
154 |                  background-size: 170% 7px;
155 |                  }
156 |                  .table__ridge--bottom {
157 |                  background-image: url(https://cdn.evbstatic.com/s3-s3/marketing/emails/modules/ridges_bottom_fullx2.jpg) !important;
158 |                  background-size: 170% 7px;
159 |                  }
160 |                  .summary-table__total {
161 |                  padding-right: 10px !important;
162 |                  }
163 |                  .app-cta {
164 |                  display: none !important;
165 |                  }
166 |                  .app-cta__mobile {
167 |                  width: 100% !important;
168 |                  height: auto !important;
169 |                  max-height: none !important;
170 |                  overflow: visible !important;
171 |                  float: none !important;
172 |                  display: block !important;
173 |                  margin-top: 12px !important;
174 |                  visibility: visible;
175 |                  font-size: inherit !important;
176 |                  }
177 |                  .list-card__header {
178 |                  width: 130px !important;
179 |                  }
180 |                  .list-card__label {
181 |                  width: 130px !important;
182 |                  }
183 |                  .list-card__image-wrapper {
184 |                  width: 130px !important;
185 |                  height: 65px !important;
186 |                  }
187 |                  .list-card__image {
188 |                  max-width: 130px !important;
189 |                  max-height: 65px !important;
190 |                  }
191 |                  .list-card__body {
192 |                  padding-left: 10px !important;
193 |                  }
194 |                  .list-card__title {
195 |                  margin-bottom: 10px !important;
196 |                  }
197 |                  .list-card__date {
198 |                  padding-top: 0 !important;
199 |                  }
200 |                  }
201 |                  @media all and (device-width: 768px) and (device-height: 1024px) and (orientation:landscape) {
202 |                  .ribbon-mobile {
203 |                  line-height: 1.3 !important;
204 |                  }
205 |                  .ribbon-mobile__text {
206 |                  padding: 0 !important;
207 |                  }
208 |                  }
209 |                  @media all and (device-width: 768px) and (device-height: 1024px) and (orientation:portrait) {
210 |                  .ribbon-mobile {
211 |                  line-height: 1.3 !important;
212 |                  }
213 |                  .ribbon-mobile__text {
214 |                  padding: 0 !important;
215 |                  }
216 |                  }
217 |                  @media screen and (min-device-height:480px) and (max-device-height:568px), (min-device-width : 375px) and (max-device-width : 667px) and (-webkit-min-device-pixel-ratio : 2), (min-device-width : 414px) and (max-device-width : 736px) and (-webkit-min-device-pixel-ratio : 3) {
218 |                  .hide_for_iphone {
219 |                  display: none !important;
220 |                  }
221 |                  .passbook {
222 |                  width: auto !important;
223 |                  height: auto !important;
224 |                  line-height: auto !important;
225 |                  visibility: visible !important;
226 |                  display: block !important;
227 |                  max-height: none !important;
228 |                  overflow: visible !important;
229 |                  float: none !important;
230 |                  text-indent: 0 !important;
231 |                  font-size: inherit !important;
232 |                  }
233 |                  }
234 |               </style>
235 |            </head>
236 |            <!-- Global container with background styles. Gmail converts BODY to DIV so we lose properties like BGCOLOR. -->
237 |            <body border="0" cellpadding="0" cellspacing="0" height="100%" width="100%" bgcolor="#F7F7F7" style="margin: 0;">
238 |               <table border="0" cellpadding="0" cellspacing="0" height="100%" width="100%" bgcolor="#F7F7F7">
239 |                  <tr>
240 |                     <td style="padding-right: 10px; padding-left: 10px;">
241 |                                 <table class="content" align="center" cellpadding="0" cellspacing="0" border="0" bgcolor="white" style="width: 600px; max-width: 600px;">
242 |                                    <tr>
243 |                                       <td width="100%" valign="middle" style="text-align:center; padding:30px 0 20px 0;">
244 |                                          <a href="https://www.twitter.com">
245 |                                          <img src="https://raw.githubusercontent.com/alexattia/Data-Science-Projects/master/TwitterParsing/pic2.png" width="715" height="70" border="0"  style="width:715px; height:70px;" />
246 |                                          </a>
247 |                                       </td>
248 |                                    </tr>
249 |                                 </table>
250 |                     </td>
251 |                  </tr>
252 |                  <tr>
253 |                     <td>
254 |                                 <table class="content" align="center" cellpadding="0" cellspacing="0" border="0" bgcolor="#F7F7F7" style="width: 600px; max-width: 600px;">
255 |                                    <tr>
256 |                                       <td colspan="2" style="background: #fff; border-radius: 8px;">
257 |                                          <table border="0" cellpadding="0" cellspacing="0" width="100%">
258 |                                             <tr>
259 |                                                <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;">
260 |                                                   <style>
261 |                                                      @media (max-device-width: 640px) {
262 |                                                      .organized_by {
263 |                                                      font-size: 14px !important;
264 |                                                      }
265 |                                                      .ticket-section__or {
266 |                                                      padding: 12px 0 14px 0 !important;
267 |                                                      }
268 |                                                      .your-account__logo {
269 |                                                      margin-top: 8px;
270 |                                                      }
271 |                                                      }
272 |                                                      @media screen and (min-device-height:480px) and (max-device-height:568px) {
273 |                                                      h2 {
274 |                                                      font-weight: 600 !important;
275 |                                                      }
276 |                                                      .header_defer {
277 |                                                      font-size: 12px;
278 |                                                      }
279 |                                                      }
280 |                                                   </style>
281 | """
282 |         self.flashcard = """
283 |                                     <!-- Beginnning flash cards details  -->
284 | 
285 |                                     <tr class="">
286 |                                        <td class="grid__col" style="font-family: 'Benton Sans', -apple-system, BlinkMacSystemFont, Roboto, 'Helvetica neue', Helvetica, Tahoma, Arial, sans-serif; padding: 32px 40px; border-radius:0 0 6px 6px;" align="">
287 |                                           <table cellpadding="0" cellspacing="0" border="0" align="right" style="width:500px; text-align: center; margin-top:16px;" class="small_full_width ">
288 |                                              <tr>
289 |                                                 <td>
290 |                                                    <img src="cid:image1" style='width:500px; height:333px' border="0" width='500' height='333' class="hide-for-small"/>
291 |                                                 </td>
292 |                                              </tr>
293 |                                           </table>
294 |                                        </td>
295 |                                     </tr>
296 |         """
297 | 
298 |         self.end = """
299 |                                                     </td>
300 |                                             </tr>
301 |                                          </table>
302 |                                       </td>
303 |                                    </tr>
304 |                                 </table>
305 |                     </td>
306 |                  </tr>
307 |               </table>
308 |            </body>
309 |         </html>
310 |         """


--------------------------------------------------------------------------------
/TwitterParsing/download_pics.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | from selenium import webdriver
  3 | import urllib
  4 | import time
  5 | import sys
  6 | import numpy as np
  7 | import glob
  8 | import os
  9 | import datetime
 10 | from bs4 import BeautifulSoup
 11 | import smtplib
 12 | from email.mime.text import MIMEText
 13 | from email.header import Header
 14 | from email.utils import formataddr
 15 | import config
 16 | 
 17 | def internet_on():
 18 |     """
 19 |     Check if the network connection is on
 20 |     :return: boolean
 21 |     """
 22 |     try:
 23 |         urllib.request.urlopen('http://216.58.192.142', timeout=1)
 24 |         return True
 25 |     except urllib.error.URLError: 
 26 |         return False
 27 | 
 28 | def get_all_tweets_chrome(query):
 29 |     """
 30 |     Get all tweets from the webpage using the Chrome driver
 31 |     :param query: Tweet search query
 32 |     :return: all tweets on page as selenium objects
 33 |     """
 34 |     driver = webdriver.Chrome("/usr/local/bin/chromedriver")
 35 |     driver.set_page_load_timeout(20)
 36 |     driver.get("https://twitter.com/search?f=tweets&vertical=default&q={}".format(urllib.parse.quote(query)))
 37 | 
 38 |     length = []
 39 |     try:
 40 |         while True:
 41 |             time.sleep(np.random.randint(50,100)*0.01)
 42 |             tweets_found = driver.find_elements_by_class_name('tweet') 
 43 |             driver.execute_script("return arguments[0].scrollIntoView();", tweets_found[::-1][0])
 44 |             
 45 |             # Stop the loop while no more found tweets 
 46 |             length.append(len(tweets_found))
 47 |             if len(tweets_found) > 200 and (len(length) - len(set(length))) > 2:
 48 |                 print('%s tweets found at %s' % (len(tweets_found), datetime.datetime.now().strftime('%d/%m/%Y - %H:%M')))
 49 |                 break
 50 |     except IndexError:
 51 |         driver.save_screenshot('/Users/alexandreattia/Desktop/Work/Practice/HackerRankChallenge/TwitterParsing/screenshot_%s.png' % datetime.datetime.now().strftime('%d_%m_%Y_%H/%M'))
 52 |         time.sleep(np.random.randint(50,100)*0.01)
 53 |         tweets_found = driver.find_elements_by_class_name('tweet') 
 54 |         print('%s tweets found at %s' % (len(tweets_found), datetime.datetime.now().strftime('%d/%m/%Y - %H:%M')))
 55 |     except Exception as e:
 56 |         print(e)
 57 |         time.sleep(np.random.randint(50,100)*0.01)
 58 |         tweets_found = driver.find_elements_by_class_name('tweet') 
 59 |         driver.execute_script("return arguments[0].scrollIntoView();", tweets_found[::-1][0])
 60 |     return driver
 61 | 
 62 | def get_all_tweets_phantom(query):
 63 |     """
 64 |     Get all tweets from the webpage using the PhantomJS driver
 65 |     :param query: Tweet search query
 66 |     :return: all tweets on page as selenium objects
 67 |     """
 68 |     driver = webdriver.PhantomJS("/usr/local/bin/phantomjs")
 69 |     driver.get("https://twitter.com/search?f=tweets&vertical=default&q={}".format(urllib.parse.quote(query)))
 70 |     lastHeight = driver.execute_script("return document.body.scrollHeight")
 71 |     i = 0
 72 |     while True:
 73 |         driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
 74 |         time.sleep(np.random.randint(50,100)*0.01)
 75 |         newHeight = driver.execute_script("return document.body.scrollHeight")
 76 |         if newHeight <= lastHeight:
 77 |             break
 78 |         lastHeight = newHeight
 79 |         i += 1
 80 |     return driver
 81 | 
 82 | def format_tweets(driver):
 83 |     """
 84 |     Convert selenium objects into dictionnaries (images, date, text, username as keys)
 85 |     :param driver: Selenium driver (with tweets loaded)
 86 |     :return: well-formated tweets as a list of dictionnaries
 87 |     """
 88 |     tweets_found = driver.find_elements_by_class_name('tweet') 
 89 |     tweets = []
 90 |     for tweet in tweets_found:
 91 |         tweet_dict = {}
 92 |         tweet = tweet.get_attribute('innerHTML')
 93 |         bs = BeautifulSoup(tweet.strip(), "lxml")
 94 |         tweet_dict['username'] = bs.find('span', class_='username').text
 95 |         timestamp = float(bs.find('span', class_='_timestamp')['data-time'])
 96 |         tweet_dict['date'] = datetime.datetime.fromtimestamp(timestamp)
 97 |         tweet_dict['tweet_link'] = 'https://twitter.com' + bs.find('a', class_='js-permalink')['href']
 98 |         tweet_dict['text'] = bs.find('p', class_='tweet-text').text
 99 |         try:
100 |             tweet_dict['images'] = [k['src'] for k in bs.find('div', class_="AdaptiveMedia-container").find_all('img')]
101 |         except:
102 |             tweet_dict['images'] = []
103 |         if len(tweet_dict['images']) > 0:
104 |             tweet_dict['text'] = tweet_dict['text'][:tweet_dict['text'].index('pic.twitter')-1]
105 |         tweets.append(tweet_dict)
106 |     driver.close()
107 |     return tweets
108 | 
109 | def filter_tweets(tweets):
110 |     """
111 |     Filter tweets not made by Chris Albon, without images and already downloaded
112 |     :param tweets: list of dictionnaries
113 |     :return: tweets as a list of dictionnaries
114 |     """
115 |     # We keep only tweets by chrisalbon with pictures
116 |     search_tweets = [tw for tw in tweets if tw['username'] == '@chrisalbon' and len(tw['images']) > 0]
117 |     # He made multiple tweets on the same topic, we keep only the most recent tweets
118 |     # We use the indexes of the reversed tweet list and dictionnaries to keep only key 
119 |     unique_search_index = sorted(list({t['text'].lower():i for i,t in list(enumerate(search_tweets))[::-1]}.values()))
120 |     unique_search_tweets = [search_tweets[i] for i in unique_search_index]
121 | 
122 |     # Keep non-downloaded tweets
123 |     most_recent_file = sorted([datetime.datetime.fromtimestamp(os.path.getmtime(path)) 
124 |                                for path in glob.glob("./downloaded_pics/*.jpg")], reverse=True)[0]
125 |     recent_seach_tweets = [tw for tw in unique_search_tweets if tw['date'] > most_recent_file]
126 | 
127 |     # Uncomment for testing new tweets
128 |     # recent_seach_tweets = [tw for tw in unique_search_tweets if tw['date'] > datetime.datetime(2017, 7, 6, 13, 41, 48)]
129 |     return recent_seach_tweets
130 | 
131 | def download_pictures(recent_seach_tweets):
132 |     """
133 |     Download pictures from tweets
134 |     :param recent_seach_tweets: list of dictionnaries
135 |     """
136 |     # Downloading pictures
137 |     print('%s - Downloading %d tweets' % (datetime.datetime.now().strftime('%d/%m/%Y - %H:%M'), len(recent_seach_tweets)))
138 |     for tw in recent_seach_tweets:
139 |         img_url = tw['images'][0]
140 |         filename = tw['text'][:tw['text'].index("#")-1].lower().replace(' ','_')
141 |         filename = "./downloaded_pics/%s.jpg" % filename
142 |         urllib.request.urlretrieve(img_url, filename)
143 | 
144 | def send_email(recent_seach_tweets):
145 |     """
146 |     Send a summary email of new tweets
147 |     :param recent_seach_tweets: list of dictionnaries
148 |     """
149 |     # Create e-mail message
150 |     msg = C.intro + C.message_init.format(number_flash_cards=len(recent_seach_tweets))
151 |     # Add a special text for the first 3 new tweets 
152 |     for tw in recent_seach_tweets[:3]:
153 |         date = tw['date'].strftime('Le %d/%m/%Y à %H:%M')
154 |         link_tweet = tw['tweet_link']
155 |         link_picture = tw['images'][0]
156 |         tweet_text = tw["text"][:tw["text"].index('#')-1]
157 |         msg += C.flashcard.format(date=date, 
158 |                                   link_picture=link_picture, 
159 |                                   tweet_text=tweet_text,
160 |                                   tweet_link=link_tweet)
161 | 
162 |     # mapping for the subject
163 |     numbers = { 0 : 'zero', 1 : 'one', 2 : 'two', 3 : 'three', 4 : 'four', 5 : 'five',
164 |               6 : 'six', 7 : 'seven', 8 : 'eight', 9 : 'nine', 10 : 'ten'}
165 | 
166 |     message = MIMEText(msg, 'html')
167 |     message['From'] = formataddr((str(Header('Twitter Parser', 'utf-8')), C.my_email_address))
168 |     message['To'] = C.dest
169 |     message['Subject'] = '%s new Machine Learning Flash Cards' % numbers[len(recent_seach_tweets)].title()
170 |     msg_full = message.as_string()
171 | 
172 |     server = smtplib.SMTP('smtp.gmail.com:587')
173 |     server.starttls()
174 |     server.login(C.my_email_address, C.password)
175 |     server.sendmail(C.my_email_address, C.dest, msg_full)
176 |     server.quit()
177 |     print('%s - E-mail sent!' % datetime.datetime.now().strftime('%d/%m/%Y - %H:%M'))
178 | 
179 | if __name__ == '__main__':
180 |     # config file with mail content, email addresses
181 |     C = config.Config()
182 |     # Twitter search query
183 |     query = '#machinelearningflashcards'
184 |     # Loop until there is a network connection
185 |     time_slept = 0
186 |     while not internet_on():
187 |         time.sleep(1)
188 |         time_slept += 1
189 |         if time_slept > 15 : 
190 |             print('%s - No network connection' % datetime.datetime.now().strftime('%d/%m/%Y - %H:%M'))
191 |             sys.exit()
192 |     # driver = get_all_tweets_chrome(query)
193 |     driver = get_all_tweets_phantom(query)
194 |     tweets = format_tweets(driver)
195 |     recent_seach_tweets = filter_tweets(tweets)
196 |     if len(recent_seach_tweets) > 0:
197 |         download_pictures(recent_seach_tweets)
198 |         send_email(recent_seach_tweets)
199 |     else:
200 |         print('%s - No picture to download' % datetime.datetime.now().strftime('%d/%m/%Y - %H:%M'))
201 | 
202 | 
203 | 


--------------------------------------------------------------------------------
/TwitterParsing/pic.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/TwitterParsing/pic.png


--------------------------------------------------------------------------------
/TwitterParsing/pic2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/TwitterParsing/pic2.png


--------------------------------------------------------------------------------
/TwitterParsing/run.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | # change directory
 4 | APPDIR=/Users/alexandreattia/Desktop/Work/Practice/HackerRankChallenge/TwitterParsing
 5 | export APPDIR
 6 | cd $APPDIR
 7 | 
 8 | # activate virtual env
 9 | source /Users/alexandreattia/Desktop/Work/workenv/bin/activate
10 | 
11 | # launch python script
12 | ./send_pictures.py


--------------------------------------------------------------------------------
/TwitterParsing/send_pictures.py:
--------------------------------------------------------------------------------
 1 | import glob
 2 | import numpy as np
 3 | from email.mime.text import MIMEText
 4 | from email.mime.image import MIMEImage
 5 | from email.mime.multipart import MIMEMultipart
 6 | from email.header import Header
 7 | from email.utils import formataddr
 8 | import smtplib
 9 | import config_downl
10 | from download_pics import internet_on
11 | import datetime
12 | import sys
13 | 
14 | def get_picture():
15 |     pic_folder = '/Users/alexandreattia/Desktop/Work/machine_learning_flashcards_v1.4/png_web/'
16 |     pics = glob.glob(pic_folder + '*')
17 |     pic = np.random.choice(pics)
18 |     flashcard_title = pic.split('/')[7].replace('_web.png', '').replace('_', ' ')
19 |     return pic, flashcard_title
20 | 
21 | def send_email(flashcard_title, pic):
22 |     """
23 |     Send an email of with the picture of the day
24 |     """
25 |     # Create e-mail Multi Part
26 |     msgRoot = MIMEMultipart('related')
27 |     msgRoot['From'] = formataddr((str(Header('ML Flashcards', 'utf-8')), C.my_email_address))
28 |     msgRoot['To'] = C.dest
29 |     msgRoot['Subject'] = 'Today ML Flashcard : %s' % flashcard_title.title()
30 |     msgAlternative = MIMEMultipart('alternative')
31 |     msgRoot.attach(msgAlternative)
32 | 
33 |     # Create e-mail message
34 |     msg = C.intro + C.message_init.format(flashcard_title=flashcard_title.title())
35 |     msg += C.flashcard
36 |     message = MIMEText(msg, 'html')
37 |     msgAlternative.attach(message)
38 |     
39 |     # Add picture to e-mail
40 |     fp = open(pic, 'rb')
41 |     msgImage = MIMEImage(fp.read())
42 |     fp.close()
43 | 
44 |     # Define the image's ID as referenced above
45 |     msgImage.add_header('Content-ID', '<image1>')
46 |     msgRoot.attach(msgImage)
47 | 
48 |     msg_full = msgRoot.as_string()
49 |     
50 |     server = smtplib.SMTP('smtp.gmail.com:587')
51 |     server.starttls()
52 |     server.login(C.my_email_address, C.password)
53 |     server.sendmail(C.my_email_address, C.dest, msg_full)
54 |     server.quit()
55 |     print('%s - E-mail sent!' % datetime.datetime.now().strftime('%d/%m/%Y - %H:%M'))
56 | 
57 | if __name__ == '__main__':
58 |     # config file with mail content, email addresses
59 |     C = config_downl.Config()
60 |     # Loop until there is a network connection
61 |     time_slept = 0
62 |     while not internet_on():
63 |         time.sleep(1)
64 |         time_slept += 1
65 |         if time_slept > 15 : 
66 |             print('%s - No network connection' % datetime.datetime.now().strftime('%d/%m/%Y - %H:%M'))
67 |             sys.exit()
68 |     p, fc = get_picture()
69 |     send_email(fc, p)
70 | 


--------------------------------------------------------------------------------
/pics/corr_matrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/pics/corr_matrix.png


--------------------------------------------------------------------------------
/pics/features.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/pics/features.png


--------------------------------------------------------------------------------
/pics/psg_stats.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/pics/psg_stats.png


--------------------------------------------------------------------------------
/pics/psg_ste.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/pics/psg_ste.png


--------------------------------------------------------------------------------