├── FaceRecognition
├── README.md
├── Training.ipynb
├── detect_recognize.py
└── result.png
├── HackerRank_MaximumSubArray.py
├── HackerRank_PalindromeIndex.py
├── HackerRank_TwoStrings.py
├── KaggleAmazon
├── README.md
├── Understanding the Amazon from Space.ipynb
├── predict_keras.py
├── predict_tf.py
└── train_keras.py
├── KaggleBikeSharing
├── Kaggle_BikeSharing.py
├── Kaggle_BikeSharing_Explanations_French.pdf
└── README.md
├── KaggleInstaCart
├── Data Exploration.ipynb
└── Predictions.ipynb
├── KaggleMovieRating
├── Exploration.ipynb
├── README.md
├── genre.json
├── imdb_movie_content.py
├── movie_budget.json
└── parser.py
├── KaggleSoccer
├── Parsing Soccer Ways.ipynb
├── Predicting Ratings.ipynb
├── README.md
├── df.pkl
├── dumper.py
└── parser.py
├── KaggleTaxiTrip
├── Exploring the dataset.ipynb
├── README.md
└── pic
│ ├── download.png
│ ├── feat_importance.png
│ ├── nyc_clusters.png
│ ├── outliners.png
│ └── rush_hour.png
├── KaggleTitanic
├── Notebook.ipynb
└── README.md
├── KaggleWiki
├── README.md
└── wikipedia_parser.py
├── Kaggle_KobeShots.ipynb
├── ObjectDetection
├── color_extract.py
└── recognition.py
├── ProjectMovieRating
├── .ipynb_checkpoints
│ └── Exploration-checkpoint.ipynb
├── Exploration.ipynb
├── README.md
├── __pycache__
│ ├── imdb_movie_content.cpython-35.pyc
│ └── parser.cpython-35.pyc
├── genre.json
├── imdb_movie_content.py
├── movie_budget.json
├── movie_contents.json
├── movie_contents9.json
└── parser.py
├── ProjectSoccer
├── Predicting Ratings.ipynb
├── README.md
└── df.pkl
├── README.md
├── TwitterParsing
├── README.md
├── com.alexattia.machinelearning.downloadflashcards.plist
├── config.py
├── config_downl.py
├── download_pics.py
├── pic.png
├── pic2.png
├── run.sh
└── send_pictures.py
└── pics
├── corr_matrix.png
├── features.png
├── psg_stats.png
└── psg_ste.png
/FaceRecognition/README.md:
--------------------------------------------------------------------------------
1 | ## Face Recognition
2 |
3 | Modern face recognition with deep learning and HOG algorithm.
4 |
5 | 1. Find faces in image (HOG Algorithm)
6 | 2. Affine Transformations (Face alignment using an ensemble of regression
7 | trees)
8 | 3. Encoding Faces (FaceNet)
9 | 4. Make a prediction (Linear SVM)
10 |
11 | We are using the [Histogram of Oriented Gradients](http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf) (HOG) method. Instead of computing gradients for every pixel of the image (way too much detail). We compute the weighted vote orientation gradients of 16x16 pixels squares. Afterward, we have a simple representation (HOG image) that captures the basic structure of a face.
12 | All we have to do is find the part of our image that looks the most similar to a known trained HOG pattern.
13 | For this technique, we use the dlib Python library to generate and view HOG representations of images.
14 | ```
15 | face_detector = dlib.get_frontal_face_detector()
16 | detected_faces = face_detector(image, 1)
17 | ```
18 |
19 | After isolating the faces in our image, we need to warp (posing and projecting)the picture so the face is always in he same place. To do this, we are going to use the [face landmark estimation algorithm](http://www.csc.kth.se/~vahidk/papers/KazemiCVPR14.pdf). Following this method, there are 68 specific points (landmarks) on every face and we train a machine learning algorithm to find these 68 specific points on any face.
20 | ```
21 | face_pose_predictor = dlib.shape_predictor(predictor_model)
22 | pose_landmarks = face_pose_predictor(img, f)
23 | ```
24 | After find those landmarks, we need to use affine transformations (such as rotating, scaling and shearing --like translations) on the image so that the eyes and mouth are centered as best as possible.
25 | ```
26 | face_aligner = openface.AlignDlib(predictor_model)
27 | alignedFace = face_aligner.align(534, image, face_rect, landmarkIndices=openface.AlignDlib.OUTER_EYES_AND_NOSE)
28 | ```
29 |
30 | The next step is encoding the detected face. For this, we use Deep Learning. We train a neural net to generate [128 measurements (face embedding)](http://www.cv-foundation.org/openaccess/content_cvpr_2015/app/1A_089.pdf) for each face.
31 | The training process works by looking at 3 face images at a time:
32 | - Load two training face images of the same known person and generate for the two pictures the 128 measurements
33 | - Load a picture of a different person and generate for the two pictures the 128 measurements
34 | Then we tweak the neural network slightly so that it makes sure the measurements for the same person are slightly closer while making sure the measurements for the two different persons are slightly further apart.
35 | Once the network has been trained, it can generate measurements for any face, even ones it has never seen before!
36 | ```
37 | face_encoder = dlib.face_recognition_model_v1(face_recognition_model)
38 | face_encoding = np.array(face_encoder.compute_face_descriptor(image, pose_landmarks, 1))
39 | ```
40 |
41 | Finally, we need a classifier (Linear SVM or other classifier) to find the person in our database of known people who has the closest measurements to our test image. We train the classifier with the measurements as input.
42 |
43 | Thanks to Adam Geitgey who wrote a great [post](https://medium.com/@ageitgey/machine-learning-is-fun-part-4-modern-face-recognition-with-deep-learning-c3cffc121d78) about this, I followed his pipeline.
44 |
45 | 
--------------------------------------------------------------------------------
/FaceRecognition/detect_recognize.py:
--------------------------------------------------------------------------------
1 | import dlib
2 | from skimage import io, transform
3 | import cv2
4 | import matplotlib.pyplot as plt
5 | import numpy as np
6 | import pandas as pd
7 | import glob
8 | import openface
9 | import pickle
10 | import os
11 | import sys
12 | import argparse
13 | import time
14 |
15 | from sklearn.svm import SVC
16 | from sklearn.preprocessing import LabelEncoder
17 |
18 | face_detector = dlib.get_frontal_face_detector()
19 | face_encoder = dlib.face_recognition_model_v1('./model/dlib_face_recognition_resnet_model_v1.dat')
20 | face_pose_predictor = dlib.shape_predictor('./model/shape_predictor_68_face_landmarks.dat')
21 |
22 | def get_detected_faces(filename):
23 | """
24 | Detect faces in a picture using HOG
25 | :param filename: picture filename
26 | :return: picture numpy array, face detector object with detected faces
27 | """
28 | image = cv2.imread(filename)
29 | image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
30 | return image, face_detector(image, 1)
31 |
32 | def get_face_encoding(image, detected_face):
33 | """
34 | Encode face into 128 measurements using a neural net
35 | :param image: picture numpy array
36 | :param detected_face: face detector object with one detected face
37 | :return: measurement (128,) numpy array
38 | """
39 | pose_landmarks = face_pose_predictor(image, detected_face)
40 | face_encoding = face_encoder.compute_face_descriptor(image, pose_landmarks, 1)
41 | return np.array(face_encoding)
42 |
43 | def training(people):
44 | """
45 | Training our classifier (Linear SVC). Saving model using pickle.
46 | We need to have only one person/face per picture.
47 | :param people: people to classify and recognize
48 | """
49 | # parsing labels and reprensations
50 | df = pd.DataFrame()
51 | for p in people:
52 | l = []
53 | for filename in glob.glob('./data/%s/*' % p):
54 | image, face_detect = get_detected_faces(filename)
55 | face_encoding = get_face_encoding(image, face_detect[0])
56 | l.append(np.append(face_encoding, [p]))
57 | temp = pd.DataFrame(np.array(l))
58 | df = pd.concat([df, temp])
59 | df.reset_index(drop=True, inplace=True)
60 |
61 | # converting labels into int
62 | le = LabelEncoder()
63 | y = le.fit_transform(df[128])
64 | print("Training for {} classes.".format(len(le.classes_)))
65 | X = df.drop(128, axis=1)
66 | print("Training with {} pictures.".format(len(X)))
67 |
68 | # training
69 | clf = SVC(C=1, kernel='linear', probability=True)
70 | clf.fit(X, y)
71 |
72 | # dumping model
73 | fName = "./classifier.pkl"
74 | print("Saving classifier to '{}'".format(fName))
75 | with open(fName, 'wb') as f:
76 | pickle.dump((le, clf), f)
77 |
78 | def predict(filename, le=None, clf=None, verbose=False):
79 | """
80 | Detect and recognize a face using a trained classifier.
81 | :param filename: picture filename
82 | :param le:
83 | :paral clf:
84 | :param verbose:
85 | :return: picture with bounding boxes and prediction
86 | """
87 | if not le and not clf:
88 | with open("./classifier.pkl", 'rb') as f:
89 | (le, clf) = pickle.load(f)
90 | image, detected_faces = get_detected_faces(filename)
91 | prediction = []
92 | # Verbose for debugging
93 | if verbose:
94 | print('{} faces detected.'.format(len(detected_faces)))
95 | img = np.copy(image)
96 | font = cv2.FONT_HERSHEY_SIMPLEX
97 | for face_detect in detected_faces:
98 | # draw bounding boxes
99 | cv2.rectangle(img, (face_detect.left(), face_detect.top()),
100 | (face_detect.right(), face_detect.bottom()), (255, 0, 0), 2)
101 | start_time = time.time()
102 | # predict each face
103 | p = clf.predict_proba(get_face_encoding(image, face_detect).reshape(1, 128))
104 | # throwing away prediction with low confidence
105 | a = np.sort(p[0])[::-1]
106 | if a[0]-a[1] > 0.5:
107 | y_pred = le.inverse_transform(np.argmax(p))
108 | prediction.append([y_pred, (face_detect.left(), face_detect.top()),
109 | (face_detect.right(), face_detect.bottom())])
110 | else:
111 | y_pred = 'unknown'
112 | # Verbose for debugging
113 | if verbose:
114 | print('\n'.join(['%s : %.3f' % (k[0], k[1]) for k in list(zip(map(le.inverse_transform,
115 | np.argsort(p[0])),
116 | np.sort(p[0])))[::-1]]))
117 | print('Prediction took {:.2f}s'.format(time.time()-start_time))
118 |
119 | cv2.putText(img, y_pred, (face_detect.left(), face_detect.top()-5), font, np.max(img.shape[:2])/1800, (255, 0, 0))
120 | return img, prediction
121 |
122 | if __name__ == '__main__':
123 | parser = argparse.ArgumentParser()
124 | parser.add_argument('mode', type=str, help='train or predict')
125 | parser.add_argument('--training_data',
126 | type=str,
127 | help="Path to training data folder.",
128 | default='./data/')
129 | parser.add_argument('--testing_data',
130 | type=str,
131 | help="Path to test data folder.",
132 | default='./test/')
133 |
134 | args = parser.parse_args()
135 | people = os.listdir(args.training_data)
136 | print('{} people will be classified.'.format(len(people)))
137 | if args.mode == 'train':
138 | training(people)
139 | elif args.mode == 'test':
140 | with open("./classifier.pkl", 'rb') as f:
141 | (le, clf) = pickle.load(f)
142 | for i, f in enumerate(glob.glob(args.testing_data)):
143 | img, _ = predict(f, le, clf)
144 | cv2.imwrite(args.testing_data + 'test_{}.jpg'.format(i), img)
145 |
146 |
--------------------------------------------------------------------------------
/FaceRecognition/result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/FaceRecognition/result.png
--------------------------------------------------------------------------------
/HackerRank_MaximumSubArray.py:
--------------------------------------------------------------------------------
1 | ## Aim:
2 | ## Given an array of k elements, find the maximum possible sum of a
3 | ## 1. Contiguous subarray
4 | ## 2. Non-contiguous (not necessarily contiguous) subarray.
5 |
6 | ## Input :
7 | ## First line of the input has an integer t. t cases follow.
8 | ## Each test case begins with an integer k . In the next line, k integers follow representing the elements of array arr.
9 |
10 | ## Output :
11 | ## Two, space separated, integers denoting the maximum contiguous and non-contiguous subarray.
12 | ## At least one integer should be selected and put into the subarrays
13 | ## (this may be required in cases where all elements are negative).
14 |
15 | t = int(input().strip())
16 | def max_subarray(A):
17 | max_ending_here = max_so_far = 0
18 | for x in A:
19 | max_ending_here = max(0, max_ending_here + x)
20 | max_so_far = max(max_so_far, max_ending_here)
21 | return max_so_far
22 |
23 | for i in range(2*t):
24 | k=int(input())
25 | arr = [int(arr_temp) for arr_temp in input().strip().split(' ')]
26 | if(all(item>0 for item in arr)):
27 | print(sum(arr),sum(arr))
28 | elif(all(item<0 for item in arr)):
29 | print(max(arr),max(arr))
30 | else:
31 | c=0
32 | for i in range(len(arr)):
33 | if(c+arr[i]>c):
34 | c+=arr[i]
35 | print(max_subarray(arr),c)
36 |
--------------------------------------------------------------------------------
/HackerRank_PalindromeIndex.py:
--------------------------------------------------------------------------------
1 | ## Given a string of lowercase letters, determine the index of the character
2 | ## whose removal will make the string a palindrome.
3 | ## If the string is already a palindrome, then print -1 . There will always be a valid solution.
4 |
5 | ## Input:
6 | ## The first line contains n (the number of test cases).
7 | ## The n subsequent lines of test cases each contain a single string to be checked.
8 |
9 | ## Output :
10 | ## int the zero-indexed position (integer) of a character whose deletion will result in a palindrome;
11 | ## if there is no such character (i.e.: the string is already a palindrome), print -1.
12 |
13 | n = int(input().strip())
14 | a = []
15 | for a_i in range(n):
16 | a_t = [a_temp for a_temp in input().strip().split(' ')][0]
17 | a.append(a_t)
18 |
19 | def isPal(s):
20 | for i in range(int(len(s)/2)):
21 | if s[i] != s[len(s)-i-1]:
22 | return False
23 | return True
24 |
25 | def solve(s):
26 | for i in range(int((len(s)+1)/2)):
27 | if s[i] != s[len(s)-i-1]:
28 | if isPal(s[:i]+s[i+1:]):
29 | return i
30 | else:
31 | return len(s)-i-1
32 | return -1
33 | for i in range(n):
34 | x=a[i]
35 | print(solve(x))
--------------------------------------------------------------------------------
/HackerRank_TwoStrings.py:
--------------------------------------------------------------------------------
1 | ## Aim :
2 | ## You are given two strings, A and B.
3 | ## Find if there is a substring that appears in both A and B.
4 |
5 | # Input:
6 | # The first line of the input will contain a single integer , tthe number of test cases.
7 | # Then there will be t descriptions of the test cases. Each description contains two lines.
8 | # The first line contains the string A and the second line contains the string B.
9 |
10 | # Output:
11 | # For each test case, display YES (in a newline), if there is a common substring.
12 | # Otherwise, display NO
13 |
14 | n = int(input().strip())
15 | a = []
16 | for a_i in range(2*n):
17 | a_t = [a_temp for a_temp in input().strip().split(' ')][0]
18 | a.append(a_t)
19 |
20 | for i in range(2*n):
21 | if (i%2==0):
22 | s1=set(a[i])
23 | s2=set(a[i+1])
24 | S=s1.intersection(s2) #intersection of subset -> return the same characters !!
25 | if (len(S)==0):
26 | print("NO")
27 | else:
28 | print("YES")
29 |
--------------------------------------------------------------------------------
/KaggleAmazon/README.md:
--------------------------------------------------------------------------------
1 | # Planet: Understanding the Amazon from Space
2 |
3 | [Kaggle Link](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space)
4 |
5 | Use satellite data to track the human footprint in the Amazon rainforest.
6 |
7 | The goal of this challenge is to label satellite image chips with atmospheric conditions and various classes of land cover/land use. Resulting algorithms will help the global community better understand where, how, and why deforestation happens all over the world - and ultimately how to respond.
8 |
9 | This problem was tackled with Deep Learning models (using TensorFlow and Keras).
10 | Submissions are evaluated based on the F-beta score (F2 score), it measures acccuracy using precision and recall.
11 |
12 |
13 |
--------------------------------------------------------------------------------
/KaggleAmazon/predict_keras.py:
--------------------------------------------------------------------------------
1 | import train_keras
2 | from keras.models import load_model
3 | import os
4 | import numpy as np
5 | import pandas as pd
6 | from tqdm import tqdm
7 | from keras.callbacks import ModelCheckpoint
8 | import sys
9 |
10 | TF_CPP_MIN_LOG_LEVEL=2
11 | TEST_BATCH = 128
12 |
13 | def load_params():
14 | X_test = os.listdir('./test-jpg')
15 | X_test = [fn.replace('.jpg', '') for fn in X_test]
16 | model = load_model('model_amazon6.h5', custom_objects={'fbeta': train_keras.fbeta})
17 | with open('tag_columns.txt', 'r') as f:
18 | tag_columns = f.read().split('\n')
19 | return X_test, model, tag_columns
20 |
21 | def prediction(X_test, model, tag_columns, test_folder):
22 | result = []
23 | for i in tqdm(range(0, len(X_test), TEST_BATCH)):
24 | X_batch = X_test[i:i+TEST_BATCH]
25 | X_batch = np.array([train_keras.preprocess(train_keras.load_image(fn, folder=test_folder)) for fn in X_batch])
26 | p = model.predict(X_batch)
27 | result.append(p)
28 |
29 | r = np.concatenate(result)
30 | r = r > 0.5
31 | table = []
32 | for row in r:
33 | t = []
34 | for b, v in zip(row, tag_columns):
35 | if b:
36 | t.append(v.replace('tag_', ''))
37 | table.append(' '.join(t))
38 | print('Prediction done !')
39 | return table
40 |
41 | def launch(test_folder):
42 | X_test, model, tag_columns = load_params()
43 | table = prediction(X_test, model, tag_columns, test_folder)
44 | try:
45 | df_pred = pd.DataFrame.from_dict({'image_name': X_test, 'tags': table})
46 | df_pred.to_csv('submission9.csv', index=False)
47 | except:
48 | np.save('image_name', X_test)
49 | np.save('table', table)
50 |
51 | if __name__ == '__main__':
52 | if len(sys.argv) > 1:
53 | test_folder = sys.argv[1]
54 | else:
55 | test_folder='test-jpg'
56 | launch(test_folder)
--------------------------------------------------------------------------------
/KaggleAmazon/predict_tf.py:
--------------------------------------------------------------------------------
1 | import tensorflow as tf
2 | import numpy as np
3 | import cv2
4 | import pandas as pd
5 | import pickle
6 | import glob
7 |
8 | picture_size = 32
9 | batch_size = 128
10 | num_steps = 30001
11 |
12 | def load_set():
13 | global labels
14 |
15 | df = pd.DataFrame.from_csv('./train.csv')
16 | df.tags = df.tags.apply(lambda x:x.split(' '))
17 | df = pd.concat([df, df.tags.apply(pd.Series)], axis=1)
18 | labels = list(set(np.concatenate(df.tags)))
19 | mapping = dict(enumerate(labels))
20 | reverse_mapping = {v:k for k,v in mapping.items()}
21 |
22 | pictures = []
23 | tags = []
24 | for pic in np.random.choice(df.index, len(df)):
25 | img = cv2.imread('./train-jpg/{}.jpg'.format(pic)) / 255.
26 | img = cv2.resize(img, (picture_size, picture_size))
27 | tag = np.zeros(len(labels))
28 | for t in df.loc[pic].tags:
29 | tag[reverse_mapping[t]] = 1
30 | pictures.append(img)
31 | tags.append(tag)
32 |
33 | return pictures, tags, mapping
34 |
35 | def separate_set(pictures, labels):
36 | pictures_train = np.array(pictures[:int(len(pictures)*0.8)])
37 | labels_train = np.array(tags[:int(len(tags)*0.8)])
38 | print('Train shape', pictures_train.shape, labels_train.shape)
39 |
40 | pictures_valid = np.array(pictures[int(len(pictures)*0.8) : int(len(pictures)*0.9)])
41 | labels_valid = np.array(tags[int(len(tags)*0.8) : int(len(tags)*0.9)])
42 | print('Valid shape', pictures_valid.shape, labels_valid.shape)
43 |
44 | pictures_test = np.array(pictures[int(len(pictures)*0.9):])
45 | labels_test = np.array(tags[int(len(tags)*0.9):])
46 | print('Test shape', pictures_test.shape, labels_test.shape)
47 | return pictures_train, labels_train, pictures_valid, labels_valid, pictures_test, labels_test
48 |
49 |
50 | def accuracy(predictions, labels):
51 | return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
52 | / predictions.shape[0])
53 |
54 | def conv2d(x, W, stride=(1, 1), padding='SAME'):
55 | return tf.nn.conv2d(x, W, strides=[1, stride[0], stride[1], 1],
56 | padding=padding)
57 |
58 | def max_pool(x, ksize=(2, 2), stride=(2, 2)):
59 | return tf.nn.max_pool(x, ksize=[1, ksize[0], ksize[1], 1],
60 | strides=[1, stride[0], stride[1], 1], padding='SAME')
61 |
62 | def weight_variable(shape):
63 | initial = tf.truncated_normal(shape, stddev=0.1)
64 | return tf.Variable(initial)
65 |
66 |
67 | def bias_variable(shape):
68 | initial = tf.constant(0.1, shape=shape)
69 | return tf.Variable(initial)
70 |
71 | def model(data):
72 | # First convolution
73 | h_conv1 = tf.nn.relu(conv2d(data, W_1) + b_1)
74 | h_pool1 = max_pool(h_conv1, ksize=(2, 2), stride=(2, 2))
75 |
76 | # Second convolution
77 | h_conv2 = tf.nn.relu(conv2d(h_pool1, W_2) + b_2)
78 | h_pool2 = max_pool(h_conv2, ksize=(2, 2), stride=(2, 2))
79 | # reshape tensor into a batch of vectors
80 | pool_flat = tf.reshape(h_pool2, [-1, int(picture_size / 4) * int(picture_size / 4) * 64])
81 |
82 | # Full connected layer with 1024 neurons.
83 | fc = tf.nn.relu(tf.matmul(pool_flat, W_fc) + b_fc)
84 |
85 | # Compute logits
86 | return tf.matmul(fc, W_logits) + b_logits
87 |
88 | if __name__ == "__main__":
89 | pictures, tags, mapping = load_set()
90 | pictures_train, labels_train, pictures_valid, labels_valid, pictures_test, labels_test = separate_set(pictures, tags)
91 |
92 | graph = tf.Graph()
93 | with graph.as_default():
94 | x = tf.placeholder(tf.float32, shape=(batch_size, picture_size, picture_size, 3))
95 | y_ = tf.placeholder(tf.float32, shape=(batch_size, len(labels)))
96 | x_valid = tf.constant(pictures_valid, dtype=tf.float32)
97 | x_test = tf.constant(pictures_test, dtype=tf.float32)
98 |
99 | # Weights and biases
100 | W_1 = weight_variable([3, 3, 3, 64])
101 | b_1 = bias_variable([32])
102 | W_2 = weight_variable([5, 5, 32, 128])
103 | b_2 = bias_variable([64])
104 | W_fc = weight_variable([int(picture_size / 4) * int(picture_size / 4) * 64, 1024])
105 | b_fc = bias_variable([1024])
106 | W_logits = weight_variable([1024, len(labels)])
107 | b_logits = bias_variable([len(labels)])
108 |
109 | logits = model(x)
110 | loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y_))
111 | optimizer = tf.train.AdamOptimizer(1e-4).minimize(loss)
112 | y = tf.nn.softmax(logits)
113 | y_valid = tf.nn.softmax(model(x_valid))
114 | y_test = tf.nn.softmax(model(x_test))
115 |
116 | with tf.Session(graph = graph) as session:
117 | tf.global_variables_initializer().run()
118 | for step in range(num_steps):
119 | offset = (step * batch_size) % (labels_train.shape[0] - batch_size)
120 | batch_data = pictures_train[offset:(offset + batch_size), :, :, :]
121 | batch_labels = labels_train[offset:(offset + batch_size), :]
122 | feed_dict = {x : batch_data, y_ : batch_labels}
123 | _, l, predictions = session.run([optimizer, loss, y], feed_dict=feed_dict)
124 | if (step % 1000 == 0):
125 | print('Minibatch loss at step %d: %.3f / Valid Accuracy %.2f' % (step, l,
126 | accuracy(y_valid.eval(), labels_valid)))
127 | print(accuracy(y_test.eval(), test_labels))
128 |
--------------------------------------------------------------------------------
/KaggleAmazon/train_keras.py:
--------------------------------------------------------------------------------
1 | import numpy as np # linear algebra
2 | import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
3 | import os
4 | import glob
5 | import matplotlib.image as mpimg
6 | import gc
7 |
8 | from keras import backend as K
9 | from keras.models import Sequential
10 | from keras.layers import Dense, Dropout, Flatten, Lambda
11 | from keras.layers import Conv2D, MaxPooling2D, AveragePooling2D
12 | from keras.layers.normalization import BatchNormalization
13 | from keras.preprocessing.image import ImageDataGenerator
14 | from keras.regularizers import l2
15 | from keras.callbacks import ModelCheckpoint
16 |
17 | from sklearn.utils import shuffle
18 | from sklearn.model_selection import train_test_split
19 |
20 | import cv2
21 | from tqdm import tqdm
22 |
23 | picture_size = 128
24 | INPUT_SHAPE = (picture_size, picture_size, 4)
25 | EPOCHS = 20
26 | BATCH = 64
27 | PER_EPOCH = 256
28 |
29 | def fbeta(y_true, y_pred, threshold_shift=0):
30 | """
31 | Submissions are evaluated based on the F-beta score, it measures acccuracy using precision
32 | and recall. Beta = 2 here.
33 | """
34 | beta = 2
35 |
36 | # just in case of hipster activation at the final layer
37 | y_pred = K.clip(y_pred, 0, 1)
38 |
39 | # shifting the prediction threshold from .5 if needed
40 | y_pred_bin = K.round(y_pred + threshold_shift)
41 |
42 | tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
43 | fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1)))
44 | fn = K.sum(K.round(K.clip(y_true - y_pred, 0, 1)))
45 |
46 | precision = tp / (tp + fp)
47 | recall = tp / (tp + fn)
48 |
49 | beta_squared = beta ** 2
50 | return (beta_squared + 1) * (precision * recall) / (beta_squared * precision + recall + K.epsilon())
51 |
52 | def load_image(filename, resize=True, folder='train-jpg'):
53 | img = mpimg.imread('./{}/{}.jpg'.format(folder, filename))
54 | if resize:
55 | img = cv2.resize(img, (picture_size, picture_size))
56 | return np.array(img)
57 |
58 | # def mean_normalize(img):
59 | # return (img - img.mean()) / (img.max() - img.min())
60 |
61 | def preprocess(img):
62 | img = img / 127.5 - 1
63 | return img
64 |
65 | def generator(X, y, batch_size=32):
66 | X_copy, y_copy = X, y
67 | while True:
68 | for i in range(0, len(X_copy), batch_size):
69 | X_result, y_result = [], []
70 | for x, y in zip(X_copy[i:i+batch_size], y_copy[i:i+batch_size]):
71 | rx, ry = [load_image(x)], [y]
72 | rx = np.array([preprocess(x) for x in rx])
73 | ry = np.array(ry)
74 | X_result.append(rx)
75 | y_result.append(ry)
76 | X_result, y_result = np.concatenate(X_result), np.concatenate(y_result)
77 | yield shuffle(X_result, y_result)
78 | X_copy, y_copy = shuffle(X_copy, y_copy)
79 |
80 | def create_model():
81 | model = Sequential()
82 |
83 | model.add(Conv2D(48, (8, 8), strides=(2, 2), input_shape=INPUT_SHAPE, activation='elu'))
84 | model.add(BatchNormalization())
85 |
86 | model.add(Conv2D(64, (8, 8), strides=(2, 2), activation='elu'))
87 | model.add(BatchNormalization())
88 |
89 | model.add(Conv2D(96, (5, 5), strides=(2, 2), activation='elu'))
90 | model.add(BatchNormalization())
91 |
92 | model.add(Conv2D(128, (3, 3), activation='elu'))
93 | model.add(BatchNormalization())
94 |
95 | model.add(Conv2D(128, (2, 2), activation='elu'))
96 | model.add(BatchNormalization())
97 |
98 | model.add(Flatten())
99 | model.add(Dropout(0.3))
100 |
101 | model.add(Dense(1024, activation='elu'))
102 | model.add(BatchNormalization())
103 |
104 | model.add(Dense(64, activation='elu'))
105 | model.add(BatchNormalization())
106 |
107 | model.add(Dense(n_classes, activation='sigmoid'))
108 |
109 |
110 | model.compile(
111 | optimizer='adam',
112 | loss='binary_crossentropy',
113 | metrics=[fbeta, 'accuracy']
114 | )
115 |
116 | return model
117 |
118 | def load_set():
119 | global n_features, n_classes, tag_columns
120 | df = pd.DataFrame.from_csv('./train.csv')
121 | df.tags = df.tags.apply(lambda x:x.split(' '))
122 | df.reset_index(inplace=True)
123 | labels = list(df.tags)
124 |
125 | tags = set(np.concatenate(labels))
126 | tag_list = list(tags)
127 | tag_list.sort()
128 | tag_columns = ['tag_' + t for t in tag_list]
129 | for t in tag_list:
130 | df['tag_' + t] = df['tags'].map(lambda x: 1 if t in x else 0)
131 |
132 | X = df['image_name'].values
133 | y = df[tag_columns].values
134 |
135 | n_features = 1
136 | n_classes = y.shape[1]
137 | X, y = shuffle(X, y)
138 |
139 | X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.1)
140 |
141 | print('Train', X_train.shape, y_train.shape)
142 | print('Valid', X_valid.shape, y_valid.shape)
143 | return X_train, X_valid, y_train, y_valid
144 |
145 | def launch():
146 | X_train, X_valid, y_train, y_valid = load_set()
147 | model = create_model()
148 | filepath="./weights-improvement-{epoch:02d}-{val_fbeta:.3f}.hdf5"
149 | checkpoint = ModelCheckpoint(filepath, monitor='val_fbeta', verbose=1, save_best_only=True, mode='max')
150 | callbacks_list = [checkpoint]
151 |
152 | history = model.fit_generator(
153 | generator(X_train, y_train, batch_size=BATCH),
154 | steps_per_epoch=PER_EPOCH,
155 | epochs=EPOCHS,
156 | validation_data=generator(X_valid, y_valid, batch_size=BATCH),
157 | validation_steps=len(y_valid)//(4*BATCH),
158 | callbacks=callbacks_list
159 | )
160 | model.save('model_amazon2.h5')
161 |
162 | if __name__ == '__main__':
163 | launch()
164 |
165 |
166 |
--------------------------------------------------------------------------------
/KaggleBikeSharing/Kaggle_BikeSharing.py:
--------------------------------------------------------------------------------
1 | ## coding: utf8
2 | ## Kaggle Challenge
3 | ## Utilisation de Python 3.5
4 |
5 | ## Chargement des bibliothèques
6 | import numpy as np
7 | import pandas as pd
8 | import matplotlib.pyplot as plt
9 | import collections
10 | import sklearn
11 | from sklearn import linear_model
12 | from sklearn import svm
13 | from sklearn import learning_curve
14 | from sklearn import ensemble
15 | #%matplotlib inline
16 |
17 | ## Chargement de la base de données
18 | df= pd.read_csv('data.csv')
19 |
20 | ## Transformation de la date sous
21 | ## format string au format de date
22 | df['date']=pd.to_datetime(df['datetime'])
23 |
24 | ## Extraction du mois, de l'heure et
25 | ## du jour de la semaine
26 | df.set_index('date', inplace=True)
27 | df['month']=df.index.month
28 | df['hours']=df.index.hour
29 | df['dayOfWeek']=df.index.weekday
30 |
31 | ###################
32 | ################### Influence du temps
33 | ###################
34 |
35 | ## Création d'une classe pour tracer
36 | ## la moyenne du nombre de locations
37 | ## de velos heure par heure pour les
38 | ## jours ouvrés ou non
39 |
40 | class mean_30():
41 | def __init__(self, df):
42 | self.df=df
43 | def mean_hours_min(self,h):
44 | a = self.df["hours"] == h
45 | return self.df[a]["count"].mean()
46 | def transf(self, t):
47 | return self.mean_hours_min(t)
48 | def transfc(self, t):
49 | return self.err_hours_min(t)
50 | def vector_day(self):
51 | k = []
52 | for i in range(0,24):
53 | k.append(i)
54 | hour_day = pd.DataFrame()
55 | hour_day["A"] = k
56 | return hour_day["A"]
57 | def view(self):
58 | plt.plot(self.vector_day().apply(self.transf))
59 |
60 | ## Tracer des graphes
61 |
62 | fig=plt.figure()
63 | fig.suptitle('Location de velo selon l\'heure, jours ouvrables ou non', fontsize=13)
64 | plt.ylabel('nombre de locations de vélos')
65 | plt.xlabel('heure de la journée')
66 | moy0=mean_30(df[df['workingday']==0])
67 | moy0.view()
68 | moy1=mean_30(df[df['workingday']==1])
69 | moy1.view()
70 | plt.legend(['0','1'])
71 | plt.show()
72 |
73 | ## Création d'une classe pour tracer
74 | ## la deviation standard du nombre de
75 | ## locations de velos heure par heure
76 | ## pour un jour
77 |
78 | ## Changer cette valeur entre 0 et 6 pour
79 | ## pour obtenir un autre jour de la
80 | ## semaine
81 |
82 | j=0
83 |
84 | class std_30():
85 | def __init__(self, df):
86 | self.df=df
87 | def mean_hours_std(self,j,h):
88 | y = self.df[self.df["dayOfWeek"]==j]["hours"] == h
89 | return self.df[self.df["dayOfWeek"]==j][y]["count"].mean()
90 | def err_hours(self,j,h):
91 | y = self.df[self.df["dayOfWeek"]==j]["hours"] == h
92 | return self.df[self.df["dayOfWeek"]==j][y]["count"].std()
93 | def transf_err(self,t):
94 | return self.mean_hours_std(j,t)
95 | def transf_err2(self,t):
96 | return self.err_hours(j,t)
97 | def vector_day(self):
98 | k = []
99 | for i in range(0,24):
100 | k.append(i)
101 | hour_std = pd.DataFrame()
102 | hour_std["A"] = k
103 | return hour_std["A"]
104 | def view(self):
105 | errors=self.vector_day().apply(self.transf_err2)
106 | fig, ax = plt.subplots()
107 | self.vector_day().apply(self.transf_err).plot(yerr=errors, ax=ax,label=str(j))
108 | plt.legend('0',loc=2,prop={'size':9})
109 |
110 | fig.suptitle('Deviation standard des locations de velo selon l\'heure', fontsize=13)
111 | std0=std_30(df)
112 | std0.view()
113 | plt.ylabel('nombre de locations de vélos')
114 | plt.xlabel('heure de la journée')
115 | plt.show()
116 |
117 | ###################
118 | ################### Influence du mois
119 | ###################
120 |
121 | ## Création d'une classe pour tracer
122 | ## la moyenne du nombre de locations
123 | ## de velos par mois
124 |
125 | class month_30():
126 | def __init__(self, df):
127 | self.df=df
128 | def mean_hours_min(self,m):
129 | a = self.df["month"] == m
130 | return self.df[a]["count"].mean()
131 | def transf(self, t):
132 | return self.mean_hours_min(t)
133 | def transfc(self, t):
134 | return self.err_hours_min(t)
135 | def vector_day(self):
136 | k = []
137 | for i in range(0,13):
138 | k.append(i)
139 | hour_day = pd.DataFrame()
140 | hour_day["A"] = k
141 | return hour_day["A"]
142 | def view(self):
143 | plt.plot(self.vector_day().apply(self.transf))
144 |
145 | ## Tracer des graphes
146 |
147 | fig=plt.figure()
148 | fig.suptitle('Location de velo selon le mois', fontsize=13)
149 | moy0=month_30(df)
150 | moy0.view()
151 | plt.ylabel('nombre de locations de vélos')
152 | plt.xlabel('mois de l\' année')
153 | plt.show()
154 |
155 | ###################
156 | ################### Influence de la météo
157 | ###################
158 | plt.figure()
159 |
160 | ## Moyenne de la demande en vélos
161 | ## Pour les 4 conditions grâce
162 | ## à un dictionnaire Python
163 |
164 | a={u'Degage/nuageux':df[df['weather']==1]['count'].mean(),
165 | u'Brouillard': df[df['weather']==2]['count'].mean(),
166 | u'Legere pluie':df[df['weather']==3]['count'].mean()
167 | }
168 |
169 | width = 1/1.6
170 | plt.bar(range(len(a)), a.values(),width,color="blue",align='center')
171 | plt.xticks(range(len(a)), a.keys())
172 | plt.ylabel('nombre de locations de vélos')
173 | plt.title('Moyenne des locations de velos pour differentes conditions meteorologiques')
174 | plt.show()
175 |
176 | ###################
177 | ################### Influence du vent, de la température et de l'humidité
178 | ###################
179 |
180 | ## Moyenne de la demande en vélos
181 | ## Pour certains paramètres grâce
182 | ## à un dictionnaire Python
183 |
184 | D = {u'V>13k/h':df[df['windspeed']>13]['count'].mean(),
185 | u'V<13k/h': df[df['windspeed']<13]['count'].mean(),
186 | u'T<24°C':df[df['atemp']<24]['count'].mean(),
187 | u'T>24°C':df[df['atemp']>24]['count'].mean(),
188 | u'H>62%': df[df['humidity']>62]['count'].mean(),
189 | u'H<62%':df[df['humidity']<62]['count'].mean()
190 | }
191 | od = collections.OrderedDict(sorted(D.items()))
192 | width = 1/1.6
193 | plt.figure()
194 | plt.bar(range(len(od)), od.values(),width,color="blue",align='center')
195 | plt.xticks(range(len(od)), od.keys())
196 | plt.title('Variation de la demande en fonction de 3 variables')
197 | plt.show()
198 |
199 | ###################
200 | ################### Conclusion et choix des paramètres influents
201 | ###################
202 |
203 | ##On calcule matrice de corrélation pour
204 | ##retirer les variables corrélées
205 |
206 | df.corr()
207 | plt.matshow(df.corr())
208 | plt.yticks(range(len(df.corr().columns)), df.corr().columns);
209 | plt.colorbar()
210 | plt.show()
211 |
212 | df1=df.drop(['workingday','datetime','season','atemp','holiday','registered','casual'],axis=1)
213 |
214 | target=df1['count'].values #ensemble des outputs à prédire (yi)
215 |
216 | train=df1.drop('count',axis=1) #ensemble des données (xi)
217 |
218 | ## Split aléatoire entre données d'entrainement
219 | ## et donnes test
220 | X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(
221 | train, target, test_size=0.33, random_state=42)
222 |
223 | ## Creation d'une classe pour tracer
224 | ## les courbes d'apprentissages
225 |
226 | def plot_learning_curve(estimator, title, X, y, cv=None, n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
227 | plt.figure()
228 | plt.title(title)
229 | plt.xlabel("Training examples")
230 | plt.ylabel("Score")
231 | train_sizes, train_scores, test_scores = sklearn.learning_curve.learning_curve(
232 | estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
233 | train_scores_mean = np.mean(train_scores, axis=1)
234 | train_scores_std = np.std(train_scores, axis=1)
235 | test_scores_mean = np.mean(test_scores, axis=1)
236 | test_scores_std = np.std(test_scores, axis=1)
237 | plt.grid()
238 | plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
239 | plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
240 | plt.legend(loc="best")
241 | return plt
242 |
243 | ###################
244 | ################### Regression Linéaire
245 | ###################
246 |
247 | linreg=linear_model.LinearRegression()
248 | linreg.fit(X_train,Y_train)
249 | tableau=[['Paramètre', 'Coefficient']] #liste pour voir les valeurs des coefficients
250 | col=list(train.columns.values)
251 | for i in range(7):
252 | tableau.append([col[i],linreg.coef_[i]])
253 |
254 | print ("Training Score Regression Linéare : ", str(linreg.score(X_train,Y_train)))
255 | print ("Test Score Regression Linéare : " , str(linreg.score(X_test,Y_test)))
256 | print ("Coefficients Regression Linéare :")
257 | print (tableau)
258 |
259 | ###################
260 | ################### Gradient Boosting Regression
261 | ###################
262 |
263 | gbr = ensemble.GradientBoostingRegressor(n_estimators=2000)
264 | gbr.fit(X_train,Y_train)
265 | print ("Training Score GradientBoosting: ", str(gbr.score(X_train,Y_train)))
266 | print ("Test Score GradientBoosting: " , str(gbr.score(X_test,Y_test)))
267 |
268 | title = "Learning Curves (GradientBoosting)"
269 | estimator = ensemble.GradientBoostingRegressor(n_estimators=2000)
270 | plot_learning_curve(estimator, title, X_train, Y_train)
271 | plt.show()
272 |
273 | ###################
274 | ################### Support Vector Regression
275 | ###################
276 | svr=svm.SVR(kernel='linear')
277 | svr.fit(X_train,Y_train)
278 | print ("Training Score SVR: ", str(svr.score(X_train,Y_train)))
279 | print ("Test Score SVR : " , str(svr.score(X_test,Y_test)))
280 |
281 | ###################
282 | ################### Random Forest Regression
283 | ###################
284 |
285 | rf=ensemble.RandomForestRegressor(n_estimators=30,oob_score=True) #30 arbres et OOB Estimation
286 | rf.fit(train,target)
287 | print ("Training Score RandomForest: ", str(rf.score(train,target)))
288 | print ("OOB Score RandomForest: " , str(rf.oob_score_))
289 |
290 | ###################
291 | ################### Améliorations
292 | ###################
293 |
294 | ## On cherche le paramètre le plus influent de notre modèle
295 | ## Pour cela on utilise deux algorithmes qu'on a utilisé :
296 | ## la régréssio linéaire et Random Forest
297 | def param_import():
298 | col=list(train.columns.values)
299 | #on trouve d'abord les coefficients de la régréssion linéaire
300 | index1=linreg.coef_.argsort()[-2:][-1] #renvoie la liste triée des coef et on prend le premier élement
301 | index2=linreg.coef_.argsort()[-2:][0] #renvoie la liste triée des coef et on prend le deuxieme élement
302 | print('Pour les améliorations, calculons les paramètres les plus influents : ')
303 | print('...')
304 | print('Pour la regréssion linéaire, les paramètres les plus influents sont :', col[index1],' et ',col[index2])
305 | #on trouve ensuite les coefficients de la RF
306 | index3=rf.feature_importances_.argsort()[-2:][-1]
307 | index4=rf.feature_importances_.argsort()[-2:][0]
308 | print('Pour l\'algorithme de RF, les paramètres les plus influents sont :', col[index3],' et ',col[index4])
309 | #on trouve ensuite les coefficients de Gradient Boosting
310 | index5=gbr.feature_importances_.argsort()[-2:][-1]
311 | index6=gbr.feature_importances_.argsort()[-2:][0]
312 | print('Pour l\'algorithme de Gradient Boosting, les paramètres les plus influents sont :', col[index5],' et ',col[index6])
313 | if index3==index5:
314 | plus_import=index3
315 | elif index5==index4:
316 | plus_import=index4
317 | return plus_import
318 |
319 | print('Le paramètre le plus important est donc : ', col[param_import()])
320 |
321 | ## On cherche à prouver que certains creneaux sont
322 | ##plus importants pour la demande en vélo
323 |
324 | soir = df[df['hours'].isin([17,18,19])]
325 | peak_soir=soir[soir['workingday']==1]
326 | matin = df[df['hours'].isin([7,8,9])]
327 | peak_matin=matin[matin['workingday']==1]
328 | we = df[df['hours'].isin([12,13,14,15,16])]
329 | peak_we=we[we['workingday']==0]
330 |
331 | print('Calculons ensuite la moyenne du nombre de vélos pour plusieurs créneaux horaires : ')
332 | print('...')
333 | print('La moyenne totale de la demande est : ', df['count'].mean())
334 | print('En semaine, entre 17 et 19h : ', peak_soir['count'].mean())
335 | print('En semaine, entre 7 et 9h : ', peak_matin['count'].mean())
336 | print('En week-end, entre 12 et 16h : ', peak_we['count'].mean())
337 |
338 |
--------------------------------------------------------------------------------
/KaggleBikeSharing/Kaggle_BikeSharing_Explanations_French.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleBikeSharing/Kaggle_BikeSharing_Explanations_French.pdf
--------------------------------------------------------------------------------
/KaggleBikeSharing/README.md:
--------------------------------------------------------------------------------
1 | # Kaggle Bike Sharing
2 |
3 | [Kaggle Link](https://www.kaggle.com/c/bike-sharing-demand)
4 |
5 | Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city.
6 | In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.
7 |
8 | The goal of this challenge is to build a model that predicts the count of bike shared, exclusively based on contextual features. The first part of this challenge was aimed to understand, to analyse and to process those dataset. I wanted to produce meaningful information with plots. The second part was to build a model and use a Machine Learning library in order to predict the count.
9 |
10 | The more importants parameters were the time, the month, the temperature and the weather.
11 | Multiple models were tested during this challenge (Linear Regression, Gradient Boosting, SVR and Random Forest). Finally, the chosen model was Random Forest. The accuracy was measured with [Out-of-Bag Error](https://www.stat.berkeley.edu/~breiman/OOBestimation.pdf) and the OOB score was 0.85.
12 |
13 | More detailed explanations (with a couple of plots) has been written in French.
14 | -->[French Explanations PDF](https://github.com/alexattia/Online-Challenge/blob/master/KaggleBikeSharing/Kaggle_BikeSharing_Explanations_French.pdf)
15 |
16 |
--------------------------------------------------------------------------------
/KaggleInstaCart/Predictions.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 6,
6 | "metadata": {
7 | "ExecuteTime": {
8 | "end_time": "2017-06-21T21:37:22.901368Z",
9 | "start_time": "2017-06-21T21:37:21.787528Z"
10 | },
11 | "collapsed": true
12 | },
13 | "outputs": [],
14 | "source": [
15 | "import numpy as np\n",
16 | "import pandas as pd\n",
17 | "import matplotlib.pyplot as plt\n",
18 | "import seaborn as sns\n",
19 | "import calendar\n",
20 | "%matplotlib inline"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 7,
26 | "metadata": {
27 | "ExecuteTime": {
28 | "end_time": "2017-06-21T21:37:36.669912Z",
29 | "start_time": "2017-06-21T21:37:22.902860Z"
30 | },
31 | "collapsed": true
32 | },
33 | "outputs": [],
34 | "source": [
35 | "# departements and aisle id\n",
36 | "aisles = pd.read_csv('./data/aisles.csv')\n",
37 | "departements = pd.read_csv('./data/departments.csv')\n",
38 | "products = pd.read_csv('./data/products.csv')\n",
39 | "orderp = pd.read_csv('./data/order_products__prior.csv')\n",
40 | "ordert = pd.read_csv('./data/order_products__train.csv')\n",
41 | "orders = pd.read_csv('./data/orders.csv')"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": 8,
47 | "metadata": {
48 | "ExecuteTime": {
49 | "end_time": "2017-06-21T21:37:43.774516Z",
50 | "start_time": "2017-06-21T21:37:36.671328Z"
51 | }
52 | },
53 | "outputs": [],
54 | "source": [
55 | "order_products = pd.merge(orderp, orders, on = 'order_id')"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": null,
61 | "metadata": {
62 | "ExecuteTime": {
63 | "start_time": "2017-06-21T21:37:22.391Z"
64 | }
65 | },
66 | "outputs": [],
67 | "source": [
68 | "data = order_products.groupby(['product_id', 'user_id']).agg({'order_id':'count',\n",
69 | " 'order_number':[np.min, np.max],\n",
70 | " 'add_to_cart_order':'mean'})\n",
71 | "data.columns = data.columns.droplevel(0)\n",
72 | "data = data.rename(columns={'mean':\"up_average_cart_position\", \n",
73 | " 'amin':\"up_first_order\", \n",
74 | " 'amax':\"up_last_order\", \n",
75 | " 'count':\"up_orders\"})\n",
76 | "data = data.reset_index()"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": null,
82 | "metadata": {
83 | "ExecuteTime": {
84 | "start_time": "2017-06-21T21:37:23.516Z"
85 | },
86 | "scrolled": true
87 | },
88 | "outputs": [],
89 | "source": [
90 | "d = pd.merge(data, order_products, on='product_id')"
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": null,
96 | "metadata": {
97 | "collapsed": true
98 | },
99 | "outputs": [],
100 | "source": []
101 | }
102 | ],
103 | "metadata": {
104 | "kernelspec": {
105 | "display_name": "workenv",
106 | "language": "python",
107 | "name": "workenv"
108 | },
109 | "language_info": {
110 | "codemirror_mode": {
111 | "name": "ipython",
112 | "version": 3
113 | },
114 | "file_extension": ".py",
115 | "mimetype": "text/x-python",
116 | "name": "python",
117 | "nbconvert_exporter": "python",
118 | "pygments_lexer": "ipython3",
119 | "version": "3.5.3"
120 | },
121 | "nav_menu": {},
122 | "toc": {
123 | "navigate_menu": true,
124 | "number_sections": true,
125 | "sideBar": true,
126 | "threshold": 6,
127 | "toc_cell": false,
128 | "toc_section_display": "block",
129 | "toc_window_display": false
130 | }
131 | },
132 | "nbformat": 4,
133 | "nbformat_minor": 2
134 | }
135 |
--------------------------------------------------------------------------------
/KaggleMovieRating/README.md:
--------------------------------------------------------------------------------
1 | # Predict IMDB movie rating
2 |
3 | Project inspired by Chuan Sun [work](https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset) without Scrappy
4 |
5 | Main question : How can we tell the greatness of a movie before it is released in cinema?
6 |
7 | ## First Part - Parsing data
8 |
9 | The first part aims to parse data from the imdb and the numbers websites : casting information, directors, production companies, awards, genres, budget, gross, description, imdb_rating, etc.
10 | To create the movie_contents.json file :
11 | ``python3 parser.py nb_elements``
12 |
13 | ## Second Part - Data Analysis
14 |
15 | The second part is to analyze the dataframe and observe correlation between variables. For example, are the movie awards correlated to the worlwide gross ? Does the more a movie is a liked, the more the casting is liked ?
16 | See the jupyter notebook file.
17 |
18 | 
19 |
20 | As we can see in the pictures above, the imdb score is correlated to the number of awards and the gross but not really to the production budget and the number of facebook likes of the casting.
21 | Obviously, domestic and worlwide gross are highly correlated. However, the more important the production budget, the more important the gross.
22 | As it is shown in the notebook, the budget is not really correlated to the number of awards.
23 | What's funny is that the popularity of the third most famous actor is more important for the IMDB score than the popularity of the most famous score (Correlation 0.2 vs 0.08).
24 | (Many other charts in the Jupyter notebook)
25 |
26 | ## Third Part - Predict the IMDB score
27 |
28 | Machine Learning to predict the IMDB score with the meaningful variables.
29 | Using a Random Forest algorithm (500 estimators).
30 | 
31 |
32 |
--------------------------------------------------------------------------------
/KaggleMovieRating/genre.json:
--------------------------------------------------------------------------------
1 | {"Action": "0",
2 | "Adventure": "17",
3 | "Animation": "14",
4 | "Biography": "20",
5 | "Comedy": "21",
6 | "Crime": "10",
7 | "Documentary": "3",
8 | "Drama": "18",
9 | "Family": "19",
10 | "Fantasy": "6",
11 | "History": "8",
12 | "Horror": "4",
13 | "Music": "12",
14 | "Musical": "2",
15 | "Mystery": "16",
16 | "Romance": "9",
17 | "Sci-Fi": "13",
18 | "Short": "5",
19 | "Sport": "11",
20 | "Thriller": "7",
21 | "War": "15",
22 | "Western": "1"}
--------------------------------------------------------------------------------
/KaggleMovieRating/imdb_movie_content.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import json
3 | from bs4 import BeautifulSoup
4 | import time
5 | import random
6 | import re
7 | import datetime
8 |
9 | def parse_facebook_likes_number(num_likes_string):
10 | if num_likes_string[-1] == 'K':
11 | thousand = 1000
12 | else:
13 | thousand = 1
14 | return float(num_likes_string.replace('K','')) * thousand
15 |
16 | def parse_duration(time_string):
17 | time_string = time_string.split('min')[0]
18 | x = time.strptime(time_string,'%Hh%M')
19 | return datetime.timedelta(hours=x.tm_hour,minutes=x.tm_min,seconds=x.tm_sec).total_seconds()
20 |
21 | class ImdbMovieContent():
22 | def __init__(self, movies):
23 | self.movies = movies
24 | self.base_url = "http://www.imdb.com"
25 |
26 | def get_facebook_likes(self, entity_id):
27 | if entity_id.startswith('nm'):
28 | url = "https://www.facebook.com/widgets/like.php?width=280&show_faces=1&layout=standard&href=http%3A%2F%2Fwww.imdb.com%2Fname%2F{}%2F&colorscheme=light".format(entity_id)
29 | elif entity_id.startswith('tt'):
30 | url = "https://www.facebook.com/widgets/like.php?width=280&show_faces=1&layout=standard&href=http%3A%2F%2Fwww.imdb.com%2Ftitle%2F{}%2F&colorscheme=light".format(entity_id)
31 | else:
32 | url = None
33 | time.sleep(random.uniform(0, 0.25)) # randomly snooze a time within [0, 0.4] second
34 | try:
35 | response = requests.get(url)
36 | soup = BeautifulSoup(response.text, "lxml")
37 | sentence = soup.find_all(id="u_0_2")[0].span.string # get sentence like: "43K people like this"
38 | num_likes = sentence.split(" ")[0]
39 | except Exception as e:
40 | num_likes = None
41 | return parse_facebook_likes_number(num_likes)
42 |
43 | def get_id_from_url(self, url):
44 | if url is None:
45 | return None
46 | return url.split('/')[4]
47 |
48 | def parse(self):
49 | movies_content = []
50 | for movie in self.movies:
51 | imdb_url = movie['imdb_url']
52 | response = requests.get(imdb_url)
53 | bs = BeautifulSoup(response.text, 'lxml')
54 | movies_content.append(self.get_content(bs))
55 | return movies_content
56 |
57 | def get_awards(self, movie_link):
58 | awards_url = movie_link + 'awards'
59 | response = requests.get(awards_url)
60 | bs = BeautifulSoup(response.text, 'lxml')
61 | awards = bs.find_all('tr')
62 | award_dict = []
63 | for award in awards:
64 | if 'rowspan' in award.find('td').attrs:
65 | rowspan = int(award.find('td')['rowspan'])
66 | if rowspan == 1:
67 | award_dict.append({'category' : award.find('span', class_='award_category').text,
68 | 'type' : award.find('b').text,
69 | 'award': award.find('td', class_='award_description').text.split('\n')[1].replace(' ', '')})
70 | else:
71 | index = awards.index(award)
72 | dictt = {'category':award.find('span', class_='award_category').text,
73 | 'type' : award.find('b').text}
74 | awards_ = []
75 | for elem in awards[index:index+rowspan]:
76 | award_dict.append({'category':award.find('span', class_='award_category').text,
77 | 'type' : award.find('b').text,
78 | 'award': elem.find('td', class_='award_description').text.split('\n')[1].replace(' ', '')})
79 | award_nominated = [{k:v for k,v in award.items() if k != 'type'} for award in award_dict if award['type'] == 'Nominated']
80 | award_won = [{k:v for k,v in award.items() if k != 'type'} for award in award_dict if award['type'] == 'Won']
81 | return award_nominated, award_won
82 |
83 | def get_content(self, bs):
84 | movie = {}
85 | try:
86 | title_year = bs.find('div', class_ = 'title_wrapper').find('h1').text.encode('latin').decode('utf-8', 'ignore').replace(') ','').split('(')
87 | movie['movie_title'] = title_year[0]
88 | except:
89 | movie['movie_title'] = None
90 | try:
91 | movie['title_year'] = title_year[1]
92 | except:
93 | movie['title_year'] = None
94 | # try:
95 | # movie['genres'] = '|'.join([genre.text.replace(' ','') for genre in bs.find('div', {'itemprop': 'genre'}).find_all('a')])
96 | # except:
97 | # movie['genres'] = None
98 | try:
99 | title_details = bs.find('div', {'id':'titleDetails'})
100 | movie['language'] = '|'.join([language.text for language in title_details.find_all('a', href=re.compile('language'))])
101 | movie['country'] = '|'.join([country.text for country in title_details.find_all('a', href=re.compile('country'))])
102 | except:
103 | movie['language'] = None
104 | try:
105 | keywords = bs.find_all('span', {'itemprop':'keywords'})
106 | movie['keywords'] = '|'.join([key.text for key in keywords])
107 | except:
108 | movie['keywords'] = None
109 | try:
110 | movie['storyline'] = bs.find('div', {'id':'titleStoryLine'}).find('div', {'itemprop':'description'}).text.replace('\n', '')
111 | except:
112 | movie['storyline'] = None
113 | try:
114 | movie['contentRating'] = bs.find('span', {'itemprop':'contentRating'}).text.split(' ')[1]
115 | except:
116 | movie['contentRating'] = None
117 | try:
118 | movie['color'] = bs.find('a', href=re.compile('colors')).text
119 | except:
120 | movie['color'] = None
121 | try:
122 | movie['idmb_score'] = bs.find('span', {'itemprop':'ratingValue'}).text
123 | except:
124 | movie['idmb_score'] = None
125 | try:
126 | movie['num_voted_users'] = bs.find('span', {'itemprop':'ratingCount'}).text.replace(',',)
127 | except:
128 | movie['num_voted_users'] = None
129 | try:
130 | movie['duration_sec'] = parse_duration(bs.find('time', {'itemprop':'duration'}).text.replace(' ','').replace('\n', ''))
131 | except:
132 | movie['duration_sec'] = None
133 | try:
134 | review_counts = [int(count.text.split(' ')[0].replace(',', '')) for count in bs.find_all('span', {'itemprop':'reviewCount'})]
135 | movie['num_user_for_reviews'] = review_counts[0]
136 | except:
137 | movie['num_user_for_reviews'] = None
138 | try:
139 | prod_co = bs.find_all('span', {'itemtype': re.compile('Organization')})
140 | movie['production_co'] = [elem.text.replace('\n','') for elem in prod_co]
141 | except:
142 | movie['production_co'] = None
143 | try:
144 | movie['num_critic_for_reviews'] = review_counts[1]
145 | except:
146 | movie['num_critic_for_reviews'] = None
147 | try:
148 | actors = bs.find('table', {'class':'cast_list'}).find_all('a', {'itemprop':'url'})
149 | pairs_for_rows = [(self.base_url + actor['href'], actor.text.replace('\n', '')[1:]) for actor in actors]
150 | movie['cast_info'] = [{'actor_name' : actor[1],
151 | 'actor_link' : actor[0],
152 | 'actor_fb_likes': self.get_facebook_likes(self.get_id_from_url(actor[0]))} for actor in pairs_for_rows]
153 | except:
154 | movie['cast_info'] = None
155 | try:
156 | director = bs.find('span', {'itemprop':'director'}).find('a')
157 | director = (self.base_url + director['href'].replace('?','/?'), director.text)
158 | movie['director_info'] = {'director_name':director[1],
159 | 'director_link':director[0],
160 | 'director_fb_links': self.get_facebook_likes(self.get_id_from_url(director[0]))}
161 | except:
162 | movie['director_info'] = None
163 | try:
164 | movie['movie_imdb_link'] = bs.find('link')['href']
165 | movie['num_facebook_like'] = self.get_facebook_likes(self.get_id_from_url(movie['movie_imdb_link']))
166 | except:
167 | movie['num_facebook_like'] = None
168 | try:
169 | poster_image_url = bs.find('div', {'class':'poster'}).find('img')['src']
170 | movie['image_urls'] = poster_image_url.split("_V1_")[0] + "_V1_.jpg"
171 | except:
172 | movie['image_urls'] = None
173 | try:
174 | award_nominated, award_won = self.get_awards(movie['movie_imdb_link'])
175 | movie['awards'] = {"won": award_won, 'nominated': award_nominated}
176 | except Exception as e:
177 | print(e)
178 | movie['awards'] = None
179 | return movie
180 |
181 |
182 |
--------------------------------------------------------------------------------
/KaggleMovieRating/parser.py:
--------------------------------------------------------------------------------
1 | from bs4 import BeautifulSoup
2 | import json
3 | import requests
4 | import urllib
5 | from tqdm import tqdm
6 | import locale
7 | import pandas as pd
8 | import re
9 | import time
10 | import random
11 | import sys
12 |
13 | from imdb_movie_content import ImdbMovieContent
14 |
15 | def parse_price(price):
16 | """
17 | Convert string price to numbers
18 | """
19 | if not price:
20 | return 0
21 | price = price.replace(',', '')
22 | return locale.atoi(re.sub('[^0-9,]', "", price))
23 |
24 | def get_movie_budget():
25 | """
26 | Parsing the numbers website to get the budget data.
27 | :return: list of dictionnaries with budget and gross
28 | """
29 | movie_budget_url = 'http://www.the-numbers.com/movie/budgets/all'
30 | response = requests.get(movie_budget_url)
31 | bs = BeautifulSoup(response.text, 'lxml')
32 | table = bs.find('table')
33 | rows = [elem for elem in table.find_all('tr') if elem.get_text() != '\n']
34 |
35 | movie_budget = []
36 | for row in rows[1:]:
37 | specs = [elem.get_text() for elem in row.find_all('td')]
38 | movie_name = specs[2].encode('latin1').decode('utf8', 'ignore')
39 | movie_budget.append({'release_date': specs[1],
40 | 'movie_name': movie_name,
41 | 'production_budget': parse_price(specs[3]),
42 | 'domestic_gross': parse_price(specs[4]),
43 | 'worldwide_gross': parse_price(specs[5])})
44 |
45 | return movie_budget
46 |
47 | def get_imdb_urls(movie_budget, nb_elements=None):
48 | """
49 | Parsing imdb website to get imdb movies links.
50 | Dumping a json file with budget, gross and imdb urls
51 | :param movie_budget: list of dictionnaries with budget and gross
52 | :param nb_elements: number of movies to parse
53 | """
54 | for movie in tqdm(movie_budget[1000:1000+nb_elements]):
55 | movie_name = movie['movie_name']
56 | title_url = urllib.parse.quote(movie_name.encode('utf-8'))
57 | imdb_search_link = "http://www.imdb.com/find?ref_=nv_sr_fn&q={}&s=tt".format(title_url)
58 | response = requests.get(imdb_search_link)
59 | bs = BeautifulSoup(response.text, 'lxml')
60 | results = bs.find("table", class_= "findList" )
61 | try:
62 | movie['imdb_url'] = "http://www.imdb.com" + results.find('td', class_='result_text').find('a')['href']
63 | except:
64 | movie['imdb_url'] = None
65 |
66 | with open('movie_budget.json', 'w') as fp:
67 | json.dump(movie_budget, fp)
68 |
69 | def get_imdb_content(movie_budget_path, nb_elements=None):
70 | """
71 | Parsing imdb website to get imdb content : awards, casting, description, etc.
72 | Dumping a json file with imdb content
73 | :param movie_budget_path: path of the movie_budget.json file
74 | :param nb_elements: number of movies to parse
75 | """
76 | with open(movie_budget_path, 'r') as fp:
77 | movies = json.load(fp)
78 | content_provider = ImdbMovieContent(movies)
79 | contents = []
80 | threshold = 1300
81 | for i, movie in enumerate(movies[threshold:threshold+nb_elements]):
82 | time.sleep(random.uniform(0, 0.25))
83 | print("\r%i / %i" % (i, len(movies[threshold:threshold+nb_elements])), end="")
84 | try:
85 | imdb_url = movie['imdb_url']
86 | response = requests.get(imdb_url)
87 | bs = BeautifulSoup(response.text, 'lxml')
88 | movies_content = content_provider.get_content(bs)
89 | contents.append(movies_content)
90 | except Exception as e:
91 | print(e)
92 | pass
93 | with open('movie_contents7.json', 'w') as fp:
94 | json.dump(contents, fp)
95 |
96 | def parse_awards(movie):
97 | """
98 | Convert awards information to a dictionnary for dataframe.
99 | Keeping only Oscar, BAFTA, Golden Globe and Palme d'Or awards.
100 | :param movie: movie dictionnary
101 | :return: well-formated dictionnary with awards information
102 | """
103 | awards_kept = ['Oscar', 'BAFTA Film Award', 'Golden Globe', 'Palme d\'Or']
104 | awards_category = ['won', 'nominated']
105 | parsed_awards = {}
106 | for category in awards_category:
107 | for awards_type in awards_kept:
108 | awards_cat = [award for award in movie['awards'][category] if award['category'] == awards_type]
109 | for k, award in enumerate(awards_cat):
110 | parsed_awards['{}_{}_{}'.format(awards_type, category, k+1)] = award["award"]
111 | return parsed_awards
112 |
113 | def parse_actors(movie):
114 | """
115 | Convert casting information to a dictionnary for dataframe.
116 | Keeping only 3 actors with most facebook likes.
117 | :param movie: movie dictionnary
118 | :return: well-formated dictionnary with casting information
119 | """
120 | sorted_actors = sorted(movie['cast_info'], key=lambda x:x['actor_fb_likes'], reverse=True)
121 | top_k = 3
122 | parsed_actors = {}
123 | parsed_actors['total_cast_fb_likes'] = sum([actor['actor_fb_likes'] for actor in movie['cast_info']]) + movie['director_info']['director_fb_links']
124 | for k, actor in enumerate(sorted_actors[:top_k]):
125 | if k < len(sorted_actors):
126 | parsed_actors['actor_{}_name'.format(k+1)] = actor['actor_name']
127 | parsed_actors['actor_{}_fb_likes'.format(k+1)] = actor['actor_fb_likes']
128 | else:
129 | parsed_actors['actor_{}_name'.format(k+1)] = None
130 | parsed_actors['actor_{}_fb_likes'.format(k+1)] = None
131 | return parsed_actors
132 |
133 | def parse_production_company(movie):
134 | """
135 | Convert production companies to a dictionnary for dataframe.
136 | Keeping only 3 production companies.
137 | :param movie: movie dictionnary
138 | :return: well-formated dictionnary with production companies
139 | """
140 | parsed_production_co = {}
141 | top_k = 3
142 | production_companies = movie['production_co'][:top_k]
143 | for k, company in enumerate(production_companies):
144 | if k < len(movie['production_co']):
145 | parsed_production_co['production_co_{}'.format(k+1)] = company
146 | else:
147 | parsed_production_co['production_co_{}'.format(k+1)] = None
148 | return parsed_production_co
149 |
150 | def parse_genres(movie):
151 | """
152 | Convert genres to a dictionnary for dataframe.
153 | :param movie: movie dictionnary
154 | :return: well-formated dictionnary with genres
155 | """
156 | parse_genres = {}
157 | g = movie['genres']
158 | with open('genre.json', 'r') as f:
159 | genres = json.load(f)
160 | for k, genre in enumerate(g):
161 | if genre in genres:
162 | parse_genres['genre_{}'.format(k+1)] = genres[genre]
163 | return parse_genres
164 |
165 | def create_dataframe(movies_content_path, movie_budget_path):
166 | """
167 | Create dataframe from movie_budget.json and movie_content.json files.
168 | :param movies_content_path: path of the movies_content.json file
169 | :param movie_budget_path: path of the movie_budget.json file
170 | :return: well formated dataframe
171 | """
172 | with open(movies_content_path, 'r') as fp:
173 | movies = json.load(fp)
174 | with open(movie_budget_path, 'r') as fp:
175 | movies_budget = json.load(fp)
176 | movies_list = []
177 | for movie in movies:
178 | content = {k:v for k,v in movie.items() if k not in ['awards', 'cast_info', 'director_info', 'production_co']}
179 | name = movie['movie_title']
180 | try:
181 | budget = [film for film in movies_budget if film['movie_name']==name][0]
182 | budget = {k:v for k,v in budget.items() if k not in ['imdb_url', 'movie_name']}
183 | content.update(budget)
184 | except:
185 | pass
186 | try:
187 | content.update(parse_awards(movie))
188 | except:
189 | pass
190 | try:
191 | content.update(parse_genres(movie))
192 | except:
193 | pass
194 | try:
195 | content.update({k:v for k,v in movie['director_info'].items() if k!= 'director_link'})
196 | except:
197 | pass
198 | try:
199 | content.update(parse_production_company(movie))
200 | except:
201 | pass
202 | try:
203 | content.update(parse_actors(movie))
204 | except:
205 | pass
206 | movies_list.append(content)
207 | df = pd.DataFrame(movies_list)
208 | df = df[pd.notnull(df.idmb_score)]
209 | df.idmb_score = df.idmb_score.apply(float)
210 | return df
211 |
212 | if __name__ == '__main__':
213 | if len(sys.argv) > 1:
214 | nb_elements = int(sys.argv[1])
215 | else:
216 | nb_elements = None
217 | # movie_budget = get_movie_budget()
218 | # movies = get_imdb_urls(movie_budget, nb_elements=nb_elements)
219 | get_imdb_content("movie_budget.json", nb_elements=nb_elements)
220 |
221 |
222 |
--------------------------------------------------------------------------------
/KaggleSoccer/Predicting Ratings.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 25,
6 | "metadata": {
7 | "ExecuteTime": {
8 | "end_time": "2017-05-21T05:03:03.866314Z",
9 | "start_time": "2017-05-21T05:03:03.851669Z"
10 | },
11 | "collapsed": true
12 | },
13 | "outputs": [],
14 | "source": [
15 | "import pandas as pd\n",
16 | "import numpy as np\n",
17 | "import seaborn as sns\n",
18 | "import matplotlib.pyplot as plt\n",
19 | "import imp\n",
20 | "import json\n",
21 | "from sklearn.preprocessing import LabelEncoder\n",
22 | "from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, GradientBoostingRegressor\n",
23 | "from sklearn.linear_model import Ridge, LinearRegression, Lasso\n",
24 | "from sklearn.kernel_ridge import KernelRidge\n",
25 | "from sklearn.svm import SVR, LinearSVC\n",
26 | "from sklearn.model_selection import train_test_split\n",
27 | "import whoscored\n",
28 | "from unidecode import unidecode\n",
29 | "import pickle\n",
30 | "import difflib\n",
31 | "import glob\n",
32 | "%matplotlib inline"
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "## Data Processing"
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": 17,
45 | "metadata": {
46 | "ExecuteTime": {
47 | "end_time": "2017-05-21T05:00:15.202603Z",
48 | "start_time": "2017-05-21T05:00:15.081729Z"
49 | },
50 | "collapsed": true
51 | },
52 | "outputs": [],
53 | "source": [
54 | "teams = {}\n",
55 | "for file in glob.glob('./games/*.json'):\n",
56 | " with open(file,'r') as f:\n",
57 | " teams[file.split('/')[2].split('_')[0]] = json.load(f)"
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": 21,
63 | "metadata": {
64 | "ExecuteTime": {
65 | "end_time": "2017-05-21T05:01:41.986757Z",
66 | "start_time": "2017-05-21T05:01:41.952494Z"
67 | },
68 | "collapsed": true
69 | },
70 | "outputs": [],
71 | "source": [
72 | "def get_position_sub(row):\n",
73 | " if row['Position'] == 'Sub':\n",
74 | " name = row['Player']\n",
75 | " try:\n",
76 | " row['Position'] = [k for k in df[df.Player == name].Position.value_counts().index if k !='Sub'][0]\n",
77 | " except:\n",
78 | " row['Position'] = 'substitute'\n",
79 | " return row\n",
80 | "\n",
81 | "stop_words = ['<','Bl.', 'Exc.', '>']\n",
82 | "def mean_rating(row):\n",
83 | " if pd.isnull(row['Rating_LEquipe']) or row['Rating_LEquipe'] in stop_words:\n",
84 | " name = row['Player']\n",
85 | " temp = df[(df.Player == name) & (~df.Rating_LEquipe.isin(stop_words)) & (pd.notnull(df.Rating_LEquipe))]\n",
86 | " if len(temp) > 4:\n",
87 | " row['Rating_LEquipe'] = np.mean(temp.Rating_LEquipe.apply(int))\n",
88 | " return row\n",
89 | "\n",
90 | "position_mapping = {'attackingmidfieldcenter' : 'attackingmidfield',\n",
91 | " 'attackingmidfieldleft' : 'attackingmidfield',\n",
92 | " 'attackingmidfieldright' : 'attackingmidfield',\n",
93 | " 'defenderleft' : 'defenderlateral',\n",
94 | " 'defendermidfieldcenter' : \"defendermidfield\",\n",
95 | " 'defendermidfieldleft' : 'defendermidfield',\n",
96 | " 'defendermidfieldright': \"defendermidfield\",\n",
97 | " 'defenderright': 'defenderlateral',\n",
98 | " 'forwardleft' : 'forwardlateral',\n",
99 | " 'forwardright' : 'forwardlateral',\n",
100 | " 'midfieldcenter' :'midfield' ,\n",
101 | " 'midfieldleft' :'midfield' ,\n",
102 | " 'midfieldright' :'midfield' }\n",
103 | "\n",
104 | "mapping_team_name = {'ASM': 'Monaco',\n",
105 | " 'ASNL': 'Nancy',\n",
106 | " 'ASSE': 'Saint-Etienne',\n",
107 | " 'DFCO': 'Dijon',\n",
108 | " 'EAG': 'Guingamp',\n",
109 | " 'FCGB': 'Bordeaux',\n",
110 | " 'FCL': 'Lorient',\n",
111 | " 'FCM': 'Metz',\n",
112 | " 'FCN': 'Nantes',\n",
113 | " 'Losc': 'Lille',\n",
114 | " 'MHSC': 'Montpellier',\n",
115 | " 'Man. City': 'Manchester City',\n",
116 | " 'Man. United': 'Manchester United',\n",
117 | " 'OGCN': 'Nice',\n",
118 | " 'OL': 'Lyon',\n",
119 | " 'OM': 'Marseille',\n",
120 | " 'PSG': 'Paris Saint Germain',\n",
121 | " 'Palace': 'Crystal Palace',\n",
122 | " 'SCB': 'SC Bastia',\n",
123 | " 'SCO': 'Angers',\n",
124 | " 'SMC': 'Caen',\n",
125 | " 'SRFC': 'Rennes',\n",
126 | " 'Stoke City': 'Stoke',\n",
127 | " 'TFC': 'Toulouse',\n",
128 | " 'WBA': 'West Bromwich Albion'}"
129 | ]
130 | },
131 | {
132 | "cell_type": "code",
133 | "execution_count": 138,
134 | "metadata": {
135 | "ExecuteTime": {
136 | "end_time": "2017-05-21T05:41:57.976844Z",
137 | "start_time": "2017-05-21T05:41:31.980530Z"
138 | },
139 | "collapsed": true
140 | },
141 | "outputs": [],
142 | "source": [
143 | "df = pd.DataFrame()\n",
144 | "for team, file in teams.items():\n",
145 | " for i, game in enumerate(file):\n",
146 | " temp = pd.DataFrame(game['stats'])\n",
147 | " temp['Opponent'] = game['opponent']\n",
148 | " temp['Place'] = game['place'].title()\n",
149 | " result = game['result'].split(' : ')\n",
150 | " temp['Goal Team'] = int(result[int(game['place'] == 'away')])\n",
151 | " temp['Goal Opponent'] = int(result[int(game['place'] == 'home')])\n",
152 | " temp['Team'] = {'psg':'Paris Saint Germain'}.get(team, team).title()\n",
153 | " temp['Team'] = {'Bastia': 'SC Bastia'}.get(temp['Team'][0], temp['Team'][0])\n",
154 | " temp['Day'] = i+1\n",
155 | " df = pd.concat([df, temp])\n",
156 | "df = df.apply(whoscored.get_name, axis=1).reset_index(drop=True)\n",
157 | "df['LineUp'] = 1\n",
158 | "df.loc[df.Position == 'Sub', 'LineUp'] = 0\n",
159 | "df = df.apply(get_position_sub, axis=1)\n",
160 | "df.Goal.fillna(0, inplace=True)\n",
161 | "df.Assist.fillna(0, inplace=True)\n",
162 | "df.Yellowcard.fillna(0, inplace=True)\n",
163 | "df.Redcard.fillna(0, inplace=True)\n",
164 | "df.Penaltymissed.fillna(0, inplace=True)\n",
165 | "df.Shotonpost.fillna(0, inplace=True)\n",
166 | "# df.Position = df.Position.apply(lambda x:position_mapping.get(x,x))"
167 | ]
168 | },
169 | {
170 | "cell_type": "code",
171 | "execution_count": 139,
172 | "metadata": {
173 | "ExecuteTime": {
174 | "end_time": "2017-05-21T05:41:57.990536Z",
175 | "start_time": "2017-05-21T05:41:57.978189Z"
176 | }
177 | },
178 | "outputs": [],
179 | "source": [
180 | "with open('./ratings/notes_ligue1_lequipe.json','r') as f:\n",
181 | " rating_lequipe = json.load(f)\n",
182 | "rating_lequipe = {mapping_team_name.get(k,k):[p for p in v if list(p.keys())[0] != 'Nom' and len(list(p.values())[0]) > 0] \n",
183 | " for k,v in rating_lequipe.items()}"
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": 140,
189 | "metadata": {
190 | "ExecuteTime": {
191 | "end_time": "2017-05-21T05:42:40.226407Z",
192 | "start_time": "2017-05-21T05:41:57.992420Z"
193 | },
194 | "scrolled": true
195 | },
196 | "outputs": [],
197 | "source": [
198 | "for team_name, rating_team in rating_lequipe.items():\n",
199 | " if team_name in set(df.Team):\n",
200 | " players_lequipe = [list(k.keys())[0] for k in rating_team]\n",
201 | " players_df = list(set(df[df.Team == team_name].Player))\n",
202 | " for player in rating_team:\n",
203 | " [(player_name, player_ratings)] = player.items()\n",
204 | " try:\n",
205 | " player_name_df = difflib.get_close_matches(player_name, players_df)[0]\n",
206 | " except:\n",
207 | " if len(unidecode(player_name).split('-')) > 1 :\n",
208 | " player_name_df = [k for k in players_df if unidecode(player_name).split('-')[0].replace(\"'\",\"\").lower() in unidecode(k).replace(\"'\",\"\").lower()\n",
209 | " or unidecode(player_name).split('-')[1].replace(\"'\",\"\").lower() in unidecode(k).replace(\"'\",\"\").lower()][0]\n",
210 | " else:\n",
211 | " player_name_df = [k for k in players_df if unidecode(player_name).replace(\"'\",\"\").lower() in unidecode(k).replace(\"'\",\"\").lower()][0]\n",
212 | " for day, rating in player_ratings.items():\n",
213 | " df.loc[(df.Player == player_name_df) & (df.Team == team_name) & (df.Day == int(day.split('Day ')[1])), 'Rating_LEquipe'] = rating\n",
214 | "df = df.apply(mean_rating, axis=1)\n",
215 | "df.drop('null', axis=1, inplace=True)"
216 | ]
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": 141,
221 | "metadata": {
222 | "ExecuteTime": {
223 | "end_time": "2017-05-21T05:42:40.279122Z",
224 | "start_time": "2017-05-21T05:42:40.227785Z"
225 | },
226 | "scrolled": true
227 | },
228 | "outputs": [
229 | {
230 | "name": "stdout",
231 | "output_type": "stream",
232 | "text": [
233 | "9552/10187 données avec une note l'Équipe\n"
234 | ]
235 | },
236 | {
237 | "data": {
238 | "text/html": [
239 | "
\n",
240 | "\n",
253 | "
\n",
254 | " \n",
255 | " \n",
256 | " | \n",
257 | " Team | \n",
258 | " Goal Team | \n",
259 | " Goal Opponent | \n",
260 | " Player | \n",
261 | " Goal | \n",
262 | " Opponent | \n",
263 | " Day | \n",
264 | " Position | \n",
265 | " Age | \n",
266 | " LineUp | \n",
267 | " Rating_LEquipe | \n",
268 | "
\n",
269 | " \n",
270 | " \n",
271 | " \n",
272 | " 415 | \n",
273 | " Paris Saint Germain | \n",
274 | " 2 | \n",
275 | " 1 | \n",
276 | " Thiago Motta | \n",
277 | " 0.0 | \n",
278 | " Lyon | \n",
279 | " 30 | \n",
280 | " midfieldcenter | \n",
281 | " 34 | \n",
282 | " 0 | \n",
283 | " 5 | \n",
284 | "
\n",
285 | " \n",
286 | " 9580 | \n",
287 | " Toulouse | \n",
288 | " 0 | \n",
289 | " 0 | \n",
290 | " Mauro Goicoechea | \n",
291 | " 0.0 | \n",
292 | " Rennes | \n",
293 | " 30 | \n",
294 | " goalkeeper | \n",
295 | " 29 | \n",
296 | " 1 | \n",
297 | " 7 | \n",
298 | "
\n",
299 | " \n",
300 | " 5736 | \n",
301 | " Metz | \n",
302 | " 0 | \n",
303 | " 1 | \n",
304 | " Thomas Didillon | \n",
305 | " 0.0 | \n",
306 | " Rennes | \n",
307 | " 11 | \n",
308 | " goalkeeper | \n",
309 | " 21 | \n",
310 | " 1 | \n",
311 | " 5 | \n",
312 | "
\n",
313 | " \n",
314 | " 6440 | \n",
315 | " Nantes | \n",
316 | " 1 | \n",
317 | " 1 | \n",
318 | " Rémy Riou | \n",
319 | " 0.0 | \n",
320 | " Metz | \n",
321 | " 25 | \n",
322 | " goalkeeper | \n",
323 | " 29 | \n",
324 | " 1 | \n",
325 | " 5 | \n",
326 | "
\n",
327 | " \n",
328 | " 3446 | \n",
329 | " Lille | \n",
330 | " 1 | \n",
331 | " 0 | \n",
332 | " Eder | \n",
333 | " 0.0 | \n",
334 | " SC Bastia | \n",
335 | " 31 | \n",
336 | " forward | \n",
337 | " 29 | \n",
338 | " 1 | \n",
339 | " 4 | \n",
340 | "
\n",
341 | " \n",
342 | " 9100 | \n",
343 | " Saint-Etienne | \n",
344 | " 1 | \n",
345 | " 1 | \n",
346 | " Kevin Malcuit | \n",
347 | " 0.0 | \n",
348 | " Rennes | \n",
349 | " 33 | \n",
350 | " defenderright | \n",
351 | " 25 | \n",
352 | " 1 | \n",
353 | " 5.42105 | \n",
354 | "
\n",
355 | " \n",
356 | " 5640 | \n",
357 | " Metz | \n",
358 | " 3 | \n",
359 | " 0 | \n",
360 | " Yann Jouffre | \n",
361 | " 0.0 | \n",
362 | " Nantes | \n",
363 | " 4 | \n",
364 | " attackingmidfieldright | \n",
365 | " 32 | \n",
366 | " 0 | \n",
367 | " 4.88889 | \n",
368 | "
\n",
369 | " \n",
370 | " 2401 | \n",
371 | " Angers | \n",
372 | " 3 | \n",
373 | " 0 | \n",
374 | " Abdoulaye Bamba | \n",
375 | " 0.0 | \n",
376 | " SC Bastia | \n",
377 | " 27 | \n",
378 | " defenderright | \n",
379 | " 27 | \n",
380 | " 1 | \n",
381 | " 6 | \n",
382 | "
\n",
383 | " \n",
384 | "
\n",
385 | "
"
386 | ],
387 | "text/plain": [
388 | " Team Goal Team Goal Opponent Player Goal \\\n",
389 | "415 Paris Saint Germain 2 1 Thiago Motta 0.0 \n",
390 | "9580 Toulouse 0 0 Mauro Goicoechea 0.0 \n",
391 | "5736 Metz 0 1 Thomas Didillon 0.0 \n",
392 | "6440 Nantes 1 1 Rémy Riou 0.0 \n",
393 | "3446 Lille 1 0 Eder 0.0 \n",
394 | "9100 Saint-Etienne 1 1 Kevin Malcuit 0.0 \n",
395 | "5640 Metz 3 0 Yann Jouffre 0.0 \n",
396 | "2401 Angers 3 0 Abdoulaye Bamba 0.0 \n",
397 | "\n",
398 | " Opponent Day Position Age LineUp Rating_LEquipe \n",
399 | "415 Lyon 30 midfieldcenter 34 0 5 \n",
400 | "9580 Rennes 30 goalkeeper 29 1 7 \n",
401 | "5736 Rennes 11 goalkeeper 21 1 5 \n",
402 | "6440 Metz 25 goalkeeper 29 1 5 \n",
403 | "3446 SC Bastia 31 forward 29 1 4 \n",
404 | "9100 Rennes 33 defenderright 25 1 5.42105 \n",
405 | "5640 Nantes 4 attackingmidfieldright 32 0 4.88889 \n",
406 | "2401 SC Bastia 27 defenderright 27 1 6 "
407 | ]
408 | },
409 | "execution_count": 141,
410 | "metadata": {},
411 | "output_type": "execute_result"
412 | }
413 | ],
414 | "source": [
415 | "print('%d/%d données avec une note l\\'Équipe' % (len(df[(~df.Rating_LEquipe.isin(stop_words)) & (pd.notnull(df.Rating_LEquipe))]),\n",
416 | " len(df)))\n",
417 | "df[['Team','Goal Team', 'Goal Opponent', 'Player', 'Goal',\n",
418 | " 'Opponent', 'Day', 'Position', 'Age', 'LineUp', 'Rating_LEquipe']].sort_values('Player').sample(8)"
419 | ]
420 | },
421 | {
422 | "cell_type": "code",
423 | "execution_count": 143,
424 | "metadata": {
425 | "ExecuteTime": {
426 | "end_time": "2017-05-21T05:43:00.252474Z",
427 | "start_time": "2017-05-21T05:43:00.221024Z"
428 | },
429 | "collapsed": true
430 | },
431 | "outputs": [],
432 | "source": [
433 | "with open('./df.pkl', 'rb') as f:\n",
434 | " df = pickle.load(f)"
435 | ]
436 | },
437 | {
438 | "cell_type": "code",
439 | "execution_count": 176,
440 | "metadata": {
441 | "ExecuteTime": {
442 | "end_time": "2017-05-21T07:45:29.115217Z",
443 | "start_time": "2017-05-21T07:45:28.990990Z"
444 | },
445 | "collapsed": true
446 | },
447 | "outputs": [],
448 | "source": [
449 | "other_cols = ['Day', \"Opponent\", \"Place\", \"Player\", \"Position\", \n",
450 | " \"Team\", 'Key Events', \"Rating_LEquipe\"]\n",
451 | "col_delete = ['Key Events', \"Rating_LEquipe\", 'Rating', 'Day']\n",
452 | "cols_to_transf = [col for col in df.columns if col not in other_cols]\n",
453 | "p = pd.concat([df[cols_to_transf].applymap(float), df[other_cols]], axis=1)\n",
454 | "new_col_to_remov = [e for e in p.columns if e.startswith('Acc')] + ['Touches']"
455 | ]
456 | },
457 | {
458 | "cell_type": "markdown",
459 | "metadata": {
460 | "ExecuteTime": {
461 | "end_time": "2017-05-20T18:57:44.123776Z",
462 | "start_time": "2017-05-20T18:57:44.118723Z"
463 | }
464 | },
465 | "source": [
466 | "## Learning"
467 | ]
468 | },
469 | {
470 | "cell_type": "code",
471 | "execution_count": 177,
472 | "metadata": {
473 | "ExecuteTime": {
474 | "end_time": "2017-05-21T07:45:30.269693Z",
475 | "start_time": "2017-05-21T07:45:30.254692Z"
476 | }
477 | },
478 | "outputs": [],
479 | "source": [
480 | "p = p[(~p.Rating_LEquipe.isin(stop_words)) & (pd.notnull(p.Rating_LEquipe))].set_index('Player')\n",
481 | "X = p[[col for col in p.columns if col not in col_delete + new_col_to_remov]]\n",
482 | "y = p['Rating_LEquipe'].apply(float)"
483 | ]
484 | },
485 | {
486 | "cell_type": "code",
487 | "execution_count": 178,
488 | "metadata": {
489 | "ExecuteTime": {
490 | "end_time": "2017-05-21T07:45:31.239689Z",
491 | "start_time": "2017-05-21T07:45:30.949497Z"
492 | }
493 | },
494 | "outputs": [
495 | {
496 | "name": "stderr",
497 | "output_type": "stream",
498 | "text": [
499 | "/Users/alexandreattia/Desktop/Work/workenv/lib/python3.5/site-packages/ipykernel/__main__.py:5: SettingWithCopyWarning: \n",
500 | "A value is trying to be set on a copy of a slice from a DataFrame.\n",
501 | "Try using .loc[row_indexer,col_indexer] = value instead\n",
502 | "\n",
503 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
504 | ]
505 | }
506 | ],
507 | "source": [
508 | "label_encoders = {}\n",
509 | "col_encode = [\"Opponent\", \"Place\", \"Position\", \"Team\"]\n",
510 | "for col in col_encode:\n",
511 | " label_encoders[col.lower()] = LabelEncoder()\n",
512 | " X[col] = label_encoders[col.lower()].fit_transform(X[col])"
513 | ]
514 | },
515 | {
516 | "cell_type": "code",
517 | "execution_count": 123,
518 | "metadata": {
519 | "ExecuteTime": {
520 | "end_time": "2017-05-21T05:28:51.343565Z",
521 | "start_time": "2017-05-21T05:28:51.299523Z"
522 | }
523 | },
524 | "outputs": [
525 | {
526 | "name": "stdout",
527 | "output_type": "stream",
528 | "text": [
529 | "5-fold cross validation mean square error : 0.86\n"
530 | ]
531 | }
532 | ],
533 | "source": [
534 | "d = []\n",
535 | "k_fold = 5\n",
536 | "for k in range(k_fold): \n",
537 | " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/k_fold)\n",
538 | " ridge = Ridge(alpha=0.001, max_iter=10000, tol=0.0001)\n",
539 | " ridge.fit(X_train, y_train)\n",
540 | " d.append(np.mean((ridge.predict(X_test) - y_test) ** 2))\n",
541 | "print('%d-fold cross validation mean square error : %.2f' % (k_fold, np.mean(d)))\n",
542 | "ridge = Ridge(alpha=0.001, max_iter=10000, tol=0.0001)\n",
543 | "_ = ridge.fit(X, y)"
544 | ]
545 | },
546 | {
547 | "cell_type": "code",
548 | "execution_count": 137,
549 | "metadata": {
550 | "ExecuteTime": {
551 | "end_time": "2017-05-21T05:37:59.969171Z",
552 | "start_time": "2017-05-21T05:37:59.963504Z"
553 | }
554 | },
555 | "outputs": [
556 | {
557 | "name": "stdout",
558 | "output_type": "stream",
559 | "text": [
560 | "The 10 most important parameters are:\n",
561 | "Penaltymissed : -0.59\n",
562 | "Goal : 0.43\n",
563 | "Assist : 0.36\n",
564 | "Redcard : -0.31\n",
565 | "Goal Opponent : -0.22\n",
566 | "Shotonpost : 0.19\n",
567 | "ShotsOT : 0.16\n",
568 | "Goal Team : 0.15\n",
569 | "Yellowcard : -0.09\n",
570 | "Place : -0.09\n"
571 | ]
572 | }
573 | ],
574 | "source": [
575 | "top_k = 10\n",
576 | "l = np.argsort(list(map(np.abs, ridge.coef_)))[::-1][:top_k]\n",
577 | "print(\"The %d most important parameters are:\\n%s\" % (top_k,'\\n'.join(['%s : %.2f' % (a,b) \n",
578 | " for a,b in zip(X_train.columns[l], ridge.coef_[l])])))"
579 | ]
580 | },
581 | {
582 | "cell_type": "code",
583 | "execution_count": 180,
584 | "metadata": {
585 | "ExecuteTime": {
586 | "end_time": "2017-05-21T07:46:19.524179Z",
587 | "start_time": "2017-05-21T07:45:50.166761Z"
588 | }
589 | },
590 | "outputs": [
591 | {
592 | "name": "stdout",
593 | "output_type": "stream",
594 | "text": [
595 | "RF : oob score : 1.00\n",
596 | "RF : 5-fold cross validation mean square error : 0.74\n"
597 | ]
598 | }
599 | ],
600 | "source": [
601 | "rf = RandomForestRegressor(oob_score=True, n_estimators=100)\n",
602 | "rf.fit(X, y)\n",
603 | "print('RF : oob score : %.2f' % rf.oob_score)\n",
604 | "d = []\n",
605 | "k_fold = 5\n",
606 | "for k in range(k_fold): \n",
607 | " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/k_fold)\n",
608 | " rf = RandomForestRegressor(oob_score=True, n_estimators=100)\n",
609 | " rf.fit(X_train, y_train)\n",
610 | " d.append(np.mean((rf.predict(X_test) - y_test) ** 2))\n",
611 | "print('RF : %d-fold cross validation mean square error : %.2f' % (k_fold, np.mean(d)))"
612 | ]
613 | },
614 | {
615 | "cell_type": "code",
616 | "execution_count": null,
617 | "metadata": {
618 | "collapsed": true
619 | },
620 | "outputs": [],
621 | "source": []
622 | }
623 | ],
624 | "metadata": {
625 | "kernelspec": {
626 | "display_name": "workenv",
627 | "language": "python",
628 | "name": "workenv"
629 | },
630 | "language_info": {
631 | "codemirror_mode": {
632 | "name": "ipython",
633 | "version": 3
634 | },
635 | "file_extension": ".py",
636 | "mimetype": "text/x-python",
637 | "name": "python",
638 | "nbconvert_exporter": "python",
639 | "pygments_lexer": "ipython3",
640 | "version": "3.5.2"
641 | },
642 | "nav_menu": {},
643 | "toc": {
644 | "navigate_menu": true,
645 | "number_sections": true,
646 | "sideBar": true,
647 | "threshold": 6,
648 | "toc_cell": false,
649 | "toc_section_display": "block",
650 | "toc_window_display": false
651 | }
652 | },
653 | "nbformat": 4,
654 | "nbformat_minor": 2
655 | }
656 |
--------------------------------------------------------------------------------
/KaggleSoccer/README.md:
--------------------------------------------------------------------------------
1 | # Parsing and using Soccer data
2 |
3 | ## First Part - Parsing data
4 |
5 | The first part aims to parse data from multiple websites : games, teams, players, etc.
6 | To collect data about a team and dump a json file (must have created a ``./teams`` folder) :
7 | ``python3 dumper.py team_name``
8 |
9 | ## Second Part - Data Analysis
10 |
11 | The second part is to analyze the dataset to understand what I can do with it.
12 |
13 | 
14 |
15 | 
--------------------------------------------------------------------------------
/KaggleSoccer/df.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleSoccer/df.pkl
--------------------------------------------------------------------------------
/KaggleSoccer/dumper.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import re
3 | import time
4 | from datetime import datetime
5 | import json
6 | from tqdm import tqdm
7 | import numpy as np
8 | import seaborn as sns
9 | from matplotlib import patches
10 | import pandas as pd
11 | import parser
12 | import sys
13 | import random
14 | from collections import Counter
15 | from bs4 import BeautifulSoup
16 | from sklearn import preprocessing
17 | import urllib
18 |
19 | def convert_int(string):
20 | try:
21 | string = int(string)
22 | return string
23 | except:
24 | return string
25 |
26 | def get_team(request, link=None):
27 | team = {}
28 | if not link:
29 | team_searched = request
30 | team_searched = urllib.parse.quote(team_searched.encode('utf-8'))
31 | search_link = "http://us.soccerway.com/search/teams/?q={}".format(team_searched)
32 | response = requests.get(search_link)
33 | bs = BeautifulSoup(response.text, 'lxml')
34 | results = bs.find("ul", class_='search-results')
35 | # Take the first results
36 | try:
37 | link = "http://us.soccerway.com" + results.find_all('a')[0]['href']
38 | print('Please check team link:', link)
39 | team['id_'] = results.find_all('a')[0]["href"].split('/')[4]
40 | team['name'] = results.find_all('a')[0].text
41 | team['country'] = results.find_all('a')[0]["href"].split('/')[2]
42 | except:
43 | print('No team found !')
44 | else:
45 | team['id_'] = link.split('/')[6]
46 | team['name'] = link.split('/')[5]
47 | team['country'] = link.split('/')[4]
48 | return team
49 |
50 | def get_games(team, nb_pages=12):
51 | games = []
52 | for page_number in range(nb_pages):
53 | link_base = 'http://us.soccerway.com/a/block_team_matches?block_id=page_team_1_block_team_matches_3&callback_params='
54 | link_ = urllib.parse.quote('{"page":0,"bookmaker_urls":[],"block_service_id":"team_matches_block_teammatches","team_id":%s,\
55 | "competition_id":0,"filter":"all","new_design":false}' % team['id_']) + '&action=changePage¶ms=' + urllib.parse.quote('{"page":-%s}' % (page_number))
56 | link = link_base + link_
57 | response = requests.get(link)
58 |
59 | test = json.loads(response.text)['commands'][0]['parameters']['content']
60 | bs = BeautifulSoup(test, 'lxml')
61 |
62 | for kind in ['even', 'odd']:
63 | for elem in bs.find_all('tr', class_ = kind):
64 | game = {}
65 | game["date"] = elem.find('td', {'class': ["full-date"]}).text
66 | game["competition"] = elem.find('td', {'class': ["competition"]}).text
67 | game["team_a"] = elem.find('td', class_='team-a').text
68 | game["team_b"] = elem.find('td', class_='team-b').text
69 | game['link'] = "http://us.soccerway.com" + elem.find('td', class_='score-time').find('a')['href']
70 | game["score"] = elem.find('td', class_='score-time').text.replace(' ','')
71 | if 'E' in game["score"]:
72 | game["score"] = game['score'].replace('E','')
73 | game['extra_time'] = True
74 | if 'P' in game["score"]:
75 | game["score"] = game['score'].replace('P','')
76 | game['penalties'] = True
77 | if datetime.strptime(game["date"], '%d/%m/%y') < datetime.now():
78 | game = parser.get_score_details(game, team)
79 | time.sleep(random.uniform(0, 0.25))
80 | game.update(parser.get_goals(game['link']))
81 | else:
82 | del game['score']
83 | games.append(game)
84 | games = sorted(games, key=lambda x:datetime.strptime(x['date'], '%d/%m/%y'))
85 | team['games'] = games
86 | return team
87 |
88 | def get_squad(team, season_path='./seasons_codes.json'):
89 | with open(season_path, 'r') as f:
90 | seasons = json.load(f)[team["country"]]
91 |
92 | team['squad'] = {}
93 | for k,v in seasons.items():
94 | link_base = 'http://us.soccerway.com/a/block_team_squad?block_id=page_team_1_block_team_squad_3&callback_params='
95 | link_ = urllib.parse.quote('{"team_id":%s}' % team['id_']) + '&action=changeSquadSeason¶ms=' + urllib.parse.quote('{"season_id":%s}' % v)
96 | link = link_base + link_
97 | response = requests.get(link)
98 | test = json.loads(response.text)['commands'][0]['parameters']['content']
99 | bs = BeautifulSoup(test, 'lxml')
100 |
101 | players = bs.find('tbody').find_all('tr')
102 | squad = [{
103 | k: convert_int(player.find('td', class_=k).text)
104 | for k in [k for k,v in Counter(np.concatenate([elem.attrs['class'] for elem in player.find_all('td')])).items()
105 | if v < 2 and k not in ['photo', '', 'flag']]
106 | } for player in players]
107 | team['squad'][k] = squad
108 | try:
109 | coach = {'position': 'Coach', 'name':bs.find_all('tbody')[1].text}
110 | team['coach'][k] = coach
111 | except: pass
112 | return team
113 |
114 | if __name__ == '__main__':
115 | if len(sys.argv) > 1:
116 | request = ' '.join(sys.argv[1:])
117 | else:
118 | raise ValueError('You must enter a requested team!')
119 | team = get_team(request)
120 | f = input('Satisfied ? (Y/n) ')
121 | count = 0
122 | while f not in ['Y', 'y','yes','']:
123 | if count < 3:
124 | request = input('Enter a new request : ')
125 | link = None
126 | else:
127 | link = input('Paste team link : ')
128 | team = get_team(request, link)
129 | f = input('Satisfied ? (Y/n)')
130 | count += 1
131 | team = get_games(team)
132 | team = get_squad(team)
133 | with open('./teams/%s.json' % team["name"].lower(), 'w') as f:
134 | json.dump(team, f)
135 |
--------------------------------------------------------------------------------
/KaggleSoccer/parser.py:
--------------------------------------------------------------------------------
1 | import urllib
2 | import requests
3 | import time
4 | import pandas as pd
5 | import numpy as np
6 | import datetime
7 | import json
8 | from tqdm import tqdm
9 | import random
10 | from bs4 import BeautifulSoup
11 | from collections import Counter
12 |
13 | def get_score_details(game, team):
14 | if game["team_a"] == team['name']:
15 | game['place'] = 'home'
16 | try:
17 | game['nb_goals_{}'.format(team['name'])] = int(game['score'].split('-')[0])
18 | game['nb_goals_adv'] = int(game['score'].split('-')[1])
19 | except:pass
20 | else:
21 | game['place'] = 'away'
22 | try:
23 | game['nb_goals_{}'.format(team['name'])] = int(game['score'].split('-')[1])
24 | game['nb_goals_adv'] = int(game['score'].split('-')[0])
25 | except:pass
26 | try:
27 | if game['nb_goals_{}'.format(team['name'])] > game['nb_goals_adv']:
28 | game['result'] = 'WIN'
29 | elif game['nb_goals_{}'.format(team['name'])] == game['nb_goals_adv']:
30 | game['result'] = 'TIE'
31 | else:
32 | game['result'] = 'LOST'
33 | except:pass
34 | return game
35 |
36 | def shot_team(row, columns, team_name):
37 | try:
38 | for col in columns:
39 | if row['team_a'] == team_name:
40 | row['{}_{}'.format(col.lower().replace(' ','_'), team_name.lower())] = row[col]['team_a']
41 | row['{}_adv'.format(col.lower().replace(' ','_'))] = row[col]['team_b']
42 | else:
43 | row['{}_{}'.format(col.lower().replace(' ','_'), team_name.lower())] = row[col]['team_b']
44 | row['{}_adv'.format(col.lower().replace(' ','_'))] = row[col]['team_a']
45 | except: pass
46 | return row
47 |
48 | def convert_team_name(row, team_name):
49 | for col in ['goals', 'players_team', "subs_in"]:
50 | try:
51 | if row['team_a'] == team_name:
52 | row['{}_{}'.format(col, team_name.lower())] = row['{}_a'.format(col)]
53 | row['{}_adv'.format(col)] = row['{}_b'.format(col)]
54 | else:
55 | row['{}_{}'.format(col, team_name.lower())] = row['{}_b'.format(col)]
56 | row['{}_adv'.format(col)] = row['{}_a'.format(col)]
57 | del row['{}_a'.format(col)], row['{}_b'.format(col)]
58 | except: pass
59 | return row
60 |
61 | def player_per_opponent(df, player, team_name):
62 | stats_player = {}
63 | for opponent in set(df.opponent):
64 | try:
65 | game_played = len([e for e in list(df[df.opponent == opponent]["players_team_%s" % team_name.lower()]) +
66 | list(df[df.opponent == opponent]["subs_in_%s" % team_name.lower()])
67 | if player in e])
68 | if game_played > 0:
69 | goals = len([e for e in np.concatenate([elem for elem in df[df.opponent == opponent]["goals_%s" % team_name.lower()] if type(elem) == list])
70 | if e['player'] == player])
71 | assist = len([e for e in np.concatenate([elem for elem in df[df.opponent == opponent]["goals_%s" % team_name.lower()] if type(elem) == list])
72 | if 'assist' in e and e['assist'] == player])
73 | de = goals + assist
74 |
75 | stats_player[opponent] = {'goals':goals,
76 | 'goals_game':goals/game_played,
77 | 'decisive_game':de/game_played,
78 | 'assists':assist,
79 | 'decisive':de,
80 | 'games':game_played}
81 | except: pass
82 | return stats_player
83 |
84 | def ratio_one_opponent(df, opponent, team_name, top_k=10):
85 | df_oppo = df[df.opponent == opponent]
86 | game_player = dict(Counter(np.concatenate(list(df_oppo["players_team_%s" % team_name.lower()]) +
87 | list(df_oppo["subs_in_%s" % team_name.lower()]))).items())
88 | goal_counter = Counter([elem['player'] for elem in np.concatenate(list(df_oppo["goals_%s" % team_name.lower()]))])
89 | assist_counter = Counter([elem['assist'] for elem in np.concatenate(list(df_oppo["goals_%s" % team_name.lower()])) if "assist" in elem])
90 | goal_ratio = sorted([(k, v/game_player[k]) for k,v in dict(goal_counter).items() if k in game_player and game_player[k] >= 3 ],
91 | key=lambda x:x[1], reverse=True)[:top_k]
92 | assist_ratio = sorted([(k, v/game_player[k]) for k,v in dict(assist_counter).items() if k in game_player and game_player[k] >= 3 ],
93 | key=lambda x:x[1], reverse=True)[:top_k]
94 | decisive_ratio = sorted([(k, v/game_player[k]) for k,v in dict(goal_counter+assist_counter).items() if k in game_player and game_player[k] >= 3 ],
95 | key=lambda x:x[1], reverse=True)[:top_k]
96 | return goal_ratio, assist_ratio, decisive_ratio
97 |
98 | def get_goals_team(bs, team):
99 | if bs.find('span', class_='bidi').text != ' 0 - 0':
100 | goals = [{'player':elem.find('a').text,
101 | 'time':elem.find('span', class_='minute').text.replace("'",''),
102 | 'assist': elem.find('span', class_='assist')} for elem in bs.find('div', class_='block_match_goals').find_all('td', class_='player-{}'.format(team)) if elem.text !='\n\n']
103 | for goal in goals:
104 | if goal['assist'] != None:
105 | goal['assist'] = goal['assist'].text.replace('(assist by ','').replace(')','')
106 | else:
107 | del goal['assist']
108 | else:
109 | goals=[]
110 | return goals
111 |
112 | def get_lineups(bs):
113 | lineups = {}
114 | players = [elem.text.replace('\n','') for elem in bs.find_all('td', class_='large-link')]
115 | lineups["players_team_a"] = players[:11]
116 | lineups["players_team_b"] = players[11:22]
117 | lineups['subs_in_a'] = [elem[:elem.index('for')] for elem in players[22:29] if 'for' in elem]
118 | lineups['subs_in_b'] = [elem[:elem.index('for')] for elem in players[29:36] if 'for' in elem]
119 | return lineups
120 |
121 | def get_stats(bs):
122 | l = bs.find('div', class_='block_match_stats_plus_chart').find('iframe')['src']
123 | response = requests.get('http://us.soccerway.com/' + l)
124 | bs2 = BeautifulSoup(response.text,'lxml')
125 | stats_list = [elem.text[1:-1].split('\n') for elem in bs2.find('table').find('tr').find_all('tr')[1:]][0::2]
126 | stats = {elem[1]:{'team_a': int(elem[0]), 'team_b':int(elem[2])} for elem in stats_list}
127 | return stats
128 |
129 | def get_goals(link):
130 | game = {}
131 | response = requests.get(link)
132 | bs = BeautifulSoup(response.text, 'lxml')
133 | try:
134 | game['goals_a'], game['goals_b'] = [get_goals_team(bs, 'a'), get_goals_team(bs, 'b')]
135 | except: pass
136 | try:
137 | game.update(get_lineups(bs))
138 | except: pass
139 | try:
140 | game.update(get_stats(bs))
141 | except: pass
142 | return game
143 |
144 | def convert_df_games(dict_file):
145 | df = pd.DataFrame(dict_file["games"])
146 | df['date'] = pd.to_datetime(df.date.apply(lambda x:'/'.join([x.split('/')[1],x.split('/')[0], x.split('/')[2]])))
147 | df['month'] = df.date.apply(lambda x:x.month)
148 | df['year'] = df.date.apply(lambda x:x.year)
149 | df = df[df.date < datetime.datetime.now()]
150 | df = df.sort_values('date', ascending=False)
151 | team_name = df.team_a.value_counts().index[0]
152 | df['opponent'] = df.apply(lambda x:(x['team_a']+x['team_b']).replace(team_name, ''), axis=1)
153 | cols = ['Corners', 'Fouls', 'Offsides', 'Shots on target', 'Shots wide']
154 | df = df.apply(lambda x:shot_team(x, cols, team_name),axis=1)
155 | df = df.apply(lambda x:convert_team_name(x, team_name),axis=1)
156 | df = df.drop(cols, axis=1)
157 | return df
158 |
--------------------------------------------------------------------------------
/KaggleTaxiTrip/README.md:
--------------------------------------------------------------------------------
1 | ## New York City Taxi Trip Duration
2 |
3 | Kaggle playground to predict the total ride duration of taxi trips in New York City.
4 |
5 | ### First part - Data exploration
6 | The first part is to analyze the dataframe and observe correlation between variables.
7 | 
8 | 
9 |
10 | ### Second part - Clustering
11 | The goal of this playground is to predict the trip duration of test set. We know that some neighborhoods are more congested. So, I used K-Means to compute geo-clusters for pickup and drop off.
12 | 
13 |
14 | ### Third part - Cleaning and feature selection
15 | I have found some odd long trips : one day trip with a mean spead < 1km/h.
16 | 
17 | I have removed these outliners.
18 |
19 | I also added features from the data available : Haversine distance, Manhattan distance, means for clusters, PCA for rotation.
20 |
21 | ### Forth part - Prediction
22 | I compared Random Forest and XGBoost.
23 | Current Root Mean Squared Logarithmic error : 0.391
24 |
25 | Feature importance for RF & XGBoost
26 | 
--------------------------------------------------------------------------------
/KaggleTaxiTrip/pic/download.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleTaxiTrip/pic/download.png
--------------------------------------------------------------------------------
/KaggleTaxiTrip/pic/feat_importance.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleTaxiTrip/pic/feat_importance.png
--------------------------------------------------------------------------------
/KaggleTaxiTrip/pic/nyc_clusters.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleTaxiTrip/pic/nyc_clusters.png
--------------------------------------------------------------------------------
/KaggleTaxiTrip/pic/outliners.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleTaxiTrip/pic/outliners.png
--------------------------------------------------------------------------------
/KaggleTaxiTrip/pic/rush_hour.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/KaggleTaxiTrip/pic/rush_hour.png
--------------------------------------------------------------------------------
/KaggleTitanic/README.md:
--------------------------------------------------------------------------------
1 | ## Titanic: Machine Learning from Disaster
2 |
3 | The Titanic challenge on Kaggle is a competition in which the goal is to predict the survival or the death of a given passenger based on a set of variables describing him such as his age, his sex, or his passenger class on the boat.
--------------------------------------------------------------------------------
/KaggleWiki/README.md:
--------------------------------------------------------------------------------
1 | # Parsing and using Wikipedia Data
2 |
3 | ## First Part - French governments
4 |
5 | The first part aims to parse data from the French wikipedia website and pages about French governements. I am playing with data to get insights about governments : number of ministers, ministers' studies, etc.
6 |
--------------------------------------------------------------------------------
/KaggleWiki/wikipedia_parser.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | from bs4 import BeautifulSoup
4 | from selenium import webdriver
5 | import time
6 |
7 | def convert_formation(s):
8 | mapping = {'IEP' : ['Institut d\'études politiques', 'IEP'],
9 | 'ENA' : ['ENA', 'École nationale d\'administration'],
10 | 'ENS': ['École normale supérieure', 'ENS'],
11 | 'ENPC':['École nationale des ponts et chaussées', 'ENPC'],
12 | 'X': ['École polytechnique'],
13 | 'HEC': ['HEC', 'École des hautes études commerciales'],
14 | 'ESCP': ['ESCP', 'École supérieure de commerce de Paris'],
15 | 'ESSEC' : ['ESSEC', "École Supérieure des Sciences Economiques et Commerciales "]
16 | }
17 | etudes = []
18 | for key, value in mapping.items():
19 | for v in value:
20 | if v in s:
21 | etudes.append(key)
22 | break
23 | return etudes
24 |
25 | def get_ministres(link):
26 | browser = webdriver.Chrome()
27 | try:
28 | browser.get(link)
29 | time.sleep(0.2)
30 | except:
31 | time.sleep(1)
32 | text = browser.find_element_by_xpath('//div[@id="mw-content-text"]').get_attribute('innerHTML').strip()
33 | bs = BeautifulSoup(text, 'lxml')
34 | ministres_l = [[(k.text.replace('\u200d ','') , k.find('a')['href']) for k in e.find_parents()[1].find_all('td') if k.text !='' and k.find('a')]
35 | for e in bs.find_all('a', class_="image")]
36 | ministres = [{'Poste':k[0][0], 'Nom':k[::-1][1][0], 'Lien':'https://fr.wikipedia.org' + k[::-1][1][1]} for k in ministres_l if len(k) > 1][:-1]
37 | for m in ministres:
38 | m['Formation'] = get_formation(m['Lien'])
39 | browser.quit()
40 | return ministres
41 |
42 | def get_formation(link_bio):
43 | browser = webdriver.Chrome()
44 | browser.set_page_load_timeout(15)
45 | try:
46 | browser.get(link_bio)
47 | except:
48 | time.sleep(1.5)
49 | bs = BeautifulSoup(browser.find_element_by_xpath('//div[@id="mw-content-text"]').get_attribute('innerHTML').strip(), 'lxml')
50 | try:
51 | h3_elem = [e.find('span').text for e in bs.find_all('h3')]
52 | h3_elem_formation = [k for k in h3_elem if 'formation' in k.lower() or 'études' in k.lower() or 'etudes' in k.lower() or 'parcours' in k.lower()
53 | or "cursus" in k.lower()][0]
54 | f, s = h3_elem[h3_elem.index(h3_elem_formation):h3_elem.index(h3_elem_formation)+2]
55 | except:
56 | h2_elem = [e.find('span').text for e in bs.find_all('h2')[1:]]
57 | h2_elem_biographie = [k for k in h2_elem if 'biographie' in k.lower() or 'carrière' in k.lower() or 'formation' in k.lower() or 'parcours' in k.lower()
58 | or "cursus" in k.lower()][0]
59 | f, s = h2_elem[h2_elem.index(h2_elem_biographie):h2_elem.index(h2_elem_biographie)+2]
60 | try:
61 | first, second = [m.start() for m in re.finditer(f, bs.text)][1], [m.start() for m in re.finditer(s, bs.text)][1]
62 | browser.quit()
63 | s = bs.text[first:second].replace('\n', '')
64 | return convert_formation(re.sub(r'\[*\d*\]', '', s))
65 | except:
66 | print('Error for %s' % link_bio)
67 |
68 | def get_previous_government_link(browser, link):
69 | try:
70 | browser.get(link)
71 | time.sleep(0.5)
72 | except:
73 | time.sleep(1.5)
74 | box = browser.find_element_by_xpath('//table[@class="infobox_v2"]')
75 | browser.execute_script("return arguments[0].scrollIntoView();", box)
76 | bs = BeautifulSoup(box.get_attribute('innerHTML').strip(), 'lxml')
77 | d = dict(set([([k.replace('|','').replace('\n','') for k in e.text.replace('\n\n','|').split('||') if k != ''][0],
78 | "https://fr.wikipedia.org" + e.find('a')['href'])
79 | for e in bs.find_all('tr') if e.find('img', {'alt':"Précédent"})]))
80 | duration = [e.find('td').text.encode().decode('ascii', 'ignore') for e in bs.find_all('tr') if e.find('th', text='Durée')][0]
81 | return d, duration
82 |
83 |
84 | def convert_duration(dur):
85 | splits = [''.join([s for s in elem if s.isdigit()]) for elem in dur.split('an')]
86 | if len(splits) == 2:
87 | return 365 * int(splits[0]) + int(splits[1])
88 | else:
89 | return int(splits[0])
--------------------------------------------------------------------------------
/ObjectDetection/color_extract.py:
--------------------------------------------------------------------------------
1 | import cv2
2 | import numpy as np
3 | import sys
4 | import os
5 | import imp
6 | import glob
7 | from sklearn.cluster import KMeans
8 | import skimage.filters as skf
9 | import skimage.color as skc
10 | import skimage.morphology as skm
11 | from skimage.measure import label
12 |
13 | class Back():
14 | """
15 | Two algorithms are used together to separate background and foreground.
16 | One consider as background all pixel whose color is close to the pixels
17 | in the corners. This part is impacted by the `max_distance' and
18 | `use_lab' settings.
19 | The second one computes the edges of the image and uses a flood fill
20 | starting from all corners.
21 | """
22 | def __init__(self, max_distance=5, use_lab=True):
23 | """
24 | The possible settings are:
25 | - max_distance: The maximum distance for two colors to be
26 | considered closed. A higher value will yield to a more aggressive
27 | background removal.
28 | (default: 5)
29 | - use_lab: Whether to use the LAB color space to perform
30 | background removal. More expensive but closer to eye perception.
31 | (default: True)
32 | """
33 | self._settings= {
34 | 'max_distance': max_distance,
35 | 'use_lab': use_lab,
36 | }
37 |
38 | def get(self, img):
39 | f = self._floodfill(img)
40 | g = self._global(img)
41 | m = f | g
42 |
43 | if np.count_nonzero(m) < 0.90 * m.size:
44 | return m
45 |
46 | ng, nf = np.count_nonzero(g), np.count_nonzero(f)
47 |
48 | if ng < 0.90 * g.size and nf < 0.90 * f.size:
49 | return g if ng > nf else f
50 | if ng < 0.90 * g.size:
51 | return g
52 | if nf < 0.90 * f.size:
53 | return f
54 |
55 | return np.zeros_like(m)
56 |
57 | def _global(self, img):
58 | h, w = img.shape[:2]
59 | mask = np.zeros((h, w), dtype=np.bool)
60 | max_distance = self._settings['max_distance']
61 |
62 | if self._settings['use_lab']:
63 | img = skc.rgb2lab(img)
64 |
65 | # Compute euclidean distance of each corner against all other pixels.
66 | corners = [(0, 0), (-1, 0), (0, -1), (-1, -1)]
67 | for color in (img[i, j] for i, j in corners):
68 | norm = np.sqrt(np.sum(np.square(img - color), 2))
69 | # Add to the mask pixels close to one of the corners.
70 | mask |= norm < max_distance
71 |
72 | return mask
73 |
74 | def _floodfill(self, img):
75 | back = self._scharr(img)
76 | # Binary thresholding.
77 | back = back > 0.05
78 |
79 | # Thin all edges to be 1-pixel wide.
80 |
81 | back = skm.skeletonize(back)
82 | # Edges are not detected on the borders, make artificial ones.
83 | back[0, :] = back[-1, :] = True
84 | back[:, 0] = back[:, -1] = True
85 |
86 | # Label adjacent pixels of the same color.
87 | labels = label(back, background=-1, connectivity=1)
88 |
89 | # Count as background all pixels labeled like one of the corners.
90 | corners = [(1, 1), (-2, 1), (1, -2), (-2, -2)]
91 | for l in (labels[i, j] for i, j in corners):
92 | back[labels == l] = True
93 |
94 | # Remove remaining inner edges.
95 | return skm.opening(back)
96 |
97 | def _scharr(self, img):
98 | # Invert the image to ease edge detection.
99 | img = 1. - img
100 | grey = skc.rgb2grey(img)
101 | return skf.scharr(grey)
102 |
103 | class ColorExtractor():
104 | def __init__(self):
105 | self._back = Back(max_distance=10, n_clusters=3)
106 |
107 | def get_bar_hist(self, filepath, full_return=False):
108 | image = cv2.imread(filepath)
109 | image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
110 | back_mask = self._back.get(image)
111 | image_noback = np.zeros(image.shape)
112 | image_noback[~back_mask] = image[~back_mask] /255
113 | image_array = image[~back_mask]
114 | clt = KMeans(n_clusters = n_clusters)
115 | clt.fit(image_array)
116 | hist = self.centroid_histogram(clt)
117 | if full_return:
118 | bar = self.plot_colors(hist, clt.cluster_centers_)
119 | return image, image_noback, bar, hist
120 | else:
121 | return image, hist
122 |
123 |
124 | def centroid_histogram(self, clt):
125 | # grab the number of different clusters and create a histogram
126 | # based on the number of pixels assigned to each cluster
127 | numLabels = np.arange(0, len(np.unique(clt.labels_)) + 1)
128 | (hist, _) = np.histogram(clt.labels_, bins = numLabels)
129 |
130 | # normalize the histogram, such that it sums to one
131 | hist /= hist.astype("float").sum()
132 |
133 | # return the histogram
134 | return hist
135 |
136 | def plot_colors(self, hist, centroids):
137 | # initialize the bar chart representing the relative frequency
138 | # of each of the colors
139 | bar = np.zeros((50, 300, 3), dtype = "uint8")
140 | startX = 0
141 |
142 | zipped = zip(hist, centroids)
143 | zipped = sorted(zipped, reverse=True, key=lambda x : x[0])
144 |
145 | # loop over the percentage of each cluster and the color of
146 | # each cluster
147 | for (percent, color) in zipped:
148 | # plot the relative percentage of each cluster
149 | endX = startX + (percent * 300)
150 | cv2.rectangle(bar, (int(startX), 0), (int(endX), 50),
151 | color.astype("uint8").tolist(), -1)
152 | startX = endX
153 |
154 | # return the bar chart
155 | return bar
--------------------------------------------------------------------------------
/ObjectDetection/recognition.py:
--------------------------------------------------------------------------------
1 | import cv2
2 | from tensorflow.python.client import device_lib
3 | from darkflow.net.build import TFNet
4 | import numpy as np
5 | import glob
6 | import os
7 |
8 | def get_available_gpus():
9 | """
10 | Return the number of GPUs availableipytho
11 | """
12 | local_device_protos = device_lib.list_local_devices()
13 | return [x.name for x in local_device_protos if x.device_type == 'GPU']
14 |
15 | def define_options(config_path):
16 | """
17 | Define the network configuration. Load the model and the weights.
18 | Threshold hard coded.
19 | :return: option for tfnet object
20 | """
21 | options = {}
22 | if len(get_available_gpus()) > 0 :
23 | options["gpu"] = 1
24 | else:
25 | options["gpu"] = 0
26 |
27 | model = config_path + 'cfg/yolo.cfg'
28 | weights = config_path + 'yolo.weights'
29 | if os.path.exists(model):
30 | options['model'] = model
31 | else:
32 | print('No cfg model')
33 | if os.path.exists(weights):
34 | options['load'] = weights
35 | else:
36 | print('No yolo weights, wget https://pjreddie.com/media/files/yolo.weights')
37 | options['config'] = config_path + 'cfg/'
38 | options["threshold"] = 0.1
39 | return options
40 |
41 | def non_max_suppression_fast(boxes, probs, overlap_thresh=0.1, max_boxes=30):
42 | """
43 | Eliminating redundant object detection windows with a faster non maximum suppression method
44 | Greedily select high-scoring detections and skip detections that are significantly covered by
45 | a previously selected detection.
46 | :param boxes: list of boxes
47 | :param probs: list of probabilities relatives to the boxes
48 | """
49 |
50 | # if there are no boxes, return an empty list
51 | if len(boxes) == 0:
52 | return []
53 |
54 | # grab the coordinates of the bounding boxes
55 | x1 = boxes[:, 0]
56 | y1 = boxes[:, 1]
57 | x2 = boxes[:, 2]
58 | y2 = boxes[:, 3]
59 |
60 | np.testing.assert_array_less(x1, x2)
61 | np.testing.assert_array_less(y1, y2)
62 |
63 | # if the bounding boxes integers, convert them to floats --
64 | # this is important since we'll be doing a bunch of divisions
65 | if boxes.dtype.kind == "i":
66 | boxes = boxes.astype("float")
67 |
68 | # initialize the list of picked indexes
69 | pick = []
70 |
71 | # calculate the areas
72 | area = (x2 - x1 + 1) * (y2 - y1 + 1)
73 |
74 | # sort the bounding boxes
75 | idxs = np.argsort(probs)
76 |
77 | # keep looping while some indexes still remain in the indexes
78 | # list
79 | while len(idxs) > 0:
80 | # grab the last index in the indexes list and add the
81 | # index value to the list of picked indexes
82 | last = len(idxs) - 1
83 | i = idxs[last]
84 | pick.append(i)
85 |
86 | # find the intersection
87 | xx1_int = np.maximum(x1[i], x1[idxs[:last]])
88 | yy1_int = np.maximum(y1[i], y1[idxs[:last]])
89 | xx2_int = np.minimum(x2[i], x2[idxs[:last]])
90 | yy2_int = np.minimum(y2[i], y2[idxs[:last]])
91 | ww_int = np.maximum(0, xx2_int - xx1_int + 0.5)
92 | hh_int = np.maximum(0, yy2_int - yy1_int + 0.5)
93 | area_int = ww_int * hh_int
94 |
95 | # find the union
96 | area_union = area[i] + area[idxs[:last]] - area_int
97 |
98 | # compute the ratio of overlap
99 | overlap = area_int / (area_union + 1e-6)
100 |
101 | # delete all indexes from the index list that have
102 | idxs = np.delete(idxs, np.concatenate(([last],
103 | np.where(overlap > overlap_thresh)[0])))
104 |
105 | if len(pick) >= max_boxes:
106 | break
107 |
108 | # return only the bounding boxes that were picked using the integer data type
109 | boxes = boxes[pick].astype("int")
110 | probs = probs[pick]
111 | return boxes, probs
112 |
113 | def predict_one(image, tfnet):
114 | """
115 | Object detection in a picture
116 | :param image: image numpy array
117 | :param tfnet: net object
118 | :return: picture with object bounding boxes, predictions, confidence scores
119 | """
120 | font = cv2.FONT_HERSHEY_SIMPLEX
121 | # prediction
122 | result = tfnet.return_predict(image)
123 | # Separate each label to do non max suppresion
124 | for label in set([k['label'] for k in result]):
125 | boxes = np.array([[k['topleft']['x'], k['topleft']['y'], k['bottomright']['x'], k['bottomright']['y']] for k in result
126 | if k['label'] == label])
127 | probs = np.array([k['confidence'] for k in result if k['label'] == label])
128 | # eliminating redundant object
129 | b, p = non_max_suppression_fast(boxes, probs)
130 | for k in b:
131 | # remove big object
132 | if np.abs(np.mean(np.array((k[0],k[1]))-np.array((k[2], k[3])))) < np.min(image.shape[:2])/5:
133 | cv2.rectangle(image, (k[0],k[1]), (k[2], k[3]), (255,0,0), 2)
134 | cv2.putText(image, label, (k[0],k[1]), font, 0.3, (255, 0, 0))
135 | return image, b, p
136 |
137 | def predict(folder, nb_items=None, config_path='../darkflow/'):
138 | """
139 | Multiple predictions
140 | :param folder: folder with pictures
141 | :param nb_items: number of pictures to predict
142 | :config_path: weights and model files path
143 | :return: list of icture with object bounding boxes, predictions, confidence scores
144 | """
145 | images = [cv2.imread(f) for f in glob.glob(folder+'*')][:nb_items]
146 | results = []
147 | tfnet = TFNet(define_options(config_path))
148 | for image in images:
149 | results.append(predict_one(image, tfnet))
150 | return results
151 |
152 |
--------------------------------------------------------------------------------
/ProjectMovieRating/README.md:
--------------------------------------------------------------------------------
1 | # Predict IMDB movie rating
2 |
3 | Project inspired by Chuan Sun [work](https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset) without Scrappy
4 |
5 | Main question : How can we tell the greatness of a movie before it is released in cinema?
6 |
7 | ## First Part - Parsing data
8 |
9 | The first part aims to parse data from the imdb and the numbers websites : casting information, directors, production companies, awards, genres, budget, gross, description, imdb_rating, etc.
10 | To create the movie_contents.json file :
11 | ``python3 parser.py nb_elements``
12 |
13 | ## Second Part - Data Analysis
14 |
15 | The second part is to analyze the dataframe and observe correlation between variables. For example, are the movie awards correlated to the worlwide gross ? Does the more a movie is a liked, the more the casting is liked ?
16 | See the jupyter notebook file.
17 |
18 | 
19 |
20 | As we can see in the pictures above, the imdb score is correlated to the number of awards and the gross but not really to the production budget and the number of facebook likes of the casting.
21 | Obviously, domestic and worlwide gross are highly correlated. However, the more important the production budget, the more important the gross.
22 | As it is shown in the notebook, the budget is not really correlated to the number of awards.
23 | What's funny is that the popularity of the third most famous actor is more important for the IMDB score than the popularity of the most famous score (Correlation 0.2 vs 0.08).
24 | (Many other charts in the Jupyter notebook)
25 |
26 | ## Third Part - Predict the IMDB score
27 |
28 | Machine Learning to predict the IMDB score with the meaningful variables.
29 | Using a Random Forest algorithm (500 estimators).
30 | 
31 |
32 |
--------------------------------------------------------------------------------
/ProjectMovieRating/__pycache__/imdb_movie_content.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/ProjectMovieRating/__pycache__/imdb_movie_content.cpython-35.pyc
--------------------------------------------------------------------------------
/ProjectMovieRating/__pycache__/parser.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/ProjectMovieRating/__pycache__/parser.cpython-35.pyc
--------------------------------------------------------------------------------
/ProjectMovieRating/genre.json:
--------------------------------------------------------------------------------
1 | {"Action": "0",
2 | "Adventure": "17",
3 | "Animation": "14",
4 | "Biography": "20",
5 | "Comedy": "21",
6 | "Crime": "10",
7 | "Documentary": "3",
8 | "Drama": "18",
9 | "Family": "19",
10 | "Fantasy": "6",
11 | "History": "8",
12 | "Horror": "4",
13 | "Music": "12",
14 | "Musical": "2",
15 | "Mystery": "16",
16 | "Romance": "9",
17 | "Sci-Fi": "13",
18 | "Short": "5",
19 | "Sport": "11",
20 | "Thriller": "7",
21 | "War": "15",
22 | "Western": "1"}
--------------------------------------------------------------------------------
/ProjectMovieRating/imdb_movie_content.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import json
3 | from bs4 import BeautifulSoup
4 | import time
5 | import random
6 | import re
7 | import datetime
8 |
9 | def parse_facebook_likes_number(num_likes_string):
10 | if num_likes_string[-1] == 'K':
11 | thousand = 1000
12 | else:
13 | thousand = 1
14 | return float(num_likes_string.replace('K','')) * thousand
15 |
16 | def parse_duration(time_string):
17 | time_string = time_string.split('min')[0]
18 | x = time.strptime(time_string,'%Hh%M')
19 | return datetime.timedelta(hours=x.tm_hour,minutes=x.tm_min,seconds=x.tm_sec).total_seconds()
20 |
21 | class ImdbMovieContent():
22 | def __init__(self, movies):
23 | self.movies = movies
24 | self.base_url = "http://www.imdb.com"
25 |
26 | def get_facebook_likes(self, entity_id):
27 | if entity_id.startswith('nm'):
28 | url = "https://www.facebook.com/widgets/like.php?width=280&show_faces=1&layout=standard&href=http%3A%2F%2Fwww.imdb.com%2Fname%2F{}%2F&colorscheme=light".format(entity_id)
29 | elif entity_id.startswith('tt'):
30 | url = "https://www.facebook.com/widgets/like.php?width=280&show_faces=1&layout=standard&href=http%3A%2F%2Fwww.imdb.com%2Ftitle%2F{}%2F&colorscheme=light".format(entity_id)
31 | else:
32 | url = None
33 | time.sleep(random.uniform(0, 0.25)) # randomly snooze a time within [0, 0.4] second
34 | try:
35 | response = requests.get(url)
36 | soup = BeautifulSoup(response.text, "lxml")
37 | sentence = soup.find_all(id="u_0_2")[0].span.string # get sentence like: "43K people like this"
38 | num_likes = sentence.split(" ")[0]
39 | except Exception as e:
40 | num_likes = None
41 | return parse_facebook_likes_number(num_likes)
42 |
43 | def get_id_from_url(self, url):
44 | if url is None:
45 | return None
46 | return url.split('/')[4]
47 |
48 | def parse(self):
49 | movies_content = []
50 | for movie in self.movies:
51 | imdb_url = movie['imdb_url']
52 | response = requests.get(imdb_url)
53 | bs = BeautifulSoup(response.text, 'lxml')
54 | movies_content.append(self.get_content(bs))
55 | return movies_content
56 |
57 | def get_awards(self, movie_link):
58 | awards_url = movie_link + 'awards'
59 | response = requests.get(awards_url)
60 | bs = BeautifulSoup(response.text, 'lxml')
61 | awards = bs.find_all('tr')
62 | award_dict = []
63 | for award in awards:
64 | if 'rowspan' in award.find('td').attrs:
65 | rowspan = int(award.find('td')['rowspan'])
66 | if rowspan == 1:
67 | award_dict.append({'category' : award.find('span', class_='award_category').text,
68 | 'type' : award.find('b').text,
69 | 'award': award.find('td', class_='award_description').text.split('\n')[1].replace(' ', '')})
70 | else:
71 | index = awards.index(award)
72 | dictt = {'category':award.find('span', class_='award_category').text,
73 | 'type' : award.find('b').text}
74 | awards_ = []
75 | for elem in awards[index:index+rowspan]:
76 | award_dict.append({'category':award.find('span', class_='award_category').text,
77 | 'type' : award.find('b').text,
78 | 'award': elem.find('td', class_='award_description').text.split('\n')[1].replace(' ', '')})
79 | award_nominated = [{k:v for k,v in award.items() if k != 'type'} for award in award_dict if award['type'] == 'Nominated']
80 | award_won = [{k:v for k,v in award.items() if k != 'type'} for award in award_dict if award['type'] == 'Won']
81 | return award_nominated, award_won
82 |
83 | def get_content(self, bs):
84 | movie = {}
85 | try:
86 | title_year = bs.find('div', class_ = 'title_wrapper').find('h1').text.encode('latin').decode('utf-8', 'ignore').replace(') ','').split('(')
87 | movie['movie_title'] = title_year[0]
88 | except:
89 | movie['movie_title'] = None
90 | try:
91 | movie['title_year'] = title_year[1]
92 | except:
93 | movie['title_year'] = None
94 | # try:
95 | # movie['genres'] = '|'.join([genre.text.replace(' ','') for genre in bs.find('div', {'itemprop': 'genre'}).find_all('a')])
96 | # except:
97 | # movie['genres'] = None
98 | try:
99 | title_details = bs.find('div', {'id':'titleDetails'})
100 | movie['language'] = '|'.join([language.text for language in title_details.find_all('a', href=re.compile('language'))])
101 | movie['country'] = '|'.join([country.text for country in title_details.find_all('a', href=re.compile('country'))])
102 | except:
103 | movie['language'] = None
104 | try:
105 | keywords = bs.find_all('span', {'itemprop':'keywords'})
106 | movie['keywords'] = '|'.join([key.text for key in keywords])
107 | except:
108 | movie['keywords'] = None
109 | try:
110 | movie['storyline'] = bs.find('div', {'id':'titleStoryLine'}).find('div', {'itemprop':'description'}).text.replace('\n', '')
111 | except:
112 | movie['storyline'] = None
113 | try:
114 | movie['contentRating'] = bs.find('span', {'itemprop':'contentRating'}).text.split(' ')[1]
115 | except:
116 | movie['contentRating'] = None
117 | try:
118 | movie['color'] = bs.find('a', href=re.compile('colors')).text
119 | except:
120 | movie['color'] = None
121 | try:
122 | movie['idmb_score'] = bs.find('span', {'itemprop':'ratingValue'}).text
123 | except:
124 | movie['idmb_score'] = None
125 | try:
126 | movie['num_voted_users'] = bs.find('span', {'itemprop':'ratingCount'}).text.replace(',',)
127 | except:
128 | movie['num_voted_users'] = None
129 | try:
130 | movie['duration_sec'] = parse_duration(bs.find('time', {'itemprop':'duration'}).text.replace(' ','').replace('\n', ''))
131 | except:
132 | movie['duration_sec'] = None
133 | try:
134 | review_counts = [int(count.text.split(' ')[0].replace(',', '')) for count in bs.find_all('span', {'itemprop':'reviewCount'})]
135 | movie['num_user_for_reviews'] = review_counts[0]
136 | except:
137 | movie['num_user_for_reviews'] = None
138 | try:
139 | prod_co = bs.find_all('span', {'itemtype': re.compile('Organization')})
140 | movie['production_co'] = [elem.text.replace('\n','') for elem in prod_co]
141 | except:
142 | movie['production_co'] = None
143 | try:
144 | movie['num_critic_for_reviews'] = review_counts[1]
145 | except:
146 | movie['num_critic_for_reviews'] = None
147 | try:
148 | actors = bs.find('table', {'class':'cast_list'}).find_all('a', {'itemprop':'url'})
149 | pairs_for_rows = [(self.base_url + actor['href'], actor.text.replace('\n', '')[1:]) for actor in actors]
150 | movie['cast_info'] = [{'actor_name' : actor[1],
151 | 'actor_link' : actor[0],
152 | 'actor_fb_likes': self.get_facebook_likes(self.get_id_from_url(actor[0]))} for actor in pairs_for_rows]
153 | except:
154 | movie['cast_info'] = None
155 | try:
156 | director = bs.find('span', {'itemprop':'director'}).find('a')
157 | director = (self.base_url + director['href'].replace('?','/?'), director.text)
158 | movie['director_info'] = {'director_name':director[1],
159 | 'director_link':director[0],
160 | 'director_fb_links': self.get_facebook_likes(self.get_id_from_url(director[0]))}
161 | except:
162 | movie['director_info'] = None
163 | try:
164 | movie['movie_imdb_link'] = bs.find('link')['href']
165 | movie['num_facebook_like'] = self.get_facebook_likes(self.get_id_from_url(movie['movie_imdb_link']))
166 | except:
167 | movie['num_facebook_like'] = None
168 | try:
169 | poster_image_url = bs.find('div', {'class':'poster'}).find('img')['src']
170 | movie['image_urls'] = poster_image_url.split("_V1_")[0] + "_V1_.jpg"
171 | except:
172 | movie['image_urls'] = None
173 | try:
174 | award_nominated, award_won = self.get_awards(movie['movie_imdb_link'])
175 | movie['awards'] = {"won": award_won, 'nominated': award_nominated}
176 | except Exception as e:
177 | print(e)
178 | movie['awards'] = None
179 | return movie
180 |
181 |
182 |
--------------------------------------------------------------------------------
/ProjectMovieRating/parser.py:
--------------------------------------------------------------------------------
1 | from bs4 import BeautifulSoup
2 | import json
3 | import requests
4 | import urllib
5 | from tqdm import tqdm
6 | import locale
7 | import pandas as pd
8 | import re
9 | import time
10 | import random
11 | import sys
12 |
13 | from imdb_movie_content import ImdbMovieContent
14 |
15 | def parse_price(price):
16 | """
17 | Convert string price to numbers
18 | """
19 | if not price:
20 | return 0
21 | price = price.replace(',', '')
22 | return locale.atoi(re.sub('[^0-9,]', "", price))
23 |
24 | def get_movie_budget():
25 | """
26 | Parsing the numbers website to get the budget data.
27 | :return: list of dictionnaries with budget and gross
28 | """
29 | movie_budget_url = 'http://www.the-numbers.com/movie/budgets/all'
30 | response = requests.get(movie_budget_url)
31 | bs = BeautifulSoup(response.text, 'lxml')
32 | table = bs.find('table')
33 | rows = [elem for elem in table.find_all('tr') if elem.get_text() != '\n']
34 |
35 | movie_budget = []
36 | for row in rows[1:]:
37 | specs = [elem.get_text() for elem in row.find_all('td')]
38 | movie_name = specs[2].encode('latin1').decode('utf8', 'ignore')
39 | movie_budget.append({'release_date': specs[1],
40 | 'movie_name': movie_name,
41 | 'production_budget': parse_price(specs[3]),
42 | 'domestic_gross': parse_price(specs[4]),
43 | 'worldwide_gross': parse_price(specs[5])})
44 |
45 | return movie_budget
46 |
47 | def get_imdb_urls(movie_budget, nb_elements=None):
48 | """
49 | Parsing imdb website to get imdb movies links.
50 | Dumping a json file with budget, gross and imdb urls
51 | :param movie_budget: list of dictionnaries with budget and gross
52 | :param nb_elements: number of movies to parse
53 | """
54 | for movie in tqdm(movie_budget[1000:1000+nb_elements]):
55 | movie_name = movie['movie_name']
56 | title_url = urllib.parse.quote(movie_name.encode('utf-8'))
57 | imdb_search_link = "http://www.imdb.com/find?ref_=nv_sr_fn&q={}&s=tt".format(title_url)
58 | response = requests.get(imdb_search_link)
59 | bs = BeautifulSoup(response.text, 'lxml')
60 | results = bs.find("table", class_= "findList" )
61 | try:
62 | movie['imdb_url'] = "http://www.imdb.com" + results.find('td', class_='result_text').find('a')['href']
63 | except:
64 | movie['imdb_url'] = None
65 |
66 | with open('movie_budget.json', 'w') as fp:
67 | json.dump(movie_budget, fp)
68 |
69 | def get_imdb_content(movie_budget_path, nb_elements=None):
70 | """
71 | Parsing imdb website to get imdb content : awards, casting, description, etc.
72 | Dumping a json file with imdb content
73 | :param movie_budget_path: path of the movie_budget.json file
74 | :param nb_elements: number of movies to parse
75 | """
76 | with open(movie_budget_path, 'r') as fp:
77 | movies = json.load(fp)
78 | content_provider = ImdbMovieContent(movies)
79 | contents = []
80 | threshold = 1300
81 | for i, movie in enumerate(movies[threshold:threshold+nb_elements]):
82 | time.sleep(random.uniform(0, 0.25))
83 | print("\r%i / %i" % (i, len(movies[threshold:threshold+nb_elements])), end="")
84 | try:
85 | imdb_url = movie['imdb_url']
86 | response = requests.get(imdb_url)
87 | bs = BeautifulSoup(response.text, 'lxml')
88 | movies_content = content_provider.get_content(bs)
89 | contents.append(movies_content)
90 | except Exception as e:
91 | print(e)
92 | pass
93 | with open('movie_contents7.json', 'w') as fp:
94 | json.dump(contents, fp)
95 |
96 | def parse_awards(movie):
97 | """
98 | Convert awards information to a dictionnary for dataframe.
99 | Keeping only Oscar, BAFTA, Golden Globe and Palme d'Or awards.
100 | :param movie: movie dictionnary
101 | :return: well-formated dictionnary with awards information
102 | """
103 | awards_kept = ['Oscar', 'BAFTA Film Award', 'Golden Globe', 'Palme d\'Or']
104 | awards_category = ['won', 'nominated']
105 | parsed_awards = {}
106 | for category in awards_category:
107 | for awards_type in awards_kept:
108 | awards_cat = [award for award in movie['awards'][category] if award['category'] == awards_type]
109 | for k, award in enumerate(awards_cat):
110 | parsed_awards['{}_{}_{}'.format(awards_type, category, k+1)] = award["award"]
111 | return parsed_awards
112 |
113 | def parse_actors(movie):
114 | """
115 | Convert casting information to a dictionnary for dataframe.
116 | Keeping only 3 actors with most facebook likes.
117 | :param movie: movie dictionnary
118 | :return: well-formated dictionnary with casting information
119 | """
120 | sorted_actors = sorted(movie['cast_info'], key=lambda x:x['actor_fb_likes'], reverse=True)
121 | top_k = 3
122 | parsed_actors = {}
123 | parsed_actors['total_cast_fb_likes'] = sum([actor['actor_fb_likes'] for actor in movie['cast_info']]) + movie['director_info']['director_fb_links']
124 | for k, actor in enumerate(sorted_actors[:top_k]):
125 | if k < len(sorted_actors):
126 | parsed_actors['actor_{}_name'.format(k+1)] = actor['actor_name']
127 | parsed_actors['actor_{}_fb_likes'.format(k+1)] = actor['actor_fb_likes']
128 | else:
129 | parsed_actors['actor_{}_name'.format(k+1)] = None
130 | parsed_actors['actor_{}_fb_likes'.format(k+1)] = None
131 | return parsed_actors
132 |
133 | def parse_production_company(movie):
134 | """
135 | Convert production companies to a dictionnary for dataframe.
136 | Keeping only 3 production companies.
137 | :param movie: movie dictionnary
138 | :return: well-formated dictionnary with production companies
139 | """
140 | parsed_production_co = {}
141 | top_k = 3
142 | production_companies = movie['production_co'][:top_k]
143 | for k, company in enumerate(production_companies):
144 | if k < len(movie['production_co']):
145 | parsed_production_co['production_co_{}'.format(k+1)] = company
146 | else:
147 | parsed_production_co['production_co_{}'.format(k+1)] = None
148 | return parsed_production_co
149 |
150 | def parse_genres(movie):
151 | """
152 | Convert genres to a dictionnary for dataframe.
153 | :param movie: movie dictionnary
154 | :return: well-formated dictionnary with genres
155 | """
156 | parse_genres = {}
157 | g = movie['genres']
158 | with open('genre.json', 'r') as f:
159 | genres = json.load(f)
160 | for k, genre in enumerate(g):
161 | if genre in genres:
162 | parse_genres['genre_{}'.format(k+1)] = genres[genre]
163 | return parse_genres
164 |
165 | def create_dataframe(movies_content_path, movie_budget_path):
166 | """
167 | Create dataframe from movie_budget.json and movie_content.json files.
168 | :param movies_content_path: path of the movies_content.json file
169 | :param movie_budget_path: path of the movie_budget.json file
170 | :return: well formated dataframe
171 | """
172 | with open(movies_content_path, 'r') as fp:
173 | movies = json.load(fp)
174 | with open(movie_budget_path, 'r') as fp:
175 | movies_budget = json.load(fp)
176 | movies_list = []
177 | for movie in movies:
178 | content = {k:v for k,v in movie.items() if k not in ['awards', 'cast_info', 'director_info', 'production_co']}
179 | name = movie['movie_title']
180 | try:
181 | budget = [film for film in movies_budget if film['movie_name']==name][0]
182 | budget = {k:v for k,v in budget.items() if k not in ['imdb_url', 'movie_name']}
183 | content.update(budget)
184 | except:
185 | pass
186 | try:
187 | content.update(parse_awards(movie))
188 | except:
189 | pass
190 | try:
191 | content.update(parse_genres(movie))
192 | except:
193 | pass
194 | try:
195 | content.update({k:v for k,v in movie['director_info'].items() if k!= 'director_link'})
196 | except:
197 | pass
198 | try:
199 | content.update(parse_production_company(movie))
200 | except:
201 | pass
202 | try:
203 | content.update(parse_actors(movie))
204 | except:
205 | pass
206 | movies_list.append(content)
207 | df = pd.DataFrame(movies_list)
208 | df = df[pd.notnull(df.idmb_score)]
209 | df.idmb_score = df.idmb_score.apply(float)
210 | return df
211 |
212 | if __name__ == '__main__':
213 | if len(sys.argv) > 1:
214 | nb_elements = int(sys.argv[1])
215 | else:
216 | nb_elements = None
217 | # movie_budget = get_movie_budget()
218 | # movies = get_imdb_urls(movie_budget, nb_elements=nb_elements)
219 | get_imdb_content("movie_budget.json", nb_elements=nb_elements)
220 |
221 |
222 |
--------------------------------------------------------------------------------
/ProjectSoccer/README.md:
--------------------------------------------------------------------------------
1 | # Parsing and using Soccer data
2 |
3 | ## First Part - Parsing data
4 |
5 | The first part aims to parse data from multiple websites : games, teams, players, etc.
6 | To collect data about a team and dump a json file (must have created a ``./teams`` folder) :
7 | ``python3 dumper.py team_name``
8 |
9 | ## Second Part - Data Analysis
10 |
11 | The second part is to analyze the dataset to understand what I can do with it.
12 |
13 | 
14 |
15 | 
--------------------------------------------------------------------------------
/ProjectSoccer/df.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/ProjectSoccer/df.pkl
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Online-Challenge
2 | Challenge submitted on HackerRank and Kaggle.
3 |
4 | Algorithm challenges are made on HackerRank using Python.
5 |
6 | Data Science and Machine Learning challenges are made on Kaggle using Python too.
7 |
8 | ## [Kaggle Bike Sharing](https://github.com/alexattia/Data-Science-Projects/tree/master/KaggleBikeSharing)
9 | The goal of this challenge is to build a model that predicts the count of bike shared, exclusively based on contextual features. The first part of this challenge was aimed to understand, to analyse and to process those dataset. I wanted to produce meaningful information with plots. The second part was to build a model and use a Machine Learning library in order to predict the count.
10 |
11 | -->[French Explanations PDF](https://github.com/alexattia/Data-Science-Projects/blob/master/KaggleBikeSharing/Kaggle_BikeSharing_Explanations_French.pdf)
12 |
13 | ## [Twitter Parsing](https://github.com/alexattia/Data-Science-Projects/tree/master/TwitterParsing)
14 |
15 | I've recently discovered the Chris Albon Machine Learning flash cards and I want to download those flash cards but the official Twitter API has a limit rate of 2 weeks old tweets so I had to find a way to bypass this limitation : use Selenium and PhantomJS.
16 | Purpose of this project : Check every 2 hours, if he posted new flash cards. In this case, download them and send me a summary email.
17 |
18 | ## [Face Recognition](https://github.com/alexattia/Data-Science-Projects/tree/master/FaceRecognition)
19 |
20 | Modern face recognition with deep learning and HOG algorithm. Using dlib C++ library, I have a quick face recognition tool using few pictures (20 per person).
21 |
22 | ## [Playing with Soccer data](https://github.com/alexattia/Data-Science-Projects/tree/master/KaggleSoccer)
23 |
24 | As a soccer fan and a data passionate, I wanted to play and analyze with soccer data.
25 | I don't know currently what's the aim of this project but I will parse data from diverse websites, for differents teams and differents players.
26 |
27 | ## [NYC Taxi Trips](https://github.com/alexattia/Data-Science-Projects/tree/master/KaggleTaxiTrip)
28 |
29 | Kaggle playground to predict the total ride duration of taxi trips in New York City.
30 |
31 | ## [Kaggle Understanding the Amazon from Space](https://github.com/alexattia/Data-Science-Projects/tree/master/KaggleAmazon)
32 | Use satellite data to track the human footprint in the Amazon rainforest.
33 | Deep Learning model (using Keras) to label satellite images.
34 |
35 | ## [Predicting IMDB movie rating](https://github.com/alexattia/Data-Science-Projects/tree/master/KaggleMovieRating)
36 | Project inspired by Chuan Sun [work](https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset)
37 | How can we tell the greatness of a movie ?
38 | Scrapping and Machine Learning
--------------------------------------------------------------------------------
/TwitterParsing/README.md:
--------------------------------------------------------------------------------
1 | # Twitter Parsing
2 |
3 | ## Parsing Machine Learning Flashcards
4 | After spending some time on Twitter, following Machine Learning, Deep Learning and Data Science "influencers". I have found some interesting stuff.
5 | I decided to download some of those tweets. For example, @chrisalbon writes (draws would be more relevant) interesting Machine Learning flash cards.
6 | I am using Selenium to parse and download those flash cards and I create a daemon to parse and send me about new flash cards every day.
7 |
8 | `com.alexattia.machinelearning.downloadflashcards.plist` is the plist file to copy into /Library/LaunchDaemons/
9 | `run.sh` is the script to launch every day
10 | `download_pics.py`is the Python script
11 |
12 | ## Sending one picture a day
13 | Recently, @chrisalbon published his [machine learning flashcards](machinelearningflashcards.com). There are more than 300 pictures. In order to learn those flashcards, I want to learn one a day. So, I am using a daemon to send me an email every day with an embedded flashcard.
--------------------------------------------------------------------------------
/TwitterParsing/com.alexattia.machinelearning.downloadflashcards.plist:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | Label
6 | com.alexattia.machinelearning.downloadflashcards
7 | ProgramArguments
8 |
9 | /Users/alexandreattia/Desktop/Work/Practice/HackerRankChallenge/TwitterParsing/run.sh
10 |
11 | StandardOutPath
12 | /Users/alexandreattia/Desktop/Work/Practice/HackerRankChallenge/TwitterParsing/output.txt
13 | StandardErrorPath
14 | /Users/alexandreattia/Desktop/Work/Practice/HackerRankChallenge/TwitterParsing/errors.txt
15 |
17 | StartCalendarInterval
18 |
19 | Hour
20 | 16
21 | Minute
22 | 0
23 |
24 |
25 |
--------------------------------------------------------------------------------
/TwitterParsing/config.py:
--------------------------------------------------------------------------------
1 | class Config:
2 | def __init__(self):
3 | self.my_email_address = 'XXX'
4 | self.password = 'XXX'
5 | self.dest = 'XXX'
6 | self.message_init = """
7 |
8 |
9 |
10 |
11 |
12 |
13 | Hi Alexandre,
14 | {number_flash_cards} machine learning flashcards has been downloaded.
15 |
16 |
19 | |
20 |
21 |
22 |
23 |
29 | |
30 |
31 |
32 |
33 |
34 | |
35 |
36 | """
37 |
38 | self.intro = """\
39 |
40 |
41 |
42 |
43 |
44 |
45 |
46 |
47 |
238 |
239 |
240 |
241 |
344 |
345 |
346 | """
--------------------------------------------------------------------------------
/TwitterParsing/config_downl.py:
--------------------------------------------------------------------------------
1 | class Config:
2 | def __init__(self):
3 | self.my_email_address = 'XXX'
4 | self.password = 'XXX'
5 | self.dest = 'XXX'
6 | self.message_init = """
7 |
8 |
9 |
10 |
11 |
12 |
13 | Hi Alexandre,
14 | Today the machine learning flashcard is about {flashcard_title}.
15 |
16 | |
17 |
18 |
19 |
20 |
26 | |
27 |
28 |
29 |
30 |
31 | |
32 |
33 | """
34 |
35 | self.intro = """\
36 |
37 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
235 |
236 |
237 |
238 |
239 |
240 |
241 |
242 |
243 |
244 |
245 |
246 |
247 | |
248 |
249 |
250 | |
251 |
252 |
253 |
254 |
255 |
256 |
257 |
258 |
259 |
260 |
281 | """
282 | self.flashcard = """
283 |
284 |
285 | |
286 |
287 |
288 |
289 |
290 |
291 | |
292 |
293 |
294 | |
295 |
296 | """
297 |
298 | self.end = """
299 |
300 |
301 |
302 | |
303 |
304 |
305 | |
306 |
307 |
308 |
309 |
310 | """
--------------------------------------------------------------------------------
/TwitterParsing/download_pics.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | from selenium import webdriver
3 | import urllib
4 | import time
5 | import sys
6 | import numpy as np
7 | import glob
8 | import os
9 | import datetime
10 | from bs4 import BeautifulSoup
11 | import smtplib
12 | from email.mime.text import MIMEText
13 | from email.header import Header
14 | from email.utils import formataddr
15 | import config
16 |
17 | def internet_on():
18 | """
19 | Check if the network connection is on
20 | :return: boolean
21 | """
22 | try:
23 | urllib.request.urlopen('http://216.58.192.142', timeout=1)
24 | return True
25 | except urllib.error.URLError:
26 | return False
27 |
28 | def get_all_tweets_chrome(query):
29 | """
30 | Get all tweets from the webpage using the Chrome driver
31 | :param query: Tweet search query
32 | :return: all tweets on page as selenium objects
33 | """
34 | driver = webdriver.Chrome("/usr/local/bin/chromedriver")
35 | driver.set_page_load_timeout(20)
36 | driver.get("https://twitter.com/search?f=tweets&vertical=default&q={}".format(urllib.parse.quote(query)))
37 |
38 | length = []
39 | try:
40 | while True:
41 | time.sleep(np.random.randint(50,100)*0.01)
42 | tweets_found = driver.find_elements_by_class_name('tweet')
43 | driver.execute_script("return arguments[0].scrollIntoView();", tweets_found[::-1][0])
44 |
45 | # Stop the loop while no more found tweets
46 | length.append(len(tweets_found))
47 | if len(tweets_found) > 200 and (len(length) - len(set(length))) > 2:
48 | print('%s tweets found at %s' % (len(tweets_found), datetime.datetime.now().strftime('%d/%m/%Y - %H:%M')))
49 | break
50 | except IndexError:
51 | driver.save_screenshot('/Users/alexandreattia/Desktop/Work/Practice/HackerRankChallenge/TwitterParsing/screenshot_%s.png' % datetime.datetime.now().strftime('%d_%m_%Y_%H/%M'))
52 | time.sleep(np.random.randint(50,100)*0.01)
53 | tweets_found = driver.find_elements_by_class_name('tweet')
54 | print('%s tweets found at %s' % (len(tweets_found), datetime.datetime.now().strftime('%d/%m/%Y - %H:%M')))
55 | except Exception as e:
56 | print(e)
57 | time.sleep(np.random.randint(50,100)*0.01)
58 | tweets_found = driver.find_elements_by_class_name('tweet')
59 | driver.execute_script("return arguments[0].scrollIntoView();", tweets_found[::-1][0])
60 | return driver
61 |
62 | def get_all_tweets_phantom(query):
63 | """
64 | Get all tweets from the webpage using the PhantomJS driver
65 | :param query: Tweet search query
66 | :return: all tweets on page as selenium objects
67 | """
68 | driver = webdriver.PhantomJS("/usr/local/bin/phantomjs")
69 | driver.get("https://twitter.com/search?f=tweets&vertical=default&q={}".format(urllib.parse.quote(query)))
70 | lastHeight = driver.execute_script("return document.body.scrollHeight")
71 | i = 0
72 | while True:
73 | driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
74 | time.sleep(np.random.randint(50,100)*0.01)
75 | newHeight = driver.execute_script("return document.body.scrollHeight")
76 | if newHeight <= lastHeight:
77 | break
78 | lastHeight = newHeight
79 | i += 1
80 | return driver
81 |
82 | def format_tweets(driver):
83 | """
84 | Convert selenium objects into dictionnaries (images, date, text, username as keys)
85 | :param driver: Selenium driver (with tweets loaded)
86 | :return: well-formated tweets as a list of dictionnaries
87 | """
88 | tweets_found = driver.find_elements_by_class_name('tweet')
89 | tweets = []
90 | for tweet in tweets_found:
91 | tweet_dict = {}
92 | tweet = tweet.get_attribute('innerHTML')
93 | bs = BeautifulSoup(tweet.strip(), "lxml")
94 | tweet_dict['username'] = bs.find('span', class_='username').text
95 | timestamp = float(bs.find('span', class_='_timestamp')['data-time'])
96 | tweet_dict['date'] = datetime.datetime.fromtimestamp(timestamp)
97 | tweet_dict['tweet_link'] = 'https://twitter.com' + bs.find('a', class_='js-permalink')['href']
98 | tweet_dict['text'] = bs.find('p', class_='tweet-text').text
99 | try:
100 | tweet_dict['images'] = [k['src'] for k in bs.find('div', class_="AdaptiveMedia-container").find_all('img')]
101 | except:
102 | tweet_dict['images'] = []
103 | if len(tweet_dict['images']) > 0:
104 | tweet_dict['text'] = tweet_dict['text'][:tweet_dict['text'].index('pic.twitter')-1]
105 | tweets.append(tweet_dict)
106 | driver.close()
107 | return tweets
108 |
109 | def filter_tweets(tweets):
110 | """
111 | Filter tweets not made by Chris Albon, without images and already downloaded
112 | :param tweets: list of dictionnaries
113 | :return: tweets as a list of dictionnaries
114 | """
115 | # We keep only tweets by chrisalbon with pictures
116 | search_tweets = [tw for tw in tweets if tw['username'] == '@chrisalbon' and len(tw['images']) > 0]
117 | # He made multiple tweets on the same topic, we keep only the most recent tweets
118 | # We use the indexes of the reversed tweet list and dictionnaries to keep only key
119 | unique_search_index = sorted(list({t['text'].lower():i for i,t in list(enumerate(search_tweets))[::-1]}.values()))
120 | unique_search_tweets = [search_tweets[i] for i in unique_search_index]
121 |
122 | # Keep non-downloaded tweets
123 | most_recent_file = sorted([datetime.datetime.fromtimestamp(os.path.getmtime(path))
124 | for path in glob.glob("./downloaded_pics/*.jpg")], reverse=True)[0]
125 | recent_seach_tweets = [tw for tw in unique_search_tweets if tw['date'] > most_recent_file]
126 |
127 | # Uncomment for testing new tweets
128 | # recent_seach_tweets = [tw for tw in unique_search_tweets if tw['date'] > datetime.datetime(2017, 7, 6, 13, 41, 48)]
129 | return recent_seach_tweets
130 |
131 | def download_pictures(recent_seach_tweets):
132 | """
133 | Download pictures from tweets
134 | :param recent_seach_tweets: list of dictionnaries
135 | """
136 | # Downloading pictures
137 | print('%s - Downloading %d tweets' % (datetime.datetime.now().strftime('%d/%m/%Y - %H:%M'), len(recent_seach_tweets)))
138 | for tw in recent_seach_tweets:
139 | img_url = tw['images'][0]
140 | filename = tw['text'][:tw['text'].index("#")-1].lower().replace(' ','_')
141 | filename = "./downloaded_pics/%s.jpg" % filename
142 | urllib.request.urlretrieve(img_url, filename)
143 |
144 | def send_email(recent_seach_tweets):
145 | """
146 | Send a summary email of new tweets
147 | :param recent_seach_tweets: list of dictionnaries
148 | """
149 | # Create e-mail message
150 | msg = C.intro + C.message_init.format(number_flash_cards=len(recent_seach_tweets))
151 | # Add a special text for the first 3 new tweets
152 | for tw in recent_seach_tweets[:3]:
153 | date = tw['date'].strftime('Le %d/%m/%Y à %H:%M')
154 | link_tweet = tw['tweet_link']
155 | link_picture = tw['images'][0]
156 | tweet_text = tw["text"][:tw["text"].index('#')-1]
157 | msg += C.flashcard.format(date=date,
158 | link_picture=link_picture,
159 | tweet_text=tweet_text,
160 | tweet_link=link_tweet)
161 |
162 | # mapping for the subject
163 | numbers = { 0 : 'zero', 1 : 'one', 2 : 'two', 3 : 'three', 4 : 'four', 5 : 'five',
164 | 6 : 'six', 7 : 'seven', 8 : 'eight', 9 : 'nine', 10 : 'ten'}
165 |
166 | message = MIMEText(msg, 'html')
167 | message['From'] = formataddr((str(Header('Twitter Parser', 'utf-8')), C.my_email_address))
168 | message['To'] = C.dest
169 | message['Subject'] = '%s new Machine Learning Flash Cards' % numbers[len(recent_seach_tweets)].title()
170 | msg_full = message.as_string()
171 |
172 | server = smtplib.SMTP('smtp.gmail.com:587')
173 | server.starttls()
174 | server.login(C.my_email_address, C.password)
175 | server.sendmail(C.my_email_address, C.dest, msg_full)
176 | server.quit()
177 | print('%s - E-mail sent!' % datetime.datetime.now().strftime('%d/%m/%Y - %H:%M'))
178 |
179 | if __name__ == '__main__':
180 | # config file with mail content, email addresses
181 | C = config.Config()
182 | # Twitter search query
183 | query = '#machinelearningflashcards'
184 | # Loop until there is a network connection
185 | time_slept = 0
186 | while not internet_on():
187 | time.sleep(1)
188 | time_slept += 1
189 | if time_slept > 15 :
190 | print('%s - No network connection' % datetime.datetime.now().strftime('%d/%m/%Y - %H:%M'))
191 | sys.exit()
192 | # driver = get_all_tweets_chrome(query)
193 | driver = get_all_tweets_phantom(query)
194 | tweets = format_tweets(driver)
195 | recent_seach_tweets = filter_tweets(tweets)
196 | if len(recent_seach_tweets) > 0:
197 | download_pictures(recent_seach_tweets)
198 | send_email(recent_seach_tweets)
199 | else:
200 | print('%s - No picture to download' % datetime.datetime.now().strftime('%d/%m/%Y - %H:%M'))
201 |
202 |
203 |
--------------------------------------------------------------------------------
/TwitterParsing/pic.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/TwitterParsing/pic.png
--------------------------------------------------------------------------------
/TwitterParsing/pic2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/TwitterParsing/pic2.png
--------------------------------------------------------------------------------
/TwitterParsing/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/sh
2 |
3 | # change directory
4 | APPDIR=/Users/alexandreattia/Desktop/Work/Practice/HackerRankChallenge/TwitterParsing
5 | export APPDIR
6 | cd $APPDIR
7 |
8 | # activate virtual env
9 | source /Users/alexandreattia/Desktop/Work/workenv/bin/activate
10 |
11 | # launch python script
12 | ./send_pictures.py
--------------------------------------------------------------------------------
/TwitterParsing/send_pictures.py:
--------------------------------------------------------------------------------
1 | import glob
2 | import numpy as np
3 | from email.mime.text import MIMEText
4 | from email.mime.image import MIMEImage
5 | from email.mime.multipart import MIMEMultipart
6 | from email.header import Header
7 | from email.utils import formataddr
8 | import smtplib
9 | import config_downl
10 | from download_pics import internet_on
11 | import datetime
12 | import sys
13 |
14 | def get_picture():
15 | pic_folder = '/Users/alexandreattia/Desktop/Work/machine_learning_flashcards_v1.4/png_web/'
16 | pics = glob.glob(pic_folder + '*')
17 | pic = np.random.choice(pics)
18 | flashcard_title = pic.split('/')[7].replace('_web.png', '').replace('_', ' ')
19 | return pic, flashcard_title
20 |
21 | def send_email(flashcard_title, pic):
22 | """
23 | Send an email of with the picture of the day
24 | """
25 | # Create e-mail Multi Part
26 | msgRoot = MIMEMultipart('related')
27 | msgRoot['From'] = formataddr((str(Header('ML Flashcards', 'utf-8')), C.my_email_address))
28 | msgRoot['To'] = C.dest
29 | msgRoot['Subject'] = 'Today ML Flashcard : %s' % flashcard_title.title()
30 | msgAlternative = MIMEMultipart('alternative')
31 | msgRoot.attach(msgAlternative)
32 |
33 | # Create e-mail message
34 | msg = C.intro + C.message_init.format(flashcard_title=flashcard_title.title())
35 | msg += C.flashcard
36 | message = MIMEText(msg, 'html')
37 | msgAlternative.attach(message)
38 |
39 | # Add picture to e-mail
40 | fp = open(pic, 'rb')
41 | msgImage = MIMEImage(fp.read())
42 | fp.close()
43 |
44 | # Define the image's ID as referenced above
45 | msgImage.add_header('Content-ID', '')
46 | msgRoot.attach(msgImage)
47 |
48 | msg_full = msgRoot.as_string()
49 |
50 | server = smtplib.SMTP('smtp.gmail.com:587')
51 | server.starttls()
52 | server.login(C.my_email_address, C.password)
53 | server.sendmail(C.my_email_address, C.dest, msg_full)
54 | server.quit()
55 | print('%s - E-mail sent!' % datetime.datetime.now().strftime('%d/%m/%Y - %H:%M'))
56 |
57 | if __name__ == '__main__':
58 | # config file with mail content, email addresses
59 | C = config_downl.Config()
60 | # Loop until there is a network connection
61 | time_slept = 0
62 | while not internet_on():
63 | time.sleep(1)
64 | time_slept += 1
65 | if time_slept > 15 :
66 | print('%s - No network connection' % datetime.datetime.now().strftime('%d/%m/%Y - %H:%M'))
67 | sys.exit()
68 | p, fc = get_picture()
69 | send_email(fc, p)
70 |
--------------------------------------------------------------------------------
/pics/corr_matrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/pics/corr_matrix.png
--------------------------------------------------------------------------------
/pics/features.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/pics/features.png
--------------------------------------------------------------------------------
/pics/psg_stats.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/pics/psg_stats.png
--------------------------------------------------------------------------------
/pics/psg_ste.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexattia/Data-Science-Projects/34fa543a101668c4b646460c4454f1a6600bdc60/pics/psg_ste.png
--------------------------------------------------------------------------------