├── README.md
├── data_visualization.py
└── output_tSNE_visualization.jpeg


/README.md:
--------------------------------------------------------------------------------
 1 | #How to Best Visualize a Dataset Easily
 2 | 
 3 | #Overview
 4 | 
 5 | This is the code for [this](https://youtu.be/yQsOFWqpjkE) video by Siraj Raval on Youtube. The human activities dataset contains 5 classes (sitting-down, standing-up, standing, walking, and sitting) collected on 8 hours of activities of 4 healthy subjects. The data set is downloaded from [here](http://groupware.les.inf.puc-rio.br/har#ixzz4Mt0Teae2). This code downloads the dataset, cleans it, creates feature vectors, then uses [T-SNE](https://lvdmaaten.github.io/tsne/) to reduce the dimensionality of the feature vectors to just 2. Then, we use matplotlib to visualize the data. 
 6 | 
 7 | ##Dependencies
 8 | 
 9 | * pandas(http://pandas.pydata.org/) 
10 | * numpy (http://www.numpy.org/) 
11 | * scikit-learn (http://scikit-learn.org/) 
12 | * matplotlib (http://matplotlib.org/) 
13 | 
14 | Install dependencies via '[pip](https://pypi.python.org/pypi/pip) install'. (i.e pip install pandas). 
15 | 
16 | Note** updated dataset is here if the other link is broken
17 | http://rstudio-pubs-static.s3.amazonaws.com/19668_2a08e88c36ab4b47876a589bb1d61c37.html﻿
18 | 
19 | ##Usage
20 | 
21 | To run this code, just run the following in terminal: 
22 | 
23 | `python data_visualization.py`
24 | 
25 | ##Challenge
26 | 
27 | The challenge for this video is to visualize [this](https://www.kaggle.com/mylesoneill/game-of-thrones) Game of Thrones dataset. Use T-SNE to lower the dimensionality of the data and plot it using matplotlib. In your README, write our 2-3 sentences of something you discovered about the data after visualizing it. This will be great practice in understanding why dimensionality reduction is so important and analyzing data visually.
28 | 
29 | ##Due Date is December 29th 2016
30 | 
31 | ##Credits
32 | 
33 | The credits for this code go to [Yifeng-He](https://github.com/Yifeng-He). I've merely created a wrapper around the code to make it easy for people to get started.
34 | 


--------------------------------------------------------------------------------
/data_visualization.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import pandas as pd
 3 | import numpy as np
 4 | from sklearn.preprocessing import LabelEncoder
 5 | from sklearn.preprocessing import StandardScaler
 6 | from sklearn.cross_validation import train_test_split
 7 | from sklearn.metrics import accuracy_score
 8 | 
 9 | # visulaize the important characteristics of the dataset
10 | import matplotlib.pyplot as plt
11 | 
12 | # step 1: download the data
13 | dataframe_all = pd.read_csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv")
14 | num_rows = dataframe_all.shape[0]
15 | 
16 | # step 2: remove useless data
17 | # count the number of missing elements (NaN) in each column
18 | counter_nan = dataframe_all.isnull().sum()
19 | counter_without_nan = counter_nan[counter_nan==0]
20 | # remove the columns with missing elements
21 | dataframe_all = dataframe_all[counter_without_nan.keys()]
22 | # remove the first 7 columns which contain no discriminative information
23 | dataframe_all = dataframe_all.ix[:,7:]
24 | # the list of columns (the last column is the class label)
25 | columns = dataframe_all.columns
26 | print columns
27 | 
28 | # step 3: get features (x) and scale the features
29 | # get x and convert it to numpy array
30 | x = dataframe_all.ix[:,:-1].values
31 | standard_scaler = StandardScaler()
32 | x_std = standard_scaler.fit_transform(x)
33 | 
34 | # step 4: get class labels y and then encode it into number 
35 | # get class label data
36 | y = dataframe_all.ix[:,-1].values
37 | # encode the class label
38 | class_labels = np.unique(y)
39 | label_encoder = LabelEncoder()
40 | y = label_encoder.fit_transform(y)
41 | 
42 | # step 5: split the data into training set and test set
43 | test_percentage = 0.1
44 | x_train, x_test, y_train, y_test = train_test_split(x_std, y, test_size = test_percentage, random_state = 0)
45 | 
46 | # t-distributed Stochastic Neighbor Embedding (t-SNE) visualization
47 | from sklearn.manifold import TSNE
48 | tsne = TSNE(n_components=2, random_state=0)
49 | x_test_2d = tsne.fit_transform(x_test)
50 | 
51 | # scatter plot the sample points among 5 classes
52 | markers=('s', 'd', 'o', '^', 'v')
53 | color_map = {0:'red', 1:'blue', 2:'lightgreen', 3:'purple', 4:'cyan'}
54 | plt.figure()
55 | for idx, cl in enumerate(np.unique(y_test)):
56 |     plt.scatter(x=x_test_2d[y_test==cl,0], y=x_test_2d[y_test==cl,1], c=color_map[idx], marker=markers[idx], label=cl)
57 | plt.xlabel('X in t-SNE')
58 | plt.ylabel('Y in t-SNE')
59 | plt.legend(loc='upper left')
60 | plt.title('t-SNE visualization of test data')
61 | plt.show()
62 | 
63 | 
64 | 
65 | 
66 | 
67 | 


--------------------------------------------------------------------------------
/output_tSNE_visualization.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/llSourcell/visualize_dataset_demo/7f5a1d1200e554d6b6ed34ccd166d362df3adb6f/output_tSNE_visualization.jpeg


--------------------------------------------------------------------------------