├── README.md ├── data_visualization.py └── output_tSNE_visualization.jpeg /README.md: -------------------------------------------------------------------------------- 1 | #How to Best Visualize a Dataset Easily 2 | 3 | #Overview 4 | 5 | This is the code for [this](https://youtu.be/yQsOFWqpjkE) video by Siraj Raval on Youtube. The human activities dataset contains 5 classes (sitting-down, standing-up, standing, walking, and sitting) collected on 8 hours of activities of 4 healthy subjects. The data set is downloaded from [here](http://groupware.les.inf.puc-rio.br/har#ixzz4Mt0Teae2). This code downloads the dataset, cleans it, creates feature vectors, then uses [T-SNE](https://lvdmaaten.github.io/tsne/) to reduce the dimensionality of the feature vectors to just 2. Then, we use matplotlib to visualize the data. 6 | 7 | ##Dependencies 8 | 9 | * pandas(http://pandas.pydata.org/) 10 | * numpy (http://www.numpy.org/) 11 | * scikit-learn (http://scikit-learn.org/) 12 | * matplotlib (http://matplotlib.org/) 13 | 14 | Install dependencies via '[pip](https://pypi.python.org/pypi/pip) install'. (i.e pip install pandas). 15 | 16 | Note** updated dataset is here if the other link is broken 17 | http://rstudio-pubs-static.s3.amazonaws.com/19668_2a08e88c36ab4b47876a589bb1d61c37.html 18 | 19 | ##Usage 20 | 21 | To run this code, just run the following in terminal: 22 | 23 | `python data_visualization.py` 24 | 25 | ##Challenge 26 | 27 | The challenge for this video is to visualize [this](https://www.kaggle.com/mylesoneill/game-of-thrones) Game of Thrones dataset. Use T-SNE to lower the dimensionality of the data and plot it using matplotlib. In your README, write our 2-3 sentences of something you discovered about the data after visualizing it. This will be great practice in understanding why dimensionality reduction is so important and analyzing data visually. 28 | 29 | ##Due Date is December 29th 2016 30 | 31 | ##Credits 32 | 33 | The credits for this code go to [Yifeng-He](https://github.com/Yifeng-He). I've merely created a wrapper around the code to make it easy for people to get started. 34 | -------------------------------------------------------------------------------- /data_visualization.py: -------------------------------------------------------------------------------- 1 | 2 | import pandas as pd 3 | import numpy as np 4 | from sklearn.preprocessing import LabelEncoder 5 | from sklearn.preprocessing import StandardScaler 6 | from sklearn.cross_validation import train_test_split 7 | from sklearn.metrics import accuracy_score 8 | 9 | # visulaize the important characteristics of the dataset 10 | import matplotlib.pyplot as plt 11 | 12 | # step 1: download the data 13 | dataframe_all = pd.read_csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv") 14 | num_rows = dataframe_all.shape[0] 15 | 16 | # step 2: remove useless data 17 | # count the number of missing elements (NaN) in each column 18 | counter_nan = dataframe_all.isnull().sum() 19 | counter_without_nan = counter_nan[counter_nan==0] 20 | # remove the columns with missing elements 21 | dataframe_all = dataframe_all[counter_without_nan.keys()] 22 | # remove the first 7 columns which contain no discriminative information 23 | dataframe_all = dataframe_all.ix[:,7:] 24 | # the list of columns (the last column is the class label) 25 | columns = dataframe_all.columns 26 | print columns 27 | 28 | # step 3: get features (x) and scale the features 29 | # get x and convert it to numpy array 30 | x = dataframe_all.ix[:,:-1].values 31 | standard_scaler = StandardScaler() 32 | x_std = standard_scaler.fit_transform(x) 33 | 34 | # step 4: get class labels y and then encode it into number 35 | # get class label data 36 | y = dataframe_all.ix[:,-1].values 37 | # encode the class label 38 | class_labels = np.unique(y) 39 | label_encoder = LabelEncoder() 40 | y = label_encoder.fit_transform(y) 41 | 42 | # step 5: split the data into training set and test set 43 | test_percentage = 0.1 44 | x_train, x_test, y_train, y_test = train_test_split(x_std, y, test_size = test_percentage, random_state = 0) 45 | 46 | # t-distributed Stochastic Neighbor Embedding (t-SNE) visualization 47 | from sklearn.manifold import TSNE 48 | tsne = TSNE(n_components=2, random_state=0) 49 | x_test_2d = tsne.fit_transform(x_test) 50 | 51 | # scatter plot the sample points among 5 classes 52 | markers=('s', 'd', 'o', '^', 'v') 53 | color_map = {0:'red', 1:'blue', 2:'lightgreen', 3:'purple', 4:'cyan'} 54 | plt.figure() 55 | for idx, cl in enumerate(np.unique(y_test)): 56 | plt.scatter(x=x_test_2d[y_test==cl,0], y=x_test_2d[y_test==cl,1], c=color_map[idx], marker=markers[idx], label=cl) 57 | plt.xlabel('X in t-SNE') 58 | plt.ylabel('Y in t-SNE') 59 | plt.legend(loc='upper left') 60 | plt.title('t-SNE visualization of test data') 61 | plt.show() 62 | 63 | 64 | 65 | 66 | 67 | -------------------------------------------------------------------------------- /output_tSNE_visualization.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/llSourcell/visualize_dataset_demo/7f5a1d1200e554d6b6ed34ccd166d362df3adb6f/output_tSNE_visualization.jpeg --------------------------------------------------------------------------------