├── 1-Analyze-Bay-Area-Bike-Share-Project ├── Bay_Area_Bike_Share_Analysis.ipynb └── Data.zip ├── 3-Intro-to-Data-Analysis ├── Data │ ├── Exports (p of GDP).xlsx │ ├── Indicator_HDI.xlsx │ ├── Oil Production.xlsx │ ├── README.md │ ├── indicator CDIAC carbon_dioxide_emissions_per_capita.xlsx │ ├── indicator gapminder gdp_per_capita_ppp.xlsx │ ├── indicator life_expectancy_at_birth.xlsx │ ├── indicator ti cpi 2009.xlsx │ ├── indicatorGNIpercapitaATLAS.xlsx │ ├── indicator_per capita government expenditure on health (ppp int. $).xlsx │ ├── indicator_per capita total expenditure on health (ppp int. $).xlsx │ ├── indicator_total population female.xlsx │ └── indicator_total population male.xlsx ├── Investigate-a-Dataset.ipynb └── L1_Starter_Code.ipynb ├── 4-Data-Wrangling ├── Audit_Zipcode.py ├── Auditing_Street_Names.py ├── Convert_to_CSV_files.py ├── Convert_to_SQL_Database.py ├── Data_Wrangling.ipynb ├── Improving_Street_Names.py ├── Number_of_Tags.py ├── README.md ├── Sample.py ├── Schema.py ├── Tags_Types.py ├── Update_Zipcode.py └── sample.osm ├── 5-Exploratory-Data-Analysis ├── Exploratory Data Analysis.html ├── Exploratory Data Analysis.rmd ├── README.md └── Resources.txt ├── 6-Inferential-Statistics ├── Stroop-Effect.ipynb └── stroopdata.csv ├── 7-Intro-to-Machine-Learning ├── Identify Fraud from Enron Email.ipynb ├── README.md ├── my_classifier.pkl ├── my_dataset.pkl ├── my_feature_list.pkl ├── poi_id.py ├── resources.MD └── tester.py ├── 8-Data-Visualization-in-Tableau ├── Create-a-Tableau-Story.ipynb ├── Flight Summary.zip └── README.md └── README.md /1-Analyze-Bay-Area-Bike-Share-Project/Data.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/1-Analyze-Bay-Area-Bike-Share-Project/Data.zip -------------------------------------------------------------------------------- /3-Intro-to-Data-Analysis/Data/Exports (p of GDP).xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/Exports (p of GDP).xlsx -------------------------------------------------------------------------------- /3-Intro-to-Data-Analysis/Data/Indicator_HDI.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/Indicator_HDI.xlsx -------------------------------------------------------------------------------- /3-Intro-to-Data-Analysis/Data/Oil Production.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/Oil Production.xlsx -------------------------------------------------------------------------------- /3-Intro-to-Data-Analysis/Data/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /3-Intro-to-Data-Analysis/Data/indicator CDIAC carbon_dioxide_emissions_per_capita.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicator CDIAC carbon_dioxide_emissions_per_capita.xlsx -------------------------------------------------------------------------------- /3-Intro-to-Data-Analysis/Data/indicator gapminder gdp_per_capita_ppp.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicator gapminder gdp_per_capita_ppp.xlsx -------------------------------------------------------------------------------- /3-Intro-to-Data-Analysis/Data/indicator life_expectancy_at_birth.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicator life_expectancy_at_birth.xlsx -------------------------------------------------------------------------------- /3-Intro-to-Data-Analysis/Data/indicator ti cpi 2009.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicator ti cpi 2009.xlsx -------------------------------------------------------------------------------- /3-Intro-to-Data-Analysis/Data/indicatorGNIpercapitaATLAS.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicatorGNIpercapitaATLAS.xlsx -------------------------------------------------------------------------------- /3-Intro-to-Data-Analysis/Data/indicator_per capita government expenditure on health (ppp int. $).xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicator_per capita government expenditure on health (ppp int. $).xlsx -------------------------------------------------------------------------------- /3-Intro-to-Data-Analysis/Data/indicator_per capita total expenditure on health (ppp int. $).xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicator_per capita total expenditure on health (ppp int. $).xlsx -------------------------------------------------------------------------------- /3-Intro-to-Data-Analysis/Data/indicator_total population female.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicator_total population female.xlsx -------------------------------------------------------------------------------- /3-Intro-to-Data-Analysis/Data/indicator_total population male.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicator_total population male.xlsx -------------------------------------------------------------------------------- /4-Data-Wrangling/Audit_Zipcode.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code checks for zipcode whether they begin with '94' or '95' or something else 3 | ''' 4 | OSMFILE = "san-jose_california.osm" 5 | zip_type_re = re.compile(r'\d{5}$') 6 | 7 | def audit_ziptype(zip_types, zipcode): 8 | if zipcode[0:2] != 95: 9 | zip_types[zipcode[0:2]].add(zipcode) 10 | elif zipcode[0:2] != 94: 11 | zip_types[zipcode[0:2]].add(zipcode) 12 | 13 | def is_zipcode(elem): 14 | return (elem.attrib['k'] == "addr:postcode") 15 | 16 | def audit_zip(osmfile): 17 | osm_file = open(osmfile, "r") 18 | zip_types = defaultdict(set) 19 | for event, elem in ET.iterparse(osm_file, events=("start",)): 20 | if elem.tag == "node" or elem.tag == "way": 21 | for tag in elem.iter("tag"): 22 | if is_zipcode(tag): 23 | audit_ziptype(zip_types,tag.attrib['v']) 24 | osm_file.close() 25 | return zip_types 26 | 27 | zip_print = audit_zip(OSMFILE) 28 | 29 | def test(): 30 | pprint.pprint(dict(zip_print)) 31 | 32 | if __name__ == '__main__': 33 | test() 34 | -------------------------------------------------------------------------------- /4-Data-Wrangling/Auditing_Street_Names.py: -------------------------------------------------------------------------------- 1 | import xml.etree.cElementTree as ET 2 | import pprint 3 | from collections import defaultdict 4 | import re 5 | 6 | ''' 7 | The code below lists out all the street types not in the expected list. 8 | ''' 9 | OSMFILE = "san-jose_california.osm" 10 | street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE) 11 | 12 | expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 13 | "Trail", "Parkway", "Commons", "Circle", "Terrace", "Way"] 14 | 15 | 16 | def audit_street_type(street_types, street_name): 17 | m = street_type_re.search(street_name) 18 | if m: 19 | street_type = m.group() 20 | if street_type not in expected: 21 | street_types[street_type].add(street_name) 22 | 23 | 24 | def is_street_name(elem): 25 | return (elem.attrib['k'] == "addr:street") 26 | 27 | 28 | def audit(osmfile): 29 | osm_file = open(osmfile, "r") 30 | street_types = defaultdict(set) 31 | for event, elem in ET.iterparse(osm_file, events=("start",)): 32 | 33 | if elem.tag == "node" or elem.tag == "way": 34 | for tag in elem.iter("tag"): 35 | if is_street_name(tag): 36 | audit_street_type(street_types, tag.attrib['v']) 37 | osm_file.close() 38 | return street_types 39 | 40 | st_types = audit(OSMFILE) 41 | 42 | def test(): 43 | pprint.pprint(dict(st_types)) 44 | 45 | if __name__ == '__main__': 46 | test() -------------------------------------------------------------------------------- /4-Data-Wrangling/Convert_to_CSV_files.py: -------------------------------------------------------------------------------- 1 | ''' 2 | The code below is mostly derived from Udacity Lession 13: Case study: OpenStreetMap Data [SQL] 3 | https://classroom.udacity.com/nanodegrees/nd002/parts/860b269a-d0b0-4f0c-8f3d-ab08865d43bf/modules/316820862075461/lessons/5436095827/concepts/54908788190923 4 | ''' 5 | OSM_PATH = "san-jose_california.osm" 6 | OSMFILE = "san-jose_california.osm" 7 | street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE) 8 | 9 | expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 10 | "Trail", "Parkway", "Commons", "Circle", "Terrace", "Way"] 11 | 12 | mapping = { "St": "Street", 13 | "St.": "Street", 14 | "Rd.": "Road", 15 | "Ave": "Avenue", 16 | "Blvd": "Boulevard", 17 | "Dr": "Drive", 18 | "Rd": "Road" 19 | } 20 | 21 | NODES_PATH = "nodes.csv" 22 | NODE_TAGS_PATH = "nodes_tags.csv" 23 | WAYS_PATH = "ways.csv" 24 | WAY_NODES_PATH = "ways_nodes.csv" 25 | WAY_TAGS_PATH = "ways_tags.csv" 26 | 27 | LOWER_COLON = re.compile(r'^([a-z]|_)+:([a-z]|_)+') 28 | PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]') 29 | 30 | SCHEMA = schema 31 | 32 | # Make sure the fields order in the csvs matches the column order in the sql table schema 33 | NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp'] 34 | NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type'] 35 | WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp'] 36 | WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type'] 37 | WAY_NODES_FIELDS = ['id', 'node_id', 'position'] 38 | 39 | def shape_element(element, node_attr_fields=NODE_FIELDS, way_attr_fields=WAY_FIELDS, 40 | problem_chars=PROBLEMCHARS, default_tag_type='regular'): 41 | """Clean and shape node or way XML element to Python dict""" 42 | node_attribs = {} 43 | way_attribs = {} 44 | way_nodes = [] 45 | tags = [] # Handle secondary tags the same way for both node and way elements 46 | p=0 47 | 48 | if element.tag == 'node': 49 | for i in NODE_FIELDS: 50 | node_attribs[i] = element.attrib[i] 51 | for tag in element.iter("tag"): 52 | node_tags_attribs = {} 53 | temp = LOWER_COLON.search(tag.attrib['k']) 54 | is_p = PROBLEMCHARS.search(tag.attrib['k']) 55 | if is_p: 56 | continue 57 | elif temp: 58 | split_char = temp.group(1) 59 | split_index = tag.attrib['k'].index(split_char) 60 | type1 = temp.group(1) 61 | node_tags_attribs['id'] = element.attrib['id'] 62 | node_tags_attribs['key'] = tag.attrib['k'][split_index+2:] 63 | node_tags_attribs['value'] = tag.attrib['v'] 64 | node_tags_attribs['type'] = tag.attrib['k'][:split_index+1] 65 | if node_tags_attribs['type'] == "addr" and node_tags_attribs['key'] == "street": 66 | # update street name 67 | node_tags_attribs['value'] = update_name(tag.attrib['v'], mapping) 68 | #elif node_tags_attribs['type'] == "addr" and node_tags_attribs['key'] == "postcode": 69 | # # update post code 70 | # node_tags_attribs['value'] = update_zipcode(tag.attrib['v']) 71 | else: 72 | node_tags_attribs['id'] = element.attrib['id'] 73 | node_tags_attribs['key'] = tag.attrib['k'] 74 | node_tags_attribs['value'] = tag.attrib['v'] 75 | node_tags_attribs['type'] = 'regular' 76 | if node_tags_attribs['type'] == "addr" and node_tags_attribs['key'] == "street": 77 | # update street name 78 | node_tags_attribs['value'] = update_name(tag.attrib['v'], mapping) 79 | #elif node_tags_attribs['type'] == "addr" and node_tags_attribs['key'] == "postcode": 80 | # # update post code 81 | # node_tags_attribs['value'] = update_zipcode(tag.attrib['v']) 82 | tags.append(node_tags_attribs) 83 | return {'node': node_attribs, 'node_tags': tags} 84 | elif element.tag == 'way': 85 | id = element.attrib['id'] 86 | for i in WAY_FIELDS: 87 | way_attribs[i] = element.attrib[i] 88 | for i in element.iter('nd'): 89 | d = {} 90 | d['id'] = id 91 | d['node_id'] = i.attrib['ref'] 92 | d['position'] = p 93 | p+=1 94 | way_nodes.append(d) 95 | for c in element.iter('tag'): 96 | temp = LOWER_COLON.search(c.attrib['k']) 97 | is_p = PROBLEMCHARS.search(c.attrib['k']) 98 | e = {} 99 | if is_p: 100 | continue 101 | elif temp: 102 | split_char = temp.group(1) 103 | split_index = c.attrib['k'].index(split_char) 104 | e['id'] = id 105 | e['key'] = c.attrib['k'][split_index+2:] 106 | e['type'] = c.attrib['k'][:split_index+1] 107 | e['value'] = c.attrib['v'] 108 | if e['type'] == "addr" and e['key'] == "street": 109 | e['value'] = update_name(c.attrib['v'], mapping) 110 | #elif e['type'] == "addr" and e['key'] == "postcode": 111 | # e['value'] = update_zipcode(c.attrib['v']) 112 | else: 113 | e['id'] = id 114 | e['key'] = c.attrib['k'] 115 | e['type'] = 'regular' 116 | e['value'] = c.attrib['v'] 117 | if e['type'] == "addr" and e['key'] == "street": 118 | e['value'] = update_name(c.attrib['v'], mapping) 119 | #elif e['type'] == "addr" and e['key'] == "postcode": 120 | # e['value'] = update_zipcode(c.attrib['v']) 121 | tags.append(e) 122 | 123 | return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags} 124 | 125 | if element.tag == 'node': 126 | return {'node': node_attribs, 'node_tags': tags} 127 | elif element.tag == 'way': 128 | return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags} 129 | 130 | 131 | # ================================================== # 132 | # Helper Functions # 133 | # ================================================== # 134 | def get_element(osm_file, tags=('node', 'way', 'relation')): 135 | """Yield element if it is the right type of tag""" 136 | context = ET.iterparse(osm_file, events=('start', 'end')) 137 | _, root = next(context) 138 | for event, elem in context: 139 | if event == 'end' and elem.tag in tags: 140 | yield elem 141 | root.clear() 142 | 143 | 144 | def validate_element(element, validator, schema=SCHEMA): 145 | """Raise ValidationError if element does not match schema""" 146 | if validator.validate(element, schema) is not True: 147 | field, errors = next(validator.errors.iteritems()) 148 | message_string = "\nElement of type '{0}' has the following errors:\n{1}" 149 | error_string = pprint.pformat(errors) 150 | 151 | raise Exception(message_string.format(field, error_string)) 152 | 153 | 154 | class UnicodeDictWriter(csv.DictWriter, object): 155 | """Extend csv.DictWriter to handle Unicode input""" 156 | def writerow(self, row): 157 | super(UnicodeDictWriter, self).writerow({ 158 | k: (v.encode('utf-8') if isinstance(v, unicode) else v) for k, v in row.iteritems() 159 | }) 160 | 161 | def writerows(self, rows): 162 | for row in rows: 163 | self.writerow(row) 164 | 165 | 166 | # ================================================== # 167 | # Main Function # 168 | # ================================================== # 169 | def process_map(file_in, validate): 170 | """Iteratively process each XML element and write to csv(s)""" 171 | with codecs.open(NODES_PATH, 'wb') as nodes_file, \ 172 | codecs.open(NODE_TAGS_PATH, 'wb') as nodes_tags_file, \ 173 | codecs.open(WAYS_PATH, 'wb') as ways_file, \ 174 | codecs.open(WAY_NODES_PATH, 'wb') as way_nodes_file, \ 175 | codecs.open(WAY_TAGS_PATH, 'wb') as way_tags_file: 176 | 177 | nodes_writer = UnicodeDictWriter(nodes_file, NODE_FIELDS) 178 | node_tags_writer = UnicodeDictWriter(nodes_tags_file, NODE_TAGS_FIELDS) 179 | ways_writer = UnicodeDictWriter(ways_file, WAY_FIELDS) 180 | way_nodes_writer = UnicodeDictWriter(way_nodes_file, WAY_NODES_FIELDS) 181 | way_tags_writer = UnicodeDictWriter(way_tags_file, WAY_TAGS_FIELDS) 182 | 183 | nodes_writer.writeheader() 184 | node_tags_writer.writeheader() 185 | ways_writer.writeheader() 186 | way_nodes_writer.writeheader() 187 | way_tags_writer.writeheader() 188 | 189 | validator = cerberus.Validator() 190 | 191 | for element in get_element(file_in, tags=('node', 'way')): 192 | el = shape_element(element) 193 | if el: 194 | if validate is True: 195 | validate_element(el, validator) 196 | 197 | if element.tag == 'node': 198 | nodes_writer.writerow(el['node']) 199 | node_tags_writer.writerows(el['node_tags']) 200 | elif element.tag == 'way': 201 | ways_writer.writerow(el['way']) 202 | way_nodes_writer.writerows(el['way_nodes']) 203 | way_tags_writer.writerows(el['way_tags']) 204 | 205 | 206 | if __name__ == '__main__': 207 | # Note: Validation is ~ 10X slower. For the project consider using a small 208 | # sample of the map when validating. 209 | process_map(OSM_PATH, validate=True) 210 | -------------------------------------------------------------------------------- /4-Data-Wrangling/Convert_to_SQL_Database.py: -------------------------------------------------------------------------------- 1 | # Creating the database and tables 2 | import sqlite3 3 | conn = sqlite3.connect('data_wrangling.sqlite') 4 | 5 | conn.text_factory = str 6 | cur = conn.cursor() 7 | 8 | #Make some fresh tables using executescript() 9 | cur.execute('''DROP TABLE IF EXISTS nodes''') 10 | cur.execute('''DROP TABLE IF EXISTS nodes_tags''') 11 | cur.execute('''DROP TABLE IF EXISTS ways''') 12 | cur.execute('''DROP TABLE IF EXISTS ways_tags''') 13 | cur.execute('''DROP TABLE IF EXISTS ways_nodes''') 14 | 15 | 16 | cur.execute('''CREATE TABLE nodes ( 17 | id INTEGER PRIMARY KEY NOT NULL, 18 | lat REAL, 19 | lon REAL, 20 | user TEXT, 21 | uid INTEGER, 22 | version INTEGER, 23 | changeset INTEGER, 24 | timestamp TEXT) 25 | ''') 26 | 27 | with open('nodes.csv','r') as nodes_table: # `with` statement available in 2.5+ 28 | # csv.DictReader uses first line in file for column headings by default 29 | dr = csv.DictReader(nodes_table) # comma is default delimiter 30 | to_db = [(i['id'], i['lat'], i['lon'], i['user'], i['uid'], i['version'], i['changeset'], i['timestamp']) for i in dr] 31 | 32 | cur.executemany("INSERT INTO nodes VALUES (?, ?, ?, ?, ?, ?, ?, ?);", to_db) 33 | 34 | 35 | cur.execute('''CREATE TABLE nodes_tags ( 36 | id INTEGER, 37 | key TEXT, 38 | value TEXT, 39 | type TEXT, 40 | FOREIGN KEY (id) REFERENCES nodes(id)) 41 | ''') 42 | 43 | with open('nodes_tags.csv','r') as nodes_tags_table: # `with` statement available in 2.5+ 44 | # csv.DictReader uses first line in file for column headings by default 45 | dr = csv.DictReader(nodes_tags_table) # comma is default delimiter 46 | to_db = [(i['id'], i['key'], i['value'], i['type']) for i in dr] 47 | 48 | cur.executemany("INSERT INTO nodes_tags VALUES (?, ?, ?, ?);", to_db) 49 | 50 | cur.execute('''CREATE TABLE ways ( 51 | id INTEGER PRIMARY KEY NOT NULL, 52 | user TEXT, 53 | uid INTEGER, 54 | version TEXT, 55 | changeset INTEGER, 56 | timestamp TEXT) 57 | ''') 58 | 59 | with open('ways.csv','r') as ways_table: # `with` statement available in 2.5+ 60 | # csv.DictReader uses first line in file for column headings by default 61 | dr = csv.DictReader(ways_table) # comma is default delimiter 62 | to_db = [(i['id'], i['user'], i['uid'], i['version'], i['changeset'], i['timestamp']) for i in dr] 63 | 64 | cur.executemany("INSERT INTO ways VALUES (?, ?, ?, ?, ?, ?);", to_db) 65 | 66 | cur.execute('''CREATE TABLE ways_tags ( 67 | id INTEGER NOT NULL, 68 | key TEXT NOT NULL, 69 | value TEXT NOT NULL, 70 | type TEXT, 71 | FOREIGN KEY (id) REFERENCES ways(id)) 72 | ''') 73 | 74 | with open('ways_tags.csv','r') as ways_tags_table: # `with` statement available in 2.5+ 75 | # csv.DictReader uses first line in file for column headings by default 76 | dr = csv.DictReader(ways_tags_table) # comma is default delimiter 77 | to_db = [(i['id'], i['key'], i['value'], i['type']) for i in dr] 78 | 79 | cur.executemany("INSERT INTO ways_tags VALUES (?, ?, ?, ?);", to_db) 80 | 81 | cur.execute('''CREATE TABLE ways_nodes ( 82 | id INTEGER NOT NULL, 83 | node_id INTEGER NOT NULL, 84 | position INTEGER NOT NULL, 85 | FOREIGN KEY (id) REFERENCES ways(id), 86 | FOREIGN KEY (node_id) REFERENCES nodes(id)) 87 | ''') 88 | 89 | with open('ways_nodes.csv','r') as ways_nodes_table: # `with` statement available in 2.5+ 90 | # csv.DictReader uses first line in file for column headings by default 91 | dr = csv.DictReader(ways_nodes_table) # comma is default delimiter 92 | to_db = [(i['id'], i['node_id'], i['position']) for i in dr] 93 | 94 | cur.executemany("INSERT INTO ways_nodes VALUES (?, ?, ?);", to_db) 95 | 96 | #Save changes 97 | conn.commit() 98 | -------------------------------------------------------------------------------- /4-Data-Wrangling/Improving_Street_Names.py: -------------------------------------------------------------------------------- 1 | import xml.etree.cElementTree as ET 2 | import pprint 3 | from collections import defaultdict 4 | import re 5 | 6 | ''' 7 | The code below updates the unexpected street types listed in the mapping list 8 | while keeping others unchanged. 9 | ''' 10 | mapping = { "St": "Street", 11 | "St.": "Street", 12 | "Rd.": "Road", 13 | "Ave": "Avenue", 14 | "Blvd": "Boulevard", 15 | "Dr": "Drive", 16 | "Rd": "Road" 17 | } 18 | 19 | def update_name(name, mapping): 20 | m = street_type_re.search(name) 21 | if m.group() not in expected: 22 | if m.group() in mapping.keys(): 23 | name = re.sub(m.group(), mapping[m.group()], name) 24 | return name 25 | 26 | def test(): 27 | for st_type, ways in st_types.iteritems(): 28 | for name in ways: 29 | better_name = update_name(name, mapping) 30 | print name, "=>", better_name 31 | 32 | if __name__ == '__main__': 33 | test() -------------------------------------------------------------------------------- /4-Data-Wrangling/Number_of_Tags.py: -------------------------------------------------------------------------------- 1 | import xml.etree.cElementTree as ET 2 | import pprint 3 | from collections import defaultdict 4 | import re 5 | 6 | ''' 7 | The code below is to find out how many types of tags are there and the number of each tag. 8 | ''' 9 | 10 | def count_tags(filename): 11 | tags = {} 12 | for event, element in ET.iterparse(filename): 13 | if element.tag not in tags.keys(): 14 | tags[element.tag] = 1 15 | else: 16 | tags[element.tag] += 1 17 | return tags 18 | 19 | def test(): 20 | 21 | tags = count_tags('san-jose_california.osm') 22 | pprint.pprint(tags) 23 | 24 | if __name__ == "__main__": 25 | test() -------------------------------------------------------------------------------- /4-Data-Wrangling/README.md: -------------------------------------------------------------------------------- 1 | 1. [Data Wrangling.ipynb](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Data_Wrangling.ipynb) - Jupyter Notebook containing the projects 2 | 3 | 2. [Number of Tags](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Number_of_Tags.py) - Identify how many types of tags and the occurance of each tag 4 | 5 | [Tags Type](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Tags_Types.py) - Categorize tags into lowercase, lowercase and colon, problematic characters and others 6 | 7 | [Auditing Street Names](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Auditing_Street_Names.py) - Audit the street types not in the list of expexted street types 8 | 9 | [Improving Street Names](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Improving_Street_Names.py) - Improve the dataset by replace overabbreaviated street types 10 | 11 | [SQL Schema](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Schema.py) - schema for the 5 cvs files which will be used to create SQL database 12 | 13 | [Convert to CSV files](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Convert_to_CSV_files.py) - Code to convert the database into 5 csv files 14 | 15 | [Audit Zipcode](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Audit_Zipcode.py) - Audit the zipcode to check whether it starts with '94' or '95' or something else 16 | 17 | [Update Zipcode](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Update_Zipcode.py) - Code to update zipcode (if it is 8/9 digits or has state in zipcode, only first 5 digits is kept, else nothing is changed.) 18 | 19 | [Convert to SQL Database](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Convert_to_SQL_Database.py) - Code to convert 5 csv files into .db database 20 | 21 | 3. [Mapzen](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Mapzen_SanJose.txt) - Mapzen contained the metro extract for the city of San Jose but the website has closed down since Feb 1st, 2018 22 | 23 | 4. [Sample](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Sample.py) - Code to generate a sample of the dataset 24 | 25 | [sample.osm](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/sample.osm) - Contains the sample for osm file (~8MB) 26 | 27 | 5. [Resources](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Resources.txt) - Contains a list of websites and Udacity lessons used to complete the project 28 | -------------------------------------------------------------------------------- /4-Data-Wrangling/Sample.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | import xml.etree.ElementTree as ET # Use cElementTree or lxml if too slow 5 | 6 | OSM_FILE = "san-jose_california.osm" # Replace this with your osm file 7 | SAMPLE_FILE = "sample.osm" 8 | 9 | k = 50 # Parameter: take every k-th top level element 10 | 11 | def get_element(osm_file, tags=('node', 'way', 'relation')): 12 | """Yield element if it is the right type of tag 13 | 14 | Reference: 15 | http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python 16 | """ 17 | context = iter(ET.iterparse(osm_file, events=('start', 'end'))) 18 | _, root = next(context) 19 | for event, elem in context: 20 | if event == 'end' and elem.tag in tags: 21 | yield elem 22 | root.clear() 23 | 24 | 25 | with open(SAMPLE_FILE, 'wb') as output: 26 | output.write('\n') 27 | output.write('\n ') 28 | 29 | # Write every kth top level element 30 | for i, element in enumerate(get_element(OSM_FILE)): 31 | if i % k == 0: 32 | output.write(ET.tostring(element, encoding='utf-8')) 33 | 34 | output.write('') 35 | -------------------------------------------------------------------------------- /4-Data-Wrangling/Schema.py: -------------------------------------------------------------------------------- 1 | import xml.etree.cElementTree as ET 2 | import pprint 3 | from collections import defaultdict 4 | import re 5 | 6 | import csv 7 | import codecs 8 | import cerberus 9 | import schema 10 | 11 | ''' 12 | The schema below for the 5 csv files which will be used to construct a SQL databases 13 | ''' 14 | 15 | schema = { 16 | 'node': { 17 | 'type': 'dict', 18 | 'schema': { 19 | 'id': {'required': True, 'type': 'integer', 'coerce': int}, 20 | 'lat': {'required': True, 'type': 'float', 'coerce': float}, 21 | 'lon': {'required': True, 'type': 'float', 'coerce': float}, 22 | 'user': {'required': True, 'type': 'string'}, 23 | 'uid': {'required': True, 'type': 'integer', 'coerce': int}, 24 | 'version': {'required': True, 'type': 'string'}, 25 | 'changeset': {'required': True, 'type': 'integer', 'coerce': int}, 26 | 'timestamp': {'required': True, 'type': 'string'} 27 | } 28 | }, 29 | 'node_tags': { 30 | 'type': 'list', 31 | 'schema': { 32 | 'type': 'dict', 33 | 'schema': { 34 | 'id': {'required': True, 'type': 'integer', 'coerce': int}, 35 | 'key': {'required': True, 'type': 'string'}, 36 | 'value': {'required': True, 'type': 'string'}, 37 | 'type': {'required': True, 'type': 'string'} 38 | } 39 | } 40 | }, 41 | 'way': { 42 | 'type': 'dict', 43 | 'schema': { 44 | 'id': {'required': True, 'type': 'integer', 'coerce': int}, 45 | 'user': {'required': True, 'type': 'string'}, 46 | 'uid': {'required': True, 'type': 'integer', 'coerce': int}, 47 | 'version': {'required': True, 'type': 'string'}, 48 | 'changeset': {'required': True, 'type': 'integer', 'coerce': int}, 49 | 'timestamp': {'required': True, 'type': 'string'} 50 | } 51 | }, 52 | 'way_nodes': { 53 | 'type': 'list', 54 | 'schema': { 55 | 'type': 'dict', 56 | 'schema': { 57 | 'id': {'required': True, 'type': 'integer', 'coerce': int}, 58 | 'node_id': {'required': True, 'type': 'integer', 'coerce': int}, 59 | 'position': {'required': True, 'type': 'integer', 'coerce': int} 60 | } 61 | } 62 | }, 63 | 'way_tags': { 64 | 'type': 'list', 65 | 'schema': { 66 | 'type': 'dict', 67 | 'schema': { 68 | 'id': {'required': True, 'type': 'integer', 'coerce': int}, 69 | 'key': {'required': True, 'type': 'string'}, 70 | 'value': {'required': True, 'type': 'string'}, 71 | 'type': {'required': True, 'type': 'string'} 72 | } 73 | } 74 | } 75 | } 76 | -------------------------------------------------------------------------------- /4-Data-Wrangling/Tags_Types.py: -------------------------------------------------------------------------------- 1 | import xml.etree.cElementTree as ET 2 | import pprint 3 | from collections import defaultdict 4 | import re 5 | 6 | ''' 7 | The code below allows you to check the k value for each tag. 8 | By classifying the tagss into few categories: 9 | 1. "lower": valid tags containing only lowercase letters 10 | 2. "lower_colon": valid tags with a colon in the names 11 | 3. "problemchars": tags with problematic characters 12 | 4. "other": other tags that don't fall into the 3 categories above 13 | ''' 14 | lower = re.compile(r'^([a-z]|_)*$') 15 | lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$') 16 | problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]') 17 | 18 | def key_type(element, keys): 19 | if element.tag == "tag": 20 | k = element.attrib['k'] 21 | if re.search(lower,k): 22 | keys["lower"] += 1 23 | elif re.search(lower_colon,k): 24 | keys["lower_colon"] += 1 25 | elif re.search(problemchars,k): 26 | keys["problemchars"] += 1 27 | else: 28 | keys["other"] += 1 29 | return keys 30 | 31 | def process_map(filename): 32 | keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0} 33 | for _, element in ET.iterparse(filename): 34 | keys = key_type(element, keys) 35 | return keys 36 | 37 | def test(): 38 | keys = process_map('san-jose_california.osm') 39 | pprint.pprint(keys) 40 | 41 | if __name__ == "__main__": 42 | test() -------------------------------------------------------------------------------- /4-Data-Wrangling/Update_Zipcode.py: -------------------------------------------------------------------------------- 1 | ''' 2 | This code will update non 5-digit zipcode. 3 | If it is 8/9-digit, only the first 5 digits are kept. 4 | If it has the state name in front, only the 5 digits are kept. 5 | If it is something else, will not change anything as it might result in error when validating the csv file. 6 | ''' 7 | def update_zipcode(zipcode): 8 | """Clean postcode to a uniform format of 5 digit; Return updated postcode""" 9 | if re.findall(r'^\d{5}$', zipcode): # 5 digits 02118 10 | valid_zipcode = zipcode 11 | return valid_zipcode 12 | elif re.findall(r'(^\d{5})-\d{3}$', zipcode): # 8 digits 02118-029 13 | valid_zipcode = re.findall(r'(^\d{5})-\d{3}$', zipcode)[0] 14 | return valid_zipcode 15 | elif re.findall(r'(^\d{5})-\d{4}$', zipcode): # 9 digits 02118-0239 16 | valid_zipcode = re.findall(r'(^\d{5})-\d{4}$', zipcode)[0] 17 | return valid_zipcode 18 | elif re.findall(r'CA\s*\d{5}', zipcode): # with state code CA 02118 19 | valid_zipcode =re.findall(r'\d{5}', zipcode)[0] 20 | return valid_zipcode 21 | else: #return default zipcode to avoid overwriting 22 | return zipcode 23 | 24 | def test_zip(): 25 | for zips, ways in zip_print.iteritems(): 26 | for name in ways: 27 | better_name = update_zipcode(name) 28 | print name, "=>", better_name 29 | 30 | if __name__ == '__main__': 31 | test_zip() 32 | -------------------------------------------------------------------------------- /5-Exploratory-Data-Analysis/Exploratory Data Analysis.rmd: -------------------------------------------------------------------------------- 1 | Loan Data from Prosper by Kai Sheng TEH 2 | ======================================================== 3 | 4 | ```{r echo=FALSE, message=FALSE, warning=FALSE, packages} 5 | # Packages that were used are loaded 6 | library(ggplot2) 7 | library(plyr) 8 | library(dplyr) 9 | library(reshape) 10 | library(reshape2) 11 | library(knitr) 12 | library(ggthemes) 13 | library(gridExtra) 14 | library(tidyr) 15 | library(GGally) 16 | library(progress) 17 | library(scales) 18 | ``` 19 | 20 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Load_the_Data} 21 | # Load the Data 22 | prosper <- read.csv('prosperLoanData.csv', na.strings=c("", "NA")) 23 | ``` 24 | 25 | Prosper is a San Francisco-based peer-to-peer lending company established in 2005. Prosper generates revenue by charging borrowers an origination fees of 1% to 5% to verify identities and assess credibility of borrowers. It also charges investors a 1% annual servicing fee. The dataset provided to Udacity was last updated in 11th March, 2014. 26 | 27 | # Univariate Plots Section 28 | 29 | To kickstart the analysis, analyses below are carried out: 30 | 31 | ```{r echo=FALSE, message=FALSE, warning=FALSE, List_of_Variables} 32 | # List all the variable 33 | names(prosper) 34 | ``` 35 | 36 | There are a total of 81 columns in the dataset. Excluding listing-related identifiers, there should be around 70 variables. 37 | 38 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Summaries} 39 | # Summaries of each variable 40 | summary(prosper) 41 | ``` 42 | 43 | Analyses included in the summaries of variables above: 44 | 45 | a) range for continuous variables 46 | 47 | b) top 5 items in discrete variables 48 | 49 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Internal_Structure} 50 | # Internal Structure of each variable 51 | str(prosper) 52 | ``` 53 | 54 | Internal structures of the 81 variables are as above. 55 | 56 | ```{r echo=FALSE, message=FALSE, warning=FALSE, LoanbyYearQuarter} 57 | # New factors values created for Loan Origination Quarter 58 | LoanOriginationQuarter2 = unique(prosper$LoanOriginationQuarter) 59 | LoanOriginationQuarter2 = 60 | LoanOriginationQuarter2[order(substring(LoanOriginationQuarter2,4,7), 61 | substring(LoanOriginationQuarter2,1,2))] 62 | 63 | # Histogram and summary of Loan Origination Date 64 | ggplot(data = prosper, aes(x = LoanOriginationQuarter))+ 65 | geom_histogram(stat ='count')+ 66 | scale_x_discrete(limits=LoanOriginationQuarter2)+ 67 | theme(axis.text.x = element_text(angle = 60, vjust = 0.6)) 68 | summary(prosper$LoanOriginationQuarter) 69 | ``` 70 | 71 | There is an increasing trend from end 2005 till 2014 except for the period of end 2008 till early 2009. It drop in loan being approved could be due to the Global Financial Crisis. There is also a dip at the end of 2012 which could be caused by the European sovereign debt crisis. 72 | 73 | ```{r echo=FALSE, message=FALSE, warning=FALSE, ProsperRating(Alpha)} 74 | # Prosper Rating (Alpha) are rearranged 75 | prosper$ProsperRating..Alpha. <- factor(prosper$ProsperRating..Alpha, 76 | levels = c('AA','A','B','C', 77 | 'D','E','HR','NA')) 78 | 79 | # Histogram and summary of Prosper Rating (Alpha) 80 | ggplot(data = prosper, aes(x = ProsperRating..Alpha.))+ 81 | geom_histogram(stat = 'count') 82 | summary(prosper$ProsperRating..Alpha.) 83 | ``` 84 | 85 | Majority of borrowers are not classified. Among those being rated, 'C' is the most common rating. 'AA' is the highest rating and relatively less borrowers qualified for the rating. Excluding those non-classified, the plot shows a normal distribution. 86 | 87 | ```{r echo=FALSE, message=FALSE, warning=FALSE, ProsperScore} 88 | # Prosper Score are rearranged 89 | ProsperScore2 <- factor(prosper$ProsperScore, 90 | levels = c('1','2','3','4','5','6','7', 91 | '8','9','10','11','NA')) 92 | 93 | # Histogram and summary of Prosper Score 94 | ggplot(data = prosper, aes(x = ProsperScore2))+ 95 | geom_histogram(stat = 'count') 96 | summary(prosper$ProsperScore) 97 | ``` 98 | 99 | Majority of the loan applicants are not rated. Among those rated, most have a score between 4 to 8. 100 | 101 | ```{r echo=FALSE, message=FALSE, warning=FALSE, IncomeRange} 102 | # Income Range are rearranged 103 | prosper$IncomeRange = factor(prosper$IncomeRange, 104 | levels = c("Not employed", "$0", "$1-24,999", 105 | "$25,000-49,999", "$50,000-74,999", 106 | "$75,000-99,999", "$100,000+", 107 | "Not displayed")) 108 | 109 | # Histogram and summary of Income Range 110 | ggplot(data = prosper, aes(x = IncomeRange))+ 111 | geom_histogram(stat = 'count')+ 112 | theme(axis.text.x = element_text(angle = 30, vjust = 0.6)) 113 | summary(prosper$IncomeRange) 114 | ``` 115 | 116 | The median household income in the USA was $53,657 in 2014 (U.S. Census Bureau) and most of the borrowers are from the middle or lower-middle class. 117 | 118 | There are less number of borrowers for those earning more than $75,000, as them usually have savings to cover their needs. It is worthwhile to note that comparatively there are way less number of loans approved for those earning less than $25,000 as they are deemed to risky to lend money to. 119 | 120 | ```{r echo=FALSE, message=FALSE, warning=FALSE, DebtToIncomeRatio} 121 | # Histogram of Debt to Income Ratio 122 | p1 <- ggplot(data = prosper, aes(x = DebtToIncomeRatio))+ 123 | geom_histogram(binwidth = 0.02) 124 | 125 | # Histogram of Debt to Income Ratio (top 1% data removed) 126 | p2 <- ggplot(data = prosper, aes(x = DebtToIncomeRatio))+ 127 | geom_histogram(binwidth = 0.02)+ 128 | xlim(0, quantile(prosper$DebtToIncomeRatio, prob = 0.99, na.rm = TRUE)) 129 | 130 | # 2 histogram charts arranged side by side and summary 131 | grid.arrange(p1, p2, ncol = 2) 132 | summary(prosper$DebtToIncomeRatio) 133 | ``` 134 | 135 | The debt-to-income ratio histogram on the left has a long tail where there are few people with a ratio of 10, which indicates them as risky borrowers as their income is too low to service their debt. By removing the top 1% outliers, we can see that most borrowers have a ratio of around 0.2. 136 | 137 | ```{r echo=FALSE, message=FALSE, warning=FALSE, LoanOriginalAmount} 138 | # Histogram of Loan Original Amount 139 | p3 <- ggplot(data = prosper, aes(x = LoanOriginalAmount))+ 140 | geom_histogram(binwidth = 1000)+ 141 | scale_x_continuous(breaks = seq(0, 40000, 5000))+ 142 | theme(axis.text.x = element_text(angle = 30)) 143 | 144 | # Histogram of Loan Original Amount (Top 1% outliers removed) 145 | p4 <- ggplot(data = prosper, aes(x = LoanOriginalAmount))+ 146 | geom_histogram(binwidth = 1000)+ 147 | scale_x_continuous(breaks = seq(0, 25000, 5000))+ 148 | xlim(0, quantile(prosper$LoanOriginalAmount, prob = 0.99, na.rm = TRUE))+ 149 | theme(axis.text.x = element_text(angle = 30)) 150 | 151 | # 2 histogram charts arranged side by side and summary 152 | grid.arrange(p3, p4, ncol = 2) 153 | summary(prosper$LoanOriginalAmount) 154 | ``` 155 | 156 | We can see that most of the loan amount are around $5,000. 157 | 158 | There are occasional spikes in $5k, $10k, $15k, $20k and even up $35k which are explainable by the fact that they are multiples of 5,000 where most people tend to use when deciding the amount to borrow. 159 | 160 | ```{r echo=FALSE, message=FALSE, warning=FALSE, EmploymentStatus} 161 | # Employment Status are rearranged 162 | prosper$EmploymentStatus = factor(prosper$EmploymentStatus, 163 | levels = c("Employed","Full-time", 164 | "Part-time","Self-employed", 165 | "Retied","Not employed", 166 | "Other","Not available","NA")) 167 | 168 | # Histogram and summary of Employment Status 169 | ggplot(data = prosper, aes(x = EmploymentStatus))+ 170 | geom_histogram(stat = 'count')+ 171 | theme(axis.text.x = element_text(angle = 30, vjust = 0.6)) 172 | summary(prosper$EmploymentStatus) 173 | ``` 174 | 175 | Most of the borrowers are employed, be it full-time, part-time, self-employed or non-specified. This makes sense as loan applicants need to demonstrate that they have stable income to pay back the loan. 176 | 177 | ```{r echo=FALSE, message=FALSE, warning=FALSE, LoanTerms(Month)} 178 | # Histogram and summary of Loan Terms in months 179 | Term2 = factor(prosper$Term, levels = c("12", "36", "60")) 180 | ggplot(data = prosper, aes(x = Term2))+ 181 | geom_histogram(stat = 'count') 182 | summary(prosper$Term) 183 | ``` 184 | 185 | Majority of the borrowers have a loan period of 36 months or 3 years. 186 | 187 | ```{r echo=FALSE, message=FALSE, warning=FALSE, MonthlyLoanPayment} 188 | # Histogram of Monthly Loan Payment 189 | p5 <- ggplot(data = prosper, aes(x = MonthlyLoanPayment))+ 190 | geom_histogram(binwidth = 50) 191 | 192 | # Histogram of Monthly Loan Payment (top 1% data removed) 193 | p6 <- ggplot(data = prosper, aes(x = MonthlyLoanPayment))+ 194 | geom_histogram(binwidth = 50)+ 195 | xlim(0, quantile(prosper$MonthlyLoanPayment, prob = 0.99, na.rm = TRUE))+ 196 | theme(axis.text.x = element_text(angle = 30)) 197 | 198 | # 2 histogram charts arranged side by side and summary 199 | grid.arrange(p5, p6, ncol = 2) 200 | summary(prosper$MonthlyLoanPayment) 201 | 202 | # Mode of Monthly Loan Payment 203 | names(sort(-table(prosper$MonthlyLoanPayment)))[1] 204 | ``` 205 | 206 | Majority of the monthly loan payment are less than $250. 207 | 208 | $174 is the most common amount of monthly installment and only few borrowers have an installment of exceeding $1,000. 209 | 210 | ```{r echo=FALSE, message=FALSE, warning=FALSE, CreditScoreRange} 211 | # Trend charts of lower (orange) and upper (blue) range of credit score 212 | ggplot(data = prosper)+ 213 | geom_line(stat = 'count', 214 | aes(x = CreditScoreRangeLower, color = "CreditScoreRangeLower"))+ 215 | geom_line(stat = 'count', 216 | aes(x = CreditScoreRangeUpper, color = "CreditScoreRangeUpper"))+ 217 | scale_color_manual("Credit Score Range", 218 | values = c('CreditScoreRangeLower' = 'Orange', 219 | 'CreditScoreRangeUpper' = 'Blue'))+ 220 | scale_x_continuous(breaks = seq(0, 900, 100)) 221 | 222 | # Summary of Credit Score Range (lower) and (upper) 223 | summary(prosper$CreditScoreRangeLower) 224 | summary(prosper$CreditScoreRangeUpper) 225 | ``` 226 | 227 | Line charts instead of bar charts are chosen to better reflect the range of score overlaid on top of each other. 228 | 229 | The credit score range for most borrowers are between 650 to 750 and the gap between upper and lower range is around 20 points for most borrowers. 230 | 231 | ```{r echo=FALSE, message=FALSE, warning=FALSE, IsBorrowerHomeowner} 232 | # Histogram and summary on whether borrower is a homeowner 233 | ggplot(data = prosper, aes(x = IsBorrowerHomeowner))+ 234 | geom_histogram(stat = 'count') 235 | summary(prosper$IsBorrowerHomeowner) 236 | ``` 237 | 238 | Homeownership is roughly equally split between True and False for borrowers. 239 | 240 | From this, it can be deduced that homeownership might not be the top factors in deciding whether to extend the loans to borrowers. 241 | 242 | ```{r echo=FALSE, message=FALSE, warning=FALSE, BorrowerRate_LenderYield} 243 | # Histogram of Borrower Rate 244 | p7 <- ggplot(data = prosper, aes(x = BorrowerRate))+ 245 | geom_histogram(binwidth = 0.01)+ 246 | scale_x_continuous(breaks = seq(-0.1, 0.5, 0.05))+ 247 | theme(axis.text.x = element_text(angle = 90)) 248 | 249 | # Histogram of Lender Yield 250 | p8 <- ggplot(data = prosper, aes(x = LenderYield))+ 251 | geom_histogram(binwidth = 0.01)+ 252 | scale_x_continuous(breaks = seq(-0.05, 0.5, 0.05))+ 253 | theme(axis.text.x = element_text(angle = 90)) 254 | 255 | # 2 histogram charts arranged side by side and summaries 256 | grid.arrange(p7, p8, ncol = 2) 257 | summary(prosper$BorrowerRate) 258 | summary(prosper$LenderYield) 259 | ``` 260 | 261 | The histograms show a bimodal distribution. Majority of the borrower rates and lender yield are between 0.1 and 0.2. The peak at above 0.3 could be possibly explained by the more common rate given to borrowers with less stellar creditworthiness. 262 | 263 | When compared to the borrower rate, lender yield shows a similar trend with the x-axis shifted slightly to the left by 0.01. This could be explained by the fact that Prosper probably charges a 1% fees as its revenue. 264 | 265 | ```{r echo=FALSE, message=FALSE, warning=FALSE, BorrowerState} 266 | # Histogram and summary of State origin of Borrower 267 | ggplot(data = prosper, aes(x = BorrowerState))+ 268 | geom_histogram(stat = 'count')+ 269 | theme(axis.text.y = element_text(size = 5))+ 270 | scale_y_continuous(breaks = seq(0, 15000, 2500))+ 271 | coord_flip() 272 | summary(prosper$BorrowerState) 273 | ``` 274 | 275 | California by far has the most borrowers at slightly less than 15,000, followed by Georgia, Florida, Illinois, New York and Texas which has between 5,000 and 7,000 borrowers each. 276 | 277 | The high number of borrowers from these states doesn't come as surprise as they are among the states with the most population. However, the much higher number of borrowers from California is not proportional to its population when compared to Texas. One hypothesis is that it enjoys higher awareness among Californians as an alternative to bank loans could be the reasons due to its location in California. 278 | 279 | # Univariate Analysis 280 | 281 | ### What is the structure of your dataset? 282 | The dataset contains 81 variables with 113937 observations from year 2005 to 2014. 283 | 284 | ### What is/are the main feature(s) of interest in your dataset? 285 | The typical characteristics of the borrowers are of interest for this dataset. Various plots are created to observe and identify the trend of each variable. 286 | 287 | ### What other features in the dataset do you think will help support your investigation into your feature(s) of interest? 288 | Income range, Debt-to-income ratio are few of the variables that will help to explain why the loans were approved and what are the yield/rate for the loans. 289 | 290 | ### Did you create any new variables from existing variables in the dataset? 291 | No, but I rearranged the factors such Prosper's rating (Alpha), Prosper's score, income range, employmenet status and loan term (months) so that the charts can be understood more easily. I also created new factors value for loan origination quarter to facilitate the ordering by year and quarter later on. 292 | 293 | ### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this? 294 | Most features do not have any unusual distributions and if they do, they are explainable by some other factors. The only one that I am interested in is spike at 0.3 in the borrower rate and lender yield. My expectation was that the graph is skewed towards lower rate to favor borrower with better credit history for risk management. 295 | 296 | 297 | # Bivariate Plots Section 298 | 299 | ```{r echo=FALSE, message=FALSE, warning=FALSE, LoanQuarter_Term_Count} 300 | # Relationship between Loan Origination Quarter and Loan Term (month) 301 | ggplot(data = prosper, aes(x = LoanOriginationQuarter, fill = Term2))+ 302 | geom_histogram(stat ='count')+ 303 | scale_x_discrete(limits=LoanOriginationQuarter2)+ 304 | theme(axis.text.x = element_text(angle = 60, vjust = 0.6)) 305 | ``` 306 | 307 | Initially, only 36-month term loan were given. 12-month and 60-month term loan were introduced in Q4 2010 but only 60-month term loan took off. 12-month term loan is believed to be discontinued in end 2012. 308 | 309 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Score_RateYield} 310 | # Relationship between Prosper Rating and Borrower Rate in boxplot 311 | p9 <- ggplot(data = prosper, aes(x = ProsperScore2, y = BorrowerRate))+ 312 | geom_boxplot() 313 | 314 | # Relationship between Prosper Rating and Lender Yield in boxplot 315 | p10 <- ggplot(data = prosper, aes(x = ProsperScore2, y = LenderYield))+ 316 | geom_boxplot() 317 | 318 | # Arrange both charts side by side 319 | grid.arrange(p9, p10, ncol = 2) 320 | 321 | # Correlation of Prosper Score with Borrower Rate and Lender Yield 322 | with(prosper, cor.test(ProsperScore, BorrowerRate, method = 'pearson')) 323 | with(prosper, cor.test(ProsperScore, LenderYield, method = 'pearson')) 324 | ``` 325 | 326 | The boxplots above shows that Borrower Rate and Lender Yield decrease with improved Prosper's score. Applicants with better rating pose less risk and thus have lower chance of defaulting. Therefore, lenders are willing to charge less interest rate. 327 | 328 | ```{r echo=FALSE, message=FALSE, warning=FALSE, ProsperScore_IncomeLoanAmount} 329 | # Relationship between Prosper's Score and Stated Monthly Income in boxplot 330 | p11 <- ggplot(data = prosper, aes(x = ProsperScore2, y = StatedMonthlyIncome))+ 331 | geom_boxplot()+ 332 | ylim(0, quantile(prosper$StatedMonthlyIncome, prob=0.99, na.rm=TRUE)) 333 | 334 | # Relationship between Prosper's Score and Borrower Rate in boxplot 335 | p12 <- ggplot(data = prosper, aes(x = ProsperScore2, y = LoanOriginalAmount))+ 336 | geom_boxplot() 337 | 338 | # Arrange both charts side by side 339 | grid.arrange(p11, p12, ncol = 2) 340 | 341 | # Correlation of Prosper's Score with Stated Monthly Income and 342 | # Loan Original Amount 343 | with(prosper, cor.test(ProsperScore, StatedMonthlyIncome, method = 'pearson')) 344 | with(prosper, cor.test(ProsperScore, LoanOriginalAmount, method = 'pearson')) 345 | ``` 346 | 347 | Delving into the monthly income and loan amount, both boxplot charts didn't present any surprises. Applicants with higher rating tend to have higher monthly income and larger loan amount. 348 | 349 | ```{r echo=FALSE, message=FALSE, warning=FALSE, EmploymentStatus_LoanAmount} 350 | # Relationship between Employment Status and Loan Amount in boxplot 351 | ggplot(data = prosper, aes(x= EmploymentStatus, y = LoanOriginalAmount))+ 352 | geom_boxplot()+ 353 | theme(axis.text.x = element_text(angle = 30, vjust = 0.6))+ 354 | ylim(0, quantile(prosper$StatedMonthlyIncome, prob = 0.99, na.rm = TRUE)) 355 | ``` 356 | 357 | Looking at the relationship between employment status and loan amount, employed, self-employed and full-time borrowers are usually afforded higher loan amount as opposed to part-timers, not employed or not available. 358 | 359 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Term_LoanAmountRateYield} 360 | # Relationship between Loan Term (month) and Loan Amount in boxplot 361 | p13 <- ggplot(data = prosper, aes(x = Term2, y = LoanOriginalAmount))+ 362 | geom_boxplot() 363 | 364 | # Relationship between Loan Term (month) and Borrower Rate in boxplot 365 | p14 <- ggplot(data = prosper, aes(x = Term2, y = BorrowerRate))+ 366 | geom_boxplot() 367 | 368 | # Relationship between Loan Term (month) and Lender Yield in boxplot 369 | p15 <- ggplot(data = prosper, aes(x = Term2, y = LenderYield))+ 370 | geom_boxplot() 371 | 372 | # Arrange both charts side by side 373 | grid.arrange(p13, p14, p15, ncol = 3) 374 | 375 | # Correlation of Loan Term (months) with Loan Original Amount, 376 | # Borrower Rate and Lender Yield 377 | with(prosper, cor.test(Term, LoanOriginalAmount, method = 'pearson')) 378 | with(prosper, cor.test(Term, BorrowerRate, method = 'pearson')) 379 | with(prosper, cor.test(Term, LenderYield, method = 'pearson')) 380 | ``` 381 | 382 | When investigating the effect of loan term (months), it can be said that loans with longer terms usually come with larger amount. As such, a higher interest rate is levied due to higher risk exposure. This is the same for lender yield as higher interest rate is needed to attract investor to lend to riskier borrowers. 383 | 384 | ```{r echo=FALSE, message=FALSE, warning=FALSE, IncomeRange_LoanAmountRatio} 385 | # Relationship between Income Range and Loan Amount in boxplot 386 | p16 <- ggplot(data = prosper, aes(x = IncomeRange, y = LoanOriginalAmount))+ 387 | geom_boxplot()+ 388 | theme(axis.text.x = element_text(angle = 30, vjust = 0.6))+ 389 | ylim(0, quantile(prosper$LoanOriginalAmount, prob=0.99, na.rm=TRUE)) 390 | 391 | # Relationship between Income Range and Debt to Income Ratio in boxplot 392 | p17 <- ggplot(data = prosper, aes(x = IncomeRange, y = DebtToIncomeRatio))+ 393 | geom_boxplot()+ 394 | theme(axis.text.x = element_text(angle = 30, vjust = 0.6))+ 395 | ylim(0, quantile(prosper$DebtToIncomeRatio, prob=0.99, na.rm=TRUE)) 396 | 397 | # Arrange both charts side by side 398 | grid.arrange(p16, p17, ncol=2) 399 | 400 | # Correlation of Stated Monthly Income with Loan Original Amount, 401 | # and Debt to Income Ratio 402 | with(prosper, cor.test(StatedMonthlyIncome, LoanOriginalAmount, 403 | method = 'pearson')) 404 | with(prosper, cor.test(StatedMonthlyIncome, DebtToIncomeRatio, 405 | method = 'pearson')) 406 | ``` 407 | 408 | When comparing the Income Range, those with higher income are able to borrow more as they also tend to have a lower debt-to-income ratio which indicates lower risk. 409 | 410 | ```{r echo=FALSE, message=FALSE, warning=FALSE, DebttoIncomeRatio_RateYield} 411 | # Debt to Income Ratio are cut into intervals 412 | DebtToIncomeRatio2 <- cut(prosper$DebtToIncomeRatio, 413 | breaks = c(0, 0.2, 0.4, 0.6, 414 | 0.8, 1, 1.5, 10.1)) 415 | 416 | # Relationship between Debt to Income Ratio and Borrower Rate in boxplot 417 | p18 <- ggplot(data = prosper, aes(x = DebtToIncomeRatio2, y = BorrowerRate))+ 418 | geom_boxplot()+ 419 | theme(axis.text.x = element_text(angle = 90, vjust = 0.6)) 420 | 421 | # Relationship between Debt to Income Ratio and Lender Yield in boxplot 422 | p19 <- ggplot(data = prosper, aes(x = DebtToIncomeRatio2, y = LenderYield))+ 423 | geom_boxplot()+ 424 | theme(axis.text.x = element_text(angle = 90, vjust = 0.6)) 425 | 426 | # Arrange both charts side by side 427 | grid.arrange(p18, p19, ncol = 2) 428 | 429 | # Correlation of Debt to Income Ratio and Borrower Rate / Lender Yield 430 | with(prosper, cor.test(DebtToIncomeRatio, BorrowerRate, method = 'pearson')) 431 | with(prosper, cor.test(DebtToIncomeRatio, LenderYield, method = 'pearson')) 432 | ``` 433 | 434 | Lower debt-to-income ratio does lead to lower borrower rate or lender yield. That is because those with lower debt to income ratio indicates that they have better ability to service their loan installment and therefore have lower probability of defaulting on their loan. 435 | 436 | # Bivariate Analysis 437 | 438 | ### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset? 439 | In general, most of the relationship observed in the charts are aligned with my expectation. Applicants with higher Prosper's score and lower debt-to-income ratio are able to enjoy lower borrower rate. 440 | 441 | ### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)? 442 | Even though borrower rate tends to increase as debt-to-income ratio increases, that seems to not be the case for those with a debt-to-income ratio of more than 1.5. There are probably other factors that lead to lower borrower rate. Further investigation is needed to understand this anomaly. 443 | 444 | ### What was the strongest relationship you found? 445 | The strongest relatioship found is between Prosper's score and borrower rate where higher Prosper's score leads to lower borrower rate. Its correlation coefficient is -0.66. 446 | 447 | # Multivariate Plots Section 448 | 449 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Ratio_Rate_Income_LoanAmt} 450 | # Relationship between Debt to Income Ratio, Borrower Rate, Income Range 451 | # and Loan Original Amount 452 | ggplot(data = prosper, aes(x = DebtToIncomeRatio, y = BorrowerRate, 453 | color = IncomeRange, size = LoanOriginalAmount))+ 454 | geom_point()+ 455 | scale_x_continuous(breaks = seq(0, 10, 1))+ 456 | xlim(1.5, 10) 457 | ``` 458 | 459 | Continuing my investigtion of the relationship between debt-to-income ratio and borrower rate, I removed all debt-to-income ratios of less than 1.5. From the scatterplotplot, it is rather a surprise that lots of borrowers with low or unidentified income are able to borrow a large sum of more than $10,000 with low borrower rate (less than 0.25). I suspect that there are other variables behind it. 460 | 461 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Ratio_Rate_Score} 462 | # Relationship between Debt to Income Ratio, Borrower Rate, Prosper's Score 463 | ggplot(data = prosper, aes(x = DebtToIncomeRatio, y = BorrowerRate, 464 | color = ProsperScore2))+ 465 | geom_point()+ 466 | scale_x_continuous(breaks = seq(0, 10, 1))+ 467 | xlim(1.5, 10) 468 | ``` 469 | 470 | By further extending my investigation, it shows that these borrowers with low income but yet able to borrow with low rate have rather good Prosper's score (exclude those with 'NA' score). That explains the anomalies in the debt-to-income ratio vs borrower rate boxplot chart. 471 | 472 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Rate_Ratio_RatingAlpha} 473 | # Relationship between Debt to Debt to Income Ratio, Borrower Rate 474 | # and Prosper's Rating (Alpha) 475 | ggplot(data = subset(prosper, !is.na(ProsperRating..Alpha.)), 476 | aes(x = DebtToIncomeRatio, y = BorrowerRate, 477 | color = ProsperRating..Alpha.))+ 478 | geom_point()+ 479 | xlim(0, quantile(prosper$DebtToIncomeRatio, prob = 0.9965, na.rm = TRUE)) 480 | ``` 481 | 482 | One interesting finding is borrower rate of respective Prosper's rating tends to have a narrow range of borrower rate irregardless with the debt-to-income ratio. Debt-to-income ratio seems not to matter much as long as borrowers establish a good credit rating score. 483 | 484 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Yield_Ratio_RatingAlpha} 485 | # Relationship between Debt to Debt to Income Ratio, Lender Yield 486 | # and Prosper's Rating (Alpha) 487 | ggplot(data = subset(prosper, !is.na(ProsperRating..Alpha.)), 488 | aes(x = DebtToIncomeRatio, y = LenderYield, 489 | color = ProsperRating..Alpha.))+ 490 | geom_point()+ 491 | xlim(0, quantile(prosper$DebtToIncomeRatio, prob = 0.9965, na.rm = TRUE)) 492 | ``` 493 | 494 | Similarly, lender yield is more likely to be determined by the Prosper's rating of the borrowers than the debt-to-income ratio. 495 | 496 | ```{r echo=FALSE, message=FALSE, warning=FALSE, MonthlyIncome_LoanAmt_Rating} 497 | # Relationship between Stated Monthly Income, Loan Amount, 498 | # and Prosper's Rating (Alpha) 499 | ggplot(data = subset(prosper, !is.na(ProsperRating..Alpha.)), 500 | aes(x = StatedMonthlyIncome, y = LoanOriginalAmount, 501 | color = ProsperRating..Alpha.))+ 502 | geom_point(alpha = 0.2)+ 503 | xlim(0, quantile(prosper$StatedMonthlyIncome, prob = 0.99, na.rm = TRUE)) 504 | ``` 505 | 506 | Borrowers with no credit history are allowed to borrow less than $5,000 in general while borrowers with good ratings are allowed to borrow more up to $35,000. 507 | 508 | ```{r echo=FALSE, message=FALSE, warning=FALSE, MonthlyIncome_Ratio_Rating} 509 | # Relationship between Stated Monthly Income, Debt-to-Income Ratio 510 | # and Prosper's Rating (Alpha) 511 | ggplot(data = subset(prosper, !is.na(ProsperRating..Alpha.)), 512 | aes(x = StatedMonthlyIncome, y = DebtToIncomeRatio, 513 | color = ProsperRating..Alpha.))+ 514 | geom_point()+ 515 | xlim(0, quantile(prosper$StatedMonthlyIncome, prob = 0.99, na.rm = TRUE)) 516 | ``` 517 | 518 | It seems that most of the applicants that fulfil undesirable features of 'bad borrowers' are those with no prior Prosper's rating, low monthly income and high debt-to-income ratio. However, this is normal especially for young graduates who just started out. 519 | 520 | ```{r echo=FALSE, message=FALSE, warning=FALSE, MonthlyIncome_LoanAmt_Term} 521 | # Relationship between Stated Monthly Income, Loan Original Amount 522 | # and Loan Term (month) 523 | ggplot(data = prosper, 524 | aes(x = StatedMonthlyIncome, y = LoanOriginalAmount, color = Term2))+ 525 | geom_point(alpha = 0.8)+ 526 | xlim(0, quantile(prosper$StatedMonthlyIncome, prob = 0.99, na.rm = TRUE))+ 527 | scale_color_manual(values=c("#150303", "#E69F00", "#56B4E9")) 528 | ``` 529 | 530 | It can be seen that the length of the loan term somewhat corresponds to the amount of the loan. Larger loan typically requires longer monthly installment period irregardless of the monthly income. 531 | 532 | It is also necessary to take note that 12-month term loan were discontinued by 2012 possibly due to lack of interest as there might be an upper limit placed on the amount of loan figure. 533 | 534 | ```{r echo=FALSE, message=FALSE, warning=FALSE, LoanOQtr_LoanAmt_Score} 535 | # Relationship between Loan Origination Quarter, Loan Original Amount, 536 | # Prosper's Score and Prosper's Rating (Alpha) 537 | ggplot(data = subset(prosper, !is.na(ProsperRating..Alpha.)), 538 | aes(x = LoanOriginationQuarter, y = LoanOriginalAmount, 539 | color = ProsperScore))+ 540 | geom_point(alpha = 0.8)+ 541 | scale_x_discrete(limit = LoanOriginationQuarter2)+ 542 | facet_wrap( ~ ProsperRating..Alpha., ncol = 2)+ 543 | theme(axis.text.x = element_text(angle = 90, vjust = 0.6, size = 7)) 544 | ``` 545 | 546 | Using facet wrap, we can observe clearly that borrowers with lower Prosper's score are allowed to borrow smaller amount of loan due to higher perceived risk of defaulting. 547 | 548 | As Prosper expands over the years, the expansion mostly comes from borrowers with good rating while those with no credit history have decreased. It shows that Propser is pursuing more sustainable business model. Another probably explanation is that potential new customer have acquired some credit history over the years. 549 | 550 | ```{r echo=FALSE, message=FALSE, warning=FALSE, LoanQuarter_Count_LoanStatus} 551 | # Mutate the Loan Status to create a temporary variable with only 552 | # 3 categories, i.e. 'Performing Loan', 'Past Due' and 'Defaulted' 553 | prosper <- prosper %>% mutate(LoanStatus2 = ifelse(LoanStatus %in% 554 | c("Cancelled", "Chargedoff", "Defaulted"), 0, 555 | ifelse(LoanStatus %in% 556 | c("Completed", "Current", "FinalPaymentInProgress"), 2, 557 | 1))) 558 | 559 | # Rearrange the factors of Loan Status 560 | prosper$LoanStatus2 <- factor(prosper$LoanStatus2, levels = 2:0, 561 | labels = c("Performing Loan","Past Due","Defaulted")) 562 | 563 | # Relationship between Loan Origination Quarter, Loan Count and Loan Status 564 | ggplot(prosper, aes(x = LoanOriginationQuarter, fill = LoanStatus2))+ 565 | geom_bar(stat = "count")+ 566 | scale_x_discrete(limits=LoanOriginationQuarter2)+ 567 | theme(axis.text.x = element_text(angle = 90, vjust = 0.6))+ 568 | scale_fill_manual(values = c("Defaulted" = "red","Past Due" = "yellow", 569 | "Performing Loan" = "green")) 570 | ``` 571 | 572 | The barchart above confirms my hypothesis that Prosper expansion comes mostly from performing loans. Over the years, the number of defaulted loans has dropped. 573 | 574 | # Multivariate Analysis 575 | 576 | ### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest? 577 | Monthly income is a strong factor in determining the borrower rate. At the same time, borrowers with no credit history tend to be those with lower salary which suggests that they might be young people who just graduated or still in school. 578 | 579 | ### Were there any interesting or surprising interactions between features? 580 | It is interesting to note that debt-to-income ratio has minimal effect on the borrower rate, holding the Prosper's rating variable constant. Previous chart that shows debt-to-income ratio drops with increasing salary suggests that the ratio is rather a dependent variable of monthly income. 581 | 582 | ------ 583 | 584 | # Final Plots and Summary 585 | 586 | ### Plot One 587 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Plot_One} 588 | # Relationship between Prosper Rating and Borrower Rate in boxplot 589 | ggplot(data = prosper, aes(x = ProsperScore2, y = BorrowerRate, 590 | fill = ProsperScore2))+ 591 | geom_boxplot()+ 592 | ggtitle("Borrower Rate vs Prosper's Score")+ 593 | xlab("Prosper's Score")+ 594 | ylab("Borrower Rate")+ 595 | theme(plot.title = element_text(hjust = 0.5, size = 20))+ 596 | scale_fill_discrete(name = "Prosper's Score") 597 | ``` 598 | 599 | ### Description One 600 | The boxplots above shows that borrowers with better Prosper's Score tend to enjoy lower borrower rate. The range of the rate doesn't fluctuate much for most Prosper's score with the exception of those with moderate score between 4 to 7. 601 | 602 | However, there are outliers in the opposite trend for those with the best and worst score. That could be due to other factors such as amount of loan taken, new monthly income or change in employment status. 603 | 604 | ### Plot Two 605 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Plot_Two} 606 | # Relationship between Debt to Debt to Income Ratio, Borrower Rate 607 | # and Prosper's Rating (Alpha) 608 | ggplot(data = subset(prosper, !is.na(ProsperRating..Alpha.)), 609 | aes(x = DebtToIncomeRatio, y = BorrowerRate, 610 | color = ProsperRating..Alpha.))+ 611 | geom_point(alpha = 0.5)+ 612 | xlim(0, quantile(prosper$DebtToIncomeRatio, prob = 0.9965, na.rm = TRUE))+ 613 | xlab("Debt-to-Income Ratio")+ 614 | ylab("Borrower Rate")+ 615 | ggtitle("Effect of Prosper's Rating (Alpha) on 616 | Borrower Rate vs Debt-to-Income Ratio")+ 617 | theme(plot.title = element_text(hjust = 0.5, size = 15))+ 618 | scale_color_discrete(name = "Prosper's Rating (Alpha)") 619 | ``` 620 | 621 | ### Description Two 622 | This graph is chosen as it shows that the Prosper's rating (alpha) is curcial in determining the borrower rate. Holding the Prosper's rating (alpha) constant, an increase in debt-to-income ratio has insignificant impact on the borrower rate. 623 | 624 | Those with 'HR' rating are more likely to have debt-to-income ratio larger than 1.0. On the other hand, borrowers rated 'AA' tend to have ratio of less than 0.5 and thus have a lower borrower rate that is usually below 0.1. 625 | 626 | ### Plot Three 627 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Plot_Three} 628 | # Relationship between Loan Origination Quarter, Loan Count and Loan Status 629 | ggplot(prosper, aes(x = LoanOriginationQuarter, fill = LoanStatus2))+ 630 | geom_bar(stat = "count")+ 631 | scale_x_discrete(limits=LoanOriginationQuarter2)+ 632 | theme(axis.text.x = element_text(angle = 90, vjust = 0.6))+ 633 | scale_fill_manual("Loan Status", values = c("Defaulted" = "#DA0404", 634 | "Past Due" = "#E8C203", 635 | "Performing Loan" = "#05B94C"))+ 636 | xlab("Loan Quarter")+ 637 | ylab("Loan Count")+ 638 | ggtitle("Loan Status over 2005 - 2014 period")+ 639 | theme(plot.title = element_text(hjust = 0.5, size = 15)) 640 | ``` 641 | 642 | ### Description Three 643 | The barchart above shows that Prosper has been expanding its loaning operation with the exception of late 2008 and late 2012 which is possibly caused by the Global Financial Crisis and the European Sovereign Debt Crisis. 644 | 645 | Over the years, non-performing loans have decreased largely. That shows that Prosper's ability to predict its applicants creditworthiness has been improving. Another reason could be Prosper decided to pursue a more sustainable expansion instead of lending to risky borrowers for higher yield which might result in bankruptcy when non-performing loans outnumber performing loans. When crisis hit, Prosper seemed to tigheten its lending policy which is in line with most bank practices as well. 646 | 647 | ------ 648 | 649 | # Reflection 650 | When I started exploring this dataset, I was overwhelmed by the number of variables available. It was very tedious to study the relationship between all variables. As such, I only chose about 20 variables that I am more familiar with. It would be great if Prosper is able to provide better clarification how some of the rating, score or borrower rate were determined. However, I also understand that these data are Prosper's confidential proprieatry. That being said, once I spent a few days working on the data I have a better grasp of the dataset. I only included plots that are related to the storytelling and excluded others variables that doesn't tell much about the characteristics of the demographics. 651 | 652 | The other challenge that I faced is unfamiliarity with R. Since this is my first time coding in R, I took a lot of notes on everything and did a lot of Googling on forums and documentations to find out how to plot certain graphs or customize the charts. I am glad that my effort paid off well as I am able to produce complete the chapter and this project in less than 2 weeks' time. 653 | 654 | Overall, I was able to come out with a great storyline for this report. The variables don't seem too intimidating after a while since most are quite self-explanatory. Without specific questions or directions, I am free to venture around and determine the focus of the storyline. That is when I decided to look more info how the borrower rate was determined and how the Prosper's rating affected other variables. 655 | 656 | I was rather surprised that debt-to-income ratio doesn't seem to play an important role in determining the borrower rate after taking into account of Prosper's rating. However, I can't totally exclude the importance of debt-to-income ratio without knowing how Prosper's rating is determined as debt-to-income ratio might be one of the main determinant components. 657 | 658 | To move on from here, it would be great to be able to build an equation or predictive model to simulate real world scenario. Prosper can also collect other related info that might aid in making the prediction more accurate such as age, education level or city of the applicants/borrowers. Prosper can also help to explain how the rating were determined without revealing too much corporate info as it will help in building the predictive model. -------------------------------------------------------------------------------- /5-Exploratory-Data-Analysis/README.md: -------------------------------------------------------------------------------- 1 | Link to [project report](https://cdn.rawgit.com/kaishengteh/Data-Analyst-Nanodegree/e94db549/5-Exploratory-Data-Analysis/Exploratory%20Data%20Analysis.html). 2 | -------------------------------------------------------------------------------- /5-Exploratory-Data-Analysis/Resources.txt: -------------------------------------------------------------------------------- 1 | Some of the resources used while completing the projects: 2 | 1. Sample project 3 | https://s3.amazonaws.com/content.udacity-data.com/courses/ud651/diamondsExample_2016-05.html 4 | 5 | 2. R documentation 6 | https://www.rdocumentation.org 7 | 8 | 3. Forums 9 | http://www.stackoverflow.com 10 | http://www.cookbook-r.com/ -------------------------------------------------------------------------------- /6-Inferential-Statistics/Stroop-Effect.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Statistics: The Science of Decisions Project Instructions" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Background Information\n", 15 | "\n", 16 | "In a Stroop task, participants are presented with a list of words, with each word displayed in a color of ink. The participant’s task is to say out loud the color of the ink in which the word is printed. The task has two conditions: a congruent words condition, and an incongruent words condition. In the congruent words condition, the words being displayed are color words whose names match the colors in which they are printed: for example RED, BLUE. In the incongruent words condition, the words displayed are color words whose names do not match the colors in which they are printed: for example PURPLE, ORANGE. In each case, we measure the time it takes to name the ink colors in equally-sized lists. Each participant will go through and record a time from each condition." 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "## Questions For Investigation" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "#### 1. What is our independent variable? What is our dependent variable?" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "Independent variable: Word congruence (whether color of the word matches the definition of the word)\n", 38 | "\n", 39 | "Dependent variable: Time taken to name the ink colors of the words" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "#### 2. What is an appropriate set of hypotheses for, this task? What kind of statistical test do you expect to perform? Justify your choices." 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "Null Hypothesis, H0 - Congruence of the word has no effect on the mean population time taken to name the ink colors of the words\n", 54 | "\n", 55 | "Alternative Hypothesis, H1 - Words where their definition not congruent with its ink color will increase the mean population time taken to name of the ink colors of the words\n", 56 | "\n", 57 | "H0: μ = μ0 (same mean population time)\n", 58 | "H1: μ > μ0 (a upper-tailed test) (mean populataion time for incongruent words are longer than congruent words)\n", 59 | "\n", 60 | "A one-tail paired t-test will be carried out to study whether there are significant difference in time taken to identify the ink colors of the words due to congruence. A t-test is used instead of a Z-test as the population parameters are unknown.\n", 61 | "\n", 62 | "An alpha, α = 0.05 is used as a threshold whether to reject or accept the null hypothesis." 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "#### 3. Report some descriptive statistics regarding this dataset. Include at least one measure of central tendency and at least one measure of variability.\n", 70 | "\n" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 1, 76 | "metadata": { 77 | "collapsed": true 78 | }, 79 | "outputs": [], 80 | "source": [ 81 | "'''Importing all required libraries'''\n", 82 | "import pandas as pd\n", 83 | "import numpy as np\n", 84 | "import matplotlib.pyplot as plt\n", 85 | "%matplotlib inline" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 2, 91 | "metadata": {}, 92 | "outputs": [ 93 | { 94 | "name": "stdout", 95 | "output_type": "stream", 96 | "text": [ 97 | " Congruent Incongruent Mean Difference\n", 98 | "0 12.079 19.278 7.199\n", 99 | "1 16.791 18.741 1.950\n", 100 | "2 9.564 21.214 11.650\n", 101 | "3 8.630 15.687 7.057\n", 102 | "4 14.669 22.803 8.134\n", 103 | "5 12.238 20.878 8.640\n", 104 | "6 14.692 24.572 9.880\n", 105 | "7 8.987 17.394 8.407\n", 106 | "8 9.401 20.762 11.361\n", 107 | "9 14.480 26.282 11.802\n", 108 | "10 22.328 24.524 2.196\n", 109 | "11 15.298 18.644 3.346\n", 110 | "12 15.073 17.510 2.437\n", 111 | "13 16.929 20.330 3.401\n", 112 | "14 18.200 35.255 17.055\n", 113 | "15 12.130 22.158 10.028\n", 114 | "16 18.495 25.139 6.644\n", 115 | "17 10.639 20.429 9.790\n", 116 | "18 11.344 17.425 6.081\n", 117 | "19 12.369 34.288 21.919\n", 118 | "20 12.944 23.894 10.950\n", 119 | "21 14.233 17.960 3.727\n", 120 | "22 19.710 22.058 2.348\n", 121 | "23 16.004 21.157 5.153\n" 122 | ] 123 | } 124 | ], 125 | "source": [ 126 | "stroop = pd.read_csv(\"stroopdata.csv\")\n", 127 | "stroop['Mean Difference'] = stroop['Incongruent'] - stroop['Congruent'] #A new column for population is created by adding up female and male pop.\n", 128 | "print(stroop)" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 3, 134 | "metadata": {}, 135 | "outputs": [ 136 | { 137 | "data": { 138 | "text/html": [ 139 | "
\n", 140 | "\n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | "
CongruentIncongruentMean Difference
count24.00000024.00000024.000000
mean14.05112522.0159177.964792
std3.5593584.7970574.864827
min8.63000015.6870001.950000
25%11.89525018.7167503.645500
50%14.35650021.0175007.666500
75%16.20075024.05150010.258500
max22.32800035.25500021.919000
\n", 200 | "
" 201 | ], 202 | "text/plain": [ 203 | " Congruent Incongruent Mean Difference\n", 204 | "count 24.000000 24.000000 24.000000\n", 205 | "mean 14.051125 22.015917 7.964792\n", 206 | "std 3.559358 4.797057 4.864827\n", 207 | "min 8.630000 15.687000 1.950000\n", 208 | "25% 11.895250 18.716750 3.645500\n", 209 | "50% 14.356500 21.017500 7.666500\n", 210 | "75% 16.200750 24.051500 10.258500\n", 211 | "max 22.328000 35.255000 21.919000" 212 | ] 213 | }, 214 | "execution_count": 3, 215 | "metadata": {}, 216 | "output_type": "execute_result" 217 | } 218 | ], 219 | "source": [ 220 | "stroop.describe()" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 4, 226 | "metadata": {}, 227 | "outputs": [ 228 | { 229 | "data": { 230 | "text/plain": [ 231 | "Congruent 14.3565\n", 232 | "Incongruent 21.0175\n", 233 | "Mean Difference 7.6665\n", 234 | "dtype: float64" 235 | ] 236 | }, 237 | "execution_count": 4, 238 | "metadata": {}, 239 | "output_type": "execute_result" 240 | } 241 | ], 242 | "source": [ 243 | "stroop.median()" 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": {}, 249 | "source": [ 250 | "Median is chosen as the measure of central tendency.\n", 251 | "\n", 252 | "Standard deviation is chosen as the measure of variability." 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "#### 4. Provide one or two visualizations that show the distribution of the sample data. Write one or two sentences noting what you observe about the plot or plots." 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": 5, 265 | "metadata": {}, 266 | "outputs": [ 267 | { 268 | "data": { 269 | "text/plain": [ 270 | "array([[,\n", 271 | " ],\n", 272 | " [,\n", 273 | " ]], dtype=object)" 274 | ] 275 | }, 276 | "execution_count": 5, 277 | "metadata": {}, 278 | "output_type": "execute_result" 279 | }, 280 | { 281 | "data": { 282 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAfwAAAFyCAYAAAAQ6Gi7AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAAPYQAAD2EBqD+naQAAIABJREFUeJzt3Xu8HHV9//HXJ4AJCSqWaEBNvKLGS9U9FhORgGhDm7oH\nbWsQRRuQogaQxiYIVZKgxV+CFe1JgEITuYUTsGoC2kiigjbBC7oH8XZiy8WTCAKJCJIEEMjn98fM\nSXY3u+fs7M7sfnf3/Xw85pGc787M5zPfme/3uzs7O2PujoiIiHS2Ma1OQERERLKnAV9ERKQLaMAX\nERHpAhrwRUREuoAGfBERkS6gAV9ERKQLaMAXERHpAhrwRUREuoAGfBERkS6gAV9ERKQLaMDPiJm9\n1MwuM7O7zOwxM3vEzDaZ2cfMbFyr8wuVmZ1oZme1Og+RepjZP5jZbjPLtTqXbqE+o3b7tzqBTmRm\nfwN8GXgcuBr4BfAM4K3AhcCrgY+0LMGwvQ94DfDvrU5EpE56QElzqc+okQb8lJnZi4HVwD3Ase7+\nYNHLl5rZecDftCC1UZnZOHd/vNV5iEh7UJ/RXnRKP32fACYAHyob7AFw97vdfRmAme1nZueZ2Z1m\n9riZ3WNmF5jZM4qXMbPfmNmNZnakmf0o/orgLjP7QPn6zezPzex7ZrbLzLaa2SfN7OT4NOOUCuuc\naWY/NrPHgNPM7EXxvB+ssO7dZrawrOz5ZvYlM7s/3oZfmNnJZfMcHS/7njifrfE2fNvMXlY03y1E\nb4aGc9htZnfXWO8iwTGzK83s0bidrI3//6CZfc7MrGxeM7OzzOxncft40My+Wfz1gPoM9RmN0Cf8\n9L0TuNvdf1TDvCuBDxKd/v834M3AucCrgL8rms+Bw4H/ipe5EjgFuMLMfuLugxA1JOAW4GngAmAX\ncCrwJ/Y9zehxnH7gMuBy4NdJNtTMngf8KI7XB2wH/hpYaWbPdPe+skXOief9HPBsojdHq4Dp8ev/\nGpe/APgnwIAdSXISCYwTfbBaD/wQ+GfgHcDHgTuJ2t6wLwH/APw38J9E/fNRwDRgIJ5HfYb6jPq5\nu6aUJuCZwG7gazXM++fxvP9RVn4h0QF+dFHZPXHZW4rKJgKPARcWlfUBTwGvKyo7mKhRPQ1MqbDO\nd5TFf1Gc1wcr5LwbWFj09wrgt8DBZfP1Aw8BY+O/j46X/QWwX9F8Z8Y5vLqo7OtEb5havj81aUo6\nEQ3YTwO5+O8r4r//pWy+AnBb0d9vi9vIRSOsW32G+oyGJp3ST9ez4n8frWHeWUTvmL9QVv55onep\n5d/z/8rdvz/8h7tvJ3p3/dKieY4DfuDuPy+a72Hg2io53OPu364h12r+lqix7WdmhwxPwAaid93l\nVyp/yd2fLvp7I9G2vhSRznZZ2d8bKT3u/45ogPv0COtQn6E+oyE6pZ+uP8b/PrOGeYffFd9ZXOju\nD5jZw/HrxbZUWMcfgOeUrfP7Fea7s0IZRO/Y62JmzyX6JHAa8OEKszjwvLKyrWV//yH+9zmIdK7H\n3f33ZWXlbfelwH3xYFuN+gz1GQ3RgJ8id3/UzO4DXptksRrne7pKuVUpr8VjFcoq5mNm5WeDhv9e\nBVxVZf0/K/s7i20QCV21475e6jPUZ9RFA376vgH8o5m92Ue+cG+IqAEcTtGFL/FFLQfHryc1BLy8\nQvnhCdYx/A764LLy8k8P24i+utjP3W9OsP7R6DfM0o3uAmaa2cEjfMpXn1GZ+owa6Tv89F1IdKXr\nirghljCzl5nZx4B1RO9S/6lsln8mOoD/u47Y64HpZvbnRfH+jOjGFDVx90eJLtiZUfbS6RQ1LHff\nDXwV+Dsze035esxsYrLU99hJ9F2eSDf5KlF/vGiEedRnVKY+o0b6hJ8yd7/bzN4HXAcMmlnxnfaO\nBP6e6EKUPjO7iuh3rM8Bvkf0E5sPEl3l/706wl8InAR828yWETWEU4nexT+H2t8JrwDOMbP/BH5C\n1JAPZ9/TaOcAxwA/iuf9FfBnQA9wLNFVwUkVgNlm9nngx8AOd/9GHesRaZXEp5vd/btmdg3wMTN7\nBXAT0RuAo4Cb3f0Sd/+Z+oyK1GfUSAN+Btz96/E75gVAL9FtdP9ENPDPJ/r9KsCHiE7lzQHeBdxP\n9FvY8it1neoNr/gd9G/N7Biin9qcS/Su+1Ki36V+kehWv7Ws89NEDe/vgfcQfbL4a+DBsngPmtkR\nwELg3cBHgd8DvwTOrpbnKOWXAK8nqpN/Iup41HilnVT6/Xot880B7iDqFy4EHiEaPIsvqlOfoT6j\nbhb/jlE6mJl9EfhH4CDXDheRUajP6EyJv8OPb4t4jZltj2/FeIfpyVDBsLIn8cW/cT0J2KiGK1lS\n39Ce1Gd0j0Sn9M3sYOBW4DtEN2zYTvQ9zR9GWk6a6gdm9l1gEDiU6HaazwQ+08qkpLOpb2hr6jO6\nRKJT+ma2BJju7kdnl5I0wsz+leh7tBcSfddVAM5391tamph0NPUN7Ut9RvdIOuD/kujq0clE9zq+\nF7jE3Vdkk56ItAP1DSLhSzrgP0b0DvDzwFeAI4B/Bz7s7tdUmP8QotN7v6H0ak8RKTUOeDGwvsJt\nWIOnvkEkE6n2C0kH/CeInvB0VFHZvwNvcvcjK8z/Pqo/hEFE9vV+d+9vdRJJqW8QyVQq/ULS3+H/\njujCjmKDRE9AquQ3AKtWrWLq1KkJQyXzl3/5l3zrW9/qiBh9fX2cdNJJRNfMvKTONd0DnFe17jup\nvjohxuDgYLzPozbThoLtG2q1dx8Ut7u5RD/zrsXIbS4LzTg2GxV6jiHnl3a/kHTAvxV4ZVnZK6l+\nD+fHAaZOnUoul+2vc3K5XMfE2NtZzGLfp0XWagA4r2rdd1J9dUKMIu16ejvYviG54nZ3JfD+Gpcb\nuc1locnHZl1CzzH0/GKp9AtJf4f/BWCamZ0b3xP+fUS3YVyeRjKNOPDAAxUjsDiK0VWC7RsaE/a+\nb4djM/QcQ88vTYkGfHf/CdHtEE8Efg58EjjL3a/LIDcRaRPqG0TCl/he+u6+jug+ySIie6hvEAlb\nxzwed8KECYoRWBzFkPYX9r5vh2Mz9BxDzy9NHTPgv+IVr1CMwOIohrS/sPd9OxyboecYen5pyvRp\nefGDMwqFQqEdroIMxsDAAD09PUR3uGzkKv0eVPftYe8+p8fdB1qdT9ZC7Bsab3dqc5KutPuFjvmE\nLyIiItVpwBcREekCHTPgb9++XTECi6MY0v7C3vftcGyGnmPo+aWpYwb8U045RTECi6MY0v7C3vft\ncGyGnmPo+aWpYwb8xYsXK0ZgcRRD2t/iVicwonY4NkPPMfT80tQxA34zrortlBjNiqMY0v7C3vft\ncGyGnmPo+aWpYwZ8ERERqU4DvoiISBdINOCb2SIz2102/Sqr5JJYuXKlYgQWRzG6R8h9Q2PC3vft\ncGyGnmPo+aWpnk/4vwAmAYfG01tTzahOAwPZ35ysU2I0K45idJ0g+4bGhL3v2+HYDD3H0PNLUz0D\n/lPuvs3dH4ynh1LPqg4XX3yxYgQWRzG6TpB9Q2PC3vftcGyGnmPo+aWpngH/cDO718zuMrNVZjY5\n9axEpB2pbxAJWNIB/4fAHOA44CPAS4D/MbPueb6giFSivkEkcPsnmdnd1xf9+Qszuw0YAmYDV6SZ\nmIi0D/UNIuFr6Gd57v4I8L/Ay0eab9asWfT29pZM06dPZ+3atSXzbdiwgd7e3n2WP/300/e5knJg\nYIDe3t4990EeXm7RokUsXbq0ZN4tW7bQ29vL5s2bS8qXLVvGggULSsp27dpFb28vmzZtKilfvXo1\nU6ZM2Se3E044IZPtiFxZtoYtQC+wuax8GbCgrOwxAG6//fZ9tuPkk0/eJ7+0t2N4W7LcH+26HatX\nr97TDnp6epgyZQrz5s3bJ792FlLfMCzJPtzbpopj7or/3lQ272rg5H1yy+JYLN+O4XWl2abS3o4Z\nM2Y0vD+y3I5DDz20pu1I47gaaTsWL15c0g56enqYNWvWPrk1xN3rnoCDgIeAM6q8ngO8UCh41tav\nX98xMQqFggMOBQevc4rWUa3uO6m+OiHG3n1Ozhtok6FMIfUNtarc7tan1uay0Ixjs1Gh5xhyfmn3\nC0l/h/85M5thZi8ys7cAa4Anid7ittTMmTMVI7A4itE9Qu4bGhP2vm+HYzP0HEPPL02JvsMHXgj0\nA4cA24jObU1z99+nnZiItBX1DSKBS3rR3olZJSIi7Ut9g0j4OuZe+uUXYyhG6+MohrS/sPd9Oxyb\noecYen5p6pgBf/Xq7L8q7JQYzYqjGNL+wt737XBshp5j6PmlqWMG/Ouvv14xAoujGNL+wt737XBs\nhp5j6PmlqWMGfBEREalOA76IiEgX0IAvIiLSBTpmwK90K0XFaG0cxZD2F/a+b4djM/QcQ88vTR0z\n4HfKHdd0p73ujCGhCnvft8OxGXqOoeeXpo4Z8E88Mfv7fnRKjGbFUQxpf2Hv+3Y4NkPPMfT80tQx\nA76IiIhUpwFfRESkC3TMgF/+TGTFaH0cxZD2F/a+b4djM/QcQ88vTQ0N+GZ2jpntNrOL0kqoXhde\neKFiBBZHMbpXSH1DY8Le9+1wbIaeY+j5panuAd/M/gI4DbgjvXTqd9111ylGYHEUozuF1jc0Jux9\n3w7HZug5hp5fmuoa8M3sIGAVcCrwcKoZ1Wn8+PGKEVgcxeg+IfYNjQl737fDsRl6jqHnl6Z6P+Ff\nDHzd3W9OMxkRaXvqG0QCtX/SBczsvcAbgDelkcAtt9zCQw891NA6JkyYwHHHHYeZNbSeLVu2sH37\n9obWMXHiRKZMmdLQOkTaURp9w5YtW/jxj39cdw4HH3wwhx9+eF3teHBwsO64aWm0D1L/IyNy95on\n4IXA/cBri8puAS6qMn8O8EmTJnk+ny+Zpk2b5hdccIEDqUz5fN7d3RcuXOhLlizxYkNDQ57P531w\ncLCkvK+vz+fPn79nnnHjxjecx7hx4/2aa67Zk0+xuXPn+ooVK0rKCoWC5/N537Ztm7u7z58/3wuF\nQry+Mx28aBpyyDsMlpX3OcwvK9vkwD7x+vv7fc6cOXu2e9js2bN9zZo1JWXr16+vezuGt6Xe/TFs\n586dns/nfePGjR2zHf39/XvaQS6X88mTJ/uMGTOGj6GcJ2iToUxp9Q0HHfTMhtvg/vs/o8F1FIra\nVHG72hm3v41lba3fYU68HF4oFNw9+bG4dOnShvugZzxjnA8NDdV8LBYbblPlGm1TH/jAB/ZpU+71\n99Vpb8fLXvaymrajUt+Q5nYsWrSopC3kcjmfNGlSqv1C0kZ9PPA08CfgyXjaXVRmZfPnihtAuRtu\nuCHemF87PNTAhL/3ve+tGKNWewfZVXHDrTQtGOG1Qrxs9e2tRV9fX1EuhbKOJclUGDGXvr6+unNM\nsi2KUZu9+7xtB/xU+oZDD50cD7L19AP3FA1+I7XjatNnKrS7vtTaXO3HQJLci/ukxvufLDSj/TQi\n5PzS7heSntL/NvC6srIrgUFgiXvUkpM7GHhOfYvGjj322IaW32sqUV9USbXy9Jx55pkMDAw0JY5i\nhBOjA6TYNxxIff1B8SVJI7Xjaiqd0m/Fvk+Se/Z9UqNCbz+h55emRAO+u+8EflVcZmY7gd+7e+u/\nABORllDfIBK+NO60V+enehHpcOobRALS8IDv7se6+8fTSKYRv/vd75oQZXP2ETZnH6NZcRSju4XS\nNzQm9H0fen7ht5/Q80tTx9xL/6tf/WoTopydfYSzs4/RrDiKIe0v9H0fen7ht5/Q80tTxwz4zXmm\n8fLsIyzPPkaz4iiGtL/Q933o+YXffkLPL00dM+AfcsghTYiS/Q0tmnXTjGbEUQxpf6Hv+9DzC7/9\nhJ5fmjpmwBcREZHqNOCLiIh0gY4Z8G+66aYmRFmafYSl2cdoVhzFkPYX+r4PPb/w20/o+aWpYwb8\nP/3pT02Isiv7CLuyj9GsOIoh7S/0fR96fuG3n9DzS1PHDPi9vb1NiHJ+9hHOzz5Gs+IohrS/0Pd9\n6PmF335Czy9NHTPgi4iISHUa8EVERLpAxwz4jz76aBOibM8+wvbsYzQrjmJI+wt934eeX/jtJ/T8\n0pRowDezj5jZHWb2SDx938z+KqvkkrjqqquaEOWU7COckn2MZsVRjO4Rct/QmND3fej5hd9+Qs8v\nTUk/4W8FPkH0EOYe4GbgBjObmnZiSeXz+SZEWZx9hMXZx2hWHMXoKsH2DY1Z3OoERrG41QmMKvT2\nE3p+aUo04Lv7f7v7Te5+l7vf6e6fAnYA07JJr3YvetGLmhAll32EXPYxmhVHMbpHyH1DY0Lf96Hn\nF377CT2/NO1f74JmNgaYDYwHfpBaRiLS1tQ3iIQp8YBvZq8lasTjgEeBd7t79zxQWEQqUt8gErZ6\nrtLfDLweOAK4FLjazF6ValZ12LRpUxOirMw+wsrsYzQrjmJ0nSD7hsYk3/eDg4MMDAwkngYHB5uS\nX7OF3n5Czy9V7t7QBHwLuLTKaznAJ02a5Pl8vmSaNm2an3vuuQ44PODgDusd8vH/i6e5DivKygrx\nvNsc8GOOOcbd3RcuXOhLlizxYkNDQ57P531wcLCkvK+vz+fPn+/u7oVCIc5lU7zejWXx+h1eVSG3\n2Q5rinLCly9f7vl83svNnTvXV6xYUVJWKBQ8n8/7tm3b9syzN5czy2INxbkNlpX3OcwvK9vkwD7x\n+vv7fc6cOT537tyS8tmzZ/uaNWtKytavX1/3dgzPV+/+GLZz507P5/O+cePGjtmO/v7+Pe0gl8v5\n5MmTfcaMGfE+J+cNtslQpnr6hrFjxzm8oKhNDU+19A0PD9ehw4y4byied6HDkhHa1Kp42UJRm5pb\nNO/OEfqGOQ7fcBhTlEO9U6FCH1dtO4bzG4q3GS8UCjUfi8WG21S5RtvUe97znn3alHv9fXXa2/Hi\nF7+4pu2o1DekuR2LFi0qaQu5XM4nTZqUar+QRqP+DvClKq/lyg/AYjfccIOXDvj1Tvjll19eMUat\n9g6yhQbyKOzT4No9F2mOvfu8owb8xH3DoYdOdjivzmO+eMCvp+2samDZ4uVXxetIOn2mwfhq850m\n7X4h0Xf4ZvZZ4JvAFuCZwPuBo4GZSdYjIp1FfUOxqdR39Xw9p/RFapf0or3nAVcBhwGPAD8DZrr7\nzWknJiJtRX2DSOASDfjufmpWiYhI+1LfIBK+jrmX/vLly5sQJftH8DbnMb/NiaMY0v5C3/eh5xd+\n+wk9vzR1zID/tre9rQlRzsg+whnZx2hWHMWQ9hf6vg89v/DbT+j5paljBvzXvOY1TYiS/fVHM2c2\n5xqnZsRRDGl/oe/70PMLv/2Enl+aOmbAFxERkeo04IuIiHSBjhnwb7/99iZEWZt9hLXZx2hWHMWQ\n9hf6vg89v/DbT+j5paljBvwf//jHTYiyOvsIq7OP0aw4iiHtL/R9H3p+4bef0PNLU8cM+KeddloT\nolyffYTrs4/RrDiKIe0v9H0fen7ht5/Q80tTxwz4IiIiUp0GfBERkS6gAV9ERKQLdMyAf+WVVzYh\nysnZRzg5+xjNiqMY0v5C3/eh5xd++wk9vzQlGvDN7Fwzu83M/mhmD5jZGjN7RVbJJfHqV7+6CVE6\n565unXKHuk6J0e5C7hsaE/q+Dz2/8NtP6PmlKekn/KOAZcCbgXcABwAbzOzAtBNL6ogjjmhClBOz\nj3Bi9jGaFUcxukqwfUNjQt/3oecXfvsJPb80JX087qziv81sDvAg0ANsSi8tEWkn6htEwtfod/gH\nAw48lEIuItI51DeIBCbRJ/xiZmbAF4FN7v6r9FKqT6FQoKenp+7lBwcHa5hrE/DWumPUYtOmTYwf\nPz619VXbrttvv503vvGNoy7/xBNPMHbs2LpiF8eYOHEiU6ZMqWs9I9m0aRNvfWv2+yTrGJ0ktL6h\nMdm3+caEnl/z2s+WLVvYvn174uWG+6ms+qiguHtdE3ApcDdw2Ajz5ACfNGmS5/P5kmnatGl+7rnn\nOuDwgIM7rHfIx/8vnuY6rCgrK8TzbnPAzcbE62p02hSvd2NZvH6HyRVym+2wpignfPny5Z7P573c\n3LlzfcWKFSVlhULB8/m8b9u2zd3d8/m8FwqFOJczy2INxbkNlpX3OcwvK/tKSvWRzjRu3HgfGhpy\nd/ehoSHP5/M+ODhYUhd9fX0+f/78krKdO3d6Pp/3jRs3lpT39/f7nDlz9qnn2bNn+5o1a0rK1q9f\nX/f+GN4nCxcu9CVLlpTMW+929Pf372kHuVzOJ0+e7DNmzBiuq5zX2SZDmRrpG8aOHefwgqI2NTzV\n0jc8XHTMzYj7huJ5FzosGaFNrYqXLRS1qeKYO0foG+aULV/eN9SyHaeWLV/cx1XbjuF1DcXbjBcK\nhZqPxWLDbapco23qqKOO2qdNuXuqbWpoaMjHjRufWh9Va9+Q5nYsWrSopC3kcjmfNGlSqv2CedT4\nEjGz5UAeOMrdt4wwXw4oFAoFcrncPq/feOONHH/88cADwPMS51EUKf53FTC1znWsA84DCkR9USW7\ngJE+fQ8APVTb3lrs2rWLzZs3x2crRsplNNcCJ1G9Th4DRruearhO6q3X4RiDwEkN1Us1u3btSvWM\nSKtiDAwMDJ+h6nH3gUyDZajRvuGww6Zw//1zgE/XEf0Rom8SoL62M9xmipcdrc2Ptnyj8UdTnF/j\n/U8Wmtt+6umrHgN+Q1Z9VCPS7hcSn9KPG/TxwNEjNejWmEr9A2Qtp/SzPWiBDBpGGnXSyDqylXVH\n0qwYnSDsvqFeoe/70PNrdvupt69q8x+T1CjRgG9mlxD9DqQX2Glmk+KXHnH3x9NOTkTag/oGkfAl\nvUr/I8CzgO8C9xVNs9NNS0TajPoGkcAlGvDdfYy771dhujqrBMOyIPsIC7KPEUfqiBjNqK/m7ZP2\n1bl9Q+j7PvT82qH9hJ5fejrmXvrNkf1PNpr3s5BmxOmM+ur4n+rICELf96Hn1w7tJ/T80qMBP5Ez\ns49wZvYx4kgdEaMZ9dW8fSLhCX3fh55fO7Sf0PNLjwZ8ERGRLqABX0REpAtowE9kc/YRNmcfI47U\nETGaUV/N2ycSntD3fej5tUP7CT2/9GjAT+Ts7COcnX2MOFJHxGhGfTVvn0h4Qt/3oefXDu0n9PzS\nowE/keXZR1iefYw4UkfEaEZ9NW+fSHhC3/eh59cO7Sf0/NKjAT+RTvoJmH6WF1IMCVXo+z70/Nqh\n/YSeX3o04IuIiHQBDfgiIiJdQAN+Ikuzj7A0+xhxpI6I0Yz6at4+kfCEvu9Dz68d2k/o+aUn8YBv\nZkeZ2Y1mdq+Z7Taz3iwSC9Ou7CPsyj5GHKkjYjSjvpq3T9pX5/YLoe/70PNrh/YTen7pqecT/gTg\np8BcwNNNJ3TnZx/h/OxjxJE6IkYz6qt5+6StdWi/EPq+Dz2/dmg/oeeXnv2TLuDuNwE3AZiZpZ6R\niLQd9Qsi4dN3+CIiIl1AA34i27OPsD37GHGkjojRjPpq3j6R8IS+70PPrx3aT+j5pSfxKf3udgpw\n46hzDQ4O1h1h3rx5nHbaaXUvX7vatiXNGI3UC8ATTzzB2LFjS8rmzZvHF77whYbWMZryGBMnTmyD\nm4lIOprRThqxb36NtLMsju1TTjmFG28MvQ4XtzqJ5nD3uidgN9A7wus5wCdNmuT5fL5kmjZtmp97\n7rkOODzg4A7rHfLx/4unuQ4rysoK8bzb4nUQly10WFI271A872BZeZ/D/Pj/q+J1bIrn3Vg2b3+V\n3GY7rIn//w2HMUX5NDqdWcd2DE8r43WU11u/w5y4rqptx/D0iaJ6TbI/issWxvHSqJf9gljHuHHj\nfWhoyIeGhjyfz/vg4KAX6+vr8/nz55eU7dy50/P5vG/cuNH7+/v3tINcLueTJ0/2GTNmDK8/10ib\nDGEarV8YqW8YO3acwwsqHIu19A0PF+2nGWXHovvofcNwH1DwvW2q+Njf6dX7hjlly1drUyNtx6ll\ny1dqU+XbMTzvkMMRDtbQsb3ffvv50NBQybE7e/ZsX7NmTUnZ+vXrPZ/Pe7m5c+f6ihUrSspWrVrl\n+Xzet23bVlK+cOFCX7JkSUlZvW2qUCgU1d3w/hipry7eHzPi5fBCoVB1OwqFQqbbsWjRopK2kMvl\nfNKkSan2C+ZR46uLme0G3uXuFd++mVkOKBQKBXK53D6v33jjjRx//PHAA8Dz6s4Dhq8RKhD1I/W4\nFjgppXWsAqbWuQ6AdcB5KeXS6nUUr6eRehmuk1avYxA4iWrHdL0GBgbo6ekB6HH3gdRW3AKj9Qvx\nPBX7hsMOm8L9988BPl1H5EeAg+P/13PMNnq8h7J8vcd3Nsd2M+xtP/XW3QDQE9y2p90vJD6lb2YT\ngJezd5R9qZm9HnjI3bc2mlBnmEpjA2Rjp77D1Ui9DNdJq9chlahfCImOb6msnu/w3wTcwt7TQJ+P\ny68i+jJERLqP+gWRwCW+St/dv+fuY9x9v7KpCxr1yg6J0aw4itEtOrdfCH3fh54frFwZeo6h55ce\n/SwvkWZ8tdqsr287ZVs6JYaEKfR9H3p+0ffQYQs9v/RowE/k4g6J0aw4iiHtLvR9H3p+cPHFoecY\nen7p0YAvIiLSBTTgi4iIdAEN+CIiIl1AA34izXjEd7MeI94p29IpMSRMoe/70POD3t7Qcww9v/Ro\nwE/kjA6J0aw4iiHtLvR9H3p+cMYZoecYen7p0YCfyMwOidGsOIoh7S70fR96fjBzZug5hp5fejTg\ni4iIdAEN+CIiIl1AA34iazskRrPiKIa0u9D3fej5wdq1oecYen7p0YCfyNIOidGsOIoh7S70fR96\nfrB0aeg5hp5feuoa8M3sdDO7x8weM7MfmtlfpJ1YmJ7bITGaFUcxuk3n9Q2h7/vQ84PnPjf0HEPP\nLz2JB3wzO4Ho0ZeLgDcCdwDrzWxiyrmJSBtR3yAStno+4c8DLnP3q919M/ARYBd65rVIt1PfIBKw\nRAO+mR0A9ADfGS5zdwe+DUxPNzURaRfqG0TCt3/C+ScC+wEPlJU/ALyywvzjAAYHByuu7K677or/\ndy4wPmETlEfpAAAgAElEQVQqlawDKsca3a01rONW4NoG11FLHi9MaT0jrWO0ballHbXkcG0K6xkp\nl1q2Y7R11LLccIx7gOrHdL2K1jcu1RU3Typ9w5NP/gn4JvCHOlL4U9H/6znWKh0fzTi+Glm+OL9G\n40fH9rp16+o+vseMGcPu3btLM7z1Vq69dvQ6rLRsre655574f/Xu93VA+u26Uan3C+5e8wQcBuwG\n3lxWvhT4QYX53we4Jk2aap7el6RNhjKhvkGTpiynVPqFpJ/wtwNPA5PKyicB91eYfz3wfuA3wOMJ\nY4l0k3HAi4naTDtS3yCSvlT7BYvfbde+gNkPgR+5+1nx3wZsAfrc/XNpJCUi7Ud9g0jYkn7CB7gI\nuNLMCsBtRFfmjgeuTDEvEWk/6htEApZ4wHf3L8e/q/000em6nwLHufu2tJMTkfahvkEkbIlP6YuI\niEj70b30RUREukAmA76ZjTGzz5jZ3Wa2y8zuNLNPNbjOo8zsRjO718x2m1lvhXk+bWb3xTG/ZWYv\nTzOOme1vZkvN7GdmtiOe5yozOyztbSma9z/ieT6Wdgwzm2pmN5jZw/H2/MjMXlhpffXEMLMJZrbc\nzLbG++SXZvbhhNtxrpndZmZ/NLMHzGyNmb2iwnx17/vRYqS432valqL569r3IanhGLkiLi+e1jUx\nv8yPr6zzC6AOP2Jmd5jZI/H0fTP7q7J5WlJ/teTX6vqrkO85cQ4XlZU3XIdZfcI/B/gwMBd4FXA2\ncLaZndHAOicQfSc4l+h3iSXM7BPAGcBpwBHATqL7eD8jxTjjgTcA5xPdK/zdRDcVuSHFGHuY2buB\nNwP3Jlz/qDHM7GXARuBXwAzgdcBnSPYTqdG24wvATKLfXL8q/nu5mb0zQYyjgGVE9fAO4ABgg5kd\nWLQtje770WKktd9H3ZaibWpk34eklmP9m0Tf+R8aTyc2JzWgOcdXpvnFWlmHW4FPADmiuy3eDNxg\nZlOh5fU3an6xVtbfHhY9bOo0oudQFJenU4cZ3YTj68B/lpV9Bbg6pfXvBnrLyu4D5hX9/SzgMWB2\nmnEqzPMmot8fvzDNGMALiH7SNJXoFlgfS7m+VgNXpbjPK8X4OfDJsrKfAJ9uIM7EONZbs9r3lWKk\nvd9HipPmvg9pqnKMXAF8rdW5NfP4yiC/oOowzun3wMmh1V+V/IKoP+Ag4NfAscAtwEVFr6VSh1l9\nwv8+8HYzOxzAzF4PHMnw/QtTZmYvIXpXVnwf7z8CPyL7+3gfTPTJ5eG0VmhmBlwNXOjuqd/rMV7/\n3wD/Z2Y3xacKf2hmx6cc6vtAr5k9P477NuBwGruJxHB9PxSvM4t9XxJjlHka2e/7xMl63wfqmPgY\n3Gxml5jZn7Uwl2YcX6nlVySIOrTo69z3Ep0V+35o9VeeX9FLIdTfxcDX3f3m4sI067Ce3+HXYgnR\nO5DNZvY00VcHn3T36zKKdyhRI6h0H+9DM4qJmY0l2tZ+d9+R4qrPAf7k7stTXGex5xG9m/wE8Emi\nr1z+GviamR3j7htTinMmcDnwWzN7iugT8T+6+60jL1ZZPBh+Edjk7r+Ki1Pd91VilM/T8H4fIU7W\n+z403wS+SnQm42XA/wPWmdl0jz/KNEszjq9GjHDMtLwOzey1wA+I7gz3KPBud/+1mU0ngPqrll/8\ncgj1916irw3fVOHl1I7BrAb8E4i+t30v0XfEbwD+3czuc/drMorZVGa2P/BfRDtiborr7QE+RvRd\ncVaGz+ysdfe++P8/M7O3ED3SNK0B/2NE3z2+k+gU9Qzgkvg4uHnEJSu7BHg10dmirIwYI8X9vk+c\nJu37oLj7l4v+/KWZ/Ry4CziG6LRmMzXj+GpExfwCqcPNwOuBZwN/D1xtZjOaFLsWFfNz982trj+L\nLpT+IvAOd38yy1hZndK/EFji7v/l7r9092uJLtg6N6N49wNG7ffxbkhRpz8ZmJnyp/u3As8FtprZ\nk2b2JPAi4CIzuzulGNuBp9j3sVKDwJQ0ApjZOOAC4OPuvs7df+HulwDXA/PrWN9yYBZwjLv/ruil\n1Pb9CDGGX09lv48Qpxn7Pmjufg/R8dm0q7ihOcdXI0Y7Nou1og7d/Sl3v9vdb3f3TxJddHYWgdTf\nCPlVmrfZ9ddD1O4Hitr90cBZZvYnok/yqdRhVgP+eKLTt8V2ZxUv3kH3A28fLjOzZxF9uvx+teXq\nUdTpvxR4u7vX8xzPkVwN/DnRu9Hh6T6iN1HHpREgfhf5Y/Z9bOkrgKE0YhBdTXwA+x4Hw1/x1Czu\n7I4H3ubuW4pfS2vfjxQjfj2V/T5KnMz3fejiTzuHACMOainHzPz4yiq/KvM3vQ4rGAOMDaH+qhgD\njK30Qgvq79tEv5J6A3vb/U+AVcDr3f1u0qrDjK42vILoFO4sok8o7wYeBD7bwDonxBXxBqI3D/8U\n/z05fv1soisv83HlrQX+D3hGWnGIvgK5gWhQfB3RO6zh6YC0tqXC/Imv1K6hvt5F9BO8U4m+tzqD\n6IHi01OMcQvwM6J3qy8G5gC7gNMSxLiE6OHoR5XV97iieRra96PFSHG/j7otaez7kKZR2tMEojcz\nbybqJ95O1NENJqnXBvPL/PjKMr9A6vCzcX4vAl5L9B34U8Cxra6/0fILof6q5Fx+lX4641tGyU4g\nepDGPUS/F/w/ot8w79/AOo+OO4yny6YvFc2zmOgT0S6iK8Ffnmac+IAof2347xlpbkvZ/HeTfMCv\npb7mAP8b76MB4J1pxiC6OHAl0e9gdxJdz3FWwhiV1v808MGy+ere96PFiPd7+Wv17PeatqXRfR/S\nNEp7GgfcRPTp5fF4Wy8FntvE/DI/vrLML5A6XBHHfSzOYwPxYN/q+hstvxDqr0rON1M04KdVh7qX\nvoiISBfQvfRFRES6gAZ8ERGRLqABv0tY9DCGhWVlbzKzWy16IMzTZvbncflfmdntZvZYXP6s1mQt\nIiJp6ZoB38z+wfY+CektVebZGr9+Y7PzS8LMflO0LU+b2R8sepLbZWZ2RJXFnKKHl8Q/M/sK8Byi\nK6c/AAzFt5S8nujCkLlx+c4st0dERLKX1Z32QvYY0V0AS36/aGZHEz20JMnT4lrFgduBfyO6IcMz\niR608h7gH83sIncvv7nNgUQ/RRn2MqKb7HzI3a8YLjSz44huu/spd2/2nc5ERCQj3TjgrwPeY2Yf\nc/fdReXvI/r95cTWpJXYve6+urjAokco9gMfN7P/c/fLhl9z9z+VLT9816ZHaiyvm5mNd/ddaa1P\nRESS65pT+jEneizsIcBfDhea2QFE91fuJ/rEXMIi/2Rmv4i/177fzP7DzA4um6/XzL5hZvea2eNm\ndqeZfcrMxpTN9934FPxUM7vFzHaa2W/NbEFDG+f+BPBBoidpfbIs5p7v8M3sCuC7cX18JX7tFjO7\nBbgyXuQncfmXitbxZouervdwnPN3y78eMbPF8XJTzazfzB6i6N78ZvZKM/uKmf0+rssfm1m+bB3D\nX7+8xcwuMrMH4+sMvmZmh5Rvt5n9tZl9z8z+aGaPmNltZnZi2Tyj5i4i0sm6bcAH+A3wQ6B4QJhF\n9HS/ak/zuxxYSjRwfYzopiHvB24ys/2K5ptD9CSmz8fz/QT4NNGdnYo58GdET2m6Hfg40Z2dlsSn\n1Ovm7juBNcALzGxqldn+g+g+9wb8O3AS8K/xdHk8z6fi8ssAzOxY4HtEp/sXEz0X4dnAzWZW/ISn\n4esE/ovophbnAv8Zr+M1RHX/SqI6+TiwA1hrlR/Nu4zorlKLie44lgdKniJnZnOAbxA9NvSzRE8A\nvJ2iW9EmyF1EpHO18m5CTb5z0T8Q3aEqR3Qx2sNE93qG6CK1b8f/vwe4sWi5txLd7eqEsvX9ZVz+\n3qKysRXiXkr0JuCAorJb4lzeV1R2ANFdlL5cw7aU5Fjh9bPi9b+zqGw3sLDo7+E7oP1ttXoqK/81\n8N9lZWOJnip1U1HZoni911TI69tEg/H+ZeWbgM1lOewuXm9c/nmi2/8+M/77WURfPdzKCLeYrDV3\nTZo0aerkqRs/4QN8megBP+80s4OIHt96bZV5/57ozcF3zOyQ4Ylo4NoBvG14Ro9OqQNgZgfF822K\nY72qbL073L2/aNkngduIHs7SqOGnuD0zhXVhZm8ADgdWl9XBM4HvED32tpgTnxkoWsdziOrqv4Bn\nl61nA3C4mR1Wto7LKbUR2I/oVrcQvek6iOjJjOXXKNSbu4hIR+rGi/Zw9+1m9m2iC/UmEH218ZUq\nsx9OdLr4wUqrIrpfPABm9mqiU+VvI/r0WTzfs8uW/W2F9f2B6BR2ow6K/300hXVBVAcQPc2tkt1m\n9mx3L77Q756yeV5O9BXCZ4i+Oig3XJfFT6jaWjbP8BPqnhP/+7L4319WyQvqy11EpON05YAf6yf6\nbvkw4JvuXm1wHEP0POL3UeGCPmAbgJk9G/gforMBnyJ6CMPjRM86XsK+10uUPzZ2WKUYSQ2/abgz\nhXXB3tz/meg50pWUPxv+sSrr+DeiBz9UUp5vpToyktVRPbmLiHScbh7w1xCddn4zcMII891F9MjE\n7xefsq/gGKJPnse7+63DhWb2sqpLZMDMJhA9+naLu29OabV3xf8+6u4317mOu+N/n2xgHVB08yCi\nvIzokZd3V549ldxFRNpet36Hj0dXs3+E6Krtr48w65eJ3hgtLH/BzPaLP9lD9GnUKKpTM3sG0QWC\nTWFm44BVRG88Lkhx1QWigXN+/IaiPO6o9y5w921EPwX8sJkdWs86KthA9LXFuWY2tso8DecuItIJ\nuu0TfsmpYHe/ZrQF3P1/zOwy4Jz4ArANwJPAK4gu6PsY8DWiO/f9AbjazPrixU+i9BNpml5gZu+P\n/38Q8GqiO+1NAv7N3Vc0sO7yenIzO5XopkW/jH/Hfy/RnQnfRnSlfKWf1ZU7nejCu5+b2X8SfSqf\nBEyP1/XGajlUKnf3R81sHtFXMz82s36iffB64EB3PznF3EVE2lq3Dfi1DL4l95wHcPePmtlPgA8T\nfXJ+iuj3/FcT/SQMd3/IzP6G6KdjnyEaeK4Bbqbyd9bVcqn1DcIb4vhO9Cl3K3ADsNLdf1LLdiXJ\nwd2/Z2bTgfOIBu6DgPuBH1F2RX417j4Y/+59EdFP7w4huhjydqL7FSTOzd2/ZGYPAOcQXTvxJLAZ\n+EKauYuItDtzz+oDqIiIiIQi0Xf4ZnaP7X1KW/G0LKsERUREpHFJT+m/iejGJ8NeR/Sd9pdTy0hE\nRERSl2jAd/ffF/8dP/TkLnffWGURERERCUDdP8uz6Alz7wdWppeOiIiIZKGR3+G/m+h2sVellIuI\niIhkpO6r9M3sJuAJd6/6G+b4ISXHEf2E7fG6Aol0h3HAi4H15V+diYikoa7f4ZvZFOAdRLdwHclx\nVH8KnYjs6/1Ez3kQEUlVvTfeOYXogTLrRpnvNwCrVq1i6tSpiYNcdtllfOlLX+Opp6o9a2U064F/\nYdOmTRx44IFV5/roRz/KpZdeWmeM7IWeH4SfY+j5DQ4OctJJJ0HcZkRE0pZ4wDczA+YAV7r77lFm\nfxxg6tSp5HK5xMk9//nPJ7odffJlI/8LwBve8AYmTNjnNup7TJo0qa78miX0/CD8HEPPr4i++hKR\nTNRz0d47gMnAFSnnIiIiIhlJ/Anf3b9F6c13REREJHBd+3jcYnfeeWerUxhR6PlB+DmGnp+ISNY0\n4ENdFxQ2U+j5Qfg5hp6fiEjWNOADX/3qV1udwohCzw/CzzH0/EREsqYBX0REpAtowBcREekCGvCB\n7du3tzqFEYWeH4SfY+j5iYhkTQM+cMopp7Q6hRGFnh+En2Po+YmIZE0DPrB48eJWpzCi0POD8HMM\nPT8RkawlHvDN7Plmdo2ZbTezXWZ2h5m1xT1Lqwn9lquh5wfh5xh6fiIiWUt0pz0zOxi4FfgO0ZPw\ntgOHA39IPzURERFJS9Jb654DbHH3U4vKhlLMR0RERDKQ9JR+HviJmX3ZzB4wswEzO3XUpQK3cuXK\nVqcwotDzg/BzDD0/EZGsJR3wXwp8FPg1MBO4FOgzsw+knVgzDQwMtDqFEYWeH4SfY+j5iYhkLekp\n/THAbe5+Xvz3HWb2WuAjwDWpZtZEF198catTGFHo+UH4OYaen4hI1pJ+wv8dMFhWNghMGWmhWbNm\n0dvbWzJNnz6dtWvXlsy3YcMGent7K6zhdKD8lOwA0Et03WCxRcDSkpKtW7fS29vL5s2bS8qXLVvG\nggULSsp27dpFb28vmzZtKilfvXo1J5988j6ZnXDCCTVvx+mnn77PqeWBgQF6e3v3uTHMokWLWLq0\ndDu2bNmi7eiA7Vi9evWedtDT08OUKVOYN2/ePvmJiKTJ3L32mc2uBV7o7kcXlX0B+At3f2uF+XNA\noVAo1PWzqPPPP58LLricJ5+8N/GykeuAE9mxYwcTJkyocx0i2RsYGKCnpwegx931/YOIpC7pJ/wv\nANPM7Fwze5mZvQ84FViefmoiIiKSlkQDvrv/BHg3cCLwc+CTwFnufl0GuTVN5a8RwhF6fhB+jqHn\nJyKStaQX7eHu64B1GeTSMmeccUarUxhR6PlB+DmGnp+ISNZ0L31g5syZrU5hRKHnB+HnGHp+IiJZ\n04AvIiLSBTTgi4iIdAEN+LDP77ZDE3p+EH6OoecnIpI1DfhEN3EJWej5Qfg5hp6fiEjWNOAD119/\nfatTGFHo+UH4OYaen4hI1jTgi4iIdAEN+CIiIl1AA76IiEgXSDTgm9kiM9tdNv0qq+SapdJT10IS\nen4Qfo6h5ycikrXEt9YFfgG8HbD476fSS6c1Qr8LW+j5Qfg5hp6fiEjW6hnwn3L3baln0kInnnhi\nq1MYUej5Qfg5hp6fiEjW6vkO/3Azu9fM7jKzVWY2OfWsREREJFVJP+H/EJgD/Bo4DFgM/I+Zvdbd\nd6abWufYsmUL27dvr3v5J554grFjx9a9/MSJE5kyZUrdy4uISAdw97on4NnAw8DJVV7PAT5p0iTP\n5/Ml07Rp03zNmjVebP369Z7P5/f8vXjxYj/ggOc7zHVY4eBFU8Eh77CtrHyhw5L4/6sd8MHBQc/n\n8z44OFgSr6+vz+fPn+8bN27cU7Zz507P5/MlZe7u/f39PmfOHC83e/bsEbdjaGjIx40b70AD05iG\nlh83brwPDQ01tB3F5s6d6ytWrCgpW7Fihefzed+2bVtJ+cKFC33JkiUlZUNDQyPuj2Jp7Y8jjzyy\npu0oFAqZb0d/f/+edpDL5Xzy5Mk+Y8aM4f2V8wbapCZNmjRVm8zdG3rDYGa3Ad9y909WeC0HFAqF\nArlcLvG6zz//fC644HKefPLeOrO7DjiRHTt2MGHChKpz9fb2cuONN9YZY2QDAwP09PQAq4Cpdaxh\nHXBeA8sPAidR7z6oVZZ1mIbQ89t7nNDj7gOtzkdEOk89F+3tYWYHAS8Hrk4nnda47rrrmhBlKtEJ\nj6QGG1y+OZpTh/ULPT8Rkawl/R3+58xshpm9yMzeAqwBngTa+skk48ePb3UKbS/0Ogw9PxGRrCX9\nhP9CoB84BNgGbAKmufvv005MRERE0pNowHd3/ZhZRESkDele+sCCBQtanULbC70OQ89PRCRrGvBB\nv1FPQeh1GHp+IiJZ04APnHnmma1Ooe2FXoeh5ycikjUN+CIiIl1AA76IiEgX0IAPbN68udUptL3Q\n6zD0/EREsqYBHzj77LNbnULbC70OQ89PRCRrGvCB5cuXtzqFthd6HYaen4hI1hoa8M3sHDPbbWYX\npZVQK+gnW40LvQ5Dz09EJGt1D/hm9hfAacAd6aUjIiIiWahrwI+fkrcKOBV4ONWMREREJHX1fsK/\nGPi6u9+cZjKtsnTp0lan0PZCr8PQ8xMRyVrSp+VhZu8F3gC8Kf10WmPXrl1VX9uyZQvbt2+ve92D\ng4Ojz9QBRqrDEISen4hI5ty95ono8bj3A68tKrsFuKjK/DnAJ02a5Pl8vmSaNm2ar1mzxoutX7/e\n8/n8nr8XL17sBxzwfIe5DiscvGgqOOQdtpWVL3RYEv9/tQM+ODjo+XzeBwcHS+L19fX5/PnzS8p2\n7tzp+XzeN27c6ENDQz5u3HgHUpgKdW7HqqLlh+J5B8vm7XOYX1a2M553hQNeKBTc3b2/v9/nzJnj\n5WbPnj3q/hg2d+5cX7FiRUlZoVDwfD7v27ZtKylfuHChL1mypKRsaGiorv1RrJ23o7+/f087yOVy\nPnnyZJ8xY8bwsZLzBG1SkyZNmmqdzN1rfnNgZscDXwOeBiwu3i/uqJ4GxnrRCs0sBxQKhQK5XK7m\nOMPOP/98Lrjgcp588t7Ey0auA05kx44dTJgwIfHSAwMD9PT0EF2uMLXOHNYB5wEFovc/SV0LnNTA\n8gNAD/XuA2mOvccaPe4+0Op8RKTzJD2l/23gdWVlVwKDwBJP8u6hrUylvsEWoqoRERFprUQX7bn7\nTnf/VfEE7AR+7+5tO7I18h29REKvw9DzExHJWhp32mv7T/WnnHJKq1Noe6HXYej5iYhkLfFV+uXc\n/dg0EmmlxYsXtzqFthd6HYaen4hI1nQvfdDFbCkIvQ5Dz09EJGsa8EVERLqABnwREZEuoAEfWLly\nZatTaHuh12Ho+YmIZE0DPtFNT6Qxoddh6PmJiGRNAz5w8cUXtzqFthd6HYaen4hI1jTgi4iIdAEN\n+CIiIl0g0YBvZh8xszvM7JF4+r6Z/VVWyYmIiEg6kn7C3wp8guhJMj3AzcANZlbvo+SC0Nvb2+oU\n2l7odRh6fiIiWUt0a113/++yok+Z2UeBabTxY+HOOOOMVqfQ9kKvw9DzExHJWt330jezMcBsYDzw\ng9QyaoGZM2e2OoW2F3odhp6fiEjWEg/4ZvZaogF+HPAo8G5335x2YiIiIpKeeq7S3wy8HjgCuBS4\n2sxelWpWIiIikqrEA767P+Xud7v77e7+SeAO4KyRlpk1axa9vb0l0/Tp01m7dm3JfBs2bKhycdXp\nQPmtUQeAXmB7WfkiYGlJydatW+nt7WXz5tITEcuWLWPBggUleezatYve3l42bdpUtt7VwMkVcjsB\nWFtWtiHOLd3tgC3xvOUnVJYBC8rKdsXz3l5Sunr1ak4+ed/tOOGEE2reH6effvo+t6r9/Oc/T29v\nL9u3l27HokWLWLq0dDu2bNky4v4o2Yoq+yPpdhxxxBE1bcfAwEDm27F69eo97aCnp4cpU6Ywb968\nffITEUmVuzc0Ad8BvlTltRzghULB67F48WI/4IDnO3id02oHfMeOHSPGmT17dsXyQqHggEOhgRxW\nNbiORpePtqHefVCranUYitDz23uskfMG26QmTZo0VZoSfYdvZp8Fvkn0UfOZwPuBo4G2viLq+uuv\nb3UKbS/0Ogw9PxGRrCW9aO95wFXAYcAjwM+Ame5+c9qJiYiISHqS/g7/1KwSERERkezoXvoiIiJd\nQAM+VLzaW5IJvQ5Dz09EJGsa8NFd2NIQeh2Gnp+ISNY04AMnnnhiq1Noe6HXYej5iYhkTQO+iIhI\nF9CALyIi0gU04EOF2+hKUqHXYej5iYhkTQM+cOGFF7Y6hbYXeh2Gnp+ISNYSDfhmdq6Z3WZmfzSz\nB8xsjZm9IqvkmuW6665rdQptL/Q6DD0/EZGsJf2EfxTRo9neDLwDOADYYGYHpp1YM40fP77VKbS9\n0Osw9PxERLKW9Na6s4r/NrM5wINAD6AvSUVERALV6Hf4BxM90vOhFHIRERGRjNQ94JuZAV8ENrn7\nr9JLqfkWLFjQ6hTaXuh1GHp+IiJZS/p43GKXAK8Gjkwpl5aZMmVKq1Noe6HXYej5iYhkra5P+Ga2\nHJgFHOPuvxtt/lmzZtHb21syTZ8+nbVr15bMt2HDBnp7eyus4XRgZVnZANALbC8rXwQsLSnZunUr\nvb29bN68uaR82bJlLFiwgDPPPHNP2a5du+jt7a3wu+3VQKUHsJwArC0r2xDnlu52wJZ43s1l5cuA\n8k+wu+J5by8pXb16dcUHyZxwwgk174/TTz+dlStLt+PII4+kt7eX7dtLt2PRokUsXVq6HVu2bBlx\nf5RsRZX9kXQ7vvWtb9W0HQMDA5lvx+rVq/e0g56eHqZMmcK8efP2yU9EJFXunmgClgNbgZfWMG8O\n8EKh4PVYvHixH3DA8x28zmm1A75jx4664hcKBQccCg3ksKrBdTS6fLQN9e4DaY69xxo5T9gmNWnS\npKmWKdEpfTO7BDiR6KPjTjObFL/0iLs/3vjbDxEREclC0lP6HwGeBXwXuK9omp1uWs1VfkpWkgu9\nDkPPT0Qka4kGfHcf4+77VZiuzirBZjj77LNbnULbC70OQ89PRCRrupc+sHz58lan0PZCr8PQ8xMR\nyZoGfPSTrTSEXoeh5ycikjUN+CIiIl1AA76IiEgX0IAP+9xQRZILvQ5Dz09EJGsa8InugiaNCb0O\nQ89PRCRrGvCB888/v9UptL3Q6zD0/EREsqYBX0REpAtowBcREekCiQd8MzvKzG40s3vNbLeZVXos\nXFspfzKaJBd6HYaen4hI1ur5hD8B+Ckwl+jpXm3vlFNOaXUKbS/0Ogw9PxGRrCV6Wh6Au98E3ARg\nZpZ6Ri2wePHiVqfQ9kKvw9DzExHJmr7DB3K5XKtTaHuh12Ho+YmIZC3xJ/x29NOf/pQDDzww8XKD\ng4MZZNN9tmzZ0tB36BMnTmz4Xvgh5CAi0lLuXvcE7AZ6R3g9B/ikSZM8n8+XTNOmTfM1a9Z4sfXr\n13s+n9/z9+LFi/2AA57vMNdhhYMXTQWHvMO2svKFDkvi/y9zGONE1xo0MBUc+h3mlMVyh9kOa8rK\n1se5ucOqonXUsx3Fyw/F8w6WzdvnML+sbGc87woHvFAouLt7f3+/z5kzx8vNnj171P0xbO7cub5i\nxYqSskKh4Pl83rdt21ZSftZZZ/n++x/QUP2PGbOff+UrXylZb5LtuOaaa3zMmP0aymG//fb3c845\np/engQoAAAaRSURBVGS9Q0NDns/nfXBwsKS8r6/P58+fX1K2c+dOz+fzvnHjRu/v79/TDnK5nE+e\nPNlnzJgxHCvnDbRJTZo0aao2mXv9192Z2W7gXe5+Y5XXc0ChUCjUdUr1/PPP54ILLufJJ++tM8P/\nB/wLsAqYOsJ8a4F3VShfB5wHFIjeu9TjWuCkBtbR6PIDQA/17oNarVy5kg996EP7Rh8YoKenh9H3\nQTWDwEkN5R9CDqPZmyM97j6QSRAR6WpdcUo/6uRH6qhXVnldp/RrNTAwUHHA32u0fdAMIeQgItIa\niQd8M5sAvBwYvkL/pWb2euAhd9+aZnLNc3GrE2h7F1+sOhQRCVk9n/DfBNzC3u83Px+XXwXox84i\nIiIBqud3+N9DP+cTERFpKxq4RUREuoAGfADa/nEALdfbqzoUEQmZBnwAzmh1Am3vjDNUhyIiIdOA\nD8DMVifQ9mbOVB2KiIRMA76IiEgX0IAvIiLSBTTgA9GtdaURa9eqDkVEQqYBH4ClrU6g7S1dqjoU\nEQlZXQO+mZ1uZveY2WNm9kMz+4u0E2uu57Y6gbb33OeqDkVEQpZ4wDezE4hup7sIeCNwB7DezCam\nnJuIiIikpJ5P+POAy9z9anffDHwE2IXuoy8iIhKsRAO+mR0A9ADfGS5zdwe+DUxPNzURERFJS9KH\n50wE9gMeKCt/AHhlhfnHAQwO1vdc+fvuu4+nn94JXF7X8nBb/O86Rn62/a3AtVXKa1l+JI2uo9Hl\n74mWXreu7v0wZswYdu/ePeI8t956K9deu28d3nPPPfH/Wpd/WjnUG78WResel1kQEelqFn1Ar3Fm\ns8OAe4Hp7v6jovKlwAx3n142//uoPJKKSGXvd/f+VichIp0n6Sf87cDTwKSy8knA/RXmXw+8H/gN\n8HjS5ES6yDjgxURtRkQkdYk+4QOY2Q+BH7n7WfHfBmwB+tz9c+mnKCIiIo1K+gkf4CLgSjMrEH1J\nPg8YD1yZYl4iIiKSosQDvrt/Of7N/aeJTuX/FDjO3belnZyIiIikI/EpfREREWk/upe+iIhIF9CA\nLyIi0gUyHfBDfciOmS0ys91l069anNNRZnajmd0b59NbYZ5Pm9l9ZrbLzL5lZi8PJT8zu6JCna5r\nYn7nmtltZvZHM3vAzNaY2SsqzNfKOhw1x1bXo4h0rswG/DZ4yM4viC46PDSe3tradJhAdAHkXGCf\nCyvM7BPAGcBpwBHATqL6fEYI+cW+SWmdntic1AA4ClgGvBl4B3AAsMHMDhyeIYA6HDXHWCvrUUQ6\nVGYX7VX5vf5Wot/rX5hJ0NpzWwQc7+65VuZRjZntBt7l7jcWld0HfM7dvxD//SyiWxr/g7t/OYD8\nrgCe7e5/28xcqonfWD5IdAfITXFZMHU4Qo5B1aOIdI5MPuG3yUN2Do9PT99lZqvMbHKrE6rGzF5C\n9EmvuD7/CPyIcOoT4Jj4VPVmM7vEzP6shbkcTHQm4iEItg5LciwSUj2KSIfI6pT+SA/ZOTSjmEn8\nEJgDHEf0eN+XAP9jZhNamdQIDiUaGEKtT4hOQ38QOBY4GzgaWBef2WmqOOYXgU3uPnxtRlB1WCVH\nCKgeRaSz1HOnvbbn7sX3K/+Fmd0GDAGzgStak1V7Kzsl/ksz+zlwF3AMcEuT07kEeDVwZJPjJlEx\nx8DqUUQ6SFaf8JM+ZKel3P0R4H+Bpl2xndD9gNEm9Qng7vcQHQdNrVMzWw7MAo5x998VvRRMHY6Q\n4z5aVY8i0nkyGfDd/UmgALx9uCw+Jfl24PtZxGyEmR1E1KGO2Pm2Stzp309pfT6L6Grv4OoTwMxe\nCBxCE+s0HkiPB97m7luKXwulDkfKscr8Ta9HEelMWZ7SD/YhO2b2OeDrRKfxXwCcDzwJrG5hThOI\n3nQMf1f7UjN7PfCQu28l+r73U2Z2J9Hjhj8D/Ba4odX5xdMi4KtEg+rLgaVEZ02a8rhXM7uE6Odr\nvcBOMxv+JP+Iuw8/mrnVdThijnEdt7QeRaSDuXtmE9Fvtn8DPAb8AHhTlvES5LWaqKN/jOjRvv3A\nS1qc09HAbqKvQoqnLxXNsxi4D9hFNAC8PIT8iJ7lfhPRIPU4cDdwKfDcJuZXKbengQ+WzdfKOhwx\nxxDqUZMmTZ076eE5IiIiXUD30hcREekCGvBFRES6gAZ8ERGRLqABX0REpAtowBcREekCGvBFRES6\ngAZ8ERGRLqABX0REpAtowBcREekCGvBFRES6gAZ8ERGRLvD/AWsdJeqxsbIcAAAAAElFTkSuQmCC\n", 283 | "text/plain": [ 284 | "" 285 | ] 286 | }, 287 | "metadata": {}, 288 | "output_type": "display_data" 289 | } 290 | ], 291 | "source": [ 292 | "stroop.hist()" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "It can be observed that time taken for incongruent test are in general longer than congruent test evident by the range of the data.\n", 300 | "\n", 301 | "In congruent test, the time taken ranges from 8 seconds to 22 seconds and majority of the time taken is less than 14 seconds.\n", 302 | "\n", 303 | "In incongruent test, the range start from 15 seconds to 35 seconds and majority of the time taken us around 15 seconds to 25 seconds." 304 | ] 305 | }, 306 | { 307 | "cell_type": "markdown", 308 | "metadata": {}, 309 | "source": [ 310 | "#### 5. Now, perform the statistical test and report your results. What is your confidence level and your critical statistic value? Do you reject the null hypothesis or fail to reject it? Come to a conclusion in terms of the experiment task. Did the results match up with your expectations?" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "α = **0.05**\n", 318 | "\n", 319 | "df = **23**\n", 320 | "\n", 321 | "t-critical = **1.714**\n", 322 | "\n", 323 | "SE = stdev_(b-a) / √n\n", 324 | " = 4.86 / √24 = **0.99**\n", 325 | "\n", 326 | "t-statistic = mean difference / (stdev_(b-a) / sqrt(n)) = 7.96 / (4.86 / √24 ) = **8.02**\n", 327 | "\n", 328 | "From the t-table, p value is **<0.0005**.\n", 329 | "\n", 330 | "As a result, the **null hypothesis should be rejected**. Congruent words do take shorter response time to recognize its ink color than incongruent words.\n", 331 | "\n", 332 | "The result matches up with my expectations that congruent words are easier to identify its ink color." 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "#### 6. Optional: What do you think is responsible for the effects observed? Can you think of an alternative or similar task that would result in a similar effect? Some research about the problem will be helpful for thinking about these two questions!" 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": {}, 345 | "source": [ 346 | "Our brain processes word faster than recognizing color. Thus, we tend to be able to identify the ink color of congruent words before we even manage to identify the ink color.\n", 347 | "\n", 348 | "One similar examplpe would reading the musical scores. Playing violin as my first instrument, I am used to using treble clef. However, I have to switch to bass clef when playing cello. Notes on the same location point to different tones oon different clefs. Coupled with different string combinations (G-D-A-E in violin vs C-G-D-A in cello), it can be quite confusing when I switch between instrument immediately. It takes a while to familiarize with different clefs and string combination." 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [ 355 | "https://s3.amazonaws.com/udacity-hosted-downloads/t-table.jpg\n", 356 | "https://link.springer.com/article/10.3758%2FMC.38.7.893\n", 357 | "https://makingmusicmag.com/explanation-clefs-treble-bass-alto-tenor\n" 358 | ] 359 | } 360 | ], 361 | "metadata": { 362 | "kernelspec": { 363 | "display_name": "Python [conda env:DAND]", 364 | "language": "python", 365 | "name": "conda-env-DAND-py" 366 | }, 367 | "language_info": { 368 | "codemirror_mode": { 369 | "name": "ipython", 370 | "version": 2 371 | }, 372 | "file_extension": ".py", 373 | "mimetype": "text/x-python", 374 | "name": "python", 375 | "nbconvert_exporter": "python", 376 | "pygments_lexer": "ipython2", 377 | "version": "2.7.12" 378 | } 379 | }, 380 | "nbformat": 4, 381 | "nbformat_minor": 2 382 | } 383 | -------------------------------------------------------------------------------- /6-Inferential-Statistics/stroopdata.csv: -------------------------------------------------------------------------------- 1 | Congruent,Incongruent 2 | 12.079,19.278 3 | 16.791,18.741 4 | 9.564,21.214 5 | 8.630,15.687 6 | 14.669,22.803 7 | 12.238,20.878 8 | 14.692,24.572 9 | 8.987,17.394 10 | 9.401,20.762 11 | 14.480,26.282 12 | 22.328,24.524 13 | 15.298,18.644 14 | 15.073,17.510 15 | 16.929,20.330 16 | 18.200,35.255 17 | 12.130,22.158 18 | 18.495,25.139 19 | 10.639,20.429 20 | 11.344,17.425 21 | 12.369,34.288 22 | 12.944,23.894 23 | 14.233,17.960 24 | 19.710,22.058 25 | 16.004,21.157 26 | -------------------------------------------------------------------------------- /7-Intro-to-Machine-Learning/Identify Fraud from Enron Email.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Identify Fraud from Enron Email\n", 8 | "### Enron Submission Free-Response Questions\n", 9 | "##### by Kai Sheng TEH" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "**1. Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those? [relevant rubric items: “data exploration”, “outlier investigation”]**" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "The goal of the project is to use Machine Learning to identify *person of interest (POI)* in the Enron fraud case through the use of financial information and email text files available to the public.\n", 24 | "\n", 25 | "There are 146 entries and 20 features in the dataset. Of those features, 14 are financial features denoted in U.S. dollar while 5 are text features in terms of number of emails sent or received. \n", 26 | "\n", 27 | "There is also a feature named 'poi' which identify whether the person is idnetified as POI in the real world situation. 18 of them are POIs while 128 are non-POIs. It is hopeful that the financial and email information of poi individuals are different from the other value to serve as red flags.\n", 28 | "\n", 29 | "The features are being iterated to find out how many 'NaN' are there.\n", 30 | "\n", 31 | "| Features | NaN Count |\n", 32 | "| ------------------------- |:--------- |\n", 33 | "| loan_advances | 142 |\n", 34 | "| director_fees | 129 |\n", 35 | "| restricted_stock_deferred | 128 |\n", 36 | "| deferral_payments | 107 |\n", 37 | "| deferred_income | 97 |\n", 38 | "| long_term_incentive | 80 |\n", 39 | "| bonus | 64 |\n", 40 | "| shared_receipt_with_poi | 60 |\n", 41 | "| from_messages | 60 |\n", 42 | "| from_poi_to_this_person | 60 |\n", 43 | "| from_this_person_to_poi | 60 |\n", 44 | "| to_messages | 60 |\n", 45 | "| other | 53 |\n", 46 | "| salary | 51 |\n", 47 | "| expenses | 51 |\n", 48 | "| exercised_stock_options | 44 |\n", 49 | "| restricted_stock | 36 |\n", 50 | "| total_payments | 21 |\n", 51 | "| total_stock_value | 20 |\n", 52 | "| poi | 0 |\n", 53 | "\n", 54 | "Through exploratory data analyiss, an outlier was determined when plotting a scatterplot of bonus vs salary. Going through the pdf file, the extremely high salary and bonus is under the 'TOTAL' entry. It was removed together with 'The Travel Agenct in the Park' using the *data_dict.pop* function. As such, only 144 entries are left in the dataset." 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "**2. What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values. [relevant rubric items: “create new features”, “intelligently select features”, “properly scale features”]**" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "3 new features were created. They are:\n", 69 | "1. bonus-salary ratio (bonus / salary)\n", 70 | "2. from_this_person_to_poi percentage (from_this_person_to_poi / from_messages)\n", 71 | "3. from_poi_to_this_person percentage (from_poi_to_this_person / to_messages)\n", 72 | "\n", 73 | "If there person is a POI, the bonus to salary ratio might be very high. Similarly, POIs might also interact through emails more frequently among themselves.\n", 74 | "\n", 75 | "Scaling was done using MinMaxScaler to ensure the impact of each variable is not dominated by any single variables with high values such as bonus or salary.\n", 76 | "\n", 77 | "Select K Best and Decision Tree methods were used to determine the feature scores as below:\n", 78 | "\n", 79 | "| SelectKBest | Score | Decision Tree | Score |\n", 80 | "|:----------------------------- | :------------ |:----------------------------- | :------------ |\n", 81 | "| exercised_stock_options\t | 24.81507973 | exercised_stock_options\t | 0.216485631 |\n", 82 | "| total_stock_value\t | 24.18289868\t| expenses\t | 0.180751044 |\n", 83 | "| bonus\t | 20.79225205\t| shared_receipt_with_poi\t | 0.178337845 |\n", 84 | "| salary\t | 18.28968404\t| from_this_person_to_poi_ratio | 0.135602729 |\n", 85 | "| from_this_person_to_poi_ratio | 16.40971255\t| total_payments\t | 0.115005291 |\n", 86 | "| deferred_income\t |11.45847658\t| total_stock_value\t | 0.055633364 |\n", 87 | "| long_term_incentive\t |9.922186013\t| from_this_person_to_poi\t | 0.047666667 |\n", 88 | "| restricted_stock\t |9.212810622\t| long_term_incentive\t | 0.04237037 |\n", 89 | "| total_payments\t |8.77277773\t\t| restricted_stock\t | 0.028147059 |\n", 90 | "| shared_receipt_with_poi\t |8.589420732\t| salary\t | 0 |\n", 91 | "| loan_advances\t |7.184055658\t| bonus\t | 0 |\n", 92 | "| expenses\t |6.094173311\t| deferral_payments\t | 0 |\n", 93 | "| from_poi_to_this_person\t |5.243449713\t| deferred_income\t | 0 |\n", 94 | "| from_poi_to_this_person_ratio\t| 3.128091748\t| restricted_stock_deferred\t | 0 |\n", 95 | "| from_this_person_to_poi\t |2.382612108\t| loan_advances\t | 0 |\n", 96 | "| director_fees\t | 2.126327802\t| from_messages\t | 0 |\n", 97 | "| to_messages\t | 1.646341129\t| director_fees\t | 0 |\n", 98 | "| deferral_payments\t | 0.224611275\t| from_poi_to_this_person\t | 0 |\n", 99 | "| from_messages\t | 0.169700948\t| from_poi_to_this_person_ratio | 0 |\n", 100 | "| restricted_stock_deferred\t | 0.065499653 | salary_bonus_ratio\t | 0 |\n", 101 | "| salary_bonus_ratio\t | 0.000368768\t| to_messages\t | 0 |\n", 102 | "\n", 103 | "SelectKBest scoring is chosen over DecisionTree.\n", 104 | "The top 10 features with the highest scores are kept for use in POI identifier." 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "**3. What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms? [relevant rubric item: “pick an algorithm”]**" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "The algorithms that I tried are Naive-Bayes Gaussian, AdaBoost, Decision Tree, K Nearest Neighbors and Random Forest classifiers.\n", 119 | "\n", 120 | "At this point of time, only Gaussian and Decision Tree algorithms fulfil the >0.3 requirement for bothprecision and recall. Thus, the othersalgorithms need tuning in order to achieve >0.3 requirements.\n", 121 | "\n", 122 | "The performances of each algorithm before tuning are as below:\n", 123 | "\n", 124 | "| Algorithm | Accuracy | Precision | Recall | F1 Score |\n", 125 | "| :------------------- | :------- | :-------- | :------- | :------- |\n", 126 | "| Naive-Bayes Gaussian | 0.83613 | 0.36639 | 0.31400 | 0.33818 |\n", 127 | "| Decision Tree | 0.82207 | 0.32459 | 0.30950 | 0.31687 |\n", 128 | "| AdaBoost | 0.83853 | 0.36405 | 0.28250 | 0.31813 |\n", 129 | "| K Neartest Neighbors | 0.87640 | 0.63878 | 0.16800 | 0.26603 |\n", 130 | "| Random Forest | 0.86140 | 0.44537 | 0.16100 | 0.23650 |" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "**4. What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well? How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier). [relevant rubric items: “discuss parameter tuning”, “tune the algorithm”]**" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "Tuning the parameters of an algorithm means changing the input parameters and comparing the performance of each parameter combination in order to determine the optimal parameters to be used. Parameters of algorithms not optimized will result in sub-optimized performance or lower accuracy.\n", 145 | "\n", 146 | "GridSearchCV was used to tune and find the best combination of parameters to be used in each algorithm.\n", 147 | "\n", 148 | "Gaussian classifiers doesn't have any parameters to tune and is thus not applicable in this case.\n", 149 | "\n", 150 | "The performances after tuning are being compared below:\n", 151 | "\n", 152 | "| Algorithm | Accuracy | Precision | Recall | F1 Score | Notes |\n", 153 | "| :------------------- | :------- | :-------- | :------- | :------- | :---------- |\n", 154 | "| Naive-Bayes Gaussian | 0.83613 | 0.36639 | 0.31400 | 0.33818 | Tuning N/A |\n", 155 | "| Decision Tree | 0.82533 | 0.33351 | 0.31050 | 0.32160 | - |\n", 156 | "| AdaBoost | 0.83820 | 0.36305 | 0.28300 | 0.31807 | - |\n", 157 | "| K Neartest Neighbors | 0.87640 | 0.63878 | 0.16800 | 0.26603 | - |\n", 158 | "| Random Forest | 0.86087 | 0.44299 | 0.16900 | 0.24466 | - |\n", 159 | "\n", 160 | "Gaussian has the best best performances despite not needing any tunings. It is chosen as the POI identifier algorithm." 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "**5. What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis? [relevant rubric items: “discuss validation”, “validation strategy”]**" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "Validation is the process of training the algorithms and testing how well the model has been trained.\n", 175 | "\n", 176 | "Few of the classic mistake made in validation are overfitting and using same training and testing data. Overfitting occurs when the algorithm is being fit too closely to the training data. It performs well on training data but poorly on testing data. As such, there must be a balanced between fitting too closely or loosely to training data so that the algorithms is able to perform a reasonably well generalization when being tested on unseen data.\n", 177 | "\n", 178 | "At the same time, training and testing should not be done on the same data. Instead, 10%-30% of the data should be set aside for testing. This will prevent overfitting as well.\n", 179 | "\n", 180 | "For this projct, *StratifiedShuffleSplit* is being used for validation in *tester.py*. The data are split into 1000 different sets, which is also known as folds. This allows us to maximize the amount of data available on hand to be used for training and testing. The performance is then averaged out." 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "**6. Give at least 2 evaluation metrics and your average performance for each of them. Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]**" 188 | ] 189 | }, 190 | { 191 | "cell_type": "markdown", 192 | "metadata": {}, 193 | "source": [ 194 | "The 3 evaluation metrics being used etensively in this project are precision, recall and F1 score.\n", 195 | "The average precision, recall and F1 score for Gaussian algorithm are 0.36639, 0.31400 and 0.33818 respectively.\n", 196 | "\n", 197 | "The definition of each evaluation metrics are as below:\n", 198 | "1. Precision = true positive / false positive (high precision -> only guilty person indicted, no false alarm)\n", 199 | "2. Recall = true positive / false negative (high recall -> guilty person are all indicted, no guilty person escapes)\n", 200 | "3. F1 score - a weighted average of precision and recall by multiplying the product of recall and precision by 2 then dividing the sum of precision and recall\n", 201 | "\n", 202 | "In this case, we would want to strike for a higher precision so as to not accuse the wrong person. Machine Learning is only one of the investigation methods and police or prosecutors should not rely wholly on machine learning. Any guilty person not being picked up by th algorithms should still be subjected to thorough investigations." 203 | ] 204 | } 205 | ], 206 | "metadata": { 207 | "anaconda-cloud": {}, 208 | "kernelspec": { 209 | "display_name": "Python 2", 210 | "language": "python", 211 | "name": "python2" 212 | }, 213 | "language_info": { 214 | "codemirror_mode": { 215 | "name": "ipython", 216 | "version": 2 217 | }, 218 | "file_extension": ".py", 219 | "mimetype": "text/x-python", 220 | "name": "python", 221 | "nbconvert_exporter": "python", 222 | "pygments_lexer": "ipython2", 223 | "version": "2.7.12" 224 | } 225 | }, 226 | "nbformat": 4, 227 | "nbformat_minor": 2 228 | } 229 | -------------------------------------------------------------------------------- /7-Intro-to-Machine-Learning/README.md: -------------------------------------------------------------------------------- 1 | 1. [Identify Fraud from Enron Email.ipynb](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/7-Intro-to-Machine-Learning/Identify%20Fraud%20from%20Enron%20Email.ipynb) - Jupyter Notebook containing the projects 2 | 3 | 2. [resources.MD](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/7-Intro-to-Machine-Learning/resources.MD) - Resources used while completing the project 4 | 5 | 3. [poi_id.py](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/7-Intro-to-Machine-Learning/poi_id.py) - Code for the project 6 | 7 | 4. [tester.py](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/7-Intro-to-Machine-Learning/tester.py) - Tester code to evaluate the performance of the algorithms 8 | -------------------------------------------------------------------------------- /7-Intro-to-Machine-Learning/my_classifier.pkl: -------------------------------------------------------------------------------- 1 | ccopy_reg 2 | _reconstructor 3 | p0 4 | (csklearn.ensemble.forest 5 | RandomForestClassifier 6 | p1 7 | c__builtin__ 8 | object 9 | p2 10 | Ntp3 11 | Rp4 12 | (dp5 13 | S'warm_start' 14 | p6 15 | I00 16 | sS'base_estimator' 17 | p7 18 | g0 19 | (csklearn.tree.tree 20 | DecisionTreeClassifier 21 | p8 22 | g2 23 | Ntp9 24 | Rp10 25 | (dp11 26 | S'presort' 27 | p12 28 | I00 29 | sS'splitter' 30 | p13 31 | S'best' 32 | p14 33 | sS'min_impurity_decrease' 34 | p15 35 | F0.0 36 | sS'min_impurity_split' 37 | p16 38 | NsS'min_samples_leaf' 39 | p17 40 | I1 41 | sS'max_features' 42 | p18 43 | NsS'random_state' 44 | p19 45 | NsS'criterion' 46 | p20 47 | S'gini' 48 | p21 49 | sS'min_weight_fraction_leaf' 50 | p22 51 | F0.0 52 | sS'max_leaf_nodes' 53 | p23 54 | NsS'min_samples_split' 55 | p24 56 | I2 57 | sS'_sklearn_version' 58 | p25 59 | S'0.19.1' 60 | p26 61 | sS'max_depth' 62 | p27 63 | NsS'class_weight' 64 | p28 65 | NsbsS'n_jobs' 66 | p29 67 | I1 68 | sg15 69 | F0.0 70 | sS'verbose' 71 | p30 72 | I0 73 | sg23 74 | NsS'bootstrap' 75 | p31 76 | I01 77 | sS'oob_score' 78 | p32 79 | I00 80 | sg17 81 | I1 82 | sS'n_estimators' 83 | p33 84 | I10 85 | sg24 86 | I2 87 | sg22 88 | F0.0 89 | sg20 90 | g21 91 | sS'estimator_params' 92 | p34 93 | (g20 94 | g27 95 | g24 96 | g17 97 | g22 98 | g18 99 | g23 100 | g15 101 | g16 102 | g19 103 | tp35 104 | sg19 105 | Nsg16 106 | Nsg18 107 | S'auto' 108 | p36 109 | sg25 110 | g26 111 | sg27 112 | Nsg28 113 | Nsb. -------------------------------------------------------------------------------- /7-Intro-to-Machine-Learning/my_feature_list.pkl: -------------------------------------------------------------------------------- 1 | (lp0 2 | S'poi' 3 | p1 4 | aS'salary' 5 | p2 6 | aS'bonus' 7 | p3 8 | aS'deferred_income' 9 | p4 10 | aS'long_term_incentive' 11 | p5 12 | aS'shared_receipt_with_poi' 13 | p6 14 | aS'total_stock_value' 15 | p7 16 | aS'total_payments' 17 | p8 18 | aS'exercised_stock_options' 19 | p9 20 | aS'from_this_person_to_poi_ratio' 21 | p10 22 | aS'restricted_stock' 23 | p11 24 | a. -------------------------------------------------------------------------------- /7-Intro-to-Machine-Learning/poi_id.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | import sys 4 | import pickle 5 | sys.path.append("../tools/") 6 | import matplotlib.pyplot as plt 7 | import numpy as np 8 | import random 9 | 10 | from feature_format import featureFormat, targetFeatureSplit 11 | from tester import dump_classifier_and_data 12 | 13 | from sklearn.metrics import accuracy_score 14 | from sklearn.naive_bayes import GaussianNB 15 | from sklearn.svm import SVC 16 | from sklearn import tree 17 | from sklearn.tree import DecisionTreeClassifier 18 | from sklearn.metrics import accuracy_score 19 | from sklearn.preprocessing import MinMaxScaler 20 | from sklearn.feature_selection import SelectKBest 21 | from sklearn.model_selection import cross_val_score 22 | from sklearn.ensemble import AdaBoostClassifier 23 | from sklearn.neighbors import KNeighborsClassifier 24 | from sklearn.ensemble import RandomForestClassifier 25 | from sklearn.pipeline import Pipeline 26 | from sklearn.grid_search import GridSearchCV 27 | from sklearn.feature_selection import SelectKBest, f_classif 28 | 29 | 30 | ### Task 1: Select what features you'll use. 31 | ### features_list is a list of strings, each of which is a feature name. 32 | ### The first feature must be "poi". 33 | features_list = ['poi', 34 | 'salary', 35 | 'bonus', 36 | 'deferral_payments', 37 | 'expenses', 38 | 'deferred_income', 39 | 'long_term_incentive', 40 | 'restricted_stock_deferred', 41 | 'shared_receipt_with_poi', 42 | 'loan_advances', 43 | 'from_messages', 44 | 'director_fees', 45 | 'total_stock_value', 46 | 'from_poi_to_this_person', 47 | 'from_this_person_to_poi', 48 | 'total_payments', 49 | 'exercised_stock_options', 50 | 'to_messages', 51 | 'restricted_stock', 52 | 'other'] 53 | # You will need to use more features 54 | 55 | ### Load the dictionary containing the dataset 56 | with open("final_project_dataset.pkl", "r") as data_file: 57 | data_dict = pickle.load(data_file) 58 | 59 | ### Summarize the dataset 60 | ''' 61 | print 'Total number of ppl:', len(data_dict) #How many people in the dataset 62 | print 'Number of features:', len(features_list) #How mant features in the list 63 | 64 | poi_num = 0 65 | for i in data_dict: 66 | if data_dict[i]['poi'] == True: 67 | poi_num += 1 68 | print 'Number of poi:', poi_num #Number of poi 69 | ''' 70 | 71 | ### Checking the incompleteness of the dataset (% of NaN in every feature) 72 | ### Need to remove the new features created when running the code below 73 | ''' 74 | nan = [0 for i in range(len(features_list))] 75 | for k, v in data_dict.iteritems(): 76 | for j, feature in enumerate(features_list): 77 | if v[feature] == 'NaN': 78 | nan[j] += 1 79 | 80 | for i, feature in enumerate(features_list): 81 | print 'NaN count for', feature, ':', nan[i] #Number of NaN in each feature 82 | ''' 83 | 84 | ### Task 2: Remove outliers 85 | ### After plotting and checking the pdf, "TOTAL" and "THE AGENCY IN THE PARK" 86 | ### are removed as both don't seem to help with predicting 87 | 88 | data_dict.pop('TOTAL', 0) 89 | data_dict.pop('THE TRAVEL AGENCY IN THE PARK', 0) 90 | 91 | ### Scatterplot of Bonus vs Salary 92 | ''' 93 | data = featureFormat(data_dict, ['salary', 'bonus']) 94 | for point in data: 95 | x = point[0] 96 | y = point[1] 97 | plt.scatter(x, y) 98 | plt.xlabel('salary') 99 | plt.ylabel('bonus') 100 | ''' 101 | 102 | ### Task 3: Create new feature(s) 103 | 104 | ### Create new salary-bonus-ratio, from_this_person_to_poi % & from_poi_to_this_person % features 105 | 106 | def ratio_calc(numerator, denominator): 107 | fraction = 0 108 | if numerator == 'NaN' or denominator == 'NaN': 109 | fraction == 'NaN' 110 | else: 111 | fraction = float(numerator) / float(denominator) 112 | return fraction 113 | 114 | for name in data_dict: 115 | salary_bonus_ratio_temp = ratio_calc(data_dict[name]['salary'], data_dict[name]['bonus']) 116 | data_dict[name]['salary_bonus_ratio'] = salary_bonus_ratio_temp 117 | 118 | from_this_person_to_poi_ratio_temp = ratio_calc(data_dict[name]['from_this_person_to_poi'], data_dict[name]['from_messages']) 119 | data_dict[name]['from_this_person_to_poi_ratio'] = from_this_person_to_poi_ratio_temp 120 | 121 | from_poi_to_this_person_ratio_temp = ratio_calc(data_dict[name]['from_poi_to_this_person'], data_dict[name]['to_messages']) 122 | data_dict[name]['from_poi_to_this_person_ratio'] = from_poi_to_this_person_ratio_temp 123 | 124 | 125 | ### Store to my_dataset for easy export below. 126 | my_dataset = data_dict 127 | 128 | ### Extract features and labels from dataset for local testing 129 | features_list = ['poi', 130 | 'salary', 131 | 'bonus', 132 | #'deferral_payments', 133 | #'expenses', 134 | 'deferred_income', 135 | 'long_term_incentive', 136 | #'restricted_stock_deferred', 137 | 'shared_receipt_with_poi', 138 | #'loan_advances', 139 | #'from_messages', 140 | #'director_fees', 141 | 'total_stock_value', 142 | #'from_poi_to_this_person', 143 | #'from_this_person_to_poi', 144 | 'total_payments', 145 | 'exercised_stock_options', 146 | #'from_poi_to_this_person_ratio', 147 | 'from_this_person_to_poi_ratio', 148 | #'salary_bonus_ratio', 149 | #'to_messages', 150 | 'restricted_stock' 151 | #'other' 152 | ] 153 | ### Other is removed from the dataset as it doesn't tell much and including it 154 | ### might skew the predictions 155 | 156 | data = featureFormat(my_dataset, features_list, sort_keys = True) 157 | labels, features = targetFeatureSplit(data) 158 | 159 | 160 | ### Applying feature scaling using MinMaxScaler 161 | 162 | scaler = MinMaxScaler() 163 | features = scaler.fit_transform(features) 164 | 165 | 166 | ### Using SelectKBest to determine by features to use 167 | ''' 168 | selector = SelectKBest(f_classif, k=20) 169 | selector.fit(features, labels) 170 | features = selector.transform(features) 171 | feature_scores = zip(features_list[1:],selector.scores_) 172 | 173 | sorted_scores = sorted(feature_scores, key=lambda feature: feature[1], reverse = True) 174 | for item in sorted_scores: 175 | print item[0], item[1] 176 | ''' 177 | 178 | ### Using Decision Tree 179 | ''' 180 | clf = DecisionTreeClassifier() 181 | clf.fit(features, labels) 182 | dt_scores = zip(features_list[1:],clf.feature_importances_) 183 | sorted_dtscores = sorted(dt_scores, key=lambda feature: feature[1], reverse = True) 184 | for item in sorted_dtscores: 185 | print item[0], item[1] 186 | ''' 187 | 188 | ### Task 4: Try a varity of classifiers 189 | ### Please name your classifier clf for easy export below. 190 | ### Note that if you want to do PCA or other multi-stage operations, 191 | ### you'll need to use Pipelines. For more info: 192 | ### http://scikit-learn.org/stable/modules/pipeline.html 193 | 194 | # Provided to give you a starting point. Try a variety of classifiers. 195 | ### Naive-Bayes Gaussian 196 | ''' 197 | clf = GaussianNB() 198 | ''' 199 | 200 | ### Decision Tree 201 | #clf = tree.DecisionTreeClassifier() 202 | 203 | ### AdaBoost Classifier 204 | #clf = AdaBoostClassifier() 205 | 206 | ### K Nearest Neighbors Classifier 207 | #clf = KNeighborsClassifier() 208 | 209 | ### Random Forest Classifier 210 | #clf = RandomForestClassifier() 211 | 212 | ### Task 5: Tune your classifier to achieve better than .3 precision and recall 213 | ### using our testing script. Check the tester.py script in the final project 214 | ### folder for details on the evaluation method, especially the test_classifier 215 | ### function. Because of the small size of the dataset, the script uses 216 | ### stratified shuffle split cross validation. For more info: 217 | ### http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html 218 | 219 | # Example starting point. Try investigating other evaluation techniques! 220 | 221 | from sklearn.cross_validation import train_test_split 222 | 223 | features_train, features_test, labels_train, labels_test = \ 224 | train_test_split(features, labels, test_size=0.3, random_state=42) 225 | 226 | 227 | ### Decision Tree, need to change the clf to dt when tring to figure the optimal parameter values 228 | ''' 229 | clf = tree.DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, 230 | max_features=None, max_leaf_nodes=None, 231 | min_impurity_decrease=0.0, min_impurity_split=None, 232 | min_samples_leaf=1, min_samples_split=2, 233 | min_weight_fraction_leaf=0.0, presort=False, random_state=None, 234 | splitter='best') 235 | ''' 236 | #param_grid = {} 237 | #clf = GridSearchCV(dt, param_grid) 238 | #clf = clf.fit(features, labels) 239 | #print clf.best_estimator_ 240 | 241 | ### AdaBoost Classifier 242 | #clf = AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, 243 | # learning_rate=1.0, n_estimators=50, random_state=None) 244 | #param_grid = {} 245 | #clf = GridSearchCV(ada, param_grid) 246 | #clf = clf.fit(features, labels) 247 | #print clf.best_estimator_ 248 | 249 | ### K Nearest Neighbors Classifier 250 | #clf = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', 251 | # metric_params=None, n_jobs=1, n_neighbors=5, p=2, 252 | # weights='uniform') 253 | #param_grid = {} 254 | #clf = GridSearchCV(knn, param_grid) 255 | #clf = clf.fit(features, labels) 256 | #print clf.best_estimator_ 257 | 258 | ### Random Forest Classifier 259 | #clf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', 260 | # max_depth=None, max_features='auto', max_leaf_nodes=None, 261 | # min_impurity_decrease=0.0, min_impurity_split=None, 262 | # min_samples_leaf=1, min_samples_split=2, 263 | # min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, 264 | # oob_score=False, random_state=None, verbose=0, 265 | # warm_start=False) 266 | #param_grid = {} 267 | #clf = GridSearchCV(rf, param_grid) 268 | #clf = clf.fit(features, labels) 269 | #print clf.best_estimator_ 270 | 271 | ### Task 6: Dump your classifier, dataset, and features_list so anyone can 272 | ### check your results. You do not need to change anything below, but make sure 273 | ### that the version of poi_id.py that you submit can be run on its own and 274 | ### generates the necessary .pkl files for validating your results. 275 | 276 | 277 | dump_classifier_and_data(clf, my_dataset, features_list) 278 | -------------------------------------------------------------------------------- /7-Intro-to-Machine-Learning/resources.MD: -------------------------------------------------------------------------------- 1 | Some of the resources used while completing the projects: 2 | 1. scikit-learn documentation 3 | http://scikit-learn.org/stable/documentation.html 4 | 5 | 2. R documentation 6 | https://www.rdocumentation.org 7 | 8 | 3. Forums 9 | http://www.stackoverflow.com 10 | -------------------------------------------------------------------------------- /7-Intro-to-Machine-Learning/tester.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/pickle 2 | 3 | """ a basic script for importing student's POI identifier, 4 | and checking the results that they get from it 5 | 6 | requires that the algorithm, dataset, and features list 7 | be written to my_classifier.pkl, my_dataset.pkl, and 8 | my_feature_list.pkl, respectively 9 | 10 | that process should happen at the end of poi_id.py 11 | """ 12 | 13 | import pickle 14 | import sys 15 | from sklearn.cross_validation import StratifiedShuffleSplit 16 | sys.path.append("../tools/") 17 | from feature_format import featureFormat, targetFeatureSplit 18 | 19 | PERF_FORMAT_STRING = "\ 20 | \tAccuracy: {:>0.{display_precision}f}\tPrecision: {:>0.{display_precision}f}\t\ 21 | Recall: {:>0.{display_precision}f}\tF1: {:>0.{display_precision}f}\tF2: {:>0.{display_precision}f}" 22 | RESULTS_FORMAT_STRING = "\tTotal predictions: {:4d}\tTrue positives: {:4d}\tFalse positives: {:4d}\ 23 | \tFalse negatives: {:4d}\tTrue negatives: {:4d}" 24 | 25 | def test_classifier(clf, dataset, feature_list, folds = 1000): 26 | data = featureFormat(dataset, feature_list, sort_keys = True) 27 | labels, features = targetFeatureSplit(data) 28 | cv = StratifiedShuffleSplit(labels, folds, random_state = 42) 29 | true_negatives = 0 30 | false_negatives = 0 31 | true_positives = 0 32 | false_positives = 0 33 | for train_idx, test_idx in cv: 34 | features_train = [] 35 | features_test = [] 36 | labels_train = [] 37 | labels_test = [] 38 | for ii in train_idx: 39 | features_train.append( features[ii] ) 40 | labels_train.append( labels[ii] ) 41 | for jj in test_idx: 42 | features_test.append( features[jj] ) 43 | labels_test.append( labels[jj] ) 44 | 45 | ### fit the classifier using training set, and test on test set 46 | clf.fit(features_train, labels_train) 47 | predictions = clf.predict(features_test) 48 | for prediction, truth in zip(predictions, labels_test): 49 | if prediction == 0 and truth == 0: 50 | true_negatives += 1 51 | elif prediction == 0 and truth == 1: 52 | false_negatives += 1 53 | elif prediction == 1 and truth == 0: 54 | false_positives += 1 55 | elif prediction == 1 and truth == 1: 56 | true_positives += 1 57 | else: 58 | print "Warning: Found a predicted label not == 0 or 1." 59 | print "All predictions should take value 0 or 1." 60 | print "Evaluating performance for processed predictions:" 61 | break 62 | try: 63 | total_predictions = true_negatives + false_negatives + false_positives + true_positives 64 | accuracy = 1.0*(true_positives + true_negatives)/total_predictions 65 | precision = 1.0*true_positives/(true_positives+false_positives) 66 | recall = 1.0*true_positives/(true_positives+false_negatives) 67 | f1 = 2.0 * true_positives/(2*true_positives + false_positives+false_negatives) 68 | f2 = (1+2.0*2.0) * precision*recall/(4*precision + recall) 69 | print clf 70 | print PERF_FORMAT_STRING.format(accuracy, precision, recall, f1, f2, display_precision = 5) 71 | print RESULTS_FORMAT_STRING.format(total_predictions, true_positives, false_positives, false_negatives, true_negatives) 72 | print "" 73 | except: 74 | print "Got a divide by zero when trying out:", clf 75 | print "Precision or recall may be undefined due to a lack of true positive predicitons." 76 | 77 | CLF_PICKLE_FILENAME = "my_classifier.pkl" 78 | DATASET_PICKLE_FILENAME = "my_dataset.pkl" 79 | FEATURE_LIST_FILENAME = "my_feature_list.pkl" 80 | 81 | def dump_classifier_and_data(clf, dataset, feature_list): 82 | with open(CLF_PICKLE_FILENAME, "w") as clf_outfile: 83 | pickle.dump(clf, clf_outfile) 84 | with open(DATASET_PICKLE_FILENAME, "w") as dataset_outfile: 85 | pickle.dump(dataset, dataset_outfile) 86 | with open(FEATURE_LIST_FILENAME, "w") as featurelist_outfile: 87 | pickle.dump(feature_list, featurelist_outfile) 88 | 89 | def load_classifier_and_data(): 90 | with open(CLF_PICKLE_FILENAME, "r") as clf_infile: 91 | clf = pickle.load(clf_infile) 92 | with open(DATASET_PICKLE_FILENAME, "r") as dataset_infile: 93 | dataset = pickle.load(dataset_infile) 94 | with open(FEATURE_LIST_FILENAME, "r") as featurelist_infile: 95 | feature_list = pickle.load(featurelist_infile) 96 | return clf, dataset, feature_list 97 | 98 | def main(): 99 | ### load up student's classifier, dataset, and feature_list 100 | clf, dataset, feature_list = load_classifier_and_data() 101 | ### Run testing script 102 | test_classifier(clf, dataset, feature_list) 103 | 104 | if __name__ == '__main__': 105 | main() 106 | -------------------------------------------------------------------------------- /8-Data-Visualization-in-Tableau/Create-a-Tableau-Story.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Summary" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Airports in the US have been registering consistent growth yearly with the exception of year 2001 due to 9/11 incident and the late 2000s recession. The visualzations below are to provide an insight of the aviation industry in the US and some more detailed segmentated views for the west coast and east coast." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "## Design" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "### Dataset\n", 29 | "\n", 30 | "The follwing steps were taken in order to create a comprehensive csv file.\n", 31 | "1. Multiple datasets were downloaded from the website of U.S. Burueat of Transportation Statistics from year 2000 to 2016. \n", 32 | "\n", 33 | "2. Only scheduled flights related to passengers are kept while those of freight and unschedule flights are removed.\n", 34 | "\n", 35 | "3. Rows with 0 total passengers are deleted as it doesn't add up to anything in our final dataset.\n", 36 | "\n", 37 | "4. Each row of data is duplicated except that its origin and destination are swapped. It is done as we also need to consider the number for destination city without changing the structure of the data.\n", 38 | "\n", 39 | "5. The data is then pivoted in Excel to calculate annual number of passenger by airlines by city pairs.\n", 40 | "\n", 41 | "6. Once the data is loaded into Tableau, airport pair and city pair are created by combining the airport code and city name.\n", 42 | "\n", 43 | "7. Another calculated field is also created to filter out city pairs with less than 500 passengers annually when calculating the connectivity of an airport (number of airport with direct flight) so there were lots of one-off flights which will skewed the data, such as Singapore-Raleigh which is never flown by any commercial airlines." 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "### Visualization\n", 51 | "\n", 52 | "Trend charts (line) are chosen as it allows data over the period from 2000 to 2016 to be shown.\n", 53 | "Instead of listing the airports or airlines in tables, trend charts allows the values of the data to be visualized easily and ranking of the data to be observed readily.\n", 54 | "\n", 55 | "For example, in the \"Foreign Airports (Annual Int'l Pax) - West Coast\" chart, it can be seen that Narita Airport (NRT) is losing its lead in terms of passenger movement and is slated to be replaced by Vancouver (YVR). This trend can't be estimated if ordinary table were used to represent the data.\n", 56 | "\n", 57 | "The storyline to be carried over 8 slides of trend charts.\n", 58 | "1. US Airport (Annual Pax) and US Airlines (Annual Pax)\n", 59 | "2. US Airport (Annual Domestic Pax) and US Airlines (Annual Domestic Pax)\n", 60 | "3. US Airports (Annual Int'l Pax) and US Airlines (Annual Int'l Pax)\n", 61 | "4. Foreign Airports (Annual Int'l Pax) and Foreign Airlines (Annual Int'l Pax)\n", 62 | "5. Foreign Airports (Annual Int'l Pax) - West Coast and East Coast\n", 63 | "6. Foreign Market (Annual Int'l Pax) - Overall, West Coast, East Coast\n", 64 | "7. US Airports Domestic Connectivity and Int'l Connectivity (# Airports)\n", 65 | "8. Foreign Airports Int'l Connectivity (# Airports)\n", 66 | "9. City Pairs (Domestic & Int'l)" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "### Improvements\n", 74 | "\n", 75 | "#### Feedback #1\n", 76 | "My initial graphs are all filled with data and lines. \n", 77 | "![Too much info!!](https://user-images.githubusercontent.com/14093302/34650996-f9f27c14-f404-11e7-9292-a4d8b17e733a.PNG \"Too much info!!\")\n", 78 | "\n", 79 | "After being told that the graphs is too crammed, I remove most of the data and only keep the top airports in terms of passengers which will form the main storyline for this project.\n", 80 | "\n", 81 | "#### Feedback #2\n", 82 | "\n", 83 | "Each city pair is duplicated in opposite way. For example, the line for LHR-JFK is overlapped by JFK-LHR.\n", 84 | "![Duplicated city pairs](https://user-images.githubusercontent.com/14093302/34650997-fc1a584a-f404-11e7-9a77-db0eebd2623c.PNG \"Duplicated City Pairs!\")\n", 85 | "\n", 86 | "The duplication is because the data is organized such that the origin and destination is in a row and then it was duplicated to account for both way passenger count. After filtering out city pair with low passenger volume, I carefully remove the duplicated city pairs for both graphs.\n", 87 | "\n", 88 | "#### Feedback #3\n", 89 | "The labels for some of my graphs are entangled such that they overalap and obstruct each other.\n", 90 | "![Labels too messy!!](https://user-images.githubusercontent.com/14093302/34650999-feaa0ede-f404-11e7-9900-c928e165a83f.PNG \"Lables too messy!!\")\n", 91 | "\n", 92 | "Whenever allowed, I will try to label each line. However, when the labels overlap I will remove them and only label to top few since most of the storyline is centered on them. \n", 93 | "\n", 94 | "I learn that it is better to convey a more limited range of data more effectively than to squeenzed all info into a graph to the extend it overwhelmes the audience." 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "## Feedback" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "### Feedback 1\n", 109 | "\n", 110 | "There are just too much info on each graph and it is hard to focus on the data. Perhaps you should only pick the top 10 highest or highlight a specific airport or airline that you wish to talk about?" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "### Feedback 2\n", 118 | "\n", 119 | "Do you notice that in the city pair graphs, there seems to be duplicated line trailing below as it there are pairs of data? For example, the city pair with the highest amount of passengers is New York, NY to London, UK. Just pick the first one of each pair and hide the others." 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "### Feedback 3" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "I feel overwhelemed by your graphs are there are just too many line without clear focus? If you just want to show the ranking of the airport in terms of passengers, why not use horizontal bar charts? Unless it is to show the trend overtime? But again, there isn't a clear focus on your graph and it seems you just show whatever you have. At the same time, not all lines need to be labelled." 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "### Feedback 4" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "I highly recommend using color-blind friendly colors. The following chart shows some of the indistinguishable colors for people with this disorder.\n", 148 | "\n", 149 | "You can find the colorblind palette in the color legends by going to edit colors then in the opened window by clicking on the dropbox; you can see a colorblind option." 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "## Project\n", 157 | "\n", 158 | "\n", 159 | "https://public.tableau.com/profile/kaishengteh#!/vizhome/USFlights/USAviation2000-2016" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "%%HTML\n", 167 | "\n", 168 | "
\n" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "## Resources" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=293" 183 | ] 184 | } 185 | ], 186 | "metadata": { 187 | "kernelspec": { 188 | "display_name": "Python [conda root]", 189 | "language": "python", 190 | "name": "conda-root-py" 191 | }, 192 | "language_info": { 193 | "codemirror_mode": { 194 | "name": "ipython", 195 | "version": 3 196 | }, 197 | "file_extension": ".py", 198 | "mimetype": "text/x-python", 199 | "name": "python", 200 | "nbconvert_exporter": "python", 201 | "pygments_lexer": "ipython3", 202 | "version": "3.6.2" 203 | } 204 | }, 205 | "nbformat": 4, 206 | "nbformat_minor": 2 207 | } 208 | -------------------------------------------------------------------------------- /8-Data-Visualization-in-Tableau/Flight Summary.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/8-Data-Visualization-in-Tableau/Flight Summary.zip -------------------------------------------------------------------------------- /8-Data-Visualization-in-Tableau/README.md: -------------------------------------------------------------------------------- 1 | The Tableau storyline can be accessed [here](https://public.tableau.com/profile/kaishengteh#!/vizhome/USFlights/USAviation2000-2016). 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data-Analyst-Nanodegree 2 | 3 | ### Kai Sheng Teh 4 | 5 | This repository contains projects for Udacity's [Data Analyst Nanodegree](https://www.udacity.com/course/data-analyst-nanodegree--nd002). 6 | 7 | ### Part 1: Analyze Bay Area Bike Share Project 8 | Complete your first project analyzing bike rental data. It’s a great project to tackle during the first week in your Nanodegree to see if the program is a good fit for you! 9 | 10 | - Project: [Analyze Bay Area Bike Share Data](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/1-Analyze-Bay-Area-Bike-Share-Project/Bay_Area_Bike_Share_Analysis.ipynb) 11 | 12 | ### Part 2: [Descriptive Statistics](https://www.udacity.com/course/intro-to-descriptive-statistics--ud827) 13 | Learn to use descriptive statistics to describe properties of datasets and learn about how samples and populations are related. 14 | 15 | ### Part 3: [Intro to Data Analysis](https://www.udacity.com/course/intro-to-data-analysis--ud170) 16 | Choose one of Udacity's curated datasets and investigate it using NumPy and Pandas. Go through the entire data analysis process, starting by posing a question and finishing by sharing your findings. 17 | 18 | - Project: [Investigate a Dataset](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/3-Intro-to-Data-Analysis/Investigate-a-Dataset.ipynb) 19 | 20 | ### Part 4: [Data Wrangling](https://www.udacity.com/course/data-wrangling-with-mongodb--ud032) 21 | Choose a region of the world from www.openstreetmap.org and then use data wrangling techniques to audit and clean the data. You'll then use a database to query the cleaned data. 22 | 23 | - Project: [Wrangle OpenStreetMap Data](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Data_Wrangling.ipynb) 24 | 25 | ### Part 5: [Exploratory Data Analysis](https://www.udacity.com/course/data-analysis-with-r--ud651) 26 | Use R and apply exploratory data analysis techniques to explore a selected data set for distributions, outliers, and relationships. 27 | 28 | - Project: [Explore and Summarize Data](https://cdn.rawgit.com/kaishengteh/Data-Analyst-Nanodegree/e94db549/5-Exploratory-Data-Analysis/Exploratory%20Data%20Analysis.html) 29 | 30 | ### Part 6: [Inferential Statistics](https://www.udacity.com/course/intro-to-inferential-statistics--ud201) 31 | Use descriptive statistics and a statistical test to analyze the Stroop effect, a classic result of experimental psychology. 32 | 33 | - Project: [Test a Perceptual Phenomenon](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/6-Inferential-Statistics/Stroop-Effect.ipynb) 34 | 35 | ### Part 7: [Intro to Machine Learning](https://www.udacity.com/course/intro-to-machine-learning--ud120) 36 | Play detective and put your machine learning skills to use by building an algorithm to identify Enron Employees who may have committed fraud based on the public Enron financial and email dataset. 37 | 38 | - Project: [Identify Fraud from Enron Email](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/7-Intro-to-Machine-Learning/Identify%20Fraud%20from%20Enron%20Email.ipynb) 39 | 40 | ### Part 8: [Data Visualization in Tableau](https://www.udacity.com/course/data-visualization-in-tableau--ud1006) 41 | Understand the importance of data visualization. Know how different data types are encoded in visualizations. Select the most effective chart or graph based on the data being displayed. 42 | 43 | - Project: [Create a Tableau Story](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/8-Data-Visualization-in-Tableau/Create-a-Tableau-Story.ipynb) 44 | 45 | ![udacity Data Analyst Nanodegree](https://user-images.githubusercontent.com/14093302/37262598-0227d436-25df-11e8-9613-a6a03c8edc08.jpg) 46 | --------------------------------------------------------------------------------