├── 1-Analyze-Bay-Area-Bike-Share-Project
    ├── Bay_Area_Bike_Share_Analysis.ipynb
    └── Data.zip
├── 3-Intro-to-Data-Analysis
    ├── Data
    │   ├── Exports (p of GDP).xlsx
    │   ├── Indicator_HDI.xlsx
    │   ├── Oil Production.xlsx
    │   ├── README.md
    │   ├── indicator CDIAC carbon_dioxide_emissions_per_capita.xlsx
    │   ├── indicator gapminder gdp_per_capita_ppp.xlsx
    │   ├── indicator life_expectancy_at_birth.xlsx
    │   ├── indicator ti cpi 2009.xlsx
    │   ├── indicatorGNIpercapitaATLAS.xlsx
    │   ├── indicator_per capita government expenditure on health (ppp int. $).xlsx
    │   ├── indicator_per capita total expenditure on health (ppp int. $).xlsx
    │   ├── indicator_total population female.xlsx
    │   └── indicator_total population male.xlsx
    ├── Investigate-a-Dataset.ipynb
    └── L1_Starter_Code.ipynb
├── 4-Data-Wrangling
    ├── Audit_Zipcode.py
    ├── Auditing_Street_Names.py
    ├── Convert_to_CSV_files.py
    ├── Convert_to_SQL_Database.py
    ├── Data_Wrangling.ipynb
    ├── Improving_Street_Names.py
    ├── Number_of_Tags.py
    ├── README.md
    ├── Sample.py
    ├── Schema.py
    ├── Tags_Types.py
    ├── Update_Zipcode.py
    └── sample.osm
├── 5-Exploratory-Data-Analysis
    ├── Exploratory Data Analysis.html
    ├── Exploratory Data Analysis.rmd
    ├── README.md
    └── Resources.txt
├── 6-Inferential-Statistics
    ├── Stroop-Effect.ipynb
    └── stroopdata.csv
├── 7-Intro-to-Machine-Learning
    ├── Identify Fraud from Enron Email.ipynb
    ├── README.md
    ├── my_classifier.pkl
    ├── my_dataset.pkl
    ├── my_feature_list.pkl
    ├── poi_id.py
    ├── resources.MD
    └── tester.py
├── 8-Data-Visualization-in-Tableau
    ├── Create-a-Tableau-Story.ipynb
    ├── Flight Summary.zip
    └── README.md
└── README.md


/1-Analyze-Bay-Area-Bike-Share-Project/Data.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/1-Analyze-Bay-Area-Bike-Share-Project/Data.zip


--------------------------------------------------------------------------------
/3-Intro-to-Data-Analysis/Data/Exports (p of GDP).xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/Exports (p of GDP).xlsx


--------------------------------------------------------------------------------
/3-Intro-to-Data-Analysis/Data/Indicator_HDI.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/Indicator_HDI.xlsx


--------------------------------------------------------------------------------
/3-Intro-to-Data-Analysis/Data/Oil Production.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/Oil Production.xlsx


--------------------------------------------------------------------------------
/3-Intro-to-Data-Analysis/Data/README.md:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/3-Intro-to-Data-Analysis/Data/indicator CDIAC carbon_dioxide_emissions_per_capita.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicator CDIAC carbon_dioxide_emissions_per_capita.xlsx


--------------------------------------------------------------------------------
/3-Intro-to-Data-Analysis/Data/indicator gapminder gdp_per_capita_ppp.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicator gapminder gdp_per_capita_ppp.xlsx


--------------------------------------------------------------------------------
/3-Intro-to-Data-Analysis/Data/indicator life_expectancy_at_birth.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicator life_expectancy_at_birth.xlsx


--------------------------------------------------------------------------------
/3-Intro-to-Data-Analysis/Data/indicator ti cpi 2009.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicator ti cpi 2009.xlsx


--------------------------------------------------------------------------------
/3-Intro-to-Data-Analysis/Data/indicatorGNIpercapitaATLAS.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicatorGNIpercapitaATLAS.xlsx


--------------------------------------------------------------------------------
/3-Intro-to-Data-Analysis/Data/indicator_per capita government expenditure on health (ppp int. $).xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicator_per capita government expenditure on health (ppp int. $).xlsx


--------------------------------------------------------------------------------
/3-Intro-to-Data-Analysis/Data/indicator_per capita total expenditure on health (ppp int. $).xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicator_per capita total expenditure on health (ppp int. $).xlsx


--------------------------------------------------------------------------------
/3-Intro-to-Data-Analysis/Data/indicator_total population female.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicator_total population female.xlsx


--------------------------------------------------------------------------------
/3-Intro-to-Data-Analysis/Data/indicator_total population male.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/3-Intro-to-Data-Analysis/Data/indicator_total population male.xlsx


--------------------------------------------------------------------------------
/4-Data-Wrangling/Audit_Zipcode.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | This code checks for zipcode whether they begin with '94' or '95' or something else
 3 | '''
 4 | OSMFILE = "san-jose_california.osm"
 5 | zip_type_re = re.compile(r'\d{5}$')
 6 | 
 7 | def audit_ziptype(zip_types, zipcode):
 8 |     if zipcode[0:2] != 95:
 9 |         zip_types[zipcode[0:2]].add(zipcode)
10 |     elif zipcode[0:2] != 94:
11 |         zip_types[zipcode[0:2]].add(zipcode)
12 |         
13 | def is_zipcode(elem):
14 |     return (elem.attrib['k'] == "addr:postcode")
15 | 
16 | def audit_zip(osmfile):
17 |     osm_file = open(osmfile, "r")
18 |     zip_types = defaultdict(set)
19 |     for event, elem in ET.iterparse(osm_file, events=("start",)):
20 |         if elem.tag == "node" or elem.tag == "way":
21 |             for tag in elem.iter("tag"):
22 |                 if is_zipcode(tag):
23 |                     audit_ziptype(zip_types,tag.attrib['v'])
24 |     osm_file.close()
25 |     return zip_types
26 | 
27 | zip_print = audit_zip(OSMFILE)
28 | 
29 | def test():    
30 |     pprint.pprint(dict(zip_print))
31 | 
32 | if __name__ == '__main__':
33 |     test()
34 | 


--------------------------------------------------------------------------------
/4-Data-Wrangling/Auditing_Street_Names.py:
--------------------------------------------------------------------------------
 1 | import xml.etree.cElementTree as ET
 2 | import pprint
 3 | from collections import defaultdict
 4 | import re
 5 | 
 6 | '''
 7 | The code below lists out all the street types not in the expected list.
 8 | '''
 9 | OSMFILE = "san-jose_california.osm"
10 | street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
11 | 
12 | expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
13 |             "Trail", "Parkway", "Commons", "Circle", "Terrace", "Way"]
14 | 
15 | 
16 | def audit_street_type(street_types, street_name):
17 |     m = street_type_re.search(street_name)
18 |     if m:
19 |         street_type = m.group()
20 |         if street_type not in expected:
21 |             street_types[street_type].add(street_name)
22 | 
23 | 
24 | def is_street_name(elem):
25 |     return (elem.attrib['k'] == "addr:street")
26 | 
27 | 
28 | def audit(osmfile):
29 |     osm_file = open(osmfile, "r")
30 |     street_types = defaultdict(set)
31 |     for event, elem in ET.iterparse(osm_file, events=("start",)):
32 | 
33 |         if elem.tag == "node" or elem.tag == "way":
34 |             for tag in elem.iter("tag"):
35 |                 if is_street_name(tag):
36 |                     audit_street_type(street_types, tag.attrib['v'])
37 |     osm_file.close()
38 |     return street_types
39 | 
40 | st_types = audit(OSMFILE)
41 | 
42 | def test():    
43 |     pprint.pprint(dict(st_types))
44 | 
45 | if __name__ == '__main__':
46 |     test()


--------------------------------------------------------------------------------
/4-Data-Wrangling/Convert_to_CSV_files.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | The code below is mostly derived from Udacity Lession 13: Case study: OpenStreetMap Data [SQL]
  3 | https://classroom.udacity.com/nanodegrees/nd002/parts/860b269a-d0b0-4f0c-8f3d-ab08865d43bf/modules/316820862075461/lessons/5436095827/concepts/54908788190923
  4 | '''
  5 | OSM_PATH = "san-jose_california.osm"
  6 | OSMFILE = "san-jose_california.osm"
  7 | street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
  8 | 
  9 | expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
 10 |             "Trail", "Parkway", "Commons", "Circle", "Terrace", "Way"]
 11 | 
 12 | mapping = { "St": "Street",
 13 |             "St.": "Street",
 14 |             "Rd.": "Road",
 15 |             "Ave": "Avenue",
 16 |             "Blvd": "Boulevard",
 17 |             "Dr": "Drive",
 18 |             "Rd": "Road"
 19 |             }
 20 | 
 21 | NODES_PATH = "nodes.csv"
 22 | NODE_TAGS_PATH = "nodes_tags.csv"
 23 | WAYS_PATH = "ways.csv"
 24 | WAY_NODES_PATH = "ways_nodes.csv"
 25 | WAY_TAGS_PATH = "ways_tags.csv"
 26 | 
 27 | LOWER_COLON = re.compile(r'^([a-z]|_)+:([a-z]|_)+')
 28 | PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
 29 | 
 30 | SCHEMA = schema
 31 | 
 32 | # Make sure the fields order in the csvs matches the column order in the sql table schema
 33 | NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
 34 | NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']
 35 | WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
 36 | WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type']
 37 | WAY_NODES_FIELDS = ['id', 'node_id', 'position']
 38 |     
 39 | def shape_element(element, node_attr_fields=NODE_FIELDS, way_attr_fields=WAY_FIELDS,
 40 |                   problem_chars=PROBLEMCHARS, default_tag_type='regular'):
 41 |     """Clean and shape node or way XML element to Python dict"""
 42 |     node_attribs = {}
 43 |     way_attribs = {}
 44 |     way_nodes = []
 45 |     tags = []  # Handle secondary tags the same way for both node and way elements
 46 |     p=0
 47 |     
 48 |     if element.tag == 'node':
 49 |         for i in NODE_FIELDS:
 50 |             node_attribs[i] = element.attrib[i]
 51 |         for tag in element.iter("tag"):
 52 |             node_tags_attribs = {}
 53 |             temp = LOWER_COLON.search(tag.attrib['k'])
 54 |             is_p = PROBLEMCHARS.search(tag.attrib['k'])
 55 |             if is_p:
 56 |                 continue
 57 |             elif temp:
 58 |                 split_char = temp.group(1)
 59 |                 split_index = tag.attrib['k'].index(split_char)
 60 |                 type1 = temp.group(1)
 61 |                 node_tags_attribs['id'] = element.attrib['id']
 62 |                 node_tags_attribs['key'] = tag.attrib['k'][split_index+2:]
 63 |                 node_tags_attribs['value'] = tag.attrib['v']
 64 |                 node_tags_attribs['type'] = tag.attrib['k'][:split_index+1]
 65 |                 if node_tags_attribs['type'] == "addr" and node_tags_attribs['key'] == "street":
 66 |                     # update street name
 67 |                     node_tags_attribs['value'] = update_name(tag.attrib['v'], mapping) 
 68 |                 #elif node_tags_attribs['type'] == "addr" and node_tags_attribs['key'] == "postcode":
 69 |                 #    # update post code
 70 |                 #    node_tags_attribs['value'] = update_zipcode(tag.attrib['v']) 
 71 |             else:
 72 |                 node_tags_attribs['id'] = element.attrib['id']
 73 |                 node_tags_attribs['key'] = tag.attrib['k']
 74 |                 node_tags_attribs['value'] = tag.attrib['v']
 75 |                 node_tags_attribs['type'] = 'regular'
 76 |                 if node_tags_attribs['type'] == "addr" and node_tags_attribs['key'] == "street":
 77 |                     # update street name
 78 |                     node_tags_attribs['value'] = update_name(tag.attrib['v'], mapping) 
 79 |                 #elif node_tags_attribs['type'] == "addr" and node_tags_attribs['key'] == "postcode":
 80 |                 #    # update post code
 81 |                 #    node_tags_attribs['value'] = update_zipcode(tag.attrib['v']) 
 82 |             tags.append(node_tags_attribs)
 83 |         return {'node': node_attribs, 'node_tags': tags}
 84 |     elif element.tag == 'way':
 85 |         id = element.attrib['id']
 86 |         for i in WAY_FIELDS:
 87 |             way_attribs[i] = element.attrib[i]
 88 |         for i in element.iter('nd'):
 89 |             d = {}
 90 |             d['id'] = id
 91 |             d['node_id'] = i.attrib['ref']
 92 |             d['position'] = p
 93 |             p+=1
 94 |             way_nodes.append(d)
 95 |         for c in element.iter('tag'):
 96 |             temp = LOWER_COLON.search(c.attrib['k'])
 97 |             is_p = PROBLEMCHARS.search(c.attrib['k'])
 98 |             e = {}
 99 |             if is_p:
100 |                 continue
101 |             elif temp:
102 |                 split_char = temp.group(1)
103 |                 split_index = c.attrib['k'].index(split_char)
104 |                 e['id'] = id
105 |                 e['key'] = c.attrib['k'][split_index+2:]
106 |                 e['type'] = c.attrib['k'][:split_index+1]
107 |                 e['value'] = c.attrib['v']
108 |                 if e['type'] == "addr" and e['key'] == "street":
109 |                     e['value'] = update_name(c.attrib['v'], mapping) 
110 |                 #elif e['type'] == "addr" and e['key'] == "postcode":
111 |                 #    e['value'] = update_zipcode(c.attrib['v'])
112 |             else:
113 |                 e['id'] = id
114 |                 e['key'] = c.attrib['k']
115 |                 e['type'] = 'regular'
116 |                 e['value'] =  c.attrib['v']
117 |                 if e['type'] == "addr" and e['key'] == "street":
118 |                     e['value'] = update_name(c.attrib['v'], mapping) 
119 |                 #elif e['type'] == "addr" and e['key'] == "postcode":
120 |                 #    e['value'] = update_zipcode(c.attrib['v'])
121 |             tags.append(e)
122 |         
123 |     return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}
124 |         
125 |     if element.tag == 'node':
126 |         return {'node': node_attribs, 'node_tags': tags}
127 |     elif element.tag == 'way':
128 |         return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}
129 |     
130 | 
131 | # ================================================== #
132 | #               Helper Functions                     #
133 | # ================================================== #
134 | def get_element(osm_file, tags=('node', 'way', 'relation')):
135 |     """Yield element if it is the right type of tag"""
136 |     context = ET.iterparse(osm_file, events=('start', 'end'))
137 |     _, root = next(context)
138 |     for event, elem in context:
139 |         if event == 'end' and elem.tag in tags:
140 |             yield elem
141 |             root.clear()
142 | 
143 | 
144 | def validate_element(element, validator, schema=SCHEMA):
145 |     """Raise ValidationError if element does not match schema"""
146 |     if validator.validate(element, schema) is not True:
147 |         field, errors = next(validator.errors.iteritems())
148 |         message_string = "\nElement of type '{0}' has the following errors:\n{1}"
149 |         error_string = pprint.pformat(errors)
150 |         
151 |         raise Exception(message_string.format(field, error_string))
152 | 
153 | 
154 | class UnicodeDictWriter(csv.DictWriter, object):
155 |     """Extend csv.DictWriter to handle Unicode input"""
156 |     def writerow(self, row):
157 |         super(UnicodeDictWriter, self).writerow({
158 |             k: (v.encode('utf-8') if isinstance(v, unicode) else v) for k, v in row.iteritems()
159 |         })
160 | 
161 |     def writerows(self, rows):
162 |         for row in rows:
163 |             self.writerow(row)
164 | 
165 | 
166 | # ================================================== #
167 | #               Main Function                        #
168 | # ================================================== #
169 | def process_map(file_in, validate):
170 |     """Iteratively process each XML element and write to csv(s)"""
171 |     with codecs.open(NODES_PATH, 'wb') as nodes_file, \
172 |         codecs.open(NODE_TAGS_PATH, 'wb') as nodes_tags_file, \
173 |         codecs.open(WAYS_PATH, 'wb') as ways_file, \
174 |         codecs.open(WAY_NODES_PATH, 'wb') as way_nodes_file, \
175 |         codecs.open(WAY_TAGS_PATH, 'wb') as way_tags_file:
176 | 
177 |         nodes_writer = UnicodeDictWriter(nodes_file, NODE_FIELDS)
178 |         node_tags_writer = UnicodeDictWriter(nodes_tags_file, NODE_TAGS_FIELDS)
179 |         ways_writer = UnicodeDictWriter(ways_file, WAY_FIELDS)
180 |         way_nodes_writer = UnicodeDictWriter(way_nodes_file, WAY_NODES_FIELDS)
181 |         way_tags_writer = UnicodeDictWriter(way_tags_file, WAY_TAGS_FIELDS)
182 | 
183 |         nodes_writer.writeheader()
184 |         node_tags_writer.writeheader()
185 |         ways_writer.writeheader()
186 |         way_nodes_writer.writeheader()
187 |         way_tags_writer.writeheader()
188 | 
189 |         validator = cerberus.Validator()
190 | 
191 |         for element in get_element(file_in, tags=('node', 'way')):
192 |             el = shape_element(element)
193 |             if el:
194 |                 if validate is True:
195 |                     validate_element(el, validator)
196 | 
197 |                 if element.tag == 'node':
198 |                     nodes_writer.writerow(el['node'])
199 |                     node_tags_writer.writerows(el['node_tags'])
200 |                 elif element.tag == 'way':
201 |                     ways_writer.writerow(el['way'])
202 |                     way_nodes_writer.writerows(el['way_nodes'])
203 |                     way_tags_writer.writerows(el['way_tags'])
204 | 
205 | 
206 | if __name__ == '__main__':
207 |     # Note: Validation is ~ 10X slower. For the project consider using a small
208 |     # sample of the map when validating.
209 |     process_map(OSM_PATH, validate=True)
210 | 


--------------------------------------------------------------------------------
/4-Data-Wrangling/Convert_to_SQL_Database.py:
--------------------------------------------------------------------------------
 1 | # Creating the database and tables
 2 | import sqlite3
 3 | conn = sqlite3.connect('data_wrangling.sqlite')
 4 | 
 5 | conn.text_factory = str
 6 | cur = conn.cursor()
 7 | 
 8 | #Make some fresh tables using executescript()
 9 | cur.execute('''DROP TABLE IF EXISTS nodes''')
10 | cur.execute('''DROP TABLE IF EXISTS nodes_tags''')
11 | cur.execute('''DROP TABLE IF EXISTS ways''')
12 | cur.execute('''DROP TABLE IF EXISTS ways_tags''')
13 | cur.execute('''DROP TABLE IF EXISTS ways_nodes''')
14 | 
15 | 
16 | cur.execute('''CREATE TABLE nodes (
17 |     id INTEGER PRIMARY KEY NOT NULL,
18 |     lat REAL,
19 |     lon REAL,
20 |     user TEXT,
21 |     uid INTEGER,
22 |     version INTEGER,
23 |     changeset INTEGER,
24 |     timestamp TEXT)
25 | ''')
26 | 
27 | with open('nodes.csv','r') as nodes_table: # `with` statement available in 2.5+
28 |     # csv.DictReader uses first line in file for column headings by default
29 |     dr = csv.DictReader(nodes_table) # comma is default delimiter
30 |     to_db = [(i['id'], i['lat'], i['lon'], i['user'], i['uid'], i['version'], i['changeset'], i['timestamp']) for i in dr]
31 | 
32 | cur.executemany("INSERT INTO nodes VALUES (?, ?, ?, ?, ?, ?, ?, ?);", to_db)
33 | 
34 | 
35 | cur.execute('''CREATE TABLE nodes_tags (
36 |     id INTEGER,
37 |     key TEXT,
38 |     value TEXT,
39 |     type TEXT,
40 |     FOREIGN KEY (id) REFERENCES nodes(id))
41 | ''')
42 | 
43 | with open('nodes_tags.csv','r') as nodes_tags_table: # `with` statement available in 2.5+
44 |     # csv.DictReader uses first line in file for column headings by default
45 |     dr = csv.DictReader(nodes_tags_table) # comma is default delimiter
46 |     to_db = [(i['id'], i['key'], i['value'], i['type']) for i in dr]
47 | 
48 | cur.executemany("INSERT INTO nodes_tags VALUES (?, ?, ?, ?);", to_db)
49 | 
50 | cur.execute('''CREATE TABLE ways (
51 |     id INTEGER PRIMARY KEY NOT NULL,
52 |     user TEXT,
53 |     uid INTEGER,
54 |     version TEXT,
55 |     changeset INTEGER,
56 |     timestamp TEXT)
57 | ''')
58 | 
59 | with open('ways.csv','r') as ways_table: # `with` statement available in 2.5+
60 |     # csv.DictReader uses first line in file for column headings by default
61 |     dr = csv.DictReader(ways_table) # comma is default delimiter
62 |     to_db = [(i['id'], i['user'], i['uid'], i['version'], i['changeset'], i['timestamp']) for i in dr]
63 | 
64 | cur.executemany("INSERT INTO ways VALUES (?, ?, ?, ?, ?, ?);", to_db)
65 | 
66 | cur.execute('''CREATE TABLE ways_tags (
67 |     id INTEGER NOT NULL,
68 |     key TEXT NOT NULL,
69 |     value TEXT NOT NULL,
70 |     type TEXT,
71 |     FOREIGN KEY (id) REFERENCES ways(id))
72 | ''')
73 | 
74 | with open('ways_tags.csv','r') as ways_tags_table: # `with` statement available in 2.5+
75 |     # csv.DictReader uses first line in file for column headings by default
76 |     dr = csv.DictReader(ways_tags_table) # comma is default delimiter
77 |     to_db = [(i['id'], i['key'], i['value'], i['type']) for i in dr]
78 | 
79 | cur.executemany("INSERT INTO ways_tags VALUES (?, ?, ?, ?);", to_db)
80 | 
81 | cur.execute('''CREATE TABLE ways_nodes (
82 |     id INTEGER NOT NULL,
83 |     node_id INTEGER NOT NULL,
84 |     position INTEGER NOT NULL,
85 |     FOREIGN KEY (id) REFERENCES ways(id),
86 |     FOREIGN KEY (node_id) REFERENCES nodes(id))
87 | ''')
88 | 
89 | with open('ways_nodes.csv','r') as ways_nodes_table: # `with` statement available in 2.5+
90 |     # csv.DictReader uses first line in file for column headings by default
91 |     dr = csv.DictReader(ways_nodes_table) # comma is default delimiter
92 |     to_db = [(i['id'], i['node_id'], i['position']) for i in dr]
93 | 
94 | cur.executemany("INSERT INTO ways_nodes VALUES (?, ?, ?);", to_db)
95 | 
96 | #Save changes
97 | conn.commit()
98 | 


--------------------------------------------------------------------------------
/4-Data-Wrangling/Improving_Street_Names.py:
--------------------------------------------------------------------------------
 1 | import xml.etree.cElementTree as ET
 2 | import pprint
 3 | from collections import defaultdict
 4 | import re
 5 | 
 6 | '''
 7 | The code below updates the unexpected street types listed in the mapping list
 8 | while keeping others unchanged.
 9 | '''
10 | mapping = { "St": "Street",
11 |             "St.": "Street",
12 |             "Rd.": "Road",
13 |             "Ave": "Avenue",
14 |             "Blvd": "Boulevard",
15 |             "Dr": "Drive",
16 |             "Rd": "Road"
17 |             }
18 | 
19 | def update_name(name, mapping):
20 |     m = street_type_re.search(name)
21 |     if m.group() not in expected:
22 |         if m.group() in mapping.keys():
23 |             name = re.sub(m.group(), mapping[m.group()], name)
24 |     return name
25 | 
26 | def test():
27 |     for st_type, ways in st_types.iteritems():
28 |         for name in ways:
29 |             better_name = update_name(name, mapping)
30 |             print name, "=>", better_name
31 | 
32 | if __name__ == '__main__':
33 |     test()


--------------------------------------------------------------------------------
/4-Data-Wrangling/Number_of_Tags.py:
--------------------------------------------------------------------------------
 1 | import xml.etree.cElementTree as ET
 2 | import pprint
 3 | from collections import defaultdict
 4 | import re
 5 | 
 6 | '''
 7 | The code below is to find out how many types of tags are there and the number of each tag.
 8 | '''
 9 | 
10 | def count_tags(filename):
11 |     tags = {}
12 |     for event, element in ET.iterparse(filename):
13 |         if element.tag not in tags.keys():
14 |             tags[element.tag] = 1
15 |         else:
16 |             tags[element.tag] += 1
17 |     return tags
18 | 
19 | def test():
20 | 
21 |     tags = count_tags('san-jose_california.osm')
22 |     pprint.pprint(tags)
23 | 
24 | if __name__ == "__main__":
25 |     test()


--------------------------------------------------------------------------------
/4-Data-Wrangling/README.md:
--------------------------------------------------------------------------------
 1 | 1. [Data Wrangling.ipynb](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Data_Wrangling.ipynb) - Jupyter Notebook containing the projects
 2 | 
 3 | 2. [Number of Tags](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Number_of_Tags.py) - Identify how many types of tags and the occurance of each tag
 4 | 
 5 |    [Tags Type](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Tags_Types.py) - Categorize tags into lowercase, lowercase and colon, problematic characters and others
 6 |    
 7 |    [Auditing Street Names](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Auditing_Street_Names.py) - Audit the street types not in the list of expexted street types
 8 |    
 9 |    [Improving Street Names](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Improving_Street_Names.py) - Improve the dataset by replace overabbreaviated street types
10 |    
11 |    [SQL Schema](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Schema.py) - schema for the 5 cvs files which will be used to create SQL database
12 |    
13 |    [Convert to CSV files](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Convert_to_CSV_files.py) - Code to convert the database into 5 csv files
14 |    
15 |    [Audit Zipcode](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Audit_Zipcode.py) - Audit the zipcode to check whether it starts with '94' or '95' or something else
16 |    
17 |    [Update Zipcode](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Update_Zipcode.py) - Code to update zipcode (if it is 8/9 digits or has state in zipcode, only first 5 digits is kept, else nothing is changed.) 
18 |    
19 |    [Convert to SQL Database](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Convert_to_SQL_Database.py) - Code to convert 5 csv files into .db database
20 |       
21 | 3. [Mapzen](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Mapzen_SanJose.txt) - Mapzen contained the metro extract for the city of San Jose but the website has closed down since Feb 1st, 2018
22 | 
23 | 4. [Sample](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Sample.py) - Code to generate a sample of the dataset
24 | 
25 |    [sample.osm](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/sample.osm) - Contains the sample for osm file (~8MB)
26 | 
27 | 5. [Resources](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Resources.txt) - Contains a list of websites and Udacity lessons used to complete the project
28 | 


--------------------------------------------------------------------------------
/4-Data-Wrangling/Sample.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | import xml.etree.ElementTree as ET  # Use cElementTree or lxml if too slow
 5 | 
 6 | OSM_FILE = "san-jose_california.osm"  # Replace this with your osm file
 7 | SAMPLE_FILE = "sample.osm"
 8 | 
 9 | k = 50 # Parameter: take every k-th top level element
10 | 
11 | def get_element(osm_file, tags=('node', 'way', 'relation')):
12 |     """Yield element if it is the right type of tag
13 | 
14 |     Reference:
15 |     http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
16 |     """
17 |     context = iter(ET.iterparse(osm_file, events=('start', 'end')))
18 |     _, root = next(context)
19 |     for event, elem in context:
20 |         if event == 'end' and elem.tag in tags:
21 |             yield elem
22 |             root.clear()
23 | 
24 | 
25 | with open(SAMPLE_FILE, 'wb') as output:
26 |     output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
27 |     output.write('<osm>\n  ')
28 | 
29 |     # Write every kth top level element
30 |     for i, element in enumerate(get_element(OSM_FILE)):
31 |         if i % k == 0:
32 |             output.write(ET.tostring(element, encoding='utf-8'))
33 | 
34 |     output.write('</osm>')
35 | 


--------------------------------------------------------------------------------
/4-Data-Wrangling/Schema.py:
--------------------------------------------------------------------------------
 1 | import xml.etree.cElementTree as ET
 2 | import pprint
 3 | from collections import defaultdict
 4 | import re
 5 | 
 6 | import csv
 7 | import codecs
 8 | import cerberus
 9 | import schema
10 | 
11 | '''
12 | The schema below for the 5 csv files which will be used to construct a SQL databases
13 | '''
14 | 
15 | schema = {
16 |     'node': {
17 |         'type': 'dict',
18 |         'schema': {
19 |             'id': {'required': True, 'type': 'integer', 'coerce': int},
20 |             'lat': {'required': True, 'type': 'float', 'coerce': float},
21 |             'lon': {'required': True, 'type': 'float', 'coerce': float},
22 |             'user': {'required': True, 'type': 'string'},
23 |             'uid': {'required': True, 'type': 'integer', 'coerce': int},
24 |             'version': {'required': True, 'type': 'string'},
25 |             'changeset': {'required': True, 'type': 'integer', 'coerce': int},
26 |             'timestamp': {'required': True, 'type': 'string'}
27 |         }
28 |     },
29 |     'node_tags': {
30 |         'type': 'list',
31 |         'schema': {
32 |             'type': 'dict',
33 |             'schema': {
34 |                 'id': {'required': True, 'type': 'integer', 'coerce': int},
35 |                 'key': {'required': True, 'type': 'string'},
36 |                 'value': {'required': True, 'type': 'string'},
37 |                 'type': {'required': True, 'type': 'string'}
38 |             }
39 |         }
40 |     },
41 |     'way': {
42 |         'type': 'dict',
43 |         'schema': {
44 |             'id': {'required': True, 'type': 'integer', 'coerce': int},
45 |             'user': {'required': True, 'type': 'string'},
46 |             'uid': {'required': True, 'type': 'integer', 'coerce': int},
47 |             'version': {'required': True, 'type': 'string'},
48 |             'changeset': {'required': True, 'type': 'integer', 'coerce': int},
49 |             'timestamp': {'required': True, 'type': 'string'}
50 |         }
51 |     },
52 |     'way_nodes': {
53 |         'type': 'list',
54 |         'schema': {
55 |             'type': 'dict',
56 |             'schema': {
57 |                 'id': {'required': True, 'type': 'integer', 'coerce': int},
58 |                 'node_id': {'required': True, 'type': 'integer', 'coerce': int},
59 |                 'position': {'required': True, 'type': 'integer', 'coerce': int}
60 |             }
61 |         }
62 |     },
63 |     'way_tags': {
64 |         'type': 'list',
65 |         'schema': {
66 |             'type': 'dict',
67 |             'schema': {
68 |                 'id': {'required': True, 'type': 'integer', 'coerce': int},
69 |                 'key': {'required': True, 'type': 'string'},
70 |                 'value': {'required': True, 'type': 'string'},
71 |                 'type': {'required': True, 'type': 'string'}
72 |             }
73 |         }
74 |     }
75 | }
76 | 


--------------------------------------------------------------------------------
/4-Data-Wrangling/Tags_Types.py:
--------------------------------------------------------------------------------
 1 | import xml.etree.cElementTree as ET
 2 | import pprint
 3 | from collections import defaultdict
 4 | import re
 5 | 
 6 | '''
 7 | The code below allows you to check the k value for each tag.
 8 | By classifying the tagss into few categories:
 9 | 1. "lower": valid tags containing only lowercase letters
10 | 2. "lower_colon": valid tags with a colon in the names
11 | 3. "problemchars": tags with problematic characters
12 | 4. "other": other tags that don't fall into the 3 categories above
13 | '''
14 | lower = re.compile(r'^([a-z]|_)*$')
15 | lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
16 | problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
17 | 
18 | def key_type(element, keys):
19 |     if element.tag == "tag":
20 |         k = element.attrib['k']
21 |         if re.search(lower,k):
22 |             keys["lower"] += 1
23 |         elif re.search(lower_colon,k):
24 |             keys["lower_colon"] += 1
25 |         elif re.search(problemchars,k):
26 |             keys["problemchars"] += 1
27 |         else:
28 |             keys["other"] += 1
29 |     return keys
30 | 
31 | def process_map(filename):
32 |     keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
33 |     for _, element in ET.iterparse(filename):
34 |         keys = key_type(element, keys)
35 |     return keys
36 | 
37 | def test():
38 |     keys = process_map('san-jose_california.osm')
39 |     pprint.pprint(keys)
40 | 
41 | if __name__ == "__main__":
42 |     test()


--------------------------------------------------------------------------------
/4-Data-Wrangling/Update_Zipcode.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | This code will update non 5-digit zipcode.
 3 | If it is 8/9-digit, only the first 5 digits are kept.
 4 | If it has the state name in front, only the 5 digits are kept.
 5 | If it is something else, will not change anything as it might result in error when validating the csv file.
 6 | '''
 7 | def update_zipcode(zipcode):
 8 |     """Clean postcode to a uniform format of 5 digit; Return updated postcode"""
 9 |     if re.findall(r'^\d{5}$', zipcode): # 5 digits 02118
10 |         valid_zipcode = zipcode
11 |         return valid_zipcode
12 |     elif re.findall(r'(^\d{5})-\d{3}$', zipcode): # 8 digits 02118-029
13 |         valid_zipcode = re.findall(r'(^\d{5})-\d{3}$', zipcode)[0]
14 |         return valid_zipcode
15 |     elif re.findall(r'(^\d{5})-\d{4}$', zipcode): # 9 digits 02118-0239
16 |         valid_zipcode = re.findall(r'(^\d{5})-\d{4}$', zipcode)[0]
17 |         return valid_zipcode
18 |     elif re.findall(r'CA\s*\d{5}', zipcode): # with state code CA 02118
19 |         valid_zipcode =re.findall(r'\d{5}', zipcode)[0]  
20 |         return valid_zipcode  
21 |     else: #return default zipcode to avoid overwriting
22 |         return zipcode
23 |     
24 | def test_zip():
25 |     for zips, ways in zip_print.iteritems():
26 |         for name in ways:
27 |             better_name = update_zipcode(name)
28 |             print name, "=>", better_name
29 | 
30 | if __name__ == '__main__':
31 |     test_zip()
32 | 


--------------------------------------------------------------------------------
/5-Exploratory-Data-Analysis/Exploratory Data Analysis.rmd:
--------------------------------------------------------------------------------
  1 | Loan Data from Prosper by Kai Sheng TEH
  2 | ========================================================
  3 | 
  4 | ```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
  5 | # Packages that were used are loaded
  6 | library(ggplot2)
  7 | library(plyr)
  8 | library(dplyr)
  9 | library(reshape)
 10 | library(reshape2)
 11 | library(knitr)
 12 | library(ggthemes)
 13 | library(gridExtra)
 14 | library(tidyr)
 15 | library(GGally)
 16 | library(progress)
 17 | library(scales)
 18 | ```
 19 | 
 20 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Load_the_Data}
 21 | # Load the Data
 22 | prosper <- read.csv('prosperLoanData.csv', na.strings=c("", "NA"))
 23 | ```
 24 | 
 25 | Prosper is a San Francisco-based peer-to-peer lending company established in 2005. Prosper  generates revenue by charging borrowers an origination fees of 1% to 5% to verify identities and assess credibility of borrowers. It also charges investors a 1% annual servicing fee. The dataset provided to Udacity was last updated in 11th March, 2014.
 26 | 
 27 | # Univariate Plots Section
 28 | 
 29 | To kickstart the analysis, analyses below are carried out:
 30 | 
 31 | ```{r echo=FALSE, message=FALSE, warning=FALSE, List_of_Variables}
 32 | # List all the variable
 33 | names(prosper)
 34 | ```
 35 | 
 36 | There are a total of 81 columns in the dataset. Excluding listing-related identifiers, there should be around 70 variables.
 37 | 
 38 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Summaries}
 39 | # Summaries of each variable
 40 | summary(prosper)
 41 | ```
 42 | 
 43 | Analyses included in the summaries of variables above:
 44 | 
 45 | a) range for continuous variables
 46 | 
 47 | b) top 5 items in discrete variables
 48 | 
 49 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Internal_Structure}
 50 | # Internal Structure of each variable
 51 | str(prosper)
 52 | ```
 53 | 
 54 | Internal structures of the 81 variables are as above.
 55 | 
 56 | ```{r echo=FALSE, message=FALSE, warning=FALSE, LoanbyYearQuarter}
 57 | # New factors values created for Loan Origination Quarter
 58 | LoanOriginationQuarter2 = unique(prosper$LoanOriginationQuarter) 
 59 | LoanOriginationQuarter2 = 
 60 |   LoanOriginationQuarter2[order(substring(LoanOriginationQuarter2,4,7), 
 61 |                                 substring(LoanOriginationQuarter2,1,2))]
 62 | 
 63 | # Histogram and summary of Loan Origination Date
 64 | ggplot(data = prosper, aes(x = LoanOriginationQuarter))+
 65 |   geom_histogram(stat ='count')+
 66 |   scale_x_discrete(limits=LoanOriginationQuarter2)+
 67 |   theme(axis.text.x = element_text(angle = 60, vjust = 0.6))
 68 | summary(prosper$LoanOriginationQuarter)
 69 | ```
 70 | 
 71 | There is an increasing trend from end 2005 till 2014 except for the period of end 2008 till early 2009. It drop in loan being approved could be due to the Global Financial Crisis. There is also a dip at the end of 2012 which could be caused by the European sovereign debt crisis.
 72 | 
 73 | ```{r echo=FALSE, message=FALSE, warning=FALSE, ProsperRating(Alpha)}
 74 | # Prosper Rating (Alpha) are rearranged
 75 | prosper$ProsperRating..Alpha. <- factor(prosper$ProsperRating..Alpha,
 76 |                                         levels = c('AA','A','B','C',
 77 |                                                    'D','E','HR','NA'))
 78 | 
 79 | # Histogram and summary of Prosper Rating (Alpha)
 80 | ggplot(data = prosper, aes(x = ProsperRating..Alpha.))+
 81 |   geom_histogram(stat = 'count')
 82 | summary(prosper$ProsperRating..Alpha.)
 83 | ```
 84 | 
 85 | Majority of borrowers are not classified. Among those being rated, 'C' is the most common rating. 'AA' is the highest rating and relatively less borrowers qualified for the rating. Excluding those non-classified, the plot shows a normal distribution.
 86 | 
 87 | ```{r echo=FALSE, message=FALSE, warning=FALSE, ProsperScore}
 88 | # Prosper Score are rearranged
 89 | ProsperScore2 <- factor(prosper$ProsperScore,
 90 |                         levels = c('1','2','3','4','5','6','7',
 91 |                                    '8','9','10','11','NA'))
 92 | 
 93 | # Histogram and summary of Prosper Score
 94 | ggplot(data = prosper, aes(x = ProsperScore2))+
 95 |   geom_histogram(stat = 'count')
 96 | summary(prosper$ProsperScore)
 97 | ```
 98 | 
 99 | Majority of the loan applicants are not rated. Among those rated, most have a score between 4 to 8.
100 | 
101 | ```{r echo=FALSE, message=FALSE, warning=FALSE, IncomeRange}
102 | # Income Range are rearranged
103 | prosper$IncomeRange = factor(prosper$IncomeRange, 
104 |                              levels = c("Not employed", "$0", "$1-24,999", 
105 |                                         "$25,000-49,999", "$50,000-74,999", 
106 |                                         "$75,000-99,999", "$100,000+", 
107 |                                         "Not displayed"))
108 | 
109 | # Histogram and summary of Income Range
110 | ggplot(data = prosper, aes(x = IncomeRange))+
111 |   geom_histogram(stat = 'count')+
112 |   theme(axis.text.x = element_text(angle = 30, vjust = 0.6))
113 | summary(prosper$IncomeRange)
114 | ```
115 | 
116 | The median household income in the USA was $53,657 in 2014 (U.S. Census Bureau) and most of the borrowers are from the middle or lower-middle class. 
117 | 
118 | There are less number of borrowers for those earning more than $75,000, as them usually have savings to cover their needs. It is worthwhile to note that comparatively there are way less number of loans approved for those earning less than $25,000 as they are deemed to risky to lend money to.
119 | 
120 | ```{r echo=FALSE, message=FALSE, warning=FALSE, DebtToIncomeRatio}
121 | # Histogram of Debt to Income Ratio
122 | p1 <- ggplot(data = prosper, aes(x = DebtToIncomeRatio))+
123 |   geom_histogram(binwidth = 0.02)
124 | 
125 | # Histogram of Debt to Income Ratio (top 1% data removed)
126 | p2 <- ggplot(data = prosper, aes(x = DebtToIncomeRatio))+
127 |   geom_histogram(binwidth = 0.02)+
128 |   xlim(0, quantile(prosper$DebtToIncomeRatio, prob = 0.99, na.rm = TRUE))
129 | 
130 | # 2 histogram charts arranged side by side and summary
131 | grid.arrange(p1, p2, ncol = 2)
132 | summary(prosper$DebtToIncomeRatio)
133 | ```
134 | 
135 | The debt-to-income ratio histogram on the left has a long tail where there are few people with a ratio of 10, which indicates them as risky borrowers as their income is too low to service their debt. By removing the top 1% outliers, we can see that most borrowers have a ratio of around 0.2.
136 | 
137 | ```{r echo=FALSE, message=FALSE, warning=FALSE, LoanOriginalAmount}
138 | # Histogram of Loan Original Amount
139 | p3 <- ggplot(data = prosper, aes(x = LoanOriginalAmount))+
140 |   geom_histogram(binwidth = 1000)+
141 |   scale_x_continuous(breaks = seq(0, 40000, 5000))+
142 |   theme(axis.text.x = element_text(angle = 30))
143 | 
144 | # Histogram of Loan Original Amount (Top 1% outliers removed)
145 | p4 <- ggplot(data = prosper, aes(x = LoanOriginalAmount))+
146 |   geom_histogram(binwidth = 1000)+
147 |   scale_x_continuous(breaks = seq(0, 25000, 5000))+
148 |   xlim(0, quantile(prosper$LoanOriginalAmount, prob = 0.99, na.rm = TRUE))+
149 |   theme(axis.text.x = element_text(angle = 30))
150 | 
151 | # 2 histogram charts arranged side by side and summary
152 | grid.arrange(p3, p4, ncol = 2)
153 | summary(prosper$LoanOriginalAmount)
154 | ```
155 | 
156 | We can see that most of the loan amount are around $5,000.
157 | 
158 | There are occasional spikes in $5k, $10k, $15k, $20k and even up $35k which are explainable by the fact that they are multiples of 5,000 where most people tend to use when deciding the amount to borrow.
159 | 
160 | ```{r echo=FALSE, message=FALSE, warning=FALSE, EmploymentStatus}
161 | # Employment Status are rearranged
162 | prosper$EmploymentStatus = factor(prosper$EmploymentStatus, 
163 |                                   levels = c("Employed","Full-time",
164 |                                              "Part-time","Self-employed",
165 |                                              "Retied","Not employed", 
166 |                                              "Other","Not available","NA"))
167 | 
168 | # Histogram and summary of Employment Status
169 | ggplot(data = prosper, aes(x = EmploymentStatus))+
170 |   geom_histogram(stat = 'count')+
171 |   theme(axis.text.x = element_text(angle = 30, vjust = 0.6))
172 | summary(prosper$EmploymentStatus)
173 | ```
174 | 
175 | Most of the borrowers are employed, be it full-time, part-time, self-employed or non-specified. This makes sense as loan applicants need to demonstrate that they have stable income to pay back the loan. 
176 | 
177 | ```{r echo=FALSE, message=FALSE, warning=FALSE, LoanTerms(Month)}
178 | # Histogram and summary of Loan Terms in months
179 | Term2 = factor(prosper$Term, levels = c("12", "36", "60"))
180 | ggplot(data = prosper, aes(x = Term2))+
181 |   geom_histogram(stat = 'count')
182 | summary(prosper$Term)
183 | ```
184 | 
185 | Majority of the borrowers have a loan period of 36 months or 3 years.
186 | 
187 | ```{r echo=FALSE, message=FALSE, warning=FALSE, MonthlyLoanPayment}
188 | # Histogram of Monthly Loan Payment
189 | p5 <- ggplot(data = prosper, aes(x = MonthlyLoanPayment))+
190 |   geom_histogram(binwidth = 50)
191 | 
192 | # Histogram of Monthly Loan Payment (top 1% data removed)
193 | p6 <- ggplot(data = prosper, aes(x = MonthlyLoanPayment))+
194 |   geom_histogram(binwidth = 50)+
195 |   xlim(0, quantile(prosper$MonthlyLoanPayment, prob = 0.99, na.rm = TRUE))+
196 |   theme(axis.text.x = element_text(angle = 30))
197 | 
198 | # 2 histogram charts arranged side by side and summary
199 | grid.arrange(p5, p6, ncol = 2)
200 | summary(prosper$MonthlyLoanPayment)
201 | 
202 | # Mode of Monthly Loan Payment
203 | names(sort(-table(prosper$MonthlyLoanPayment)))[1]
204 | ```
205 | 
206 | Majority of the monthly loan payment are less than $250.
207 | 
208 | $174 is the most common amount of monthly installment and only few borrowers have an installment of exceeding $1,000.
209 | 
210 | ```{r echo=FALSE, message=FALSE, warning=FALSE, CreditScoreRange}
211 | # Trend charts of lower (orange) and upper (blue) range of credit score
212 | ggplot(data = prosper)+
213 |   geom_line(stat = 'count', 
214 |             aes(x = CreditScoreRangeLower, color = "CreditScoreRangeLower"))+
215 |   geom_line(stat = 'count', 
216 |             aes(x = CreditScoreRangeUpper, color = "CreditScoreRangeUpper"))+
217 |   scale_color_manual("Credit Score Range", 
218 |                      values = c('CreditScoreRangeLower' = 'Orange', 
219 |                                 'CreditScoreRangeUpper' = 'Blue'))+
220 |   scale_x_continuous(breaks = seq(0, 900, 100))
221 | 
222 | # Summary of Credit Score Range (lower) and (upper)
223 | summary(prosper$CreditScoreRangeLower)
224 | summary(prosper$CreditScoreRangeUpper)
225 | ```
226 | 
227 | Line charts instead of bar charts are chosen to better reflect the range of score overlaid on top of each other.
228 | 
229 | The credit score range for most borrowers are between 650 to 750 and the gap between upper and lower range is around 20 points for most borrowers.
230 | 
231 | ```{r echo=FALSE, message=FALSE, warning=FALSE, IsBorrowerHomeowner}
232 | # Histogram and summary on whether borrower is a homeowner
233 | ggplot(data = prosper, aes(x = IsBorrowerHomeowner))+
234 |   geom_histogram(stat = 'count')
235 | summary(prosper$IsBorrowerHomeowner)
236 | ```
237 | 
238 | Homeownership is roughly equally split between True and False for borrowers.
239 | 
240 | From this, it can be deduced that homeownership might not be the top factors in deciding whether to extend the loans to borrowers.
241 | 
242 | ```{r echo=FALSE, message=FALSE, warning=FALSE, BorrowerRate_LenderYield}
243 | # Histogram of Borrower Rate
244 | p7 <- ggplot(data = prosper, aes(x = BorrowerRate))+
245 |   geom_histogram(binwidth = 0.01)+
246 |   scale_x_continuous(breaks = seq(-0.1, 0.5, 0.05))+
247 |   theme(axis.text.x = element_text(angle = 90))
248 | 
249 | # Histogram of Lender Yield
250 | p8 <- ggplot(data = prosper, aes(x = LenderYield))+
251 |   geom_histogram(binwidth = 0.01)+
252 |   scale_x_continuous(breaks = seq(-0.05, 0.5, 0.05))+
253 |   theme(axis.text.x = element_text(angle = 90))
254 | 
255 | # 2 histogram charts arranged side by side and summaries
256 | grid.arrange(p7, p8, ncol = 2)
257 | summary(prosper$BorrowerRate)
258 | summary(prosper$LenderYield)
259 | ```
260 | 
261 | The histograms show a bimodal distribution. Majority of the borrower rates and lender yield are between 0.1 and 0.2. The peak at above 0.3 could be possibly explained by the more common rate given to borrowers with less stellar creditworthiness.
262 | 
263 | When compared to the borrower rate, lender yield shows a similar trend with the x-axis shifted slightly to the left by 0.01. This could be explained by the fact that Prosper probably charges a 1% fees as its revenue.
264 | 
265 | ```{r echo=FALSE, message=FALSE, warning=FALSE, BorrowerState}
266 | # Histogram and summary of State origin of Borrower
267 | ggplot(data = prosper, aes(x = BorrowerState))+
268 |   geom_histogram(stat = 'count')+
269 |   theme(axis.text.y = element_text(size = 5))+
270 |   scale_y_continuous(breaks = seq(0, 15000, 2500))+
271 |   coord_flip()
272 | summary(prosper$BorrowerState)
273 | ```
274 | 
275 | California by far has the most borrowers at slightly less than 15,000, followed by Georgia, Florida, Illinois, New York and Texas which has between 5,000 and 7,000 borrowers each. 
276 | 
277 | The high number of borrowers from these states doesn't come as surprise as they are among the states with the most population. However, the much higher number of borrowers from California is not proportional to its population when compared to Texas. One hypothesis is that it enjoys higher awareness among Californians as an alternative to bank loans could be the reasons due to its location in California.
278 | 
279 | # Univariate Analysis
280 | 
281 | ### What is the structure of your dataset?
282 | The dataset contains 81 variables with 113937 observations from year 2005 to 2014.
283 | 
284 | ### What is/are the main feature(s) of interest in your dataset?
285 | The typical characteristics of the borrowers are of interest for this dataset. Various plots are created to observe and identify the trend of each variable.
286 | 
287 | ### What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
288 | Income range, Debt-to-income ratio are few of the variables that will help to explain why the loans were approved and what are the yield/rate for the loans.
289 | 
290 | ### Did you create any new variables from existing variables in the dataset?
291 | No, but I rearranged the factors such Prosper's rating (Alpha), Prosper's score, income range, employmenet status and loan term (months) so that the charts can be understood more easily. I also created new factors value for loan origination quarter to facilitate the ordering by year and quarter later on.
292 | 
293 | ### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
294 | Most features do not have any unusual distributions and if they do, they are explainable by some other factors. The only one that I am interested in is spike at 0.3 in the borrower rate and lender yield. My expectation was that the graph is skewed towards lower rate to favor borrower with better credit history for risk management.
295 | 
296 | 
297 | # Bivariate Plots Section
298 | 
299 | ```{r echo=FALSE, message=FALSE, warning=FALSE, LoanQuarter_Term_Count}
300 | # Relationship between Loan Origination Quarter and Loan Term (month)
301 | ggplot(data = prosper, aes(x = LoanOriginationQuarter, fill = Term2))+
302 |   geom_histogram(stat ='count')+
303 |   scale_x_discrete(limits=LoanOriginationQuarter2)+
304 |   theme(axis.text.x = element_text(angle = 60, vjust = 0.6))
305 | ```
306 | 
307 | Initially, only 36-month term loan were given. 12-month and 60-month term loan were introduced in Q4 2010 but only 60-month term loan took off. 12-month term loan is believed to be discontinued in end 2012.
308 | 
309 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Score_RateYield}
310 | # Relationship between Prosper Rating and Borrower Rate in boxplot
311 | p9 <- ggplot(data = prosper, aes(x = ProsperScore2, y = BorrowerRate))+
312 |   geom_boxplot()
313 | 
314 | # Relationship between Prosper Rating and Lender Yield in boxplot
315 | p10 <- ggplot(data = prosper, aes(x = ProsperScore2, y = LenderYield))+
316 |   geom_boxplot()
317 | 
318 | # Arrange both charts side by side
319 | grid.arrange(p9, p10, ncol = 2)
320 | 
321 | # Correlation of Prosper Score with Borrower Rate and Lender Yield
322 | with(prosper, cor.test(ProsperScore, BorrowerRate, method = 'pearson'))
323 | with(prosper, cor.test(ProsperScore, LenderYield, method = 'pearson'))
324 | ```
325 | 
326 | The boxplots above shows that Borrower Rate and Lender Yield decrease with improved Prosper's score. Applicants with better rating pose less risk and thus have lower chance of defaulting. Therefore, lenders are willing to charge less interest rate.
327 | 
328 | ```{r echo=FALSE, message=FALSE, warning=FALSE, ProsperScore_IncomeLoanAmount}
329 | # Relationship between Prosper's Score and Stated Monthly Income in boxplot
330 | p11 <- ggplot(data = prosper, aes(x = ProsperScore2, y = StatedMonthlyIncome))+
331 |   geom_boxplot()+
332 |   ylim(0, quantile(prosper$StatedMonthlyIncome, prob=0.99, na.rm=TRUE))
333 | 
334 | # Relationship between Prosper's Score and Borrower Rate in boxplot
335 | p12 <- ggplot(data = prosper, aes(x = ProsperScore2, y = LoanOriginalAmount))+
336 |   geom_boxplot()
337 | 
338 | # Arrange both charts side by side
339 | grid.arrange(p11, p12, ncol = 2)
340 | 
341 | # Correlation of Prosper's Score with Stated Monthly Income and 
342 | # Loan Original Amount
343 | with(prosper, cor.test(ProsperScore, StatedMonthlyIncome, method = 'pearson'))
344 | with(prosper, cor.test(ProsperScore, LoanOriginalAmount, method = 'pearson'))
345 | ```
346 | 
347 | Delving into the monthly income and loan amount, both boxplot charts didn't present any surprises. Applicants with higher rating tend to have higher monthly income and larger loan amount.
348 | 
349 | ```{r echo=FALSE, message=FALSE, warning=FALSE, EmploymentStatus_LoanAmount}
350 | # Relationship between Employment Status and Loan Amount in boxplot
351 | ggplot(data = prosper, aes(x= EmploymentStatus, y = LoanOriginalAmount))+
352 |   geom_boxplot()+
353 |   theme(axis.text.x = element_text(angle = 30, vjust = 0.6))+
354 |   ylim(0, quantile(prosper$StatedMonthlyIncome, prob = 0.99, na.rm = TRUE))
355 | ```
356 | 
357 | Looking at the relationship between employment status and loan amount, employed, self-employed and full-time borrowers are usually afforded higher loan amount as opposed to part-timers, not employed or not available.
358 | 
359 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Term_LoanAmountRateYield}
360 | # Relationship between Loan Term (month) and Loan Amount in boxplot
361 | p13 <- ggplot(data = prosper, aes(x = Term2, y = LoanOriginalAmount))+
362 |   geom_boxplot()
363 | 
364 | # Relationship between Loan Term (month) and Borrower Rate in boxplot
365 | p14 <- ggplot(data = prosper, aes(x = Term2, y = BorrowerRate))+
366 |   geom_boxplot()
367 | 
368 | # Relationship between Loan Term (month) and Lender Yield in boxplot
369 | p15 <- ggplot(data = prosper, aes(x = Term2, y = LenderYield))+
370 |   geom_boxplot()
371 | 
372 | # Arrange both charts side by side
373 | grid.arrange(p13, p14, p15, ncol = 3)
374 | 
375 | # Correlation of Loan Term (months) with Loan Original Amount, 
376 | # Borrower Rate and Lender Yield
377 | with(prosper, cor.test(Term, LoanOriginalAmount, method = 'pearson'))
378 | with(prosper, cor.test(Term, BorrowerRate, method = 'pearson'))
379 | with(prosper, cor.test(Term, LenderYield, method = 'pearson'))
380 | ```
381 | 
382 | When investigating the effect of loan term (months), it can be said that loans with longer terms usually come with larger amount. As such, a higher interest rate is levied due to higher risk exposure. This is the same for lender yield as higher interest rate is needed to attract investor to lend to riskier borrowers.
383 | 
384 | ```{r echo=FALSE, message=FALSE, warning=FALSE, IncomeRange_LoanAmountRatio}
385 | # Relationship between Income Range and Loan Amount in boxplot
386 | p16 <- ggplot(data = prosper, aes(x = IncomeRange, y = LoanOriginalAmount))+
387 |   geom_boxplot()+
388 |   theme(axis.text.x = element_text(angle = 30, vjust = 0.6))+
389 |   ylim(0, quantile(prosper$LoanOriginalAmount, prob=0.99, na.rm=TRUE))
390 | 
391 | # Relationship between Income Range and Debt to Income Ratio in boxplot
392 | p17 <- ggplot(data = prosper, aes(x = IncomeRange, y = DebtToIncomeRatio))+
393 |   geom_boxplot()+
394 |   theme(axis.text.x = element_text(angle = 30, vjust = 0.6))+
395 |   ylim(0, quantile(prosper$DebtToIncomeRatio, prob=0.99, na.rm=TRUE))
396 | 
397 | # Arrange both charts side by side
398 | grid.arrange(p16, p17, ncol=2)
399 | 
400 | # Correlation of Stated Monthly Income with Loan Original Amount, 
401 | # and Debt to Income Ratio
402 | with(prosper, cor.test(StatedMonthlyIncome, LoanOriginalAmount, 
403 |                        method = 'pearson'))
404 | with(prosper, cor.test(StatedMonthlyIncome, DebtToIncomeRatio, 
405 |                        method = 'pearson'))
406 | ```
407 | 
408 | When comparing the Income Range, those with higher income are able to borrow more as they also tend to have a lower debt-to-income ratio which indicates lower risk.
409 | 
410 | ```{r echo=FALSE, message=FALSE, warning=FALSE, DebttoIncomeRatio_RateYield}
411 | # Debt to Income Ratio are cut into intervals
412 | DebtToIncomeRatio2 <- cut(prosper$DebtToIncomeRatio,
413 |                                  breaks = c(0, 0.2, 0.4, 0.6, 
414 |                                             0.8, 1, 1.5, 10.1))
415 | 
416 | # Relationship between Debt to Income Ratio and Borrower Rate in boxplot
417 | p18 <- ggplot(data = prosper, aes(x = DebtToIncomeRatio2, y = BorrowerRate))+
418 |   geom_boxplot()+
419 |   theme(axis.text.x = element_text(angle = 90, vjust = 0.6))
420 | 
421 | # Relationship between Debt to Income Ratio and Lender Yield in boxplot
422 | p19 <- ggplot(data = prosper, aes(x = DebtToIncomeRatio2, y = LenderYield))+
423 |   geom_boxplot()+
424 |   theme(axis.text.x = element_text(angle = 90, vjust = 0.6))
425 | 
426 | # Arrange both charts side by side
427 | grid.arrange(p18, p19, ncol = 2)
428 | 
429 | # Correlation of Debt to Income Ratio and Borrower Rate / Lender Yield
430 | with(prosper, cor.test(DebtToIncomeRatio, BorrowerRate, method = 'pearson'))
431 | with(prosper, cor.test(DebtToIncomeRatio, LenderYield, method = 'pearson'))
432 | ```
433 | 
434 | Lower debt-to-income ratio does lead to lower borrower rate or lender yield. That is because those with lower debt to income ratio indicates that they have better ability to service their loan installment and therefore have lower probability of defaulting on their loan.
435 | 
436 | # Bivariate Analysis
437 | 
438 | ### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
439 | In general, most of the relationship observed in the charts are aligned with my expectation. Applicants with higher Prosper's score and lower debt-to-income ratio are able to enjoy lower borrower rate.
440 | 
441 | ### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
442 | Even though borrower rate tends to increase as debt-to-income ratio increases, that seems to not be the case for those with a debt-to-income ratio of more than 1.5. There are probably other factors that lead to lower borrower rate. Further investigation is needed to understand this anomaly.
443 | 
444 | ### What was the strongest relationship you found?
445 | The strongest relatioship found is between Prosper's score and borrower rate where higher Prosper's score leads to lower borrower rate. Its correlation coefficient is -0.66.
446 | 
447 | # Multivariate Plots Section
448 | 
449 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Ratio_Rate_Income_LoanAmt}
450 | # Relationship between Debt to Income Ratio, Borrower Rate, Income Range
451 | # and Loan Original Amount
452 | ggplot(data = prosper, aes(x = DebtToIncomeRatio, y = BorrowerRate,
453 |                            color = IncomeRange, size = LoanOriginalAmount))+
454 |   geom_point()+
455 |   scale_x_continuous(breaks = seq(0, 10, 1))+
456 |   xlim(1.5, 10)
457 | ```  
458 | 
459 | Continuing my investigtion of the relationship between debt-to-income ratio and borrower rate, I removed all debt-to-income ratios of less than 1.5. From the scatterplotplot, it is rather a surprise that lots of borrowers with low or unidentified income are able to borrow a large sum of more than $10,000 with low borrower rate (less than 0.25). I suspect that there are other variables behind it.
460 | 
461 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Ratio_Rate_Score}
462 | # Relationship between Debt to Income Ratio, Borrower Rate, Prosper's Score
463 | ggplot(data = prosper, aes(x = DebtToIncomeRatio, y = BorrowerRate, 
464 |                            color = ProsperScore2))+
465 |   geom_point()+
466 |   scale_x_continuous(breaks = seq(0, 10, 1))+
467 |   xlim(1.5, 10)
468 | ```  
469 | 
470 | By further extending my investigation, it shows that these borrowers with low income but yet able to borrow with low rate have rather good Prosper's score (exclude those with 'NA' score). That explains the anomalies in the debt-to-income ratio vs borrower rate boxplot chart.
471 | 
472 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Rate_Ratio_RatingAlpha}
473 | # Relationship between Debt to Debt to Income Ratio, Borrower Rate
474 | # and Prosper's Rating (Alpha) 
475 | ggplot(data = subset(prosper, !is.na(ProsperRating..Alpha.)), 
476 |        aes(x = DebtToIncomeRatio, y = BorrowerRate, 
477 |            color = ProsperRating..Alpha.))+
478 |   geom_point()+
479 |   xlim(0, quantile(prosper$DebtToIncomeRatio, prob = 0.9965, na.rm = TRUE))
480 | ```
481 | 
482 | One interesting finding is borrower rate of respective Prosper's rating tends to have a narrow range of borrower rate irregardless with the debt-to-income ratio. Debt-to-income ratio seems not to matter much as long as borrowers establish a good credit rating score.
483 | 
484 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Yield_Ratio_RatingAlpha}
485 | # Relationship between Debt to Debt to Income Ratio, Lender Yield
486 | # and Prosper's Rating (Alpha)
487 | ggplot(data = subset(prosper, !is.na(ProsperRating..Alpha.)), 
488 |        aes(x = DebtToIncomeRatio, y = LenderYield, 
489 |            color = ProsperRating..Alpha.))+
490 |   geom_point()+
491 |   xlim(0, quantile(prosper$DebtToIncomeRatio, prob = 0.9965, na.rm = TRUE))
492 | ```
493 | 
494 | Similarly, lender yield is more likely to be determined by the Prosper's rating of the borrowers than the debt-to-income ratio.
495 | 
496 | ```{r echo=FALSE, message=FALSE, warning=FALSE, MonthlyIncome_LoanAmt_Rating}
497 | # Relationship between Stated Monthly Income, Loan Amount,
498 | # and  Prosper's Rating (Alpha)
499 | ggplot(data = subset(prosper, !is.na(ProsperRating..Alpha.)), 
500 |        aes(x = StatedMonthlyIncome, y = LoanOriginalAmount, 
501 |            color = ProsperRating..Alpha.))+
502 |   geom_point(alpha = 0.2)+
503 |   xlim(0, quantile(prosper$StatedMonthlyIncome, prob = 0.99, na.rm = TRUE))
504 | ```
505 | 
506 | Borrowers with no credit history are allowed to borrow less than $5,000 in general while borrowers with good ratings are allowed to borrow more up to $35,000. 
507 | 
508 | ```{r echo=FALSE, message=FALSE, warning=FALSE, MonthlyIncome_Ratio_Rating}
509 | # Relationship between Stated Monthly Income, Debt-to-Income Ratio
510 | # and Prosper's Rating (Alpha)
511 | ggplot(data = subset(prosper, !is.na(ProsperRating..Alpha.)), 
512 |        aes(x = StatedMonthlyIncome, y = DebtToIncomeRatio, 
513 |            color = ProsperRating..Alpha.))+
514 |   geom_point()+
515 |   xlim(0, quantile(prosper$StatedMonthlyIncome, prob = 0.99, na.rm = TRUE))
516 | ```
517 | 
518 | It seems that most of the applicants that fulfil undesirable features of 'bad borrowers' are those with no prior Prosper's rating, low monthly income and high debt-to-income ratio. However, this is normal especially for young graduates who just started out.
519 | 
520 | ```{r echo=FALSE, message=FALSE, warning=FALSE, MonthlyIncome_LoanAmt_Term}
521 | # Relationship between Stated Monthly Income, Loan Original Amount
522 | # and Loan Term (month)
523 | ggplot(data = prosper, 
524 |        aes(x = StatedMonthlyIncome, y = LoanOriginalAmount, color = Term2))+
525 |   geom_point(alpha = 0.8)+
526 |   xlim(0, quantile(prosper$StatedMonthlyIncome, prob = 0.99, na.rm = TRUE))+
527 |   scale_color_manual(values=c("#150303", "#E69F00", "#56B4E9"))
528 | ```
529 | 
530 | It can be seen that the length of the loan term somewhat corresponds to the amount of the loan. Larger loan typically requires longer monthly installment period irregardless of the monthly income.
531 | 
532 | It is also necessary to take note that 12-month term loan were discontinued by 2012 possibly due to lack of interest as there might be an upper limit placed on the amount of loan figure.
533 | 
534 | ```{r echo=FALSE, message=FALSE, warning=FALSE, LoanOQtr_LoanAmt_Score}
535 | # Relationship between Loan Origination Quarter, Loan Original Amount,
536 | # Prosper's Score and Prosper's Rating (Alpha)
537 | ggplot(data = subset(prosper, !is.na(ProsperRating..Alpha.)), 
538 |        aes(x = LoanOriginationQuarter, y = LoanOriginalAmount, 
539 |            color = ProsperScore))+
540 |   geom_point(alpha = 0.8)+
541 |   scale_x_discrete(limit = LoanOriginationQuarter2)+
542 |   facet_wrap( ~ ProsperRating..Alpha., ncol = 2)+
543 |   theme(axis.text.x = element_text(angle = 90, vjust = 0.6, size = 7))
544 | ```
545 | 
546 | Using facet wrap, we can observe clearly that borrowers with lower Prosper's score are allowed to borrow smaller amount of loan due to higher perceived risk of defaulting. 
547 | 
548 | As Prosper expands over the years, the expansion mostly comes from borrowers with good rating while those with no credit history have decreased. It shows that Propser is pursuing more sustainable business model. Another probably explanation is that potential new customer have acquired some credit history over the years.
549 | 
550 | ```{r echo=FALSE, message=FALSE, warning=FALSE, LoanQuarter_Count_LoanStatus}
551 | # Mutate the Loan Status to create a temporary variable with only
552 | # 3 categories, i.e. 'Performing Loan', 'Past Due' and 'Defaulted'
553 | prosper <- prosper %>% mutate(LoanStatus2 = ifelse(LoanStatus %in% 
554 |                      c("Cancelled", "Chargedoff", "Defaulted"), 0,
555 |                      ifelse(LoanStatus %in% 
556 |                      c("Completed", "Current", "FinalPaymentInProgress"), 2, 
557 |                      1)))
558 | 
559 | # Rearrange the factors of Loan Status
560 | prosper$LoanStatus2 <- factor(prosper$LoanStatus2, levels = 2:0, 
561 |                          labels = c("Performing Loan","Past Due","Defaulted"))
562 | 
563 | # Relationship between Loan Origination Quarter, Loan Count and Loan Status
564 | ggplot(prosper, aes(x = LoanOriginationQuarter, fill = LoanStatus2))+
565 |   geom_bar(stat = "count")+
566 |   scale_x_discrete(limits=LoanOriginationQuarter2)+
567 |   theme(axis.text.x = element_text(angle = 90, vjust = 0.6))+
568 |   scale_fill_manual(values = c("Defaulted" = "red","Past Due" = "yellow",
569 |                                "Performing Loan" = "green"))
570 | ```
571 | 
572 | The barchart above confirms my hypothesis that Prosper expansion comes mostly from performing loans. Over the years, the number of defaulted loans has dropped.
573 | 
574 | # Multivariate Analysis
575 | 
576 | ### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
577 | Monthly income is a strong factor in determining the borrower rate. At the same time, borrowers with no credit history tend to be those with lower salary which suggests that they might be young people who just graduated or still in school.
578 | 
579 | ### Were there any interesting or surprising interactions between features?
580 | It is interesting to note that debt-to-income ratio has minimal effect on the borrower rate, holding the Prosper's rating variable constant. Previous chart that shows debt-to-income ratio drops with increasing salary suggests that the ratio is rather a dependent variable of monthly income.
581 | 
582 | ------
583 | 
584 | # Final Plots and Summary
585 | 
586 | ### Plot One
587 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Plot_One}
588 | # Relationship between Prosper Rating and Borrower Rate in boxplot
589 | ggplot(data = prosper, aes(x = ProsperScore2, y = BorrowerRate, 
590 |                                   fill = ProsperScore2))+
591 |   geom_boxplot()+
592 |   ggtitle("Borrower Rate vs Prosper's Score")+
593 |   xlab("Prosper's Score")+
594 |   ylab("Borrower Rate")+
595 |   theme(plot.title = element_text(hjust = 0.5, size = 20))+
596 |   scale_fill_discrete(name = "Prosper's Score")
597 | ```
598 | 
599 | ### Description One
600 | The boxplots above shows that borrowers with better Prosper's Score tend to enjoy lower borrower rate. The range of the rate doesn't fluctuate much for most Prosper's score with the exception of those with moderate score between 4 to 7.
601 | 
602 | However, there are outliers in the opposite trend for those with the best and worst score. That could be due to other factors such as amount of loan taken, new monthly income or change in employment status.
603 | 
604 | ### Plot Two
605 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Plot_Two}
606 | # Relationship between Debt to Debt to Income Ratio, Borrower Rate
607 | # and Prosper's Rating (Alpha) 
608 | ggplot(data = subset(prosper, !is.na(ProsperRating..Alpha.)), 
609 |        aes(x = DebtToIncomeRatio, y = BorrowerRate, 
610 |            color = ProsperRating..Alpha.))+
611 |   geom_point(alpha = 0.5)+
612 |   xlim(0, quantile(prosper$DebtToIncomeRatio, prob = 0.9965, na.rm = TRUE))+
613 |   xlab("Debt-to-Income Ratio")+
614 |   ylab("Borrower Rate")+
615 |   ggtitle("Effect of Prosper's Rating (Alpha) on 
616 |           Borrower Rate vs Debt-to-Income Ratio")+
617 |   theme(plot.title = element_text(hjust = 0.5, size = 15))+
618 |   scale_color_discrete(name = "Prosper's Rating (Alpha)")
619 | ```
620 | 
621 | ### Description Two
622 | This graph is chosen as it shows that the Prosper's rating (alpha) is curcial in determining the borrower rate. Holding the Prosper's rating (alpha) constant, an increase in debt-to-income ratio has insignificant impact on the borrower rate.
623 | 
624 | Those with 'HR' rating are more likely to have debt-to-income ratio larger than 1.0. On the other hand, borrowers rated 'AA' tend to have ratio of less than 0.5 and thus have a lower borrower rate that is usually below 0.1.
625 | 
626 | ### Plot Three
627 | ```{r echo=FALSE, message=FALSE, warning=FALSE, Plot_Three}
628 | # Relationship between Loan Origination Quarter, Loan Count and Loan Status
629 | ggplot(prosper, aes(x = LoanOriginationQuarter, fill = LoanStatus2))+
630 |   geom_bar(stat = "count")+
631 |   scale_x_discrete(limits=LoanOriginationQuarter2)+
632 |   theme(axis.text.x = element_text(angle = 90, vjust = 0.6))+
633 |   scale_fill_manual("Loan Status", values = c("Defaulted" = "#DA0404",
634 |                                               "Past Due" = "#E8C203",
635 |                                               "Performing Loan" = "#05B94C"))+
636 |   xlab("Loan Quarter")+
637 |   ylab("Loan Count")+
638 |   ggtitle("Loan Status over 2005 - 2014 period")+
639 |   theme(plot.title = element_text(hjust = 0.5, size = 15))
640 | ```
641 | 
642 | ### Description Three
643 | The barchart above shows that Prosper has been expanding its loaning operation with the exception of late 2008 and late 2012 which is possibly caused by the Global Financial Crisis and the European Sovereign Debt Crisis.
644 | 
645 | Over the years, non-performing loans have decreased largely. That shows that Prosper's ability to predict its applicants creditworthiness has been improving. Another reason could be Prosper decided to pursue a more sustainable expansion instead of lending to risky borrowers for higher yield which might result in bankruptcy when non-performing loans outnumber performing loans. When crisis hit, Prosper seemed to tigheten its lending policy which is in line with most bank practices as well.
646 | 
647 | ------
648 | 
649 | # Reflection
650 | When I started exploring this dataset, I was overwhelmed by the number of variables available. It was very tedious to study the relationship between all variables. As such, I only chose about 20 variables that I am more familiar with. It would be great if Prosper is able to provide better clarification how some of the rating, score or borrower rate were determined. However, I also understand that these data are Prosper's confidential proprieatry. That being said, once I spent a few days working on the data I have a better grasp of the dataset. I only included plots that are related to the storytelling and excluded others variables that doesn't tell much about the characteristics of the demographics.
651 | 
652 | The other challenge that I faced is unfamiliarity with R. Since this is my first time coding in R, I took a lot of notes on everything and did a lot of Googling on forums and documentations to find out how to plot certain graphs or customize the charts. I am glad that my effort paid off well as I am able to produce complete the chapter and this project in less than 2 weeks' time.
653 | 
654 | Overall, I was able to come out with a great storyline for this report. The variables don't seem too intimidating after a while since most are quite self-explanatory. Without specific questions or directions, I am free to venture around and determine the focus of the storyline. That is when I decided to look more info how the borrower rate was determined and how the Prosper's rating affected other variables.
655 | 
656 | I was rather surprised that debt-to-income ratio doesn't seem to play an important role in determining the borrower rate after taking into account of Prosper's rating. However, I can't totally exclude the importance of debt-to-income ratio without knowing how Prosper's rating is determined as debt-to-income ratio might be one of the main determinant components.
657 | 
658 | To move on from here, it would be great to be able to build an equation or predictive model to simulate real world scenario. Prosper can also collect other related info that might aid in making the prediction more accurate such as age, education level or city of the applicants/borrowers. Prosper can also help to explain how the rating were determined without revealing too much corporate info as it will help in building the predictive model.


--------------------------------------------------------------------------------
/5-Exploratory-Data-Analysis/README.md:
--------------------------------------------------------------------------------
1 | Link to [project report](https://cdn.rawgit.com/kaishengteh/Data-Analyst-Nanodegree/e94db549/5-Exploratory-Data-Analysis/Exploratory%20Data%20Analysis.html).
2 | 


--------------------------------------------------------------------------------
/5-Exploratory-Data-Analysis/Resources.txt:
--------------------------------------------------------------------------------
 1 | Some of the resources used while completing the projects:
 2 | 1. Sample project
 3 | https://s3.amazonaws.com/content.udacity-data.com/courses/ud651/diamondsExample_2016-05.html
 4 | 
 5 | 2. R documentation
 6 | https://www.rdocumentation.org
 7 | 
 8 | 3. Forums
 9 | http://www.stackoverflow.com
10 | http://www.cookbook-r.com/


--------------------------------------------------------------------------------
/6-Inferential-Statistics/Stroop-Effect.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Statistics: The Science of Decisions Project Instructions"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "## Background Information\n",
 15 |     "\n",
 16 |     "In a Stroop task, participants are presented with a list of words, with each word displayed in a color of ink. The participant’s task is to say out loud the color of the ink in which the word is printed. The task has two conditions: a congruent words condition, and an incongruent words condition. In the congruent words condition, the words being displayed are color words whose names match the colors in which they are printed: for example RED, BLUE. In the incongruent words condition, the words displayed are color words whose names do not match the colors in which they are printed: for example PURPLE, ORANGE. In each case, we measure the time it takes to name the ink colors in equally-sized lists. Each participant will go through and record a time from each condition."
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "## Questions For Investigation"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "markdown",
 28 |    "metadata": {},
 29 |    "source": [
 30 |     "#### 1. What is our independent variable? What is our dependent variable?"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "markdown",
 35 |    "metadata": {},
 36 |    "source": [
 37 |     "Independent variable: Word congruence (whether color of the word matches the definition of the word)\n",
 38 |     "\n",
 39 |     "Dependent variable: Time taken to name the ink colors of the words"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "markdown",
 44 |    "metadata": {},
 45 |    "source": [
 46 |     "#### 2. What is an appropriate set of hypotheses for,  this task? What kind of statistical test do you expect to perform? Justify your choices."
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "markdown",
 51 |    "metadata": {},
 52 |    "source": [
 53 |     "Null Hypothesis, H0 - Congruence of the word has no effect on the mean population time taken to name the ink colors of the words\n",
 54 |     "\n",
 55 |     "Alternative Hypothesis, H1 - Words where their definition not congruent with its ink color will increase the mean population time taken to name of the ink colors of the words\n",
 56 |     "\n",
 57 |     "H0: μ = μ0 (same mean population time)\n",
 58 |     "H1: μ > μ0 (a upper-tailed test) (mean populataion time for incongruent words are longer than congruent words)\n",
 59 |     "\n",
 60 |     "A one-tail paired t-test will be carried out to study whether there are significant difference in time taken to identify the ink colors of the words due to congruence. A t-test is used instead of a Z-test as the population parameters are unknown.\n",
 61 |     "\n",
 62 |     "An alpha, α = 0.05 is used as a threshold whether to reject or accept the null hypothesis."
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "markdown",
 67 |    "metadata": {},
 68 |    "source": [
 69 |     "#### 3. Report some descriptive statistics regarding this dataset. Include at least one measure of central tendency and at least one measure of variability.\n",
 70 |     "\n"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "code",
 75 |    "execution_count": 1,
 76 |    "metadata": {
 77 |     "collapsed": true
 78 |    },
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "'''Importing all required libraries'''\n",
 82 |     "import pandas as pd\n",
 83 |     "import numpy as np\n",
 84 |     "import matplotlib.pyplot as plt\n",
 85 |     "%matplotlib inline"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "code",
 90 |    "execution_count": 2,
 91 |    "metadata": {},
 92 |    "outputs": [
 93 |     {
 94 |      "name": "stdout",
 95 |      "output_type": "stream",
 96 |      "text": [
 97 |       "    Congruent  Incongruent  Mean Difference\n",
 98 |       "0      12.079       19.278            7.199\n",
 99 |       "1      16.791       18.741            1.950\n",
100 |       "2       9.564       21.214           11.650\n",
101 |       "3       8.630       15.687            7.057\n",
102 |       "4      14.669       22.803            8.134\n",
103 |       "5      12.238       20.878            8.640\n",
104 |       "6      14.692       24.572            9.880\n",
105 |       "7       8.987       17.394            8.407\n",
106 |       "8       9.401       20.762           11.361\n",
107 |       "9      14.480       26.282           11.802\n",
108 |       "10     22.328       24.524            2.196\n",
109 |       "11     15.298       18.644            3.346\n",
110 |       "12     15.073       17.510            2.437\n",
111 |       "13     16.929       20.330            3.401\n",
112 |       "14     18.200       35.255           17.055\n",
113 |       "15     12.130       22.158           10.028\n",
114 |       "16     18.495       25.139            6.644\n",
115 |       "17     10.639       20.429            9.790\n",
116 |       "18     11.344       17.425            6.081\n",
117 |       "19     12.369       34.288           21.919\n",
118 |       "20     12.944       23.894           10.950\n",
119 |       "21     14.233       17.960            3.727\n",
120 |       "22     19.710       22.058            2.348\n",
121 |       "23     16.004       21.157            5.153\n"
122 |      ]
123 |     }
124 |    ],
125 |    "source": [
126 |     "stroop = pd.read_csv(\"stroopdata.csv\")\n",
127 |     "stroop['Mean Difference'] = stroop['Incongruent'] - stroop['Congruent'] #A new column for population is created by adding up female and male pop.\n",
128 |     "print(stroop)"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "code",
133 |    "execution_count": 3,
134 |    "metadata": {},
135 |    "outputs": [
136 |     {
137 |      "data": {
138 |       "text/html": [
139 |        "<div>\n",
140 |        "<table border=\"1\" class=\"dataframe\">\n",
141 |        "  <thead>\n",
142 |        "    <tr style=\"text-align: right;\">\n",
143 |        "      <th></th>\n",
144 |        "      <th>Congruent</th>\n",
145 |        "      <th>Incongruent</th>\n",
146 |        "      <th>Mean Difference</th>\n",
147 |        "    </tr>\n",
148 |        "  </thead>\n",
149 |        "  <tbody>\n",
150 |        "    <tr>\n",
151 |        "      <th>count</th>\n",
152 |        "      <td>24.000000</td>\n",
153 |        "      <td>24.000000</td>\n",
154 |        "      <td>24.000000</td>\n",
155 |        "    </tr>\n",
156 |        "    <tr>\n",
157 |        "      <th>mean</th>\n",
158 |        "      <td>14.051125</td>\n",
159 |        "      <td>22.015917</td>\n",
160 |        "      <td>7.964792</td>\n",
161 |        "    </tr>\n",
162 |        "    <tr>\n",
163 |        "      <th>std</th>\n",
164 |        "      <td>3.559358</td>\n",
165 |        "      <td>4.797057</td>\n",
166 |        "      <td>4.864827</td>\n",
167 |        "    </tr>\n",
168 |        "    <tr>\n",
169 |        "      <th>min</th>\n",
170 |        "      <td>8.630000</td>\n",
171 |        "      <td>15.687000</td>\n",
172 |        "      <td>1.950000</td>\n",
173 |        "    </tr>\n",
174 |        "    <tr>\n",
175 |        "      <th>25%</th>\n",
176 |        "      <td>11.895250</td>\n",
177 |        "      <td>18.716750</td>\n",
178 |        "      <td>3.645500</td>\n",
179 |        "    </tr>\n",
180 |        "    <tr>\n",
181 |        "      <th>50%</th>\n",
182 |        "      <td>14.356500</td>\n",
183 |        "      <td>21.017500</td>\n",
184 |        "      <td>7.666500</td>\n",
185 |        "    </tr>\n",
186 |        "    <tr>\n",
187 |        "      <th>75%</th>\n",
188 |        "      <td>16.200750</td>\n",
189 |        "      <td>24.051500</td>\n",
190 |        "      <td>10.258500</td>\n",
191 |        "    </tr>\n",
192 |        "    <tr>\n",
193 |        "      <th>max</th>\n",
194 |        "      <td>22.328000</td>\n",
195 |        "      <td>35.255000</td>\n",
196 |        "      <td>21.919000</td>\n",
197 |        "    </tr>\n",
198 |        "  </tbody>\n",
199 |        "</table>\n",
200 |        "</div>"
201 |       ],
202 |       "text/plain": [
203 |        "       Congruent  Incongruent  Mean Difference\n",
204 |        "count  24.000000    24.000000        24.000000\n",
205 |        "mean   14.051125    22.015917         7.964792\n",
206 |        "std     3.559358     4.797057         4.864827\n",
207 |        "min     8.630000    15.687000         1.950000\n",
208 |        "25%    11.895250    18.716750         3.645500\n",
209 |        "50%    14.356500    21.017500         7.666500\n",
210 |        "75%    16.200750    24.051500        10.258500\n",
211 |        "max    22.328000    35.255000        21.919000"
212 |       ]
213 |      },
214 |      "execution_count": 3,
215 |      "metadata": {},
216 |      "output_type": "execute_result"
217 |     }
218 |    ],
219 |    "source": [
220 |     "stroop.describe()"
221 |    ]
222 |   },
223 |   {
224 |    "cell_type": "code",
225 |    "execution_count": 4,
226 |    "metadata": {},
227 |    "outputs": [
228 |     {
229 |      "data": {
230 |       "text/plain": [
231 |        "Congruent          14.3565\n",
232 |        "Incongruent        21.0175\n",
233 |        "Mean Difference     7.6665\n",
234 |        "dtype: float64"
235 |       ]
236 |      },
237 |      "execution_count": 4,
238 |      "metadata": {},
239 |      "output_type": "execute_result"
240 |     }
241 |    ],
242 |    "source": [
243 |     "stroop.median()"
244 |    ]
245 |   },
246 |   {
247 |    "cell_type": "markdown",
248 |    "metadata": {},
249 |    "source": [
250 |     "Median is chosen as the measure of central tendency.\n",
251 |     "\n",
252 |     "Standard deviation is chosen as the measure of variability."
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "markdown",
257 |    "metadata": {},
258 |    "source": [
259 |     "#### 4. Provide one or two visualizations that show the distribution of the sample data. Write one or two sentences noting what you observe about the plot or plots."
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": 5,
265 |    "metadata": {},
266 |    "outputs": [
267 |     {
268 |      "data": {
269 |       "text/plain": [
270 |        "array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000A82DA58>,\n",
271 |        "        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000006C25080>],\n",
272 |        "       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000000B034B38>,\n",
273 |        "        <matplotlib.axes._subplots.AxesSubplot object at 0x000000000B1D51D0>]], dtype=object)"
274 |       ]
275 |      },
276 |      "execution_count": 5,
277 |      "metadata": {},
278 |      "output_type": "execute_result"
279 |     },
280 |     {
281 |      "data": {
282 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAfwAAAFyCAYAAAAQ6Gi7AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAAPYQAAD2EBqD+naQAAIABJREFUeJzt3Xu8HHV9//HXJ4AJCSqWaEBNvKLGS9U9FhORgGhDm7oH\nbWsQRRuQogaQxiYIVZKgxV+CFe1JgEITuYUTsGoC2kiigjbBC7oH8XZiy8WTCAKJCJIEEMjn98fM\nSXY3u+fs7M7sfnf3/Xw85pGc787M5zPfme/3uzs7O2PujoiIiHS2Ma1OQERERLKnAV9ERKQLaMAX\nERHpAhrwRUREuoAGfBERkS6gAV9ERKQLaMAXERHpAhrwRUREuoAGfBERkS6gAV9ERKQLaMDPiJm9\n1MwuM7O7zOwxM3vEzDaZ2cfMbFyr8wuVmZ1oZme1Og+RepjZP5jZbjPLtTqXbqE+o3b7tzqBTmRm\nfwN8GXgcuBr4BfAM4K3AhcCrgY+0LMGwvQ94DfDvrU5EpE56QElzqc+okQb8lJnZi4HVwD3Ase7+\nYNHLl5rZecDftCC1UZnZOHd/vNV5iEh7UJ/RXnRKP32fACYAHyob7AFw97vdfRmAme1nZueZ2Z1m\n9riZ3WNmF5jZM4qXMbPfmNmNZnakmf0o/orgLjP7QPn6zezPzex7ZrbLzLaa2SfN7OT4NOOUCuuc\naWY/NrPHgNPM7EXxvB+ssO7dZrawrOz5ZvYlM7s/3oZfmNnJZfMcHS/7njifrfE2fNvMXlY03y1E\nb4aGc9htZnfXWO8iwTGzK83s0bidrI3//6CZfc7MrGxeM7OzzOxncft40My+Wfz1gPoM9RmN0Cf8\n9L0TuNvdf1TDvCuBDxKd/v834M3AucCrgL8rms+Bw4H/ipe5EjgFuMLMfuLugxA1JOAW4GngAmAX\ncCrwJ/Y9zehxnH7gMuBy4NdJNtTMngf8KI7XB2wH/hpYaWbPdPe+skXOief9HPBsojdHq4Dp8ev/\nGpe/APgnwIAdSXISCYwTfbBaD/wQ+GfgHcDHgTuJ2t6wLwH/APw38J9E/fNRwDRgIJ5HfYb6jPq5\nu6aUJuCZwG7gazXM++fxvP9RVn4h0QF+dFHZPXHZW4rKJgKPARcWlfUBTwGvKyo7mKhRPQ1MqbDO\nd5TFf1Gc1wcr5LwbWFj09wrgt8DBZfP1Aw8BY+O/j46X/QWwX9F8Z8Y5vLqo7OtEb5havj81aUo6\nEQ3YTwO5+O8r4r//pWy+AnBb0d9vi9vIRSOsW32G+oyGJp3ST9ez4n8frWHeWUTvmL9QVv55onep\n5d/z/8rdvz/8h7tvJ3p3/dKieY4DfuDuPy+a72Hg2io53OPu364h12r+lqix7WdmhwxPwAaid93l\nVyp/yd2fLvp7I9G2vhSRznZZ2d8bKT3u/45ogPv0COtQn6E+oyE6pZ+uP8b/PrOGeYffFd9ZXOju\nD5jZw/HrxbZUWMcfgOeUrfP7Fea7s0IZRO/Y62JmzyX6JHAa8OEKszjwvLKyrWV//yH+9zmIdK7H\n3f33ZWXlbfelwH3xYFuN+gz1GQ3RgJ8id3/UzO4DXptksRrne7pKuVUpr8VjFcoq5mNm5WeDhv9e\nBVxVZf0/K/s7i20QCV21475e6jPUZ9RFA376vgH8o5m92Ue+cG+IqAEcTtGFL/FFLQfHryc1BLy8\nQvnhCdYx/A764LLy8k8P24i+utjP3W9OsP7R6DfM0o3uAmaa2cEjfMpXn1GZ+owa6Tv89F1IdKXr\nirghljCzl5nZx4B1RO9S/6lsln8mOoD/u47Y64HpZvbnRfH+jOjGFDVx90eJLtiZUfbS6RQ1LHff\nDXwV+Dsze035esxsYrLU99hJ9F2eSDf5KlF/vGiEedRnVKY+o0b6hJ8yd7/bzN4HXAcMmlnxnfaO\nBP6e6EKUPjO7iuh3rM8Bvkf0E5sPEl3l/706wl8InAR828yWETWEU4nexT+H2t8JrwDOMbP/BH5C\n1JAPZ9/TaOcAxwA/iuf9FfBnQA9wLNFVwUkVgNlm9nngx8AOd/9GHesRaZXEp5vd/btmdg3wMTN7\nBXAT0RuAo4Cb3f0Sd/+Z+oyK1GfUSAN+Btz96/E75gVAL9FtdP9ENPDPJ/r9KsCHiE7lzQHeBdxP\n9FvY8it1neoNr/gd9G/N7Biin9qcS/Su+1Ki36V+kehWv7Ws89NEDe/vgfcQfbL4a+DBsngPmtkR\nwELg3cBHgd8DvwTOrpbnKOWXAK8nqpN/Iup41HilnVT6/Xot880B7iDqFy4EHiEaPIsvqlOfoT6j\nbhb/jlE6mJl9EfhH4CDXDheRUajP6EyJv8OPb4t4jZltj2/FeIfpyVDBsLIn8cW/cT0J2KiGK1lS\n39Ce1Gd0j0Sn9M3sYOBW4DtEN2zYTvQ9zR9GWk6a6gdm9l1gEDiU6HaazwQ+08qkpLOpb2hr6jO6\nRKJT+ma2BJju7kdnl5I0wsz+leh7tBcSfddVAM5391tamph0NPUN7Ut9RvdIOuD/kujq0clE9zq+\nF7jE3Vdkk56ItAP1DSLhSzrgP0b0DvDzwFeAI4B/Bz7s7tdUmP8QotN7v6H0ak8RKTUOeDGwvsJt\nWIOnvkEkE6n2C0kH/CeInvB0VFHZvwNvcvcjK8z/Pqo/hEFE9vV+d+9vdRJJqW8QyVQq/ULS3+H/\njujCjmKDRE9AquQ3AKtWrWLq1KkJQyXzl3/5l3zrW9/qiBh9fX2cdNJJRNfMvKTONd0DnFe17jup\nvjohxuDgYLzPozbThoLtG2q1dx8Ut7u5RD/zrsXIbS4LzTg2GxV6jiHnl3a/kHTAvxV4ZVnZK6l+\nD+fHAaZOnUoul+2vc3K5XMfE2NtZzGLfp0XWagA4r2rdd1J9dUKMIu16ejvYviG54nZ3JfD+Gpcb\nuc1locnHZl1CzzH0/GKp9AtJf4f/BWCamZ0b3xP+fUS3YVyeRjKNOPDAAxUjsDiK0VWC7RsaE/a+\nb4djM/QcQ88vTYkGfHf/CdHtEE8Efg58EjjL3a/LIDcRaRPqG0TCl/he+u6+jug+ySIie6hvEAlb\nxzwed8KECYoRWBzFkPYX9r5vh2Mz9BxDzy9NHTPgv+IVr1CMwOIohrS/sPd9OxyboecYen5pyvRp\nefGDMwqFQqEdroIMxsDAAD09PUR3uGzkKv0eVPftYe8+p8fdB1qdT9ZC7Bsab3dqc5KutPuFjvmE\nLyIiItVpwBcREekCHTPgb9++XTECi6MY0v7C3vftcGyGnmPo+aWpYwb8U045RTECi6MY0v7C3vft\ncGyGnmPo+aWpYwb8xYsXK0ZgcRRD2t/iVicwonY4NkPPMfT80tQxA34zrortlBjNiqMY0v7C3vft\ncGyGnmPo+aWpYwZ8ERERqU4DvoiISBdINOCb2SIz2102/Sqr5JJYuXKlYgQWRzG6R8h9Q2PC3vft\ncGyGnmPo+aWpnk/4vwAmAYfG01tTzahOAwPZ35ysU2I0K45idJ0g+4bGhL3v2+HYDD3H0PNLUz0D\n/lPuvs3dH4ynh1LPqg4XX3yxYgQWRzG6TpB9Q2PC3vftcGyGnmPo+aWpngH/cDO718zuMrNVZjY5\n9axEpB2pbxAJWNIB/4fAHOA44CPAS4D/MbPueb6giFSivkEkcPsnmdnd1xf9+Qszuw0YAmYDV6SZ\nmIi0D/UNIuFr6Gd57v4I8L/Ay0eab9asWfT29pZM06dPZ+3atSXzbdiwgd7e3n2WP/300/e5knJg\nYIDe3t4990EeXm7RokUsXbq0ZN4tW7bQ29vL5s2bS8qXLVvGggULSsp27dpFb28vmzZtKilfvXo1\nU6ZM2Se3E044IZPtiFxZtoYtQC+wuax8GbCgrOwxAG6//fZ9tuPkk0/eJ7+0t2N4W7LcH+26HatX\nr97TDnp6epgyZQrz5s3bJ792FlLfMCzJPtzbpopj7or/3lQ272rg5H1yy+JYLN+O4XWl2abS3o4Z\nM2Y0vD+y3I5DDz20pu1I47gaaTsWL15c0g56enqYNWvWPrk1xN3rnoCDgIeAM6q8ngO8UCh41tav\nX98xMQqFggMOBQevc4rWUa3uO6m+OiHG3n1Ozhtok6FMIfUNtarc7tan1uay0Ixjs1Gh5xhyfmn3\nC0l/h/85M5thZi8ys7cAa4Anid7ittTMmTMVI7A4itE9Qu4bGhP2vm+HYzP0HEPPL02JvsMHXgj0\nA4cA24jObU1z99+nnZiItBX1DSKBS3rR3olZJSIi7Ut9g0j4OuZe+uUXYyhG6+MohrS/sPd9Oxyb\noecYen5p6pgBf/Xq7L8q7JQYzYqjGNL+wt737XBshp5j6PmlqWMG/Ouvv14xAoujGNL+wt737XBs\nhp5j6PmlqWMGfBEREalOA76IiEgX0IAvIiLSBTpmwK90K0XFaG0cxZD2F/a+b4djM/QcQ88vTR0z\n4HfKHdd0p73ujCGhCnvft8OxGXqOoeeXpo4Z8E88Mfv7fnRKjGbFUQxpf2Hv+3Y4NkPPMfT80tQx\nA76IiIhUpwFfRESkC3TMgF/+TGTFaH0cxZD2F/a+b4djM/QcQ88vTQ0N+GZ2jpntNrOL0kqoXhde\neKFiBBZHMbpXSH1DY8Le9+1wbIaeY+j5panuAd/M/gI4DbgjvXTqd9111ylGYHEUozuF1jc0Jux9\n3w7HZug5hp5fmuoa8M3sIGAVcCrwcKoZ1Wn8+PGKEVgcxeg+IfYNjQl737fDsRl6jqHnl6Z6P+Ff\nDHzd3W9OMxkRaXvqG0QCtX/SBczsvcAbgDelkcAtt9zCQw891NA6JkyYwHHHHYeZNbSeLVu2sH37\n9obWMXHiRKZMmdLQOkTaURp9w5YtW/jxj39cdw4HH3wwhx9+eF3teHBwsO64aWm0D1L/IyNy95on\n4IXA/cBri8puAS6qMn8O8EmTJnk+ny+Zpk2b5hdccIEDqUz5fN7d3RcuXOhLlizxYkNDQ57P531w\ncLCkvK+vz+fPn79nnnHjxjecx7hx4/2aa67Zk0+xuXPn+ooVK0rKCoWC5/N537Ztm7u7z58/3wuF\nQry+Mx28aBpyyDsMlpX3OcwvK9vkwD7x+vv7fc6cOXu2e9js2bN9zZo1JWXr16+vezuGt6Xe/TFs\n586dns/nfePGjR2zHf39/XvaQS6X88mTJ/uMGTOGj6GcJ2iToUxp9Q0HHfTMhtvg/vs/o8F1FIra\nVHG72hm3v41lba3fYU68HF4oFNw9+bG4dOnShvugZzxjnA8NDdV8LBYbblPlGm1TH/jAB/ZpU+71\n99Vpb8fLXvaymrajUt+Q5nYsWrSopC3kcjmfNGlSqv1C0kZ9PPA08CfgyXjaXVRmZfPnihtAuRtu\nuCHemF87PNTAhL/3ve+tGKNWewfZVXHDrTQtGOG1Qrxs9e2tRV9fX1EuhbKOJclUGDGXvr6+unNM\nsi2KUZu9+7xtB/xU+oZDD50cD7L19AP3FA1+I7XjatNnKrS7vtTaXO3HQJLci/ukxvufLDSj/TQi\n5PzS7heSntL/NvC6srIrgUFgiXvUkpM7GHhOfYvGjj322IaW32sqUV9USbXy9Jx55pkMDAw0JY5i\nhBOjA6TYNxxIff1B8SVJI7Xjaiqd0m/Fvk+Se/Z9UqNCbz+h55emRAO+u+8EflVcZmY7gd+7e+u/\nABORllDfIBK+NO60V+enehHpcOobRALS8IDv7se6+8fTSKYRv/vd75oQZXP2ETZnH6NZcRSju4XS\nNzQm9H0fen7ht5/Q80tTx9xL/6tf/WoTopydfYSzs4/RrDiKIe0v9H0fen7ht5/Q80tTxwz4zXmm\n8fLsIyzPPkaz4iiGtL/Q933o+YXffkLPL00dM+AfcsghTYiS/Q0tmnXTjGbEUQxpf6Hv+9DzC7/9\nhJ5fmjpmwBcREZHqNOCLiIh0gY4Z8G+66aYmRFmafYSl2cdoVhzFkPYX+r4PPb/w20/o+aWpYwb8\nP/3pT02Isiv7CLuyj9GsOIoh7S/0fR96fuG3n9DzS1PHDPi9vb1NiHJ+9hHOzz5Gs+IohrS/0Pd9\n6PmF335Czy9NHTPgi4iISHUa8EVERLpAxwz4jz76aBOibM8+wvbsYzQrjmJI+wt934eeX/jtJ/T8\n0pRowDezj5jZHWb2SDx938z+KqvkkrjqqquaEOWU7COckn2MZsVRjO4Rct/QmND3fej5hd9+Qs8v\nTUk/4W8FPkH0EOYe4GbgBjObmnZiSeXz+SZEWZx9hMXZx2hWHMXoKsH2DY1Z3OoERrG41QmMKvT2\nE3p+aUo04Lv7f7v7Te5+l7vf6e6fAnYA07JJr3YvetGLmhAll32EXPYxmhVHMbpHyH1DY0Lf96Hn\nF377CT2/NO1f74JmNgaYDYwHfpBaRiLS1tQ3iIQp8YBvZq8lasTjgEeBd7t79zxQWEQqUt8gErZ6\nrtLfDLweOAK4FLjazF6ValZ12LRpUxOirMw+wsrsYzQrjmJ0nSD7hsYk3/eDg4MMDAwkngYHB5uS\nX7OF3n5Czy9V7t7QBHwLuLTKaznAJ02a5Pl8vmSaNm2an3vuuQ44PODgDusd8vH/i6e5DivKygrx\nvNsc8GOOOcbd3RcuXOhLlizxYkNDQ57P531wcLCkvK+vz+fPn+/u7oVCIc5lU7zejWXx+h1eVSG3\n2Q5rinLCly9f7vl83svNnTvXV6xYUVJWKBQ8n8/7tm3b9syzN5czy2INxbkNlpX3OcwvK9vkwD7x\n+vv7fc6cOT537tyS8tmzZ/uaNWtKytavX1/3dgzPV+/+GLZz507P5/O+cePGjtmO/v7+Pe0gl8v5\n5MmTfcaMGfE+J+cNtslQpnr6hrFjxzm8oKhNDU+19A0PD9ehw4y4byied6HDkhHa1Kp42UJRm5pb\nNO/OEfqGOQ7fcBhTlEO9U6FCH1dtO4bzG4q3GS8UCjUfi8WG21S5RtvUe97znn3alHv9fXXa2/Hi\nF7+4pu2o1DekuR2LFi0qaQu5XM4nTZqUar+QRqP+DvClKq/lyg/AYjfccIOXDvj1Tvjll19eMUat\n9g6yhQbyKOzT4No9F2mOvfu8owb8xH3DoYdOdjivzmO+eMCvp+2samDZ4uVXxetIOn2mwfhq850m\n7X4h0Xf4ZvZZ4JvAFuCZwPuBo4GZSdYjIp1FfUOxqdR39Xw9p/RFapf0or3nAVcBhwGPAD8DZrr7\nzWknJiJtRX2DSOASDfjufmpWiYhI+1LfIBK+jrmX/vLly5sQJftH8DbnMb/NiaMY0v5C3/eh5xd+\n+wk9vzR1zID/tre9rQlRzsg+whnZx2hWHMWQ9hf6vg89v/DbT+j5paljBvzXvOY1TYiS/fVHM2c2\n5xqnZsRRDGl/oe/70PMLv/2Enl+aOmbAFxERkeo04IuIiHSBjhnwb7/99iZEWZt9hLXZx2hWHMWQ\n9hf6vg89v/DbT+j5paljBvwf//jHTYiyOvsIq7OP0aw4iiHtL/R9H3p+4bef0PNLU8cM+KeddloT\nolyffYTrs4/RrDiKIe0v9H0fen7ht5/Q80tTxwz4IiIiUp0GfBERkS6gAV9ERKQLdMyAf+WVVzYh\nysnZRzg5+xjNiqMY0v5C3/eh5xd++wk9vzQlGvDN7Fwzu83M/mhmD5jZGjN7RVbJJfHqV7+6CVE6\n565unXKHuk6J0e5C7hsaE/q+Dz2/8NtP6PmlKekn/KOAZcCbgXcABwAbzOzAtBNL6ogjjmhClBOz\nj3Bi9jGaFUcxukqwfUNjQt/3oecXfvsJPb80JX087qziv81sDvAg0ANsSi8tEWkn6htEwtfod/gH\nAw48lEIuItI51DeIBCbRJ/xiZmbAF4FN7v6r9FKqT6FQoKenp+7lBwcHa5hrE/DWumPUYtOmTYwf\nPz619VXbrttvv503vvGNoy7/xBNPMHbs2LpiF8eYOHEiU6ZMqWs9I9m0aRNvfWv2+yTrGJ0ktL6h\nMdm3+caEnl/z2s+WLVvYvn174uWG+6ms+qiguHtdE3ApcDdw2Ajz5ACfNGmS5/P5kmnatGl+7rnn\nOuDwgIM7rHfIx/8vnuY6rCgrK8TzbnPAzcbE62p02hSvd2NZvH6HyRVym+2wpignfPny5Z7P573c\n3LlzfcWKFSVlhULB8/m8b9u2zd3d8/m8FwqFOJczy2INxbkNlpX3OcwvK/tKSvWRzjRu3HgfGhpy\nd/ehoSHP5/M+ODhYUhd9fX0+f/78krKdO3d6Pp/3jRs3lpT39/f7nDlz9qnn2bNn+5o1a0rK1q9f\nX/f+GN4nCxcu9CVLlpTMW+929Pf372kHuVzOJ0+e7DNmzBiuq5zX2SZDmRrpG8aOHefwgqI2NTzV\n0jc8XHTMzYj7huJ5FzosGaFNrYqXLRS1qeKYO0foG+aULV/eN9SyHaeWLV/cx1XbjuF1DcXbjBcK\nhZqPxWLDbapco23qqKOO2qdNuXuqbWpoaMjHjRufWh9Va9+Q5nYsWrSopC3kcjmfNGlSqv2CedT4\nEjGz5UAeOMrdt4wwXw4oFAoFcrncPq/feOONHH/88cADwPMS51EUKf53FTC1znWsA84DCkR9USW7\ngJE+fQ8APVTb3lrs2rWLzZs3x2crRsplNNcCJ1G9Th4DRruearhO6q3X4RiDwEkN1Us1u3btSvWM\nSKtiDAwMDJ+h6nH3gUyDZajRvuGww6Zw//1zgE/XEf0Rom8SoL62M9xmipcdrc2Ptnyj8UdTnF/j\n/U8Wmtt+6umrHgN+Q1Z9VCPS7hcSn9KPG/TxwNEjNejWmEr9A2Qtp/SzPWiBDBpGGnXSyDqylXVH\n0qwYnSDsvqFeoe/70PNrdvupt69q8x+T1CjRgG9mlxD9DqQX2Glmk+KXHnH3x9NOTkTag/oGkfAl\nvUr/I8CzgO8C9xVNs9NNS0TajPoGkcAlGvDdfYy771dhujqrBMOyIPsIC7KPEUfqiBjNqK/m7ZP2\n1bl9Q+j7PvT82qH9hJ5fejrmXvrNkf1PNpr3s5BmxOmM+ur4n+rICELf96Hn1w7tJ/T80qMBP5Ez\ns49wZvYx4kgdEaMZ9dW8fSLhCX3fh55fO7Sf0PNLjwZ8ERGRLqABX0REpAtowE9kc/YRNmcfI47U\nETGaUV/N2ycSntD3fej5tUP7CT2/9GjAT+Ts7COcnX2MOFJHxGhGfTVvn0h4Qt/3oefXDu0n9PzS\nowE/keXZR1iefYw4UkfEaEZ9NW+fSHhC3/eh59cO7Sf0/NKjAT+RTvoJmH6WF1IMCVXo+z70/Nqh\n/YSeX3o04IuIiHQBDfgiIiJdQAN+Ikuzj7A0+xhxpI6I0Yz6at4+kfCEvu9Dz68d2k/o+aUn8YBv\nZkeZ2Y1mdq+Z7Taz3iwSC9Ou7CPsyj5GHKkjYjSjvpq3T9pX5/YLoe/70PNrh/YTen7pqecT/gTg\np8BcwNNNJ3TnZx/h/OxjxJE6IkYz6qt5+6StdWi/EPq+Dz2/dmg/oeeXnv2TLuDuNwE3AZiZpZ6R\niLQd9Qsi4dN3+CIiIl1AA34i27OPsD37GHGkjojRjPpq3j6R8IS+70PPrx3aT+j5pSfxKf3udgpw\n46hzDQ4O1h1h3rx5nHbaaXUvX7vatiXNGI3UC8ATTzzB2LFjS8rmzZvHF77whYbWMZryGBMnTmyD\nm4lIOprRThqxb36NtLMsju1TTjmFG28MvQ4XtzqJ5nD3uidgN9A7wus5wCdNmuT5fL5kmjZtmp97\n7rkOODzg4A7rHfLx/4unuQ4rysoK8bzb4nUQly10WFI271A872BZeZ/D/Pj/q+J1bIrn3Vg2b3+V\n3GY7rIn//w2HMUX5NDqdWcd2DE8r43WU11u/w5y4rqptx/D0iaJ6TbI/issWxvHSqJf9gljHuHHj\nfWhoyIeGhjyfz/vg4KAX6+vr8/nz55eU7dy50/P5vG/cuNH7+/v3tINcLueTJ0/2GTNmDK8/10ib\nDGEarV8YqW8YO3acwwsqHIu19A0PF+2nGWXHovvofcNwH1DwvW2q+Njf6dX7hjlly1drUyNtx6ll\ny1dqU+XbMTzvkMMRDtbQsb3ffvv50NBQybE7e/ZsX7NmTUnZ+vXrPZ/Pe7m5c+f6ihUrSspWrVrl\n+Xzet23bVlK+cOFCX7JkSUlZvW2qUCgU1d3w/hipry7eHzPi5fBCoVB1OwqFQqbbsWjRopK2kMvl\nfNKkSan2C+ZR46uLme0G3uXuFd++mVkOKBQKBXK53D6v33jjjRx//PHAA8Dz6s4Dhq8RKhD1I/W4\nFjgppXWsAqbWuQ6AdcB5KeXS6nUUr6eRehmuk1avYxA4iWrHdL0GBgbo6ekB6HH3gdRW3AKj9Qvx\nPBX7hsMOm8L9988BPl1H5EeAg+P/13PMNnq8h7J8vcd3Nsd2M+xtP/XW3QDQE9y2p90vJD6lb2YT\ngJezd5R9qZm9HnjI3bc2mlBnmEpjA2Rjp77D1Ui9DNdJq9chlahfCImOb6msnu/w3wTcwt7TQJ+P\ny68i+jJERLqP+gWRwCW+St/dv+fuY9x9v7KpCxr1yg6J0aw4itEtOrdfCH3fh54frFwZeo6h55ce\n/SwvkWZ8tdqsr287ZVs6JYaEKfR9H3p+0ffQYQs9v/RowE/k4g6J0aw4iiHtLvR9H3p+cPHFoecY\nen7p0YAvIiLSBTTgi4iIdAEN+CIiIl1AA34izXjEd7MeI94p29IpMSRMoe/70POD3t7Qcww9v/Ro\nwE/kjA6J0aw4iiHtLvR9H3p+cMYZoecYen7p0YCfyMwOidGsOIoh7S70fR96fjBzZug5hp5fejTg\ni4iIdAEN+CIiIl1AA34iazskRrPiKIa0u9D3fej5wdq1oecYen7p0YCfyNIOidGsOIoh7S70fR96\nfrB0aeg5hp5feuoa8M3sdDO7x8weM7MfmtlfpJ1YmJ7bITGaFUcxuk3n9Q2h7/vQ84PnPjf0HEPP\nLz2JB3wzO4Ho0ZeLgDcCdwDrzWxiyrmJSBtR3yAStno+4c8DLnP3q919M/ARYBd65rVIt1PfIBKw\nRAO+mR0A9ADfGS5zdwe+DUxPNzURaRfqG0TCt3/C+ScC+wEPlJU/ALyywvzjAAYHByuu7K677or/\ndy4wPmETlEfpAAAgAElEQVQqlawDKsca3a01rONW4NoG11FLHi9MaT0jrWO0ballHbXkcG0K6xkp\nl1q2Y7R11LLccIx7gOrHdL2K1jcu1RU3Typ9w5NP/gn4JvCHOlL4U9H/6znWKh0fzTi+Glm+OL9G\n40fH9rp16+o+vseMGcPu3btLM7z1Vq69dvQ6rLRsre655574f/Xu93VA+u26Uan3C+5e8wQcBuwG\n3lxWvhT4QYX53we4Jk2aap7el6RNhjKhvkGTpiynVPqFpJ/wtwNPA5PKyicB91eYfz3wfuA3wOMJ\nY4l0k3HAi4naTDtS3yCSvlT7BYvfbde+gNkPgR+5+1nx3wZsAfrc/XNpJCUi7Ud9g0jYkn7CB7gI\nuNLMCsBtRFfmjgeuTDEvEWk/6htEApZ4wHf3L8e/q/000em6nwLHufu2tJMTkfahvkEkbIlP6YuI\niEj70b30RUREukAmA76ZjTGzz5jZ3Wa2y8zuNLNPNbjOo8zsRjO718x2m1lvhXk+bWb3xTG/ZWYv\nTzOOme1vZkvN7GdmtiOe5yozOyztbSma9z/ieT6Wdgwzm2pmN5jZw/H2/MjMXlhpffXEMLMJZrbc\nzLbG++SXZvbhhNtxrpndZmZ/NLMHzGyNmb2iwnx17/vRYqS432valqL569r3IanhGLkiLi+e1jUx\nv8yPr6zzC6AOP2Jmd5jZI/H0fTP7q7J5WlJ/teTX6vqrkO85cQ4XlZU3XIdZfcI/B/gwMBd4FXA2\ncLaZndHAOicQfSc4l+h3iSXM7BPAGcBpwBHATqL7eD8jxTjjgTcA5xPdK/zdRDcVuSHFGHuY2buB\nNwP3Jlz/qDHM7GXARuBXwAzgdcBnSPYTqdG24wvATKLfXL8q/nu5mb0zQYyjgGVE9fAO4ABgg5kd\nWLQtje770WKktd9H3ZaibWpk34eklmP9m0Tf+R8aTyc2JzWgOcdXpvnFWlmHW4FPADmiuy3eDNxg\nZlOh5fU3an6xVtbfHhY9bOo0oudQFJenU4cZ3YTj68B/lpV9Bbg6pfXvBnrLyu4D5hX9/SzgMWB2\nmnEqzPMmot8fvzDNGMALiH7SNJXoFlgfS7m+VgNXpbjPK8X4OfDJsrKfAJ9uIM7EONZbs9r3lWKk\nvd9HipPmvg9pqnKMXAF8rdW5NfP4yiC/oOowzun3wMmh1V+V/IKoP+Ag4NfAscAtwEVFr6VSh1l9\nwv8+8HYzOxzAzF4PHMnw/QtTZmYvIXpXVnwf7z8CPyL7+3gfTPTJ5eG0VmhmBlwNXOjuqd/rMV7/\n3wD/Z2Y3xacKf2hmx6cc6vtAr5k9P477NuBwGruJxHB9PxSvM4t9XxJjlHka2e/7xMl63wfqmPgY\n3Gxml5jZn7Uwl2YcX6nlVySIOrTo69z3Ep0V+35o9VeeX9FLIdTfxcDX3f3m4sI067Ce3+HXYgnR\nO5DNZvY00VcHn3T36zKKdyhRI6h0H+9DM4qJmY0l2tZ+d9+R4qrPAf7k7stTXGex5xG9m/wE8Emi\nr1z+GviamR3j7htTinMmcDnwWzN7iugT8T+6+60jL1ZZPBh+Edjk7r+Ki1Pd91VilM/T8H4fIU7W\n+z403wS+SnQm42XA/wPWmdl0jz/KNEszjq9GjHDMtLwOzey1wA+I7gz3KPBud/+1mU0ngPqrll/8\ncgj1916irw3fVOHl1I7BrAb8E4i+t30v0XfEbwD+3czuc/drMorZVGa2P/BfRDtiborr7QE+RvRd\ncVaGz+ysdfe++P8/M7O3ED3SNK0B/2NE3z2+k+gU9Qzgkvg4uHnEJSu7BHg10dmirIwYI8X9vk+c\nJu37oLj7l4v+/KWZ/Ry4CziG6LRmMzXj+GpExfwCqcPNwOuBZwN/D1xtZjOaFLsWFfNz982trj+L\nLpT+IvAOd38yy1hZndK/EFji7v/l7r9092uJLtg6N6N49wNG7ffxbkhRpz8ZmJnyp/u3As8FtprZ\nk2b2JPAi4CIzuzulGNuBp9j3sVKDwJQ0ApjZOOAC4OPuvs7df+HulwDXA/PrWN9yYBZwjLv/ruil\n1Pb9CDGGX09lv48Qpxn7Pmjufg/R8dm0q7ihOcdXI0Y7Nou1og7d/Sl3v9vdb3f3TxJddHYWgdTf\nCPlVmrfZ9ddD1O4Hitr90cBZZvYnok/yqdRhVgP+eKLTt8V2ZxUv3kH3A28fLjOzZxF9uvx+teXq\nUdTpvxR4u7vX8xzPkVwN/DnRu9Hh6T6iN1HHpREgfhf5Y/Z9bOkrgKE0YhBdTXwA+x4Hw1/x1Czu\n7I4H3ubuW4pfS2vfjxQjfj2V/T5KnMz3fejiTzuHACMOainHzPz4yiq/KvM3vQ4rGAOMDaH+qhgD\njK30Qgvq79tEv5J6A3vb/U+AVcDr3f1u0qrDjK42vILoFO4sok8o7wYeBD7bwDonxBXxBqI3D/8U\n/z05fv1soisv83HlrQX+D3hGWnGIvgK5gWhQfB3RO6zh6YC0tqXC/Imv1K6hvt5F9BO8U4m+tzqD\n6IHi01OMcQvwM6J3qy8G5gC7gNMSxLiE6OHoR5XV97iieRra96PFSHG/j7otaez7kKZR2tMEojcz\nbybqJ95O1NENJqnXBvPL/PjKMr9A6vCzcX4vAl5L9B34U8Cxra6/0fILof6q5Fx+lX4641tGyU4g\nepDGPUS/F/w/ot8w79/AOo+OO4yny6YvFc2zmOgT0S6iK8Ffnmac+IAof2347xlpbkvZ/HeTfMCv\npb7mAP8b76MB4J1pxiC6OHAl0e9gdxJdz3FWwhiV1v808MGy+ere96PFiPd7+Wv17PeatqXRfR/S\nNEp7GgfcRPTp5fF4Wy8FntvE/DI/vrLML5A6XBHHfSzOYwPxYN/q+hstvxDqr0rON1M04KdVh7qX\nvoiISBfQvfRFRES6gAZ8ERGRLqABv0tY9DCGhWVlbzKzWy16IMzTZvbncflfmdntZvZYXP6s1mQt\nIiJp6ZoB38z+wfY+CektVebZGr9+Y7PzS8LMflO0LU+b2R8sepLbZWZ2RJXFnKKHl8Q/M/sK8Byi\nK6c/AAzFt5S8nujCkLlx+c4st0dERLKX1Z32QvYY0V0AS36/aGZHEz20JMnT4lrFgduBfyO6IcMz\niR608h7gH83sIncvv7nNgUQ/RRn2MqKb7HzI3a8YLjSz44huu/spd2/2nc5ERCQj3TjgrwPeY2Yf\nc/fdReXvI/r95cTWpJXYve6+urjAokco9gMfN7P/c/fLhl9z9z+VLT9816ZHaiyvm5mNd/ddaa1P\nRESS65pT+jEneizsIcBfDhea2QFE91fuJ/rEXMIi/2Rmv4i/177fzP7DzA4um6/XzL5hZvea2eNm\ndqeZfcrMxpTN9934FPxUM7vFzHaa2W/NbEFDG+f+BPBBoidpfbIs5p7v8M3sCuC7cX18JX7tFjO7\nBbgyXuQncfmXitbxZouervdwnPN3y78eMbPF8XJTzazfzB6i6N78ZvZKM/uKmf0+rssfm1m+bB3D\nX7+8xcwuMrMH4+sMvmZmh5Rvt5n9tZl9z8z+aGaPmNltZnZi2Tyj5i4i0sm6bcAH+A3wQ6B4QJhF\n9HS/ak/zuxxYSjRwfYzopiHvB24ys/2K5ptD9CSmz8fz/QT4NNGdnYo58GdET2m6Hfg40Z2dlsSn\n1Ovm7juBNcALzGxqldn+g+g+9wb8O3AS8K/xdHk8z6fi8ssAzOxY4HtEp/sXEz0X4dnAzWZW/ISn\n4esE/ovophbnAv8Zr+M1RHX/SqI6+TiwA1hrlR/Nu4zorlKLie44lgdKniJnZnOAbxA9NvSzRE8A\nvJ2iW9EmyF1EpHO18m5CTb5z0T8Q3aEqR3Qx2sNE93qG6CK1b8f/vwe4sWi5txLd7eqEsvX9ZVz+\n3qKysRXiXkr0JuCAorJb4lzeV1R2ANFdlL5cw7aU5Fjh9bPi9b+zqGw3sLDo7+E7oP1ttXoqK/81\n8N9lZWOJnip1U1HZoni911TI69tEg/H+ZeWbgM1lOewuXm9c/nmi2/8+M/77WURfPdzKCLeYrDV3\nTZo0aerkqRs/4QN8megBP+80s4OIHt96bZV5/57ozcF3zOyQ4Ylo4NoBvG14Ro9OqQNgZgfF822K\nY72qbL073L2/aNkngduIHs7SqOGnuD0zhXVhZm8ADgdWl9XBM4HvED32tpgTnxkoWsdziOrqv4Bn\nl61nA3C4mR1Wto7LKbUR2I/oVrcQvek6iOjJjOXXKNSbu4hIR+rGi/Zw9+1m9m2iC/UmEH218ZUq\nsx9OdLr4wUqrIrpfPABm9mqiU+VvI/r0WTzfs8uW/W2F9f2B6BR2ow6K/300hXVBVAcQPc2tkt1m\n9mx3L77Q756yeV5O9BXCZ4i+Oig3XJfFT6jaWjbP8BPqnhP/+7L4319WyQvqy11EpON05YAf6yf6\nbvkw4JvuXm1wHEP0POL3UeGCPmAbgJk9G/gforMBnyJ6CMPjRM86XsK+10uUPzZ2WKUYSQ2/abgz\nhXXB3tz/meg50pWUPxv+sSrr+DeiBz9UUp5vpToyktVRPbmLiHScbh7w1xCddn4zcMII891F9MjE\n7xefsq/gGKJPnse7+63DhWb2sqpLZMDMJhA9+naLu29OabV3xf8+6u4317mOu+N/n2xgHVB08yCi\nvIzokZd3V549ldxFRNpet36Hj0dXs3+E6Krtr48w65eJ3hgtLH/BzPaLP9lD9GnUKKpTM3sG0QWC\nTWFm44BVRG88Lkhx1QWigXN+/IaiPO6o9y5w921EPwX8sJkdWs86KthA9LXFuWY2tso8DecuItIJ\nuu0TfsmpYHe/ZrQF3P1/zOwy4Jz4ArANwJPAK4gu6PsY8DWiO/f9AbjazPrixU+i9BNpml5gZu+P\n/38Q8GqiO+1NAv7N3Vc0sO7yenIzO5XopkW/jH/Hfy/RnQnfRnSlfKWf1ZU7nejCu5+b2X8SfSqf\nBEyP1/XGajlUKnf3R81sHtFXMz82s36iffB64EB3PznF3EVE2lq3Dfi1DL4l95wHcPePmtlPgA8T\nfXJ+iuj3/FcT/SQMd3/IzP6G6KdjnyEaeK4Bbqbyd9bVcqn1DcIb4vhO9Cl3K3ADsNLdf1LLdiXJ\nwd2/Z2bTgfOIBu6DgPuBH1F2RX417j4Y/+59EdFP7w4huhjydqL7FSTOzd2/ZGYPAOcQXTvxJLAZ\n+EKauYuItDtzz+oDqIiIiIQi0Xf4ZnaP7X1KW/G0LKsERUREpHFJT+m/iejGJ8NeR/Sd9pdTy0hE\nRERSl2jAd/ffF/8dP/TkLnffWGURERERCUDdP8uz6Alz7wdWppeOiIiIZKGR3+G/m+h2sVellIuI\niIhkpO6r9M3sJuAJd6/6G+b4ISXHEf2E7fG6Aol0h3HAi4H15V+diYikoa7f4ZvZFOAdRLdwHclx\nVH8KnYjs6/1Ez3kQEUlVvTfeOYXogTLrRpnvNwCrVq1i6tSpiYNcdtllfOlLX+Opp6o9a2U064F/\nYdOmTRx44IFV5/roRz/KpZdeWmeM7IWeH4SfY+j5DQ4OctJJJ0HcZkRE0pZ4wDczA+YAV7r77lFm\nfxxg6tSp5HK5xMk9//nPJ7odffJlI/8LwBve8AYmTNjnNup7TJo0qa78miX0/CD8HEPPr4i++hKR\nTNRz0d47gMnAFSnnIiIiIhlJ/Anf3b9F6c13REREJHBd+3jcYnfeeWerUxhR6PlB+DmGnp+ISNY0\n4ENdFxQ2U+j5Qfg5hp6fiEjWNOADX/3qV1udwohCzw/CzzH0/EREsqYBX0REpAtowBcREekCGvCB\n7du3tzqFEYWeH4SfY+j5iYhkTQM+cMopp7Q6hRGFnh+En2Po+YmIZE0DPrB48eJWpzCi0POD8HMM\nPT8RkawlHvDN7Plmdo2ZbTezXWZ2h5m1xT1Lqwn9lquh5wfh5xh6fiIiWUt0pz0zOxi4FfgO0ZPw\ntgOHA39IPzURERFJS9Jb654DbHH3U4vKhlLMR0RERDKQ9JR+HviJmX3ZzB4wswEzO3XUpQK3cuXK\nVqcwotDzg/BzDD0/EZGsJR3wXwp8FPg1MBO4FOgzsw+knVgzDQwMtDqFEYWeH4SfY+j5iYhkLekp\n/THAbe5+Xvz3HWb2WuAjwDWpZtZEF198catTGFHo+UH4OYaen4hI1pJ+wv8dMFhWNghMGWmhWbNm\n0dvbWzJNnz6dtWvXlsy3YcMGent7K6zhdKD8lOwA0Et03WCxRcDSkpKtW7fS29vL5s2bS8qXLVvG\nggULSsp27dpFb28vmzZtKilfvXo1J5988j6ZnXDCCTVvx+mnn77PqeWBgQF6e3v3uTHMokWLWLq0\ndDu2bNmi7eiA7Vi9evWedtDT08OUKVOYN2/ePvmJiKTJ3L32mc2uBV7o7kcXlX0B+At3f2uF+XNA\noVAo1PWzqPPPP58LLricJ5+8N/GykeuAE9mxYwcTJkyocx0i2RsYGKCnpwegx931/YOIpC7pJ/wv\nANPM7Fwze5mZvQ84FViefmoiIiKSlkQDvrv/BHg3cCLwc+CTwFnufl0GuTVN5a8RwhF6fhB+jqHn\nJyKStaQX7eHu64B1GeTSMmeccUarUxhR6PlB+DmGnp+ISNZ0L31g5syZrU5hRKHnB+HnGHp+IiJZ\n04AvIiLSBTTgi4iIdAEN+LDP77ZDE3p+EH6OoecnIpI1DfhEN3EJWej5Qfg5hp6fiEjWNOAD119/\nfatTGFHo+UH4OYaen4hI1jTgi4iIdAEN+CIiIl1AA76IiEgXSDTgm9kiM9tdNv0qq+SapdJT10IS\nen4Qfo6h5ycikrXEt9YFfgG8HbD476fSS6c1Qr8LW+j5Qfg5hp6fiEjW6hnwn3L3baln0kInnnhi\nq1MYUej5Qfg5hp6fiEjW6vkO/3Azu9fM7jKzVWY2OfWsREREJFVJP+H/EJgD/Bo4DFgM/I+Zvdbd\nd6abWufYsmUL27dvr3v5J554grFjx9a9/MSJE5kyZUrdy4uISAdw97on4NnAw8DJVV7PAT5p0iTP\n5/Ml07Rp03zNmjVebP369Z7P5/f8vXjxYj/ggOc7zHVY4eBFU8Eh77CtrHyhw5L4/6sd8MHBQc/n\n8z44OFgSr6+vz+fPn+8bN27cU7Zz507P5/MlZe7u/f39PmfOHC83e/bsEbdjaGjIx40b70AD05iG\nlh83brwPDQ01tB3F5s6d6ytWrCgpW7Fihefzed+2bVtJ+cKFC33JkiUlZUNDQyPuj2Jp7Y8jjzyy\npu0oFAqZb0d/f/+edpDL5Xzy5Mk+Y8aM4f2V8wbapCZNmjRVm8zdG3rDYGa3Ad9y909WeC0HFAqF\nArlcLvG6zz//fC644HKefPLeOrO7DjiRHTt2MGHChKpz9fb2cuONN9YZY2QDAwP09PQAq4Cpdaxh\nHXBeA8sPAidR7z6oVZZ1mIbQ89t7nNDj7gOtzkdEOk89F+3tYWYHAS8Hrk4nnda47rrrmhBlKtEJ\nj6QGG1y+OZpTh/ULPT8Rkawl/R3+58xshpm9yMzeAqwBngTa+skk48ePb3UKbS/0Ogw9PxGRrCX9\nhP9CoB84BNgGbAKmufvv005MRERE0pNowHd3/ZhZRESkDele+sCCBQtanULbC70OQ89PRCRrGvBB\nv1FPQeh1GHp+IiJZ04APnHnmma1Ooe2FXoeh5ycikjUN+CIiIl1AA76IiEgX0IAPbN68udUptL3Q\n6zD0/EREsqYBHzj77LNbnULbC70OQ89PRCRrGvCB5cuXtzqFthd6HYaen4hI1hoa8M3sHDPbbWYX\npZVQK+gnW40LvQ5Dz09EJGt1D/hm9hfAacAd6aUjIiIiWahrwI+fkrcKOBV4ONWMREREJHX1fsK/\nGPi6u9+cZjKtsnTp0lan0PZCr8PQ8xMRyVrSp+VhZu8F3gC8Kf10WmPXrl1VX9uyZQvbt2+ve92D\ng4Ojz9QBRqrDEISen4hI5ty95ono8bj3A68tKrsFuKjK/DnAJ02a5Pl8vmSaNm2ar1mzxoutX7/e\n8/n8nr8XL17sBxzwfIe5DiscvGgqOOQdtpWVL3RYEv9/tQM+ODjo+XzeBwcHS+L19fX5/PnzS8p2\n7tzp+XzeN27c6ENDQz5u3HgHUpgKdW7HqqLlh+J5B8vm7XOYX1a2M553hQNeKBTc3b2/v9/nzJnj\n5WbPnj3q/hg2d+5cX7FiRUlZoVDwfD7v27ZtKylfuHChL1mypKRsaGiorv1RrJ23o7+/f087yOVy\nPnnyZJ8xY8bwsZLzBG1SkyZNmmqdzN1rfnNgZscDXwOeBiwu3i/uqJ4GxnrRCs0sBxQKhQK5XK7m\nOMPOP/98Lrjgcp588t7Ey0auA05kx44dTJgwIfHSAwMD9PT0EF2uMLXOHNYB5wEFovc/SV0LnNTA\n8gNAD/XuA2mOvccaPe4+0Op8RKTzJD2l/23gdWVlVwKDwBJP8u6hrUylvsEWoqoRERFprUQX7bn7\nTnf/VfEE7AR+7+5tO7I18h29REKvw9DzExHJWhp32mv7T/WnnHJKq1Noe6HXYej5iYhkLfFV+uXc\n/dg0EmmlxYsXtzqFthd6HYaen4hI1nQvfdDFbCkIvQ5Dz09EJGsa8EVERLqABnwREZEuoAEfWLly\nZatTaHuh12Ho+YmIZE0DPtFNT6Qxoddh6PmJiGRNAz5w8cUXtzqFthd6HYaen4hI1jTgi4iIdAEN\n+CIiIl0g0YBvZh8xszvM7JF4+r6Z/VVWyYmIiEg6kn7C3wp8guhJMj3AzcANZlbvo+SC0Nvb2+oU\n2l7odRh6fiIiWUt0a113/++yok+Z2UeBabTxY+HOOOOMVqfQ9kKvw9DzExHJWt330jezMcBsYDzw\ng9QyaoGZM2e2OoW2F3odhp6fiEjWEg/4ZvZaogF+HPAo8G5335x2YiIiIpKeeq7S3wy8HjgCuBS4\n2sxelWpWIiIikqrEA767P+Xud7v77e7+SeAO4KyRlpk1axa9vb0l0/Tp01m7dm3JfBs2bKhycdXp\nQPmtUQeAXmB7WfkiYGlJydatW+nt7WXz5tITEcuWLWPBggUleezatYve3l42bdpUtt7VwMkVcjsB\nWFtWtiHOLd3tgC3xvOUnVJYBC8rKdsXz3l5Sunr1ak4+ed/tOOGEE2reH6effvo+t6r9/Oc/T29v\nL9u3l27HokWLWLq0dDu2bNky4v4o2Yoq+yPpdhxxxBE1bcfAwEDm27F69eo97aCnp4cpU6Ywb968\nffITEUmVuzc0Ad8BvlTltRzghULB67F48WI/4IDnO3id02oHfMeOHSPGmT17dsXyQqHggEOhgRxW\nNbiORpePtqHefVCranUYitDz23uskfMG26QmTZo0VZoSfYdvZp8Fvkn0UfOZwPuBo4G2viLq+uuv\nb3UKbS/0Ogw9PxGRrCW9aO95wFXAYcAjwM+Ame5+c9qJiYiISHqS/g7/1KwSERERkezoXvoiIiJd\nQAM+VLzaW5IJvQ5Dz09EJGsa8NFd2NIQeh2Gnp+ISNY04AMnnnhiq1Noe6HXYej5iYhkTQO+iIhI\nF9CALyIi0gU04EOF2+hKUqHXYej5iYhkTQM+cOGFF7Y6hbYXeh2Gnp+ISNYSDfhmdq6Z3WZmfzSz\nB8xsjZm9IqvkmuW6665rdQptL/Q6DD0/EZGsJf2EfxTRo9neDLwDOADYYGYHpp1YM40fP77VKbS9\n0Osw9PxERLKW9Na6s4r/NrM5wINAD6AvSUVERALV6Hf4BxM90vOhFHIRERGRjNQ94JuZAV8ENrn7\nr9JLqfkWLFjQ6hTaXuh1GHp+IiJZS/p43GKXAK8Gjkwpl5aZMmVKq1Noe6HXYej5iYhkra5P+Ga2\nHJgFHOPuvxtt/lmzZtHb21syTZ8+nbVr15bMt2HDBnp7eyus4XRgZVnZANALbC8rXwQsLSnZunUr\nvb29bN68uaR82bJlLFiwgDPPPHNP2a5du+jt7a3wu+3VQKUHsJwArC0r2xDnlu52wJZ43s1l5cuA\n8k+wu+J5by8pXb16dcUHyZxwwgk174/TTz+dlStLt+PII4+kt7eX7dtLt2PRokUsXVq6HVu2bBlx\nf5RsRZX9kXQ7vvWtb9W0HQMDA5lvx+rVq/e0g56eHqZMmcK8efP2yU9EJFXunmgClgNbgZfWMG8O\n8EKh4PVYvHixH3DA8x28zmm1A75jx4664hcKBQccCg3ksKrBdTS6fLQN9e4DaY69xxo5T9gmNWnS\npKmWKdEpfTO7BDiR6KPjTjObFL/0iLs/3vjbDxEREclC0lP6HwGeBXwXuK9omp1uWs1VfkpWkgu9\nDkPPT0Qka4kGfHcf4+77VZiuzirBZjj77LNbnULbC70OQ89PRCRrupc+sHz58lan0PZCr8PQ8xMR\nyZoGfPSTrTSEXoeh5ycikjUN+CIiIl1AA76IiEgX0IAP+9xQRZILvQ5Dz09EJGsa8InugiaNCb0O\nQ89PRCRrGvCB888/v9UptL3Q6zD0/EREsqYBX0REpAtowBcREekCiQd8MzvKzG40s3vNbLeZVXos\nXFspfzKaJBd6HYaen4hI1ur5hD8B+Ckwl+jpXm3vlFNOaXUKbS/0Ogw9PxGRrCV6Wh6Au98E3ARg\nZpZ6Ri2wePHiVqfQ9kKvw9DzExHJmr7DB3K5XKtTaHuh12Ho+YmIZC3xJ/x29NOf/pQDDzww8XKD\ng4MZZNN9tmzZ0tB36BMnTmz4Xvgh5CAi0lLuXvcE7AZ6R3g9B/ikSZM8n8+XTNOmTfM1a9Z4sfXr\n13s+n9/z9+LFi/2AA57vMNdhhYMXTQWHvMO2svKFDkvi/y9zGONE1xo0MBUc+h3mlMVyh9kOa8rK\n1se5ucOqonXUsx3Fyw/F8w6WzdvnML+sbGc87woHvFAouLt7f3+/z5kzx8vNnj171P0xbO7cub5i\nxYqSskKh4Pl83rdt21ZSftZZZ/n++x/QUP2PGbOff+UrXylZb5LtuOaaa3zMmP0aymG//fb3c845\np/engQoAAAaRSURBVGS9Q0NDns/nfXBwsKS8r6/P58+fX1K2c+dOz+fzvnHjRu/v79/TDnK5nE+e\nPNlnzJgxHCvnDbRJTZo0aao2mXv9192Z2W7gXe5+Y5XXc0ChUCjUdUr1/PPP54ILLufJJ++tM8P/\nB/wLsAqYOsJ8a4F3VShfB5wHFIjeu9TjWuCkBtbR6PIDQA/17oNarVy5kg996EP7Rh8YoKenh9H3\nQTWDwEkN5R9CDqPZmyM97j6QSRAR6WpdcUo/6uRH6qhXVnldp/RrNTAwUHHA32u0fdAMIeQgItIa\niQd8M5sAvBwYvkL/pWb2euAhd9+aZnLNc3GrE2h7F1+sOhQRCVk9n/DfBNzC3u83Px+XXwXox84i\nIiIBqud3+N9DP+cTERFpKxq4RUREuoAGfADa/nEALdfbqzoUEQmZBnwAzmh1Am3vjDNUhyIiIdOA\nD8DMVifQ9mbOVB2KiIRMA76IiEgX0IAvIiLSBTTgA9GtdaURa9eqDkVEQqYBH4ClrU6g7S1dqjoU\nEQlZXQO+mZ1uZveY2WNm9kMz+4u0E2uu57Y6gbb33OeqDkVEQpZ4wDezE4hup7sIeCNwB7DezCam\nnJuIiIikpJ5P+POAy9z9anffDHwE2IXuoy8iIhKsRAO+mR0A9ADfGS5zdwe+DUxPNzURERFJS9KH\n50wE9gMeKCt/AHhlhfnHAQwO1vdc+fvuu4+nn94JXF7X8nBb/O86Rn62/a3AtVXKa1l+JI2uo9Hl\n74mWXreu7v0wZswYdu/ePeI8t956K9deu28d3nPPPfH/Wpd/WjnUG78WResel1kQEelqFn1Ar3Fm\ns8OAe4Hp7v6jovKlwAx3n142//uoPJKKSGXvd/f+VichIp0n6Sf87cDTwKSy8knA/RXmXw+8H/gN\n8HjS5ES6yDjgxURtRkQkdYk+4QOY2Q+BH7n7WfHfBmwB+tz9c+mnKCIiIo1K+gkf4CLgSjMrEH1J\nPg8YD1yZYl4iIiKSosQDvrt/Of7N/aeJTuX/FDjO3belnZyIiIikI/EpfREREWk/upe+iIhIF9CA\nLyIi0gUyHfBDfciOmS0ys91l069anNNRZnajmd0b59NbYZ5Pm9l9ZrbLzL5lZi8PJT8zu6JCna5r\nYn7nmtltZvZHM3vAzNaY2SsqzNfKOhw1x1bXo4h0rswG/DZ4yM4viC46PDSe3tradJhAdAHkXGCf\nCyvM7BPAGcBpwBHATqL6fEYI+cW+SWmdntic1AA4ClgGvBl4B3AAsMHMDhyeIYA6HDXHWCvrUUQ6\nVGYX7VX5vf5Wot/rX5hJ0NpzWwQc7+65VuZRjZntBt7l7jcWld0HfM7dvxD//SyiWxr/g7t/OYD8\nrgCe7e5/28xcqonfWD5IdAfITXFZMHU4Qo5B1aOIdI5MPuG3yUN2Do9PT99lZqvMbHKrE6rGzF5C\n9EmvuD7/CPyIcOoT4Jj4VPVmM7vEzP6shbkcTHQm4iEItg5LciwSUj2KSIfI6pT+SA/ZOTSjmEn8\nEJgDHEf0eN+XAP9jZhNamdQIDiUaGEKtT4hOQ38QOBY4GzgaWBef2WmqOOYXgU3uPnxtRlB1WCVH\nCKgeRaSz1HOnvbbn7sX3K/+Fmd0GDAGzgStak1V7Kzsl/ksz+zlwF3AMcEuT07kEeDVwZJPjJlEx\nx8DqUUQ6SFaf8JM+ZKel3P0R4H+Bpl2xndD9gNEm9Qng7vcQHQdNrVMzWw7MAo5x998VvRRMHY6Q\n4z5aVY8i0nkyGfDd/UmgALx9uCw+Jfl24PtZxGyEmR1E1KGO2Pm2Stzp309pfT6L6Grv4OoTwMxe\nCBxCE+s0HkiPB97m7luKXwulDkfKscr8Ta9HEelMWZ7SD/YhO2b2OeDrRKfxXwCcDzwJrG5hThOI\n3nQMf1f7UjN7PfCQu28l+r73U2Z2J9Hjhj8D/Ba4odX5xdMi4KtEg+rLgaVEZ02a8rhXM7uE6Odr\nvcBOMxv+JP+Iuw8/mrnVdThijnEdt7QeRaSDuXtmE9Fvtn8DPAb8AHhTlvES5LWaqKN/jOjRvv3A\nS1qc09HAbqKvQoqnLxXNsxi4D9hFNAC8PIT8iJ7lfhPRIPU4cDdwKfDcJuZXKbengQ+WzdfKOhwx\nxxDqUZMmTZ076eE5IiIiXUD30hcREekCGvBFRES6gAZ8ERGRLqABX0REpAtowBcREekCGvBFRES6\ngAZ8ERGRLqABX0REpAtowBcREekCGvBFRES6gAZ8ERGRLvD/AWsdJeqxsbIcAAAAAElFTkSuQmCC\n",
283 |       "text/plain": [
284 |        "<matplotlib.figure.Figure at 0xaf38550>"
285 |       ]
286 |      },
287 |      "metadata": {},
288 |      "output_type": "display_data"
289 |     }
290 |    ],
291 |    "source": [
292 |     "stroop.hist()"
293 |    ]
294 |   },
295 |   {
296 |    "cell_type": "markdown",
297 |    "metadata": {},
298 |    "source": [
299 |     "It can be observed that time taken for incongruent test are in general longer than congruent test evident by the range of the data.\n",
300 |     "\n",
301 |     "In congruent test, the time taken ranges from 8 seconds to 22 seconds and majority of the time taken is less than 14 seconds.\n",
302 |     "\n",
303 |     "In incongruent test, the range start from 15 seconds to 35 seconds and majority of the time taken us around 15 seconds to 25 seconds."
304 |    ]
305 |   },
306 |   {
307 |    "cell_type": "markdown",
308 |    "metadata": {},
309 |    "source": [
310 |     "#### 5. Now, perform the statistical test and report your results. What is your confidence level and your critical statistic value? Do you reject the null hypothesis or fail to reject it? Come to a conclusion in terms of the experiment task. Did the results match up with your expectations?"
311 |    ]
312 |   },
313 |   {
314 |    "cell_type": "markdown",
315 |    "metadata": {},
316 |    "source": [
317 |     "α = **0.05**\n",
318 |     "\n",
319 |     "df = **23**\n",
320 |     "\n",
321 |     "t-critical = **1.714**\n",
322 |     "\n",
323 |     "SE = stdev_(b-a) / √n\n",
324 |     "    = 4.86 / √24  = **0.99**\n",
325 |     "\n",
326 |     "t-statistic = mean difference / (stdev_(b-a) / sqrt(n)) = 7.96 / (4.86 / √24 )  = **8.02**\n",
327 |     "\n",
328 |     "From the t-table, p value is **<0.0005**.\n",
329 |     "\n",
330 |     "As a result, the **null hypothesis should be rejected**. Congruent words do take shorter response time to recognize its ink color than incongruent words.\n",
331 |     "\n",
332 |     "The result matches up with my expectations that congruent words are easier to identify its ink color."
333 |    ]
334 |   },
335 |   {
336 |    "cell_type": "markdown",
337 |    "metadata": {},
338 |    "source": [
339 |     "#### 6. Optional: What do you think is responsible for the effects observed? Can you think of an alternative or similar task that would result in a similar effect? Some research about the problem will be helpful for thinking about these two questions!"
340 |    ]
341 |   },
342 |   {
343 |    "cell_type": "markdown",
344 |    "metadata": {},
345 |    "source": [
346 |     "Our brain processes word faster than recognizing color. Thus, we tend to be able to identify the ink color of congruent words before we even manage to identify the ink color.\n",
347 |     "\n",
348 |     "One similar examplpe would reading the musical scores. Playing violin as my first instrument, I am used to using treble clef. However, I have to switch to bass clef when playing cello. Notes on the same location point to different tones oon different clefs. Coupled with different string combinations (G-D-A-E in violin vs C-G-D-A in cello), it can be quite confusing when I switch between instrument immediately. It takes a while to familiarize with different clefs and string combination."
349 |    ]
350 |   },
351 |   {
352 |    "cell_type": "markdown",
353 |    "metadata": {},
354 |    "source": [
355 |     "https://s3.amazonaws.com/udacity-hosted-downloads/t-table.jpg\n",
356 |     "https://link.springer.com/article/10.3758%2FMC.38.7.893\n",
357 |     "https://makingmusicmag.com/explanation-clefs-treble-bass-alto-tenor\n"
358 |    ]
359 |   }
360 |  ],
361 |  "metadata": {
362 |   "kernelspec": {
363 |    "display_name": "Python [conda env:DAND]",
364 |    "language": "python",
365 |    "name": "conda-env-DAND-py"
366 |   },
367 |   "language_info": {
368 |    "codemirror_mode": {
369 |     "name": "ipython",
370 |     "version": 2
371 |    },
372 |    "file_extension": ".py",
373 |    "mimetype": "text/x-python",
374 |    "name": "python",
375 |    "nbconvert_exporter": "python",
376 |    "pygments_lexer": "ipython2",
377 |    "version": "2.7.12"
378 |   }
379 |  },
380 |  "nbformat": 4,
381 |  "nbformat_minor": 2
382 | }
383 | 


--------------------------------------------------------------------------------
/6-Inferential-Statistics/stroopdata.csv:
--------------------------------------------------------------------------------
 1 | Congruent,Incongruent
 2 | 12.079,19.278
 3 | 16.791,18.741
 4 | 9.564,21.214
 5 | 8.630,15.687
 6 | 14.669,22.803
 7 | 12.238,20.878
 8 | 14.692,24.572
 9 | 8.987,17.394
10 | 9.401,20.762
11 | 14.480,26.282
12 | 22.328,24.524
13 | 15.298,18.644
14 | 15.073,17.510
15 | 16.929,20.330
16 | 18.200,35.255
17 | 12.130,22.158
18 | 18.495,25.139
19 | 10.639,20.429
20 | 11.344,17.425
21 | 12.369,34.288
22 | 12.944,23.894
23 | 14.233,17.960
24 | 19.710,22.058
25 | 16.004,21.157
26 | 


--------------------------------------------------------------------------------
/7-Intro-to-Machine-Learning/Identify Fraud from Enron Email.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Identify Fraud from Enron Email\n",
  8 |     "### Enron Submission Free-Response Questions\n",
  9 |     "##### by Kai Sheng TEH"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "markdown",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "**1. Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?  [relevant rubric items: “data exploration”, “outlier investigation”]**"
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "markdown",
 21 |    "metadata": {},
 22 |    "source": [
 23 |     "The goal of the project is to use Machine Learning to identify *person of interest (POI)* in the Enron fraud case through the use of financial information and email text files available to the public.\n",
 24 |     "\n",
 25 |     "There are 146 entries and 20 features in the dataset. Of those features, 14 are financial features denoted in U.S. dollar while 5 are text features in terms of number of emails sent or received. \n",
 26 |     "\n",
 27 |     "There is also a feature named 'poi' which identify whether the person is idnetified as POI in the real world situation. 18 of them are POIs while 128 are non-POIs. It is hopeful that the financial and email information of poi individuals are different from the other value to serve as red flags.\n",
 28 |     "\n",
 29 |     "The features are being iterated to find out how many 'NaN' are there.\n",
 30 |     "\n",
 31 |     "| Features                  | NaN Count |\n",
 32 |     "| ------------------------- |:--------- |\n",
 33 |     "| loan_advances             | 142       |\n",
 34 |     "| director_fees             | 129       |\n",
 35 |     "| restricted_stock_deferred | 128       |\n",
 36 |     "| deferral_payments         | 107       |\n",
 37 |     "| deferred_income           | 97        |\n",
 38 |     "| long_term_incentive       | 80        |\n",
 39 |     "| bonus                     | 64        |\n",
 40 |     "| shared_receipt_with_poi   | 60        |\n",
 41 |     "| from_messages             | 60        |\n",
 42 |     "| from_poi_to_this_person   | 60        |\n",
 43 |     "| from_this_person_to_poi   | 60        |\n",
 44 |     "| to_messages               | 60        |\n",
 45 |     "| other                     | 53        |\n",
 46 |     "| salary                    | 51        |\n",
 47 |     "| expenses                  | 51        |\n",
 48 |     "| exercised_stock_options   | 44        |\n",
 49 |     "| restricted_stock          | 36        |\n",
 50 |     "| total_payments            | 21        |\n",
 51 |     "| total_stock_value         | 20        |\n",
 52 |     "| poi                       | 0         |\n",
 53 |     "\n",
 54 |     "Through exploratory data analyiss, an outlier was determined when plotting a scatterplot of bonus vs salary. Going through the pdf file, the extremely high salary and bonus is under the 'TOTAL' entry. It was removed together with 'The Travel Agenct in the Park' using the *data_dict.pop* function.  As such, only 144 entries are left in the dataset."
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "markdown",
 59 |    "metadata": {},
 60 |    "source": [
 61 |     "**2. What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.  [relevant rubric items: “create new features”, “intelligently select features”, “properly scale features”]**"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "markdown",
 66 |    "metadata": {},
 67 |    "source": [
 68 |     "3 new features were created. They are:\n",
 69 |     "1. bonus-salary ratio (bonus / salary)\n",
 70 |     "2. from_this_person_to_poi percentage (from_this_person_to_poi / from_messages)\n",
 71 |     "3. from_poi_to_this_person percentage (from_poi_to_this_person / to_messages)\n",
 72 |     "\n",
 73 |     "If there person is a POI, the bonus to salary ratio might be very high. Similarly, POIs might also interact through emails more frequently among themselves.\n",
 74 |     "\n",
 75 |     "Scaling was done using MinMaxScaler to ensure the impact of each variable is not dominated by any single variables with high values such as bonus or salary.\n",
 76 |     "\n",
 77 |     "Select K Best and Decision Tree methods were used to determine the feature scores as below:\n",
 78 |     "\n",
 79 |     "| SelectKBest                   | Score         | Decision Tree                 | Score         |\n",
 80 |     "|:----------------------------- | :------------ |:----------------------------- | :------------ |\n",
 81 |     "| exercised_stock_options\t    | 24.81507973   | exercised_stock_options\t    | 0.216485631   |\n",
 82 |     "| total_stock_value\t            | 24.18289868\t| expenses\t                    | 0.180751044   |\n",
 83 |     "| bonus\t                        | 20.79225205\t| shared_receipt_with_poi\t    | 0.178337845   |\n",
 84 |     "| salary\t                    | 18.28968404\t| from_this_person_to_poi_ratio | 0.135602729   |\n",
 85 |     "| from_this_person_to_poi_ratio | 16.40971255\t| total_payments\t            | 0.115005291   |\n",
 86 |     "| deferred_income\t            |11.45847658\t| total_stock_value\t            | 0.055633364   |\n",
 87 |     "| long_term_incentive\t        |9.922186013\t| from_this_person_to_poi\t    | 0.047666667   |\n",
 88 |     "| restricted_stock\t            |9.212810622\t| long_term_incentive\t        | 0.04237037    |\n",
 89 |     "| total_payments\t            |8.77277773\t\t| restricted_stock\t            | 0.028147059   |\n",
 90 |     "| shared_receipt_with_poi\t    |8.589420732\t| salary\t                    | 0             |\n",
 91 |     "| loan_advances\t                |7.184055658\t| bonus\t                        | 0             |\n",
 92 |     "| expenses\t                    |6.094173311\t| deferral_payments\t            | 0             |\n",
 93 |     "| from_poi_to_this_person\t    |5.243449713\t| deferred_income\t            | 0             |\n",
 94 |     "| from_poi_to_this_person_ratio\t| 3.128091748\t| restricted_stock_deferred\t    | 0             |\n",
 95 |     "| from_this_person_to_poi\t    |2.382612108\t| loan_advances\t                | 0             |\n",
 96 |     "| director_fees\t                | 2.126327802\t| from_messages\t                | 0             |\n",
 97 |     "| to_messages\t                | 1.646341129\t| director_fees\t                | 0             |\n",
 98 |     "| deferral_payments\t            | 0.224611275\t| from_poi_to_this_person\t    | 0             |\n",
 99 |     "| from_messages\t                | 0.169700948\t| from_poi_to_this_person_ratio | 0             |\n",
100 |     "| restricted_stock_deferred\t    | 0.065499653   | salary_bonus_ratio\t        | 0             |\n",
101 |     "| salary_bonus_ratio\t        | 0.000368768\t| to_messages\t                | 0             |\n",
102 |     "\n",
103 |     "SelectKBest scoring is chosen over DecisionTree.\n",
104 |     "The top 10 features with the highest scores are kept for use in POI identifier."
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "markdown",
109 |    "metadata": {},
110 |    "source": [
111 |     "**3. What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?  [relevant rubric item: “pick an algorithm”]**"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "markdown",
116 |    "metadata": {},
117 |    "source": [
118 |     "The algorithms that I tried are Naive-Bayes Gaussian, AdaBoost, Decision Tree, K Nearest Neighbors and Random Forest classifiers.\n",
119 |     "\n",
120 |     "At this point of time, only Gaussian and Decision Tree algorithms fulfil the >0.3 requirement for bothprecision and recall. Thus, the othersalgorithms need tuning in order to achieve >0.3 requirements.\n",
121 |     "\n",
122 |     "The performances of each algorithm before tuning are as below:\n",
123 |     "\n",
124 |     "| Algorithm            | Accuracy | Precision | Recall   | F1 Score |\n",
125 |     "| :------------------- | :------- | :-------- | :------- | :------- |\n",
126 |     "| Naive-Bayes Gaussian | 0.83613  | 0.36639   | 0.31400  | 0.33818  |\n",
127 |     "| Decision Tree        | 0.82207  | 0.32459   | 0.30950  | 0.31687  |\n",
128 |     "| AdaBoost             | 0.83853  | 0.36405   | 0.28250  | 0.31813  |\n",
129 |     "| K Neartest Neighbors | 0.87640  | 0.63878   | 0.16800  | 0.26603  |\n",
130 |     "| Random Forest        | 0.86140  | 0.44537   | 0.16100  | 0.23650  |"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "markdown",
135 |    "metadata": {},
136 |    "source": [
137 |     "**4. What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well?  How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).  [relevant rubric items: “discuss parameter tuning”, “tune the algorithm”]**"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "markdown",
142 |    "metadata": {},
143 |    "source": [
144 |     "Tuning the parameters of an algorithm means changing the input parameters and comparing the performance of each parameter combination in order to determine the optimal parameters to be used. Parameters of algorithms not optimized will result in sub-optimized performance or lower accuracy.\n",
145 |     "\n",
146 |     "GridSearchCV was used to tune and find the best combination of parameters to be used in each algorithm.\n",
147 |     "\n",
148 |     "Gaussian classifiers doesn't have any parameters to tune and is thus not applicable in this case.\n",
149 |     "\n",
150 |     "The performances after tuning are being compared below:\n",
151 |     "\n",
152 |     "| Algorithm            | Accuracy | Precision | Recall   | F1 Score | Notes       |\n",
153 |     "| :------------------- | :------- | :-------- | :------- | :------- | :---------- |\n",
154 |     "| Naive-Bayes Gaussian | 0.83613  | 0.36639   | 0.31400  | 0.33818  | Tuning N/A  |\n",
155 |     "| Decision Tree        | 0.82533  | 0.33351   | 0.31050  | 0.32160  | -           |\n",
156 |     "| AdaBoost             | 0.83820  | 0.36305   | 0.28300  | 0.31807  | -           |\n",
157 |     "| K Neartest Neighbors | 0.87640  | 0.63878   | 0.16800  | 0.26603  | -           |\n",
158 |     "| Random Forest        | 0.86087  | 0.44299   | 0.16900  | 0.24466  | -           |\n",
159 |     "\n",
160 |     "Gaussian has the best best performances despite not needing any tunings. It is chosen as the POI identifier algorithm."
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "markdown",
165 |    "metadata": {},
166 |    "source": [
167 |     "**5. What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?  [relevant rubric items: “discuss validation”, “validation strategy”]**"
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "markdown",
172 |    "metadata": {},
173 |    "source": [
174 |     "Validation is the process of training the algorithms and testing how well the model has been trained.\n",
175 |     "\n",
176 |     "Few of the classic mistake made in validation are overfitting and using same training and testing data. Overfitting occurs when the algorithm is being fit too closely to the training data. It performs well on training data but poorly on testing data. As such, there must be a balanced between fitting too closely or loosely to training data so that the algorithms is able to perform a reasonably well generalization when being tested on unseen data.\n",
177 |     "\n",
178 |     "At the same time, training and testing should not be done on the same data. Instead, 10%-30% of the data should be set aside for testing. This will prevent overfitting as well.\n",
179 |     "\n",
180 |     "For this projct, *StratifiedShuffleSplit* is being used for validation in *tester.py*. The data are split into 1000 different sets, which is also known as folds. This allows us to maximize the amount of data available on hand to be used for training and testing. The performance is then averaged out."
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "markdown",
185 |    "metadata": {},
186 |    "source": [
187 |     "**6. Give at least 2 evaluation metrics and your average performance for each of them.  Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance. [relevant rubric item: “usage of evaluation metrics”]**"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "markdown",
192 |    "metadata": {},
193 |    "source": [
194 |     "The 3 evaluation metrics being used etensively in this project are precision, recall and F1 score.\n",
195 |     "The average precision, recall and F1 score for Gaussian algorithm are 0.36639, 0.31400 and 0.33818 respectively.\n",
196 |     "\n",
197 |     "The definition of each evaluation metrics are as below:\n",
198 |     "1. Precision = true positive / false positive (high precision -> only guilty person indicted, no false alarm)\n",
199 |     "2. Recall = true positive / false negative (high recall -> guilty person are all indicted, no guilty person escapes)\n",
200 |     "3. F1 score - a weighted average of precision and recall by multiplying the product of recall and precision by 2 then dividing the sum of precision and recall\n",
201 |     "\n",
202 |     "In this case, we would want to strike for a higher precision so as to not accuse the wrong person. Machine Learning is only one of the investigation methods and police or prosecutors should not rely wholly on machine learning. Any guilty person not being picked up by th algorithms should still be subjected to thorough investigations."
203 |    ]
204 |   }
205 |  ],
206 |  "metadata": {
207 |   "anaconda-cloud": {},
208 |   "kernelspec": {
209 |    "display_name": "Python 2",
210 |    "language": "python",
211 |    "name": "python2"
212 |   },
213 |   "language_info": {
214 |    "codemirror_mode": {
215 |     "name": "ipython",
216 |     "version": 2
217 |    },
218 |    "file_extension": ".py",
219 |    "mimetype": "text/x-python",
220 |    "name": "python",
221 |    "nbconvert_exporter": "python",
222 |    "pygments_lexer": "ipython2",
223 |    "version": "2.7.12"
224 |   }
225 |  },
226 |  "nbformat": 4,
227 |  "nbformat_minor": 2
228 | }
229 | 


--------------------------------------------------------------------------------
/7-Intro-to-Machine-Learning/README.md:
--------------------------------------------------------------------------------
1 | 1. [Identify Fraud from Enron Email.ipynb](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/7-Intro-to-Machine-Learning/Identify%20Fraud%20from%20Enron%20Email.ipynb) - Jupyter Notebook containing the projects
2 | 
3 | 2. [resources.MD](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/7-Intro-to-Machine-Learning/resources.MD) - Resources used while completing the project
4 | 
5 | 3. [poi_id.py](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/7-Intro-to-Machine-Learning/poi_id.py) - Code for the project
6 |    
7 | 4. [tester.py](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/7-Intro-to-Machine-Learning/tester.py) - Tester code to evaluate the performance of the algorithms
8 | 


--------------------------------------------------------------------------------
/7-Intro-to-Machine-Learning/my_classifier.pkl:
--------------------------------------------------------------------------------
  1 | ccopy_reg
  2 | _reconstructor
  3 | p0
  4 | (csklearn.ensemble.forest
  5 | RandomForestClassifier
  6 | p1
  7 | c__builtin__
  8 | object
  9 | p2
 10 | Ntp3
 11 | Rp4
 12 | (dp5
 13 | S'warm_start'
 14 | p6
 15 | I00
 16 | sS'base_estimator'
 17 | p7
 18 | g0
 19 | (csklearn.tree.tree
 20 | DecisionTreeClassifier
 21 | p8
 22 | g2
 23 | Ntp9
 24 | Rp10
 25 | (dp11
 26 | S'presort'
 27 | p12
 28 | I00
 29 | sS'splitter'
 30 | p13
 31 | S'best'
 32 | p14
 33 | sS'min_impurity_decrease'
 34 | p15
 35 | F0.0
 36 | sS'min_impurity_split'
 37 | p16
 38 | NsS'min_samples_leaf'
 39 | p17
 40 | I1
 41 | sS'max_features'
 42 | p18
 43 | NsS'random_state'
 44 | p19
 45 | NsS'criterion'
 46 | p20
 47 | S'gini'
 48 | p21
 49 | sS'min_weight_fraction_leaf'
 50 | p22
 51 | F0.0
 52 | sS'max_leaf_nodes'
 53 | p23
 54 | NsS'min_samples_split'
 55 | p24
 56 | I2
 57 | sS'_sklearn_version'
 58 | p25
 59 | S'0.19.1'
 60 | p26
 61 | sS'max_depth'
 62 | p27
 63 | NsS'class_weight'
 64 | p28
 65 | NsbsS'n_jobs'
 66 | p29
 67 | I1
 68 | sg15
 69 | F0.0
 70 | sS'verbose'
 71 | p30
 72 | I0
 73 | sg23
 74 | NsS'bootstrap'
 75 | p31
 76 | I01
 77 | sS'oob_score'
 78 | p32
 79 | I00
 80 | sg17
 81 | I1
 82 | sS'n_estimators'
 83 | p33
 84 | I10
 85 | sg24
 86 | I2
 87 | sg22
 88 | F0.0
 89 | sg20
 90 | g21
 91 | sS'estimator_params'
 92 | p34
 93 | (g20
 94 | g27
 95 | g24
 96 | g17
 97 | g22
 98 | g18
 99 | g23
100 | g15
101 | g16
102 | g19
103 | tp35
104 | sg19
105 | Nsg16
106 | Nsg18
107 | S'auto'
108 | p36
109 | sg25
110 | g26
111 | sg27
112 | Nsg28
113 | Nsb.


--------------------------------------------------------------------------------
/7-Intro-to-Machine-Learning/my_feature_list.pkl:
--------------------------------------------------------------------------------
 1 | (lp0
 2 | S'poi'
 3 | p1
 4 | aS'salary'
 5 | p2
 6 | aS'bonus'
 7 | p3
 8 | aS'deferred_income'
 9 | p4
10 | aS'long_term_incentive'
11 | p5
12 | aS'shared_receipt_with_poi'
13 | p6
14 | aS'total_stock_value'
15 | p7
16 | aS'total_payments'
17 | p8
18 | aS'exercised_stock_options'
19 | p9
20 | aS'from_this_person_to_poi_ratio'
21 | p10
22 | aS'restricted_stock'
23 | p11
24 | a.


--------------------------------------------------------------------------------
/7-Intro-to-Machine-Learning/poi_id.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | 
  3 | import sys
  4 | import pickle
  5 | sys.path.append("../tools/")
  6 | import matplotlib.pyplot as plt
  7 | import numpy as np
  8 | import random
  9 | 
 10 | from feature_format import featureFormat, targetFeatureSplit
 11 | from tester import dump_classifier_and_data
 12 | 
 13 | from sklearn.metrics import accuracy_score
 14 | from sklearn.naive_bayes import GaussianNB
 15 | from sklearn.svm import SVC
 16 | from sklearn import tree
 17 | from sklearn.tree import DecisionTreeClassifier
 18 | from sklearn.metrics import accuracy_score
 19 | from sklearn.preprocessing import MinMaxScaler
 20 | from sklearn.feature_selection import SelectKBest
 21 | from sklearn.model_selection import cross_val_score
 22 | from sklearn.ensemble import AdaBoostClassifier
 23 | from sklearn.neighbors import KNeighborsClassifier
 24 | from sklearn.ensemble import RandomForestClassifier
 25 | from sklearn.pipeline import Pipeline
 26 | from sklearn.grid_search import GridSearchCV
 27 | from sklearn.feature_selection import SelectKBest, f_classif
 28 | 
 29 | 
 30 | ### Task 1: Select what features you'll use.
 31 | ### features_list is a list of strings, each of which is a feature name.
 32 | ### The first feature must be "poi".
 33 | features_list = ['poi',
 34 |                  'salary',
 35 |                  'bonus',
 36 |                  'deferral_payments',
 37 |                  'expenses',
 38 |                  'deferred_income',
 39 |                  'long_term_incentive',
 40 |                  'restricted_stock_deferred',
 41 |                  'shared_receipt_with_poi',
 42 |                  'loan_advances',
 43 |                  'from_messages',
 44 |                  'director_fees',
 45 |                  'total_stock_value',
 46 |                  'from_poi_to_this_person',
 47 |                  'from_this_person_to_poi',
 48 |                  'total_payments',
 49 |                  'exercised_stock_options',
 50 |                  'to_messages',
 51 |                  'restricted_stock',
 52 |                  'other'] 
 53 | # You will need to use more features
 54 | 
 55 | ### Load the dictionary containing the dataset
 56 | with open("final_project_dataset.pkl", "r") as data_file:
 57 |     data_dict = pickle.load(data_file)
 58 | 
 59 | ### Summarize the dataset
 60 | '''
 61 | print 'Total number of ppl:', len(data_dict) #How many people in the dataset
 62 | print 'Number of features:', len(features_list) #How mant features in the list
 63 | 
 64 | poi_num = 0
 65 | for i in data_dict:
 66 |     if data_dict[i]['poi'] == True:
 67 |         poi_num += 1
 68 | print 'Number of poi:', poi_num #Number of poi
 69 | '''
 70 | 
 71 | ### Checking the incompleteness of the dataset (% of NaN in every feature)
 72 | ### Need to remove the new features created when running the code below
 73 | '''
 74 | nan = [0 for i in range(len(features_list))]
 75 | for k, v in data_dict.iteritems():
 76 |     for j, feature in enumerate(features_list):
 77 |         if v[feature] == 'NaN':
 78 |             nan[j] += 1
 79 | 
 80 | for i, feature in enumerate(features_list):
 81 |     print 'NaN count for', feature, ':', nan[i] #Number of NaN in each feature
 82 | '''
 83 | 
 84 | ### Task 2: Remove outliers
 85 | ### After plotting and checking the pdf, "TOTAL" and "THE AGENCY IN THE PARK" 
 86 | ### are removed as both don't seem to help with predicting
 87 | 
 88 | data_dict.pop('TOTAL', 0)
 89 | data_dict.pop('THE TRAVEL AGENCY IN THE PARK', 0)
 90 | 
 91 | ### Scatterplot of Bonus vs Salary
 92 | '''
 93 | data = featureFormat(data_dict, ['salary', 'bonus'])
 94 | for point in data:
 95 |     x = point[0]
 96 |     y = point[1]
 97 |     plt.scatter(x, y)
 98 | plt.xlabel('salary')
 99 | plt.ylabel('bonus')
100 | '''
101 | 
102 | ### Task 3: Create new feature(s)
103 | 
104 | ### Create new salary-bonus-ratio, from_this_person_to_poi % & from_poi_to_this_person % features
105 | 
106 | def ratio_calc(numerator, denominator):
107 |     fraction = 0
108 |     if numerator == 'NaN' or denominator == 'NaN':
109 |         fraction == 'NaN'
110 |     else:
111 |         fraction = float(numerator) / float(denominator)
112 |     return fraction
113 | 
114 | for name in data_dict:
115 |     salary_bonus_ratio_temp = ratio_calc(data_dict[name]['salary'], data_dict[name]['bonus'])
116 |     data_dict[name]['salary_bonus_ratio'] = salary_bonus_ratio_temp
117 | 
118 |     from_this_person_to_poi_ratio_temp = ratio_calc(data_dict[name]['from_this_person_to_poi'], data_dict[name]['from_messages'])
119 |     data_dict[name]['from_this_person_to_poi_ratio'] = from_this_person_to_poi_ratio_temp
120 | 
121 |     from_poi_to_this_person_ratio_temp = ratio_calc(data_dict[name]['from_poi_to_this_person'], data_dict[name]['to_messages'])
122 |     data_dict[name]['from_poi_to_this_person_ratio'] = from_poi_to_this_person_ratio_temp
123 | 
124 | 
125 | ### Store to my_dataset for easy export below.
126 | my_dataset = data_dict
127 | 
128 | ### Extract features and labels from dataset for local testing
129 | features_list = ['poi',
130 |                  'salary',
131 |                  'bonus',
132 |                  #'deferral_payments',
133 |                  #'expenses',
134 |                  'deferred_income',
135 |                  'long_term_incentive',
136 |                  #'restricted_stock_deferred',
137 |                  'shared_receipt_with_poi',
138 |                  #'loan_advances',
139 |                  #'from_messages',
140 |                  #'director_fees',
141 |                  'total_stock_value',
142 |                  #'from_poi_to_this_person',
143 |                  #'from_this_person_to_poi',
144 |                  'total_payments',
145 |                  'exercised_stock_options',
146 |                  #'from_poi_to_this_person_ratio',
147 |                  'from_this_person_to_poi_ratio',
148 |                  #'salary_bonus_ratio',
149 |                  #'to_messages',
150 |                  'restricted_stock'
151 |                  #'other'
152 |                  ] 
153 | ### Other is removed from the dataset as it doesn't tell much and including it
154 | ### might skew the predictions
155 | 
156 | data = featureFormat(my_dataset, features_list, sort_keys = True)
157 | labels, features = targetFeatureSplit(data)
158 | 
159 | 
160 | ### Applying feature scaling using MinMaxScaler
161 | 
162 | scaler = MinMaxScaler()
163 | features = scaler.fit_transform(features)
164 | 
165 | 
166 | ### Using SelectKBest to determine by features to use
167 | '''
168 | selector = SelectKBest(f_classif, k=20)
169 | selector.fit(features, labels)
170 | features = selector.transform(features)
171 | feature_scores = zip(features_list[1:],selector.scores_)
172 | 
173 | sorted_scores = sorted(feature_scores, key=lambda feature: feature[1], reverse = True)
174 | for item in sorted_scores:
175 |     print item[0], item[1]
176 | '''
177 | 
178 | ### Using Decision Tree
179 | '''
180 | clf = DecisionTreeClassifier()
181 | clf.fit(features, labels)
182 | dt_scores = zip(features_list[1:],clf.feature_importances_)
183 | sorted_dtscores = sorted(dt_scores, key=lambda feature: feature[1], reverse = True)
184 | for item in sorted_dtscores:
185 |     print item[0], item[1]
186 | '''
187 | 
188 | ### Task 4: Try a varity of classifiers
189 | ### Please name your classifier clf for easy export below.
190 | ### Note that if you want to do PCA or other multi-stage operations,
191 | ### you'll need to use Pipelines. For more info:
192 | ### http://scikit-learn.org/stable/modules/pipeline.html
193 | 
194 | # Provided to give you a starting point. Try a variety of classifiers.
195 | ### Naive-Bayes Gaussian
196 | '''
197 | clf = GaussianNB()
198 | '''
199 | 
200 | ### Decision Tree
201 | #clf = tree.DecisionTreeClassifier()
202 | 
203 | ### AdaBoost Classifier
204 | #clf = AdaBoostClassifier()
205 | 
206 | ### K Nearest Neighbors Classifier
207 | #clf = KNeighborsClassifier()
208 | 
209 | ### Random Forest Classifier
210 | #clf = RandomForestClassifier()
211 | 
212 | ### Task 5: Tune your classifier to achieve better than .3 precision and recall 
213 | ### using our testing script. Check the tester.py script in the final project
214 | ### folder for details on the evaluation method, especially the test_classifier
215 | ### function. Because of the small size of the dataset, the script uses
216 | ### stratified shuffle split cross validation. For more info: 
217 | ### http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html
218 | 
219 | # Example starting point. Try investigating other evaluation techniques!
220 | 
221 | from sklearn.cross_validation import train_test_split
222 | 
223 | features_train, features_test, labels_train, labels_test = \
224 |     train_test_split(features, labels, test_size=0.3, random_state=42)
225 | 
226 | 
227 | ### Decision Tree, need to change the clf to dt when tring to figure the optimal parameter values
228 | '''
229 | clf = tree.DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
230 |             max_features=None, max_leaf_nodes=None,
231 |             min_impurity_decrease=0.0, min_impurity_split=None,
232 |             min_samples_leaf=1, min_samples_split=2,
233 |             min_weight_fraction_leaf=0.0, presort=False, random_state=None,
234 |             splitter='best')
235 | '''
236 | #param_grid = {}
237 | #clf = GridSearchCV(dt, param_grid)
238 | #clf = clf.fit(features, labels)
239 | #print clf.best_estimator_
240 | 
241 | ### AdaBoost Classifier
242 | #clf = AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
243 | #          learning_rate=1.0, n_estimators=50, random_state=None)
244 | #param_grid = {}
245 | #clf = GridSearchCV(ada, param_grid)
246 | #clf = clf.fit(features, labels)
247 | #print clf.best_estimator_    
248 |     
249 | ### K Nearest Neighbors Classifier
250 | #clf = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
251 | #           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
252 | #           weights='uniform')
253 | #param_grid = {}
254 | #clf = GridSearchCV(knn, param_grid)
255 | #clf = clf.fit(features, labels)
256 | #print clf.best_estimator_
257 | 
258 | ### Random Forest Classifier
259 | #clf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
260 | #            max_depth=None, max_features='auto', max_leaf_nodes=None,
261 | #            min_impurity_decrease=0.0, min_impurity_split=None,
262 | #            min_samples_leaf=1, min_samples_split=2,
263 | #            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
264 | #            oob_score=False, random_state=None, verbose=0,
265 | #            warm_start=False)
266 | #param_grid = {}
267 | #clf = GridSearchCV(rf, param_grid)
268 | #clf = clf.fit(features, labels)
269 | #print clf.best_estimator_
270 | 
271 | ### Task 6: Dump your classifier, dataset, and features_list so anyone can
272 | ### check your results. You do not need to change anything below, but make sure
273 | ### that the version of poi_id.py that you submit can be run on its own and
274 | ### generates the necessary .pkl files for validating your results.
275 | 
276 | 
277 | dump_classifier_and_data(clf, my_dataset, features_list)
278 | 


--------------------------------------------------------------------------------
/7-Intro-to-Machine-Learning/resources.MD:
--------------------------------------------------------------------------------
 1 | Some of the resources used while completing the projects:
 2 | 1. scikit-learn documentation
 3 | http://scikit-learn.org/stable/documentation.html
 4 | 
 5 | 2. R documentation
 6 | https://www.rdocumentation.org
 7 | 
 8 | 3. Forums
 9 | http://www.stackoverflow.com
10 | 


--------------------------------------------------------------------------------
/7-Intro-to-Machine-Learning/tester.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/pickle
  2 | 
  3 | """ a basic script for importing student's POI identifier,
  4 |     and checking the results that they get from it 
  5 |  
  6 |     requires that the algorithm, dataset, and features list
  7 |     be written to my_classifier.pkl, my_dataset.pkl, and
  8 |     my_feature_list.pkl, respectively
  9 | 
 10 |     that process should happen at the end of poi_id.py
 11 | """
 12 | 
 13 | import pickle
 14 | import sys
 15 | from sklearn.cross_validation import StratifiedShuffleSplit
 16 | sys.path.append("../tools/")
 17 | from feature_format import featureFormat, targetFeatureSplit
 18 | 
 19 | PERF_FORMAT_STRING = "\
 20 | \tAccuracy: {:>0.{display_precision}f}\tPrecision: {:>0.{display_precision}f}\t\
 21 | Recall: {:>0.{display_precision}f}\tF1: {:>0.{display_precision}f}\tF2: {:>0.{display_precision}f}"
 22 | RESULTS_FORMAT_STRING = "\tTotal predictions: {:4d}\tTrue positives: {:4d}\tFalse positives: {:4d}\
 23 | \tFalse negatives: {:4d}\tTrue negatives: {:4d}"
 24 | 
 25 | def test_classifier(clf, dataset, feature_list, folds = 1000):
 26 |     data = featureFormat(dataset, feature_list, sort_keys = True)
 27 |     labels, features = targetFeatureSplit(data)
 28 |     cv = StratifiedShuffleSplit(labels, folds, random_state = 42)
 29 |     true_negatives = 0
 30 |     false_negatives = 0
 31 |     true_positives = 0
 32 |     false_positives = 0
 33 |     for train_idx, test_idx in cv: 
 34 |         features_train = []
 35 |         features_test  = []
 36 |         labels_train   = []
 37 |         labels_test    = []
 38 |         for ii in train_idx:
 39 |             features_train.append( features[ii] )
 40 |             labels_train.append( labels[ii] )
 41 |         for jj in test_idx:
 42 |             features_test.append( features[jj] )
 43 |             labels_test.append( labels[jj] )
 44 |         
 45 |         ### fit the classifier using training set, and test on test set
 46 |         clf.fit(features_train, labels_train)
 47 |         predictions = clf.predict(features_test)
 48 |         for prediction, truth in zip(predictions, labels_test):
 49 |             if prediction == 0 and truth == 0:
 50 |                 true_negatives += 1
 51 |             elif prediction == 0 and truth == 1:
 52 |                 false_negatives += 1
 53 |             elif prediction == 1 and truth == 0:
 54 |                 false_positives += 1
 55 |             elif prediction == 1 and truth == 1:
 56 |                 true_positives += 1
 57 |             else:
 58 |                 print "Warning: Found a predicted label not == 0 or 1."
 59 |                 print "All predictions should take value 0 or 1."
 60 |                 print "Evaluating performance for processed predictions:"
 61 |                 break
 62 |     try:
 63 |         total_predictions = true_negatives + false_negatives + false_positives + true_positives
 64 |         accuracy = 1.0*(true_positives + true_negatives)/total_predictions
 65 |         precision = 1.0*true_positives/(true_positives+false_positives)
 66 |         recall = 1.0*true_positives/(true_positives+false_negatives)
 67 |         f1 = 2.0 * true_positives/(2*true_positives + false_positives+false_negatives)
 68 |         f2 = (1+2.0*2.0) * precision*recall/(4*precision + recall)
 69 |         print clf
 70 |         print PERF_FORMAT_STRING.format(accuracy, precision, recall, f1, f2, display_precision = 5)
 71 |         print RESULTS_FORMAT_STRING.format(total_predictions, true_positives, false_positives, false_negatives, true_negatives)
 72 |         print ""
 73 |     except:
 74 |         print "Got a divide by zero when trying out:", clf
 75 |         print "Precision or recall may be undefined due to a lack of true positive predicitons."
 76 | 
 77 | CLF_PICKLE_FILENAME = "my_classifier.pkl"
 78 | DATASET_PICKLE_FILENAME = "my_dataset.pkl"
 79 | FEATURE_LIST_FILENAME = "my_feature_list.pkl"
 80 | 
 81 | def dump_classifier_and_data(clf, dataset, feature_list):
 82 |     with open(CLF_PICKLE_FILENAME, "w") as clf_outfile:
 83 |         pickle.dump(clf, clf_outfile)
 84 |     with open(DATASET_PICKLE_FILENAME, "w") as dataset_outfile:
 85 |         pickle.dump(dataset, dataset_outfile)
 86 |     with open(FEATURE_LIST_FILENAME, "w") as featurelist_outfile:
 87 |         pickle.dump(feature_list, featurelist_outfile)
 88 | 
 89 | def load_classifier_and_data():
 90 |     with open(CLF_PICKLE_FILENAME, "r") as clf_infile:
 91 |         clf = pickle.load(clf_infile)
 92 |     with open(DATASET_PICKLE_FILENAME, "r") as dataset_infile:
 93 |         dataset = pickle.load(dataset_infile)
 94 |     with open(FEATURE_LIST_FILENAME, "r") as featurelist_infile:
 95 |         feature_list = pickle.load(featurelist_infile)
 96 |     return clf, dataset, feature_list
 97 | 
 98 | def main():
 99 |     ### load up student's classifier, dataset, and feature_list
100 |     clf, dataset, feature_list = load_classifier_and_data()
101 |     ### Run testing script
102 |     test_classifier(clf, dataset, feature_list)
103 | 
104 | if __name__ == '__main__':
105 |     main()
106 | 


--------------------------------------------------------------------------------
/8-Data-Visualization-in-Tableau/Create-a-Tableau-Story.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Summary"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "Airports in the US have been registering consistent growth yearly with the exception of year 2001 due to 9/11 incident and the late 2000s recession. The visualzations below are to provide an insight of the aviation industry in the US and some more detailed segmentated views for the west coast and east coast."
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "## Design"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "markdown",
 26 |    "metadata": {},
 27 |    "source": [
 28 |     "### Dataset\n",
 29 |     "\n",
 30 |     "The follwing steps were taken in order to create a comprehensive csv file.\n",
 31 |     "1. Multiple datasets were downloaded from the website of U.S. Burueat of Transportation Statistics from year 2000 to 2016. \n",
 32 |     "\n",
 33 |     "2. Only scheduled flights related to passengers are kept while those of freight and unschedule flights are removed.\n",
 34 |     "\n",
 35 |     "3. Rows with 0 total passengers are deleted as it doesn't add up to anything in our final dataset.\n",
 36 |     "\n",
 37 |     "4. Each row of data is duplicated except that its origin and destination are swapped. It is done as we also need to consider the number for destination city without changing the structure of the data.\n",
 38 |     "\n",
 39 |     "5. The data is then pivoted in Excel to calculate annual number of passenger by airlines by city pairs.\n",
 40 |     "\n",
 41 |     "6. Once the data is loaded into Tableau, airport pair and city pair are created by combining the airport code and city name.\n",
 42 |     "\n",
 43 |     "7. Another calculated field is also created to filter out city pairs with less than 500 passengers annually when calculating the connectivity of an airport (number of airport with direct flight) so there were lots of one-off flights which will skewed the data, such as Singapore-Raleigh which is never flown by any commercial airlines."
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "markdown",
 48 |    "metadata": {},
 49 |    "source": [
 50 |     "### Visualization\n",
 51 |     "\n",
 52 |     "Trend charts (line) are chosen as it allows data over the period from 2000 to 2016 to be shown.\n",
 53 |     "Instead of listing the airports or airlines in tables, trend charts allows the values of the data to be visualized easily and ranking of the data to be observed readily.\n",
 54 |     "\n",
 55 |     "For example, in the \"Foreign Airports (Annual Int'l Pax) - West Coast\" chart, it can be seen that Narita Airport (NRT) is losing its lead in terms of passenger movement and is slated to be replaced by Vancouver (YVR). This trend can't be estimated if ordinary table were used to represent the data.\n",
 56 |     "\n",
 57 |     "The storyline to be carried over 8 slides of trend charts.\n",
 58 |     "1. US Airport (Annual Pax) and US Airlines (Annual Pax)\n",
 59 |     "2. US Airport (Annual Domestic Pax) and US Airlines (Annual Domestic Pax)\n",
 60 |     "3. US Airports (Annual Int'l Pax) and US Airlines (Annual Int'l Pax)\n",
 61 |     "4. Foreign Airports (Annual Int'l Pax) and Foreign Airlines (Annual Int'l Pax)\n",
 62 |     "5. Foreign Airports (Annual Int'l Pax) - West Coast and East Coast\n",
 63 |     "6. Foreign Market (Annual Int'l Pax) - Overall, West Coast, East Coast\n",
 64 |     "7. US Airports Domestic Connectivity and Int'l Connectivity (# Airports)\n",
 65 |     "8. Foreign Airports Int'l Connectivity (# Airports)\n",
 66 |     "9. City Pairs (Domestic & Int'l)"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "metadata": {},
 72 |    "source": [
 73 |     "### Improvements\n",
 74 |     "\n",
 75 |     "#### Feedback #1\n",
 76 |     "My initial graphs are all filled with data and lines. \n",
 77 |     "![Too much info!!](https://user-images.githubusercontent.com/14093302/34650996-f9f27c14-f404-11e7-9292-a4d8b17e733a.PNG \"Too much info!!\")\n",
 78 |     "\n",
 79 |     "After being told that the graphs is too crammed, I remove most of the data and only keep the top airports in terms of passengers which will form the main storyline for this project.\n",
 80 |     "\n",
 81 |     "#### Feedback #2\n",
 82 |     "\n",
 83 |     "Each city pair is duplicated in opposite way. For example, the line for LHR-JFK is overlapped by JFK-LHR.\n",
 84 |     "![Duplicated city pairs](https://user-images.githubusercontent.com/14093302/34650997-fc1a584a-f404-11e7-9a77-db0eebd2623c.PNG \"Duplicated City Pairs!\")\n",
 85 |     "\n",
 86 |     "The duplication is because the data is organized such that the origin and destination is in a row and then it was duplicated to account for both way passenger count. After filtering out city pair with low passenger volume, I carefully remove the duplicated city pairs for both graphs.\n",
 87 |     "\n",
 88 |     "#### Feedback #3\n",
 89 |     "The labels for some of my graphs are entangled such that they overalap and obstruct each other.\n",
 90 |     "![Labels too messy!!](https://user-images.githubusercontent.com/14093302/34650999-feaa0ede-f404-11e7-9900-c928e165a83f.PNG \"Lables too messy!!\")\n",
 91 |     "\n",
 92 |     "Whenever allowed, I will try to label each line. However, when the labels overlap I will remove them and only label to top few since most of the storyline is centered on them. \n",
 93 |     "\n",
 94 |     "I learn that it is better to convey a more limited range of data more effectively than to squeenzed all info into a graph to the extend it overwhelmes the audience."
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "markdown",
 99 |    "metadata": {},
100 |    "source": [
101 |     "## Feedback"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "markdown",
106 |    "metadata": {},
107 |    "source": [
108 |     "### Feedback 1\n",
109 |     "\n",
110 |     "There are just too much info on each graph and it is hard to focus on the data. Perhaps you should only pick the top 10 highest or highlight a specific airport or airline that you wish to talk about?"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "markdown",
115 |    "metadata": {},
116 |    "source": [
117 |     "### Feedback 2\n",
118 |     "\n",
119 |     "Do you notice that in the city pair graphs, there seems to be duplicated line trailing below as it there are pairs of data? For example, the city pair with the highest amount of passengers is New York, NY to London, UK. Just pick the first one of each pair and hide the others."
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "markdown",
124 |    "metadata": {},
125 |    "source": [
126 |     "### Feedback 3"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "markdown",
131 |    "metadata": {},
132 |    "source": [
133 |     "I feel overwhelemed by your graphs are there are just too many line without clear focus? If you just want to show the ranking of the airport in terms of passengers, why not use horizontal bar charts? Unless it is to show the trend overtime? But again, there isn't a clear focus on your graph and it seems you just show whatever you have. At the same time, not all lines need to be labelled."
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "markdown",
138 |    "metadata": {},
139 |    "source": [
140 |     "### Feedback 4"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "markdown",
145 |    "metadata": {},
146 |    "source": [
147 |     "I highly recommend using color-blind friendly colors. The following chart shows some of the indistinguishable colors for people with this disorder.\n",
148 |     "\n",
149 |     "You can find the colorblind palette in the color legends by going to edit colors then in the opened window by clicking on the dropbox; you can see a colorblind option."
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "markdown",
154 |    "metadata": {},
155 |    "source": [
156 |     "## Project\n",
157 |     "\n",
158 |     "\n",
159 |     "https://public.tableau.com/profile/kaishengteh#!/vizhome/USFlights/USAviation2000-2016"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "markdown",
164 |    "metadata": {},
165 |    "source": [
166 |     "%%HTML\n",
167 |     "\n",
168 |     "<div class='tableauPlaceholder' id='viz1515340699732' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;US&#47;USFlights&#47;USAviation2000-2016&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='USFlights&#47;USAviation2000-2016' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;US&#47;USFlights&#47;USAviation2000-2016&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1515340699732');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='1366px';vizElement.style.height='795px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>\n"
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "markdown",
173 |    "metadata": {},
174 |    "source": [
175 |     "## Resources"
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "markdown",
180 |    "metadata": {},
181 |    "source": [
182 |     "https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=293"
183 |    ]
184 |   }
185 |  ],
186 |  "metadata": {
187 |   "kernelspec": {
188 |    "display_name": "Python [conda root]",
189 |    "language": "python",
190 |    "name": "conda-root-py"
191 |   },
192 |   "language_info": {
193 |    "codemirror_mode": {
194 |     "name": "ipython",
195 |     "version": 3
196 |    },
197 |    "file_extension": ".py",
198 |    "mimetype": "text/x-python",
199 |    "name": "python",
200 |    "nbconvert_exporter": "python",
201 |    "pygments_lexer": "ipython3",
202 |    "version": "3.6.2"
203 |   }
204 |  },
205 |  "nbformat": 4,
206 |  "nbformat_minor": 2
207 | }
208 | 


--------------------------------------------------------------------------------
/8-Data-Visualization-in-Tableau/Flight Summary.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kaishengteh/Data-Analyst-Nanodegree/b29c745f5abdab54058fd2ae7f50664759523658/8-Data-Visualization-in-Tableau/Flight Summary.zip


--------------------------------------------------------------------------------
/8-Data-Visualization-in-Tableau/README.md:
--------------------------------------------------------------------------------
1 | The Tableau storyline can be accessed [here](https://public.tableau.com/profile/kaishengteh#!/vizhome/USFlights/USAviation2000-2016).
2 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Data-Analyst-Nanodegree
 2 | 
 3 | ### Kai Sheng Teh
 4 | 
 5 | This repository contains projects for Udacity's [Data Analyst Nanodegree](https://www.udacity.com/course/data-analyst-nanodegree--nd002).
 6 | 
 7 | ### Part 1: Analyze Bay Area Bike Share Project
 8 | Complete your first project analyzing bike rental data. It’s a great project to tackle during the first week in your Nanodegree to see if the program is a good fit for you!
 9 | 
10 | - Project: [Analyze Bay Area Bike Share Data](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/1-Analyze-Bay-Area-Bike-Share-Project/Bay_Area_Bike_Share_Analysis.ipynb)
11 | 
12 | ### Part 2: [Descriptive Statistics](https://www.udacity.com/course/intro-to-descriptive-statistics--ud827)
13 | Learn to use descriptive statistics to describe properties of datasets and learn about how samples and populations are related.
14 | 
15 | ### Part 3: [Intro to Data Analysis](https://www.udacity.com/course/intro-to-data-analysis--ud170)
16 | Choose one of Udacity's curated datasets and investigate it using NumPy and Pandas. Go through the entire data analysis process, starting by posing a question and finishing by sharing your findings.
17 | 
18 | - Project: [Investigate a Dataset](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/3-Intro-to-Data-Analysis/Investigate-a-Dataset.ipynb)
19 | 
20 | ### Part 4: [Data Wrangling](https://www.udacity.com/course/data-wrangling-with-mongodb--ud032)
21 | Choose a region of the world from www.openstreetmap.org and then use data wrangling techniques to audit and clean the data. You'll then use a database to query the cleaned data.
22 | 
23 | - Project: [Wrangle OpenStreetMap Data](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/4-Data-Wrangling/Data_Wrangling.ipynb)
24 | 
25 | ### Part 5: [Exploratory Data Analysis](https://www.udacity.com/course/data-analysis-with-r--ud651)
26 | Use R and apply exploratory data analysis techniques to explore a selected data set for distributions, outliers, and relationships.
27 | 
28 | - Project: [Explore and Summarize Data](https://cdn.rawgit.com/kaishengteh/Data-Analyst-Nanodegree/e94db549/5-Exploratory-Data-Analysis/Exploratory%20Data%20Analysis.html)
29 | 
30 | ### Part 6: [Inferential Statistics](https://www.udacity.com/course/intro-to-inferential-statistics--ud201)
31 | Use descriptive statistics and a statistical test to analyze the Stroop effect, a classic result of experimental psychology.
32 | 
33 | - Project: [Test a Perceptual Phenomenon](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/6-Inferential-Statistics/Stroop-Effect.ipynb)
34 | 
35 | ### Part 7: [Intro to Machine Learning](https://www.udacity.com/course/intro-to-machine-learning--ud120)
36 | Play detective and put your machine learning skills to use by building an algorithm to identify Enron Employees who may have committed fraud based on the public Enron financial and email dataset.
37 | 
38 | - Project: [Identify Fraud from Enron Email](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/7-Intro-to-Machine-Learning/Identify%20Fraud%20from%20Enron%20Email.ipynb)
39 | 
40 | ### Part 8: [Data Visualization in Tableau](https://www.udacity.com/course/data-visualization-in-tableau--ud1006)
41 | Understand the importance of data visualization. Know how different data types are encoded in visualizations. Select the most effective chart or graph based on the data being displayed.
42 | 
43 | - Project: [Create a Tableau Story](https://github.com/kaishengteh/Data-Analyst-Nanodegree/blob/master/8-Data-Visualization-in-Tableau/Create-a-Tableau-Story.ipynb)
44 | 
45 | ![udacity Data Analyst Nanodegree](https://user-images.githubusercontent.com/14093302/37262598-0227d436-25df-11e8-9613-a6a03c8edc08.jpg)
46 | 


--------------------------------------------------------------------------------