├── LICENSE ├── README.md └── first_unique.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Alexandre Campino 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Interview Practice - Data Analyst - Udacity 2 | This readme file will serve to answer the Project Submission for the 3 | interview Practice of Udacity's Data Analyst Nanodegree. 4 | 5 | 6 | ### Question 1 - Describe a data project you worked on recently. 7 | 8 | I have been working recently in merging and analyzing my company’s 9 | sales/user data. This is a two part work. At the moment, the data is 10 | spread across 7 different web tools, most of them are outdated. 11 | So I have been writing Python routines to scrap and mine this data. 12 | Unfortunately most of these tools do not have a easy output, so 13 | scraping from the HTML file has been the solution, for the most part. 14 | After all this data is mined/gathered into a database, using csv files 15 | for the first part, it is cleaned using Pandas. Data is then stored in 16 | a SQL database, so it can readily be used in the future. 17 | The second part of the project is the Data Analytics and presentation. 18 | I am using Tableau and R language to read the SQL database and build models 19 | to predict the outcome of our events. 20 | Also with Tableau, I make weekly visualizations about the past week’s numbers, 21 | organized in a neat way which allows the company to draw conclusions and plan 22 | events better. All this work has already given fruits, since we reduced our 23 | wasted materials billing in 30%. Also we have steadily been increasing our 24 | revenue each month, by 5%, due to the models that are now in place. 25 | With this project I have learned quite in depth exactly what it is to be a 26 | data analyst. I have gone through all the phases of the project successfuly. 27 | This project has been trully amazing and inpiring because it is very challenging. 28 | I have learned to use Python daily and all the routines to programatically 29 | solve issues that manually would take countless hours. I have obtained professional 30 | knowledge on how gather all the information of a database and produce a meaningful 31 | outcome from it, that can be used to fulfill the company's goals. 32 | All these skills can be used on my new position at the company. I can use my 33 | knowledge of building Python applications to develop tools that will solve problems 34 | related to scalability and accomodating millions of users. Database skills acquired 35 | will be used to help network tools to handle future loads and improve performance. 36 | 37 | 38 | 39 | ### Question 2 - You are given a ten piece box of chocolate truffles. 40 | 41 | You know based on the label that six of the pieces have an orange cream 42 | filling and four of the pieces have a coconut filling. If you were to eat 43 | four pieces in a row, what is the probability that the first two pieces 44 | you eat have an orange cream filling and the last two have a coconut 45 | filling? 46 | 47 | Great way to answer this question is using a tree diagram. 48 | 49 | Let's call P(A) the probability of occurring the event in question. 50 | P(O) is the probability of eating a Orange filling and P(C) a coconut filling. 51 | 52 | We only have one type of event order that matters. Which is: 53 | ``` 54 | P(O) -> P(O) -> P(C) -> P(C) 55 | ``` 56 | 57 | On the first chocolate we have P(O) = 6/10, 2nd is 5/9. Then P(C) is 4/8=½ 58 | and then 3/7. The following table resumes the information: 59 | 60 | 61 | | Ate | Probability | O left | C left | 62 | |--------|-------------|------------|------------- 63 | | - | - | 6 | 4 | 64 | | O | 6/10 | 5 | 4 | 65 | | O | 5/9 | 4 | 4 | 66 | | C | 1/2 | 4 | 3 | 67 | | C | 3/7 | 4 | 2 | 68 | 69 | All this probabilities are independent so we can 70 | just calculate the total as: 71 | 72 | ``` 73 | P(A) = 6/10 x 5/9 x ½ x 3/7 = 1/14 = 0.0714 or 7% 74 | ``` 75 | 76 | ### Follow up question: 77 | 78 | If you were given an identical box of chocolates and again eat four 79 | pieces in a row, what is the probability that exactly two contain 80 | coconut filling? 81 | 82 | There are 6 different possible order of events that allow us to eat exactly 2 83 | Coconut chocolates. These are: 84 | 85 | ``` 86 | C C O O 87 | O C C O 88 | O O C C 89 | C O O C 90 | C O C O 91 | O C O C 92 | ``` 93 | 94 | It is possible to calculate the probabilite of each individual event, 95 | and then add them up in the end. The following python code achieves 96 | the intended: 97 | 98 | ```python 99 | import operator 100 | 101 | 102 | def eat_one_piece(box, choco): 103 | """ 104 | Determines the probability of eating a certain 105 | type of chocolate 106 | :param chocolate_box: dict of {type: count}. 107 | :return (float, chocolate_box): probability and updated box. 108 | """ 109 | total_chocos = sum(chocolate_box.values()) 110 | prob = (0. + chocolate_box[choco]) / total_choco 111 | if chocolate_box[choco]: 112 | chocolate_box[choco] -= 1 113 | return prob, chocolate_box 114 | 115 | 116 | def order_event(chocolate_box, seq): 117 | """Determines probability of order of chocos. 118 | :param chocolate_box: dict of {type: count}. 119 | :param seq: string 120 | :return float 121 | """ 122 | prob = [] 123 | for choco in seq: 124 | p, chocolate_box = eat_one_piece(chocolate_box, choco) 125 | prob.append(p) 126 | return reduce(operator.mul, prob, 1) 127 | 128 | 129 | if __name__ == '__main__': 130 | total_prob = 0 131 | for seq in ['CCOO', 'OCCO', 'OOCC', 'COOC', 'COCO', 'OCOC']: 132 | chocolate_box = {'O': 6, 'C': 4} 133 | ps = order_event(box, seq) 134 | print '- {}: {}'.format(seq, ps) 135 | total_prob += ps 136 | print 'Result', total_prob 137 | ``` 138 | 139 | Here is the output: 140 | 141 | ``` 142 | - CCOO: 0.0714285714286 143 | - OCCO: 0.0714285714286 144 | - OOCC: 0.0714285714286 145 | - COOC: 0.0714285714286 146 | - COCO: 0.0714285714286 147 | - OCOC: 0.0714285714286 148 | Result 0.428571428571 149 | ``` 150 | Of course this uoutput is expected, each individual probability is the same as the one above. This is because these are independent events, so the order you eat the chocolates does not matter, as long as you eat 2 of the Coconut filling. 151 | 152 | ### Question 3 - Given the table users: 153 | 154 | Table "users" 155 | | Column | Type | 156 | |-------------|-----------| 157 | | id | integer | 158 | | username | character | 159 | | email | character | 160 | | city | character | 161 | | state | character | 162 | | zip | integer | 163 | | active | boolean | 164 | 165 | construct a query to find the top 5 states with the highest number of 166 | active users. Include the number for each state in the query result. 167 | Example result: 168 | 169 | | state | num_active_users | 170 | |------------|------------------| 171 | | New Mexico | 502 | 172 | | Alabama | 495 | 173 | | California | 300 | 174 | | Maine | 201 | 175 | | Texas | 189 | 176 | 177 | The following SQL command will do the intended: 178 | 179 | ```sql 180 | select state, count(id) as num_active_users from users 181 | where active = 1 182 | group by state 183 | order by num_active_users desc 184 | limit 5 185 | ``` 186 | 187 | ### Question 4 - Define a function first_unique 188 | 189 | that takes a string as input and returns the first non-repeated (unique) 190 | character in the input string. If there are no unique characters return 191 | None. Note: Your code should be in Python. 192 | 193 | The solution bellow work as intended. It is somehow verbose but it accounts 194 | for all possible cases presented. Since I am using dictionaries, it has 195 | a constant look-up time. I only need to go through the array twice. Once 196 | when I am building the dict and a second time choosing which element of 197 | the dict to return, if any. This means the complexity of my algorith is 198 | of O(2N). The complexity should be less than O(2N) because we can only 199 | have as many keys in the dict, as we have symbols. If we only consider 200 | the alphabet this would be 26 keys max. So on my second iteration thorough 201 | the keys, it would be of very reduced complexity in time and space. A very 202 | long string would cause some complexity in saving the large indexes and count. 203 | I could have done this with nested for loops, but that would 204 | mean complexity of O(N^2) which is much worst. 205 | 206 | ```python 207 | def first_unique(string): 208 | if string.upper().isupper(): # if it has letters 209 | if len(string) == 1: 210 | return string 211 | else: 212 | letter = {} 213 | for index,char in enumerate(string): 214 | if letter.get(char) == None: 215 | letter[char] = [1,index] 216 | else: 217 | letter[char] = [letter[char][0]+1,index] 218 | 219 | for key,value in letter.items(): 220 | if value[0] == 1: 221 | return key 222 | 223 | else: 224 | return None 225 | 226 | 227 | print(first_unique('aabbcdd123')) 228 | print(first_unique('112233')) 229 | print(first_unique('a')) 230 | ``` 231 | The output of it will be: 232 | 233 | ``` 234 | > python first_unique.py 235 | c 236 | a 237 | None 238 | ``` 239 | 240 | ### Question 5 - What are underfitting and overfitting 241 | 242 | in the context of Machine Learning? How might you balance them? 243 | 244 | Overfitting happens when a model learns the detail and noise in the 245 | training data to the extent that it negatively impacts the performance 246 | of the model on new data. This means that the noise or random fluctuations 247 | in the training data is picked up and learned as concepts by the model. 248 | Underfitting refers to a model that can neither model the training data 249 | nor generalize to new data. An underfit machine learning model is not a 250 | suitable model and will be obvious as it will have poor performance 251 | on the training data. 252 | Ideally we will want a sweet spot between underfitting and overfitting. 253 | So you want the machine to learn the traning set rather well, but not too 254 | much that it will not be able to adapt to new sets positevely. 255 | 256 | Possible causes of underfitting: 257 | 258 | * model is too simple 259 | * not enough features 260 | * bad choice of parameters 261 | 262 | Possible causes of overfitting: 263 | 264 | * too few data points 265 | * too many features 266 | * data has noise 267 | 268 | Before answering the final question, insert a job description for a 269 | data analyst position of your choice! 270 | 271 | Your answer for Question 6 should be targeted to the company/job-description you chose. 272 | 273 | ### Question 6 - If you were to start your data analyst position today, 274 | 275 | what would be your goals a year from now? 276 | 277 | 278 | 279 | ``` 280 | Junior Python Developer 281 | 282 | Are you an up and coming Python or Golang Developer? 283 | 284 | Do you enjoy building complex algorithms and storing huge amounts of data? If so, please read on! 285 | 286 | Based in the heart of Orange County, Irvine, just a few steps from all the action, next to great shopping and dining, We are a high growth Cloud Technology / Content Streaming and Distribution company that enables users to access content, data, and services in real-time! 287 | 288 | We are currently seeking a talented junior - mid Software Engineer with strong Python skills who can help develop solutions for our cloud services to support our growth as we sign up new customers and add millions of users. 289 | Top Reasons to Work with Us 290 | 1. Custom work set - build your dream computer, tools or hardware setup 291 | 2. Exceptional benefits and flexible paid time off 292 | 3. Opportunity to see your ideals turn into code 293 | What You Will Be Doing 294 | - Creating tools and solving challenges related to scaling and accommodating millions of users 295 | - Developing network side tools to accommodate and handle load, improve performance 296 | - Building cloud technologies and improving performance 297 | What You Need for this Position 298 | - Understanding of Python and good knowledge of various Python Libraries, API's and toolkits 299 | - Must be able to work with adult content. You will have limited exposure 300 | - 1+ years of professional experience 301 | - Know how to scale. 302 | - Working knowledge of SQL 303 | What's In It for You 304 | - Strong Base Salary ($80,000 - $110,000 DOE) 305 | - 401k Matching and Bonus! 306 | - Projected company growth of over 10 times in the next 2-3 Years 307 | - A new product that is revolutionizing distribution of high-demand content 308 | - Solve challenging problems for a platform that is already serving a growing user base. 309 | 310 | ``` 311 | A year from now I would like to be a Mid Senior Python Developer that builds his own tools 312 | to develop big data analysis for large companies. I would like to be designing algorithms 313 | and prediction models. Eventually moving into Machine Learning and Deep Learning domains 314 | using that knowledge to develop better methods to extract and analyse data. I want to be able 315 | to build any sort of web app with a Python back-end. I would like to be a project manager 316 | for some project within the company, where I would have a team to lead. 317 | 318 | The work I am currently working on will serve as basis for the position I am applying for. 319 | I have been working with Python for the last 2 years, creating tools to analyse millions of 320 | observation points of data. Using these tools to nuild prediction models for future events 321 | and increase company's profits. the tools I have created are ready to handle heavy loads and 322 | future expansion of users. They store data on the cloud and not locally, in case of a security 323 | breach or other case of accident, information is not sunddenly lost. 324 | 325 | I am defintely looking forward to turn my ideas into code and produce results with it. Further 326 | enrich my knowledge of Python and data analysis it is big plus with this position 327 | 328 | 329 | -------------------------------------------------------------------------------- /first_unique.py: -------------------------------------------------------------------------------- 1 | # def first_unique1(string): 2 | # if len(string) == 1: 3 | # return string 4 | # if string.upper().isupper(): 5 | # for index,char in enumerate(string): 6 | # if index != 0 and index != len(string)-1: 7 | # if char != string[index+1] and char != string[index-1]: 8 | # return char 9 | # break 10 | # elif index == 0 and char != string[index+1]: 11 | # return char 12 | # break 13 | # elif index == len(string)-1 and char != string[index-1]: 14 | # return char 15 | # break 16 | # elif index == len(string)-1: 17 | # return -1 18 | # else: 19 | # return None 20 | 21 | def first_unique(string): 22 | if string.upper().isupper(): # if it has letters 23 | if len(string) == 1: 24 | return string 25 | else: 26 | letter = {} 27 | for index,char in enumerate(string): 28 | if letter.get(char) == None: 29 | letter[char] = [1,index] 30 | else: 31 | letter[char] = [letter[char][0]+1,index] 32 | 33 | for key,value in letter.items(): 34 | if value[0] == 1: 35 | return key 36 | 37 | else: 38 | return None 39 | 40 | 41 | 42 | 43 | 44 | print(first_unique('aabbcdd123')) 45 | print(first_unique('a')) 46 | print(first_unique('112233')) 47 | --------------------------------------------------------------------------------