├── 2020-12-13 DataCamp Python Programmer #189,875.pdf ├── 2020-12-18 DataCamp Data Scientist with Python #190,835.pdf ├── 2020-12-29 DataCamp Data Analyst with Python #192,564.pdf ├── 2021-03-16 AI4I-Foundation-in-AI-Certificate-AI-MakerSpace.pdf ├── 2021-04-25 DataCamp Data Engineer with Python #214,688.pdf ├── 2021-05-09 DataCamp Machine Learning Scientist with Python #218,190.pdf ├── AI4I-1-Q-Introduction_to_Python_Quiz.txt ├── AI4I-2-Q-Libraries_and_Data_Manipulation_Quiz.txt ├── AI4I-3-Q-Exploratory_Data_Analysis_Quiz.txt ├── AI4I-4-Q-Statistical_Thinking_Quiz.txt ├── AI4I-5-Q-Supervised_Learning_Quiz.txt ├── AI4I-6-Q-Unsupervised_Learning_Quiz.txt ├── AI4I-7-Q-Deep_Learning_Quiz.txt ├── AI4I-8-Q-Other_Languages_and_Tools_to_Learn_Quiz.txt ├── AI4I-9-Q-Data_Science_Project_Lifecycle_Quiz.txt ├── AI4I_Data_Science_Project_Lifecycle.pdf └── README.md /2020-12-13 DataCamp Python Programmer #189,875.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JNYH/AI4I_Data_Science_Project_Lifecycle/ab4b752d63f71d63f2fdec0b0fcfcc2190ed92ec/2020-12-13 DataCamp Python Programmer #189,875.pdf -------------------------------------------------------------------------------- /2020-12-18 DataCamp Data Scientist with Python #190,835.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JNYH/AI4I_Data_Science_Project_Lifecycle/ab4b752d63f71d63f2fdec0b0fcfcc2190ed92ec/2020-12-18 DataCamp Data Scientist with Python #190,835.pdf -------------------------------------------------------------------------------- /2020-12-29 DataCamp Data Analyst with Python #192,564.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JNYH/AI4I_Data_Science_Project_Lifecycle/ab4b752d63f71d63f2fdec0b0fcfcc2190ed92ec/2020-12-29 DataCamp Data Analyst with Python #192,564.pdf -------------------------------------------------------------------------------- /2021-03-16 AI4I-Foundation-in-AI-Certificate-AI-MakerSpace.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JNYH/AI4I_Data_Science_Project_Lifecycle/ab4b752d63f71d63f2fdec0b0fcfcc2190ed92ec/2021-03-16 AI4I-Foundation-in-AI-Certificate-AI-MakerSpace.pdf -------------------------------------------------------------------------------- /2021-04-25 DataCamp Data Engineer with Python #214,688.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JNYH/AI4I_Data_Science_Project_Lifecycle/ab4b752d63f71d63f2fdec0b0fcfcc2190ed92ec/2021-04-25 DataCamp Data Engineer with Python #214,688.pdf -------------------------------------------------------------------------------- /2021-05-09 DataCamp Machine Learning Scientist with Python #218,190.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JNYH/AI4I_Data_Science_Project_Lifecycle/ab4b752d63f71d63f2fdec0b0fcfcc2190ed92ec/2021-05-09 DataCamp Machine Learning Scientist with Python #218,190.pdf -------------------------------------------------------------------------------- /AI4I-1-Q-Introduction_to_Python_Quiz.txt: -------------------------------------------------------------------------------- 1 | AI4I-1-Q Introduction to Python Quiz 2 | 3 | 1. What is the output of this code? 4 | import numpy as np 5 | m = np.array([2,4,5]) 6 | n = 3 7 | print(m*n) 8 | 9 | [6 15 4] 10 | [6 12 15] <-answer 11 | [5 2 7] 12 | 13 | 14 | 15 | 2. Select the code to return the following output: 16 | 42 17 | 18 | p=14 19 | q=3 20 | print(p*q) 21 | (^answer) 22 | 23 | p=14 24 | q=3 25 | print(p-q) 26 | 27 | p=14 28 | q=3 29 | print(p+q) 30 | 31 | 32 | 33 | 3. Complete the following code snippet: 34 | import 35 | numpy 36 | as np 37 | np.array([19, 21, 15,26]) 38 | 39 | 40 | 41 | 4. The variables x and y are defined as follows: 42 | x = np.array([[1, 3, 8], [3, 2, 6]]) 43 | y = np.array([[6, 4, 2], [0, 5, 1]]) 44 | What is the output of this code? 45 | print(x-y) 46 | 47 | [[7, 7, 12], [7, 13, 2]] 48 | [[6, 10, 35], [0, 42, 1]] 49 | [[-5, -1, 6], [3, -3, 5]] <-answer 50 | 51 | 52 | 53 | 5. Select the code that returns the following output: 54 | nv 55 | 56 | x = ['n', 'z', 'v', 's', 't', 'h', 'o'] 57 | print(x[-7] + x[-5]) 58 | (^answer) 59 | 60 | x = ['n', 'z', 'v', 's', 't', 'h', 'o'] 61 | print(x[-4] + x[-6]) 62 | 63 | x = ['n', 'z', 'v', 's', 't', 'h', 'o'] 64 | print(x[-2] + x[-5]) 65 | 66 | -------------------------------------------------------------------------------- /AI4I-2-Q-Libraries_and_Data_Manipulation_Quiz.txt: -------------------------------------------------------------------------------- 1 | AI4I-2-Q Libraries and Data Manipulation Quiz 2 | 3 | All of the questions in this section refer to this scenario below. 4 | 5 | You are working as a Data Scientist in an organization focusing on global health. Your manager has asked you analyze a dataset to study Diabetes for a city in a developing country. 6 | 7 | The first dataset you receive was given to you as a text file. When you open the file in a text editor you see the following: 8 | 9 | patient_id#Pregnancies#Glucose#BloodPressure#SkinThickness#Insulin#DiabetesPedigreeFunction#Outcome 10 | 163#0#114#80#34#285#0.167#0 11 | 348#3#116#0#0#0#0.187#0 12 | 395#4#158#78#0#0#0.803#1 13 | 187#8#181#68#36#495#0.615#1 14 | 15 | 16 | 17 | Assume the following library imports: 18 | import numpy as np 19 | import pandas as pd 20 | import matplotlib.pyplot as plt 21 | 22 | 23 | 1. Complete the code to read in the data using Pandas such that the DataFrame can be displayed as: 24 | 25 | Pregnancies Glucose BloodPressure SkinThickness Insulin DiabetesPedigreeFunction Outcome 26 | patient_id 27 | 163 0 114 80 34 285 0.167 0 28 | 348 3 116 0 0 0 0.187 0 29 | 395 4 158 78 0 0 0.803 1 30 | 187 8 181 68 36 495 0.615 1 31 | 342 1 95 74 21 73 0.673 0 32 | 33 | Complete the following code (2 answers expected) 34 | df_main = pd.read_csv(filename, 35 | sep 36 | =”#”, 37 | index_col 38 | =0) 39 | 40 | 41 | 42 | 2. Once you have loaded in the data, you proceed to do EDA (Exploratory Data Analysis) to have a better understanding of the data. 43 | As part of this EDA, you decided to create a box and whiskers plot to look at the Glucose levels of the patients by their diabetic condition. 44 | Assume that the DataFrame is saved as the variable df. Fill in the blank below to display the plot. 45 | 46 | ________________________ 47 | plt.xlabel('Diabetic Condition') 48 | plt.ylabel('Glucose Level') 49 | plt.show() 50 | 51 | df.boxplot(column = ['Glucose'], by=['Outcome']) <-answer 52 | plt.boxplot(df, column = ['Glucose'], by=['Outcome']) 53 | plt.plot(df, column = ['Glucose'], type='boxplot') 54 | df.boxplot(column = ['Outcome'], by=['Glucose']) 55 | 56 | 57 | 58 | 3. As you continue to explore the data, you then focus your attention on the diastolic Blood Pressure readings. You realise that the raw values are not useful and decide to classify (or bin) the values according to well known Blood Pressure categories. 59 | Fill in the blank with the the appropriate function to count the number of patients based on their Blood Pressure Status. 60 | 61 | bins = [0, 80, 90, 120, 200] 62 | bin_labels = ['Normal', 'High Blood Pressure 1', 'High Blood Pressure 2', 'Hypertensive'] 63 | bp_status = __________(df_main.BloodPressure, bins=bins).value_counts() 64 | bp_status.index = bin_labels 65 | 66 | Normal 568 67 | High Blood Pressure 1 127 68 | High Blood Pressure 2 37 69 | Hypertensive 1 70 | Name: BloodPressure, dtype: int64 71 | 72 | pd.bin 73 | pd.cut <-answer 74 | np.cut 75 | 76 | 77 | 78 | 4. Your manager comes to you and says that some additional features have been made available about the patients. You read in the file into a DataFrame with the variable name df_new_features. The first few values are shown below. 79 | 80 | BMI Age 81 | patient_id 82 | 30 34.1 38 83 | 430 35.0 43 84 | 259 25.9 24 85 | 103 22.5 21 86 | 525 31.6 24 87 | 88 | Combine the two DataFrames, taking care to match the information based on the patient_id. You also decide to sort the DataFrame by ascending age and BMI. 89 | The resultant DataFrame should look like the following: 90 | 91 | Pregnancies Glucose BloodPressure SkinThickness Insulin DiabetesPedigreeFunction Outcome BMI Age 92 | patient_id 93 | 372 0 118 64 23 89 1.731 0 0.0 21 94 | 146 0 102 75 23 0 0.572 0 0.0 21 95 | 61 2 84 0 0 0 0.304 0 0.0 21 96 | 439 1 97 70 15 0 0.147 0 18.2 21 97 | 527 1 97 64 19 82 0.299 0 18.2 21 98 | 99 | Complete the following code (2 answers expected) 100 | combined = df. 101 | merge 102 | (df_new_features, on=’patient_id’). 103 | sort_values 104 | (by=[‘Age’,’BMI’]) 105 | 106 | 107 | 108 | 5. After a discussion with your manager, you decide to focus the study on patients who are of working age. Filter the dataset to keep only patients aged 65 years (countrys’ official retirement age) or younger. Sort the DataFrame by the patient_id. 109 | The output DataFrame should look like the following: 110 | 111 | Pregnancies Glucose BloodPressure SkinThickness Insulin DiabetesPedigreeFunction Outcome BMI Age 112 | patient_id 113 | 372 0 118 64 23 89 1.731 0 0.0 21 114 | 146 0 102 75 23 0 0.572 0 0.0 21 115 | 61 2 84 0 0 0 0.304 0 0.0 21 116 | 439 1 97 70 15 0 0.147 0 18.2 21 117 | 527 1 97 64 19 82 0.299 0 18.2 21 118 | 119 | Which statement creates this result? 120 | working_age = combined.Age <= 65; combined_filtered = combined[working_age] 121 | combined_filtered = combined[combined['Age'] <= 65].sort_index() 122 | combined_filtered = combined.loc[combined.Age <= 65].sort_index() 123 | Any of the above <-answer 124 | 125 | -------------------------------------------------------------------------------- /AI4I-3-Q-Exploratory_Data_Analysis_Quiz.txt: -------------------------------------------------------------------------------- 1 | AI4I-3-Q Exploratory Data Analysis Quiz 2 | 3 | All of the questions in this section refer to the same scenario. 4 | 5 | You are a marketing analyst for a candy company. You have been given a dataset of survey data about candy and have been asked to analyze it. 6 | 7 | Dataset 8 | AI4I-3 Exploratory Data Analysis Quiz_DatasetDownload 9 | 10 | Imports 11 | Assume the following imports. 12 | import numpy as np 13 | import pandas as pd 14 | import matplotlib.pyplot as plt 15 | import seaborn as sns 16 | import scipy 17 | from sklearn.feature_extraction.text import CountVectorizer 18 | 19 | When you import the csv, use pd.read_csv('candyhierarchy2017 (1).csv', encoding="ISO-8859-1"). 20 | 21 | df = pd.read_csv('candyhierarchy2017 (1).csv', encoding="ISO-8859-1") 22 | df.info() 23 | 24 | RangeIndex: 2460 entries, 0 to 2459 25 | Columns: 120 entries, Internal ID to Click Coordinates (x, y) 26 | dtypes: float64(4), int64(1), object(115) 27 | memory usage: 2.3+ MB 28 | 29 | 30 | 31 | 1. How many rows are the in the dataset? 32 | There are 33 | 2460 34 | rows in the dataset. 35 | 36 | 37 | 38 | 39 | print(df.isnull().sum()) 40 | Internal ID 0 41 | Q1: GOING OUT? 110 42 | Q2: GENDER 41 43 | Q3: AGE 84 44 | Q4: COUNTRY 64 45 | ... 46 | Q12: MEDIA [Daily Dish] 2375 47 | Q12: MEDIA [Science] 1098 48 | Q12: MEDIA [ESPN] 2361 49 | Q12: MEDIA [Yahoo] 2393 50 | Click Coordinates (x, y) 855 51 | Length: 120, dtype: int64 52 | 53 | 54 | 2. You have decided to do some analysis around the age of the respondents. After inspecting the data, you notice that there are some missing values. How many missing values does the column ‘Q3: Age’ have in this dataset? 55 | There are 56 | 84 57 | missing values in the column. 58 | 59 | -------------------------------------------------------------------------------- /AI4I-4-Q-Statistical_Thinking_Quiz.txt: -------------------------------------------------------------------------------- 1 | AI4I-4-Q Statistical Thinking Quiz 2 | 3 | All of the questions in this section refer to the below scenario. 4 | 5 | Your manager has asked you to analyze some data from the HR department. 6 | The table below shows values from some of the features available. 7 | 8 | MarriedID GenderID DeptID PayRate EngagementSurvey EmpSatisfaction SpecialProjectsCount DaysLateLast30 Termd 9 | 0 1.0 0.0 1.0 28.50 2.04 2.0 6.0 0.0 0.0 10 | 1 0.0 1.0 1.0 23.00 5.00 4.0 4.0 0.0 0.0 11 | 2 0.0 1.0 1.0 29.00 3.90 5.0 5.0 0.0 0.0 12 | 3 1.0 0.0 1.0 21.50 3.24 3.0 4.0 NaN 1.0 13 | 4 0.0 0.0 1.0 16.56 5.00 3.0 5.0 0.0 0.0 14 | 15 | Imports 16 | Assume the following library imports. 17 | import numpy as np 18 | import pandas as pd 19 | import matplotlib.pyplot as plt 20 | 21 | 22 | 1. One of the first things you want to look at are the Terminated employees (Termd column) and how engaged they are compared to those still under employment. You are curious about the overall data spread (exact data points are not necessary). Which plot would you use to show this? 23 | 24 | Bee-swarm plot 25 | Box plot <-answer 26 | Histogram 27 | None of these are suitable 28 | 29 | 30 | 31 | 2. You then want to look at what the employees are earning. To do that, you plot the ECDF (Empirical Cumulative Distribution Function) graph for the hourly pay rate. From the chart below, what proportion of the employees are earning $40 or less per hour. 32 | Is it: 33 | 34 | 22% 35 | 30% 36 | 70% <-answer 37 | I cannot tell from the chart 38 | 39 | 40 | 41 | 3. You want to study if there exists a correlation between the pay rate and the employee engagement. To improve your confidence in the calculation, you decide to apply bootstrapping before calculating the replicate. Complete the function below: 42 | 43 | def draw_bs_pairs (x, y, func, size=1): 44 | # Set up array of indices 45 | inds = np.arange(len(x)) 46 | # Initialize the array of replicates 47 | bs_rep= np.empty(size) 48 | for i in range(size): 49 | bs_inds = np.random. 50 | choice 51 | (inds, size = len(inds)) 52 | bs_x = x[bs_inds] 53 | bs_y = y[bs_inds] 54 | bs_rep[i] = func([bs_x,bs_y][bs_x,bs_y]) 55 | return bs_rep 56 | 57 | 58 | 59 | 4. The full dataset has 103 features in total. You decide to try a few dimension reduction techniques. Why do you want to do that? 60 | 61 | The data will be less complex. 62 | The data will require less disk space. 63 | It will take less computation time to process the data. 64 | During modeling, there is a lower chance of overfitting. 65 | All of the above <-answer 66 | 67 | -------------------------------------------------------------------------------- /AI4I-5-Q-Supervised_Learning_Quiz.txt: -------------------------------------------------------------------------------- 1 | AI4I-5-Q: Supervised Learning Quiz 2 | 3 | Q1. An overfitted model is typically characterized by … 4 | High Bias, High Variance 5 | High Bias, Low Variance 6 | Low Bias, High Variance <-answer 7 | Low Bias, Low Variance 8 | 9 | 10 | Q2. What kind of error is caused by randomness or natural variation in the data generated by a system? 11 | Bias Error 12 | Variance Error <-answer 13 | Irreducible Error 14 | 15 | 16 | Q3. Which two metrics are commonly used to evaluate Classification models? 17 | Accuracy 18 | Precision <-answer 19 | Coefficient of Determination 20 | Recall <-answer 21 | 22 | 23 | Q4. Which parameter can we add to VotingClassifier to use soft voting to predict the class labels? 24 | voting='soft' <-answer 25 | predict_using='soft' 26 | soft_voting=True 27 | None. VotingClassifier only allows for hard voting. 28 | 29 | -------------------------------------------------------------------------------- /AI4I-6-Q-Unsupervised_Learning_Quiz.txt: -------------------------------------------------------------------------------- 1 | AI4I-6-Q: Unsupervised Learning Quiz 2 | 3 | 1. What are the main drawbacks when using dimension reduction techniques on your data? 4 | Some information is lost, possibly degrading the performance of the subsequent ML algorithms 5 | Transformed features are hard to interpret 6 | It can be computationally expensive 7 | It adds complexity to your ML pipelines 8 | All of the above <-answer 9 | 10 | 11 | 2. Imagine performing PCA on a 1000 dimension dataset and you set the explained variance to 95%. How many dimensions will the resulting dataset have? 12 | Either 2 or 3 dimensions 13 | The dataset cannot be reduced 14 | Trick Question! Depends on the dataset <-answer 15 | 16 | 17 | 3. How should you NOT select the optimum number of clusters k in a K-Means Clustering technique? 18 | a) Plot inertia vs number of clusters and try to identify the ‘elbow joint’ 19 | b) Look at the loss value for each of the k values and select the one with the lowest loss 20 | c) Plot the silhouette score against the number of cluster. The optimum k value should be near the peak 21 | B only <-answer 22 | 23 | 24 | 4. Complete the code snippet below to perform hierarchical clustering on the dataset. Assume that scipy.cluster has been imported for you. 25 | my_cluster = [Blank](dataset, method='complete') 26 | hierarchy.linkage <-answer 27 | linkage 28 | agglomerative 29 | hierarchy 30 | 31 | 32 | 5. Which statement about Non-negative Matrix Factorization (NMF) is not true? 33 | It is a dimension reduction technique 34 | NMF models are easy to interpret 35 | NMF can hand handle real numbers as input features <-answer 36 | None of the above 37 | 38 | -------------------------------------------------------------------------------- /AI4I-7-Q-Deep_Learning_Quiz.txt: -------------------------------------------------------------------------------- 1 | AI4I-7-Q: Deep Learning Quiz 2 | 3 | You are trying to build a deep learning classifier that can distinguish between the classes in the MNIST_fashion dataset. 4 | 5 | Access the answer key with this link https://colab.research.google.com/drive/1_hGdZRxwlEQzal8fw-2HY-bPa2I9jmg2?usp=sharing 6 | 7 | Create a notebook in Google Colaboratory. Install and import Tensorflow 2.0 and the required libraries. Load the MNIST_fashion dataset. What is the width of each picture in pixels? 8 | The width of each picture is 9 | 28 10 | px. 11 | 12 | 13 | 14 | Build a CNN to classifiy the different classes in the dataset. Use what you learnt about convolution layers, activation functions, and dropout. 15 | What activation should you use in the final layer? 16 | softmax 17 | should be used in the final activation layer. 18 | 19 | 20 | 21 | What is the next step after the forward pass when you are training a neural network? (Hint: You only need to answer with one word) 22 | The next step is 23 | backpropagation 24 | . 25 | 26 | 27 | 28 | If you have a high training accuracy but a low validation accuracy, what is likely to be happening? (Hint: You only need to answer with one word) 29 | This is 30 | overfitting 31 | . 32 | 33 | 34 | 35 | Fill in the blank. 36 | When using gradient descent, we try to get as close to the 37 | global 38 | minima as possible (Hint: You only need to answer with one word) 39 | 40 | -------------------------------------------------------------------------------- /AI4I-8-Q-Other_Languages_and_Tools_to_Learn_Quiz.txt: -------------------------------------------------------------------------------- 1 | AI4I-8-Q Other Languages and Tools to Learn Quiz 2 | 3 | This is a series of quiz questions to test your basic understanding of SQL, Shell and Environments. 4 | 5 | 1. With SQL, how do you select all the columns from a table named SALES? 6 | SELECT ALL FROM SALES 7 | SELECT * FROM SALES <-answer 8 | SELECT SALES 9 | SELECT * SALES 10 | 11 | 12 | 2. Which SQL keywords specify the sorting direction of the result set retrieved with the ORDER BY clause? 13 | ASC <-answer 14 | REVERSE 15 | SORT 16 | DESC <-answer 17 | 18 | 19 | 3. You are in the shell. We have a file called ‘sample’. We want to highlight only the lines that do not contain the character ‘a’, but the result should be in reverse order. We then want to write the resulting output to a file called ‘myoutput’. 20 | 21 | What commands do you issue to shell? (Use Cat, Grep and Sort Commands to help you) 22 | grep a sample -v | sort -r >> myoutput <-answer 23 | grep a sample | sort >> myoutput 24 | grep a sample -v | sort -r | myoutput 25 | grep a sample | sort -r >> myoutput 26 | 27 | -------------------------------------------------------------------------------- /AI4I-9-Q-Data_Science_Project_Lifecycle_Quiz.txt: -------------------------------------------------------------------------------- 1 | AI4I-9-Q Data Science Project Lifecycle Quiz 2 | 3 | 4 | 1. Which metric is not appropriate to measure a classification model? 5 | RMSE <-answer 6 | F1 7 | Accuracy 8 | Silhouette Score 9 | 10 | 11 | 2. Which metrics is appropriate to measure a classification model built with imbalance dataset? 12 | F1 <-answer 13 | RMSE 14 | Accuracy 15 | Silhouette Score 16 | 17 | 18 | 3. Which is not a main principle of SCRUM methodology? 19 | Frequent sprints to get fast feedbacks from stakeholders 20 | Defining project goal upfront and not deviating from it. <-answer 21 | Product owners provide the direction on what product features to build 22 | 23 | -------------------------------------------------------------------------------- /AI4I_Data_Science_Project_Lifecycle.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JNYH/AI4I_Data_Science_Project_Lifecycle/ab4b752d63f71d63f2fdec0b0fcfcc2190ed92ec/AI4I_Data_Science_Project_Lifecycle.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AI4I_Data_Science_Project_Lifecycle 2 | This is a memo to share what I have learnt in Data Science Project Lifecycle, capturing the learning objectives as well as my personal notes. The course is by Infocomm Media Development Authority (IMDA) AI For Industry (AI4I) with 24 slides compiled by LIM Tern Poh. 3 | 4 | The mission of AI Singapore is to anchor deep national capabilities in Artificial Intelligence, thereby creating social and economic impacts, grow local talent, build an AI ecosystem and put Singapore on the world map. 5 | 6 | AI for Industry (AI4I) is a fully online programme to help learners PLUS-skill themselves and learn data science, machine learning, artificial intelligence and visualization in Python. The programme is hosted on the AI Makerspace online platform. DataCamp is used as a resource to support the learning required to complete the programme. 7 | 8 | The total estimated course learning time is at least 140 hours, completing at least 35 lessons and 9 quizzes, covering content from basic Python programming to Machine Learning toolkits. 9 | 10 | For more information: https://www.aisingapore.org/talentdevelopment/ai4i/ 11 | --------------------------------------------------------------------------------