├── 1-Problem-Solving-with-Advanced-Analytics ├── 1.1-Predicting-Diamond-Price.ipynb └── 1.2-Predicting-Catalog-Demand.ipynb ├── 2-Creating-an-Analytical-Dataset ├── 2.1-Data-Cleanup.ipynb └── 2.2-Create-Report-from-Database.ipynb ├── 3-Data-Visualization-in-Tableau └── 3.1-Visualize-Movie-Data.ipynb ├── 4-Classification-Models └── 4.1-Predicting-Default-Risk.ipynb ├── 5-AB-Testing └── 5.1-AB-Test-a-New-Menu-Launch.ipynb ├── 6-Time-Series-Forecasting └── 6.1-Forecast-Video-Game-Sales.ipynb ├── 7-Segmentation-and-Clustering └── 7.1-Combining-Predictive-Techniques.ipynb └── README.md /1-Problem-Solving-with-Advanced-Analytics/1.1-Predicting-Diamond-Price.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": true 7 | }, 8 | "source": [ 9 | "# Diamond Prices" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "## Understanding the Model\n", 17 | "\n", 18 | "#### 1. According to the linear model provided, if a diamond is 1 carat heavier than another with the same cut and clarity, how much more should we expect to pay? Why?\n", 19 | "\n", 20 | "A diamond 1 carat heavier than another is **$8,413 more**, with the cut quality and the clarity being the same.\n", 21 | "\n", 22 | "##### 2. If you were interested in a 1.5 carat diamond with a Very Good cut (represented by a 3 in the model) and a VS2 clarity rating (represented by a 5 in the model), how much would the model predict you should pay for it?\n", 23 | "\n", 24 | "Price = _- 5,269 + 8,413 x Carat + 158.1 x Cut + 454 x Clarity_ \n", 25 | "         _= - 5269 + 8413 x (1.5) + 158.1 x (3) + 454 x (5)_ \n", 26 | "         _= $10094.8_\n", 27 | " \n", 28 | "Using linear regression model, it would cost **$10,094.80**." 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "## Visualize the Data\n", 36 | "\n", 37 | "##### 1.\tPlot 1 - Plot the data for the diamonds in the database, with carat on the x-axis and price on the y-axis. \n", 38 | "\n", 39 | "![Image of Plot 1](https://user-images.githubusercontent.com/14093302/29494458-396a130c-85dd-11e7-9fc8-86c8ec0cad46.png) \n", 40 | "\n", 41 | "##### 2.\tPlot 2 - Plot the data for the diamonds for which you are predicting prices with carat on the x-axis and predicted price on the y-axis. \n", 42 | "\n", 43 | "![Image of Plot 2](https://user-images.githubusercontent.com/14093302/29494459-399221e4-85dd-11e7-8216-b4f4fe2c837e.png)\n", 44 | " \n", 45 | "##### 3.\tWhat strikes you about this comparison? After seeing this plot, do you feel confident in the model’s ability to predict prices? \n", 46 | "\n", 47 | "* The relationship between price and carat are less obvious when the diamond carat is less than 0.5 because the predicted price can often fall below $0 which is not possible\n", 48 | "\n", 49 | "* It also predicts a higher price for diamond larger than 3 carat\n", 50 | " \n", 51 | "* There should be more factors in determining the price of diamond\n", 52 | "\n", 53 | "* The model shows a strong correlation between carat and price when carat is between 0.5 to 2" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "## Make a Recommendation\n", 61 | "\n", 62 | "##### 1.\tWhat price do you recommend the jewelry company to bid? Please explain how you arrived at that number. HINT: The number should be 7 digits.\n", 63 | "\n", 64 | "* Each diamond price is predicted using linear regression model\n", 65 | "\n", 66 | "* The predicted prices are summed up\n", 67 | "\n", 68 | "* Apply a 70% on the summation for all predicted prices for bidding price\n", 69 | "\n", 70 | "The bid price is recommended to be **$8,213,466**." 71 | ] 72 | } 73 | ], 74 | "metadata": { 75 | "kernelspec": { 76 | "display_name": "Python 3", 77 | "language": "python", 78 | "name": "python3" 79 | }, 80 | "language_info": { 81 | "codemirror_mode": { 82 | "name": "ipython", 83 | "version": 3 84 | }, 85 | "file_extension": ".py", 86 | "mimetype": "text/x-python", 87 | "name": "python", 88 | "nbconvert_exporter": "python", 89 | "pygments_lexer": "ipython3", 90 | "version": "3.6.2" 91 | } 92 | }, 93 | "nbformat": 4, 94 | "nbformat_minor": 2 95 | } 96 | -------------------------------------------------------------------------------- /1-Problem-Solving-with-Advanced-Analytics/1.2-Predicting-Catalog-Demand.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Predicting Catalog Demand" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Business and Data Understanding\n", 15 | "\n", 16 | "#### 1. What decisions needs to be made?\n", 17 | "\n", 18 | "The decision to be made is whether to send the catalog to these 250 new customers based on expected profit calculated.\t\n", 19 | "\n", 20 | "#### 2. What data is needed to inform those decisions?\n", 21 | "\n", 22 | "Some of the data needed to predict sales and calculate expected profit are _Customer Segment_, _Average Number of Product Purchased_, _Score_Yes_, _Margin_ and _Cost of Catalog_. " 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## Analysis, Modeling, and Validation\n", 30 | "\n", 31 | "#### 1. How and why did you select the predictor variables (see supplementary text) in your model? \n", 32 | "\n", 33 | "A linear regression study is performed on all variables against Average Sale Amount. As shown below, only Average Number of Product and Customer Segment have a p-value of less 0.05 which implies statistical significance. Scatterplots of Average Number of Product and Customer Segment versus Average Sale Amount are also plotted to study the linearity.\n", 34 | " \n", 35 | "\"Figure\n", 36 | " \n", 37 | "

Figure 1: Report for Linear Model Predictor_Variables

\n", 38 | "\n", 39 | "\"Figure\n", 40 | " \n", 41 | "

Figure 2: Scatterplots of Avg Number of Products Purchased

\n", 42 | "\n", 43 | "\"Figure\n", 44 | "\n", 45 | "

Figure 3: Scatterplots of Customer Segment vs Avg Sale Amount

\n", 46 | "\n", 47 | "#### 2. Explain why you believe your linear model is a good model.\n", 48 | "\n", 49 | "The Alteryx linear regression function is used to determine the strength of the linear and the statistical result shows an adjusted R-squared value of 0.8366 which is a high value. Customer Segment and Average Number of Products also have a p-value lower than 0.05, implying their statistical significance. Thus, the model is considered a good one.\n", 50 | "\n", 51 | "\"Figure\n", 52 | "\n", 53 | "

Figure 4: Report for Statistical Result

\n", 54 | "\n", 55 | "#### 3. What is the best linear regression equation based on the available data? Each coefficient should have no more than 2 digits after the decimal (ex: 1.28)\n", 56 | "\n", 57 | "Avg_Sale_Amount = \t303.46 – 149.36 x (If Type: Loyalty Club Only) + 281.84 x (If Type: Loyalty Club and Credit Card) – 245.42 x (If Type: Store Mailing List) \n", 58 | "                               + 0 x (If Type: Credit Card Only) + 66.98 x (Avg_Num_Products_Purchased)" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "## Presentation/Visualization\n", 66 | "\n", 67 | "#### 1. What is your recommendation? Should the company send the catalog to these 250 customers?\n", 68 | " \n", 69 | "The company should send the catalog to these 250 new customers.\n", 70 | "\n", 71 | "#### 2. How did you come up with your recommendation? (Please explain your process so reviewers can give you feedback on your process)\n", 72 | " \n", 73 | "Using linear regression model, the expected revenue from each customer is determined by multiplying expected sale amount with Score_Yes value. \n", 74 | "With a gross margin of 50%, 50% is deducted from the sum of expected revenue before the cost of catalog ($6.50) is subtracted to obtain net profit.\n", 75 | "\n", 76 | "#### 3. What is the expected profit from the new catalog (assuming the catalog is sent to these 250 customers)?\n", 77 | "\n", 78 | "Expected Profit = _(Sum of expected revenue x Gross Margin) – (Cost of Catalog x 250)_\n", 79 | "\n", 80 | "                          = _(47,225.87 x 0.5) – (6.50 x 250)_\n", 81 | "\n", 82 | "                          = _23,612.44 – 1,625_\n", 83 | "\n", 84 | "                          = _$21,987.44_" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "## Variables Distribution\n", 92 | "\n", 93 | "Variables such as _Address_, _Name_, _State_, _Customer ID_, _Store Number_ and _ZIP_ are not important predictor variables as they are either unique to each value or irrelevant in predicting the sale using common sense.\n", 94 | "\n", 95 | "_City_, _Responded to Last Catalog_ and _# Years as Customer_ might seem to be a good predictor as they are not unique ID but linear regression model showed that they are statistically insignificant.\n", 96 | "\n", 97 | "More data from category of items purchased, items turnover duration will be helpful to understand customer’s buying behavior where we can exploit it to segment our customers and customize the catalog.\n", 98 | "\n", 99 | "\"Figure\n", 100 | "\n", 101 | "

Figure 5: Distribution of each variable in the Customer List dataset

" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "## Alteryx Workflow\n", 109 | "\n", 110 | "\"Figure" 111 | ] 112 | } 113 | ], 114 | "metadata": { 115 | "kernelspec": { 116 | "display_name": "Python 3", 117 | "language": "python", 118 | "name": "python3" 119 | }, 120 | "language_info": { 121 | "codemirror_mode": { 122 | "name": "ipython", 123 | "version": 3 124 | }, 125 | "file_extension": ".py", 126 | "mimetype": "text/x-python", 127 | "name": "python", 128 | "nbconvert_exporter": "python", 129 | "pygments_lexer": "ipython3", 130 | "version": "3.6.2" 131 | } 132 | }, 133 | "nbformat": 4, 134 | "nbformat_minor": 2 135 | } 136 | -------------------------------------------------------------------------------- /2-Creating-an-Analytical-Dataset/2.1-Data-Cleanup.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Data Cleanup" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Business and Data Understanding\n", 15 | "\n", 16 | "Pawdacity, a leading pet store chain in Wyoming, needs recommendation on where to open its 14th store. \n", 17 | "Some of the data required in order to inform this decision are _city_, _2010 census population_, _Pawdacity sales in other stores_, _competitor sales_, _household with under 18_, _land area_, _population density_ and _total families_.\n" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "## Building the Training Set\n", 25 | "\n", 26 | "By performing the select, formula, data cleansing and filter functions on 4 datasets, the averages for the variables below were obtained. Also attached below is the workflow to obtain the averages.\n", 27 | "\n", 28 | "| Column | Sum | Average |\n", 29 | "| :----------------------- | ---------:| ----------:|\n", 30 | "| Census Population | 213,862 | 19,442 |\n", 31 | "| Total Pawdacity Sales\t | 3,773,304 | 343,027.64 |\n", 32 | "| Households with Under 18 | 34,064 | 3,096.73 |\n", 33 | "| Land Area\t | 33,071 | 3,006.49 |\n", 34 | "| Population Density | 63 | 5.71 |\n", 35 | "| Total Families | 62,653 | 5,695.71 |\n", 36 | "\n", 37 | "

Table 1: Sums and Averages of Variables

" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "## Dealing with Outliers\n", 45 | "\n", 46 | "\"Figure\n", 47 | "\n", 48 | "

Figure 1: Scatterplots of Population-related variables versus Pawdacity Total Sales

\n", 49 | "\n", 50 | "Based on the 5 scatterplots above, the city of **Gillette** and **Cheyenne*** seem to be the outliers as their sales data are higher than expected. \n", 51 | "\n", 52 | "When the scatterplots are extrapolated, Cheyenne’s sales data falls within the expected range when extrapolated. \n", 53 | "Thus, Gillette would be the outlier in this case when compared against all other cities due to its greatest distance from the linear trend. \n", 54 | "Since the relationships between Gillette’s population related variables and total sales are still correlated, Gillette should be kept for analysis. " 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "## Alteryx Workflow\n", 62 | "\n", 63 | "\"Figure\n", 64 | "\n", 65 | "

Figure 2: Workflow to obtain sums and averages of variables

\n", 66 | "\n", 67 | "\n" 68 | ] 69 | } 70 | ], 71 | "metadata": { 72 | "kernelspec": { 73 | "display_name": "Python 3", 74 | "language": "python", 75 | "name": "python3" 76 | }, 77 | "language_info": { 78 | "codemirror_mode": { 79 | "name": "ipython", 80 | "version": 3 81 | }, 82 | "file_extension": ".py", 83 | "mimetype": "text/x-python", 84 | "name": "python", 85 | "nbconvert_exporter": "python", 86 | "pygments_lexer": "ipython3", 87 | "version": "3.6.2" 88 | } 89 | }, 90 | "nbformat": 4, 91 | "nbformat_minor": 2 92 | } 93 | -------------------------------------------------------------------------------- /2-Creating-an-Analytical-Dataset/2.2-Create-Report-from-Database.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Create Report from Dataset" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Total Revenue by Country\n", 15 | "\n", 16 | "![figure 1](https://user-images.githubusercontent.com/14093302/29740858-4c89b4aa-8a93-11e7-8bc8-1c900ce21da4.png)\n", 17 | "\n", 18 | "\n", 19 | "The chart above shows the total revenue generated by customers in each country over a 3 years period (2014-2016). \n", 20 | "The Pareto chart is shown through the right vertical axis. \n", 21 | "USA and Germany are the top 2 countries in terms of revenue and comprises of approximately 40% of the total revenue.\n", 22 | "\n", 23 | "**SQL query**\n", 24 | "``` mysql\n", 25 | "SELECT Customers.country, SUM(unitprice*quantity) as Totalrev FROM orderdetails, orders, customers WHERE orderdetails.orderid=orders.orderid AND orders.customerid=customers.customerid\n", 26 | "GROUP BY customers.country ORDER BY Totalrev DESC;\n", 27 | "```" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": { 33 | "collapsed": true 34 | }, 35 | "source": [ 36 | "## Total Cost of Goods by Supplier\n", 37 | "\n", 38 | "![figure 2](https://user-images.githubusercontent.com/14093302/29740857-4c860986-8a93-11e7-8d9b-e7745e887258.png)\n", 39 | "\n", 40 | "The chart above summarizes the total cost of goods supplied by each company with Plutzer, Aux joyeux and Svensk being the top 3 suppliers which supply more than $5000 worth of goods each.\n", 41 | "\n", 42 | "**SQL query**\n", 43 | "```mysql\n", 44 | "SELECT companyName, SUM((unitsinstock+unitsonorder)*products.unitprice) as Totalcost, SUM(unitsinstock+unitsonorder) as Totalunit FROM products, suppliers WHERE products.supplierid=suppliers.supplierid\n", 45 | "GROUP BY companyName ORDER BY Totalcost DESC;\n", 46 | "```" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "## Total Sales by Employee\n", 54 | "\n", 55 | "![figure 3](https://user-images.githubusercontent.com/14093302/29740860-4c8b9a18-8a93-11e7-9321-3138a502db35.png)\n", 56 | "\n", 57 | "The total sales generated by each employee is illutrasted above with Peacock being the employee with most sales of roughly $250,000.\n", 58 | "\n", 59 | "**SQL Query**\n", 60 | "\n", 61 | "```mysql\n", 62 | "SELECT employees.employeeID, LastName, FirstName, Title, City, SUM(unitPrice*Quantity) as Totalsales FROM orderdetails, orders, employees WHERE orderdetails.orderid=orders.orderid and orders.employeeid=employees.employeeid\n", 63 | "GROUP BY LastName ORDER BY Totalsales DESC;\n", 64 | "```" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "## 2016 Product Category Sales and Sales Percentage\n", 72 | "\n", 73 | "![figure4](https://user-images.githubusercontent.com/14093302/29740859-4c8a068a-8a93-11e7-8560-24405cdb1c84.png)\n", 74 | "\n", 75 | "The graph above shows 2016’s total sales and the sales percentage by product category. \n", 76 | "Beverages registered the highest sales while Grains/Cereals has the lowest sales in 2016.\n", 77 | "\n", 78 | "**SQL Query**\n", 79 | "```mysql\n", 80 | "SELECT categoryname, SUM(orderdetails.unitprice*orderdetails.quantity) as Totalsales, strftime('%Y', OrderDate) as Year, annual.Annualsales, round((SUM(orderdetails.unitprice*orderdetails.quantity)/annual.annualsales),2) as TotalsalesPct FROM categories, products, orders, orderdetails,\n", 81 | "(SELECT SUM(orderdetails.unitprice*orderdetails.quantity) as Annualsales, strftime('%Y', OrderDate) as yr FROM orders, orderdetails\n", 82 | "WHERE orderdetails.orderid=orders.orderid and yr='2016'\n", 83 | "GROUP BY yr) annual\n", 84 | "WHERE categories.categoryid=products.categoryid and products.productid=orderdetails.productid and orderdetails.orderid=orders.orderid and year='2016'\n", 85 | "GROUP BY categoryname, year \n", 86 | "ORDER BY categoryname, year;\n", 87 | "```\n" 88 | ] 89 | } 90 | ], 91 | "metadata": { 92 | "kernelspec": { 93 | "display_name": "Python 3", 94 | "language": "python", 95 | "name": "python3" 96 | }, 97 | "language_info": { 98 | "codemirror_mode": { 99 | "name": "ipython", 100 | "version": 3 101 | }, 102 | "file_extension": ".py", 103 | "mimetype": "text/x-python", 104 | "name": "python", 105 | "nbconvert_exporter": "python", 106 | "pygments_lexer": "ipython3", 107 | "version": "3.6.2" 108 | } 109 | }, 110 | "nbformat": 4, 111 | "nbformat_minor": 2 112 | } 113 | -------------------------------------------------------------------------------- /3-Data-Visualization-in-Tableau/3.1-Visualize-Movie-Data.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Visualizing Movie Data" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Data Cleanup and Attribute Selection\n", 15 | "\n", 16 | "* Attributes to be explored further in this project are _popularity_, _release year_, _genre_, _production company_, _budget adj_, _revenue adj_ and _gross margin_ (calculated field)\n", 17 | "* List of movies with 0 budget or revenue are removed. That leaves us with 3855 rows of data out of 10866 rows. Even though a large number of data are being excluded, the exclusion will help with the data investigation as budget and revenue are the main variables to be studied\n", 18 | "* When sorted by _popularity_, majority of movies with 0 budget or revenue are poorly ranked and should not skew the data, thus decision to remove them are justified\n", 19 | "\n", 20 | "#### Tableau Visualizations \n", 21 | "Public Profile: \n", 22 | "https://public.tableau.com/profile/kaishengteh#!/vizhome/P3_VisulaizeMovieData/Q4\n" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## Question 1 Story: How have movie genres changed over time?\n", 30 | "\n", 31 | "#### Page 1: \n", 32 | "![p3 q1-1](https://user-images.githubusercontent.com/14093302/29742890-e4d71b6e-8ab9-11e7-91a6-27da56aa70e2.png)\n", 33 | "Movies of top 4 genres, i.e. Adventure, Action, Comedy and Drama are filtered out in order to better study the relationship between budget, revenue and number of movie release. Post-1990, there is a strong growth in number of movie release for these 4 genres with Drama registering the strongest growth while Adventure having the 4th strongest growth.\n", 34 | "\n", 35 | "#### Page 2: \n", 36 | "![p3 q1-2](https://user-images.githubusercontent.com/14093302/29742891-e4d70aca-8ab9-11e7-8136-f32344741964.png) \n", 37 | "On the budget side, Adventure genre movies has been allocated higher budget over the year followed by Action despite having lower number of movie release. There implies an opposite relationship between average budget allocated and number of movies release by genre.\n", 38 | "\n", 39 | "#### Page 3:\n", 40 | "![p3 q1-3](https://user-images.githubusercontent.com/14093302/29742889-e4cf6022-8ab9-11e7-8c85-3aed4628ae01.png)\n", 41 | "Revenue of movies is heavily influenced by the budget of the movies. Genre with higher budget typically enjoys higher revenue. This can be explains by lower number of movie release and thus having less direct competition and is able to monopolize a larger pie of audience market." 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "## Question 2: How do the attributes differ between Universal Pictures and Paramount Pictures?\n", 49 | "\n", 50 | "![p3 q2](https://user-images.githubusercontent.com/14093302/29742887-e4bca36a-8ab9-11e7-816b-b6507451c400.png) \n", 51 | "\n", 52 | "Universal Pictures (UP) releases more movies every year when compared to Paramount Pictures (PP). \n", 53 | "\n", 54 | "UP also enjoys higher revenue per movie when adjusted on average. Even though PP occasionally overtook UP, UP has always managed to improve its revenue. In fact, UP last year’s movies have registered the highest revenue in history, over 3 times the revenue of PP’s. \n", 55 | "\n", 56 | "Looking at the list of most popular movies, UP’s movies are consistently ranked higher than PP’s movies\n", 57 | "While it is not known whether UP’s popularity and revenue advantages is due to its higher number of movie release or the other way round, UP is definitely performing better than PP." 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "## Question 3: How have movies based on novels performed relative to movies not based on novels?\n", 65 | "\n", 66 | "![p3 q3](https://user-images.githubusercontent.com/14093302/29742888-e4bf5d44-8ab9-11e7-8cfc-0c808d5b0ba6.png)\n", 67 | "Even though number of novel based movie release yearly is small, its average revenue has been higher than non-novel based revenue after year 1990. Its popularity has also improved and is consistently higher." 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "## Question 4: What is the relationship between gross margin and budget of a movie?\n", 75 | "\n", 76 | "![p3 q4](https://user-images.githubusercontent.com/14093302/29742886-e4b933c4-8ab9-11e7-99f5-65b1a32ef7aa.png)\n", 77 | "In order or calculate gross margin, a field is created where the adjusted budget is subtracted from the adjusted revenue.\n", 78 | "\n", 79 | "When aggregating the data by genre and year, there is a clear trend that higher budget movies lead to higher gross margin. That might be the reason why budget allocated has ballooned over the years in order to chase after profit. Adventure and Action movies’ ability to guarantee hefty returns despite its high price tag might be one of the reasons they are among the top 4 genres in number of movie release (from Q1).\n", 80 | "\n", 81 | "When zooming down to each movie title level, there seems to be 3 trend lines. The trend line with the steepest positive slope indicated successful movies which register much higher gross margin despite having low budget. They are among the outliers. Majority of the movies follow the other 2 trend lines with gradual slopes. As observed, increase in budget also increases the probability of raking in higher gross margin/loss. \n", 82 | "\n", 83 | "This is in line with the expectation that investment with higher return comes with higher risk. However, majority of the movies registered positive gross margin and that is why most production companies are still willing to invest heavily to produce more movies over the years.\n", 84 | "\n", 85 | " \n", 86 | "\n" 87 | ] 88 | } 89 | ], 90 | "metadata": { 91 | "kernelspec": { 92 | "display_name": "Python 3", 93 | "language": "python", 94 | "name": "python3" 95 | }, 96 | "language_info": { 97 | "codemirror_mode": { 98 | "name": "ipython", 99 | "version": 3 100 | }, 101 | "file_extension": ".py", 102 | "mimetype": "text/x-python", 103 | "name": "python", 104 | "nbconvert_exporter": "python", 105 | "pygments_lexer": "ipython3", 106 | "version": "3.6.2" 107 | } 108 | }, 109 | "nbformat": 4, 110 | "nbformat_minor": 2 111 | } 112 | -------------------------------------------------------------------------------- /4-Classification-Models/4.1-Predicting-Default-Risk.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Creditworthiness" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Business and Data Understanding\n", 15 | "\n", 16 | "#### 1. What decisions needs to be made?\n", 17 | "\n", 18 | "The objective is to identify whether customers who applied for loan are creditworthy to be extended one.\n", 19 | "\n", 20 | "#### 2. What data is needed to inform those decisions?\n", 21 | "\n", 22 | "Data on past applications such as Account Balance and Credit Amount and list of customers to be processed are required in order to inform those decisions\n", 23 | "\n", 24 | "#### 3. What kind of model (Continuous, Binary, Non-Binary, Time-Series) do we need to use to help make these decisions?\n", 25 | "\n", 26 | "Binary classification models such as logistics regression, decision tree, forest model and boosted tree will be used to analyze and determine creditworthy customers" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "## Building the Training Set\n", 34 | "\n", 35 | "An association analysis is performed on the numerical variables and there are no variables which are highly correlated with each other, i.e. a correlation of higher than 0.7. \n", 36 | "\n", 37 | "\"Figure\n", 38 | "\n", 39 | "

Figure 1: Correlation Matrix of variables

\n", 40 | "\n", 41 | "When summarizing all data fields, _Duration in Current Address_ has 69% missing data and should be removed. While _Age Years_ has 2% missing data, it is appropriate to impute the missing data with the median age. Median age is used instead of mean as the data is skewed to the left as shown below.\n", 42 | "\n", 43 | "In addition, _Concurrent Credits_ and _Occupation_ has one value while _Guarantors_, _Foreign Worker_ and _No of Dependents_ show low variability where more than 80% of the data skewed towards one data. These data should be removed in order not to skew our analysis results.\n", 44 | "\n", 45 | "_Telephone_ field should also be removed due to its irrelevancy to the customer creditworthy.\n", 46 | "\n", 47 | "\"Figure\n", 48 | "\"Figure\n", 49 | "

Figure 2: Field Summary of all variables

" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "## Train your Classification Models\n", 57 | "\n", 58 | "#### a.\tLogistic Regression (Stepwise)\n", 59 | "\n", 60 | "Using _Credit Application Result_ as the target variables, _Account Balance_, _Purpose_ and _Credit Amount_ are the top 3 most significant variables with p-value of less than 0.05.\n", 61 | "\n", 62 | "\"Figure\n", 63 | "

Figure 3: Summary Report for Stepwise Logistic Regression Model

\n", 64 | "\n", 65 | "Overall accuracy is around 76.0% while accuracy for creditworthy is higher than non-creditworthy at 80.0% and 62.9% respectively. The model is biased towards predicting customers as non-creditworthy.\n", 66 | "\n", 67 | "![Figure 4](https://user-images.githubusercontent.com/14093302/29741239-5a600f4a-8a9b-11e7-8d1f-64aa50509ddf.png)\n", 68 | "

Figure 4: Model Comparison Report for Stepwise Logistic Regression Model

\n", 69 | "\n", 70 | "#### b. Decision Tree\n", 71 | "\n", 72 | "Using Credit Application Result as the target variables, Account Balance, Value Savings Stocks and Duration of Credit Month are the top 3 most important variables. The overall accuracy is 74.7%. \n", 73 | "\n", 74 | "Accuracy for creditworthy is 79.1% while accuracy for non-creditworthy is 60.0%. The model seems to be biased towards predicting customers as non-creditworthy.\n", 75 | "\n", 76 | "\"Figure\n", 77 | "

Figure 5: Decision Tree, Variable Importance and Confusion Matrix

\n", 78 | "\n", 79 | "![Figure 6](https://user-images.githubusercontent.com/14093302/29741238-5a5c89ce-8a9b-11e7-96ed-25af0a3ccad0.png) \n", 80 | "

Figure 6: Model Comparison Report for Decision Tree

\n", 81 | "\n", 82 | "#### c. Forest Model\n", 83 | "\n", 84 | "Using Credit Application Result as the target variables, Credit Amount, Age Years and Duration of Credit Month are the 3 most important variables.\n", 85 | "\n", 86 | "Overall accuracy is 80.0%. The model isn’t biased as the accuracies for creditworthy and non-creditworthy are 79.1% and 85.7% respectively, which are comparable.\n", 87 | " \n", 88 | "\"Figure\n", 89 | "

Figure 7: Percentage Error for Different Number of Trees and Variable Importance Plot

\n", 90 | "\n", 91 | "![Figure 8](https://user-images.githubusercontent.com/14093302/29741235-5a38827c-8a9b-11e7-9404-77cfcf83665c.png)\n", 92 | "

Figure 8: Model Comparison Report for Forest Model

\n", 93 | "\n", 94 | "#### d. Boosted Model\n", 95 | "\n", 96 | "Account Balance and Credit Amount are the most significant variables from figure 10. \n", 97 | "Overall accuracy for is 76.7%. Accuracies for creditworthy and non-creditworthy are 76.7% and 78.3% respectively which indicates a lack of bias in predicting credit-worthiness of customers.\n", 98 | "\n", 99 | "\"Figure\n", 100 | "

Figure 9: Variable Importance Plot for Boosted Model

\n", 101 | "\n", 102 | "![Figure 10](https://user-images.githubusercontent.com/14093302/29741233-5a348c58-8a9b-11e7-9451-e95b1ae83688.png)\n", 103 | "

Figure 10: Model Comparison Report for Boosted Model

" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "## Write-Up\n", 111 | "Forest model is chosen as it offers the highest accuracy at 80% against validation set. \n", 112 | "Its accuracies for creditworthy and non-creditworthy are among the highest of all.\n", 113 | "\n", 114 | "Forest model reaches the true positive rate at the fastes rate. The accuracy difference between creditworthy and non-creditworthy are also comparable which makes it least bias towards any decisions. This is crucial in avoiding lending money to customers with high probability of defaulting while ensuring opportunities are not overlooked by not loaning to creditworthy customers.\n", 115 | "\n", 116 | "Tthere are **408 creditworthy cutomers** using forest models to score new customers.\n", 117 | "\n", 118 | "![Figure 11](https://user-images.githubusercontent.com/14093302/29741234-5a3599d6-8a9b-11e7-8764-dc4d81fc1e21.png)\n", 119 | "

Figure 11: Model Comparison Report for all 4 classification models

\n", 120 | "\n", 121 | "![Figure 12](https://user-images.githubusercontent.com/14093302/29741232-5a31739c-8a9b-11e7-9ae5-d1e73caf64c7.png) \n", 122 | "

Figure 12: ROC curve for all 4 classification models

\n" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "## Alteryx Flow\n", 130 | "\n", 131 | "![Figure 13](https://user-images.githubusercontent.com/14093302/29741245-5a8c4074-8a9b-11e7-9ddb-17bdba28ca9f.png)" 132 | ] 133 | } 134 | ], 135 | "metadata": { 136 | "kernelspec": { 137 | "display_name": "Python 3", 138 | "language": "python", 139 | "name": "python3" 140 | }, 141 | "language_info": { 142 | "codemirror_mode": { 143 | "name": "ipython", 144 | "version": 3 145 | }, 146 | "file_extension": ".py", 147 | "mimetype": "text/x-python", 148 | "name": "python", 149 | "nbconvert_exporter": "python", 150 | "pygments_lexer": "ipython3", 151 | "version": "3.6.2" 152 | } 153 | }, 154 | "nbformat": 4, 155 | "nbformat_minor": 2 156 | } 157 | -------------------------------------------------------------------------------- /5-AB-Testing/5.1-AB-Test-a-New-Menu-Launch.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Analyzing a Market Test" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Plan Your Analysis\n", 15 | "\n", 16 | "#### 1. What is the performance metric you’ll use to evaluate the results of your test?\n", 17 | "The sum of gross margin will be used as performance metrics to evaluate whether to introduce gourmet sandwiches and limited wine offerings to spur sales growth in Round Roasters\n", 18 | "\n", 19 | "#### 2. What is the test period?\n", 20 | "A period of 12 weeks (29-Apr-16 to 21-Jul-16) is used as test period\n", 21 | "\n", 22 | "#### 3. At what level (day, week, month, etc.) should the data be aggregated?\n", 23 | "The data should be aggregated at weekly level\n" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "## Clean Up Your Data \n", 31 | "\n", 32 | "**RoundRoasterTransaction** and **Round-Roaster-Store** datasets are first combined. 76 weeks data (6-Feb-15 to 21-Jul-16) is used as A/B test requires 52 weeks of data in addition to a minimum of 12 weeks needed to calculate seasonality and for the period of testing each. \n", 33 | "12 weeks is used instead of 6 weeks in this case as the test period lasted for 12 weeks.\n", 34 | "\n", 35 | "_The week_, _week_begin_, _week_end_ and _NewProduct_Flag_ are added to calculate the weekly traffic and sales for each store. **Treatment_Store** dataset is then introduced to create a list of control and treatment stores.\n", 36 | "\n", 37 | "\"Figure\n", 38 | "

Figure 1: Workflow to clean up data

\n" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "## Match Treatment and Control Units\n", 46 | "\n", 47 | "#### 1. What control variables should be considered? Note: Only consider variables in the RoundRoastersStore file.\n", 48 | "_AvgMonthSales_ should be considered as constant variables while _Square Feet_ should ignored.\n", 49 | "\n", 50 | "#### 2. What is the correlation between your each potential control variable and your performance metric?\n", 51 | "From the Pearson Correlation Analysis, _AvgMonthSales_ has high correlation of 0.99 with the performance metric, i.e. Sum of Gross Margin. On the other hand, _Square Feet_ has a poor correlation of -0.05.\n", 52 | "\n", 53 | "\"Figure\n", 54 | "

Figure 2: Pearson Correlation Analysis

\n", 55 | "\n", 56 | "#### 3. What control variables will you use to match treatment and control stores?\n", 57 | "_AvgMonthSales_ will be used together with Trend and Seasonality when matching treatment and control stores.\n", 58 | "\n", 59 | "#### 4. Please fill out the table below with your treatment and control stores pairs:\n", 60 | "\n", 61 | "| Treatment Store | Control Store 1 | Control Store 2 |\n", 62 | "| :-------------: | :-------------: | :-------------: |\n", 63 | "| 1664 | 1964 | 8562 |\n", 64 | "| 1675 | 1807 | 7584 |\n", 65 | "| 1696 | 1863 | 7334 |\n", 66 | "| 1700 | 7037 | 1508 |\n", 67 | "| 1712 | 8162 | 7434 |\n", 68 | "| 2288 | 2568 | 9081 |\n", 69 | "| 2293 | 12219 | 9639 |\n", 70 | "| 2301 | 11668 | 12019 |\n", 71 | "| 2322 | 9238 | 9388 |\n", 72 | "| 2241 | 2572 | 3102 |\n", 73 | "\n", 74 | "

Table 1: Treatment and Control Stores

\n" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "## Analysis and Write-up\n", 82 | "\n", 83 | "#### 1. What is your recommendation - Should the company roll out the updated menu to all stores? \n", 84 | "The company should roll out the updated menu to all stores as the sum of profit margin increased by more than 18%, from $17,978.67 per store to $26,687.45 per store during test period.\n", 85 | "\n", 86 | "#### 2. What is the lift from the new menu for West and Central regions (include statistical significance)? \n", 87 | "The lift for West region is 36.6% while the lift for Central region is 43.2% and both have a statistical significance of 99.5% and 100% respectively.\n", 88 | "\n", 89 | "#### 3. What is the lift from the new menu overall?\n", 90 | "The lift for the new menu overall is 43.2% with a statistical significance of 99.6%.\n", 91 | "\n", 92 | "### West Region\n", 93 | "\n", 94 | "\"Figure\n", 95 | "\"Figure \n", 96 | "

Figure 3: A/B Analysis for West Region

\n", 97 | "\n", 98 | "### Central Region\n", 99 | "\n", 100 | "\"Figure\n", 101 | "

Figure 4: A/B Analysis for Central Region

\n", 102 | "\n", 103 | "### Overall\n", 104 | "\n", 105 | "\"Figure\n", 106 | "

Figure 5: A/B Analysis for Overall

\n" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "## Alteryx Workflow\n", 114 | "\n", 115 | "\"Figure\n", 116 | "

Figure 6: Determine treatment and ocntrol store pairing

\n", 117 | "\n", 118 | "![Figure 7](https://user-images.githubusercontent.com/14093302/29741445-c9162baa-8a9f-11e7-8de9-90098f750fd8.png)\n", 119 | "

Figure 7: A.B analysis

" 120 | ] 121 | } 122 | ], 123 | "metadata": { 124 | "kernelspec": { 125 | "display_name": "Python 3", 126 | "language": "python", 127 | "name": "python3" 128 | }, 129 | "language_info": { 130 | "codemirror_mode": { 131 | "name": "ipython", 132 | "version": 3 133 | }, 134 | "file_extension": ".py", 135 | "mimetype": "text/x-python", 136 | "name": "python", 137 | "nbconvert_exporter": "python", 138 | "pygments_lexer": "ipython3", 139 | "version": "3.6.2" 140 | } 141 | }, 142 | "nbformat": 4, 143 | "nbformat_minor": 2 144 | } 145 | -------------------------------------------------------------------------------- /6-Time-Series-Forecasting/6.1-Forecast-Video-Game-Sales.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Forecasting Sales" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Plan Your Analysis\n", 15 | "\n", 16 | "#### 1. Does the dataset meet the criteria of a time series dataset? Make sure to explore all four key characteristics of a time series data.\n", 17 | "To meet the criteria of a time series dataset, each measurement of data taken across a continuous time interval is sequential and of equal intervals, each time unit having at most one data point, ordering matters in the list of observations and dependency of time.\n", 18 | "\n", 19 | "#### 2. Which records should be used as the holdout sample?\n", 20 | "Holdout sample size depends on how far the prediction is. Since we need to predict the sales for the next 4 months, a 4-month long holdout sample from Jun-13 till Sept-13 should be used. \n" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "## Determine Trend, Seasonal, and Error components \n", 28 | "\n", 29 | "#### 1. What are the trend, seasonality, and error of the time series? Show how you were able to determine the components using time series plots. Include the graphs.\n", 30 | "\n", 31 | "\"Figure\n", 32 | "\"Figure\n", 33 | "\n", 34 | "The time series and decomposition plots are generated using TS plot function. \n", 35 | "The seasonality and trend show increasing trends, thus multiplication and addition should be applied respectively. \n", 36 | "For error plot, there isn’t a trend but rather fluctuations and thus should be applied multiplicatively as well. \n" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## Build your Models\n", 44 | "\n", 45 | "### ETS Model\n", 46 | "\n", 47 | "#### 1. What are the model terms for ETS? Explain why you chose those terms.\n", 48 | "ETS(M,A,M) is chosen based on the decomposition plot above. A dampened and non-dampened ETS models are run with a holdout sample of 4 months.\n", 49 | "\n", 50 | "**Non-dampened ETS model:**\n", 51 | "\"Figure\n", 52 | "\"Figure\n", 53 | "\n", 54 | "The AIC value is **1639.74**, RMSE (Moot Mean Square Error) is **32992.73** and MASE (Mean Absolute Percentage Error) is **0.3727**.\n", 55 | "\n", 56 | "**Dampened ETS Model:**\n", 57 | "\"Figure\n", 58 | "\"Figure\n", 59 | " \n", 60 | "The AIC value is **1639.47**, RMSE is **33153.53** and MASE is **0.3675**.\n", 61 | "\n", 62 | "**Non-Dampened:**\n", 63 | "\"Figure\n", 64 | "\n", 65 | "**Dampened:**\n", 66 | "\"Figure\n", 67 | "\t\t\t\t \n", 68 | "By comparing the forecast and actual results, **dampened model** is chosen due to its higher accuracy.\n", 69 | "The dampened model’s RMSE & MASE are lower and could offset its marginally lower AIC.\n", 70 | "\n", 71 | "### ARIMA Model\n", 72 | "\n", 73 | "#### 2. What are the model terms for ARIMA? Explain why you chose those terms. \n", 74 | "\"Figure\n", 75 | "\n", 76 | "Without differencing, the time series and seasonal component’s Auto-Correlation Function (ACF) shows high correlation and the Partial Autocorrelation Function (PACF) shows a significant lag at period 13 which is due to seasonal effect.\n", 77 | "\n", 78 | "\"Figure\n", 79 | "A seasonal difference is then taken. However, the ACF still shows high correlation while the data doesn’t have strong correlation in PACF after a seasonal difference is applied.\n", 80 | "\n", 81 | "\"Figure\n", 82 | "A seasonal first difference is performed and ACF plot doesn’t show strong correlation anymore. \n", 83 | "\n", 84 | "\"Figure\n", 85 | "ARIMA(0,1,1)(0,1,0)12 is used as lag-1 is negative and the number of period is 12 months.\n", 86 | "\n", 87 | "\"Figure\n", 88 | "As shown above, the AIC is **1256.60**, RMSE is **36761.53** and **MASE is 0.3646**.\n", 89 | "\n", 90 | "\"Figure\n", 91 | "Both ACF and PACF doesn’t shows significant correlation and no additional AR or MA terms needed.\n" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "## Forecast\n", 99 | "\n", 100 | "#### 1. Which model did you choose? Justify your answer by showing: in-sample error measurements and forecast error measurements against the holdout sample.\n", 101 | "\n", 102 | "\"Figure\n", 103 | "\n", 104 | "ARIMA model is better at forecasting sales using holdout sample as validation data as the MAPE and ME value are lower than ETS model.\n", 105 | "\n", 106 | "The RMSE for ARIMA is **33999.79** compared to ETS’ RMSE at **60176.47**. ARIMA’s MASE value of **0.4532** is also lower than ETS’ MASE value of **0.8116**. It is clear that ARIMA model is better since its in-sample error measurements and forecast error measurements are smaller.\n", 107 | "\n", 108 | "#### 2. What is the forecast for the next four periods? Graph the results using 95% and 80% confidence intervals.\n", 109 | "\n", 110 | "\"Figure\n", 111 | "\n", 112 | "The forecast for the next 4 periods (Oct-13 till Jan-14) are **754,854**, **785,854**, **684,654** and **687,854**.\n" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "## Alteryx Workflow\n", 120 | "\n", 121 | "\"Figure\n", 122 | "

Workflow 1: ARIMA Workflow

\n", 123 | "\n", 124 | "\"Figure\n", 125 | "

Workflow 2: ETS Workflow

" 126 | ] 127 | } 128 | ], 129 | "metadata": { 130 | "kernelspec": { 131 | "display_name": "Python 3", 132 | "language": "python", 133 | "name": "python3" 134 | }, 135 | "language_info": { 136 | "codemirror_mode": { 137 | "name": "ipython", 138 | "version": 3 139 | }, 140 | "file_extension": ".py", 141 | "mimetype": "text/x-python", 142 | "name": "python", 143 | "nbconvert_exporter": "python", 144 | "pygments_lexer": "ipython3", 145 | "version": "3.6.2" 146 | } 147 | }, 148 | "nbformat": 4, 149 | "nbformat_minor": 2 150 | } 151 | -------------------------------------------------------------------------------- /7-Segmentation-and-Clustering/7.1-Combining-Predictive-Techniques.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Predictive Analytics Capstone" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Task 1: Determine Store Formats for Existing Stores\n", 15 | "\n", 16 | "#### 1. What is the optimal number of store formats? How did you arrive at that number?\n", 17 | "\n", 18 | "\"Figure\n", 19 | "

Figure 1: K-Means Cluster Assessment Report

\n", 20 | "\n", 21 | "\"Figure\n", 22 | "

Figure 2: Adjusted Rand Indices and Calinski-Harabasz Indices

\n", 23 | "\n", 24 | "Based on the K-means report, Adjusted Rand and Calinski-Harabasz indices below, the optimal number of store formats is **3** when both the indices registered the highest median value.\n", 25 | "\n", 26 | "#### 2. How many stores fall into each store format?\n", 27 | "\n", 28 | "Cluster 1 has 23 stores, cluster 2 has 29 stores while cluster 3 has 33 stores.\n", 29 | "\n", 30 | "\"Figure\n", 31 | "

Figure 3: Cluster Information

\n", 32 | "\n", 33 | "#### 3. Based on the results of the clustering model, what is one way that the clusters differ from one another?\n", 34 | "\n", 35 | "Cluster 1 stores sold more General Merchandise in terms of percentage while Cluster 2 stores sold more Produce.\n", 36 | "\n", 37 | "Cluster 1 stores have highest medial total sales when compared to the other 2. Its range of total sales and most of other categorical sales are also the largest. Cluster 3 stores are the most similar in terms of sales due to more compact range.\n", 38 | "\n", 39 | "![Figure 4](https://user-images.githubusercontent.com/14093302/29742323-a7e3c37e-8aaf-11e7-89f4-bf3aeb4ea1b7.png)\n", 40 | "

Figure 4: Tableau Visualization

\n", 41 | "\n", 42 | "#### 4. Please provide a Tableau visualization (saved as a Tableau Public file) that shows the location of the stores, uses color to show cluster, and size to show total sales.\n", 43 | "\n", 44 | "\"Figure\n", 45 | "

Figure 4: Location of the stores

\n", 46 | "\n", 47 | "**Tableau Profile** \n", 48 | "https://public.tableau.com/profile/r221609#!/vizhome/Task1_39/Task1\n" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "## Task 2: Formats for New Stores \n", 56 | "\n", 57 | "#### 1. What methodology did you use to predict the best store format for the new stores? Why did you choose that methodology? (Remember to Use a 20% validation sample with Random Seed = 3 to test differences in models.)\n", 58 | "\n", 59 | "The model comparison report below shows comparison matrix of Decision Tree, Forest Model and Boosted Model. \n", 60 | "**Boosted Model** is chosen despite having same accuracy as Forest Model due to higher F1 value.\n", 61 | " \n", 62 | "![Figure 6](https://user-images.githubusercontent.com/14093302/29742378-d0f2cbce-8ab0-11e7-8c3d-ec18ffa1425d.png)\n", 63 | "

Figure 6: Model Comparison Report

\n", 64 | "\n", 65 | "#### 2. What are the three most important variables that help explain the relationship between demographic indicators and store formats? Please include a visualization.\n", 66 | "\n", 67 | "_Ave0to9_, _HVal750KPlus_ and _EdHSGrad_ are the three most important variables.\n", 68 | " \n", 69 | "\"Figure\n", 70 | "

Figure 7: Variance Importance Plot

\n", 71 | "\n", 72 | "#### 3. What format do each of the 10 new stores fall into? Please fill in the table below.\n", 73 | "\n", 74 | "| Store Number | Segment |\n", 75 | "| :----------: | :-----: |\n", 76 | "| S0086 | 1 | \n", 77 | "| S0087 | 2 | \n", 78 | "| S0088 | 3 | \n", 79 | "| S0089 | 2 | \n", 80 | "| S0090 | 2 | \n", 81 | "| S0091 | 1 | \n", 82 | "| S0092 | 2 | \n", 83 | "| S0093 | 1 | \n", 84 | "| S0094 | 2 | \n", 85 | "| S0095 | 2 | \n", 86 | "

Table 1: Store Number and Segment

" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "## Task 3: Predicting Produce Sales\n", 94 | "\n", 95 | "#### 1. What type of ETS or ARIMA model did you use for each forecast? Use ETS(a,m,n) or ARIMA(ar, i, ma) notation. How did you come to that decision?\n", 96 | "\n", 97 | "**ETS(M,N,M) with no dampening** is used for ETS model. \n", 98 | "\n", 99 | "The seasonality shows increasing trend and should be applied multiplicatively. The trend is not clear and nothing should be applied. Its error is irregular and should be applied multiplicatively.\n", 100 | " \n", 101 | "![p9 f8](https://user-images.githubusercontent.com/14093302/29742639-7758a020-8ab5-11e7-8437-d4eb95191404.png)\n", 102 | "\n", 103 | "**ARIMA(0,1,2)(0,1,0)** is used as seasonal difference and seasonal first difference were performed. There is a lag-2.\n", 104 | " \n", 105 | "![p9 f9](https://user-images.githubusercontent.com/14093302/29742636-773891fe-8ab5-11e7-85e9-21eb855612ac.png) \n", 106 | "![p9 f10](https://user-images.githubusercontent.com/14093302/29742635-7735a372-8ab5-11e7-86dc-302f96d967ba.png) \n", 107 | "![p9 f11](https://user-images.githubusercontent.com/14093302/29742638-7748231c-8ab5-11e7-9452-0a718c71a6fc.png)\n", 108 | "\n", 109 | "**ETS model’s accuracy is higher** when compared to ARIMA model. A holdout sample of 6 months data is used. Its RMSE of **1,020,597** is lower than ARIMA’s **1,429,296** while its MASE is **0.45** compared to ARIMA’s **0.53**. ETS also has a higher AIC at **1,283** while ARIMA’s AIC is **859**.\n", 110 | " \n", 111 | "\"Figure\n", 112 | "\"Figure\n", 113 | "\n", 114 | "The graph and table below shows actual and forecast value with 80% & 95% confidence level interval.\n", 115 | " \n", 116 | "\"Figure\n", 117 | "\"Figure\n", 118 | "\n", 119 | "#### 2. Please provide a Tableau Dashboard (saved as a Tableau Public file) that includes a table and a plot of the three monthly forecasts; one for existing, one for new, and one for all stores. Please name the tab in the Tableau file \"Task 3\".\n", 120 | "\n", 121 | "Table below shows the forecast sales for existing stores and new stores. New store sales is obtained by using **ETS(M,N,M)** analysis with all the 3 individual cluster to obtain the average sales per store. The average sales value (x3 cluster 1, x6 cluster 2, x1 cluster 3) are added up produce New Store Sales.\n", 122 | "\n", 123 | "| Year | Month | New Store Sales | Existing Store Sales |\n", 124 | "| :--: | :---: | :-------------: | :------------------: |\n", 125 | "| 2016 | 1 | 2,626,198 | 21,539,936 |\n", 126 | "| 2016 | 2 | 2,529,186 | 20,413,771 |\n", 127 | "| 2016 | 3 | 2,940,264 | 24,325,953 |\n", 128 | "| 2016 | 4 | 2,774,135 | 22,993,466 |\n", 129 | "| 2016 | 5 | 3,165,320 | 26,691,951 |\n", 130 | "| 2016 | 6 | 3,203,286 | 26,989,964 |\n", 131 | "| 2016 | 7 | 3,244,464 | 26,948,631 |\n", 132 | "| 2016 | 8 | 2,871,488 | 24,091,579 |\n", 133 | "| 2016 | 9 | 2,552,418 | 20,523,492 |\n", 134 | "| 2016 | 10 | 2,482,837 | 20,011,749 |\n", 135 | "| 2016 | 11 | 2,597,780 | 21,177,435 |\n", 136 | "| 2016 | 12 | 2,591,815 | 20,855,799 |\n", 137 | "

Table 2: Sales for Existing and New Stores

\n", 138 | " \n", 139 | "\"Figure\n", 140 | "\n", 141 | "The chart above shows the historical and forecast sales for existing stores and new stores over the period from Mar-12 to Dec-16.\n", 142 | "\n", 143 | "**Tableau Profile** \n", 144 | "https://public.tableau.com/profile/r221609#!/vizhome/Task3_53/Task3" 145 | ] 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "metadata": {}, 150 | "source": [ 151 | "# Alteryx Workflow\n", 152 | "\n", 153 | "\"Workflow\n", 154 | "

Workflow 1: Workflow for Task 1

\n", 155 | "\n", 156 | "\"Workflow\n", 157 | "

Workflow 2: Workflow for Task 2

\n", 158 | "\n", 159 | "\"Workflow\n", 160 | "

Workflow 3: Workflow for Task 3

" 161 | ] 162 | } 163 | ], 164 | "metadata": { 165 | "kernelspec": { 166 | "display_name": "Python 3", 167 | "language": "python", 168 | "name": "python3" 169 | }, 170 | "language_info": { 171 | "codemirror_mode": { 172 | "name": "ipython", 173 | "version": 3 174 | }, 175 | "file_extension": ".py", 176 | "mimetype": "text/x-python", 177 | "name": "python", 178 | "nbconvert_exporter": "python", 179 | "pygments_lexer": "ipython3", 180 | "version": "3.6.2" 181 | } 182 | }, 183 | "nbformat": 4, 184 | "nbformat_minor": 2 185 | } 186 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Predictive Analytics for Business Nanodegree 2 | 3 | ### Kai Sheng Teh 4 | 5 | This repository contains projects for Udacity's [Predictive Analytics for Business Nanodegree](https://www.udacity.com/course/business-analyst-nanodegree--nd008). (It was known as Business Analyst Nanodegree back in 2017) 6 | 7 | ### Part 1: [Problem Solving with Advanced Analytics by Alteryx](https://www.udacity.com/course/problem-solving-with-advanced-analytics--ud976) 8 | Learn a structured framework for solving problems with advanced analytics. Learn to select the most appropriate analytical methodology. Learn linear regression. 9 | 10 | - Project: [Predicting Diamond Prices](https://github.com/kaishengteh/Business-Analyst-Nanodegree/blob/master/1-Problem-Solving-with-Advanced-Analytics/1.1-Predicting-Diamond-Price.ipynb) 11 | - Project: [Predicting Catalog Demand](https://github.com/kaishengteh/Business-Analyst-Nanodegree/blob/master/1-Problem-Solving-with-Advanced-Analytics/1.2-Predicting-Catalog-Demand.ipynb) 12 | 13 | ### Part 2: [Creating an Analytical Dataset by Alteryx](https://www.udacity.com/course/creating-an-analytical-dataset--ud977) 14 | Understand the most common data types. Understand the various sources of data. Make adjustments to dirty data to prepare a dataset. Identify and adjust for outliers. Learn to write queries to extract and analyze data from a relational database. 15 | 16 | - Project: [Data Cleanup](https://github.com/kaishengteh/Business-Analyst-Nanodegree/blob/master/2-Creating-an-Analytical-Dataset/2.1-Data-Cleanup.ipynb) 17 | - Project: [Create Report from Database (SQL)](https://github.com/kaishengteh/Business-Analyst-Nanodegree/blob/master/2-Creating-an-Analytical-Dataset/2.2-Create-Report-from-Database.ipynb) 18 | 19 | ### Part 3: [Data Visualization in Tableau](https://www.udacity.com/course/data-visualization-in-tableau--ud1006) 20 | Understand the importance of data visualization. Know how different data types are encoded in visualizations. Select the most effective chart or graph based on the data being displayed. 21 | 22 | - Project: [Visualizing Movie Data](https://github.com/kaishengteh/Business-Analyst-Nanodegree/blob/master/3-Data-Visualization-in-Tableau/3.1-Visualize-Movie-Data.ipynb) 23 | 24 | ### Part 4: [Classification Models by Alteryx](https://www.udacity.com/course/classification-models--ud978) 25 | You will use classification models, such as logistic regression, decision tree, forest, and boosted, to make predictions of binary and non-binary outcomes. 26 | 27 | - Project: [Predicting Default Risk](https://github.com/kaishengteh/Business-Analyst-Nanodegree/blob/master/4-Classification-Models/4.1-Predicting-Default-Risk.ipynb) 28 | 29 | ### Part 5: [A/B Testing](https://www.udacity.com/course/ab-testing--ud979) 30 | Understand the fundamentals of A/B testing, including experimental design, variable selection, and analyzing and interpreting results. 31 | 32 | - Project: [A/B Test a New Menu Launch](https://github.com/kaishengteh/Business-Analyst-Nanodegree/blob/master/5-AB-Testing/5.1-AB-Test-a-New-Menu-Launch.ipynb) 33 | 34 | ### Part 6: [Time Series Forecasting](https://www.udacity.com/course/time-series-forecasting--ud980) 35 | Understand trend, seasonal, and cyclical behavior of time series data. Use time series decomposition plots. Build ETS and ARIMA models. 36 | 37 | - Project: [Forecast Video Game Sales](https://github.com/kaishengteh/Business-Analyst-Nanodegree/blob/master/6-Time-Series-Forecasting/6.1-Forecast-Video-Game-Sales.ipynb) 38 | 39 | ### Part 7: [Segmentation and Clustering](https://www.udacity.com/course/segmentation-and-clustering--ud981) 40 | Understand the difference between localization, standardization, and segmentation. Scale data to prepare a dataset for cluster modeling. Use principal components analysis (PCA) to reduce the number of variables for cluster model. Build and apply a k-centroid cluster model. Visualize and communicate the results of a cluster model. 41 | Then complete a capstone project combining techniques learned throughout the program. 42 | 43 | - Project: [Predictive Analytics Capstone](https://github.com/kaishengteh/Business-Analyst-Nanodegree/blob/master/7-Segmentation-and-Clustering/7.1-Combining-Predictive-Techniques.ipynb) 44 | 45 | ![Udacity Business Analyst Nanodegree](https://user-images.githubusercontent.com/14093302/34906846-7368e8d0-f8b0-11e7-9b8a-44c468d7a61b.jpg) 46 | --------------------------------------------------------------------------------