├── piggy_bank.jpg ├── ReadME.md ├── .ipynb_checkpoints ├── notebook-checkpoint.ipynb └── Designing a Bank Marketing Database-checkpoint.ipynb └── Designing a Bank Marketing Database.ipynb /piggy_bank.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Chisomnwa/Designing-a-Bank-Marketing-Database/main/piggy_bank.jpg -------------------------------------------------------------------------------- /ReadME.md: -------------------------------------------------------------------------------- 1 | #

Designing a Bank Marketing Database 2 | 3 | ![piggy_bank](piggy_bank.jpg) 4 | 5 |
6 | 7 | --- 8 | *This **Data Engineering** project involves creating a comprehensive database to store and manage customer information for a bank's marketing campaigns.* 9 | 10 |
11 | 12 | ## Project Description 13 | 14 | Personal loans are a lucrative revenue stream for banks. The typical interest rate of a two year loan in the United Kingdom is [around 10%](https://www.experian.com/blogs/ask-experian/whats-a-good-interest-rate-for-a-personal-loan/). This might not sound like a lot, but in September 2022 alone UK consumers borrowed [around £1.5 billion](https://www.ukfinance.org.uk/system/files/2022-12/Household%20Finance%20Review%202022%20Q3-%20Final.pdf), which would mean approximately £300 million in interest generated by banks over two years! 15 | 16 | You have been asked to work with a bank to clean and store the data they collected as part of a recent marketing campaign, which aimed to get customers to take out a personal loan. They plan to conduct more marketing campaigns going forward so would like you to set up a PostgreSQL database to store this campaign's data, designing the schema in a way that would allow data from future campaigns to be easily imported. 17 | 18 | They have supplied you with a csv file called `"bank_marketing.csv"`, which you will need to clean, reformat, and split, in order to save separate files based on the tables you will create. It is recommended to use `pandas` for these tasks. 19 | 20 | Lastly, you will write the SQL code that the bank can execute to create the tables and populate with the data from the csv files. As the bank are quite strict about their security, you'll save SQL files as multiline string variables that they can then use to create the database on their end. 21 | 22 | You have been asked to design a database that will have three tables: 23 | 24 | ## client 25 | 26 | | column | data type | description | original column in dataset | 27 | |--------|-----------|-------------|----------------------------| 28 | | `id` | `serial` | Client ID - primary key | `client_id` | 29 | | `age` | `integer` | Client's age in years | `age` | 30 | | `job` | `text` | Client's type of job | `job` | 31 | | `marital` | `text` | Client's marital status | `marital` | 32 | | `education` | `text` | Client's level of education | `education` | 33 | | `credit_default` | `boolean` | Whether the client's credit is in default | `credit_default` | 34 | | `housing` | `boolean` | Whether the client has an existing housing loan (mortgage) | `housing` | 35 | | `loan` | `boolean` | Whether the client has an existing personal loan | `loan` | 36 | 37 |
38 | 39 | ## campaign 40 | 41 | | column | data type | description | original column in dataset | 42 | |--------|-----------|-------------|----------------------------| 43 | | `campaign_id` | `serial` | Campaign ID - primary key | N/A - new column | 44 | | `client_id` | `serial` | Client ID - references `id` in the `client` table | `client_id` | 45 | | `number_contacts` | `integer` | Number of contact attempts to the client in the current campaign | `campaign` | 46 | | `contact_duration` | `integer` | Last contact duration in seconds | `duration` | 47 | | `pdays` | `integer` | Number of days since contact in previous campaign (`999` = not previously contacted) | `pdays` | 48 | | `previous_campaign_contacts` | `integer` | Number of contact attempts to the client in the previous campaign | `previous` | 49 | | `previous_outcome` | `boolean` | Outcome of the previous campaign | `poutcome` | 50 | | `campaign_outcome` | `boolean` | Outcome of the current campaign | `y` | 51 | | `last_contact_date` | `date` | Last date the client was contacted | A combination of `day`, `month`, and the newly created `year` | 52 | 53 |
54 | 55 | ## economics 56 | 57 | | column | data type | description | original column in dataset | 58 | |--------|-----------|-------------|----------------------------| 59 | | `client_id` | `serial` | Client ID - references `id` in the `client` table | `client_id` | 60 | | `emp_var_rate` | `float` | Employment variation rate (quarterly indicator) | `emp_var_rate` | 61 | | `cons_price_idx` | `float` | Consumer price index (monthly indicator) | `cons_price_idx` | 62 | | `euribor_three_months` | `float` | Euro Interbank Offered Rate (euribor) three month rate (daily indicator) | `euribor3m` | 63 | | `number_employed` | `float` | Number of employees (quarterly indicator)| `nr_employed` | 64 | 65 |
66 | 67 | ## Project Tasks 68 | 69 | View the [notebook](https://github.com/Chisomnwa/Designing-a-Bank-Marketing-Database/blob/main/Designing%20a%20Bank%20Marketing%20Database.ipynb) to see the project tasks and all expected output for each task and finally see how to successfully load the data into a PostgreSQL database after cleaning and getting it ready. 70 | 71 | In case, GitHub doesn't render the notebook withing the 5 seconds timeframe of clicking the link, you can view the notebook [here](https://nbviewer.org/github/Chisomnwa/Designing-a-Bank-Marketing-Database/blob/main/Designing%20a%20Bank%20Marketing%20Database.ipynb). 72 | 73 |
74 | 75 | --- 76 | 77 | And you'd love to read my blog post on this project [here](https://medium.com/@chisompromise/designing-a-bank-marketing-database-a033f3eee479). 78 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/notebook-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "02077ee3-e1e4-4fc5-8de1-16e987afa5fb", 6 | "metadata": {}, 7 | "source": [ 8 | "![piggy_bank](piggy_bank.jpg)\n", 9 | "\n", 10 | "
\n", 11 | "\n", 12 | "Personal loans are a lucrative revenue stream for banks. The typical interest rate of a two year loan in the United Kingdom is [around 10%](https://www.experian.com/blogs/ask-experian/whats-a-good-interest-rate-for-a-personal-loan/). This might not sound like a lot, but in September 2022 alone UK consumers borrowed [around £1.5 billion](https://www.ukfinance.org.uk/system/files/2022-12/Household%20Finance%20Review%202022%20Q3-%20Final.pdf), which would mean approximately £300 million in interest generated by banks over two years!\n", 13 | "\n", 14 | "You have been asked to work with a bank to clean and store the data they collected as part of a recent marketing campaign, which aimed to get customers to take out a personal loan. They plan to conduct more marketing campaigns going forward so would like you to set up a PostgreSQL database to store this campaign's data, designing the schema in a way that would allow data from future campaigns to be easily imported. \n", 15 | "\n", 16 | "They have supplied you with a csv file called `\"bank_marketing.csv\"`, which you will need to clean, reformat, and split, in order to save separate files based on the tables you will create. It is recommended to use `pandas` for these tasks.\n", 17 | "\n", 18 | "Lastly, you will write the SQL code that the bank can execute to create the tables and populate with the data from the csv files. As the bank are quite strict about their security, you'll save SQL files as multiline string variables that they can then use to create the database on their end. \n", 19 | "\n", 20 | "You have been asked to design a database that will have three tables:\n", 21 | "\n", 22 | "## client\n", 23 | "\n", 24 | "| column | data type | description | original column in dataset |\n", 25 | "|--------|-----------|-------------|----------------------------|\n", 26 | "| `id` | `serial` | Client ID - primary key | `client_id` |\n", 27 | "| `age` | `integer` | Client's age in years | `age` |\n", 28 | "| `job` | `text` | Client's type of job | `job` |\n", 29 | "| `marital` | `text` | Client's marital status | `marital` | \n", 30 | "| `education` | `text` | Client's level of education | `education` |\n", 31 | "| `credit_default` | `boolean` | Whether the client's credit is in default | `credit_default` |\n", 32 | "| `housing` | `boolean` | Whether the client has an existing housing loan (mortgage) | `housing` | \n", 33 | "| `loan` | `boolean` | Whether the client has an existing personal loan | `loan` |\n", 34 | "\n", 35 | "
\n", 36 | "\n", 37 | "## campaign\n", 38 | "\n", 39 | "| column | data type | description | original column in dataset |\n", 40 | "|--------|-----------|-------------|----------------------------|\n", 41 | "| `campaign_id` | `serial` | Campaign ID - primary key | N/A - new column |\n", 42 | "| `client_id` | `serial` | Client ID - references `id` in the `client` table | `client_id` |\n", 43 | "| `number_contacts` | `integer` | Number of contact attempts to the client in the current campaign | `campaign` |\n", 44 | "| `contact_duration` | `integer` | Last contact duration in seconds | `duration` |\n", 45 | "| `pdays` | `integer` | Number of days since contact in previous campaign (`999` = not previously contacted) | `pdays` |\n", 46 | "| `previous_campaign_contacts` | `integer` | Number of contact attempts to the client in the previous campaign | `previous` |\n", 47 | "| `previous_outcome` | `boolean` | Outcome of the previous campaign | `poutcome` |\n", 48 | "| `campaign_outcome` | `boolean` | Outcome of the current campaign | `y` |\n", 49 | "| `last_contact_date` | `date` | Last date the client was contacted | A combination of `day`, `month`, and the newly created `year` |\n", 50 | "\n", 51 | "
\n", 52 | "\n", 53 | "## economics\n", 54 | "\n", 55 | "| column | data type | description | original column in dataset |\n", 56 | "|--------|-----------|-------------|----------------------------|\n", 57 | "| `client_id` | `serial` | Client ID - references `id` in the `client` table | `client_id` |\n", 58 | "| `emp_var_rate` | `float` | Employment variation rate (quarterly indicator) | `emp_var_rate` |\n", 59 | "| `cons_price_idx` | `float` | Consumer price index (monthly indicator) | `cons_price_idx` |\n", 60 | "| `euribor_three_months` | `float` | Euro Interbank Offered Rate (euribor) three month rate (daily indicator) | `euribor3m` |\n", 61 | "| `number_employed` | `float` | Number of employees (quarterly indicator)| `nr_employed` |" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 1, 67 | "id": "e2edad3c-8286-4983-b5b7-35d94fd78023", 68 | "metadata": { 69 | "executionCancelledAt": null, 70 | "executionTime": 1057, 71 | "lastExecutedAt": 1686069923599, 72 | "lastScheduledRunId": null, 73 | "lastSuccessfullyExecutedCode": "# Start coding...\nimport pandas as pd\nimport numpy as np\n\n# Read in csv\nmarketing = pd.read_csv(\"bank_marketing.csv\")\n\n# Split into the three tables\nclient = marketing[[\"client_id\", \"age\", \"job\", \"marital\", \"education\", \n \"credit_default\", \"housing\", \"loan\"]]\ncampaign = marketing[[\"client_id\", \"campaign\", \"month\", \"day\", \n \"duration\", \"pdays\", \"previous\", \"poutcome\", \"y\"]]\neconomics = marketing[[\"client_id\", \"emp_var_rate\", \"cons_price_idx\", \n \"euribor3m\", \"nr_employed\"]]\n\n# Rename client_id in the client table\nclient.rename(columns={\"client_id\": \"id\"}, inplace=True)\n\n# Rename duration, y, and campaign columns\ncampaign.rename(columns={\"duration\": \"contact_duration\", \n \"y\": \"campaign_outcome\", \n \"campaign\": \"number_contacts\",\n \"previous\": \"previous_campaign_contacts\",\n \"poutcome\": \"previous_outcome\"}, \n inplace=True)\n\n# Rename euribor3m and nr_employed\neconomics.rename(columns={\"euribor3m\": \"euribor_three_months\", \n \"nr_employed\": \"number_employed\"}, \n inplace=True)\n\n# Clean education column\nclient[\"education\"] = client[\"education\"].str.replace(\".\", \"_\")\nclient[\"education\"] = client[\"education\"].replace(\"unknown\", np.NaN)\n\n# Clean job column\nclient[\"job\"] = client[\"job\"].str.replace(\".\", \"\")\n\n# Change campaign_outcome to binary values\ncampaign[\"campaign_outcome\"] = campaign[\"campaign_outcome\"].map({\"yes\": 1, \n \"no\": 0})\n\n# Convert poutcome to binary values\ncampaign[\"previous_outcome\"] = campaign[\"previous_outcome\"].replace(\"nonexistent\", \n np.NaN)\ncampaign[\"previous_outcome\"] = campaign[\"previous_outcome\"].map({\"success\": 1, \n \"failure\": 0})\n\n# Add campaign_id column\ncampaign[\"campaign_id\"] = 1\n\n# Capitalize month and day columns\ncampaign[\"month\"] = campaign[\"month\"].str.capitalize()\n\n# Add year column\ncampaign[\"year\"] = \"2022\"\n\n# Convert day to string\ncampaign[\"day\"] = campaign[\"day\"].astype(str)\n\n# Add last_contact_date column\ncampaign[\"last_contact_date\"] = campaign[\"year\"] + \"-\" + campaign[\"month\"] + \"-\" + campaign[\"day\"]\n\n# Convert to datetime\ncampaign[\"last_contact_date\"] = pd.to_datetime(campaign[\"last_contact_date\"], \n format=\"%Y-%b-%d\")\n\n# Drop unneccessary columns\ncampaign.drop(columns=[\"month\", \"day\", \"year\"], inplace=True)\n\n# Save tables to individual csv files\nclient.to_csv(\"client.csv\", index=False)\ncampaign.to_csv(\"campaign.csv\", index=False)\neconomics.to_csv(\"economics.csv\", index=False)\n\n# Store and print database_design\nclient_table = \"\"\"CREATE TABLE client\n(\n id SERIAL PRIMARY KEY,\n age INTEGER,\n job TEXT,\n marital TEXT,\n education TEXT,\n credit_default BOOLEAN,\n housing BOOLEAN,\n loan BOOLEAN\n);\n\\copy client from 'client.csv' DELIMITER ',' CSV HEADER\n\"\"\"\n\ncampaign_table = \"\"\"CREATE TABLE campaign\n(\n campaign_id SERIAL PRIMARY KEY,\n client_id SERIAL references client (id),\n number_contacts INTEGER,\n contact_duration INTEGER,\n pdays INTEGER,\n previous_campaign_contacts INTEGER,\n previous_outcome BOOLEAN,\n campaign_outcome BOOLEAN,\n last_contact_date DATE \n);\n\\copy campaign from 'campaign.csv' DELIMITER ',' CSV HEADER\n\"\"\"\n\neconomics_table = \"\"\"CREATE TABLE economics\n(\n client_id SERIAL references client (id),\n emp_var_rate FLOAT,\n cons_price_idx FLOAT,\n euribor_three_months FLOAT,\n number_employed FLOAT\n);\n\\copy economics from 'economics.csv' DELIMITER ',' CSV HEADER\n\"\"\"" 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "import pandas as pd\n", 78 | "import numpy as np\n", 79 | "\n", 80 | "# Start coding here..." 81 | ] 82 | } 83 | ], 84 | "metadata": { 85 | "editor": "DataCamp Workspace", 86 | "kernelspec": { 87 | "display_name": "Python 3 (ipykernel)", 88 | "language": "python", 89 | "name": "python3" 90 | }, 91 | "language_info": { 92 | "codemirror_mode": { 93 | "name": "ipython", 94 | "version": 3 95 | }, 96 | "file_extension": ".py", 97 | "mimetype": "text/x-python", 98 | "name": "python", 99 | "nbconvert_exporter": "python", 100 | "pygments_lexer": "ipython3", 101 | "version": "3.11.5" 102 | } 103 | }, 104 | "nbformat": 4, 105 | "nbformat_minor": 5 106 | } 107 | -------------------------------------------------------------------------------- /Designing a Bank Marketing Database.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "02077ee3-e1e4-4fc5-8de1-16e987afa5fb", 6 | "metadata": {}, 7 | "source": [ 8 | "![piggy_bank](piggy_bank.jpg)\n", 9 | "\n", 10 | "
\n", 11 | "\n", 12 | "Personal loans are a lucrative revenue stream for banks. The typical interest rate of a two year loan in the United Kingdom is [around 10%](https://www.experian.com/blogs/ask-experian/whats-a-good-interest-rate-for-a-personal-loan/). This might not sound like a lot, but in September 2022 alone UK consumers borrowed [around £1.5 billion](https://www.ukfinance.org.uk/system/files/2022-12/Household%20Finance%20Review%202022%20Q3-%20Final.pdf), which would mean approximately £300 million in interest generated by banks over two years!\n", 13 | "\n", 14 | "You have been asked to work with a bank to clean and store the data they collected as part of a recent marketing campaign, which aimed to get customers to take out a personal loan. They plan to conduct more marketing campaigns going forward so would like you to set up a PostgreSQL database to store this campaign's data, designing the schema in a way that would allow data from future campaigns to be easily imported. \n", 15 | "\n", 16 | "They have supplied you with a csv file called `\"bank_marketing.csv\"`, which you will need to clean, reformat, and split, in order to save separate files based on the tables you will create. It is recommended to use `pandas` for these tasks.\n", 17 | "\n", 18 | "Lastly, you will write the SQL code that the bank can execute to create the tables and populate with the data from the csv files. As the bank are quite strict about their security, you'll save SQL files as multiline string variables that they can then use to create the database on their end. \n", 19 | "\n", 20 | "You have been asked to design a database that will have three tables:\n", 21 | "\n", 22 | "## client\n", 23 | "\n", 24 | "| column | data type | description | original column in dataset |\n", 25 | "|--------|-----------|-------------|----------------------------|\n", 26 | "| `id` | `serial` | Client ID - primary key | `client_id` |\n", 27 | "| `age` | `integer` | Client's age in years | `age` |\n", 28 | "| `job` | `text` | Client's type of job | `job` |\n", 29 | "| `marital` | `text` | Client's marital status | `marital` | \n", 30 | "| `education` | `text` | Client's level of education | `education` |\n", 31 | "| `credit_default` | `boolean` | Whether the client's credit is in default | `credit_default` |\n", 32 | "| `housing` | `boolean` | Whether the client has an existing housing loan (mortgage) | `housing` | \n", 33 | "| `loan` | `boolean` | Whether the client has an existing personal loan | `loan` |\n", 34 | "\n", 35 | "
\n", 36 | "\n", 37 | "## campaign\n", 38 | "\n", 39 | "| column | data type | description | original column in dataset |\n", 40 | "|--------|-----------|-------------|----------------------------|\n", 41 | "| `campaign_id` | `serial` | Campaign ID - primary key | N/A - new column |\n", 42 | "| `client_id` | `serial` | Client ID - references `id` in the `client` table | `client_id` |\n", 43 | "| `number_contacts` | `integer` | Number of contact attempts to the client in the current campaign | `campaign` |\n", 44 | "| `contact_duration` | `integer` | Last contact duration in seconds | `duration` |\n", 45 | "| `pdays` | `integer` | Number of days since contact in previous campaign (`999` = not previously contacted) | `pdays` |\n", 46 | "| `previous_campaign_contacts` | `integer` | Number of contact attempts to the client in the previous campaign | `previous` |\n", 47 | "| `previous_outcome` | `boolean` | Outcome of the previous campaign | `poutcome` |\n", 48 | "| `campaign_outcome` | `boolean` | Outcome of the current campaign | `y` |\n", 49 | "| `last_contact_date` | `date` | Last date the client was contacted | A combination of `day`, `month`, and the newly created `year` |\n", 50 | "\n", 51 | "
\n", 52 | "\n", 53 | "## economics\n", 54 | "\n", 55 | "| column | data type | description | original column in dataset |\n", 56 | "|--------|-----------|-------------|----------------------------|\n", 57 | "| `client_id` | `serial` | Client ID - references `id` in the `client` table | `client_id` |\n", 58 | "| `emp_var_rate` | `float` | Employment variation rate (quarterly indicator) | `emp_var_rate` |\n", 59 | "| `cons_price_idx` | `float` | Consumer price index (monthly indicator) | `cons_price_idx` |\n", 60 | "| `euribor_three_months` | `float` | Euro Interbank Offered Rate (euribor) three month rate (daily indicator) | `euribor3m` |\n", 61 | "| `number_employed` | `float` | Number of employees (quarterly indicator)| `nr_employed` |" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 1, 67 | "id": "e2edad3c-8286-4983-b5b7-35d94fd78023", 68 | "metadata": { 69 | "executionCancelledAt": null, 70 | "executionTime": 1057, 71 | "lastExecutedAt": 1686069923599, 72 | "lastScheduledRunId": null, 73 | "lastSuccessfullyExecutedCode": "# Start coding...\nimport pandas as pd\nimport numpy as np\n\n# Read in csv\nmarketing = pd.read_csv(\"bank_marketing.csv\")\n\n# Split into the three tables\nclient = marketing[[\"client_id\", \"age\", \"job\", \"marital\", \"education\", \n \"credit_default\", \"housing\", \"loan\"]]\ncampaign = marketing[[\"client_id\", \"campaign\", \"month\", \"day\", \n \"duration\", \"pdays\", \"previous\", \"poutcome\", \"y\"]]\neconomics = marketing[[\"client_id\", \"emp_var_rate\", \"cons_price_idx\", \n \"euribor3m\", \"nr_employed\"]]\n\n# Rename client_id in the client table\nclient.rename(columns={\"client_id\": \"id\"}, inplace=True)\n\n# Rename duration, y, and campaign columns\ncampaign.rename(columns={\"duration\": \"contact_duration\", \n \"y\": \"campaign_outcome\", \n \"campaign\": \"number_contacts\",\n \"previous\": \"previous_campaign_contacts\",\n \"poutcome\": \"previous_outcome\"}, \n inplace=True)\n\n# Rename euribor3m and nr_employed\neconomics.rename(columns={\"euribor3m\": \"euribor_three_months\", \n \"nr_employed\": \"number_employed\"}, \n inplace=True)\n\n# Clean education column\nclient[\"education\"] = client[\"education\"].str.replace(\".\", \"_\")\nclient[\"education\"] = client[\"education\"].replace(\"unknown\", np.NaN)\n\n# Clean job column\nclient[\"job\"] = client[\"job\"].str.replace(\".\", \"\")\n\n# Change campaign_outcome to binary values\ncampaign[\"campaign_outcome\"] = campaign[\"campaign_outcome\"].map({\"yes\": 1, \n \"no\": 0})\n\n# Convert poutcome to binary values\ncampaign[\"previous_outcome\"] = campaign[\"previous_outcome\"].replace(\"nonexistent\", \n np.NaN)\ncampaign[\"previous_outcome\"] = campaign[\"previous_outcome\"].map({\"success\": 1, \n \"failure\": 0})\n\n# Add campaign_id column\ncampaign[\"campaign_id\"] = 1\n\n# Capitalize month and day columns\ncampaign[\"month\"] = campaign[\"month\"].str.capitalize()\n\n# Add year column\ncampaign[\"year\"] = \"2022\"\n\n# Convert day to string\ncampaign[\"day\"] = campaign[\"day\"].astype(str)\n\n# Add last_contact_date column\ncampaign[\"last_contact_date\"] = campaign[\"year\"] + \"-\" + campaign[\"month\"] + \"-\" + campaign[\"day\"]\n\n# Convert to datetime\ncampaign[\"last_contact_date\"] = pd.to_datetime(campaign[\"last_contact_date\"], \n format=\"%Y-%b-%d\")\n\n# Drop unneccessary columns\ncampaign.drop(columns=[\"month\", \"day\", \"year\"], inplace=True)\n\n# Save tables to individual csv files\nclient.to_csv(\"client.csv\", index=False)\ncampaign.to_csv(\"campaign.csv\", index=False)\neconomics.to_csv(\"economics.csv\", index=False)\n\n# Store and print database_design\nclient_table = \"\"\"CREATE TABLE client\n(\n id SERIAL PRIMARY KEY,\n age INTEGER,\n job TEXT,\n marital TEXT,\n education TEXT,\n credit_default BOOLEAN,\n housing BOOLEAN,\n loan BOOLEAN\n);\n\\copy client from 'client.csv' DELIMITER ',' CSV HEADER\n\"\"\"\n\ncampaign_table = \"\"\"CREATE TABLE campaign\n(\n campaign_id SERIAL PRIMARY KEY,\n client_id SERIAL references client (id),\n number_contacts INTEGER,\n contact_duration INTEGER,\n pdays INTEGER,\n previous_campaign_contacts INTEGER,\n previous_outcome BOOLEAN,\n campaign_outcome BOOLEAN,\n last_contact_date DATE \n);\n\\copy campaign from 'campaign.csv' DELIMITER ',' CSV HEADER\n\"\"\"\n\neconomics_table = \"\"\"CREATE TABLE economics\n(\n client_id SERIAL references client (id),\n emp_var_rate FLOAT,\n cons_price_idx FLOAT,\n euribor_three_months FLOAT,\n number_employed FLOAT\n);\n\\copy economics from 'economics.csv' DELIMITER ',' CSV HEADER\n\"\"\"" 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "# Import libraries\n", 78 | "import pandas as pd\n", 79 | "import numpy as np\n", 80 | "\n", 81 | "# suppress warnings from final output\n", 82 | "import warnings\n", 83 | "warnings.simplefilter(\"ignore\")\n", 84 | "\n", 85 | "# Start coding here..." 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "id": "9d94086e", 91 | "metadata": {}, 92 | "source": [ 93 | "#### Instruction 1: Read in bank_marketing.csv as a pandas DataFrame." 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 2, 99 | "id": "ac57dc04", 100 | "metadata": {}, 101 | "outputs": [ 102 | { 103 | "data": { 104 | "text/html": [ 105 | "

\n", 106 | "\n", 119 | "\n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | "

	client_id	age	job	marital	education	credit_default	housing	loan	contact	month	...	campaign	pdays	poutcome	emp_var_rate	cons_price_idx	cons_conf_idx	euribor3m	nr_employed	y
0	0	56	housemaid	married	basic.4y	no	no	no	telephone	may	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
1	1	57	services	married	high.school	unknown	no	no	telephone	may	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
2	2	37	services	married	high.school	no	yes	no	telephone	may	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
3	3	40	admin.	married	basic.6y	no	no	no	telephone	may	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
4	4	56	services	married	high.school	no	no	yes	telephone	may	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no

\n", 269 | "

5 rows × 22 columns

\n", 270 | "

" 271 | ], 272 | "text/plain": [ 273 | " client_id age job marital education credit_default housing \\\n", 274 | "0 0 56 housemaid married basic.4y no no \n", 275 | "1 1 57 services married high.school unknown no \n", 276 | "2 2 37 services married high.school no yes \n", 277 | "3 3 40 admin. married basic.6y no no \n", 278 | "4 4 56 services married high.school no no \n", 279 | "\n", 280 | " loan contact month ... campaign pdays previous poutcome \\\n", 281 | "0 no telephone may ... 1 999 0 nonexistent \n", 282 | "1 no telephone may ... 1 999 0 nonexistent \n", 283 | "2 no telephone may ... 1 999 0 nonexistent \n", 284 | "3 no telephone may ... 1 999 0 nonexistent \n", 285 | "4 yes telephone may ... 1 999 0 nonexistent \n", 286 | "\n", 287 | " emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed y \n", 288 | "0 1.1 93.994 -36.4 4.857 5191.0 no \n", 289 | "1 1.1 93.994 -36.4 4.857 5191.0 no \n", 290 | "2 1.1 93.994 -36.4 4.857 5191.0 no \n", 291 | "3 1.1 93.994 -36.4 4.857 5191.0 no \n", 292 | "4 1.1 93.994 -36.4 4.857 5191.0 no \n", 293 | "\n", 294 | "[5 rows x 22 columns]" 295 | ] 296 | }, 297 | "execution_count": 2, 298 | "metadata": {}, 299 | "output_type": "execute_result" 300 | } 301 | ], 302 | "source": [ 303 | "# Read in bank_marketing.csv as a pandas DataFrame.\n", 304 | "bank_marketing_data = pd.read_csv(\"bank_marketing.csv\")\n", 305 | "bank_marketing_data.head()" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "id": "11195a84", 311 | "metadata": {}, 312 | "source": [ 313 | "#### Instruction 2:\n", 314 | "\n", 315 | "Split the data into three DataFrames using information provided about the desired tables as your guide: \n", 316 | "\n", 317 | "* one with information about the client, \n", 318 | "* another containing campaign data, and \n", 319 | "* a third to store information about economics at the time of the campaign." 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 3, 325 | "id": "6707773f", 326 | "metadata": {}, 327 | "outputs": [ 328 | { 329 | "name": "stdout", 330 | "output_type": "stream", 331 | "text": [ 332 | "Client Data:\n", 333 | " client_id age job marital education credit_default housing loan\n", 334 | "0 0 56 housemaid married basic.4y no no no\n", 335 | "1 1 57 services married high.school unknown no no\n", 336 | "2 2 37 services married high.school no yes no\n", 337 | "3 3 40 admin. married basic.6y no no no\n", 338 | "4 4 56 services married high.school no no yes\n", 339 | "\n", 340 | "Campaign Data:\n", 341 | " client_id campaign duration pdays previous poutcome y month day\n", 342 | "0 0 1 261 999 0 nonexistent no may 13\n", 343 | "1 1 1 149 999 0 nonexistent no may 19\n", 344 | "2 2 1 226 999 0 nonexistent no may 23\n", 345 | "3 3 1 151 999 0 nonexistent no may 27\n", 346 | "4 4 1 307 999 0 nonexistent no may 3\n", 347 | "\n", 348 | "Economic Data:\n", 349 | " client_id emp_var_rate cons_price_idx euribor3m nr_employed\n", 350 | "0 0 1.1 93.994 4.857 5191.0\n", 351 | "1 1 1.1 93.994 4.857 5191.0\n", 352 | "2 2 1.1 93.994 4.857 5191.0\n", 353 | "3 3 1.1 93.994 4.857 5191.0\n", 354 | "4 4 1.1 93.994 4.857 5191.0\n" 355 | ] 356 | } 357 | ], 358 | "source": [ 359 | "# Define the columns for each table\n", 360 | "client_columns = ['client_id', 'age', 'job', 'marital', 'education', 'credit_default', 'housing', 'loan']\n", 361 | "campaign_columns = ['client_id', 'campaign', 'duration', 'pdays', 'previous', 'poutcome', 'y', 'month', 'day']\n", 362 | "economic_columns = ['client_id', 'emp_var_rate', 'cons_price_idx', 'euribor3m', 'nr_employed']\n", 363 | "\n", 364 | "# Create Dataframes for each table\n", 365 | "client = bank_marketing_data[client_columns]\n", 366 | "campaign = bank_marketing_data[campaign_columns]\n", 367 | "economics = bank_marketing_data[economic_columns]\n", 368 | "\n", 369 | "# Print the first few rows of each DataFrame to verify the split\n", 370 | "print(\"Client Data:\")\n", 371 | "print(client.head())\n", 372 | "\n", 373 | "print(\"\\nCampaign Data:\")\n", 374 | "print(campaign.head())\n", 375 | "\n", 376 | "print(\"\\nEconomic Data:\")\n", 377 | "print(economics.head())" 378 | ] 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "id": "a77f1836", 383 | "metadata": {}, 384 | "source": [ 385 | "#### Instruction 3:\n", 386 | "\n", 387 | "Rename the column \"client_id\" to \"id\" in client (leave as-is in the other subsets); \"duration\" to \"contact_duration\",\n", 388 | "\"previous\" to \"previous_campaign_contacts\", \"y\" to \"campaign_outcome\", \"poutcome\" to \"previous_outcome\", and \n", 389 | "\"campaign\" to \"number_contacts\" in campaign; and \"euribor3m\" to \"euribor_three_months\" and \"nr_employed\" to\n", 390 | "\"number_employed\" in economics." 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": 4, 396 | "id": "6f49b4d5", 397 | "metadata": {}, 398 | "outputs": [ 399 | { 400 | "name": "stdout", 401 | "output_type": "stream", 402 | "text": [ 403 | "Client Data:\n", 404 | " id age job marital education credit_default housing loan\n", 405 | "0 0 56 housemaid married basic.4y no no no\n", 406 | "1 1 57 services married high.school unknown no no\n", 407 | "2 2 37 services married high.school no yes no\n", 408 | "3 3 40 admin. married basic.6y no no no\n", 409 | "4 4 56 services married high.school no no yes\n", 410 | "\n", 411 | "Campaign Data:\n", 412 | " client_id number_contacts contact_duration pdays \\\n", 413 | "0 0 1 261 999 \n", 414 | "1 1 1 149 999 \n", 415 | "2 2 1 226 999 \n", 416 | "3 3 1 151 999 \n", 417 | "4 4 1 307 999 \n", 418 | "\n", 419 | " previous_campaign_contacts previous_outcome campaign_outcome month day \n", 420 | "0 0 nonexistent no may 13 \n", 421 | "1 0 nonexistent no may 19 \n", 422 | "2 0 nonexistent no may 23 \n", 423 | "3 0 nonexistent no may 27 \n", 424 | "4 0 nonexistent no may 3 \n", 425 | "\n", 426 | "Economic Data:\n", 427 | " client_id emp_var_rate cons_price_idx euribor_three_months \\\n", 428 | "0 0 1.1 93.994 4.857 \n", 429 | "1 1 1.1 93.994 4.857 \n", 430 | "2 2 1.1 93.994 4.857 \n", 431 | "3 3 1.1 93.994 4.857 \n", 432 | "4 4 1.1 93.994 4.857 \n", 433 | "\n", 434 | " number_employed \n", 435 | "0 5191.0 \n", 436 | "1 5191.0 \n", 437 | "2 5191.0 \n", 438 | "3 5191.0 \n", 439 | "4 5191.0 \n" 440 | ] 441 | } 442 | ], 443 | "source": [ 444 | "# Rename columns in the client DataFrame\n", 445 | "client.rename(columns={'client_id': 'id'}, inplace=True)\n", 446 | "\n", 447 | "# Rename columns in the campaign DataFrame\n", 448 | "campaign.rename(columns={'duration': 'contact_duration',\n", 449 | " 'previous': 'previous_campaign_contacts',\n", 450 | " 'y': 'campaign_outcome',\n", 451 | " 'poutcome': 'previous_outcome',\n", 452 | " 'campaign': 'number_contacts'}, inplace=True)\n", 453 | "\n", 454 | "# Rename columns in the economic DataFrame\n", 455 | "economics.rename(columns={'euribor3m': 'euribor_three_months',\n", 456 | " 'nr_employed': 'number_employed'}, inplace=True)\n", 457 | "\n", 458 | "# Print the first few rows of each DataFrame to verify the renaming of columns\n", 459 | "print(\"Client Data:\")\n", 460 | "print(client.head())\n", 461 | "\n", 462 | "print(\"\\nCampaign Data:\")\n", 463 | "print(campaign.head())\n", 464 | "\n", 465 | "print(\"\\nEconomic Data:\")\n", 466 | "print(economics.head())" 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "id": "08f13310", 472 | "metadata": {}, 473 | "source": [ 474 | "#### Instruction 4:\n", 475 | "\n", 476 | "Clean the \"education\" column, changing \".\" to \"_\" and \"unknown\" to NumPy's null values." 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": 5, 482 | "id": "f9a9ddd1", 483 | "metadata": {}, 484 | "outputs": [], 485 | "source": [ 486 | "# Replace \".\" with \"_\"\n", 487 | "client['education'] = client['education'].str.replace('.', '_')\n", 488 | "\n", 489 | "# Replace \"unknown\" with Numpy's NaN\n", 490 | "client['education'].replace('unknown', np.nan, inplace=True)" 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": 6, 496 | "id": "61f72247", 497 | "metadata": {}, 498 | "outputs": [ 499 | { 500 | "name": "stdout", 501 | "output_type": "stream", 502 | "text": [ 503 | "The 'education' column has been successfully cleaned.\n" 504 | ] 505 | } 506 | ], 507 | "source": [ 508 | "# Check for \".\" in the \"education\" column\n", 509 | "dot_count = client['education'].str.contains('\\.').sum()\n", 510 | "\n", 511 | "# Check for \"unknown\" in the \"education\" column\n", 512 | "unknown_count = (client['education'] == 'unknown').sum()\n", 513 | "\n", 514 | "if dot_count == 0 and unknown_count == 0:\n", 515 | " print(\"The 'education' column has been successfully cleaned.\")\n", 516 | "else:\n", 517 | " print(f\"The 'education' column still contains {dot_count} '.' and {unknown_count} 'unknown' values.\")" 518 | ] 519 | }, 520 | { 521 | "cell_type": "markdown", 522 | "id": "7606b988", 523 | "metadata": {}, 524 | "source": [ 525 | "#### Instrucion 5:\n", 526 | "\n", 527 | "Remove periods from the \"job\" column." 528 | ] 529 | }, 530 | { 531 | "cell_type": "code", 532 | "execution_count": 7, 533 | "id": "196ea3e0", 534 | "metadata": {}, 535 | "outputs": [ 536 | { 537 | "name": "stdout", 538 | "output_type": "stream", 539 | "text": [ 540 | "Periods have been successfully removed from the 'job' column.\n" 541 | ] 542 | } 543 | ], 544 | "source": [ 545 | "# Remove \".\" from the \"job\" column\n", 546 | "client['job'] = client['job'].str.replace('.', '')\n", 547 | "\n", 548 | "# Check if periods are removed\n", 549 | "if '.' not in client['job'].values:\n", 550 | " print(\"Periods have been successfully removed from the 'job' column.\")\n", 551 | "else:\n", 552 | " print(\"Periods still exist in the job column.\")" 553 | ] 554 | }, 555 | { 556 | "cell_type": "markdown", 557 | "id": "9a0eb339", 558 | "metadata": {}, 559 | "source": [ 560 | "#### Instruction 6:\n", 561 | "\n", 562 | "Convert \"success\" and \"failure\" in the \"previous_outcome\" and \"campaign_outcome\" columns to binary (1 or 0), \n", 563 | "along with the changing \"nonexistent\" to NumPy's null values in \"previous_outcome\"." 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "execution_count": 8, 569 | "id": "fe0ace15", 570 | "metadata": {}, 571 | "outputs": [ 572 | { 573 | "name": "stdout", 574 | "output_type": "stream", 575 | "text": [ 576 | "Conversions and changes in 'previous_outcome' and 'campaign_outcome' columns were successfully applied.\n" 577 | ] 578 | } 579 | ], 580 | "source": [ 581 | "# Convert \"success\" and \"failure\" to binary (1 or 0) in \"previous_outcome\" and \"campaign_outcome\"\n", 582 | "campaign['previous_outcome'] = campaign['previous_outcome'].map({'success': 1, 'failure': 0})\n", 583 | "campaign['campaign_outcome'] = campaign['campaign_outcome'].map({'success': 1, 'failure': 0})\n", 584 | "\n", 585 | "# Change \"nonexistent\" to Numpy's NaN in \"previous_outcome\"\n", 586 | "campaign['previous_outcome'].replace('nonexistent', np.nan, inplace=True)\n", 587 | "\n", 588 | "# Check if the conversions and changes are applied\n", 589 | "if ('success' not in campaign['previous_outcome'].values and\n", 590 | " 'failure' not in campaign['previous_outcome'].values and\n", 591 | " 'nonexistent' not in campaign['previous_outcome'].values and\n", 592 | " 'success' not in campaign['campaign_outcome'].values and\n", 593 | " 'failure' not in campaign['campaign_outcome'].values):\n", 594 | " print(\"Conversions and changes in 'previous_outcome' and 'campaign_outcome' columns were successfully applied.\")\n", 595 | "else:\n", 596 | " print(\"Conversions and changes were not fully applied in 'previous_outcome' and 'campaign_outcome' columns.\")" 597 | ] 598 | }, 599 | { 600 | "cell_type": "markdown", 601 | "id": "fa0e6239", 602 | "metadata": {}, 603 | "source": [ 604 | "#### Instruction 7:\n", 605 | "\n", 606 | "Add a column called campaign_id in campaign, where all rows have a value of 1." 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": 9, 612 | "id": "6b4ba272", 613 | "metadata": { 614 | "scrolled": true 615 | }, 616 | "outputs": [ 617 | { 618 | "data": { 619 | "text/html": [ 620 | "

\n", 621 | "\n", 634 | "\n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | "

	client_id	number_contacts	contact_duration	pdays	previous_outcome	campaign_outcome	month	day	campaign_id
0	0	1	261	999	NaN	NaN	may	13	1
1	1	1	149	999	NaN	NaN	may	19	1
2	2	1	226	999	NaN	NaN	may	23	1
3	3	1	151	999	NaN	NaN	may	27	1
4	4	1	307	999	NaN	NaN	may	3	1

\n", 718 | "

" 719 | ], 720 | "text/plain": [ 721 | " client_id number_contacts contact_duration pdays \\\n", 722 | "0 0 1 261 999 \n", 723 | "1 1 1 149 999 \n", 724 | "2 2 1 226 999 \n", 725 | "3 3 1 151 999 \n", 726 | "4 4 1 307 999 \n", 727 | "\n", 728 | " previous_campaign_contacts previous_outcome campaign_outcome month day \\\n", 729 | "0 0 NaN NaN may 13 \n", 730 | "1 0 NaN NaN may 19 \n", 731 | "2 0 NaN NaN may 23 \n", 732 | "3 0 NaN NaN may 27 \n", 733 | "4 0 NaN NaN may 3 \n", 734 | "\n", 735 | " campaign_id \n", 736 | "0 1 \n", 737 | "1 1 \n", 738 | "2 1 \n", 739 | "3 1 \n", 740 | "4 1 " 741 | ] 742 | }, 743 | "execution_count": 9, 744 | "metadata": {}, 745 | "output_type": "execute_result" 746 | } 747 | ], 748 | "source": [ 749 | "# Add a new column 'campaign_id' with all values set to 1\n", 750 | "campaign['campaign_id'] = 1\n", 751 | "\n", 752 | "# Check if column was succsssfully created\n", 753 | "campaign.head()" 754 | ] 755 | }, 756 | { 757 | "cell_type": "code", 758 | "execution_count": null, 759 | "id": "a15240cc", 760 | "metadata": {}, 761 | "outputs": [], 762 | "source": [] 763 | }, 764 | { 765 | "cell_type": "markdown", 766 | "id": "c91d964e", 767 | "metadata": {}, 768 | "source": [ 769 | "#### Instruction 8:\n", 770 | "\n", 771 | "Create a datetime column called last_contact_date, in the format of \"year-month-day\", where the year is 2022, and the month and day values are taken from the \"month\" and \"day\" columns." 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": 10, 777 | "id": "f473a21f", 778 | "metadata": {}, 779 | "outputs": [ 780 | { 781 | "name": "stdout", 782 | "output_type": "stream", 783 | "text": [ 784 | "The 'last_contact_date' column was successfully created.\n" 785 | ] 786 | }, 787 | { 788 | "data": { 789 | "text/html": [ 790 | "

\n", 791 | "\n", 804 | "\n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | "

	client_id	number_contacts	contact_duration	pdays	previous_outcome	campaign_outcome	month	day	campaign_id	year	last_contact_date
0	0	1	261	999	NaN	NaN	may	13	1	2022	2022-05-13
1	1	1	149	999	NaN	NaN	may	19	1	2022	2022-05-19
2	2	1	226	999	NaN	NaN	may	23	1	2022	2022-05-23
3	3	1	151	999	NaN	NaN	may	27	1	2022	2022-05-27
4	4	1	307	999	NaN	NaN	may	3	1	2022	2022-05-03

\n", 900 | "

" 901 | ], 902 | "text/plain": [ 903 | " client_id number_contacts contact_duration pdays \\\n", 904 | "0 0 1 261 999 \n", 905 | "1 1 1 149 999 \n", 906 | "2 2 1 226 999 \n", 907 | "3 3 1 151 999 \n", 908 | "4 4 1 307 999 \n", 909 | "\n", 910 | " previous_campaign_contacts previous_outcome campaign_outcome month day \\\n", 911 | "0 0 NaN NaN may 13 \n", 912 | "1 0 NaN NaN may 19 \n", 913 | "2 0 NaN NaN may 23 \n", 914 | "3 0 NaN NaN may 27 \n", 915 | "4 0 NaN NaN may 3 \n", 916 | "\n", 917 | " campaign_id year last_contact_date \n", 918 | "0 1 2022 2022-05-13 \n", 919 | "1 1 2022 2022-05-19 \n", 920 | "2 1 2022 2022-05-23 \n", 921 | "3 1 2022 2022-05-27 \n", 922 | "4 1 2022 2022-05-03 " 923 | ] 924 | }, 925 | "execution_count": 10, 926 | "metadata": {}, 927 | "output_type": "execute_result" 928 | } 929 | ], 930 | "source": [ 931 | "# Add the \"year\" column with the value 2022 to the 'campaign_data' DataFrame\n", 932 | "campaign['year'] = 2022\n", 933 | "\n", 934 | "# Create a datetime column \"last_contact_date\"\n", 935 | "campaign['last_contact_date'] = pd.to_datetime(\n", 936 | " campaign['year'].astype(str) + '-' +\n", 937 | " campaign['month'].astype(str) + '-' +\n", 938 | " campaign['day'].astype(str),\n", 939 | " errors='coerce'\n", 940 | ")\n", 941 | "\n", 942 | "# Check if the \"last_contact_date\" column was successfully created.\n", 943 | "if 'last_contact_date' in campaign.columns:\n", 944 | " print(\"The 'last_contact_date' column was successfully created.\")\n", 945 | "else:\n", 946 | " print(\"The 'last_contact_date' column was not created.\")\n", 947 | "\n", 948 | "# Print the first few rows of 'campaign_data' DataFrame to verify the creation of the date column\n", 949 | "campaign.head()" 950 | ] 951 | }, 952 | { 953 | "cell_type": "code", 954 | "execution_count": 11, 955 | "id": "d7ef5f26", 956 | "metadata": {}, 957 | "outputs": [ 958 | { 959 | "data": { 960 | "text/plain": [ 961 | "dtype('\n", 1307 | "\n", 1320 | "\n", 1321 | " \n", 1322 | " \n", 1323 | " \n", 1324 | " \n", 1325 | " \n", 1326 | " \n", 1327 | " \n", 1328 | " \n", 1329 | " \n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | " \n", 1336 | " \n", 1337 | " \n", 1338 | " \n", 1339 | " \n", 1340 | " \n", 1341 | " \n", 1342 | " \n", 1343 | " \n", 1344 | " \n", 1345 | " \n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " \n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | " \n", 1360 | " \n", 1361 | " \n", 1362 | " \n", 1363 | " \n", 1364 | " \n", 1365 | " \n", 1366 | " \n", 1367 | " \n", 1368 | " \n", 1369 | " \n", 1370 | " \n", 1371 | " \n", 1372 | " \n", 1373 | " \n", 1374 | " \n", 1375 | " \n", 1376 | " \n", 1377 | " \n", 1378 | " \n", 1379 | " \n", 1380 | " \n", 1381 | " \n", 1382 | " \n", 1383 | " \n", 1384 | " \n", 1385 | " \n", 1386 | " \n", 1387 | " \n", 1388 | " \n", 1389 | " \n", 1390 | " \n", 1391 | " \n", 1392 | " \n", 1393 | " \n", 1394 | " \n", 1395 | " \n", 1396 | " \n", 1397 | " \n", 1398 | " \n", 1399 | " \n", 1400 | " \n", 1401 | " \n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | " \n", 1421 | " \n", 1422 | " \n", 1423 | " \n", 1424 | " \n", 1425 | " \n", 1426 | " \n", 1427 | " \n", 1428 | " \n", 1429 | " \n", 1430 | " \n", 1431 | " \n", 1432 | " \n", 1433 | " \n", 1434 | " \n", 1435 | " \n", 1436 | " \n", 1437 | " \n", 1438 | " \n", 1439 | " \n", 1440 | " \n", 1441 | " \n", 1442 | " \n", 1443 | " \n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | "

	id	age	job	marital	education	credit_default	housing	loan
0	0	56	housemaid	married	basic_4y	no	no	no
1	1	57	services	married	high_school	unknown	no	no
2	2	37	services	married	high_school	no	yes	no
3	3	40	admin	married	basic_6y	no	no	no
4	4	56	services	married	high_school	no	no	yes
...	...	...	...	...	...	...	...	...
41183	41183	73	retired	married	professional_course	no	yes	no
41184	41184	46	blue-collar	married	professional_course	no	no	no
41185	41185	56	retired	married	university_degree	no	yes	no
41186	41186	44	technician	married	professional_course	no	no	no
41187	41187	74	retired	married	professional_course	no	yes	no

\n", 1458 | "

41188 rows × 8 columns

\n", 1459 | "" 1460 | ], 1461 | "text/plain": [ 1462 | " id age job marital education credit_default \\\n", 1463 | "0 0 56 housemaid married basic_4y no \n", 1464 | "1 1 57 services married high_school unknown \n", 1465 | "2 2 37 services married high_school no \n", 1466 | "3 3 40 admin married basic_6y no \n", 1467 | "4 4 56 services married high_school no \n", 1468 | "... ... ... ... ... ... ... \n", 1469 | "41183 41183 73 retired married professional_course no \n", 1470 | "41184 41184 46 blue-collar married professional_course no \n", 1471 | "41185 41185 56 retired married university_degree no \n", 1472 | "41186 41186 44 technician married professional_course no \n", 1473 | "41187 41187 74 retired married professional_course no \n", 1474 | "\n", 1475 | " housing loan \n", 1476 | "0 no no \n", 1477 | "1 no no \n", 1478 | "2 yes no \n", 1479 | "3 no no \n", 1480 | "4 no yes \n", 1481 | "... ... ... \n", 1482 | "41183 yes no \n", 1483 | "41184 no no \n", 1484 | "41185 yes no \n", 1485 | "41186 no no \n", 1486 | "41187 yes no \n", 1487 | "\n", 1488 | "[41188 rows x 8 columns]" 1489 | ] 1490 | }, 1491 | "execution_count": 21, 1492 | "metadata": {}, 1493 | "output_type": "execute_result" 1494 | } 1495 | ], 1496 | "source": [ 1497 | "# Confirm if all three tables were successfully created \n", 1498 | "# and if all the data was successfully inserted into the three different tables\n", 1499 | "\n", 1500 | "# Read client_data from database to pandas dataframe\n", 1501 | "client_df = pd.read_sql_query('SELECT * FROM client_data', engine)\n", 1502 | "client_df" 1503 | ] 1504 | }, 1505 | { 1506 | "cell_type": "code", 1507 | "execution_count": 22, 1508 | "id": "3f83cdb5", 1509 | "metadata": {}, 1510 | "outputs": [ 1511 | { 1512 | "data": { 1513 | "text/html": [ 1514 | "

\n", 1515 | "\n", 1528 | "\n", 1529 | " \n", 1530 | " \n", 1531 | " \n", 1532 | " \n", 1533 | " \n", 1534 | " \n", 1535 | " \n", 1536 | " \n", 1537 | " \n", 1538 | " \n", 1539 | " \n", 1540 | " \n", 1541 | " \n", 1542 | " \n", 1543 | " \n", 1544 | " \n", 1545 | " \n", 1546 | " \n", 1547 | " \n", 1548 | " \n", 1549 | " \n", 1550 | " \n", 1551 | " \n", 1552 | " \n", 1553 | " \n", 1554 | " \n", 1555 | " \n", 1556 | " \n", 1557 | " \n", 1558 | " \n", 1559 | " \n", 1560 | " \n", 1561 | " \n", 1562 | " \n", 1563 | " \n", 1564 | " \n", 1565 | " \n", 1566 | " \n", 1567 | " \n", 1568 | " \n", 1569 | " \n", 1570 | " \n", 1571 | " \n", 1572 | " \n", 1573 | " \n", 1574 | " \n", 1575 | " \n", 1576 | " \n", 1577 | " \n", 1578 | " \n", 1579 | " \n", 1580 | " \n", 1581 | " \n", 1582 | " \n", 1583 | " \n", 1584 | " \n", 1585 | " \n", 1586 | " \n", 1587 | " \n", 1588 | " \n", 1589 | " \n", 1590 | " \n", 1591 | " \n", 1592 | " \n", 1593 | " \n", 1594 | " \n", 1595 | " \n", 1596 | " \n", 1597 | " \n", 1598 | " \n", 1599 | " \n", 1600 | " \n", 1601 | " \n", 1602 | " \n", 1603 | " \n", 1604 | " \n", 1605 | " \n", 1606 | " \n", 1607 | " \n", 1608 | " \n", 1609 | " \n", 1610 | " \n", 1611 | " \n", 1612 | " \n", 1613 | " \n", 1614 | " \n", 1615 | " \n", 1616 | " \n", 1617 | " \n", 1618 | " \n", 1619 | " \n", 1620 | " \n", 1621 | " \n", 1622 | " \n", 1623 | " \n", 1624 | " \n", 1625 | " \n", 1626 | " \n", 1627 | " \n", 1628 | " \n", 1629 | " \n", 1630 | " \n", 1631 | " \n", 1632 | " \n", 1633 | " \n", 1634 | " \n", 1635 | " \n", 1636 | " \n", 1637 | " \n", 1638 | " \n", 1639 | " \n", 1640 | " \n", 1641 | " \n", 1642 | " \n", 1643 | " \n", 1644 | " \n", 1645 | " \n", 1646 | " \n", 1647 | " \n", 1648 | " \n", 1649 | " \n", 1650 | " \n", 1651 | " \n", 1652 | " \n", 1653 | " \n", 1654 | " \n", 1655 | " \n", 1656 | " \n", 1657 | " \n", 1658 | " \n", 1659 | " \n", 1660 | " \n", 1661 | " \n", 1662 | " \n", 1663 | " \n", 1664 | " \n", 1665 | " \n", 1666 | " \n", 1667 | " \n", 1668 | " \n", 1669 | " \n", 1670 | " \n", 1671 | " \n", 1672 | " \n", 1673 | " \n", 1674 | " \n", 1675 | " \n", 1676 | " \n", 1677 | "

	client_id	number_contacts	contact_duration	pdays	previous_campaign_contacts	previous_outcome	campaign_outcome	campaign_id	last_contact_date
0	0	1	261	999	0	NaN	None	1	2022-05-13
1	1	1	149	999	0	NaN	None	1	2022-05-19
2	2	1	226	999	0	NaN	None	1	2022-05-23
3	3	1	151	999	0	NaN	None	1	2022-05-27
4	4	1	307	999	0	NaN	None	1	2022-05-03
...	...	...	...	...	...	...	...	...	...
41183	41183	1	334	999	0	NaN	None	1	2022-11-30
41184	41184	1	383	999	0	NaN	None	1	2022-11-06
41185	41185	2	189	999	0	NaN	None	1	2022-11-24
41186	41186	1	442	999	0	NaN	None	1	2022-11-17
41187	41187	3	239	999	1	0.0	None	1	2022-11-23

\n", 1678 | "

41188 rows × 9 columns

\n", 1679 | "

" 1680 | ], 1681 | "text/plain": [ 1682 | " client_id number_contacts contact_duration pdays \\\n", 1683 | "0 0 1 261 999 \n", 1684 | "1 1 1 149 999 \n", 1685 | "2 2 1 226 999 \n", 1686 | "3 3 1 151 999 \n", 1687 | "4 4 1 307 999 \n", 1688 | "... ... ... ... ... \n", 1689 | "41183 41183 1 334 999 \n", 1690 | "41184 41184 1 383 999 \n", 1691 | "41185 41185 2 189 999 \n", 1692 | "41186 41186 1 442 999 \n", 1693 | "41187 41187 3 239 999 \n", 1694 | "\n", 1695 | " previous_campaign_contacts previous_outcome campaign_outcome \\\n", 1696 | "0 0 NaN None \n", 1697 | "1 0 NaN None \n", 1698 | "2 0 NaN None \n", 1699 | "3 0 NaN None \n", 1700 | "4 0 NaN None \n", 1701 | "... ... ... ... \n", 1702 | "41183 0 NaN None \n", 1703 | "41184 0 NaN None \n", 1704 | "41185 0 NaN None \n", 1705 | "41186 0 NaN None \n", 1706 | "41187 1 0.0 None \n", 1707 | "\n", 1708 | " campaign_id last_contact_date \n", 1709 | "0 1 2022-05-13 \n", 1710 | "1 1 2022-05-19 \n", 1711 | "2 1 2022-05-23 \n", 1712 | "3 1 2022-05-27 \n", 1713 | "4 1 2022-05-03 \n", 1714 | "... ... ... \n", 1715 | "41183 1 2022-11-30 \n", 1716 | "41184 1 2022-11-06 \n", 1717 | "41185 1 2022-11-24 \n", 1718 | "41186 1 2022-11-17 \n", 1719 | "41187 1 2022-11-23 \n", 1720 | "\n", 1721 | "[41188 rows x 9 columns]" 1722 | ] 1723 | }, 1724 | "execution_count": 22, 1725 | "metadata": {}, 1726 | "output_type": "execute_result" 1727 | } 1728 | ], 1729 | "source": [ 1730 | "# Read campaign_data from database to pandas dataframe\n", 1731 | "campaign_df = pd.read_sql_query('SELECT * FROM campaign_data', engine)\n", 1732 | "campaign_df" 1733 | ] 1734 | }, 1735 | { 1736 | "cell_type": "code", 1737 | "execution_count": 23, 1738 | "id": "25ebcd41", 1739 | "metadata": {}, 1740 | "outputs": [ 1741 | { 1742 | "data": { 1743 | "text/html": [ 1744 | "

\n", 1745 | "\n", 1758 | "\n", 1759 | " \n", 1760 | " \n", 1761 | " \n", 1762 | " \n", 1763 | " \n", 1764 | " \n", 1765 | " \n", 1766 | " \n", 1767 | " \n", 1768 | " \n", 1769 | " \n", 1770 | " \n", 1771 | " \n", 1772 | " \n", 1773 | " \n", 1774 | " \n", 1775 | " \n", 1776 | " \n", 1777 | " \n", 1778 | " \n", 1779 | " \n", 1780 | " \n", 1781 | " \n", 1782 | " \n", 1783 | " \n", 1784 | " \n", 1785 | " \n", 1786 | " \n", 1787 | " \n", 1788 | " \n", 1789 | " \n", 1790 | " \n", 1791 | " \n", 1792 | " \n", 1793 | " \n", 1794 | " \n", 1795 | " \n", 1796 | " \n", 1797 | " \n", 1798 | " \n", 1799 | " \n", 1800 | " \n", 1801 | " \n", 1802 | " \n", 1803 | " \n", 1804 | " \n", 1805 | " \n", 1806 | " \n", 1807 | " \n", 1808 | " \n", 1809 | " \n", 1810 | " \n", 1811 | " \n", 1812 | " \n", 1813 | " \n", 1814 | " \n", 1815 | " \n", 1816 | " \n", 1817 | " \n", 1818 | " \n", 1819 | " \n", 1820 | " \n", 1821 | " \n", 1822 | " \n", 1823 | " \n", 1824 | " \n", 1825 | " \n", 1826 | " \n", 1827 | " \n", 1828 | " \n", 1829 | " \n", 1830 | " \n", 1831 | " \n", 1832 | " \n", 1833 | " \n", 1834 | " \n", 1835 | " \n", 1836 | " \n", 1837 | " \n", 1838 | " \n", 1839 | " \n", 1840 | " \n", 1841 | " \n", 1842 | " \n", 1843 | " \n", 1844 | " \n", 1845 | " \n", 1846 | " \n", 1847 | " \n", 1848 | " \n", 1849 | " \n", 1850 | " \n", 1851 | " \n", 1852 | " \n", 1853 | " \n", 1854 | " \n", 1855 | " \n", 1856 | " \n", 1857 | " \n", 1858 | " \n", 1859 | "

	client_id	emp_var_rate	cons_price_idx	euribor_three_months	number_employed
0	0	1.1	93.994	4.857	5191.0
1	1	1.1	93.994	4.857	5191.0
2	2	1.1	93.994	4.857	5191.0
3	3	1.1	93.994	4.857	5191.0
4	4	1.1	93.994	4.857	5191.0
...	...	...	...	...	...
41183	41183	-1.1	94.767	1.028	4963.6
41184	41184	-1.1	94.767	1.028	4963.6
41185	41185	-1.1	94.767	1.028	4963.6
41186	41186	-1.1	94.767	1.028	4963.6
41187	41187	-1.1	94.767	1.028	4963.6

\n", 1860 | "

41188 rows × 5 columns

\n", 1861 | "

" 1862 | ], 1863 | "text/plain": [ 1864 | " client_id emp_var_rate cons_price_idx euribor_three_months \\\n", 1865 | "0 0 1.1 93.994 4.857 \n", 1866 | "1 1 1.1 93.994 4.857 \n", 1867 | "2 2 1.1 93.994 4.857 \n", 1868 | "3 3 1.1 93.994 4.857 \n", 1869 | "4 4 1.1 93.994 4.857 \n", 1870 | "... ... ... ... ... \n", 1871 | "41183 41183 -1.1 94.767 1.028 \n", 1872 | "41184 41184 -1.1 94.767 1.028 \n", 1873 | "41185 41185 -1.1 94.767 1.028 \n", 1874 | "41186 41186 -1.1 94.767 1.028 \n", 1875 | "41187 41187 -1.1 94.767 1.028 \n", 1876 | "\n", 1877 | " number_employed \n", 1878 | "0 5191.0 \n", 1879 | "1 5191.0 \n", 1880 | "2 5191.0 \n", 1881 | "3 5191.0 \n", 1882 | "4 5191.0 \n", 1883 | "... ... \n", 1884 | "41183 4963.6 \n", 1885 | "41184 4963.6 \n", 1886 | "41185 4963.6 \n", 1887 | "41186 4963.6 \n", 1888 | "41187 4963.6 \n", 1889 | "\n", 1890 | "[41188 rows x 5 columns]" 1891 | ] 1892 | }, 1893 | "execution_count": 23, 1894 | "metadata": {}, 1895 | "output_type": "execute_result" 1896 | } 1897 | ], 1898 | "source": [ 1899 | "# Read economics_data from database to pandas dataframe\n", 1900 | "economics_df = pd.read_sql_query('SELECT * FROM economics_data', engine)\n", 1901 | "economics_df" 1902 | ] 1903 | }, 1904 | { 1905 | "cell_type": "markdown", 1906 | "id": "c2aa7e6a", 1907 | "metadata": {}, 1908 | "source": [ 1909 | "***The end!***" 1910 | ] 1911 | }, 1912 | { 1913 | "cell_type": "code", 1914 | "execution_count": null, 1915 | "id": "37c76c77", 1916 | "metadata": {}, 1917 | "outputs": [], 1918 | "source": [] 1919 | }, 1920 | { 1921 | "cell_type": "code", 1922 | "execution_count": null, 1923 | "id": "554ab333", 1924 | "metadata": {}, 1925 | "outputs": [], 1926 | "source": [] 1927 | } 1928 | ], 1929 | "metadata": { 1930 | "editor": "DataCamp Workspace", 1931 | "kernelspec": { 1932 | "display_name": "Python 3 (ipykernel)", 1933 | "language": "python", 1934 | "name": "python3" 1935 | }, 1936 | "language_info": { 1937 | "codemirror_mode": { 1938 | "name": "ipython", 1939 | "version": 3 1940 | }, 1941 | "file_extension": ".py", 1942 | "mimetype": "text/x-python", 1943 | "name": "python", 1944 | "nbconvert_exporter": "python", 1945 | "pygments_lexer": "ipython3", 1946 | "version": "3.11.5" 1947 | } 1948 | }, 1949 | "nbformat": 4, 1950 | "nbformat_minor": 5 1951 | } 1952 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/Designing a Bank Marketing Database-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "02077ee3-e1e4-4fc5-8de1-16e987afa5fb", 6 | "metadata": {}, 7 | "source": [ 8 | "![piggy_bank](piggy_bank.jpg)\n", 9 | "\n", 10 | "
\n", 11 | "\n", 12 | "Personal loans are a lucrative revenue stream for banks. The typical interest rate of a two year loan in the United Kingdom is [around 10%](https://www.experian.com/blogs/ask-experian/whats-a-good-interest-rate-for-a-personal-loan/). This might not sound like a lot, but in September 2022 alone UK consumers borrowed [around £1.5 billion](https://www.ukfinance.org.uk/system/files/2022-12/Household%20Finance%20Review%202022%20Q3-%20Final.pdf), which would mean approximately £300 million in interest generated by banks over two years!\n", 13 | "\n", 14 | "You have been asked to work with a bank to clean and store the data they collected as part of a recent marketing campaign, which aimed to get customers to take out a personal loan. They plan to conduct more marketing campaigns going forward so would like you to set up a PostgreSQL database to store this campaign's data, designing the schema in a way that would allow data from future campaigns to be easily imported. \n", 15 | "\n", 16 | "They have supplied you with a csv file called `\"bank_marketing.csv\"`, which you will need to clean, reformat, and split, in order to save separate files based on the tables you will create. It is recommended to use `pandas` for these tasks.\n", 17 | "\n", 18 | "Lastly, you will write the SQL code that the bank can execute to create the tables and populate with the data from the csv files. As the bank are quite strict about their security, you'll save SQL files as multiline string variables that they can then use to create the database on their end. \n", 19 | "\n", 20 | "You have been asked to design a database that will have three tables:\n", 21 | "\n", 22 | "## client\n", 23 | "\n", 24 | "| column | data type | description | original column in dataset |\n", 25 | "|--------|-----------|-------------|----------------------------|\n", 26 | "| `id` | `serial` | Client ID - primary key | `client_id` |\n", 27 | "| `age` | `integer` | Client's age in years | `age` |\n", 28 | "| `job` | `text` | Client's type of job | `job` |\n", 29 | "| `marital` | `text` | Client's marital status | `marital` | \n", 30 | "| `education` | `text` | Client's level of education | `education` |\n", 31 | "| `credit_default` | `boolean` | Whether the client's credit is in default | `credit_default` |\n", 32 | "| `housing` | `boolean` | Whether the client has an existing housing loan (mortgage) | `housing` | \n", 33 | "| `loan` | `boolean` | Whether the client has an existing personal loan | `loan` |\n", 34 | "\n", 35 | "
\n", 36 | "\n", 37 | "## campaign\n", 38 | "\n", 39 | "| column | data type | description | original column in dataset |\n", 40 | "|--------|-----------|-------------|----------------------------|\n", 41 | "| `campaign_id` | `serial` | Campaign ID - primary key | N/A - new column |\n", 42 | "| `client_id` | `serial` | Client ID - references `id` in the `client` table | `client_id` |\n", 43 | "| `number_contacts` | `integer` | Number of contact attempts to the client in the current campaign | `campaign` |\n", 44 | "| `contact_duration` | `integer` | Last contact duration in seconds | `duration` |\n", 45 | "| `pdays` | `integer` | Number of days since contact in previous campaign (`999` = not previously contacted) | `pdays` |\n", 46 | "| `previous_campaign_contacts` | `integer` | Number of contact attempts to the client in the previous campaign | `previous` |\n", 47 | "| `previous_outcome` | `boolean` | Outcome of the previous campaign | `poutcome` |\n", 48 | "| `campaign_outcome` | `boolean` | Outcome of the current campaign | `y` |\n", 49 | "| `last_contact_date` | `date` | Last date the client was contacted | A combination of `day`, `month`, and the newly created `year` |\n", 50 | "\n", 51 | "
\n", 52 | "\n", 53 | "## economics\n", 54 | "\n", 55 | "| column | data type | description | original column in dataset |\n", 56 | "|--------|-----------|-------------|----------------------------|\n", 57 | "| `client_id` | `serial` | Client ID - references `id` in the `client` table | `client_id` |\n", 58 | "| `emp_var_rate` | `float` | Employment variation rate (quarterly indicator) | `emp_var_rate` |\n", 59 | "| `cons_price_idx` | `float` | Consumer price index (monthly indicator) | `cons_price_idx` |\n", 60 | "| `euribor_three_months` | `float` | Euro Interbank Offered Rate (euribor) three month rate (daily indicator) | `euribor3m` |\n", 61 | "| `number_employed` | `float` | Number of employees (quarterly indicator)| `nr_employed` |" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 1, 67 | "id": "e2edad3c-8286-4983-b5b7-35d94fd78023", 68 | "metadata": { 69 | "executionCancelledAt": null, 70 | "executionTime": 1057, 71 | "lastExecutedAt": 1686069923599, 72 | "lastScheduledRunId": null, 73 | "lastSuccessfullyExecutedCode": "# Start coding...\nimport pandas as pd\nimport numpy as np\n\n# Read in csv\nmarketing = pd.read_csv(\"bank_marketing.csv\")\n\n# Split into the three tables\nclient = marketing[[\"client_id\", \"age\", \"job\", \"marital\", \"education\", \n \"credit_default\", \"housing\", \"loan\"]]\ncampaign = marketing[[\"client_id\", \"campaign\", \"month\", \"day\", \n \"duration\", \"pdays\", \"previous\", \"poutcome\", \"y\"]]\neconomics = marketing[[\"client_id\", \"emp_var_rate\", \"cons_price_idx\", \n \"euribor3m\", \"nr_employed\"]]\n\n# Rename client_id in the client table\nclient.rename(columns={\"client_id\": \"id\"}, inplace=True)\n\n# Rename duration, y, and campaign columns\ncampaign.rename(columns={\"duration\": \"contact_duration\", \n \"y\": \"campaign_outcome\", \n \"campaign\": \"number_contacts\",\n \"previous\": \"previous_campaign_contacts\",\n \"poutcome\": \"previous_outcome\"}, \n inplace=True)\n\n# Rename euribor3m and nr_employed\neconomics.rename(columns={\"euribor3m\": \"euribor_three_months\", \n \"nr_employed\": \"number_employed\"}, \n inplace=True)\n\n# Clean education column\nclient[\"education\"] = client[\"education\"].str.replace(\".\", \"_\")\nclient[\"education\"] = client[\"education\"].replace(\"unknown\", np.NaN)\n\n# Clean job column\nclient[\"job\"] = client[\"job\"].str.replace(\".\", \"\")\n\n# Change campaign_outcome to binary values\ncampaign[\"campaign_outcome\"] = campaign[\"campaign_outcome\"].map({\"yes\": 1, \n \"no\": 0})\n\n# Convert poutcome to binary values\ncampaign[\"previous_outcome\"] = campaign[\"previous_outcome\"].replace(\"nonexistent\", \n np.NaN)\ncampaign[\"previous_outcome\"] = campaign[\"previous_outcome\"].map({\"success\": 1, \n \"failure\": 0})\n\n# Add campaign_id column\ncampaign[\"campaign_id\"] = 1\n\n# Capitalize month and day columns\ncampaign[\"month\"] = campaign[\"month\"].str.capitalize()\n\n# Add year column\ncampaign[\"year\"] = \"2022\"\n\n# Convert day to string\ncampaign[\"day\"] = campaign[\"day\"].astype(str)\n\n# Add last_contact_date column\ncampaign[\"last_contact_date\"] = campaign[\"year\"] + \"-\" + campaign[\"month\"] + \"-\" + campaign[\"day\"]\n\n# Convert to datetime\ncampaign[\"last_contact_date\"] = pd.to_datetime(campaign[\"last_contact_date\"], \n format=\"%Y-%b-%d\")\n\n# Drop unneccessary columns\ncampaign.drop(columns=[\"month\", \"day\", \"year\"], inplace=True)\n\n# Save tables to individual csv files\nclient.to_csv(\"client.csv\", index=False)\ncampaign.to_csv(\"campaign.csv\", index=False)\neconomics.to_csv(\"economics.csv\", index=False)\n\n# Store and print database_design\nclient_table = \"\"\"CREATE TABLE client\n(\n id SERIAL PRIMARY KEY,\n age INTEGER,\n job TEXT,\n marital TEXT,\n education TEXT,\n credit_default BOOLEAN,\n housing BOOLEAN,\n loan BOOLEAN\n);\n\\copy client from 'client.csv' DELIMITER ',' CSV HEADER\n\"\"\"\n\ncampaign_table = \"\"\"CREATE TABLE campaign\n(\n campaign_id SERIAL PRIMARY KEY,\n client_id SERIAL references client (id),\n number_contacts INTEGER,\n contact_duration INTEGER,\n pdays INTEGER,\n previous_campaign_contacts INTEGER,\n previous_outcome BOOLEAN,\n campaign_outcome BOOLEAN,\n last_contact_date DATE \n);\n\\copy campaign from 'campaign.csv' DELIMITER ',' CSV HEADER\n\"\"\"\n\neconomics_table = \"\"\"CREATE TABLE economics\n(\n client_id SERIAL references client (id),\n emp_var_rate FLOAT,\n cons_price_idx FLOAT,\n euribor_three_months FLOAT,\n number_employed FLOAT\n);\n\\copy economics from 'economics.csv' DELIMITER ',' CSV HEADER\n\"\"\"" 74 | }, 75 | "outputs": [], 76 | "source": [ 77 | "# Import libraries\n", 78 | "import pandas as pd\n", 79 | "import numpy as np\n", 80 | "\n", 81 | "# suppress warnings from final output\n", 82 | "import warnings\n", 83 | "warnings.simplefilter(\"ignore\")\n", 84 | "\n", 85 | "# Start coding here..." 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "id": "9d94086e", 91 | "metadata": {}, 92 | "source": [ 93 | "#### Instruction 1: Read in bank_marketing.csv as a pandas DataFrame." 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 2, 99 | "id": "ac57dc04", 100 | "metadata": {}, 101 | "outputs": [ 102 | { 103 | "data": { 104 | "text/html": [ 105 | "

	client_id	age	job	marital	education	credit_default	housing	loan	contact	month	...	campaign	pdays	poutcome	emp_var_rate	cons_price_idx	cons_conf_idx	euribor3m	nr_employed	y
0	0	56	housemaid	married	basic.4y	no	no	no	telephone	may	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
1	1	57	services	married	high.school	unknown	no	no	telephone	may	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
2	2	37	services	married	high.school	no	yes	no	telephone	may	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
3	3	40	admin.	married	basic.6y	no	no	no	telephone	may	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
4	4	56	services	married	high.school	no	no	yes	telephone	may	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no