├── SQL
├── SQL Project
│ ├── Readme.md
│ └── DANNYSDINER CASE-STUDY.sql
├── README.md
└── Most Powerful women
│ ├── Most powerful women SQL data wrangling.sql
│ └── The Worlds 100 Most Powerful Women.csv
├── POWER BI
└── README.md
├── EXCEL
└── README.md
├── R
├── README.md
├── Cyclistic bike share.R
└── Quantium Chips.R
├── Python Projects
├── README.md
└── WEB SCRAPPING E- COMMERCE SITE (EBAY).ipynb
└── README.md
/SQL/SQL Project/Readme.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/POWER BI/README.md:
--------------------------------------------------------------------------------
1 | #### 📍I AM SO GLAD THAT YOU HAVE TAKEN INTREST IN MY WORK.
2 | * THANK YOU FOR YOUR TIME.
3 | 🔴PLEASE DO FOLLOW THE LINK BELOW TO INTERACT WITH MY POWER BI DASHBOARD ON NOVY PRO.
4 |
5 |
6 |
7 | https://www.novypro.com/profile_projects/mary-anene
8 |
--------------------------------------------------------------------------------
/EXCEL/README.md:
--------------------------------------------------------------------------------
1 | #### I APPRICIATE YOUR INTREST IN MY EXCEL PORTFOLIO.
2 | * PLEASE DO FIND MY WORK ON EXCEL VIA THIS LINK
3 |
4 |
5 | https://www.canva.com/design/DAFdrRT376E/z1Nv_fbSAF9VXERRIO6QCQ/view?utm_content=DAFdrRT376E&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton
6 |
--------------------------------------------------------------------------------
/SQL/README.md:
--------------------------------------------------------------------------------
1 | ## DATA ANALYTICS WITH SQL
2 |
3 | Welcome to the world of data analytics with SQL! In this folder, you will find a collection of projects and resources that showcase the power and versatility of SQL in data analysis.
4 |
5 | SQL, or Structured Query Language, is a powerful tool for managing and analyzing data. It allows you to perform complex queries and extract valuable insights from large and complex data sets. In this repository, you will find examples of SQL queries and scripts that demonstrate my proficiency in data analysis using SQL.
6 |
7 | You'll find a diverse set of projects that covers a wide range of data analysis techniques, such as database creation , data cleaning, database querying and critical thinking. These projects showcase my ability to extract insights from data, and present them in a clear and visually appealing way. I have also included examples of reports that I have created using SQL. These projects demonstrate my ability to analyze and present data in a way that is easy for decision makers to understand.
8 |
9 | I have also included resources to help you improve your data analysis skills with SQL.
10 |
11 | I hope you find this repository to be informative and engaging, and I welcome any opportunity to discuss my qualifications and experience further. Thank you for visiting my repository and happy data analyzing!
12 |
--------------------------------------------------------------------------------
/R/README.md:
--------------------------------------------------------------------------------
1 | ### R PROGRAMING DATA ANALYSIS
2 |
3 | Welcome to the world of data analytics with R! In this repository, you will find a collection of projects and resources that showcase the power and versatility of R in data analysis.
4 |
5 | R is a popular programming language and software environment for data analysis. It is widely used by statisticians, data scientists, and researchers because of its powerful tools for data manipulation, visualization, and statistical modeling.
6 |
7 | In this repository, you will find examples of R code that demonstrate my proficiency in data analysis using R. You'll find a diverse set of projects that covers a wide range of data analysis techniques, such as data cleaning, data exploration, data visualization, statistical modeling, and machine learning. These projects showcase my ability to extract insights from data and present them in a clear and visually appealing way.
8 |
9 | I have also included resources to help you improve your data analysis skills with R. These resources include tutorials, exercises, and examples that will help you learn the basics of R and how to use it for data analysis. Whether you are new to data analysis or a seasoned data analyst looking to improve your skills, you will find something of value in this repository.
10 |
11 | I hope you find this repository to be informative and engaging. I welcome any opportunity to discuss my qualifications and experience further. Thank you for visiting my repository and happy data analyzing with R!
12 |
--------------------------------------------------------------------------------
/Python Projects/README.md:
--------------------------------------------------------------------------------
1 | ### DATA ANALYTICS WITH PYTHON
2 | Welcome to the world of data analytics with Python! In this file, you will find a collection of projects and resources that showcase the power and versatility of Python in data analysis.
3 |
4 | Python is a powerful and versatile programming language that is widely used in data analysis. It has a large ecosystem of libraries and frameworks that make it easy to perform complex data analysis tasks, such as data mining, data visualization, statistical modeling, and machine learning.
5 |
6 | In this file, you will find examples of Python code that demonstrate my proficiency in data analysis using Python. You'll find a diverse set of projects that covers a wide range of data analysis techniques, such as data cleaning, data exploration, data visualization, and statistical modeling. These projects showcase my ability to extract insights from data and present them in a clear and visually appealing way.
7 |
8 | I have also included resources to help you improve your data analysis skills with Python. These resources include tutorials, exercises, and examples that will help you learn the basics of Python and how to use it for data analysis. Whether you are new to data analysis or a seasoned data analyst looking to improve your skills, you will find something of value in this folder.
9 |
10 | I hope you find this repository to be informative and engaging. I welcome any opportunity to discuss my qualifications and experience further. Thank you for visiting my repository and happy data analyzing with Python!
11 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # DATA ANALYTICS POTFOLIO
2 | Welcome to my Data Analyst Portfolio!
3 |
4 | I am a data analyst with experience in using various tools and technologies to collect, organize, and analyze data to inform business decisions. I am proficient in Python, R, SQL, Excel, Power BI, and Tableau, and have a solid understanding of data analytics techniques such as dashboard building, report writing, data mining, data merging, statistics, and critical thinking.
5 |
6 | In this portfolio, I have included a variety of projects that showcase my data analytics skills. You will find links of the dashboards and reports I have created using various tools such as Power BI, Tableau, and Excel. These projects demonstrate my ability to analyze and present data in a clear and visually appealing way, making it easy for decision makers to understand the insights and take action.
7 |
8 | I have also included my work in programming languages such as Python, SQL and R. These projects showcase my ability to extract valuable insights from large and complex data sets, and to combine data from multiple sources to create a comprehensive view of the data.
9 |
10 | In addition to my technical skills, I also pride myself on my critical thinking and problem-solving abilities. I am able to approach data analysis with a strategic mindset, and to identify key issues and opportunities in the data.
11 |
12 | I am confident that my data analytics skills and experience make me an ideal candidate for any data analyst role. I hope you find my portfolio to be informative and engaging, and I welcome any opportunity to discuss my qualifications further with you.
13 |
14 | Thank you for your time, I look forward to hearing from you soon.
15 |
--------------------------------------------------------------------------------
/SQL/Most Powerful women/Most powerful women SQL data wrangling.sql:
--------------------------------------------------------------------------------
1 | #For this SQL project i will be answering 10 question from this data set
2 | #Let start by introducing the table
3 |
4 | SELECT *
5 | FROM `data_analytics_ with_sql`.`the worlds 100 most powerful women`;
6 |
7 | # 1. What is the average age of the women in the dataset?
8 |
9 | SELECT AVG(AGE)
10 | FROM `data_analytics_ with_sql`.`the worlds 100 most powerful women`;
11 |
12 | # 2. How many women in the dataset are from each country?
13 |
14 | SELECT COUNT(NAME) AS NUM_OF_WOMEN,
15 | LOCATION
16 | FROM `data_analytics_ with_sql`.`the worlds 100 most powerful women`
17 | GROUP BY LOCATION
18 | ORDER BY COUNT(NAME) DESC;
19 |
20 | #3. What are the top 5 most common occupations among the women in the dataset?
21 |
22 | SELECT CATEGORY,
23 | COUNT(CATEGORY) AS NUM_OFWOMEN
24 | FROM `data_analytics_ with_sql`.`the worlds 100 most powerful women`
25 | GROUP BY 1
26 | ORDER BY NUM_OFWOMEN DESC;
27 |
28 | #4. How many women in the dataset are over the age of 40?
29 |
30 | SELECT COUNT(AGE) AS OVER40WOMEN
31 | FROM `data_analytics_ with_sql`.`the worlds 100 most powerful women`
32 | WHERE AGE > 40
33 | ORDER BY OVER40WOMEN DESC;
34 |
35 | #-----OR---
36 |
37 | SELECT `the worlds 100 most powerful women`.NAME,
38 | AGE AS OVER40WOMEN
39 | FROM `data_analytics_ with_sql`.`the worlds 100 most powerful women`
40 | WHERE AGE > 40
41 | ORDER BY OVER40WOMEN DESC;
42 |
43 | #5. What is the youngest and oldest age of the women in the dataset?
44 | #YOUNGEST
45 | SELECT `the worlds 100 most powerful women`.RANK,
46 | `the worlds 100 most powerful women`.NAME,
47 | AGE AS YOUNGEST_WOMAN, LOCATION, CATEGORY
48 | FROM `data_analytics_ with_sql`.`the worlds 100 most powerful women`
49 | ORDER BY YOUNGEST_WOMAN
50 | LIMIT 1;
51 |
52 | #OLDEST
53 |
54 | SELECT `the worlds 100 most powerful women`.RANK,
55 | `the worlds 100 most powerful women`.NAME,
56 | AGE AS OLDERST_WOMAN, LOCATION, CATEGORY
57 | FROM `data_analytics_ with_sql`.`the worlds 100 most powerful women`
58 | ORDER BY OLDERST_WOMAN DESC
59 | LIMIT 1;
60 |
61 | #6. How many women in the dataset are from North America?
62 |
63 | SELECT LOCATION,
64 | COUNT(LOCATION) AS NORTH_AMERICA
65 | FROM `data_analytics_ with_sql`.`the worlds 100 most powerful women`
66 | GROUP BY LOCATION
67 | HAVING LOCATION IN ('United states', 'Honduras', 'Barbados') ;
68 |
69 |
70 |
71 | #7. How many women in the dataset are in the Technology category?
72 |
73 | SELECT CATEGORY,
74 | COUNT(CATEGORY) AS WOMEN_IN_TECH
75 | FROM `data_analytics_ with_sql`.`the worlds 100 most powerful women`
76 | GROUP BY CATEGORY
77 | HAVING CATEGORY = 'Technology';
78 |
79 | #8. How many women in the dataset are from Asia?
80 |
81 | SELECT LOCATION,
82 | COUNT(LOCATION) AS ASIAN_GIRLPOWER
83 | FROM `data_analytics_ with_sql`.`the worlds 100 most powerful women`
84 | GROUP BY LOCATION
85 | HAVING LOCATION IN ('China', 'India', 'Singapore', 'Indonesia' , 'Bangladesh', 'Japan', 'South Korea', 'Tiwan')
86 | ORDER BY ASIAN_GIRLPOWER DESC;
87 |
88 | #---or----
89 |
90 | SELECT LOCATION ,
91 | COUNT(
92 | CASE
93 | WHEN LOCATION IN ('China', 'India', 'Singapore', 'Indonesia' , 'Bangladesh', 'Japan', 'South Korea', 'Tiwan') Then 1 END) AS GIRLS
94 | FROM`data_analytics_ with_sql`.`the worlds 100 most powerful women`
95 | GROUP BY 1
96 | ORDER BY GIRLS DESC;
97 |
98 | # 9. What is the standard deviation of the ages of the women in the dataset?
99 |
100 | SELECT STDDEV(AGE) AS STDDEV_AGE
101 | FROM `data_analytics_ with_sql`.`the worlds 100 most powerful women`;
102 |
103 |
104 | #10. ASSIGNMENT LOCATION TO TEMPORARY COULMNS TO SHOWCASE COLUMNS
105 | SELECT *,
106 | CASE
107 | WHEN LOCATION IN ('United States', 'Honduras', 'Barbados') THEN 'NORTH AMERICA'
108 | WHEN LOCATION IN ('Germany', 'Belgium', 'Italy', 'United Kingdom', 'Spain', 'France', 'Denmark', 'Turkey', 'Finland', 'Solvakia') THEN 'EUROPE'
109 | WHEN LOCATION IN ('China', 'India', 'Taiwan', 'Singapore', 'Indonesia', 'Bangladesh', 'Japan', 'South Korea') THEN 'ASIA'
110 | WHEN LOCATION IN('Australia','New Zealand') THEN 'OCEANIA'
111 | WHEN LOCATION IN ('Nigeria', 'Tanzania') THEN 'AFRICA'
112 | ELSE 'WHAT ARE YOU!' END AS CONTINENT
113 | FROM `data_analytics_ with_sql`.`the worlds 100 most powerful women`
114 | ORDER BY CONTINENT;
115 |
116 |
117 | #11. What is the median age of the women in the dataset?
118 |
119 | SELECT AVG (AGE) AS MEDIAN_AGE
120 | FROM
121 | (SELECT AGE
122 | FROM `data_analytics_ with_sql`.`the worlds 100 most powerful women`
123 | ORDER BY AGE);
124 |
125 |
--------------------------------------------------------------------------------
/SQL/Most Powerful women/The Worlds 100 Most Powerful Women.csv:
--------------------------------------------------------------------------------
1 | RANK,NAME,AGE,LOCATION,CATEGORY
2 | 1.,Ursula von der Leyen,64,Belgium,Politics & Policy
3 | 2.,Christine Lagarde,66,Germany,Politics & Policy
4 | 3.,Kamala Harris,58,United States,Politics & Policy
5 | 4.,Mary Barra,60,United States,Business
6 | 5.,Abigail Johnson,60,United States,Money
7 | 6.,Melinda French Gates,58,United States,Philanthropy
8 | 7.,Giorgia Meloni,45,Italy,Politics & Policy
9 | 8.,Karen Lynch,58,United States,Business
10 | 9.,Julie Sweet,55,United States,Business
11 | 10.,Jane Fraser,55,United States,Finance
12 | 11.,MacKenzie Scott,52,United States,Impact
13 | 12.,Kristalina Georgieva,69,United States,Politics & Policy
14 | 13.,Rosalind Brewer,59,United States,Business
15 | 14.,Emma Walmsley,53,United Kingdom,Business
16 | 15.,Ana Patricia Botín,62,Spain,Finance
17 | 16.,Gail Boudreaux,62,United States,Business
18 | 17.,Tsai Ing-wen,66,Taiwan,Politics & Policy
19 | 18.,Ruth Porat,65,United States,Technology
20 | 19.,Safra Catz,60,United States,Technology
21 | 20.,Martina Merz,59,Germany,Business
22 | 21.,Carol Tomé,65,United States,Business
23 | 22.,Judith McKenna,56,United States,Business
24 | 23.,Susan Wojcicki,54,United States,Technology
25 | 24.,Oprah Winfrey,68,United States,Media & Entertainment
26 | 25.,Nancy Pelosi,82,United States,Politics & Policy
27 | 26.,Laurene Powell Jobs & family,59,United States,Philanthropy
28 | 27.,Amanda Blanc,55,United Kingdom,Business
29 | 28.,Amy Hood,50,United States,Technology
30 | 29.,Catherine MacGregor,50,France,Business
31 | 30.,Phebe Novakovic,65,United States,Entrepreneurs
32 | 31.,Gwynne Shotwell,59,United States,Technology
33 | 32.,Shari Redstone,68,United States,Lifestyle
34 | 33.,Janet Yellen,76,United States,Politics & Policy
35 | 34.,Jessica Tan,45,China,Finance
36 | 35.,Ho Ching,69,Singapore,Finance
37 | 36.,Nirmala Sitharaman,63,India,Politics & Policy
38 | 37.,Thasunda Brown Duckett,49,United States,Finance
39 | 38.,Kathy Warden,51,United States,Business
40 | 39.,"Marianne Lake, Jennifer Piepszak",-,United States,Finance
41 | 40.,Jacinda Ardern,42,New Zealand,Politics & Policy
42 | 41.,Dana Walden,58,United States,Media & Entertainment
43 | 42.,Sheikh Hasina Wajed,75,Bangladesh,Politics & Policy
44 | 43.,Mary Callahan Erdoes,55,United States,Money
45 | 44.,Adena Friedman,53,United States,Finance
46 | 45.,Gina Rinehart,68,Australia,Business
47 | 46.,Lynn Martin,46,United States,Finance
48 | 47.,Sri Mulyani Indrawati,60,Indonesia,Politics & Policy
49 | 48.,Vicki Hollub,63,United States,Entrepreneurs
50 | 49.,Nicke Widyawati,54,Indonesia,Business
51 | 50.,Lisa Su,53,United States,Entrepreneurs
52 | 51.,Tricia Griffith,58,United States,Business
53 | 52.,Shemara Wikramanayake,60,Australia,Business
54 | 53.,Roshni Nadar Malhotra,41,India,Technology
55 | 54.,Madhabi Puri Buch,56,India,Politics & Policy
56 | 55.,Sinead Gorman,45,United Kingdom,Business
57 | 56.,Tokiko Shimizu,57,Japan,Finance
58 | 57.,Yuriko Koike,70,Japan,Politics & Policy
59 | 58.,Jennifer Salke,58,United States,Media & Entertainment
60 | 59.,Jenny Johnson,58,United States,Money
61 | 60.,Hana Al Rostamani,-,United Arab Emirates,Finance
62 | 61.,Donna Langley,54,United States,Media & Entertainment
63 | 62.,Dong Mingzhu,68,China,Business
64 | 63.,Judy Faulkner,79,United States,Entrepreneurs
65 | 64.,Robyn Denholm,59,Australia,Business
66 | 65.,Suzanne Scott,56,United States,Media & Entertainment
67 | 66.,Lynn Good,63,United States,Business
68 | 67.,Soma Mondal,59,India,Business
69 | 68.,Belén Garijo,62,Germany,Business
70 | 69.,Melanie Kreis,51,Germany,Business
71 | 70.,Bela Bajaria,51,United States,Media & Entertainment
72 | 71.,Paula Santilli,-,Mexico,Business
73 | 72.,Laura Cha,-,China,Finance
74 | 73.,Rihanna,34,United States,Media & Entertainment
75 | 74.,Mette Frederiksen,45,Denmark,Politics & Policy
76 | 75.,Mary Meeker,63,United States,Venture Capital
77 | 76.,Joey Wat,51,China,Business
78 | 77.,Kiran Mazumdar-Shaw,69,India,Business
79 | 78.,Jenny Lee,50,Singapore,Venture Capital
80 | 79.,Taylor Swift,32,United States,Media & Entertainment
81 | 80.,Beyoncé Knowles,41,United States,Media & Entertainment
82 | 81.,Güler Sabanci,67,Turkey,Business
83 | 82.,Linda Thomas-Greenfield,70,United States,Impact
84 | 83.,Sanna Marin,37,Finland,Politics & Policy
85 | 84.,Solina Chau,60,China,Philanthropy
86 | 85.,Lee Boo-jin,52,South Korea,Business
87 | 86.,Reese Witherspoon,46,United States,Media & Entertainment
88 | 87.,Zuzana Caputova,49,Slovakia,Politics & Policy
89 | 88.,Dominique Senequier,69,France,Finance
90 | 89.,Falguni Nayar,59,India,Business
91 | 90.,Julia Gillard,61,Australia,Philanthropy
92 | 91.,Ngozi Okonjo-Iweala,68,Nigeria,Politics & Policy
93 | 92.,Raja Easa Al Gurg,-,United Arab Emirates,Business
94 | 93.,Shonda Rhimes,52,United States,Entertainment
95 | 94.,Xiomara Castro,63,Honduras,Politics & Policy
96 | 95.,Samia Suluhu Hassan,62,Tanzania,Politics & Policy
97 | 96.,Dolly Parton,76,United States,Lifestyle
98 | 97.,Kirsten Green,51,United States,Money
99 | 98.,Mia Mottley,57,Barbados,Politics & Policy
100 | 99.,Mo Abudu,58,Nigeria,Media & Entertainment
101 | 100.,Mahsa Amini (posthumous),-,Iran,Politics & Policy
102 |
--------------------------------------------------------------------------------
/SQL/SQL Project/DANNYSDINER CASE-STUDY.sql:
--------------------------------------------------------------------------------
1 | /* --------------------
2 | Case Study Questions
3 | --------------------*/
4 |
5 | -- 1. What is the total amount each customer spent at the restaurant?
6 | -- 2. How many days has each customer visited the restaurant?
7 | -- 3. What was the first item from the menu purchased by each customer?
8 | -- 4. What is the most purchased item on the menu and how many times was it purchased by all customers?
9 | -- 5. Which item was the most popular for each customer?
10 | -- 6. Which item was purchased first by the customer after they became a member?
11 | -- 7. Which item was purchased just before the customer became a member?
12 | -- 8. What is the total items and amount spent for each member before they became a member?
13 | -- 9. If each $1 spent equates to 10 points and sushi has a 2x points multiplier - how many points would each customer have?
14 | -- 10. In the first week after a customer joins the program (including their join date) they earn 2x points on all items, not just sushi - how many points do customer A and B have at the end of January?
15 |
16 | -- Select table Query:
17 |
18 | --Members Table
19 | SELECT
20 | *
21 | FROM dannys_diner.dbo.members;
22 |
23 |
24 | -- Menu table
25 | SELECT
26 | *
27 | FROM dannys_diner.dbo.menu;
28 |
29 |
30 | --Sales table
31 | SELECT
32 | *
33 | FROM dannys_diner.dbo.sales;
34 |
35 |
36 | -- 1. What is the total amount each customer spent at the restaurant?
37 |
38 | SELECT
39 | s.customer_id,
40 | SUM(m.price) AS total_amount_spent
41 | FROM dannys_diner.dbo.sales s
42 | JOIN dannys_diner.dbo.menu m ON s.product_id = m.product_id
43 | GROUP BY s.customer_id
44 | ORDER BY s.customer_id;
45 |
46 |
47 |
48 | -- 2. How many days has each customer visited the restaurant?
49 |
50 | SELECT
51 | customer_id,
52 | COUNT(DISTINCT order_date) AS total_days_visited
53 | FROM dannys_diner.dbo.sales
54 | GROUP BY customer_id
55 | ORDER BY customer_id;
56 |
57 |
58 | -- 3. What was the first item from the menu purchased by each customer?
59 |
60 | WITH customer_first_purchase AS (
61 | SELECT
62 | s.customer_id, m.product_name,
63 | MIN(s.order_date) AS first_purchase_date
64 | FROM dannys_diner.dbo.sales s
65 | JOIN dannys_diner.dbo.menu m
66 | ON s.product_id = m.product_id
67 | GROUP BY s.customer_id, m.product_name
68 | ORDER BY first_purchase_date
69 | )
70 | SELECT
71 | c.customer_id,
72 | c.product_name
73 | FROM customer_first_purchase c
74 | WHERE c.first_purchase_date = (
75 | SELECT MIN(first_purchase_date)
76 | FROM customer_first_purchase
77 | WHERE customer_id = c.customer_id
78 | )
79 | ORDER BY c.customer_id;
80 | SELECT c.customer_id, m.product_name
81 | FROM (
82 | SELECT customer_id, MIN(order_date) AS first_order_date
83 | FROM dannys_diner.dbo.sales
84 | GROUP BY customer_id
85 | ) AS c
86 | JOIN dannys_diner.dbo.sales AS s ON c.customer_id = s.customer_id AND c.first_order_date = s.order_date
87 | JOIN dannys_diner.dbo.menu AS m ON s.product_id = m.product_id
88 | ORDER BY c.customer_id;
89 |
90 |
91 |
92 | -- 4. What is the most purchased item on the menu and how many times was it purchased by all customers?
93 |
94 | SELECT
95 | M.product_id,
96 | product_name,
97 | price,
98 | COUNT(S.product_id) AS total_purchases
99 | FROM dannys_diner.dbo.menu AS M
100 | INNER JOIN dannys_diner.dbo.sales AS S
101 | ON M.product_id= S.product_id
102 | GROUP BY M.product_id, product_name, price
103 | ORDER BY total_purchases DESC
104 | ;
105 |
106 |
107 |
108 | -- 5. Which item was the most popular for each customer?
109 |
110 | WITH popular_items AS (
111 | SELECT
112 | customer_id,
113 | product_id,
114 | COUNT(*) AS order_count,
115 | ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY COUNT(*) DESC) AS rn
116 | FROM dannys_diner.dbo.sales
117 | GROUP BY customer_id, product_id
118 | )
119 | SELECT
120 | p.customer_id,
121 | m.product_name AS most_popular_item
122 | FROM popular_items p
123 | JOIN dannys_diner.dbo.menu m
124 | ON p.product_id = m.product_id
125 | WHERE p.rn = 1
126 | ORDER BY p.customer_id;
127 |
128 |
129 |
130 | -- 6. Which item was purchased first by the customer after they became a member?
131 |
132 | SELECT
133 | m.customer_id,
134 | m.join_date,
135 | MIN(s.order_date) AS first_purchase_date,
136 | u.product_name AS first_purchase_item
137 | FROM dannys_diner.dbo.members m
138 | JOIN dannys_diner.dbo.sales s ON m.customer_id = s.customer_id
139 | JOIN dannys_diner.dbo.menu u ON s.product_id = u.product_id
140 | WHERE s.order_date > m.join_date
141 | GROUP BY m.customer_id, m.join_date, u.product_name
142 | ORDER BY m.customer_id;
143 |
144 |
145 |
146 | -- 7. Which item was purchased just before the customer became a member?
147 |
148 | SELECT
149 | m.customer_id,
150 | m.join_date,
151 | MAX(s.order_date) AS last_purchase_date,
152 | u.product_name AS last_purchase_item
153 | FROM dannys_diner.dbo.members m
154 | JOIN dannys_diner.dbo.sales s
155 | ON m.customer_id = s.customer_id
156 | JOIN dannys_diner.dbo.menu u
157 | ON s.product_id = u.product_id
158 | WHERE s.order_date < m.join_date
159 | GROUP BY m.customer_id, m.join_date, u.product_name
160 | ORDER BY m.customer_id;
161 |
162 |
163 | -- 8. What is the total items and amount spent for each member before they became a member?
164 |
165 | SELECT
166 | m.customer_id,
167 | m.join_date,
168 | COUNT(s.product_id) AS total_items,
169 | SUM(u.price) AS total_amount_spent
170 | FROM dannys_diner.dbo.members m
171 | JOIN dannys_diner.dbo.sales s
172 | ON m.customer_id = s.customer_id
173 | JOIN dannys_diner.dbo.menu u
174 | ON s.product_id = u.product_id
175 | WHERE s.order_date < m.join_date
176 | GROUP BY m.customer_id, m.join_date
177 | ORDER BY m.customer_id;
178 |
179 |
180 | -- 9. If each $1 spent equates to 10 points and sushi has a 2x points multiplier - how many points would each customer have?
181 |
182 | SELECT
183 | s.customer_id,
184 | SUM(
185 | CASE
186 | WHEN m.product_name = 'sushi' THEN 20 * m.price
187 | ELSE 10 * m.price
188 | END
189 | ) AS total_points
190 | FROM dannys_diner.dbo.sales s
191 | JOIN dannys_diner.dbo.menu m
192 | ON s.product_id = m.product_id
193 | GROUP BY s.customer_id
194 | ORDER BY s.customer_id;
195 |
196 |
197 | -- 10. In the first week after a customer joins the program (including their join date) they earn 2x points on all items, not just sushi - how many points do customer A and B have at the end of January?
198 |
199 | WITH customer_points AS (
200 | SELECT s.customer_id, s.order_date, u.product_name,
201 | CASE
202 | WHEN u.product_name = 'sushi' THEN
203 | CASE
204 | WHEN s.order_date <= DATEADD(DAY, 6, m.join_date) THEN 20 * u.price
205 | ELSE 10 * u.price
206 | END
207 | ELSE
208 | CASE
209 | WHEN s.order_date <= DATEADD(DAY, 6, m.join_date) THEN 20 * u.price
210 | ELSE 10 * u.price
211 | END
212 | END AS points
213 | FROM dannys_diner.dbo.sales s
214 | JOIN dannys_diner.dbo.menu u ON s.product_id = u.product_id
215 | JOIN dannys_diner.dbo.members m ON s.customer_id = m.customer_id
216 | WHERE s.order_date <= '2021-01-31' -- End of January
217 | AND (s.order_date >= m.join_date OR s.order_date <= DATEADD(DAY, 6, m.join_date))
218 | )
219 | SELECT customer_id, SUM(points) AS total_points
220 | FROM customer_points
221 | WHERE customer_id IN ('A', 'B')
222 | GROUP BY customer_id
223 | ORDER BY customer_id;
224 |
225 |
--------------------------------------------------------------------------------
/R/Cyclistic bike share.R:
--------------------------------------------------------------------------------
1 | library(tidyverse) #helps wrangle data
2 | library(lubridate) #helps wrangle date attributes
3 | library(ggplot2) #helps visualize data
4 | getwd() #displays your working directory
5 | setwd("/Users/Oluchukwu Anene/Documents/Cyclistic bike share")
6 |
7 |
8 |
9 |
10 |
11 | # COLLECT DATA
12 | #====================================================
13 | # Upload Divvy datasets (csv files) here
14 |
15 | Jan_2022 <- read_csv("202201-divvy-tripdata.csv")
16 | Feb_2022 <- read_csv("202202-divvy-tripdata.csv")
17 | Mar_2022 <- read_csv("202203-divvy-tripdata.csv")
18 | Apr_2022 <- read_csv("202204-divvy-tripdata.csv")
19 | May_2022 <- read_csv("202205-divvy-tripdata.csv")
20 | Jun_2022 <- read_csv("202206-divvy-tripdata.csv")
21 | Jul_2022 <- read_csv("202207-divvy-tripdata.csv")
22 | Aug_2022 <- read_csv("202208-divvy-tripdata.csv")
23 | Sep_2022 <- read_csv("202209-divvy-publictripdata.csv")
24 | Oct_2022 <- read_csv("202210-divvy-tripdata.csv")
25 | Nov_2022 <- read_csv("202211-divvy-tripdata.csv")
26 | Dec_2022 <- read_csv("202212-divvy-tripdata.csv")
27 |
28 |
29 |
30 |
31 |
32 | # WRANGLE DATA AND COMBINE INTO A SINGLE FILE
33 | #====================================================
34 |
35 | # Compare column names in each of the files
36 |
37 | colnames(Jan_2022)
38 | colnames(Feb_2022)
39 | colnames(Mar_2022)
40 | colnames(Apr_2022)
41 | colnames(May_2022)
42 | colnames(Jun_2022)
43 | colnames(Jul_2022)
44 | colnames(Aug_2022)
45 | colnames(Sep_2022)
46 | colnames(Oct_2022)
47 | colnames(Nov_2022)
48 | colnames(Dec_2022)
49 |
50 |
51 |
52 | # Inspect the dataframes and look for incongruencies
53 | str(Jan_2022)
54 | str(Feb_2022)
55 | str(Mar_2022)
56 | str(Apr_2022)
57 | str(May_2022)
58 | str(Jun_2022)
59 | str(Jul_2022)
60 | str(Aug_2022)
61 | str(Sep_2022)
62 | str(Oct_2022)
63 | str(Nov_2022)
64 | str(Dec_2022)
65 |
66 |
67 | # Stack individual month's data frames into one big data frame
68 | Bike_Share <- bind_rows(Jan_2022, Feb_2022, Mar_2022,Apr_2022, May_2022, Jun_2022, Jul_2022, Aug_2022, Sep_2022, Oct_2022, Nov_2022, Dec_2022)
69 |
70 |
71 |
72 | # Remove start and end station name and station id as this data columns consists of inconsistent data as well as null values making the data unuseful
73 | Bike_Share <- Bike_Share %>%
74 | select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
75 |
76 |
77 |
78 |
79 | # Rename columns to give the more relatable names
80 |
81 | (Bike_Share <- rename(Bike_Share
82 | ,Ride_id = ride_id
83 | ,Ride_types = rideable_type
84 | ,Start_time = started_at
85 | ,End_time = ended_at
86 | ,Start_lat = start_lat
87 | ,Start_lng = start_lng
88 | ,End_lat = end_lat
89 | ,End_lng= end_lng
90 | ,User_types = member_casual ))
91 |
92 |
93 |
94 |
95 |
96 | #CLEAN UP AND ADD DATA TO PREPARE FOR ANALYSIS
97 | #======================================================
98 | # Inspect the new table that has been created
99 | colnames(Bike_Share) #List of column names
100 | nrow(Bike_Share) #How many rows are in data frame?
101 | dim(Bike_Share) #Dimensions of the data frame?
102 | head(Bike_Share) #See the first 6 rows of data frame.
103 | tail(Bike_Share) #see the last 6 rows of data frame
104 | str(Bike_Share) #See list of columns and data types (numeric, character, etc)
105 | summary(Bike_Share) #Statistical summary of data. Mainly for numerics
106 |
107 |
108 |
109 |
110 | # Add columns that list the date, month, day, and year of each ride
111 | Bike_Share$Date <- as.Date(Bike_Share$Start_time) #The default format is yyyy-mm-dd
112 | Bike_Share$Month <- format(as.Date(Bike_Share$Date), "%m")
113 | Bike_Share$Day <- format(as.Date(Bike_Share$Date), "%d")
114 | Bike_Share$Year <- format(as.Date(Bike_Share$Date), "%Y")
115 | Bike_Share$Day_of_week <- format(as.Date(Bike_Share$Date), "%A")
116 |
117 |
118 |
119 |
120 |
121 | # Add a "ride_length" calculation to Bike_Share(in seconds)
122 | Bike_Share$Ride_length <- difftime(Bike_Share$End_time,Bike_Share$Start_time)
123 |
124 | # Inspect the structure of the columns
125 | str(Bike_Share)
126 |
127 |
128 | # Convert "ride_length" from Factor to numeric so we can run calculations on the data
129 | is.factor(Bike_Share$Ride_length)
130 | Bike_Share$Ride_length <- as.numeric(as.character(Bike_Share$Ride_length))
131 | is.numeric(Bike_Share$Ride_length)
132 |
133 | # Removing "bad" data
134 | # The data frame includes a few hundred entries when ride_length was negative
135 | # Creating a new data frame (Bike_Share_p2) to store the cleaned data set
136 | Bike_Share_p2 <- Bike_Share[!(Bike_Share$Ride_length<0),]
137 |
138 |
139 |
140 |
141 |
142 |
143 | # DESCRIPTIVE ANALYSIS
144 | #=====================================
145 |
146 | # Descriptive analysis on Ride_length (all figures in seconds)
147 |
148 | summary(Bike_Share_p2$Ride_length)
149 |
150 | # Compare members and casual users
151 | aggregate(Bike_Share_p2$Ride_length ~ Bike_Share_p2$User_types, FUN = mean)
152 | aggregate(Bike_Share_p2$Ride_length ~ Bike_Share_p2$User_types, FUN = median)
153 | aggregate(Bike_Share_p2$Ride_length ~ Bike_Share_p2$User_types, FUN = max)
154 | aggregate(Bike_Share_p2$Ride_length ~ Bike_Share_p2$User_types, FUN = min)
155 |
156 | # Average ride time by each day for members vs casual users
157 | aggregate(Bike_Share_p2$Ride_length ~ Bike_Share_p2$User_types + Bike_Share_p2$Day_of_week, FUN = mean)
158 |
159 | # Get the days of the week in order .
160 | Bike_Share_p2$Day_of_week <- ordered(Bike_Share_p2$Day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
161 |
162 | # Run the average ride time by each day for members vs casual users
163 | aggregate(Bike_Share_p2$Ride_length ~ Bike_Share_p2$User_types + Bike_Share_p2$Day_of_week, FUN = mean)
164 |
165 | # analyze ridership data by type and weekday
166 | Bike_Share_p2 %>%
167 | mutate(weekday = wday(Start_time, label = TRUE)) %>% #creates weekday field using wday()
168 | group_by(User_types, weekday) %>% #groups by user_types and weekday
169 | summarise(number_of_rides = n() #calculates the number of rides and average duration
170 | ,average_duration = mean(Ride_length)) %>% # calculates the average duration
171 | arrange(User_types, weekday) # sorts
172 |
173 | # Plot the number of rides by rider type
174 | Bike_Share_p2 %>%
175 | mutate(weekday = wday(Start_time, label = TRUE)) %>%
176 | group_by(User_types, weekday) %>%
177 | summarise(Number_of_rides = n()
178 | ,average_duration = mean(Ride_length)) %>%
179 | arrange(User_types, weekday) %>%
180 | ggplot(aes(x = weekday, y = Number_of_rides, fill = User_types)) +
181 | geom_col(position = "dodge")
182 |
183 | # Plot for average duration
184 | Bike_Share_p2 %>%
185 | mutate(weekday = wday(Start_time, label = TRUE)) %>%
186 | group_by(User_types, weekday) %>%
187 | summarise(Number_of_rides = n()
188 | ,average_duration = mean(Ride_length)) %>%
189 | arrange(User_types, weekday) %>%
190 | ggplot(aes(x = weekday, y = average_duration, fill = User_types)) +
191 | geom_col(position = "dodge")
192 |
193 |
194 |
195 | #EXPORT SUMMARY FILE FOR FURTHER ANALYSIS
196 | #=================================================
197 |
198 | # Create a csv file to export
199 |
200 | counts <- aggregate(Bike_Share_p2$Ride_length ~ Bike_Share_p2$User_types + Bike_Share_p2$Day_of_week, FUN = mean)
201 | write.csv(counts, file = '~/Cyclistic bike share/avg_ridecyc_length.csv')
202 | write.csv(Bike_Share_p2, file = '~/Cyclistic bike share/Cyclistics_sharing.csv')
--------------------------------------------------------------------------------
/R/Quantium Chips.R:
--------------------------------------------------------------------------------
1 | ## Load required libraries
2 | library(data.table) #helps enhanced version of data frames
3 | library(tidyverse) #helps wrangle data
4 | library(lubridate) #helps wrangle date attributes
5 | library(ggplot2) #helps visualize data
6 | library(ggmosaic)#helps visualize data
7 | library(stringr) #helps provide functions for working with strings
8 | library(readr)
9 | library(dplyr) #helps manipulation data
10 |
11 | #=======================================================================
12 | # Point the file Path to download the data sets
13 | #======================================================================
14 |
15 | getwd() #displays your working directory
16 | setwd("/Users/Oluchukwu Anene/Documents/Quantium Chips data")
17 |
18 |
19 | #========================================================
20 | #Upload CSV Files
21 | #=======================================================
22 | transactionData <- read_csv("QVI_transaction_data.csv")
23 | customerData <- read_csv("QVI_purchase_behaviour.csv")
24 |
25 | #========================================================
26 | #Exploratory data analysis for transactionData
27 | #========================================================
28 | #view colunms
29 | colnames(transactionData)
30 |
31 |
32 | #### Examine transaction data
33 | str(transactionData)
34 |
35 |
36 | #### Examine PROD_NAME
37 | table(transactionData[ "PROD_NAME"]) #to find the unique containt and its no. of occurance
38 |
39 | #or You can use
40 |
41 | transactionData %>%
42 | count(PROD_NAME)
43 |
44 | #From the output the Product names is populated by chips so we will be working with various chips.
45 |
46 | 'There are salsa products in the dataset but we are only interested in the chips category, so let’s remove
47 | these.'
48 | #--------------------------------------------------------------------------------------------------------
49 |
50 | #### Remove salsa products
51 | # Create a new column indicating whether each row contains the word "salsa" in the PROD_NAME column
52 | transactionData <- transactionData %>%
53 | mutate(SALSA = grepl("salsa", tolower(PROD_NAME)))
54 |
55 | # Remove rows with the word "salsa" in the PROD_NAME column
56 | transactionData <- transactionData %>%
57 | filter(SALSA == FALSE) %>%
58 | select(-SALSA)
59 | #----------------------------------------------------------------------------------------------------------
60 |
61 | #### Summaries the data to check or nulls and possible outliers
62 | summary(transactionData)
63 |
64 | #======================================================================================================
65 | #PROD_QTY has an outlier of 200
66 | #Filter the dataset to find the outlier
67 | #=======================================================================================================
68 |
69 | # Select rows with PROD_QTY equal to 200
70 | transactionData[transactionData$PROD_QTY == 200, ]
71 |
72 | # Let's see if the customer has had other transactions
73 | transactionData[transactionData$LYLTY_CARD_NBR == 226000, ]
74 |
75 | ' It looks like this customer has only had the two transactions over the year and is not an ordinary retail
76 | customer. The customer might be buying chips for commercial purposes instead. We’ll remove this loyalty
77 | card number from further analysis.'
78 | #-----------------------------------------------------------------------------------------------------------
79 |
80 | # Use the base R's subsetting methodto keep only rows where LYLTY_CARD_NBR does not equal 226000
81 | transactionData <- transactionData[transactionData$LYLTY_CARD_NBR != 226000, ]
82 |
83 | # Re‐examine transaction data
84 | summary(transactionData)
85 |
86 | # That’s better. Now, let’s look at the number of transaction lines over time to see if there are any obvious data issues such as missing data.
87 | #------------------------------------------------------------------------------------------------------------------------------------------------
88 | ## Count the number of transactions by date
89 | transactionData %>%
90 | group_by(DATE) %>%
91 | summarize(n = n())
92 |
93 | # Show the top 5 dates with the highest number of transactions
94 | head(arrange(transactionData, DATE),5)
95 |
96 | # Show the bottom 5 dates with the lowest number of transactions
97 | tail(arrange(transactionData, DATE),5)
98 |
99 | 'There’s only 364 rows, meaning only 364 dates which indicates a missing date. Let’s create a sequence of
100 | dates from 1 Jul 2018 to 30 Jun 2019 and use this to create a chart of number of transactions over time to
101 | find the missing date.'
102 | #----------------------------------------------------------------------------------------------------------------------
103 | # Create a sequence of dates from 2018-07-01 to 2019-06-30
104 | allDates <- data.frame(DATE = seq(as.Date("2018-07-01"), as.Date("2019-06-30"), by = "day"))
105 |
106 | # Count the number of transactions by date and join it with the allDates dataframe
107 | transactions_by_day <- left_join(allDates,
108 | transactionData %>% group_by(DATE) %>% summarize(transactions = n()),
109 | by = c("DATE"))
110 |
111 | # Plot transactions over time
112 | ggplot(transactions_by_day, aes(x = DATE, y = transactions)) +
113 | geom_line() +
114 | scale_x_date(date_breaks = "1 month", date_labels = "%b-%Y") +
115 | labs(title = "Transactions over time", x = "Date", y = "Number of transactions") +
116 | theme_bw() +
117 | theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
118 | #This code creates a sequence of dates between 2018-07-01 and 2019-06-30, and a data frame allDates which contains all the dates and labels them as "DATE". Then it uses group_by(), summarize() and left_join() functions to count the number of transactions by date, and join this dataframe with allDates dataframe.
119 |
120 | #The scale_x_date() function allows to format the x-axis with breaks set to 1 month and labels set to month-year format and other cosmetics are set using ggplot2's theme options.
121 |
122 |
123 | 'We can see that there is an increase in purchases in December and a break in late December. Let’s zoom in
124 | on this.'
125 | #--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
126 | #Filter to December and look at individual days
127 | ggplot(transactions_by_day[month(transactions_by_day$DATE) == 12, ], aes(x = DATE, y = transactions)) +
128 | geom_line() +
129 | scale_x_date(date_breaks = "1 day", date_labels = "%d-%b") +
130 | labs(title = "Transactions over time in December", x = "Date", y = "Number of transactions") +
131 | theme_bw() +
132 | theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
133 | #This code uses the month() function to filter out only December's dates, and then uses ggplot() function to create a line plot. The scale_x_date() function allows to format the x-axis with breaks set to 1 day and labels set to day-month format, cosmetics are set with ggplot2's theme options.
134 |
135 |
136 | ' We can see that the increase in sales occurs in the lead-up to Christmas and that there are zero sales on
137 | Christmas day itself. This is due to shops being closed on Christmas day.
138 | Now that we are satisfied that the data no longer has outliers, we can move on to creating other features
139 | such as brand of chips or pack size from PROD_NAME. We will start with pack size.'
140 |
141 | #========================================================================================================================
142 | # Extract the pack size from PROD_NAME column
143 | transactionData$PACK_SIZE <- as.numeric(str_extract(transactionData$PROD_NAME, "\\d+"))
144 |
145 | # Check the result
146 | transactionData %>%
147 | group_by(PACK_SIZE) %>%
148 | summarize(count = n()) %>%
149 | arrange(PACK_SIZE)
150 | #===========================================================================================================================
151 |
152 | #View data set
153 | transactionData
154 | #----------------------------------------------------------------
155 | #since we have a product size column, i will be removing ever 134g at the back of ever chip product since we are analyzing the types not size
156 |
157 | # Remove last 4 characters from the PROD_NAME column
158 | transactionData$PROD_NAME<- substr(transactionData$PROD_NAME, 1, nchar(transactionData$PROD_NAME) - 4)
159 |
160 | #View data set
161 | transactionData
162 | #==================================================================================================================
163 |
164 | # Plot a histogram of PACK_SIZE
165 | ggplot(transactionData, aes(x = PACK_SIZE)) +
166 | geom_histogram(binwidth = 10, color = "black", fill = "white") +
167 | labs(title = "Histogram of Pack Size", x = "Pack Size", y = "Frequency") +
168 | theme_bw()
169 |
170 | 'Pack sizes created look reasonable and now to create brands, we can use the first word in PROD_NAME to
171 | work out the brand name'
172 |
173 | #====================================================================================================================
174 | # Extract the brand from PROD_NAME column
175 | transactionData$BRAND <- substr(transactionData$PROD_NAME,1,regexpr(" ",transactionData$PROD_NAME)-1)
176 | transactionData$BRAND<-toupper(transactionData$BRAND)
177 |
178 | # Check the result
179 | transactionData %>%
180 | group_by(BRAND) %>%
181 | summarize(count = n()) %>%
182 | arrange(-count)
183 |
184 | # Clean brand names
185 | transactionData$BRAND <-
186 | ifelse(transactionData$BRAND == "RED", "RRD",
187 | ifelse(transactionData$BRAND == "SNBTS", "SUNBITES",
188 | ifelse(transactionData$BRAND == "INFZNS", "INFUZIONS",
189 | ifelse(transactionData$BRAND == "WW", "WOOLWORTHS",
190 | ifelse(transactionData$BRAND == "SMITH", "SMITHS",
191 | ifelse(transactionData$BRAND == "NCC", "NATURAL",
192 | ifelse(transactionData$BRAND == "DORITO", "DORITOS",
193 | ifelse(transactionData$BRAND == "GRAIN", "GRNWVES",
194 | transactionData$BRAND))))))))
195 |
196 | # Check the result
197 | transactionData %>%
198 | group_by(BRAND) %>%
199 | summarize(count = n()) %>%
200 | arrange(BRAND)
201 |
202 |
203 |
204 | #=======================================================================================================================
205 | #Examining customer data
206 | #Now that the transaction dataset has been wrangled and cleaned, I can look at the customer dataset.
207 |
208 | #=======================================================================================================================
209 | #### Examining customer data
210 | str(customerData)
211 |
212 | summary(customerData)
213 | #-----------------------------------------------------------------------------------------------------------------------
214 | #Checking the LIFESTAGE and PREMIUM_CUSTOMER columns.
215 | #Examine the values of LIFESTAGE
216 | customerData %>%
217 | group_by(LIFESTAGE) %>%
218 | summarize(count = n()) %>%
219 | arrange(-count)
220 |
221 | #--------------------------------------------------------------------------------------------------------------------------
222 | #Examine the values of PREMIUM_CUSTOMER
223 | customerData %>%
224 | group_by(PREMIUM_CUSTOMER) %>%
225 | summarize(count = n()) %>%
226 | arrange(-count)
227 |
228 | 'As there do not seem to be any issues with the customer data, we can now go ahead and oin the transaction
229 | and customer data sets together'
230 | #============================================================================================================
231 | #### Merge transaction data to customer data
232 | Chipsdata <- merge(transactionData, customerData, all.x = TRUE)
233 |
234 | #View merged dataset
235 | Chipsdata
236 | #=============================================================================================================
237 | #check if some customers were not matched on by checking for nulls.
238 |
239 | #count the number of rows with missing LIFESTAGE
240 | Chipsdata %>%
241 | filter(is.na(LIFESTAGE)) %>%
242 | nrow()
243 |
244 |
245 | #count the number of rows with missing PREMIUM_CUSTOMER
246 | Chipsdata %>%
247 | filter(is.na(PREMIUM_CUSTOMER)) %>%
248 | nrow()
249 | #There are no missing values
250 |
251 |
252 |
253 | ###Data exploration is now complete!
254 | #===================================================================
255 |
256 |
257 | #### Data analysis on customer segments
258 | #=======================================================================================================
259 | 'Now that the data is ready for analysis, we can define some metrics of interest to the client:
260 | • Who spends the most on chips (total sales), describing customers by lifestage and how premium their
261 | general purchasing behaviour is
262 | • How many customers are in each segment
263 | • How many chips are bought per customer by segment
264 | • What’s the average chip price by customer segment
265 | We could also ask our data team for more information. Examples are:
266 | • The customer’s total spend over the period and total spend for each transaction to understand what
267 | proportion of their grocery spend is on chips
268 | • Proportion of customers in each customer segment overall to compare against the mix of customers
269 | who purchase chips
270 | Let’s start with calculating total sales by LIFESTAGE and PREMIUM_CUSTOMER and plotting the split by
271 | these segments to describe which customer segment contribute most to chip sales.'
272 | #===================================================================================================================
273 | #==================================================================================================================
274 | # Sum total sales by LIFESTAGE and PREMIUM_CUSTOMER
275 | sales <- Chipsdata %>%
276 | group_by(LIFESTAGE, PREMIUM_CUSTOMER) %>%
277 | summarize(SALES = sum(TOT_SALES))
278 |
279 | #### Create plot
280 | p <- ggplot(data = sales) +
281 | geom_mosaic(aes(weight = SALES, x = product(PREMIUM_CUSTOMER, LIFESTAGE),
282 | fill = PREMIUM_CUSTOMER)) +
283 | labs(x = "Lifestage", y = "Premium customer flag", title = "Proportion of
284 | sales") +
285 | theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
286 | #### Plot and label with proportion of sales
287 | p + geom_text(data = ggplot_build(p)$data[[1]], aes(x = (xmin + xmax)/2 , y =
288 | (ymin + ymax)/2, label = as.character(paste(round(.wt/sum(.wt),3)*100,
289 | '%'))))
290 |
291 |
292 |
293 |
294 | 'Sales are coming mainly from Budget - older families, Mainstream - young singles/couples, and Mainstream
295 | - retirees'
296 | #==========================================================================================================
297 | #Let’s see if the higher sales are due to there being more customers who buy chips.
298 | #==========================================================================================================
299 | # Number of customers by LIFESTAGE and PREMIUM_CUSTOMER
300 | customers <- Chipsdata %>%
301 | group_by(LIFESTAGE, PREMIUM_CUSTOMER) %>%
302 | summarize(CUSTOMERS = n_distinct(LYLTY_CARD_NBR)) %>%
303 | arrange(-CUSTOMERS)
304 |
305 |
306 | #### Create Plot
307 | p <- ggplot(data = customers) +
308 | geom_mosaic(aes(weight = CUSTOMERS, x = product(PREMIUM_CUSTOMER,
309 | LIFESTAGE), fill = PREMIUM_CUSTOMER)) +
310 | labs(x = "Lifestage", y = "Premium customer flag", title = "Proportion of
311 | customers") +
312 | theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
313 | #### Plot and label with Proportion of customers
314 | p + geom_text(data = ggplot_build(p)$data[[1]], aes(x = (xmin + xmax)/2 , y =
315 | (ymin + ymax)/2, label = as.character(paste(round(.wt/sum(.wt),3)*100,
316 | '%'))))
317 |
318 | 'There are more Mainstream - young singles/couples and Mainstream - retirees who buy chips. This con
319 | tributes to there being more sales to these customer segments but this is not a major driver for the Budget
320 | - Older families segment.'
321 |
322 | #======================================================================================================================================
323 | #Higher sales may also be driven by more units of chips being bought per customer.
324 | #======================================================================================================================================
325 |
326 | # Average number of units per customer by LIFESTAGE and PREMIUM_CUSTOMER
327 | avg_units <- Chipsdata %>%
328 | group_by(LIFESTAGE, PREMIUM_CUSTOMER) %>%
329 | summarize(AVG = sum(PROD_QTY) / n_distinct(LYLTY_CARD_NBR)) %>%
330 | arrange(-AVG)
331 |
332 | ## Create Plot
333 | ggplot(data = avg_units, aes(weight = AVG, x = LIFESTAGE, fill =
334 | PREMIUM_CUSTOMER)) +
335 | geom_bar(position = position_dodge()) +
336 | labs(x = "Lifestage", y = "Avg units per transaction", title = "Units per
337 | customer") +
338 | theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
339 |
340 | 'Older families and young families in general buy more chips per customer'
341 |
342 | #=========================================================================================================
343 | #NOW for the average price per unit chips bought for each customer segment as this is also a driver of total sales.
344 | #========================================================================================================
345 | # Average price per unit by LIFESTAGE and PREMIUM_CUSTOMER
346 | avg_price <- Chipsdata %>%
347 | group_by(LIFESTAGE, PREMIUM_CUSTOMER) %>%
348 | summarize(AVG = sum(TOT_SALES) / sum(PROD_QTY)) %>%
349 | arrange(-AVG)
350 |
351 | #### Create Plot
352 | ggplot(data = avg_price, aes(weight = AVG, x = LIFESTAGE, fill =
353 | PREMIUM_CUSTOMER)) +
354 | geom_bar(position = position_dodge()) +
355 | labs(x = "Lifestage", y = "Avg price per unit", title = "Price per unit") +
356 | theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
357 |
358 | 'Mainstream midage and young singles and couples are more willing to pay more per packet of chips com
359 | pared to their budget and premium counterparts. This may be due to premium shoppers being more likely to
360 | buy healthy snacks and when they buy chips, this is mainly for entertainment purposes rather than their own
361 | consumption. This is also supported by there being fewer premium midage and young singles and couples
362 | buying chips compared to their mainstream counterparts.'
363 |
364 | #===================================================================================================================
365 | #As the difference in average price per unit isn’t large, we can check if this difference is statistically different.
366 | #===================================================================================================================
367 |
368 | # Perform independent t-test between Mainstream vs Premium
369 | pricePerUnit <- Chipsdata %>%
370 | mutate(price = TOT_SALES/PROD_QTY)
371 |
372 | result <- pricePerUnit %>%
373 | filter(LIFESTAGE %in% c("YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES")) %>%
374 | group_by(PREMIUM_CUSTOMER) %>%
375 | summarize(mean = mean(price))
376 |
377 | t.test(pricePerUnit[pricePerUnit$PREMIUM_CUSTOMER == "Mainstream" & pricePerUnit$LIFESTAGE %in% c("YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES"),'price'],
378 | pricePerUnit[pricePerUnit$PREMIUM_CUSTOMER != "Mainstream" & pricePerUnit$LIFESTAGE %in% c("YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES"),'price'],
379 | alternative = "greater")
380 |
381 |
382 | 'The t-test results in a p-value < 2.2e-16, i.e. the unit price for mainstream, young and mid-age singles and
383 | couples are significantly higher than that of budget or premium, young and midage singles and couples.'
384 |
385 |
386 | #===============================================================================================================
387 | 'Checking customer segments that contribute the most to sales to retain them or further
388 | increase sales. Let’s look at Mainstream - young singles/couples. For instance, let’s find out if they tend to
389 | buy a particular brand of chips.'
390 | #=================================================================================================================
391 |
392 | # Deep dive into mainstream, young singles/couples
393 | segment1 <- Chipsdata %>%
394 | filter(LIFESTAGE == "YOUNG SINGLES/COUPLES", PREMIUM_CUSTOMER == "Mainstream")
395 | other <- Chipsdata %>%
396 | filter(!(LIFESTAGE == "YOUNG SINGLES/COUPLES" & PREMIUM_CUSTOMER == "Mainstream"))
397 |
398 | # Brand affinity compared to the rest of the population
399 | quantity_segment1 <- segment1 %>%
400 | summarize(total_qty = sum(PROD_QTY)) %>%
401 | pull(total_qty)
402 | quantity_other <- other %>%
403 | summarize(total_qty = sum(PROD_QTY)) %>%
404 | pull(total_qty)
405 |
406 | quantity_segment1_by_brand <- segment1 %>%
407 | group_by(BRAND) %>%
408 | summarize(targetSegment = sum(PROD_QTY)/quantity_segment1)
409 | quantity_other_by_brand <- other %>%
410 | group_by(BRAND) %>%
411 | summarize(other = sum(PROD_QTY)/quantity_other)
412 |
413 | brand_proportions <- quantity_segment1_by_brand %>%
414 | left_join(quantity_other_by_brand) %>%
415 | mutate(affinityToBrand = targetSegment/other) %>%
416 | arrange(affinityToBrand)
417 |
418 |
419 | #Create plot
420 | ggplot(brand_proportions, aes(x = BRAND, y = affinityToBrand)) +
421 | geom_col() +
422 | labs(x = "Brand", y = "Affinity to Brand", title = "Brand Affinity of Young Singles/Couples in Mainstream Segment")
423 |
424 | 'We can see that :
425 | • Mainstream young singles/couples are 23% more likely to purchase Tyrrells chips compared to the
426 | rest of the population
427 | • Mainstream young singles/couples are 56% less likely to purchase Burger Rings compared to the rest
428 | of the population'
429 |
430 |
431 | #======================================================================================================================
432 | #Let’s also find out if our target segment tends to buy larger packs of chips.
433 | #======================================================================================================================
434 | # Join the dataset by pack_size column and calculate the proportion
435 |
436 | #### Preferred pack size compared to the rest of the population
437 | # Pack_size affinity compared to the rest of the population
438 | # convert dataframes to data.tables
439 | segment1 <- as.data.table(segment1)
440 | other <- as.data.table(other)
441 |
442 | #convert PACK_SIZE to numeric
443 | segment1[, PACK_SIZE := as.numeric(PACK_SIZE)]
444 | other[, PACK_SIZE := as.numeric(PACK_SIZE)]
445 |
446 | quantity_segment1 <- sum(segment1[, PROD_QTY])
447 | quantity_other <- sum(other[, PROD_QTY])
448 |
449 | quantity_segment1_by_pack <- segment1[, .(targetSegment = sum(PROD_QTY) / quantity_segment1), by = PACK_SIZE]
450 |
451 | quantity_other_by_pack <- other[, .(other = sum(PROD_QTY) / quantity_other), by = PACK_SIZE]
452 |
453 | quantity_segment1_by_pack
454 |
455 |
456 | 'It looks like Mainstream young singles/couples are more likely to purchase a 270g pack of chips compared to the rest of the population'
457 |
458 | #==========================================================================================================================
459 | #let’s dive into what brands sell this pack size.
460 | #==========================================================================================================================
461 |
462 | # convert dataframe to data.table
463 | Chipsdata <- as.data.table(Chipsdata)
464 |
465 | # filter for rows where PACK_SIZE is equal to 270
466 | Chipsdata[Chipsdata$PACK_SIZE == 270, PROD_NAME]
467 |
468 |
469 | # filter for rows where PACK_SIZE is equal to 270 and select BRAND column
470 | Chipsdata[Chipsdata$PACK_SIZE == 270, unique(BRAND)]
471 |
472 | #
473 | 'Twisties are the only brand offering 270g packs and so this may instead be reflecting a higher likelihood of
474 | purchasing Twisties.'
475 | #===============================================
476 |
477 | #LET'S SAVE THE NEW DATASET for task2
478 | write.csv(Chipsdata, file = '~/Quantium Chips data/Chipsdata.csv')
479 |
480 | #================================================================================================
481 | #Conclusion
482 | #================================================================================================
483 |
484 | '* Sales have mainly been due to Budget - older families, Mainstream - young singles/couples, and Mainstream retirees shoppers.
485 |
486 | * We found that the high spend in chips for mainstream young singles/couples and retirees is due to there being more of them than other buyers. Mainstream, midage and young singles and
487 | couples are also more likely to pay more per packet of chips. This is indicative of impulse buying behaviour.
488 |
489 | * We’ve also found that Mainstream young singles and couples are 23% more likely to purchase Tyrrells chips
490 | compared to the rest of the population.
491 |
492 | * The Category Manager may want to increase the category’s performance by off-locating some Tyrrells and smaller packs of chips in discretionary space near segments
493 | where young singles and couples frequent more often to increase visibilty and impulse behaviour.
494 |
495 | *Quantium data analyst can help the Category Manager with recommendations of where these segments are and further
496 | help them with measuring the impact of the changed placement'.
--------------------------------------------------------------------------------
/Python Projects/WEB SCRAPPING E- COMMERCE SITE (EBAY).ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "3a887880",
6 | "metadata": {},
7 | "source": [
8 | "\n",
9 | "# "
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "id": "e79c20d7",
15 | "metadata": {},
16 | "source": [
17 | "# WEBSCRAPE EBAY PRODUCT DATA \n"
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "id": "0517bf9e",
23 | "metadata": {},
24 | "source": [
25 | "#### This data below show a filter category of books that contain Business Intelligence in its title "
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "id": "5a84cec0",
31 | "metadata": {},
32 | "source": [
33 | "# "
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": 1,
39 | "id": "ddfd62f2",
40 | "metadata": {},
41 | "outputs": [],
42 | "source": [
43 | "# Import all libraries required\n",
44 | "\n",
45 | "import requests\n",
46 | "import pandas as pd\n",
47 | "from bs4 import BeautifulSoup\n",
48 | "import matplotlib.pyplot as plt\n",
49 | "import seaborn as sns\n",
50 | "from prettytable import PrettyTable"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": 26,
56 | "id": "05c1dc68",
57 | "metadata": {},
58 | "outputs": [],
59 | "source": [
60 | "# Specify the URL of the Ebay item\n",
61 | "\n",
62 | "url_pattern = 'https://www.ebay.com/sch/267/i.html?_from=R40&_nkw=Business+Intelligence+&rt=nc'"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": 27,
68 | "id": "36d1f2c1",
69 | "metadata": {},
70 | "outputs": [],
71 | "source": [
72 | "# create an empty list to store the scraped data\n",
73 | "product_data = []\n",
74 | "\n",
75 | "# iterate over the page numbers\n",
76 | "for page_num in range(1, 11): # scrape data from page 1 to 10\n",
77 | " # create the url for the current page\n",
78 | " url = url_pattern.format(page_num=page_num)"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": 28,
84 | "id": "5bc231a9",
85 | "metadata": {},
86 | "outputs": [],
87 | "source": [
88 | "#send a GET request to the URL and extract the HTML content\n",
89 | "\n",
90 | "response = requests.get(url)\n",
91 | "content = response.content"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": 29,
97 | "id": "d890fbe9",
98 | "metadata": {},
99 | "outputs": [],
100 | "source": [
101 | "#Use Beautiful Soup to parse the HTML content\n",
102 | "\n",
103 | "soup = BeautifulSoup(content, 'html.parser')"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": 30,
109 | "id": "540bc991",
110 | "metadata": {},
111 | "outputs": [
112 | {
113 | "name": "stdout",
114 | "output_type": "stream",
115 | "text": [
116 | " Title Price_sold \\\n",
117 | "0 Shop on eBay 20.00 \n",
118 | "1 Business Intelligence : Practices, Technologie... 54.99 \n",
119 | "2 The Artificial Intelligence Imperative: A Prac... 30.17 \n",
120 | "3 Business Intelligence With Cold Fusion By John... 13.78 \n",
121 | "4 Better Business Intelligence By Robert Collins 14.63 \n",
122 | "\n",
123 | " Shipping_cost Item_location Item_seller \\\n",
124 | "0 0.0 \n",
125 | "1 57.45 shipping United States vbbc2015 (11,848) 100% \n",
126 | "2 17.48 shipping United Kingdom webuybooks (1,740,389) 99.7% \n",
127 | "3 6.99 shipping United States awesomebooksusa (387,183) 98.2% \n",
128 | "4 5.35 shipping United Kingdom book_fountain (166,979) 99.2% \n",
129 | "\n",
130 | " Link \n",
131 | "0 https://ebay.com/itm/123456?hash=item28caef0a3... \n",
132 | "1 https://www.ebay.com/itm/134303280275?epid=738... \n",
133 | "2 https://www.ebay.com/itm/134435858075?epid=230... \n",
134 | "3 https://www.ebay.com/itm/394419887418?hash=ite... \n",
135 | "4 https://www.ebay.com/itm/175596253189?hash=ite... \n"
136 | ]
137 | }
138 | ],
139 | "source": [
140 | "#Extract the product information needed\n",
141 | "items = soup.find_all('div', {'class': 's-item__wrapper clearfix'})\n",
142 | " \n",
143 | "\n",
144 | "for item in items:\n",
145 | " title = item.find('div', {'class': 's-item__title'}).text.strip()\n",
146 | " \n",
147 | " price_sold = float(item.find('span', {'class': 's-item__price'}).text.replace('$','').replace(',','').strip())\n",
148 | " shipping_cost = item.find('span', {'class': 's-item__shipping s-item__logisticsCost'})\n",
149 | " if shipping_cost:\n",
150 | " shipping_cost = shipping_cost.text.replace('+','').replace('$','').replace(',','').strip()\n",
151 | " else:\n",
152 | " shipping_cost = 0.0\n",
153 | " item_location = item.find('span', {'class': 's-item__location s-item__itemLocation'})\n",
154 | " if item_location:\n",
155 | " item_location = item_location.text.replace('from','').strip()\n",
156 | " else:\n",
157 | " item_location = ''\n",
158 | " item_seller = item.find('span', {'class':'s-item__seller-info'})\n",
159 | " if item_seller:\n",
160 | " item_seller = item_seller.text.strip()\n",
161 | " else:\n",
162 | " item_seller = ''\n",
163 | " link = item.find('a', {'class': 's-item__link'})['href']\n",
164 | " product_data.append([title, price_sold, shipping_cost, item_location, item_seller, link])\n",
165 | " \n",
166 | "BIbooks = pd.DataFrame(product_data, columns=['Title', 'Price_sold', 'Shipping_cost', 'Item_location','Item_seller', 'Link'])\n",
167 | "print(BIbooks.head())\n"
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": 31,
173 | "id": "d243d1c3",
174 | "metadata": {},
175 | "outputs": [
176 | {
177 | "data": {
178 | "text/html": [
179 | "