├── Chapter 1- Cohort Analysis.ipynb
├── Chapter1-Cohort_Analysis.ipynb
├── Chapter2-Recency_Frequency_Monetary_Value_analysis.ipynb
├── Chapter3-Data-pre-processing-for-clustering.ipynb
├── Chapter4-Customer-Segmentation-with-K-means.ipynb
├── Ej1.ipynb
├── Online Retail.csv
├── README.md
├── RFM_values.PNG
└── pdfs
    ├── chapter1.pdf
    ├── chapter2.pdf
    ├── chapter3.pdf
    └── chapter4.pdf


/Chapter 1- Cohort Analysis.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Chapter 1: Cohort Analysis\n",
  8 |     "\n",
  9 |     "Understand customers based on their unique behavioral attributes.\n",
 10 |     "\n",
 11 |     "It is a powerful analytics technique to group customers and enable the business to customize their product offering and marketing strategy.\n",
 12 |     "For example, we can grup the customers by the month of the first purchase, segment by their recency, frequency and monetary values or run k-means clusterng to identify similar groups of customers based on their purchasing behavior. You will dig deeper into customer purchasing habits and uncover actionable insights.\n",
 13 |     "\n",
 14 |     "Cohort analysis is a descriptive analytics tool.\n",
 15 |     "It groups the customers into mutually exclusive cohorts - which are then measured over time. Cohort analysis provides deeper insights than the so-called vanity metrics, it helps with understanding the high level trends better by providing insights on metrics across both the product and the customer lifecycle.\n",
 16 |     "\n",
 17 |     "**There are tree major types of cohorts**:\n",
 18 |     "\n",
 19 |     "- *Time cohorts* are customers who signed up for a product or service during a particular time frame. Analyzing these cohors shows the customer's behavior depending on the time they started using the company's productos or services. The time may be monthly or quarterly, even daily. \n",
 20 |     "\n",
 21 |     "- *Behavior cohorts* are customers who purchased a product or subscribed to a service in the past, it groups customers by the type of product or service they signed up. Customers who signed up for basic level services might have different needs than those who signed up for advanced services. Understanding the need of the various cohorts can be help a company design custom-made services or products for particular segments.\n",
 22 |     "\n",
 23 |     "- *Size cohorts* refer to the various sizes of customers who purchase company's products or services. This categorization can be based on the amount of spending in some period of time after acquisition, or the product type that the customer spent most of their order amount in some period of time.\n",
 24 |     "\n",
 25 |     "**The main elements of the cohort analysis**:\n",
 26 |     "\n",
 27 |     "- The cohorts analysis data is typically formatted as a pivot table.\n",
 28 |     "- The row values represent the cohort. In this case it's the month of the first purchase and customers are poled into these groups based on their first ever purchase.\n",
 29 |     "- The column values represent months since acquisition. It can be measured in other time periods like months, days, even hours or minutes. That depends on the scope of the analysis.\n",
 30 |     "- Finally, the metrics are in the table. Here, we have the count of active customers. The first column with cohort index 'one' represents the total number of customers in that cohort. This is the month of their first transaction. We will use this data in the next lessons to calculate the retention rate and other metrics.\n",
 31 |     "\n",
 32 |     "\n",
 33 |     "\n",
 34 |     "#### Resume\n",
 35 |     "\n",
 36 |     "* What is Cohort Analysis?\n",
 37 |     "\n",
 38 |     "    - Mutually exclusive segments - cohorts\n",
 39 |     "    - Compare metrics across **product** lifecycle\n",
 40 |     "    - Compare metrics across **customer** lifecycle\n",
 41 |     "\n",
 42 |     "* Types of cohorts\n",
 43 |     "\n",
 44 |     "    - Time cohorts\n",
 45 |     "    - Behavior cohorts\n",
 46 |     "    - Size cohorts\n",
 47 |     "    \n",
 48 |     "* Elements of the cohort analysis\n",
 49 |     "\n",
 50 |     "    - Pivot table\n",
 51 |     "    - Assigned cohort in rows\n",
 52 |     "    - Cohort index in columns\n",
 53 |     "    - Metrics in the table"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": 1,
 59 |    "metadata": {},
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "# importando librerias necesarias\n",
 63 |     "import numpy as np\n",
 64 |     "import pandas as pd\n",
 65 |     "import datetime as dt\n"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": 2,
 71 |    "metadata": {},
 72 |    "outputs": [],
 73 |    "source": [
 74 |     "online = pd.read_csv('Online Retail.csv', sep=';')"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": 3,
 80 |    "metadata": {},
 81 |    "outputs": [
 82 |     {
 83 |      "data": {
 84 |       "text/html": [
 85 |        "<div>\n",
 86 |        "<style scoped>\n",
 87 |        "    .dataframe tbody tr th:only-of-type {\n",
 88 |        "        vertical-align: middle;\n",
 89 |        "    }\n",
 90 |        "\n",
 91 |        "    .dataframe tbody tr th {\n",
 92 |        "        vertical-align: top;\n",
 93 |        "    }\n",
 94 |        "\n",
 95 |        "    .dataframe thead th {\n",
 96 |        "        text-align: right;\n",
 97 |        "    }\n",
 98 |        "</style>\n",
 99 |        "<table border=\"1\" class=\"dataframe\">\n",
100 |        "  <thead>\n",
101 |        "    <tr style=\"text-align: right;\">\n",
102 |        "      <th></th>\n",
103 |        "      <th>InvoiceNo</th>\n",
104 |        "      <th>StockCode</th>\n",
105 |        "      <th>Description</th>\n",
106 |        "      <th>Quantity</th>\n",
107 |        "      <th>InvoiceDate</th>\n",
108 |        "      <th>UnitPrice</th>\n",
109 |        "      <th>CustomerID</th>\n",
110 |        "      <th>Country</th>\n",
111 |        "    </tr>\n",
112 |        "  </thead>\n",
113 |        "  <tbody>\n",
114 |        "    <tr>\n",
115 |        "      <th>0</th>\n",
116 |        "      <td>536365</td>\n",
117 |        "      <td>85123A</td>\n",
118 |        "      <td>WHITE HANGING HEART T-LIGHT HOLDER</td>\n",
119 |        "      <td>6</td>\n",
120 |        "      <td>1/12/2010 8:26</td>\n",
121 |        "      <td>2,55</td>\n",
122 |        "      <td>17850.0</td>\n",
123 |        "      <td>United Kingdom</td>\n",
124 |        "    </tr>\n",
125 |        "    <tr>\n",
126 |        "      <th>1</th>\n",
127 |        "      <td>536365</td>\n",
128 |        "      <td>71053</td>\n",
129 |        "      <td>WHITE METAL LANTERN</td>\n",
130 |        "      <td>6</td>\n",
131 |        "      <td>1/12/2010 8:26</td>\n",
132 |        "      <td>3,39</td>\n",
133 |        "      <td>17850.0</td>\n",
134 |        "      <td>United Kingdom</td>\n",
135 |        "    </tr>\n",
136 |        "    <tr>\n",
137 |        "      <th>2</th>\n",
138 |        "      <td>536365</td>\n",
139 |        "      <td>84406B</td>\n",
140 |        "      <td>CREAM CUPID HEARTS COAT HANGER</td>\n",
141 |        "      <td>8</td>\n",
142 |        "      <td>1/12/2010 8:26</td>\n",
143 |        "      <td>2,75</td>\n",
144 |        "      <td>17850.0</td>\n",
145 |        "      <td>United Kingdom</td>\n",
146 |        "    </tr>\n",
147 |        "    <tr>\n",
148 |        "      <th>3</th>\n",
149 |        "      <td>536365</td>\n",
150 |        "      <td>84029G</td>\n",
151 |        "      <td>KNITTED UNION FLAG HOT WATER BOTTLE</td>\n",
152 |        "      <td>6</td>\n",
153 |        "      <td>1/12/2010 8:26</td>\n",
154 |        "      <td>3,39</td>\n",
155 |        "      <td>17850.0</td>\n",
156 |        "      <td>United Kingdom</td>\n",
157 |        "    </tr>\n",
158 |        "    <tr>\n",
159 |        "      <th>4</th>\n",
160 |        "      <td>536365</td>\n",
161 |        "      <td>84029E</td>\n",
162 |        "      <td>RED WOOLLY HOTTIE WHITE HEART.</td>\n",
163 |        "      <td>6</td>\n",
164 |        "      <td>1/12/2010 8:26</td>\n",
165 |        "      <td>3,39</td>\n",
166 |        "      <td>17850.0</td>\n",
167 |        "      <td>United Kingdom</td>\n",
168 |        "    </tr>\n",
169 |        "  </tbody>\n",
170 |        "</table>\n",
171 |        "</div>"
172 |       ],
173 |       "text/plain": [
174 |        "  InvoiceNo StockCode                          Description  Quantity  \\\n",
175 |        "0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   \n",
176 |        "1    536365     71053                  WHITE METAL LANTERN         6   \n",
177 |        "2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   \n",
178 |        "3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   \n",
179 |        "4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   \n",
180 |        "\n",
181 |        "      InvoiceDate UnitPrice  CustomerID         Country  \n",
182 |        "0  1/12/2010 8:26      2,55     17850.0  United Kingdom  \n",
183 |        "1  1/12/2010 8:26      3,39     17850.0  United Kingdom  \n",
184 |        "2  1/12/2010 8:26      2,75     17850.0  United Kingdom  \n",
185 |        "3  1/12/2010 8:26      3,39     17850.0  United Kingdom  \n",
186 |        "4  1/12/2010 8:26      3,39     17850.0  United Kingdom  "
187 |       ]
188 |      },
189 |      "execution_count": 3,
190 |      "metadata": {},
191 |      "output_type": "execute_result"
192 |     }
193 |    ],
194 |    "source": [
195 |     "online.head()"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 4,
201 |    "metadata": {},
202 |    "outputs": [
203 |     {
204 |      "name": "stdout",
205 |      "output_type": "stream",
206 |      "text": [
207 |       "<class 'pandas.core.frame.DataFrame'>\n",
208 |       "RangeIndex: 541909 entries, 0 to 541908\n",
209 |       "Data columns (total 8 columns):\n",
210 |       "InvoiceNo      541909 non-null object\n",
211 |       "StockCode      541909 non-null object\n",
212 |       "Description    540455 non-null object\n",
213 |       "Quantity       541909 non-null int64\n",
214 |       "InvoiceDate    541909 non-null object\n",
215 |       "UnitPrice      541909 non-null object\n",
216 |       "CustomerID     406829 non-null float64\n",
217 |       "Country        541909 non-null object\n",
218 |       "dtypes: float64(1), int64(1), object(6)\n",
219 |       "memory usage: 33.1+ MB\n"
220 |      ]
221 |     }
222 |    ],
223 |    "source": [
224 |     "online.info()"
225 |    ]
226 |   },
227 |   {
228 |    "cell_type": "code",
229 |    "execution_count": 5,
230 |    "metadata": {},
231 |    "outputs": [],
232 |    "source": [
233 |     "# convert object to datetime\n",
234 |     "online['InvoiceDate'] = pd.to_datetime(online['InvoiceDate'])"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "code",
239 |    "execution_count": 6,
240 |    "metadata": {},
241 |    "outputs": [],
242 |    "source": [
243 |     "# convert object to float\n",
244 |     "online['UnitPrice'] = online['UnitPrice'].apply(lambda x: x.replace(',', '.'))"
245 |    ]
246 |   },
247 |   {
248 |    "cell_type": "code",
249 |    "execution_count": 7,
250 |    "metadata": {},
251 |    "outputs": [],
252 |    "source": [
253 |     "online['UnitPrice'] = online['UnitPrice'].apply(lambda col:pd.to_numeric(col, errors='coerce'))"
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "code",
258 |    "execution_count": 8,
259 |    "metadata": {},
260 |    "outputs": [
261 |     {
262 |      "data": {
263 |       "text/plain": [
264 |        "InvoiceNo              object\n",
265 |        "StockCode              object\n",
266 |        "Description            object\n",
267 |        "Quantity                int64\n",
268 |        "InvoiceDate    datetime64[ns]\n",
269 |        "UnitPrice             float64\n",
270 |        "CustomerID            float64\n",
271 |        "Country                object\n",
272 |        "dtype: object"
273 |       ]
274 |      },
275 |      "execution_count": 8,
276 |      "metadata": {},
277 |      "output_type": "execute_result"
278 |     }
279 |    ],
280 |    "source": [
281 |     "online.dtypes"
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "code",
286 |    "execution_count": 9,
287 |    "metadata": {},
288 |    "outputs": [
289 |     {
290 |      "data": {
291 |       "text/html": [
292 |        "<div>\n",
293 |        "<style scoped>\n",
294 |        "    .dataframe tbody tr th:only-of-type {\n",
295 |        "        vertical-align: middle;\n",
296 |        "    }\n",
297 |        "\n",
298 |        "    .dataframe tbody tr th {\n",
299 |        "        vertical-align: top;\n",
300 |        "    }\n",
301 |        "\n",
302 |        "    .dataframe thead th {\n",
303 |        "        text-align: right;\n",
304 |        "    }\n",
305 |        "</style>\n",
306 |        "<table border=\"1\" class=\"dataframe\">\n",
307 |        "  <thead>\n",
308 |        "    <tr style=\"text-align: right;\">\n",
309 |        "      <th></th>\n",
310 |        "      <th>InvoiceNo</th>\n",
311 |        "      <th>StockCode</th>\n",
312 |        "      <th>Description</th>\n",
313 |        "      <th>Quantity</th>\n",
314 |        "      <th>InvoiceDate</th>\n",
315 |        "      <th>UnitPrice</th>\n",
316 |        "      <th>CustomerID</th>\n",
317 |        "      <th>Country</th>\n",
318 |        "    </tr>\n",
319 |        "  </thead>\n",
320 |        "  <tbody>\n",
321 |        "    <tr>\n",
322 |        "      <th>0</th>\n",
323 |        "      <td>536365</td>\n",
324 |        "      <td>85123A</td>\n",
325 |        "      <td>WHITE HANGING HEART T-LIGHT HOLDER</td>\n",
326 |        "      <td>6</td>\n",
327 |        "      <td>2010-01-12 08:26:00</td>\n",
328 |        "      <td>2.55</td>\n",
329 |        "      <td>17850.0</td>\n",
330 |        "      <td>United Kingdom</td>\n",
331 |        "    </tr>\n",
332 |        "    <tr>\n",
333 |        "      <th>1</th>\n",
334 |        "      <td>536365</td>\n",
335 |        "      <td>71053</td>\n",
336 |        "      <td>WHITE METAL LANTERN</td>\n",
337 |        "      <td>6</td>\n",
338 |        "      <td>2010-01-12 08:26:00</td>\n",
339 |        "      <td>3.39</td>\n",
340 |        "      <td>17850.0</td>\n",
341 |        "      <td>United Kingdom</td>\n",
342 |        "    </tr>\n",
343 |        "    <tr>\n",
344 |        "      <th>2</th>\n",
345 |        "      <td>536365</td>\n",
346 |        "      <td>84406B</td>\n",
347 |        "      <td>CREAM CUPID HEARTS COAT HANGER</td>\n",
348 |        "      <td>8</td>\n",
349 |        "      <td>2010-01-12 08:26:00</td>\n",
350 |        "      <td>2.75</td>\n",
351 |        "      <td>17850.0</td>\n",
352 |        "      <td>United Kingdom</td>\n",
353 |        "    </tr>\n",
354 |        "    <tr>\n",
355 |        "      <th>3</th>\n",
356 |        "      <td>536365</td>\n",
357 |        "      <td>84029G</td>\n",
358 |        "      <td>KNITTED UNION FLAG HOT WATER BOTTLE</td>\n",
359 |        "      <td>6</td>\n",
360 |        "      <td>2010-01-12 08:26:00</td>\n",
361 |        "      <td>3.39</td>\n",
362 |        "      <td>17850.0</td>\n",
363 |        "      <td>United Kingdom</td>\n",
364 |        "    </tr>\n",
365 |        "    <tr>\n",
366 |        "      <th>4</th>\n",
367 |        "      <td>536365</td>\n",
368 |        "      <td>84029E</td>\n",
369 |        "      <td>RED WOOLLY HOTTIE WHITE HEART.</td>\n",
370 |        "      <td>6</td>\n",
371 |        "      <td>2010-01-12 08:26:00</td>\n",
372 |        "      <td>3.39</td>\n",
373 |        "      <td>17850.0</td>\n",
374 |        "      <td>United Kingdom</td>\n",
375 |        "    </tr>\n",
376 |        "  </tbody>\n",
377 |        "</table>\n",
378 |        "</div>"
379 |       ],
380 |       "text/plain": [
381 |        "  InvoiceNo StockCode                          Description  Quantity  \\\n",
382 |        "0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   \n",
383 |        "1    536365     71053                  WHITE METAL LANTERN         6   \n",
384 |        "2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   \n",
385 |        "3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   \n",
386 |        "4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   \n",
387 |        "\n",
388 |        "          InvoiceDate  UnitPrice  CustomerID         Country  \n",
389 |        "0 2010-01-12 08:26:00       2.55     17850.0  United Kingdom  \n",
390 |        "1 2010-01-12 08:26:00       3.39     17850.0  United Kingdom  \n",
391 |        "2 2010-01-12 08:26:00       2.75     17850.0  United Kingdom  \n",
392 |        "3 2010-01-12 08:26:00       3.39     17850.0  United Kingdom  \n",
393 |        "4 2010-01-12 08:26:00       3.39     17850.0  United Kingdom  "
394 |       ]
395 |      },
396 |      "execution_count": 9,
397 |      "metadata": {},
398 |      "output_type": "execute_result"
399 |     }
400 |    ],
401 |    "source": [
402 |     "online.head()"
403 |    ]
404 |   },
405 |   {
406 |    "cell_type": "markdown",
407 |    "metadata": {},
408 |    "source": [
409 |     "### Time cohorts\n",
410 |     "\n",
411 |     "we will segment customers into acquisition cohorts based on the month their first purchase, we will then assign the cohort index to each purchase of the customer.\n",
412 |     "\n",
413 |     "It will represent the number of months since the first transaction. Time based cohorts group customers by the time they completed their first activity.\n",
414 |     "We wil mark each transaction based on its relative time period since the first purchase.\n",
415 |     "The next step we will calculate metrics like retention or average spend value, and build this heaptman."
416 |    ]
417 |   },
418 |   {
419 |    "cell_type": "code",
420 |    "execution_count": 10,
421 |    "metadata": {},
422 |    "outputs": [
423 |     {
424 |      "name": "stdout",
425 |      "output_type": "stream",
426 |      "text": [
427 |       "  InvoiceNo StockCode                          Description  Quantity  \\\n",
428 |       "0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   \n",
429 |       "1    536365     71053                  WHITE METAL LANTERN         6   \n",
430 |       "2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   \n",
431 |       "3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   \n",
432 |       "4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   \n",
433 |       "\n",
434 |       "          InvoiceDate  UnitPrice  CustomerID         Country InvoiceDay  \\\n",
435 |       "0 2010-01-12 08:26:00       2.55     17850.0  United Kingdom 2010-01-12   \n",
436 |       "1 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   \n",
437 |       "2 2010-01-12 08:26:00       2.75     17850.0  United Kingdom 2010-01-12   \n",
438 |       "3 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   \n",
439 |       "4 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   \n",
440 |       "\n",
441 |       "   CohortDay  \n",
442 |       "0 2010-01-12  \n",
443 |       "1 2010-01-12  \n",
444 |       "2 2010-01-12  \n",
445 |       "3 2010-01-12  \n",
446 |       "4 2010-01-12  \n"
447 |      ]
448 |     }
449 |    ],
450 |    "source": [
451 |     "# Define a function that will parse the date\n",
452 |     "def get_day(x): return dt.datetime(x.year, x.month, x.day) \n",
453 |     "\n",
454 |     "# Create InvoiceDay column\n",
455 |     "online['InvoiceDay'] = online['InvoiceDate'].apply(get_day) \n",
456 |     "\n",
457 |     "# Group by CustomerID and select the InvoiceDay value\n",
458 |     "grouping = online.groupby('CustomerID')['InvoiceDay'] \n",
459 |     "\n",
460 |     "# Assign a minimum InvoiceDay value to the dataset\n",
461 |     "online['CohortDay'] = grouping.transform('min')\n",
462 |     "\n",
463 |     "# View the top 5 rows\n",
464 |     "print(online.head())"
465 |    ]
466 |   },
467 |   {
468 |    "cell_type": "markdown",
469 |    "metadata": {},
470 |    "source": [
471 |     "### Calculate time offset in days - part 1\n",
472 |     "\n",
473 |     "Calculating time offset for each transaction allows you to report the metrics for each cohort in a comparable fashion.\n",
474 |     "\n",
475 |     "First, we will create 6 variables that capture the integer value of years, months and days for Invoice and Cohort Date using the get_date_int()"
476 |    ]
477 |   },
478 |   {
479 |    "cell_type": "code",
480 |    "execution_count": 11,
481 |    "metadata": {},
482 |    "outputs": [],
483 |    "source": [
484 |     "def get_date_int(df, column):\n",
485 |     "    year = df[column].dt.year\n",
486 |     "    month = df[column].dt.month\n",
487 |     "    day = df[column].dt.day\n",
488 |     "    return year, month, day"
489 |    ]
490 |   },
491 |   {
492 |    "cell_type": "code",
493 |    "execution_count": 12,
494 |    "metadata": {},
495 |    "outputs": [],
496 |    "source": [
497 |     "# Get the integers for date parts from the `InvoiceDay` column\n",
498 |     "invoice_year, invoice_month, invoice_day = get_date_int(online, 'InvoiceDay')\n",
499 |     "\n",
500 |     "# Get the integers for date parts from the `CohortDay` column\n",
501 |     "cohort_year, cohort_month, cohort_day = get_date_int(online, 'CohortDay')"
502 |    ]
503 |   },
504 |   {
505 |    "cell_type": "markdown",
506 |    "metadata": {},
507 |    "source": [
508 |     "**Calculate time offset in days - part 2**\n",
509 |     "\n",
510 |     "Now, we have six different data sets with year, month and day values for Invoice and Cohort dates - invoice_year, cohort_year, invoice_month, cohort_month, invoice_day, and cohort_day.\n",
511 |     "\n",
512 |     "calculate the difference between the Invoice and Cohort dates in years, months and days separately and then calculate the total days difference between the two. This will be your days offset which we will use to visualize the customer count. "
513 |    ]
514 |   },
515 |   {
516 |    "cell_type": "code",
517 |    "execution_count": 13,
518 |    "metadata": {
519 |     "scrolled": true
520 |    },
521 |    "outputs": [
522 |     {
523 |      "name": "stdout",
524 |      "output_type": "stream",
525 |      "text": [
526 |       "  InvoiceNo StockCode                          Description  Quantity  \\\n",
527 |       "0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   \n",
528 |       "1    536365     71053                  WHITE METAL LANTERN         6   \n",
529 |       "2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   \n",
530 |       "3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   \n",
531 |       "4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   \n",
532 |       "\n",
533 |       "          InvoiceDate  UnitPrice  CustomerID         Country InvoiceDay  \\\n",
534 |       "0 2010-01-12 08:26:00       2.55     17850.0  United Kingdom 2010-01-12   \n",
535 |       "1 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   \n",
536 |       "2 2010-01-12 08:26:00       2.75     17850.0  United Kingdom 2010-01-12   \n",
537 |       "3 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   \n",
538 |       "4 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   \n",
539 |       "\n",
540 |       "   CohortDay  CohortIndex  \n",
541 |       "0 2010-01-12          1.0  \n",
542 |       "1 2010-01-12          1.0  \n",
543 |       "2 2010-01-12          1.0  \n",
544 |       "3 2010-01-12          1.0  \n",
545 |       "4 2010-01-12          1.0  \n"
546 |      ]
547 |     }
548 |    ],
549 |    "source": [
550 |     "# Calculate difference in years\n",
551 |     "years_diff = invoice_year - cohort_year\n",
552 |     "\n",
553 |     "# Calculate difference in months\n",
554 |     "months_diff = invoice_month - cohort_month\n",
555 |     "\n",
556 |     "# Calculate difference in days\n",
557 |     "days_diff = invoice_day - cohort_day\n",
558 |     "\n",
559 |     "# Extract the difference in days from all previous values\n",
560 |     "online['CohortIndex'] = years_diff * 365 + months_diff * 30 + days_diff + 1\n",
561 |     "print(online.head())"
562 |    ]
563 |   },
564 |   {
565 |    "cell_type": "markdown",
566 |    "metadata": {},
567 |    "source": [
568 |     "<h2 class=\"p-3 mb-2 bg-primary text-white\">Data mensual</h2>"
569 |    ]
570 |   },
571 |   {
572 |    "cell_type": "markdown",
573 |    "metadata": {},
574 |    "source": [
575 |     "#### Calculate cohort metrics\n",
576 |     "\n",
577 |     "- How many customers originally in each cohort in the cohort_counts table?\n",
578 |     "- How many customers originally in each cohort?\n",
579 |     "- How many of them were active in following months?\n",
580 |     "\n",
581 |     "\n",
582 |     "We will start by using the cohort counts table from our previous lesson to calculate customer retention. (mean, avg)\n",
583 |     "The retention measures how many customers from each of the cohot have returned in the subsequent months.\n",
584 |     "\n",
585 |     "1- select the first column which is the total number of customers in the cohort\n",
586 |     "2- calculate the ratio of how many of these customers came back in the subsequent months which is the retention rate\n",
587 |     "\n",
588 |     "Note: you will see that the first month's retention -by definition- will be 100% for all cohorts, this is because the number of active customers in the first month is actually the size of the cohort\n",
589 |     "\n",
590 |     "**Customer retention**\n",
591 |     "\n",
592 |     "Customer retention is a very useful metric to understand how many of the all customers are still active. Which of the following best describes customer retention?\n",
593 |     "\n",
594 |     "- [X] Percentage of active customers out of total customers\n",
595 |     "        - **Correct!** Retention gives you the percentage of active customers compared to the total number of customers.\n",
596 |     "- [ ] Percentage of active customers compared to a previous month\n",
597 |     "        - **Incorrect submission:** This metric sounds more like a monthly change in active customers.\n",
598 |     "- [ ] Number of average active customers each month\n",
599 |     "        - **Incorrect submission:** Retention is a percentage metric while this is an absolute number.\n",
600 |     "- [ ] Active customers on the first month. \n",
601 |     "        - **Incorrect submission:** Retention is a percentage metric while this is an absolute number.\n",
602 |     "\n",
603 |     "\n",
604 |     "**Calculate retention rate from scratch**\n",
605 |     "\n",
606 |     "You have seen how to create retention and average quantity metrics table for the monthly acquisition cohorts. Now it's you time to calculate the average price metrics and see if there are any difference in shopping patterns across time and across cohorts."
607 |    ]
608 |   },
609 |   {
610 |    "cell_type": "markdown",
611 |    "metadata": {},
612 |    "source": [
613 |     "**Calculate average price**\n",
614 |     "\n",
615 |     "You will now calculate the average price metric and analyze if there are any differences in shopping patterns across time and across cohorts."
616 |    ]
617 |   },
618 |   {
619 |    "cell_type": "markdown",
620 |    "metadata": {},
621 |    "source": [
622 |     "### Visualize average quantity metric\n",
623 |     "\n",
624 |     "**Heatmap**\n",
625 |     "- Easiest way to visualize cohort analysis\n",
626 |     "- Includes both data and visuals\n",
627 |     "- Only few lines of code with seaborn"
628 |    ]
629 |   },
630 |   {
631 |    "cell_type": "markdown",
632 |    "metadata": {},
633 |    "source": [
634 |     "<div class=\"panel panel-primary\">\n",
635 |     "      <div class=\"panel-heading\"><h1>Customer retention</h1></div>\n",
636 |     "</div>"
637 |    ]
638 |   },
639 |   {
640 |    "cell_type": "markdown",
641 |    "metadata": {},
642 |    "source": [
643 |     "Here for view the result retention: [Data Mensual](Ej1.ipynb)"
644 |    ]
645 |   }
646 |  ],
647 |  "metadata": {
648 |   "kernelspec": {
649 |    "display_name": "Python 3",
650 |    "language": "python",
651 |    "name": "python3"
652 |   },
653 |   "language_info": {
654 |    "codemirror_mode": {
655 |     "name": "ipython",
656 |     "version": 3
657 |    },
658 |    "file_extension": ".py",
659 |    "mimetype": "text/x-python",
660 |    "name": "python",
661 |    "nbconvert_exporter": "python",
662 |    "pygments_lexer": "ipython3",
663 |    "version": "3.7.0"
664 |   }
665 |  },
666 |  "nbformat": 4,
667 |  "nbformat_minor": 2
668 | }
669 | 


--------------------------------------------------------------------------------
/Chapter1-Cohort_Analysis.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Chapter 1: Cohort Analysis\n",
  8 |     "\n",
  9 |     "Understand customers based on their unique behavioral attributes.\n",
 10 |     "\n",
 11 |     "It is a powerful analytics technique to group customers and enable the business to customize their product offering and marketing strategy.\n",
 12 |     "For example, we can grup the customers by the month of the first purchase, segment by their recency, frequency and monetary values or run k-means clusterng to identify similar groups of customers based on their purchasing behavior. You will dig deeper into customer purchasing habits and uncover actionable insights.\n",
 13 |     "\n",
 14 |     "Cohort analysis is a descriptive analytics tool.\n",
 15 |     "It groups the customers into mutually exclusive cohorts - which are then measured over time. Cohort analysis provides deeper insights than the so-called vanity metrics, it helps with understanding the high level trends better by providing insights on metrics across both the product and the customer lifecycle.\n",
 16 |     "\n",
 17 |     "**There are tree major types of cohorts**:\n",
 18 |     "\n",
 19 |     "- *Time cohorts* are customers who signed up for a product or service during a particular time frame. Analyzing these cohors shows the customer's behavior depending on the time they started using the company's productos or services. The time may be monthly or quarterly, even daily. \n",
 20 |     "\n",
 21 |     "- *Behavior cohorts* are customers who purchased a product or subscribed to a service in the past, it groups customers by the type of product or service they signed up. Customers who signed up for basic level services might have different needs than those who signed up for advanced services. Understanding the need of the various cohorts can be help a company design custom-made services or products for particular segments.\n",
 22 |     "\n",
 23 |     "- *Size cohorts* refer to the various sizes of customers who purchase company's products or services. This categorization can be based on the amount of spending in some period of time after acquisition, or the product type that the customer spent most of their order amount in some period of time.\n",
 24 |     "\n",
 25 |     "**The main elements of the cohort analysis**:\n",
 26 |     "\n",
 27 |     "- The cohorts analysis data is typically formatted as a pivot table.\n",
 28 |     "- The row values represent the cohort. In this case it's the month of the first purchase and customers are poled into these groups based on their first ever purchase.\n",
 29 |     "- The column values represent months since acquisition. It can be measured in other time periods like months, days, even hours or minutes. That depends on the scope of the analysis.\n",
 30 |     "- Finally, the metrics are in the table. Here, we have the count of active customers. The first column with cohort index 'one' represents the total number of customers in that cohort. This is the month of their first transaction. We will use this data in the next lessons to calculate the retention rate and other metrics.\n",
 31 |     "\n",
 32 |     "\n",
 33 |     "\n",
 34 |     "#### Resume\n",
 35 |     "\n",
 36 |     "* What is Cohort Analysis?\n",
 37 |     "\n",
 38 |     "    - Mutually exclusive segments - cohorts\n",
 39 |     "    - Compare metrics across **product** lifecycle\n",
 40 |     "    - Compare metrics across **customer** lifecycle\n",
 41 |     "\n",
 42 |     "* Types of cohorts\n",
 43 |     "\n",
 44 |     "    - Time cohorts\n",
 45 |     "    - Behavior cohorts\n",
 46 |     "    - Size cohorts\n",
 47 |     "    \n",
 48 |     "* Elements of the cohort analysis\n",
 49 |     "\n",
 50 |     "    - Pivot table\n",
 51 |     "    - Assigned cohort in rows\n",
 52 |     "    - Cohort index in columns\n",
 53 |     "    - Metrics in the table"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": 1,
 59 |    "metadata": {},
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "# importando librerias necesarias\n",
 63 |     "import numpy as np\n",
 64 |     "import pandas as pd\n",
 65 |     "import datetime as dt\n"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": 2,
 71 |    "metadata": {},
 72 |    "outputs": [],
 73 |    "source": [
 74 |     "online = pd.read_csv('Online Retail.csv', sep=';')"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": 3,
 80 |    "metadata": {},
 81 |    "outputs": [
 82 |     {
 83 |      "data": {
 84 |       "text/html": [
 85 |        "<div>\n",
 86 |        "<style scoped>\n",
 87 |        "    .dataframe tbody tr th:only-of-type {\n",
 88 |        "        vertical-align: middle;\n",
 89 |        "    }\n",
 90 |        "\n",
 91 |        "    .dataframe tbody tr th {\n",
 92 |        "        vertical-align: top;\n",
 93 |        "    }\n",
 94 |        "\n",
 95 |        "    .dataframe thead th {\n",
 96 |        "        text-align: right;\n",
 97 |        "    }\n",
 98 |        "</style>\n",
 99 |        "<table border=\"1\" class=\"dataframe\">\n",
100 |        "  <thead>\n",
101 |        "    <tr style=\"text-align: right;\">\n",
102 |        "      <th></th>\n",
103 |        "      <th>InvoiceNo</th>\n",
104 |        "      <th>StockCode</th>\n",
105 |        "      <th>Description</th>\n",
106 |        "      <th>Quantity</th>\n",
107 |        "      <th>InvoiceDate</th>\n",
108 |        "      <th>UnitPrice</th>\n",
109 |        "      <th>CustomerID</th>\n",
110 |        "      <th>Country</th>\n",
111 |        "    </tr>\n",
112 |        "  </thead>\n",
113 |        "  <tbody>\n",
114 |        "    <tr>\n",
115 |        "      <th>0</th>\n",
116 |        "      <td>536365</td>\n",
117 |        "      <td>85123A</td>\n",
118 |        "      <td>WHITE HANGING HEART T-LIGHT HOLDER</td>\n",
119 |        "      <td>6</td>\n",
120 |        "      <td>1/12/2010 8:26</td>\n",
121 |        "      <td>2,55</td>\n",
122 |        "      <td>17850.0</td>\n",
123 |        "      <td>United Kingdom</td>\n",
124 |        "    </tr>\n",
125 |        "    <tr>\n",
126 |        "      <th>1</th>\n",
127 |        "      <td>536365</td>\n",
128 |        "      <td>71053</td>\n",
129 |        "      <td>WHITE METAL LANTERN</td>\n",
130 |        "      <td>6</td>\n",
131 |        "      <td>1/12/2010 8:26</td>\n",
132 |        "      <td>3,39</td>\n",
133 |        "      <td>17850.0</td>\n",
134 |        "      <td>United Kingdom</td>\n",
135 |        "    </tr>\n",
136 |        "    <tr>\n",
137 |        "      <th>2</th>\n",
138 |        "      <td>536365</td>\n",
139 |        "      <td>84406B</td>\n",
140 |        "      <td>CREAM CUPID HEARTS COAT HANGER</td>\n",
141 |        "      <td>8</td>\n",
142 |        "      <td>1/12/2010 8:26</td>\n",
143 |        "      <td>2,75</td>\n",
144 |        "      <td>17850.0</td>\n",
145 |        "      <td>United Kingdom</td>\n",
146 |        "    </tr>\n",
147 |        "    <tr>\n",
148 |        "      <th>3</th>\n",
149 |        "      <td>536365</td>\n",
150 |        "      <td>84029G</td>\n",
151 |        "      <td>KNITTED UNION FLAG HOT WATER BOTTLE</td>\n",
152 |        "      <td>6</td>\n",
153 |        "      <td>1/12/2010 8:26</td>\n",
154 |        "      <td>3,39</td>\n",
155 |        "      <td>17850.0</td>\n",
156 |        "      <td>United Kingdom</td>\n",
157 |        "    </tr>\n",
158 |        "    <tr>\n",
159 |        "      <th>4</th>\n",
160 |        "      <td>536365</td>\n",
161 |        "      <td>84029E</td>\n",
162 |        "      <td>RED WOOLLY HOTTIE WHITE HEART.</td>\n",
163 |        "      <td>6</td>\n",
164 |        "      <td>1/12/2010 8:26</td>\n",
165 |        "      <td>3,39</td>\n",
166 |        "      <td>17850.0</td>\n",
167 |        "      <td>United Kingdom</td>\n",
168 |        "    </tr>\n",
169 |        "  </tbody>\n",
170 |        "</table>\n",
171 |        "</div>"
172 |       ],
173 |       "text/plain": [
174 |        "  InvoiceNo StockCode                          Description  Quantity  \\\n",
175 |        "0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   \n",
176 |        "1    536365     71053                  WHITE METAL LANTERN         6   \n",
177 |        "2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   \n",
178 |        "3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   \n",
179 |        "4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   \n",
180 |        "\n",
181 |        "      InvoiceDate UnitPrice  CustomerID         Country  \n",
182 |        "0  1/12/2010 8:26      2,55     17850.0  United Kingdom  \n",
183 |        "1  1/12/2010 8:26      3,39     17850.0  United Kingdom  \n",
184 |        "2  1/12/2010 8:26      2,75     17850.0  United Kingdom  \n",
185 |        "3  1/12/2010 8:26      3,39     17850.0  United Kingdom  \n",
186 |        "4  1/12/2010 8:26      3,39     17850.0  United Kingdom  "
187 |       ]
188 |      },
189 |      "execution_count": 3,
190 |      "metadata": {},
191 |      "output_type": "execute_result"
192 |     }
193 |    ],
194 |    "source": [
195 |     "online.head()"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 4,
201 |    "metadata": {},
202 |    "outputs": [
203 |     {
204 |      "name": "stdout",
205 |      "output_type": "stream",
206 |      "text": [
207 |       "<class 'pandas.core.frame.DataFrame'>\n",
208 |       "RangeIndex: 541909 entries, 0 to 541908\n",
209 |       "Data columns (total 8 columns):\n",
210 |       "InvoiceNo      541909 non-null object\n",
211 |       "StockCode      541909 non-null object\n",
212 |       "Description    540455 non-null object\n",
213 |       "Quantity       541909 non-null int64\n",
214 |       "InvoiceDate    541909 non-null object\n",
215 |       "UnitPrice      541909 non-null object\n",
216 |       "CustomerID     406829 non-null float64\n",
217 |       "Country        541909 non-null object\n",
218 |       "dtypes: float64(1), int64(1), object(6)\n",
219 |       "memory usage: 33.1+ MB\n"
220 |      ]
221 |     }
222 |    ],
223 |    "source": [
224 |     "online.info()"
225 |    ]
226 |   },
227 |   {
228 |    "cell_type": "code",
229 |    "execution_count": 5,
230 |    "metadata": {},
231 |    "outputs": [],
232 |    "source": [
233 |     "# convert object to datetime\n",
234 |     "online['InvoiceDate'] = pd.to_datetime(online['InvoiceDate'])"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "code",
239 |    "execution_count": 6,
240 |    "metadata": {},
241 |    "outputs": [],
242 |    "source": [
243 |     "# convert object to float\n",
244 |     "online['UnitPrice'] = online['UnitPrice'].apply(lambda x: x.replace(',', '.'))"
245 |    ]
246 |   },
247 |   {
248 |    "cell_type": "code",
249 |    "execution_count": 7,
250 |    "metadata": {},
251 |    "outputs": [],
252 |    "source": [
253 |     "online['UnitPrice'] = online['UnitPrice'].apply(lambda col:pd.to_numeric(col, errors='coerce'))"
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "code",
258 |    "execution_count": 8,
259 |    "metadata": {},
260 |    "outputs": [
261 |     {
262 |      "data": {
263 |       "text/plain": [
264 |        "InvoiceNo              object\n",
265 |        "StockCode              object\n",
266 |        "Description            object\n",
267 |        "Quantity                int64\n",
268 |        "InvoiceDate    datetime64[ns]\n",
269 |        "UnitPrice             float64\n",
270 |        "CustomerID            float64\n",
271 |        "Country                object\n",
272 |        "dtype: object"
273 |       ]
274 |      },
275 |      "execution_count": 8,
276 |      "metadata": {},
277 |      "output_type": "execute_result"
278 |     }
279 |    ],
280 |    "source": [
281 |     "online.dtypes"
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "code",
286 |    "execution_count": 9,
287 |    "metadata": {},
288 |    "outputs": [
289 |     {
290 |      "data": {
291 |       "text/html": [
292 |        "<div>\n",
293 |        "<style scoped>\n",
294 |        "    .dataframe tbody tr th:only-of-type {\n",
295 |        "        vertical-align: middle;\n",
296 |        "    }\n",
297 |        "\n",
298 |        "    .dataframe tbody tr th {\n",
299 |        "        vertical-align: top;\n",
300 |        "    }\n",
301 |        "\n",
302 |        "    .dataframe thead th {\n",
303 |        "        text-align: right;\n",
304 |        "    }\n",
305 |        "</style>\n",
306 |        "<table border=\"1\" class=\"dataframe\">\n",
307 |        "  <thead>\n",
308 |        "    <tr style=\"text-align: right;\">\n",
309 |        "      <th></th>\n",
310 |        "      <th>InvoiceNo</th>\n",
311 |        "      <th>StockCode</th>\n",
312 |        "      <th>Description</th>\n",
313 |        "      <th>Quantity</th>\n",
314 |        "      <th>InvoiceDate</th>\n",
315 |        "      <th>UnitPrice</th>\n",
316 |        "      <th>CustomerID</th>\n",
317 |        "      <th>Country</th>\n",
318 |        "    </tr>\n",
319 |        "  </thead>\n",
320 |        "  <tbody>\n",
321 |        "    <tr>\n",
322 |        "      <th>0</th>\n",
323 |        "      <td>536365</td>\n",
324 |        "      <td>85123A</td>\n",
325 |        "      <td>WHITE HANGING HEART T-LIGHT HOLDER</td>\n",
326 |        "      <td>6</td>\n",
327 |        "      <td>2010-01-12 08:26:00</td>\n",
328 |        "      <td>2.55</td>\n",
329 |        "      <td>17850.0</td>\n",
330 |        "      <td>United Kingdom</td>\n",
331 |        "    </tr>\n",
332 |        "    <tr>\n",
333 |        "      <th>1</th>\n",
334 |        "      <td>536365</td>\n",
335 |        "      <td>71053</td>\n",
336 |        "      <td>WHITE METAL LANTERN</td>\n",
337 |        "      <td>6</td>\n",
338 |        "      <td>2010-01-12 08:26:00</td>\n",
339 |        "      <td>3.39</td>\n",
340 |        "      <td>17850.0</td>\n",
341 |        "      <td>United Kingdom</td>\n",
342 |        "    </tr>\n",
343 |        "    <tr>\n",
344 |        "      <th>2</th>\n",
345 |        "      <td>536365</td>\n",
346 |        "      <td>84406B</td>\n",
347 |        "      <td>CREAM CUPID HEARTS COAT HANGER</td>\n",
348 |        "      <td>8</td>\n",
349 |        "      <td>2010-01-12 08:26:00</td>\n",
350 |        "      <td>2.75</td>\n",
351 |        "      <td>17850.0</td>\n",
352 |        "      <td>United Kingdom</td>\n",
353 |        "    </tr>\n",
354 |        "    <tr>\n",
355 |        "      <th>3</th>\n",
356 |        "      <td>536365</td>\n",
357 |        "      <td>84029G</td>\n",
358 |        "      <td>KNITTED UNION FLAG HOT WATER BOTTLE</td>\n",
359 |        "      <td>6</td>\n",
360 |        "      <td>2010-01-12 08:26:00</td>\n",
361 |        "      <td>3.39</td>\n",
362 |        "      <td>17850.0</td>\n",
363 |        "      <td>United Kingdom</td>\n",
364 |        "    </tr>\n",
365 |        "    <tr>\n",
366 |        "      <th>4</th>\n",
367 |        "      <td>536365</td>\n",
368 |        "      <td>84029E</td>\n",
369 |        "      <td>RED WOOLLY HOTTIE WHITE HEART.</td>\n",
370 |        "      <td>6</td>\n",
371 |        "      <td>2010-01-12 08:26:00</td>\n",
372 |        "      <td>3.39</td>\n",
373 |        "      <td>17850.0</td>\n",
374 |        "      <td>United Kingdom</td>\n",
375 |        "    </tr>\n",
376 |        "  </tbody>\n",
377 |        "</table>\n",
378 |        "</div>"
379 |       ],
380 |       "text/plain": [
381 |        "  InvoiceNo StockCode                          Description  Quantity  \\\n",
382 |        "0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   \n",
383 |        "1    536365     71053                  WHITE METAL LANTERN         6   \n",
384 |        "2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   \n",
385 |        "3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   \n",
386 |        "4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   \n",
387 |        "\n",
388 |        "          InvoiceDate  UnitPrice  CustomerID         Country  \n",
389 |        "0 2010-01-12 08:26:00       2.55     17850.0  United Kingdom  \n",
390 |        "1 2010-01-12 08:26:00       3.39     17850.0  United Kingdom  \n",
391 |        "2 2010-01-12 08:26:00       2.75     17850.0  United Kingdom  \n",
392 |        "3 2010-01-12 08:26:00       3.39     17850.0  United Kingdom  \n",
393 |        "4 2010-01-12 08:26:00       3.39     17850.0  United Kingdom  "
394 |       ]
395 |      },
396 |      "execution_count": 9,
397 |      "metadata": {},
398 |      "output_type": "execute_result"
399 |     }
400 |    ],
401 |    "source": [
402 |     "online.head()"
403 |    ]
404 |   },
405 |   {
406 |    "cell_type": "markdown",
407 |    "metadata": {},
408 |    "source": [
409 |     "### Time cohorts\n",
410 |     "\n",
411 |     "we will segment customers into acquisition cohorts based on the month their first purchase, we will then assign the cohort index to each purchase of the customer.\n",
412 |     "\n",
413 |     "It will represent the number of months since the first transaction. Time based cohorts group customers by the time they completed their first activity.\n",
414 |     "We wil mark each transaction based on its relative time period since the first purchase.\n",
415 |     "The next step we will calculate metrics like retention or average spend value, and build this heaptman."
416 |    ]
417 |   },
418 |   {
419 |    "cell_type": "code",
420 |    "execution_count": 10,
421 |    "metadata": {},
422 |    "outputs": [
423 |     {
424 |      "name": "stdout",
425 |      "output_type": "stream",
426 |      "text": [
427 |       "  InvoiceNo StockCode                          Description  Quantity  \\\n",
428 |       "0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   \n",
429 |       "1    536365     71053                  WHITE METAL LANTERN         6   \n",
430 |       "2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   \n",
431 |       "3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   \n",
432 |       "4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   \n",
433 |       "\n",
434 |       "          InvoiceDate  UnitPrice  CustomerID         Country InvoiceDay  \\\n",
435 |       "0 2010-01-12 08:26:00       2.55     17850.0  United Kingdom 2010-01-12   \n",
436 |       "1 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   \n",
437 |       "2 2010-01-12 08:26:00       2.75     17850.0  United Kingdom 2010-01-12   \n",
438 |       "3 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   \n",
439 |       "4 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   \n",
440 |       "\n",
441 |       "   CohortDay  \n",
442 |       "0 2010-01-12  \n",
443 |       "1 2010-01-12  \n",
444 |       "2 2010-01-12  \n",
445 |       "3 2010-01-12  \n",
446 |       "4 2010-01-12  \n"
447 |      ]
448 |     }
449 |    ],
450 |    "source": [
451 |     "# Define a function that will parse the date\n",
452 |     "def get_day(x): return dt.datetime(x.year, x.month, x.day) \n",
453 |     "\n",
454 |     "# Create InvoiceDay column\n",
455 |     "online['InvoiceDay'] = online['InvoiceDate'].apply(get_day) \n",
456 |     "\n",
457 |     "# Group by CustomerID and select the InvoiceDay value\n",
458 |     "grouping = online.groupby('CustomerID')['InvoiceDay'] \n",
459 |     "\n",
460 |     "# Assign a minimum InvoiceDay value to the dataset\n",
461 |     "online['CohortDay'] = grouping.transform('min')\n",
462 |     "\n",
463 |     "# View the top 5 rows\n",
464 |     "print(online.head())"
465 |    ]
466 |   },
467 |   {
468 |    "cell_type": "markdown",
469 |    "metadata": {},
470 |    "source": [
471 |     "### Calculate time offset in days - part 1\n",
472 |     "\n",
473 |     "Calculating time offset for each transaction allows you to report the metrics for each cohort in a comparable fashion.\n",
474 |     "\n",
475 |     "First, we will create 6 variables that capture the integer value of years, months and days for Invoice and Cohort Date using the get_date_int()"
476 |    ]
477 |   },
478 |   {
479 |    "cell_type": "code",
480 |    "execution_count": 11,
481 |    "metadata": {},
482 |    "outputs": [],
483 |    "source": [
484 |     "def get_date_int(df, column):\n",
485 |     "    year = df[column].dt.year\n",
486 |     "    month = df[column].dt.month\n",
487 |     "    day = df[column].dt.day\n",
488 |     "    return year, month, day"
489 |    ]
490 |   },
491 |   {
492 |    "cell_type": "code",
493 |    "execution_count": 12,
494 |    "metadata": {},
495 |    "outputs": [],
496 |    "source": [
497 |     "# Get the integers for date parts from the `InvoiceDay` column\n",
498 |     "invoice_year, invoice_month, invoice_day = get_date_int(online, 'InvoiceDay')\n",
499 |     "\n",
500 |     "# Get the integers for date parts from the `CohortDay` column\n",
501 |     "cohort_year, cohort_month, cohort_day = get_date_int(online, 'CohortDay')"
502 |    ]
503 |   },
504 |   {
505 |    "cell_type": "markdown",
506 |    "metadata": {},
507 |    "source": [
508 |     "**Calculate time offset in days - part 2**\n",
509 |     "\n",
510 |     "Now, we have six different data sets with year, month and day values for Invoice and Cohort dates - invoice_year, cohort_year, invoice_month, cohort_month, invoice_day, and cohort_day.\n",
511 |     "\n",
512 |     "calculate the difference between the Invoice and Cohort dates in years, months and days separately and then calculate the total days difference between the two. This will be your days offset which we will use to visualize the customer count. "
513 |    ]
514 |   },
515 |   {
516 |    "cell_type": "code",
517 |    "execution_count": 13,
518 |    "metadata": {
519 |     "scrolled": true
520 |    },
521 |    "outputs": [
522 |     {
523 |      "name": "stdout",
524 |      "output_type": "stream",
525 |      "text": [
526 |       "  InvoiceNo StockCode                          Description  Quantity  \\\n",
527 |       "0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   \n",
528 |       "1    536365     71053                  WHITE METAL LANTERN         6   \n",
529 |       "2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   \n",
530 |       "3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   \n",
531 |       "4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   \n",
532 |       "\n",
533 |       "          InvoiceDate  UnitPrice  CustomerID         Country InvoiceDay  \\\n",
534 |       "0 2010-01-12 08:26:00       2.55     17850.0  United Kingdom 2010-01-12   \n",
535 |       "1 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   \n",
536 |       "2 2010-01-12 08:26:00       2.75     17850.0  United Kingdom 2010-01-12   \n",
537 |       "3 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   \n",
538 |       "4 2010-01-12 08:26:00       3.39     17850.0  United Kingdom 2010-01-12   \n",
539 |       "\n",
540 |       "   CohortDay  CohortIndex  \n",
541 |       "0 2010-01-12          1.0  \n",
542 |       "1 2010-01-12          1.0  \n",
543 |       "2 2010-01-12          1.0  \n",
544 |       "3 2010-01-12          1.0  \n",
545 |       "4 2010-01-12          1.0  \n"
546 |      ]
547 |     }
548 |    ],
549 |    "source": [
550 |     "# Calculate difference in years\n",
551 |     "years_diff = invoice_year - cohort_year\n",
552 |     "\n",
553 |     "# Calculate difference in months\n",
554 |     "months_diff = invoice_month - cohort_month\n",
555 |     "\n",
556 |     "# Calculate difference in days\n",
557 |     "days_diff = invoice_day - cohort_day\n",
558 |     "\n",
559 |     "# Extract the difference in days from all previous values\n",
560 |     "online['CohortIndex'] = years_diff * 365 + months_diff * 30 + days_diff + 1\n",
561 |     "print(online.head())"
562 |    ]
563 |   },
564 |   {
565 |    "cell_type": "markdown",
566 |    "metadata": {},
567 |    "source": [
568 |     "<h2 class=\"p-3 mb-2 bg-primary text-white\">Data mensual</h2>"
569 |    ]
570 |   },
571 |   {
572 |    "cell_type": "markdown",
573 |    "metadata": {},
574 |    "source": [
575 |     "#### Calculate cohort metrics\n",
576 |     "\n",
577 |     "- How many customers originally in each cohort in the cohort_counts table?\n",
578 |     "- How many customers originally in each cohort?\n",
579 |     "- How many of them were active in following months?\n",
580 |     "\n",
581 |     "\n",
582 |     "We will start by using the cohort counts table from our previous lesson to calculate customer retention. (mean, avg)\n",
583 |     "The retention measures how many customers from each of the cohot have returned in the subsequent months.\n",
584 |     "\n",
585 |     "1- select the first column which is the total number of customers in the cohort\n",
586 |     "2- calculate the ratio of how many of these customers came back in the subsequent months which is the retention rate\n",
587 |     "\n",
588 |     "Note: you will see that the first month's retention -by definition- will be 100% for all cohorts, this is because the number of active customers in the first month is actually the size of the cohort\n",
589 |     "\n",
590 |     "**Customer retention**\n",
591 |     "\n",
592 |     "Customer retention is a very useful metric to understand how many of the all customers are still active. Which of the following best describes customer retention?\n",
593 |     "\n",
594 |     "- [X] Percentage of active customers out of total customers\n",
595 |     "        - **Correct!** Retention gives you the percentage of active customers compared to the total number of customers.\n",
596 |     "- [ ] Percentage of active customers compared to a previous month\n",
597 |     "        - **Incorrect submission:** This metric sounds more like a monthly change in active customers.\n",
598 |     "- [ ] Number of average active customers each month\n",
599 |     "        - **Incorrect submission:** Retention is a percentage metric while this is an absolute number.\n",
600 |     "- [ ] Active customers on the first month. \n",
601 |     "        - **Incorrect submission:** Retention is a percentage metric while this is an absolute number.\n",
602 |     "\n",
603 |     "\n",
604 |     "**Calculate retention rate from scratch**\n",
605 |     "\n",
606 |     "You have seen how to create retention and average quantity metrics table for the monthly acquisition cohorts. Now it's you time to calculate the average price metrics and see if there are any difference in shopping patterns across time and across cohorts."
607 |    ]
608 |   },
609 |   {
610 |    "cell_type": "markdown",
611 |    "metadata": {},
612 |    "source": [
613 |     "**Calculate average price**\n",
614 |     "\n",
615 |     "You will now calculate the average price metric and analyze if there are any differences in shopping patterns across time and across cohorts."
616 |    ]
617 |   },
618 |   {
619 |    "cell_type": "markdown",
620 |    "metadata": {},
621 |    "source": [
622 |     "### Visualize average quantity metric\n",
623 |     "\n",
624 |     "**Heatmap**\n",
625 |     "- Easiest way to visualize cohort analysis\n",
626 |     "- Includes both data and visuals\n",
627 |     "- Only few lines of code with seaborn"
628 |    ]
629 |   },
630 |   {
631 |    "cell_type": "markdown",
632 |    "metadata": {},
633 |    "source": [
634 |     "<div class=\"panel panel-primary\">\n",
635 |     "      <div class=\"panel-heading\"><h1>Customer retention</h1></div>\n",
636 |     "</div>"
637 |    ]
638 |   },
639 |   {
640 |    "cell_type": "markdown",
641 |    "metadata": {},
642 |    "source": [
643 |     "Here for view the result retention: [Data Mensual](Ej1.ipynb)"
644 |    ]
645 |   }
646 |  ],
647 |  "metadata": {
648 |   "kernelspec": {
649 |    "display_name": "Python 3",
650 |    "language": "python",
651 |    "name": "python3"
652 |   },
653 |   "language_info": {
654 |    "codemirror_mode": {
655 |     "name": "ipython",
656 |     "version": 3
657 |    },
658 |    "file_extension": ".py",
659 |    "mimetype": "text/x-python",
660 |    "name": "python",
661 |    "nbconvert_exporter": "python",
662 |    "pygments_lexer": "ipython3",
663 |    "version": "3.7.0"
664 |   }
665 |  },
666 |  "nbformat": 4,
667 |  "nbformat_minor": 2
668 | }
669 | 


--------------------------------------------------------------------------------
/Chapter2-Recency_Frequency_Monetary_Value_analysis.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Recency, Frequency, Monetary Value analysis\n",
  8 |     "\n",
  9 |     "Understand customers based on their unique behavioral attributes.\n",
 10 |     "\n",
 11 |     "### Group customers into cohorts and analyze their behavior over time.\n",
 12 |     "In this chapter, we will dive into a very popular technique called RFM segmentation, which stands for Recency, Frequency and Monetary value segmentation.\n",
 13 |     "\n",
 14 |     "### What is RFM segmentation?\n",
 15 |     "Behavioral customer segmentation based on three metrics:\n",
 16 |     "\t- Recency (R): measure how recent was each customer's last purchase, \n",
 17 |     "\t- Frequency (F): measure how many purchases the customer has done in the last 12 months\n",
 18 |     "\t- Monetary Value (M): measures how much has the customer spent in the last 12 months\n",
 19 |     "\n",
 20 |     "we will use these values to assign customers to RFM segments. Once we have calculated these numbers, the next step is to group them into some sort of categorization such as high, medium and low. (Hay muchas formas de hacer esto)\n",
 21 |     "\n",
 22 |     "### Grouping RFM values\n",
 23 |     "\n",
 24 |     "The RFM values can be grouped in several ways:\n",
 25 |     "\t- Percentiles e.g. quantiles: We can break customers into groups of equal size based on percentile values of each metric\n",
 26 |     "\t- Pareto 80/20 cut: We can assign either high or low value to each metric based on a 80/20% Pareto split\n",
 27 |     "\t- Custom - based on business knowledge: we can use existing knowledge from previous business insights about certain threshold values for each metric\n",
 28 |     "\n",
 29 |     "We are going to implement percentile-based grouping.\n",
 30 |     "\n",
 31 |     "### Short review of percentiles\n",
 32 |     "\n",
 33 |     "Process of calculating percentiles:\n",
 34 |     "\t1. Sort customers based on that metric\n",
 35 |     "\t2. Break customers into a pre-defined number of groups of equal size\n",
 36 |     "\t3. Assign a label to each group\n",
 37 |     "\n",
 38 |     "Pandas function for calculating percentiles: qcut()\n",
 39 |     "\n",
 40 |     "### Calculate percentiles with Python\n",
 41 |     "Data with eight CustomerID and a randomly calculated Spend values.\n",
 42 |     "We are going to implement percentile-based grouping."
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "code",
 47 |    "execution_count": 1,
 48 |    "metadata": {},
 49 |    "outputs": [],
 50 |    "source": [
 51 |     "data_example = {'CustomerID': [0,1,2,3,4,5,6,7], 'Spend': [137,335,172,355,303,233,244,229]}"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "code",
 56 |    "execution_count": 2,
 57 |    "metadata": {},
 58 |    "outputs": [],
 59 |    "source": [
 60 |     "import pandas as pd\n",
 61 |     "\n",
 62 |     "data = pd.DataFrame(data_example)"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "code",
 67 |    "execution_count": 3,
 68 |    "metadata": {},
 69 |    "outputs": [
 70 |     {
 71 |      "data": {
 72 |       "text/html": [
 73 |        "<div>\n",
 74 |        "<style scoped>\n",
 75 |        "    .dataframe tbody tr th:only-of-type {\n",
 76 |        "        vertical-align: middle;\n",
 77 |        "    }\n",
 78 |        "\n",
 79 |        "    .dataframe tbody tr th {\n",
 80 |        "        vertical-align: top;\n",
 81 |        "    }\n",
 82 |        "\n",
 83 |        "    .dataframe thead th {\n",
 84 |        "        text-align: right;\n",
 85 |        "    }\n",
 86 |        "</style>\n",
 87 |        "<table border=\"1\" class=\"dataframe\">\n",
 88 |        "  <thead>\n",
 89 |        "    <tr style=\"text-align: right;\">\n",
 90 |        "      <th></th>\n",
 91 |        "      <th>CustomerID</th>\n",
 92 |        "      <th>Spend</th>\n",
 93 |        "    </tr>\n",
 94 |        "  </thead>\n",
 95 |        "  <tbody>\n",
 96 |        "    <tr>\n",
 97 |        "      <th>0</th>\n",
 98 |        "      <td>0</td>\n",
 99 |        "      <td>137</td>\n",
100 |        "    </tr>\n",
101 |        "    <tr>\n",
102 |        "      <th>1</th>\n",
103 |        "      <td>1</td>\n",
104 |        "      <td>335</td>\n",
105 |        "    </tr>\n",
106 |        "    <tr>\n",
107 |        "      <th>2</th>\n",
108 |        "      <td>2</td>\n",
109 |        "      <td>172</td>\n",
110 |        "    </tr>\n",
111 |        "    <tr>\n",
112 |        "      <th>3</th>\n",
113 |        "      <td>3</td>\n",
114 |        "      <td>355</td>\n",
115 |        "    </tr>\n",
116 |        "    <tr>\n",
117 |        "      <th>4</th>\n",
118 |        "      <td>4</td>\n",
119 |        "      <td>303</td>\n",
120 |        "    </tr>\n",
121 |        "    <tr>\n",
122 |        "      <th>5</th>\n",
123 |        "      <td>5</td>\n",
124 |        "      <td>233</td>\n",
125 |        "    </tr>\n",
126 |        "    <tr>\n",
127 |        "      <th>6</th>\n",
128 |        "      <td>6</td>\n",
129 |        "      <td>244</td>\n",
130 |        "    </tr>\n",
131 |        "    <tr>\n",
132 |        "      <th>7</th>\n",
133 |        "      <td>7</td>\n",
134 |        "      <td>229</td>\n",
135 |        "    </tr>\n",
136 |        "  </tbody>\n",
137 |        "</table>\n",
138 |        "</div>"
139 |       ],
140 |       "text/plain": [
141 |        "   CustomerID  Spend\n",
142 |        "0           0    137\n",
143 |        "1           1    335\n",
144 |        "2           2    172\n",
145 |        "3           3    355\n",
146 |        "4           4    303\n",
147 |        "5           5    233\n",
148 |        "6           6    244\n",
149 |        "7           7    229"
150 |       ]
151 |      },
152 |      "execution_count": 3,
153 |      "metadata": {},
154 |      "output_type": "execute_result"
155 |     }
156 |    ],
157 |    "source": [
158 |     "data"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "code",
163 |    "execution_count": 4,
164 |    "metadata": {},
165 |    "outputs": [
166 |     {
167 |      "name": "stdout",
168 |      "output_type": "stream",
169 |      "text": [
170 |       "   CustomerID  Spend Spend_Quartile\n",
171 |       "0           0    137              1\n",
172 |       "2           2    172              1\n",
173 |       "7           7    229              2\n",
174 |       "5           5    233              2\n",
175 |       "6           6    244              3\n",
176 |       "4           4    303              3\n",
177 |       "1           1    335              4\n",
178 |       "3           3    355              4\n"
179 |      ]
180 |     }
181 |    ],
182 |    "source": [
183 |     "# Create a spend quartile with 4 groups and labels ranging from 1 through 4 \n",
184 |     "spend_quartile = pd.qcut(data['Spend'], q=4, labels=range(1,5))\n",
185 |     "\n",
186 |     "# Assign the quartile values to the Spend_Quartile column in data\n",
187 |     "data['Spend_Quartile'] = spend_quartile\n",
188 |     "\n",
189 |     "# Print data with sorted Spend values\n",
190 |     "print(data.sort_values('Spend'))"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "markdown",
195 |    "metadata": {},
196 |    "source": [
197 |     "**Calculate Recency deciles (q=10)**\n",
198 |     "\n",
199 |     "We have created a dataset for you with random CustomerID and Recency_Days values as data. You will now use this dataset to group customers into quartiles based on Recency_Days values and assign labels to each of them.\n",
200 |     "\n",
201 |     "Be cautious about the labels for this exercise. You will see that the labels are inverse, and will required one additional step in separately creating them. If you need to refresh your memory on the process of creating the labels, check out the slides!"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "code",
206 |    "execution_count": 5,
207 |    "metadata": {},
208 |    "outputs": [],
209 |    "source": [
210 |     "data = pd.DataFrame({'CustomerID': [0,1,2,3,4,5,6,7], 'Recency_Days': [37,235,396,72,255,393,203,133]})"
211 |    ]
212 |   },
213 |   {
214 |    "cell_type": "code",
215 |    "execution_count": 18,
216 |    "metadata": {},
217 |    "outputs": [],
218 |    "source": [
219 |     "# Store labels from 4 to 1 in a decreasing order\n",
220 |     "r_labels = list(range(4, 0, -1))\n",
221 |     "\n",
222 |     "# Create a spend quartile with 4 groups and pass the previously created labels \n",
223 |     "recency_quartiles = pd.qcut(data['Recency_Days'], q=4, labels=r_labels)\n",
224 |     "\n",
225 |     "# Assign the quartile values to the Recency_Quartile column in `data`\n",
226 |     "data['Recency_Quartile'] = recency_quartiles "
227 |    ]
228 |   },
229 |   {
230 |    "cell_type": "code",
231 |    "execution_count": 17,
232 |    "metadata": {},
233 |    "outputs": [
234 |     {
235 |      "data": {
236 |       "text/html": [
237 |        "<div>\n",
238 |        "<style scoped>\n",
239 |        "    .dataframe tbody tr th:only-of-type {\n",
240 |        "        vertical-align: middle;\n",
241 |        "    }\n",
242 |        "\n",
243 |        "    .dataframe tbody tr th {\n",
244 |        "        vertical-align: top;\n",
245 |        "    }\n",
246 |        "\n",
247 |        "    .dataframe thead th {\n",
248 |        "        text-align: right;\n",
249 |        "    }\n",
250 |        "</style>\n",
251 |        "<table border=\"1\" class=\"dataframe\">\n",
252 |        "  <thead>\n",
253 |        "    <tr style=\"text-align: right;\">\n",
254 |        "      <th></th>\n",
255 |        "      <th>CustomerID</th>\n",
256 |        "      <th>Recency_Days</th>\n",
257 |        "      <th>Recency_Quartile</th>\n",
258 |        "    </tr>\n",
259 |        "  </thead>\n",
260 |        "  <tbody>\n",
261 |        "    <tr>\n",
262 |        "      <th>0</th>\n",
263 |        "      <td>0</td>\n",
264 |        "      <td>37</td>\n",
265 |        "      <td>4</td>\n",
266 |        "    </tr>\n",
267 |        "    <tr>\n",
268 |        "      <th>3</th>\n",
269 |        "      <td>3</td>\n",
270 |        "      <td>72</td>\n",
271 |        "      <td>4</td>\n",
272 |        "    </tr>\n",
273 |        "    <tr>\n",
274 |        "      <th>7</th>\n",
275 |        "      <td>7</td>\n",
276 |        "      <td>133</td>\n",
277 |        "      <td>3</td>\n",
278 |        "    </tr>\n",
279 |        "    <tr>\n",
280 |        "      <th>6</th>\n",
281 |        "      <td>6</td>\n",
282 |        "      <td>203</td>\n",
283 |        "      <td>3</td>\n",
284 |        "    </tr>\n",
285 |        "    <tr>\n",
286 |        "      <th>1</th>\n",
287 |        "      <td>1</td>\n",
288 |        "      <td>235</td>\n",
289 |        "      <td>2</td>\n",
290 |        "    </tr>\n",
291 |        "    <tr>\n",
292 |        "      <th>4</th>\n",
293 |        "      <td>4</td>\n",
294 |        "      <td>255</td>\n",
295 |        "      <td>2</td>\n",
296 |        "    </tr>\n",
297 |        "    <tr>\n",
298 |        "      <th>5</th>\n",
299 |        "      <td>5</td>\n",
300 |        "      <td>393</td>\n",
301 |        "      <td>1</td>\n",
302 |        "    </tr>\n",
303 |        "    <tr>\n",
304 |        "      <th>2</th>\n",
305 |        "      <td>2</td>\n",
306 |        "      <td>396</td>\n",
307 |        "      <td>1</td>\n",
308 |        "    </tr>\n",
309 |        "  </tbody>\n",
310 |        "</table>\n",
311 |        "</div>"
312 |       ],
313 |       "text/plain": [
314 |        "   CustomerID  Recency_Days Recency_Quartile\n",
315 |        "0           0            37                4\n",
316 |        "3           3            72                4\n",
317 |        "7           7           133                3\n",
318 |        "6           6           203                3\n",
319 |        "1           1           235                2\n",
320 |        "4           4           255                2\n",
321 |        "5           5           393                1\n",
322 |        "2           2           396                1"
323 |       ]
324 |      },
325 |      "execution_count": 17,
326 |      "metadata": {},
327 |      "output_type": "execute_result"
328 |     }
329 |    ],
330 |    "source": [
331 |     "data.sort_values('Recency_Days')"
332 |    ]
333 |   },
334 |   {
335 |    "cell_type": "code",
336 |    "execution_count": 19,
337 |    "metadata": {},
338 |    "outputs": [
339 |     {
340 |      "data": {
341 |       "text/html": [
342 |        "<div>\n",
343 |        "<style scoped>\n",
344 |        "    .dataframe tbody tr th:only-of-type {\n",
345 |        "        vertical-align: middle;\n",
346 |        "    }\n",
347 |        "\n",
348 |        "    .dataframe tbody tr th {\n",
349 |        "        vertical-align: top;\n",
350 |        "    }\n",
351 |        "\n",
352 |        "    .dataframe thead th {\n",
353 |        "        text-align: right;\n",
354 |        "    }\n",
355 |        "</style>\n",
356 |        "<table border=\"1\" class=\"dataframe\">\n",
357 |        "  <thead>\n",
358 |        "    <tr style=\"text-align: right;\">\n",
359 |        "      <th></th>\n",
360 |        "      <th>CustomerID</th>\n",
361 |        "      <th>Recency_Days</th>\n",
362 |        "      <th>Recency_Quartile</th>\n",
363 |        "    </tr>\n",
364 |        "  </thead>\n",
365 |        "  <tbody>\n",
366 |        "    <tr>\n",
367 |        "      <th>0</th>\n",
368 |        "      <td>0</td>\n",
369 |        "      <td>37</td>\n",
370 |        "      <td>Active</td>\n",
371 |        "    </tr>\n",
372 |        "    <tr>\n",
373 |        "      <th>3</th>\n",
374 |        "      <td>3</td>\n",
375 |        "      <td>72</td>\n",
376 |        "      <td>Active</td>\n",
377 |        "    </tr>\n",
378 |        "    <tr>\n",
379 |        "      <th>7</th>\n",
380 |        "      <td>7</td>\n",
381 |        "      <td>133</td>\n",
382 |        "      <td>Lapsed</td>\n",
383 |        "    </tr>\n",
384 |        "    <tr>\n",
385 |        "      <th>6</th>\n",
386 |        "      <td>6</td>\n",
387 |        "      <td>203</td>\n",
388 |        "      <td>Lapsed</td>\n",
389 |        "    </tr>\n",
390 |        "    <tr>\n",
391 |        "      <th>1</th>\n",
392 |        "      <td>1</td>\n",
393 |        "      <td>235</td>\n",
394 |        "      <td>Inactive</td>\n",
395 |        "    </tr>\n",
396 |        "    <tr>\n",
397 |        "      <th>4</th>\n",
398 |        "      <td>4</td>\n",
399 |        "      <td>255</td>\n",
400 |        "      <td>Inactive</td>\n",
401 |        "    </tr>\n",
402 |        "    <tr>\n",
403 |        "      <th>5</th>\n",
404 |        "      <td>5</td>\n",
405 |        "      <td>393</td>\n",
406 |        "      <td>Churned</td>\n",
407 |        "    </tr>\n",
408 |        "    <tr>\n",
409 |        "      <th>2</th>\n",
410 |        "      <td>2</td>\n",
411 |        "      <td>396</td>\n",
412 |        "      <td>Churned</td>\n",
413 |        "    </tr>\n",
414 |        "  </tbody>\n",
415 |        "</table>\n",
416 |        "</div>"
417 |       ],
418 |       "text/plain": [
419 |        "   CustomerID  Recency_Days Recency_Quartile\n",
420 |        "0           0            37           Active\n",
421 |        "3           3            72           Active\n",
422 |        "7           7           133           Lapsed\n",
423 |        "6           6           203           Lapsed\n",
424 |        "1           1           235         Inactive\n",
425 |        "4           4           255         Inactive\n",
426 |        "5           5           393          Churned\n",
427 |        "2           2           396          Churned"
428 |       ]
429 |      },
430 |      "execution_count": 19,
431 |      "metadata": {},
432 |      "output_type": "execute_result"
433 |     }
434 |    ],
435 |    "source": [
436 |     "# Create string labels\n",
437 |     "r_labels = ['Active', 'Lapsed', 'Inactive', 'Churned']\n",
438 |     "\n",
439 |     "# Divide into groups based on quartiles\n",
440 |     "recency_quartiles = pd.qcut(data['Recency_Days'], q=4, labels=r_labels)\n",
441 |     "\n",
442 |     "# Create new column\n",
443 |     "data['Recency_Quartile'] = recency_quartiles\n",
444 |     "\n",
445 |     "# Sort values from lowest to highest\n",
446 |     "data.sort_values('Recency_Days')"
447 |    ]
448 |   },
449 |   {
450 |    "cell_type": "markdown",
451 |    "metadata": {},
452 |    "source": [
453 |     "### Recency, Frequency, Monetary Value calculation\n",
454 |     "\n",
455 |     "(Calculate metrics for each customer).\n",
456 |     "\n",
457 |     "**Definitions**\n",
458 |     "\t- Recency - it's the number of days since last customer transaction - the lower it is, the better, since every company wants ots customers to be recent and active. \n",
459 |     "\t- Frequency - calculates the number of transactions in the last 12 months, although there are variations such as average monthly transactions which depict the essence of this metric as well\n",
460 |     "\t- Monetary Value - total spend in the last 12 months\n",
461 |     "\n",
462 |     "One comment though - the 12 months is a standard way to do this, but it can be chosen arbitrarily depending on the business model and the lifecycle of the products and customers.\n",
463 |     "\n",
464 |     "\n",
465 |     "**Dataset and preparations**\n",
466 |     "\t- Same online dataset like in the previous lessons\n",
467 |     "\t- Need to do some data preparation before calculating the RFM value"
468 |    ]
469 |   },
470 |   {
471 |    "cell_type": "code",
472 |    "execution_count": 7,
473 |    "metadata": {},
474 |    "outputs": [],
475 |    "source": [
476 |     "import numpy as np\n",
477 |     "import pandas as pd\n",
478 |     "import datetime as dt\n",
479 |     "\n",
480 |     "online = pd.read_csv('Online Retail.csv', sep=';')\n",
481 |     "\n",
482 |     "# convert object to datetime\n",
483 |     "online['InvoiceDate'] = pd.to_datetime(online['InvoiceDate'])\n",
484 |     "\n",
485 |     "# convert object to float\n",
486 |     "online['UnitPrice'] = online['UnitPrice'].apply(lambda x: x.replace(',', '.'))\n",
487 |     "online['UnitPrice'] = online['UnitPrice'].apply(lambda col:pd.to_numeric(col, errors='coerce'))"
488 |    ]
489 |   },
490 |   {
491 |    "cell_type": "code",
492 |    "execution_count": 8,
493 |    "metadata": {},
494 |    "outputs": [
495 |     {
496 |      "data": {
497 |       "text/html": [
498 |        "<div>\n",
499 |        "<style scoped>\n",
500 |        "    .dataframe tbody tr th:only-of-type {\n",
501 |        "        vertical-align: middle;\n",
502 |        "    }\n",
503 |        "\n",
504 |        "    .dataframe tbody tr th {\n",
505 |        "        vertical-align: top;\n",
506 |        "    }\n",
507 |        "\n",
508 |        "    .dataframe thead th {\n",
509 |        "        text-align: right;\n",
510 |        "    }\n",
511 |        "</style>\n",
512 |        "<table border=\"1\" class=\"dataframe\">\n",
513 |        "  <thead>\n",
514 |        "    <tr style=\"text-align: right;\">\n",
515 |        "      <th></th>\n",
516 |        "      <th>InvoiceNo</th>\n",
517 |        "      <th>StockCode</th>\n",
518 |        "      <th>Description</th>\n",
519 |        "      <th>Quantity</th>\n",
520 |        "      <th>InvoiceDate</th>\n",
521 |        "      <th>UnitPrice</th>\n",
522 |        "      <th>CustomerID</th>\n",
523 |        "      <th>Country</th>\n",
524 |        "    </tr>\n",
525 |        "  </thead>\n",
526 |        "  <tbody>\n",
527 |        "    <tr>\n",
528 |        "      <th>0</th>\n",
529 |        "      <td>536365</td>\n",
530 |        "      <td>85123A</td>\n",
531 |        "      <td>WHITE HANGING HEART T-LIGHT HOLDER</td>\n",
532 |        "      <td>6</td>\n",
533 |        "      <td>2010-01-12 08:26:00</td>\n",
534 |        "      <td>2.55</td>\n",
535 |        "      <td>17850.0</td>\n",
536 |        "      <td>United Kingdom</td>\n",
537 |        "    </tr>\n",
538 |        "    <tr>\n",
539 |        "      <th>1</th>\n",
540 |        "      <td>536365</td>\n",
541 |        "      <td>71053</td>\n",
542 |        "      <td>WHITE METAL LANTERN</td>\n",
543 |        "      <td>6</td>\n",
544 |        "      <td>2010-01-12 08:26:00</td>\n",
545 |        "      <td>3.39</td>\n",
546 |        "      <td>17850.0</td>\n",
547 |        "      <td>United Kingdom</td>\n",
548 |        "    </tr>\n",
549 |        "    <tr>\n",
550 |        "      <th>2</th>\n",
551 |        "      <td>536365</td>\n",
552 |        "      <td>84406B</td>\n",
553 |        "      <td>CREAM CUPID HEARTS COAT HANGER</td>\n",
554 |        "      <td>8</td>\n",
555 |        "      <td>2010-01-12 08:26:00</td>\n",
556 |        "      <td>2.75</td>\n",
557 |        "      <td>17850.0</td>\n",
558 |        "      <td>United Kingdom</td>\n",
559 |        "    </tr>\n",
560 |        "    <tr>\n",
561 |        "      <th>3</th>\n",
562 |        "      <td>536365</td>\n",
563 |        "      <td>84029G</td>\n",
564 |        "      <td>KNITTED UNION FLAG HOT WATER BOTTLE</td>\n",
565 |        "      <td>6</td>\n",
566 |        "      <td>2010-01-12 08:26:00</td>\n",
567 |        "      <td>3.39</td>\n",
568 |        "      <td>17850.0</td>\n",
569 |        "      <td>United Kingdom</td>\n",
570 |        "    </tr>\n",
571 |        "    <tr>\n",
572 |        "      <th>4</th>\n",
573 |        "      <td>536365</td>\n",
574 |        "      <td>84029E</td>\n",
575 |        "      <td>RED WOOLLY HOTTIE WHITE HEART.</td>\n",
576 |        "      <td>6</td>\n",
577 |        "      <td>2010-01-12 08:26:00</td>\n",
578 |        "      <td>3.39</td>\n",
579 |        "      <td>17850.0</td>\n",
580 |        "      <td>United Kingdom</td>\n",
581 |        "    </tr>\n",
582 |        "  </tbody>\n",
583 |        "</table>\n",
584 |        "</div>"
585 |       ],
586 |       "text/plain": [
587 |        "  InvoiceNo StockCode                          Description  Quantity  \\\n",
588 |        "0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   \n",
589 |        "1    536365     71053                  WHITE METAL LANTERN         6   \n",
590 |        "2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   \n",
591 |        "3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   \n",
592 |        "4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   \n",
593 |        "\n",
594 |        "          InvoiceDate  UnitPrice  CustomerID         Country  \n",
595 |        "0 2010-01-12 08:26:00       2.55     17850.0  United Kingdom  \n",
596 |        "1 2010-01-12 08:26:00       3.39     17850.0  United Kingdom  \n",
597 |        "2 2010-01-12 08:26:00       2.75     17850.0  United Kingdom  \n",
598 |        "3 2010-01-12 08:26:00       3.39     17850.0  United Kingdom  \n",
599 |        "4 2010-01-12 08:26:00       3.39     17850.0  United Kingdom  "
600 |       ]
601 |      },
602 |      "execution_count": 8,
603 |      "metadata": {},
604 |      "output_type": "execute_result"
605 |     }
606 |    ],
607 |    "source": [
608 |     "online.head()"
609 |    ]
610 |   },
611 |   {
612 |    "cell_type": "code",
613 |    "execution_count": 9,
614 |    "metadata": {},
615 |    "outputs": [],
616 |    "source": [
617 |     "online = online.loc[(online['InvoiceDate'] >= '2010-12-10') & (online['InvoiceDate'] < '2011-12-13')]"
618 |    ]
619 |   },
620 |   {
621 |    "cell_type": "code",
622 |    "execution_count": 10,
623 |    "metadata": {},
624 |    "outputs": [
625 |     {
626 |      "data": {
627 |       "text/html": [
628 |        "<div>\n",
629 |        "<style scoped>\n",
630 |        "    .dataframe tbody tr th:only-of-type {\n",
631 |        "        vertical-align: middle;\n",
632 |        "    }\n",
633 |        "\n",
634 |        "    .dataframe tbody tr th {\n",
635 |        "        vertical-align: top;\n",
636 |        "    }\n",
637 |        "\n",
638 |        "    .dataframe thead th {\n",
639 |        "        text-align: right;\n",
640 |        "    }\n",
641 |        "</style>\n",
642 |        "<table border=\"1\" class=\"dataframe\">\n",
643 |        "  <thead>\n",
644 |        "    <tr style=\"text-align: right;\">\n",
645 |        "      <th></th>\n",
646 |        "      <th>InvoiceNo</th>\n",
647 |        "      <th>StockCode</th>\n",
648 |        "      <th>Description</th>\n",
649 |        "      <th>Quantity</th>\n",
650 |        "      <th>InvoiceDate</th>\n",
651 |        "      <th>UnitPrice</th>\n",
652 |        "      <th>CustomerID</th>\n",
653 |        "      <th>Country</th>\n",
654 |        "    </tr>\n",
655 |        "  </thead>\n",
656 |        "  <tbody>\n",
657 |        "    <tr>\n",
658 |        "      <th>25281</th>\n",
659 |        "      <td>538365</td>\n",
660 |        "      <td>22469</td>\n",
661 |        "      <td>HEART OF WICKER SMALL</td>\n",
662 |        "      <td>8</td>\n",
663 |        "      <td>2010-12-12 10:11:00</td>\n",
664 |        "      <td>1.65</td>\n",
665 |        "      <td>17243.0</td>\n",
666 |        "      <td>United Kingdom</td>\n",
667 |        "    </tr>\n",
668 |        "    <tr>\n",
669 |        "      <th>25282</th>\n",
670 |        "      <td>538365</td>\n",
671 |        "      <td>84030E</td>\n",
672 |        "      <td>ENGLISH ROSE HOT WATER BOTTLE</td>\n",
673 |        "      <td>1</td>\n",
674 |        "      <td>2010-12-12 10:11:00</td>\n",
675 |        "      <td>4.25</td>\n",
676 |        "      <td>17243.0</td>\n",
677 |        "      <td>United Kingdom</td>\n",
678 |        "    </tr>\n",
679 |        "    <tr>\n",
680 |        "      <th>25283</th>\n",
681 |        "      <td>538365</td>\n",
682 |        "      <td>22112</td>\n",
683 |        "      <td>CHOCOLATE HOT WATER BOTTLE</td>\n",
684 |        "      <td>3</td>\n",
685 |        "      <td>2010-12-12 10:11:00</td>\n",
686 |        "      <td>4.95</td>\n",
687 |        "      <td>17243.0</td>\n",
688 |        "      <td>United Kingdom</td>\n",
689 |        "    </tr>\n",
690 |        "    <tr>\n",
691 |        "      <th>25284</th>\n",
692 |        "      <td>538365</td>\n",
693 |        "      <td>22835</td>\n",
694 |        "      <td>HOT WATER BOTTLE I AM SO POORLY</td>\n",
695 |        "      <td>5</td>\n",
696 |        "      <td>2010-12-12 10:11:00</td>\n",
697 |        "      <td>4.65</td>\n",
698 |        "      <td>17243.0</td>\n",
699 |        "      <td>United Kingdom</td>\n",
700 |        "    </tr>\n",
701 |        "    <tr>\n",
702 |        "      <th>25285</th>\n",
703 |        "      <td>538365</td>\n",
704 |        "      <td>84029E</td>\n",
705 |        "      <td>RED WOOLLY HOTTIE WHITE HEART.</td>\n",
706 |        "      <td>4</td>\n",
707 |        "      <td>2010-12-12 10:11:00</td>\n",
708 |        "      <td>3.75</td>\n",
709 |        "      <td>17243.0</td>\n",
710 |        "      <td>United Kingdom</td>\n",
711 |        "    </tr>\n",
712 |        "  </tbody>\n",
713 |        "</table>\n",
714 |        "</div>"
715 |       ],
716 |       "text/plain": [
717 |        "      InvoiceNo StockCode                      Description  Quantity  \\\n",
718 |        "25281    538365     22469            HEART OF WICKER SMALL         8   \n",
719 |        "25282    538365    84030E    ENGLISH ROSE HOT WATER BOTTLE         1   \n",
720 |        "25283    538365     22112       CHOCOLATE HOT WATER BOTTLE         3   \n",
721 |        "25284    538365     22835  HOT WATER BOTTLE I AM SO POORLY         5   \n",
722 |        "25285    538365    84029E   RED WOOLLY HOTTIE WHITE HEART.         4   \n",
723 |        "\n",
724 |        "              InvoiceDate  UnitPrice  CustomerID         Country  \n",
725 |        "25281 2010-12-12 10:11:00       1.65     17243.0  United Kingdom  \n",
726 |        "25282 2010-12-12 10:11:00       4.25     17243.0  United Kingdom  \n",
727 |        "25283 2010-12-12 10:11:00       4.95     17243.0  United Kingdom  \n",
728 |        "25284 2010-12-12 10:11:00       4.65     17243.0  United Kingdom  \n",
729 |        "25285 2010-12-12 10:11:00       3.75     17243.0  United Kingdom  "
730 |       ]
731 |      },
732 |      "execution_count": 10,
733 |      "metadata": {},
734 |      "output_type": "execute_result"
735 |     }
736 |    ],
737 |    "source": [
738 |     "online.head()"
739 |    ]
740 |   },
741 |   {
742 |    "cell_type": "code",
743 |    "execution_count": 11,
744 |    "metadata": {},
745 |    "outputs": [
746 |     {
747 |      "name": "stdout",
748 |      "output_type": "stream",
749 |      "text": [
750 |       "Min:2010-12-12 10:11:00; Max:2011-12-10 17:19:00\n"
751 |      ]
752 |     }
753 |    ],
754 |    "source": [
755 |     "print('Min:{}; Max:{}'.format(min(online.InvoiceDate),\n",
756 |     "                              max(online.InvoiceDate)))"
757 |    ]
758 |   },
759 |   {
760 |    "cell_type": "code",
761 |    "execution_count": 12,
762 |    "metadata": {},
763 |    "outputs": [],
764 |    "source": [
765 |     "# Let's create a hypothetical snapshot_day data as if we're doing analysis recently\n",
766 |     "snapshot_date = max(online.InvoiceDate) + dt.timedelta(days=1)"
767 |    ]
768 |   },
769 |   {
770 |    "cell_type": "code",
771 |    "execution_count": 13,
772 |    "metadata": {},
773 |    "outputs": [
774 |     {
775 |      "data": {
776 |       "text/plain": [
777 |        "Timestamp('2011-12-11 17:19:00')"
778 |       ]
779 |      },
780 |      "execution_count": 13,
781 |      "metadata": {},
782 |      "output_type": "execute_result"
783 |     }
784 |    ],
785 |    "source": [
786 |     "snapshot_date"
787 |    ]
788 |   },
789 |   {
790 |    "cell_type": "markdown",
791 |    "metadata": {},
792 |    "source": [
793 |     "**Building RFM segments**\n",
794 |     "\n",
795 |     "- Cuartiles: 4 segmentos del mismo tamaño\n",
796 |     "- Recientes: ordenar los clientes en orden descentente, porque se considera mejor cuando el valor es menor\n",
797 |     "- Frecuency and monetary quartiles: ordenar en forma ascentender porque se considera mejor cuando el valor es más alto."
798 |    ]
799 |   },
800 |   {
801 |    "cell_type": "markdown",
802 |    "metadata": {},
803 |    "source": [
804 |     "The online dataset has already been pre -processed and only includes the recent 12 months of data. In the real world, we would be working with the most recent snapshot of the data of today or yesterday, but in this case the data comes from 2010 and 2011, so we have to create hypothetical snapshot date that we'll use as a starting point to calculate metrics as if we're doing the analysis on the most recent data.\n",
805 |     "\n",
806 |     "With days equal=1 argument we create a period of 1 day which we can then add to our date.\n",
807 |     "\n",
808 |     "We aggregate the data on a Customer level, and calculate three metrics: we used the InvoceDate and pass it to the lambda function, and then take a difference between our snapshot date - which would be today in the real world - and the most recent or max() invoice date, this fives us the number of days between hypothetical today and the last transaction."
809 |    ]
810 |   },
811 |   {
812 |    "cell_type": "markdown",
813 |    "metadata": {},
814 |    "source": [
815 |     "### Nota:\n",
816 |     "Arreglar el siguiente agrupamiento como funciones separadas para poder crear las funciones nuevas.\n",
817 |     "\n",
818 |     "https://code.i-harness.com/es/q/1231cb5\n",
819 |     "\n",
820 |     "https://www.e-learn.cn/content/wangluowenzhang/79068\n",
821 |     "\n",
822 |     "https://stackoverrun.com/es/q/12261730"
823 |    ]
824 |   },
825 |   {
826 |    "cell_type": "code",
827 |    "execution_count": null,
828 |    "metadata": {},
829 |    "outputs": [],
830 |    "source": [
831 |     "# Aggregate data on a customer level\n",
832 |     "datamart = online.groupby(['CustomerID']).agg({\n",
833 |     "        'InvoiceDate': lambda x: (snapshot_date - x.max()).days,\n",
834 |     "        'InvoiceNo': 'count',\n",
835 |     "        'TotalSum': 'sum'})\n",
836 |     "\n",
837 |     "# Rename columns for easier interpretation\n",
838 |     "datamart.rename(columns = {'InvoiceDate': 'Recency',\n",
839 |     "        'InvoiceNo': 'Frequency',\n",
840 |     "        'TotalSum': 'MonetaryValue'}, inplace=True)\n",
841 |     "\n",
842 |     "# Check the first rows\n",
843 |     "datamart.head()"
844 |    ]
845 |   },
846 |   {
847 |    "cell_type": "code",
848 |    "execution_count": null,
849 |    "metadata": {},
850 |    "outputs": [],
851 |    "source": [
852 |     "# Create labels for Recency and Frequency\n",
853 |     "r_labels = range(3, 0, -1); f_labels = range(1, 4)\n",
854 |     "\n",
855 |     "# Assign these labels to three equal percentile groups \n",
856 |     "r_groups = pd.qcut(datamart['Recency'], q=3, labels=r_labels)\n",
857 |     "\n",
858 |     "# Assign these labels to three equal percentile groups \n",
859 |     "f_groups = pd.qcut(datamart['Frequency'], q=3, labels=f_labels)\n",
860 |     "\n",
861 |     "# Create new columns R and F \n",
862 |     "datamart = datamart.assign(R=r_groups.values, F=f_groups.values)"
863 |    ]
864 |   },
865 |   {
866 |    "cell_type": "markdown",
867 |    "metadata": {},
868 |    "source": [
869 |     "Calcular puntuación RFM\n",
870 |     "\n",
871 |     "Asignar a los clientes tres grupos según los percentiles de Valor Monetario y luego calculará un RFM_Score, que es una suma de los valores R, F y M."
872 |    ]
873 |   },
874 |   {
875 |    "cell_type": "code",
876 |    "execution_count": null,
877 |    "metadata": {},
878 |    "outputs": [],
879 |    "source": [
880 |     "# Create labels for MonetaryValue\n",
881 |     "m_labels = range(1, 4)\n",
882 |     "\n",
883 |     "# Assign these labels to three equal percentile groups \n",
884 |     "m_groups = pd.qcut(datamart['MonetaryValue'], q=3, labels=m_labels)\n",
885 |     "\n",
886 |     "# Create new column M\n",
887 |     "datamart = datamart.assign(M=m_groups.values)\n",
888 |     "\n",
889 |     "# Calculate RFM_Score\n",
890 |     "datamart['RFM_Score'] = datamart[['R','F','M']].sum(axis=1)\n",
891 |     "print(datamart['RFM_Score'].head())"
892 |    ]
893 |   },
894 |   {
895 |    "cell_type": "markdown",
896 |    "metadata": {},
897 |    "source": [
898 |     "Create segments named Top, Middle, Low. If the RFM score is greater than or equal to 10, the level should be \"Top\". If it's between 6 and 10 it should be \"Middle\", and otherwise it should be \"Low\""
899 |    ]
900 |   },
901 |   {
902 |    "cell_type": "code",
903 |    "execution_count": null,
904 |    "metadata": {},
905 |    "outputs": [],
906 |    "source": [
907 |     "# Define rfm_level function\n",
908 |     "def rfm_level(df):\n",
909 |     "    if df['RFM_Score'] >= 10:\n",
910 |     "        return 'Top'\n",
911 |     "    elif (df['RFM_Score'] >= 6) and (df['RFM_Score'] < 10):\n",
912 |     "        return 'Middle'\n",
913 |     "    else:\n",
914 |     "        return 'Low'\n",
915 |     "\n",
916 |     "# Create a new variable RFM_Level\n",
917 |     "datamart['RFM_Level'] = datamart.apply(rfm_level, axis=1)\n",
918 |     "\n",
919 |     "# Print the header with top 5 rows to the console\n",
920 |     "print(datamart.head())"
921 |    ]
922 |   },
923 |   {
924 |    "cell_type": "code",
925 |    "execution_count": null,
926 |    "metadata": {},
927 |    "outputs": [],
928 |    "source": [
929 |     "# Calculate average values for each RFM_Level, and return a size of each segment \n",
930 |     "rfm_level_agg = datamart.groupby('RFM_Level').agg({\n",
931 |     "    'Recency': 'mean',\n",
932 |     "    'Frequency': 'mean',\n",
933 |     "  \n",
934 |     "    # Return the size of each segment\n",
935 |     "    'MonetaryValue': ['mean', 'count']\n",
936 |     "}).round(1)\n",
937 |     "\n",
938 |     "# Print the aggregated dataset\n",
939 |     "print(rfm_level_agg)"
940 |    ]
941 |   },
942 |   {
943 |    "cell_type": "code",
944 |    "execution_count": null,
945 |    "metadata": {},
946 |    "outputs": [],
947 |    "source": [
948 |     "Frequency MonetaryValue       Recency\n",
949 |     "               mean          mean count    mean\n",
950 |     "RFM_Level                                      \n",
951 |     "Low             3.2          52.7  1075   180.8\n",
952 |     "Middle         10.7         202.9  1547    73.9\n",
953 |     "Top            47.1         959.7  1021    20.3\n"
954 |    ]
955 |   },
956 |   {
957 |    "cell_type": "code",
958 |    "execution_count": null,
959 |    "metadata": {},
960 |    "outputs": [],
961 |    "source": []
962 |   }
963 |  ],
964 |  "metadata": {
965 |   "kernelspec": {
966 |    "display_name": "Python 3",
967 |    "language": "python",
968 |    "name": "python3"
969 |   },
970 |   "language_info": {
971 |    "codemirror_mode": {
972 |     "name": "ipython",
973 |     "version": 3
974 |    },
975 |    "file_extension": ".py",
976 |    "mimetype": "text/x-python",
977 |    "name": "python",
978 |    "nbconvert_exporter": "python",
979 |    "pygments_lexer": "ipython3",
980 |    "version": "3.7.0"
981 |   }
982 |  },
983 |  "nbformat": 4,
984 |  "nbformat_minor": 2
985 | }
986 | 


--------------------------------------------------------------------------------
/Chapter3-Data-pre-processing-for-clustering.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Data pre-processing for clustering\n",
  8 |     "\n",
  9 |     "Learn practical data preparation methods to ensure the k-means clustering algorithm uncovers well-separated segments\n",
 10 |     "\n",
 11 |     "### Advantages of k-means clustering\n",
 12 |     "- One of the most popular unsupervised learning method\n",
 13 |     "- Simple and fast\n",
 14 |     "- Works well*\n",
 15 |     "* _with certain assumptions about the data_\n",
 16 |     "\n",
 17 |     "### Key k-means assumptions\n",
 18 |     "- Symmetric distribution of variables (not skewed): Las variables tienen distribución simetrica\n",
 19 |     "- Variables with same average values: El segundo supuesto es que todas las variables tienen los mismos valores promedio. Esto es clave para garantizar que cada métrica tenga un peso igual en el cálculo de k-means. \n",
 20 |     "- Variables with same variance\n",
 21 |     "\n",
 22 |     "k-means assumes equal mean and equal variance, no es el caso de RFM.\n",
 23 |     "\n",
 24 |     "### Sequence\n",
 25 |     "1. Unskew the data - log transformation\n",
 26 |     "2. Standardize to the same average values\n",
 27 |     "3. Scale to the same standard deviation\n",
 28 |     "4. Store as a separate array to be used for clustering\n"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "code",
 33 |    "execution_count": null,
 34 |    "metadata": {},
 35 |    "outputs": [],
 36 |    "source": [
 37 |     "# Center the data by subtracting average values from each entry\n",
 38 |     "data_centered = data - data.mean()\n",
 39 |     "\n",
 40 |     "# Scale the data by dividing each entry by standard deviation\n",
 41 |     "data_scaled = data / data.std()\n",
 42 |     "\n",
 43 |     "# Normalize the data by applying both centering and scaling\n",
 44 |     "data_normalized = (data - data.mean()) / data.std()\n",
 45 |     "\n",
 46 |     "# Print summary statistics to make sure average is zero and standard deviation is one\n",
 47 |     "print(data_normalized.describe().round(2))"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": null,
 53 |    "metadata": {},
 54 |    "outputs": [],
 55 |    "source": [
 56 |     "             var1    var2    var3\n",
 57 |     "    count  100.00  100.00  100.00\n",
 58 |     "    mean     0.00   -0.00    0.00\n",
 59 |     "    std      1.00    1.00    1.00\n",
 60 |     "    min     -1.66   -0.73   -0.36\n",
 61 |     "    25%     -0.88   -0.51   -0.36\n",
 62 |     "    50%     -0.02   -0.29   -0.33\n",
 63 |     "    75%      0.96    0.11   -0.20\n",
 64 |     "    max      1.60    5.18    6.26"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "code",
 69 |    "execution_count": null,
 70 |    "metadata": {},
 71 |    "outputs": [],
 72 |    "source": [
 73 |     "# Initialize a scaler\n",
 74 |     "scaler = StandardScaler()\n",
 75 |     "\n",
 76 |     "# Fit the scaler\n",
 77 |     "scaler.fit(data)\n",
 78 |     "\n",
 79 |     "# Scale and center the data\n",
 80 |     "data_normalized = scaler.transform(data)\n",
 81 |     "\n",
 82 |     "# Create a pandas DataFrame\n",
 83 |     "data_normalized = pd.DataFrame(data_normalized, index=data.index, columns=data.columns)\n",
 84 |     "\n",
 85 |     "# Print summary statistics\n",
 86 |     "print(data_normalized.describe().round(2))"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "code",
 91 |    "execution_count": null,
 92 |    "metadata": {},
 93 |    "outputs": [],
 94 |    "source": [
 95 |     "# Unskew the data\n",
 96 |     "datamart_log = np.log(datamart_rfm)\n",
 97 |     "\n",
 98 |     "# Initialize a standard scaler and fit it\n",
 99 |     "scaler = StandardScaler()\n",
100 |     "scaler.fit(datamart_log)\n",
101 |     "\n",
102 |     "# Scale and center the data\n",
103 |     "datamart_normalized = scaler.transform(datamart_log)\n",
104 |     "\n",
105 |     "# Create a pandas DataFrame\n",
106 |     "datamart_normalized = pd.DataFrame(data=datamart_normalized, index=datamart_rfm.index, columns=datamart_rfm.columns)"
107 |    ]
108 |   }
109 |  ],
110 |  "metadata": {
111 |   "kernelspec": {
112 |    "display_name": "Python 3",
113 |    "language": "python",
114 |    "name": "python3"
115 |   },
116 |   "language_info": {
117 |    "codemirror_mode": {
118 |     "name": "ipython",
119 |     "version": 3
120 |    },
121 |    "file_extension": ".py",
122 |    "mimetype": "text/x-python",
123 |    "name": "python",
124 |    "nbconvert_exporter": "python",
125 |    "pygments_lexer": "ipython3",
126 |    "version": "3.7.0"
127 |   }
128 |  },
129 |  "nbformat": 4,
130 |  "nbformat_minor": 2
131 | }
132 | 


--------------------------------------------------------------------------------
/Chapter4-Customer-Segmentation-with-K-means.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "## Customer Segmentation with K-means\n",
 8 |     "\n",
 9 |     "Use data from previous chapter to build customer segments based on their recency, frequency, and monetary value.\n",
10 |     "\n",
11 |     "\n",
12 |     "#### Key steps\n",
13 |     "- Data pre-processing\n",
14 |     "- Choosing a number of clusters\n",
15 |     "- Running k-means clustering on pre-processed data\n",
16 |     "- Analyzing average RFM values of each cluster\n",
17 |     "\n",
18 |     "#### Methods to define the number of clusters\n",
19 |     "- Visual methods - elbow criterion\n",
20 |     "- Mathematical methods - silhouette coefficient\n",
21 |     "- Experimentation and interpretation\n",
22 |     "\n"
23 |    ]
24 |   },
25 |   {
26 |    "cell_type": "markdown",
27 |    "metadata": {},
28 |    "source": [
29 |     "Links revisar en orden:\n",
30 |     "- https://www.ritchieng.com/machine-learning-project-customer-segments/\n",
31 |     "- https://www.learndatasci.com/tutorials/k-means-clustering-algorithms-python-intro/\n",
32 |     "\n",
33 |     "\n",
34 |     "- https://www.toptal.com/machine-learning/clustering-algorithms\n",
35 |     "\n",
36 |     "\n",
37 |     "- https://towardsdatascience.com/an-introduction-to-clustering-algorithms-in-python-123438574097 \n",
38 |     "\n",
39 |     "- http://www.aprendemachinelearning.com/k-means-en-python-paso-a-paso/\n",
40 |     "\n",
41 |     "- https://www.kaggle.com/dhanyajothimani/basic-visualization-and-clustering-in-python\n",
42 |     "\n",
43 |     "- https://www.kaggle.com/fabiendaniel/customer-segmentation\n",
44 |     "\n",
45 |     "- http://blog.yhat.com/posts/customer-segmentation-using-python.html \n",
46 |     "\n",
47 |     "- https://www.datacamp.com/community/tutorials/introduction-customer-segmentation-python\n",
48 |     "\n",
49 |     "-https://inseaddataanalytics.github.io/INSEADAnalytics/CourseSessions/Sessions45/ClusterAnalysisReading.html\n",
50 |     "\n",
51 |     "- https://sajalsharma.com/portfolio/customer_segments\n",
52 |     "\n",
53 |     "- https://towardsdatascience.com/clustering-algorithms-for-customer-segmentation-af637c6830ac\n",
54 |     "\n",
55 |     "- https://towardsdatascience.com/find-your-best-customers-with-customer-segmentation-in-python-61d602f9eee6\n",
56 |     "\n",
57 |     "- https://www.ritchieng.com/machine-learning-project-customer-segments/\n",
58 |     "\n",
59 |     "- https://github.com/piyush2896/CustomerSegments/blob/master/customer_segments.ipynb\n",
60 |     "\n",
61 |     "- https://www.cyborgus.com/blog/16655/2017-04-03-top-mistakes-data-scientists-make\n",
62 |     "\n",
63 |     "- https://www.cyborgus.com/blog/16653/2017-03-19-what-makes-a-great-data-scientist\n",
64 |     "\n",
65 |     "- https://www.cyborgus.com/blog/16652/2017-03-13-think-like-data-scientist\n"
66 |    ]
67 |   }
68 |  ],
69 |  "metadata": {
70 |   "kernelspec": {
71 |    "display_name": "Python 3",
72 |    "language": "python",
73 |    "name": "python3"
74 |   },
75 |   "language_info": {
76 |    "codemirror_mode": {
77 |     "name": "ipython",
78 |     "version": 3
79 |    },
80 |    "file_extension": ".py",
81 |    "mimetype": "text/x-python",
82 |    "name": "python",
83 |    "nbconvert_exporter": "python",
84 |    "pygments_lexer": "ipython3",
85 |    "version": "3.7.0"
86 |   }
87 |  },
88 |  "nbformat": 4,
89 |  "nbformat_minor": 2
90 | }
91 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Customer Segmentation in Python
 2 | 
 3 | ## Course Description
 4 | 
 5 | The most successful companies today are the ones that know their customer so well that they can anticipate their needs. Data analysts play a key role in unlocking these in-depth insights, and segmenting the customers to better serve them. In this course, you will learn real-world techniques on customer segmentation and behavioral analytics, using a real dataset containing customer transactions from an online retailer. You will first identify which products are frequently bought together. Then, you will run cohort analysis to understand customer trends. On top of that, you will learn how to build easy to interpret customer segments. Finally, you will make your segments more powerful with k-means clustering, in just few lines of code! By the end of this course, you will be able to apply practical customer behavioral analytics and segmentation techniques.
 6 | 
 7 | ### Traducción:
 8 | 
 9 | Las empresas más exitosas de hoy son las que conocen tan bien a sus clientes que pueden anticipar sus necesidades. Los analistas de datos desempeñan un papel clave en el desbloqueo de estos conocimientos en profundidad y en la segmentación de los clientes para atenderlos mejor.
10 | En este curso, aprenderá técnicas del mundo real sobre segmentación de clientes y análisis de comportamiento, utilizando un conjunto de datos real que contiene transacciones de clientes de un minorista en línea. Primero identificarás qué productos se compran frecuentemente juntos. Luego, realizará un análisis de cohorte para comprender las tendencias de los clientes. Además de eso, aprenderá cómo crear segmentos de clientes fáciles de interpretar. Finalmente, hará que sus segmentos sean más potentes con el agrupamiento de k-means, ¡en solo unas pocas líneas de código! Al final de este curso, podrá aplicar técnicas prácticas de análisis de comportamiento y segmentación del cliente.
11 | 
12 | Link: [customer-segmentation-in-python](https://www.datacamp.com/courses/customer-segmentation-in-python)
13 | 
14 | ### Previous definition
15 | 
16 | In statistics, marketing and demography, a cohort is a group of subjects who share a defining characteristic (typically subjects who experienced a common event in a selected time period, such as birth or graduation). Cohort data can oftentimes be more advantageous to demographers than period data. Because cohort data is honed to a specific time period, it is usually more accurate. It is more accurate because it can be tuned to retrieve custom data for a specific study. 
17 | 
18 | Source: [wikiedia](https://en.wikipedia.org/wiki/Cohort_(statistics)) 
19 | 
20 | ## Dataset
21 | 
22 | ### Source:
23 | 
24 | Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.
25 | 
26 | ### Data Set Information:
27 | 
28 | [Online Retail Data Set](https://archive.ics.uci.edu/ml/datasets/Online%20Retail)
29 | 
30 | This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
31 | 
32 | ### Attribute Information:
33 | 
34 | - InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
35 | - StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
36 | - Description: Product (item) name. Nominal.
37 | - Quantity: The quantities of each product (item) per transaction. Numeric.
38 | - InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
39 | - UnitPrice: Unit price. Numeric, Product price per unit in sterling.
40 | - CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
41 | - Country: Country name. Nominal, the name of the country where each customer resides.
42 | 
43 | 
44 | ## Chapter 1: Cohort Analysis
45 | 
46 | Understand customers based on their unique behavioral attributes. Cohort analysis provides deeper insights than the so-called vanity metrics. It helps with understanding the high level trends better by providing insights on metrics across both the product and the customer lifecycle.
47 | 
48 | ## Chapter 2:  Recency, Frequency, Monetary Value analysis
49 | 
50 | Understand customers based on their unique behavioral attributes 
51 | 
52 | ## Chapter 3: Data pre-processing for clustering
53 | 
54 | Learn practical data preparation methods to ensure the k-means clustering algorithm uncovers well-separated segments 
55 | 
56 | ## Chapter 4: Customer Segmentation with K-means
57 | 
58 | Use data from previous chapter to build customer segments based on their recency, frequency, and monetary value 
59 | 
60 | ## Prerequisites
61 | 
62 | - pandas library
63 | - datetime objects
64 | - basic plotting with matplotlib or seaborn
65 | - basic knowledge of k-means clustering


--------------------------------------------------------------------------------
/RFM_values.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iris9112/Customer-Segmentation/a67d2e3da7234e3d27da62e5c29bc7ad20fff5c0/RFM_values.PNG


--------------------------------------------------------------------------------
/pdfs/chapter1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iris9112/Customer-Segmentation/a67d2e3da7234e3d27da62e5c29bc7ad20fff5c0/pdfs/chapter1.pdf


--------------------------------------------------------------------------------
/pdfs/chapter2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iris9112/Customer-Segmentation/a67d2e3da7234e3d27da62e5c29bc7ad20fff5c0/pdfs/chapter2.pdf


--------------------------------------------------------------------------------
/pdfs/chapter3.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iris9112/Customer-Segmentation/a67d2e3da7234e3d27da62e5c29bc7ad20fff5c0/pdfs/chapter3.pdf


--------------------------------------------------------------------------------
/pdfs/chapter4.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iris9112/Customer-Segmentation/a67d2e3da7234e3d27da62e5c29bc7ad20fff5c0/pdfs/chapter4.pdf


--------------------------------------------------------------------------------