├── .gitignore
├── an-embarrassment-of-pandas.pdf
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | README_PANDOC.md
2 | 


--------------------------------------------------------------------------------
/an-embarrassment-of-pandas.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Gaarv/an-embarrassment-of-pandas/master/an-embarrassment-of-pandas.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # An Embarrassment of Pandas
  2 | 
  3 | ![group-of-pandas](https://i.imgur.com/wPnWB5e.jpg)
  4 | 
  5 | Why an embarrassment? Because it's the name for a [group of pandas!](https://www.reference.com/pets-animals/group-pandas-called-71cd65ea758ca2e2)
  6 | 
  7 | * [DataFrames](#dataframes)
  8 | * [Series](#series)
  9 | * [Missing Values](#missing-values)
 10 | * [Method Chaining](#method-chaining)
 11 | * [Aggregation](#aggregation)
 12 | * [New Columns](#new-columns)
 13 | * [Feature Engineering](#feature-engineering)
 14 | * [Random](#random)
 15 | 
 16 | ## DataFrames
 17 | 
 18 | * Options - [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html)
 19 | ```python
 20 | # More columns
 21 | pd.set_option("display.max_columns", 500)
 22 | 
 23 | # More rows
 24 | pd.set_option("display.max_rows", 500)
 25 | 
 26 | # Floating point precision
 27 | pd.set_option("display.precision", 3)
 28 | 
 29 | # Increase column width
 30 | pd.set_option("max_colwidth", 50)
 31 | 
 32 | # Change default plotting backend - Pandas >= 0.25
 33 | # https://github.com/PatrikHlobil/Pandas-Bokeh
 34 | pd.set_option("plotting.backend", 'pandas_bokeh')
 35 | ```
 36 | 
 37 | * Useful `read_csv()` options - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
 38 | ```python
 39 | pd.read_csv(
 40 |     "data.csv.gz",
 41 |     delimiter = "^",
 42 |     # line numbers to skip (i.e. headers in an excel report)
 43 |     skiprows = 2,
 44 |     # used to denote the start and end of a quoted item
 45 |     quotechar = "|",
 46 |     # return a subset of columns
 47 |     usecols = ["return_date", "company", "sales"],
 48 |     # data type for data or columns
 49 |     dtype = { "sales": np.float64 },
 50 |     # additional strings to recognize as NA/NaN
 51 |     na_values = [".", "?"],
 52 |     # convert to datetime, instead of object
 53 |     parse_dates = ["return_date"],
 54 |     # for on-the-fly decompression of on-disk data
 55 |     # options - gzip, bz2, zip, xz
 56 |     compression = "gzip",
 57 |     # encoding to use for reading
 58 |     encoding = "latin1",
 59 |     # read in a subset of data
 60 |     nrows = 100
 61 | )
 62 | ```
 63 | 
 64 | * Read csv from URL or S3 - [s3fs](https://github.com/dask/s3fs/)
 65 | ```python
 66 | pd.read_csv("https://bit.ly/2KyxTFn")
 67 | 
 68 | # Requires s3fs library
 69 | pd.read_csv("s3://pandas-test/tips.csv")
 70 | ```
 71 | 
 72 | * Read an Excel file - [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#excel-files)
 73 | ```python
 74 | pd.read_excel("numbers.xlsx", sheet_name="Sheet1")
 75 | 
 76 | # Multiple sheets with varying parameters
 77 | with pd.ExcelFile("numbers.xlsx") as xlsx:
 78 |     df1 = pd.read_excel(xlsx, "Sheet1", na_values=["?"])
 79 |     df2 = pd.read_excel(xlsx, "Sheet2", na_values=[".", "Missing"])
 80 | ```
 81 | 
 82 | * Read multiple files at once - [glob](https://docs.python.org/3/library/glob.html)
 83 | ```python
 84 | import glob
 85 | 
 86 | # ignore_index = True to avoid duplicate index values
 87 | df = pd.concat([pd.read_csv(f) for f in glob.glob("*.csv")], ignore_index = True)
 88 | 
 89 | # More options
 90 | df = pd.concat([pd.read_csv(f, encoding = "latin1") for f in glob.glob("*.csv")])
 91 | ```
 92 | 
 93 | * Recursively grab all files in a directory
 94 | ```python
 95 | import os
 96 | import glob
 97 | 
 98 | files = [os.path.join(root, file)
 99 |         for root, dir, files in os.walk("./directory")
100 |         for file in glob.glob("*.csv")]
101 | ```
102 | 
103 | * Read in data from SQLite3
104 | ```python
105 | import sqlite3
106 | 
107 | conn = sqlite3.connect("flights.db")
108 | df = pd.read_sql_query("select * from airlines", conn)
109 | conn.close()
110 | ```
111 | 
112 | * Read in data from Postgres - [bigquery](https://googleapis.github.io/google-cloud-python/latest/bigquery/index.html), [snowflake](https://docs.snowflake.net/manuals/user-guide/sqlalchemy.html#snowflake-connector-for-python)
113 | ```python
114 | from sqlalchemy import create_engine
115 | 
116 | # Port 5439 for Redshift
117 | engine = create_engine("postgresql://user@localhost:5432/mydb")
118 | df = pd.read_sql_query("select * from airlines", engine)
119 | 
120 | # Get results in chunks
121 | for chunk in pd.read_sql_query("select * from airlines", engine, chunksize=5):
122 |     print(chunk)
123 | 
124 | # Writing back
125 | df.to_sql(
126 |     "table"
127 |     schema="schema"
128 |     # fail, replace or append
129 |     if_exists="append",
130 |     # write back in chunks
131 |     chunksize = 10000
132 | )
133 | ```
134 | 
135 | * Normalizing nested JSON - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html)
136 | ```python
137 | from pandas.io.json import json_normalize
138 | 
139 | json_normalize(data, "counties", ["state", "shortname", ["info", "governor"]])
140 | 
141 | # How deep to normalize - Pandas >= 0.25
142 | json_normalize(data, max_level=1)
143 | ```
144 | 
145 | * Column headers
146 | ```python
147 | # Lower all values
148 | df.columns = [x.lower() for x in df.columns]
149 | 
150 | # Strip out punctuation, replace spaces and lower
151 | df.columns = df.columns.str.replace("[^\w\s]", "").str.replace(" ", "_").str.lower()
152 | 
153 | # Condense multiindex columns
154 | df.columns = ["_".join(col).lower() for col in df.columns]
155 | 
156 | # Double transpose to remove bottom row for multiindex columns
157 | df.T.reset_index(1, drop=True).T
158 | ```
159 | 
160 | * Filtering DataFrame - using `pd.Series.isin()`
161 | ```python
162 | df[df["dimension"].isin(["A", "B", "C"])]
163 | 
164 | # not in
165 | df[~df["dimension"].isin(["A", "B", "C"])]
166 | ```
167 | 
168 | * Filtering DataFrame - using `pd.Series.str.contains()`
169 | ```python
170 | df[df["dimension"].str.contains("word")]
171 | 
172 | # not in
173 | df[~df["dimension"].str.contains("word")]
174 | ```
175 | 
176 | * Filtering DataFrame & more - using `df.query()` - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html)
177 | ```python
178 | df.query("salary > 100000")
179 | 
180 | df.query("name == 'john'")
181 | 
182 | df.query("name == 'john' | name == 'jack'")
183 | 
184 | df.query("name == 'john' and salary > 100000")
185 | 
186 | df.query("name.str.contains('a')")
187 | 
188 | # Grab top 1% of earners
189 | df.query("salary > salary.quantile(.99)")
190 | 
191 | # Make more than the mean
192 | df.query("salary > salary.mean()")
193 | 
194 | # Subset by top 3 most frequent products purchased
195 | df.query("item in item.value_counts().nlargest(3).index")
196 | 
197 | # Query for null values
198 | df.query("column.isnull()")
199 | 
200 | # Query for non-nulls
201 | df.query("column.notnull()")
202 | 
203 | # @ - allows you to refer to variables in the environment
204 | names = ["john", "fred", "jack"]
205 | df.query("name in @names")
206 | 
207 | # Reference columns with spaces using backticks - Pandas >= 0.25
208 | df.query("`Total Salary` > 100000")
209 | ```
210 | 
211 | * Joining - [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)
212 | ```python
213 | # Inner join
214 | pd.merge(df1, df2, on = "key")
215 | 
216 | # Left join on different key names
217 | pd.merge(df1, df2, right_on = ["right_key"], left_on = ["left_key"], how = "left")
218 | ```
219 | 
220 | * Anti-Join
221 | ```python
222 | def anti_join(x, y, on):
223 |     """Return rows in x which are not present in y"""
224 |     ans = pd.merge(left=x, right=y, how='left', indicator=True, on=on)
225 |     ans = ans.loc[ans._merge == 'left_only', :].drop(columns='_merge')
226 |     return ans
227 | ```
228 | 
229 | * Select columns based on data type
230 | ```python
231 | df.select_dtypes(include = "number")
232 | df.select_dtypes(exclude = "object")
233 | ```
234 | 
235 | * Apply function to multiple columns of the same data type
236 | ```python
237 | # Specify columns, so DataFrame isn't overwritten
238 | df[["first_name", "last_name", "email"]] = df.select_dtypes(
239 |     include = "object").apply(lambda x: x.str.lower()
240 | )
241 | ```
242 | 
243 | * Reverse column order
244 | ```python
245 | df.loc[:, ::-1]
246 | ```
247 | 
248 | * Correlation matrix
249 | ```python
250 | df.corr()
251 | 
252 | # With another DataFrame
253 | df.corrwith(df_2)
254 | ```
255 | 
256 | * Descriptive statistics
257 | ```python
258 | df.describe(include=[np.number]).T
259 | 
260 | dims = df.describe(include=[pd.Categorical]).T
261 | 
262 | # Add percent frequency for top dimension
263 | dims["frequency"] = dims["freq"].div(dims["count"])
264 | ```
265 | 
266 | * Styling numeric columns - [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html)
267 | ```python
268 | styling_options = {"sales": "${0:,.0f}", "percent_of_sales": "{:.2%f}"}
269 | 
270 | df.style.format(styling_options)
271 | ```
272 | 
273 | * Add highlighting for max and min values
274 | ```python
275 | df.style.highlight_max(color = "lightgreen").highlight_min(color = "red")
276 | ```
277 | 
278 | * Conditional formatting for one column
279 | ```python
280 | df.style.background(subset = ["measure"], cmap = "viridis")
281 | ```
282 | 
283 | ## Series
284 | 
285 | * Value counts as percentages
286 | ```python
287 | # See NaNs as well
288 | df["meaure"].value_counts(normalize = True, dropna = False)
289 | ```
290 | 
291 | * Replacing errant characters
292 | ```python
293 | df["sales"].str.replace("$", "")
294 | ```
295 | 
296 | * Replacing false conditions - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.where.html)
297 | ```python
298 | df["steps_walked"].where(df["steps_walked"] > 0, 0)
299 | ```
300 | 
301 | ## Missing Values
302 | 
303 | * Percent nulls by column
304 | ```python
305 | (df.isnull().sum() / df.isnull().count()).sort_values(ascending=False)
306 | ```
307 | 
308 | * Dropping columns - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)
309 | ```python
310 | df.drop(["column_a", "column_b"], axis = 1)
311 | ```
312 | 
313 | * Dropping duplicate rows - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html#pandas.DataFrame.drop_duplicates)
314 | ```python
315 | df.drop_duplicates(subset=["order_date", "product"], keep="first")
316 | ```
317 | 
318 | * Dropping columns based on NaN threshold - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)
319 | ```python
320 | # Any column with 90% missing values will be dropped
321 | df.dropna(thresh = len(df) * .9, axis = 1)
322 | ```
323 | 
324 | * Replacing using `fillna()` - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)
325 | ```python
326 | # Impute DataFrame with all zeroes
327 | df.fillna(0)
328 | 
329 | # Impute column with all zeroes
330 | df["measure"].fillna(0)
331 | 
332 | # Impute measure with mean of column
333 | df["measure"].fillna(df["measure"].mean())
334 | 
335 | # Impute dimension with mode of column
336 | df["dimension"].fillna(df["dimension"].mode())
337 | 
338 | # Impute by another dimension's mean
339 | df["age"].fillna(df.groupby("sex")["age"].transform("mean"))
340 | ```
341 | 
342 | * Replace values across entire DataFrame
343 | ```python
344 | df.replace(".", np.nan)
345 | 
346 | df.replace(0, np.nan)
347 | ```
348 | 
349 | * Replace numeric values containing a letter with NaN
350 | ```python
351 | df["zipcode"].replace(".*[a-zA-Z].*", np.nan, regex=True)
352 | ```
353 | 
354 | * Drop rows where **any** value is 0
355 | ```python
356 | df[(df != 0).all(1)]
357 | ```
358 | 
359 | * Drop rows where **all** values are 0
360 | ```python
361 | df = df[(df.T != 0).any()]
362 | ```
363 | 
364 | ## Method Chaining
365 | 
366 | * Chaining multiple operations
367 | ```python
368 | (pd.read_csv("employee_salaries.csv")
369 |     .query("salary > 0")
370 |     .assign(sex = lambda df: df["sex"].replace({"female": 1, "male: 0}),
371 |             age = lambda df: pd.cut(df["age"].fillna(df["age"].median()),
372 |                                     bins = [df["age"].min(), 18, 40, df["age"].max()],
373 |                                     labels = ["underage", "young", "experienced"]))
374 |     .rename({"name_1": "first_name", "name_2": "last_name"})
375 | )
376 | ```
377 | 
378 | * Pipelines for data processing
379 | ```python
380 | def fix_headers(df):
381 |     df.columns = df.columns.str.replace("[^\w\s]", "").str.replace(" ", "_").str.lower()
382 |     return df
383 |     
384 | def drop_columns_missing(df, percent):
385 |     df = df.dropna(thresh = len(df) * percent, axis = 1)
386 |     return df
387 | 
388 | def fill_missing(df, value):
389 |     df = df.fillna(value)
390 |     return df
391 | 
392 | def replace_and_convert(df, col, orig, new, dtype):
393 |     df[col] = df[col].str.replace(orig, new).astype(dtype)
394 |     return df
395 | 
396 | 
397 | (df.pipe(fix_headers)
398 |     .pipe(drop_columns_missing, percent=0.3)
399 |     .pipe(fill_missing, value=0)
400 |     .pipe(replace_and_convert, col="sales", orig="$", new="", dtype=float)
401 | )
402 | ```
403 | 
404 | [Recommended Read - Effective Pandas](https://leanpub.com/effective-pandas)
405 | 
406 | ## Aggregation
407 | 
408 | * Use `as_index = False` to avoid setting index
409 | ```python
410 | # this
411 | df.groupby("dimension", as_index = False)["measure"].sum()
412 | 
413 | # versus this
414 | df.groupby("dimension")["measure"].sum().reset_index()
415 | ```
416 | 
417 | * By date offset - [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects)
418 | ```python
419 | # H for hours
420 | # D for days
421 | # W for weeks
422 | # WOM for week of month
423 | # Q for quarter end
424 | # A for year end
425 | df.groupby(pd.Grouper(key = "date", freq = "M"))["measure"].agg(["sum", "mean"])
426 | ```
427 | 
428 | * Measure by dimension - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html)
429 | ```python
430 | # count - number of non-null observations
431 | # sum - sum of values
432 | # mean - mean of values
433 | # mad - mean absolute deviation
434 | # median - arithmetic median of values
435 | # min - minimum
436 | # max - maxmimum
437 | # mode - mode
438 | # std - unbiased standard deviation
439 | # first - first value
440 | # last - last value
441 | # nunique - unique values
442 | df.groupby("dimension")["measure"].sum()
443 | 
444 | # Specific aggregations for columns
445 | df.groupby("dimension").agg({"sales": ["mean", "sum"], "sale_date": "first", "customer": "nunique"})
446 | ```
447 | 
448 | * Pivot table - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html#pandas.pivot_table)
449 | ```python
450 | pd.pivot_table(
451 |     df,
452 |     values=["sales", "orders"],
453 |     index=["customer_id"],
454 |     aggfunc={
455 |         "sales": ["sum", "mean"],
456 |         "orders": "nunique"
457 |     }
458 | )
459 | ```
460 | 
461 | * Named aggregations - `Pandas >= 0.25` - [documentation](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#groupby-aggregation-with-relabeling)
462 | ```python
463 | # DataFrame - Version 1
464 | df.groupby("country").agg(
465 |     min_height = pd.NamedAgg(column = "height", aggfunc = "min"),
466 |     max_height = pd.NamedAgg(column = "height", aggfunc = "max"),
467 |     average_weight = pd.NamedAgg(column = "weight", aggfunc = np.mean)
468 | )
469 | 
470 | # DataFrame - Version 2
471 | df.groupby("country").agg(
472 |     min_height=("height", "min"),
473 |     max_heights=("height", "max"),
474 |     average_weight=("weight", np.mean)
475 | )
476 | 
477 | # Series
478 | df.groupby("gender").height.agg(
479 |     min_height="min",
480 |     max_height="max"
481 | )
482 | ```
483 | 
484 | ## New Columns
485 | 
486 | * Using `df.eval()`
487 | ```python
488 | df["sales"] = df.eval("price * quantity")
489 | 
490 | # Assign to different DataFrame
491 | pd.eval("sales = df.price * df.quantity", target=df_2)
492 | 
493 | # Multiline assignment
494 | df.eval("""
495 | aov = price / quantity
496 | aov_gt_50 = (price / quantity) > 50
497 | top_3_customers = customer_id in customer_id.value_counts().nlargest(3).index
498 | bottom_3_customers = customer_id in customer_id.value_counts().nsmallest(3).index
499 | """)
500 | ```
501 | 
502 | * Based on one condition - using `np.where()`
503 | ```python
504 | np.where(df["gender"] == "Male", 1, 0)
505 | ```
506 | 
507 | * Based on multiple conditions - using `np.where()`
508 | ```python
509 | np.where(df["measure"] < 5, "Low", np.where(df["measure"] < 10, "Medium", "High"))
510 | ```
511 | 
512 | * Based on multiple conditions - using `np.select()`
513 | ```python
514 | conditions = [
515 |     df["country"].str.contains("spain"),
516 |     df["country"].str.contains("italy"),
517 |     df["country"].str.contains("chile"),
518 |     df["country"].str.contains("brazil")
519 | ]
520 | 
521 | choices = ["europe", "europe", "south america", "south america"]
522 | 
523 | data["continent"] = np.select(conditions, choices, default = "other")
524 | ```
525 | 
526 | * Based on manual mapping - using `pd.Series.map()`
527 | ```python
528 | values = {"Low": 1, "Medium": 2, "High": 3}
529 | 
530 | df["dimension"].map(values)
531 | ```
532 | 
533 | * Automatically generate mappings from dimension
534 | ```python
535 | dimension_mappings = {v: k for k, v in enumerate(df["dimension"].unique())}
536 | 
537 | df["dimension"].map(dimension_mappings)
538 | ```
539 | 
540 | * Splitting a string column
541 | ```python
542 | df["email"].str.split("@", expand = True)[0]
543 | ```
544 | 
545 | * Using list comprehensions
546 | ```python
547 | df["domain"] = [x.split("@")[1] for x in df["email"]]
548 | ```
549 | 
550 | * Using regular expressions
551 | ```python
552 | import re
553 | 
554 | pattern = "([A-Z0-9._%+-]+)@([A-Z0-9.-]+)"
555 | 
556 | # Inserting colum headers, applied after extract
557 | pattern = "(?P<email>[A-Z0-9._%+-]+)@(?P<domain>[A-Z0-9.-]+)"
558 | 
559 | # Generates two columns
560 | email_components = df["email"].str.extract(pattern, flags=re.IGNORECASE)
561 | ```
562 | 
563 | * Widening a column - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html)
564 | ```python
565 | df.pivot(index = "date", columns = "companies", values = "sales")
566 | ```
567 | 
568 | ## Feature Engineering
569 | 
570 | * Instead of split-apply-combine, `transform()`
571 | ```python
572 | # this
573 | df["mean_company_salary"] = df.groupby("company")["salary"].transform("mean")
574 | 
575 | # versus this
576 | mean_salary = df.groupby("company")["salary"].agg("mean").rename("mean_salary").reset_index()
577 | df_new = df.merge(mean_salary)
578 | ```
579 | 
580 | * Extracting various date components - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#datetime-properties)
581 | ```python
582 | df["date"].dt.year
583 | df["date"].dt.quarter
584 | df["date"].dt.month
585 | df["date"].dt.week
586 | df["date"].dt.day
587 | df["date"].dt.weekday
588 | df["date"].dt.weekday_name
589 | df["date"].dt.hour
590 | ```
591 | 
592 | * Time between two dates
593 | ```python
594 | # Days between
595 | df["first_date"].sub(df["second_date"]).div(np.timedelta64(1, "D"))
596 | 
597 | # Months between
598 | df["first_date"].sub(df["second_date"]).div(np.timedelta64(1, "M"))
599 | 
600 | # Equivalent to above
601 | (df["first_date] - df["second_date"]) / np.timedelta64(1, "M")
602 | ```
603 | 
604 | * Weekend column
605 | ```python
606 | df["is_weekend"] = np.where(df["date"].dt.dayofweek.isin([5, 6]), 1, 0)
607 | ```
608 | 
609 | * Get prior date
610 | ```python
611 | df.sort_values(by=["customer_id, "order_date"])\
612 |     .groupby("customer_id")["order_date"].shift(periods=1)
613 | ```
614 | 
615 | * Days since prior date
616 | ```python
617 | df.sort_values(by = ["customer_id", "order_date"])\
618 |     .groupby("customer_id")["order_date"]\
619 |     .diff()\
620 |     .div(np.timedelta64(1, "D"))
621 | ```
622 | 
623 | * Percent change since prior date
624 | ```python
625 | df.sort_values(by = ["customer_id", "order_date"])\
626 |     .groupby("customer_id")["order_date"]\
627 |     .pct_change()
628 | ```
629 | 
630 | * Percentile rank for measure
631 | ```python
632 | df["salary"].rank(pct=True)
633 | ```
634 | 
635 | * Occurrences of word in row
636 | ```python
637 | import re
638 | 
639 | df["review"].str.count("great", flags=re.IGNORECASE)
640 | ```
641 | 
642 | * Distinct list aggregation
643 | ```python
644 | df["unique_products"] = df.groupby("customer_id").agg({"products": "unique"})
645 | 
646 | # Transform each element -> row - Pandas >= 0.25
647 | df["unique_products"].explode()
648 | ```
649 | 
650 | * User-item matrix
651 | ```python
652 | df.groupby("customer_id")["products"].value_counts().unstack().fillna(0)
653 | ```
654 | 
655 | * Binning
656 | ```python
657 | pd.qcut(data["measure"], q = 4, labels = False)
658 | 
659 | # Numeric
660 | pd.cut(df["measure"], bins = 4, labels = False)
661 | 
662 | # Dimension
663 | pd.cut(df["age"], bins = [0, 18, 25, 99], labels = ["child", "young adult", "adult"])
664 | ```
665 | 
666 | * Dummy variables
667 | ```python
668 | # Use drop_first = True to avoid collinearity
669 | pd.get_dummies(df, drop_first = True)
670 | ```
671 | 
672 | * Sort and take first value by dimension
673 | ```python
674 | df.sort_values(by = "variable").groupby("dimension").first()
675 | ```
676 | 
677 | * MinMax normalization
678 | ```python
679 | df["salary_minmax"] = (
680 |     df["salary"] - df["salary"].min()) / (df["salary"].max() - df["salary"].min()
681 | )
682 | ```
683 | 
684 | * Z-score normalization
685 | ```python
686 | df["salary_zscore"] = (df["salary"] - df["salary"].mean()) / df["salary"].std()
687 | ```
688 | 
689 | * Log transformation
690 | ```python
691 | # For positive data with no zeroes
692 | np.log(df["sales"])
693 | 
694 | # For positive data with zeroes
695 | np.log1p(df["sales"])
696 | 
697 | # Convert back - get predictions if target is log transformed
698 | np.expm1(df["sales"])
699 | ```
700 | 
701 | * Boxcox transformation
702 | ```python
703 | from scipy import stats
704 | 
705 | # Must be positive
706 | stats.boxcox(df["sales"])[0]
707 | ```
708 | 
709 | * Reciprocal transformation
710 | ```python
711 | df["age_reciprocal"] = 1.0 / df["age"]
712 | ```
713 | 
714 | * Square root transformation
715 | ```python
716 | df["age_sqrt"] = np.sqrt(df["age"])
717 | ```
718 | 
719 | * Winsorization
720 | ```python
721 | upper_limit = np.percentile(df["salary"].values, 99)
722 | lower_limit = np.percentile(df["salary"].values, 1)
723 | 
724 | df["salary"].clip(lower = lower_limit, upper = upper_limit)
725 | ```
726 | 
727 | * Mean encoding
728 | ```python
729 | df.groupby("dimension")["target"].transform("mean")
730 | ```
731 | 
732 | * Z-scores for outliers
733 | ```python
734 | from scipy import stats
735 | import numpy as np
736 | 
737 | z = np.abs(stats.zscores(df))
738 | df = df[(z < 3).all(axis = 1)]
739 | ```
740 | 
741 | * Interquartile range (IQR)
742 | ```python
743 | q1 = df["salary"].quantile(0.25)
744 | q3 = df["salary"].quantile(0.75)
745 | iqr = q3 - q1
746 | 
747 | df.query("(@q1 - 1.5 * @iqr) <= salary <= (@q3 + 1.5 * @iqr)")
748 | ```
749 | 
750 | * Geocoder - [github](https://github.com/DenisCarriere/geocoder)
751 | * Geopy - [github](https://github.com/geopy/geopy)
752 | ```python
753 | import geocoder
754 | 
755 | df["lat_long"] = df["ip"].apply(lambda x: geocoder.ip(x).latlng)
756 | ```
757 | 
758 | * RFM - Recency, Frequency and Monetary
759 | ```python
760 | rfm = (
761 |     df.groupby("customer_id")
762 |     .agg(
763 |         {
764 |             "order_date": lambda x: (x.max() - x.min()).days,
765 |             "order_id": "nunique",
766 |             "price": "mean",
767 |         }
768 |     )
769 |     .rename(
770 |         columns={"order_date": "recency", "order_id": "frequency", "price": "monetary"}
771 |     )
772 | )
773 | 
774 | rfm_quantiles = rfm.quantile(q=[0.2, 0.4, 0.6, 0.8])
775 | 
776 | recency_conditions = [
777 |     rfm.recency >= rfm_quantiles.recency.iloc[3],
778 |     rfm.recency >= rfm_quantiles.recency.iloc[2],
779 |     rfm.recency >= rfm_quantiles.recency.iloc[1],
780 |     rfm.recency >= rfm_quantiles.recency.iloc[0],
781 |     rfm.recency <= rfm_quantiles.recency.iloc[0],
782 | ]
783 | 
784 | frequency_conditions = [
785 |     rfm.frequency <= rfm_quantiles.frequency.iloc[0],
786 |     rfm.frequency <= rfm_quantiles.frequency.iloc[1],
787 |     rfm.frequency <= rfm_quantiles.frequency.iloc[2],
788 |     rfm.frequency <= rfm_quantiles.frequency.iloc[3],
789 |     rfm.frequency >= rfm_quantiles.frequency.iloc[3],
790 | ]
791 | 
792 | monetary_conditions = [
793 |     rfm.monetary <= rfm_quantiles.monetary.iloc[0],
794 |     rfm.monetary <= rfm_quantiles.monetary.iloc[1],
795 |     rfm.monetary <= rfm_quantiles.monetary.iloc[2],
796 |     rfm.monetary <= rfm_quantiles.monetary.iloc[3],
797 |     rfm.monetary >= rfm_quantiles.monetary.iloc[3],
798 | ]
799 | 
800 | ranks = [1, 2, 3, 4, 5]
801 | 
802 | rfm["r"] = np.select(recency_conditions, ranks, "other")
803 | rfm["f"] = np.select(frequency_conditions, ranks, "other")
804 | rfm["m"] = np.select(monetary_conditions, ranks, "other")
805 | 
806 | rfm["segment"] = rfm["r"].astype(str).add(rfm["f"].astype(str))
807 | 
808 | segment_map = {
809 |     r"[1-2][1-2]": "hibernating",
810 |     r"[1-2][3-4]": "at risk",
811 |     r"[1-2]5": "cannot lose",
812 |     r"3[1-2]": "about to sleep",
813 |     r"33": "need attention",
814 |     r"[3-4][4-5]": "loyal customers",
815 |     r"41": "promising",
816 |     r"51": "new customers",
817 |     r"[4-5][2-3]": "potential loyalists",
818 |     r"5[4-5]": "champions",
819 | }
820 | 
821 | rfm["segment"] = rfm.segment.replace(segment_map, regex=True)
822 | ```
823 | 
824 | * Haversine
825 | ```python
826 | import numpy as np
827 | from numpy import pi, deg2rad, cos, sin, arcsin, sqrt
828 | 
829 | def haversine(s_lat, s_lng, e_lat, e_lng):
830 |     """
831 |     determines the great-circle distance between two point
832 |     on a sphere given their longitudes and latitudes
833 |     """
834 | 
835 |     # approximate radius of earth in miles
836 |     R = 3959.87433
837 | 
838 |     s_lat = s_lat * np.pi / 180.0
839 |     s_lng = np.deg2rad(s_lng)
840 |     e_lat = np.deg2rad(e_lat)
841 |     e_lng = np.deg2rad(e_lng)
842 | 
843 |     d = (
844 |         np.sin((e_lat - s_lat) / 2) ** 2
845 |         + np.cos(s_lat) * np.cos(e_lat) * np.sin((e_lng - s_lng) / 2) ** 2
846 |     )
847 | 
848 |     return 2 * R * np.arcsin(np.sqrt(d))
849 | 
850 | 
851 | df['distance'] = haversine(
852 |     df["start_lat"].values,
853 |     df["start_long"].values,
854 |     df["end_lat"].values,
855 |     df["end_long"].values
856 | )
857 | ```
858 | 
859 | * Manhattan
860 | ```python
861 | def manhattan(s_lat, s_lng, e_lat, e_lng):
862 |     """
863 |     sum of horizontal and vertical distance between
864 |     two points
865 |     """
866 |     a = haversine(s_lat, s_lng, s_lat, e_lng)
867 |     b = haversine(s_lat, s_lng, e_lat, s_lng)
868 |     return a + b
869 | ```
870 | 
871 | ## Random
872 | 
873 | * Union two categorical columns - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.types.union_categoricals.html#pandas.api.types.union_categoricals)
874 | ```python
875 | from pandas.api.types import union_categoricals
876 | 
877 | food = pd.Categorical(["burger king", "wendys"])
878 | food_2 = pd.Categorical(["burger king", "chipotle"])
879 | 
880 | union_categoricals([food, food_2])
881 | ```
882 | 
883 | * Testing - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/general_utility_functions.html#testing-functions)
884 | ```python
885 | from pandas.util.testing import assert_frame_equal
886 | 
887 | # Methods for Series and Index as well
888 | assert_frame_equal(df_1, df_2)
889 | ```
890 | 
891 | * Dtype checking - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/general_utility_functions.html#dtype-introspection)
892 | ```python
893 | from pandas.api.types import is_numeric_dtype
894 | 
895 | is_numeric_dtype("hello world")
896 | # False
897 | ```
898 | 
899 | * Infer column dtype, useful to remap column dtypes [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.types.infer_dtype.html#pandas.api.types.infer_dtype)
900 | ```python
901 | from pandas.api.types import infer_dtype
902 | 
903 | infer_dtype(["john", np.nan, "jack"], skipna=True)
904 | # string
905 | 
906 | infer_dtype(["john", np.nan, "jack"], skipna=False)
907 | # mixed
908 | ```
909 | 


--------------------------------------------------------------------------------