├── .gitignore ├── an-embarrassment-of-pandas.pdf └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | README_PANDOC.md 2 | -------------------------------------------------------------------------------- /an-embarrassment-of-pandas.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Gaarv/an-embarrassment-of-pandas/master/an-embarrassment-of-pandas.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # An Embarrassment of Pandas 2 | 3 | ![group-of-pandas](https://i.imgur.com/wPnWB5e.jpg) 4 | 5 | Why an embarrassment? Because it's the name for a [group of pandas!](https://www.reference.com/pets-animals/group-pandas-called-71cd65ea758ca2e2) 6 | 7 | * [DataFrames](#dataframes) 8 | * [Series](#series) 9 | * [Missing Values](#missing-values) 10 | * [Method Chaining](#method-chaining) 11 | * [Aggregation](#aggregation) 12 | * [New Columns](#new-columns) 13 | * [Feature Engineering](#feature-engineering) 14 | * [Random](#random) 15 | 16 | ## DataFrames 17 | 18 | * Options - [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html) 19 | ```python 20 | # More columns 21 | pd.set_option("display.max_columns", 500) 22 | 23 | # More rows 24 | pd.set_option("display.max_rows", 500) 25 | 26 | # Floating point precision 27 | pd.set_option("display.precision", 3) 28 | 29 | # Increase column width 30 | pd.set_option("max_colwidth", 50) 31 | 32 | # Change default plotting backend - Pandas >= 0.25 33 | # https://github.com/PatrikHlobil/Pandas-Bokeh 34 | pd.set_option("plotting.backend", 'pandas_bokeh') 35 | ``` 36 | 37 | * Useful `read_csv()` options - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) 38 | ```python 39 | pd.read_csv( 40 | "data.csv.gz", 41 | delimiter = "^", 42 | # line numbers to skip (i.e. headers in an excel report) 43 | skiprows = 2, 44 | # used to denote the start and end of a quoted item 45 | quotechar = "|", 46 | # return a subset of columns 47 | usecols = ["return_date", "company", "sales"], 48 | # data type for data or columns 49 | dtype = { "sales": np.float64 }, 50 | # additional strings to recognize as NA/NaN 51 | na_values = [".", "?"], 52 | # convert to datetime, instead of object 53 | parse_dates = ["return_date"], 54 | # for on-the-fly decompression of on-disk data 55 | # options - gzip, bz2, zip, xz 56 | compression = "gzip", 57 | # encoding to use for reading 58 | encoding = "latin1", 59 | # read in a subset of data 60 | nrows = 100 61 | ) 62 | ``` 63 | 64 | * Read csv from URL or S3 - [s3fs](https://github.com/dask/s3fs/) 65 | ```python 66 | pd.read_csv("https://bit.ly/2KyxTFn") 67 | 68 | # Requires s3fs library 69 | pd.read_csv("s3://pandas-test/tips.csv") 70 | ``` 71 | 72 | * Read an Excel file - [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#excel-files) 73 | ```python 74 | pd.read_excel("numbers.xlsx", sheet_name="Sheet1") 75 | 76 | # Multiple sheets with varying parameters 77 | with pd.ExcelFile("numbers.xlsx") as xlsx: 78 | df1 = pd.read_excel(xlsx, "Sheet1", na_values=["?"]) 79 | df2 = pd.read_excel(xlsx, "Sheet2", na_values=[".", "Missing"]) 80 | ``` 81 | 82 | * Read multiple files at once - [glob](https://docs.python.org/3/library/glob.html) 83 | ```python 84 | import glob 85 | 86 | # ignore_index = True to avoid duplicate index values 87 | df = pd.concat([pd.read_csv(f) for f in glob.glob("*.csv")], ignore_index = True) 88 | 89 | # More options 90 | df = pd.concat([pd.read_csv(f, encoding = "latin1") for f in glob.glob("*.csv")]) 91 | ``` 92 | 93 | * Recursively grab all files in a directory 94 | ```python 95 | import os 96 | import glob 97 | 98 | files = [os.path.join(root, file) 99 | for root, dir, files in os.walk("./directory") 100 | for file in glob.glob("*.csv")] 101 | ``` 102 | 103 | * Read in data from SQLite3 104 | ```python 105 | import sqlite3 106 | 107 | conn = sqlite3.connect("flights.db") 108 | df = pd.read_sql_query("select * from airlines", conn) 109 | conn.close() 110 | ``` 111 | 112 | * Read in data from Postgres - [bigquery](https://googleapis.github.io/google-cloud-python/latest/bigquery/index.html), [snowflake](https://docs.snowflake.net/manuals/user-guide/sqlalchemy.html#snowflake-connector-for-python) 113 | ```python 114 | from sqlalchemy import create_engine 115 | 116 | # Port 5439 for Redshift 117 | engine = create_engine("postgresql://user@localhost:5432/mydb") 118 | df = pd.read_sql_query("select * from airlines", engine) 119 | 120 | # Get results in chunks 121 | for chunk in pd.read_sql_query("select * from airlines", engine, chunksize=5): 122 | print(chunk) 123 | 124 | # Writing back 125 | df.to_sql( 126 | "table" 127 | schema="schema" 128 | # fail, replace or append 129 | if_exists="append", 130 | # write back in chunks 131 | chunksize = 10000 132 | ) 133 | ``` 134 | 135 | * Normalizing nested JSON - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html) 136 | ```python 137 | from pandas.io.json import json_normalize 138 | 139 | json_normalize(data, "counties", ["state", "shortname", ["info", "governor"]]) 140 | 141 | # How deep to normalize - Pandas >= 0.25 142 | json_normalize(data, max_level=1) 143 | ``` 144 | 145 | * Column headers 146 | ```python 147 | # Lower all values 148 | df.columns = [x.lower() for x in df.columns] 149 | 150 | # Strip out punctuation, replace spaces and lower 151 | df.columns = df.columns.str.replace("[^\w\s]", "").str.replace(" ", "_").str.lower() 152 | 153 | # Condense multiindex columns 154 | df.columns = ["_".join(col).lower() for col in df.columns] 155 | 156 | # Double transpose to remove bottom row for multiindex columns 157 | df.T.reset_index(1, drop=True).T 158 | ``` 159 | 160 | * Filtering DataFrame - using `pd.Series.isin()` 161 | ```python 162 | df[df["dimension"].isin(["A", "B", "C"])] 163 | 164 | # not in 165 | df[~df["dimension"].isin(["A", "B", "C"])] 166 | ``` 167 | 168 | * Filtering DataFrame - using `pd.Series.str.contains()` 169 | ```python 170 | df[df["dimension"].str.contains("word")] 171 | 172 | # not in 173 | df[~df["dimension"].str.contains("word")] 174 | ``` 175 | 176 | * Filtering DataFrame & more - using `df.query()` - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) 177 | ```python 178 | df.query("salary > 100000") 179 | 180 | df.query("name == 'john'") 181 | 182 | df.query("name == 'john' | name == 'jack'") 183 | 184 | df.query("name == 'john' and salary > 100000") 185 | 186 | df.query("name.str.contains('a')") 187 | 188 | # Grab top 1% of earners 189 | df.query("salary > salary.quantile(.99)") 190 | 191 | # Make more than the mean 192 | df.query("salary > salary.mean()") 193 | 194 | # Subset by top 3 most frequent products purchased 195 | df.query("item in item.value_counts().nlargest(3).index") 196 | 197 | # Query for null values 198 | df.query("column.isnull()") 199 | 200 | # Query for non-nulls 201 | df.query("column.notnull()") 202 | 203 | # @ - allows you to refer to variables in the environment 204 | names = ["john", "fred", "jack"] 205 | df.query("name in @names") 206 | 207 | # Reference columns with spaces using backticks - Pandas >= 0.25 208 | df.query("`Total Salary` > 100000") 209 | ``` 210 | 211 | * Joining - [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) 212 | ```python 213 | # Inner join 214 | pd.merge(df1, df2, on = "key") 215 | 216 | # Left join on different key names 217 | pd.merge(df1, df2, right_on = ["right_key"], left_on = ["left_key"], how = "left") 218 | ``` 219 | 220 | * Anti-Join 221 | ```python 222 | def anti_join(x, y, on): 223 | """Return rows in x which are not present in y""" 224 | ans = pd.merge(left=x, right=y, how='left', indicator=True, on=on) 225 | ans = ans.loc[ans._merge == 'left_only', :].drop(columns='_merge') 226 | return ans 227 | ``` 228 | 229 | * Select columns based on data type 230 | ```python 231 | df.select_dtypes(include = "number") 232 | df.select_dtypes(exclude = "object") 233 | ``` 234 | 235 | * Apply function to multiple columns of the same data type 236 | ```python 237 | # Specify columns, so DataFrame isn't overwritten 238 | df[["first_name", "last_name", "email"]] = df.select_dtypes( 239 | include = "object").apply(lambda x: x.str.lower() 240 | ) 241 | ``` 242 | 243 | * Reverse column order 244 | ```python 245 | df.loc[:, ::-1] 246 | ``` 247 | 248 | * Correlation matrix 249 | ```python 250 | df.corr() 251 | 252 | # With another DataFrame 253 | df.corrwith(df_2) 254 | ``` 255 | 256 | * Descriptive statistics 257 | ```python 258 | df.describe(include=[np.number]).T 259 | 260 | dims = df.describe(include=[pd.Categorical]).T 261 | 262 | # Add percent frequency for top dimension 263 | dims["frequency"] = dims["freq"].div(dims["count"]) 264 | ``` 265 | 266 | * Styling numeric columns - [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html) 267 | ```python 268 | styling_options = {"sales": "${0:,.0f}", "percent_of_sales": "{:.2%f}"} 269 | 270 | df.style.format(styling_options) 271 | ``` 272 | 273 | * Add highlighting for max and min values 274 | ```python 275 | df.style.highlight_max(color = "lightgreen").highlight_min(color = "red") 276 | ``` 277 | 278 | * Conditional formatting for one column 279 | ```python 280 | df.style.background(subset = ["measure"], cmap = "viridis") 281 | ``` 282 | 283 | ## Series 284 | 285 | * Value counts as percentages 286 | ```python 287 | # See NaNs as well 288 | df["meaure"].value_counts(normalize = True, dropna = False) 289 | ``` 290 | 291 | * Replacing errant characters 292 | ```python 293 | df["sales"].str.replace("$", "") 294 | ``` 295 | 296 | * Replacing false conditions - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.where.html) 297 | ```python 298 | df["steps_walked"].where(df["steps_walked"] > 0, 0) 299 | ``` 300 | 301 | ## Missing Values 302 | 303 | * Percent nulls by column 304 | ```python 305 | (df.isnull().sum() / df.isnull().count()).sort_values(ascending=False) 306 | ``` 307 | 308 | * Dropping columns - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) 309 | ```python 310 | df.drop(["column_a", "column_b"], axis = 1) 311 | ``` 312 | 313 | * Dropping duplicate rows - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html#pandas.DataFrame.drop_duplicates) 314 | ```python 315 | df.drop_duplicates(subset=["order_date", "product"], keep="first") 316 | ``` 317 | 318 | * Dropping columns based on NaN threshold - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) 319 | ```python 320 | # Any column with 90% missing values will be dropped 321 | df.dropna(thresh = len(df) * .9, axis = 1) 322 | ``` 323 | 324 | * Replacing using `fillna()` - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) 325 | ```python 326 | # Impute DataFrame with all zeroes 327 | df.fillna(0) 328 | 329 | # Impute column with all zeroes 330 | df["measure"].fillna(0) 331 | 332 | # Impute measure with mean of column 333 | df["measure"].fillna(df["measure"].mean()) 334 | 335 | # Impute dimension with mode of column 336 | df["dimension"].fillna(df["dimension"].mode()) 337 | 338 | # Impute by another dimension's mean 339 | df["age"].fillna(df.groupby("sex")["age"].transform("mean")) 340 | ``` 341 | 342 | * Replace values across entire DataFrame 343 | ```python 344 | df.replace(".", np.nan) 345 | 346 | df.replace(0, np.nan) 347 | ``` 348 | 349 | * Replace numeric values containing a letter with NaN 350 | ```python 351 | df["zipcode"].replace(".*[a-zA-Z].*", np.nan, regex=True) 352 | ``` 353 | 354 | * Drop rows where **any** value is 0 355 | ```python 356 | df[(df != 0).all(1)] 357 | ``` 358 | 359 | * Drop rows where **all** values are 0 360 | ```python 361 | df = df[(df.T != 0).any()] 362 | ``` 363 | 364 | ## Method Chaining 365 | 366 | * Chaining multiple operations 367 | ```python 368 | (pd.read_csv("employee_salaries.csv") 369 | .query("salary > 0") 370 | .assign(sex = lambda df: df["sex"].replace({"female": 1, "male: 0}), 371 | age = lambda df: pd.cut(df["age"].fillna(df["age"].median()), 372 | bins = [df["age"].min(), 18, 40, df["age"].max()], 373 | labels = ["underage", "young", "experienced"])) 374 | .rename({"name_1": "first_name", "name_2": "last_name"}) 375 | ) 376 | ``` 377 | 378 | * Pipelines for data processing 379 | ```python 380 | def fix_headers(df): 381 | df.columns = df.columns.str.replace("[^\w\s]", "").str.replace(" ", "_").str.lower() 382 | return df 383 | 384 | def drop_columns_missing(df, percent): 385 | df = df.dropna(thresh = len(df) * percent, axis = 1) 386 | return df 387 | 388 | def fill_missing(df, value): 389 | df = df.fillna(value) 390 | return df 391 | 392 | def replace_and_convert(df, col, orig, new, dtype): 393 | df[col] = df[col].str.replace(orig, new).astype(dtype) 394 | return df 395 | 396 | 397 | (df.pipe(fix_headers) 398 | .pipe(drop_columns_missing, percent=0.3) 399 | .pipe(fill_missing, value=0) 400 | .pipe(replace_and_convert, col="sales", orig="$", new="", dtype=float) 401 | ) 402 | ``` 403 | 404 | [Recommended Read - Effective Pandas](https://leanpub.com/effective-pandas) 405 | 406 | ## Aggregation 407 | 408 | * Use `as_index = False` to avoid setting index 409 | ```python 410 | # this 411 | df.groupby("dimension", as_index = False)["measure"].sum() 412 | 413 | # versus this 414 | df.groupby("dimension")["measure"].sum().reset_index() 415 | ``` 416 | 417 | * By date offset - [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects) 418 | ```python 419 | # H for hours 420 | # D for days 421 | # W for weeks 422 | # WOM for week of month 423 | # Q for quarter end 424 | # A for year end 425 | df.groupby(pd.Grouper(key = "date", freq = "M"))["measure"].agg(["sum", "mean"]) 426 | ``` 427 | 428 | * Measure by dimension - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) 429 | ```python 430 | # count - number of non-null observations 431 | # sum - sum of values 432 | # mean - mean of values 433 | # mad - mean absolute deviation 434 | # median - arithmetic median of values 435 | # min - minimum 436 | # max - maxmimum 437 | # mode - mode 438 | # std - unbiased standard deviation 439 | # first - first value 440 | # last - last value 441 | # nunique - unique values 442 | df.groupby("dimension")["measure"].sum() 443 | 444 | # Specific aggregations for columns 445 | df.groupby("dimension").agg({"sales": ["mean", "sum"], "sale_date": "first", "customer": "nunique"}) 446 | ``` 447 | 448 | * Pivot table - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html#pandas.pivot_table) 449 | ```python 450 | pd.pivot_table( 451 | df, 452 | values=["sales", "orders"], 453 | index=["customer_id"], 454 | aggfunc={ 455 | "sales": ["sum", "mean"], 456 | "orders": "nunique" 457 | } 458 | ) 459 | ``` 460 | 461 | * Named aggregations - `Pandas >= 0.25` - [documentation](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#groupby-aggregation-with-relabeling) 462 | ```python 463 | # DataFrame - Version 1 464 | df.groupby("country").agg( 465 | min_height = pd.NamedAgg(column = "height", aggfunc = "min"), 466 | max_height = pd.NamedAgg(column = "height", aggfunc = "max"), 467 | average_weight = pd.NamedAgg(column = "weight", aggfunc = np.mean) 468 | ) 469 | 470 | # DataFrame - Version 2 471 | df.groupby("country").agg( 472 | min_height=("height", "min"), 473 | max_heights=("height", "max"), 474 | average_weight=("weight", np.mean) 475 | ) 476 | 477 | # Series 478 | df.groupby("gender").height.agg( 479 | min_height="min", 480 | max_height="max" 481 | ) 482 | ``` 483 | 484 | ## New Columns 485 | 486 | * Using `df.eval()` 487 | ```python 488 | df["sales"] = df.eval("price * quantity") 489 | 490 | # Assign to different DataFrame 491 | pd.eval("sales = df.price * df.quantity", target=df_2) 492 | 493 | # Multiline assignment 494 | df.eval(""" 495 | aov = price / quantity 496 | aov_gt_50 = (price / quantity) > 50 497 | top_3_customers = customer_id in customer_id.value_counts().nlargest(3).index 498 | bottom_3_customers = customer_id in customer_id.value_counts().nsmallest(3).index 499 | """) 500 | ``` 501 | 502 | * Based on one condition - using `np.where()` 503 | ```python 504 | np.where(df["gender"] == "Male", 1, 0) 505 | ``` 506 | 507 | * Based on multiple conditions - using `np.where()` 508 | ```python 509 | np.where(df["measure"] < 5, "Low", np.where(df["measure"] < 10, "Medium", "High")) 510 | ``` 511 | 512 | * Based on multiple conditions - using `np.select()` 513 | ```python 514 | conditions = [ 515 | df["country"].str.contains("spain"), 516 | df["country"].str.contains("italy"), 517 | df["country"].str.contains("chile"), 518 | df["country"].str.contains("brazil") 519 | ] 520 | 521 | choices = ["europe", "europe", "south america", "south america"] 522 | 523 | data["continent"] = np.select(conditions, choices, default = "other") 524 | ``` 525 | 526 | * Based on manual mapping - using `pd.Series.map()` 527 | ```python 528 | values = {"Low": 1, "Medium": 2, "High": 3} 529 | 530 | df["dimension"].map(values) 531 | ``` 532 | 533 | * Automatically generate mappings from dimension 534 | ```python 535 | dimension_mappings = {v: k for k, v in enumerate(df["dimension"].unique())} 536 | 537 | df["dimension"].map(dimension_mappings) 538 | ``` 539 | 540 | * Splitting a string column 541 | ```python 542 | df["email"].str.split("@", expand = True)[0] 543 | ``` 544 | 545 | * Using list comprehensions 546 | ```python 547 | df["domain"] = [x.split("@")[1] for x in df["email"]] 548 | ``` 549 | 550 | * Using regular expressions 551 | ```python 552 | import re 553 | 554 | pattern = "([A-Z0-9._%+-]+)@([A-Z0-9.-]+)" 555 | 556 | # Inserting colum headers, applied after extract 557 | pattern = "(?P[A-Z0-9._%+-]+)@(?P[A-Z0-9.-]+)" 558 | 559 | # Generates two columns 560 | email_components = df["email"].str.extract(pattern, flags=re.IGNORECASE) 561 | ``` 562 | 563 | * Widening a column - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html) 564 | ```python 565 | df.pivot(index = "date", columns = "companies", values = "sales") 566 | ``` 567 | 568 | ## Feature Engineering 569 | 570 | * Instead of split-apply-combine, `transform()` 571 | ```python 572 | # this 573 | df["mean_company_salary"] = df.groupby("company")["salary"].transform("mean") 574 | 575 | # versus this 576 | mean_salary = df.groupby("company")["salary"].agg("mean").rename("mean_salary").reset_index() 577 | df_new = df.merge(mean_salary) 578 | ``` 579 | 580 | * Extracting various date components - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#datetime-properties) 581 | ```python 582 | df["date"].dt.year 583 | df["date"].dt.quarter 584 | df["date"].dt.month 585 | df["date"].dt.week 586 | df["date"].dt.day 587 | df["date"].dt.weekday 588 | df["date"].dt.weekday_name 589 | df["date"].dt.hour 590 | ``` 591 | 592 | * Time between two dates 593 | ```python 594 | # Days between 595 | df["first_date"].sub(df["second_date"]).div(np.timedelta64(1, "D")) 596 | 597 | # Months between 598 | df["first_date"].sub(df["second_date"]).div(np.timedelta64(1, "M")) 599 | 600 | # Equivalent to above 601 | (df["first_date] - df["second_date"]) / np.timedelta64(1, "M") 602 | ``` 603 | 604 | * Weekend column 605 | ```python 606 | df["is_weekend"] = np.where(df["date"].dt.dayofweek.isin([5, 6]), 1, 0) 607 | ``` 608 | 609 | * Get prior date 610 | ```python 611 | df.sort_values(by=["customer_id, "order_date"])\ 612 | .groupby("customer_id")["order_date"].shift(periods=1) 613 | ``` 614 | 615 | * Days since prior date 616 | ```python 617 | df.sort_values(by = ["customer_id", "order_date"])\ 618 | .groupby("customer_id")["order_date"]\ 619 | .diff()\ 620 | .div(np.timedelta64(1, "D")) 621 | ``` 622 | 623 | * Percent change since prior date 624 | ```python 625 | df.sort_values(by = ["customer_id", "order_date"])\ 626 | .groupby("customer_id")["order_date"]\ 627 | .pct_change() 628 | ``` 629 | 630 | * Percentile rank for measure 631 | ```python 632 | df["salary"].rank(pct=True) 633 | ``` 634 | 635 | * Occurrences of word in row 636 | ```python 637 | import re 638 | 639 | df["review"].str.count("great", flags=re.IGNORECASE) 640 | ``` 641 | 642 | * Distinct list aggregation 643 | ```python 644 | df["unique_products"] = df.groupby("customer_id").agg({"products": "unique"}) 645 | 646 | # Transform each element -> row - Pandas >= 0.25 647 | df["unique_products"].explode() 648 | ``` 649 | 650 | * User-item matrix 651 | ```python 652 | df.groupby("customer_id")["products"].value_counts().unstack().fillna(0) 653 | ``` 654 | 655 | * Binning 656 | ```python 657 | pd.qcut(data["measure"], q = 4, labels = False) 658 | 659 | # Numeric 660 | pd.cut(df["measure"], bins = 4, labels = False) 661 | 662 | # Dimension 663 | pd.cut(df["age"], bins = [0, 18, 25, 99], labels = ["child", "young adult", "adult"]) 664 | ``` 665 | 666 | * Dummy variables 667 | ```python 668 | # Use drop_first = True to avoid collinearity 669 | pd.get_dummies(df, drop_first = True) 670 | ``` 671 | 672 | * Sort and take first value by dimension 673 | ```python 674 | df.sort_values(by = "variable").groupby("dimension").first() 675 | ``` 676 | 677 | * MinMax normalization 678 | ```python 679 | df["salary_minmax"] = ( 680 | df["salary"] - df["salary"].min()) / (df["salary"].max() - df["salary"].min() 681 | ) 682 | ``` 683 | 684 | * Z-score normalization 685 | ```python 686 | df["salary_zscore"] = (df["salary"] - df["salary"].mean()) / df["salary"].std() 687 | ``` 688 | 689 | * Log transformation 690 | ```python 691 | # For positive data with no zeroes 692 | np.log(df["sales"]) 693 | 694 | # For positive data with zeroes 695 | np.log1p(df["sales"]) 696 | 697 | # Convert back - get predictions if target is log transformed 698 | np.expm1(df["sales"]) 699 | ``` 700 | 701 | * Boxcox transformation 702 | ```python 703 | from scipy import stats 704 | 705 | # Must be positive 706 | stats.boxcox(df["sales"])[0] 707 | ``` 708 | 709 | * Reciprocal transformation 710 | ```python 711 | df["age_reciprocal"] = 1.0 / df["age"] 712 | ``` 713 | 714 | * Square root transformation 715 | ```python 716 | df["age_sqrt"] = np.sqrt(df["age"]) 717 | ``` 718 | 719 | * Winsorization 720 | ```python 721 | upper_limit = np.percentile(df["salary"].values, 99) 722 | lower_limit = np.percentile(df["salary"].values, 1) 723 | 724 | df["salary"].clip(lower = lower_limit, upper = upper_limit) 725 | ``` 726 | 727 | * Mean encoding 728 | ```python 729 | df.groupby("dimension")["target"].transform("mean") 730 | ``` 731 | 732 | * Z-scores for outliers 733 | ```python 734 | from scipy import stats 735 | import numpy as np 736 | 737 | z = np.abs(stats.zscores(df)) 738 | df = df[(z < 3).all(axis = 1)] 739 | ``` 740 | 741 | * Interquartile range (IQR) 742 | ```python 743 | q1 = df["salary"].quantile(0.25) 744 | q3 = df["salary"].quantile(0.75) 745 | iqr = q3 - q1 746 | 747 | df.query("(@q1 - 1.5 * @iqr) <= salary <= (@q3 + 1.5 * @iqr)") 748 | ``` 749 | 750 | * Geocoder - [github](https://github.com/DenisCarriere/geocoder) 751 | * Geopy - [github](https://github.com/geopy/geopy) 752 | ```python 753 | import geocoder 754 | 755 | df["lat_long"] = df["ip"].apply(lambda x: geocoder.ip(x).latlng) 756 | ``` 757 | 758 | * RFM - Recency, Frequency and Monetary 759 | ```python 760 | rfm = ( 761 | df.groupby("customer_id") 762 | .agg( 763 | { 764 | "order_date": lambda x: (x.max() - x.min()).days, 765 | "order_id": "nunique", 766 | "price": "mean", 767 | } 768 | ) 769 | .rename( 770 | columns={"order_date": "recency", "order_id": "frequency", "price": "monetary"} 771 | ) 772 | ) 773 | 774 | rfm_quantiles = rfm.quantile(q=[0.2, 0.4, 0.6, 0.8]) 775 | 776 | recency_conditions = [ 777 | rfm.recency >= rfm_quantiles.recency.iloc[3], 778 | rfm.recency >= rfm_quantiles.recency.iloc[2], 779 | rfm.recency >= rfm_quantiles.recency.iloc[1], 780 | rfm.recency >= rfm_quantiles.recency.iloc[0], 781 | rfm.recency <= rfm_quantiles.recency.iloc[0], 782 | ] 783 | 784 | frequency_conditions = [ 785 | rfm.frequency <= rfm_quantiles.frequency.iloc[0], 786 | rfm.frequency <= rfm_quantiles.frequency.iloc[1], 787 | rfm.frequency <= rfm_quantiles.frequency.iloc[2], 788 | rfm.frequency <= rfm_quantiles.frequency.iloc[3], 789 | rfm.frequency >= rfm_quantiles.frequency.iloc[3], 790 | ] 791 | 792 | monetary_conditions = [ 793 | rfm.monetary <= rfm_quantiles.monetary.iloc[0], 794 | rfm.monetary <= rfm_quantiles.monetary.iloc[1], 795 | rfm.monetary <= rfm_quantiles.monetary.iloc[2], 796 | rfm.monetary <= rfm_quantiles.monetary.iloc[3], 797 | rfm.monetary >= rfm_quantiles.monetary.iloc[3], 798 | ] 799 | 800 | ranks = [1, 2, 3, 4, 5] 801 | 802 | rfm["r"] = np.select(recency_conditions, ranks, "other") 803 | rfm["f"] = np.select(frequency_conditions, ranks, "other") 804 | rfm["m"] = np.select(monetary_conditions, ranks, "other") 805 | 806 | rfm["segment"] = rfm["r"].astype(str).add(rfm["f"].astype(str)) 807 | 808 | segment_map = { 809 | r"[1-2][1-2]": "hibernating", 810 | r"[1-2][3-4]": "at risk", 811 | r"[1-2]5": "cannot lose", 812 | r"3[1-2]": "about to sleep", 813 | r"33": "need attention", 814 | r"[3-4][4-5]": "loyal customers", 815 | r"41": "promising", 816 | r"51": "new customers", 817 | r"[4-5][2-3]": "potential loyalists", 818 | r"5[4-5]": "champions", 819 | } 820 | 821 | rfm["segment"] = rfm.segment.replace(segment_map, regex=True) 822 | ``` 823 | 824 | * Haversine 825 | ```python 826 | import numpy as np 827 | from numpy import pi, deg2rad, cos, sin, arcsin, sqrt 828 | 829 | def haversine(s_lat, s_lng, e_lat, e_lng): 830 | """ 831 | determines the great-circle distance between two point 832 | on a sphere given their longitudes and latitudes 833 | """ 834 | 835 | # approximate radius of earth in miles 836 | R = 3959.87433 837 | 838 | s_lat = s_lat * np.pi / 180.0 839 | s_lng = np.deg2rad(s_lng) 840 | e_lat = np.deg2rad(e_lat) 841 | e_lng = np.deg2rad(e_lng) 842 | 843 | d = ( 844 | np.sin((e_lat - s_lat) / 2) ** 2 845 | + np.cos(s_lat) * np.cos(e_lat) * np.sin((e_lng - s_lng) / 2) ** 2 846 | ) 847 | 848 | return 2 * R * np.arcsin(np.sqrt(d)) 849 | 850 | 851 | df['distance'] = haversine( 852 | df["start_lat"].values, 853 | df["start_long"].values, 854 | df["end_lat"].values, 855 | df["end_long"].values 856 | ) 857 | ``` 858 | 859 | * Manhattan 860 | ```python 861 | def manhattan(s_lat, s_lng, e_lat, e_lng): 862 | """ 863 | sum of horizontal and vertical distance between 864 | two points 865 | """ 866 | a = haversine(s_lat, s_lng, s_lat, e_lng) 867 | b = haversine(s_lat, s_lng, e_lat, s_lng) 868 | return a + b 869 | ``` 870 | 871 | ## Random 872 | 873 | * Union two categorical columns - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.types.union_categoricals.html#pandas.api.types.union_categoricals) 874 | ```python 875 | from pandas.api.types import union_categoricals 876 | 877 | food = pd.Categorical(["burger king", "wendys"]) 878 | food_2 = pd.Categorical(["burger king", "chipotle"]) 879 | 880 | union_categoricals([food, food_2]) 881 | ``` 882 | 883 | * Testing - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/general_utility_functions.html#testing-functions) 884 | ```python 885 | from pandas.util.testing import assert_frame_equal 886 | 887 | # Methods for Series and Index as well 888 | assert_frame_equal(df_1, df_2) 889 | ``` 890 | 891 | * Dtype checking - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/general_utility_functions.html#dtype-introspection) 892 | ```python 893 | from pandas.api.types import is_numeric_dtype 894 | 895 | is_numeric_dtype("hello world") 896 | # False 897 | ``` 898 | 899 | * Infer column dtype, useful to remap column dtypes [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.types.infer_dtype.html#pandas.api.types.infer_dtype) 900 | ```python 901 | from pandas.api.types import infer_dtype 902 | 903 | infer_dtype(["john", np.nan, "jack"], skipna=True) 904 | # string 905 | 906 | infer_dtype(["john", np.nan, "jack"], skipna=False) 907 | # mixed 908 | ``` 909 | --------------------------------------------------------------------------------