├── .github ├── ISSUE_TEMPLATE │ ├── bug_report.md │ └── snippet-request.md └── pull_request_template.md ├── .gitignore ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── bigquery ├── 100-stacked-bar.md ├── array-to-columns.md ├── bar-chart.md ├── binomial-distribution.md ├── cdf.md ├── cohort-analysis.md ├── compound-growth-rates.md ├── concat-nulls.md ├── convert-datetimes-string.md ├── convert-string-datetimes.md ├── count-nulls.md ├── filtering-arrays.md ├── first-row-of-group.md ├── generating-arrays.md ├── geographical-distance.md ├── get-last-array-element.md ├── heatmap.md ├── histogram-bins.md ├── horizontal-bar.md ├── json-strings.md ├── last-day.md ├── latest-row.md ├── least-array.md ├── least-udf.md ├── localtz-to-utc.md ├── median-udf.md ├── median.md ├── moving-average.md ├── outliers-mad.md ├── outliers-stdev.md ├── outliers-z.md ├── pivot.md ├── rand_between.md ├── random-sampling.md ├── rank.md ├── regex-email.md ├── regex-parse-url.md ├── repeating-words.md ├── replace-empty-strings-null.md ├── replace-multiples.md ├── replace-null.md ├── running-total.md ├── split-uppercase.md ├── stacked-bar-line.md ├── tf-idf.md ├── timeseries.md ├── transforming-arrays.md ├── uniform-distribution.md ├── unpivot-melt.md └── yoy.md ├── mssql ├── Cumulative distribution functions.md ├── check-column-is-accessible.md ├── compound-growth-rates.md ├── concatenate-strings.md ├── first-row-of-group.md └── search-stored-procedures.md ├── mysql ├── cdf.md └── initcap-udf.md ├── postgres ├── array-reverse.md ├── convert-epoch-to-timestamp.md ├── count-nulls.md ├── cume_dist.md ├── dt-formatting.md ├── generate-timeseries.md ├── get-last-array-element.md ├── histogram-bins.md ├── json-strings.md ├── last-x-days.md ├── median.md ├── moving-average.md ├── ntile.md ├── rank.md ├── regex-email.md ├── regex-parse-url.md ├── replace-empty-strings-null.md ├── replace-null.md └── split-column-to-rows.md ├── snowflake ├── 100-stacked-bar.md ├── bar-chart.md ├── count-nulls.md ├── filtering-arrays.md ├── flatten-array.md ├── generate-arrays.md ├── geographical-distance.md ├── heatmap.md ├── histogram-bins.md ├── horizontal-bar.md ├── last-array-element.md ├── moving-average.md ├── parse-json.md ├── parse-url.md ├── pivot.md ├── random-sampling.md ├── rank.md ├── replace-empty-strings-null.md ├── replace-null.md ├── running-total.md ├── stacked-bar-line.md ├── timeseries.md ├── transforming-arrays.md ├── udf-sort-array.md ├── unpivot-melt.md └── yoy.md └── template.md /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: "[BUG] " 5 | labels: bug 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | A clear and concise description of what the bug is. 12 | 13 | **Expected behaviour** 14 | A clear and concise description of what you expected to happen. 15 | 16 | **Screenshots** 17 | If applicable, add screenshots to help explain your problem. 18 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/snippet-request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Snippet request 3 | about: Suggest an idea for a snippet 4 | title: "[SNIPPET REQUEST] " 5 | labels: help wanted 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe what you're looking for** 11 | A clear and concise description of what you want a snippet to achieve. 12 | 13 | **What database are you using?** 14 | If it understands SQL, it's welcome here! 15 | -------------------------------------------------------------------------------- /.github/pull_request_template.md: -------------------------------------------------------------------------------- 1 | ## Summary of changes 2 | 3 | 4 | 5 | ## Checklist before submitting 6 | 7 | - [ ] I have added this snippet to the correct folder 8 | - [ ] I have added a link to this snippet in the [readme](https://github.com/count/sql-snippets/blob/main/README.md) 9 | - [ ] I haven't included any sensitive information in my example(s) 10 | - [ ] I have checked my query / function works 11 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing 2 | Thanks for your interest in helping out! Feel free to comment, open issues, and create pull requests if there's something you want to share. 3 | 4 | ## Submitting pull requests 5 | To make a change to this repository, the best way is to use pull requests: 6 | 7 | 1. Fork the repository and create your branch from `main`. 8 | 2. Make your changes. 9 | 6. Create a pull request. 10 | 11 | (If the change is particularly small, these steps are easily accomplished directly in the GitHub UI.) 12 | 13 | ## Report bugs using Github's issues 14 | We use GitHub issues to track bugs. Report a bug or start a conversation by opening a new issue. 15 | 16 | ## Use a consistent template 17 | Take a look at the existing snippets and try to format your contribution similarly. You may want to start from our [template](./template.md). 18 | 19 | 20 | ## License 21 | By contributing to this repository, you agree that your contributions will be licensed under its MIT License. 22 | 23 | ## References 24 | This document was adapted from the GitHub Gist of [briandk](https://gist.github.com/briandk/3d2e8b3ec8daf5a27a62). -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Count 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SQL Snippets 2 | 3 | A curated collection of helpful SQL queries and functions, maintained by [Count](https://count.co). 4 | 5 | All [contributions](./CONTRIBUTING.md) are very welcome - let's get useful SQL snippets out of Slack channels and Notion pages, and collect them as a community resource. 6 | 7 | ### [🙋‍♀️ Request a snippet](https://github.com/count/sql-snippets/issues/new?assignees=&labels=help+wanted&template=snippet-request.md&title=%5BSNIPPET+REQUEST%5D+) 8 | ### [⁉️ Report a bug](https://github.com/count/sql-snippets/issues/new?assignees=&labels=bug&template=bug_report.md&title=%5BBUG%5D+) 9 | ### [↣ Submit a pull request](https://github.com/count/sql-snippets/compare) 10 | 11 | 22 | 23 | | Snippet | Databases | 24 | | ------- | --------- | 25 | |

Analytics

| | 26 | | Cohort analysis | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/cohort-analysis.md) | 27 | | Cumulative distribution functions | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/cdf.md) [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/cume_dist.md) [![MySQL](https://img.shields.io/badge/MySQL-F29111)](./mysql/cdf.md) ![SQL Server](https://img.shields.io/badge/SQL%20Server-A91D22)| 28 | | Compound growth rates | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/compound-growth-rates.md) [![SQL Server](https://img.shields.io/badge/SQL%20Server-A91D22)](./mssql/compound-growth-rates.md) | 29 | | Generate a binomial distribution | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/binomial-distribution.md) | 30 | | Generate a uniform distribution | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/uniform-distribution.md) | 31 | | Histogram bins | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/histogram-bins.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/histogram-bins.md) [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/histogram-bins.md) | 32 | | Median | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/median.md) | 33 | | Median (UDF) | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/median-udf.md) | 34 | | Outlier detection: MAD method | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/outliers-mad.md) | 35 | | Outlier detection: Standard Deviation method | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/outliers-stdev.md) | 36 | | Outlier detection: Z-score method | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/outliers-z.md) | 37 | | Random Between | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/rand_between.md) | 38 | | Random sampling | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/random-sampling.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/random-sampling.md) | 39 | | Ranking | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/rank.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/rank.md) [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/rank.md) | 40 | | TF-IDF | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/tf-idf.md) | 41 | |

Arrays

| | 42 | | Arrays to columns | [![BigQuery](https://img.shields.io/badge/BigQuery-4387FB)](./bigquery/array-to-columns.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/flatten-array.md) | | 43 | | Filtering arrays | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/filtering-arrays.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/filtering-arrays.md) | 44 | | Generating arrays | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/generating-arrays.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/generate-arrays.md) | 45 | | Get last array element | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/get-last-array-element.md) [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/get-last-array-element.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/last-array-element.md) | 46 | | Reverse array | [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/array-reverse.md) | 47 | | Transforming arrays | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/transforming-arrays.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/transforming-arrays.md) | 48 | | Least non-null value in array (UDF) | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/least-udf.md) | 49 | | Greatest/Least value in array | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/least-array.md) | 50 | | Sort Array (UDF) | [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/udf-sort-array.md) | 51 | |

Charts

| | 52 | | Bar chart | [![BigQuery](https://img.shields.io/badge/BigQuery-4387FB)](./bigquery/bar-chart.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/bar-chart.md) | 53 | | Horizontal bar chart | [![BigQuery](https://img.shields.io/badge/BigQuery-4387FB)](./bigquery/horizontal-bar.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/horizontal-bar.md) | 54 | | Time series | [![BigQuery](https://img.shields.io/badge/BigQuery-4387FB)](./bigquery/timeseries.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/timeseries.md) | 55 | | 100% stacked bar chart | [![BigQuery](https://img.shields.io/badge/BigQuery-4387FB)](./bigquery/100-stacked-bar.md ) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/100-stacked-bar.md) | 56 | | Heat map | [![BigQuery](https://img.shields.io/badge/BigQuery-4387FB)](./bigquery/heatmap.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/heatmap.md) | 57 | | Stacked bar and line chart | [![BigQuery](https://img.shields.io/badge/BigQuery-4387FB)](./bigquery/stacked-bar-line.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/stacked-bar-line.md) | 58 | |

Database metadata

| | 59 | | Check if column exists in table | [![SQL Server](https://img.shields.io/badge/SQL%20Server-A91D22)](./mssql/check-column-is-accessible.md) | 60 | | Search for text in Stored Procedures | [![SQL Server](https://img.shields.io/badge/SQL%20Server-A91D22)](./mssql/search-stored-procedures.md) | 61 | |

Dates and times

| | 62 | | Date/Time Formatting One Pager| [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/dt-formatting.md) | 63 | | Converting from strings | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/convert-string-datetimes.md) | 64 | | Converting to strings | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/convert-datetimes-string.md) | 65 | | Local Timezone to UTC | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/localtz-to-utc.md) | 66 | | Converting epoch/unix to timestamp | [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/convert-epoch-to-timestamp.md) | 67 | | Generate timeseries | [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/generate-timeseries.md) | 68 | | Last day of the month | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/last-day.md) | 69 | | Last X days | [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/last-x-days.md) | 70 | |

Geography

| | 71 | | Calculating the distance between two points | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/geographical-distance.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/geographical-distance.md) | 72 | |

JSON

| | 73 | | Parsing JSON strings | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/json-strings.md) [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/json-strings.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/parse-json.md)| 74 | |

Regular & String expressions

| | 75 | | Parsing URLs | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/regex-parse-url.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/parse-url.md) [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/regex-parse-url.md) | 76 | | Validating emails | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/regex-email.md) [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/regex-email.md) | 77 | | Extract repeating words | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/repeating-words.md) | 78 | | Split on Uppercase characters | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/split-uppercase.md) | 79 | | Replace multple conditions | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/replace-multiples.md) | 80 | | CONCAT with NULLs | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/concat-nulls.md) | 81 | | Capitalize initial letters (UDF) | [![MySQL](https://img.shields.io/badge/MySQL-F29111)](./mysql/initcap-udf.md) | 82 | | Concatenating strings | [![SQL Server](https://img.shields.io/badge/SQL%20Server-A91D22)](./mssql/concatenate-strings.md) | 83 | |

Transformation

| | 84 | | Pivot | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/pivot.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/pivot.md) | 85 | | Unpivoting / melting | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/unpivot-melt.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/unpivot-melt.md) | 86 | | Replace NULLs | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/replace-null.md) [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/replace-null.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/replace-null.md) | 87 | | Replace empty string with NULL | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/replace-empty-strings-null.md) [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/replace-empty-strings-null.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/replace-empty-strings-null.md) | 88 | | Counting NULLs | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/count-nulls.md) [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/count-nulls.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/count-nulls.md) | 89 | | Splitting single columns into rows | [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/split-column-to-rows.md) | 90 | | Latest Row (deduplicate) | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/latest-row.md) | 91 | |

Window functions

| | 92 | | First row of each group | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/first-row-of-group.md) [![SQL Server](https://img.shields.io/badge/SQL%20Server-A91D22)](./mssql/first-row-of-group.md) | 93 | | Moving averages | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/moving-average.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/moving-average.md) [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/moving-average.md) | 94 | | Running totals | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/running-total.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/running-total.md) | 95 | | Year-on-year changes | [![BigQuery](https://img.shields.io/badge/BigQuery-4387fb)](./bigquery/yoy.md) [![Snowflake](https://img.shields.io/badge/Snowflake-29B5E8)](./snowflake/yoy.md) | 96 | | Creating equal-sized buckets | [![Postgres](https://img.shields.io/badge/Postgres-336791)](./postgres/ntile.md) | 97 | 98 | ## Other resources 99 | - [bigquery-utils](https://github.com/GoogleCloudPlatform/bigquery-utils) - a BigQuery-specific collection of scripts 100 | -------------------------------------------------------------------------------- /bigquery/100-stacked-bar.md: -------------------------------------------------------------------------------- 1 | # 100% bar chart 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/qMn2uMsl0rz?vm=e). 4 | 5 | # Description 6 | 7 | 100% bar charts compare the proportions of a metric across different groups. 8 | To build one you'll need: 9 | - A numerical column as the metric you want to compare. This should be represented as a percentage. 10 | - 2 categorical columns. One for the x-axis and one for the coloring of the bars. 11 | 12 | ```sql 13 | SELECT 14 | AGG_FN() as metric, 15 | as x, 16 | as color, 17 | FROM 18 | 19 | GROUP BY 20 | x, color 21 | ``` 22 | where: 23 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc. 24 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column. Must add up to 1 for each bar. 25 | - `XAXIS_COLUMN` is the group you want to show on the x-axis. Make sure this is a date, datetime, timestamp, or time column (not a number) 26 | - `COLOR_COLUMN` is the group you want to show as different colors of your bars. 27 | 28 | # Usage 29 | 30 | In this example with some tennis data we'll see what percentage of matches won for the world's top tennis players come on different surfaces. 31 | 32 | ```sql 33 | -- a 34 | select 35 | count(matches_partitioned.match_num) matches_won, 36 | winner_name player, 37 | surface 38 | from tennis.matches_partitioned 39 | where winner_name in ('Rafael Nadal','Serena Williams', 'Roger Federer','Novak Djokovic') 40 | group by player, surface 41 | ``` 42 | 43 | |matches_won| player| surface| 44 | |-----------|-------|--------| 45 | |95| Novak Djokovic | Grass| 46 | |621| Novak Djokovic | Hard | 47 | |230 | Novak Djokovic | Clay| 48 | |9 | Novak Djokovic | Carpet| 49 | | ... | ... | ... | 50 | 51 | 52 | ```sql 53 | -- b 54 | select 55 | count(matches_partitioned.match_num) matches_played, 56 | winner_name player 57 | from tennis.matches_partitioned 58 | where winner_name in ('Rafael Nadal','Serena Williams', 'Roger Federer','Novak Djokovic') 59 | group by player 60 | ``` 61 | | matches_played | player | 62 | | -------------- | --------------- | 63 | | 955 | Novak Djokovic | 64 | | 1250 | Roger Federer | 65 | | 1016 | Rafael Nadal | 66 | | 846 | Serena Williams | 67 | 68 | ```sql 69 | select 70 | a.player, 71 | a.surface, 72 | a.matches_won / b.matches_played per_matches_won 73 | from 74 | a left join b on a.player = b.player 75 | ``` 76 | 100-stacked-bar 77 | -------------------------------------------------------------------------------- /bigquery/array-to-columns.md: -------------------------------------------------------------------------------- 1 | # Array to Column 2 | Explore this snippet with some demo data [here](https://count.co/n/pQ914qtFlKo?vm=e). 3 | 4 | # Description 5 | Taking a BigQuery ARRAY and creating a column for each value is commonly used when working with 'wide' data. The following snippet will let you quickly take an array column and create a row for each value. 6 | 7 | ```sql 8 | SELECT 9 | select , array_value 10 | FROM 11 | `project.schema.table` 12 | CROSS JOIN UNNEST() as array_value 13 | ``` 14 | where: 15 | - `` is a list of the columns from the original table you want to keep in your query results 16 | - `` is the name of the ARRAY column 17 | 18 | 19 | # Usage 20 | The following example shows how to 'explode' 2 rows with 3 array elements each into a separate row for each array entry: 21 | 22 | ```sql 23 | WITH demo_data as ( 24 | select 1 row_index,[1,3,4] array_col union all 25 | select 2 row_index, [3,6,9] array_col 26 | ) 27 | select demo_data.row_index, array_value 28 | FROM demo_data 29 | cross join UNNEST(array_col) as array_value 30 | ``` 31 | | row_index | array_col | 32 | | --- | ----------- | 33 | | 1 | 1 | 34 | | 1 | 3 | 35 | | 1 | 4 | 36 | | 2 | 3 | 37 | | 2 | 6 | 38 | | 2 | 9 | 39 | 40 | # References & Helpful Links: 41 | - [UNNEST in BigQuery](https://count.co/sql-resources/bigquery-standard-sql/unnest) 42 | - [Arrays in Standard SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays) 43 | -------------------------------------------------------------------------------- /bigquery/bar-chart.md: -------------------------------------------------------------------------------- 1 | # Bar chart 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/KCUlCzlJDHK?vm=e). 4 | 5 | # Description 6 | 7 | Bar charts are perhaps the most common charts used in data analysis. They have one categorical or temporal axis, and one numerical axis. 8 | You just need a simple GROUP BY to get all the elements you need for a bar chart: 9 | 10 | ```sql 11 | SELECT 12 | AGG_FN() as metric, 13 | as cat 14 | FROM 15 | 16 | GROUP BY 17 | cat 18 | ``` 19 | 20 | where: 21 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc. 22 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column. 23 | - `CATEGORICAL_COLUMN` is the group you want to show on the x-axis. Make sure this is a date, datetime, timestamp, or time column (not a number) 24 | 25 | # Usage 26 | 27 | In this example with some Spotify data, we'll look at the average daily streams by month of year: 28 | 29 | ```sql 30 | select 31 | avg(streams) avg_daily_streams, 32 | format_date('%m - %b', day) month 33 | from ( 34 | select sum(streams) streams, 35 | day 36 | from spotify.spotify_daily_tracks 37 | group by day 38 | ) 39 | group by month 40 | ``` 41 | bar 42 | -------------------------------------------------------------------------------- /bigquery/binomial-distribution.md: -------------------------------------------------------------------------------- 1 | # Generating a binomial distribution 2 | 3 | Explore this snippet [here](https://count.co/n/F6GnYbjyIct?vm=e) 4 | 5 | # Description 6 | 7 | Once it is possible to generate a uniform distribution of numbers, it is possible to generate any other distribution. This snippet shows how to generate a sample of numbers that are binomially distributed using rejection sampling. 8 | 9 | > Note - rejection sampling is simple to implement, but potentially very inefficient 10 | 11 | The steps here are: 12 | - Generate a random list of `k` = 'number of successes' from n trials, with each trial having an expected `p` = 'probability of success'. 13 | - For each `k`, calculate the binomial probability of `k` successes using the [binomial formula](https://en.wikipedia.org/wiki/Binomial_distribution), and another random number `u` between zero and one. 14 | - If `u` is smaller than the expected probability, accept the value of `k`, otherwise reject it 15 | 16 | ```sql 17 | with params as ( 18 | select 19 | 0.3 as p, -- Probability of success per trial 20 | 20 as n, -- Number of trials 21 | 100000 as num_samples -- Number of samples to test (the actual samples will be fewer) 22 | ), 23 | samples as ( 24 | select 25 | -- Generate a random integer k = 'number of successes' in [0, n] 26 | mod(abs(farm_fingerprint(cast(indices as string))), n + 1) as k, 27 | -- Generate a random number u uniformly distributed in [0, 1] 28 | mod(abs(farm_fingerprint(cast(indices as string))), 1000000) / 1000000 as u 29 | from 30 | params, 31 | unnest(generate_array(1, num_samples)) as indices 32 | ), 33 | probs as ( 34 | select 35 | -- Hack in a factorial function for n! / (k! * (n - k)!) 36 | (select exp(sum(ln(i))) from unnest(if(n = 0, [1], generate_array(1, n))) as i) / 37 | (select exp(sum(ln(i))) from unnest(if(k = 0, [1], generate_array(1, k))) as i) / 38 | (select exp(sum(ln(i))) from unnest(if(n = k, [1], generate_array(1, n - k))) as i) * 39 | pow(p, k) * pow(1 - p, n - k) as threshold, 40 | k, 41 | u 42 | from params, samples 43 | ), 44 | after_rejection as ( 45 | -- If the random number for this value of k is too high, reject it 46 | -- Note - the binomial PDF is normalised to a max value of 1 to improve the sampling efficiency 47 | select k from probs where u <= threshold / (select max(threshold) from probs) 48 | ) 49 | 50 | select * from after_rejection 51 | ``` 52 | 53 | | k | 54 | | --- | 55 | | 7 | 56 | | 1 | 57 | | 6 | 58 | | ... | -------------------------------------------------------------------------------- /bigquery/cdf.md: -------------------------------------------------------------------------------- 1 | # Cumulative distribution functions 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/sL7bEFcg11Z?vm=e). 4 | 5 | # Description 6 | Cumulative distribution functions (CDF) are a method for analysing the distribution of a quantity, similar to histograms. They show, for each value of a quantity, what fraction of rows are smaller or greater. 7 | One method for calculating a CDF is as follows: 8 | 9 | ```sql 10 | select 11 | -- Use a row_number window function to get the position of this row 12 | (row_number() over (order by asc)) / (select count(*) from ) cdf, 13 | , 14 | from
15 | ``` 16 | where 17 | - `quantity` - the column containing the metric of interest 18 | - `table` - the table name 19 | # Example 20 | 21 | Using total Spotify streams as an example data source, let's identify: 22 | - `table` - this is called `raw` 23 | - `quantity` - this is the column `streams` 24 | 25 | then the query becomes: 26 | 27 | ```sql 28 | select 29 | (row_number() over (order by streams asc)) / (select count(*) from raw) frac, 30 | streams, 31 | from raw 32 | ``` 33 | | frac | streams | 34 | | ----------- | ----------- | 35 | | 0 | 374359 | 36 | | 0 | 375308 | 37 | | ... | ... | 38 | | 1 | 7778950 | 39 | -------------------------------------------------------------------------------- /bigquery/cohort-analysis.md: -------------------------------------------------------------------------------- 1 | # Cohort Analysis 2 | View interactive notebook [here](https://count.co/n/INCbY1QUomH?vm=e). 3 | 4 | 5 | # Description 6 | A cohort analysis groups users based on certain traits and compares their behavior across groups and over time. 7 | The most common way of grouping users is by when they first joined. This definition will change for each company and even use case but it's generally taken to be the day or week that a user joined. All users who joined at the same time are grouped together. Again, this can be made more complex if needed. 8 | The simplest way to start comparing cohorts is seeing if they returned to your site/product in the weeks after they joined. You can continue to refine what patterns of behavior you're looking for. 9 | But for this example, we'll do the following: 10 | - Assign users a cohort based on the week they created their account 11 | - For each cohort determine the % of users that returned for the 10 weeks after joining 12 | ## The Data 13 | - The way your data is structured will greatly determine how you do your analysis. Most commonly, there's a transaction table that has timestamped events such as 'create account' or 'purchase' for each user. This is what we'll use for our example. 14 | - The data comes from [BigQuery's Example Google Analytics dataset](https://console.cloud.google.com/bigquery). 15 | ## The Code 16 | - In this example I've gone step by step, but to do it all in one query you could try: 17 | 18 | ```sql 19 | WITH cohorts as 20 | ( 21 | SELECT 22 | fullVisitorId, 23 | date_trunc(first_value(date) over (partition by fullVisitorId order by date asc),week) cohort 24 | FROM data 25 | ), 26 | activity as 27 | ( 28 | SELECT 29 | cohorts.cohort, 30 | date_diff(date,cohort,week) weeks_since_joining, 31 | count(distinct data.fullVisitorId) visitors 32 | FROM 33 | cohorts 34 | LEFT JOIN data ON cohorts.fullVisitorId = data.fullVisitorId 35 | WHERE 36 | date_diff(date,cohort,week) between 1 and 10 37 | GROUP BY cohort, weeks_since_joining 38 | ), 39 | cohorts_with_weeks as 40 | ( 41 | SELECT distinct 42 | cohort, 43 | count(distinct fullVisitorId) cohort_size, 44 | weeks_since_joining 45 | FROM cohorts 46 | CROSS JOIN 47 | UNNEST(generate_array(1,10)) weeks_since_joining 48 | GROUP BY cohort, weeks_since_joining 49 | ), 50 | SELECT 51 | cohorts_with_weeks.cohort, 52 | cohorts_with_weeks.weeks_since_joining, 53 | cohorts_with_weeks.cohort_size, 54 | coalesce(activity.visitors,0)/cohorts_with_weeks.cohort_size per_active, 55 | FROM 56 | cohorts_with_weeks 57 | LEFT JOIN activity 58 | ON cohorts_with_weeks.cohort = activity.cohort 59 | AND cohorts_with_weeks.weeks_since_joining = activity.weeks_since_joining 60 | ``` 61 | 62 | #### Data Preview 63 | | fullVisitorId | visitNumber | visitId | date | ... | channelGrouping | 64 | | ---- | ----- |----- |---- | ----- | ----- | 65 | |6453870986532490896 | 1 | 1473801936 | 2016-09-13 | ... | Display | 66 | |2287637838474850444 | 5 | 1473815616 | 2016-09-13 | ... | Direct | 67 | 68 | ### 1. Assign each user a cohort based on the week they first appeared 69 | ```sql 70 | --cohorts 71 | select 72 | fullVisitorId, 73 | date_trunc(first_value(date) over (partition by fullVisitorId order by date asc),week) cohort 74 | from data 75 | ``` 76 | |fullVisitorId | cohort | 77 | | ---- | ---- | 78 | | 0002871498069867123 | 21/08/2016 | 79 | | 0020502892485275044 | 11/09/2016 | 80 | |.... | .... | 81 | 82 | 83 | ### 2. For each cohort find the number of users that have returned for weeks 1-10 84 | ```sql 85 | --activity 86 | select 87 | cohorts.cohort, 88 | date_diff(date,cohort,week) weeks_since_joining, 89 | count(distinct data.fullVisitorId) visitors 90 | from 91 | cohorts 92 | left join data on cohorts.fullVisitorId = data.fullVisitorId 93 | where date_diff(date,cohort,week) between 1 and 10 94 | group by 95 | cohort, weeks_since_joining 96 | ``` 97 | 98 | |cohort | weeks_since_joining | visitors| 99 | |----| --- | ---- | 100 | |2016-07-31| 2 | 7 | 101 | |2016-07-31| 1 | 6 | 102 | |2016-07-31| 9 | 1 | 103 | |2016-07-31| 5 | 1 | 104 | |.... | .... | .... | 105 | 106 | 107 | ### 3. Turn that into a percentage of users each week (being sure to include weeks where no users returned) 108 | ```sql 109 | --cohorts_with_weeks 110 | select distinct 111 | cohort, 112 | count(distinct fullVisitorId) cohort_size, 113 | weeks_since_joining 114 | from cohorts 115 | cross join 116 | unnest(generate_array(1,10)) weeks_since_joining 117 | group by cohort, weeks_since_joining 118 | ``` 119 | |cohort| cohort_size | weeks_since_joining | 120 | |----- |----- |----- | 121 | | 2016-08-21| 263 | 1 | 122 | |2016-08-21| 263 | 2 | 123 | |2016-08-21| 263 | 3 | 124 | |2016-08-21| 263 | 4 | 125 | |2016-08-21| 263 | 5 | 126 | |2016-08-21| 263 | 6 | 127 | |2016-08-21| 263 | 7 | 128 | |2016-08-21| 263 | 8 | 129 | |2016-08-21| 263 | 9 | 130 | |2016-08-21| 263 | 10 | 131 | |...| ... | ... | 132 | 133 | ```sql 134 | --cohort_table 135 | Select 136 | cohorts_with_weeks.cohort, 137 | cohorts_with_weeks.cohort_size, 138 | cohorts_with_weeks.weeks_since_joining, 139 | coalesce(activity.visitors,0)/cohorts_with_weeks.cohort_size per_active, 140 | from 141 | cohorts_with_weeks 142 | left join activity on cohorts_with_weeks.cohort = activity.cohort and cohorts_with_weeks.weeks_since_joining = activity.weeks_since_joining 143 | ``` 144 | |cohort| cohort_size| weeks_since_joining | per_active | 145 | | ---- |---- | ---- | ---- | 146 | | 2016-08-21| 263 | 1 | 0.015 | 147 | |2016-08-21| 263 | 2 | 0.019 | 148 | |2016-08-21| 263 | 3 | 0.023 | 149 | | ... | ... | ... | .... | 150 | 151 | Now we can see our cohort table. 152 | 153 | Screenshot 2021-07-30 at 5 53 02 pm 154 | -------------------------------------------------------------------------------- /bigquery/compound-growth-rates.md: -------------------------------------------------------------------------------- 1 | # Compound growth rates 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/Ff9O9L7JR53?vm=e). 4 | 5 | # Description 6 | When a quantity is increasing in time by the same fraction every period, it is said to exhibit **compound growth** - every time period it increases by a larger amount. 7 | It is possible to show that the behaviour of the quantity over time follows the law 8 | 9 | ```sql 10 | x(t) = x(0) * exp(r * t) 11 | ``` 12 | where 13 | - `x(t)` is the value of the quantity now 14 | - `x(0)` is the value of the quantity at time t = 0 15 | - `t` is time in units of [time_period] 16 | - `r` is the growth rate in units of 1 / [time_period] 17 | 18 | 19 | # Calculating growth rates 20 | From the equation above, to calculate a growth rate you need to know: 21 | - The value at the start of a period 22 | - The value at the end of a period 23 | - The duration of the period 24 | Then, following some algebra, the growth rate is given by 25 | 26 | ```sql 27 | r = ln(x(t) / x(0)) / t 28 | ``` 29 | 30 | For a concrete example, assume a table with columns: 31 | - `num_this_month` - this is `x(t)` 32 | - `num_last_month` - this is `x(0)` 33 | - `this_month` & `last_month` - `t` (in years) = `datetime_diff(this_month, last_month, month) / 12` 34 | 35 | 36 | #### Inferred growth rates 37 | 38 | In the following then, `growth_rate` is the equivalent yearly growth rate for that month: 39 | 40 | ```sql 41 | -- with_growth_rate 42 | select 43 | *, 44 | ln(num_this_month / num_last_month) * 12 / datetime_diff(this_month, last_month, month) growth_rate 45 | from with_previous_month 46 | ``` 47 | 48 | 49 | # Using growth rates 50 | To better depict the effects of a given growth rate, it can be converted to a year-on-year growth factor by inserting into the exponential formula above with t = 1 (one year duration). 51 | Or mathematically, 52 | 53 | ```sql 54 | yoy_growth = x(1) / x(0) = exp(r) 55 | ``` 56 | It is always sensible to spot-check what a growth rate actually represents to decide whether or not it is a useful metric. 57 | 58 | #### Extrapolated year-on-year growth 59 | 60 | ```sql 61 | -- with_yoy_growth 62 | select 63 | *, 64 | exp(growth_rate) as yoy_growth 65 | from with_growth_rate 66 | ``` 67 | | num_this_month | num_last_year | growth_rate | yoy_growth | 68 | | ----------- | ----------- | ----------- | ----------- | 69 | | 343386934 | 361476484 | -0.6160690738834281 | 0.5400632190921264 | 70 | | 1151727105 | 343386934 | 14.521920316267764 | 2026701.8316519475 | 71 | -------------------------------------------------------------------------------- /bigquery/concat-nulls.md: -------------------------------------------------------------------------------- 1 | # Concatinate text handling nulls 2 | 3 | ## Description 4 | 5 | If you concatenate values together and one is null then the while result is set to null 6 | 7 | ```sql 8 | SELECT CONCAT('word', ',', 'word2', ',', 'word3'), CONCAT('word', ',', NULL, ',', 'word3'); 9 | ``` 10 | ``` 11 | +----------------+----+ 12 | |f0_ |f1_ | 13 | +----------------+----+ 14 | |word,word2,word3|NULL| 15 | +----------------+----+ 16 | ``` 17 | 18 | If you want to ignore nulls you can use arrays and ARRAY_TO_STRING 19 | 20 | ``` 21 | SELECT ARRAY_TO_STRING(['word', 'word2', 'word3'], ','), ARRAY_TO_STRING(['word', NULL, 'word3'], ','); 22 | ``` 23 | ``` 24 | +----------------+----------+ 25 | |f0_ |f1_ | 26 | +----------------+----------+ 27 | |word,word2,word3|word,word3| 28 | +----------------+----------+ 29 | ``` 30 | -------------------------------------------------------------------------------- /bigquery/convert-datetimes-string.md: -------------------------------------------------------------------------------- 1 | # Converting dates/times to strings 2 | 3 | Explore this snippet [here](https://count.co/n/p72zrP81Vye?vm=e). 4 | 5 | # Description 6 | It is possible to convert from a `DATE` / `DATETIME` / `TIMESTAMP` / `TIME` data type into a `STRING` using the `CAST` function, and BigQuery will choose a default string format. 7 | For more control over the output format use one of the `FORMAT_*` functions: 8 | 9 | ```sql 10 | with dt as ( 11 | select 12 | current_date() as date, 13 | current_time() as time, 14 | current_datetime() as datetime, 15 | current_timestamp() as timestamp 16 | ) 17 | 18 | select 19 | cast(timestamp as string) default_format, 20 | format_date('%Y/%d/%m', datetime) day_first, 21 | format_datetime('%A', date) weekday, 22 | format_datetime('%r', datetime) time_of_day, 23 | format_time('%H', time) hour, 24 | format_timestamp('%Z', timestamp) timezone 25 | from dt 26 | ``` -------------------------------------------------------------------------------- /bigquery/convert-string-datetimes.md: -------------------------------------------------------------------------------- 1 | # Converting strings to dates/times 2 | 3 | Explore this snippet [here](https://count.co/n/wvs19LPsUrQ?vm=e). 4 | 5 | # Description 6 | BigQuery supports the `DATE` / `DATETIME` / `TIMESTAMP` / `TIME` data types, all of which have different `STRING` representations. 7 | 8 | ## CAST 9 | If a string is formatted correctly, the `CAST` function will convert it into the appropriate type: 10 | 11 | ```sql 12 | select 13 | cast('2017-06-04 14:44:00' AS datetime) AS datetime, 14 | cast('2017-06-04 14:44:00 Europe/Berlin' AS timestamp) AS timestamp, 15 | cast('2017-06-04' AS date) AS date, 16 | cast( '14:44:00' as time) time 17 | ``` 18 | 19 | 20 | ## PARSE_* 21 | Alternatively, if the format of the string is non-standard but consistent, it is possible to define a custom string format using one of the `PARSE_DATE` / `PARSE_DATETIME` etc. functions: 22 | 23 | ```sql 24 | select 25 | parse_datetime( 26 | '%A %B %d, %Y %H:%M:%S', 27 | 'Thursday January 7, 2021 12:04:33' 28 | ) as parsed_datetime 29 | ``` -------------------------------------------------------------------------------- /bigquery/count-nulls.md: -------------------------------------------------------------------------------- 1 | # Count NULLs 2 | 3 | Explore this snippet [here](https://count.co/n/vGNUC7SOJei?vm=e). 4 | 5 | # Description 6 | 7 | Part of the data cleaning process involves understanding the quality of your data. NULL values are usually best avoided, so counting their occurrences is a common operation. 8 | There are several methods that can be used here: 9 | - `sum(if( is null, 1, 0)` - use the [`IF`](https://cloud.google.com/bigquery/docs/reference/standard-sql/conditional_expressions#if) function to return 1 or 0 if a value is NULL or not respectively, then aggregate. 10 | - `count(*) - count()` - use the different forms of the [`count()`](https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions#count) aggregation which include and exclude NULLs. 11 | - `sum(case when x is null then 1 else 0 end)` - similar to the IF method, but using a [`CASE`](https://cloud.google.com/bigquery/docs/reference/standard-sql/conditional_expressions#case_expr) statement instead. 12 | 13 | ```sql 14 | with data as ( 15 | select * from unnest([1, 2, null, null, 5]) as x 16 | ) 17 | 18 | select 19 | sum(if(x is null, 1, 0)) with_if, 20 | count(*) - count(x) with_count, 21 | sum(case when x is null then 1 else 0 end) with_case 22 | from data 23 | ``` 24 | 25 | | with_if | with_count | with_case | 26 | | ------- | ---------- | --------- | 27 | | 2 | 2 | 2 | -------------------------------------------------------------------------------- /bigquery/filtering-arrays.md: -------------------------------------------------------------------------------- 1 | # Filtering arrays 2 | 3 | Explore this snippet [here](https://count.co/n/YvwEswZfHtP?vm=e). 4 | 5 | # Description 6 | The ARRAY function in BigQuery will accept an entire subquery, which can be used to filter an array: 7 | 8 | ```sql 9 | select array( 10 | select x 11 | from unnest((select [1, 2, 3, 4])) as x 12 | where x <= 3 -- Filter predicate 13 | ) as filtered 14 | ``` -------------------------------------------------------------------------------- /bigquery/first-row-of-group.md: -------------------------------------------------------------------------------- 1 | # Find the first row of each group 2 | 3 | Explore this snippet [here](https://count.co/n/1KZI8Y4aAX5?vm=e). 4 | 5 | # Description 6 | 7 | Finding the first value in a column partitioned by some other column is simple in SQL, using the GROUP BY clause. Returning the rest of the columns corresponding to that value is a little trickier. Here's one method, which relies on the RANK window function: 8 | 9 | ```sql 10 | with and_ranking as ( 11 | select 12 | *, 13 | rank() over (partition by order by ) ranking 14 | from
15 | ) 16 | select * from and_ranking where ranking = 1 17 | ``` 18 | where 19 | - `partition` - the column(s) to partition by 20 | - `ordering` - the column(s) which determine the ordering defining 'first' 21 | - `table` - the source table 22 | 23 | The CTE `and_ranking` adds a column to the original table called `ranking`, which is the order of that row within the groups defined by `partition`. Filtering for `ranking = 1` picks out those 'first' rows. 24 | 25 | # Example 26 | 27 | Using total Spotify streams as an example data source, we can pick the rows of the table corresponding to the most-streamed song per day. Filling in the template above: 28 | - `partition` - this is just `day` 29 | - `ordering` - this is `streams desc` 30 | - `table` - this is `spotify.spotify_daily_tracks` 31 | 32 | ```sql 33 | with and_ranking as ( 34 | select 35 | *, 36 | rank() over (partition by day order by streams desc) ranking 37 | from spotify.spotify_daily_tracks 38 | ) 39 | select day, title, artist, streams from and_ranking where ranking = 1 40 | ``` -------------------------------------------------------------------------------- /bigquery/generating-arrays.md: -------------------------------------------------------------------------------- 1 | # Generating arrays 2 | 3 | Explore this snippet [here](https://count.co/n/2avxdwBFCGz?vm=e). 4 | 5 | # Description 6 | BigQuery supports the ARRAY column type - here are a few methods for generating arrays. 7 | ## Literals 8 | Arrays can contain most types, though all elements must be derived from the same supertype. 9 | 10 | ```sql 11 | select 12 | [1, 2, 3] as num_array, 13 | ['a', 'b', 'c'] as str_array, 14 | [current_timestamp(), current_timestamp()] as timestamp_array, 15 | ``` 16 | 17 | 18 | ## GENERATE_* functions 19 | Analogous to `numpy.linspace()`. 20 | 21 | ```sql 22 | select 23 | generate_array(0, 10, 2) AS even_numbers, 24 | generate_date_array(CAST('2020-01-01' AS DATE), CAST('2020-03-01' AS DATE), INTERVAL 1 MONTH) AS month_starts 25 | ``` 26 | 27 | 28 | ## ARRAY_AGG 29 | Aggregate a column into a single row element. 30 | 31 | ```sql 32 | SELECT ARRAY_AGG(num) 33 | FROM (SELECT 1 num UNION ALL SELECT 2 num UNION ALL SELECT 3 num) 34 | ``` -------------------------------------------------------------------------------- /bigquery/geographical-distance.md: -------------------------------------------------------------------------------- 1 | # Distance between longitude/latitude pairs 2 | 3 | Explore this snippet [here](https://count.co/n/BcHM84o8uYb?vm=e). 4 | 5 | # Description 6 | 7 | It is a surprisingly difficult problem to calculate an accurate distance between two points on the Earth, both because of the spherical geometry and the uneven shape of the ground. 8 | Fortunately BigQuery has robust support for geographical primitives - the [`ST_DISTANCE`](https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions#st_distance) function can calculate the distance in metres between two [`ST_GEOGPOINT`](https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions#st_geogpoint)s: 9 | 10 | ```sql 11 | with points as ( 12 | -- Longitudes and latitudes are in degrees 13 | select 10 as from_long, 12 as from_lat, 15 as to_long, 17 as to_lat 14 | ) 15 | 16 | select 17 | st_distance( 18 | st_geogpoint(from_long, from_lat), 19 | st_geogpoint(to_long, to_lat) 20 | ) as dist_in_metres 21 | from points 22 | ``` 23 | 24 | | dist_in_metres | 25 | | -------------- | 26 | | 773697.017019 | -------------------------------------------------------------------------------- /bigquery/get-last-array-element.md: -------------------------------------------------------------------------------- 1 | # Get the last element of an array 2 | 3 | Explore this snippet [here](https://count.co/n/f4moFjZCYNV?vm=e). 4 | 5 | # Description 6 | There are two methods to retrieve the last element of an array - calculate the array length, or reverse the array and select the first element: 7 | 8 | ```sql 9 | select 10 | nums[ordinal(array_length(nums))] as method_1, 11 | array_reverse(nums)[ordinal(1)] as method_2, 12 | from (select [1, 2, 3, 4] as nums) 13 | ``` -------------------------------------------------------------------------------- /bigquery/heatmap.md: -------------------------------------------------------------------------------- 1 | # Heat maps 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/ee5fRPdUrIQ?vm=e). 4 | 5 | # Description 6 | 7 | Heatmaps are a great way to visualize the relationship between two variables enumerated across all possible combinations of each variable. To create one, your data must contain 8 | - 2 categorical columns (e.g. not numeric columns) 9 | - 1 numerical column 10 | 11 | ```sql 12 | SELECT 13 | AGG_FN() as metric, 14 | as cat1, 15 | as cag2, 16 | FROM 17 | 18 | GROUP BY 19 | cat1, cat2 20 | ``` 21 | where: 22 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc. 23 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column. Must add up to 1 for each bar. 24 | - `CAT_COLUMN1` and `CAT_COLUMN2` are the groups you want to compare. One will go on the x axis and the other on the y-axis. Make sure these are categorical or temporal columns and not numerical fields. 25 | 26 | # Usage 27 | 28 | In this example with some Spotify data, we'll see the average daily streams across different days of the week and months of the year. 29 | 30 | ```sql 31 | -- a 32 | select sum(streams) daily_streams, day from spotify.spotify_daily_tracks group by day 33 | ``` 34 | | daily_streams | day | 35 | | ------------- | ---------- | 36 | | 234109876 | 2020-10-25 | 37 | | 243004361 | 2020-10-26 | 38 | | 248233045 | 2020-10-27 | 39 | 40 | ```sql 41 | select 42 | avg(a.daily_streams) avg_daily_streams, 43 | format_date('%m - %b',day) month, 44 | format_date('%w %A',day) day_of_week 45 | from a 46 | group by day_of_week, month 47 | ``` 48 | heatmap 49 | -------------------------------------------------------------------------------- /bigquery/histogram-bins.md: -------------------------------------------------------------------------------- 1 | # Histogram Bins 2 | Test out this snippet with some demo data [here](https://count.co/n/J7s46HSTsrm?vm=e). 3 | 4 | # Description 5 | Creating histograms using SQL can be tricky. This snippet will let you quickly create the bins needed to create a histogram: 6 | 7 | ```sql 8 | SELECT 9 | COUNT(1) / (SELECT COUNT(1) FROM '' ) percentage_of_results 10 | FLOOR(/ ) * as bin 11 | FROM 12 | '' 13 | GROUP BY 14 | bin 15 | ``` 16 | where: 17 | - `` is your column you want to turn into a histogram 18 | - `` is the width of each bin. You can adjust this to get the desired level of detail for the histogram. 19 | 20 | 21 | # Usage 22 | To use, adjust the bin-size until you can see the 'shape' of the data. The following example finds the percentage of whiskies with by 5-point review groupings. The reviews are on a 0-100 scale: 23 | 24 | ```sql 25 | SELECT 26 | count(*) / (select count(1) from whisky.scotch_reviews) percentage_of_results, 27 | floor(scotch_reviews.review_point / 5) * 5 bin 28 | from whisky.scotch_reviews 29 | group by bin 30 | ``` 31 | | percentage_of_results | bin | 32 | | --- | ----------- | 33 | | 0.005 | 70 | 34 | | 0.03 | 75 | 35 | | ... | ... | 36 | | 0.016 | 96 | 37 | -------------------------------------------------------------------------------- /bigquery/horizontal-bar.md: -------------------------------------------------------------------------------- 1 | # Horizontal bar chart 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/w4I0DgqsPT2?vm=e). 4 | 5 | # Description 6 | 7 | Horizontal Bar charts are ideal for comparing a particular metric across a group of values. They have bars along the x-axis and group names along the y-axis. 8 | You just need a simple GROUP BY to get all the elements you need for a horizontal bar chart: 9 | 10 | ```sql 11 | SELECT 12 | AGG_FN() as metric, 13 | as group 14 | FROM 15 | 16 | GROUP BY 17 | group 18 | ``` 19 | 20 | where: 21 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc. 22 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column. 23 | - `GROUP_COLUMN` is the group you want to show on the y-axis. Make sure this is a categorical column (not a number) 24 | 25 | # Usage 26 | 27 | In this example with some Spotify data, we'll compare the total streams by artist: 28 | 29 | ```sql 30 | select 31 | sum(streams) streams, 32 | artist 33 | from spotify.spotify_daily_tracks 34 | group by artist 35 | ``` 36 | horizontal-bar 37 | -------------------------------------------------------------------------------- /bigquery/json-strings.md: -------------------------------------------------------------------------------- 1 | # Extracting values from JSON strings 2 | 3 | Explore this snippet [here](https://count.co/n/P88ejfuK7iv?vm=e). 4 | 5 | # Description 6 | BigQuery has a rich set of functionality for working with JSON - see any function with [JSON in the name](https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions). 7 | Most of the BigQuery JSON functions require the use of a [JSONPath](https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions#JSONPath_format) parameter, which defines how to access the part of the JSON object that is required. 8 | ## JSON_QUERY 9 | This function simply returns the JSON-string representation of the part of the object that is requested: 10 | 11 | ```sql 12 | with json as (select '''{ 13 | "a": 1, 14 | "b": "bee", 15 | "c": [ 16 | 4, 17 | 5, 18 | { "d": [6, 7] } 19 | ] 20 | }''' as text) 21 | 22 | select 23 | -- Note - all of these result columns are strings, and are formatted as JSON 24 | json_query(text, '$') as root, 25 | json_query(text, '$.a') as a, 26 | json_query(text, '$.b') as b, 27 | json_query(text, '$.c') as c, 28 | json_query(text, '$.c[0]') as c_first, 29 | json_query(text, '$.c[2]') as c_third, 30 | json_query(text, '$.c[2].d') as d, 31 | from json 32 | ``` 33 | 34 | 35 | ## JSON_VALUE 36 | This function is similar to `JSON_QUERY`, but tries to convert some values to native data types - namely strings, integers and booleans. If the accessed part of the JSON object is not a simple data type, `JSON_VALUE` returns null: 37 | 38 | ```sql 39 | with json as (select '''{ 40 | "a": 1, 41 | "b": "bee", 42 | "c": [ 43 | 4, 44 | 5, 45 | { "d": [6, 7] } 46 | ] 47 | }''' as text) 48 | 49 | select 50 | json_value(text, '$') as root, -- = null 51 | json_value(text, '$.a') as a, 52 | json_value(text, '$.b') as b, 53 | json_value(text, '$.c') as c, -- = null 54 | json_value(text, '$.c[0]') as c_first, 55 | json_value(text, '$.c[2]') as c_third, -- = null 56 | json_value(text, '$.c[2].d') as d, -- = null 57 | from json 58 | ``` -------------------------------------------------------------------------------- /bigquery/last-day.md: -------------------------------------------------------------------------------- 1 | # Last day of the month 2 | 3 | ## Description 4 | 5 | In some SQL dialects there is a LAST_DAY function, but there isn’t one in BigQuery. 6 | We can achieve the same by taking the day before the first day of the following month. 7 | 8 | 1. `DATE_ADD(, INTERVAL 1 MONTH)` sets the reference date to the following month. 9 | 2. `DATE_TRUNC(, MONTH)` gets the first day of that month. 10 | 3. `DATE_SUB(, INTERVAL 1 DAY)` gets the day before. `-1` can be used as well. 11 | 12 | ## Example: Last day of the current month (local time) 13 | 14 | ``` 15 | SELECT 16 | DATE_TRUNC(DATE_ADD(CURRENT_DATE('Europe/Madrid'), INTERVAL 1 MONTH), MONTH) - 1 17 | ``` 18 | If no time zone is specified as an argument to CURRENT_DATE, the default time zone, UTC, is used. 19 | -------------------------------------------------------------------------------- /bigquery/latest-row.md: -------------------------------------------------------------------------------- 1 | 2 | # Latest Row (deduplicate) 3 | View interactive snippet [here](https://count.co/n/zFoUJuIYOrC?vm=e). 4 | 5 | ## Query to return only the latest row based on a column or columns 6 | 7 | With BigQuery it is common to have duplicates in your tables especially datalake ones so we need to remove duplicates and often want the latest version of each row. We can make use of the ROW_NUMBER window/analytical function to get a count of each record with the same key and then using QUALIFY to only return the 1st row order by the latest first. 8 | 9 | Here we have some rows with duplicates. I've added the ROW_NUMBER so we can see the values it returns. 10 | 11 | ```sql 12 | WITH 13 | sample_data AS ( 14 | SELECT 1 AS product_id,'widgit' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date 15 | UNION ALL 16 | SELECT 2 AS product_id,'borble' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date 17 | UNION ALL 18 | SELECT 3 AS product_id,'connector' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date 19 | UNION ALL 20 | SELECT 1 AS product_id,'widgit' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-02-01' AS modified_date 21 | UNION ALL 22 | SELECT 2 AS product_id,'borble' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-03-01' AS modified_date ) 23 | SELECT 24 | *, ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY modified_date DESC, created_date DESC) as row_no 25 | FROM 26 | sample_data 27 | ``` 28 | 29 | We can use QUALIFY to only return a single row. Also if we use a named subquery/cte we can work with the deduped as if it was just a table. 30 | The product_id can be replaced with a list of columns to specify the unique rows. 31 | ```sql 32 | WITH 33 | sample_data AS ( 34 | SELECT 1 AS product_id,'widgit' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date 35 | UNION ALL 36 | SELECT 2 AS product_id,'borble' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date 37 | UNION ALL 38 | SELECT 3 AS product_id,'connector' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date 39 | UNION ALL 40 | SELECT 1 AS product_id,'widgit' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-02-01' AS modified_date 41 | UNION ALL 42 | SELECT 2 AS product_id,'borble' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-03-01' AS modified_date 43 | ), 44 | data_deduped as ( 45 | SELECT 46 | * 47 | FROM 48 | sample_data 49 | WHERE 50 | 1=1 QUALIFY ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY modified_date DESC, created_date DESC)=1 51 | ) 52 | SELECT 53 | * 54 | FROM 55 | data_deduped 56 | ``` 57 | 58 | 59 | https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts 60 | https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#with_clause 61 | -------------------------------------------------------------------------------- /bigquery/least-array.md: -------------------------------------------------------------------------------- 1 | # Greatest/Least value in an array 2 | Interactive version [here](https://count.co/n/tgXkwcSxl2W?b=L2LtcTSMIQS). 3 | 4 | # Description 5 | GREATEST and LEAST are helpful way to find the greatest value in a row but to do that for an array is trickier. 6 | This snippet lets you return the max/min value from an array while maintaining your data shape. 7 | **NOTE: This will exclude NULLS.** 8 | 9 | ```sql 10 | SELECT 11 | ..., 12 | MIN(value) 13 | FROM 14 |
15 | CROSS JOIN UNNEST() as value, 16 | group by ... 17 | ``` 18 | where: 19 | - `...` is a list of all the other rows you wish to select, 20 | 21 | - `` is the column that contains the array from which you want to select the greatest/least value 22 | 23 | # Example: 24 | ```sql 25 | with demo_data as ( 26 | select 1 rownumber, [0,1,NULL,3] arr 27 | union all select 2 rownumber, [1,2,3,4] arr 28 | ) 29 | select 30 | rownumber, 31 | min(value) least, 32 | max(value) greatest 33 | from demo_data 34 | cross join unnest(arr) as value 35 | group by rownumber 36 | ``` 37 | |rownumber|least | greatest | 38 | |---- |---- |----| 39 | |1 | 0 |3| 40 | |2 | 1 |4| 41 | -------------------------------------------------------------------------------- /bigquery/least-udf.md: -------------------------------------------------------------------------------- 1 | # UDF to find the least non-null value in an array 2 | View interactive version [here](https://count.co/n/5KMhS4l4xgR?vm=e). 3 | 4 | # Description 5 | LEAST and GREATEST will include NULLS in BigQuery, so to find the LEAST excluding nulls across a row of values can be done using this UDF: 6 | 7 | ```sql 8 | CREATE or replace FUNCTION .myLeast(x ARRAY) AS 9 | ((SELECT MIN(y) FROM UNNEST(x) AS y)); 10 | ``` 11 | Credit to this [Stackoverflow post](https://stackoverflow.com/questions/43796213/correctly-migrate-postgres-least-behavior-to-bigquery). 12 | 13 | # Example: 14 | ```sql 15 | with demo_data as ( 16 | select 1 rownumber, 0 a, null b, -1 c, 1 d union all 17 | select 2 rownumber, 5 a, 0 b, null c, 8 d 18 | ) 19 | select rownumber, spotify.myLeast([a,b,c,d]) least_non_null, least(a,b,c,d) least_old_way 20 | from demo_data 21 | ``` 22 | |rownumber|least_not_null|least_old_way| 23 | |--- | ---- | ----- | 24 | | 1 | -1 | NULL | 25 | |2 | 0 | NULL | 26 | -------------------------------------------------------------------------------- /bigquery/localtz-to-utc.md: -------------------------------------------------------------------------------- 1 | # Local Timezone to UTC 2 | Explore this snippet with some demo data [here](https://count.co/n/448Ip23n0Ef?vm=e). 3 | 4 | 5 | # Description 6 | [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time) (Coordinated Universal Time) is a universal standard time and can be a great way to compare data across many timezones. 7 | Often we're given time in a local timezone and need to convert it to UTC. To do that in BigQuery we can just use the [TIMESTAMP](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions#timestamp) function: 8 | 9 | ```sql 10 | SELECT 11 | TIMESTAMP(, ) as utc_timestamp 12 | ``` 13 | where: 14 | - The local datetime can be in the form of a string expression, datetime object or date object. 15 | - The timezone should be one of BigQuery's [time zones](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions#timezone_definitions). 16 | # Usage 17 | We're given Euro match data in Pacific time and want to convert it to UTC: 18 | 19 | ```sql 20 | -- match_data 21 | select '2021-06-11 12:00:00' start_time, 'Turkey' Home, 'Italy' Away union all 22 | select '2021-06-12 06:00:00' start_time, 'Wales' Home, 'Switzerland' Away union all 23 | select'2021-06-12 09:00:00' start_time, 'Denmark' Home, 'Finland' Away union all 24 | ``` 25 | |start_date| Home | Away | 26 | |-------| ----- | ------ | 27 | |2021-06-11 12:00:00| Turkey | Italy | 28 | |2021-06-12 06:00:00| Wales | Switzerland | 29 | |2021-06-12 09:00:00| Denmark | Finland | 30 | 31 | ```sql 32 | -- tidied 33 | Select 34 | timestamp(match_data.start_time,'America/Los_Angeles') utc_start, 35 | Home, 36 | Away 37 | from match_data 38 | ``` 39 | |utc_start| Home | Away | 40 | |-------| ----- | ------ | 41 | |2021-06-11T19:00:00.000Z| Turkey | Italy | 42 | |2021-06-12T13:00:00.000Z| Wales | Switzerland | 43 | |2021-06-12T16:00:00.000Z| Denmark | Finland | 44 | -------------------------------------------------------------------------------- /bigquery/median-udf.md: -------------------------------------------------------------------------------- 1 | # MEDIAN UDF 2 | Explore the function with some demo data [here](https://count.co/n/RHmVhHzZpIp?vm=e). 3 | 4 | # Description 5 | BigQuery does not have an explicit `MEDIAN` function. The following snippet will show you how to create a custom function for `MEDIAN` in your BigQuery instance. 6 | 7 | 8 | # UDF (User Defined Function) 9 | Use the following snippet to create your median function in the BigQuery console: 10 | 11 | ```sql 12 | CREATE OR REPLACE FUNCTION .median(arr ANY TYPE) AS (( 13 | SELECT IF ( 14 | MOD(ARRAY_LENGTH(arr), 2) = 0, 15 | (arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]) / 2, 16 | arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))] 17 | ) 18 | FROM 19 | (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x) 20 | )); 21 | ``` 22 | 23 | 24 | # Usage 25 | To call the `MEDIAN` function defined above, use the following: 26 | 27 | ```sql 28 | SELECT .median(ARRAY_AGG()) as median 29 | FROM 30 | ``` 31 | ### Example 32 | Find the median number of followers for an artist on Spotify: 33 | 34 | ```sql 35 | SELECT 36 | spotify.median(array_agg(followers)) as median_followers 37 | FROM 38 | spotify.spotify_artists 39 | ``` 40 | | median_followers | 41 | | --- | 42 | | 1358146.5 | 43 | -------------------------------------------------------------------------------- /bigquery/median.md: -------------------------------------------------------------------------------- 1 | # MEDIAN 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/7XpUfYUH9p6?vm=e). 4 | 5 | # Description 6 | BigQuery does not have an explicit `MEDIAN` function. The following snippet will show you how to calculate the median of any column. 7 | 8 | 9 | # Snippet ✂️ 10 | Use the following snippet to create your median function in the BigQuery console: 11 | 12 | ```sql 13 | PERCENTILE_CONT(, 0.5) OVER() as median 14 | ``` 15 | 16 | 17 | # Usage 18 | Use the `MEDIAN` syntax in a query like: 19 | 20 | ```sql 21 | SELECT 22 | PERCENTILE_CONT(, 0.5) OVER() as median 23 | FROM 24 | 25 | ``` 26 | | median | 27 | | ----------- | 28 | | 1358146.5 | 29 | -------------------------------------------------------------------------------- /bigquery/moving-average.md: -------------------------------------------------------------------------------- 1 | # Moving average 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/X2EmYjUGwwO?vm=e). 4 | 5 | # Description 6 | Moving averages are relatively simple to calculate in BigQuery, using the `avg` window function. The template for the query is 7 | 8 | ```sql 9 | select 10 | avg() over (partition by order by asc rows between preceding and current row) mov_av, 11 | , 12 | 13 | from
14 | ``` 15 | where 16 | - `value` - this is the numeric quantity to calculate a moving average for 17 | - `fields` - these are zero or more columns to split the moving averages by 18 | - `ordering` - this is the column which determines the order of the moving average, most commonly temporal 19 | - `x` - how many rows to include in the moving average 20 | - `table` - where to pull these columns from 21 | # Example 22 | Using total Spotify streams as an example data source, let's identify: 23 | - `value` - this is `streams` 24 | - `fields` - this is just `artist` 25 | - `ordering` - this is the `day` column 26 | - `x` - choose 7 for a weekly average 27 | - `table` - this is called `raw` 28 | 29 | then the query is: 30 | 31 | ```sql 32 | select 33 | avg(streams) over (partition by artist order by day asc rows between 7 preceding and current row) mov_av, 34 | artist, 35 | day, streams 36 | from raw 37 | ``` 38 | | moving_avg | artist | day | streams | 39 | | --- | ----------- |------ |-------- | 40 | | 535752.5 | 2 Chainz | 2017-06-03 | 562412 | 41 | | 517285 | 2 Chainz | 2017-06-04 | 480350 | 42 | | 522389.75 | 2 Chainz | 2017-06-05 | 537704 | 43 | | 529767.6 | 2 Chainz | 2017-06-06 | 559279 | 44 | | 535539.5 | 2 Chainz | 2017-06-07 | 564399 | 45 | | 535155.1428571428 | 2 Chainz | 2017-06-08 | 532849 | 46 | | 539254 | 2 Chainz | 2017-06-09 | 567946 | 47 | | 541911.5 | 2 Chainz | 2017-06-10 | 530353 | 48 | -------------------------------------------------------------------------------- /bigquery/outliers-mad.md: -------------------------------------------------------------------------------- 1 | # Identify Outliers 2 | View interactive notebook [here](https://count.co/n/lgTo572j3We?vm=e). 3 | 4 | # Description 5 | Outlier detection is a key step to any analysis. It allows you to quickly spot errors in data collection, or which points you may need to remove before you do any statistical modeling. 6 | There are a number of ways to find outliers, this snippet focuses on the MAD method. 7 | # The Data 8 | For this example, we'll be looking at some data that shows the total number of daily streams for the Top 200 Tracks on Spotify every day since Jan 1 2017: 9 | 10 | Data: 11 | |streams | day | 12 | | ------ | --- | 13 | |234109876 | 25/10/2020 | 14 | |... | .... | 15 | 16 | 17 | # Method 1: MAD (Median Absolute Deviation) 18 | [MAD](https://en.wikipedia.org/wiki/Median_absolute_deviation) describes how far away a point is from the median. This method can be preferable to other methods, like the z-score method because you don't have to assume the data is normally distributed. 19 | 20 | ![image](https://user-images.githubusercontent.com/42146708/127674882-abeb638b-5b58-4274-9112-34b52a84c1fd.png) 21 | 22 | > "A critical value for the Median Absolute Deviation of 5 has been suggested by workers in Nonparametric statistics, as a Median Absolute Deviation value of just under 1.5 is equivalent to one standard deviation, so MAD = 5 is approximately equal to 3 standard deviations." 23 | > - [Using SQL to detect outliers](https://towardsdatascience.com/using-sql-to-detect-outliers-aff676bb2c1a) by Robert de Graaf 24 | 25 | ```sql 26 | SELECT 27 | diff_to_med.*, 28 | percentile_cont(diff_to_med.difference, 0.5) over() median_of_diff, 29 | abs(difference/percentile_cont(difference, 0.5) over()) mad, 30 | case 31 | when abs(difference/percentile_cont(difference, 0.5) over()) > then 'outlier' 32 | else 'not outlier' 33 | end label 34 | FROM ( 35 | select 36 | 37 | , 38 | med.median , 39 | abs(data. - med.median) difference 40 | from 41 |
as data 42 | cross join 43 | ( 44 | select distinct 45 | percentile_cont(,.5) over() median 46 | from
as data) as med 47 | ) as diff_to_med 48 | order by mad desc 49 | ``` 50 | where: 51 | - ``: Columns selected other than the outlier column 52 | - ``: Column name for outlier 53 | - ``: The outlier threshold. 5 = 3 standard deviations 54 | 55 | #### MAD Method Outliers 56 | ```sql 57 | select diff_to_med.*, 58 | percentile_cont(diff_to_med.difference, 0.5) over() median_of_diff, 59 | abs(difference/percentile_cont(difference, 0.5) over()) mad, 60 | case 61 | when abs(difference/percentile_cont(difference, 0.5) over()) > 5 then 'outlier' 62 | else 'not outlier' 63 | end label 64 | from ( 65 | select data.day, data.streams, med.median ,abs(data.streams - med.median) difference 66 | from data 67 | cross join (select distinct percentile_cont(streams,.5) over() median from data) as med 68 | ) as diff_to_med 69 | order by mad desc 70 | ``` 71 | | day | streams | median | difference | median_to_diff | mad | label | 72 | | --- | ------- | ------ | ---------- | -------------- | ---- | ----- | 73 | | 24/12/2020 | 555245855 | 235341012.5 | 319904842.5 | 15798261 |20.0249 | outlier | 74 | | 08/01/2017 | 162831491 | 235341012.5 | 72509521.5 | 15798261 |4.59 | not outlier | 75 | -------------------------------------------------------------------------------- /bigquery/outliers-stdev.md: -------------------------------------------------------------------------------- 1 | # Identify Outliers 2 | View interactive notebook [here](https://count.co/n/lgTo572j3We?vm=e). 3 | 4 | # Description 5 | Outlier detection is a key step to any analysis. It allows you to quickly spot errors in data collection, or which points you may need to remove before you do any statistical modeling. 6 | There are a number of ways to find outliers, this snippet focuses on the Standard Deviation method. 7 | # The Data 8 | For this example, we'll be looking at some data that shows the total number of daily streams for the Top 200 Tracks on Spotify every day since Jan 1 2017: 9 | 10 | Data: 11 | |streams | day | 12 | | ------ | --- | 13 | |234109876 | 25/10/2020 | 14 | |... | .... | 15 | 16 | 17 | # Standard Deviation 18 | The most common way outliers are detected is by finding points outside a certain number of sigmas (or standard deviations) from the mean. 19 | If the data is normally distributed, then the following rules apply: 20 | 21 | ![image](https://user-images.githubusercontent.com/42146708/127679527-731d73ba-2294-4e0e-b7d3-30386d3a1d6f.png) 22 | 23 | > Image from anomoly.io 24 | So, identifying points 2 or 3 sigmas away from the mean should lead us to our outliers. 25 | 26 | ## Example 27 | 28 | ```sql 29 | SELECT 30 | data.*, 31 | case 32 | when between mean - 3 * stdev and mean + 3 * stdev 33 | then 'not outlier' 34 | else 'outlier' 35 | end label, 36 | mean - * stdev lower_bound, 37 | mean + * stdev upper_bound 38 | FROM data 39 | CROSS JOIN 40 | ( 41 | SELECT 42 | avg() mean, 43 | stddev_samp(data.streams 44 | FROM
AS data ) mean_sd 45 | ORDER BY label DESC 46 | ``` 47 | where: 48 | - ``: Columns selected other than the outlier column 49 | - ``: Column name for outlier 50 | - ``: Number of standard deviations to identify the outliers (suggested: 3) 51 | 52 | #### Standard Deviation Method Outliers 53 | ```sql 54 | select 55 | data.*, 56 | case 57 | when streams between mean - 3 * stdev and mean + 3 * stdev then 'not outlier' 58 | else 'outlier' 59 | end label, 60 | mean - 3 * stdev lower_bound, 61 | mean + 3 * stdev upper_bound 62 | from data 63 | cross join (select 64 | avg(data.streams) mean, 65 | stddev_samp(data.streams) stdev 66 | from data ) mean_sd 67 | order by label desc 68 | ``` 69 | | day | streams | label | lower_bound | upper_bound | 70 | | --- | ------- | ------ | ---------- | -------------- | 71 | | 29/06/2018 | 350167113 | outlier |146828000.339 | 326135386.804 | 72 | | 08/01/2017 | 162831491 | not outlier | 146828000.339 | 326135386.804 | 73 | -------------------------------------------------------------------------------- /bigquery/outliers-z.md: -------------------------------------------------------------------------------- 1 | # Identify Outliers 2 | View interactive notebook [here](https://count.co/n/lgTo572j3We?vm=e). 3 | 4 | # Description 5 | Outlier detection is a key step to any analysis. It allows you to quickly spot errors in data collection, or which points you may need to remove before you do any statistical modeling. 6 | There are a number of ways to find outliers, this snippet focuses on the Z score method. 7 | # The Data 8 | For this example, we'll be looking at some data that shows the total number of daily streams for the Top 200 Tracks on Spotify every day since Jan 1 2017: 9 | 10 | Data: 11 | |streams | day | 12 | | ------ | --- | 13 | |234109876 | 25/10/2020 | 14 | |... | .... | 15 | 16 | 17 | # Z-Score 18 | Similar to Method 2, the z-score method tells us the probability of a point occurring in a normal distribution. 19 | 20 | ![image](https://user-images.githubusercontent.com/42146708/127680954-ed19dc55-3df8-43a5-9db3-beabd7f37fd0.png) 21 | 22 | > Image from simplypsychology.org 23 | To use z-scores to identify outliers, we can set an alpha (significance level) that will determine how 'likely' each point will be in order for it to be classified as an outlier. 24 | This method also assumes the data is normally distributed. 25 | 26 | ```sql 27 | SELECT 28 | find_z.*, 29 | case 30 | when abs(z_score) >=abs() then 'outlier' else 'not outlier' 31 | end label 32 | FROM ( 33 | SELECT 34 | , 35 | , 36 | ( - avg() over ()) 37 | / (stddev() over ()) z_score 38 | FROM
AS data) find_z 39 | ORDER BY abs(z_score) DESC 40 | ``` 41 | where: 42 | - ``: Columns selected other than the outlier column 43 | - ``: Column name for outlier 44 | - ``: z-score threshold to identify outliers (suggested: 1.96) 45 | 46 | 47 | ## Example 48 | 49 | 50 | ```sql 51 | select find_z.*, 52 | case when abs(z_score) >=abs(1.96) then 'outlier' else 'not outlier' end label 53 | from (select 54 | day, 55 | streams, 56 | (streams - avg(streams) over ()) 57 | / (stddev(streams) over ()) z_score 58 | from data) find_z 59 | order by abs(z_score) desc 60 | ``` 61 | | day | streams | z_score | label | 62 | | --- | ------- | ------ | ---------- | 63 | | 07/01/2017 | 177567024 | -1.971 |outlier | 64 | | 02/12/2020 | 294978502 | 1.957 | not outlier | 65 | -------------------------------------------------------------------------------- /bigquery/pivot.md: -------------------------------------------------------------------------------- 1 | # Pivot in BigQuery 2 | View interactive notebook [here](https://count.co/n/MqDQcqT18qS?vm=e) 3 | 4 | # Description 5 | Pivoting data is a helpful process of any analysis. Recently BigQuery announced a new [PIVOT](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#pivot_operator) operator that will make this easier than ever. 6 | 7 | ```sql 8 | SELECT 9 | 10 | FROM 11 |
12 | PIVOT(() FOR IN ( [as alias], [as alias], ... [as alias]) [as alias] 13 | ``` 14 | where: 15 | - anything in brackets is opional 16 | - `` is the list of columns returned in the query 17 | - `
` is the table you a pivoting 18 | - may not produce a value able or be a subquery using `SELECT AS STRUCT` 19 | - `` is an aggregate function that aggregates all input rows across the values of the `` 20 | - may reference columns in `
` and correlated columns, but none defined by the `PIVOT` clause itself 21 | - must have only one argument 22 | - Except for `COUNT`, you can only use aggregate functions that ignore `NULL` inputs 23 | - if using `COUNT`, you can use `*` as an argument 24 | - `` is the column that will be aggregated across the values of `` 25 | - `` is the column that will have its values pivoted 26 | - values must be valid column names (so no spaces or special characters) 27 | - use the aliases to ensure they are the right format 28 | # Example 29 | In this case we'll see how many matches some popular tennis players have won in the Grand Slams. 30 | 31 | ```sql 32 | --tennis_data 33 | select 34 | winner_name, 35 | count(1) wins, 36 | initcap(tourney_name) tournament 37 | from 38 | tennis.matches_partitioned 39 | where winner_name in ('Novak Djokovic','Roger Federer','Rafael Nadal','Serena Williams') 40 | and tourney_level = 'G' 41 | group by winner_name, tournament 42 | ``` 43 | | winner_name | wins | tournament| 44 | | ----------- | ---- | ---------| 45 | | Roger Federer | 91 | Us Open | 46 | | Roger Federer | 103 | Australian Open | 47 | | Roger Federer | 102 | Wimbledon | 48 | | Roger Federer | 70 | Roland Garros | 49 | 50 | ```sql 51 | select * 52 | from 53 | tennis_data 54 | pivot(sum(wins) for tournament in ('Wimbledon','Us Open' as US_Open,'Roland Garros' as French_Open,'Australian Open' as Aussie_Open )) 55 | ``` 56 | 57 | | winner_name | Wimbledon | US_Open | French_Open | Aussie_Open | 58 | | ----------- | ---- | ---------| ---------| ---------| 59 | | Roger Federer | 102 | 91 | 70 | 103 | 60 | 61 | # Other links 62 | - https://towardsdatascience.com/pivot-in-bigquery-4eefde28b3be 63 | -------------------------------------------------------------------------------- /bigquery/rand_between.md: -------------------------------------------------------------------------------- 1 | # Rand Between in BigQuery 2 | Interactive version [here](https://count.co/n/lUHSRbAZqLK?vm=e). 3 | 4 | 5 | # Description 6 | Generating random numbers is a good way to test out our code or to randomize results. 7 | In BigQuery there is a [`RAND`](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#rand) function which will generate a number between 0 and 1. 8 | But sometimes it's helpful to get a random integer within a specified range. This snippet will do that: 9 | 10 | ```sql 11 | SELECT 12 | ROUND( + RAND() * ( - )) rand_between 13 | FROM 14 |
15 | ``` 16 | where: 17 | - ``, `` is the range between which you want to generate a random number 18 | # Example: 19 | 20 | ```sql 21 | with random_number as (select rand() as random) 22 | select 23 | random, -- random number between 0 and 1 24 | round(random*100) rand_1_100, --random integer between 1 and 100 25 | round(1 + random * (10 - 1)) rand_btw_1_10 -- random integer between 1 and 10 26 | from random_number 27 | ``` 28 | |random | rand_1_100 | rand_btw_1_10 | 29 | | ----- | ---------- | ------------- | 30 | |0.597 | 60 | 6 | 31 | -------------------------------------------------------------------------------- /bigquery/random-sampling.md: -------------------------------------------------------------------------------- 1 | # Random sampling 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/cSN4gyht6Vd?vm=e). 4 | 5 | # Description 6 | Taking a random sample of a table is often useful when the table is very large, and you are still working in an exploratory or investigative mode. A query against a small fraction of a table runs much more quickly, and consumes fewer resources. 7 | Here are some example queries to select a random or pseudo-random subset of rows. 8 | 9 | 10 | # Using rand() 11 | The [RAND](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#rand) function returns a value in [0, 1) (i.e. including zero but not 1). You can use it to sample tables, e.g.: 12 | 13 | #### Return 1% of rows 14 | 15 | ```sql 16 | -- one_percent 17 | select * from spotify.spotify_daily_tracks where rand() < 0.01 18 | ``` 19 | | day | track_id | rank | title | artist | streams | 20 | | -------- | -------- |----- | ----- | ------ | ------- | 21 | | 2018-01-09 | 6mrKP2jyIQmM0rw6fQryjr | 9 | Let You Down | NF | 2610265 | 22 | |... | ... | ... | ... | ... | ... | 23 | 24 | #### Return 10 rows 25 | 26 | ```sql 27 | -- ten_rows 28 | select * from spotify.spotify_daily_tracks order by rand() limit 10 29 | ``` 30 | | day | track_id | rank | title | artist | streams | 31 | | -------- | -------- |----- | ----- | ------ | ------- | 32 | | 2018-01-09 | 6mrKP2jyIQmM0rw6fQryjr | 9 | Let You Down | NF | 2610265 | 33 | |... | ... | ... | ... | ... | ... | 34 | (only 10 rows returned) 35 | 36 | #### Return approximately 10 rows (discouraged) 37 | 38 | ```sql 39 | -- ten_rows_approx 40 | select 41 | * 42 | from spotify.spotify_daily_tracks 43 | where rand() < 10 / (select count(*) from spotify.spotify_daily_tracks) 44 | ``` 45 | | day | track_id | rank | title | artist | streams | 46 | | -------- | -------- |----- | ----- | ------ | ------- | 47 | | 2018-01-09 | 6mrKP2jyIQmM0rw6fQryjr | 9 | Let You Down | NF | 2610265 | 48 | |... | ... | ... | ... | ... | ... | 49 | (only 10 rows returned) 50 | 51 | # Using hashing 52 | A hash is a deterministic mapping from one value to another, and so is not random, but can appear random 'enough' to produce a convincing sample. Use a [supported hash function](https://cloud.google.com/bigquery/docs/reference/standard-sql/hash_functions) in BigQuery to produce a pseudo-random ordering of your table: 53 | 54 | #### Farm fingerprint 55 | 56 | ```sql 57 | -- farm_fingerprint 58 | select * from spotify.spotify_artists order by farm_fingerprint(spotify_artists.artist_id) limit 10 59 | ``` 60 | | artist_id | name | popularity | followers | updated_at | url | 61 | | -------- | -------- |----- | ----- | ------ | ------- | 62 | | 5wPoxI5si3eJsYYwyXV4Wi | N.E.R.D | 64 | 647718 | 2021-03-14 | https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi | 63 | |... | ... | ... | ... | ... | ... | 64 | (only 10 rows returned) 65 | 66 | #### MD5 67 | 68 | ```sql 69 | -- md5 70 | select * from spotify.spotify_artists order by md5(spotify_artists.artist_id) limit 10 71 | ``` 72 | | artist_id | name | popularity | followers | updated_at | url | 73 | | -------- | -------- |----- | ----- | ------ | ------- | 74 | | 5wPoxI5si3eJsYYwyXV4Wi | N.E.R.D | 64 | 647718 | 2021-03-14 | https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi | 75 | |... | ... | ... | ... | ... | ... | 76 | (only 10 rows returned) 77 | 78 | #### SHA1 79 | 80 | ```sql 81 | -- SHA1 82 | select * from spotify.spotify_artists order by sha1(spotify_artists.artist_id) limit 10 83 | ``` 84 | | artist_id | name | popularity | followers | updated_at | url | 85 | | -------- | -------- |----- | ----- | ------ | ------- | 86 | | 5wPoxI5si3eJsYYwyXV4Wi | N.E.R.D | 64 | 647718 | 2021-03-14 | https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi | 87 | |... | ... | ... | ... | ... | ... | 88 | (only 10 rows returned) 89 | 90 | #### SHA256 91 | 92 | ```sql 93 | -- SHA256 94 | select * from spotify.spotify_artists order by sha256(spotify_artists.artist_id) limit 10 95 | ``` 96 | | artist_id | name | popularity | followers | updated_at | url | 97 | | -------- | -------- |----- | ----- | ------ | ------- | 98 | | 5wPoxI5si3eJsYYwyXV4Wi | N.E.R.D | 64 | 647718 | 2021-03-14 | https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi | 99 | |... | ... | ... | ... | ... | ... | 100 | (only 10 rows returned) 101 | #### SHA512 102 | 103 | ```sql 104 | -- SHA512 105 | select * from spotify.spotify_artists order by sha512(spotify_artists.artist_id) limit 10 106 | ``` 107 | | artist_id | name | popularity | followers | updated_at | url | 108 | | -------- | -------- |----- | ----- | ------ | ------- | 109 | | 5wPoxI5si3eJsYYwyXV4Wi | N.E.R.D | 64 | 647718 | 2021-03-14 | https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi | 110 | |... | ... | ... | ... | ... | ... | 111 | (only 10 rows returned) 112 | 113 | # Using TABLESAMPLE 114 | The [TABLESAMPLE](https://cloud.google.com/bigquery/docs/table-sampling) clause is currently in preview status, but it has a major benefit - it doesn't require a full table scan, and therefore can be much cheaper and quicker than the methods above. The downside is that the percentage value must be a literal, so this query is less flexible than the ones above. 115 | 116 | #### Tablesample 117 | 118 | ```sql 119 | -- tablesample 120 | select * from spotify.spotify_daily_tracks tablesample system (10 percent) 121 | ``` 122 | | artist_id | name | popularity | followers | updated_at | url | 123 | | -------- | -------- |----- | ----- | ------ | ------- | 124 | | 5wPoxI5si3eJsYYwyXV4Wi | N.E.R.D | 64 | 647718 | 2021-03-14 | https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi | 125 | |... | ... | ... | ... | ... | ... | 126 | -------------------------------------------------------------------------------- /bigquery/rank.md: -------------------------------------------------------------------------------- 1 | # Rank 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/eVT4mi7X88o?vm=e). 4 | 5 | # Description 6 | A common requirement is the need to rank a particular row in a group by some quantity - for example, 'rank stores by total sales in each region'. Fortunately BigQuery supports the `rank` window function to help with this. 7 | The query template is 8 | 9 | ```sql 10 | select 11 | rank() over (partition by order by desc) rank 12 | from
13 | ``` 14 | where 15 | - `fields` - zero or more columns to split the results by - a rank will be calculated for each unique combination of values in these columns 16 | - `rank_by` - a column to decide the ranking 17 | - `table` - where to pull these columns from 18 | 19 | # Example 20 | Using total Spotify streams by day and artist as an example data source, let's calculate, by artist, the rank for total streams by day (e.g. rank = 1 indicates that day was their most-streamed day). 21 | 22 | - `fields` - this is just `artist` 23 | - `rank_by` - this is `streams` 24 | - `table` - this is called `raw` 25 | 26 | then the query is: 27 | 28 | ```sql 29 | select 30 | *, 31 | rank() over (partition by artist order by streams desc) rank 32 | from raw 33 | ``` 34 | | streams | artist | day | rank | 35 | | --- | --------- | --- | ----| 36 | | 685306 | Dalex, Lenny Tavárez, Chencho Corleone, Juhn, Dímelo Flow | 2020-08-01 | 1 | 37 | | 680634 | Dalex, Lenny Tavárez, Chencho Corleone, Juhn, Dímelo Flow | 2020-08-08 | 2 | 38 | | ... | ... | ... | ... | 39 | -------------------------------------------------------------------------------- /bigquery/regex-email.md: -------------------------------------------------------------------------------- 1 | # Regex: Validate email address 2 | 3 | Explore this snippet [here](https://count.co/n/aAHVEQ74TYd?vm=e). 4 | 5 | # Description 6 | The only real way to validate an email address is to send an email to it, but for the purposes of analytics it's often useful to strip out rows including malformed email address. 7 | Using the `REGEXP_CONTAINS` function, this example comes courtesy of Reddit user [chrisgoddard](https://www.reddit.com/r/bigquery/comments/dshge0/udf_for_email_validation/f6r7rpt?utm_source=share&utm_medium=web2x&context=3): 8 | 9 | ```sql 10 | with emails as ( 11 | select * from unnest([ 12 | 'a', 13 | 'hello@count.c', 14 | 'hello@count.co', 15 | '@count.co', 16 | 'a@a.a' 17 | ]) as email 18 | ) 19 | 20 | select 21 | email, 22 | regexp_contains( 23 | lower(email), 24 | "^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])$" 25 | ) valid 26 | from emails 27 | ``` -------------------------------------------------------------------------------- /bigquery/regex-parse-url.md: -------------------------------------------------------------------------------- 1 | # Regex: Parse URL String 2 | Explore the query with some demo data [here](https://count.co/n/U2RGnRt91pB?vm=e). 3 | 4 | # Description 5 | URL strings are a very valuable source of information, but trying to get your regex exactly right is a pain. 6 | This re-usable snippet can be applied to any URL to extract the exact part you're after. 7 | 8 | ```sql 9 | SELECT 10 | REGEXP_EXTRACT(, r'(?:[a-zA-Z]+://)?([a-zA-Z0-9-.]+)/?') as host, 11 | REGEXP_EXTRACT(, r'(?:[a-zA-Z]+://)?(?:[a-zA-Z0-9-.]+)/{1}([a-zA-Z0-9-./]+)') as path, 12 | REGEXP_EXTRACT(, r'\?(.*)') as query, 13 | REGEXP_EXTRACT(, r'#(.*)') as ref, 14 | REGEXP_EXTRACT(, r'^([a-zA-Z]+)://') as protocol 15 | FROM 16 | '' 17 | ``` 18 | where: 19 | - `` is your URL string (e.g. `"https://www.yoursite.com/pricing/details?myparam1=123&myparam2=abc#Ref1"` 20 | - `host` is the domain name (e.g. `yoursite.com`) 21 | - `path` is the page path on the website (e.g. `pricing/details`) 22 | - `query` is the string of query parameters in the url (e.g. `myparam1=123`) 23 | - `ref` is the reference or where the user came from before visiting the url (e.g. `#newsfeed`) 24 | - `protocol` is how the data is transferred between the host and server (e.g. `https`) 25 | 26 | 27 | # Usage 28 | To use, just copy out the regex for whatever part of the URL you need. Or capture them all as in the example below: 29 | 30 | ```sql 31 | WITH my_url as (select 'https://www.yoursite.com/pricing/details?myparam1=123&myparam2=abc#newsfeed' as url) 32 | SELECT 33 | REGEXP_EXTRACT(url, r'(?:[a-zA-Z]+://)?([a-zA-Z0-9-.]+)/?') as host, 34 | REGEXP_EXTRACT(url, r'(?:[a-zA-Z]+://)?(?:[a-zA-Z0-9-.]+)/{1}([a-zA-Z0-9-./]+)') as path, 35 | REGEXP_EXTRACT(url, r'\?(.*)') as query, 36 | REGEXP_EXTRACT(url, r'#(.*)') as ref, 37 | REGEXP_EXTRACT(url, r'^([a-zA-Z]+)://') as protocol 38 | FROM my_url 39 | ``` 40 | | host | path | query | ref | protocol| 41 | | --- | ------ | ----- | --- | ------| 42 | | www.yoursite.com | pricing/details | myparam1=123&myparam2=abc#newsfeed | newsfeed | https | 43 | 44 | 45 | # References Helpful Links 46 | - [The Anatomy of a Full Path URL](https://zvelo.com/anatomy-of-full-path-url-hostname-protocol-path-more/) 47 | - [GoogleCloudPlatform/bigquery-utils](https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/community/url_parse.sql) 48 | -------------------------------------------------------------------------------- /bigquery/repeating-words.md: -------------------------------------------------------------------------------- 1 | # Extract Repeating Words from String 2 | 3 | 4 | # Description 5 | If I wanted to extract all Workorder numbers from the string: 6 | 7 | ```sql 8 | F1 - wo-12, some other text WOM-1235674,WOM-2345,WOM-3456 9 | ``` 10 | Where the workorder numbers always follow `WOM-`, 11 | and I wanted to make sure each workorder number got it's own row so I could join to another table, I could do: 12 | 13 | ```sql 14 | SELECT 15 | product_id, 16 | wo_number 17 | FROM 18 |
19 | CROSS JOIN 20 | UNNEST(regexp_extract_all(wo_text,r'WOM-([1-9]+)')) as wo_number 21 | ``` 22 | where `wo_text` is the column of text that contains the workorder info. 23 | > Inspired by: https://www.reddit.com/r/bigquery/comments/oizgm6/extracting_word_from_string/ 24 | # Example: 25 | ```sql 26 | with data as ( 27 | select 123 product_id, 'F1 - wo-12, some other text WOM-1235674,WOM-2345,WOM-3456' as wo_text 28 | ) 29 | select product_id, wo_number 30 | from data 31 | cross join unnest(regexp_extract_all(wo_text,r'WOM-([1-9]+)')) as wo_number 32 | --This will intentionally not extract 'wo-12' as it doesn't match WOM-[numbers] pattern 33 | ``` 34 | |product_id | wo_number| 35 | |---- | ----- | 36 | | 123 | 1235674 | 37 | |123 | 2345 | 38 | |123 | 3456 | 39 | -------------------------------------------------------------------------------- /bigquery/replace-empty-strings-null.md: -------------------------------------------------------------------------------- 1 | # Replace empty strings with NULLs 2 | 3 | Explore this snippet [here](https://count.co/n/g0Q8V8oSnB8?vm=e). 4 | 5 | # Description 6 | 7 | An essential part of cleaning a new data source is deciding how to treat missing values. The BigQuery function [IF](https://cloud.google.com/bigquery/docs/reference/standard-sql/conditional_expressions#if) can help with replacing empty values with something else: 8 | 9 | ```sql 10 | with data as ( 11 | select * from unnest(['a', 'b', '', 'd']) as str 12 | ) 13 | 14 | select 15 | if(length(str) = 0, null, str) str 16 | from data 17 | ``` 18 | 19 | | str | 20 | | ---- | 21 | | a | 22 | | b | 23 | | NULL | 24 | | d | -------------------------------------------------------------------------------- /bigquery/replace-multiples.md: -------------------------------------------------------------------------------- 1 | # Replace or Strip out multiple different string matches from some text. 2 | 3 | ## Description 4 | 5 | If you need to remove to replace multiple difference string conditions it can be quite code heavy unless you make use REGEXP_REPLACE 6 | Use | to alternative matches 7 | (?i) allows case of the matches to be ignored 8 | 9 | ``` 10 | select REGEXP_REPLACE('Some text1 or Text2 etc', r'(?i)text1|text2|text3|text4','NewValue') 11 | ``` 12 | 13 | ## Example 14 | 15 | ```sql 16 | with pet as 17 | (select ['Marley (hcp)','Marley (HCP)','Marley(hcp)','Marley (hcP)','Marley (CC)','Marley (cc)','Marley(cc)','Marley (rehomed)','Marley (re-homed)','Marley (stray)','Marley**','Marley **'] as pets 18 | ), pet_list as 19 | ( 20 | select pet_name 21 | from pet 22 | left join unnest(pets) pet_name) 23 | select 24 | pet_name, 25 | TRIM(REGEXP_REPLACE(pet_name, r'(?i)\(hcp\)|\(cc\)|\(re-homed\)|\(foster\)|\(stray\)|\(rehomed\)|\*\*','')) as clean_pet_name 26 | from pet_list 27 | ``` 28 | 29 | ``` 30 | +-----------------+--------------+ 31 | |pet_name |clean_pet_name| 32 | +-----------------+--------------+ 33 | |Marley (hcp) |Marley | 34 | |Marley (HCP) |Marley | 35 | |Marley(hcp) |Marley | 36 | |Marley (hcP) |Marley | 37 | |Marley (CC) |Marley | 38 | |Marley (cc) |Marley | 39 | |Marley(cc) |Marley | 40 | |Marley (rehomed) |Marley | 41 | |Marley (re-homed)|Marley | 42 | |Marley (stray) |Marley | 43 | |Marley** |Marley | 44 | |Marley ** |Marley | 45 | +-----------------+--------------+ 46 | ``` 47 | -------------------------------------------------------------------------------- /bigquery/replace-null.md: -------------------------------------------------------------------------------- 1 | # Replace NULLs 2 | 3 | Explore this snippet [here](https://count.co/n/HrbhKQX4JfT?vm=e). 4 | 5 | # Description 6 | 7 | An essential part of cleaning a new data source is deciding how to treat NULL values. The BigQuery function [IFNULL](https://cloud.google.com/bigquery/docs/reference/standard-sql/conditional_expressions#ifnull) can help with replacing NULL values with something else: 8 | 9 | ```sql 10 | with data as ( 11 | select * from unnest([ 12 | struct(1 as num, 'a' as str), 13 | struct(null as num, 'b' as str), 14 | struct(3 as num, null as str) 15 | ]) 16 | ) 17 | 18 | select 19 | ifnull(num, -1) num, 20 | ifnull(str, 'I AM NULL') str, 21 | from data 22 | ``` 23 | 24 | | num | str | 25 | | --- | --------- | 26 | | 1 | a | 27 | | -1 | b | 28 | | 3 | I AM NULL | -------------------------------------------------------------------------------- /bigquery/running-total.md: -------------------------------------------------------------------------------- 1 | # Running total 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/xFHD89xBRAU?vm=e). 4 | 5 | # Description 6 | Running totals are relatively simple to calculate in BigQuery, using the `sum` window function. 7 | The template for the query is 8 | 9 | ```sql 10 | select 11 | sum() over (partition by order by asc) running_total, 12 | , 13 | 14 | from
15 | ``` 16 | where 17 | - `value` - this is the numeric quantity to calculate a running total for 18 | - `fields` - these are zero or more columns to split the running totals by 19 | - `ordering` - this is the column which determines the order of the running total, most commonly temporal 20 | - `table` - where to pull these columns from 21 | # Example 22 | Using total Spotify streams as an example data source, let's identify: 23 | - `value` - this is `streams` 24 | - `fields` - this is just `artist` 25 | - `ordering` - this is the `day` column 26 | - `table` - this is called `raw` 27 | 28 | then the query is: 29 | 30 | ```sql 31 | select 32 | sum(streams) over (partition by artist order by day asc) running_total, 33 | artist, 34 | day 35 | from raw 36 | ``` 37 | 38 | | running_total | artist | day | streams | 39 | | ------------- | ------- | --- | ------ | 40 | | 705819 | *NSYNC | 2017-12-23 | 705819 | 41 | | 2042673 | *NSYNC | 2017-12-24 | 1336854 | 42 | | ... | ... | ... |... | 43 | -------------------------------------------------------------------------------- /bigquery/split-uppercase.md: -------------------------------------------------------------------------------- 1 | # Split a string on Uppercase Characters 2 | View interactive version [here](https://count.co/n/Exa1qpLjE8y?vm=e). 3 | 4 | # Description 5 | Split a string like 'HelloWorld' into ['Hello','World'] using: 6 | 7 | ```sql 8 | regexp_extract_all('stringToReplace',r'([A-Z][a-z]*)') 9 | ``` 10 | 11 | # Example: 12 | ```sql 13 | select 14 | regexp_extract_all('HelloWorld',r'([A-Z][a-z]*)') as_array, 15 | array_to_string(regexp_extract_all('HelloWorld',r'([A-Z][a-z]*)'),' ') a_string_again 16 | ``` 17 | |as_array | a_string_again| 18 | |------- | --------- | 19 | |["Hello","World"] | Hello World | 20 | -------------------------------------------------------------------------------- /bigquery/stacked-bar-line.md: -------------------------------------------------------------------------------- 1 | # Stacked bar and line chart 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/xnf2JGWOpEh?vm=e). 4 | 5 | # Description 6 | 7 | Bar and line charts can communicate a lot of info on a single chart. They require all of the elements of a bar chart: 8 | - 2 categorical or temporal columns - 1 for the x-axis and 1 for stacked bars 9 | - 2 numeric columns - 1 for the bar and 1 for the line 10 | 11 | ```sql 12 | SELECT 13 | AGG_FN() OVER (PARTITION BY , ) as bar_metric, 14 | as x_axis, 15 | as cat, 16 | AGG_FN() OVER (PARTITION BY ) as line 17 | FROM 18 | 19 | ``` 20 | 21 | where: 22 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc. 23 | - `COLUMN` is the column you want to aggregate to get your bar chart metric. Make sure this is a numeric column. 24 | - `COLUMN2` is the column you want to aggregate to get your line chart metric. Make sure this is a numeric column. 25 | - `CAT_COLUMN1` and `CAT_COLUMN2` are the groups you want to compare. One will go on the x axis and the other on the colors of the bar chart. Make sure these are categorical or temporal columns and not numerical fields. 26 | 27 | # Usage 28 | 29 | In this example with some example data we'll look at some sample data of users that were successful and unsuccessful at a certain task over a few days. We'll show the count of users by their outcome, then show the overall success rate for each day. 30 | 31 | ```sql 32 | with sample_data as ( 33 | select date('2021-01-01') registered_at, 'successful' category, round(rand()*10) user_count union all 34 | select date('2021-01-01') registered_at, 'unsuccessful' category, round(rand()*10) user_count union all 35 | select date('2021-01-02') registered_at, 'successful' category, round(rand()*10) user_count union all 36 | select date('2021-01-02') registered_at, 'unsuccessful' category, round(rand()*10) user_count union all 37 | select date('2021-01-03') registered_at, 'successful' category, round(rand()*10) user_count union all 38 | select date('2021-01-03') registered_at, 'unsuccessful' category, round(rand()*10) user_count 39 | ) 40 | 41 | 42 | select 43 | registered_at, 44 | category, 45 | user_count, 46 | sum(case when category = 'successful' then user_count end) over (partition by registered_at) / sum(user_count) over (partition by registered_at) success_rate 47 | from sample_data 48 | ``` 49 | stackedbar-line 50 | -------------------------------------------------------------------------------- /bigquery/tf-idf.md: -------------------------------------------------------------------------------- 1 | # TF-IDF 2 | To explore an interactive version of this snippet, go [here](https://count.co/n/0NnhvqUjFuZ?vm=e). 3 | 4 | # Description 5 | TF-IDF (Term Frequency - Inverse Document Frequency) is a way of extracting keywords that are used more in certain documents than in the corpus of documents as a whole. 6 | It's a powerful NLP tool and is surprisingly easy to do in SQL. 7 | This snippet comes from Felipe Hoffa on [stackoverflow](https://stackoverflow.com/questions/47028576/how-can-i-compute-tf-idf-with-sql-bigquery): 8 | 9 | ```sql 10 | #standardSQL 11 | WITH words_by_post AS ( 12 | SELECT CONCAT(link_id, '/', id) id, REGEXP_EXTRACT_ALL( 13 | REGEXP_REPLACE(REGEXP_REPLACE(LOWER(body), '&', '&'), r'&[a-z]{2,4};', '*') 14 | , r'[a-z]{2,20}\'?[a-z]+') words 15 | , COUNT(*) OVER() docs_n 16 | FROM `fh-bigquery.reddit_comments.2017_07` 17 | WHERE body NOT IN ('[deleted]', '[removed]') 18 | AND subreddit = 'movies' 19 | AND score > 100 20 | ), words_tf AS ( 21 | SELECT id, word, COUNT(*) / ARRAY_LENGTH(ANY_VALUE(words)) tf, ARRAY_LENGTH(ANY_VALUE(words)) words_in_doc 22 | , ANY_VALUE(docs_n) docs_n 23 | FROM words_by_post, UNNEST(words) word 24 | GROUP BY id, word 25 | HAVING words_in_doc>30 26 | ), docs_idf AS ( 27 | SELECT tf.id, word, tf.tf, ARRAY_LENGTH(tfs) docs_with_word, LOG(docs_n/ARRAY_LENGTH(tfs)) idf 28 | FROM ( 29 | SELECT word, ARRAY_AGG(STRUCT(tf, id, words_in_doc)) tfs, ANY_VALUE(docs_n) docs_n 30 | FROM words_tf 31 | GROUP BY 1 32 | ), UNNEST(tfs) tf 33 | ) 34 | 35 | 36 | SELECT *, tf*idf tfidf 37 | FROM docs_idf 38 | WHERE docs_with_word > 1 39 | ORDER BY tfidf DESC 40 | LIMIT 1000 41 | ``` 42 | 43 | 44 | # Usage 45 | 46 | ```sql 47 | -- ex_data 48 | select 1 id, 'England will win it' body union all 49 | select 2 id, 'France will win it' body union all 50 | select 3 id, 'Its coming home' body 51 | ``` 52 | |id | body | 53 | |--| ---- | 54 | |1 | 'England will win it' | 55 | |2 | 'France will win it' | 56 | |3 | 'Its coming home'| 57 | ```sql 58 | -- words_by_post 59 | SELECT 60 | id, REGEXP_EXTRACT_ALL( 61 | REGEXP_REPLACE(REGEXP_REPLACE(LOWER(body), '&', '&'), r'&[a-z]{2,4};', '*') 62 | , r'[a-z]{2,20}\'?[a-z]+') words 63 | , COUNT(*) OVER() docs_n 64 | FROM ex_data 65 | ``` 66 | |id | body | 67 | |--| ---- | 68 | |1 | ['england','will','win'] | 69 | |2 | ['france','will','win'] | 70 | |3 | ['its','coming','home']| 71 | ```sql 72 | -- words_tf 73 | SELECT id, word, COUNT(*) / ARRAY_LENGTH(ANY_VALUE(words)) tf, ARRAY_LENGTH(ANY_VALUE(words)) words_in_doc 74 | , ANY_VALUE(docs_n) docs_n 75 | FROM words_by_post, UNNEST(words) word 76 | GROUP BY id, word 77 | ``` 78 | |id | word | tf | words_in_doc| docs_n| 79 | |--| ---- | ---- | ---- | ---- | 80 | |1 | 'england'|0.33|3 |3| 81 | |1 | 'will' |0.33|3 |3| 82 | |1| 'win'|0.33|3 |3| 83 | |...| ...|...|... |...| 84 | 85 | ```sql 86 | -- docs_idf 87 | SELECT tf.id, word, tf.tf, ARRAY_LENGTH(tfs) docs_with_word, LOG(docs_n/ARRAY_LENGTH(tfs)) idf 88 | FROM ( 89 | SELECT word, ARRAY_AGG(STRUCT(tf, id, words_in_doc)) tfs, ANY_VALUE(docs_n) docs_n 90 | FROM words_tf 91 | GROUP BY 1 92 | ), UNNEST(tfs) tf 93 | ``` 94 | |id | word | tf | docs_with_word| idf| 95 | |--| ---- | ---- | ---- | ---- | 96 | |1 | 'england'|0.33|1 |1.099| 97 | |1 | 'will' |0.33|2 |.405| 98 | |1| 'win'|0.33|2 |.405| 99 | |...| ...|...|... |...| 100 | ```sql 101 | -- tf_idf 102 | SELECT *, tf*idf tfidf 103 | FROM docs_idf 104 | WHERE docs_with_word > 1 105 | ORDER BY tfidf DESC 106 | ``` 107 | |id | word | tf | docs_with_word| idf|tfidf| 108 | |--| ---- | ---- | ---- | ---- |---- | 109 | |1 | 'england'|0.33|1 |1.099|.135| 110 | |1 | 'will' |0.33|2 |.405|.135| 111 | |1| 'win'|0.33|2 |.405|.135| 112 | |...| ...|...|... |...|...| 113 | -------------------------------------------------------------------------------- /bigquery/timeseries.md: -------------------------------------------------------------------------------- 1 | # Time series line chart 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/B3XjM5dxt4f?vm=e). 4 | 5 | # Description 6 | 7 | Timeseries charts show how individual metric(s) change over time. They have lines along the y-axis and dates along the x-axis. 8 | You just need a simple GROUP BY to get all the elements you need for a timeseries chart: 9 | 10 | ```sql 11 | SELECT 12 | AGG_FN() as metric, 13 | as datetime 14 | FROM 15 | 16 | GROUP BY 17 | datetime 18 | ``` 19 | where: 20 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc. 21 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column. 22 | - `DATETIME_COLUMN` is the group you want to show on the x-axis. Make sure this is a date, datetime, timestamp, or time column (not a number) 23 | 24 | # Usage 25 | 26 | In this example with some Spotify data, we'll look at the total daily streams over time: 27 | 28 | ```sql 29 | select 30 | sum(streams) streams, 31 | day 32 | from spotify.spotify_daily_tracks 33 | group by day 34 | ``` 35 | timeseries 36 | -------------------------------------------------------------------------------- /bigquery/transforming-arrays.md: -------------------------------------------------------------------------------- 1 | # Transforming arrays 2 | 3 | Explore this snippet [here](https://count.co/n/CLAQ94HaRvO?vm=e). 4 | 5 | # Description 6 | To transform elements of an array (often known as mapping over an array), it is possible to gain access to array elements using the `UNNEST` function: 7 | 8 | ```sql 9 | with data as (select [1,2,3] nums, ['a', 'b', 'c'] strs) 10 | 11 | select 12 | 13 | array( 14 | select 15 | x * 2 + 1 -- Transformation function 16 | from unnest(nums) as x 17 | ) as nums_transformed, 18 | 19 | array( 20 | select 21 | x || '_2' -- Transformation function 22 | from unnest(strs) as x 23 | ) as strs_transformed 24 | 25 | from data 26 | ``` -------------------------------------------------------------------------------- /bigquery/uniform-distribution.md: -------------------------------------------------------------------------------- 1 | # Generate a uniform distribution 2 | 3 | Explore this snippet [here](https://count.co/n/s7XKhGzLxX6?vm=e) 4 | 5 | # Description 6 | 7 | Generating a uniform distribution of (pseudo-) random numbers is often useful when performing statistical tests on data. 8 | This snippet demonstrates a method for generating a deterministic sample from a uniform distribution: 9 | 1. Create a column of row indices - here the [`GENERATE_ARRAY`](https://cloud.google.com/bigquery/docs/reference/standard-sql/array_functions#generate_array) function is used, though the [`ROW_NUMBER`](https://cloud.google.com/bigquery/docs/reference/standard-sql/numbering_functions#row_number) window function may also be useful. 10 | 2. Convert the row indices into pseudo-random numbers using the [`FARM_FINGERPRINT`](https://cloud.google.com/bigquery/docs/reference/standard-sql/hash_functions#farm_fingerprint) hashing function. 11 | 3. Map the hashes into the correct range 12 | 13 | ```sql 14 | with params as ( 15 | select 16 | 10000 as num_samples, -- How many samples to generate 17 | 1000000 as precision, -- How much precision to maintain (e.g. 6 decimal places) 18 | 2 as min, -- Lower limit of distribution 19 | 5 as max, -- Upper limit of distribution 20 | ), 21 | samples as ( 22 | select 23 | -- Create hashes (between 0 and precision) 24 | -- Map [0, precision] -> [min, max] 25 | mod(abs(farm_fingerprint(cast(indices as string))), precision) / precision * (max - min) + min as samples 26 | from 27 | params, 28 | unnest(generate_array(1, num_samples)) as indices 29 | ) 30 | 31 | select * from samples 32 | ``` 33 | 34 | | samples | 35 | | ------- | 36 | | 3.55 | 37 | | 3.316 | 38 | | 2.146 | 39 | | ... | -------------------------------------------------------------------------------- /bigquery/unpivot-melt.md: -------------------------------------------------------------------------------- 1 | # Unpivot / melt 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/xwGc7kFuAm0?vm=e). 4 | 5 | # Description 6 | The act of unpivoting (or melting, if you're a `pandas` user) is to convert columns to rows. BigQuery has recently added the [UNPIVOT](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#unpivot_operator) operator as a preview feature. 7 | The query template is 8 | 9 | ```sql 10 | SELECT 11 | , 12 | metric, 13 | value 14 | FROM
UNPIVOT(value FOR metric IN ()) 15 | order by metric, value 16 | ``` 17 | where 18 | - `columns_to_keep` - all the columns to be preserved 19 | - `columns_to_unpivot` - all the columns to be converted to rows 20 | - `table` - the table to pull the columns from 21 | 22 | (The output column names 'metric' and 'value' can also be changed.) 23 | # Examples 24 | In the examples below, we use a table with columns `A`, `B`, `C`. We take the columns `B` and `C`, and turn their values into columns called `metric` (containing either the string 'B' or 'C') and `value` (the value from either the column `B` or `C`). The column `A` is preserved. 25 | 26 | ```sql 27 | -- Define some dummy data 28 | with a as ( 29 | select * from unnest([ 30 | struct( 'a' as A, 1 as B, 2 as C ), 31 | struct( 'b' as A, 3 as B, 4 as C ), 32 | struct( 'c' as A, 5 as B, 6 as C ) 33 | ]) 34 | ) 35 | SELECT 36 | A, 37 | metric, 38 | value 39 | FROM a UNPIVOT(value FOR metric IN (B, C)) 40 | order by metric, value 41 | ``` 42 | | A | metric | value | 43 | | --- | ----- | ------ | 44 | | a | B | 1 | 45 | | b | B | 3 | 46 | | c | B | 5 | 47 | | a | C | 2 | 48 | | b | C | 4 | 49 | | c | C | 6 | 50 | 51 | If you would rather not use preview features in BigQuery, there is another generic solution (from [StackOverflow](https://stackoverflow.com/a/47657770)), though it may be much less performant than the built-in `UNPIVOT` implementation. 52 | The query template here is 53 | 54 | ```sql 55 | select 56 | , 57 | metric, 58 | value 59 | from ( 60 | SELECT 61 | *, 62 | REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 63 | REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value 64 | FROM 65 |
, 66 | UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(
), r'{|}', ''))) pair 67 | ) 68 | where metric not in () 69 | order by metric, value 70 | ``` 71 | where 72 | - `column_names_to_keep` - this is a list of the JSON-escaped names of the columns to keep 73 | 74 | An annotated example is: 75 | 76 | ```sql 77 | -- Define some dummy data 78 | with a as ( 79 | select * from unnest([ 80 | struct( 'a' as A, 1 as B, 2 as C ), 81 | struct( 'b' as A, 3 as B, 4 as C ), 82 | struct( 'c' as A, 5 as B, 6 as C ) 83 | ]) 84 | ) 85 | 86 | -- See https://stackoverflow.com/a/47657770 87 | 88 | select 89 | A, -- Keep A column 90 | metric, 91 | value 92 | from ( 93 | SELECT 94 | *, -- Original table 95 | 96 | -- Do some manual JSON parsing of a string which looks like '"B": 3' 97 | REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric, 98 | REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value 99 | FROM 100 | a, 101 | -- 1. Create a row which contains a string like '{ "A": "a", "B": 1, "C": 2 }' 102 | -- 2. Remove the outer {} characters 103 | -- 3. Split the string on , characters 104 | -- 4. Join the new array onto the original table 105 | UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(a), r'{|}', ''))) pair 106 | ) 107 | where metric not in ('A') -- Filter extra rows that we don't need 108 | order by metric, value 109 | ``` 110 | | A | metric | value | 111 | | --- | ----- | ------ | 112 | | a | B | 1 | 113 | | b | B | 3 | 114 | | c | B | 5 | 115 | | a | C | 2 | 116 | | b | C | 4 | 117 | | c | C | 6 | 118 | -------------------------------------------------------------------------------- /bigquery/yoy.md: -------------------------------------------------------------------------------- 1 | # Year-on-year percentage difference 2 | To explore this example with some demo data, head [here](https://count.co/n/7I6eHGEwxX3?vm=e). 3 | 4 | # Description 5 | Calculating the year-on-year change of a quantity typically requires a few steps: 6 | 1. Aggregate the quantity by year (and by any other fields) 7 | 2. Add a column for last years quantity (this will be null for the first year in the dataset) 8 | 3. Calculate the change from last year to this year 9 | 10 | A generic query template may then look like 11 | 12 | ```sql 13 | select 14 | *, 15 | -- Calculate percentage change 16 | ( - last_year_total) / last_year_total * 100 as perc_diff 17 | from ( 18 | select 19 | *, 20 | -- Use a window function to get last year's total 21 | lag() over (partition by order by asc) last_year_total 22 | from
23 | order by , desc 24 | ) 25 | ``` 26 | where 27 | - `table` - your source data table 28 | - `fields` - one or more columns to split the year-on-year differences by 29 | - `year_column` - the column containing the year of the aggregation 30 | - `year_total` - the column containing the aggregated data 31 | # Example 32 | Using total Spotify streams as an example data source, let's identify: 33 | - `table` - this is called `raw` 34 | - `fields` - this is just the single column `artist` 35 | - `year_column` - this is called `year` 36 | - `year_total` - this is `sum_streams` 37 | 38 | Then the query looks like: 39 | 40 | ```sql 41 | -- raw 42 | select 43 | date_trunc(day, year) year, 44 | sum(streams) sum_streams, 45 | artist 46 | from spotify.spotify_daily_tracks 47 | group by 1,3 48 | ``` 49 | | year | sum_streams | artist | 50 | | --- | ----------- | ---- | 51 | | 2020-01-01 | 115372043 | CJ | 52 | | 2021-01-01 | 179284925 | CJ | 53 | | ... | ... | ... | 54 | ```sql 55 | -- with_perc_diff 56 | select 57 | *, 58 | (sum_streams - last_year_total) / last_year_total * 100 as perc_diff 59 | from ( 60 | select 61 | *, 62 | lag(sum_streams) over (partition by artist order by year asc) last_year_total 63 | from raw 64 | order by artist, year desc 65 | ) 66 | ``` 67 | | year | sum_streams | artist | last_year_total | perc_diff | 68 | | --- | ----------- | ---- | ---- | ----- | 69 | | 2020-01-01 | 115372043 | CJ | NULL | NULL | 70 | | 2021-01-01 | 179284925 | CJ | 115372043 | 55.397200515899684 | 71 | | ... | ... | ... | ... | .... | 72 | -------------------------------------------------------------------------------- /mssql/Cumulative distribution functions.md: -------------------------------------------------------------------------------- 1 | # Cumulative distribution functions 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/ANpwFTvhnWT?vm=e). 4 | 5 | # Description 6 | 7 | Cumulative distribution functions (CDF) are a method for analysing the distribution of a quantity, similar to histograms. They show, for each value of a quantity, what fraction of rows are smaller or greater. 8 | One method for calculating a CDF is as follows: 9 | 10 | ``` 11 | select 12 | -- Use a row_number window function to get the position of this row and CAST to convert row_number and count from integers to decimal 13 | CAST(row_number() over (order by asc) AS DECIMAL)/CAST((select count(*) from
) AS DECIMAL) AS cdf 14 | from
15 | ``` 16 | 17 | where 18 | 19 | * `quantity` - the column containing the metric of interest 20 | * `table` - the table name 21 | 22 | **Note:** CAST is used to convert the numerator and denominator of the fraction into decimals before they are divided. 23 | 24 | # Example 25 | 26 | Using some student test scores as an example data source, let’s identify: 27 | 28 | * `table` - this is called `Example_data` 29 | * `quantity` - this is the `column score` 30 | 31 | then the query becomes: 32 | 33 | ``` 34 | select 35 | student_id, 36 | score, 37 | CAST(row_number() over (order by score asc) AS DECIMAL)/CAST((select count(*) from Example_data) AS DECIMAL) AS frac 38 | from Example_data 39 | ``` 40 | 41 | |student_id|score|frac| 42 | |---|---|---| 43 | |3|76|0.2| 44 | |5|77|0.2| 45 | |4|82|0.6| 46 | |...|...|...| 47 | |1|97|1| 48 | -------------------------------------------------------------------------------- /mssql/check-column-is-accessible.md: -------------------------------------------------------------------------------- 1 | # Check if a column is accessible 2 | 3 | Explore this snippet [here](https://count.co/n/xpgYlM1uDAT?vm=e). 4 | 5 | # Description 6 | One method to check if the current session has access to a column is to use the [`COL_LENGTH`](https://docs.microsoft.com/en-us/sql/t-sql/functions/col-length-transact-sql) function, which returns the length of a column in bytes. 7 | If the column doesn't exist (or cannot be accessed), COL_LENGTH returns null. 8 | 9 | ```sql 10 | select 11 | col_length('Orders', 'Sales') as does_exist, 12 | col_length('Orders', 'Shmales') as does_not_exist 13 | ``` 14 | 15 | | does_exist | does_not_exist | 16 | | ---------- | -------------- | 17 | | 8 | NULL | -------------------------------------------------------------------------------- /mssql/compound-growth-rates.md: -------------------------------------------------------------------------------- 1 | # Compound growth rates 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/HFVMSAR7nmb?vm=e). 4 | 5 | # Description 6 | When a quantity is increasing in time by the same fraction every period, it is said to exhibit **compound growth** - every time period it increases by a larger amount. 7 | It is possible to show that the behaviour of the quantity over time follows the law 8 | 9 | ```sql 10 | x(t) = x(0) * exp(r * t) 11 | ``` 12 | where 13 | - `x(t)` is the value of the quantity now 14 | - `x(0)` is the value of the quantity at time t = 0 15 | - `t` is time in units of [time_period] 16 | - `r` is the growth rate in units of 1 / [time_period] 17 | 18 | # Calculating growth rates 19 | 20 | From the equation above, to calculate a growth rate you need to know: 21 | - The value at the start of a period 22 | - The value at the end of a period 23 | - The duration of the period 24 | 25 | Then, following some algebra, the growth rate is given by 26 | 27 | ```sql 28 | r = ln(x(t) / x(0)) / t 29 | ``` 30 | 31 | For a concrete example, assume a table with columns: 32 | - `num_this_month` - this is `x(t)` 33 | - `num_last_month` - this is `x(0)` 34 | 35 | #### Inferred growth rates 36 | 37 | In the following then, `growth_rate` is the equivalent yearly growth rate for that month: 38 | 39 | ```sql 40 | -- with_growth_rate 41 | select 42 | *, 43 | log(num_this_month / num_last_month) * 12 growth_rate 44 | from with_previous_month 45 | ``` 46 | 47 | # Using growth rates 48 | 49 | To better depict the effects of a given growth rate, it can be converted to a year-on-year growth factor by inserting into the exponential formula above with t = 1 (one year duration). 50 | Or mathematically, 51 | 52 | ```sql 53 | yoy_growth = x(1) / x(0) = exp(r) 54 | ``` 55 | It is always sensible to spot-check what a growth rate actually represents to decide whether or not it is a useful metric. 56 | 57 | #### Extrapolated year-on-year growth 58 | 59 | ```sql 60 | -- with_yoy_growth 61 | select 62 | *, 63 | exp(growth_rate) as yoy_growth 64 | from with_growth_rate 65 | ``` -------------------------------------------------------------------------------- /mssql/concatenate-strings.md: -------------------------------------------------------------------------------- 1 | # Concatenating strings 2 | 3 | Explore this snippet [here](https://count.co/n/NoziHzLCWJA?vm=e). 4 | 5 | # Description 6 | Use the [`STRING_AGG`](https://docs.microsoft.com/en-us/sql/t-sql/functions/string-agg-transact-sql) function to concatenate strings within a group. To specify the ordering, make sure to use the WITHIN GROUP clause too. 7 | 8 | ```sql 9 | with data as ( 10 | select * from (values 11 | ('A', '1'), 12 | ('A', '2'), 13 | ('B', '3'), 14 | ('B', '4'), 15 | ('B', '5') 16 | ) as data(str, num) 17 | ) 18 | 19 | select 20 | str, 21 | string_agg(num, ' < ') within group (order by num asc) as aggregated 22 | from data 23 | group by str 24 | ``` 25 | 26 | | str | aggregated | 27 | | --- | ---------- | 28 | | A | 1 < 2 | 29 | | B | 3 < 4 < 5 | -------------------------------------------------------------------------------- /mssql/first-row-of-group.md: -------------------------------------------------------------------------------- 1 | # Find the first row of each group 2 | 3 | Explore this snippet [here](https://count.co/n/3eWfeZ0n2a8?vm=e). 4 | 5 | # Description 6 | 7 | Finding the first value in a column partitioned by some other column is simple in SQL, using the GROUP BY clause. Returning the rest of the columns corresponding to that value is a little trickier. Here's one method, which relies on the RANK window function: 8 | 9 | ```sql 10 | with and_ranking as ( 11 | select 12 | *, 13 | rank() over (partition by order by ) ranking 14 | from
15 | ) 16 | select * from and_ranking where ranking = 1 17 | ``` 18 | where 19 | - `partition` - the column(s) to partition by 20 | - `ordering` - the column(s) which determine the ordering defining 'first' 21 | - `table` - the source table 22 | 23 | The CTE `and_ranking` adds a column to the original table called `ranking`, which is the order of that row within the groups defined by `partition`. Filtering for `ranking = 1` picks out those 'first' rows. 24 | 25 | # Example 26 | 27 | Using product sales as an example data source, we can pick the rows of the table corresponding to the most-recent order per city. Filling in the template above: 28 | - `partition` - this is just `City` 29 | - `ordering` - this is `Order_Date desc` 30 | - `table` - this is `Sales` 31 | 32 | ```sql 33 | with and_ranking as ( 34 | select 35 | *, 36 | rank() over (partition by City order by Order_Date desc) as ranking 37 | from Sales 38 | ) 39 | 40 | select * from and_ranking where ranking = 1 41 | ``` -------------------------------------------------------------------------------- /mssql/search-stored-procedures.md: -------------------------------------------------------------------------------- 1 | # SEARCH FOR TEXT WITHIN STORED PROCEDURES 2 | 3 | # Description 4 | Often you'll want to know which stored procedure is performing an action or contains specific text. Query will match on comments. 5 | 6 | # Snippet ✂️ 7 | ```sql 8 | SELECT name 9 | FROM sys.procedures 10 | WHERE Object_definition(object_id) LIKE '%strHell%' 11 | ``` 12 | 13 | # Usage 14 | 15 | ```sql 16 | SELECT name 17 | FROM sys.procedures 18 | WHERE Object_definition(object_id) LIKE '%blitz%' 19 | ``` 20 | 21 | | name | 22 | | -------- | 23 | | sp_BlitzFirst| 24 | | sp_BlitzCache| 25 | | sp_BlitzWho| 26 | | sp_Blitz| 27 | | sp_BlitzBackups| 28 | | sp_BlitzLock| 29 | | sp_BlitzIndex| 30 | 31 | # Referenced 32 | [Stackoverflow](https://stackoverflow.com/questions/14704105/search-text-in-stored-procedure-in-sql-server) -------------------------------------------------------------------------------- /mysql/cdf.md: -------------------------------------------------------------------------------- 1 | # Cumulative distribution functions 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/sL7bEFcg11Z?vm=e). 4 | 5 | # Description 6 | Cumulative distribution functions (CDF) are a method for analysing the distribution of a quantity, similar to histograms. They show, for each value of a quantity, what fraction of rows are smaller or greater. 7 | One method for calculating a CDF is as follows: 8 | 9 | ```sql 10 | select 11 | -- Use a row_number window function to get the position of this row 12 | (row_number() over (order by asc)) / (select count(*) from
) cdf, 13 | , 14 | from
15 | ``` 16 | where 17 | - `quantity` - the column containing the metric of interest 18 | - `table` - the table name 19 | # Example 20 | 21 | 22 | ```sql 23 | with data as ( 24 | select 1 student_id, 97 score union all 25 | select 2 student_id, 93 score union all 26 | select 3 student_id, 76 score union all 27 | select 4 student_id, 82 score union all 28 | select 5 student_id, 77 score 29 | ) 30 | select student_id, score, (row_number() over (order by score asc)) / (select count(*) from data) frac 31 | from data 32 | ``` 33 | | student_id | score | frac | 34 | | ----------- | ----------- | ---- | 35 | |3 | 76 | 0.2| 36 | | 5| 77 | 0.4 | 37 | |4 | 82 | 0.6| 38 | |2 | 93 | 0.8| 39 | |1|97 | 1| 40 | -------------------------------------------------------------------------------- /mysql/initcap-udf.md: -------------------------------------------------------------------------------- 1 | # UDF for capitalizing initial letters 2 | 3 | # Description 4 | 5 | Most SQL dialects have a way to convert a string from e.g. "tHis STRING" to "This String" where each word starts with a capitalized letter. This is useful for transforming names of people or places (e.g. ‘North Carolina’ or ‘Juan Martin Del Potro’). 6 | 7 | Per this [Stackoverflow post](https://stackoverflow.com/questions/12364086/how-can-i-achieve-initcap-functionality-in-mysql), this UDF will give us INITCAP functionality. 8 | 9 | ```sql 10 | DELIMITER $$ 11 | 12 | DROP FUNCTION IF EXISTS `test`.`initcap`$$ 13 | 14 | CREATE FUNCTION `initcap`(x char(30)) RETURNS char(30) CHARSET utf8 15 | READS SQL DATA 16 | DETERMINISTIC 17 | BEGIN 18 | SET @str=''; 19 | SET @l_str=''; 20 | WHILE x REGEXP ' ' DO 21 | SELECT SUBSTRING_INDEX(x, ' ', 1) INTO @l_str; 22 | SELECT SUBSTRING(x, LOCATE(' ', x)+1) INTO x; 23 | SELECT CONCAT(@str, ' ', CONCAT(UPPER(SUBSTRING(@l_str,1,1)),LOWER(SUBSTRING(@l_str,2)))) INTO @str; 24 | END WHILE; 25 | RETURN LTRIM(CONCAT(@str, ' ', CONCAT(UPPER(SUBSTRING(x,1,1)),LOWER(SUBSTRING(x,2))))); 26 | END$$ 27 | 28 | DELIMITER ; 29 | ``` 30 | 31 | # Usage 32 | 33 | Use like: 34 | 35 | ```sql 36 | select initcap('This is a test string'); 37 | ``` 38 | 39 | # References 40 | 41 | - [Stackoverflow](https://stackoverflow.com/questions/12364086/how-can-i-achieve-initcap-functionality-in-mysql) 42 | - [SQL snippets](https://sql-snippets.count.co/t/udf-initcap-in-mysql/207) -------------------------------------------------------------------------------- /postgres/array-reverse.md: -------------------------------------------------------------------------------- 1 | # Reverse an array 2 | 3 | Explore this snippet [here](https://count.co/n/aDl6lXQKQdx?vm=e). 4 | 5 | # Description 6 | There is no built-in function for reversing an array, but it is possible to construct one using the `generate_subscripts` function: 7 | 8 | ```sql 9 | with data as (select array[1, 2, 3, 4] as nums) 10 | 11 | select 12 | array( 13 | select nums[i] 14 | from generate_subscripts(nums, 1) as indices(i) 15 | order by i desc 16 | ) as reversed 17 | from data 18 | ``` 19 | 20 | 21 | ## References 22 | PostgreSQL wiki [https://wiki.postgresql.org/wiki/Array_reverse](https://wiki.postgresql.org/wiki/Array_reverse) -------------------------------------------------------------------------------- /postgres/convert-epoch-to-timestamp.md: -------------------------------------------------------------------------------- 1 | # Replace epoch/unix formatted timestamps with human-readable ones 2 | View an interactive version of this snippet [here](https://count.co/n/UdQXtD16DGx?vm=e). 3 | 4 | # Description 5 | 6 | Often data sources will include epoch/unix-style timestamps or dates, which are not human-readable. 7 | Therefore, we want to convert these to a different format, for example to be used in charts or aggregation queries. 8 | The [`TO_TIMESTAMP`](https://www.postgresql.org/docs/13/functions-formatting.html) function will allow us to do this with little effort: 9 | 10 | ```sql 11 | WITH data AS ( 12 | SELECT * 13 | FROM (VALUES (1611176312), (1611176759), (1611176817), (1611176854)) AS data (str) 14 | ) 15 | 16 | SELECT 17 | TO_TIMESTAMP(str) AS converted_timestamp 18 | FROM data; 19 | ``` 20 | 21 | | converted_timestamp | 22 | | ---- | 23 | | 2021-01-20 20:58:32.000000 | 24 | | 2021-01-20 21:05:59.000000 | 25 | | 2021-01-20 21:06:57.000000 | 26 | | 2021-01-20 21:07:34.000000 | 27 | -------------------------------------------------------------------------------- /postgres/count-nulls.md: -------------------------------------------------------------------------------- 1 | # Count NULLs 2 | 3 | Explore this snippet [here](https://count.co/n/YJCjwADi4VY?vm=e). 4 | 5 | # Description 6 | 7 | Part of the data cleaning process involves understanding the quality of your data. NULL values are usually best avoided, so counting their occurrences is a common operation. 8 | There are several methods that can be used here: 9 | - `count(*) - count()` - use the different forms of the [`count()`](https://www.postgresql.org/docs/current/functions-aggregate.html) aggregation which include and exclude NULLs. 10 | - `sum(case when x is null then 1 else 0 end)` - create a new column which contains 1 or 0 depending on whether the value is NULL, then aggregate. 11 | 12 | ```sql 13 | with data as ( 14 | select * from (values (1), (2), (null), (null), (5)) as data (x) 15 | ) 16 | 17 | select 18 | count(*) - count(x) with_count, 19 | sum(case when x is null then 1 else 0 end) with_case 20 | from data 21 | ``` 22 | 23 | | with_count | with_case | 24 | | ---------- | --------- | 25 | | 2 | 2 | 26 | -------------------------------------------------------------------------------- /postgres/cume_dist.md: -------------------------------------------------------------------------------- 1 | # Cumulative Distribution 2 | View an interactive snippet [here](https://count.co/n/1fymVJoCPVM?vm=e). 3 | 4 | # Description 5 | Cumulative Distribution is a method for analysing the distribution of a quantity, similar to histograms. They show, for each value of a quantity, what fraction of rows are smaller or greater. 6 | One method for calculating it is as follows: 7 | 8 | ```sql 9 | SELECT 10 | -- If necessary, use a row_number window function to get the position of this row in the dataset 11 | (ROW_NUMBER() OVER (ORDER BY )) / (SELECT COUNT(*) FROM ) AS cume_dist, 12 | , 13 | FROM 14 | ``` 15 | where 16 | - `quantity` - the column containing the metric of interest 17 | - `schema.table` - the table with your data 18 | 19 | # Example 20 | 21 | ```sql 22 | WITH data AS ( 23 | SELECT md5(RANDOM()::TEXT) AS username_hash, 24 | RANDOM() AS random_number, 25 | row_number 26 | FROM GENERATE_SERIES(1, 10) AS row_number 27 | ) 28 | 29 | SELECT 30 | --we do not need to use a row_number window function as we have a generate_series in our test data set 31 | username_hash, 32 | (row_number::FLOAT / (SELECT COUNT(*) FROM data)) AS frac, 33 | random_number 34 | FROM data; 35 | ``` 36 | | username_hash | frac | random_number | 37 | | ----- | ----- | ----- | 38 | | f9689609b388f96ac0846c69d4b916dc | 0.1 | 0.48311887726926983 | 39 | | 68551a63b1eb24c750408dc20c6e4510 | 0.2 | 0.7580949364508456 | 40 | | 86b915387b7842c33c52e71ada5488d2 | 0.3 | 0.9461365697398278 | 41 | | 2f9b8d13471f5f8896c073edc4d712a8 | 0.4 | 0.2723711331889973 | 42 | | 27cd443183f886105001f337d987bdaa | 0.5 | 0.2698584751712403 | 43 | | 370ce4710f9900fa38f99494d50b9275 | 0.6 | 0.011248462980887552 | 44 | | 98d2b23c493e44a395a330ca2d349a47 | 0.7 | 0.7387236527452572 | 45 | | 328ac83d33498f81a6b1e70fb8e0dcbb | 0.8 | 0.5815637247802528 | 46 | | 9426739e865d970e7c932c37714b32f0 | 0.9 | 0.5957734211030683 | 47 | | 452a9d8f21c9773fbd5d832138d0d4d1 | 1 | 0.9090960753850368 | 48 | -------------------------------------------------------------------------------- /postgres/dt-formatting.md: -------------------------------------------------------------------------------- 1 | # Formatting Dates and Timestamps 2 | 3 | # Description 4 | Formatting dates in Postgres requries memorizing a long list of [patterns](https://www.postgresql.org/docs/9.1/functions-formatting.html). These snippets contain the most common formatts for quick access. 5 | 6 | # For Dates: 7 | ```sql 8 | with dates as (SELECT day::DATE 9 | FROM generate_series('2021-01-01'::DATE, '2021-01-04'::DATE, '1 day') AS day) 10 | 11 | select 12 | day, 13 | to_char(day,'D/ID/day/dy') day_of_week, --[D]Sunday = 1 / [ID]Monday = 1; can change caps for day to be DAY or Day 14 | to_char(day, 'Month/MM/mon') all_month_formats, -- mon can be MON, or Mon depending on how you want the results capitalized 15 | to_char(day, 'YYYY/YY') year_formats, -- formatting years 16 | to_char(day,'DDD/IDDD') day_of_year, -- day of year / IDDD: day of ISO 8601 week-numbering year (001-371; day 1 of the year is Monday of the first ISO week) 17 | to_char(day,'J') as julian_day, -- days since November 24, 4714 BC at midnight 18 | to_char(day, 'DD') day_of_month, -- day of the month 19 | to_char(day,'WW/IW') week_of_year, -- week of year; IW: week number of ISO 8601 week-numbering year (01-53; the first Thursday of the year is in week 1) 20 | to_char(day,'W') week_of_month, -- week of month (1-4), 21 | to_char(day,'Q') quarter, 22 | to_char(day,'Mon DD YYYY') as american_date, -- Month/Day/Year format 23 | to_char(day,'DD Mon YYYY') as international_date -- Day/Month/Year 24 | from dates 25 | ``` 26 | |day|day_of_week|all_month_formats|year_formats|day_of_year|julian_day|day_of_month| week_of_year| week_of_month| quarter| american_date| international_date| 27 | |---|-----------|-----------------|------------|-----------|----------|------------|-------------|--------------|--------|--------------|-------------------| 28 | |01/01/01|6/5/friday /fri|January /01/jan|2021/21|001/369|2459216|01|01/53|1|1|Jan 01 2021|01 Jan 2021| 29 | |02/01/01|7/6/saturday /sat|January /01/jan|2021/21|002/370|2459217|02|01/53|1|1|Jan 02 2021|02 Jan 2021| 30 | |03/01/01|1/7/sunday /sun|January /01/jan|2021/21|003/371|2459218|03|01/53|1|1|Jan 03 2021|03 Jan 2021| 31 | |04/01/01|2/1/monday /mon|January /01/jan|2021/21|004/001|2459219|04|01/01|1|1|Jan 04 2021|04 Jan 2021| 32 | 33 | 34 | # For Timestamps: 35 | ```sql 36 | with timestamps as ( 37 | SELECT timeseries_hour 38 | FROM generate_series( 39 | timestamp with time zone '2021-01-01 00:00:00+02', 40 | timestamp with time zone '2021-01-01 15:00:00+02', 41 | '90 minute') AS timeseries_hour 42 | ) 43 | 44 | select 45 | timeseries_hour, 46 | to_char(timeseries_hour,'HH/HH24') as hour, --HH: 01-12; H24:0-24 47 | to_char(timeseries_hour,'MI') as minute, 48 | to_char(timeseries_hour,'SS') as seconds, 49 | to_char(timeseries_hour,'MS') as milliseconds, 50 | to_char(timeseries_hour,'US') as microseconds, 51 | to_char(timeseries_hour,'SSSS') as seconds_after_midnight, 52 | to_char(timeseries_hour,'AM') as am_pm, 53 | to_char(timeseries_hour,'TZ') as timezone, 54 | to_char(timeseries_hour,'HH:MI AM') as local_time 55 | from timestamps 56 | ``` 57 | |timeseries_hour|hour|minute|seconds|milliseconds|microseconds|seconds_after_midnight|am_pm|timezone|local_time| 58 | |---------------|----|------|-------|------------|------------|----------------------|-----|--------|----------| 59 | |2020-12-31T22:00:00.000Z|10/22|00|00|000|000000|79200|PM|UTC|10:00 PM| 60 | |2020-12-31T23:30:00.000Z|11/23|30|00|000|000000|84600|PM|UTC|11:30 PM| 61 | |2021-01-01T01:00:00.000Z|01/01|00|00|000|000000|3600|AM|UTC|01:00 AM| 62 | |2021-01-01T02:30:00.000Z|02/02|30|00|000|000000|9000|AM|UTC|02:30 AM| 63 | |2021-01-01T04:00:00.000Z|04/04|00|00|000|000000|14400|AM|UTC|04:00 AM| 64 | |2021-01-01T05:30:00.000Z|05/05|30|00|000|000000|19800|AM|UTC|05:30 AM| 65 | |2021-01-01T07:00:00.000Z|07/07|00|00|000|000000|25200|AM|UTC|07:00 AM| 66 | |2021-01-01T08:30:00.000Z|08/08|30|00|000|000000|30600|AM|UTC|08:30 AM| 67 | |2021-01-01T10:00:00.000Z|10/10|00|00|000|000000|36000|AM|UTC|10:00 AM| 68 | |2021-01-01T11:30:00.000Z|11/11|30|00|000|000000|41400|AM|UTC|11:30 AM| 69 | |2021-01-01T13:00:00.000Z|01/13|00|00|000|000000|46800|PM|UTC|01:00 PM| 70 | -------------------------------------------------------------------------------- /postgres/generate-timeseries.md: -------------------------------------------------------------------------------- 1 | # Generate a timeseries of dates and/or times 2 | View an interactive version of this snippet [here](https://count.co/n/Du9nUudW9MH?vm=e). 3 | 4 | # Description 5 | 6 | In order to conduct timeseries analysis, we need to have a complete timeseries for the data in question. Often, we cannot be certain that every day/hour etc. occurs naturally in our dataset. 7 | Therefore, we to be able to generate a series of dates and/or times to base our query on. 8 | The [`GENERATE_SERIES`](https://www.postgresql.org/docs/13/functions-srf.html) function will allow us to do this: 9 | 10 | ```sql 11 | --generate timeseries in hourly increments 12 | SELECT timeseries_hour 13 | FROM generate_series('2021-01-01 00:00:00'::TIMESTAMP, '2021-01-01 03:00:00'::TIMESTAMP, '1 hour') AS timeseries_hour; 14 | ``` 15 | | timeseries_hour | 16 | |---| 17 | | 2021-01-01 00:00:00.000000 | 18 | | 2021-01-01 01:00:00.000000 | 19 | | 2021-01-01 02:00:00.000000 | 20 | | 2021-01-01 03:00:00.000000 | 21 | 22 | ```sql 23 | --generate timeseries in daily increments 24 | SELECT timeseries_day::DATE 25 | FROM generate_series('2021-01-01'::DATE, '2021-01-04'::DATE, '1 day') AS timeseries_day; 26 | ``` 27 | 28 | | timeseries_day | 29 | |---| 30 | | 2021-01-01 | 31 | | 2021-01-02 | 32 | | 2021-01-03 | 33 | | 2021-01-04 | 34 | -------------------------------------------------------------------------------- /postgres/get-last-array-element.md: -------------------------------------------------------------------------------- 1 | # Get the last element of an array 2 | 3 | Explore this snippet [here](https://count.co/n/0loHJW60YO8?vm=e). 4 | 5 | # Description 6 | Use the `array_length` function to access the final element of an array. This function takes two arguments, an array and a dimension index. For common one-dimensional arrays, the dimension index is always 1. 7 | 8 | ```sql 9 | with data as (select array[1,2,3,4] as nums) 10 | 11 | select nums[array_length(nums, 1)] from data 12 | ``` -------------------------------------------------------------------------------- /postgres/histogram-bins.md: -------------------------------------------------------------------------------- 1 | # Histogram Bins 2 | View an interactive version of this snippet [here](https://count.co/n/27wc3pvc5R0?vm=e). 3 | 4 | # Description 5 | Creating histograms using SQL can be tricky. This template will let you quickly create the bins needed to create a histogram: 6 | 7 | ```sql 8 | SELECT 9 | COUNT(1) / (SELECT COUNT(1) FROM ) AS percentage_of_results 10 | FLOOR(/ ) * AS bin 11 | FROM 12 | 13 | GROUP BY 14 | bin 15 | ``` 16 | where: 17 | - `` is your column you want to turn into a histogram 18 | - `` is the width of each bin. You can adjust this to get the desired level of detail for the histogram. 19 | 20 | # Example 21 | Adjust the bin-size until you can see the 'shape' of the data. The following example finds the percentage of fruits with 5-point review groupings. The reviews are on a 0-100 scale: 22 | 23 | ```sql 24 | WITH data AS ( 25 | SELECT CASE 26 | WHEN RANDOM() <= 0.25 THEN 'tennessee' 27 | WHEN RANDOM() <= 0.5 THEN 'colonel' 28 | WHEN RANDOM() <= 0.75 THEN 'pikesville' 29 | ELSE 'michters' 30 | END AS whiskey, 31 | CASE 32 | WHEN RANDOM() <= 0.20 THEN 1 33 | WHEN RANDOM() <= 0.40 THEN 2 34 | WHEN RANDOM() <= 0.60 THEN 3 35 | WHEN RANDOM() <= 0.80 THEN 4 36 | ELSE 5 37 | END AS star_rating, 38 | review_date 39 | FROM generate_series('2021-01-01'::TIMESTAMP,'2021-01-31'::TIMESTAMP, '3 DAY') AS review_date 40 | ) 41 | 42 | SELECT 43 | COUNT(*) / (SELECT COUNT(1)::FLOAT FROM data) AS percentage_of_results, 44 | FLOOR(data.star_rating / 1) * 1 bin 45 | FROM data 46 | GROUP BY bin 47 | ``` 48 | | percentage_of_results | bin | 49 | | ----- | ----- | 50 | | 0.2727272727272727 | 3 | 51 | | 0.09090909090909091 | 5 | 52 | | 0.18181818181818182 | 1 | 53 | | 0.45454545454545453 | 2 | 54 | 55 | -------------------------------------------------------------------------------- /postgres/json-strings.md: -------------------------------------------------------------------------------- 1 | # Extracting values from JSON strings 2 | 3 | Explore this snippet [here](https://count.co/n/woUuvi105YA?vm=e). 4 | 5 | # Description 6 | Postgres has supported lots of JSON functionality for a while now - see some examples in the [official documentation](https://www.postgresql.org/docs/current/functions-json.html). 7 | ## Operators 8 | There are several operators that act as accessors on JSON data types 9 | 10 | ```sql 11 | with json as (select '{ 12 | "a": 1, 13 | "b": "bee", 14 | "c": [ 15 | 4, 16 | 5, 17 | { "d": [6, 7] } 18 | ] 19 | }' as text) 20 | 21 | select 22 | text::json as root, -- Cast text to JSON 23 | text::json->'a' as a, -- Extract number as string 24 | text::json->'b' as b, -- Extract string 25 | text::json->'c' as c, -- Extract object 26 | text::json->'c'->0 as c_first_str, -- Extract array element as string 27 | text::json->'c'->>0 as c_first_num, -- Extract array element as number 28 | text::json#>'{c, 2}' as c_third, -- Extract deeply-nested object 29 | text::json#>'{c, 2}'->'d' as d -- Extract field from deeply-nested object 30 | from json 31 | ``` 32 | 33 | | root | a | b | c | c_first_str | c_first_num | c_third | d | 34 | | --------------------------------------------------- | - | ----- | ----------------------- | ----------- | ----------- | --------------- | ------ | 35 | | { "a": 1, "b": "bee", "c": [4, 5, { "d": [6, 7] }]} | 1 | "bee" | [4, 5, { "d": [6, 7] }] | 4 | 4 | { "d": [6, 7] } | [6, 7] | 36 | 37 | 38 | ## JSON_EXTRACT_PATH 39 | As an alternative to the operator approach, the JSON_EXTRACT_PATH function works by specifying the keys of the JSON fields to be extracted: 40 | 41 | ```sql 42 | select 43 | json_extract_path('{ 44 | "a": 1, 45 | "b": "bee", 46 | "c": [ 47 | 4, 48 | 5, 49 | { "d": [6, 7] } 50 | ] 51 | }', 'c', '2', 'd', '0') 52 | ``` 53 | 54 | | json_extract_path | 55 | | ----------------- | 56 | | 6 | -------------------------------------------------------------------------------- /postgres/last-x-days.md: -------------------------------------------------------------------------------- 1 | # Filter for last X days 2 | 3 | ## Description 4 | Often you only want to return the last X days of a query (e.g. last 30 days) and you want that to always be up to date (dynamic). 5 | 6 | 7 | This snippet can be added to your WHERE clause to filter for the last X days: 8 | 9 | ``` sql 10 | SELECT 11 | ,..., 12 | 13 | FROM 14 |
15 | WHERE 16 | > current_date - interval 'X' day 17 | ``` 18 | where: 19 | 20 | 21 | - ` ,...` are all the columns you want to return in your query 22 | - `` is your date column you want to filter on 23 | - `X` is the number of days you want to filter on 24 | 25 | 26 | For reference see Postgres's [date/time Functions and Operators](https://www.postgresql.org/docs/8.2/functions-datetime.html). 27 | 28 | ```sql 29 | select 30 | day, 31 | rank, 32 | title, 33 | artist, 34 | streams 35 | from 36 | "public".spotify_daily 37 | where 38 | day > current_date - interval '180' day 39 | and artist = 'Drake' 40 | ``` 41 | |day|rank|title|artist|streams| 42 | |---|----|-----|------|-------| 43 | |06/04/2021|46|What's Next|Drake|1,524,267| 44 | |...|...|...|...|...| 45 | -------------------------------------------------------------------------------- /postgres/median.md: -------------------------------------------------------------------------------- 1 | # Calculating the Median 2 | View an interactive version of this snippet [here](https://count.co/n/QH2mMBK2RJu?vm=e). 3 | 4 | # Description 5 | Postgres does not have an explicit `MEDIAN` function. The following snippet will show you how to calculate the median of any column. 6 | 7 | # Snippet ✂️ 8 | Here is a median function with a sample dataset: 9 | 10 | ```sql 11 | WITH data AS ( 12 | SELECT CASE 13 | WHEN RANDOM() <= 0.25 THEN 'apple' 14 | WHEN RANDOM() <= 0.5 THEN 'banana' 15 | WHEN RANDOM() <= 0.75 THEN 'pear' 16 | ELSE 'orange' 17 | END AS fruit, 18 | RANDOM()::DOUBLE PRECISION AS weight 19 | FROM generate_series(1,10) 20 | ) 21 | 22 | SELECT 23 | PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY weight) AS median 24 | FROM 25 | data 26 | ``` 27 | | median | 28 | | ----------- | 29 | | 0.1358146 | 30 | -------------------------------------------------------------------------------- /postgres/moving-average.md: -------------------------------------------------------------------------------- 1 | # Moving Average 2 | View an interactive version of this snippet [here](https://count.co/n/u5jbJD3MDX6?vm=e). 3 | 4 | # Description 5 | Moving averages are quite simple to calculate in Postgres, using the `AVG` window function. Here is an example query with a 7-day moving average over a total of 31 days, calculated for 1 dimension (colour): 6 | 7 | ```sql 8 | WITH data AS ( 9 | SELECT *, 10 | CASE WHEN num <= 0.5 THEN 'blue' ELSE 'red' END AS str 11 | FROM (SELECT generate_series('2021-01-01'::TIMESTAMP, '2021-01-31'::TIMESTAMP, '1 day') AS timeseries, 12 | RANDOM() AS num 13 | ) t 14 | ) 15 | 16 | SELECT timeseries, 17 | str, 18 | num, 19 | AVG(num) OVER (PARTITION BY str ORDER BY timeseries ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) AS mov_avg 20 | FROM data 21 | ORDER BY timeseries, str; 22 | ``` 23 | 24 | timeseries|str|num|mov_avg 25 | | --- | ----------- | ------ |-------- | 26 | 2021-01-01 00:00:00.000000|red|0.6724503329790004|0.6724503329790004 27 | 2021-01-02 00:00:00.000000|blue|0.02015057538457654|0.02015057538457654 28 | 2021-01-03 00:00:00.000000|blue|0.19436280964817954|0.10725669251637804 29 | 2021-01-04 00:00:00.000000|blue|0.3170075569498252|0.17717364732752708 30 | 2021-01-05 00:00:00.000000|blue|0.45274767058829823|0.24606715314271987 31 | 2021-01-06 00:00:00.000000|red|0.5295068450998173|0.6009785890394088 32 | 2021-01-07 00:00:00.000000|red|0.5856947203861544|0.5958839661549907 33 | 2021-01-08 00:00:00.000000|blue|0.23106543649677036|0.24306680981352996 34 | 2021-01-09 00:00:00.000000|red|0.8057297084321|0.648345401724268 35 | 2021-01-10 00:00:00.000000|blue|0.18782650473713147|0.23386009230079688 36 | 2021-01-11 00:00:00.000000|blue|0.012715414005509018|0.20226799540147006 37 | 2021-01-12 00:00:00.000000|blue|0.41378600069740656|0.22870774606346211 38 | 2021-01-13 00:00:00.000000|red|0.7848015045167251|0.6756366222827594 39 | 2021-01-14 00:00:00.000000|blue|0.3062660702635327|0.26447218292333163 40 | 2021-01-15 00:00:00.000000|red|0.7321478245155468|0.685055155988224 41 | 2021-01-16 00:00:00.000000|blue|0.20009039295143083|0.26518813083623805 42 | 2021-01-17 00:00:00.000000|red|0.9851023962729109|0.7279190474574649 43 | 2021-01-18 00:00:00.000000|blue|0.42790603927394955|0.2790504411267536 44 | 2021-01-19 00:00:00.000000|red|0.7966662316864124|0.7365124454860834 45 | 2021-01-20 00:00:00.000000|red|0.8913488809896606|0.7638747639874159 46 | 2021-01-21 00:00:00.000000|red|0.6797595409599531|0.7826563509699329 47 | 2021-01-22 00:00:00.000000|red|0.9692422752366809|0.8305997953262487 48 | 2021-01-23 00:00:00.000000|blue|0.3873783130847208|0.2708792714388064 49 | 2021-01-24 00:00:00.000000|red|0.960779246012315|0.8499809875237756 50 | 2021-01-25 00:00:00.000000|red|0.6482284656189883|0.8329093576615585 51 | 2021-01-26 00:00:00.000000|red|0.8471274360131815|0.8472818090987628 52 | 2021-01-27 00:00:00.000000|red|0.5269376445697667|0.7900112151358698 53 | 2021-01-28 00:00:00.000000|blue|0.07696316291711724|0.2516164872413498 54 | 2021-01-29 00:00:00.000000|blue|0.1035287644690932|0.241079269707845 55 | 2021-01-30 00:00:00.000000|red|0.7339086567504438|0.7821665182688737 56 | 2021-01-31 00:00:00.000000|red|0.8169741358988709|0.772869675132525 57 | 58 | -------------------------------------------------------------------------------- /postgres/ntile.md: -------------------------------------------------------------------------------- 1 | # Creating equal-sized buckets using ntile 2 | View an interactive version of this snippet [here](https://count.co/n/IgJwlWSujF0?vm=e). 3 | 4 | # Description 5 | When you want to create equal sized groups (i.e. buckets) of rows, there is a SQL function for that too. 6 | The function [NTILE](https://www.postgresql.org/docs/13/functions-window.html) lets you do exactly that: 7 | 8 | ```sql 9 | WITH users AS ( 10 | SELECT md5(RANDOM()::TEXT) AS name_hash, 11 | CASE WHEN RANDOM() <= 0.5 THEN 'male' ELSE 'female' END AS gender, 12 | CASE WHEN RANDOM() <= 0.33 THEN 'student' WHEN RANDOM() <= 0.66 THEN 'professor' ELSE 'alumni' END AS status 13 | FROM generate_series(1, 10) 14 | ) 15 | 16 | SELECT 17 | name_hash, 18 | gender, 19 | status, 20 | NTILE(3) OVER(PARTITION BY status ORDER BY status) AS ntile_bucket 21 | FROM 22 | users; 23 | ``` 24 | 25 | |name_hash|gender|status|ntile_bucket| 26 | |------|------|------|------| 27 | |8c9882e8cc6f748c06b3a20ee7af9787|female|alumni|1| 28 | |032dffdb7cb67a5f1a6b77b713043508|female|alumni|2| 29 | |72559a7e25935004d172329a47cb3db5|male|alumni|3| 30 | |284d1337a3d30433fe17ad2110d27655|male|professor|1| 31 | |3797bb418424d7ced117e8817f398fea|male|professor|2| 32 | |e56d1e6e53f12c6a133c4f47a962754c|female|professor|3| 33 | |8713e6b5a1c359f42e960c05460c8e14|male|student|1| 34 | |af8ea7ea643fbbfefe7c845e77c9e4d3|male|student|1| 35 | |033161ae441c347c4aa43acd46347836|female|student|2| 36 | |a476b8c2cfa8df1c9494c9851cbd4e66|female|student|3| 37 | 38 | -------------------------------------------------------------------------------- /postgres/rank.md: -------------------------------------------------------------------------------- 1 | # Ranking Data 2 | 3 | # Description 4 | A common requirement is the need to rank a particular row in a group by some quantity - for example, 'rank stores by total sales in each region'. Fortunately, Postgres natively has a `rank` window function to do this. 5 | The syntax looks like this: 6 | 7 | ```sql 8 | SELECT 9 | RANK() OVER(PARTITION BY ORDER BY ) AS RANK 10 | FROM 11 | ``` 12 | where 13 | - `fields` - zero or more columns to split the results by - a rank will be calculated for each unique combination of values in these columns 14 | - `rank_by` - a column to decide the ranking 15 | - `schema.table` - where to pull these columns from 16 | 17 | # Example 18 | 19 | We want to find out, for our random data sample of fruits, which fruit is most popular. 20 | 21 | ```sql 22 | WITH data AS ( 23 | SELECT CASE 24 | WHEN RANDOM() <= 0.25 THEN 'apple' 25 | WHEN RANDOM() <= 0.5 THEN 'banana' 26 | WHEN RANDOM() <= 0.75 THEN 'pear' 27 | ELSE 'orange' 28 | END AS fruit, 29 | RANDOM() + (series^RANDOM()) AS popularity 30 | FROM generate_series(1, 10) AS series 31 | ) 32 | 33 | SELECT 34 | fruit, 35 | popularity, 36 | RANK() OVER(ORDER BY popularity DESC) AS popularity_rank 37 | FROM data 38 | ``` 39 | 40 | | fruit | popularity | popularity_rank | 41 | | ----- | ----- | ----- | 42 | | banana | 3.994866211350613 | 1 | 43 | | banana | 3.2802621307398208 | 2 | 44 | | banana | 2.626913466142052 | 3 | 45 | | pear | 2.59071184223806 | 4 | 46 | | banana | 2.137745293125355 | 5 | 47 | | apple | 2.0741654350518637 | 6 | 48 | | apple | 2.0401770856253933 | 7 | 49 | | pear | 1.771223675755631 | 8 | 50 | | banana | 1.743055659082957 | 9 | 51 | | pear | 1.7158950640518675 | 10 | 52 | 53 | -------------------------------------------------------------------------------- /postgres/regex-email.md: -------------------------------------------------------------------------------- 1 | # RegEx: Validate Email Addresses 2 | View an interactive version of this snippet [here](https://count.co/n/QLRHXD9dVRG?vm=e). 3 | 4 | 5 | # Description 6 | The only real way to validate an email address is to send a request to it and observe the response, but for the purposes of analytics it's often useful to strip out rows including malformed email addresses. 7 | Using the `SUBSTRING` function, the email validation RegEx comes courtesy of [Evan Carroll](https://dba.stackexchange.com/questions/68266/what-is-the-best-way-to-store-an-email-address-in-postgresql/165923#165923): 8 | 9 | # Snippet ✂️ 10 | 11 | ```sql 12 | WITH data AS ( 13 | SELECT UNNEST(ARRAY[ 14 | 'a', 15 | 'hello@count.c', 16 | 'hello@count.co', 17 | '@count.co', 18 | 'a@a.a' 19 | ]) AS emails 20 | ) 21 | 22 | SELECT 23 | emails, 24 | CASE WHEN 25 | SUBSTRING( 26 | LOWER(emails), 27 | '^[a-zA-Z0-9.!#$%&''''*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$' 28 | ) IS NOT NULL 29 | THEN TRUE 30 | ELSE FALSE 31 | END AS is_valid 32 | FROM data 33 | ``` 34 | 35 | | emails | is_valid | 36 | | ----- | ----- | 37 | | a | false | 38 | | hello@count.c | true | 39 | | hello@count.co | true | 40 | | @count.co | false | 41 | | a@a.a | true | 42 | -------------------------------------------------------------------------------- /postgres/regex-parse-url.md: -------------------------------------------------------------------------------- 1 | # RegEx: Parse URL String 2 | View an interactive version of this snippet [here](https://count.co/n/rpoUsuw0s3r?vm=e). 3 | 4 | 5 | # Description 6 | URL strings are a very valuable source of information, but trying to get your RegEx exactly right is a pain. 7 | This re-usable snippet can be applied to any URL to extract the exact part you're after. 8 | 9 | # Snippet ✂️ 10 | 11 | ```sql 12 | SELECT 13 | SUBSTRING(my_url, '^([a-zA-Z]+)://') AS protocol, 14 | SUBSTRING(my_url, '(?:[a-zA-Z]+:\/\/)?([a-zA-Z0-9.]+)\/?') AS host, 15 | SUBSTRING(my_url, '(?:[a-zA-Z]+://)?(?:[a-zA-Z0-9.]+)/{1}([a-zA-Z0-9./]+)') AS path, 16 | SUBSTRING(my_url, '\?(.*)') AS query, 17 | SUBSTRING(my_url, '#(.*)') AS ref 18 | FROM 19 | ``` 20 | where: 21 | - `` is your URL string (e.g. `"https://www.yoursite.com/pricing/details?myparam1=123&myparam2=abc#Ref1"` 22 | - `protocol` is how the data is transferred between the host and server (e.g. `https`) 23 | - `host` is the domain name (e.g. `yoursite.com`) 24 | - `path` is the page path on the website (e.g. `pricing/details`) 25 | - `query` is the string of query parameters in the url (e.g. `myparam1=123`) 26 | - `ref` is the reference or where the user came from before visiting the url (e.g. `#newsfeed`) 27 | 28 | # Usage 29 | To use the snippet, just copy out the RegEx for whatever part of the URL you need - or capture them all as in the example below: 30 | 31 | ```sql 32 | WITH data as 33 | (SELECT 'https://www.yoursite.com/pricing/details?myparam1=123&myparam2=abc#newsfeed' AS my_url) 34 | 35 | SELECT 36 | SUBSTRING(my_url, '^([a-zA-Z]+)://') AS protocol, 37 | SUBSTRING(my_url, '(?:[a-zA-Z]+:\/\/)?([a-zA-Z0-9.]+)\/?') AS host, 38 | SUBSTRING(my_url, '(?:[a-zA-Z]+://)?(?:[a-zA-Z0-9.]+)/{1}([a-zA-Z0-9./]+)') AS path, 39 | SUBSTRING(my_url, '\?(.*)') AS query, 40 | SUBSTRING(my_url, '#(.*)') AS ref 41 | FROM data 42 | ``` 43 | | protocol | host | path | query | ref | 44 | | --- | ------ | ----- | --- | ------| 45 | | https | www.yoursite.com | pricing/details | myparam1=123&myparam2=abc#newsfeed | newsfeed | 46 | 47 | # References Helpful Links 48 | - [The Anatomy of a Full Path URL](https://zvelo.com/anatomy-of-full-path-url-hostname-protocol-path-more/) 49 | -------------------------------------------------------------------------------- /postgres/replace-empty-strings-null.md: -------------------------------------------------------------------------------- 1 | # Replace empty strings with NULLs 2 | 3 | Explore this snippet [here](https://count.co/n/Fzyv9i4SP2B?vm=e). 4 | 5 | # Description 6 | 7 | An essential part of cleaning a new data source is deciding how to treat missing values. The [`CASE`](https://www.postgresql.org/docs/current/functions-conditional.html#FUNCTIONS-CASE) statement can help with replacing empty values with something else: 8 | 9 | ```sql 10 | WITH data AS ( 11 | SELECT * 12 | FROM (VALUES ('a'), ('b'), (''), ('d')) AS data (str) 13 | ) 14 | 15 | SELECT 16 | CASE WHEN LENGTH(str) != 0 THEN str END AS str 17 | FROM data; 18 | ``` 19 | 20 | | str | 21 | | ---- | 22 | | a | 23 | | b | 24 | | NULL | 25 | | d | -------------------------------------------------------------------------------- /postgres/replace-null.md: -------------------------------------------------------------------------------- 1 | # Replace NULLs 2 | 3 | Explore this snippet [here](https://count.co/n/hpKKsv0U2Rv?vm=e). 4 | 5 | # Description 6 | 7 | An essential part of cleaning a new data source is deciding how to treat NULL values. The Postgres function [COALESCE](https://www.postgresql.org/docs/current/functions-conditional.html) can help with replacing NULL values with something else: 8 | 9 | ```sql 10 | with data as ( 11 | select * from (values 12 | (1, 'one'), 13 | (null, 'two'), 14 | (3, null) 15 | ) as data (num, str) 16 | ) 17 | 18 | select 19 | coalesce(num, -1) num, 20 | coalesce(str, 'I AM NULL') str 21 | from data 22 | ``` 23 | 24 | | num | str | 25 | | --- | --------- | 26 | | 1 | a | 27 | | -1 | b | 28 | | 3 | I AM NULL | -------------------------------------------------------------------------------- /postgres/split-column-to-rows.md: -------------------------------------------------------------------------------- 1 | # Split a single column into separate rows 2 | View an interactive version of this snippet [here](https://count.co/n/lF1dUz66TuF?vm=e). 3 | 4 | 5 | # Description 6 | 7 | You will often see multiple values, separated by a single character, appear in a single column. 8 | If you want to split them out but instead of having separate columns, generate rows for each value, you can use the function [`REGEXP_SPLIT_TO_TABLE`](https://www.postgresql.org/docs/13/functions-string.html): 9 | 10 | ```sql 11 | WITH data AS ( 12 | SELECT * 13 | FROM (VALUES ('yellow;green;blue'), ('orange;red;grey;black')) AS data (str) 14 | ) 15 | 16 | SELECT 17 | REGEXP_SPLIT_TO_TABLE(str,';') str_split 18 | FROM data; 19 | ``` 20 | _The separator can be any single character (i.e. ',' or /) or something more complex like a string (to&char123), as the function uses Regular Expressions._ 21 | 22 | | str_split | 23 | | ---- | 24 | | yellow | 25 | | green | 26 | | blue | 27 | | orange | 28 | | red | 29 | | grey | 30 | | black | 31 | -------------------------------------------------------------------------------- /snowflake/100-stacked-bar.md: -------------------------------------------------------------------------------- 1 | # 100% bar chart 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/C6BCLP94YLs?vm=e). 4 | 5 | # Description 6 | 7 | 100% bar charts compare the proportions of a metric across different groups. 8 | To build one you'll need: 9 | - A numerical column as the metric you want to compare. This should be represented as a percentage. 10 | - 2 categorical columns. One for the x-axis and one for the coloring of the bars. 11 | 12 | ```sql 13 | SELECT 14 | AGG_FN() as metric, 15 | as x, 16 | as color, 17 | FROM 18 |
19 | GROUP BY 20 | x, color 21 | ``` 22 | 23 | where: 24 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc. 25 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column. Must add up to 1 for each bar. 26 | - `XAXIS_COLUMN` is the group you want to show on the x-axis. Make sure this is a date, datetime, timestamp, or time column (not a number) 27 | - `COLOR_COLUMN` is the group you want to show as different colors of your bars. 28 | 29 | # Usage 30 | 31 | In this example with some London weather data, we calculate the percentage of days in each month that are of each weather type (e.g. rain, sun). 32 | 33 | ```sql 34 | -- a 35 | select 36 | count(DATE) days, 37 | month(date) month, 38 | WEATHER 39 | from PUBLIC.LONDON_WEATHER 40 | group by month, WEATHER 41 | order by month 42 | ``` 43 | 44 | | DAYS | MONTH | WEATHER | 45 | | ---- | ----- | ------- | 46 | | 33 | 1 | sunny | 47 | | 16 | 1 | fog | 48 | | 74 | 1 | rain | 49 | | ... | ... | ... | 50 | 51 | ```sql 52 | select 53 | count(DATE) days, 54 | month(date) month 55 | from PUBLIC.LONDON_WEATHER 56 | group by month 57 | order by month 58 | ``` 59 | | DAYS | MONTH | 60 | | ---- | ----- | 61 | | 123 | 1 | 62 | | 113 | 2 | 63 | | 123 | 3 | 64 | | 120 | 4 | 65 | | ... | ... | 66 | 67 | ```sql 68 | -- c 69 | select 70 | "a".WEATHER, 71 | "a".MONTH, 72 | "a".DAYS / "b".DAYS per_days 73 | from 74 | "a" left join "b" on "a".MONTH = "b".MONTH 75 | ``` 76 | 77 | stacked-bar-2 78 | -------------------------------------------------------------------------------- /snowflake/bar-chart.md: -------------------------------------------------------------------------------- 1 | # Bar chart 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/CmSsFFvSJfz?vm=e). 4 | 5 | # Description 6 | 7 | Bar charts are perhaps the most common charts used in data analysis. They have one categorical or temporal axis, and one numerical axis. 8 | You just need a simple GROUP BY to get all the elements you need for a bar chart: 9 | 10 | ```sql 11 | SELECT 12 | AGG_FN() as metric, 13 | as cat 14 | FROM 15 |
16 | GROUP BY 17 | cat 18 | ``` 19 | 20 | where: 21 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc. 22 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column. 23 | - `CATEGORICAL_COLUMN` is the group you want to show on the x-axis. Make sure this is a date, datetime, timestamp, or time column (not a number) 24 | 25 | # Usage 26 | 27 | In this example with some Spotify data, we'll look at the average daily streams by month of year: 28 | 29 | ```sql 30 | select 31 | avg(streams) avg_daily_streams, 32 | MONTH(day) month 33 | from ( 34 | select sum(streams) streams, 35 | day 36 | from PUBLIC.SPOTIFY_DAILY_TRACKS 37 | group by day 38 | ) 39 | group by month 40 | ``` 41 | bar 42 | -------------------------------------------------------------------------------- /snowflake/count-nulls.md: -------------------------------------------------------------------------------- 1 | # Count NULLs 2 | 3 | Explore this snippet [here](https://count.co/n/E0yQhZaKdmD?vm=e). 4 | 5 | # Description 6 | Part of the data cleaning process involves understanding the quality of your data. NULL values are usually best avoided, so counting their occurrences is a common operation. 7 | There are several methods that can be used here: 8 | - `sum(if( is null, 1, 0)` - use the [`IFF`](https://docs.snowflake.com/en/sql-reference/functions/iff.html) function to return 1 or 0 if a value is NULL or not respectively, then aggregate. 9 | - `count(*) - count()` - use the different forms of the [`count()`](https://docs.snowflake.com/en/sql-reference/functions/count.html) aggregation which include and exclude NULLs. 10 | - `sum(case when x is null then 1 else 0 end)` - similar to the IFF method, but using a [`CASE`](https://docs.snowflake.com/en/sql-reference/functions/case.html) statement instead. 11 | 12 | ```sql 13 | with data as ( 14 | select * from (values (1), (2), (null), (null), (5)) as data (x) 15 | ) 16 | 17 | select 18 | sum(iff(x is null, 1, 0)) with_iff, 19 | count(*) - count(x) with_count, 20 | sum(case when x is null then 1 else 0 end) with_case 21 | from data 22 | ``` 23 | 24 | | WITH_IFF | WITH_COUNT | WITH_CASE | 25 | | -------- | ---------- | --------- | 26 | | 2 | 2 | 2 | -------------------------------------------------------------------------------- /snowflake/filtering-arrays.md: -------------------------------------------------------------------------------- 1 | # Filtering arrays 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/2rWE6XxrMAm?vm=e). 4 | 5 | # Description 6 | 7 | To filter arrays in Snowflake you can use a combination of ARRAY_AGG and FLATTEN: 8 | 9 | ```sql 10 | select 11 | array_agg(value) as filtered 12 | from (select array_construct(1, 2, 3, 4) as x), 13 | lateral flatten(input => x) 14 | where value <= 3 15 | ``` 16 | | FILTERED | 17 | | -------- | 18 | | [1,2,3] | 19 | -------------------------------------------------------------------------------- /snowflake/flatten-array.md: -------------------------------------------------------------------------------- 1 | # Flatten an Array 2 | 3 | 4 | # Description 5 | To flatten an ARRAY in other dialects you could use UNNEST. In Snowflake, that functionality is called [FLATTEN](https://docs.snowflake.com/en/sql-reference/functions/flatten.html). To use it you can do: 6 | 7 | ```sql 8 | SELECT 9 | VALUE 10 | FROM 11 |
, 12 | LATERAL FLATTEN(INPUT=> ) 13 | ``` 14 | where: 15 | - `
` is your table w/ the array column 16 | - `` is the name of the column with the arrays 17 | > Note: You can select more than VALUE; there are a lot of things returned. See the example below to see. 18 | 19 | ```sql 20 | with demo_arrays as ( 21 | select array_construct(1,4,'a') arr 22 | ) 23 | 24 | select 25 | * --SHOW ALL THE COLUMNS RETURNED W/ FLATTEN. Common to use VALUE. 26 | from demo_arrays, 27 | lateral flatten(input => arr) 28 | ``` 29 | |ARR|SEQ|KEY|PATH|INDEX|VALUE|THIS| 30 | |---|---|---|----|-----|-----|----| 31 | |[1,4,"a"]|1|NULL|[0]|0|1|[1,4,"a"]| 32 | |[1,4,"a"]|1|NULL|[1]|1|4|[1,4,"a"]| 33 | |[1,4,"a"]|1|NULL|[2]|2|"a"|[1,4,"a"]| 34 | 35 | 36 | 37 | # Example: 38 | Here's an array with JSON objects within. We can flatten these objects twice to get to the JSON values. 39 | 40 | ```sql 41 | With demo_data as ( 42 | select array_construct( 43 | parse_json('{ 44 | "AveragePosition": 1, 45 | "Competitor": "test.com.au", 46 | "EstimatedImpressions": 4583, 47 | "SearchTerms": 3, 48 | "ShareOfClicks": "14.25%", 49 | "ShareOfSpend": "1.09%" 50 | }'), 51 | parse_json('{ 52 | "AveragePosition": 1.9, 53 | "Competitor": "businessname.com.au", 54 | "EstimatedImpressions": 10118, 55 | "SearchTerms": 34, 56 | "ShareOfClicks": "6.77%", 57 | "ShareOfSpend": "9.16%" 58 | }')) 59 | as demo_array 60 | ) 61 | 62 | select 63 | value:AveragePosition, 64 | value:Competitor, 65 | value:EstimatedImpressions, 66 | value:SearchTerms, 67 | value:ShareofClicks, 68 | value:ShareofSpend 69 | from ( 70 | select value from demo_data, 71 | lateral flatten(input => demo_array)) 72 | ``` 73 | |VALUE:AVERAGEPOSITION|VALUE:COMPETITOR|VALUE:ESTIMATEDIMPRESSIONS|VALUE:SEARCHTERMS|VALUE:SHAREOFCLICKS|VALUE:SHAREOFSPEND| 74 | |-----|-----|-----|-----|------|------| 75 | |1|"test.com.au"|4583|3|"14.25"|"1.09%"| 76 | |1.9|"businessname.com.au"|10118|34|"6.77%"|"9.16"| 77 | -------------------------------------------------------------------------------- /snowflake/generate-arrays.md: -------------------------------------------------------------------------------- 1 | # Generating arrays 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/PeGMQjXNBBq?vm=e). 4 | 5 | # Description 6 | 7 | Snowflake supports the ARRAY column type - here are a few methods for generating arrays. 8 | 9 | ## Literals 10 | 11 | Arrays can contain most types, and they don't have to have the same supertype. 12 | 13 | ```sql 14 | select 15 | array_construct(1, 2, 3) as num_array, 16 | array_construct('a', 'b', 'c') as str_array, 17 | array_construct(null, 'hello', 3::double, 4, 5) mixed_array, 18 | array_construct(current_timestamp(), current_timestamp()) as timestamp_array 19 | ``` 20 | | NUM_ARRAY | STR_ARRAY | MIXED_ARRAY | TIMESTAMP_ARRAY | 21 | | --------- | ------------- | -------------------- | --------------- | 22 | | [1,2,3] | ["a","b","c"] | [null,"hello",3,4,5] | ["2021-07-02 11:37:36.205 -0400","2021-07-02 11:37:36.205 -0400"] | 23 | 24 | ## GENERATE_* functions 25 | Analogous to `numpy.linspace()`. 26 | See this Snowflake [article](https://community.snowflake.com/s/article/Generate-gap-free-sequences-of-numbers-and-dates) for more details. 27 | 28 | 29 | For generating a sequence of dates without gaps, you can use [GENERATOR](https://docs.snowflake.com/en/sql-reference/functions/generator.html) in combination with [SEQ4](https://docs.snowflake.com/en/sql-reference/functions/seq1.html): 30 | 31 | ```sql 32 | select row_number() over (order by seq4()) consecutive_nums 33 | from table(generator(rowcount => 10)); 34 | ``` 35 | | CONSECUTIVE_NUMS | 36 | | ---------------- | 37 | | 1 | 38 | | 2 | 39 | | 3 | 40 | | 4 | 41 | | 5 | 42 | | 6 | 43 | | 7 | 44 | | 8 | 45 | | 9 | 46 | | 10 | 47 | 48 | Or to generate a sequence of consecutive dates, you can add [TIMEADD](https://docs.snowflake.com/en/sql-reference/functions/timeadd.html): 49 | 50 | ```sql 51 | SELECT TIMEADD('day', row_number() over (order by 1), '2020-08-30')::date date 52 | FROM TABLE(GENERATOR(ROWCOUNT => 10)); 53 | ``` 54 | | DATE | 55 | | ---------- | 56 | | 2020-08-30 | 57 | | 2020-09-01 | 58 | | 2020-09-02 | 59 | | 2020-09-03 | 60 | | 2020-09-04 | 61 | | 2020-09-05 | 62 | | 2020-09-06 | 63 | | 2020-09-07 | 64 | | 2020-09-08 | 65 | | 2020-09-09 | 66 | 67 | ## ARRAY_AGG 68 | Aggregate a column into a single row element. 69 | 70 | ```sql 71 | SELECT ARRAY_AGG(num) 72 | FROM (SELECT 1 num UNION ALL SELECT 2 num UNION ALL SELECT 3 num) 73 | ``` 74 | | ARRAY_AGG(NUM) | 75 | | -------------- | 76 | | [1,2,3] | 77 | -------------------------------------------------------------------------------- /snowflake/geographical-distance.md: -------------------------------------------------------------------------------- 1 | # Distance between longitude/latitude pairs 2 | 3 | Explore this snippet [here](https://count.co/n/nqZbPem1fKi?vm=e). 4 | 5 | # Description 6 | 7 | It is a surprisingly difficult problem to calculate an accurate distance between two points on the Earth, both because of the spherical geometry and the uneven shape of the ground. 8 | Fortunately Snowflake has robust support for geographical primitives - the[`ST_DISTANCE`](https://docs.snowflake.com/en/sql-reference/functions/st_distance.html) function can calculate the distance in metres between two [`ST_POINT`](https://docs.snowflake.com/en/sql-reference/functions/st_makepoint.html)s 9 | 10 | 11 | ```sql 12 | with points as ( 13 | -- Longitudes and latitudes are in degrees 14 | select 10 as from_long, 12 as from_lat, 15 as to_long, 17 as to_lat 15 | ) 16 | 17 | select 18 | st_distance( 19 | st_makepoint(from_long, from_lat), 20 | st_makepoint(to_long, to_lat) 21 | ) as dist_in_metres 22 | from points 23 | ``` 24 | 25 | | DIST_IN_METRES | 26 | | -------------- | 27 | | 773697.017019 | -------------------------------------------------------------------------------- /snowflake/heatmap.md: -------------------------------------------------------------------------------- 1 | # Heat maps 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/Gw3T9qTRRe7?vm=e). 4 | 5 | # Description 6 | 7 | Heatmaps are a great way to visualize the relationship between two variables enumerated across all possible combinations of each variable. To create one, your data must contain 8 | - 2 categorical columns (e.g. not numeric columns) 9 | - 1 numerical column 10 | 11 | ```sql 12 | SELECT 13 | AGG_FN() as metric, 14 | as cat1, 15 | as cat2, 16 | FROM 17 |
18 | GROUP BY 19 | cat1, cat2 20 | ``` 21 | where: 22 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc. 23 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column. Must add up to 1 for each bar. 24 | - `CAT_COLUMN1` and `CAT_COLUMN2` are the groups you want to compare. One will go on the x axis and the other on the y-axis. Make sure these are categorical or temporal columns and not numerical fields. 25 | 26 | # Usage 27 | 28 | In this example with some Spotify data, we'll see the average daily streams across different days of the week and months of the year. 29 | 30 | ```sql 31 | -- a 32 | select sum(streams) daily_streams, day from PUBLIC.SPOTIFY_DAILY_TRACKS group by day 33 | ``` 34 | 35 | | daily_streams | day | 36 | | ------------- | ---------- | 37 | | 234109876 | 2020-10-25 | 38 | | 243004361 | 2020-10-26 | 39 | | 248233045 | 2020-10-27 | 40 | 41 | ```sql 42 | select 43 | avg(daily_streams) avg_daily_streams, 44 | month(day) month, 45 | dayofweek(day) day_of_week 46 | from "a" 47 | group by 48 | day_of_week, 49 | month 50 | ``` 51 | heatmap2 52 | -------------------------------------------------------------------------------- /snowflake/histogram-bins.md: -------------------------------------------------------------------------------- 1 | # Histogram Bins 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/S6nEvljt64H?vm=e). 4 | 5 | # Description 6 | 7 | Creating histograms using SQL can be tricky. This snippet will let you quickly create the bins needed to create a histogram: 8 | 9 | ```sql 10 | SELECT 11 | COUNT(1) / (SELECT COUNT(1) FROM '
' ) percentage_of_results 12 | FLOOR(/ ) * as bin 13 | FROM 14 | '
' 15 | GROUP BY 16 | bin 17 | ``` 18 | where: 19 | - `` is the column to turn into a histogram 20 | - `` is the width of each bin. You can adjust this to get the desired level of detail for the histogram. 21 | 22 | # Usage 23 | 24 | To use, adjust the bin-size until you can see the 'shape' of the data. The following example finds the percentage of days in London by 5-degree temperature groupings. 25 | 26 | ```sql 27 | SELECT 28 | count(*) / (select count(1) from PUBLIC.LONDON_WEATHER) percentage_of_results, 29 | floor(LONDON_WEATHER.TEMP / 5) * 5 bin 30 | from PUBLIC.LONDON_WEATHER 31 | group by bin 32 | order by bin 33 | ``` 34 | | PERCENTAGE_OF_RESULTS | BIN | 35 | | --------------------- | --- | 36 | | 0.001 | 20 | 37 | | 0.001 | 25 | 38 | | ... | ... | 39 | | 0.014 | 75 | 40 | | 0.001 | 80 | 41 | -------------------------------------------------------------------------------- /snowflake/horizontal-bar.md: -------------------------------------------------------------------------------- 1 | # Horizontal bar chart 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/y9YkufdWrwf?vm=e). 4 | 5 | # Description 6 | 7 | Horizontal Bar charts are ideal for comparing a particular metric across a group of values. They have bars along the x-axis and group names along the y-axis. 8 | You just need a simple GROUP BY to get all the elements you need for a horizontal bar chart: 9 | 10 | ```sql 11 | SELECT 12 | AGG_FN() as metric, 13 | as group 14 | FROM 15 |
16 | GROUP BY 17 | group 18 | ``` 19 | where: 20 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc. 21 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column. 22 | - `GROUP_COLUMN` is the group you want to show on the y-axis. Make sure this is a categorical column (not a number) 23 | 24 | # Usage 25 | 26 | In this example with some Spotify data, we'll compare the total streams by artist: 27 | 28 | ```sql 29 | select 30 | sum(streams) streams, 31 | artist 32 | from PUBLIC.SPOTIFY_DAILY_STREAMS 33 | group by artist 34 | ``` 35 | horizontal-bar 36 | -------------------------------------------------------------------------------- /snowflake/last-array-element.md: -------------------------------------------------------------------------------- 1 | # Get the last element of an array 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/rybQwRV3JJh?vm=e). 4 | 5 | # Description 6 | 7 | You can use [ARRAY_SLICE](https://docs.snowflake.com/en/sql-reference/functions/array_slice.html) and [ARRAY_SIZE](https://docs.snowflake.com/en/sql-reference/functions/array_size.html) to find the last element in an Array in Snowflake: 8 | 9 | ```sql 10 | select 11 | array_slice(nums,-1,array_size(nums)) last_element 12 | from 13 | ( 14 | select array_construct(1, 2, 3, 4) as nums 15 | ) 16 | ``` 17 | | LAST_ELEMENT | 18 | | ------------ | 19 | | [4] | 20 | -------------------------------------------------------------------------------- /snowflake/moving-average.md: -------------------------------------------------------------------------------- 1 | # Moving average 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/vwTzk68QCZ7?vm=e). 4 | 5 | # Description 6 | Moving averages are relatively simple to calculate in Snowflake, using the `avg` window function. The template for the query is 7 | 8 | ```sql 9 | select 10 | avg() over (partition by order by asc rows between preceding and current row) mov_av, 11 | , 12 | 13 | from
14 | ``` 15 | where 16 | - `value` - this is the numeric quantity to calculate a moving average for 17 | - `fields` - these are zero or more columns to split the moving averages by 18 | - `ordering` - this is the column which determines the order of the moving average, most commonly temporal 19 | - `x` - how many rows to include in the moving average 20 | - `table` - where to pull these columns from 21 | 22 | # Example 23 | 24 | Using total Spotify streams as an example data source, let's identify: 25 | - `value` - this is `streams` 26 | - `fields` - this is just `artist` 27 | - `ordering` - this is the `day` column 28 | - `x` - choose 7 for a weekly average 29 | - `table` - this is called `raw` 30 | 31 | then the query is: 32 | 33 | ```sql 34 | -- RAW 35 | select day, sum(streams) streams, artist from PUBLIC.SPOTIFY_DAILY_TRACKS group by 1, 3 36 | ``` 37 | | DAY | STREAMS | ARTIST | 38 | |----------|---------|---------| 39 | |2017-06-01| 509093 | 2 Chainz| 40 | |2017-06-03| 562412 | 2 Chainz| 41 | |2017-06-04| 480350 | 2 Chainz| 42 | 43 | ```sql 44 | select 45 | avg(streams) over (partition by artist order by day asc rows between 7 preceding and current row) mov_av, 46 | artist, 47 | day, 48 | streams 49 | from RAW 50 | ``` 51 | | MOV_AVG | DAY | STREAMS | ARTIST | 52 | |----------|------------|---------|----------| 53 | | 509093 | 2017-06-01 | 509093 | 2 Chainz | 54 | | 535752.5 | 2017-06-03 | 562412 | 2 Chainz | 55 | | 517285 | 2017-06-04 | 480350 | 2 Chainz | 56 | -------------------------------------------------------------------------------- /snowflake/parse-json.md: -------------------------------------------------------------------------------- 1 | # Parsing JSON Strings 2 | 3 | 4 | # Description 5 | Snowflake has a few helpful ways to deal w/ JSON objects. This snippet goes through a few ways to access elements in a JSON object: 6 | 7 | ```sql 8 | SELECT 9 | :.. -- select json element, 10 | :""."", -- select json element if element name has special characters or spaces 11 | :::, -- cast return object to explicit type 12 | :[array_index], -- access one instance of array 13 | get(:,array_size(:)-1) -- get last element in array 14 | FROM 15 |
16 | ``` 17 | where: 18 | - `` is the column in `
` with the JSON object 19 | - `` is the level of nesting in the JSON object 20 | - `
` is the table w/ the JSON data. 21 | # Example: 22 | 23 | ```sql 24 | with car_sales 25 | as( 26 | select parse_json(column1) as src 27 | from values 28 | ('{ 29 | "date" : "2017-04-28", 30 | "dealership" : "Valley View Auto Sales", 31 | "salesperson" : { 32 | "id": "55", 33 | "name": "Frank Beasley" 34 | }, 35 | "customer" : [ 36 | {"name": "Joyce Ridgely", "phone": "16504378889", "address": "San Francisco, CA"} 37 | ], 38 | "vehicle" : [ 39 | {"make": "Honda", "model": "Civic", "year": "2017", "price": "20275", "extras":["ext warranty", "paint protection"]} 40 | ], 41 | "values":[1,2,3] 42 | }'), 43 | ('{ 44 | "date" : "2017-04-28", 45 | "dealership" : "Tindel Toyota", 46 | "salesperson" : { 47 | "id": "274", 48 | "name": "Greg Northrup" 49 | }, 50 | "customer" : [ 51 | {"name": "Bradley Greenbloom", "phone": "12127593751", "address": "New York, NY"} 52 | ], 53 | "vehicle" : [ 54 | {"make": "Toyota", "model": "Camry", "year": "2017", "price": "23500", "extras":["ext warranty", "rust proofing", "fabric protection"]} 55 | ], 56 | "values":[3,2,1] 57 | }') v) 58 | 59 | select 60 | src:date::date as date , -- select json object 61 | src:"customer", -- for objects w/ incompatible names (e.g. spaces and special characters) escape w/ quotes 62 | src:vehicle[0].make::string as make, -- select first item in list 63 | get(src:values,array_size(src:values)-1) last_element -- last value in list 64 | 65 | from car_sales; 66 | ``` 67 | |DATE|SRC:"CUSTOMER"|MAKE|LAST_ELEMENT| 68 | |----|------------|----|------| 69 | |28/04/2017|[{"address":"San Francisco, CA","name":"Joyce Ridgely","phone":"16504378889"}]|HONDA|3| 70 | |28/04/2017"|[{"address":"New York, NY","name":"Bradley Greenbloom","phone":"12127593751"}]|TOYOTA|1| 71 | -------------------------------------------------------------------------------- /snowflake/parse-url.md: -------------------------------------------------------------------------------- 1 | # Parse URL String 2 | 3 | Explore the snippet with some demo data [here](https://count.co/n/LM876dhSHtU?vm=e). 4 | 5 | # Description 6 | 7 | URL strings are a very valuable source of information, but trying to get your regex exactly right is a pain. 8 | This re-usable snippet can be applied to any URL to extract the exact part you're after. 9 | Helpfully, Snowflake has a [PARSE_URL](https://docs.snowflake.com/en/sql-reference/functions/parse_url.html) function that returns a JSON object of URL components. 10 | 11 | ```sql 12 | SELECT KEY,VALUE 13 | FROM 14 | ( 15 | SELECT parsed from 16 | ( 17 | SELECT 18 | parse_url(url) as parsed 19 | FROM my_url 20 | ) 21 | ), 22 | LATERAL FLATTEN(input=> parsed) 23 | ``` 24 | where: 25 | - `` is your URL string (e.g. `"https://www.yoursite.com/pricing/details?myparam1=123&myparam2=abc#Ref1"` 26 | 27 | 28 | # Usage 29 | 30 | To use, just copy out the regex for whatever part of the URL you need. Or capture them all as in the example below: 31 | 32 | ```sql 33 | WITH my_url as (select 'https://www.yoursite.com/pricing/details?myparam1=123&myparam2=abc#newsfeed' as url) 34 | 35 | select KEY,VALUE FROM 36 | ( 37 | SELECT parsed from 38 | ( 39 | select 40 | parse_url(url) as parsed 41 | FROM my_url 42 | ) 43 | ), 44 | LATERAL FLATTEN(input => parsed) 45 | ``` 46 | | KEY | VALUE | 47 | |------------| ----------------------------------- | 48 | | fragment | "newsfeed" | 49 | | host | "www.yoursite.com" | 50 | | parameters | {"myparam1":"123","myparam2":"abc"} | 51 | | port | (Empty) | 52 | | query | "myparam1=123&myparam2=abc" | 53 | | scheme | "https" | 54 | 55 | # References 56 | - [The Anatomy of a Full Path URL](https://zvelo.com/anatomy-of-full-path-url-hostname-protocol-path-more/) 57 | -------------------------------------------------------------------------------- /snowflake/pivot.md: -------------------------------------------------------------------------------- 1 | # Pivot Tables in Snowflake 2 | 3 | 4 | # Description 5 | We're all familiar with the power of Pivot tables, but building them in SQL tends to be tricky. 6 | Snowflake has some bespoke features that make this a little easier though. In this context: 7 | > Pivoting a table means turning rows into columns 8 | 9 | To pivot in Snowflake, you can use the [PIVOT](https://docs.snowflake.com/en/sql-reference/constructs/pivot.html) keyword: 10 | 11 | ```sql 12 | SELECT 13 | 14 | FROM
15 | PIVOT ( ( ) 16 | FOR IN ( [ , ... ] ) ) 17 | ``` 18 | where: 19 | - `` are all the columns you're selecting 20 | - `
` is the table you're pivoting 21 | - `` is any of AVG, COUNT, MAX, MIN, or SUM 22 | - `` is the column that will be aggregated 23 | - `` is the column that contains the values from which the column names will be generated 24 | - `` is a list of what will become the pivot columns 25 | # Example: 26 | 27 | ```sql 28 | with data as ( 29 | select 192.048 HOURS_WATCHED, 'Friends'TITLE, 2018 YEAR union all 30 | select 1.869 HOURS_WATCHED, 'Arrested Development' TITLE, 2018 YEAR union all 31 | select 78.596 HOURS_WATCHED, 'Friends' TITLE, 2019 YEAR union all 32 | select 2.521 HOURS_WATCHED, 'Arrested Development' TITLE, 2019 YEAR union all 33 | select 58.134 HOURS_WATCHED, 'Friends' TITLE, 2020 YEAR union all 34 | select 0.993 HOURS_WATCHED, 'Arrested Development' TITLE, 2020 YEAR 35 | ) 36 | SELECT * FROM data 37 | PIVOT(sum(hours_watched) for year in (2018, 2019, 2020)) 38 | ``` 39 | |TITLE|2018|2019|2020| 40 | |----|----|-----|----| 41 | |Friends|192.048|78.596|59.134| 42 | |Arrested Development|1.869|2.521|0.993| 43 | 44 | # References: 45 | - [Pivot Tables in Snowflake ](https://count.co/sql-resources/snowflake/pivot-tables) 46 | -------------------------------------------------------------------------------- /snowflake/random-sampling.md: -------------------------------------------------------------------------------- 1 | # Random sampling 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/UekRUzNjL4V?vm=e). 4 | 5 | # Description 6 | 7 | Taking a random sample of a table is often useful when the table is very large, and you are still working in an exploratory or investigative mode. A query against a small fraction of a table runs much more quickly, and consumes fewer resources. 8 | Here are some example queries to select a random or pseudo-random subset of rows. 9 | 10 | # Using RANDOM 11 | The [RANDOM](https://docs.snowflake.com/en/sql-reference/functions/random.html) function returns a value in [0, 1) (i.e. including zero but not 1). You can use it to sample tables, e.g.: 12 | 13 | #### Return 1% of rows 14 | 15 | ```sql 16 | select * from PUBLIC.SPOTIFY_DAILY_TRACKS QUALIFY PERCENT_RANK() OVER (ORDER BY RANDOM()) <= 0.01 17 | ``` 18 | 19 | #### Return 10 rows 20 | 21 | ```sql 22 | select * from PUBLIC.SPOTIFY_DAILY_TRACKS order by random() limit 10 23 | ``` 24 | 25 | # Using hashing 26 | 27 | A hash is a deterministic mapping from one value to another, and so is not random, but can appear random 'enough' to produce a convincing sample. Use a [supported hash function ](https://docs.snowflake.com/en/sql-reference/functions-hash-scalar.html)in Snowflake to produce a pseudo-random ordering of your table: 28 | 29 | #### MD5 30 | 31 | ```sql 32 | select * from PUBLIC.SPOTIFY_DAILY_TRACKS order by md5(SPOTIFY_DAILY_TRACKS.TRACK_ID) limit 10 33 | ``` 34 | 35 | #### SHA1 36 | 37 | ```sql 38 | select * from PUBLIC.SPOTIFY_DAILY_TRACKS order by sha1(SPOTIFY_DAILY_TRACKS.TRACK_ID) limit 10 39 | ``` 40 | 41 | # Using TABLESAMPLE 42 | The [TABLESAMPLE](https://docs.snowflake.com/en/sql-reference/constructs/sample.html) has a major benefit - it doesn't require a full table scan, and therefore can be much cheaper and quicker than the methods above. The downside is that the percentage value must be a literal, so this query is less flexible than the ones above. 43 | 44 | #### Tablesample 45 | 46 | ```sql 47 | select * from PUBLIC.SPOTIFY_DAILY_TRACKS tablesample (1) -- 1% of the data 48 | ``` 49 | -------------------------------------------------------------------------------- /snowflake/rank.md: -------------------------------------------------------------------------------- 1 | # Rank 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/9MW21RBKUus). 4 | 5 | # Description 6 | A common requirement is the need to rank a particular row in a group by some quantity - for example, 'rank stores by total sales in each region'. Fortunately Snowflake supports the `rank` window function to help with this. 7 | The query template is 8 | 9 | ```sql 10 | select 11 | rank() over (partition by order by desc) rank 12 | from
13 | ``` 14 | where 15 | - `fields` - zero or more columns to split the results by - a rank will be calculated for each unique combination of values in these columns 16 | - `rank_by` - a column to decide the ranking 17 | - `table` - where to pull these columns from 18 | 19 | # Example 20 | 21 | Using total Spotify streams by day and artist as an example data source, let's calculate, by artist, the rank for total streams by day (e.g. rank = 1 indicates that day was their most-streamed day). 22 | - `fields` - this is just `artist` 23 | - `rank_by` - this is `streams` 24 | - `table` - this is called `raw` 25 | 26 | then the query is: 27 | 28 | ```sql 29 | -- RAW 30 | select streams, artist, day from PUBLIC.SPOTIFY_DAILY_TRACKS 31 | ``` 32 | | DAY | STREAMS | ARTIST | 33 | |-----------|---------|----------| 34 | | 2017-06-01| 509093 | 2 Chainz | 35 | | 2017-06-03| 562412 | 2 Chainz | 36 | | 2017-06-04| 480350 | 2 Chainz | 37 | 38 | ```sql 39 | select 40 | *, 41 | rank() over (partition by artist order by streams desc) rank 42 | from RAW 43 | ``` 44 | | RANK | DAY | STREAMS | ARTIST | 45 | |------|------------|---------|----------| 46 | | 247 | 2017-06-01 | 509093 | 2 Chainz | 47 | | 205 | 2017-06-03 | 562412 | 2 Chainz | 48 | | 262 | 2017-06-04 | 480350 | 2 Chainz | 49 | -------------------------------------------------------------------------------- /snowflake/replace-empty-strings-null.md: -------------------------------------------------------------------------------- 1 | # Replace empty strings with NULLs 2 | 3 | Explore this snippet [here](https://count.co/n/aamIOgywUQp?vm=e). 4 | 5 | # Description 6 | 7 | An essential part of cleaning a new data source is deciding how to treat missing values. The Snowflake function [IFF](https://docs.snowflake.com/en/sql-reference/functions/iff.html) can help with replacing empty values with something else: 8 | 9 | ```sql 10 | with data as ( 11 | select * from (values ('a'), ('b'), (''), ('d')) as data (str) 12 | ) 13 | 14 | select 15 | iff(length(str) = 0, null, str) str 16 | from data 17 | ``` 18 | 19 | | STR | 20 | | ---- | 21 | | a | 22 | | b | 23 | | NULL | 24 | | d | -------------------------------------------------------------------------------- /snowflake/replace-null.md: -------------------------------------------------------------------------------- 1 | # Replace NULLs 2 | 3 | Explore this snippet [here](https://count.co/n/yYCaXXdOESx?vm=e). 4 | 5 | # Description 6 | 7 | An essential part of cleaning a new data source is deciding how to treat NULL values. The Snowflake function [IFNULL](https://docs.snowflake.com/en/sql-reference/functions/ifnull.html) can help with replacing NULL values with something else: 8 | 9 | ```sql 10 | with data as ( 11 | select * from (values 12 | (1, 'one'), 13 | (null, 'two'), 14 | (3, null) 15 | ) as data (num, str) 16 | ) 17 | 18 | select 19 | ifnull(num, -1) num, 20 | ifnull(str, 'I AM NULL') str 21 | from data 22 | ``` 23 | 24 | | NUM | STR | 25 | | --- | --------- | 26 | | 1 | a | 27 | | -1 | b | 28 | | 3 | I AM NULL | -------------------------------------------------------------------------------- /snowflake/running-total.md: -------------------------------------------------------------------------------- 1 | # Running total 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/9oKm47UpLz2?vm=e). 4 | 5 | # Description 6 | 7 | Running totals are relatively simple to calculate in Snowflake, using the `sum` window function. 8 | The template for the query is 9 | 10 | ```sql 11 | select 12 | sum() over (partition by order by asc) running_total, 13 | , 14 | 15 | from
16 | ``` 17 | where 18 | - `value` - this is the numeric quantity to calculate a running total for 19 | - `fields` - these are zero or more columns to split the running totals by 20 | - `ordering` - this is the column which determines the order of the running total, most commonly temporal 21 | - `table` - where to pull these columns from 22 | 23 | # Example 24 | 25 | Using total Spotify streams as an example data source, let's identify: 26 | - `value` - this is `streams` 27 | - `fields` - this is just `artist` 28 | - `ordering` - this is the `day` column 29 | - `table` - this is called `raw` 30 | 31 | then the query is: 32 | 33 | ```sql 34 | -- RAW 35 | select streams, day, artist from PUBLIC.SPOTIFY_DAILY_TRACKS 36 | ``` 37 | | STREAMS | DAY | ARTIST | 38 | |---------|------------|--------| 39 | | 628338 | 2020-10-25 | CJ | 40 | | 720233 | 2020-10-26 | CJ | 41 | | 777880 | 2020-10-27 | CJ | 42 | | 811130 | 2020-10-28 | CJ | 43 | 44 | ```sql 45 | select 46 | sum(streams) over (partition by artist order by day asc) running_total, 47 | artist, 48 | day 49 | from RAW 50 | order by 2,3 51 | ``` 52 | | RUNNING_TOTAL | DAY | ARTIST | 53 | |---------------|------------|--------| 54 | | 628338 | 2020-10-25 | CJ | 55 | | 1348571 | 2020-10-26 | CJ | 56 | | 2126451 | 2020-10-27 | CJ | 57 | | 2937581 | 2020-10-28 | CJ | 58 | -------------------------------------------------------------------------------- /snowflake/stacked-bar-line.md: -------------------------------------------------------------------------------- 1 | # Stacked bar and line chart 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/ze9Y5H2r4mu?vm=e). 4 | 5 | # Description 6 | 7 | Bar and line charts can communicate a lot of info on a single chart. They require all of the elements of a bar chart: 8 | - 2 categorical or temporal columns - 1 for the x-axis and 1 for stacked bars 9 | - 2 numeric columns - 1 for the bar and 1 for the line 10 | 11 | ```sql 12 | SELECT 13 | AGG_FN() OVER (PARTITION BY , ) as bar_metric, 14 | as x_axis, 15 | as cat, 16 | AGG_FN() OVER (PARTITION BY ) as line 17 | FROM 18 |
19 | ``` 20 | 21 | where: 22 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc. 23 | - `COLUMN` is the column you want to aggregate to get your bar chart metric. Make sure this is a numeric column. 24 | - `COLUMN2` is the column you want to aggregate to get your line chart metric. Make sure this is a numeric column. 25 | - `CAT_COLUMN1` and `CAT_COLUMN2` are the groups you want to compare. One will go on the x axis and the other on the colors of the bar chart. Make sure these are categorical or temporal columns and not numerical fields. 26 | 27 | # Usage 28 | 29 | In this example with some example data we'll look at some sample data of users that were successful and unsuccessful at a certain task over a few days. We'll show the count of users by their outcome, then show the overall success rate for each day. 30 | 31 | ```sql 32 | with sample_data as ( 33 | select date('2021-01-01') registered_at, 'successful' category, abs(random()/1000000) user_count union all 34 | select date('2021-01-01') registered_at, 'unsuccessful' category, abs(random()/1000000) user_count union all 35 | select date('2021-01-02') registered_at, 'successful' category, abs(random()/1000000) user_count union all 36 | select date('2021-01-02') registered_at, 'unsuccessful' category, abs(random()/1000000) user_count union all 37 | select date('2021-01-03') registered_at, 'successful' category, abs(random()/1000000) user_count union all 38 | select date('2021-01-03') registered_at, 'unsuccessful' category, abs(random()/1000000) user_count 39 | ) 40 | 41 | 42 | select 43 | registered_at, 44 | category, 45 | user_count, 46 | sum(case when category = 'successful' then user_count end) over (partition by registered_at) / sum(user_count) over (partition by registered_at) success_rate 47 | from sample_data 48 | ``` 49 | stackedbar-line 50 | -------------------------------------------------------------------------------- /snowflake/timeseries.md: -------------------------------------------------------------------------------- 1 | # Time series line chart 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/ScDndvwcc9x?vm=e). 4 | 5 | # Description 6 | 7 | Timeseries charts show how individual metric(s) change over time. They have lines along the y-axis and dates along the x-axis. 8 | You just need a simple GROUP BY to get all the elements you need for a timeseries chart: 9 | 10 | ```sql 11 | SELECT 12 | AGG_FN() as metric, 13 | as datetime 14 | FROM 15 |
16 | GROUP BY 17 | datetime 18 | ``` 19 | where: 20 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc. 21 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column. 22 | - `DATETIME_COLUMN` is the group you want to show on the x-axis. Make sure this is a date, datetime, timestamp, or time column (not a number) 23 | 24 | # Usage 25 | 26 | In this example with some spotify data, we'll look at the total daily streams over time: 27 | 28 | ```sql 29 | select 30 | sum(streams) streams, 31 | day 32 | from PUBLIC.SPOTIFY_DAILY_TRACKS 33 | group by day 34 | ``` 35 | timeseries 36 | -------------------------------------------------------------------------------- /snowflake/transforming-arrays.md: -------------------------------------------------------------------------------- 1 | # Transforming arrays 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/UPNSCO974GQ?vm=e). 4 | 5 | # Description 6 | 7 | To transform elements of an array (often known as mapping over an array), it is possible to gain access to array elements using the `FLATTEN` function 8 | 9 | ```sql 10 | with data as (select array_construct(1,2,3) nums, array_construct('a', 'b', 'c') strs) 11 | 12 | select 13 | array_agg(x.value * 2 + 1) nums_transformed, -- Transformation function 14 | array_agg(y.value ||'_2') strs_transformed -- Another transformation function 15 | from DATA 16 | cross join table(flatten(input => nums)) as x 17 | left join table(flatten(input => strs)) as y on x.index = y.index 18 | ``` 19 | 20 | | NUMS_TRANSFORMED | STRS_TRANSFORMED | 21 | |------------------|----------------------| 22 | | [3,5,7] | ["a_2","b_2","c_2" ] | 23 | -------------------------------------------------------------------------------- /snowflake/udf-sort-array.md: -------------------------------------------------------------------------------- 1 | # UDF: Sort Array 2 | 3 | ## Description 4 | 5 | Javascript UDF to sort an array in ascending order. 6 | 7 | ```sql 8 | -- Create the UDF. 9 | CREATE OR REPLACE FUNCTION array_sort(a array) 10 | RETURNS array 11 | LANGUAGE JAVASCRIPT 12 | AS 13 | $$ 14 | return A.sort(); 15 | $$ 16 | ; 17 | 18 | -- Call the UDF with a small array. 19 | SELECT ARRAY_SORT(PARSE_JSON('[2,4,5,3,1]')); 20 | ``` 21 | 22 | ## Example 23 | 24 | 25 | ```sql 26 | SELECT ARRAY_SORT(PARSE_JSON('[2,4,5,3,1]')) as arr_sorted; 27 | ``` 28 | 29 | | ARR_SORTED | 30 | | ------------- | 31 | | [1,2,3,4,5] | 32 | 33 | # References: 34 | - https://docs.snowflake.com/en/sql-reference/user-defined-functions.html 35 | - https://docs.snowflake.com/en/developer-guide/udf/javascript/udf-javascript-introduction.html#introductory-example 36 | -------------------------------------------------------------------------------- /snowflake/unpivot-melt.md: -------------------------------------------------------------------------------- 1 | # Unpivot / melt 2 | 3 | Explore this snippet with some demo data [here](https://count.co/n/l1c3J7nME8a?vm=e). 4 | 5 | # Description 6 | 7 | The act of unpivoting (or melting, if you're a `pandas` user) is to convert columns to rows. Snowflake has added the [UNPIVOT](https://docs.snowflake.com/en/sql-reference/constructs/unpivot.html) query construct to help. 8 | The query template is 9 | 10 | ```sql 11 | SELECT 12 | , 13 | metric, 14 | value 15 | FROM
UNPIVOT(value FOR metric IN ()) 16 | order by metric, value 17 | ``` 18 | where 19 | - `columns_to_keep` - all the columns to be preserved 20 | - `columns_to_unpivot` - all the columns to be converted to rows 21 | - `table` - the table to pull the columns from 22 | 23 | (The output column names metric and value can also be changed.) 24 | 25 | # Examples 26 | 27 | In the examples below, we use a table with columns `A`, `B`, `C`. We take the columns `B` and `C`, and turn their values into columns called `metric` (containing either the string 'B' or 'C') and `value` (the value from either the column `B` or `C`). The column `A` is preserved. 28 | 29 | ```sql 30 | -- Define some dummy data 31 | with a as ( 32 | select * from( 33 | select 'a' as A, 1 as B, 2 as C union all 34 | select 'b' as A, 3 as B, 4 as C union all 35 | select 'c' as A, 5 as B, 6 as C 36 | ) 37 | ) 38 | 39 | SELECT 40 | A, 41 | metric, 42 | value 43 | FROM a UNPIVOT(value FOR metric IN (B, C)) 44 | order by metric, value 45 | ``` 46 | | A | metric | value | 47 | |---| ------ | ----- | 48 | | a | B | 1 | 49 | | b | B | 3 | 50 | | c | B | 5 | 51 | | a | C | 2 | 52 | | b | C | 4 | 53 | | c | C | 6 | 54 | -------------------------------------------------------------------------------- /snowflake/yoy.md: -------------------------------------------------------------------------------- 1 | # Year-on-year percentage difference 2 | 3 | To explore this example with some demo data, head [here](https://count.co/n/WDGC9JhS6dE?vm=e). 4 | 5 | # Description 6 | 7 | Calculating the year-on-year change of a quantity typically requires a few steps: 8 | 1. Aggregate the quantity by year (and by any other fields) 9 | 2. Add a column for last years quantity (this will be null for the first year in the dataset) 10 | 3. Calculate the change from last year to this year 11 | A generic query template may then look like 12 | 13 | ```sql 14 | select 15 | *, 16 | -- Calculate percentage change 17 | ( - last_year_total) / last_year_total * 100 as perc_diff 18 | from ( 19 | select 20 | *, 21 | -- Use a window function to get last year's total 22 | lag() over (partition by order by asc) last_year_total 23 | from
24 | order by , desc 25 | ) 26 | ``` 27 | where 28 | - `table` - your source data table 29 | - `fields` - one or more columns to split the year-on-year differences by 30 | - `year_column` - the column containing the year of the aggregation 31 | - `year_total` - the column containing the aggregated data 32 | 33 | # Example 34 | 35 | Using total Spotify streams as an example data source, let's identify: 36 | - `table` - this is called `RAW` 37 | - `fields` - this is just the single column `ARTIST` 38 | - `year_column` - this is called `YEAR` 39 | - `year_total` - this is `SUM_STREAMS` 40 | 41 | Then the query looks like: 42 | 43 | ```sql 44 | -- RAW 45 | select 46 | date_trunc(year,day) year, 47 | sum(streams) sum_streams, 48 | artist 49 | from PUBLIC.SPOTIFY_DAILY_TRACKS 50 | group by 1,3 51 | ``` 52 | | year | sum_streams | artist | 53 | | ---------- | ----------- | ------ | 54 | | 2020-01-01 | 115372043 | CJ | 55 | | 2021-01-01 | 179284925 | CJ | 56 | | ... | ... | ... | 57 | 58 | ```sql 59 | select 60 | *, 61 | (sum_streams - last_year_total) / last_year_total * 100 as perc_diff 62 | from ( 63 | select 64 | *, 65 | lag(sum_streams) over (partition by artist order by year asc) last_year_total 66 | from RAW 67 | 68 | order by artist, year desc 69 | ) 70 | ``` 71 | | year | sum_streams | artist | last_year_total | perc_diff | 72 | | ---------- | ----------- | ------ | --------------- | --------- | 73 | | 2020-01-01 | 115372043 | CJ | NULL | NULL | 74 | | 2021-01-01 | 179284925 | CJ | 115372043 | 55.397200 | 75 | | ... | ... | ... | ... | ... | 76 | -------------------------------------------------------------------------------- /template.md: -------------------------------------------------------------------------------- 1 | # Snippet title 2 | 3 | # Description 4 | This is a few-line description outlining the usefulness and functionality of this snippet. 5 | 6 | This is a template for the query: 7 | 8 | ```sql 9 | select from 10 | ``` 11 | where 12 | - `a_column` - a description of a placeholder 13 | - `a_table` - another description of a placeholder 14 | 15 | # Example 16 | 17 | Here's an actual SQL query with the placeholders filled in 18 | - `a_column` - this has been filled in with `my_column` 19 | - `a_table` - this has been filled in with `my_table` 20 | 21 | then the query becomes: 22 | 23 | ```sql 24 | select my_column from my_table 25 | ``` 26 | 27 | You may also want to include some example query output: 28 | 29 | | my_column | maybe_another_column | 30 | | --------- | -------------------- | 31 | | a | 1 | 32 | | b | 2 | 33 | | ... | ... | 34 | | z | 26 | 35 | --------------------------------------------------------------------------------