├── .github
├── ISSUE_TEMPLATE
│ ├── bug_report.md
│ └── snippet-request.md
└── pull_request_template.md
├── .gitignore
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── bigquery
├── 100-stacked-bar.md
├── array-to-columns.md
├── bar-chart.md
├── binomial-distribution.md
├── cdf.md
├── cohort-analysis.md
├── compound-growth-rates.md
├── concat-nulls.md
├── convert-datetimes-string.md
├── convert-string-datetimes.md
├── count-nulls.md
├── filtering-arrays.md
├── first-row-of-group.md
├── generating-arrays.md
├── geographical-distance.md
├── get-last-array-element.md
├── heatmap.md
├── histogram-bins.md
├── horizontal-bar.md
├── json-strings.md
├── last-day.md
├── latest-row.md
├── least-array.md
├── least-udf.md
├── localtz-to-utc.md
├── median-udf.md
├── median.md
├── moving-average.md
├── outliers-mad.md
├── outliers-stdev.md
├── outliers-z.md
├── pivot.md
├── rand_between.md
├── random-sampling.md
├── rank.md
├── regex-email.md
├── regex-parse-url.md
├── repeating-words.md
├── replace-empty-strings-null.md
├── replace-multiples.md
├── replace-null.md
├── running-total.md
├── split-uppercase.md
├── stacked-bar-line.md
├── tf-idf.md
├── timeseries.md
├── transforming-arrays.md
├── uniform-distribution.md
├── unpivot-melt.md
└── yoy.md
├── mssql
├── Cumulative distribution functions.md
├── check-column-is-accessible.md
├── compound-growth-rates.md
├── concatenate-strings.md
├── first-row-of-group.md
└── search-stored-procedures.md
├── mysql
├── cdf.md
└── initcap-udf.md
├── postgres
├── array-reverse.md
├── convert-epoch-to-timestamp.md
├── count-nulls.md
├── cume_dist.md
├── dt-formatting.md
├── generate-timeseries.md
├── get-last-array-element.md
├── histogram-bins.md
├── json-strings.md
├── last-x-days.md
├── median.md
├── moving-average.md
├── ntile.md
├── rank.md
├── regex-email.md
├── regex-parse-url.md
├── replace-empty-strings-null.md
├── replace-null.md
└── split-column-to-rows.md
├── snowflake
├── 100-stacked-bar.md
├── bar-chart.md
├── count-nulls.md
├── filtering-arrays.md
├── flatten-array.md
├── generate-arrays.md
├── geographical-distance.md
├── heatmap.md
├── histogram-bins.md
├── horizontal-bar.md
├── last-array-element.md
├── moving-average.md
├── parse-json.md
├── parse-url.md
├── pivot.md
├── random-sampling.md
├── rank.md
├── replace-empty-strings-null.md
├── replace-null.md
├── running-total.md
├── stacked-bar-line.md
├── timeseries.md
├── transforming-arrays.md
├── udf-sort-array.md
├── unpivot-melt.md
└── yoy.md
└── template.md
/.github/ISSUE_TEMPLATE/bug_report.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Bug report
3 | about: Create a report to help us improve
4 | title: "[BUG] "
5 | labels: bug
6 | assignees: ''
7 |
8 | ---
9 |
10 | **Describe the bug**
11 | A clear and concise description of what the bug is.
12 |
13 | **Expected behaviour**
14 | A clear and concise description of what you expected to happen.
15 |
16 | **Screenshots**
17 | If applicable, add screenshots to help explain your problem.
18 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/snippet-request.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Snippet request
3 | about: Suggest an idea for a snippet
4 | title: "[SNIPPET REQUEST] "
5 | labels: help wanted
6 | assignees: ''
7 |
8 | ---
9 |
10 | **Describe what you're looking for**
11 | A clear and concise description of what you want a snippet to achieve.
12 |
13 | **What database are you using?**
14 | If it understands SQL, it's welcome here!
15 |
--------------------------------------------------------------------------------
/.github/pull_request_template.md:
--------------------------------------------------------------------------------
1 | ## Summary of changes
2 |
3 |
4 |
5 | ## Checklist before submitting
6 |
7 | - [ ] I have added this snippet to the correct folder
8 | - [ ] I have added a link to this snippet in the [readme](https://github.com/count/sql-snippets/blob/main/README.md)
9 | - [ ] I haven't included any sensitive information in my example(s)
10 | - [ ] I have checked my query / function works
11 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing
2 | Thanks for your interest in helping out! Feel free to comment, open issues, and create pull requests if there's something you want to share.
3 |
4 | ## Submitting pull requests
5 | To make a change to this repository, the best way is to use pull requests:
6 |
7 | 1. Fork the repository and create your branch from `main`.
8 | 2. Make your changes.
9 | 6. Create a pull request.
10 |
11 | (If the change is particularly small, these steps are easily accomplished directly in the GitHub UI.)
12 |
13 | ## Report bugs using Github's issues
14 | We use GitHub issues to track bugs. Report a bug or start a conversation by opening a new issue.
15 |
16 | ## Use a consistent template
17 | Take a look at the existing snippets and try to format your contribution similarly. You may want to start from our [template](./template.md).
18 |
19 |
20 | ## License
21 | By contributing to this repository, you agree that your contributions will be licensed under its MIT License.
22 |
23 | ## References
24 | This document was adapted from the GitHub Gist of [briandk](https://gist.github.com/briandk/3d2e8b3ec8daf5a27a62).
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 Count
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # SQL Snippets
2 |
3 | A curated collection of helpful SQL queries and functions, maintained by [Count](https://count.co).
4 |
5 | All [contributions](./CONTRIBUTING.md) are very welcome - let's get useful SQL snippets out of Slack channels and Notion pages, and collect them as a community resource.
6 |
7 | ### [🙋♀️ Request a snippet](https://github.com/count/sql-snippets/issues/new?assignees=&labels=help+wanted&template=snippet-request.md&title=%5BSNIPPET+REQUEST%5D+)
8 | ### [⁉️ Report a bug](https://github.com/count/sql-snippets/issues/new?assignees=&labels=bug&template=bug_report.md&title=%5BBUG%5D+)
9 | ### [↣ Submit a pull request](https://github.com/count/sql-snippets/compare)
10 |
11 |
22 |
23 | | Snippet | Databases |
24 | | ------- | --------- |
25 | |
Analytics
| |
26 | | Cohort analysis | [](./bigquery/cohort-analysis.md) |
27 | | Cumulative distribution functions | [](./bigquery/cdf.md) [](./postgres/cume_dist.md) [](./mysql/cdf.md) |
28 | | Compound growth rates | [](./bigquery/compound-growth-rates.md) [](./mssql/compound-growth-rates.md) |
29 | | Generate a binomial distribution | [](./bigquery/binomial-distribution.md) |
30 | | Generate a uniform distribution | [](./bigquery/uniform-distribution.md) |
31 | | Histogram bins | [](./bigquery/histogram-bins.md) [](./snowflake/histogram-bins.md) [](./postgres/histogram-bins.md) |
32 | | Median | [](./bigquery/median.md) |
33 | | Median (UDF) | [](./bigquery/median-udf.md) |
34 | | Outlier detection: MAD method | [](./bigquery/outliers-mad.md) |
35 | | Outlier detection: Standard Deviation method | [](./bigquery/outliers-stdev.md) |
36 | | Outlier detection: Z-score method | [](./bigquery/outliers-z.md) |
37 | | Random Between | [](./bigquery/rand_between.md) |
38 | | Random sampling | [](./bigquery/random-sampling.md) [](./snowflake/random-sampling.md) |
39 | | Ranking | [](./bigquery/rank.md) [](./snowflake/rank.md) [](./postgres/rank.md) |
40 | | TF-IDF | [](./bigquery/tf-idf.md) |
41 | | Arrays
| |
42 | | Arrays to columns | [](./bigquery/array-to-columns.md) [](./snowflake/flatten-array.md) | |
43 | | Filtering arrays | [](./bigquery/filtering-arrays.md) [](./snowflake/filtering-arrays.md) |
44 | | Generating arrays | [](./bigquery/generating-arrays.md) [](./snowflake/generate-arrays.md) |
45 | | Get last array element | [](./bigquery/get-last-array-element.md) [](./postgres/get-last-array-element.md) [](./snowflake/last-array-element.md) |
46 | | Reverse array | [](./postgres/array-reverse.md) |
47 | | Transforming arrays | [](./bigquery/transforming-arrays.md) [](./snowflake/transforming-arrays.md) |
48 | | Least non-null value in array (UDF) | [](./bigquery/least-udf.md) |
49 | | Greatest/Least value in array | [](./bigquery/least-array.md) |
50 | | Sort Array (UDF) | [](./snowflake/udf-sort-array.md) |
51 | | Charts
| |
52 | | Bar chart | [](./bigquery/bar-chart.md) [](./snowflake/bar-chart.md) |
53 | | Horizontal bar chart | [](./bigquery/horizontal-bar.md) [](./snowflake/horizontal-bar.md) |
54 | | Time series | [](./bigquery/timeseries.md) [](./snowflake/timeseries.md) |
55 | | 100% stacked bar chart | [](./bigquery/100-stacked-bar.md ) [](./snowflake/100-stacked-bar.md) |
56 | | Heat map | [](./bigquery/heatmap.md) [](./snowflake/heatmap.md) |
57 | | Stacked bar and line chart | [](./bigquery/stacked-bar-line.md) [](./snowflake/stacked-bar-line.md) |
58 | | Database metadata
| |
59 | | Check if column exists in table | [](./mssql/check-column-is-accessible.md) |
60 | | Search for text in Stored Procedures | [](./mssql/search-stored-procedures.md) |
61 | | Dates and times
| |
62 | | Date/Time Formatting One Pager| [](./postgres/dt-formatting.md) |
63 | | Converting from strings | [](./bigquery/convert-string-datetimes.md) |
64 | | Converting to strings | [](./bigquery/convert-datetimes-string.md) |
65 | | Local Timezone to UTC | [](./bigquery/localtz-to-utc.md) |
66 | | Converting epoch/unix to timestamp | [](./postgres/convert-epoch-to-timestamp.md) |
67 | | Generate timeseries | [](./postgres/generate-timeseries.md) |
68 | | Last day of the month | [](./bigquery/last-day.md) |
69 | | Last X days | [](./postgres/last-x-days.md) |
70 | | Geography
| |
71 | | Calculating the distance between two points | [](./bigquery/geographical-distance.md) [](./snowflake/geographical-distance.md) |
72 | | JSON
| |
73 | | Parsing JSON strings | [](./bigquery/json-strings.md) [](./postgres/json-strings.md) [](./snowflake/parse-json.md)|
74 | | Regular & String expressions
| |
75 | | Parsing URLs | [](./bigquery/regex-parse-url.md) [](./snowflake/parse-url.md) [](./postgres/regex-parse-url.md) |
76 | | Validating emails | [](./bigquery/regex-email.md) [](./postgres/regex-email.md) |
77 | | Extract repeating words | [](./bigquery/repeating-words.md) |
78 | | Split on Uppercase characters | [](./bigquery/split-uppercase.md) |
79 | | Replace multple conditions | [](./bigquery/replace-multiples.md) |
80 | | CONCAT with NULLs | [](./bigquery/concat-nulls.md) |
81 | | Capitalize initial letters (UDF) | [](./mysql/initcap-udf.md) |
82 | | Concatenating strings | [](./mssql/concatenate-strings.md) |
83 | | Transformation
| |
84 | | Pivot | [](./bigquery/pivot.md) [](./snowflake/pivot.md) |
85 | | Unpivoting / melting | [](./bigquery/unpivot-melt.md) [](./snowflake/unpivot-melt.md) |
86 | | Replace NULLs | [](./bigquery/replace-null.md) [](./postgres/replace-null.md) [](./snowflake/replace-null.md) |
87 | | Replace empty string with NULL | [](./bigquery/replace-empty-strings-null.md) [](./postgres/replace-empty-strings-null.md) [](./snowflake/replace-empty-strings-null.md) |
88 | | Counting NULLs | [](./bigquery/count-nulls.md) [](./postgres/count-nulls.md) [](./snowflake/count-nulls.md) |
89 | | Splitting single columns into rows | [](./postgres/split-column-to-rows.md) |
90 | | Latest Row (deduplicate) | [](./bigquery/latest-row.md) |
91 | | Window functions
| |
92 | | First row of each group | [](./bigquery/first-row-of-group.md) [](./mssql/first-row-of-group.md) |
93 | | Moving averages | [](./bigquery/moving-average.md) [](./snowflake/moving-average.md) [](./postgres/moving-average.md) |
94 | | Running totals | [](./bigquery/running-total.md) [](./snowflake/running-total.md) |
95 | | Year-on-year changes | [](./bigquery/yoy.md) [](./snowflake/yoy.md) |
96 | | Creating equal-sized buckets | [](./postgres/ntile.md) |
97 |
98 | ## Other resources
99 | - [bigquery-utils](https://github.com/GoogleCloudPlatform/bigquery-utils) - a BigQuery-specific collection of scripts
100 |
--------------------------------------------------------------------------------
/bigquery/100-stacked-bar.md:
--------------------------------------------------------------------------------
1 | # 100% bar chart
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/qMn2uMsl0rz?vm=e).
4 |
5 | # Description
6 |
7 | 100% bar charts compare the proportions of a metric across different groups.
8 | To build one you'll need:
9 | - A numerical column as the metric you want to compare. This should be represented as a percentage.
10 | - 2 categorical columns. One for the x-axis and one for the coloring of the bars.
11 |
12 | ```sql
13 | SELECT
14 | AGG_FN() as metric,
15 | as x,
16 | as color,
17 | FROM
18 |
19 | GROUP BY
20 | x, color
21 | ```
22 | where:
23 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc.
24 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column. Must add up to 1 for each bar.
25 | - `XAXIS_COLUMN` is the group you want to show on the x-axis. Make sure this is a date, datetime, timestamp, or time column (not a number)
26 | - `COLOR_COLUMN` is the group you want to show as different colors of your bars.
27 |
28 | # Usage
29 |
30 | In this example with some tennis data we'll see what percentage of matches won for the world's top tennis players come on different surfaces.
31 |
32 | ```sql
33 | -- a
34 | select
35 | count(matches_partitioned.match_num) matches_won,
36 | winner_name player,
37 | surface
38 | from tennis.matches_partitioned
39 | where winner_name in ('Rafael Nadal','Serena Williams', 'Roger Federer','Novak Djokovic')
40 | group by player, surface
41 | ```
42 |
43 | |matches_won| player| surface|
44 | |-----------|-------|--------|
45 | |95| Novak Djokovic | Grass|
46 | |621| Novak Djokovic | Hard |
47 | |230 | Novak Djokovic | Clay|
48 | |9 | Novak Djokovic | Carpet|
49 | | ... | ... | ... |
50 |
51 |
52 | ```sql
53 | -- b
54 | select
55 | count(matches_partitioned.match_num) matches_played,
56 | winner_name player
57 | from tennis.matches_partitioned
58 | where winner_name in ('Rafael Nadal','Serena Williams', 'Roger Federer','Novak Djokovic')
59 | group by player
60 | ```
61 | | matches_played | player |
62 | | -------------- | --------------- |
63 | | 955 | Novak Djokovic |
64 | | 1250 | Roger Federer |
65 | | 1016 | Rafael Nadal |
66 | | 846 | Serena Williams |
67 |
68 | ```sql
69 | select
70 | a.player,
71 | a.surface,
72 | a.matches_won / b.matches_played per_matches_won
73 | from
74 | a left join b on a.player = b.player
75 | ```
76 |
77 |
--------------------------------------------------------------------------------
/bigquery/array-to-columns.md:
--------------------------------------------------------------------------------
1 | # Array to Column
2 | Explore this snippet with some demo data [here](https://count.co/n/pQ914qtFlKo?vm=e).
3 |
4 | # Description
5 | Taking a BigQuery ARRAY and creating a column for each value is commonly used when working with 'wide' data. The following snippet will let you quickly take an array column and create a row for each value.
6 |
7 | ```sql
8 | SELECT
9 | select , array_value
10 | FROM
11 | `project.schema.table`
12 | CROSS JOIN UNNEST() as array_value
13 | ```
14 | where:
15 | - `` is a list of the columns from the original table you want to keep in your query results
16 | - `` is the name of the ARRAY column
17 |
18 |
19 | # Usage
20 | The following example shows how to 'explode' 2 rows with 3 array elements each into a separate row for each array entry:
21 |
22 | ```sql
23 | WITH demo_data as (
24 | select 1 row_index,[1,3,4] array_col union all
25 | select 2 row_index, [3,6,9] array_col
26 | )
27 | select demo_data.row_index, array_value
28 | FROM demo_data
29 | cross join UNNEST(array_col) as array_value
30 | ```
31 | | row_index | array_col |
32 | | --- | ----------- |
33 | | 1 | 1 |
34 | | 1 | 3 |
35 | | 1 | 4 |
36 | | 2 | 3 |
37 | | 2 | 6 |
38 | | 2 | 9 |
39 |
40 | # References & Helpful Links:
41 | - [UNNEST in BigQuery](https://count.co/sql-resources/bigquery-standard-sql/unnest)
42 | - [Arrays in Standard SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays)
43 |
--------------------------------------------------------------------------------
/bigquery/bar-chart.md:
--------------------------------------------------------------------------------
1 | # Bar chart
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/KCUlCzlJDHK?vm=e).
4 |
5 | # Description
6 |
7 | Bar charts are perhaps the most common charts used in data analysis. They have one categorical or temporal axis, and one numerical axis.
8 | You just need a simple GROUP BY to get all the elements you need for a bar chart:
9 |
10 | ```sql
11 | SELECT
12 | AGG_FN() as metric,
13 | as cat
14 | FROM
15 |
16 | GROUP BY
17 | cat
18 | ```
19 |
20 | where:
21 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc.
22 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column.
23 | - `CATEGORICAL_COLUMN` is the group you want to show on the x-axis. Make sure this is a date, datetime, timestamp, or time column (not a number)
24 |
25 | # Usage
26 |
27 | In this example with some Spotify data, we'll look at the average daily streams by month of year:
28 |
29 | ```sql
30 | select
31 | avg(streams) avg_daily_streams,
32 | format_date('%m - %b', day) month
33 | from (
34 | select sum(streams) streams,
35 | day
36 | from spotify.spotify_daily_tracks
37 | group by day
38 | )
39 | group by month
40 | ```
41 |
42 |
--------------------------------------------------------------------------------
/bigquery/binomial-distribution.md:
--------------------------------------------------------------------------------
1 | # Generating a binomial distribution
2 |
3 | Explore this snippet [here](https://count.co/n/F6GnYbjyIct?vm=e)
4 |
5 | # Description
6 |
7 | Once it is possible to generate a uniform distribution of numbers, it is possible to generate any other distribution. This snippet shows how to generate a sample of numbers that are binomially distributed using rejection sampling.
8 |
9 | > Note - rejection sampling is simple to implement, but potentially very inefficient
10 |
11 | The steps here are:
12 | - Generate a random list of `k` = 'number of successes' from n trials, with each trial having an expected `p` = 'probability of success'.
13 | - For each `k`, calculate the binomial probability of `k` successes using the [binomial formula](https://en.wikipedia.org/wiki/Binomial_distribution), and another random number `u` between zero and one.
14 | - If `u` is smaller than the expected probability, accept the value of `k`, otherwise reject it
15 |
16 | ```sql
17 | with params as (
18 | select
19 | 0.3 as p, -- Probability of success per trial
20 | 20 as n, -- Number of trials
21 | 100000 as num_samples -- Number of samples to test (the actual samples will be fewer)
22 | ),
23 | samples as (
24 | select
25 | -- Generate a random integer k = 'number of successes' in [0, n]
26 | mod(abs(farm_fingerprint(cast(indices as string))), n + 1) as k,
27 | -- Generate a random number u uniformly distributed in [0, 1]
28 | mod(abs(farm_fingerprint(cast(indices as string))), 1000000) / 1000000 as u
29 | from
30 | params,
31 | unnest(generate_array(1, num_samples)) as indices
32 | ),
33 | probs as (
34 | select
35 | -- Hack in a factorial function for n! / (k! * (n - k)!)
36 | (select exp(sum(ln(i))) from unnest(if(n = 0, [1], generate_array(1, n))) as i) /
37 | (select exp(sum(ln(i))) from unnest(if(k = 0, [1], generate_array(1, k))) as i) /
38 | (select exp(sum(ln(i))) from unnest(if(n = k, [1], generate_array(1, n - k))) as i) *
39 | pow(p, k) * pow(1 - p, n - k) as threshold,
40 | k,
41 | u
42 | from params, samples
43 | ),
44 | after_rejection as (
45 | -- If the random number for this value of k is too high, reject it
46 | -- Note - the binomial PDF is normalised to a max value of 1 to improve the sampling efficiency
47 | select k from probs where u <= threshold / (select max(threshold) from probs)
48 | )
49 |
50 | select * from after_rejection
51 | ```
52 |
53 | | k |
54 | | --- |
55 | | 7 |
56 | | 1 |
57 | | 6 |
58 | | ... |
--------------------------------------------------------------------------------
/bigquery/cdf.md:
--------------------------------------------------------------------------------
1 | # Cumulative distribution functions
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/sL7bEFcg11Z?vm=e).
4 |
5 | # Description
6 | Cumulative distribution functions (CDF) are a method for analysing the distribution of a quantity, similar to histograms. They show, for each value of a quantity, what fraction of rows are smaller or greater.
7 | One method for calculating a CDF is as follows:
8 |
9 | ```sql
10 | select
11 | -- Use a row_number window function to get the position of this row
12 | (row_number() over (order by asc)) / (select count(*) from ) cdf,
13 | ,
14 | from
15 | ```
16 | where
17 | - `quantity` - the column containing the metric of interest
18 | - `table` - the table name
19 | # Example
20 |
21 | Using total Spotify streams as an example data source, let's identify:
22 | - `table` - this is called `raw`
23 | - `quantity` - this is the column `streams`
24 |
25 | then the query becomes:
26 |
27 | ```sql
28 | select
29 | (row_number() over (order by streams asc)) / (select count(*) from raw) frac,
30 | streams,
31 | from raw
32 | ```
33 | | frac | streams |
34 | | ----------- | ----------- |
35 | | 0 | 374359 |
36 | | 0 | 375308 |
37 | | ... | ... |
38 | | 1 | 7778950 |
39 |
--------------------------------------------------------------------------------
/bigquery/cohort-analysis.md:
--------------------------------------------------------------------------------
1 | # Cohort Analysis
2 | View interactive notebook [here](https://count.co/n/INCbY1QUomH?vm=e).
3 |
4 |
5 | # Description
6 | A cohort analysis groups users based on certain traits and compares their behavior across groups and over time.
7 | The most common way of grouping users is by when they first joined. This definition will change for each company and even use case but it's generally taken to be the day or week that a user joined. All users who joined at the same time are grouped together. Again, this can be made more complex if needed.
8 | The simplest way to start comparing cohorts is seeing if they returned to your site/product in the weeks after they joined. You can continue to refine what patterns of behavior you're looking for.
9 | But for this example, we'll do the following:
10 | - Assign users a cohort based on the week they created their account
11 | - For each cohort determine the % of users that returned for the 10 weeks after joining
12 | ## The Data
13 | - The way your data is structured will greatly determine how you do your analysis. Most commonly, there's a transaction table that has timestamped events such as 'create account' or 'purchase' for each user. This is what we'll use for our example.
14 | - The data comes from [BigQuery's Example Google Analytics dataset](https://console.cloud.google.com/bigquery).
15 | ## The Code
16 | - In this example I've gone step by step, but to do it all in one query you could try:
17 |
18 | ```sql
19 | WITH cohorts as
20 | (
21 | SELECT
22 | fullVisitorId,
23 | date_trunc(first_value(date) over (partition by fullVisitorId order by date asc),week) cohort
24 | FROM data
25 | ),
26 | activity as
27 | (
28 | SELECT
29 | cohorts.cohort,
30 | date_diff(date,cohort,week) weeks_since_joining,
31 | count(distinct data.fullVisitorId) visitors
32 | FROM
33 | cohorts
34 | LEFT JOIN data ON cohorts.fullVisitorId = data.fullVisitorId
35 | WHERE
36 | date_diff(date,cohort,week) between 1 and 10
37 | GROUP BY cohort, weeks_since_joining
38 | ),
39 | cohorts_with_weeks as
40 | (
41 | SELECT distinct
42 | cohort,
43 | count(distinct fullVisitorId) cohort_size,
44 | weeks_since_joining
45 | FROM cohorts
46 | CROSS JOIN
47 | UNNEST(generate_array(1,10)) weeks_since_joining
48 | GROUP BY cohort, weeks_since_joining
49 | ),
50 | SELECT
51 | cohorts_with_weeks.cohort,
52 | cohorts_with_weeks.weeks_since_joining,
53 | cohorts_with_weeks.cohort_size,
54 | coalesce(activity.visitors,0)/cohorts_with_weeks.cohort_size per_active,
55 | FROM
56 | cohorts_with_weeks
57 | LEFT JOIN activity
58 | ON cohorts_with_weeks.cohort = activity.cohort
59 | AND cohorts_with_weeks.weeks_since_joining = activity.weeks_since_joining
60 | ```
61 |
62 | #### Data Preview
63 | | fullVisitorId | visitNumber | visitId | date | ... | channelGrouping |
64 | | ---- | ----- |----- |---- | ----- | ----- |
65 | |6453870986532490896 | 1 | 1473801936 | 2016-09-13 | ... | Display |
66 | |2287637838474850444 | 5 | 1473815616 | 2016-09-13 | ... | Direct |
67 |
68 | ### 1. Assign each user a cohort based on the week they first appeared
69 | ```sql
70 | --cohorts
71 | select
72 | fullVisitorId,
73 | date_trunc(first_value(date) over (partition by fullVisitorId order by date asc),week) cohort
74 | from data
75 | ```
76 | |fullVisitorId | cohort |
77 | | ---- | ---- |
78 | | 0002871498069867123 | 21/08/2016 |
79 | | 0020502892485275044 | 11/09/2016 |
80 | |.... | .... |
81 |
82 |
83 | ### 2. For each cohort find the number of users that have returned for weeks 1-10
84 | ```sql
85 | --activity
86 | select
87 | cohorts.cohort,
88 | date_diff(date,cohort,week) weeks_since_joining,
89 | count(distinct data.fullVisitorId) visitors
90 | from
91 | cohorts
92 | left join data on cohorts.fullVisitorId = data.fullVisitorId
93 | where date_diff(date,cohort,week) between 1 and 10
94 | group by
95 | cohort, weeks_since_joining
96 | ```
97 |
98 | |cohort | weeks_since_joining | visitors|
99 | |----| --- | ---- |
100 | |2016-07-31| 2 | 7 |
101 | |2016-07-31| 1 | 6 |
102 | |2016-07-31| 9 | 1 |
103 | |2016-07-31| 5 | 1 |
104 | |.... | .... | .... |
105 |
106 |
107 | ### 3. Turn that into a percentage of users each week (being sure to include weeks where no users returned)
108 | ```sql
109 | --cohorts_with_weeks
110 | select distinct
111 | cohort,
112 | count(distinct fullVisitorId) cohort_size,
113 | weeks_since_joining
114 | from cohorts
115 | cross join
116 | unnest(generate_array(1,10)) weeks_since_joining
117 | group by cohort, weeks_since_joining
118 | ```
119 | |cohort| cohort_size | weeks_since_joining |
120 | |----- |----- |----- |
121 | | 2016-08-21| 263 | 1 |
122 | |2016-08-21| 263 | 2 |
123 | |2016-08-21| 263 | 3 |
124 | |2016-08-21| 263 | 4 |
125 | |2016-08-21| 263 | 5 |
126 | |2016-08-21| 263 | 6 |
127 | |2016-08-21| 263 | 7 |
128 | |2016-08-21| 263 | 8 |
129 | |2016-08-21| 263 | 9 |
130 | |2016-08-21| 263 | 10 |
131 | |...| ... | ... |
132 |
133 | ```sql
134 | --cohort_table
135 | Select
136 | cohorts_with_weeks.cohort,
137 | cohorts_with_weeks.cohort_size,
138 | cohorts_with_weeks.weeks_since_joining,
139 | coalesce(activity.visitors,0)/cohorts_with_weeks.cohort_size per_active,
140 | from
141 | cohorts_with_weeks
142 | left join activity on cohorts_with_weeks.cohort = activity.cohort and cohorts_with_weeks.weeks_since_joining = activity.weeks_since_joining
143 | ```
144 | |cohort| cohort_size| weeks_since_joining | per_active |
145 | | ---- |---- | ---- | ---- |
146 | | 2016-08-21| 263 | 1 | 0.015 |
147 | |2016-08-21| 263 | 2 | 0.019 |
148 | |2016-08-21| 263 | 3 | 0.023 |
149 | | ... | ... | ... | .... |
150 |
151 | Now we can see our cohort table.
152 |
153 |
154 |
--------------------------------------------------------------------------------
/bigquery/compound-growth-rates.md:
--------------------------------------------------------------------------------
1 | # Compound growth rates
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/Ff9O9L7JR53?vm=e).
4 |
5 | # Description
6 | When a quantity is increasing in time by the same fraction every period, it is said to exhibit **compound growth** - every time period it increases by a larger amount.
7 | It is possible to show that the behaviour of the quantity over time follows the law
8 |
9 | ```sql
10 | x(t) = x(0) * exp(r * t)
11 | ```
12 | where
13 | - `x(t)` is the value of the quantity now
14 | - `x(0)` is the value of the quantity at time t = 0
15 | - `t` is time in units of [time_period]
16 | - `r` is the growth rate in units of 1 / [time_period]
17 |
18 |
19 | # Calculating growth rates
20 | From the equation above, to calculate a growth rate you need to know:
21 | - The value at the start of a period
22 | - The value at the end of a period
23 | - The duration of the period
24 | Then, following some algebra, the growth rate is given by
25 |
26 | ```sql
27 | r = ln(x(t) / x(0)) / t
28 | ```
29 |
30 | For a concrete example, assume a table with columns:
31 | - `num_this_month` - this is `x(t)`
32 | - `num_last_month` - this is `x(0)`
33 | - `this_month` & `last_month` - `t` (in years) = `datetime_diff(this_month, last_month, month) / 12`
34 |
35 |
36 | #### Inferred growth rates
37 |
38 | In the following then, `growth_rate` is the equivalent yearly growth rate for that month:
39 |
40 | ```sql
41 | -- with_growth_rate
42 | select
43 | *,
44 | ln(num_this_month / num_last_month) * 12 / datetime_diff(this_month, last_month, month) growth_rate
45 | from with_previous_month
46 | ```
47 |
48 |
49 | # Using growth rates
50 | To better depict the effects of a given growth rate, it can be converted to a year-on-year growth factor by inserting into the exponential formula above with t = 1 (one year duration).
51 | Or mathematically,
52 |
53 | ```sql
54 | yoy_growth = x(1) / x(0) = exp(r)
55 | ```
56 | It is always sensible to spot-check what a growth rate actually represents to decide whether or not it is a useful metric.
57 |
58 | #### Extrapolated year-on-year growth
59 |
60 | ```sql
61 | -- with_yoy_growth
62 | select
63 | *,
64 | exp(growth_rate) as yoy_growth
65 | from with_growth_rate
66 | ```
67 | | num_this_month | num_last_year | growth_rate | yoy_growth |
68 | | ----------- | ----------- | ----------- | ----------- |
69 | | 343386934 | 361476484 | -0.6160690738834281 | 0.5400632190921264 |
70 | | 1151727105 | 343386934 | 14.521920316267764 | 2026701.8316519475 |
71 |
--------------------------------------------------------------------------------
/bigquery/concat-nulls.md:
--------------------------------------------------------------------------------
1 | # Concatinate text handling nulls
2 |
3 | ## Description
4 |
5 | If you concatenate values together and one is null then the while result is set to null
6 |
7 | ```sql
8 | SELECT CONCAT('word', ',', 'word2', ',', 'word3'), CONCAT('word', ',', NULL, ',', 'word3');
9 | ```
10 | ```
11 | +----------------+----+
12 | |f0_ |f1_ |
13 | +----------------+----+
14 | |word,word2,word3|NULL|
15 | +----------------+----+
16 | ```
17 |
18 | If you want to ignore nulls you can use arrays and ARRAY_TO_STRING
19 |
20 | ```
21 | SELECT ARRAY_TO_STRING(['word', 'word2', 'word3'], ','), ARRAY_TO_STRING(['word', NULL, 'word3'], ',');
22 | ```
23 | ```
24 | +----------------+----------+
25 | |f0_ |f1_ |
26 | +----------------+----------+
27 | |word,word2,word3|word,word3|
28 | +----------------+----------+
29 | ```
30 |
--------------------------------------------------------------------------------
/bigquery/convert-datetimes-string.md:
--------------------------------------------------------------------------------
1 | # Converting dates/times to strings
2 |
3 | Explore this snippet [here](https://count.co/n/p72zrP81Vye?vm=e).
4 |
5 | # Description
6 | It is possible to convert from a `DATE` / `DATETIME` / `TIMESTAMP` / `TIME` data type into a `STRING` using the `CAST` function, and BigQuery will choose a default string format.
7 | For more control over the output format use one of the `FORMAT_*` functions:
8 |
9 | ```sql
10 | with dt as (
11 | select
12 | current_date() as date,
13 | current_time() as time,
14 | current_datetime() as datetime,
15 | current_timestamp() as timestamp
16 | )
17 |
18 | select
19 | cast(timestamp as string) default_format,
20 | format_date('%Y/%d/%m', datetime) day_first,
21 | format_datetime('%A', date) weekday,
22 | format_datetime('%r', datetime) time_of_day,
23 | format_time('%H', time) hour,
24 | format_timestamp('%Z', timestamp) timezone
25 | from dt
26 | ```
--------------------------------------------------------------------------------
/bigquery/convert-string-datetimes.md:
--------------------------------------------------------------------------------
1 | # Converting strings to dates/times
2 |
3 | Explore this snippet [here](https://count.co/n/wvs19LPsUrQ?vm=e).
4 |
5 | # Description
6 | BigQuery supports the `DATE` / `DATETIME` / `TIMESTAMP` / `TIME` data types, all of which have different `STRING` representations.
7 |
8 | ## CAST
9 | If a string is formatted correctly, the `CAST` function will convert it into the appropriate type:
10 |
11 | ```sql
12 | select
13 | cast('2017-06-04 14:44:00' AS datetime) AS datetime,
14 | cast('2017-06-04 14:44:00 Europe/Berlin' AS timestamp) AS timestamp,
15 | cast('2017-06-04' AS date) AS date,
16 | cast( '14:44:00' as time) time
17 | ```
18 |
19 |
20 | ## PARSE_*
21 | Alternatively, if the format of the string is non-standard but consistent, it is possible to define a custom string format using one of the `PARSE_DATE` / `PARSE_DATETIME` etc. functions:
22 |
23 | ```sql
24 | select
25 | parse_datetime(
26 | '%A %B %d, %Y %H:%M:%S',
27 | 'Thursday January 7, 2021 12:04:33'
28 | ) as parsed_datetime
29 | ```
--------------------------------------------------------------------------------
/bigquery/count-nulls.md:
--------------------------------------------------------------------------------
1 | # Count NULLs
2 |
3 | Explore this snippet [here](https://count.co/n/vGNUC7SOJei?vm=e).
4 |
5 | # Description
6 |
7 | Part of the data cleaning process involves understanding the quality of your data. NULL values are usually best avoided, so counting their occurrences is a common operation.
8 | There are several methods that can be used here:
9 | - `sum(if( is null, 1, 0)` - use the [`IF`](https://cloud.google.com/bigquery/docs/reference/standard-sql/conditional_expressions#if) function to return 1 or 0 if a value is NULL or not respectively, then aggregate.
10 | - `count(*) - count()` - use the different forms of the [`count()`](https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions#count) aggregation which include and exclude NULLs.
11 | - `sum(case when x is null then 1 else 0 end)` - similar to the IF method, but using a [`CASE`](https://cloud.google.com/bigquery/docs/reference/standard-sql/conditional_expressions#case_expr) statement instead.
12 |
13 | ```sql
14 | with data as (
15 | select * from unnest([1, 2, null, null, 5]) as x
16 | )
17 |
18 | select
19 | sum(if(x is null, 1, 0)) with_if,
20 | count(*) - count(x) with_count,
21 | sum(case when x is null then 1 else 0 end) with_case
22 | from data
23 | ```
24 |
25 | | with_if | with_count | with_case |
26 | | ------- | ---------- | --------- |
27 | | 2 | 2 | 2 |
--------------------------------------------------------------------------------
/bigquery/filtering-arrays.md:
--------------------------------------------------------------------------------
1 | # Filtering arrays
2 |
3 | Explore this snippet [here](https://count.co/n/YvwEswZfHtP?vm=e).
4 |
5 | # Description
6 | The ARRAY function in BigQuery will accept an entire subquery, which can be used to filter an array:
7 |
8 | ```sql
9 | select array(
10 | select x
11 | from unnest((select [1, 2, 3, 4])) as x
12 | where x <= 3 -- Filter predicate
13 | ) as filtered
14 | ```
--------------------------------------------------------------------------------
/bigquery/first-row-of-group.md:
--------------------------------------------------------------------------------
1 | # Find the first row of each group
2 |
3 | Explore this snippet [here](https://count.co/n/1KZI8Y4aAX5?vm=e).
4 |
5 | # Description
6 |
7 | Finding the first value in a column partitioned by some other column is simple in SQL, using the GROUP BY clause. Returning the rest of the columns corresponding to that value is a little trickier. Here's one method, which relies on the RANK window function:
8 |
9 | ```sql
10 | with and_ranking as (
11 | select
12 | *,
13 | rank() over (partition by order by ) ranking
14 | from
15 | )
16 | select * from and_ranking where ranking = 1
17 | ```
18 | where
19 | - `partition` - the column(s) to partition by
20 | - `ordering` - the column(s) which determine the ordering defining 'first'
21 | - `table` - the source table
22 |
23 | The CTE `and_ranking` adds a column to the original table called `ranking`, which is the order of that row within the groups defined by `partition`. Filtering for `ranking = 1` picks out those 'first' rows.
24 |
25 | # Example
26 |
27 | Using total Spotify streams as an example data source, we can pick the rows of the table corresponding to the most-streamed song per day. Filling in the template above:
28 | - `partition` - this is just `day`
29 | - `ordering` - this is `streams desc`
30 | - `table` - this is `spotify.spotify_daily_tracks`
31 |
32 | ```sql
33 | with and_ranking as (
34 | select
35 | *,
36 | rank() over (partition by day order by streams desc) ranking
37 | from spotify.spotify_daily_tracks
38 | )
39 | select day, title, artist, streams from and_ranking where ranking = 1
40 | ```
--------------------------------------------------------------------------------
/bigquery/generating-arrays.md:
--------------------------------------------------------------------------------
1 | # Generating arrays
2 |
3 | Explore this snippet [here](https://count.co/n/2avxdwBFCGz?vm=e).
4 |
5 | # Description
6 | BigQuery supports the ARRAY column type - here are a few methods for generating arrays.
7 | ## Literals
8 | Arrays can contain most types, though all elements must be derived from the same supertype.
9 |
10 | ```sql
11 | select
12 | [1, 2, 3] as num_array,
13 | ['a', 'b', 'c'] as str_array,
14 | [current_timestamp(), current_timestamp()] as timestamp_array,
15 | ```
16 |
17 |
18 | ## GENERATE_* functions
19 | Analogous to `numpy.linspace()`.
20 |
21 | ```sql
22 | select
23 | generate_array(0, 10, 2) AS even_numbers,
24 | generate_date_array(CAST('2020-01-01' AS DATE), CAST('2020-03-01' AS DATE), INTERVAL 1 MONTH) AS month_starts
25 | ```
26 |
27 |
28 | ## ARRAY_AGG
29 | Aggregate a column into a single row element.
30 |
31 | ```sql
32 | SELECT ARRAY_AGG(num)
33 | FROM (SELECT 1 num UNION ALL SELECT 2 num UNION ALL SELECT 3 num)
34 | ```
--------------------------------------------------------------------------------
/bigquery/geographical-distance.md:
--------------------------------------------------------------------------------
1 | # Distance between longitude/latitude pairs
2 |
3 | Explore this snippet [here](https://count.co/n/BcHM84o8uYb?vm=e).
4 |
5 | # Description
6 |
7 | It is a surprisingly difficult problem to calculate an accurate distance between two points on the Earth, both because of the spherical geometry and the uneven shape of the ground.
8 | Fortunately BigQuery has robust support for geographical primitives - the [`ST_DISTANCE`](https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions#st_distance) function can calculate the distance in metres between two [`ST_GEOGPOINT`](https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions#st_geogpoint)s:
9 |
10 | ```sql
11 | with points as (
12 | -- Longitudes and latitudes are in degrees
13 | select 10 as from_long, 12 as from_lat, 15 as to_long, 17 as to_lat
14 | )
15 |
16 | select
17 | st_distance(
18 | st_geogpoint(from_long, from_lat),
19 | st_geogpoint(to_long, to_lat)
20 | ) as dist_in_metres
21 | from points
22 | ```
23 |
24 | | dist_in_metres |
25 | | -------------- |
26 | | 773697.017019 |
--------------------------------------------------------------------------------
/bigquery/get-last-array-element.md:
--------------------------------------------------------------------------------
1 | # Get the last element of an array
2 |
3 | Explore this snippet [here](https://count.co/n/f4moFjZCYNV?vm=e).
4 |
5 | # Description
6 | There are two methods to retrieve the last element of an array - calculate the array length, or reverse the array and select the first element:
7 |
8 | ```sql
9 | select
10 | nums[ordinal(array_length(nums))] as method_1,
11 | array_reverse(nums)[ordinal(1)] as method_2,
12 | from (select [1, 2, 3, 4] as nums)
13 | ```
--------------------------------------------------------------------------------
/bigquery/heatmap.md:
--------------------------------------------------------------------------------
1 | # Heat maps
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/ee5fRPdUrIQ?vm=e).
4 |
5 | # Description
6 |
7 | Heatmaps are a great way to visualize the relationship between two variables enumerated across all possible combinations of each variable. To create one, your data must contain
8 | - 2 categorical columns (e.g. not numeric columns)
9 | - 1 numerical column
10 |
11 | ```sql
12 | SELECT
13 | AGG_FN() as metric,
14 | as cat1,
15 | as cag2,
16 | FROM
17 |
18 | GROUP BY
19 | cat1, cat2
20 | ```
21 | where:
22 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc.
23 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column. Must add up to 1 for each bar.
24 | - `CAT_COLUMN1` and `CAT_COLUMN2` are the groups you want to compare. One will go on the x axis and the other on the y-axis. Make sure these are categorical or temporal columns and not numerical fields.
25 |
26 | # Usage
27 |
28 | In this example with some Spotify data, we'll see the average daily streams across different days of the week and months of the year.
29 |
30 | ```sql
31 | -- a
32 | select sum(streams) daily_streams, day from spotify.spotify_daily_tracks group by day
33 | ```
34 | | daily_streams | day |
35 | | ------------- | ---------- |
36 | | 234109876 | 2020-10-25 |
37 | | 243004361 | 2020-10-26 |
38 | | 248233045 | 2020-10-27 |
39 |
40 | ```sql
41 | select
42 | avg(a.daily_streams) avg_daily_streams,
43 | format_date('%m - %b',day) month,
44 | format_date('%w %A',day) day_of_week
45 | from a
46 | group by day_of_week, month
47 | ```
48 |
49 |
--------------------------------------------------------------------------------
/bigquery/histogram-bins.md:
--------------------------------------------------------------------------------
1 | # Histogram Bins
2 | Test out this snippet with some demo data [here](https://count.co/n/J7s46HSTsrm?vm=e).
3 |
4 | # Description
5 | Creating histograms using SQL can be tricky. This snippet will let you quickly create the bins needed to create a histogram:
6 |
7 | ```sql
8 | SELECT
9 | COUNT(1) / (SELECT COUNT(1) FROM '' ) percentage_of_results
10 | FLOOR(/ ) * as bin
11 | FROM
12 | ''
13 | GROUP BY
14 | bin
15 | ```
16 | where:
17 | - `` is your column you want to turn into a histogram
18 | - `` is the width of each bin. You can adjust this to get the desired level of detail for the histogram.
19 |
20 |
21 | # Usage
22 | To use, adjust the bin-size until you can see the 'shape' of the data. The following example finds the percentage of whiskies with by 5-point review groupings. The reviews are on a 0-100 scale:
23 |
24 | ```sql
25 | SELECT
26 | count(*) / (select count(1) from whisky.scotch_reviews) percentage_of_results,
27 | floor(scotch_reviews.review_point / 5) * 5 bin
28 | from whisky.scotch_reviews
29 | group by bin
30 | ```
31 | | percentage_of_results | bin |
32 | | --- | ----------- |
33 | | 0.005 | 70 |
34 | | 0.03 | 75 |
35 | | ... | ... |
36 | | 0.016 | 96 |
37 |
--------------------------------------------------------------------------------
/bigquery/horizontal-bar.md:
--------------------------------------------------------------------------------
1 | # Horizontal bar chart
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/w4I0DgqsPT2?vm=e).
4 |
5 | # Description
6 |
7 | Horizontal Bar charts are ideal for comparing a particular metric across a group of values. They have bars along the x-axis and group names along the y-axis.
8 | You just need a simple GROUP BY to get all the elements you need for a horizontal bar chart:
9 |
10 | ```sql
11 | SELECT
12 | AGG_FN() as metric,
13 | as group
14 | FROM
15 |
16 | GROUP BY
17 | group
18 | ```
19 |
20 | where:
21 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc.
22 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column.
23 | - `GROUP_COLUMN` is the group you want to show on the y-axis. Make sure this is a categorical column (not a number)
24 |
25 | # Usage
26 |
27 | In this example with some Spotify data, we'll compare the total streams by artist:
28 |
29 | ```sql
30 | select
31 | sum(streams) streams,
32 | artist
33 | from spotify.spotify_daily_tracks
34 | group by artist
35 | ```
36 |
37 |
--------------------------------------------------------------------------------
/bigquery/json-strings.md:
--------------------------------------------------------------------------------
1 | # Extracting values from JSON strings
2 |
3 | Explore this snippet [here](https://count.co/n/P88ejfuK7iv?vm=e).
4 |
5 | # Description
6 | BigQuery has a rich set of functionality for working with JSON - see any function with [JSON in the name](https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions).
7 | Most of the BigQuery JSON functions require the use of a [JSONPath](https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions#JSONPath_format) parameter, which defines how to access the part of the JSON object that is required.
8 | ## JSON_QUERY
9 | This function simply returns the JSON-string representation of the part of the object that is requested:
10 |
11 | ```sql
12 | with json as (select '''{
13 | "a": 1,
14 | "b": "bee",
15 | "c": [
16 | 4,
17 | 5,
18 | { "d": [6, 7] }
19 | ]
20 | }''' as text)
21 |
22 | select
23 | -- Note - all of these result columns are strings, and are formatted as JSON
24 | json_query(text, '$') as root,
25 | json_query(text, '$.a') as a,
26 | json_query(text, '$.b') as b,
27 | json_query(text, '$.c') as c,
28 | json_query(text, '$.c[0]') as c_first,
29 | json_query(text, '$.c[2]') as c_third,
30 | json_query(text, '$.c[2].d') as d,
31 | from json
32 | ```
33 |
34 |
35 | ## JSON_VALUE
36 | This function is similar to `JSON_QUERY`, but tries to convert some values to native data types - namely strings, integers and booleans. If the accessed part of the JSON object is not a simple data type, `JSON_VALUE` returns null:
37 |
38 | ```sql
39 | with json as (select '''{
40 | "a": 1,
41 | "b": "bee",
42 | "c": [
43 | 4,
44 | 5,
45 | { "d": [6, 7] }
46 | ]
47 | }''' as text)
48 |
49 | select
50 | json_value(text, '$') as root, -- = null
51 | json_value(text, '$.a') as a,
52 | json_value(text, '$.b') as b,
53 | json_value(text, '$.c') as c, -- = null
54 | json_value(text, '$.c[0]') as c_first,
55 | json_value(text, '$.c[2]') as c_third, -- = null
56 | json_value(text, '$.c[2].d') as d, -- = null
57 | from json
58 | ```
--------------------------------------------------------------------------------
/bigquery/last-day.md:
--------------------------------------------------------------------------------
1 | # Last day of the month
2 |
3 | ## Description
4 |
5 | In some SQL dialects there is a LAST_DAY function, but there isn’t one in BigQuery.
6 | We can achieve the same by taking the day before the first day of the following month.
7 |
8 | 1. `DATE_ADD(, INTERVAL 1 MONTH)` sets the reference date to the following month.
9 | 2. `DATE_TRUNC(, MONTH)` gets the first day of that month.
10 | 3. `DATE_SUB(, INTERVAL 1 DAY)` gets the day before. `-1` can be used as well.
11 |
12 | ## Example: Last day of the current month (local time)
13 |
14 | ```
15 | SELECT
16 | DATE_TRUNC(DATE_ADD(CURRENT_DATE('Europe/Madrid'), INTERVAL 1 MONTH), MONTH) - 1
17 | ```
18 | If no time zone is specified as an argument to CURRENT_DATE, the default time zone, UTC, is used.
19 |
--------------------------------------------------------------------------------
/bigquery/latest-row.md:
--------------------------------------------------------------------------------
1 |
2 | # Latest Row (deduplicate)
3 | View interactive snippet [here](https://count.co/n/zFoUJuIYOrC?vm=e).
4 |
5 | ## Query to return only the latest row based on a column or columns
6 |
7 | With BigQuery it is common to have duplicates in your tables especially datalake ones so we need to remove duplicates and often want the latest version of each row. We can make use of the ROW_NUMBER window/analytical function to get a count of each record with the same key and then using QUALIFY to only return the 1st row order by the latest first.
8 |
9 | Here we have some rows with duplicates. I've added the ROW_NUMBER so we can see the values it returns.
10 |
11 | ```sql
12 | WITH
13 | sample_data AS (
14 | SELECT 1 AS product_id,'widgit' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date
15 | UNION ALL
16 | SELECT 2 AS product_id,'borble' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date
17 | UNION ALL
18 | SELECT 3 AS product_id,'connector' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date
19 | UNION ALL
20 | SELECT 1 AS product_id,'widgit' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-02-01' AS modified_date
21 | UNION ALL
22 | SELECT 2 AS product_id,'borble' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-03-01' AS modified_date )
23 | SELECT
24 | *, ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY modified_date DESC, created_date DESC) as row_no
25 | FROM
26 | sample_data
27 | ```
28 |
29 | We can use QUALIFY to only return a single row. Also if we use a named subquery/cte we can work with the deduped as if it was just a table.
30 | The product_id can be replaced with a list of columns to specify the unique rows.
31 | ```sql
32 | WITH
33 | sample_data AS (
34 | SELECT 1 AS product_id,'widgit' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date
35 | UNION ALL
36 | SELECT 2 AS product_id,'borble' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date
37 | UNION ALL
38 | SELECT 3 AS product_id,'connector' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-01-01' AS modified_date
39 | UNION ALL
40 | SELECT 1 AS product_id,'widgit' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-02-01' AS modified_date
41 | UNION ALL
42 | SELECT 2 AS product_id,'borble' AS description,'Active' AS status,'2019-01-01' AS created_date,'2021-03-01' AS modified_date
43 | ),
44 | data_deduped as (
45 | SELECT
46 | *
47 | FROM
48 | sample_data
49 | WHERE
50 | 1=1 QUALIFY ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY modified_date DESC, created_date DESC)=1
51 | )
52 | SELECT
53 | *
54 | FROM
55 | data_deduped
56 | ```
57 |
58 |
59 | https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts
60 | https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#with_clause
61 |
--------------------------------------------------------------------------------
/bigquery/least-array.md:
--------------------------------------------------------------------------------
1 | # Greatest/Least value in an array
2 | Interactive version [here](https://count.co/n/tgXkwcSxl2W?b=L2LtcTSMIQS).
3 |
4 | # Description
5 | GREATEST and LEAST are helpful way to find the greatest value in a row but to do that for an array is trickier.
6 | This snippet lets you return the max/min value from an array while maintaining your data shape.
7 | **NOTE: This will exclude NULLS.**
8 |
9 | ```sql
10 | SELECT
11 | ...,
12 | MIN(value)
13 | FROM
14 |
15 | CROSS JOIN UNNEST() as value,
16 | group by ...
17 | ```
18 | where:
19 | - `...` is a list of all the other rows you wish to select,
20 |
21 | - `` is the column that contains the array from which you want to select the greatest/least value
22 |
23 | # Example:
24 | ```sql
25 | with demo_data as (
26 | select 1 rownumber, [0,1,NULL,3] arr
27 | union all select 2 rownumber, [1,2,3,4] arr
28 | )
29 | select
30 | rownumber,
31 | min(value) least,
32 | max(value) greatest
33 | from demo_data
34 | cross join unnest(arr) as value
35 | group by rownumber
36 | ```
37 | |rownumber|least | greatest |
38 | |---- |---- |----|
39 | |1 | 0 |3|
40 | |2 | 1 |4|
41 |
--------------------------------------------------------------------------------
/bigquery/least-udf.md:
--------------------------------------------------------------------------------
1 | # UDF to find the least non-null value in an array
2 | View interactive version [here](https://count.co/n/5KMhS4l4xgR?vm=e).
3 |
4 | # Description
5 | LEAST and GREATEST will include NULLS in BigQuery, so to find the LEAST excluding nulls across a row of values can be done using this UDF:
6 |
7 | ```sql
8 | CREATE or replace FUNCTION .myLeast(x ARRAY) AS
9 | ((SELECT MIN(y) FROM UNNEST(x) AS y));
10 | ```
11 | Credit to this [Stackoverflow post](https://stackoverflow.com/questions/43796213/correctly-migrate-postgres-least-behavior-to-bigquery).
12 |
13 | # Example:
14 | ```sql
15 | with demo_data as (
16 | select 1 rownumber, 0 a, null b, -1 c, 1 d union all
17 | select 2 rownumber, 5 a, 0 b, null c, 8 d
18 | )
19 | select rownumber, spotify.myLeast([a,b,c,d]) least_non_null, least(a,b,c,d) least_old_way
20 | from demo_data
21 | ```
22 | |rownumber|least_not_null|least_old_way|
23 | |--- | ---- | ----- |
24 | | 1 | -1 | NULL |
25 | |2 | 0 | NULL |
26 |
--------------------------------------------------------------------------------
/bigquery/localtz-to-utc.md:
--------------------------------------------------------------------------------
1 | # Local Timezone to UTC
2 | Explore this snippet with some demo data [here](https://count.co/n/448Ip23n0Ef?vm=e).
3 |
4 |
5 | # Description
6 | [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time) (Coordinated Universal Time) is a universal standard time and can be a great way to compare data across many timezones.
7 | Often we're given time in a local timezone and need to convert it to UTC. To do that in BigQuery we can just use the [TIMESTAMP](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions#timestamp) function:
8 |
9 | ```sql
10 | SELECT
11 | TIMESTAMP(, ) as utc_timestamp
12 | ```
13 | where:
14 | - The local datetime can be in the form of a string expression, datetime object or date object.
15 | - The timezone should be one of BigQuery's [time zones](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions#timezone_definitions).
16 | # Usage
17 | We're given Euro match data in Pacific time and want to convert it to UTC:
18 |
19 | ```sql
20 | -- match_data
21 | select '2021-06-11 12:00:00' start_time, 'Turkey' Home, 'Italy' Away union all
22 | select '2021-06-12 06:00:00' start_time, 'Wales' Home, 'Switzerland' Away union all
23 | select'2021-06-12 09:00:00' start_time, 'Denmark' Home, 'Finland' Away union all
24 | ```
25 | |start_date| Home | Away |
26 | |-------| ----- | ------ |
27 | |2021-06-11 12:00:00| Turkey | Italy |
28 | |2021-06-12 06:00:00| Wales | Switzerland |
29 | |2021-06-12 09:00:00| Denmark | Finland |
30 |
31 | ```sql
32 | -- tidied
33 | Select
34 | timestamp(match_data.start_time,'America/Los_Angeles') utc_start,
35 | Home,
36 | Away
37 | from match_data
38 | ```
39 | |utc_start| Home | Away |
40 | |-------| ----- | ------ |
41 | |2021-06-11T19:00:00.000Z| Turkey | Italy |
42 | |2021-06-12T13:00:00.000Z| Wales | Switzerland |
43 | |2021-06-12T16:00:00.000Z| Denmark | Finland |
44 |
--------------------------------------------------------------------------------
/bigquery/median-udf.md:
--------------------------------------------------------------------------------
1 | # MEDIAN UDF
2 | Explore the function with some demo data [here](https://count.co/n/RHmVhHzZpIp?vm=e).
3 |
4 | # Description
5 | BigQuery does not have an explicit `MEDIAN` function. The following snippet will show you how to create a custom function for `MEDIAN` in your BigQuery instance.
6 |
7 |
8 | # UDF (User Defined Function)
9 | Use the following snippet to create your median function in the BigQuery console:
10 |
11 | ```sql
12 | CREATE OR REPLACE FUNCTION .median(arr ANY TYPE) AS ((
13 | SELECT IF (
14 | MOD(ARRAY_LENGTH(arr), 2) = 0,
15 | (arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]) / 2,
16 | arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]
17 | )
18 | FROM
19 | (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x)
20 | ));
21 | ```
22 |
23 |
24 | # Usage
25 | To call the `MEDIAN` function defined above, use the following:
26 |
27 | ```sql
28 | SELECT .median(ARRAY_AGG()) as median
29 | FROM
30 | ```
31 | ### Example
32 | Find the median number of followers for an artist on Spotify:
33 |
34 | ```sql
35 | SELECT
36 | spotify.median(array_agg(followers)) as median_followers
37 | FROM
38 | spotify.spotify_artists
39 | ```
40 | | median_followers |
41 | | --- |
42 | | 1358146.5 |
43 |
--------------------------------------------------------------------------------
/bigquery/median.md:
--------------------------------------------------------------------------------
1 | # MEDIAN
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/7XpUfYUH9p6?vm=e).
4 |
5 | # Description
6 | BigQuery does not have an explicit `MEDIAN` function. The following snippet will show you how to calculate the median of any column.
7 |
8 |
9 | # Snippet ✂️
10 | Use the following snippet to create your median function in the BigQuery console:
11 |
12 | ```sql
13 | PERCENTILE_CONT(, 0.5) OVER() as median
14 | ```
15 |
16 |
17 | # Usage
18 | Use the `MEDIAN` syntax in a query like:
19 |
20 | ```sql
21 | SELECT
22 | PERCENTILE_CONT(, 0.5) OVER() as median
23 | FROM
24 |
25 | ```
26 | | median |
27 | | ----------- |
28 | | 1358146.5 |
29 |
--------------------------------------------------------------------------------
/bigquery/moving-average.md:
--------------------------------------------------------------------------------
1 | # Moving average
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/X2EmYjUGwwO?vm=e).
4 |
5 | # Description
6 | Moving averages are relatively simple to calculate in BigQuery, using the `avg` window function. The template for the query is
7 |
8 | ```sql
9 | select
10 | avg() over (partition by order by asc rows between preceding and current row) mov_av,
11 | ,
12 |
13 | from
14 | ```
15 | where
16 | - `value` - this is the numeric quantity to calculate a moving average for
17 | - `fields` - these are zero or more columns to split the moving averages by
18 | - `ordering` - this is the column which determines the order of the moving average, most commonly temporal
19 | - `x` - how many rows to include in the moving average
20 | - `table` - where to pull these columns from
21 | # Example
22 | Using total Spotify streams as an example data source, let's identify:
23 | - `value` - this is `streams`
24 | - `fields` - this is just `artist`
25 | - `ordering` - this is the `day` column
26 | - `x` - choose 7 for a weekly average
27 | - `table` - this is called `raw`
28 |
29 | then the query is:
30 |
31 | ```sql
32 | select
33 | avg(streams) over (partition by artist order by day asc rows between 7 preceding and current row) mov_av,
34 | artist,
35 | day, streams
36 | from raw
37 | ```
38 | | moving_avg | artist | day | streams |
39 | | --- | ----------- |------ |-------- |
40 | | 535752.5 | 2 Chainz | 2017-06-03 | 562412 |
41 | | 517285 | 2 Chainz | 2017-06-04 | 480350 |
42 | | 522389.75 | 2 Chainz | 2017-06-05 | 537704 |
43 | | 529767.6 | 2 Chainz | 2017-06-06 | 559279 |
44 | | 535539.5 | 2 Chainz | 2017-06-07 | 564399 |
45 | | 535155.1428571428 | 2 Chainz | 2017-06-08 | 532849 |
46 | | 539254 | 2 Chainz | 2017-06-09 | 567946 |
47 | | 541911.5 | 2 Chainz | 2017-06-10 | 530353 |
48 |
--------------------------------------------------------------------------------
/bigquery/outliers-mad.md:
--------------------------------------------------------------------------------
1 | # Identify Outliers
2 | View interactive notebook [here](https://count.co/n/lgTo572j3We?vm=e).
3 |
4 | # Description
5 | Outlier detection is a key step to any analysis. It allows you to quickly spot errors in data collection, or which points you may need to remove before you do any statistical modeling.
6 | There are a number of ways to find outliers, this snippet focuses on the MAD method.
7 | # The Data
8 | For this example, we'll be looking at some data that shows the total number of daily streams for the Top 200 Tracks on Spotify every day since Jan 1 2017:
9 |
10 | Data:
11 | |streams | day |
12 | | ------ | --- |
13 | |234109876 | 25/10/2020 |
14 | |... | .... |
15 |
16 |
17 | # Method 1: MAD (Median Absolute Deviation)
18 | [MAD](https://en.wikipedia.org/wiki/Median_absolute_deviation) describes how far away a point is from the median. This method can be preferable to other methods, like the z-score method because you don't have to assume the data is normally distributed.
19 |
20 | 
21 |
22 | > "A critical value for the Median Absolute Deviation of 5 has been suggested by workers in Nonparametric statistics, as a Median Absolute Deviation value of just under 1.5 is equivalent to one standard deviation, so MAD = 5 is approximately equal to 3 standard deviations."
23 | > - [Using SQL to detect outliers](https://towardsdatascience.com/using-sql-to-detect-outliers-aff676bb2c1a) by Robert de Graaf
24 |
25 | ```sql
26 | SELECT
27 | diff_to_med.*,
28 | percentile_cont(diff_to_med.difference, 0.5) over() median_of_diff,
29 | abs(difference/percentile_cont(difference, 0.5) over()) mad,
30 | case
31 | when abs(difference/percentile_cont(difference, 0.5) over()) > then 'outlier'
32 | else 'not outlier'
33 | end label
34 | FROM (
35 | select
36 |
37 | ,
38 | med.median ,
39 | abs(data. - med.median) difference
40 | from
41 | as data
42 | cross join
43 | (
44 | select distinct
45 | percentile_cont(,.5) over() median
46 | from as data) as med
47 | ) as diff_to_med
48 | order by mad desc
49 | ```
50 | where:
51 | - ``: Columns selected other than the outlier column
52 | - ``: Column name for outlier
53 | - ``: The outlier threshold. 5 = 3 standard deviations
54 |
55 | #### MAD Method Outliers
56 | ```sql
57 | select diff_to_med.*,
58 | percentile_cont(diff_to_med.difference, 0.5) over() median_of_diff,
59 | abs(difference/percentile_cont(difference, 0.5) over()) mad,
60 | case
61 | when abs(difference/percentile_cont(difference, 0.5) over()) > 5 then 'outlier'
62 | else 'not outlier'
63 | end label
64 | from (
65 | select data.day, data.streams, med.median ,abs(data.streams - med.median) difference
66 | from data
67 | cross join (select distinct percentile_cont(streams,.5) over() median from data) as med
68 | ) as diff_to_med
69 | order by mad desc
70 | ```
71 | | day | streams | median | difference | median_to_diff | mad | label |
72 | | --- | ------- | ------ | ---------- | -------------- | ---- | ----- |
73 | | 24/12/2020 | 555245855 | 235341012.5 | 319904842.5 | 15798261 |20.0249 | outlier |
74 | | 08/01/2017 | 162831491 | 235341012.5 | 72509521.5 | 15798261 |4.59 | not outlier |
75 |
--------------------------------------------------------------------------------
/bigquery/outliers-stdev.md:
--------------------------------------------------------------------------------
1 | # Identify Outliers
2 | View interactive notebook [here](https://count.co/n/lgTo572j3We?vm=e).
3 |
4 | # Description
5 | Outlier detection is a key step to any analysis. It allows you to quickly spot errors in data collection, or which points you may need to remove before you do any statistical modeling.
6 | There are a number of ways to find outliers, this snippet focuses on the Standard Deviation method.
7 | # The Data
8 | For this example, we'll be looking at some data that shows the total number of daily streams for the Top 200 Tracks on Spotify every day since Jan 1 2017:
9 |
10 | Data:
11 | |streams | day |
12 | | ------ | --- |
13 | |234109876 | 25/10/2020 |
14 | |... | .... |
15 |
16 |
17 | # Standard Deviation
18 | The most common way outliers are detected is by finding points outside a certain number of sigmas (or standard deviations) from the mean.
19 | If the data is normally distributed, then the following rules apply:
20 |
21 | 
22 |
23 | > Image from anomoly.io
24 | So, identifying points 2 or 3 sigmas away from the mean should lead us to our outliers.
25 |
26 | ## Example
27 |
28 | ```sql
29 | SELECT
30 | data.*,
31 | case
32 | when between mean - 3 * stdev and mean + 3 * stdev
33 | then 'not outlier'
34 | else 'outlier'
35 | end label,
36 | mean - * stdev lower_bound,
37 | mean + * stdev upper_bound
38 | FROM data
39 | CROSS JOIN
40 | (
41 | SELECT
42 | avg() mean,
43 | stddev_samp(data.streams
44 | FROM AS data ) mean_sd
45 | ORDER BY label DESC
46 | ```
47 | where:
48 | - ``: Columns selected other than the outlier column
49 | - ``: Column name for outlier
50 | - ``: Number of standard deviations to identify the outliers (suggested: 3)
51 |
52 | #### Standard Deviation Method Outliers
53 | ```sql
54 | select
55 | data.*,
56 | case
57 | when streams between mean - 3 * stdev and mean + 3 * stdev then 'not outlier'
58 | else 'outlier'
59 | end label,
60 | mean - 3 * stdev lower_bound,
61 | mean + 3 * stdev upper_bound
62 | from data
63 | cross join (select
64 | avg(data.streams) mean,
65 | stddev_samp(data.streams) stdev
66 | from data ) mean_sd
67 | order by label desc
68 | ```
69 | | day | streams | label | lower_bound | upper_bound |
70 | | --- | ------- | ------ | ---------- | -------------- |
71 | | 29/06/2018 | 350167113 | outlier |146828000.339 | 326135386.804 |
72 | | 08/01/2017 | 162831491 | not outlier | 146828000.339 | 326135386.804 |
73 |
--------------------------------------------------------------------------------
/bigquery/outliers-z.md:
--------------------------------------------------------------------------------
1 | # Identify Outliers
2 | View interactive notebook [here](https://count.co/n/lgTo572j3We?vm=e).
3 |
4 | # Description
5 | Outlier detection is a key step to any analysis. It allows you to quickly spot errors in data collection, or which points you may need to remove before you do any statistical modeling.
6 | There are a number of ways to find outliers, this snippet focuses on the Z score method.
7 | # The Data
8 | For this example, we'll be looking at some data that shows the total number of daily streams for the Top 200 Tracks on Spotify every day since Jan 1 2017:
9 |
10 | Data:
11 | |streams | day |
12 | | ------ | --- |
13 | |234109876 | 25/10/2020 |
14 | |... | .... |
15 |
16 |
17 | # Z-Score
18 | Similar to Method 2, the z-score method tells us the probability of a point occurring in a normal distribution.
19 |
20 | 
21 |
22 | > Image from simplypsychology.org
23 | To use z-scores to identify outliers, we can set an alpha (significance level) that will determine how 'likely' each point will be in order for it to be classified as an outlier.
24 | This method also assumes the data is normally distributed.
25 |
26 | ```sql
27 | SELECT
28 | find_z.*,
29 | case
30 | when abs(z_score) >=abs() then 'outlier' else 'not outlier'
31 | end label
32 | FROM (
33 | SELECT
34 | ,
35 | ,
36 | ( - avg() over ())
37 | / (stddev() over ()) z_score
38 | FROM AS data) find_z
39 | ORDER BY abs(z_score) DESC
40 | ```
41 | where:
42 | - ``: Columns selected other than the outlier column
43 | - ``: Column name for outlier
44 | - ``: z-score threshold to identify outliers (suggested: 1.96)
45 |
46 |
47 | ## Example
48 |
49 |
50 | ```sql
51 | select find_z.*,
52 | case when abs(z_score) >=abs(1.96) then 'outlier' else 'not outlier' end label
53 | from (select
54 | day,
55 | streams,
56 | (streams - avg(streams) over ())
57 | / (stddev(streams) over ()) z_score
58 | from data) find_z
59 | order by abs(z_score) desc
60 | ```
61 | | day | streams | z_score | label |
62 | | --- | ------- | ------ | ---------- |
63 | | 07/01/2017 | 177567024 | -1.971 |outlier |
64 | | 02/12/2020 | 294978502 | 1.957 | not outlier |
65 |
--------------------------------------------------------------------------------
/bigquery/pivot.md:
--------------------------------------------------------------------------------
1 | # Pivot in BigQuery
2 | View interactive notebook [here](https://count.co/n/MqDQcqT18qS?vm=e)
3 |
4 | # Description
5 | Pivoting data is a helpful process of any analysis. Recently BigQuery announced a new [PIVOT](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#pivot_operator) operator that will make this easier than ever.
6 |
7 | ```sql
8 | SELECT
9 |
10 | FROM
11 |
12 | PIVOT(() FOR IN ( [as alias], [as alias], ... [as alias]) [as alias]
13 | ```
14 | where:
15 | - anything in brackets is opional
16 | - `` is the list of columns returned in the query
17 | - `` is the table you a pivoting
18 | - may not produce a value able or be a subquery using `SELECT AS STRUCT`
19 | - `` is an aggregate function that aggregates all input rows across the values of the ``
20 | - may reference columns in `` and correlated columns, but none defined by the `PIVOT` clause itself
21 | - must have only one argument
22 | - Except for `COUNT`, you can only use aggregate functions that ignore `NULL` inputs
23 | - if using `COUNT`, you can use `*` as an argument
24 | - `` is the column that will be aggregated across the values of ``
25 | - `` is the column that will have its values pivoted
26 | - values must be valid column names (so no spaces or special characters)
27 | - use the aliases to ensure they are the right format
28 | # Example
29 | In this case we'll see how many matches some popular tennis players have won in the Grand Slams.
30 |
31 | ```sql
32 | --tennis_data
33 | select
34 | winner_name,
35 | count(1) wins,
36 | initcap(tourney_name) tournament
37 | from
38 | tennis.matches_partitioned
39 | where winner_name in ('Novak Djokovic','Roger Federer','Rafael Nadal','Serena Williams')
40 | and tourney_level = 'G'
41 | group by winner_name, tournament
42 | ```
43 | | winner_name | wins | tournament|
44 | | ----------- | ---- | ---------|
45 | | Roger Federer | 91 | Us Open |
46 | | Roger Federer | 103 | Australian Open |
47 | | Roger Federer | 102 | Wimbledon |
48 | | Roger Federer | 70 | Roland Garros |
49 |
50 | ```sql
51 | select *
52 | from
53 | tennis_data
54 | pivot(sum(wins) for tournament in ('Wimbledon','Us Open' as US_Open,'Roland Garros' as French_Open,'Australian Open' as Aussie_Open ))
55 | ```
56 |
57 | | winner_name | Wimbledon | US_Open | French_Open | Aussie_Open |
58 | | ----------- | ---- | ---------| ---------| ---------|
59 | | Roger Federer | 102 | 91 | 70 | 103 |
60 |
61 | # Other links
62 | - https://towardsdatascience.com/pivot-in-bigquery-4eefde28b3be
63 |
--------------------------------------------------------------------------------
/bigquery/rand_between.md:
--------------------------------------------------------------------------------
1 | # Rand Between in BigQuery
2 | Interactive version [here](https://count.co/n/lUHSRbAZqLK?vm=e).
3 |
4 |
5 | # Description
6 | Generating random numbers is a good way to test out our code or to randomize results.
7 | In BigQuery there is a [`RAND`](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#rand) function which will generate a number between 0 and 1.
8 | But sometimes it's helpful to get a random integer within a specified range. This snippet will do that:
9 |
10 | ```sql
11 | SELECT
12 | ROUND( + RAND() * ( - )) rand_between
13 | FROM
14 |
15 | ```
16 | where:
17 | - ``, `` is the range between which you want to generate a random number
18 | # Example:
19 |
20 | ```sql
21 | with random_number as (select rand() as random)
22 | select
23 | random, -- random number between 0 and 1
24 | round(random*100) rand_1_100, --random integer between 1 and 100
25 | round(1 + random * (10 - 1)) rand_btw_1_10 -- random integer between 1 and 10
26 | from random_number
27 | ```
28 | |random | rand_1_100 | rand_btw_1_10 |
29 | | ----- | ---------- | ------------- |
30 | |0.597 | 60 | 6 |
31 |
--------------------------------------------------------------------------------
/bigquery/random-sampling.md:
--------------------------------------------------------------------------------
1 | # Random sampling
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/cSN4gyht6Vd?vm=e).
4 |
5 | # Description
6 | Taking a random sample of a table is often useful when the table is very large, and you are still working in an exploratory or investigative mode. A query against a small fraction of a table runs much more quickly, and consumes fewer resources.
7 | Here are some example queries to select a random or pseudo-random subset of rows.
8 |
9 |
10 | # Using rand()
11 | The [RAND](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#rand) function returns a value in [0, 1) (i.e. including zero but not 1). You can use it to sample tables, e.g.:
12 |
13 | #### Return 1% of rows
14 |
15 | ```sql
16 | -- one_percent
17 | select * from spotify.spotify_daily_tracks where rand() < 0.01
18 | ```
19 | | day | track_id | rank | title | artist | streams |
20 | | -------- | -------- |----- | ----- | ------ | ------- |
21 | | 2018-01-09 | 6mrKP2jyIQmM0rw6fQryjr | 9 | Let You Down | NF | 2610265 |
22 | |... | ... | ... | ... | ... | ... |
23 |
24 | #### Return 10 rows
25 |
26 | ```sql
27 | -- ten_rows
28 | select * from spotify.spotify_daily_tracks order by rand() limit 10
29 | ```
30 | | day | track_id | rank | title | artist | streams |
31 | | -------- | -------- |----- | ----- | ------ | ------- |
32 | | 2018-01-09 | 6mrKP2jyIQmM0rw6fQryjr | 9 | Let You Down | NF | 2610265 |
33 | |... | ... | ... | ... | ... | ... |
34 | (only 10 rows returned)
35 |
36 | #### Return approximately 10 rows (discouraged)
37 |
38 | ```sql
39 | -- ten_rows_approx
40 | select
41 | *
42 | from spotify.spotify_daily_tracks
43 | where rand() < 10 / (select count(*) from spotify.spotify_daily_tracks)
44 | ```
45 | | day | track_id | rank | title | artist | streams |
46 | | -------- | -------- |----- | ----- | ------ | ------- |
47 | | 2018-01-09 | 6mrKP2jyIQmM0rw6fQryjr | 9 | Let You Down | NF | 2610265 |
48 | |... | ... | ... | ... | ... | ... |
49 | (only 10 rows returned)
50 |
51 | # Using hashing
52 | A hash is a deterministic mapping from one value to another, and so is not random, but can appear random 'enough' to produce a convincing sample. Use a [supported hash function](https://cloud.google.com/bigquery/docs/reference/standard-sql/hash_functions) in BigQuery to produce a pseudo-random ordering of your table:
53 |
54 | #### Farm fingerprint
55 |
56 | ```sql
57 | -- farm_fingerprint
58 | select * from spotify.spotify_artists order by farm_fingerprint(spotify_artists.artist_id) limit 10
59 | ```
60 | | artist_id | name | popularity | followers | updated_at | url |
61 | | -------- | -------- |----- | ----- | ------ | ------- |
62 | | 5wPoxI5si3eJsYYwyXV4Wi | N.E.R.D | 64 | 647718 | 2021-03-14 | https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi |
63 | |... | ... | ... | ... | ... | ... |
64 | (only 10 rows returned)
65 |
66 | #### MD5
67 |
68 | ```sql
69 | -- md5
70 | select * from spotify.spotify_artists order by md5(spotify_artists.artist_id) limit 10
71 | ```
72 | | artist_id | name | popularity | followers | updated_at | url |
73 | | -------- | -------- |----- | ----- | ------ | ------- |
74 | | 5wPoxI5si3eJsYYwyXV4Wi | N.E.R.D | 64 | 647718 | 2021-03-14 | https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi |
75 | |... | ... | ... | ... | ... | ... |
76 | (only 10 rows returned)
77 |
78 | #### SHA1
79 |
80 | ```sql
81 | -- SHA1
82 | select * from spotify.spotify_artists order by sha1(spotify_artists.artist_id) limit 10
83 | ```
84 | | artist_id | name | popularity | followers | updated_at | url |
85 | | -------- | -------- |----- | ----- | ------ | ------- |
86 | | 5wPoxI5si3eJsYYwyXV4Wi | N.E.R.D | 64 | 647718 | 2021-03-14 | https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi |
87 | |... | ... | ... | ... | ... | ... |
88 | (only 10 rows returned)
89 |
90 | #### SHA256
91 |
92 | ```sql
93 | -- SHA256
94 | select * from spotify.spotify_artists order by sha256(spotify_artists.artist_id) limit 10
95 | ```
96 | | artist_id | name | popularity | followers | updated_at | url |
97 | | -------- | -------- |----- | ----- | ------ | ------- |
98 | | 5wPoxI5si3eJsYYwyXV4Wi | N.E.R.D | 64 | 647718 | 2021-03-14 | https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi |
99 | |... | ... | ... | ... | ... | ... |
100 | (only 10 rows returned)
101 | #### SHA512
102 |
103 | ```sql
104 | -- SHA512
105 | select * from spotify.spotify_artists order by sha512(spotify_artists.artist_id) limit 10
106 | ```
107 | | artist_id | name | popularity | followers | updated_at | url |
108 | | -------- | -------- |----- | ----- | ------ | ------- |
109 | | 5wPoxI5si3eJsYYwyXV4Wi | N.E.R.D | 64 | 647718 | 2021-03-14 | https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi |
110 | |... | ... | ... | ... | ... | ... |
111 | (only 10 rows returned)
112 |
113 | # Using TABLESAMPLE
114 | The [TABLESAMPLE](https://cloud.google.com/bigquery/docs/table-sampling) clause is currently in preview status, but it has a major benefit - it doesn't require a full table scan, and therefore can be much cheaper and quicker than the methods above. The downside is that the percentage value must be a literal, so this query is less flexible than the ones above.
115 |
116 | #### Tablesample
117 |
118 | ```sql
119 | -- tablesample
120 | select * from spotify.spotify_daily_tracks tablesample system (10 percent)
121 | ```
122 | | artist_id | name | popularity | followers | updated_at | url |
123 | | -------- | -------- |----- | ----- | ------ | ------- |
124 | | 5wPoxI5si3eJsYYwyXV4Wi | N.E.R.D | 64 | 647718 | 2021-03-14 | https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi |
125 | |... | ... | ... | ... | ... | ... |
126 |
--------------------------------------------------------------------------------
/bigquery/rank.md:
--------------------------------------------------------------------------------
1 | # Rank
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/eVT4mi7X88o?vm=e).
4 |
5 | # Description
6 | A common requirement is the need to rank a particular row in a group by some quantity - for example, 'rank stores by total sales in each region'. Fortunately BigQuery supports the `rank` window function to help with this.
7 | The query template is
8 |
9 | ```sql
10 | select
11 | rank() over (partition by order by desc) rank
12 | from
13 | ```
14 | where
15 | - `fields` - zero or more columns to split the results by - a rank will be calculated for each unique combination of values in these columns
16 | - `rank_by` - a column to decide the ranking
17 | - `table` - where to pull these columns from
18 |
19 | # Example
20 | Using total Spotify streams by day and artist as an example data source, let's calculate, by artist, the rank for total streams by day (e.g. rank = 1 indicates that day was their most-streamed day).
21 |
22 | - `fields` - this is just `artist`
23 | - `rank_by` - this is `streams`
24 | - `table` - this is called `raw`
25 |
26 | then the query is:
27 |
28 | ```sql
29 | select
30 | *,
31 | rank() over (partition by artist order by streams desc) rank
32 | from raw
33 | ```
34 | | streams | artist | day | rank |
35 | | --- | --------- | --- | ----|
36 | | 685306 | Dalex, Lenny Tavárez, Chencho Corleone, Juhn, Dímelo Flow | 2020-08-01 | 1 |
37 | | 680634 | Dalex, Lenny Tavárez, Chencho Corleone, Juhn, Dímelo Flow | 2020-08-08 | 2 |
38 | | ... | ... | ... | ... |
39 |
--------------------------------------------------------------------------------
/bigquery/regex-email.md:
--------------------------------------------------------------------------------
1 | # Regex: Validate email address
2 |
3 | Explore this snippet [here](https://count.co/n/aAHVEQ74TYd?vm=e).
4 |
5 | # Description
6 | The only real way to validate an email address is to send an email to it, but for the purposes of analytics it's often useful to strip out rows including malformed email address.
7 | Using the `REGEXP_CONTAINS` function, this example comes courtesy of Reddit user [chrisgoddard](https://www.reddit.com/r/bigquery/comments/dshge0/udf_for_email_validation/f6r7rpt?utm_source=share&utm_medium=web2x&context=3):
8 |
9 | ```sql
10 | with emails as (
11 | select * from unnest([
12 | 'a',
13 | 'hello@count.c',
14 | 'hello@count.co',
15 | '@count.co',
16 | 'a@a.a'
17 | ]) as email
18 | )
19 |
20 | select
21 | email,
22 | regexp_contains(
23 | lower(email),
24 | "^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])$"
25 | ) valid
26 | from emails
27 | ```
--------------------------------------------------------------------------------
/bigquery/regex-parse-url.md:
--------------------------------------------------------------------------------
1 | # Regex: Parse URL String
2 | Explore the query with some demo data [here](https://count.co/n/U2RGnRt91pB?vm=e).
3 |
4 | # Description
5 | URL strings are a very valuable source of information, but trying to get your regex exactly right is a pain.
6 | This re-usable snippet can be applied to any URL to extract the exact part you're after.
7 |
8 | ```sql
9 | SELECT
10 | REGEXP_EXTRACT(, r'(?:[a-zA-Z]+://)?([a-zA-Z0-9-.]+)/?') as host,
11 | REGEXP_EXTRACT(, r'(?:[a-zA-Z]+://)?(?:[a-zA-Z0-9-.]+)/{1}([a-zA-Z0-9-./]+)') as path,
12 | REGEXP_EXTRACT(, r'\?(.*)') as query,
13 | REGEXP_EXTRACT(, r'#(.*)') as ref,
14 | REGEXP_EXTRACT(, r'^([a-zA-Z]+)://') as protocol
15 | FROM
16 | ''
17 | ```
18 | where:
19 | - `` is your URL string (e.g. `"https://www.yoursite.com/pricing/details?myparam1=123&myparam2=abc#Ref1"`
20 | - `host` is the domain name (e.g. `yoursite.com`)
21 | - `path` is the page path on the website (e.g. `pricing/details`)
22 | - `query` is the string of query parameters in the url (e.g. `myparam1=123`)
23 | - `ref` is the reference or where the user came from before visiting the url (e.g. `#newsfeed`)
24 | - `protocol` is how the data is transferred between the host and server (e.g. `https`)
25 |
26 |
27 | # Usage
28 | To use, just copy out the regex for whatever part of the URL you need. Or capture them all as in the example below:
29 |
30 | ```sql
31 | WITH my_url as (select 'https://www.yoursite.com/pricing/details?myparam1=123&myparam2=abc#newsfeed' as url)
32 | SELECT
33 | REGEXP_EXTRACT(url, r'(?:[a-zA-Z]+://)?([a-zA-Z0-9-.]+)/?') as host,
34 | REGEXP_EXTRACT(url, r'(?:[a-zA-Z]+://)?(?:[a-zA-Z0-9-.]+)/{1}([a-zA-Z0-9-./]+)') as path,
35 | REGEXP_EXTRACT(url, r'\?(.*)') as query,
36 | REGEXP_EXTRACT(url, r'#(.*)') as ref,
37 | REGEXP_EXTRACT(url, r'^([a-zA-Z]+)://') as protocol
38 | FROM my_url
39 | ```
40 | | host | path | query | ref | protocol|
41 | | --- | ------ | ----- | --- | ------|
42 | | www.yoursite.com | pricing/details | myparam1=123&myparam2=abc#newsfeed | newsfeed | https |
43 |
44 |
45 | # References Helpful Links
46 | - [The Anatomy of a Full Path URL](https://zvelo.com/anatomy-of-full-path-url-hostname-protocol-path-more/)
47 | - [GoogleCloudPlatform/bigquery-utils](https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/community/url_parse.sql)
48 |
--------------------------------------------------------------------------------
/bigquery/repeating-words.md:
--------------------------------------------------------------------------------
1 | # Extract Repeating Words from String
2 |
3 |
4 | # Description
5 | If I wanted to extract all Workorder numbers from the string:
6 |
7 | ```sql
8 | F1 - wo-12, some other text WOM-1235674,WOM-2345,WOM-3456
9 | ```
10 | Where the workorder numbers always follow `WOM-`,
11 | and I wanted to make sure each workorder number got it's own row so I could join to another table, I could do:
12 |
13 | ```sql
14 | SELECT
15 | product_id,
16 | wo_number
17 | FROM
18 |
19 | CROSS JOIN
20 | UNNEST(regexp_extract_all(wo_text,r'WOM-([1-9]+)')) as wo_number
21 | ```
22 | where `wo_text` is the column of text that contains the workorder info.
23 | > Inspired by: https://www.reddit.com/r/bigquery/comments/oizgm6/extracting_word_from_string/
24 | # Example:
25 | ```sql
26 | with data as (
27 | select 123 product_id, 'F1 - wo-12, some other text WOM-1235674,WOM-2345,WOM-3456' as wo_text
28 | )
29 | select product_id, wo_number
30 | from data
31 | cross join unnest(regexp_extract_all(wo_text,r'WOM-([1-9]+)')) as wo_number
32 | --This will intentionally not extract 'wo-12' as it doesn't match WOM-[numbers] pattern
33 | ```
34 | |product_id | wo_number|
35 | |---- | ----- |
36 | | 123 | 1235674 |
37 | |123 | 2345 |
38 | |123 | 3456 |
39 |
--------------------------------------------------------------------------------
/bigquery/replace-empty-strings-null.md:
--------------------------------------------------------------------------------
1 | # Replace empty strings with NULLs
2 |
3 | Explore this snippet [here](https://count.co/n/g0Q8V8oSnB8?vm=e).
4 |
5 | # Description
6 |
7 | An essential part of cleaning a new data source is deciding how to treat missing values. The BigQuery function [IF](https://cloud.google.com/bigquery/docs/reference/standard-sql/conditional_expressions#if) can help with replacing empty values with something else:
8 |
9 | ```sql
10 | with data as (
11 | select * from unnest(['a', 'b', '', 'd']) as str
12 | )
13 |
14 | select
15 | if(length(str) = 0, null, str) str
16 | from data
17 | ```
18 |
19 | | str |
20 | | ---- |
21 | | a |
22 | | b |
23 | | NULL |
24 | | d |
--------------------------------------------------------------------------------
/bigquery/replace-multiples.md:
--------------------------------------------------------------------------------
1 | # Replace or Strip out multiple different string matches from some text.
2 |
3 | ## Description
4 |
5 | If you need to remove to replace multiple difference string conditions it can be quite code heavy unless you make use REGEXP_REPLACE
6 | Use | to alternative matches
7 | (?i) allows case of the matches to be ignored
8 |
9 | ```
10 | select REGEXP_REPLACE('Some text1 or Text2 etc', r'(?i)text1|text2|text3|text4','NewValue')
11 | ```
12 |
13 | ## Example
14 |
15 | ```sql
16 | with pet as
17 | (select ['Marley (hcp)','Marley (HCP)','Marley(hcp)','Marley (hcP)','Marley (CC)','Marley (cc)','Marley(cc)','Marley (rehomed)','Marley (re-homed)','Marley (stray)','Marley**','Marley **'] as pets
18 | ), pet_list as
19 | (
20 | select pet_name
21 | from pet
22 | left join unnest(pets) pet_name)
23 | select
24 | pet_name,
25 | TRIM(REGEXP_REPLACE(pet_name, r'(?i)\(hcp\)|\(cc\)|\(re-homed\)|\(foster\)|\(stray\)|\(rehomed\)|\*\*','')) as clean_pet_name
26 | from pet_list
27 | ```
28 |
29 | ```
30 | +-----------------+--------------+
31 | |pet_name |clean_pet_name|
32 | +-----------------+--------------+
33 | |Marley (hcp) |Marley |
34 | |Marley (HCP) |Marley |
35 | |Marley(hcp) |Marley |
36 | |Marley (hcP) |Marley |
37 | |Marley (CC) |Marley |
38 | |Marley (cc) |Marley |
39 | |Marley(cc) |Marley |
40 | |Marley (rehomed) |Marley |
41 | |Marley (re-homed)|Marley |
42 | |Marley (stray) |Marley |
43 | |Marley** |Marley |
44 | |Marley ** |Marley |
45 | +-----------------+--------------+
46 | ```
47 |
--------------------------------------------------------------------------------
/bigquery/replace-null.md:
--------------------------------------------------------------------------------
1 | # Replace NULLs
2 |
3 | Explore this snippet [here](https://count.co/n/HrbhKQX4JfT?vm=e).
4 |
5 | # Description
6 |
7 | An essential part of cleaning a new data source is deciding how to treat NULL values. The BigQuery function [IFNULL](https://cloud.google.com/bigquery/docs/reference/standard-sql/conditional_expressions#ifnull) can help with replacing NULL values with something else:
8 |
9 | ```sql
10 | with data as (
11 | select * from unnest([
12 | struct(1 as num, 'a' as str),
13 | struct(null as num, 'b' as str),
14 | struct(3 as num, null as str)
15 | ])
16 | )
17 |
18 | select
19 | ifnull(num, -1) num,
20 | ifnull(str, 'I AM NULL') str,
21 | from data
22 | ```
23 |
24 | | num | str |
25 | | --- | --------- |
26 | | 1 | a |
27 | | -1 | b |
28 | | 3 | I AM NULL |
--------------------------------------------------------------------------------
/bigquery/running-total.md:
--------------------------------------------------------------------------------
1 | # Running total
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/xFHD89xBRAU?vm=e).
4 |
5 | # Description
6 | Running totals are relatively simple to calculate in BigQuery, using the `sum` window function.
7 | The template for the query is
8 |
9 | ```sql
10 | select
11 | sum() over (partition by order by asc) running_total,
12 | ,
13 |
14 | from
15 | ```
16 | where
17 | - `value` - this is the numeric quantity to calculate a running total for
18 | - `fields` - these are zero or more columns to split the running totals by
19 | - `ordering` - this is the column which determines the order of the running total, most commonly temporal
20 | - `table` - where to pull these columns from
21 | # Example
22 | Using total Spotify streams as an example data source, let's identify:
23 | - `value` - this is `streams`
24 | - `fields` - this is just `artist`
25 | - `ordering` - this is the `day` column
26 | - `table` - this is called `raw`
27 |
28 | then the query is:
29 |
30 | ```sql
31 | select
32 | sum(streams) over (partition by artist order by day asc) running_total,
33 | artist,
34 | day
35 | from raw
36 | ```
37 |
38 | | running_total | artist | day | streams |
39 | | ------------- | ------- | --- | ------ |
40 | | 705819 | *NSYNC | 2017-12-23 | 705819 |
41 | | 2042673 | *NSYNC | 2017-12-24 | 1336854 |
42 | | ... | ... | ... |... |
43 |
--------------------------------------------------------------------------------
/bigquery/split-uppercase.md:
--------------------------------------------------------------------------------
1 | # Split a string on Uppercase Characters
2 | View interactive version [here](https://count.co/n/Exa1qpLjE8y?vm=e).
3 |
4 | # Description
5 | Split a string like 'HelloWorld' into ['Hello','World'] using:
6 |
7 | ```sql
8 | regexp_extract_all('stringToReplace',r'([A-Z][a-z]*)')
9 | ```
10 |
11 | # Example:
12 | ```sql
13 | select
14 | regexp_extract_all('HelloWorld',r'([A-Z][a-z]*)') as_array,
15 | array_to_string(regexp_extract_all('HelloWorld',r'([A-Z][a-z]*)'),' ') a_string_again
16 | ```
17 | |as_array | a_string_again|
18 | |------- | --------- |
19 | |["Hello","World"] | Hello World |
20 |
--------------------------------------------------------------------------------
/bigquery/stacked-bar-line.md:
--------------------------------------------------------------------------------
1 | # Stacked bar and line chart
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/xnf2JGWOpEh?vm=e).
4 |
5 | # Description
6 |
7 | Bar and line charts can communicate a lot of info on a single chart. They require all of the elements of a bar chart:
8 | - 2 categorical or temporal columns - 1 for the x-axis and 1 for stacked bars
9 | - 2 numeric columns - 1 for the bar and 1 for the line
10 |
11 | ```sql
12 | SELECT
13 | AGG_FN() OVER (PARTITION BY , ) as bar_metric,
14 | as x_axis,
15 | as cat,
16 | AGG_FN() OVER (PARTITION BY ) as line
17 | FROM
18 |
19 | ```
20 |
21 | where:
22 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc.
23 | - `COLUMN` is the column you want to aggregate to get your bar chart metric. Make sure this is a numeric column.
24 | - `COLUMN2` is the column you want to aggregate to get your line chart metric. Make sure this is a numeric column.
25 | - `CAT_COLUMN1` and `CAT_COLUMN2` are the groups you want to compare. One will go on the x axis and the other on the colors of the bar chart. Make sure these are categorical or temporal columns and not numerical fields.
26 |
27 | # Usage
28 |
29 | In this example with some example data we'll look at some sample data of users that were successful and unsuccessful at a certain task over a few days. We'll show the count of users by their outcome, then show the overall success rate for each day.
30 |
31 | ```sql
32 | with sample_data as (
33 | select date('2021-01-01') registered_at, 'successful' category, round(rand()*10) user_count union all
34 | select date('2021-01-01') registered_at, 'unsuccessful' category, round(rand()*10) user_count union all
35 | select date('2021-01-02') registered_at, 'successful' category, round(rand()*10) user_count union all
36 | select date('2021-01-02') registered_at, 'unsuccessful' category, round(rand()*10) user_count union all
37 | select date('2021-01-03') registered_at, 'successful' category, round(rand()*10) user_count union all
38 | select date('2021-01-03') registered_at, 'unsuccessful' category, round(rand()*10) user_count
39 | )
40 |
41 |
42 | select
43 | registered_at,
44 | category,
45 | user_count,
46 | sum(case when category = 'successful' then user_count end) over (partition by registered_at) / sum(user_count) over (partition by registered_at) success_rate
47 | from sample_data
48 | ```
49 |
50 |
--------------------------------------------------------------------------------
/bigquery/tf-idf.md:
--------------------------------------------------------------------------------
1 | # TF-IDF
2 | To explore an interactive version of this snippet, go [here](https://count.co/n/0NnhvqUjFuZ?vm=e).
3 |
4 | # Description
5 | TF-IDF (Term Frequency - Inverse Document Frequency) is a way of extracting keywords that are used more in certain documents than in the corpus of documents as a whole.
6 | It's a powerful NLP tool and is surprisingly easy to do in SQL.
7 | This snippet comes from Felipe Hoffa on [stackoverflow](https://stackoverflow.com/questions/47028576/how-can-i-compute-tf-idf-with-sql-bigquery):
8 |
9 | ```sql
10 | #standardSQL
11 | WITH words_by_post AS (
12 | SELECT CONCAT(link_id, '/', id) id, REGEXP_EXTRACT_ALL(
13 | REGEXP_REPLACE(REGEXP_REPLACE(LOWER(body), '&', '&'), r'&[a-z]{2,4};', '*')
14 | , r'[a-z]{2,20}\'?[a-z]+') words
15 | , COUNT(*) OVER() docs_n
16 | FROM `fh-bigquery.reddit_comments.2017_07`
17 | WHERE body NOT IN ('[deleted]', '[removed]')
18 | AND subreddit = 'movies'
19 | AND score > 100
20 | ), words_tf AS (
21 | SELECT id, word, COUNT(*) / ARRAY_LENGTH(ANY_VALUE(words)) tf, ARRAY_LENGTH(ANY_VALUE(words)) words_in_doc
22 | , ANY_VALUE(docs_n) docs_n
23 | FROM words_by_post, UNNEST(words) word
24 | GROUP BY id, word
25 | HAVING words_in_doc>30
26 | ), docs_idf AS (
27 | SELECT tf.id, word, tf.tf, ARRAY_LENGTH(tfs) docs_with_word, LOG(docs_n/ARRAY_LENGTH(tfs)) idf
28 | FROM (
29 | SELECT word, ARRAY_AGG(STRUCT(tf, id, words_in_doc)) tfs, ANY_VALUE(docs_n) docs_n
30 | FROM words_tf
31 | GROUP BY 1
32 | ), UNNEST(tfs) tf
33 | )
34 |
35 |
36 | SELECT *, tf*idf tfidf
37 | FROM docs_idf
38 | WHERE docs_with_word > 1
39 | ORDER BY tfidf DESC
40 | LIMIT 1000
41 | ```
42 |
43 |
44 | # Usage
45 |
46 | ```sql
47 | -- ex_data
48 | select 1 id, 'England will win it' body union all
49 | select 2 id, 'France will win it' body union all
50 | select 3 id, 'Its coming home' body
51 | ```
52 | |id | body |
53 | |--| ---- |
54 | |1 | 'England will win it' |
55 | |2 | 'France will win it' |
56 | |3 | 'Its coming home'|
57 | ```sql
58 | -- words_by_post
59 | SELECT
60 | id, REGEXP_EXTRACT_ALL(
61 | REGEXP_REPLACE(REGEXP_REPLACE(LOWER(body), '&', '&'), r'&[a-z]{2,4};', '*')
62 | , r'[a-z]{2,20}\'?[a-z]+') words
63 | , COUNT(*) OVER() docs_n
64 | FROM ex_data
65 | ```
66 | |id | body |
67 | |--| ---- |
68 | |1 | ['england','will','win'] |
69 | |2 | ['france','will','win'] |
70 | |3 | ['its','coming','home']|
71 | ```sql
72 | -- words_tf
73 | SELECT id, word, COUNT(*) / ARRAY_LENGTH(ANY_VALUE(words)) tf, ARRAY_LENGTH(ANY_VALUE(words)) words_in_doc
74 | , ANY_VALUE(docs_n) docs_n
75 | FROM words_by_post, UNNEST(words) word
76 | GROUP BY id, word
77 | ```
78 | |id | word | tf | words_in_doc| docs_n|
79 | |--| ---- | ---- | ---- | ---- |
80 | |1 | 'england'|0.33|3 |3|
81 | |1 | 'will' |0.33|3 |3|
82 | |1| 'win'|0.33|3 |3|
83 | |...| ...|...|... |...|
84 |
85 | ```sql
86 | -- docs_idf
87 | SELECT tf.id, word, tf.tf, ARRAY_LENGTH(tfs) docs_with_word, LOG(docs_n/ARRAY_LENGTH(tfs)) idf
88 | FROM (
89 | SELECT word, ARRAY_AGG(STRUCT(tf, id, words_in_doc)) tfs, ANY_VALUE(docs_n) docs_n
90 | FROM words_tf
91 | GROUP BY 1
92 | ), UNNEST(tfs) tf
93 | ```
94 | |id | word | tf | docs_with_word| idf|
95 | |--| ---- | ---- | ---- | ---- |
96 | |1 | 'england'|0.33|1 |1.099|
97 | |1 | 'will' |0.33|2 |.405|
98 | |1| 'win'|0.33|2 |.405|
99 | |...| ...|...|... |...|
100 | ```sql
101 | -- tf_idf
102 | SELECT *, tf*idf tfidf
103 | FROM docs_idf
104 | WHERE docs_with_word > 1
105 | ORDER BY tfidf DESC
106 | ```
107 | |id | word | tf | docs_with_word| idf|tfidf|
108 | |--| ---- | ---- | ---- | ---- |---- |
109 | |1 | 'england'|0.33|1 |1.099|.135|
110 | |1 | 'will' |0.33|2 |.405|.135|
111 | |1| 'win'|0.33|2 |.405|.135|
112 | |...| ...|...|... |...|...|
113 |
--------------------------------------------------------------------------------
/bigquery/timeseries.md:
--------------------------------------------------------------------------------
1 | # Time series line chart
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/B3XjM5dxt4f?vm=e).
4 |
5 | # Description
6 |
7 | Timeseries charts show how individual metric(s) change over time. They have lines along the y-axis and dates along the x-axis.
8 | You just need a simple GROUP BY to get all the elements you need for a timeseries chart:
9 |
10 | ```sql
11 | SELECT
12 | AGG_FN() as metric,
13 | as datetime
14 | FROM
15 |
16 | GROUP BY
17 | datetime
18 | ```
19 | where:
20 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc.
21 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column.
22 | - `DATETIME_COLUMN` is the group you want to show on the x-axis. Make sure this is a date, datetime, timestamp, or time column (not a number)
23 |
24 | # Usage
25 |
26 | In this example with some Spotify data, we'll look at the total daily streams over time:
27 |
28 | ```sql
29 | select
30 | sum(streams) streams,
31 | day
32 | from spotify.spotify_daily_tracks
33 | group by day
34 | ```
35 |
36 |
--------------------------------------------------------------------------------
/bigquery/transforming-arrays.md:
--------------------------------------------------------------------------------
1 | # Transforming arrays
2 |
3 | Explore this snippet [here](https://count.co/n/CLAQ94HaRvO?vm=e).
4 |
5 | # Description
6 | To transform elements of an array (often known as mapping over an array), it is possible to gain access to array elements using the `UNNEST` function:
7 |
8 | ```sql
9 | with data as (select [1,2,3] nums, ['a', 'b', 'c'] strs)
10 |
11 | select
12 |
13 | array(
14 | select
15 | x * 2 + 1 -- Transformation function
16 | from unnest(nums) as x
17 | ) as nums_transformed,
18 |
19 | array(
20 | select
21 | x || '_2' -- Transformation function
22 | from unnest(strs) as x
23 | ) as strs_transformed
24 |
25 | from data
26 | ```
--------------------------------------------------------------------------------
/bigquery/uniform-distribution.md:
--------------------------------------------------------------------------------
1 | # Generate a uniform distribution
2 |
3 | Explore this snippet [here](https://count.co/n/s7XKhGzLxX6?vm=e)
4 |
5 | # Description
6 |
7 | Generating a uniform distribution of (pseudo-) random numbers is often useful when performing statistical tests on data.
8 | This snippet demonstrates a method for generating a deterministic sample from a uniform distribution:
9 | 1. Create a column of row indices - here the [`GENERATE_ARRAY`](https://cloud.google.com/bigquery/docs/reference/standard-sql/array_functions#generate_array) function is used, though the [`ROW_NUMBER`](https://cloud.google.com/bigquery/docs/reference/standard-sql/numbering_functions#row_number) window function may also be useful.
10 | 2. Convert the row indices into pseudo-random numbers using the [`FARM_FINGERPRINT`](https://cloud.google.com/bigquery/docs/reference/standard-sql/hash_functions#farm_fingerprint) hashing function.
11 | 3. Map the hashes into the correct range
12 |
13 | ```sql
14 | with params as (
15 | select
16 | 10000 as num_samples, -- How many samples to generate
17 | 1000000 as precision, -- How much precision to maintain (e.g. 6 decimal places)
18 | 2 as min, -- Lower limit of distribution
19 | 5 as max, -- Upper limit of distribution
20 | ),
21 | samples as (
22 | select
23 | -- Create hashes (between 0 and precision)
24 | -- Map [0, precision] -> [min, max]
25 | mod(abs(farm_fingerprint(cast(indices as string))), precision) / precision * (max - min) + min as samples
26 | from
27 | params,
28 | unnest(generate_array(1, num_samples)) as indices
29 | )
30 |
31 | select * from samples
32 | ```
33 |
34 | | samples |
35 | | ------- |
36 | | 3.55 |
37 | | 3.316 |
38 | | 2.146 |
39 | | ... |
--------------------------------------------------------------------------------
/bigquery/unpivot-melt.md:
--------------------------------------------------------------------------------
1 | # Unpivot / melt
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/xwGc7kFuAm0?vm=e).
4 |
5 | # Description
6 | The act of unpivoting (or melting, if you're a `pandas` user) is to convert columns to rows. BigQuery has recently added the [UNPIVOT](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#unpivot_operator) operator as a preview feature.
7 | The query template is
8 |
9 | ```sql
10 | SELECT
11 | ,
12 | metric,
13 | value
14 | FROM UNPIVOT(value FOR metric IN ())
15 | order by metric, value
16 | ```
17 | where
18 | - `columns_to_keep` - all the columns to be preserved
19 | - `columns_to_unpivot` - all the columns to be converted to rows
20 | - `table` - the table to pull the columns from
21 |
22 | (The output column names 'metric' and 'value' can also be changed.)
23 | # Examples
24 | In the examples below, we use a table with columns `A`, `B`, `C`. We take the columns `B` and `C`, and turn their values into columns called `metric` (containing either the string 'B' or 'C') and `value` (the value from either the column `B` or `C`). The column `A` is preserved.
25 |
26 | ```sql
27 | -- Define some dummy data
28 | with a as (
29 | select * from unnest([
30 | struct( 'a' as A, 1 as B, 2 as C ),
31 | struct( 'b' as A, 3 as B, 4 as C ),
32 | struct( 'c' as A, 5 as B, 6 as C )
33 | ])
34 | )
35 | SELECT
36 | A,
37 | metric,
38 | value
39 | FROM a UNPIVOT(value FOR metric IN (B, C))
40 | order by metric, value
41 | ```
42 | | A | metric | value |
43 | | --- | ----- | ------ |
44 | | a | B | 1 |
45 | | b | B | 3 |
46 | | c | B | 5 |
47 | | a | C | 2 |
48 | | b | C | 4 |
49 | | c | C | 6 |
50 |
51 | If you would rather not use preview features in BigQuery, there is another generic solution (from [StackOverflow](https://stackoverflow.com/a/47657770)), though it may be much less performant than the built-in `UNPIVOT` implementation.
52 | The query template here is
53 |
54 | ```sql
55 | select
56 | ,
57 | metric,
58 | value
59 | from (
60 | SELECT
61 | *,
62 | REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric,
63 | REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
64 | FROM
65 | ,
66 | UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(), r'{|}', ''))) pair
67 | )
68 | where metric not in ()
69 | order by metric, value
70 | ```
71 | where
72 | - `column_names_to_keep` - this is a list of the JSON-escaped names of the columns to keep
73 |
74 | An annotated example is:
75 |
76 | ```sql
77 | -- Define some dummy data
78 | with a as (
79 | select * from unnest([
80 | struct( 'a' as A, 1 as B, 2 as C ),
81 | struct( 'b' as A, 3 as B, 4 as C ),
82 | struct( 'c' as A, 5 as B, 6 as C )
83 | ])
84 | )
85 |
86 | -- See https://stackoverflow.com/a/47657770
87 |
88 | select
89 | A, -- Keep A column
90 | metric,
91 | value
92 | from (
93 | SELECT
94 | *, -- Original table
95 |
96 | -- Do some manual JSON parsing of a string which looks like '"B": 3'
97 | REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(0)], r'^"|"$', '') metric,
98 | REGEXP_REPLACE(SPLIT(pair, ':')[OFFSET(1)], r'^"|"$', '') value
99 | FROM
100 | a,
101 | -- 1. Create a row which contains a string like '{ "A": "a", "B": 1, "C": 2 }'
102 | -- 2. Remove the outer {} characters
103 | -- 3. Split the string on , characters
104 | -- 4. Join the new array onto the original table
105 | UNNEST(SPLIT(REGEXP_REPLACE(to_json_string(a), r'{|}', ''))) pair
106 | )
107 | where metric not in ('A') -- Filter extra rows that we don't need
108 | order by metric, value
109 | ```
110 | | A | metric | value |
111 | | --- | ----- | ------ |
112 | | a | B | 1 |
113 | | b | B | 3 |
114 | | c | B | 5 |
115 | | a | C | 2 |
116 | | b | C | 4 |
117 | | c | C | 6 |
118 |
--------------------------------------------------------------------------------
/bigquery/yoy.md:
--------------------------------------------------------------------------------
1 | # Year-on-year percentage difference
2 | To explore this example with some demo data, head [here](https://count.co/n/7I6eHGEwxX3?vm=e).
3 |
4 | # Description
5 | Calculating the year-on-year change of a quantity typically requires a few steps:
6 | 1. Aggregate the quantity by year (and by any other fields)
7 | 2. Add a column for last years quantity (this will be null for the first year in the dataset)
8 | 3. Calculate the change from last year to this year
9 |
10 | A generic query template may then look like
11 |
12 | ```sql
13 | select
14 | *,
15 | -- Calculate percentage change
16 | ( - last_year_total) / last_year_total * 100 as perc_diff
17 | from (
18 | select
19 | *,
20 | -- Use a window function to get last year's total
21 | lag() over (partition by order by asc) last_year_total
22 | from
23 | order by , desc
24 | )
25 | ```
26 | where
27 | - `table` - your source data table
28 | - `fields` - one or more columns to split the year-on-year differences by
29 | - `year_column` - the column containing the year of the aggregation
30 | - `year_total` - the column containing the aggregated data
31 | # Example
32 | Using total Spotify streams as an example data source, let's identify:
33 | - `table` - this is called `raw`
34 | - `fields` - this is just the single column `artist`
35 | - `year_column` - this is called `year`
36 | - `year_total` - this is `sum_streams`
37 |
38 | Then the query looks like:
39 |
40 | ```sql
41 | -- raw
42 | select
43 | date_trunc(day, year) year,
44 | sum(streams) sum_streams,
45 | artist
46 | from spotify.spotify_daily_tracks
47 | group by 1,3
48 | ```
49 | | year | sum_streams | artist |
50 | | --- | ----------- | ---- |
51 | | 2020-01-01 | 115372043 | CJ |
52 | | 2021-01-01 | 179284925 | CJ |
53 | | ... | ... | ... |
54 | ```sql
55 | -- with_perc_diff
56 | select
57 | *,
58 | (sum_streams - last_year_total) / last_year_total * 100 as perc_diff
59 | from (
60 | select
61 | *,
62 | lag(sum_streams) over (partition by artist order by year asc) last_year_total
63 | from raw
64 | order by artist, year desc
65 | )
66 | ```
67 | | year | sum_streams | artist | last_year_total | perc_diff |
68 | | --- | ----------- | ---- | ---- | ----- |
69 | | 2020-01-01 | 115372043 | CJ | NULL | NULL |
70 | | 2021-01-01 | 179284925 | CJ | 115372043 | 55.397200515899684 |
71 | | ... | ... | ... | ... | .... |
72 |
--------------------------------------------------------------------------------
/mssql/Cumulative distribution functions.md:
--------------------------------------------------------------------------------
1 | # Cumulative distribution functions
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/ANpwFTvhnWT?vm=e).
4 |
5 | # Description
6 |
7 | Cumulative distribution functions (CDF) are a method for analysing the distribution of a quantity, similar to histograms. They show, for each value of a quantity, what fraction of rows are smaller or greater.
8 | One method for calculating a CDF is as follows:
9 |
10 | ```
11 | select
12 | -- Use a row_number window function to get the position of this row and CAST to convert row_number and count from integers to decimal
13 | CAST(row_number() over (order by asc) AS DECIMAL)/CAST((select count(*) from ) AS DECIMAL) AS cdf
14 | from
15 | ```
16 |
17 | where
18 |
19 | * `quantity` - the column containing the metric of interest
20 | * `table` - the table name
21 |
22 | **Note:** CAST is used to convert the numerator and denominator of the fraction into decimals before they are divided.
23 |
24 | # Example
25 |
26 | Using some student test scores as an example data source, let’s identify:
27 |
28 | * `table` - this is called `Example_data`
29 | * `quantity` - this is the `column score`
30 |
31 | then the query becomes:
32 |
33 | ```
34 | select
35 | student_id,
36 | score,
37 | CAST(row_number() over (order by score asc) AS DECIMAL)/CAST((select count(*) from Example_data) AS DECIMAL) AS frac
38 | from Example_data
39 | ```
40 |
41 | |student_id|score|frac|
42 | |---|---|---|
43 | |3|76|0.2|
44 | |5|77|0.2|
45 | |4|82|0.6|
46 | |...|...|...|
47 | |1|97|1|
48 |
--------------------------------------------------------------------------------
/mssql/check-column-is-accessible.md:
--------------------------------------------------------------------------------
1 | # Check if a column is accessible
2 |
3 | Explore this snippet [here](https://count.co/n/xpgYlM1uDAT?vm=e).
4 |
5 | # Description
6 | One method to check if the current session has access to a column is to use the [`COL_LENGTH`](https://docs.microsoft.com/en-us/sql/t-sql/functions/col-length-transact-sql) function, which returns the length of a column in bytes.
7 | If the column doesn't exist (or cannot be accessed), COL_LENGTH returns null.
8 |
9 | ```sql
10 | select
11 | col_length('Orders', 'Sales') as does_exist,
12 | col_length('Orders', 'Shmales') as does_not_exist
13 | ```
14 |
15 | | does_exist | does_not_exist |
16 | | ---------- | -------------- |
17 | | 8 | NULL |
--------------------------------------------------------------------------------
/mssql/compound-growth-rates.md:
--------------------------------------------------------------------------------
1 | # Compound growth rates
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/HFVMSAR7nmb?vm=e).
4 |
5 | # Description
6 | When a quantity is increasing in time by the same fraction every period, it is said to exhibit **compound growth** - every time period it increases by a larger amount.
7 | It is possible to show that the behaviour of the quantity over time follows the law
8 |
9 | ```sql
10 | x(t) = x(0) * exp(r * t)
11 | ```
12 | where
13 | - `x(t)` is the value of the quantity now
14 | - `x(0)` is the value of the quantity at time t = 0
15 | - `t` is time in units of [time_period]
16 | - `r` is the growth rate in units of 1 / [time_period]
17 |
18 | # Calculating growth rates
19 |
20 | From the equation above, to calculate a growth rate you need to know:
21 | - The value at the start of a period
22 | - The value at the end of a period
23 | - The duration of the period
24 |
25 | Then, following some algebra, the growth rate is given by
26 |
27 | ```sql
28 | r = ln(x(t) / x(0)) / t
29 | ```
30 |
31 | For a concrete example, assume a table with columns:
32 | - `num_this_month` - this is `x(t)`
33 | - `num_last_month` - this is `x(0)`
34 |
35 | #### Inferred growth rates
36 |
37 | In the following then, `growth_rate` is the equivalent yearly growth rate for that month:
38 |
39 | ```sql
40 | -- with_growth_rate
41 | select
42 | *,
43 | log(num_this_month / num_last_month) * 12 growth_rate
44 | from with_previous_month
45 | ```
46 |
47 | # Using growth rates
48 |
49 | To better depict the effects of a given growth rate, it can be converted to a year-on-year growth factor by inserting into the exponential formula above with t = 1 (one year duration).
50 | Or mathematically,
51 |
52 | ```sql
53 | yoy_growth = x(1) / x(0) = exp(r)
54 | ```
55 | It is always sensible to spot-check what a growth rate actually represents to decide whether or not it is a useful metric.
56 |
57 | #### Extrapolated year-on-year growth
58 |
59 | ```sql
60 | -- with_yoy_growth
61 | select
62 | *,
63 | exp(growth_rate) as yoy_growth
64 | from with_growth_rate
65 | ```
--------------------------------------------------------------------------------
/mssql/concatenate-strings.md:
--------------------------------------------------------------------------------
1 | # Concatenating strings
2 |
3 | Explore this snippet [here](https://count.co/n/NoziHzLCWJA?vm=e).
4 |
5 | # Description
6 | Use the [`STRING_AGG`](https://docs.microsoft.com/en-us/sql/t-sql/functions/string-agg-transact-sql) function to concatenate strings within a group. To specify the ordering, make sure to use the WITHIN GROUP clause too.
7 |
8 | ```sql
9 | with data as (
10 | select * from (values
11 | ('A', '1'),
12 | ('A', '2'),
13 | ('B', '3'),
14 | ('B', '4'),
15 | ('B', '5')
16 | ) as data(str, num)
17 | )
18 |
19 | select
20 | str,
21 | string_agg(num, ' < ') within group (order by num asc) as aggregated
22 | from data
23 | group by str
24 | ```
25 |
26 | | str | aggregated |
27 | | --- | ---------- |
28 | | A | 1 < 2 |
29 | | B | 3 < 4 < 5 |
--------------------------------------------------------------------------------
/mssql/first-row-of-group.md:
--------------------------------------------------------------------------------
1 | # Find the first row of each group
2 |
3 | Explore this snippet [here](https://count.co/n/3eWfeZ0n2a8?vm=e).
4 |
5 | # Description
6 |
7 | Finding the first value in a column partitioned by some other column is simple in SQL, using the GROUP BY clause. Returning the rest of the columns corresponding to that value is a little trickier. Here's one method, which relies on the RANK window function:
8 |
9 | ```sql
10 | with and_ranking as (
11 | select
12 | *,
13 | rank() over (partition by order by ) ranking
14 | from
15 | )
16 | select * from and_ranking where ranking = 1
17 | ```
18 | where
19 | - `partition` - the column(s) to partition by
20 | - `ordering` - the column(s) which determine the ordering defining 'first'
21 | - `table` - the source table
22 |
23 | The CTE `and_ranking` adds a column to the original table called `ranking`, which is the order of that row within the groups defined by `partition`. Filtering for `ranking = 1` picks out those 'first' rows.
24 |
25 | # Example
26 |
27 | Using product sales as an example data source, we can pick the rows of the table corresponding to the most-recent order per city. Filling in the template above:
28 | - `partition` - this is just `City`
29 | - `ordering` - this is `Order_Date desc`
30 | - `table` - this is `Sales`
31 |
32 | ```sql
33 | with and_ranking as (
34 | select
35 | *,
36 | rank() over (partition by City order by Order_Date desc) as ranking
37 | from Sales
38 | )
39 |
40 | select * from and_ranking where ranking = 1
41 | ```
--------------------------------------------------------------------------------
/mssql/search-stored-procedures.md:
--------------------------------------------------------------------------------
1 | # SEARCH FOR TEXT WITHIN STORED PROCEDURES
2 |
3 | # Description
4 | Often you'll want to know which stored procedure is performing an action or contains specific text. Query will match on comments.
5 |
6 | # Snippet ✂️
7 | ```sql
8 | SELECT name
9 | FROM sys.procedures
10 | WHERE Object_definition(object_id) LIKE '%strHell%'
11 | ```
12 |
13 | # Usage
14 |
15 | ```sql
16 | SELECT name
17 | FROM sys.procedures
18 | WHERE Object_definition(object_id) LIKE '%blitz%'
19 | ```
20 |
21 | | name |
22 | | -------- |
23 | | sp_BlitzFirst|
24 | | sp_BlitzCache|
25 | | sp_BlitzWho|
26 | | sp_Blitz|
27 | | sp_BlitzBackups|
28 | | sp_BlitzLock|
29 | | sp_BlitzIndex|
30 |
31 | # Referenced
32 | [Stackoverflow](https://stackoverflow.com/questions/14704105/search-text-in-stored-procedure-in-sql-server)
--------------------------------------------------------------------------------
/mysql/cdf.md:
--------------------------------------------------------------------------------
1 | # Cumulative distribution functions
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/sL7bEFcg11Z?vm=e).
4 |
5 | # Description
6 | Cumulative distribution functions (CDF) are a method for analysing the distribution of a quantity, similar to histograms. They show, for each value of a quantity, what fraction of rows are smaller or greater.
7 | One method for calculating a CDF is as follows:
8 |
9 | ```sql
10 | select
11 | -- Use a row_number window function to get the position of this row
12 | (row_number() over (order by asc)) / (select count(*) from ) cdf,
13 | ,
14 | from
15 | ```
16 | where
17 | - `quantity` - the column containing the metric of interest
18 | - `table` - the table name
19 | # Example
20 |
21 |
22 | ```sql
23 | with data as (
24 | select 1 student_id, 97 score union all
25 | select 2 student_id, 93 score union all
26 | select 3 student_id, 76 score union all
27 | select 4 student_id, 82 score union all
28 | select 5 student_id, 77 score
29 | )
30 | select student_id, score, (row_number() over (order by score asc)) / (select count(*) from data) frac
31 | from data
32 | ```
33 | | student_id | score | frac |
34 | | ----------- | ----------- | ---- |
35 | |3 | 76 | 0.2|
36 | | 5| 77 | 0.4 |
37 | |4 | 82 | 0.6|
38 | |2 | 93 | 0.8|
39 | |1|97 | 1|
40 |
--------------------------------------------------------------------------------
/mysql/initcap-udf.md:
--------------------------------------------------------------------------------
1 | # UDF for capitalizing initial letters
2 |
3 | # Description
4 |
5 | Most SQL dialects have a way to convert a string from e.g. "tHis STRING" to "This String" where each word starts with a capitalized letter. This is useful for transforming names of people or places (e.g. ‘North Carolina’ or ‘Juan Martin Del Potro’).
6 |
7 | Per this [Stackoverflow post](https://stackoverflow.com/questions/12364086/how-can-i-achieve-initcap-functionality-in-mysql), this UDF will give us INITCAP functionality.
8 |
9 | ```sql
10 | DELIMITER $$
11 |
12 | DROP FUNCTION IF EXISTS `test`.`initcap`$$
13 |
14 | CREATE FUNCTION `initcap`(x char(30)) RETURNS char(30) CHARSET utf8
15 | READS SQL DATA
16 | DETERMINISTIC
17 | BEGIN
18 | SET @str='';
19 | SET @l_str='';
20 | WHILE x REGEXP ' ' DO
21 | SELECT SUBSTRING_INDEX(x, ' ', 1) INTO @l_str;
22 | SELECT SUBSTRING(x, LOCATE(' ', x)+1) INTO x;
23 | SELECT CONCAT(@str, ' ', CONCAT(UPPER(SUBSTRING(@l_str,1,1)),LOWER(SUBSTRING(@l_str,2)))) INTO @str;
24 | END WHILE;
25 | RETURN LTRIM(CONCAT(@str, ' ', CONCAT(UPPER(SUBSTRING(x,1,1)),LOWER(SUBSTRING(x,2)))));
26 | END$$
27 |
28 | DELIMITER ;
29 | ```
30 |
31 | # Usage
32 |
33 | Use like:
34 |
35 | ```sql
36 | select initcap('This is a test string');
37 | ```
38 |
39 | # References
40 |
41 | - [Stackoverflow](https://stackoverflow.com/questions/12364086/how-can-i-achieve-initcap-functionality-in-mysql)
42 | - [SQL snippets](https://sql-snippets.count.co/t/udf-initcap-in-mysql/207)
--------------------------------------------------------------------------------
/postgres/array-reverse.md:
--------------------------------------------------------------------------------
1 | # Reverse an array
2 |
3 | Explore this snippet [here](https://count.co/n/aDl6lXQKQdx?vm=e).
4 |
5 | # Description
6 | There is no built-in function for reversing an array, but it is possible to construct one using the `generate_subscripts` function:
7 |
8 | ```sql
9 | with data as (select array[1, 2, 3, 4] as nums)
10 |
11 | select
12 | array(
13 | select nums[i]
14 | from generate_subscripts(nums, 1) as indices(i)
15 | order by i desc
16 | ) as reversed
17 | from data
18 | ```
19 |
20 |
21 | ## References
22 | PostgreSQL wiki [https://wiki.postgresql.org/wiki/Array_reverse](https://wiki.postgresql.org/wiki/Array_reverse)
--------------------------------------------------------------------------------
/postgres/convert-epoch-to-timestamp.md:
--------------------------------------------------------------------------------
1 | # Replace epoch/unix formatted timestamps with human-readable ones
2 | View an interactive version of this snippet [here](https://count.co/n/UdQXtD16DGx?vm=e).
3 |
4 | # Description
5 |
6 | Often data sources will include epoch/unix-style timestamps or dates, which are not human-readable.
7 | Therefore, we want to convert these to a different format, for example to be used in charts or aggregation queries.
8 | The [`TO_TIMESTAMP`](https://www.postgresql.org/docs/13/functions-formatting.html) function will allow us to do this with little effort:
9 |
10 | ```sql
11 | WITH data AS (
12 | SELECT *
13 | FROM (VALUES (1611176312), (1611176759), (1611176817), (1611176854)) AS data (str)
14 | )
15 |
16 | SELECT
17 | TO_TIMESTAMP(str) AS converted_timestamp
18 | FROM data;
19 | ```
20 |
21 | | converted_timestamp |
22 | | ---- |
23 | | 2021-01-20 20:58:32.000000 |
24 | | 2021-01-20 21:05:59.000000 |
25 | | 2021-01-20 21:06:57.000000 |
26 | | 2021-01-20 21:07:34.000000 |
27 |
--------------------------------------------------------------------------------
/postgres/count-nulls.md:
--------------------------------------------------------------------------------
1 | # Count NULLs
2 |
3 | Explore this snippet [here](https://count.co/n/YJCjwADi4VY?vm=e).
4 |
5 | # Description
6 |
7 | Part of the data cleaning process involves understanding the quality of your data. NULL values are usually best avoided, so counting their occurrences is a common operation.
8 | There are several methods that can be used here:
9 | - `count(*) - count()` - use the different forms of the [`count()`](https://www.postgresql.org/docs/current/functions-aggregate.html) aggregation which include and exclude NULLs.
10 | - `sum(case when x is null then 1 else 0 end)` - create a new column which contains 1 or 0 depending on whether the value is NULL, then aggregate.
11 |
12 | ```sql
13 | with data as (
14 | select * from (values (1), (2), (null), (null), (5)) as data (x)
15 | )
16 |
17 | select
18 | count(*) - count(x) with_count,
19 | sum(case when x is null then 1 else 0 end) with_case
20 | from data
21 | ```
22 |
23 | | with_count | with_case |
24 | | ---------- | --------- |
25 | | 2 | 2 |
26 |
--------------------------------------------------------------------------------
/postgres/cume_dist.md:
--------------------------------------------------------------------------------
1 | # Cumulative Distribution
2 | View an interactive snippet [here](https://count.co/n/1fymVJoCPVM?vm=e).
3 |
4 | # Description
5 | Cumulative Distribution is a method for analysing the distribution of a quantity, similar to histograms. They show, for each value of a quantity, what fraction of rows are smaller or greater.
6 | One method for calculating it is as follows:
7 |
8 | ```sql
9 | SELECT
10 | -- If necessary, use a row_number window function to get the position of this row in the dataset
11 | (ROW_NUMBER() OVER (ORDER BY )) / (SELECT COUNT(*) FROM ) AS cume_dist,
12 | ,
13 | FROM
14 | ```
15 | where
16 | - `quantity` - the column containing the metric of interest
17 | - `schema.table` - the table with your data
18 |
19 | # Example
20 |
21 | ```sql
22 | WITH data AS (
23 | SELECT md5(RANDOM()::TEXT) AS username_hash,
24 | RANDOM() AS random_number,
25 | row_number
26 | FROM GENERATE_SERIES(1, 10) AS row_number
27 | )
28 |
29 | SELECT
30 | --we do not need to use a row_number window function as we have a generate_series in our test data set
31 | username_hash,
32 | (row_number::FLOAT / (SELECT COUNT(*) FROM data)) AS frac,
33 | random_number
34 | FROM data;
35 | ```
36 | | username_hash | frac | random_number |
37 | | ----- | ----- | ----- |
38 | | f9689609b388f96ac0846c69d4b916dc | 0.1 | 0.48311887726926983 |
39 | | 68551a63b1eb24c750408dc20c6e4510 | 0.2 | 0.7580949364508456 |
40 | | 86b915387b7842c33c52e71ada5488d2 | 0.3 | 0.9461365697398278 |
41 | | 2f9b8d13471f5f8896c073edc4d712a8 | 0.4 | 0.2723711331889973 |
42 | | 27cd443183f886105001f337d987bdaa | 0.5 | 0.2698584751712403 |
43 | | 370ce4710f9900fa38f99494d50b9275 | 0.6 | 0.011248462980887552 |
44 | | 98d2b23c493e44a395a330ca2d349a47 | 0.7 | 0.7387236527452572 |
45 | | 328ac83d33498f81a6b1e70fb8e0dcbb | 0.8 | 0.5815637247802528 |
46 | | 9426739e865d970e7c932c37714b32f0 | 0.9 | 0.5957734211030683 |
47 | | 452a9d8f21c9773fbd5d832138d0d4d1 | 1 | 0.9090960753850368 |
48 |
--------------------------------------------------------------------------------
/postgres/dt-formatting.md:
--------------------------------------------------------------------------------
1 | # Formatting Dates and Timestamps
2 |
3 | # Description
4 | Formatting dates in Postgres requries memorizing a long list of [patterns](https://www.postgresql.org/docs/9.1/functions-formatting.html). These snippets contain the most common formatts for quick access.
5 |
6 | # For Dates:
7 | ```sql
8 | with dates as (SELECT day::DATE
9 | FROM generate_series('2021-01-01'::DATE, '2021-01-04'::DATE, '1 day') AS day)
10 |
11 | select
12 | day,
13 | to_char(day,'D/ID/day/dy') day_of_week, --[D]Sunday = 1 / [ID]Monday = 1; can change caps for day to be DAY or Day
14 | to_char(day, 'Month/MM/mon') all_month_formats, -- mon can be MON, or Mon depending on how you want the results capitalized
15 | to_char(day, 'YYYY/YY') year_formats, -- formatting years
16 | to_char(day,'DDD/IDDD') day_of_year, -- day of year / IDDD: day of ISO 8601 week-numbering year (001-371; day 1 of the year is Monday of the first ISO week)
17 | to_char(day,'J') as julian_day, -- days since November 24, 4714 BC at midnight
18 | to_char(day, 'DD') day_of_month, -- day of the month
19 | to_char(day,'WW/IW') week_of_year, -- week of year; IW: week number of ISO 8601 week-numbering year (01-53; the first Thursday of the year is in week 1)
20 | to_char(day,'W') week_of_month, -- week of month (1-4),
21 | to_char(day,'Q') quarter,
22 | to_char(day,'Mon DD YYYY') as american_date, -- Month/Day/Year format
23 | to_char(day,'DD Mon YYYY') as international_date -- Day/Month/Year
24 | from dates
25 | ```
26 | |day|day_of_week|all_month_formats|year_formats|day_of_year|julian_day|day_of_month| week_of_year| week_of_month| quarter| american_date| international_date|
27 | |---|-----------|-----------------|------------|-----------|----------|------------|-------------|--------------|--------|--------------|-------------------|
28 | |01/01/01|6/5/friday /fri|January /01/jan|2021/21|001/369|2459216|01|01/53|1|1|Jan 01 2021|01 Jan 2021|
29 | |02/01/01|7/6/saturday /sat|January /01/jan|2021/21|002/370|2459217|02|01/53|1|1|Jan 02 2021|02 Jan 2021|
30 | |03/01/01|1/7/sunday /sun|January /01/jan|2021/21|003/371|2459218|03|01/53|1|1|Jan 03 2021|03 Jan 2021|
31 | |04/01/01|2/1/monday /mon|January /01/jan|2021/21|004/001|2459219|04|01/01|1|1|Jan 04 2021|04 Jan 2021|
32 |
33 |
34 | # For Timestamps:
35 | ```sql
36 | with timestamps as (
37 | SELECT timeseries_hour
38 | FROM generate_series(
39 | timestamp with time zone '2021-01-01 00:00:00+02',
40 | timestamp with time zone '2021-01-01 15:00:00+02',
41 | '90 minute') AS timeseries_hour
42 | )
43 |
44 | select
45 | timeseries_hour,
46 | to_char(timeseries_hour,'HH/HH24') as hour, --HH: 01-12; H24:0-24
47 | to_char(timeseries_hour,'MI') as minute,
48 | to_char(timeseries_hour,'SS') as seconds,
49 | to_char(timeseries_hour,'MS') as milliseconds,
50 | to_char(timeseries_hour,'US') as microseconds,
51 | to_char(timeseries_hour,'SSSS') as seconds_after_midnight,
52 | to_char(timeseries_hour,'AM') as am_pm,
53 | to_char(timeseries_hour,'TZ') as timezone,
54 | to_char(timeseries_hour,'HH:MI AM') as local_time
55 | from timestamps
56 | ```
57 | |timeseries_hour|hour|minute|seconds|milliseconds|microseconds|seconds_after_midnight|am_pm|timezone|local_time|
58 | |---------------|----|------|-------|------------|------------|----------------------|-----|--------|----------|
59 | |2020-12-31T22:00:00.000Z|10/22|00|00|000|000000|79200|PM|UTC|10:00 PM|
60 | |2020-12-31T23:30:00.000Z|11/23|30|00|000|000000|84600|PM|UTC|11:30 PM|
61 | |2021-01-01T01:00:00.000Z|01/01|00|00|000|000000|3600|AM|UTC|01:00 AM|
62 | |2021-01-01T02:30:00.000Z|02/02|30|00|000|000000|9000|AM|UTC|02:30 AM|
63 | |2021-01-01T04:00:00.000Z|04/04|00|00|000|000000|14400|AM|UTC|04:00 AM|
64 | |2021-01-01T05:30:00.000Z|05/05|30|00|000|000000|19800|AM|UTC|05:30 AM|
65 | |2021-01-01T07:00:00.000Z|07/07|00|00|000|000000|25200|AM|UTC|07:00 AM|
66 | |2021-01-01T08:30:00.000Z|08/08|30|00|000|000000|30600|AM|UTC|08:30 AM|
67 | |2021-01-01T10:00:00.000Z|10/10|00|00|000|000000|36000|AM|UTC|10:00 AM|
68 | |2021-01-01T11:30:00.000Z|11/11|30|00|000|000000|41400|AM|UTC|11:30 AM|
69 | |2021-01-01T13:00:00.000Z|01/13|00|00|000|000000|46800|PM|UTC|01:00 PM|
70 |
--------------------------------------------------------------------------------
/postgres/generate-timeseries.md:
--------------------------------------------------------------------------------
1 | # Generate a timeseries of dates and/or times
2 | View an interactive version of this snippet [here](https://count.co/n/Du9nUudW9MH?vm=e).
3 |
4 | # Description
5 |
6 | In order to conduct timeseries analysis, we need to have a complete timeseries for the data in question. Often, we cannot be certain that every day/hour etc. occurs naturally in our dataset.
7 | Therefore, we to be able to generate a series of dates and/or times to base our query on.
8 | The [`GENERATE_SERIES`](https://www.postgresql.org/docs/13/functions-srf.html) function will allow us to do this:
9 |
10 | ```sql
11 | --generate timeseries in hourly increments
12 | SELECT timeseries_hour
13 | FROM generate_series('2021-01-01 00:00:00'::TIMESTAMP, '2021-01-01 03:00:00'::TIMESTAMP, '1 hour') AS timeseries_hour;
14 | ```
15 | | timeseries_hour |
16 | |---|
17 | | 2021-01-01 00:00:00.000000 |
18 | | 2021-01-01 01:00:00.000000 |
19 | | 2021-01-01 02:00:00.000000 |
20 | | 2021-01-01 03:00:00.000000 |
21 |
22 | ```sql
23 | --generate timeseries in daily increments
24 | SELECT timeseries_day::DATE
25 | FROM generate_series('2021-01-01'::DATE, '2021-01-04'::DATE, '1 day') AS timeseries_day;
26 | ```
27 |
28 | | timeseries_day |
29 | |---|
30 | | 2021-01-01 |
31 | | 2021-01-02 |
32 | | 2021-01-03 |
33 | | 2021-01-04 |
34 |
--------------------------------------------------------------------------------
/postgres/get-last-array-element.md:
--------------------------------------------------------------------------------
1 | # Get the last element of an array
2 |
3 | Explore this snippet [here](https://count.co/n/0loHJW60YO8?vm=e).
4 |
5 | # Description
6 | Use the `array_length` function to access the final element of an array. This function takes two arguments, an array and a dimension index. For common one-dimensional arrays, the dimension index is always 1.
7 |
8 | ```sql
9 | with data as (select array[1,2,3,4] as nums)
10 |
11 | select nums[array_length(nums, 1)] from data
12 | ```
--------------------------------------------------------------------------------
/postgres/histogram-bins.md:
--------------------------------------------------------------------------------
1 | # Histogram Bins
2 | View an interactive version of this snippet [here](https://count.co/n/27wc3pvc5R0?vm=e).
3 |
4 | # Description
5 | Creating histograms using SQL can be tricky. This template will let you quickly create the bins needed to create a histogram:
6 |
7 | ```sql
8 | SELECT
9 | COUNT(1) / (SELECT COUNT(1) FROM ) AS percentage_of_results
10 | FLOOR(/ ) * AS bin
11 | FROM
12 |
13 | GROUP BY
14 | bin
15 | ```
16 | where:
17 | - `` is your column you want to turn into a histogram
18 | - `` is the width of each bin. You can adjust this to get the desired level of detail for the histogram.
19 |
20 | # Example
21 | Adjust the bin-size until you can see the 'shape' of the data. The following example finds the percentage of fruits with 5-point review groupings. The reviews are on a 0-100 scale:
22 |
23 | ```sql
24 | WITH data AS (
25 | SELECT CASE
26 | WHEN RANDOM() <= 0.25 THEN 'tennessee'
27 | WHEN RANDOM() <= 0.5 THEN 'colonel'
28 | WHEN RANDOM() <= 0.75 THEN 'pikesville'
29 | ELSE 'michters'
30 | END AS whiskey,
31 | CASE
32 | WHEN RANDOM() <= 0.20 THEN 1
33 | WHEN RANDOM() <= 0.40 THEN 2
34 | WHEN RANDOM() <= 0.60 THEN 3
35 | WHEN RANDOM() <= 0.80 THEN 4
36 | ELSE 5
37 | END AS star_rating,
38 | review_date
39 | FROM generate_series('2021-01-01'::TIMESTAMP,'2021-01-31'::TIMESTAMP, '3 DAY') AS review_date
40 | )
41 |
42 | SELECT
43 | COUNT(*) / (SELECT COUNT(1)::FLOAT FROM data) AS percentage_of_results,
44 | FLOOR(data.star_rating / 1) * 1 bin
45 | FROM data
46 | GROUP BY bin
47 | ```
48 | | percentage_of_results | bin |
49 | | ----- | ----- |
50 | | 0.2727272727272727 | 3 |
51 | | 0.09090909090909091 | 5 |
52 | | 0.18181818181818182 | 1 |
53 | | 0.45454545454545453 | 2 |
54 |
55 |
--------------------------------------------------------------------------------
/postgres/json-strings.md:
--------------------------------------------------------------------------------
1 | # Extracting values from JSON strings
2 |
3 | Explore this snippet [here](https://count.co/n/woUuvi105YA?vm=e).
4 |
5 | # Description
6 | Postgres has supported lots of JSON functionality for a while now - see some examples in the [official documentation](https://www.postgresql.org/docs/current/functions-json.html).
7 | ## Operators
8 | There are several operators that act as accessors on JSON data types
9 |
10 | ```sql
11 | with json as (select '{
12 | "a": 1,
13 | "b": "bee",
14 | "c": [
15 | 4,
16 | 5,
17 | { "d": [6, 7] }
18 | ]
19 | }' as text)
20 |
21 | select
22 | text::json as root, -- Cast text to JSON
23 | text::json->'a' as a, -- Extract number as string
24 | text::json->'b' as b, -- Extract string
25 | text::json->'c' as c, -- Extract object
26 | text::json->'c'->0 as c_first_str, -- Extract array element as string
27 | text::json->'c'->>0 as c_first_num, -- Extract array element as number
28 | text::json#>'{c, 2}' as c_third, -- Extract deeply-nested object
29 | text::json#>'{c, 2}'->'d' as d -- Extract field from deeply-nested object
30 | from json
31 | ```
32 |
33 | | root | a | b | c | c_first_str | c_first_num | c_third | d |
34 | | --------------------------------------------------- | - | ----- | ----------------------- | ----------- | ----------- | --------------- | ------ |
35 | | { "a": 1, "b": "bee", "c": [4, 5, { "d": [6, 7] }]} | 1 | "bee" | [4, 5, { "d": [6, 7] }] | 4 | 4 | { "d": [6, 7] } | [6, 7] |
36 |
37 |
38 | ## JSON_EXTRACT_PATH
39 | As an alternative to the operator approach, the JSON_EXTRACT_PATH function works by specifying the keys of the JSON fields to be extracted:
40 |
41 | ```sql
42 | select
43 | json_extract_path('{
44 | "a": 1,
45 | "b": "bee",
46 | "c": [
47 | 4,
48 | 5,
49 | { "d": [6, 7] }
50 | ]
51 | }', 'c', '2', 'd', '0')
52 | ```
53 |
54 | | json_extract_path |
55 | | ----------------- |
56 | | 6 |
--------------------------------------------------------------------------------
/postgres/last-x-days.md:
--------------------------------------------------------------------------------
1 | # Filter for last X days
2 |
3 | ## Description
4 | Often you only want to return the last X days of a query (e.g. last 30 days) and you want that to always be up to date (dynamic).
5 |
6 |
7 | This snippet can be added to your WHERE clause to filter for the last X days:
8 |
9 | ``` sql
10 | SELECT
11 | ,...,
12 |
13 | FROM
14 |
15 | WHERE
16 | > current_date - interval 'X' day
17 | ```
18 | where:
19 |
20 |
21 | - ` ,...` are all the columns you want to return in your query
22 | - `` is your date column you want to filter on
23 | - `X` is the number of days you want to filter on
24 |
25 |
26 | For reference see Postgres's [date/time Functions and Operators](https://www.postgresql.org/docs/8.2/functions-datetime.html).
27 |
28 | ```sql
29 | select
30 | day,
31 | rank,
32 | title,
33 | artist,
34 | streams
35 | from
36 | "public".spotify_daily
37 | where
38 | day > current_date - interval '180' day
39 | and artist = 'Drake'
40 | ```
41 | |day|rank|title|artist|streams|
42 | |---|----|-----|------|-------|
43 | |06/04/2021|46|What's Next|Drake|1,524,267|
44 | |...|...|...|...|...|
45 |
--------------------------------------------------------------------------------
/postgres/median.md:
--------------------------------------------------------------------------------
1 | # Calculating the Median
2 | View an interactive version of this snippet [here](https://count.co/n/QH2mMBK2RJu?vm=e).
3 |
4 | # Description
5 | Postgres does not have an explicit `MEDIAN` function. The following snippet will show you how to calculate the median of any column.
6 |
7 | # Snippet ✂️
8 | Here is a median function with a sample dataset:
9 |
10 | ```sql
11 | WITH data AS (
12 | SELECT CASE
13 | WHEN RANDOM() <= 0.25 THEN 'apple'
14 | WHEN RANDOM() <= 0.5 THEN 'banana'
15 | WHEN RANDOM() <= 0.75 THEN 'pear'
16 | ELSE 'orange'
17 | END AS fruit,
18 | RANDOM()::DOUBLE PRECISION AS weight
19 | FROM generate_series(1,10)
20 | )
21 |
22 | SELECT
23 | PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY weight) AS median
24 | FROM
25 | data
26 | ```
27 | | median |
28 | | ----------- |
29 | | 0.1358146 |
30 |
--------------------------------------------------------------------------------
/postgres/moving-average.md:
--------------------------------------------------------------------------------
1 | # Moving Average
2 | View an interactive version of this snippet [here](https://count.co/n/u5jbJD3MDX6?vm=e).
3 |
4 | # Description
5 | Moving averages are quite simple to calculate in Postgres, using the `AVG` window function. Here is an example query with a 7-day moving average over a total of 31 days, calculated for 1 dimension (colour):
6 |
7 | ```sql
8 | WITH data AS (
9 | SELECT *,
10 | CASE WHEN num <= 0.5 THEN 'blue' ELSE 'red' END AS str
11 | FROM (SELECT generate_series('2021-01-01'::TIMESTAMP, '2021-01-31'::TIMESTAMP, '1 day') AS timeseries,
12 | RANDOM() AS num
13 | ) t
14 | )
15 |
16 | SELECT timeseries,
17 | str,
18 | num,
19 | AVG(num) OVER (PARTITION BY str ORDER BY timeseries ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) AS mov_avg
20 | FROM data
21 | ORDER BY timeseries, str;
22 | ```
23 |
24 | timeseries|str|num|mov_avg
25 | | --- | ----------- | ------ |-------- |
26 | 2021-01-01 00:00:00.000000|red|0.6724503329790004|0.6724503329790004
27 | 2021-01-02 00:00:00.000000|blue|0.02015057538457654|0.02015057538457654
28 | 2021-01-03 00:00:00.000000|blue|0.19436280964817954|0.10725669251637804
29 | 2021-01-04 00:00:00.000000|blue|0.3170075569498252|0.17717364732752708
30 | 2021-01-05 00:00:00.000000|blue|0.45274767058829823|0.24606715314271987
31 | 2021-01-06 00:00:00.000000|red|0.5295068450998173|0.6009785890394088
32 | 2021-01-07 00:00:00.000000|red|0.5856947203861544|0.5958839661549907
33 | 2021-01-08 00:00:00.000000|blue|0.23106543649677036|0.24306680981352996
34 | 2021-01-09 00:00:00.000000|red|0.8057297084321|0.648345401724268
35 | 2021-01-10 00:00:00.000000|blue|0.18782650473713147|0.23386009230079688
36 | 2021-01-11 00:00:00.000000|blue|0.012715414005509018|0.20226799540147006
37 | 2021-01-12 00:00:00.000000|blue|0.41378600069740656|0.22870774606346211
38 | 2021-01-13 00:00:00.000000|red|0.7848015045167251|0.6756366222827594
39 | 2021-01-14 00:00:00.000000|blue|0.3062660702635327|0.26447218292333163
40 | 2021-01-15 00:00:00.000000|red|0.7321478245155468|0.685055155988224
41 | 2021-01-16 00:00:00.000000|blue|0.20009039295143083|0.26518813083623805
42 | 2021-01-17 00:00:00.000000|red|0.9851023962729109|0.7279190474574649
43 | 2021-01-18 00:00:00.000000|blue|0.42790603927394955|0.2790504411267536
44 | 2021-01-19 00:00:00.000000|red|0.7966662316864124|0.7365124454860834
45 | 2021-01-20 00:00:00.000000|red|0.8913488809896606|0.7638747639874159
46 | 2021-01-21 00:00:00.000000|red|0.6797595409599531|0.7826563509699329
47 | 2021-01-22 00:00:00.000000|red|0.9692422752366809|0.8305997953262487
48 | 2021-01-23 00:00:00.000000|blue|0.3873783130847208|0.2708792714388064
49 | 2021-01-24 00:00:00.000000|red|0.960779246012315|0.8499809875237756
50 | 2021-01-25 00:00:00.000000|red|0.6482284656189883|0.8329093576615585
51 | 2021-01-26 00:00:00.000000|red|0.8471274360131815|0.8472818090987628
52 | 2021-01-27 00:00:00.000000|red|0.5269376445697667|0.7900112151358698
53 | 2021-01-28 00:00:00.000000|blue|0.07696316291711724|0.2516164872413498
54 | 2021-01-29 00:00:00.000000|blue|0.1035287644690932|0.241079269707845
55 | 2021-01-30 00:00:00.000000|red|0.7339086567504438|0.7821665182688737
56 | 2021-01-31 00:00:00.000000|red|0.8169741358988709|0.772869675132525
57 |
58 |
--------------------------------------------------------------------------------
/postgres/ntile.md:
--------------------------------------------------------------------------------
1 | # Creating equal-sized buckets using ntile
2 | View an interactive version of this snippet [here](https://count.co/n/IgJwlWSujF0?vm=e).
3 |
4 | # Description
5 | When you want to create equal sized groups (i.e. buckets) of rows, there is a SQL function for that too.
6 | The function [NTILE](https://www.postgresql.org/docs/13/functions-window.html) lets you do exactly that:
7 |
8 | ```sql
9 | WITH users AS (
10 | SELECT md5(RANDOM()::TEXT) AS name_hash,
11 | CASE WHEN RANDOM() <= 0.5 THEN 'male' ELSE 'female' END AS gender,
12 | CASE WHEN RANDOM() <= 0.33 THEN 'student' WHEN RANDOM() <= 0.66 THEN 'professor' ELSE 'alumni' END AS status
13 | FROM generate_series(1, 10)
14 | )
15 |
16 | SELECT
17 | name_hash,
18 | gender,
19 | status,
20 | NTILE(3) OVER(PARTITION BY status ORDER BY status) AS ntile_bucket
21 | FROM
22 | users;
23 | ```
24 |
25 | |name_hash|gender|status|ntile_bucket|
26 | |------|------|------|------|
27 | |8c9882e8cc6f748c06b3a20ee7af9787|female|alumni|1|
28 | |032dffdb7cb67a5f1a6b77b713043508|female|alumni|2|
29 | |72559a7e25935004d172329a47cb3db5|male|alumni|3|
30 | |284d1337a3d30433fe17ad2110d27655|male|professor|1|
31 | |3797bb418424d7ced117e8817f398fea|male|professor|2|
32 | |e56d1e6e53f12c6a133c4f47a962754c|female|professor|3|
33 | |8713e6b5a1c359f42e960c05460c8e14|male|student|1|
34 | |af8ea7ea643fbbfefe7c845e77c9e4d3|male|student|1|
35 | |033161ae441c347c4aa43acd46347836|female|student|2|
36 | |a476b8c2cfa8df1c9494c9851cbd4e66|female|student|3|
37 |
38 |
--------------------------------------------------------------------------------
/postgres/rank.md:
--------------------------------------------------------------------------------
1 | # Ranking Data
2 |
3 | # Description
4 | A common requirement is the need to rank a particular row in a group by some quantity - for example, 'rank stores by total sales in each region'. Fortunately, Postgres natively has a `rank` window function to do this.
5 | The syntax looks like this:
6 |
7 | ```sql
8 | SELECT
9 | RANK() OVER(PARTITION BY ORDER BY ) AS RANK
10 | FROM
11 | ```
12 | where
13 | - `fields` - zero or more columns to split the results by - a rank will be calculated for each unique combination of values in these columns
14 | - `rank_by` - a column to decide the ranking
15 | - `schema.table` - where to pull these columns from
16 |
17 | # Example
18 |
19 | We want to find out, for our random data sample of fruits, which fruit is most popular.
20 |
21 | ```sql
22 | WITH data AS (
23 | SELECT CASE
24 | WHEN RANDOM() <= 0.25 THEN 'apple'
25 | WHEN RANDOM() <= 0.5 THEN 'banana'
26 | WHEN RANDOM() <= 0.75 THEN 'pear'
27 | ELSE 'orange'
28 | END AS fruit,
29 | RANDOM() + (series^RANDOM()) AS popularity
30 | FROM generate_series(1, 10) AS series
31 | )
32 |
33 | SELECT
34 | fruit,
35 | popularity,
36 | RANK() OVER(ORDER BY popularity DESC) AS popularity_rank
37 | FROM data
38 | ```
39 |
40 | | fruit | popularity | popularity_rank |
41 | | ----- | ----- | ----- |
42 | | banana | 3.994866211350613 | 1 |
43 | | banana | 3.2802621307398208 | 2 |
44 | | banana | 2.626913466142052 | 3 |
45 | | pear | 2.59071184223806 | 4 |
46 | | banana | 2.137745293125355 | 5 |
47 | | apple | 2.0741654350518637 | 6 |
48 | | apple | 2.0401770856253933 | 7 |
49 | | pear | 1.771223675755631 | 8 |
50 | | banana | 1.743055659082957 | 9 |
51 | | pear | 1.7158950640518675 | 10 |
52 |
53 |
--------------------------------------------------------------------------------
/postgres/regex-email.md:
--------------------------------------------------------------------------------
1 | # RegEx: Validate Email Addresses
2 | View an interactive version of this snippet [here](https://count.co/n/QLRHXD9dVRG?vm=e).
3 |
4 |
5 | # Description
6 | The only real way to validate an email address is to send a request to it and observe the response, but for the purposes of analytics it's often useful to strip out rows including malformed email addresses.
7 | Using the `SUBSTRING` function, the email validation RegEx comes courtesy of [Evan Carroll](https://dba.stackexchange.com/questions/68266/what-is-the-best-way-to-store-an-email-address-in-postgresql/165923#165923):
8 |
9 | # Snippet ✂️
10 |
11 | ```sql
12 | WITH data AS (
13 | SELECT UNNEST(ARRAY[
14 | 'a',
15 | 'hello@count.c',
16 | 'hello@count.co',
17 | '@count.co',
18 | 'a@a.a'
19 | ]) AS emails
20 | )
21 |
22 | SELECT
23 | emails,
24 | CASE WHEN
25 | SUBSTRING(
26 | LOWER(emails),
27 | '^[a-zA-Z0-9.!#$%&''''*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$'
28 | ) IS NOT NULL
29 | THEN TRUE
30 | ELSE FALSE
31 | END AS is_valid
32 | FROM data
33 | ```
34 |
35 | | emails | is_valid |
36 | | ----- | ----- |
37 | | a | false |
38 | | hello@count.c | true |
39 | | hello@count.co | true |
40 | | @count.co | false |
41 | | a@a.a | true |
42 |
--------------------------------------------------------------------------------
/postgres/regex-parse-url.md:
--------------------------------------------------------------------------------
1 | # RegEx: Parse URL String
2 | View an interactive version of this snippet [here](https://count.co/n/rpoUsuw0s3r?vm=e).
3 |
4 |
5 | # Description
6 | URL strings are a very valuable source of information, but trying to get your RegEx exactly right is a pain.
7 | This re-usable snippet can be applied to any URL to extract the exact part you're after.
8 |
9 | # Snippet ✂️
10 |
11 | ```sql
12 | SELECT
13 | SUBSTRING(my_url, '^([a-zA-Z]+)://') AS protocol,
14 | SUBSTRING(my_url, '(?:[a-zA-Z]+:\/\/)?([a-zA-Z0-9.]+)\/?') AS host,
15 | SUBSTRING(my_url, '(?:[a-zA-Z]+://)?(?:[a-zA-Z0-9.]+)/{1}([a-zA-Z0-9./]+)') AS path,
16 | SUBSTRING(my_url, '\?(.*)') AS query,
17 | SUBSTRING(my_url, '#(.*)') AS ref
18 | FROM
19 | ```
20 | where:
21 | - `` is your URL string (e.g. `"https://www.yoursite.com/pricing/details?myparam1=123&myparam2=abc#Ref1"`
22 | - `protocol` is how the data is transferred between the host and server (e.g. `https`)
23 | - `host` is the domain name (e.g. `yoursite.com`)
24 | - `path` is the page path on the website (e.g. `pricing/details`)
25 | - `query` is the string of query parameters in the url (e.g. `myparam1=123`)
26 | - `ref` is the reference or where the user came from before visiting the url (e.g. `#newsfeed`)
27 |
28 | # Usage
29 | To use the snippet, just copy out the RegEx for whatever part of the URL you need - or capture them all as in the example below:
30 |
31 | ```sql
32 | WITH data as
33 | (SELECT 'https://www.yoursite.com/pricing/details?myparam1=123&myparam2=abc#newsfeed' AS my_url)
34 |
35 | SELECT
36 | SUBSTRING(my_url, '^([a-zA-Z]+)://') AS protocol,
37 | SUBSTRING(my_url, '(?:[a-zA-Z]+:\/\/)?([a-zA-Z0-9.]+)\/?') AS host,
38 | SUBSTRING(my_url, '(?:[a-zA-Z]+://)?(?:[a-zA-Z0-9.]+)/{1}([a-zA-Z0-9./]+)') AS path,
39 | SUBSTRING(my_url, '\?(.*)') AS query,
40 | SUBSTRING(my_url, '#(.*)') AS ref
41 | FROM data
42 | ```
43 | | protocol | host | path | query | ref |
44 | | --- | ------ | ----- | --- | ------|
45 | | https | www.yoursite.com | pricing/details | myparam1=123&myparam2=abc#newsfeed | newsfeed |
46 |
47 | # References Helpful Links
48 | - [The Anatomy of a Full Path URL](https://zvelo.com/anatomy-of-full-path-url-hostname-protocol-path-more/)
49 |
--------------------------------------------------------------------------------
/postgres/replace-empty-strings-null.md:
--------------------------------------------------------------------------------
1 | # Replace empty strings with NULLs
2 |
3 | Explore this snippet [here](https://count.co/n/Fzyv9i4SP2B?vm=e).
4 |
5 | # Description
6 |
7 | An essential part of cleaning a new data source is deciding how to treat missing values. The [`CASE`](https://www.postgresql.org/docs/current/functions-conditional.html#FUNCTIONS-CASE) statement can help with replacing empty values with something else:
8 |
9 | ```sql
10 | WITH data AS (
11 | SELECT *
12 | FROM (VALUES ('a'), ('b'), (''), ('d')) AS data (str)
13 | )
14 |
15 | SELECT
16 | CASE WHEN LENGTH(str) != 0 THEN str END AS str
17 | FROM data;
18 | ```
19 |
20 | | str |
21 | | ---- |
22 | | a |
23 | | b |
24 | | NULL |
25 | | d |
--------------------------------------------------------------------------------
/postgres/replace-null.md:
--------------------------------------------------------------------------------
1 | # Replace NULLs
2 |
3 | Explore this snippet [here](https://count.co/n/hpKKsv0U2Rv?vm=e).
4 |
5 | # Description
6 |
7 | An essential part of cleaning a new data source is deciding how to treat NULL values. The Postgres function [COALESCE](https://www.postgresql.org/docs/current/functions-conditional.html) can help with replacing NULL values with something else:
8 |
9 | ```sql
10 | with data as (
11 | select * from (values
12 | (1, 'one'),
13 | (null, 'two'),
14 | (3, null)
15 | ) as data (num, str)
16 | )
17 |
18 | select
19 | coalesce(num, -1) num,
20 | coalesce(str, 'I AM NULL') str
21 | from data
22 | ```
23 |
24 | | num | str |
25 | | --- | --------- |
26 | | 1 | a |
27 | | -1 | b |
28 | | 3 | I AM NULL |
--------------------------------------------------------------------------------
/postgres/split-column-to-rows.md:
--------------------------------------------------------------------------------
1 | # Split a single column into separate rows
2 | View an interactive version of this snippet [here](https://count.co/n/lF1dUz66TuF?vm=e).
3 |
4 |
5 | # Description
6 |
7 | You will often see multiple values, separated by a single character, appear in a single column.
8 | If you want to split them out but instead of having separate columns, generate rows for each value, you can use the function [`REGEXP_SPLIT_TO_TABLE`](https://www.postgresql.org/docs/13/functions-string.html):
9 |
10 | ```sql
11 | WITH data AS (
12 | SELECT *
13 | FROM (VALUES ('yellow;green;blue'), ('orange;red;grey;black')) AS data (str)
14 | )
15 |
16 | SELECT
17 | REGEXP_SPLIT_TO_TABLE(str,';') str_split
18 | FROM data;
19 | ```
20 | _The separator can be any single character (i.e. ',' or /) or something more complex like a string (to&char123), as the function uses Regular Expressions._
21 |
22 | | str_split |
23 | | ---- |
24 | | yellow |
25 | | green |
26 | | blue |
27 | | orange |
28 | | red |
29 | | grey |
30 | | black |
31 |
--------------------------------------------------------------------------------
/snowflake/100-stacked-bar.md:
--------------------------------------------------------------------------------
1 | # 100% bar chart
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/C6BCLP94YLs?vm=e).
4 |
5 | # Description
6 |
7 | 100% bar charts compare the proportions of a metric across different groups.
8 | To build one you'll need:
9 | - A numerical column as the metric you want to compare. This should be represented as a percentage.
10 | - 2 categorical columns. One for the x-axis and one for the coloring of the bars.
11 |
12 | ```sql
13 | SELECT
14 | AGG_FN() as metric,
15 | as x,
16 | as color,
17 | FROM
18 |
19 | GROUP BY
20 | x, color
21 | ```
22 |
23 | where:
24 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc.
25 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column. Must add up to 1 for each bar.
26 | - `XAXIS_COLUMN` is the group you want to show on the x-axis. Make sure this is a date, datetime, timestamp, or time column (not a number)
27 | - `COLOR_COLUMN` is the group you want to show as different colors of your bars.
28 |
29 | # Usage
30 |
31 | In this example with some London weather data, we calculate the percentage of days in each month that are of each weather type (e.g. rain, sun).
32 |
33 | ```sql
34 | -- a
35 | select
36 | count(DATE) days,
37 | month(date) month,
38 | WEATHER
39 | from PUBLIC.LONDON_WEATHER
40 | group by month, WEATHER
41 | order by month
42 | ```
43 |
44 | | DAYS | MONTH | WEATHER |
45 | | ---- | ----- | ------- |
46 | | 33 | 1 | sunny |
47 | | 16 | 1 | fog |
48 | | 74 | 1 | rain |
49 | | ... | ... | ... |
50 |
51 | ```sql
52 | select
53 | count(DATE) days,
54 | month(date) month
55 | from PUBLIC.LONDON_WEATHER
56 | group by month
57 | order by month
58 | ```
59 | | DAYS | MONTH |
60 | | ---- | ----- |
61 | | 123 | 1 |
62 | | 113 | 2 |
63 | | 123 | 3 |
64 | | 120 | 4 |
65 | | ... | ... |
66 |
67 | ```sql
68 | -- c
69 | select
70 | "a".WEATHER,
71 | "a".MONTH,
72 | "a".DAYS / "b".DAYS per_days
73 | from
74 | "a" left join "b" on "a".MONTH = "b".MONTH
75 | ```
76 |
77 |
78 |
--------------------------------------------------------------------------------
/snowflake/bar-chart.md:
--------------------------------------------------------------------------------
1 | # Bar chart
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/CmSsFFvSJfz?vm=e).
4 |
5 | # Description
6 |
7 | Bar charts are perhaps the most common charts used in data analysis. They have one categorical or temporal axis, and one numerical axis.
8 | You just need a simple GROUP BY to get all the elements you need for a bar chart:
9 |
10 | ```sql
11 | SELECT
12 | AGG_FN() as metric,
13 | as cat
14 | FROM
15 |
16 | GROUP BY
17 | cat
18 | ```
19 |
20 | where:
21 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc.
22 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column.
23 | - `CATEGORICAL_COLUMN` is the group you want to show on the x-axis. Make sure this is a date, datetime, timestamp, or time column (not a number)
24 |
25 | # Usage
26 |
27 | In this example with some Spotify data, we'll look at the average daily streams by month of year:
28 |
29 | ```sql
30 | select
31 | avg(streams) avg_daily_streams,
32 | MONTH(day) month
33 | from (
34 | select sum(streams) streams,
35 | day
36 | from PUBLIC.SPOTIFY_DAILY_TRACKS
37 | group by day
38 | )
39 | group by month
40 | ```
41 |
42 |
--------------------------------------------------------------------------------
/snowflake/count-nulls.md:
--------------------------------------------------------------------------------
1 | # Count NULLs
2 |
3 | Explore this snippet [here](https://count.co/n/E0yQhZaKdmD?vm=e).
4 |
5 | # Description
6 | Part of the data cleaning process involves understanding the quality of your data. NULL values are usually best avoided, so counting their occurrences is a common operation.
7 | There are several methods that can be used here:
8 | - `sum(if( is null, 1, 0)` - use the [`IFF`](https://docs.snowflake.com/en/sql-reference/functions/iff.html) function to return 1 or 0 if a value is NULL or not respectively, then aggregate.
9 | - `count(*) - count()` - use the different forms of the [`count()`](https://docs.snowflake.com/en/sql-reference/functions/count.html) aggregation which include and exclude NULLs.
10 | - `sum(case when x is null then 1 else 0 end)` - similar to the IFF method, but using a [`CASE`](https://docs.snowflake.com/en/sql-reference/functions/case.html) statement instead.
11 |
12 | ```sql
13 | with data as (
14 | select * from (values (1), (2), (null), (null), (5)) as data (x)
15 | )
16 |
17 | select
18 | sum(iff(x is null, 1, 0)) with_iff,
19 | count(*) - count(x) with_count,
20 | sum(case when x is null then 1 else 0 end) with_case
21 | from data
22 | ```
23 |
24 | | WITH_IFF | WITH_COUNT | WITH_CASE |
25 | | -------- | ---------- | --------- |
26 | | 2 | 2 | 2 |
--------------------------------------------------------------------------------
/snowflake/filtering-arrays.md:
--------------------------------------------------------------------------------
1 | # Filtering arrays
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/2rWE6XxrMAm?vm=e).
4 |
5 | # Description
6 |
7 | To filter arrays in Snowflake you can use a combination of ARRAY_AGG and FLATTEN:
8 |
9 | ```sql
10 | select
11 | array_agg(value) as filtered
12 | from (select array_construct(1, 2, 3, 4) as x),
13 | lateral flatten(input => x)
14 | where value <= 3
15 | ```
16 | | FILTERED |
17 | | -------- |
18 | | [1,2,3] |
19 |
--------------------------------------------------------------------------------
/snowflake/flatten-array.md:
--------------------------------------------------------------------------------
1 | # Flatten an Array
2 |
3 |
4 | # Description
5 | To flatten an ARRAY in other dialects you could use UNNEST. In Snowflake, that functionality is called [FLATTEN](https://docs.snowflake.com/en/sql-reference/functions/flatten.html). To use it you can do:
6 |
7 | ```sql
8 | SELECT
9 | VALUE
10 | FROM
11 | ,
12 | LATERAL FLATTEN(INPUT=> )
13 | ```
14 | where:
15 | - `` is your table w/ the array column
16 | - `` is the name of the column with the arrays
17 | > Note: You can select more than VALUE; there are a lot of things returned. See the example below to see.
18 |
19 | ```sql
20 | with demo_arrays as (
21 | select array_construct(1,4,'a') arr
22 | )
23 |
24 | select
25 | * --SHOW ALL THE COLUMNS RETURNED W/ FLATTEN. Common to use VALUE.
26 | from demo_arrays,
27 | lateral flatten(input => arr)
28 | ```
29 | |ARR|SEQ|KEY|PATH|INDEX|VALUE|THIS|
30 | |---|---|---|----|-----|-----|----|
31 | |[1,4,"a"]|1|NULL|[0]|0|1|[1,4,"a"]|
32 | |[1,4,"a"]|1|NULL|[1]|1|4|[1,4,"a"]|
33 | |[1,4,"a"]|1|NULL|[2]|2|"a"|[1,4,"a"]|
34 |
35 |
36 |
37 | # Example:
38 | Here's an array with JSON objects within. We can flatten these objects twice to get to the JSON values.
39 |
40 | ```sql
41 | With demo_data as (
42 | select array_construct(
43 | parse_json('{
44 | "AveragePosition": 1,
45 | "Competitor": "test.com.au",
46 | "EstimatedImpressions": 4583,
47 | "SearchTerms": 3,
48 | "ShareOfClicks": "14.25%",
49 | "ShareOfSpend": "1.09%"
50 | }'),
51 | parse_json('{
52 | "AveragePosition": 1.9,
53 | "Competitor": "businessname.com.au",
54 | "EstimatedImpressions": 10118,
55 | "SearchTerms": 34,
56 | "ShareOfClicks": "6.77%",
57 | "ShareOfSpend": "9.16%"
58 | }'))
59 | as demo_array
60 | )
61 |
62 | select
63 | value:AveragePosition,
64 | value:Competitor,
65 | value:EstimatedImpressions,
66 | value:SearchTerms,
67 | value:ShareofClicks,
68 | value:ShareofSpend
69 | from (
70 | select value from demo_data,
71 | lateral flatten(input => demo_array))
72 | ```
73 | |VALUE:AVERAGEPOSITION|VALUE:COMPETITOR|VALUE:ESTIMATEDIMPRESSIONS|VALUE:SEARCHTERMS|VALUE:SHAREOFCLICKS|VALUE:SHAREOFSPEND|
74 | |-----|-----|-----|-----|------|------|
75 | |1|"test.com.au"|4583|3|"14.25"|"1.09%"|
76 | |1.9|"businessname.com.au"|10118|34|"6.77%"|"9.16"|
77 |
--------------------------------------------------------------------------------
/snowflake/generate-arrays.md:
--------------------------------------------------------------------------------
1 | # Generating arrays
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/PeGMQjXNBBq?vm=e).
4 |
5 | # Description
6 |
7 | Snowflake supports the ARRAY column type - here are a few methods for generating arrays.
8 |
9 | ## Literals
10 |
11 | Arrays can contain most types, and they don't have to have the same supertype.
12 |
13 | ```sql
14 | select
15 | array_construct(1, 2, 3) as num_array,
16 | array_construct('a', 'b', 'c') as str_array,
17 | array_construct(null, 'hello', 3::double, 4, 5) mixed_array,
18 | array_construct(current_timestamp(), current_timestamp()) as timestamp_array
19 | ```
20 | | NUM_ARRAY | STR_ARRAY | MIXED_ARRAY | TIMESTAMP_ARRAY |
21 | | --------- | ------------- | -------------------- | --------------- |
22 | | [1,2,3] | ["a","b","c"] | [null,"hello",3,4,5] | ["2021-07-02 11:37:36.205 -0400","2021-07-02 11:37:36.205 -0400"] |
23 |
24 | ## GENERATE_* functions
25 | Analogous to `numpy.linspace()`.
26 | See this Snowflake [article](https://community.snowflake.com/s/article/Generate-gap-free-sequences-of-numbers-and-dates) for more details.
27 |
28 |
29 | For generating a sequence of dates without gaps, you can use [GENERATOR](https://docs.snowflake.com/en/sql-reference/functions/generator.html) in combination with [SEQ4](https://docs.snowflake.com/en/sql-reference/functions/seq1.html):
30 |
31 | ```sql
32 | select row_number() over (order by seq4()) consecutive_nums
33 | from table(generator(rowcount => 10));
34 | ```
35 | | CONSECUTIVE_NUMS |
36 | | ---------------- |
37 | | 1 |
38 | | 2 |
39 | | 3 |
40 | | 4 |
41 | | 5 |
42 | | 6 |
43 | | 7 |
44 | | 8 |
45 | | 9 |
46 | | 10 |
47 |
48 | Or to generate a sequence of consecutive dates, you can add [TIMEADD](https://docs.snowflake.com/en/sql-reference/functions/timeadd.html):
49 |
50 | ```sql
51 | SELECT TIMEADD('day', row_number() over (order by 1), '2020-08-30')::date date
52 | FROM TABLE(GENERATOR(ROWCOUNT => 10));
53 | ```
54 | | DATE |
55 | | ---------- |
56 | | 2020-08-30 |
57 | | 2020-09-01 |
58 | | 2020-09-02 |
59 | | 2020-09-03 |
60 | | 2020-09-04 |
61 | | 2020-09-05 |
62 | | 2020-09-06 |
63 | | 2020-09-07 |
64 | | 2020-09-08 |
65 | | 2020-09-09 |
66 |
67 | ## ARRAY_AGG
68 | Aggregate a column into a single row element.
69 |
70 | ```sql
71 | SELECT ARRAY_AGG(num)
72 | FROM (SELECT 1 num UNION ALL SELECT 2 num UNION ALL SELECT 3 num)
73 | ```
74 | | ARRAY_AGG(NUM) |
75 | | -------------- |
76 | | [1,2,3] |
77 |
--------------------------------------------------------------------------------
/snowflake/geographical-distance.md:
--------------------------------------------------------------------------------
1 | # Distance between longitude/latitude pairs
2 |
3 | Explore this snippet [here](https://count.co/n/nqZbPem1fKi?vm=e).
4 |
5 | # Description
6 |
7 | It is a surprisingly difficult problem to calculate an accurate distance between two points on the Earth, both because of the spherical geometry and the uneven shape of the ground.
8 | Fortunately Snowflake has robust support for geographical primitives - the[`ST_DISTANCE`](https://docs.snowflake.com/en/sql-reference/functions/st_distance.html) function can calculate the distance in metres between two [`ST_POINT`](https://docs.snowflake.com/en/sql-reference/functions/st_makepoint.html)s
9 |
10 |
11 | ```sql
12 | with points as (
13 | -- Longitudes and latitudes are in degrees
14 | select 10 as from_long, 12 as from_lat, 15 as to_long, 17 as to_lat
15 | )
16 |
17 | select
18 | st_distance(
19 | st_makepoint(from_long, from_lat),
20 | st_makepoint(to_long, to_lat)
21 | ) as dist_in_metres
22 | from points
23 | ```
24 |
25 | | DIST_IN_METRES |
26 | | -------------- |
27 | | 773697.017019 |
--------------------------------------------------------------------------------
/snowflake/heatmap.md:
--------------------------------------------------------------------------------
1 | # Heat maps
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/Gw3T9qTRRe7?vm=e).
4 |
5 | # Description
6 |
7 | Heatmaps are a great way to visualize the relationship between two variables enumerated across all possible combinations of each variable. To create one, your data must contain
8 | - 2 categorical columns (e.g. not numeric columns)
9 | - 1 numerical column
10 |
11 | ```sql
12 | SELECT
13 | AGG_FN() as metric,
14 | as cat1,
15 | as cat2,
16 | FROM
17 |
18 | GROUP BY
19 | cat1, cat2
20 | ```
21 | where:
22 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc.
23 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column. Must add up to 1 for each bar.
24 | - `CAT_COLUMN1` and `CAT_COLUMN2` are the groups you want to compare. One will go on the x axis and the other on the y-axis. Make sure these are categorical or temporal columns and not numerical fields.
25 |
26 | # Usage
27 |
28 | In this example with some Spotify data, we'll see the average daily streams across different days of the week and months of the year.
29 |
30 | ```sql
31 | -- a
32 | select sum(streams) daily_streams, day from PUBLIC.SPOTIFY_DAILY_TRACKS group by day
33 | ```
34 |
35 | | daily_streams | day |
36 | | ------------- | ---------- |
37 | | 234109876 | 2020-10-25 |
38 | | 243004361 | 2020-10-26 |
39 | | 248233045 | 2020-10-27 |
40 |
41 | ```sql
42 | select
43 | avg(daily_streams) avg_daily_streams,
44 | month(day) month,
45 | dayofweek(day) day_of_week
46 | from "a"
47 | group by
48 | day_of_week,
49 | month
50 | ```
51 |
52 |
--------------------------------------------------------------------------------
/snowflake/histogram-bins.md:
--------------------------------------------------------------------------------
1 | # Histogram Bins
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/S6nEvljt64H?vm=e).
4 |
5 | # Description
6 |
7 | Creating histograms using SQL can be tricky. This snippet will let you quickly create the bins needed to create a histogram:
8 |
9 | ```sql
10 | SELECT
11 | COUNT(1) / (SELECT COUNT(1) FROM '' ) percentage_of_results
12 | FLOOR(/ ) * as bin
13 | FROM
14 | ''
15 | GROUP BY
16 | bin
17 | ```
18 | where:
19 | - `` is the column to turn into a histogram
20 | - `` is the width of each bin. You can adjust this to get the desired level of detail for the histogram.
21 |
22 | # Usage
23 |
24 | To use, adjust the bin-size until you can see the 'shape' of the data. The following example finds the percentage of days in London by 5-degree temperature groupings.
25 |
26 | ```sql
27 | SELECT
28 | count(*) / (select count(1) from PUBLIC.LONDON_WEATHER) percentage_of_results,
29 | floor(LONDON_WEATHER.TEMP / 5) * 5 bin
30 | from PUBLIC.LONDON_WEATHER
31 | group by bin
32 | order by bin
33 | ```
34 | | PERCENTAGE_OF_RESULTS | BIN |
35 | | --------------------- | --- |
36 | | 0.001 | 20 |
37 | | 0.001 | 25 |
38 | | ... | ... |
39 | | 0.014 | 75 |
40 | | 0.001 | 80 |
41 |
--------------------------------------------------------------------------------
/snowflake/horizontal-bar.md:
--------------------------------------------------------------------------------
1 | # Horizontal bar chart
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/y9YkufdWrwf?vm=e).
4 |
5 | # Description
6 |
7 | Horizontal Bar charts are ideal for comparing a particular metric across a group of values. They have bars along the x-axis and group names along the y-axis.
8 | You just need a simple GROUP BY to get all the elements you need for a horizontal bar chart:
9 |
10 | ```sql
11 | SELECT
12 | AGG_FN() as metric,
13 | as group
14 | FROM
15 |
16 | GROUP BY
17 | group
18 | ```
19 | where:
20 | - `AGG_FN` is an aggregation function like `SUM`, `AVG`, `COUNT`, `MAX`, etc.
21 | - `COLUMN` is the column you want to aggregate to get your metric. Make sure this is a numeric column.
22 | - `GROUP_COLUMN` is the group you want to show on the y-axis. Make sure this is a categorical column (not a number)
23 |
24 | # Usage
25 |
26 | In this example with some Spotify data, we'll compare the total streams by artist:
27 |
28 | ```sql
29 | select
30 | sum(streams) streams,
31 | artist
32 | from PUBLIC.SPOTIFY_DAILY_STREAMS
33 | group by artist
34 | ```
35 |
36 |
--------------------------------------------------------------------------------
/snowflake/last-array-element.md:
--------------------------------------------------------------------------------
1 | # Get the last element of an array
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/rybQwRV3JJh?vm=e).
4 |
5 | # Description
6 |
7 | You can use [ARRAY_SLICE](https://docs.snowflake.com/en/sql-reference/functions/array_slice.html) and [ARRAY_SIZE](https://docs.snowflake.com/en/sql-reference/functions/array_size.html) to find the last element in an Array in Snowflake:
8 |
9 | ```sql
10 | select
11 | array_slice(nums,-1,array_size(nums)) last_element
12 | from
13 | (
14 | select array_construct(1, 2, 3, 4) as nums
15 | )
16 | ```
17 | | LAST_ELEMENT |
18 | | ------------ |
19 | | [4] |
20 |
--------------------------------------------------------------------------------
/snowflake/moving-average.md:
--------------------------------------------------------------------------------
1 | # Moving average
2 |
3 | Explore this snippet with some demo data [here](https://count.co/n/vwTzk68QCZ7?vm=e).
4 |
5 | # Description
6 | Moving averages are relatively simple to calculate in Snowflake, using the `avg` window function. The template for the query is
7 |
8 | ```sql
9 | select
10 | avg() over (partition by order by asc rows between preceding and current row) mov_av,
11 | ,
12 |
13 | from
14 | ```
15 | where
16 | - `value` - this is the numeric quantity to calculate a moving average for
17 | - `fields` - these are zero or more columns to split the moving averages by
18 | - `ordering` - this is the column which determines the order of the moving average, most commonly temporal
19 | - `x` - how many rows to include in the moving average
20 | - `table` - where to pull these columns from
21 |
22 | # Example
23 |
24 | Using total Spotify streams as an example data source, let's identify:
25 | - `value` - this is `streams`
26 | - `fields` - this is just `artist`
27 | - `ordering` - this is the `day` column
28 | - `x` - choose 7 for a weekly average
29 | - `table` - this is called `raw`
30 |
31 | then the query is:
32 |
33 | ```sql
34 | -- RAW
35 | select day, sum(streams) streams, artist from PUBLIC.SPOTIFY_DAILY_TRACKS group by 1, 3
36 | ```
37 | | DAY | STREAMS | ARTIST |
38 | |----------|---------|---------|
39 | |2017-06-01| 509093 | 2 Chainz|
40 | |2017-06-03| 562412 | 2 Chainz|
41 | |2017-06-04| 480350 | 2 Chainz|
42 |
43 | ```sql
44 | select
45 | avg(streams) over (partition by artist order by day asc rows between 7 preceding and current row) mov_av,
46 | artist,
47 | day,
48 | streams
49 | from RAW
50 | ```
51 | | MOV_AVG | DAY | STREAMS | ARTIST |
52 | |----------|------------|---------|----------|
53 | | 509093 | 2017-06-01 | 509093 | 2 Chainz |
54 | | 535752.5 | 2017-06-03 | 562412 | 2 Chainz |
55 | | 517285 | 2017-06-04 | 480350 | 2 Chainz |
56 |
--------------------------------------------------------------------------------
/snowflake/parse-json.md:
--------------------------------------------------------------------------------
1 | # Parsing JSON Strings
2 |
3 |
4 | # Description
5 | Snowflake has a few helpful ways to deal w/ JSON objects. This snippet goes through a few ways to access elements in a JSON object:
6 |
7 | ```sql
8 | SELECT
9 | :.. -- select json element,
10 | :""."", -- select json element if element name has special characters or spaces
11 | :::, -- cast return object to explicit type
12 | :[array_index], -- access one instance of array
13 | get(:,array_size(:)-1) -- get last element in array
14 | FROM
15 |
16 | ```
17 | where:
18 | - `` is the column in `` with the JSON object
19 | - `` is the level of nesting in the JSON object
20 | - `` is the table w/ the JSON data.
21 | # Example:
22 |
23 | ```sql
24 | with car_sales
25 | as(
26 | select parse_json(column1) as src
27 | from values
28 | ('{
29 | "date" : "2017-04-28",
30 | "dealership" : "Valley View Auto Sales",
31 | "salesperson" : {
32 | "id": "55",
33 | "name": "Frank Beasley"
34 | },
35 | "customer" : [
36 | {"name": "Joyce Ridgely", "phone": "16504378889", "address": "San Francisco, CA"}
37 | ],
38 | "vehicle" : [
39 | {"make": "Honda", "model": "Civic", "year": "2017", "price": "20275", "extras":["ext warranty", "paint protection"]}
40 | ],
41 | "values":[1,2,3]
42 | }'),
43 | ('{
44 | "date" : "2017-04-28",
45 | "dealership" : "Tindel Toyota",
46 | "salesperson" : {
47 | "id": "274",
48 | "name": "Greg Northrup"
49 | },
50 | "customer" : [
51 | {"name": "Bradley Greenbloom", "phone": "12127593751", "address": "New York, NY"}
52 | ],
53 | "vehicle" : [
54 | {"make": "Toyota", "model": "Camry", "year": "2017", "price": "23500", "extras":["ext warranty", "rust proofing", "fabric protection"]}
55 | ],
56 | "values":[3,2,1]
57 | }') v)
58 |
59 | select
60 | src:date::date as date , -- select json object
61 | src:"customer", -- for objects w/ incompatible names (e.g. spaces and special characters) escape w/ quotes
62 | src:vehicle[0].make::string as make, -- select first item in list
63 | get(src:values,array_size(src:values)-1) last_element -- last value in list
64 |
65 | from car_sales;
66 | ```
67 | |DATE|SRC:"CUSTOMER"|MAKE|LAST_ELEMENT|
68 | |----|------------|----|------|
69 | |28/04/2017|[{"address":"San Francisco, CA","name":"Joyce Ridgely","phone":"16504378889"}]|HONDA|3|
70 | |28/04/2017"|[{"address":"New York, NY","name":"Bradley Greenbloom","phone":"12127593751"}]|TOYOTA|1|
71 |
--------------------------------------------------------------------------------
/snowflake/parse-url.md:
--------------------------------------------------------------------------------
1 | # Parse URL String
2 |
3 | Explore the snippet with some demo data [here](https://count.co/n/LM876dhSHtU?vm=e).
4 |
5 | # Description
6 |
7 | URL strings are a very valuable source of information, but trying to get your regex exactly right is a pain.
8 | This re-usable snippet can be applied to any URL to extract the exact part you're after.
9 | Helpfully, Snowflake has a [PARSE_URL](https://docs.snowflake.com/en/sql-reference/functions/parse_url.html) function that returns a JSON object of URL components.
10 |
11 | ```sql
12 | SELECT KEY,VALUE
13 | FROM
14 | (
15 | SELECT parsed from
16 | (
17 | SELECT
18 | parse_url(url) as parsed
19 | FROM my_url
20 | )
21 | ),
22 | LATERAL FLATTEN(input=> parsed)
23 | ```
24 | where:
25 | - `` is your URL string (e.g. `"https://www.yoursite.com/pricing/details?myparam1=123&myparam2=abc#Ref1"`
26 |
27 |
28 | # Usage
29 |
30 | To use, just copy out the regex for whatever part of the URL you need. Or capture them all as in the example below:
31 |
32 | ```sql
33 | WITH my_url as (select 'https://www.yoursite.com/pricing/details?myparam1=123&myparam2=abc#newsfeed' as url)
34 |
35 | select KEY,VALUE FROM
36 | (
37 | SELECT parsed from
38 | (
39 | select
40 | parse_url(url) as parsed
41 | FROM my_url
42 | )
43 | ),
44 | LATERAL FLATTEN(input => parsed)
45 | ```
46 | | KEY | VALUE |
47 | |------------| ----------------------------------- |
48 | | fragment | "newsfeed" |
49 | | host | "www.yoursite.com" |
50 | | parameters | {"myparam1":"123","myparam2":"abc"} |
51 | | port | (Empty) |
52 | | query | "myparam1=123&myparam2=abc" |
53 | | scheme | "https" |
54 |
55 | # References
56 | - [The Anatomy of a Full Path URL](https://zvelo.com/anatomy-of-full-path-url-hostname-protocol-path-more/)
57 |
--------------------------------------------------------------------------------
/snowflake/pivot.md:
--------------------------------------------------------------------------------
1 | # Pivot Tables in Snowflake
2 |
3 |
4 | # Description
5 | We're all familiar with the power of Pivot tables, but building them in SQL tends to be tricky.
6 | Snowflake has some bespoke features that make this a little easier though. In this context:
7 | > Pivoting a table means turning rows into columns
8 |
9 | To pivot in Snowflake, you can use the [PIVOT](https://docs.snowflake.com/en/sql-reference/constructs/pivot.html) keyword:
10 |
11 | ```sql
12 | SELECT
13 |
14 | FROM
15 | PIVOT ( ( )
16 | FOR