└── README.adoc /README.adoc: -------------------------------------------------------------------------------- 1 | :toc: 2 | :toclevels: 4 3 | :toc-placement!: 4 | 5 | = SQL Conventions 6 | 7 | image::https://img.shields.io/badge/Slack-Join%20our%20tech%20community-17202A?logo=slack[link=https://join.slack.com/t/fgribreau/shared_invite/zt-edpjwt2t-Zh39mDUMNQ0QOr9qOj~jrg] 8 | 9 | ==== The only SQL convention you will ever need. 10 | 11 | This convention is used @cloud_iam_com @Netwo @OuestFrance @hook0 @iAdvize @Bringr @Redsmin @Oxmoto. 12 | 13 | toc::[] 14 | 15 | === Data layer 16 | 17 | * For SQL use https://www.postgresql.org[PostgreSQL], it’s the 18 | https://insights.stackoverflow.com/survey/2018/#technology-most-loved-dreaded-and-wanted-databases[most 19 | loved relational database (StackOverflow survey 2018)] and it’s a 20 | multi-model database (K/V store, Document store (use jsonb), foreign 21 | data wrapper, and much more). Any questions? 22 | 23 | === Application layer 24 | 25 | * If your API is only doing mainly data persistence use 26 | https://postgrest.com[Postgrest] is the way to go and only implement the 27 | missing part in another process. You can then compose both API with the 28 | reverse-proxy. 29 | * Otherwise, use a data-mapping library 30 | (e.g. https://github.com/tpolecat/doobie[doobie]) not an ORM. 31 | 32 | ==== Queries 33 | 34 | * Don’t use BETWEEN 35 | (https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_BETWEEN_.28especially_with_timestamps.29[why]) 36 | 37 | * Prefer = to LIKE 38 | 39 | ____ 40 | LIKE compares characters, and can be paired with wildcard operators like %, whereas the = operator compares strings and numbers for exact matches. The = can take advantage of indexed columns. (https://www.metabase.com/learn/building-analytics/sql-templates/sql-best-practices[source]) 41 | ____ 42 | 43 | * 44 | 45 | * Prefer `EXIST` to `IN` 46 | 47 | ____ 48 | If you just need to verify the existence of a value in a table, prefer EXISTS to IN, as the EXISTS process exits as soon as it finds the search value, whereas IN will scan the entire table. IN should be used for finding values in lists. 49 | Similarly, prefer NOT EXISTS to NOT IN. (https://www.metabase.com/learn/building-analytics/sql-templates/sql-best-practices[source]) 50 | ____ 51 | 52 | 53 | === DDL - Data Description Language 54 | 55 | * `SET search_path=pg_catalog` to force to explicitely specify schema names in every object declaration (besides triggers). This will lower bugs and gives better understanding to developers because (https://getnobullshit.com/)[explicit > implicit]. 56 | 57 | === Tables/Views 58 | 59 | ==== Table/Views name 60 | 61 | * *singular* (e.g. `+team+` not `+teams+`) (https://launchbylunch.com/posts/2014/Feb/16/sql-naming-conventions/#singular-relations[Here is why]) 62 | * *snake_case* (e.g. `block_theme_version`) 63 | * *double underscore* for `+n-n+` tables (e.g. `user__organization`) 64 | 65 | ==== Columns 66 | 67 | * *snake_case* (for example: `+created_at+`. Not `+createdAt+` or `CreatedAt`) Because in PostgreSQL https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS[keywords and unquoted identifiers are case insensitive] and is the source of many mistakes. 68 | * *double underscore* for PK and FK columns (e.g. (PK) `+user__id+`, (PK) `+user__id+`, (FK) `+organization__id+`, (FK) 69 | `+organization__id+`) 70 | ** why? 71 | *** leverage `using(column__id)` 72 | *** easier to grasp for PK/FK the table name part (the part before `__`) for snake_case columns 73 | *** Column are case-sensitive in postgresql but SQL queries are case insensitive 74 | 75 | * *`NOT NULL` by default*, NULL is the exception (think of it as the https://github.com/chrissrogers/maybe#why[maybe Monad]) 76 | * *No abbreviation* unless it's both well-known and very long like `i18n` 77 | * *No reserved keywords* (https://www.postgresql.org/docs/8.1/sql-keywords-appendix.html[Complete list]) 78 | * *Use UUID* as PK and FK (https://www.clever-cloud.com/blog/engineering/2015/05/20/why-auto-increment-is-a-terrible-idea/[Where is why]), (https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_serial[do not use `serial`]) rely on `gen_random_uuid()` (https://shusson.info/post/benchmark-v4-uuid-generation-in-postgres[benchmark]) 79 | * Note that when you use Postgres native UUID v4 type instead of bigserial, table size grows by 25% and insert rate drops to 25%. 80 | * If you choose bigserial than distinguish internal and external ids (e.g. gitlab internal schema design names column "iid" those that are publicly shared to the end user). Don't forget to add an index `CREATE UNIQUE INDEX index_issues_on_project_id_and_iid ON public.issues USING btree (project_id, iid);` 81 | * Use `text` or `citext` (variable unlimited length) with check constraint instead of `varchar(n)` or `char(n)`. 82 | * `text` type with CHECK constraint allows you to evolve the schema easily compared to character varying or varchar(n) when you have length checks. ([source](https://shekhargulati.com/2022/07/08/my-notes-on-gitlabs-postgres-schema-design/)) 83 | 84 | #### Date time management 85 | * Use `timestamptz` everywhere you need to store a date (e.g. `+created_at TIMESTAMPTZ DEFAULT now()+` (https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_timestamp_.28without_time_zone.29[Here is why])) and leverage the https://www.postgresql.org/docs/current/brin-intro.html[BRIN] index on it 86 | * `+updated_at TIMESTAMPTZ DEFAULT now()+` unless you plan to leverage (https://www.morling.dev/blog/last-updated-columns-with-postgres/[learn more]) 87 | event-sourcing 88 | * `+deleted_at TIMESTAMPTZ DEFAULT NULL+`: 89 | ** unless you plan to leverage event-sourcing 90 | ** don’t forget to 91 | http://stackoverflow.com/questions/8289100/create-unique-constraint-with-null-columns/8289253#8289253[`+deleted_at+`] 92 | * Comment each column, explain your rational, explain your decisions, should be in plain english 93 | * Boolean columns must start with either `+is+` or `+has+`. 94 | * https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_char.28n.29[Don't use char(n)] 95 | https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_char.28n.29_even_for_fixed-length_identifiers[even for fixed-length identifiers] 96 | 97 | === Constraints 98 | 99 | General rule is: `+{tablename}_{columnname(s)}_{suffix}+` 100 | (e.g. `+table_name_column_name_a__pkey+`) where the suffix is one of the 101 | following: 102 | 103 | * Primary Key constraint: `+pk+` 104 | * Foreign key: `+fk+` 105 | * Unique constraint: `+key+` 106 | * Check constraint: `+chk+` 107 | * Exclusion constraint: `+exl+` 108 | * Any other kind of index: `+idx+` 109 | 110 | ==== PK - Primary Key 111 | 112 | * `+{table_name}_{column_name}_pk+` in case of a single column PK 113 | * `+{table_name}_{column_name1}_{column_name2}_{column_name3}_pk+` in case of 114 | multiple columns as primary key (`+column_name1+`, `+column_name2+`, 115 | `+column_name3+`) 116 | 117 | ==== FK - Foreign key 118 | 119 | * `+{from_table_name}_{from_column_name}_{to_table_name}_{to_column_name}__fk+` 120 | * Always specify `ON DELETE` `ON UPDATE` in order to force *you* to think about reference consequences 121 | 122 | ==== Unique 123 | 124 | * `+{from_table_name}_{from_column_name}_key+` in case of a single column unique 125 | constraint 126 | * `+{from_table_name}_{from_column_name1}_{from_column_name2}_{from_column_name3}__key+` in case of 127 | multiple columns as unique (`+column_name1+`, `+column_name2+`, 128 | `+column_name3+`) 129 | 130 | === Functions 131 | 132 | ==== Name 133 | 134 | They are 3 types of functions, `+notify+` functions and `+private+` 135 | functions and `+public+` functions 136 | 137 | * *notify*, format: notify[_schema_name_][_table_name_][_event_] (e.g. `+notify_authentication_user_created(user_id)+`): should only format the notification message underneath and use pg_notify. Beware of the 138 | http://stackoverflow.com/a/41059797/745121[8000 characters limit], only 139 | send metadata (ids), data should be asked by workers through the API. If 140 | you really wish to send data then 141 | https://github.com/xstevens/pg_kafka[pg_kafka] might be a better 142 | alternative. 143 | * *private*, format: _[`+_function_name_+`] 144 | (e.g. `+_reset_failed_login+`): must never be exposed through the public 145 | schema. Used mainly for consistency and business-rules 146 | * *public*, format [`+_function_name_+`] (e.g. `+log_in(email, password)+`): must be 147 | exposed through the public schema. 148 | 149 | ==== Parameters 150 | 151 | Every parameter name must ends with `$`. This will prevent any "Reference to XXX is ambiguous" issue. 152 | 153 | ===== Example 154 | 155 | ```sql 156 | create function lib_fsm.transition_create( 157 | from_state__id$ uuid, 158 | event$ varchar(30), 159 | to_state__id$ uuid, 160 | description$ text default null 161 | ) 162 | ``` 163 | 164 | === Types 165 | 166 | ==== Enum types 167 | 168 | Don't use enums, you will have issue over time because https://stackoverflow.com/a/25812436/745121[you cannot remove element from an enum]. 169 | If your enums represent various state, leverage https://en.wikipedia.org/wiki/Finite-state_machine[a state machine]. Use a library like https://github.com/netwo-io/lib_fsm[lib_fsm]. 170 | 171 | ==== Boolean 172 | 173 | Always use `true` and `false`, without single-quote. 174 | 175 | PostgreSQL documentation says that `TRUE` and `FALSE` should be prefered because they are more SQL compliant but hey, LET'S STOP YELLING WHEN WE WRITE SQL SHALL WE? 176 | 177 | 178 | ==== String 179 | 180 | - Multi-line string must be represented with `$_$my string$_$` 181 | 182 | 183 | ==== JSONB 184 | 185 | - prefer `+jsonb+` to `json` and sql arrays. Jsonb has improved query performance and efficient storage 186 | 187 | 188 | - A `metadata` jsonb column is a great way to let the end-user store arbitrary key-value data to these objects. (e.g. https://documentation.hook0.com/docs/metadata https://stripe.com/docs/api/metadata ) 189 | 190 | metadata key-value pair must be https://www.getnobullshit.com/tech-lead/tout-limiter-dans-lespace-et-dans-le-temps[limited in space] you can use a trigger for that: 191 | 192 | [source,sql] 193 | ---- 194 | CREATE OR REPLACE FUNCTION validate_metadata() 195 | RETURNS TRIGGER AS $$ 196 | DECLARE 197 | key TEXT; 198 | value TEXT; 199 | keys INT; 200 | BEGIN 201 | keys := 0; 202 | 203 | FOR key, value IN (SELECT * FROM jsonb_each_text(NEW.metadata)) 204 | LOOP 205 | keys := keys + 1; 206 | 207 | IF length(key::text) > 40 OR length(value::text) > 500 THEN 208 | RAISE 'Key and value must be at most 40 and 500 characters long respectively.'; 209 | END IF; 210 | 211 | IF keys > 50 THEN 212 | RAISE 'A maximum of 50 keys are allowed in the metadata.'; 213 | END IF; 214 | END LOOP; 215 | 216 | RETURN NEW; 217 | END; 218 | $$ LANGUAGE plpgsql; 219 | 220 | CREATE TRIGGER validate_metadata_trigger 221 | BEFORE INSERT OR UPDATE ON your_table 222 | FOR EACH ROW EXECUTE FUNCTION validate_metadata(); 223 | ---- 224 | 225 | === Triggers 226 | 227 | ==== Name 228 | 229 | (translation in progress) 230 | 231 | ==== Columns 232 | 233 | * utiliser BNCF (au dessus de la 3NF) (cf normal form) 234 | * leverage `+using+`, so instead of: 235 | 236 | [source,sql] 237 | ---- 238 | select from 239 | table_1 240 | inner join table_2 241 | on table_1.table_1_id = 242 | table_2.table_1_id 243 | ---- 244 | 245 | use: 246 | 247 | [source,sql] 248 | ---- 249 | select from 250 | table_1 251 | inner join table_2 252 | using (table_1_id) 253 | ---- 254 | 255 | * don’t use PostgreSQL enums you will have issues when you need to https://stackoverflow.com/a/25812436/745121[remove some values over time]. Use a dedicated table instead. 256 | * use the right PostgreSQL types: 257 | 258 | .... 259 | inet (IP address) 260 | timestamp with time zone 261 | point (2D point) 262 | tstzrange (time range) 263 | interval (duration) 264 | .... 265 | 266 | * constraint should be inside your database as much as possible: 267 | 268 | [source,sql] 269 | ---- 270 | create table reservation( 271 | reservation_id uuid primary key, 272 | dates tstzrange not null, 273 | exclude using gist (dates with &&) 274 | ); 275 | ---- 276 | 277 | * use row-level-security to ensure R/U/D access on each table rows 278 | 279 | (http://stackoverflow.com/questions/4107915/postgresql-default-constraint-names/4108266#4108266[source]) 280 | 281 | === Policies 282 | 283 | ==== Name 284 | 285 | todo. 286 | 287 | === SQL Formatter 288 | 289 | ```bash 290 | docker run --rm --network=none guriandoro/sqlparse:0.3.1 "SELECT several, columns from a_table as a join another_table as b where a.id = 1;" 291 | ``` 292 | 293 | === Configuration 294 | 295 | 296 | ==== `statement_timeout` 297 | 298 | Since we do want to https://www.getnobullshit.com/[limit everything in space and time], configure `statement_timeout` on role to let your database abort any statement that takes more than the specified amount of time (in ms). 299 | 300 | ```sql 301 | -- Limit in time SQL queries => improve overall reliability 302 | -- https://www.postgresql.org/docs/current/runtime-config-client.html 303 | -- PostgreSQL WILL ABORT any statement that takes more than the specified amount of time (in milliseconds) 304 | -- If you do have an issue with that, please first (from first to last): 305 | -- - .. check that your query is relying on indices (did you use EXPLAIN (ANALYZE, BUFFERS) ?) 306 | -- - .. consider materialized views 307 | -- - .. ensure pg cache settings are OK 308 | -- - .. ensure the disk is SSD and fast enough 309 | -- - .. ensure the server has enough CPU & RAM 310 | -- - .. check if its for analytics purposes, if so then requesting a postgres replica might be a better idea 311 | -- When all these above points were evaluated *then* we can all talk about increasing the values below :) 312 | alter role APP_ROLE_THAT_DOES_THE_QUERY set statement_timeout to '250ms'; 313 | ``` 314 | 315 | == Things to monitor 316 | 317 | - https://www.percona.com/blog/2020/05/29/removing-postgresql-bottlenecks-caused-by-high-traffic/[Removing PostgreSQL Bottlenecks Caused by High Traffic] 318 | ____ 319 | Your cache hit ratio tells you how often your data is served from in 320 | memory vs. having to go to disk. Serving from memory vs. going to disk 321 | will be orders of magnitude faster, thus the more you can keep in memory 322 | the better. Of course you could provision an instance with as much 323 | memory as you have data, but you don’t necessarily have to. Instead 324 | watching your cache hit ratio and ensuring it is at 99% is a good metric 325 | for proper performance. 326 | (https://www.citusdata.com/blog/2019/03/29/health-checks-for-your-postgres-database/[Source]) 327 | ____ 328 | 329 | [source,sql] 330 | ---- 331 | SELECT 332 | sum(heap_blks_read) as heap_read, 333 | sum(heap_blks_hit) as heap_hit, 334 | sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio 335 | FROM 336 | pg_statio_user_tables; 337 | ---- 338 | 339 | ____ 340 | Under the covers Postgres is essentially a giant append only log. When 341 | you write data it appends to the log, when you update data it marks the 342 | old record as invalid and writes a new one, when you delete data it just 343 | marks it invalid. Later Postgres comes through and vacuums those dead 344 | records (also known as tuples). All those unvacuumed dead tuples are 345 | what is known as bloat. Bloat can slow down other writes and create 346 | other issues. Paying attention to your bloat and when it is getting out 347 | of hand can be key for tuning vacuum on your database. 348 | (https://www.citusdata.com/blog/2019/03/29/health-checks-for-your-postgres-database/[Source]) 349 | ____ 350 | 351 | [source,sql] 352 | ---- 353 | WITH constants AS ( 354 | SELECT current_setting('block_size')::numeric AS bs, 23 AS hdr, 4 AS ma 355 | ), bloat_info AS ( 356 | SELECT 357 | ma,bs,schemaname,tablename, 358 | (datawidth+(hdr+ma-(case when hdr%ma=0 THEN ma ELSE hdr%ma END)))::numeric AS datahdr, 359 | (maxfracsum*(nullhdr+ma-(case when nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2 360 | FROM ( 361 | SELECT 362 | schemaname, tablename, hdr, ma, bs, 363 | SUM((1-null_frac)*avg_width) AS datawidth, 364 | MAX(null_frac) AS maxfracsum, 365 | hdr+( 366 | SELECT 1+count(*)/8 367 | FROM pg_stats s2 368 | WHERE null_frac<>0 AND s2.schemaname = s.schemaname AND s2.tablename = s.tablename 369 | ) AS nullhdr 370 | FROM pg_stats s, constants 371 | GROUP BY 1,2,3,4,5 372 | ) AS foo 373 | ), table_bloat AS ( 374 | SELECT 375 | schemaname, tablename, cc.relpages, bs, 376 | CEIL((cc.reltuples*((datahdr+ma- 377 | (CASE WHEN datahdr%ma=0 THEN ma ELSE datahdr%ma END))+nullhdr2+4))/(bs-20::float)) AS otta 378 | FROM bloat_info 379 | JOIN pg_class cc ON cc.relname = bloat_info.tablename 380 | JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = bloat_info.schemaname AND nn.nspname <> 'information_schema' 381 | ), index_bloat AS ( 382 | SELECT 383 | schemaname, tablename, bs, 384 | COALESCE(c2.relname,'?') AS iname, COALESCE(c2.reltuples,0) AS ituples, COALESCE(c2.relpages,0) AS ipages, 385 | COALESCE(CEIL((c2.reltuples*(datahdr-12))/(bs-20::float)),0) AS iotta -- very rough approximation, assumes all cols 386 | FROM bloat_info 387 | JOIN pg_class cc ON cc.relname = bloat_info.tablename 388 | JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = bloat_info.schemaname AND nn.nspname <> 'information_schema' 389 | JOIN pg_index i ON indrelid = cc.oid 390 | JOIN pg_class c2 ON c2.oid = i.indexrelid 391 | ) 392 | SELECT 393 | type, schemaname, object_name, bloat, pg_size_pretty(raw_waste) as waste 394 | FROM 395 | (SELECT 396 | 'table' as type, 397 | schemaname, 398 | tablename as object_name, 399 | ROUND(CASE WHEN otta=0 THEN 0.0 ELSE table_bloat.relpages/otta::numeric END,1) AS bloat, 400 | CASE WHEN relpages < otta THEN '0' ELSE (bs*(table_bloat.relpages-otta)::bigint)::bigint END AS raw_waste 401 | FROM 402 | table_bloat 403 | UNION 404 | SELECT 405 | 'index' as type, 406 | schemaname, 407 | tablename || '::' || iname as object_name, 408 | ROUND(CASE WHEN iotta=0 OR ipages=0 THEN 0.0 ELSE ipages/iotta::numeric END,1) AS bloat, 409 | CASE WHEN ipages < iotta THEN '0' ELSE (bs*(ipages-iotta))::bigint END AS raw_waste 410 | FROM 411 | index_bloat) bloat_summary 412 | ORDER BY raw_waste DESC, bloat DESC 413 | ---- 414 | 415 | ____ 416 | Postgres makes it simply to query for unused indexes so you can easily 417 | give yourself back some performance by removing them 418 | (https://www.citusdata.com/blog/2019/03/29/health-checks-for-your-postgres-database/[Source]) 419 | ____ 420 | 421 | [source,sql] 422 | ---- 423 | SELECT 424 | schemaname || '.' || relname AS table, 425 | indexrelname AS index, 426 | pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size, 427 | idx_scan as index_scans 428 | FROM pg_stat_user_indexes ui 429 | JOIN pg_index i ON ui.indexrelid = i.indexrelid 430 | WHERE NOT indisunique AND idx_scan < 50 AND pg_relation_size(relid) > 5 * 8192 431 | ORDER BY pg_relation_size(i.indexrelid) / nullif(idx_scan, 0) DESC NULLS FIRST, 432 | pg_relation_size(i.indexrelid) DESC; 433 | ---- 434 | 435 | ____ 436 | pg_stat_statements is useful for monitoring your database query 437 | performance. It records a lot of valuable stats about which queries are 438 | run, how fast they return, how many times their run, etc. Checking in on 439 | this set of queries regularly can tell you where is best to add indexes 440 | or optimize your application so your query calls may not be so 441 | excessive. 442 | (https://www.citusdata.com/blog/2019/03/29/health-checks-for-your-postgres-database/[Source]) 443 | ____ 444 | 445 | [source,sql] 446 | ---- 447 | SELECT query, 448 | calls, 449 | total_time, 450 | total_time / calls as time_per, 451 | stddev_time, 452 | rows, 453 | rows / calls as rows_per, 454 | 100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent 455 | FROM pg_stat_statements 456 | WHERE query not similar to '%pg_%' 457 | and calls > 500 458 | --ORDER BY calls 459 | --ORDER BY total_time 460 | order by time_per 461 | --ORDER BY rows_per 462 | DESC LIMIT 20; 463 | ---- 464 | 465 | == Schema design 466 | 467 | * https://github.com/FGRibreau/stripe-schema[Stripe own schema] 468 | 469 | == Tools 470 | 471 | * https://www.postgresql.org/docs/9.4/pgstatstatements.html[pg_stat_statements] 472 | * https://github.com/darold/pgbadger[A fast PostgreSQL Log Analyzer] 473 | * https://pganalyze.com[PostgreSQL Performance Monitoring] 474 | 475 | == Migrations 476 | 477 | - https://pythonspeed.com/articles/schema-migrations-server-startup/[How to do Zero-downtime migrations] 478 | - https://medium.com/braintree-product-technology/postgresql-at-scale-database-schema-changes-without-downtime-20d3749ed680[Zero-downtime migrations best practices] 479 | 480 | == Good practices 481 | 482 | * https://hakibenita.com/sql-dos-and-donts[12 Common Mistakes and Missed Optimization Opportunities in SQL] 483 | * https://pythonspeed.com/articles/schema-migrations-server-startup/[Don't apply migrations on application startup] 484 | 485 | == Managed PostgreSQL Databases 486 | 487 | * Google Cloud PostgreSQL 488 | ** Pros 489 | ** Cons 490 | *** No support for plv8 491 | *** Any features that require `superuser` privileges are not supported 492 | *** `postgres` role is not a `superuser` 493 | **** Can create roles 494 | **** Can not select from tables that are restricted by default like `pg_shadow` 495 | **** Thus can not edit `pg_catalog.pg_class` (in order to change row level security activation for example) 496 | **** Can read from all necessary tables other than `pg_authid` 497 | **** 498 | * Scaleway Managed PostgreSQL: 499 | ** Pros 500 | *** multi-schema support 501 | *** configuration options are editable 502 | *** user/role management is self-service 503 | ** Cons 504 | *** / 505 | * OVH Cloud SQL 506 | ** Pros 507 | *** / 508 | ** Cons 509 | *** no multi-schema support 510 | --------------------------------------------------------------------------------