├── .gitignore ├── 00-workshop-contents.Rmd ├── 01-intro-to-data-validation.Rmd ├── 02-scan-your-data.Rmd ├── 03-expect-test-functions.Rmd ├── 04-scaling-up-data-validation.Rmd ├── 05-intro-to-data-documentation.Rmd ├── 06-getting-deeper-into-documenting-data.Rmd ├── LICENSE ├── README.md ├── game_revenue-validation.R ├── informant-penguins.html ├── pointblank-workshop.Rproj ├── save_multiple_agents_to_disk.R ├── small_table_tests ├── agent-small_table_2022-10-13 ├── agent-small_table_2022-10-14 ├── agent-small_table_2022-10-15 ├── agent-small_table_2022-10-16 └── agent-small_table_2022-10-17 ├── storms-validation.R ├── tbl_scan-storms.html └── test-game_revenue.R /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | .DS_Store 6 | -------------------------------------------------------------------------------- /00-workshop-contents.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Workshop Contents" 3 | output: html_document 4 | --- 5 | 6 | This **pointblank** workshop will teach you *a lot* about what **pointblank** can do, and, it'll give you an opportunity to experiment with the package. The workshop modules will introduce you to a large variety of examples to get you well-acquainted with the basic functionality of the package. 7 | 8 | Each module of the workshop focuses on a different subset of functions and they are all presented here as **R Markdown** (.Rmd) files, with one file for each workshop module: 9 | 10 | - `"01-intro-to-data-validation.Rmd"` (The `agent`, validation fns, interrogation/reports) 11 | 12 | - `"02-scan-your-data.Rmd"` (Looking at your data with `scan_data()`) 13 | 14 | - `"03-expect-test-functions.Rmd"` (Using the `expect_*()` and `test_*()` functions) 15 | 16 | - `"04-scaling-up-data-validation.Rmd"` (The `multiagent` and its reporting structures) 17 | 18 | - `"05-intro-to-data-documentation.Rmd"` (The `informant` and describing your data) 19 | 20 | - `"06-getting-deeper-into-documenting-data.Rmd"` (Using snippets and text tricks) 21 | 22 | You can navigate to any of these and modify the code within the self-contained **R Markdown** code chunks. Entire **R Markdown** files can be knit to HTML, where a separate window will show the rendered document. 23 | 24 | ### The **pointblank** Installation 25 | 26 | Normally you would install **pointblank** on your system by using `install.packages()`: 27 | 28 | ```{r eval=FALSE} 29 | # install.packages("pointblank") 30 | ``` 31 | 32 | For this workshop, however, we are going to use the development version of **pointblank** and install it from GitHub with `devtools::install_github()`. 33 | 34 | ```{r eval=FALSE} 35 | # devtools::install_github("rich-iannone/pointblank") 36 | ``` 37 | -------------------------------------------------------------------------------- /01-intro-to-data-validation.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Introduction to Data Validation, pointblank Style" 3 | output: html_document 4 | --- 5 | 6 | ```{r setup, include=FALSE} 7 | knitr::opts_chunk$set(echo = TRUE) 8 | library(pointblank) 9 | library(tidyverse) 10 | library(blastula) 11 | library(palmerpenguins) 12 | ``` 13 | 14 | ## Intro 15 | 16 | A common workflow for data validation in **pointblank** involves three basic components: 17 | 18 | - the creation of an 'agent' (this is the main data collection and reporting object) 19 | - the declaration of validation steps using validation functions (as many as you need) 20 | - the interrogation of the data (here the agent finally carries out the validation tasks) 21 | 22 | We always start with `create_agent()` and define how the data can be reached and also provide some basic rules about how an interrogation of how that data should eventually be carried out. While we are giving the agent some default behavior, we can override some of this on a step-by-step basis when declaring our validation steps. We always end with `interrogate()` and that function carries out the work of validating the data and generating the all-important reporting. To sum up, this is the construction: 23 | 24 | ```r 25 | agent <- 26 | create_agent(...) %>% 27 | << validation functions >> %>% 28 | interrogate() 29 | ``` 30 | 31 | ### A simple data validation on a small dataset called `small_table` 32 | 33 | The package contains a few datasets. A really small one for experimentation is called `small_table`: 34 | 35 | ```{r paged.print=FALSE} 36 | pointblank::small_table 37 | ``` 38 | 39 | We're going to break the validation process into steps. First, let's create an `agent`, give it the `small_table`, and look at the report. 40 | 41 | ```{r} 42 | # Create the agent with `create_agent()`; the `tbl` is given to the agent 43 | agent_1 <- 44 | create_agent( 45 | tbl = small_table, 46 | tbl_name = "small_table", 47 | label = "Workshop agent No. 1", 48 | ) 49 | 50 | # Printing the `agent` will print the report with the default options 51 | agent_1 52 | ``` 53 | 54 | Okay. Let's provide a few validation functions. 55 | 56 | ```{r} 57 | agent_1 <- 58 | agent_1 %>% 59 | col_vals_gte(columns = d, value = 0) %>% 60 | col_vals_in_set(columns = f, set = c("low", "mid", "high")) %>% 61 | col_is_logical(columns = e) %>% 62 | col_is_numeric(columns = d) %>% 63 | col_is_character(columns = c(b, f)) %>% 64 | rows_distinct() 65 | 66 | agent_1 67 | ``` 68 | 69 | When looking at the report, we see that it contains the information about the validation steps but many of the table cells (to the right) have no entries. That area is the interrogation data, and, we haven't yet used the `interrogate()` function. Let's use it now: 70 | 71 | ```{r} 72 | agent_1 <- agent_1 %>% interrogate() 73 | 74 | agent_1 75 | ``` 76 | 77 | Now, we see a validation report we can use! Let's go over each of the columns and understand what they mean. 78 | 79 | - `STEP`: the name of the validation function used for the validation step and 80 | the step number. 81 | 82 | - `COLUMNS`: the names of the target columns used in the validation step (if applicable). 83 | 84 | - `VALUES`: the values used in the validation step, where applicable; this could be as literal values, as column names, an expression, etc. 85 | 86 | - `TBL`: indicates whether any there were any changes to the target table just prior to interrogation. A rightward arrow from a small circle indicates that there was no mutation of the table. An arrow from a circle to a purple square indicates that 'preconditions' were used to modify the target table. An arrow from a circle to a half-filled circle indicates that the target table has been 'segmented'. 87 | 88 | - `EVAL`: a symbol that denotes the success of interrogation evaluation for each step. A checkmark indicates no issues with evaluation. A warning sign indicates that a warning occurred during evaluation. An explosion symbol indicates that evaluation failed due to an error. Hover over the symbol for details on each condition. 89 | 90 | - `UNITS`: the total number of test units for the validation step (these are the atomic units of testing which depend on the type of validation). 91 | 92 | - `PASS`: on top is the absolute number of passing test units and below that is the fraction of passing test units over the total number of test units. 93 | 94 | - `FAIL`: on top is the absolute number of failing test units and below that is the fraction of failing test units over the total number of test units. 95 | 96 | - `W`, `S`, `N`: indicators that show whether the *warn*, *stop*, or *notify* states were entered; unset states appear as dashes, states that are set with thresholds appear as unfilled circles when not entered and filled when thresholds are exceeded (colors for `W`, `S`, and `N` are amber, red, and blue) 97 | 98 | - `EXT`: a column that provides buttons to download data extracts as CSV files for row-based validation steps having failing test units. Buttons only appear when there is data to collect. 99 | 100 | We see nothing in the `W`, `S`, and `N` columns. This is because we have to explicitly set thresholds for those to be active. We'll do that next... 101 | 102 | ### Data validation with threshold levels 103 | 104 | We often should think about what's tolerable in terms of data quality and implement that into our reporting. Let's set proportional failure thresholds to the `warn`, `stop`, and `notify` states using the `action_levels()` function. 105 | 106 | ```{r} 107 | # Create an `action_levels` object with the namesake function. 108 | al <- 109 | action_levels( 110 | warn_at = 0.15, 111 | stop_at = 0.25, 112 | notify_at = 0.35 113 | ) 114 | 115 | # This can be printed for inspection 116 | al 117 | ``` 118 | 119 | We are using threshold fractions of test units (between `0` and `1`). For `0.15`, this means that if 15% percent of the test units are found to *fail* (i.e., don't meet the expectation), then the designated failure state is entered. 120 | 121 | Absolute values starting from `1` can be used instead, and this constitutes an absolute failure threshold (e.g., `10` means that if `10` of the test units are found to fail, the failure state is entered). 122 | 123 | What are test units? They make up the individual tests for a validation step. They will vary by the validation function used but, in simple terms, a validation function that validates values in a column will have the number of test units equal to the number of rows. A validation function that validates a column type will have exactly one test unit. This is always given in the `UNITS` column of the reporting table. 124 | 125 | Let’s use the `action_levels` object in a new validation process (based on the same `small_table` dataset). We'll make it so the validation functions used will result in more failing test units. 126 | 127 | ```{r} 128 | agent_2 <- 129 | create_agent( 130 | tbl = small_table, 131 | tbl_name = "small_table", 132 | label = "Workshop agent No. 2", 133 | actions = al 134 | ) %>% 135 | col_is_posix(columns = date_time) %>% 136 | col_vals_lt(columns = a, value = 7) %>% 137 | col_vals_regex(columns = b, regex = "^[0-9]-[a-w]{3}-[2-9]{3}$") %>% 138 | col_vals_between(columns = d, left = 0, right = 4000) %>% 139 | col_is_logical(columns = e) %>% 140 | col_is_character(columns = c(b, f)) %>% 141 | col_vals_lt(columns = d, value = 9600) %>% 142 | col_vals_in_set(columns = f, set = c("low", "mid")) %>% 143 | rows_distinct() %>% 144 | interrogate() 145 | 146 | agent_2 147 | ``` 148 | 149 | Some notes: 150 | 151 | - the thresholds for the `warn`, `stop`, and `notify` states are presented in the table header; these are defaults for every validation step 152 | - we now have some indicators of failure thresholds being met (look at the `W`, `S`, and `N` columns); steps `2`, `3`, `9`, and `10` have at least the `warn` condition 153 | - it's possible to have test unit failures but not enter a `warn` state (look at steps `4` and `8`); they still provide CSVs for failed rows but the `W` indicator circle isn't filled in 154 | 155 | How you set the default thresholds will depend on how strict the measure for data quality is. There might be certain validation steps where we'd like to be more stringent. For the next validation process we will apply the `action_levels()` function to individual steps, overriding the default setting. 156 | 157 | ```{r} 158 | agent_3 <- 159 | create_agent( 160 | tbl = small_table, 161 | tbl_name = "small_table", 162 | label = "Workshop agent No. 3", 163 | actions = al 164 | ) %>% 165 | col_is_posix(columns = date_time) %>% 166 | col_vals_lt(columns = a, value = 7) %>% 167 | col_vals_regex(columns = b, regex = "^[0-9]-[a-w]{3}-[2-9]{3}$") %>% 168 | col_vals_between( 169 | columns = d, 170 | left = 0, 171 | right = 4000, 172 | actions = action_levels( # Setting `actions` at the individual 173 | warn_at = 1, # validation step. This time, using absolute 174 | stop_at = 3, # threshold values (i.e., a single test unit 175 | notify_at = 5 # failing triggers the `warn` state 176 | ) 177 | ) %>% 178 | col_is_logical(columns = e) %>% 179 | col_is_character(columns = vars(b, f)) %>% 180 | col_vals_lt(columns = d, value = 9600) %>% 181 | col_vals_in_set(columns = f, set = c("low", "mid")) %>% 182 | rows_distinct() %>% 183 | interrogate() 184 | 185 | agent_3 186 | ``` 187 | 188 | ### A look at the available validation functions 189 | 190 | There are 36 validation functions. Here they are: 191 | 192 | - `col_vals_lt()` 193 | - `col_vals_lte()` 194 | - `col_vals_equal()` 195 | - `col_vals_not_equal()` 196 | - `col_vals_gte()` 197 | - `col_vals_gt()` 198 | - `col_vals_between()` 199 | - `col_vals_not_between()` 200 | - `col_vals_in_set()` 201 | - `col_vals_not_in_set()` 202 | - `col_vals_make_set()` 203 | - `col_vals_make_subset()` 204 | - `col_vals_increasing()` 205 | - `col_vals_decreasing()` 206 | - `col_vals_null()` 207 | - `col_vals_not_null()` 208 | - `col_vals_regex()` 209 | - `col_vals_within_spec()` 210 | - `col_vals_expr()` 211 | - `rows_distinct()` 212 | - `rows_complete()` 213 | - `col_is_character()` 214 | - `col_is_numeric()` 215 | - `col_is_integer()` 216 | - `col_is_logical()` 217 | - `col_is_date()` 218 | - `col_is_posix()` 219 | - `col_is_factor()` 220 | - `col_exists()` 221 | - `col_schema_match()` 222 | - `row_count_match()` 223 | - `col_count_match()` 224 | - `tbl_match()` 225 | - `conjointly()` 226 | - `serially()` 227 | - `specially()` 228 | 229 | It's a lot to keep track of but they all try to use a consistent interface. Let's break this down. 230 | 231 | The `col_vals_*()` group will check individual cells within one or more columns. Aside from using an 'agent', we can use the validation functions *directly* on the data. It acts as a sort of validation 'filter'; data will pass through unchanged if validation passes, error if validation doesn't pass. Let's try that with `col_vals_between()`: 232 | 233 | ```{r paged.print=FALSE} 234 | small_table %>% col_vals_between(columns = a, left = 0, right = 10) 235 | ``` 236 | 237 | ```{r error=TRUE} 238 | small_table %>% col_vals_between(columns = a, left = 5, right = 10) 239 | ``` 240 | 241 | The `col_is_*()` group will check whether a column is of a certain type. Let's look at two cases: one passing and the other failing. 242 | 243 | ```{r paged.print=FALSE} 244 | small_table %>% col_is_character(columns = b) 245 | ``` 246 | 247 | ```{r error=TRUE} 248 | small_table %>% col_is_numeric(columns = date) 249 | ``` 250 | 251 | The two `rows_*()` functions (`rows_distinct()` and `rows_complete()`) will check entire rows (this can be narrowed down with the `columns` argument). Here are examples of both, with failing and then passing cases. 252 | 253 | `rows_distinct()`: 254 | 255 | ```{r error=TRUE} 256 | small_table %>% rows_distinct() 257 | ``` 258 | 259 | ```{r paged.print=FALSE} 260 | head(small_table) %>% rows_distinct() 261 | ``` 262 | 263 | `rows_complete()`: 264 | 265 | ```{r error=TRUE} 266 | small_table %>% rows_complete() 267 | ``` 268 | 269 | ```{r paged.print=FALSE} 270 | small_table %>% rows_complete(columns = vars(date_time, date, a, b)) 271 | ``` 272 | 273 | The `*_match()` functions validate whether some aspect of the table as a whole matches an expectation. 274 | 275 | - `col_schema_match()` - column schema matching 276 | - `row_count_match()` - tbl row count matching (with another tbl or fixed value) 277 | - `col_count_match()` - tbl col count matching (with another tbl or fixed value) 278 | - `tbl_match()` - does the target table match a comparison table? 279 | 280 | Here are two (passing) examples: 281 | 282 | ```{r paged.print=FALSE} 283 | small_table %>% row_count_match(count = 13) 284 | ``` 285 | 286 | ```{r paged.print=FALSE} 287 | small_table %>% col_count_match(count = palmerpenguins::penguins) 288 | ``` 289 | 290 | ### Getting data extracts for failed rows from the 'agent' 291 | 292 | Those CSV buttons in the validation report are useful for sharing the report with others since they don't even need to know R to obtain those extracts. For the person familiar with R and **pointblank**, it is possible to get data frames for the failed rows (per validation step). 293 | 294 | We can use the `get_data_extracts()` function to obtain a list of data frames, or, use the `i` argument to get a data frame available for a specific step. Not all steps will have associated data frames. Also, not all validation functions will produce data frames here (they need to check values in columns). 295 | 296 | Let's use `get_data_extracts()` on `agent_3`. 297 | 298 | ```{r paged.print=FALSE} 299 | get_data_extracts(agent = agent_3) 300 | ``` 301 | 302 | The list components are named for the validation steps that have data extracts (i.e., filtered rows where test unit failures occurred). Let's get an individual data extract from step `9` (the `col_vals_in_set()` step, which looked at column `f`): 303 | 304 | ```{r paged.print=FALSE} 305 | get_data_extracts(agent = agent_3, i = 9) 306 | ``` 307 | 308 | ### Getting 'sundered' data back (either 'good' or 'bad' rows) 309 | 310 | Sometimes, if your methodology allows for it, you want to use the best part of the input data for something else. With the `get_sundered_data()`, we use provide an agent object that interrogated the data and what we get back could be: 311 | 312 | - the 'pass' data piece (rows with no failing test units across all row-based validation functions) 313 | - the 'fail' data piece (rows with at least one failing test unit across the same series of validations) 314 | - all the data with a new column that labels each row as passing or failing across validation steps (the labels can be customized). 315 | 316 | Let's make new agent and validate `small_table` again. 317 | 318 | ```{r} 319 | agent_4 <- 320 | create_agent( 321 | tbl = small_table, 322 | tbl_name = "small_table", 323 | label = "Workshop agent No. 4" 324 | ) %>% 325 | col_vals_gt(columns = d, value = 1000) %>% 326 | col_vals_between( 327 | columns = c, 328 | left = vars(a), right = vars(d), # Using values in columns, not literal vals 329 | na_pass = TRUE 330 | ) %>% 331 | interrogate() 332 | 333 | agent_4 334 | ``` 335 | 336 | Get the sundered data piece that contains only rows that passed both validation steps (this is the default piece). This yields 5 of 13 total rows. 337 | 338 | ```{r paged.print=FALSE} 339 | agent_4 %>% get_sundered_data() 340 | ``` 341 | 342 | Get the complementary data piece: all of those rows that failed either of the two validation steps. This yields 8 of 13 total rows. 343 | 344 | ```{r paged.print=FALSE} 345 | agent_4 %>% get_sundered_data(type = "fail") 346 | ``` 347 | 348 | We can get all of the input data returned with a flag column (called `.pb_combined`). This is done by using `type = "combined"` and that rightmost column will contain `"pass"` and `"fail"` values. 349 | 350 | ```{r paged.print=FALSE} 351 | agent_4 %>% get_sundered_data(type = "combined") 352 | ``` 353 | 354 | The labels can be changed and this is flexible: 355 | 356 | ```{r paged.print=FALSE} 357 | agent_4 %>% get_sundered_data(type = "combined", pass_fail = c(TRUE, FALSE)) 358 | ``` 359 | 360 | ```{r paged.print=FALSE} 361 | agent_4 %>% get_sundered_data(type = "combined", pass_fail = 0:1) 362 | ``` 363 | 364 | ### Accessing the plan/interrogation data with `get_agent_x_list()` 365 | 366 | The agent's x-list is a record of information that the agent possesses at any given time. The x-list will contain the most complete information after an interrogation has taken place (before then, the data largely reflects the validation plan). 367 | 368 | The x-list can be constrained to a particular validation step (by supplying the step number to the `i` argument), or, we can get the information for all validation steps by leaving `i` unspecified. 369 | 370 | Let's obtain such a list from `agent_3`, which had 10 validation steps: 371 | 372 | ```{r paged.print=FALSE} 373 | # Generate the `x_list` object from `agent_3` 374 | x_list_3 <- agent_3 %>% get_agent_x_list() 375 | 376 | # Printing this gives us a console preview 377 | # of which components are available 378 | x_list_3 379 | ``` 380 | 381 | The amount of information contained in here is comprehensive (see `?get_agent_x_list` for a detailed breakdown) but we can provide a few examples. 382 | 383 | The number of test units in each validation step. 384 | 385 | ```{r} 386 | x_list_3$n 387 | ``` 388 | 389 | The number of *passing* test units in each validation step. 390 | 391 | ```{r} 392 | x_list_3$n_passed 393 | ``` 394 | 395 | The *fraction* of passing test units in each validation step. 396 | 397 | ```{r} 398 | x_list_3$f_passed 399 | ``` 400 | 401 | The `warn`, `stop`, and `notify` states. We can arrange that in a tibble and use the step numbers (`i`) as well. 402 | 403 | ```{r paged.print=FALSE} 404 | dplyr::tibble( 405 | step = x_list_3$i, 406 | warn = x_list_3$warn, 407 | stop = x_list_3$stop, 408 | notify = x_list_3$notify 409 | ) 410 | ``` 411 | 412 | ### Emailing the interrogation report with `email_create()` 413 | 414 | We can choose to email the report if the `notify` state is entered. The message can be created with the agent through the `email_create()` function. Here's a useful bit of code that allows for conditional sending. 415 | 416 | ```{r eval=FALSE} 417 | 418 | if (any(x_list_3$notify)) { 419 | 420 | email_create(agent_3) %>% 421 | blastula::smtp_send( 422 | from = "sender@email.com", 423 | to = "recipient@email.com", 424 | credentials = creds_file(file = "email_secret") 425 | ) 426 | } 427 | ``` 428 | 429 | Such code might be useful during an automated process where data is periodically checked and failures beyond thresholds require notification. 430 | 431 | While `email_create()` will generate the email message body, functions in the **blastula** package are responsible for the sending of that email. For more information on sending HTML email, look at the help article found by using `?blastula::smtp_send`. 432 | 433 | ### Customizing the interrogation report with `get_agent_report()` 434 | 435 | We don't have to fully accept the defaults for a data validation report. Using `get_agent_report()` gives us options. 436 | 437 | Here's how you can change the title: 438 | 439 | ```{r} 440 | agent_3 %>% get_agent_report(title = "The **3rd** Example") 441 | ``` 442 | 443 | You can bring the steps that had serious failures up to the top: 444 | 445 | ```{r} 446 | agent_3 %>% get_agent_report(arrange_by = "severity") 447 | ``` 448 | 449 | You can remove those steps that had no failures: 450 | 451 | ```{r} 452 | agent_3 %>% get_agent_report(arrange_by = "severity", keep = "fail_states") 453 | ``` 454 | 455 | You can change the language of the report: 456 | 457 | ```{r} 458 | agent_3 %>% get_agent_report(lang = "de") 459 | ``` 460 | 461 | ------ 462 | 463 | ### SUMMARY 464 | 465 | 1. Data validation in **pointblank** requires the creation of an agent, validation functions, and an interrogation. 466 | 2. The agent creates a report that tries to be informative and easily explainable. 467 | 3. We can set data quality thresholds with `action_levels()`; there can be default DQ thresholds and step-specific thresholds (in both cases, supplied to `actions`). 468 | 4. There are 36 validation functions (having a similar interface and many common arguments); they can be used with an agent or directly on the data. 469 | 5. We can get data extracts pertaining to failing test units in rows of the input dataset (with `get_data_extracts()`). 470 | 6. There is the option to obtain 'sundered' data, which is the input data split by whether cells contained failing test units (with `get_sundered_data()`) 471 | 7. A huge amount of validation data can be accessed with the `get_agent_x_list()` function (useful for programming with the validation results). 472 | 8. We can create an email message using a specialized version of the validation report with `email_create()`; this integrates with the **blastula** R package. 473 | 9. The report can be modified with `get_agent_report()`. 474 | 475 | -------------------------------------------------------------------------------- /02-scan-your-data.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Scan Your Data" 3 | output: html_document 4 | --- 5 | 6 | ```{r setup, include=FALSE} 7 | knitr::opts_chunk$set(echo = TRUE) 8 | library(pointblank) 9 | library(tidyverse) 10 | library(palmerpenguins) 11 | library(safetyData) 12 | ``` 13 | 14 | ## Intro 15 | 16 | Sometimes you know nothing about a new dataset. The **pointblank** package is here to help and it has the `scan_data()` function. So simple, and it gives you so much information on a data table. The function generates an HTML report that scours the input table data. 17 | 18 | In the same spirit, generating validation steps can be laborious and difficult at first. There's a function available to kickstart that process: `draft_validation()`. It'll generate a new .R file with a suggested validation plan that's meant to work and is tweakable. 19 | 20 | ### Performing table scans with `scan_data()` 21 | 22 | The `scan_data()` function is available for providing an interactive overview of a tabular dataset. The reporting output contains several sections to make everything more digestible, and these are: 23 | 24 | - `Overview`: Shows table dimensions, duplicate row counts, column types, and reproducibility information 25 | - `Variables`: Provides a summary for each table variable and further statistics and summaries depending on the variable type 26 | - `Interactions`: Displays a matrix plot that describes the interactions between variables 27 | - `Correlations`: This is a set of correlation matrix plots for numerical variables 28 | - `Missing Values`: A summary figure that shows the degree of missingness across variables 29 | - `Sample`: A table that provides the head and tail rows of the dataset 30 | 31 | The output HTML report will appear in the RStudio Viewer and can also be integrated in R Markdown or Quarto HTML output. Here’s an example that uses the `penguins_raw` dataset from the **palmerpenguins** package. 32 | 33 | ```{r} 34 | scan_data(tbl = palmerpenguins::penguins_raw, navbar = FALSE) 35 | ``` 36 | 37 | As could be seen, the first two sections had a lot of additional information tucked behind detail views (with the `Toggle details` buttons) and within tab sets. Should this amount of information be a little overwhelming, there is the option to disable one or more sections. With `scan_data()`’s `sections` argument, you can specify just the sections that are needed for a specific scan. 38 | 39 | The default value for `sections` is the string `"OVICMS"` and each letter of that stands for the following sections in their default order: 40 | 41 | `"O"`: `"overview"` 42 | `"V"`: `"variables"` 43 | `"I"`: `"interactions"` 44 | `"C"`: `"correlations"` 45 | `"M"`: `"missing"` 46 | `"S"`: `"sample"` 47 | 48 | This string can contain less key characters and the order can be changed to suit the desired layout of the report. For example, if you just need the Overview, a Sample, and the description of Variables in the target table, the string to use for sections would be `"OSV"`. 49 | 50 | The `tbl` supplied could be a data frame, tibble, a `tbl_dbi` object, or a `tbl_spark` object. Here are a few more datasets that could be scanned, this time using `sections = "OSV"`: 51 | 52 | ```{r eval=FALSE} 53 | scan_data(tbl = safetyData::adam_adae, sections = "OSV") 54 | ``` 55 | 56 | ```{r eval=FALSE} 57 | scan_data(tbl = safetyData::adam_advs, sections = "OSV") 58 | ``` 59 | 60 | The reporting generated by scan_data() can be presented in one of eight spoken languages: English (`"en"`, the default), French (`"fr"`), German (`"de"`), Italian (`"it"`), Spanish (`"es"`), Portuguese (`"pt"`), Turkish (`"tr"`), Chinese (`"zh"`), Russian (`"ru"`), Polish (`"pl"`), Danish (`"da"`), Swedish (`"sv"`), and Dutch (`"nl"`). These two-letter language codes can be used as an argument to the `lang` argument. 61 | 62 | Here's an example that scans **dplyr**'s `starwars` dataset and creates the report in Danish. 63 | 64 | ```{r} 65 | scan_data(tbl = dplyr::starwars, sections = "OVS", lang = "da") 66 | ``` 67 | 68 | It's possible to export this reporting to a self-contained HTML file. To do so, use the `export_report()` function (this also works for every other type of reporting you'll see in the Viewer). 69 | 70 | ```{r eval=FALSE} 71 | # Use `scan_data()` and assign reporting to `tbl_scan` 72 | tbl_scan <- scan_data(tbl = dplyr::storms, sections = "OVS") 73 | 74 | # Write the `ptblank_tbl_scan` object to an HTML file 75 | export_report( 76 | tbl_scan, 77 | filename = "tbl_scan-storms.html" 78 | ) 79 | ``` 80 | 81 | ### Drafting a nice, new validation plan with `draft_validation()` 82 | 83 | We can generate a draft validation plan in a new `.R` or `.Rmd` file using an input data table (just like with `scan_data()`). With `draft_validation()` the data table will be scanned to learn about its column data and a set of starter validation steps (constituting a validation plan) will be written. 84 | 85 | Let's draft a validation plan for the `dplyr::storms` dataset. Here's a quick look at that table: 86 | 87 | ```{r paged.print=FALSE} 88 | dplyr::storms 89 | ``` 90 | 91 | Here's how we generate the new `.R` file: 92 | 93 | ```{r eval=FALSE} 94 | 95 | draft_validation( 96 | tbl = ~dplyr::storms, # This `~` makes it an expression for getting the data 97 | tbl_name = "storms", 98 | filename = "storms-validation" 99 | ) 100 | ``` 101 | 102 | Check out the new file called `"storms-validation.R"`! It's ready to run, all the validation steps run without failing test units, and the process (thanks to column inference routines) knows what to do with certain types of columns (like the latitude and longitude ones). 103 | 104 | Once in the file, it's possible to tweak the validation steps to better fit the expectations to the particular domain. It's best to use a data extract that contains a good amount of rows and is relatively free of spurious data. 105 | 106 | ------ 107 | 108 | ### SUMMARY 109 | 110 | 1. It's a great idea to examine data you're unfamiliar with with `scan_data()`! 111 | 2. The `draft_validation()` function can give you a super-quickstart for data validation (it scans your data, but in a different way). 112 | -------------------------------------------------------------------------------- /03-expect-test-functions.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Using the `expect_*()` and `test_*()` Functions" 3 | output: html_document 4 | --- 5 | 6 | ```{r setup, include=FALSE} 7 | knitr::opts_chunk$set(echo = TRUE) 8 | library(pointblank) 9 | library(tidyverse) 10 | ``` 11 | 12 | ## Intro 13 | 14 | Those validation functions used previously with an agent have two sets of variants, taking the forms `expect_*()` and `test_*()`. 15 | 16 | The 'expect' prefix indicates that those functions are to be used in a **testthat** unit testing workflow. The 'test' prefix indicates that that set of functions produce logical outputs (`TRUE`/`FALSE`), making them suitable for programming. 17 | 18 | ### Using the **pointblank** expectation functions 19 | 20 | The **testthat** package has collection of functions that begin with `expect_`. The `expect_*()` functions in **pointblank** follow the same convention and can be used in the standard **testthat** workflow (in a `test-.R` file, inside the `tests/testthat` folder). The big difference here is that instead of testing function outputs, we are testing data tables. 21 | 22 | Say we wanted to test the values in the `c` column of the `small_table` dataset. Let's look at the values first: 23 | 24 | ```{r} 25 | small_table$c 26 | ``` 27 | 28 | Our expectation is that values can be between `0` and `10` and `NA` values are permitted. We can use `expect_col_vals_between()` for that: 29 | 30 | ```{r} 31 | expect_col_vals_between(small_table, columns = c, 0, 10, na_pass = TRUE) 32 | ``` 33 | 34 | When running this, nothing is returned. The default threshold for error is one test unit (can be changed with the `threshold` argument). If there is an error, that is reported in the console. 35 | 36 | ```{r, error=TRUE} 37 | expect_col_vals_between(small_table, columns = c, 0, 7, na_pass = TRUE) 38 | ``` 39 | 40 | There are 36 `expect_*()` functions, which is a lot. It's actually somewhat overwhelming at first. If you wanted to test your dataset in the **testthat** framework, a nice beginning approach might be to take the dataset and do two things in sequence: 41 | 42 | - use the `draft_validation()` function to generate a validation plan with the dataset as the primary input 43 | - use the `write_testthat_file()` function to create a **testthat** .R file using the agent from the `draft_validation()` file 44 | 45 | Let's use the `game_revenue` dataset from the **pointblank** package in this two-step workflow. 46 | 47 | ```{r eval=FALSE} 48 | draft_validation( 49 | tbl = ~ pointblank::game_revenue, 50 | file_name = "game_revenue-validation" 51 | ) 52 | ``` 53 | 54 | Going into the `"game_revenue-validation.R"` file, the following line was added to the bottom: 55 | 56 | ```{r eval=FALSE} 57 | write_testthat_file(agent = agent, name = "game_revenue", path = ".") 58 | ``` 59 | 60 | Then the entire file was executed, creating the `"test-game_revenue.R"` file. This can be run using the 'Run Tests' button. 61 | 62 | ### Using the **pointblank** test functions 63 | 64 | The collection of `test_*()` functions, 36 of them, are used to give us a single `TRUE` or `FALSE`. 65 | 66 | Say we wanted a script to error if there are `NA` values in the `date_time` column of the `small_table` dataset. We could write this: 67 | 68 | ```{r} 69 | if (!test_col_vals_not_null(small_table, columns = date_time)) { 70 | stop("There should not be any `NA` values in the `date_time` column.") 71 | } 72 | ``` 73 | 74 | This one does result in an error: 75 | 76 | ```{r error=TRUE} 77 | if ( 78 | !test_col_vals_increasing(small_table, date, allow_stationary = TRUE) || 79 | !test_col_vals_gt(small_table, a, 1) 80 | ) { 81 | stop("There are problems with `small_table`.") 82 | } 83 | ``` 84 | 85 | ------ 86 | 87 | ### SUMMARY 88 | 89 | 1. You can validate tabular data in a **testthat** workflow with the `expect_*()` functions. 90 | 2. The `test_*()` collection of functions can be useful for developing conditional logic in programming contexts. 91 | -------------------------------------------------------------------------------- /04-scaling-up-data-validation.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Scaling up Data Validation with the Multiagent" 3 | output: html_document 4 | --- 5 | 6 | ```{r setup, include=FALSE} 7 | knitr::opts_chunk$set(echo = TRUE) 8 | library(pointblank) 9 | library(tidyverse) 10 | ``` 11 | 12 | ## Intro 13 | 14 | If your data quality process involves more data validation runs and you have a large variety of tables to validate, we can take advantage of functions available in **pointblank** to make all that possible. 15 | 16 | ### Writing agents to disk 17 | 18 | The `x_write_disk()` function lets you write a **pointblank** agent to disk. This is very useful if you want to continually save the interrogation results as part of a larger data quality process. Here's an example where the agent is saved to disk with the date as part of the filename. 19 | 20 | ```{r eval=FALSE} 21 | 22 | # Create the agent, develop a validation plan, interrogate 23 | agent <- 24 | create_agent( 25 | tbl = ~ small_table, 26 | tbl_name = "small_table", 27 | label = "Daily check of `small_table`.", 28 | actions = action_levels( 29 | warn_at = 0.10, 30 | stop_at = 0.25, 31 | notify_at = 0.35 32 | ) 33 | ) %>% 34 | col_exists(columns = vars(date, date_time)) %>% 35 | col_vals_regex( 36 | columns = b, 37 | regex = "[0-9]-[a-z]{3}-[0-9]{3}" 38 | ) %>% 39 | rows_distinct() %>% 40 | col_vals_gt(columns = d, value = 100) %>% 41 | col_vals_lte(columns = c, value = 5) %>% 42 | interrogate() 43 | 44 | # Save the agent to disk with `x_write_disk()`; append the date 45 | x_write_disk( 46 | agent, 47 | filename = affix_date("agent-small_table"), 48 | path = "small_table_tests" 49 | ) 50 | ``` 51 | 52 | ### Reading agents from disk 53 | 54 | We have this on disk as `"small_table_tests/agent-small_table_2022-10-13"`. We can read this from disk using the `x_read_disk()` function (it recreates the object). 55 | 56 | ```{r} 57 | agent_2022_10_13 <- 58 | x_read_disk( 59 | filename = "agent-small_table_2022-10-13", 60 | path = "small_table_tests" 61 | ) 62 | ``` 63 | 64 | We can get the data validation report from it. 65 | 66 | ```{r} 67 | agent_2022_10_13 68 | ``` 69 | 70 | ### Creating a 'multiagent' to get a combined data validation report 71 | 72 | A common task might be to see how data quality is changing over time. If you have multiple saved agents that check the same table, we can make a combined validation report that shows all of those validations. 73 | 74 | We actually have five saved agents in the `"small_table_tests"` directory: 75 | 76 | - `"agent-small_table_2022-10-13"` 77 | - `"agent-small_table_2022-10-14"` 78 | - `"agent-small_table_2022-10-15"` 79 | - `"agent-small_table_2022-10-16"` 80 | - `"agent-small_table_2022-10-17"` 81 | 82 | Let's get them all into a single report. We do this by generating a `multiagent` and that object has it's own `get_multiagent_report()` function for customizing the layout and content of the report. 83 | 84 | ```{r} 85 | multiagent <- 86 | create_multiagent( 87 | x_read_disk("small_table_tests/agent-small_table_2022-10-13"), 88 | x_read_disk("small_table_tests/agent-small_table_2022-10-14"), 89 | x_read_disk("small_table_tests/agent-small_table_2022-10-15"), 90 | x_read_disk("small_table_tests/agent-small_table_2022-10-16"), 91 | x_read_disk("small_table_tests/agent-small_table_2022-10-17") 92 | ) 93 | ``` 94 | 95 | We can get a combined data validation report from it. By default, all validation reports are stacked together in the `"long"` display mode. 96 | 97 | ```{r} 98 | multiagent 99 | ``` 100 | 101 | With `get_multiagent_report()` we can customize the reporting. Here, we will choose the `"wide"` display mode and provide a custom title. 102 | 103 | ```{r} 104 | get_multiagent_report( 105 | multiagent, 106 | display_mode = "wide", 107 | title = "Wide report from **Multiple** Table Validations" 108 | ) 109 | ``` 110 | 111 | ------ 112 | 113 | ### SUMMARY 114 | 115 | 1. We can save agents to disk and read them back. This is good for keeping records of data quality and all data/reporting is preserved. 116 | 2. Multiple agents can be combined together, generating specialized reports that can show the validations of multiple table (long display) of the evolution of data quality for a single table (wide display). 117 | -------------------------------------------------------------------------------- /05-intro-to-data-documentation.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Introduction to Data Documentation" 3 | output: html_document 4 | --- 5 | 6 | ```{r setup, include=FALSE} 7 | knitr::opts_chunk$set(echo = TRUE) 8 | library(pointblank) 9 | library(tidyverse) 10 | ``` 11 | 12 | ## Intro 13 | 14 | A good thing to do often is to document our datasets. We can do this in **pointblank** through the use of several functions that let us define portions of information about a table. This 'info text' can pertain to individual columns, the table as a whole, and whatever additional information makes sense for your organization. 15 | 16 | ### A simple example using `small_table` 17 | 18 | Let's document the `small_table` dataset that's available in **pointblank**. Here's the table once again: 19 | 20 | ```{r paged.print=FALSE} 21 | pointblank::small_table 22 | ``` 23 | 24 | To start the process, the `create_informant()` function is used. This creates an 'informant' object that is quite a bit different from the 'agent' object. 25 | 26 | ```{r} 27 | 28 | # Create the informant 29 | informant <- 30 | create_informant( 31 | tbl = small_table, 32 | tbl_name = "small_table", 33 | label = "Metadata for the `small_table` dataset." 34 | ) 35 | 36 | # Print to get the information report for the table 37 | informant 38 | ``` 39 | 40 | Printing `informant` will show us the automatically-generated information on the `small_table` dataset, adding the *COLUMNS* section. 41 | 42 | What we get in the initial report is very basic. Next, we ought to add information with the following set of `info_*()` functions: 43 | 44 | - `info_tabular()`: Add info pertaining to the data table as a whole 45 | - `info_columns()`: Add info for each table column 46 | - `info_section()`: Add a section that provides ancillary information 47 | 48 | Let’s try adding some information with each of these functions and then look at the resulting report. 49 | 50 | ```{r} 51 | informant <- 52 | create_informant( 53 | tbl = small_table, 54 | tbl_name = "small_table", 55 | label = "Example No. 2" 56 | ) %>% 57 | info_tabular( 58 | description = "This table is included in the **pointblank** pkg." 59 | ) %>% 60 | info_columns( 61 | columns = "date_time", 62 | info = "This column is full of timestamps." 63 | ) %>% 64 | info_section( 65 | section_name = "further information", 66 | `examples and documentation` = "Examples for how to use the `info_*()` functions 67 | (and many more) are available at the 68 | [**pointblank** site](https://rich-iannone.github.io/pointblank/)." 69 | ) 70 | 71 | informant 72 | ``` 73 | 74 | As can be seen, the report is a bit more filled out with information. The *TABLE* and *COLUMNS* sections are in their prescribed order and the new section we named *FURTHER INFORMATION* follows those (and it has one subsection called *EXAMPLES AND DOCUMENTATION*). Let’s explore how each of the three different `info_*()` functions work. 75 | 76 | ### The *TABLE* section and `info_tabular()` 77 | 78 | The `info_tabular()` function adds information to the TABLE section. We use named arguments to define subsection names and their content. In the previous example 79 | 80 | ```r 81 | info_tabular(description = "This table is included in the **pointblank** pkg.") 82 | ``` 83 | 84 | was used to make the *DESCRIPTION* subsection (all section titles are automatically capitalized). We can define as many subsections to the *TABLE* section as we need, either in the same `info_tabular()` call or across multiple calls. 85 | 86 | ```{r} 87 | informant %>% 88 | info_tabular(Updates = "This table is not regularly updated.") 89 | ``` 90 | 91 | The *TABLE* section is a great place to put all the information about the table that needs to be front and center. Examples of some useful topics for this section might include: 92 | 93 | - a high-level summary of the table, stating its purpose and importance 94 | - what each row of the table represents 95 | - the main users of the table within an organization 96 | - a description of how the table is generated 97 | - information on the frequency of updates 98 | 99 | ### The *COLUMNS* section and `info_columns()` 100 | 101 | The section that follows the *TABLE* section is *COLUMNS.* This section provides an opportunity to describe each table column in as much detail as necessary. Here, individual columns serve as subsections (automatically generated upon using `create_informant()`) and there can be subsections within each column as well. 102 | 103 | The interesting thing about the information provided here via `info_columns()` is that the information is additive. We can make multiple calls of `info_columns()` and disperse common pieces of info text to multiple columns and append the text to any existing. 104 | 105 | Let's use the `palmerpenguins::penguins` dataset and fill in information for each column by adapting documentation from the **palmerpenguins** package. 106 | 107 | ```{r} 108 | informant_pp <- 109 | create_informant( 110 | tbl = palmerpenguins::penguins, 111 | tbl_name = "penguins", 112 | label = "The `penguins` dataset from the **palmerpenguins** pkg." 113 | ) %>% 114 | info_columns( 115 | columns = "species", 116 | info = "A factor denoting penguin species (*Adélie*, *Chinstrap*, and *Gentoo*)." 117 | ) %>% 118 | info_columns( 119 | columns = "island", 120 | info = "A factor denoting island in Palmer Archipelago, Antarctica 121 | (*Biscoe*, *Dream*, or *Torgersen*)." 122 | ) %>% 123 | info_columns( 124 | columns = "bill_length_mm", 125 | info = "A number denoting bill length" 126 | ) %>% 127 | info_columns( 128 | columns = "bill_depth_mm", 129 | info = "A number denoting bill depth" 130 | ) %>% 131 | info_columns( 132 | columns = "flipper_length_mm", 133 | info = "An integer denoting flipper length" 134 | ) %>% 135 | info_columns( 136 | columns = ends_with("mm"), 137 | info = "(in units of millimeters)." 138 | ) %>% 139 | info_columns( 140 | columns = "body_mass_g", 141 | info = "An integer denoting body mass (grams)." 142 | ) %>% 143 | info_columns( 144 | columns = "sex", 145 | info = "A factor denoting penguin sex (`\"female\"`, `\"male\"`)." 146 | ) %>% 147 | info_columns( 148 | columns = "year", 149 | info = "The study year (e.g., `2007`, `2008`, `2009`)." 150 | ) 151 | 152 | informant_pp 153 | ``` 154 | 155 | We can use **tidyselect** functions like `ends_with()` to append info text to a common subsection that exists across multiple columns. This was useful for stating the units which were common across three columns: `bill_length_mm`, `bill_depth_mm`, and `flipper_length_mm`. The following **tidyselect** functions are available in pointblank to make this process easier: 156 | 157 | - `starts_with()`: Match columns that start with a prefix. 158 | - `ends_with()`: Match columns that end with a suffix. 159 | - `contains()`: Match columns that contain a literal string. 160 | - `matches()`: Perform matching with a regular expression. 161 | - `everything()`: Select all columns. 162 | 163 | ------ 164 | 165 | ### Creating extra sections with `info_section()` 166 | 167 | Any information that doesn't fit in the *TABLE* or *COLUMNS* sections can be placed in extra sections with `info_section()`. These sections go at the bottom (in the order of creation). Let’s include a *SOURCE* section that provides references and a note on the data. 168 | 169 | ```{r} 170 | informant_pp <- 171 | informant_pp %>% 172 | info_section( 173 | section_name = "source", 174 | References = c( 175 | 176 | "Adélie penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. Structural 177 | size measurements and isotopic signatures of foraging among adult male and female 178 | Adélie penguins (Pygoscelis adeliae) nesting along the Palmer Archipelago near 179 | Palmer Station, 2007-2009 ver 5. Environmental Data Initiative 180 | ", 181 | 182 | "Gentoo penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. Structural 183 | size measurements and isotopic signatures of foraging among adult male and female 184 | Gentoo penguin (Pygoscelis papua) nesting along the Palmer Archipelago near Palmer 185 | Station, 2007-2009 ver 5. Environmental Data Initiative 186 | ", 187 | 188 | "Chinstrap penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. 189 | Structural size measurements and isotopic signatures of foraging among adult male 190 | and female Chinstrap penguin (Pygoscelis antarcticus) nesting along the Palmer 191 | Archipelago near Palmer Station, 2007-2009 ver 6. Environmental Data Initiative 192 | " 193 | ), 194 | Note = 195 | "Originally published in: Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual 196 | Dimorphism and Environmental Variability within a Community of Antarctic Penguins 197 | (Genus Pygoscelis). PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081 198 | " 199 | ) 200 | 201 | informant_pp 202 | ``` 203 | 204 | What other types of information go well in these separate sections? Some ideas are: 205 | 206 | - any info related to the source of the data table (e.g., references, background, etc.) 207 | - definitions/explanations of terms used above 208 | - persons responsible for the data table, perhaps with contact information 209 | - further details on how the table is produced 210 | - any important issues with the table and notes on upcoming changes 211 | - links to other information artifacts that pertain to the table 212 | - report generation metadata, which might include things like the update history, persons responsible, instructions on how to contribute, etc. 213 | 214 | ### Customizing the information report with `get_informant_report()` 215 | 216 | With `get_informant_report()`, it's possible to alter the title of the information report, change the width of the table, and more. Let's make it so the report is slightly narrower at `600px` and that the title is the name of the table. 217 | 218 | ```{r} 219 | informant_report <- 220 | informant_pp %>% 221 | get_informant_report(size = "600px", title = ":tbl_name:") 222 | 223 | informant_report 224 | ``` 225 | 226 | Given that this report looks really good, it can be published in a variety of ways (e.g., Connect, Quarto Pub, etc.), and, you can export to a standalone HTML file with `export_report()`. 227 | 228 | ```{r eval=FALSE} 229 | export_report(informant_report, filename = "informant-penguins.html") 230 | ``` 231 | 232 | ### SUMMARY 233 | 234 | 1. Begin the process of documenting a dataset with `create_informant()`. 235 | 2. Use `info_tabular()` to describe the table in general terms. 236 | 3. With `info_columns()`, you can document each column in a dataset. 237 | 4. Arbitrary sections of additional information can be added with `info_section()`. 238 | 5. We can control the look and feel of the information report with `get_informant_report()`. 239 | 6. It's possible to export the informant to standalone HTML with `export_report()`. 240 | -------------------------------------------------------------------------------- /06-getting-deeper-into-documenting-data.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Getting Deeper Into Documenting Data" 3 | output: html_document 4 | --- 5 | 6 | ```{r setup, include=FALSE} 7 | knitr::opts_chunk$set(echo = TRUE) 8 | library(pointblank) 9 | library(tidyverse) 10 | ``` 11 | 12 | ## Intro 13 | 14 | We now know how to make a useful data dictionary that can be published and widely shared. We used a **pointblank** `informant` with a set of information functions to generate *info text* and put that text into the appropriate report sections. We’re going to take this a few steps further and look into some more functionality makes *info text* more dynamic and also include a finalizing step in this workflow that accounts for evolving data. 15 | 16 | ### Creating snippets of useful text with `info_snippet()` 17 | 18 | A great source of information about the table can be the table itself. Suppose you want to show: 19 | 20 | - some categorical values from a particular column 21 | - a range of values in an important numeric column 22 | - KPI values that can be calculated using data in the table 23 | 24 | This can all be done with the `info_snippet()` function. You give the snippet a name and you give it a function call. Let’s do this for the `small_table` dataset available in pointblank. Again, this is what that table looks like: 25 | 26 | ```{r paged.print=FALSE} 27 | pointblank::small_table 28 | ``` 29 | 30 | If you wanted the mean value of data in column `d` rounded to one decimal place, one such way we could do it is with this expression: 31 | 32 | ```{r} 33 | small_table %>% .$d %>% mean() %>% round(1) 34 | ``` 35 | 36 | Inside of an `info_snippet()` call, which is used after creating the informant object, the expression would look like this: 37 | 38 | ```{r} 39 | informant <- 40 | create_informant( 41 | tbl = small_table, 42 | tbl_name = "small_table", 43 | label = "Metadata for the `small_table` dataset." 44 | ) %>% 45 | info_snippet( 46 | snippet_name = "mean_d", 47 | fn = ~ . %>% .$d %>% mean() %>% round(1) 48 | ) 49 | ``` 50 | 51 | The `small_table` dataset is associated with the `informant` as the target table, so, it’s represented as the leading `.` in the functional sequence given to `fn` inside of `info_snippet()`. It’s important to note that there’s a leading `~`, making this expression a formula (i.e., we don’t want to execute anything here, at this time). 52 | 53 | Lastly, the snippet has been given the name `"mean_d"`. We know that this snippet will produce the value `2304.7` so what can we do with that? We should put that value into some info text and use the `snippet_name` as the key. It works similarly to how the **glue** package does text interpolation, and here’s the continuation of the above example: 54 | 55 | ```{r} 56 | informant <- 57 | informant %>% 58 | info_columns( 59 | columns = vars(d), 60 | info = "This column contains fairly large numbers (much larger than 61 | those numbers in column `a`. The mean value is {mean_d}, which is 62 | far greater than any number in that other column." 63 | ) 64 | ``` 65 | 66 | Within the text, there’s the use of curly braces and the name of the snippet (`{mean_d}`). That’s where the `2304.7` value will be inserted. This methodology for inserting the computed values of snippets can be performed wherever info text is provided (in either of the `info_tabular()`, `info_columns()`, and `info_section()` functions). 67 | 68 | There's one last step. We have to finalize everything with the `incorporate()` function. Using this instructs **pointblank** to query the data (this is similar to using `interrogate()` when doing data validation). 69 | 70 | Let’s write the whole thing again and finish it off with a call to `incorporate()`. 71 | 72 | ```{r} 73 | informant <- 74 | create_informant( 75 | tbl = small_table, 76 | tbl_name = "small_table", 77 | label = "Metadata for the `small_table` dataset." 78 | ) %>% 79 | info_snippet( 80 | snippet_name = "mean_d", 81 | fn = ~ . %>% .$d %>% mean() %>% round(1) 82 | ) %>% 83 | info_columns( 84 | columns = vars(d), 85 | info = "This column contains fairly large numbers (much larger than 86 | those numbers in column `a`. The mean value is {mean_d}, which is 87 | far greater than any number in that other column." 88 | ) %>% 89 | incorporate() 90 | ``` 91 | 92 | Now let's print the `informant` to get the information report for the table. 93 | 94 | ```{r} 95 | informant 96 | ``` 97 | 98 | ### Using `snip_*()` functions with `info_snippet()` 99 | 100 | There are a few functions available in **pointblank** that make it much easier to get commonly-used text snippets. All of them begin with the `snip_` prefix and they are: 101 | 102 | - `snip_list()`: Get a list of column categories 103 | - `snip_lowest()`: Get the lowest value from a column 104 | - `snip_highest()`: Get the highest value from a column 105 | - `snip_stats()`: Get an inline statistical summary 106 | 107 | Each of these functions can be used directly as a `fn` value in `info_snippet()` and we don’t have to specify the table since its assumed that the target table is where we’ll be snipping data from. Let’s have a look at each of these in action. 108 | 109 | #### `snip_list()` 110 | 111 | When describing some aspect of the target table, we may want to extract some values from a column and include them as a piece of info text. This can be efficiently done with `snip_list()`. 112 | 113 | ```{r} 114 | informant_pp <- 115 | create_informant( 116 | tbl = ~ palmerpenguins::penguins, 117 | tbl_name = "penguins", 118 | label = "The `penguins` dataset from the **palmerpenguins** pkg." 119 | ) %>% 120 | info_snippet( 121 | snippet_name = "species_snippet", 122 | fn = snip_list(column = "species") 123 | ) %>% 124 | info_snippet( 125 | snippet_name = "island_snippet", 126 | fn = snip_list(column = "island") 127 | ) %>% 128 | info_columns( 129 | columns = "species", 130 | info = "A factor denoting penguin species ({species_snippet})." 131 | ) %>% 132 | info_columns( 133 | columns = "island", 134 | info = "A factor denoting island in Palmer Archipelago, Antarctica 135 | ({island_snippet})." 136 | ) %>% 137 | incorporate() 138 | ``` 139 | 140 | ```{r} 141 | informant_pp 142 | ``` 143 | 144 | This also works for numeric values. Let’s use `snip_list()` to provide a text snippet based on values in the `year` column (which is an `integer` column): 145 | 146 | ```{r} 147 | informant_pp <- 148 | informant_pp %>% 149 | info_columns( 150 | columns = "year", 151 | info = "The study year ({year_snippet})." 152 | ) %>% 153 | info_snippet( 154 | snippet_name = "year_snippet", 155 | fn = snip_list(column = "year") 156 | ) %>% 157 | incorporate() 158 | ``` 159 | 160 | ```{r} 161 | informant_pp 162 | ``` 163 | 164 | #### `snip_lowest()` and `snip_highest()` 165 | 166 | We can get the lowest and highest values from a column and inject those formatted values into some info_text. Let’s do that for some of the measured values in the penguins dataset with `snip_lowest()` and `snip_highest()`. 167 | 168 | ```{r} 169 | informant_pp <- 170 | informant_pp %>% 171 | info_columns( 172 | columns = "bill_length_mm", 173 | info = "A number denoting bill length" 174 | ) %>% 175 | info_columns( 176 | columns = "bill_depth_mm", 177 | info = "A number denoting bill depth (in the range of 178 | {min_depth} to {max_depth} millimeters)." 179 | ) %>% 180 | info_columns( 181 | columns = "flipper_length_mm", 182 | info = "An integer denoting flipper length" 183 | ) %>% 184 | info_columns( 185 | columns = matches("length"), 186 | info = "(in units of millimeters)." 187 | ) %>% 188 | info_columns( 189 | columns = "flipper_length_mm", 190 | info = "Largest observed is {largest_flipper_length} mm." 191 | ) %>% 192 | info_snippet( 193 | snippet_name = "min_depth", 194 | fn = snip_lowest(column = "bill_depth_mm") 195 | ) %>% 196 | info_snippet( 197 | snippet_name = "max_depth", 198 | fn = snip_highest(column = "bill_depth_mm") 199 | ) %>% 200 | info_snippet( 201 | snippet_name = "largest_flipper_length", 202 | fn = snip_highest(column = "flipper_length_mm") 203 | ) %>% 204 | incorporate() 205 | ``` 206 | 207 | ```{r} 208 | informant_pp 209 | ``` 210 | 211 | We can see from the report output that we can creatively use the lowest and highest values obtained by `snip_lowest()` and `snip_highest()` to specify a range or simply show some maximum value. 212 | 213 | Note that while the ordering of the `info_columns()` calls in the example affects the overall layout of the text (through the text appending behavior), the placement of `info_snippet()` calls does not matter. And, again, we must use `incorporate()` to update all of the text snippets and render them in their appropriate locations. 214 | 215 | ### Enhancements to text: *Text Tricks* 216 | 217 | You can use Markdown but there are a few extra tricks that can make the resulting text even better; we call them *Text Tricks*. Once you know about these text tricks you’ll be able to express information in many more interesting ways. 218 | 219 | #### Links and Dates 220 | 221 | If you have links in your text, **pointblank** will try to identify them and style them nicely. This amounts to using a pleasing, light-blue color and underlines that appear on hover. It doesn’t take much to style links but it does require *something*. So, Markdown links written as `< link url >` or `[ link text ]( link url )` will both get the transformation treatment. 222 | 223 | Sometimes you want dates to stand out from text. Try enclosing a date expressed in the ISO-8601 standard with parentheses, like this: `(2004-12-01)`. 224 | 225 | Here’s how we might use these features while otherwise adding more information to the **palmerpenguins** reporting: 226 | 227 | ```{r} 228 | informant_pp <- 229 | informant_pp %>% 230 | info_tabular( 231 | `R dataset` = "The goal of `palmerpenguins` is to provide a great dataset 232 | for data exploration & visualization, as an alternative to `iris`. The 233 | latest CRAN release was published on (2020-07-25).", 234 | `data collection` = "Data were collected and made available by Dr. Kristen 235 | Gorman and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu), 236 | a member of the [Long Term Ecological Research Network](https://lternet.edu).", 237 | citation = "Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer 238 | Archipelago (Antarctica) penguin data. R package version 0.1.0. 239 | . 240 | doi: 10.5281/zenodo.3960218." 241 | ) %>% 242 | incorporate() 243 | ``` 244 | 245 | ```{r} 246 | informant_pp 247 | ``` 248 | 249 | #### Labels 250 | 251 | We can take portions of text and present them as labels. These will help you call out important attributes in short form and may eliminate the need for oft-repeated statements. You might apply to labels to signify priority, category, or any other information you find useful. To do this we have two options, 252 | 253 | 1. Use double parentheses for a rectangular label: `((label text))` 254 | 2. Use triple parens for a rounded-rectangular label: `(((label text)))` 255 | 256 | ```{r} 257 | informant_pp <- 258 | informant_pp %>% 259 | info_columns( 260 | columns = vars(body_mass_g), 261 | info = "An integer denoting body mass." 262 | ) %>% 263 | info_columns( 264 | columns = c(ends_with("mm"), ends_with("g")), 265 | info = "((measured))" 266 | ) %>% 267 | info_section( 268 | section_name = "additional notes", 269 | `data types` = "(((factor))) (((numeric))) (((integer)))" 270 | ) %>% 271 | incorporate() 272 | ``` 273 | 274 | ```{r} 275 | informant_pp 276 | ``` 277 | 278 | #### Styled text 279 | 280 | If you want to use CSS styles on spans of info text, it’s possible with the following construction: 281 | 282 | `[[ info text ]]<< CSS style rules >>` 283 | 284 | It’s important to ensure that each CSS rule is concluded with a `;` character in this syntax. Styling the word `factor` inside a piece of *info text* might look like this: 285 | 286 | `This is a [[factor]]<> value.` 287 | 288 | here are many CSS style rules that can be used. Here’s a sample of a few useful ones: 289 | 290 | - `color: ;` (text color) 291 | - `background-color: ;` (the text’s background color) 292 | - `text-decoration: (overline | line-through | underline);` 293 | - `text-transform: (uppercase | lowercase | capitalize);` 294 | - `letter-spacing: ;` 295 | - `word-spacing: ;` 296 | - `font-style: (normal | italic | oblique);` 297 | - `font-weight: (normal | bold | 100-900);` 298 | - `font-variant: (normal | bold | 100-900);` 299 | - `border: (solid | dashed | dotted);` 300 | 301 | Continuing with our palmerpenguins reporting, we’ll add some more info text and take the opportunity to add CSS style rules using the `[[ ]]<< >>` syntax. 302 | 303 | ```{r} 304 | informant_pp <- 305 | informant_pp %>% 306 | info_columns( 307 | columns = vars(sex), 308 | info = "A [[factor]]<> 309 | denoting penguin sex (female or male)." 310 | ) %>% 311 | info_section( 312 | section_name = "additional notes", 313 | keywords = " 314 | [[((penguins))]]<> 315 | [[((Antarctica))]]<> 316 | [[((measurements))]]<> 317 | " 318 | ) %>% 319 | incorporate() 320 | ``` 321 | 322 | ```{r} 323 | informant_pp 324 | ``` 325 | 326 | With the above `info_columns()` and `info_section()` function calls, we are able to style a single word (with an underline) and even style labels (changing the border and background colors). The syntax here is somewhat forgiving, allowing you to put line breaks between `]]` and `<<` and between style rules so that lines of markup don’t have to be overly long. 327 | 328 | ### SUMMARY 329 | 330 | 1. We can query the table being documented with an expression inside `info_snippet()`. This allows us to inject the expression output into *info text*. 331 | 2. There are several `snip_*()` functions included in **pointblank** that handle common use cases. They are used like this: `info_snippet(fn = snip_*(...))`. 332 | 3. We can create label-like text with `(( ))` or `((( )))` 333 | 4. We can style text with `[[ ]]<< >>`. 334 | 5. Links will be styled automatically if you use Markdown links; dates in ISO 8601 notation can be autostyled if enclosed in parentheses. 335 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Creative Commons Legal Code 2 | 3 | CC0 1.0 Universal 4 | 5 | CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE 6 | LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN 7 | ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS 8 | INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES 9 | REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS 10 | PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM 11 | THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED 12 | HEREUNDER. 13 | 14 | Statement of Purpose 15 | 16 | The laws of most jurisdictions throughout the world automatically confer 17 | exclusive Copyright and Related Rights (defined below) upon the creator 18 | and subsequent owner(s) (each and all, an "owner") of an original work of 19 | authorship and/or a database (each, a "Work"). 20 | 21 | Certain owners wish to permanently relinquish those rights to a Work for 22 | the purpose of contributing to a commons of creative, cultural and 23 | scientific works ("Commons") that the public can reliably and without fear 24 | of later claims of infringement build upon, modify, incorporate in other 25 | works, reuse and redistribute as freely as possible in any form whatsoever 26 | and for any purposes, including without limitation commercial purposes. 27 | These owners may contribute to the Commons to promote the ideal of a free 28 | culture and the further production of creative, cultural and scientific 29 | works, or to gain reputation or greater distribution for their Work in 30 | part through the use and efforts of others. 31 | 32 | For these and/or other purposes and motivations, and without any 33 | expectation of additional consideration or compensation, the person 34 | associating CC0 with a Work (the "Affirmer"), to the extent that he or she 35 | is an owner of Copyright and Related Rights in the Work, voluntarily 36 | elects to apply CC0 to the Work and publicly distribute the Work under its 37 | terms, with knowledge of his or her Copyright and Related Rights in the 38 | Work and the meaning and intended legal effect of CC0 on those rights. 39 | 40 | 1. Copyright and Related Rights. A Work made available under CC0 may be 41 | protected by copyright and related or neighboring rights ("Copyright and 42 | Related Rights"). Copyright and Related Rights include, but are not 43 | limited to, the following: 44 | 45 | i. the right to reproduce, adapt, distribute, perform, display, 46 | communicate, and translate a Work; 47 | ii. moral rights retained by the original author(s) and/or performer(s); 48 | iii. publicity and privacy rights pertaining to a person's image or 49 | likeness depicted in a Work; 50 | iv. rights protecting against unfair competition in regards to a Work, 51 | subject to the limitations in paragraph 4(a), below; 52 | v. rights protecting the extraction, dissemination, use and reuse of data 53 | in a Work; 54 | vi. database rights (such as those arising under Directive 96/9/EC of the 55 | European Parliament and of the Council of 11 March 1996 on the legal 56 | protection of databases, and under any national implementation 57 | thereof, including any amended or successor version of such 58 | directive); and 59 | vii. other similar, equivalent or corresponding rights throughout the 60 | world based on applicable law or treaty, and any national 61 | implementations thereof. 62 | 63 | 2. Waiver. To the greatest extent permitted by, but not in contravention 64 | of, applicable law, Affirmer hereby overtly, fully, permanently, 65 | irrevocably and unconditionally waives, abandons, and surrenders all of 66 | Affirmer's Copyright and Related Rights and associated claims and causes 67 | of action, whether now known or unknown (including existing as well as 68 | future claims and causes of action), in the Work (i) in all territories 69 | worldwide, (ii) for the maximum duration provided by applicable law or 70 | treaty (including future time extensions), (iii) in any current or future 71 | medium and for any number of copies, and (iv) for any purpose whatsoever, 72 | including without limitation commercial, advertising or promotional 73 | purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each 74 | member of the public at large and to the detriment of Affirmer's heirs and 75 | successors, fully intending that such Waiver shall not be subject to 76 | revocation, rescission, cancellation, termination, or any other legal or 77 | equitable action to disrupt the quiet enjoyment of the Work by the public 78 | as contemplated by Affirmer's express Statement of Purpose. 79 | 80 | 3. Public License Fallback. Should any part of the Waiver for any reason 81 | be judged legally invalid or ineffective under applicable law, then the 82 | Waiver shall be preserved to the maximum extent permitted taking into 83 | account Affirmer's express Statement of Purpose. In addition, to the 84 | extent the Waiver is so judged Affirmer hereby grants to each affected 85 | person a royalty-free, non transferable, non sublicensable, non exclusive, 86 | irrevocable and unconditional license to exercise Affirmer's Copyright and 87 | Related Rights in the Work (i) in all territories worldwide, (ii) for the 88 | maximum duration provided by applicable law or treaty (including future 89 | time extensions), (iii) in any current or future medium and for any number 90 | of copies, and (iv) for any purpose whatsoever, including without 91 | limitation commercial, advertising or promotional purposes (the 92 | "License"). The License shall be deemed effective as of the date CC0 was 93 | applied by Affirmer to the Work. Should any part of the License for any 94 | reason be judged legally invalid or ineffective under applicable law, such 95 | partial invalidity or ineffectiveness shall not invalidate the remainder 96 | of the License, and in such case Affirmer hereby affirms that he or she 97 | will not (i) exercise any of his or her remaining Copyright and Related 98 | Rights in the Work or (ii) assert any associated claims and causes of 99 | action with respect to the Work, in either case contrary to Affirmer's 100 | express Statement of Purpose. 101 | 102 | 4. Limitations and Disclaimers. 103 | 104 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 105 | surrendered, licensed or otherwise affected by this document. 106 | b. Affirmer offers the Work as-is and makes no representations or 107 | warranties of any kind concerning the Work, express, implied, 108 | statutory or otherwise, including without limitation warranties of 109 | title, merchantability, fitness for a particular purpose, non 110 | infringement, or the absence of latent or other defects, accuracy, or 111 | the present or absence of errors, whether or not discoverable, all to 112 | the greatest extent permissible under applicable law. 113 | c. Affirmer disclaims responsibility for clearing rights of other persons 114 | that may apply to the Work or any use thereof, including without 115 | limitation any person's Copyright and Related Rights in the Work. 116 | Further, Affirmer disclaims responsibility for obtaining any necessary 117 | consents, permissions or other rights required for any use of the 118 | Work. 119 | d. Affirmer understands and acknowledges that Creative Commons is not a 120 | party to this document and has no duty or obligation with respect to 121 | this CC0 or use of the Work. 122 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## The **pointblank** Workshop 2 | 3 | This **pointblank** workshop will teach you *a lot* about what **pointblank** can do, and, it'll give you an opportunity to experiment with the package. All materials are also available as a Posit Cloud project, making it easy to get up and running. 4 | 5 | https://posit.cloud/content/4726872 6 | 7 | The goal of the workshop is to introduce you to a lot of examples and provide some time to use the functions of **pointblank** with some sample datasets, learning bit-by-bit as we go. 8 | 9 | Each module of the workshop focuses on a different subset of functions and they are all presented here as **R Markdown** (.Rmd) files, with one file for each workshop module: 10 | 11 | - `"01-intro-to-data-validation.Rmd"` (The `agent`, validation fns, interrogation/reports) 12 | - `"02-scan-your-data.Rmd"` (Looking at your data with `scan_data()`) 13 | - `"03-expect-test-functions.Rmd"` (Using the `expect_*()` and `test_*()` functions) 14 | - `"04-scaling-up-data-validation.Rmd"` (The `multiagent` and its reporting structures) 15 | - `"05-intro-to-data-documentation.Rmd"` (The `informant` and describing your data) 16 | - `"06-getting-deeper-into-documenting-data.Rmd"` (Using snippets and text tricks) 17 | 18 | You can navigate to any of these and modify the code within the self-contained **R Markdown** code chunks. Entire **R Markdown** files can be knit to HTML, where a separate window will show the rendered document. 19 | 20 | ### Installation 21 | 22 | Installation of **pointblank** on your system is done by using `install.packages()`: 23 | 24 | ```{r eval=FALSE} 25 | # install.packages("pointblank") 26 | ``` 27 | 28 | You can optionally use the development version of **pointblank**, installing it from GitHub with `devtools::install_github()`. 29 | 30 | ```{r eval=FALSE} 31 | # devtools::install_github("rich-iannone/pointblank") 32 | ``` 33 | -------------------------------------------------------------------------------- /game_revenue-validation.R: -------------------------------------------------------------------------------- 1 | library(pointblank) 2 | 3 | agent <- 4 | create_agent( 5 | tbl = ~pointblank::game_revenue, 6 | actions = action_levels( 7 | warn_at = 0.05, 8 | stop_at = 0.10 9 | ), 10 | tbl_name = "~pointblank::game_revenue", 11 | label = "Validation plan generated by `draft_validation()`." 12 | ) %>% 13 | # Expect that column `player_id` is of type: character 14 | col_is_character( 15 | columns = vars(player_id) 16 | ) %>% 17 | # Expect that column `session_id` is of type: character 18 | col_is_character( 19 | columns = vars(session_id) 20 | ) %>% 21 | # Expect that column `item_type` is of type: character 22 | col_is_character( 23 | columns = vars(item_type) 24 | ) %>% 25 | # Expect that column `item_name` is of type: character 26 | col_is_character( 27 | columns = vars(item_name) 28 | ) %>% 29 | # Expect that column `item_revenue` is of type: numeric 30 | col_is_numeric( 31 | columns = vars(item_revenue) 32 | ) %>% 33 | # Expect that values in `item_revenue` should be between `0.004` and `142.989` 34 | col_vals_between( 35 | columns = vars(item_revenue), 36 | left = 0.004, 37 | right = 142.989 38 | ) %>% 39 | # Expect that column `session_duration` is of type: numeric 40 | col_is_numeric( 41 | columns = vars(session_duration) 42 | ) %>% 43 | # Expect that values in `session_duration` should be between `3.2` and `41` 44 | col_vals_between( 45 | columns = vars(session_duration), 46 | left = 3.2, 47 | right = 41 48 | ) %>% 49 | # Expect that column `acquisition` is of type: character 50 | col_is_character( 51 | columns = vars(acquisition) 52 | ) %>% 53 | # Expect that column `country` is of type: character 54 | col_is_character( 55 | columns = vars(country) 56 | ) %>% 57 | # Expect that values in `country` should be in the set of `Germany`, `Canada`, `South Korea` (and 20 more) 58 | col_vals_in_set( 59 | columns = vars(country), 60 | set = c("Germany", "Canada", "South Korea", "Sweden", "Austria", "Hong Kong", "United States", "Mexico", "Egypt", "Denmark", "Norway", "Japan", "Australia", "South Africa", "Spain", "France", "Portugal", "Russia", "India", "Switzerland", "China", "Philippines", "United Kingdom") 61 | ) %>% 62 | # Expect entirely distinct rows across all columns 63 | rows_distinct() %>% 64 | # Expect that column schemas match 65 | col_schema_match( 66 | schema = col_schema( 67 | player_id = "character", 68 | session_id = "character", 69 | session_start = c("POSIXct", "POSIXt"), 70 | time = c("POSIXct", "POSIXt"), 71 | item_type = "character", 72 | item_name = "character", 73 | item_revenue = "numeric", 74 | session_duration = "numeric", 75 | start_day = "Date", 76 | acquisition = "character", 77 | country = "character" 78 | ) 79 | ) %>% 80 | interrogate() 81 | 82 | agent 83 | 84 | write_testthat_file(agent = agent, name = "game_revenue", path = ".") 85 | -------------------------------------------------------------------------------- /informant-penguins.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
11 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 519 | 520 | 521 | 522 | 523 | 524 | 525 | 526 | 529 | 533 | 536 | 539 | 542 | 545 | 548 | 551 | 552 | 553 | 554 | 571 | 576 | 577 | 578 | 579 | 582 | 583 | 584 | 585 |
penguins
The penguins dataset from the palmerpenguins pkg.

513 |
tibble 514 | penguinsRows 515 | 344 516 | Columns 517 | 8
518 |
Columns

species  factor

527 |

INFO A factor denoting penguin species (Adélie, Chinstrap, and Gentoo).

528 |

island  factor

530 |

INFO A factor denoting island in Palmer Archipelago, Antarctica 531 | (Biscoe, Dream, or Torgersen).

532 |

bill_length_mm  numeric

534 |

INFO A number denoting bill length (in units of millimeters).

535 |

bill_depth_mm  numeric

537 |

INFO A number denoting bill depth (in units of millimeters).

538 |

flipper_length_mm  integer

540 |

INFO An integer denoting flipper length (in units of millimeters).

541 |

body_mass_g  integer

543 |

INFO An integer denoting body mass (grams).

544 |

sex  factor

546 |

INFO A factor denoting penguin sex ("female", "male").

547 |

year  integer

549 |

INFO The study year (e.g., 2007, 2008, 2009).

550 |
source

REFERENCES

555 |

Adélie penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. Structural 556 | size measurements and isotopic signatures of foraging among adult male and female 557 | Adélie penguins (Pygoscelis adeliae) nesting along the Palmer Archipelago near 558 | Palmer Station, 2007-2009 ver 5. Environmental Data Initiative 559 | https://doi.org/10.6073/pasta/98b16d7d563f265cb52372c8ca99e60f

560 |

Gentoo penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. Structural 561 | size measurements and isotopic signatures of foraging among adult male and female 562 | Gentoo penguin (Pygoscelis papua) nesting along the Palmer Archipelago near Palmer 563 | Station, 2007-2009 ver 5. Environmental Data Initiative 564 | https://doi.org/10.6073/pasta/7fca67fb28d56ee2ffa3d9370ebda689

565 |

Chinstrap penguins: Palmer Station Antarctica LTER and K. Gorman. 2020. 566 | Structural size measurements and isotopic signatures of foraging among adult male 567 | and female Chinstrap penguin (Pygoscelis antarcticus) nesting along the Palmer 568 | Archipelago near Palmer Station, 2007-2009 ver 6. Environmental Data Initiative 569 | https://doi.org/10.6073/pasta/c14dfcfada8ea13a17536e73eb6fbe9e

570 |

NOTE

572 |

Originally published in: Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual 573 | Dimorphism and Environmental Variability within a Community of Antarctic Penguins 574 | (Genus Pygoscelis). PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081

575 |
2022-10-13 15:52:30 EDT 580 | < 1 s 581 | 2022-10-13 15:52:30 EDT
586 |
587 | 588 | 589 | -------------------------------------------------------------------------------- /pointblank-workshop.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | -------------------------------------------------------------------------------- /save_multiple_agents_to_disk.R: -------------------------------------------------------------------------------- 1 | library(pointblank) 2 | 3 | al <- 4 | action_levels( 5 | warn_at = 0.05, 6 | stop_at = 0.25, 7 | notify_at = 0.35 8 | ) 9 | 10 | agent_1 <- 11 | create_agent( 12 | tbl = ~ small_table, 13 | tbl_name = "small_table", 14 | label = "Daily check of `small_table`.", 15 | actions = al 16 | ) %>% 17 | col_vals_gt(vars(date_time), vars(date), na_pass = TRUE) %>% 18 | col_vals_gt(vars(b), vars(g), na_pass = TRUE) %>% 19 | rows_distinct() %>% 20 | col_vals_gt(vars(d), 100) %>% 21 | col_vals_equal(vars(d), vars(d), na_pass = TRUE) %>% 22 | col_vals_between(vars(c), left = vars(a), right = vars(d), na_pass = TRUE) %>% 23 | col_vals_not_between(vars(c), left = 10, right = 20, na_pass = TRUE) %>% 24 | rows_distinct(vars(d, e, f)) %>% 25 | col_is_integer(vars(a)) %>% 26 | interrogate() 27 | 28 | x_write_disk( 29 | agent_1, 30 | filename = "agent-small_table_2022-10-14", 31 | path = "small_table_tests" 32 | ) 33 | 34 | agent_2 <- 35 | create_agent( 36 | tbl = ~ small_table, 37 | tbl_name = "small_table", 38 | label = "Daily check of `small_table`.", 39 | actions = al 40 | ) %>% 41 | col_exists(vars(date, date_time)) %>% 42 | col_vals_regex( 43 | vars(b), "[0-9]-[a-z]{3}-[0-9]{3}", 44 | active = FALSE 45 | ) %>% 46 | rows_distinct() %>% 47 | interrogate() 48 | 49 | x_write_disk( 50 | agent_2, 51 | filename = "agent-small_table_2022-10-15", 52 | path = "small_table_tests" 53 | ) 54 | 55 | agent_3 <- 56 | create_agent( 57 | tbl = ~ small_table, 58 | tbl_name = "small_table", 59 | label = "Daily check of `small_table`.", 60 | actions = al 61 | ) %>% 62 | rows_distinct() %>% 63 | col_vals_gt(vars(d), 100) %>% 64 | col_vals_lte(vars(c), 5) %>% 65 | col_vals_equal( 66 | vars(d), vars(d), 67 | na_pass = TRUE 68 | ) %>% 69 | col_vals_in_set( 70 | vars(f), 71 | set = c("low", "mid", "high") 72 | ) %>% 73 | col_vals_between( 74 | vars(c), 75 | left = vars(a), right = vars(d), 76 | na_pass = TRUE 77 | ) %>% 78 | interrogate() 79 | 80 | x_write_disk( 81 | agent_3, 82 | filename = "agent-small_table_2022-10-16", 83 | path = "small_table_tests" 84 | ) 85 | 86 | agent_4 <- 87 | create_agent( 88 | tbl = ~ small_table, 89 | tbl_name = "small_table", 90 | label = "Daily check of `small_table`.", 91 | actions = al 92 | ) %>% 93 | col_vals_gt(vars(date_time), vars(date), na_pass = TRUE) %>% 94 | interrogate() 95 | 96 | x_write_disk( 97 | agent_4, 98 | filename = "agent-small_table_2022-10-17", 99 | path = "small_table_tests" 100 | ) 101 | -------------------------------------------------------------------------------- /small_table_tests/agent-small_table_2022-10-13: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rich-iannone/pointblank-workshop/989c61db0e2915c9b1a39de2b2a137b24d9bf27c/small_table_tests/agent-small_table_2022-10-13 -------------------------------------------------------------------------------- /small_table_tests/agent-small_table_2022-10-14: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rich-iannone/pointblank-workshop/989c61db0e2915c9b1a39de2b2a137b24d9bf27c/small_table_tests/agent-small_table_2022-10-14 -------------------------------------------------------------------------------- /small_table_tests/agent-small_table_2022-10-15: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rich-iannone/pointblank-workshop/989c61db0e2915c9b1a39de2b2a137b24d9bf27c/small_table_tests/agent-small_table_2022-10-15 -------------------------------------------------------------------------------- /small_table_tests/agent-small_table_2022-10-16: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rich-iannone/pointblank-workshop/989c61db0e2915c9b1a39de2b2a137b24d9bf27c/small_table_tests/agent-small_table_2022-10-16 -------------------------------------------------------------------------------- /small_table_tests/agent-small_table_2022-10-17: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rich-iannone/pointblank-workshop/989c61db0e2915c9b1a39de2b2a137b24d9bf27c/small_table_tests/agent-small_table_2022-10-17 -------------------------------------------------------------------------------- /storms-validation.R: -------------------------------------------------------------------------------- 1 | library(pointblank) 2 | 3 | agent <- 4 | create_agent( 5 | tbl = ~dplyr::storms, 6 | actions = action_levels( 7 | warn_at = 0.05, 8 | stop_at = 0.10 9 | ), 10 | tbl_name = "storms", 11 | label = "Validation plan generated by `draft_validation()`." 12 | ) %>% 13 | # Expect that column `name` is of type: character 14 | col_is_character( 15 | columns = vars(name) 16 | ) %>% 17 | # Expect that column `year` is of type: numeric 18 | col_is_numeric( 19 | columns = vars(year) 20 | ) %>% 21 | # Expect that values in `year` should be between `1975` and `2020` 22 | col_vals_between( 23 | columns = vars(year), 24 | left = 1975, 25 | right = 2020 26 | ) %>% 27 | # Expect that column `month` is of type: numeric 28 | col_is_numeric( 29 | columns = vars(month) 30 | ) %>% 31 | # Expect that values in `month` should be between `1` and `12` 32 | col_vals_between( 33 | columns = vars(month), 34 | left = 1, 35 | right = 12 36 | ) %>% 37 | # Expect that column `day` is of type: integer 38 | col_is_integer( 39 | columns = vars(day) 40 | ) %>% 41 | # Expect that values in `day` should be between `1` and `31` 42 | col_vals_between( 43 | columns = vars(day), 44 | left = 1, 45 | right = 31 46 | ) %>% 47 | # Expect that column `hour` is of type: numeric 48 | col_is_numeric( 49 | columns = vars(hour) 50 | ) %>% 51 | # Expect that values in `hour` should be between `0` and `23` 52 | col_vals_between( 53 | columns = vars(hour), 54 | left = 0, 55 | right = 23 56 | ) %>% 57 | # Expect that column `lat` is of type: numeric 58 | col_is_numeric( 59 | columns = vars(lat) 60 | ) %>% 61 | # Expect that values in `lat` should be between `-90` and `90` 62 | col_vals_between( 63 | columns = vars(lat), 64 | left = -90, 65 | right = 90 66 | ) %>% 67 | # Expect that column `long` is of type: numeric 68 | col_is_numeric( 69 | columns = vars(long) 70 | ) %>% 71 | # Expect that values in `long` should be between `-180` and `180` 72 | col_vals_between( 73 | columns = vars(long), 74 | left = -180, 75 | right = 180 76 | ) %>% 77 | # Expect that column `status` is of type: character 78 | col_is_character( 79 | columns = vars(status) 80 | ) %>% 81 | # Expect that column `category` is of type: factor 82 | col_is_factor( 83 | columns = vars(category) 84 | ) %>% 85 | # Expect that column `wind` is of type: integer 86 | col_is_integer( 87 | columns = vars(wind) 88 | ) %>% 89 | # Expect that values in `wind` should be between `10` and `160` 90 | col_vals_between( 91 | columns = vars(wind), 92 | left = 10, 93 | right = 160 94 | ) %>% 95 | # Expect that column `pressure` is of type: integer 96 | col_is_integer( 97 | columns = vars(pressure) 98 | ) %>% 99 | # Expect that values in `pressure` should be between `882` and `1022` 100 | col_vals_between( 101 | columns = vars(pressure), 102 | left = 882, 103 | right = 1022 104 | ) %>% 105 | # Expect that column `tropicalstorm_force_diameter` is of type: integer 106 | col_is_integer( 107 | columns = vars(tropicalstorm_force_diameter) 108 | ) %>% 109 | # Expect that values in `tropicalstorm_force_diameter` should be between `0` and `870` 110 | col_vals_between( 111 | columns = vars(tropicalstorm_force_diameter), 112 | left = 0, 113 | right = 870, 114 | na_pass = TRUE 115 | ) %>% 116 | # Expect that column `hurricane_force_diameter` is of type: integer 117 | col_is_integer( 118 | columns = vars(hurricane_force_diameter) 119 | ) %>% 120 | # Expect that values in `hurricane_force_diameter` should be between `0` and `300` 121 | col_vals_between( 122 | columns = vars(hurricane_force_diameter), 123 | left = 0, 124 | right = 300, 125 | na_pass = TRUE 126 | ) %>% 127 | # Expect entirely distinct rows across all columns 128 | rows_distinct() %>% 129 | # Expect that column schemas match 130 | col_schema_match( 131 | schema = col_schema( 132 | name = "character", 133 | year = "numeric", 134 | month = "numeric", 135 | day = "integer", 136 | hour = "numeric", 137 | lat = "numeric", 138 | long = "numeric", 139 | status = "character", 140 | category = c("ordered", "factor"), 141 | wind = "integer", 142 | pressure = "integer", 143 | tropicalstorm_force_diameter = "integer", 144 | hurricane_force_diameter = "integer" 145 | ) 146 | ) %>% 147 | interrogate() 148 | 149 | agent 150 | -------------------------------------------------------------------------------- /test-game_revenue.R: -------------------------------------------------------------------------------- 1 | # Generated by pointblank 2 | 3 | library(pointblank) 4 | 5 | tbl <- pointblank::game_revenue 6 | 7 | test_that("column `player_id` is of type: character", { 8 | 9 | expect_col_is_character( 10 | tbl, 11 | columns = vars(player_id), 12 | threshold = 1 13 | ) 14 | }) 15 | 16 | test_that("column `session_id` is of type: character", { 17 | 18 | expect_col_is_character( 19 | tbl, 20 | columns = vars(session_id), 21 | threshold = 1 22 | ) 23 | }) 24 | 25 | test_that("column `item_type` is of type: character", { 26 | 27 | expect_col_is_character( 28 | tbl, 29 | columns = vars(item_type), 30 | threshold = 1 31 | ) 32 | }) 33 | 34 | test_that("column `item_name` is of type: character", { 35 | 36 | expect_col_is_character( 37 | tbl, 38 | columns = vars(item_name), 39 | threshold = 1 40 | ) 41 | }) 42 | 43 | test_that("column `item_revenue` is of type: numeric", { 44 | 45 | expect_col_is_numeric( 46 | tbl, 47 | columns = vars(item_revenue), 48 | threshold = 1 49 | ) 50 | }) 51 | 52 | test_that("values in `item_revenue` should be between `0.004` and `142.989`", { 53 | 54 | expect_col_vals_between( 55 | tbl, 56 | columns = vars(item_revenue), 57 | left = 0.004, 58 | right = 142.989, 59 | threshold = 0.1 60 | ) 61 | }) 62 | 63 | test_that("column `session_duration` is of type: numeric", { 64 | 65 | expect_col_is_numeric( 66 | tbl, 67 | columns = vars(session_duration), 68 | threshold = 1 69 | ) 70 | }) 71 | 72 | test_that("values in `session_duration` should be between `3.2` and `41`", { 73 | 74 | expect_col_vals_between( 75 | tbl, 76 | columns = vars(session_duration), 77 | left = 3.2, 78 | right = 41, 79 | threshold = 0.1 80 | ) 81 | }) 82 | 83 | test_that("column `acquisition` is of type: character", { 84 | 85 | expect_col_is_character( 86 | tbl, 87 | columns = vars(acquisition), 88 | threshold = 1 89 | ) 90 | }) 91 | 92 | test_that("column `country` is of type: character", { 93 | 94 | expect_col_is_character( 95 | tbl, 96 | columns = vars(country), 97 | threshold = 1 98 | ) 99 | }) 100 | 101 | test_that("values in `country` should be in the set of `Germany`, `Canada`, `South Korea` (and 20 more)", { 102 | 103 | expect_col_vals_in_set( 104 | tbl, 105 | columns = vars(country), 106 | set = c("Germany", "Canada", "South Korea", "Sweden", "Austria", "Hong Kong", "United States", "Mexico", "Egypt", "Denmark", "Norway", "Japan", "Australia", "South Africa", "Spain", "France", "Portugal", "Russia", "India", "Switzerland", "China", "Philippines", "United Kingdom"), 107 | threshold = 0.1 108 | ) 109 | }) 110 | 111 | test_that("entirely distinct rows across all columns", { 112 | 113 | expect_rows_distinct(tbl) 114 | }) 115 | 116 | test_that("column schemas match", { 117 | 118 | expect_col_schema_match( 119 | tbl, 120 | schema = col_schema( 121 | player_id = "character", 122 | session_id = "character", 123 | session_start = c("POSIXct", "POSIXt"), 124 | time = c("POSIXct", "POSIXt"), 125 | item_type = "character", 126 | item_name = "character", 127 | item_revenue = "numeric", 128 | session_duration = "numeric", 129 | start_day = "Date", 130 | acquisition = "character", 131 | country = "character" 132 | ), 133 | threshold = 1 134 | ) 135 | }) 136 | --------------------------------------------------------------------------------