├── .gitignore ├── LICENSE ├── README.Rmd ├── README.md ├── rencfaq.Rproj ├── what-every.Rmd └── what-every.md /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Creative Commons Legal Code 2 | 3 | CC0 1.0 Universal 4 | 5 | CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE 6 | LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN 7 | ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS 8 | INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES 9 | REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS 10 | PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM 11 | THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED 12 | HEREUNDER. 13 | 14 | Statement of Purpose 15 | 16 | The laws of most jurisdictions throughout the world automatically confer 17 | exclusive Copyright and Related Rights (defined below) upon the creator 18 | and subsequent owner(s) (each and all, an "owner") of an original work of 19 | authorship and/or a database (each, a "Work"). 20 | 21 | Certain owners wish to permanently relinquish those rights to a Work for 22 | the purpose of contributing to a commons of creative, cultural and 23 | scientific works ("Commons") that the public can reliably and without fear 24 | of later claims of infringement build upon, modify, incorporate in other 25 | works, reuse and redistribute as freely as possible in any form whatsoever 26 | and for any purposes, including without limitation commercial purposes. 27 | These owners may contribute to the Commons to promote the ideal of a free 28 | culture and the further production of creative, cultural and scientific 29 | works, or to gain reputation or greater distribution for their Work in 30 | part through the use and efforts of others. 31 | 32 | For these and/or other purposes and motivations, and without any 33 | expectation of additional consideration or compensation, the person 34 | associating CC0 with a Work (the "Affirmer"), to the extent that he or she 35 | is an owner of Copyright and Related Rights in the Work, voluntarily 36 | elects to apply CC0 to the Work and publicly distribute the Work under its 37 | terms, with knowledge of his or her Copyright and Related Rights in the 38 | Work and the meaning and intended legal effect of CC0 on those rights. 39 | 40 | 1. Copyright and Related Rights. A Work made available under CC0 may be 41 | protected by copyright and related or neighboring rights ("Copyright and 42 | Related Rights"). Copyright and Related Rights include, but are not 43 | limited to, the following: 44 | 45 | i. the right to reproduce, adapt, distribute, perform, display, 46 | communicate, and translate a Work; 47 | ii. moral rights retained by the original author(s) and/or performer(s); 48 | iii. publicity and privacy rights pertaining to a person's image or 49 | likeness depicted in a Work; 50 | iv. rights protecting against unfair competition in regards to a Work, 51 | subject to the limitations in paragraph 4(a), below; 52 | v. rights protecting the extraction, dissemination, use and reuse of data 53 | in a Work; 54 | vi. database rights (such as those arising under Directive 96/9/EC of the 55 | European Parliament and of the Council of 11 March 1996 on the legal 56 | protection of databases, and under any national implementation 57 | thereof, including any amended or successor version of such 58 | directive); and 59 | vii. other similar, equivalent or corresponding rights throughout the 60 | world based on applicable law or treaty, and any national 61 | implementations thereof. 62 | 63 | 2. Waiver. To the greatest extent permitted by, but not in contravention 64 | of, applicable law, Affirmer hereby overtly, fully, permanently, 65 | irrevocably and unconditionally waives, abandons, and surrenders all of 66 | Affirmer's Copyright and Related Rights and associated claims and causes 67 | of action, whether now known or unknown (including existing as well as 68 | future claims and causes of action), in the Work (i) in all territories 69 | worldwide, (ii) for the maximum duration provided by applicable law or 70 | treaty (including future time extensions), (iii) in any current or future 71 | medium and for any number of copies, and (iv) for any purpose whatsoever, 72 | including without limitation commercial, advertising or promotional 73 | purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each 74 | member of the public at large and to the detriment of Affirmer's heirs and 75 | successors, fully intending that such Waiver shall not be subject to 76 | revocation, rescission, cancellation, termination, or any other legal or 77 | equitable action to disrupt the quiet enjoyment of the Work by the public 78 | as contemplated by Affirmer's express Statement of Purpose. 79 | 80 | 3. Public License Fallback. Should any part of the Waiver for any reason 81 | be judged legally invalid or ineffective under applicable law, then the 82 | Waiver shall be preserved to the maximum extent permitted taking into 83 | account Affirmer's express Statement of Purpose. In addition, to the 84 | extent the Waiver is so judged Affirmer hereby grants to each affected 85 | person a royalty-free, non transferable, non sublicensable, non exclusive, 86 | irrevocable and unconditional license to exercise Affirmer's Copyright and 87 | Related Rights in the Work (i) in all territories worldwide, (ii) for the 88 | maximum duration provided by applicable law or treaty (including future 89 | time extensions), (iii) in any current or future medium and for any number 90 | of copies, and (iv) for any purpose whatsoever, including without 91 | limitation commercial, advertising or promotional purposes (the 92 | "License"). The License shall be deemed effective as of the date CC0 was 93 | applied by Affirmer to the Work. Should any part of the License for any 94 | reason be judged legally invalid or ineffective under applicable law, such 95 | partial invalidity or ineffectiveness shall not invalidate the remainder 96 | of the License, and in such case Affirmer hereby affirms that he or she 97 | will not (i) exercise any of his or her remaining Copyright and Related 98 | Rights in the Work or (ii) assert any associated claims and causes of 99 | action with respect to the Work, in either case contrary to Affirmer's 100 | express Statement of Purpose. 101 | 102 | 4. Limitations and Disclaimers. 103 | 104 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 105 | surrendered, licensed or otherwise affected by this document. 106 | b. Affirmer offers the Work as-is and makes no representations or 107 | warranties of any kind concerning the Work, express, implied, 108 | statutory or otherwise, including without limitation warranties of 109 | title, merchantability, fitness for a particular purpose, non 110 | infringement, or the absence of latent or other defects, accuracy, or 111 | the present or absence of errors, whether or not discoverable, all to 112 | the greatest extent permissible under applicable law. 113 | c. Affirmer disclaims responsibility for clearing rights of other persons 114 | that may apply to the Work or any use thereof, including without 115 | limitation any person's Copyright and Related Rights in the Work. 116 | Further, Affirmer disclaims responsibility for obtaining any necessary 117 | consents, permissions or other rights required for any use of the 118 | Work. 119 | d. Affirmer understands and acknowledges that Creative Commons is not a 120 | party to this document and has no duty or obligation with respect to 121 | this CC0 or use of the Work. 122 | -------------------------------------------------------------------------------- /README.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "The R Encoding FAQ" 3 | output: 4 | github_document: 5 | toc: true 6 | toc_depth: 2 7 | --- 8 | 9 | ```{r setup, include=FALSE} 10 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>", error = TRUE) 11 | ``` 12 | 13 | # Introduction 14 | 15 | The goal of this document is twofold: (1) collect current best practices to solve text encoding related issues and (2) include links to further information about text encoding. 16 | 17 | # Contributing 18 | 19 | Encoding of text is a large topic, and to make this FAQ useful, we need your contributions. There are a variety of ways to contribute: 20 | 21 | - Found a typo a broken link or some example code is not working for you? Please open an issue about it. Or even better, consider fixing it in a pull request. 22 | - Something is missing? There is a new (or old) package of function that should be mentioned here? Please open an issue about it. 23 | - Some information is wrong or something is not explained well? Please open an issue about it. Or even better, fix it in a pull request! 24 | - You found an answer useful? Please tweet about it and link to this FAQ. 25 | 26 | Your contributions are most appreciated. 27 | 28 | # Encoding of text 29 | 30 | Good general introductions to text encodings: 31 | 32 | - [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) 33 | 34 | - [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](https://kunststube.net/encoding/) 35 | 36 | # R's text representation 37 | 38 | See the R Internals manual: 39 | 40 | 41 | 42 | # R Connections and encodings 43 | 44 | Some notes about connections. 45 | 46 | - Connections assume that the input is in the encoding specified by `getOption("encoding")`. They convert text to the native encoding. 47 | - Functions that *create* connections have an `encoding` argument, that overrides `getOption("encoding")`. 48 | - If you tell a connection that the input is already in the native encoding, then it will not convert the input. If the input is in fact *not* in the native encoding, then it is up to the user to mark it accordingly, e.g. as UTF-8. 49 | 50 | # Printing text to the console 51 | 52 | R converts strings to the native encoding when printing them to the screen. This is important to know when debugging encoding problems: two strings that print the same way may have a different internal representation: 53 | 54 | ```{r} 55 | s1 <- "\xfc" 56 | Encoding(s1) <- "latin1" 57 | s2 <- iconv(s1, "latin1", "UTF-8") 58 | s1 59 | s2 60 | ``` 61 | 62 | ```{r} 63 | s1 == s2 64 | identical(s1, s2) 65 | testthat::expect_equal(s1, s2) 66 | testthat::expect_identical(s1, s2) 67 | ``` 68 | 69 | ```{r} 70 | Encoding(s1) 71 | Encoding(s2) 72 | ``` 73 | 74 | ```{r} 75 | charToRaw(s1) 76 | charToRaw(s2) 77 | ``` 78 | 79 | ## Display width 80 | 81 | Some Unicode glyphs, e.g. typical emojis, are *wide*. They are supposed to take up two characters in a monospace font. The set of wide glyphs (just like the set of glyphs in general) has been changing in different Unicode versions. 82 | 83 | Up to version 4.0.3 R followed Unicode 8 (released in 2015) in terms of character width, so `nchar(..., type = "width")` miscalculated the width of many Asian characters. 84 | 85 | R 4.0.4 follows Unicode 12.1. 86 | 87 | R 4.1.0 - R 4.3.3 follow Unicode 13.0. 88 | 89 | Current R-devel (will be R 4.4.0 in about a month) still uses Unicode 13.0. 90 | 91 | To correctly calculate the display width of Unicode text, across various R versions, you can use the utf8 package, or `cli::ansi_nchar()`. 92 | 93 | Note that some terminals, and also RStudio follow an older Unicode version of the standard, and they also calculate the display width incorrectly, typically printing characters on top of each other. 94 | 95 | # Issues when building packages 96 | 97 | See 'Writing R Extensions' about including UTF-8 characters in packages: 98 | 99 | ## Code files 100 | 101 | I suggest that you keep your source files in ASCII, except for comments, which you can write in UTF-8 if you like. (But not necessarily in roxygen2 comments, which will go in the manual, see below!) 102 | 103 | If you need a string containing non-ASCII characters, construct it with `\uxxxx` or `\U{xxxxxx}` escape sequences, which yields a UTF-8-encoded string. 104 | 105 | ```{r} 106 | one_and_a_half <- "\u0031\u00BD" 107 | one_and_a_half 108 | Encoding(one_and_a_half) 109 | charToRaw(one_and_a_half) 110 | ``` 111 | 112 | Here are some handy ways to find the Unicode code points for an existing string: 113 | 114 | * Copy and paste into the [Unicode character inspector](https://apps.timwhitlock.info/unicode/inspect). 115 | * `sprintf("\\u%04X", utf8ToInt(x))` 116 | * *Do we have other suggestions?* 117 | 118 | If you need to include non-UTF-8 non-ASCII characters, e.g. for testing, then include them via raw data, or via `\xxx` escape sequences. E.g. to include a latin1 string, you can do either of these: 119 | 120 | ```{r} 121 | t1 <- "t\xfck\xf6rf\xfar\xf3g\xe9p" 122 | Encoding(t1) <- "latin1" 123 | t2 <- rawToChar(as.raw(c( 124 | 0x74, 0xfc, 0x6b, 0xf6, 0x72, 0x66, 0xfa, 0x72, 125 | 0xf3, 0x67, 0xe9, 0x70))) 126 | Encoding(t2) <- "latin1" 127 | t1 == t2 128 | identical(charToRaw(t1), charToRaw(t2)) 129 | ``` 130 | 131 | ## Symbols 132 | 133 | Symbols must be in the native encoding, so best practice is to only use ASCII symbols. 134 | 135 | This is not completely equivalent to using ASCII code files, because names, e.g. in a named list, or the row names in a data frame are also symbols in R. So do not use non-ACSII names. (Another reason to avoid row names in data frames.) 136 | 137 | A typical mistake is to assign file names as list or row names when working with files. While it might be a convenient representation, it is a bad idea because of the possible encoding errors. 138 | 139 | Similarly, always use ASCII column names. 140 | 141 | # Unit tests 142 | 143 | Unit test files are code files, so the same rule applies to them: keep them ASCII. See the 'How-to' section below on specific testing related tasks. 144 | 145 | ## Manual pages 146 | 147 | You can use *some* UTF-8 characters in manual pages, if you include `\endoding{UTF-8}` in the manual page, or you declared `Encoding: UTF-8` in the `DESCRIPTION` file of the package. 148 | 149 | Unfortunately these do not include the characters typically used in the output of tibble and cli. The problem is with the PDF manual and LaTeX, so you can conditionally include UTF-8 in the HTML output with the `\if` Rd command. 150 | 151 | If your PDF manual does not build because of a non-supported UTF-8 character has sneaked in somewhere, `tools::showNonASCIIfile()` can show where exactly they are. 152 | 153 | ## Vignettes 154 | 155 | TODO 156 | 157 | # Graphics 158 | 159 | TODO 160 | 161 | # How-to 162 | 163 | ## How to query the current native encoding? 164 | 165 | R uses the native encoding extensively. Some examples: 166 | 167 | - Symbols (i.e. variable names, names in a named list, column and row names, etc.) are always in the native encoding. 168 | - Strings marked with `"unknown"` encoding are assumed to be in this encoding. 169 | - Output connections assume that the output is in this encoding. 170 | - Input connection convert the input into this encoding. 171 | - `enc2native()` encodes its input into this encoding. 172 | - Printing to the screen (via `cat()`, `print()`, `writeLines()`, etc.) re-encodes strings into this encoding. 173 | 174 | It is surprisingly tricky to query the current native encoding, especially on older R versions. Here are a number of things to try: 175 | 176 | 1. Call `l10n_info()`. If the native encoding is UTF-8 or latin1 you'll see that in the output: 177 | 178 | ``` r 179 | ❯ l10n_info() 180 | $MBCS 181 | [1] TRUE 182 | 183 | $`UTF-8` 184 | [1] TRUE 185 | 186 | $`Latin-1` 187 | [1] FALSE 188 | ``` 189 | 190 | 2. On Windows, you'll also see the code page: 191 | 192 | ``` r 193 | ❯ l10n_info() 194 | $MBCS 195 | [1] FALSE 196 | 197 | $`UTF-8` 198 | [1] FALSE 199 | 200 | $`Latin-1` 201 | [1] TRUE 202 | 203 | $codepage 204 | [1] 1252 205 | 206 | $system.codepage 207 | [1] 1252 208 | ``` 209 | 210 | Append the codepage number to `"CP"` to get a string that you can use in `iconv()`. E.g. in this case it would be `"CP1252"`. 211 | 212 | 3. On Unix, from R 4.1 the name of the encoding is also included in the answer. This is useful on non-UTF-8 systems: 213 | 214 | ``` r 215 | ❯ l10n_info() 216 | $MBCS 217 | [1] FALSE 218 | 219 | $`UTF-8` 220 | [1] FALSE 221 | 222 | $`Latin-1` 223 | [1] FALSE 224 | 225 | $codeset 226 | [1] "ISO-8859-15" 227 | ``` 228 | 229 | 4. On older R versions this field is not included. On R 3.5 and above you can use the following trick to find the encoding: 230 | 231 | ``` r 232 | ❯ rawToChar(serialize(NULL, NULL, ascii = TRUE, version = 3)) 233 | [1] "A\n3\n262148\n197888\n5\nUTF-8\n254\n" 234 | ``` 235 | 236 | Line number 6 (i.e. after the fifth `\n`) in the output is the name of the encoding. On R 4.0.x you can also save an RDS file and then use the `infoRDS()` function on it to see the current native encoding. 237 | 238 | 5. Otherwise can call `Sys.getlocale()` and parsing its output will probably give you an encoding name that works in `iconv()`: 239 | 240 | ``` r 241 | ❯ Sys.getlocale("LC_CTYPE") 242 | [1] "en_US.iso885915" 243 | ``` 244 | 245 | ## How is the `encoding` argument and option used? 246 | 247 | In general functions use the `encoding` argument two ways: 248 | 249 | - For some functions it specifies the encoding of the input. 250 | - For others it specifies how output should be marked. 251 | 252 | For example 253 | 254 | ``` r 255 | file("input.txt", encoding = "UTF-8") 256 | ``` 257 | 258 | tells `file()` that `input.txt` is in `UTF-8`. On the other hand 259 | 260 | ``` r 261 | scan("input2.txt", what = "", encoding = "UTF-8") 262 | ``` 263 | 264 | means that the output of `scan()` will be *marked* as having UTF-8 encoding. (`scan()` also has a `fileEncoding` argument, which specifies the encoding of the input file. 265 | 266 | TODO: the weirdness of `readLines()`. 267 | 268 | ## How to read lines from UTF-8 files? 269 | 270 | ### With the brio package 271 | 272 | The simplest option is to use the brio package, if you can allow it: 273 | 274 | ``` r 275 | brio::read_lines("file") 276 | ``` 277 | 278 | (brio also has a `read_file()` function if you want to read in the whole file into a string.) 279 | 280 | brio does not currently check if the file is valid in UTF-8. See below how to do that. 281 | 282 | ### With base R only 283 | 284 | The xfun package has a nice function that reads UTF-8 files with base R only: `xfun::read_utf8()`. It currently reads like this, after some simplification: 285 | 286 | ``` r 287 | read_utf8 <- function (path) { 288 | opts <- options(encoding = "native.enc") 289 | on.exit(options(opts), add = TRUE) 290 | x <- readLines(path, encoding = "UTF-8", warn = FALSE) 291 | x 292 | } 293 | ``` 294 | 295 | (The original function also checks that the returned lines are valid UTF-8.) 296 | 297 | Explanation and notes: 298 | 299 | - `path` must be a file name, or an un-opened connection, or a connection that was opened in the native encoding (i.e. with `encoding = "native.enc"`). Otherwise `read_utf8()` might (silently) return lines in the wrong encoding. 300 | - If `readLines()` gets a file name, then it opens a connection to it in `UTF-8`, so the file is not re-encoded and it also marks the result as `UTF-8`. 301 | - For extra safety, you can add a check that `path` is a file name. 302 | - As far as I can tell, there is no R API to query the encoding of a connection. 303 | 304 | TODO: does this work correctly with non-ASCII file names? 305 | 306 | ## How to write lines to UTF-8 files? 307 | 308 | ### With the brio package 309 | 310 | Call `brio::write_file()` to write a character vector to a file. Notes: 311 | 312 | 1. `brio::write_file()` converts the input character vector to UTF-8. It uses the marked encoding for this, so if that is not correct, then the conversion won't be correct, either. 313 | 2. Will handle UTF-8 file names correctly. 314 | 3. `brio::write_file()` cannot write to connections currently, because with already opened connections there is no way to tell their encoding. 315 | 316 | ### With base R 317 | 318 | Kevin Ushey's excellent [blog post](https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/) has an excellent detailed derivation of this base R function: 319 | 320 | ```{r} 321 | write_utf8 <- function(text, path) { 322 | # step 1: ensure our text is utf8 encoded 323 | utf8 <- enc2utf8(text) 324 | upath <- enc2utf8(path) 325 | 326 | # step 2: create a connection with 'native' encoding 327 | # this signals to R that translation before writing 328 | # to the connection should be skipped 329 | con <- file(upath, open = "w+", encoding = "native.enc") 330 | on.exit(close(con), add = TRUE) 331 | 332 | # step 3: write to the connection with 'useBytes = TRUE', 333 | # telling R to skip translation to the native encoding 334 | writeLines(utf8, con = con, useBytes = TRUE) 335 | } 336 | ``` 337 | 338 | (This is a slightly more robust version than the one in the blog post.) 339 | 340 | TODO: does this work correctly with non-ASCII file names? 341 | 342 | ## How to convert text between encodings? 343 | 344 | The `iconv()` base R function can convert between encodings. It ignores the `Encoding()` markings of the input completely. It uses `""` as the synonym for the native encoding. 345 | 346 | ## How to guess the encoding of text? 347 | 348 | The `stringi::stri_enc_detect()` function tries to detect the encoding of a string. This is mostly guessing of course. 349 | 350 | ## How to check if a string is UTF-8? 351 | 352 | Not all bytes and bytes sequences are valid in UTF-8. There are a few alternatives to check if a string is valid: 353 | 354 | - `stringi::stri_enc_isutf8()` 355 | 356 | - `utf8::utf8_valid()` 357 | 358 | - The following trick using only base R: 359 | 360 | ``` r 361 | ! is.na(iconv(x, "UTF-8", "UTF-8")) 362 | ``` 363 | 364 | `iconv()` returns `NA` for the elements that are invalid in the input encoding. 365 | 366 | ## How to download web pages in the right encoding? 367 | 368 | TODO 369 | 370 | ## Are RDS files encoding-safe? 371 | 372 | No, in general they are not, but the situation is quite good. If you save an RDS file that contains text in some encoding, it is safe to read that file on a different computer (or the same computer with different settings), if at least one these conditions hold: 373 | 374 | - all text is either UTF-8 and latin1 encoded, and they are also marked as such, or 375 | - both computers (or settings) have the same native encoding, 376 | - the RDS file has version 3, and the loading platform can represent all characters in the RDS file. This usually holds if the loading platform is UTF-8. 377 | 378 | Note that from RDS version 3 the strings in the native encoding are re-encoded to the current native encoding when the RDS file is loaded. 379 | 380 | ## How to `parse()` UTF-8 text into UTF-8 code? 381 | 382 | On R 3.6 and before we need to create a connection to create code with UTF-8 strings, from an UTF-8 character vector. It goes like this: 383 | 384 | ``` r 385 | safe_parse <- function(text) { 386 | text <- enc2utf8(text) 387 | Encoding(text) <- "unknown" 388 | con <- textConnection(text) 389 | on.exit(close(con), add = TRUE) 390 | eval(parse(con, encoding = "UTF-8")) 391 | } 392 | ``` 393 | 394 | - By marking the text as `unknown` we make sure that `textConnection()` will not convert it. 395 | - The `encoding` argument of `parse()` marks the output as `UTF-8`. 396 | - Note that when `parse()` parses from a connection it will lose the source references. You can use the `srcfile` argument of `parse()` to add them back. 397 | 398 | ## How to `deparse()` UTF-8 code into UTF-8 text? 399 | 400 | TODO 401 | 402 | ## How to include UTF-8 characters in a package `DESCRIPTION` 403 | 404 | Some fields in `DESCRIPTION` may contain non-ASCII characters. Set the `Encoding` field to `UTF-8`, to use UTF-8 characters in these: 405 | 406 | Encoding: UTF-8 407 | 408 | ## How to get a package `DESCRIPTION` in UTF-8? 409 | 410 | The desc package always converts `DESCRIPTION` files to `UTF-8`. 411 | 412 | ## How to use UTF-8 file names on Windows? 413 | 414 | You can use functions or a package that already handles UTF-8 file names, e.g. brio or processx. If you need to open a file from C code, you need to do the following: 415 | 416 | - Convert the file name to UTF-8 just before passing it to C. In generic code `enc2utf8()` will do. 417 | - In the C code, on Windows, convert the UTF-8 path to UTF-16 using the `MultiByteToWideChar()` Windows API function. 418 | - Use the `_wfopen()` Windows API function to open the file. 419 | 420 | Here is an example from the brio package: 421 | 422 | ## How to capture UTF-8 output? 423 | 424 | This is only possible on UTF-8 systems, and there is no portable way that works on Windows or other non-UTF-8 platforms. If the native encoding is UTF-8, then `sink()` and `capture.output()` will create text in UTF-8. 425 | 426 | Capturing UTF-8 output on Windows (or a Unix system that does not support a UTF-8 locale) is not currently possible. (Well, on Windows there is a very dirty trick, but it involves changing an internal R flag, and running the code in a subprocess.) 427 | 428 | ### Are testthat snapshot tests encoding-safe? 429 | 430 | Somewhat. testthat 3 by default turn off non-ASCII output from cli (and thus tibble, etc.). So your snapshots that use cli facilities will be in ASCII, which is safe to use anywhere. 431 | 432 | If you non-ASCII output from other sources, then hopefully there is a switch to turn them off. As far as I can tell, rlang uses cli to produce non-ASCII output, so it should be fine. 433 | 434 | ## How to test non-ASCII output? 435 | 436 | If you need to test non-ASCII output produced by the cli package, then 437 | 438 | - use snapshot tests, 439 | - call `testthat::local_reproducible_output(unicode = TRUE)` at the beginning of your test, 440 | - record your snapshots on a UTF-8 platform. 441 | 442 | Your tests will not run on non-UTF-8 platforms. It is not currently possible to record non-ASCII snapshot tests on non-UTF-8 platforms. 443 | 444 | ## How to avoid non-ASCII characters in the manual? 445 | 446 | If you accidentally included some non-ASCII characters in the manual, e.g. because you included some UTF-8 output from cli or tibble/pillar, then you can set `options(cli.unicode = FALSE)` at the right place to avoid them. 447 | 448 | The `tools::showNonASCIIfile()` function helps finding non-ASCII characters in a package. 449 | 450 | ## How to include non-ASCII characters in PDF vignettes? 451 | 452 | TODO 453 | 454 | # Encoding related R functions and packages 455 | 456 | - `base::charToRaw()` 457 | 458 | - `base::Encoding()` and `` base::`Encoding<-` `` 459 | 460 | - `base::iconv()` 461 | 462 | - `base::iconvlist()` 463 | 464 | - `base::nchar()` 465 | 466 | - `base::l10n_info()` 467 | 468 | - [stringi package](https://cran.rstudio.com/web/packages/stringi/) 469 | 470 | - [utf8 package](https://cran.rstudio.com/web/packages/utf8) 471 | 472 | - [brio package](https://cran.rstudio.com/web/packages/brio) 473 | 474 | - [cli package](https://cran.rstudio.com/web/packages/cli/) 475 | 476 | - [Unicode package](https://cran.rstudio.com/web/packages/Unicode) 477 | 478 | # Known encoding issues in packages and functions 479 | 480 | - `yaml::write_yaml` crashes on Windows on latin1 encoded strings: 481 | 482 | # Tips for debugging encoding issues 483 | 484 | - `charToRaw()` is your best friend. 485 | 486 | - Don't forget, if they print the same, if they are `identical()`, they can still be in a different encoding. `charToRaw()` is your best friend. 487 | 488 | - `testthat::CheckReporter` saves a `testthat-problems.rds` file, if there were any test failures. You can get this file from win-builder, R-hub, etc. The file is a version 2 RDS file, so no encoding conversion will be done by `readRDS()`. 489 | 490 | - Don't trust any function that processes text. Some functions keep the encoding of the input, some convert to the native encoding, some convert to UTF-8. Some convert to UTF-8, without marking the output as UTF-8. Different R versions do different re-encodings. 491 | 492 | - Typing in a string on the console is not the same as `parse()`-ing the same code from a file. The console is always assumed to provide text in the native encoding, package files are typically assumed UTF-8. (This is why it is important to use `\uxxxx` escape sequences for UTF-8 text.) 493 | 494 | # Text transformers 495 | 496 | - `print()` 497 | 498 | - `format()` 499 | 500 | - `normalizePath()` 501 | 502 | - `basename()` 503 | 504 | - `encodeString()` 505 | 506 | # Changes in R 4.1.0 507 | 508 | TODO 509 | 510 | # Changes in R 4.2.0 511 | 512 | TODO 513 | 514 | # Further Reading 515 | 516 | Good general introductions to text encodings: 517 | 518 | - [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) 519 | 520 | - [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](https://kunststube.net/encoding/) 521 | 522 | - [UTF-8 Everywhere](http://utf8everywhere.org) 523 | 524 | R documentation: 525 | 526 | - [Encodings for CHARSXPs](https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Encodings-for-CHARSXPs) section in [R internals](https://cran.r-project.org/doc/manuals/r-devel/R-ints.html) 527 | 528 | - [Writing portable packages / Encoding issues](https://cran.r-project.org/doc/manuals/R-exts.html#Encoding-issues) in [Writing R Extensions](https://cran.r-project.org/doc/manuals/R-exts.html) 529 | 530 | - [Encodings and R](https://developer.r-project.org/Encodings_and_R.html) -- *Old and slightly outdated, it gives a good summary of what it took to change R's internal encoding.* 531 | 532 | Intros to various encodings: 533 | 534 | - [UTF-8 on Wikipedia](https://en.wikipedia.org/wiki/UTF-8) 535 | 536 | - [UTF-16 on Wikipedia](https://en.wikipedia.org/wiki/UTF-16) 537 | 538 | - [latin-1 (ISO/IEC 8859-1) on Wikipedia](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) 539 | 540 | About Unicode: 541 | 542 | - [The main Unicode homepage](https://home.unicode.org/) 543 | 544 | - [Unicode FAQs](https://unicode.org/faq/) 545 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | The R Encoding FAQ 2 | ================ 3 | 4 | - [Introduction](#introduction) 5 | - [Contributing](#contributing) 6 | - [Encoding of text](#encoding-of-text) 7 | - [R’s text representation](#rs-text-representation) 8 | - [R Connections and encodings](#r-connections-and-encodings) 9 | - [Printing text to the console](#printing-text-to-the-console) 10 | - [Display width](#display-width) 11 | - [Issues when building packages](#issues-when-building-packages) 12 | - [Code files](#code-files) 13 | - [Symbols](#symbols) 14 | - [Unit tests](#unit-tests) 15 | - [Manual pages](#manual-pages) 16 | - [Vignettes](#vignettes) 17 | - [Graphics](#graphics) 18 | - [How-to](#how-to) 19 | - [How to query the current native 20 | encoding?](#how-to-query-the-current-native-encoding) 21 | - [How is the `encoding` argument and option 22 | used?](#how-is-the-encoding-argument-and-option-used) 23 | - [How to read lines from UTF-8 24 | files?](#how-to-read-lines-from-utf-8-files) 25 | - [How to write lines to UTF-8 26 | files?](#how-to-write-lines-to-utf-8-files) 27 | - [How to convert text between 28 | encodings?](#how-to-convert-text-between-encodings) 29 | - [How to guess the encoding of 30 | text?](#how-to-guess-the-encoding-of-text) 31 | - [How to check if a string is 32 | UTF-8?](#how-to-check-if-a-string-is-utf-8) 33 | - [How to download web pages in the right 34 | encoding?](#how-to-download-web-pages-in-the-right-encoding) 35 | - [Are RDS files encoding-safe?](#are-rds-files-encoding-safe) 36 | - [How to `parse()` UTF-8 text into UTF-8 37 | code?](#how-to-parse-utf-8-text-into-utf-8-code) 38 | - [How to `deparse()` UTF-8 code into UTF-8 39 | text?](#how-to-deparse-utf-8-code-into-utf-8-text) 40 | - [How to include UTF-8 characters in a package 41 | `DESCRIPTION`](#how-to-include-utf-8-characters-in-a-package-description) 42 | - [How to get a package `DESCRIPTION` in 43 | UTF-8?](#how-to-get-a-package-description-in-utf-8) 44 | - [How to use UTF-8 file names on 45 | Windows?](#how-to-use-utf-8-file-names-on-windows) 46 | - [How to capture UTF-8 output?](#how-to-capture-utf-8-output) 47 | - [How to test non-ASCII output?](#how-to-test-non-ascii-output) 48 | - [How to avoid non-ASCII characters in the 49 | manual?](#how-to-avoid-non-ascii-characters-in-the-manual) 50 | - [How to include non-ASCII characters in PDF 51 | vignettes?](#how-to-include-non-ascii-characters-in-pdf-vignettes) 52 | - [Encoding related R functions and 53 | packages](#encoding-related-r-functions-and-packages) 54 | - [Known encoding issues in packages and 55 | functions](#known-encoding-issues-in-packages-and-functions) 56 | - [Tips for debugging encoding 57 | issues](#tips-for-debugging-encoding-issues) 58 | - [Text transformers](#text-transformers) 59 | - [Changes in R 4.1.0](#changes-in-r-410) 60 | - [Changes in R 4.2.0](#changes-in-r-420) 61 | - [Further Reading](#further-reading) 62 | 63 | # Introduction 64 | 65 | The goal of this document is twofold: (1) collect current best practices 66 | to solve text encoding related issues and (2) include links to further 67 | information about text encoding. 68 | 69 | # Contributing 70 | 71 | Encoding of text is a large topic, and to make this FAQ useful, we need 72 | your contributions. There are a variety of ways to contribute: 73 | 74 | - Found a typo a broken link or some example code is not working for 75 | you? Please open an issue about it. Or even better, consider fixing it 76 | in a pull request. 77 | - Something is missing? There is a new (or old) package of function that 78 | should be mentioned here? Please open an issue about it. 79 | - Some information is wrong or something is not explained well? Please 80 | open an issue about it. Or even better, fix it in a pull request! 81 | - You found an answer useful? Please tweet about it and link to this 82 | FAQ. 83 | 84 | Your contributions are most appreciated. 85 | 86 | # Encoding of text 87 | 88 | Good general introductions to text encodings: 89 | 90 | - [The Absolute Minimum Every Software Developer Absolutely, Positively 91 | Must Know About Unicode and Character Sets (No 92 | Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) 93 | 94 | - [What Every Programmer Absolutely, Positively Needs To Know About 95 | Encodings And Character Sets To Work With 96 | Text](https://kunststube.net/encoding/) 97 | 98 | # R’s text representation 99 | 100 | See the R Internals manual: 101 | 102 | 103 | 104 | # R Connections and encodings 105 | 106 | Some notes about connections. 107 | 108 | - Connections assume that the input is in the encoding specified by 109 | `getOption("encoding")`. They convert text to the native encoding. 110 | - Functions that *create* connections have an `encoding` argument, that 111 | overrides `getOption("encoding")`. 112 | - If you tell a connection that the input is already in the native 113 | encoding, then it will not convert the input. If the input is in fact 114 | *not* in the native encoding, then it is up to the user to mark it 115 | accordingly, e.g. as UTF-8. 116 | 117 | # Printing text to the console 118 | 119 | R converts strings to the native encoding when printing them to the 120 | screen. This is important to know when debugging encoding problems: two 121 | strings that print the same way may have a different internal 122 | representation: 123 | 124 | ``` r 125 | s1 <- "\xfc" 126 | Encoding(s1) <- "latin1" 127 | s2 <- iconv(s1, "latin1", "UTF-8") 128 | s1 129 | #> [1] "ü" 130 | s2 131 | #> [1] "ü" 132 | ``` 133 | 134 | ``` r 135 | s1 == s2 136 | #> [1] TRUE 137 | identical(s1, s2) 138 | #> [1] TRUE 139 | testthat::expect_equal(s1, s2) 140 | testthat::expect_identical(s1, s2) 141 | ``` 142 | 143 | ``` r 144 | Encoding(s1) 145 | #> [1] "latin1" 146 | Encoding(s2) 147 | #> [1] "UTF-8" 148 | ``` 149 | 150 | ``` r 151 | charToRaw(s1) 152 | #> [1] fc 153 | charToRaw(s2) 154 | #> [1] c3 bc 155 | ``` 156 | 157 | ## Display width 158 | 159 | Some Unicode glyphs, e.g. typical emojis, are *wide*. They are supposed 160 | to take up two characters in a monospace font. The set of wide glyphs 161 | (just like the set of glyphs in general) has been changing in different 162 | Unicode versions. 163 | 164 | Up to version 4.0.3 R followed Unicode 8 (released in 2015) in terms of 165 | character width, so `nchar(..., type = "width")` miscalculated the width 166 | of many Asian characters. 167 | 168 | R 4.0.4 follows Unicode 12.1. 169 | 170 | R 4.1.0 - R 4.3.3 follow Unicode 13.0. 171 | 172 | Current R-devel (will be R 4.4.0 in about a month) still uses Unicode 173 | 13.0. 174 | 175 | To correctly calculate the display width of Unicode text, across various 176 | R versions, you can use the utf8 package, or `cli::ansi_nchar()`. 177 | 178 | Note that some terminals, and also RStudio follow an older Unicode 179 | version of the standard, and they also calculate the display width 180 | incorrectly, typically printing characters on top of each other. 181 | 182 | # Issues when building packages 183 | 184 | See ‘Writing R Extensions’ about including UTF-8 characters in packages: 185 | 186 | 187 | ## Code files 188 | 189 | I suggest that you keep your source files in ASCII, except for comments, 190 | which you can write in UTF-8 if you like. (But not necessarily in 191 | roxygen2 comments, which will go in the manual, see below!) 192 | 193 | If you need a string containing non-ASCII characters, construct it with 194 | `\uxxxx` or `\U{xxxxxx}` escape sequences, which yields a UTF-8-encoded 195 | string. 196 | 197 | ``` r 198 | one_and_a_half <- "\u0031\u00BD" 199 | one_and_a_half 200 | #> [1] "1½" 201 | Encoding(one_and_a_half) 202 | #> [1] "UTF-8" 203 | charToRaw(one_and_a_half) 204 | #> [1] 31 c2 bd 205 | ``` 206 | 207 | Here are some handy ways to find the Unicode code points for an existing 208 | string: 209 | 210 | - Copy and paste into the [Unicode character 211 | inspector](https://apps.timwhitlock.info/unicode/inspect). 212 | * `sprintf("\\u%04X", utf8ToInt(x))` 213 | - *Do we have other suggestions?* 214 | 215 | If you need to include non-UTF-8 non-ASCII characters, e.g. for testing, 216 | then include them via raw data, or via `\xxx` escape sequences. E.g. to 217 | include a latin1 string, you can do either of these: 218 | 219 | ``` r 220 | t1 <- "t\xfck\xf6rf\xfar\xf3g\xe9p" 221 | Encoding(t1) <- "latin1" 222 | t2 <- rawToChar(as.raw(c( 223 | 0x74, 0xfc, 0x6b, 0xf6, 0x72, 0x66, 0xfa, 0x72, 224 | 0xf3, 0x67, 0xe9, 0x70))) 225 | Encoding(t2) <- "latin1" 226 | t1 == t2 227 | #> [1] TRUE 228 | identical(charToRaw(t1), charToRaw(t2)) 229 | #> [1] TRUE 230 | ``` 231 | 232 | ## Symbols 233 | 234 | Symbols must be in the native encoding, so best practice is to only use 235 | ASCII symbols. 236 | 237 | This is not completely equivalent to using ASCII code files, because 238 | names, e.g. in a named list, or the row names in a data frame are also 239 | symbols in R. So do not use non-ACSII names. (Another reason to avoid 240 | row names in data frames.) 241 | 242 | A typical mistake is to assign file names as list or row names when 243 | working with files. While it might be a convenient representation, it is 244 | a bad idea because of the possible encoding errors. 245 | 246 | Similarly, always use ASCII column names. 247 | 248 | # Unit tests 249 | 250 | Unit test files are code files, so the same rule applies to them: keep 251 | them ASCII. See the ‘How-to’ section below on specific testing related 252 | tasks. 253 | 254 | ## Manual pages 255 | 256 | You can use *some* UTF-8 characters in manual pages, if you include 257 | `\endoding{UTF-8}` in the manual page, or you declared `Encoding: UTF-8` 258 | in the `DESCRIPTION` file of the package. 259 | 260 | Unfortunately these do not include the characters typically used in the 261 | output of tibble and cli. The problem is with the PDF manual and LaTeX, 262 | so you can conditionally include UTF-8 in the HTML output with the `\if` 263 | Rd command. 264 | 265 | If your PDF manual does not build because of a non-supported UTF-8 266 | character has sneaked in somewhere, `tools::showNonASCIIfile()` can show 267 | where exactly they are. 268 | 269 | ## Vignettes 270 | 271 | TODO 272 | 273 | # Graphics 274 | 275 | TODO 276 | 277 | # How-to 278 | 279 | ## How to query the current native encoding? 280 | 281 | R uses the native encoding extensively. Some examples: 282 | 283 | - Symbols (i.e. variable names, names in a named list, column and row 284 | names, etc.) are always in the native encoding. 285 | - Strings marked with `"unknown"` encoding are assumed to be in this 286 | encoding. 287 | - Output connections assume that the output is in this encoding. 288 | - Input connection convert the input into this encoding. 289 | - `enc2native()` encodes its input into this encoding. 290 | - Printing to the screen (via `cat()`, `print()`, `writeLines()`, etc.) 291 | re-encodes strings into this encoding. 292 | 293 | It is surprisingly tricky to query the current native encoding, 294 | especially on older R versions. Here are a number of things to try: 295 | 296 | 1. Call `l10n_info()`. If the native encoding is UTF-8 or latin1 you’ll 297 | see that in the output: 298 | 299 | ``` r 300 | ❯ l10n_info() 301 | $MBCS 302 | [1] TRUE 303 | 304 | $`UTF-8` 305 | [1] TRUE 306 | 307 | $`Latin-1` 308 | [1] FALSE 309 | ``` 310 | 311 | 2. On Windows, you’ll also see the code page: 312 | 313 | ``` r 314 | ❯ l10n_info() 315 | $MBCS 316 | [1] FALSE 317 | 318 | $`UTF-8` 319 | [1] FALSE 320 | 321 | $`Latin-1` 322 | [1] TRUE 323 | 324 | $codepage 325 | [1] 1252 326 | 327 | $system.codepage 328 | [1] 1252 329 | ``` 330 | 331 | Append the codepage number to `"CP"` to get a string that you can 332 | use in `iconv()`. E.g. in this case it would be `"CP1252"`. 333 | 334 | 3. On Unix, from R 4.1 the name of the encoding is also included in the 335 | answer. This is useful on non-UTF-8 systems: 336 | 337 | ``` r 338 | ❯ l10n_info() 339 | $MBCS 340 | [1] FALSE 341 | 342 | $`UTF-8` 343 | [1] FALSE 344 | 345 | $`Latin-1` 346 | [1] FALSE 347 | 348 | $codeset 349 | [1] "ISO-8859-15" 350 | ``` 351 | 352 | 4. On older R versions this field is not included. On R 3.5 and above 353 | you can use the following trick to find the encoding: 354 | 355 | ``` r 356 | ❯ rawToChar(serialize(NULL, NULL, ascii = TRUE, version = 3)) 357 | [1] "A\n3\n262148\n197888\n5\nUTF-8\n254\n" 358 | ``` 359 | 360 | Line number 6 (i.e. after the fifth `\n`) in the output is the name 361 | of the encoding. On R 4.0.x you can also save an RDS file and then 362 | use the `infoRDS()` function on it to see the current native 363 | encoding. 364 | 365 | 5. Otherwise can call `Sys.getlocale()` and parsing its output will 366 | probably give you an encoding name that works in `iconv()`: 367 | 368 | ``` r 369 | ❯ Sys.getlocale("LC_CTYPE") 370 | [1] "en_US.iso885915" 371 | ``` 372 | 373 | ## How is the `encoding` argument and option used? 374 | 375 | In general functions use the `encoding` argument two ways: 376 | 377 | - For some functions it specifies the encoding of the input. 378 | - For others it specifies how output should be marked. 379 | 380 | For example 381 | 382 | ``` r 383 | file("input.txt", encoding = "UTF-8") 384 | ``` 385 | 386 | tells `file()` that `input.txt` is in `UTF-8`. On the other hand 387 | 388 | ``` r 389 | scan("input2.txt", what = "", encoding = "UTF-8") 390 | ``` 391 | 392 | means that the output of `scan()` will be *marked* as having UTF-8 393 | encoding. (`scan()` also has a `fileEncoding` argument, which specifies 394 | the encoding of the input file. 395 | 396 | TODO: the weirdness of `readLines()`. 397 | 398 | ## How to read lines from UTF-8 files? 399 | 400 | ### With the brio package 401 | 402 | The simplest option is to use the brio package, if you can allow it: 403 | 404 | ``` r 405 | brio::read_lines("file") 406 | ``` 407 | 408 | (brio also has a `read_file()` function if you want to read in the whole 409 | file into a string.) 410 | 411 | brio does not currently check if the file is valid in UTF-8. See below 412 | how to do that. 413 | 414 | ### With base R only 415 | 416 | The xfun package has a nice function that reads UTF-8 files with base R 417 | only: `xfun::read_utf8()`. It currently reads like this, after some 418 | simplification: 419 | 420 | ``` r 421 | read_utf8 <- function (path) { 422 | opts <- options(encoding = "native.enc") 423 | on.exit(options(opts), add = TRUE) 424 | x <- readLines(path, encoding = "UTF-8", warn = FALSE) 425 | x 426 | } 427 | ``` 428 | 429 | (The original function also checks that the returned lines are valid 430 | UTF-8.) 431 | 432 | Explanation and notes: 433 | 434 | - `path` must be a file name, or an un-opened connection, or a 435 | connection that was opened in the native encoding (i.e. with 436 | `encoding = "native.enc"`). Otherwise `read_utf8()` might (silently) 437 | return lines in the wrong encoding. 438 | - If `readLines()` gets a file name, then it opens a connection to it in 439 | `UTF-8`, so the file is not re-encoded and it also marks the result as 440 | `UTF-8`. 441 | - For extra safety, you can add a check that `path` is a file name. 442 | - As far as I can tell, there is no R API to query the encoding of a 443 | connection. 444 | 445 | TODO: does this work correctly with non-ASCII file names? 446 | 447 | ## How to write lines to UTF-8 files? 448 | 449 | ### With the brio package 450 | 451 | Call `brio::write_file()` to write a character vector to a file. Notes: 452 | 453 | 1. `brio::write_file()` converts the input character vector to UTF-8. 454 | It uses the marked encoding for this, so if that is not correct, 455 | then the conversion won’t be correct, either. 456 | 2. Will handle UTF-8 file names correctly. 457 | 3. `brio::write_file()` cannot write to connections currently, because 458 | with already opened connections there is no way to tell their 459 | encoding. 460 | 461 | ### With base R 462 | 463 | Kevin Ushey’s excellent [blog 464 | post](https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/) 465 | has an excellent detailed derivation of this base R function: 466 | 467 | ``` r 468 | write_utf8 <- function(text, path) { 469 | # step 1: ensure our text is utf8 encoded 470 | utf8 <- enc2utf8(text) 471 | upath <- enc2utf8(path) 472 | 473 | # step 2: create a connection with 'native' encoding 474 | # this signals to R that translation before writing 475 | # to the connection should be skipped 476 | con <- file(upath, open = "w+", encoding = "native.enc") 477 | on.exit(close(con), add = TRUE) 478 | 479 | # step 3: write to the connection with 'useBytes = TRUE', 480 | # telling R to skip translation to the native encoding 481 | writeLines(utf8, con = con, useBytes = TRUE) 482 | } 483 | ``` 484 | 485 | (This is a slightly more robust version than the one in the blog post.) 486 | 487 | TODO: does this work correctly with non-ASCII file names? 488 | 489 | ## How to convert text between encodings? 490 | 491 | The `iconv()` base R function can convert between encodings. It ignores 492 | the `Encoding()` markings of the input completely. It uses `""` as the 493 | synonym for the native encoding. 494 | 495 | ## How to guess the encoding of text? 496 | 497 | The `stringi::stri_enc_detect()` function tries to detect the encoding 498 | of a string. This is mostly guessing of course. 499 | 500 | ## How to check if a string is UTF-8? 501 | 502 | Not all bytes and bytes sequences are valid in UTF-8. There are a few 503 | alternatives to check if a string is valid: 504 | 505 | - `stringi::stri_enc_isutf8()` 506 | 507 | - `utf8::utf8_valid()` 508 | 509 | - The following trick using only base R: 510 | 511 | ``` r 512 | ! is.na(iconv(x, "UTF-8", "UTF-8")) 513 | ``` 514 | 515 | `iconv()` returns `NA` for the elements that are invalid in the input 516 | encoding. 517 | 518 | ## How to download web pages in the right encoding? 519 | 520 | TODO 521 | 522 | ## Are RDS files encoding-safe? 523 | 524 | No, in general they are not, but the situation is quite good. If you 525 | save an RDS file that contains text in some encoding, it is safe to read 526 | that file on a different computer (or the same computer with different 527 | settings), if at least one these conditions hold: 528 | 529 | - all text is either UTF-8 and latin1 encoded, and they are also marked 530 | as such, or 531 | - both computers (or settings) have the same native encoding, 532 | - the RDS file has version 3, and the loading platform can represent all 533 | characters in the RDS file. This usually holds if the loading platform 534 | is UTF-8. 535 | 536 | Note that from RDS version 3 the strings in the native encoding are 537 | re-encoded to the current native encoding when the RDS file is loaded. 538 | 539 | ## How to `parse()` UTF-8 text into UTF-8 code? 540 | 541 | On R 3.6 and before we need to create a connection to create code with 542 | UTF-8 strings, from an UTF-8 character vector. It goes like this: 543 | 544 | ``` r 545 | safe_parse <- function(text) { 546 | text <- enc2utf8(text) 547 | Encoding(text) <- "unknown" 548 | con <- textConnection(text) 549 | on.exit(close(con), add = TRUE) 550 | eval(parse(con, encoding = "UTF-8")) 551 | } 552 | ``` 553 | 554 | - By marking the text as `unknown` we make sure that `textConnection()` 555 | will not convert it. 556 | - The `encoding` argument of `parse()` marks the output as `UTF-8`. 557 | - Note that when `parse()` parses from a connection it will lose the 558 | source references. You can use the `srcfile` argument of `parse()` to 559 | add them back. 560 | 561 | ## How to `deparse()` UTF-8 code into UTF-8 text? 562 | 563 | TODO 564 | 565 | ## How to include UTF-8 characters in a package `DESCRIPTION` 566 | 567 | Some fields in `DESCRIPTION` may contain non-ASCII characters. Set the 568 | `Encoding` field to `UTF-8`, to use UTF-8 characters in these: 569 | 570 | Encoding: UTF-8 571 | 572 | ## How to get a package `DESCRIPTION` in UTF-8? 573 | 574 | The desc package always converts `DESCRIPTION` files to `UTF-8`. 575 | 576 | ## How to use UTF-8 file names on Windows? 577 | 578 | You can use functions or a package that already handles UTF-8 file 579 | names, e.g. brio or processx. If you need to open a file from C code, 580 | you need to do the following: 581 | 582 | - Convert the file name to UTF-8 just before passing it to C. In generic 583 | code `enc2utf8()` will do. 584 | - In the C code, on Windows, convert the UTF-8 path to UTF-16 using the 585 | `MultiByteToWideChar()` Windows API function. 586 | - Use the `_wfopen()` Windows API function to open the file. 587 | 588 | Here is an example from the brio package: 589 | 590 | 591 | ## How to capture UTF-8 output? 592 | 593 | This is only possible on UTF-8 systems, and there is no portable way 594 | that works on Windows or other non-UTF-8 platforms. If the native 595 | encoding is UTF-8, then `sink()` and `capture.output()` will create text 596 | in UTF-8. 597 | 598 | Capturing UTF-8 output on Windows (or a Unix system that does not 599 | support a UTF-8 locale) is not currently possible. (Well, on Windows 600 | there is a very dirty trick, but it involves changing an internal R 601 | flag, and running the code in a subprocess.) 602 | 603 | ### Are testthat snapshot tests encoding-safe? 604 | 605 | Somewhat. testthat 3 by default turn off non-ASCII output from cli (and 606 | thus tibble, etc.). So your snapshots that use cli facilities will be in 607 | ASCII, which is safe to use anywhere. 608 | 609 | If you non-ASCII output from other sources, then hopefully there is a 610 | switch to turn them off. As far as I can tell, rlang uses cli to produce 611 | non-ASCII output, so it should be fine. 612 | 613 | ## How to test non-ASCII output? 614 | 615 | If you need to test non-ASCII output produced by the cli package, then 616 | 617 | - use snapshot tests, 618 | - call `testthat::local_reproducible_output(unicode = TRUE)` at the 619 | beginning of your test, 620 | - record your snapshots on a UTF-8 platform. 621 | 622 | Your tests will not run on non-UTF-8 platforms. It is not currently 623 | possible to record non-ASCII snapshot tests on non-UTF-8 platforms. 624 | 625 | ## How to avoid non-ASCII characters in the manual? 626 | 627 | If you accidentally included some non-ASCII characters in the manual, 628 | e.g. because you included some UTF-8 output from cli or tibble/pillar, 629 | then you can set `options(cli.unicode = FALSE)` at the right place to 630 | avoid them. 631 | 632 | The `tools::showNonASCIIfile()` function helps finding non-ASCII 633 | characters in a package. 634 | 635 | ## How to include non-ASCII characters in PDF vignettes? 636 | 637 | TODO 638 | 639 | # Encoding related R functions and packages 640 | 641 | - `base::charToRaw()` 642 | 643 | - `base::Encoding()` and `` base::`Encoding<-` `` 644 | 645 | - `base::iconv()` 646 | 647 | - `base::iconvlist()` 648 | 649 | - `base::nchar()` 650 | 651 | - `base::l10n_info()` 652 | 653 | - [stringi package](https://cran.rstudio.com/web/packages/stringi/) 654 | 655 | - [utf8 package](https://cran.rstudio.com/web/packages/utf8) 656 | 657 | - [brio package](https://cran.rstudio.com/web/packages/brio) 658 | 659 | - [cli package](https://cran.rstudio.com/web/packages/cli/) 660 | 661 | - [Unicode package](https://cran.rstudio.com/web/packages/Unicode) 662 | 663 | # Known encoding issues in packages and functions 664 | 665 | - `yaml::write_yaml` crashes on Windows on latin1 encoded strings: 666 | 667 | 668 | # Tips for debugging encoding issues 669 | 670 | - `charToRaw()` is your best friend. 671 | 672 | - Don’t forget, if they print the same, if they are `identical()`, they 673 | can still be in a different encoding. `charToRaw()` is your best 674 | friend. 675 | 676 | - `testthat::CheckReporter` saves a `testthat-problems.rds` file, if 677 | there were any test failures. You can get this file from win-builder, 678 | R-hub, etc. The file is a version 2 RDS file, so no encoding 679 | conversion will be done by `readRDS()`. 680 | 681 | - Don’t trust any function that processes text. Some functions keep the 682 | encoding of the input, some convert to the native encoding, some 683 | convert to UTF-8. Some convert to UTF-8, without marking the output as 684 | UTF-8. Different R versions do different re-encodings. 685 | 686 | - Typing in a string on the console is not the same as `parse()`-ing the 687 | same code from a file. The console is always assumed to provide text 688 | in the native encoding, package files are typically assumed UTF-8. 689 | (This is why it is important to use `\uxxxx` escape sequences for 690 | UTF-8 text.) 691 | 692 | # Text transformers 693 | 694 | - `print()` 695 | 696 | - `format()` 697 | 698 | - `normalizePath()` 699 | 700 | - `basename()` 701 | 702 | - `encodeString()` 703 | 704 | # Changes in R 4.1.0 705 | 706 | TODO 707 | 708 | # Changes in R 4.2.0 709 | 710 | TODO 711 | 712 | # Further Reading 713 | 714 | Good general introductions to text encodings: 715 | 716 | - [The Absolute Minimum Every Software Developer Absolutely, Positively 717 | Must Know About Unicode and Character Sets (No 718 | Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) 719 | 720 | - [What Every Programmer Absolutely, Positively Needs To Know About 721 | Encodings And Character Sets To Work With 722 | Text](https://kunststube.net/encoding/) 723 | 724 | - [UTF-8 Everywhere](http://utf8everywhere.org) 725 | 726 | R documentation: 727 | 728 | - [Encodings for 729 | CHARSXPs](https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Encodings-for-CHARSXPs) 730 | section in [R 731 | internals](https://cran.r-project.org/doc/manuals/r-devel/R-ints.html) 732 | 733 | - [Writing portable packages / Encoding 734 | issues](https://cran.r-project.org/doc/manuals/R-exts.html#Encoding-issues) 735 | in [Writing R 736 | Extensions](https://cran.r-project.org/doc/manuals/R-exts.html) 737 | 738 | - [Encodings and 739 | R](https://developer.r-project.org/Encodings_and_R.html) – *Old and 740 | slightly outdated, it gives a good summary of what it took to change 741 | R’s internal encoding.* 742 | 743 | Intros to various encodings: 744 | 745 | - [UTF-8 on Wikipedia](https://en.wikipedia.org/wiki/UTF-8) 746 | 747 | - [UTF-16 on Wikipedia](https://en.wikipedia.org/wiki/UTF-16) 748 | 749 | - [latin-1 (ISO/IEC 8859-1) on 750 | Wikipedia](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) 751 | 752 | About Unicode: 753 | 754 | - [The main Unicode homepage](https://home.unicode.org/) 755 | 756 | - [Unicode FAQs](https://unicode.org/faq/) 757 | -------------------------------------------------------------------------------- /rencfaq.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | AutoAppendNewline: Yes 16 | StripTrailingWhitespace: Yes 17 | -------------------------------------------------------------------------------- /what-every.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "What every R developer must know about encodings" 3 | output: 4 | github_document: 5 | toc: true 6 | toc_depth: 3 7 | --- 8 | 9 | ## Why are encodings so hard (in R)? 10 | 11 | - Impossible to interpret a piece of text without external information. 12 | - R's internal encoding is different on different platforms, which make is hard to write (and test!) portable code, and to transfer data. 13 | - R's functions do (seemingly) random encoding conversions. 14 | 15 | ## Important encodings (for the R developer) 16 | 17 | ### UTF-8 18 | 19 | #### What is Unicode? 20 | 21 | Unicode is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. 22 | 23 | #### Numbered characters 24 | 25 | - First 128 is the same as ASCII. Then the rest of the 143,859 characters (Unicode 13.0) 26 | 27 | - 1,112,064 possible characters. (Original plan was much-much more.) 28 | 29 | - You can create text that encodes Unicode characters with `\u` and `\U` escapes: 30 | 31 | ```{r} 32 | "Just a normal string with \u00fc is \u2713 \U{1F60A}" 33 | ``` 34 | 35 | - R always encodes `\u` and `\U` in UTF-8. 36 | 37 | - Some are non-printing: zero width non-joiner, right-to-left mark, left-to-right mark, etc. 38 | 39 | - Some are combining: 40 | 41 | ```{r} 42 | "person + dark skin tone: \U1F9D1 \U1F3FF" 43 | "person + dark skin tone: \U1F9D1\U1F3FF" 44 | ``` 45 | 46 | #### UTF-8 encoding 47 | 48 | - Encodes all Unicode characters. 49 | 50 | - Variable length encoding: between 1 and 4 bytes. Think about the implications of this! 51 | 52 | - Includes ASCII, in the same exact encoding. 53 | 54 | - Again, R always encodes `\u` and `\U` in UTF-8, use it for UTF-8 string literals. 55 | 56 | - 57 | 58 | ### ASCII 59 | 60 | - Ancient. 128 characters, stored on one byte, the highest bit is zero. 61 | 62 | ### latin1 (and CP1252) 63 | 64 | - Extension of ASCII, by defining the characters when the high bit is one. 65 | 66 | - CP1252 is an a modified latin1, it substitutes some control characters with printable ones. 67 | 68 | - Usually this is the default native encoding of R in the US and Western Europe. 69 | 70 | - Covers most Western European special characters: 71 | 72 | ### UTF-16 73 | 74 | - Encodes all of Unicode. 75 | 76 | - Each Unicode character is two or four bytes. 77 | 78 | - Internal encoding of Windows, Javascript, (older) Python. 79 | 80 | - We only need to deal with it in R when communicating with the Windows API. Most commonly this means passing file paths to Windows. 81 | 82 | - R cannot represent this in a string, because it has embedded zeros. 83 | 84 | - Best is to convert from/to UTF-8 immediately before passing to or getting it from Windows. (This usually happens in C code.) 85 | 86 | ## How R stores text data 87 | 88 | - Zero-terminated bytes. (Hence, no UTF-16 possible.) 89 | 90 | ```{r} 91 | charToRaw("hello world!") 92 | ``` 93 | 94 | - R often assumes that strings are in the *native* encoding. E.g. symbols have to be in the native encoding, connections convert to the native encoding, so does printing, etc. 95 | 96 | - The native encoding can be different on different platforms. *Yes, this is the biggest challenge when writing portable R code that deals with strings.* 97 | 98 | - Use `l10n_info()` to query the native encoding. (But see the FAQ for the real answer.) 99 | 100 | ```{r} 101 | l10n_info() 102 | ``` 103 | 104 | - It is possible to *declare* the encoding of a string (not character vector). (!) 105 | 106 | - But alas... fun with `Encoding()` : 107 | 108 | ```{r} 109 | x <- "\xfb" 110 | x 111 | ``` 112 | 113 | ```{r} 114 | charToRaw(x) 115 | ``` 116 | 117 | ```{r} 118 | Encoding(x) <- "latin1" 119 | x 120 | ``` 121 | 122 | ```{r} 123 | y <- "\xfb" 124 | Encoding(y) <- "latin2" 125 | y 126 | ``` 127 | 128 | ```{r} 129 | Encoding(y) 130 | ``` 131 | 132 | - The encoding information is only stored on two bits. So four values are possible: 133 | 134 | - `UTF-8` 135 | 136 | - `latin1` 137 | 138 | - `bytes` 139 | 140 | - `unknown` 141 | 142 | - If the native encoding is not UTF-8 or latin1, native strings are marked as `unknown`. But `unknown` means different things on different platforms. 143 | 144 | ## What encoding should I use? 145 | 146 | UTF-8. Only use something else if you really have to. Convert to UTF-8 as soon as possible. Convert to something else as late as possible. 147 | 148 | ## How-to? 149 | 150 | - How to convert to UTF-8? 151 | 152 | - How to check if a string is UTF-8? 153 | 154 | - How to read a file in UTF-8? 155 | 156 | - How to write a file in UTF-8? 157 | 158 | ## Debugging encoding issues 159 | 160 | ### Common issues 161 | 162 | #### Don't let the printing fool you. 163 | 164 | R converts strings to the native encoding when printing them to the screen. This is important to know when debugging encoding problems: two strings that print the same way may have a different internal representation: 165 | 166 | ```{r} 167 | s1 <- "\xfc" 168 | Encoding(s1) <- "latin1" 169 | s2 <- iconv(s1, "latin1", "UTF-8") 170 | s1 171 | s2 172 | ``` 173 | 174 | ```{r} 175 | s1 == s2 176 | identical(s1, s2) 177 | testthat::expect_equal(s1, s2) 178 | testthat::expect_identical(s1, s2) 179 | ``` 180 | 181 | ```{r} 182 | Encoding(s1) 183 | Encoding(s2) 184 | ``` 185 | 186 | ```{r} 187 | charToRaw(s1) 188 | charToRaw(s2) 189 | ``` 190 | 191 | #### Beware the silent conversions 192 | 193 | R functions changing the encoding silently. All functions that transform text are suspicious. Older R versions are usually worse: 194 | 195 | ``` r 196 | > x <- "ü" 197 | > Encoding(x) 198 | [1] "latin1" 199 | > charToRaw(x) 200 | [1] fc 201 | ``` 202 | 203 | ``` r 204 | > ux <- enc2utf8(x) 205 | > Encoding(ux) 206 | [1] "UTF-8" 207 | > charToRaw(ux) 208 | [1] c3 bc 209 | ``` 210 | 211 | ``` r 212 | > nux <- normalizePath(ux, mustWork = FALSE) 213 | > nux 214 | [1] "C:\\Users\\Gabor\\works\\processx\\ü" 215 | > Encoding(nux) 216 | [1] "unknown" 217 | > charToRaw(nux) 218 | [1] 43 3a 5c 55 73 65 72 73 5c 47 219 | [11] 61 62 6f 72 5c 77 6f 72 6b 73 220 | [21] 5c 70 72 6f 63 65 73 73 78 5c 221 | [31] fc 222 | ``` 223 | 224 | ``` r 225 | > basename(nux) 226 | [1] "ü" 227 | > Encoding(basename(nux)) 228 | [1] "UTF-8" 229 | > charToRaw(basename(nux)) 230 | [1] c3 bc 231 | ``` 232 | 233 | #### Display width is off everywhere 234 | 235 | Aligning text with *wide* Unicode characters is hard. 236 | 237 | ### Tips 238 | 239 | - `charToRaw()` is your best friend. 240 | 241 | - Don't forget, if they print the same, if they are `identical()`, they can still be in a different encoding. `charToRaw()` is your best friend. 242 | 243 | - `testthat::CheckReporter` saves a `testthat-problems.rds` file, if there were any test failures. You can get this file from win-builder, R-hub, etc. The file is a version 2 RDS file, so no encoding conversion will be done by `readRDS()`. 244 | 245 | - Don't trust any function that processes text. Some functions keep the encoding of the input, some convert to the native encoding, some convert to UTF-8. Some convert to UTF-8, without marking the output as UTF-8. Different R versions do different re-encodings. 246 | 247 | - Typing in a string on the console is not the same as `parse()`-ing the same code from a file. The console is always assumed to provide text in the native encoding, package files are typically assumed UTF-8. (This is why it is important to use `\uxxxx` escape sequences for UTF-8 text.) 248 | -------------------------------------------------------------------------------- /what-every.md: -------------------------------------------------------------------------------- 1 | What every R developer must know about encodings 2 | ================ 3 | 4 | - [Why are encodings so hard (in R)?](#why-are-encodings-so-hard-in-r) 5 | - [Important encodings (for the R 6 | developer)](#important-encodings-for-the-r-developer) 7 | - [UTF-8](#utf-8) 8 | - [ASCII](#ascii) 9 | - [latin1 (and CP1252)](#latin1-and-cp1252) 10 | - [UTF-16](#utf-16) 11 | - [How R stores text data](#how-r-stores-text-data) 12 | - [What encoding should I use?](#what-encoding-should-i-use) 13 | - [How-to?](#how-to) 14 | - [Debugging encoding issues](#debugging-encoding-issues) 15 | - [Common issues](#common-issues) 16 | - [Tips](#tips) 17 | 18 | ## Why are encodings so hard (in R)? 19 | 20 | - Impossible to interpret a piece of text without external 21 | information. 22 | - R’s internal encoding is different on different platforms, which 23 | make is hard to write (and test!) portable code, and to transfer 24 | data. 25 | - R’s functions do (seemingly) random encoding conversions. 26 | 27 | ## Important encodings (for the R developer) 28 | 29 | ### UTF-8 30 | 31 | #### What is Unicode? 32 | 33 | Unicode is an information technology standard for the consistent 34 | encoding, representation, and handling of text expressed in most of the 35 | world’s writing systems. 36 | 37 | #### Numbered characters 38 | 39 | - First 128 is the same as ASCII. Then the rest of the 143,859 40 | characters (Unicode 13.0) 41 | 42 | - 1,112,064 possible characters. (Original plan was much-much more.) 43 | 44 | - You can create text that encodes Unicode characters with `\u` and 45 | `\U` escapes: 46 | 47 | ``` r 48 | "Just a normal string with \u00fc is \u2713 \U{1F60A}" 49 | ``` 50 | 51 | ## [1] "Just a normal string with ü is ✓ 😊" 52 | 53 | - R always encodes `\u` and `\U` in UTF-8. 54 | 55 | - Some are non-printing: zero width non-joiner, right-to-left mark, 56 | left-to-right mark, etc. 57 | 58 | - Some are combining: 59 | 60 | ``` r 61 | "person + dark skin tone: \U1F9D1 \U1F3FF" 62 | ``` 63 | 64 | ## [1] "person + dark skin tone: 🧑 🏿" 65 | 66 | ``` r 67 | "person + dark skin tone: \U1F9D1\U1F3FF" 68 | ``` 69 | 70 | ## [1] "person + dark skin tone: 🧑🏿" 71 | 72 | #### UTF-8 encoding 73 | 74 | - Encodes all Unicode characters. 75 | 76 | - Variable length encoding: between 1 and 4 bytes. Think about the 77 | implications of this! 78 | 79 | - Includes ASCII, in the same exact encoding. 80 | 81 | - Again, R always encodes `\u` and `\U` in UTF-8, use it for UTF-8 82 | string literals. 83 | 84 | - 85 | 86 | ### ASCII 87 | 88 | - Ancient. 128 characters, stored on one byte, the highest bit is 89 | zero. 90 | 91 | ### latin1 (and CP1252) 92 | 93 | - Extension of ASCII, by defining the characters when the high bit is 94 | one. 95 | 96 | - CP1252 is an a modified latin1, it substitutes some control 97 | characters with printable ones. 98 | 99 | - Usually this is the default native encoding of R in the US and 100 | Western Europe. 101 | 102 | - Covers most Western European special characters: 103 | 104 | 105 | ### UTF-16 106 | 107 | - Encodes all of Unicode. 108 | 109 | - Each Unicode character is two or four bytes. 110 | 111 | - Internal encoding of Windows, Javascript, (older) Python. 112 | 113 | - We only need to deal with it in R when communicating with the 114 | Windows API. Most commonly this means passing file paths to Windows. 115 | 116 | - R cannot represent this in a string, because it has embedded zeros. 117 | 118 | - Best is to convert from/to UTF-8 immediately before passing to or 119 | getting it from Windows. (This usually happens in C code.) 120 | 121 | ## How R stores text data 122 | 123 | - Zero-terminated bytes. (Hence, no UTF-16 possible.) 124 | 125 | ``` r 126 | charToRaw("hello world!") 127 | ``` 128 | 129 | ## [1] 68 65 6c 6c 6f 20 77 6f 72 6c 64 21 130 | 131 | - R often assumes that strings are in the *native* encoding. E.g. 132 | symbols have to be in the native encoding, connections convert to 133 | the native encoding, so does printing, etc. 134 | 135 | - The native encoding can be different on different platforms. *Yes, 136 | this is the biggest challenge when writing portable R code that 137 | deals with strings.* 138 | 139 | - Use `l10n_info()` to query the native encoding. (But see the FAQ for 140 | the real answer.) 141 | 142 | ``` r 143 | l10n_info() 144 | ``` 145 | 146 | ## $MBCS 147 | ## [1] TRUE 148 | ## 149 | ## $`UTF-8` 150 | ## [1] TRUE 151 | ## 152 | ## $`Latin-1` 153 | ## [1] FALSE 154 | ## 155 | ## $codeset 156 | ## [1] "UTF-8" 157 | 158 | - It is possible to *declare* the encoding of a string (not character 159 | vector). (!) 160 | 161 | - But alas… fun with `Encoding()` : 162 | 163 | ``` r 164 | x <- "\xfb" 165 | x 166 | ``` 167 | 168 | ## [1] "\xfb" 169 | 170 | ``` r 171 | charToRaw(x) 172 | ``` 173 | 174 | ## [1] fb 175 | 176 | ``` r 177 | Encoding(x) <- "latin1" 178 | x 179 | ``` 180 | 181 | ## [1] "û" 182 | 183 | ``` r 184 | y <- "\xfb" 185 | Encoding(y) <- "latin2" 186 | y 187 | ``` 188 | 189 | ## [1] "\xfb" 190 | 191 | ``` r 192 | Encoding(y) 193 | ``` 194 | 195 | ## [1] "unknown" 196 | 197 | - The encoding information is only stored on two bits. So four values 198 | are possible: 199 | 200 | - `UTF-8` 201 | 202 | - `latin1` 203 | 204 | - `bytes` 205 | 206 | - `unknown` 207 | 208 | - If the native encoding is not UTF-8 or latin1, native strings are 209 | marked as `unknown`. But `unknown` means different things on 210 | different platforms. 211 | 212 | ## What encoding should I use? 213 | 214 | UTF-8. Only use something else if you really have to. Convert to UTF-8 215 | as soon as possible. Convert to something else as late as possible. 216 | 217 | ## How-to? 218 | 219 | - How to convert to UTF-8? 220 | 221 | - How to check if a string is UTF-8? 222 | 223 | - How to read a file in UTF-8? 224 | 225 | - How to write a file in UTF-8? 226 | 227 | ## Debugging encoding issues 228 | 229 | ### Common issues 230 | 231 | #### Don’t let the printing fool you. 232 | 233 | R converts strings to the native encoding when printing them to the 234 | screen. This is important to know when debugging encoding problems: two 235 | strings that print the same way may have a different internal 236 | representation: 237 | 238 | ``` r 239 | s1 <- "\xfc" 240 | Encoding(s1) <- "latin1" 241 | s2 <- iconv(s1, "latin1", "UTF-8") 242 | s1 243 | ``` 244 | 245 | ## [1] "ü" 246 | 247 | ``` r 248 | s2 249 | ``` 250 | 251 | ## [1] "ü" 252 | 253 | ``` r 254 | s1 == s2 255 | ``` 256 | 257 | ## [1] TRUE 258 | 259 | ``` r 260 | identical(s1, s2) 261 | ``` 262 | 263 | ## [1] TRUE 264 | 265 | ``` r 266 | testthat::expect_equal(s1, s2) 267 | testthat::expect_identical(s1, s2) 268 | ``` 269 | 270 | ``` r 271 | Encoding(s1) 272 | ``` 273 | 274 | ## [1] "latin1" 275 | 276 | ``` r 277 | Encoding(s2) 278 | ``` 279 | 280 | ## [1] "UTF-8" 281 | 282 | ``` r 283 | charToRaw(s1) 284 | ``` 285 | 286 | ## [1] fc 287 | 288 | ``` r 289 | charToRaw(s2) 290 | ``` 291 | 292 | ## [1] c3 bc 293 | 294 | #### Beware the silent conversions 295 | 296 | R functions changing the encoding silently. All functions that transform 297 | text are suspicious. Older R versions are usually worse: 298 | 299 | ``` r 300 | > x <- "ü" 301 | > Encoding(x) 302 | [1] "latin1" 303 | > charToRaw(x) 304 | [1] fc 305 | ``` 306 | 307 | ``` r 308 | > ux <- enc2utf8(x) 309 | > Encoding(ux) 310 | [1] "UTF-8" 311 | > charToRaw(ux) 312 | [1] c3 bc 313 | ``` 314 | 315 | ``` r 316 | > nux <- normalizePath(ux, mustWork = FALSE) 317 | > nux 318 | [1] "C:\\Users\\Gabor\\works\\processx\\ü" 319 | > Encoding(nux) 320 | [1] "unknown" 321 | > charToRaw(nux) 322 | [1] 43 3a 5c 55 73 65 72 73 5c 47 323 | [11] 61 62 6f 72 5c 77 6f 72 6b 73 324 | [21] 5c 70 72 6f 63 65 73 73 78 5c 325 | [31] fc 326 | ``` 327 | 328 | ``` r 329 | > basename(nux) 330 | [1] "ü" 331 | > Encoding(basename(nux)) 332 | [1] "UTF-8" 333 | > charToRaw(basename(nux)) 334 | [1] c3 bc 335 | ``` 336 | 337 | #### Display width is off everywhere 338 | 339 | Aligning text with *wide* Unicode characters is hard. 340 | 341 | ### Tips 342 | 343 | - `charToRaw()` is your best friend. 344 | 345 | - Don’t forget, if they print the same, if they are `identical()`, 346 | they can still be in a different encoding. `charToRaw()` is your 347 | best friend. 348 | 349 | - `testthat::CheckReporter` saves a `testthat-problems.rds` file, if 350 | there were any test failures. You can get this file from 351 | win-builder, R-hub, etc. The file is a version 2 RDS file, so no 352 | encoding conversion will be done by `readRDS()`. 353 | 354 | - Don’t trust any function that processes text. Some functions keep 355 | the encoding of the input, some convert to the native encoding, some 356 | convert to UTF-8. Some convert to UTF-8, without marking the output 357 | as UTF-8. Different R versions do different re-encodings. 358 | 359 | - Typing in a string on the console is not the same as `parse()`-ing 360 | the same code from a file. The console is always assumed to provide 361 | text in the native encoding, package files are typically assumed 362 | UTF-8. (This is why it is important to use `\uxxxx` escape sequences 363 | for UTF-8 text.) 364 | --------------------------------------------------------------------------------