├── .gitignore
├── LICENSE
├── README.Rmd
├── README.md
├── rencfaq.Rproj
├── what-every.Rmd
└── what-every.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | .Ruserdata
5 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | Creative Commons Legal Code
  2 | 
  3 | CC0 1.0 Universal
  4 | 
  5 |     CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
  6 |     LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN
  7 |     ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
  8 |     INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
  9 |     REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS
 10 |     PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
 11 |     THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED
 12 |     HEREUNDER.
 13 | 
 14 | Statement of Purpose
 15 | 
 16 | The laws of most jurisdictions throughout the world automatically confer
 17 | exclusive Copyright and Related Rights (defined below) upon the creator
 18 | and subsequent owner(s) (each and all, an "owner") of an original work of
 19 | authorship and/or a database (each, a "Work").
 20 | 
 21 | Certain owners wish to permanently relinquish those rights to a Work for
 22 | the purpose of contributing to a commons of creative, cultural and
 23 | scientific works ("Commons") that the public can reliably and without fear
 24 | of later claims of infringement build upon, modify, incorporate in other
 25 | works, reuse and redistribute as freely as possible in any form whatsoever
 26 | and for any purposes, including without limitation commercial purposes.
 27 | These owners may contribute to the Commons to promote the ideal of a free
 28 | culture and the further production of creative, cultural and scientific
 29 | works, or to gain reputation or greater distribution for their Work in
 30 | part through the use and efforts of others.
 31 | 
 32 | For these and/or other purposes and motivations, and without any
 33 | expectation of additional consideration or compensation, the person
 34 | associating CC0 with a Work (the "Affirmer"), to the extent that he or she
 35 | is an owner of Copyright and Related Rights in the Work, voluntarily
 36 | elects to apply CC0 to the Work and publicly distribute the Work under its
 37 | terms, with knowledge of his or her Copyright and Related Rights in the
 38 | Work and the meaning and intended legal effect of CC0 on those rights.
 39 | 
 40 | 1. Copyright and Related Rights. A Work made available under CC0 may be
 41 | protected by copyright and related or neighboring rights ("Copyright and
 42 | Related Rights"). Copyright and Related Rights include, but are not
 43 | limited to, the following:
 44 | 
 45 |   i. the right to reproduce, adapt, distribute, perform, display,
 46 |      communicate, and translate a Work;
 47 |  ii. moral rights retained by the original author(s) and/or performer(s);
 48 | iii. publicity and privacy rights pertaining to a person's image or
 49 |      likeness depicted in a Work;
 50 |  iv. rights protecting against unfair competition in regards to a Work,
 51 |      subject to the limitations in paragraph 4(a), below;
 52 |   v. rights protecting the extraction, dissemination, use and reuse of data
 53 |      in a Work;
 54 |  vi. database rights (such as those arising under Directive 96/9/EC of the
 55 |      European Parliament and of the Council of 11 March 1996 on the legal
 56 |      protection of databases, and under any national implementation
 57 |      thereof, including any amended or successor version of such
 58 |      directive); and
 59 | vii. other similar, equivalent or corresponding rights throughout the
 60 |      world based on applicable law or treaty, and any national
 61 |      implementations thereof.
 62 | 
 63 | 2. Waiver. To the greatest extent permitted by, but not in contravention
 64 | of, applicable law, Affirmer hereby overtly, fully, permanently,
 65 | irrevocably and unconditionally waives, abandons, and surrenders all of
 66 | Affirmer's Copyright and Related Rights and associated claims and causes
 67 | of action, whether now known or unknown (including existing as well as
 68 | future claims and causes of action), in the Work (i) in all territories
 69 | worldwide, (ii) for the maximum duration provided by applicable law or
 70 | treaty (including future time extensions), (iii) in any current or future
 71 | medium and for any number of copies, and (iv) for any purpose whatsoever,
 72 | including without limitation commercial, advertising or promotional
 73 | purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each
 74 | member of the public at large and to the detriment of Affirmer's heirs and
 75 | successors, fully intending that such Waiver shall not be subject to
 76 | revocation, rescission, cancellation, termination, or any other legal or
 77 | equitable action to disrupt the quiet enjoyment of the Work by the public
 78 | as contemplated by Affirmer's express Statement of Purpose.
 79 | 
 80 | 3. Public License Fallback. Should any part of the Waiver for any reason
 81 | be judged legally invalid or ineffective under applicable law, then the
 82 | Waiver shall be preserved to the maximum extent permitted taking into
 83 | account Affirmer's express Statement of Purpose. In addition, to the
 84 | extent the Waiver is so judged Affirmer hereby grants to each affected
 85 | person a royalty-free, non transferable, non sublicensable, non exclusive,
 86 | irrevocable and unconditional license to exercise Affirmer's Copyright and
 87 | Related Rights in the Work (i) in all territories worldwide, (ii) for the
 88 | maximum duration provided by applicable law or treaty (including future
 89 | time extensions), (iii) in any current or future medium and for any number
 90 | of copies, and (iv) for any purpose whatsoever, including without
 91 | limitation commercial, advertising or promotional purposes (the
 92 | "License"). The License shall be deemed effective as of the date CC0 was
 93 | applied by Affirmer to the Work. Should any part of the License for any
 94 | reason be judged legally invalid or ineffective under applicable law, such
 95 | partial invalidity or ineffectiveness shall not invalidate the remainder
 96 | of the License, and in such case Affirmer hereby affirms that he or she
 97 | will not (i) exercise any of his or her remaining Copyright and Related
 98 | Rights in the Work or (ii) assert any associated claims and causes of
 99 | action with respect to the Work, in either case contrary to Affirmer's
100 | express Statement of Purpose.
101 | 
102 | 4. Limitations and Disclaimers.
103 | 
104 |  a. No trademark or patent rights held by Affirmer are waived, abandoned,
105 |     surrendered, licensed or otherwise affected by this document.
106 |  b. Affirmer offers the Work as-is and makes no representations or
107 |     warranties of any kind concerning the Work, express, implied,
108 |     statutory or otherwise, including without limitation warranties of
109 |     title, merchantability, fitness for a particular purpose, non
110 |     infringement, or the absence of latent or other defects, accuracy, or
111 |     the present or absence of errors, whether or not discoverable, all to
112 |     the greatest extent permissible under applicable law.
113 |  c. Affirmer disclaims responsibility for clearing rights of other persons
114 |     that may apply to the Work or any use thereof, including without
115 |     limitation any person's Copyright and Related Rights in the Work.
116 |     Further, Affirmer disclaims responsibility for obtaining any necessary
117 |     consents, permissions or other rights required for any use of the
118 |     Work.
119 |  d. Affirmer understands and acknowledges that Creative Commons is not a
120 |     party to this document and has no duty or obligation with respect to
121 |     this CC0 or use of the Work.
122 | 


--------------------------------------------------------------------------------
/README.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "The R Encoding FAQ"
  3 | output: 
  4 |   github_document:
  5 |     toc: true
  6 |     toc_depth: 2
  7 | ---
  8 | 
  9 | ```{r setup, include=FALSE}
 10 | knitr::opts_chunk$set(collapse = TRUE, comment = "#>", error = TRUE)
 11 | ```
 12 | 
 13 | # Introduction
 14 | 
 15 | The goal of this document is twofold: (1) collect current best practices to solve text encoding related issues and (2) include links to further information about text encoding.
 16 | 
 17 | # Contributing
 18 | 
 19 | Encoding of text is a large topic, and to make this FAQ useful, we need your contributions. There are a variety of ways to contribute:
 20 | 
 21 | -   Found a typo a broken link or some example code is not working for you? Please open an issue about it. Or even better, consider fixing it in a pull request.
 22 | -   Something is missing? There is a new (or old) package of function that should be mentioned here? Please open an issue about it.
 23 | -   Some information is wrong or something is not explained well? Please open an issue about it. Or even better, fix it in a pull request!
 24 | -   You found an answer useful? Please tweet about it and link to this FAQ.
 25 | 
 26 | Your contributions are most appreciated.
 27 | 
 28 | # Encoding of text
 29 | 
 30 | Good general introductions to text encodings:
 31 | 
 32 | -   [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)
 33 | 
 34 | -   [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](https://kunststube.net/encoding/)
 35 | 
 36 | # R's text representation
 37 | 
 38 | See the R Internals manual:
 39 | 
 40 | <https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Encodings-for-CHARSXPs>
 41 | 
 42 | # R Connections and encodings
 43 | 
 44 | Some notes about connections.
 45 | 
 46 | -   Connections assume that the input is in the encoding specified by `getOption("encoding")`. They convert text to the native encoding.
 47 | -   Functions that *create* connections have an `encoding` argument, that overrides `getOption("encoding")`.
 48 | -   If you tell a connection that the input is already in the native encoding, then it will not convert the input. If the input is in fact *not* in the native encoding, then it is up to the user to mark it accordingly, e.g. as UTF-8.
 49 | 
 50 | # Printing text to the console
 51 | 
 52 | R converts strings to the native encoding when printing them to the screen. This is important to know when debugging encoding problems: two strings that print the same way may have a different internal representation:
 53 | 
 54 | ```{r}
 55 | s1 <- "\xfc"
 56 | Encoding(s1) <- "latin1"
 57 | s2 <- iconv(s1, "latin1", "UTF-8")
 58 | s1
 59 | s2
 60 | ```
 61 | 
 62 | ```{r}
 63 | s1 == s2
 64 | identical(s1, s2)
 65 | testthat::expect_equal(s1, s2)
 66 | testthat::expect_identical(s1, s2)
 67 | ```
 68 | 
 69 | ```{r}
 70 | Encoding(s1)
 71 | Encoding(s2)
 72 | ```
 73 | 
 74 | ```{r}
 75 | charToRaw(s1)
 76 | charToRaw(s2)
 77 | ```
 78 | 
 79 | ## Display width
 80 | 
 81 | Some Unicode glyphs, e.g. typical emojis, are *wide*. They are supposed to take up two characters in a monospace font. The set of wide glyphs (just like the set of glyphs in general) has been changing in different Unicode versions.
 82 | 
 83 | Up to version 4.0.3 R followed Unicode 8 (released in 2015) in terms of character width, so `nchar(..., type = "width")` miscalculated the width of many Asian characters.
 84 | 
 85 | R 4.0.4 follows Unicode 12.1.
 86 | 
 87 | R 4.1.0 - R 4.3.3 follow Unicode 13.0.
 88 | 
 89 | Current R-devel (will be R 4.4.0 in about a month) still uses Unicode 13.0.
 90 | 
 91 | To correctly calculate the display width of Unicode text, across various R versions, you can use the utf8 package, or `cli::ansi_nchar()`.
 92 | 
 93 | Note that some terminals, and also RStudio follow an older Unicode version of the standard, and they also calculate the display width incorrectly, typically printing characters on top of each other.
 94 | 
 95 | # Issues when building packages
 96 | 
 97 | See 'Writing R Extensions' about including UTF-8 characters in packages: <https://cran.r-project.org/doc/manuals/R-exts.html#Encoding-issues>
 98 | 
 99 | ## Code files
100 | 
101 | I suggest that you keep your source files in ASCII, except for comments, which you can write in UTF-8 if you like. (But not necessarily in roxygen2 comments, which will go in the manual, see below!)
102 | 
103 | If you need a string containing non-ASCII characters, construct it with `\uxxxx` or `\U{xxxxxx}` escape sequences, which yields a UTF-8-encoded string.
104 | 
105 | ```{r}
106 | one_and_a_half <- "\u0031\u00BD"
107 | one_and_a_half
108 | Encoding(one_and_a_half)
109 | charToRaw(one_and_a_half)
110 | ```
111 | 
112 | Here are some handy ways to find the Unicode code points for an existing string:
113 | 
114 | * Copy and paste into the [Unicode character inspector](https://apps.timwhitlock.info/unicode/inspect).
115 | * `sprintf("\\u%04X", utf8ToInt(x))`
116 | * *Do we have other suggestions?*
117 | 
118 | If you need to include non-UTF-8 non-ASCII characters, e.g. for testing, then include them via raw data, or via `\xxx` escape sequences. E.g. to include a latin1 string, you can do either of these:
119 | 
120 | ```{r}
121 | t1 <- "t\xfck\xf6rf\xfar\xf3g\xe9p"
122 | Encoding(t1) <- "latin1"
123 | t2 <- rawToChar(as.raw(c(
124 |   0x74, 0xfc, 0x6b, 0xf6, 0x72, 0x66, 0xfa, 0x72, 
125 |   0xf3, 0x67, 0xe9, 0x70)))
126 | Encoding(t2) <- "latin1"
127 | t1 == t2
128 | identical(charToRaw(t1), charToRaw(t2))
129 | ```
130 | 
131 | ## Symbols
132 | 
133 | Symbols must be in the native encoding, so best practice is to only use ASCII symbols.
134 | 
135 | This is not completely equivalent to using ASCII code files, because names, e.g. in a named list, or the row names in a data frame are also symbols in R. So do not use non-ACSII names. (Another reason to avoid row names in data frames.)
136 | 
137 | A typical mistake is to assign file names as list or row names when working with files. While it might be a convenient representation, it is a bad idea because of the possible encoding errors.
138 | 
139 | Similarly, always use ASCII column names.
140 | 
141 | # Unit tests
142 | 
143 | Unit test files are code files, so the same rule applies to them: keep them ASCII. See the 'How-to' section below on specific testing related tasks.
144 | 
145 | ## Manual pages
146 | 
147 | You can use *some* UTF-8 characters in manual pages, if you include `\endoding{UTF-8}` in the manual page, or you declared `Encoding: UTF-8` in the `DESCRIPTION` file of the package.
148 | 
149 | Unfortunately these do not include the characters typically used in the output of tibble and cli. The problem is with the PDF manual and LaTeX, so you can conditionally include UTF-8 in the HTML output with the `\if` Rd command.
150 | 
151 | If your PDF manual does not build because of a non-supported UTF-8 character has sneaked in somewhere, `tools::showNonASCIIfile()` can show where exactly they are.
152 | 
153 | ## Vignettes
154 | 
155 | TODO
156 | 
157 | # Graphics
158 | 
159 | TODO
160 | 
161 | # How-to
162 | 
163 | ## How to query the current native encoding?
164 | 
165 | R uses the native encoding extensively. Some examples:
166 | 
167 | -   Symbols (i.e. variable names, names in a named list, column and row names, etc.) are always in the native encoding.
168 | -   Strings marked with `"unknown"` encoding are assumed to be in this encoding.
169 | -   Output connections assume that the output is in this encoding.
170 | -   Input connection convert the input into this encoding.
171 | -   `enc2native()` encodes its input into this encoding.
172 | -   Printing to the screen (via `cat()`, `print()`, `writeLines()`, etc.) re-encodes strings into this encoding.
173 | 
174 | It is surprisingly tricky to query the current native encoding, especially on older R versions. Here are a number of things to try:
175 | 
176 | 1.  Call `l10n_info()`. If the native encoding is UTF-8 or latin1 you'll see that in the output:
177 | 
178 |     ``` r
179 |     ❯ l10n_info()
180 |     $MBCS
181 |     [1] TRUE
182 | 
183 |     $`UTF-8`
184 |     [1] TRUE
185 | 
186 |     $`Latin-1`
187 |     [1] FALSE
188 |     ```
189 | 
190 | 2.  On Windows, you'll also see the code page:
191 | 
192 |     ``` r
193 |     ❯ l10n_info()
194 |     $MBCS
195 |     [1] FALSE
196 | 
197 |     $`UTF-8`
198 |     [1] FALSE
199 | 
200 |     $`Latin-1`
201 |     [1] TRUE
202 | 
203 |     $codepage
204 |     [1] 1252
205 | 
206 |     $system.codepage
207 |     [1] 1252
208 |     ```
209 | 
210 |     Append the codepage number to `"CP"` to get a string that you can use in `iconv()`. E.g. in this case it would be `"CP1252"`.
211 | 
212 | 3.  On Unix, from R 4.1 the name of the encoding is also included in the answer. This is useful on non-UTF-8 systems:
213 | 
214 |     ``` r
215 |     ❯ l10n_info()
216 |     $MBCS
217 |     [1] FALSE
218 | 
219 |     $`UTF-8`
220 |     [1] FALSE
221 | 
222 |     $`Latin-1`
223 |     [1] FALSE
224 | 
225 |     $codeset
226 |     [1] "ISO-8859-15"
227 |     ```
228 | 
229 | 4.  On older R versions this field is not included. On R 3.5 and above you can use the following trick to find the encoding:
230 | 
231 |     ``` r
232 |     ❯ rawToChar(serialize(NULL, NULL, ascii = TRUE, version = 3))
233 |     [1] "A\n3\n262148\n197888\n5\nUTF-8\n254\n"
234 |     ```
235 | 
236 |     Line number 6 (i.e. after the fifth `\n`) in the output is the name of the encoding. On R 4.0.x you can also save an RDS file and then use the `infoRDS()` function on it to see the current native encoding.
237 | 
238 | 5.  Otherwise can call `Sys.getlocale()` and parsing its output will probably give you an encoding name that works in `iconv()`:
239 | 
240 |     ``` r
241 |     ❯ Sys.getlocale("LC_CTYPE")
242 |     [1] "en_US.iso885915"
243 |     ```
244 | 
245 | ## How is the `encoding` argument and option used?
246 | 
247 | In general functions use the `encoding` argument two ways:
248 | 
249 | -   For some functions it specifies the encoding of the input.
250 | -   For others it specifies how output should be marked.
251 | 
252 | For example
253 | 
254 | ``` r
255 | file("input.txt", encoding = "UTF-8")
256 | ```
257 | 
258 | tells `file()` that `input.txt` is in `UTF-8`. On the other hand
259 | 
260 | ``` r
261 | scan("input2.txt", what = "", encoding = "UTF-8")
262 | ```
263 | 
264 | means that the output of `scan()` will be *marked* as having UTF-8 encoding. (`scan()` also has a `fileEncoding` argument, which specifies the encoding of the input file.
265 | 
266 | TODO: the weirdness of `readLines()`.
267 | 
268 | ## How to read lines from UTF-8 files?
269 | 
270 | ### With the brio package
271 | 
272 | The simplest option is to use the brio package, if you can allow it:
273 | 
274 | ``` r
275 | brio::read_lines("file")
276 | ```
277 | 
278 | (brio also has a `read_file()` function if you want to read in the whole file into a string.)
279 | 
280 | brio does not currently check if the file is valid in UTF-8. See below how to do that.
281 | 
282 | ### With base R only
283 | 
284 | The xfun package has a nice function that reads UTF-8 files with base R only: `xfun::read_utf8()`. It currently reads like this, after some simplification:
285 | 
286 | ``` r
287 | read_utf8 <- function (path) {
288 |   opts <- options(encoding = "native.enc")
289 |   on.exit(options(opts), add = TRUE)
290 |   x <- readLines(path, encoding = "UTF-8", warn = FALSE)
291 |   x
292 | }
293 | ```
294 | 
295 | (The original function also checks that the returned lines are valid UTF-8.)
296 | 
297 | Explanation and notes:
298 | 
299 | -   `path` must be a file name, or an un-opened connection, or a connection that was opened in the native encoding (i.e. with `encoding = "native.enc"`). Otherwise `read_utf8()` might (silently) return lines in the wrong encoding.
300 | -   If `readLines()` gets a file name, then it opens a connection to it in `UTF-8`, so the file is not re-encoded and it also marks the result as `UTF-8`.
301 | -   For extra safety, you can add a check that `path` is a file name.
302 | -   As far as I can tell, there is no R API to query the encoding of a connection.
303 | 
304 | TODO: does this work correctly with non-ASCII file names?
305 | 
306 | ## How to write lines to UTF-8 files?
307 | 
308 | ### With the brio package
309 | 
310 | Call `brio::write_file()` to write a character vector to a file. Notes:
311 | 
312 | 1.  `brio::write_file()` converts the input character vector to UTF-8. It uses the marked encoding for this, so if that is not correct, then the conversion won't be correct, either.
313 | 2.  Will handle UTF-8 file names correctly.
314 | 3.  `brio::write_file()` cannot write to connections currently, because with already opened connections there is no way to tell their encoding.
315 | 
316 | ### With base R
317 | 
318 | Kevin Ushey's excellent [blog post](https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/) has an excellent detailed derivation of this base R function:
319 | 
320 | ```{r}
321 | write_utf8 <- function(text, path) {
322 |   # step 1: ensure our text is utf8 encoded
323 |   utf8 <- enc2utf8(text)
324 |   upath <- enc2utf8(path)
325 |   
326 |   # step 2: create a connection with 'native' encoding
327 |   # this signals to R that translation before writing
328 |   # to the connection should be skipped
329 |   con <- file(upath, open = "w+", encoding = "native.enc")
330 |   on.exit(close(con), add = TRUE)
331 |   
332 |   # step 3: write to the connection with 'useBytes = TRUE',
333 |   # telling R to skip translation to the native encoding
334 |   writeLines(utf8, con = con, useBytes = TRUE)
335 | }
336 | ```
337 | 
338 | (This is a slightly more robust version than the one in the blog post.)
339 | 
340 | TODO: does this work correctly with non-ASCII file names?
341 | 
342 | ## How to convert text between encodings?
343 | 
344 | The `iconv()` base R function can convert between encodings. It ignores the `Encoding()` markings of the input completely. It uses `""` as the synonym for the native encoding.
345 | 
346 | ## How to guess the encoding of text?
347 | 
348 | The `stringi::stri_enc_detect()` function tries to detect the encoding of a string. This is mostly guessing of course.
349 | 
350 | ## How to check if a string is UTF-8?
351 | 
352 | Not all bytes and bytes sequences are valid in UTF-8. There are a few alternatives to check if a string is valid:
353 | 
354 | -   `stringi::stri_enc_isutf8()`
355 | 
356 | -   `utf8::utf8_valid()`
357 | 
358 | -   The following trick using only base R:
359 | 
360 |     ``` r
361 |     ! is.na(iconv(x, "UTF-8", "UTF-8"))
362 |     ```
363 | 
364 |     `iconv()` returns `NA` for the elements that are invalid in the input encoding.
365 | 
366 | ## How to download web pages in the right encoding?
367 | 
368 | TODO
369 | 
370 | ## Are RDS files encoding-safe?
371 | 
372 | No, in general they are not, but the situation is quite good. If you save an RDS file that contains text in some encoding, it is safe to read that file on a different computer (or the same computer with different settings), if at least one these conditions hold:
373 | 
374 | -   all text is either UTF-8 and latin1 encoded, and they are also marked as such, or
375 | -   both computers (or settings) have the same native encoding,
376 | -   the RDS file has version 3, and the loading platform can represent all characters in the RDS file. This usually holds if the loading platform is UTF-8.
377 | 
378 | Note that from RDS version 3 the strings in the native encoding are re-encoded to the current native encoding when the RDS file is loaded.
379 | 
380 | ## How to `parse()` UTF-8 text into UTF-8 code?
381 | 
382 | On R 3.6 and before we need to create a connection to create code with UTF-8 strings, from an UTF-8 character vector. It goes like this:
383 | 
384 | ``` r
385 | safe_parse <- function(text) {
386 |   text <- enc2utf8(text)
387 |   Encoding(text) <- "unknown"
388 |   con <- textConnection(text)
389 |   on.exit(close(con), add = TRUE)
390 |   eval(parse(con, encoding = "UTF-8"))
391 | }
392 | ```
393 | 
394 | -   By marking the text as `unknown` we make sure that `textConnection()` will not convert it.
395 | -   The `encoding` argument of `parse()` marks the output as `UTF-8`.
396 | -   Note that when `parse()` parses from a connection it will lose the source references. You can use the `srcfile` argument of `parse()` to add them back.
397 | 
398 | ## How to `deparse()` UTF-8 code into UTF-8 text?
399 | 
400 | TODO
401 | 
402 | ## How to include UTF-8 characters in a package `DESCRIPTION`
403 | 
404 | Some fields in `DESCRIPTION` may contain non-ASCII characters. Set the `Encoding` field to `UTF-8`, to use UTF-8 characters in these:
405 | 
406 |     Encoding: UTF-8
407 | 
408 | ## How to get a package `DESCRIPTION` in UTF-8?
409 | 
410 | The desc package always converts `DESCRIPTION` files to `UTF-8`.
411 | 
412 | ## How to use UTF-8 file names on Windows?
413 | 
414 | You can use functions or a package that already handles UTF-8 file names, e.g. brio or processx. If you need to open a file from C code, you need to do the following:
415 | 
416 | -   Convert the file name to UTF-8 just before passing it to C. In generic code `enc2utf8()` will do.
417 | -   In the C code, on Windows, convert the UTF-8 path to UTF-16 using the `MultiByteToWideChar()` Windows API function.
418 | -   Use the `_wfopen()` Windows API function to open the file.
419 | 
420 | Here is an example from the brio package: <https://github.com/r-lib/brio/blob/2cf72bb77ad55c758b4a140112916ddc23f00b59/src/brio.c#L9>
421 | 
422 | ## How to capture UTF-8 output?
423 | 
424 | This is only possible on UTF-8 systems, and there is no portable way that works on Windows or other non-UTF-8 platforms. If the native encoding is UTF-8, then `sink()` and `capture.output()` will create text in UTF-8.
425 | 
426 | Capturing UTF-8 output on Windows (or a Unix system that does not support a UTF-8 locale) is not currently possible. (Well, on Windows there is a very dirty trick, but it involves changing an internal R flag, and running the code in a subprocess.)
427 | 
428 | ### Are testthat snapshot tests encoding-safe?
429 | 
430 | Somewhat. testthat 3 by default turn off non-ASCII output from cli (and thus tibble, etc.). So your snapshots that use cli facilities will be in ASCII, which is safe to use anywhere.
431 | 
432 | If you non-ASCII output from other sources, then hopefully there is a switch to turn them off. As far as I can tell, rlang uses cli to produce non-ASCII output, so it should be fine.
433 | 
434 | ## How to test non-ASCII output?
435 | 
436 | If you need to test non-ASCII output produced by the cli package, then
437 | 
438 | -   use snapshot tests,
439 | -   call `testthat::local_reproducible_output(unicode = TRUE)` at the beginning of your test,
440 | -   record your snapshots on a UTF-8 platform.
441 | 
442 | Your tests will not run on non-UTF-8 platforms. It is not currently possible to record non-ASCII snapshot tests on non-UTF-8 platforms.
443 | 
444 | ## How to avoid non-ASCII characters in the manual?
445 | 
446 | If you accidentally included some non-ASCII characters in the manual, e.g. because you included some UTF-8 output from cli or tibble/pillar, then you can set `options(cli.unicode = FALSE)` at the right place to avoid them.
447 | 
448 | The `tools::showNonASCIIfile()` function helps finding non-ASCII characters in a package.
449 | 
450 | ## How to include non-ASCII characters in PDF vignettes?
451 | 
452 | TODO
453 | 
454 | # Encoding related R functions and packages
455 | 
456 | -   `base::charToRaw()`
457 | 
458 | -   `base::Encoding()` and `` base::`Encoding<-` ``
459 | 
460 | -   `base::iconv()`
461 | 
462 | -   `base::iconvlist()`
463 | 
464 | -   `base::nchar()`
465 | 
466 | -   `base::l10n_info()`
467 | 
468 | -   [stringi package](https://cran.rstudio.com/web/packages/stringi/)
469 | 
470 | -   [utf8 package](https://cran.rstudio.com/web/packages/utf8)
471 | 
472 | -   [brio package](https://cran.rstudio.com/web/packages/brio)
473 | 
474 | -   [cli package](https://cran.rstudio.com/web/packages/cli/)
475 | 
476 | -   [Unicode package](https://cran.rstudio.com/web/packages/Unicode)
477 | 
478 | # Known encoding issues in packages and functions
479 | 
480 | -   `yaml::write_yaml` crashes on Windows on latin1 encoded strings: <https://github.com/viking/r-yaml/issues/90>
481 | 
482 | # Tips for debugging encoding issues
483 | 
484 | -   `charToRaw()` is your best friend.
485 | 
486 | -   Don't forget, if they print the same, if they are `identical()`, they can still be in a different encoding. `charToRaw()` is your best friend.
487 | 
488 | -   `testthat::CheckReporter` saves a `testthat-problems.rds` file, if there were any test failures. You can get this file from win-builder, R-hub, etc. The file is a version 2 RDS file, so no encoding conversion will be done by `readRDS()`.
489 | 
490 | -   Don't trust any function that processes text. Some functions keep the encoding of the input, some convert to the native encoding, some convert to UTF-8. Some convert to UTF-8, without marking the output as UTF-8. Different R versions do different re-encodings.
491 | 
492 | -   Typing in a string on the console is not the same as `parse()`-ing the same code from a file. The console is always assumed to provide text in the native encoding, package files are typically assumed UTF-8. (This is why it is important to use `\uxxxx` escape sequences for UTF-8 text.)
493 | 
494 | # Text transformers
495 | 
496 | -   `print()`
497 | 
498 | -   `format()`
499 | 
500 | -   `normalizePath()`
501 | 
502 | -   `basename()`
503 | 
504 | -   `encodeString()`
505 | 
506 | # Changes in R 4.1.0
507 | 
508 | TODO
509 | 
510 | # Changes in R 4.2.0
511 | 
512 | TODO
513 | 
514 | # Further Reading
515 | 
516 | Good general introductions to text encodings:
517 | 
518 | -   [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)
519 | 
520 | -   [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](https://kunststube.net/encoding/)
521 | 
522 | -   [UTF-8 Everywhere](http://utf8everywhere.org)
523 | 
524 | R documentation:
525 | 
526 | -   [Encodings for CHARSXPs](https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Encodings-for-CHARSXPs) section in [R internals](https://cran.r-project.org/doc/manuals/r-devel/R-ints.html)
527 | 
528 | -   [Writing portable packages / Encoding issues](https://cran.r-project.org/doc/manuals/R-exts.html#Encoding-issues) in [Writing R Extensions](https://cran.r-project.org/doc/manuals/R-exts.html)
529 | 
530 | -   [Encodings and R](https://developer.r-project.org/Encodings_and_R.html) -- *Old and slightly outdated, it gives a good summary of what it took to change R's internal encoding.*
531 | 
532 | Intros to various encodings:
533 | 
534 | -   [UTF-8 on Wikipedia](https://en.wikipedia.org/wiki/UTF-8)
535 | 
536 | -   [UTF-16 on Wikipedia](https://en.wikipedia.org/wiki/UTF-16)
537 | 
538 | -   [latin-1 (ISO/IEC 8859-1) on Wikipedia](https://en.wikipedia.org/wiki/ISO/IEC_8859-1)
539 | 
540 | About Unicode:
541 | 
542 | -   [The main Unicode homepage](https://home.unicode.org/)
543 | 
544 | -   [Unicode FAQs](https://unicode.org/faq/)
545 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | The R Encoding FAQ
  2 | ================
  3 | 
  4 | - [Introduction](#introduction)
  5 | - [Contributing](#contributing)
  6 | - [Encoding of text](#encoding-of-text)
  7 | - [R’s text representation](#rs-text-representation)
  8 | - [R Connections and encodings](#r-connections-and-encodings)
  9 | - [Printing text to the console](#printing-text-to-the-console)
 10 |   - [Display width](#display-width)
 11 | - [Issues when building packages](#issues-when-building-packages)
 12 |   - [Code files](#code-files)
 13 |   - [Symbols](#symbols)
 14 | - [Unit tests](#unit-tests)
 15 |   - [Manual pages](#manual-pages)
 16 |   - [Vignettes](#vignettes)
 17 | - [Graphics](#graphics)
 18 | - [How-to](#how-to)
 19 |   - [How to query the current native
 20 |     encoding?](#how-to-query-the-current-native-encoding)
 21 |   - [How is the `encoding` argument and option
 22 |     used?](#how-is-the-encoding-argument-and-option-used)
 23 |   - [How to read lines from UTF-8
 24 |     files?](#how-to-read-lines-from-utf-8-files)
 25 |   - [How to write lines to UTF-8
 26 |     files?](#how-to-write-lines-to-utf-8-files)
 27 |   - [How to convert text between
 28 |     encodings?](#how-to-convert-text-between-encodings)
 29 |   - [How to guess the encoding of
 30 |     text?](#how-to-guess-the-encoding-of-text)
 31 |   - [How to check if a string is
 32 |     UTF-8?](#how-to-check-if-a-string-is-utf-8)
 33 |   - [How to download web pages in the right
 34 |     encoding?](#how-to-download-web-pages-in-the-right-encoding)
 35 |   - [Are RDS files encoding-safe?](#are-rds-files-encoding-safe)
 36 |   - [How to `parse()` UTF-8 text into UTF-8
 37 |     code?](#how-to-parse-utf-8-text-into-utf-8-code)
 38 |   - [How to `deparse()` UTF-8 code into UTF-8
 39 |     text?](#how-to-deparse-utf-8-code-into-utf-8-text)
 40 |   - [How to include UTF-8 characters in a package
 41 |     `DESCRIPTION`](#how-to-include-utf-8-characters-in-a-package-description)
 42 |   - [How to get a package `DESCRIPTION` in
 43 |     UTF-8?](#how-to-get-a-package-description-in-utf-8)
 44 |   - [How to use UTF-8 file names on
 45 |     Windows?](#how-to-use-utf-8-file-names-on-windows)
 46 |   - [How to capture UTF-8 output?](#how-to-capture-utf-8-output)
 47 |   - [How to test non-ASCII output?](#how-to-test-non-ascii-output)
 48 |   - [How to avoid non-ASCII characters in the
 49 |     manual?](#how-to-avoid-non-ascii-characters-in-the-manual)
 50 |   - [How to include non-ASCII characters in PDF
 51 |     vignettes?](#how-to-include-non-ascii-characters-in-pdf-vignettes)
 52 | - [Encoding related R functions and
 53 |   packages](#encoding-related-r-functions-and-packages)
 54 | - [Known encoding issues in packages and
 55 |   functions](#known-encoding-issues-in-packages-and-functions)
 56 | - [Tips for debugging encoding
 57 |   issues](#tips-for-debugging-encoding-issues)
 58 | - [Text transformers](#text-transformers)
 59 | - [Changes in R 4.1.0](#changes-in-r-410)
 60 | - [Changes in R 4.2.0](#changes-in-r-420)
 61 | - [Further Reading](#further-reading)
 62 | 
 63 | # Introduction
 64 | 
 65 | The goal of this document is twofold: (1) collect current best practices
 66 | to solve text encoding related issues and (2) include links to further
 67 | information about text encoding.
 68 | 
 69 | # Contributing
 70 | 
 71 | Encoding of text is a large topic, and to make this FAQ useful, we need
 72 | your contributions. There are a variety of ways to contribute:
 73 | 
 74 | - Found a typo a broken link or some example code is not working for
 75 |   you? Please open an issue about it. Or even better, consider fixing it
 76 |   in a pull request.
 77 | - Something is missing? There is a new (or old) package of function that
 78 |   should be mentioned here? Please open an issue about it.
 79 | - Some information is wrong or something is not explained well? Please
 80 |   open an issue about it. Or even better, fix it in a pull request!
 81 | - You found an answer useful? Please tweet about it and link to this
 82 |   FAQ.
 83 | 
 84 | Your contributions are most appreciated.
 85 | 
 86 | # Encoding of text
 87 | 
 88 | Good general introductions to text encodings:
 89 | 
 90 | - [The Absolute Minimum Every Software Developer Absolutely, Positively
 91 |   Must Know About Unicode and Character Sets (No
 92 |   Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)
 93 | 
 94 | - [What Every Programmer Absolutely, Positively Needs To Know About
 95 |   Encodings And Character Sets To Work With
 96 |   Text](https://kunststube.net/encoding/)
 97 | 
 98 | # R’s text representation
 99 | 
100 | See the R Internals manual:
101 | 
102 | <https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Encodings-for-CHARSXPs>
103 | 
104 | # R Connections and encodings
105 | 
106 | Some notes about connections.
107 | 
108 | - Connections assume that the input is in the encoding specified by
109 |   `getOption("encoding")`. They convert text to the native encoding.
110 | - Functions that *create* connections have an `encoding` argument, that
111 |   overrides `getOption("encoding")`.
112 | - If you tell a connection that the input is already in the native
113 |   encoding, then it will not convert the input. If the input is in fact
114 |   *not* in the native encoding, then it is up to the user to mark it
115 |   accordingly, e.g. as UTF-8.
116 | 
117 | # Printing text to the console
118 | 
119 | R converts strings to the native encoding when printing them to the
120 | screen. This is important to know when debugging encoding problems: two
121 | strings that print the same way may have a different internal
122 | representation:
123 | 
124 | ``` r
125 | s1 <- "\xfc"
126 | Encoding(s1) <- "latin1"
127 | s2 <- iconv(s1, "latin1", "UTF-8")
128 | s1
129 | #> [1] "ü"
130 | s2
131 | #> [1] "ü"
132 | ```
133 | 
134 | ``` r
135 | s1 == s2
136 | #> [1] TRUE
137 | identical(s1, s2)
138 | #> [1] TRUE
139 | testthat::expect_equal(s1, s2)
140 | testthat::expect_identical(s1, s2)
141 | ```
142 | 
143 | ``` r
144 | Encoding(s1)
145 | #> [1] "latin1"
146 | Encoding(s2)
147 | #> [1] "UTF-8"
148 | ```
149 | 
150 | ``` r
151 | charToRaw(s1)
152 | #> [1] fc
153 | charToRaw(s2)
154 | #> [1] c3 bc
155 | ```
156 | 
157 | ## Display width
158 | 
159 | Some Unicode glyphs, e.g. typical emojis, are *wide*. They are supposed
160 | to take up two characters in a monospace font. The set of wide glyphs
161 | (just like the set of glyphs in general) has been changing in different
162 | Unicode versions.
163 | 
164 | Up to version 4.0.3 R followed Unicode 8 (released in 2015) in terms of
165 | character width, so `nchar(..., type = "width")` miscalculated the width
166 | of many Asian characters.
167 | 
168 | R 4.0.4 follows Unicode 12.1.
169 | 
170 | R 4.1.0 - R 4.3.3 follow Unicode 13.0.
171 | 
172 | Current R-devel (will be R 4.4.0 in about a month) still uses Unicode
173 | 13.0.
174 | 
175 | To correctly calculate the display width of Unicode text, across various
176 | R versions, you can use the utf8 package, or `cli::ansi_nchar()`.
177 | 
178 | Note that some terminals, and also RStudio follow an older Unicode
179 | version of the standard, and they also calculate the display width
180 | incorrectly, typically printing characters on top of each other.
181 | 
182 | # Issues when building packages
183 | 
184 | See ‘Writing R Extensions’ about including UTF-8 characters in packages:
185 | <https://cran.r-project.org/doc/manuals/R-exts.html#Encoding-issues>
186 | 
187 | ## Code files
188 | 
189 | I suggest that you keep your source files in ASCII, except for comments,
190 | which you can write in UTF-8 if you like. (But not necessarily in
191 | roxygen2 comments, which will go in the manual, see below!)
192 | 
193 | If you need a string containing non-ASCII characters, construct it with
194 | `\uxxxx` or `\U{xxxxxx}` escape sequences, which yields a UTF-8-encoded
195 | string.
196 | 
197 | ``` r
198 | one_and_a_half <- "\u0031\u00BD"
199 | one_and_a_half
200 | #> [1] "1½"
201 | Encoding(one_and_a_half)
202 | #> [1] "UTF-8"
203 | charToRaw(one_and_a_half)
204 | #> [1] 31 c2 bd
205 | ```
206 | 
207 | Here are some handy ways to find the Unicode code points for an existing
208 | string:
209 | 
210 | - Copy and paste into the [Unicode character
211 |   inspector](https://apps.timwhitlock.info/unicode/inspect).
212 | * `sprintf("\\u%04X", utf8ToInt(x))`
213 | - *Do we have other suggestions?*
214 | 
215 | If you need to include non-UTF-8 non-ASCII characters, e.g. for testing,
216 | then include them via raw data, or via `\xxx` escape sequences. E.g. to
217 | include a latin1 string, you can do either of these:
218 | 
219 | ``` r
220 | t1 <- "t\xfck\xf6rf\xfar\xf3g\xe9p"
221 | Encoding(t1) <- "latin1"
222 | t2 <- rawToChar(as.raw(c(
223 |   0x74, 0xfc, 0x6b, 0xf6, 0x72, 0x66, 0xfa, 0x72, 
224 |   0xf3, 0x67, 0xe9, 0x70)))
225 | Encoding(t2) <- "latin1"
226 | t1 == t2
227 | #> [1] TRUE
228 | identical(charToRaw(t1), charToRaw(t2))
229 | #> [1] TRUE
230 | ```
231 | 
232 | ## Symbols
233 | 
234 | Symbols must be in the native encoding, so best practice is to only use
235 | ASCII symbols.
236 | 
237 | This is not completely equivalent to using ASCII code files, because
238 | names, e.g. in a named list, or the row names in a data frame are also
239 | symbols in R. So do not use non-ACSII names. (Another reason to avoid
240 | row names in data frames.)
241 | 
242 | A typical mistake is to assign file names as list or row names when
243 | working with files. While it might be a convenient representation, it is
244 | a bad idea because of the possible encoding errors.
245 | 
246 | Similarly, always use ASCII column names.
247 | 
248 | # Unit tests
249 | 
250 | Unit test files are code files, so the same rule applies to them: keep
251 | them ASCII. See the ‘How-to’ section below on specific testing related
252 | tasks.
253 | 
254 | ## Manual pages
255 | 
256 | You can use *some* UTF-8 characters in manual pages, if you include
257 | `\endoding{UTF-8}` in the manual page, or you declared `Encoding: UTF-8`
258 | in the `DESCRIPTION` file of the package.
259 | 
260 | Unfortunately these do not include the characters typically used in the
261 | output of tibble and cli. The problem is with the PDF manual and LaTeX,
262 | so you can conditionally include UTF-8 in the HTML output with the `\if`
263 | Rd command.
264 | 
265 | If your PDF manual does not build because of a non-supported UTF-8
266 | character has sneaked in somewhere, `tools::showNonASCIIfile()` can show
267 | where exactly they are.
268 | 
269 | ## Vignettes
270 | 
271 | TODO
272 | 
273 | # Graphics
274 | 
275 | TODO
276 | 
277 | # How-to
278 | 
279 | ## How to query the current native encoding?
280 | 
281 | R uses the native encoding extensively. Some examples:
282 | 
283 | - Symbols (i.e. variable names, names in a named list, column and row
284 |   names, etc.) are always in the native encoding.
285 | - Strings marked with `"unknown"` encoding are assumed to be in this
286 |   encoding.
287 | - Output connections assume that the output is in this encoding.
288 | - Input connection convert the input into this encoding.
289 | - `enc2native()` encodes its input into this encoding.
290 | - Printing to the screen (via `cat()`, `print()`, `writeLines()`, etc.)
291 |   re-encodes strings into this encoding.
292 | 
293 | It is surprisingly tricky to query the current native encoding,
294 | especially on older R versions. Here are a number of things to try:
295 | 
296 | 1.  Call `l10n_info()`. If the native encoding is UTF-8 or latin1 you’ll
297 |     see that in the output:
298 | 
299 |     ``` r
300 |     ❯ l10n_info()
301 |     $MBCS
302 |     [1] TRUE
303 | 
304 |     $`UTF-8`
305 |     [1] TRUE
306 | 
307 |     $`Latin-1`
308 |     [1] FALSE
309 |     ```
310 | 
311 | 2.  On Windows, you’ll also see the code page:
312 | 
313 |     ``` r
314 |     ❯ l10n_info()
315 |     $MBCS
316 |     [1] FALSE
317 | 
318 |     $`UTF-8`
319 |     [1] FALSE
320 | 
321 |     $`Latin-1`
322 |     [1] TRUE
323 | 
324 |     $codepage
325 |     [1] 1252
326 | 
327 |     $system.codepage
328 |     [1] 1252
329 |     ```
330 | 
331 |     Append the codepage number to `"CP"` to get a string that you can
332 |     use in `iconv()`. E.g. in this case it would be `"CP1252"`.
333 | 
334 | 3.  On Unix, from R 4.1 the name of the encoding is also included in the
335 |     answer. This is useful on non-UTF-8 systems:
336 | 
337 |     ``` r
338 |     ❯ l10n_info()
339 |     $MBCS
340 |     [1] FALSE
341 | 
342 |     $`UTF-8`
343 |     [1] FALSE
344 | 
345 |     $`Latin-1`
346 |     [1] FALSE
347 | 
348 |     $codeset
349 |     [1] "ISO-8859-15"
350 |     ```
351 | 
352 | 4.  On older R versions this field is not included. On R 3.5 and above
353 |     you can use the following trick to find the encoding:
354 | 
355 |     ``` r
356 |     ❯ rawToChar(serialize(NULL, NULL, ascii = TRUE, version = 3))
357 |     [1] "A\n3\n262148\n197888\n5\nUTF-8\n254\n"
358 |     ```
359 | 
360 |     Line number 6 (i.e. after the fifth `\n`) in the output is the name
361 |     of the encoding. On R 4.0.x you can also save an RDS file and then
362 |     use the `infoRDS()` function on it to see the current native
363 |     encoding.
364 | 
365 | 5.  Otherwise can call `Sys.getlocale()` and parsing its output will
366 |     probably give you an encoding name that works in `iconv()`:
367 | 
368 |     ``` r
369 |     ❯ Sys.getlocale("LC_CTYPE")
370 |     [1] "en_US.iso885915"
371 |     ```
372 | 
373 | ## How is the `encoding` argument and option used?
374 | 
375 | In general functions use the `encoding` argument two ways:
376 | 
377 | - For some functions it specifies the encoding of the input.
378 | - For others it specifies how output should be marked.
379 | 
380 | For example
381 | 
382 | ``` r
383 | file("input.txt", encoding = "UTF-8")
384 | ```
385 | 
386 | tells `file()` that `input.txt` is in `UTF-8`. On the other hand
387 | 
388 | ``` r
389 | scan("input2.txt", what = "", encoding = "UTF-8")
390 | ```
391 | 
392 | means that the output of `scan()` will be *marked* as having UTF-8
393 | encoding. (`scan()` also has a `fileEncoding` argument, which specifies
394 | the encoding of the input file.
395 | 
396 | TODO: the weirdness of `readLines()`.
397 | 
398 | ## How to read lines from UTF-8 files?
399 | 
400 | ### With the brio package
401 | 
402 | The simplest option is to use the brio package, if you can allow it:
403 | 
404 | ``` r
405 | brio::read_lines("file")
406 | ```
407 | 
408 | (brio also has a `read_file()` function if you want to read in the whole
409 | file into a string.)
410 | 
411 | brio does not currently check if the file is valid in UTF-8. See below
412 | how to do that.
413 | 
414 | ### With base R only
415 | 
416 | The xfun package has a nice function that reads UTF-8 files with base R
417 | only: `xfun::read_utf8()`. It currently reads like this, after some
418 | simplification:
419 | 
420 | ``` r
421 | read_utf8 <- function (path) {
422 |   opts <- options(encoding = "native.enc")
423 |   on.exit(options(opts), add = TRUE)
424 |   x <- readLines(path, encoding = "UTF-8", warn = FALSE)
425 |   x
426 | }
427 | ```
428 | 
429 | (The original function also checks that the returned lines are valid
430 | UTF-8.)
431 | 
432 | Explanation and notes:
433 | 
434 | - `path` must be a file name, or an un-opened connection, or a
435 |   connection that was opened in the native encoding (i.e. with
436 |   `encoding = "native.enc"`). Otherwise `read_utf8()` might (silently)
437 |   return lines in the wrong encoding.
438 | - If `readLines()` gets a file name, then it opens a connection to it in
439 |   `UTF-8`, so the file is not re-encoded and it also marks the result as
440 |   `UTF-8`.
441 | - For extra safety, you can add a check that `path` is a file name.
442 | - As far as I can tell, there is no R API to query the encoding of a
443 |   connection.
444 | 
445 | TODO: does this work correctly with non-ASCII file names?
446 | 
447 | ## How to write lines to UTF-8 files?
448 | 
449 | ### With the brio package
450 | 
451 | Call `brio::write_file()` to write a character vector to a file. Notes:
452 | 
453 | 1.  `brio::write_file()` converts the input character vector to UTF-8.
454 |     It uses the marked encoding for this, so if that is not correct,
455 |     then the conversion won’t be correct, either.
456 | 2.  Will handle UTF-8 file names correctly.
457 | 3.  `brio::write_file()` cannot write to connections currently, because
458 |     with already opened connections there is no way to tell their
459 |     encoding.
460 | 
461 | ### With base R
462 | 
463 | Kevin Ushey’s excellent [blog
464 | post](https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/)
465 | has an excellent detailed derivation of this base R function:
466 | 
467 | ``` r
468 | write_utf8 <- function(text, path) {
469 |   # step 1: ensure our text is utf8 encoded
470 |   utf8 <- enc2utf8(text)
471 |   upath <- enc2utf8(path)
472 |   
473 |   # step 2: create a connection with 'native' encoding
474 |   # this signals to R that translation before writing
475 |   # to the connection should be skipped
476 |   con <- file(upath, open = "w+", encoding = "native.enc")
477 |   on.exit(close(con), add = TRUE)
478 |   
479 |   # step 3: write to the connection with 'useBytes = TRUE',
480 |   # telling R to skip translation to the native encoding
481 |   writeLines(utf8, con = con, useBytes = TRUE)
482 | }
483 | ```
484 | 
485 | (This is a slightly more robust version than the one in the blog post.)
486 | 
487 | TODO: does this work correctly with non-ASCII file names?
488 | 
489 | ## How to convert text between encodings?
490 | 
491 | The `iconv()` base R function can convert between encodings. It ignores
492 | the `Encoding()` markings of the input completely. It uses `""` as the
493 | synonym for the native encoding.
494 | 
495 | ## How to guess the encoding of text?
496 | 
497 | The `stringi::stri_enc_detect()` function tries to detect the encoding
498 | of a string. This is mostly guessing of course.
499 | 
500 | ## How to check if a string is UTF-8?
501 | 
502 | Not all bytes and bytes sequences are valid in UTF-8. There are a few
503 | alternatives to check if a string is valid:
504 | 
505 | - `stringi::stri_enc_isutf8()`
506 | 
507 | - `utf8::utf8_valid()`
508 | 
509 | - The following trick using only base R:
510 | 
511 |   ``` r
512 |   ! is.na(iconv(x, "UTF-8", "UTF-8"))
513 |   ```
514 | 
515 |   `iconv()` returns `NA` for the elements that are invalid in the input
516 |   encoding.
517 | 
518 | ## How to download web pages in the right encoding?
519 | 
520 | TODO
521 | 
522 | ## Are RDS files encoding-safe?
523 | 
524 | No, in general they are not, but the situation is quite good. If you
525 | save an RDS file that contains text in some encoding, it is safe to read
526 | that file on a different computer (or the same computer with different
527 | settings), if at least one these conditions hold:
528 | 
529 | - all text is either UTF-8 and latin1 encoded, and they are also marked
530 |   as such, or
531 | - both computers (or settings) have the same native encoding,
532 | - the RDS file has version 3, and the loading platform can represent all
533 |   characters in the RDS file. This usually holds if the loading platform
534 |   is UTF-8.
535 | 
536 | Note that from RDS version 3 the strings in the native encoding are
537 | re-encoded to the current native encoding when the RDS file is loaded.
538 | 
539 | ## How to `parse()` UTF-8 text into UTF-8 code?
540 | 
541 | On R 3.6 and before we need to create a connection to create code with
542 | UTF-8 strings, from an UTF-8 character vector. It goes like this:
543 | 
544 | ``` r
545 | safe_parse <- function(text) {
546 |   text <- enc2utf8(text)
547 |   Encoding(text) <- "unknown"
548 |   con <- textConnection(text)
549 |   on.exit(close(con), add = TRUE)
550 |   eval(parse(con, encoding = "UTF-8"))
551 | }
552 | ```
553 | 
554 | - By marking the text as `unknown` we make sure that `textConnection()`
555 |   will not convert it.
556 | - The `encoding` argument of `parse()` marks the output as `UTF-8`.
557 | - Note that when `parse()` parses from a connection it will lose the
558 |   source references. You can use the `srcfile` argument of `parse()` to
559 |   add them back.
560 | 
561 | ## How to `deparse()` UTF-8 code into UTF-8 text?
562 | 
563 | TODO
564 | 
565 | ## How to include UTF-8 characters in a package `DESCRIPTION`
566 | 
567 | Some fields in `DESCRIPTION` may contain non-ASCII characters. Set the
568 | `Encoding` field to `UTF-8`, to use UTF-8 characters in these:
569 | 
570 |     Encoding: UTF-8
571 | 
572 | ## How to get a package `DESCRIPTION` in UTF-8?
573 | 
574 | The desc package always converts `DESCRIPTION` files to `UTF-8`.
575 | 
576 | ## How to use UTF-8 file names on Windows?
577 | 
578 | You can use functions or a package that already handles UTF-8 file
579 | names, e.g. brio or processx. If you need to open a file from C code,
580 | you need to do the following:
581 | 
582 | - Convert the file name to UTF-8 just before passing it to C. In generic
583 |   code `enc2utf8()` will do.
584 | - In the C code, on Windows, convert the UTF-8 path to UTF-16 using the
585 |   `MultiByteToWideChar()` Windows API function.
586 | - Use the `_wfopen()` Windows API function to open the file.
587 | 
588 | Here is an example from the brio package:
589 | <https://github.com/r-lib/brio/blob/2cf72bb77ad55c758b4a140112916ddc23f00b59/src/brio.c#L9>
590 | 
591 | ## How to capture UTF-8 output?
592 | 
593 | This is only possible on UTF-8 systems, and there is no portable way
594 | that works on Windows or other non-UTF-8 platforms. If the native
595 | encoding is UTF-8, then `sink()` and `capture.output()` will create text
596 | in UTF-8.
597 | 
598 | Capturing UTF-8 output on Windows (or a Unix system that does not
599 | support a UTF-8 locale) is not currently possible. (Well, on Windows
600 | there is a very dirty trick, but it involves changing an internal R
601 | flag, and running the code in a subprocess.)
602 | 
603 | ### Are testthat snapshot tests encoding-safe?
604 | 
605 | Somewhat. testthat 3 by default turn off non-ASCII output from cli (and
606 | thus tibble, etc.). So your snapshots that use cli facilities will be in
607 | ASCII, which is safe to use anywhere.
608 | 
609 | If you non-ASCII output from other sources, then hopefully there is a
610 | switch to turn them off. As far as I can tell, rlang uses cli to produce
611 | non-ASCII output, so it should be fine.
612 | 
613 | ## How to test non-ASCII output?
614 | 
615 | If you need to test non-ASCII output produced by the cli package, then
616 | 
617 | - use snapshot tests,
618 | - call `testthat::local_reproducible_output(unicode = TRUE)` at the
619 |   beginning of your test,
620 | - record your snapshots on a UTF-8 platform.
621 | 
622 | Your tests will not run on non-UTF-8 platforms. It is not currently
623 | possible to record non-ASCII snapshot tests on non-UTF-8 platforms.
624 | 
625 | ## How to avoid non-ASCII characters in the manual?
626 | 
627 | If you accidentally included some non-ASCII characters in the manual,
628 | e.g. because you included some UTF-8 output from cli or tibble/pillar,
629 | then you can set `options(cli.unicode = FALSE)` at the right place to
630 | avoid them.
631 | 
632 | The `tools::showNonASCIIfile()` function helps finding non-ASCII
633 | characters in a package.
634 | 
635 | ## How to include non-ASCII characters in PDF vignettes?
636 | 
637 | TODO
638 | 
639 | # Encoding related R functions and packages
640 | 
641 | - `base::charToRaw()`
642 | 
643 | - `base::Encoding()` and `` base::`Encoding<-` ``
644 | 
645 | - `base::iconv()`
646 | 
647 | - `base::iconvlist()`
648 | 
649 | - `base::nchar()`
650 | 
651 | - `base::l10n_info()`
652 | 
653 | - [stringi package](https://cran.rstudio.com/web/packages/stringi/)
654 | 
655 | - [utf8 package](https://cran.rstudio.com/web/packages/utf8)
656 | 
657 | - [brio package](https://cran.rstudio.com/web/packages/brio)
658 | 
659 | - [cli package](https://cran.rstudio.com/web/packages/cli/)
660 | 
661 | - [Unicode package](https://cran.rstudio.com/web/packages/Unicode)
662 | 
663 | # Known encoding issues in packages and functions
664 | 
665 | - `yaml::write_yaml` crashes on Windows on latin1 encoded strings:
666 |   <https://github.com/viking/r-yaml/issues/90>
667 | 
668 | # Tips for debugging encoding issues
669 | 
670 | - `charToRaw()` is your best friend.
671 | 
672 | - Don’t forget, if they print the same, if they are `identical()`, they
673 |   can still be in a different encoding. `charToRaw()` is your best
674 |   friend.
675 | 
676 | - `testthat::CheckReporter` saves a `testthat-problems.rds` file, if
677 |   there were any test failures. You can get this file from win-builder,
678 |   R-hub, etc. The file is a version 2 RDS file, so no encoding
679 |   conversion will be done by `readRDS()`.
680 | 
681 | - Don’t trust any function that processes text. Some functions keep the
682 |   encoding of the input, some convert to the native encoding, some
683 |   convert to UTF-8. Some convert to UTF-8, without marking the output as
684 |   UTF-8. Different R versions do different re-encodings.
685 | 
686 | - Typing in a string on the console is not the same as `parse()`-ing the
687 |   same code from a file. The console is always assumed to provide text
688 |   in the native encoding, package files are typically assumed UTF-8.
689 |   (This is why it is important to use `\uxxxx` escape sequences for
690 |   UTF-8 text.)
691 | 
692 | # Text transformers
693 | 
694 | - `print()`
695 | 
696 | - `format()`
697 | 
698 | - `normalizePath()`
699 | 
700 | - `basename()`
701 | 
702 | - `encodeString()`
703 | 
704 | # Changes in R 4.1.0
705 | 
706 | TODO
707 | 
708 | # Changes in R 4.2.0
709 | 
710 | TODO
711 | 
712 | # Further Reading
713 | 
714 | Good general introductions to text encodings:
715 | 
716 | - [The Absolute Minimum Every Software Developer Absolutely, Positively
717 |   Must Know About Unicode and Character Sets (No
718 |   Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)
719 | 
720 | - [What Every Programmer Absolutely, Positively Needs To Know About
721 |   Encodings And Character Sets To Work With
722 |   Text](https://kunststube.net/encoding/)
723 | 
724 | - [UTF-8 Everywhere](http://utf8everywhere.org)
725 | 
726 | R documentation:
727 | 
728 | - [Encodings for
729 |   CHARSXPs](https://cran.r-project.org/doc/manuals/r-devel/R-ints.html#Encodings-for-CHARSXPs)
730 |   section in [R
731 |   internals](https://cran.r-project.org/doc/manuals/r-devel/R-ints.html)
732 | 
733 | - [Writing portable packages / Encoding
734 |   issues](https://cran.r-project.org/doc/manuals/R-exts.html#Encoding-issues)
735 |   in [Writing R
736 |   Extensions](https://cran.r-project.org/doc/manuals/R-exts.html)
737 | 
738 | - [Encodings and
739 |   R](https://developer.r-project.org/Encodings_and_R.html) – *Old and
740 |   slightly outdated, it gives a good summary of what it took to change
741 |   R’s internal encoding.*
742 | 
743 | Intros to various encodings:
744 | 
745 | - [UTF-8 on Wikipedia](https://en.wikipedia.org/wiki/UTF-8)
746 | 
747 | - [UTF-16 on Wikipedia](https://en.wikipedia.org/wiki/UTF-16)
748 | 
749 | - [latin-1 (ISO/IEC 8859-1) on
750 |   Wikipedia](https://en.wikipedia.org/wiki/ISO/IEC_8859-1)
751 | 
752 | About Unicode:
753 | 
754 | - [The main Unicode homepage](https://home.unicode.org/)
755 | 
756 | - [Unicode FAQs](https://unicode.org/faq/)
757 | 


--------------------------------------------------------------------------------
/rencfaq.Rproj:
--------------------------------------------------------------------------------
 1 | Version: 1.0
 2 | 
 3 | RestoreWorkspace: Default
 4 | SaveWorkspace: Default
 5 | AlwaysSaveHistory: Default
 6 | 
 7 | EnableCodeIndexing: Yes
 8 | UseSpacesForTab: Yes
 9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 | 
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 | 
15 | AutoAppendNewline: Yes
16 | StripTrailingWhitespace: Yes
17 | 


--------------------------------------------------------------------------------
/what-every.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "What every R developer must know about encodings"
  3 | output: 
  4 |   github_document:
  5 |     toc: true
  6 |     toc_depth: 3
  7 | ---
  8 | 
  9 | ## Why are encodings so hard (in R)?
 10 | 
 11 | -   Impossible to interpret a piece of text without external information.
 12 | -   R's internal encoding is different on different platforms, which make is hard to write (and test!) portable code, and to transfer data.
 13 | -   R's functions do (seemingly) random encoding conversions.
 14 | 
 15 | ## Important encodings (for the R developer)
 16 | 
 17 | ### UTF-8
 18 | 
 19 | #### What is Unicode?
 20 | 
 21 | Unicode is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.
 22 | 
 23 | #### Numbered characters
 24 | 
 25 | -   First 128 is the same as ASCII. Then the rest of the 143,859 characters (Unicode 13.0)
 26 | 
 27 | -   1,112,064 possible characters. (Original plan was much-much more.)
 28 | 
 29 | -   You can create text that encodes Unicode characters with `\u` and `\U` escapes:
 30 | 
 31 |     ```{r}
 32 |     "Just a normal string with \u00fc is \u2713 \U{1F60A}"
 33 |     ```
 34 | 
 35 | -   R always encodes `\u` and `\U` in UTF-8.
 36 | 
 37 | -   Some are non-printing: zero width non-joiner, right-to-left mark, left-to-right mark, etc.
 38 | 
 39 | -   Some are combining:
 40 | 
 41 | ```{r}
 42 | "person + dark skin tone: \U1F9D1 \U1F3FF"
 43 | "person + dark skin tone: \U1F9D1\U1F3FF"
 44 | ```
 45 | 
 46 | #### UTF-8 encoding
 47 | 
 48 | -   Encodes all Unicode characters.
 49 | 
 50 | -   Variable length encoding: between 1 and 4 bytes. Think about the implications of this!
 51 | 
 52 | -   Includes ASCII, in the same exact encoding.
 53 | 
 54 | -   Again, R always encodes `\u` and `\U` in UTF-8, use it for UTF-8 string literals.
 55 | 
 56 | -   <https://en.wikipedia.org/wiki/UTF-8#Encoding>
 57 | 
 58 | ### ASCII
 59 | 
 60 | -   Ancient. 128 characters, stored on one byte, the highest bit is zero.
 61 | 
 62 | ### latin1 (and CP1252)
 63 | 
 64 | -   Extension of ASCII, by defining the characters when the high bit is one.
 65 | 
 66 | -   CP1252 is an a modified latin1, it substitutes some control characters with printable ones.
 67 | 
 68 | -   Usually this is the default native encoding of R in the US and Western Europe.
 69 | 
 70 | -   Covers most Western European special characters: <https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Modern_languages_with_complete_coverage>
 71 | 
 72 | ### UTF-16
 73 | 
 74 | -   Encodes all of Unicode.
 75 | 
 76 | -   Each Unicode character is two or four bytes.
 77 | 
 78 | -   Internal encoding of Windows, Javascript, (older) Python.
 79 | 
 80 | -   We only need to deal with it in R when communicating with the Windows API. Most commonly this means passing file paths to Windows.
 81 | 
 82 | -   R cannot represent this in a string, because it has embedded zeros.
 83 | 
 84 | -   Best is to convert from/to UTF-8 immediately before passing to or getting it from Windows. (This usually happens in C code.)
 85 | 
 86 | ## How R stores text data
 87 | 
 88 | -   Zero-terminated bytes. (Hence, no UTF-16 possible.)
 89 | 
 90 |     ```{r}
 91 |     charToRaw("hello world!")
 92 |     ```
 93 | 
 94 | -   R often assumes that strings are in the *native* encoding. E.g. symbols have to be in the native encoding, connections convert to the native encoding, so does printing, etc.
 95 | 
 96 | -   The native encoding can be different on different platforms. *Yes, this is the biggest challenge when writing portable R code that deals with strings.*
 97 | 
 98 | -   Use `l10n_info()` to query the native encoding. (But see the FAQ for the real answer.)
 99 | 
100 |     ```{r}
101 |     l10n_info()
102 |     ```
103 | 
104 | -   It is possible to *declare* the encoding of a string (not character vector). (!)
105 | 
106 | -   But alas... fun with `Encoding()` :
107 | 
108 |     ```{r}
109 |     x <- "\xfb"
110 |     x
111 |     ```
112 | 
113 |     ```{r}
114 |     charToRaw(x)
115 |     ```
116 | 
117 |     ```{r}
118 |     Encoding(x) <- "latin1"
119 |     x
120 |     ```
121 | 
122 |     ```{r}
123 |     y <- "\xfb"
124 |     Encoding(y) <- "latin2"
125 |     y
126 |     ```
127 | 
128 |     ```{r}
129 |     Encoding(y)
130 |     ```
131 | 
132 | -   The encoding information is only stored on two bits. So four values are possible:
133 | 
134 |     -   `UTF-8`
135 | 
136 |     -   `latin1`
137 | 
138 |     -   `bytes`
139 | 
140 |     -   `unknown`
141 | 
142 | -   If the native encoding is not UTF-8 or latin1, native strings are marked as `unknown`. But `unknown` means different things on different platforms.
143 | 
144 | ## What encoding should I use?
145 | 
146 | UTF-8. Only use something else if you really have to. Convert to UTF-8 as soon as possible. Convert to something else as late as possible.
147 | 
148 | ## How-to?
149 | 
150 | -   How to convert to UTF-8?
151 | 
152 | -   How to check if a string is UTF-8?
153 | 
154 | -   How to read a file in UTF-8?
155 | 
156 | -   How to write a file in UTF-8?
157 | 
158 | ## Debugging encoding issues
159 | 
160 | ### Common issues
161 | 
162 | #### Don't let the printing fool you.
163 | 
164 | R converts strings to the native encoding when printing them to the screen. This is important to know when debugging encoding problems: two strings that print the same way may have a different internal representation:
165 | 
166 | ```{r}
167 | s1 <- "\xfc"
168 | Encoding(s1) <- "latin1"
169 | s2 <- iconv(s1, "latin1", "UTF-8")
170 | s1
171 | s2
172 | ```
173 | 
174 | ```{r}
175 | s1 == s2
176 | identical(s1, s2)
177 | testthat::expect_equal(s1, s2)
178 | testthat::expect_identical(s1, s2)
179 | ```
180 | 
181 | ```{r}
182 | Encoding(s1)
183 | Encoding(s2)
184 | ```
185 | 
186 | ```{r}
187 | charToRaw(s1)
188 | charToRaw(s2)
189 | ```
190 | 
191 | #### Beware the silent conversions
192 | 
193 | R functions changing the encoding silently. All functions that transform text are suspicious. Older R versions are usually worse:
194 | 
195 | ``` r
196 | > x <- "ü"
197 | > Encoding(x)
198 | [1] "latin1"
199 | > charToRaw(x)
200 | [1] fc
201 | ```
202 | 
203 | ``` r
204 | > ux <- enc2utf8(x)
205 | > Encoding(ux)
206 | [1] "UTF-8"
207 | > charToRaw(ux)
208 | [1] c3 bc
209 | ```
210 | 
211 | ``` r
212 | > nux <- normalizePath(ux, mustWork = FALSE)
213 | > nux
214 | [1] "C:\\Users\\Gabor\\works\\processx\\ü"
215 | > Encoding(nux)
216 | [1] "unknown"
217 | > charToRaw(nux)
218 |  [1] 43 3a 5c 55 73 65 72 73 5c 47
219 | [11] 61 62 6f 72 5c 77 6f 72 6b 73 
220 | [21] 5c 70 72 6f 63 65 73 73 78 5c 
221 | [31] fc
222 | ```
223 | 
224 | ``` r
225 | > basename(nux)
226 | [1] "ü"
227 | > Encoding(basename(nux))
228 | [1] "UTF-8"
229 | > charToRaw(basename(nux))
230 | [1] c3 bc
231 | ```
232 | 
233 | #### Display width is off everywhere
234 | 
235 | Aligning text with *wide* Unicode characters is hard.
236 | 
237 | ### Tips
238 | 
239 | -   `charToRaw()` is your best friend.
240 | 
241 | -   Don't forget, if they print the same, if they are `identical()`, they can still be in a different encoding. `charToRaw()` is your best friend.
242 | 
243 | -   `testthat::CheckReporter` saves a `testthat-problems.rds` file, if there were any test failures. You can get this file from win-builder, R-hub, etc. The file is a version 2 RDS file, so no encoding conversion will be done by `readRDS()`.
244 | 
245 | -   Don't trust any function that processes text. Some functions keep the encoding of the input, some convert to the native encoding, some convert to UTF-8. Some convert to UTF-8, without marking the output as UTF-8. Different R versions do different re-encodings.
246 | 
247 | -   Typing in a string on the console is not the same as `parse()`-ing the same code from a file. The console is always assumed to provide text in the native encoding, package files are typically assumed UTF-8. (This is why it is important to use `\uxxxx` escape sequences for UTF-8 text.)
248 | 


--------------------------------------------------------------------------------
/what-every.md:
--------------------------------------------------------------------------------
  1 | What every R developer must know about encodings
  2 | ================
  3 | 
  4 | -   [Why are encodings so hard (in R)?](#why-are-encodings-so-hard-in-r)
  5 | -   [Important encodings (for the R
  6 |     developer)](#important-encodings-for-the-r-developer)
  7 |     -   [UTF-8](#utf-8)
  8 |     -   [ASCII](#ascii)
  9 |     -   [latin1 (and CP1252)](#latin1-and-cp1252)
 10 |     -   [UTF-16](#utf-16)
 11 | -   [How R stores text data](#how-r-stores-text-data)
 12 | -   [What encoding should I use?](#what-encoding-should-i-use)
 13 | -   [How-to?](#how-to)
 14 | -   [Debugging encoding issues](#debugging-encoding-issues)
 15 |     -   [Common issues](#common-issues)
 16 |     -   [Tips](#tips)
 17 | 
 18 | ## Why are encodings so hard (in R)?
 19 | 
 20 | -   Impossible to interpret a piece of text without external
 21 |     information.
 22 | -   R’s internal encoding is different on different platforms, which
 23 |     make is hard to write (and test!) portable code, and to transfer
 24 |     data.
 25 | -   R’s functions do (seemingly) random encoding conversions.
 26 | 
 27 | ## Important encodings (for the R developer)
 28 | 
 29 | ### UTF-8
 30 | 
 31 | #### What is Unicode?
 32 | 
 33 | Unicode is an information technology standard for the consistent
 34 | encoding, representation, and handling of text expressed in most of the
 35 | world’s writing systems.
 36 | 
 37 | #### Numbered characters
 38 | 
 39 | -   First 128 is the same as ASCII. Then the rest of the 143,859
 40 |     characters (Unicode 13.0)
 41 | 
 42 | -   1,112,064 possible characters. (Original plan was much-much more.)
 43 | 
 44 | -   You can create text that encodes Unicode characters with `\u` and
 45 |     `\U` escapes:
 46 | 
 47 |     ``` r
 48 |     "Just a normal string with \u00fc is \u2713 \U{1F60A}"
 49 |     ```
 50 | 
 51 |         ## [1] "Just a normal string with ü is ✓ 😊"
 52 | 
 53 | -   R always encodes `\u` and `\U` in UTF-8.
 54 | 
 55 | -   Some are non-printing: zero width non-joiner, right-to-left mark,
 56 |     left-to-right mark, etc.
 57 | 
 58 | -   Some are combining:
 59 | 
 60 | ``` r
 61 | "person + dark skin tone: \U1F9D1 \U1F3FF"
 62 | ```
 63 | 
 64 |     ## [1] "person + dark skin tone: 🧑 🏿"
 65 | 
 66 | ``` r
 67 | "person + dark skin tone: \U1F9D1\U1F3FF"
 68 | ```
 69 | 
 70 |     ## [1] "person + dark skin tone: 🧑🏿"
 71 | 
 72 | #### UTF-8 encoding
 73 | 
 74 | -   Encodes all Unicode characters.
 75 | 
 76 | -   Variable length encoding: between 1 and 4 bytes. Think about the
 77 |     implications of this!
 78 | 
 79 | -   Includes ASCII, in the same exact encoding.
 80 | 
 81 | -   Again, R always encodes `\u` and `\U` in UTF-8, use it for UTF-8
 82 |     string literals.
 83 | 
 84 | -   <https://en.wikipedia.org/wiki/UTF-8#Encoding>
 85 | 
 86 | ### ASCII
 87 | 
 88 | -   Ancient. 128 characters, stored on one byte, the highest bit is
 89 |     zero.
 90 | 
 91 | ### latin1 (and CP1252)
 92 | 
 93 | -   Extension of ASCII, by defining the characters when the high bit is
 94 |     one.
 95 | 
 96 | -   CP1252 is an a modified latin1, it substitutes some control
 97 |     characters with printable ones.
 98 | 
 99 | -   Usually this is the default native encoding of R in the US and
100 |     Western Europe.
101 | 
102 | -   Covers most Western European special characters:
103 |     <https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Modern_languages_with_complete_coverage>
104 | 
105 | ### UTF-16
106 | 
107 | -   Encodes all of Unicode.
108 | 
109 | -   Each Unicode character is two or four bytes.
110 | 
111 | -   Internal encoding of Windows, Javascript, (older) Python.
112 | 
113 | -   We only need to deal with it in R when communicating with the
114 |     Windows API. Most commonly this means passing file paths to Windows.
115 | 
116 | -   R cannot represent this in a string, because it has embedded zeros.
117 | 
118 | -   Best is to convert from/to UTF-8 immediately before passing to or
119 |     getting it from Windows. (This usually happens in C code.)
120 | 
121 | ## How R stores text data
122 | 
123 | -   Zero-terminated bytes. (Hence, no UTF-16 possible.)
124 | 
125 |     ``` r
126 |     charToRaw("hello world!")
127 |     ```
128 | 
129 |         ##  [1] 68 65 6c 6c 6f 20 77 6f 72 6c 64 21
130 | 
131 | -   R often assumes that strings are in the *native* encoding. E.g.
132 |     symbols have to be in the native encoding, connections convert to
133 |     the native encoding, so does printing, etc.
134 | 
135 | -   The native encoding can be different on different platforms. *Yes,
136 |     this is the biggest challenge when writing portable R code that
137 |     deals with strings.*
138 | 
139 | -   Use `l10n_info()` to query the native encoding. (But see the FAQ for
140 |     the real answer.)
141 | 
142 |     ``` r
143 |     l10n_info()
144 |     ```
145 | 
146 |         ## $MBCS
147 |         ## [1] TRUE
148 |         ## 
149 |         ## $`UTF-8`
150 |         ## [1] TRUE
151 |         ## 
152 |         ## $`Latin-1`
153 |         ## [1] FALSE
154 |         ## 
155 |         ## $codeset
156 |         ## [1] "UTF-8"
157 | 
158 | -   It is possible to *declare* the encoding of a string (not character
159 |     vector). (!)
160 | 
161 | -   But alas… fun with `Encoding()` :
162 | 
163 |     ``` r
164 |     x <- "\xfb"
165 |     x
166 |     ```
167 | 
168 |         ## [1] "\xfb"
169 | 
170 |     ``` r
171 |     charToRaw(x)
172 |     ```
173 | 
174 |         ## [1] fb
175 | 
176 |     ``` r
177 |     Encoding(x) <- "latin1"
178 |     x
179 |     ```
180 | 
181 |         ## [1] "û"
182 | 
183 |     ``` r
184 |     y <- "\xfb"
185 |     Encoding(y) <- "latin2"
186 |     y
187 |     ```
188 | 
189 |         ## [1] "\xfb"
190 | 
191 |     ``` r
192 |     Encoding(y)
193 |     ```
194 | 
195 |         ## [1] "unknown"
196 | 
197 | -   The encoding information is only stored on two bits. So four values
198 |     are possible:
199 | 
200 |     -   `UTF-8`
201 | 
202 |     -   `latin1`
203 | 
204 |     -   `bytes`
205 | 
206 |     -   `unknown`
207 | 
208 | -   If the native encoding is not UTF-8 or latin1, native strings are
209 |     marked as `unknown`. But `unknown` means different things on
210 |     different platforms.
211 | 
212 | ## What encoding should I use?
213 | 
214 | UTF-8. Only use something else if you really have to. Convert to UTF-8
215 | as soon as possible. Convert to something else as late as possible.
216 | 
217 | ## How-to?
218 | 
219 | -   How to convert to UTF-8?
220 | 
221 | -   How to check if a string is UTF-8?
222 | 
223 | -   How to read a file in UTF-8?
224 | 
225 | -   How to write a file in UTF-8?
226 | 
227 | ## Debugging encoding issues
228 | 
229 | ### Common issues
230 | 
231 | #### Don’t let the printing fool you.
232 | 
233 | R converts strings to the native encoding when printing them to the
234 | screen. This is important to know when debugging encoding problems: two
235 | strings that print the same way may have a different internal
236 | representation:
237 | 
238 | ``` r
239 | s1 <- "\xfc"
240 | Encoding(s1) <- "latin1"
241 | s2 <- iconv(s1, "latin1", "UTF-8")
242 | s1
243 | ```
244 | 
245 |     ## [1] "ü"
246 | 
247 | ``` r
248 | s2
249 | ```
250 | 
251 |     ## [1] "ü"
252 | 
253 | ``` r
254 | s1 == s2
255 | ```
256 | 
257 |     ## [1] TRUE
258 | 
259 | ``` r
260 | identical(s1, s2)
261 | ```
262 | 
263 |     ## [1] TRUE
264 | 
265 | ``` r
266 | testthat::expect_equal(s1, s2)
267 | testthat::expect_identical(s1, s2)
268 | ```
269 | 
270 | ``` r
271 | Encoding(s1)
272 | ```
273 | 
274 |     ## [1] "latin1"
275 | 
276 | ``` r
277 | Encoding(s2)
278 | ```
279 | 
280 |     ## [1] "UTF-8"
281 | 
282 | ``` r
283 | charToRaw(s1)
284 | ```
285 | 
286 |     ## [1] fc
287 | 
288 | ``` r
289 | charToRaw(s2)
290 | ```
291 | 
292 |     ## [1] c3 bc
293 | 
294 | #### Beware the silent conversions
295 | 
296 | R functions changing the encoding silently. All functions that transform
297 | text are suspicious. Older R versions are usually worse:
298 | 
299 | ``` r
300 | > x <- "ü"
301 | > Encoding(x)
302 | [1] "latin1"
303 | > charToRaw(x)
304 | [1] fc
305 | ```
306 | 
307 | ``` r
308 | > ux <- enc2utf8(x)
309 | > Encoding(ux)
310 | [1] "UTF-8"
311 | > charToRaw(ux)
312 | [1] c3 bc
313 | ```
314 | 
315 | ``` r
316 | > nux <- normalizePath(ux, mustWork = FALSE)
317 | > nux
318 | [1] "C:\\Users\\Gabor\\works\\processx\\ü"
319 | > Encoding(nux)
320 | [1] "unknown"
321 | > charToRaw(nux)
322 |  [1] 43 3a 5c 55 73 65 72 73 5c 47
323 | [11] 61 62 6f 72 5c 77 6f 72 6b 73 
324 | [21] 5c 70 72 6f 63 65 73 73 78 5c 
325 | [31] fc
326 | ```
327 | 
328 | ``` r
329 | > basename(nux)
330 | [1] "ü"
331 | > Encoding(basename(nux))
332 | [1] "UTF-8"
333 | > charToRaw(basename(nux))
334 | [1] c3 bc
335 | ```
336 | 
337 | #### Display width is off everywhere
338 | 
339 | Aligning text with *wide* Unicode characters is hard.
340 | 
341 | ### Tips
342 | 
343 | -   `charToRaw()` is your best friend.
344 | 
345 | -   Don’t forget, if they print the same, if they are `identical()`,
346 |     they can still be in a different encoding. `charToRaw()` is your
347 |     best friend.
348 | 
349 | -   `testthat::CheckReporter` saves a `testthat-problems.rds` file, if
350 |     there were any test failures. You can get this file from
351 |     win-builder, R-hub, etc. The file is a version 2 RDS file, so no
352 |     encoding conversion will be done by `readRDS()`.
353 | 
354 | -   Don’t trust any function that processes text. Some functions keep
355 |     the encoding of the input, some convert to the native encoding, some
356 |     convert to UTF-8. Some convert to UTF-8, without marking the output
357 |     as UTF-8. Different R versions do different re-encodings.
358 | 
359 | -   Typing in a string on the console is not the same as `parse()`-ing
360 |     the same code from a file. The console is always assumed to provide
361 |     text in the native encoding, package files are typically assumed
362 |     UTF-8. (This is why it is important to use `\uxxxx` escape sequences
363 |     for UTF-8 text.)
364 | 


--------------------------------------------------------------------------------