├── .Rbuildignore
├── .github
└── workflows
│ └── publish.yml
├── .gitignore
├── 00-authors.qmd
├── 00-preface.qmd
├── 01-intro.qmd
├── 02-r-basics.qmd
├── 03-r-workflow.qmd
├── 04-r-strings.qmd
├── 05-acquiring-files.qmd
├── 06-acquiring-packages.qmd
├── 07-acquiring-internet.qmd
├── 08-quanteda-overview.qmd
├── 09-quanteda-corpus.qmd
├── 10-quanteda-tokens.qmd
├── 11-quanteda-tokensadvanced.qmd
├── 12-quanteda-dfms.qmd
├── 13-quanteda-fcms.qmd
├── 14-exploring-description.qmd
├── 15-exploring-kwic.qmd
├── 16-exploring-dictionaries.qmd
├── 17-exploring-frequencies.qmd
├── 18-comparing-sophistication.qmd
├── 19-comparing-documents.qmd
├── 20-comparing-features.qmd
├── 21-ml-supervised-scaling.qmd
├── 22-ml-unsupervised-scaling.qmd
├── 23-ml-classifiers.qmd
├── 24-ml-topicmodels.qmd
├── 25-lms-linguistic-parsing.qmd
├── 26-lms-embeddings.qmd
├── 27-lms-transformers.qmd
├── 28-further-tidy.qmd
├── 29-further-hardlanguages.qmd
├── 30-appendix-installation.qmd
├── 31-appendix-encoding.qmd
├── 32-appendix-regex.qmd
├── CODE_OF_CONDUCT.md
├── DESCRIPTION
├── LICENSE
├── README.md
├── TAUR.bib
├── _freeze
├── 00-preface
│ └── execute-results
│ │ ├── html.json
│ │ └── tex.json
├── 01-intro
│ ├── execute-results
│ │ ├── html.json
│ │ └── tex.json
│ ├── figure-html
│ │ ├── keynesspres-1.png
│ │ └── unnamed-chunk-14-1.png
│ └── figure-pdf
│ │ ├── keynesspres-1.pdf
│ │ └── unnamed-chunk-14-1.pdf
├── 09-quanteda-corpus
│ └── execute-results
│ │ ├── html.json
│ │ └── tex.json
├── 10-quanteda-tokens
│ └── execute-results
│ │ ├── html.json
│ │ └── tex.json
├── 11-quanteda-tokensadvanced
│ └── execute-results
│ │ ├── html.json
│ │ └── tex.json
├── 12-quanteda-dfms
│ └── execute-results
│ │ ├── html.json
│ │ └── tex.json
├── 13-quanteda-dfms
│ └── execute-results
│ │ ├── html.json
│ │ └── tex.json
├── 29-appendix-installation
│ └── execute-results
│ │ ├── html.json
│ │ └── tex.json
├── 30-appendix-installation
│ └── execute-results
│ │ ├── html.json
│ │ └── tex.json
├── index
│ └── execute-results
│ │ ├── html.json
│ │ └── tex.json
├── preface
│ └── execute-results
│ │ ├── html.json
│ │ └── tex.json
└── site_libs
│ ├── clipboard
│ └── clipboard.min.js
│ ├── kePrint-0.0.1
│ └── kePrint.js
│ └── lightable-0.0.1
│ └── lightable.css
├── _quarto.yml
├── contributing.md
├── images
├── TAURbook_cover.jpg
├── TAURbook_cover_large.png
├── TAURbook_cover_small.png
├── basics-help.pdf
├── caution.png
├── important.png
├── note.png
├── tip.png
└── warning.png
├── index.qmd
├── latex
└── preamble.tex
├── references.qmd
└── welcome-html.qmd
/.Rbuildignore:
--------------------------------------------------------------------------------
1 | ^CODE_OF_CONDUCT\.md$
2 |
--------------------------------------------------------------------------------
/.github/workflows/publish.yml:
--------------------------------------------------------------------------------
1 | on:
2 | workflow_dispatch:
3 | push:
4 | branches: main
5 |
6 | name: Quarto Publish
7 |
8 | jobs:
9 | build-deploy:
10 | runs-on: ubuntu-latest
11 | permissions:
12 | contents: write
13 | steps:
14 | - name: Check out repository
15 | uses: actions/checkout@v3
16 |
17 | - name: Set up Quarto
18 | uses: quarto-dev/quarto-actions/setup@v2
19 |
20 | - name: Render Book project
21 | uses: quarto-dev/quarto-actions/render@v2
22 | with:
23 | to: html
24 |
25 | - name: Publish HTML book
26 | uses: quarto-dev/quarto-actions/publish@v2
27 | with:
28 | target: gh-pages
29 | render: false
30 | env:
31 | GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
32 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | .Ruserdata
5 | *.Rproj
6 | /.quarto/
7 | /_site/
8 | /_book/
9 | docs/
10 | /_book/
11 | .DS_Store
12 | Text-Analysis-Using-R-Quarto.Rproj
13 | README.html
14 | README_files/
15 | 01-intro_files
16 |
--------------------------------------------------------------------------------
/00-authors.qmd:
--------------------------------------------------------------------------------
1 | # About the Authors {.unnumbered}
2 |
3 | *Kenneth Benoit* is Professor of Computational Social Science in the Department of Methodology at the London School of Economics and Political Science. He is also (2020--present) Director of the LSE Data Science Institute.He has previously held positions in the Department of Political Science at Trinity College Dublin and at the Central European University (Budapest). He received his Ph.D. (1998) from Harvard University, Department of Government. His current research focuses on computational, quantitative methods for processing large amounts of textual data, mainly political texts and social media.
4 |
5 | *Stefan Müller* is an Assistant Professor and Ad Astra Fellow in the School of Politics and International Relations at University College Dublin. Previously, he was a Senior Researcher at the University of Zurich. He received his Ph.D. (2019) from Trinity College Dublin, Department of Political Science. His current research focuses on political representation, party competition, political communication, public opinion, and quantitative text analysis. His work has been published in journals such as the *American Political Science Review*, *The Journal of Politics*, *Political Communication*, the *European Journal of Political Research*, and *Political Science Research and Methods*, among others.
6 |
7 | The *Quanteda Initiative* is a non-profit company we founded in 2018 in London, devoted to the promotion of open-source text analysis software. It supports active development of these tools, in addition to providing training materials, training events, and sponsoring workshops and conferences. Its main product is the open-source **quanteda** package for R, but the Quanteda Initiative also supports a family of independent but interrelated packages for providing additional functionality for natural language processing and document management.
8 |
9 | Its core objectives, as stated in its charter, are to:
10 |
11 | - Support open-source, text analysis software developed for research and scientific analysis. These efforts focus mainly on the open-source software library quanteda, written for the R programming language, and its related family of extension packages.
12 |
13 | - Promote interoperability between text analytic software libraries, including those written in other languages such as Python, Java, or C++.
14 |
15 | - Provide ongoing user, technical, and development support for open-source text analysis software. Organize training and dissemination activities related to open-source text analysis software.
16 |
--------------------------------------------------------------------------------
/00-preface.qmd:
--------------------------------------------------------------------------------
1 | ::: hidden
2 | $$
3 | \def\quanteda{{\bf quanteda}}
4 | $$
5 | :::
6 |
7 | # Preface {.unnumbered}
8 |
9 | The uses and means for analysing text, especially using quantitative and computational approaches, have exploded in recent years across the fields of academic, industry, policy, and other forms of research and analysis. Text mining and text analysis have become the focus of a major methodological wave of innovation in the social sciences, especially political science but also extending to finance, media and communications, and sociology. In non-academic research or industry, furthermore, text mining applications are found in almost every sector, given the ubiquity of text and the need to analyse it.
10 |
11 | "Text analysis" is a broad label for a wide range of tools applied to *textual data*. It encompasses all manner of methods for turning unstructured raw data in the form of natural language documents into structured data that can be analysed systematically and using specific quantitative approaches. Text analytic methods include descriptive analysis, keyword analysis, topic analysis, measurement and scaling, clustering, text and vocabulary comparisons, or sentiment analysis. It may include causal analysis or predictive modelling. In their most advanced current forms, these methods extend to the natural language generation models that power the newest generation of artificial intelligence systems.
12 |
13 | This book provides a practical guide introducing the fundamentals of text analysis methods and how to implement them using the R programming language. It covers a wide range of topics to provide a useful resource for a range of students from complete beginners to experienced users of R wanting to learn more advanced techniques. Our emphasis is on the text analytic workflow as a whole, ranging from an introduction to text manipulation in R, all the way through to advanced machine learning methods for textual data.
14 |
15 | ## Why another book on text mining in R? {.unnumbered}
16 |
17 | Text mining tools for R have existed for many years, led by the venerable **tm** package [@tmpackage]. In 2015, the first version of **quanteda** [@quantedaJOSS] was published on CRAN. Since then, the package has undergone three major versions, with each release improving its consistency, power, and usability. Starting with version 2, **quanteda** split into a series of packages designed around its different functions (such as plotting, statistics, or machine learning), with **quanteda** retaining the core text processing functions. Other popular alternatives exist, such as **tidytext** [@tidytextJOSS], although as we explain in [Chapter -@sec-furthertidy], **quanteda** works perfectly well with the R "tidyverse" and related approaches to text analysis.
18 |
19 | This is hardly the only book covering how to work with text in R. @silge2017text introduced the **tidytext** package and numerous examples for mining text using the tidyverse approach to programming in R, including keyword analysis, mining word associations, and topic modelling. @kwartler2017text provides a good coverage of text mining workflows, plus methods for text visualisations, sentiment scoring, clustering, and classification, as well as methods for sourcing and manipulating source documents. Earlier works include @turenne2016analyse and @becue2019textual.
20 |
21 | ::: {.callout-note icon="true"}
22 | ## One of the two hard things in computer science: Naming things
23 |
24 | Where does does the package name come from? *quanteda* is a portmanteau name indicating the purpose of the package, which is the **qu**antitative **an**alysis of **te**xtual **da**ta.
25 | :::
26 |
27 | So why another book? First, the R ecosystem for text analysis is rapidly evolving, with the publication in recent years of massively improved, specialist text analysis packages such as **quanteda**. None of the earlier books covers this amazing family of packages. In addition, the field of text analysis methodologies has also advanced rapidly, including machine learning approaches based on artificial neural networks and deep learning. These and the packages that make use of them---for instance **spacyr** [@spacyr] for harnessing the power of the spaCy natural language processing library for Python [@spaCy]---have yet to be presented in a systematic, book-length treatment. Furthermore, as we are the authors and creators of many of the packages we cover in this book, we view this as the authoritative reference and how-to guide for using these packages.
28 |
29 | ## What to Expect from This Book {.unnumbered}
30 |
31 | This book is meant to be as a practical resource for those confronting the practical challenges of text analysis for the first time, focusing on how to do this in R. Our main focus is on the **quanteda** package and its extensions, although we also cover more general issues including a brief overview of the R functions required to get started quickly with practical text analysis We cover an introduction to the R language, for
32 |
33 | @benoit2020text provides a detailed overview of the analysis of textual data, and what distinguishes textual data from other forms of data. It also clearly articulates what is meant by treating text "as data" for analysis. This book and the approaches it presents are firmly geared toward this mindset.
34 |
35 | Each chapter is structured so to provide a continuity across each topic. For each main subject explained in a chapter, we clearly explain the objective of the chapter, then describe the text analytic methods in an applied fashion so that readers are aware of the workings of the method. We then provide practical examples, with detailed working code in R as to how to implement the method. Next, we identify any special issues involved in correctly applying the method, including how to hand the more complicated situations that may arise in practice. Finally, we provide further reading for readers wishing to learn more, and exercises for those wishing for hands-on practice, or for assigning these when using them in teaching environment.
36 |
37 | We have years of experience in teaching this material in many practical short courses, summer schools, and regular university courses. We have drawn extensively from this experience in designing the overall scope of this book and the structure of each chapter. Our goal is to make the book is suitable for self-learning or to form the basis for teaching and learning in a course on applied text analysis. Indeed, we have partly written this book to assign when teaching text analysis in our own curricula.
38 |
39 | ## Who This Book Is For {.unnumbered}
40 |
41 | We don't assume any previous knowledge of text mining---indeed, the goal of this book is to provide that, from a foundation through to some very advanced topics. Getting use from this book does not require a pre-existing familiarity with R, although, as the slogan goes, "every little helps". In Part I we cover some of the basics of R and how to make use of it for text analysis specifically. Readers will also learn more of R through our extensive examples. However, experience in teaching and presenting this material tells us that a foundation of R will enable readers to advance through the applications far more rapidly than if they were learning R from scratch at the same time that they take the plunge into the possibly strange new world of text analysis.
42 |
43 | We are both academics, although we also have experience working in industry or in applying text analysis for non-academic purposes. The typical reader may be a student of text analysis in the literal sense (of being an student) or in the general sense of someone studying techniques in order to improve their practical and conceptual knowledge. Our orientation as social scientists, with a specialization in political text and political communications and media. But this book is for everyone: social scientists, computer scientists, scholars of digital humanities, researchers in marketing and management, and applied researchers working in policy, government, or business fields. This book is written to have the credibility and scholarly rigour (for referencing methods, for instance) needed by academic readers, but is designed to be written in a straightforward, jargon-free (as much as we were able!) manner to be of maximum practical use to non-academic analysts as well.
44 |
45 | ## How the Book is Structured {.unnumbered}
46 |
47 | ### Sections {.unnumbered}
48 |
49 | The book is divided into seven sections. These group topics that we feel represent common stages of learning in text analysis, or similar groups of topics that different users will be drawn too. By grouping stages of learning, we make it possible also for intermediate or advanced users to jump to the section that interests them most, or to the sections where they feel they need additional learning.
50 |
51 | Our sections are:
52 |
53 | - **(Working in R):** This section is designed for beginners to learn quickly the R required for the techniques we cover in the book, and to guide them in learning a proper R workflow for text analysis.
54 |
55 | - **Acquiring texts:** Often described (by us at least) as the hardest problem in text analysis, we cover how to source documents, including from Internet sources, and to import these into R as part of the quantitative text analysis pipeline.
56 |
57 | - **Managing textual data using quanteda:** In this section, we introduce the **quanteda** package, and cover each stage of textual data processing, from creating structured corpora, to tokenisation, and building matrix representations of these tokens. We also talk about how to build and manage structured lexical resources such as dictionaries and stop word lists.
58 |
59 | - **Exploring and describing texts:** How to get overviews of texts using summary statistics, exploring texts using keywords-in-context, extracting target words, and identifying key words.
60 |
61 | - **Statistics for comparing texts:** How to characterise documents in terms of their lexical diversity. readability, similarity, or distance.
62 |
63 | - **Machine learning for texts:** How to apply scaling models, predictive models, and classification models to textual matrices.
64 |
65 | - **Further methods for texts:** Advanced methods including the use of natural language models to annotate texts, extract entities, or use word embeddings; integrating **quanteda** with "tidy" data approaches; and how to apply text analysis to "hard" languages such as Chinese (hard because of the high dimensional character set and the lack of whitespace to delimit words).
66 |
67 | Finally, in several **appendices**, we provide more detail about some tricky subjects, such as text encoding formats and working with regular expressions.
68 |
69 | ### Chapter structure {.unnumbered}
70 |
71 | Our approach in each chapter is split into the following components, which we apply in every chapter:
72 |
73 | - Objectives. We explain the purpose of each chapter and what we believe are the most important learning outcomes.
74 |
75 | - Methods. We clearly explain the methodological elements of each chapter, through a combination of high-level explanations, formulas, and references.
76 |
77 | - Examples. We use practical examples, with R code, demonstrating how to apply the methods to realise the objectives.
78 |
79 | - Issues. We identify any special issues, potential problems, or additional approaches that a user might face when applying the methods to their text analysis problem.
80 |
81 | - Further Reading. In part because our scholarly backgrounds compel us to do so, and in part because we know that many readers will want to read more about each method, each chapter contains its own set of references and further readings.
82 |
83 | - Exercises. For those wishing additional practice or to use this text as a teaching resource (which we strongly encourage!), we provide exercises that can be assigned for each chapter.
84 |
85 | Throughout the book, we will demonstrate with examples and build models using a selection of text data sets. A description of these data sets can be found in [Appendix -@sec-appendix-installing].
86 |
87 | ### Conventions {.unnumbered}
88 |
89 | Throughout the book we use several kinds of info boxes to call your attention to information, cautions, and warnings.
90 |
91 | ::: callout-note
92 | The information icon signals a note or a reference to further information.
93 | :::
94 |
95 | ::: callout-tip
96 | Tips provide suggestions for better ways of doing things.
97 | :::
98 |
99 | ::: callout-important
100 | The exclamation mark icon signals an important point to consider.
101 | :::
102 |
103 | ::: callout-warning
104 | Warning icons flag things you definitely want to avoid.
105 | :::
106 |
107 | As you may already have noticed, we put names of R packages in **boldface**.
108 |
109 | Code blocks will be self-evident, and will look like this, with the output produced from executing those commands shown below the highlighted code in a mono-spaced font.
110 |
111 | ```{r}
112 | library("quanteda")
113 | data_corpus_inaugural[1:3]
114 | ```
115 |
116 | We love the pipe operator in R (when used with discipline!) and built **quanteda** with a the aim making all of the main functions easily and logically pipeable. Since version 1, we have re-exported the `%>%` operator from **magrittr** to make it available out of the box with **quanteda**. With the introduction of the `|>` pipe in R 4.1.0, however, we prefer to use this variant, so will use that in all code used in this book.
117 |
118 | ### Data used in examples {.unnumbered}
119 |
120 | All examples and code are bundled as a companion R package to the book, available from our [public GitHub repository](https://github.com/quanteda/TAUR/).
121 |
122 | ::: callout-tip
123 | We have written a companion package for this book called **TAUR**, which can be installed from GitHub using this command:
124 |
125 | ```{r eval = FALSE}
126 | remotes::install_github("quanteda/TAUR")
127 | ```
128 | :::
129 |
130 | We largely rely on data from three sources:
131 |
132 | - the built-in-objects from the **quanteda** package, such as the US Presidential Inaugural speech corpus;\
133 | - added corpora from the book's companion package, **TAUR**; and
134 | - some additional **quanteda** corpora or dictionaries from from the additional sources or packages where indicated.
135 |
136 | Although not commonly used, our scheme for naming data follows a very consistent scheme. The data objects being with `data`, have the object class as the second part of the name, such as `corpus`, and the third and final part of the data object name contains a description. The three elements are separated by the underscore (`_`) character. This means that any object is known by its name to be data, so that it shows up in the index of package objects (from the all-important help page, e.g. from `help(package = "quanteda")`) in one location, under "d". It also means that its object class is known from the name, without further inspection. So `data_dfm_lbgexample` is a dfm, while `data_corpus_inaugural` is clearly a corpus. We use this scheme and others like with an almost religious fervour, because we think that learning the functionality of a programming framework for NLP and quantitative text analysis is complicated enough without having also to decipher or remember a mishmash of haphazard and inconsistently named functions and objects. The more you use our software packages and specifically **quanteda**, the more you will come to appreciate the attention we have paid to implementing a consistent naming scheme for objects, functions, and their arguments as well as to their consistent functionality.
137 |
138 | ## Colophon {.unnumbered}
139 |
140 | This book was written in [RStudio](https://www.rstudio.com/ide/) using [**Quarto**](https://quarto.org). The [website](https://quanteda.github.io/Text-Analysis-Using-R/) is hosted via [GitHub Pages](https://pages.github.com), and the complete source is available on [GitHub](https://github.com/quanteda/Text-Analysis-Using-R).
141 |
142 | This version of the book was built with `r R.version.string` and the following packages:
143 |
144 | ```{r}
145 | #| echo: false
146 | #| results: asis
147 | #| cache: false
148 | deps <- desc::desc_get_deps()
149 | pkgs <- sort(deps$package[deps$type == "Imports"])
150 | pkgs <- sessioninfo::package_info(pkgs, dependencies = FALSE)
151 | df <- tibble::tibble(
152 | Package = pkgs$package,
153 | Version = pkgs$ondiskversion,
154 | Source = gsub("@", "\\\\@", pkgs$source)
155 | ) |>
156 | dplyr::mutate(Source = stringr::str_remove(Source, "\\\\@[a-f0-9]*"))
157 | gt::gt(df)
158 | ```
159 |
160 | ### How to Contact Us {.unnumbered}
161 |
162 | Please address comments and questions concerning this book by filing an issue on our GitHub page, . At this repository, you will also find instructions for installing the companion R package, [**TAUR**](https://github.com/quanteda/TAUR).
163 |
164 | For more information about the authors or the Quanteda Initiative, visit our [website](https://quanteda.org).
165 |
166 | ## Acknowledgements {.unnumbered}
167 |
168 | Developing **quanteda** has been a labour of many years involving many contributors. The most notable contributor to the package and its learning materials is Kohei Watanabe, without whom **quanteda** would not be the incredible package it is today. Kohei has provided a clear and vigorous vision for the package's evolution across its major versions and continues to maintain it today. Others have contributed in major ways both through design and programming, namely Paul Nulty and Akitaka Matsuo, both who were present at creation and through the 1.0 launch which involved the first of many major redesigns. Adam Obeng made a brief but indelible imprint on several of the packages in the quanteda family. Haiyan Wang wrote versions of some of the core C++ code that makes quanteda so fast, as well as contributing to the R base. William Lowe and Christian Müller also contributed code and ideas that have enriched the package and the methods it implements.
169 |
170 | No project this large could have existed without institutional and financial benefactors and supporters. Most notable of these is the European Research Council, who funded the original development under a its Starting Investigator Grant scheme, awarded to Kenneth Benoit in 2011 for a project entitled QUANTESS: Quantitative Text Analysis for the Social Sciences (ERC-2011-StG 283794-QUANTESS). Our day jobs (as university professors) have also provided invaluable material and intellectual support for the development of the ideas, methodologies, and software documented in this book. This list includes the London School of Economics and Political Science, which employs Ken Benoit and which hosted the QUANTESS grant and employed many of the core quanteda team and/or where they completed PhDs; University College Dublin where Stefan Müller is employed; Trinity College Dublin, where Ken Benoit was formerly a Professor of Quantitative Social Sciences and where Stefan completed his PhD in political science; and the Australian National University which provided a generous affiliation to Ken and where the outline for this book took first shape as a giant tableau of post-it notes on the wall of a visiting office in the former building of the School of Politics and International Relations.
171 |
172 | As we develop this book, we hope that many readers of the work-in-progress will contribute feedback, comments, and even edits, and we plan to acknowledge you all. So, don't be shy readers. You can suggest changes by clicking on "Edit this page" in the top-right corner, [forking](https://gist.github.com/Chaser324/ce0505fbed06b947d962) the [GitHub repository](https://github.com/quanteda/Text-Analysis-Using-R), and making a pull request. More details on contributing and copyright are provided in a separate page on [Contributing to Text Analysis Using R](https://github.com/quanteda/Text-Analysis-Using-R/blob/main/contributing.md).
173 |
174 | We also have to give a much-deserved shout-out to the amazing team at RStudio, many of whom we've been privileged to hear speak at events or meet in person. You made Quarto available at just the right time. That and the innovations in R tools and software that you have driven for the past decade have been rich beyond every expectation.
175 |
176 | Finally, no acknowledgements would be complete without a profound thanks to our partners, who have put up with us during the long incubation of this project. They have had to listen us talk about this book for years before we finally got around to writing it. They've tolerated us cursing software bugs, students, CRAN, each other, package users, and ourselves. But they've also seen they joy that we've experienced from creating tools and materials that empower users and students, and the excitement of the long intellectual voyage we have taken together and with our ever-growing base of users and students. Bina and Émeline, thank you for all of your support and encouragement. You'll be so happy to know we finally have this book project well underway.
177 |
--------------------------------------------------------------------------------
/01-intro.qmd:
--------------------------------------------------------------------------------
1 | \pagenumbering{arabic}
2 |
3 | # Introduction and Motivation {#sec-intro}
4 |
5 | ## Objectives
6 |
7 | To dive right in to text analysis using R, by means of an extended demonstration of how to acquire, process, quantify, and analyze a series of texts. The objective is to demonstrate the techniques covered in the book and the tools required to apply them.
8 |
9 | This overview is intended to serve more as a demonstration of the kinds of natural language processing and text analysis that can be done powerfully and easily in R, rather than as primary instructional material. Starting in [Chapter @sec-rbasics], you learn about the R fundamentals and work our way gradually up to basic and then more advanced text analysis. Until then, enjoy the demonstration, and don't be discouraged if it seems too advanced to follow at this stage. By the time you have finished this book, this sort of analysis will be very familiar.
10 |
11 | ## Application: Analyzing Candidate Debates from the US Presidential Election Campaign of 2020
12 |
13 | The methods demonstrated in this chapter include:
14 |
15 | - acquiring texts directly from the world-wide web;
16 | - creating a corpus from the texts, with associated document-level variables;
17 | - segmenting the texts by sections found in each document;
18 | - cleaning up the texts further;
19 | - tokenizing the texts into words;
20 | - summarizing the texts in terms of who spoke and how much;
21 | - examining keywords-in-context from the texts, and identifying keywords using statistical association measures;
22 | - transforming and selecting features using lower-casing, stemming, and removing punctuation and numbers;
23 | - removing selected features in the form of "stop words";
24 | - creating a document-feature matrix from the tokens;
25 | - performing sentiment analysis on the texts;
26 | - fitting topic models on the texts.
27 |
28 | For demonstration, we will use the corpus of televised debate transcripts from the U.S. Presidential election campaign of 2020. Donald Trump and Joe Biden participated in two televised debates. The first debate took place in Cleveland, Ohio, on 29 September 2020. The two candidates met again in Nashville, Tennessee on 10 December.[^01-intro-1]
29 |
30 | [^01-intro-1]: The transcripts are available at https://www.presidency.ucsb.edu/documents/presidential-debate-case-western-reserve-university-cleveland-ohio and https://www.presidency.ucsb.edu/documents/presidential-debate-belmont-university-nashville-tennessee-0.
31 |
32 | ### Acquiring text directly from the world wide web.
33 |
34 | Full transcripts of both televized debates are available from [The American Presidency Project](https://www.presidency.ucsb.edu/documents/app-categories/elections-and-transitions/debates). We start our demonstration by scraping the debates using the **rvest** package [@rvest]. You will learn more about scraping data from the internet in [Chapter @sec-acquire-internet]. We first need to install and load the packages required for this demonstration.
35 |
36 | ```{r, eval=FALSE}
37 | # install packages
38 | install.packages("quanteda")
39 | install.packages("quanteda.textstats")
40 | install.packages("quanteda.textplots")
41 | install.packages("rvest")
42 | install.packages("stringr")
43 | install.packages("devtools")
44 | devtools::install_github("quanteda/quanteda.tidy")
45 | ```
46 |
47 | ```{r, message=FALSE}
48 | # load packages
49 | library("quanteda")
50 | library("rvest")
51 | library("stringr")
52 | library("quanteda.textstats")
53 | library("quanteda.textplots")
54 | library("quanteda.tidy")
55 | ```
56 |
57 | Next, we identify the presidential debates for the 2020 period, assign the URL of the debates to an object, and load the source page. We retrieve metadata (the location and dates of the debates), store the information as a data frame, and get the URL debates. Finally, we scrape the search results and save the texts as a character vector, with one element per debate.
58 |
59 | ```{r}
60 | # search presidential debates for the 2020 period
61 | # https://www.presidency.ucsb.edu/advanced-search?field-keywords=&field-keywords2=&field-keywords3=&from%5Bdate%5D=01-01-2020&to%5Bdate%5D=12-31-2020&person2=200301&category2%5B%5D=64&items_per_page=50
62 |
63 | # assign URL of debates to an object called url_debates
64 | url_debates <- "https://www.presidency.ucsb.edu/advanced-search?field-keywords=&field-keywords2=&field-keywords3=&from%5Bdate%5D=01-01-2020&to%5Bdate%5D=12-31-2020&person2=200301&category2%5B%5D=64&items_per_page=50"
65 |
66 | source_page <- read_html(url_debates)
67 |
68 | # get debate meta-data
69 |
70 | nodes_pres <- ".views-field-title a"
71 | text_pres <- ".views-field-field-docs-start-date-time-value.text-nowrap"
72 |
73 | debates_meta <- data.frame(
74 | location = html_text(html_nodes(source_page, nodes_pres)),
75 | date = html_text(html_nodes(source_page, text_pres)),
76 | stringsAsFactors = FALSE
77 | )
78 |
79 | # format the date
80 | debates_meta$date <- as.Date(trimws(debates_meta$date),
81 | format = "%b %d, %Y")
82 |
83 | # get debate URLs
84 | debates_links <- source_page |>
85 | html_nodes(".views-field-title a") |>
86 | html_attr(name = "href")
87 |
88 | # add first part of URL to debate links
89 | debates_links <- paste0("https://www.presidency.ucsb.edu", debates_links)
90 |
91 | # scrape search results
92 | debates_scraped <- lapply(debates_links, read_html)
93 |
94 | # get character vector, one element per debate
95 | debates_text <- sapply(debates_scraped, function(x) {
96 | html_nodes(x, "p") |>
97 | html_text() |>
98 | paste(collapse = "\n\n")
99 | })
100 | ```
101 |
102 | Having retrieved the text of the two debates, we clean up the character vector. More specifically, we clean up the details on the location using regular expressions and the **stringr** package (see [Chapter @sec-appendix-regex]).
103 |
104 | ```{r}
105 | debates_meta$location <- str_replace(debates_meta$location, "^.* in ", "")
106 | ```
107 |
108 | ### Creating a text corpus
109 |
110 | Now we have two objects. The character vector `debates_text` contains the text of both debates, and the data frame `debates_meta` stores the date and location of both debates. These objects allow us to create a **quanteda** corpus, with the metadata. We use the `corpus()` function, specify the location of the document-level variables, and assign the corpus to an object called `data_corpus_debates`.
111 |
112 | ```{r}
113 | data_corpus_debates <- corpus(debates_text,
114 | docvars = debates_meta)
115 |
116 | # check the number of documents included in the text corpus
117 | ndoc(data_corpus_debates)
118 | ```
119 |
120 | The object `data_corpus_debates` contains two documents, one for each debate. While this unit of analysis may be suitable for some analyses, we want to identify all utterances by the moderator and the two candidates. With the function `corpus_segment()`, we can segment the corpus into statements. The unit of analysis changes from a full debate to a statement during a debate. Inspecting the page for one of the debates[^01-intro-2] reveals that a new utterance starts with the speaker's name in ALL CAPS, followed by a colon. We use this consistent pattern for segmenting the corpus to the level of utterances. The regular expression `"\\s*[[:upper:]]+:\\s+"` identifies speaker names in ALL CAPS (`\\s*[[:upper:]]+`), followed by a colon `+:` and a white space `\\s+`.
121 |
122 | [^01-intro-2]: https://www.presidency.ucsb.edu/documents/presidential-debate-case-western-reserve-university-cleveland-ohio
123 |
124 | ```{r}
125 | data_corpus_debatesseg <- corpus_segment(data_corpus_debates,
126 | pattern = "\\s*[[:upper:]]+:\\s+",
127 | valuetype = "regex",
128 | case_insensitive = FALSE)
129 |
130 | summary(data_corpus_debatesseg, n = 4)
131 | ```
132 |
133 | The segmentation results in a new corpus consisting of 1,207 utterances. The `summary()` function provides a useful overview of each utterance. The first text, for example, contains 251 tokens, 139 types (i.e., unique tokens) and 13 sentences.
134 |
135 | Having segmented the corpus, we improve the document-level variables since such meta information on each document is crucial for subsequent analyses like subsetting the corpus or grouping the corpus to the level of speakers. We create a new document-level variable called "speaker" based on the pattern extracted above.
136 |
137 | ```{r}
138 | # str_trim() removes empty whitespaces,
139 | # str_remove_all() removes the colons,
140 | # and str_to_title() changes speaker names
141 | # from UPPER CASE to Title Case
142 |
143 | data_corpus_debatesseg <- data_corpus_debatesseg |>
144 | rename(speaker = pattern) |>
145 | mutate(speaker = str_trim(speaker),
146 | speaker = str_remove_all(speaker, ":"),
147 | speaker = str_to_title(speaker))
148 |
149 | # cross-table of speaker statements by debate
150 | table(data_corpus_debatesseg$speaker,
151 | data_corpus_debatesseg$location)
152 | ```
153 |
154 | The cross-table reports the number of statement by each speaker in each debate. The first debate in Cleveland seems to be longer: the number of Trump's and Biden's statements during the Cleveland debates are three times higher than those in Tennessee. The transcript reports 246 utterances by Chris Wallace during the first and 146 by Kristen Welker during the second debate. We can inspect all cleaned document-level variables in this text corpus.
155 |
156 | ```{r}
157 | data_corpus_debatesseg |>
158 | docvars() |>
159 | glimpse()
160 | ```
161 |
162 | ### Tokenizing a corpus
163 |
164 | Next, we tokenize our text corpus. Typically, tokenization involves separating texts by white spaces. We tokenize the text corpus without any pre-processing using `tokens()`.
165 |
166 | ```{r}
167 | toks_usdebates2020 <- tokens(data_corpus_debatesseg)
168 |
169 | # let's inspect the first six tokens of the first four documents
170 | print(toks_usdebates2020, max_ndoc = 4, max_ntoken = 6)
171 |
172 | tokens(data_corpus_debatesseg)
173 |
174 | # check number of tokens and types
175 | toks_usdebates2020 |>
176 | ntoken() |>
177 | sum()
178 |
179 | toks_usdebates2020 |>
180 | ntype() |>
181 | sum()
182 | ```
183 |
184 | Without any pre-processing, the corpus consists of 43,742 tokens and 28,174 types. We can easily check how these numbers change when transforming all tokens to lowercase and removing punctuation characters.
185 |
186 | ```{r}
187 | toks_usdebates2020_reduced <- toks_usdebates2020 |>
188 | tokens(remove_punct = TRUE) |>
189 | tokens_tolower()
190 |
191 | # check number of tokens and types
192 | toks_usdebates2020_reduced |>
193 | ntoken() |>
194 | sum()
195 |
196 | toks_usdebates2020_reduced |>
197 | ntype() |>
198 | sum()
199 | ```
200 |
201 | The number of tokens and types decreases to 36,740 and 28,174, respectively, after removing punctuation and harmonizing all terms to lowercase.
202 |
203 | ### Keywords-in-context
204 |
205 | In contrast to a document-feature matrix (covered below), tokens objects still preserve the order of words. We can use tokens objects to identify the occurrence of keywords and their immediate context.
206 |
207 | ```{r}
208 | kw_america <- kwic(toks_usdebates2020,
209 | pattern = c("america"),
210 | window = 2)
211 |
212 | # number of mentions
213 | nrow(kw_america)
214 |
215 | # print first 6 mentions of America and the context of ±2 words
216 | head(kw_america, n = 6)
217 | ```
218 |
219 | ### Text processing
220 |
221 | The keywords-in-context analysis above reveals that all terms are still in upper case and that very frequent, uninformative words (so-called stopwords) and punctuation are still part of the text. In most applications, we remove very frequent features and transform all words to lowercase. The code below shows how to adjust the object accordingly.
222 |
223 | ```{r}
224 | toks_usdebates2020_processed <- data_corpus_debatesseg |>
225 | tokens(remove_punct = TRUE) |>
226 | tokens_remove(pattern = stopwords("en")) |>
227 | tokens_tolower()
228 | ```
229 |
230 | Let's inspect if the changes have been implemented as we expect by calling `kwic()` on the new tokens object.
231 |
232 | ```{r}
233 | kw_america_processed <- kwic(toks_usdebates2020_processed,
234 | pattern = c("america"),
235 | window = 2)
236 |
237 | # print first 6 mentions of America and the context of ±2 words
238 | head(kw_america_processed, n = 6)
239 |
240 | # test: print as table+
241 | library(kableExtra)
242 | kw_america_processed |> data.frame() |>
243 | dplyr::select(Pre = pre, Keyword = keyword, Post = post, Pattern = pattern) |>
244 | kbl(booktabs = T) %>%
245 | kable_styling(latex_options = c("striped", "scale_down"), html_font = "Source Sans Pro", full_width = F)
246 | ```
247 |
248 | The processing of the tokens object worked as expected. Let's imagine we want to group the documents by debate and speaker, resulting in two documents for Trump, two for Biden, and one for each moderator. `tokens_group()` allows us to change the unit of analysis. After aggregating the documents, we can use the function `textplot_xray()` to observe the occurrences of specific keywords during the debates.
249 |
250 | ```{r}
251 | # new document-level variable with date and speaker
252 | toks_usdebates2020$speaker_date <- paste(
253 | toks_usdebates2020$speaker,
254 | toks_usdebates2020$date,
255 | sep = ", ")
256 |
257 | # reshape the tokens object to speaker-date level
258 | # and keep only Trump and Biden
259 | toks_usdebates2020_grouped <- toks_usdebates2020 |>
260 | tokens_subset(speaker %in% c("Trump", "Biden")) |>
261 | tokens_group(groups = speaker_date)
262 |
263 | # check number of documents
264 | ndoc(toks_usdebates2020_grouped)
265 |
266 | # use absolute position of the token in the document
267 | textplot_xray(
268 | kwic(toks_usdebates2020_grouped, pattern = "america"),
269 | kwic(toks_usdebates2020_grouped, pattern = "tax*")
270 | )
271 | ```
272 |
273 | The grouped document also allows us to check how often each candidate spoke during the debate.
274 |
275 | ```{r}
276 | ntoken(toks_usdebates2020_grouped)
277 | ```
278 |
279 | The number of tokens between Trump and Biden do not differ substantively in both debates. Trump's share of speech is only slightly higher than Biden's (7996 v 8093 tokens; 8973 v 9016 tokens).
280 |
281 | ### Identifying multiword expressions
282 |
283 | Many languages build on multiword expressions. For instance, "income" and "tax" as separate unigrams have a different meaning than the bigram "income tax". The package **quanteda.textstats** includes the function `textstat_collocation()` that automatically retrieves common multiword expressions.
284 |
285 | ```{r}
286 | tstat_coll <- data_corpus_debatesseg |>
287 | tokens(remove_punct = TRUE) |>
288 | tokens_remove(pattern = stopwords("en"), padding = TRUE) |>
289 | textstat_collocations(size = 2:3, min_count = 5)
290 |
291 | # for illustration purposes select the first 20 collocations
292 | head(tstat_coll, 20)
293 | ```
294 |
295 | We can use `tokens_compound()` to compound certain multiword expressions before creating a document-feature matrix which does not consider word order. For illustration purposes, we compound `climate change` ,`social securit*`, and `health insurance*`. By default, compounded tokens are concatenated by `_`.
296 |
297 | ```{r}
298 | toks_usdebates2020_comp <- toks_usdebates2020 |>
299 | tokens(remove_punct = TRUE) |>
300 | tokens_compound(pattern = phrase(c("climate change",
301 | "social securit*",
302 | "health insuranc*"))) |>
303 | tokens_remove(pattern = stopwords("en"))
304 | ```
305 |
306 | ### Document-feature matrix
307 |
308 | We have come a long way already. We downloaded debate transcripts, segmented the texts to utterances, added document-level variables, tokenized the corpus, inspected keywords, and compounded multiword expressions. Next, we transform our tokens object into a document-feature matrix (dfm). A dfm counts the occurrences of tokens in each document. We can create a document feature matrix, print the structure, and get the most frequent words.
309 |
310 | ```{r}
311 | dfmat_presdebates20 <- dfm(toks_usdebates2020_comp)
312 |
313 | # most frequent features
314 | topfeatures(dfmat_presdebates20, n = 10)
315 |
316 | # most frequent features by speaker
317 |
318 | topfeatures(dfmat_presdebates20, groups = speaker, n = 10)
319 | ```
320 |
321 | Many methods build on document-feature matrices and the "bag-of-words" approach. In this section, we introduce `textstat_keyness()`, which identifies features that occur differentially across different categories -- in our case, Trump's and Biden's utterances. The function `textplot_keyness()` provides a straightforward way of visualize the results of the keyness analysis (Figure \ref@(fig:keynesspres).)
322 |
323 | ```{r keynesspres, fig.height = 7}
324 | tstat_key <- dfmat_presdebates20 |>
325 | dfm_subset(speaker %in% c("Trump", "Biden")) |>
326 | dfm_group(groups = speaker) |>
327 | textstat_keyness(target = "Trump")
328 |
329 | textplot_keyness(tstat_key)
330 | ```
331 |
332 | ### Bringing it all together: Sentiment analysis
333 |
334 | Before introducing the R Basics in the next chapter, we show how to conduct a dictionary-based sentiment analysis using the Lexicoder Sentiment Dictionary [@young2012]. The dictionary, included in **quanteda** as `data_dictionary_LSD2015` contains 1709 positive and 2858 negative terms (as well as their negations). We discuss advanced sentiment analyses with measures of uncertainty in Chapter ??.
335 |
336 | ```{r}
337 | toks_sent <- toks_usdebates2020 |>
338 | tokens_group(groups = speaker) |>
339 | tokens_lookup(dictionary = data_dictionary_LSD2015,
340 | nested_scope = "dictionary")
341 |
342 | # create a dfm with the count of matches,
343 | # transform object into a data frame,
344 | # and add document-level variables
345 | dat_sent <- toks_sent |>
346 | dfm() |>
347 | convert(to = "data.frame") |>
348 | cbind(docvars(toks_sent))
349 |
350 | # select Trump and Biden and aggregate sentiment
351 | dat_sent$sentiment = with(
352 | dat_sent,
353 | log((positive + neg_negative + 0.5) /
354 | (negative + neg_positive + 0.5)))
355 |
356 | dat_sent
357 | ```
358 |
359 | ## Issues
360 |
361 | Some issues raised from the example, where we might have done things differently or added additional analysis.
362 |
363 | ## Further Reading
364 |
365 | Some studies of debate transcripts? Or issues involved in analyzing interactive or "dialogical" documents.
366 |
367 | ## Exercises
368 |
369 | Add some here.
370 |
--------------------------------------------------------------------------------
/02-r-basics.qmd:
--------------------------------------------------------------------------------
1 | # R Basics {#sec-rbasics}
2 |
3 | ## Objectives
4 |
5 | To provide a targeted introduction to R, for those needing an introduction or a review.
6 |
7 | We will cover:
8 |
9 | - core object types (atomic, vectors, lists, complex)
10 | - detailed survey of characters and "strings"
11 | - detailed coverage of list types
12 | - detailed coverage of matrix types
13 | - introduction to the data.frame
14 | - output data in the form of a data.frame
15 | - output data in the form of a plot (ggplot2 object)
16 |
17 | The objective is to introduce how these core object types behave, and use them to store basic quantities required in text analysis.
18 |
19 | ## Methods
20 |
21 | The methods are primarily about how to use R, including:
22 |
23 | We will build up the basic objects needed for understand the core structures needed in R to hold texts (character), meta-data (data.frame), tokens (lists), dictionaries (lists), and document-feature matrices (matrices).
24 |
25 | ## Examples
26 |
27 | Examples working through the construction of each text object container using the types above.
28 |
29 | ## Issues
30 |
31 | Could have used "tidy" approaches.
32 |
33 | Sparsity.
34 |
35 | Indexing.
36 |
37 | Wouldn't it be nice if we could nest some objects inside others?
38 |
39 | ## Further Reading
40 |
41 | Some resources on R.
42 |
43 | ## Exercises
44 |
45 | Add some here.
46 |
--------------------------------------------------------------------------------
/03-r-workflow.qmd:
--------------------------------------------------------------------------------
1 | # R Workflow {#sec-rworkflow}
2 |
3 | ## Objectives
4 |
5 | How R works:
6 | - passing by value, compared to other languages
7 | - using "pipes"
8 | - object orientation: object classes and methods (functions)
9 |
10 | Workflow issues:
11 | - clear code
12 | - reproducible workflow
13 | - R markdown and R notebooks
14 |
15 | Using packages
16 |
17 |
18 | ## Methods
19 |
20 | Applicable methods for the objectives listed above.
21 |
22 | ## Examples
23 |
24 | Objects that work on character and lists.
25 |
26 | Things that work for data.frames. Non-standard evaluation for named columns in a data.frame.
27 |
28 | ggplot2
29 |
30 | ## Issues
31 |
32 | Additional issues.
33 |
34 | ## Further Reading
35 |
36 | Advanced reading on R. Packages.
37 |
38 | ## Exercises
39 |
40 | Add some here.
41 |
--------------------------------------------------------------------------------
/04-r-strings.qmd:
--------------------------------------------------------------------------------
1 | # Working with Text in R {#sec-rstrings}
2 |
3 | ## Objectives
4 |
5 | A deeper dive into how text is represented in R.
6 | - character vectors vesus "strings"
7 | - factors
8 | - **base** package string handling
9 | - the **stringi** and **stringr** packages
10 | - regular expressions
11 | - encoding, briefly.
12 |
13 | ## Methods
14 |
15 | Applicable methods for the objectives listed above.
16 |
17 | ## Examples
18 |
19 | Lots of the above.
20 |
21 | ## Issues
22 |
23 | Additional issues.
24 |
25 | ## Further Reading
26 |
27 | String handling in R. More regular expressions. More on encoding.
28 |
29 | ## Exercises
30 |
31 | Add some here.
32 |
--------------------------------------------------------------------------------
/05-acquiring-files.qmd:
--------------------------------------------------------------------------------
1 | # Working with Files {#sec-acquire-files}
2 |
3 | ## Objectives
4 |
5 | Really basic issues for working with files, such as: - Where files typically live on your computer; - Paths, the "current working directory" and how to change it; - Input formats and methods to convert them; - saving and naming files; - text editors; - cleaning texts; - detecting and converting text file encodings.
6 |
7 | ## Methods
8 |
9 | Applicable methods for the objectives listed above.
10 |
11 | ## Examples
12 |
13 | Examples of each of the above methods.
14 |
15 | ## Issues
16 |
17 | Additional issues.
18 |
19 | ## Further Reading
20 |
21 | Additional resources from libraries or the web.
22 |
23 | ## Exercises
24 |
25 | Add some here.
26 |
--------------------------------------------------------------------------------
/06-acquiring-packages.qmd:
--------------------------------------------------------------------------------
1 | # Using Text Import Tools {#sec-acquire-packages}
2 |
3 | ## Objectives
4 |
5 | - The **readtext** package
6 | - Using API gateway packages: **twitteR** and other social media packages
7 | - Dealing with JSON formats
8 | - Newspaper formats, LexisNexis, etc.
9 | - OCR tools
10 |
11 | ## Methods
12 |
13 | Applicable methods for the objectives listed above.
14 |
15 | ## Examples
16 |
17 | Examples of each of the above methods.
18 |
19 | ## Issues
20 |
21 | Additional issues.
22 |
23 | ## Further Reading
24 |
25 | Additional resources from libraries or the web.
26 |
27 | ## Exercises
28 |
29 | Add some here.
30 |
--------------------------------------------------------------------------------
/07-acquiring-internet.qmd:
--------------------------------------------------------------------------------
1 | # Obtaining Texts from the Internet {#sec-acquire-internet}
2 |
3 | ## Objectives
4 |
5 | - Web scraping
6 | - Markup formats
7 | - Challenges of removing tags
8 | - APIs, and JSON
9 |
10 | ## Methods
11 |
12 | Applicable methods for the objectives listed above.
13 |
14 | ## Examples
15 |
16 | Examples of each of the above methods.
17 |
18 | ## Issues
19 |
20 | Additional issues.
21 |
22 | ## Further Reading
23 |
24 | Additional resources from libraries or the web.
25 |
26 | ## Exercises
27 |
28 | Add some here.
29 |
--------------------------------------------------------------------------------
/08-quanteda-overview.qmd:
--------------------------------------------------------------------------------
1 | # Introducing the **quanteda** Package {#sec-quanteda-overview}
2 |
3 | ## Objectives
4 |
5 | - Purpose
6 | - Basic object types in **quanteda**
7 | - Inter-related classes
8 | - Workflow
9 | - Key differences between **quanteda** and other packages
10 | - Working with other packages
11 |
12 | ## Methods
13 |
14 | Applicable methods for the objectives listed above.
15 |
16 | ## Examples
17 |
18 | Examples of each of the above methods.
19 |
20 | ## Issues
21 |
22 | Additional issues.
23 |
24 | ## Further Reading
25 |
26 | Additional resources from libraries or the web.
27 |
28 | ## Exercises
29 |
30 | Add some here.
31 |
--------------------------------------------------------------------------------
/13-quanteda-fcms.qmd:
--------------------------------------------------------------------------------
1 | # Building Feature Co-occurrence Matrices {#sec-quanteda-fcms}
2 |
3 | ## Objectives
4 |
5 | In this chapter we explain the intuition behind feature co-occurrence matrices (fcm), when and why to use them, and how to construct an fcm from a **quanteda** tokens object. At the end of the chapter, readers will have a solid understanding of creating fcms and when and how they can use them.
6 |
7 | ## Methods
8 |
9 | * Creating fcms
10 | * Advanced options
11 |
12 | ## Examples
13 |
14 | * Co-occurrence statistics
15 | * Network analysis
16 | * Computing word embeddings
17 |
18 | ## Issues
19 |
20 | Additional issues.
21 |
22 | ## Further Reading
23 |
24 | Additional resources from libraries or the web.
25 |
26 | ## Exercises
27 |
28 | Add some here.
29 |
--------------------------------------------------------------------------------
/14-exploring-description.qmd:
--------------------------------------------------------------------------------
1 | # Describing Texts {#sec-exploring-description}
2 |
3 | ## Objectives
4 |
5 | Describing the data in texts. - `n` functions - `summary()` - `docnames()` - `docvars()` - `meta()` - `featnames()` - `head()`, `tail()`, `print()`
6 |
7 | ## Methods
8 |
9 | Applicable methods for the objectives listed above.
10 |
11 | ## Examples
12 |
13 | Examples of each of the above methods.
14 |
15 | ## Issues
16 |
17 | Additional issues.
18 |
19 | ## Further Reading
20 |
21 | Additional resources from libraries or the web.
22 |
23 | ## Exercises
24 |
25 | Add some here.
26 |
--------------------------------------------------------------------------------
/15-exploring-kwic.qmd:
--------------------------------------------------------------------------------
1 | # Keywords-in-Context {#sec-exploring-kwic}
2 |
3 | ## Objectives
4 |
5 | - What are "keywords".
6 | - The notion of a "concordance"
7 | - Using patterns and different input types
8 | - Multi-word patterns
9 | - Using kwic objects for subsequent input - to `corpus()`, or to `textplot_xray()`
10 | - Preview: Statistical detection of key words
11 |
12 | ## Methods
13 |
14 | Applicable methods for the objectives listed above.
15 |
16 | ## Examples
17 |
18 | Using `kwic()` to verify a dictionary usage.
19 |
20 | ## Issues
21 |
22 | Additional issues.
23 |
24 | ## Further Reading
25 |
26 | Additional resources from libraries or the web.
27 |
28 | ## Exercises
29 |
30 | Add some here.
31 |
--------------------------------------------------------------------------------
/16-exploring-dictionaries.qmd:
--------------------------------------------------------------------------------
1 | # Applying Dictionaries {#sec-exploring-dictionaries}
2 |
3 | ## Objectives
4 |
5 | - The basic idea of counting token and feature matches as equivalence classes
6 | - Matching phrases
7 | - Weighting dictionary applications
8 | - Multi-lingual applications
9 | - Scaling dictionary results
10 | - Confidence intervals and bootstrapping
11 | - Dictionary validation using `kwic()`
12 | - Dictionary construction using `textstat_keyness()`
13 |
14 | Creating and revising dictionaries.
15 |
16 | - key-value lists
17 | - more details on pattern types
18 | - the `phrase()` argument
19 | - importing foreign dictionaries
20 | - editing dictionaries -
21 | using dictionaries as `pattern` inputs to other functions
22 | - exporting dictionaries
23 | - using dictionaries: `tokens_lookup()`
24 |
25 | ## Methods
26 |
27 | Applicable methods for the objectives listed above.
28 |
29 | ## Examples
30 |
31 | Sentiment analysis.
32 |
33 | ## Issues
34 |
35 | Copyright.
36 |
37 | Which format?
38 |
39 | Weighted dictionaries.
40 |
41 | Additional issues.
42 |
43 | ## Further Reading
44 |
45 | On-line sources of dictionaries.
46 |
47 | QI's **quanteda.dictionaries** package.
48 |
49 | ## Exercises
50 |
51 | Add some here.
52 |
--------------------------------------------------------------------------------
/17-exploring-frequencies.qmd:
--------------------------------------------------------------------------------
1 | # Most Frequent Words {#sec-exploring-freqs}
2 |
3 | ## Objectives
4 |
5 | - Summarizing features via a `dfm()`
6 | - Most frequent features
7 | - `dfm_group()` revisited
8 | - Using dictionaries to count keys
9 | - `topfeatures()`
10 | - `textplot_wordcloud()`
11 | - `textstat_frequency()`
12 | - `textstat_keyness()`, `textplot_keyness()`
13 |
14 | ## Methods
15 |
16 | Applicable methods for the objectives listed above.
17 |
18 | ## Examples
19 |
20 | Using `kwic()` to verify a dictionary usage.
21 |
22 | ## Issues
23 |
24 | Additional issues.
25 |
26 | ## Further Reading
27 |
28 | Additional resources from libraries or the web.
29 |
30 | ## Exercises
31 |
32 | Add some here.
33 |
--------------------------------------------------------------------------------
/18-comparing-sophistication.qmd:
--------------------------------------------------------------------------------
1 | # Profiling Lexical Patterns and Usage {#sec-comparing-lexical}
2 |
3 | ## Objectives
4 |
5 | - Sentence length
6 | - Syllables
7 | - Types versus tokens
8 | - Usage in word lists
9 | - Lexical diversity
10 | - Readability
11 | - Bootstrapping methods and uncertainty accounting
12 |
13 | ## Methods
14 |
15 | Applicable methods for the objectives listed above.
16 |
17 | ## Examples
18 |
19 | Sentiment analysis.
20 |
21 | ## Issues
22 |
23 | Non-English languages.
24 |
25 | Sampling.
26 |
27 | Text length and its effect on diversity. Zipf's law and Heap's law.
28 |
29 | ## Further Reading
30 |
31 | Additional resources from libraries or the web.
32 |
33 | ## Exercises
34 |
35 | Add some here.
36 |
--------------------------------------------------------------------------------
/19-comparing-documents.qmd:
--------------------------------------------------------------------------------
1 | # Document Similarity and Distance {#sec-comparing-documents}
2 |
3 | ## Objectives
4 |
5 | - Distance and similarity measures
6 | - Measuring similarity
7 | - Measuring distance
8 | - Clustering
9 | - Multi-dimensional scaling
10 | - Network analysis of document connections
11 |
12 | ## Methods
13 |
14 | Applicable methods for the objectives listed above.
15 |
16 | ## Examples
17 |
18 | Examples here.
19 |
20 | ## Issues
21 |
22 | Weighting and feature selection and its effects on similarity and distance.
23 |
24 | Computational issues.
25 |
26 | ## Further Reading
27 |
28 | Additional resources from libraries or the web.
29 |
30 | ## Exercises
31 |
32 | Add some here.
33 |
--------------------------------------------------------------------------------
/20-comparing-features.qmd:
--------------------------------------------------------------------------------
1 | # Feature Similarity and Distance {#sec-comparing-features}
2 |
3 | ## Objectives
4 |
5 | - Distance and similarity measures revisited
6 | - Clustering
7 | - Network analysis of feature connections
8 | - Improving feature comparisons through detection and pre-processing of multi-word expressions
9 |
10 | ## Methods
11 |
12 | Applicable methods for the objectives listed above.
13 |
14 | ## Examples
15 |
16 | Examples here.
17 |
18 | ## Issues
19 |
20 | Computational issues.
21 |
22 | ## Further Reading
23 |
24 | Additional resources from libraries or the web.
25 |
26 | ## Exercises
27 |
28 | Add some here.
29 |
--------------------------------------------------------------------------------
/21-ml-supervised-scaling.qmd:
--------------------------------------------------------------------------------
1 | # Supervised Document Scaling {#sec-ml-supervised-scaling}
2 |
3 | ## Objectives
4 |
5 | - Introduction to the notion of document scaling. Difference from classification, which we cover in [Chapter @sec-ml-classifiers].
6 | - Supervised versus unsupervised methods
7 | - Wordscores
8 | - Class affinity model
9 |
10 | ## Methods
11 |
12 | Applicable methods for the objectives listed above.
13 |
14 | ## Examples
15 |
16 | Examples here.
17 |
18 | ## Issues
19 |
20 | Training documents and how to select them.
21 |
22 | Uncertainty accounting.
23 |
24 | ## Further Reading
25 |
26 | Additional resources from libraries or the web.
27 |
28 | ## Exercises
29 |
30 | Add some here.
31 |
--------------------------------------------------------------------------------
/22-ml-unsupervised-scaling.qmd:
--------------------------------------------------------------------------------
1 | # Unsupervised Document Scaling {#sec-ml-unsupervised-scaling}
2 |
3 | ## Objectives
4 |
5 | - Notion of unsupervised scaling, based on distance
6 | - LSA
7 | - Correspondence analysis
8 | - The "wordfish" model
9 |
10 | ## Methods
11 |
12 | Applicable methods for the objectives listed above.
13 |
14 | ## Examples
15 |
16 | Examples here.
17 |
18 | ## Issues
19 |
20 | The hazards of *ex post* interpretation of scaling.
21 |
22 | Dimensionality.
23 |
24 | ## Further Reading
25 |
26 | Additional resources from libraries or the web.
27 |
28 | ## Exercises
29 |
30 | Add some here.
31 |
--------------------------------------------------------------------------------
/23-ml-classifiers.qmd:
--------------------------------------------------------------------------------
1 | # Methods for Text Classification {#sec-ml-classifiers}
2 |
3 | ## Objectives
4 |
5 | - Class prediction versus scaling, and the notion of predicting classes
6 | - Naive Bayes
7 | - SVMs
8 | - More advanced methods
9 | - Feature selection for improving prediction
10 | - Assessing performance
11 |
12 | ## Methods
13 |
14 | Applicable methods for the objectives listed above.
15 |
16 | ## Examples
17 |
18 | Examples here.
19 |
20 | ## Issues
21 |
22 | Why kNN is poor choice.
23 |
24 | Which classifier?
25 |
26 | ngrams or unigrams?
27 |
28 | ## Further Reading
29 |
30 | Additional resources from libraries or the web.
31 |
32 | ## Exercises
33 |
34 | Add some here.
35 |
--------------------------------------------------------------------------------
/24-ml-topicmodels.qmd:
--------------------------------------------------------------------------------
1 | # Topic modelling {#sec-ml-topicmodels}
2 |
3 | ## Objectives
4 |
5 | - Latent Dirichlet Allocation
6 | - Older methods (e.g. LSA)
7 | - Advanced methods
8 | - Using the **stm** package
9 | - Pre-processing steps and their importance, including compounding MWEs
10 |
11 | ## Methods
12 |
13 | Applicable methods for the objectives listed above.
14 |
15 | ## Examples
16 |
17 | Examples here.
18 |
19 | ## Issues
20 |
21 | Using other topic model packages
22 |
23 | ## Further Reading
24 |
25 | Additional resources from libraries or the web.
26 |
27 | ## Exercises
28 |
29 | Add some here.
30 |
--------------------------------------------------------------------------------
/25-lms-linguistic-parsing.qmd:
--------------------------------------------------------------------------------
1 | # Linguistic Parsing {#sec-lms-linguistic-parsing}
2 |
3 | ## Objectives
4 |
5 | - What is "NLP" and how does it differ from what we have covered so far?
6 | - Part of speech tagging
7 | - Different schemes for part of speech tagging
8 | - Differentiating homographs based on POS
9 | - POS tagging in other languages
10 | - Named Entity Recognition
11 | - Noun phrase extraction
12 | - Dependency parsing to extract syntactic relations
13 |
14 | ## Methods
15 |
16 | Applicable methods for the objectives listed above.
17 |
18 | ## Examples
19 |
20 | Examples of each of the above methods.
21 |
22 | ## Issues
23 |
24 | Alternative NLP tools. (Here we focus on **spacyr**.)
25 |
26 | Other languages.
27 |
28 | Training new models.
29 |
30 | ## Further Reading
31 |
32 | Additional resources from libraries or the web.
33 |
34 | ## Exercises
35 |
36 | Add some here.
37 |
--------------------------------------------------------------------------------
/26-lms-embeddings.qmd:
--------------------------------------------------------------------------------
1 | # Word Embeddings {#sec-lms-embeddings}
2 |
3 | ## Objectives
4 |
5 | - Understanding what are word embeddings
6 | - Methods: `word2vec`, `doc2vec`, GLoVe, SVD/LSA
7 | - Constructing a feature-co-occurrence matrix
8 | - Fitting a word embedding model
9 | - Using pre-trained embedding scores with documents
10 | - Prediction based on embeddings
11 | - Further methods
12 |
13 | ## Methods
14 |
15 | Applicable methods for the objectives listed above.
16 |
17 | ## Examples
18 |
19 | Examples here.
20 |
21 | ## Issues
22 |
23 | Issues here.
24 |
25 | ## Further Reading
26 |
27 | Additional resources from libraries or the web.
28 |
29 | ## Exercises
30 |
31 | Add some here.
32 |
--------------------------------------------------------------------------------
/27-lms-transformers.qmd:
--------------------------------------------------------------------------------
1 | # Large Language Models and Transformers {#sec-lms-transformers}
2 |
3 | ## Objectives
4 |
5 | - POS tagging
6 | - Named Entity Recognition
7 | - sentiment analysis
8 | - clustering of words or documents
9 | - classification
10 | - topic models
11 |
12 | ## Methods
13 |
14 | Applicable methods for the objectives listed above.
15 |
16 | ## Examples
17 |
18 | Examples of each of the above methods.
19 |
20 | ## Issues
21 |
22 | Alternative NLP tools. (Here we focus on **spacyr**.)
23 |
24 | Other languages.
25 |
26 | Training new models.
27 |
28 | ## Further Reading
29 |
30 | Additional resources from libraries or the web.
31 |
32 | ## Exercises
33 |
34 | Add some here.
35 |
--------------------------------------------------------------------------------
/28-further-tidy.qmd:
--------------------------------------------------------------------------------
1 | # Integrating "tidy" approaches {#sec-furthertidy}
2 |
3 | ## Objectives
4 |
5 | - What is the "tidy" approach and how does it apply to text?
6 | - The **tidytext** package and its advantages
7 | - Switching from **quanteda** to **tidytext** (and back)
8 | - Advantages of non-tidy text objects
9 |
10 | ## Methods
11 |
12 | Applicable methods for the objectives listed above.
13 |
14 | ## Examples
15 |
16 | Examples of each of the above methods.
17 |
18 | ## Issues
19 |
20 | Efficiency. Complexity.
21 |
22 | ## Further Reading
23 |
24 | Tidy resources. Tidytext book. Data Science Using R (Wickham and Grolemund).
25 |
26 | ## Exercises
27 |
28 | Add some here.
29 |
--------------------------------------------------------------------------------
/29-further-hardlanguages.qmd:
--------------------------------------------------------------------------------
1 | # Text analysis in "Hard" Languages {#sec-further-hardlangs}
2 |
3 | ## Objectives
4 |
5 | - Non-English characters, and encoding issues
6 | - Segmentation in languages that do not use whitespace delimiters between words
7 | - Right-to-left languages
8 | - Emoji
9 |
10 | ## Methods
11 |
12 | Applicable methods for the objectives listed above.
13 |
14 | ## Examples
15 |
16 | Chinese, Japanese, Korean.
17 |
18 | Hindi, Georgian.
19 |
20 | Arabic and other RTL languages.
21 |
22 | ## Issues
23 |
24 | Stemming, syllables, character counts may be off.
25 |
26 | Font issues.
27 |
28 | ## Further Reading
29 |
30 | Further reading here.
31 |
32 | ## Exercises
33 |
34 | Add some here.
35 |
--------------------------------------------------------------------------------
/30-appendix-installation.qmd:
--------------------------------------------------------------------------------
1 | # Installing the Required Tools {#sec-appendix-installing}
2 |
3 | ## Objectives
4 |
5 | - Installing R
6 | - Installing RStudio
7 | - Installing **quanteda**
8 | - Installing **spacy**
9 | - Installing companion package(s)
10 | - Keeping up to date
11 | - Troubleshooting problems
12 |
13 | ## Installing R
14 |
15 | R is a free software environment for statistical computing that runs on numerous platforms, including Windows, macOS, Linux, and Solaris. You can find details at https://www.r-project.org/, and link there to a set of mirror websites for downloading the latest version.
16 |
17 | We recommend that you always using the latest version of R, which is what we used for compiling this book. There are seldom reasons to use older versions of R, and the R Core Team and the maintainers of the largest repository of R packages, CRAN (for Comprehensive R Archive Network) put an enormous amount of attention and energy into assuring that extension packages work with stably and with one another.
18 |
19 | You verify which version of R you are using by either viewing the messages on startup, e.g.
20 |
21 | R version 4.2.1 (2022-06-23) -- "Funny-Looking Kid"
22 | Copyright (C) 2022 The R Foundation for Statistical Computing
23 | Platform: x86_64-apple-darwin17.0 (64-bit)
24 |
25 | R is free software and comes with ABSOLUTELY NO WARRANTY.
26 | You are welcome to redistribute it under certain conditions.
27 | Type 'license()' or 'licence()' for distribution details.
28 |
29 | Natural language support but running in an English locale
30 |
31 | R is a collaborative project with many contributors.
32 | Type 'contributors()' for more information and
33 | 'citation()' on how to cite R or R packages in publications.
34 |
35 | Type 'demo()' for some demos, 'help()' for on-line help, or
36 | 'help.start()' for an HTML browser interface to help.
37 | Type 'q()' to quit R.
38 |
39 | or by calling
40 |
41 | ```{r}
42 | R.Version()$version.string
43 | ```
44 |
45 | ## Installing recommended tools and packages
46 |
47 | ### RStudio
48 |
49 | There are numerous ways of running R, including from the "command line" ("Terminal" on Mac or Linux, or "Command Prompt" on Windows) or from the R Gui console that comes with most installations of R. But our very strong recommendation is to us the outstanding [RStudio Dekstop](https://www.rstudio.com/products/rstudio/). "IDE" stands for *integrated development environment*, and provides a combination of file manager, package manager, help viewer, graphics viewer, environment browser, editor, and much more. Before this can be used, however, you must have installed R.
50 |
51 | RStudio can be installed from https://www.rstudio.com/products/rstudio/download/.
52 |
53 | ### Additional packages
54 |
55 | The main package you will need for the examples in this book is **quanteda**, which provides a framework for the quantitative analysis of textual data. When you install this package, by default it will also install any required packages that **quanteda** depends on to function (such as **stringi** that it uses for many essential string handling operations). Because "dependencies" are installed into your local library, these additional packages are also available for you to use independently. You only need to install them once (although they might need updating, see below).
56 |
57 | We also suggest you install:
58 |
59 | - **readtext** - for reading in documents that contain text, and converting them automatically to plain text;
60 |
61 | - **spacyr** - an R wrapper to the Python package [spaCy](https://spacy.io) for natural language processing, including part-of-speech tagging and entity extraction. While we have tried hard to make this automatic, installing and configuring **spacyr** actually involves some major work under the hood, such as installing a version of Python in a self-contained virtual environment and then installing spaCy and one of its language models.
62 |
63 | ### Keeping packages up-to-date
64 |
65 | R has an easy method for ensuring you have the latest versions of installed packages:
66 |
67 | ```{r, eval = FALSE}
68 | update.packages()
69 | ```
70 |
71 | ## Additional Issues
72 |
73 | ### Installing development versions of packages
74 |
75 | The R packages that you can install using the methods described above are the pre-compiled binary versions that are distributed on CRAN. (Linux installations are the exception, as these are always compiled upon installation.) Sometimes, package developers will publish "development" versions of their packages that have yet to published on CRAN, for instance on the popular [GitHub](https://github.com) platform hosting the world's largest collection of open-source software.
76 |
77 | The **quanteda** package, for instance, is hosted on GitHub at https://github.com/quanteda/quanteda, where its development version tends to be slightly ahead of the CRAN version. If you are feeling adventurous, or need a new version in which a specific issue or bug has been fixed, you can install it from the GitHub source using:
78 |
79 | ```{r, eval = FALSE}
80 | devtools::install_github("quanteda/quanteda")
81 | ```
82 |
83 | Because installing from GitHub is the same as installing from the source code, this also involves compiling the C++ and Fortran source code that makes parts of **quanteda** so fast. For this source installation to work, you will need to have installed the appropriate compilers.
84 |
85 | If you are using a Windows platform, this means you will need also to install the [Rtools](https://CRAN.R-project.org/bin/windows/Rtools/) software available from CRAN.
86 |
87 | If you are using macOS, you should install the [macOS tools](https://cran.r-project.org/bin/macosx/tools/), namely the Clang 6.x compiler and the GNU Fortran compiler (as **quanteda** requires gfortran to build).[^29-appendix-installation-1]\
88 | Linux always compiles packages containing C++ code upon installation, so if you are using that platform then you are unlikely to need to install any additional components in order to install a development version.
89 |
90 | [^29-appendix-installation-1]: If you are still getting errors related to gfortran, follow the fixes [here](https://thecoatlessprofessor.com/programming/rcpp-rcpparmadillo-and-os-x-mavericks--lgfortran-and--lquadmath-error/).
91 |
92 | ### Troubleshooting
93 |
94 | Most problems come from not having the latest versions of packages installed, so make sure you have updated them using the instructions above.
95 |
96 | Other problems include: - Lack of permissions to install packages. This might affect Windows users of work laptops, whose workplace prevents user modification of any software. - Lack of internet access, or access being restricted by a firewall or proxy server.
97 |
98 | ## Further Reading
99 |
100 | Hadley Wickham's excellent book [R Packages](http://r-pkgs.had.co.nz/) is well worth reading.
101 |
102 | - Wickham, Hadley. (2015). *R packages: organize, test, document, and share your code.* O'Reilly Media, Inc.
103 |
--------------------------------------------------------------------------------
/31-appendix-encoding.qmd:
--------------------------------------------------------------------------------
1 | # Everything You Never Wanted to Know about Encoding {#sec-appendix-encoding}
2 |
3 | (and were afraid to ask)
4 |
5 | ## Objectives
6 |
7 | - The concept of encoding
8 | - Classic 8-bit encodings
9 | - How to detect encodings
10 | - Converting files
11 | - Converting text once loaded
12 | - Encoding "bit" versus actual text encoding
13 | - When you are likely to encounter problems (with what sorts of files)
14 | - Unicode encodings (UTF-8, UTF-16)
15 | - How to avoid encoding headaches forever
16 |
17 | ## Methods
18 |
19 | Applicable methods for the objectives listed above.
20 |
21 | ## Examples
22 |
23 | Examples here.
24 |
25 | ## Issues
26 |
27 | UTF-16 in some operating systems (Windows).
28 |
29 | Editors that are "encoding smart".
30 |
31 | ## Further Reading
32 |
33 | Further reading here.
34 |
35 | ## Exercises
36 |
37 | Add some here.
38 |
--------------------------------------------------------------------------------
/32-appendix-regex.qmd:
--------------------------------------------------------------------------------
1 | # A Survival Guide to Regular Expressions {#sec-appendix-regex}
2 |
3 | ## Objectives
4 |
5 | - Basic of regular expressions
6 | - Fixed matching and "globs" and relationship to regexes
7 | - Base regular expressions versus **stringi**
8 | - character classes
9 | - Unicode character categories
10 | - Lookaheads
11 |
12 | ## Methods
13 |
14 | Applicable methods for the objectives listed above.
15 |
16 | ## Examples
17 |
18 | Examples here.
19 |
20 | ## Issues
21 |
22 | Older variants: PCRE, versus GNU, versus **stringi** implementations.
23 |
24 | Shortcut names versus Unicode categories.
25 |
26 | ## Further Reading
27 |
28 | Further reading here.
29 |
30 | ## Exercises
31 |
32 | Add some here.
33 |
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | # Contributor Covenant Code of Conduct
2 |
3 | ## Our Pledge
4 |
5 | We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.
6 |
7 | We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.
8 |
9 | ## Our Standards
10 |
11 | Examples of behavior that contributes to a positive environment for our community include:
12 |
13 | - Demonstrating empathy and kindness toward other people
14 | - Being respectful of differing opinions, viewpoints, and experiences
15 | - Giving and gracefully accepting constructive feedback
16 | - Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
17 | - Focusing on what is best not just for us as individuals, but for the overall community
18 |
19 | Examples of unacceptable behavior include:
20 |
21 | - The use of sexualized language or imagery, and sexual attention or advances of any kind
22 | - Trolling, insulting or derogatory comments, and personal or political attacks
23 | - Public or private harassment
24 | - Publishing others' private information, such as a physical or email address, without their explicit permission
25 | - Other conduct which could reasonably be considered inappropriate in a professional setting
26 |
27 | ## Enforcement Responsibilities
28 |
29 | Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.
30 |
31 | Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.
32 |
33 | ## Scope
34 |
35 | This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.
36 |
37 | ## Enforcement
38 |
39 | Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at K.R.Benoit\@lse.ac.uk. All complaints will be reviewed and investigated promptly and fairly.
40 |
41 | All community leaders are obligated to respect the privacy and security of the reporter of any incident.
42 |
43 | ## Enforcement Guidelines
44 |
45 | Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:
46 |
47 | ### 1. Correction
48 |
49 | **Community Impact**: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.
50 |
51 | **Consequence**: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.
52 |
53 | ### 2. Warning
54 |
55 | **Community Impact**: A violation through a single incident or series of actions.
56 |
57 | **Consequence**: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.
58 |
59 | ### 3. Temporary Ban
60 |
61 | **Community Impact**: A serious violation of community standards, including sustained inappropriate behavior.
62 |
63 | **Consequence**: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.
64 |
65 | ### 4. Permanent Ban
66 |
67 | **Community Impact**: Demonstrating a pattern of violation of community standards, including sustained inappropriate behaviour, harassment of an individual, or aggression toward or disparagement of classes of individuals.
68 |
69 | **Consequence**: A permanent ban from any sort of public interaction within the community.
70 |
71 | ## Attribution
72 |
73 | This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org), version 2.1, available at .
74 |
75 | Community Impact Guidelines were inspired by [Mozilla's code of conduct enforcement ladder](https://github.com/mozilla/inclusion).
76 |
77 | For answers to common questions about this code of conduct, see the FAQ at . Translations are available at .
78 |
--------------------------------------------------------------------------------
/DESCRIPTION:
--------------------------------------------------------------------------------
1 | Type: Book
2 | Package: Text-Analysis-Using-R
3 | Title: Text Analysis Using R
4 | Version: 0.1.0
5 | Depends:
6 | R (>= 4.1.0)
7 | Imports:
8 | quanteda,
9 | readtext,
10 | quanteda.textstats,
11 | quanteda.textplots,
12 | quanteda.textmodels,
13 | stopwords,
14 | tidyverse
15 | Encoding: UTF-8
16 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 United States License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-nd/4.0/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Text Analysis Using R
2 |
3 |
4 |
5 | [](https://github.com/quanteda/Text-Analysis-Using-R/actions/workflows/publish.yml)
6 |
7 |
8 |
9 | This is the source code of the book in progress, *Text Analysis Using R*, by Kenneth Benoit and Stefan Müller.
10 |
11 | The draft HTML version of this book is published at https://quanteda.github.io/Text-Analysis-Using-R/. It is authored in [Quarto](https://quarto.org) and we will be updating it as we fill in the chapter outline with completed draft chapters.
12 |
13 | ## Companion R package
14 |
15 | We have released a companion R package, TAUR, containing data and functions used in the book. You can install it using:
16 |
17 | ``` r
18 | remotes::install_github("quanteda/TAUR")
19 | ```
20 |
21 | ## Providing feedback or contributing
22 |
23 | This book is a work in progress and we welcome your feedback (sent through filing an [issue](https://github.com/quanteda/Text-Analysis-Using-R/issues)). You are even more welcome to edit the book yourself and submit the changes as a pull request. (Instructions for doing that can be found [here](https://gist.github.com/Chaser324/ce0505fbed06b947d962).) We are using the brilliant new [Quarto](https://quarto.org) framework for authoring this book. You can find full details as to how this works [here](https://quarto.org/docs/books/).
24 |
25 | The *Text Analysis Using R* project is released with a [Contributor Code of Conduct](CODE_OF_CONDUCT.md). By participating in this project (for example, by submitting an [issue](https://github.com/quanteda/Text-Analysis-Using-R/issues) with suggestions or edits) you agree to abide by its terms. Instructions for making contributions can be found in the [`contributing.md`](contributing.md) file.
26 |
--------------------------------------------------------------------------------
/TAUR.bib:
--------------------------------------------------------------------------------
1 | %% This BibTeX bibliography file was created using BibDesk.
2 | %% https://bibdesk.sourceforge.io/
3 |
4 | %% Saved with string encoding Unicode (UTF-8)
5 |
6 |
7 | @article{pangetal2002,
8 | author = {Pang, B. and Lee, L. and Vaithyanathan, S.},
9 | date-added = {2022-11-04 15:45:35 +1100},
10 | date-modified = {2022-11-04 15:45:53 +1100},
11 | journal = {Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
12 | pages = {79-86},
13 | title = {Thumbs up? Sentiment classification using machine learning techniques},
14 | year = {2002}}
15 |
16 | @article{Most63,
17 | author = {F. Mosteller and D. L. Wallace},
18 | date-added = {2022-11-04 15:27:17 +1100},
19 | date-modified = {2022-11-04 15:27:17 +1100},
20 | journal = {Journal of the American Statistical Assocation},
21 | number = {302},
22 | pages = {275--309},
23 | title = {Inference in an Authorship Problem},
24 | volume = {58},
25 | year = {1963}}
26 |
27 | @article{monroe2008,
28 | author = {Monroe, B.L. and Quinn, K.M. and Colaresi, M.P.},
29 | date-added = {2022-11-04 15:22:21 +1100},
30 | date-modified = {2022-11-04 15:22:58 +1100},
31 | journal = {Political Analysis},
32 | number = {4},
33 | pages = {372--403},
34 | title = {Fightin' Words: {L}exical Feature Selection and Evaluation for Identifying the Content of Political Conflict},
35 | volume = {16},
36 | year = {2008}}
37 |
38 | @manual{spacyr,
39 | author = {Kenneth Benoit and Akitaka Matsuo},
40 | date-added = {2022-08-24 17:18:06 +1000},
41 | date-modified = {2022-08-24 17:18:06 +1000},
42 | note = {R package version 1.2.1},
43 | title = {spacyr: Wrapper to the 'spaCy' 'NLP' Library},
44 | url = {https://CRAN.R-project.org/package=spacyr},
45 | year = {2020},
46 | bdsk-url-1 = {https://CRAN.R-project.org/package=spacyr}}
47 |
48 | @article{spaCy,
49 | author = {Honnibal, Matthew and Montani, Ines and Van Landeghem, Sophie and Boyd, Adriane},
50 | date-added = {2022-08-24 16:27:38 +1000},
51 | date-modified = {2022-08-24 16:27:38 +1000},
52 | doi = {10.5281/zenodo.1212303},
53 | title = {{spaCy}: Industrial-strength Natural Language Processing in Python},
54 | year = {2020},
55 | bdsk-url-1 = {https://doi.org/10.5281/zenodo.1212303}}
56 |
57 | @book{becue2019textual,
58 | author = {B{\'e}cue-Bertaut, Monica},
59 | date-added = {2022-08-24 16:18:37 +1000},
60 | date-modified = {2022-08-24 16:18:37 +1000},
61 | publisher = {CRC Press},
62 | title = {Textual data Science with {R}},
63 | year = {2019}}
64 |
65 | @book{turenne2016analyse,
66 | author = {Turenne, Nicolas},
67 | date-added = {2022-08-24 16:16:50 +1000},
68 | date-modified = {2022-08-24 16:16:50 +1000},
69 | publisher = {ISTE Group},
70 | title = {Analyse de donn{\'e}es textuelles sous R},
71 | year = {2016}}
72 |
73 | @book{kwartler2017text,
74 | author = {Kwartler, Ted},
75 | date-added = {2022-08-24 15:58:45 +1000},
76 | date-modified = {2022-08-24 15:58:45 +1000},
77 | publisher = {John Wiley \& Sons},
78 | address = {Chichester},
79 | title = {Text Mining in Practice with {R}},
80 | year = {2017}}
81 |
82 | @article{tidytextJOSS,
83 | author = {Julia Silge and David Robinson},
84 | date-added = {2022-08-24 15:56:19 +1000},
85 | date-modified = {2022-08-24 15:56:19 +1000},
86 | doi = {10.21105/joss.00037},
87 | journal = {Journal of Open Source Software},
88 | number = {3},
89 | title = {tidytext: Text Mining and Analysis Using Tidy Data Principles in R},
90 | url = {http://dx.doi.org/10.21105/joss.00037},
91 | volume = {1},
92 | year = {2016},
93 | bdsk-url-1 = {http://dx.doi.org/10.21105/joss.00037}}
94 |
95 | @book{silge2017text,
96 | author = {Silge, Julia and Robinson, David},
97 | date-added = {2022-08-24 15:49:42 +1000},
98 | date-modified = {2022-08-24 15:49:42 +1000},
99 | publisher = {O'Reilly Media, Inc.},
100 | title = {Text Mining with R: A Tidy Approach},
101 | year = {2017}}
102 |
103 | @book{xie2015,
104 | address = {Boca Raton, Florida},
105 | author = {Yihui Xie},
106 | edition = {2nd},
107 | note = {ISBN 978-1498716963},
108 | publisher = {Chapman and Hall/CRC},
109 | title = {Dynamic Documents with {R} and knitr},
110 | url = {http://yihui.name/knitr/},
111 | year = {2015},
112 | bdsk-url-1 = {http://yihui.name/knitr/}}
113 |
114 | @article{quantedaJOSS,
115 | author = {Kenneth Benoit and Kohei Watanabe and Haiyan Wang and Paul Nulty and Adam Obeng and Stefan M{\"u}ller and Akitaka Matsuo},
116 | doi = {10.21105/joss.00774},
117 | journal = {Journal of Open Source Software},
118 | number = {30},
119 | pages = {774},
120 | title = {quanteda: An {R} Package for the Quantitative Analysis of Textual Data},
121 | url = {https://quanteda.io},
122 | volume = {3},
123 | year = {2018},
124 | bdsk-url-1 = {https://quanteda.io},
125 | bdsk-url-2 = {https://doi.org/10.21105/joss.00774}}
126 |
127 | @article{young2012,
128 | author = {Young, Lori and Soroka, Stuart N.},
129 | date-modified = {2019-10-13 20:17:38 +0200},
130 | journal = {Political Communication},
131 | number = {2},
132 | pages = {205-231},
133 | title = {Affective News: The Automated Coding of Sentiment in Political Texts},
134 | volume = {29},
135 | year = {2012}}
136 |
137 | @incollection{benoit2020text,
138 | address = {Thousand Oaks},
139 | author = {Benoit, Kenneth},
140 | booktitle = {Handbook of Research Methods in Political Science and International Relations},
141 | date-added = {2020-09-30 08:13:11 +0100},
142 | date-modified = {2020-09-30 08:21:58 +0100},
143 | editor = {Curini, Luigi and Franzese, Robert},
144 | pages = {461-497},
145 | publisher = {Sage},
146 | rating = {5},
147 | subtitle = {An Overview},
148 | title = {Text as Data},
149 | year = {2020}}
150 |
151 | @manual{rvest,
152 | author = {Hadley Wickham},
153 | note = {R package version 1.0.2},
154 | title = {rvest: Easily Harvest (Scrape) Web Pages},
155 | url = {https://CRAN.R-project.org/package=rvest},
156 | year = {2021},
157 | bdsk-url-1 = {https://CRAN.R-project.org/package=rvest}}
158 |
159 | @article{JSSv025i05,
160 | abstract = {During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the <strong>tm</strong> package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classification and string kernels.},
161 | author = {Feinerer, Ingo and Hornik, Kurt and Meyer, David},
162 | doi = {10.18637/jss.v025.i05},
163 | journal = {Journal of Statistical Software},
164 | number = {5},
165 | pages = {1--54},
166 | title = {Text Mining Infrastructure in {R}},
167 | url = {https://www.jstatsoft.org/index.php/jss/article/view/v025i05},
168 | volume = {25},
169 | year = {2008},
170 | bdsk-url-1 = {https://www.jstatsoft.org/index.php/jss/article/view/v025i05},
171 | bdsk-url-2 = {https://doi.org/10.18637/jss.v025.i05}}
172 |
173 | @article{tmpackage,
174 | author = {Ingo Feinerer and Kurt Hornik and David Meyer},
175 | journal = {Journal of Statistical Software},
176 | month = {March},
177 | number = {5},
178 | pages = {1--54},
179 | title = {Text Mining Infrastructure in {R}},
180 | url = {https://www.jstatsoft.org/v25/i05/},
181 | volume = {25},
182 | year = {2008},
183 | bdsk-url-1 = {https://www.jstatsoft.org/v25/i05/}}
184 |
185 | @article{benoit19sophistication,
186 | author = {Benoit, Kenneth and Munger, Kevin and Spirling, Arthur},
187 | doi = {10.1111/ajps.12423},
188 | journal = {American Journal of Political Science},
189 | keywords = {seconary, phd, readability},
190 | number = {2},
191 | pages = {491-508},
192 | title = {Measuring and Explaining Political Sophistication Through Textual Complexity},
193 | volume = {63},
194 | year = {2019},
195 | bdsk-url-1 = {https://doi.org/10.1111/ajps.12423}}
196 |
197 | @article{boussalis21reactions,
198 | author = {Boussalis, Constantine and Coan, Travis G. and Holman, Mirya R. and M{\"u}ller, Stefan},
199 | doi = {10.1017/S0003055421000666},
200 | journal = {American Political Science Review},
201 | number = {4},
202 | pages = {1242-1257},
203 | title = {Gender, Candidate Emotional Expression, and Voter Reactions During Televised Debates},
204 | volume = {115},
205 | year = {2021},
206 | bdsk-url-1 = {https://doi.org/10.1017/S0003055421000666}}
207 |
208 | @article{herzog15cuts,
209 | author = {Herzog, Alexander and Benoit, Kenneth},
210 | doi = {10.1086/682670},
211 | journal = {The Journal of Politics},
212 | number = {4},
213 | pages = {1157-1175},
214 | title = {The Most Unkindest Cuts: Speaker Selection and Expressed Goverment Dissent During Economic Crisis},
215 | volume = {77},
216 | year = {2015},
217 | bdsk-url-1 = {https://doi.org/10.1086/682670}}
218 |
219 | @article{castanhosilva22eu,
220 | author = {Castanho Silva, Bruno and Proksch, Sven-Oliver},
221 | doi = {10.1017/psrm.2021.36},
222 | journal = {Political Science Research and Methods},
223 | keywords = {secondary, Twitter},
224 | number = {doi: 10.1017/psrm.2021.36},
225 | title = {Politicians Unleashed? Political Communication on Twitter and in Parliament in Western Europe},
226 | volume = {published ahead of print},
227 | year = {2021},
228 | bdsk-url-1 = {https://doi.org/10.1017/psrm.2021.36}}
229 |
230 | @article{mueller22temporal,
231 | author = {M{\"u}ller, Stefan},
232 | doi = {10.1086/715165},
233 | journal = {The Journal of Politics},
234 | number = {1},
235 | pages = {585-590},
236 | title = {The Temporal Focus of Campaign Communication},
237 | volume = {84},
238 | year = {2022},
239 | bdsk-url-1 = {https://doi.org/10.1086/715165}}
240 |
241 | @article{card22immig,
242 | author = {Card, Dallas and Change, Serina and Becker, Chris and Mendelsohn, Julia and Voigt, Rob and Boustan, Leah and Abramitzky, Ran and Jurafsky, Daniel},
243 | doi = {10.1073/pnas.2120510119},
244 | journal = {Proceedings of the National Academy of Sciences of the United States of America (PNAS)},
245 | number = {31},
246 | pages = {e2120510119},
247 | title = {Computational Analysis of 140 years of US Political Speeches Reveals More Positive but Increasingly Polarized Framing of Immigration},
248 | volume = {119},
249 | year = {2022},
250 | bdsk-url-1 = {https://doi.org/10.1073/pnas.2120510119}}
251 |
252 | @book{grimmer22textasdata,
253 | address = {Princeton},
254 | author = {Grimmer, Justin and Roberts, Margaret E. and Stewart, Brandon M.},
255 | date-modified = {2022-08-24 15:49:02 +1000},
256 | publisher = {Princeton University Press},
257 | subtitle = {A New Framework for Machine Learning and the Social Sciences},
258 | title = {Text as Data: The promise and pitfalls of automatic content analysis methods for political texts},
259 | year = {2022}}
260 |
261 | @book{wickham23r4ds,
262 | address = {Sebastopol},
263 | author = {Wickham, Hadley and Çetinkaya-Rundel, Mine and Grolemund, Garrett},
264 | publisher = {O'Reilly},
265 | subtitle = {Import, Tidy, Transform, Visualize, and Model Data},
266 | title = {R for Data Science},
267 | url = {https://r4ds.hadley.nz},
268 | edition = {2},
269 | year = {2023}}
270 |
271 |
272 | @Manual{stopwords,
273 | title = {stopwords: Multilingual Stopword Lists},
274 | author = {Kenneth Benoit and David Muhr and Kohei Watanabe},
275 | year = {2021},
276 | note = {R package version 2.3},
277 | url = {https://CRAN.R-project.org/package=stopwords},
278 | }
279 |
280 | @article{denny18preprocess,
281 | author = {Denny, Matthew W. and Spirling, Arthur},
282 | doi = {10.1017/pan.2017.44},
283 | journal = {Political Analysis},
284 | number = {2},
285 | pages = {168-189},
286 | title = {Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It},
287 | volume = {26},
288 | year = {2018}}
289 |
290 |
291 |
292 | @article{monroe08intro,
293 | author = {Monroe, Burt L. and Schrodt, Philip A.},
294 | doi = {10.1093/pan/mpn017},
295 | journal = {Political Analysis},
296 | number = {4},
297 | pages = {351-355},
298 | subtitle = {The Statistical Analysis of Political Text},
299 | title = {Introduction to the Special Issue},
300 | volume = {16},
301 | year = {2008}}
302 |
303 |
304 | @article{wilbur92stopwords,
305 | author = {Wilbur, W. John and Sirotkin, Karl},
306 | doi = {10.1177/01655515920180010},
307 | journal = {Journal of Information Science},
308 | number = {1},
309 | pages = {45-55},
310 | title = {The Automatic Identification of Stop Words},
311 | volume = {18},
312 | year = {1992}}
313 |
314 |
315 | @article{sarica21stopwords,
316 | author = {Sarica, Serhad and Luo, Jianxi},
317 | doi = {10.1371/journal.pone.0254937},
318 | journal = {PLoS One},
319 | number = {8},
320 | pages = {e0254937},
321 | title = {Stopwords in Technical Language Processing},
322 | volume = {16},
323 | year = {2021}}
324 |
325 |
326 | @inproceedings{porter01snowball,
327 | title={Snowball: A Language for Stemming Algorithms},
328 | author={Martin F. Porter},
329 | year={2001}
330 | }
331 |
332 |
333 | @article{tokenizers,
334 | title = {Fast, Consistent Tokenization of Natural Language Text},
335 | author = {Lincoln A. Mullen and Kenneth Benoit and Os Keyes and Dmitry Selivanov and Jeffrey Arnold},
336 | journal = {Journal of Open Source Software},
337 | year = {2018},
338 | volume = {3},
339 | issue = {23},
340 | pages = {655},
341 | doi = {10.21105/joss.00655}
342 | }
343 |
344 |
345 | @book{hvitfeldt21ml,
346 | address = {Boca Raton},
347 | author = {Hvitfeldt, Emil and Silge, Julia},
348 | publisher = {CRC Press},
349 | title = {Supervised Machine Learning For Text Analysis in R},
350 | url = {https://smltar.com},
351 | year = {2021}
352 | }
353 |
354 |
355 | @article{schofield16stem,
356 | author = {Schofield, Alexandra and Mimno, David},
357 | doi = {10.1162/tacl_a_00099},
358 | journal = {Transactions of the Association for Computational Linguistics},
359 | pages = {287-300},
360 | subtitle = {The Effects of Stemmers on Topic Models},
361 | title = {Comparing Apples to Apple},
362 | volume = {4},
363 | year = {2016}
364 | }
365 |
366 |
367 | @article{lupia20nsf,
368 | author = {Lupia, Arthur and Soroka, Stuart N. and Beatty, Alison},
369 | doi = {10.1126/sciadv.aaz6300},
370 | journal = {Science Advances},
371 | number = {33},
372 | pages = {eaaz6300},
373 | title = {What Does Congress Want from the National Science Foundation? A Content Analysis of Remarks from 1995 to 2018},
374 | volume = {6},
375 | year = {2020},
376 | bdsk-url-1 = {https://doi.org/10.1126/sciadv.aaz6300}}
377 |
378 |
379 | @article{gessler22immig,
380 | author = {Gessler, Theresa and Hunger, Sophia},
381 | doi = {10.1017/psrm.2021.64},
382 | journal = {Political Science Research and Methods},
383 | number = {3},
384 | pages = {524 - 544},
385 | title = {How the Refugee Crisis and Radical Right Parties Shape Party Competition on Immigration},
386 | volume = {10},
387 | year = {2022}
388 | }
389 |
390 |
391 | @article{rauh20eu,
392 | author = {Rauh, Christopher and Bes, Bart Joachim and Schoonvelde, Martijn},
393 | doi = {10.1111/1475-6765.12350},
394 | journal = {European Journal of Political Research},
395 | number = {2},
396 | pages = {397-423},
397 | title = {Undermining, Defusing or Defending European Integration? Assessing Public Communication of European Executives in Times of EU Politicisation},
398 | volume = {59},
399 | year = {2020}}
400 |
401 |
402 | @Manual{lda,
403 | title = {lda: Collapsed Gibbs Sampling Methods for Topic Models},
404 | author = {Jonathan Chang},
405 | year = {2015},
406 | note = {R package version 1.4.2},
407 | url = {https://CRAN.R-project.org/package=lda},
408 | }
409 |
410 |
411 | @Article{stm,
412 | title = {{stm}: An {R} Package for Structural Topic Models},
413 | author = {Margaret E. Roberts and Brandon M. Stewart and Dustin Tingley},
414 | journal = {Journal of Statistical Software},
415 | year = {2019},
416 | volume = {91},
417 | number = {2},
418 | pages = {1--40},
419 | doi = {10.18637/jss.v091.i02},
420 | }
421 |
422 |
423 | @Article{topicmodels,
424 | title = {{topicmodels}: An {R} Package for Fitting Topic Models},
425 | author = {Bettina Gr\"un and Kurt Hornik},
426 | journal = {Journal of Statistical Software},
427 | year = {2011},
428 | volume = {40},
429 | number = {13},
430 | pages = {1--30},
431 | doi = {10.18637/jss.v040.i13},
432 | }
433 |
434 |
435 | @article{keyatm,
436 | author = {Eshima, Shusei and Imai, Kosuke and Sakasi, Tomoya},
437 | date-added = {2023-08-29 11:36:57 +0100},
438 | date-modified = {2023-08-29 11:38:25 +0100},
439 | doi = {10.1111/ajps.12779},
440 | journal = {American Journal of Political Science},
441 | keywords = {secondary, keyATM},
442 | title = {Keyword Assisted Topic Models},
443 | volume = {online first},
444 | year = {2023}}
445 |
446 |
447 | @Manual{lsx,
448 | title = {LSX: Semi-Supervised Algorithm for Document Scaling},
449 | author = {Kohei Watanabe},
450 | year = {2023},
451 | note = {R package version 1.3.1},
452 | url = {https://CRAN.R-project.org/package=LSX},
453 | }
454 |
455 | @Manual{context,
456 | title = {conText: 'a la Carte' on Text (ConText) Embedding Regression},
457 | author = {Pedro L. Rodriguez and Arthur Spirling and Brandon Stewart},
458 | year = {2023},
459 | note = {R package version 1.4.3},
460 | url = {https://CRAN.R-project.org/package=conText},
461 | }
462 |
463 |
464 |
465 | @book{manning08iir,
466 | address = {New York},
467 | author = {Manning, Christopher D. and Raghavan, Prabhakar and Sch{\"u}tze, Hinrich},
468 | date-added = {2017-08-09 10:24:52 +0000},
469 | date-modified = {2018-12-05 17:35:47 +0000},
470 | gender = {pm},
471 | keywords = {secondary, data mining, text analysis},
472 | publisher = {Cambridge University Press},
473 | title = {An Introduction to Information Retrieval},
474 | url = {https://nlp.stanford.edu/IR-book/},
475 | year = {2008}}
476 |
477 |
478 | @article{slapin08wordfish,
479 | author = {Slapin, Jonathan B. and Proksch, Sven-Oliver},
480 | doi = {10.1111/j.1540-5907.2008.00338.x},
481 | journal = {American Journal of Political Science},
482 | number = {3},
483 | pages = {705-722},
484 | title = {A Scaling Model for Estimating Time-Series Party Positions from Texts},
485 | volume = {52},
486 | year = {2008}}
487 |
488 |
489 |
490 | @article{crisp21voteseeking,
491 | author = {Crisp, Brian F. and Schneider, Benjamin and Catalinac, Amy and Muraoka, Taishi},
492 | doi = {10.1016/j.electstud.2021.102369},
493 | journal = {Electoral Studies},
494 | pages = {102369},
495 | title = {Capturing Vote-seeking Incentives and the Cultivation of a Personal and Party Vote},
496 | volume = {72},
497 | year = {2021}}
--------------------------------------------------------------------------------
/_freeze/00-preface/execute-results/tex.json:
--------------------------------------------------------------------------------
1 | {
2 | "hash": "9d4242545130ba5f73dd575d197a6061",
3 | "result": {
4 | "markdown": "::: hidden\n\n\n$$\n \\def\\quanteda{{\\bf quanteda}}\n$$\n\n\n:::\n\n# Preface {.unnumbered}\n\nThe uses and means for analysing text, especially using quantitative and computational approaches, have exploded in recent years across the fields of academic, industry, policy, and other forms of research and analysis. Text mining and text analysis have become the focus of a major methodological wave of innovation in the social sciences, especially political science but also extending to finance, media and communications, and sociology. In non-academic research or industry, furthermore, text mining applications are found in almost every sector, given the ubiquity of text and the need to analyse it.\n\n\"Text analysis\" is a broad label for a wide range of tools applied to *textual data*. It encompasses all manner of methods for turning unstructured raw data in the form of natural language documents into structured data that can be analysed systematically and using specific quantitative approaches. Text analytic methods include descriptive analysis, keyword analysis, topic analysis, measurement and scaling, clustering, text and vocabulary comparisons, or sentiment analysis. It may include causal analysis or predictive modelling. In their most advanced current forms, these methods extend to the natural language generation models that power the newest generation of artificial intelligence systems.\n\nThis book provides a practical guide introducing the fundamentals of text analysis methods and how to implement them using the R programming language. It covers a wide range of topics to provide a useful resource for a range of students from complete beginners to experienced users of R wanting to learn more advanced techniques. Our emphasis is on the text analytic workflow as a whole, ranging from an introduction to text manipulation in R, all the way through to advanced machine learning methods for textual data.\n\n## Why another book on text mining in R? {.unnumbered}\n\nText mining tools for R have existed for many years, led by the venerable **tm** package [@tmpackage]. In 2015, the first version of **quanteda** [@quantedaJOSS] was published on CRAN. Since then, the package has undergone three major versions, with each release improving its consistency, power, and usability. Starting with version 2, **quanteda** split into a series of packages designed around its different functions (such as plotting, statistics, or machine learning), with **quanteda** retaining the core text processing functions. Other popular alternatives exist, such as **tidytext** [@tidytextJOSS], although as we explain in [Chapter -@sec-furthertidy], **quanteda** works perfectly well with the R \"tidyverse\" and related approaches to text analysis.\n\nThis is hardly the only book covering how to work with text in R. @silge2017text introduced the **tidytext** package and numerous examples for mining text using the tidyverse approach to programming in R, including keyword analysis, mining word associations, and topic modelling. @kwartler2017text provides a good coverage of text mining workflows, plus methods for text visualisations, sentiment scoring, clustering, and classification, as well as methods for sourcing and manipulating source documents. Earlier works include @turenne2016analyse and @becue2019textual.\n\n::: {.callout-note icon=\"true\"}\n## One of the two hard things in computer science: Naming things\n\nWhere does does the package name come from? *quanteda* is a portmanteau name indicating the purpose of the package, which is the **qu**antitative **an**alysis of **te**xtual **da**ta.\n:::\n\nSo why another book? First, the R ecosystem for text analysis is rapidly evolving, with the publication in recent years of massively improved, specialist text analysis packages such as **quanteda**. None of the earlier books covers this amazing family of packages. In addition, the field of text analysis methodologies has also advanced rapidly, including machine learning approaches based on artificial neural networks and deep learning. These and the packages that make use of them---for instance **spacyr** [@spacyr] for harnessing the power of the spaCy natural language processing library for Python [@spaCy]---have yet to be presented in a systematic, book-length treatment. Furthermore, as we are the authors and creators of many of the packages we cover in this book, we view this as the authoritative reference and how-to guide for using these packages.\n\n## What to Expect from This Book {.unnumbered}\n\nThis book is meant to be as a practical resource for those confronting the practical challenges of text analysis for the first time, focusing on how to do this in R. Our main focus is on the **quanteda** package and its extensions, although we also cover more general issues including a brief overview of the R functions required to get started quickly with practical text analysis We cover an introduction to the R language, for\n\n@benoit2020text provides a detailed overview of the analysis of textual data, and what distinguishes textual data from other forms of data. It also clearly articulates what is meant by treating text \"as data\" for analysis. This book and the approaches it presents are firmly geared toward this mindset.\n\nEach chapter is structured so to provide a continuity across each topic. For each main subject explained in a chapter, we clearly explain the objective of the chapter, then describe the text analytic methods in an applied fashion so that readers are aware of the workings of the method. We then provide practical examples, with detailed working code in R as to how to implement the method. Next, we identify any special issues involved in correctly applying the method, including how to hand the more complicated situations that may arise in practice. Finally, we provide further reading for readers wishing to learn more, and exercises for those wishing for hands-on practice, or for assigning these when using them in teaching environment.\n\nWe have years of experience in teaching this material in many practical short courses, summer schools, and regular university courses. We have drawn extensively from this experience in designing the overall scope of this book and the structure of each chapter. Our goal is to make the book is suitable for self-learning or to form the basis for teaching and learning in a course on applied text analysis. Indeed, we have partly written this book to assign when teaching text analysis in our own curricula.\n\n## Who This Book Is For {.unnumbered}\n\nWe don't assume any previous knowledge of text mining---indeed, the goal of this book is to provide that, from a foundation through to some very advanced topics. Getting use from this book does not require a pre-existing familiarity with R, although, as the slogan goes, \"every little helps\". In Part I we cover some of the basics of R and how to make use of it for text analysis specifically. Readers will also learn more of R through our extensive examples. However, experience in teaching and presenting this material tells us that a foundation of R will enable readers to advance through the applications far more rapidly than if they were learning R from scratch at the same time that they take the plunge into the possibly strange new world of text analysis.\n\nWe are both academics, although we also have experience working in industry or in applying text analysis for non-academic purposes. The typical reader may be a student of text analysis in the literal sense (of being an student) or in the general sense of someone studying techniques in order to improve their practical and conceptual knowledge. Our orientation as social scientists, with a specialization in political text and political communications and media. But this book is for everyone: social scientists, computer scientists, scholars of digital humanities, researchers in marketing and management, and applied researchers working in policy, government, or business fields. This book is written to have the credibility and scholarly rigour (for referencing methods, for instance) needed by academic readers, but is designed to be written in a straightforward, jargon-free (as much as we were able!) manner to be of maximum practical use to non-academic analysts as well.\n\n## How the Book is Structured {.unnumbered}\n\n### Sections {.unnumbered}\n\nThe book is divided into seven sections. These group topics that we feel represent common stages of learning in text analysis, or similar groups of topics that different users will be drawn too. By grouping stages of learning, we make it possible also for intermediate or advanced users to jump to the section that interests them most, or to the sections where they feel they need additional learning.\n\nOur sections are:\n\n- **(Working in R):** This section is designed for beginners to learn quickly the R required for the techniques we cover in the book, and to guide them in learning a proper R workflow for text analysis.\n\n- **Acquiring texts:** Often described (by us at least) as the hardest problem in text analysis, we cover how to source documents, including from Internet sources, and to import these into R as part of the quantitative text analysis pipeline.\n\n- **Managing textual data using quanteda:** In this section, we introduce the **quanteda** package, and cover each stage of textual data processing, from creating structured corpora, to tokenisation, and building matrix representations of these tokens. We also talk about how to build and manage structured lexical resources such as dictionaries and stop word lists.\n\n- **Exploring and describing texts:** How to get overviews of texts using summary statistics, exploring texts using keywords-in-context, extracting target words, and identifying key words.\n\n- **Statistics for comparing texts:** How to characterise documents in terms of their lexical diversity. readability, similarity, or distance.\n\n- **Machine learning for texts:** How to apply scaling models, predictive models, and classification models to textual matrices.\n\n- **Further methods for texts:** Advanced methods including the use of natural language models to annotate texts, extract entities, or use word embeddings; integrating **quanteda** with \"tidy\" data approaches; and how to apply text analysis to \"hard\" languages such as Chinese (hard because of the high dimensional character set and the lack of whitespace to delimit words).\n\nFinally, in several **appendices**, we provide more detail about some tricky subjects, such as text encoding formats and working with regular expressions.\n\n### Chapter structure {.unnumbered}\n\nOur approach in each chapter is split into the following components, which we apply in every chapter:\n\n- Objectives. We explain the purpose of each chapter and what we believe are the most important learning outcomes.\n\n- Methods. We clearly explain the methodological elements of each chapter, through a combination of high-level explanations, formulas, and references.\n\n- Examples. We use practical examples, with R code, demonstrating how to apply the methods to realise the objectives.\n\n- Issues. We identify any special issues, potential problems, or additional approaches that a user might face when applying the methods to their text analysis problem.\n\n- Further Reading. In part because our scholarly backgrounds compel us to do so, and in part because we know that many readers will want to read more about each method, each chapter contains its own set of references and further readings.\n\n- Exercises. For those wishing additional practice or to use this text as a teaching resource (which we strongly encourage!), we provide exercises that can be assigned for each chapter.\n\nThroughout the book, we will demonstrate with examples and build models using a selection of text data sets. A description of these data sets can be found in [Appendix -@sec-appendix-installing].\n\n### Conventions {.unnumbered}\n\nThroughout the book we use several kinds of info boxes to call your attention to information, cautions, and warnings.\n\n::: callout-note\nThe information icon signals a note or a reference to further information.\n:::\n\n::: callout-tip\nTips provide suggestions for better ways of doing things.\n:::\n\n::: callout-important\nThe exclamation mark icon signals an important point to consider.\n:::\n\n::: callout-warning\nWarning icons flag things you definitely want to avoid.\n:::\n\nAs you may already have noticed, we put names of R packages in **boldface**.\n\nCode blocks will be self-evident, and will look like this, with the output produced from executing those commands shown below the highlighted code in a mono-spaced font.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(\"quanteda\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nPackage version: 4.0.0\nUnicode version: 14.0\nICU version: 71.1\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nParallel computing: 12 of 12 threads used.\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nSee https://quanteda.io for tutorials and examples.\n```\n:::\n\n```{.r .cell-code}\ndata_corpus_inaugural[1:3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nCorpus consisting of 3 documents and 4 docvars.\n1789-Washington :\n\"Fellow-Citizens of the Senate and of the House of Representa...\"\n\n1793-Washington :\n\"Fellow citizens, I am again called upon by the voice of my c...\"\n\n1797-Adams :\n\"When it was first perceived, in early times, that no middle ...\"\n```\n:::\n:::\n\n\n\nWe love the pipe operator in R (when used with discipline!) and built **quanteda** with a the aim making all of the main functions easily and logically pipeable. Since version 1, we have re-exported the `%>%` operator from **magrittr** to make it available out of the box with **quanteda**. With the introduction of the `|>` pipe in R 4.1.0, however, we prefer to use this variant, so will use that in all code used in this book.\n\n### Data used in examples {.unnumbered}\n\nAll examples and code are bundled as a companion R package to the book, available from our [public GitHub repository](https://github.com/quanteda/TAUR/).\n\n::: callout-tip\nWe have written a companion package for this book called **TAUR**, which can be installed from GitHub using this command:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nremotes::install_github(\"quanteda/TAUR\")\n```\n:::\n\n\n:::\n\nWe largely rely on data from three sources:\n\n- the built-in-objects from the **quanteda** package, such as the US Presidential Inaugural speech corpus;\\\n- added corpora from the book's companion package, **TAUR**; and\n- some additional **quanteda** corpora or dictionaries from from the additional sources or packages where indicated.\n\nAlthough not commonly used, our scheme for naming data follows a very consistent scheme. The data objects being with `data`, have the object class as the second part of the name, such as `corpus`, and the third and final part of the data object name contains a description. The three elements are separated by the underscore (`_`) character. This means that any object is known by its name to be data, so that it shows up in the index of package objects (from the all-important help page, e.g. from `help(package = \"quanteda\")`) in one location, under \"d\". It also means that its object class is known from the name, without further inspection. So `data_dfm_lbgexample` is a dfm, while `data_corpus_inaugural` is clearly a corpus. We use this scheme and others like with an almost religious fervour, because we think that learning the functionality of a programming framework for NLP and quantitative text analysis is complicated enough without having also to decipher or remember a mishmash of haphazard and inconsistently named functions and objects. The more you use our software packages and specifically **quanteda**, the more you will come to appreciate the attention we have paid to implementing a consistent naming scheme for objects, functions, and their arguments as well as to their consistent functionality.\n\n## Colophon {.unnumbered}\n\nThis book was written in [RStudio](https://www.rstudio.com/ide/) using [**Quarto**](https://quarto.org). The [website](https://quanteda.github.io/Text-Analysis-Using-R/) is hosted via [GitHub Pages](https://pages.github.com), and the complete source is available on [GitHub](https://github.com/quanteda/Text-Analysis-Using-R).\n\nThis version of the book was built with R version 4.3.2 (2023-10-31) and the following packages:\n\n\n\n::: {.cell-output-display}\n\\begin{longtable}{lrl}\n\\toprule\nPackage & Version & Source \\\\ \n\\midrule\\addlinespace[2.5pt]\nquanteda & 4.0.0 & local \\\\ \nquanteda.textmodels & 0.9.6 & CRAN (R 4.3.0) \\\\ \nquanteda.textplots & 0.94.3 & CRAN (R 4.3.0) \\\\ \nquanteda.textstats & 0.96.5 & local \\\\ \nreadtext & 0.90 & CRAN (R 4.3.0) \\\\ \nstopwords & 2.3 & CRAN (R 4.3.0) \\\\ \ntidyverse & 2.0.0 & CRAN (R 4.3.0) \\\\ \n\\bottomrule\n\\end{longtable}\n:::\n\n\n\n### How to Contact Us {.unnumbered}\n\nPlease address comments and questions concerning this book by filing an issue on our GitHub page, . At this repository, you will also find instructions for installing the companion R package, [**TAUR**](https://github.com/quanteda/TAUR).\n\nFor more information about the authors or the Quanteda Initiative, visit our [website](https://quanteda.org).\n\n## Acknowledgements {.unnumbered}\n\nDeveloping **quanteda** has been a labour of many years involving many contributors. The most notable contributor to the package and its learning materials is Kohei Watanabe, without whom **quanteda** would not be the incredible package it is today. Kohei has provided a clear and vigorous vision for the package's evolution across its major versions and continues to maintain it today. Others have contributed in major ways both through design and programming, namely Paul Nulty and Akitaka Matsuo, both who were present at creation and through the 1.0 launch which involved the first of many major redesigns. Adam Obeng made a brief but indelible imprint on several of the packages in the quanteda family. Haiyan Wang wrote versions of some of the core C++ code that makes quanteda so fast, as well as contributing to the R base. William Lowe and Christian Müller also contributed code and ideas that have enriched the package and the methods it implements.\n\nNo project this large could have existed without institutional and financial benefactors and supporters. Most notable of these is the European Research Council, who funded the original development under a its Starting Investigator Grant scheme, awarded to Kenneth Benoit in 2011 for a project entitled QUANTESS: Quantitative Text Analysis for the Social Sciences (ERC-2011-StG 283794-QUANTESS). Our day jobs (as university professors) have also provided invaluable material and intellectual support for the development of the ideas, methodologies, and software documented in this book. This list includes the London School of Economics and Political Science, which employs Ken Benoit and which hosted the QUANTESS grant and employed many of the core quanteda team and/or where they completed PhDs; University College Dublin where Stefan Müller is employed; Trinity College Dublin, where Ken Benoit was formerly a Professor of Quantitative Social Sciences and where Stefan completed his PhD in political science; and the Australian National University which provided a generous affiliation to Ken and where the outline for this book took first shape as a giant tableau of post-it notes on the wall of a visiting office in the former building of the School of Politics and International Relations.\n\nAs we develop this book, we hope that many readers of the work-in-progress will contribute feedback, comments, and even edits, and we plan to acknowledge you all. So, don't be shy readers. You can suggest changes by clicking on \"Edit this page\" in the top-right corner, [forking](https://gist.github.com/Chaser324/ce0505fbed06b947d962) the [GitHub repository](https://github.com/quanteda/Text-Analysis-Using-R), and making a pull request. More details on contributing and copyright are provided in a separate page on [Contributing to Text Analysis Using R](https://github.com/quanteda/Text-Analysis-Using-R/blob/main/contributing.md).\n\nWe also have to give a much-deserved shout-out to the amazing team at RStudio, many of whom we've been privileged to hear speak at events or meet in person. You made Quarto available at just the right time. That and the innovations in R tools and software that you have driven for the past decade have been rich beyond every expectation.\n\nFinally, no acknowledgements would be complete without a profound thanks to our partners, who have put up with us during the long incubation of this project. They have had to listen us talk about this book for years before we finally got around to writing it. They've tolerated us cursing software bugs, students, CRAN, each other, package users, and ourselves. But they've also seen they joy that we've experienced from creating tools and materials that empower users and students, and the excitement of the long intellectual voyage we have taken together and with our ever-growing base of users and students. Bina and Émeline, thank you for all of your support and encouragement. You'll be so happy to know we finally have this book project well underway.\n",
5 | "supporting": [
6 | "00-preface_files"
7 | ],
8 | "filters": [
9 | "rmarkdown/pagebreak.lua"
10 | ],
11 | "includes": {},
12 | "engineDependencies": {
13 | "knitr": [
14 | "{\"type\":\"list\",\"attributes\":{\"knit_meta_id\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"unnamed-chunk-3\",\"unnamed-chunk-3\",\"unnamed-chunk-3\",\"unnamed-chunk-3\",\"unnamed-chunk-3\"]}},\"value\":[{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"booktabs\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"caption\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"longtable\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"colortbl\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]},{\"type\":\"list\",\"attributes\":{\"names\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"name\",\"options\",\"extra_lines\"]},\"class\":{\"type\":\"character\",\"attributes\":{},\"value\":[\"latex_dependency\"]}},\"value\":[{\"type\":\"character\",\"attributes\":{},\"value\":[\"array\"]},{\"type\":\"NULL\"},{\"type\":\"NULL\"}]}]}"
15 | ]
16 | },
17 | "preserve": null,
18 | "postProcess": false
19 | }
20 | }
--------------------------------------------------------------------------------
/_freeze/01-intro/figure-html/keynesspres-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/quanteda/Text-Analysis-Using-R/723c7dfba17d176bfa108d40f2cabe5f2a4c4fdd/_freeze/01-intro/figure-html/keynesspres-1.png
--------------------------------------------------------------------------------
/_freeze/01-intro/figure-html/unnamed-chunk-14-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/quanteda/Text-Analysis-Using-R/723c7dfba17d176bfa108d40f2cabe5f2a4c4fdd/_freeze/01-intro/figure-html/unnamed-chunk-14-1.png
--------------------------------------------------------------------------------
/_freeze/01-intro/figure-pdf/keynesspres-1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/quanteda/Text-Analysis-Using-R/723c7dfba17d176bfa108d40f2cabe5f2a4c4fdd/_freeze/01-intro/figure-pdf/keynesspres-1.pdf
--------------------------------------------------------------------------------
/_freeze/01-intro/figure-pdf/unnamed-chunk-14-1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/quanteda/Text-Analysis-Using-R/723c7dfba17d176bfa108d40f2cabe5f2a4c4fdd/_freeze/01-intro/figure-pdf/unnamed-chunk-14-1.pdf
--------------------------------------------------------------------------------
/_freeze/29-appendix-installation/execute-results/html.json:
--------------------------------------------------------------------------------
1 | {
2 | "hash": "bc8e83895b8e28744964dcda23102211",
3 | "result": {
4 | "markdown": "# Installing the Required Tools {#sec-appendix-installing}\n\n## Objectives\n\n- Installing R\n- Installing RStudio\n- Installing **quanteda**\n- Installing **spacy**\n- Installing companion package(s)\n- Keeping up to date\n- Troubleshooting problems\n\n## Installing R\n\nR is a free software environment for statistical computing that runs on numerous platforms, including Windows, macOS, Linux, and Solaris. You can find details at https://www.r-project.org/, and link there to a set of mirror websites for downloading the latest version.\n\nWe recommend that you always using the latest version of R, which is what we used for compiling this book. There are seldom reasons to use older versions of R, and the R Core Team and the maintainers of the largest repository of R packages, CRAN (for Comprehensive R Archive Network) put an enormous amount of attention and energy into assuring that extension packages work with stably and with one another.\n\nYou verify which version of R you are using by either viewing the messages on startup, e.g.\n\n R version 4.2.1 (2022-06-23) -- \"Funny-Looking Kid\"\n Copyright (C) 2022 The R Foundation for Statistical Computing\n Platform: x86_64-apple-darwin17.0 (64-bit)\n\n R is free software and comes with ABSOLUTELY NO WARRANTY.\n You are welcome to redistribute it under certain conditions.\n Type 'license()' or 'licence()' for distribution details.\n\n Natural language support but running in an English locale\n\n R is a collaborative project with many contributors.\n Type 'contributors()' for more information and\n 'citation()' on how to cite R or R packages in publications.\n\n Type 'demo()' for some demos, 'help()' for on-line help, or\n 'help.start()' for an HTML browser interface to help.\n Type 'q()' to quit R.\n\nor by calling\n\n\n::: {.cell}\n\n```{.r .cell-code}\nR.Version()$version.string\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"R version 4.2.1 (2022-06-23)\"\n```\n:::\n:::\n\n\n## Installing recommended tools and packages\n\n### RStudio\n\nThere are numerous ways of running R, including from the \"command line\" (\"Terminal\" on Mac or Linux, or \"Command Prompt\" on Windows) or from the R Gui console that comes with most installations of R. But our very strong recommendation is to us the outstanding [RStudio Dekstop](https://www.rstudio.com/products/rstudio/). \"IDE\" stands for *integrated development environment*, and provides a combination of file manager, package manager, help viewer, graphics viewer, environment browser, editor, and much more. Before this can be used, however, you must have installed R.\n\nRStudio can be installed from https://www.rstudio.com/products/rstudio/download/.\n\n### Additional packages\n\nThe main package you will need for the examples in this book is **quanteda**, which provides a framework for the quantitative analysis of textual data. When you install this package, by default it will also install any required packages that **quanteda** depends on to function (such as **stringi** that it uses for many essential string handling operations). Because \"dependencies\" are installed into your local library, these additional packages are also available for you to use independently. You only need to install them once (although they might need updating, see below).\n\nWe also suggest you install:\n\n- **readtext** - for reading in documents that contain text, and converting them automatically to plain text;\n\n- **spacyr** - an R wrapper to the Python package [spaCy](https://spacy.io) for natural language processing, including part-of-speech tagging and entity extraction. While we have tried hard to make this automatic, installing and configuring **spacyr** actually involves some major work under the hood, such as installing a version of Python in a self-contained virtual environment and then installing spaCy and one of its language models.\n\n### Keeping packages up-to-date\n\nR has an easy method for ensuring you have the latest versions of installed packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nupdate.packages()\n```\n:::\n\n\n## Additional Issues\n\n### Installing development versions of packages\n\nThe R packages that you can install using the methods described above are the pre-compiled binary versions that are distributed on CRAN. (Linux installations are the exception, as these are always compiled upon installation.) Sometimes, package developers will publish \"development\" versions of their packages that have yet to published on CRAN, for instance on the popular [GitHub](https://github.com) platform hosting the world's largest collection of open-source software.\n\nThe **quanteda** package, for instance, is hosted on GitHub at https://github.com/quanteda/quanteda, where its development version tends to be slightly ahead of the CRAN version. If you are feeling adventurous, or need a new version in which a specific issue or bug has been fixed, you can install it from the GitHub source using:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndevtools::install_github(\"quanteda/quanteda\") \n```\n:::\n\n\nBecause installing from GitHub is the same as installing from the source code, this also involves compiling the C++ and Fortran source code that makes parts of **quanteda** so fast. For this source installation to work, you will need to have installed the appropriate compilers.\n\nIf you are using a Windows platform, this means you will need also to install the [Rtools](https://CRAN.R-project.org/bin/windows/Rtools/) software available from CRAN.\n\nIf you are using macOS, you should install the [macOS tools](https://cran.r-project.org/bin/macosx/tools/), namely the Clang 6.x compiler and the GNU Fortran compiler (as **quanteda** requires gfortran to build).[^29-appendix-installation-1]\\\nLinux always compiles packages containing C++ code upon installation, so if you are using that platform then you are unlikely to need to install any additional components in order to install a development version.\n\n[^29-appendix-installation-1]: If you are still getting errors related to gfortran, follow the fixes [here](https://thecoatlessprofessor.com/programming/rcpp-rcpparmadillo-and-os-x-mavericks--lgfortran-and--lquadmath-error/).\n\n### Troubleshooting\n\nMost problems come from not having the latest versions of packages installed, so make sure you have updated them using the instructions above.\n\nOther problems include: - Lack of permissions to install packages. This might affect Windows users of work laptops, whose workplace prevents user modification of any software. - Lack of internet access, or access being restricted by a firewall or proxy server.\n\n## Further Reading\n\nHadley Wickham's excellent book [R Packages](http://r-pkgs.had.co.nz/) is well worth reading.\n\n- Wickham, Hadley. (2015). *R packages: organize, test, document, and share your code.* O'Reilly Media, Inc.\n",
5 | "supporting": [],
6 | "filters": [
7 | "rmarkdown/pagebreak.lua"
8 | ],
9 | "includes": {},
10 | "engineDependencies": {},
11 | "preserve": {},
12 | "postProcess": true
13 | }
14 | }
--------------------------------------------------------------------------------
/_freeze/29-appendix-installation/execute-results/tex.json:
--------------------------------------------------------------------------------
1 | {
2 | "hash": "bc8e83895b8e28744964dcda23102211",
3 | "result": {
4 | "markdown": "# Installing the Required Tools {#sec-appendix-installing}\n\n## Objectives\n\n- Installing R\n- Installing RStudio\n- Installing **quanteda**\n- Installing **spacy**\n- Installing companion package(s)\n- Keeping up to date\n- Troubleshooting problems\n\n## Installing R\n\nR is a free software environment for statistical computing that runs on numerous platforms, including Windows, macOS, Linux, and Solaris. You can find details at https://www.r-project.org/, and link there to a set of mirror websites for downloading the latest version.\n\nWe recommend that you always using the latest version of R, which is what we used for compiling this book. There are seldom reasons to use older versions of R, and the R Core Team and the maintainers of the largest repository of R packages, CRAN (for Comprehensive R Archive Network) put an enormous amount of attention and energy into assuring that extension packages work with stably and with one another.\n\nYou verify which version of R you are using by either viewing the messages on startup, e.g.\n\n R version 4.2.1 (2022-06-23) -- \"Funny-Looking Kid\"\n Copyright (C) 2022 The R Foundation for Statistical Computing\n Platform: x86_64-apple-darwin17.0 (64-bit)\n\n R is free software and comes with ABSOLUTELY NO WARRANTY.\n You are welcome to redistribute it under certain conditions.\n Type 'license()' or 'licence()' for distribution details.\n\n Natural language support but running in an English locale\n\n R is a collaborative project with many contributors.\n Type 'contributors()' for more information and\n 'citation()' on how to cite R or R packages in publications.\n\n Type 'demo()' for some demos, 'help()' for on-line help, or\n 'help.start()' for an HTML browser interface to help.\n Type 'q()' to quit R.\n\nor by calling\n\n\n::: {.cell}\n\n```{.r .cell-code}\nR.Version()$version.string\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"R version 4.2.1 (2022-06-23)\"\n```\n:::\n:::\n\n\n## Installing recommended tools and packages\n\n### RStudio\n\nThere are numerous ways of running R, including from the \"command line\" (\"Terminal\" on Mac or Linux, or \"Command Prompt\" on Windows) or from the R Gui console that comes with most installations of R. But our very strong recommendation is to us the outstanding [RStudio Dekstop](https://www.rstudio.com/products/rstudio/). \"IDE\" stands for *integrated development environment*, and provides a combination of file manager, package manager, help viewer, graphics viewer, environment browser, editor, and much more. Before this can be used, however, you must have installed R.\n\nRStudio can be installed from https://www.rstudio.com/products/rstudio/download/.\n\n### Additional packages\n\nThe main package you will need for the examples in this book is **quanteda**, which provides a framework for the quantitative analysis of textual data. When you install this package, by default it will also install any required packages that **quanteda** depends on to function (such as **stringi** that it uses for many essential string handling operations). Because \"dependencies\" are installed into your local library, these additional packages are also available for you to use independently. You only need to install them once (although they might need updating, see below).\n\nWe also suggest you install:\n\n- **readtext** - for reading in documents that contain text, and converting them automatically to plain text;\n\n- **spacyr** - an R wrapper to the Python package [spaCy](https://spacy.io) for natural language processing, including part-of-speech tagging and entity extraction. While we have tried hard to make this automatic, installing and configuring **spacyr** actually involves some major work under the hood, such as installing a version of Python in a self-contained virtual environment and then installing spaCy and one of its language models.\n\n### Keeping packages up-to-date\n\nR has an easy method for ensuring you have the latest versions of installed packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nupdate.packages()\n```\n:::\n\n\n## Additional Issues\n\n### Installing development versions of packages\n\nThe R packages that you can install using the methods described above are the pre-compiled binary versions that are distributed on CRAN. (Linux installations are the exception, as these are always compiled upon installation.) Sometimes, package developers will publish \"development\" versions of their packages that have yet to published on CRAN, for instance on the popular [GitHub](https://github.com) platform hosting the world's largest collection of open-source software.\n\nThe **quanteda** package, for instance, is hosted on GitHub at https://github.com/quanteda/quanteda, where its development version tends to be slightly ahead of the CRAN version. If you are feeling adventurous, or need a new version in which a specific issue or bug has been fixed, you can install it from the GitHub source using:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndevtools::install_github(\"quanteda/quanteda\") \n```\n:::\n\n\nBecause installing from GitHub is the same as installing from the source code, this also involves compiling the C++ and Fortran source code that makes parts of **quanteda** so fast. For this source installation to work, you will need to have installed the appropriate compilers.\n\nIf you are using a Windows platform, this means you will need also to install the [Rtools](https://CRAN.R-project.org/bin/windows/Rtools/) software available from CRAN.\n\nIf you are using macOS, you should install the [macOS tools](https://cran.r-project.org/bin/macosx/tools/), namely the Clang 6.x compiler and the GNU Fortran compiler (as **quanteda** requires gfortran to build).[^29-appendix-installation-1]\\\nLinux always compiles packages containing C++ code upon installation, so if you are using that platform then you are unlikely to need to install any additional components in order to install a development version.\n\n[^29-appendix-installation-1]: If you are still getting errors related to gfortran, follow the fixes [here](https://thecoatlessprofessor.com/programming/rcpp-rcpparmadillo-and-os-x-mavericks--lgfortran-and--lquadmath-error/).\n\n### Troubleshooting\n\nMost problems come from not having the latest versions of packages installed, so make sure you have updated them using the instructions above.\n\nOther problems include: - Lack of permissions to install packages. This might affect Windows users of work laptops, whose workplace prevents user modification of any software. - Lack of internet access, or access being restricted by a firewall or proxy server.\n\n## Further Reading\n\nHadley Wickham's excellent book [R Packages](http://r-pkgs.had.co.nz/) is well worth reading.\n\n- Wickham, Hadley. (2015). *R packages: organize, test, document, and share your code.* O'Reilly Media, Inc.\n",
5 | "supporting": [
6 | "29-appendix-installation_files"
7 | ],
8 | "filters": [
9 | "rmarkdown/pagebreak.lua"
10 | ],
11 | "includes": {},
12 | "engineDependencies": {
13 | "knitr": [
14 | "{\"type\":\"list\",\"attributes\":{},\"value\":[]}"
15 | ]
16 | },
17 | "preserve": null,
18 | "postProcess": false
19 | }
20 | }
--------------------------------------------------------------------------------
/_freeze/30-appendix-installation/execute-results/html.json:
--------------------------------------------------------------------------------
1 | {
2 | "hash": "bc8e83895b8e28744964dcda23102211",
3 | "result": {
4 | "markdown": "# Installing the Required Tools {#sec-appendix-installing}\n\n## Objectives\n\n- Installing R\n- Installing RStudio\n- Installing **quanteda**\n- Installing **spacy**\n- Installing companion package(s)\n- Keeping up to date\n- Troubleshooting problems\n\n## Installing R\n\nR is a free software environment for statistical computing that runs on numerous platforms, including Windows, macOS, Linux, and Solaris. You can find details at https://www.r-project.org/, and link there to a set of mirror websites for downloading the latest version.\n\nWe recommend that you always using the latest version of R, which is what we used for compiling this book. There are seldom reasons to use older versions of R, and the R Core Team and the maintainers of the largest repository of R packages, CRAN (for Comprehensive R Archive Network) put an enormous amount of attention and energy into assuring that extension packages work with stably and with one another.\n\nYou verify which version of R you are using by either viewing the messages on startup, e.g.\n\n R version 4.2.1 (2022-06-23) -- \"Funny-Looking Kid\"\n Copyright (C) 2022 The R Foundation for Statistical Computing\n Platform: x86_64-apple-darwin17.0 (64-bit)\n\n R is free software and comes with ABSOLUTELY NO WARRANTY.\n You are welcome to redistribute it under certain conditions.\n Type 'license()' or 'licence()' for distribution details.\n\n Natural language support but running in an English locale\n\n R is a collaborative project with many contributors.\n Type 'contributors()' for more information and\n 'citation()' on how to cite R or R packages in publications.\n\n Type 'demo()' for some demos, 'help()' for on-line help, or\n 'help.start()' for an HTML browser interface to help.\n Type 'q()' to quit R.\n\nor by calling\n\n\n::: {.cell}\n\n```{.r .cell-code}\nR.Version()$version.string\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"R version 4.3.2 (2023-10-31)\"\n```\n:::\n:::\n\n\n## Installing recommended tools and packages\n\n### RStudio\n\nThere are numerous ways of running R, including from the \"command line\" (\"Terminal\" on Mac or Linux, or \"Command Prompt\" on Windows) or from the R Gui console that comes with most installations of R. But our very strong recommendation is to us the outstanding [RStudio Dekstop](https://www.rstudio.com/products/rstudio/). \"IDE\" stands for *integrated development environment*, and provides a combination of file manager, package manager, help viewer, graphics viewer, environment browser, editor, and much more. Before this can be used, however, you must have installed R.\n\nRStudio can be installed from https://www.rstudio.com/products/rstudio/download/.\n\n### Additional packages\n\nThe main package you will need for the examples in this book is **quanteda**, which provides a framework for the quantitative analysis of textual data. When you install this package, by default it will also install any required packages that **quanteda** depends on to function (such as **stringi** that it uses for many essential string handling operations). Because \"dependencies\" are installed into your local library, these additional packages are also available for you to use independently. You only need to install them once (although they might need updating, see below).\n\nWe also suggest you install:\n\n- **readtext** - for reading in documents that contain text, and converting them automatically to plain text;\n\n- **spacyr** - an R wrapper to the Python package [spaCy](https://spacy.io) for natural language processing, including part-of-speech tagging and entity extraction. While we have tried hard to make this automatic, installing and configuring **spacyr** actually involves some major work under the hood, such as installing a version of Python in a self-contained virtual environment and then installing spaCy and one of its language models.\n\n### Keeping packages up-to-date\n\nR has an easy method for ensuring you have the latest versions of installed packages:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nupdate.packages()\n```\n:::\n\n\n## Additional Issues\n\n### Installing development versions of packages\n\nThe R packages that you can install using the methods described above are the pre-compiled binary versions that are distributed on CRAN. (Linux installations are the exception, as these are always compiled upon installation.) Sometimes, package developers will publish \"development\" versions of their packages that have yet to published on CRAN, for instance on the popular [GitHub](https://github.com) platform hosting the world's largest collection of open-source software.\n\nThe **quanteda** package, for instance, is hosted on GitHub at https://github.com/quanteda/quanteda, where its development version tends to be slightly ahead of the CRAN version. If you are feeling adventurous, or need a new version in which a specific issue or bug has been fixed, you can install it from the GitHub source using:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndevtools::install_github(\"quanteda/quanteda\") \n```\n:::\n\n\nBecause installing from GitHub is the same as installing from the source code, this also involves compiling the C++ and Fortran source code that makes parts of **quanteda** so fast. For this source installation to work, you will need to have installed the appropriate compilers.\n\nIf you are using a Windows platform, this means you will need also to install the [Rtools](https://CRAN.R-project.org/bin/windows/Rtools/) software available from CRAN.\n\nIf you are using macOS, you should install the [macOS tools](https://cran.r-project.org/bin/macosx/tools/), namely the Clang 6.x compiler and the GNU Fortran compiler (as **quanteda** requires gfortran to build).[^29-appendix-installation-1]\\\nLinux always compiles packages containing C++ code upon installation, so if you are using that platform then you are unlikely to need to install any additional components in order to install a development version.\n\n[^29-appendix-installation-1]: If you are still getting errors related to gfortran, follow the fixes [here](https://thecoatlessprofessor.com/programming/rcpp-rcpparmadillo-and-os-x-mavericks--lgfortran-and--lquadmath-error/).\n\n### Troubleshooting\n\nMost problems come from not having the latest versions of packages installed, so make sure you have updated them using the instructions above.\n\nOther problems include: - Lack of permissions to install packages. This might affect Windows users of work laptops, whose workplace prevents user modification of any software. - Lack of internet access, or access being restricted by a firewall or proxy server.\n\n## Further Reading\n\nHadley Wickham's excellent book [R Packages](http://r-pkgs.had.co.nz/) is well worth reading.\n\n- Wickham, Hadley. (2015). *R packages: organize, test, document, and share your code.* O'Reilly Media, Inc.\n",
5 | "supporting": [],
6 | "filters": [
7 | "rmarkdown/pagebreak.lua"
8 | ],
9 | "includes": {},
10 | "engineDependencies": {},
11 | "preserve": {},
12 | "postProcess": true
13 | }
14 | }
--------------------------------------------------------------------------------
/_freeze/30-appendix-installation/execute-results/tex.json:
--------------------------------------------------------------------------------
1 | {
2 | "hash": "bc8e83895b8e28744964dcda23102211",
3 | "result": {
4 | "markdown": "# Installing the Required Tools {#sec-appendix-installing}\n\n## Objectives\n\n- Installing R\n- Installing RStudio\n- Installing **quanteda**\n- Installing **spacy**\n- Installing companion package(s)\n- Keeping up to date\n- Troubleshooting problems\n\n## Installing R\n\nR is a free software environment for statistical computing that runs on numerous platforms, including Windows, macOS, Linux, and Solaris. You can find details at https://www.r-project.org/, and link there to a set of mirror websites for downloading the latest version.\n\nWe recommend that you always using the latest version of R, which is what we used for compiling this book. There are seldom reasons to use older versions of R, and the R Core Team and the maintainers of the largest repository of R packages, CRAN (for Comprehensive R Archive Network) put an enormous amount of attention and energy into assuring that extension packages work with stably and with one another.\n\nYou verify which version of R you are using by either viewing the messages on startup, e.g.\n\n R version 4.2.1 (2022-06-23) -- \"Funny-Looking Kid\"\n Copyright (C) 2022 The R Foundation for Statistical Computing\n Platform: x86_64-apple-darwin17.0 (64-bit)\n\n R is free software and comes with ABSOLUTELY NO WARRANTY.\n You are welcome to redistribute it under certain conditions.\n Type 'license()' or 'licence()' for distribution details.\n\n Natural language support but running in an English locale\n\n R is a collaborative project with many contributors.\n Type 'contributors()' for more information and\n 'citation()' on how to cite R or R packages in publications.\n\n Type 'demo()' for some demos, 'help()' for on-line help, or\n 'help.start()' for an HTML browser interface to help.\n Type 'q()' to quit R.\n\nor by calling\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nR.Version()$version.string\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"R version 4.3.2 (2023-10-31)\"\n```\n:::\n:::\n\n\n\n## Installing recommended tools and packages\n\n### RStudio\n\nThere are numerous ways of running R, including from the \"command line\" (\"Terminal\" on Mac or Linux, or \"Command Prompt\" on Windows) or from the R Gui console that comes with most installations of R. But our very strong recommendation is to us the outstanding [RStudio Dekstop](https://www.rstudio.com/products/rstudio/). \"IDE\" stands for *integrated development environment*, and provides a combination of file manager, package manager, help viewer, graphics viewer, environment browser, editor, and much more. Before this can be used, however, you must have installed R.\n\nRStudio can be installed from https://www.rstudio.com/products/rstudio/download/.\n\n### Additional packages\n\nThe main package you will need for the examples in this book is **quanteda**, which provides a framework for the quantitative analysis of textual data. When you install this package, by default it will also install any required packages that **quanteda** depends on to function (such as **stringi** that it uses for many essential string handling operations). Because \"dependencies\" are installed into your local library, these additional packages are also available for you to use independently. You only need to install them once (although they might need updating, see below).\n\nWe also suggest you install:\n\n- **readtext** - for reading in documents that contain text, and converting them automatically to plain text;\n\n- **spacyr** - an R wrapper to the Python package [spaCy](https://spacy.io) for natural language processing, including part-of-speech tagging and entity extraction. While we have tried hard to make this automatic, installing and configuring **spacyr** actually involves some major work under the hood, such as installing a version of Python in a self-contained virtual environment and then installing spaCy and one of its language models.\n\n### Keeping packages up-to-date\n\nR has an easy method for ensuring you have the latest versions of installed packages:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nupdate.packages()\n```\n:::\n\n\n\n## Additional Issues\n\n### Installing development versions of packages\n\nThe R packages that you can install using the methods described above are the pre-compiled binary versions that are distributed on CRAN. (Linux installations are the exception, as these are always compiled upon installation.) Sometimes, package developers will publish \"development\" versions of their packages that have yet to published on CRAN, for instance on the popular [GitHub](https://github.com) platform hosting the world's largest collection of open-source software.\n\nThe **quanteda** package, for instance, is hosted on GitHub at https://github.com/quanteda/quanteda, where its development version tends to be slightly ahead of the CRAN version. If you are feeling adventurous, or need a new version in which a specific issue or bug has been fixed, you can install it from the GitHub source using:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndevtools::install_github(\"quanteda/quanteda\") \n```\n:::\n\n\n\nBecause installing from GitHub is the same as installing from the source code, this also involves compiling the C++ and Fortran source code that makes parts of **quanteda** so fast. For this source installation to work, you will need to have installed the appropriate compilers.\n\nIf you are using a Windows platform, this means you will need also to install the [Rtools](https://CRAN.R-project.org/bin/windows/Rtools/) software available from CRAN.\n\nIf you are using macOS, you should install the [macOS tools](https://cran.r-project.org/bin/macosx/tools/), namely the Clang 6.x compiler and the GNU Fortran compiler (as **quanteda** requires gfortran to build).[^29-appendix-installation-1]\\\nLinux always compiles packages containing C++ code upon installation, so if you are using that platform then you are unlikely to need to install any additional components in order to install a development version.\n\n[^29-appendix-installation-1]: If you are still getting errors related to gfortran, follow the fixes [here](https://thecoatlessprofessor.com/programming/rcpp-rcpparmadillo-and-os-x-mavericks--lgfortran-and--lquadmath-error/).\n\n### Troubleshooting\n\nMost problems come from not having the latest versions of packages installed, so make sure you have updated them using the instructions above.\n\nOther problems include: - Lack of permissions to install packages. This might affect Windows users of work laptops, whose workplace prevents user modification of any software. - Lack of internet access, or access being restricted by a firewall or proxy server.\n\n## Further Reading\n\nHadley Wickham's excellent book [R Packages](http://r-pkgs.had.co.nz/) is well worth reading.\n\n- Wickham, Hadley. (2015). *R packages: organize, test, document, and share your code.* O'Reilly Media, Inc.\n",
5 | "supporting": [
6 | "30-appendix-installation_files"
7 | ],
8 | "filters": [
9 | "rmarkdown/pagebreak.lua"
10 | ],
11 | "includes": {},
12 | "engineDependencies": {
13 | "knitr": [
14 | "{\"type\":\"list\",\"attributes\":{},\"value\":[]}"
15 | ]
16 | },
17 | "preserve": null,
18 | "postProcess": false
19 | }
20 | }
--------------------------------------------------------------------------------
/_freeze/index/execute-results/html.json:
--------------------------------------------------------------------------------
1 | {
2 | "hash": "3f46ef8c4d4a00a65867323f0c68dd41",
3 | "result": {
4 | "markdown": "---\ndescription-meta: |\n A practical guide to quantitative text analysis and natural language processing using R, with a focus on the quanteda family of packages.\nbiblio-style: apalike\nlink-citations: true\nlinks-as-notes: true\nbibliography: [TAUR.bib]\n---\n\n\n## Welcome {.unnumbered}\n\n::: {.content-visible when-format=\"html\"}\n
\n:::\n\n\n\\pagenumbering{roman}\n\nThis is the draft version of *Text Analysis Using R*. This book offers a comprehensive practical guide to text analysis and natural language processing using the R language. We have pitched the book at those already familiar with some R, but we also provide a gentle enough introduction that it is suitable for newcomers to R.\n\nYou'll learn how to prepare your texts for analysis, how to analyse texts for insight using statistical methods and machine learning, and how to present those results using graphical methods. Each chapter covers a distinct topic, first presenting the methodology underlying each topic, and then providing practical examples using R. We also discuss advanced issues facing each method and its application. Finally, for those engaged in self-learning or wishing to use this book for instruction, we provide practical exercises for each chapter.\n\n\n\n\n\nThe book is organised into parts, starting with a review of R and especially the R packages and functions relevant to text analysis. If you are already comfortable with R you can skim or skip that section and proceed straight to part two.\n\n::: callout-note\nThis book is a work in progress. We will publish chapters as we write them, and open up the GitHub source repository for readers to make comments or even suggest corrections. You can view this at this .\n:::\n\n## License\n\nWe will eventually seek a publisher for this book, but want to write it first. In the meantime we are retaining full copyright and licensing it only to be free to read.\n\n© Kenneth Benoit and Stefan Müller all rights reserved.\n",
5 | "supporting": [
6 | "index_files"
7 | ],
8 | "filters": [
9 | "rmarkdown/pagebreak.lua"
10 | ],
11 | "includes": {},
12 | "engineDependencies": {},
13 | "preserve": {},
14 | "postProcess": true
15 | }
16 | }
--------------------------------------------------------------------------------
/_freeze/index/execute-results/tex.json:
--------------------------------------------------------------------------------
1 | {
2 | "hash": "3f46ef8c4d4a00a65867323f0c68dd41",
3 | "result": {
4 | "markdown": "---\ndescription-meta: |\n A practical guide to quantitative text analysis and natural language processing using R, with a focus on the quanteda family of packages.\nbiblio-style: apalike\nlink-citations: true\nlinks-as-notes: true\nbibliography: [TAUR.bib]\n---\n\n\n\n## Welcome {.unnumbered}\n\n::: {.content-visible when-format=\"html\"}\n
\n:::\n\n\n\\pagenumbering{roman}\n\nThis is the draft version of *Text Analysis Using R*. This book offers a comprehensive practical guide to text analysis and natural language processing using the R language. We have pitched the book at those already familiar with some R, but we also provide a gentle enough introduction that it is suitable for newcomers to R.\n\nYou'll learn how to prepare your texts for analysis, how to analyse texts for insight using statistical methods and machine learning, and how to present those results using graphical methods. Each chapter covers a distinct topic, first presenting the methodology underlying each topic, and then providing practical examples using R. We also discuss advanced issues facing each method and its application. Finally, for those engaged in self-learning or wishing to use this book for instruction, we provide practical exercises for each chapter.\n\n\n\n::: {.cell}\n\n:::\n\n\n\nThe book is organised into parts, starting with a review of R and especially the R packages and functions relevant to text analysis. If you are already comfortable with R you can skim or skip that section and proceed straight to part two.\n\n::: callout-note\nThis book is a work in progress. We will publish chapters as we write them, and open up the GitHub source repository for readers to make comments or even suggest corrections. You can view this at this .\n:::\n\n## License\n\nWe will eventually seek a publisher for this book, but want to write it first. In the meantime we are retaining full copyright and licensing it only to be free to read.\n\n© Kenneth Benoit and Stefan Müller all rights reserved.\n",
5 | "supporting": [
6 | "index_files"
7 | ],
8 | "filters": [
9 | "rmarkdown/pagebreak.lua"
10 | ],
11 | "includes": {},
12 | "engineDependencies": {
13 | "knitr": [
14 | "{\"type\":\"list\",\"attributes\":{},\"value\":[]}"
15 | ]
16 | },
17 | "preserve": null,
18 | "postProcess": false
19 | }
20 | }
--------------------------------------------------------------------------------
/_freeze/preface/execute-results/html.json:
--------------------------------------------------------------------------------
1 | {
2 | "hash": "3d9c3331a82d3d5b4dc92f9ebd8481a6",
3 | "result": {
4 | "markdown": "::: hidden\n\n$$\n \\def\\quanteda{{\\bf quanteda}}\n$$\n\n:::\n\n# Preface {.unnumbered}\n\nThe uses and means for analysing text, especially using quantitative and computational approaches, have exploded in recent years across the fields of academic, industry, policy, and other forms of research and analysis. Text mining and text analysis have become the focus of a major methodological wave of innovation in the social sciences, especially political science but also extending to finance, media and communications, and sociology. In non-academic research or industry, furthermore, text mining applications are found in almost every sector, given the ubiquity of text and the need to analyse it.\n\n\"Text analysis\" is a broad label for a wide range of tools applied to *textual data*. It encompasses all manner of methods for turning unstructured raw data in the form of natural language documents into structured data that can be analysed systematically and using specific quantitative approaches. Text analytic methods include descriptive analysis, keyword analysis, topic analysis, measurement and scaling, clustering, text and vocabulary comparisons, or sentiment analysis. It may include causal analysis or predictive modelling. In their most advanced current forms, these methods extend to the natural language generation models that power the newest generation of artificial intelligence systems.\n\nThis book provides a practical guide introducing the fundamentals of text analysis methods and how to implement them using the R programming language. It covers a wide range of topics to provide a useful resource for a range of students from complete beginners to experienced users of R wanting to learn more advanced techniques. Our emphasis is on the text analytic workflow as a whole, ranging from an introduction to text manipulation in R, all the way through to advanced machine learning methods for textual data.\n\n## Why another book on text mining in R? {.unnumbered}\n\nText mining tools for R have existed for many years, led by the venerable **tm** package [@tmpackage]. In 2015, the first version of **quanteda** [@quantedaJOSS] was published on CRAN. Since then, the package has undergone three major versions, with each release improving its consistency, power, and usability. Starting with version 2, **quanteda** split into a series of packages designed around its different functions (such as plotting, statistics, or machine learning), with **quanteda** retaining the core text processing functions. Other popular alternatives exist, such as **tidytext** [@tidytextJOSS], although as we explain in [Chapter -@sec-furthertidy], **quanteda** works perfectly well with the R \"tidyverse\" and related approaches to text analysis.\n\nThis is hardly the only book covering how to work with text in R. @silge2017text introduced the **tidytext** package and numerous examples for mining text using the tidyverse approach to programming in R, including keyword analysis, mining word associations, and topic modelling. @kwartler2017text provides a good coverage of text mining workflows, plus methods for text visualisations, sentiment scoring, clustering, and classification, as well as methods for sourcing and manipulating source documents. Earlier works include @turenne2016analyse and @becue2019textual.\n\n::: {.callout-note icon=\"false\"}\n## One of the two hard things in computer science: Naming things\n\nWhere does does the package name come from? *quanteda* is a portmanteau name indicating the purpose of the package, which is the **qu**antitative **an**alysis of **te**xtual **da**ta.\n:::\n\nSo why another book? First, the R ecosystem for text analysis is rapidly evolving, with the publication in recent years of massively improved, specialist text analysis packages such as **quanteda**. None of the earlier books covers this amazing family of packages. In addition, the field of text analysis methodologies has also advanced rapidly, including machine learning approaches based on artificial neural networks and deep learning. These and the packages that make use of them---for instance **spacyr** [@spacyr] for harnessing the power of the spaCy natural language processing library for Python [@spaCy]---have yet to be presented in a systematic, book-length treatment. Furthermore, as we are the authors and creators of many of the packages we cover in this book, we view this as the authoritative reference and how-to guide for using these packages.\n\n## What to Expect from This Book {.unnumbered}\n\nThis book is meant to be as a practical resource for those confronting the practical challenges of text analysis for the first time, focusing on how to do this in R. Our main focus is on the **quanteda** package and its extensions, although we also cover more general issues including a brief overview of the R functions required to get started quickly with practical text analysis We cover an introduction to the R language, for\n\n@benoit2020text provides a detailed overview of the analysis of textual data, and what distinguishes textual data from other forms of data. It also clearly articulates what is meant by treating text \"as data\" for analysis. This book and the approaches it presents are firmly geared toward this mindset.\n\nEach chapter is structured so to provide a continuity across each topic. For each main subject explained in a chapter, we clearly explain the objective of the chapter, then describe the text analytic methods in an applied fashion so that readers are aware of the workings of the method. We then provide practical examples, with detailed working code in R as to how to implement the method. Next, we identify any special issues involved in correctly applying the method, including how to hand the more complicated situations that may arise in practice. Finally, we provide further reading for readers wishing to learn more, and exercises for those wishing for hands-on practice, or for assigning these when using them in teaching environment.\n\nWe have years of experience in teaching this material in many practical short courses, summer schools, and regular university courses. We have drawn extensively from this experience in designing the overall scope of this book and the structure of each chapter. Our goal is to make the book is suitable for self-learning or to form the basis for teaching and learning in a course on applied text analysis. Indeed, we have partly written this book to assign when teaching text analysis in our own curricula.\n\n## Who This Book Is For {.unnumbered}\n\nWe don't assume any previous knowledge of text mining---indeed, the goal of this book is to provide that, from a foundation through to some very advanced topics. Getting use from this book does not require a pre-existing familiarity with R, although, as the slogan goes, \"every little helps\". In Part I we cover some of the basics of R and how to make use of it for text analysis specifically. Readers will also learn more of R through our extensive examples. However, experience in teaching and presenting this material tells us that a foundation of R will enable readers to advance through the applications far more rapidly than if they were learning R from scratch at the same time that they take the plunge into the possibly strange new world of text analysis.\n\nWe are both academics, although we also have experience working in industry or in applying text analysis for non-academic purposes. The typical reader may be a student of text analysis in the literal sense (of being an student) or in the general sense of someone studying techniques in order to improve their practical and conceptual knowledge. Our orientation as social scientists, with a specialization in political text and political communications and media. But this book is for everyone: social scientists, computer scientists, scholars of digital humanities, researchers in marketing and management, and applied researchers working in policy, government, or business fields. This book is written to have the credibility and scholarly rigour (for referencing methods, for instance) needed by academic readers, but is designed to be written in a straightforward, jargon-free (as much as we were able!) manner to be of maximum practical use to non-academic analysts as well.\n\n## How the Book is Structured {.unnumbered}\n\n### Sections {.unnumbered}\n\nThe book is divided into seven sections. These group topics that we feel represent common stages of learning in text analysis, or similar groups of topics that different users will be drawn too. By grouping stages of learning, we make it possible also for intermediate or advanced users to jump to the section that interests them most, or to the sections where they feel they need additional learning.\n\nOur sections are:\n\n- **(Working in R):** This section is designed for beginners to learn quickly the R required for the techniques we cover in the book, and to guide them in learning a proper R workflow for text analysis.\n\n- **Acquiring texts:** Often described (by us at least) as the hardest problem in text analysis, we cover how to source documents, including from Internet sources, and to import these into R as part of the quantitative text analysis pipeline.\n\n- **Managing textual data using quanteda:** In this section, we introduce the **quanteda** package, and cover each stage of textual data processing, from creating structured corpora, to tokenisation, and building matrix representations of these tokens. WE also talk about how to build and manage structured lexical resources such as dictionaries and stop word lists.\n\n- **Exploring and describing texts:** How to get overviews of texts using summary statistics, exploring texts using keywords-in-context, extracting target words, and identifying key words.\n\n- **Statistics for comparing texts:** How to characterise documents in terms of their lexical diversity. readability, similarity, or distance.\n\n- **Machine learning for texts:** How to apply scaling models, predictive models, and classification models to textual matrices.\n\n- **Further methods for texts:** Advanced methods including the use of natural language models to annotate texts, extract entities, or use word embeddings; integrating **quanteda** with \"tidy\" data approaches; and how to apply text analysis to \"hard\" languages such as Chinese (hard because of the high dimensional character set and the lack of whitespace to delimit words).\n\nFinally, in several **appendices**, we provide more detail about some tricky subjects, such as text encoding formats and working with regular expressions.\n\n### Chapter structure {.unnumbered}\n\nOur approach in each chapter is split into the following components, which we apply in every chapter:\n\n- Objectives. We explain the purpose of each chapter and what we believe are the most important learning outcomes.\n\n- Methods. We clearly explain the methodological elements of each chapter, through a combination of high-level explanations, formulas, and references.\n\n- Examples. We use practical examples, with R code, demonstrating how to apply the methods to realise the objectives.\n\n- Issues. We identify any special issues, potential problems, or additional approaches that a user might face when applying the methods to their text analysis problem.\n\n- Further Reading. In part because our scholarly backgrounds compel us to do so, and in part because we know that many readers will want to read more about each method, each chapter contains its own set of references and further readings.\n\n- Exercises. For those wishing additional practice or to use this text as a teaching resource (which we strongly encourage!), we provide exercises that can be assigned for each chapter.\n\nThroughout the book, we will demonstrate with examples and build models using a selection of text data sets. A description of these data sets can be found in [Appendix -@sec-appendix-installing].\n\n### Conventions {.unnumbered}\n\nThroughout the book we use several kinds of info boxes to call your attention to information, cautions, and warnings.\n\n::: callout-note\nThe information icon signals a note or a reference to further information.\n:::\n\n::: callout-tip\nTips provide suggestions for better ways of doing things.\n:::\n\n::: callout-important\nThe exclamation mark icon signals an important point to consider.\n:::\n\n::: callout-warning\nWarning icons flag things you definitely want to avoid.\n:::\n\nAs you may already have noticed, we put names of R packages in **boldface**.\n\nCode blocks will be self-evident, and will look like this, with the output produced from executing those commands shown below the highlighted code in a mono-spaced font.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(\"quanteda\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nPackage version: 3.2.2\nUnicode version: 14.0\nICU version: 70.1\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nParallel computing: 10 of 10 threads used.\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nSee https://quanteda.io for tutorials and examples.\n```\n:::\n\n```{.r .cell-code}\ndata_corpus_inaugural[1:3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nCorpus consisting of 3 documents and 4 docvars.\n1789-Washington :\n\"Fellow-Citizens of the Senate and of the House of Representa...\"\n\n1793-Washington :\n\"Fellow citizens, I am again called upon by the voice of my c...\"\n\n1797-Adams :\n\"When it was first perceived, in early times, that no middle ...\"\n```\n:::\n:::\n\n\nWe love the pipe operator in R (when used with discipline!) and built **quanteda** with a the aim making all of the main functions easily and logically pipeable. Since version 1, we have re-exported the `%>%` operator from **magrittr** to make it available out of the box with **quanteda**. With the introduction of the `|>` pipe in R 4.1.0, however, we prefer to use this variant, so will use that in all code used in this book.\n\n### Data used in examples {.unnumbered}\n\nAll examples and code are bundled as a companion R package to the book, available from our [public GitHub repository](https://github.com/quanteda/TAUR/).\n\n::: callout-tip\nWe have written a companion package for this book called **TAUR**, which can be installed from GitHub using this command:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nremotes::install_github(\"quanteda/TAUR\")\n```\n:::\n\n:::\n\nWe largely rely on data from three sources:\n\n- the built-in-objects from the **quanteda** package, such as the US Presidential Inaugural speech corpus;\\\n- added corpora from the book's companion package, **TAUR**; and\n- some additional **quanteda** corpora or dictionaries from from the additional sources or packages where indicated.\n\nAlthough not commonly used, our scheme for naming data follows a very consistent scheme. The data objects being with `data`, have the object class as the second part of the name, such as `corpus`, and the third and final part of the data object name contains a description. The three elements are separated by the underscore (`_`) character. This means that any object is known by its name to be data, so that it shows up in the index of package objects (from the all-important help page, e.g. from `help(package = \"quanteda\")`) in one location, under \"d\". It also means that its object class is known from the name, without further inspection. So `data_dfm_lbgexample` is a dfm, while `data_corpus_inaugural` is clearly a corpus. We use this scheme and others like with an almost religious fervour, because we think that learning the functionality of a programming framework for NLP and quantitative text analysis is complicated enough without having also to decipher or remember a mishmash of haphazard and inconsistently named functions and objects. The more you use our software packages and specifically **quanteda**, the more you will come to appreciate the attention we have paid to implementing a consistent naming scheme for objects, functions, and their arguments as well as to their consistent functionality.\n\n## Colophon {.unnumbered}\n\nThis book was written in [RStudio](https://www.rstudio.com/ide/) using [**Quarto**](https://quarto.org). The [website](https://quanteda.github.io/Text-Analysis-Using-R/) is hosted via [GitHub Pages](https://pages.github.com), and the complete source is available on [GitHub](https://github.com/quanteda/Text-Analysis-Using-R).\n\nThis version of the book was built with R version 4.2.1 (2022-06-23) and the following packages:\n\n\n::: {.cell-output-display}\n|Package |Version |Source |\n|:-------------------|:-------|:--------------|\n|quanteda |3.2.2 |CRAN (R 4.2.0) |\n|quanteda.textmodels |0.9.4 |CRAN (R 4.2.0) |\n|quanteda.textplots |0.94.1 |CRAN (R 4.2.0) |\n|quanteda.textstats |0.95 |CRAN (R 4.2.0) |\n|readtext |0.81 |CRAN (R 4.2.0) |\n|stopwords |2.3 |CRAN (R 4.2.0) |\n|tidyverse |1.3.2 |CRAN (R 4.2.0) |\n:::\n\n\n### How to Contact Us {.unnumbered}\n\nPlease address comments and questions concerning this book by filing an issue on our GitHub page, . At this repository, you will also find instructions for installing the companion R package, [**TAUR**](https://github.com/quanteda/TAUR).\n\nFor more information about the authors or the Quanteda Initiative, visit our [website](https://quanteda.org).\n\n## Acknowledgements {.unnumbered}\n\nDeveloping **quanteda** has been a labour of many years involving many contributors. The most notable contributor to the package and its learning materials is Kohei Watanabe, without whom **quanteda** would not be the incredible package it is today. Kohei has provided a clear and vigorous vision for the package's evolution across its major versions and continues to maintain it today. Others have contributed in major ways both through design and programming, namely Paul Nulty and Akitaka Matsuo, both who were present at creation and through the 1.0 launch which involved the first of many major redesigns. Adam Obeng made a brief but indelible imprint on several of the packages in the quanteda family. Haiyan Wang wrote versions of some of the core C++ code that makes quanteda so fast, as well as contributing to the R base. William Lowe and Christian Müller also contributed code and ideas that have enriched the package and the methods it implements.\n\nNo project this large could have existed without institutional and financial benefactors and supporters. Most notable of these is the European Research Council, who funded the original development under a its Starting Investigator Grant scheme, awarded to Kenneth Benoit in 2011 for a project entitled QUANTESS: Quantitative Text Analysis for the Social Sciences (ERC-2011-StG 283794-QUANTESS). Our day jobs (as university professors) have also provided invaluable material and intellectual support for the development of the ideas, methodologies, and software documented in this book. This list includes the London School of Economics and Political Science, which employs Ken Benoit and which hosted the QUANTESS grant and employed many of the core quanteda team and/or where they completed PhDs; University College Dublin where Stefan Müller is employed; Trinity College Dublin, where Ken Benoit was formerly a Professor of Quantitative Social Sciences and where Stefan completed his PhD in political science; and the Australian National University which provided a generous affiliation to Ken Benoit and where the outline for this book took first shape as a giant tableau of post-it notes on the wall of a visiting office in the former building of the School of Politics and International Relations.\n\nAs we develop this book, we hope that many readers of the work-in-progress will contribute feedback, comments, and even edits, and we plan to acknowledge you all. So, don't be shy readers.\n\nWe also have to give a much-deserved shout-out to the amazing team at RStudio, many of whom we've been privileged to hear speak at events or meet in person. You made Quarto available at just the right time. That and the innovations in R tools and software that you have driven for the past decade have been rich beyond every expectation.\n\nFinally, no acknowledgements would be complete without a profound thanks to our partners, who have put up with us during the long incubation of this project. They have had to listen us talk about this book for years before we finally got around to writing it. They've tolerated us cursing software bugs, students, CRAN, each other, package users, and ourselves. But they've also seen they joy that we've experienced from creating tools and materials that empower users and students, and the excitement of the long intellectual voyage we have taken together and with our ever-growing base of users and students. Bina and Émeline, thank you for all of your support and encouragement. You'll be so happy to know we finally have this book project well underway.\n",
5 | "supporting": [],
6 | "filters": [
7 | "rmarkdown/pagebreak.lua"
8 | ],
9 | "includes": {},
10 | "engineDependencies": {},
11 | "preserve": {},
12 | "postProcess": true
13 | }
14 | }
--------------------------------------------------------------------------------
/_freeze/preface/execute-results/tex.json:
--------------------------------------------------------------------------------
1 | {
2 | "hash": "6be04d233edb473f383c20ba814d5e85",
3 | "result": {
4 | "markdown": "::: hidden\n\n$$\n \\def\\quanteda{{\\bf quanteda}}\n$$\n\n:::\n\n# Preface {.unnumbered}\n\nThe uses and means for analysing text, especially using quantitative and computational approaches, have exploded in recent years across the fields of academic, industry, policy, and other forms of research and analysis. Text mining and text analysis have become the focus of a major methodological wave of innovation in the social sciences, especially political science but also extending to finance, media and communications, and sociology. In non-academic research or industry, furthermore, text mining applications are found in almost every sector, given the ubiquity of text and the need to analyse it.\n\n\"Text analysis\" is a broad label for a wide range of tools applied to *textual data*. It encompasses all manner of methods for turning unstructured raw data in the form of natural language documents into structured data that can be analysed systematically and using specific quantitative approaches. Text analytic methods include descriptive analysis, keyword analysis, topic analysis, measurement and scaling, clustering, text and vocabulary comparisons, or sentiment analysis. It may include causal analysis or predictive modelling. In their most advanced current forms, these methods extend to the natural language generation models that power the newest generation of artificial intelligence systems.\n\nThis book provides a practical guide introducing the fundamentals of text analysis methods and how to implement them using the R programming language. It covers a wide range of topics to provide a useful resource for a range of students from complete beginners to experienced users of R wanting to learn more advanced techniques. Our emphasis is on the text analytic workflow as a whole, ranging from an introduction to text manipulation in R, all the way through to advanced machine learning methods for textual data.\n\n## Why another book on text mining in R? {.unnumbered}\n\nText mining tools for R have existed for many years, led by the venerable **tm** package [@tmpackage]. In 2015, the first version of **quanteda** [@quantedaJOSS] was published on CRAN. Since then, the package has undergone three major versions, with each release improving its consistency, power, and usability. Starting with version 2, **quanteda** split into a series of packages designed around its different functions (such as plotting, statistics, or machine learning), with **quanteda** retaining the core text processing functions. Other popular alternatives exist, such as **tidytext** [@tidytextJOSS], although as we explain in [Chapter -@sec-furthertidy], **quanteda** works perfectly well with the R \"tidyverse\" and related approaches to text analysis.\n\nThis is hardly the only book covering how to work with text in R. @silge2017text introduced the **tidytext** package and numerous examples for mining text using the tidyverse approach to programming in R, including keyword analysis, mining word associations, and topic modelling. @kwartler2017text provides a good coverage of text mining workflows, plus methods for text visualisations, sentiment scoring, clustering, and classification, as well as methods for sourcing and manipulating source documents. Earlier works include @turenne2016analyse and @becue2019textual.\n\n::: {.callout-note icon=\"true\"}\n## One of the two hard things in computer science: Naming things\n\nWhere does does the package name come from? *quanteda* is a portmanteau name indicating the purpose of the package, which is the **qu**antitative **an**alysis of **te**xtual **da**ta.\n:::\n\nSo why another book? First, the R ecosystem for text analysis is rapidly evolving, with the publication in recent years of massively improved, specialist text analysis packages such as **quanteda**. None of the earlier books covers this amazing family of packages. In addition, the field of text analysis methodologies has also advanced rapidly, including machine learning approaches based on artificial neural networks and deep learning. These and the packages that make use of them---for instance **spacyr** [@spacyr] for harnessing the power of the spaCy natural language processing library for Python [@spaCy]---have yet to be presented in a systematic, book-length treatment. Furthermore, as we are the authors and creators of many of the packages we cover in this book, we view this as the authoritative reference and how-to guide for using these packages.\n\n## What to Expect from This Book {.unnumbered}\n\nThis book is meant to be as a practical resource for those confronting the practical challenges of text analysis for the first time, focusing on how to do this in R. Our main focus is on the **quanteda** package and its extensions, although we also cover more general issues including a brief overview of the R functions required to get started quickly with practical text analysis We cover an introduction to the R language, for\n\n@benoit2020text provides a detailed overview of the analysis of textual data, and what distinguishes textual data from other forms of data. It also clearly articulates what is meant by treating text \"as data\" for analysis. This book and the approaches it presents are firmly geared toward this mindset.\n\nEach chapter is structured so to provide a continuity across each topic. For each main subject explained in a chapter, we clearly explain the objective of the chapter, then describe the text analytic methods in an applied fashion so that readers are aware of the workings of the method. We then provide practical examples, with detailed working code in R as to how to implement the method. Next, we identify any special issues involved in correctly applying the method, including how to hand the more complicated situations that may arise in practice. Finally, we provide further reading for readers wishing to learn more, and exercises for those wishing for hands-on practice, or for assigning these when using them in teaching environment.\n\nWe have years of experience in teaching this material in many practical short courses, summer schools, and regular university courses. We have drawn extensively from this experience in designing the overall scope of this book and the structure of each chapter. Our goal is to make the book is suitable for self-learning or to form the basis for teaching and learning in a course on applied text analysis. Indeed, we have partly written this book to assign when teaching text analysis in our own curricula.\n\n## Who This Book Is For {.unnumbered}\n\nWe don't assume any previous knowledge of text mining---indeed, the goal of this book is to provide that, from a foundation through to some very advanced topics. Getting use from this book does not require a pre-existing familiarity with R, although, as the slogan goes, \"every little helps\". In Part I we cover some of the basics of R and how to make use of it for text analysis specifically. Readers will also learn more of R through our extensive examples. However, experience in teaching and presenting this material tells us that a foundation of R will enable readers to advance through the applications far more rapidly than if they were learning R from scratch at the same time that they take the plunge into the possibly strange new world of text analysis.\n\nWe are both academics, although we also have experience working in industry or in applying text analysis for non-academic purposes. The typical reader may be a student of text analysis in the literal sense (of being an student) or in the general sense of someone studying techniques in order to improve their practical and conceptual knowledge. Our orientation as social scientists, with a specialization in political text and political communications and media. But this book is for everyone: social scientists, computer scientists, scholars of digital humanities, researchers in marketing and management, and applied researchers working in policy, government, or business fields. This book is written to have the credibility and scholarly rigour (for referencing methods, for instance) needed by academic readers, but is designed to be written in a straightforward, jargon-free (as much as we were able!) manner to be of maximum practical use to non-academic analysts as well.\n\n## How the Book is Structured {.unnumbered}\n\n### Sections {.unnumbered}\n\nThe book is divided into seven sections. These group topics that we feel represent common stages of learning in text analysis, or similar groups of topics that different users will be drawn too. By grouping stages of learning, we make it possible also for intermediate or advanced users to jump to the section that interests them most, or to the sections where they feel they need additional learning.\n\nOur sections are:\n\n- **(Working in R):** This section is designed for beginners to learn quickly the R required for the techniques we cover in the book, and to guide them in learning a proper R workflow for text analysis.\n\n- **Acquiring texts:** Often described (by us at least) as the hardest problem in text analysis, we cover how to source documents, including from Internet sources, and to import these into R as part of the quantitative text analysis pipeline.\n\n- **Managing textual data using quanteda:** In this section, we introduce the **quanteda** package, and cover each stage of textual data processing, from creating structured corpora, to tokenisation, and building matrix representations of these tokens. WE also talk about how to build and manage structured lexical resources such as dictionaries and stop word lists.\n\n- **Exploring and describing texts:** How to get overviews of texts using summary statistics, exploring texts using keywords-in-context, extracting target words, and identifying key words.\n\n- **Statistics for comparing texts:** How to characterise documents in terms of their lexical diversity. readability, similarity, or distance.\n\n- **Machine learning for texts:** How to apply scaling models, predictive models, and classification models to textual matrices.\n\n- **Further methods for texts:** Advanced methods including the use of natural language models to annotate texts, extract entities, or use word embeddings; integrating **quanteda** with \"tidy\" data approaches; and how to apply text analysis to \"hard\" languages such as Chinese (hard because of the high dimensional character set and the lack of whitespace to delimit words).\n\nFinally, in several **appendices**, we provide more detail about some tricky subjects, such as text encoding formats and working with regular expressions.\n\n### Chapter structure {.unnumbered}\n\nOur approach in each chapter is split into the following components, which we apply in every chapter:\n\n- Objectives. We explain the purpose of each chapter and what we believe are the most important learning outcomes.\n\n- Methods. We clearly explain the methodological elements of each chapter, through a combination of high-level explanations, formulas, and references.\n\n- Examples. We use practical examples, with R code, demonstrating how to apply the methods to realise the objectives.\n\n- Issues. We identify any special issues, potential problems, or additional approaches that a user might face when applying the methods to their text analysis problem.\n\n- Further Reading. In part because our scholarly backgrounds compel us to do so, and in part because we know that many readers will want to read more about each method, each chapter contains its own set of references and further readings.\n\n- Exercises. For those wishing additional practice or to use this text as a teaching resource (which we strongly encourage!), we provide exercises that can be assigned for each chapter.\n\nThroughout the book, we will demonstrate with examples and build models using a selection of text data sets. A description of these data sets can be found in [Appendix -@sec-appendix-installing].\n\n### Conventions {.unnumbered}\n\nThroughout the book we use several kinds of info boxes to call your attention to information, cautions, and warnings.\n\n::: callout-note\nThe information icon signals a note or a reference to further information.\n:::\n\n::: callout-tip\nTips provide suggestions for better ways of doing things.\n:::\n\n::: callout-important\nThe exclamation mark icon signals an important point to consider.\n:::\n\n::: callout-warning\nWarning icons flag things you definitely want to avoid.\n:::\n\nAs you may already have noticed, we put names of R packages in **boldface**.\n\nCode blocks will be self-evident, and will look like this, with the output produced from executing those commands shown below the highlighted code in a mono-spaced font.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(\"quanteda\")\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nPackage version: 3.2.2\nUnicode version: 14.0\nICU version: 70.1\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nParallel computing: 10 of 10 threads used.\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nSee https://quanteda.io for tutorials and examples.\n```\n:::\n\n```{.r .cell-code}\ndata_corpus_inaugural[1:3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nCorpus consisting of 3 documents and 4 docvars.\n1789-Washington :\n\"Fellow-Citizens of the Senate and of the House of Representa...\"\n\n1793-Washington :\n\"Fellow citizens, I am again called upon by the voice of my c...\"\n\n1797-Adams :\n\"When it was first perceived, in early times, that no middle ...\"\n```\n:::\n:::\n\n\nWe love the pipe operator in R (when used with discipline!) and built **quanteda** with a the aim making all of the main functions easily and logically pipeable. Since version 1, we have re-exported the `%>%` operator from **magrittr** to make it available out of the box with **quanteda**. With the introduction of the `|>` pipe in R 4.1.0, however, we prefer to use this variant, so will use that in all code used in this book.\n\n### Data used in examples {.unnumbered}\n\nAll examples and code are bundled as a companion R package to the book, available from our [public GitHub repository](https://github.com/quanteda/TAUR/).\n\n::: callout-tip\nWe have written a companion package for this book called **TAUR**, which can be installed from GitHub using this command:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nremotes::install_github(\"quanteda/TAUR\")\n```\n:::\n\n:::\n\nWe largely rely on data from three sources:\n\n- the built-in-objects from the **quanteda** package, such as the US Presidential Inaugural speech corpus;\\\n- added corpora from the book's companion package, **TAUR**; and\n- some additional **quanteda** corpora or dictionaries from from the additional sources or packages where indicated.\n\nAlthough not commonly used, our scheme for naming data follows a very consistent scheme. The data objects being with `data`, have the object class as the second part of the name, such as `corpus`, and the third and final part of the data object name contains a description. The three elements are separated by the underscore (`_`) character. This means that any object is known by its name to be data, so that it shows up in the index of package objects (from the all-important help page, e.g. from `help(package = \"quanteda\")`) in one location, under \"d\". It also means that its object class is known from the name, without further inspection. So `data_dfm_lbgexample` is a dfm, while `data_corpus_inaugural` is clearly a corpus. We use this scheme and others like with an almost religious fervour, because we think that learning the functionality of a programming framework for NLP and quantitative text analysis is complicated enough without having also to decipher or remember a mishmash of haphazard and inconsistently named functions and objects. The more you use our software packages and specifically **quanteda**, the more you will come to appreciate the attention we have paid to implementing a consistent naming scheme for objects, functions, and their arguments as well as to their consistent functionality.\n\n## Colophon {.unnumbered}\n\nThis book was written in [RStudio](https://www.rstudio.com/ide/) using [**Quarto**](https://quarto.org). The [website](https://quanteda.github.io/Text-Analysis-Using-R/) is hosted via [GitHub Pages](https://pages.github.com), and the complete source is available on [GitHub](https://github.com/quanteda/Text-Analysis-Using-R).\n\nThis version of the book was built with R version 4.2.1 (2022-06-23) and the following packages:\n\n\n::: {.cell-output-display}\n|Package |Version |Source |\n|:-------------------|:-------|:--------------|\n|quanteda |3.2.2 |CRAN (R 4.2.0) |\n|quanteda.textmodels |0.9.4 |CRAN (R 4.2.0) |\n|quanteda.textplots |0.94.1 |CRAN (R 4.2.0) |\n|quanteda.textstats |0.95 |CRAN (R 4.2.0) |\n|readtext |0.81 |CRAN (R 4.2.0) |\n|stopwords |2.3 |CRAN (R 4.2.0) |\n|tidyverse |1.3.2 |CRAN (R 4.2.0) |\n:::\n\n\n### How to Contact Us {.unnumbered}\n\nPlease address comments and questions concerning this book by filing an issue on our GitHub page, . At this repository, you will also find instructions for installing the companion R package, [**TAUR**](https://github.com/quanteda/TAUR).\n\nFor more information about the authors or the Quanteda Initiative, visit our [website](https://quanteda.org).\n\n## Acknowledgements {.unnumbered}\n\nDeveloping **quanteda** has been a labour of many years involving many contributors. The most notable contributor to the package and its learning materials is Kohei Watanabe, without whom **quanteda** would not be the incredible package it is today. Kohei has provided a clear and vigorous vision for the package's evolution across its major versions and continues to maintain it today. Others have contributed in major ways both through design and programming, namely Paul Nulty and Akitaka Matsuo, both who were present at creation and through the 1.0 launch which involved the first of many major redesigns. Adam Obeng made a brief but indelible imprint on several of the packages in the quanteda family. Haiyan Wang wrote versions of some of the core C++ code that makes quanteda so fast, as well as contributing to the R base. William Lowe and Christian Müller also contributed code and ideas that have enriched the package and the methods it implements.\n\nNo project this large could have existed without institutional and financial benefactors and supporters. Most notable of these is the European Research Council, who funded the original development under a its Starting Investigator Grant scheme, awarded to Kenneth Benoit in 2011 for a project entitled QUANTESS: Quantitative Text Analysis for the Social Sciences (ERC-2011-StG 283794-QUANTESS). Our day jobs (as university professors) have also provided invaluable material and intellectual support for the development of the ideas, methodologies, and software documented in this book. This list includes the London School of Economics and Political Science, which employs Ken Benoit and which hosted the QUANTESS grant and employed many of the core quanteda team and/or where they completed PhDs; University College Dublin where Stefan Müller is employed; Trinity College Dublin, where Ken Benoit was formerly a Professor of Quantitative Social Sciences and where Stefan completed his PhD in political science; and the Australian National University which provided a generous affiliation to Ken Benoit and where the outline for this book took first shape as a giant tableau of post-it notes on the wall of a visiting office in the former building of the School of Politics and International Relations.\n\nAs we develop this book, we hope that many readers of the work-in-progress will contribute feedback, comments, and even edits, and we plan to acknowledge you all. So, don't be shy readers.\n\nWe also have to give a much-deserved shout-out to the amazing team at RStudio, many of whom we've been privileged to hear speak at events or meet in person. You made Quarto available at just the right time. That and the innovations in R tools and software that you have driven for the past decade have been rich beyond every expectation.\n\nFinally, no acknowledgements would be complete without a profound thanks to our partners, who have put up with us during the long incubation of this project. They have had to listen us talk about this book for years before we finally got around to writing it. They've tolerated us cursing software bugs, students, CRAN, each other, package users, and ourselves. But they've also seen they joy that we've experienced from creating tools and materials that empower users and students, and the excitement of the long intellectual voyage we have taken together and with our ever-growing base of users and students. Bina and Émeline, thank you for all of your support and encouragement. You'll be so happy to know we finally have this book project well underway.\n",
5 | "supporting": [
6 | "preface_files"
7 | ],
8 | "filters": [
9 | "rmarkdown/pagebreak.lua"
10 | ],
11 | "includes": {},
12 | "engineDependencies": {
13 | "knitr": [
14 | "{\"type\":\"list\",\"attributes\":{},\"value\":[]}"
15 | ]
16 | },
17 | "preserve": null,
18 | "postProcess": false
19 | }
20 | }
--------------------------------------------------------------------------------
/_freeze/site_libs/clipboard/clipboard.min.js:
--------------------------------------------------------------------------------
1 | /*!
2 | * clipboard.js v2.0.10
3 | * https://clipboardjs.com/
4 | *
5 | * Licensed MIT © Zeno Rocha
6 | */
7 | !function(t,e){"object"==typeof exports&&"object"==typeof module?module.exports=e():"function"==typeof define&&define.amd?define([],e):"object"==typeof exports?exports.ClipboardJS=e():t.ClipboardJS=e()}(this,function(){return n={686:function(t,e,n){"use strict";n.d(e,{default:function(){return o}});var e=n(279),i=n.n(e),e=n(370),u=n.n(e),e=n(817),c=n.n(e);function a(t){try{return document.execCommand(t)}catch(t){return}}var f=function(t){t=c()(t);return a("cut"),t};var l=function(t){var e,n,o,r=1
14 | :::
15 |
16 |
17 | \pagenumbering{roman}
18 |
19 | This is the draft version of *Text Analysis Using R*. This book offers a comprehensive practical guide to text analysis and natural language processing using the R language. We have pitched the book at those already familiar with some R, but we also provide a gentle enough introduction that it is suitable for newcomers to R.
20 |
21 | You'll learn how to prepare your texts for analysis, how to analyse texts for insight using statistical methods and machine learning, and how to present those results using graphical methods. Each chapter covers a distinct topic, first presenting the methodology underlying each topic, and then providing practical examples using R. We also discuss advanced issues facing each method and its application. Finally, for those engaged in self-learning or wishing to use this book for instruction, we provide practical exercises for each chapter.
22 |
23 | ```{r, eval = FALSE, child = if (knitr::is_html_output()) 'welcome-html.qmd'}
24 | ```
25 |
26 | The book is organised into parts, starting with a review of R and especially the R packages and functions relevant to text analysis. If you are already comfortable with R you can skim or skip that section and proceed straight to part two.
27 |
28 | ::: callout-note
29 | This book is a work in progress. We will publish chapters as we write them, and open up the GitHub source repository for readers to make comments or even suggest corrections. You can view this at this .
30 | :::
31 |
32 | ## License
33 |
34 | We will eventually seek a publisher for this book, but want to write it first. In the meantime we are retaining full copyright and licensing it only to be free to read.
35 |
36 | © Kenneth Benoit and Stefan Müller all rights reserved.
37 |
--------------------------------------------------------------------------------
/latex/preamble.tex:
--------------------------------------------------------------------------------
1 | \usepackage{booktabs}
2 | \usepackage{longtable}
3 | \usepackage[bf,singlelinecheck=off]{caption}
4 | \usepackage[a4paper,width = 14cm]{geometry}
5 | \usepackage{letltxmacro}
6 | %\usepackage{Alegreya}
7 | %\usepackage{AlegreyaSans}
8 |
9 |
10 | \addtokomafont{disposition}{\rmfamily} % headings same font family as text (not serif font)
11 |
12 |
13 | \usepackage[colorlinks=true, linkcolor=blue, filecolor=blue, urlcolor=blue, pdfborder={0 0 0},citecolor=blue]{hyperref}
14 |
--------------------------------------------------------------------------------
/references.qmd:
--------------------------------------------------------------------------------
1 | ## References {.unnumbered}
2 |
3 | ::: {#refs}
4 | :::
5 |
--------------------------------------------------------------------------------
/welcome-html.qmd:
--------------------------------------------------------------------------------
1 | You can find the website for this book [here](https://quanteda.github.io/Text-Analysis-Using-R/).
2 |
--------------------------------------------------------------------------------