├── .gitignore
├── 00-intro.Rmd
├── 00-introducing-RStudio.Rmd
├── 01-data-input.Rmd
├── 02-getting-tm.Rmd
├── 03-preparing-dtm.Rmd
├── 04-analysing-dtm.Rmd
├── 05-visualising-dtm.Rmd
├── CONTRIBUTING.md
├── CONTRIBUTORS.md
├── DataCarpentry_overview_slides.pptx
├── LICENSE.md
├── img
└── r_starting_how_it_should_like.png
├── survey_data.csv
└── textmining-socialsci.Rproj
/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 |
--------------------------------------------------------------------------------
/00-intro.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | layout: lesson
3 | title: Introducing text analysis of survey data
4 | ---
5 |
6 | > ### Objectives
7 | >
8 | > * Explain when a quantitative approach is useful when analysing text
9 | > * Explain how text can be used as quantitative data
10 |
11 | This lesson introduces text analysis using R. The lesson uses the `tm` package, which is currently the most versatile and widely used package for analysing text in R. The lesson focuses on survey responses in a CSV file, but it could easily be adapted to work with text in PDF files, MS Word documents, plain text files, web pages, twitter streams, and so on.
12 |
13 | A convenient way to collect information from people is to ask them to type responses to a set of questions. This is especially useful when it's tricky to anticipate the range of responses that is to be collected and when it's not practical or desirable to offer checkboxes or dropdown menus. Collection of free text allows respondents to be unconstrained by the data collection instrument (although the size limit of the text field is a practical constraint).
14 |
15 | Let's say we want to do a survey to understand people's needs about training in data and programming skills. We don't have much prior knowledge about the people taking the survey, so we might use a combination of likert scales ('how often do you program?' with six buttons ranging between 'never' and 'daily' ) and free text responses for questions like 'Briefly list 1-2 specific skills relating to data and software you'd like to acquire'. If 10-20 people take the survey we can easily scan the free text responses and get a sense of what skills people are interested in. But if 100 or more people take the survey, we can't count on a casual glance of the text to get a reliable summary of the responses.
16 |
17 | The challenge that we'll tackle here is how to programmatically quantify the free text responses so we can quickly see what the range of responses are, and rank their frequencies to see what the most popular skills are.
18 |
19 | The next lessons will show:
20 | - how to get survey data like this into R
21 | - how to use the tm package in R to convert the text into numbers, which is what R is especially well suited to working with
22 | - how to manipulate and analyse the data in R to summarise the free text data
23 | - how to visualize the results of the analysis with the ggplot2 package
24 |
25 | ### What and Why
26 |
27 | At the simplest level, the main task in using a programming language to analyse free text data is to convert the words to numbers, upon which we can then perform simple transformations and arithmetic. This would be an extremely tedious task to perform manually, and fortunately there are some quite mature free and open source software packages for R that do it very efficiently. Advantages of using this software include the ability to automate and reproduce the analysis and the transparency of the method that allows others to easily see the choices you've made during the analysis.
28 |
29 | The most important data structure that results from this conversion is the document-term matrix. This is a table of numbers where each column represents a word and each row represents a document. In the case of a survey, we might have one document-term matrix per question, and each row would represent one respondent. Using this data structure, one person's response to a question can be represented as a vector of numbers that express the frequency of various words in their response. Obviously this vector makes little sense if we try to read it as a normal sentence, but it's very useful for working at a large scale and identifying high-frequency words and word associations. One of the key advantages of this structure is storage efficiency. By storing a count of a word in a document we don't need to store all its individual occurrences. And thanks to a format called 'simple triplet matrix' we don't need to store zeros at all. This means that a document-term matrix takes much less memory that the original text it represents, so it's faster to operate on and easier to store and transmit.
30 |
31 | These lessons are designed to be worked through interactively at the R console. At several points we'll be iterating over functions, experimenting with different parameter values. This is a typical process for exploratory data analysis, working interactively and trying several different methods before we get something meaningful.
32 |
33 | ### Key Points
34 |
35 | * Free text responses is a valuable survey method, but can be challenging to analyse
36 | * A quantitative approach to analysing free text is advantageous because it can be automated, reproduced and audited.
37 | * The document-term matrix is an important data structure for quantitative analysis of free text
38 | * An exploratory, iterative approach is valuable when encountering new data for the first time
39 |
40 |
41 |
--------------------------------------------------------------------------------
/00-introducing-RStudio.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | layout: topic
3 | title: Introducing RStudio
4 | minutes: 15
5 | ---
6 |
7 | ```{r, echo=FALSE, purl=FALSE}
8 | knitr::opts_chunk$set(results='hide', fig.path='img/r-lesson-')
9 | ```
10 |
11 | > ## Learning Objectives
12 | >
13 | > * Introduce participants to the RStudio interface
14 | > * Set up participants to work with files, scripts and output figures
15 | > * Introduce R syntax
16 | > * Introduce good R style
17 | > * Point to relevant information on how to get help, and understand how to ask well formulated questions
18 |
19 | # Working with RStudio
20 |
21 | For this workshop you installed R and RStudio. RStudio is an interface that makes it nicer
22 | to work with the R programming language.
23 |
24 | # Introducing RStudio and setting up the environment
25 |
26 | * Start RStudio (presentation of RStudio -below- should happen here)
27 | * Under the `File` menu, click on `New project`, choose `New directory`, then
28 | `Empty project`
29 | * Enter a name for this new folder, and choose a convenient location for
30 | it. This will be your **working directory** for the rest of the day
31 | (e.g., `~/data-carpentry`)
32 | The `~` sign is a shortcut for your 'home' directory - the place where you start.
33 |
34 | * Click on "Create project"
35 | * Under the `Files` tab on the right of the screen, click on `New Folder` and
36 | create a folder named `data` within your newly created working directory.
37 | (e.g., `~/data-carpentry/data`)
38 | * Create a new R script (File > New File > R script) and save it in your working
39 | directory (e.g. `data-carpentry-script.R`)
40 |
41 | Your working directory should now look like this:
42 |
43 | 
44 |
45 | ## Organizing your working directory
46 |
47 | You should separate the original data (raw data) from intermediate datasets that
48 | you may create for the need of a particular analysis. For instance, you may want
49 | to create a `data/` directory within your working directory that stores the raw
50 | data, and have a `data_output/` directory for intermediate datasets and a
51 | `figure_output/` directory for the plots you will generate.
52 |
53 | # Presentation of RStudio
54 |
55 | Let's start by learning about our tool.
56 |
57 | * Console, Scripts, Environments, Plots
58 | * Code and workflow are more reproducible if we can document everything that we
59 | do.
60 | * Our end goal is not just to "do stuff" but to do it in a way that anyone, and in
61 | particular, ourselves of 6 months from now, can
62 | easily and exactly replicate our workflow and results.
63 |
64 | # Interacting with R
65 |
66 | There are two main ways of interacting with R: using the console or by using
67 | script files (plain text files that contain your code).
68 |
69 | The console window (in RStudio, the bottom left panel) is the place where R is
70 | waiting for you to tell it what to do, and where it will show the results of a
71 | command. You can type commands directly into the console, but they will be
72 | forgotten when you close the session. It is better to enter the commands in the
73 | script editor, and save the script. This way, you have a complete record of what
74 | you did, you can easily show others how you did it and you can do it again later
75 | on if needed. You can copy-paste into the R console, but the Rstudio script
76 | editor allows you to 'send' the current line or the currently selected text to
77 | the R console. You can go to Code -> Run line(s) or use the `Ctrl-Enter` shortcut.
78 |
79 | Let's give it a try and use R as a fancy caculator. In the console (the bottom left panel) type
80 |
81 | ```
82 | 2+2
83 | ```
84 |
85 | We see that it gives us
86 |
87 | ```
88 | [1] 4
89 | ```
90 |
91 | Hooray, it worked like it should!
92 |
93 | Now, let's try using the script editor. In the top left, the editor, type
94 |
95 | ```
96 | 3+3
97 | ```
98 |
99 | Once you do that, nothing happens. That's because R doesn't actually know you typed that yet.
100 | You have to tell it you want to run that command. With your cursor somewhere on that line,
101 | type Control^Enter. Now you can see that in the bottom panel, the command was run and you get
102 |
103 | ```
104 | [1] 6
105 | ```
106 |
107 | If R is ready to accept commands, the R console shows a `>` prompt. If it
108 | receives a command (by typing, copy-pasting or sent from the script editor using
109 | `Ctrl-Enter`), R will try to execute it, and when ready, show the results and
110 | come back with a new `>`-prompt to wait for new commands.
111 |
112 | If R is still waiting for you to enter more data because it isn't complete yet,
113 | the console will show a `+` prompt. It means that you haven't finished entering
114 | a complete command. This is because you have not 'closed' a parenthesis or
115 | quotation. If you're in Rstudio and this happens, click inside the console
116 | window and press `Esc`; this should help you out of trouble.
117 |
118 | We'll be working in the scripting window, because in that section, you can save your work. If
119 | you just do something in that bottom part, once you close RStudio, that work goes away. It's
120 | like your notebook versus a dry erase board.
121 |
122 | ## Commenting
123 |
124 | Something else that's nice that you can do in the scripting window, is that you can comment
125 | your work. So, you can say why you did something or what that section is supposed to do.
126 |
127 | Use `#` signs to comment. Comment liberally in your R scripts. Anything to the
128 | right of a `#` is ignored by R.
129 |
130 |
131 | ```
132 | To comment something you use the sign
133 | #
134 | ```
135 |
136 | Let's try that.
137 |
138 | ```
139 | # This is addition of 3 and 3
140 | 3+3
141 | ```
142 |
143 | If you highlight just the 3+3 and then type Ctrl+Enter, you get 6. If you highlight
144 | the whole thing and type Ctrl+Enter, you still get 6. Anything in the line after
145 | the # is being ignored.
146 |
147 | ## Assignment operator
148 |
149 | `<-` is the assignment operator. It assigns values on the right to objects on
150 | the left. So, after executing `x <- 3`, the value of `x` is `3`. The arrow can
151 | be read as 3 **goes into** `x`. You can also use `=` for assignments but not in
152 | all contexts so it is good practice to use `<-` for assignments. `=` should only
153 | be used to specify the values of arguments in functions, see below.
154 |
155 | In RStudio, typing `Alt + -` (push `Alt`, the key next to your space bar at the
156 | same time as the `-` key) will write ` <- ` in a single keystroke.
157 |
158 | So, let's try some algebra. Assign a <- 4 and b <- 2. Add a + b and what do you get?
159 |
160 | **Remember that you have to run each line if you're typing it in the scripting windows**
161 |
162 |
163 | ***
164 | ##EXERCISE
165 |
166 | - In the scripting window, add four numbers and comment them to say why you picked
167 | those four numbers, and run the command.
168 | - In the scripting window, assign c, d, e and f to those 4 numbers. Add c, d, e and f. Do you
169 | get the same number as in the exercise above?
170 | - What happens if you just type `a` in the console (bottom) window?
171 | - Change c to 100. Now add c, d, e and f again. What do you think will happen? Is this what happened?
172 | - In the console window, hit the up arrow key. Hit it again. What's happening?
173 |
174 |
175 | ***
176 |
177 | At some point in your analysis you may want to check the content of variable or
178 | the structure of an object, without necessarily keep a record of it in your
179 | script. You can type these commands directly in the console. RStudio provides
180 | the `Ctrl-1` and `Ctrl-2` shortcuts allow you to jump between the script and the
181 | console windows.
182 |
183 | # Built in functions
184 |
185 | R also has handy built in functions for commonly done things. It also has packages you can
186 | load in for all types of things. The text mining package will be one we work with today.
187 |
188 | For instance, adding up a bunch of numbers, like we just did, is a common function. Instead
189 | of adding them all up, we can use the function `sum()`. Type
190 |
191 | ```
192 | sum(1,2,3,4)
193 | ```
194 |
195 | Do you get what you expected?
196 |
197 | ***
198 | ##EXERCISE
199 |
200 | - Use the sum() function on your numbers.
201 | - Does sum() work just for numbers or for the letters/variables you assigned to the numbers too?
202 |
203 | ***
204 |
205 |
206 |
207 |
208 |
209 | # Basics of R (it's more than just a calculator)
210 |
211 |
212 | R is a versatile, open source programming/scripting language that's useful both
213 | for statistics but also data science. Inspired by the programming language S.
214 |
215 | * Open source software under GPL.
216 | * Superior (if not just comparable) to commercial alternatives. R has over 7,000
217 | user contributed packages at this time. It's widely used both in academia and
218 | industry.
219 | * Available on all platforms.
220 | * Not just for statistics, but also general purpose programming.
221 | * For people who have experience in programmming: R is both an object-oriented
222 | and a so-called [functional language](http://adv-r.had.co.nz/Functional-programming.html)
223 | * Large and growing community of peers.
224 |
225 |
226 |
227 | ## Seeking help
228 |
229 | ### I know the name of the function I want to use, but I'm not sure how to use it
230 |
231 | If you need help with a specific function, let's say `sum()`, you can type:
232 |
233 | ```{r, eval=FALSE}
234 | ?sum
235 | ```
236 |
237 | If you just need to remind yourself of the names of the arguments, you can use:
238 |
239 | ```{r, eval=FALSE}
240 | args(sum)
241 | ```
242 |
243 | If the function is part of a package that is installed on your computer but
244 | don't remember which one, you can type:
245 |
246 | ```{r, eval=FALSE}
247 | ??geom_point
248 | ```
249 |
250 | ### I want to use a function that does X, there must be a function or it but I don't know which one...
251 |
252 | If you are looking for a function to do a particular task, you can use
253 | `help.search()` (but only looks through the installed packages):
254 |
255 | ```{r, eval=FALSE}
256 | help.search("kruskal")
257 | ```
258 |
259 | If you can't find what you are looking for, you can use the
260 | [rdocumention.org](http://www.rdocumentation.org) website that search through
261 | the help files across all packages available.
262 |
263 | ### Style
264 |
265 | [http://adv-r.had.co.nz/Style.html](R Style Guide)
266 |
267 |
268 | ### I am stuck... I get an error message that I don't understand
269 |
270 | Start by googling the error message. However, this doesn't always work very well
271 | because often, package developers rely on the error catching provided by R. You
272 | end up with general error messages that might not be very helpful to diagnose a
273 | problem (e.g. "subscript out of bounds").
274 |
275 | However, you should check stackoverflow. Search using the `[r]` tag. Most
276 | questions have already been answered, but the challenge is to use the right
277 | words in the search to find the answers:
278 | [http://stackoverflow.com/questions/tagged/r](http://stackoverflow.com/questions/tagged/r)
279 |
280 | The [Introduction to R](http://cran.r-project.org/doc/manuals/R-intro.pdf) can
281 | also be dense for people with little programming experience but it is a good
282 | place to understand the underpinnings of the R language.
283 |
284 | The [R FAQ](http://cran.r-project.org/doc/FAQ/R-FAQ.html) is dense and technical
285 | but it is full of useful information.
286 |
287 | ### Asking for help
288 |
289 | The key to get help from someone is for them to grasp your problem rapidly. You
290 | should make it as easy as possible to pinpoint where the issue might be.
291 |
292 | Try to use the correct words to describe your problem. For instance, a package
293 | is not the same thing as a library. Most people will understand what you meant,
294 | but others have really strong feelings about the difference in meaning. The key
295 | point is that it can make things confusing for people trying to help you. Be as
296 | precise as possible when describing your problem
297 |
298 | If possible, try to reduce what doesn't work to a simple reproducible
299 | example. If you can reproduce the problem using a very small `data.frame`
300 | instead of your 50,000 rows and 10,000 columns one, provide the small one with
301 | the description of your problem. When appropriate, try to generalize what you
302 | are doing so even people who are not in your field can understand the question.
303 |
304 | To share an object with someone else, if it's relatively small, you can use the
305 | function `dput()`. It will output R code that can be used to recreate the exact same
306 | object as the one in memory:
307 |
308 | ```{r, results='show'}
309 | dput(head(iris)) # iris is an example data.frame that comes with R
310 | ```
311 |
312 | If the object is larger, provide either the raw file (i.e., your CSV file) with
313 | your script up to the point of the error (and after removing everything that is
314 | not relevant to your issue). Alternatively, in particular if your questions is
315 | not related to a `data.frame`, you can save any R object to a file:
316 |
317 | ```{r, eval=FALSE}
318 | saveRDS(iris, file="/tmp/iris.rds")
319 | ```
320 |
321 | The content of this file is however not human readable and cannot be posted
322 | directly on stackoverflow. It can however be sent to someone by email who can read
323 | it with this command:
324 |
325 | ```{r, eval=FALSE}
326 | some_data <- readRDS(file="~/Downloads/iris.rds")
327 | ```
328 |
329 | Last, but certainly not least, **always include the output of `sessionInfo()`**
330 | as it provides critical information about your platform, the versions of R and
331 | the packages that you are using, and other information that can be very helpful
332 | to understand your problem.
333 |
334 | ```{r, results='show'}
335 | sessionInfo()
336 | ```
337 |
338 | ### Where to ask for help?
339 |
340 | * Your friendly colleagues: if you know someone with more experience than you,
341 | they might be able and willing to help you.
342 | * Stackoverlow: if your question hasn't been answered before and is well
343 | crafted, chances are you will get an answer in less than 5 min.
344 | * The [R-help](https://stat.ethz.ch/mailman/listinfo/r-help): it is read by a
345 | lot of people (including most of the R core team), a lot of people post to it,
346 | but the tone can be pretty dry, and it is not always very welcoming to new
347 | users. If your question is valid, you are likely to get an answer very fast
348 | but don't expect that it will come with smiley faces. Also, here more than
349 | everywhere else, be sure to use correct vocabulary (otherwise you might get an
350 | answer pointing to the misuse of your words rather than answering your
351 | question). You will also have more success if your question is about a base
352 | function rather than a specific package.
353 | * If your question is about a specific package, see if there is a mailing list
354 | for it. Usually it's included in the DESCRIPTION file of the package that can
355 | be accessed using `packageDescription("name-of-package")`. You may also want
356 | to try to email the author of the package directly.
357 | * There are also some topic-specific mailing lists (GIS, phylogenetics, etc...),
358 | the complete list is [here](http://www.r-project.org/mail.html).
359 |
360 | ### More resources
361 |
362 | * The [Posting Guide](http://www.r-project.org/posting-guide.html) for the R
363 | mailing lists.
364 | * [How to ask for R help](http://blog.revolutionanalytics.com/2014/01/how-to-ask-for-r-help.html)
365 | useful guidelines
366 |
--------------------------------------------------------------------------------
/01-data-input.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | layout: lesson
3 | title: Getting text data into R
4 | ---
5 |
6 | > ### Objectives
7 | > * Input the survey data into R
8 | > * Inspect the data in R
9 |
10 | Surveys are mostly conducted online using a web form. Most online survey applications provide a simple way to get the data by exporting a CSV file. CSV files are plain text files where the columns are separated by commas, hence 'comma separated variables' or CSV. The advantage of a CSV over an Excel or SPSS file is that we can open and read a CSV file using just about any software, including a simple text editor. We're not tied to a certain version of a certain expensive program when we work with CSV, so it's a good format to work with for maximum portability and endurance. We could also import text files, PDF files or other file formats for this kind of text analysis. For simplicity here we have a CSV file with just one column. This is an excerpt from real survey data that have been anonymised and de-identified.
11 |
12 | Built into R is a convenient function for importing CSV files into the R environment:
13 |
14 |
15 | ```{r}
16 | survey_data <- read.csv("survey_data.csv", stringsAsFactors = FALSE)
17 | ```
18 |
19 | You need to give R the full path to the CSV file on your computer, or you can use the function `setwd` to set a working directory for your R session before the `read.csv` line. Once you specify the appropriate working directory you can just refer to the CSV file by its name rather than the whole path. The argument `stringsAsFactors = FALSE` keeps the text as a character type, rather than converting it to a factor, which is the default setting. Factors are useful in many settings, but not this one.
20 |
21 | Now that we have the CSV data in our R environment, we want to inspect it to see that the import process went as expected. There are two handy functions we can use for this `str` will report on the structure of our data, and `head` will show us the first five rows of the data. We're looking to see that the data in R resembles what we see in the CSV file (which we can inspect in a spreadsheet program).
22 |
23 |
24 | ```{r}
25 | str(survey_data)
26 | head(survey_data)
27 | ```
28 |
29 | And we see that it looks pretty good. The output from `str` shows we have 72 observations (rows) and 1 variable (column), our data is formatted as 'character' (indicated by 'chr'). The output from `head` shows there are no weird characters, in the first five rows at least. One detail we can change to improve usability is the column name, it's rather long an unwieldy, so let's shorten it. First we'll see exactly what it is, then we'll replace it:
30 |
31 |
32 | ```{r}
33 | names(survey_data) # inspect col names, wow so long and unreadable!
34 | names(survey_data) <- "skills_to_acquire" # replace
35 | names(survey_data) # check that the replacement worked as intended
36 | ```
37 |
38 | Now we have the data in, and we're confident that the import process went well, we can carry on with quantification.
39 |
40 | ### Key Points
41 |
42 | * Survey responses can be collected online and the data can be exported as a CSV file
43 | * CSV files are advantageous because they're not bound to certain programs
44 | * R easily imports CSV files
45 | * R has convenient functions for inspecting and modifying data after import
46 |
47 |
--------------------------------------------------------------------------------
/02-getting-tm.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | layout: lesson
3 | title: Creating a document-term matrix
4 | ---
5 |
6 | > ### Objectives
7 | > * Install and use the tm package for R
8 | > * Quantify the text by converting it to a document-term matrix
9 |
10 |
11 | With our data in R, we are now ready to begin manipulations and analyses. Note that we're not working directly on the CSV file, we're working on a copy of the data in R. This means that if we make a change to the data that we don't like, we can start over by importing the original CSV file again. Keeping the original data intact like this is good practice for ensuring a reproducible workflow.
12 |
13 | The basic install of R does not come with many useful functions for working with text. However, there is a very powerful contributed package called `tm` (for 'text mining') which is collection of functions for working with text that we will use. We'll need to download this package and then make these functions available to our R session. You only need to do this once per computer, so if you've already downloaded the `tm` package onto your computer in the recent past you can skip that line, but don't skip `library(tm)`, that's necessary each time you open R:
14 |
15 |
16 | ```{r}
17 | install.packages("tm") # download the tm package from the web, only need to do this once per computer. If you're running `install.packages` for the first time, you will be asked to select a CRAN mirror, I usually choose 0 - rstudio
18 | library(tm) # make the functions available to our session
19 | ```
20 |
21 | One of the strengths of working with R is that there is often a lot of documentation, and often this contains examples of how to use the functions. The `tm` package is a reasonably good example of this and we can see the documentation using:
22 |
23 |
24 | ```{r}
25 | help(package = tm)
26 | ```
27 |
28 | From here we can browse the functions and access the vignettes which give a detailed guide to using the most important functions. You'll also find a lot of information on the stackoverflow Q&A website under the `tm` tag (http://stackoverflow.com/questions/tagged/tm) that can be hard to find in the documentation (or is not there at all). We'll go directly on and continue working with the survey data by converting the free text into a document-term matrix. First we convert to a corpus, a file format in the tm package for a collection of documents containing text in the same format the we read it in the CSV file. Then we convert to a document-term matrix, and then have a look to see that the number of documents in the document-term matrix matches the number of rows in our CSV.
29 |
30 |
31 | ```{r}
32 | my_corpus <- Corpus(DataframeSource(survey_data))
33 | my_corpus # inspect
34 | my_dtm <- DocumentTermMatrix(my_corpus)
35 | my_dtm # inspect
36 | ```
37 |
38 | We can also see the number of unique words in the data, referenced as `terms` in the document-term matrix. The value for `Maximal term length` tells us the number of characters in the longest word. Is this case it's very long, usually a result of punctuation joining words together like 'document-term'. We'll do something about this in a moment.
39 |
40 | We want to see the actual rows and columns of the document-term matrix to verify that the conversion went as expected, so we use the function `inspect`, and we can subset the document-term matrix to inspect certain rows and columns (since it can be unwieldy to scroll through the whole document-term matrix). The `inspect` function can also be used on `my_corpus` if you want to see how that looks.
41 |
42 |
43 | ```{r}
44 | inspect(my_dtm) # see rather too much to make sense of
45 | inspect(my_dtm[1:10, 1:10]) # see just the first 10 rows and columns
46 | ```
47 |
48 | The most striking detail here is that the matrix is mostly zeros, this is quite normal, since not every word will be in every response. We can also see a lot of punctuation stuck on the words that we don't want. In the next lesson we'll tackle those.
49 |
50 | ### Key Points
51 |
52 | * Key text analysis functions are in the contributed package tm
53 | * The tm package can be downloaded and installed using R
54 | * The documentation can be easily accessed using R
55 | * The free text can be easily converted to a document-term matrix
56 | * The document-term matrix needs some work before it's ready for analysis, for example removal of punctuation.
57 |
58 |
--------------------------------------------------------------------------------
/03-preparing-dtm.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | layout: lesson
3 | title: Preparing the document-term matrix
4 | ---
5 |
6 | > ### Objectives
7 | > * Prepare the document-term matrix for analysis
8 | > * Remove unwanted elements from the text
9 |
10 | Now we've sailed through getting the data into R and getting it into the document-term matrix format we can see that there are few problems that we need to deal with before we can get on with the analysis. When we inspected the document-term matrix we saw that the text is polluted with punctuation, and probably other things like sneaky white spaces that are hard to spot, and numbers (meaning digits). We want to remove all of that before we move on.
11 |
12 | But there are also some other things we should also remove. To get to the words that we're interested in, we can remove all the uninteresting words such as 'a', 'the', and so on. These uninteresting words are often called stopwords and we can delete them from our data to simplify and speed up our operations. We can also convert all the words to lower case, since upper and lower case have no semantic difference here (for the most part, an abbreviation is an obvious exception, if we were expecting those to be important we might skip the case conversion).
13 |
14 | The `tm` package contains convenient functions for removed these things from text data. There are many other related functions of course, such as stemming (which will reduce 'learning' and 'learned' to 'learn'), part-of-speech tagging (for example, to select only the nouns), weighting, and so on, that similarly clean and transform the data, but we'll stick with the simple ones for now. We can easily do this cleaning during the process of converting the corpus to a document-term matrix:
15 |
16 |
17 | ```{r}
18 | my_dtm <- DocumentTermMatrix(my_corpus,
19 | control = list(removePunctuation = TRUE,
20 | stripWhitespace = TRUE,
21 | removeNumbers = TRUE,
22 | stopwords = TRUE,
23 | tolower = TRUE,
24 | wordLengths=c(1,Inf)))
25 | # This last line with 'wordLengths' overrides the default minimum
26 | # word length of 3 characters. We're expecting a few important single
27 | # character words, so we set the function to keep those. Normally words
28 | # with <3 characters are uninteresting in most text mining applications
29 | my_dtm # inspect
30 | inspect(my_dtm[1:10, 1:10]) # inspect again
31 | ```
32 |
33 | And now we see that when we inspect the document-term matrix the punctuation has gone. We can also see that the number of terms has been reduced, and the `Maximal term length` value has also dropped. We can have a look through the words that are remaining after this data cleaning:
34 |
35 |
36 | ```{r}
37 | Terms(my_dtm)
38 | ```
39 |
40 | The majority of words look fine, there are a few odd ones in there that we're left with after removing punctuation. We can work on these removing sparse terms, since these long words probably only occur once in the corpus. We'll use the function `removeSparseTerms` which takes an argument between <1 (ie. 0.9999) and zero for how sparse the resulting document-term matrix should be. Typically we'll need to make several passes at this to find a suitable sparsity value. Too close to zero and we'll have hardly any words left, so it's useful to experiment with a bunch of different values here.
41 |
42 |
43 | ```{r}
44 | my_dtm_sparse <- removeSparseTerms(my_dtm, 0.98)
45 | my_dtm_sparse # inspect
46 | ```
47 |
48 | Now we've got a dataset that has be cleaned of most of the unwanted items such as punctuation, numerals and very common words that are of little interest. We're ready to learn something from the data!
49 |
50 | ### Key Points
51 |
52 | * Text needs to be cleaned before analysis to remove uninteresting elements
53 | * The tm package has convenient functions for cleaning the text
54 | * Inspecting the output after each transformation helps to assess the effectiveness of the data cleaning, and some transformations need to be iterated over to suit the data
55 |
56 |
--------------------------------------------------------------------------------
/04-analysing-dtm.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | layout: lesson
3 | title: Analysing the document-term matrix
4 | ---
5 |
6 | > ### Objectives
7 | > * Analyse the document-term matrix to find most frequent words
8 | > * Identify associations between words
9 |
10 |
11 | Now we are at the point where we can start to extract useful new information from our data. The first step is to find the words that appear most frequently in our data and the second step is to find associations between words to help us understand how the words are being used. Since this is an exploratory data analysis you will need to repeat the analysis over and over with slight variations in the parameters until we get output that is interesting.
12 |
13 | To identify the most frequent words we can use the `findFreqTerms` function in `tm`. We could also convert the document-term matrix to a regular matrix and sort the matrix. However, this is problematic for larger document-term matrices, since the resulting matrix can be too big to store in memory (since the regular matrix has to allocate memory for each zero in the matrix, but the document-term matrix does not). Here we'll stick with using `findFreqTerms` since that's more versatile. The `lowfreq` argument specifies the minimum number of times the word should occur in the data:
14 |
15 |
16 | ```{r}
17 | findFreqTerms(my_dtm_sparse, lowfreq=5)
18 | ```
19 |
20 | We'll need to experiment with a bunch of different values for `lowfreq`, and even so, we might find that some of the words we're getting are uninteresting, such as 'able', 'also', 'use' and so on. One way to deal with those is to make a custom stopword list and run the remove stopwords function again. However, we can take a bit of a shortcut with stopwords. Instead of going back to the corpus, removing stopwords and numbers and so on, we can simply subset the columns of the document-term matrix using `[` to exclude the words we don't want any more. This `[` is a fundamental and widely used function in R that can be used to subset many kinds of data, not just text.
21 |
22 |
23 | ```{r}
24 | remove <- c("able", "also", "use", "like", "id", "better", "basic", "will", "something") # words we probably don't want in our data
25 | my_dtm_sparse_stopwords <- my_dtm_sparse[ , !(Terms(my_dtm_sparse) %in% remove) ] # subset to remove those words
26 | # when using [ to subset, the pattern is [ rows , columns ] so if we have a
27 | # function before the comma then we're operating on the rows, and after the
28 | # comma is operating on the columns. In this case we are operating on the columns.
29 | # The exclamation mark at the start of our column operation signifies
30 | # 'exclude', then we call the list of words in our data with
31 | # Terms(my_term) then we use %in% to literally mean 'in'
32 | # and then we provide our list of stopwords. So we're telling R to give us
33 | # all of the words in our data except those in our custom stopwordlist.
34 | ```
35 |
36 | Let's have a look at our frequent terms now and see what's left, some experimentation might still be needed with the `lowfreq` value:
37 |
38 |
39 | ```{r}
40 | findFreqTerms(my_dtm_sparse_stopwords, lowfreq = 5)
41 | ```
42 |
43 | That gives us a nice manageable set of words that are highly relevant to our main question of what skills people want to learn. We can squeeze a bit more information out of these words by looking to see what other words they are often used with. The function `findAssocs` is ideal for this, and like the other functions in this lesson, requires an iterative approach to get the most informative output. So let's experiment with different values of `corlimit` (the lower limit of the correlation value between our word of interest and the rest of the words in our data):
44 |
45 |
46 | ```{r}
47 | findAssocs(my_dtm_sparse_stopwords, 'stata', corlimit = 0.25)
48 | ```
49 |
50 | In this case we've learned that stata seems to be what some people are currently using, and is mentioned with SPSS, a sensible pairing as both are commercial software packages that are widely used in some research areas. But analysing these correlations one word at a time is tedious, so let's speed it up a bit by analysing a vector of words all at once. Let's do all of the high frequency words, first we'll store them in a vector object, then we'll pass that object to the `findAssocs` function (notice how we can drop the word `corlimit` in the `findAssocs`? Since it only takes one numeric argument it's unambiguous, so we can just pop the number in there without a label):
51 |
52 |
53 | ```{r}
54 | # create a vector of the high-frequency words
55 | my_highfreqs <- findFreqTerms(my_dtm_sparse_stopwords, lowfreq = 5)
56 | # find words associated with each of the high frequency words
57 | my_highfreqs_assocs <- findAssocs(my_dtm_sparse_stopwords, my_highfreqs, 0.25)
58 | # have a look...
59 | my_highfreqs_assocs
60 | my_highfreqs_assocs$python
61 | ```
62 |
63 | This is where we can learn a lot about our respondents. For example 'learn' is associated with 'basics', 'linear' and 'modelling', indicating that our respondents are keen to learn more about linear modelling. 'Modelling' seems particularly frequent across many of the high frequency terms. We see 'visualization' associated with 'spatial' and 'modelling', and we see 'tools' associated with 'plot' suggesting a strong interest in learning about tools for data visualisation. 'R' is correlated with 'statistical', indicated that our respondents are aware of the unique strength of that programming language. And so on, now we have some quite specific insights into our respondents.
64 |
65 | So we've improved our understanding our our respondents nicely and we have a good sense of what's on their minds. The next step is to put these words in order of frequency so we know which are the most important in our survey, and visualise the data so we can easily see what's going on.
66 |
67 | ### Key Points
68 |
69 | * We can subset the document-term matrix using the `[` function to remove additional stopwords
70 | * Once stopwords are removed, the high-frequency words offer insights into the data
71 | * Analysing the words associated with the high-frequency words gives further insights into the context, meaning and use of the high-frequency words
72 | * We can automate the correlation analysis by passing a vector of words to the function, rather than just analysing one word at a time
73 |
74 |
--------------------------------------------------------------------------------
/05-visualising-dtm.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | layout: lesson
3 | title: Visualising the document-term matrix
4 | ---
5 |
6 | > ### Objectives
7 | > * Analyse the rank of frequencies
8 | > * Visualise the results with ggplot2
9 | > * Visualise correlations of words with Rgraphviz
10 |
11 |
12 | We have some useful and new insights into our data, but we don't yet know which words are more frequent that others. Which are our respondents more interested in, R or Python? We also want to visualise our results so we can effectively communicate them to others. Let's make a table that includes the exact frequencies of our high-frequency words. This is a three step operation: starting first in the middle of the function, we use `[` again to subset the document-term matrix to get only the columns for our high-frequency words, second we use `as.matrix` convert the document-term matrix to a regular matrix, third, use `colSums` to compute the column sums to get the total count for each word across all the documents (survey respondents):
13 |
14 |
15 | ```{r}
16 | my_highfreqs_values <- colSums(as.matrix(my_dtm_sparse_stopwords[, my_highfreqs]))
17 | ```
18 |
19 | The output of this line is an object called a 'named number' which is not an ideal format for making a table or plotting, so we'll need to get it into a data frame, which is much more useful. We'll use the function `data.frame` and then separately extract the words and their frequencies into columns of the data frame, assigning column names at the same time (here they are 'word' and 'freq'):
20 |
21 |
22 | ```{r}
23 | my_highfreqs_values_df <- data.frame(word = names(my_highfreqs_values),
24 | freq = my_highfreqs_values)
25 | my_highfreqs_values_df # have a look
26 | ```
27 |
28 | Now we have a table, which is progress. If we have a glance up and down we can see what word occur most frequently. We can go a step further and sort the table to make it quicker to see the rank order of words in our data, once again using `[` with the addition of `order` to organise our data frame:
29 |
30 |
31 | ```{r}
32 | my_highfreqs_values_df <- my_highfreqs_values_df[with(my_highfreqs_values_df, order(freq)), ]
33 | # this may seem like a lot of typing for such a simple operation. If you're feeling brave
34 | # and want to type less, check out the 'arrange' function in the 'dplyr' package, it's
35 | # a lot less typing and quicker for large datasets.
36 | my_highfreqs_values_df # have a look
37 | ```
38 |
39 | That's much better, now we've got a something that we can easily adapt to include in a report, for example by using the `write.csv` function to put the table into a spreadsheet file. We can see that 'data', 'analysis' and 'r' are the three most frequent words, indicating that these are the skills and tools our respondents are most interested in. Let's plot this so we can summarise our results in a high-impact visualisation. R comes with versatile built-in plotting functions, but we'll use a contributed package called `ggplot2` which produces publication-quality plots with minimum effort. We'll install the package first, assuming this is first time you've used `ggplot`. If you've already used it on your computer you can skip `install.packages`, but don't skip `library(ggplot2)`!
40 |
41 |
42 | ```{r}
43 | # get the package
44 | install.packages("ggplot2")
45 | # make code available to our session
46 | library(ggplot2)
47 | # draw the plot
48 | ggplot(my_highfreqs_values_df, aes(reorder(word, -freq), freq)) +
49 | geom_bar(stat = "identity") +
50 | theme_minimal() +
51 | xlab("high frequency words") +
52 | ylab("frequency") +
53 | theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.2))
54 | ```
55 |
56 | There's quite a bit going on in that `ggplot` function, briefly, we specify the dataset and the columns to plot, then in the same line we reorder the data from the highest frequency word to the lowest (`ggplot` doesn't care about our previous ordering), then specify a bar plot with `geom_bar`, then adjust the background colour with `theme_minimal`, then customize the axis labels, and finally rotate and center the horizontal axis labels. The result is a clear picture of the data, and we see a strong interest in data analysis using R. Python is one of the lowest frequency words in this set, so if were planning lessons for our respondents we now know that R is the language they're most familiar with and most interested to learn.
57 |
58 | Our second plot will visualise some of the correlations between the high-frequency words. We're using a package called `Rgraphviz` that is available in an online repository called [Bioconductor](http://www.bioconductor.org/). One of the great things about R is the huge number of user-contributed packages, but one of the downsides is they're scattered across a few different repositories. Most are on CRAN, followed by Bioconductor, and a small number on GitHub. Each repository requires a slightly different method for obtaining the package, as you can see in the code chunk below. Once we've got the Rgraphviz package ready, we can draw a cluster graph of our high frequency words and link them when they have at least a correlation of at least 0.15 (in this case). You'll need to experiment with different values of `corThreshold` to get a plot that gives meaningful information, too low and you'll get a mess of spaghetti lines, and too high and you'll get nothing. The vignette for Rgraphviz explains how to add colour to the plot and many other customisations, we'll stick with the basic plot here:
59 |
60 |
61 | ```{r}
62 | source("http://bioconductor.org/biocLite.R")
63 | biocLite("Rgraphviz")
64 | library(Rgraphviz)
65 |
66 | plot(my_dtm_sparse_stopwords,
67 | terms = findFreqTerms(my_dtm_sparse_stopwords,
68 | lowfreq = 5),
69 | corThreshold = 0.15)
70 | ```
71 |
72 | We can quickly see the centrality of 'learn', and how words like 'stata' and 'using' are correlated with many other words. 'Databases' is left unconnected, indicating that there's isn't much of a pattern in its appearance in the survey responses - perhaps our respondents were not sure what exactly a database is and what it's used for, or they had such different ideas about databases there is no strong correlation with other words.
73 |
74 | That wraps up this introductory lesson to text mining survey responses with R. We've gone through some of the key tasks of getting data into R, cleaning and transforming it using specialised packages, doing quantitative analyses and finally creating visualisations. We've also learned a lot about common functions and quirks when using R. The value of this exercise comes from the new insights we've gathered from our data that were not obvious from a casual inspection of the raw data file. From here you have a solid foundation for text mining with R, ready to apply to all kinds of other data and research questions. You shouldn't hesitate to augment what you've learnt here with the help built-in to R and the packages, online tutorials, and your imaginative experiments at the command line.
75 |
76 |
77 | ### Key Points
78 |
79 | * Data can to be rearranged and reordered to make it more readable and easier to extract insights
80 | * Visualising the word frequencies with a simple bar plot gives a clear picture of the most frequent words
81 | * A cluster plot helps us see correlations between words and better understand patterns in word use in our data
82 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing New Material
2 |
3 | Data Carpentry is an open source project, and we welcome contributions of all
4 | kinds: new and improved lessons, bug reports, and small fixes to existing
5 | material are all useful.
6 |
7 | By contributing, you are agreeing that Data Carpentry may redistribute your work
8 | under [these licenses](LICENSE.md).
9 |
10 |
11 | **Table of Contents**
12 |
13 | * [Working With GitHub](#working-with-github)
14 | * [Locations and Formats](#locations-and-formats)
15 | * [FAQ](#faq)
16 |
17 |
18 | ## Working With GitHub
19 |
20 | 1. Fork the `datacarpentry/textmining-socialsci` repository on GitHub.
21 |
22 | 2. Clone that repository to your own machine.
23 |
24 | 3. Create a branch from `master` for your changes.
25 | Give your branch a meaningful name,
26 | such as `fixing-typos-in-novice-shell-lesson`
27 | or `adding-tutorial-on-visualization`.
28 |
29 | 4. Make your changes, commit them, and push them to your repository on GitHub.
30 |
31 | 5. Send a pull request to the `master` branch of the lesson
32 | repository at http://github.com/datacarpentry/textmining-socialsci.
33 |
34 | If it is easier for you to send them to us some other way,
35 | please mail us at [board@datacarpentry.org](mailto:board@datacarpentry.org).
36 | Given a choice between you creating content or wrestling with Git,
37 | we'd rather have you doing the former.
38 |
39 |
40 | ## Locations and Formats
41 |
42 | Lessons may be written in R Markdown, Markdown, as IPython Notebooks, or in other formats.
43 | However, Jekyll (the tool GitHub uses to create websites) only knows how to handle Markdown and HTML. if some other format is used, the author of the lesson must add the generated Markdown
44 | to the repository. This ensures that people who *aren't* familiar with some
45 | format don't have to install the tools needed to work with it (e.g., R
46 | programmers don't have to install the IPython Notebook).
47 |
48 | > If a lesson is in a format we don't already handle, the author must also add
49 | > something to the Makefile to re-create the Markdown from the source. Please
50 | > check with us if you plan to do this.
51 |
52 |
53 | ## Formatting of the material
54 |
55 | To ensure a consistent formatting of the lessons, we recommend the following
56 | guidelines:
57 | * No trailing white space
58 | * Wrap lines at 80 characters (unless it breaks URLs)
59 | * Use unclosed atx style headers (see below)
60 |
61 | ## FAQ
62 |
63 | * *Where can I get help?*
64 |
65 | Mail us at [board@datacarpentry.org](mailto:board@datacarpentry.org),
66 | come chat with us on [our IRC channel](irc://moznet/sciencelab),
67 |
--------------------------------------------------------------------------------
/CONTRIBUTORS.md:
--------------------------------------------------------------------------------
1 | Materials have been developed and adapted by many contributors and were originally adapted from Software Carpentry materials.
2 |
3 | The first workshop was run at NESCent on May 8-9, 2014 with the development and
4 | instruction of lessons by Karen Cranston, Hilmar Lapp, Tracy Teal and Ethan White and contributions from Deb Paul and Mike Smorul.
5 |
6 | ## Data
7 |
8 | ### Biology
9 | Data is from the paper S. K. Morgan Ernest, Thomas J. Valone, and James H. Brown. 2009. Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA. Ecology 90:1708.
10 |
11 | http://esapubs.org/archive/ecol/E090/118/
12 |
13 | Excel data is from the paper Bahlai, C.A., Schaafsma, A.W., Lagos, D., Voegtlin, D., Smith, J.L., Welsman, J.A., Xue, Y., DiFonzo, C., Hallett, R.H., 2014. Factors inducing migratory forms of soybean aphid and an examination of North American spatial dynamics of this species in the context of migratory behavior. Agriculture and Forest Entomology. 16, 240-250.
14 |
15 | http://onlinelibrary.wiley.com/doi/10.1111/afe.12051/full
16 |
17 | Master_suction_trap_data_list_uncleaned.csv is a pre-cleaning version of a publically available dataset by David Voegtlin, Doris Lagos, Douglas Landis and Christie Bahlai, available at http://lter.kbs.msu.edu/datatables/122
18 |
19 | ## Lessons
20 |
21 | ### shell
22 |
23 | Original materials adapted from Software Carpentry shell lessons by Greg Wilson
24 | with contributions from Ethan White, Jens vdL, Raniere Silva, Meg Stanton, Amy
25 | Brown and Doug Latornell.
26 |
27 | ### SQL
28 |
29 | Original materials adapted from Software Carpentry SQL lessons for ecologists by
30 | Ethan White, which were adapted from Greg Wilson's original SQL lessons.
31 |
32 | ### R materials
33 | Original materials adapted from SWC Python lessons by Sarah Supp.
34 | John Blischak led the continued development of materials with contributions
35 | from Gavin Simpson, Tracy Teal, Greg Wilson, Diego Barneche, Stephen Turner and Karthik Ram.
36 |
37 | ### Spreadsheet Best Practices
38 | Original materials adapted from [Practical Data Management for Bug Counters](http://practicaldatamanagement.wordpress.com/) by Christie Bahlai
39 | Christie Bahlai and Aleksandra Pawlik led the continued development of materials with contributions
40 | from Jennifer Bryan, Alexander Duryee, Jeffrey Hollister, Daisie Huang, Owen Jones, and Ben Marwick
41 |
--------------------------------------------------------------------------------
/DataCarpentry_overview_slides.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/textmining-socialsci-ARCHIVED/457189d43fb9338c238b5b775ae9cb4a2613b189/DataCarpentry_overview_slides.pptx
--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
1 | ### Instructional Material
2 |
3 | All Data Carpentry instructional material is made available under
4 | the Creative Commons Attribution license. You are free:
5 |
6 | * to **Share**---to copy, distribute and transmit the work
7 | * to **Remix**---to adapt the work
8 |
9 | Under the following conditions:
10 |
11 | * **Attribution**---You must attribute the work using "Copyright (c)
12 | Data Carpentry" (but not in any way that suggests that we
13 | endorse you or your use of the work). Where practical, you must
14 | also include a hyperlink to http://datacarpentry.org.
15 |
16 | With the understanding that:
17 |
18 | * **Waiver**---Any of the above conditions can be waived if you get
19 | permission from the copyright holder.
20 | * **Other Rights**---In no way are any of the following rights
21 | affected by the license:
22 | * Your fair dealing or fair use rights;
23 | * The author's moral rights;
24 | * Rights other persons may have either in the work itself or in
25 | how the work is used, such as publicity or privacy rights. *
26 | * **Notice**---For any reuse or distribution, you must make clear to
27 | others the license terms of this work. The best way to do this is
28 | with a link to
29 | [http://creativecommons.org/licenses/by/3.0/](http://creativecommons.org/licenses/by/3.0/).
30 |
31 | For the full legal text of this license, please see
32 | [http://creativecommons.org/licenses/by/3.0/legalcode](http://creativecommons.org/licenses/by/3.0/legalcode).
33 |
34 | ### Software
35 |
36 | Except where otherwise noted, the example programs and other software
37 | provided by Data Carpentry are made available under the
38 | [OSI](http://opensource.org)-approved
39 | [MIT license](http://opensource.org/licenses/mit-license.html).
40 |
41 | Permission is hereby granted, free of charge, to any person obtaining
42 | a copy of this software and associated documentation files (the
43 | "Software"), to deal in the Software without restriction, including
44 | without limitation the rights to use, copy, modify, merge, publish,
45 | distribute, sublicense, and/or sell copies of the Software, and to
46 | permit persons to whom the Software is furnished to do so, subject to
47 | the following conditions:
48 |
49 | The above copyright notice and this permission notice shall be
50 | included in all copies or substantial portions of the Software.
51 |
52 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
53 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
54 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
55 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
56 | LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
57 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
58 | WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
59 |
--------------------------------------------------------------------------------
/img/r_starting_how_it_should_like.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/textmining-socialsci-ARCHIVED/457189d43fb9338c238b5b775ae9cb4a2613b189/img/r_starting_how_it_should_like.png
--------------------------------------------------------------------------------
/survey_data.csv:
--------------------------------------------------------------------------------
1 | "Briefly list 1-2 specific skills relating to data and
2 | software you'd like to acquire and apply to your
3 | research (eg. types of analyses you'd like to use,
4 | software tools you'd like to work with)"
5 |
6 | "How to apply tests of significance to my findings
7 | General coding best practices for legibility and reproducibility "
8 | Just need to increase comfort with core concepts (why researchers would need to do some programming) in order to assist researchers with data curation. Not an active researcher myself.
9 | ?
10 | "Programming in R and Python (little to no experience at this time)
11 | Advanced commands in Stata, particularly with regards to visualizing data
12 | GIS"
13 | "I'd like to learn more programming languages (currently I know Stata and a teeny bit of R and SQL). Potentially also survey data collection and analysis, and it's always good to go over more statistics, conceptually and pragmatically."
14 | Learn the basics of browsing and manipulating data using Stata.
15 | "A better understanding of database tools and structure, geo-spatial software and data, and R."
16 | Not sure.
17 | "Knowing how to use Stata, SPSS, Drupal, and even Excel more effectively. How to manage large amounts of data and manipulate them to draw out specific patterns or variables. A background in broader concepts around digital humanities and ""big data"" would be useful, too."
18 | content management tools such as Plone
19 | I'd like to learn the basics of the tools that researchers on campus are using most frequently to analyze & manage their data.
20 | "Would like to know how to customize software more to answer my specific questions.
21 |
22 | Would like to know how to scrape data off the web and move from there."
23 | I would be happy to learn any basics of the software listed in question #9 and which ones are most useful for specific data use/manipulation needs.
24 | My researchers use R a lot for ecological analysis and I'd like to understand that better.
25 | Gis
26 | Data organization and data presentation. (Beyond Catalyst and Excel)
27 | making the link between databases and web-based displays of info
28 | "linear regression;
29 | R,
30 | stata"
31 | "I almost exclusively use R, and I'm familiar with the basics of shiny, and quite familiar with knitr, but I'd like to gain familiarity with CSS and incorporating other javascript libraries into my R workflow, and learn about the RCharts package."
32 | "natural language processing, data visualization"
33 | Time series analysis
34 | "Please don't make these fields ""required"""
35 | I would like to know basic statistics to understand what papers are saying.
36 | "Tableau / visualization tools
37 | Use existing spatial maps to plot my data (e.g. use Police Department beat-area maps to analyze my spatial data contained in the license plate reader databases."
38 | "Would like to be able to analyze data in the diary text eg names, places, sites etc. Currently using XML/TEI, XSLT, mediawiki, omeka, wordpress"
39 | "I know a little GIS, but would like to be able to do more spatial analysis. I can use ArcGIS at a very basic level, but would like to be able to take advantage of more advanced capabilities. "
40 | "I'd like to become more competent with statistical software, such as R, and with data management."
41 | "gis, web page design and programming"
42 | I would like to learn R (I am currently using SPSS and have also used SAS and Stata) and explore data visualization software.
43 | All of the above
44 | I don't know enough about different systems of analysis to be able to answer this question. An easier software or tool to group data (other than those provided in excel) would be great!
45 | I do not know enough to answer.
46 | "I'd like to learn more about how to program, how to design and build surveys and how to use R for qualitative and quantitative research."
47 | "Data visualization, textual analysis and network analysis"
48 | I don't know much except for what my research classes have pointed out. I haven't done programming except this year when I used SPSS. It was confusing and I could CERTAINLY use more instruction in order to do it; luckily I had help in class. But it would be better if I learned how to program myself.
49 | "I have some experience with statistical analysis software (SPSS and Stata), but would like to expand that and develop some basic knowledge of other software or programming language that would expand my analysis capacity."
50 | "Data management in R and using R to call specialized programs (e.g., Mplus) and output results as tables and plots."
51 | I would like a basic introduction to other data analysis programs in order to see which one most closely aligns with my needs and skill level. Analysis would be identifying moderators and mediators of primary clinical outcomes within and RCT.
52 | "I think that knowledge of R would be beneficial to allowing more flexibility in my models. I think it will also be helpful in my career, as a growing number of people are using R for running their statistical models. I also want to become more competent at building and managing databases."
53 | "Mostly basic database skills and management -- data normalization, relational table design, Structured Query Language, etc. Other possible interests: topic modeling, basics/philosophy of GIS, other forms of digital data visualization, and I guess at some point I should learn something about TEI."
54 | Revisit Java or object orientated design.
55 | More efficient full-text search and multi-threading for processing larger lets of data. Information on working with spreading jobs across multiple machines in the cluster would also be interesting.
56 | Best practices for accessing and analyzing data. Better facility with analysis and scripting tools.
57 | I'd like to learn more about using Microsoft Excel and SPSS for data analysis. I also need to learn how to use software that allows me to analyze data through hierarchical linear modeling.
58 | " hard to say this early in my research- but am generally interested in making legible databases (like mysql) as well as online environments like ruby on rails, php, etc."
59 | I don't know where to start!
60 | "Anthropac for Mac, visual representations of qualitative data"
61 | I have some knowledge of Atlas.ti ethnographic coding software and would like to improve it.
62 | Great facility with quantitative software
63 | learning which methods to use for different purposes
64 | I would like to become more familiar with Bayesian analysis.
65 | "Long-term: natural language processing, graph theory
66 | Shorter-term: python scripting, building models"
67 | "tools for web interactions--PHP, Python
68 | tools for design interactions--Axure, etc."
69 | Knowing what database mgmt. sys and scripting lang. will be best going forward. Whatever programming tools are needed for this purpose. Including software for appropriate for tabulating frequency/occurances of syntax patterns under varying conditions (space/time/rhyme scheme) & display of same. Mapping software to plot
70 | "??
71 |
72 | Re: 9: I programmed in Fortran for SPSSx as a graduate student; would not object to a refresher and update on how best to handle intersection of qual data with social survey data."
73 | "More SQL. In Python, getting more experience tools like Numpy/Scikit-learn."
74 | I would like to learn to use both SPSS and to do programing for
75 | AWS
76 | Data analysis/spatial analysis.
77 | I'd like to understand something about Flash because many of the literary works I work with have been created using this tool.
78 | "database building and management
79 | language for conducting search queries in said databases"
80 | Data scraping of social media sites and social computing systems such as Wikipedia
81 | I'd like to work with machine learning techniques.
82 | "Interesting in being able to analyze transcripts as well as prose from books and websites, both for theme and linguistic structures. Also, video analysis."
83 | "Visualizations
84 | Qualitative analysis
85 | "
86 | "more statistical analysis with R, multilevel analysis"
87 | Python libraries and tools for ML and Viz
88 | "qualitative software that is more user friendly, more manipulatable to my needs."
89 | "multi-variate regression analyses
90 | R and/or SPSS"
91 | Coding in SAS or R for data analysis
92 | R
93 | "Better understanding of topic modelling, analysis of statistics, analysis of brain data."
94 |
--------------------------------------------------------------------------------
/textmining-socialsci.Rproj:
--------------------------------------------------------------------------------
1 | Version: 1.0
2 |
3 | RestoreWorkspace: Default
4 | SaveWorkspace: Default
5 | AlwaysSaveHistory: Default
6 |
7 | EnableCodeIndexing: Yes
8 | UseSpacesForTab: Yes
9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 |
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 |
--------------------------------------------------------------------------------